arXiv Daily Digest

164

Papers

LatentAM: Real-Time, Large-Scale Latent Gaussian Attention Mapping via Online Dictionary Learning

LatentAM：通过在线字典学习实现实时大规模潜在高斯注意力映射

Lee, Junwoon, Tian, Yulun

Abstract

We present LatentAM, an online 3D Gaussian Splatting (3DGS) mapping framework that builds scalable latent feature maps from streaming RGB-D observations for open-vocabulary robotic perception. Instead of distilling high-dimensional Vision-Language Model (VLM) embeddings using model-specific decoders, LatentAM proposes an online dictionary learning approach that is both model-agnostic and pretraining-free, enabling plug-and-play integration with different VLMs at test time. Specifically, our approach associates each Gaussian primitive with a compact query vector that can be converted into approximate VLM embeddings using an attention mechanism with a learnable dictionary. The dictionary is initialized efficiently from streaming observations and optimized online to adapt to evolving scene semantics under trust-region regularization. To scale to long trajectories and large environments, we further propose an efficient map management strategy based on voxel hashing, where optimization is restricted to an active local map on the GPU, while the global map is stored and indexed on the CPU to maintain bounded GPU memory usage. Experiments on public benchmarks and a large-scale custom dataset demonstrate that LatentAM attains significantly better feature reconstruction fidelity compared to state-of-the-art methods, while achieving near-real-time speed (12-35 FPS) on the evaluated datasets. Our project page is at: https://junwoonlee.github.io/projects/LatentAM

Chinese Translation

我们提出了LatentAM，一个在线3D高斯点云映射（3DGS）框架，该框架能够从流式RGB-D观测中构建可扩展的潜在特征图，用于开放词汇的机器人感知。与其使用特定模型的解码器提取高维视觉-语言模型（VLM）嵌入不同，LatentAM提出了一种在线字典学习方法，该方法既不依赖于模型，又无需预训练，使其能够在测试时与不同的VLM进行即插即用的集成。具体而言，我们的方法将每个高斯原语与一个紧凑的查询向量关联，该向量可以通过具有可学习字典的注意力机制转换为近似的VLM嵌入。字典从流式观测中高效初始化，并在线优化，以适应在信任区域正则化下不断变化的场景语义。为了扩展到长轨迹和大环境，我们进一步提出了一种基于体素哈希的高效地图管理策略，其中优化限制在GPU上的一个活动局部地图上，而全局地图则存储和索引在CPU上，以保持有限的GPU内存使用。对公共基准和大规模自定义数据集的实验表明，LatentAM在特征重建保真度方面显著优于最先进的方法，同时在评估的数据集上实现了接近实时的速度（12-35 FPS）。我们的项目页面地址为：https://junwoonlee.github.io/projects/LatentAM

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2602.12322

ForeAct: Steering Your VLA with Efficient Visual Foresight Planning

ForeAct：通过高效的视觉前瞻规划引导您的VLA

Zhang, Zhuoyang, Yang, Shang, Hu, Qinghao, Huang, Luke J., Hou, James, Sun, Yufei, Lu, Yao, Han, Song

Abstract

Vision-Language-Action (VLA) models convert high-level language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present Visual Foresight Planning (ForeAct), a general and efficient planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuo-motor inference rather than high-level semantic reasoning, leading to improved accuracy and generalization. Our planner comprises a highly efficient foresight image generation module that predicts a high-quality 640$\times$480 future observation from the current visual input and language instruction within only 0.33s on an H100 GPU, together with a vision-language model that reasons over the task and produces subtask descriptions for both the generator and the VLA. Importantly, state-of-the-art VLAs can integrate our planner seamlessly by simply augmenting their visual inputs, without any architectural modification. The foresight generator is pretrained on over 1 million multi-task, cross-embodiment episodes, enabling it to learn robust embodied dynamics. We evaluate our framework on a benchmark that consists of 11 diverse, multi-step real-world tasks. It achieves an average success rate of 87.4%, demonstrating a +40.9% absolute improvement over the $\pi_0$ baseline (46.5%) and a +30.3% absolute improvement over $\pi_0$ augmented with textual subtask guidance (57.1%).

Chinese Translation

视觉-语言-行动（VLA）模型将高层语言指令转换为具体的可执行动作，这一任务在开放世界环境中尤为具有挑战性。我们提出了视觉前瞻规划（ForeAct），这是一种通用且高效的规划器，能够通过想象的未来观察和子任务描述逐步引导VLA。借助想象的未来观察，VLA可以专注于视觉-运动推理，而非高层语义推理，从而提高准确性和泛化能力。我们的规划器包括一个高效的前瞻图像生成模块，该模块能够在仅0.33秒内从当前视觉输入和语言指令预测出高质量的640×480未来观察，运行于H100 GPU上，同时配备一个视觉-语言模型，该模型对任务进行推理并为生成器和VLA生成子任务描述。重要的是，最先进的VLA可以通过简单地增强其视觉输入而无缝集成我们的规划器，而无需任何架构修改。前瞻生成器在超过100万个多任务、跨体现的情节上进行了预训练，使其能够学习到稳健的体现动态。我们在一个包含11个多样化、多步骤真实世界任务的基准上评估了我们的框架。它实现了87.4%的平均成功率，相较于$ ext{π}_0$基线（46.5%）有+40.9%的绝对提升，相较于带有文本子任务指导的$ ext{π}_0$（57.1%）有+30.3%的绝对提升。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2602.12346

Schur-MI: Fast Mutual Information for Robotic Information Gathering

Schur-MI：用于机器人信息收集的快速互信息

Jakkala, Kalvik, O'Kane, Jason, Akella, Srinivas

Abstract

Mutual information (MI) is a principled and widely used objective for robotic information gathering (RIG), providing strong theoretical guarantees for sensor placement (SP) and informative path planning (IPP). However, its high computational cost, dominated by repeated log-determinant evaluations, has limited its use in real-time planning. This letter presents Schur-MI, a Gaussian process (GP) MI formulation that (i) leverages the iterative structure of RIG to precompute and reuse expensive intermediate quantities across planning steps, and (ii) uses a Schur-complement factorization to avoid large determinant computations. Together, these methods reduce the per-evaluation cost of MI from $\mathcal{O}(|\mathcal{V}|^3)$ to $\mathcal{O}(|\mathcal{A}|^3)$, where $\mathcal{V}$ and $\mathcal{A}$ denote the candidate and selected sensing locations, respectively. Experiments on real-world bathymetry datasets show that Schur-MI achieves up to a $12.7\times$ speedup over the standard MI formulation. Field trials with an autonomous surface vehicle (ASV) performing adaptive IPP further validate its practicality. By making MI computation tractable for online planning, Schur-MI helps bridge the gap between information-theoretic objectives and real-time robotic exploration.

Chinese Translation

互信息（MI）是机器人信息收集（RIG）中一种原则性和广泛使用的目标，为传感器布置（SP）和信息路径规划（IPP）提供了强有力的理论保证。然而，其高计算成本，主要由重复的对数行列式评估主导，限制了其在实时规划中的应用。本文提出了Schur-MI，这是一种高斯过程（GP）互信息的公式，它（i）利用RIG的迭代结构，在规划步骤中预计算并重用昂贵的中间量，以及（ii）使用Schur补因式分解来避免大规模行列式计算。这些方法将互信息的每次评估成本从$ ext{O}(| ext{V}|^3)$降低到$ ext{O}(| ext{A}|^3)$，其中$ ext{V}$和$ ext{A}$分别表示候选和选定的传感位置。在实际的水深数据集上的实验表明，Schur-MI相较于标准的互信息公式实现了高达$12.7 imes$的加速。与执行自适应IPP的自主水面车辆（ASV）的现场试验进一步验证了其实用性。通过使互信息计算适用于在线规划，Schur-MI帮助弥合了信息理论目标与实时机器人探索之间的差距。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2602.12351

LongNav-R1: Horizon-Adaptive Multi-Turn RL for Long-Horizon VLA Navigation

LongNav-R1：面向长时间范围视觉-语言-动作导航的地平线自适应多轮强化学习

Hu, Yue, Xi, Avery, Xiao, Qixin, Isaacson, Seth, Liu, Henry X., Vasudevan, Ram, Ghaffari, Maani

Abstract

This paper develops LongNav-R1, an end-to-end multi-turn reinforcement learning (RL) framework designed to optimize Visual-Language-Action (VLA) models for long-horizon navigation. Unlike existing single-turn paradigm, LongNav-R1 reformulates the navigation decision process as a continuous multi-turn conversation between the VLA policy and the embodied environment. This multi-turn RL framework offers two distinct advantages: i) it enables the agent to reason about the causal effects of historical interactions and sequential future outcomes; and ii) it allows the model to learn directly from online interactions, fostering diverse trajectory generation and avoiding the behavioral rigidity often imposed by human demonstrations. Furthermore, we introduce Horizon-Adaptive Policy Optimization. This mechanism explicitly accounts for varying horizon lengths during advantage estimation, facilitating accurate temporal credit assignment over extended sequences. Consequently, the agent develops diverse navigation behaviors and resists collapse during long-horizon tasks. Experiments on object navigation benchmarks validate the framework's efficacy: With 4,000 rollout trajectories, LongNav-R1 boosts the Qwen3-VL-2B success rate from 64.3% to 73.0%. These results demonstrate superior sample efficiency and significantly outperform state-of-the-art methods. The model's generalizability and robustness are further validated by its zero-shot performance in long-horizon real-world navigation settings. All source code will be open-sourced upon publication.

Chinese Translation

本文开发了LongNav-R1，一个端到端的多轮强化学习（RL）框架，旨在优化长时间范围导航的视觉-语言-动作（VLA）模型。与现有的单轮范式不同，LongNav-R1将导航决策过程重新表述为VLA策略与具身环境之间的连续多轮对话。该多轮强化学习框架提供了两个显著优势：i）它使代理能够推理历史交互的因果效应和未来结果的序列；ii）它允许模型直接从在线交互中学习，促进多样化轨迹生成，避免人类示范常常施加的行为僵化。此外，我们引入了地平线自适应策略优化（Horizon-Adaptive Policy Optimization）。该机制在优势估计过程中明确考虑了不同的地平线长度，从而促进了对扩展序列的准确时间信用分配。因此，代理能够发展出多样化的导航行为，并在长时间范围任务中抵御崩溃。在物体导航基准上的实验验证了该框架的有效性：在4,000个回合轨迹下，LongNav-R1将Qwen3-VL-2B的成功率从64.3%提升至73.0%。这些结果展示了优越的样本效率，并显著超越了最先进的方法。模型的泛化能力和鲁棒性在长时间范围的真实世界导航环境中的零-shot表现得到了进一步验证。所有源代码将在发表后开源。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2602.12360

Predicting Dynamic Map States from Limited Field-of-View Sensor Data

从有限视场传感器数据中预测动态地图状态

Peterson, Knut, Han, David

Abstract

When autonomous systems are deployed in real-world scenarios, sensors are often subject to limited field-of-view (FOV) constraints, either naturally through system design, or through unexpected occlusions or sensor failures. In conditions where a large FOV is unavailable, it is important to be able to infer information about the environment and predict the state of nearby surroundings based on available data to maintain safe and accurate operation. In this work, we explore the effectiveness of deep learning for dynamic map state prediction based on limited FOV time series data. We show that by representing dynamic sensor data in a simple single-image format that captures both spatial and temporal information, we can effectively use a wide variety of existing image-to-image learning models to predict map states with high accuracy in a diverse set of sensing scenarios.

Chinese Translation

当自主系统在现实场景中部署时，传感器常常受到有限视场（FOV）约束的影响，这可能是由于系统设计的自然限制，或是由于意外的遮挡或传感器故障。在无法获得大视场的情况下，能够根据可用数据推断环境信息并预测附近环境的状态，对于维持安全和准确的操作至关重要。在本研究中，我们探讨了深度学习在基于有限视场时间序列数据的动态地图状态预测中的有效性。我们展示了通过将动态传感器数据以简单的单图像格式表示，该格式同时捕捉空间和时间信息，我们可以有效地利用多种现有的图像到图像学习模型，在多样的传感场景中以高准确度预测地图状态。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2602.12385

Zero-Shot Adaptation to Robot Structural Damage via Natural Language-Informed Kinodynamics Modeling

通过自然语言驱动的运动动力学建模实现机器人结构损伤的零样本适应

Pokhrel, Anuj, Datar, Aniket, Nazeri, Mohammad, Cancelliere, Francesco, Xiao, Xuesu

Abstract

High-performance autonomous mobile robots endure significant mechanical stress during in-the-wild operations, e.g., driving at high speeds or over rugged terrain. Although these platforms are engineered to withstand such conditions, mechanical degradation is inevitable. Structural damage manifests as consistent and notable changes in kinodynamic behavior compared to a healthy vehicle. Given the heterogeneous nature of structural failures, quantifying various damages to inform kinodynamics is challenging. We posit that natural language can describe and thus capture this variety of damages. Therefore, we propose Zero-shot Language Informed Kinodynamics (ZLIK), which employs self-supervised learning to ground semantic information of damage descriptions in kinodynamic behaviors to learn a forward kinodynamics model in a data-driven manner. Using the high-fidelity soft-body physics simulator BeamNG.tech, we collect data from a variety of structurally compromised vehicles. Our learned model achieves zero-shot adaptation to different damages with up to 81% reduction in kinodynamics error and generalizes across the sim-to-real and full-to-1/10$^{\text{th}}$ scale gaps.

Chinese Translation

高性能自主移动机器人在实际操作中承受显著的机械压力，例如高速行驶或在崎岖地形上行驶。尽管这些平台经过设计以承受此类条件，但机械退化是不可避免的。结构损伤表现为与健康车辆相比，运动动力学行为的持续且显著变化。鉴于结构故障的异质性，量化各种损伤以指导运动动力学是具有挑战性的。我们认为，自然语言可以描述并捕捉这种多样的损伤。因此，我们提出了零样本语言驱动的运动动力学（Zero-shot Language Informed Kinodynamics, ZLIK），该方法采用自监督学习将损伤描述的语义信息与运动动力学行为相结合，以数据驱动的方式学习前向运动动力学模型。利用高保真软体物理模拟器BeamNG.tech，我们从多种结构受损的车辆中收集数据。我们学习的模型在不同损伤下实现了零样本适应，运动动力学误差减少了多达81%，并在模拟到真实和全尺度到1/10$^{ ext{th}}$尺度之间具有良好的泛化能力。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2602.12405

Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning

自我精炼视觉语言模型用于机器人故障检测与推理

Qi, Carl, Wang, Xiaojie, Yong, Silong, Sheng, Stephen, Mao, Huitan, Srinivasan, Sriram, Nambi, Manikantan, Zhang, Amy, Dattatreya, Yesh

Abstract

Reasoning about failures is crucial for building reliable and trustworthy robotic systems. Prior approaches either treat failure reasoning as a closed-set classification problem or assume access to ample human annotations. Failures in the real world are typically subtle, combinatorial, and difficult to enumerate, whereas rich reasoning labels are expensive to acquire. We address this problem by introducing ARMOR: Adaptive Round-based Multi-task mOdel for Robotic failure detection and reasoning. We formulate detection and reasoning as a multi-task self-refinement process, where the model iteratively predicts detection outcomes and natural language reasoning conditioned on past outputs. During training, ARMOR learns from heterogeneous supervision - large-scale sparse binary labels and small-scale rich reasoning annotations - optimized via a combination of offline and online imitation learning. At inference time, ARMOR generates multiple refinement trajectories and selects the most confident prediction via a self-certainty metric. Experiments across diverse environments show that ARMOR achieves state-of-the-art performance by improving over the previous approaches by up to 30% on failure detection rate and up to 100% in reasoning measured through LLM fuzzy match score, demonstrating robustness to heterogeneous supervision and open-ended reasoning beyond predefined failure modes. We provide dditional visualizations on our website: https://sites.google.com/utexas.edu/armor

Chinese Translation

对故障进行推理对于构建可靠和可信的机器人系统至关重要。以往的方法要么将故障推理视为一个封闭集分类问题，要么假设可以获得大量的人类标注。现实世界中的故障通常是微妙的、组合性的，且难以枚举，而丰富的推理标签获取成本高昂。我们通过引入ARMOR（自适应轮次多任务模型）来解决这一问题，该模型用于机器人故障检测与推理。我们将检测和推理形式化为一个多任务自我精炼过程，其中模型迭代地预测检测结果和基于过去输出的自然语言推理。在训练过程中，ARMOR从异构监督中学习——大规模稀疏二元标签和小规模丰富推理标注，通过离线和在线模仿学习的结合进行优化。在推理阶段，ARMOR生成多个精炼轨迹，并通过自我确定性指标选择最可信的预测。跨多种环境的实验表明，ARMOR在故障检测率上比以往方法提高了多达30%，在通过LLM模糊匹配分数测量的推理上提高了多达100%，展示了对异构监督和超越预定义故障模式的开放式推理的鲁棒性。我们在网站上提供了更多可视化内容：https://sites.google.com/utexas.edu/armor

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2602.12407

MiDAS: A Multimodal Data Acquisition System and Dataset for Robot-Assisted Minimally Invasive Surgery

MiDAS：一种用于机器人辅助微创手术的多模态数据采集系统及数据集

Weerasinghe, Keshara, Roodabeh, Seyed Hamid Reza, Hawkins, Andrew, Zhang, Zhaomeng, Schrader, Zachary, Alemzadeh, Homa

Abstract

Background: Robot-assisted minimally invasive surgery (RMIS) research increasingly relies on multimodal data, yet access to proprietary robot telemetry remains a major barrier. We introduce MiDAS, an open-source, platform-agnostic system enabling time-synchronized, non-invasive multimodal data acquisition across surgical robotic platforms. Methods: MiDAS integrates electromagnetic and RGB-D hand tracking, foot pedal sensing, and surgical video capturing without requiring proprietary robot interfaces. We validated MiDAS on the open-source Raven-II and the clinical da Vinci Xi by collecting multimodal datasets of peg transfer and hernia repair suturing tasks performed by surgical residents. Correlation analysis and downstream gesture recognition experiments were conducted. Results: External hand and foot sensing closely approximated internal robot kinematics and non-invasive motion signals achieved gesture recognition performance comparable to proprietary telemetry. Conclusion: MiDAS enables reproducible multimodal RMIS data collection and is released with annotated datasets, including the first multimodal dataset capturing hernia repair suturing on high-fidelity simulation models.

Chinese Translation

背景：机器人辅助微创手术（RMIS）研究越来越依赖于多模态数据，但获取专有机器人遥测数据仍然是一个主要障碍。我们介绍了MiDAS，一个开源、平台无关的系统，能够在外科机器人平台上实现时间同步的非侵入式多模态数据采集。方法：MiDAS集成了电磁和RGB-D手部追踪、脚踏板传感和外科视频捕捉，无需专有机器人接口。我们在开源的Raven-II和临床的da Vinci Xi上验证了MiDAS，通过收集外科住院医生执行的钉子转移和疝气修补缝合任务的多模态数据集。进行了相关性分析和下游手势识别实验。结果：外部手部和脚部传感器的测量结果与内部机器人运动学高度相似，非侵入式运动信号的手势识别性能与专有遥测相当。结论：MiDAS实现了可重复的多模态RMIS数据采集，并发布了带注释的数据集，包括第一个捕捉高保真模拟模型上疝气修补缝合的多模态数据集。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2602.12416

Control Barrier Functions with Audio Risk Awareness for Robot Safe Navigation on Construction Sites

具有音频风险感知的控制屏障函数用于建筑工地机器人安全导航

Mootz, Johannes, Akhavian, Reza

Abstract

Construction automation increasingly requires autonomous mobile robots, yet robust autonomy remains challenging on construction sites. These environments are dynamic and often visually occluded, which complicates perception and navigation. In this context, valuable information from audio sources remains underutilized in most autonomy stacks. This work presents a control barrier function (CBF)-based safety filter that provides safety guarantees for obstacle avoidance while adapting safety margins during navigation using an audio-derived risk cue. The proposed framework augments the CBF with a lightweight, real-time jackhammer detector based on signal envelope and periodicity. Its output serves as an exogenous risk that is directly enforced in the controller by modulating the barrier function. The approach is evaluated in simulation with two CBF formulations (circular and goal-aligned elliptical) with a unicycle robot navigating a cluttered construction environment. Results show that the CBF safety filter eliminates safety violations across all trials while reaching the target in 40.2% (circular) vs. 76.5% (elliptical), as the elliptical formulation better avoids deadlock. This integration of audio perception into a CBF-based controller demonstrates a pathway toward richer multimodal safety reasoning in autonomous robots for safety-critical and dynamic environments.

Chinese Translation

建筑自动化日益需要自主移动机器人，但在建筑工地上实现稳健的自主性仍然具有挑战性。这些环境动态变化且常常存在视觉遮挡， complicating perception and navigation。在这种背景下，音频来源提供的有价值信息在大多数自主系统中仍未得到充分利用。本研究提出了一种基于控制屏障函数（CBF）的安全过滤器，该过滤器在导航过程中利用音频衍生的风险线索调整安全边际，从而提供障碍物避让的安全保证。所提出的框架通过基于信号包络和周期性的轻量级实时电锤检测器增强了CBF。其输出作为外部风险直接在控制器中通过调节屏障函数进行强制执行。该方法在模拟中进行了评估，使用两种CBF公式（圆形和目标对齐椭圆形），使单轮机器人在杂乱的建筑环境中导航。结果表明，CBF安全过滤器在所有试验中消除了安全违规，同时在40.2%（圆形）与76.5%（椭圆形）的情况下达到目标，因为椭圆形公式更好地避免了死锁。这种将音频感知集成到基于CBF的控制器中的方法展示了在安全关键和动态环境中实现更丰富的多模态安全推理的途径。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2602.12421

An Autonomous, End-to-End, Convex-Based Framework for Close-Range Rendezvous Trajectory Design and Guidance with Hardware Testbed Validation

一种自主的端到端凸优化框架用于近距离会合轨迹设计与引导，并进行硬件测试验证

Wijayatunga, Minduli C., Guinane, Julian, Wallace, Nathan D., Wu, Xiaofeng

Abstract

Autonomous satellite servicing missions must execute close-range rendezvous under stringent safety and operational constraints while remaining computationally tractable for onboard use and robust to uncertainty in sensing, actuation, and dynamics. This paper presents CORTEX (Convex Optimization for Rendezvous Trajectory Execution), an autonomous, perception-enabled, real-time trajectory design and guidance framework for close-range rendezvous. CORTEX integrates a deep-learning perception pipeline with convex-optimisation-based trajectory design and guidance, including reference regeneration and abort-to-safe-orbit logic to recover from large deviations caused by sensor faults and engine failures. CORTEX is validated in high-fidelity software simulation and hardware-in-the-loop experiments. The software pipeline (Basilisk) models high-fidelity relative dynamics, realistic thruster execution, perception, and attitude control. Hardware testing uses (i) an optical navigation testbed to assess perception-to-estimation performance and (ii) a planar air-bearing testbed to evaluate the end-to-end guidance loop under representative actuation and subsystem effects. A Monte-Carlo campaign in simulation includes initial-state uncertainty, thrust-magnitude errors, and missed-thrust events; under the strongest case investigated, CORTEX achieves terminal docking errors of $36.85 \pm 44.46$ mm in relative position and $1.25 \pm 2.26$ mm/s in relative velocity. On the planar air-bearing testbed, 18 cases are executed (10 nominal; 8 off-nominal requiring recomputation and/or abort due to simulated engine failure and sensor malfunctions), yielding terminal errors of $8.09 \pm 5.29$ mm in position and $2.23 \pm 1.72$ mm/s in velocity.

Chinese Translation

自主卫星服务任务必须在严格的安全和操作约束下执行近距离会合，同时保持计算上的可行性，以便在机载使用中具备鲁棒性，能够应对传感、驱动和动力学的不确定性。本文提出了CORTEX（会合轨迹执行的凸优化），这是一个自主的、具备感知能力的实时轨迹设计与引导框架，专为近距离会合而设计。CORTEX将深度学习感知管道与基于凸优化的轨迹设计和引导相结合，包括参考再生和安全轨道中止逻辑，以应对由传感器故障和发动机故障引起的大偏差。CORTEX在高保真软件仿真和硬件在环实验中进行了验证。软件管道（Basilisk）模拟了高保真的相对动力学、现实的推力执行、感知和姿态控制。硬件测试使用了（i）光学导航测试平台以评估感知到估计的性能，以及（ii）平面气垫测试平台以评估在代表性驱动和子系统效应下的端到端引导循环。仿真中的蒙特卡洛实验包括初始状态不确定性、推力幅度误差和漏推事件；在调查的最强案例下，CORTEX实现了相对位置的终端对接误差为$36.85 imes 44.46$ mm，和相对速度的误差为$1.25 imes 2.26$ mm/s。在平面气垫测试平台上，执行了18个案例（10个正常；8个非正常需要重新计算和/或因模拟的发动机故障和传感器故障而中止），终端位置误差为$8.09 imes 5.29$ mm，速度误差为$2.23 imes 1.72$ mm/s。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2602.12487

Gradient-Enhanced Partitioned Gaussian Processes for Real-Time Quadrotor Dynamics Modeling

基于梯度增强的分区高斯过程用于实时四旋翼动力学建模

Sang, Xinhuan, Rozman, Adam, Grace, Sheryl, Tron, Roberto

Abstract

We present a quadrotor dynamics Gaussian Process (GP) with gradient information that achieves real-time inference via state-space partitioning and approximation, and that includes aerodynamic effects using data from mid-fidelity potential flow simulations. While traditional GP-based approaches provide reliable Bayesian predictions with uncertainty quantification, they are computationally expensive and thus unsuitable for real-time simulations. To address this challenge, we integrate gradient information to improve accuracy and introduce a novel partitioning and approximation strategy to reduce online computational cost. In particular, for the latter, we associate a local GP with each non-overlapping region; by splitting the training data into local near and far subsets, and by using Schur complements, we show that a large part of the matrix inversions required for inference can be performed offline, enabling real-time inference at frequencies above 30 Hz on standard desktop hardware. To generate a training dataset that captures aerodynamic effects, such as rotor-rotor interactions and apparent wind direction, we use the CHARM code, which is a mid-fidelity aerodynamic solver. It is applied to the SUI Endurance quadrotor to predict force and torque, along with noise at three specified locations. The derivative information is obtained via finite differences. Experimental results demonstrate that the proposed partitioned GP with gradient conditioning achieves higher accuracy than standard partitioned GPs without gradient information, while greatly reducing computational time. This framework provides an efficient foundation for real-time aerodynamic prediction and control algorithms in complex and unsteady environments.

Chinese Translation

我们提出了一种包含梯度信息的四旋翼动力学高斯过程（Gaussian Process, GP），通过状态空间分区和近似实现实时推断，并利用中等保真度的潜流模拟数据考虑气动效应。尽管传统的基于高斯过程的方法提供了可靠的贝叶斯预测和不确定性量化，但其计算开销较大，因此不适合实时模拟。为了解决这一挑战，我们整合了梯度信息以提高准确性，并引入了一种新颖的分区和近似策略，以降低在线计算成本。具体而言，对于后者，我们将每个不重叠区域与一个局部高斯过程关联；通过将训练数据分割为局部近邻和远离子集，并利用Schur补，我们展示了推断所需的大部分矩阵求逆可以离线完成，从而使得在标准桌面硬件上以超过30 Hz的频率实现实时推断。为了生成捕捉气动效应（如转子间相互作用和表观风向）的训练数据集，我们使用了CHARM代码，这是一种中等保真度的气动求解器。它被应用于SUI Endurance四旋翼，以预测在三个指定位置的力和扭矩及噪声。通过有限差分获得导数信息。实验结果表明，所提出的具有梯度条件的分区高斯过程在准确性上优于不带梯度信息的标准分区高斯过程，同时大大减少了计算时间。该框架为复杂和非稳态环境中的实时气动预测和控制算法提供了高效的基础。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2602.12492

Composable Model-Free RL for Navigation with Input-Affine Systems

用于输入仿射系统导航的可组合无模型强化学习

Sang, Xinhuan, Abdelgawad, Abdelrahman, Tron, Roberto

Abstract

As autonomous robots move into complex, dynamic real-world environments, they must learn to navigate safely in real time, yet anticipating all possible behaviors is infeasible. We propose a composable, model-free reinforcement learning method that learns a value function and an optimal policy for each individual environment element (e.g., goal or obstacle) and composes them online to achieve goal reaching and collision avoidance. Assuming unknown nonlinear dynamics that evolve in continuous time and are input-affine, we derive a continuous-time Hamilton-Jacobi-Bellman (HJB) equation for the value function and show that the corresponding advantage function is quadratic in the action and optimal policy. Based on this structure, we introduce a model-free actor-critic algorithm that learns policies and value functions for static or moving obstacles using gradient descent. We then compose multiple reach/avoid models via a quadratically constrained quadratic program (QCQP), yielding formal obstacle-avoidance guarantees in terms of value-function level sets, providing a model-free alternative to CLF/CBF-based controllers. Simulations demonstrate improved performance over a PPO baseline applied to a discrete-time approximation.

Chinese Translation

随着自主机器人进入复杂的动态现实环境，它们必须实时学习安全导航，但预见所有可能的行为是不可行的。我们提出了一种可组合的无模型强化学习方法，该方法为每个单独的环境元素（例如，目标或障碍物）学习价值函数和最优策略，并在线组合它们以实现目标到达和碰撞避免。在假设未知的非线性动态在连续时间中演变且为输入仿射的情况下，我们推导出价值函数的连续时间哈密顿-雅可比-贝尔曼（HJB）方程，并表明相应的优势函数在动作和最优策略中是二次的。基于这一结构，我们引入了一种无模型的演员-评论家算法，该算法使用梯度下降学习静态或移动障碍物的策略和价值函数。然后，我们通过二次约束二次规划（QCQP）组合多个到达/避免模型，从而在价值函数水平集方面提供正式的障碍避免保证，为基于CLF/CBF的控制器提供了一种无模型的替代方案。仿真结果表明，与应用于离散时间近似的PPO基线相比，性能有所改善。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2602.12508

Monocular Reconstruction of Neural Tactile Fields

单目神经触觉场重建

Mantripragada, Pavan, Deshmukh, Siddhanth, Dessalene, Eadom, Desai, Manas, Aloimonos, Yiannis

Abstract

Robots operating in the real world must plan through environments that deform, yield, and reconfigure under contact, requiring interaction-aware 3D representations that extend beyond static geometric occupancy. To address this, we introduce neural tactile fields, a novel 3D representation that maps spatial locations to the expected tactile response upon contact. Our model predicts these neural tactile fields from a single monocular RGB image -- the first method to do so. When integrated with off-the-shelf path planners, neural tactile fields enable robots to generate paths that avoid high-resistance objects while deliberately routing through low-resistance regions (e.g. foliage), rather than treating all occupied space as equally impassable. Empirically, our learning framework improves volumetric 3D reconstruction by $85.8\%$ and surface reconstruction by $26.7\%$ compared to state-of-the-art monocular 3D reconstruction methods (LRM and Direct3D).

Chinese Translation

在真实世界中操作的机器人必须在接触下规划通过变形、屈服和重新配置的环境，这需要超越静态几何占用的交互感知三维表示。为了解决这个问题，我们引入了神经触觉场，这是一种新颖的三维表示方法，将空间位置映射到接触时的预期触觉响应。我们的模型从单一的单目RGB图像预测这些神经触觉场——这是首个实现此功能的方法。当与现成的路径规划器集成时，神经触觉场使机器人能够生成避免高阻力物体的路径，同时故意穿过低阻力区域（例如，植被），而不是将所有占用空间视为同样不可通行。实证结果表明，与最先进的单目三维重建方法（LRM和Direct3D）相比，我们的学习框架在体积三维重建上提高了85.8%，在表面重建上提高了26.7%。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2602.12532

CRAFT: Adapting VLA Models to Contact-rich Manipulation via Force-aware Curriculum Fine-tuning

CRAFT：通过力感知课程微调将VLA模型适应于接触丰富的操作

Zhang, Yike, Wang, Yaonan, Sun, Xinxin, Huang, Kaizhen, Xu, Zhiyuan, Ji, Junjie, Che, Zhengping, Tang, Jian, Sun, Jingtao

Abstract

Vision-Language-Action (VLA) models have shown a strong capability in enabling robots to execute general instructions, yet they struggle with contact-rich manipulation tasks, where success requires precise alignment, stable contact maintenance, and effective handling of deformable objects. A fundamental challenge arises from the imbalance between high-entropy vision and language inputs and low-entropy but critical force signals, which often leads to over-reliance on perception and unstable control. To address this, we introduce CRAFT, a force-aware curriculum fine-tuning framework that integrates a variational information bottleneck module to regulate vision and language embeddings during early training. This curriculum strategy encourages the model to prioritize force signals initially, before progressively restoring access to the full multimodal information. To enable force-aware learning, we further design a homologous leader-follower teleoperation system that collects synchronized vision, language, and force data across diverse contact-rich tasks. Real-world experiments demonstrate that CRAFT consistently improves task success, generalizes to unseen objects and novel task variations, and adapts effectively across diverse VLA architectures, enabling robust and generalizable contact-rich manipulation.

Chinese Translation

视觉-语言-行动（VLA）模型在使机器人执行一般指令方面表现出了强大的能力，但在接触丰富的操作任务中却面临挑战，这类任务的成功需要精确的对齐、稳定的接触维护以及有效处理可变形物体的能力。一个根本性挑战源于高熵的视觉和语言输入与低熵但关键的力信号之间的不平衡，这往往导致对感知的过度依赖和不稳定的控制。为了解决这个问题，我们提出了CRAFT，一个力感知课程微调框架，集成了变分信息瓶颈模块，以调节早期训练中的视觉和语言嵌入。这种课程策略鼓励模型在初期优先关注力信号，然后逐步恢复对完整多模态信息的访问。为了实现力感知学习，我们进一步设计了一个同源的领导-跟随远程操作系统，该系统在多样的接触丰富任务中收集同步的视觉、语言和力数据。现实世界的实验表明，CRAFT始终提高任务成功率，能够推广到未见过的物体和新颖的任务变体，并在多种VLA架构中有效适应，从而实现稳健且可推广的接触丰富操作。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2602.12549

Eva-Tracker: ESDF-update-free, Visibility-aware Planning with Target Reacquisition for Robust Aerial Tracking

Eva-Tracker：无ESDF更新、考虑可见性的目标重新获取的鲁棒空中跟踪规划

Lin, Yue, Liu, Yang, Wang, Dong, Lu, Huchuan

Abstract

The Euclidean Signed Distance Field (ESDF) is widely used in visibility evaluation to prevent occlusions and collisions during tracking. However, frequent ESDF updates introduce considerable computational overhead. To address this issue, we propose Eva-Tracker, a visibility-aware trajectory planning framework for aerial tracking that eliminates ESDF updates and incorporates a recovery-capable path generation method for target reacquisition. First, we design a target trajectory prediction method and a visibility-aware initial path generation algorithm that maintain an appropriate observation distance, avoid occlusions, and enable rapid replanning to reacquire the target when it is lost. Then, we propose the Field of View ESDF (FoV-ESDF), a precomputed ESDF tailored to the tracker's field of view, enabling rapid visibility evaluation without requiring updates. Finally, we optimize the trajectory using differentiable FoV-ESDF-based objectives to ensure continuous visibility throughout the tracking process. Extensive simulations and real-world experiments demonstrate that our approach delivers more robust tracking results with lower computational effort than existing state-of-the-art methods. The source code is available at https://github.com/Yue-0/Eva-Tracker.

Chinese Translation

欧几里得有符号距离场（ESDF）广泛用于可见性评估，以防止跟踪过程中的遮挡和碰撞。然而，频繁的ESDF更新会引入相当大的计算开销。为了解决这个问题，我们提出了Eva-Tracker，一个考虑可见性的空中跟踪轨迹规划框架，该框架消除了ESDF更新，并结合了一种具有恢复能力的路径生成方法以实现目标重新获取。首先，我们设计了一种目标轨迹预测方法和一种考虑可见性的初始路径生成算法，以保持适当的观察距离，避免遮挡，并在目标丢失时快速重新规划以重新获取目标。然后，我们提出了视场ESDF（FoV-ESDF），这是一种针对跟踪器视场预先计算的ESDF，能够快速进行可见性评估而无需更新。最后，我们使用可微分的基于FoV-ESDF的目标优化轨迹，以确保在整个跟踪过程中持续可见。大量的仿真和实际实验表明，我们的方法在计算效率上优于现有的最先进方法，提供了更为鲁棒的跟踪结果。源代码可在 https://github.com/Yue-0/Eva-Tracker 获取。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2602.12584

Hemispherical Angular Power Mapping of Installed mmWave Radar Modules Under Realistic Deployment Constraints

在现实部署约束下已安装毫米波雷达模块的半球角功率映射

Qureshi, Maaz, Bagheri, Mohammad Omid, Melek, William, Shaker, George

Abstract

Characterizing the angular radiation behavior of installed millimeter-wave (mmWave) radar modules is increasingly important in practical sensing platforms, where packaging, mounting hardware, and nearby structures can significantly alter the effective emission profile. However, once a device is embedded in its host environment, conventional chamber- and turntable-based antenna measurements are often impractical. This paper presents a hemispherical angular received-power mapping methodology for in-situ EM validation of installed mmWave modules under realistic deployment constraints. The approach samples the accessible half-space around a stationary device-under-test by placing a calibrated receiving probe at prescribed (phi, theta, r) locations using geometry-consistent positioning and quasi-static acquisition. Amplitude-only received-power is recorded using standard RF instrumentation to generate hemispherical angular power maps that capture installation-dependent radiation characteristics. Proof-of-concept measurements on a 60-GHz radar module demonstrate repeatable hemi-spherical mapping with angular trends in good agreement with full-wave simulation, supporting practical on-site characterization of embedded mmWave transmitters.

Chinese Translation

在实际传感平台中，表征已安装的毫米波（mmWave）雷达模块的角辐射行为变得越来越重要，因为包装、安装硬件和附近结构会显著改变有效发射特性。然而，一旦设备嵌入其宿主环境，传统的基于测试室和转台的天线测量往往不切实际。本文提出了一种半球角接收功率映射方法，用于在现实部署约束下对已安装的mmWave模块进行原位电磁验证。该方法通过在规定的（phi，theta，r）位置放置经过校准的接收探头，利用几何一致的定位和准静态采集，对静止的待测设备周围的可接触半空间进行采样。使用标准射频仪器记录仅幅度的接收功率，以生成捕捉安装依赖辐射特性的半球角功率图。对60 GHz雷达模块的概念验证测量展示了可重复的半球映射，其角度趋势与全波仿真结果良好一致，支持对嵌入式mmWave发射器的实际现场表征。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2602.12597

PISHYAR: A Socially Intelligent Smart Cane for Indoor Social Navigation and Multimodal Human-Robot Interaction for Visually Impaired People

PISHYAR：一种用于室内社交导航和多模态人机交互的社会智能智能手杖，旨在帮助视觉障碍人士

Joo, Mahdi Haghighat, Jafari, Maryam Karimi, Taheri, Alireza

Abstract

This paper presents PISHYAR, a socially intelligent smart cane designed by our group to combine socially aware navigation with multimodal human-AI interaction to support both physical mobility and interactive assistance. The system consists of two components: (1) a social navigation framework implemented on a Raspberry Pi 5 that integrates real-time RGB-D perception using an OAK-D Lite camera, YOLOv8-based object detection, COMPOSER-based collective activity recognition, D* Lite dynamic path planning, and haptic feedback via vibration motors for tasks such as locating a vacant seat; and (2) an agentic multimodal LLM-VLM interaction framework that integrates speech recognition, vision language models, large language models, and text-to-speech, with dynamic routing between voice-only and vision-only modes to enable natural voice-based communication, scene description, and object localization from visual input. The system is evaluated through a combination of simulation-based tests, real-world field experiments, and user-centered studies. Results from simulated and real indoor environments demonstrate reliable obstacle avoidance and socially compliant navigation, achieving an overall system accuracy of approximately 80% under different social conditions. Group activity recognition further shows robust performance across diverse crowd scenarios. In addition, a preliminary exploratory user study with eight visually impaired and low-vision participants evaluates the agentic interaction framework through structured tasks and a UTAUT-based questionnaire reveals high acceptance and positive perceptions of usability, trust, and perceived sociability during our experiments. The results highlight the potential of PISHYAR as a multimodal assistive mobility aid that extends beyond navigation to provide socially interactive support for such users.

Chinese Translation

本文介绍了PISHYAR，这是一款由我们团队设计的社会智能智能手杖，旨在将社会意识导航与多模态人机交互相结合，以支持身体移动和互动辅助。该系统由两个组件组成：（1）一个在Raspberry Pi 5上实现的社会导航框架，集成了使用OAK-D Lite相机的实时RGB-D感知、基于YOLOv8的物体检测、基于COMPOSER的集体活动识别、D* Lite动态路径规划，以及通过振动马达提供的触觉反馈，用于定位空座位等任务；（2）一个代理多模态LLM-VLM交互框架，集成了语音识别、视觉语言模型、大型语言模型和文本转语音，能够在仅语音模式和仅视觉模式之间动态切换，以实现自然的语音通信、场景描述和基于视觉输入的物体定位。该系统通过模拟测试、现实世界实地实验和以用户为中心的研究相结合进行评估。来自模拟和真实室内环境的结果表明，系统在不同社交条件下实现了可靠的障碍物规避和社会合规导航，整体系统准确率约为80%。群体活动识别在多样化人群场景中表现出强大的性能。此外，针对八名视觉障碍和低视力参与者的初步探索性用户研究通过结构化任务和基于UTAUT的问卷评估了代理交互框架，结果显示在我们的实验中对可用性、信任和感知社交性的接受度高且评价积极。结果突显了PISHYAR作为一种多模态辅助移动工具的潜力，不仅限于导航，还为此类用户提供社会互动支持。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2602.12628

RLinf-Co: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

RLinf-Co：基于强化学习的视-语-动（VLA）模型仿真-真实联合训练

Shi, Liangzhi, Chen, Shuaihang, Gao, Feng, Chen, Yinuo, Chen, Kang, Zhang, Tonghe, Zhang, Hongzhi, Zhang, Weinan, Yu, Chao, Wang, Yu

Abstract

Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and does not exploit large-scale closed-loop interaction. Consequently, real-world gains and generalization are often limited. In this paper, we propose an \underline{\textit{RL}}-based sim-real \underline{\textit{Co}}-training \modify{(RL-Co)} framework that leverages interactive simulation while preserving real-world capabilities. Our method follows a generic two-stage design: we first warm-start the policy with SFT on a mixture of real and simulated demonstrations, then fine-tune it with reinforcement learning in simulation while adding an auxiliary supervised loss on real-world data to anchor the policy and mitigate catastrophic forgetting. We evaluate our framework on four real-world tabletop manipulation tasks using two representative VLA architectures, OpenVLA and $\pi_{0.5}$, and observe consistent improvements over real-only fine-tuning and SFT-based co-training, including +24% real-world success on OpenVLA and +20% on $\pi_{0.5}$. Beyond higher success rates, RL co-training yields stronger generalization to unseen task variations and substantially improved real-world data efficiency, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.

Chinese Translation

仿真提供了一种可扩展且低成本的方式来丰富视-语-动（VLA）训练，从而减少对昂贵的真实机器人演示的依赖。然而，大多数仿真-真实联合训练方法依赖于监督微调（SFT），将仿真视为静态演示源，并未利用大规模闭环交互。因此，现实世界的收益和泛化能力往往受到限制。本文提出了一种基于强化学习的仿真-真实联合训练框架（RL-Co），该框架利用交互式仿真，同时保留现实世界的能力。我们的方法遵循通用的两阶段设计：首先，我们在真实和仿真演示的混合数据上通过SFT对策略进行预热，然后在仿真中使用强化学习对其进行微调，同时在真实世界数据上添加辅助监督损失，以固定策略并减轻灾难性遗忘。我们在四个真实世界的桌面操作任务上评估了我们的框架，使用两个代表性的VLA架构，OpenVLA和$ ext{π}_{0.5}$，并观察到与仅基于真实数据的微调和基于SFT的联合训练相比，均有一致的改进，包括OpenVLA上+24%的真实世界成功率和$ ext{π}_{0.5}$上+20%的成功率。除了更高的成功率，RL联合训练还实现了对未见任务变体的更强泛化能力，并显著提高了真实世界数据的效率，为利用仿真增强真实机器人部署提供了一条实用且可扩展的途径。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2602.12633

Real-to-Sim for Highly Cluttered Environments via Physics-Consistent Inter-Object Reasoning

通过物理一致的物体间推理实现高度杂乱环境的真实到仿真转换

Xiang, Tianyi, Cao, Jiahang, Guo, Sikai, Zhao, Guoyang, Luo, Andrew F., Ma, Jun

Abstract

Reconstructing physically valid 3D scenes from single-view observations is a prerequisite for bridging the gap between visual perception and robotic control. However, in scenarios requiring precise contact reasoning, such as robotic manipulation in highly cluttered environments, geometric fidelity alone is insufficient. Standard perception pipelines often neglect physical constraints, resulting in invalid states, e.g., floating objects or severe inter-penetration, rendering downstream simulation unreliable. To address these limitations, we propose a novel physics-constrained Real-to-Sim pipeline that reconstructs physically consistent 3D scenes from single-view RGB-D data. Central to our approach is a differentiable optimization pipeline that explicitly models spatial dependencies via a contact graph, jointly refining object poses and physical properties through differentiable rigid-body simulation. Extensive evaluations in both simulation and real-world settings demonstrate that our reconstructed scenes achieve high physical fidelity and faithfully replicate real-world contact dynamics, enabling stable and reliable contact-rich manipulation.

Chinese Translation

从单视角观测重建物理有效的三维场景是弥合视觉感知与机器人控制之间差距的前提。然而，在需要精确接触推理的场景中，例如在高度杂乱环境中的机器人操作，仅依靠几何保真度是不够的。标准的感知流程往往忽视物理约束，导致无效状态的出现，例如漂浮物体或严重的相互穿透，从而使下游仿真变得不可靠。为了解决这些局限性，我们提出了一种新颖的物理约束真实到仿真（Real-to-Sim）流程，该流程能够从单视角RGB-D数据中重建物理一致的三维场景。我们的方法的核心是一个可微优化流程，通过接触图明确建模空间依赖关系，联合优化物体姿态和物理属性，利用可微刚体仿真进行改进。在仿真和现实世界环境中的广泛评估表明，我们重建的场景实现了高物理保真度，并忠实再现了现实世界的接触动态，从而实现了稳定可靠的接触丰富操作。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2602.12656

PMG: Parameterized Motion Generator for Human-like Locomotion Control

PMG：用于类人运动控制的参数化运动生成器

Han, Chenxi, Min, Yuheng, Huang, Zihao, Hong, Ao, Liu, Hang, Cheng, Yi, Liu, Houde

Abstract

Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain. In particular, while low-level motion tracking and trajectory-following controllers are mature, whole-body reference-guided methods are difficult to adapt to higher-level command interfaces and diverse task contexts: they require large, high-quality datasets, are brittle across speed and pose regimes, and are sensitive to robot-specific calibration. To address these limitations, we propose the Parameterized Motion Generator (PMG), a real-time motion generator grounded in an analysis of human motion structure that synthesizes reference trajectories using only a compact set of parameterized motion data together with High-dimensional control commands. Combined with an imitation-learning pipeline and an optimization-based sim-to-real motor parameter identification module, we validate the complete approach on our humanoid prototype ZERITH Z1 and show that, within a single integrated system, PMG produces natural, human-like locomotion, responds precisely to high-dimensional control inputs-including VR-based teleoperation-and enables efficient, verifiable sim-to-real transfer. Together, these results establish a practical, experimentally validated pathway toward natural and deployable humanoid control.

Chinese Translation

近年来，数据驱动的强化学习和运动跟踪的进展显著改善了类人运动的表现，但仍然存在一些关键的实际挑战。特别是，尽管低级运动跟踪和轨迹跟随控制器已经成熟，但全身参考引导的方法在适应更高级的命令接口和多样化任务环境方面仍然困难：它们需要大量高质量的数据集，在不同速度和姿态下表现脆弱，并且对机器人特定的校准敏感。为了解决这些局限性，我们提出了参数化运动生成器（PMG），这是一种实时运动生成器，基于对人类运动结构的分析，利用仅一组紧凑的参数化运动数据和高维控制命令合成参考轨迹。结合模仿学习管道和基于优化的仿真到现实的电机参数识别模块，我们在我们的类人原型ZERITH Z1上验证了完整的方法，并展示了在一个集成系统内，PMG能够产生自然的类人运动，精确响应高维控制输入（包括基于虚拟现实的遥操作），并实现高效、可验证的仿真到现实转移。这些结果共同确立了一条实用的、经过实验验证的路径，朝向自然且可部署的类人控制。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2602.12684

Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

小米机器人0：一个开源的视觉-语言-动作模型，具备实时执行能力

Cai, Rui, Guo, Jun, He, Xinze, Jin, Piaopiao, Li, Jie, Lin, Bingxuan, Liu, Futeng, Liu, Wei, Ma, Fei, Ma, Kun, Qiu, Feng, Qu, Heng, Su, Yifei, Sun, Qiao, Wang, Dong, Wang, Donghao, Wang, Yunhong, Wu, Rujie, Xiang, Diyun, Yang, Yu, Ye, Hangjun, Zhang, Yuan, Zhou, Quanyun

Abstract

In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced at https://xiaomi-robotics-0.github.io

Chinese Translation

在本报告中，我们介绍了小米机器人0，这是一个优化以实现高性能和快速流畅实时执行的先进视觉-语言-动作（VLA）模型。我们方法的关键在于精心设计的训练方案和部署策略。小米机器人0首先在大规模跨体现机器人轨迹和视觉-语言数据上进行预训练，使其具备广泛且可推广的动作生成能力，同时避免了对基础预训练视觉-语言模型（VLM）视觉-语义知识的灾难性遗忘。在后训练阶段，我们提出了几种技术，以训练VLA模型实现异步执行，解决真实机器人滚动时的推理延迟问题。在部署过程中，我们仔细对齐连续预测动作块的时间步，以确保连续和无缝的实时滚动。我们在模拟基准测试和两个需要精确灵巧双手操作的具有挑战性的真实机器人任务上对小米机器人0进行了广泛评估。结果表明，我们的方法在所有模拟基准测试中都达到了最先进的性能。此外，小米机器人0能够在使用消费级GPU的真实机器人上快速流畅地执行，且在两个真实机器人任务上实现了高成功率和高吞吐量。为了促进未来的研究，代码和模型检查点已开源，地址为 https://xiaomi-robotics-0.github.io

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2602.12686

SignScene: Visual Sign Grounding for Mapless Navigation

SignScene：无地图导航的视觉标志定位

Zimmerman, Nicky, Loo, Joel, Koh, Benjamin, Wang, Zishuo, Hsu, David

Abstract

Navigational signs enable humans to navigate unfamiliar environments without maps. This work studies how robots can similarly exploit signs for mapless navigation in the open world. A central challenge lies in interpreting signs: real-world signs are diverse and complex, and their abstract semantic contents need to be grounded in the local 3D scene. We formalize this as sign grounding, the problem of mapping semantic instructions on signs to corresponding scene elements and navigational actions. Recent Vision-Language Models (VLMs) offer the semantic common-sense and reasoning capabilities required for this task, but are sensitive to how spatial information is represented. We propose SignScene, a sign-centric spatial-semantic representation that captures navigation-relevant scene elements and sign information, and presents them to VLMs in a form conducive to effective reasoning. We evaluate our grounding approach on a dataset of 114 queries collected across nine diverse environment types, achieving 88% grounding accuracy and significantly outperforming baselines. Finally, we demonstrate that it enables real-world mapless navigation on a Spot robot using only signs.

Chinese Translation

导航标志使人类能够在没有地图的情况下导航于不熟悉的环境。本文研究了机器人如何类似地利用标志进行开放世界中的无地图导航。一个核心挑战在于对标志的解读：现实世界中的标志多样且复杂，其抽象语义内容需要在本地三维场景中得到定位。我们将此形式化为标志定位，即将标志上的语义指令映射到相应的场景元素和导航动作的问题。最近的视觉-语言模型（Vision-Language Models, VLMs）提供了执行此任务所需的语义常识和推理能力，但对空间信息的表示方式较为敏感。我们提出了SignScene，一种以标志为中心的空间-语义表示，能够捕捉与导航相关的场景元素和标志信息，并以有利于有效推理的形式呈现给VLMs。我们在一个包含114个查询的数据集上评估了我们的定位方法，该数据集涵盖了九种不同的环境类型，达到了88%的定位准确率，显著优于基线。最后，我们展示了该方法使Spot机器人能够仅通过标志实现现实世界中的无地图导航。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2602.12691

ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training

ALOE：用于视觉-语言-动作模型后训练的动作级别离线策略评估

Yang, Rushuai, Wang, Hecheng, Liu, Chiming, Yan, Xiaohan, Wang, Yunlong, Du, Xuan, Yue, Shuoyu, Liu, Yongcheng, Zhang, Chuheng, Qi, Lizhe, Chen, Yi, Shan, Wei, Yao, Maoqing

Abstract

We study how to improve large foundation vision-language-action (VLA) systems through online reinforcement learning (RL) in real-world settings. Central to this process is the value function, which provides learning signals to guide VLA learning from experience. In practice, the value function is estimated from trajectory fragments collected from different data sources, including historical policies and intermittent human interventions. Estimating the value function of current behavior quality from the mixture data is inherently an off-policy evaluation problem. However, prior work often adopts conservative on-policy estimation for stability, which avoids direct evaluation of the current high-capacity policy and limits learning effectiveness. In this paper, we propose ALOE, an action-level off-policy evaluation framework for VLA post-training. ALOE applies chunking-based temporal-difference bootstrapping to evaluate individual action sequences instead of predicting final task outcomes. This design improves effective credit assignment to critical action chunks under sparse rewards and supports stable policy improvement. We evaluate our method on three real-world manipulation tasks, including smartphone packing as a high-precision task, laundry folding as a long-horizon deformable-object task, and bimanual pick-and-place involving multi-object perception. Across all tasks, ALOE improves learning efficiency without compromising execution speed, showing that off-policy RL can be reintroduced in a reliable manner for real-world VLA post-training. Videos and additional materials are available at our project website.

Chinese Translation

我们研究如何通过在线强化学习（RL）在现实世界环境中改善大型基础视觉-语言-动作（VLA）系统。该过程的核心是价值函数，它提供学习信号以指导VLA从经验中学习。在实践中，价值函数是从不同数据源收集的轨迹片段中估计的，这些数据源包括历史策略和间歇性的人类干预。从混合数据中估计当前行为质量的价值函数本质上是一个离线策略评估问题。然而，之前的工作通常采用保守的在线策略估计以确保稳定性，这避免了对当前高容量策略的直接评估，并限制了学习的有效性。在本文中，我们提出了ALOE，一种用于VLA后训练的动作级别离线策略评估框架。ALOE应用基于块的时间差分自助法来评估单个动作序列，而不是预测最终任务结果。这一设计在稀疏奖励下改善了对关键动作块的有效信用分配，并支持稳定的策略改进。我们在三个现实世界的操作任务上评估了我们的方法，包括作为高精度任务的智能手机包装、作为长时间跨度可变形物体任务的洗衣折叠，以及涉及多物体感知的双手抓取与放置。在所有任务中，ALOE提高了学习效率而不影响执行速度，表明离线策略强化学习可以以可靠的方式重新引入到现实世界的VLA后训练中。视频和其他材料可在我们的项目网站上获得。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2602.12700

Constrained PSO Six-Parameter Fuzzy PID Tuning Method for Balanced Optimization of Depth Tracking Performance in Underwater Vehicles

约束粒子群优化六参数模糊PID调节方法在水下车辆深度跟踪性能平衡优化中的应用

Ding, Yanxi, Jia, Tingyue

Abstract

Depth control of underwater vehicles in engineering applications must simultaneously satisfy requirements for rapid tracking, low overshoot, and actuator constraints. Traditional fuzzy PID tuning often relies on empirical methods, making it difficult to achieve a stable and reproducible equilibrium solution between performance enhancement and control cost. This paper proposes a constrained particle swarm optimization (PSO) method for tuning six-parameter fuzzy PID controllers. By adjusting the benchmark PID parameters alongside the fuzzy controller's input quantization factor and output proportional gain, it achieves synergistic optimization of the overall tuning strength and dynamic response characteristics of the fuzzy PID system. To ensure engineering feasibility of the optimization results, a time-weighted absolute error integral, adjustment time, relative overshoot control energy, and saturation occupancy rate are introduced. Control energy constraints are applied to construct a constraint-driven comprehensive evaluation system, suppressing pseudo-improvements achieved solely by increasing control inputs. Simulation results demonstrate that, while maintaining consistent control energy and saturation levels, the proposed method significantly enhances deep tracking performance: the time-weighted absolute error integral decreases from 0.2631 to 0.1473, the settling time shortens from 2.301 s to 1.613 s, and the relative overshoot reduces from 0.1494 to 0.01839. Control energy varied from 7980 to 7935, satisfying the energy constraint, while saturation occupancy decreased from 0.004 to 0.003. These results validate the effectiveness and engineering significance of the proposed constrained six-parameter joint tuning strategy for depth control in underwater vehicle navigation scenarios.

Chinese Translation

在工程应用中，水下车辆的深度控制必须同时满足快速跟踪、低超调和执行器约束的要求。传统的模糊PID调节通常依赖于经验方法，这使得在性能提升与控制成本之间实现稳定且可重复的平衡解变得困难。本文提出了一种约束粒子群优化（PSO）方法，用于调节六参数模糊PID控制器。通过调整基准PID参数以及模糊控制器的输入量化因子和输出比例增益，实现了模糊PID系统整体调节强度和动态响应特性的协同优化。为了确保优化结果的工程可行性，引入了时间加权绝对误差积分、调整时间、相对超调控制能量和饱和占用率等指标。施加控制能量约束以构建约束驱动的综合评估系统，抑制仅通过增加控制输入而实现的伪改进。仿真结果表明，在保持一致的控制能量和饱和水平的同时，所提方法显著提升了深度跟踪性能：时间加权绝对误差积分从0.2631降至0.1473，稳态时间从2.301秒缩短至1.613秒，相对超调从0.1494降低至0.01839。控制能量从7980变化至7935，满足能量约束，而饱和占用率从0.004降至0.003。这些结果验证了所提约束六参数联合调节策略在水下车辆导航场景中进行深度控制的有效性和工程意义。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2602.12724

TRANS: Terrain-aware Reinforcement Learning for Agile Navigation of Quadruped Robots under Social Interactions

TRANS：考虑地形的强化学习用于四足机器人在社会交互下的灵活导航

Zhu, Wei, Kurniawan, Irfan Tito, Zhao, Ye, Hayashibe, Mistuhiro

Abstract

This study introduces TRANS: Terrain-aware Reinforcement learning for Agile Navigation under Social interactions, a deep reinforcement learning (DRL) framework for quadrupedal social navigation over unstructured terrains. Conventional quadrupedal navigation typically separates motion planning from locomotion control, neglecting whole-body constraints and terrain awareness. On the other hand, end-to-end methods are more integrated but require high-frequency sensing, which is often noisy and computationally costly. In addition, most existing approaches assume static environments, limiting their use in human-populated settings. To address these limitations, we propose a two-stage training framework with three DRL pipelines. (1) TRANS-Loco employs an asymmetric actor-critic (AC) model for quadrupedal locomotion, enabling traversal of uneven terrains without explicit terrain or contact observations. (2) TRANS-Nav applies a symmetric AC framework for social navigation, directly mapping transformed LiDAR data to ego-agent actions under differential-drive kinematics. (3) A unified pipeline, TRANS, integrates TRANS-Loco and TRANS-Nav, supporting terrain-aware quadrupedal navigation in uneven and socially interactive environments. Comprehensive benchmarks against locomotion and social navigation baselines demonstrate the effectiveness of TRANS. Hardware experiments further confirm its potential for sim-to-real transfer.

Chinese Translation

本研究介绍了TRANS：考虑地形的强化学习用于社会交互下的灵活导航，这是一个针对四足社交导航在非结构化地形上的深度强化学习（DRL）框架。传统的四足导航通常将运动规划与运动控制分开，忽视了整体身体约束和地形意识。另一方面，端到端的方法更为集成，但需要高频率的传感，这通常会产生噪声且计算成本高。此外，大多数现有方法假设环境是静态的，限制了它们在人员密集环境中的应用。为了应对这些局限性，我们提出了一个两阶段的训练框架，包含三个DRL管道。(1) TRANS-Loco采用不对称的演员-评论家（AC）模型进行四足运动，使其能够在没有明确地形或接触观测的情况下穿越不平坦地形。(2) TRANS-Nav应用对称AC框架进行社交导航，直接将变换后的激光雷达数据映射到自我代理的动作上，基于差动驱动运动学。(3) 统一管道TRANS整合了TRANS-Loco和TRANS-Nav，支持在不平坦和社交互动环境中的地形感知四足导航。与运动和社交导航基准的全面比较展示了TRANS的有效性。硬件实验进一步确认了其在模拟到现实转移中的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2602.12734

Scaling Single Human Demonstrations for Imitation Learning using Generative Foundational Models

利用生成基础模型扩展单一人类示范以进行模仿学习

Heppert, Nick, Nguyen, Minh Quang, Valada, Abhinav

Abstract

Imitation learning is a popular paradigm to teach robots new tasks, but collecting robot demonstrations through teleoperation or kinesthetic teaching is tedious and time-consuming. In contrast, directly demonstrating a task using our human embodiment is much easier and data is available in abundance, yet transfer to the robot can be non-trivial. In this work, we propose Real2Gen to train a manipulation policy from a single human demonstration. Real2Gen extracts required information from the demonstration and transfers it to a simulation environment, where a programmable expert agent can demonstrate the task arbitrarily many times, generating an unlimited amount of data to train a flow matching policy. We evaluate Real2Gen on human demonstrations from three different real-world tasks and compare it to a recent baseline. Real2Gen shows an average increase in the success rate of 26.6% and better generalization of the trained policy due to the abundance and diversity of training data. We further deploy our purely simulation-trained policy zero-shot in the real world. We make the data, code, and trained models publicly available at real2gen.cs.uni-freiburg.de.

Chinese Translation

模仿学习是一种流行的范式，用于教导机器人执行新任务，但通过遥操作或运动教学收集机器人示范既繁琐又耗时。相比之下，直接使用我们的人类体现来演示任务要容易得多，且数据丰富，但将其转移到机器人上可能并非易事。在本研究中，我们提出了Real2Gen，以从单一的人类示范中训练操作策略。Real2Gen从示范中提取所需信息，并将其转移到仿真环境中，在该环境中，程序化专家代理可以任意多次演示任务，从而生成无限量的数据以训练流匹配策略。我们在三种不同的现实任务的人类示范上评估了Real2Gen，并与最近的基线进行了比较。Real2Gen的成功率平均提高了26.6%，并且由于训练数据的丰富性和多样性，训练策略的泛化能力更强。我们进一步在现实世界中零-shot部署了我们完全通过仿真训练的策略。我们将数据、代码和训练模型公开发布在real2gen.cs.uni-freiburg.de。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2602.12794

SafeFlowMPC: Predictive and Safe Trajectory Planning for Robot Manipulators with Learning-based Policies

SafeFlowMPC：基于学习策略的机器人操纵器的预测与安全轨迹规划

Oelerich, Thies, Ebmer, Gerald, Hartl-Nesic, Christian, Kugi, Andreas

Abstract

The emerging integration of robots into everyday life brings several major challenges. Compared to classical industrial applications, more flexibility is needed in combination with real-time reactivity. Learning-based methods can train powerful policies based on demonstrated trajectories, such that the robot generalizes a task to similar situations. However, these black-box models lack interpretability and rigorous safety guarantees. Optimization-based methods provide these guarantees but lack the required flexibility and generalization capabilities. This work proposes SafeFlowMPC, a combination of flow matching and online optimization to combine the strengths of learning and optimization. This method guarantees safety at all times and is designed to meet the demands of real-time execution by using a suboptimal model-predictive control formulation. SafeFlowMPC achieves strong performance in three real-world experiments on a KUKA 7-DoF manipulator, namely two grasping experiment and a dynamic human-robot object handover experiment. A video of the experiments is available at http://www.acin.tuwien.ac.at/42d6. The code is available at https://github.com/TU-Wien-ACIN-CDS/SafeFlowMPC.

Chinese Translation

机器人日益融入日常生活带来了几个主要挑战。与传统工业应用相比，需要更多的灵活性以及实时反应能力。基于学习的方法可以根据示范轨迹训练出强大的策略，使得机器人能够将任务推广到类似的情境。然而，这些黑箱模型缺乏可解释性和严格的安全保障。基于优化的方法提供了这些保障，但缺乏所需的灵活性和推广能力。本研究提出了SafeFlowMPC，它结合了流匹配和在线优化，以整合学习和优化的优势。该方法在任何时候都能保证安全，并通过使用次优模型预测控制（model-predictive control）形式来满足实时执行的需求。SafeFlowMPC在三个真实世界实验中表现出色，涉及KUKA 7自由度操纵器，包括两个抓取实验和一个动态人机物体交接实验。实验视频可在http://www.acin.tuwien.ac.at/42d6查看。代码可在https://github.com/TU-Wien-ACIN-CDS/SafeFlowMPC获取。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2602.12838

SKYSURF: A Self-learning Framework for Persistent Surveillance using Cooperative Aerial Gliders

SKYSURF：一种基于合作空中滑翔机的持久监视自学习框架

Mohamadi, Houssem Eddine, Kara, Nadjia

Abstract

The success of surveillance applications involving small unmanned aerial vehicles (UAVs) depends on how long the limited on-board power would persist. To cope with this challenge, alternative renewable sources of lift are sought. One promising solution is to extract energy from rising masses of buoyant air. This paper proposes a local-global behavioral management and decision-making approach for the autonomous deployment of soaring-capable UAVs. The cooperative UAVs are modeled as non-deterministic finite state-based rational agents. In addition to a mission planning module for assigning tasks and issuing dynamic navigation waypoints for a new path planning scheme, in which the concepts of visibility and prediction are applied to avoid the collisions. Moreover, a delayed learning and tuning strategy is employed optimize the gains of the path tracking controller. Rigorous comparative analyses carried out with three benchmarking baselines and 15 evolutionary algorithms highlight the adequacy of the proposed approach for maintaining the surveillance persistency (staying aloft for longer periods without landing) and maximizing the detection of targets (two times better than non-cooperative and semi-cooperative approaches) with less power consumption (almost 6% of battery consumed in six hours).

Chinese Translation

涉及小型无人机（UAV）的监视应用的成功依赖于有限的机载电源能够持续多长时间。为应对这一挑战，寻求替代的可再生升力来源。一种有前景的解决方案是从上升的浮力空气团中提取能量。本文提出了一种局部-全局行为管理和决策方法，用于自主部署具有滑翔能力的无人机。合作无人机被建模为非确定性有限状态的理性代理。除了任务分配和为新的路径规划方案发布动态导航航点的任务规划模块外，该方案还应用了可见性和预测的概念以避免碰撞。此外，采用延迟学习和调优策略来优化路径跟踪控制器的增益。与三个基准基线和15种进化算法进行的严格比较分析突显了所提方法在维持监视持续性（在不着陆的情况下保持更长时间的飞行）和最大化目标检测（比非合作和半合作方法提高两倍）方面的适用性，同时减少了能量消耗（在六小时内消耗的电池几乎为6%）。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2602.12918

Adding internal audio sensing to internal vision enables human-like in-hand fabric recognition with soft robotic fingertips

将内部音频感知与内部视觉结合，使软机器人指尖具有人类般的手中织物识别能力

Andrussow, Iris, Solano, Jans, Richardson, Benjamin A., Martius, Georg, Kuchenbecker, Katherine J.

Abstract

Distinguishing the feel of smooth silk from coarse cotton is a trivial everyday task for humans. When exploring such fabrics, fingertip skin senses both spatio-temporal force patterns and texture-induced vibrations that are integrated to form a haptic representation of the explored material. It is challenging to reproduce this rich, dynamic perceptual capability in robots because tactile sensors typically cannot achieve both high spatial resolution and high temporal sampling rate. In this work, we present a system that can sense both types of haptic information, and we investigate how each type influences robotic tactile perception of fabrics. Our robotic hand's middle finger and thumb each feature a soft tactile sensor: one is the open-source Minsight sensor that uses an internal camera to measure fingertip deformation and force at 50 Hz, and the other is our new sensor Minsound that captures vibrations through an internal MEMS microphone with a bandwidth from 50 Hz to 15 kHz. Inspired by the movements humans make to evaluate fabrics, our robot actively encloses and rubs folded fabric samples between its two sensitive fingers. Our results test the influence of each sensing modality on overall classification performance, showing high utility for the audio-based sensor. Our transformer-based method achieves a maximum fabric classification accuracy of 97 % on a dataset of 20 common fabrics. Incorporating an external microphone away from Minsound increases our method's robustness in loud ambient noise conditions. To show that this audio-visual tactile sensing approach generalizes beyond the training data, we learn general representations of fabric stretchiness, thickness, and roughness.

Chinese Translation

区分光滑的丝绸与粗糙的棉布是人类日常生活中的一项微不足道的任务。在探索这些织物时，指尖皮肤感知到的时空力模式和纹理引起的振动被整合形成对所探索材料的触觉表征。由于触觉传感器通常无法同时实现高空间分辨率和高时间采样率，因此在机器人中重现这种丰富的动态感知能力具有挑战性。在本研究中，我们提出了一种能够感知这两种触觉信息的系统，并探讨了每种类型如何影响机器人对织物的触觉感知。我们的机器人手的中指和拇指各配备了一种软触觉传感器：一个是开源的Minsight传感器，利用内部相机以50 Hz的频率测量指尖变形和力，另一个是我们的新传感器Minsound，通过内部MEMS麦克风捕捉频带从50 Hz到15 kHz的振动。受到人类评估织物时动作的启发，我们的机器人主动将折叠的织物样本夹住并在其两个敏感的手指之间摩擦。我们的结果测试了每种感知方式对整体分类性能的影响，显示出基于音频的传感器的高效用。我们基于变换器的方法在20种常见织物的数据集上达到了97%的最高织物分类准确率。在Minsound之外加入一个外部麦克风，提高了我们方法在嘈杂环境条件下的鲁棒性。为了证明这种音频-视觉触觉感知方法超越训练数据的泛化能力，我们学习了织物的拉伸性、厚度和粗糙度的一般表征。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2602.12971

INHerit-SG: Incremental Hierarchical Semantic Scene Graphs with RAG-Style Retrieval

INHerit-SG：增量层次语义场景图与RAG风格检索

Fang, YukTungSamuel, Shi, Zhikang, Qiu, Jiabin, Chen, Zixuan, Shi, Jieqi, Xu, Hao, Huo, Jing, Gao, Yang

Abstract

Driven by advancements in foundation models, semantic scene graphs have emerged as a prominent paradigm for high-level 3D environmental abstraction in robot navigation. However, existing approaches are fundamentally misaligned with the needs of embodied tasks. As they rely on either offline batch processing or implicit feature embeddings, the maps can hardly support interpretable human-intent reasoning in complex environments. To address these limitations, we present INHerit-SG. We redefine the map as a structured, RAG-ready knowledge base where natural-language descriptions are introduced as explicit semantic anchors to better align with human intent. An asynchronous dual-process architecture, together with a Floor-Room-Area-Object hierarchy, decouples geometric segmentation from time-consuming semantic reasoning. An event-triggered map update mechanism reorganizes the graph only when meaningful semantic events occur. This strategy enables our graph to maintain long-term consistency with relatively low computational overhead. For retrieval, we deploy multi-role Large Language Models (LLMs) to decompose queries into atomic constraints and handle logical negations, and employ a hard-to-soft filtering strategy to ensure robust reasoning. This explicit interpretability improves the success rate and reliability of complex retrievals, enabling the system to adapt to a broader spectrum of human interaction tasks. We evaluate INHerit-SG on a newly constructed dataset, HM3DSem-SQR, and in real-world environments. Experiments demonstrate that our system achieves state-of-the-art performance on complex queries, and reveal its scalability for downstream navigation tasks. Project Page: https://fangyuktung.github.io/INHeritSG.github.io/

Chinese Translation

随着基础模型的进步，语义场景图已成为机器人导航中高层次3D环境抽象的重要范式。然而，现有方法在根本上与具身任务的需求不匹配。由于它们依赖于离线批处理或隐式特征嵌入，因此在复杂环境中，地图几乎无法支持可解释的人类意图推理。为了解决这些局限性，我们提出了INHerit-SG。我们将地图重新定义为一个结构化的、适合RAG（Retrieval-Augmented Generation）风格的知识库，其中引入自然语言描述作为显式语义锚点，以更好地与人类意图对齐。一个异步双过程架构，以及一个楼层-房间-区域-对象层次结构，将几何分割与耗时的语义推理解耦。事件触发的地图更新机制仅在发生有意义的语义事件时重新组织图形。这一策略使我们的图形能够在相对较低的计算开销下保持长期一致性。在检索方面，我们部署多角色大型语言模型（LLMs）将查询分解为原子约束并处理逻辑否定，并采用硬到软的过滤策略以确保稳健的推理。这种显式可解释性提高了复杂检索的成功率和可靠性，使系统能够适应更广泛的人类交互任务。我们在新构建的数据集HM3DSem-SQR和真实环境中评估了INHerit-SG。实验表明，我们的系统在复杂查询上实现了最先进的性能，并揭示了其在下游导航任务中的可扩展性。项目页面：https://fangyuktung.github.io/INHeritSG.github.io/

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2602.12978

Learning Native Continuation for Action Chunking Flow Policies

学习原生延续以实现动作分块流政策

Liu, Yufeng, Yu, Hang, Zhao, Juntu, Li, Bocheng, Zhang, Di, Li, Mingzhu, Wu, Wenxuan, Hu, Yingdong, Xie, Junyuan, Guo, Junliang, Wang, Dequan, Gao, Yang

Abstract

Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.

Chinese Translation

动作分块使得视觉语言动作（VLA）模型能够实时运行，但简单的分块执行往往在分块边界处出现不连续性。实时分块（RTC）缓解了这个问题，但它是外部于策略的，导致虚假的多模态切换和不内在平滑的轨迹。我们提出了Legato，一种用于动作分块流基VLA策略的训练时延续方法。具体而言，Legato从一个调度形状的已知动作和噪声的混合中初始化去噪，向模型暴露部分动作信息。此外，Legato重塑学习到的流动态，以确保去噪过程在每步指导下在训练和推理之间保持一致。Legato在训练期间进一步使用随机调度条件，以支持不同的推理延迟并实现可控的平滑性。实证结果表明，Legato在执行过程中产生了更平滑的轨迹，并减少了虚假的多模态切换，从而减少了犹豫时间和缩短了任务完成时间。大量的现实世界实验表明，Legato在五个操作任务中始终优于RTC，在轨迹平滑性和任务完成时间上均实现了约10%的提升。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2602.13016

How Swarms Differ: Challenges in Collective Behaviour Comparison

群体的差异：集体行为比较中的挑战

Jesus, André Fialho, Kuckling, Jonas

Abstract

Collective behaviours often need to be expressed through numerical features, e.g., for classification or imitation learning. This problem is often addressed by proposing an ad-hoc feature set for a particular swarm behaviour context, usually without further consideration of the solution's resilience outside of the conceived context. Yet, the development of automatic methods to design swarm behaviours is dependent on the ability to measure quantitatively the similarity of swarm behaviours. Hence, we investigate the impact of feature sets for collective behaviours. We select swarm feature sets and similarity measures from prior swarm robotics works, which mainly considered a narrow behavioural context and assess their robustness. We demonstrate that the interplay of feature set and similarity measure makes some combinations more suitable to distinguish groups of similar behaviours. We also propose a self-organised map-based approach to identify regions of the feature space where behaviours cannot be easily distinguished.

Chinese Translation

集体行为通常需要通过数值特征来表达，例如用于分类或模仿学习。这个问题通常通过为特定的群体行为情境提出一个特定的特征集来解决，通常没有进一步考虑解决方案在所设想情境之外的韧性。然而，自动设计群体行为的方法的发展依赖于定量测量群体行为相似性的能力。因此，我们研究了特征集对集体行为的影响。我们从先前的群体机器人研究中选择了群体特征集和相似性度量，这些研究主要考虑了狭窄的行为情境，并评估了它们的稳健性。我们证明了特征集与相似性度量之间的相互作用使得某些组合更适合区分相似行为的群体。我们还提出了一种基于自组织映射的方法，以识别特征空间中行为难以区分的区域。

View on arXiv Download PDF AI Translation

cs.RO / 33 / 2602.13078

SENSE-STEP: Learning Sim-to-Real Locomotion for a Sensory-Enabled Soft Quadruped Robot

SENSE-STEP：为具备传感能力的软四足机器人学习仿真到现实的运动控制

de Kam, Storm, Shahabi, Ebrahim, Della Santina, Cosimo

Abstract

Robust closed-loop locomotion remains challenging for soft quadruped robots due to high-dimensional dynamics, actuator hysteresis, and difficult-to-model contact interactions, while conventional proprioception provides limited information about ground contact. In this paper, we present a learning-based control framework for a pneumatically actuated soft quadruped equipped with tactile suction-cup feet, and we validate the approach experimentally on physical hardware. The control policy is trained in simulation through a staged learning process that starts from a reference gait and is progressively refined under randomized environmental conditions. The resulting controller maps proprioceptive and tactile feedback to coordinated pneumatic actuation and suction-cup commands, enabling closed-loop locomotion on flat and inclined surfaces. When deployed on the real robot, the closed-loop policy outperforms an open-loop baseline, increasing forward speed by 41% on a flat surface and by 91% on a 5-degree incline. Ablation studies further demonstrate the role of tactile force estimates and inertial feedback in stabilizing locomotion, with performance improvements of up to 56% compared to configurations without sensory feedback.

Chinese Translation

由于高维动态、执行器滞后以及难以建模的接触交互，软四足机器人的稳健闭环运动仍然面临挑战，而传统的本体感知提供的地面接触信息有限。本文提出了一种基于学习的控制框架，适用于配备触觉吸盘足的气动驱动软四足机器人，并在物理硬件上对该方法进行了实验验证。控制策略通过一种分阶段的学习过程在仿真中进行训练，该过程从参考步态开始，并在随机环境条件下逐步优化。最终的控制器将本体感知和触觉反馈映射到协调的气动驱动和吸盘指令，从而实现平坦和倾斜表面的闭环运动。在真实机器人上部署时，闭环策略的表现优于开放环基线，在平坦表面上前进速度提高了41%，在5度倾斜面上提高了91%。消融研究进一步证明了触觉力估计和惯性反馈在稳定运动中的作用，与没有传感反馈的配置相比，性能提升高达56%。

View on arXiv Download PDF AI Translation

cs.RO / 34 / 2602.13081

Agentic AI for Robot Control: Flexible but still Fragile

用于机器人控制的代理智能：灵活但依然脆弱

Lima, Oscar, Vinci, Marc, Günther, Martin, Renz, Marian, Sung, Alexander, Stock, Sebastian, Brust, Johannes, Niecksch, Lennart, Yi, Zongyao, Igelbrink, Felix, Kisliuk, Benjamin, Atzmueller, Martin, Hertzberg, Joachim

Abstract

Recent work leverages the capabilities and commonsense priors of generative models for robot control. In this paper, we present an agentic control system in which a reasoning-capable language model plans and executes tasks by selecting and invoking robot skills within an iterative planner and executor loop. We deploy the system on two physical robot platforms in two settings: (i) tabletop grasping, placement, and box insertion in indoor mobile manipulation (Mobipick) and (ii) autonomous agricultural navigation and sensing (Valdemar). Both settings involve uncertainty, partial observability, sensor noise, and ambiguous natural-language commands. The system exposes structured introspection of its planning and decision process, reacts to exogenous events via explicit event checks, and supports operator interventions that modify or redirect ongoing execution. Across both platforms, our proof-of-concept experiments reveal substantial fragility, including non-deterministic suboptimal behavior, instruction-following errors, and high sensitivity to prompt specification. At the same time, the architecture is flexible: transfer to a different robot and task domain largely required updating the system prompt (domain model, affordances, and action catalogue) and re-binding the same tool interface to the platform-specific skill API.

Chinese Translation

近期的研究利用生成模型的能力和常识先验来进行机器人控制。本文提出了一种代理控制系统，其中一个具备推理能力的语言模型通过在迭代规划和执行循环中选择和调用机器人技能来规划和执行任务。我们在两个物理机器人平台上部署该系统，分别在两个场景中进行实验：（i）室内移动操控中的桌面抓取、放置和箱体插入（Mobipick），以及（ii）自主农业导航和感知（Valdemar）。这两个场景都涉及不确定性、部分可观测性、传感器噪声和模糊的自然语言指令。该系统展示了其规划和决策过程的结构化自省，通过明确的事件检查对外部事件作出反应，并支持操作员干预以修改或重定向正在进行的执行。在这两个平台上，我们的概念验证实验揭示了显著的脆弱性，包括非确定性的次优行为、指令遵循错误以及对提示规范的高度敏感性。与此同时，该架构具有灵活性：转移到不同的机器人和任务领域主要需要更新系统提示（领域模型、可供性和动作目录），并将相同的工具接口重新绑定到特定平台的技能API。

View on arXiv Download PDF AI Translation

cs.RO / 35 / 2602.13086

UniManip: General-Purpose Zero-Shot Robotic Manipulation with Agentic Operational Graph

UniManip：基于代理操作图的通用零-shot机器人操控

Liu, Haichao, Xue, Yuanjiang, Zhou, Yuheng, Deng, Haoyuan, Liang, Yinan, Xie, Lihua, Wang, Ziwei

Abstract

Achieving general-purpose robotic manipulation requires robots to seamlessly bridge high-level semantic intent with low-level physical interaction in unstructured environments. However, existing approaches falter in zero-shot generalization: end-to-end Vision-Language-Action (VLA) models often lack the precision required for long-horizon tasks, while traditional hierarchical planners suffer from semantic rigidity when facing open-world variations. To address this, we present UniManip, a framework grounded in a Bi-level Agentic Operational Graph (AOG) that unifies semantic reasoning and physical grounding. By coupling a high-level Agentic Layer for task orchestration with a low-level Scene Layer for dynamic state representation, the system continuously aligns abstract planning with geometric constraints, enabling robust zero-shot execution. Unlike static pipelines, UniManip operates as a dynamic agentic loop: it actively instantiates object-centric scene graphs from unstructured perception, parameterizes these representations into collision-free trajectories via a safety-aware local planner, and exploits structured memory to autonomously diagnose and recover from execution failures. Extensive experiments validate the system's robust zero-shot capability on unseen objects and tasks, demonstrating a 22.5% and 25.0% higher success rate compared to state-of-the-art VLA and hierarchical baselines, respectively. Notably, the system enables direct zero-shot transfer from fixed-base setups to mobile manipulation without fine-tuning or reconfiguration. Our open-source project page can be found at https://henryhcliu.github.io/unimanip.

Chinese Translation

实现通用机器人操控需要机器人在非结构化环境中无缝连接高层语义意图与低层物理交互。然而，现有方法在零-shot 泛化方面表现不佳：端到端的视觉-语言-动作（VLA）模型通常缺乏长时间任务所需的精确性，而传统的层次规划器在面对开放世界变体时则遭遇语义僵化。为了解决这一问题，我们提出了UniManip，一个基于双层代理操作图（AOG）的框架，统一了语义推理和物理基础。通过将高层代理层用于任务协调与低层场景层用于动态状态表示相结合，该系统持续对齐抽象规划与几何约束，实现稳健的零-shot 执行。与静态管道不同，UniManip 作为一个动态代理循环运行：它主动从非结构化感知中实例化以对象为中心的场景图，通过安全感知的局部规划器将这些表示参数化为无碰撞轨迹，并利用结构化记忆自主诊断和恢复执行失败。大量实验验证了该系统在未见对象和任务上的稳健零-shot 能力，成功率分别比最先进的 VLA 和层次基线高出22.5%和25.0%。值得注意的是，该系统能够在不进行微调或重新配置的情况下，实现从固定基座设置到移动操控的直接零-shot 转移。我们的开源项目页面可以在 https://henryhcliu.github.io/unimanip 找到。

View on arXiv Download PDF AI Translation

cs.RO / 36 / 2602.13159

Temporally-Sampled Efficiently Adaptive State Lattices for Autonomous Ground Robot Navigation in Partially Observed Environments

针对部分可观测环境中自主地面机器人导航的时间采样高效自适应状态格

Menon, Ashwin Satish, Damm, Eric R., Lancaster, Eli S., Sanchez, Felix A., Gregory, Jason M., Howard, Thomas M.

Abstract

Due to sensor limitations, environments that off-road mobile robots operate in are often only partially observable. As the robots move throughout the environment and towards their goal, the optimal route is continuously revised as the sensors perceive new information. In traditional autonomous navigation architectures, a regional motion planner will consume the environment map and output a trajectory for the local motion planner to use as a reference. Due to the continuous revision of the regional plan guidance as a result of changing map information, the reference trajectories which are passed down to the local planner can differ significantly across sequential planning cycles. This rapidly changing guidance can result in unsafe navigation behavior, often requiring manual safety interventions during autonomous traversals in off-road environments. To remedy this problem, we propose Temporally-Sampled Efficiently Adaptive State Lattices (TSEASL), which is a regional planner arbitration architecture that considers updated and optimized versions of previously generated trajectories against the currently generated trajectory. When tested on a Clearpath Robotics Warthog Unmanned Ground Vehicle as well as real map data collected from the Warthog, results indicate that when running TSEASL, the robot did not require manual interventions in the same locations where the robot was running the baseline planner. Additionally, higher levels of planner stability were recorded with TSEASL over the baseline. The paper concludes with a discussion of further improvements to TSEASL in order to make it more generalizable to various off-road autonomy scenarios.

Chinese Translation

由于传感器的限制，越野移动机器人所操作的环境通常只能部分观察到。随着机器人在环境中移动并朝向目标，最佳路线会随着传感器感知到的新信息而不断修订。在传统的自主导航架构中，区域运动规划器会消耗环境地图，并输出供局部运动规划器作为参考的轨迹。由于区域规划指导因地图信息变化而持续修订，传递给局部规划器的参考轨迹在连续规划周期中可能会显著不同。这种快速变化的指导可能导致不安全的导航行为，通常需要在越野环境中的自主穿越过程中进行手动安全干预。为了解决这个问题，我们提出了时间采样高效自适应状态格（Temporally-Sampled Efficiently Adaptive State Lattices，TSEASL），这是一种区域规划器仲裁架构，考虑了先前生成轨迹的更新和优化版本与当前生成轨迹的对比。在对Clearpath Robotics Warthog无人地面车辆及其收集的真实地图数据进行测试时，结果表明，在运行TSEASL时，机器人在与基线规划器相同的位置不需要手动干预。此外，TSEASL的规划稳定性记录也高于基线。本文最后讨论了对TSEASL的进一步改进，以使其更具通用性，适用于各种越野自主场景。

View on arXiv Download PDF AI Translation

cs.RO / 37 / 2602.13163

Human Emotion-Mediated Soft Robotic Arts: Exploring the Intersection of Human Emotions, Soft Robotics and Arts

人类情感介导的软机器人艺术：探索人类情感、软机器人技术与艺术的交汇

Nadipineni, Saitarun, Hong, Chenhao, Ramlall, Tanishtha, Sirithunge, Chapa, Althoefer, Kaspar, Iida, Fumiya, Lalitharatne, Thilina Dulantha

Abstract

Soft robotics has emerged as a versatile field with applications across various domains, from healthcare to industrial automation, and more recently, art and interactive installations. The inherent flexibility, adaptability, and safety of soft robots make them ideal for applications that require delicate, organic, and lifelike movement, allowing for immersive and responsive interactions. This study explores the intersection of human emotions, soft robotics, and art to establish and create new forms of human emotion-mediated soft robotic art. In this paper, we introduce two soft embodiments: a soft character and a soft flower as an art display that dynamically responds to brain signals based on alpha waves, reflecting different emotion levels. We present how human emotions can be measured as alpha waves based on brain/EEG signals, how we map the alpha waves to the dynamic movements of the two soft embodiments, and demonstrate our proposed concept using experiments. The findings of this study highlight how soft robotics can embody human emotional states, offering a new medium for insightful artistic expression and interaction, and demonstrating how art displays can be embodied.

Chinese Translation

软机器人技术作为一个多功能领域，已在医疗、工业自动化等多个领域展现出应用潜力，最近更是在艺术和互动装置方面得到了关注。软机器人的固有灵活性、适应性和安全性使其成为需要精细、有机和栩栩如生运动的应用的理想选择，从而实现沉浸式和响应式的互动。本研究探讨了人类情感、软机器人技术与艺术的交汇，以建立和创造新的形式的人类情感介导的软机器人艺术。在本文中，我们介绍了两种软体现：一个软角色和一个软花朵，作为一种艺术展示，能够根据大脑信号中的α波动态响应，反映不同的情感水平。我们展示了如何基于脑电图（EEG）信号测量人类情感作为α波，如何将α波映射到这两种软体现的动态运动，并通过实验验证我们提出的概念。本研究的发现强调了软机器人如何体现人类情感状态，为深刻的艺术表达和互动提供了一种新的媒介，并展示了艺术展示如何得以具象化。

View on arXiv Download PDF AI Translation

cs.RO / 38 / 2602.13193

Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

可引导的视觉-语言-动作策略用于具身推理和层次控制

Chen, William, Bhatia, Jagdeep Singh, Glossop, Catherine, Mathihalli, Nikhil, Doshi, Ria, Tang, Andy, Driess, Danny, Pertsch, Karl, Levine, Sergey

Abstract

Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io

Chinese Translation

预训练的视觉-语言模型（VLMs）能够在多样化的环境中进行语义和视觉推理，为机器人控制提供了宝贵的常识先验。然而，如何有效地将这些知识扎根于机器人行为中仍然是一个开放的挑战。以往的方法通常采用层次化的方法，其中VLMs对由独立的低级策略执行的高层命令进行推理，例如视觉-语言-动作模型（VLAs）。VLMs与VLAs之间的接口通常是自然语言任务指令，这在根本上限制了VLM推理对低级行为的引导能力。因此，我们引入了可引导策略：在不同抽象层次上（如子任务、动作和基础像素坐标）通过丰富的合成命令训练的VLAs。通过改善低级可控性，可引导策略能够解锁VLMs中的预训练知识，从而实现更好的任务泛化。我们通过使用学习到的高层具身推理器和通过上下文学习促使推理命令抽象的现成VLM来控制我们的可引导策略，展示了这一优势。在广泛的现实世界操作实验中，这两种新方法的表现超越了以往的具身推理VLAs和基于VLM的层次基线，包括在具有挑战性的泛化和长时间跨度任务上的表现。网站：steerable-policies.github.io

View on arXiv Download PDF AI Translation

cs.RO / 39 / 2602.13197

Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

模仿有效方法：从人类视频中进行模拟过滤的模块化策略学习

Zhai, Albert J., Zeng, Kuo-Hao, Lu, Jiasen, Farhadi, Ali, Wang, Shenlong, Ma, Wei-Chiu

Abstract

The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.

Chinese Translation

通过观看人类的视频学习操作技能的能力，有潜力为机器人学习解锁一种新的高度可扩展的数据来源。在此，我们研究了抓取操作，其中任务涉及在执行各种抓取后动作之前抓取物体。人类视频为学习抓取后动作提供了强有力的信号，但对于学习先决的抓取行为则帮助有限，尤其是对于没有类人手的机器人。一个有前景的解决方案是使用模块化策略设计，利用专门的抓取生成器来产生稳定的抓取。然而，任意的稳定抓取往往与任务不兼容，阻碍了机器人执行所需下游动作的能力。为了解决这一挑战，我们提出了感知-模拟-模仿（Perceive-Simulate-Imitate, PSI）框架，该框架利用经过配对抓取-轨迹过滤的模拟处理人类视频运动数据来训练模块化操作策略。这个模拟步骤通过抓取适用性标签扩展了轨迹数据，从而允许对面向任务的抓取能力进行监督学习。我们通过现实世界的实验展示了我们的框架可以高效地学习精确的操作技能，而无需任何机器人数据，结果表现出显著比简单使用抓取生成器更强的鲁棒性。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

cs.CV / 1 / 2602.12361

Thermal Imaging for Contactless Cardiorespiratory and Sudomotor Response Monitoring

用于无接触心肺和出汗反应监测的热成像技术

Casado, Constantino Álvarez, Rahman, Mohammad, Sharifipour, Sasan, Nguyen, Nhi, Cañellas, Manuel Lage, Wu, Xiaoting, López, Miguel Bordallo

Abstract

Thermal infrared imaging captures skin temperature changes driven by autonomic regulation and can potentially provide contactless estimation of electrodermal activity (EDA), heart rate (HR), and breathing rate (BR). While visible-light methods address HR and BR, they cannot access EDA, a standard marker of sympathetic activation. This paper characterizes the extraction of these three biosignals from facial thermal video using a signal-processing pipeline that tracks anatomical regions, applies spatial aggregation, and separates slow sudomotor trends from faster cardiorespiratory components. For HR, we apply an orthogonal matrix image transformation (OMIT) decomposition across multiple facial regions of interest (ROIs), and for BR we average nasal and cheek signals before spectral peak detection. We evaluate 288 EDA configurations and the HR/BR pipeline on 31 sessions from the public SIMULATOR STUDY 1 (SIM1) driver monitoring dataset. The best fixed EDA configuration (nose region, exponential moving average) reaches a mean absolute correlation of $0.40 \pm 0.23$ against palm EDA, with individual sessions reaching 0.89. BR estimation achieves a mean absolute error of $3.1 \pm 1.1$ bpm, while HR estimation yields $13.8 \pm 7.5$ bpm MAE, limited by the low camera frame rate (7.5 Hz). We report signal polarity alternation across sessions, short thermodynamic latency for well-tracked signals, and condition-dependent and demographic effects on extraction quality. These results provide baseline performance bounds and design guidance for thermal contactless biosignal estimation.

Chinese Translation

热红外成像捕捉由自主神经调节引起的皮肤温度变化，并有可能提供无接触的电皮肤活动（EDA）、心率（HR）和呼吸频率（BR）的估计。虽然可见光方法可以处理HR和BR，但无法获取EDA，这是交感神经激活的标准标志。本文描述了一种从面部热视频中提取这三种生物信号的信号处理流程，该流程跟踪解剖区域，应用空间聚合，并将缓慢的出汗趋势与更快的心肺成分分离。对于HR，我们在多个面部感兴趣区域（ROIs）上应用正交矩阵图像变换（OMIT）分解；对于BR，我们在进行频谱峰值检测之前对鼻部和面颊信号进行平均。我们在公共的SIMULATOR STUDY 1 (SIM1) 驾驶员监测数据集中评估了288种EDA配置和HR/BR流程，共31个会话。最佳固定EDA配置（鼻部区域，指数移动平均）与手掌EDA的平均绝对相关性达到$0.40 imes 0.23$，个别会话达到0.89。BR估计的平均绝对误差为$3.1 imes 1.1$ bpm，而HR估计的MAE为$13.8 imes 7.5$ bpm，受限于低相机帧率（7.5 Hz）。我们报告了会话间信号极性交替、对跟踪良好的信号的短热力学延迟，以及对提取质量的条件依赖和人口统计效应。这些结果为热无接触生物信号估计提供了基线性能界限和设计指导。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2602.12370

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

LLaMo：通过连续自回归标记扩展预训练语言模型以实现统一的运动理解与生成

Li, Zekun, An, Sizhe, Tang, Chengcheng, Guo, Chuan, Shugurov, Ivan, Zhang, Linguang, Zhao, Amy, Sridhar, Srinath, Tao, Lingling, Mittal, Abhay

Abstract

Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation. We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (>30 FPS). Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.

Chinese Translation

近年来，大型模型的进展在统一的多模态生成和理解方面取得了显著的突破。然而，统一运动与语言生成及理解的模型发展仍然在很大程度上未被充分探索。现有的方法通常在配对的运动-文本数据上微调大型语言模型（LLMs），这可能导致由于可用文本-运动对的规模有限而造成语言能力的灾难性遗忘。此外，先前的方法通常通过量化将运动转换为离散表示，以便与语言模型集成，这引入了来自离散标记化的显著抖动伪影。为了解决这些挑战，我们提出了LLaMo，一个通过特定模态的混合变换器（Mixture-of-Transformers, MoT）架构扩展预训练LLMs的统一框架。该设计本质上保留了基础模型的语言理解能力，同时实现了可扩展的多模态适应。我们将人类运动编码为因果连续潜在空间，并通过轻量级流匹配头在仅解码器的骨干网络中维持下一个标记预测范式，从而实现实时（>30 FPS）的流式运动生成。利用预训练LLMs的全面语言理解能力和大规模运动-文本预训练，我们的实验表明LLaMo在一般设置下实现了高保真度的文本到运动生成和运动到文本的字幕生成，特别是在零-shot运动生成方面，标志着朝着通用统一运动-语言大型模型迈出了重要一步。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2602.12381

Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues

基于 CLIP 的合成图像检测：理解与评估预测线索

Willi, Marco, Mathys, Melanie, Graber, Michael

Abstract

Recent generative models produce near-photorealistic images, challenging the trustworthiness of photographs. Synthetic image detection (SID) has thus become an important area of research. Prior work has highlighted how synthetic images differ from real photographs--unfortunately, SID methods often struggle to generalize to novel generative models and often perform poorly in practical settings. CLIP, a foundational vision-language model which yields semantically rich image-text embeddings, shows strong accuracy and generalization for SID. Yet, the underlying relevant cues embedded in CLIP-features remain unknown. It is unclear, whether CLIP-based detectors simply detect strong visual artifacts or exploit subtle semantic biases, both of which would render them useless in practical settings or on generative models of high quality. We introduce SynthCLIC, a paired dataset of real photographs and high-quality synthetic counterparts from recent diffusion models, designed to reduce semantic bias in SID. Using an interpretable linear head with de-correlated activations and a text-grounded concept-model, we analyze what CLIP-based detectors learn. CLIP-based linear detectors reach 0.96 mAP on a GAN-based benchmark but only 0.92 on our high-quality diffusion dataset SynthCLIC, and generalization across generator families drops to as low as 0.37 mAP. We find that the detectors primarily rely on high-level photographic attributes (e.g., minimalist style, lens flare, or depth layering), rather than overt generator-specific artifacts. CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures. This highlights the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust SID.

Chinese Translation

近期的生成模型能够生成近乎照片真实的图像，这对照片的可信度提出了挑战。因此，合成图像检测（SID）成为了一个重要的研究领域。先前的研究强调了合成图像与真实照片之间的差异——不幸的是，SID 方法往往难以推广到新型生成模型，并且在实际应用中表现不佳。CLIP（Contrastive Language-Image Pretraining）作为一个基础的视觉-语言模型，生成语义丰富的图像-文本嵌入，在 SID 中显示出较强的准确性和泛化能力。然而，嵌入在 CLIP 特征中的相关线索仍然未知。目前尚不清楚基于 CLIP 的检测器是否仅仅检测强烈的视觉伪影，或是利用微妙的语义偏差，这两者都可能使其在实际应用或高质量生成模型上失去效用。我们引入了 SynthCLIC，这是一个由真实照片和来自近期扩散模型的高质量合成图像组成的配对数据集，旨在减少 SID 中的语义偏差。通过使用可解释的线性头与去相关的激活以及基于文本的概念模型，我们分析了基于 CLIP 的检测器所学习的内容。基于 CLIP 的线性检测器在基于 GAN 的基准测试中达到了 0.96 的 mAP，但在我们的高质量扩散数据集 SynthCLIC 上仅为 0.92，并且在生成器家族之间的泛化能力降至 0.37 的 mAP。我们发现，这些检测器主要依赖于高级摄影属性（例如，极简风格、镜头光晕或深度分层），而非明显的生成器特定伪影。总体而言，基于 CLIP 的检测器表现良好，但在不同的生成架构之间的泛化能力不均衡。这突显了对模型持续更新和更广泛训练曝光的需求，同时强化了基于 CLIP 的方法作为更通用、稳健的 SID 的强大基础。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2602.12393

Reproducing DragDiffusion: Interactive Point-Based Editing with Diffusion Models

再现 DragDiffusion：基于交互点的扩散模型编辑

Subhan, Ali, Raza, Ashir

Abstract

DragDiffusion is a diffusion-based method for interactive point-based image editing that enables users to manipulate images by directly dragging selected points. The method claims that accurate spatial control can be achieved by optimizing a single diffusion latent at an intermediate timestep, together with identity-preserving fine-tuning and spatial regularization. This work presents a reproducibility study of DragDiffusion using the authors' released implementation and the DragBench benchmark. We reproduce the main ablation studies on diffusion timestep selection, LoRA-based fine-tuning, mask regularization strength, and UNet feature supervision, and observe close agreement with the qualitative and quantitative trends reported in the original work. At the same time, our experiments show that performance is sensitive to a small number of hyperparameter assumptions, particularly the optimized timestep and the feature level used for motion supervision, while other components admit broader operating ranges. We further evaluate a multi-timestep latent optimization variant and find that it does not improve spatial accuracy while substantially increasing computational cost. Overall, our findings support the central claims of DragDiffusion while clarifying the conditions under which they are reliably reproducible. Code is available at https://github.com/AliSubhan5341/DragDiffusion-TMLR-Reproducibility-Challenge.

Chinese Translation

DragDiffusion 是一种基于扩散的交互式点图像编辑方法，允许用户通过直接拖动选定的点来操控图像。该方法声称通过在中间时间步优化单个扩散潜变量，结合保持身份的微调和空间正则化，可以实现准确的空间控制。本研究对 DragDiffusion 进行了可重复性研究，使用了作者发布的实现和 DragBench 基准。我们再现了关于扩散时间步选择、基于 LoRA 的微调、掩模正则化强度和 UNet 特征监督的主要消融研究，并观察到与原始工作中报告的定性和定量趋势高度一致。同时，我们的实验表明，性能对少量超参数假设非常敏感，特别是优化的时间步和用于运动监督的特征层，而其他组件则具有更广泛的操作范围。我们进一步评估了一种多时间步潜变量优化变体，发现它并未提高空间准确性，反而显著增加了计算成本。总体而言，我们的发现支持了 DragDiffusion 的核心主张，同时澄清了它们可靠可重复的条件。代码可在 https://github.com/AliSubhan5341/DragDiffusion-TMLR-Reproducibility-Challenge 获取。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2602.12395

What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis

强化学习对视觉推理的提升是什么？一种弗兰肯斯坦式分析

Li, Xirui, Li, Ming, Zhou, Tianyi

Abstract

Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL's reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.

Chinese Translation

具有可验证奖励的强化学习（RL）已成为提升视觉-语言模型视觉推理的标准后训练阶段，但仍不清楚与作为冷启动初始化的监督微调相比，RL 实际上改善了哪些能力。端到端基准的提升混合了多个因素，使得很难将改进归因于特定技能。为了解决这一问题，我们提出了一种弗兰肯斯坦式分析框架，包括：（i）通过因果探测进行功能定位；（ii）通过参数比较进行更新特征化；（iii）通过模型合并进行可迁移性测试。相反，RL 在推理时间上主要在中后层引起了一致的变化，而这些中后层的改进既是可迁移的（通过合并）又是必要的（通过冻结）以实现 RL 的收益。总体而言，我们的结果表明，RL 在视觉推理中的可靠贡献并不是视觉感知的统一增强，而是对中后层变换器计算的系统性细化，这改善了视觉与推理的对齐和推理性能，突显了仅依靠基准评估来理解多模态推理改进的局限性。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2602.12401

ZeroDiff++: Substantial Unseen Visual-semantic Correlation in Zero-shot Learning

ZeroDiff++：零样本学习中的显著未见视觉-语义相关性

Ye, Zihan, Gowda, Shreyank N, Du, Kaile, Luo, Weijian, Shao, Ling

Abstract

Zero-shot Learning (ZSL) enables classifiers to recognize classes unseen during training, commonly via generative two stage methods: (1) learn visual semantic correlations from seen classes; (2) synthesize unseen class features from semantics to train classifiers. In this paper, we identify spurious visual semantic correlations in existing generative ZSL worsened by scarce seen class samples and introduce two metrics to quantify spuriousness for seen and unseen classes. Furthermore, we point out a more critical bottleneck: existing unadaptive fully noised generators produce features disconnected from real test samples, which also leads to the spurious correlation. To enhance the visual-semantic correlations on both seen and unseen classes, we propose ZeroDiff++, a diffusion-based generative framework. In training, ZeroDiff++ uses (i) diffusion augmentation to produce diverse noised samples, (ii) supervised contrastive (SC) representations for instance level semantics, and (iii) multi view discriminators with Wasserstein mutual learning to assess generated features. At generation time, we introduce (iv) Diffusion-based Test time Adaptation (DiffTTA) to adapt the generator using pseudo label reconstruction, and (v) Diffusion-based Test time Generation (DiffGen) to trace the diffusion denoising path and produce partially synthesized features that connect real and generated data, and mitigates data scarcity further. Extensive experiments on three ZSL benchmarks demonstrate that ZeroDiff++ not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data. Code would be available.

Chinese Translation

零样本学习（ZSL）使分类器能够识别在训练过程中未见的类别，通常通过生成的两阶段方法实现：（1）从已见类别中学习视觉语义相关性；（2）从语义合成未见类别特征以训练分类器。在本文中，我们识别出现有生成性ZSL中的虚假视觉语义相关性，这种相关性因已见类别样本稀缺而加剧，并引入了两种度量标准来量化已见和未见类别的虚假性。此外，我们指出一个更为关键的瓶颈：现有的非自适应完全噪声生成器产生的特征与真实测试样本脱节，这也导致了虚假相关性。为了增强已见和未见类别的视觉-语义相关性，我们提出了ZeroDiff++，一个基于扩散的生成框架。在训练过程中，ZeroDiff++使用（i）扩散增强生成多样化的噪声样本，（ii）实例级语义的监督对比（SC）表示，以及（iii）具有Wasserstein互学习的多视图鉴别器来评估生成的特征。在生成时，我们引入（iv）基于扩散的测试时间自适应（DiffTTA）来使用伪标签重构适应生成器，以及（v）基于扩散的测试时间生成（DiffGen）来追踪扩散去噪路径并生成部分合成特征，以连接真实和生成的数据，并进一步缓解数据稀缺问题。在三个ZSL基准上的大量实验表明，ZeroDiff++不仅在现有ZSL方法上取得了显著的改进，而且即使在训练数据稀缺的情况下也保持了稳健的性能。代码将会公开。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2602.12403

MonoLoss: A Training Objective for Interpretable Monosemantic Representations

MonoLoss：一种用于可解释单义表示的训练目标

Nasiri-Sarvi, Ali, Nguyen, Anh Tien, Rivaz, Hassan, Samaras, Dimitris, Hosseini, Mahdi S.

Abstract

Sparse autoencoders (SAEs) decompose polysemantic neural representations, where neurons respond to multiple unrelated concepts, into monosemantic features that capture single, interpretable concepts. However, standard training objectives only weakly encourage this decomposition, and existing monosemanticity metrics require pairwise comparisons across all dataset samples, making them inefficient during training and evaluation. We study a recent MonoScore metric and derive a single-pass algorithm that computes exactly the same quantity, but with a cost that grows linearly, rather than quadratically, with the number of dataset images. On OpenImagesV7, we achieve up to a 1200x speedup wall-clock speedup in evaluation and 159x during training, while adding only ~4% per-epoch overhead. This allows us to treat MonoScore as a training signal: we introduce the Monosemanticity Loss (MonoLoss), a plug-in objective that directly rewards semantically consistent activations for learning interpretable monosemantic representations. Across SAEs trained on CLIP, SigLIP2, and pretrained ViT features, using BatchTopK, TopK, and JumpReLU SAEs, MonoLoss increases MonoScore for most latents. MonoLoss also consistently improves class purity (the fraction of a latent's activating images belonging to its dominant class) across all encoder and SAE combinations, with the largest gain raising baseline purity from 0.152 to 0.723. Used as an auxiliary regularizer during ResNet-50 and CLIP-ViT-B/32 finetuning, MonoLoss yields up to 0.6\% accuracy gains on ImageNet-1K and monosemantic activating patterns on standard benchmark datasets. The code is publicly available at https://github.com/AtlasAnalyticsLab/MonoLoss.

Chinese Translation

稀疏自编码器（Sparse Autoencoders, SAEs）将多义神经表示分解为单义特征，这些特征捕捉单一且可解释的概念。然而，标准训练目标仅弱烈地鼓励这种分解，现有的单义性度量需要对所有数据集样本进行成对比较，这在训练和评估过程中效率低下。我们研究了一种最近提出的 MonoScore 度量，并推导出一种单遍算法，该算法以线性增长的成本计算完全相同的量，而不是随着数据集图像数量的平方增长。在 OpenImagesV7 上，我们在评估中实现了高达 1200 倍的墙钟速度提升，在训练中实现了 159 倍的速度提升，同时每个周期仅增加约 4% 的开销。这使我们能够将 MonoScore 视为训练信号：我们引入了单义性损失（Monosemanticity Loss, MonoLoss），这是一种直接奖励语义一致激活的插件目标，用于学习可解释的单义表示。在使用 BatchTopK、TopK 和 JumpReLU SAEs 训练的基于 CLIP、SigLIP2 和预训练 ViT 特征的 SAEs 中，MonoLoss 提高了大多数潜变量的 MonoScore。MonoLoss 还在所有编码器和 SAE 组合中一致地提高了类别纯度（潜变量激活图像中属于其主导类别的比例），最大增益将基线纯度从 0.152 提高到 0.723。作为 ResNet-50 和 CLIP-ViT-B/32 微调过程中的辅助正则化器，MonoLoss 在 ImageNet-1K 上带来了高达 0.6% 的准确率提升，并在标准基准数据集上产生了单义激活模式。代码已公开发布在 https://github.com/AtlasAnalyticsLab/MonoLoss。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2602.12441

Prototype-driven fusion of pathology and spatial transcriptomics for interpretable survival prediction

基于原型驱动的病理与空间转录组学融合用于可解释的生存预测

Liu, Lihe, Pan, Xiaoxi, Yuan, Yinyin, Shang, Lulu

Abstract

Whole slide images (WSIs) enable weakly supervised prognostic modeling via multiple instance learning (MIL). Spatial transcriptomics (ST) preserves in situ gene expression, providing a spatial molecular context that complements morphology. As paired WSI-ST cohorts scale to population level, leveraging their complementary spatial signals for prognosis becomes crucial; however, principled cross-modal fusion strategies remain limited for this paradigm. To this end, we introduce PathoSpatial, an interpretable end-to-end framework integrating co-registered WSIs and ST to learn spatially informed prognostic representations. PathoSpatial uses task-guided prototype learning within a multi-level experts architecture, adaptively orchestrating unsupervised within-modality discovery with supervised cross-modal aggregation. By design, PathoSpatial substantially strengthens interpretability while maintaining discriminative ability. We evaluate PathoSpatial on a triple-negative breast cancer cohort with paired ST and WSIs. PathoSpatial delivers strong and consistent performance across five survival endpoints, achieving superior or comparable performance to leading unimodal and multimodal methods. PathoSpatial inherently enables post-hoc prototype interpretation and molecular risk decomposition, providing quantitative, biologically grounded explanations, highlighting candidate prognostic factors. We present PathoSpatial as a proof-of-concept for scalable and interpretable multimodal learning for spatial omics-pathology fusion.

Chinese Translation

全幻灯片图像（WSIs）通过多实例学习（MIL）实现弱监督的预后建模。空间转录组学（ST）保留了原位基因表达，提供了补充形态学的空间分子背景。随着配对的WSI-ST队列扩展到人口水平，利用其互补的空间信号进行预后变得至关重要；然而，针对这一范式的原则性跨模态融合策略仍然有限。为此，我们提出了PathoSpatial，这是一个可解释的端到端框架，整合了配准的WSIs和ST，以学习空间信息驱动的预后表示。PathoSpatial在多层专家架构中使用任务引导的原型学习，自适应地协调无监督的模态内发现与监督的跨模态聚合。PathoSpatial的设计显著增强了可解释性，同时保持了区分能力。我们在一个具有配对ST和WSIs的三阴性乳腺癌队列上评估了PathoSpatial。PathoSpatial在五个生存终点上表现出强大且一致的性能，达到或超过了领先的单模态和多模态方法。PathoSpatial本质上支持事后原型解释和分子风险分解，提供定量的、生物学基础的解释，突出候选预后因素。我们展示了PathoSpatial作为可扩展和可解释的多模态学习在空间组学-病理融合中的概念验证。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2602.12461

Semantic-aware Adversarial Fine-tuning for CLIP

面向语义的对抗性微调用于CLIP

Zhang, Jiacheng, Li, Jinhao, Huang, Hanxun, Erfani, Sarah M., Rubinstein, Benjamin I. P., Liu, Feng

Abstract

Recent studies have shown that CLIP model's adversarial robustness in zero-shot classification tasks can be enhanced by adversarially fine-tuning its image encoder with adversarial examples (AEs), which are generated by minimizing the cosine similarity between images and a hand-crafted template (e.g., ''A photo of a {label}''). However, it has been shown that the cosine similarity between a single image and a single hand-crafted template is insufficient to measure the similarity for image-text pairs. Building on this, in this paper, we find that the AEs generated using cosine similarity may fail to fool CLIP when the similarity metric is replaced with semantically enriched alternatives, making the image encoder fine-tuned with these AEs less robust. To overcome this issue, we first propose a semantic-ensemble attack to generate semantic-aware AEs by minimizing the average similarity between the original image and an ensemble of refined textual descriptions. These descriptions are initially generated by a foundation model to capture core semantic features beyond hand-crafted templates and are then refined to reduce hallucinations. To this end, we propose Semantic-aware Adversarial Fine-Tuning (SAFT), which fine-tunes CLIP's image encoder with semantic-aware AEs. Extensive experiments show that SAFT outperforms current methods, achieving substantial improvements in zero-shot adversarial robustness across 16 datasets. Our code is available at: https://github.com/tmlr-group/SAFT.

Chinese Translation

最近的研究表明，通过使用对抗样本（AEs）对CLIP模型的图像编码器进行对抗性微调，可以增强其在零样本分类任务中的对抗鲁棒性，这些对抗样本是通过最小化图像与手工制作模板（例如，“一张{label}的照片”）之间的余弦相似度生成的。然而，已有研究表明，单个图像与单个手工制作模板之间的余弦相似度不足以衡量图像-文本对的相似性。在此基础上，本文发现，当将相似性度量替换为语义丰富的替代方案时，使用余弦相似度生成的对抗样本可能无法欺骗CLIP，从而导致使用这些对抗样本微调的图像编码器鲁棒性降低。为了解决这个问题，我们首先提出了一种语义集成攻击，通过最小化原始图像与一组精炼文本描述之间的平均相似度来生成面向语义的对抗样本。这些描述最初由基础模型生成，以捕捉超越手工制作模板的核心语义特征，然后经过精炼以减少幻觉。为此，我们提出了面向语义的对抗性微调（SAFT），该方法使用面向语义的对抗样本对CLIP的图像编码器进行微调。大量实验表明，SAFT优于当前方法，在16个数据集上实现了零样本对抗鲁棒性的显著提升。我们的代码可在以下链接获取：https://github.com/tmlr-group/SAFT。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2602.12484

A Lightweight and Explainable DenseNet-121 Framework for Grape Leaf Disease Classification

一种轻量级且可解释的 DenseNet-121 框架用于葡萄叶病害分类

Haque, Md. Ehsanul, Polash, Md. Saymon Hosen, Ovi, Rakib Hasan, Bulbul, Aminul Kader, Siam, Md Kamrul, Saykat, Tamim Hasan

Abstract

Grapes are among the most economically and culturally significant fruits on a global scale, and table grapes and wine are produced in significant quantities in Europe and Asia. The production and quality of grapes are significantly impacted by grape diseases such as Bacterial Rot, Downy Mildew, and Powdery Mildew. Consequently, the sustainable management of a vineyard necessitates the early and precise identification of these diseases. Current automated methods, particularly those that are based on the YOLO framework, are often computationally costly and lack interpretability that makes them unsuitable for real-world scenarios. This study proposes grape leaf disease classification using Optimized DenseNet 121. Domain-specific preprocessing and extensive connectivity reveal disease-relevant characteristics, including veins, edges, and lesions. An extensive comparison with baseline CNN models, including ResNet18, VGG16, AlexNet, and SqueezeNet, demonstrates that the proposed model exhibits superior performance. It achieves an accuracy of 99.27%, an F1 score of 99.28%, a specificity of 99.71%, and a Kappa of 98.86%, with an inference time of 9 seconds. The cross-validation findings show a mean accuracy of 99.12%, indicating strength and generalizability across all classes. We also employ Grad-CAM to highlight disease-related regions to guarantee the model is highlighting physiologically relevant aspects and increase transparency and confidence. Model optimization reduces processing requirements for real-time deployment, while transfer learning ensures consistency on smaller and unbalanced samples. An effective architecture, domain-specific preprocessing, and interpretable outputs make the proposed framework scalable, precise, and computationally inexpensive for detecting grape leaf diseases.

Chinese Translation

葡萄是全球经济和文化上最重要的水果之一，欧洲和亚洲生产了大量的鲜食葡萄和葡萄酒。葡萄的生产和质量受到葡萄病害的显著影响，如细菌腐烂、霜霉病和白粉病。因此，葡萄园的可持续管理需要对这些病害进行早期和准确的识别。目前的自动化方法，特别是基于 YOLO 框架的方法，通常计算成本高且缺乏可解释性，使其不适合实际应用。本研究提出了一种基于优化的 DenseNet 121 的葡萄叶病害分类方法。特定领域的预处理和广泛的连接揭示了与病害相关的特征，包括叶脉、边缘和病变。与基线卷积神经网络（CNN）模型的广泛比较，包括 ResNet18、VGG16、AlexNet 和 SqueezeNet，表明所提出的模型表现优越。其准确率达到 99.27%，F1 分数为 99.28%，特异性为 99.71%，Kappa 值为 98.86%，推理时间为 9 秒。交叉验证结果显示平均准确率为 99.12%，表明该模型在所有类别中具有强度和泛化能力。我们还采用 Grad-CAM 方法突出与疾病相关的区域，以确保模型强调生理相关的方面，提高透明度和信心。模型优化减少了实时部署的处理需求，而迁移学习确保在较小和不平衡样本上的一致性。有效的架构、特定领域的预处理和可解释的输出使得所提出的框架在检测葡萄叶病害方面具有可扩展性、精确性和计算成本低的优势。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2602.12486

Human-Like Coarse Object Representations in Vision Models

视觉模型中的类人粗略物体表征

Gizdov, Andrey, Procopio, Andrea, Li, Yichen, Harari, Daniel, Ullman, Tomer

Abstract

Humans appear to represent objects for intuitive physics with coarse, volumetric bodies'' that smooth concavities - trading fine visual details for efficient physical predictions - yet their internal structure is largely unknown. Segmentation models, in contrast, optimize pixel-accurate masks that may misalign with such bodies. We ask whether and when these models nonetheless acquire human-like bodies. Using a time-to-collision (TTC) behavioral paradigm, we introduce a comparison pipeline and alignment metric, then vary model training time, size, and effective capacity via pruning. Across all manipulations, alignment with human behavior follows an inverse U-shaped curve: small/briefly trained/pruned models under-segment into blobs; large/fully trained models over-segment with boundary wiggles; and an intermediate ideal body granularity'' best matches humans. This suggests human-like coarse bodies emerge from resource constraints rather than bespoke biases, and points to simple knobs - early checkpoints, modest architectures, light pruning - for eliciting physics-efficient representations. We situate these results within resource-rational accounts balancing recognition detail against physical affordances.

Chinese Translation

人类似乎以粗糙的、体积性的“身体”来表征物体，以便进行直观的物理推理——在高效的物理预测与细致的视觉细节之间进行权衡——但其内部结构仍然 largely unknown。相比之下，分割模型优化像素精确的掩膜，这可能与这些身体不对齐。我们探讨这些模型是否以及何时能够获得类人的身体。通过时间碰撞（TTC）行为范式，我们引入了一种比较管道和对齐度量，然后通过修剪改变模型的训练时间、大小和有效容量。在所有操作中，与人类行为的对齐呈现出一个反U型曲线：小型/短暂训练/修剪的模型将物体过度分割为块状；大型/完全训练的模型则因边界抖动而过度分割；而中间理想的身体粒度则最佳匹配人类。这表明类人粗糙身体的出现是由于资源限制，而非特定偏见，并指出了简单的调节方式——早期检查点、适度架构、轻度修剪——以引发物理高效的表征。我们将这些结果置于资源理性框架中，平衡识别细节与物理可供性。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2602.12489

Insertion Network for Image Sequence Correspondence

图像序列对应的插入网络

Su, Dingjie, Hong, Weixiang, Dawant, Benoit M., Landman, Bennett A.

Abstract

We propose a novel method for establishing correspondence between two sequences of 2D images. One particular application of this technique is slice-level content navigation, where the goal is to localize specific 2D slices within a 3D volume or determine the anatomical coverage of a 3D scan based on its 2D slices. This serves as an important preprocessing step for various diagnostic tasks, as well as for automatic registration and segmentation pipelines. Our approach builds sequence correspondence by training a network to learn how to insert a slice from one sequence into the appropriate position in another. This is achieved by encoding contextual representations of each slice and modeling the insertion process using a slice-to-slice attention mechanism. We apply this method to localize manually labeled key slices in body CT scans and compare its performance to the current state-of-the-art alternative known as body part regression, which predicts anatomical position scores for individual slices. Unlike body part regression, which treats each slice independently, our method leverages contextual information from the entire sequence. Experimental results show that the insertion network reduces slice localization errors in supervised settings from 8.4 mm to 5.4 mm, demonstrating a substantial improvement in accuracy.

Chinese Translation

我们提出了一种新颖的方法，用于建立两组二维图像序列之间的对应关系。这项技术的一个特定应用是切片级内容导航，其目标是在三维体积中定位特定的二维切片，或根据其二维切片确定三维扫描的解剖覆盖范围。这为各种诊断任务以及自动配准和分割流程提供了重要的预处理步骤。我们的方法通过训练网络学习如何将一个序列中的切片插入到另一个序列的适当位置，从而建立序列对应关系。这是通过对每个切片进行上下文表示编码，并使用切片到切片的注意力机制建模插入过程来实现的。我们将该方法应用于定位手动标记的关键切片在身体CT扫描中的位置，并将其性能与当前最先进的替代方案——身体部位回归（body part regression）进行比较，该方案为单个切片预测解剖位置分数。与独立处理每个切片的身体部位回归不同，我们的方法利用了整个序列的上下文信息。实验结果表明，插入网络在监督设置中将切片定位误差从8.4毫米降低到5.4毫米，显示出显著的准确性提升。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2602.12498

Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language Models

针对医学视觉语言模型改进否定处理的层特定微调

Abbasi, Ali, Taghipour, Mehdi, Beheshti, Rahmatollah

Abstract

Negation is a fundamental linguistic operation in clinical reporting, yet vision-language models (VLMs) frequently fail to distinguish affirmative from negated medical statements. To systematically characterize this limitation, we introduce a radiology-specific diagnostic benchmark that evaluates polarity sensitivity under controlled clinical conditions, revealing that common medical VLMs consistently confuse negated and non-negated findings. To enable learning beyond simple condition absence, we further construct a contextual clinical negation dataset that encodes structured claims and supports attribute-level negations involving location and severity. Building on these resources, we propose Negation-Aware Selective Training (NAST), an interpretability-guided adaptation method that uses causal tracing effects (CTEs) to modulate layer-wise gradient updates during fine-tuning. Rather than applying uniform learning rates, NAST scales each layer's update according to its causal contribution to negation processing, transforming mechanistic interpretability signals into a principled optimization rule. Experiments demonstrate improved discrimination of affirmative and negated clinical statements without degrading general vision-language alignment, highlighting the value of causal interpretability for targeted model adaptation in safety-critical medical settings. Code and resources are available at https://github.com/healthylaife/NAST.

Chinese Translation

否定是临床报告中的一种基本语言操作，但视觉语言模型（VLMs）常常无法区分肯定与否定的医学陈述。为了系统性地描述这一局限性，我们引入了一个特定于放射学的诊断基准，评估在受控临床条件下的极性敏感性，揭示了常见医学VLMs在否定和非否定发现之间的混淆。为了使学习超越简单的条件缺失，我们进一步构建了一个上下文临床否定数据集，该数据集编码了结构化的主张，并支持涉及位置和严重性的属性级否定。在这些资源的基础上，我们提出了否定感知选择性训练（Negation-Aware Selective Training, NAST），这是一种基于可解释性的适应方法，利用因果追踪效应（Causal Tracing Effects, CTEs）在微调过程中调节层级梯度更新。NAST并不是应用统一的学习率，而是根据每一层对否定处理的因果贡献来调整其更新，从而将机械可解释性信号转化为原则性的优化规则。实验表明，在不降低一般视觉语言对齐的情况下，能够改善肯定和否定临床陈述的区分能力，突显了因果可解释性在安全关键医学环境中针对性模型适应的价值。代码和资源可在 https://github.com/healthylaife/NAST 获取。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2602.12515

Matching of SAR and optical images based on transformation to shared modality

基于转化为共享模态的SAR与光学图像匹配

Borisov, Alexey, Myasnikov, Evgeny, Myasnikov, Vladislav

Abstract

Significant differences in optical images and Synthetic Aperture Radar (SAR) images are caused by fundamental differences in the physical principles underlying their acquisition by Earth remote sensing platforms. These differences make precise image matching (co-registration) of these two types of images difficult. In this paper, we propose a new approach to image matching of optical and SAR images, which is based on transforming the images to a new modality. The new image modality is common to both optical and SAR images and satisfies the following conditions. First, the transformed images must have an equal pre-defined number of channels. Second, the transformed and co-registered images must be as similar as possible. Third, the transformed images must be non-degenerate, meaning they must preserve the significant features of the original images. To further match images transformed to this shared modality, we train the RoMa image matching model, which is one of the leading solutions for matching of regular digital photographs. We evaluated the proposed approach on the publicly available MultiSenGE dataset containing both optical and SAR images. We demonstrated its superiority over alternative approaches based on image translation between original modalities and various feature matching algorithms. The proposed solution not only provides better quality of matching, but is also more versatile. It enables the use of ready-made RoMa and DeDoDe models, pre-trained for regular images, without retraining for a new modality, while maintaining high-quality matching of optical and SAR images.

Chinese Translation

光学图像与合成孔径雷达（SAR）图像之间存在显著差异，这些差异源于其在地球遥感平台上获取的物理原理的根本不同。这些差异使得这两种类型图像的精确匹配（共注册）变得困难。本文提出了一种新的光学图像与SAR图像匹配方法，该方法基于将图像转化为一种新的模态。该新图像模态对光学图像和SAR图像是共同的，并满足以下条件：首先，转化后的图像必须具有相同的预定义通道数；其次，转化后的图像与共注册图像必须尽可能相似；第三，转化后的图像必须是非退化的，即它们必须保留原始图像的重要特征。为了进一步匹配转化为这种共享模态的图像，我们训练了RoMa图像匹配模型，该模型是匹配常规数字照片的领先解决方案之一。我们在公开可用的MultiSenGE数据集上评估了所提出的方法，该数据集包含光学图像和SAR图像。我们证明了该方法优于基于原始模态之间图像转换和各种特征匹配算法的替代方法。所提出的解决方案不仅提供了更好的匹配质量，而且更加灵活。它使得可以使用为常规图像预训练的现成RoMa和DeDoDe模型，而无需为新模态重新训练，同时保持光学图像与SAR图像的高质量匹配。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2602.12524

LiDAR-Anchored Collaborative Distillation for Robust 2D Representations

基于LiDAR的协作蒸馏方法用于稳健的2D表示

Jo, Wonjun, Ha, Hyunwoo, Ji-Yeon, Kim, Jeong, Hawook, Oh, Tae-Hyun

Abstract

As deep learning continues to advance, self-supervised learning has made considerable strides. It allows 2D image encoders to extract useful features for various downstream tasks, including those related to vision-based systems. Nevertheless, pre-trained 2D image encoders fall short in conducting the task under noisy and adverse weather conditions beyond clear daytime scenes, which require for robust visual perception. To address these issues, we propose a novel self-supervised approach, \textbf{Collaborative Distillation}, which leverages 3D LiDAR as self-supervision to improve robustness to noisy and adverse weather conditions in 2D image encoders while retaining their original capabilities. Our method outperforms competing methods in various downstream tasks across diverse conditions and exhibits strong generalization ability. In addition, our method also improves 3D awareness stemming from LiDAR's characteristics. This advancement highlights our method's practicality and adaptability in real-world scenarios.

Chinese Translation

随着深度学习的不断进步，自监督学习取得了显著的进展。它使得2D图像编码器能够提取有用的特征，以应对各种下游任务，包括与基于视觉的系统相关的任务。然而，预训练的2D图像编码器在噪声和恶劣天气条件下的表现不尽如人意，无法有效处理超出清晰白天场景的任务，这就需要稳健的视觉感知。为了解决这些问题，我们提出了一种新颖的自监督方法—— extbf{协作蒸馏}，该方法利用3D LiDAR作为自监督信号，旨在提高2D图像编码器在噪声和恶劣天气条件下的稳健性，同时保留其原有能力。我们的算法在多种条件下的各种下游任务中优于竞争方法，并展现出强大的泛化能力。此外，我们的方法还增强了基于LiDAR特性的3D感知能力。这一进展突显了我们方法在现实场景中的实用性和适应性。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2602.12525

Geometric Stratification for Singular Configurations of the P3P Problem via Local Dual Space

通过局部对偶空间对 P3P 问题的奇异配置进行几何分层

Sun, Xueying, Li, Zijia, Li, Nan

Abstract

This paper investigates singular configurations of the P3P problem. Using local dual space, a systematic algebraic-computational framework is proposed to give a complete geometric stratification for the P3P singular configurations with respect to the multiplicity $\mu$ of the camera center $O$: for $\mu\ge 2$, $O$ lies on the ``danger cylinder'', for $\mu\ge 3$, $O$ lies on one of three generatrices of the danger cylinder associated with the first Morley triangle or the circumcircle, and for $\mu\ge 4$, $O$ lies on the circumcircle which indeed corresponds to infinite P3P solutions. Furthermore, a geometric stratification for the complementary configuration $O^\prime$ associated with a singular configuration $O$ is studied as well: for $\mu\ge 2$, $O^\prime$ lies on a deltoidal surface associated with the danger cylinder, and for $\mu\ge 3$, $O^\prime$ lies on one of three cuspidal curves of the deltoidal surface.

Chinese Translation

本文研究了 P3P 问题的奇异配置。通过局部对偶空间，提出了一种系统的代数计算框架，以便对 P3P 奇异配置进行完整的几何分层，具体取决于相机中心 $O$ 的重数 $ u$：当 $ u geq 2$ 时，$O$ 位于“危险圆柱”上；当 $ u geq 3$ 时，$O$ 位于与第一莫雷三角形或外接圆相关的危险圆柱的三条发生线之一上；当 $ u geq 4$ 时，$O$ 位于外接圆上，这实际上对应于无限多个 P3P 解。此外，本文还研究了与奇异配置 $O$ 相关的补充配置 $O^ ext{'}$ 的几何分层：当 $ u geq 2$ 时，$O^ ext{'}$ 位于与危险圆柱相关的菱形曲面上；当 $ u geq 3$ 时，$O^ ext{'}$ 位于菱形曲面的三条尖点曲线之一上。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2602.12540

Self-Supervised JEPA-based World Models for LiDAR Occupancy Completion and Forecasting

基于自监督JEPA的LiDAR占用补全与预测世界模型

Zhu, Haoran, Choromanska, Anna

Abstract

Autonomous driving, as an agent operating in the physical world, requires the fundamental capability to build \textit{world models} that capture how the environment evolves spatiotemporally in order to support long-term planning. At the same time, scalability demands learning such models in a self-supervised manner; \textit{joint-embedding predictive architecture (JEPA)} enables learning world models via leveraging large volumes of unlabeled data without relying on expensive human annotations. In this paper, we propose \textbf{AD-LiST-JEPA}, a self-supervised world model for autonomous driving that predicts future spatiotemporal evolution from LiDAR data using a JEPA framework. We evaluate the quality of the learned representations through a downstream LiDAR-based occupancy completion and forecasting (OCF) task, which jointly assesses perception and prediction. Proof of concept experiments show better OCF performance with pretrained encoder after JEPA-based world model learning.

Chinese Translation

作为在物理世界中运作的智能体，自动驾驶需要具备构建 extit{世界模型}的基本能力，以捕捉环境在时空上的演变，从而支持长期规划。同时，扩展性要求以自监督的方式学习此类模型； extit{联合嵌入预测架构（JEPA）}使得通过利用大量未标记数据而不依赖昂贵的人类注释来学习世界模型成为可能。在本文中，我们提出了 extbf{AD-LiST-JEPA}，一种自监督的自动驾驶世界模型，它利用JEPA框架从LiDAR数据中预测未来的时空演变。我们通过下游的基于LiDAR的占用补全与预测（OCF）任务评估所学习表示的质量，该任务共同评估感知与预测。概念验证实验表明，在JEPA基础的世界模型学习后，预训练编码器的OCF性能更佳。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2602.12561

PLLM: Pseudo-Labeling Large Language Models for CAD Program Synthesis

PLLM：用于计算机辅助设计程序合成的伪标签大语言模型

Li, Yuanbo, Shu, Dule, Chen, Yanying, Klenk, Matt, Ritchie, Daniel

Abstract

Recovering Computer-Aided Design (CAD) programs from 3D geometries is a widely studied problem. Recent advances in large language models (LLMs) have enabled progress in CAD program synthesis, but existing methods rely on supervised training with paired shape-program data, which is often unavailable. We introduce PLLM, a self-training framework for CAD program synthesis from unlabeled 3D shapes. Given a pre-trained CAD-capable LLM and a shape dataset, PLLM iteratively samples candidate programs, selects high-fidelity executions, and augments programs to construct synthetic program-shape pairs for fine-tuning. We experiment on adapting CAD-Recode from DeepCAD to the unlabeled ABC dataset show consistent improvements in geometric fidelity and program diversity.

Chinese Translation

从三维几何体恢复计算机辅助设计（CAD）程序是一个广泛研究的问题。近年来，大语言模型（LLMs）的进展推动了CAD程序合成的发展，但现有方法依赖于配对的形状-程序数据进行监督训练，而这些数据通常不可用。我们提出了PLLM，一个用于从未标记的三维形状合成CAD程序的自我训练框架。给定一个经过预训练的CAD能力大语言模型和一个形状数据集，PLLM迭代地采样候选程序，选择高保真执行，并增强程序以构建合成的程序-形状对进行微调。我们在将DeepCAD中的CAD-Recode适应到未标记的ABC数据集上的实验显示，在几何保真度和程序多样性方面取得了一致的改进。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2602.12563

The Constant Eye: Benchmarking and Bridging Appearance Robustness in Autonomous Driving

恒定的视角：自主驾驶中外观鲁棒性的基准测试与桥接

Wang, Jiabao, Zhou, Hongyu, Yang, Yuanbo, Shao, Jiahao, Liao, Yiyi

Abstract

Despite rapid progress, autonomous driving algorithms remain notoriously fragile under Out-of-Distribution (OOD) conditions. We identify a critical decoupling failure in current research: the lack of distinction between appearance-based shifts, such as weather and lighting, and structural scene changes. This leaves a fundamental question unanswered: Is the planner failing because of complex road geometry, or simply because it is raining? To resolve this, we establish navdream, a high-fidelity robustness benchmark leveraging generative pixel-aligned style transfer. By creating a visual stress test with negligible geometric deviation, we isolate the impact of appearance on driving performance. Our evaluation reveals that existing planning algorithms often show significant degradation under OOD appearance conditions, even when the underlying scene structure remains consistent. To bridge this gap, we propose a universal perception interface leveraging a frozen visual foundation model (DINOv3). By extracting appearance-invariant features as a stable interface for the planner, we achieve exceptional zero-shot generalization across diverse planning paradigms, including regression-based, diffusion-based, and scoring-based models. Our plug-and-play solution maintains consistent performance across extreme appearance shifts without requiring further fine-tuning. The benchmark and code will be made available.

Chinese Translation

尽管取得了快速进展，自主驾驶算法在分布外（OOD）条件下仍然极其脆弱。我们在当前研究中识别出一个关键的解耦失败：缺乏对基于外观的变化（如天气和光照）与结构场景变化之间的区分。这留下了一个基本问题未得到解答：规划者的失败是由于复杂的道路几何，还是仅仅因为下雨？为了解决这个问题，我们建立了 navdream，一个高保真鲁棒性基准，利用生成的像素对齐风格迁移。通过创建一个几何偏差微乎其微的视觉压力测试，我们隔离了外观对驾驶性能的影响。我们的评估显示，现有的规划算法在OOD外观条件下通常表现出显著的性能下降，即使基础场景结构保持一致。为了解决这一问题，我们提出了一个通用感知接口，利用一个冻结的视觉基础模型（DINOv3）。通过提取外观不变特征作为规划者的稳定接口，我们在包括基于回归、基于扩散和基于评分的模型在内的多种规划范式中实现了卓越的零样本泛化。我们的即插即用解决方案在极端外观变化下保持一致的性能，而无需进一步的微调。基准测试和代码将会公开。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2602.12590

Unbiased Gradient Estimation for Event Binning via Functional Backpropagation

通过函数反向传播实现事件分箱的无偏梯度估计

Chen, Jinze, Zhai, Wei, Han, Han, Ma, Tiankai, Cao, Yang, Li, Bin, Zha, Zheng-Jun

Abstract

Event-based vision encodes dynamic scenes as asynchronous spatio-temporal spikes called events. To leverage conventional image processing pipelines, events are typically binned into frames. However, binning functions are discontinuous, which truncates gradients at the frame level and forces most event-based algorithms to rely solely on frame-based features. Attempts to directly learn from raw events avoid this restriction but instead suffer from biased gradient estimation due to the discontinuities of the binning operation, ultimately limiting their learning efficiency. To address this challenge, we propose a novel framework for unbiased gradient estimation of arbitrary binning functions by synthesizing weak derivatives during backpropagation while keeping the forward output unchanged. The key idea is to exploit integration by parts: lifting the target functions to functionals yields an integral form of the derivative of the binning function during backpropagation, where the cotangent function naturally arises. By reconstructing this cotangent function from the sampled cotangent vector, we compute weak derivatives that provably match long-range finite differences of both smooth and non-smooth targets. Experimentally, our method improves simple optimization-based egomotion estimation with 3.2\% lower RMS error and 1.57$\times$ faster convergence. On complex downstream tasks, we achieve 9.4\% lower EPE in self-supervised optical flow, and 5.1\% lower RMS error in SLAM, demonstrating broad benefits for event-based visual perception. Source code can be found at https://github.com/chjz1024/EventFBP.

Chinese Translation

基于事件的视觉将动态场景编码为称为事件的异步时空脉冲。为了利用传统的图像处理管道，事件通常被分箱为帧。然而，分箱函数是不连续的，这在帧级别截断了梯度，迫使大多数基于事件的算法仅依赖于基于帧的特征。直接从原始事件学习的尝试避免了这一限制，但由于分箱操作的不连续性，反而遭受了偏置梯度估计的问题，最终限制了它们的学习效率。为了解决这一挑战，我们提出了一种新颖的框架，通过在反向传播过程中合成弱导数来实现任意分箱函数的无偏梯度估计，同时保持前向输出不变。关键思路是利用分部积分法：将目标函数提升为泛函，在反向传播过程中产生分箱函数的积分形式的导数，其中余切函数自然出现。通过从采样的余切向量重建该余切函数，我们计算出弱导数，这些弱导数在平滑和非平滑目标的长程有限差分中证明是匹配的。实验表明，我们的方法在简单的基于优化的自运动估计中将均方根误差降低了3.2\%，并且收敛速度提高了1.57倍。在复杂的下游任务中，我们在自监督光流中实现了9.4\\%的平均端点误差降低，在SLAM中实现了5.1\\%的均方根误差降低，展示了基于事件的视觉感知的广泛优势。源代码可在 https://github.com/chjz1024/EventFBP 找到。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2602.12609

QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching

QuEPT：具有一次性校准的量化弹性精度变换器，用于多位切换

Xu, Ke, Wang, Yixin, Li, Zhongcheng, Cui, Hao, Hu, Jinshui, Zhang, Xingyi

Abstract

Elastic precision quantization enables multi-bit deployment via a single optimization pass, fitting diverse quantization scenarios.Yet, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language models.This paper proposes QuEPT, an efficient post-training scheme that reconstructs block-wise multi-bit errors with one-shot calibration on a small data slice. It can dynamically adapt to various predefined bit-widths by cascading different low-rank adapters, and supports real-time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi-Bit Token Merging (MB-ToMe) to dynamically fuse token features across different bit-widths, improving robustness during bit-width switching. Additionally, we propose Multi-Bit Cascaded Low-Rank adapters (MB-CLoRA) to strengthen correlations between bit-width groups, further improve the overall performance of QuEPT. Extensive experiments demonstrate that QuEPT achieves comparable or better performance to existing state-of-the-art post-training quantization methods.Our code is available at https://github.com/xuke225/QuEPT

Chinese Translation

弹性精度量化通过单次优化过程实现多位部署，适应多样的量化场景。然而，与变换器架构相关的高存储和优化成本使得弹性量化的研究仍然有限，特别是在大型语言模型方面。本文提出了QuEPT，一种高效的后训练方案，通过在小数据切片上进行一次性校准来重构块级多位误差。它可以通过级联不同的低秩适配器动态适应各种预定义的位宽，并支持在均匀量化和混合精度量化之间实时切换，而无需重复优化。为了增强准确性和鲁棒性，我们引入了多位令牌合并（Multi-Bit Token Merging, MB-ToMe），动态融合不同位宽的令牌特征，提高位宽切换过程中的鲁棒性。此外，我们提出了多位级联低秩适配器（Multi-Bit Cascaded Low-Rank adapters, MB-CLoRA），以增强位宽组之间的相关性，进一步提升QuEPT的整体性能。大量实验表明，QuEPT在性能上与现有的最先进的后训练量化方法相当或更优。我们的代码可在 https://github.com/xuke225/QuEPT 获取。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2602.12618

Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

通过注意力驱动的自我压缩实现视觉标记减少，以提高多模态大型语言模型的效率

Deniz, Omer Faruk, Mao, Ruiyu, Li, Ruochen, Tian, Yapeng, Khan, Latifur

Abstract

Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.

Chinese Translation

多模态大型语言模型（MLLMs）在处理大量视觉标记时，需在所有LLM层中消耗显著的计算资源。以往的剪枝方法要么在LLM之前进行，因编码器-投影器设计的多样性而限制了通用性，要么在LLM内部使用与FlashAttention不兼容的启发式方法。我们采取了不同的方法：不是识别不重要的标记，而是将LLM本身视为压缩的最佳指导。我们观察到，深层网络自然传递视觉到文本的信息，因此引入了注意力驱动的自我压缩（ADSC），这是一种简单且广泛适用的方法，利用LLM的注意力机制逐步减少视觉标记。我们的方法在选定层应用均匀的标记下采样，形成瓶颈，促使模型重组并压缩信息到剩余标记中。该方法无需计算分数、辅助模块或修改注意力，且与FlashAttention完全兼容。应用于LLaVA-1.5时，ADSC将FLOPs减少了53.7%，峰值KV-cache内存减少了56.7%，同时保持了98.2%的原始模型性能。在多个基准测试中，其在效率和准确性上均优于以往的剪枝方法。重要的是，在高压缩比下，我们的方法仍然保持稳健，而基于启发式的技术则急剧下降。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2602.12640

ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models

ImageRAGTurbo：基于检索增强扩散模型的一步文本到图像生成

Qiu, Peijie, Ramshankar, Hariharan, Ramisa, Arnau, Vidal, René, C, Amit Kumar K, Salaka, Vamsi, Bhagat, Rahul

Abstract

Diffusion models have emerged as the leading approach for text-to-image generation. However, their iterative sampling process, which gradually morphs random noise into coherent images, introduces significant latency that limits their applicability. While recent few-step diffusion models reduce the number of sampling steps to as few as one to four steps, they often compromise image quality and prompt alignment, especially in one-step generation. Additionally, these models require computationally expensive training procedures. To address these limitations, we propose ImageRAGTurbo, a novel approach to efficiently finetune few-step diffusion models via retrieval augmentation. Given a text prompt, we retrieve relevant text-image pairs from a database and use them to condition the generation process. We argue that such retrieved examples provide rich contextual information to the UNet denoiser that helps reduce the number of denoising steps without compromising image quality. Indeed, our initial investigations show that using the retrieved content to edit the denoiser's latent space ($\mathcal{H}$-space) without additional finetuning already improves prompt fidelity. To further improve the quality of the generated images, we augment the UNet denoiser with a trainable adapter in the $\mathcal{H}$-space, which efficiently blends the retrieved content with the target prompt using a cross-attention mechanism. Experimental results on fast text-to-image generation demonstrate that our approach produces high-fidelity images without compromising latency compared to existing methods.

Chinese Translation

扩散模型已成为文本到图像生成的主要方法。然而，其迭代采样过程逐渐将随机噪声转变为连贯图像，导致显著的延迟，限制了其应用。尽管最近的少步扩散模型将采样步骤减少到一到四步，但它们往往牺牲图像质量和提示对齐，尤其是在一步生成中。此外，这些模型需要计算成本高昂的训练过程。为了解决这些局限性，我们提出了ImageRAGTurbo，一种通过检索增强有效微调少步扩散模型的新方法。给定一个文本提示，我们从数据库中检索相关的文本-图像对，并用它们来条件生成过程。我们认为，这些检索到的示例为UNet去噪器提供了丰富的上下文信息，有助于在不影响图像质量的情况下减少去噪步骤的数量。实际上，我们的初步研究表明，使用检索内容编辑去噪器的潜在空间（$ extmath{H}$-space）而无需额外微调，已经提高了提示的保真度。为了进一步提高生成图像的质量，我们在$ extmath{H}$-space中增强了UNet去噪器，使用可训练的适配器，通过交叉注意机制高效地将检索内容与目标提示融合。关于快速文本到图像生成的实验结果表明，我们的方法在不影响延迟的情况下，能够生成高保真度的图像，相较于现有方法表现更佳。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2602.12649

Multi-Task Learning with Additive U-Net for Image Denoising and Classification

基于加性 U-Net 的图像去噪与分类的多任务学习

Lakkavalli, Vikram, Sinha, Neelam

Abstract

We investigate additive skip fusion in U-Net architectures for image denoising and denoising-centric multi-task learning (MTL). By replacing concatenative skips with gated additive fusion, the proposed Additive U-Net (AddUNet) constrains shortcut capacity while preserving fixed feature dimensionality across depth. This structural regularization induces controlled encoder-decoder information flow and stabilizes joint optimization. Across single-task denoising and joint denoising-classification settings, AddUNet achieves competitive reconstruction performance with improved training stability. In MTL, learned skip weights exhibit systematic task-aware redistribution: shallow skips favor reconstruction, while deeper features support discrimination. Notably, reconstruction remains robust even under limited classification capacity, indicating implicit task decoupling through additive fusion. These findings show that simple constraints on skip connections act as an effective architectural regularizer for stable and scalable multi-task learning without increasing model complexity.

Chinese Translation

我们研究了 U-Net 架构中的加性跳跃融合，用于图像去噪和以去噪为中心的多任务学习（MTL）。通过用门控加性融合替换连接跳跃，所提出的加性 U-Net（AddUNet）在保持固定特征维度的同时限制了快捷通道的容量。这种结构正则化诱导了受控的编码器-解码器信息流，并稳定了联合优化。在单任务去噪和联合去噪-分类设置中，AddUNet 实现了具有竞争力的重建性能，并提高了训练的稳定性。在 MTL 中，学习到的跳跃权重表现出系统性的任务感知重分配：浅层跳跃更倾向于重建，而深层特征则支持判别。值得注意的是，即使在分类能力有限的情况下，重建仍然保持稳健，表明通过加性融合实现了隐式任务解耦。这些发现表明，对跳跃连接施加简单约束可以作为一种有效的结构正则化器，以实现稳定且可扩展的多任务学习，而无需增加模型复杂性。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2602.12652

CBEN -- A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding

CBEN -- 一种用于云稳健遥感图像理解的多模态机器学习数据集

Stricker, Marco, Iwamura, Masakazu, Kise, Koichi

Abstract

Clouds are a common phenomenon that distorts optical satellite imagery, which poses a challenge for remote sensing. However, in the literature cloudless analysis is often performed where cloudy images are excluded from machine learning datasets and methods. Such an approach cannot be applied to time sensitive applications, e.g., during natural disasters. A possible solution is to apply cloud removal as a preprocessing step to ensure that cloudfree solutions are not failing under such conditions. But cloud removal methods are still actively researched and suffer from drawbacks, such as generated visual artifacts. Therefore, it is desirable to develop cloud robust methods that are less affected by cloudy weather. Cloud robust methods can be achieved by combining optical data with radar, a modality unaffected by clouds. While many datasets for machine learning combine optical and radar data, most researchers exclude cloudy images. We identify this exclusion from machine learning training and evaluation as a limitation that reduces applicability to cloudy scenarios. To investigate this, we assembled a dataset, named CloudyBigEarthNet (CBEN), of paired optical and radar images with cloud occlusion for training and evaluation. Using average precision (AP) as the evaluation metric, we show that state-of-the-art methods trained on combined clear-sky optical and radar imagery suffer performance drops of 23-33 percentage points when evaluated on cloudy images. We then adapt these methods to cloudy optical data during training, achieving relative improvement of 17.2-28.7 percentage points on cloudy test cases compared with the original approaches. Code and dataset are publicly available at: https://github.com/mstricker13/CBEN

Chinese Translation

云是扭曲光学卫星图像的常见现象，这给遥感带来了挑战。然而，在文献中，通常会进行无云分析，即将多云图像排除在机器学习数据集和方法之外。这种方法无法应用于时间敏感的应用场景，例如自然灾害期间。一个可能的解决方案是将云去除作为预处理步骤，以确保在这种情况下无云的解决方案不会失败。但云去除方法仍在积极研究中，并存在一些缺点，例如生成视觉伪影。因此，开发对多云天气影响较小的云稳健方法是可取的。云稳健方法可以通过将光学数据与雷达数据结合来实现，后者不受云的影响。虽然许多机器学习数据集结合了光学和雷达数据，但大多数研究人员排除了多云图像。我们将这种从机器学习训练和评估中排除多云图像的做法视为一种限制，降低了其在多云场景中的适用性。为此，我们组建了一个名为CloudyBigEarthNet (CBEN) 的数据集，包含配对的光学和雷达图像，带有云遮挡，用于训练和评估。使用平均精度（AP）作为评估指标，我们展示了在结合晴空光学和雷达图像上训练的最先进方法在评估多云图像时性能下降了23-33个百分点。然后，我们在训练期间将这些方法适应于多云光学数据，与原始方法相比，在多云测试案例上实现了17.2-28.7个百分点的相对提升。代码和数据集可在以下网址公开获取：https://github.com/mstricker13/CBEN

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2602.12659

IndicFairFace: Balanced Indian Face Dataset for Auditing and Mitigating Geographical Bias in Vision-Language Models

IndicFairFace：用于审计和减轻视觉-语言模型地理偏差的平衡印度人脸数据集

Mohsin, Aarish Shah, Khan, Mohammed Tayyab Ilyas, Nadeem, Mohammad, Sohail, Shahab Saquib, Cambria, Erik, Gao, Jiechao

Abstract

Vision-Language Models (VLMs) are known to inherit and amplify societal biases from their web-scale training data with Indian being particularly misrepresented. Existing fairness-aware datasets have significantly improved demographic balance across global race and gender groups, yet they continue to treat Indian as a single monolithic category. The oversimplification ignores the vast intra-national diversity across 28 states and 8 Union Territories of India and leads to representational and geographical bias. To address the limitation, we present IndicFairFace, a novel and balanced face dataset comprising 14,400 images representing geographical diversity of India. Images were sourced ethically from Wikimedia Commons and open-license web repositories and uniformly balanced across states and gender. Using IndicFairFace, we quantify intra-national geographical bias in prominent CLIP-based VLMs and reduce it using post-hoc Iterative Nullspace Projection debiasing approach. We also show that the adopted debiasing approach does not adversely impact the existing embedding space as the average drop in retrieval accuracy on benchmark datasets is less than 1.5 percent. Our work establishes IndicFairFace as the first benchmark to study geographical bias in VLMs for the Indian context.

Chinese Translation

视觉-语言模型（VLMs）已知会从其大规模网络训练数据中继承和放大社会偏见，而印度在其中尤其被误代表。现有的公平性意识数据集在全球种族和性别群体之间显著改善了人口平衡，但它们仍然将印度视为一个单一的单一类别。这种过于简化的处理忽视了印度28个邦和8个联邦直辖区之间的广泛国内多样性，导致了代表性和地理偏见。为了解决这一局限性，我们提出了IndicFairFace，一个新颖且平衡的人脸数据集，包含14,400张代表印度地理多样性的图像。这些图像是从维基媒体共享和开放许可的网络资源中伦理地获取的，并在各邦和性别之间均匀平衡。利用IndicFairFace，我们量化了在显著的基于CLIP的VLMs中的国内地理偏见，并使用后期迭代零空间投影去偏见方法减少了这种偏见。我们还展示了所采用的去偏见方法不会对现有的嵌入空间产生不利影响，因为在基准数据集上的检索准确率平均下降不到1.5%。我们的工作确立了IndicFairFace作为研究印度背景下VLMs地理偏见的第一个基准。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2602.12679

Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening

时间反转采样中的运动先验蒸馏用于生成中间帧

Jeon, Wooseok, Shin, Seunghyun, Shin, Dongmin, Jeon, Hae-Gon

Abstract

Recent progress in image-to-video (I2V) diffusion models has significantly advanced the field of generative inbetweening, which aims to generate semantically plausible frames between two keyframes. In particular, inference-time sampling strategies, which leverage the generative priors of large-scale pre-trained I2V models without additional training, have become increasingly popular. However, existing inference-time sampling, either fusing forward and backward paths in parallel or alternating them sequentially, often suffers from temporal discontinuities and undesirable visual artifacts due to the misalignment between the two generated paths. This is because each path follows the motion prior induced by its own conditioning frame. In this work, we propose Motion Prior Distillation (MPD), a simple yet effective inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path. Our method can deliberately avoid denoising the end-conditioned path which causes the ambiguity of the path, and yield more temporally coherent inbetweening results with the forward motion prior. We not only perform quantitative evaluations on standard benchmarks, but also conduct extensive user studies to demonstrate the effectiveness of our approach in practical scenarios.

Chinese Translation

近年来，图像到视频（I2V）扩散模型的进展显著推动了生成中间帧领域的发展，该领域旨在生成两个关键帧之间语义上合理的帧。特别是，推理时的采样策略利用了大规模预训练I2V模型的生成先验，而无需额外训练，已变得越来越流行。然而，现有的推理时采样，无论是并行融合前向和后向路径，还是顺序交替使用，通常会因两个生成路径之间的错位而遭受时间不连续性和不理想的视觉伪影。这是因为每条路径都遵循其自身条件帧所诱导的运动先验。在本研究中，我们提出了运动先验蒸馏（Motion Prior Distillation, MPD），这是一种简单而有效的推理时蒸馏技术，通过将前向路径的运动残差蒸馏到后向路径中，从而抑制双向不匹配。我们的方法可以故意避免去噪最终条件路径，这会导致路径的模糊性，并利用前向运动先验产生更具时间一致性的中间帧结果。我们不仅在标准基准上进行了定量评估，还进行了广泛的用户研究，以证明我们的方法在实际场景中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2602.12696

Channel-Aware Probing for Multi-Channel Imaging

面向多通道成像的通道感知探测

Marikkar, Umar, Husain, Syed Sameed, Awais, Muhammad, Atito, Sara

Abstract

Training and evaluating vision encoders on Multi-Channel Imaging (MCI) data remains challenging as channel configurations vary across datasets, preventing fixed-channel training and limiting reuse of pre-trained encoders on new channel settings. Prior work trains MCI encoders but typically evaluates them via full fine-tuning, leaving probing with frozen pre-trained encoders comparatively underexplored. Existing studies that perform probing largely focus on improving representations, rather than how to best leverage fixed representations for downstream tasks. Although the latter problem has been studied in other domains, directly transferring those strategies to MCI yields weak results, even worse than training from scratch. We therefore propose Channel-Aware Probing (CAP), which exploits the intrinsic inter-channel diversity in MCI datasets by controlling feature flow at both the encoder and probe levels. CAP uses Independent Feature Encoding (IFE) to encode each channel separately, and Decoupled Pooling (DCP) to pool within channels before aggregating across channels. Across three MCI benchmarks, CAP consistently improves probing performance over the default probing protocol, matches fine-tuning from scratch, and largely reduces the gap to full fine-tuning from the same MCI pre-trained checkpoints. Code can be found in https://github.com/umarikkar/CAP.

Chinese Translation

在多通道成像（Multi-Channel Imaging, MCI）数据上训练和评估视觉编码器仍然具有挑战性，因为通道配置在不同数据集中各不相同，这阻碍了固定通道的训练，并限制了预训练编码器在新通道设置上的重用。以往的研究训练了MCI编码器，但通常通过完全微调来评估它们，使得使用冻结的预训练编码器进行探测相对未被充分探索。现有的研究主要集中在改善表示上，而不是如何最好地利用固定表示进行下游任务。尽管后者问题在其他领域已有研究，但将这些策略直接转移到MCI上会产生较弱的结果，甚至不如从头训练。因此，我们提出了通道感知探测（Channel-Aware Probing, CAP），通过在编码器和探测器层面控制特征流，利用MCI数据集中固有的通道间多样性。CAP使用独立特征编码（Independent Feature Encoding, IFE）分别对每个通道进行编码，并采用解耦池化（Decoupled Pooling, DCP）在通道内进行池化，然后在通道间进行聚合。在三个MCI基准测试中，CAP始终在默认探测协议上提高探测性能，与从头微调相匹配，并大幅缩小与相同MCI预训练检查点的完全微调之间的差距。代码可在https://github.com/umarikkar/CAP找到。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2602.12725

ART3mis: Ray-Based Textual Annotation on 3D Cultural Objects

ART3mis：基于光线的3D文化对象文本注释

Arampatzakis, Vasileios, Sevetlidis, Vasileios, Arnaoutoglou, Fotis, Kalogeras, Athanasios, Koulamas, Christos, Lalos, Aris, Kiourt, Chairi, Ioannakis, George, Koutsoudis, Anestis, Pavlidis, George

Abstract

Beyond simplistic 3D visualisations, archaeologists, as well as cultural heritage experts and practitioners, need applications with advanced functionalities. Such as the annotation and attachment of metadata onto particular regions of the 3D digital objects. Various approaches have been presented to tackle this challenge, most of which achieve excellent results in the domain of their application. However, they are often confined to that specific domain and particular problem. In this paper, we present ART3mis - a general-purpose, user-friendly, interactive textual annotation tool for 3D objects. Primarily attuned to aid cultural heritage conservators, restorers and curators with no technical skills in 3D imaging and graphics, the tool allows for the easy handling, segmenting and annotating of 3D digital replicas of artefacts. ART3mis applies a user-driven, direct-on-surface approach. It can handle detailed 3D cultural objects in real-time and store textual annotations for multiple complex regions in JSON data format.

Chinese Translation

除了简单的3D可视化，考古学家以及文化遗产专家和从业者需要具有高级功能的应用程序，例如在3D数字对象的特定区域上进行注释和附加元数据。为了解决这一挑战，已经提出了多种方法，其中大多数在其应用领域取得了优异的结果。然而，它们通常局限于特定领域和特定问题。本文介绍了ART3mis——一种通用、用户友好、互动的3D对象文本注释工具。该工具主要旨在帮助没有3D成像和图形技术技能的文化遗产保护者、修复师和策展人，允许他们轻松处理、分割和注释文物的3D数字复制品。ART3mis采用用户驱动的直接表面注释方法，能够实时处理详细的3D文化对象，并以JSON数据格式存储多个复杂区域的文本注释。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2602.12735

VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

VimRAG：通过多模态记忆图在检索增强生成中导航大规模视觉上下文

Wang, Qiuchen, Wang, Shihang, Zeng, Yu, Zhang, Qiang, Zhang, Fanrui, Guo, Zhuoning, Zhang, Bosi, Huang, Wenxuan, Chen, Lin, Chen, Zehui, Xie, Pengjun, Ding, Ruixue

Abstract

Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval-augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph-Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high-resolution tokens to pivotal evidence while compressing or discarding trivial clues. To implement this paradigm, we propose a Graph-Guided Policy Optimization strategy. This strategy disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment. Extensive experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks. The code is available at https://github.com/Alibaba-NLP/VRAG.

Chinese Translation

有效地检索、推理和理解多模态信息仍然是智能系统面临的一个关键挑战。传统的检索增强生成（RAG）方法依赖于线性交互历史，这在处理长上下文任务时表现不佳，尤其是在涉及信息稀疏但令牌密集的视觉数据的迭代推理场景中。为了解决这一问题，我们提出了VimRAG，一个专为文本、图像和视频的多模态检索增强推理而设计的框架。受到我们系统研究的启发，我们将推理过程建模为一个动态有向无环图，该图结构化了智能体状态和检索到的多模态证据。在此结构化记忆的基础上，我们引入了一种图调制视觉记忆编码机制，通过其拓扑位置评估记忆节点的重要性，从而使模型能够动态分配高分辨率令牌给关键证据，同时压缩或丢弃琐碎线索。为了实现这一范式，我们提出了一种图引导策略优化策略。该策略通过修剪与冗余动作相关的记忆节点，将逐步有效性与轨迹级奖励解耦，从而促进细粒度的信用分配。大量实验表明，VimRAG在多样化的多模态RAG基准测试中始终实现了最先进的性能。代码可在 https://github.com/Alibaba-NLP/VRAG 获取。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2602.12740

SPRig: Self-Supervised Pose-Invariant Rigging from Mesh Sequences

SPRig：基于网格序列的自监督姿态不变绑定

Wang, Ruipeng, Zhong, Langkun, Wang, Miaowei

Abstract

State-of-the-art rigging methods assume a canonical rest pose--an assumption that fails for sequential data (e.g., animal motion capture or AIGC/video-derived mesh sequences) that lack the T-pose. Applied frame-by-frame, these methods are not pose-invariant and produce topological inconsistencies across frames. Thus We propose SPRig, a general fine-tuning framework that enforces cross-frame consistency losses to learn pose-invariant rigs on top of existing models. We validate our approach on rigging using a new permutation-invariant stability protocol. Experiments demonstrate SOTA temporal stability: our method produces coherent rigs from challenging sequences and dramatically reduces the artifacts that plague baseline methods. The code will be released publicly upon acceptance.

Chinese Translation

最先进的绑定方法假设存在一个规范的静止姿态——这一假设在缺乏T姿态的序列数据（例如，动物动作捕捉或AIGC/视频衍生的网格序列）中失效。逐帧应用时，这些方法并不具备姿态不变性，并且在帧之间产生拓扑不一致。因此，我们提出了SPRig，这是一种通用的微调框架，通过强制跨帧一致性损失来学习基于现有模型的姿态不变绑定。我们使用一种新的置换不变稳定性协议验证了我们的方法在绑定中的有效性。实验表明，我们的方法在时间稳定性方面达到了最先进水平：我们的技术能够从具有挑战性的序列中生成一致的绑定，并显著减少了困扰基线方法的伪影。代码将在接受后公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2602.12742

Synthetic Craquelure Generation for Unsupervised Painting Restoration

无监督绘画修复的合成裂纹生成

Cuch-Guillén, Jana, Agudo, Antonio, Pérez-Gonzalo, Raül

Abstract

Cultural heritage preservation increasingly demands non-invasive digital methods for painting restoration, yet identifying and restoring fine craquelure patterns from complex brushstrokes remains challenging due to scarce pixel-level annotations. We propose a fully annotation-free framework driven by a domain-specific synthetic craquelure generator, which simulates realistic branching and tapered fissure geometry using B\'ezier trajectories. Our approach couples a classical morphological detector with a learning-based refinement module: a SegFormer backbone adapted via Low-Rank Adaptation (LoRA). Uniquely, we employ a detector-guided strategy, injecting the morphological map as an input spatial prior, while a masked hybrid loss and logit adjustment constrain the training to focus specifically on refining candidate crack regions. The refined masks subsequently guide an Anisotropic Diffusion inpainting stage to reconstruct missing content. Experimental results demonstrate that our pipeline significantly outperforms state-of-the-art photographic restoration models in zero-shot settings, while faithfully preserving the original paint brushwork.

Chinese Translation

文化遗产保护日益需要非侵入性的数字方法进行绘画修复，但由于缺乏像素级注释，从复杂的笔触中识别和修复细微的裂纹图案仍然具有挑战性。我们提出了一种完全无注释的框架，该框架由一个特定领域的合成裂纹生成器驱动，使用贝塞尔轨迹模拟逼真的分支和锥形裂缝几何形状。我们的方法将经典的形态学检测器与基于学习的细化模块结合在一起：一个通过低秩适应（Low-Rank Adaptation, LoRA）调整的SegFormer主干网络。独特之处在于，我们采用了一种检测器引导策略，将形态学图作为输入空间先验，同时使用掩蔽混合损失和对数调整约束训练，专注于细化候选裂缝区域。细化后的掩膜随后引导各向异性扩散修复阶段以重建缺失内容。实验结果表明，我们的管道在零样本设置下显著优于最先进的摄影修复模型，同时忠实地保留了原始的绘画笔触。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2602.12751

ReBA-Pred-Net: Weakly-Supervised Regional Brain Age Prediction on MRI

ReBA-Pred-Net：基于弱监督的MRI区域脑龄预测

Shao, Shuai, Wang, Yan, Jiang, Shu, Zhao, Shiyuan, Luo, Xinzhe, Yang, Di, Wang, Jiangtao, Bai, Yutong, Zhang, Jianguo

Abstract

Brain age has become a prominent biomarker of brain health. Yet most prior work targets whole brain age (WBA), a coarse paradigm that struggles to support tasks such as disease characterization and research on development and aging patterns, because relevant changes are typically region-selective rather than brain-wide. Therefore, robust regional brain age (ReBA) estimation is critical, yet a widely generalizable model has yet to be established. In this paper, we propose the Regional Brain Age Prediction Network (ReBA-Pred-Net), a Teacher-Student framework designed for fine-grained brain age estimation. The Teacher produces soft ReBA to guide the Student to yield reliable ReBA estimates with a clinical-prior consistency constraint (regions within the same function should change similarly). For rigorous evaluation, we introduce two indirect metrics: Healthy Control Similarity (HCS), which assesses statistical consistency by testing whether regional brain-age-gap (ReBA minus chronological age) distributions align between training and unseen HC; and Neuro Disease Correlation (NDC), which assesses factual consistency by checking whether clinically confirmed patients show elevated brain-age-gap in disease-associated regions. Experiments across multiple backbones demonstrate the statistical and factual validity of our method.

Chinese Translation

脑龄已成为脑健康的重要生物标志物。然而，大多数先前的研究集中于整体脑龄（Whole Brain Age, WBA），这一粗略的范式难以支持疾病特征描述以及发展和衰老模式的研究，因为相关变化通常是区域选择性的，而非全脑范围的。因此，稳健的区域脑龄（Regional Brain Age, ReBA）估计至关重要，但尚未建立广泛适用的模型。本文提出了区域脑龄预测网络（Regional Brain Age Prediction Network, ReBA-Pred-Net），这是一个旨在进行精细脑龄估计的教师-学生框架。教师生成软性ReBA，以指导学生在临床先验一致性约束下（同一功能区域应有类似变化）产生可靠的ReBA估计。为了进行严格评估，我们引入了两个间接指标：健康对照相似性（Healthy Control Similarity, HCS），通过测试训练集和未见健康对照组（Healthy Control, HC）之间的区域脑龄差（ReBA减去实际年龄）分布是否一致来评估统计一致性；以及神经疾病相关性（Neuro Disease Correlation, NDC），通过检查临床确认的患者在疾病相关区域是否显示出升高的脑龄差来评估事实一致性。多个基础模型的实验验证了我们方法的统计和事实有效性。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2602.12755

Towards reconstructing experimental sparse-view X-ray CT data with diffusion models

基于扩散模型重建实验稀疏视图X射线CT数据的研究

Thomsen, Nelas J., Wang, Xinyuan, Lucka, Felix, Demircan-Tureyen, Ezgi

Abstract

Diffusion-based image generators are promising priors for ill-posed inverse problems like sparse-view X-ray Computed Tomography (CT). As most studies consider synthetic data, it is not clear whether training data mismatch (``domain shift'') or forward model mismatch complicate their successful application to experimental data. We measured CT data from a physical phantom resembling the synthetic Shepp-Logan phantom and trained diffusion priors on synthetic image data sets with different degrees of domain shift towards it. Then, we employed the priors in a Decomposed Diffusion Sampling scheme on sparse-view CT data sets with increasing difficulty leading to the experimental data. Our results reveal that domain shift plays a nuanced role: while severe mismatch causes model collapse and hallucinations, diverse priors outperform well-matched but narrow priors. Forward model mismatch pulls the image samples away from the prior manifold, which causes artifacts but can be mitigated with annealed likelihood schedules that also increase computational efficiency. Overall, we demonstrate that performance gains do not immediately translate from synthetic to experimental data, and future development must validate against real-world benchmarks.

Chinese Translation

基于扩散的图像生成器为稀疏视图X射线计算机断层扫描（CT）等病态逆问题提供了有前景的先验。由于大多数研究考虑的是合成数据，因此尚不清楚训练数据不匹配（“领域转移”）或前向模型不匹配是否会影响其在实验数据中的成功应用。我们从一个物理幻影中测量了CT数据，该幻影类似于合成的Shepp-Logan幻影，并在与之存在不同程度领域转移的合成图像数据集上训练了扩散先验。随后，我们在逐渐增加难度的稀疏视图CT数据集上采用了分解扩散采样方案。我们的结果表明，领域转移发挥了微妙的作用：虽然严重的不匹配会导致模型崩溃和幻觉，但多样化的先验优于匹配良好但范围狭窄的先验。前向模型不匹配使图像样本偏离先验流形，造成伪影，但可以通过退火似然调度来缓解，同时提高计算效率。总体而言，我们证明了性能提升并不会立即从合成数据转化为实验数据，未来的发展必须针对真实世界基准进行验证。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2602.12761

Towards complete digital twins in cultural heritage with ART3mis 3D artifacts annotator

基于ART3mis 3D文物标注器的文化遗产完整数字双胞胎的构建

Karamatskos, Dimitrios, Arampatzakis, Vasileios, Sevetlidis, Vasileios, Nousias, Stavros, Kalogeras, Athanasios, Koulamas, Christos, Lalos, Aris, Pavlidis, George

Abstract

Archaeologists, as well as specialists and practitioners in cultural heritage, require applications with additional functions, such as the annotation and attachment of metadata to specific regions of the 3D digital artifacts, to go beyond the simplistic three-dimensional (3D) visualization. Different strategies addressed this issue, most of which are excellent in their particular area of application, but their capacity is limited to their design's purpose; they lack generalization and interoperability. This paper introduces ART3mis, a general-purpose, user-friendly, feature-rich, interactive web-based textual annotation tool for 3D objects. Moreover, it enables the communication, distribution, and reuse of information as it complies with the W3C Web Annotation Data Model. It is primarily designed to help cultural heritage conservators, restorers, and curators who lack technical expertise in 3D imaging and graphics, handle, segment, and annotate 3D digital replicas of artifacts with ease.

Chinese Translation

考古学家以及文化遗产领域的专家和从业者需要具备附加功能的应用程序，例如对3D数字文物特定区域进行标注和附加元数据，以超越简单的三维（3D）可视化。不同的策略已针对这一问题进行了探讨，大多数在其特定应用领域表现出色，但其能力受限于设计目的，缺乏通用性和互操作性。本文介绍了ART3mis，这是一种通用、用户友好、功能丰富的基于网络的3D对象文本标注工具。此外，它符合W3C Web标注数据模型，能够促进信息的交流、分发和重用。该工具主要旨在帮助缺乏3D成像和图形技术专长的文化遗产保护者、修复者和策展人，轻松处理、分割和标注文物的3D数字复制品。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2602.12769

PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion

PixelRush：通过一步扩散实现超快速、无训练的高分辨率图像生成

Lai, Hong-Phuc, Nguyen, Phong, Tran, Anh

Abstract

Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than five minutes to produce a single 4K image. In this paper, we present PixelRush, the first tuning-free framework for practical high-resolution text-to-image generation. Our method builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles. Instead, PixelRush enables efficient patch-based denoising within a low-step regime. To address artifacts introduced by patch blending in few-step generation, we propose a seamless blending strategy. Furthermore, we mitigate over-smoothing effects through a noise injection mechanism. PixelRush delivers exceptional efficiency, generating 4K images in approximately 20 seconds representing a 10$\times$ to 35$\times$ speedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate both the performance gains and the quality of outputs achieved by our approach.

Chinese Translation

预训练的扩散模型在生成高质量图像方面表现出色，但仍然受到其固有训练分辨率的限制。最近的无训练方法试图通过在去噪过程中引入干预来克服这一限制；然而，这些方法会产生大量的计算开销，通常需要超过五分钟才能生成一幅4K图像。在本文中，我们提出了PixelRush，这是第一个无调优的实用高分辨率文本到图像生成框架。我们的方法建立在已有的基于块的推理范式之上，但消除了多次反演和再生循环的需求。相反，PixelRush能够在低步数范围内实现高效的基于块的去噪。为了应对在少步生成中引入的块混合伪影，我们提出了一种无缝混合策略。此外，我们通过噪声注入机制减轻了过度平滑的影响。PixelRush展现出卓越的效率，约在20秒内生成4K图像，相较于最先进的方法实现了10$ imes$到35$ imes$的加速，同时保持了卓越的视觉保真度。大量实验验证了我们方法在性能提升和输出质量方面的优势。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2602.12774

Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting

基于引导的 MLLM 弱监督无类别目标计数

Zhang, Xiaowen, Yue, Zijie, Luo, Yong, Zhao, Cairong, Chen, Qijun, Shi, Miaojing

Abstract

Object counting is a fundamental task in computer vision, with broad applicability in many real-world scenarios. Fully-supervised counting methods require costly point-level annotations per object. Few weakly-supervised methods leverage only image-level object counts as supervision and achieve fairly promising results. They are, however, often limited to counting a single category, e.g. person. In this paper, we propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting. Instead of directly fine-tuning MLLMs to predict object counts, which can be challenging due to the modality gap, we incorporate three simple yet effective strategies to bootstrap the counting paradigm in both training and testing: First, a divide-and-discern dialogue tuning strategy is proposed to guide the MLLM to determine whether the object count falls within a specific range and progressively break down the range through multi-round dialogue. Second, a compare-and-rank count optimization strategy is introduced to train the MLLM to optimize the relative ranking of multiple images according to their object counts. Third, a global-and-local counting enhancement strategy aggregates and fuses local and global count predictions to improve counting performance in dense scenes. Extensive experiments on FSC-147, CARPK, PUCPR+, and ShanghaiTech show that WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs. Code is available at https://github.com/viscom-tongji/WS-COC.

Chinese Translation

目标计数是计算机视觉中的一项基础任务，在许多现实场景中具有广泛的应用。完全监督的计数方法需要对每个对象进行昂贵的点级注释。少数弱监督方法仅利用图像级对象计数作为监督，并取得了相当可喜的结果。然而，它们通常仅限于计数单一类别，例如人。在本文中，我们提出了 WS-COC，这是第一个基于 MLLM 的弱监督无类别目标计数框架。我们并不是直接微调 MLLM 以预测对象计数，因为这可能由于模态差距而具有挑战性，而是结合了三种简单而有效的策略，在训练和测试中引导计数范式：首先，提出了一种划分与辨别对话调优策略，以指导 MLLM 确定对象计数是否落在特定范围内，并通过多轮对话逐步细化该范围。其次，引入了一种比较与排序计数优化策略，训练 MLLM 根据多个图像的对象计数优化相对排名。第三，全球与局部计数增强策略聚合并融合局部和全球计数预测，以提高密集场景中的计数性能。在 FSC-147、CARPK、PUCPR+ 和 ShanghaiTech 上进行的广泛实验表明，WS-COC 的表现与许多最先进的完全监督方法相匹配甚至超越，同时显著降低了注释成本。代码可在 https://github.com/viscom-tongji/WS-COC 获取。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2602.12796

GSM-GS: Geometry-Constrained Single and Multi-view Gaussian Splatting for Surface Reconstruction

GSM-GS：几何约束单视图和多视图高斯点云重建

Ren, Xiao, Liu, Yu, An, Ning, Cheng, Jian, Qiao, Xin, Kong, He

Abstract

Recently, 3D Gaussian Splatting has emerged as a prominent research direction owing to its ultrarapid training speed and high-fidelity rendering capabilities. However, the unstructured and irregular nature of Gaussian point clouds poses challenges to reconstruction accuracy. This limitation frequently causes high-frequency detail loss in complex surface microstructures when relying solely on routine strategies. To address this limitation, we propose GSM-GS: a synergistic optimization framework integrating single-view adaptive sub-region weighting constraints and multi-view spatial structure refinement. For single-view optimization, we leverage image gradient features to partition scenes into texture-rich and texture-less sub-regions. The reconstruction quality is enhanced through adaptive filtering mechanisms guided by depth discrepancy features. This preserves high-weight regions while implementing a dual-branch constraint strategy tailored to regional texture variations, thereby improving geometric detail characterization. For multi-view optimization, we introduce a geometry-guided cross-view point cloud association method combined with a dynamic weight sampling strategy. This constructs 3D structural normal constraints across adjacent point cloud frames, effectively reinforcing multi-view consistency and reconstruction fidelity. Extensive experiments on public datasets demonstrate that our method achieves both competitive rendering quality and geometric reconstruction. See our interactive project page

Chinese Translation

近年来，3D高斯点云重建作为一个重要的研究方向，因其超快速的训练速度和高保真渲染能力而受到关注。然而，高斯点云的无结构和不规则特性对重建精度提出了挑战。这一限制常常导致在复杂表面微结构中高频细节的丢失，尤其是在仅依赖常规策略时。为了解决这一限制，我们提出了GSM-GS：一个协同优化框架，整合了单视图自适应子区域加权约束和多视图空间结构精炼。在单视图优化中，我们利用图像梯度特征将场景划分为富纹理和缺纹理的子区域。通过深度差异特征引导的自适应滤波机制，重建质量得以提升。这一机制在保留高权重区域的同时，实施了针对区域纹理变化的双分支约束策略，从而改善几何细节的表征。在多视图优化中，我们引入了一种几何引导的跨视图点云关联方法，并结合动态权重采样策略。这一方法在相邻点云帧之间构建了3D结构法线约束，有效增强了多视图一致性和重建保真度。在公共数据集上的大量实验表明，我们的方法在渲染质量和几何重建方面均表现出竞争力。请参见我们的互动项目页面。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2602.12843

Thinking Like a Radiologist: A Dataset for Anatomy-Guided Interleaved Vision Language Reasoning in Chest X-ray Interpretation

像放射科医生一样思考：一个用于胸部X光解读的解剖引导交错视觉语言推理的数据集

Zhao, Yichen, Peng, Zelin, Yang, Piao, Yang, Xiaokang, Shen, Wei

Abstract

Radiological diagnosis is a perceptual process in which careful visual inspection and language reasoning are repeatedly interleaved. Most medical large vision language models (LVLMs) perform visual inspection only once and then rely on text-only chain-of-thought (CoT) reasoning, which operates purely in the linguistic space and is prone to hallucination. Recent methods attempt to mitigate this issue by introducing visually related coordinates, such as bounding boxes. However, these remain a pseudo-visual solution: coordinates are still text and fail to preserve rich visual details like texture and density. Motivated by the interleaved nature of radiological diagnosis, we introduce MMRad-IVL-22K, the first large-scale dataset designed for natively interleaved visual language reasoning in chest X-ray interpretation. MMRad-IVL-22K reflects a repeated cycle of reasoning and visual inspection workflow of radiologists, in which visual rationales complement textual descriptions and ground each step of the reasoning process. MMRad-IVL-22K comprises 21,994 diagnostic traces, enabling systematic scanning across 35 anatomical regions. Experimental results on advanced closed-source LVLMs demonstrate that report generation guided by multimodal CoT significantly outperforms that guided by text-only CoT in clinical accuracy and report quality (e.g., 6\% increase in the RadGraph metric), confirming that high-fidelity interleaved vision language evidence is a non-substitutable component of reliable medical AI. Furthermore, benchmarking across seven state-of-the-art open-source LVLMs demonstrates that models fine-tuned on MMRad-IVL-22K achieve superior reasoning consistency and report quality compared with both general-purpose and medical-specific LVLMs. The project page is available at https://github.com/qiuzyc/thinking_like_a_radiologist.

Chinese Translation

放射学诊断是一个感知过程，其中仔细的视觉检查和语言推理反复交错进行。大多数医学大型视觉语言模型（LVLMs）仅进行一次视觉检查，然后依赖于仅基于文本的思维链（CoT）推理，这种推理完全在语言空间中进行，容易产生幻觉。最近的方法试图通过引入视觉相关坐标（如边界框）来缓解这个问题。然而，这些仍然是一种伪视觉解决方案：坐标仍然是文本，无法保留丰富的视觉细节，如纹理和密度。受到放射学诊断交错特性的启发，我们推出了MMRad-IVL-22K，这是第一个为胸部X光解读设计的本地交错视觉语言推理的大规模数据集。MMRad-IVL-22K反映了放射科医生推理和视觉检查工作流程的重复循环，其中视觉推理补充文本描述，并为推理过程的每一步提供基础。MMRad-IVL-22K包含21,994个诊断轨迹，能够系统地扫描35个解剖区域。在先进的闭源LVLMs上的实验结果表明，由多模态CoT引导的报告生成在临床准确性和报告质量上显著优于仅由文本CoT引导的生成（例如，RadGraph指标提高了6%），确认高保真交错视觉语言证据是可靠医疗人工智能不可替代的组成部分。此外，在七个最先进的开源LVLMs上的基准测试表明，基于MMRad-IVL-22K微调的模型在推理一致性和报告质量上优于通用和医学特定的LVLMs。项目页面可访问：https://github.com/qiuzyc/thinking_like_a_radiologist。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2602.12877

RoadscapesQA: A Multitask, Multimodal Dataset for Visual Question Answering on Indian Roads

RoadscapesQA：一个用于印度道路视觉问答的多任务多模态数据集

Iyer, Vijayasri, Rathinagiriswaran, Maahin, S, Jyothikamalesh

Abstract

Understanding road scenes is essential for autonomous driving, as it enables systems to interpret visual surroundings to aid in effective decision-making. We present Roadscapes, a multitask multimodal dataset consisting of upto 9,000 images captured in diverse Indian driving environments, accompanied by manually verified bounding boxes. To facilitate scalable scene understanding, we employ rule-based heuristics to infer various scene attributes, which are subsequently used to generate question-answer (QA) pairs for tasks such as object grounding, reasoning, and scene understanding. The dataset includes a variety of scenes from urban and rural India, encompassing highways, service roads, village paths, and congested city streets, captured in both daytime and nighttime settings. Roadscapes has been curated to advance research on visual scene understanding in unstructured environments. In this paper, we describe the data collection and annotation process, present key dataset statistics, and provide initial baselines for image QA tasks using vision-language models.

Chinese Translation

理解道路场景对自动驾驶至关重要，因为它使系统能够解读视觉环境，以帮助有效决策。我们提出了Roadscapes，这是一个多任务多模态数据集，包含多达9,000张在不同印度驾驶环境中捕获的图像，并附有人工验证的边界框。为了促进可扩展的场景理解，我们采用基于规则的启发式方法推断各种场景属性，这些属性随后用于生成问题-答案（QA）对，以支持物体定位、推理和场景理解等任务。该数据集包括来自印度城市和农村的多种场景，涵盖高速公路、服务道路、乡村小路和拥挤的城市街道，图像在白天和夜间环境中捕获。Roadscapes旨在推动对非结构化环境中视觉场景理解的研究。在本文中，我们描述了数据收集和标注过程，呈现了关键数据集统计信息，并提供了使用视觉-语言模型进行图像QA任务的初步基准。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2602.12892

RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training

RADAR：揭示多模态大语言模型预训练中能力的不对称发展

Nie, Yunshuang, Lin, Bingqian, Niu, Minzhe, Xiang, Kun, Han, Jianhua, Huang, Guowei, Quan, Xingyue, Xu, Hang, Chen, Bokui, Liang, Xiaodan

Abstract

Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training by leveraging their inherent perception and reasoning capabilities to solve complex tasks. However, the lack of an efficient evaluation framework impedes the diagnosis of their performance bottlenecks. Current evaluation primarily relies on testing after supervised fine-tuning, which introduces laborious additional training and autoregressive decoding costs. Meanwhile, common pre-training metrics cannot quantify a model's perception and reasoning abilities in a disentangled manner. Furthermore, existing evaluation benchmarks are typically limited in scale or misaligned with pre-training objectives. Thus, we propose RADAR, an efficient ability-centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe-training. RADAR involves two key components: (1) Soft Discrimination Score, a novel metric for robustly tracking ability development without fine-tuning, based on quantifying nuanced gradations of the model preference for the correct answer over distractors; and (2) Multi-Modal Mixture Benchmark, a new 15K+ sample benchmark for comprehensively evaluating pre-trained MLLMs' perception and reasoning abilities in a 0-shot manner, where we unify authoritative benchmark datasets and carefully collect new datasets, extending the evaluation scope and addressing the critical gaps in current benchmarks. With RADAR, we comprehensively reveal the asymmetric development of perceptual and reasoning capabilities in pretrained MLLMs across diverse factors, including data volume, model size, and pretraining strategy. Our RADAR underscores the need for a decomposed perspective on pre-training ability bottlenecks, informing targeted interventions to advance MLLMs efficiently. Our code is publicly available at https://github.com/Nieysh/RADAR.

Chinese Translation

预训练的多模态大语言模型（MLLMs）通过利用其固有的感知和推理能力，为后续训练提供了丰富的知识基础，以解决复杂任务。然而，缺乏有效的评估框架阻碍了对其性能瓶颈的诊断。目前的评估主要依赖于监督微调后的测试，这引入了繁琐的额外训练和自回归解码成本。同时，常见的预训练指标无法以解耦的方式量化模型的感知和推理能力。此外，现有评估基准通常在规模上有限或与预训练目标不一致。因此，我们提出了RADAR，一个高效的以能力为中心的评估框架，用于揭示多模态大语言模型预训练中能力的发展不对称性。RADAR包括两个关键组成部分：（1）软区分分数（Soft Discrimination Score），一种新颖的指标，用于在不进行微调的情况下稳健地跟踪能力发展，基于量化模型对正确答案相对于干扰项的细微偏好；（2）多模态混合基准（Multi-Modal Mixture Benchmark），一个新的15K+样本基准，用于以零样本方式全面评估预训练MLLMs的感知和推理能力，我们统一了权威基准数据集并仔细收集了新数据集，扩展了评估范围，并解决了当前基准中的关键缺口。通过RADAR，我们全面揭示了预训练MLLMs在感知和推理能力方面的不对称发展，涵盖了数据量、模型规模和预训练策略等多种因素。我们的RADAR强调了对预训练能力瓶颈进行分解视角的必要性，为有效推进MLLMs提供了针对性的干预建议。我们的代码已公开发布在 https://github.com/Nieysh/RADAR。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2602.12902

Robustness of Object Detection of Autonomous Vehicles in Adverse Weather Conditions

自主车辆在恶劣天气条件下物体检测的鲁棒性

Pettersen, Fox, Zhu, Hong

Abstract

As self-driving technology advances toward widespread adoption, determining safe operational thresholds across varying environmental conditions becomes critical for public safety. This paper proposes a method for evaluating the robustness of object detection ML models in autonomous vehicles under adverse weather conditions. It employs data augmentation operators to generate synthetic data that simulates different severance degrees of the adverse operation conditions at progressive intensity levels to find the lowest intensity of the adverse conditions at which the object detection model fails. The robustness of the object detection model is measured by the average first failure coefficients (AFFC) over the input images in the benchmark. The paper reports an experiment with four object detection models: YOLOv5s, YOLOv11s, Faster R-CNN, and Detectron2, utilising seven data augmentation operators that simulate weather conditions fog, rain, and snow, and lighting conditions of dark, bright, flaring, and shadow. The experiment data show that the method is feasible, effective, and efficient to evaluate and compare the robustness of object detection models in various adverse operation conditions. In particular, the Faster R-CNN model achieved the highest robustness with an overall average AFFC of 71.9% over all seven adverse conditions, while YOLO variants showed the AFFC values of 43%. The method is also applied to assess the impact of model training that targets adverse operation conditions using synthetic data on model robustness. It is observed that such training can improve robustness in adverse conditions but may suffer from diminishing returns and forgetting phenomena (i.e., decline in robustness) if overtrained.

Chinese Translation

随着自动驾驶技术向广泛应用的推进，确定在不同环境条件下的安全操作阈值对公共安全变得至关重要。本文提出了一种评估自主车辆在恶劣天气条件下物体检测机器学习（ML）模型鲁棒性的方法。该方法采用数据增强操作符生成合成数据，以模拟不同程度的恶劣操作条件，逐步增加强度，以找出物体检测模型失效的最低恶劣条件强度。物体检测模型的鲁棒性通过基准测试中输入图像的平均首次失效系数（AFFC）进行测量。本文报告了一项实验，涉及四种物体检测模型：YOLOv5s、YOLOv11s、Faster R-CNN 和 Detectron2，利用七种数据增强操作符模拟雾、雨、雪等天气条件，以及黑暗、明亮、耀斑和阴影等光照条件。实验数据表明，该方法在评估和比较各种恶劣操作条件下物体检测模型的鲁棒性方面是可行、有效且高效的。特别是，Faster R-CNN 模型在所有七种恶劣条件下的整体平均 AFFC 达到 71.9%，而 YOLO 变体的 AFFC 值为 43%。该方法还用于评估针对恶劣操作条件的模型训练对模型鲁棒性的影响，使用合成数据进行训练。观察到这种训练可以提高在恶劣条件下的鲁棒性，但如果过度训练，可能会遭遇收益递减和遗忘现象（即鲁棒性下降）。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2602.12905

Adaptive Scaling with Geometric and Visual Continuity of completed 3D objects

具有几何和视觉连续性的完成3D对象的自适应缩放

Vermandere, Jelle, Bassier, Maarten, Vergauwen, Maarten

Abstract

Object completion networks typically produce static Signed Distance Fields (SDFs) that faithfully reconstruct geometry but cannot be rescaled or deformed without introducing structural distortions. This limitation restricts their use in applications requiring flexible object manipulation, such as indoor redesign, simulation, and digital content creation. We introduce a part-aware scaling framework that transforms these static completed SDFs into editable, structurally coherent objects. Starting from SDFs and Texture Fields generated by state-of-the-art completion models, our method performs automatic part segmentation, defines user-controlled scaling zones, and applies smooth interpolation of SDFs, color, and part indices to enable proportional and artifact-free deformation. We further incorporate a repetition-based strategy to handle large-scale deformations while preserving repeating geometric patterns. Experiments on Matterport3D and ShapeNet objects show that our method overcomes the inherent rigidity of completed SDFs and is visually more appealing than global and naive selective scaling, particularly for complex shapes and repetitive structures.

Chinese Translation

对象完成网络通常生成静态的有符号距离场（Signed Distance Fields, SDF），能够忠实地重建几何形状，但无法在不引入结构扭曲的情况下进行缩放或变形。这一限制限制了它们在需要灵活对象操作的应用中的使用，例如室内重新设计、仿真和数字内容创作。我们提出了一种基于部件感知的缩放框架，将这些静态完成的SDF转变为可编辑的、结构一致的对象。我们的算法从由最先进的完成模型生成的SDF和纹理场开始，执行自动部件分割，定义用户控制的缩放区域，并应用SDF、颜色和部件索引的平滑插值，以实现比例和无伪影的变形。我们进一步结合了一种基于重复的策略，以处理大规模变形，同时保留重复的几何图案。在Matterport3D和ShapeNet对象上的实验表明，我们的方法克服了完成SDF固有的刚性，并在视觉上比全局和简单选择缩放更具吸引力，特别是在复杂形状和重复结构的情况下。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2602.12916

Reliable Thinking with Images

可靠的图像思维

Li, Haobin, Yang, Yutong, Lin, Yijie, Xiang, Dai, Yang, Mouxing, Peng, Xi

Abstract

As a multimodal extension of Chain-of-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another'', erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer. Extensive experiments on seven benchmarks verify the effectiveness of RTWI against NT.

Chinese Translation

作为链式思维（Chain-of-Thought, CoT）的多模态扩展，图像思维（Thinking with Images, TWI）最近成为增强多模态大型语言模型（Multi-modal Large Language Models, MLLMs）推理能力的有希望的途径，它通过将视觉线索融入文本推理过程生成交错的CoT。然而，现有TWI方法的成功在很大程度上依赖于交错的图像-文本CoT是无误的这一假设，而在现实场景中，由于多模态理解的复杂性，这一假设很容易被违反。在本文中，我们揭示并研究了TWI中一个高度实用但未被充分探索的问题，称为噪声思维（Noisy Thinking, NT）。具体而言，NT指的是不完美的视觉线索挖掘和答案推理过程。正如所说的，“一错再错”，错误的交错CoT会导致错误累积，从而显著降低MLLM的性能。为了解决NT问题，我们提出了一种新方法，称为可靠的图像思维（Reliable Thinking with Images, RTWI）。简而言之，RTWI以统一的文本中心方式评估视觉线索和文本CoT的可靠性，并相应地采用稳健的过滤和投票模块，以防止NT污染最终答案。在七个基准上的大量实验验证了RTWI对抗NT的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2602.12919

EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition

EPRBench：一个高质量的基于事件流的视觉地点识别基准数据集

Wang, Xiao, Xiong, Xingxing, Gao, Jinfeng, Lou, Xufeng, Jiang, Bo, Chen, Si-bao, Wang, Yaowei, Tian, Yonghong

Abstract

Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion. Recognizing the current scarcity of dedicated datasets in this domain, we introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR. EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios. To support semantic-aware and language-integrated VPR research, we provide LLM-generated scene descriptions, subsequently refined through human annotation, establishing a solid foundation for integrating LLMs into event-based perception pipelines. To facilitate systematic evaluation, we implement and benchmark 15 state-of-the-art VPR algorithms on EPRBench, offering a strong baseline for future algorithmic comparisons. Furthermore, we propose a novel multi-modal fusion paradigm for VPR: leveraging LLMs to generate textual scene descriptions from raw event streams, which then guide spatially attentive token selection, cross-modal feature fusion, and multi-scale representation learning. This framework not only achieves highly accurate place recognition but also produces interpretable reasoning processes alongside its predictions, significantly enhancing model transparency and explainability. The dataset and source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID

Chinese Translation

基于事件流的视觉地点识别（VPR）是一个新兴的研究方向，为传统可见光摄像头在低光照、过曝和高速运动等挑战性条件下的不稳定性提供了有力的解决方案。鉴于当前该领域专用数据集的稀缺，我们推出了EPRBench，一个专门为基于事件流的VPR设计的高质量基准数据集。EPRBench包含10K个事件序列和65K个事件帧，这些数据是通过手持设备和车载设备收集的，旨在全面捕捉不同视角、天气条件和光照场景下的现实世界挑战。为了支持语义感知和语言集成的VPR研究，我们提供了由大型语言模型（LLM）生成的场景描述，并通过人工标注进行了进一步的精炼，为将LLM集成到基于事件的感知管道中奠定了坚实的基础。为了便于系统评估，我们在EPRBench上实现并基准测试了15种最先进的VPR算法，为未来的算法比较提供了强有力的基线。此外，我们提出了一种新颖的多模态融合范式：利用LLM从原始事件流生成文本场景描述，然后指导空间注意的标记选择、跨模态特征融合和多尺度表示学习。该框架不仅实现了高精度的地点识别，还在预测过程中产生可解释的推理过程，显著增强了模型的透明性和可解释性。数据集和源代码将发布在https://github.com/Event-AHU/Neuromorphic_ReID

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2602.12922

Beyond Benchmarks of IUGC: Rethinking Requirements of Deep Learning Methods for Intrapartum Ultrasound Biometry from Fetal Ultrasound Videos

超越IUGC基准：重新思考深度学习方法在产程超声生物测量中的需求，基于胎儿超声视频

Bai, Jieyun, Zhou, Zihao, Tang, Yitong, Gan, Jie, Liang, Zhuonan, Fan, Jianan, Mcguire, Lisa B., Clarke, Jillian L., Cai, Weidong, Spurway, Jacaueline, Tang, Yubo, Wang, Shiye, Shen, Wenda, Yu, Wangwang, Li, Yihao, Zhang, Philippe, Jiang, Weili, Li, Yongjie, Nasim, Salem Muhsin Ali Binqahal Al, Abzhanov, Arsen, Saeed, Numan, Yaqub, Mohammad, Xian, Zunhui, Lin, Hongxing, Lan, Libin, Ramesh, Jayroop, Bacher, Valentin, Eid, Mark, Kalabizadeh, Hoda, Rupprecht, Christian, Namburete, Ana I. L., Yeung, Pak-Hei, Wyburd, Madeleine K., Dinsdale, Nicola K., Serikbey, Assanali, Li, Jiankai, Chen, Sung-Liang, Hu, Zicheng, Liu, Nana, Deng, Yian, Hu, Wei, Tan, Cong, Zhang, Wenfeng, Nhi, Mai Tuyet, Koehler, Gregor, Stock, Rapheal, Maier-Hein, Klaus, Elbatel, Marawan, Li, Xiaomeng, Slimani, Saad, Campello, Victor M., Ohene-Botwe, Benard, Khobo, Isaac, Huang, Yuxin, Han, Zhenyan, Hou, Hongying, Qiu, Di, Zheng, Zheng, Luo, Gongning, Ni, Dong, Lu, Yaosheng, Lekadir, Karim, Li, Shuo

Abstract

A substantial proportion (45\%) of maternal deaths, neonatal deaths, and stillbirths occur during the intrapartum phase, with a particularly high burden in low- and middle-income countries. Intrapartum biometry plays a critical role in monitoring labor progression; however, the routine use of ultrasound in resource-limited settings is hindered by a shortage of trained sonographers. To address this challenge, the Intrapartum Ultrasound Grand Challenge (IUGC), co-hosted with MICCAI 2024, was launched. The IUGC introduces a clinically oriented multi-task automatic measurement framework that integrates standard plane classification, fetal head-pubic symphysis segmentation, and biometry, enabling algorithms to exploit complementary task information for more accurate estimation. Furthermore, the challenge releases the largest multi-center intrapartum ultrasound video dataset to date, comprising 774 videos (68,106 frames) collected from three hospitals, providing a robust foundation for model training and evaluation. In this study, we present a comprehensive overview of the challenge design, review the submissions from eight participating teams, and analyze their methods from five perspectives: preprocessing, data augmentation, learning strategy, model architecture, and post-processing. In addition, we perform a systematic analysis of the benchmark results to identify key bottlenecks, explore potential solutions, and highlight open challenges for future research. Although encouraging performance has been achieved, our findings indicate that the field remains at an early stage, and further in-depth investigation is required before large-scale clinical deployment. All benchmark solutions and the complete dataset have been publicly released to facilitate reproducible research and promote continued advances in automatic intrapartum ultrasound biometry.

Chinese Translation

在产程阶段，母亲死亡、婴儿死亡和死胎的发生率相当高，尤其是在低收入和中等收入国家，这一比例高达45%。产程生物测量在监测分娩进程中发挥着关键作用；然而，在资源有限的环境中，超声的常规使用受到训练有素的超声技师短缺的制约。为了解决这一挑战，产程超声大挑战（Intrapartum Ultrasound Grand Challenge, IUGC）与2024年MICCAI共同举办。IUGC推出了一种临床导向的多任务自动测量框架，集成了标准平面分类、胎头-耻骨联合分割和生物测量，使算法能够利用互补任务信息以获得更准确的估计。此外，该挑战发布了迄今为止最大的多中心产程超声视频数据集，包含来自三家医院的774个视频（68,106帧），为模型训练和评估提供了坚实的基础。在本研究中，我们对挑战设计进行了全面概述，回顾了八个参与团队的提交，并从五个方面分析了他们的方法：预处理、数据增强、学习策略、模型架构和后处理。此外，我们对基准结果进行了系统分析，以识别关键瓶颈，探索潜在解决方案，并强调未来研究中的开放挑战。尽管取得了令人鼓舞的性能，但我们的研究结果表明，该领域仍处于早期阶段，需进一步深入研究，才能在大规模临床应用之前做好准备。所有基准解决方案和完整数据集已公开发布，以促进可重复研究并推动自动产程超声生物测量的持续进步。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2602.12933

Deep-Learning Atlas Registration for Melanoma Brain Metastases: Preserving Pathology While Enabling Cohort-Level Analyses

深度学习大脑图谱配准用于黑色素瘤脑转移：在实现群体水平分析的同时保留病理信息

Wielenberg, Nanna E., Popp, Ilinca, Blanck, Oliver, Zander, Lucas, Peeken, Jan C., Combs, Stephanie E., Grosu, Anca-Ligia, Baltas, Dimos, Fechter, Tobias

Abstract

Melanoma brain metastases (MBM) are common and spatially heterogeneous lesions, complicating cohort-level analyses due to anatomical variability and differing MRI protocols. We propose a fully differentiable, deep-learning-based deformable registration framework that aligns individual pathological brains to a common atlas while preserving metastatic tissue without requiring lesion masks or preprocessing. Missing anatomical correspondences caused by metastases are handled through a forward-model similarity metric based on distance-transformed anatomical labels, combined with a volume-preserving regularization term to ensure deformation plausibility. Registration performance was evaluated using Dice coefficient (DSC), Hausdorff distance (HD), average symmetric surface distance (ASSD), and Jacobian-based measures. The method was applied to 209 MBM patients from three centres, enabling standardized mapping of metastases to anatomical, arterial, and perfusion atlases. The framework achieved high registration accuracy across datasets (DSC 0.89-0.92, HD 6.79-7.60 mm, ASSD 0.63-0.77 mm) while preserving metastatic volumes. Spatial analysis demonstrated significant over-representation of MBM in the cerebral cortex and putamen, under-representation in white matter, and consistent localization near the gray-white matter junction. No arterial territory showed increased metastasis frequency after volume correction. This approach enables robust atlas registration of pathological brain MRI without lesion masks and supports reproducible multi-centre analyses. Applied to MBM, it confirms and refines known spatial predilections, particularly preferential seeding near the gray-white matter junction and cortical regions. The publicly available implementation facilitates reproducible research and extension to other brain tumours and neurological pathologies.

Chinese Translation

黑色素瘤脑转移（MBM）是常见且空间异质性显著的病变，由于解剖变异性和不同的MRI协议，给群体水平分析带来了复杂性。我们提出了一种完全可微分的基于深度学习的可变形配准框架，该框架能够将个体病理大脑对齐到一个共同的图谱，同时保留转移性组织，无需病变掩膜或预处理。由于转移导致的缺失解剖对应关系通过基于距离变换的解剖标签相似性度量来处理，并结合体积保持正则化项以确保变形的合理性。使用Dice系数（DSC）、Hausdorff距离（HD）、平均对称表面距离（ASSD）和基于Jacobian的度量来评估配准性能。该方法应用于来自三个中心的209例MBM患者，能够标准化地将转移映射到解剖、动脉和灌注图谱。该框架在各数据集上实现了高配准精度（DSC 0.89-0.92，HD 6.79-7.60 mm，ASSD 0.63-0.77 mm），同时保留了转移体积。空间分析显示MBM在大脑皮层和壳核中显著过度表现，而在白质中表现不足，并且在灰白质交界处的一致定位。经过体积校正后，没有动脉区域显示转移频率增加。这种方法能够在没有病变掩膜的情况下实现病理大脑MRI的稳健图谱配准，并支持可重复的多中心分析。应用于MBM时，它确认并细化了已知的空间偏好，特别是在灰白质交界处和皮层区域的优先种植。公开可用的实现促进了可重复研究，并可扩展至其他脑肿瘤和神经病理。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2602.12936

Unleashing MLLMs on the Edge: A Unified Framework for Cross-Modal ReID via Adaptive SVD Distillation

在边缘释放多模态大语言模型：通过自适应奇异值分解蒸馏的跨模态重识别统一框架

Jiang, Hongbo, Li, Jie, Cai, Xinqi, Xie, Tianyu, Shen, Yunhang, Dai, Pingyang, Cao, Liujuan

Abstract

Practical cloud-edge deployment of Cross-Modal Re-identification (CM-ReID) faces challenges due to maintaining a fragmented ecosystem of specialized cloud models for diverse modalities. While Multi-Modal Large Language Models (MLLMs) offer strong unification potential, existing approaches fail to adapt them into a single end-to-end backbone and lack effective knowledge distillation strategies for edge deployment. To address these limitations, we propose MLLMEmbed-ReID, a unified framework based on a powerful cloud-edge architecture. First, we adapt a foundational MLLM into a state-of-the-art cloud model. We leverage instruction-based prompting to guide the MLLM in generating a unified embedding space across RGB, infrared, sketch, and text modalities. This model is then trained efficiently with a hierarchical Low-Rank Adaptation finetuning (LoRA-SFT) strategy, optimized under a holistic cross-modal alignment objective. Second, to deploy its knowledge onto an edge-native student, we introduce a novel distillation strategy motivated by the low-rank property in the teacher's feature space. To prioritize essential information, this method employs a Principal Component Mapping loss, while relational structures are preserved via a Feature Relation loss. Our lightweight edge-based model achieves state-of-the-art performance on multiple visual CM-ReID benchmarks, while its cloud-based counterpart excels across all CM-ReID benchmarks. The MLLMEmbed-ReID framework thus presents a complete and effective solution for deploying unified MLLM-level intelligence on resource-constrained devices. The code and models will be open-sourced soon.

Chinese Translation

跨模态重识别（CM-ReID）的实际云边部署面临挑战，因为需要维护一个针对多种模态的专用云模型的碎片化生态系统。尽管多模态大语言模型（MLLMs）提供了强大的统一潜力，但现有方法未能将其适配为单一的端到端骨干网络，并且缺乏有效的知识蒸馏策略以便于边缘部署。为了解决这些局限性，我们提出了MLLMEmbed-ReID，这是一个基于强大云边架构的统一框架。首先，我们将一个基础的MLLM适配为最先进的云模型。我们利用基于指令的提示来引导MLLM生成跨RGB、红外、素描和文本模态的统一嵌入空间。该模型随后采用分层低秩适配微调（LoRA-SFT）策略进行高效训练，并在整体跨模态对齐目标下进行优化。其次，为了将其知识部署到边缘原生学生模型上，我们引入了一种新颖的蒸馏策略，该策略受到教师特征空间中的低秩特性的启发。为了优先考虑重要信息，该方法采用主成分映射损失，同时通过特征关系损失保留关系结构。我们的轻量级边缘模型在多个视觉CM-ReID基准测试中实现了最先进的性能，而其云端对应模型在所有CM-ReID基准测试中表现优异。因此，MLLMEmbed-ReID框架为在资源受限设备上部署统一的MLLM级智能提供了完整而有效的解决方案。代码和模型将很快开源。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2602.12957

Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding

无训练加速文档解析视觉语言模型的层次性推测解码

Liao, Wenhui, Li, Hongliang, Xie, Pengyu, Cai, Xinyu, Shen, Yufan, Xin, Yi, Qin, Qi, Ye, Shenglong, Li, Tianbin, Hu, Ming, He, Junjun, Liu, Yihao, Wang, Wenhai, Dou, Min, Fu, Bin, Shi, Botian, Qiao, Yu, Jin, Lianwen

Abstract

Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust generalization, VLM-based end-to-end approaches have emerged as the mainstream paradigm in recent years. However, these models often suffer from substantial inference latency, as they must auto-regressively generate long token sequences when processing long-form documents. In this work, motivated by the extremely long outputs and complex layout structures commonly found in document parsing, we propose a training-free and highly efficient acceleration method. Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens, while the more accurate VLM verifies these draft predictions in parallel. Moreover, we further exploit the layout-structured nature of documents by partitioning each page into independent regions, enabling parallel decoding of each region using the same draft-verify strategy. The final predictions are then assembled according to the natural reading order. Experimental results demonstrate the effectiveness of our approach: on the general-purpose OmniDocBench, our method provides a 2.42x lossless acceleration for the dots.ocr model, and achieves up to 4.89x acceleration on long-document parsing tasks. We will release our code to facilitate reproducibility and future research.

Chinese Translation

文档解析是多模态理解中的一项基础任务，支持信息提取和智能文档分析等广泛的下游应用。得益于强大的语义建模和稳健的泛化能力，基于视觉语言模型（VLM）的端到端方法近年来已成为主流范式。然而，这些模型在推理时常常面临显著的延迟，因为它们在处理长文档时必须自回归地生成长序列的标记。在本研究中，受到文档解析中常见的极长输出和复杂布局结构的启发，我们提出了一种无训练且高效的加速方法。我们借鉴了推测解码的思路，采用轻量级文档解析管道作为草稿模型，以预测未来标记的批次，同时更精确的VLM则并行验证这些草稿预测。此外，我们进一步利用文档的布局结构特性，通过将每一页划分为独立区域，使得每个区域可以使用相同的草稿-验证策略进行并行解码。最终的预测结果根据自然阅读顺序进行组装。实验结果证明了我们方法的有效性：在通用的OmniDocBench上，我们的方法为dots.ocr模型提供了2.42倍的无损加速，并在长文档解析任务中实现了高达4.89倍的加速。我们将发布我们的代码，以促进可重复性和未来的研究。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2602.12983

Detecting Object Tracking Failure via Sequential Hypothesis Testing

通过序贯假设检验检测目标跟踪失败

Muñoz, Alejandro Monroy, Verma, Rajeev, Timans, Alexander

Abstract

Real-time online object tracking in videos constitutes a core task in computer vision, with wide-ranging applications including video surveillance, motion capture, and robotics. Deployed tracking systems usually lack formal safety assurances to convey when tracking is reliable and when it may fail, at best relying on heuristic measures of model confidence to raise alerts. To obtain such assurances we propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time. Leveraging recent advancements in the field, our sequential test (formalized as an e-process) quickly identifies when tracking failures set in whilst provably containing false alerts at a desired rate, and thus limiting potentially costly re-calibration or intervention steps. The approach is computationally light-weight, requires no extra training or fine-tuning, and is in principle model-agnostic. We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information, and demonstrate its effectiveness for two established tracking models across four video benchmarks. As such, sequential testing can offer a statistically grounded and efficient mechanism to incorporate safety assurances into real-time tracking systems.

Chinese Translation

实时在线视频目标跟踪是计算机视觉中的核心任务，广泛应用于视频监控、动作捕捉和机器人技术等领域。现有的跟踪系统通常缺乏正式的安全保障，无法明确何时跟踪是可靠的，何时可能会失败，通常依赖启发式的模型置信度来发出警报。为了获得这样的保障，我们提出将目标跟踪视为一种序贯假设检验，其中关于跟踪失败的证据会随着时间的推移逐渐积累。利用该领域的最新进展，我们的序贯检验（形式化为e-process）能够快速识别跟踪失败的发生，同时在所需的速率下可证明地控制虚假警报，从而限制潜在的高成本重新校准或干预步骤。该方法计算轻量，无需额外的训练或微调，并且原则上与模型无关。我们通过利用真实数据或仅依赖内部跟踪信息，提出了监督和无监督的变体，并在四个视频基准上展示了其在两个已建立跟踪模型中的有效性。因此，序贯检验可以为实时跟踪系统提供一个统计基础和高效的机制，以纳入安全保障。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2602.13003

MASAR: Motion-Appearance Synergy Refinement for Joint Detection and Trajectory Forecasting

MASAR：运动-外观协同精炼用于联合检测和轨迹预测

Lehocine, Mohammed Amine Bencheikh, Schmidt, Julian, Moosmann, Frank, Gupta, Dikshant, Flohr, Fabian

Abstract

Classical autonomous driving systems connect perception and prediction modules via hand-crafted bounding-box interfaces, limiting information flow and propagating errors to downstream tasks. Recent research aims to develop end-to-end models that jointly address perception and prediction; however, they often fail to fully exploit the synergy between appearance and motion cues, relying mainly on short-term visual features. We follow the idea of "looking backward to look forward", and propose MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer-based 3D detector. MASAR employs an object-centric spatio-temporal mechanism that jointly encodes appearance and motion features. By predicting past trajectories and refining them using guidance from appearance cues, MASAR captures long-term temporal dependencies that enhance future trajectory forecasting. Experiments conducted on the nuScenes dataset demonstrate MASAR's effectiveness, showing improvements of over 20% in minADE and minFDE while maintaining robust detection performance. Code and models are available at https://github.com/aminmed/MASAR.

Chinese Translation

经典的自动驾驶系统通过手工设计的边界框接口连接感知和预测模块，这限制了信息流并将错误传播到下游任务。近期研究旨在开发端到端模型，联合解决感知和预测问题；然而，它们往往未能充分利用外观和运动线索之间的协同作用，主要依赖于短期视觉特征。我们遵循“向后看以便向前看”的理念，提出了MASAR，一个新颖的完全可微分框架，适用于任何基于变换器的3D检测器，进行联合3D检测和轨迹预测。MASAR采用了一种以对象为中心的时空机制，联合编码外观和运动特征。通过预测过去的轨迹并利用外观线索进行精炼，MASAR捕捉了增强未来轨迹预测的长期时间依赖性。在nuScenes数据集上进行的实验表明，MASAR的有效性，minADE和minFDE的改进超过20%，同时保持了稳健的检测性能。代码和模型可在https://github.com/aminmed/MASAR获取。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2602.13013

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

面向通用视频多模态大语言模型的属性结构化与质量验证指令

Li, Yunheng, Zhang, Hengrui, Guo, Meng-Hao, Gao, Wenzhao, Jia, Shaoyong, Jiao, Shaohui, Hou, Qibin, Cheng, Ming-Ming

Abstract

Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.

Chinese Translation

通用视频理解需要在多样化的现实场景中对细粒度的视觉和音频信息进行时间建模。然而，现有模型的性能主要受到视频指令数据的限制，这些数据将复杂的视听内容表示为单一的不完整描述，缺乏细粒度的组织和可靠的注释。为了解决这个问题，我们提出了：(i) ASID-1M，一个开源的包含一百万个结构化、细粒度视听指令注释的集合，具有单属性和多属性监督；(ii) ASID-Verify，一个可扩展的数据整理管道，用于注释，具有自动验证和精炼功能，确保描述与相应视听内容之间的语义和时间一致性；以及 (iii) ASID-Captioner，一个通过在 ASID-1M 上进行监督微调（Supervised Fine-Tuning, SFT）训练的视频理解模型。在涵盖视听字幕、属性字幕、基于字幕的问答和基于字幕的时间定位的七个基准测试中的实验表明，ASID-Captioner 提高了细粒度字幕的质量，同时减少了幻觉现象并改善了指令遵循。它在开源模型中实现了最先进的性能，并且与 Gemini-3-Pro 具有竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2602.13015

Multimodal Classification via Total Correlation Maximization

通过总相关性最大化的多模态分类

Yu, Feng, Wu, Xiangyu, Yang, Yang, Lu, Jianfeng

Abstract

Multimodal learning integrates data from diverse sensors to effectively harness information from different modalities. However, recent studies reveal that joint learning often overfits certain modalities while neglecting others, leading to performance inferior to that of unimodal learning. Although previous efforts have sought to balance modal contributions or combine joint and unimodal learning, thereby mitigating the degradation of weaker modalities with promising outcomes, few have examined the relationship between joint and unimodal learning from an information-theoretic perspective. In this paper, we theoretically analyze modality competition and propose a method for multimodal classification by maximizing the total correlation between multimodal features and labels. By maximizing this objective, our approach alleviates modality competition while capturing inter-modal interactions via feature alignment. Building on Mutual Information Neural Estimation (MINE), we introduce Total Correlation Neural Estimation (TCNE) to derive a lower bound for total correlation. Subsequently, we present TCMax, a hyperparameter-free loss function that maximizes total correlation through variational bound optimization. Extensive experiments demonstrate that TCMax outperforms state-of-the-art joint and unimodal learning approaches. Our code is available at https://github.com/hubaak/TCMax.

Chinese Translation

多模态学习整合来自不同传感器的数据，以有效利用来自不同模态的信息。然而，最近的研究表明，联合学习往往会对某些模态过拟合，而忽视其他模态，导致其性能低于单模态学习。尽管之前的努力试图平衡模态贡献或结合联合学习和单模态学习，从而减轻较弱模态的退化并取得了良好的结果，但很少有研究从信息论的角度考察联合学习和单模态学习之间的关系。在本文中，我们从理论上分析了模态竞争，并提出了一种通过最大化多模态特征与标签之间的总相关性来进行多模态分类的方法。通过最大化这一目标，我们的方法缓解了模态竞争，同时通过特征对齐捕捉模态间的交互。基于互信息神经估计（Mutual Information Neural Estimation, MINE），我们引入了总相关性神经估计（Total Correlation Neural Estimation, TCNE），以推导总相关性的下界。随后，我们提出了TCMax，这是一种无超参数的损失函数，通过变分界优化最大化总相关性。大量实验表明，TCMax在性能上超越了最先进的联合学习和单模态学习方法。我们的代码可在 https://github.com/hubaak/TCMax 获取。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2602.13020

DynaGuide: A Generalizable Dynamic Guidance Framework for Unsupervised Semantic Segmentation

DynaGuide：一种可泛化的动态引导框架用于无监督语义分割

Guermazi, Boujemaa, Ksantini, Riadh, Khan, Naimul

Abstract

Unsupervised image segmentation is a critical task in computer vision. It enables dense scene understanding without human annotations, which is especially valuable in domains where labelled data is scarce. However, existing methods often struggle to reconcile global semantic structure with fine-grained boundary accuracy. This paper introduces DynaGuide, an adaptive segmentation framework that addresses these challenges through a novel dual-guidance strategy and dynamic loss optimization. Building on our previous work, DynaSeg, DynaGuide combines global pseudo-labels from zero-shot models such as DiffSeg or SegFormer with local boundary refinement using a lightweight CNN trained from scratch. This synergy allows the model to correct coarse or noisy global predictions and produce high-precision segmentations. At the heart of DynaGuide is a multi-component loss that dynamically balances feature similarity, Huber-smoothed spatial continuity, including diagonal relationships, and semantic alignment with the global pseudo-labels. Unlike prior approaches, DynaGuide trains entirely without ground-truth labels in the target domain and supports plug-and-play integration of diverse guidance sources. Extensive experiments on BSD500, PASCAL VOC2012, and COCO demonstrate that DynaGuide achieves state-of-the-art performance, improving mIoU by 17.5% on BSD500, 3.1% on PASCAL VOC2012, and 11.66% on COCO. With its modular design, strong generalization, and minimal computational footprint, DynaGuide offers a scalable and practical solution for unsupervised segmentation in real-world settings. Code available at: https://github.com/RyersonMultimediaLab/DynaGuide

Chinese Translation

无监督图像分割是计算机视觉中的一项关键任务。它能够在没有人工标注的情况下实现密集场景理解，这在标注数据稀缺的领域尤为重要。然而，现有方法往往难以协调全局语义结构与细粒度边界精度之间的关系。本文介绍了DynaGuide，一种自适应分割框架，通过新颖的双重引导策略和动态损失优化来应对这些挑战。在我们之前的工作DynaSeg的基础上，DynaGuide结合了来自零样本模型（如DiffSeg或SegFormer）的全局伪标签与使用从头训练的轻量级卷积神经网络（CNN）进行的局部边界细化。这种协同作用使模型能够纠正粗糙或噪声的全局预测，并生成高精度的分割结果。DynaGuide的核心是一个多组件损失函数，动态平衡特征相似性、Huber平滑的空间连续性（包括对角关系）以及与全局伪标签的语义对齐。与以往方法不同，DynaGuide在目标领域完全不依赖于真实标签进行训练，并支持多种引导源的即插即用集成。在BSD500、PASCAL VOC2012和COCO上的大量实验表明，DynaGuide达到了最先进的性能，在BSD500上提高了17.5%的mIoU，在PASCAL VOC2012上提高了3.1%，在COCO上提高了11.66%。凭借其模块化设计、强大的泛化能力和最小的计算开销，DynaGuide为现实环境中的无监督分割提供了一种可扩展且实用的解决方案。代码可在：https://github.com/RyersonMultimediaLab/DynaGuide

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2602.13022

Learning Image-based Tree Crown Segmentation from Enhanced Lidar-based Pseudo-labels

基于增强激光雷达伪标签的图像树冠分割学习

Pesonen, Julius, Rua, Stefan, Taher, Josef, Koivumäki, Niko, Yu, Xiaowei, Honkavaara, Eija

Abstract

Mapping individual tree crowns is essential for tasks such as maintaining urban tree inventories and monitoring forest health, which help us understand and care for our environment. However, automatically separating the crowns from each other in aerial imagery is challenging due to factors such as the texture and partial tree crown overlaps. In this study, we present a method to train deep learning models that segment and separate individual trees from RGB and multispectral images, using pseudo-labels derived from aerial laser scanning (ALS) data. Our study shows that the ALS-derived pseudo-labels can be enhanced using a zero-shot instance segmentation model, Segment Anything Model 2 (SAM 2). Our method offers a way to obtain domain-specific training annotations for optical image-based models without any manual annotation cost, leading to segmentation models which outperform any available models which have been targeted for general domain deployment on the same task.

Chinese Translation

映射单个树冠对于维护城市树木清单和监测森林健康等任务至关重要，这有助于我们理解和关心我们的环境。然而，由于纹理和部分树冠重叠等因素，在航空图像中自动分离树冠是具有挑战性的。在本研究中，我们提出了一种方法，通过使用源自航空激光扫描（ALS）数据的伪标签来训练深度学习模型，从RGB和多光谱图像中分割和分离单个树木。我们的研究表明，基于ALS的伪标签可以通过零样本实例分割模型Segment Anything Model 2（SAM 2）进行增强。我们的方法提供了一种获取特定领域训练注释的方法，无需任何人工标注成本，从而导致的分割模型在同一任务上超越了任何可用的针对一般领域部署的模型。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2602.13024

FedHENet: A Frugal Federated Learning Framework for Heterogeneous Environments

FedHENet：一种适用于异构环境的节约型联邦学习框架

Dopico-Castro, Alejandro, Fontenla-Romero, Oscar, Guijarro-Berdiñas, Bertha, Alonso-Betanzos, Amparo, Digón, Iván Pérez

Abstract

Federated Learning (FL) enables collaborative training without centralizing data, essential for privacy compliance in real-world scenarios involving sensitive visual information. Most FL approaches rely on expensive, iterative deep network optimization, which still risks privacy via shared gradients. In this work, we propose FedHENet, extending the FedHEONN framework to image classification. By using a fixed, pre-trained feature extractor and learning only a single output layer, we avoid costly local fine-tuning. This layer is learned by analytically aggregating client knowledge in a single round of communication using homomorphic encryption (HE). Experiments show that FedHENet achieves competitive accuracy compared to iterative FL baselines while demonstrating superior stability performance and up to 70\% better energy efficiency. Crucially, our method is hyperparameter-free, removing the carbon footprint associated with hyperparameter tuning in standard FL. Code available in https://github.com/AlejandroDopico2/FedHENet/

Chinese Translation

联邦学习（Federated Learning, FL）使得在不集中数据的情况下进行协作训练成为可能，这对于涉及敏感视觉信息的现实场景中的隐私合规至关重要。大多数FL方法依赖于昂贵的迭代深度网络优化，这仍然通过共享梯度存在隐私风险。在本研究中，我们提出了FedHENet，将FedHEONN框架扩展到图像分类中。通过使用固定的、预训练的特征提取器并仅学习一个输出层，我们避免了代价高昂的本地微调。该层通过使用同态加密（homomorphic encryption, HE）在一次通信中分析性地聚合客户端知识来学习。实验表明，FedHENet在准确性上与迭代FL基线相比具有竞争力，同时展示了更优的稳定性表现和高达70%的能效提升。关键是，我们的方法不需要超参数，消除了标准FL中与超参数调优相关的碳足迹。代码可在 https://github.com/AlejandroDopico2/FedHENet/ 获取。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2602.13028

Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

人类对齐的多模态大型语言模型评估者用于细粒度图像编辑评估：基准、框架与分析

Liu, Runzhou, Weingord, Hailey, Mittal, Sejal, Dungarwal, Prakhar, Nandula, Anusha, Ni, Bo, Basu, Samyadeep, Chen, Hongjie, Ahmed, Nesreen K., Li, Li, Zhang, Jiayi, Goswami, Koustava, Mukherjee, Subhojyoti, Kveton, Branislav, Mathur, Puneet, Dernoncourt, Franck, Zhao, Yue, Wang, Yu, Rossi, Ryan A., Tu, Zhengzhong, Du, Hongru

Abstract

Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user instructions. In this work, we introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing that decomposes common evaluation notions into twelve fine-grained interpretable factors spanning image preservation, edit quality, and instruction fidelity. Building on this formulation, we present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics across diverse image editing tasks. Through extensive human studies, we show that the proposed MLLM judges align closely with human evaluations at a fine granularity, supporting their use as reliable and scalable evaluators. We further demonstrate that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas our judges provide more intuitive and informative assessments in both offline and online settings. Together, this work introduces a benchmark, a principled factorization, and empirical evidence positioning fine-grained MLLM judges as a practical foundation for studying, comparing, and improving image editing approaches.

Chinese Translation

评估图像编辑模型仍然面临挑战，因为传统指标的粗粒度和有限的可解释性常常无法捕捉对人类感知和意图重要的方面。这些指标往往奖励视觉上合理的输出，却忽视了可控性、编辑定位和对用户指令的忠实度。在本研究中，我们提出了一种细粒度的多模态大型语言模型（MLLM）评估框架，用于图像编辑，该框架将常见评估概念分解为十二个细粒度的可解释因素，涵盖图像保留、编辑质量和指令忠实度。在此基础上，我们提出了一个新的经过人类验证的基准，整合了人类判断、基于MLLM的评估、模型输出和传统指标，适用于多种图像编辑任务。通过广泛的人类研究，我们展示了所提出的MLLM评估者在细粒度上与人类评估高度一致，支持其作为可靠且可扩展的评估者的使用。我们进一步证明，传统的图像编辑指标往往无法有效代表这些因素，未能区分过度编辑或语义不准确的输出，而我们的评估者在离线和在线环境中提供了更直观和信息丰富的评估。综上所述，本研究引入了一个基准、一个原则化的分解，以及实证证据，将细粒度的MLLM评估者定位为研究、比较和改进图像编辑方法的实用基础。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2602.13041

Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images

基于隐式尺度的三维重建用于从单目图像估计多种食品体积

Chen, Yuhao, Vinod, Gautham, Raghavan, Siddeshwar, Mahmud, Talha Ibn, Coburn, Bruce, Ma, Jinge, Zhu, Fengqing, He, Jiangpeng

Abstract

We present Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images, a benchmark dataset designed to advance geometry-based food portion estimation in realistic dining scenarios. Existing dietary assessment methods largely rely on single-image analysis or appearance-based inference, including recent vision-language models, which lack explicit geometric reasoning and are sensitive to scale ambiguity. This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. To reflect real-world conditions, explicit physical references and metric annotations are removed; instead, contextual objects such as plates and utensils are provided, requiring algorithms to infer scale from implicit cues and prior knowledge. The dataset emphasizes multi-food scenes with diverse object geometries, frequent occlusions, and complex spatial arrangements. The benchmark was adopted as a challenge at the MetaFood 2025 Workshop, where multiple teams proposed reconstruction-based solutions. Experimental results show that while strong vision--language baselines achieve competitive performance, geometry-based reconstruction methods provide both improved accuracy and greater robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.

Chinese Translation

我们提出了基于隐式尺度的单目多食品图像三维重建，构建了一个基准数据集，旨在推动现实就餐场景中基于几何的食品份量估计。现有的膳食评估方法主要依赖于单图像分析或基于外观的推断，包括最近的视觉-语言模型，这些方法缺乏明确的几何推理，并且对尺度模糊性敏感。该基准将食品份量估计重新定义为在单目观察下的隐式尺度三维重建问题。为了反映现实世界的条件，显式的物理参考和度量注释被移除；相反，提供了盘子和餐具等上下文对象，要求算法从隐式线索和先验知识中推断尺度。该数据集强调了具有多样化物体几何形状、频繁遮挡和复杂空间排列的多食品场景。该基准在MetaFood 2025研讨会上被作为挑战采用，多支团队提出了基于重建的解决方案。实验结果表明，尽管强大的视觉-语言基线实现了竞争性表现，但基于几何的重建方法提供了更高的准确性和更强的鲁棒性，表现最好的方法在体积估计中达到了0.21的平均绝对百分比误差（MAPE）和5.7的L1 Chamfer距离在几何准确性上。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2602.13055

Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

Curriculum-DPO++：通过数据和模型课程进行文本到图像生成的直接偏好优化

Croitoru, Florinel-Alin, Hondru, Vlad, Ionescu, Radu Tudor, Sebe, Nicu, Shah, Mubarak

Abstract

Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text-to-image generation, we recently proposed Curriculum-DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum-DPO++, an enhanced method that combines the original data-level curriculum with a novel model-level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum-DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine-tuning is based on Low-Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low-rank matrices. Instead of maintaining a fixed capacity, we initialize the low-rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum-DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum-DPO. Finally, we compare Curriculum-DPO++ against Curriculum-DPO and other state-of-the-art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum-DPO.

Chinese Translation

直接偏好优化（Direct Preference Optimization, DPO）被提出作为一种有效且高效的替代方案，以替代基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）。然而，无论是RLHF还是DPO都未考虑到学习某些偏好比学习其他偏好更为困难，这使得优化过程并不理想。为了解决文本到图像生成中的这一问题，我们最近提出了Curriculum-DPO，这是一种通过难度组织图像对的方法。在本文中，我们介绍了Curriculum-DPO++，一种增强的方法，结合了原始数据级课程与新颖的模型级课程。更准确地说，我们提议在训练进展时动态增加去噪网络的学习能力。我们通过两种机制实现这一能力的提升。首先，我们仅用原始Curriculum-DPO中可训练层的子集初始化模型。随着训练的进行，我们逐步解冻层，直到配置与完整的基线架构匹配。其次，由于微调基于低秩适应（Low-Rank Adaptation, LoRA），我们为低秩矩阵的维度实施渐进式调度。我们并不保持固定的能力，而是以显著小于基线的维度初始化低秩矩阵。随着训练的进行，我们逐步增加它们的秩，使能力增长，直到收敛到与Curriculum-DPO相同的秩值。此外，我们提出了一种替代Curriculum-DPO所采用的排名策略。最后，我们在九个基准测试中将Curriculum-DPO++与Curriculum-DPO及其他最先进的偏好优化方法进行了比较，在文本对齐、美学和人类偏好方面超越了竞争方法。我们的代码可在 https://github.com/CroitoruAlin/Curriculum-DPO 获取。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2602.13066

A Calibrated Memorization Index (MI) for Detecting Training Data Leakage in Generative MRI Models

用于检测生成MRI模型训练数据泄漏的校准记忆指数（MI）

Deo, Yash, Jia, Yan, Lassila, Toni, Hodge, Victoria J, Frang, Alejandro F, Qian, Chenghao, Kang, Siyuan, Habli, Ibrahim

Abstract

Image generative models are known to duplicate images from the training data as part of their outputs, which can lead to privacy concerns when used for medical image generation. We propose a calibrated per-sample metric for detecting memorization and duplication of training data. Our metric uses image features extracted using an MRI foundation model, aggregates multi-layer whitened nearest-neighbor similarities, and maps them to a bounded \emph{Overfit/Novelty Index} (ONI) and \emph{Memorization Index} (MI) scores. Across three MRI datasets with controlled duplication percentages and typical image augmentations, our metric robustly detects duplication and provides more consistent metric values across datasets. At the sample level, our metric achieves near-perfect detection of duplicates.

Chinese Translation

图像生成模型已知会将训练数据中的图像复制为其输出的一部分，这在用于医学图像生成时可能引发隐私问题。我们提出了一种校准的逐样本指标，用于检测训练数据的记忆和重复。我们的指标使用通过MRI基础模型提取的图像特征，聚合多层白化最近邻相似度，并将其映射到有界的 extit{过拟合/新颖性指数}（ONI）和 extit{记忆指数}（MI）得分。在三个具有控制重复百分比和典型图像增强的MRI数据集中，我们的指标稳健地检测重复，并在不同数据集中提供更一致的指标值。在样本层面，我们的指标实现了几乎完美的重复检测。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2602.13067

SIEFormer: Spectral-Interpretable and -Enhanced Transformer for Generalized Category Discovery

SIEFormer：用于广义类别发现的光谱可解释和增强变换器

Li, Chunming, Wang, Shidong, Xin, Tong, Zhang, Haofeng

Abstract

This paper presents a novel approach, Spectral-Interpretable and -Enhanced Transformer (SIEFormer), which leverages spectral analysis to reinterpret the attention mechanism within Vision Transformer (ViT) and enhance feature adaptability, with particular emphasis on challenging Generalized Category Discovery (GCD) tasks. The proposed SIEFormer is composed of two main branches, each corresponding to an implicit and explicit spectral perspective of the ViT, enabling joint optimization. The implicit branch realizes the use of different types of graph Laplacians to model the local structure correlations of tokens, along with a novel Band-adaptive Filter (BaF) layer that can flexibly perform both band-pass and band-reject filtering. The explicit branch, on the other hand, introduces a Maneuverable Filtering Layer (MFL) that learns global dependencies among tokens by applying the Fourier transform to the input ``value" features, modulating the transformed signal with a set of learnable parameters in the frequency domain, and then performing an inverse Fourier transform to obtain the enhanced features. Extensive experiments reveal state-of-the-art performance on multiple image recognition datasets, reaffirming the superiority of our approach through ablation studies and visualizations.

Chinese Translation

本文提出了一种新颖的方法——光谱可解释和增强变换器（SIEFormer），该方法利用光谱分析重新解释视觉变换器（ViT）中的注意力机制，并增强特征适应性，特别强调在具有挑战性的广义类别发现（GCD）任务中的应用。所提出的SIEFormer由两个主要分支组成，分别对应于ViT的隐式和显式光谱视角，从而实现联合优化。隐式分支利用不同类型的图拉普拉斯算子来建模标记的局部结构相关性，并引入了一种新颖的带自适应滤波器（BaF）层，可以灵活地执行带通和带拒绝滤波。显式分支则引入了一种可调滤波层（MFL），通过对输入的“值”特征应用傅里叶变换来学习标记之间的全局依赖关系，利用一组可学习的参数在频域中调制变换信号，然后执行逆傅里叶变换以获得增强特征。大量实验表明，在多个图像识别数据集上实现了最先进的性能，通过消融研究和可视化进一步确认了我们方法的优越性。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2602.13091

Universal Transformation of One-Class Classifiers for Unsupervised Anomaly Detection

一类分类器的通用转化用于无监督异常检测

McIntosh, Declan, Albu, Alexandra Branzan

Abstract

Detecting anomalies in images and video is an essential task for multiple real-world problems, including industrial inspection, computer-assisted diagnosis, and environmental monitoring. Anomaly detection is typically formulated as a one-class classification problem, where the training data consists solely of nominal values, leaving methods built on this assumption susceptible to training label noise. We present a dataset folding method that transforms an arbitrary one-class classifier-based anomaly detector into a fully unsupervised method. This is achieved by making a set of key weak assumptions: that anomalies are uncommon in the training dataset and generally heterogeneous. These assumptions enable us to utilize multiple independently trained instances of a one-class classifier to filter the training dataset for anomalies. This transformation requires no modifications to the underlying anomaly detector; the only changes are algorithmically selected data subsets used for training. We demonstrate that our method can transform a wide variety of one-class classifier anomaly detectors for both images and videos into unsupervised ones. Our method creates the first unsupervised logical anomaly detectors by transforming existing methods. We also demonstrate that our method achieves state-of-the-art performance for unsupervised anomaly detection on the MVTec AD, ViSA, and MVTec Loco AD datasets. As improvements to one-class classifiers are made, our method directly transfers those improvements to the unsupervised domain, linking the domains.

Chinese Translation

在图像和视频中检测异常是多个现实世界问题的重要任务，包括工业检测、计算机辅助诊断和环境监测。异常检测通常被表述为一类分类问题，其中训练数据仅由正常值组成，这使得基于这一假设构建的方法容易受到训练标签噪声的影响。我们提出了一种数据集折叠方法，将任意一类分类器基础的异常检测器转化为完全无监督的方法。这是通过做出一系列关键的弱假设来实现的：即异常在训练数据集中是不常见的，并且通常是异质的。这些假设使我们能够利用多个独立训练的一类分类器实例来过滤训练数据集中的异常。这一转化不需要对基础异常检测器进行修改；唯一的变化是用于训练的算法选择的数据子集。我们证明了我们的方法可以将各种一类分类器异常检测器（适用于图像和视频）转化为无监督的检测器。我们的方法通过转化现有方法创造了首个无监督逻辑异常检测器。我们还展示了我们的方法在 MVTec AD、ViSA 和 MVTec Loco AD 数据集上实现了无监督异常检测的最先进性能。随着对一类分类器的改进，我们的方法将这些改进直接转移到无监督领域，连接了这两个领域。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2602.13168

Realistic Face Reconstruction from Facial Embeddings via Diffusion Models

通过扩散模型从面部嵌入重建真实面孔

Han, Dong, Li, Yong, Denzler, Joachim

Abstract

With the advancement of face recognition (FR) systems, privacy-preserving face recognition (PPFR) systems have gained popularity for their accurate recognition, enhanced facial privacy protection, and robustness to various attacks. However, there are limited studies to further verify privacy risks by reconstructing realistic high-resolution face images from embeddings of these systems, especially for PPFR. In this work, we propose the face embedding mapping (FEM), a general framework that explores Kolmogorov-Arnold Network (KAN) for conducting the embedding-to-face attack by leveraging pre-trained Identity-Preserving diffusion model against state-of-the-art (SOTA) FR and PPFR systems. Based on extensive experiments, we verify that reconstructed faces can be used for accessing other real-word FR systems. Besides, the proposed method shows the robustness in reconstructing faces from the partial and protected face embeddings. Moreover, FEM can be utilized as a tool for evaluating safety of FR and PPFR systems in terms of privacy leakage. All images used in this work are from public datasets.

Chinese Translation

随着人脸识别（FR）系统的发展，隐私保护人脸识别（PPFR）系统因其准确的识别能力、增强的面部隐私保护以及对各种攻击的鲁棒性而受到广泛关注。然而，目前针对这些系统（特别是PPFR）从嵌入重建真实高分辨率人脸图像以进一步验证隐私风险的研究仍然有限。在本研究中，我们提出了面部嵌入映射（FEM），这是一个通用框架，探索了Kolmogorov-Arnold网络（KAN），通过利用预训练的身份保护扩散模型对最先进（SOTA）的FR和PPFR系统进行嵌入到面部的攻击。基于广泛的实验，我们验证了重建的人脸可以用于访问其他真实世界的FR系统。此外，所提出的方法在从部分和受保护的面部嵌入中重建人脸方面表现出鲁棒性。此外，FEM可以作为评估FR和PPFR系统在隐私泄露方面安全性的工具。本研究中使用的所有图像均来自公共数据集。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2602.13172

LongStream: Long-Sequence Streaming Autoregressive Visual Geometry

LongStream：长序列流式自回归视觉几何

Cheng, Chong, Chen, Xianda, Xie, Tao, Yin, Wei, Ren, Weiqiang, Zhang, Qian, Guo, Xiaoyuang, Wang, Hao

Abstract

Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention degradation over ultra-long sequences and reduces the gap between training and inference. Experiments show LongStream achieves state-of-the-art performance. It delivers stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS. Project Page: https://3dagentworld.github.io/longstream/

Chinese Translation

长序列流式三维重建仍然是一个重要的开放挑战。现有的自回归模型在处理长序列时往往失败。它们通常将姿态锚定在第一帧，这导致注意力衰减、尺度漂移和外推误差。我们提出了LongStream，一种新颖的量度解耦流式视觉几何模型，旨在跨越数千帧进行度量尺度场景重建。我们的方法有三个方面。首先，我们舍弃第一帧锚点，预测相对关键帧的姿态。这将长距离外推重新构造为一个难度恒定的局部任务。其次，我们引入正交尺度学习。这种方法完全将几何与尺度估计解耦，以抑制漂移。最后，我们解决了Transformer缓存问题，如对注意力的依赖和长期KV缓存污染。我们提出了缓存一致性训练，并结合定期缓存刷新。这种方法抑制了超长序列中的注意力衰减，并减少了训练与推理之间的差距。实验表明，LongStream实现了最先进的性能。它在公里级序列上以18帧每秒提供稳定的度量尺度重建。项目页面：https://3dagentworld.github.io/longstream/

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2602.13176

Monocular Markerless Motion Capture Enables Quantitative Assessment of Upper Extremity Reachable Workspace

单目无标记运动捕捉技术实现上肢可达工作空间的定量评估

Donahue, Seth, Peiffer, J. D., Richardson, R. Tyler, Zhong, Yishan, Tan, Shaun Q. Y., Marteau, Benoit, Russo, Stephanie R., Wang, May D., Cotton, R. James, Chafetz, Ross

Abstract

To validate a clinically accessible approach for quantifying the Upper Extremity Reachable Workspace (UERW) using a single (monocular) camera and Artificial Intelligence (AI)-driven Markerless Motion Capture (MMC) for biomechanical analysis. Objective assessment and validation of these techniques for specific clinically oriented tasks are crucial for their adoption in clinical motion analysis. AI-driven monocular MMC reduces the barriers to adoption in the clinic and has the potential to reduce the overhead for analysis of this common clinical assessment. Nine adult participants with no impairments performed the standardized UERW task, which entails reaching targets distributed across a virtual sphere centered on the torso, with targets displayed in a VR headset. Movements were simultaneously captured using a marker-based motion capture system and a set of eight FLIR cameras. We performed monocular video analysis on two of these video camera views to compare a frontal and offset camera configurations. The frontal camera orientation demonstrated strong agreement with the marker-based reference, exhibiting a minimal mean bias of $0.61 \pm 0.12$ \% reachspace reached per octanct (mean $\pm$ standard deviation). In contrast, the offset camera view underestimated the percent workspace reached ($-5.66 \pm 0.45$ \% reachspace reached). Conclusion: The findings support the feasibility of a frontal monocular camera configuration for UERW assessment, particularly for anterior workspace evaluation where agreement with marker-based motion capture was highest. The overall performance demonstrates clinical potential for practical, single-camera assessments. This study provides the first validation of monocular MMC system for the assessment of the UERW task. By reducing technical complexity, this approach enables broader implementation of quantitative upper extremity mobility assessment.

Chinese Translation

本研究旨在验证一种临床可及的方法，通过使用单个（单目）摄像头和基于人工智能（AI）的无标记运动捕捉（MMC）技术，对上肢可达工作空间（UERW）进行定量分析。对这些技术在特定临床任务中的客观评估和验证对于其在临床运动分析中的应用至关重要。基于AI的单目MMC降低了在临床应用中的障碍，并有潜力减少对这一常见临床评估的分析开销。九名没有障碍的成年参与者执行标准化的UERW任务，该任务涉及在一个以躯干为中心的虚拟球体内达到分布的目标，目标通过虚拟现实（VR）头显显示。运动同时通过基于标记的运动捕捉系统和一组八个FLIR摄像头进行捕捉。我们对这两个摄像头视角进行了单目视频分析，以比较正面和偏移摄像头配置。正面摄像头的方向与基于标记的参考结果表现出强一致性，平均偏差为$0.61 imes 0.12$ ext{% reachspace reached}（均值$ imes$标准偏差）。相反，偏移摄像头视角低估了可达工作空间的百分比（$-5.66 imes 0.45$ ext{% reachspace reached}）。结论：研究结果支持正面单目摄像头配置用于UERW评估的可行性，特别是在前方工作空间评估中，与基于标记的运动捕捉的一致性最高。整体表现显示出在实际应用中使用单摄像头评估的临床潜力。本研究首次验证了单目MMC系统在UERW任务评估中的应用。通过降低技术复杂性，该方法使定量上肢活动能力评估的更广泛实施成为可能。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2602.13185

FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

FlexAM：用于多功能视频生成控制的灵活外观-运动分解

Sheng, Mingzhi, Gu, Zekai, Li, Peng, Lin, Cheng, Guo, Hao-Xiang, Chen, Ying-Cong, Liu, Yuan

Abstract

Effective and generalizable control in video generation remains a significant challenge. While many methods rely on ambiguous or task-specific signals, we argue that a fundamental disentanglement of "appearance" and "motion" provides a more robust and scalable pathway. We propose FlexAM, a unified framework built upon a novel 3D control signal. This signal represents video dynamics as a point cloud, introducing three key enhancements: multi-frequency positional encoding to distinguish fine-grained motion, depth-aware positional encoding, and a flexible control signal for balancing precision and generative quality. This representation allows FlexAM to effectively disentangle appearance and motion, enabling a wide range of tasks including I2V/V2V editing, camera control, and spatial object editing. Extensive experiments demonstrate that FlexAM achieves superior performance across all evaluated tasks.

Chinese Translation

在视频生成中，实现有效且具有普遍适应性的控制仍然是一个重大挑战。虽然许多方法依赖于模糊或特定任务的信号，但我们认为对“外观”和“运动”的基本解耦提供了一条更稳健且可扩展的路径。我们提出了FlexAM，这是一个基于新颖的3D控制信号的统一框架。该信号将视频动态表示为点云，并引入了三个关键增强：多频率位置编码以区分细粒度运动、深度感知位置编码，以及用于平衡精度和生成质量的灵活控制信号。这种表示使FlexAM能够有效地解耦外观和运动，从而支持包括I2V/V2V编辑、相机控制和空间对象编辑在内的广泛任务。大量实验表明，FlexAM在所有评估任务中均表现出优越的性能。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2602.13191

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

CoPE-VideoLM：高效视频语言模型的编解码原语

Sarkar, Sayan Deb, Pautrat, Rémi, Miksik, Ondrej, Pollefeys, Marc, Armeni, Iro, Rad, Mahdi, Dusmanu, Mihai

Abstract

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

Chinese Translation

视频语言模型（VideoLMs）使人工智能系统能够理解视频中的时间动态。为了适应最大上下文窗口限制，目前的方法使用关键帧采样，但由于时间覆盖稀疏，可能会错过宏观事件和微观细节。此外，处理每帧的完整图像及其标记会带来相当大的计算开销。为了解决这些限制，我们提出利用视频编解码原语（特别是运动矢量和残差），这些原语本质上编码了视频的冗余性和稀疏性，而无需对大多数帧进行昂贵的全图像编码。为此，我们引入了轻量级的基于变换器的编码器，这些编码器聚合编解码原语，并通过一种预训练策略将它们的表示与图像编码器嵌入对齐，从而加速端到端微调过程中的收敛。与标准的VideoLMs相比，我们的方法将首次标记的时间缩短了高达86%，标记使用量减少了高达93%。此外，通过调整关键帧和编解码原语的密度，我们能够在14个多样化的视频理解基准测试中保持或超过性能，这些基准涵盖了常规问答、时间推理、长篇理解和空间场景理解。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2602.13195

Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

对话式图像分割：用可扩展的监督基础抽象概念

Sahoo, Aadarsh, Gkioxari, Georgia

Abstract

Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., "left-most apple") and overlooks functional and physical reasoning (e.g., "where can I safely store the knife?"). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: https://glab-caltech.github.io/converseg/

Chinese Translation

对话式图像分割将抽象的、以意图驱动的概念转化为像素级精确的掩膜。之前关于图像定位的研究主要集中在类别和空间查询（例如，“最左边的苹果”），而忽视了功能和物理推理（例如，“我可以安全存放刀具的地方在哪里？”）。我们解决了这一空白，提出了对话式图像分割（Conversational Image Segmentation, CIS）和ConverSeg，一个涵盖实体、空间关系、意图、可用性、功能、安全性和物理推理的基准。我们还介绍了ConverSeg-Net，它将强大的分割先验与语言理解相结合，以及一个无需人工监督即可生成提示-掩膜对的AI驱动数据引擎。我们展示了当前的语言引导分割模型对于CIS是不够的，而在我们的数据引擎上训练的ConverSeg-Net在ConverSeg上取得了显著的提升，并在现有的语言引导分割基准上保持了强劲的表现。项目网页：https://glab-caltech.github.io/converseg/

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2602.12316

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

GT-HarmBench：通过博弈论视角评估人工智能安全风险

Cobben, Pepijn, Huang, Xuanqiang Angelo, Pham, Thao Amelia, Dahlgren, Isabel, Zhang, Terry Jingchen, Jin, Zhijing

Abstract

Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 2,009 high-stakes scenarios spanning game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.

Chinese Translation

前沿人工智能系统在高风险多智能体环境中越来越具备能力并被广泛部署。然而，现有的人工智能安全基准主要评估单一智能体，导致多智能体风险，如协调失败和冲突，尚未得到充分理解。我们引入了GT-HarmBench，这是一个包含2,009个高风险场景的基准，涵盖了博弈论结构，如囚徒困境、猎鹿游戏和鸡游戏。这些场景来源于麻省理工学院人工智能风险库中的现实人工智能风险背景。在15个前沿模型中，智能体在仅62%的情况下选择社会有益的行动，常常导致有害结果。我们测量了对博弈论提示框架和顺序的敏感性，并分析了导致失败的推理模式。我们进一步表明，博弈论干预措施可以将社会有益结果提高多达18%。我们的结果突显了显著的可靠性差距，并提供了一个广泛的标准化测试平台，用于研究多智能体环境中的对齐问题。基准和代码可在 https://github.com/causalNLP/gt-harmbench 获取。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2602.12356

A Theoretical Framework for Adaptive Utility-Weighted Benchmarking

自适应效用加权基准测试的理论框架

Waggoner, Philip

Abstract

Benchmarking has long served as a foundational practice in machine learning and, increasingly, in modern AI systems such as large language models, where shared tasks, metrics, and leaderboards offer a common basis for measuring progress and comparing approaches. As AI systems are deployed in more varied and consequential settings, though, there is growing value in complementing these established practices with a more holistic conceptualization of what evaluation should represent. Of note, recognizing the sociotechnical contexts in which these systems operate invites an opportunity for a deeper view of how multiple stakeholders and their unique priorities might inform what we consider meaningful or desirable model behavior. This paper introduces a theoretical framework that reconceptualizes benchmarking as a multilayer, adaptive network linking evaluation metrics, model components, and stakeholder groups through weighted interactions. Using conjoint-derived utilities and a human-in-the-loop update rule, we formalize how human tradeoffs can be embedded into benchmark structure and how benchmarks can evolve dynamically while preserving stability and interpretability. The resulting formulation generalizes classical leaderboards as a special case and provides a foundation for building evaluation protocols that are more context aware, resulting in new robust tools for analyzing the structural properties of benchmarks, which opens a path toward more accountable and human-aligned evaluation.

Chinese Translation

基准测试长期以来一直是机器学习中的基础实践，越来越多地应用于现代人工智能系统，如大型语言模型，其中共享任务、指标和排行榜为衡量进展和比较方法提供了共同基础。然而，随着人工智能系统在更为多样化和重要的环境中部署，补充这些既定实践以更全面的评估概念的价值日益凸显。值得注意的是，认识到这些系统所运行的社会技术背景为我们提供了一个更深刻的视角，了解多个利益相关者及其独特优先事项如何影响我们对有意义或理想模型行为的理解。本文介绍了一个理论框架，将基准测试重新概念化为一个多层次的自适应网络，通过加权交互将评估指标、模型组件和利益相关者群体联系起来。通过使用联合派生效用和人机交互的更新规则，我们形式化了如何将人类权衡嵌入基准结构，以及基准如何在保持稳定性和可解释性的同时动态演变。最终的公式将经典排行榜作为特例进行推广，并为构建更具上下文意识的评估协议提供了基础，从而产生新的强大工具，用于分析基准的结构属性，为实现更负责任和与人类对齐的评估开辟了道路。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2602.12389

Evolving Beyond Snapshots: Harmonizing Structure and Sequence via Entity State Tuning for Temporal Knowledge Graph Forecasting

超越快照的演变：通过实体状态调优协调结构与序列以进行时间知识图谱预测

Li, Siyuan, Wu, Yunjia, Xiao, Yiyong, Huang, Pingyang, Li, Peize, Liu, Ruitong, Wen, Yan, Sun, Te, Pei, Fangyi

Abstract

Temporal knowledge graph (TKG) forecasting requires predicting future facts by jointly modeling structural dependencies within each snapshot and temporal evolution across snapshots. However, most existing methods are stateless: they recompute entity representations at each timestamp from a limited query window, leading to episodic amnesia and rapid decay of long-term dependencies. To address this limitation, we propose Entity State Tuning (EST), an encoder-agnostic framework that endows TKG forecasters with persistent and continuously evolving entity states. EST maintains a global state buffer and progressively aligns structural evidence with sequential signals via a closed-loop design. Specifically, a topology-aware state perceiver first injects entity-state priors into structural encoding. Then, a unified temporal context module aggregates the state-enhanced events with a pluggable sequence backbone. Subsequently, a dual-track evolution mechanism writes the updated context back to the global entity state memory, balancing plasticity against stability. Experiments on multiple benchmarks show that EST consistently improves diverse backbones and achieves state-of-the-art performance, highlighting the importance of state persistence for long-horizon TKG forecasting. The code is published at https://github.com/yuanwuyuan9/Evolving-Beyond-Snapshots

Chinese Translation

时间知识图谱（TKG）预测需要通过联合建模每个快照内的结构依赖关系和跨快照的时间演变来预测未来事实。然而，大多数现有方法是无状态的：它们在每个时间戳重新计算实体表示，基于有限的查询窗口，导致情节性遗忘和长期依赖关系的快速衰减。为了解决这一限制，我们提出了实体状态调优（Entity State Tuning, EST），这是一种与编码器无关的框架，使TKG预测模型具备持久且持续演变的实体状态。EST维护一个全局状态缓冲区，并通过闭环设计逐步将结构证据与序列信号对齐。具体而言，拓扑感知状态感知器首先将实体状态先验注入结构编码中。然后，统一的时间上下文模块将状态增强的事件与可插拔的序列骨干网络进行聚合。随后，双轨演变机制将更新的上下文写回全局实体状态内存，平衡可塑性与稳定性。在多个基准测试上的实验表明，EST始终改善各种骨干网络，并实现了最先进的性能，突显了状态持久性在长时间跨度TKG预测中的重要性。代码已发布在 https://github.com/yuanwuyuan9/Evolving-Beyond-Snapshots

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2602.12419

Intent-Driven Smart Manufacturing Integrating Knowledge Graphs and Large Language Models

基于意图驱动的智能制造：知识图谱与大型语言模型的整合

Jradi, Takoua, Violos, John, Spatharakis, Dimitrios, Mavraidi, Lydia, Dimolitsas, Ioannis, Leivadeas, Aris, Papavassiliou, Symeon

Abstract

The increasing complexity of smart manufacturing environments demands interfaces that can translate high-level human intents into machine-executable actions. This paper presents a unified framework that integrates instruction-tuned Large Language Models (LLMs) with ontology-aligned Knowledge Graphs (KGs) to enable intent-driven interaction in Manufacturing-as-a-Service (MaaS) ecosystems. We fine-tune Mistral-7B-Instruct-V02 on a domain-specific dataset, enabling the translation of natural language intents into structured JSON requirement models. These models are semantically mapped to a Neo4j-based knowledge graph grounded in the ISA-95 standard, ensuring operational alignment with manufacturing processes, resources, and constraints. Our experimental results demonstrate significant performance gains over zero-shot and 3-shots baselines, achieving 89.33\% exact match accuracy and 97.27\% overall accuracy. This work lays the foundation for scalable, explainable, and adaptive human-machine

Chinese Translation

智能制造环境日益复杂，迫切需要能够将高层次人类意图转化为机器可执行动作的接口。本文提出了一个统一框架，将经过指令调优的大型语言模型（LLMs）与本体对齐的知识图谱（KGs）相结合，以实现制造即服务（MaaS）生态系统中的意图驱动交互。我们在一个特定领域的数据集上对Mistral-7B-Instruct-V02进行了微调，使其能够将自然语言意图转化为结构化的JSON需求模型。这些模型在语义上映射到基于Neo4j的知识图谱，该图谱以ISA-95标准为基础，确保与制造过程、资源和约束的操作一致性。我们的实验结果显示，相较于零样本和三样本基线，性能显著提升，达到了89.33%的精确匹配准确率和97.27%的整体准确率。本研究为可扩展、可解释和自适应的人机交互奠定了基础。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2602.12544

Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation

通过自动数据生成和细粒度评估扩展网络代理训练

Logeswaran, Lajanugen, Kim, Jaekyeom, Sohn, Sungryull, Glasscock, Creighton, Lee, Honglak

Abstract

We present a scalable pipeline for automatically generating high-quality training data for web agents. In particular, a major challenge in identifying high-quality training instances is trajectory evaluation - quantifying how much progress was made towards task completion. We introduce a novel constraint-based evaluation framework that provides fine-grained assessment of progress towards task completion. This enables us to leverage partially successful trajectories, which significantly expands the amount of usable training data. We evaluate our method on a new benchmark we propose called BookingArena, which consists of complex booking tasks across 20 popular websites, and demonstrate that our distilled student model outperforms open-source approaches and matches or exceeds commercial systems, while being a significantly smaller model. Our work addresses the challenge of efficiently creating diverse, realistic web interaction datasets and provides a systematic evaluation methodology for complex structured web tasks.

Chinese Translation

我们提出了一种可扩展的管道，用于自动生成高质量的网络代理训练数据。特别是，识别高质量训练实例的一个主要挑战是轨迹评估——量化在任务完成过程中取得的进展。我们引入了一种新颖的基于约束的评估框架，提供了对任务完成进展的细粒度评估。这使我们能够利用部分成功的轨迹，从而显著扩展可用训练数据的数量。我们在一个新基准上评估了我们的方法，该基准称为 BookingArena，包含来自 20 个热门网站的复杂预订任务，并证明我们的提炼学生模型在性能上优于开源方法，并与商业系统相匹配或超越，同时模型规模显著更小。我们的工作解决了高效创建多样化、真实的网络交互数据集的挑战，并为复杂结构化网络任务提供了系统的评估方法。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2602.12566

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

混合还是合并：面向大型语言模型的多领域强化学习

Wang, Haoqing, Long, Xiang, Li, Ziheng, Xu, Yilong, Li, Tingguang, Tang, Yehui

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, and instruction following) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, model prediction behavior, and information constraints. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at https://github.com/mosAI25/M2RL

Chinese Translation

可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）在激发大型语言模型（Large Language Models, LLMs）的显式推理能力方面发挥了关键作用。通过 RLVR，我们可以在一些特定领域（如编程或数学）实现专家级的表现。当需要一个通用的多领域专家级模型时，我们需要仔细考虑不同领域之间 RLVR 的协作。目前的最先进模型主要采用两种不同的训练范式来进行多领域 RLVR：混合多任务 RLVR 和分开 RLVR 后再进行模型合并。然而，大多数研究并没有对这些范式进行详细的比较和分析。为此，我们选择多个常用的高层次任务（例如数学、编程、科学和指令跟随）作为我们的目标领域，并使用开源数据集设计了广泛的定性和定量实验。我们发现跨领域的 RLVR 之间几乎没有相互干扰，而推理密集型领域则表现出相互协同的效果。此外，我们从权重空间几何、模型预测行为和信息约束的角度分析了相互收益的内部机制。该项目被命名为 M2RL，意为混合多任务训练或分开训练后再进行模型合并的强化学习，主页地址为 https://github.com/mosAI25/M2RL

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2602.12586

Can I Have Your Order? Monte-Carlo Tree Search for Slot Filling Ordering in Diffusion Language Models

我可以点餐吗？蒙特卡洛树搜索在扩散语言模型中的槽填充顺序优化

Leang, Joshua Ong Jun, Zhao, Yu, Stoian, Mihaela Cătălina, Li, Wenda, Cohen, Shay B., Giunchiglia, Eleonora

Abstract

While plan-and-infill decoding in Masked Diffusion Models (MDMs) shows promise for mathematical and code reasoning, performance remains highly sensitive to slot infilling order, often yielding substantial output variance. We introduce McDiffuSE, a framework that formulates slot selection as decision making and optimises infilling orders through Monte Carlo Tree Search (MCTS). McDiffuSE uses look-ahead simulations to evaluate partial completions before commitment, systematically exploring the combinatorial space of generation orders. Experiments show an average improvement of 3.2% over autoregressive baselines and 8.0% over baseline plan-and-infill, with notable gains of 19.5% on MBPP and 4.9% on MATH500. Our analysis reveals that while McDiffuSE predominantly follows sequential ordering, incorporating non-sequential generation is essential for maximising performance. We observe that larger exploration constants, rather than increased simulations, are necessary to overcome model confidence biases and discover effective orderings. These findings establish MCTS-based planning as an effective approach for enhancing generation quality in MDMs.

Chinese Translation

尽管在掩码扩散模型（Masked Diffusion Models, MDMs）中，计划与填充解码在数学和代码推理方面展现出良好的前景，但其性能仍对槽填充顺序高度敏感，常常导致显著的输出方差。我们提出了McDiffuSE，一个将槽选择形式化为决策并通过蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）优化填充顺序的框架。McDiffuSE使用前瞻性模拟在承诺之前评估部分完成情况，系统性地探索生成顺序的组合空间。实验结果显示，相较于自回归基线，平均提升3.2%，而与基线计划与填充相比提升8.0%，在MBPP上有显著提升19.5%，在MATH500上提升4.9%。我们的分析表明，尽管McDiffuSE主要遵循顺序生成，但纳入非顺序生成对于最大化性能至关重要。我们观察到，较大的探索常数比增加模拟次数更能克服模型置信度偏差并发现有效的排序。这些发现确立了基于MCTS的规划作为提升MDMs生成质量的有效方法。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2602.12617

GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics

GeoAgent：通过强化地理特征学习在各地进行地理定位

Jin, Modi, Zhang, Yiming, Sun, Boyuan, Zhang, Dingwen, Cheng, MingMing, Hou, Qibin

Abstract

This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought (CoT) data and training strategies, which conflict with geographic characteristics. To address these issues, we first introduce GeoSeek, a new geolocation dataset comprising CoT data annotated by geographic experts and professional players. We further thoroughly explore the inherent characteristics of geographic tasks and propose a geo-similarity reward and a consistency reward assessed by a consistency agent to assist training. This encourages the model to converge towards correct answers from a geographic perspective while ensuring the integrity and consistency of its reasoning process. Experimental results show that GeoAgent outperforms existing methods and a series of general VLLMs across multiple grains, while generating reasoning that closely aligns with humans.

Chinese Translation

本文提出了GeoAgent，一个能够与人类紧密推理并得出细粒度地址结论的模型。以往基于强化学习（RL）的方法在性能和可解释性上取得了突破，但由于依赖于人工智能生成的思维链（CoT）数据和训练策略，仍然存在一些问题，这与地理特征相冲突。为了解决这些问题，我们首先引入了GeoSeek，一个由地理专家和专业玩家注释的CoT数据的新地理定位数据集。我们进一步深入探讨了地理任务的固有特征，并提出了一种地理相似性奖励和由一致性代理评估的一致性奖励，以辅助训练。这鼓励模型从地理角度向正确答案收敛，同时确保其推理过程的完整性和一致性。实验结果表明，GeoAgent在多个粒度上优于现有方法和一系列通用的视觉语言大模型（VLLMs），同时生成的推理与人类的思维紧密对齐。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2602.12631

AI Agents for Inventory Control: Human-LLM-OR Complementarity

用于库存控制的人工智能代理：人类-大语言模型-运筹学的互补性

Baek, Jackie, Fu, Yaopeng, Ma, Will, Peng, Tianyi

Abstract

Inventory control is a fundamental operations problem in which ordering decisions are traditionally guided by theoretically grounded operations research (OR) algorithms. However, such algorithms often rely on rigid modeling assumptions and can perform poorly when demand distributions shift or relevant contextual information is unavailable. Recent advances in large language models (LLMs) have generated interest in AI agents that can reason flexibly and incorporate rich contextual signals, but it remains unclear how best to incorporate LLM-based methods into traditional decision-making pipelines. We study how OR algorithms, LLMs, and humans can interact and complement each other in a multi-period inventory control setting. We construct InventoryBench, a benchmark of over 1,000 inventory instances spanning both synthetic and real-world demand data, designed to stress-test decision rules under demand shifts, seasonality, and uncertain lead times. Through this benchmark, we find that OR-augmented LLM methods outperform either method in isolation, suggesting that these methods are complementary rather than substitutes. We further investigate the role of humans through a controlled classroom experiment that embeds LLM recommendations into a human-in-the-loop decision pipeline. Contrary to prior findings that human-AI collaboration can degrade performance, we show that, on average, human-AI teams achieve higher profits than either humans or AI agents operating alone. Beyond this population-level finding, we formalize an individual-level complementarity effect and derive a distribution-free lower bound on the fraction of individuals who benefit from AI collaboration; empirically, we find this fraction to be substantial.

Chinese Translation

库存控制是一个基本的运营问题，传统上，订货决策是由理论基础的运筹学（OR）算法指导的。然而，这些算法往往依赖于严格的建模假设，当需求分布发生变化或相关上下文信息不可用时，其表现可能不佳。最近，大语言模型（LLMs）的进展引发了对能够灵活推理并结合丰富上下文信号的人工智能代理的兴趣，但如何将基于LLM的方法有效地融入传统决策流程仍不明确。我们研究了在多期库存控制环境中，运筹学算法、LLMs和人类如何相互作用并互为补充。我们构建了InventoryBench，这是一个包含超过1000个库存实例的基准，涵盖了合成和真实世界的需求数据，旨在在需求变化、季节性和不确定的交货时间下对决策规则进行压力测试。通过这个基准，我们发现增强运筹学的LLM方法在性能上优于单独使用任一方法，表明这些方法是互补的而非替代的。我们进一步通过一个控制的课堂实验来探讨人类的角色，该实验将LLM推荐嵌入到人机协作的决策流程中。与先前发现的人机协作可能降低性能的结论相反，我们展示了平均而言，人机团队的利润高于单独操作的人类或人工智能代理。除了这一总体层面的发现，我们还形式化了个体层面的互补效应，并推导出一个无分布的下限，表示受益于人工智能协作的个体比例；实证结果表明，这一比例相当可观。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2602.12662

Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents

快速与慢速思考：大语言模型代理的逐步认知深度适应

Yang, Ruihan, Ye, Fanghua, We, Xiang, Zhao, Ruoqing, Luo, Kang, Xu, Xinbo, Zhao, Bo, Ma, Ruotian, Wang, Shanyi, Tu, Zhaopeng, Li, Xiaolong, Yang, Deqing, Linus

Abstract

Large language models (LLMs) are increasingly deployed as autonomous agents for multi-turn decision-making tasks. However, current agents typically rely on fixed cognitive patterns: non-thinking models generate immediate responses, while thinking models engage in deep reasoning uniformly. This rigidity is inefficient for long-horizon tasks, where cognitive demands vary significantly from step to step, with some requiring strategic planning and others only routine execution. In this paper, we introduce CogRouter, a framework that trains agents to dynamically adapt cognitive depth at each step. Grounded in ACT-R theory, we design four hierarchical cognitive levels ranging from instinctive responses to strategic planning. Our two-stage training approach includes Cognition-aware Supervised Fine-tuning (CoSFT) to instill stable level-specific patterns, and Cognition-aware Policy Optimization (CoPO) for step-level credit assignment via confidence-aware advantage reweighting. The key insight is that appropriate cognitive depth should maximize the confidence of the resulting action. Experiments on ALFWorld and ScienceWorld demonstrate that CogRouter achieves state-of-the-art performance with superior efficiency. With Qwen2.5-7B, it reaches an 82.3% success rate, outperforming GPT-4o (+40.3%), OpenAI-o3 (+18.3%), and GRPO (+14.0%), while using 62% fewer tokens.

Chinese Translation

大型语言模型（LLMs）越来越多地被部署为多轮决策任务的自主代理。然而，当前的代理通常依赖于固定的认知模式：非思考模型生成即时响应，而思考模型则均匀地进行深度推理。这种僵化在长时间任务中效率低下，因为认知需求在每一步中差异显著，有些需要战略规划，而另一些仅需常规执行。本文介绍了CogRouter，一个训练代理在每一步动态适应认知深度的框架。基于ACT-R理论，我们设计了四个层次的认知水平，从本能反应到战略规划。我们的两阶段训练方法包括认知感知的监督微调（Cognition-aware Supervised Fine-tuning, CoSFT），以灌输稳定的层次特定模式，以及认知感知的策略优化（Cognition-aware Policy Optimization, CoPO），通过信心感知的优势重加权进行逐步信用分配。关键见解在于，适当的认知深度应最大化所产生行动的信心。在ALFWorld和ScienceWorld上的实验表明，CogRouter实现了最先进的性能，且效率更高。使用Qwen2.5-7B时，其成功率达到82.3%，超越了GPT-4o（+40.3%）、OpenAI-o3（+18.3%）和GRPO（+14.0%），同时使用的token减少了62%。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2602.12665

Evaluating Robustness of Reasoning Models on Parameterized Logical Problems

评估推理模型在参数化逻辑问题上的鲁棒性

Es-sebbani, Naïm, Marquer, Esteban, Salhi, Yakoub, Bouraoui, Zied

Abstract

Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability. We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2--CNF formulas, where satisfiability is characterized by the implication graph and can be tuned along interpretable axes. Our generators isolate distinct competencies and failure modes: (i) contradiction-cycle UNSAT cores with controllable size and imbalance, (ii) SAT instances with a prescribed fraction of free variables to control solution multiplicity, (iii) planted backbones that modulate propagation, (iv) late bridge clauses that couple otherwise monotone regions to probe sensitivity to ordering and revision, and (v) symmetry/duplication variants that test abstraction under renaming and redundant structure. We evaluate LLM-based reasoners on decision accuracy and assignment validity, and quantify robustness under semantics-preserving perturbations such as clause reordering, filler clauses, and variable renaming. Across models, we observe sharp performance transitions under targeted structural interventions even when surface statistics are held fixed, revealing brittleness regimes that are invisible to aggregate SAT accuracy.

Chinese Translation

逻辑为评估基于大型语言模型（LLM）的推理器提供了一个受控的测试平台，但标准的SAT风格基准往往将表面难度（长度、措辞、子句顺序）与实际决定可满足性的结构现象混淆在一起。我们引入了一个基于参数化结构2-CNF公式族构建的2-SAT诊断基准，其中可满足性由蕴含图表征，并可以沿可解释的轴进行调整。我们的生成器隔离出不同的能力和失败模式：（i）具有可控大小和不平衡的矛盾循环不可满足核心，（ii）具有规定自由变量比例的可满足实例，以控制解的多样性，（iii）调节传播的植入骨架，（iv）将其他单调区域耦合的晚桥子句，以探测对顺序和修订的敏感性，以及（v）测试在重命名和冗余结构下抽象的对称/重复变体。我们在决策准确性和赋值有效性上评估基于LLM的推理器，并量化在保持语义的扰动下的鲁棒性，例如子句重排序、填充子句和变量重命名。在不同模型中，我们观察到在针对性结构干预下，即使表面统计保持固定，性能也会发生剧烈转变，揭示出对总SAT准确性不可见的脆弱性区域。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2602.12670

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

SkillsBench：评估智能体技能在多样任务中的有效性

Li, Xiangyi, Chen, Wenbo, Liu, Yimin, Zheng, Shenghan, Chen, Xiaokun, He, Yifeng, Li, Yubo, You, Bingran, Shen, Haotian, Sun, Jiankai, Wang, Shuyi, Zeng, Qunhong, Wang, Di, Zhao, Xuandong, Wang, Yuanli, Chaim, Roey Ben, Di, Zonglin, Gao, Yipeng, He, Junwei, He, Yizhuo, Jing, Liqiang, Kong, Luyang, Lan, Xin, Li, Jiachen, Li, Songlin, Li, Yijiang, Lin, Yueqian, Liu, Xinyi, Liu, Xuanqing, Lyu, Haoran, Ma, Ze, Wang, Bowei, Wang, Runhui, Wang, Tianyu, Ye, Wengao, Zhang, Yue, Xing, Hanwen, Xue, Yiqi, Dillmann, Steven, Lee, Han-chung

Abstract

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

Chinese Translation

智能体技能是增强大型语言模型（LLM）智能体推理能力的结构化程序知识包。尽管快速被采纳，但目前尚无标准方法来衡量其实际效果。我们提出了SkillsBench，这是一个涵盖11个领域、86个任务的基准，配备了精心策划的技能和确定性验证器。每个任务在三种条件下进行评估：无技能、策划技能和自生成技能。我们测试了7种智能体模型配置，涵盖7,308条轨迹。策划技能将平均通过率提高了16.2个百分点（pp），但不同领域的效果差异很大（软件工程领域提高4.5pp，医疗保健领域提高51.9pp），且在84个任务中有16个显示出负增长。自生成技能在平均水平上没有提供任何好处，表明模型无法可靠地生成其所需的程序知识。专注于2-3个模块的技能优于全面文档，而配备技能的小型模型可以与不具备技能的大型模型相媲美。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2602.12748

X-SYS: A Reference Architecture for Interactive Explanation Systems

X-SYS：交互式解释系统的参考架构

Labarta, Tobias, Hoang, Nhi, Dreyer, Maximilian, Berend, Jim, Hein, Oleg, Ma, Jackie, Samek, Wojciech, Lapuschkin, Sebastian

Abstract

The explainable AI (XAI) research community has proposed numerous technical methods, yet deploying explainability as systems remains challenging: Interactive explanation systems require both suitable algorithms and system capabilities that maintain explanation usability across repeated queries, evolving models and data, and governance constraints. We argue that operationalizing XAI requires treating explainability as an information systems problem where user interaction demands induce specific system requirements. We introduce X-SYS, a reference architecture for interactive explanation systems, that guides (X)AI researchers, developers and practitioners in connecting interactive explanation user interfaces (XUI) with system capabilities. X-SYS organizes around four quality attributes named STAR (scalability, traceability, responsiveness, and adaptability), and specifies a five-component decomposition (XUI Services, Explanation Services, Model Services, Data Services, Orchestration and Governance). It maps interaction patterns to system capabilities to decouple user interface evolution from backend computation. We implement X-SYS through SemanticLens, a system for semantic search and activation steering in vision-language models. SemanticLens demonstrates how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability. Together, this work provides a reusable blueprint and concrete instantiation for interactive explanation systems supporting end-to-end design under operational constraints.

Chinese Translation

可解释人工智能（XAI）研究社区提出了众多技术方法，但将可解释性部署为系统仍然面临挑战：交互式解释系统需要合适的算法和系统能力，以在重复查询、不断演变的模型和数据以及治理约束下保持解释的可用性。我们认为，操作化 XAI 需要将可解释性视为一个信息系统问题，其中用户交互需求引发特定的系统要求。我们介绍了 X-SYS，这是一种交互式解释系统的参考架构，旨在指导（X）AI 研究人员、开发者和从业者将交互式解释用户界面（XUI）与系统能力连接起来。X-SYS 组织围绕四个质量属性，即 STAR（可扩展性、可追溯性、响应性和适应性），并指定了五个组件的分解（XUI 服务、解释服务、模型服务、数据服务、编排与治理）。它将交互模式映射到系统能力，以解耦用户界面的演变与后端计算。我们通过 SemanticLens 实现了 X-SYS，这是一个用于语义搜索和视觉-语言模型激活引导的系统。SemanticLens 展示了基于合同的服务边界如何实现独立演变，离线/在线分离如何确保响应性，以及持久状态管理如何支持可追溯性。总之，这项工作为支持在操作约束下的端到端设计的交互式解释系统提供了一个可重用的蓝图和具体实例。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2602.12852

WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning

WebClipper：基于图的轨迹剪枝实现网络代理的高效演化

Wang, Junjie, Xie, Zequn, Yang, Dan, Feng, Jie, Shen, Yue, Sun, Duolin, Long, Meixiu, Jiao, Yihan, Tan, Zhehao, Wang, Jian, Wei, Peng, Gu, Jinjie

Abstract

Deep Research systems based on web agents have shown strong potential in solving complex information-seeking tasks, yet their search efficiency remains underexplored. We observe that many state-of-the-art open-source web agents rely on long tool-call trajectories with cyclic reasoning loops and exploration of unproductive branches. To address this, we propose WebClipper, a framework that compresses web agent trajectories via graph-based pruning. Concretely, we model the agent's search process as a state graph and cast trajectory optimization as a minimum-necessary Directed Acyclic Graph (DAG) mining problem, yielding pruned trajectories that preserve essential reasoning while eliminating redundant steps. Continued training on these refined trajectories enables the agent to evolve toward more efficient search patterns and reduces tool-call rounds by about 20% while improving accuracy. Furthermore, we introduce a new metric called F-AE Score to measure the model's overall performance in balancing accuracy and efficiency. Experiments demonstrate that WebClipper compresses tool-call rounds under excellent performance, providing practical insight into balancing effectiveness and efficiency in web agent design.

Chinese Translation

基于网络代理的深度研究系统在解决复杂信息检索任务方面展现出强大的潜力，但其搜索效率仍未得到充分探索。我们观察到许多最先进的开源网络代理依赖于长时间的工具调用轨迹，这些轨迹包含循环推理环和无效分支的探索。为了解决这一问题，我们提出了WebClipper，一个通过基于图的剪枝来压缩网络代理轨迹的框架。具体而言，我们将代理的搜索过程建模为状态图，并将轨迹优化视为一个最小必要有向无环图（Directed Acyclic Graph, DAG）挖掘问题，从而生成保留必要推理并消除冗余步骤的剪枝轨迹。对这些精炼轨迹的持续训练使得代理能够向更高效的搜索模式演化，并将工具调用次数减少约20%，同时提高了准确性。此外，我们引入了一种新的指标F-AE Score，用于衡量模型在准确性与效率之间的整体表现。实验表明，WebClipper在优秀性能下压缩了工具调用次数，为网络代理设计中有效性与效率的平衡提供了实用的见解。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2602.12876

BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

BrowseComp-$V^3$: 一种视觉、垂直和可验证的多模态浏览代理基准

Zhang, Huanyao, Zhou, Jiepeng, Li, Bo, Zhou, Bowen, Dan, Yanzhe, Lu, Haishan, Cao, Zhiyong, Chen, Jiaoyang, Han, Yuqian, Sheng, Zinan, Tao, Zhengwei, Liang, Hao, Wu, Jialong, Shi, Yang, He, Yuanpeng, Lin, Jiaye, Zhang, Qintong, Yan, Guochen, Zhao, Runhao, Li, Zhengpin, Yu, Xiaohan, Mei, Lang, Chen, Chong, Zhang, Wentao, Cui, Bin

Abstract

Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments. However, existing benchmarks for multimodal browsing remain limited in task complexity, evidence accessibility, and evaluation granularity, hindering comprehensive and reproducible assessments of deep search capabilities. To address these limitations, we introduce BrowseComp-$V^3$, a novel benchmark consisting of 300 carefully curated and challenging questions spanning diverse domains. The benchmark emphasizes deep, multi-level, and cross-modal multi-hop reasoning, where critical evidence is interleaved across textual and visual modalities within and across web pages. All supporting evidence is strictly required to be publicly searchable, ensuring fairness and reproducibility. Beyond final-answer accuracy, we incorporate an expert-validated, subgoal-driven process evaluation mechanism that enables fine-grained analysis of intermediate reasoning behaviors and systematic characterization of capability boundaries. In addition, we propose OmniSeeker, a unified multimodal browsing agent framework integrating diverse web search and visual perception tools. Comprehensive experiments demonstrate that even state-of-the-art models achieve only 36% accuracy on our benchmark, revealing critical bottlenecks in multimodal information integration and fine-grained perception. Our results highlight a fundamental gap between current model capabilities and robust multimodal deep search in real-world settings.

Chinese Translation

多模态大型语言模型（MLLMs）具备日益先进的规划和工具使用能力，正演变为能够在开放世界环境中执行多模态网页浏览和深度搜索的自主代理。然而，现有的多模态浏览基准在任务复杂性、证据可获取性和评估粒度方面仍然有限，阻碍了对深度搜索能力的全面和可重复评估。为了解决这些局限性，我们提出了BrowseComp-$V^3$，一个包含300个经过精心策划和具有挑战性的问题的新基准，涵盖多个领域。该基准强调深度、多层次和跨模态的多跳推理，其中关键证据在网页内外的文本和视觉模态之间交错。所有支持证据都严格要求为公开可搜索，以确保公平性和可重复性。除了最终答案的准确性外，我们还引入了一种经过专家验证的、以子目标驱动的过程评估机制，使得对中间推理行为的细致分析和能力边界的系统性表征成为可能。此外，我们提出了OmniSeeker，一个统一的多模态浏览代理框架，整合了多种网页搜索和视觉感知工具。全面的实验表明，即使是最先进的模型在我们的基准上也仅达到36%的准确率，揭示了多模态信息整合和细粒度感知中的关键瓶颈。我们的结果突显了当前模型能力与现实世界中强健的多模态深度搜索之间的根本差距。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2602.12963

Information-theoretic analysis of world models in optimal reward maximizers

最优奖励最大化者中世界模型的信息论分析

Harwood, Alfred, Faustino, Jose, Altair, Alex

Abstract

An important question in the field of AI is the extent to which successful behaviour requires an internal representation of the world. In this work, we quantify the amount of information an optimal policy provides about the underlying environment. We consider a Controlled Markov Process (CMP) with $n$ states and $m$ actions, assuming a uniform prior over the space of possible transition dynamics. We prove that observing a deterministic policy that is optimal for any non-constant reward function then conveys exactly $n \log m$ bits of information about the environment. Specifically, we show that the mutual information between the environment and the optimal policy is $n \log m$ bits. This bound holds across a broad class of objectives, including finite-horizon, infinite-horizon discounted, and time-averaged reward maximization. These findings provide a precise information-theoretic lower bound on the "implicit world model'' necessary for optimality.

Chinese Translation

在人工智能领域，一个重要的问题是成功行为在多大程度上需要对世界的内部表征。在本研究中，我们量化了最优策略提供的关于基础环境的信息量。我们考虑一个具有 $n$ 个状态和 $m$ 个动作的受控马尔可夫过程（Controlled Markov Process, CMP），假设在可能的转移动态空间上采用均匀先验。我们证明，观察一个对于任何非恒定奖励函数都是最优的确定性策略，能够传递关于环境的正好 $n imes ext{log} m$ 位信息。具体而言，我们展示了环境与最优策略之间的互信息为 $n imes ext{log} m$ 位。这个界限在包括有限时间、无限时间折扣和时间平均奖励最大化等广泛目标类中成立。这些发现为最优性所需的“隐式世界模型”提供了一个精确的信息论下界。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2602.13093

Consistency of Large Reasoning Models Under Multi-Turn Attacks

大型推理模型在多轮攻击下的一致性

Li, Yubo, Krishnan, Ramayya, Padman, Rema

Abstract

Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatically confer adversarial robustness and that confidence-based defenses require fundamental redesign for reasoning models.

Chinese Translation

具有推理能力的大型推理模型在复杂任务上取得了最先进的性能，但它们在多轮对抗压力下的鲁棒性仍然未被充分探讨。我们评估了九个前沿推理模型在对抗攻击下的表现。我们的研究发现，推理确实赋予了有意义但不完全的鲁棒性：大多数研究的推理模型显著优于经过指令调优的基线，但所有模型都表现出不同的脆弱性特征，其中误导性建议普遍有效，而社会压力则显示出模型特定的有效性。通过轨迹分析，我们识别出五种失败模式（自我怀疑、社会从众、建议劫持、情感易感性和推理疲劳），前两种模式占据了50%的失败案例。我们进一步证明，尽管信心感知响应生成（Confidence-Aware Response Generation, CARG）对标准大型语言模型有效，但由于延长推理轨迹引发的过度自信，它在推理模型中失效；反直觉地，随机信心嵌入的表现优于针对性提取。我们的结果强调，推理能力并不自动赋予对抗鲁棒性，基于信心的防御机制需要对推理模型进行根本性的重新设计。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2602.13135

Constrained Assumption-Based Argumentation Frameworks

约束假设基础论证框架

De Angelis, Emanuele, Fioravanti, Fabio, Meo, Maria Chiara, Pettorossi, Alberto, Proietti, Maurizio, Toni, Francesca

Abstract

Assumption-based Argumentation (ABA) is a well-established form of structured argumentation. ABA frameworks with an underlying atomic language are widely studied, but their applicability is limited by a representational restriction to ground (variable-free) arguments and attacks built from propositional atoms. In this paper, we lift this restriction and propose a novel notion of constrained ABA (CABA), whose components, as well as arguments built from them, may include constrained variables, ranging over possibly infinite domains. We define non-ground semantics for CABA, in terms of various notions of non-ground attacks. We show that the new semantics conservatively generalise standard ABA semantics.

Chinese Translation

假设基础论证（ABA）是一种成熟的结构化论证形式。尽管具有基础原子语言的ABA框架得到了广泛研究，但其适用性受到限制，因为它仅限于基于命题原子的无变量（即地面）论证和攻击。在本文中，我们解除这一限制，提出了一种新颖的约束ABA（CABA）概念，其组成部分及由其构建的论证可以包含约束变量，这些变量可能跨越无限域。我们为CABA定义了非地面语义，涉及各种非地面攻击的概念。我们证明了新的语义保守性地推广了标准ABA语义。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2602.13166

Optimal Take-off under Fuzzy Clearances

模糊间隔下的最优起飞

Henry, Hugo, Tsai, Arthur, Cohen, Kelly

Abstract

This paper presents a hybrid obstacle avoidance architecture that integrates Optimal Control under clearance with a Fuzzy Rule Based System (FRBS) to enable adaptive constraint handling for unmanned aircraft. Motivated by the limitations of classical optimal control under uncertainty and the need for interpretable decision making in safety critical aviation systems, we design a three stage Takagi Sugeno Kang fuzzy layer that modulates constraint radii, urgency levels, and activation decisions based on regulatory separation minima and airworthiness guidelines from FAA and EASA. These fuzzy-derived clearances are then incorporated as soft constraints into an optimal control problem solved using the FALCON toolbox and IPOPT. The framework aims to reduce unnecessary recomputations by selectively activating obstacle avoidance updates while maintaining compliance with aviation procedures. A proof of concept implementation using a simplified aircraft model demonstrates that the approach can generate optimal trajectories with computation times of 2,3 seconds per iteration in a single threaded MATLAB environment, suggesting feasibility for near real time applications. However, our experiments revealed a critical software incompatibility in the latest versions of FALCON and IPOPT, in which the Lagrangian penalty term remained identically zero, preventing proper constraint enforcement. This behavior was consistent across scenarios and indicates a solver toolbox regression rather than a modeling flaw. Future work includes validating this effect by reverting to earlier software versions, optimizing the fuzzy membership functions using evolutionary methods, and extending the system to higher fidelity aircraft models and stochastic obstacle environments.

Chinese Translation

本文提出了一种混合障碍物规避架构，该架构将间隔下的最优控制与基于模糊规则的系统（Fuzzy Rule Based System, FRBS）相结合，以实现无人机的自适应约束处理。受到经典最优控制在不确定性下的局限性以及安全关键航空系统中对可解释决策需求的启发，我们设计了一个三阶段的Takagi-Sugeno-Kang模糊层，该层根据FAA和EASA的监管分离最小值和适航指南调节约束半径、紧急程度和激活决策。这些模糊推导的间隔随后作为软约束纳入到使用FALCON工具箱和IPOPT求解的最优控制问题中。该框架旨在通过选择性激活障碍物规避更新来减少不必要的重新计算，同时保持与航空程序的合规性。使用简化飞机模型的概念验证实现表明，该方法能够在单线程MATLAB环境中生成最优轨迹，计算时间为每次迭代2.3秒，表明其在近实时应用中的可行性。然而，我们的实验揭示了FALCON和IPOPT最新版本中的一个关键软件不兼容性，其中拉格朗日惩罚项始终为零，阻止了适当的约束执行。这种行为在不同场景中是一致的，表明这是求解器工具箱的回归问题，而非建模缺陷。未来的工作包括通过回退到早期软件版本来验证这一影响，使用进化方法优化模糊隶属函数，并将系统扩展到更高保真度的飞机模型和随机障碍环境。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2602.12284

A Lightweight LLM Framework for Disaster Humanitarian Information Classification

灾害人道主义信息分类的轻量级大语言模型框架

Jinzhen, Han, Jisung, Kim, Soo, Yang Jong, Sik, Yun Hong

Abstract

Timely classification of humanitarian information from social media is critical for effective disaster response. However, deploying large language models (LLMs) for this task faces challenges in resource-constrained emergency settings. This paper develops a lightweight, cost-effective framework for disaster tweet classification using parameter-efficient fine-tuning. We construct a unified experimental corpus by integrating and normalizing the HumAID dataset (76,484 tweets across 19 disaster events) into a dual-task benchmark: humanitarian information categorization and event type identification. Through systematic evaluation of prompting strategies, LoRA fine-tuning, and retrieval-augmented generation (RAG) on Llama 3.1 8B, we demonstrate that: (1) LoRA achieves 79.62% humanitarian classification accuracy (+37.79% over zero-shot) while training only ~2% of parameters; (2) QLoRA enables efficient deployment with 99.4% of LoRA performance at 50% memory cost; (3) contrary to common assumptions, RAG strategies degrade fine-tuned model performance due to label noise from retrieved examples. These findings establish a practical, reproducible pipeline for building reliable crisis intelligence systems with limited computational resources.

Chinese Translation

及时对社交媒体上的人道主义信息进行分类对于有效的灾害响应至关重要。然而，在资源受限的紧急情况下，部署大型语言模型（LLMs）面临挑战。本文开发了一种轻量级、成本效益高的框架，用于灾害推文分类，采用参数高效的微调方法。我们通过整合和规范化HumAID数据集（包含76,484条推文，涵盖19个灾害事件）构建了一个统一的实验语料库，形成了人道主义信息分类和事件类型识别的双任务基准。通过对提示策略、LoRA微调和检索增强生成（RAG）在Llama 3.1 8B上的系统评估，我们证明了： (1) LoRA在仅训练约2%的参数的情况下，实现了79.62%的分类准确率（比零样本提高37.79%）； (2) QLoRA在50%的内存成本下实现了99.4%的LoRA性能，支持高效部署； (3) 与普遍假设相反，RAG策略由于从检索示例中引入的标签噪声，降低了微调模型的性能。这些发现为在有限计算资源下构建可靠的危机智能系统奠定了实用且可重复的流程。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2602.12285

From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness

从偏见聊天机器人到偏见智能体：考察角色分配对大型语言模型智能体鲁棒性的影响

Cao, Linbo, Sun, Lihao, Yue, Yang

Abstract

Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of actions with real-world impacts beyond text generation. While persona-induced biases in text generation are well documented, their effects on agent task performance remain largely unexplored, even though such effects pose more direct operational risks. In this work, we present the first systematic case study showing that demographic-based persona assignments can alter LLM agents' behavior and degrade performance across diverse domains. Evaluating widely deployed models on agentic benchmarks spanning strategic reasoning, planning, and technical operations, we uncover substantial performance variations - up to 26.2% degradation, driven by task-irrelevant persona cues. These shifts appear across task types and model architectures, indicating that persona conditioning and simple prompt injections can distort an agent's decision-making reliability. Our findings reveal an overlooked vulnerability in current LLM agentic systems: persona assignments can introduce implicit biases and increase behavioral volatility, raising concerns for the safe and robust deployment of LLM agents.

Chinese Translation

大型语言模型（LLMs）越来越多地被部署为能够进行超越文本生成的具有现实世界影响的自主智能体。尽管在文本生成中因角色引起的偏见已被广泛记录，但其对智能体任务表现的影响仍然 largely 未被探索，尽管这些影响带来了更直接的操作风险。在本研究中，我们首次系统性地展示了基于人口统计的角色分配如何改变 LLM 智能体的行为，并在多个领域降低其表现。通过评估广泛部署的模型在涵盖战略推理、规划和技术操作的智能体基准测试中的表现，我们发现了显著的表现差异——由与任务无关的角色线索驱动的表现下降高达 26.2%。这些变化出现在不同的任务类型和模型架构中，表明角色条件化和简单的提示注入可能会扭曲智能体的决策可靠性。我们的研究揭示了当前 LLM 智能体系统中被忽视的脆弱性：角色分配可能引入隐性偏见并增加行为波动性，这引发了对 LLM 智能体安全和鲁棒部署的担忧。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2602.12287

Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction

基于自适应思维链的检索增强自学推理模型用于自动语音识别中的命名实体纠正

An, Junjie, Tian, Jingguang, Wang, Tianyi, Gao, Yu, Mou, Xiaofeng, Xu, Yi

Abstract

End-to-end automatic speech recognition (ASR) systems frequently misrecognize domain-specific phrases like named entities, which can cause catastrophic failures in downstream tasks. A new family of named entity correction methods based on large language models (LLMs) has recently emerged. However, these approaches have yet to fully exploit the sophisticated reasoning capabilities inherent to LLMs. To bridge this gap, we propose a novel retrieval-augmented generation framework for correcting named entity errors in ASR. Our approach consists of two key components: (1) a rephrasing language model (RLM) for named entity recognition, followed by candidate retrieval using a phonetic-level edit distance; and (2) a novel self-taught reasoning model with adaptive chain-of-thought (A-STAR) that dynamically adjusts the depth of its reasoning based on task difficulty. Experiments on the AISHELL-1 and Homophone datasets demonstrate the effectiveness of our method, which achieves relative reductions in the named entity character error rate of 17.96\% and 34.42\%, respectively, compared to a strong baseline.

Chinese Translation

端到端的自动语音识别（ASR）系统常常错误识别领域特定短语，如命名实体，这可能导致下游任务的严重失败。最近出现了一种基于大型语言模型（LLMs）的命名实体纠正方法的新家族。然而，这些方法尚未充分利用LLMs固有的复杂推理能力。为了解决这一问题，我们提出了一种新颖的检索增强生成框架，用于纠正ASR中的命名实体错误。我们的方法由两个关键组件组成：（1）用于命名实体识别的重述语言模型（RLM），随后通过语音级编辑距离进行候选检索；（2）一种新颖的自学推理模型，具有自适应思维链（A-STAR），能够根据任务难度动态调整推理深度。在AISHELL-1和同音词数据集上的实验表明，我们的方法有效，相较于强基线，命名实体字符错误率分别减少了17.96%和34.42%。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2602.12302

Grandes Modelos de Linguagem Multimodais (MLLMs): Da Teoria \`a Pr\'atica

多模态大型语言模型（MLLMs）：从理论到实践

da Silva, Neemias, Scholz, Júlio C. W., Harrison, John, Borges, Marina, Ávila, Paulo, Santos, Frances A, Delgado, Myriam, Minetto, Rodrigo, Silva, Thiago H

Abstract

Multimodal Large Language Models (MLLMs) combine the natural language understanding and generation capabilities of LLMs with perception skills in modalities such as image and audio, representing a key advancement in contemporary AI. This chapter presents the main fundamentals of MLLMs and emblematic models. Practical techniques for preprocessing, prompt engineering, and building multimodal pipelines with LangChain and LangGraph are also explored. For further practical study, supplementary material is publicly available online: https://github.com/neemiasbsilva/MLLMs-Teoria-e-Pratica. Finally, the chapter discusses the challenges and highlights promising trends.

Chinese Translation

多模态大型语言模型（MLLMs）将大型语言模型（LLMs）在自然语言理解和生成方面的能力与图像和音频等模态的感知技能相结合，代表了当代人工智能的一个重要进展。本章介绍了MLLMs的主要基础和典型模型。同时，还探讨了使用LangChain和LangGraph进行预处理、提示工程和构建多模态管道的实用技术。为了进一步的实践研究，补充材料已在网上公开提供：https://github.com/neemiasbsilva/MLLMs-Teoria-e-Pratica。最后，本章讨论了面临的挑战并强调了有前景的趋势。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2602.12414

propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

propella-1：用于大规模LLM数据整理的多属性文档标注

Idahl, Maximilian, Droste, Benedikt, Plüster, Björn, Harries, Jan Philipp

Abstract

Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.

Chinese Translation

自FineWeb-Edu以来，LLM预训练的数据整理主要依赖于小型分类器生成的单一标量质量评分。单一评分将多个质量维度混为一谈，限制了灵活的过滤，并且缺乏可解释性。我们推出了propella-1，一系列小型多语言LLM（参数量为0.6B、1.7B和4B），能够对文本文档进行18个属性的标注，这些属性被组织为六个类别：核心内容、分类、质量与价值、受众与目的、安全与合规，以及地理相关性。这些模型支持57种语言，并生成符合预定义模式的结构化JSON标注。与作为参考标注器的前沿商业LLM进行评估时，4B模型的标注一致性高于许多更大型的通用模型。我们发布了propella-annotations，一个包含超过30亿个文档标注的数据集，涵盖了主要的预训练语料库，包括来自FineWeb-2、FinePDFs、HPLT 3.0和Nemotron-CC的数据。利用这些标注，我们呈现了对广泛使用的预训练数据集的多维组合分析，揭示了单一评分方法无法捕捉的质量、推理深度和内容组成方面的显著差异。所有模型权重和标注均在宽松的商业使用许可证下发布。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2602.12424

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

RankLLM：通过量化问题难度对大型语言模型进行加权排名

Zhang, Ziqian, Hu, Xingjian, Huang, Yue, Zhang, Kai, Chen, Ruoxi, Liu, Yixin, Wen, Qingsong, Xu, Kaidi, Zhang, Xiangliang, Gong, Neil Zhenqiang, Sun, Lichao

Abstract

Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM's core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question's difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.

Chinese Translation

基准测试建立了一个标准化的评估框架，以系统地评估大型语言模型（LLMs）的性能，从而促进客观比较并推动该领域的进步。然而，现有的基准测试未能区分问题的难度，限制了其有效区分模型能力的能力。为了解决这一局限性，我们提出了RankLLM，一个旨在量化问题难度和模型能力的新框架。RankLLM将难度作为主要区分标准，使得对LLM能力的评估更加细致。RankLLM的核心机制促进了模型与问题之间的双向评分传播。RankLLM的核心直觉是，当模型正确回答一个问题时，它会获得一个能力评分，而当一个问题对模型构成挑战时，其难度评分会增加。利用这一框架，我们对30个模型在多个领域的35,550个问题进行了评估。RankLLM与人类判断的达成率达到90%，并且在性能上始终优于强基线模型如IRT。它还表现出强大的稳定性、快速的收敛性和高计算效率，使其成为大规模、关注难度的LLM评估的实用解决方案。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2602.12445

RBCorr: Response Bias Correction in Language Models

RBCorr：语言模型中的响应偏差校正

Bhatt, Om, Ivanova, Anna A.

Abstract

Language models (LMs) are known to be prone to response biases, which present as option preference biases in fixed-response questions. It is therefore imperative to develop low-cost and effective response bias correction methods to improve LM performance and enable more accurate evaluations of model abilities. Here, we propose a simple response bias correction strategy ($\texttt{RBCorr}$) and test it on 12 open-weight language models using yes-no, entailment, and multiple choice questions. We show that response bias is prevalent in LMs pre-correction and that $\texttt{RBCorr}$ effectively eliminates bias and boosts model performance. We also explore the generalizability of bias behavior across models, datasets, and prompt formats, showing that LogProbs-based correction is highly dependent on all three of these aspects. Overall, $\texttt{RBCorr}$ is an easy-to-use method that can boost the performance of smaller LMs and ensure that LM performance on closed-response benchmarks aligns more closely with their true capabilities.

Chinese Translation

语言模型（LMs）已知容易受到响应偏差的影响，这种偏差在固定响应问题中表现为选项偏好偏差。因此，开发低成本且有效的响应偏差校正方法以提高LM性能并实现对模型能力的更准确评估是至关重要的。在此，我们提出了一种简单的响应偏差校正策略（$ exttt{RBCorr}$），并在12个开放权重的语言模型上使用是非题、蕴含题和多项选择题进行了测试。我们显示了在校正前，LM中普遍存在响应偏差，而$ exttt{RBCorr}$有效消除了这种偏差并提升了模型性能。我们还探讨了偏差行为在不同模型、数据集和提示格式之间的可推广性，显示基于LogProbs的校正在这三者之间高度依赖。总体而言，$ exttt{RBCorr}$是一种易于使用的方法，可以提升较小LM的性能，并确保LM在封闭响应基准上的表现更接近其真实能力。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2602.12575

Discovering Semantic Latent Structures in Psychological Scales: A Response-Free Pathway to Efficient Simplification

发现心理量表中的语义潜在结构：一种无响应的高效简化路径

Wang, Bo, Zhang, Yuxuan, Hu, Yueqin, Hou, Hanchao, Peng, Kaiping, Ni, Shiguang

Abstract

Psychological scale refinement traditionally relies on response-based methods such as factor analysis, item response theory, and network psychometrics to optimize item composition. Although rigorous, these approaches require large samples and may be constrained by data availability and cross-cultural comparability. Recent advances in natural language processing suggest that the semantic structure of questionnaire items may encode latent construct organization, offering a complementary response-free perspective. We introduce a topic-modeling framework that operationalizes semantic latent structure for scale simplification. Items are encoded using contextual sentence embeddings and grouped via density-based clustering to discover latent semantic factors without predefining their number. Class-based term weighting derives interpretable topic representations that approximate constructs and enable merging of semantically adjacent clusters. Representative items are selected using membership criteria within an integrated reduction pipeline. We benchmarked the framework across DASS, IPIP, and EPOCH, evaluating structural recovery, internal consistency, factor congruence, correlation preservation, and reduction efficiency. The proposed method recovered coherent factor-like groupings aligned with established constructs. Selected items reduced scale length by 60.5% on average while maintaining psychometric adequacy. Simplified scales showed high concordance with original factor structures and preserved inter-factor correlations, indicating that semantic latent organization provides a response-free approximation of measurement structure. Our framework formalizes semantic structure as an inspectable front-end for scale construction and reduction. To facilitate adoption, we provide a visualization-supported tool enabling one-click semantic analysis and structured simplification.

Chinese Translation

心理量表的优化传统上依赖于基于响应的方法，如因子分析、项目反应理论和网络心理测量，以优化项目组成。尽管这些方法严谨，但需要大量样本，并可能受到数据可用性和跨文化可比性的限制。最近在自然语言处理领域的进展表明，问卷项目的语义结构可能编码潜在构念的组织，提供了一种补充的无响应视角。我们引入了一种主题建模框架，将语义潜在结构操作化以实现量表简化。项目通过上下文句子嵌入进行编码，并通过基于密度的聚类进行分组，以发现潜在的语义因子，而无需预先定义其数量。基于类别的术语加权导出了可解释的主题表示，近似构念并实现语义相邻聚类的合并。通过集成的降维流程，使用成员资格标准选择代表性项目。我们在DASS、IPIP和EPOCH上对该框架进行了基准测试，评估了结构恢复、内部一致性、因子一致性、相关性保持和降维效率。所提出的方法恢复了与既定构念一致的连贯因子样分组。所选项目平均减少了量表长度60.5%，同时保持心理测量的适宜性。简化后的量表与原始因子结构高度一致，并保留了因子间的相关性，表明语义潜在组织提供了测量结构的无响应近似。我们的框架将语义结构形式化为可检视的量表构建和简化前端。为促进采用，我们提供了一种支持可视化的工具，能够实现一键语义分析和结构化简化。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2602.12635

Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

释放Ascend NPU上的低比特推理：HiFloat格式的全面评估

Zhao, Pengxiang, Zhen, Hui-Ling, Li, Xing, Bao, Han, Lin, Weizhe, Yang, Zhiyuan, Yu, Ziwei, Wang, Xin, Yuan, Mingxuan, Yu, Xianzhi, Dong, Zhenhua

Abstract

As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency. In this work, we evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs. Through rigorous comparison across weight-activation and KV-cache tasks, we provide three key insights: (1) INT8 suits narrow-range data, while floating-point formats excel with high-variance data; (2) in 4-bit regimes, HiF4's hierarchical scaling prevents the accuracy collapse seen in integer formats; and (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks. Overall, HiFloat provides a solution for high-efficiency LLM inference on NPUs.

Chinese Translation

随着大规模语言模型（LLMs）的发展，低比特浮点格式如MXFP和NVFP4为精度和效率提供了新的机会。在本研究中，我们评估了HiFloat（HiF8和HiF4），这一系列专为Ascend NPU设计的格式。通过在权重-激活和KV缓存任务上的严格比较，我们提供了三个关键见解：（1）INT8适合窄范围数据，而浮点格式在高方差数据中表现优越；（2）在4比特范围内，HiF4的层次缩放防止了整数格式中出现的精度崩溃；（3）HiFloat与最先进的后训练量化框架完全兼容。总体而言，HiFloat为在NPU上进行高效的LLM推理提供了解决方案。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2602.12639

CLASE: A Hybrid Method for Chinese Legalese Stylistic Evaluation

CLASE：一种用于中文法律语言风格评估的混合方法

Ma, Yiran Rex, Ye, Yuxiao, Xie, Huiyuan

Abstract

Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference-based metrics conflate semantic accuracy with stylistic fidelity, and LLM-as-a-judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism that combines 1) linguistic feature-based scores and 2) experience-guided LLM-as-a-judge scores. Both the feature coefficients and the LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM-restored counterparts. This hybrid design captures both surface-level features and implicit stylistic norms in a transparent, reference-free manner. Experiments on 200 Chinese legal documents show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods. Beyond improved alignment, CLASE provides interpretable score breakdowns and suggestions for improvements, offering a scalable and practical solution for professional stylistic evaluation in legal text generation (Code and data for CLASE is available at: https://github.com/rexera/CLASE).

Chinese Translation

大型语言模型（LLMs）生成的法律文本通常能够达到合理的事实准确性，但往往未能遵循法律写作的专业风格规范和语言惯例。为了提高风格质量，建立一种可靠的评估方法是关键的第一步。然而，让法律专家手动开发这样的指标是不切实际的，因为法律写作实践中的隐性风格要求难以形式化为明确的评分标准。同时，现有的自动评估方法也存在不足：基于参考的指标将语义准确性与风格忠实性混为一谈，而LLM作为评审的评估则存在不透明和不一致的问题。为了解决这些挑战，我们提出了CLASE（中文法律语言风格评估），这是一种混合评估方法，专注于法律文本的风格表现。该方法结合了1）基于语言特征的评分和2）经验指导的LLM作为评审的评分，采用混合评分机制。特征系数和LLM评分经验均来自真实法律文件及其LLM恢复版本的对比对。该混合设计以透明且无参考的方式捕捉了表层特征和隐性风格规范。在对200份中文法律文件的实验中，CLASE在与人类判断的一致性方面显著高于传统指标和纯LLM作为评审的方法。除了提高一致性，CLASE还提供可解释的评分细分和改进建议，为法律文本生成中的专业风格评估提供了一种可扩展且实用的解决方案（CLASE的代码和数据可在：https://github.com/rexera/CLASE获取）。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2602.12642

Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR

超越归一化：重新思考分区函数作为强化学习变分推理的难度调度器

Kim, Dohyung, Kim, Minbeom, Kim, Jeonghye, Lee, Sangmook, Rhee, Sojeong, Jung, Kyomin

Abstract

Reward-maximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error-prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.

Chinese Translation

奖励最大化的强化学习方法增强了大型语言模型（LLMs）的推理性能，但往往降低了输出的多样性。近期的研究通过采用GFlowNets来解决这一问题，训练LLMs以匹配目标分布，同时共同学习其分区函数。与之前仅将分区函数视为归一化器的研究不同，我们将其重新解释为每个提示的期望奖励（即在线准确性）信号，利用这一未被使用的信息来提高样本效率。具体而言，我们首先建立了分区函数与每个提示的准确性估计之间的理论关系。在这一关键见解的基础上，我们提出了分区函数引导的强化学习（PACED-RL），这是一种后训练框架，利用准确性估计在训练过程中优先考虑信息丰富的问题提示，并通过准确性估计误差优先重放进一步提高样本效率。至关重要的是，这两个组件重用了在GFlowNet训练过程中已经产生的信息，有效地将计算开销摊销到现有的优化过程中。广泛的实验在多样的基准测试中展示了相较于GRPO和之前的GFlowNet方法的显著性能提升，突显了PACED-RL作为一种更具样本效率的分布匹配训练方向的前景。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2602.12660

Learning Ordinal Probabilistic Reward from Preferences

从偏好中学习序数概率奖励

Chen, Longze, Wang, Lu, Shan, Renke, Gong, Ze, Luo, Run, Li, Jiaming, Luo, Jing, Wang, Qiyao, Yang, Min

Abstract

Reward models are crucial for aligning large language models (LLMs) with human values and intentions. Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations: GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic interpretation. To address these challenges, we introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM). Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response. To make this paradigm practical, we present its closed-form, discrete realization: the Ordinal Probabilistic Reward Model (OPRM), which discretizes the quality score into a finite set of ordinal ratings. Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT). It enables rewards to better reflect absolute text quality by incorporating quality-level annotations, which guide the model to concentrate the probability mass within corresponding rating sub-regions. Experiments on various reward model benchmarks show that our method improves accuracy by $\textbf{2.9%}\sim\textbf{7.4%}$ compared to prior reward models, demonstrating strong performance and data efficiency. Analysis of the score distribution provides evidence that our method captures not only relative rankings but also absolute quality.

Chinese Translation

奖励模型对于将大型语言模型（LLMs）与人类价值观和意图对齐至关重要。现有方法遵循生成性（GRMs）或判别性（DRMs）范式，但两者均存在局限性：GRMs 通常需要昂贵的逐点监督，而 DRMs 产生的相对分数未经过校准，缺乏概率解释。为了解决这些挑战，我们提出了一种新颖的奖励建模范式：概率奖励模型（PRM）。我们的方法将奖励视为随机变量，而不是确定性标量，学习每个响应质量的完整概率分布。为了使这一范式具有实用性，我们提出了其封闭形式的离散实现：序数概率奖励模型（OPRM），该模型将质量分数离散化为有限的序数评级集合。在 OPRM 的基础上，我们提出了一种数据高效的训练策略，称为区域泛洪调优（RgFT）。该策略通过引入质量水平注释，使奖励更好地反映绝对文本质量，从而指导模型将概率质量集中在相应评级子区域内。针对各种奖励模型基准的实验表明，我们的方法相比于先前的奖励模型提高了准确性，提升幅度为 $ extbf{2.9%} extsim extbf{7.4%}$，展现出强大的性能和数据效率。对分数分布的分析提供了证据，表明我们的方法不仅捕捉了相对排名，还捕捉了绝对质量。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2602.12674

$\mathcal{X}$-KD: General Experiential Knowledge Distillation for Large Language Models

$ ext{X}$-KD：大型语言模型的通用经验知识蒸馏

Cai, Yuang, Yuan, Yuyu

Abstract

Knowledge Distillation (KD) for Large Language Models (LLMs) has become increasingly important as models grow in size and complexity. While existing distillation approaches focus on imitating teacher behavior, they often overlook the original learning environment that shaped the teacher's knowledge. Inspired by the experiential learning theory and inverse reinforcement learning, we propose Experiential Knowledge Distillation ($\mathcal{X}$-KD), a novel and general framework that enables student models to learn in the teacher's original learning environment. $\mathcal{X}$-KD adopts the Approximated Variational Reward Imitation Learning (AVRIL) framework to jointly model the teacher's original reward function and perform policy distillation, encouraging consistency between the student policy and the original reward function. Our derivation demonstrates that $\mathcal{X}$-KD follows the supervised learning framework and applies to both sequence-level and divergence-based distillation methods, underlining the simplicity and flexibility of our approach. Empirical results show that $\mathcal{X}$-KD outperforms the generalized KD and MiniLLM baselines on abstractive summarization, machine translation, and arithmetic reasoning tasks. Additionally, $\mathcal{X}$-KD achieves better performance-diversity trade-off and data efficiency than baseline KD approaches.

Chinese Translation

大型语言模型（LLMs）的知识蒸馏（KD）随着模型规模和复杂性的增加变得越来越重要。虽然现有的蒸馏方法专注于模仿教师的行为，但它们往往忽视了塑造教师知识的原始学习环境。受到经验学习理论和逆向强化学习的启发，我们提出了经验知识蒸馏（$ ext{X}$-KD），这是一种新颖且通用的框架，使学生模型能够在教师的原始学习环境中学习。$ ext{X}$-KD采用近似变分奖励模仿学习（AVRIL）框架，联合建模教师的原始奖励函数并执行策略蒸馏，鼓励学生策略与原始奖励函数之间的一致性。我们的推导表明，$ ext{X}$-KD遵循监督学习框架，并适用于序列级和基于发散的蒸馏方法，强调了我们方法的简单性和灵活性。实证结果表明，$ ext{X}$-KD在抽象摘要、机器翻译和算术推理任务上优于广义KD和MiniLLM基线。此外，$ ext{X}$-KD在性能-多样性权衡和数据效率方面也优于基线KD方法。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2602.12705

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

MedXIAOHE：构建医疗多模态大语言模型的综合方案

Shi, Baorong, Cui, Bo, Jiang, Boyuan, Yu, Deli, Qian, Fang, Yang, Haihua, Wang, Huichao, Chen, Jiale, Pan, Jianfei, Cao, Jieqiong, Lin, Jinghao, Wu, Kai, Yang, Lin, Yao, Shengsheng, Chen, Tao, Xiao, Xiaojun, Ji, Xiaozhong, Wang, Xu, He, Yijun, Yang, Zhixiong

Abstract

We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.

Chinese Translation

我们提出了MedXIAOHE，这是一种医疗视觉-语言基础模型，旨在推动通用医疗理解和推理在现实临床应用中的发展。MedXIAOHE在多种医疗基准测试中实现了最先进的性能，并在多个能力上超越了领先的闭源多模态系统。为实现这一目标，我们提出了一种实体感知的持续预训练框架，该框架组织异构医疗语料，以拓宽知识覆盖面并减少长尾差距（例如，罕见疾病）。为了实现医疗专家级的推理和互动，MedXIAOHE通过强化学习和工具增强的自主训练，融入了多样的医疗推理模式，使其能够进行可验证决策痕迹的多步骤诊断推理。为了提高在现实使用中的可靠性，MedXIAOHE整合了用户偏好标准、基于证据的推理以及低幻觉的长篇报告生成，并改善了对医疗指令的遵循。我们发布此报告以记录我们的实际设计选择、扩展见解和评估框架，希望能激发进一步的研究。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2602.12709

ReFilter: Improving Robustness of Retrieval-Augmented Generation via Gated Filter

ReFilter：通过门控过滤器提高检索增强生成的鲁棒性

Chen, Yixin, Xiong, Ying, Wu, Shangyu, Ke, Xiangrui, Guan, Nan, Xue, Chun Jason

Abstract

Retrieval-augmented generation (RAG) has become a dominant paradigm for grounding large language models (LLMs) with external evidence in knowledge-intensive question answering. A core design choice is how to fuse retrieved samples into the LLMs, where existing internal fusion approaches broadly fall into query-based fusion, parametric fusion, and latent-based fusion. Despite their effectiveness at modest retrieval scales, these methods often fail to scale gracefully as the number of retrieved candidates k increases: Larger k improves evidence coverage, yet realistic top-k retrieval inevitably contains irrelevant or redundant content and increases the inference cost. To address these limitations, we propose ReFilter, a novel latent-based fusion framework that performs token-level filtering and fusion. ReFilter consists of three key components: a context encoder for encoding context features, a gated filter for weighting each token, and a token fusion module for integrating the weighted token feature into the LLM's hidden states. Our experiments across four general-domain QA benchmarks show that ReFilter consistently achieves the best average performance under both in-domain adaptation and out-of-domain transfer. ReFilter further generalizes to five biomedical QA benchmarks in zero-shot transfer without domain fine-tuning, reaching 70.01% average accuracy with Qwen2.5-14B-Instruct.

Chinese Translation

检索增强生成（RAG）已成为将外部证据与知识密集型问答相结合的主导范式。一个核心设计选择是如何将检索到的样本融合到大型语言模型（LLMs）中，现有的内部融合方法大致分为基于查询的融合、参数化融合和基于潜变量的融合。尽管这些方法在适度的检索规模下有效，但随着检索候选数量 k 的增加，它们往往无法优雅地扩展：更大的 k 提高了证据覆盖率，但现实中的前 k 检索不可避免地包含无关或冗余内容，并增加了推理成本。为了解决这些局限性，我们提出了 ReFilter，一种新颖的基于潜变量的融合框架，执行令牌级过滤和融合。ReFilter 由三个关键组件组成：用于编码上下文特征的上下文编码器、用于加权每个令牌的门控过滤器，以及用于将加权令牌特征整合到 LLM 隐藏状态中的令牌融合模块。我们在四个通用领域问答基准上的实验表明，ReFilter 在领域内适应和领域外迁移下始终实现最佳的平均性能。ReFilter 进一步在五个生物医学问答基准上进行零-shot 迁移，无需领域微调，使用 Qwen2.5-14B-Instruct 达到 70.01% 的平均准确率。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2602.12746

Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models without Forgetting

Lamer-SSL：一种层感知的LoRA专家混合模型，用于无遗忘的自监督模型的持续多语言扩展

Xu, Jing, Wu, Minglin, Chen, Xueyuan, Wu, Xixin, Meng, Helen

Abstract

Despite their impressive performance, self-supervised speech models often struggle to generalize to new languages and tend to forget previously acquired knowledge during continual training. To address this, we propose Lamer-SSL, a parameter-efficient framework that integrates a Layer-Aware MixturE of LoRA Experts (Lamer) module with a replay strategy. The Lamer module enables flexible balancing between shared and language-specific representations, while layer-aware expert allocation assigns more experts to deeper layers where semantic information is richer. Meanwhile, the replay strategy retains prior knowledge using minimal data, mitigating forgetting during continual training. Experiments on automatic speech recognition (ASR) and language identification (LID) demonstrate that Lamer-SSL extends self-supervised models to new languages effectively while maintaining strong performance on previously learned languages with only 2.14% parameters being trainable.

Chinese Translation

尽管自监督语音模型表现出色，但它们在新语言的泛化能力上常常面临挑战，并且在持续训练过程中容易遗忘先前获得的知识。为了解决这一问题，我们提出了Lamer-SSL，这是一种参数高效的框架，结合了层感知的LoRA专家混合模块（Lamer）和重放策略。Lamer模块实现了共享表示与语言特定表示之间的灵活平衡，而层感知的专家分配则将更多专家分配给语义信息更丰富的深层。与此同时，重放策略通过使用最少的数据保留先前知识，从而减轻了持续训练过程中的遗忘。针对自动语音识别（ASR）和语言识别（LID）的实验表明，Lamer-SSL能够有效地将自监督模型扩展到新语言，同时在先前学习的语言上保持强劲的性能，仅需2.14%的可训练参数。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2602.12759

Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks

面向序列标注任务的诊断与预测评估方法

Alvarez-Mellado, Elena, Gonzalo, Julio

Abstract

Standard evaluation in NLP typically indicates that system A is better on average than system B, but it provides little info on how to improve performance and, what is worse, it should not come as a surprise if B ends up being better than A on outside data. We propose an evaluation methodology for sequence labeling tasks grounded on error analysis that provides both quantitative and qualitative information on where systems must be improved and predicts how models will perform on a different distribution. The key is to create test sets that, contrary to common practice, do not rely on gathering large amounts of real-world in-distribution scraped data, but consists in handcrafting a small set of linguistically motivated examples that exhaustively cover the range of span attributes (such as shape, length, casing, sentence position, etc.) a system may encounter in the wild. We demonstrate this methodology on a benchmark for anglicism identification in Spanish. Our methodology provides results that are diagnostic (because they help identify systematic weaknesses in performance), actionable (because they can inform which model is better suited for a given scenario) and predictive: our method predicts model performance on external datasets with a median correlation of 0.85.

Chinese Translation

自然语言处理中的标准评估通常表明系统A在平均性能上优于系统B，但对于如何提升性能提供的信息有限，更糟的是，如果B在外部数据上表现优于A，这也不应令人感到意外。我们提出了一种基于错误分析的序列标注任务评估方法，提供了关于系统需要改进的定量和定性信息，并预测模型在不同分布上的表现。关键在于创建测试集，与常规做法相反，我们不依赖于收集大量真实世界的在分布抓取数据，而是手工制作一小组具有语言学动机的示例，全面覆盖系统在实际应用中可能遇到的跨度属性（如形状、长度、大小写、句子位置等）。我们在西班牙语的外来词识别基准上展示了这一方法。我们的评估方法提供了诊断性结果（因为它有助于识别性能中的系统性弱点）、可操作性（因为它可以指示哪个模型更适合特定场景）和预测性：我们的方法在外部数据集上的模型性能预测中具有0.85的中位相关性。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2602.12778

Aspect-Based Sentiment Analysis for Future Tourism Experiences: A BERT-MoE Framework for Persian User Reviews

基于方面的情感分析在未来旅游体验中的应用：一种针对波斯语用户评论的BERT-MoE框架

Taskooh, Hamidreza Kazemi, Harofte, Taha Zare

Abstract

This study advances aspect-based sentiment analysis (ABSA) for Persian-language user reviews in the tourism domain, addressing challenges of low-resource languages. We propose a hybrid BERT-based model with Top-K routing and auxiliary losses to mitigate routing collapse and improve efficiency. The pipeline includes: (1) overall sentiment classification using BERT on 9,558 labeled reviews, (2) multi-label aspect extraction for six tourism-related aspects (host, price, location, amenities, cleanliness, connectivity), and (3) integrated ABSA with dynamic routing. The dataset consists of 58,473 preprocessed reviews from the Iranian accommodation platform Jabama, manually annotated for aspects and sentiments. The proposed model achieves a weighted F1-score of 90.6% for ABSA, outperforming baseline BERT (89.25%) and a standard hybrid approach (85.7%). Key efficiency gains include a 39% reduction in GPU power consumption compared to dense BERT, supporting sustainable AI deployment in alignment with UN SDGs 9 and 12. Analysis reveals high mention rates for cleanliness and amenities as critical aspects. This is the first ABSA study focused on Persian tourism reviews, and we release the annotated dataset to facilitate future multilingual NLP research in tourism.

Chinese Translation

本研究推动了针对波斯语用户评论的基于方面的情感分析（ABSA），解决了低资源语言面临的挑战。我们提出了一种基于BERT的混合模型，结合Top-K路由和辅助损失，以减轻路由崩溃并提高效率。该流程包括：（1）使用BERT对9,558条标注评论进行整体情感分类，（2）针对六个与旅游相关的方面（房东、价格、地点、设施、清洁度、连接性）进行多标签方面提取，以及（3）集成动态路由的ABSA。数据集由来自伊朗住宿平台Jabama的58,473条预处理评论组成，手动标注了方面和情感。所提出的模型在ABSA任务中实现了90.6%的加权F1分数，优于基线BERT（89.25%）和标准混合方法（85.7%）。关键的效率提升包括与密集BERT相比，GPU功耗减少39%，支持与联合国可持续发展目标9和12相一致的可持续人工智能部署。分析显示，清洁度和设施的提及率较高，成为关键方面。这是首个聚焦于波斯语旅游评论的ABSA研究，我们发布了标注数据集，以促进未来多语言自然语言处理在旅游领域的研究。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2602.12806

RAT-Bench: A Comprehensive Benchmark for Text Anonymization

RAT-Bench：一个全面的文本匿名化基准测试

Krčo, Nataša, Yao, Zexi, Meeus, Matthieu, de Montjoye, Yves-Alexandre

Abstract

Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft's Presidio or Anthropic's PII purifier. These tools have traditionally been evaluated on their ability to remove specific identifiers (e.g., names), yet their effectiveness at preventing re-identification remains unclear. We introduce RAT-Bench, a comprehensive benchmark for text anonymization tools based on re-identification risk. Using U.S. demographic statistics, we generate synthetic text containing various direct and indirect identifiers across domains, languages, and difficulty levels. We evaluate a range of NER- and LLM-based text anonymization tools and, based on the attributes an LLM-based attacker is able to correctly infer from the anonymized text, we report the risk of re-identification in the U.S. population, while properly accounting for the disparate impact of identifiers. We find that, while capabilities vary widely, even the best tools are far from perfect in particular when direct identifiers are not written in standard ways and when indirect identifiers enable re-identification. Overall we find LLM-based anonymizers, including new iterative anonymizers, to provide a better privacy-utility trade-off albeit at a higher computational cost. Importantly, we also find them to work well across languages. We conclude with recommendations for future anonymization tools and will release the benchmark and encourage community efforts to expand it, in particular to other geographies.

Chinese Translation

包含个人信息的数据越来越多地用于训练、微调或查询大型语言模型（LLMs）。在使用之前，文本通常会被清除识别信息，通常使用诸如微软的Presidio或Anthropic的PII purifier等工具。这些工具传统上是根据其去除特定标识符（例如，姓名）的能力进行评估的，但它们在防止重新识别方面的有效性仍不明确。我们引入了RAT-Bench，这是一个基于重新识别风险的文本匿名化工具的全面基准测试。利用美国的人口统计统计数据，我们生成了包含各种直接和间接标识符的合成文本，涵盖不同领域、语言和难度级别。我们评估了一系列基于命名实体识别（NER）和大型语言模型（LLM）的文本匿名化工具，并根据LLM攻击者能够从匿名文本中正确推断出的属性，报告了美国人口中的重新识别风险，同时适当考虑标识符的不同影响。我们发现，尽管能力差异很大，但即使是最好的工具在特定情况下也远非完美，尤其是在直接标识符未以标准方式书写时，以及间接标识符使得重新识别成为可能时。总体而言，我们发现基于LLM的匿名化工具，包括新的迭代匿名化工具，提供了更好的隐私-效用权衡，尽管计算成本较高。重要的是，我们还发现它们在不同语言之间表现良好。我们最后提出了对未来匿名化工具的建议，并将发布该基准测试，鼓励社区努力扩展它，特别是扩展到其他地理区域。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2602.12811

Left-right asymmetry in predicting brain activity from LLMs' representations emerges with their formal linguistic competence

从大型语言模型（LLMs）表征中预测大脑活动的左右不对称性与其形式语言能力的出现

Bonnasse-Gahot, Laurent, Pallier, Christophe

Abstract

When humans and large language models (LLMs) process the same text, activations in the LLMs correlate with brain activity measured, e.g., with functional magnetic resonance imaging (fMRI). Moreover, it has been shown that, as the training of an LLM progresses, the performance in predicting brain activity from its internal activations improves more in the left hemisphere than in the right one. The aim of the present work is to understand which kind of competence acquired by the LLMs underlies the emergence of this left-right asymmetry. Using the OLMo-2 7B language model at various training checkpoints and fMRI data from English participants, we compare the evolution of the left-right asymmetry in brain scores alongside performance on several benchmarks. We observe that the asymmetry co-emerges with the formal linguistic abilities of the LLM. These abilities are demonstrated in two ways: by the model's capacity to assign a higher probability to an acceptable sentence than to a grammatically unacceptable one within a minimal contrasting pair, or its ability to produce well-formed text. On the opposite, the left-right asymmetry does not correlate with the performance on arithmetic or Dyck language tasks; nor with text-based tasks involving world knowledge and reasoning. We generalize these results to another family of LLMs (Pythia) and another language, namely French. Our observations indicate that the left-right asymmetry in brain predictivity matches the progress in formal linguistic competence (knowledge of linguistic patterns).

Chinese Translation

当人类和大型语言模型（LLMs）处理相同文本时，LLMs中的激活与通过功能性磁共振成像（fMRI）测量的大脑活动相关。此外，研究表明，随着LLM训练的进展，从其内部激活中预测大脑活动的性能在左半球的提升程度高于右半球。本研究的目的是理解LLMs所获得的哪种能力导致了这种左右不对称性的出现。我们使用OLMo-2 7B语言模型在不同训练检查点的表现，以及来自英语参与者的fMRI数据，比较大脑评分中的左右不对称性演变与多个基准测试的性能。我们观察到，这种不对称性与LLM的形式语言能力共同出现。这些能力通过两种方式体现：模型能够在一个最小对比对中为可接受句子分配比语法上不可接受句子更高的概率，或其生成良好构造文本的能力。相反，左右不对称性与算术或Dyck语言任务的表现无关；也与涉及世界知识和推理的基于文本的任务无关。我们将这些结果推广到另一类LLMs（Pythia）和另一种语言，即法语。我们的观察表明，预测大脑活动的左右不对称性与形式语言能力的进展（语言模式的知识）相匹配。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2602.12818

AIWizards at MULTIPRIDE: A Hierarchical Approach to Slur Reclamation Detection

MULTIPRIDE中的AIWizards：一种层次化的侮辱性言论回收检测方法

Tedeschini, Luca, Fasulo, Matteo

Abstract

Detecting reclaimed slurs represents a fundamental challenge for hate speech detection systems, as the same lexcal items can function either as abusive expressions or as in-group affirmations depending on social identity and context. In this work, we address Subtask B of the MultiPRIDE shared task at EVALITA 2026 by proposing a hierarchical approach to modeling the slur reclamation process. Our core assumption is that members of the LGBTQ+ community are more likely, on average, to employ certain slurs in a eclamatory manner. Based on this hypothesis, we decompose the task into two stages. First, using a weakly supervised LLM-based annotation, we assign fuzzy labels to users indicating the likelihood of belonging to the LGBTQ+ community, inferred from the tweet and the user bio. These soft labels are then used to train a BERT-like model to predict community membership, encouraging the model to learn latent representations associated with LGBTQ+ identity. In the second stage, we integrate this latent space with a newly initialized model for the downstream slur reclamation detection task. The intuition is that the first model encodes user-oriented sociolinguistic signals, which are then fused with representations learned by a model pretrained for hate speech detection. Experimental results on Italian and Spanish show that our approach achieves performance statistically comparable to a strong BERT-based baseline, while providing a modular and extensible framework for incorporating sociolinguistic context into hate speech modeling. We argue that more fine-grained hierarchical modeling of user identity and discourse context may further improve the detection of reclaimed language. We release our code at https://github.com/LucaTedeschini/multipride.

Chinese Translation

检测回收的侮辱性言论是仇恨言论检测系统面临的一个基本挑战，因为相同的词汇项在社会身份和语境的不同影响下，可以作为攻击性表达或群体内部的肯定。在本研究中，我们通过提出一种层次化的方法来建模侮辱性言论回收过程，来解决EVALITA 2026年MultiPRIDE共享任务的子任务B。我们的核心假设是，LGBTQ+社区的成员在平均情况下更可能以一种肯定的方式使用某些侮辱性言论。基于这一假设，我们将任务分解为两个阶段。首先，通过使用弱监督的基于LLM的注释，我们为用户分配模糊标签，指示其属于LGBTQ+社区的可能性，这一信息是从推文和用户简介中推断得出的。这些软标签随后用于训练一个类似BERT的模型，以预测社区成员资格，鼓励模型学习与LGBTQ+身份相关的潜在表示。在第二阶段，我们将这一潜在空间与一个新初始化的模型结合，用于下游的侮辱性言论回收检测任务。我们的直觉是，第一个模型编码了面向用户的社会语言学信号，这些信号随后与为仇恨言论检测预训练的模型学习到的表示融合。对意大利语和西班牙语的实验结果表明，我们的方法在性能上与强大的基于BERT的基线统计上相当，同时提供了一个模块化和可扩展的框架，以将社会语言学背景纳入仇恨言论建模中。我们认为，更细粒度的用户身份和话语背景的层次化建模可能会进一步改善对回收语言的检测。我们的代码已发布在 https://github.com/LucaTedeschini/multipride。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2602.12871

MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

MentalBench：评估大型语言模型精神诊断能力的基准

Song, Hoyun, Kang, Migyeong, Shin, Jisu, Kim, Jihyun, Park, Chanbi, Yoo, Hangyeol, An, Jihyun, Oh, Alice, Han, Jinyoung, Lim, KyungTae

Abstract

We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as a golden-standard logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling low-noise and interpretable evaluation. Our experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders. These findings reveal evaluation gaps not captured by existing benchmarks.

Chinese Translation

我们介绍了MentalBench，这是一个用于评估大型语言模型（LLMs）在精神诊断决策中的能力的基准。现有的心理健康基准主要依赖社交媒体数据，限制了其评估基于《精神疾病诊断与统计手册》（DSM）诊断判断的能力。MentalBench的核心是MentalKG，这是一个由精神科医生构建和验证的知识图谱，编码了DSM-5的诊断标准和23种精神障碍的鉴别诊断规则。我们以MentalKG作为黄金标准的逻辑基础，生成了24,750个合成临床案例，这些案例在信息完整性和诊断复杂性上系统变化，从而实现低噪声和可解释的评估。我们的实验表明，尽管最先进的LLMs在探测DSM-5知识的结构化查询中表现良好，但在区分临床上重叠的障碍时，它们在诊断决策中校准信心方面存在困难。这些发现揭示了现有基准未能捕捉的评估差距。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2602.12889

BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models

BaziQA-Benchmark：评估大型语言模型中的符号和时间组合推理

Chen, Jiangxi, Liu, Qian

Abstract

We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021--2025), where each instance requires structured inference over a fixed symbolic chart and interacting temporal conditions. Unlike anecdotal or prompt-driven evaluations, BaziQA-Benchmark enables objective scoring and controlled comparison across years, domains, and model families. We evaluate contemporary language models under a multi-turn setting and analyze performance variation across temporal difficulty, reasoning domains, and inference protocols.To further probe reasoning behavior, we introduce a lightweight Structured Reasoning Protocol that constrains inference order without adding domain knowledge. Results show that models consistently outperform chance but remain far from saturation, exhibiting pronounced sensitivity to temporal composition and reasoning order, as well as systematic failures on precise temporal localization and multi-condition symbolic judgments.

Chinese Translation

我们提出了BaziQA-Benchmark，这是一个标准化的基准，用于评估大型语言模型中的符号和时间组合推理。该基准源自2021年至2025年全球占卜师比赛中200个经过专业策划的多项选择题，每个实例都需要在固定的符号图表和交互的时间条件下进行结构化推理。与轶事或提示驱动的评估不同，BaziQA-Benchmark能够实现跨年份、领域和模型家族的客观评分和控制比较。我们在多轮设置下评估当代语言模型，并分析在时间难度、推理领域和推理协议方面的性能变化。为了进一步探讨推理行为，我们引入了一种轻量级的结构化推理协议，该协议在不增加领域知识的情况下限制推理顺序。结果显示，模型的表现始终优于随机猜测，但仍远未达到饱和，表现出对时间组合和推理顺序的明显敏感性，以及在精确时间定位和多条件符号判断上的系统性失败。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2602.12911

ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark

ViMedCSS：越南医学代码切换语音数据集与基准

Nguyen, Tung X., Vo, Nhu, Nguyen, Giang-Son, Hoang, Duy Mai, Huynh, Chien Dinh, Unanue, Inigo Jauregi, Piccardi, Massimo, Buntine, Wray, Le, Dung D.

Abstract

Code-switching (CS), which is when Vietnamese speech uses English words like drug names or procedures, is a common phenomenon in Vietnamese medical communication. This creates challenges for Automatic Speech Recognition (ASR) systems, especially in low-resource languages like Vietnamese. Current most ASR systems struggle to recognize correctly English medical terms within Vietnamese sentences, and no benchmark addresses this challenge. In this paper, we construct a 34-hour \textbf{Vi}etnamese \textbf{Med}ical \textbf{C}ode-\textbf{S}witching \textbf{S}peech dataset (ViMedCSS) containing 16,576 utterances. Each utterance includes at least one English medical term drawn from a curated bilingual lexicon covering five medical topics. Using this dataset, we evaluate several state-of-the-art ASR models and examine different specific fine-tuning strategies for improving medical term recognition to investigate the best approach to solve in the dataset. Experimental results show that Vietnamese-optimized models perform better on general segments, while multilingual pretraining helps capture English insertions. The combination of both approaches yields the best balance between overall and code-switched accuracy. This work provides the first benchmark for Vietnamese medical code-switching and offers insights into effective domain adaptation for low-resource, multilingual ASR systems.

Chinese Translation

代码切换（CS）是指越南语中使用英语单词（如药物名称或程序）的现象，这在越南医学交流中非常普遍。这给自动语音识别（ASR）系统带来了挑战，尤其是在像越南这样的低资源语言中。目前大多数ASR系统在识别越南句子中的英语医学术语时表现不佳，并且没有基准能够解决这一挑战。本文构建了一个包含34小时的越南医学代码切换语音数据集（ViMedCSS），其中包含16,576个语句。每个语句至少包含一个来自经过整理的双语词汇表的英语医学术语，涵盖五个医学主题。利用该数据集，我们评估了几种最先进的ASR模型，并研究了不同的特定微调策略，以改善医学术语的识别，从而探讨解决数据集中问题的最佳方法。实验结果表明，针对越南语优化的模型在一般段落上表现更好，而多语言预训练有助于捕捉英语插入。两种方法的结合在整体和代码切换准确性之间达到了最佳平衡。本研究为越南医学代码切换提供了首个基准，并为低资源多语言ASR系统的有效领域适应提供了见解。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2602.12921

When Words Don't Mean What They Say: Figurative Understanding in Bengali Idioms

当语言不再字面：孟加拉语习语中的隐喻理解

Sakhawat, Adib, Parveen, Shamim Ara, Amin, Md Ruhul, Mahmud, Shamim Al, Islam, Md Saiful, Khatun, Tahera

Abstract

Figurative language understanding remains a significant challenge for Large Language Models (LLMs), especially for low-resource languages. To address this, we introduce a new idiom dataset, a large-scale, culturally-grounded corpus of 10,361 Bengali idioms. Each idiom is annotated under a comprehensive 19-field schema, established and refined through a deliberative expert consensus process, that captures its semantic, syntactic, cultural, and religious dimensions, providing a rich, structured resource for computational linguistics. To establish a robust benchmark for Bangla figurative language understanding, we evaluate 30 state-of-the-art multilingual and instruction-tuned LLMs on the task of inferring figurative meaning. Our results reveal a critical performance gap, with no model surpassing 50% accuracy, a stark contrast to significantly higher human performance (83.4%). This underscores the limitations of existing models in cross-linguistic and cultural reasoning. By releasing the new idiom dataset and benchmark, we provide foundational infrastructure for advancing figurative language understanding and cultural grounding in LLMs for Bengali and other low-resource languages.

Chinese Translation

隐喻语言理解仍然是大型语言模型（LLMs）面临的重大挑战，尤其是在资源匮乏的语言中。为了解决这一问题，我们引入了一个新的习语数据集，这是一个规模庞大、文化根植的孟加拉语习语语料库，包含10,361个习语。每个习语都根据一个全面的19字段架构进行注释，该架构通过专家共识过程建立和完善，捕捉其语义、句法、文化和宗教维度，为计算语言学提供了丰富的结构化资源。为了建立孟加拉语隐喻语言理解的稳健基准，我们对30个最先进的多语言和指令调优的LLMs在推断隐喻意义的任务上进行了评估。我们的结果揭示了一个显著的性能差距，没有任何模型的准确率超过50%，与人类的显著更高表现（83.4%）形成鲜明对比。这突显了现有模型在跨语言和文化推理方面的局限性。通过发布新的习语数据集和基准，我们为推动孟加拉语及其他资源匮乏语言的隐喻语言理解和文化根植提供了基础设施。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2602.12937

Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

课程学习与伪标签技术提升多标签阿拉伯方言识别模型的泛化能力

Mekky, Ali, Zeftawy, Mohamed El, Hassan, Lara, Keleg, Amr, Nakov, Preslav

Abstract

Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system. Code and data are available at https://mohamedalaa9.github.io/lahjatbert/.

Chinese Translation

长期以来，阿拉伯方言识别（ADI）被建模为单标签分类任务，近期的研究认为应将其框架转变为多标签分类任务。然而，ADI仍受到单标签数据集可用性的限制，目前没有可用于训练的大规模多标签资源。通过分析在单标签ADI数据上训练的模型，我们发现将这些数据集重新用于多标签阿拉伯方言识别（MLADI）的主要困难在于负样本的选择，因为许多被视为负样本的句子在多个方言中可能是可接受的。为了解决这些问题，我们通过使用GPT-4o和二元方言可接受性分类器生成自动多标签注释，构建了一个多标签数据集，并通过阿拉伯方言程度（ALDi）进行聚合。随后，我们使用与方言复杂性和标签基数相一致的课程学习策略训练了一个基于BERT的多标签分类器。在MLADI排行榜上，我们表现最佳的LAHJATBERT模型达到了0.69的宏F1分数，而之前报告的最强系统仅为0.55。代码和数据可在 https://mohamedalaa9.github.io/lahjatbert/ 获取。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2602.12966

ProbeLLM: Automating Principled Diagnosis of LLM Failures

ProbeLLM：自动化大语言模型故障的原则性诊断

Huang, Yue, Jiang, Zhengzhe, Ma, Yuchen, Jiang, Yu, Wang, Xiangqi, Zhou, Yujun, Hao, Yuexing, Guo, Kehan, Chen, Pin-Yu, Feuerriegel, Stefan, Zhang, Xiangliang

Abstract

Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.

Chinese Translation

理解大型语言模型（LLMs）如何以及为何失败，正成为随着模型快速发展而日益突出的中心挑战，而静态评估则逐渐滞后。尽管动态测试生成使得自动化探测成为可能，但现有方法往往只能发现孤立的故障案例，缺乏对探索的原则性控制，并且对模型弱点的潜在结构提供的洞察有限。我们提出了ProbeLLM，一个与基准无关的自动化探测框架，将弱点发现从个别故障提升到结构化故障模式。ProbeLLM将探测形式化为层次化的蒙特卡洛树搜索，明确地在新故障区域的全局探索和重复错误模式的局部细化之间分配有限的探测预算。通过将探测限制在可验证的测试案例，并利用工具增强生成和验证，ProbeLLM将故障发现建立在可靠的证据基础上。发现的故障进一步通过故障感知嵌入和边界感知归纳整合为可解释的故障模式。在多样的基准和LLMs上，ProbeLLM揭示了比静态基准和先前自动化方法更广泛、更清晰和更细致的故障景观，支持从案例中心评估向原则性弱点发现的转变。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2602.12984

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

SciAgentGym：基准测试大型语言模型代理的多步骤科学工具使用

Shen, Yujiong, Yang, Yajie, Xi, Zhiheng, Hu, Binze, Sha, Huayu, Zhang, Jiazheng, Peng, Qiyuan, Shang, Junlin, Huang, Jixuan, Fan, Yutao, Tong, Jingqi, Dou, Shihan, Zhang, Ming, Bai, Lei, Yin, Zhenfei, Gui, Tao, Ma, Xingjun, Zhang, Qi, Huang, Xuanjing, Jiang, Yu-Gang

Abstract

Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.

Chinese Translation

科学推理本质上要求整合复杂的工具包，以便在特定领域知识中进行导航。然而，当前的基准测试在很大程度上忽视了代理在如此严格的工作流程中协调工具的能力。为了解决这一问题，我们推出了SciAgentGym，一个可扩展的互动环境，提供1,780个跨四个自然科学学科的特定领域工具，并配备强大的执行基础设施。与此相辅相成的是，我们提出了SciAgentBench，一个分层评估套件，旨在对代理能力进行压力测试，从基本操作到长期工作流程。我们的评估发现了一个关键瓶颈：最先进的模型在复杂的科学工具使用方面表现不佳。即使是像GPT-5这样的领先模型，成功率也随着交互范围的扩大而急剧下降，从60.6%降至30.9%，主要是由于多步骤工作流程执行中的失败。为了解决这个问题，我们提出了SciForge，一种数据合成方法，将工具操作空间建模为依赖图，以生成逻辑感知的训练轨迹。通过在这些轨迹上进行微调，我们的SciAgent-8B在性能上超越了显著更大的Qwen3-VL-235B-Instruct，同时展现出科学工具使用能力的积极跨领域迁移。这些结果强调了下一代自主科学代理的良好潜力。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2602.12989

Evaluating the Homogeneity of Keyphrase Prediction Models

评估关键词预测模型的同质性

Houbre, Maël, Boudin, Florian, Daille, Beatrice

Abstract

Keyphrases which are useful in several NLP and IR applications are either extracted from text or predicted by generative models. Contrarily to keyphrase extraction approaches, keyphrase generation models can predict keyphrases that do not appear in a document's text called `absent keyphrases`. This ability means that keyphrase generation models can associate a document to a notion that is not explicitly mentioned in its text. Intuitively, this suggests that for two documents treating the same subjects, a keyphrase generation model is more likely to be homogeneous in their indexing i.e. predict the same keyphrase for both documents, regardless of those keyphrases appearing in their respective text or not; something a keyphrase extraction model would fail to do. Yet, homogeneity of keyphrase prediction models is not covered by current benchmarks. In this work, we introduce a method to evaluate the homogeneity of keyphrase prediction models and study if absent keyphrase generation capabilities actually help the model to be more homogeneous. To our surprise, we show that keyphrase extraction methods are competitive with generative models, and that the ability to generate absent keyphrases can actually have a negative impact on homogeneity. Our data, code and prompts are available on huggingface and github.

Chinese Translation

关键词在多个自然语言处理（NLP）和信息检索（IR）应用中非常有用，它们可以从文本中提取或由生成模型预测。与关键词提取方法相反，关键词生成模型能够预测在文档文本中未出现的关键词，称为“缺失关键词”（absent keyphrases）。这一能力意味着关键词生成模型可以将文档与其文本中未明确提及的概念关联起来。直观上，这表明对于处理相同主题的两个文档，关键词生成模型更有可能在索引上保持同质性，即为两个文档预测相同的关键词，无论这些关键词是否出现在各自的文本中；而关键词提取模型则无法做到这一点。然而，当前的基准测试并未涵盖关键词预测模型的同质性。在本研究中，我们提出了一种评估关键词预测模型同质性的方法，并研究缺失关键词生成能力是否确实有助于模型变得更加同质。令人惊讶的是，我们展示了关键词提取方法与生成模型具有竞争力，并且生成缺失关键词的能力实际上可能对同质性产生负面影响。我们的数据、代码和提示可在 huggingface 和 github 上获取。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2602.12996

Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models

了解更多，了解更清晰：一种用于大型语言模型知识增强的元认知框架

Chen, Hao, He, Ye, Fan, Yuchun, Yan, Yukun, Liu, Zhenghao, Zhu, Qingfu, Sun, Maosong, Che, Wanxiang

Abstract

Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledge-intensive tasks. However, existing methods typically operate on the simplistic premise that model performance equates with internal knowledge, overlooking the knowledge-confidence gaps that lead to overconfident errors or uncertain truths. To bridge this gap, we propose a novel meta-cognitive framework for reliable knowledge augmentation via differentiated intervention and alignment. Our approach leverages internal cognitive signals to partition the knowledge space into mastered, confused, and missing regions, guiding targeted knowledge expansion. Furthermore, we introduce a cognitive consistency mechanism to synchronize subjective certainty with objective accuracy, ensuring calibrated knowledge boundaries. Extensive experiments demonstrate the our framework consistently outperforms strong baselines, validating its rationality in not only enhancing knowledge capabilities but also fostering cognitive behaviors that better distinguish knowns from unknowns.

Chinese Translation

知识增强显著提升了大型语言模型（LLMs）在知识密集型任务中的表现。然而，现有方法通常基于模型性能等同于内部知识的简单前提，忽视了导致过度自信错误或不确定真相的知识-信心差距。为了解决这一问题，我们提出了一种新颖的元认知框架，通过差异化干预和对齐实现可靠的知识增强。我们的方法利用内部认知信号将知识空间划分为掌握、困惑和缺失区域，从而指导有针对性的知识扩展。此外，我们引入了一种认知一致性机制，以同步主观确定性与客观准确性，确保知识边界的校准。大量实验表明，我们的框架在多个强基准上表现出色，验证了其在增强知识能力的同时，促进更好地区分已知与未知的认知行为的合理性。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2602.13047

Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech

我们能信任人工智能在英国认知障碍群体中检测健康的多语种英语使用者吗？基于真实对话语音的研究

Pahar, Madhurananda, Illingworth, Caitlin, Braun, Dorota, Mirheidari, Bahman, Sproson, Lise, Blackburn, Daniel, Christensen, Heidi

Abstract

Conversational speech often reveals early signs of cognitive decline, such as dementia and MCI. In the UK, one in four people belongs to an ethnic minority, and dementia prevalence is expected to rise most rapidly among Black and Asian communities. This study examines the trustworthiness of AI models, specifically the presence of bias, in detecting healthy multilingual English speakers among the cognitively impaired cohort, to make these tools clinically beneficial. For experiments, monolingual participants were recruited nationally (UK), and multilingual speakers were enrolled from four community centres in Sheffield and Bradford. In addition to a non-native English accent, multilinguals spoke Somali, Chinese, or South Asian languages, who were further divided into two Yorkshire accents (West and South) to challenge the efficiency of the AI tools thoroughly. Although ASR systems showed no significant bias across groups, classification and regression models using acoustic and linguistic features exhibited bias against multilingual speakers, particularly in memory, fluency, and reading tasks. This bias was more pronounced when models were trained on the publicly available DementiaBank dataset. Moreover, multilinguals were more likely to be misclassified as having cognitive decline. This study is the first of its kind to discover that, despite their strong overall performance, current AI models show bias against multilingual individuals from ethnic minority backgrounds in the UK, and they are also more likely to misclassify speakers with a certain accent (South Yorkshire) as living with a more severe cognitive decline. In this pilot study, we conclude that the existing AI tools are therefore not yet reliable for diagnostic use in these populations, and we aim to address this in future work by developing more generalisable, bias-mitigated models.

Chinese Translation

对话语音常常揭示认知衰退的早期迹象，如痴呆和轻度认知障碍（MCI）。在英国，四分之一的人口属于少数民族，而痴呆的发病率预计在黑人和亚洲社区中将迅速上升。本研究考察了人工智能模型的可信度，特别是在检测认知障碍群体中的健康多语种英语使用者时是否存在偏见，以使这些工具在临床上具有实用价值。在实验中，单语参与者在全国范围内（英国）招募，多语种说话者则从谢菲尔德和布拉德福德的四个社区中心招募。除了非母语英语口音外，多语种说话者还讲索马里语、中文或南亚语言，并进一步分为两种约克郡口音（西部和南部），以全面挑战人工智能工具的效率。尽管自动语音识别（ASR）系统在各组之间未显示出显著偏见，但使用声学和语言特征的分类和回归模型对多语种说话者表现出偏见，尤其是在记忆、流利度和阅读任务中。当模型在公开可用的DementiaBank数据集上训练时，这种偏见更加明显。此外，多语种说话者更可能被误分类为认知衰退。本研究是首个发现尽管当前人工智能模型整体表现良好，但在英国的少数民族背景的多语种个体中仍显示出偏见，并且更可能将某种口音（南约克郡）的说话者误分类为更严重的认知衰退。在这项初步研究中，我们得出结论，现有的人工智能工具尚不可靠，无法在这些人群中用于诊断，未来我们旨在通过开发更具普适性和降低偏见的模型来解决这一问题。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2602.13059

TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution

TraceBack：用于细粒度表格归因的多智能体分解

Anvekar, Tejas, Park, Junha, Jha, Rajat, Gupta, Devanshu, Ganesan, Poojah, Mathur, Puneeth, Gupta, Vivek

Abstract

Question answering (QA) over structured tables requires not only accurate answers but also transparency about which cells support them. Existing table QA systems rarely provide fine-grained attribution, so even correct answers often lack verifiable grounding, limiting trust in high-stakes settings. We address this with TraceBack, a modular multi-agent framework for scalable, cell-level attribution in single-table QA. TraceBack prunes tables to relevant rows and columns, decomposes questions into semantically coherent sub-questions, and aligns each answer span with its supporting cells, capturing both explicit and implicit evidence used in intermediate reasoning steps. To enable systematic evaluation, we release CITEBench, a benchmark with phrase-to-cell annotations drawn from ToTTo, FetaQA, and AITQA. We further propose FairScore, a reference-less metric that compares atomic facts derived from predicted cells and answers to estimate attribution precision and recall without human cell labels. Experiments show that TraceBack substantially outperforms strong baselines across datasets and granularities, while FairScore closely tracks human judgments and preserves relative method rankings, supporting interpretable and scalable evaluation of table-based QA.

Chinese Translation

基于结构化表格的问题回答（QA）不仅需要准确的答案，还需要透明地说明哪些单元格支持这些答案。现有的表格QA系统很少提供细粒度的归因，因此即使是正确的答案也往往缺乏可验证的依据，这在高风险环境中限制了信任。我们通过TraceBack来解决这个问题，TraceBack是一个模块化的多智能体框架，旨在实现单表QA中的可扩展单元格级归因。TraceBack对表格进行修剪，保留相关的行和列，将问题分解为语义上连贯的子问题，并将每个答案跨度与其支持的单元格对齐，捕捉在中间推理步骤中使用的显性和隐性证据。为了实现系统评估，我们发布了CITEBench，这是一个基准，包含从ToTTo、FetaQA和AITQA中提取的短语到单元格的注释。我们进一步提出了FairScore，这是一种无参考的度量，比较从预测单元格和答案中得出的原子事实，以估计归因的精确度和召回率，而无需人工单元格标签。实验表明，TraceBack在各个数据集和粒度上显著优于强基线，而FairScore与人类判断密切相关，并保持相对方法排名，支持可解释和可扩展的表格QA评估。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2602.13084

Exploring a New Competency Modeling Process with Large Language Models

基于大型语言模型的新能力建模过程探索

Du, Silin, Xin, Manqing, Wang, Raymond Jia

Abstract

Competency modeling is widely used in human resource management to select, develop, and evaluate talent. However, traditional expert-driven approaches rely heavily on manual analysis of large volumes of interview transcripts, making them costly and prone to randomness, ambiguity, and limited reproducibility. This study proposes a new competency modeling process built on large language models (LLMs). Instead of merely automating isolated steps, we reconstruct the workflow by decomposing expert practices into structured computational components. Specifically, we leverage LLMs to extract behavioral and psychological descriptions from raw textual data and map them to predefined competency libraries through embedding-based similarity. We further introduce a learnable parameter that adaptively integrates different information sources, enabling the model to determine the relative importance of behavioral and psychological signals. To address the long-standing challenge of validation, we develop an offline evaluation procedure that allows systematic model selection without requiring additional large-scale data collection. Empirical results from a real-world implementation in a software outsourcing company demonstrate strong predictive validity, cross-library consistency, and structural robustness. Overall, our framework transforms competency modeling from a largely qualitative and expert-dependent practice into a transparent, data-driven, and evaluable analytical process.

Chinese Translation

能力建模在人力资源管理中被广泛应用于人才的选择、发展和评估。然而，传统的专家驱动方法严重依赖于对大量面试记录的手动分析，这使得其成本高昂且容易受到随机性、模糊性和有限可重复性的影响。本研究提出了一种基于大型语言模型（LLMs）的新能力建模过程。我们不仅仅是自动化孤立的步骤，而是通过将专家实践分解为结构化的计算组件来重构工作流程。具体而言，我们利用LLMs从原始文本数据中提取行为和心理描述，并通过基于嵌入的相似性将其映射到预定义的能力库。我们进一步引入一个可学习的参数，该参数自适应地整合不同的信息源，使模型能够确定行为和心理信号的相对重要性。为了解决长期存在的验证挑战，我们开发了一种离线评估程序，允许系统化的模型选择，而无需额外的大规模数据收集。来自一家软件外包公司的实际应用的实证结果显示出强大的预测有效性、跨库一致性和结构稳健性。总体而言，我们的框架将能力建模从一个主要依赖定性和专家的实践转变为一个透明、数据驱动且可评估的分析过程。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2602.13102

Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts

朝向可解释的语言能力评估模型：预测爱沙尼亚学习者文本的CEFR水平

Allkivi, Kais

Abstract

Using NLP to analyze authentic learner language helps to build automated assessment and feedback tools. It also offers new and extensive insights into the development of second language production. However, there is a lack of research explicitly combining these aspects. This study aimed to classify Estonian proficiency examination writings (levels A2-C1), assuming that careful feature selection can lead to more explainable and generalizable machine learning models for language testing. Various linguistic properties of the training data were analyzed to identify relevant proficiency predictors associated with increasing complexity and correctness, rather than the writing task. Such lexical, morphological, surface, and error features were used to train classification models, which were compared to models that also allowed for other features. The pre-selected features yielded a similar test accuracy but reduced variation in the classification of different text types. The best classifiers achieved an accuracy of around 0.9. Additional evaluation on an earlier exam sample revealed that the writings have become more complex over a 7-10-year period, while accuracy still reached 0.8 with some feature sets. The results have been implemented in the writing evaluation module of an Estonian open-source language learning environment.

Chinese Translation

使用自然语言处理（NLP）分析真实学习者语言有助于构建自动化评估和反馈工具。这也为第二语言产出的发展提供了新的广泛见解。然而，明确结合这些方面的研究仍然不足。本研究旨在对爱沙尼亚语能力考试写作（A2-C1级别）进行分类，假设仔细的特征选择可以导致更具可解释性和可推广性的语言测试机器学习模型。分析了训练数据的各种语言特性，以识别与复杂性和正确性增加相关的能力预测因子，而不是写作任务本身。这些词汇、形态、表面和错误特征被用于训练分类模型，并与允许其他特征的模型进行了比较。预先选择的特征在测试准确性上产生了相似的结果，但减少了对不同文本类型分类的变异性。最佳分类器的准确率达到了约0.9。对早期考试样本的额外评估显示，写作在7-10年期间变得更加复杂，而某些特征集的准确率仍达到了0.8。这些结果已被应用于爱沙尼亚开源语言学习环境的写作评估模块中。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2602.13110

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

SCOPE：选择性保形优化的成对大语言模型评判

Badshah, Sher, Emami, Ali, Sajjad, Hassan

Abstract

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $\alpha$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at $\alpha = 0.10$, \textsc{Scope} consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk $\approx 0.097$ to $0.099$), while retaining substantial coverage, reaching $0.89$ on RewardBench with Qwen-14B and $0.98$ on RewardBench with Qwen-32B. Compared to na\"ive baselines, \textsc{Scope} accepts up to $2.4\times$ more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.

Chinese Translation

大型语言模型（LLMs）越来越多地被用作评判者，以替代成本高昂的人类偏好标签进行成对评估。尽管其实用性较强，但LLM评判者仍然容易出现误校准和系统性偏差。本文提出了SCOPE（选择性保形优化的成对评估），这是一个具有有限样本统计保证的选择性成对评判框架。在可交换性假设下，SCOPE校准了一个接受阈值，使得非弃权判断的错误率最多为用户指定的水平$eta$。为了为SCOPE提供一个无偏的不确定性信号，我们引入了双向偏好熵（Bidirectional Preference Entropy，BPE），该方法在两种响应位置下查询评判者，聚合隐含的偏好概率以强制对响应顺序的不变性，并将聚合的概率转换为基于熵的不确定性评分。在MT-Bench、RewardBench和Chatbot Arena上，BPE在标准置信度代理的基础上提高了不确定性质量，提供了更强的选择信号，使得SCOPE能够始终满足目标风险水平，同时在评判者规模上保持良好的覆盖率。特别是在$eta = 0.10$时，SCOPE在所有基准和评判者规模上始终满足风险界限（经验风险约为$0.097$到$0.099$），同时保持了可观的覆盖率，在RewardBench上使用Qwen-14B达到$0.89$，在RewardBench上使用Qwen-32B达到$0.98$。与简单基线相比，SCOPE在MT-Bench上使用Qwen-7B在相同目标风险约束下接受的判断数量最多增加了$2.4 imes$，这表明BPE使得基于LLM的评估更加可靠且覆盖面更广。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2602.13123

From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media

从防晒霜到软屏障：分析已发表写作和社交媒体中新词创造的相关因素

Ryskina, Maria, Gormley, Matthew R., Mahowald, Kyle, Mortensen, David R., Berg-Kirkpatrick, Taylor, Kulkarni, Vivek

Abstract

Living languages are shaped by a host of conflicting internal and external evolutionary pressures. While some of these pressures are universal across languages and cultures, others differ depending on the social and conversational context: language use in newspapers is subject to very different constraints than language use on social media. Prior distributional semantic work on English word emergence (neology) identified two factors correlated with creation of new words by analyzing a corpus consisting primarily of historical published texts (Ryskina et al., 2020, arXiv:2001.07740). Extending this methodology to contextual embeddings in addition to static ones and applying it to a new corpus of Twitter posts, we show that the same findings hold for both domains, though the topic popularity growth factor may contribute less to neology on Twitter than in published writing. We hypothesize that this difference can be explained by the two domains favouring different neologism formation mechanisms.

Chinese Translation

活语言受到一系列相互冲突的内部和外部进化压力的影响。虽然这些压力在语言和文化之间有一些普遍性，但其他压力则取决于社会和对话的背景：报纸中的语言使用受到的约束与社交媒体中的语言使用截然不同。之前对英语新词出现（neology）的分布语义研究通过分析一个主要由历史出版文本组成的语料库，识别了与新词创造相关的两个因素（Ryskina et al., 2020, arXiv:2001.07740）。我们将这一方法扩展到上下文嵌入（contextual embeddings）以及静态嵌入，并将其应用于新的Twitter帖子语料库，结果显示这两个领域的发现是一致的，尽管话题流行度增长因素在Twitter上的新词创造中可能贡献较小。我们假设这种差异可以通过这两个领域偏好不同的新词形成机制来解释。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2602.13139

OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

OpenLID-v3：提升紧密相关语言识别的精度——经验报告

Fedorova, Mariia, Arefyev, Nikolay, Buljan, Maja, Helcl, Jindřich, Oepen, Stephan, Rønningstad, Egil, Scherrer, Yves

Abstract

Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.

Chinese Translation

语言识别（LID）是从网络数据构建高质量多语言数据集的重要步骤。现有的语言识别工具（如 OpenLID 或 GlotLID）在识别紧密相关语言和区分有效自然语言与噪声方面常常面临挑战，这会污染语言特定子集，尤其是在低资源语言的情况下。在本研究中，我们通过增加更多训练数据、合并有问题的语言变体集群以及引入标记噪声的特殊标签来扩展 OpenLID 分类器。我们将这个扩展系统称为 OpenLID-v3，并在多个基准上与 GlotLID 进行评估。在开发过程中，我们重点关注三组紧密相关的语言（波斯尼亚语、克罗地亚语和塞尔维亚语；意大利北部和法国南部的罗曼语变体；以及斯堪的纳维亚语言），并在现有数据集不足的情况下贡献新的评估数据集。我们发现集成方法提高了精度，但也显著降低了低资源语言的覆盖率。OpenLID-v3 可在 https://huggingface.co/HPLT/OpenLID-v3 获取。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2602.13194

Semantic Chunking and the Entropy of Natural Language

语义分块与自然语言的熵

Zhong, Weishun, Sivan, Doron, Can, Tankut, Katkov, Mikhail, Tsodyks, Misha

Abstract

The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.

Chinese Translation

印刷英语的熵率被著名地估计为每个字符约一比特，这一基准是现代大型语言模型（LLMs）最近才接近的。这个熵率意味着英语相对于随机文本预期的每个字符五比特，几乎包含80%的冗余。我们提出了一种统计模型，试图捕捉自然语言复杂的多尺度结构，为这种冗余水平提供了第一性原理的解释。我们的模型描述了一种自相似地将文本分割为语义连贯块的过程，直到单词级别。文本的语义结构可以层次化地分解，从而允许进行分析处理。与现代LLMs和开放数据集的数值实验表明，我们的模型定量捕捉了真实文本在不同语义层次上的结构。我们的模型预测的熵率与印刷英语的估计熵率一致。此外，我们的理论进一步揭示，自然语言的熵率并不是固定的，而应该随着语料库的语义复杂性系统性地增加，这一复杂性由我们模型中唯一的自由参数所捕捉。

View on arXiv Download PDF AI Translation

arXiv Papers

LatentAM: Real-Time, Large-Scale Latent Gaussian Attention Mapping via Online Dictionary Learning

ForeAct: Steering Your VLA with Efficient Visual Foresight Planning

Schur-MI: Fast Mutual Information for Robotic Information Gathering

LongNav-R1: Horizon-Adaptive Multi-Turn RL for Long-Horizon VLA Navigation

Predicting Dynamic Map States from Limited Field-of-View Sensor Data

Zero-Shot Adaptation to Robot Structural Damage via Natural Language-Informed Kinodynamics Modeling

Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning

MiDAS: A Multimodal Data Acquisition System and Dataset for Robot-Assisted Minimally Invasive Surgery

Control Barrier Functions with Audio Risk Awareness for Robot Safe Navigation on Construction Sites

An Autonomous, End-to-End, Convex-Based Framework for Close-Range Rendezvous Trajectory Design and Guidance with Hardware Testbed Validation

Gradient-Enhanced Partitioned Gaussian Processes for Real-Time Quadrotor Dynamics Modeling

Composable Model-Free RL for Navigation with Input-Affine Systems

Monocular Reconstruction of Neural Tactile Fields

CRAFT: Adapting VLA Models to Contact-rich Manipulation via Force-aware Curriculum Fine-tuning

Eva-Tracker: ESDF-update-free, Visibility-aware Planning with Target Reacquisition for Robust Aerial Tracking

Hemispherical Angular Power Mapping of Installed mmWave Radar Modules Under Realistic Deployment Constraints

PISHYAR: A Socially Intelligent Smart Cane for Indoor Social Navigation and Multimodal Human-Robot Interaction for Visually Impaired People

RLinf-Co: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

Real-to-Sim for Highly Cluttered Environments via Physics-Consistent Inter-Object Reasoning

PMG: Parameterized Motion Generator for Human-like Locomotion Control

Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

SignScene: Visual Sign Grounding for Mapless Navigation

ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training

Constrained PSO Six-Parameter Fuzzy PID Tuning Method for Balanced Optimization of Depth Tracking Performance in Underwater Vehicles

TRANS: Terrain-aware Reinforcement Learning for Agile Navigation of Quadruped Robots under Social Interactions

Scaling Single Human Demonstrations for Imitation Learning using Generative Foundational Models

SafeFlowMPC: Predictive and Safe Trajectory Planning for Robot Manipulators with Learning-based Policies

SKYSURF: A Self-learning Framework for Persistent Surveillance using Cooperative Aerial Gliders

Adding internal audio sensing to internal vision enables human-like in-hand fabric recognition with soft robotic fingertips

INHerit-SG: Incremental Hierarchical Semantic Scene Graphs with RAG-Style Retrieval

Learning Native Continuation for Action Chunking Flow Policies

How Swarms Differ: Challenges in Collective Behaviour Comparison

SENSE-STEP: Learning Sim-to-Real Locomotion for a Sensory-Enabled Soft Quadruped Robot

Agentic AI for Robot Control: Flexible but still Fragile

UniManip: General-Purpose Zero-Shot Robotic Manipulation with Agentic Operational Graph

Temporally-Sampled Efficiently Adaptive State Lattices for Autonomous Ground Robot Navigation in Partially Observed Environments

Human Emotion-Mediated Soft Robotic Arts: Exploring the Intersection of Human Emotions, Soft Robotics and Arts

Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

Thermal Imaging for Contactless Cardiorespiratory and Sudomotor Response Monitoring

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues

Reproducing DragDiffusion: Interactive Point-Based Editing with Diffusion Models

What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis

ZeroDiff++: Substantial Unseen Visual-semantic Correlation in Zero-shot Learning

MonoLoss: A Training Objective for Interpretable Monosemantic Representations

Prototype-driven fusion of pathology and spatial transcriptomics for interpretable survival prediction

Semantic-aware Adversarial Fine-tuning for CLIP

A Lightweight and Explainable DenseNet-121 Framework for Grape Leaf Disease Classification

Human-Like Coarse Object Representations in Vision Models

Insertion Network for Image Sequence Correspondence

Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language Models

Matching of SAR and optical images based on transformation to shared modality

LiDAR-Anchored Collaborative Distillation for Robust 2D Representations

Geometric Stratification for Singular Configurations of the P3P Problem via Local Dual Space

Self-Supervised JEPA-based World Models for LiDAR Occupancy Completion and Forecasting

PLLM: Pseudo-Labeling Large Language Models for CAD Program Synthesis

The Constant Eye: Benchmarking and Bridging Appearance Robustness in Autonomous Driving

Unbiased Gradient Estimation for Event Binning via Functional Backpropagation

QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching

Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models

Multi-Task Learning with Additive U-Net for Image Denoising and Classification

CBEN -- A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding

IndicFairFace: Balanced Indian Face Dataset for Auditing and Mitigating Geographical Bias in Vision-Language Models

Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening

Channel-Aware Probing for Multi-Channel Imaging

ART3mis: Ray-Based Textual Annotation on 3D Cultural Objects

VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

SPRig: Self-Supervised Pose-Invariant Rigging from Mesh Sequences

Synthetic Craquelure Generation for Unsupervised Painting Restoration

ReBA-Pred-Net: Weakly-Supervised Regional Brain Age Prediction on MRI

Towards reconstructing experimental sparse-view X-ray CT data with diffusion models

Towards complete digital twins in cultural heritage with ART3mis 3D artifacts annotator

PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion

Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting

GSM-GS: Geometry-Constrained Single and Multi-view Gaussian Splatting for Surface Reconstruction

Thinking Like a Radiologist: A Dataset for Anatomy-Guided Interleaved Vision Language Reasoning in Chest X-ray Interpretation

RoadscapesQA: A Multitask, Multimodal Dataset for Visual Question Answering on Indian Roads