arXiv Daily Digest

314

Papers

KG-ASG: Collision-Knowledge-Guided Closed-Loop Adversarial Scenario Generation With Primary-Support Attribution

KG-ASG：基于碰撞知识的闭环对抗场景生成框架与主要支持归因

Wang, Cheng, Xiong, Chen, Wang, Ziwen, Zhou, Yuchen, Liu, Qiang

Abstract

Safety validation of autonomous driving systems requires high-risk scenario coverage, clear collision semantics, executable trajectories, and attributable multi-vehicle interactions. Existing safety-critical scenario generation methods often rely on low-level trajectory perturbations, collision-proxy optimization, or single-adversary search, which may produce adversarial samples with ambiguous collision causes or uncontrolled multi-vehicle collisions. This paper proposes KG-ASG, a collision-knowledge-guided closed-loop adversarial scenario generation framework with primary-support attribution. KG-ASG constructs a structured collision knowledge base and trains a lightweight Collision Expert to infer the target collision mode, the unique primary adversary, support vehicles, and their interaction roles. Guided by this semantic prior, multi-vehicle adversarial generation is formulated as a primary-support process, where the primary adversary induces the main conflict and support vehicles shape the surrounding risk structure without becoming additional colliders. Rule, physical, interaction-safety, and single-collider constraints are imposed as hard gates to filter non-executable samples. To handle reactive ego behaviors, planner-controller feedback is further used for failure diagnosis, candidate re-ranking, and terminal refinement. Experiments on WOMD scenarios reconstructed in MetaDrive show that KG-ASG achieves strong adversarial effectiveness while improving Valid Primary Attack, reducing multi-collision, and obtaining closed-loop recovery gains under IDM, Cruise, and Expert controllers. These results demonstrate that collision-knowledge guidance and primary-support single-collider reasoning improve adversarial effectiveness, interpretability, and executability for autonomous driving safety validation.

Chinese Translation

自动驾驶系统的安全验证需要高风险场景覆盖、清晰的碰撞语义、可执行的轨迹以及可归因的多车辆交互。现有的安全关键场景生成方法通常依赖于低级轨迹扰动、碰撞代理优化或单一对手搜索，这可能导致生成的对抗样本具有模糊的碰撞原因或无法控制的多车辆碰撞。本文提出了KG-ASG，一种基于碰撞知识的闭环对抗场景生成框架，具有主要支持归因。KG-ASG构建了一个结构化的碰撞知识库，并训练了一个轻量级的碰撞专家，以推断目标碰撞模式、唯一的主要对手、支持车辆及其交互角色。在这种语义先验的指导下，多车辆对抗生成被形式化为一个主要支持过程，其中主要对手引发主要冲突，而支持车辆则塑造周围的风险结构而不成为额外的碰撞者。规则、物理、交互安全和单一碰撞者约束作为硬性门限被施加，以过滤不可执行的样本。为了处理反应性自我行为，进一步使用规划者-控制器反馈进行故障诊断、候选重排序和终端优化。在MetaDrive中重建的WOMD场景上的实验表明，KG-ASG在提高有效主要攻击、减少多重碰撞和在IDM、巡航和专家控制器下获得闭环恢复收益的同时，实现了强大的对抗有效性。这些结果表明，碰撞知识指导和主要支持单一碰撞者推理提高了自动驾驶安全验证的对抗有效性、可解释性和可执行性。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2605.18921

Geo-Data-Driven HD Map Generation Workflow with Integrated Reference-Free Constraint-Based Verification

基于地理数据驱动的高清地图生成工作流程与集成的无参考约束验证

He, Ruidi, Tiwari, Vaibhav, Al-Ghobari, Mohanad, Zhang, Meng, Rausch, Andreas

Abstract

High-definition (HD) maps are core artifacts for automated driving systems, but their generation commonly relies on sensor-intensive mobile mapping campaigns, while quality assessment often depends on high-precision reference data. These dependencies make HD map engineering costly and difficult to apply in settings where specialised measurement data or independently measured reference maps are unavailable. This paper presents an engineering-oriented geo-data-driven workflow for HD map generation with integrated representation-level verification. The workflow uses openly available geo-engineering datasets as the primary input source and transforms them into lane-level HD map representations of existing road environments through explicit intermediate representations and processing stages. To assess the generated representations without external reference maps, the workflow integrates executable constraint-based verification into the engineering process. Selected constraints are derived from specifications relevant to automated driving and road-design guidelines. They are evaluated directly on the generated lanelet-based representation to detect geometric, topological, and elevation-related inconsistencies. The workflow is evaluated using real-world shapefile-based road-network data from four cities in Lower Saxony, Germany, and controlled defect-injection scenarios. The real-world evaluation shows that the generated map representations satisfy the selected constraints in the evaluated scenarios, while the defect-injection study demonstrates complete detection of the considered defect types without observed false positives. The results indicate that geo-data-driven HD map generation with integrated executable verification can provide a modular and inspectable complement to sensor-intensive mapping workflows under reduced sensing and reference-data availability.

Chinese Translation

高清（HD）地图是自动驾驶系统的核心构件，但其生成通常依赖于传感器密集型的移动测绘活动，而质量评估往往依赖于高精度的参考数据。这些依赖性使得HD地图工程成本高昂，并且在缺乏专业测量数据或独立测量的参考地图的环境中难以应用。本文提出了一种面向工程的基于地理数据驱动的HD地图生成工作流程，并集成了表示层级的验证。该工作流程以公开可用的地理工程数据集作为主要输入源，通过明确的中间表示和处理阶段，将其转化为现有道路环境的车道级HD地图表示。为了在没有外部参考地图的情况下评估生成的表示，该工作流程将可执行的基于约束的验证集成到工程过程中。所选约束源自与自动驾驶和道路设计指南相关的规范，直接在生成的基于车道单元的表示上进行评估，以检测几何、拓扑和高程相关的不一致性。该工作流程使用来自德国下萨克森州四个城市的真实世界形状文件基础道路网络数据和受控缺陷注入场景进行了评估。真实世界的评估表明，在评估场景中生成的地图表示满足所选约束，而缺陷注入研究则展示了对考虑的缺陷类型的完全检测，且未观察到假阳性。结果表明，集成可执行验证的基于地理数据驱动的HD地图生成可以为在传感和参考数据可用性降低的情况下，提供对传感器密集型测绘工作流程的模块化和可检验的补充。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2605.19009

Adversarial Stress Testing of SPARK Humanoid Safety Filters

SPARK类人安全过滤器的对抗性压力测试

Ghosh, Saurav, Sow, Abdou, Zhang, Luke

Abstract

Humanoid robots are difficult to deploy safely because they have high-dimensional bodies, many collision constraints, and must operate near people and obstacles. Safety filters help by modifying a nominal control action when it may violate collision-avoidance constraints. Still, nominal benchmark scores do not fully show how these filters behave in harder environments. In this work, we study the robustness of SPARK humanoid safety filters through replication and stress testing. We replicate the SPARK benchmark case G1SportMode_D1_WG_SO_v1 in MuJoCo and evaluate RSSA, RSSS, SSA, CBF, PFM, and SMA under controlled random seeds. We also built a post-processing pipeline that converts raw SPARK logs into goal-tracking, minimum-distance, and collision-step metrics. Our results show that some methods track the goal more closely, while others reduce collision steps more effectively. The stress tests further indicate that safety behavior can change under obstacle crowding, noisy distance estimates, and delayed obstacle information. These findings suggest that humanoid autonomy should be evaluated beyond nominal performance, using metrics that expose failure modes before deployment.

Chinese Translation

类人机器人因其高维度的身体、众多的碰撞约束以及必须在人和障碍物附近操作而难以安全部署。安全过滤器通过在可能违反避碰约束时修改名义控制动作来提供帮助。然而，名义基准分数并不能完全反映这些过滤器在更复杂环境中的表现。在本研究中，我们通过复制和压力测试研究了SPARK类人安全过滤器的鲁棒性。我们在MuJoCo中复制了SPARK基准案例G1SportMode_D1_WG_SO_v1，并在受控随机种子下评估了RSSA、RSSS、SSA、CBF、PFM和SMA。我们还建立了一个后处理管道，将原始SPARK日志转换为目标跟踪、最小距离和碰撞步骤指标。我们的结果表明，一些方法更紧密地跟踪目标，而另一些方法则更有效地减少碰撞步骤。压力测试进一步表明，在障碍物拥挤、噪声距离估计和延迟障碍物信息的情况下，安全行为可能会发生变化。这些发现表明，类人自主性应超越名义性能进行评估，使用能够在部署前揭示失败模式的指标。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2605.19029

Distributionally Robust Control via Stein Variational Inference for Contact-Rich Manipulation

通过 Stein 变分推断实现的分布鲁棒控制用于接触丰富的操控

Sathyanarayan, Hrishikesh, Vantilborgh, Victor, Ravichandar, Harish, Lefebvre, Tom, Abraham, Ian

Abstract

Reliable robotic manipulation requires control policies that can accurately represent and adapt to uncertainty arising from contact-rich interactions. Modern data-driven methods mitigate uncertainty through large-scale training and computation, and degrade significantly in performance with limited number of training samples. By contrast, classical model-based controllers are computationally efficient and reliable, but their limited ability to represent task-relevant uncertainty can hinder performance in contact-rich interactions. In this work, we propose to expand the capabilities of model-based manipulation control through more flexible uncertainty modeling that retains performance while exactly adapting to uncertainty. Our approach casts the manipulation problem as a distributionally robust control optimization and proposes a novel deterministic formulation based on Stein variational inference that preserves performance while explicitly modeling task-sensitive parameter uncertainty. As a result, the derived controllers are more aware of task sensitivities to uncertainty, yielding high reliability without compromising performance. Experimental results demonstrate up to 3$\times$ improved robustness across a range of contact-rich manipulation tasks under broad parametric uncertainty, outperforming existing model-based control methods.

Chinese Translation

可靠的机器人操控需要能够准确表示和适应来自接触丰富交互的不确定性的控制策略。现代数据驱动的方法通过大规模训练和计算来减轻不确定性，但在训练样本数量有限时性能显著下降。相比之下，经典的基于模型的控制器计算效率高且可靠，但其在表示与任务相关的不确定性方面的能力有限，可能会妨碍在接触丰富交互中的表现。在本研究中，我们提出通过更灵活的不确定性建模来扩展基于模型的操控控制能力，该方法在准确适应不确定性的同时保持性能。我们的方法将操控问题视为分布鲁棒控制优化，并提出了一种基于 Stein 变分推断的新型确定性公式，该公式在显式建模任务敏感参数不确定性的同时保持性能。因此，所推导的控制器对不确定性的任务敏感性有更高的意识，从而在不妥协性能的情况下实现高可靠性。实验结果表明，在广泛的参数不确定性下，针对一系列接触丰富的操控任务，鲁棒性提高了多达 3 倍，超越了现有的基于模型的控制方法。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2605.19033

RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

RLFTSim：通过强化学习微调实现现实且可控的多智能体交通仿真

Ahmadi, Ehsan, Schofield, Hunter, Khamidehi, Behzad, Arasteh, Fazel, Shan, Jinjun, Mou, Lili, Bai, Dongfeng, Rezaee, Kasra

Abstract

Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at https://ehsan-ami.github.io/rlftsim.

Chinese Translation

监督开放式训练已被广泛应用于交通仿真模型的训练；然而，它未能捕捉到复杂驾驶场景中固有的动态多智能体交互。我们提出了RLFTSim，一个基于强化学习的微调框架，通过将仿真器的输出与真实世界数据分布对齐，从而增强场景的现实性，并提供了一种在场景生成中提炼目标条件可控性的方法。我们在一个预训练的仿真模型基础上实例化了RLFTSim，设计了一种平衡真实度和可控性的奖励，并在Waymo开放运动数据集上进行了全面实验。我们的结果显示了现实性的改善，达到了最先进的性能。与其他基于启发式搜索的微调方法相比，由于提出了一种低方差和密集的奖励信号，RLFTSim所需的样本显著减少，并且它通过设计直接解决了现实性对齐问题。我们还展示了我们的方法在通过目标条件提炼交通仿真可控性方面的有效性。项目页面可访问 https://ehsan-ami.github.io/rlftsim。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2605.19038

Guiding Neuro-Symbolic Scenario Generation with Spatio-Temporal Logic

利用时空逻辑指导神经符号场景生成

Bonin, Lorenzo, Giacomarra, Francesco, Bortolussi, Luca, Deshmukh, Jyotirmoy V., Cairoli, Francesca

Abstract

The rapid advancement of autonomous driving (AD) technologies has outpaced the development of robust safety evaluation methods. Conventional testing relies on exposing AD systems to vast numbers of real-world traffic scenes -- a brute-force approach that is prohibitively expensive and statistically ineffective at capturing the rare, safety-critical edge cases essential for validating real-world robustness. To address this fundamental limitation, we introduce STRELGen, a scalable framework for the targeted generation of safety-critical driving scenarios. STRELGen synergistically combines a multi-agent trajectory-generation diffusion model (DM) with Spatio-Temporal Logic (STREL) specifications that encode complex safety and realism properties through a highly interpretable formalism. Crucially, monitoring satisfaction levels of these specifications is differentiable, enabling gradient-based search. At inference time, we optimize directly over the DM latent space to maximize STREL formula satisfaction. The result is efficient generation of highly plausible yet safety-critical multi-agent scenarios that lie within the learned data distribution. STRELGen thus provides a flexible, interpretable, and powerful tool for stress-testing autonomous driving systems, moving beyond the limitations of brute-force data collection.

Chinese Translation

自主驾驶（AD）技术的快速发展已经超越了稳健安全评估方法的进展。传统测试依赖于将AD系统暴露于大量真实交通场景中——这是一种成本高昂且在统计上无法有效捕捉稀有的、对安全至关重要的边缘案例的粗暴方法，而这些边缘案例对于验证现实世界的稳健性至关重要。为了解决这一根本性限制，我们提出了STRELGen，一个用于目标生成安全关键驾驶场景的可扩展框架。STRELGen协同结合了多智能体轨迹生成扩散模型（DM）与时空逻辑（STREL）规范，后者通过高度可解释的形式编码复杂的安全性和现实性属性。关键在于，这些规范的满足程度监控是可微分的，使得基于梯度的搜索成为可能。在推理时，我们直接在DM潜在空间上进行优化，以最大化STREL公式的满足度。最终结果是高效生成高度可信且安全关键的多智能体场景，这些场景位于学习到的数据分布内。因此，STRELGen提供了一种灵活、可解释且强大的工具，用于对自主驾驶系统进行压力测试，超越了粗暴数据收集的局限性。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2605.19104

Neural Operators for Design-Space Surrogate Modeling of Tendon-Actuated Continuum Robots

用于腱驱动连续机器人设计空间代理建模的神经算子

Frieden, Branden, Ferguson, James M., Kuntz, Alan, Shankar, Varun

Abstract

Continuum robots enable dexterous manipulation in constrained environments, but require accurate and efficient models for real-time manipulation and control. Traditional physics-based models can be computationally expensive and may suffer from inaccuracies due to unmodeled effects, while current learning-based methods often generalize poorly beyond the specific robot on which they are trained. We present a formulation of surrogate modeling for tendon-driven continuum robots as an operator learning problem that maps robot design parameters and tendon actuation inputs to resulting configurations. This formulation enables a single trained model to generalize across a large class of robot designs. We develop four novel neural operator architectures--two based on Deep Operator Networks (DeepONets) and two based on Fourier Neural Operators (FNOs)--and train them on simulation data to predict robot configurations. All architectures achieve good accuracy while allowing for fast and accurate generalization across designs. Our results demonstrate that operator learning provides an effective and generalizable surrogate for continuum robot mechanics in the design space, enabling fast modeling for control, planning, and design optimization in surgical and industrial applications.

Chinese Translation

连续机器人能够在受限环境中实现灵巧操作，但需要准确且高效的模型以进行实时操作和控制。传统的基于物理的模型计算成本高昂，并且可能由于未建模效应而导致不准确，而当前的基于学习的方法通常在超出其训练特定机器人的范围时泛化能力较差。我们提出了一种将腱驱动连续机器人的代理建模形式化为算子学习问题，该问题将机器人设计参数和腱驱动输入映射到相应的配置。这种形式化使得单个训练模型能够在大类机器人设计中进行泛化。我们开发了四种新颖的神经算子架构——两种基于深度算子网络（Deep Operator Networks，DeepONets），两种基于傅里叶神经算子（Fourier Neural Operators，FNOs）——并在仿真数据上进行训练以预测机器人配置。所有架构均实现了良好的准确性，同时允许在设计之间快速且准确地进行泛化。我们的结果表明，算子学习为设计空间中的连续机器人力学提供了一种有效且可泛化的代理，能够快速建模以支持控制、规划和设计优化在外科和工业应用中的实施。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2605.19120

CosFly: Plan in the Matrix, Fly in the World

CosFly：在矩阵中规划，在世界中飞行

Chen, Hanxuan, Wang, Xiangyue, Cheng, Songsheng, Ren, Ruilong, Zheng, Jie, Yuan, Shuai, Zeng, Tianle, Guo, Hanzhong, Li, Binbo, Wang, Kangli, Pei, Ji

Abstract

We present CosFly, a box-structured planning and multimodal simulation pipeline for aerial tracking, together with CosFly-Track, a large-scale UAV dataset for dynamic target tracking across diverse environments including urban centers, highways, rural landscapes, forests, and coastal towns. In our current implementation on CARLA, CosFly provides a modular 7-step construction pipeline that converts complex 3D worlds into structured obstacle representations for planning, then projects the resulting trajectories back into multi-modal sensor data -- including RGB images, high-precision depth maps, and semantic segmentation masks -- paired with natural language navigation instructions. A key feature is the support for configurable fixed-FOV zoom levels (one FOV setting drawn per trajectory and held constant throughout), enabling simulation of various focal lengths through camera-intrinsic adjustments. The pipeline covers the complete workflow from 3D map export through grid simplification, pedestrian and drone trajectory planning, multi-modal rendering with 6-DOF pose annotations, quality inspection, and teacher-student caption generation. We analyze two trajectory-planning paradigms for aerial target tracking: a conventional two-stage pipeline with front-end candidate generation and backend refinement, and a direct gradient-based formulation that optimizes multiple tracking constraints in a single objective. The public CosFly-Track release contains 250 validated trajectories and approximately 100,000 rendered images with complete 6-DOF drone pose annotations (position x, y, z and orientation yaw, pitch, roll). Together, the pipeline and dataset establish a scalable foundation for aerial-ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi-modal perception across diverse environments.

Chinese Translation

我们提出了CosFly，一个用于空中跟踪的盒式结构规划和多模态仿真管道，以及CosFly-Track，一个大型无人机（UAV）数据集，涵盖城市中心、高速公路、乡村景观、森林和沿海城镇等多样环境中的动态目标跟踪。在我们当前基于CARLA的实现中，CosFly提供了一个模块化的7步构建管道，将复杂的3D世界转换为结构化障碍物表示以进行规划，然后将生成的轨迹投影回多模态传感器数据中——包括RGB图像、高精度深度图和语义分割掩膜——并配以自然语言导航指令。一个关键特性是支持可配置的固定视场（FOV）缩放级别（每条轨迹绘制一个FOV设置并在整个过程中保持不变），使得通过相机内在参数调整模拟各种焦距成为可能。该管道涵盖了从3D地图导出、网格简化、行人和无人机轨迹规划、多模态渲染（带有6自由度姿态注释）、质量检查到师生标题生成的完整工作流程。我们分析了两种用于空中目标跟踪的轨迹规划范式：一种是传统的两阶段管道，包括前端候选生成和后端优化，另一种是直接基于梯度的公式，优化多个跟踪约束于单一目标。公开的CosFly-Track发布包含250条经过验证的轨迹和大约100,000张带有完整6自由度无人机姿态注释（位置x、y、z和方向偏航、俯仰、滚转）的渲染图像。该管道和数据集共同建立了一个可扩展的基础，支持空地协作研究，促进动态目标跟踪、无人机导航和多模态感知在多样环境中的应用。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2605.19136

Automatically Improving Simulation Physics for Articulated Objects

自动改善关节物体的仿真物理

Pham, Anh-Quan

Abstract

Simulation is a central tool for scalable robot learning, but its effectiveness depends on the quality of object assets. While modern 3D datasets provide rich geometric and kinematic representations, they typically lack the physical properties required for stable and realistic interaction, requiring significant manual effort to construct simulation-ready articulated objects. In this thesis, we introduce interaction-readiness, which characterizes whether an object can be reliably simulated under manipulation. We propose a quantitative evaluation framework that decomposes interaction-readiness into measurable components, enabling systematic analysis of object quality and revealing failure modes not captured by conventional evaluation. We further present a multi-modal, simulator-in-the-loop approach for generating interaction-ready articulated objects from incomplete 3D assets. The method integrates geometric, visual, and semantic information to infer physical properties and refines them through iterative simulator feedback to improve physical consistency. Experiments across diverse articulated objects and manipulation tasks show that object quality directly impacts simulation stability, interaction behavior, and policy performance. Objects refined by our method exhibit more stable and realistic dynamics, enabling more reliable downstream learning and evaluation. Overall, this thesis demonstrates the importance of physical realism for articulated objects in simulation and introduces a practical multi-modal refinement approach, guided by simulator feedback, for constructing such objects at scale.

Chinese Translation

仿真是可扩展机器人学习的核心工具，但其有效性依赖于物体资产的质量。尽管现代3D数据集提供了丰富的几何和运动学表示，但它们通常缺乏稳定和真实交互所需的物理属性，这需要大量的手动工作来构建适合仿真的关节物体。在本论文中，我们引入了交互准备性（interaction-readiness），用于表征一个物体在操作下是否能够可靠地被仿真。我们提出了一个定量评估框架，将交互准备性分解为可测量的组成部分，从而能够系统地分析物体质量，并揭示传统评估未能捕捉的失败模式。我们进一步提出了一种多模态、仿真器在环（simulator-in-the-loop）的方法，从不完整的3D资产中生成交互准备的关节物体。该方法整合了几何、视觉和语义信息，以推断物理属性，并通过迭代的仿真器反馈进行优化，以提高物理一致性。在多种关节物体和操作任务上的实验表明，物体质量直接影响仿真稳定性、交互行为和策略性能。经过我们方法优化的物体表现出更稳定和真实的动态特性，从而使下游学习和评估更加可靠。总体而言，本论文展示了物理现实主义在仿真中对关节物体的重要性，并引入了一种实用的多模态优化方法，该方法在仿真器反馈的指导下，能够大规模构建此类物体。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2605.19138

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

COBALT：通过基于云的智能手机远程操作众包机器人学习

Agarwal, Ayush, Gandhi, Ansh, Collins, Jeremy A., Rayyan, Omar, Sarswat, Aryan, Koushik, Ranjani, Moghani, Masoud, Mandlekar, Ajay, Garg, Animesh

Abstract

The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An inmemory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency for up to 8 concurrent users per GPU. We also demonstrate stable operation supporting 256 simulated clients across 8 GPUs, underscoring the system's ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit \href{https://cobalt-teleop.github.io/}{cobalt-teleop.github.io} for more details.

Chinese Translation

大规模、高质量示范数据的稀缺仍然是扩展机器人操作模仿学习的瓶颈。我们提出了COBALT，一个旨在在模拟和现实世界中大规模民主化机器人学习的远程操作平台。通过利用向量化环境，我们的可扩展、负载均衡基础设施支持多个用户在单个GPU上并发进行远程操作，从而显著降低了远程操作成本。操作员可以使用常见设备（包括单个或双手机、虚拟现实头盔、3D鼠标和键盘）从地球上的几乎任何地方连接。内存数据缓存和高效的视频流传输保持控制和渲染的同步，使得在每个GPU上支持多达8个并发用户以20 Hz的频率运行，端到端延迟低于100毫秒。我们还展示了支持256个模拟客户端在8个GPU上稳定运行的能力，强调了系统在硬件和单个服务器之间的扩展能力。我们进行了一项全面的用户研究，表明基于手机的远程操作的性能与专用硬件相当或更好，从而实现更快、更符合人体工程学的数据收集。为了确保数据质量，COBALT记录了一系列实时指标以自动过滤次优示范。我们进一步证明，结构化的用户培训课程显著提高了数据收集的质量。在用户研究的启发下，我们众包收集了一个大规模、高质量的试点数据集，包含7500多个示范（超过50小时），这些数据是在五天内通过智能手机在九个国家收集的。我们通过训练最先进的模仿学习算法验证了数据集的质量。有关更多详细信息，请访问 [cobalt-teleop.github.io](https://cobalt-teleop.github.io/)。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2605.19166

A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions

基于强化学习的四旋翼控制性能调优的启发式方法：奖励设计与终止条件

Suarez, Fausto Mauricio Lagos, Saradagi, Akshit, Sumathy, Vidya, Nikolakopoulos, George

Abstract

Reinforcement learning (RL)-based quadrotor control policies have achieved impressive performance in tasks such as fast navigation in cluttered environments and drone racing, where the focus is on speed and agility. However, in several applications, such as infrastructure inspection, it is critical to achieve precise, controlled maneuvers with tunable performance. In this article, we present a novel heuristic approach to achieve tunable performance in RL-based Quadrotor control through reward design and termination conditions. We present a novel reward structure containing dual bandwidth exponentials that achieves a baseline critically damped response in setpoint tracking, with low steady-state errors. When trained with a Proximal Policy Optimization (PPO) algorithm, in conjunction with episode truncation conditions, the desired performance is achieved in 6 million time steps in a sample-efficient manner. In order to tune the performance about the baseline behavior, we present intuitive heuristic rules to adjust the reward weights and exponential coefficients to achieve faster (acrobatic-like) and slower (inspection-like) settling time performance, while retaining the baseline critically damped response and approximately 2\% steady-state error. We evaluate the three RL policies (baseline, acrobatic, and inspection) across 100 trials and show accurate and tunable performance in position and yaw tracking from random initial conditions, thereby demonstrating the effectiveness of the proposed heuristic approach.

Chinese Translation

基于强化学习（RL）的四旋翼控制策略在快速导航于复杂环境和无人机竞速等任务中取得了显著的表现，这些任务的重点在于速度和灵活性。然而，在基础设施检查等若干应用中，实现精确、可控的机动并具备可调性能至关重要。本文提出了一种新颖的启发式方法，通过奖励设计和终止条件实现基于RL的四旋翼控制的可调性能。我们提出了一种新颖的奖励结构，包含双带宽指数，能够在设定点跟踪中实现基线临界阻尼响应，并具有较低的稳态误差。在使用近端策略优化（PPO）算法进行训练时，结合剧集截断条件，所需性能在600万时间步内以样本高效的方式实现。为了围绕基线行为调节性能，我们提出了直观的启发式规则，以调整奖励权重和指数系数，从而实现更快（类似特技）和更慢（类似检查）的稳定时间性能，同时保持基线的临界阻尼响应和约2%的稳态误差。我们在100次试验中评估了三种RL策略（基线、特技和检查），并展示了从随机初始条件出发在位置和偏航跟踪中的准确和可调性能，从而证明了所提启发式方法的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2605.19202

Aerial Inspection Behaviors via RL-based Quadrotor Control for Under-canopy Forest Environments

基于强化学习的四旋翼控制在树冠下森林环境中的空中检查行为

Suarez, Fausto Mauricio Lagos, Saradagi, Akshit, Sumathy, Vidya, Sankaranarayanan, Viswa Narayanan, Nikolakopoulos, George

Abstract

This paper addresses the problem of using a deep Reinforcement Learning (RL)-based low-level Quadrotor controller within an autonomous Quadrotor navigation stack for aerial inspection missions in under-canopy forest environments. Specifically, the article presents an end-to-end (mapping states to RPMs) Quadrotor control policy that achieves inspection view-pose tracking (simultaneous position and yaw reference tracking), which is crucial for various target inspection behaviors and point-to-point navigation in forests. To ensure safe and reliable deployment of the end-to-end RL controller in long-range missions, this article utilizes a higher navigation guidance layer comprising of a Traveling Salesman Problem planner (TSP) and a Rapidly-exploring Random Tree Star (RRT*) planner. Over a known map of a forest and a set of user-specified inspection regions, the TSP planner finds the optimal visitation sequence. Between two target regions, collision-free paths that respect the tracking limitations of the lower end-to-end RL policy are generated by an RRT* planner. Through five target inspection scenarios, this article demonstrates that an RL-based motor-level stabilizing controller, supported by a navigation guidance layer, can be used effectively as the low-level inspection execution module for under-canopy forest inspection missions.

Chinese Translation

本文解决了在树冠下森林环境中进行空中检查任务时，如何在自主四旋翼导航系统中使用基于深度强化学习（RL）的低级四旋翼控制器的问题。具体而言，文章提出了一种端到端的四旋翼控制策略（将状态映射到转速），该策略实现了检查视角姿态跟踪（同时跟踪位置和偏航参考），这对于各种目标检查行为和森林中的点对点导航至关重要。为了确保在长距离任务中安全可靠地部署端到端的RL控制器，本文利用了一个更高层次的导航引导层，该层由旅行商问题（TSP）规划器和快速扩展随机树星（RRT*）规划器组成。在已知的森林地图和一组用户指定的检查区域上，TSP规划器找到最佳访问顺序。在两个目标区域之间，RRT*规划器生成符合低级端到端RL策略跟踪限制的无碰撞路径。通过五个目标检查场景，本文展示了基于RL的电机级稳定控制器在导航引导层的支持下，可以有效地作为树冠下森林检查任务的低级检查执行模块。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2605.19206

CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation

CLUE：通过利用统一语义图谱自适应优先考虑上下文线索以实现有效的零-shot目标导航

Kim, Taeyun, Choi, Alvin Jinsung, Hong, Dasol, Myung, Hyun

Abstract

Zero-shot object-goal navigation (ZSON) is a challenging problem in robotics that requires a comprehensive understanding of both language and visual observations. Contextual cues from rooms and objects are critical, but their relative importance depends on the target: some objects are strongly tied to specific room types, while others are better predicted by nearby co-located objects. Existing methods overlook this distinction, leading to inefficient and inaccurate exploration. We present CLUE, a novel navigation framework that adaptively balances the use of contextual rooms and objects by leveraging commonsense knowledge extracted from an offline large language model (LLM). By estimating a target's association with room types using LLM, the agent prioritizes room cues for predictable objects and object cues for those with weak room associations. Our framework constructs a unified semantic value map that integrates both types of contextual information, adaptively weighted by the target's ambiguity to guide exploration. Combined with multi-viewpoint verification and an exploration strategy informed by contextual cues, CLUE achieves robust and efficient navigation. Extensive experiments in simulation and real-world deployments show that our method consistently outperforms state-of-the-art baselines in both success rate (SR) and success weighted by path length (SPL), demonstrating its effectiveness and practicality for real-world navigation tasks.

Chinese Translation

零-shot目标导航（ZSON）是机器人领域中的一个挑战性问题，要求对语言和视觉观察有全面的理解。房间和物体的上下文线索至关重要，但它们的相对重要性取决于目标：某些物体与特定房间类型紧密相关，而其他物体则更容易通过附近的共存物体进行预测。现有方法忽视了这一区别，导致探索效率低下且不准确。我们提出了CLUE，一个新颖的导航框架，通过利用从离线大型语言模型（LLM）提取的常识知识，自适应地平衡上下文房间和物体的使用。通过使用LLM估计目标与房间类型的关联，代理优先考虑可预测物体的房间线索以及与房间关联较弱的物体的物体线索。我们的框架构建了一个统一的语义价值图，该图整合了两种类型的上下文信息，并根据目标的模糊性自适应加权以指导探索。结合多视角验证和基于上下文线索的信息探索策略，CLUE实现了稳健且高效的导航。在模拟和现实世界部署中的大量实验表明，我们的方法在成功率（SR）和路径长度加权成功率（SPL）方面始终优于最先进的基线，证明了其在现实世界导航任务中的有效性和实用性。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2605.19209

Graph Neural Planning and Predictive Control for Multi-Robot Communication-Constrained Unlabeled Motion Planning

图神经规划与预测控制在多机器人通信受限的无标签运动规划中的应用

Goarin, Manohari, Zhou, Yang, Loianno, Giuseppe

Abstract

The multi-robot unlabeled motion planning problem of concurrently assigning robots to goals and generating safe trajectories is central in many collaborative tasks. Recent Graph Neural Network methods offer scalable decentralized solutions but rely on simplified dynamics and simulation environments, overlooking key challenges of real-world deployment such as dynamic feasibility and communication constraints. To address these gaps, we propose a hierarchical framework that combines a Graph ATtention Planner (GATP) with a decentralized Nonlinear Model Predictive Controller (NMPC). GATP provides intermediate subgoals through multi-robot cooperation, and the NMPC enforces safety under nonlinear dynamics and actuation constraints. We evaluate our framework in both simulation and real-world quadrotor experiments. Thanks to attention mechanisms and minimal communication requirements, we demonstrate improved generalization to larger teams, robustness to communication delays up to 200 ms and practical feasibility with decentralized on-board inference.

Chinese Translation

多机器人无标签运动规划问题涉及同时将机器人分配到目标并生成安全轨迹，这在许多协作任务中至关重要。近期的图神经网络方法提供了可扩展的去中心化解决方案，但依赖于简化的动态模型和仿真环境，忽视了现实部署中的关键挑战，如动态可行性和通信约束。为了解决这些问题，我们提出了一个分层框架，将图注意力规划器（Graph ATtention Planner, GATP）与去中心化非线性模型预测控制器（Nonlinear Model Predictive Controller, NMPC）相结合。GATP通过多机器人合作提供中间子目标，而NMPC在非线性动态和驱动约束下强制执行安全性。我们在仿真和实际四旋翼实验中评估了我们的框架。得益于注意机制和最小的通信需求，我们展示了对更大团队的改进泛化能力、对高达200毫秒通信延迟的鲁棒性以及去中心化的车载推理的实际可行性。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2605.19255

Bilateral Teleoperation with Compliant 6-DOF Pose-and-Force Sensing

带有顺应性6自由度姿态与力感知的双向遥操作

Feng, Yue, Huang, Weicheng, Chen, I-Ming

Abstract

Existing bilateral teleoperation platforms still rely on costly rigid six-axis force/torque sensors, tightly coupled leader-follower hardware, and kilohertz control loops. We present a Cartesian bilateral framework built on the hardware-agnostic WinGs Operating Studio (WOS) middleware, in which a low-cost compliant 6-DOF pose-and-force sensing end-effector, Delta6, is mounted on both sides so that each manipulator behaves as an end-effector 6-DOF series elastic actuator (SEA). The leader runs a damping-only admittance loop with a 6-D biquad notch filter; the follower realizes a stiffness-damping impedance through a position-based outer loop with a PID wrench-to-pose mapping. Three time scales (hardware I/O, mid-rate impedance/admittance, low-rate teleoperation messages) are explicitly decoupled, enabling the same application to drive heterogeneous arms. On a Lite6/FR3 testbed at 150 Hz, the system tracks stably under delays up to $120\pm40$ ms and 1% packet loss, matches the prescribed virtual stiffness in contact, and shows a favorable cumulative energy signature in passivity-style tests.

Chinese Translation

现有的双向遥操作平台仍然依赖于昂贵的刚性六轴力/扭矩传感器、紧密耦合的主从硬件以及千赫级的控制回路。我们提出了一种基于硬件无关的WinGs操作工作室（WOS）中间件构建的笛卡尔双向框架，其中在两侧均安装了低成本的顺应性6自由度姿态与力感知末端执行器Delta6，使得每个操纵器表现得像一个末端执行器6自由度系列弹性执行器（SEA）。主操纵器运行一个仅具有阻尼的导纳回路，并配备一个6维双二次陷波滤波器；从操纵器通过基于位置的外部回路实现刚度-阻尼阻抗，并采用PID力矩到姿态的映射。三个时间尺度（硬件输入/输出、中频阻抗/导纳、低频遥操作消息）被明确解耦，使得同一应用能够驱动异构臂。在150 Hz的Lite6/FR3测试平台上，该系统在高达$120 ext{±}40$毫秒的延迟和1%的数据包丢失下稳定跟踪，能够在接触时匹配规定的虚拟刚度，并在被动性测试中显示出良好的累积能量特征。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2605.19257

PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM

PRISM-SLAM：基于概率射线的尺度感知度量SLAM推断

Im, Eunsoo

Abstract

Monocular SLAM historically suffers from scale ambiguity and tracking failure in dynamic environments. While recent vision foundation models (VFMs) provide remarkable zero-shot depth priors, naively integrating these deterministic predictions ignores predictive uncertainty and frame-to-frame scale inconsistencies. We propose PRISM-SLAM, a real-time framework that rigorously integrates VFM priors into a structured Bayesian factor graph to achieve scale-aware, metric-consistent localization and mapping. Specifically, we introduce a Pl\"ucker Ray-Distance Factor to anchor monocular observations in absolute space within a globally consistent metric coordinate system, mathematically resolving scale drift by making the metric scale Fisher-identifiable. To handle environmental dynamics, we derive an epistemic uncertainty proxy from temporal depth consistency and formulate a Dynamic Scene Uncertainty Gating (DSUG) mechanism. This soft-gating approach probabilistically down-weights dynamic distractors without incurring the heavy computational overhead associated with traditional semantic segmentation masks. By employing a multi-process architecture that asynchronously processes VFM inference and geometric tracking, PRISM-SLAM provides verified metric output at 30 FPS using solely RGB input, bridging the gap between foundation models and real-world robotic applications. Evaluated on the TUM RGB-D and 7-Scenes benchmarks, PRISM-SLAM achieves a metric $SE(3)$ Absolute Trajectory Error (ATE) nearly identical to its oracle-aligned $Sim(3)$ error. This demonstrates that our system can produce deployment-ready metric trajectories by delivering robust metric SLAM solutions without any post-hoc scale correction. Project page: https://prismslam-cmd.github.io/prismslam_pr/

Chinese Translation

单目SLAM在动态环境中历史上面临尺度模糊和跟踪失败的问题。尽管最近的视觉基础模型（VFM）提供了显著的零样本深度先验，但简单地将这些确定性预测整合起来忽视了预测的不确定性和帧间尺度不一致性。我们提出了PRISM-SLAM，这是一种实时框架，严格地将VFM先验整合到结构化的贝叶斯因子图中，以实现尺度感知的度量一致定位和地图构建。具体而言，我们引入了一种Pl"ucker射线距离因子，将单目观测锚定在绝对空间中，并在全球一致的度量坐标系统内进行数学上的尺度漂移解决，使得度量尺度在Fisher意义上可识别。为了处理环境动态，我们从时间深度一致性中推导出一种认知不确定性代理，并制定了动态场景不确定性门控（DSUG）机制。这种软门控方法在不引入传统语义分割掩膜相关的高计算开销的情况下，概率性地降低动态干扰因素的权重。通过采用多进程架构异步处理VFM推断和几何跟踪，PRISM-SLAM在仅使用RGB输入的情况下以30帧每秒提供经过验证的度量输出，弥合了基础模型与现实世界机器人应用之间的差距。在TUM RGB-D和7-Scenes基准测试中评估，PRISM-SLAM的度量$SE(3)$绝对轨迹误差（ATE）几乎与其oracle对齐的$Sim(3)$误差相同。这表明我们的系统能够通过提供稳健的度量SLAM解决方案而无需任何事后尺度校正，生成可部署的度量轨迹。项目页面：https://prismslam-cmd.github.io/prismslam_pr/

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2605.19294

DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies

DEFLECT：通过流匹配似然估计的反事实调优实现延迟鲁棒执行的视觉-语言-动作政策

Zhu, Yixiang, Chen, Yonghao, Meng, Rui, Guo, Jingyu, Zou, Jiaxiang, Yang, Zijie, Wang, Taowen, Chen, Xinyu

Abstract

Vision-Language-Action (VLA) policies are typically deployed with asynchronous inference: the robot executes a previously predicted action chunk while the model computes the next one. This creates a prediction-execution misalignment: the chunk is conditioned on the observation taken before inference began, but executes in a physical state that has already drifted forward by several control steps; naive asynchronous rollover collapses from 89% to under 1% on Kinetix as the inference cycle covers up to seven control steps. We introduce DEFLECT, a fully offline post-training refinement that applies as a near drop-in upgrade to existing async-VLA stacks by converting latency itself into a label-free preference signal: counterfactual fresh/stale action pairs are constructed from a frozen reference policy and scored under the deployment-time conditioning via an implicit flow-matching likelihood-ratio surrogate, with no human labels, reward models, or online rollouts. DEFLECT substantially extends the usable delay envelope of async VLA control, with +6.4 success-rate gain in the high-latency regime (5-7 control steps), +4.6 when transferred to a real-scale VLA at the longest delay, and consistent improvements on two real-robot tasks (a bimanual conveyor pick-and-place and a reactive whack-a-mole).

Chinese Translation

视觉-语言-动作（VLA）政策通常采用异步推理进行部署：机器人在执行先前预测的动作块的同时，模型计算下一个动作。这导致了预测与执行之间的错位：该动作块是基于推理开始前的观察进行条件化的，但在物理状态上已经向前漂移了几个控制步骤；在 Kinetix 上，简单的异步切换成功率从 89% 降至不足 1%，因为推理周期覆盖了多达七个控制步骤。我们提出了 DEFLECT，这是一种完全离线的后训练精炼方法，作为对现有异步 VLA 堆栈的近乎即插即用的升级，通过将延迟本身转化为无标签的偏好信号：反事实的新鲜/过时动作对是从冻结的参考政策构建的，并通过隐式流匹配似然比替代物在部署时条件下进行评分，无需人类标签、奖励模型或在线回放。DEFLECT 显著扩展了异步 VLA 控制的可用延迟范围，在高延迟范围（5-7 个控制步骤）中成功率提高了 6.4%，在最长延迟下转移到真实规模 VLA 时提高了 4.6%，并在两个真实机器人任务（双手传送带抓取与放置和反应式打地鼠）中实现了一致的改进。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2605.19314

ContextFlow: Hierarchical Task-State Alignment for Long-Horizon Embodied Agents

ContextFlow：用于长时间跨度具身智能体的层次任务状态对齐

Guo, Shuhan, Zhang, Kun, Liu, Haifei, Gao, Xingyu, Zhang, Yongqi, Wang, Yaqing, Yao, Quanming

Abstract

Long-horizon embodied agents increasingly delegate navigation, search, approach, and manipulation to specialist executors. As these executors become stronger, the main bottleneck shifts from local skill execution to maintaining a coherent task frontier across planning, monitoring, memory, and execution. We study task-state misalignment, a task-level consistency failure in which the planner's active stage, runtime evidence, remembered context, and delegated executor no longer justify the same next-step decision. This failure can lead to unsupported handoffs, stage lock, executor-context mismatch, and unnecessary replanning. We propose ContextFlow, an inspectable alignment framework that represents stages as explicit contracts, converts runtime observations into evidence packets, and applies scoped updates including continue, refine, transfer, promote, and repair. ContextFlow keeps specialist executors responsible for local closed-loop control while making task-frontier alignment explicit and auditable. Experiments and demonstration traces on long-horizon embodied tasks illustrate how evidence-grounded scoped updates diagnose and mitigate recurring task-state failures.

Chinese Translation

长时间跨度的具身智能体越来越多地将导航、搜索、接近和操作任务委托给专业执行者。随着这些执行者能力的增强，主要瓶颈从局部技能执行转移到在规划、监控、记忆和执行之间保持一致的任务前沿。我们研究了任务状态不对齐，这是一种任务级一致性失败，其中规划者的活跃阶段、运行时证据、记忆上下文和委托的执行者不再支持相同的下一步决策。这种失败可能导致不支持的交接、阶段锁定、执行者与上下文不匹配以及不必要的重新规划。我们提出了ContextFlow，一个可检查的对齐框架，将阶段表示为明确的合同，将运行时观察转换为证据包，并应用包括继续、细化、转移、提升和修复在内的范围更新。ContextFlow使专业执行者负责局部闭环控制，同时使任务前沿对齐变得明确且可审计。在长时间跨度的具身任务上的实验和演示轨迹展示了基于证据的范围更新如何诊断和缓解反复出现的任务状态失败。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2605.19420

Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation

超越航路点：跨体现语义导航的双热图定位

Yun, Kaijie, Chen, Yue

Abstract

Grounding open-ended semantic instructions into physically executable local goals is a fundamental challenge in human-robot interaction. While existing navigation frameworks often regress deterministic waypoints, this rigid formulation collapses spatial uncertainty and frequently targets non-traversable object centers, leading to severe execution failures. In this work, we focus on the practical setting of in-FOV semantic navigation, where a robot receives concise, interleaved multimodal (text and image) prompts. To bridge the gap between abstract semantic intent and physical reachability, we propose a unified Vision-Language framework that abandons single-point regression in favor of a Dual-Heatmap representation. Our framework predicts a navigation affordance heatmap that captures continuous reachable regions, coupled with a facing heatmap for orientation constraints. These dense outputs inherently function as a differentiable semantic potential field, integrating seamlessly with downstream local planners. To support this paradigm, we build a fully automated, foundation-model-assisted synthetic data pipeline and establish a comprehensive simulation benchmark. Extensive experiments demonstrate that our framework achieves state-of-the-art performance among comparable 8B baselines. Crucially, a feature-fusion study and simulation studies across diverse robot embodiments (Jetbot, H1, Aliengo) reveal that explicit heatmap prediction drastically improves the Affordance Rate (AR). By placing targets reliably in executable free space, our framework effectively mitigates the brittleness of point regression, offering a transferable path toward safe cross-embodiment semantic navigation.

Chinese Translation

将开放式语义指令定位为可物理执行的局部目标是人机交互中的一个基本挑战。现有的导航框架通常回归确定性的航路点，这种刚性的表述方式压缩了空间不确定性，并且常常以不可通行的物体中心为目标，导致严重的执行失败。在本研究中，我们关注于视野内（in-FOV）语义导航的实际场景，其中机器人接收简洁的交错多模态（文本和图像）提示。为了弥合抽象语义意图与物理可达性之间的差距，我们提出了一个统一的视觉-语言框架，该框架放弃了单点回归，转而采用双热图（Dual-Heatmap）表示。我们的框架预测一个导航可用性热图，捕捉连续可达区域，并结合一个面向热图以约束方向。这些密集输出本质上作为一个可微分的语义潜在场，能够与下游局部规划器无缝集成。为了支持这一范式，我们构建了一个完全自动化的基础模型辅助合成数据管道，并建立了一个全面的仿真基准。大量实验表明，我们的框架在可比的8B基线中实现了最先进的性能。至关重要的是，特征融合研究和跨多种机器人体现（Jetbot、H1、Aliengo）的仿真实验表明，显式热图预测显著提高了可用性率（Affordance Rate，AR）。通过可靠地将目标放置在可执行的自由空间中，我们的框架有效地减轻了点回归的脆弱性，为安全的跨体现语义导航提供了一条可转移的路径。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2605.19430

Neuromorphic Control of a Flapping-Wing Robot on Resource-Constrained Hardware

资源受限硬件上的拍打翼机器人神经形态控制

Filali, Rim El, Feng, Chenrui, Gao, Chao, Gu, Weibin

Abstract

Flapping-Wing Micro Aerial Vehicles (FWMAVs) provide exceptional maneuverability and aerodynamic efficiency but pose significant challenges for onboard control due to nonlinear dynamics and stringent Size, Weight, and Power (SWaP) constraints, as exemplified by a butterfly-inspired robot less than 30 gram. To this end, we present a hierarchical neuromorphic control framework that enables fully onboard, closed-loop flight on a widely available, resource-constrained ESP32 microcontroller with a unit cost of approximately $5. Specifically, our method deploys two lightweight Spiking Neural Networks (SNNs) onboard: one for state estimation from raw sensory feedback and another for control via modulation of a Central Pattern Generator (CPG) for wing actuation. Trained by imitation learning, the system achieves stable pitch and heading angle tracking during untethered real-world flight. Experimental results further reveal that the SNN-based controller reduces latency by 36% (1059us to 680us) and power by 18% (0.033W to 0.027W) for inference compared to the conventional Artificial Neural Network (ANN) baseline, demonstrating the viability of spike-based computation without specialized hardware. To the best of our knowledge, this work constitutes the first demonstration of fully onboard neuromorphic control for autonomous flight of a FWMAV, highlighting the potential of SNNs to enable energy-efficient autonomy under stringent SWaP constraints. Visual abstract: http://bit.ly/4nI8ECY

Chinese Translation

拍打翼微型空中载具（FWMAVs）提供了卓越的机动性和空气动力学效率，但由于非线性动态和严格的尺寸、重量和功率（SWaP）限制，给机载控制带来了重大挑战，例如一种重量不足30克的仿蝴蝶机器人。为此，我们提出了一种分层神经形态控制框架，使得在广泛可用的、资源受限的ESP32微控制器上实现完全机载的闭环飞行成为可能，该微控制器的单价约为5美元。具体而言，我们的方法在机载部署了两个轻量级脉冲神经网络（SNNs）：一个用于从原始传感反馈中进行状态估计，另一个通过调制中央模式发生器（CPG）进行翼部驱动控制。通过模仿学习进行训练，该系统在无绳的真实飞行中实现了稳定的俯仰角和航向角跟踪。实验结果进一步表明，与传统的人工神经网络（ANN）基线相比，基于SNN的控制器在推理时将延迟减少了36%（从1059微秒降至680微秒），功耗降低了18%（从0.033瓦降至0.027瓦），展示了在没有专用硬件的情况下，基于脉冲的计算的可行性。根据我们所知，这项工作是首次展示完全机载神经形态控制用于FWMAV的自主飞行，突显了SNN在严格SWaP限制下实现能效自主的潜力。视觉摘要： http://bit.ly/4nI8ECY

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2605.19431

Self-assembling Modular Aerial Robot for Versatile Aerial Tasks

自组装模块化空中机器人用于多功能空中任务

Sugihara, Junichiro, Kitagawa, Masaki, Li, Jinjie, Li, Yunong, Nishio, Takuzumi, Okada, Kei, Zhao, Moju

Abstract

Multirotor aerial robots excel at maneuvering in three-dimensional space, and recent advances enable nimble navigation in cluttered and confined environments, especially for small airframes. By contrast, platforms built for high-altitude work tend to be larger to deliver high thrust for stable physical interaction with the environment. However, these conflicting design requirements create a long-standing trade-off between nimble navigation and robust aerial manipulation. Here, we present LEGION units, which are reconfigurable modular aerial robots capable of in-flight self-assembly for cooperative manipulation, drawing inspiration from the self-organized collectives formed by ants. Each unit retains nimble maneuverability while joint-equipped docking interfaces at both ends enable end-to-end self-assembly into a flying manipulator. We show that multiple units autonomously dock in flight; once latched, they maintain a zero-clearance interlock by controlling the contact force and torque, enabling reliable aggregation and articulated motion even outdoors. We further show that self-reconfigurability enables morphological switching between nimble individual flight and collective articulated manipulation, while realizing core in-flight manipulation primitives including pushing, pulling, rotating, grasping, and carrying. LEGION's self-organization enables aerial robots, especially in swarms, to shift from passive observers to active participants in their environment, broadening the scope of aerial physical interaction.

Chinese Translation

多旋翼空中机器人在三维空间中的机动性表现出色，近期的技术进展使其能够在拥挤和狭小的环境中灵活导航，尤其适用于小型机身。相比之下，专为高空作业设计的平台往往体积较大，以提供高推力以实现与环境的稳定物理交互。然而，这些相互矛盾的设计要求在灵活导航和稳健空中操作之间形成了长期的权衡。在此，我们提出了LEGION单元，这是一种可重构的模块化空中机器人，能够在飞行中自我组装以实现协作操作，灵感来源于蚂蚁形成的自组织集体。每个单元保持灵活的机动性，同时两端配备的对接接口使其能够端到端自我组装成一个飞行操控器。我们展示了多个单元能够在飞行中自主对接；一旦锁定，它们通过控制接触力和扭矩维持零间隙的互锁，即使在户外也能实现可靠的聚合和关节运动。我们进一步展示了自我重构能力使得单个灵活飞行与集体关节操作之间的形态切换成为可能，同时实现了包括推、拉、旋转、抓取和搬运在内的核心飞行操作原语。LEGION的自组织能力使得空中机器人，尤其是在群体中，能够从被动观察者转变为其环境中的主动参与者，拓宽了空中物理交互的范围。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2605.19490

Closed-Loop Hybrid Digital Twin Platform for Connected and Automated Vehicle Validation

用于连接和自动驾驶车辆验证的闭环混合数字双胞胎平台

Quan, Kanglong, Xia, Zhebing, Jiang, Linfeng, Yu, Hao, Qiao, Ziheng, Dong, Dapeng, Jia, Dongyao

Abstract

Comprehensive and efficient validation of connected and automated vehicles (CAVs) is critical prior to real-world deployment. While simulation-based testing offers scalability, existing approaches often lack seamless integration with real vehicles and field data, limiting their fidelity in capturing dynamic, real-world interactions. To bridge this gap, this paper proposes a novel real-time hybrid digital twin platform. Its core innovation lies in the tight coupling of a high-fidelity CARLA-SUMO co-simulation with a physical test site and vehicle via a low-latency Vehicle-to-Everything (V2X) communication link. A custom-developed middleware serves as the critical bridge, synchronizing a real CAV's kinematic state as a shadow vehicle in the simulation and translating virtual control commands into chassis-actuating Controller Area Network (CAN) messages for closed-loop control. Detailed implementation includes using photogrammetry for full-scale asset reconstruction and a cloud-edge collaborative architecture for scalable, multi-user operation. Experimental results demonstrate stable synchronization and effective closed-loop control with low latency, confirming the platform's practicality for multi-scenario CAV verification.

Chinese Translation

在实际部署之前，对连接和自动驾驶车辆（CAVs）进行全面且高效的验证至关重要。尽管基于仿真的测试提供了可扩展性，但现有方法往往缺乏与真实车辆和现场数据的无缝集成，从而限制了其在捕捉动态、真实世界交互中的真实性。为了解决这一问题，本文提出了一种新颖的实时混合数字双胞胎平台。其核心创新在于通过低延迟的车联网（Vehicle-to-Everything, V2X）通信链接，将高保真度的CARLA-SUMO联合仿真与物理测试场地和车辆紧密结合。定制开发的中间件作为关键桥梁，同步真实CAV的运动状态作为仿真中的影子车辆，并将虚拟控制命令转换为底盘驱动的控制器局域网（Controller Area Network, CAN）消息，以实现闭环控制。详细的实现包括使用摄影测量进行全尺度资产重建，以及云边协作架构以实现可扩展的多用户操作。实验结果表明，该平台在低延迟下实现了稳定的同步和有效的闭环控制，确认了其在多场景CAV验证中的实用性。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2605.19501

CANINE: Coaching Visually Impaired Users for Interactive Navigation with a Robot Guide Dog

CANINE：为视障用户提供与机器人导盲犬互动导航的辅导

Yu, Cunjun, Wang, Zishuo, Xiao, Anxing, Li, Linfeng, Hsu, David

Abstract

Robot guide dogs offer navigation assistance that greatly expands the independent mobility of the visually impaired, but their effective use requires subtle human-robot coordination that is difficult for users to learn from generic verbal instructions. To tackle this challenge, we present CANINE, an automated coaching system that trains users for interactive navigation with a robot guide dog, through personalized, adaptive verbal feedback. CANINE decomposes a complex coordination task into sub-skills and operates at two levels. At the high level, it decides what to train by tracking the learner's proficiency across sub-skills using knowledge tracing and prioritizing training on the weakest areas. At the low level, CANINE decides how to train each sub-skill by observing each human practice episode, using foundation models to infer the underlying causes of errors, and generating targeted verbal corrections adaptively. A controlled study with blindfolded participants, treated as a proxy population for quantitative evaluation, demonstrates that CANINE significantly improves both learning efficiency and final navigation performance compared to generic verbal instructions. We further validate CANINE through a retention study and an exploratory case study. The retention study shows lasting skill improvement after two weeks. The case study confirms CANINE's effectiveness in training a visually impaired user, while revealing additional design considerations for real-world deployment. Both are well aligned with the findings of the controlled study. Project page: https://cunjunyu.github.io/project/canine/

Chinese Translation

机器人导盲犬提供的导航辅助极大地扩展了视障人士的独立移动能力，但其有效使用需要微妙的人机协调，这对用户来说很难通过通用的语言指令学习。为了解决这一挑战，我们提出了CANINE，一个自动化辅导系统，通过个性化、适应性的语言反馈来训练用户与机器人导盲犬进行互动导航。CANINE将复杂的协调任务分解为子技能，并在两个层面上运作。在高层面，它通过跟踪学习者在子技能上的熟练程度，使用知识追踪技术来决定训练内容，并优先训练最薄弱的领域。在低层面，CANINE通过观察每个用户的练习过程，使用基础模型推断错误的潜在原因，并自适应地生成针对性的语言纠正来决定如何训练每个子技能。一项针对蒙眼参与者的对照研究，作为定量评估的代理人群，表明与通用语言指令相比，CANINE显著提高了学习效率和最终导航表现。我们还通过保留研究和探索性案例研究进一步验证了CANINE。保留研究显示在两周后技能仍有持久改善。案例研究确认了CANINE在训练视障用户方面的有效性，同时揭示了现实世界部署的额外设计考虑。这两者与对照研究的发现高度一致。项目页面：https://cunjunyu.github.io/project/canine/

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2605.19503

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

ARC-RL：一个受ARC Raiders启发的强化学习平台

Romeo, Carlo, Bagdanov, Andrew D.

Abstract

Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints.

Chinese Translation

腿部运动的强化学习已经发展成为一套多组件奖励函数和物理引擎基准，其形态均源自真实的商业硬件。然而，游戏中的非玩家角色（NPC）受到风格约束，这在模拟到真实的机器人领域中并不存在，通常表现为没有真实机器人对应物的生物体。我们介绍了ARC-RL，这是一个包含四个MuJoCo连续控制环境的套件，具有受ARC Raiders的生物图鉴启发的机器人形态：18自由度的高大六足机器人女王、12自由度的装甲六足机器人堡垒、18自由度的紧凑六足机器人蜱虫和12自由度的四足机器人跳跃者。所有四个机器人共享统一的观察模板、动作约定、仿真节奏，以及一个单一的封闭形式多组件奖励函数，其唯一的形态变化体现在一小组权重和参数上。该奖励融合了速度跟踪项、健康生存奖励、相位锁定的步态合规奖励/成本对、动作正则化项、三个安全惩罚和一个姿态锚点；在任何时候，运动捕捉数据都不会进入奖励中。此外，我们为每种形态提供了手工制作的中央模式生成器演示，这些演示既作为固定的专家参考，也作为离线到在线训练的先前数据来源。在这个平台上，我们进行了一项控制的实证研究，比较标准的在线算法（SAC、SPEQ、SOPE-EO）与增强了先前数据的方法（SACfD、SPEQ-O2O、SOPE），并描述每种范式如何应对平台的形态多样性和动画风格的约束。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2605.19524

SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving

SafeAlign-VLA：一种针对风险感知自主驾驶的负增强安全对齐框架

Tian, Kefei, Lian, Yuansheng, Yang, Kai, Chen, Xiangdong, Li, Shen

Abstract

End-to-end autonomous driving systems excel in common scenarios but struggle with safety-critical long-tail cases. Vision-Language-Action (VLA) models are promising due to their strong reasoning capabilities. However, most VLA-based approaches rely on positive expert demonstrations, rarely exploiting negative samples, leading to insufficient understanding of risky behaviors and safety boundaries. To address this limitation, we propose SafeAlign-VLA, a unified negative-enhanced safe alignment framework that incorporates negative data into supervised learning and reinforcement learning. First, we develop a counterfactual safety pairing paradigm to generate structured safety labels and counterfactual positive trajectories from risky scenarios via counterfactual reasoning. Then, a two-stage training strategy is adopted: negative-enhanced supervised fine-tuning for failure feedback and trajectory correction, followed by anchor-based group relative policy optimization that uses positive and negative trajectories as contrastive anchors to steer sampling and penalize high-risk behaviors via group-relative advantages. Experiments on NAVSIM and DeepAccident validate the proposed framework. SafeAlign-VLA achieves 89.1 PDMS on the NAVSIM v1 testset, improving over the baseline without negative data by 1.3%. On DeepAccident, it reduces the collision rate to 3.36%, while achieving 84.2% language accuracy and 85.8% risk prediction accuracy. These results demonstrate the effectiveness of the proposed negative-enhanced safe alignment framework for safe and robust autonomous driving.

Chinese Translation

端到端自主驾驶系统在常见场景中表现出色，但在安全关键的长尾案例中却面临挑战。视觉-语言-动作（VLA）模型因其强大的推理能力而备受关注。然而，大多数基于VLA的方法依赖于正向专家示范，鲜有利用负样本，导致对风险行为和安全边界的理解不足。为了解决这一局限性，我们提出了SafeAlign-VLA，一个统一的负增强安全对齐框架，将负数据纳入监督学习和强化学习中。首先，我们开发了一种反事实安全配对范式，通过反事实推理从风险场景中生成结构化的安全标签和反事实正向轨迹。然后，采用两阶段训练策略：首先进行负增强的监督微调，以获取失败反馈和轨迹修正；接着进行基于锚点的组相对策略优化，利用正向和负向轨迹作为对比锚点，引导采样并通过组相对优势惩罚高风险行为。在NAVSIM和DeepAccident上的实验验证了所提框架的有效性。SafeAlign-VLA在NAVSIM v1测试集上达到了89.1 PDMS，相较于没有负数据的基线提高了1.3%。在DeepAccident上，其碰撞率降低至3.36%，同时实现了84.2%的语言准确率和85.8%的风险预测准确率。这些结果证明了所提负增强安全对齐框架在安全和稳健的自主驾驶中的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2605.19562

Learning-Accelerated Optimization-based Trajectory Planning for Cooperative Aerial-Ground Handover Missions

基于学习加速优化的合作空地交接任务轨迹规划

Chen, Jingshan, Yu, Bochen, Ebel, Henrik, Eberhard, Peter

Abstract

This paper presents a learning-augmented trajectory planning framework for cooperative unmanned aerial vehicle (UAV) and unmanned ground vehicle (UGV) handover missions. While centralized trajectory optimization ensures dynamic feasibility and task optimality, its high computational cost limits real-time applicability. We propose a neural surrogate planner utilizing decoupled encoder-decoder long short-term memory (LSTM) networks to generate coordinated handover trajectory predictions from the task specifications. These predictions serve as informed warm starts for the downstream centralized optimizer, thereby accelerating convergence to dynamically feasible solutions. Benchmark evaluations demonstrate that the learning-augmented planning framework achieves more than a threefold speedup and 100% optimization success rate compared to cold start optimization. The results indicate that combining data-driven inference with model-based refinement enables fast and reliable trajectory generation for heterogeneous multi-robot systems.

Chinese Translation

本文提出了一种学习增强的轨迹规划框架，用于合作无人机（UAV）和无人地面车辆（UGV）交接任务。虽然集中式轨迹优化确保了动态可行性和任务最优性，但其高计算成本限制了实时应用。我们提出了一种神经代理规划器，利用解耦的编码器-解码器长短期记忆网络（LSTM）从任务规格生成协调的交接轨迹预测。这些预测作为下游集中优化器的有信息的热启动，从而加速收敛到动态可行的解决方案。基准评估表明，与冷启动优化相比，学习增强的规划框架实现了超过三倍的加速和100%的优化成功率。结果表明，将数据驱动的推理与基于模型的细化相结合，使得异构多机器人系统的轨迹生成快速且可靠。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2605.19580

PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models

PAPO-VLA：面向规划的视觉-语言-动作模型策略优化

Guo, Peizheng, Wang, Jingyao, Zheng, Changwen, Qiang, Wenwen

Abstract

Vision-Language-Action (VLA) models show promising ability in language-guided robotic tasks. However, making VLA policies reliable remains challenging, because a manipulation task is completed through closed-loop interaction, where each action affects subsequent execution. To analyze this problem, we revisit VLA policy during execution and argue that a VLA policy acts both as a planner, which makes task-oriented decisions that change the direction of execution, and as an executor, which realizes these decisions through dense continuous actions. This view suggests that improving VLA reliability requires particular attention to planning actions. Existing optimization methods can imitate actions or improve complete trajectories, but they usually do not explicitly identify planning actions or measure their importance for task success. To address this issue, we propose Planning-Aware Policy Optimization for VLA models (PAPO-VLA). PAPO-VLA first identifies planning actions by jointly considering action variation and trajectory outcome, then estimates their importance through causal sufficiency and causal necessity, and finally incorporates this importance into GRPO advantage estimation. In this way, more important planning actions receive stronger optimization emphasis, while the whole trajectory is still optimized by trajectory-level feedback. Experiments on multiple benchmarks demonstrate the effectiveness of PAPO-VLA.

Chinese Translation

视觉-语言-动作（VLA）模型在语言引导的机器人任务中展现出良好的能力。然而，使VLA策略可靠仍然具有挑战性，因为操控任务是通过闭环交互完成的，每个动作都会影响后续的执行。为了分析这个问题，我们在执行过程中重新审视VLA策略，并认为VLA策略既充当规划者，做出改变执行方向的任务导向决策，也充当执行者，通过密集的连续动作实现这些决策。这一观点表明，提高VLA的可靠性需要特别关注规划动作。现有的优化方法可以模仿动作或改善完整轨迹，但通常并不明确识别规划动作或衡量其对任务成功的重要性。为了解决这个问题，我们提出了面向规划的VLA模型策略优化（PAPO-VLA）。PAPO-VLA首先通过联合考虑动作变化和轨迹结果来识别规划动作，然后通过因果充分性和因果必要性来估计它们的重要性，最后将这种重要性纳入GRPO（Generalized REINFORCE Policy Optimization）优势估计中。通过这种方式，更重要的规划动作会得到更强的优化强调，而整个轨迹仍然通过轨迹级反馈进行优化。在多个基准测试中的实验表明，PAPO-VLA的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2605.19592

Implicit Action Chunking for Smooth Continuous Control

平滑连续控制的隐式动作分块

Liang, Bosun, Pei, Shuo, Chen, Zirui, Fan, Chuanzhi, Sun, Chen, Wu, Yuankai, Tan, Huachun, Wang, Yong

Abstract

Reinforcement learning often produces high-frequency oscillatory control signals that undermine the safety and stability required for physical deployment. Explicit action chunking addresses this by predicting fixed-horizon trajectories but scales the policy output dimension proportionally with the horizon length, leading to optimization difficulties and incompatibility with standard step-wise interaction. To overcome these challenges, this paper proposes Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control. Unlike explicit methods, DWS enforces temporal coherence without expanding the action space. It uses a dual-window design: an execution window that ensures physical smoothness through deterministic modulation, and a value window that aligns temporal-difference targets over the horizon to correct critic bias caused by open-loop execution. DWS also includes a lightweight actor-side temporal regularizer based on first-order action differences to promote global continuity. This design effectively bridges the gap between temporal abstraction and reactive step-wise control. Experiments on benchmarks including the DeepMind Control Suite and industrial energy management tasks show that DWS outperforms state-of-the-art (SOTA) baselines. In complex vision-based autonomous driving tasks, DWS achieves smoother control, safer behavior with reduced jitter, and attains a 100% success rate.

Chinese Translation

强化学习通常会产生高频振荡的控制信号，这会削弱物理部署所需的安全性和稳定性。显式动作分块通过预测固定时域轨迹来解决这一问题，但随着时域长度的增加，政策输出维度也成比例扩大，导致优化困难，并与标准逐步交互不兼容。为克服这些挑战，本文提出了双窗口平滑（Dual-Window Smoothing, DWS），一种用于平滑连续控制的隐式动作分块框架。与显式方法不同，DWS 在不扩展动作空间的情况下强制执行时间一致性。它采用双窗口设计：一个执行窗口通过确定性调制确保物理平滑性，一个价值窗口在时域内对齐时间差目标，以纠正由开环执行引起的评论者偏差。DWS 还包括一个基于一阶动作差异的轻量级演员侧时间正则化器，以促进全局连续性。该设计有效地弥合了时间抽象与反应式逐步控制之间的差距。在包括 DeepMind 控制套件和工业能源管理任务的基准测试中，DWS 的表现优于最先进的基线。在复杂的基于视觉的自动驾驶任务中，DWS 实现了更平滑的控制，减少了抖动，提高了安全性，并达到了 100% 的成功率。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2605.19594

MCNav: Memory-Aware Dynamic Cognitive Map for Zero-shot Goal-oriented Navigation

MCNav：面向零-shot目标导航的记忆感知动态认知地图

Li, Jingyu, Liu, Zhe, Wu, Wenxiao, Zhang, Li

Abstract

Navigating to instance-level targets in complex environments is a challenging problem. Many existing zero-shot methods achieve strong performance by modeling the entire environment and leveraging large language models for scene understanding. However, such strategies primarily focus on exploring new regions while lacking a deeper exploitation of information from previously explored areas. Consequently, when targets are missed or misidentified within previously visited regions, navigation failures occur frequently. To address these limitations, we propose MCNav, a memory-aware navigation framework with a dynamic cognitive map. This map stores efficiently queryable information about relevant objects in explored areas. Building on this memory structure, MCNav introduces two memory-aware exploration strategies: goal re-validation, which re-assesses previously seen objects to correct matching failures, and missed goal re-exploration, which estimates the likelihood that a target is present in an explored region from contextual cues. These strategies are further stabilized by a blacklist mechanism to prevent repeated errors and a double-check mechanism for high-confidence confirmation. We evaluate MCNav on the HM3Dv1 and HM3Dv2 datasets across three different tasks, where it achieves state-of-the-art performance, particularly on the instance-level goal navigation task.

Chinese Translation

在复杂环境中导航至实例级目标是一项具有挑战性的任务。许多现有的零-shot方法通过建模整个环境并利用大型语言模型进行场景理解，取得了良好的性能。然而，这些策略主要集中在探索新区域，而缺乏对先前探索区域信息的深入利用。因此，当在之前访问过的区域中错过或误识别目标时，导航失败的情况频繁发生。为了解决这些局限性，我们提出了MCNav，一个具有动态认知地图的记忆感知导航框架。该地图高效地存储可查询的关于已探索区域相关对象的信息。在此记忆结构的基础上，MCNav引入了两种记忆感知的探索策略：目标重新验证（goal re-validation），即重新评估之前见过的对象以纠正匹配失败；以及错过目标重新探索（missed goal re-exploration），即根据上下文线索估计目标在已探索区域内存在的可能性。这些策略通过黑名单机制进一步稳定，以防止重复错误，并通过双重检查机制进行高置信度确认。我们在HM3Dv1和HM3Dv2数据集上对MCNav进行了评估，涵盖三种不同任务，其中在实例级目标导航任务上取得了最先进的性能。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2605.19600

FlyMirage: A Fully Automated Generation Pipeline for Diverse and Scalable UAV Flight Data via Generative World Model

FlyMirage：通过生成世界模型实现多样化和可扩展的无人机飞行数据的全自动生成管道

Li, Jinhan, Huang, Xijie, Wang, Zhaoqi, Wang, Yijin, Ge, Weiqi, He, Qiyi, Zhu, Mo, Gao, Fei, Wu, Yuze, Zhou, Xin

Abstract

In the field of Vision-Language Navigation (VLN), aerial datasets remain limited in their ability to combine scale, diversity, and realism, often relying on either costly real-world scenes or visually limited simulations. To address these challenges, we introduce FlyMirage, a highly scalable and fully automated data generation pipeline for aerial VLN. Our approach leverages large language models (LLM) as an environment designer to promote scene diversity, paired with a generative world model that instantiates these designs into high-fidelity 3D Gaussian Splatting (3DGS) scenes. To substantially reduce human labor and ensure the feasibility of flight data, FlyMirage automates scene exploration and semantic information acquisition, and further integrates a dynamically feasible planner for uncrewed aerial vehicle (UAV) trajectory generation. Utilizing this toolchain, we generate a large-scale, diverse, and photorealistic aerial VLN dataset, with dynamically feasible flying trajectories, designed to support the development of next-generation embodied navigation models.

Chinese Translation

在视觉-语言导航（VLN）领域，航空数据集在规模、多样性和真实性的结合能力上仍然有限，通常依赖于昂贵的现实场景或视觉受限的模拟。为了解决这些挑战，我们提出了FlyMirage，一个高度可扩展的全自动航空VLN数据生成管道。我们的方法利用大型语言模型（LLM）作为环境设计师，以促进场景多样性，并结合生成世界模型，将这些设计实例化为高保真度的3D高斯点云（3DGS）场景。为了大幅减少人工劳动并确保飞行数据的可行性，FlyMirage自动化了场景探索和语义信息获取，并进一步集成了一个动态可行的规划器，用于无人机（UAV）轨迹生成。利用这一工具链，我们生成了一个大规模、多样化且照片真实的航空VLN数据集，具有动态可行的飞行轨迹，旨在支持下一代具身导航模型的发展。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2605.19631

HEAT: Heterogeneous End-to-End Autonomous Driving via Trajectory-Guided World Models

HEAT：通过轨迹引导的世界模型实现异构端到端自主驾驶

Cho, Hoonhee, Lee, Giwon, Kang, Jae-Young, Yang, Hyemin, Park, Heejun, Yoon, Kuk-Jin

Abstract

End-to-end autonomous driving has emerged as a compelling alternative to traditional modular pipelines by directly mapping raw sensor data to driving actions. While recent approaches achieve strong performance on single-domain datasets, their performance degrades significantly when trained jointly across multiple heterogeneous domains. In practice, however, autonomous systems must operate across diverse environments with heterogeneous distributions, including different cities, sensor configurations, and traffic patterns, without domain-specific retraining. This gap highlights a key challenge in multi-domain learning: domain-specific variations across heterogeneous domains introduce conflicting learning signals, driving models toward compromised solutions that are suboptimal across domains. To address this, we propose a trajectory-driven learning paradigm that organizes training around planning trajectories, enabling the model to capture domain-invariant representations of driving intent. Furthermore, we incorporate a world model that predicts future latent features conditioned on ego actions, improving feature consistency and mitigating domain-induced biases. We evaluate our approach on three benchmarks, nuScenes, NAVSIM, and the Waymo end-to-end dataset, and show substantial improvements over existing methods across all domains. Our results demonstrate that a single unified model can be trained on heterogeneous datasets while maintaining strong performance within each domain, highlighting a step toward scalable real-world deployment. We will make our code publicly available.

Chinese Translation

端到端自主驾驶已成为一种引人注目的替代方案，直接将原始传感器数据映射到驾驶动作。尽管最近的方法在单一领域数据集上取得了良好的性能，但在多个异构领域联合训练时，其性能显著下降。然而，在实际应用中，自主系统必须在具有异构分布的多样环境中运行，包括不同的城市、传感器配置和交通模式，而无需进行领域特定的再训练。这一差距突显了多领域学习中的一个关键挑战：异构领域之间的领域特定变异引入了相互矛盾的学习信号，导致模型朝向在各领域都不理想的妥协解决方案发展。为了解决这一问题，我们提出了一种基于轨迹驱动的学习范式，该范式围绕规划轨迹组织训练，使模型能够捕捉驾驶意图的领域不变表示。此外，我们还结合了一个世界模型，该模型预测基于自我动作的未来潜在特征，提高了特征一致性并减轻了领域引起的偏差。我们在三个基准上评估了我们的方法，分别是nuScenes、NAVSIM和Waymo端到端数据集，并在所有领域中显示出相较于现有方法的显著改进。我们的结果表明，单一统一模型可以在异构数据集上进行训练，同时在每个领域内保持强大的性能，标志着向可扩展的现实世界部署迈出了一步。我们将公开我们的代码。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2605.19678

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

RoVLA：用于鲁棒视觉-语言-动作模型的多一致性约束

Luo, Jingzhou, Wen, Yifan, Bai, Yongjie, Song, Xinshuai, Liu, Yang, Lin, Liang

Abstract

Vision-Language-Action (VLA) models have shown strong performance on embodied manipulation, yet they remain brittle under visual observation changes, paraphrased language instructions, and compounded perturbations. This limitation suggests that existing methods still rely heavily on shallow correlations in the training distribution, rather than learning stable couplings among task semantics, environment states, and action generation. Although recent efforts improve robustness through larger-scale training, post-training adaptation, or enhanced predictive modeling, they rarely enforce invariance-oriented consistency within the end-to-end policy itself. To address this issue, we propose RoVLA, a robust vision-language-action framework with multi-consistency constraints. RoVLA enforces consistency under three complementary transformations: instruction semantics, trajectory evolution, and observation perturbation. Specifically, Instructional Consistency (IC) promotes stable grounding under semantically equivalent instruction rewrites, Evolutionary Consistency (EC) preserves coherent action intent throughout the generation process, and Observational Consistency (OC) improves robustness to visual and proprioceptive perturbations by enforcing consistent predictions before and after targeted disturbances. By explicitly modeling these invariances during training, RoVLA reduces reliance on superficial correlations and improves robustness and generalization. Experiments on LIBERO-Plus, RoboTwin 2.0, and real-world manipulation tasks show that RoVLA consistently outperforms strong baseline methods and exhibits superior robustness under diverse task and observation shifts. These results demonstrate the effectiveness of multi-consistency learning for robust embodied control. Codes will be available at https://github.com/HCPLab-SYSU/RoVLA.

Chinese Translation

视觉-语言-动作（VLA）模型在具身操作方面表现出色，但在视觉观察变化、语言指令的改述以及复合干扰下仍然显得脆弱。这一局限性表明，现有方法仍然在很大程度上依赖于训练分布中的浅层相关性，而不是学习任务语义、环境状态和动作生成之间的稳定耦合。尽管最近的努力通过大规模训练、后训练适应或增强预测建模来提高鲁棒性，但它们很少在端到端策略内部强制执行面向不变性的连贯性。为了解决这一问题，我们提出了RoVLA，一个具有多一致性约束的鲁棒视觉-语言-动作框架。RoVLA在三种互补变换下强制一致性：指令语义、轨迹演变和观察扰动。具体而言，指令一致性（Instructional Consistency, IC）促进在语义等价的指令重写下的稳定基础，演变一致性（Evolutionary Consistency, EC）在生成过程中保持一致的行动意图，而观察一致性（Observational Consistency, OC）通过在目标干扰前后强制一致的预测来提高对视觉和本体感知扰动的鲁棒性。通过在训练期间明确建模这些不变性，RoVLA减少了对表面相关性的依赖，提高了鲁棒性和泛化能力。在LIBERO-Plus、RoboTwin 2.0和真实世界操作任务上的实验表明，RoVLA始终优于强基线方法，并在多样化的任务和观察变化下表现出更强的鲁棒性。这些结果证明了多一致性学习在鲁棒具身控制中的有效性。代码将发布在 https://github.com/HCPLab-SYSU/RoVLA。

View on arXiv Download PDF AI Translation

cs.RO / 33 / 2605.19690

D-CLING: Prior-Preserving Depth-Conditioned Fine-Tuning for Navigation Foundation Models

D-CLING：保留先验的深度条件微调用于导航基础模型

Nakaoka, Shintaro, Kanai, Takayuki, Tanaka, Kazuhito

Abstract

Navigation Foundation Models (NFMs) trained on large cross-embodied datasets have demonstrated powerful generalizability in various scenarios. Adopting in-domain fine-tuning for an NFM efficiently calibrates the visuomotor policy, promising further improvement even in a novel scenario. However, the fine-tuned models still suffer from poor obstacle avoidance or fail to properly reach the provided goals. Furthermore, model updates using a small subset of data typically erode the pre-trained prior, compromising the pre-training generalization. Consequently, fine-tuning deteriorates the capability of the model for robust and accurate navigation. In this work, we present a novel fine-tuning method that leverages large-scale pre-training while efficiently learning in novel setups, such as environments or camera configurations. In particular, inspired by ControlNet, we fine-tune an NFM by attaching a trainable copy of the pre-trained backbone using zero-initialized residual pathways, thereby learning geometric cues. This design enables the model to efficiently acquire in-domain geometry while preserving pre-trained knowledge across various behaviors. Despite its simplicity, our comprehensive evaluation of real-world navigation suggests that our proposal effectively enables robust long-horizon navigation with minimal collisions and human intervention. Additionally, our offline analysis shows that the proposed method maintains or further improves action prediction capabilities beyond the fine-tuned dataset, providing a key insight into continual learning for general navigation. The project page: https://toyotafrc.github.io/DCLING-Proj/

Chinese Translation

在大型跨体数据集上训练的导航基础模型（NFM）在各种场景中展现了强大的泛化能力。对NFM进行领域内微调可以有效地校准视运动策略，甚至在新场景中也能带来进一步的改进。然而，微调后的模型仍然存在避障能力差或无法正确到达指定目标的问题。此外，使用小数据子集进行模型更新通常会侵蚀预训练的先验，损害预训练的泛化能力。因此，微调会降低模型在稳健和准确导航方面的能力。在本研究中，我们提出了一种新颖的微调方法，该方法利用大规模预训练，同时有效地学习新环境或相机配置等新设置。特别地，受到ControlNet的启发，我们通过附加一个可训练的预训练骨干网络副本，使用零初始化的残差通道来微调NFM，从而学习几何线索。这一设计使模型能够有效地获取领域内几何信息，同时保留在各种行为中的预训练知识。尽管方法简单，我们对现实世界导航的全面评估表明，我们的提议有效地实现了稳健的长时间导航，且碰撞和人工干预最小。此外，我们的离线分析显示，所提方法在微调数据集之外保持或进一步提高了动作预测能力，为一般导航的持续学习提供了关键见解。项目页面：https://toyotafrc.github.io/DCLING-Proj/

View on arXiv Download PDF AI Translation

cs.RO / 34 / 2605.19701

Multi-Session Ground Texture SLAM in Low-Dynamic Environments

低动态环境中的多会话地面纹理SLAM

Hart, Kyle M., Englot, Brendan

Abstract

The simultaneous localization and mapping community has introduced a growing number of systems adapted for multi-session operations where the operational environment features low-dynamic changes that impact mapping, such as surface wear, weather phenomena, or seasonal change. These systems allow for lifelong operations by a robot within these environments. There is also growing interest in operations in environments where the unique ground texture is the only mapping feature available for use. These ground texture systems are not yet targeted for multi-session low-dynamic-change environments though. This work explores the impact of three different techniques on trajectory estimation accuracy in these multi-session low-dynamic ground texture environments. Of the three, the use of Kullback-Leibler Divergence, as a similarity score and a bias influencing loop closure confidence, is found to have the most success. We show an analysis of all three methods and a deeper exploration of the impact of Kullback-Leibler Divergence. We also introduce a dataset for use by the robotics community that contains multi-session images where the ground changes between sessions and also high-accuracy pose information for use in evaluation.

Chinese Translation

同时定位与地图构建（SLAM）领域逐渐引入了越来越多适用于多会话操作的系统，这些系统能够在低动态变化的环境中运行，这些变化会影响地图构建，例如表面磨损、天气现象或季节变化。这些系统使得机器人能够在这些环境中进行长期操作。对于那些唯一的地面纹理是可用的地图特征的环境，相关操作的兴趣也在不断增长。然而，目前这些地面纹理系统尚未针对多会话低动态变化环境进行优化。本研究探讨了三种不同技术对这些多会话低动态地面纹理环境中轨迹估计精度的影响。在这三种技术中，使用Kullback-Leibler散度（Kullback-Leibler Divergence）作为相似性评分和影响回路闭合置信度的偏差被发现最为成功。我们展示了对这三种方法的分析，并深入探讨了Kullback-Leibler散度的影响。此外，我们还为机器人社区引入了一个数据集，其中包含多会话图像，展示了会话之间的地面变化，以及用于评估的高精度位姿信息。

View on arXiv Download PDF AI Translation

cs.RO / 35 / 2605.19703

KIO-planner: Attention-Guided Single-Stage Motion Planning with Dual Mapping for UAV Navigation

KIO-planner：基于注意力引导的单阶段运动规划与双重映射用于无人机导航

Yao, Dexing, Li, Haochen, Wei, Junhao, Zhao, Yifu, Li, Yanxiao, Xu, Jiahui, Hu, Jinxuan, Tian, Lele, Lu, Baili, Li, Zikun, Yang, Xu, Im, Sio-Kei, Yang, Dingcheng, Wang, Yapeng

Abstract

Autonomous UAV flight in confined, wall-dense environments requires low-latency and reliable motion planning under strict safety constraints. Traditional optimization-based planners suffer from mapping latency and easily fall into local minima when navigating through dense structural obstacles. Meanwhile, existing end-to-end learning methods struggle to extract fine-grained geometric features from raw depth images and lack hard kinodynamic constraints, leading to unpredictable collisions near walls. To address these issues, we propose KIO-planner, an attention-guided single-stage trajectory planning framework. First, we integrate a Convolutional Block Attention Module (CBAM) into the perception backbone to adaptively focus on critical structural edges and traversable space. Second, we introduce a novel Dual Mapping mechanism--comprising physical bounds activation and a deterministic Geometric Safety Shield in the depth-pixel space--to enforce kinodynamic feasibility and collision-free flight without global map fusion. Extensive high-fidelity simulated experiments demonstrate that KIO-planner enables highly agile navigation at speeds up to 3.0 m/s. Compared to the state-of-the-art baseline, KIO-planner achieves lower inference latency (approximately 24 ms) and generates significantly smoother trajectories, reducing control cost by 28.4%. Most notably, our Dual Mapping substantially increases the worst-case safety margin, measured by minimum distance to obstacles, from 0.48 m to 0.76 m, ensuring fast, smooth, and safer navigation in highly constrained environments.

Chinese Translation

在狭窄、墙体密集的环境中，无人机的自主飞行需要在严格的安全约束下实现低延迟和可靠的运动规划。传统的基于优化的规划器在映射延迟方面存在问题，并且在穿越密集结构障碍物时容易陷入局部最优。同时，现有的端到端学习方法在从原始深度图像中提取细粒度几何特征方面表现不佳，并且缺乏严格的动力学约束，导致在靠近墙壁时发生不可预测的碰撞。为了解决这些问题，我们提出了KIO-planner，一个基于注意力引导的单阶段轨迹规划框架。首先，我们将卷积块注意力模块（CBAM）集成到感知主干中，以自适应地关注关键的结构边缘和可通行空间。其次，我们引入了一种新颖的双重映射机制——包括物理边界激活和深度像素空间中的确定性几何安全屏障——以强制执行动力学可行性和无碰撞飞行，而无需全局地图融合。大量高保真模拟实验表明，KIO-planner能够以高达3.0 m/s的速度实现高度灵活的导航。与最先进的基线相比，KIO-planner实现了更低的推理延迟（约24毫秒），并生成了显著更平滑的轨迹，控制成本降低了28.4%。最值得注意的是，我们的双重映射显著增加了最坏情况下的安全边际，测量为与障碍物的最小距离，从0.48米提高到0.76米，确保在高度受限的环境中实现快速、平稳和更安全的导航。

View on arXiv Download PDF AI Translation

cs.RO / 36 / 2605.19771

Beyond Imitation: Learning Safe End-to-End Autonomous Driving from Hard Negatives

超越模仿：从困难负样本中学习安全的端到端自主驾驶

Wang, Junli, Hua, Zhihua, Liu, Xueyi, Xing, Zebin, Tian, Haochen, Ma, Kun, Ye, Hangjun, Chen, Guang, Chen, Long, Zhang, Qichao

Abstract

Existing imitation learning methods for end-to-end autonomous driving predominantly learn from successful demonstrations by minimizing geometric deviations from expert trajectories. This paradigm implicitly assumes that spatial proximity implies behavioral safety, leading to a critical objective mismatch: trajectories with nearly identical imitation losses may exhibit drastically different safety outcomes, where one remains recoverable while the other results in collision. To address this limitation, we propose BeyondDrive, a failure-aware imitation learning framework that jointly learns from successful and failed driving behaviors. First, we introduce a flow matching-based negative trajectory generator that synthesizes safety-critical yet expert-proximate trajectories, enabling explicit modeling of safety asymmetry. Second, we develop a diversity-aware sampling strategy that mitigates mode collapse and improves coverage of diverse failure modes during negative trajectory generation. Third, we propose a Repulsive Distance Loss that simultaneously attracts predictions toward expert demonstrations while repelling them from hard negative trajectories, thereby establishing discriminative safety boundaries in trajectory space. Applied to the uni-modal baseline Latent TransFuser, BeyondDrive achieves 89.7 PDMS on the NAVSIMv1 closed-loop benchmark, outperforming prior state-of-the-art methods. Moreover, BeyondDrive generalizes effectively across different autonomous driving architectures, including multi-modal planners, and further demonstrates strong zero-shot transferability on the HUGSIM benchmark.

Chinese Translation

现有的端到端自主驾驶模仿学习方法主要通过最小化与专家轨迹的几何偏差来学习成功的演示。这一范式隐含地假设空间接近性意味着行为安全，从而导致了一个关键的目标不匹配：具有几乎相同模仿损失的轨迹可能表现出截然不同的安全结果，其中一个轨迹可以恢复，而另一个则导致碰撞。为了解决这一局限性，我们提出了BeyondDrive，一个关注失败的模仿学习框架，能够同时学习成功和失败的驾驶行为。首先，我们引入了一种基于流匹配的负轨迹生成器，该生成器合成安全关键但接近专家的轨迹，从而实现安全不对称性的显式建模。其次，我们开发了一种关注多样性的采样策略，以减轻模式崩溃并改善负轨迹生成过程中的多样性失败模式的覆盖。第三，我们提出了一种排斥距离损失，该损失同时将预测吸引向专家演示，同时将其排斥于困难负轨迹，从而在轨迹空间中建立区分性的安全边界。应用于单模态基线Latent TransFuser，BeyondDrive在NAVSIMv1闭环基准测试中达到了89.7的PDMS，超越了之前的最先进方法。此外，BeyondDrive在不同的自主驾驶架构中有效泛化，包括多模态规划器，并在HUGSIM基准测试中进一步展示了强大的零样本迁移能力。

View on arXiv Download PDF AI Translation

cs.RO / 37 / 2605.19840

Justifying bio-inspired robotics research: A taxonomy of strategies

为生物启发式机器人研究辩护：策略分类

Zhang, Margaret J., Ting, Justin, Moore, Talia Y.

Abstract

For most of human history, we have not thought systematically about how and why we incorporate aspects of the natural world into our designs. The lack of a systematic approach has resulted in inconsistencies in motivations and methods that make it difficult to predict or evaluate the success of bio-inspired design. This mismatch between expectations and results can lead to disappointment when a reader considers a bio-inspired design to be superficial, weak, or incomplete. This is especially true in the field of Robotics, in which similarity to a biological system might be the driving motivation for construction. In an effort to assist robotics researchers justify their specific bio-inspired approach and to assist funding program managers with discerning the value of different bio-inspired approaches, here we propose a taxonomy of motivations for bio-inspired design and describe the potential significant contributions that are likely to result from different approaches.

Chinese Translation

在人类历史的大部分时间里，我们并没有系统地思考如何以及为何将自然界的某些方面融入我们的设计中。缺乏系统性的方法导致了动机和方法上的不一致，使得预测或评估生物启发设计的成功变得困难。这种期望与结果之间的不匹配可能会导致失望，尤其是当读者认为某个生物启发设计表面化、薄弱或不完整时。在机器人领域尤其如此，因为与生物系统的相似性可能是构建的驱动力。为了帮助机器人研究人员为其特定的生物启发方法辩护，并帮助资助项目经理辨别不同生物启发方法的价值，我们在此提出了一种生物启发设计动机的分类法，并描述了不同方法可能带来的重要贡献。

View on arXiv Download PDF AI Translation

cs.RO / 38 / 2605.19881

Trajectory Planning and Control near the Limits: an Open Experimental Benchmark on the RoboRacer Platform

极限附近的轨迹规划与控制：RoboRacer平台上的开放实验基准

Piccinini, Mattia, Zambiasi, Patrick, Mungiello, Aniello, Piazza, Mattia, Jahncke, Felix, Betz, Johannnes

Abstract

We present a modular framework to benchmark new and existing methods for trajectory planning and control in high-acceleration maneuvers that push autonomous driving to the limits. Our framework includes time-optimal raceline generation, online time-optimal velocity replanning, geometric path tracking controllers, and a new model-structured neural network (MS-NN) to learn the inverse dynamics for steering control. We deploy our framework on a 1:10-scale RoboRacer platform, using two circuits. Through several ablations with cautious and aggressive racelines, we study the performance of single modules and their combinations. We show that our MS-NN significantly improves tracking accuracy, decreases steering oscillations, and is physically interpretable. Moreover, online velocity replanning improves lap times by compensating for execution errors, and enables the vehicle to safely reach higher speeds and accelerations. To support future research, our code, datasets, videos and results are publicly available at https://roboracer-benchmark.github.io/planning_control_benchmark/.

Chinese Translation

我们提出了一个模块化框架，用于基准测试新旧方法在高加速度机动中进行轨迹规划和控制，这些机动将自主驾驶推向极限。我们的框架包括时间最优的赛道生成、在线时间最优的速度重新规划、几何路径跟踪控制器，以及一个新的模型结构神经网络（MS-NN），用于学习转向控制的逆动力学。我们在1:10比例的RoboRacer平台上部署了我们的框架，使用了两个赛道。通过对谨慎和激进赛道的多次消融实验，我们研究了单个模块及其组合的性能。我们展示了我们的MS-NN显著提高了跟踪精度，减少了转向振荡，并且具有物理可解释性。此外，在线速度重新规划通过补偿执行误差改善了圈速，并使车辆能够安全地达到更高的速度和加速度。为了支持未来的研究，我们的代码、数据集、视频和结果已公开发布在 https://roboracer-benchmark.github.io/planning_control_benchmark/。

View on arXiv Download PDF AI Translation

cs.RO / 39 / 2605.19919

Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

超越动作残差：通过瓶颈潜在强化学习实现现实世界机器人策略引导

Yu, Dongjie, Lei, Kun, Jiang, Zhennan, Pan, Jia, Xu, Huazhe

Abstract

Pretrained imitation policies have become a strong foundation for robot manipulation, but they often require online improvement to overcome execution errors, limited dataset coverage, and deployment mismatch. A central question is therefore how reinforcement learning (RL) should adapt policies after offline pretraining. Existing lightweight methods commonly apply residual corrections directly in action space, but this often leads to noisy and poorly structured exploration. In this work, we propose Z-Perturbation Reinforcement Learning (ZPRL), an approach that steers pretrained policies through a compact bottleneck latent rather than through policy weights or output actions. During offline training, we augment the policy with a plug-and-play variational information bottleneck (VIB) module to extract a task-relevant latent interface from observation embeddings. During online finetuning, the base policy is frozen and RL learns only a residual perturbation on this latent, whose decoded representation conditions the frozen action generator. We instantiate ZPRL on flow-matching policies and evaluate it on eight simulation tasks and four real-world tasks. Across diverse manipulation settings, ZPRL improves both sample efficiency and final performance over strong post-training baselines. In the real world, ZPRL improves the average success rate on four tasks by 33.7% over imitation base policies while producing smoother exploration behaviors than an action residual counterpart. These results suggest that a compact, task-aligned bottleneck latent provides an effective interface for online RL adaptation. More videos can be found at https://manutdmoon.github.io/ZPRL/.

Chinese Translation

预训练的模仿策略已成为机器人操作的坚实基础，但它们通常需要在线改进，以克服执行错误、有限的数据集覆盖和部署不匹配。因此，一个核心问题是强化学习（RL）应如何在离线预训练后调整策略。现有的轻量级方法通常直接在动作空间中应用残差修正，但这往往导致噪声和结构不良的探索。在本研究中，我们提出了Z-Perturbation Reinforcement Learning（ZPRL），一种通过紧凑的瓶颈潜在而非通过策略权重或输出动作来引导预训练策略的方法。在离线训练期间，我们通过一个即插即用的变分信息瓶颈（VIB）模块增强策略，从观察嵌入中提取与任务相关的潜在接口。在在线微调期间，基础策略被冻结，RL仅在该潜在上学习残差扰动，其解码表示条件化冻结的动作生成器。我们在流匹配策略上实例化ZPRL，并在八个仿真任务和四个现实世界任务上进行了评估。在多样的操作设置中，ZPRL在样本效率和最终性能上均优于强大的后训练基线。在现实世界中，ZPRL在四个任务上的平均成功率比模仿基础策略提高了33.7%，同时产生比动作残差对应物更平滑的探索行为。这些结果表明，紧凑的、与任务对齐的瓶颈潜在为在线RL适应提供了有效的接口。更多视频可以在 https://manutdmoon.github.io/ZPRL/ 找到。

View on arXiv Download PDF AI Translation

cs.RO / 40 / 2605.19924

RoHIL: Robust Human-in-the-Loop Robotic Reinforcement Learning Against Illumination Variations

RoHIL：针对光照变化的鲁棒人机协作机器人强化学习

Zhang, Shuoqin, Xiong, Yixin, Gao, Xiru, Liu, Kai, Wang, Ke, Zhou, Xichuan, Hu, Zhe

Abstract

Human-in-the-loop reinforcement learning systems achieve near-perfect success on the workstation where they are trained, but collapse when the same robot is moved to a workstation a few meters away due to shifts in the visual input distribution caused by new lamp positions and window light. Re-collecting demonstrations and re-running HIL on every workstation is incompatible with deployment, and naively fine-tuning on shifted-light data triggers catastrophic forgetting of the source workstation. To close this cross-domain gap, we present RoHIL, an offline fine-tuning framework that uses no extra real-robot interaction. RoHIL combines (i) a world-model-based image relighter that re-synthesises the visual stream of source-workstation trajectories under multiple virtual HDRI environments, leaving actions and rewards real; (ii) Illumination-Retention Replay (IRR), a data-level anti-forgetting mechanism that interleaves relit adaptation transitions with original-light retention transitions to preserve source-workstation Bellman coverage; and (iii) an anchored Bellman-actor regulariser that constrains representation and policy drift from the original source-workstation policy. Across four real-robot manipulation tasks under significant cross-workstation illumination variations, RoHIL substantially improves shifted-light performance where standard HIL-RL collapses, while preserving source-workstation performance, eliminating the need to re-collect data and retrain for every new workstation and environment. Project page: https://anonymous4365.github.io/RoHIL/

Chinese Translation

人机协作强化学习系统在其训练的工作站上几乎可以实现完美的成功，但当同一机器人被移动到几米外的工作站时，由于灯光位置和窗户光线的变化导致视觉输入分布的变化，系统会崩溃。在每个工作站重新收集演示并重新运行人机协作学习（HIL）与部署不兼容，而在光照变化的数据上进行简单的微调会导致源工作站的灾难性遗忘。为了解决这一跨域差距，我们提出了RoHIL，一个离线微调框架，不需要额外的真实机器人交互。RoHIL结合了（i）基于世界模型的图像重光化器，在多个虚拟高动态范围图像（HDRI）环境下重新合成源工作站轨迹的视觉流，同时保持动作和奖励的真实性；（ii）光照保留重放（Illumination-Retention Replay, IRR），一种数据级别的抗遗忘机制，它将重光化适应过渡与原始光照保留过渡交替，以保持源工作站的贝尔曼覆盖；（iii）一个锚定的贝尔曼-演员正则化器，限制表示和策略的漂移，使其保持在原始源工作站策略附近。在四个真实机器人操作任务中，面对显著的跨工作站光照变化，RoHIL显著改善了光照变化下的性能，而标准HIL-RL则崩溃，同时保持了源工作站的性能，消除了为每个新工作站和环境重新收集数据和重新训练的需求。项目页面：https://anonymous4365.github.io/RoHIL/

View on arXiv Download PDF AI Translation

cs.RO / 41 / 2605.19958

TravExplorer: Cross-Floor Embodied Exploration via Traversability-Aware 3-D Planning

TravExplorer：通过可通行性感知的三维规划实现跨楼层的具身探索

Zheng, Han, Chen, Zhe, Huang, Yudong, Liu, Haoran, Wang, Jinghao, Yang, Ming, Qin, Tong

Abstract

Zero-shot Object Navigation (ZSON) has shown promise for open-vocabulary target search in unseen environments, yet most existing systems remain tied to planar representations and single-floor assumptions. These assumptions become inadequate in real buildings, where navigation involves floors, stairs, landings, and vertically overlapping spaces. This article presents TravExplorer, a cross-floor embodied exploration framework that couples zero-shot semantic guidance with traversability-aware 3-D planning. TravExplorer maintains a unified volumetric map that distinguishes occupied structures from robot-reachable support surfaces and extracts traversable frontiers from connected support surfaces, including floors, stairs, and landings. A FOV-aware active perception strategy further resolves incomplete observations during cross-floor traversal. To reduce semantic-reasoning latency, a lightweight guidance module aligns a probabilistic instance map from online open-vocabulary segmentation with a spatial value map from fast image-to-text matching. Based on these geometric and semantic memories, a hierarchical planner performs target-aware frontier touring over object hypotheses, traversable frontiers, and stair landmarks, and generates executable cross-floor motions through foothold-guided 3-D search and vertically constrained local trajectory optimization. Experiments over 4,195 simulated episodes on HM3D and MP3D demonstrate consistent advantages over representative ObjectNav baselines. Fifty real-world trials on a Unitree Go2 further validate open-vocabulary target search across single-floor and cross-floor indoor environments without prior maps or human intervention. The code will be released at https://github.com/wuyi2121/TravExplorer.

Chinese Translation

零样本物体导航（Zero-shot Object Navigation, ZSON）在未见环境中的开放词汇目标搜索中展现了潜力，然而大多数现有系统仍然依赖于平面表示和单层假设。这些假设在真实建筑中显得不足，因为导航涉及楼层、楼梯、平台和垂直重叠空间。本文提出了TravExplorer，一个跨楼层的具身探索框架，将零样本语义引导与可通行性感知的三维规划相结合。TravExplorer维护一个统一的体积地图，区分被占用的结构与机器人可达的支撑表面，并从连接的支撑表面中提取可通行的边界，包括楼层、楼梯和平台。一种基于视场（FOV）感知的主动感知策略进一步解决了跨楼层穿越过程中的不完整观测。为了减少语义推理延迟，一个轻量级引导模块将来自在线开放词汇分割的概率实例图与来自快速图像到文本匹配的空间值图对齐。基于这些几何和语义记忆，一个分层规划器在物体假设、可通行边界和楼梯地标上执行目标感知的边界巡游，并通过基于支撑点的三维搜索和垂直约束的局部轨迹优化生成可执行的跨楼层运动。在HM3D和MP3D上进行的4,195个模拟实验展示了相对于代表性物体导航基线的一致优势。在Unitree Go2上的50个真实世界试验进一步验证了在没有先前地图或人工干预的情况下，在单层和跨楼层室内环境中进行开放词汇目标搜索的有效性。代码将发布在 https://github.com/wuyi2121/TravExplorer。

View on arXiv Download PDF AI Translation

cs.RO / 42 / 2605.19981

CEER: Compliant End-Effector and Root Control as a Unified Interface for Hierarchical Humanoid Loco-Manipulation

CEER：统一的合规末端执行器与根控制接口用于分层人形机器人运动-操作

Luo, Xinyuan, Chen, Xingrui, Yin, Xunjian, Wu, Hongxuan, Xia, Boxi, Chen, Zhuoqun, Li, Jinzhou, Chen, Boyuan, Cheng, Xianyi

Abstract

Humanoid robots have achieved impressive locomotion performance, yet contact-rich and long-horizon manipulation remains a major bottleneck. Manipulation is inherently contact-rich and demands compliant whole-body control for stable interaction, while its diversity and long-horizon nature favor modular, planner-compatible interfaces over joint-space tracking. We propose CEER, a compliant end-effector-root (EE-root) control abstraction for modular humanoid loco-manipulation within a hierarchical planning framework. CEER enables compliance-aware whole-body control in an interpretable task space defined by root motion commands and end-effector pose targets, and supports plug-and-play integration with heterogeneous high-level planners. A teacher-student framework is adopted to distill a general motion-tracking controller into a low-level policy that consumes only EE-root commands. We further construct a hierarchical system that integrates heterogeneous planners and task modules through the EE-root interface, enabling diverse manipulation tasks without retraining the underlying whole-body policy. Experiments in simulation and on hardware demonstrate 3.3 cm end-effector tracking accuracy with substantially reduced jerk compared to baselines, stable contact-rich manipulation under teleoperation, and up to 70% success in simulated single-object loco-manipulation tasks within a room-scale environment. These results indicate that compliant EE-root control provides a practical abstraction for humanoid loco-manipulation, enabling modular and scalable integration of diverse skills.

Chinese Translation

人形机器人在运动性能方面取得了显著进展，但接触丰富且长时间的操作仍然是一个主要瓶颈。操作本质上是接触丰富的，要求合规的全身控制以实现稳定的交互，而其多样性和长时间特性则更倾向于模块化、与规划兼容的接口，而非关节空间跟踪。我们提出了CEER，一种合规末端执行器-根（EE-root）控制抽象，用于在分层规划框架内进行模块化的人形机器人运动-操作。CEER使得在由根运动指令和末端执行器姿态目标定义的可解释任务空间中实现合规感知的全身控制，并支持与异构高层规划器的即插即用集成。我们采用教师-学生框架，将通用运动跟踪控制器提炼为仅消耗EE-root指令的低层策略。我们进一步构建了一个分层系统，通过EE-root接口集成异构规划器和任务模块，使得在不重新训练基础全身策略的情况下能够执行多样的操作任务。模拟和硬件实验表明，末端执行器的跟踪精度达到3.3厘米，与基线相比显著减少了抖动，在远程操作下实现了稳定的接触丰富操作，并在房间规模环境中实现了高达70%的单对象运动-操作任务成功率。这些结果表明，合规EE-root控制为人形机器人运动-操作提供了一个实用的抽象，能够实现多样技能的模块化和可扩展集成。

View on arXiv Download PDF AI Translation

cs.RO / 43 / 2605.19986

Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

超越二元成功：细粒度操作的诊断元评估框架

Xu, He-Yang, Zhang, Pengyuan, Ge, Zongyuan, Hao, Xiaoshuai, Belongie, Serge, Geng, Xin, Peng, Yuxin, Wei, Xiu-Shen

Abstract

Fine-grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high-fidelity spatial perception, and constraint-respecting motor execution. However, current embodied AI benchmarks collapse these capacities into binary success rates, systematically inflating reported capabilities by up to 70% and masking the architectural bottlenecks that impede real-world deployment. We introduce MetaFine, a diagnostic meta-evaluation framework that disentangles manipulation competency along three axes: understanding, perception, and controlled behavior. Built on a compositional task graph, MetaFine absorbs heterogeneous external benchmarks and reconstructs them into diagnostic scenarios of varying complexity under a unified protocol. Evaluating state-of-the-art vision-language-action (VLA) models through this lens exposes severe dimension-specific failures invisible to conventional metrics. Through targeted causal intervention, we identify the visual encoder's ability to preserve local spatial structure as a key bottleneck for fine-grained precision: improving it directly unlocks previously inaccessible manipulation capabilities without modifying downstream policies. MetaFine further supports hybrid real-sim validation, using limited paired real-world rollouts to calibrate scalable simulation-based estimates for more stable physical benchmarking. By shifting evaluation from ranking to diagnosis, MetaFine turns benchmarking into an actionable compass for repairing the layered capacities underlying genuine physical dexterity. The MetaFine framework, benchmarks, and supporting resources will be publicly released at our project page: https://metafine.github.io/.

Chinese Translation

细粒度操作标志着一种新的范式，其中全球场景上下文已不再足够，成功依赖于局部属性定位、高保真空间感知和遵循约束的运动执行之间的紧密耦合。然而，目前的具身人工智能基准将这些能力简化为二元成功率，系统性地夸大了报告的能力，最高可达70%，并掩盖了阻碍实际部署的架构瓶颈。我们提出了MetaFine，一个诊断元评估框架，它沿着理解、感知和控制行为三个轴解构操作能力。MetaFine基于一个组合任务图，吸收异构外部基准，并在统一协议下将其重构为不同复杂度的诊断场景。通过这个视角评估最先进的视觉-语言-行动（VLA）模型，揭示了传统指标无法察觉的严重维度特定失败。通过有针对性的因果干预，我们识别出视觉编码器保持局部空间结构的能力是细粒度精度的关键瓶颈：直接改善这一点可以解锁先前无法访问的操作能力，而无需修改下游策略。MetaFine进一步支持混合真实-模拟验证，利用有限的配对真实世界回合来校准可扩展的基于模拟的估计，以实现更稳定的物理基准测试。通过将评估从排名转向诊断，MetaFine将基准测试转变为修复真实物理灵巧性背后分层能力的可操作指南。MetaFine框架、基准和支持资源将在我们的项目页面公开发布：https://metafine.github.io/

View on arXiv Download PDF AI Translation

cs.RO / 44 / 2605.19990

Minimalist Visual Inertial Odometry

极简视觉惯性里程计

Pasti, Francesco, Klotz, Jeremy, Bellotto, Nicola, Nayar, Shree K.

Abstract

Visual-Inertial Odometry(VIO), which is critical to mobile robot navigation, uses cameras with a large number of pixels. Capturing and processing camera images requires significant resources. This work presents a minimalist approach to planar odometry, demonstrating that just four visual measurements and an IMU can provide robust motion estimation for differential-drive robots. Our key insight is that four downward-facing photodiodes that sense the world through optical Gabor masks produce signals that encode speed. Based on this, we jointly optimize the mask parameters alongside a Temporal Convolutional Network (TCN) using a physically-grounded simulator. The resulting model decodes speed from just the four measurements produced by the photodiodes. Pairing these estimates with the angular speed from an IMU yields a continuous planar trajectory. We validate our approach with a prototype sensor mounted on a differential drive robot. Across diverse indoor and outdoor terrains, our system closely tracks the reference ground truth without any real-world fine-tuning. Our work shows that minimalist sensing enables efficient and accurate planar odometry.

Chinese Translation

视觉惯性里程计（Visual-Inertial Odometry, VIO）对移动机器人导航至关重要，通常使用高像素数的相机。捕获和处理相机图像需要大量资源。本研究提出了一种极简的平面里程计方法，展示了仅使用四个视觉测量和一个惯性测量单元（IMU）即可为差分驱动机器人提供稳健的运动估计。我们的关键见解是，四个向下朝向的光电二极管通过光学Gabor掩模感知世界，产生编码速度的信号。基于此，我们在一个物理基础的模拟器上联合优化掩模参数和时间卷积网络（Temporal Convolutional Network, TCN）。最终模型仅通过光电二极管产生的四个测量值解码速度。将这些估计与IMU的角速度配对，得到一个连续的平面轨迹。我们通过在差分驱动机器人上安装原型传感器来验证我们的方法。在多种室内和室外地形中，我们的系统在没有任何现实世界微调的情况下，紧密跟踪参考真实值。我们的研究表明，极简的传感方法能够实现高效且准确的平面里程计。

View on arXiv Download PDF AI Translation

cs.RO / 45 / 2605.20101

Topology-Optimized Pneumatic Soft Actuator: Design and Experimental Validation

拓扑优化的气动软驱动器：设计与实验验证

Mehta, Sumit, Poulios, Konstantinos

Abstract

This paper demonstrates the computational design of soft elastomeric pneumatic actuators using nonlinear topology optimization. An existing density- and porohyperelasticity-based topology optimization framework was extended from 2D to 3D and used to generate two manufacturable actuator designs, which were then studied numerically and experimentally. For both designs, the objective was to maximize the bending response for a prescribed actuation pressure under two different allowable strain limits. A key advantage of the employed topology optimization framework is that it can consistently, during the optimization, account for the very large deformations induced upon pressurization. The two optimized 3D designs were fabricated using stereolithography and experimentally tested to validate their performance.

Chinese Translation

本文展示了使用非线性拓扑优化进行软弹性气动驱动器的计算设计。现有的基于密度和多孔超弹性的拓扑优化框架从二维扩展到三维，并用于生成两个可制造的驱动器设计，随后对这两个设计进行了数值和实验研究。对于这两种设计，目标是在两个不同的允许应变限制下，最大化在规定驱动压力下的弯曲响应。所采用的拓扑优化框架的一个关键优势是能够在优化过程中始终考虑因加压而引起的非常大变形。这两个优化的三维设计通过立体光刻技术制造，并进行了实验测试以验证其性能。

View on arXiv Download PDF AI Translation

cs.RO / 46 / 2605.20138

Hamilton--Jacobi Reachability for Spacecraft Collision Avoidance

航天器碰撞避免的哈密顿-雅可比可达性

Hui, Larry, Kam, Jordan, Su, William, Zhou, Jianshu

Abstract

This article presents a Hamilton--Jacobi (HJ) reachability framework for a two--satellite collision avoidance problem operating in the same circular orbit, where relative motion is modeled in the radial--tangential--normal (RTN) frame using planar Hill--Clohessy--Wiltshire (HCW) dynamics. We define the target state space as unsafe relative configurations in the orbit plane corresponding to minimum separation requirements consistent with Federal Communications Commission (FCC) orbital standards. The interaction between spacecraft is formulated as a zero--sum differential game, where Player 1 is the controlled satellite and Player 2 is modeled as a bounded adversarial disturbance with unknown intent. We present the HJ formulation and compute backward reachable sets that characterize relative states from which collision cannot be avoided under worst-case disturbances, while states outside this set admit provably collision-free trajectories. These reachable sets are integrated with supervisory hybrid control logic to determine when evasive maneuvers must be initiated, enabling mathematically grounded safety guarantees for scalability.

Chinese Translation

本文提出了一种哈密顿-雅可比（Hamilton--Jacobi, HJ）可达性框架，用于解决在同一圆形轨道上运行的双卫星碰撞避免问题，其中相对运动在径向-切向-法向（Radial-Tangential-Normal, RTN）坐标系中使用平面希尔-克洛赫西-威尔特希尔（Hill--Clohessy--Wiltshire, HCW）动力学建模。我们将目标状态空间定义为轨道平面中对应于符合联邦通信委员会（Federal Communications Commission, FCC）轨道标准的最小间隔要求的不安全相对配置。航天器之间的相互作用被形式化为一个零和微分博弈，其中玩家1是受控卫星，玩家2被建模为具有未知意图的有界对抗干扰。我们展示了HJ的公式，并计算出向后可达集，这些可达集表征了在最坏情况下无法避免碰撞的相对状态，而位于此集合外的状态则允许可证明的无碰撞轨迹。这些可达集与监督混合控制逻辑相结合，以确定何时必须启动规避机动，从而为可扩展性提供数学基础的安全保证。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

132

cs.CV / 1 / 2605.18956

MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

MotionMERGE：一个用于人类运动编辑、推理、生成和解释的多粒度框架

Wu, Bizhu, Xie, Jinheng, Chen, Wenting, Kong, Zhe, Ren, Jianfeng, Shen, Linlin, Bai, Ruibin, Qu, Rong

Abstract

Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can't focus on motion's localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.

Chinese Translation

近期的运动语言模型统一了理解和生成等任务，但其操作粒度较粗，缺乏对动画或交互所需的身体部位的细粒度理解和细致控制。这源于模型和数据中的基本问题，即模型无法专注于运动的局部模式，而训练数据缺乏细粒度的监督。为了解决这一问题，我们提出了MotionMERGE，一个统一的框架，弥合了粒度差距。首先，我们开创性地研究了细粒度语言引导的运动控制，包括详细理解和局部编辑，通过在单一的LLM中明确建模运动的部件和时间层面，从而赋予模型精确控制的强大先验。其次，我们设计了ReasoningAware Granularity-Synergy预训练，这是一种新颖的策略，采用联合监督进行跨粒度对齐、时间基础、局部对齐、运动一致性和运动基础的思维链（CoT）推理。这使得模型具备了细粒度的运动语言对齐、跨粒度协同和明确的推理能力。第三，我们整理了MotionFineEdit，一个大规模数据集（837K原子+ 144K复杂三元组），其中包含首个细粒度时空修正指令和运动基础的CoT注释，为细粒度文本驱动的运动编辑和运动基础推理建立了新的基准。大量实验表明，MotionMERGE在更精确的运动生成、理解和编辑方面的能力，以及对其他复杂运动任务的强大零-shot泛化能力。这项工作代表了朝着与运动进行更细粒度交互和人类般推理的模型迈出的重要一步。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2605.18974

Harnessing Self-Supervised Features for Art Classification

利用自监督特征进行艺术作品分类

Melis, Federico, Bilardello, Davide, Prato, Emanuele, Turri, Evelyn, Baraldi, Lorenzo

Abstract

Classifying artworks presents a significant challenge due to the complex interplay of fine-grained details and abstract features that condition the style or genre of an artwork. This paper presents a systematic investigation of the effectiveness of supervised and self-supervised backbones as feature extractors for both artwork classification and retrieval, with a particular focus on paintings. We conduct an extensive experimental evaluation using the DINO family and CLIP models, assessing multiple classification strategies and feature representations. Our results demonstrate that employing a self-supervised backbone leads to consistent improvements in artwork classification performance. Moreover, our work provides insights into the applicability of classification and retrieval modules in real-world applications, such as virtual reality (VR) applications that support museum navigation.

Chinese Translation

艺术作品分类面临着重大挑战，因为细致的细节与抽象特征之间的复杂相互作用决定了艺术作品的风格或类型。本文系统地研究了监督和自监督骨干网络作为艺术作品分类和检索的特征提取器的有效性，特别关注绘画作品。我们使用 DINO 系列和 CLIP 模型进行了广泛的实验评估，评估了多种分类策略和特征表示。我们的结果表明，采用自监督骨干网络能够持续提高艺术作品分类的性能。此外，我们的研究为分类和检索模块在现实世界应用中的适用性提供了见解，例如支持博物馆导航的虚拟现实（VR）应用。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2605.18984

Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

Artifact-Bench：评估多模态大型语言模型在检测和评估AI生成视频伪影方面的能力

Tang, Yuqi, Shi, Yang, Zhang, Zhuoran, Wang, Qixun, Bai, Xuehai, Ding, Yue, Chen, Ruizhe, Zeng, Bohan, Chen, Xinlong, Zhu, Xuanyu, Li, Bozhou, Wang, Yuran, Dai, Yifan, Tong, Chengzhuo, Liu, Xinyu, Ji, Yiyan, Wei, Yujie, Dong, Yuhao, Yan, Shilin, Wang, Fengxiang, Zhang, Yi-Fan, Wang, Haotian, Zhang, Yuanxing, Wan, Pengfei

Abstract

Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy, Artifact-Bench defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or even below-random performance in challenging settings. We further observe significant misalignment between MLLM judgments and human perceptual preferences, highlighting their limited reliability as general evaluators for AI-generated video realism.

Chinese Translation

近期的视频生成模型在AI生成视频的真实感方面取得了显著进展，但其输出仍然存在伪影，如时间不一致、结构扭曲和语义不连贯。尽管多模态大型语言模型（MLLMs）展现出强大的视觉理解能力，但它们对这些伪影的感知和推理能力仍不明确。现有基准测试往往缺乏对伪影感知和细粒度诊断推理的系统评估，尤其是在超越照片真实内容的多样化AI生成视频领域。为了解决这一问题，我们提出了Artifact-Bench，这是一个全面的基准，用于评估MLLMs在AI生成视频伪影检测和分析方面的能力。我们首先建立了一个涵盖照片真实、动画和CG风格视频的三层次层级真实感伪影分类法。基于这一分类法，Artifact-Bench定义了三个互补任务：真实视频与AI生成视频分类、成对真实感比较和细粒度伪影识别。在对19个领先的MLLMs进行实验时，我们发现它们在伪影感知和推理方面存在显著的局限性，许多模型在挑战性设置下的表现接近随机甚至低于随机。我们进一步观察到MLLM的判断与人类感知偏好之间存在显著的不一致，突显了它们作为AI生成视频真实感的通用评估者的有限可靠性。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2605.19004

EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction

EgoTraj：用于多模态预测的真实世界自我中心人类轨迹数据集

Yehia, Ahmad, Mohamed, Abduallah, Wang, Tianyi, Byeon, Jiseop, Qian, Kun, Jiao, Junfeng, Claudel, Christian

Abstract

Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real-world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real-world urban environments. Each recording provides synchronized RGB video along with ground-truth data, including continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state-of-the-art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR-based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at https://github.com/yehiahmad/EgoTraj.

Chinese Translation

从自我中心的视角准确预测人类轨迹在类人机器人、可穿戴传感系统和辅助导航等应用中发挥着核心作用。然而，由于在真实环境中收集的自我中心轨迹数据集稀缺，这方面的进展仍然有限。为了解决这一需求，我们引入了EgoTraj，一个使用Meta Quest Pro (MQPro) 记录的自我中心多模态开放数据集。EgoTraj包含75个来自多个MQPro佩戴者在真实城市环境中收集的人类导航序列。每个录音提供同步的RGB视频以及真实数据，包括连续时间同步的6自由度头部姿态、每帧3D眼动向量和场景注释。据我们所知，EgoTraj与典型的自我中心轨迹数据集不同，它捕捉了跨越多样化城市路线的长时间、自主导航，并且参与者多样性广泛。为了展示该数据集的潜力，我们对几种最先进的自我中心轨迹预测方法进行了基准测试，并进行了消融研究以分析注视、场景和运动线索的贡献。结果突显了EgoTraj在基于增强现实的感知、导航和辅助系统中的实用性。EgoTraj数据集、代码和EgoViz仪表板可在https://github.com/yehiahmad/EgoTraj公开获取。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2605.19020

A Systematic Failure Analysis of Vision Foundation Models for Open Set Iris Presentation Attack Detection

开放集虹膜呈现攻击检测的视觉基础模型系统性失败分析

Anand, Rahul, Singh, Siddharth, D, Dileep A, Prasanna, Mahadeva, Ramachandra, Raghavendra

Abstract

Vision foundation models have demonstrated strong transferability across diverse visual recognition tasks and are increasingly considered for biometric applications. Their suitability for iris Presentation Attack Detection (PAD), particularly under realistic open-set operating conditions, remains insufficiently examined. This work presents a systematic failure analysis of general-purpose vision foundation models for open-set iris PAD using periocular imagery. Five representative foundation models are evaluated under three open-set protocols that explicitly separate different sources of distribution shift: unseen Presentation Attack Instruments (PAIs), unseen datasets captured with different sensors and cross-spectral transfer from near-infrared (NIR) to visible spectrum (VIS) imagery. Both frozen feature representations and parameter-efficient task adaptation using Low-Rank Adaptation (LoRA) are assessed within a unified experimental framework. The results indicate that foundation models can transfer across datasets with similar sensing characteristics, but fail to generalise reliably to unseen attack instruments and degrade sharply under cross-spectral evaluation. While LoRA improves performance in certain cross-dataset settings, it frequently amplifies failure under attack-level and spectral shifts. Additional validation experiments using segmented iris inputs, full backbone fine-tuning, joint cross-dataset and cross-PAI shifts, and reverse VIS to NIR transfer further confirm that these failures are not simply artefacts of periocular input, weak adaptation, or one-directional spectral evaluation. These findings show that strong closed-set or cross-dataset performance should not be treated as evidence of robust open-set security, and highlight the need for PAD representations that maintain sensitivity to presentation artefacts while remaining stable under realistic deployment variation.

Chinese Translation

视觉基础模型在多样的视觉识别任务中展现了强大的迁移能力，越来越多地被考虑用于生物识别应用。然而，它们在现实开放集操作条件下对虹膜呈现攻击检测（PAD）的适用性仍然未得到充分检验。本研究对开放集虹膜PAD的通用视觉基础模型进行了系统性失败分析，使用了眼周图像。评估了五个代表性的基础模型在三种开放集协议下的表现，这些协议明确区分了不同来源的分布偏移：未见过的呈现攻击工具（PAIs）、使用不同传感器捕获的未见数据集，以及从近红外（NIR）到可见光谱（VIS）图像的跨光谱迁移。在统一的实验框架内评估了冻结特征表示和使用低秩适应（LoRA）的参数高效任务适应性。结果表明，基础模型可以在具有相似传感特性的不同数据集之间迁移，但在未见攻击工具上无法可靠地泛化，并在跨光谱评估中急剧下降。虽然LoRA在某些跨数据集设置中提高了性能，但在攻击级别和光谱偏移下经常放大失败。使用分割虹膜输入、全骨干网络微调、联合跨数据集和跨PAI偏移以及反向VIS到NIR迁移的额外验证实验进一步确认，这些失败并非仅仅是眼周输入、适应性不足或单向光谱评估的伪影。这些发现表明，强大的闭集或跨数据集性能不应被视为稳健开放集安全性的证据，并强调了需要保持对呈现伪影敏感的PAD表示，同时在现实部署变化下保持稳定性。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2605.19027

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

MedFM-Robust：医学基础模型鲁棒性的基准评估

Cui, Xiangxiang, Huang, Tianjin, Wang, Yifang, Hu, Lijie, Yin, Lu

Abstract

Medical foundation models (MedFMs) have emerged as transformative tools in healthcare, demonstrating capabilities across diverse clinical applications. These models can be broadly categorized into two paradigms: Medical Vision-Language Models (Med-VLMs) and segmentation foundation models. Med-VLMs range from medical-specialized models such as LLaVA-Med and MedGemma, to general-purpose models like GPT-4o and Gemini, all capable of medical image understanding tasks including visual question answering (VQA), report generation, and visual grounding. Concurrently, the Segment Anything Model (SAM) has catalyzed a new generation of medical segmentation models, with adaptations like SAM-Med2D and MedSAM. The widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions.

Chinese Translation

医学基础模型（MedFMs）已成为医疗保健中具有变革性工具，展现出在多种临床应用中的能力。这些模型大致可以分为两种范式：医学视觉-语言模型（Med-VLMs）和分割基础模型。Med-VLMs 包括医学专用模型，如 LLaVA-Med 和 MedGemma，以及通用模型，如 GPT-4o 和 Gemini，均能够执行医学图像理解任务，包括视觉问答（VQA）、报告生成和视觉定位。同时，Segment Anything Model（SAM）催生了一代新的医学分割模型，适应性地发展出 SAM-Med2D 和 MedSAM。因此，这些模型在临床中的广泛应用需要在真实世界条件下进行严格的可靠性评估。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2605.19032

Personalized Face Privacy Protection From a Single Image

基于单张图像的个性化面部隐私保护

Yahn, Zachary, Ilhan, Fatih, Huang, Tiansheng, Tekin, Selim, Hu, Sihao, Xu, Yichang, Loper, Margaret, Liu, Ling

Abstract

Photos of faces uploaded online are vulnerable to malicious actors who can scrape facial images from online sources and intrude on personal privacy via unauthorized use of facial recognition models. This paper presents FaceCloak, a novel personalized face privacy protection system, which can generate defensive identity-specific universal face privacy masks from a single image of a user, causing facial recognition to fail. FaceCloak introduces a three-stage personalized face perturbation learning methodology: (1) It generates a small set of high-variety synthetic face images of a person based on a single image of the person. (2) It learns face cloaking by adding more protection to key facial-identity leakage regions through iterative perturbation generation over the small set of synthetic images, effectively shifting a user's identity embedding towards a distant anchor identity and away from a similar one. (3) It generates a personalized identity-protective mask in the form of pixel-wise cloaking, which is light-weight and can be efficiently applied to any facial image of a user while maintaining good perceptual quality. Extensive experiments on three popular face datasets across ten recognition models show the effectiveness of FaceCloak compared to 29 other existing representative methods. Code is available at https://github.com/zacharyyahn/FaceCloak

Chinese Translation

上传到网上的面部照片容易受到恶意行为者的攻击，他们可以从在线来源抓取面部图像，并通过未经授权使用面部识别模型侵犯个人隐私。本文提出了一种名为FaceCloak的新型个性化面部隐私保护系统，该系统能够从用户的单张图像生成防御性身份特定的通用面部隐私面具，从而导致面部识别失败。FaceCloak引入了一种三阶段个性化面部扰动学习方法： (1) 基于个人的单张图像生成一小组高多样性的合成面部图像。 (2) 通过对这组小的合成图像进行迭代扰动生成，学习面部遮蔽，为关键面部身份泄漏区域增加更多保护，有效地将用户的身份嵌入向远离相似身份的远程锚点身份转移。 (3) 生成一种个性化的身份保护面具，以像素级遮蔽的形式呈现，轻量且可以高效地应用于用户的任何面部图像，同时保持良好的感知质量。在三个流行的面部数据集和十种识别模型上的大量实验表明，FaceCloak相较于其他29种现有代表性方法具有良好的效果。代码可在 https://github.com/zacharyyahn/FaceCloak 获取。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2605.19060

LiFT: Lifted Inter-slice Feature Trajectories for 3D Image Generation from 2D Generators

LiFT：基于提升的切片间特征轨迹用于从2D生成器生成3D图像

Zhang, Xinhe, Zhang, Yuyang, Jin, Pengfei, Marin-Llobet, Arnau, Li, Na, Li, Quanzheng

Abstract

High-resolution 3D medical image generation remains challenging because fully volumetric models are computationally expensive, while efficient 2D slice generators often fail to preserve anatomical consistency across the third dimension. We propose LiFT, a framework for Lifted inter-slice Feature Trajectories that factorizes 3D volume synthesis into per-slice image generation and inter-slice trajectory learning. Rather than modeling the volumetric distribution end-to-end, LiFT treats a volume as an ordered trajectory in feature space, capturing how anatomical structures appear, transform, and disappear across depth. A tri-planar drifting loss aligns the trajectory of generated slices with the trajectories of real volumes, enabling distributional learning over inter-slice progressions in unconditional generation; in paired translation, a bidirectional $z$-context mixer trained against the registered target supplies through-plane coherence while preserving per-slice fidelity. We evaluate LiFT on BraTS 2023 (unconditional and missing-modality MR) and SynthRAD2023 (MR-to-CT). Across these settings, LiFT preserves per-slice quality, approaches the reported cWDM missing-MR reconstruction quality at $\sim$$135\times$ lower inference cost (without formal equivalence testing), and improves through-plane coherence on MR-to-CT relative to a no-mapper ablation, demonstrating that lightweight inter-slice trajectory learning is a viable route to high-resolution 3D medical synthesis.

Chinese Translation

高分辨率3D医学图像生成仍然面临挑战，因为全体积模型计算开销大，而高效的2D切片生成器往往无法在第三维度上保持解剖一致性。我们提出了LiFT，一个提升切片间特征轨迹的框架，将3D体积合成分解为每切片图像生成和切片间轨迹学习。LiFT并不是端到端地建模体积分布，而是将体积视为特征空间中的有序轨迹，捕捉解剖结构在深度上的出现、变换和消失。三平面漂移损失将生成切片的轨迹与真实体积的轨迹对齐，使得在无条件生成中能够进行切片间进展的分布学习；在配对翻译中，经过注册目标训练的双向 $z$-上下文混合器提供了切片间的一致性，同时保持每切片的保真度。我们在BraTS 2023（无条件和缺失模态MR）和SynthRAD2023（MR到CT）上评估了LiFT。在这些设置中，LiFT保持了每切片的质量，以大约 $135 imes$ 的更低推理成本接近报告的cWDM缺失MR重建质量（没有正式的等效性测试），并在MR到CT的切片间一致性上相较于无映射消融有所改善，证明轻量级的切片间轨迹学习是实现高分辨率3D医学合成的可行途径。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2605.19074

Learning Long-Term Temporal Dependencies in Photovoltaic Power Output Prediction Through Multi-Horizon Forecasting

通过多时间尺度预测学习光伏电力输出预测中的长期时间依赖性

Laha, Sumit, Sharma, Ankit, Foroosh, Hassan

Abstract

The rapid global expansion of solar photovoltaic (PV) capacity-reaching a record 597 GW in 2024-highlights the urgent need for robust forecasting models to mitigate the grid instability caused by the intermittent nature of solar irradiance. While deep learning-based direct forecasting using ground-based sky images (GSI) has emerged as a dominant approach, existing literature is often constrained by single-architecture evaluations and an exclusive focus on single-horizon (point) prediction. This paper proposes a transition from traditional single-horizon estimation toward a multi-horizon forecasting framework, leading to an architecture-independent improvement in accuracy. We hypothesize and demonstrate experimentally that joint optimization over a sequence of future values allows deep neural networks to better capture latent inter-step temporal dependencies by avoiding precocious convergence of the network in terms of both weight gradients and filter diversity. Leveraging this architecture-independent improvement that integrates sequential sky imagery with historical PV generation data, we evaluate the models' abilities to predict power output across multiple discrete future time steps simultaneously. Our methodology is validated through a comparative analysis across diverse deep learning architectures. The results demonstrate that this multi-horizon approach significantly enhances predictive accuracy and robustness across the entire forecast horizon while maintaining computational parsimony. By achieving superior performance with negligible overhead compared to single-horizon models, this work provides a scalable and efficient solution to improve the resilience of modern power grids.

Chinese Translation

全球太阳能光伏（PV）容量的快速扩张——预计在2024年达到创纪录的597 GW——突显了迫切需要强大的预测模型，以减轻因太阳辐射间歇性特性而导致的电网不稳定性。尽管基于深度学习的直接预测方法利用地面天空图像（GSI）已成为主流，但现有文献往往受到单一架构评估和对单一时间尺度（点）预测的专注限制。本文建议从传统的单一时间尺度估计转向多时间尺度预测框架，从而在架构无关的情况下提高准确性。我们假设并通过实验验证，针对未来一系列值的联合优化使深度神经网络能够更好地捕捉潜在的跨步时间依赖性，避免在权重梯度和滤波器多样性方面的过早收敛。利用这种将顺序天空图像与历史光伏发电数据相结合的架构无关的改进，我们评估模型在多个离散未来时间步长上同时预测电力输出的能力。我们的方法通过对多种深度学习架构的比较分析进行了验证。结果表明，这种多时间尺度方法显著提高了整个预测区间的预测准确性和稳健性，同时保持了计算的简约性。通过以微不足道的开销实现优于单一时间尺度模型的性能，本研究提供了一种可扩展且高效的解决方案，以提高现代电网的韧性。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2605.19075

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

CRAFT：用于多模态视频问答的评论者精炼自适应关键帧定位

Bhosale, Mahesh, Wasi, Abdul, Trivedi, Vishvesh, Yan, Pengyu, Gorugantu, Akhil, Doermann, David

Abstract

Grounded multi-video question answering over real-world news events requires systems to surface query-relevant evidence across heterogeneous video archives while attributing every claim to its supporting source. We introduce CRAFT (Critic-Refined Adaptive Key-Frame Targeting), a query-conditioned pipeline that combines dynamic keyframe selection, per-video ASR with multilingual fallback, and a hybrid critic loop to iteratively verify and repair claims before consolidation. The pipeline integrates UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator, with a final citation-merging stage that emits each fact once with all supporting source identifiers. On MAGMaR 2026, CRAFT achieves the best overall average (0.739), reference recall (0.810), and citation F1 (0.635). We further evaluate on a MAGMaR-style conversion of WikiVideo with 52 non-overlapping event queries, where CRAFT also performs strongly (0.823 Avg), showing that its claim-centric evidence aggregation generalizes beyond MAGMaR. Ablations show that atomic claims, ASR, and the critic loop drive the main gains over the vanilla query-conditioned baseline. Code and implementation details are publicly available at https://github.com/bhosalems/CRAFT.

Chinese Translation

基于真实世界新闻事件的多视频问答需要系统在异构视频档案中提取与查询相关的证据，同时将每个主张归因于其支持来源。我们提出了CRAFT（评论者精炼自适应关键帧定位），这是一个基于查询的管道，结合了动态关键帧选择、每个视频的自动语音识别（ASR）与多语言回退，以及一个混合评论循环，以在整合之前迭代验证和修正主张。该管道集成了UNLI时序蕴含、DeBERTa-v3跨主张筛选和Llama-3.2-3B裁决者，最后的引用合并阶段将每个事实与所有支持来源标识符一起输出一次。在MAGMaR 2026上，CRAFT实现了最佳的整体平均值（0.739）、参考召回率（0.810）和引用F1值（0.635）。我们进一步在WikiVideo的MAGMaR风格转换上进行评估，包含52个不重叠的事件查询，CRAFT同样表现出色（0.823平均），显示其以主张为中心的证据聚合超越了MAGMaR。消融实验表明，原子主张、ASR和评论循环是相较于基础的查询条件模型主要增益的驱动因素。代码和实现细节可在https://github.com/bhosalems/CRAFT上公开获取。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2605.19111

FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models

FAGER：基于事实的文本到图像模型评估与优化

Lim, Youngsun, Ham, Cusuh, Chen, Pin-Yu, Ghadiyaram, Deepti

Abstract

Existing text-to-image (T2I) evaluation metrics mainly assess whether generated images align with information explicitly stated in the prompt, but often fail to capture factual requirements that are implicit, externally grounded, or identity-defining. As a result, they are not well suited for evaluating factual correctness in prompts involving scientific knowledge, historical facts, products, or culture-specific concepts. We propose FActually Grounded Evaluation and Refinement (FAGER), an agentic framework that evaluates whether generated images correctly reflect visually verifiable facts grounded in or implied by the prompt, while also providing actionable feedback for improvement. FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. To validate FAGER as a factuality metric, we introduce a Factual A/B test, which measures whether a metric prefers factual reference images over corresponding generated images. Across five datasets spanning science, history, products, culture, and knowledge-intensive concepts, FAGER consistently outperforms prior metrics on this test. We further show that FAGER can be used to refine T2I outputs in a fully training-free manner, yielding substantial factuality gains across datasets.

Chinese Translation

现有的文本到图像（T2I）评估指标主要评估生成的图像是否与提示中明确陈述的信息一致，但往往未能捕捉到隐含的、外部基础的或身份定义的事实要求。因此，它们不适合评估涉及科学知识、历史事实、产品或文化特定概念的提示中的事实正确性。我们提出了基于事实的评估与优化框架（FAGER），该框架评估生成的图像是否正确反映了与提示相关或隐含的可视可验证事实，同时提供可操作的改进反馈。FAGER首先通过结合基于大型语言模型（LLM）的事实提议与参考引导的视觉事实提取和验证，构建一个结构化的事实评分标准，然后将该评分标准转换为基于视觉语言模型（VLM）评估的问题-答案对。为了验证FAGER作为事实性指标的有效性，我们引入了事实A/B测试，该测试衡量一个指标是否更倾向于真实参考图像而非相应的生成图像。在涵盖科学、历史、产品、文化和知识密集型概念的五个数据集上，FAGER在该测试中始终优于先前的指标。我们进一步展示了FAGER可以在完全无训练的情况下优化T2I输出，在各个数据集上实现显著的事实性提升。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2605.19133

Knowing When Not to Predict: Self Supervised Learning and Abstention for Safer DR Screening

知道何时不进行预测：自监督学习与弃权以实现更安全的糖尿病视网膜病变筛查

Chopra, Muskaan, Sparrenberg, Lorenz, Terheyden, Jan H., Sifa, Rafet

Abstract

Self-supervised learning (SSL) is now a standard way to pretrain medical image models, but performance is still mostly judged by downstream accuracy. For safety-critical screening tasks such as diabetic retinopathy grading, this is not enough: a model must also know when its predictions are unreliable and defer uncertain cases for clinical review. In this work, we examine how the length of SSL pretraining influences calibrated confidence and confidence-based abstention. We evaluate multiple SSL checkpoints under a fixed fine-tuning protocol and assess calibrated confidence, coverage, selective accuracy, and selective macro-F1. Across datasets and data regimes, SSL pretraining improves selective prediction compared to training from scratch. Unlike prior SSL studies that primarily evaluate downstream accuracy or AUROC, we analyze how SSL pretraining duration influences confidence behavior under calibrated confidence-based abstention. However, once accuracy saturates, selective performance can still change markedly across checkpoints, and longer pretraining does not consistently improve reliability. These results underscore the importance of abstention-aware evaluation and suggest that pretraining length should be treated as an important reliability-related design choice rather than only a computational detail. Code is available at GitHub.

Chinese Translation

自监督学习（SSL）现在已成为医学图像模型预训练的标准方法，但其性能仍主要通过下游准确性来评估。对于糖尿病视网膜病变分级等安全关键的筛查任务，这还不够：模型还必须知道何时其预测不可靠，并将不确定的案例推迟到临床审查。在本研究中，我们考察了SSL预训练的长度如何影响校准置信度和基于置信度的弃权。我们在固定的微调协议下评估多个SSL检查点，并评估校准置信度、覆盖率、选择性准确性和选择性宏F1。在不同的数据集和数据模式下，与从头开始训练相比，SSL预训练提高了选择性预测。与以往主要评估下游准确性或AUROC的SSL研究不同，我们分析了SSL预训练持续时间如何影响基于校准置信度的弃权下的置信度行为。然而，一旦准确性达到饱和，选择性性能在不同检查点之间仍然可能发生显著变化，且更长的预训练并不总是能提高可靠性。这些结果强调了关注弃权评估的重要性，并建议将预训练长度视为一个重要的与可靠性相关的设计选择，而不仅仅是一个计算细节。代码可在GitHub上获取。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2605.19137

Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

基于冻结图像基础模型的数据高效视频预训练研究

Orlova, Svetlana, Cavagnero, Niccolò, Dubbelman, Gijs

Abstract

Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: https://github.com/tue-mps/towards-video-image-frozen .

Chinese Translation

视频基础模型在许多视频理解任务中表现出色，但通常需要在大规模视频数据集上进行大规模预训练，从而导致显著的数据和计算成本。相比之下，现代图像基础模型已经提供了强大的空间表示。这引发了一个重要问题：是否可以通过重用这些空间表示并仅对时间推理进行预训练来构建具有竞争力的视频模型？我们迈出了探索一种轻量级训练范式的初步步骤，该范式冻结一个预训练的图像基础模型，仅训练一个递归时间模块以处理流媒体视频。通过将图像基础模型作为空间编码器重用，这种方法可以显著减少与端到端视频预训练相比所需的视频数据和计算量。在这项工作中，我们在投入视频预训练计算之前探讨了这种方法的可行性。我们在多个视频理解任务中的实证发现表明，强大的时间性能可以在没有大规模视频预训练的情况下出现，这为未来在冻结的图像基础模型上预训练时间模块以获得递归视频基础模型的研究提供了动力。代码链接： https://github.com/tue-mps/towards-video-image-frozen

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2605.19155

Efficient coding along the visual hierarchy

视觉层次中的高效编码

Passi, Ananya, Robinson, Brian S., Bonner, Michael F.

Abstract

Biological visual systems learn from limited experience, unlike deep learning models that rely on millions of training images. What learning principles make this possible? We tested whether efficient coding, the idea that neural representations capture the statistical structure of natural inputs, can build a hierarchy of human-aligned visual features from limited data. We developed an unsupervised learning procedure in which each layer of a deep network compresses its inputs onto the dominant modes of variation in natural images, using only local statistics and no labels, tasks, or backpropagation. This unsupervised procedure yields features that progress from edges and colors to textures and shapes. The features of this deep efficient coding model are readily recognized by human observers and are predictive of image-evoked fMRI responses in human visual cortex. Furthermore, a hybrid learning procedure that combines efficient coding with supervised fine-tuning yields better brain alignment in low-data settings and more rapid category learning. These findings suggest that efficient coding may shape representations across the entire visual hierarchy and help explain the data efficiency of biological vision.

Chinese Translation

生物视觉系统从有限的经验中学习，这与依赖于数百万训练图像的深度学习模型不同。是什么学习原则使这一切成为可能？我们测试了高效编码的概念，即神经表征捕捉自然输入的统计结构，是否能够从有限的数据中构建出与人类对齐的视觉特征层次。我们开发了一种无监督学习程序，其中深度网络的每一层将其输入压缩到自然图像中主要变化模式上，仅使用局部统计信息，而不依赖标签、任务或反向传播。该无监督程序产生的特征从边缘和颜色逐渐发展到纹理和形状。这个深度高效编码模型的特征被人类观察者轻易识别，并且能够预测人类视觉皮层中图像引发的功能性磁共振成像（fMRI）反应。此外，结合高效编码与监督微调的混合学习程序在低数据环境中实现了更好的大脑对齐和更快速的类别学习。这些发现表明，高效编码可能在整个视觉层次中塑造表征，并有助于解释生物视觉的数据效率。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2605.19207

Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings

低资源医疗环境中的量化机器学习模型用于医学影像

Kanneti, Sumanth Meenan, Shah, Aryan

Abstract

Deep learning models have shown strong performance in medical image analysis, but deploying them in low-resource clinical environments remains difficult due to computational, memory, and power constraints. This paper presents a multi-strategy compression framework for brain tumor classification from MRI, encompassing quantization-aware training, knowledge distillation from a DenseNet-101 teacher to a compact DenseNet-32 student with low-bit post-training quantization, and Float16 post-training quantization on a lightweight MobileNetV2 backbone. Using a multi-class brain tumor MRI dataset containing glioma, meningioma, pituitary tumors, and healthy controls, we provide full experimental validation of the MobileNetV2-based pipeline, training the classifier through a three-stage transfer learning process and applying Float16 quantization via TensorFlow Lite. The DenseNet-based distillation and quantization-aware training strategies are described as complementary compression approaches within the framework, with their complete empirical evaluation reserved for future work. Experimental results on the MobileNetV2 pipeline show that the quantized model achieves 82.37 percent validation accuracy compared to the 82.20 percent full-precision baseline, reducing model size from 35.34 MB to 5.76 MB, a 6.14x compression ratio with no meaningful accuracy loss. Per-class evaluation confirms that quantization preserves diagnostic performance uniformly across all four tumor categories. These findings demonstrate that lightweight quantized models can deliver clinically viable brain tumor screening in resource-constrained healthcare settings.

Chinese Translation

深度学习模型在医学影像分析中表现出强大的性能，但由于计算、内存和电力限制，在低资源临床环境中部署这些模型仍然困难。本文提出了一种多策略压缩框架，用于从MRI图像中对脑肿瘤进行分类，涵盖了量化感知训练、从DenseNet-101教师模型到紧凑型DenseNet-32学生模型的知识蒸馏，以及在轻量级MobileNetV2骨干网络上的Float16后训练量化。我们使用包含胶质瘤、脑膜瘤、垂体肿瘤和健康对照的多类脑肿瘤MRI数据集，提供了基于MobileNetV2的管道的全面实验验证，通过三阶段迁移学习过程训练分类器，并通过TensorFlow Lite应用Float16量化。DenseNet基础的蒸馏和量化感知训练策略被描述为框架内的互补压缩方法，其完整的实证评估留待未来工作。MobileNetV2管道的实验结果表明，量化模型的验证准确率达到82.37%，相比之下全精度基线为82.20%，模型大小从35.34 MB减少到5.76 MB，压缩比为6.14倍，且没有显著的准确性损失。每类评估确认量化在所有四种肿瘤类别中均匀保持了诊断性能。这些发现表明，轻量级量化模型能够在资源受限的医疗环境中提供临床可行的脑肿瘤筛查。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2605.19210

D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation

D-凸性：通过准凹性实现统一可微凸形状先验的基于数据的图像分割

Chen, Shengzhe, Yan, Hao

Abstract

Convexity is a fundamental geometric prior that underlies many natural and man-made structures, yet remains challenging to impose effectively in end-to-end trainable segmentation networks. We revisit convexity from a functional perspective and propose a unified, threshold-free convexity prior based on the quasi-concavity of the network's output mask function u. Instead of constraining a single binary segmentation, we require all super-level sets of u to be convex, transforming global shape constraints into local, differentiable inequalities on u and its derivatives. From this principle, we derive zero, first, and second-order characterizations, yielding respectively a local midpoint convexification algorithm, a gradient-based condition linked to supporting hyperplanes, and a sufficient second-order inequality expressed as a quadratic form on the tangent plane. The first and second-order formulations produce a compact convolutional loss that can be densely applied across the image without thresholding. Our quasi-concavity losses integrate seamlessly with modern segmentation networks via the proposed convex gradient projection module (CGPM). They consistently enforce convexity and improve shape regularity across multiple datasets, outperforming networks tailored for retinal segmentation and surpassing previous shape-aware methods. Remarkably, our analysis unifies a wide spectrum of previous convex shape models, from discrete 1-0-1 line constraints and graph-cuts convexity formulations to curvature or signed distance Laplacian based level-set priors, within a single continuous and differentiable framework.

Chinese Translation

凸性是许多自然和人造结构的基本几何先验，但在端到端可训练的分割网络中有效施加凸性仍然具有挑战性。我们从函数的角度重新审视凸性，并提出了一种基于网络输出掩膜函数 u 的准凹性，统一且无阈值的凸性先验。我们要求 u 的所有超水平集都是凸的，而不是限制单一的二元分割，这将全局形状约束转化为 u 及其导数上的局部可微不等式。基于这一原则，我们推导出零阶、一阶和二阶特征化，分别得到局部中点凸化算法、与支持超平面相关的基于梯度的条件，以及以切平面上的二次形式表达的充分二阶不等式。一阶和二阶公式产生了一种紧凑的卷积损失，可以在图像上密集应用而无需阈值化。我们的准凹性损失通过提出的凸梯度投影模块（CGPM）与现代分割网络无缝集成。它们持续强制执行凸性，并在多个数据集上改善形状规则性，超越了专门针对视网膜分割的网络，并超过了之前的形状感知方法。值得注意的是，我们的分析统一了广泛的先前凸形状模型，从离散的 1-0-1 线约束和图切割凸性公式到基于曲率或符号距离拉普拉斯的水平集先验，形成了一个连续且可微的框架。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2605.19213

Smartphone-based Circular Plot Sampling for Forest Inventory

基于智能手机的森林清查圆形样地取样

Sun, Su, Chiu, Jui-Cheng, Khanal, Nabin, Fei, Songlin, Chen, Yingjie Victor

Abstract

Circular sample plots are a cornerstone of forest inventory, yet accurate measurement of tree diameter at breast height (DBH) and spatial location within such plots remains challenging. Conventional approaches rely either on costly terrestrial LiDAR systems or labor-intensive manual methods involving calipers and compass bearings, limiting their scalability and accessibility in large scale environments. We present a lightweight, smartphone-based pipeline that enables complete plot sampling based tree measurement from a single walkthrough video, requiring no specialized hardware beyond a consumer smartphone mounted on a portable stand. The proposed method integrates pretrained monocular depth estimation and tree instance segmentation with a simultaneous localization and mapping (SLAM) framework to jointly refine camera trajectories and depth across the video sequence. Tree positions and DBH estimates are recovered by fusing SLAM-derived camera poses with segmented depth maps, with absolute real-world scale anchored via a calibrated reference length. The system was evaluated in both managed forest plots and natural forest plot, achieving a mean absolute error of 1.51 cm (MARE 3.98%) and 2.30 cm (MARE 5.69%) respectively, with consistent performance across varying starting directions and positions. Cross-video consistency analysis further demonstrated stable and reproducible tree localization across measurements initiated from different starting positions. The proposed approach achieves accuracy comparable to established field methods while substantially reducing equipment cost and operational complexity, making it accessible to both professional researchers and non-expert forest managers in diverse operational settings.

Chinese Translation

圆形样地是森林清查的基础，但在这些样地中准确测量胸径（DBH）和空间位置仍然具有挑战性。传统方法依赖于昂贵的地面激光雷达（LiDAR）系统或劳动密集型的手动方法，这些方法涉及卡尺和指南针方位，限制了它们在大规模环境中的可扩展性和可及性。我们提出了一种轻量级的基于智能手机的管道，能够通过单次走访视频实现完整的样地取样树木测量，且不需要超出便携式支架上安装的消费级智能手机的专用硬件。该方法将预训练的单目深度估计和树木实例分割与同时定位与地图构建（SLAM）框架相结合，以共同优化视频序列中的相机轨迹和深度。通过将SLAM推导的相机姿态与分割的深度图融合，恢复树木位置和DBH估计，并通过校准的参考长度锚定绝对真实世界尺度。该系统在管理森林样地和自然森林样地中进行了评估，分别实现了1.51厘米（MARE 3.98%）和2.30厘米（MARE 5.69%）的平均绝对误差，并在不同起始方向和位置下表现一致。跨视频一致性分析进一步证明了在不同起始位置发起的测量中树木定位的稳定性和可重复性。所提出的方法在准确性上可与已建立的现场方法相媲美，同时显著降低了设备成本和操作复杂性，使其在多样化的操作环境中对专业研究人员和非专业森林管理者均具有可及性。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2605.19218

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

面向高效视觉-语言模型推理的旋转对齐关键通道剪枝

Kang, Beomseok, Jo, Dongwon, Song, Jiwon, Son, Donghwee, Kim, Jae-Joon

Abstract

Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.

Chinese Translation

视觉-语言模型在推理时面临严重的 KV 缓存压力，因为单个图像通常会编码成数千个标记。大多数现有方法通过标记剪枝利用标记稀疏性，但永久丢弃视觉内容会在细粒度感知任务上造成显著退化。这促使我们关注一个互补的方向，即特征稀疏性：在固定的 KV 缓存预算下，压缩通道维度可以在相同的内存成本下保留更多的视觉标记。然而，先前的关键通道剪枝方法面临结构性权衡：逐标记的通道剪枝具有表达能力，但结构不明确且速度较慢，而基于头的剪枝方法则对硬件友好但鲁棒性较差。我们通过 RotateK 解决了这个问题，RotateK 是一个基于旋转的结构化关键通道剪枝框架。RotateK 应用在线 PCA 基于的旋转，将标记依赖的通道重要性对齐到共享的低维子空间，从而在轻量级的基于头的掩码下实现准确的剪枝；一个融合的 Triton 注意力核直接在稀疏通道的关键上操作，以实现高效解码。在两个代表性的 VLM 主干网络上的实验表明，RotateK 在准确性和解码延迟方面始终优于先前的关键通道剪枝方法，而联合标记-通道剪枝在匹配的 KV 缓存预算下也优于仅标记的基线。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2605.19223

HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

HAVEN：统一视频理解的层次对齐多模态基准

Shi, Mengqi, Zhang, Haopeng

Abstract

While Multimodal Large Language Models (MLLMs) exhibit strong performance on standard video tasks, their ability to faithfully summarize and reason over complex narratives remains poorly evaluated. Existing summarization benchmarks fragment supervision across isolated granularities, such as keyframes, key shots, or disjointed text summaries, failing to capture the inherently hierarchical structure of cross-modal alignment. To address this critical gap, we introduce HAVEN, a hierarchically aligned multimodal benchmark for unified video understanding. HAVEN pioneers a fully granular (frame, shot, and video levels) and fully multimodal (video and text) dataset architecture, complete with explicit, continuous alignment between modalities. Built upon this unified annotation paradigm, we propose a comprehensive evaluation suite spanning summarization, temporal reasoning, multimodal grounding, and saliency ranking. Extensive benchmarking of state-of-the-art MLLMs exposes a persistent gap between surface-level textual fluency and grounded multimodal understanding. Ultimately, HAVEN advances the evaluation of multimodal systems beyond traditional QA formats, offering a rigorous, standardized testbed to drive future research in interpretable, hierarchical video understanding. We publicly release the dataset, benchmark suite, and evaluation protocols.

Chinese Translation

尽管多模态大型语言模型（MLLMs）在标准视频任务上表现出色，但它们在忠实总结和推理复杂叙事方面的能力仍然评估不足。现有的摘要基准在孤立的粒度（如关键帧、关键镜头或不连贯的文本摘要）上分散监督，未能捕捉跨模态对齐的固有层次结构。为了解决这一关键问题，我们提出了HAVEN，一个用于统一视频理解的层次对齐多模态基准。HAVEN开创了一个完全粒度化（帧、镜头和视频层级）和完全多模态（视频和文本）数据集架构，包含模态之间的显式、连续对齐。在这一统一注释范式的基础上，我们提出了一套全面的评估工具，涵盖摘要、时间推理、多模态定位和显著性排名。对最先进的MLLMs进行的广泛基准测试揭示了表面文本流畅性与扎实多模态理解之间的持续差距。最终，HAVEN推动了多模态系统评估超越传统的问答格式，提供了一个严格、标准化的测试平台，以推动未来可解释的层次视频理解研究。我们将公开发布数据集、基准套件和评估协议。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2605.19230

Robust Mitigation of Age-Dependent Confounding Effects via Sample-Difficulty Decorrelation

通过样本难度去相关化稳健减轻年龄依赖的混杂效应

Kurian, Nikhil Cherian, Parra, Victor Caquilpan, Shoby, Abin, Whitbread, Luke, Palmer, Lyle J.

Abstract

Age dependent performance disparities in medical image classification often arise because age acts as a confounder, linking imaging morphology with disease prevalence. In practice, disparities can manifest as overdiagnosis at ages where disease prevalence is higher and underdiagnosis at ages where prevalence is lower, and can worsen under train test shifts in the age distribution. Conventional mitigation approaches that enforce strict age invariance may suppress diagnostically meaningful information encoded in age. We therefore propose a robust framework that mitigates the effects of age-dependent confounding by targeting spurious age linked trends rather than enforcing invariance. Following a warm-up phase, we characterize sample difficulty and model its age-dependent trends in a label-conditioned manner. We decorrelate age from dominant age difficulty trends using robust, Huber weighted affinity weights, attenuating confounding-driven shortcuts while preserving clinically meaningful, nonlinear age information. We further introduce an Age Coverage Score that scales the decorrelation penalty by minibatch age variance to ensure stable optimization under limited age diversity. Across two radiology datasets, our approach reduces age dependent true and false positive disparities with minimal AUC impact and remains robust to increasing train test age distribution shifts.

Chinese Translation

医学图像分类中的年龄依赖性性能差异通常是由于年龄作为混杂因素，将成像形态与疾病流行率联系起来。在实践中，这种差异可能表现为在疾病流行率较高的年龄段出现过度诊断，而在流行率较低的年龄段则出现不足诊断，并且在年龄分布的训练测试转变下可能加剧。传统的减轻方法通过强制严格的年龄不变性，可能会抑制编码在年龄中的具有诊断意义的信息。因此，我们提出了一个稳健的框架，通过针对虚假的与年龄相关的趋势，而不是强制不变性，来减轻年龄依赖的混杂效应。在热身阶段之后，我们以标签条件的方式表征样本难度并建模其年龄依赖的趋势。我们使用稳健的、Huber加权的亲和权重将年龄与主导的年龄难度趋势去相关化，减弱混杂驱动的捷径，同时保留临床上有意义的非线性年龄信息。我们进一步引入了年龄覆盖评分，该评分通过小批量年龄方差来缩放去相关化惩罚，以确保在有限年龄多样性下的稳定优化。在两个放射学数据集中，我们的方法在最小化AUC影响的同时，减少了年龄依赖的真实和假阳性差异，并对不断增加的训练测试年龄分布转变保持稳健。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2605.19242

PhyWorld: Physics-Faithful World Model for Video Generation

PhyWorld：用于视频生成的物理忠实世界模型

Zhao, Pu, Lin, Juyi, Rupprecht, Timothy, Akbari, Arash, Yang, Chence, Chowdhury, Rahul, Motamedi, Elaheh, Akbari, Arman, He, Yumei, Wang, Chen, Yuan, Geng, Chen, Weiwei, Wang, Yanzhi

Abstract

World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations, namely, generated videos that preserve the physical state implied by the conditioning input, and evolve in ways consistent with basic physical principles. We propose PhyWorld, a video generation world model designed to produce temporally coherent and physically faithful scene continuations through two-stage post-training. In the first stage, we improve video-to-video continuation with flow matching fine-tuning, encouraging stable visual attributes and coherent motion dynamics across frames. In the second stage, we align generated dynamics with physical principles using Direct Preference Optimization (DPO) over physics preference pairs, guiding the model toward outputs with higher physical plausibility. To evaluate PhyWorld, we use both standard video-quality benchmarks and a dedicated physical-faithfulness benchmark with per-law scoring. Experiments show that PhyWorld improves video consistency, achieving an average score of 0.769 on VBench compared with 0.756 or below for state-of-the-art baselines. PhyWorld also improves physical plausibility, reaching an average score of 3.09 on our physical-faithfulness benchmark compared with 2.99 for the strongest baseline. These results suggest that post-training large video generation models with continuation and physics-preference signals can make them more effective world simulators for Physical AI.

Chinese Translation

世界模拟器可以为在真实世界部署之前训练物理人工智能系统提供安全且可扩展的环境。大型视频生成模型作为这种模拟器的基础，因其能够生成多样且逼真的视觉未来而备受关注。然而，将它们用作世界模拟器需要生成物理忠实的视频延续，即生成的视频必须保留由条件输入所暗示的物理状态，并以符合基本物理原则的方式演变。我们提出了PhyWorld，这是一种视频生成世界模型，旨在通过两阶段后训练生成时间上连贯且物理忠实的场景延续。在第一阶段，我们通过流匹配微调改进视频到视频的延续，鼓励跨帧保持稳定的视觉属性和连贯的运动动态。在第二阶段，我们使用直接偏好优化（Direct Preference Optimization, DPO）对物理偏好对齐生成的动态，引导模型朝向具有更高物理合理性的输出。为了评估PhyWorld，我们使用了标准视频质量基准和一个专门的物理忠实性基准，后者采用逐条法律评分。实验表明，PhyWorld提高了视频的一致性，在VBench上取得了0.769的平均分，而最先进的基线得分为0.756或更低。PhyWorld还提高了物理合理性，在我们的物理忠实性基准上达到了3.09的平均分，而最强基线为2.99。这些结果表明，通过延续和物理偏好信号对大型视频生成模型进行后训练，可以使其成为更有效的物理人工智能世界模拟器。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2605.19247

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

结构化开放式神经架构搜索：利用大型语言模型的半自动设计知识结构化以实现高效的神经架构搜索

Sakuma, Yuiko, Yoshimura, Masakazu, Gröpl, Marcel, Sun, Zitang, Otsuka, Junji, Irie, Atsushi, Ohashi, Takeshi

Abstract

Current neural architecture search (NAS) methods are often limited by their predefined, restrictive search spaces. While recent large language model (LLM)-assisted NAS methods enable open-ended search spaces, they often suffer from inefficient exploration due to biased or low-quality design ideas. To address these issues, we propose to semi-automatically structure model design knowledge to guide the search process. Our approach first defines a high-level structural template of architectural attributes. An LLM then populates this template by analyzing papers, creating a rich and diverse search space that embodies this structured design knowledge. To efficiently explore this vast space, we introduce FairNAD, using a multi-type mutation that enables broad exploration through mutation with fair idea sampling, Pareto-aware mutation, LLM-driven iterative mutation, and a fine-grained feedback loop. We demonstrate the effectiveness of FairNAD in discovering high-performing architectures that yield 0.84, 2.17, and 2.35 points improvement on CIFAR-10, CIFAR-100, and ImageNet16-120, respectively, compared to current state-of-the-art methods.

Chinese Translation

当前的神经架构搜索（NAS）方法常常受到预定义的、限制性的搜索空间的限制。尽管最近的基于大型语言模型（LLM）的NAS方法能够实现开放式搜索空间，但由于设计思路的偏见或低质量，它们往往面临效率低下的探索问题。为了解决这些问题，我们提出了一种半自动结构化模型设计知识的方法，以指导搜索过程。我们的方法首先定义了一个高层次的架构属性结构模板。然后，LLM通过分析论文来填充该模板，从而创建一个丰富多样的搜索空间，体现这种结构化设计知识。为了高效探索这一广阔空间，我们引入了FairNAD，采用多类型变异，能够通过公平的想法采样、Pareto感知变异、LLM驱动的迭代变异和细粒度反馈循环实现广泛探索。我们展示了FairNAD在发现高性能架构方面的有效性，与当前最先进的方法相比，在CIFAR-10、CIFAR-100和ImageNet16-120上分别提高了0.84、2.17和2.35分。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2605.19256

Distribution Matching Distillation without Fake Score Network

无假评分网络的分布匹配蒸馏

Kim, Youngjoong, Lee, Deokyeong, Park, Jaesik

Abstract

Distribution Matching Distillation (DMD) provides an effective distribution-level correction for few-step generation, while relying on an auxiliary fake-score network to track the evolving generative distribution. Recent work combines DMD-style objectives with flow-map generators to exploit both forward-divergence training and reverse-divergence correction. The fake-score estimator remains an additional component with memory and update overhead. In this work, we study whether this explicit tracker can be avoided when the generator itself has a flow-map structure. We propose Fake-Score-network-Free DMD (FSF-DMD), a DMD formulation for flow-map generators that replaces the auxiliary fake-score estimator with a generator-induced pseudo-velocity surrogate. The key observation is that the endpoint pseudo-velocity of a flow-map generator provides a tractable proxy for fake-velocity estimation, allowing the generator itself to supply the reverse-divergence signal. Building on this observation, we derive a practical objective, extend it with flow-map-consistent backward simulation, and introduce a self-teacher variant for training from scratch. In our ImageNet-1K $256 \times 256$ experiments, FSF-DMD improves flow-map baselines, reaches lower FID than the listed DMD2 comparisons in the flow-map-initialized setting, and remains effective under flow-matching initialization and training from scratch.

Chinese Translation

分布匹配蒸馏（Distribution Matching Distillation, DMD）为少步生成提供了一种有效的分布级修正，同时依赖于辅助假评分网络来跟踪不断演变的生成分布。近期的研究将DMD风格的目标与流映射生成器结合，以利用前向散度训练和反向散度修正。然而，假评分估计器仍然是一个额外的组件，带来了内存和更新的开销。在本研究中，我们探讨了当生成器本身具有流映射结构时，是否可以避免这个显式的跟踪器。我们提出了无假评分网络的DMD（Fake-Score-network-Free DMD, FSF-DMD），这是一个针对流映射生成器的DMD公式，它用生成器诱导的伪速度替代了辅助假评分估计器。关键观察是，流映射生成器的端点伪速度为假速度估计提供了一个可处理的代理，使得生成器本身能够提供反向散度信号。在此观察的基础上，我们推导出一个实用的目标，扩展了流映射一致的反向模拟，并引入了一种自教师变体以从头开始训练。在我们的ImageNet-1K $256 imes 256$ 实验中，FSF-DMD改善了流映射基线，在流映射初始化设置中达到了比列出的DMD2比较更低的FID，并在流匹配初始化和从头训练的情况下仍然有效。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2605.19279

FPED: A Functional-Network Prior-Guided Mixture-of-Experts Framework for Interpretable Brain Decoding

FPED：一种基于功能网络先验指导的专家混合框架用于可解释的脑解码

Ren, Yudan, Shi, Pengcheng, Ma, Zihan, He, Xiaowei, Li, Xiao

Abstract

Visual image reconstruction from functional Magnetic Resonance Imaging (fMRI) is a fundamental task in brain decoding, providing a crucial pathway for understanding human perceptual mechanisms and developing advanced brain-computer interfaces (BCIs). However, most current methods simply flatten fMRI signals from localized visual cortices into one-dimensional (1D) vectors, mapping them directly into latent spaces such as that of Contrastive Language-Image Pre-training (CLIP). This paradigm not only disrupts the inherent network topology of the brain-leading to limited neuroscientific interpretability-but also overlooks the synergistic contributions of other distributed functional networks in processing high-level visual semantics. To address these limitations, we propose FPED, a Functional-Network Prior-Guided Mixture of Experts (MoE) framework for interpretable brain decoding. FPED explicitly models different functional brain networks as specialized experts and employs adaptive routing to capture their complementary contributions to visual semantic understanding. Unlike conventional homogeneous decoding paradigms, our framework incorporates neurobiologically grounded priors to enable structured and interpretable network-level representation learning. Experimental results demonstrate that FPED achieves highly competitive semantic reconstruction performance with only 0.68B parameters. The learned routing dynamics reveal biologically meaningful correspondence between functional brain networks and modality-specific semantic processing, providing transparent neuroscientific interpretability. This suggests that brain network-aware expert modeling is a promising direction for bridging neural decoding and biologically inspired artificial intelligence.

Chinese Translation

从功能性磁共振成像（fMRI）中重建视觉图像是脑解码中的一项基本任务，为理解人类感知机制和开发先进的脑-机接口（BCI）提供了重要途径。然而，目前大多数方法仅将局部视觉皮层的fMRI信号展平为一维（1D）向量，并直接将其映射到对比语言-图像预训练（CLIP）等潜在空间中。这种范式不仅破坏了大脑固有的网络拓扑，导致神经科学解释能力有限，还忽视了其他分布式功能网络在处理高级视觉语义中的协同贡献。为了解决这些局限性，我们提出了FPED，一种基于功能网络先验指导的专家混合（MoE）框架，用于可解释的脑解码。FPED明确将不同的功能脑网络建模为专门的专家，并采用自适应路由来捕捉它们在视觉语义理解中的互补贡献。与传统的同质解码范式不同，我们的框架结合了神经生物学基础的先验，以实现结构化和可解释的网络级表示学习。实验结果表明，FPED在仅使用0.68B参数的情况下，达到了高度竞争的语义重建性能。学习到的路由动态揭示了功能脑网络与特定模态语义处理之间的生物学意义对应，提供了透明的神经科学解释能力。这表明，关注脑网络的专家建模是连接神经解码与生物启发的人工智能的一个有前景的方向。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2605.19289

What Makes Synthetic Data Effective in Image Segmentation

合成数据在图像分割中的有效性因素

Zhang, Jinjin, Guo, Xiefan, Jin, Yizhou, Zhou, Nan, Huang, Di

Abstract

Driven by rapid advances in large-scale generative models, synthetic data has emerged as a promising solution for visual understanding. While modern diffusion models achieve remarkable photorealistic image synthesis, their potential in complex visual segmentation tasks remains underexplored. In this work, we conduct a systematic analysis of synthetic images from state-of-the-art diffusion models to uncover the factors governing their utility. In particular, synthetic images characterized by dense scene composition and fine instance fidelity demonstrate distinctive benefits, yielding significantly more discriminative spatial representations. Building on these insights, we propose SENSE, a unified framework that leverages flexible and scalable synthetic data to substantially enhance segmentation performance. Notably, SENSE is model-agnostic, compatible with diverse architectures (e.g., DPT and Mask2Former), and scales effectively across models with varying parameter capacities. Extensive experiments on Cityscapes, COCO, and ADE20K validate the effectiveness and generalization capability of our approach. Code is available at https://github.com/zhang0jhon/SENSE.

Chinese Translation

随着大规模生成模型的快速发展，合成数据已成为视觉理解的一个有前景的解决方案。尽管现代扩散模型在生成逼真的图像方面取得了显著进展，但它们在复杂视觉分割任务中的潜力仍然未被充分探索。在本研究中，我们对来自最先进的扩散模型的合成图像进行了系统分析，以揭示其效用的决定因素。特别是，具有密集场景构成和精细实例保真度的合成图像展现出独特的优势，产生显著更具区分性的空间表示。基于这些见解，我们提出了SENSE，一个统一框架，利用灵活且可扩展的合成数据显著提升分割性能。值得注意的是，SENSE是模型无关的，兼容多种架构（例如，DPT和Mask2Former），并且在具有不同参数容量的模型中有效扩展。在Cityscapes、COCO和ADE20K上的大量实验验证了我们方法的有效性和泛化能力。代码可在 https://github.com/zhang0jhon/SENSE 获取。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2605.19301

iGSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language Models

iGSP：隐式梯度子空间投影用于高效的视觉-语言模型持续学习

Cui, Xuezhi, Zhou, Dongbo, Guo, Wang, Wang, Zeyuan, Li, Ziyu, Zhou, Gaozhi, Li, Xian, Zhao, Ling, Yang, Wentao, Tao, Chao, Li, Haifeng

Abstract

Vision-Language Models require efficient adaptation to continually emerging downstream tasks. While Parameter-Efficient Fine-Tuning mitigates catastrophic forgetting, assigning isolated modules per task leads to parameter explosion. Conversely, recent similarity-driven sharing mechanisms falsely equate superficial visual similarity with underlying alignment consistency. This fundamental mismatch triggers severe negative transfer between visually similar but logically distinct tasks and fails to exploit alignment reuse across visually diverse ones. We argue thatalignment sharing is fundamentally a geometric problem of overlapping optimization trajectories within shared low-rank subspaces. Grounded in this insight, we propose iGSP, a novel framework that achieves efficient adaptation via implicit gradient subspace projection. Leveraging the early convergence of MoE routers to establish the subspace basis, iGSP bifurcates the adaptation process into two phases. First, the Subspace Identification phase introduces candidate experts via basis pre-expansion, applies a novel subspace-constrained regularization to implicitly project new task gradients onto the historical subspace, and precisely prunes redundant dimensions by treating routing probabilities as gradient flow indicators, ultimately to maximize knowledge reuse. Second, the Orthogonal Subspace Fine-Tuning phase fixes this structural basis and removes the regularization to rapidly fit the task-specific residual loss. Extensive experiments on the MTIL benchmark demonstrate that iGSP achieves state-of-the-art accuracy while significantly improving training efficiency, reducing the average trainable parameters by 42.7\% compared to current SOTA methods, and decreasing the final total parameters by 86.9\% relative to counterparts. The source code is available at https://github.com/GeoX-Lab/iGSP.

Chinese Translation

视觉-语言模型需要高效适应不断涌现的下游任务。虽然参数高效微调可以减轻灾难性遗忘，但为每个任务分配孤立模块会导致参数爆炸。相反，最近的基于相似性的共享机制错误地将表面视觉相似性等同于潜在的对齐一致性。这种根本的不匹配会在视觉上相似但逻辑上不同的任务之间引发严重的负迁移，并未能在视觉上多样的任务中利用对齐重用。我们认为，对齐共享根本上是一个几何问题，涉及在共享低秩子空间内重叠优化轨迹。基于这一见解，我们提出了iGSP，一个通过隐式梯度子空间投影实现高效适应的新框架。iGSP利用MoE（Mixture of Experts）路由器的早期收敛性来建立子空间基，分为两个阶段进行适应过程。首先，子空间识别阶段通过基预扩展引入候选专家，应用一种新颖的子空间约束正则化，将新任务梯度隐式投影到历史子空间，并通过将路由概率视为梯度流指示器来精确修剪冗余维度，最终最大化知识重用。其次，正交子空间微调阶段固定这一结构基，并去除正则化，以快速拟合任务特定的残差损失。在MTIL基准上的大量实验表明，iGSP在显著提高训练效率的同时，达到了最先进的准确率，与当前的SOTA（State of the Art）方法相比，平均可训练参数减少了42.7\%，最终总参数相较于同类方法减少了86.9\%。源代码可在 https://github.com/GeoX-Lab/iGSP 获取。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2605.19304

MMGS: 10$\times$ Compressed 3DGS through Optimal Transport Aggregation based on Multi-view Ranking

MMGS：基于多视角排序的最优传输聚合实现的10$ imes$压缩3DGS

Zhao, Beizhen, Yu, Sicheng, Yin, Ziran, Shen, Dongxu, Wang, Hao

Abstract

While 3D Gaussian Splatting (3DGS) has revolutionized 3D reconstruction, it suffers from significant overhead due to massive redundant primitives. Existing compression methods typically rely on local sampling or fixed pruning thresholds, which often struggle to balance redundancy reduction with high-fidelity rendering. To address this, we propose a novel framework that formulates Gaussian optimization as a global geometric distribution matching problem. Specifically, our approach integrates three components: (1) we introduce a multi-view 3D Gaussian contribution ranking mechanism that filters primitives using geometric consistency instead of local heuristics; (2) we propose a global Optimal Transport (OT)-based aggregation algorithm that merges redundant primitives while preserving the underlying geometry; and (3) we design an OT-based densification operator that maintains the Gaussian's distributional properties for stable optimization. Our approach achieves state-of-the-art rendering quality with only \textbf{10$\%$} primitives and \textbf{10$\times$} accelerated training speeds compared to vanilla 3DGS.

Chinese Translation

尽管3D高斯喷溅（3DGS）在3D重建方面取得了革命性进展，但由于大量冗余原语，其开销仍然显著。现有的压缩方法通常依赖于局部采样或固定的剪枝阈值，这往往难以在减少冗余和高保真渲染之间取得平衡。为了解决这个问题，我们提出了一种新颖的框架，将高斯优化公式化为全局几何分布匹配问题。具体而言，我们的方法整合了三个组成部分：（1）我们引入了一种多视角3D高斯贡献排名机制，该机制利用几何一致性而非局部启发式方法来过滤原语；（2）我们提出了一种基于全局最优传输（Optimal Transport, OT）的聚合算法，该算法在保留底层几何结构的同时合并冗余原语；（3）我们设计了一种基于OT的稠密化算子，保持高斯的分布特性以实现稳定优化。与传统的3DGS相比，我们的方法在仅使用10$ extbf{ ext{%}}$的原语和10$ extbf{ ext{x}}$的加速训练速度下，实现了最先进的渲染质量。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2605.19307

MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems

MetaRA：基于多模态大型语言模型的视觉问答系统的变形鲁棒性评估

Xu, Quanxing, Tian, Yuhao, Zhou, Ling, Zhong, Xian, Huang, Xiaohua, Huang, Rubing, Lin, Chia-Wen

Abstract

Visual Question Answering (VQA), as the representative multimodal task, serves as a key benchmark for evaluating the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, existing evaluations largely rely on static datasets and accuracy-based metrics, which fail to capture robustness, consistency, and generalization. Inspired by Metamorphic Testing (MT), we propose Metamorphic Robustness Assessment (MetaRA), a testing framework that employs Metamorphic Relations (MRs) to systematically probe vulnerabilities in MLLM-based VQA systems. MetaRA generates controlled variations of image-question inputs based on specific MRs and evaluates models across diverse conditions. Applying MetaRA to multiple MLLM-based VQA models across different tasks reveals nuanced failure patterns, including sensitivity to linguistic perturbations, over-reliance on superficial visual cues, and deeper weaknesses in multimodal reasoning. Experimental results demonstrate that MetaRA provides richer diagnostic insights than conventional accuracy metrics, exposing failure modes that remain hidden under standard benchmarks. Overall, this work highlights the need for systematic robustness evaluation in VQA and positions metamorphic assessment as a scalable, model-agnostic approach toward trustworthy multimodal AI.

Chinese Translation

视觉问答（VQA）作为代表性的多模态任务，是评估多模态大型语言模型（MLLMs）推理能力的关键基准。然而，现有的评估主要依赖于静态数据集和基于准确率的指标，这些方法未能有效捕捉鲁棒性、一致性和泛化能力。受到变形测试（Metamorphic Testing, MT）的启发，我们提出了变形鲁棒性评估（MetaRA），这是一个测试框架，利用变形关系（Metamorphic Relations, MRs）系统性地探测基于MLLM的VQA系统中的脆弱性。MetaRA基于特定的MRs生成图像-问题输入的受控变体，并在多种条件下评估模型。将MetaRA应用于多个基于MLLM的VQA模型及不同任务，揭示了细微的失败模式，包括对语言扰动的敏感性、对表面视觉线索的过度依赖以及多模态推理中的更深层次弱点。实验结果表明，MetaRA提供的诊断洞察比传统的准确率指标更为丰富，揭示了在标准基准下隐藏的失败模式。总体而言，这项工作突显了在VQA中进行系统鲁棒性评估的必要性，并将变形评估定位为一种可扩展的、与模型无关的可信多模态人工智能方法。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2605.19319

SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

SWEET：用于具身任务执行的稀疏世界建模与图像编辑

Song, Yiren, Wang, Yihan, Deng, Xiyao, Yan, Zhuoran, Shou, Mike Zheng

Abstract

Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task-relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX-Kontext under the same robotic data setting, and find that image editing produces more reliable task-level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one-shot sparse visual planning framework that progressively generates a sequence of task-relevant manipulation keyframes through successive image editing, conditioned on language instructions and optional arrow-based spatial guidance. A goal-conditioned diffusion action predictor then converts adjacent imagined keyframes into executable action chunks. To reduce the mismatch between real and edited visual subgoals, we further introduce a mixed-training strategy with filtered edited targets. Experiments on DROID and RoboMimic show that SWEET improves keyframe prediction across seen and unseen scenes and enables a full pipeline from sequential keyframe planning to executable robot actions, suggesting that image editing is a promising and underexplored direction for embodied visual prediction.

Chinese Translation

视觉预测已成为具身控制的一个有前景的范式，其中生成未来观察并将其转化为动作。然而，密集视频生成计算成本高昂，并且对于许多操作任务而言，往往是不必要的，因为这些任务的进展可以通过少量与任务相关的视觉状态来概括。在本研究中，我们探讨图像编辑模型是否可以作为机器人操作的稀疏视觉世界模型，通过预测任务级未来状态而无需密集视频展开。我们首先在相同的机器人数据设置下，对视频生成模型Wan2.2和图像编辑模型FLUX-Kontext进行了受控比较，发现图像编辑生成了更可靠的任务级关键帧，具有更好的视觉保真度和显著更低的推理成本。基于这一观察，我们提出了SWEET，一个一次性稀疏视觉规划框架，通过连续的图像编辑逐步生成一系列与任务相关的操作关键帧，条件是语言指令和可选的基于箭头的空间指导。然后，一个目标条件的扩散动作预测器将相邻的想象关键帧转换为可执行的动作块。为了减少真实视觉子目标与编辑视觉子目标之间的不匹配，我们进一步引入了一种混合训练策略，使用过滤后的编辑目标。在DROID和RoboMimic上的实验表明，SWEET在已见和未见场景中改善了关键帧预测，并实现了从顺序关键帧规划到可执行机器人动作的完整流程，表明图像编辑是具身视觉预测的一个有前景且尚未充分探索的方向。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2605.19320

TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

TextAlign：具有层次奖励的文本渲染偏好对齐

Cui, Mingxuan, Yang, Jingpu, Ji, Fengxian, Jiang, Qian, Shi, Zhecheng, Wang, Jiaming, Song, Zirui, Koto, Fajri, Chen, Xiuying

Abstract

Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.

Chinese Translation

忠实的文本渲染仍然是大型文本到图像生成模型的一个持续弱点，因为它既需要遵循语义指令，又需要细粒度的字形结构。以往的方法通常通过特定于架构的模块或编码器修改来提高这一能力，这使得在基础模型中部署变得复杂。我们将文本渲染视为一种后训练偏好对齐问题，并提出了TextAlign，这是一种非侵入性的框架，保持生成器架构不变。关键组件是基于层次视觉-语言模型（VLM）的奖励，它将渲染错误分解为全局、单词和字形级别，然后将二元缺陷判断转换为标量偏好信号。生成的信号支持组相对策略优化（Group Relative Policy Optimization, GRPO）和直接偏好优化（Direct Preference Optimization, DPO）。在FLUX.1-dev和Z-Image-Turbo上的实验表明，在不降低总体生成质量的情况下，OCR基础的文本准确性持续提高。与强大的基础和文本渲染基线（包括SD3.5、Qwen-Image、AnyText和TextDiffuser）相比，这些结果表明奖励设计为改善文本渲染提供了一种可扩展的替代方案，而无需重新设计模型。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2605.19322

DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs

DynaTok：针对视频大语言模型的时间自适应和位置偏差感知的令牌压缩

Park, Minyoung, Kong, Taehun, Ahn, Sangjun

Abstract

Recent advances in Video Large Language Models (Video-LLMs) have greatly expanded multimodal reasoning capabilities. However, the massive number of visual tokens extracted from long video sequences incurs prohibitive computational costs, limiting their deployment in real-world scenarios. Existing training-free token compression methods select tokens based on attention magnitude as a proxy for semantic importance, but often overlook positional bias and rely only on short-term temporal locality, leading to redundant spatio-temporal coverage and inefficient token usage. We present DynaTok, a training-free, temporally adaptive and bias-aware token compression framework that allocates token budgets across both temporal and spatial dimensions. Through a lightweight exponential moving average (EMA) memory, the Temporal Budget Allocation (TBA) module dynamically assigns fewer tokens to redundant frames and more to novel frames, capturing long-term temporal variation. The Spatial Budget Allocation (SBA) module complements this by selecting spatially diverse and semantically important features using activation-based attention maps, while leveraging a spatial memory to reduce redundancy from previously selected regions and mitigate positional bias. DynaTok integrates seamlessly with existing Video-LLMs such as LLaVA-OneVision and LLaVA-Video without retraining, and effectively preserves semantic coverage under aggressive compression. Experiments on four representative VideoQA benchmarks-MVBench, LongVideoBench, MLVU, and VideoMME-show that DynaTok retains over 95% of baseline accuracy even with a 90% token reduction, surpassing recent training-free approaches. These results demonstrate that DynaTok provides a principled foundation for efficient and robust video reasoning, paving the way toward real-time streaming video understanding with future Video-LLMs.

Chinese Translation

近期视频大语言模型（Video-LLMs）的进展极大地扩展了多模态推理能力。然而，从长视频序列中提取的大量视觉令牌带来了巨大的计算成本，限制了其在现实场景中的应用。现有的无训练令牌压缩方法基于注意力大小选择令牌作为语义重要性的代理，但往往忽视了位置偏差，仅依赖短期时间局部性，导致冗余的时空覆盖和低效的令牌使用。我们提出了DynaTok，一个无训练的、时间自适应的和偏差感知的令牌压缩框架，能够在时间和空间维度上分配令牌预算。通过轻量级的指数移动平均（EMA）记忆，时间预算分配（TBA）模块动态地将较少的令牌分配给冗余帧，而将更多的令牌分配给新颖帧，从而捕捉长期时间变化。空间预算分配（SBA）模块通过使用基于激活的注意力图选择空间上多样且语义重要的特征来补充这一点，同时利用空间记忆减少先前选择区域的冗余并缓解位置偏差。DynaTok能够无缝集成到现有的视频大语言模型中，如LLaVA-OneVision和LLaVA-Video，而无需重新训练，并在激进压缩下有效保持语义覆盖。在四个具有代表性的视频问答基准测试（MVBench、LongVideoBench、MLVU和VideoMME）上的实验表明，即使在90%的令牌减少情况下，DynaTok仍能保持超过95%的基线准确率，超越了近期的无训练方法。这些结果表明，DynaTok为高效且稳健的视频推理提供了一个有原则的基础，为未来视频大语言模型的实时流媒体视频理解铺平了道路。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2605.19329

RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

RE-VLM：用于场景理解的事件增强视觉语言模型

Liu, Hanqing, Liu, Mingjie, Cui, Luoping, Lin, Endian, Jiang, Donghong, Zhu, Chuang

Abstract

Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question-answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments. Code and datasets are available at https://github.com/bupt-ai-cz/RE-VLM.

Chinese Translation

传统的视觉语言模型（VLM）在恶劣条件下捕捉的场景（例如，低光照、高动态范围或快速运动）中难以进行解读，因为标准RGB图像在这些环境中会退化。事件相机提供了一种互补的模式：它们以高时间分辨率和宽动态范围异步记录每个像素的亮度变化，在帧失效的情况下保留运动线索。我们提出了RE-VLM，这是第一个双流视觉语言模型，联合利用RGB图像和事件流，以实现对正常和挑战性条件下的稳健场景理解。RE-VLM采用并行的RGB和事件编码器，并结合逐步训练策略，将异构视觉特征与语言对齐。为了解决RGB-事件-文本监督的稀缺性，我们进一步提出了一种图驱动的管道，将同步的RGB-事件流转换为可验证的场景图，从中合成标题和问答（QA）对。为了开发和评估RE-VLM，我们构建了两个数据集：PEOD-Chat，针对光照受限的场景，以及RGBE-Chat，涵盖多种场景。在标题生成和视觉问答基准测试中，RE-VLM在参数数量相当的情况下，始终优于最先进的仅RGB和仅事件模型，尤其在挑战性条件下表现出显著的提升。这些结果证明了事件增强的视觉语言模型在实现广泛真实世界环境中的稳健视觉语言理解方面的有效性。代码和数据集可在 https://github.com/bupt-ai-cz/RE-VLM 获取。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2605.19340

Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation

选择性、正则化和校准：利用视觉基础模型进行跨域少样本语义分割

Ma, Junyuan, Xiang, Xunzhi, Li, Wenbin, Fan, Qi, Gao, Yang

Abstract

Vision foundation models (VFMs) have achieved strong performance across various vision tasks. However, it still remains challenging to apply VFMs for cross-domain few-shot segmentation (CD-FSS), which segments objects of novel classes under domain shifts using only a few labeled exemplars. The challenge is mainly driven by two factors: (1) limited labeled exemplars per novel class relative to the scale of VFM pre-training, making the model prone to overfitting during retraining, and (2) target-domain shifts underrepresented during pre-training, inducing cross-domain inconsistency and layer-wise sensitivity. To address these issues, we propose Hierarchical Exemplar Representation Adaptation (HERA), a three-stage select-regularize-calibrate VFM-based segmentation framework that learns effectively from limited labels and adapts to novel domains without source-data retraining. We first design Hierarchical Layer Selection (HLS) to adaptively identify the most informative VFM layer using a data-dependent Exemplar Transfer Risk (ETR) computed for each candidate layer. Then, Prior-Guided Regularization (PGR) regularizes interactions on the selected representation, yielding well-structured local signals for the subsequent stage. Furthermore, Pixelwise Adaptive Calibration (PAC) combines the selected representation with the refined interaction maps to calibrate pixel-wise predictions, producing consistent masks. Together, these stages form a hierarchical select-regularize-calibrate pipeline that guides frozen VFM features in new domains while fine-tuning less than 2.7% of parameters at test time. Extensive experiments show that HERA surpasses the state of the art by more than 4.1 mIoU across multiple CD-FSS benchmarks.

Chinese Translation

视觉基础模型（VFMs）在各种视觉任务中取得了强大的性能。然而，将VFMs应用于跨域少样本分割（CD-FSS）仍然具有挑战性，该任务在领域转移的情况下，仅使用少量标记样本对新类别的对象进行分割。挑战主要源于两个因素：（1）相对于VFM预训练的规模，每个新类别的标记样本有限，导致模型在再训练过程中容易过拟合；（2）预训练期间目标域的转移未得到充分代表，导致跨域不一致性和层级敏感性。为了解决这些问题，我们提出了分层示例表示适应（HERA），这是一个基于VFM的三阶段选择-正则化-校准分割框架，能够有效地从有限的标签中学习，并在不进行源数据再训练的情况下适应新域。我们首先设计了分层层选择（HLS），通过为每个候选层计算数据依赖的示例转移风险（ETR），自适应地识别最具信息量的VFM层。然后，先验引导正则化（PGR）对所选表示上的交互进行正则化，为后续阶段提供结构良好的局部信号。此外，逐像素自适应校准（PAC）将所选表示与精炼的交互图结合，以校准逐像素预测，从而生成一致的掩码。这些阶段共同形成了一个分层选择-正则化-校准管道，在测试时指导冻结的VFM特征在新域中，同时微调不到2.7%的参数。大量实验表明，HERA在多个CD-FSS基准测试中超过了当前最先进的技术，提升了超过4.1 mIoU。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2605.19342

Semantic-Enriched Latent Visual Reasoning

语义丰富的潜在视觉推理

Xu, Tianrun, Sun, Yue, Wang, Qixun, Lu, Jingyi, Wang, Yuan, Zhang, Tianren, Guo, Longteng, Rao, Fengyun, Lyu, Jing, Chen, Feng, Liu, Jing

Abstract

Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.

Chinese Translation

多模态潜在空间推理旨在通过在紧凑的潜在空间中直接进行视觉推理，取代以图像为基础的显性思维。然而，现有的方法在很大程度上依赖于视觉监督，产生的潜在表示缺乏足够的语义丰富性，限制了其支持多样化区域级推理任务的能力。在本研究中，我们提出了语义丰富的潜在视觉推理（Semantic-Enriched Latent Visual Reasoning, SLVR），这是一种两阶段学习框架，通过属性级视觉语义丰富潜在表示，并将其与多样化的推理目标对齐。在第一阶段，SLVR在细粒度属性监督下学习语义丰富的区域中心潜在表示。在第二阶段，我们设计了多查询组相对策略优化（Multi-query Group Relative Policy Optimization, M-GRPO），以对齐基于同一区域的多个查询的潜在表示。为了支持这一框架，我们构建了SLV-Set，包含约40万条区域级属性注释和80万条多查询问答样本，并引入了SV-QA，一个评估语义变化下潜在推理的基准。实验表明，与现有基线相比，SLVR提高了潜在视觉推理的鲁棒性和语义一致性。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2605.19359

MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

MAM-CLIP：基于乳腺X光图谱的视觉-语言预训练用于BI-RADS分类

Gulluk, Halil Ibrahim, Gevaert, Olivier

Abstract

Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP

Chinese Translation

深度学习方法在从乳腺X光图像预测BI-RADS评分方面显示出了良好的效果。然而，这些图像的解读可能会有所不同，甚至在放射科医生之间也会存在差异。鉴于乳腺X光图像的固有复杂性，仅依靠图像标签训练分类模型往往会导致性能有限。为了解决这一挑战，我们从两个乳腺X光图谱中整理了2313幅乳腺X光图像及其对应的描述。我们提出的方法采用了一个多模态模型，该模型使用预训练的PubMedBERT作为语言组件。通过对图像-文本对进行对比学习来训练该模型，我们使视觉编码器能够吸收描述中蕴含的丰富信息，从而提升其对乳腺X光发现的理解。随后，我们在两个数据集上对视觉编码器进行了微调，以进行BI-RADS预测，取得了优于未进行此预训练的模型的性能，特别是在标注样本稀缺的情况下。3类平均F1分数的提升范围为+1%至+14%：在40K训练样本时提高1%，在1K样本时提高14%。此外，我们的实验表明，从乳腺X光图谱中获得的2K图像-文本对在标签预测方面比2K标注样本更具信息量，当可用的训练样本超过10K时，平均提升幅度为+1.1%。总体而言，我们的工作为乳腺X光提供了一个视觉-语言模型，并突显了乳腺X光图谱中文本信息的价值。此外，我们公开发布了TEKNOFEST数据集的预处理乳腺X光图像。训练代码、预训练模型权重、数据提取脚本以及发布的数据集均可在以下网址获取：https://github.com/igulluk/MAM-CLIP

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2605.19360

Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection

可扩展的、节能的光神经架构用于多路复用深伪视频检测

Kashani, Parnian Ghapandar, Chen, Shiqi, Ozcan, Aydogan

Abstract

The rapid proliferation of AI-generated visual media has created an urgent need for efficient, trustworthy deepfake detection systems. However, existing deep learning-based detection methods rely on computationally intensive and energy-demanding inference algorithms, limiting their scalability. Here, we present a hybrid digital-analog deepfake video detection framework that combines a lightweight digital front-end with a spatially multiplexed optical decoding back-end for massively parallel analog inference through a programmable spatial light modulator. By simultaneously processing 15 or more video streams within a single optical propagation pass, the system enables high-throughput and accurate video-level authenticity prediction at reduced computational cost compared with purely digital methods. We validated this hybrid deepfake video processor using different datasets spanning classical face-swapping, real-world deepfake recordings, and fully AI-generated videos. Using a spatially multiplexed experimental set-up operating in the visible spectrum, we achieved average deepfake detection accuracy, sensitivity and specificity of 97.79%, 99.86% and 95.72%, respectively, on the Celeb-DF video dataset with 15 videos tested in parallel in a single optical pass per inference. The multiplexed optical decoder also demonstrates resilience against various types of video degradation, noise, compression, experimental misalignments and black-box adversarial attacks. Our results show that integrating optical computation into AI inference enables simultaneous gains in throughput, energy efficiency, and adversarial robustness - three properties that are difficult to achieve together in purely digital systems.

Chinese Translation

人工智能生成的视觉媒体的快速传播使得高效、可信的深伪检测系统的需求变得迫切。然而，现有的基于深度学习的检测方法依赖于计算密集型和能量需求高的推理算法，限制了它们的可扩展性。在此，我们提出了一种混合数字-模拟深伪视频检测框架，该框架结合了轻量级数字前端和空间多路复用光解码后端，通过可编程空间光调制器实现大规模并行的模拟推理。该系统能够在单次光传播过程中同时处理15个或更多的视频流，相较于纯数字方法，能够以较低的计算成本实现高通量和准确的视频级真实性预测。我们使用涵盖经典换脸、真实世界深伪录音和完全AI生成视频的不同数据集验证了这一混合深伪视频处理器。在可见光谱下运行的空间多路复用实验设置中，我们在Celeb-DF视频数据集上实现了平均深伪检测准确率、灵敏度和特异性分别为97.79%、99.86%和95.72%，并在每次推理中并行测试15个视频。多路复用光解码器还展示了对各种类型的视频退化、噪声、压缩、实验失调和黑箱对抗攻击的抵抗力。我们的结果表明，将光计算集成到AI推理中能够同时提高通量、能效和对抗鲁棒性——这三种特性在纯数字系统中难以同时实现。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2605.19371

Multi-Scale Generative Modeling with Heat Dissipation Flow Matching

多尺度生成建模与热耗散流匹配

Ma, Jun, Zhang, Hanquan, Qin, Yanjun, Guan, Haoyuan, Zhang, Ke

Abstract

Diffusion models are widely used in image generation, with most relying on noise-based corruption and denoising. A distinct branch instead uses blur as the main corruption, preserving better color budgets and multi-scale detail by providing multi-scale priors. However, blur-based models remain in SDE-based frameworks and are not integrated into ODE-based frameworks, such as Flow Matching (FM). Meanwhile, in the blur-based formulation, the classical inverse heat-dissipation (IHD) process faces an ill-posed challenge. Moreover, under the data-manifold assumption, regressing blurred images from high-dimensional noise (or velocity) space is also difficult. We propose Heat Dissipation Flow Matching (HDFM), which introduces a continuous blurred (heat-dissipation) process into FM to inject multi-scale priors. HDFM aligns an interpolated heat-dissipation path to address ill-posedness and adopts $x$-prediction to mitigate high-dimensional regression difficulty. Toy experiments and ablation studies show that HDFM consistently benefits from both blur and $x$-prediction. The performance of HDFM outperforms most baseline methods on all datasets.

Chinese Translation

扩散模型广泛应用于图像生成，大多数模型依赖于基于噪声的损坏和去噪。然而，另一种独特的分支则将模糊作为主要损坏，利用多尺度先验更好地保留色彩预算和多尺度细节。然而，基于模糊的模型仍然停留在基于随机微分方程（SDE）的框架中，尚未与基于常微分方程（ODE）的框架（如流匹配（Flow Matching, FM））相结合。同时，在基于模糊的公式中，经典的逆热耗散（Inverse Heat-Dissipation, IHD）过程面临着不适定的挑战。此外，在数据流形假设下，从高维噪声（或速度）空间回归模糊图像也很困难。我们提出了热耗散流匹配（Heat Dissipation Flow Matching, HDFM），该方法将连续的模糊（热耗散）过程引入FM，以注入多尺度先验。HDFM对齐插值的热耗散路径以解决不适定性，并采用$x$-预测来减轻高维回归的难度。玩具实验和消融研究表明，HDFM始终受益于模糊和$x$-预测。在所有数据集上，HDFM的性能优于大多数基线方法。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2605.19374

Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings

基于概念引导的噪声负样本抑制用于胸部X光发现的零样本分类与定位

Lian, Chenyu, Zhou, Hong-Yu, Wong, Chun-Ka, Qin, Jing

Abstract

Vision-language alignment using chest X-rays and radiology reports has emerged as an advanced paradigm for zero-shot classification and grounding of chest X-ray findings. However, standard contrastive learning typically treats radiographs and reports from different patients simply as negative pairs. This assumption introduces noisy negatives, as different patients frequently exhibit similar findings. Such noisy negatives cause semantic ambiguity and degrade performance in zero-shot understanding tasks. To address this challenge, we propose CoNNS, a concept-guided noisy-negative suppression framework. To support the negative suppression mechanism, unlike previous methods that use raw reports or templatized texts, we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts (evidential segment and presence statement). Leveraging this ontology, we implement a cross-patient pair relabeling strategy comprising three steps: (1) Fine-Grained Breakdown to categorize pairs based on finding presence; (2) Noisy Negative Filtering to resolve semantic conflicts by removing false negatives; and (3) Hard Negative Mining to identify subtle attribute discrepancies using a lightweight language model. Finally, we propose a Concept-Aware NCE loss to align visual features with text while suppressing the identified noisy negatives. Extensive experiments across multi-granularity zero-shot grounding tasks and five zero-shot classification datasets validate that CoNNS outperforms existing state-of-the-art models. The code is available at https://github.com/DopamineLcy/conns.

Chinese Translation

利用胸部X光片和放射学报告进行视觉-语言对齐已成为零样本分类和胸部X光发现定位的先进范式。然而，标准的对比学习通常将来自不同患者的X光片和报告简单地视为负样本对。这一假设引入了噪声负样本，因为不同患者经常表现出相似的发现。这些噪声负样本导致语义模糊，并降低了零样本理解任务的性能。为了解决这一挑战，我们提出了CoNNS（基于概念引导的噪声负样本抑制）框架。为了支持负样本抑制机制，与之前使用原始报告或模板化文本的方法不同，我们使用大型语言模型构建了一个分层概念本体。该本体通过明确建模存在性、属性（位置和特征）和文本（证据段和存在声明）来构建41个关键临床概念。利用这一本体，我们实施了一种跨患者对重标记策略，包括三个步骤：（1）细粒度拆分，根据发现的存在性对对进行分类；（2）噪声负样本过滤，通过去除假阴性来解决语义冲突；（3）困难负样本挖掘，利用轻量级语言模型识别细微的属性差异。最后，我们提出了一种概念感知的NCE损失，以在抑制识别出的噪声负样本的同时，将视觉特征与文本对齐。在多粒度零样本定位任务和五个零样本分类数据集上的广泛实验验证了CoNNS优于现有的最先进模型。代码可在https://github.com/DopamineLcy/conns获取。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2605.19378

Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock

视觉扩散变换器中的稀疏专家混合路由：从路由崩溃到选择性死锁的诊断、边界校准与演化路线图

Sha, Haiying

Abstract

This paper systematically diagnoses the training failure modes of Token-Choice sparse Mixture-of-Experts (MoE) on video Diffusion Transformers. Starting from a pretrained dense model of about 5 billion parameters, we convert it into an MoE architecture following three laws: routed experts exactly clone the original FFN weights, shared experts are initialized to zero for verification and then to extremely small non-zero noise for actual training, while only the gating networks start from random initialization. Experiments reveal a hierarchy of five failure modes: (1) linear routers suffer global soft saturation with complete expert homogenization; (2) MLP routers introduce selective deadlock, where roughly one-third of layers degenerate into a single-expert mode that cannot be prevented by increasing the auxiliary loss; (3) cross-attention routers exhibit preliminary self-recovery, yet about nine layers remain stubbornly deadlocked; (4) deadlocked layers display a U-shaped distribution, concentrated in shallow visual processing layers and deep semantic integration layers; (5) bfloat16 mixed precision causes tiny weight updates to be truncated to zero by hardware. Based on routing decision time series over 65 million tokens across 5,000 training steps, we propose the Functional Redundancy Hypothesis: deadlock is a rational waiting strategy before the shared expert matures within the gate-shared expert-routed expert triadic system. This hypothesis is supported by the theory of functional redundancy in systems biology. On the engineering side, we summarize the Three Laws of dense-to-MoE conversion and provide a complete solution for the bfloat16 precision trap. We calibrate the current capability boundary of the Token-Choice paradigm and outline a three-step evolutionary roadmap from visual unification to a world model.

Chinese Translation

本文系统地诊断了视频扩散变换器中Token-Choice稀疏专家混合模型（MoE）的训练失败模式。我们从一个约50亿参数的预训练密集模型出发，按照三个原则将其转换为MoE架构：路由的专家完全克隆原始前馈网络（FFN）权重，共享专家初始化为零以进行验证，然后在实际训练中初始化为极小的非零噪声，而只有门控网络从随机初始化开始。实验揭示了五种失败模式的层次结构：（1）线性路由器遭受全局软饱和，导致专家完全同质化；（2）多层感知器（MLP）路由器引入选择性死锁，约三分之一的层退化为单专家模式，增加辅助损失无法阻止这种退化；（3）交叉注意力路由器表现出初步的自我恢复，但约九层仍顽固地死锁；（4）死锁层呈现U型分布，集中在浅层视觉处理层和深层语义融合层；（5）bfloat16混合精度导致微小的权重更新被硬件截断为零。基于在5000个训练步骤中对6500万个token的路由决策时间序列，我们提出了功能冗余假说：死锁是在共享专家在门控共享专家-路由专家三元系统中成熟之前的一种合理等待策略。该假说得到了系统生物学中功能冗余理论的支持。在工程方面，我们总结了从密集到MoE转换的三条法则，并提供了bfloat16精度陷阱的完整解决方案。我们校准了Token-Choice范式的当前能力边界，并概述了从视觉统一到世界模型的三步演化路线图。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2605.19386

MatPhys: Learning Material-Aware Physics Parameters for Deformable Object Simulation from Videos

MatPhys：从视频中学习材料感知的物理参数以进行可变形物体模拟

Yang, Yang, Wang, Yiyan, Liu, Zheming, Iwamoto, Naoya

Abstract

Reconstructing simulation-ready deformable objects is important for vision, graphics, and robotics. Existing physics-driven methods can recover physical digital twins from videos, but they suffer from two fundamental limitations: they typically assume a homogeneous material across the whole object, and their scene-specific inverse optimization, combined with the inherent ambiguity of monocular observation, yields inconsistent parameters for the same material across different scenes or interactions. We propose MatPhys, a material-aware feed-forward framework that predicts spring-mass parameters from a single-view video, addressing these two issues with two coupled designs. To relax the homogeneous material assumption, we use DINO features to decompose the object into semantically meaningful parts and to query a part-level material prior, assigning each part its own physical behavior. To enforce cross-scene consistency, we introduce a learned material codebook of shared material embeddings as the bridge between appearance and physics, and further use the part-level prior as a reference distribution that constrains the decoder so that the same material yields consistent parameters across scenes and interactions. Together, these designs turn an under-constrained monocular problem into feed-forward inference grounded on shared, reusable material concepts. Experiments show that our method matches per-scene optimization baselines in reconstruction and future prediction, while achieving stronger generalization to unseen interactions and objects with more consistent physical parameters.

Chinese Translation

重建适用于仿真的可变形物体对于计算机视觉、图形学和机器人技术至关重要。现有的物理驱动方法能够从视频中恢复物理数字双胞胎，但它们存在两个基本限制：通常假设整个物体具有均匀材料，并且其场景特定的逆优化结合单目观察的固有模糊性，导致在不同场景或交互中同一材料的参数不一致。我们提出了MatPhys，一种材料感知的前馈框架，从单视角视频中预测弹簧-质量参数，解决这两个问题。为了放宽均匀材料假设，我们使用DINO特征将物体分解为语义上有意义的部分，并查询部分级材料先验，为每个部分分配其独特的物理行为。为了强制跨场景一致性，我们引入了一个学习的材料词典，作为外观与物理之间的桥梁，进一步使用部分级先验作为参考分布，约束解码器，使得相同材料在不同场景和交互中产生一致的参数。这些设计将一个欠约束的单目问题转变为基于共享、可重用材料概念的前馈推理。实验表明，我们的方法在重建和未来预测方面与每场景优化基线相匹配，同时在未见交互和物体上实现了更强的泛化能力，具有更一致的物理参数。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2605.19390

LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue

LMM-Track4D：通过轨迹驱动对话激发LMM中的4D动态推理

Li, Chaoyue, Xu, Yongxue, Feng, Jie, Ding, Jiayu

Abstract

Recent large multimodal models (LMMs) have become increasingly capable on image and video understanding, yet still struggle to sustain 4D continuous spatiotemporal dynamic reasoning. To study this capability gap, we formulate trajectory-grounded multi-turn spatiotemporal dialogue, a new task in which a model must answer spatiotemporal queries while returning structured 3D target trajectories over an entire short clip or a specified segment of a longer clip, and introduce Track4D-Bench, a benchmark with 526 clip-level dialogue samples spanning 23.5k frames and 7.5k object annotations, for training and evaluation. Building on this task, we propose LMM-Track4D, which combines RTGE (Ray--Time Geometry Encoding), a dedicated streaming state token TRK for long-horizon dynamic propagation, and an Object-Slot Kinematic, Residual-Anchor (OSK-RA) decoder for stable 4-step 3D state estimation under occlusion and viewpoint variation. Experiments on Track4D-Bench show consistent improvements over strong baselines, suggesting that explicit dynamic state modeling is a useful design principle for eliciting 4D dynamic reasoning in LMMs. Our code and dataset will be publicly available at https://github.com/mikubaka88/LMM-Track4D.

Chinese Translation

近年来，大型多模态模型（LMMs）在图像和视频理解方面的能力不断提升，但在维持4D连续时空动态推理方面仍然存在困难。为了研究这一能力差距，我们提出了轨迹驱动的多轮时空对话，这是一项新任务，模型必须在回答时空查询的同时，返回整个短视频片段或较长片段的指定部分的结构化3D目标轨迹，并引入了Track4D-Bench，这是一个包含526个片段级对话样本、覆盖23.5k帧和7.5k对象注释的基准，用于训练和评估。在此任务的基础上，我们提出了LMM-Track4D，它结合了RTGE（Ray-Time Geometry Encoding）、用于长时间动态传播的专用流状态标记TRK，以及用于在遮挡和视角变化下进行稳定4步3D状态估计的对象槽运动学残差锚（OSK-RA）解码器。在Track4D-Bench上的实验显示，相较于强基线模型有一致的提升，表明显式动态状态建模是激发LMM中4D动态推理的有效设计原则。我们的代码和数据集将公开发布在https://github.com/mikubaka88/LMM-Track4D。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2605.19393

Neuron Incidence Redistribution for Fairness in Medical Image Classification

医学图像分类中的神经元发生重分配以实现公平性

Shoby, Abin, Palmer, Lyle John, Kurian, Nikhil Cherian

Abstract

Deep learning models for medical image classification are susceptible to subgroup performance disparities across demographic attributes such as age, gender, and race. We identify a latent representational mechanism underlying these disparities: in transfer-learned models, the dominant penultimate-layer activation channel under positive predictions is co-activated by both disease-positive samples and privileged demographic groups (male, older patients), producing over-diagnosis; conversely, the dominant channel under negative predictions is co-activated by disadvantaged groups (female, younger patients), producing systematic under-diagnosis. To address this, we propose Neuron Incidence Redistribution (NIR), a lightweight regularization method that penalizes the variance of predicted-probability-weighted mean activations across penultimate-layer neurons, requiring no demographic labels at training time. On HAM10000, TPR disparity drops from 10.81% to 0.93% across age groups and from 12.04% to 0.74% across gender, with a marginal AUC improvement of 0.51 points. On Harvard OCT-RNFL, NIR reduces FPR disparity for race (from 15.68% to 10.66%) and age (from 12.69% to 1.80%), demonstrating that distributing latent disease evidence across the full penultimate layer is a principled and effective strategy for improving demographic fairness in medical AI.

Chinese Translation

医学图像分类的深度学习模型容易受到年龄、性别和种族等人口属性引起的子群体性能差异的影响。我们识别出这些差异背后的潜在表征机制：在迁移学习模型中，正向预测下主导的倒数第二层激活通道同时被疾病阳性样本和特权人口群体（男性、年长患者）共同激活，导致过度诊断；相反，负向预测下的主导通道则由弱势群体（女性、年轻患者）共同激活，导致系统性缺诊。为了解决这个问题，我们提出了神经元发生重分配（Neuron Incidence Redistribution, NIR），这是一种轻量级的正则化方法，通过惩罚倒数第二层神经元的预测概率加权均值激活的方差，且在训练时不需要人口标签。在HAM10000数据集上，年龄组的真正率（TPR）差异从10.81%降至0.93%，性别差异从12.04%降至0.74%，AUC的边际改善为0.51点。在哈佛OCT-RNFL数据集上，NIR将种族的假阳性率（FPR）差异从15.68%降至10.66%，年龄差异从12.69%降至1.80%，证明在整个倒数第二层中分配潜在疾病证据是一种原则性且有效的策略，以改善医学人工智能中的人口公平性。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2605.19398

Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

重新平衡参考帧主导性以改善图像到视频模型中的运动

Jeon, Wooseok, Park, Seungho, Shin, Seunghyun, Lee, Sangeyl, Jeong, Hyeonho, Jeon, Hae-Gon

Abstract

Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate this issue by weakening or modifying the image-conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify \emph{reference-frame dominance} as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS~(Dynamic Motion Slider), a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state-of-the-art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.

Chinese Translation

图像到视频模型生成的视频往往相较于文本到视频模型显得过于静态。虽然先前的方法通过削弱或修改图像条件信号来缓解这一问题，但它们通常需要额外的训练或牺牲对参考图像的保真度。在本研究中，我们将 extit{参考帧主导性}识别为运动抑制的关键机制。我们观察到，在图像到视频（I2V）模型中，非参考帧对参考帧关键标记分配了过多的自注意力，导致参考信息在时间上被过度传播，从而抑制了帧间动态。基于这一发现，我们提出了DyMoS（动态运动滑块），这是一种无训练且模型无关的方法，在初始去噪步骤中重新平衡从生成帧到参考帧的注意力路径。DyMoS保持输入图像和模型权重不变，并引入一个单一的标量参数以持续控制运动强度。在多个最先进的I2V骨干网络上的实验表明，DyMoS在保持视觉质量和对参考图像的保真度的同时，始终改善了运动动态。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2605.19410

Vision Harnessing Agent for Open Ad-hoc Segmentation

开放式临时分割的视觉驱动代理

Wang, Zilin, Yu, Stella X.

Abstract

Segmentation has become easy when the concept is known, requiring retrieval of a learned visual grounding from text. It remains hard for open ad-hoc concepts, where the grounding may not exist as one learned mask and must often be constructed from image evidence through parts, relations, exclusions, and collections. We propose a Vision-guided Ad-hoc Segmentation Agent (VASA), the first vision harnessing agent for open ad-hoc segmentation. VASA is training-free and couples a VLM agent, a segmentation foundation model, and a visually grounded workflow. Rather than revising text prompts alone, VASA uses a persistent working mask to reason, construct, and validate a solution. It plans visual operations, invokes segmentation tools, inspects results, edits the mask, and recovers from errors. We construct PARS, a new benchmark that turns part-level labels in PartImageNet into open ad-hoc concepts through long-form definition queries. On PARS, VASA outperforms open-vocabulary, reasoning-based, and agentic baselines, surpassing SAM3 Agent by 14-25%. On RefCOCOm, a standard multi-granularity referring segmentation benchmark, VASA improves over SAM3 Agent by 5-9% and over other agentic baselines by up to 20%. These results validate agentic visual construction for open ad-hoc segmentation. Our work points to a path for AI agents beyond wrapping foundation models as tools: Programming them with task knowledge, VLM behavior, visual routines, working memory, and failure-aware workflows.

Chinese Translation

当概念已知时，分割变得简单，这需要从文本中检索已学习的视觉基础。然而，对于开放式临时概念，基础可能并不存在于一个已学习的掩膜中，通常必须通过图像证据中的部分、关系、排除和集合来构建。我们提出了一种视觉驱动的临时分割代理（Vision-guided Ad-hoc Segmentation Agent, VASA），这是首个用于开放式临时分割的视觉驱动代理。VASA 不需要训练，结合了视觉语言模型（VLM）代理、分割基础模型和视觉基础工作流程。VASA 不仅仅是修正文本提示，而是使用持久的工作掩膜进行推理、构建和验证解决方案。它规划视觉操作，调用分割工具，检查结果，编辑掩膜，并从错误中恢复。我们构建了 PARS，这是一个新的基准，通过长形式定义查询将 PartImageNet 中的部分级标签转化为开放式临时概念。在 PARS 上，VASA 超越了开放词汇、基于推理和代理基线，超过 SAM3 代理 14-25%。在 RefCOCOm 上，一个标准的多粒度引用分割基准，VASA 比 SAM3 代理提高了 5-9%，并且比其他代理基线提高了多达 20%。这些结果验证了开放式临时分割的代理视觉构建。我们的工作为 AI 代理指明了一条超越将基础模型作为工具的路径：通过任务知识、VLM 行为、视觉例程、工作记忆和故障意识工作流程对其进行编程。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2605.19435

KappaPlace: Learning Hyperspherical Uncertainty for Visual Place Recognition via Prototype-Anchored Supervision

KappaPlace：通过原型锚定监督学习超球面不确定性以实现视觉地点识别

Yanko, Maya, Shavit, Yoli

Abstract

Visual Place Recognition (VPR) is critical for autonomous navigation, yet state-of-the-art methods lack well-calibrated uncertainty estimation. Standard pipelines cannot reliably signal when a query is ambiguous or a match is likely incorrect, posing risks in safety-critical robotics. We propose KappaPlace, a principled framework for learning uncertainty-aware VPR representations. Our core contribution is a Prototype-Anchored supervision strategy that leverages latent class representatives as targets for a probabilistic objective. By modeling image descriptors as von Mises-Fisher (vMF) variables, we learn a lightweight module to predict the concentration parameter as a direct proxy for aleatoric uncertainty. While existing VPR uncertainty methods are typically restricted to a query-centric view, we derive a novel match-level formulation to quantify the reliability of specific query-reference pairs. Across five diverse benchmarks, KappaPlace reduces Expected Calibration Error (ECE@K) by up to 50% compared to existing methods while maintaining or improving retrieval recall. We provide both a joint-training variant and a post-training extension for frozen backbones. Our results demonstrate that KappaPlace provides a robust, stable, and well-calibrated signal that enables reliable decision-making within the VPR pipeline. Our code is available at: https://github.com/mayayank95/UncertaintyAwareVPR

Chinese Translation

视觉地点识别（VPR）对于自主导航至关重要，但现有的最先进方法缺乏良好校准的不确定性估计。标准流程无法可靠地指示查询何时模糊或匹配可能不正确，这在安全关键的机器人技术中存在风险。我们提出了KappaPlace，一个学习不确定性感知VPR表示的原则性框架。我们的核心贡献是原型锚定监督策略，它利用潜在类别代表作为概率目标的目标。通过将图像描述符建模为冯·米塞斯-费舍尔（von Mises-Fisher, vMF）变量，我们学习了一个轻量级模块，以预测浓度参数，作为对随机不确定性的直接代理。尽管现有的VPR不确定性方法通常限制于查询中心视角，我们推导出了一种新的匹配级别公式，以量化特定查询-参考对的可靠性。在五个不同的基准测试中，KappaPlace将期望校准误差（Expected Calibration Error, ECE@K）相比现有方法降低了多达50%，同时保持或提高了检索召回率。我们提供了一个联合训练变体和一个针对冻结主干的后训练扩展。我们的结果表明，KappaPlace提供了一个稳健、稳定且良好校准的信号，使得在VPR流程中能够进行可靠的决策。我们的代码可在以下链接获取：https://github.com/mayayank95/UncertaintyAwareVPR

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2605.19446

Targeted Downstream-Agnostic Attack

针对性下游无关攻击

Lei, Zhuxin, Yang, Ziyuan, Zhang, Yi

Abstract

Recently, pre-trained encoders have gained widespread use due to their strong capability in representation extraction. However, they are vulnerable to downstream-agnostic attacks (DAAs). Existing DAA methods operate under a permissive threat model, where an attack is successful if the generated downstream-agnostic adversarial examples (DAEs) change the original prediction, without requiring a specific target. In this paper, we propose a Targeted DAA (TDAA) method under a stricter threat model requiring the attack to be both targeted and downstream-agnostic. Since the downstream task is unknown and encoders do not directly produce predictions, achieving a targeted attack is particularly challenging. To address this, we introduce a novel component termed the 'threat image', pre-selected by the attacker as the target. Specifically, a generator is designed to produce example-specific adversarial perturbations that compel the victim encoder to output identical features for both the DAEs and the threat image. Unlike previous DAA methods that generate a single shared perturbation for all samples, which often fails due to image diversity, our method adopts an example-specific paradigm. This generates tailored perturbations for each image to ensure a high attack success rate and invisibility. By leveraging the threat image as a feature-level anchor, our method builds a task-agnostic bridge to reveal the vulnerabilities of the victim encoder. Extensive experiments on 10 self-supervised methods across 3 benchmark datasets demonstrate the effectiveness of our approach and reveal the pronounced vulnerability of pre-trained encoders. The code will be made publicly available after the review period.

Chinese Translation

近年来，预训练编码器因其在特征提取方面的强大能力而得到广泛应用。然而，它们对下游无关攻击（DAAs）存在脆弱性。现有的DAA方法在一个宽松的威胁模型下运行，即只要生成的下游无关对抗样本（DAEs）改变了原始预测，攻击就被视为成功，而不需要特定的目标。在本文中，我们提出了一种在更严格威胁模型下的针对性DAA（TDAA）方法，该模型要求攻击既要有针对性又要下游无关。由于下游任务未知且编码器不直接产生预测，实现针对性攻击尤其具有挑战性。为了解决这个问题，我们引入了一种新颖的组件，称为“威胁图像”，由攻击者预先选择作为目标。具体而言，我们设计了一个生成器，用于生成特定示例的对抗扰动，迫使受害编码器对DAEs和威胁图像输出相同的特征。与之前的DAA方法为所有样本生成单一共享扰动的做法不同，这种方法往往因图像多样性而失败，我们的方法采用了特定示例的范式。这为每个图像生成量身定制的扰动，以确保高攻击成功率和隐蔽性。通过利用威胁图像作为特征级锚点，我们的方法构建了一个任务无关的桥梁，以揭示受害编码器的脆弱性。在3个基准数据集上对10种自监督方法进行的广泛实验表明了我们方法的有效性，并揭示了预训练编码器的显著脆弱性。代码将在审稿期结束后公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2605.19484

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

CutVerse：一个用于媒体后期制作编辑的组合式图形用户界面代理基准测试

Hu, Haobo, Guo, Xiangwu, Chen, Zhiheng, Gao, Difei, Liu, Haotian, Jin, Libiao, Mao, Qi

Abstract

While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.

Chinese Translation

尽管图形用户界面（GUI）代理在网页导航和基本操作系统任务方面取得了显著进展，但它们在专业创意工作流程中的能力仍然 largely 未被充分探索。为了解决这一问题，我们推出了 Cutverse，一个旨在系统评估自主 GUI 代理在真实媒体后期制作环境中的基准测试。我们在 7 个专业应用程序（例如 Premiere Pro、Photoshop）中策划了专家演示，涵盖 186 个复杂的长期任务，这些任务基于真实的编辑工作流程，涉及密集的多模态接口和紧密耦合的交互序列。为了支持可扩展的评估，我们开发了一种轻量级解析器，将原始屏幕录制和低级交互日志转换为结构化的组合式 GUI 动作轨迹，并具有精确的基础。广泛的评估表明，现有代理在真实媒体编辑任务中的任务成功率仅为 36.0\%，突显了我们基准测试中复杂、长期媒体后期制作工作流程所带来的挑战。虽然当前模型在空间基础、多模态对齐和协调动作执行方面表现出良好的前景，但在长期可靠性和领域特定规划方面仍然有限。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2605.19491

Thinking in Scales: Accelerating Gigapixel Pathology Image Analysis via Adaptive Continuous Reasoning

在尺度中思考：通过自适应连续推理加速千兆像素病理图像分析

Ge, Jiusong, Zhan, Yingkang, Zhao, Wenjie, Zhang, Di, Wang, Ke, Liu, Jiashuai, Yang, Chunze, Li, Chengzu, Zhang, Jian, Dong, Yuxin, Zhang, Ni, Liu, Qidong, Crispin-Ortuzar, Mireia, Fu, Huazhu, Li, Chen, Gao, Zeyu

Abstract

Traditional whole slide image (WSI) analysis methods typically rely on the multiple instance learning (MIL) paradigm, which extracts patch-level features at high magnification and aggregates them for slide-level prediction. However, such exhaustive patch-level processing is computationally expensive, severely limiting the efficiency and scalability of WSI analysis. To address this challenge, we propose PathCTM (a Pathology-oriented Continuous Thought Model) that enables token-efficient scale-space continuous reasoning for gigapixel WSIs. PathCTM formulates diagnostic inference as a dynamic sequential information pursuit. It progressively transitions from low-magnification global to high-magnification local inspection, and adaptively terminates inference when sufficient evidence is gathered to effectively bound decision uncertainty. Specifically, it uses conditional computation for dynamic scale switching with attention-guided region pruning, coupled with confidence-aware early stopping. Extensive experiments demonstrate that, compared with standard MIL-based methods, PathCTM reduces the number of required image patches by 95.95% and shortens inference time by approximately 95.62%, while maintaining AUC without degradation. Code is available at https://github.com/JSGe-AI/PathCTM.

Chinese Translation

传统的全幻灯片图像（WSI）分析方法通常依赖于多实例学习（MIL）范式，该范式在高放大倍率下提取补丁级特征并将其聚合以进行幻灯片级预测。然而，这种全面的补丁级处理计算成本高昂，严重限制了WSI分析的效率和可扩展性。为了解决这一挑战，我们提出了PathCTM（面向病理的连续思维模型），该模型能够实现千兆像素WSI的高效尺度空间连续推理。PathCTM将诊断推理公式化为动态顺序信息追踪。它逐步从低放大倍率的全局视角过渡到高放大倍率的局部检查，并在收集到足够证据以有效界定决策不确定性时自适应终止推理。具体而言，它使用条件计算进行动态尺度切换，并结合注意力引导的区域修剪和基于置信度的提前停止。大量实验表明，与标准的基于MIL的方法相比，PathCTM将所需图像补丁的数量减少了95.95%，并将推理时间缩短了约95.62%，同时保持了AUC不降。代码可在 https://github.com/JSGe-AI/PathCTM 获取。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2605.19506

EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning

EventPrune：级联事件辅助的令牌剪枝用于高效的第一人称动态空间推理

Ma, Pengtao, Zhou, Ziliang, Ruan, Ciyu, Wang, Haoyang, Li, Kaiyuan, Gong, Zihang, Ding, Wenhua, Gao, Chen, Xu, Jingao, Chen, Xinlei

Abstract

First-person dynamic spatial reasoning requires models to track continuous motion and precise geometric structure, but the quadratic attention cost of Transformer-based Video-LLMs makes dense visual tokens computationally expensive. Existing token pruning paradigms predominantly rely on discrete static snapshots, failing to preserve the motion and geometric cues essential for reasoning. We propose Event Cascade Pruning (ECP), to our knowledge the first training-free framework that leverages the high-frequency motion cues from event cameras as a continuous event-guided motion prior to guide token selection. ECP combines three stages: Event-Triggered Causal Sampling to anchor motion-informative keyframes, Event-guided Motion Saliency Filtering to suppress event-inactive visual tokens, and Event-Attention Ranking Fusion to calibrate spatial attention with motion-salient dynamics. With 80% visual token reduction, ECP outperforms the full-token baseline (37.62% vs. 36.31%) while achieving 1.89x inference speedup and 52% GFLOPs reduction. We further introduce ESR-Real, the first real-world RGB-event benchmark for first-person spatial reasoning, where ECP improves accuracy by 2.68 percentage points over full-token baselines.

Chinese Translation

第一人称动态空间推理要求模型跟踪连续运动和精确的几何结构，但基于Transformer的视频大语言模型（Video-LLMs）的二次注意力成本使得密集视觉令牌的计算开销巨大。现有的令牌剪枝范式主要依赖于离散的静态快照，未能保留对推理至关重要的运动和几何线索。我们提出了事件级联剪枝（Event Cascade Pruning, ECP），据我们所知，这是第一个不需要训练的框架，利用事件摄像头提供的高频运动线索作为连续的事件引导运动先验来指导令牌选择。ECP结合了三个阶段：事件触发的因果采样（Event-Triggered Causal Sampling）用于锚定运动信息关键帧，事件引导的运动显著性过滤（Event-guided Motion Saliency Filtering）用于抑制事件不活跃的视觉令牌，以及事件注意力排名融合（Event-Attention Ranking Fusion）用于将空间注意力与运动显著动态进行校准。在减少80%视觉令牌的情况下，ECP在准确率上超越了全令牌基线（37.62%对比36.31%），同时实现了1.89倍的推理加速和52%的GFLOPs减少。我们进一步介绍了ESR-Real，这是第一个针对第一人称空间推理的真实世界RGB-事件基准，其中ECP的准确率比全令牌基线提高了2.68个百分点。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2605.19510

Return of Frustratingly Easy Unsupervised Video Domain Adaptation

令人沮丧的简单无监督视频领域适应的回归

Wei, Pengfei, Sun, Yiqun, Xu, Zhiqiang, Ke, Yiping, Hsieh, Lawrence B.

Abstract

Unsupervised video domain adaptation (UVDA) is a practical but under-explored problem. In this paper, we propose a frustratingly easy UVDA method, called MetaTrans. Specifically, MetaTrans adopts a concise learning objective that contains only two fundamental loss terms. Despite the simplicity of the learning objective, MetaTrans embodies an advanced UVDA idea, that is, handling the spatial and temporal divergence of cross-domain videos separately, through a subtle model architecture design. By implementing a temporal-static subtraction module, MetaTrans effectively removes spatial and temporal divergence. Extensive empirical evaluations, particularly on various cross-domain action recognition tasks, show substantial absolute adaptation performance enhancement and significantly superior relative performance gain compared with state-of-the-art UVDA baselines.

Chinese Translation

无监督视频领域适应（UVDA）是一个实际但尚未充分探索的问题。本文提出了一种令人沮丧的简单UVDA方法，称为MetaTrans。具体而言，MetaTrans采用了一个简洁的学习目标，仅包含两个基本损失项。尽管学习目标简单，MetaTrans却体现了一个先进的UVDA理念，即通过微妙的模型架构设计，分别处理跨领域视频的空间和时间差异。通过实现一个时间-静态减法模块，MetaTrans有效地消除了空间和时间的差异。广泛的实证评估，特别是在各种跨领域动作识别任务上，显示出显著的绝对适应性能提升，并且与最先进的UVDA基线相比，具有显著优越的相对性能提升。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2605.19511

Are Watermarked Images Editable? SafeMark for Watermark-Preserving Text-Guided Image Editing

水印图像可编辑吗？用于水印保护的文本引导图像编辑的 SafeMark

Wu, Xiaodong, Li, Qi, Li, Xiangman, Zhang, Zelin, Liu, Lingshuang, Ni, Jianbing

Abstract

This paper investigates a fundamental yet underexplored question: can watermarked images remain editable without compromising watermark integrity? We propose SafeMark, a framework for watermark-preserving text-guided image manipulation that explicitly integrates watermark integrity into the editing process. Specifically, SafeMark adds a thresholded watermark-decoding loss directly to the diffusion editor's training objective, fine-tuning the editor so that semantically valid edits also preserve the embedded watermark at the final output. This design admits a clean information-theoretic justification: maintaining high bit-accuracy on the edited image lower-bounds the mutual information that the editor channel preserves between watermark and edited output, the quantity that fundamentally controls watermark recoverability. SafeMark is compatible with differentiable diffusion-based editors, and requires no architectural modification. Extensive evaluations across multiple datasets, text-guided editing methods, and post-edit distortion settings demonstrate that SafeMark achieves high watermark bit accuracy across diverse editing settings while maintaining high-quality semantic edits, without sacrificing robustness to common post-edit distortions. These results demonstrate that semantic editability and watermark integrity are fundamentally compatible, enabling trustworthy image provenance in generative editing pipelines.

Chinese Translation

本文探讨了一个基本但尚未深入研究的问题：水印图像是否可以在不损害水印完整性的情况下保持可编辑性？我们提出了 SafeMark，一个用于水印保护的文本引导图像操作框架，该框架明确将水印完整性整合到编辑过程中。具体而言，SafeMark 在扩散编辑器的训练目标中直接添加了一个阈值水印解码损失，微调编辑器，使得语义有效的编辑在最终输出中也能保留嵌入的水印。该设计具有清晰的信息论基础：在编辑图像上保持高比特准确性下限了编辑器通道在水印和编辑输出之间所保留的互信息量，而这一量正是控制水印可恢复性的关键。SafeMark 兼容可微分的基于扩散的编辑器，并且不需要架构修改。对多个数据集、文本引导编辑方法和后编辑失真设置的广泛评估表明，SafeMark 在多样化的编辑设置中实现了高水印比特准确性，同时保持高质量的语义编辑，而不牺牲对常见后编辑失真的鲁棒性。这些结果表明，语义可编辑性和水印完整性在根本上是兼容的，从而在生成编辑管道中实现了可信的图像来源。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2605.19522

iDiff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment

iDiff：可解释的差异感知框架用于成对图像质量评估

Yue, Xinli, Sun, JianHui, Shao, Tao, Yao, Liangchao, Xia, Fan, Deng, Yuetang

Abstract

Pairwise image quality assessment (IQA) in professional photography requires a model not only to identify the preferred image between two candidates, but also to provide convincing and image-grounded reasoning. In the NTIRE 2026 RAIM challenge, this requirement is further emphasized by jointly evaluating preference prediction and rationale generation. To address this task, we propose iDiff, an Interpretable Difference-aware framework for pairwise image quality assessment. Our method adopts a dual-branch design consisting of an Answer Model and a Thinking Model. The Answer Model performs robust preference prediction by explicitly decomposing each sample into left/right global and local views, followed by content-aware specialization for person and scene images and ensemble-based aggregation across backbones. The Thinking Model focuses on rationale generation and is progressively enhanced with expert-style templates, multi-source quality features, and answer-aware supervision conditioned on the Answer Model prediction. In this way, iDiff jointly models discriminative decision making and structured explanation, improving both robustness and interpretability. Extensive experiments demonstrate the effectiveness of the proposed framework on both accuracy and reasoning-quality metrics. Our method achieved first place in the NTIRE 2026 RAIM challenge, showing the effectiveness of integrating explicit difference modeling with structured multimodal reasoning for pairwise IQA.

Chinese Translation

专业摄影中的成对图像质量评估（IQA）要求模型不仅能够在两个候选图像之间识别出优选图像，还需提供令人信服且基于图像的推理。在NTIRE 2026 RAIM挑战赛中，这一要求通过共同评估偏好预测和推理生成得到了进一步强调。为了解决这一任务，我们提出了iDiff，一个可解释的差异感知框架用于成对图像质量评估。我们的方法采用双支路设计，包括答案模型（Answer Model）和思维模型（Thinking Model）。答案模型通过明确地将每个样本分解为左/右全局和局部视图，随后针对人物和场景图像进行内容感知的专业化，并在主干网络之间进行基于集成的聚合，从而实现稳健的偏好预测。思维模型则专注于推理生成，并通过专家风格的模板、多源质量特征以及基于答案模型预测的答案感知监督逐步增强。通过这种方式，iDiff共同建模了判别决策和结构化解释，提高了稳健性和可解释性。大量实验表明，所提框架在准确性和推理质量指标上均表现出色。我们的方法在NTIRE 2026 RAIM挑战赛中获得第一名，展示了将显式差异建模与结构化多模态推理相结合用于成对IQA的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2605.19527

Dual-Prompt CLIP with Hybrid Visual Encoders for Occluded Person Re-Identification

基于混合视觉编码器的双提示 CLIP 模型用于遮挡行人重识别

Ji, Zhangjian, Qiao, Shaotong, Feng, Kai, Wei, Wei

Abstract

Occluded person re-identification focuses on matching partially visible pedestrians across multiple camera views. However, occlusions disrupt body-region cues, thereby complicating cross-view matching. Most person ReID methods built on pretrained vision-language models only focus on enhancing prompt-based feature learning while ignoring the semantic information of occluders. Based on the success of CLIP-ReID, we propose a novel Dual Prompt Learning ReID (DPL-ReID) model for occluded person ReID. It incorporates a Dual Prompt Learning (Dual-PL) strategy, which can utilize textual cues to capture complete pedestrian semantics and keep robustness against occlusion, and a Real-World Occlusion Augmentation (RWOA) method that realistically simulates occlusion scenarios encountered in real word to enrich occluded samples. In addition, we also design a Weighted Gated Feature Fusion (WGFF) method, which in corporates LSNet to capture global information and act as a feature-gating mechanism. This mechanism can effectively guide the CLIP visual encoder toward generating more comprehensive feature representations. Extensive experiments on several benchmark occluded ReID datasets show that our proposed DPL-ReID achieves the state-of-the art performance. The occlusion instance library are available at https://github.com/stone-qiao/DPL-ReID.

Chinese Translation

遮挡行人重识别关注于在多个摄像头视角下匹配部分可见的行人。然而，遮挡会干扰身体区域线索，从而使得跨视角匹配变得复杂。大多数基于预训练视觉-语言模型的行人重识别方法仅关注于增强基于提示的特征学习，而忽视了遮挡物的语义信息。基于 CLIP-ReID 的成功，我们提出了一种新颖的双提示学习重识别模型（DPL-ReID）用于遮挡行人重识别。该模型结合了双提示学习（Dual-PL）策略，可以利用文本线索捕捉完整的行人语义，并保持对遮挡的鲁棒性，同时引入了一种真实世界遮挡增强（RWOA）方法，真实模拟现实中遇到的遮挡场景，以丰富遮挡样本。此外，我们还设计了一种加权门控特征融合（WGFF）方法，结合 LSNet 捕捉全局信息并作为特征门控机制。该机制可以有效引导 CLIP 视觉编码器生成更全面的特征表示。在多个基准遮挡重识别数据集上的大量实验表明，我们提出的 DPL-ReID 达到了最先进的性能。遮挡实例库可在 https://github.com/stone-qiao/DPL-ReID 获取。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2605.19528

Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

朝向相机鲁棒的三维定位：基于方程锚定的工具使用方法用于多模态大型语言模型

Jiang, Xueying, Li, Wenhao, Qian, Quanhao, Zhao, Deli, Lu, Shijian, Zhang, Gongjie, Xu, Ran

Abstract

3D localization in Multimodal Large Language Models (MLLMs), including 3D object detection and 3D visual grounding, is fundamentally limited by camera intrinsic ambiguity: the same image admits different 3D scenes under different cameras. Existing MLLMs either ignore camera parameters and overfit to a canonical training intrinsic, or retrieve depth and 3D cues from external tools but treat the returned values as reference cues (numerical hints that the model is free to interpret implicitly), both preventing camera information from being deterministically propagated into the prediction. We propose an equation-anchored tool-use framework that re-purposes spatial tools as formula variables. The proposed framework proactively retrieves camera intrinsics and samples multi-point metric depths, writes the pinhole back-projection equation $\hat{X} = (u_c - c_x)\bar{Z}/f_x$ explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box. On both 3D object detection and 3D visual grounding tasks under rescaled camera intrinsics from $0.5\times$ to $1.5\times$, our method outperforms RGB-only and tool-augmented baselines, with significant gains where the camera deviates most from the training scale. Code and data will be released.

Chinese Translation

多模态大型语言模型（MLLMs）中的三维定位，包括三维物体检测和三维视觉定位，受到相机内部参数模糊性的根本限制：同一图像在不同相机下可能对应不同的三维场景。现有的MLLMs要么忽略相机参数并过拟合于标准训练内参，要么从外部工具中检索深度和三维线索，但将返回的值视为参考线索（模型可以自由隐式解释的数值提示），这两种方法都阻止了相机信息以确定性方式传递到预测中。我们提出了一种基于方程锚定的工具使用框架，将空间工具重新用作公式变量。该框架主动检索相机内参并采样多点度量深度，明确地在思维链（Chain-of-Thought, CoT）中写出针孔反投影方程 $ ext{ }ar{Z} = (u_c - c_x)ar{Z}/f_x$，并在回归最终的9自由度边界框之前将工具输出替换到公式中。在相机内参从 $0.5 imes$ 到 $1.5 imes$ 的重缩放条件下，我们的方法在三维物体检测和三维视觉定位任务中均优于仅使用RGB和工具增强的基线，在相机偏离训练规模最多的情况下取得了显著的提升。代码和数据将会发布。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2605.19532

Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection

通过基于核心标记注意力的种子选择提升文本到图像扩散模型

Zhang, Yunzhe, Liu, Hongfu, Hong, Pengyu

Abstract

Text-to-image diffusion models can synthesize high-quality images, yet the outcome is notoriously sensitive to the random seed: different initial seeds often yield large variations in image quality and prompt-image alignment. We revisit this "seed effect" and show that attention dynamics over prompt core tokens, the content-bearing words, measured during the first few denoising steps, strongly predict final generation quality. Building on this observation, we introduce Attention-Based Seed Selection (ABSS), a training-free, plug-and-play method that ranks seeds for a given prompt by leveraging cross-attention to core tokens during the denoising process. ABSS requires no finetuning and does not alter the initial noise; it scores and ranks all candidate seeds, keeps only the top-k for full generation, and discards the rest, without relying on a fixed accept/reject threshold. Operating purely at inference time, ABSS can serve as a lightweight pre-selection add-on for existing seed-optimization pipelines, enabling additional gains. Across three benchmarks, extensive experiments show that ABSS enables consistent improvements in text-image alignment and visual quality for Stable Diffusion variants, as corroborated by human preference and alignment metrics.

Chinese Translation

文本到图像的扩散模型能够合成高质量图像，但结果对随机种子极为敏感：不同的初始种子往往导致图像质量和提示-图像对齐的巨大差异。我们重新审视这种“种子效应”，并展示在前几步去噪过程中，针对提示核心标记（内容承载词）的注意力动态能够强烈预测最终生成质量。基于这一观察，我们提出了基于注意力的种子选择（Attention-Based Seed Selection, ABSS），这是一种无训练、即插即用的方法，通过在去噪过程中利用对核心标记的交叉注意力来对给定提示的种子进行排名。ABSS不需要微调，也不改变初始噪声；它对所有候选种子进行评分和排名，仅保留前k个进行完整生成，其余则被丢弃，而不依赖于固定的接受/拒绝阈值。ABSS完全在推理时操作，可以作为现有种子优化流程的轻量级预选择附加组件，从而实现额外的收益。在三个基准测试中，广泛的实验表明，ABSS在文本-图像对齐和视觉质量上为Stable Diffusion变体带来了持续的改善，这得到了人类偏好和对齐指标的支持。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2605.19533

Replacement Learning: Training Neural Networks with Fewer Parameters

替代学习：用更少的参数训练神经网络

Zhang, Yuming, Wang, Peizhe, Han, Tianyang, Shi, Hengyu, Su, Junhao, Guan, Dongzhi, Liu, Jiabin, Wang, Jiaji

Abstract

End-to-end training with full-depth backpropagation remains the dominant paradigm for optimizing deep neural networks, but its efficiency deteriorates as models grow deeper. Since every block must be executed and differentiated under a single global objective, full-depth BP introduces substantial parameter redundancy, activation-memory cost, and training latency, especially when neighboring layers exhibit highly correlated learning patterns. Directly skipping or removing layers can reduce cost, but often weakens representation capacity or requires architecture-specific reuse designs. In this paper, we propose Replacement Learning (RepL), a training-time paradigm that reduces full-depth redundancy by replacing selected blocks rather than simply discarding them. For each removed block, RepL inserts a lightweight computing layer that synthesizes a surrogate operator from the parameters of its adjacent preceding and succeeding blocks through a learnable transformation, and applies the synthesized operator to the preceding activation. In this way, RepL preserves local contextual continuity while avoiding unnecessary full-layer computation. We instantiate RepL for CNNs and ViTs with tailored parameter-fusion blocks that handle convolutional channels, feature resolutions, and transformer submodules. Extensive experiments on CIFAR-10, SVHN, STL-10, ImageNet, COCO, and CityScapes show that RepL reduces trainable parameters, GPU memory usage, and training time while matching or surpassing standard end-to-end training across classification, detection, and segmentation. Additional results on WikiText-2, transfer learning, inference throughput, checkpointing, stochastic depth, and INT8 quantization further demonstrate its generality and compatibility.

Chinese Translation

端到端训练结合全深度反向传播仍然是优化深度神经网络的主流范式，但随着模型深度的增加，其效率逐渐下降。由于每个模块必须在单一全局目标下执行和进行微分，全深度反向传播引入了大量的参数冗余、激活内存成本和训练延迟，尤其是在相邻层表现出高度相关的学习模式时。直接跳过或移除层可以减少成本，但往往会削弱表示能力或需要特定于架构的重用设计。本文提出了替代学习（Replacement Learning, RepL），一种通过替换选定模块而非简单丢弃它们来减少全深度冗余的训练时范式。对于每个移除的模块，RepL 插入一个轻量级计算层，该层通过可学习的变换从其相邻的前后模块的参数中合成一个替代算子，并将合成的算子应用于前一层的激活。通过这种方式，RepL 保持了局部上下文的连续性，同时避免了不必要的全层计算。我们为卷积神经网络（CNNs）和视觉变换器（ViTs）实例化了 RepL，采用定制的参数融合模块来处理卷积通道、特征分辨率和变换器子模块。在 CIFAR-10、SVHN、STL-10、ImageNet、COCO 和 CityScapes 上的广泛实验表明，RepL 在分类、检测和分割任务中减少了可训练参数、GPU 内存使用和训练时间，同时匹配或超越了标准的端到端训练。此外，在 WikiText-2、迁移学习、推理吞吐量、检查点、随机深度和 INT8 量化等方面的额外结果进一步证明了其通用性和兼容性。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2605.19538

CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision

CaptchaMind：通过显式推理监督的强化学习训练 CAPTCHA 解码器

Wang, Pengcheng, Liu, Haoxiang, Dai, Yang, Zeng, Xiangxiang, Chen, Guanhua, Hu, Baotian, Wang, Longyue, Luo, Weihua

Abstract

CAPTCHAs are widely deployed as human verification mechanisms and frequently block intelligent agents from completing end-to-end automation in real-world web environments. Solving modern CAPTCHAs requires robust multi-step visual reasoning and interaction capabilities, yet training-based approaches have remained absent due to the lack of large-scale training data and process-level annotations. We introduce CaptchaBench, the first CAPTCHA benchmark designed to support large-scale training, comprising 16,000 programmatically generated samples across eight task categories with detailed region and process-level annotations. Systematic evaluation on CaptchaBench reveals that existing methods fail consistently on tasks requiring fine-grained visual detail capture and region-level comparison. We therefore present CaptchaMind, an RL-based solver trained with explicit reasoning process supervision, achieving 82.9% average success rate across eight tasks and 71.0% on real-world instances, substantially outperforming all existing methods without closed-source APIs.

Chinese Translation

CAPTCHA 被广泛应用于人类验证机制，常常阻碍智能代理在现实世界的网络环境中完成端到端自动化。解决现代 CAPTCHA 需要强大的多步骤视觉推理和交互能力，但由于缺乏大规模训练数据和过程级注释，基于训练的方法一直缺失。我们介绍了 CaptchaBench，这是第一个旨在支持大规模训练的 CAPTCHA 基准，包含 16,000 个程序生成的样本，涵盖八个任务类别，并提供详细的区域和过程级注释。在 CaptchaBench 上的系统评估表明，现有方法在需要细粒度视觉细节捕捉和区域级比较的任务上表现不佳。因此，我们提出了 CaptchaMind，这是一种基于强化学习的解码器，经过显式推理过程监督训练，在八个任务上实现了 82.9% 的平均成功率，在现实世界实例上达到 71.0%，显著优于所有现有方法且不依赖于闭源 API。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2605.19539

Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R

信任与否：基于Trust3R的前馈3D重建中的证据不确定性

Zhu, Zihao, Zhao, Wenyuan, Chen, Nuo, Tian, Chao, Fan, Zhiwen

Abstract

Geometric foundation models hold promise for unconstrained dense geometry prediction from uncalibrated images. However, in current feed-forward designs, their predicted confidence scores are heuristic, lack probabilistic interpretation, and often fail to indicate where and how much the predicted geometry can be trusted. To address this gap, we present Trust3R, a lightweight evidential uncertainty framework for feed-forward 3D reconstruction. Trust3R combines gated residual mean refinement with a Normal-Inverse-Wishart evidential head, yielding a closed-form multivariate Student-t distribution for per-point geometric uncertainty. This design provides probabilistically grounded pointmap uncertainty estimates while adding moderate inference overhead. We evaluate on diverse indoor and outdoor benchmarks and compare against MASt3R's built-in confidence map as well as common uncertainty-aware baselines spanning single-pass heteroscedastic regression and sampling-based methods such as MC dropout and deep ensembles. Experimental results show that Trust3R consistently improves risk-coverage and sparsification, and generally improves geometric accuracy. These gains are reflected in stronger uncertainty ranking across benchmarks, with 25% lower AURC and 41% lower AUSE on ScanNet++, providing a practical reliability signal for uncertainty-aware weighting in downstream geometry pipelines. The project page and code are available at https://trust3r-z.github.io/.

Chinese Translation

几何基础模型在从未校准图像中进行无约束密集几何预测方面展现出潜力。然而，在当前的前馈设计中，它们预测的置信度分数是启发式的，缺乏概率解释，且常常无法指示预测几何的可信度及其程度。为了解决这一问题，我们提出了Trust3R，一个轻量级的证据不确定性框架，用于前馈3D重建。Trust3R结合了门控残差均值细化与正态-逆威沙特（Normal-Inverse-Wishart）证据头，生成每个点几何不确定性的封闭形式多元学生-t分布。该设计提供了概率基础的点图不确定性估计，同时增加了适度的推理开销。我们在多样的室内和室外基准测试上进行了评估，并与MASt3R的内置置信图以及涵盖单次异方差回归和基于采样的方法（如MC dropout和深度集成）等常见的不确定性感知基线进行了比较。实验结果表明，Trust3R在风险覆盖和稀疏化方面持续改善，并通常提高了几何精度。这些提升在基准测试中反映为更强的不确定性排名，在ScanNet++上实现了25%的AURC降低和41%的AUSE降低，为下游几何管道中的不确定性感知加权提供了实用的可靠性信号。项目页面和代码可在https://trust3r-z.github.io/获取。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2605.19554

Self-Creative Text-to-Object Generation using Semantic-Aware Spatial Weighting

基于语义感知空间加权的自创文本到物体生成

Yu, Yue, Chen, Haibo, Chen, Shuo, Yang, Jian, Li, Jun

Abstract

Instilling creativity in text-to-image (T2I) generation presents a significant challenge, as it requires synthesized images to exhibit not only visual novelty and surprise, but also artistic value. Current T2I models, however, are largely optimized for literal text-image alignment with their data distribution, and their noise prediction networks constrain the generation to high-probability regions, consequently generating outputs that lack authentic creativity. To address this, we propose a Self-Creative Diffusion (SCDiff) model for meaningful T2I generations featuring two core modules: a learnable spatial weighting (LSW) module and a visual-semantic mixing loss (VSML). The LSW module designs a parametric Kaiser-Bessel window to reinforce central image features, fostering novel and surprising generation. The VSML module introduces a dual loss function: a similarity loss constrains that the new images align with its textual description, while a diversity loss maximizes its distinction from the original image, enhancing both semantic value and visual novelty. Extensive experiments demonstrate that our model substantially improves creativity, semantic alignment, and visual coherence, offering a simple yet powerful framework for generating creative objects.

Chinese Translation

在文本到图像（T2I）生成中注入创造力是一项重大挑战，因为这要求合成的图像不仅展现视觉新颖性和惊喜感，还具备艺术价值。然而，目前的T2I模型主要针对与其数据分布的字面文本-图像对齐进行了优化，其噪声预测网络限制了生成过程在高概率区域内，从而导致生成的输出缺乏真正的创造力。为了解决这一问题，我们提出了一种自创扩散（Self-Creative Diffusion，SCDiff）模型，用于有意义的T2I生成，具有两个核心模块：可学习空间加权（Learnable Spatial Weighting，LSW）模块和视觉-语义混合损失（Visual-Semantic Mixing Loss，VSML）。LSW模块设计了一个参数化的凯瑟-贝塞尔窗口，以增强中心图像特征，促进新颖和令人惊讶的生成。VSML模块引入了双重损失函数：相似性损失约束新图像与其文本描述的一致性，而多样性损失则最大化其与原始图像的区别，从而增强语义价值和视觉新颖性。大量实验表明，我们的模型在创造力、语义对齐和视觉一致性方面显著提升，提供了一个简单而强大的框架用于生成创意对象。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2605.19556

EpiDiffVO: Geometry-Aware Epipolar Diffusion for Robust Visual Odometry

EpiDiffVO：基于几何的极线扩散用于鲁棒视觉里程计

Rao, Prateeth

Abstract

Estimating relative pose from image pairs fundamentally requires only a minimal subset of geometrically consistent correspondences. However, most learning-based approaches rely on dense matching or direct regression, leading to redundancy and reduced geometric interpretability. In this work, we propose a sparse epipolar matching framework that predicts a compact set of correspondences optimized for geometric consistency across varying temporal baselines. To address residual noise and misalignment, we introduce an epipolar diffusion process that models correspondence uncertainty and refines keypoints toward epipolar consistency. The refined correspondences, along with depth cues, are lifted into a graph representation forming a Steiner graph that encodes relational structure between points. A graph neural network learns a compact subset of informative correspondences, which are passed to a differentiable singular value decomposition solver for end-to-end geometric estimation. Relative pose is recovered from the resulting essential matrix and evaluated in a visual odometry setting on the TartanAir and KITTI SLAM datasets. Experimental results demonstrate that combining sparse matching, diffusion-based refinement, and graph-based subset selection reduces correspondence redundancy while maintaining robust pose estimation across challenging baselines.

Chinese Translation

从图像对中估计相对姿态基本上只需要一小部分几何一致的对应点。然而，大多数基于学习的方法依赖于密集匹配或直接回归，导致冗余和几何可解释性降低。在本研究中，我们提出了一种稀疏极线匹配框架，预测一组紧凑的对应点，优化其在不同时间基线下的几何一致性。为了解决残余噪声和错位问题，我们引入了一种极线扩散过程，该过程建模对应点的不确定性，并将关键点精炼至极线一致性。经过精炼的对应点与深度线索一起提升为图表示，形成一个编码点之间关系结构的斯坦纳图（Steiner graph）。图神经网络学习一组紧凑的信息性对应点，这些对应点被传递给可微分的奇异值分解求解器，用于端到端的几何估计。相对姿态从得到的本质矩阵中恢复，并在TartanAir和KITTI SLAM数据集的视觉里程计设置中进行评估。实验结果表明，结合稀疏匹配、基于扩散的精炼和基于图的子集选择，能够减少对应点的冗余，同时在具有挑战性的基线下保持鲁棒的姿态估计。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2605.19559

EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs

EgoCoT-Bench：针对多模态大语言模型的基于操作的链式思维推理的基础和可验证基准

Dai, Yang, Jiao, Dian, Lin, Tianwei, Zhang, Wenqiao

Abstract

The rapid development of Multimodal Large Language Models (MLLMs) has led to growing interest in egocentric video understanding, specifically the ability for MLLMs to recognize fine-grained hand-object interactions, track object state changes over time, and reason about manipulative processes in dynamic environments from a first-person perspective. However, existing egocentric video benchmarks suffer from \textbf{limited grounded rationale evaluation}, offering limited support for fine-grained operation-centric reasoning and rarely examining whether model rationales are grounded in explicit spatio-temporal evidence. To address this gap, we introduce \textbf{EgoCoT-Bench}, a fine-grained egocentric benchmark for grounded and verifiable operation-centric reasoning with explicit step-by-step rationale annotations. Overall, EgoCoT-Bench comprises 3,172 verifiable QA pairs over 351 egocentric videos separated into four task groups for a total of 12 sub-task groups, encompassing perception and retrospection, anticipation, and high-level reasoning. The benchmark is constructed through a spatio-temporal scene graphs (STSG) guided generation framework and is further refined by human annotators to ensure correctness, egocentric relevance and fine-grained quality. Experimental results show continuing difficulties with egocentric fine-grained reasoning and further reveal that many multimodal models produce explanations that are answer-correct, but have evidence that is inconsistent with the answer. We hope EgoCoT-Bench can serve as a useful testbed for grounded and verifiable reasoning in egocentric video understanding. Project page and supplementary materials are available at: https://dstardust.github.io/EgoCoT/.

Chinese Translation

多模态大语言模型（MLLMs）的快速发展引发了对自我中心视频理解的日益关注，特别是MLLMs识别细粒度手物交互、跟踪物体状态随时间变化的能力，以及从第一人称视角推理动态环境中的操作过程。然而，现有的自我中心视频基准存在 extbf{有限的基础推理评估}，对细粒度基于操作的推理支持不足，且很少检查模型的推理是否基于明确的时空证据。为了解决这一问题，我们引入了 extbf{EgoCoT-Bench}，这是一个细粒度的自我中心基准，旨在提供基础和可验证的基于操作的推理，并附有明确的逐步推理注释。总体而言，EgoCoT-Bench包含3,172对可验证的问答对，涵盖351个自我中心视频，分为四个任务组，总共12个子任务组，涉及感知与回顾、预测和高层次推理。该基准通过时空场景图（STSG）指导生成框架构建，并由人工注释者进一步精炼，以确保正确性、自我中心相关性和细粒度质量。实验结果显示，自我中心细粒度推理仍存在持续困难，并进一步揭示许多多模态模型生成的解释虽然答案正确，但证据与答案不一致。我们希望EgoCoT-Bench能够成为自我中心视频理解中基础和可验证推理的有用测试平台。项目页面和补充材料可在以下链接获取：https://dstardust.github.io/EgoCoT/.

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2605.19578

Lens Privacy Sealing: A New Benchmark and Method for Physical Privacy-Preserving Action Recognition

镜头隐私封装：一种新的基准和物理隐私保护动作识别方法

Liu, Mengyuan, Wang, Ziyi, Li, Peiming, Yuan, Junsong

Abstract

RGB camera-based surveillance systems enable human action recognition for public safety and healthcare, yet raise serious privacy concerns. Existing methods rely on post-capture algorithms, which fail to protect privacy during data acquisition. We propose Lens Privacy Sealing (LPS), a simple hardware solution that physically obscures camera lenses with adjustable laminating film, providing pre-sensor privacy protection at minimal cost. Unlike software methods or expensive engineered optics, LPS achieves strong privacy through stochastic multi-layer scattering that is physically irreversible. We introduce the P$^3$AR dataset for privacy-preserving action recognition, featuring both large-scale replay-captured (P$^3$AR-NTU, 114K videos) and real-world collected (P$^3$AR-PKU) subsets with privacy attribute annotations. To handle video degradation from LPS, we propose MSPNet, a single-stage framework incorporating Inter-Frame Noise Suppressor (IFNS) and Cross-Frame Semantic Aggregator (CFSA), enhanced by contrastive language-image pre-training for robust semantic extraction. Extensive experiments demonstrate that MSPNet with IFNS and CFSA nearly doubles action recognition accuracy compared to baseline methods while suppressing identity recognition to low levels. Comprehensive validation shows LPS achieves a superior privacy-utility trade-off compared to state-of-the-art hardware methods, resists reconstruction attacks including PSF inversion and data-driven recovery, and generalizes robustly across optical configurations and challenging environments. Code is available at https://github.com/wangzy01/MSPNet.

Chinese Translation

基于RGB相机的监控系统能够实现人类动作识别，以保障公共安全和医疗健康，但也引发了严重的隐私问题。现有方法依赖于捕获后算法，这在数据获取过程中无法保护隐私。我们提出了镜头隐私封装（Lens Privacy Sealing, LPS），这是一种简单的硬件解决方案，通过可调节的层压膜物理遮蔽相机镜头，以最低成本提供传感器前的隐私保护。与软件方法或昂贵的工程光学不同，LPS通过物理不可逆的随机多层散射实现强隐私保护。我们引入了P$^3$AR数据集用于隐私保护的动作识别，包含大规模回放捕获（P$^3$AR-NTU，114K视频）和真实世界收集（P$^3$AR-PKU）子集，并附有隐私属性注释。为了处理LPS导致的视频退化，我们提出了MSPNet，这是一种单阶段框架，结合了帧间噪声抑制器（Inter-Frame Noise Suppressor, IFNS）和跨帧语义聚合器（Cross-Frame Semantic Aggregator, CFSA），并通过对比语言-图像预训练增强了语义提取的鲁棒性。大量实验表明，配备IFNS和CFSA的MSPNet在动作识别准确率上几乎是基线方法的两倍，同时将身份识别抑制到较低水平。全面验证显示，LPS在隐私-效用权衡上优于最先进的硬件方法，抵御包括点扩散函数（PSF）反演和数据驱动恢复在内的重建攻击，并在光学配置和挑战性环境中具有良好的泛化能力。代码可在https://github.com/wangzy01/MSPNet获取。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2605.19595

A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images

一种由LLM代理优化的YOLO26-MoE新模型用于考虑无人机图像的绝缘子故障检测

Matos-Carvalho, João Pedro, Seman, Laio Oriel, Stefenon, Stefano Frizzo, Khreasat, Mohammad Khalaf Mohammad, González, Gabriel Villarrubia

Abstract

The inspection of electrical power line insulators is essential for ensuring grid reliability and preventing failures caused by damaged or degraded insulation components. In recent years, Unmanned Aerial Vehicles (UAVs) combined with deep learning-based vision systems have emerged as an effective solution for automating this process. However, insulator fault detection remains challenging due to small defect regions, heterogeneous fault patterns, complex backgrounds, and varying imaging conditions. To address these challenges, this paper proposes an optimized YOLO26-MoE, a novel object detection architecture that integrates a sparse Mixture-of-Experts (MoE) module into the high-resolution branch of the YOLO26 detector. The proposed modification enables adaptive feature refinement for subtle and diverse fault patterns while preserving the efficiency of a one-stage detection framework. Hyperparameter optimization, final training, and evaluation were coordinated through a tool-augmented Large Language Model (LLM) agent. The proposed model achieved 0.9900 [email protected] and 0.9515 [email protected]:0.95, outperforming the latest YOLO versions. These results demonstrate that the proposed model provides an effective and reliable solution for UAV-based insulator fault detection.

Chinese Translation

电力线路绝缘子的检查对于确保电网的可靠性和防止因绝缘组件损坏或退化而导致的故障至关重要。近年来，结合深度学习视觉系统的无人机（UAV）已成为自动化这一过程的有效解决方案。然而，由于缺陷区域小、故障模式异质、背景复杂以及成像条件多变，绝缘子故障检测仍然面临挑战。为了解决这些问题，本文提出了一种优化的YOLO26-MoE模型，这是一种新颖的目标检测架构，将稀疏的专家混合（Mixture-of-Experts, MoE）模块集成到YOLO26检测器的高分辨率分支中。所提出的修改使得对细微和多样化故障模式的自适应特征细化成为可能，同时保持了一阶段检测框架的效率。超参数优化、最终训练和评估通过工具增强的大型语言模型（Large Language Model, LLM）代理进行协调。所提出的模型在0.5的mAP上达到了0.9900，在0.5:0.95的mAP上达到了0.9515，超越了最新的YOLO版本。这些结果表明，所提出的模型为基于无人机的绝缘子故障检测提供了有效且可靠的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2605.19605

deadtrees.earth-aerial: A Multi-Resolution Aerial Image Dataset for Tree Cover and Mortality Detection

deadtrees.earth-aerial：用于树木覆盖和死亡检测的多分辨率空中图像数据集

Sharma, Ayushi, Mosig, Clemens, Drees, Lukas, Soltani, Salim, Vajna-Jehle, Janusch, Sheppard, Aaron, Ahmadi, Belqis, Schmid, Jonathan, Neumeier, Paul, Jacobs, Nathan, Wegner, Jan Dirk, Kattenborn, Teja

Abstract

Forests worldwide are increasingly threatened by climate change and disturbances such as fire, pests, and pathogens, creating an urgent need for scalable monitoring of tree cover and tree mortality. Aerial imagery from drones and aircraft is a key data source for detailed and large-scale mapping of tree crowns and mortality. However, related progress is limited by the lack of globally representative, harmonized datasets for joint segmentation of tree cover and mortality. We introduce two novel, open, machine-learning-ready datasets to enable joint segmentation of tree cover and tree mortality from centimeter-scale aerial imagery for the first time at global scales. With DTE-aerial-train, we provide a training dataset comprising 385K image patches of size 1024x1024 pixels, with resolutions ranging from 2.5 to 20 cm. It includes multi-class expert-annotated and -audited pseudo-labels for tree cover and mortality. With DTE-aerial-bench, we provide a geographically balanced benchmark test set of 25 globally distributed orthoimages totaling 525 patches with high-quality expert annotations for both tree cover and mortality. Both the training and benchmark datasets span tropical, temperate, boreal, and dryland biomes and cover a wide range of forest structures and mortality patterns. Using the benchmark test set for evaluation, we establish strong reference baselines that improve mortality segmentation across all biomes and scales with significant gains in challenging regions, such as boreal forests, where the F1 score increases from 0.40 to 0.58 with around 45% relative improvement. All data, models, and code will be publicly released under permissive open-source licenses. An interactive visualization of the benchmark dataset is available at deadtrees.earth/releases/dte-aerial-bench.

Chinese Translation

全球森林正日益受到气候变化以及火灾、害虫和病原体等干扰的威胁，迫切需要可扩展的树木覆盖和树木死亡监测。无人机和飞机的空中影像是详细且大规模绘制树冠和死亡的重要数据来源。然而，相关进展受到缺乏全球代表性、协调一致的数据集的限制，这些数据集用于树木覆盖和死亡的联合分割。我们首次引入两个新颖的、开放的、适用于机器学习的数据集，以支持在全球范围内对厘米级空中影像进行树木覆盖和树木死亡的联合分割。通过 DTE-aerial-train，我们提供了一个训练数据集，包含385K个1024x1024像素大小的图像块，分辨率范围从2.5到20厘米。该数据集包括多类专家标注和审核的伪标签，用于树木覆盖和死亡。通过 DTE-aerial-bench，我们提供了一个地理平衡的基准测试集，包含25个全球分布的正射影像，总计525个图像块，并附有高质量的专家标注，涵盖树木覆盖和死亡。训练和基准数据集涵盖热带、温带、北方和干旱生物群落，涵盖广泛的森林结构和死亡模式。使用基准测试集进行评估，我们建立了强有力的参考基线，显著改善了所有生物群落和尺度下的死亡分割，在挑战性区域（如北方森林）中取得了显著提升，F1得分从0.40提高到0.58，约有45%的相对改善。所有数据、模型和代码将根据宽松的开源许可证公开发布。基准数据集的交互式可视化可在 deadtrees.earth/releases/dte-aerial-bench 获得。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2605.19607

Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution

谱集成梯度用于粗到细特征归因

Kim, Soyeon, Lim, Seongwoo, Lee, Kyowoon, Choi, Jaesik

Abstract

Integrated Gradients (IG) is a widely adopted feature attribution method that satisfies desirable axiomatic properties. However, the choice of integration path significantly affects the quality of attributions, and the standard straight-line path introduces all input features simultaneously, often accumulating noisy gradients along the way. To address this limitation, we propose Spectral Integrated Gradients, which constructs integration paths based on singular value decomposition (SVD) of the baseline-to-input difference. By progressively activating singular components from largest to smallest, SIG introduces global structure before fine-grained details, naturally following a coarse-to-fine progression. Through extensive evaluation across diverse image classification datasets, we demonstrate that SIG produces cleaner attribution maps with reduced noise and achieves improved quantitative performance compared to existing path-based attribution methods. Our code is available at https://github.com/leekwoon/sig/.

Chinese Translation

集成梯度（Integrated Gradients, IG）是一种广泛采用的特征归因方法，满足理想的公理性质。然而，积分路径的选择显著影响归因质量，标准的直线路径同时引入所有输入特征，往往在此过程中累积噪声梯度。为了解决这一局限性，我们提出了谱集成梯度（Spectral Integrated Gradients, SIG），该方法基于基线与输入差异的奇异值分解（Singular Value Decomposition, SVD）构建积分路径。通过从最大到最小逐步激活奇异成分，SIG 在引入细粒度细节之前先呈现全局结构，自然遵循粗到细的进程。通过在多样的图像分类数据集上进行广泛评估，我们证明了SIG能够生成更清晰的归因图，减少噪声，并在定量性能上优于现有的基于路径的归因方法。我们的代码可在 https://github.com/leekwoon/sig/ 获取。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2605.19611

Inverse Design of Metasurface based Absorbers using Physics Guided Conditional Diffusion Models

基于物理引导的条件扩散模型的超表面吸收器逆向设计

Joy, Vineetha, Palai, Jamshed, Sahoo, Satwik, Kumar, Anshuman, Sethi, Amit, Singh, Hema

Abstract

Inverse design of metasurfaces for specific electromagnetic responses requires generating geometries that satisfy stringent spectral constraints while maintaining manufacturability. Conventional design methodologies rely on iterative optimization routines using full wave simulations, which become extremely time consuming and computationally intensive for large design spaces. In addition, commonly employed generative approaches often exhibit limited conditional fidelity and the generated designs often contain fine or irregular features that are impractical to fabricate. In this regard, we propose a physics guided condition quality enhanced diffusion framework for the inverse design of metasurface based absorbers. Here, the conditioning information consisting of target reflection characteristics is integrated into the model using feature wise linear modulation (FiLM). Furthermore, to enforce adherence to target spectra, a pre trained surrogate EM simulator is embedded into the framework introducing physics aware regularization through spectrum level loss functions. The efficiency of the proposed model is demonstrated by generating practically realizable metasurfaces for different types of reflection characteristics in the frequency range of 2 to 18 GHz. The proposed framework achieves an average spectral mean squared error of 0.0006 and band alignment accuracy of 0.958 between the target spectra and the spectra produced by the generated designs, demonstrating high conditional accuracy. In addition, the model generates multiple geometries for the same condition, thereby providing diverse design alternatives to the engineer. The proposed model produces the suitable design in approximately 30 seconds, whereas the conventional approach can take several months under comparable computational resources. The efficiency of the model is also established via experimental measurements.

Chinese Translation

针对特定电磁响应的超表面逆向设计需要生成满足严格光谱约束的几何形状，同时保持可制造性。传统设计方法依赖于使用全波模拟的迭代优化程序，这在大设计空间中变得极其耗时且计算密集。此外，常用的生成方法通常表现出有限的条件保真度，生成的设计往往包含细小或不规则的特征，这些特征在制造上不切实际。在这方面，我们提出了一种基于物理引导的条件质量增强扩散框架，用于超表面吸收器的逆向设计。在这里，包含目标反射特性的条件信息通过特征线性调制（Feature wise Linear Modulation, FiLM）集成到模型中。此外，为了强制遵循目标光谱，预训练的替代电磁模拟器嵌入到框架中，通过光谱级损失函数引入物理感知的正则化。通过生成在2到18 GHz频率范围内具有不同反射特性的实际可实现的超表面，展示了所提模型的效率。该框架在目标光谱与生成设计所产生光谱之间实现了平均光谱均方误差为0.0006和光谱对齐精度为0.958，展示了高条件精度。此外，该模型为相同条件生成多个几何形状，从而为工程师提供多样化的设计替代方案。所提模型在大约30秒内生成合适的设计，而传统方法在可比计算资源下可能需要几个月的时间。模型的效率也通过实验测量得到了验证。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2605.19613

White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation

先白平衡，后调整：通过视觉-语言评估实现跨相机色彩恒常性

Li, Shuwei, Tan, Lei, Tan, Robby T.

Abstract

Color constancy aims to keep object colors consistent under varying illumination. Cross-camera generalization in color constancy remains challenging because learning-based models often overfit to the color response characteristics of the training camera, resulting in degraded performance on images captured by other cameras. We propose VLM-CC, a feedback-guided framework that formulates color constancy as an iterative refinement process. Instead of directly estimating the illuminant from raw input, VLM-CC performs iterative correction driven by vision-language model (VLM)-based evaluation. At each iteration, the image is white-balanced using the current estimate and converted to pseudo-sRGB. A lightweight LoRA-tuned VLM then assesses the corrected image, identifying the dominant residual color cast and providing qualitative feedback. This feedback is mapped to a residual illumination direction (red, green, or blue) and used to update the illuminant estimate until convergence. Our key idea is to reframe color constancy as an iterative perceptual feedback problem, leveraging VLM evaluation instead of direct RGB regression. By replacing direct RGB estimation with VLM-guided perceptual feedback, VLM-CC achieves state-of-the-art robustness in cross-camera color constancy across multiple datasets. Code will be available at https://github.com/NothingIknow/VLM-CC.

Chinese Translation

色彩恒常性旨在保持物体颜色在不同照明条件下的一致性。跨相机的色彩恒常性泛化仍然具有挑战性，因为基于学习的模型往往会过拟合训练相机的颜色响应特征，从而导致在其他相机拍摄的图像上的性能下降。我们提出了VLM-CC，这是一种反馈引导的框架，将色彩恒常性公式化为一个迭代精炼过程。VLM-CC并不是直接从原始输入中估计光源，而是通过基于视觉-语言模型（VLM）的评估进行驱动的迭代修正。在每次迭代中，使用当前估计对图像进行白平衡，并转换为伪sRGB。然后，一个轻量级的经过LoRA调优的VLM评估修正后的图像，识别出主要的残余色偏并提供定性反馈。这一反馈被映射到一个残余照明方向（红色、绿色或蓝色），并用于更新光源估计，直到收敛。我们的关键思想是将色彩恒常性重新构建为一个迭代的感知反馈问题，利用VLM评估而不是直接的RGB回归。通过用VLM引导的感知反馈替代直接的RGB估计，VLM-CC在多个数据集上实现了跨相机色彩恒常性的最先进的鲁棒性。代码将发布在 https://github.com/NothingIknow/VLM-CC。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2605.19620

B\'ezier Degradation Modeling for LiDAR-based Human Motion Capture

基于LiDAR的人体运动捕捉的Bézier降解建模

An, Xiaoqi, Zhao, Lin, Li, Jun, Gong, Chen, Yang, Jian

Abstract

LiDAR-based 3D human motion capture has broad applications in fields such as autonomous driving and robotics, where accurate motion reconstruction is crucial. However, existing methods often struggle with unstable inputs and severe occlusions, leading to jittery or even failed pose predictions. To address these challenges, we propose BMLiCap, a coarse-to-fine framework that models motion using temporally compressible B\'ezier curves. By reducing control points through a trajectory-preserving strategy, we obtain a coherent and learning-friendly motion representation. To reconstruct human actions from LiDAR point-cloud cues, we design a progressive motion-reconstruction module. Specifically, a Time-scale Motion Transformer (TMT) is introduced to predict motion curves at multiple temporal scales, and a Multi-level Motion Aggregator (MMA) is utilized to adaptively fuse the multi-scale curves to recover detailed, temporally coherent poses, effectively bridging observation gaps caused by occlusions and noise. Across four mainstream benchmarks LiDARHuman26M, FreeMotion, NoiseMotion, and SLOPER4D, BMLiCap achieves state-of-the-art accuracy and temporal continuity in complex scenes, demonstrating its ability to compensate for severe occlusions and reduce prediction jitter.

Chinese Translation

基于LiDAR的3D人体运动捕捉在自动驾驶和机器人等领域具有广泛的应用，其中准确的运动重建至关重要。然而，现有方法往往在处理不稳定输入和严重遮挡时表现不佳，导致姿态预测抖动甚至失败。为了解决这些挑战，我们提出了BMLiCap，一个粗到细的框架，通过时间可压缩的Bézier曲线建模运动。通过采用保持轨迹的策略减少控制点，我们获得了一种连贯且易于学习的运动表示。为了从LiDAR点云线索中重建人类动作，我们设计了一个渐进式运动重建模块。具体而言，引入了时间尺度运动变换器（Time-scale Motion Transformer, TMT）来预测多个时间尺度的运动曲线，并利用多层运动聚合器（Multi-level Motion Aggregator, MMA）自适应地融合多尺度曲线，以恢复详细且时间连贯的姿态，有效弥补因遮挡和噪声造成的观察间隙。在四个主流基准测试LiDARHuman26M、FreeMotion、NoiseMotion和SLOPER4D上，BMLiCap在复杂场景中实现了最先进的准确性和时间连续性，展示了其补偿严重遮挡和减少预测抖动的能力。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2605.19622

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

UniRefiner：通过对比注册教导预训练的视觉变换器自我处理杂质

Qiu, Congpei, Hu, Zhaoyu, Ke, Wei, Tian, Zhuotao, Wu, Yanhao, Zhang, Tong

Abstract

Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9\% mIoU on ADE20K (+9.4\%), surpassing specialized vision models like DINOv2 (49.1\%), while zero-shot segmentation accuracy improves by up to 22\%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.

Chinese Translation

视觉变换器（ViTs）的表征学习迅速发展，但在空间敏感任务中，大规模模型的效用受到虚假标记的阻碍。以往的努力在减轻这一问题上有限，通常将这些伪影狭义地定义为简单的高范数异常值。我们认为这种范围是不够的。对于密集预测任务，我们认为任何未能编码位置对齐语义的标记都应视为伪影。这一更广泛的定义揭示了一个更复杂的问题，促使我们系统地分类和描述三种基本类型的伪标记，这些标记会破坏空间表征。基于这一全面的诊断，我们提出了UniRefiner，一个通用的精炼框架，旨在教导预训练的ViTs自我处理这些伪影。UniRefiner使用对比注册明确隔离和重新分配伪标记，通过双重目标实现： (i) 将图像标记与过滤后的常规标记对齐以保留语义，(ii) 将注册标记与检测到的伪标记对齐以捕捉伪信号。我们的方法仅需在约5000张图像上进行少量的微调，即可精炼多样的ViTs，包括像EVA-CLIP-8B和InternViT-6B这样的大型模型。实验表明了一致且显著的改进：值得注意的是，精炼后的EVA-CLIP-8B在ADE20K上达到了51.9%的mIoU（+9.4%），超过了像DINOv2（49.1%）这样的专业视觉模型，同时零-shot分割准确率提高了多达22%。UniRefiner释放了现有大规模基础模型的潜在空间能力，为其更广泛的应用铺平了道路。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2605.19623

PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation

PrAda：用于文本提示分割的少样本视觉适应

Rosi, Gabriele, Cermelli, Fabio, Masone, Carlo, Caputo, Barbara

Abstract

Segmenting images is critical for visual understanding but demands extensive pixel-level annotations. Foundational models have enabled new paradigms for predicting new classes guided by textual prompts, without annotations from the target domain. Yet, on specialized target domains, far from the original pre-training, their performance degrades. We study the errors of existing methods under such domain-shift, finding that misclassification rather than mask generation is the main culprit. To address this, we introduce the novel problem of Few-Shot Visual Adaptation for text-prompted Segmentation. This kind of adaptation has been largely studied for image classification, but it remains unexplored for segmentation. We tackle this task with Prototype Adaptation (PrAda), a novel, parameter-efficient method that adapts a frozen text-prompted segmentation model. Our approach learns class-specific prototypes by combining fine-grained pixel features and high-level transformer representations, which are then fused with the original text-based predictions through a learned importance factor. This preserves the model's zero-shot potential while enabling strong adaptation to new domains. Experiments across semantic, instance, and panoptic segmentation on five benchmarks demonstrate that PrAda yields significant improvements over state-of-the-art and proposed baselines.

Chinese Translation

图像分割对于视觉理解至关重要，但需要大量的像素级注释。基础模型使得通过文本提示预测新类别的新范式成为可能，而无需来自目标领域的注释。然而，在与原始预训练相距较远的专业目标领域，这些模型的性能会下降。我们研究了现有方法在这种领域转移下的错误，发现错误分类而非掩膜生成是主要原因。为了解决这个问题，我们提出了用于文本提示分割的少样本视觉适应这一新问题。这种适应在图像分类中得到了广泛研究，但在分割领域仍未被探索。我们通过原型适应（Prototype Adaptation, PrAda）这一新颖且参数高效的方法来解决这一任务，该方法适应一个冻结的文本提示分割模型。我们的方法通过结合细粒度的像素特征和高层次的变换器表示来学习类别特定的原型，然后通过学习到的重要性因子将其与原始基于文本的预测融合。这在保持模型零样本潜力的同时，能够强有力地适应新领域。在五个基准上的语义、实例和全景分割实验表明，PrAda在性能上显著优于最先进的技术和提出的基线。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2605.19624

Component-Aware Structure-Preserving Style Transfer for Satellite Sim2Real 6D Pose Estimation

面向组件的结构保持样式迁移用于卫星Sim2Real 6D姿态估计

Zhang, Yonglong

Abstract

Monocular 6D pose estimation for non-cooperative satellites depends heavily on annotated training data, yet real satellite images with reliable pose labels and component-level masks are difficult to acquire at scale. Synthetic rendering can provide exact geometric annotations, but the appearance gap between rendered and real observations limits direct transfer to the real domain. This paper presents a component-aware structure-preserving style transfer framework for satellite synthetic-to-real data construction. The method builds weakly paired real--synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It then extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. To keep the generated images usable for downstream supervision, adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints. Experiments are conducted on 5,000 rendered satellite images and 100 real images captured in a calibrated setup. The real images provide target-domain appearance references and final evaluation images, while the downstream GDRNet pose estimator is trained only on synthetic or translated synthetic images. Compared with representative image-translation baselines, the proposed method achieves the lowest image distribution discrepancy, with an FID of 54.32 and a KID of 0.048. When the translated data are used to train GDRNet in this target-domain adaptation setting, the ADD pass rate improves to 0.260 and the AUC improves to 0.611. These results indicate that component-level appearance transfer can improve satellite Sim2Real pose estimation in the considered calibrated setup while retaining simulation-derived geometric annotations.

Chinese Translation

单目6D姿态估计对于非合作卫星在很大程度上依赖于带注释的训练数据，但具有可靠姿态标签和组件级掩模的真实卫星图像难以大规模获取。合成渲染可以提供精确的几何注释，但渲染图像与真实观测之间的外观差距限制了其直接迁移到真实领域的能力。本文提出了一种面向组件的结构保持样式迁移框架，用于卫星合成到真实数据的构建。该方法从经过校准的真实采集、基于ArUco的相机姿态测量、CAD渲染和组件掩模中构建弱配对的真实-合成样本。然后，从未标记的真实图像中提取逐部分的真实领域样式编码，并通过掩模对齐调制将其注入到相应的合成卫星区域。为了保持生成图像在下游监督中的可用性，采用对抗训练结合局部对比一致性、自我正则化和边缘保持约束。在5000张渲染卫星图像和100张在校准设置中捕获的真实图像上进行了实验。真实图像提供了目标领域的外观参考和最终评估图像，而下游的GDRNet姿态估计器仅在合成或翻译的合成图像上进行训练。与代表性的图像翻译基线相比，所提出的方法在图像分布差异上达到了最低，FID为54.32，KID为0.048。当翻译数据用于在此目标领域适应设置中训练GDRNet时，ADD通过率提高至0.260，AUC提高至0.611。这些结果表明，组件级外观迁移可以在考虑的校准设置中改善卫星Sim2Real姿态估计，同时保留源自仿真的几何注释。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2605.19634

P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

P2DNav：用于零-shot视觉与语言导航的全景到俯视推理

Sheng, Kai, Wang, Liuyi, Dai, Haojie, Li, Jinlong, Qin, Yongrui, He, Zongtao, Liu, Chengju, Chen, Qijun

Abstract

Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360{\deg} panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.

Chinese Translation

视觉与语言导航（VLN）要求具身代理将自然语言指令转化为可在未见环境中执行的导航动作。现有的零-shot方法通常依赖于额外的路径点预测模块，这往往将高层次的方向推理与细粒度的局部定位纠缠在一起，导致决策容易出错且不稳定。本文提出了P2DNav，一种用于零-shot视觉与语言导航的分层框架。P2DNav由三个核心组件组成：全景到俯视（P2D）、滑动窗口对话记忆（SDM）和反思性重新定向机制（RRM）。P2D明确将导航决策分解为两个阶段：全景方向选择和俯视局部定位。它首先从360°全景中选择与指令相关的方向，然后在该方向上从俯视RGB观测中预测一个像素级目标点。此外，SDM将导航历史组织为多轮对话上下文，并在滑动窗口内维护最近的视觉观测，以支持长时间的导航。RRM进一步通过评估基于俯视观测的局部定位的可靠性来实现反思性重新定向，并在必要时返回全景方向选择。在R2R-CE基准上的实验表明，P2DNav在零-shot方法中表现出色。特别是，与最先进的（SOTA）基于路径点和无路径点的零-shot方法相比，P2DNav分别实现了146.6%和58.9%的成功率提升，证明了P2D、SDM和RRM在零-shot VLN中的有效性。代码将公开发布供公众使用。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2605.19639

Benchmarking and Evolving Reason-Reflect-Rectify for Reflective Visual Generation

基准测试与演化反思-反映-修正框架用于反思性视觉生成

Wang, Junjie, Lou, Xinghua, Li, Jason, Tian, Ye, Chen, Keyu, Li, Yulin, Kang, Bin, Mai, Jacky, Li, Yanwei, Tian, Zhuotao, Nie, Liqiang

Abstract

Text-to-Image (T2I) models and Unified Multimodal Models (UMMs) have achieved remarkable progress in visual generation. However, their reliance on a single-pass generation paradigm limits their ability to handle complex prompts requiring iterative refinement. To enable multi-round Reflective Visual Generation (RVG), we formalize the Reason-Reflect-Rectify (R^3) loop as a core framework and introduce R^3-Bench, a benchmark of over 600 expert-annotated instances that quantifies iterative reasoning and rectification capabilities. Evaluation on R^3-Bench reveals a critical gap: while state-of-the-art models can identify generation errors, they fail to generate actionable rectification instructions. To bridge this gap, we propose R^3-Refiner, a dual-stage framework leveraging Group Relative Policy Optimization (GRPO) and a Hierarchical Reward Mechanism (HRM) to better align rectification with reflective reasoning. Experiments show that R^3-Refiner achieves significant improvements on R^3-Bench (+12.0% in Reflective Verdict Score, +9.0% in Rectification Score), and can be seamlessly integrated with various MLLMs to enhance the generation quality of different T2I models on GenEval++ and T2I-CompBench. Code is available at https://github.com/xiaomoguhz/R3-Bench.

Chinese Translation

文本到图像（T2I）模型和统一多模态模型（UMMs）在视觉生成方面取得了显著进展。然而，它们对单次生成范式的依赖限制了它们处理需要迭代优化的复杂提示的能力。为了实现多轮反思性视觉生成（RVG），我们将反思-反映-修正（R^3）循环形式化为核心框架，并引入R^3-Bench，这是一个包含600多个专家标注实例的基准，量化了迭代推理和修正能力。在R^3-Bench上的评估揭示了一个关键差距：尽管最先进的模型能够识别生成错误，但它们无法生成可操作的修正指令。为了解决这一问题，我们提出了R^3-Refiner，这是一个双阶段框架，利用组相对策略优化（GRPO）和层次奖励机制（HRM）更好地将修正与反思性推理对齐。实验表明，R^3-Refiner在R^3-Bench上取得了显著提升（反思性裁决得分提高12.0%，修正得分提高9.0%），并且可以与各种多模态大模型（MLLMs）无缝集成，以提升不同T2I模型在GenEval++和T2I-CompBench上的生成质量。代码可在https://github.com/xiaomoguhz/R3-Bench获取。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2605.19649

CAD-Free Learning of Spacecraft Pose Estimators via NeRF-Based Augmentations

基于NeRF增强的无CAD航天器姿态估计学习

Legrand, Antoine, Detry, Renaud, De Vleeschouwer, Christophe

Abstract

Spacecraft pose estimation networks require tens of thousands of CAD-rendered images to be trained. This reliance on synthetic CAD data (i) limits applicability to targets with reliable geometry prior, excluding uncooperative or poorly documented spacecraft, and (ii) causes poor generalization to real on-orbit conditions due to unrealistic illumination and material appearance. This paper introduces a NeRF-based image augmentation method that enables the learning of spacecraft pose estimators from only a few tens to a few hundreds of images. The method learns a Neural Radiance Field of the target and generates a large, diverse dataset through geometrically-consistent viewpoint and appearance augmentation. This augmented dataset enables the training of accurate target-specific pose estimators without requiring a CAD model or large synthetic datasets. Experiments show that our approach supports the training of accurate pose estimators from only 25 to 400 realistic images, even under severe illumination variations. When applied on large CAD-based synthetic datasets, the NeRF-based augmentation also enhances out-of-domain generalization, yielding improved robustness to real on-orbit conditions.

Chinese Translation

航天器姿态估计网络需要数万个CAD渲染图像进行训练。这种对合成CAD数据的依赖 (i) 限制了其对具有可靠几何先验的目标的适用性，排除了不合作或文档不全的航天器，以及 (ii) 由于不现实的光照和材料外观，导致对真实轨道条件的泛化能力较差。本文提出了一种基于NeRF的图像增强方法，使得航天器姿态估计器的学习仅需几十到几百张图像。该方法学习目标的神经辐射场，并通过几何一致的视角和外观增强生成一个大而多样的数据集。这个增强的数据集使得在不需要CAD模型或大型合成数据集的情况下，能够训练出准确的特定目标姿态估计器。实验表明，我们的方法支持仅使用25到400张真实图像训练出准确的姿态估计器，即使在严重的光照变化下。当应用于大型基于CAD的合成数据集时，基于NeRF的增强还提升了域外泛化能力，提高了对真实轨道条件的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2605.19656

Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

跨视角喷溅：基于地理参考图像的前馈视图合成

Turkulainen, Matias, Krishnan, Akshay, Aleotti, Filippo, Sayed, Mohamed, Garcia-Hernando, Guillermo, Kannala, Juho, Solin, Arno, Brostow, Gabriel, Turmukhambetov, Daniyar

Abstract

We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced datasets and paired satellite-terrain data, mined from open mapping services. We evaluate our method on a new benchmark for novel-view synthesis with georeferenced imagery allowing comparison to prior state-of-the-art methods. Our code and data preparation will be available at https://nianticspatial.github.io/cross-view-splatter/.

Chinese Translation

我们提出了跨视角喷溅（Cross-View Splatter），这是一种前馈方法，能够预测在地面和卫星拍摄的户外场景中像素对齐的高斯喷溅。忠实的重建需要良好的相机覆盖，但地面图像的获取既耗时又难以在大规模户外场景中捕捉。幸运的是，卫星图像可以提供易于通过公共API访问的全球几何先验。跨视角喷溅将正射校正的卫星视图与带有GPS标签的地面照片融合，以在统一的3D坐标框架中预测高斯喷溅。通过对齐地面和鸟瞰特征表示，我们的模型在场景覆盖和新视图合成方面相较于单独使用地面图像有所改善。我们在经过精心策划的地理参考数据集和从开放地图服务中挖掘的配对卫星-地形数据上进行训练。我们在一个新的基准上评估我们的方法，该基准用于基于地理参考图像的新视图合成，允许与先前的最先进方法进行比较。我们的代码和数据准备将发布在 https://nianticspatial.github.io/cross-view-splatter/。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2605.19688

DocQT: Improving Document Forgery Localization Robustness via Diverse JPEG Quantization Tables

DocQT：通过多样化的 JPEG 量化表提高文档伪造定位的鲁棒性

Ronfleux-Corail, Kylian, Bernard, Guillaume, Coustaty, Mickaël, Sidère, Nicolas

Abstract

Document manipulation localization models achieve strong performance on public benchmarks yet fail to generalize to operational document workflows. We identify a critical and overlooked source of this gap: the mismatch between the narrow distribution of JPEG quantization tables used during training -restricted to standard libjpeg quality factors -and the heterogeneous compression profiles encountered in real-world insurance document pipelines. To isolate this factor, we conduct a controlled factorial study comparing two architectures with contrasting levels of quantization table awareness -FFDN [2] and Mesorch [20] -each trained under either standard quality factor augmentation (Standard-QT ) or operationally calibrated quantization tables sampled from DocQT, a quantization-table bank derived from a MAIF operational image corpus (Real-QT ), and evaluated under three recompression conditions. Training under Real-QT yields substantial localization gains on DocTamper [15] and significantly reduces the pixel-level false positive rate on authentic operational documents, but only for architectures that explicitly ingest the quantization table as input. The released DocQT quantization-table dataset and compression-reproduction material are directly available at https://github.com/Kyliroco/Improving-Document-Forgery-Localization-Robustness-via-Diverse-JPEG-Quantization-Tables. These results demonstrate that standard quality factor augmentation does not adequately proxy operational compression diversity, and that architectural choices explicitly conditioning on the quantization table provide a meaningful robustness advantage for real-world deployment.

Chinese Translation

文档操作定位模型在公共基准测试中表现出色，但在实际文档工作流程中却难以推广。我们识别出这一差距的一个关键且被忽视的来源：训练过程中使用的 JPEG 量化表的狭窄分布——仅限于标准 libjpeg 质量因子——与在现实世界保险文档流程中遇到的异质压缩特征之间的不匹配。为了隔离这一因素，我们进行了一项受控的因子研究，比较了两种具有不同量化表意识水平的架构——FFDN [2] 和 Mesorch [20]——每种架构在标准质量因子增强（Standard-QT）或从 DocQT 中采样的操作校准量化表（Real-QT）下进行训练，并在三种重新压缩条件下进行评估。在 Real-QT 下训练在 DocTamper [15] 上获得了显著的定位提升，并显著降低了真实操作文档的像素级假阳性率，但仅对那些明确将量化表作为输入的架构有效。发布的 DocQT 量化表数据集和压缩重现材料可以直接在 https://github.com/Kyliroco/Improving-Document-Forgery-Localization-Robustness-via-Diverse-JPEG-Quantization-Tables 获取。这些结果表明，标准质量因子增强并不能充分代表操作压缩的多样性，而明确依赖量化表的架构选择为实际部署提供了显著的鲁棒性优势。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2605.19692

WBCAtt+: Fine-Grained Pixel-Level Morphological Annotations for White Blood Cell Images

WBCAtt+: 白血球图像的细粒度像素级形态注释

Tsutsui, Satoshi, Pang, Winnie, He, Shuting, Wen, Bihan

Abstract

The microscopic examination of white blood cells (WBCs) plays a fundamental role in pathology and is essential for diagnosing blood disorders such as leukemia and anemia. To support further research on WBC images, multiple datasets have been proposed. However, they mainly annotate cell categories, and lack detailed morphological characteristics that pathologists use to explain their interpretations of cells. To address this gap, we introduce WBCAtt+, a novel dataset of WBC images densely annotated with 11 morphological attributes and five pixel-level cell components. With 113k image-level labels and 10k segmentation maps, WBCAtt+ is the first to provide comprehensive annotations for WBC images. Leveraging this dataset, we provide baseline models for attribute recognition and semantic segmentation. We also design an attribute recognition model to incorporate compositional structure of cells, further improving the recognition performance. Lastly, we showcase various applications enabled by our dataset, such as explainable AI models, including counterfactual example generation. \revision{The dataset and code are publicly available\footnote{https://doi.org/10.57967/hf/8143}}.

Chinese Translation

白血球（WBC）的显微镜检查在病理学中发挥着基础性作用，对于诊断白血病和贫血等血液疾病至关重要。为了支持对白血球图像的进一步研究，已经提出了多个数据集。然而，这些数据集主要注释细胞类别，缺乏病理学家用于解释细胞的详细形态特征。为了解决这一问题，我们引入了WBCAtt+，这是一个新颖的数据集，包含密集注释的白血球图像，涵盖11种形态属性和五种像素级细胞成分。WBCAtt+提供了113k个图像级标签和10k个分割图，是首个为白血球图像提供全面注释的数据集。利用该数据集，我们提供了属性识别和语义分割的基线模型。我们还设计了一种属性识别模型，以结合细胞的组合结构，进一步提高识别性能。最后，我们展示了我们的数据集所支持的各种应用，例如可解释的人工智能模型，包括反事实示例生成。数据集和代码已公开可用。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2605.19712

Physics-informed simulation framework for realistic sonar image generation and statistical validation

基于物理的仿真框架用于真实声纳图像生成和统计验证

S, Kamal Basha, Nambiar, Athira

Abstract

Synthetic sonar datasets offer a scalable alternative to costly real-world acquisition, yet their utility remains limited by the absence of rigorous quantitative validation. We present ACOUSIM (ACOustic SIMulation and Validation Platform), a physics-informed framework that evaluates the statistical alignment between synthetic and real sonar imagery without relying on generative models. A Gazebo-based environment generates sonar-like images by explicitly controlling seabed texture, illumination-driven shadowing, platform altitude, and noise. Realism is quantified against two public sonar datasets, SeabedObjects-KLSG-II and Sonar Common Target Detection (SCTD), using global intensity and local texture (LBP) distributions assessed via Kullback-Leibler divergence, Jensen-Shannon divergence, and Earth Mover's Distance. Results show strong texture alignment (KL < 0.07) across all classes, with plane-class intensity alignment outperforming ship-class due to shadow geometry complexity. ACOUSIM establishes a reproducible, distribution-level baseline for sim-to-real sonar evaluation and directly supports reliable dataset validation for underwater image analysis.

Chinese Translation

合成声纳数据集为昂贵的现实世界采集提供了一种可扩展的替代方案，但其效用仍然受到缺乏严格定量验证的限制。我们提出了ACOUSIM（声学仿真与验证平台），这是一个基于物理的框架，能够在不依赖生成模型的情况下评估合成声纳图像与真实声纳图像之间的统计一致性。基于Gazebo的环境通过明确控制海底纹理、光照驱动的阴影、平台高度和噪声来生成类似声纳的图像。通过使用Kullback-Leibler散度、Jensen-Shannon散度和地球搬运者距离，对比两个公共声纳数据集SeabedObjects-KLSG-II和Sonar Common Target Detection (SCTD)，量化现实感，评估全局强度和局部纹理（LBP）分布。结果显示，在所有类别中，纹理一致性较强（KL < 0.07），平面类别的强度一致性优于船舶类别，原因在于阴影几何的复杂性。ACOUSIM建立了一个可重复的、分布级别的基准，用于仿真到真实声纳的评估，并直接支持水下图像分析的可靠数据集验证。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2605.19717

Physics-in-the-Loop: A Hybrid Agentic Architecture for Validated CAD Engineering Design

物理环路：一种用于验证计算机辅助设计工程设计的混合智能架构

Berger, Elias, Usama, Muhammad, Mehlstäubl, Jan, Saske, Bernhard, Paetzold-Byhain, Kristin

Abstract

Large Language Models (LLMs) can generate Computer-Aided Design (CAD), yet lack physical comprehension required for reliable engineering design. Instead of attempting to implicitly learn physical laws from data, we propose a Hybrid Agentic-Physical Architecture that embeds validated knowledge-based engineering tools directly into the decision making loop of autonomous AI agents. In this framework, engineering design is formulated as a closed-loop, sequential decision making process guided by explicit physical verification. Based on a load case, dedicated agents iteratively plan, generate, evaluate, and revise engineering designs using knowledge-based tools as a feedback signal. We introduce a benchmark dataset and metrics for assessing functional validity in generative CAD. Our system generates more complex and physically verified designs, with a 4.2 increase in structural complexity and improving compile rate by 3.5% compared to similar agentic methods. The codebase, prompts and dataset will be made publicly available to support reproducibility and future research.

Chinese Translation

大型语言模型（LLMs）能够生成计算机辅助设计（CAD），但缺乏可靠工程设计所需的物理理解。我们提出了一种混合智能-物理架构，直接将经过验证的知识基础工程工具嵌入自主人工智能代理的决策循环中，而不是试图从数据中隐式学习物理法则。在该框架中，工程设计被表述为一个由明确的物理验证指导的闭环序列决策过程。基于负载案例，专用代理迭代地规划、生成、评估和修订工程设计，使用知识基础工具作为反馈信号。我们引入了一个基准数据集和评估生成CAD功能有效性的指标。我们的系统生成了更复杂且经过物理验证的设计，与类似的智能方法相比，结构复杂性提高了4.2，并且编译率提高了3.5%。代码库、提示和数据集将公开发布，以支持可重复性和未来研究。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2605.19726

Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention

通过块近似稀疏注意力实现扩散语言模型中的高效长上下文建模

Zhang, Wenhu, Wu, Yiming, Wang, Huanyu, Liu, Yaoyang, Dou, Huanzhang, Yang, Senqiao, Wu, Sitong, Zhao, Hanbin, Jia, Jiaya

Abstract

Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks by fixed sampling patterns over the high-resolution attention space, such as tail regions or anti-diagonal stripes. Such prior-driven sampling can miss salient tokens and introduce instability under distribution shifts. In this paper, we propose the Block Approximate Sparse Attention framework (BA-Att) with block-wise pre-downsampled operation, which identifies informative regions within a compact downsampled space, avoiding reliance on brittle positional priors. To analyze its theoretical behavior, we define an oracle post-downsample attention map and formalize the approximation error between pre- and post-downsample schemes. Based on this insight, we introduce a lightweight norm-sorting module and a covariance-compensated correction that approximates full covariance using diagonal QK variances, reducing computational complexity. Extensive experiments show that our operator achieves up to 6.95x acceleration over FlashAttention in attention computation, and maintains near full-attention performance at 50% sparsity across language models, multimodal language models, and video generation models, demonstrating strong efficiency and generalization.

Chinese Translation

扩散语言模型（DLMs）能够实现全球一致的、双向的和可控的文本生成，相较于传统的自回归大型语言模型（LLMs）具有优势，但在扩展到超长序列时仍然成本高昂。许多现有的块稀疏注意力方法通过在高分辨率注意力空间中固定采样模式（如尾部区域或反对角条带）选择块。这种基于先验的采样可能会遗漏显著的标记，并在分布变化下引入不稳定性。本文提出了块近似稀疏注意力框架（BA-Att），该框架采用块级预下采样操作，能够在紧凑的下采样空间内识别信息丰富的区域，避免依赖脆弱的位置信息先验。为了分析其理论行为，我们定义了一个神谕后下采样注意力图，并形式化了预下采样和后下采样方案之间的近似误差。基于这一见解，我们引入了一个轻量级的范数排序模块和一个协方差补偿校正，通过使用对角QK方差来近似全协方差，从而降低计算复杂性。大量实验表明，我们的操作在注意力计算中实现了高达6.95倍的加速，并在50%稀疏性下保持了接近全注意力的性能，适用于语言模型、多模态语言模型和视频生成模型，展示了强大的效率和泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2605.19727

Tango3D: Towards Alignment for Global and Local 2D-3D Correspondence

Tango3D：朝着全球与局部2D-3D对应的对齐

He, Zebin, Yang, Mingxin, Yang, Shuhui, Sun, Hanxiao, Han, Xintong, Guo, Chunchao, Luo, Wenhan

Abstract

Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show our model successfully achieves object-level pixel-to-point alignment while maintaining competitive global retrieval, a joint capability not offered by existing 3D foundation models. By establishing a fine-grained alignment feature space, Tango3D injects rich semantics into purely geometric 3D tokens, paving the way for a wide range of dense 3D downstream tasks.

Chinese Translation

现有的3D基础模型通常将点云对齐到冻结的视觉-语言空间，如CLIP，通过将3D形状压缩为全局向量实现强大的跨模态检索。然而，这种仅基于全局的对齐无法建立细粒度的像素到点的对应关系。为了解决这个问题，我们提出了Tango3D，这是一种统一了密集对应和全局检索的基础模型。我们使用几何感知的2D视觉主干和预训练的3D变分自编码器（VAE）将图像编码为2D补丁，将点云编码为3D令牌。这些被映射到一个共享的单一空间，以实现局部像素到点的对齐和全局语义对齐。为了稳定密集和全局目标的联合学习，我们引入了一种三阶段的渐进训练策略。实验表明，我们的模型成功实现了对象级别的像素到点对齐，同时保持了具有竞争力的全局检索，这是现有3D基础模型所不具备的联合能力。通过建立细粒度的对齐特征空间，Tango3D将丰富的语义注入到纯几何的3D令牌中，为广泛的密集3D下游任务铺平了道路。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2605.19728

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

Aero-World：基于惯性控制的动作条件空中视频生成

Radi, Abdul Mohaimen Al, Li, Kunyang, Shang, Yuzhang, Shah, Mubarak, Tian, Yu

Abstract

Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can support scalable training and evaluation of aerial agents by providing a controllable proxy for real-world or expensive simulation data. To address this problem, we propose \textbf{Aero-World}, a method for converting a pretrained image-to-video diffusion model into a controllable aerial video generator. Aero-World injects sequences of translational acceleration and angular velocity into a pretrained latent diffusion transformer through an action-token stream. A frozen latent-space Physics Probe, trained independently on real video--IMU pairs, provides differentiable inertial-consistency supervision during LoRA finetuning while avoiding computationally expensive video decoding. We further propose \textbf{AeroBench}, a benchmark for evaluating whether generated drone videos adhere to low-level action signals. AeroBench uses Action Alignment Score (AAS) to measure agreement with commanded inertial actions and Physical Consistency Rate (PCR) to measure temporal motion stability. On AeroBench, Aero-World improves mean AAS from 57.7 to 63.6 over action-only finetuning and gives a stronger quality-control trade-off than AirScape, with lower FVD (596.5 vs. 1058.6), higher SSIM (0.595 vs. 0.505), and higher Flow-IMU correlation (0.44 vs. 0.20). These results suggest that frozen Physics Probe supervision is a practical mechanism for adapting pretrained video generators toward more action-aligned aerial motion.

Chinese Translation

基础视频模型产生了视觉上令人印象深刻的结果，但它们在具身人工智能中的应用仍然有限，因为它们主要是在自然语言而非低级控制信号上进行训练。这一限制在空中飞行中尤为明显，因为运动发生在不受约束的6自由度空间中，微小的自我运动误差可能导致大的轨迹漂移。生成遵循细粒度惯性动作的空中视频可以通过提供一个可控的代理，支持空中代理的可扩展训练和评估，从而替代现实世界或昂贵的仿真数据。为了解决这个问题，我们提出了 extbf{Aero-World}，一种将预训练的图像到视频扩散模型转换为可控空中视频生成器的方法。Aero-World通过动作令牌流将平移加速度和角速度序列注入到预训练的潜在扩散变换器中。一个独立于真实视频-IMU对训练的冻结潜在空间物理探针，在LoRA微调期间提供可微分的惯性一致性监督，同时避免了计算上昂贵的视频解码。我们进一步提出了 extbf{AeroBench}，一个评估生成的无人机视频是否遵循低级动作信号的基准。AeroBench使用动作对齐分数（Action Alignment Score, AAS）来测量与指令惯性动作的一致性，并使用物理一致性率（Physical Consistency Rate, PCR）来测量时间运动稳定性。在AeroBench上，Aero-World将平均AAS从57.7提高到63.6，相较于仅基于动作的微调，提供了比AirScape更强的质量控制权衡，具有更低的FVD（596.5对1058.6），更高的SSIM（0.595对0.505），以及更高的流动-IMU相关性（0.44对0.20）。这些结果表明，冻结的物理探针监督是一种将预训练视频生成器适应于更具动作对齐的空中运动的实用机制。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2605.19729

LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

LIFT和PLACE：一种简单、稳定且有效的轻量级扩散模型知识蒸馏框架

Han, Hyunsoo, Yeo, Sangyeop, Yoo, Jaejun

Abstract

We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process - stemming from its substantially larger capacity - poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTtingbased distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a "coarse" alignment and a "fine" refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, PLACE extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance. Our experiments show that LIFT and PLACE is effective across diffusion spaces (image/latent), backbones (U-Net/DiT), tasks (unconditional/conditional), datasets, and even extends to flow-based models such as MMDiT (SD3). Furthermore, under extreme compression with a 1.3M-parameter student (only 1.6% of the teacher), conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50-200+, but our method remains stably convergent and achieves an FID of 15.73.

Chinese Translation

我们证明，在扩散模型的知识蒸馏中，教师网络复杂的去噪过程——源于其显著更大的容量——对学生模型忠实模仿构成了重大挑战。为了解决这个问题，我们提出了一种粗到细的蒸馏框架，包含基于线性拟合的蒸馏（LIFT）和分段局部自适应系数估计（PLACE）。首先，LIFT将目标分解为“粗略”对齐和“精细”优化。学生模型首先在粗略对齐上进行训练，然后再进行困难的精细优化。其次，PLACE扩展了LIFT，通过将输出划分为基于误差的组来解决空间上非均匀的误差，提供局部自适应指导。我们的实验表明，LIFT和PLACE在扩散空间（图像/潜在）、骨干网络（U-Net/DiT）、任务（无条件/条件）、数据集等方面都有效，甚至扩展到基于流的模型，如MMDiT（SD3）。此外，在极端压缩下，使用1.3M参数的学生模型（仅为教师的1.6%），传统的知识蒸馏无法提供足够的指导以实现稳定训练，FID分数往往降至50-200以上，但我们的方法仍然保持稳定收敛，并实现了15.73的FID。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2605.19734

GeoMamba: A Geometry-driven MambaVision Framework and Dataset for Fine-grained Optical-SAR Object Retrieval

GeoMamba：一个基于几何驱动的 MambaVision 框架及其用于细粒度光学-合成孔径雷达对象检索的数据集

Fang, Tiantong, Wang, Xiuwei, Xiao, Jing, Zhou, Wujie, Liao, Liang, Wang, Mi

Abstract

Multi-source remote sensing enables complementary observation of ground objects, while cross-modal fine-grained object retrieval remains challenging, especially under unaligned optical and SAR conditions. Unlike conventional retrieval settings that rely on paired or spatially aligned samples, practical optical-SAR retrieval is affected by substantial modality discrepancy, speckle noise, and structural inconsistency, which limit robust cross-modal representation learning. To address this problem, we propose GeoMamba, a geometry-driven framework tailored for optical-SAR fine-grained retrieval. Specifically, GeoMamba introduces a Geometric Feature Injection (GFI) module that enhances cross-modal feature interaction and incorporates structural priors, thereby improving the robustness of SAR representations and promoting geometry-consistent feature learning. In addition, a Geometric Consistency Constraint (GCC) module, together with a Deep Supervision (DS) strategy, imposes hierarchical geometric constraints using classical operators, which helps preserve informative object structures during representation learning. We further construct a new dataset, FGOS-as, containing 11 aerospace and maritime categories for evaluating unaligned cross-modal fine-grained object retrieval in realistic remote sensing scenarios. Extensive experiments on FGOS-as demonstrate that GeoMamba outperforms existing methods, achieving 63.3% mAP and 77.0% Rank-1 accuracy in all-to-all retrieval setting.

Chinese Translation

多源遥感能够对地面物体进行互补观测，但跨模态细粒度对象检索仍然面临挑战，尤其是在光学与合成孔径雷达（SAR）条件不对齐的情况下。与依赖配对或空间对齐样本的传统检索设置不同，实际的光学-SAR 检索受到显著的模态差异、散斑噪声和结构不一致的影响，这限制了稳健的跨模态表示学习。为了解决这个问题，我们提出了 GeoMamba，一个专为光学-SAR 细粒度检索量身定制的几何驱动框架。具体而言，GeoMamba 引入了几何特征注入（Geometric Feature Injection, GFI）模块，该模块增强了跨模态特征交互并结合了结构先验，从而提高了 SAR 表示的稳健性并促进了几何一致的特征学习。此外，几何一致性约束（Geometric Consistency Constraint, GCC）模块结合深度监督（Deep Supervision, DS）策略，利用经典算子施加分层几何约束，帮助在表示学习过程中保留信息丰富的对象结构。我们进一步构建了一个新的数据集 FGOS-as，包含 11 个航空航天和海洋类别，用于评估现实遥感场景中未对齐的跨模态细粒度对象检索。在 FGOS-as 上进行的广泛实验表明，GeoMamba 超越了现有方法，在全对全检索设置中达到了 63.3% 的平均精度（mAP）和 77.0% 的 Rank-1 准确率。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2605.19739

FlowErase-RL: Rethinking Concept Erasure as Reward Optimization in Flow Matching Models

FlowErase-RL：将概念消除重新思考为流匹配模型中的奖励优化

Sun, Yi, Zhang, Zhiqi, Zhong, Xinhao, Zhou, Yimin, Sun, Shuoyang, Chen, Bin, Xia, Shu-Tao, Xu, Ke

Abstract

Recent advances in flow matching models have significantly improved text-to-image generation quality, but also introduce growing safety risks due to the generation of harmful or undesirable content. Existing concept erasure methods are either inference-time interventions with limited effectiveness or rely on supervised fine-tuning (SFT), which requires precisely aligned data and struggles with scalability and multi-concept settings. In this paper, we propose \emph{FlowErase-RL}, the first GRPO-based framework for concept erasure in flow matching models. We reformulate concept erasure as a reward optimization problem and introduce a \textbf{dynamic dual-path reward mechanism} that jointly optimizes (i) a Concept Erasure (CE) reward to suppress target concepts and (ii) a Non-target Space (NS) reward to preserve generative fidelity. The two reward paths are adaptively balanced during training via a performance-driven switching strategy, enabling stable optimization without explicit supervision. Extensive experiments on nudity, object, and artistic style erasure demonstrate that our method achieves state-of-the-art erasure performance while maintaining strong image quality and semantic alignment. Moreover, it exhibits robust resistance to adversarial attacks and scales effectively to multi-concept scenarios. Our results establish a new paradigm for safe and controllable generation in flow matching models.

Chinese Translation

近年来，流匹配模型的进展显著提高了文本到图像生成的质量，但也因生成有害或不良内容而引入了日益增长的安全风险。现有的概念消除方法要么是推理时的干预，效果有限，要么依赖于监督微调（SFT），这需要精确对齐的数据，并且在可扩展性和多概念设置上存在困难。在本文中，我们提出了 extit{FlowErase-RL}，这是第一个基于GRPO的流匹配模型中的概念消除框架。我们将概念消除重新表述为一个奖励优化问题，并引入了一种 extbf{动态双路径奖励机制}，该机制共同优化（i）一个概念消除（CE）奖励以抑制目标概念，以及（ii）一个非目标空间（NS）奖励以保持生成的保真度。在训练过程中，这两条奖励路径通过基于性能的切换策略自适应平衡，从而实现稳定的优化而无需显式监督。针对裸体、物体和艺术风格消除的广泛实验表明，我们的方法在保持强图像质量和语义对齐的同时，实现了最先进的消除性能。此外，它对对抗攻击表现出强大的抵抗力，并能有效扩展到多概念场景。我们的结果为流匹配模型中的安全和可控生成建立了一个新的范式。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2605.19744

Real-World On-Vehicle Evaluation of Embedding-Based Anomaly Detection

基于嵌入的异常检测的真实世界车辆评估

Schotschneider, Albert, Bogdoll, Daniel, Pavlitska, Svetlana, Abouelazm, Ahmed, Zoellner, Johann Marius

Abstract

Detecting anomalies in traffic scenes is crucial for ensuring safety in autonomous driving, yet collecting representative anomalous data remains challenging. Existing anomaly detection methods are highly specialized and rely on normality as defined by the abstract semantic Cityscapes classes, making it difficult to adapt to diverse real-world scenarios. We propose an adaptable real-time anomaly detection method that leverages foundation models in the form of pretrained vision transformer embeddings to detect deviations via nearest-neighbor similarity in the latent semantic feature space. Based on patch-wise processing, the algorithm produces dense anomaly masks, allowing for the localization of detected anomalies. The method robustly models normality through a single reference image. This formulation avoids explicit supervision and dataset-specific training, making it suitable for real-world deployment. We evaluate the method on standard benchmarks and on an automated vehicle in real-world scenarios. Despite its simplicity, the method achieves good performance on the Road Anomaly benchmark and demonstrates consistent qualitative behavior in practice, successfully highlighting semantically unusual objects in diverse scenes. These results suggest that simple, reference-based methods can provide useful anomaly signals under realistic operating conditions.

Chinese Translation

在交通场景中检测异常对于确保自动驾驶的安全至关重要，但收集具有代表性的异常数据仍然具有挑战性。现有的异常检测方法高度专业化，并依赖于抽象语义Cityscapes类别定义的正常性，这使得其难以适应多样化的现实场景。我们提出了一种可适应的实时异常检测方法，该方法利用预训练视觉变换器嵌入的基础模型，通过在潜在语义特征空间中进行最近邻相似性检测偏差。基于块处理，该算法生成密集的异常掩码，从而实现对检测到的异常的定位。该方法通过单一参考图像稳健地建模正常性。这种构造避免了显式监督和数据集特定的训练，使其适合于现实世界的部署。我们在标准基准测试和真实场景中的自动化车辆上评估了该方法。尽管其简单性，该方法在道路异常基准测试中表现良好，并在实践中展示了一致的定性行为，成功突出显示了多样场景中的语义异常对象。这些结果表明，简单的基于参考的方法在现实操作条件下可以提供有用的异常信号。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2605.19750

CPC-VAR:Continual Personalized and Compositional Generation in Visual Autoregressive Models

CPC-VAR：视觉自回归模型中的持续个性化与组合生成

Li, Junhao, Zhong, Xinhao, sun, Yi, Qiao, Yuxia, Chen, Bin, Xia, Shu-Tao, Wang, Yaowei

Abstract

Visual autoregressive (VAR) models have recently emerged as an efficient paradigm for text-to-image generation. Despite their strong generative capability, existing VAR-based personalization methods remain limited to static settings, failing to accommodate evolving user demands. In particular, sequential concept learning leads to severe catastrophic forgetting, while multi-concept synthesis often suffers from feature entanglement and attribute inconsistency. In this work, we present the first systematic study of continual personalized generation in VAR models. We identify two key challenges: (i) preserving previously learned concepts during sequential customization, and (ii) composing multiple personalized concepts in a controllable manner. To address these issues, we propose a unified framework with two core components. For continual single-concept learning, we introduce Gradient-based Concept Neuron Selection (GCNS), which identifies concept-relevant neurons and constrains only conflicting parameters across tasks, effectively mitigating forgetting without additional model expansion. For multi-concept synthesis, we propose a context-aware composition strategy that performs multi-branch feature modeling and localized cross-attention fusion guided by spatial conditions, enabling precise and disentangled concept composition. Extensive experiments demonstrate that our method significantly improves performance in long-sequence continual personalization while achieving superior results in multi-concept image synthesis compared to existing baselines. These findings highlight the potential of VAR models for scalable and controllable personalized generation.

Chinese Translation

视觉自回归（VAR）模型最近作为一种高效的文本到图像生成范式而出现。尽管其强大的生成能力，现有的基于VAR的个性化方法仍然局限于静态设置，无法满足不断变化的用户需求。特别是，顺序概念学习导致严重的灾难性遗忘，而多概念合成常常遭遇特征纠缠和属性不一致的问题。在本研究中，我们首次系统性地研究了VAR模型中的持续个性化生成。我们识别出两个关键挑战：（i）在顺序定制过程中保持先前学习的概念，以及（ii）以可控的方式组合多个个性化概念。为了解决这些问题，我们提出了一个统一框架，包含两个核心组件。对于持续的单概念学习，我们引入了基于梯度的概念神经元选择（GCNS），该方法识别与概念相关的神经元，并仅对任务间冲突的参数进行约束，有效减轻遗忘而无需额外的模型扩展。对于多概念合成，我们提出了一种上下文感知的组合策略，该策略通过空间条件引导的多分支特征建模和局部交叉注意力融合，能够实现精确且解耦的概念组合。大量实验表明，我们的方法在长序列持续个性化中显著提高了性能，并在多概念图像合成中相较于现有基线取得了更优的结果。这些发现突显了VAR模型在可扩展和可控个性化生成中的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2605.19776

Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation

偏好顺序、评分锚点：从融合的专家审美真值到自我蒸馏

Zhao, Yuanpei, Lin, Jie, Zhang, Chao, Wang, Yilin, Li, Mao, Li, Chenhui, Hou, Jie, Lv, Tangjie

Abstract

Pairwise preferences and pointwise ratings are the two dominant annotation protocols in image aesthetic assessment (IAA), yet existing benchmarks adopt only one, leaving their complementarity unmeasured under controlled conditions. We introduce PPaint, a matched dual-protocol benchmark in which 15 domain experts, 5 per category, annotate 150 Chinese paintings under both protocols across five aesthetic dimensions, collecting 45,900 pairwise expert judgments through a locally dense preference design alongside the matched ratings. The matched design reveals complementary strengths: preferences yield more consistent ordinal rankings, while ratings anchor the absolute score scale. Fusing both signals via two independent preference-to-score methods yields a fused expert ground truth on which the two constructions converge to nearly identical scores. The same preference-to-score principle extends to label-free VLM training. PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via an Elo reference pool, and trains the same VLM with confidence-weighted ranking optimization to produce a single-pass aesthetic scorer. Trained on a single painting category, the distilled Qwen3-VL-8B improves mean SRCC from 0.504 to 0.709 across all three categories, outperforming all open-source baselines including the dedicated aesthetic model ArtiMuse and matching closed-source Gemini-3.1-Pro within 0.04 SRCC at single-pass inference cost, with cross-domain transfer further validated on APDDv2. We will release the full PPaint dataset and training code.

Chinese Translation

成对偏好和逐点评分是图像审美评估（IAA）中两种主要的注释协议，然而现有基准仅采用其中一种，导致在受控条件下未能测量它们的互补性。我们引入了PPaint，这是一个匹配的双协议基准，其中15位领域专家（每个类别5位）在五个审美维度下对150幅中国画进行注释，通过局部密集的偏好设计收集了45,900个成对专家判断，同时收集匹配的评分。匹配设计揭示了互补的优势：偏好产生了更一致的序数排名，而评分则锚定了绝对分数尺度。通过两种独立的偏好到分数的方法融合这两种信号，得到了一个融合的专家真值，两个构造几乎收敛到相同的分数。相同的偏好到分数原则扩展到无标签的VLM训练。PSDistill通过Elo参考池将VLM的成对判断转换为校准的伪分数，并使用置信加权排名优化训练相同的VLM，以产生单次通过的审美评分器。在单一绘画类别上训练的蒸馏Qwen3-VL-8B在所有三个类别中将平均SRCC从0.504提高到0.709，超越了所有开源基准，包括专用审美模型ArtiMuse，并在单次推理成本下与闭源的Gemini-3.1-Pro相匹配，SRCC差距仅为0.04，跨领域转移在APDDv2上进一步得到验证。我们将发布完整的PPaint数据集和训练代码。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2605.19786

Fast 4D Mesh Generation by Spatio-Temporal Attention Chains

通过时空注意链快速生成4D网格

Samuel, Dvir, Atzmon, Yuval, Chechik, Gal, Kasten, Yoni

Abstract

4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details and thereby improving dynamic mesh geometry and temporal consistency. Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a $13\times$ speedup while producing higher-quality results. Moreover, our approach scales to videos up to $16\times$ longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks: 2D object tracking and 4D tracking. We further show that our framework enables reliable camera estimation, a capability not supported by prior 4D mesh generation methods.

Chinese Translation

4D网格生成最近作为一种强大的范式出现，用于从视频中恢复动态3D结构，但现有方法仍然较慢、计算开销大，并且难以扩展到更长的序列。我们提出了一种无训练的方法，加速4D网格生成，同时提高时间对应质量。我们的关键观察是，时间对应在4D主干网络内部出现的时间远早于其生成的网格在视觉上变得准确。我们利用这一点，提出了一个称为时空注意链（Spatio-Temporal Attention Chain）的通用框架，该框架在空间和时间上传播信息。从锚网格上的顶点开始，链将顶点映射到潜在标记。然后，它在潜在空间中遵循时间对应，并通过潜在到顶点的注意力恢复特定帧的顶点。这一设计避免了昂贵的显式匹配，同时保留了锚网格的细节，从而改善了动态网格的几何形状和时间一致性。与最先进的方法相比，我们的方法在9秒内生成一个4D网格，实现了$13 imes$的加速，同时产生更高质量的结果。此外，我们的方法可以扩展到视频长度达到$16 imes$而不降低网格质量。除了生成之外，改进的对应关系在两个下游任务上实现了具有竞争力的零样本性能：2D物体跟踪和4D跟踪。我们进一步展示了我们的框架能够实现可靠的相机估计，这是之前的4D网格生成方法所不支持的能力。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2605.19792

Mechanisms of Object Localization in Vision-Language Models

视觉-语言模型中的物体定位机制

Schaumlöffel, Timothy, Vilas, Martina G., Roig, Gemma

Abstract

Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis. We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while the semantic arrangement of tokens within those boundaries is largely irrelevant to the predicted box. Only a very small set of attention heads mediates the causal effect for both classification and localization, concentrating in early-mid layers for LLaVA and mid-late layers for InternVL. The two tasks share some early processing but ultimately depend on largely distinct specialized heads. Overall, we provide the first layer- and head-level account of localization in VLMs, revealing narrow computational pathways that can guide future model design and grounding objectives.

Chinese Translation

视觉基础的语言模型（VLMs）在连接视觉和文本信息方面非常有效，但在基本的分类和定位任务上常常表现不佳。尽管分类机制的研究相对广泛，但支持物体定位的过程仍然不够清晰。在本研究中，我们使用一系列机制可解释性工具，包括标记消融、注意力剔除和因果中介分析，调查了两个代表性模型家族，LLaVA-1.5 和 InternVL-3.5。我们发现，定位是由一种容器化机制驱动的，其中与物体对齐的标记定义了物体的空间范围，而这些边界内标记的语义排列对预测框的影响则大多无关。只有一小部分注意力头在分类和定位的因果效应中起到中介作用，LLaVA集中在早中层，而InternVL则集中在中晚层。这两个任务共享一些早期处理，但最终依赖于在很大程度上各自独特的专门头。总体而言，我们提供了VLMs中定位的首次层级和头级解释，揭示了可以指导未来模型设计和基础目标的狭窄计算路径。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2605.19797

Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth

Depth2Pose：一种基于姿态的单目深度估计基准，无需真实深度

Kocur, Viktor, Aung, Sithu, Flood, Gabrielle, Ding, Yaqing, Bujnak, Lukas, Sattler, Torsten, Kukelova, Zuzana

Abstract

Monocular depth estimation has improved significantly in recent years, driven by increasingly powerful models and large-scale training data. Predicted depth is increasingly used as an input signal for downstream tasks such as Structure-from-Motion (SfM), visual localization, and SLAM. However, monocular depth estimators (MDEs) are still primarily evaluated in terms of depth accuracy. Standard metrics aggregate errors globally and may not reflect the usefulness of depth for downstream geometric tasks. We therefore propose Depth2Pose, a framework for evaluating MDEs in the context of downstream tasks. By combining depth predictions with feature correspondences in depth-aware geometric solvers, we use relative camera pose estimation accuracy as a task-driven proxy for depth quality. Traditional benchmarks require dense ground truth in the form of per-pixel depth, which is expensive to obtain. In contrast, our formulation requires only camera poses, which can be estimated efficiently, e.g., using Structure-from-Motion pipelines. As a result, our framework can be applied to scenes where ground-truth depth is difficult to obtain, for example due to large scene scale or heavy occlusions (e.g., vegetated environments). Leveraging this, we introduce the D2P dataset, which contains challenging scenes outside the distribution of commonly used training data. We show that methods performing well under standard depth error metrics on existing benchmarks also perform well under our pose-based metric when evaluated on the same datasets, but do not necessarily generalize to our more challenging dataset. Finally, we provide a simple and extensible evaluation framework. The dataset and code are available at kocurvik.github.io/depth2pose.

Chinese Translation

单目深度估计近年来得到了显著改善，这得益于越来越强大的模型和大规模的训练数据。预测的深度越来越多地被用作下游任务的输入信号，例如运动重建（Structure-from-Motion, SfM）、视觉定位和同步定位与地图构建（SLAM）。然而，单目深度估计器（MDEs）仍然主要通过深度准确性进行评估。标准指标在全局范围内聚合误差，可能无法反映深度在下游几何任务中的有效性。因此，我们提出了Depth2Pose，一个用于在下游任务背景下评估MDEs的框架。通过将深度预测与深度感知几何求解器中的特征对应结合，我们使用相对相机姿态估计的准确性作为深度质量的任务驱动代理。传统基准需要以每像素深度的形式提供密集的真实值，这种获取成本高昂。相比之下，我们的公式仅需要相机姿态，这可以高效估计，例如使用运动重建管道。因此，我们的框架可以应用于真实深度难以获取的场景，例如由于场景规模大或严重遮挡（例如植被环境）而导致的情况。利用这一点，我们引入了D2P数据集，其中包含超出常用训练数据分布的具有挑战性的场景。我们展示了在现有基准上表现良好的方法在我们的基于姿态的指标下也能在相同数据集上表现良好，但不一定能推广到我们更具挑战性的数据集。最后，我们提供了一个简单且可扩展的评估框架。数据集和代码可在kocurvik.github.io/depth2pose获取。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2605.19799

Synergistic Foundation Models for Semi-Supervised Fetal Cardiac Ultrasound Analysis: SAM-Med2D Boundary Refinement and DINOv3 Semantic Enhancement

用于半监督胎儿心脏超声分析的协同基础模型：SAM-Med2D 边界优化与 DINOv3 语义增强

Zhuang, Tonghao, Hu, Shanglong, Luo, Yongsheng, Zhang, Zhiqi, Li, Yu

Abstract

We present a semi-supervised framework for joint segmentation and classification of fetal cardiac ultrasound images. Built upon the EchoCare multi-task backbone, our method integrates SAM-Med2D for boundary refinement and leverages DINOv3 to enhance pseudo-label quality. We introduce view-specific hard masking along with a two-stage optimization strategy: an EMA phase to consolidate segmentation capabilities, followed by a Classification Fine-Tuning phase that freezes segmentation parameters and resets the classification head to recover classification performance without compromising segmentation gains. Evaluated on the FETUS 2026 leaderboard, our method achieves a Dice Similarity Coefficient at 79.99%, Normalized Surface Distance at 61.62%, and F1-score at 41.20%, validating the effectiveness of our approach for prenatal congenital heart disease screening. Source code is publicly available at: https://github.com/2826056177/zcst_fetus2026.

Chinese Translation

我们提出了一种用于胎儿心脏超声图像的联合分割和分类的半监督框架。该方法基于 EchoCare 多任务骨干网络，集成了 SAM-Med2D 进行边界优化，并利用 DINOv3 提升伪标签质量。我们引入了视图特定的硬掩膜，并采用了两阶段优化策略：首先是 EMA 阶段以巩固分割能力，随后是分类微调阶段，该阶段冻结分割参数并重置分类头，以在不影响分割性能提升的情况下恢复分类性能。在 FETUS 2026 排行榜上评估，我们的方法达到了 79.99% 的 Dice 相似系数、61.62% 的归一化表面距离和 41.20% 的 F1-score，验证了我们的方法在产前先天性心脏病筛查中的有效性。源代码已公开，地址为：https://github.com/2826056177/zcst_fetus2026。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2605.19804

Stitched Value Model for Diffusion Alignment

用于扩散对齐的拼接价值模型

Go, Hyojun, Chung, Hyungjin, Truong, Prune, Bhat, Goutam, Mi, Li, An, Zhaochong, Zhao, Zixiang, Narnhofer, Dominik, Belongie, Serge, Tombari, Federico, Schindler, Konrad

Abstract

For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes $3.2\times$ faster while halving peak GPU memory, and DiffusionNFT becomes $2.3\times$ faster.

Chinese Translation

在实际应用中，基于扩散或流的生成模型必须与特定任务的奖励对齐，例如提示的保真度或美学偏好。这种对齐具有挑战性，因为奖励是针对干净输出图像定义的，但对齐过程需要在嘈杂的中间潜变量上进行价值函数估计。现有方法依赖于Tweedie风格或蒙特卡洛近似，在估计器偏差与计算成本之间进行权衡：Tweedie估计高效但有偏差，而蒙特卡洛估计更准确但需要昂贵的回滚。一个自然的替代方案是学习的价值函数，但如何有效地训练一个强大且通用的价值模型以适应嘈杂潜变量仍然是一个未解的问题。在此，我们提出了StitchVM，一个模型拼接框架，能够高效地将为干净图像预训练的奖励模型转移到嘈杂潜变量领域。StitchVM从一个现有的、截断的像素空间奖励模型开始，并将一个冻结的扩散主干附加到其上作为头部。通过像素空间模型，得到的混合模型保留了经过精心预训练的强大奖励能力；通过扩散主干，它继承了处理嘈杂潜变量的固有能力。拼接过程极其轻量，例如，拼接和微调CLIP ViT-L和SD 3.5 Medium仅需10个GPU小时。通过将强大的像素空间奖励模型提升到潜在空间，StitchVM开启了一种新的扩散对齐风格：与其进行粗略但昂贵的逐样本价值函数近似，不如为实际的嘈杂潜变量构建正确的函数，然后在多个样本和迭代中进行摊销。我们展示了这种方法在广泛的下游引导和后训练方法中带来了改进：DPS的速度提高了$3.2 imes$，同时峰值GPU内存减少了一半，而DiffusionNFT的速度提高了$2.3 imes$。

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2605.19821

LaCoVL-FER: Landmark-Guided Contrastive Learning Network with Vision-Language Enhancement for Facial Expression Recognition

LaCoVL-FER：基于地标引导的对比学习网络与视觉-语言增强用于面部表情识别

Wang, Jiaxin, Jian, Muwei, Yu, Hui, Dong, Junyu, Xia, Yifan

Abstract

Facial Expression Recognition (FER) in the wild is still challenging due to uncontrolled variations in pose, occlusion, and illumination. Most existing attention-based methods primarily rely on visual appearance cues, suffering from attention redundancy and instability, which limits their performance in complex scenarios. To address these issues, we propose a novel landmark-guided contrastive learning network with vision-language enhancement for FER (LaCoVL-FER), which integrates geometric priors from facial landmarks and semantic priors from a vision-language model. Specifically, a Landmark-Guided Adaptive Encoder (LGAE) is designed to introduce geometric priors through a Bi-branch Gated Cross Attention (BGCA) mechanism, which achieves adaptive fusion of landmark-based geometric and visual appearance features to produce expression-relevant features, thereby focusing on key facial regions and suppressing noise interference. In parallel, a Vision-Language Enhancement Strategy (VLES) is presented to leverage the expression-relevant features to refine the generalizable visual features extracted by the frozen pretrained CLIP image encoder, yielding expression-specific visual representations. Based on these representations, an Expression-Conditioned Prompting (ECP) mechanism is utilized to further adapt the textual features of fixed class-level prompts from the frozen pretrained CLIP text encoder, generating more instance-aware textual representations. These visual-textual representations are aligned as semantic priors to enhance the robustness and generalization of FER. Quantitative and qualitative experiments demonstrate that our LaCoVL-FER outperforms state-of-the-art methods on three representative real-world FER datasets, including RAF-DB, FERPlus, and AffectNet. The code is available at https://github.com/ylin06804/LaCoVL-FER.

Chinese Translation

在自然环境中，面部表情识别（FER）仍然面临挑战，主要由于姿态、遮挡和光照的不可控变化。现有的大多数基于注意力的方法主要依赖于视觉外观线索，容易受到注意力冗余和不稳定性的影响，这限制了它们在复杂场景中的表现。为了解决这些问题，我们提出了一种新颖的基于地标引导的对比学习网络与视觉-语言增强的FER方法（LaCoVL-FER），该方法结合了来自面部地标的几何先验和来自视觉-语言模型的语义先验。具体而言，我们设计了一种地标引导自适应编码器（LGAE），通过双分支门控交叉注意力（BGCA）机制引入几何先验，实现基于地标的几何特征和视觉外观特征的自适应融合，以生成与表情相关的特征，从而聚焦于关键面部区域并抑制噪声干扰。同时，提出了一种视觉-语言增强策略（VLES），利用与表情相关的特征来优化由冻结的预训练CLIP图像编码器提取的可泛化视觉特征，从而生成特定于表情的视觉表示。基于这些表示，采用表情条件提示（ECP）机制进一步调整来自冻结的预训练CLIP文本编码器的固定类别级提示的文本特征，生成更具实例感知的文本表示。这些视觉-文本表示作为语义先验进行对齐，以增强FER的鲁棒性和泛化能力。定量和定性实验表明，我们的LaCoVL-FER在三个具有代表性的真实世界FER数据集上优于最先进的方法，包括RAF-DB、FERPlus和AffectNet。代码可在https://github.com/ylin06804/LaCoVL-FER获取。

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2605.19837

CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

CADENet：用于自主驾驶恶劣天气感知的条件自适应异步双流增强网络

Khairy, Sherif, Elias, Catherine M.

Abstract

Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time requirements. Progress on this problem is also constrained by an under-recognized evaluation ceiling: ground truth annotated on degraded images cannot credit a detector that recovers objects the annotators themselves could not see, so a genuinely useful enhancement can register as a near-flat F1 gain. This paper presents CADENet (Condition-Adaptive Asynchronous Dual-stream Enhancement Network), a training-free three-thread system: Thread S (YOLOv11n) delivers detections at full frame rate with zero added latency; Thread Q applies condition-adaptive enhancement (CAPE) and fuses results via entropy-guided NMS (EG-NMS) without blocking Thread S; Thread E provides CLIP zero-shot weather classification, so new weather categories require only a new text prompt, with no labeled data and no retraining. Evaluated on 1327 DAWN images (YOLOv11m, IoU = 0.5, confidence = 0.25), CADENet achieves Recall = 0.0103 (micro), F1 = 0.0230 on snow, and F1 = 0.0038 on rain. We formalize the annotation completeness bias on DAWN-class data, so the reported F1 values are lower bounds on the true gain; recall is the annotation-gap-immune headline metric. Thread S sustains approximately 44 FPS regardless of enhancement load. No model retraining or additional sensor hardware is required.

Chinese Translation

恶劣天气（如雨、雾、沙和雪）会降低自主车辆基于摄像头的物体检测能力。现有的增强后检测方法会阻碍安全关键的感知循环，违反严格的实时性要求。对此问题的进展也受到一个未被充分认识的评估上限的限制：在降级图像上标注的真实情况无法为恢复了标注者自己无法看到的物体的检测器提供信用，因此真正有用的增强可能仅表现为接近平坦的F1增益。本文提出了CADENet（条件自适应异步双流增强网络），这是一个无训练的三线程系统：线程S（YOLOv11n）以零额外延迟以全帧率提供检测；线程Q应用条件自适应增强（CAPE），并通过熵引导的非极大值抑制（EG-NMS）融合结果，而不阻塞线程S；线程E提供CLIP零样本天气分类，因此新的天气类别只需一个新的文本提示，无需标注数据和重新训练。在1327张DAWN图像上进行评估（YOLOv11m，IoU = 0.5，置信度 = 0.25），CADENet在雪上实现了召回率 = 0.0103（微观），F1 = 0.0230，在雨上F1 = 0.0038。我们对DAWN类数据的标注完整性偏差进行了形式化，因此报告的F1值是实际增益的下限；召回率是抗标注差距的主要指标。线程S在增强负载下仍能维持约44 FPS。无需模型重新训练或额外的传感器硬件。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2605.19839

When Preference Labels Fall Short: Aligning Diffusion Models from Real Data

当偏好标签不足时：从真实数据对齐扩散模型

Chen, Weiyan, Deng, Weijian, Xiao, Yao, Tu, Weijie, Dong, ZiYi, Radwan, Ibrahim, Lin, Liang, Wei, Pengxu

Abstract

Preference alignment aims to guide generative models by learning from comparisons between preferred and non-preferred samples. In practice, most existing approaches rely on preference pairs constructed from model-generated images. Such supervision is inherently relative and can be ambiguous when both samples exhibit artifacts or limited visual quality, making it difficult to infer what constitutes a truly desirable output. In this work, we investigate whether real data can serve as an alternative source of supervision for preference alignment. We adopt a data-centric perspective and study a curation strategy that treats real images as reference points and constructs preference signals by contrasting them with generated or perturbed samples, without requiring manually annotated preference pairs. Through empirical analysis, we show that real-data-based supervision provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods. Our results suggest that real data offers a practical and complementary source of supervision for preference alignment and highlight directions of label-efficient alignment strategies. Code and models are available at https://cwyxx.github.io/RealAlign.

Chinese Translation

偏好对齐旨在通过学习偏好样本与非偏好样本之间的比较来指导生成模型。在实践中，大多数现有方法依赖于从模型生成的图像构建的偏好对。这种监督本质上是相对的，当两个样本都表现出伪影或有限的视觉质量时，可能会产生模糊性，使得难以推断出什么构成真正理想的输出。在本研究中，我们探讨了真实数据是否可以作为偏好对齐的替代监督来源。我们采用以数据为中心的视角，研究了一种策展策略，该策略将真实图像视为参考点，并通过将其与生成或扰动样本进行对比来构建偏好信号，而无需手动标注的偏好对。通过实证分析，我们表明基于真实数据的监督为对齐扩散模型提供了有效的指导，并达到了与现有基于偏好的方法相当的性能。我们的结果表明，真实数据为偏好对齐提供了一种实用且互补的监督来源，并强调了标签高效对齐策略的方向。代码和模型可在 https://cwyxx.github.io/RealAlign 获取。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2605.19846

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

FineBench：针对细粒度人类活动理解的视觉-语言模型基准测试与增强

Faure, Gueter Josmy, Chen, Min-Hung, Yeh, Jia-Fong, Su, Hung-Ting, Hsu, Winston H.

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs.

Chinese Translation

视觉-语言模型（VLMs）在一般视频理解方面展现了卓越的能力，但在对人类行为和互动进行细粒度理解时，常常面临挑战，这对于需要细致解读的现实应用至关重要。尽管一些近期的人本基准评估了模型行为的某些方面，如公平性/伦理、情感感知以及更广泛的人本指标，但它们并未在规模上结合长视频、非常密集的问答覆盖以及帧级空间/时间基础。为填补这一空白，我们提出了FineBench，一个专门设计用于评估细粒度理解的人本视频问答（VQA）基准。FineBench包含199,420个多项选择问答对，密集标注于64个长视频（每个15分钟），重点关注详细的人物移动、人物互动和物体操作，包括组合动作。我们的广泛评估显示，尽管像GPT-5这样的专有模型表现出色，但当前的开源VLM在性能上显著不足，尤其在多人物场景中的空间推理和区分人类动作与互动的细微差别方面表现不佳。为了解决这些识别出的弱点，我们提出了FineAgent，一个通过利用定位器（Localizer）和描述器（Descriptor）来增强VLM的模块化框架。实验表明，FineAgent持续提升了多种开源VLM在FineBench上的表现。FineBench为未来细粒度人本视频理解的研究提供了一个严格的测试平台，而FineAgent则为提升当前VLM中的此类推理提供了一种实用的方法。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2605.19855

A Framework for Evaluating Zero-Shot Image Generation in Concept-based Explainability

基于概念的可解释性中零样本图像生成评估框架

Astolfi, Giacomo, Bianchi, Matteo, Campi, Riccardo, De Santis, Antonio, Brambilla, Marco

Abstract

Concept-based Explainable Artificial Intelligence (XAI) interprets deep learning models using human-understandable visual features (e.g., textures or object parts) by linking internal representations to class predictions, thereby bridging the gap between low-level image data and high-level semantics. A major challenge, however, is the reliance on large sets of labeled images to represent each concept, which limits scalability. In this work, we investigate the use of zero-shot Text-to-Image (T2I) generative models as a source of synthetic concept datasets for concept-based XAI methods. Specifically, we generate concepts using predefined prompts and evaluate their faithfulness to real ones through four complementary analyses: (1) comparing synthetic vs. real concept images via concept representation similarity; (2) evaluating their intra-similarity by comparing pairs of subsets of the same concept with progressively increasing size; (3) evaluating their performance for downstream explanation tasks using relevant class images; (4) evaluating how removing a concept from tested class images affects explanations of generated concepts. While current T2I generative models promise a shortcut to concept-based XAI, our study highlights challenges and raises open questions about the use of synthetic data generated by zero-shot pipelines in model analyses. The resulting dataset is available at https://github.com/DataSciencePolimi/ZeroShot-T2I-Concepts.

Chinese Translation

基于概念的可解释人工智能（XAI）通过将内部表示与类别预测联系起来，使用人类可理解的视觉特征（例如，纹理或物体部分）来解释深度学习模型，从而弥合低级图像数据与高级语义之间的差距。然而，一个主要挑战是依赖大量标记图像来表示每个概念，这限制了可扩展性。在本研究中，我们探讨了使用零样本文本到图像（Text-to-Image, T2I）生成模型作为基于概念的XAI方法的合成概念数据集的来源。具体而言，我们使用预定义的提示生成概念，并通过四个互补分析评估其与真实概念的相似性：（1）通过概念表示相似性比较合成与真实概念图像；（2）通过比较同一概念的逐步增大规模的子集对评估其内部相似性；（3）使用相关类别图像评估其在下游解释任务中的表现；（4）评估从测试类别图像中移除一个概念如何影响生成概念的解释。尽管当前的T2I生成模型为基于概念的XAI提供了一条捷径，但我们的研究突出了挑战，并提出了关于在模型分析中使用零样本管道生成的合成数据的未解问题。生成的数据集可在 https://github.com/DataSciencePolimi/ZeroShot-T2I-Concepts 获取。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2605.19859

Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

关注视觉语言模型：在视觉语言模型中基准测试注视跟随和社会注视预测

Wang, Hengfei, Gupta, Anshul, Vuillecard, Pierre, Odobez, Jean-Marc

Abstract

Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires reasoning about the physical scene as well as the activity, interactions, and social context. However, the extent to which VLMs can reliably understand human gaze and related attentional behaviors remains largely unexplored. In this work, we present EyeVLM, a systematic evaluation framework for gaze understanding in VLMs across two complementary dimensions: tasks and models. To assess gaze understanding capabilities, we focus on two core tasks. The first, gaze following, i.e., predicting the 2D location where a person is looking, has a geometric and visual processing focus, requiring a precise understanding of the human face, attention direction, 3D scene structure, and spatial grounding of attended targets. The second, social gaze prediction, requires social and relational reasoning over multi-person interactions (e.g., mutual gaze and shared attention), and may benefit more from the LLM semantic reasoning capabilities within VLMs. Regarding models, EyeVLM evaluates these tasks in two ways: a zero-shot setting with a diverse set of state-of-the-art open- and closed-source VLMs, exploring different prompting strategies; and a fine-tuning approach based on task-specific QA pairs, studying the impact of model scale and data scale. As benchmarks, we rely on existing gaze understanding datasets and perform a systematic comparison with state-of-the-art purely visual models. Overall, our results show that current VLMs lack precise gaze understanding capabilities. While standard training helps reduce the gap with visual models, significant improvements are still needed.

Chinese Translation

视觉语言模型（VLMs）迅速发展成为具有强大零样本泛化能力的通用多模态推理器。在这一背景下，VLMs 可以极大地促进人类注视和注意力的分析，这是理解人类行为的核心任务，要求对物理场景、活动、互动和社会背景进行推理。然而，VLMs 在多大程度上能够可靠地理解人类注视及相关的注意行为仍然很大程度上未被探索。在本研究中，我们提出了 EyeVLM，这是一个系统的评估框架，用于在 VLMs 中进行注视理解的评估，涵盖两个互补维度：任务和模型。为了评估注视理解能力，我们专注于两个核心任务。第一个任务是注视跟随，即预测一个人注视的 2D 位置，侧重于几何和视觉处理，要求对人脸、注意方向、3D 场景结构和被关注目标的空间定位有精确的理解。第二个任务是社会注视预测，要求对多个人互动（例如，互视和共享注意力）进行社会和关系推理，并可能更依赖于 VLMs 中的 LLM 语义推理能力。在模型方面，EyeVLM 通过两种方式评估这些任务：一种是零样本设置，使用多种最先进的开源和闭源 VLMs，探索不同的提示策略；另一种是基于任务特定问答对的微调方法，研究模型规模和数据规模的影响。作为基准，我们依赖现有的注视理解数据集，并与最先进的纯视觉模型进行系统比较。总体而言，我们的结果表明，当前的 VLMs 缺乏精确的注视理解能力。尽管标准训练有助于缩小与视觉模型的差距，但仍需显著改进。

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2605.19865

Landscape-Awareness for Geometric View Diffusion Model

面向几何视图扩散模型的景观感知

Chen, Yan-Ting, Chen, Hao-Wei, Hsiao, Tsu-Ching, Lee, Chun-Yi

Abstract

Accurate camera viewpoint estimation under sparse-view conditions remains challenging, particularly in two-view scenarios. Recent approaches leverage diffusion models such as Zero123 to synthesize novel views conditioned on relative viewpoint, showing promising results when repurposed for viewpoint estimation via optimization with MSE loss. However, existing methods often suffer from nonconvex loss landscape with numerous local minima, making them sensitive to initialization and reliant on naive multistart strategies. We analyze these optimization challenges and visualize failure cases, showing that geometric ambiguities, such as symmetry and self-similarity, can mislead gradient-based updates toward incorrect viewpoints. To address these limitations, we propose a score-based method that reshapes the optimization landscape to guide updates toward the ground-truth viewpoint, followed by a refinement stage using a viewpoint-conditioned diffusion model. Experiments show that our method improves convergence, reduces reliance on brute-force sampling, and achieves competitive accuracy with higher sample-efficiency.

Chinese Translation

在稀疏视图条件下，准确的相机视点估计仍然具有挑战性，尤其是在双视图场景中。最近的方法利用扩散模型，如 Zero123，根据相对视点合成新视图，在通过均方误差（MSE）损失进行优化时，显示出有希望的结果。然而，现有方法通常面临非凸损失景观的问题，存在众多局部最小值，使其对初始化敏感，并依赖于简单的多起始策略。我们分析了这些优化挑战并可视化了失败案例，显示几何模糊性（如对称性和自相似性）可能会误导基于梯度的更新朝向错误的视点。为了解决这些限制，我们提出了一种基于评分的方法，重塑优化景观以引导更新朝向真实视点，随后使用视点条件的扩散模型进行精细化阶段。实验表明，我们的方法改善了收敛性，减少了对强力采样的依赖，并在更高的样本效率下实现了竞争性的准确性。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2605.19866

Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding

用于鲁棒的分布外视觉文档理解的结构化布局先验

Hachem, Peter El, Nassar, Ahmed, Gurbuz, A. Said, Auer, Christoph, Staar, Peter W. J.

Abstract

Vision-Language Models (VLMs) parse documents end-to-end but frequently break down on layouts unlike those seen in training. We attribute this to a two-hop bottleneck: before the decoder can extract content (Hop 2), it must first classify and localize the enclosing layout entity (Hop 1), and when the first hop fails the second collapses into omissions, malformed structure, or autoregressive repetition. We pre-resolve Hop 1 outside the decoder by running a lightweight RT-DETR detector, serializing its outputs in the parser's native DocTags vocabulary, and injecting them into the prompt alongside the full page image. Unlike analyze-then-parse approaches that crop the page, or prior prompt-level priors written in plain text, our prior shares the decoder's generation space and leaves the global image in view as a fallback when detections are noisy. On a 10k-page structural out-of-distribution benchmark, markdown F1 rises from $0.37$ to $0.92$; on the Chinese subset of OmniDocBench, table TEDS rises from $0.01$ to $0.36$; and on the 26k-page ViDoRe V3 benchmark, infinite-loop decoding failures drop across every industrial domain tested. These gains cost $15\%$ wall-clock latency and a median of $74$ prompt tokens, with no architectural change to the base VLM. An attention-level analysis further reveals a bimodal phase shift in which the decoder attends to injected layout tokens when emitting structure and to image patches when emitting content, consistent with the two-hop bottleneck being alleviated. Model weights will be released to support reproducibility.

Chinese Translation

视觉-语言模型（VLMs）能够端到端解析文档，但在处理与训练中不同的布局时常常出现问题。我们将此归因于一个两步瓶颈：在解码器提取内容（步骤2）之前，它必须首先分类和定位封闭的布局实体（步骤1），而当第一步失败时，第二步就会陷入遗漏、结构畸形或自回归重复。我们通过运行轻量级的RT-DETR检测器，在解码器外部预先解决步骤1，将其输出序列化为解析器的原生DocTags词汇，并将其与完整页面图像一起注入到提示中。与先分析后解析的方法（这些方法裁剪页面）或先前以纯文本书写的提示级先验不同，我们的先验共享了解码器的生成空间，并在检测结果噪声较大时保留全局图像作为后备。在一个包含1万页结构化分布外基准的测试中，markdown F1从0.37上升至0.92；在OmniDocBench的中文子集上，表格TEDS从0.01上升至0.36；在26,000页的ViDoRe V3基准上，各个工业领域的无限循环解码失败率均有所下降。这些改进的代价是15%的时延和中位数74个提示令牌，且对基础VLM没有架构上的改变。注意力级别的分析进一步揭示了一个双模态相位转变，即解码器在生成结构时关注注入的布局令牌，而在生成内容时关注图像块，这与两步瓶颈的缓解是一致的。模型权重将被发布以支持可重复性。

View on arXiv Download PDF AI Translation

cs.CV / 102 / 2605.19868

WoundFormer: Multi-Scale Spatial Feature Fusion for Multi-Class Wound Tissue Segmentation

WoundFormer：多尺度空间特征融合用于多类别伤口组织分割

Kabir, Muhammad Ashad, Dulal, Rabin

Abstract

Chronic wounds such as diabetic foot ulcers and pressure injuries require accurate tissue-level assessment to guide treatment planning and monitor healing progression. While deep learning methods have advanced automated wound analysis, most existing approaches focus on binary segmentation and inadequately model heterogeneous tissue composition due to high intra-class variability and limited annotated data. Multi-class wound tissue segmentation, therefore, remains a challenging and clinically relevant problem. We propose WoundFormer, a transformer-based framework that enhances hierarchical spatial feature fusion for multi-class wound tissue segmentation. Specifically, we replace the standard SegFormer decoder with a spatially-preserving multi-scale aggregation head that maintains feature topology during cross-scale integration and strengthens contextual interactions through convolutional fusion. This design improves boundary localization and discrimination between visually similar tissue categories while preserving transformer efficiency. We evaluate WoundFormer on the WoundTissueSeg dataset (147 images, six tissue classes) and a second benchmark (DFUTissue dataset). The proposed method achieves an overall Dice score of 81.9%, outperforming strong CNN- and transformer-based baselines by up to 4.3 Dice points on the WoundTissueSeg benchmark, with consistent improvements across minority tissue classes. These results indicate that explicit modeling of hierarchical spatial interactions enhances transformer representations for heterogeneous wound tissue segmentation and supports more reliable quantitative wound assessment.

Chinese Translation

慢性伤口，如糖尿病足溃疡和压疮，需要准确的组织层面评估以指导治疗计划和监测愈合进程。尽管深度学习方法在自动化伤口分析方面取得了进展，但现有大多数方法集中于二分类分割，且由于类内变异性高和标注数据有限，未能充分建模异质组织组成。因此，多类别伤口组织分割仍然是一个具有挑战性且临床相关的问题。我们提出了WoundFormer，一个基于变换器的框架，增强了多类别伤口组织分割的层次空间特征融合。具体而言，我们用一个空间保持的多尺度聚合头替换了标准的SegFormer解码器，该聚合头在跨尺度集成过程中保持特征拓扑，并通过卷积融合增强上下文交互。该设计改善了边界定位和视觉相似组织类别之间的区分，同时保持了变换器的效率。我们在WoundTissueSeg数据集（147张图像，六个组织类别）和第二个基准（DFUTissue数据集）上评估了WoundFormer。所提方法在WoundTissueSeg基准上实现了81.9%的整体Dice得分，超越了强大的CNN和变换器基线，提升幅度高达4.3个Dice点，并在少数组织类别上表现出一致的改进。这些结果表明，明确建模层次空间交互增强了变换器在异质伤口组织分割中的表示能力，并支持更可靠的定量伤口评估。

View on arXiv Download PDF AI Translation

cs.CV / 103 / 2605.19869

Passive Construction Site Safety Monitoring via Persona-Scaffolded Adversarial Chain-of-Thought VLM Verification

通过个性化支架对抗思维链验证的被动施工现场安全监测

Sriram, Ananth, Mokaria, Neel, Singh, Rajveer

Abstract

Construction remains the deadliest industry sector in the United States, with 1,055 fatal worker injuries recorded in 2023, and the majority preventable. Existing monitoring approaches are expensive, require real-time human operators, or address only a narrow subset of violations. This paper presents a passive, end-of-shift construction safety monitoring pipeline processing video from POV body-worn and fixed wall-mounted cameras through a three-stage architecture: (1) fine-tuned YOLO11 for primary PPE and hazard detection, (2) SAM 3 for segmentation refinement and worker deduplication, and (3) Qwen3-VL-8B-Instruct with a method-prompted, persona-scaffolded three-pass adversarial chain-of-thought protocol for compliance verification and hallucination control. The principal contribution is the Stage 3 prompt design: professional persona backstories following the method-actor framing drive an observed 12% precision improvement over single-pass prompting in an informal three-author review of the 12-video Ironsite development corpus, with the largest gains on hallucination-prone violation categories. Structural message isolation enforces observational independence between a generator, discriminator, and reconciliation pass governed by asymmetric rules encoding priors about human observation versus automated detection reliability. The system maps violations to OSHA standards, performs REBA-inspired ergonomic risk scoring from pose keypoints, and produces per-worker safety reports with timestamped evidence. An evaluation harness is released for future reproduction.

Chinese Translation

建筑业仍然是美国最致命的行业，2023年记录了1,055起致命工伤，其中大多数是可以预防的。现有的监测方法成本高昂，需要实时人力操作，或仅针对狭窄的违规行为子集。本文提出了一种被动的、班次结束时的施工安全监测流程，通过三阶段架构处理来自POV（第一人称视角）佩戴式和固定墙面摄像头的视频：(1) 针对主要个人防护装备（PPE）和危险检测的精细调优YOLO11，(2) 用于分割精细化和工人去重的SAM 3，以及(3) 采用方法提示、个性化支架的三次对抗思维链协议的Qwen3-VL-8B-Instruct，用于合规性验证和幻觉控制。主要贡献在于第三阶段提示设计：专业个性背景故事遵循方法演员框架，在对12个视频的Ironsite开发语料库进行非正式三作者评审时，观察到精确度提高了12%，在易产生幻觉的违规类别上获得了最大的提升。结构性信息隔离确保生成器、鉴别器和由不对称规则控制的和解过程之间的观察独立性，这些规则编码了关于人类观察与自动检测可靠性的先验知识。该系统将违规行为映射到OSHA标准，基于姿态关键点执行REBA（快速评估身体活动）启发的人体工学风险评分，并生成每位工人的安全报告，附带时间戳证据。我们发布了一个评估工具，以便未来的复现研究。

View on arXiv Download PDF AI Translation

cs.CV / 104 / 2605.19876

Structural Energy Guidance for View-Consistent Text-to-3D Generation

结构能量引导的视图一致性文本到3D生成

Zhang, Qing, Tong, Jinguang, Zhang, Jing, Hong, Jie, Li, Xuesong

Abstract

Text-to-3D generation based on diffusion models often suffers from the Janus problem, leading to inconsistent geometry across viewpoints. This work identifies viewpoint bias in 2D diffusion priors as the main cause and proposes Structural Energy-Guided Sampling (SEGS), a training-free and plug-and-play framework to improve multi-view consistency. SEGS constructs a structural energy in the PCA subspace of U-Net features and injects its gradient into the denoising process. It can be easily integrated into SDS/VSD pipelines without retraining. Experiments show that SEGS reduces the Janus Rate by about 10% on average and improves View-CS scores across multiple baselines, including DreamFusion, Magic3D, and LucidDreamer. This method effectively alleviates viewpoint artifacts while preserving appearance fidelity, providing a flexible solution for high-quality text-to-3D content generation.

Chinese Translation

基于扩散模型的文本到3D生成常常面临雅努斯问题，导致不同视角下几何形状不一致。本研究识别出二维扩散先验中的视角偏差是主要原因，并提出了结构能量引导采样（Structural Energy-Guided Sampling, SEGS），这是一种无训练且即插即用的框架，用于改善多视角一致性。SEGS在U-Net特征的主成分分析（PCA）子空间中构建结构能量，并将其梯度注入去噪过程中。该方法可以轻松集成到SDS/VSD管道中，而无需重新训练。实验表明，SEGS平均减少了约10%的雅努斯率，并在多个基准上提高了视图一致性评分，包括DreamFusion、Magic3D和LucidDreamer。该方法有效缓解了视角伪影，同时保持外观保真度，为高质量文本到3D内容生成提供了灵活的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 105 / 2605.19890

GoTTA be Diverse: Rethinking Memory Policies for Test-Time Adaptation

必须多样化：重新思考测试时间适应的记忆策略

Alhuwaider, Shyma, Alsaedy, Yasmeen, Ramazanova, Merey, Giancola, Silvio, Ghanem, Bernard

Abstract

Test-time adaptation (TTA) enables a pre-trained model to adapt online to an unlabeled test stream under distribution shift. While most TTA research focuses on the adaptation objective, practical streams also depend critically on the memory used to select which test samples drive adaptation. Existing memory mechanisms are usually evaluated as components of specific TTA algorithms, making it difficult to isolate which memory design choices matter and when they matter. In this work, we provide a systematic benchmark that decouples memory from the adaptation algorithm and evaluates memory policies under unified conditions across i.i.d., non-i.i.d., continual, and practical test streams. Our study shows that effective memory management requires more than retaining recent or class-balanced samples. In particular, intra-class diversity is a key factor for avoiding redundant buffers and maintaining representative adaptation signals under temporally correlated and label-skewed streams. Motivated by this finding, we introduce Guided Observational Test-Time Adaptation (GOTTA), a family of diversity-aware memory policies that combine class-balanced allocation with feature-space diversity. GOTTA memories act as drop-in replacements for existing buffers and can be paired with different TTA objectives. Across corruption benchmarks and video-stream settings, diversity-aware memory improves adaptation most clearly under constrained memory budgets and challenging non-i.i.d. streams, while remaining competitive as memory capacity increases. These results highlight memory management as a first-class component of robust test-time adaptation and identify diversity as a central principle for practical TTA.

Chinese Translation

测试时间适应（TTA）使得预训练模型能够在线适应在分布变化下的无标签测试流。尽管大多数TTA研究集中在适应目标上，但实际流也在很大程度上依赖于用于选择哪些测试样本驱动适应的记忆。现有的记忆机制通常作为特定TTA算法的组成部分进行评估，这使得很难孤立出哪些记忆设计选择是重要的，以及何时重要。在这项工作中，我们提供了一个系统的基准，将记忆与适应算法解耦，并在统一条件下评估记忆策略，包括独立同分布（i.i.d.）、非独立同分布（non-i.i.d.）、持续性和实际测试流。我们的研究表明，有效的记忆管理不仅仅依赖于保留最近或类别平衡的样本。特别是，类内多样性是避免冗余缓冲区和在时间相关及标签偏斜流中保持代表性适应信号的关键因素。基于这一发现，我们引入了引导观察测试时间适应（Guided Observational Test-Time Adaptation, GOTTA），这是一系列关注多样性的记忆策略，结合了类别平衡分配和特征空间多样性。GOTTA记忆作为现有缓冲区的替代品，可以与不同的TTA目标配对。在腐蚀基准和视频流设置中，关注多样性的记忆在受限的内存预算和具有挑战性的非独立同分布流中显著改善了适应性，同时在内存容量增加时仍保持竞争力。这些结果突显了记忆管理作为稳健测试时间适应的一个重要组成部分，并确定了多样性作为实际TTA的核心原则。

View on arXiv Download PDF AI Translation

cs.CV / 106 / 2605.19929

Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

打破大规模视觉-语言模型低比特量化中的模态异质性

Zhong, Yi, Qin, Haotong, Zhang, Xindong, Zhang, Lei, Sun, Guolei

Abstract

Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies, we design an Adaptive Cross-Modal Calibration (ACC) module that employs dual lightweight learnable branches to dynamically mitigate modality-induced quantization errors. Extensive experiments on popular VLMs demonstrate that SplitQ significantly outperforms existing approaches across 6 popular multi-modal datasets under all evaluated quantization settings, including W4A8, W4A4, W3A3, and W3A2. Notably, SplitQ preserves 93.5% of FP16 performance under the challenging W3A3 setting (69.5 vs. 74.3), pushing the efficiency frontier for deploying advanced VLMs. Our code is available at https://github.com/EMVision-NK/SplitQ

Chinese Translation

低比特后训练量化（PTQ）是将视觉-语言模型（VLMs）部署到资源受限设备上的关键技术。然而，现有的PTQ方法常常由于文本和视觉模态在量化过程中异质的激活分布而降低VLM的准确性。我们发现，这种跨模态异质性在通道间分布不均：少数通道包含大部分模态特定的异常值，而这些异常值通常在每个模态的不同通道中存在。基于此，我们提出了SplitQ，一个基于通道分割的后训练量化框架。SplitQ的核心是引入了一种新颖的模态特定异常通道解耦（MOCD）模块，能够有效地隔离显著的模态特定异常通道，且开销极小。为了进一步解决剩余的跨模态分布差异，我们设计了一个自适应跨模态校准（ACC）模块，利用双轻量可学习分支动态减轻模态引起的量化误差。在流行的VLM上进行的大量实验表明，SplitQ在所有评估的量化设置下（包括W4A8、W4A4、W3A3和W3A2）显著优于现有方法，覆盖6个流行的多模态数据集。值得注意的是，SplitQ在具有挑战性的W3A3设置下保留了93.5%的FP16性能（69.5 vs. 74.3），推动了先进VLM部署的效率前沿。我们的代码可在https://github.com/EMVision-NK/SplitQ获取。

View on arXiv Download PDF AI Translation

cs.CV / 107 / 2605.19931

StruMPL: Multi-task Dense Regression under Disjoint Partial Supervision and MNAR Labels

StruMPL：在不相交的部分监督和MNAR标签下的多任务密集回归

Asiyabi, Reza M., Molina-Valero, Juan Alberto, Partnership, The SEOSAW, Hancock, Steven, Ryan, Casey M.

Abstract

Estimating forest aboveground biomass (AGB) from Earth observation combines two structurally incompatible label sources: spaceborne lidar provides canopy structure at millions of locations but no biomass estimate, and ground-based plots provide biomass at thousands of biased locations but no metrics of structure. No single training sample carries labels for all target variables, plot labels are missing not at random (MNAR), and biomass is linked to the structural variables by known but biome-specific allometric laws. We formalise this as multi-task dense regression under heterogeneous disjoint partial supervision with MNAR labels and inter-task physical constraints, and propose StruMPL to address it jointly. A shared encoder feeds per-variable regression, imputation, and propensity heads for spatial MNAR correction, and a learnable physics module that evaluates the inter-task constraint on the model's own predictions at every pixel. The supervised loss uses an Augmented IPW (AIPW) pseudo-outcome with stop-gradients on the propensity and on the imputation baseline; we show analytically and empirically that both are necessary for joint optimisation to recover IPW-weighted stationary points while keeping the loss bounded. On two ecologically distinct biomes, StruMPL outperforms ablation variants and the closest published method on AGB RMSE and bias, with a stratified analysis showing AIPW reduces high-AGB bias by ~54%.

Chinese Translation

从地球观测中估计森林地上生物量（AGB）结合了两种结构上不兼容的标签来源：太空激光雷达提供了数百万个位置的树冠结构，但没有生物量估计，而基于地面的样地提供了数千个偏倚位置的生物量，但没有结构指标。没有单一的训练样本携带所有目标变量的标签，样地标签是以非随机缺失（MNAR）的方式缺失的，并且生物量与结构变量之间通过已知但特定生物群落的异速生长法则相联系。我们将其形式化为在异质不相交部分监督下的多任务密集回归，具有MNAR标签和任务间物理约束，并提出StruMPL以共同解决此问题。一个共享编码器为每个变量的回归、插补和倾向头提供空间MNAR校正，并且一个可学习的物理模块在每个像素上评估模型自身预测的任务间约束。监督损失使用增强的逆概率加权（AIPW）伪结果，并在倾向和插补基线上的停止梯度；我们从理论和实证上证明这两者对于联合优化以恢复IPW加权的平稳点是必要的，同时保持损失有界。在两个生态上不同的生物群落中，StruMPL在AGB均方根误差（RMSE）和偏差方面优于消融变体和最接近的已发布方法，分层分析显示AIPW将高AGB偏差减少了约54%。

View on arXiv Download PDF AI Translation

cs.CV / 108 / 2605.19949

Feed-Forward Gaussian Splatting from Sparse Aerial Views

来自稀疏航空视图的前馈高斯点云重建

Wu, Dongli, Li, Zhuoxiao, Hua, Tongyan, Ren, Yinrui, Wei, Xiaobao, Qin, Rongjun, Zhao, Wufan

Abstract

Reconstructing large-scale urban scenes from sparse aerial views is a crucial yet challenging task. Due to biased top-down and shallow-oblique camera poses, sparse aerial captures exhibit strong evidence imbalance: roofs and open regions are repeatedly observed, while facades, distant buildings, and occluded structures receive little multi-view support. Existing feed-forward 3D Gaussian Splatting methods directly regress a deterministic representation from sparse inputs, but this often leads to ghosting, melted facades, and stretched textures. Recent pseudo-view and video-based generative reconstruction methods use additional supervision or generative priors. However, they often lack a clear separation between observed geometry and prior-driven content, which can lead to plausible but inconsistent structures. We propose AnyCity, an observation-grounded generative reconstruction framework for sparse aerial urban scenes. AnyCity first predicts an observation-supported geometry latent to anchor reliable structures, and then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. During training, dense-to-sparse distillation transfers structural cues from dense-view reconstruction, while an aerial-adapted video diffusion prior provides fine-grained urban appearance cues through gated token conditioning. Observation-preserving objectives keep the refined representation consistent with input-supported geometry. At inference time, AnyCity reconstructs the final 3D Gaussian scene from sparse aerial views in a single feed-forward pass, achieving coherent urban novel-view synthesis with second-level inference. Experiments on synthetic, aerial-domain, UAV-textured, and real-world scenes show consistent improvements over feed-forward baselines.

Chinese Translation

从稀疏的航空视图重建大规模城市场景是一项关键但具有挑战性的任务。由于偏向顶部和浅斜的相机姿态，稀疏的航空捕捉表现出明显的证据不平衡：屋顶和开放区域被反复观察，而立面、远处建筑物和被遮挡结构则获得的多视角支持较少。现有的前馈3D高斯点云重建方法直接从稀疏输入回归一个确定性的表示，但这往往导致鬼影、熔化的立面和拉伸的纹理。最近的伪视图和基于视频的生成重建方法使用额外的监督或生成先验。然而，它们往往缺乏观察几何与先验驱动内容之间的明确分离，这可能导致看似合理但不一致的结构。我们提出了AnyCity，一个基于观察的稀疏航空城市场景生成重建框架。AnyCity首先预测一个支持观察的几何潜变量，以锚定可靠的结构，然后使用脚手架条件的航空补全标记来预测一个门控残差更新，以处理弱约束内容，最后进行高斯解码。在训练过程中，稠密到稀疏的蒸馏从稠密视图重建中转移结构线索，而适应航空的视频扩散先验通过门控标记条件提供细粒度的城市外观线索。保持观察的一致性目标使得精炼的表示与输入支持的几何保持一致。在推理时，AnyCity通过单次前馈传递从稀疏航空视图重建最终的3D高斯场景，实现了连贯的城市新视图合成和二级推理。在合成、航空域、无人机纹理和真实场景的实验中，AnyCity在前馈基线之上显示出一致的改进。

View on arXiv Download PDF AI Translation

cs.CV / 109 / 2605.19950

AffectVerse: Emotional World Models for Multimodal Affective Computing

情感宇宙：用于多模态情感计算的情感世界模型

Zhao, Bo, Ye, Fanghua, Ji, Yixin, Zhao, Sicheng, Peng, Xiaojiang, YU, Zitong

Abstract

Humans infer emotions by integrating observed multimodal cues with expectations about how affective states may unfold. Existing multimodal large language models (MLLMs), however, often treat emotion recognition as static fusion over complete audiovisual-text inputs, leaving affective dynamics implicit. We propose AffectVerse, a Qwen2.5-Omni-based model equipped with an Emotion World Module (EWM), an action-free representation-level module for short-horizon latent affective prediction. \rev{EWM contains three modules: 1) Cross-Modal Temporal Imagination predicts future video/audio representations from past tokens with multi-step rollout. 2) MAMA(Modality-Aware Multi-step Attention) Belief Aggregation compresses imagined tokens into modality-aware belief tokens. 3) Belief Injection inserts these belief tokens into the LLM for affective reasoning.} AffectVerse uses future prediction as a past-conditioned self-supervised signal: it does not replace modeling observed history or require unseen signals at inference, but forces the current belief state to encode transition cues that are predictive of subsequent affective change. Across nine benchmarks, AffectVerse improves at least 2.57\% over other models, while controlled ablations show additive gains from temporal imagination, cross-modal rollout, and belief aggregation. These results suggest predictive belief-state modeling is a practical alternative for affective computing.

Chinese Translation

人类通过整合观察到的多模态线索与对情感状态如何展开的预期来推断情感。然而，现有的多模态大型语言模型（MLLMs）通常将情感识别视为对完整的视听文本输入进行静态融合，从而使情感动态隐含。我们提出了AffectVerse，这是一种基于Qwen2.5-Omni的模型，配备了情感世界模块（Emotion World Module, EWM），这是一个无动作的表示级模块，用于短期潜在情感预测。EWM包含三个模块：1）跨模态时间想象（Cross-Modal Temporal Imagination）从过去的标记预测未来的视频/音频表示，并进行多步展开；2）模态感知多步注意力（MAMA, Modality-Aware Multi-step Attention）信念聚合将想象的标记压缩为模态感知信念标记；3）信念注入（Belief Injection）将这些信念标记插入LLM中以进行情感推理。AffectVerse将未来预测作为过去条件下的自监督信号：它并不替代对观察历史的建模，也不需要在推理时使用未见信号，而是强制当前信念状态编码预测后续情感变化的转变线索。在九个基准测试中，AffectVerse的表现至少提高了2.57 ext{%}，而控制性消融实验显示时间想象、跨模态展开和信念聚合带来了附加收益。这些结果表明，预测信念状态建模是情感计算的一个实用替代方案。

View on arXiv Download PDF AI Translation

cs.CV / 110 / 2605.19956

Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models

迈向细粒度鲁棒性：面向视觉-语言模型的注意力引导测试时提示调优

Hai, Jia-Wei, Wang, Yijun, Wei, Xiu-Shen

Abstract

Vision-Language Models (VLMs), such as CLIP, have achieved significant zero-shot performance on downstream tasks with various fine-tuning adaptation methods. However, recent studies have proven that adversarial attacks can significantly degrade the inference ability of VLMs, posing substantial risks to their practical applications. Prevalent test-time adaptation methods typically rely on multi-view augmentation to implement various fine-tuning strategies, which struggle to identify semantic information and are prone to destroying discriminative regions in fine-grained scenarios. To address these limitations, we propose Attention-Guided Test-Time Prompt Tuning (A-TPT), a semantics-preserving method designed for test-time adaptation. We first refine the gradient attention rollout mechanism to identify semantically meaningful regions surviving under adversarial attacks. Furthermore, we leverage them to guide the spatially varying augmentation intensities and multi-view ensemble for prompt tuning and inference. Extensive experiments demonstrate that A-TPT outperforms existing test-time adaptation methods on both adversarial and clean data. Codes are available at https://github.com/SEU-VIPGroup/A-TPT .

Chinese Translation

视觉-语言模型（VLMs），如 CLIP，在下游任务中通过各种微调适应方法取得了显著的零-shot 性能。然而，近期研究表明，对抗性攻击会显著降低 VLMs 的推理能力，给其实际应用带来重大风险。普遍的测试时适应方法通常依赖于多视角增强来实施各种微调策略，但在识别语义信息方面存在困难，并且在细粒度场景中容易破坏判别区域。为了解决这些局限性，我们提出了注意力引导测试时提示调优（Attention-Guided Test-Time Prompt Tuning，A-TPT），这是一种旨在测试时适应的语义保留方法。我们首先优化了梯度注意力传播机制，以识别在对抗性攻击下仍然具有语义意义的区域。此外，我们利用这些区域来引导空间变化的增强强度和多视角集成，以进行提示调优和推理。大量实验表明，A-TPT 在对抗性和干净数据上均优于现有的测试时适应方法。代码可在 https://github.com/SEU-VIPGroup/A-TPT 获取。

View on arXiv Download PDF AI Translation

cs.CV / 111 / 2605.19957

World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

混合具身任务中的长时间演化的世界-自我建模

Lin, Zuyao, Zhang, Jianhui, Jia, Peidong, Zhao, Xiaoguang, Zhang, Shanghang, Chen, Xingyu

Abstract

World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emph{World-Ego Modeling}, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.

Chinese Translation

世界模型在具身智能中得到了广泛探索，但它们通常在单一流中预测世界和自我的不同演化，其中世界捕捉持久的与指令无关的场景规律，而自我则捕捉以机器人为中心的与指令相关的动态。这种世界-自我纠缠导致在长时间具身场景中的性能下降，特别是在具有交错导航和操作行为的混合任务中。本文介绍了 extit{世界-自我建模}，一种新的概念范式，将未来演化分解为世界和自我组件。我们从运动、语义和意图三个视角定义世界-自我边界，并分析了后、前和完全解缠的三种解缠策略。此外，我们将这一范式实例化为世界-自我模型（World-Ego Model, WEM），这是一个统一的具身世界模型，结合了隐式分离的世界-自我规划器和级联并行专家混合（cascade-parallel mixture-of-experts, CP-MoE）扩散生成器。为了实现严格的评估，我们进一步构建了HTEWorld，这是第一个针对混合导航-操作任务的长时间世界建模基准，提供了125K个视频片段（超过450万帧）以及细粒度的动作注释和300个多回合评估轨迹（超过2000条指令）。大量实验表明，WEM在HTEWorld上实现了最先进的性能，同时在现有的仅操作基准上仍具竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 112 / 2605.19974

SphericalDreamer: Generating Navigable Immersive 3D Worlds with Panorama Fusion

SphericalDreamer：通过全景融合生成可导航的沉浸式3D世界

Schnepf, Antoine, Kassab, Karim, Vasile, Flavian, Comport, Andrew

Abstract

The generation of immersive and navigable 3D environments is increasingly prevalent with the growing adoption of virtual reality and 3D content. However, recent methods face a fundamental limitation: they cannot produce 3D worlds that simultaneously (i) are navigable over long-range spatial extents and (ii) cover the complete omnidirectional field of view ($360^\circ$ horizontally and $180^\circ$ vertically). To address this challenge, we introduce SphericalDreamer, a method for generating fully immersive and long-range 3D outdoor environments from textual prompts. Our approach is built on the generation of multiple panoramic images, which are subsequently lifted into 3D and fused together while maintaining visual and geometric consistency. SphericalDreamer produces highly detailed, fully immersive 3D environments, while substantially improving scale and navigability compared to prior approaches.

Chinese Translation

随着虚拟现实和3D内容的日益普及，生成沉浸式和可导航的3D环境变得越来越重要。然而，近期的方法面临一个根本性的限制：它们无法同时生成(i)在长距离空间范围内可导航的3D世界，以及(ii)覆盖完整的全向视野（水平$360^ ext{°}$和垂直$180^ ext{°}$）。为了解决这一挑战，我们提出了SphericalDreamer，一种从文本提示生成完全沉浸式和长距离3D户外环境的方法。我们的方法基于生成多个全景图像，这些图像随后被提升为3D并融合在一起，同时保持视觉和几何的一致性。与先前的方法相比，SphericalDreamer生成的3D环境具有高度细致、完全沉浸的特点，同时在规模和可导航性上有显著提升。

View on arXiv Download PDF AI Translation

cs.CV / 113 / 2605.19976

RECIPE: Procedural Planning via Grounding in Instructional Video

RECIPE：基于教学视频的程序规划

Seminara, Luigi, Furnari, Antonino, Torresani, Lorenzo

Abstract

Visual planning asks a model to generate the remaining steps of a procedure in natural language given a partial video context and a goal. Progress on this task is bottlenecked by annotation: clean labeled datasets are small, domain-narrow, and encode a single execution trajectory per example, even though many valid orderings exist. Large-scale instructional video corpora offer orders of magnitude more procedural content, but supervised fine-tuning on pseudo-labels from their noisy ASR narrations propagates segmentation and alignment errors and stays single-trajectory. We identify a key asymmetry: extracting clean step labels from noisy video is hard, but verifying whether a generated step sequence is temporally grounded in ASR transcripts is cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry in RECIPE, which uses grounding quality as a reward for GRPO, turning the noisy corpus into a verifier rather than a label source. The framework applies uniformly to two planner input configurations (Socratic, with a textual history extracted by a frozen VLM, and Video, consuming video tokens directly) and to annotated and weakly supervised regimes. We evaluate on 7 procedural benchmarks using a reference-based LLM-as-judge protocol scoring plans across 6 procedural criteria. RECIPE-RL improves over the base checkpoint at all scales (0.5B, 3B, 7B) and every benchmark, with macro-accuracy gains of +7 to +8 points in-domain and up to +16 points zero-shot. It outperforms supervised fine-tuning on both annotated and pseudo-labeled plans (the latter degrades the base) and remains robust without human annotations. Used as the proposal stage of a prior propose-assess-search planner, it improves over the strongest zero-shot baseline at every horizon on Visual Planning for Assistance, and on COIN it preserves the generation diversity that SFT collapses.

Chinese Translation

视觉规划要求模型在给定部分视频上下文和目标的情况下，以自然语言生成程序的剩余步骤。该任务的进展受限于标注：干净的标注数据集数量少、领域狭窄，并且每个示例仅编码单一执行轨迹，尽管存在许多有效的排序。大规模的教学视频语料库提供了数量级更多的程序内容，但在其嘈杂的自动语音识别（ASR）叙述中进行伪标签的监督微调会传播分割和对齐错误，并且仍然保持单一轨迹。我们识别出一个关键的不对称性：从嘈杂视频中提取干净的步骤标签是困难的，但验证生成的步骤序列是否在ASR转录中时间上是基础的则成本低且可以扩展到数百万个视频。我们在RECIPE中利用这一不对称性，将基础质量作为GRPO的奖励，将嘈杂语料库转变为验证器而非标签源。该框架均匀适用于两种规划器输入配置（苏格拉底式，使用由冻结的视觉语言模型（VLM）提取的文本历史，以及视频，直接处理视频标记）以及注释和弱监督模式。我们在7个程序基准上进行评估，使用基于参考的LLM作为评判协议，针对6个程序标准对计划进行评分。RECIPE-RL在所有规模（0.5B、3B、7B）和每个基准上均优于基础检查点，在领域内的宏观准确率提升了7到8个百分点，在零样本情况下提升了最多16个百分点。它在注释和伪标注计划上均优于监督微调（后者会降低基础性能），并且在没有人工标注的情况下仍然保持稳健。作为先前提议-评估-搜索规划器的提议阶段，它在视觉规划辅助的每个时间范围内均优于最强的零样本基线，并且在COIN上保持了生成多样性，而监督微调则会导致其崩溃。

View on arXiv Download PDF AI Translation

cs.CV / 114 / 2605.19982

InterLight: Leveraging Intrinsic Illumination Priors for Low-Light Image Enhancement

InterLight：利用内在照明先验进行低光照图像增强

Wang, Ziqi, Zhang, Xu, Chang, Laibin, Chen, Shi, Ma, Jiaqi, Zhang, Huan

Abstract

Low-Light Image Enhancement (LLIE) has long been a challenging problem in low-level vision, as insufficient illumination often leads to low contrast, detail loss, and noise. Recent studies show that deep learning-based Retinex theory can effectively decouple illumination and reflectance. However, existing methods frequently suffer from over-enhancement or color distortion, and often assume uniform noise or ideal lighting. To address these limitations, we propose InterLight, a novel framework that systematically excavates and operationalizes intrinsic illumination priors for LLIE.Our core insight is that robust enhancement requires not just estimating illumination, but constructing an illumination-aware pipeline. We first inject sensor-level illumination-response priors via physics-guided augmentation, then represent the degradation through adaptive prompts conditioned on the scene's latent illumination state. This explicit representation directly guides a luminance-gated intrinsic memory mechanism to selectively compensate for information loss, prioritizing reconstruction in dark regions while preserving fidelity in bright ones. Finally, the entire process is regularized by a self-supervised consistency objective that distills illumination-invariant features. By deeply exploiting intrinsic illumination priors, our method achieves clearer textures and more visually coherent enhancement results. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach. Code is available at: https://github.com/House-yuyu/InterLight.

Chinese Translation

低光照图像增强（LLIE）长期以来一直是低级视觉中的一个挑战性问题，因为不足的照明常常导致低对比度、细节丢失和噪声。近期研究表明，基于深度学习的Retinex理论能够有效地解耦照明和反射。然而，现有方法常常面临过度增强或颜色失真的问题，并且通常假设均匀噪声或理想照明。为了解决这些局限性，我们提出了InterLight，一个新颖的框架，系统性地挖掘并操作内在照明先验以进行LLIE。我们的核心见解是，稳健的增强不仅需要估计照明，还需要构建一个照明感知的处理流程。我们首先通过物理引导的增强注入传感器级别的照明响应先验，然后通过基于场景潜在照明状态的自适应提示表示退化。这种显式表示直接引导一个亮度门控的内在记忆机制，以选择性地补偿信息损失，优先在暗区重建，同时在亮区保持保真度。最后，整个过程通过自监督一致性目标进行正则化，以提炼照明不变特征。通过深入挖掘内在照明先验，我们的方法实现了更清晰的纹理和更具视觉一致性的增强效果。在多个基准上的广泛实验证明了我们方法的有效性。代码可在以下链接获取：https://github.com/House-yuyu/InterLight。

View on arXiv Download PDF AI Translation

cs.CV / 115 / 2605.19995

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

CogOmniControl：基于推理驱动的可控视频生成与创意意图认知

Yang, Hongji, Li, Songlian, Zhou, Yucheng, Zhao, Xiaotong, Zhao, Alan, Xu, Chengzhong, Shen, Jianbing

Abstract

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop "harness-like" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: https://um-lab.github.io/CogOmniControl/

Chinese Translation

近期的扩散模型在视频生成方面实现了强大的真实感和流畅性，但在抽象、稀疏或复杂条件下仍然表现脆弱，导致在专业制作工作流程中（如故事板草图和粘土渲染条件）的表现不佳。现有的视频生成模型要么通过适配器注入条件，要么在扩散主干中结合通用视觉-语言模型（VLM），这造成了能力差距，无法生成与用户创意意图相符的视频。我们提出了CogOmniControl，这是一个推理驱动的框架，将可控视频生成分解为创意意图认知和生成。具体而言，我们使用真实的动漫制作数据训练了一个专门的CogVLM。与通用VLM相比，它生成更专业和清晰的输出，能够准确地从稀疏和抽象条件中认知用户的创意意图，并将这些线索调整为密集的推理输出。此外，CogOmniDiT通过上下文生成统一了来自各种条件的控制，并通过强化学习与CogVLM的推理输出对齐。此外，利用CogVLM在指导视频生成方面的强大能力，我们释放其在规划特定评估器方面的潜力，并为生成的视频启用最佳选择（Best-of-N）。这一整合将整个框架转变为一个闭环的“马具式”架构。我们进一步介绍了CogReasonBench和CogControlBench，这些基于专业工作流程数据构建，承载真实的创意意图，而非模拟的。对两个基准的实验表明，CogOmniControl超越了现有的开源模型。项目网站：https://um-lab.github.io/CogOmniControl/

View on arXiv Download PDF AI Translation

cs.CV / 116 / 2605.20033

A Nash Equilibrium Framework For Training-Free Multimodal Step Verification

无训练的多模态步骤验证的纳什均衡框架

Sinha, Rohit, Tilaganji, Kunal, Ganu, Tanuja, Natarajan, Nagarajan, Sharma, Amit, Balasubramanian, Vineeth N.

Abstract

Multimodal large language models often generate reasoning chains containing subtle errors that lead to incorrect answers. Current verification approaches have notable limitations. Learned critics need extensive labeled data and show inconsistent performance across different tasks. Meanwhile, existing training-free methods simply average scores from different sources, missing a key insight: when these scores disagree, that disagreement itself carries important information about whether a reasoning step is truly valid or not. We propose a training-free verification approach that treats step-wise verification as a coordination problem among specialized judges. We formalize these judges' interaction as a Nash equilibrium game where agreement signals valid steps while disagreement reveals instability. Our method computes equilibrium scores through a closed-form solution, enabling both disagreement-aware filtering and stability-conscious ranking of reasoning steps. Evaluated across six benchmarks, our approach achieves consistent improvements of 2.4% to 5.2% over baseline models and shows competitive performance against learned critics, demonstrating that cross-modal agreement (not just average confidence) provides robust verification signals without task-specific adaptation.

Chinese Translation

多模态大型语言模型常常生成包含微妙错误的推理链，这导致错误的答案。目前的验证方法存在显著的局限性。学习型评审者需要大量标注数据，并且在不同任务中表现不一致。同时，现有的无训练方法仅仅对来自不同来源的分数进行平均，忽视了一个关键的洞察：当这些分数不一致时，这种不一致本身携带着关于推理步骤是否真正有效的重要信息。我们提出了一种无训练的验证方法，将逐步验证视为专门评审者之间的协调问题。我们将这些评审者的互动形式化为一个纳什均衡博弈，其中一致性信号表示有效步骤，而不一致性则揭示不稳定性。我们的方法通过封闭形式解计算均衡分数，使得能够进行不一致性感知的过滤和稳定性意识的推理步骤排名。在六个基准测试中的评估，我们的方法在基线模型上实现了2.4%到5.2%的持续改进，并且在与学习型评审者的比较中表现出竞争力，证明了跨模态一致性（而不仅仅是平均置信度）提供了强健的验证信号，而无需特定任务的适应。

View on arXiv Download PDF AI Translation

cs.CV / 117 / 2605.20035

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

阶段自适应的高效全模态大语言模型的令牌选择

Xin, Zijie, Yang, Jie, Zhao, Ruixiang, Wang, Tianyi, Rao, Fengyun, Lyu, Jing, Li, Xirong

Abstract

Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.

Chinese Translation

全模态大语言模型（om-LLMs）通过将视频和音频编码为时间对齐的令牌序列，实现了统一的音视频理解。这些密集的非文本令牌在整个大语言模型中的处理会产生相当大的计算开销。尽管无训练的令牌选择可以降低这一成本，但现有方法要么仅关注视觉输入，要么在大语言模型之前以固定的每种模态比例修剪om-LLM令牌，未能捕捉跨模态令牌重要性在各层之间的演变。为了解决这一限制，我们首先分析了om-LLMs的层级令牌依赖性。我们发现视觉和音频依赖呈现块状模式，并随着深度逐渐减弱，这表明许多后层的非文本令牌在跨模态融合后变得冗余。基于这一观察，我们提出了SEATS，一种无训练的阶段自适应令牌选择方法，用于高效的om-LLM推理。在大语言模型之前，SEATS通过注意力加权的多样性选择去除时空冗余。在大语言模型内部，它在各个块之间逐步修剪令牌，并利用查询相关性得分动态分配保留预算，从时间窗口到模态。在后层，一旦跨模态融合完成，它将移除所有剩余的非文本令牌。在Qwen2.5-Omni和Qwen3-Omni上的实验表明，SEATS有效提高了推理效率。在仅保留10%的视觉和音频令牌的情况下，实现了9.3倍的FLOPs减少和4.8倍的预填充加速，同时保持了96.3%的原始性能。

View on arXiv Download PDF AI Translation

cs.CV / 118 / 2605.20044

OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives

OP2GS：具有双透明度原语的对象感知3D高斯溅射

Liu, Guiyu, Vaara, Niklas, Mustaniemi, Janne, Kannala, Juho, Heikkilä, Janne

Abstract

3D Gaussian Splatting (3DGS) provides an explicit and efficient scene representation, but its primitives lack inherent object-level identity, hindering downstream tasks such as open-vocabulary scene understanding. Existing methods typically address this by either distilling high-dimensional feature embeddings into Gaussians or by lifting 2D mask labels into 3D via heuristic refinement. However, feature-based approaches incur heavy storage and decoding overhead, while lifting-based pipelines remain vulnerable to label contamination: Gaussians necessary for appearance reconstruction often receive incorrect object labels during 2D-to-3D projection. We propose OP2GS, an object-aware Gaussian representation that augments each primitive with an explicit instance identity and a dedicated instance opacity $\sigma^{*}$ for object-mask rendering. The original opacity $\sigma$ remains responsible for visual reconstruction, while $\sigma^{*}$ models whether a Gaussian should contribute to a particular object mask. This dual-opacity formulation decouples visual existence from instance occupancy: mislabeled Gaussians can remain available for image rendering while becoming transparent in the object-mask branch. To learn this representation, we introduce a random object loss that optimizes the 1D instance occupancy field using the standard transmittance-based visibility of 3DGS. Semantic descriptors are then attached at the object level through multi-view aggregation, eliminating per-Gaussian feature storage. Compared with feature-training approaches, OP2GS achieves competitive open-vocabulary performance while significantly reducing computational overhead. Compared with training-free pipelines, it leverages physically consistent occupancy learning to resolve visibility ambiguities.

Chinese Translation

3D高斯溅射（3DGS）提供了一种明确且高效的场景表示，但其原语缺乏固有的对象级身份，阻碍了下游任务如开放词汇场景理解。现有方法通常通过将高维特征嵌入提炼为高斯分布，或通过启发式细化将2D掩码标签提升到3D来解决这一问题。然而，基于特征的方法会带来较大的存储和解码开销，而基于提升的管道则容易受到标签污染的影响：在2D到3D投影过程中，进行外观重建所需的高斯分布往往会收到错误的对象标签。我们提出了OP2GS，一种对象感知的高斯表示，它为每个原语增强了明确的实例身份和专用的实例透明度$ ext{σ}^{*}$用于对象掩码渲染。原始的透明度$ ext{σ}$仍然负责视觉重建，而$ ext{σ}^{*}$则建模一个高斯是否应对特定对象掩码做出贡献。这种双透明度的形式将视觉存在与实例占用解耦：标记错误的高斯可以在图像渲染中保持可用，同时在对象掩码分支中变得透明。为了学习这种表示，我们引入了一种随机对象损失，通过基于标准透射的3DGS可见性优化1D实例占用场。然后，通过多视图聚合在对象级别附加语义描述符，从而消除了每个高斯的特征存储。与基于特征的训练方法相比，OP2GS在开放词汇性能上表现出竞争力，同时显著降低了计算开销。与无训练管道相比，它利用物理一致的占用学习来解决可见性歧义。

View on arXiv Download PDF AI Translation

cs.CV / 119 / 2605.20064

Cardiac fat segmentation using computed tomography and an image-to-image conditional generative adversarial neural network

基于计算机断层扫描和图像到图像条件生成对抗神经网络的心脏脂肪分割

da Silva, Guilherme Santos, Casanova, Dalcimar, Oliva, Jefferson Tales, Rodrigues, Erick Oliveira

Abstract

In recent years, research has highlighted the association between increased adipose tissue surrounding the human heart and elevated susceptibility to cardiovascular diseases such as atrial fibrillation and coronary heart disease. However, the manual segmentation of these fat deposits has not been widely implemented in clinical practice due to the substantial workload it entails for medical professionals and the associated costs. Consequently, the demand for more precise and time-efficient quantitative analysis has driven the emergence of novel computational methods for fat segmentation. This study presents a novel deep learning-based methodology that offers autonomous segmentation and quantification of two distinct types of cardiac fat deposits. The proposed approach leverages the pix2pix network, a generative conditional adversarial network primarily designed for image-to-image translation tasks. By applying this network architecture, we aim to investigate its efficacy in tackling the specific challenge of cardiac fat segmentation, despite not being originally tailored for this purpose. The two types of fat deposits of interest in this study are referred to as epicardial and mediastinal fats, which are spatially separated by the pericardium. The experimental results demonstrated an average accuracy of 99.08% and f1-score 98.73 for the segmentation of the epicardial fat and 97.90% of accuracy and f1-score of 98.40 for the mediastinal fat. These findings represent the high precision and overlap agreement achieved by the proposed methodology. In comparison to existing studies, our approach exhibited superior performance in terms of f1-score and run time, enabling the images to be segmented in real time.

Chinese Translation

近年来的研究强调了围绕人类心脏的脂肪组织增加与心血管疾病（如房颤和冠心病）易感性升高之间的关联。然而，由于手动分割这些脂肪沉积物对医疗专业人员而言工作量巨大且成本高昂，因此在临床实践中尚未得到广泛应用。因此，对更精确和高效的定量分析的需求推动了新计算方法在脂肪分割中的出现。本研究提出了一种基于深度学习的新方法，能够自主分割和量化两种不同类型的心脏脂肪沉积物。所提出的方法利用pix2pix网络，这是一种主要用于图像到图像转换任务的生成条件对抗网络。通过应用这一网络架构，我们旨在探讨其在解决心脏脂肪分割这一特定挑战中的有效性，尽管该网络最初并非为此目的而设计。本研究中关注的两种脂肪沉积物被称为心外膜脂肪和纵隔脂肪，它们被心包隔开。实验结果显示，心外膜脂肪的分割平均准确率为99.08%，f1-score为98.73，而纵隔脂肪的准确率为97.90%，f1-score为98.40。这些发现表明所提出的方法达到了高精度和重叠一致性。与现有研究相比，我们的方法在f1-score和运行时间方面表现优越，能够实现实时图像分割。

View on arXiv Download PDF AI Translation

cs.CV / 120 / 2605.20073

X-Ray cardiac angiographic vessel segmentation based on pixel classification using machine learning and region growing

基于像素分类的X射线心脏血管造影血管分割方法：结合机器学习与区域生长

Rodrigues, E O, Rodrigues, L O, Lima, J J, Casanova, D, Favarim, F, Dosciatti, E R, Pegorini, V, Oliveira, L S N, Morais, F F C

Abstract

This work proposes a pixel-classification approach for vessel segmentation in x-ray angiograms. The proposal uses textural features such as anisotropic diffusion, features based on the Hessian matrix, mathematical morphology and statistics. These features are extracted from the neighborhood of each pixel. The approach also uses the ELEMENT methodology, which consists of creating a pixel-classification controlled by region-growing where the result of the classification affects further classifications of pixels. The Random Forests classifier is used to predict whether the pixel belongs to the vessel structure. The approach achieved the best accuracy in the literature (95.48%) outperforming unsupervised state-of-the-art approaches.

Chinese Translation

本研究提出了一种基于像素分类的血管分割方法，用于X射线血管造影图像。该方法利用了各类纹理特征，包括各向异性扩散、基于Hessian矩阵的特征、数学形态学和统计特征。这些特征是从每个像素的邻域中提取的。该方法还采用了ELEMENT方法论，该方法通过区域生长控制像素分类，其中分类结果会影响后续像素的分类。随机森林（Random Forests）分类器被用来预测像素是否属于血管结构。该方法在文献中取得了最佳准确率（95.48%），超越了当前无监督的最先进方法。

View on arXiv Download PDF AI Translation

cs.CV / 121 / 2605.20079

Probability-Conserving Flow Guidance

概率守恒流引导

Esmati, Parsa, Hyung, Junha, Dadashzadeh, Amirhossein, Choo, Jaegul, Mirmehdi, Majid

Abstract

Diffusion and flow-based generative models dominate visual synthesis, with guidance aligning samples to user input and improving perceptual quality. However, Classifier-Free Guidance (CFG) and extrapolation-based methods are heuristic linear combinations of velocities/scores that ignore the generative manifold geometry, breaking probability conservation and driving samples off the learned manifold under strong guidance. We analyse guidance through the continuity equation and show its effect decomposes into a divergence term and a score-parallel term defined invariantly across parameterisations. We prove the divergence term blows up structurally as sampling approaches the data manifold, motivating a time-dependent schedule alongside score-parallel attenuation. The resulting plug-and-play rule, Adaptive Manifold Guidance (AdaMaG), bounds both terms at no additional inference cost. Finally, we show that most empirical heuristics for reducing saturation or improving generation quality correspond directly to the two terms in our decomposition. Across image generation benchmarks, AdaMaG improves realism, reduces hallucinations, and induces controlled desaturation in high-guidance regimes.

Chinese Translation

扩散和基于流的生成模型在视觉合成中占据主导地位，指导样本与用户输入对齐并提高感知质量。然而，无分类器引导（Classifier-Free Guidance, CFG）和基于外推的方法是对速度/分数的启发式线性组合，这忽略了生成流形的几何特性，破坏了概率守恒，并在强引导下将样本推离学习到的流形。我们通过连续性方程分析引导，并展示其效果分解为一个散度项和一个在参数化中不变的分数平行项。我们证明散度项在采样接近数据流形时结构性地爆炸，这促使我们在分数平行衰减的同时采用时间依赖的调度。最终得到的即插即用规则，自适应流形引导（Adaptive Manifold Guidance, AdaMaG），在不增加推理成本的情况下对这两个项进行界定。最后，我们展示大多数用于减少饱和或提高生成质量的经验启发式方法直接对应于我们分解中的两个项。在图像生成基准测试中，AdaMaG提高了真实感，减少了幻觉，并在高引导状态下引入了可控的去饱和效果。

View on arXiv Download PDF AI Translation

cs.CV / 122 / 2605.20082

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

VL-DPO：基于视觉-语言指导的偏好对齐自主驾驶微调

Xu, Zhefan, Jerfel, Ghassen, Haliem, Marina, Zhao, Qi, Kang, Jeonhyung, Refaat, Khaled S.

Abstract

The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.

Chinese Translation

自主驾驶数据集的快速增长使得强大的运动预测模型得以扩展。虽然大规模的预训练提供了强劲的性能，但标准的模仿目标可能无法完全捕捉人类驾驶偏好的复杂细微差别。同时，最近在视觉-语言模型（VLMs）方面的进展展示了令人印象深刻的推理和常识理解能力。基于这些能力，本文提出了VL-DPO，一个视觉-语言指导的框架，将自我车辆运动预测模型与人类偏好对齐。我们的方法利用VLM作为零样本推理器，从预训练模型的滚动输出中自动生成偏好对，随后通过直接偏好优化（DPO）对模型进行微调。我们在Waymo开放端到端驾驶数据集（WOD-E2E）上对模型进行微调，并使用评估者反馈分数（RFS）和平均位移误差（ADE）评估与保留的人类偏好注释的性能。我们的实验确认了VLM的轨迹选择是人类偏好的高质量代理。我们的最终模型VL-DPO在RFS上提高了11.94%，在ADE上减少了10.01%，相较于预训练模型。

View on arXiv Download PDF AI Translation

cs.CV / 123 / 2605.20085

Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

基于空间提示的自我中心操控视觉轨迹预测

Li, Yifan, Zhou, Xinyu, Ge, Yunhao, Kong, Yu

Abstract

Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.

Chinese Translation

机器人操控通常通过语言指令或任务标识符来指定，然而在物体相似且环境杂乱的情况下，通过空间指示要移动的物体及其放置位置更为有效。针对物体和目标指定的视觉中心挑战，我们首次正式提出了空间提示视觉轨迹预测（Spatially Prompted Visual Trajectory Prediction, SP-VTP）的概念。该新颖的设置利用初始空间提示（如边界框或点）来定义任务目标，要求模型从自我中心流中预测未来的末端执行器轨迹。为研究这一问题，我们收集并标注了EgoSPT数据集，该数据集包含自我中心的空间提示操控轨迹，并附有第一帧物体和目标的基础注释以及恢复的三维末端执行器运动。SP-VTP具有挑战性，因为任务指定是静态的，而场景配置随时间演变。为了解决这一问题，我们提出了SPOT（Spatially Prompted Object-Target Policy），该模型结合了用于第一帧视觉和坐标空间提示的任务编码器、用于当前视觉和历史上下文的观察编码器，以及用于未来末端执行器运动的轨迹生成器。在严格的场景级划分下的实验表明，SPOT在跨场景轨迹预测上优于未提示或单源提示的基线。EgoSPT和SPOT共同建立了一个新的空间提示问题SP-VTP，作为自我中心操控的简单且可扩展的任务条件。

View on arXiv Download PDF AI Translation

cs.CV / 124 / 2605.20090

MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

MetaEarth-MM：场景中心联合建模的统一多模态遥感图像生成

Yu, Zhiping, Liu, Chenyang, Cao, Jinqi, Yang, Qinzhe, Yu, Siwei, Zou, Zhengxia, Shi, Zhenwei

Abstract

Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at https://github.com/YZPioneer/MetaEarth-MM.

Chinese Translation

多模态遥感图像对于地球观测至关重要，但在实际应用中，完整的配对观测往往稀缺。现有的生成方法通常通过孤立的成对模态转换来解决这一问题，但随着模态和生成任务数量的增加，其灵活性和可扩展性仍然有限。在此，我们开发了一个生成基础模型MetaEarth-MM，用于多模态遥感图像，能够在一个统一模型中实现配对联合生成和五种模态之间的任意转换。我们认识到多模态观测背后的内在场景一致性，因此在MetaEarth-MM中引入了一种场景中心的联合建模范式。与依赖于直接外观级别跨模态映射的先前方法不同，我们的模型围绕潜在的场景内容组织生成。具体而言，MetaEarth-MM采用了一个解耦架构，首先从可用观测中推断出潜在场景表示，然后基于这一中间状态生成目标模态。为了支持训练，我们进一步构建了EarthMM，这是一个包含280万多分辨率全球图像和220万对齐配对的大规模数据集。大量实验表明，MetaEarth-MM不仅展现了强大的生成能力和在多样生成任务中的稳健泛化能力，还支持数据和表示层面的下游任务，突显了其作为跨模态地球观测通用基础模型的潜力。代码和数据集将发布在 https://github.com/YZPioneer/MetaEarth-MM。

View on arXiv Download PDF AI Translation

cs.CV / 125 / 2605.20110

SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

SetCon：通过集合级概念预测实现开放式指称分割

Zhang, Zhixiong, Li, Yizhuo, Ding, Shuangrui, Zang, Yuhang, Ding, Shengyuan, Xing, Long, Wang, Yibin, Zhang, Qiaosheng, Wang, Jiaqi

Abstract

Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the target scope and then refines it into fine-grained concept groups aligned with target subsets. To support this, a two-stage annotation pipeline augments existing reasoning segmentation datasets with hierarchical semantic supervision (236k samples, 784k concept phrases). SetCon achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE), with margins that grow as the number of referred targets increases. The concept interface also transfers to video under a detect-and-track setting, yielding new state-of-the-art results on seven referring video benchmarks, including +10.9 J&F on MeViS and +12.4 J&F on Ref-SeCVOS.

Chinese Translation

指称分割将自然语言查询与像素级掩膜相结合，但在多个实例、跨类别组或开放式目标集等复杂场景中扩展这一方法仍然具有挑战性。之前基于大型视觉语言模型（LVLM）的方法通过一个或多个特殊标记顺序表示被指称的目标，将多个目标视为独立输出，而非一个连贯的集合，且对捕捉集合级属性（如完整性和互斥性）几乎没有激励。我们将开放式指称分割重新表述为显式的集合级概念预测，并提出了集合概念分割（SetCon），该方法使用LVLM生成的自然语言概念，而非特定于分割的标记，作为联合掩膜集解码的语义条件。一个分层语义分解首先预测一个共享的集合级概念，以定义目标范围，然后将其细化为与目标子集对齐的细粒度概念组。为支持这一点，一个两阶段的注释流程通过分层语义监督增强现有的推理分割数据集（236k样本，784k概念短语）。SetCon在图像基准测试中取得了最先进的结果（在gRefCOCO上提高3.3 gIoU，在MUSE上提高12.1 gIoU），且随着被指称目标数量的增加，性能差距也在扩大。该概念接口还可以在检测与跟踪设置下转移到视频中，在七个指称视频基准测试中取得了新的最先进结果，包括在MeViS上提高10.9 J&F，在Ref-SeCVOS上提高12.4 J&F。

View on arXiv Download PDF AI Translation

cs.CV / 126 / 2605.20147

PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

PixVerve：利用大规模高质量数据集推动原生超高分辨率图像生成至100MP

Chen, Haojun, He, Haoyang, Xu, Chengming, He, Qingdong, Zhu, Junwei, Wang, Yabiao, Xue, Zhucun, Zeng, Xianfang, Chen, Zhennan, Hu, Xiaobin, Zhao, Hao, Liu, Yong, Zhang, Jiangning, Tao, Dacheng

Abstract

Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.

Chinese Translation

文本到图像（T2I）模型最近在1K和2K分辨率方面取得了显著进展。随着对更好视觉体验的极大渴望和成像技术的快速发展，对超高分辨率（UHR）图像生成的需求显著增加。然而，由于高分辨率内容的稀缺性和复杂性，UHR图像生成面临巨大挑战。在本文中，我们首先介绍了PixVerve-95K，这是一个高质量的开源UHR T2I数据集，采用精心设计的数据管道进行整理，包含95K幅图像，涵盖多种场景（每幅图像的最小像素数为100M）及七维注释。基于我们的大规模图像-文本数据集，我们迈出了开创性的一步，将各种T2I基础模型扩展至原生100MP生成，采用三种训练方案。最后，利用传统指标和基于多模态大语言模型的评估，我们提出的PixVerve-Bench基准建立了一个全面的UHR图像评估协议，涵盖视觉质量和语义对齐。在我们的基准上进行的广泛实验结果以及对训练策略的建设性探索，共同为未来的突破提供了宝贵的见解。

View on arXiv Download PDF AI Translation

cs.CV / 127 / 2605.20150

TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization

TideGS：通过外存优化实现超过十亿个3D高斯点的可扩展训练

Zhong, Chonghao, Shi, Linfeng, Chen, Hua, Sun, Tiecheng, Zhao, Hao, Yuan, Binhang, Li, Chaojian

Abstract

Training 3D Gaussian Splatting (3DGS) at billion-primitive scale is fundamentally memory-bound: each Gaussian primitive carries a large attribute vector, and the aggregate parameter table quickly exceeds GPU capacity, limiting prior systems to tens of millions of Gaussians on commodity single-GPU hardware. We observe that 3DGS training is inherently sparse and trajectory-conditioned: each iteration activates only the Gaussians visible from the current camera batch, so GPU memory can serve as a working-set cache rather than a persistent parameter store. Building on this insight, we introduce TideGS, an out-of-core training framework that manages parameters across an SSD-CPU-GPU hierarchy via three synergistic techniques: block-virtualized geometry for SSD-aligned spatial locality, a hierarchical asynchronous pipeline to overlap I/O with computation, and trajectory-adaptive differential streaming that transfers only incremental working-set deltas between iterations. Experiments show that TideGS enables training with over one billion Gaussians on a single 24 GB GPU while achieving the best reconstruction quality among evaluated single-GPU baselines on large-scale scenes, scaling beyond prior out-of-core baselines (e.g., approximately 100M Gaussians) and standard in-memory training (e.g., approximately 11M Gaussians).

Chinese Translation

在十亿原语规模下训练3D高斯点（3DGS）在本质上受限于内存：每个高斯原语携带一个较大的属性向量，聚合参数表迅速超出GPU容量，限制了之前系统在普通单GPU硬件上只能处理数千万个高斯点。我们观察到3DGS训练本质上是稀疏的，并且受轨迹条件影响：每次迭代仅激活当前相机批次可见的高斯点，因此GPU内存可以作为工作集缓存，而不是持久的参数存储。基于这一见解，我们提出了TideGS，一个外存训练框架，通过三种协同技术管理SSD-CPU-GPU层次结构中的参数：针对SSD对齐空间局部性的块虚拟几何，层次化异步管道以重叠I/O与计算，以及轨迹自适应差分流，仅在迭代之间传输增量工作集增量。实验表明，TideGS能够在单个24 GB GPU上训练超过十亿个高斯点，同时在大规模场景中实现评估的单GPU基线中最佳的重建质量，超越了之前的外存基线（例如，约1亿个高斯点）和标准的内存训练（例如，约1100万个高斯点）。

View on arXiv Download PDF AI Translation

cs.CV / 128 / 2605.20158

Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models

重新思考大型视觉语言模型在胸部X光推理中的视觉归因

Xiong, Guangzhi, Jin, Qiao, Sinha, Sanchit, Lu, Zhiyong, Zhang, Aidong

Abstract

Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.

Chinese Translation

大型视觉语言模型（LVLMs）在医疗应用中展现出良好的前景，但它们无法真实地将响应与视觉证据相结合，这引发了对临床可信度的严重担忧。尽管视觉归因方法被广泛用于解释LVLM的预测，但这些解释是否真正反映了模型决策背后的视觉证据仍然未得到验证，因为内部模型推理的真实标注通常不可用。我们通过开发一个因果评估框架来解决这一问题，该框架仅保留经过专家标注并通过反事实编辑验证为对模型预测具有因果责任的胸部X光（CXR）-视觉问答（VQA）样本。利用该框架，我们对11种归因方法、六个开源LVLM和两种输出模式（直接回答和逐步推理）进行了研究，发现现有的归因方法往往无法识别LVLM所使用的证据。为了解决这一问题，我们提出了MedFocus，这是一种基于概念的归因方法，通过不平衡最优传输定位临床相关的解剖区域，并通过有针对性的干预测量其对模型输出的因果影响。MedFocus生成空间、概念级和标记级的归因，显著优于之前的方法，朝着为医疗LVLM提供更可信的归因迈出了重要一步。我们的数据和代码可在 https://github.com/gzxiong/medfocus/ 获取。

View on arXiv Download PDF AI Translation

cs.CV / 129 / 2605.20159

Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites

可解释的计算机视觉在航空航天SiC/SiC复合材料X射线断层扫描缺陷检测中的应用

Corredor, Antonio Peña, Lesseur, Julien, Nunez, Romain, Rivalland, Paul, Philippe, Thomas

Abstract

Non-destructive testing of aerospace SiC/SiC composites via X-ray computed tomography (XCT) relies on expert visual assessment, with current workflows offering limited traceability for accept/reject decisions. Deep convolutional networks can automate defect detection, yet their black-box nature conflicts with the transparency that industrial inspection practice demands. To close this gap, we introduce p-ResNet-50, a convolutional framework extended with a prototype layer that couples high detection accuracy with case-based explanations. Six learned prototypes are explicitly aligned with expert-defined semantic categories-healthy matrix, matrix--air interfaces, pores, line-like defects, and mixed morphologies-so that every classification is traceable to a physically meaningful reference. Two novel regularisation terms, anchor-based and medoid-based, tether prototypes to expert-selected patches and prevent prototype collapse, addressing a known limitation of prototype networks. Latent-space analysis via UMAP delineates semantically coherent sub-domains and maps zones of uncertainty where misclassifications concentrate, giving inspectors an explicit picture of where the model is-and is not-reliable. The framework is validated on an XCT patch dataset of approximately 12,000 patches extracted from four defect-rich SiC/SiC laboratory specimens. Taking a black-box ResNet-50 as a baseline (ROC-AUC = 0.991), the prototype extension achieves comparable performance (accuracy 0.957 vs. 0.959; ROC-AUC 0.994 vs. 0.993) while trading a slight reduction in sensitivity for higher precision and specificity. Each decision is backed by representative evidence patches, and the model explicitly flags its uncertainty regions. Beyond defect mapping, the framework establishes a reusable methodology for embedding domain-expert knowledge into prototype networks, applicable to other XCT inspection scenarios requiring traceable, auditable decisions.

Chinese Translation

航空航天SiC/SiC复合材料的无损检测依赖于专家的视觉评估，目前的工作流程在接受/拒绝决策方面提供的可追溯性有限。深度卷积网络可以自动化缺陷检测，但其黑箱特性与工业检测实践所需的透明性相悖。为了解决这一问题，我们引入了p-ResNet-50，这是一种扩展了原型层的卷积框架，结合了高检测精度和基于案例的解释。六个学习到的原型明确与专家定义的语义类别对齐——健康基体、基体-空气界面、孔、线状缺陷和混合形态——使得每个分类都可以追溯到一个物理上有意义的参考。两个新颖的正则化项，基于锚点和基于中位数的，紧密连接原型与专家选择的补丁，防止原型崩溃，解决了原型网络已知的局限性。通过UMAP进行的潜在空间分析描绘了语义一致的子领域，并映射出误分类集中区域的不确定性，给检查员提供了模型可靠与不可靠区域的明确图景。该框架在一个包含约12,000个补丁的XCT补丁数据集上进行了验证，该数据集提取自四个缺陷丰富的SiC/SiC实验样本。以黑箱ResNet-50作为基线（ROC-AUC = 0.991），原型扩展实现了可比的性能（准确率0.957 vs. 0.959；ROC-AUC 0.994 vs. 0.993），在提高精度和特异性的同时略微降低了灵敏度。每个决策都有代表性证据补丁的支持，模型明确标记其不确定性区域。除了缺陷映射，该框架还建立了一种可重用的方法论，将领域专家知识嵌入原型网络，适用于其他需要可追溯、可审计决策的XCT检测场景。

View on arXiv Download PDF AI Translation

cs.CV / 130 / 2605.20165

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

CaMo：基于相机运动的视觉语言模型评估与训练

Huang, Hsiang-Wei, Lu, Junbin, Chen, Kuang-Ming, Shangguan, Jianxu, Yang, Cheng-Yen, Hwang, Jenq-Neng

Abstract

Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial cognition. We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM. Under SNS, state-of-the-art spatial VLMs exhibit significant performance degradation despite high direct question answering accuracy. To address this gap, we introduce CaMo, a camera motion grounded VLM that achieves consistent performance across SNS evaluation and direct spatial question answering accuracy. Our results highlight the importance of explicit spatial narrative externalization for evaluating VLMs with transferable 3D spatial understanding. Our code, data, and model is available at https://github.com/hsiangwei0903/CaMo

Chinese Translation

视觉语言模型（VLMs）在空间问答基准测试中表现出色，但尚不清楚这种提升是否反映了真正的空间智能。我们表明，现有的空间 VLMs 缺乏基本的相机运动理解，这是空间认知的一个关键组成部分。我们提出了空间叙事评分（SNS），这是一种评估框架，要求 VLMs 生成明确的空间叙事，捕捉场景语义和相机运动，并随后与一个冻结的代理大型语言模型（LLM）进行推理。在 SNS 下，最先进的空间 VLMs 尽管在直接问答准确性上表现良好，但却显示出显著的性能下降。为了解决这一问题，我们引入了 CaMo，这是一种基于相机运动的 VLM，在 SNS 评估和直接空间问答准确性上均表现出一致的性能。我们的结果强调了明确的空间叙事外化在评估具有可转移 3D 空间理解的 VLMs 中的重要性。我们的代码、数据和模型可在 https://github.com/hsiangwei0903/CaMo 获取。

View on arXiv Download PDF AI Translation

cs.CV / 131 / 2605.20174

Multi-axis Analysis of Image Manipulation Localization

多轴图像操控定位分析

Nichols, Keanu, Appapogu, Divya, Biamby, Giscard, Bashkirova, Dina, Rohrbach, Anna, Plummer, Bryan A.

Abstract

Advanced image editing software enables easy creation of highly convincing image manipulations, which has been made even more accessible in recent years due to advances in generative AI. Manipulated images, while often harmless, could spread misinformation, create false narratives, and influence people's opinions on important issues. Despite this growing threat, there is limited research on detecting advanced manipulations across different visual domains. Thus, we introduce Analysis Under Domain-shifts, qualIty, Type, and Size (AUDITS), a comprehensive benchmark designed for studying axes of analysis in image manipulation detection. AUDITS comprises over 530K images from two distinct sources (user and news photos). We curate our dataset to support analysis across multiple axes using recent diffusion-based inpaintings, spanning a diverse range of manipulation types and sizes. We conduct experiments under different types of domain shift to evaluate robustness of existing image manipulation detection methods. Our goal is to drive further research in this area by offering new insights that would help develop more reliable and generalizable image manipulation detection methods.

Chinese Translation

先进的图像编辑软件使得创建高度可信的图像操控变得容易，近年来由于生成性人工智能的进步，这一过程变得更加普及。虽然操控图像通常无害，但它们可能传播错误信息、制造虚假叙事，并影响人们对重要问题的看法。尽管这一威胁日益严重，但在不同视觉领域中检测高级操控的研究仍然有限。因此，我们提出了域转移、质量、类型和大小分析（Analysis Under Domain-shifts, qualIty, Type, and Size，AUDITS），这是一个旨在研究图像操控检测分析轴的综合基准。AUDITS包含来自两个不同来源（用户和新闻照片）的超过53万张图像。我们精心策划了数据集，以支持使用最近的基于扩散的修复技术进行多轴分析，涵盖多种操控类型和大小。我们在不同类型的域转移下进行实验，以评估现有图像操控检测方法的鲁棒性。我们的目标是通过提供新的见解来推动该领域的进一步研究，从而帮助开发更可靠和更具普适性的图像操控检测方法。

View on arXiv Download PDF AI Translation

cs.CV / 132 / 2605.20183

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

MSAVBench：迈向全面可靠的多镜头音视频生成评估

Wei, Yujie, Han, Yujin, Chen, Zhekai, Li, Yongming, Jiang, Kaixun, Liu, Zhihang, Li, Quanhao, Qing, Zhiwu, Wang, Xiang, Xing, Zhen, Chu, Ruihang, Hong, Lingyi, He, Yefei, Zhou, Junjie, Yu, Junqiu, Shi, Yang, Zou, Difan, Zhu, Kai, Zhang, Shiwei, Zhang, Yingya, Liu, Yu, Liu, Xihui, Shan, Hongming

Abstract

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.

Chinese Translation

视频生成正迅速从单镜头合成演变为复杂的多镜头音视频（MSAV）叙事，以满足现实世界的需求。然而，评估这些前沿模型仍然是一个基本挑战。现有基准在范围和数据多样性方面有限，并依赖于僵化的评估流程，阻碍了对现代MSAV模型的系统性和可靠性评估。为了解决这些问题，我们提出了MSAVBench，这是首个全面的基准和自适应混合评估框架，专门用于多镜头音视频生成。我们的基准涵盖视频、音频、镜头和参考四个关键维度，涵盖多样化的任务设置，镜头数量变化高达15个，以及具有挑战性的非现实场景。我们的评估框架通过自适应自我校正机制增强了鲁棒性，采用实例级标准来评估主观指标，并通过工具基础的证据提取来支持复杂判断。此外，MSAVBench与人类判断高度一致，斯皮尔曼等级相关系数达到91.5%。我们对19个最先进的闭源和开源模型的系统评估显示，当前系统在导演级控制和细粒度音视频同步方面仍然存在困难，而模块化或代理生成流程为缩小开源和闭源模型之间的差距提供了有希望的路径。我们将发布基准数据和评估代码，以促进未来的研究。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2605.18801

Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

立场：让我们开发数据探针，以从根本上理解数据如何影响大型语言模型的性能

Wang, Shiqiang, Woisetschläger, Herbert, Jacobsen, Hans Arno, Ji, Mingyue

Abstract

Data is fundamental to large language models (LLMs). However, understanding of what makes certain data useful for different stages of an LLM workflow, including training, tuning, alignment, in-context learning, etc., and why, remains an open question. Current approaches rely heavily on extensive experimentation with large public datasets to obtain empirical heuristics for data filtering and dataset construction. These approaches are compute intensive and lack a principled way of understanding the essence of how specific data characteristics drive LLM behavior. In this position paper, we advocate for the need of developing systematic methodologies for generating synthetic sequences from appropriately defined random processes, with the goal that these sequences can reveal useful characteristics when they are used in one or multiple stages of the LLM workflow. We refer to such sequences as data probes. By observing LLM behavior on data probes, researchers can systematically conduct studies on how data characteristics influence model performance, generalization, and robustness. The probing sequences exhibit statistical properties that can be viewed using theoretical concepts, such as typical sets, which are generalized to describe the behaviors of LLMs. This data-probe approach provides a pathway for uncovering foundational insights into the role of data in LLM training and inference, beyond empirical heuristics.

Chinese Translation

数据是大型语言模型（LLMs）的基础。然而，关于什么使某些数据在LLM工作流程的不同阶段（包括训练、调优、对齐、上下文学习等）中有用，以及原因，仍然是一个未解之谜。当前的方法在很大程度上依赖于对大型公共数据集进行广泛实验，以获得数据过滤和数据集构建的经验启发。这些方法计算密集且缺乏理解特定数据特征如何驱动LLM行为的原则性方法。在这篇立场论文中，我们主张需要开发系统的方法论，从适当定义的随机过程中生成合成序列，目的是使这些序列在LLM工作流程的一个或多个阶段中使用时能够揭示有用的特征。我们将这种序列称为数据探针。通过观察LLM在数据探针上的行为，研究人员可以系统地研究数据特征如何影响模型性能、泛化能力和鲁棒性。这些探针序列展示了可以通过理论概念（如典型集合）来观察的统计特性，这些概念被推广以描述LLM的行为。这种数据探针方法为揭示数据在LLM训练和推理中的作用提供了一条途径，超越了经验启发。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2605.18818

Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

文档人工智能的操作化：生产环境中OCR和LLM管道的微服务架构

Fehlis, Yao, Bengfort, Benjamin, Si, Zhangzhang, Eyorokon, Vahid, Roman, Prema, Deziel, Patrick, Slonaker, Devon, Veldman, Steve, Johnson, Ben, Rigelo, Joyce, Wharton, Michael, Kramer, Steve

Abstract

Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.

Chinese Translation

学术研究往往集中于文档理解的新模型，导致文献中模型定义与在生产规模下运行模型之间存在较大差距。为缩小这一差距，我们提出了一种微服务架构，该架构封装了多个模型的管道，包括分类、光学字符识别（OCR）和大型语言模型（LLM）结构化字段提取，以及我们在每小时处理数千份多页文档时的实际经验。我们描述了主要的设计决策，包括混合分类、将GPU绑定推理与CPU绑定编排分离、对管道中许多IO绑定操作使用异步处理，以及独立的水平扩展策略。通过批量分析，我们发现了两个意外的定性结果，这些结果影响了生产部署：OCR，而非语言模型解析，主导了端到端延迟，系统的饱和度由共享的GPU推理能力而非工作线程数量决定。我们的目标是为实践者提供具体的架构模式，以构建超越基准的文档理解系统；有效地在生产中操作化模型。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2605.18937

Evaluating the Utility of Personal Health Records in Personalized Health AI

评估个人健康记录在个性化健康人工智能中的实用性

Sayres, Rory, Chen, Kejia, Jain, Ayush, Thompson, Matthew, Richina, Jonathan, Yin, Xiang, Hu, Jimmy, Zhang, Fan, Lou, Bob, Sanchez, Mike, Mezerreg, Ines, Schreier, Meredith, Subramaniam, Hamsa, Lee, I-Ching, Jia, Yugang, Mcduff, Daniel, Matias, Yossi, Hassidim, Avinatan, Webster, Dale, Liu, Yun, Barr, Jackie, Duong, Quang

Abstract

Patient-managed Personal Health Records (PHRs) promises to empower patients to better understand their health; but information in the record is complex, potentially hindering insights. In this study, we assess the potential of large language models (LLMs, Gemini 3.0 Flash) to provide helpful answers to user health queries, when provided clinical data from PHRs as context. A total of 2,257 user queries were drawn from 3 different distributions to represent patient questions: shorter web search queries, longer questions derived from templates of chatbot conversations, and questions patients asked to their healthcare team (patient calls). Queries were matched with de-identified PHRs (from a pool of 1,945). Gemini responses were generated (1) without PHR context; (2) with a basic summary of demographics, conditions, and medications; (3) with full, extensive clinical notes. For evaluation, we leveraged an existing rating framework (SHARP), and developed a new framework for specific error modes when interpreting PHRs. Evaluation was performed using autoraters for the full set, and with clinician ratings for a subset (n=95), with both sets of raters knowing the full PHR context. We see significant improvements in the helpfulness of answers to all question types with PHR data (p < 0.001, paired t-test). We also observe potential gains in safety, accuracy, relevance and personalization of answers. Our PHR evaluation framework further identifies gaps in LLM understanding of particular aspects of complex PHRs, such as temporal disorientation, and rare but meaningful confabulations. These results suggest potential for PHR data to help people with a wide range of user needs; and provide a framework for monitoring for gaps in LLM answers based on PHR context. This study motivates further work to assess and realize potential benefits to users from understanding their health records.

Chinese Translation

患者管理的个人健康记录（PHRs）有望使患者更好地理解他们的健康状况；但记录中的信息复杂，可能会妨碍洞察力。在本研究中，我们评估了大型语言模型（LLMs，Gemini 3.0 Flash）在提供用户健康查询的有用答案时的潜力，前提是提供来自PHRs的临床数据作为背景。我们从三种不同的分布中抽取了共计2,257个用户查询，以代表患者问题：较短的网络搜索查询、基于聊天机器人对话模板的较长问题，以及患者向其医疗团队提出的问题（患者来电）。查询与去标识化的PHRs（来自1,945个样本池）进行了匹配。Gemini的响应生成分为三种情况：（1）没有PHR背景；（2）提供基本的人口统计信息、疾病和药物的摘要；（3）提供完整的、详尽的临床记录。为了评估，我们利用了现有的评分框架（SHARP），并为解释PHRs时的特定错误模式开发了一个新的框架。评估使用了全量的自动评分者，以及对一个子集（n=95）的临床医生评分，两个评分组均了解完整的PHR背景。我们观察到，使用PHR数据后，所有问题类型的答案有用性显著提高（p < 0.001，配对t检验）。我们还观察到在答案的安全性、准确性、相关性和个性化方面可能的提升。我们的PHR评估框架进一步识别了LLM在理解复杂PHRs的特定方面（如时间错位和稀有但有意义的虚构）中的不足。这些结果表明PHR数据有潜力帮助满足广泛用户需求的人群；并提供了一个基于PHR背景监测LLM答案缺口的框架。本研究激励了进一步的工作，以评估和实现用户从理解其健康记录中获得潜在利益的可能性。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2605.19008

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

基于有线学习的训练控制治理：在压力下的有限自主训练以实现稳定性和效率

Radianis, Anis

Abstract

Modern language-model training is increasingly exposed to instability, degraded runs, and wasted compute, especially under aggressive learning-rate, scale, and runtime-stress conditions. This paper introduces Learn-by-Wire Guard (LBW-Guard), a bounded autonomous training-control governance layer that operates above AdamW. Rather than replacing the optimizer update rule, LBW-Guard observes training telemetry, interprets instability-sensitive regimes, and applies bounded control to optimizer execution while preserving fixed training objectives. We evaluate LBW-Guard in a Qwen2.5-centered stress-and-robustness suite using WikiText-103, with Qwen2.5-7B as the empirical anchor, model-size comparisons against Qwen2.5-3B and Qwen2.5-14B, learning-rate stress tests, gradient-clipping baselines, and a no-LoRA TinyLlama-1B full-parameter sanity check. In the 7B reference setting, LBW-Guard reduces final perplexity from 13.21 to 10.74, an 18.7% improvement, while reducing end-to-end time from 392.54s to 357.02s, a 1.10x speedup. Under stronger learning-rate stress, AdamW degrades to 1885.24 final perplexity at LR=3e-3 and 659.76 at LR=1e-3, whereas LBW-Guard remains trainable at 11.57 and 10.33, respectively. Gradient-clipping baselines do not reproduce this effect. These results support a scoped systems conclusion that stability-sensitive LLM training can benefit from a governance plane above the optimizer. LBW-Guard provides evidence that bounded runtime control can preserve productive compute under stress while remaining distinct from optimizer replacement and local gradient suppression.

Chinese Translation

现代语言模型训练越来越容易受到不稳定性、性能下降和计算资源浪费的影响，尤其是在激进的学习率、规模和运行时压力条件下。本文介绍了基于有线学习的守护机制（Learn-by-Wire Guard, LBW-Guard），这是一个在 AdamW 之上运行的有限自主训练控制治理层。LBW-Guard 并不是替代优化器更新规则，而是观察训练遥测，解释对不稳定性敏感的状态，并在保持固定训练目标的同时，对优化器执行施加有限控制。我们在以 Qwen2.5 为中心的压力和鲁棒性测试套件中评估了 LBW-Guard，使用 WikiText-103，以 Qwen2.5-7B 作为经验锚点，并与 Qwen2.5-3B 和 Qwen2.5-14B 进行模型规模比较，进行学习率压力测试，设置梯度裁剪基线，以及对无 LoRA 的 TinyLlama-1B 全参数进行合理性检查。在 7B 参考设置下，LBW-Guard 将最终困惑度从 13.21 降低到 10.74，改善幅度为 18.7%，同时将端到端时间从 392.54 秒减少到 357.02 秒，实现了 1.10 倍的加速。在更强的学习率压力下，AdamW 在 LR=3e-3 时最终困惑度降至 1885.24，而在 LR=1e-3 时降至 659.76，而 LBW-Guard 分别保持可训练性，最终困惑度为 11.57 和 10.33。梯度裁剪基线并未重现这一效果。这些结果支持一个有针对性的系统结论，即对稳定性敏感的 LLM 训练可以从优化器之上的治理层中受益。LBW-Guard 提供了证据，表明有限的运行时控制可以在压力下保持有效的计算，同时与优化器替换和局部梯度抑制保持区别。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2605.19010

AgentNLQ: A General-Purpose Agent for Natural Language to SQL

AgentNLQ：一种通用的自然语言转SQL代理

Bogdanov, Olena, Jung, Yeunji, Dhir, Chandra, Gaddam, Pareekshitreddy, Jain, Saurabh, Tumati, Lakshmi, Parthasarathy, Vijay, Shirgaonkar, Anup

Abstract

Natural language to SQL (NL2SQL) conversion is an important problem for researchers and enterprises due to the ubiquitous importance of relational databases in broad-ranging practical problems. Despite the rapid advancements in the capabilities of LLMs, NL2SQL has not reached parity in accuracy with human expert SQL writers, hence needing additional improvements in NL2SQL algorithms. This study presents a new multi-agent method for NL2SQL that achieves 78.1% semantic accuracy on the BIg Bench for LaRge-scale Database (BIRD) benchmark. Our method leverages a semantically enriched representation of user-provided schema, adds user-provided business rules, and produces accurate SQL queries. The main contributions of this study are (a) We designed an optimized new orchestrator in a multi-agent solution that uses LLMs to plan, orchestrate, reflect, and self-correct to generate accurate SQL queries, (b) We developed an advanced schema enrichment method that creates context-aware metadata to improve accuracy, and (c) We demonstrated the accuracy and generalizability of the method across different domains and datasets by evaluating it on the BIRD-SQL benchmark.

Chinese Translation

自然语言转SQL（NL2SQL）转换是一个重要的问题，因关系数据库在广泛的实际问题中具有普遍的重要性，吸引了研究者和企业的关注。尽管大规模语言模型（LLMs）的能力迅速提升，NL2SQL在准确性上仍未达到人类专家SQL编写者的水平，因此需要对NL2SQL算法进行进一步改进。本研究提出了一种新的多代理方法，用于NL2SQL，在大规模数据库（BIRD）基准测试的语义准确率达到了78.1%。我们的方法利用了用户提供的模式的语义丰富表示，添加了用户提供的业务规则，并生成准确的SQL查询。本研究的主要贡献包括：（a）我们设计了一种优化的新协调器，在多代理解决方案中使用LLMs进行规划、协调、反思和自我纠正，以生成准确的SQL查询；（b）我们开发了一种先进的模式增强方法，创建上下文感知的元数据以提高准确性；（c）我们通过在BIRD-SQL基准上进行评估，展示了该方法在不同领域和数据集上的准确性和通用性。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2605.19031

KAN-MLP-Mixer: A comprehensive investigation of the usage of Kolmogorov-Arnold Networks (KANs) for improving IMU-based Human Activity Recognition

KAN-MLP-Mixer：对使用Kolmogorov-Arnold网络（KANs）改善基于IMU的人体活动识别的全面研究

Liu, Mengxi, Bian, Sizhen, Fortes, Vitor, Nicolas, Francisco Calatrava, Geißler, Daniel, Kiefer-Emmanouilidis, Maximilian, Zhou, Bo, Lukowicz, Paul

Abstract

Kolmogorov-Arnold Networks (KANs) have demonstrated an exceptional ability to learn complex functions on clean, low-dimensional data but struggle to maintain performance on noisy and imperfect real-world datasets. In contrast, conventional multi-layer perceptrons (MLPs) are far more tolerant to noise and computationally efficient. Replacing all MLP components with KANs in HAR models often degrades accuracy and computation efficiency, highlighting an open challenge: how to combine KANs' precision with MLPs' noise robustness and efficiency. To address this, we systematically explore various placements of KAN modules within deep HAR networks and propose a hybrid architecture that strategically synergizes the strengths of both paradigms, which uses a KAN-based input embedding layer, retains MLP layers for intermediate feature mixing, and introduces a specialized LarctanKAN module for final activity classification. Across eight public HAR datasets, the hybrid KAN-MLP model achieves an average macro F1 score relative improvement of 5.33\% compared pure-MLP model, significantly outperforming standalone KAN and MLP baselines. Furthermore, integrating this hybrid strategy into other state-of-the-art HAR architectures consistently boosts their performance. Our findings demonstrate that a carefully orchestrated combination of KAN, MLP, or other conventional neural components yields more robust and accurate HAR models for real-world wearable sensing environments.

Chinese Translation

Kolmogorov-Arnold网络（KANs）在干净、低维数据上展示了学习复杂函数的卓越能力，但在噪声和不完美的真实世界数据集上表现不佳。相比之下，传统的多层感知器（MLPs）对噪声的容忍度更高且计算效率更高。在人体活动识别（HAR）模型中用KAN替换所有MLP组件通常会降低准确性和计算效率，这突显了一个开放性挑战：如何将KAN的精确性与MLP的抗噪声能力和效率结合起来。为了解决这个问题，我们系统地探索了KAN模块在深度HAR网络中的不同放置方式，并提出了一种混合架构，战略性地协同利用两种范式的优势，该架构使用基于KAN的输入嵌入层，保留MLP层用于中间特征混合，并引入专门的LarctanKAN模块用于最终活动分类。在八个公共HAR数据集上，混合KAN-MLP模型相对于纯MLP模型实现了平均宏F1分数提升5.33%，显著优于独立的KAN和MLP基线。此外，将这种混合策略整合到其他最先进的HAR架构中，始终能够提升其性能。我们的研究结果表明，精心协调的KAN、MLP或其他传统神经组件的组合能够为真实世界可穿戴传感环境提供更强大和准确的HAR模型。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2605.19035

Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On

可信代理网络：信任在代理网络中必须内置，而非附加

Yao, Yixiang, Yao, Yuhang, Fan, Xinyi, Gao, Jiechao, Wang, Jie, Zhang, Minjia, Ravi, Srivatsan, Joe-Wong, Carlee

Abstract

The rapid advancement of Large Language Models has given rise to autonomous LLM-based agents capable of complex reasoning and execution. As these agents transition from isolated operation to collaborative ecosystems, we witness the emergence of the Agent-to-Agent (A2A) network, a paradigm where heterogeneous agents autonomously coordinate to solve multi-step tasks. While these networks may offer better task performance compared to simply using one agent to complete the entire task, they introduce systemic vulnerabilities, such as adversarial composition, semantic misalignment, and cascading operational failures, that existing agent alignment techniques cannot address. In this vision paper, we argue that the trustworthiness of A2A networks cannot be fully guaranteed via retrofitting on existing protocols that are largely designed for individual agents. Rather, it must be architected from the very beginning of the A2A coordination framework. We present a comprehensive conceptual framework that situates trust in A2A systems through four design pillars.

Chinese Translation

大型语言模型的快速发展催生了能够进行复杂推理和执行的自主LLM（Large Language Model）基础代理。随着这些代理从孤立操作转向协作生态系统，我们见证了代理间（Agent-to-Agent, A2A）网络的出现，这是一种异构代理自主协调以解决多步骤任务的范式。尽管这些网络相比于单一代理完成整个任务可能提供更好的任务性能，但它们引入了系统性脆弱性，例如对抗性组合、语义不一致和级联操作失败，而现有的代理对齐技术无法解决这些问题。在本文中，我们认为A2A网络的可信度不能仅通过对现有主要为单一代理设计的协议进行改造来完全保证。相反，它必须从A2A协调框架的最初设计阶段就开始构建。我们提出了一个全面的概念框架，通过四个设计支柱将信任置于A2A系统之中。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2605.19042

Interference-Aware Multi-Task Unlearning

干扰感知的多任务遗忘

Huang, Ying-Hua, Fang, Rui, Chen, Hsi-Wen, Chen, Ming-Syan

Abstract

Machine unlearning aims to remove the contribution of designated training data from a trained model while preserving performance on the remaining data. Existing work mainly focuses on single-task settings, whereas modern models often operate in multi-task setups with shared backbones, where removing supervision for one task or instance can unintentionally affect others. We introduce multi-task unlearning with two settings: full-task unlearning, which removes a target instance from all tasks, and partial-task unlearning, which removes supervision only from selected tasks. We show that shared parameters couple the forget and retain sets, causing task-level interference on non-target tasks and instance-level interference on other instances. To address this issue, we propose an interference-aware framework that combines task-aware gradient projection, which constrains updates within task-specific subspaces, with instance-level gradient orthogonalization, which reduces conflicts between forget and retain signals. Experiments on two multi-task computer vision benchmarks across five tasks show that our method achieves effective unlearning while maintaining strong generalization, reducing UIS compared with the strongest baseline by 30.3% in full-task unlearning and 52.9% in partial-task unlearning.

Chinese Translation

机器遗忘旨在从训练模型中移除指定训练数据的贡献，同时保持对其余数据的性能。现有研究主要集中在单任务设置上，而现代模型通常在共享骨干网络的多任务环境中运行，在这种情况下，移除某一任务或实例的监督可能会无意中影响其他任务。我们引入了多任务遗忘的两个设置：全任务遗忘（full-task unlearning），即从所有任务中移除目标实例；部分任务遗忘（partial-task unlearning），即仅从选定任务中移除监督。我们表明，共享参数将遗忘集和保留集耦合在一起，导致非目标任务的任务级干扰和其他实例的实例级干扰。为了解决这个问题，我们提出了一种干扰感知框架，该框架结合了任务感知的梯度投影（task-aware gradient projection），它限制更新在任务特定子空间内，以及实例级梯度正交化（instance-level gradient orthogonalization），它减少了遗忘信号和保留信号之间的冲突。在五个任务的两个多任务计算机视觉基准上的实验表明，我们的方法在实现有效遗忘的同时保持了强大的泛化能力，在全任务遗忘中相比于最强基线减少了30.3%的UIS，在部分任务遗忘中减少了52.9%的UIS。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2605.19093

Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

通过引导嵌入：系统提示的贝叶斯优化动态表示

Lin, Zhiyuan Jerry, Letham, Benjamin, Dooley, Samuel, Balandat, Maximilian, Bakshy, Eytan

Abstract

System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per-example labels, failures, or critiques. We study this aggregate feedback setting as sample-constrained black-box optimization over discrete, variable-length text. We introduce ReElicit, a Bayesian optimization framework based on \emph{embedding by elicitation}. Given a task description, previously evaluated prompts, and scalar scores, an LLM elicits a compact, interpretable feature space and maps prompts into it. Leveraging a probabilistic Gaussian process surrogate, an acquisition function then selects target feature vectors, which the LLM realizes and refines into deployable system prompts. Re-eliciting the feature space as new evaluations arrive lets the representation adapt to the observed prompt-score history. We evaluate the setting using offline benchmark accuracy as a controlled aggregate proxy: the optimizer observes one scalar score per prompt and no per-example labels, errors, or critiques. Across ten system prompt optimization tasks with a 30 total evaluation budget, ReElicit achieves the strongest aggregate performance profile among representative aggregate-only prompt-optimization baselines. These results suggest that LLMs can serve as adaptive semantic representation builders, not only prompt generators, for Bayesian optimization over natural-language artifacts.

Chinese Translation

系统提示是现代人工智能系统中的核心控制机制，塑造了跨对话、任务和用户群体的行为。然而，当反馈仅以汇总指标而非逐例标签、错误或批评的形式提供时，调整这些提示变得困难。我们研究了这种汇总反馈设置，作为对离散、可变长度文本的样本受限黑箱优化。我们提出了ReElicit，一个基于 extit{引导嵌入}的贝叶斯优化框架。给定任务描述、先前评估的提示和标量分数，LLM（大型语言模型）引导出一个紧凑、可解释的特征空间，并将提示映射到该空间中。利用概率高斯过程代理，获取函数选择目标特征向量，LLM将其实现并细化为可部署的系统提示。随着新评估的到来，重新引导特征空间使得表示能够适应观察到的提示-分数历史。我们使用离线基准准确性作为受控汇总代理来评估该设置：优化器每个提示观察一个标量分数，而没有逐例标签、错误或批评。在十个系统提示优化任务中，评估总预算为30，ReElicit在代表性的仅汇总提示优化基准中实现了最强的汇总性能表现。这些结果表明，LLM不仅可以作为提示生成器，还可以作为自适应语义表示构建者，用于自然语言工件的贝叶斯优化。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2605.19099

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

DecisionBench：长时间跨度代理工作流程中紧急委托的基准测试

Gao, Yuxuan, Wang, Megan, Yu, Yi Ling, Ma, Zijian Carl, Qu, Ao

Abstract

We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.

Chinese Translation

我们介绍了DecisionBench，这是一个用于长时间跨度代理工作流程中紧急委托的基准测试平台。该平台固定了一套任务（GAIA、tau-bench、BFCL多轮）、一个模型池（11个模型，7个供应商系列）、一个委托接口（call_model及可选的read_profile通道）、一个确定性的技能标注层，以及一个涵盖质量、成本、延迟、委托率、路由精度-at-k、供应商自偏好和反事实委托上限的多维度指标套件。该平台对同伴信息的生成或传递方式不设限制，因此可以对学习的路由器、更丰富的同伴记忆、自适应的个人资料构建和多步骤委托进行评估。我们通过对完整模型池（n=23,375个任务实例）进行五个条件的参考扫描来表征该平台。出现了三个基准级发现：（i）在四种意识条件下，平均最终任务质量在统计上没有显著差异（|beta| <= 0.010, p >= 0.21），因此仅进行质量评估将错过协调信号；（ii）在近乎相等的平均质量下，路由精度-at-1在不同条件下的范围为7.5%到29.5%，交付通道（按需工具与预加载描述）主导了描述内容；（iii）反事实上限将完美委托的表现置于每个任务套件的测量性能之上15-31个百分点，为未来的协调方法提供了巨大的未实现空间。我们发布了该平台、标注层、参考干预套件、分析管道以及220个条件运行档案。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2605.19127

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

POLAR-Bench：隐私-效用权衡的诊断基准测试

Zheng, Qiaoyuan, Yang, Yiqu, Gao, Qi, Schlag, Imanol

Abstract

LLM agents increasingly have access to private user data and act on the user's behalf when interacting with third-party systems. The user defines what may and must not be shared, and the agent must robustly follow that intent even when third-party systems behave adversarially. We introduce POLAR-Bench (Policy-aware adversarial Benchmark), in which a trusted model with a privacy policy and a task converses with a third-party model that adversarially probes for both task-relevant and protected attributes. Across 10 domains and 7,852 samples, we score privacy and utility by deterministic set-membership and vary privacy policy dimension and attack strategy along two orthogonal axes, producing a 5 times 5 diagnostic surface per model. Our results reveal a sharp split: current frontier models withhold over 99% of protected attributes, while smaller open-weight models in the 1--30B range, the class users most commonly run as their own trusted agent on-device or via private inference, score notably worse, with the weakest leaking over half. POLAR-Bench thus localizes where each model's intent-following breaks down, providing a foothold for privacy alignment where it matters most.

Chinese Translation

大型语言模型（LLM）代理越来越多地访问用户的私人数据，并在与第三方系统交互时代表用户行事。用户定义了可以共享和不能共享的内容，而代理必须在第三方系统表现出对抗性时，稳健地遵循这一意图。我们引入了POLAR-Bench（政策感知对抗基准），在该基准中，一个具有隐私政策和任务的可信模型与一个对抗性探测任务相关属性和受保护属性的第三方模型进行对话。在10个领域和7,852个样本中，我们通过确定性集合成员资格来评分隐私和效用，并在两个正交轴上变化隐私政策维度和攻击策略，为每个模型生成5×5的诊断表面。我们的结果揭示了明显的分裂：当前的前沿模型保留了超过99%的受保护属性，而在1-30B范围内的小型开放权重模型（用户最常在设备上或通过私有推理作为自己的可信代理运行）得分明显较低，其中最弱的模型泄露了超过一半的受保护属性。因此，POLAR-Bench定位了每个模型的意图遵循在哪些方面出现了问题，为隐私对齐提供了一个重要的切入点。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2605.19140

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

学习交接：在接口约束下可证明收敛的工作流学习

Li, Jiayu, Zhang, Enpei, Zhou, Dawei, Chen, Elynn, Yan, Yujun

Abstract

We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-$Q$, an asynchronous decentralized $Q$-learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC-$Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$-learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC-$Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.

Chinese Translation

我们研究了一种工作流学习的情境，其中专门的代理通过共享工件交接控制，每个代理仅观察该工件的局部功能及其自身的私有状态，并且没有集中式学习者访问联合轨迹——这是跨越组织、供应商或信任边界的多代理大语言模型（LLM）管道的操作模式。我们将这一模式形式化为接口约束半马尔可夫决策过程（IC-SMDP），其决策时刻发生在交接时，并设计了IC-$Q$，一种异步去中心化的$Q$-学习算法，其中每次交接时的跨代理协调恰好是一个标量。我们的主要结果是针对神经IC-$Q$的有限样本界限，该界限分解为三个独立可控的误差源：神经函数逼近误差、接口表示差距和随机选项持续时间折扣下的混合时间残差。建立这一界限需要将近似信息状态（AIS）框架从单代理原始步骤马尔可夫决策过程提升到多代理半马尔可夫决策过程，并在随机持续时间下控制马尔可夫噪声，这在之前的工作中尚未实现。据我们所知，这是针对去中心化部分可观测情况下神经$Q$-学习的首个有限样本保证。四个实验：一个受控的合成IC-SMDP验证了界限的逐项有效性，多LLM数学推理，多代理路由和多代理CPU编程，表明IC-$Q$在没有任何代理观察联合轨迹的情况下与集中式预言机相匹配，且三个误差源在其相应轴上按界限预测的方式缩放。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2605.19151

Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

渐进自主性作为偏好学习：代理工具使用中信任校准的形式化

Ou, Changkun

Abstract

We formalize trust calibration for agentic tool use (deciding when an automated agent's proposed action may execute autonomously versus require human approval) as a preference-learning problem. A policy gateway maintains a Gaussian-process posterior over a latent human risk-tolerance function, observed through a probit likelihood on binary approve/deny feedback, and escalates to the human exactly where the approval outcome is most uncertain. We show this is structurally an instance of Preferential Bayesian Optimization, inheriting its inference machinery (approximate Gaussian-process classification) and its sample-efficiency argument (uncertainty-targeted querying), while differing in objective: classifying an action space into allow/block/ask regions rather than optimizing a design.

Chinese Translation

我们将代理工具使用中的信任校准（决定何时自动化代理的提议行动可以自主执行，何时需要人类批准）形式化为一个偏好学习问题。一个策略网关维护一个关于潜在的人类风险容忍度函数的高斯过程后验，该函数通过二元批准/拒绝反馈的 probit 似然进行观察，并在批准结果最不确定的地方向人类升级。我们展示了这在结构上是偏好贝叶斯优化（Preferential Bayesian Optimization）的一个实例，继承了其推理机制（近似高斯过程分类）和样本效率论证（不确定性目标查询），而在目标上有所不同：将动作空间分类为允许/阻止/询问区域，而不是优化设计。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2605.19156

How Far Are We From True Auto-Research?

我们距离真正的自动研究还有多远？

Zhang, Zhengxin, Wang, Ning, Galhotra, Sainyam, Cardie, Claire

Abstract

Recent auto-research systems can produce complete papers, but feasibility is not the same as quality, and the field still lacks a systematic study of how good agent-generated papers actually are. We introduce ResearchArena, a minimal scaffold that lets off-the-shelf agents (Claude Code using Opus 4.6, Codex using GPT-5.4, and Kimi Code using K2.5) carry out the full research loop themselves (ideation, experimentation, paper writing, self-refinement) under only lightweight guidance. Across 13 computer science seeds and 3 trials per agent-domain pair, ResearchArena yields 117 agent-generated papers, each evaluated under three complementary lenses: a manuscript-only reviewer (SAR), an artifact-aware peer review (PR) in which agents inspect the workspace alongside the manuscript, and an human conducted meta-review. Under SAR alone the picture is optimistic: Claude Code obtains the highest score, outperforms Analemma's FARS, and matches the weighted-average human ICLR 2025 submission, suggesting that minimally scaffolded agents can produce papers that look competitive on manuscript-only review. Manual inspection, however, reveals this picture is overstated: SAR scores are poorly aligned with its actual acceptance decisions and reward plausible framing without verifying experimental substance. Under artifact-aware PR scores drop sharply, and manual auditing identifies experimental rigor as the major bottleneck, decomposing into three failure modes (fabricated results, underpowered experiments, and plan/execution mismatch) that are highly agent-dependent: Codex 5%/8% paper-vs-artifact mismatch / fabricated references versus Kimi Code 77%/72%, a $\sim$15$\times$ spread that tracks distinct research personas the agents develop. None of the 117 agent-generated papers reaches the acceptance bar of a top-tier venue. This suggests that we are still gapped from the true auto-research.

Chinese Translation

近期的自动研究系统能够生成完整的论文，但可行性并不等同于质量，且该领域仍然缺乏对代理生成论文实际质量的系统性研究。我们介绍了ResearchArena，这是一个最小化的框架，允许现成的代理（Claude Code使用Opus 4.6，Codex使用GPT-5.4，以及Kimi Code使用K2.5）在仅有轻量指导的情况下自行完成完整的研究循环（构思、实验、论文写作、自我完善）。在13个计算机科学种子和每个代理-领域对的3次试验中，ResearchArena生成了117篇代理生成的论文，每篇论文在三个互补的视角下进行评估：仅手稿审稿人（SAR），一种关注文献的同行评审（PR），在该评审中，代理在审查手稿的同时检查工作空间，以及由人类进行的元审查。仅在SAR下，结果显得乐观：Claude Code获得了最高分，超越了Analemma的FARS，并与加权平均的人类ICLR 2025提交相匹配，这表明经过最小化框架的代理能够生成在仅手稿审查中看起来具有竞争力的论文。然而，人工检查显示这一结果被夸大了：SAR分数与实际接受决策之间的对齐度较差，并且在未验证实验实质的情况下奖励合理的框架。在关注文献的PR下，分数急剧下降，人工审核发现实验严谨性是主要瓶颈，分解为三种失败模式（伪造结果、实验能力不足和计划/执行不匹配），这些模式高度依赖于代理：Codex的论文与文献不匹配/伪造参考文献的比例为5%/8%，而Kimi Code则为77%/72%，约为15倍的差距，反映了代理所发展出的不同研究人格。117篇代理生成的论文中没有一篇达到顶级会议的接受标准。这表明我们仍然与真正的自动研究存在差距。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2605.19186

Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)

可发现的智能体知识——智能体知识图谱能力的正式框架（扩展版）

Payne, Terry R., Tamma, Valentina, Daga, Enrico

Abstract

Two decades ago, the Semantic Web Services community was asked how agents with different ontological commitments could discover, compose, and invoke web services coherently. The response was OWL-S and WSMO: formally grounded capability descriptions specifying what a service could do, what the agent must already know for invocation to be epistemically sound, and how ontological mismatches could be formally bridged. Current Knowledge Graph (KG) metadata standards such as VoID and DCAT describe what a KG contains yet say nothing about what a specific agent can prove from it, what closure assumptions govern empty results, or whether the agent's task vocabulary is grounded in the schema. Furthermore, in deployed KGs the governing schema DL and the operative entailment regime can diverge: an epistemic failure mode invisible to current metadata. We revisit and extend these insights for the KG setting with a four-dimensional formal framework from which we derive the Agentic Affordance Profile (AAP): a semantic layer above VoID and DCAT enabling principled KG selection, composition, and failure diagnosis at agent planning time. A five-point research agenda identifies the formal, computational, and engineering work needed to realise AAP-based affordance matching at scale.

Chinese Translation

二十年前，语义网服务社区被问及如何使具有不同本体承诺的智能体能够一致地发现、组合和调用网络服务。回应是 OWL-S 和 WSMO：这些是正式基础的能力描述，指定了服务可以做什么、智能体在调用时必须已知的知识以确保认知上的合理性，以及如何正式弥合本体不匹配。目前的知识图谱（KG）元数据标准，如 VoID 和 DCAT，描述了知识图谱的内容，但未说明特定智能体可以从中证明什么、空结果的闭合假设是什么，或智能体的任务词汇是否基于该模式。此外，在已部署的知识图谱中，主导模式的描述逻辑（DL）和操作蕴含机制可能会出现偏差：这是当前元数据无法察觉的认知失败模式。我们重新审视并扩展这些见解，针对知识图谱环境提出一个四维正式框架，从中推导出智能体能力配置文件（AAP）：这是一个位于 VoID 和 DCAT 之上的语义层，能够在智能体规划时实现原则性的知识图谱选择、组合和故障诊断。一个五点研究议程确定了实现基于 AAP 的能力匹配所需的正式、计算和工程工作，以便在大规模上进行应用。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2605.19192

Hallucination as Exploit: Evidence-Carrying Multimodal Agents

幻觉作为利用：证据承载的多模态智能体

Zhang, Guijia, Zheng, Hao, Yang, Harry

Abstract

Multimodal agents use screenshots, documents, and webpages to choose tool calls. When a false visual claim triggers a click, email, extraction, or transfer, hallucination becomes an authorization failure rather than an answer-quality error. We formalize this failure mode as hallucination-to-action conversion: an unsupported perceptual claim supplies the precondition that makes a privileged action appear permitted. We propose evidence-carrying multimodal agents (ECA), which treat free-form model text as inadmissible evidence. ECA decomposes each tool call into action-critical predicates, obtains typed certificates from constrained DOM/OCR/AX verifiers, and lets a deterministic gate grant only the privileges those certificates support. The architecture does not hide perception error; it converts opaque model belief into named verifier, schema, and implementation residuals. Verifier red-teaming over 1,900 attacks exposes this residual directly: four targeted hardening steps reduce gate bypass from 15% to 1.3%. With content-derived certificates, ECA obtains 0% unsafe-action rate on a 200-task end-to-end pipeline (Wilson 95% upper bound 2.67%) and a 120-task browser proof-of-concept (upper bound 4.3%). A direct HACR audit on 500 stratified task keys shows that unsupported action-critical claims reach unsafe execution for naive agents (100.0%) and prompt-only defense (49.6%), but not for ECA. Oracle-certificate replay on 7,488 GPT-5.4 benchmark traces serves as a gate-correctness sanity check, and neural judge baselines remain bypassable under the same threat model. The resulting principle is simple: model language may propose actions, but external evidence must authorize them.

Chinese Translation

多模态智能体使用屏幕截图、文档和网页来选择工具调用。当一个虚假的视觉声明触发点击、电子邮件、提取或转移时，幻觉成为授权失败，而不是答案质量错误。我们将这种失败模式形式化为幻觉到行动的转换：一个不支持的感知声明提供了使特权行动看似被允许的前提条件。我们提出了证据承载的多模态智能体（ECA），该智能体将自由形式的模型文本视为不可接受的证据。ECA将每个工具调用分解为行动关键谓词，从受限的DOM/OCR/AX验证器获取类型证书，并让一个确定性的门仅授予那些证书支持的特权。该架构并不掩盖感知错误；它将不透明的模型信念转换为命名验证器、模式和实现残余。对1,900次攻击的验证器红队测试直接揭示了这一残余：四个针对性的强化步骤将门绕过率从15%降低到1.3%。通过内容衍生的证书，ECA在一个200任务的端到端管道中实现了0%的不安全行动率（Wilson 95%上限2.67%）和一个120任务的浏览器概念验证（上限4.3%）。对500个分层任务键的直接HACR审计显示，对于天真的智能体（100.0%）和仅提示防御（49.6%），不支持的行动关键声明达到了不安全执行，但对于ECA则没有。在7,488个GPT-5.4基准追踪上的Oracle证书重放作为门正确性的理智检查，而神经判断基线在相同威胁模型下仍然可被绕过。由此得出的原则很简单：模型语言可以提出行动，但外部证据必须授权这些行动。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2605.19215

Not all uncertainty is alike: volatility, stochasticity, and exploration

不确定性并非相同：波动性、随机性与探索

Piray, Payam

Abstract

Adaptive decision-making in biological and artificial intelligence requires balancing the exploitation of known outcomes with the exploration of uncertain alternatives. Although prior work suggests that uncertainty generally promotes exploration, it has typically treated distinct sources of environmental uncertainty as equivalent. We consider environments with latent reward states that drift over time (volatility) and are observed through noisy outcomes (stochasticity). Both increase posterior uncertainty, yet we show they drive optimal exploration in opposite directions: volatility enhances it, stochasticity suppresses it. We establish this asymmetry formally by extending the Gittins index framework to Gaussian state-space bandits with latent dynamics. We further derive Cause-Aware Uncertainty-Sensitive Exploration (CAUSE), a closed-form exploration bonus obtained via control-as-inference that inherits the same monotonicities. CAUSE outperforms standard exploration strategies in environments with heterogeneous noise structure, and also improves on a Gittins-per-arm policy whose rested-bandit optimality does not transfer to restless settings. Learning and exploration are governed by the same noise-inference asymmetry, and the framework predicts that pathological noise inference produces \emph{reversed} rather than merely impaired exploration, with implications for computational accounts of psychiatric conditions.

Chinese Translation

生物和人工智能中的自适应决策需要在利用已知结果与探索不确定替代方案之间取得平衡。尽管先前的研究表明不确定性通常促进探索，但通常将环境不确定性的不同来源视为等同。我们考虑具有潜在奖励状态随时间漂移（波动性）并通过噪声结果观察到（随机性）的环境。这两者都增加了后验不确定性，但我们表明它们在驱动最佳探索方面的方向相反：波动性增强探索，而随机性抑制探索。我们通过将 Gittins 指数框架扩展到具有潜在动态的高斯状态空间强盗问题，正式建立了这种不对称性。我们进一步推导出因果感知不确定性敏感探索（Cause-Aware Uncertainty-Sensitive Exploration，简称 CAUSE），这是一种通过控制作为推断获得的闭式探索奖励，继承了相同的单调性。在具有异质噪声结构的环境中，CAUSE 优于标准探索策略，并且在一个 Gittins 每臂策略上也有所改进，该策略的休息强盗最优性并不转移到不休息的设置中。学习和探索受相同的噪声推断不对称性支配，该框架预测病态噪声推断会产生 extit{反向}而不仅仅是受损的探索，这对精神疾病的计算模型具有重要意义。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2605.19219

SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

SimGym：一个基于流量的视觉语言模型代理的电子商务A/B测试模拟框架

Li, Han, Malik, Vibhor, Foumani, Zahra Zanjani, Castelo, Alberto, Xie, Shuang, Fan, Ailin, Koay, Keat Yang, Zhu, Yuanzheng, Feghhi, Meysam, Uliana, Ronie, Zhang, Zhaoyu, Martins, Angelo Ocana, Zhao, Mingyu, Pelland, Francis, Faerman, Jonathan, LeBlanc, Nikolas, Glazer, Aaron, McNamara, Andrew, Wu, Zhong, Wang, Lingyun

Abstract

A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks degrading user experience. We present SimGym, a framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM) agents operating in a live browser. The framework comprises three key components: (a) a traffic-grounded persona generation pipeline that derives per-shop buyer archetypes and intents from production clickstream data; (b) a live-browser agent architecture that combines multimodal perception over visual and browser-structured observations with episodic memory and guardrails to conduct coherent shopping sessions across control and treatment storefronts; and (c) an evaluation protocol that compares simulated outcome shifts with observed shifts in real buyer behavior. We validate SimGym on A/B tests of visually driven UI theme changes from a major e-commerce platform across diverse storefronts and product categories. Empirical results show that SimGym agents achieve strong agreement with observed outcome shifts, attaining 77% directional alignment with add-to-cart shifts observed across interface variants in real-buyer traffic. It reduces experimental cycles from weeks to under an hour, enabling rapid experimentation without exposing real buyers to candidate variants.

Chinese Translation

A/B测试仍然是评估电子商务店面修改的金标准，但它会分流流量，需要数周才能达到统计显著性，并且存在降低用户体验的风险。我们提出了SimGym，这是一个用于在电子商务店面上模拟A/B测试的框架，利用在实时浏览器中操作的视觉语言模型（VLM）代理。该框架包括三个关键组件：（a）一个基于流量的人物生成管道，从生产点击流数据中推导每个商店的买家原型和意图；（b）一个实时浏览器代理架构，结合了对视觉和浏览器结构观察的多模态感知，以及情景记忆和保护措施，以在控制和处理店面之间进行连贯的购物会话；（c）一个评估协议，将模拟的结果变化与真实买家行为中的观察变化进行比较。我们在一个主要电子商务平台的视觉驱动UI主题变化的A/B测试中验证了SimGym，涵盖了多样的店面和产品类别。实证结果表明，SimGym代理与观察到的结果变化达成了强一致性，在真实买家流量中观察到的添加到购物车的变化中达到了77%的方向一致性。它将实验周期从数周缩短到不到一小时，使得快速实验成为可能，而无需将真实买家暴露于候选变体中。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2605.19229

Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

大型语言模型能否革新调查研究？关于灾害准备响应的实验

Wang, Yan, Guo, Ziyi, McCarty, Christopher

Abstract

Survey research faces mounting structural challenges: declining response rates, sample bias, block-wise missingness among at-risk respondents, and AI-assisted fraudulent completions in online panels. Large language models (LLMs) have been proposed as a remedy, yet rigorous evaluations across the full survey workflow remain scarce, particularly in disaster contexts where data quality matters most. We present and evaluate a five-stage framework for LLM integration covering questionnaire design, sample selection, pilot testing, missing-data imputation, and post-collection analysis, using the 2024 Hurricane Milton preparedness survey of Florida residents (n=946) as a shared empirical testbed. We introduce a Protection Motivation Theory (PMT)-constrained co-occurrence knowledge graph and develop seven LLM configurations spanning zero-shot inference, retrieval-augmented baselines, and novel theory-informed variants. Our proposed Anchored Marginal Theory-Informed LLM (A-TLM) outperforms all three classical imputation baselines (IPW/MI, MICE+PMM, missForest) on RMSE under disaster-relevant block-wise MNAR conditions (S4 RMSE 1.439 vs. 1.496 for the next-best), while achieving near-zero signed bias (-0.121) where the random-forest imputer produces the largest absolute bias (-0.631). Organizing retrieval around PMT causal structure and integrating all evidence in a single model call outperforms unstructured retrieval and staged sequential inference (MAE 0.993 vs. 1.097 for standard RAG). We document that near-zero aggregate bias can mask opposing subgroup errors and propose subgroup-stratified bias auditing as a reporting standard. A retrieval-constrained knowledge-graph chatbot demonstrates that hallucination is architecturally manageable through grounded refusal.

Chinese Translation

调查研究面临着日益严重的结构性挑战：响应率下降、样本偏差、处于风险中的受访者存在块状缺失，以及在线面板中的人工智能辅助欺诈性完成。大型语言模型（LLMs）被提议作为一种解决方案，但在整个调查工作流程中进行严格评估的研究仍然稀缺，尤其是在数据质量至关重要的灾害背景下。我们提出并评估了一个涵盖问卷设计、样本选择、试点测试、缺失数据插补和后收集分析的五阶段框架，使用2024年佛罗里达州居民的飓风米尔顿准备调查（n=946）作为共享的实证测试平台。我们引入了一个受保护动机理论（PMT）约束的共现知识图谱，并开发了七种LLM配置，涵盖零样本推理、检索增强基线和新颖的理论驱动变体。我们提出的锚定边际理论驱动LLM（A-TLM）在灾害相关的块状缺失非随机缺失条件下（S4 RMSE 1.439 vs. 1.496 for the next-best）优于所有三种经典插补基线（IPW/MI, MICE+PMM, missForest），同时在随机森林插补器产生最大绝对偏差（-0.631）的情况下，实现了近零的有符号偏差（-0.121）。围绕PMT因果结构组织检索并将所有证据整合到单个模型调用中，优于非结构化检索和分阶段序贯推理（MAE 0.993 vs. 1.097 for standard RAG）。我们记录到近零的总体偏差可能掩盖对立的子组错误，并提出子组分层偏差审计作为报告标准。一个受检索约束的知识图谱聊天机器人证明，通过基础拒绝，幻觉在结构上是可控的。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2605.19250

Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination

模态冲突幻觉中注意力头失衡的因果证据

Jiang, Jinrui, Wu, Zhangtai, Wu, Zhen, Dai, Xinyu

Abstract

Modality-conflict hallucination occurs when multimodal large language models (MLLMs) prioritize erroneous textual premises over contradictory visual evidence. To understand why visual evidence fails to prevail during generation, we take a mechanistic perspective and examine which internal components drive or resist this failure. We perform head-level causal analysis using path patching across five open-source MLLMs and identify two groups of attention heads with opposing causal roles: hallucination-driving heads and hallucination-resisting heads. We find a consistent asymmetry: driving effects are more broadly distributed and carry greater aggregate weight, whereas resisting effects concentrate in a small number of high-importance heads. Ablation experiments further confirm that these groups exert opposing effects during generation: distributed driving influence and localized resistance together form an imbalanced routing structure that biases generation toward the erroneous premise. Motivated by this finding, we propose MACI (Modality-conflict-Aware Causal Intervention), a conditional intervention that suppresses causally identified hallucination-driving heads only when conflict is detected. Across five MLLMs, MACI achieves the largest hallucination reduction among compared inference-time baselines on the MMMC benchmark with a favorable hallucination-accuracy trade-off, and transfers zero-shot to the SCI-SemanticConflict test.

Chinese Translation

模态冲突幻觉发生在多模态大型语言模型（MLLMs）优先考虑错误的文本前提而忽视矛盾的视觉证据时。为了理解为何视觉证据在生成过程中未能占据主导地位，我们从机制的角度出发，研究哪些内部组件驱动或抵制这种失败。我们通过对五个开源MLLMs进行头级因果分析，识别出两组具有相反因果角色的注意力头：驱动幻觉的头和抵制幻觉的头。我们发现了一种一致的不对称性：驱动效应更广泛分布并具有更大的整体权重，而抵制效应则集中在少数高重要性的头上。消融实验进一步确认了这两组在生成过程中的相反效应：分布式的驱动影响与局部的抵制共同形成了一种失衡的路由结构，使生成偏向于错误的前提。基于这一发现，我们提出了MACI（模态冲突感知因果干预），这是一种条件干预，仅在检测到冲突时抑制因果识别的驱动幻觉头。在五个MLLMs中，MACI在MMMC基准测试中实现了与比较的推理时间基线相比最大的幻觉减少，并在幻觉与准确性之间达成了良好的权衡，同时在SCI-SemanticConflict测试中实现了零样本迁移。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2605.19260

AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

AQuaUI：针对具有自适应四叉树的GUI代理的视觉标记减少

Li, Yuankai, Zhu, Tinghui, Son, Ha Min, Zhao, Zhe, Liu, Xin, Chen, Muhao

Abstract

Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iteration step. However, these screenshots exhibit highly non-uniform spatial information density: large regions may carry little information and are visually homogeneous, while key text and icons may require high visual fidelity. Existing approaches to this problem either require additional training or rely on attention-based token compression, ignoring the structured layout and spatial redundancy of GUI screenshots. To fill the gap, this paper proposes AquaUI, a training-free inference-time token reduction method for GUI agent models that utilizes the non-uniform information density in screenshots. AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of the quadtree. AQuaUI preserves the spatial positions of retained tokens throughout the pipeline to ensure that all position-encoding stages remain consistent. To further improve temporal consistency across multi-step GUI interactions, we propose a conditional quadtree algorithm that leverages the continuity between consecutive screenshots within a single request. Specifically, it refines the current quadtree using previous quadtrees as references, helping preserve fine-grained regions across static or mildly shifted GUI states. We implement AQuaUI on state-of-the-art GUI agent models and conduct experiments on standard grounding and navigational benchmarks. AQuaUI consistently shows improved accuracy-efficiency trade-offs over prior baselines. Notably, on GUI-Owl-1.5-32B-Instruct, AQuaUI achieves up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance, suggesting that the spatial redundancy of GUI screenshots can be exploited at inference without retraining.

Chinese Translation

大型多模态模型（LMMs）最近作为GUI代理模型的有前景的基础架构而出现，其中高分辨率的GUI截图在每次迭代步骤中被引入到提示中。然而，这些截图表现出高度不均匀的空间信息密度：大区域可能携带的信息很少且视觉上同质，而关键文本和图标可能需要高视觉保真度。现有的方法要么需要额外的训练，要么依赖基于注意力的标记压缩，忽略了GUI截图的结构化布局和空间冗余。为填补这一空白，本文提出了AquaUI，这是一种针对GUI代理模型的无训练推理时标记减少方法，利用截图中的非均匀信息密度。AQuaUI在每个截图输入上构建自适应四叉树，并在四叉树的每个叶节点保留一个代表性合并标记。AQuaUI在整个流程中保留所保留标记的空间位置，以确保所有位置编码阶段保持一致。为了进一步改善多步GUI交互中的时间一致性，我们提出了一种条件四叉树算法，利用单个请求中连续截图之间的连续性。具体而言，它使用先前的四叉树作为参考来优化当前的四叉树，帮助在静态或轻微移动的GUI状态之间保留细粒度区域。我们在最先进的GUI代理模型上实现了AQuaUI，并在标准的基础和导航基准上进行了实验。AQuaUI在准确性与效率的权衡上始终优于先前的基线。值得注意的是，在GUI-Owl-1.5-32B-Instruct上，AQuaUI实现了高达13.22%的加速和29.52%的视觉标记减少，同时保留了99.06%的全标记性能，这表明GUI截图的空间冗余可以在推理时被利用，而无需重新训练。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2605.19264

Swimming with Whales: Analysis of Power Imbalances in Stake-Weighted Governance

与鲸鱼共游：利益权重治理中的权力不平衡分析

Zhang, Yuzhe, Schneider, Manvir, Wang, Qin, Grossi, Davide

Abstract

Voting methods weighted by stakes are the fundamental governance paradigm in Proof-of-Stake (PoS) blockchains. Such a paradigm is known to be prone to power distortions: a few users possessing large stakes may completely control decision making, even without owning the totality of the stakes. We study this phenomenon through the lens of computational social choice, focusing on the extent of power imbalances in stake-weighted voting when power is quantified using the Penrose-Banzhaf power index. Our work presents both analytical and empirical contributions. Analytically, we demonstrate that while a perfect alignment between power and relative stake ownership is generally unattainable, it can be approximated in expectation under specific conditions. Empirically, using data from a real-world on-chain governance system (Project Catalyst), we provide a more fine-grained understanding of the power imbalances that are likely to occur in current stake-weighted governance systems.

Chinese Translation

基于利益权重的投票方法是权益证明（Proof-of-Stake, PoS）区块链中的基本治理范式。该范式已知容易出现权力扭曲：少数拥有大量利益的用户可能完全控制决策，即使他们并不拥有全部的利益。我们通过计算社会选择的视角研究这一现象，重点关注在使用Penrose-Banzhaf权力指数量化权力时，利益权重投票中的权力不平衡程度。我们的工作提供了分析和实证方面的贡献。在分析方面，我们证明了虽然权力与相对利益所有权之间的完美对齐通常是无法实现的，但在特定条件下可以在期望上进行近似。在实证方面，我们利用来自一个真实的链上治理系统（Project Catalyst）的数据，更细致地理解当前利益权重治理系统中可能出现的权力不平衡。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2605.19330

MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

MOCHA：用于智能体技能优化的多目标切比雪夫退火

Tanjim, Md Mehrab, Subramanian, Jayakumar, Chen, Xiang, Kveton, Branislav, Mukherjee, Subhojyoti, Zhang, Anlan, Kim, Sungchul, Sarkhel, Somdeb, Choudhury, Sunav

Abstract

LLM agents organize behavior through skills - structured natural-language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi-field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co-resident skills compete for limited context windows. These constraints make skill optimization inherently multi-objective: a skill must simultaneously maximize task performance and satisfy platform limits. Yet existing prompt optimizers either ignore these trade-offs or collapse them into a weighted sum, missing Pareto-optimal variants in non-convex objective regions. We introduce MOCHA (Multi-Objective Chebyshev Annealing), which replaces single-objective selection with Chebyshev scalarization - covering the full Pareto front, including non-convex regions - combined with exponential annealing that transitions from exploration to exploitation. In our experiments across six diverse agent skills - where all methods share the same multi-objective mutation operator and baselines receive identical per-objective textual feedback - existing optimizers fail to improve the seed skill on 4 of 6 tasks: 1000 rollouts yield zero progress. MOCHA breaks through on every task, achieving 7.5% relative improvement in mean correctness over the strongest baseline (up to 14.9% on FEVER and 10.4% on TheoremQA) while discovering twice as many more Pareto-optimal skill variants.

Chinese Translation

大型语言模型（LLM）智能体通过技能组织行为——这些技能是结构化的自然语言规范，规定了智能体如何推理、检索和响应。与单一提示不同，技能是受限于严格平台约束的多字段工件：描述字段因路由而被截断，指令主体通过渐进式披露进行压缩，而共存技能则在有限的上下文窗口中竞争。这些约束使得技能优化本质上成为多目标问题：技能必须同时最大化任务性能并满足平台限制。然而，现有的提示优化器要么忽视这些权衡，要么将其简化为加权和，从而错过非凸目标区域中的帕累托最优变体。我们提出了MOCHA（多目标切比雪夫退火），它用切比雪夫标量化替代单目标选择——覆盖整个帕累托前沿，包括非凸区域——并结合指数退火，从探索过渡到开发。在我们针对六种不同智能体技能的实验中——所有方法共享相同的多目标变异操作符，基线接收相同的每目标文本反馈——现有优化器在6个任务中的4个任务上未能改善种子技能：1000次回合没有取得任何进展。MOCHA在每个任务上都取得突破，相较于最强基线实现了7.5%的平均正确率相对提升（在FEVER上高达14.9%，在TheoremQA上高达10.4%），同时发现了两倍于现有方法的帕累托最优技能变体。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2605.19337

Agentic Trading: When LLM Agents Meet Financial Markets

代理交易：当大型语言模型代理遇见金融市场

Xia, Yihan, You, Panpan, Wang, Taotao, Liu, Fang, Qi, Han, Wu, Xiaoxiao, Zhang, Shengli

Abstract

A growing body of work explores how Large Language Models (LLMs) can be embedded in trading systems as agents that perceive market information, retrieve context, reason about decisions, emit tradable actions, and adapt under market feedback. This paper reframes LLM-based trading agents as expert-system decision pipelines and presents an audit-oriented evidence map of 77 included studies in a protocol-coded snapshot screened through 2026-03-09. A primary empirical subset (n=19) satisfies the minimum boundary of Action Output plus Closed-Loop Evaluation; the remaining 58 included studies are retained as background and design context. The central empirical finding is protocol incomparability: within the primary subset, only 2/19 studies report extractable time-consistent split protocols, 1/19 reports an explicit transaction-cost model, 1/19 documents universe or survivorship handling, 11/19 report execution timing or semantics, 15/19 are coded as R0, and no study reaches R3 reproducibility. We therefore use Architecture-Capability-Adaptation as a working analytical lens rather than a validated taxonomy, and we foreground the evidence ledger, reproducibility audit, and reporting checklist as the main contributions. The resulting survey shows that architectural experimentation is expanding rapidly, while comparable evaluation protocols, execution semantics, and reproducible artifacts remain the field's immediate bottlenecks.

Chinese Translation

越来越多的研究探讨如何将大型语言模型（LLMs）嵌入交易系统，作为能够感知市场信息、检索上下文、推理决策、发出可交易动作并在市场反馈下进行适应的代理。本文将基于LLM的交易代理重新框定为专家系统决策流程，并呈现了一份以审计为导向的证据地图，涵盖了77项在2026年3月9日筛选的协议编码快照中的研究。一个主要的实证子集（n=19）满足了行动输出加闭环评估的最小边界；其余58项研究则作为背景和设计上下文保留。中心实证发现是协议不可比性：在主要子集中，仅有2/19项研究报告了可提取的时间一致性分割协议，1/19项报告了明确的交易成本模型，1/19项记录了宇宙或生存处理，11/19项报告了执行时机或语义，15/19项被编码为R0，且没有研究达到R3可重复性。因此，我们使用架构-能力-适应作为工作分析视角，而非经过验证的分类法，并将证据账本、可重复性审计和报告清单作为主要贡献。结果调查显示，架构实验正在迅速扩展，而可比评估协议、执行语义和可重复性文物仍然是该领域的直接瓶颈。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2605.19376

Generative Recursive Reasoning

生成递归推理

Baek, Junyeob, Jo, Mingyu, Kim, Minsu, Ren, Mengye, Bengio, Yoshua, Ahn, Sungjin

Abstract

How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce \emph{Generative Recursive reAsoning Models (GRAM)}, a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via $p_\theta(y \mid x)$ and, with fixed or absent inputs, unconditional generation via $p_\theta(x)$. Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. \href{https://ahn-ml.github.io/gram-website/}{https://ahn-ml.github.io/gram-website}

Chinese Translation

未来的神经推理系统应如何实现扩展计算？递归推理模型（Recursive Reasoning Models, RRMs）通过使用共享转移函数进行迭代潜在状态细化，为自回归序列扩展提供了一种有前景的替代方案。然而，现有的RRMs在很大程度上是确定性的，遵循单一的潜在轨迹并收敛到单一预测。我们提出了生成递归推理模型（Generative Recursive reAsoning Models, GRAM），这是一个将递归潜在推理转变为概率性多轨迹计算的框架。GRAM将推理建模为随机潜在轨迹，使得多个假设、替代解决策略以及通过递归深度和并行轨迹采样进行推理时的扩展成为可能。这产生了一个潜变量生成模型，通过$p_ heta(y ext{ | } x)$支持条件推理，并在输入固定或缺失的情况下，通过$p_ heta(x)$实现无条件生成。通过摊销变分推理进行训练，GRAM在结构化推理和多解约束满足任务上优于确定性的递归和递归基线，同时展示了无条件生成的能力。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2605.19382

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

PRISM：程序化时空推理的基准测试

Zhang, Qiran, Wang, Yuheng, Yang, Runde, Wu, Lin, Fan, Jingru, Yao, Shu, Zhang, Jie, Zhou, Tianle, Li, Huatao, Shi, Ruijie, Li, Yihan, Qian, Chen

Abstract

Programmatic video generation through code offers geometric precision and temporal coherence beyond pixel-level diffusion models, yet rigorously evaluating whether language models can produce spatially correct animated outputs remains an open problem. We introduce PRISM, a large-scale benchmark of 10,372 human-calibrated instruction-code pairs (20 times larger than prior programmatic video generation benchmarks), grounded in real-world knowledge visualization scenarios across English and Chinese and spanning 437 subject categories. We further propose a funnel-style evaluation framework with four complementary metrics: Code-Level Reliability for executability, Spatial Reasoning for layout correctness over full animation sequences, and Prompt-Aware Dynamic Visual Complexity (PADVC) and Temporal Density (TD) for diagnosing dynamic expression and temporal activity. Systematic evaluation of seven mainstream LLMs reveals a striking Execution-Spatial Gap: the average drop from execution success rate to spatial pass rate is approximately 41%, showing that runnable code does not necessarily yield spatially coherent visual output. These findings show that programmatic video generation evaluation should go beyond executability. PRISM provides a principled benchmark for advancing spatially coherent code generation.

Chinese Translation

通过代码进行程序化视频生成提供了超越像素级扩散模型的几何精度和时间一致性，但严格评估语言模型是否能够生成空间上正确的动画输出仍然是一个未解决的问题。我们引入了PRISM，这是一个大规模基准测试，包含10,372对经过人工校准的指令-代码对（是之前程序化视频生成基准的20倍），基于现实世界知识可视化场景，涵盖英语和中文，并跨越437个主题类别。我们进一步提出了一种漏斗式评估框架，包含四个互补指标：代码级可靠性（Code-Level Reliability）用于可执行性，空间推理（Spatial Reasoning）用于整个动画序列的布局正确性，以及提示感知动态视觉复杂性（Prompt-Aware Dynamic Visual Complexity, PADVC）和时间密度（Temporal Density, TD）用于诊断动态表达和时间活动。对七种主流大型语言模型（LLMs）的系统评估揭示了显著的执行-空间差距：从执行成功率到空间通过率的平均下降约为41%，这表明可运行的代码不一定会产生空间一致的视觉输出。这些发现表明，程序化视频生成的评估应超越可执行性。PRISM为推进空间一致的代码生成提供了一个原则性的基准。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2605.19418

Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling

基于签名图建模的冲突韧性多智能体推理

He, Longgang, He, Longzhu, He, Daojing, Li, Chaozhuo

Abstract

LLM-based multi-agent systems (MAS) have demonstrated strong reasoning and decision-making capabilities that consistently surpass those of single LLM agents. However, their performance often suffers from naive aggregation mechanisms that assume uniformly cooperative interactions. Upon close inspection, we observe that existing graph-based MAS frameworks (1) propagate errors when conflicting signals arise without control, and (2) lack explicit modeling of conflicting inter-agent relations as well as structural awareness, failing to identify reliable interaction patterns. To bridge this gap, we introduce SIGMA, a novel SIgned Graph-informed Multi-Agent reasoning framework that explicitly captures trust, conflict, and neutral relations among agents via a signed relational graph. Specifically, given a query, SIGMA first selects a set of relevant and diverse agents, then constructs a structured signed interaction graph with confidence-weighted edges. Reasoning proceeds through conflict-aware signed message passing, which reinforces information from trustworthy agents while suppressing conflicting signals, and terminates with a structure- and conflict-aware weighted aggregation to yield globally consistent and conflict-resilient predictions. Extensive experiments on six benchmark datasets, across multiple LLM backbones and diverse multi-agent configurations, demonstrate that SIGMA consistently outperforms state-of-the-art baselines, achieving notable gains in both accuracy and conflict-resilient performance.

Chinese Translation

基于大型语言模型（LLM）的多智能体系统（MAS）展示了强大的推理和决策能力，始终超越单一LLM智能体的表现。然而，它们的性能常常受到简单聚合机制的影响，这些机制假设交互是均匀合作的。经过仔细检查，我们观察到现有的基于图的MAS框架存在以下问题：（1）在出现冲突信号时，错误传播无法得到控制；（2）缺乏对智能体间冲突关系的明确建模以及结构意识，未能识别可靠的交互模式。为了解决这一问题，我们提出了SIGMA，一个新颖的基于签名图的多智能体推理框架，通过签名关系图明确捕捉智能体之间的信任、冲突和中立关系。具体而言，给定一个查询，SIGMA首先选择一组相关且多样的智能体，然后构建一个具有置信度加权边的结构化签名交互图。推理通过冲突感知的签名消息传递进行，这一过程强化了来自可信智能体的信息，同时抑制冲突信号，并以结构和冲突感知的加权聚合结束，以产生全球一致且具有冲突韧性的预测。在六个基准数据集上的大量实验中，跨多个LLM基础模型和多样的多智能体配置，SIGMA始终优于最先进的基线，在准确性和冲突韧性表现上均取得显著提升。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2605.19447

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

什么时机进行蒸馏：针对多回合智能体的选择性回顾蒸馏

Li, Xiaozhe, Lyu, Tianyi, Li, Yang, Ma, Yichuan, Li, Peiji, Li, Linyang, Guo, Qipeng, Lin, Dahua, Chen, Kai

Abstract

Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.

Chinese Translation

强化学习可以通过稀疏的任务奖励训练大规模语言模型（LLM）智能体，但长时间跨度的信用分配仍然具有挑战性：单一的成功或失败信号必须分配到许多动作上。现有方法依赖于轨迹级奖励或代理信号，而未充分利用逐步的环境反馈。多回合智能体设置尚未得到充分探索，其中反馈可以包括错误信息、页面变化、观察结果或参考轨迹。我们系统地研究了五种反馈来源和两种插入粒度，并引入了选择性环境重加权学习框架（SERL）。SERL利用任务奖励来确定更新方向，同时环境反馈调整位置和幅度，专注于关键动作。在ALFWorld和WebShop上，SERL分别达到了90.0%和80.1%的成功率，超越了强大的强化学习和蒸馏基准。分析表明，在重要时刻提供的基于实际情况的、与动作相关的反馈始终优于对更长或更丰富上下文的无差别使用。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2605.19457

Generative Auto-Bidding with Unified Modeling and Exploration

统一建模与探索的生成自动竞价

Zhang, Mingming, Zhuang, Feiqing, Li, Na, Sun, Shengjie, Chen, Xiaowei, Zhu, Junxiong, Xiao, Fei, Yang, Keping, Zou, Lixin, Li, Chenliang

Abstract

Automated bidding is central to modern digital advertising. Early rule-based methods lacked adaptability, while subsequent Reinforcement Learning approaches modeled bidding as a Markov Decision Process but struggled with long-term dependencies. Recent generative models show promise, yet they lack explicit mechanisms to balance exploration and safety, relying solely on action perturbations or trajectory guidance without a safety fallback. This results in inefficient exploration and elevated financial risk for advertising platforms. To address this gap, we propose GUIDE (Generative Auto-Bidding with Unified Modeling and Exploration), a framework that synergistically integrates directed exploration with a safe fallback mechanism. GUIDE employs a Decision Transformer (DT) to jointly model historical bidding actions and environmental state transitions. A Q-value module guides the DT's exploration via regularization constraints, while an Inverse Dynamics Module (IDM) leverages DT-predicted future states to infer robust, behaviorally consistent actions as a safe policy fallback. The Q-value module then adaptively selects the final action between these two options, balancing exploration and safety. Together, these components form an integrated "explore-safeguard-select" pipeline that unifies efficiency and safety. We conduct extensive experiments on public datasets, in simulated auction environments, and through large-scale online deployment on Taobao, a leading Chinese advertising platform. Results show GUIDE consistently outperforms state-of-the-art baselines across all scenarios. In real-world deployment, GUIDE achieves notable gains: +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI, demonstrating its effectiveness and strong industrial applicability.

Chinese Translation

自动竞价是现代数字广告的核心。早期的基于规则的方法缺乏适应性，而随后的强化学习方法将竞价建模为马尔可夫决策过程，但在处理长期依赖性时遇到困难。最近的生成模型显示出潜力，但它们缺乏明确的机制来平衡探索与安全，仅依赖于动作扰动或轨迹引导，而没有安全后备。这导致了低效的探索和广告平台的高财务风险。为了解决这一问题，我们提出了GUIDE（生成自动竞价与统一建模和探索），一个将定向探索与安全后备机制协同集成的框架。GUIDE采用决策变换器（Decision Transformer, DT）共同建模历史竞价行为和环境状态转变。Q值模块通过正则化约束引导DT的探索，而逆动力学模块（Inverse Dynamics Module, IDM）利用DT预测的未来状态推断出稳健且行为一致的动作作为安全策略后备。Q值模块随后在这两个选项之间自适应地选择最终动作，以平衡探索与安全。这些组件共同形成一个集成的“探索-保护-选择”流程，统一了效率与安全性。我们在公共数据集、模拟拍卖环境以及在中国领先的广告平台淘宝的大规模在线部署中进行了广泛实验。结果表明，GUIDE在所有场景中始终优于最先进的基线。在实际部署中，GUIDE实现了显著的收益：广告总交易额（GMV）增加4.10%，广告点击量增加1.40%，广告成本增加1.66%，广告投资回报率（ROI）增加3.52%，展示了其有效性和强大的工业适用性。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2605.19461

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

超越模式崩溃：多样化推理的分布匹配

Li, Xiaozhe, Li, Yang, Fang, Xinyu, Ding, Shengyuan, Li, Peiji, Chen, Yongkang, Ma, Yichuan, Lyu, Tianyi, Li, Linyang, Lin, Dahua, Guo, Qipeng, Liu, Qingwen, Chen, Kai

Abstract

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality, an ideal testbed for evaluating exploration. DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO's 40.1%) and 43.1% on vision-based NP-Bench (vs. 38.4%), demonstrating 9% and 12% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.

Chinese Translation

像 GRPO 这样的在线强化学习方法面临模式崩溃的问题：它们表现出解决方案多样性的减少，一旦发现某个解决方案，就会将概率质量集中在该解决方案上，并停止对替代策略的探索。我们表明，这源于反向 KL 最小化的模式寻求行为，该行为强化了首次发现的高奖励轨迹，而不是维持多个多样化解决方案的分布。我们提出了 DMPO（分布匹配策略优化），通过对前向 KL 最小化的原则性近似来防止模式崩溃。DMPO 构建了一个基于采样轨迹的组级目标分布，该分布与它们的奖励成比例，然后将策略分布与该目标对齐。这提供了覆盖模式的行为，而无需从不可处理的全局目标分布中进行采样，从而在整个训练过程中实现持续探索。我们在 NP 难组合优化问题上验证了 DMPO，在这些问题中存在指数级的可行解决方案，但只有少数接近最优解，这是评估探索的理想测试平台。DMPO 在基于文本的 NP-Bench 上实现了 43.9% 的质量比（相比 GRPO 的 40.1%），在基于视觉的 NP-Bench 上实现了 43.1%（相比 38.4%），分别展示了 9% 和 12% 的相对提升。这些增益在数学推理（+2.0%）和域外任务（+2.3%）中也得到了推广，显示出保持多样性的训练增强了跨模态的推理能力。我们的工作确立了分布匹配作为一种实用的、原则性的防止在线强化学习中模式崩溃的方法，持续的质量提升表明在多样化推理任务中实现了持续探索。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2605.19485

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

基于注意力引导的奖励机制在针对大型推理模型的强化学习越狱中的应用

Lin, Zheng, Niu, Zhenxing, Ji, Haoxuan, Huang, Yuzhe, Gao, Haichang

Abstract

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a model's internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate jailbreak attacks on LRMs and reveal that the attack success rate (ASR) is closely correlated with LRMs' attention patterns. Specifically, successful jailbreaks tend to assign lower attention to harmful tokens in the input prompt, while allocating higher attention to those tokens in the reasoning content. Motivated by this finding, we propose a novel jailbreak method for LRMs that leverages reinforcement learning (RL) to enhance attack effectiveness, explicitly incorporating attention signals into the reward function design. In addition, we introduce diverse persuasion strategies to enrich the RL action space, which consistently improves the ASR. Extensive experiments on five open-source and closed-source LRMs across three benchmarks demonstrate that our method achieves substantially higher ASR, outperforming existing approaches in terms of effectiveness, efficiency, and transferability.

Chinese Translation

大型推理模型（LRMs）在通过生成结构化的逐步推理内容来解决复杂问题方面展现了显著的能力。然而，暴露模型的内部推理过程会引入额外的安全风险；例如，最近的研究表明，LRMs比标准的LLMs更容易受到越狱攻击。本文研究了对LRMs的越狱攻击，并揭示攻击成功率（ASR）与LRMs的注意力模式密切相关。具体而言，成功的越狱攻击往往对输入提示中的有害标记分配较低的注意力，而对推理内容中的这些标记分配较高的注意力。基于这一发现，我们提出了一种新颖的LRMs越狱方法，该方法利用强化学习（RL）来增强攻击效果，明确将注意力信号纳入奖励函数设计。此外，我们引入多样的说服策略以丰富RL的动作空间，从而持续提高ASR。在三个基准测试中对五个开源和闭源LRMs进行的广泛实验表明，我们的方法实现了显著更高的ASR，在有效性、效率和可迁移性方面超越了现有方法。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2605.19514

Position: The Turing-Completeness of Real-World Autoregressive Transformers Relies Heavily on Context Management

位置：现实世界自回归变换器的图灵完备性在很大程度上依赖于上下文管理

Cui, Guanyu, Wei, Zhewei, He, Kun

Abstract

Many works make the eye-catching claim that Transformers are Turing-complete. However, the literature often conflates two distinct settings: (i) a fixed Transformer system setting, in which a fixed autoregressive Transformer is coupled with a fixed context-management method to process inputs of different lengths step by step, and (ii) a scaling-family setting, in which a family of different models (with increasing context-window length or numerical precision) is used to handle different input lengths. Existing proofs of Transformer Turing-completeness are frequently established in setting (ii), whereas real-world LLM deployment and the standard notion of Turing-completeness correspond more naturally to setting (i). In this paper, we first formalize the fixed-system setting, thereby providing a concrete characterization of how real-world LLMs operate. We then argue that results proved in the scaling-family setting provide theoretically meaningful resource bounds but do not establish Turing-completeness, thereby clarifying a common misinterpretation of existing results. Finally, we show that different context-management methods can yield sharply different computational power, and we advocate the position that context management is a central component that critically determines the computational power of real-world autoregressive Transformers.

Chinese Translation

许多研究声称变换器是图灵完备的。然而，文献中常常混淆了两种不同的设置：（i）固定变换器系统设置，在该设置中，固定的自回归变换器与固定的上下文管理方法结合，以逐步处理不同长度的输入；（ii）缩放家族设置，在该设置中，使用不同模型的家族（具有增加的上下文窗口长度或数值精度）来处理不同的输入长度。现有的变换器图灵完备性证明通常是在设置（ii）中建立的，而现实世界大语言模型（LLM）的部署和标准的图灵完备性概念更自然地对应于设置（i）。在本文中，我们首先形式化固定系统设置，从而提供对现实世界LLM操作的具体特征描述。然后，我们论证在缩放家族设置中证明的结果提供了理论上有意义的资源界限，但并未建立图灵完备性，从而澄清了对现有结果的常见误解。最后，我们展示了不同的上下文管理方法可以产生截然不同的计算能力，并主张上下文管理是一个核心组成部分，关键决定了现实世界自回归变换器的计算能力。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2605.19518

BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation

BLINKG：一个用于大语言模型集成知识图谱生成的基准测试

Castedo, Carla, Iglesias, Enrique, Lama, Manuel, Bugarin-Diz, Alberto, Vidal, Maria-Esther, Chaves-Fraga, David

Abstract

Generating Knowledge Graphs (KGs) remains one of the most time-consuming and labor-intensive tasks for knowledge engineers, as they need to identify semantic equivalences between input data sources and ontology terms. While declarative solutions (e.g., RML, SPARQL-Anything) have helped to generalize this process, aligning input schema elements with ontology terms still involves intricate transformations and requires considerable manual effort. With the advent of Large Language Models (LLMs), there is growing interest in leveraging their capabilities to assist KG engineers. Although some studies have explored using LLMs to automate KG construction, there is still no standardized framework for assessing how effectively they establish correspondences between data schemes and ontology concepts. Therefore, in this paper, we propose BLINKG, a benchmark designed to evaluate the mapping capabilities of LLMs in constructing KGs from heterogeneous data sources. The benchmark includes a set of scenarios with increasing complexity, based on real-world use cases. We conduct an extensive experimental evaluation of several stateof-the-art LLMs using BLINK and observe that they already offer promising solutions. However, their performance remains limited in complex scenarios. Thanks to this benchmark, we can already assess the current capabilities of LLMs for KG construction. Additionally, we define a set of requirements for achieving (semi)automated (LLM-driven) KG construction, opening new research lines in this area.

Chinese Translation

生成知识图谱（KG）仍然是知识工程师最耗时和劳动密集的任务之一，因为他们需要识别输入数据源与本体术语之间的语义等价关系。尽管声明性解决方案（例如 RML、SPARQL-Anything）在一定程度上帮助了这一过程的概括，但将输入模式元素与本体术语对齐仍然涉及复杂的转换，并需要相当大的人工努力。随着大语言模型（LLMs）的出现，利用其能力来辅助知识图谱工程师的兴趣日益增长。尽管一些研究探讨了使用 LLMs 来自动化知识图谱构建，但目前仍没有标准化的框架来评估它们在建立数据模式与本体概念之间的对应关系方面的有效性。因此，在本文中，我们提出了 BLINKG，一个旨在评估 LLMs 在从异构数据源构建知识图谱时映射能力的基准测试。该基准测试包括一系列基于真实世界用例的复杂性逐渐增加的场景。我们使用 BLINK 对几种最先进的 LLMs 进行了广泛的实验评估，并观察到它们已经提供了有前景的解决方案。然而，它们在复杂场景中的表现仍然有限。借助这个基准测试，我们可以评估 LLMs 在知识图谱构建方面的当前能力。此外，我们定义了一组实现（半）自动化（基于 LLM 的）知识图谱构建的要求，为该领域开辟了新的研究方向。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2605.19521

Efficient Elicitation of Collective Disagreements

高效引导集体分歧

Ouaguenouni, Mohamed, Garrido-Lucero, Felipe, Grandi, Umberto, Hidalgo, César, Tydrichova, Magdalena

Abstract

We analyze the structure of the disagreement among a population of voters over a set of alternatives. Surveys typically ask either for pairwise comparisons, simple and intuitive for participants, or full rankings over alternatives, eliciting the entire voters' preferences. Building on the observation that pairwise comparisons cannot distinguish structural disagreement from noise, we propose a stratified framework to identify the minimal aggregated preference information needed to compute a number of disagreement measures from the literature. Specifically, we introduce the plurality matrix, a generalization of pairwise comparisons that records, for every subset $S$ of alternatives, the probability that each $a \in S$ ranks first in $S$. We define the level of a disagreement measure as the smallest subset size needed to express it, showing that many existing notions, including rank-variance and divisiveness, sit at level $3$, proving that pairwise comparisons are not enough. In addition, we demonstrate the interest of going beyond level $3$ both theoretically and experimentally. To make these results actionable, we design two elicitation protocols to estimate the plurality matrix, exploring the trade-off between the number of required participants and the cognitive load requested to each of them.

Chinese Translation

我们分析了选民群体在一组备选方案上的分歧结构。调查通常要求进行成对比较，这对参与者来说简单直观，或者要求对备选方案进行全面排名，以引导出选民的整体偏好。基于成对比较无法区分结构性分歧与噪声的观察，我们提出了一个分层框架，以识别计算文献中多种分歧度量所需的最小聚合偏好信息。具体而言，我们引入了多数矩阵（plurality matrix），这是成对比较的一种推广，它记录了每个备选方案子集 $S$ 中每个 $a ext{ in } S$ 在 $S$ 中排名第一的概率。我们将分歧度量的级别定义为表达它所需的最小子集大小，显示许多现有概念，包括排名方差（rank-variance）和分裂性（divisiveness），都处于第 $3$ 级，证明成对比较是不够的。此外，我们还理论和实验上展示了超越第 $3$ 级的兴趣。为了使这些结果具有可操作性，我们设计了两种引导协议以估计多数矩阵，探索所需参与者数量与每位参与者所需认知负担之间的权衡。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2605.19529

Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

生成-评估一致性：LLM驱动自适应评估的必要有效性标准

Lee, Grandee, Wang, Yue, Lye, Che Yee, Peh, Luke

Abstract

When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance r = 0.698 with systematic positive bias. GEA is strong r > 0.7 for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.

Chinese Translation

当同一个大型语言模型（LLM）生成评估项目、模拟学生反应并进行评分时，验证循环是自指的。我们引入了生成-评估一致性（Generative-Evaluative Agreement, GEA），这一有效性标准用于衡量LLM的评分功能是否能够恢复其生成功能所指示的技能水平。在对一个两阶段自适应评估进行的首次直接GEA测量中，该模型大约恢复了预期方差的一半，相关系数 r = 0.698，并存在系统性的正偏差。对于语法可验证的技能，GEA表现强劲，相关系数 r > 0.7，但对于设计层面的技能则接近零，低技能的高估在路由阈值附近抬高了分数。我们认为，细粒度的技能分解评分标准是增强GEA的主要建议机制，并概述了补充的缓解措施。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2605.19576

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

库漂移：诊断和修复自我演化 LLM 技能库中的隐性故障模式

Zhang, Xing, Cui, Yanwei, Wang, Guanghui, Li, Ziyuan, Qiu, Wei, Zhu, Bing, He, Peiyang

Abstract

Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

Chinese Translation

自我演化的技能库面临一种我们称之为“库漂移”（library drift）的隐性故障模式：在没有以结果为导向的生命周期管理的情况下，无限制的技能积累导致检索退化、误报注入和性能停滞。最近的评估确认了这一症状——LLM 创作的技能带来的增益为 +0.0pp，而人工策划的技能则带来了 +16.2pp（SkillsBench）——但其潜在机制尚未被单独识别。我们提供了 (1) 一个可重复的触发器：隔离漂移的消融实验——一个禁用技能注入（平坦底线，+0.002），一个强制提前退休（主动伤害，$-$0.019）；(2) 追踪级别的诊断：一个仅附加的证据日志，记录每个技能的贡献分数、归因判决和路由器参与指标，使故障在达到最终任务分数之前可见；(3) 一个经过验证的修复方案：一个最小治理配方（以结果为导向的退休 + 有界的主动能力 + 元技能创作优先）将保留的 pass@1 从 0.258 的基线提升到 MBPP+ hard-100 上 100 轮的晚期窗口均值 0.584（滚动增益 $+$0.328）。八个消融实验分解了哪些治理机制是承重的，哪些被包含，从而为诊断任何自我演化代理中的库漂移提供了具体的操作手册。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2605.19587

SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

SceneCode：可编辑室内场景的可执行世界程序与关节对象

Wang, Puyi, Wang, Yuhao, Li, Linjie, Yang, Zhengyuan, Lin, Kevin Qinghong, Li, Yangguang, Cheng, Yu

Abstract

Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipelines, however, typically represent generated content as static meshes and inherit articulation only from curated asset libraries, which limits object-level controllability and prevents new interactable assets from being produced on demand. We address this gap by formulating physically interactable indoor scene synthesis as programmatic world generation, and present SceneCode, a framework that compiles a natural language prompt into an executable, code-driven indoor world rather than a collection of opaque meshes. A room-level agentic backbone first turns the prompt into a structured house layout and emits per-object AssetRequests through a planner--designer--critic loop. Each request is then routed to one of five code-generation strategies and converted into a synthesized part-wise Blender Python programs that are validated through an execution-guided repair-and-refine loop. The resulting programs are compiled into simulation-ready assets, and exported as SDF for physics simulation. A persistent scene-state registry links object requests, executable programs, rendered geometry, and simulation assets, turning scene assembly into a traceable and locally editable world-building process. We evaluate SceneCode across scene-level synthesis, object-level asset quality, human judgment, and downstream robot interaction. Results show that executable world programs improve prompt-faithful indoor scene generation and produce assets with cleaner mesh structure, and simulator-loadable articulation metadata. Project page: https://scene-code.github.io/.

Chinese Translation

室内场景合成是具身人工智能、机器人操控和基于仿真的策略评估的基础，其中一个有用的场景不仅需要指定环境的外观，还需描述其对象的结构。然而，现有的流程通常将生成的内容表示为静态网格，并仅从策划的资产库中继承关节性，这限制了对象级的可控性，并阻止了按需生成新的可交互资产。我们通过将物理可交互的室内场景合成形式化为程序化世界生成来填补这一空白，并提出了SceneCode，一个将自然语言提示编译为可执行的、代码驱动的室内世界的框架，而不是一组不透明的网格。首先，房间级的代理骨架将提示转化为结构化的房屋布局，并通过规划者-设计者-评估者循环发出每个对象的资产请求。每个请求随后被路由到五种代码生成策略之一，并转化为合成的逐部分Blender Python程序，这些程序通过执行引导的修复和优化循环进行验证。最终生成的程序被编译为适合仿真的资产，并以SDF格式导出以用于物理仿真。一个持久的场景状态注册表将对象请求、可执行程序、渲染几何体和仿真资产链接起来，将场景组装转变为一个可追踪和本地可编辑的世界构建过程。我们在场景级合成、对象级资产质量、人类判断和下游机器人交互等方面评估了SceneCode。结果表明，可执行世界程序改善了提示忠实的室内场景生成，并生成了具有更清晰网格结构和可加载的关节元数据的资产。项目页面：https://scene-code.github.io/

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2605.19593

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

面向多模型大语言模型调度器：卸载和抢占的实证洞察

Yildiz, Mert, Spadaccino, Pietro, Rolich, Alexey, Cuomo, Francesca, Baiocchi, Andrea

Abstract

Modern deployments of Large Language Models (LLMs) increasingly require serving multiple models with diverse architectures, sizes, and specialization on shared, heterogeneous hardware. This setting introduces new challenges for resource allocation, dispatching, and scheduling, particularly under GPU memory constraints where partial CPU-GPU offloading and preemption become necessary. While existing systems primarily optimize throughput for a single model, comparatively little work addresses multi-model scheduling under these conditions. In this paper, we present an empirical study of how different LLMs behave across hardware platforms, focusing on the performance implications of layer offloading and preemption. We show that offloading leads to strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU residency. We further demonstrate that preemption incurs substantial overhead, largely dominated by model state reload rather than key-value cache transfer, and that this cost varies significantly across models and hardware platforms. Additionally, we highlight the role of sequence length and interconnect bandwidth in amplifying data movement and execution inefficiencies. Based on these findings, we identify a set of key features that future schedulers must consider, including model-specific offloading sensitivity, workload characteristics, and the cost structure of preemption and data transfer. These insights provide guidance for the design of next-generation LLM serving systems capable of efficiently managing heterogeneous, multi-model workloads with hybrid CPU-GPU execution.

Chinese Translation

现代大语言模型（LLMs）的部署越来越需要在共享的异构硬件上服务多个具有不同架构、规模和专业化的模型。这种环境为资源分配、调度和调度带来了新的挑战，尤其是在GPU内存受限的情况下，部分CPU-GPU卸载和抢占变得必要。虽然现有系统主要针对单一模型优化吞吐量，但在这些条件下，针对多模型调度的研究相对较少。本文呈现了一项实证研究，探讨不同LLM在硬件平台上的表现，重点关注层卸载和抢占对性能的影响。我们表明，卸载导致解码吞吐量出现强烈的非线性和模型依赖性下降，较小的模型对减少GPU驻留的敏感性更强。我们进一步证明，抢占会产生可观的开销，主要由模型状态重新加载主导，而非键值缓存转移，并且这一成本在不同模型和硬件平台之间差异显著。此外，我们强调序列长度和互连带宽在放大数据移动和执行低效方面的作用。基于这些发现，我们确定了一组未来调度器必须考虑的关键特征，包括模型特定的卸载敏感性、工作负载特征以及抢占和数据传输的成本结构。这些洞察为下一代LLM服务系统的设计提供了指导，使其能够有效管理异构的多模型工作负载，并实现混合CPU-GPU执行。

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2605.19604

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

正式技能：用于高效和准确的LLM代理的可编程运行时技能

Zhang, Xi, Gao, Meijun, Zhao, Yuntian, Tan, Xinyu, Yao, Yilun, Wang, Feiyu, Wang, Yanshu, Dingsiyi, Yang, Tong

Abstract

Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Existing skills remain largely informal: Markdown skills and instruction packs encode procedures as long natural-language documents, while function calling, Model Context Protocol (MCP) servers, and framework tools structure individual actions but usually leave workflow state, policy enforcement, and completion discipline outside the skill itself. We introduce Formal Skill, a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. We implement the abstraction in FairyClaw, an open-source event-driven runtime for executable, observable, and composable Formal Skills. On Harness-Bench, FairyClaw obtains highly competitive average scores while using substantially fewer tokens, with especially strong results on tasks that expose the role of Formal Skill.

Chinese Translation

大型语言模型（LLM）代理越来越多地在真实工作环境中运作，其中工具和技能决定了模型推理是否能够转化为可靠的行动。现有的技能仍然主要是非正式的：Markdown技能和指令包将程序编码为冗长的自然语言文档，而函数调用、模型上下文协议（MCP）服务器和框架工具则结构化个别动作，但通常将工作流状态、政策执行和完成纪律留在技能本身之外。我们引入了正式技能（Formal Skill），这是一种运行时原生抽象，使用JSON元数据和动作模式表示可重用能力，提供可靠的Python执行器、钩子管理的控制逻辑、正式技能路由和技能本地运行时状态。通过将可重用程序从重复的提示文本转移到可执行的状态机和钩子政策中，正式技能为代理提供了一个令牌高效且可执行的控制界面。我们在FairyClaw中实现了这一抽象，FairyClaw是一个开源的事件驱动运行时，用于可执行、可观察和可组合的正式技能。在Harness-Bench上，FairyClaw获得了高度竞争的平均分数，同时使用的令牌数量显著减少，尤其在揭示正式技能作用的任务中表现出色。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2605.19630

EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

EMO-BOOST：情感增强音视频特征以改善深度伪造检测中的泛化能力

Marik, Aritra, Klemt, Marcel, Rohrbach, Anna

Abstract

With every advancement in generative AI models, forensics is under increasing pressure. The constant emergence of new generation techniques makes it impossible to collect data for each manipulation to train a deepfake detection model. Thus, generalizing to deepfakes unseen during training is one of the major challenges in current deepfake detection research. To tackle this challenge, we employ high-level semantic cues and argue that these cues can support low-level focused approaches in generalizing to unseen types of manipulations. In this work, we study emotions as a high-level semantic cue. We propose Emo-Boost, a multimodal deepfake detection framework that fuses an off-the-shelf RGB- and acoustic-focused deepfake detector with our emotion-based deepfake detector EmoForensics. EmoForensics utilises vision and audio emotion recognition modules and models intra- and inter-modal temporal consistency in emotion representations from an audio-visual stream. We found that EmoForensics and the low-level focused method capture complementary signals. Consequently, combining both signals in EmoBoost enhances the average cross-manipulation generalization AUC by 2.1% on FakeAVCeleb.

Chinese Translation

随着生成性人工智能模型的不断进步，取证领域面临着越来越大的压力。新一代技术的不断涌现使得为每种操控收集数据以训练深度伪造检测模型变得不可能。因此，在训练过程中未见过的深度伪造样本上实现泛化是当前深度伪造检测研究中的主要挑战之一。为了解决这一挑战，我们采用高层次的语义线索，并认为这些线索可以支持低层次的聚焦方法在泛化未见操控类型方面的效果。在这项工作中，我们研究了情感作为一种高层次的语义线索。我们提出了Emo-Boost，一个多模态深度伪造检测框架，它将现成的RGB和声学聚焦的深度伪造检测器与我们基于情感的深度伪造检测器EmoForensics相融合。EmoForensics利用视觉和音频情感识别模块，并建模音视频流中情感表征的内部和跨模态时间一致性。我们发现EmoForensics和低层次聚焦方法捕获了互补信号。因此，在EmoBoost中结合这两种信号使得在FakeAVCeleb数据集上平均跨操控泛化AUC提高了2.1%。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2605.19662

When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach

当表格基础模型遇上战略性表格数据：一种先验对齐方法

Lv, Xinpeng, Mao, Yunxin, Xu, Renzhe, Zheng, Chunyuan, Chen, Yikai, Li, Haoxuan, Yang, Jinxuan, Kuang, Kun, Chen, Yuanlong, Geng, Mingyang, Huang, Wanrong, Liu, Shixuan, Yang, Shaowu, Yang, Wenjing, Lin, Zhouchen, Wang, Haotian

Abstract

Tabular foundation models based on pretrained prior-data fitted networks~(PFNs) have shown strong generalization on diverse tabular tasks, but they are typically designed for \emph{non-strategic} settings where data distributions are independent of deployed classifiers. In many real-world decision scenarios, however, individuals may strategically modify their features after deployment to obtain favorable outcomes, inducing a post-deployment distribution shift. This paper studies whether PFN-style tabular foundation models can generalize to such \emph{strategic} tabular data. We show that strategic manipulation creates a mismatch between the non-strategic prior learned during pretraining and the post-manipulation strategic prior, which leads to systematic prediction bias. To address this issue, we propose \textbf{Strategic Prior-data Fitted Network}~\textit{(SPN)}, an inference-time strategy-aware framework that adapts tabular foundation models to strategic environments without retraining. SPN constructs strategic in-context examples to approximate post-manipulation inputs and aligns PFN predictions with the induced strategic distribution. Experiments on real-world and synthetic tabular datasets show that SPN consistently improves robustness and predictive performance under strategic manipulation compared with both tabular foundation models and classical tabular methods.

Chinese Translation

基于预训练先验数据拟合网络（PFNs）的表格基础模型在多样的表格任务中表现出强大的泛化能力，但它们通常是为 extit{非战略性}环境设计的，其中数据分布与部署的分类器是独立的。然而，在许多现实决策场景中，个体可能会在部署后战略性地修改其特征，以获得有利的结果，从而引发后部署分布的变化。本文研究PFN风格的表格基础模型是否能够泛化到这种 extit{战略性}表格数据。我们表明，战略性操控导致了在预训练期间学习的非战略性先验与后操控的战略性先验之间的不匹配，这导致了系统性的预测偏差。为了解决这个问题，我们提出了 extbf{战略性先验数据拟合网络}（SPN），这是一种推理时的策略感知框架，能够在不重新训练的情况下将表格基础模型适应于战略性环境。SPN构建战略性上下文示例，以近似后操控输入，并将PFN预测与引发的战略分布对齐。在真实世界和合成表格数据集上的实验表明，与表格基础模型和经典表格方法相比，SPN在战略性操控下始终提高了鲁棒性和预测性能。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2605.19663

Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models

伪代码引导的结构化推理用于自动化可靠推断的视觉-语言模型

Ni, Weicong, Jiang, Tianbao, Wang, Linlin

Abstract

Vision-Language Models (VLMs) are becoming the cornerstone of high-level reasoning for robotic automation, enabling robots to parse natural language commands and perceive their environments. However, their susceptibility to hallucinations introduces critical failures in decision-making, posing significant safety and reliability risks in physical deployments. This challenge is exacerbated by the open-ended nature of real-world tasks, where questions vary vastly in difficulty and modality, demanding robust and adaptable reasoning strategies. To tackle this, we propose the Pseudocode-guided Structured Reasoning framework (PStar), which adaptively selects structured pseudocode reasoning paths to help VLMs perform flexible and step-by-step reasoning. We first design a set of abstract reasoning functions and formulate a structured pseudocode library to represent modular reasoning strategies. Crucially, we design a Difficulty Feature Vector (DFV) that allows the model to assess question complexity and adaptively choose appropriate reasoning strategies-enhancing robustness and interpretability. Extensive experiments demonstrate that PStar significantly reduces hallucination rates, achieving state-of-the-art scores of 87.1% on POPE and 68.0% on MMStar, outperforming even GPT-4V. By providing a validated mechanism to reduce visual-language errors, PStar offers a critical step toward deploying more trustworthy and deterministic VLMs for real-world automated systems, where such errors can lead to catastrophic outcomes.

Chinese Translation

视觉-语言模型（VLMs）正成为机器人自动化高层次推理的基石，使机器人能够解析自然语言指令并感知其环境。然而，它们对幻觉的敏感性引入了决策中的关键失败，给物理部署带来了显著的安全性和可靠性风险。现实世界任务的开放性特征加剧了这一挑战，问题在难度和形式上差异巨大，要求强大且适应性强的推理策略。为此，我们提出了伪代码引导的结构化推理框架（PStar），该框架自适应地选择结构化的伪代码推理路径，以帮助VLMs进行灵活的逐步推理。我们首先设计了一组抽象推理函数，并制定了一个结构化伪代码库，以表示模块化推理策略。至关重要的是，我们设计了一个难度特征向量（DFV），使模型能够评估问题复杂性并自适应选择适当的推理策略，从而增强了鲁棒性和可解释性。大量实验表明，PStar显著降低了幻觉率，在POPE上达到了87.1%的最新成绩，在MMStar上达到了68.0%，甚至超过了GPT-4V。通过提供一种经过验证的机制来减少视觉-语言错误，PStar为在现实世界自动化系统中部署更可靠和确定性的VLMs迈出了关键一步，因为这些错误可能导致灾难性后果。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2605.19671

Transforming Constraint Programs to Input for Local Search

将约束程序转化为局部搜索的输入

Devriendt, Jo, De Causmaecker, Patrick, Denecker, Marc

Abstract

Applying local search algorithms to combinatorial optimization problems is not an easy feat. Typically, human intervention is required to compile the constraints to input data for some metaheuristic algorithm. In this paper, we establish a link between symmetry properties of constraint optimization problems and local search neighborhoods, and we use this link to automatically generate neighborhoods from a constraint specification in the context of the IDP system. We evaluate the obtained neighborhoods for six classical optimization problems. The resulting observations support the viability of this technique.

Chinese Translation

将局部搜索算法应用于组合优化问题并非易事。通常，需要人工干预将约束编译为某些元启发式算法的输入数据。本文建立了约束优化问题的对称性特性与局部搜索邻域之间的联系，并利用这一联系在IDP系统的背景下自动生成邻域。我们对六个经典优化问题获得的邻域进行了评估。结果观察支持了该技术的可行性。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2605.19674

Beyond Rational Illusion: Behaviorally Realistic Strategic Classification

超越理性幻觉：行为现实的战略分类

Lv, Xinpeng, Mao, Yunxin, Xu, Renzhe, Zheng, Chunyuan, Chen, Yikai, Li, Haoxuan, Shi, Yang, Yang, Jinxuan, Lin, Zhouchen, Chen, Yuanlong, Zhang, Yuanxing, Yang, Shaowu, Yang, Wenjing, Wang, Haotian

Abstract

Strategic classification(SC) studies the interaction between decision models and agents who strategically manipulate their features for favorable outcomes. Existing SC frameworks typically rely on the idealized assumption that agents are strictly rational. However, evidence from behavioral economics and psychology consistently shows that real-world decision-making is often shaped by cognitive biases, deviating from pure rationality. To formalize this limitation, we identify and define a new problem setting, termed the behaviorally realistic strategic classification problem, where agents' strategic manipulations deviate from full rationality due to psychological biases. Motivated by the identified limitation, we propose the Prospect-Guided Strategic Framework (Pro-SF) to address the problem, a principled framework grounded in prospect theory to model and learn under behaviorally realistic strategic responses. Specifically, to capture behaviorally realistic strategic manipulations, our framework reformulates the Stackelberg-style interaction between agents and the decision-maker by incorporating three key mechanisms inspired by prospect theory, including the asymmetry between benefits and costs, different subjective reference points, and non-rational probability distortion. Experiments on synthetic and real-world datasets establish Pro-SF as a behaviorally grounded approach to strategic classification, bridging machine learning and behavioral economics for more reliable deployment in the real world.

Chinese Translation

战略分类（SC）研究决策模型与代理人之间的互动，代理人为了获得有利结果而战略性地操控其特征。现有的SC框架通常依赖于代理人严格理性的理想化假设。然而，行为经济学和心理学的证据持续表明，现实世界的决策往往受到认知偏见的影响，偏离了纯粹理性的状态。为了形式化这一局限性，我们识别并定义了一个新的问题设置，称为行为现实的战略分类问题，在这一问题中，代理人的战略操控由于心理偏见而偏离了完全理性。基于这一识别出的局限性，我们提出了前景引导战略框架（Prospect-Guided Strategic Framework，Pro-SF）来解决该问题，这是一个基于前景理论的原则性框架，用于建模和学习行为现实的战略响应。具体而言，为了捕捉行为现实的战略操控，我们的框架通过引入三个受前景理论启发的关键机制，重新构建了代理人与决策者之间的斯塔克尔伯格式互动，包括利益与成本之间的不对称性、不同的主观参考点以及非理性的概率扭曲。对合成数据集和真实世界数据集的实验验证了Pro-SF作为一种行为基础的战略分类方法，架起了机器学习与行为经济学之间的桥梁，以便在现实世界中实现更可靠的应用。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2605.19721

Projecting Latent RL Actions: Towards Generalizable and Scalable Graph Combinatorial Optimization

投影潜在强化学习动作：迈向可泛化和可扩展的图组合优化

Terranova, Franco, Bernardez, Guillermo, Cabellos-Aparicio, Albert, Miolane, Nina, Lahmadi, Abdelkader

Abstract

Graph combinatorial optimization (GCO) has attracted growing interest, as many NP-hard problems naturally admit graph formulations, yet their combinatorial explosion renders exact methods computationally intractable. Recent advances in Reinforcement Learning (RL) combined with Graph Neural Networks (GNNs) have significantly improved learning-based GCO solvers. However, existing approaches face limitations in both generalization across diverse graph instances and computational scalability as action spaces grow. To address both challenges, we introduce projection agents, a novel RL-GCO approach that operates directly in a continuous GNN-based action embedding space, predicting a desired latent action in a single forward pass and subsequently decoding it into a valid discrete action. Additionally, we enable fair comparison across RL methods through a shared embedding space for both observations and actions. Across diverse benchmarks, our approach achieves up to 16.2x faster inference and up to 40% better generalization than existing solutions using only simple nearest-neighbor decoding, while opening the door to strong RL performance in super-linear decision spaces with multiple interdependent variables. Finally, we release LaGCO-RL, a Python library that automates latent action-space construction and supports existing RL-GCO solutions, promoting reproducibility and adaptation to new GCO benchmarks.

Chinese Translation

图组合优化（GCO）引起了越来越多的关注，因为许多 NP-hard 问题自然可以用图来表述，但其组合爆炸使得精确方法在计算上不可行。最近，强化学习（RL）与图神经网络（GNNs）的结合显著提升了基于学习的 GCO 求解器的性能。然而，现有方法在不同图实例的泛化能力和随着动作空间增长的计算可扩展性方面面临限制。为了解决这两个挑战，我们引入了投影代理（projection agents），这是一种新颖的 RL-GCO 方法，直接在基于 GNN 的连续动作嵌入空间中操作，在单次前向传递中预测所需的潜在动作，并随后将其解码为有效的离散动作。此外，我们通过为观察和动作提供共享嵌入空间，使得不同 RL 方法之间的比较更加公平。在多种基准测试中，我们的方法在推理速度上提高了最多 16.2 倍，泛化能力比现有解决方案提高了最多 40%，而只使用简单的最近邻解码，同时为在具有多个相互依赖变量的超线性决策空间中实现强大的 RL 性能打开了大门。最后，我们发布了 LaGCO-RL，这是一个 Python 库，自动化潜在动作空间的构建，并支持现有的 RL-GCO 解决方案，促进可重复性和对新 GCO 基准的适应。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2605.19743

EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

EngiAI：一个基于大型语言模型驱动的工程设计的多智能体框架和基准套件

Molinari, Gioele, Felten, Florian, Massoudi, Soheyl, Fuge, Mark

Abstract

Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequately address multi-agent systems that combine simulation, retrieval, and manufacturing preparation. We introduce a benchmark suite with three evaluation dimensions: (1) a workflow benchmark with seven prompt styles targeting distinct cognitive demands-including direct tool use, semantic disambiguation, conditional branching, and working-memory tasks; (2) a Retrieval-Augmented Generation (RAG) benchmark with gated scoring isolating retrieval contributions to parameter selection; and (3) an High Performance Computing (HPC) benchmark evaluating end-to-end ML training orchestration on a SLURM cluster. Alongside the benchmark we present EngiAI, a Multi-Agent System (MAS) reference implementation built on LangGraph that operationalizes the benchmark by coordinating seven specialized agents through a supervisor architecture, unifying topology optimization, document retrieval, HPC job orchestration, and 3D printer control. Across four LLM backends and two EngiBench problems, proprietary models achieve 96-97% average task completion on Beams2D, while open-source 4B-parameter models reach 55-78%, with clear generational improvement. Conditional branching proves most challenging, with task completion dropping to 20-53% for the conditional style on Photonics2D. RAG gating confirms near-perfect retrieval-augmented scores ($\approx 1.0$) versus near-zero without retrieval, validating the evaluation design. On HPC orchestration, one model completes all pipeline steps in 100% of runs while another drops to 50%, revealing that multi-step instruction following degrades over long-running workflows.

Chinese Translation

大型语言模型（LLM）智能体在工程设计任务中的应用日益增多，但现有的评估框架未能充分解决结合仿真、检索和制造准备的多智能体系统。我们介绍了一个基准套件，包含三个评估维度：（1）一个工作流程基准，具有七种提示风格，针对不同的认知需求，包括直接工具使用、语义消歧、条件分支和工作记忆任务；（2）一个检索增强生成（RAG）基准，采用门控评分，隔离检索对参数选择的贡献；（3）一个高性能计算（HPC）基准，评估在SLURM集群上端到端机器学习训练的编排。除了基准，我们还展示了EngiAI，一个基于LangGraph的多智能体系统（MAS）参考实现，通过监督架构协调七个专业智能体，从而实现拓扑优化、文档检索、高性能计算作业编排和3D打印机控制。在四个LLM后端和两个EngiBench问题中，专有模型在Beams2D上实现了96-97%的平均任务完成率，而开源的4B参数模型达到了55-78%，并显示出明显的代际改进。条件分支被证明是最具挑战性的，Photonic2D的条件风格任务完成率降至20-53%。RAG门控确认了接近完美的检索增强得分（约1.0），而没有检索时得分接近零，验证了评估设计。在HPC编排方面，一个模型在100%的运行中完成了所有管道步骤，而另一个模型降至50%，揭示了在长时间运行的工作流程中，多步骤指令的执行效果下降。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2605.19748

Memory-Augmented Reinforcement Learning Agent for CAD Generation

用于计算机辅助设计生成的记忆增强强化学习代理

Xiaolong, Yin, Yu, Liu, Jiahang, Shen, Xingyu, Lu, Jingzhe, Ni, Fengxiao, Fan, Fan, Sang

Abstract

Automatic generation of computer-aided design (CAD) models is a core technology for enabling intelligence in advanced manufacturing. Existing generation methods based on large language models (LLMs) often fall short when handling complex CAD models characterized by long operation sequences, diverse operation types, and strong geometric constraints, primarily because reasoning chains break and effective error-correction mechanisms are lacking. To address this problem, this paper proposes a memory-augmented reinforcement learning framework for CAD generation agents. The framework encapsulates the underlying geometric kernel into a structured toolchain callable by the agent and builds a closed-loop mechanism of design intent understanding, global planning, execution, and multi-dimensional verification. It also designs a dual-track memory module consisting of a case library and a skill library, and proposes a dynamic utility retrieval algorithm. By introducing reinforcement learning into retrieval and policy optimization, the agent can effectively avoid retrieval traps in which examples are semantically similar but geometrically infeasible, enabling online self-correction and continual evolution without additional large-scale annotated data. Experiments show that the proposed method significantly improves both the success rate and geometric consistency on complex CAD model generation tasks.

Chinese Translation

计算机辅助设计（CAD）模型的自动生成是实现先进制造智能化的核心技术。现有基于大型语言模型（LLMs）的生成方法在处理复杂的CAD模型时往往表现不佳，这些模型具有长操作序列、多样的操作类型和强几何约束，主要原因在于推理链断裂和缺乏有效的错误纠正机制。为了解决这一问题，本文提出了一种用于CAD生成代理的记忆增强强化学习框架。该框架将底层几何内核封装成一个结构化的工具链，供代理调用，并建立了设计意图理解、全局规划、执行和多维验证的闭环机制。它还设计了一个由案例库和技能库组成的双轨记忆模块，并提出了一种动态效用检索算法。通过将强化学习引入检索和策略优化，代理能够有效避免在语义上相似但在几何上不可行的检索陷阱，实现在线自我纠正和持续演化，而无需额外的大规模标注数据。实验表明，所提出的方法显著提高了复杂CAD模型生成任务的成功率和几何一致性。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2605.19758

CogScale: Scalable Benchmark for Sequence Processing

CogScale：可扩展的序列处理基准

Bendi-Ouis, Yannis, de Coudenhove, Romain, Hinaut, Xavier

Abstract

The ability to maintain and manipulate information over time is a fundamental aspect of living beings and Artificial Intelligence. While modern models have achieved remarkable success in tasks like natural language processing, evaluating the capacity of novel architectures to process sequential information remains computationally expensive and time-consuming. Testing a new architecture often requires scaling up to massive datasets and models, leading to vast computational costs and slow iteration cycles. In this paper, we propose CogScale, a benchmark of 14 scalable synthetic tasks designed to isolate and evaluate specific cognitive and memory abilities at different parametrizable scales. By providing a standardized, lightweight framework, CogScale allows researchers to rapidly validate architectural innovations before committing to large-scale training. To establish a solid baseline, we evaluate seven distinct architectures: Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), xLSTM, Echo State Network (ESN), Mamba, Transformer Decoder, and Transformer Encoder-Decoder. These evaluations are conducted under strict parameter budgets (1k, 10k, and 100k) and across different difficulty levels and scales. Our results show that while classical RNNs and Echo State Networks excel at basic retention within strict parameter budgets, only attention mechanisms and modern state-space models consistently maintain high performance as reasoning complexity and task difficulty scale.

Chinese Translation

维持和操控信息的能力是生物体和人工智能的一个基本特征。尽管现代模型在自然语言处理等任务中取得了显著成功，但评估新架构处理序列信息的能力仍然在计算上代价高昂且耗时。测试新架构通常需要扩展到大规模数据集和模型，这导致巨大的计算成本和缓慢的迭代周期。本文提出了CogScale，一个包含14个可扩展合成任务的基准，旨在在不同可参数化规模下隔离和评估特定的认知和记忆能力。通过提供一个标准化、轻量级的框架，CogScale使研究人员能够在进行大规模训练之前快速验证架构创新。为了建立一个稳固的基线，我们评估了七种不同的架构：门控递归单元（Gated Recurrent Unit, GRU）、长短期记忆（Long Short-Term Memory, LSTM）、xLSTM、回声状态网络（Echo State Network, ESN）、Mamba、Transformer解码器和Transformer编码器-解码器。这些评估在严格的参数预算（1k、10k和100k）下以及不同的难度级别和规模中进行。我们的结果表明，尽管经典的递归神经网络（RNN）和回声状态网络在严格的参数预算下在基本保留任务中表现出色，但只有注意力机制和现代状态空间模型在推理复杂性和任务难度增加时能够持续保持高性能。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2605.19762

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

真正提升数学推理的因素：超越纯代码的结构化推理信号

Zhao, Yuze, Fang, Junpeng, Yu, Lu, Huang, Zhenya, Zhang, Kai, Cui, Qing, Liu, Qi, Zhou, Jun, Chen, Enhong

Abstract

Code has become a standard component of modern foundation language model (LM) training, yet its role beyond programming remains unclear. We revisit the claim that code improves reasoning through controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation. Our findings are threefold. First, when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general reasoning enhancer; instead, it competes with knowledge-intensive tasks, especially complex mathematical reasoning. Second, the reasoning gains often attributed to code are better explained by cross-domain structured reasoning traces, such as code-text and math-text mixtures, rather than by executable code alone. Third, increasing the density of structured math-domain samples within a fixed math budget yields substantial gains on difficult mathematical reasoning while largely preserving programming performance, suggesting that cognitive scaffolds offer a targeted way to mitigate cross-domain trade-offs. Finally, routing analyses show that data-composition effects are reflected in expert-activation patterns, providing mechanism-level evidence for competitive and synergistic interactions across domains. Our results clarify which data characteristics transfer across capability dimensions and point to more precise data-centric optimization strategies.

Chinese Translation

代码已成为现代基础语言模型（LM）训练的标准组成部分，但其在编程之外的作用仍不明确。我们通过对一个10万亿标记的语料库进行控制预训练实验，重新审视了代码改善推理的说法，且该语料库具有细致的领域分离。我们的发现有三方面。首先，当代码仅限于独立可执行程序并控制Code-NL数据时，代码显著提高了编程能力，但并未作为一般推理增强器；相反，它与知识密集型任务竞争，尤其是复杂的数学推理。其次，通常归因于代码的推理提升，更好地通过跨领域结构化推理痕迹来解释，例如代码-文本和数学-文本混合，而不仅仅是可执行代码。第三，在固定的数学预算内增加结构化数学领域样本的密度，可以在保持编程性能的同时，在困难的数学推理上获得显著提升，这表明认知支架提供了一种有针对性的方式来缓解跨领域的权衡。最后，路由分析显示，数据组合效应反映在专家激活模式中，为跨领域的竞争和协同互动提供了机制层面的证据。我们的结果澄清了哪些数据特征在能力维度之间转移，并指向更精确的数据中心优化策略。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2605.19765

GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction

GroupAffect-4：四人协作互动的多模态数据集

Seikavandi, Meisam Jamshidi, Modica, Alice, Obara, Anna, Shaffi, Shan Ahmed, Narcizo, Fabricio Batista, Ignatenko, Tanya, Vucurevich, Ted, Haddad, Karim, Barratt, Daniel, Overholt, Daniel, Boldt, Jesper Bunsow, Burelli, Paolo, Dittberner, Andrew Burke

Abstract

Existing affective-computing, social-signal-processing, and meeting corpora capture important parts of human interaction, but they rarely support analysis of affect in co-located groups as a coupled individual, interpersonal, and group-level process. The required signals (per-participant physiology, eye movement, audio, self-report, task outcomes, and personality) are usually fragmented across separate dataset traditions. We introduce GroupAffect-4, a multimodal corpus of 40 participants in 10 four-person groups, each completing four ecologically varied collaborative tasks spanning information pooling, negotiation, idea generation, and a public-goods game. Each participant is instrumented with a wrist-worn physiology sensor, eye-tracking glasses, and a close-talk microphone; sessions include continuous affect self-reports, post-task questionnaires, task outcomes, and Big-Five personality scores, all time-aligned to a shared clock. The dataset covers over 91% of expected physiology windows and 98% of eye-tracking windows, with strong task validity confirmed by a clear affective manipulation check across the negotiation block. We define fifteen benchmarkable targets spanning three analysis levels -- within-person state, between-person traits, and group dynamics -- and report leave-one-group-out feasibility baselines establishing the dataset's evaluative scope. GroupAffect-4 is released with a BIDS-inspired structure, Croissant metadata, a datasheet, per-session quality reports, and open processing scripts. Code and processing scripts are available at https://github.com/meisamjam/GroupAffect-4; the dataset is publicly archived at https://zenodo.org/records/20037847.

Chinese Translation

现有的情感计算、社会信号处理和会议语料库捕捉了人类互动的重要部分，但它们很少支持将共处群体中的情感分析视为个体、人与人之间以及群体层面的耦合过程。所需的信号（每位参与者的生理数据、眼动、音频、自我报告、任务结果和个性）通常在不同的数据集传统中是碎片化的。我们介绍了GroupAffect-4，这是一个包含40名参与者的多模态语料库，分为10个四人小组，每个小组完成四项生态多样的协作任务，涵盖信息汇聚、谈判、创意生成和公共物品博弈。每位参与者都配备了腕部生理传感器、眼动追踪眼镜和近讲麦克风；会议包括连续的情感自我报告、任务后问卷、任务结果和五大人格评分，所有数据均与共享时钟时间对齐。该数据集覆盖了超过91%的预期生理窗口和98%的眼动追踪窗口，谈判阶段的情感操控检查确认了强大的任务有效性。我们定义了跨越三种分析层次的十五个基准目标——个体状态、个体特征和群体动态，并报告了留一组外的可行性基线，确立了数据集的评估范围。GroupAffect-4以BIDS启发的结构发布，包含Croissant元数据、数据表、每次会议的质量报告和开放处理脚本。代码和处理脚本可在https://github.com/meisamjam/GroupAffect-4获取；数据集已公开存档于https://zenodo.org/records/20037847。

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2605.19768

Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

针对多项式逻辑马尔可夫决策过程的极小极大最优方差感知后悔界限

Boudart, Pierre, Gaillard, Pierre, Rudi, Alessandro

Abstract

We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant $\bar\sigma\_T \leq 1/2$, measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of $\smash{\tilde{O}(dH^2\bar\sigma\_T\sqrt{T})}$, which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $\bar\sigma\_T = O(H^{-1})$, reducing the horizon dependence by a factor $H$. We further establish a matching $\smash{\Omega(dH^2\bar\sigma\_T\sqrt{T})}$ lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.

Chinese Translation

我们研究了针对由多项式逻辑（MNL）模型建模的过往马尔可夫决策过程（MDPs）的强化学习。现有针对MNL混合MDPs的算法导致的后悔为$ ilde{O}(dH^2 ext{sqrt}(T))$（Li等，2024），其中$d$为特征维度，$H$为剧集长度，$T$为剧集数量。受到逻辑赌博文献（Abeille等，2021；Faury等，2022；Boudart等，2026）的启发，我们引入一个问题相关常数$ar{ extsigma}_T ext{leq} 1/2$，用于测量学习者轨迹上最优下游价值函数的归一化平均方差。我们提出了一种算法，其后悔为$ ilde{O}(dH^2ar{ extsigma}_T ext{sqrt}(T))$，在最坏情况下恢复了现有界限，并在结构化MDPs中有所改进。例如，对于KL约束的稳健MDPs，$ar{ extsigma}_T = O(H^{-1})$，将时间依赖性减少了一个因子$H$。我们进一步建立了匹配的$ ext{Omega}(dH^2ar{ extsigma}_T ext{sqrt}(T))$下界，证明了极小极大最优性（至对数因子）并首次完全表征了MNL混合MDPs的后悔复杂性。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2605.19769

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

OpenComputer：用于计算机使用代理的可验证软件世界

Wei, Jinbiao, Ma, Qianran, Zhao, Yilun, Zhou, Xiao, Ni, Kangqi, Gan, Guo, Cohan, Arman

Abstract

We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.

Chinese Translation

我们提出了OpenComputer，这是一个基于验证器的框架，用于构建可验证的软件世界，以供计算机使用代理使用。OpenComputer集成了四个组件：（1）应用特定的状态验证器，提供对真实应用程序的结构化检查端点；（2）一个自我演化的验证层，通过基于执行的反馈提高验证器的可靠性；（3）一个任务生成管道，合成现实且可机器检查的桌面任务；（4）一个评估工具，记录完整的轨迹并计算可审计的部分信用奖励。在当前形式下，OpenComputer覆盖了33个桌面应用程序和1,000个最终任务，涵盖浏览器、办公工具、创意软件、开发环境、文件管理器和通信应用程序。实验表明，OpenComputer的硬编码验证器与人类裁决的对齐程度高于基于大语言模型（LLM）作为裁判的评估，尤其是在成功依赖于细粒度应用状态时。尽管部分进展，前沿代理在端到端完成方面仍然面临困难，而开源模型在其OSWorld验证分数上表现出明显下降，暴露出计算机自动化中的持续差距。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2605.19779

Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation

无分布假设的连续人工智能代理评估的不确定性量化

Gao, Yuxuan, Wang, Megan, Yu, Yi Ling

Abstract

We adapt split conformal prediction and adaptive conformal inference (ACI) to continuous AI agent evaluation, providing distribution-free coverage guarantees for forecasted quality scores. Conformal intervals achieve calibration error below 0.02 across all nominal levels at the 24h horizon, while ACI correctly widens intervals by 35% following agent releases then reconverges. We further develop compositional uncertainty bounds for multi-agent pipelines (validated via simulation across inter-stage correlations rho in [-0.5, 0.9]), a conformal abstention rule for pairwise rankings with controlled false-ranking rate, and FDR-corrected abstention for leaderboard-scale multiple testing. Evaluating 50 agents via 18 real-time signals collected hourly, we show that per-agent conditional coverage is well-concentrated around the nominal level (mean 80.4%, 90% of agents within [72%, 90%]), and that cross-source sentiment divergence predicts ranking instability (r=0.64, p<0.01). A circularity-controlled validation confirms the framework captures signal beyond benchmarks (rho_s=0.52, p<0.01, n=35). Code and data are released under CC BY 4.0.

Chinese Translation

我们将分裂的保形预测和自适应保形推断（Adaptive Conformal Inference, ACI）应用于连续人工智能代理评估，为预测的质量评分提供无分布假设的覆盖保证。在24小时的预测范围内，保形区间的校准误差在所有名义水平下均低于0.02，而ACI在代理发布后正确地将区间扩大了35%，然后重新收敛。我们进一步为多代理管道开发了组合不确定性界限（通过在[-0.5, 0.9]范围内的阶段间相关性进行模拟验证），提出了一种对成对排名的保形弃权规则，以控制错误排名率，并为排行榜规模的多重检验提供了FDR校正的弃权。通过每小时收集的18个实时信号评估50个代理，我们显示每个代理的条件覆盖率在名义水平附近高度集中（平均80.4%，90%的代理在[72%，90%]范围内），并且跨源情感差异预测排名不稳定性（r=0.64, p<0.01）。一个控制循环性的验证确认该框架捕捉到了超越基准的信号（rho_s=0.52, p<0.01, n=35）。代码和数据在CC BY 4.0许可下发布。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2605.19781

From SGD to Muon: Adaptive Optimization via Schatten-p Norms

从 SGD 到 Muon：通过 Schatten-p 范数的自适应优化

Massena, Thomas, Friedrich, Corentin, Serrurier, Mathieu

Abstract

Modern optimizers, like Muon, impose matrix-wise geometry constraints on their updates. These matrix-wise constraints can be unified under Linear Minimization Oracle (LMO) theory. However, all current methods impose fixed LMO geometries for the update rules, chosen by-design or empirically, which are not necessarily optimal according to the problem's geometry. We introduce a novel efficient datadriven criterion for dynamically choosing proxy-optimal update LMO geometries on individual Deep Neural Network layers. Derived in closed form from gradient and activation statistics using a single-step random feature regression surrogate model, our criterion navigates a design space interpolating from SGD to Muon updates. Moreover, integrating parameter-wise preconditioning allows our framework to recover SGD, Muon, Adam, and MuAdam as specific extrema. To make this adaptive approach scalable, we pair it with efficient computational strategies, achieving only a $\sim$ 3% runtime overhead on highly optimized baselines. As a proof of concept, we show that this data-driven optimizer beats or remains competitive with the performance of the best performing optimizer between Muon and AdamW across three different training scenarios. Ultimately, this work provides evidence that LMO geometry can be successfully and efficiently adapted from runtime data, opening a new pathway for optimizer design beyond static geometries.

Chinese Translation

现代优化器，如 Muon，在其更新中施加了矩阵级几何约束。这些矩阵级约束可以在线性最小化oracle（LMO）理论下统一。然而，当前所有方法对更新规则施加的 LMO 几何都是固定的，经过设计或经验选择的，这些几何不一定是根据问题的几何结构最优的。我们提出了一种新颖的高效数据驱动标准，用于动态选择个别深度神经网络层的代理最优更新 LMO 几何。该标准通过单步随机特征回归替代模型，从梯度和激活统计中以闭式形式推导而来，能够在从 SGD 到 Muon 更新的设计空间中进行导航。此外，集成参数级预处理使我们的框架能够恢复 SGD、Muon、Adam 和 MuAdam 作为特定的极值。为了使这种自适应方法具有可扩展性，我们将其与高效的计算策略相结合，仅在高度优化的基线中实现约 3% 的运行时开销。作为概念验证，我们展示了这种数据驱动的优化器在三种不同训练场景中超越或与 Muon 和 AdamW 之间表现最佳的优化器保持竞争力。最终，这项工作提供了证据，表明 LMO 几何可以成功且高效地从运行时数据中进行适应，为超越静态几何的优化器设计开辟了一条新路径。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2605.19782

Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization

先验知识还是搜索？硬件感知代码优化中的LLM代理研究

Redko, Dmitry, Fazlyev, Albert, Sozykin, Konstantin, Ivanova, Maria, Burnaev, Evgeny, Shvetsov, Egor

Abstract

LLM discovery and optimization systems are increasingly applied across domains, implementing a common propose-evaluate-revise loop. Such optimization or discovery progresses via context conditioning on received feedback from an environment. However, as modern LLM agents are increasingly complex in their structure, it is difficult to evaluate which components contribute the most, and when and how this exploration may fail. We answer these questions through three controlled experiments. Our findings: (1) In pure black-box optimization, LLMs act as greedy optimizers. (2) In zero-shot kernel generation, providing explicit input-size information has no measurable effect, models converge to the same kernel parameters regardless of size or temperature, as though the size instruction were invisible. Moreover, when tasked to perform kernel optimization for uncommon kernel sizes, performance sharply degrades regardless of the language used. (3) In feedback-loop kernel optimization, CUDA improves monotonically under iterative feedback, while TVM IR actively degrades, which demonstrates that kernel optimization degrades when models operate with low-density language. Our results conclude that LLMs in code optimization tasks highly depend on pretrained priors rather than provided feedback or agentic structure.

Chinese Translation

LLM发现与优化系统在各个领域的应用日益增多，实施了一种通用的提出-评估-修订循环。这种优化或发现通过对环境反馈的上下文条件进行进展。然而，随着现代LLM代理在结构上变得越来越复杂，评估哪些组件贡献最大，以及这种探索何时以及如何可能失败变得困难。我们通过三个受控实验回答了这些问题。我们的发现：（1）在纯黑箱优化中，LLM表现为贪婪优化器。（2）在零样本内核生成中，提供明确的输入大小信息没有可测量的效果，模型无论大小或温度如何都收敛到相同的内核参数，仿佛大小指令是不可见的。此外，当被要求对不常见的内核大小进行内核优化时，性能急剧下降，无论使用何种语言。（3）在反馈循环内核优化中，CUDA在迭代反馈下单调提高，而TVM IR则积极下降，这表明当模型在低密度语言下操作时，内核优化会退化。我们的结果总结出，在代码优化任务中，LLM高度依赖于预训练的先验知识，而非提供的反馈或代理结构。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2605.19824

From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

从提示到铺路：代理性场景到计划推理中的时间基础

Gado, Ahmed Y., Goba, Omar Y., Hassanein, Alaa, Elias, Catherine M., Hussein, Ahmed

Abstract

Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.

Chinese Translation

近期的研究尝试通过大型语言模型（LLMs）和大型多模态模型（LMMs）的集成来支持自主车辆（AVs）的高层场景解释和规划，但仍将时间视为次要属性。这种缺乏时间基础的处理导致在连续动作推理中出现不一致性，削弱了安全性和可解释性。本研究探讨了在代理间通信中引入时间条件是否能够保持或增强一致性，而不引入语义或逻辑一致性的降级。为此，我们引入了三种规划器架构，逐步增加时间整合，并在经过精心挑选的BDD-X数据集子集上进行评估，使用语义、句法和逻辑指标。结果表明，尽管时间条件重塑了推理风格，但在标准基于NLP的正确性指标上并未产生统计学显著的改善。然而，定性分析揭示了预测性危险推理、稳定的纠正行为和Sentinel中的战略性分歧。这些发现阐明了基于提示的时间基础的局限性，并建立了时间场景到计划推理的首个实证基准。

View on arXiv Download PDF AI Translation

cs.AI / 57 / 2605.19826

Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support

可解释的废水数字双胞胎：具有自我否定决策支持的自适应上下文条件结构模拟器

Simethy, Gary, Arroyo, Daniel Ortiz, Durdevic, Petar

Abstract

Operators of safety-critical industrial processes increasingly rely on digital twins to screen control interventions, but such simulators rarely carry certified safety guarantees. Wastewater treatment plants exemplify the gap: operators face a daily safety-efficiency trade-off where aerating too little risks effluent violations and nitrous-oxide (N2O) spikes, and aerating too much wastes energy. We develop an explainable digital twin for aeration and dosing setpoints. CCSS-IX, the simulator, is a bank of interpretable locally linear state-space "experts" adaptively mixed by a context-aware gating network, building on a continuous-time regime-switching scaffold. A runtime decision layer applies conformal risk control to abstain, reopen, or return a falsifying temporal witness for any operator-proposed action that cannot be statistically certified. The artificial-intelligence contribution is twofold: an identifiable, context-conditioned structured surrogate that retains operator-readable dynamics, and a self-falsifying decision rule with finite-sample coverage guarantees. The engineering contribution is a validated, end-to-end decision-support pipeline, tested on a 1000-step slice of the Aved{\o}re full-scale plant (42.6% sensor missingness, 2-minute sampling), the Agtrup/BlueKolding full-scale plant in Denmark, and the Benchmark Simulation Model No. 2 (BSM2) international benchmark, under a matched ten-seed protocol. The static structured ensemble lies within 0.78% root-mean-square error of an unconstrained black-box reference, and the adaptive variant within 1.08%. The calibrated reopen rule cuts aggregate two-plant regret by 43.6% at an unsafe-action cost weight of 4 and eliminates unsafe chosen actions on the BSM2 main slice. Event-aligned temporal witnesses prevent 93 of 187 false-safe N2O approvals, about 4.65x the dyadic baseline (paired McNemar p < 1e-21).

Chinese Translation

安全关键工业过程的操作员越来越依赖数字双胞胎来筛选控制干预措施，但此类模拟器很少具备认证的安全保证。废水处理厂便是这一差距的典型例子：操作员每天面临安全与效率的权衡，通风不足可能导致排放违规和一氧化二氮（N2O）峰值，而通风过多则浪费能源。我们开发了一种用于通风和投药设定点的可解释数字双胞胎。CCSS-IX，该模拟器，是一个可解释的局部线性状态空间“专家”库，通过上下文感知的门控网络自适应混合，基于连续时间的状态切换框架。运行时决策层应用符合风险控制，以对任何无法统计认证的操作员提议的行动进行弃权、重新开放或返回一个否定的时间见证。人工智能的贡献有两个方面：一个可识别的、上下文条件的结构代理，保留了操作员可读的动态，以及一个具有有限样本覆盖保证的自我否定决策规则。工程贡献是一个经过验证的端到端决策支持管道，经过在Avedøre全规模工厂（42.6%的传感器缺失，2分钟采样）、丹麦的Agtrup/BlueKolding全规模工厂以及国际基准Benchmark Simulation Model No. 2 (BSM2)上的1000步切片测试，采用匹配的十种种子协议。静态结构集成的均方根误差在0.78%以内，相对于一个不受约束的黑箱参考，而自适应变体在1.08%以内。经过校准的重新开放规则在不安全行动成本权重为4时将两家工厂的总后悔减少了43.6%，并在BSM2主切片上消除了不安全的选择行动。事件对齐的时间见证防止了187次虚假安全的N2O批准中的93次，大约是二元基线的4.65倍（配对McNemar p < 1e-21）。

View on arXiv Download PDF AI Translation

cs.AI / 58 / 2605.19895

Streamlined Constraint Reasoning via CNN Pattern Recognition on Enumerated Solutions

通过 CNN 模式识别在枚举解上简化约束推理

Spracklen, Patrick

Abstract

Constraint programming practitioners accelerate hard problems through a layered set of techniques applied in order of risk. Standard hardening (symmetry-breaking and implied constraints) is applied first and preserves satisfiability. Streamliner constraints, which restrict search to a structural sub-family of solutions, do not preserve satisfiability and are reserved as a final lever. Existing automated streamliner-synthesis approaches either search a constraint grammar or prompt a Large Language Model directly on the problem model. We propose a different approach: enumerate feasible solutions, train a Convolutional Neural Network contrastively against perturbed non-solutions to detect structural patterns, and translate the CNN's discriminative signal into candidate MiniZinc streamliners through LLM-driven synthesis. The CNN grounds the LLM's constraint generation in observed solution structure rather than model text alone. We evaluate on hardened benchmark models where streamliner discovery is the residual performance lever. Our pipeline achieves 98.8% portfolio time reduction on hardened Vessel Loading, 98.6% on hardened Social Golfers, and 89.4% on Black Hole, with best-single streamliners reaching geometric-mean speedups of 932x, 356x, and 1103x respectively. Discovered streamliners include class-based packing constraints on Vessel Loading, beyond-hardening canonicalisations on Social Golfers, and layout-coordinate bounds on Black Hole.

Chinese Translation

约束编程实践者通过一系列按风险顺序应用的分层技术来加速困难问题。标准硬化（对称打破和隐含约束）首先应用，并保持可满足性。流线型约束限制搜索在解决方案的结构子族中，不保持可满足性，因此作为最后的手段保留。现有的自动流线型合成方法要么搜索约束文法，要么直接在问题模型上提示大型语言模型（Large Language Model, LLM）。我们提出了一种不同的方法：枚举可行解，训练卷积神经网络（Convolutional Neural Network, CNN）与扰动的非解进行对比，以检测结构模式，并通过 LLM 驱动的合成将 CNN 的判别信号转化为候选 MiniZinc 流线型约束。CNN 将 LLM 的约束生成基于观察到的解决方案结构，而不仅仅是模型文本。我们在硬化基准模型上进行评估，其中流线型发现是剩余性能杠杆。我们的流程在硬化的船舶装载问题上实现了 98.8%的时间减少，在硬化的社交高尔夫问题上实现了 98.6%，在黑洞问题上实现了 89.4%，最佳单个流线型约束分别达到几何平均加速比 932 倍、356 倍和 1103 倍。发现的流线型约束包括船舶装载上的基于类别的打包约束、社交高尔夫上的超硬化规范化，以及黑洞上的布局坐标界限。

View on arXiv Download PDF AI Translation

cs.AI / 59 / 2605.19932

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

PEEK：作为长上下文 LLM 代理的方向缓存的上下文图

Gu, Zhuohan, Zhang, Qizheng, Khattab, Omar, Madden, Samuel

Abstract

Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent's prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.

Chinese Translation

大型语言模型（LLM）代理越来越多地在长且重复的外部上下文中操作，例如文档语料库和代码库。在多次调用中，现有的方法要么保留代理的轨迹，要么被动访问原始材料，或者采用任务级策略。它们都没有保留我们认为在重复相同上下文工作负载中最需要的：关于重复上下文本身的可重用方向知识（例如，上下文包含什么，如何组织，以及哪些实体、常量和模式在历史上是有用的）。我们引入了 PEEK，一个将这种方向知识缓存和维护为上下文图的系统：在代理提示中的一个小型、恒定大小的工件，使其能够持久地窥视外部上下文。该图由一个可编程缓存策略维护，包含三个模块：一个提取可转移知识的提炼器（Distiller），一个将其转换为结构化编辑的制图师（Cartographer），以及一个执行固定令牌预算的基于优先级的驱逐器（Evictor）。在长上下文推理和信息聚合方面，PEEK 在使用比最先进的提示学习框架 ACE 少 93-145 次迭代且成本降低 1.7-5.8 倍的情况下，提升了 6.3-34.0% 的性能。在上下文学习方面，PEEK 的解决率和评分准确性分别提高了 6.0-14.0% 和 7.8-12.1%，且成本比 ACE 低 1.4 倍。这些增益在包括 OpenAI Codex（一个生产级编码代理）在内的 LMs 和代理架构中具有普遍性。这些结果表明，上下文图有助于长上下文 LLM 代理更准确和高效地与重复的外部上下文进行交互。

View on arXiv Download PDF AI Translation

cs.AI / 60 / 2605.19940

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

受机器人启发的基础模型在社会敏感领域的保护措施

Ramnauth, Rebecca, Brscic, Drazen, Scassellati, Brian

Abstract

Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-dependent. Existing guardrail approaches -- ranging from training-time alignment to prompting, decoding constraints, and post-hoc moderation -- primarily provide empirical risk reduction rather than enforceable behavioral guarantees, and largely treat safety as a property of individual outputs rather than interaction trajectories. We reframe guardrails as a problem of runtime behavioral control over interaction trajectories, drawing on robotics to introduce formal constructs for constraint enforcement in uncertain, closed-loop systems. We instantiate these ideas in the Grounded Observer framework and apply it across three real-world deployments: small talk, in-home autism therapy, and behavioral de-escalation in schools. Across settings, the framework enables runtime interventions that mitigate drift into undesirable interaction regimes while adapting to diverse social contexts. We discuss extensions to the framework and propose research directions toward stronger guarantees.

Chinese Translation

基础模型越来越多地应用于教育、心理健康和护理等社会敏感领域，这些领域中的失败往往是累积性的且依赖于上下文。现有的保护措施方法——从训练时的对齐到提示、解码约束和事后审核——主要提供经验风险降低，而非可执行的行为保证，并且在很大程度上将安全性视为单个输出的属性，而非交互轨迹。我们将保护措施重新定义为对交互轨迹的运行时行为控制问题，借鉴机器人技术引入不确定闭环系统中约束执行的形式构造。我们在“基础观察者”（Grounded Observer）框架中实例化这些思想，并将其应用于三个现实世界的部署场景：闲聊、居家自闭症治疗和学校中的行为去激化。在这些场景中，该框架使得运行时干预成为可能，从而减轻了向不良交互模式的漂移，同时适应多样的社会背景。我们讨论了对该框架的扩展，并提出了朝向更强保证的研究方向。

View on arXiv Download PDF AI Translation

cs.AI / 61 / 2605.19943

Probabilistic Tiny Recursive Model

概率性微型递归模型

Sghaier, Amin, Parviz, Ali, Jolicoeur-Martineau, Alexia

Abstract

Tiny Recursive Models (TRM) solve complex reasoning tasks with a fraction of the parameters of modern large language models (LLMs) by iteratively refining a latent state and final answer. While powerful, their deterministic recursion can lead to convergence at suboptimal solutions, without escape mechanism. A common workaround relies on task-specific input perturbations at test time combined with answer aggregation via voting. We introduce Probabilistic TRM (PTRM), a task-agnostic framework for test-time compute scaling that addresses this limitation through stochastic exploration. PTRM injects Gaussian noise at each deep recursion step, enabling parallel trajectories to explore diverse solution basins, and selects among them using the model's existing Q head (used for early stopping in the original TRM). Without requiring retraining or task-specific augmentations, PTRM enables substantial accuracy gains across benchmarks, including Sudoku-Extreme (87.4% to 98.75%) and on various puzzles from Pencil Puzzle Bench (62.6% to 91.2%). On the latter, PTRM achieves nearly double the accuracy of frontier LLMs (91.2% vs. 55.1%) at less than 0.0001x the cost, using only 7M parameters.

Chinese Translation

微型递归模型（Tiny Recursive Models, TRM）通过迭代优化潜在状态和最终答案，以现代大型语言模型（Large Language Models, LLMs）参数的极小部分解决复杂推理任务。尽管功能强大，但其确定性递归可能导致收敛到次优解，且缺乏逃逸机制。一个常见的解决方法是在测试时依赖于特定任务的输入扰动，并结合通过投票进行答案聚合。我们提出了概率性微型递归模型（Probabilistic TRM, PTRM），这是一个与任务无关的测试时计算扩展框架，通过随机探索来解决这一限制。PTRM在每个深度递归步骤中注入高斯噪声，使得并行轨迹能够探索多样的解空间，并利用模型现有的Q头（在原始TRM中用于早期停止）进行选择。在不需要重新训练或特定任务增强的情况下，PTRM在多个基准测试中实现了显著的准确性提升，包括数独极限（Sudoku-Extreme，准确率从87.4%提升至98.75%）以及来自铅笔拼图基准（Pencil Puzzle Bench）的各种难题（准确率从62.6%提升至91.2%）。在后者中，PTRM的准确率几乎是前沿LLMs的两倍（91.2%对比55.1%），且成本不到0.0001倍，仅使用700万参数。

View on arXiv Download PDF AI Translation

cs.AI / 62 / 2605.20006

GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards

GeoX：通过自我对弈和可验证奖励掌握地理空间推理

Ahn, Kyeongjin, Lee, Seungeon, Gummadi, Krishna P., Cha, Meeyoung

Abstract

Geospatial reasoning requires solving image-grounded problems over the complex spatial structure of a scene. However, developing this capability is hindered by the cost of annotating a vast and combinatorial question space. We propose GeoX, a self-play framework that acquires spatial logic through executable programs that yield verifiable rewards, without relying on large-scale human-curated data Given a satellite or aerial image, our framework employs a single multimodal policy that proposes spatial problems as executable programs and solves them under three reasoning modes-abduction, deduction, and induction-over spatial primitives and an image understanding tool. A verifier executes each program to covert a reward signal that jointly optimizes the two roles via reinforcement learning. GeoX consistently improves its base VLMs by up to 5.5 points on average, matching or exceeding conventional baselines trained on millions of curated data. Along-side the proposed method, we release a benchmark for geospatial understanding accumulated through self-play.

Chinese Translation

地理空间推理需要在场景复杂的空间结构上解决基于图像的问题。然而，开发这一能力受到注释庞大且组合性问题空间的成本限制。我们提出了GeoX，一个自我对弈框架，通过可执行程序获取空间逻辑，这些程序产生可验证的奖励，而不依赖于大规模的人为策划数据。给定一幅卫星或航空图像，我们的框架采用单一的多模态策略，将空间问题作为可执行程序提出，并在三种推理模式下进行求解——溯因、演绎和归纳——针对空间原语和图像理解工具。验证器执行每个程序以转换奖励信号，通过强化学习共同优化这两个角色。GeoX在基线视觉语言模型（VLMs）上平均提高了多达5.5个点，匹配或超过了在数百万条策划数据上训练的传统基线。除了提出的方法，我们还发布了通过自我对弈积累的地理空间理解基准。

View on arXiv Download PDF AI Translation

cs.AI / 63 / 2605.20023

When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

当技能无助时：关于工具基础代理在进攻性网络安全中程序性知识的负面结果

Chacko, Samuel Jacob, Hugglestone, James, Islam, Chashi Mahiul, Liu, Xiuwen

Abstract

Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emph{when} Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions of increasing richness (55, 1{,}478, 1{,}976, and 4{,}147 lines), and show that these conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no-Skills and full-Skills conditions is only 8.9~pp ($p = 0.71$, $\chi^2$; $p = 0.25$, Cochran--Armitage trend test; five of six pairwise Cohen's $h$ values fall below the $0.2$ small-effect threshold). We argue that the missing variable is \emph{environment-feedback bandwidth}. When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side-channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.

Chinese Translation

代理技能是结构化的程序性知识包，在推理时加载到大型语言模型（LLM）代理中，广泛报道在各个领域平均提高任务通过率16.2个百分点。然而，同样的基准显示出广泛的差异，在84个任务中有16个在引入技能后出现负增量。社区尚未明确阐述技能何时有助于任务完成，何时仅仅是冗余的负担。我们重新分析了一项最近发布的180次受控研究，该研究针对一个基于多代理控制程序（MCP）的自主夺旗（CTF）代理，在四种逐渐丰富的文档条件下（55、1,478、1,976和4,147行），并显示这些条件几乎完全对应于无技能、经验技能、策划技能和综合技能的消融实验。在进攻性网络安全这一现有技能基准未深入覆盖的领域，技能的边际效益几乎消失。无技能与全技能条件之间的差距仅为8.9个百分点（$p = 0.71$，$ ext{χ}^2$；$p = 0.25$，Cochran--Armitage趋势检验；六对比中有五个Cohen的$h$值低于$0.2$的小效应阈值）。我们认为缺失的变量是环境反馈带宽。当代理的工具层返回严格的、模式验证的、低延迟的观察时，环境本身提供了通常需要技能提供的程序性修正信号。因此，策划技能的边际效益显著降低，在某些情况下（例如，我们的时间侧信道设置）甚至会主动降低性能。我们阐明了一个可证伪的假设，勾勒出其对复合人工智能系统的设计影响，并将发布重新分析管道以支持复制研究。

View on arXiv Download PDF AI Translation

cs.AI / 64 / 2605.20025

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

AutoResearchClaw：人机协作的自我强化自主研究

Liu, Jiaqi, Qiu, Shi, Li, Mairui, Li, Bingzhou, Ji, Haonian, Han, Siwei, Ye, Xinyu, Xia, Peng, Dong, Zihan, Zhang, Congyu, Zhang, Letian, Chen, Guiming, Tu, Haoqin, Yang, Xinyu, Feng, Lu, Zhao, Xujiang, Chen, Haifeng, Zhou, Jiawei, Wang, Xiao, Zhang, Weitong, Zhu, Hongtu, Li, Yun, Mei, Jieru, Fei, Hongliang, Zhang, Jiaheng, Li, Linjie, Zhang, Linjun, Zhou, Yuyin, Wang, Sheng, Xiong, Caiming, Zou, James, Zheng, Zeyu, Xie, Cihang, Ding, Mingyu, Yao, Huaxiu

Abstract

Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this process as a linear pipeline: they rely on single-agent reasoning, stop when execution fails, and do not carry experience across runs. We present AutoResearchClaw, a multi-agent autonomous research pipeline built on five mechanisms: structured multi-agent debate for hypothesis generation and result analysis, a self-healing executor with a \textsc{Pivot}/\textsc{Refine} decision loop that transforms failures into information, verifiable result reporting that prevents fabricated numbers and hallucinated citations, human-in-the-loop collaboration with seven intervention modes spanning full autonomy to step-by-step oversight, and cross-run evolution that converts past mistakes into future safeguards. On ARC-Bench, a 25-topic experiment-stage benchmark, AutoResearchClaw outperforms AI Scientist v2 by 54.7%. A human-in-the-loop ablation across seven intervention modes reveals that precise, targeted collaboration at high-leverage decision points consistently outperforms both full autonomy and exhaustive step-by-step oversight. We position AutoResearchClaw as a research amplifier that augments rather than replaces human scientific judgment. Code is available at https://github.com/aiming-lab/AutoResearchClaw.

Chinese Translation

自动化科学发现不仅仅是从想法生成论文。真正的研究是迭代的：假设从多个角度受到挑战，实验失败并为下一次尝试提供信息，经验在循环中不断积累。现有的自主研究系统通常将这一过程建模为线性流程：它们依赖于单一代理的推理，当执行失败时停止，并且不在运行之间传递经验。我们提出了AutoResearchClaw，一个基于五种机制的多代理自主研究流程：用于假设生成和结果分析的结构化多代理辩论，一个具有 extsc{Pivot}/ extsc{Refine}决策循环的自我修复执行器，将失败转化为信息，可验证的结果报告以防止伪造数据和虚构引用，涵盖从完全自主到逐步监督的七种干预模式的人机协作，以及将过去错误转化为未来保障的跨运行演化。在ARC-Bench，一个包含25个主题实验阶段的基准测试中，AutoResearchClaw的表现比AI Scientist v2高出54.7%。对七种干预模式进行的人机协作消融实验表明，在高杠杆决策点进行精确、针对性的协作始终优于完全自主和详尽的逐步监督。我们将AutoResearchClaw定位为一种研究放大器，增强而非取代人类科学判断。代码可在https://github.com/aiming-lab/AutoResearchClaw获取。

View on arXiv Download PDF AI Translation

cs.AI / 65 / 2605.20072

Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

探究具身大语言模型：更高的观察保真度何以损害问题解决能力

Zenkri, Oussama, Brock, Oliver

Abstract

Large Language Models are increasingly proposed as cognitive components for robotic systems, yet their opaque decision processes make it difficult to explain success or failure in closed-loop embodied tasks. Following an empirical AI methodology, we study embodied LLM agents behaviorally by varying the information available to the agent and measuring the resulting changes in behavior. Using the Lockbox, a sequential mechanical puzzle with hidden interdependencies, we evaluate LLMs across RGB, RGB-D, and ground-truth symbolic observations in a physical robotic setup and use controlled simulation to probe the resulting behavior. Counterintuitively, agents perform best under raw RGB input and worst under perfect ground-truth observations. In simulation, we probe this effect by randomly flipping perceived action outcomes and find that moderate noise improves performance, peaking at a 40% flip probability with a 2.85-fold success rate increase over the noise-free baseline. Further analysis links this gain to a reduction in repetitive action loops. These findings suggest that success rates alone are insufficient for evaluating LLMs, as measured performance may reflect the interaction between perceptual errors and reasoning failures rather than robust problem solving.

Chinese Translation

大型语言模型越来越多地被提议作为机器人系统的认知组件，但其不透明的决策过程使得在闭环具身任务中解释成功或失败变得困难。遵循实证人工智能的方法论，我们通过改变可供代理使用的信息并测量由此引起的行为变化，行为性地研究具身大语言模型代理。在一个具有隐藏相互依赖关系的顺序机械难题——锁箱（Lockbox）中，我们在物理机器人设置下评估了大语言模型在RGB、RGB-D和真实符号观察下的表现，并使用受控仿真来探究结果行为。出人意料的是，代理在原始RGB输入下表现最佳，而在完美的真实观察下表现最差。在仿真中，我们通过随机翻转感知的行动结果来探究这一效应，发现适度的噪声能够提高性能，在40%的翻转概率下，成功率较无噪声基线提高了2.85倍。进一步分析将这一增益与重复行动循环的减少联系起来。这些发现表明，仅仅依靠成功率不足以评估大语言模型，因为测量的表现可能反映的是感知错误与推理失败之间的相互作用，而非稳健的问题解决能力。

View on arXiv Download PDF AI Translation

cs.AI / 66 / 2605.20098

Neurosymbolic Learning for Inference-Time Argumentation

推理时的神经符号学习与论证

Freedman, Gabriel, Dejl, Adam, Gould, Adam, Mansi, Chen, Lihu, Jiang, Jianqi, Toni, Francesca

Abstract

Claim verification is an important problem in high-stakes settings, including health and finance. When information underpinning claims is incomplete or conflicting, uncertain answers may be more appropriate than binary true or false classifications. In all cases, faithful explanations of the considerations determining the final verdict are crucial. We introduce inference-time argumentation (ITA), a trainable neurosymbolic framework for ternary claim verification in which a formal argumentation semantics giving the strength of claims is used both (i) to guide LLM training as models learn to generate arguments and assign them base scores (representing intrinsic strengths) and (ii) to compute ternary (true/false/uncertain) predictions from generated, scored arguments. As a result, at training time, argument generation and scoring can be optimised according to the quality of the induced argumentative predictions. Moreover, at inference time, the final prediction is faithful, by construction, to the arguments and scores determining the verdict, rather than being justified by a potentially unfaithful post-hoc reasoning trace as in conventional reasoning models. We finally show that, on two datasets for ternary claim verification, ITA improves upon argumentative baselines and can perform competitively against non-argumentative direct-prediction baselines, while providing verdicts that are computed deterministically from explicit, inspectable argumentative structures.

Chinese Translation

主张验证是高风险环境中一个重要的问题，包括健康和金融领域。当支撑主张的信息不完整或相互冲突时，不确定的答案可能比二元的真或假分类更为合适。在所有情况下，忠实于决定最终裁决的考虑因素的解释都是至关重要的。我们引入了推理时论证（Inference-Time Argumentation, ITA），这是一个可训练的神经符号框架，用于三元主张验证，其中使用一种形式的论证语义来表示主张的强度，既用于（i）指导大型语言模型（LLM）的训练，使模型学习生成论证并为其分配基础分数（代表内在强度），又用于（ii）从生成的、评分的论证中计算三元（真/假/不确定）预测。因此，在训练时，论证生成和评分可以根据诱导的论证预测的质量进行优化。此外，在推理时，最终的预测在构造上忠实于决定裁决的论证和分数，而不是像传统推理模型那样通过潜在的不忠实事后推理轨迹来证明。最后，我们展示了在两个三元主张验证的数据集上，ITA在论证基线之上有所改进，并且可以与非论证的直接预测基线进行竞争，同时提供从明确的、可检查的论证结构中确定性计算的裁决。

View on arXiv Download PDF AI Translation

cs.AI / 67 / 2605.20120

Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

使用 Aristotle API 在 Lean 4 中进行 AI 辅助定理证明：草蜢问题的形式化案例研究

Lau, Gabriel Rongyang

Abstract

AI-assisted theorem proving can now generate substantial Lean developments for olympiad-level mathematics, but the evidential status of such developments depends on which declarations are actually verified. This paper reports a Lean 4 formalization case study of an Aristotle API proof attempt for the Grasshopper problem, originally posed as IMO 2009 Problem 6. The generated artifact states a generalized Lean version of the theorem, contains four verified helper lemmas for local components of a maximality and adjacent-swap exchange strategy, and leaves the main theorem grasshopper closed directly by one unresolved sorry. The verified components establish that the final partial sum equals the total sum, that an adjacent transposition can affect only the relevant intermediate partial sum, that the changed partial sum has the expected form, and that maximality at a position admitting an adjacent successor swap forces a corresponding forbidden-set membership fact. The Aristotle output summary identifies the intended remaining mathematical step as the global counting step needed to show that these membership facts produce at least n distinct forbidden values, contradicting the cardinality assumption |M| < n; the Lean source itself does not reduce the main theorem to a separately encoded counting lemma. This case study gives an inspectable example of a central limitation in AI-assisted formalization, namely that local proof search can succeed while the global combinatorial bookkeeping required for a theorem remains unresolved. The paper contributes a reproducible Lean artifact and a precise analysis of its verified and unverified proof content.

Chinese Translation

AI 辅助定理证明现在能够为奥林匹克级别的数学生成大量的 Lean 发展，但这些发展的证据状态取决于实际验证的声明。本文报告了一个 Lean 4 的形式化案例研究，涉及对草蜢问题（最初提出为 IMO 2009 问题 6）进行的 Aristotle API 证明尝试。生成的文档陈述了定理的一个广义 Lean 版本，包含四个经过验证的辅助引理，针对最大性和相邻交换策略的局部组件，并通过一个未解决的 sorry 直接留下了主要定理草蜢闭合。经过验证的组件确立了最终的部分和等于总和，相邻的置换仅能影响相关的中间部分和，改变的部分和具有预期的形式，并且在允许相邻后继交换的位置的最大性强制了相应的禁忌集合成员事实。Aristotle 输出摘要确定了所需的剩余数学步骤，即全球计数步骤，以显示这些成员事实产生至少 n 个不同的禁忌值，从而与基数假设 |M| < n 矛盾；Lean 源代码本身并未将主要定理简化为单独编码的计数引理。该案例研究提供了一个可检查的 AI 辅助形式化的中心限制示例，即局部证明搜索可以成功，而定理所需的全球组合记账仍未解决。本文贡献了一个可重复的 Lean 文档和对其经过验证和未验证的证明内容的精确分析。

View on arXiv Download PDF AI Translation

cs.AI / 68 / 2605.20164

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

并非每个评分标准都能平等地教学：面向策略的评分奖励用于可验证奖励的强化学习

Tyagi, Utkarsh, Guo, Xingang, Rezaei, MohammadHossein, George, Daniel, Mahmoud, Anas, Lee, Jackson, Liu, Bing, He, Yunzhong

Abstract

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins $24$ of $30$ base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in $2.5$--$4\times$ fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.

Chinese Translation

具有可验证奖励的强化学习在后期训练中变得非常有效，前提是正确性可以自动检查。然而，许多重要的模型行为需要同时满足多个定性标准。基于评分标准的奖励通过对特定提示的标准进行评分并将其聚合为标量奖励来解决这一问题。然而，标准的静态聚合将标准的人为重要性与其作为优化信号的当前有效性混为一谈。我们展示了这一假设在评分标准强化学习中是如何失效的：许多重要的标准已经饱和或当前无法达到，而区分不同输出的标准不一定是那些具有最大人类权重的标准。我们引入了POW3R，一个面向策略的评分奖励框架，旨在保留人类权重和类别平衡作为评分目标，同时在训练过程中调整标准级别的奖励权重。POW3R利用输出级别的对比来强调当前区分策略输出的标准，使得GRPO奖励更加信息丰富，而不改变基础评估目标。在涵盖多模态和仅文本设置的两个数据集上的三种基础策略中，POW3R在30个基础策略/指标比较中赢得了24个，提升了平均评分奖励和严格完成率（满足每个所需评分标准的响应比例），并在2.5到4倍更少的训练步骤中达到了相同的平稳状态。因此，评分奖励应区分最终答案中重要的内容与能够教导当前策略的内容。

View on arXiv Download PDF AI Translation

cs.AI / 69 / 2605.20167

HaorFloodAlert: Deseasonalized ML Ensemble for 72-Hour Flood Prediction in Bangladesh Haor Wetlands

HaorFloodAlert：用于孟加拉国Haor湿地72小时洪水预测的去季节性机器学习集成模型

Koli, Salma Hoque Talukdar, Jely, Fahima Haque Talukder, Alim, Md. Samiul, Hossen, Md. Zakir

Abstract

Flash floods in Bangladesh's haor wetlands show up with almost no warning. They wreck the annual boro rice harvest. Current setups, built for riverine floods, miss backwater dynamics entirely. These basins are flat. Water does not behave like it does on the Brahmaputra. We built HaorFloodAlert, a deseasonalized machine learning ensemble that forecasts 72-hour flood probability for the Sunamganj Haor (approximately 8,000 km2). Temperature was acting as a seasonal cheat code - it inflated accuracy by 6.9 pp just because floods happen in warm months. We caught that. We also built an upstream Barak River Sentinel-1 SAR proxy from Silchar, Assam, giving about 36 hours of lead time. Otsu-thresholded SAR change detection validates at 84-91 percent spatial match. The operational ensemble (RF 0.5625 + XGBoost 0.4375) hits 89.6 percent LOOCV accuracy, 87.5 percent recall, and 0.943 AUC-ROC on 77 real Sentinel-1 events. A three-tier alert pipeline and a BRRI-calibrated boro rice damage estimator are included.

Chinese Translation

孟加拉国Haor湿地的突发洪水几乎没有预警，严重影响年度boro稻米收成。目前的洪水预警系统主要针对河流洪水，完全忽视了回水动态。这些盆地地势平坦，水流行为与布拉马普特拉河截然不同。我们构建了HaorFloodAlert，一个去季节性机器学习集成模型，能够预测Sunamganj Haor（约8,000平方公里）72小时内的洪水概率。温度在此过程中起到了季节性作弊码的作用——仅因洪水发生在温暖月份，准确率提高了6.9个百分点。我们捕捉到了这一点。同时，我们还从阿萨姆邦的Silchar构建了上游Barak河的Sentinel-1 SAR代理，提供了约36小时的预警时间。Otsu阈值SAR变化检测的空间匹配验证率为84-91%。该操作集成模型（RF 0.5625 + XGBoost 0.4375）在77个真实的Sentinel-1事件中达到了89.6%的留一交叉验证（LOOCV）准确率、87.5%的召回率和0.943的AUC-ROC。此外，还包括一个三层警报管道和一个经过BRRI校准的boro稻米损害估算器。

View on arXiv Download PDF AI Translation

cs.AI / 70 / 2605.20173

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

用于选择和组合生产 LLM 代理运行时架构模式的方法论

Srinivasan, Vasundra

Abstract

Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load-bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six runtime patterns that compose the SDB differently across conversational, autonomous, and long-horizon agents: hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop. For each pattern, we trace its lineage to distributed-systems concepts and identify what changes when the worker is stochastic. The paper contributes a five-step methodology for selecting runtime patterns, a diagnostic procedure that maps production failures to pattern weaknesses, and a failure mode called replay divergence, in which LLM-based consumers of a deterministic event log produce different downstream outputs under model-version or prompt changes. A stylized reliability decomposition separates per-call model variance from architectural momentum, motivating the claim that as model variance decreases, pattern choice and SDB strength become increasingly important levers for long-run reliability. We apply the methodology to five workloads and provide one runnable reference implementation for a 90-day contract-renewal agent.

Chinese Translation

生产 LLM 代理将随机模型输出与确定性软件系统结合在一起，但这两者之间的边界很少被视为一类重要的架构对象。本文将该边界称为随机-确定性边界（SDB）：这是一个由提议者、验证者、提交步骤和拒绝信号组成的四部分契约，规定了 LLM 输出如何转化为系统行为。我们认为 SDB 是生产代理运行时的承载原语。在这个原语周围，我们将代理运行时设计组织为三个关注点：协调、状态和控制。我们提出了六种运行时模式的目录，这些模式在对话式、自治式和长时间跨度的代理中以不同方式组合 SDB：层级委托、散点-聚合加事务、事件驱动序列、共享状态机、监督者加门控以及人机协作。对于每种模式，我们追溯其与分布式系统概念的渊源，并识别当工作者是随机时所发生的变化。本文贡献了一种五步方法论，用于选择运行时模式，一种将生产故障映射到模式弱点的诊断程序，以及一种称为重放偏差的故障模式，其中基于 LLM 的确定性事件日志消费者在模型版本或提示变化下产生不同的下游输出。一种风格化的可靠性分解将每次调用的模型方差与架构动量分离，推动了这样的主张：随着模型方差的减少，模式选择和 SDB 强度成为长期可靠性日益重要的杠杆。我们将该方法论应用于五个工作负载，并提供了一个可运行的参考实现，用于 90 天的合同续签代理。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2605.19066

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

低资源自然语言处理评估中的标注稀缺悖论：十年的加速与新兴限制

Marivate, Vukosi

Abstract

Over the past decade, low-resource natural language processing (NLP) has experienced explosive growth, propelled by cross-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised. We present a critical narrative survey of low-resource NLP evaluation (2014--present), tracing its evolution across three phases: early heuristic optimism, the illusions of top-down benchmark scaling, and the current era of generative bottlenecks. We conceptualise the \emph{Annotation Scarcity Paradox}, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. By examining extractive data pipelines, undercompensated ``ghost work'', and language data flaring, we argue that this paradox threatens the epistemic validity of reported progress. We survey emerging responses -- including data augmentation, model-based evaluation, participatory curation, and annotation-efficient approaches via item response theory and active learning -- and assess their equity and validity trade-offs. We close with a practitioner call to action, arguing that overcoming this bottleneck requires a paradigm shift from transactional data extraction to relational, community-embedded evaluation rooted in epistemic governance, data sovereignty, and shared ownership.

Chinese Translation

在过去十年中，低资源自然语言处理（NLP）经历了爆炸式增长，这一进展得益于跨语言迁移、大规模多语言模型以及基准测试的快速扩展。然而，这一表面上的进步掩盖了一个关键且未被充分审视的紧张关系：评估日益复杂的生成系统所需的深厚社会语言学专业知识严重紧缺，分布不均且结构边缘化。我们呈现了一项关于低资源NLP评估的批判性叙述性调查（2014年至今），追溯其在三个阶段的演变：早期的启发式乐观、由上而下的基准扩展的幻觉，以及当前的生成瓶颈时代。我们概念化了 extit{标注稀缺悖论}，即当技术能力的扩展远远超过真实评估所需的主权人力基础设施时所产生的结构性摩擦。通过考察提取性数据管道、报酬不足的“幽灵工作”和语言数据的突发现象，我们认为这一悖论威胁到报告进展的认识有效性。我们调查了新兴的应对措施，包括数据增强、基于模型的评估、参与式策展以及通过项目反应理论和主动学习实现的高效标注方法，并评估了它们在公平性和有效性上的权衡。最后，我们呼吁从业者采取行动，认为克服这一瓶颈需要从交易性数据提取转向以知识治理、数据主权和共享所有权为基础的关系型、社区嵌入式评估的范式转变。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2605.19069

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

商业自动语音识别系统在语言切换语音上的基准测试：阿拉伯语、波斯语和德语

Abdoli, Sajjad, Al-Sumaidaee, Ghassan, Taylor, Clayton W., Ahmad, ElShiekh, Rashad, Ahmed

Abstract

Code-switching -- the natural alternation between two languages within a single utterance -- represents one of the most challenging and under-studied conditions for automatic speech recognition (ASR). Existing commercial ASR benchmarks predominantly evaluate clean, monolingual audio and report a single Word Error Rate (WER) figure that tells practitioners little about real-world multilingual performance. We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic--English, Saudi Arabic (Najdi/Hijazi)--English, Persian (Farsi)--English, and German--English. Each dataset comprises 300 samples selected by a two-stage pipeline: a heuristic filter scoring transcripts on five structural code-switching signals, followed by a GPT-4o and Gemini 1.5 Pro ensemble scoring candidates across six linguistic dimensions. This pipeline reduces LLM scoring costs by approximately 91\% relative to exhaustive scoring. We evaluate the systems on both WER and BERTScore, arguing that BERTScore is a more reliable metric for Arabic and Persian pairs where transliteration variance causes WER to penalise semantically correct transcriptions. ElevenLabs Scribe v2 achieves the lowest WER across all four language pairs (13.2% overall; 13.1% on Egyptian Arabic) and leads on BERTScore (0.936 overall). We further demonstrate that difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and that BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The benchmarking dataset is publicly available at https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch.

Chinese Translation

语言切换——在单一话语中自然交替使用两种语言——代表了自动语音识别（ASR）中最具挑战性和研究不足的条件之一。现有的商业ASR基准主要评估干净的单语音频，并报告一个单一的词错误率（WER）指标，这对从业者了解现实世界的多语言表现帮助不大。我们提出了一个基准，评估五个商业ASR提供商在四对语言上的表现：埃及阿拉伯语-英语、沙特阿拉伯语（Najdi/Hijazi）-英语、波斯语（Farsi）-英语和德语-英语。每个数据集包含300个样本，通过两阶段管道选择：首先是一个启发式过滤器，根据五个结构性语言切换信号对转录文本进行评分，随后是一个GPT-4o和Gemini 1.5 Pro的集成模型在六个语言维度上对候选项进行评分。该管道将大型语言模型（LLM）评分成本降低了约91%，相较于全面评分。我们在WER和BERTScore上评估这些系统，认为BERTScore是阿拉伯语和波斯语对中更可靠的指标，因为音译变异导致WER对语义正确的转录进行惩罚。ElevenLabs Scribe v2在所有四对语言中实现了最低的WER（总体13.2%；埃及阿拉伯语13.1%），并在BERTScore上领先（总体0.936）。我们进一步证明，按难度分层的分析揭示了被聚合平均值掩盖的性能差距，并且BERT嵌入投影确认了参考与假设之间的语义接近性，尽管在表面上存在书写差异。基准测试数据集可在https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch公开获取。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2605.19077

ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

ReacTOD：用于零-shot对话状态跟踪的有界神经符号代理自然语言理解

Lin, Yanjun, Xiao, Zimo, Natarajan, Kartik, Sankaranarayanan, Mahesh, Nawanit, Niraj, Parashar, Rakshit, Zhang, Austin, Konaraddi, Karthik, Mote, Rishita, Niu, Wei

Abstract

Task-oriented dialogue systems -- handling transactions, reservations, and service requests -- require predictable behavior, yet the moderately-sized LLMs needed for practical latency are prone to hallucination and format errors that cascade into incorrect actions (e.g., a hotel booked for the wrong date). We propose ReacTOD, a bounded neuro-symbolic architecture that reformulates NLU as discrete tool calls within a self-correcting ReAct loop governed by deterministic validation. A bounded ReAct loop enables iterative self-correction, improving accuracy by up to 9.3 percentage points over single-pass inference on MultiWOZ. A symbolic validator enforces action compliance, schema conformance, and coreference consistency on every dialogue state update, achieving a 93.1% self-correction rate on intercepted errors and producing structured execution traces. Incremental state prediction and on-demand history retrieval keep prompts compact, empirically improving instruction adherence in parameter-constrained models. On MultiWOZ 2.1, ReacTOD achieves a new zero-shot state-of-the-art: gpt-oss-20B reaches 52.71% joint goal accuracy, surpassing the previous best by 14 percentage points, while Qwen3-8B achieves 47.34% with only 8B parameters. On the Schema-Guided Dialogue (SGD) benchmark, ReacTOD with Claude-Opus-4.6 achieves 80.68% JGA under fully end-to-end evaluation with predicted domains, and Qwen3-32B reaches 64.09% -- demonstrating cross-benchmark generalization without task-specific training data.

Chinese Translation

面向任务的对话系统——处理交易、预订和服务请求——需要可预测的行为，然而，为了满足实际延迟需求而使用的中等规模大型语言模型（LLMs）容易出现幻觉和格式错误，这些错误会导致不正确的操作（例如，预订了错误日期的酒店）。我们提出了ReacTOD，一种有界神经符号架构，它将自然语言理解（NLU）重新构造为在由确定性验证控制的自我纠正ReAct循环中的离散工具调用。有界ReAct循环使得迭代自我纠正成为可能，在MultiWOZ上相较于单次推理提高了多达9.3个百分点的准确性。符号验证器在每次对话状态更新时强制执行操作合规性、模式一致性和指代一致性，成功实现了93.1%的自我纠正率，并生成结构化执行轨迹。增量状态预测和按需历史检索保持了提示的紧凑性，实证上提高了在参数受限模型中的指令遵循性。在MultiWOZ 2.1上，ReacTOD达到了新的零-shot最先进水平：gpt-oss-20B的联合目标准确率达到52.71%，比之前的最佳结果提高了14个百分点，而Qwen3-8B在仅有8B参数的情况下达到了47.34%。在Schema-Guided Dialogue (SGD)基准测试中，使用Claude-Opus-4.6的ReacTOD在完全端到端评估下的联合目标准确率达到了80.68%，而Qwen3-32B达到了64.09%——展示了在没有任务特定训练数据的情况下的跨基准泛化能力。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2605.19149

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

代理崩溃：通往地狱的道路是由有益的代理铺成的

Jha, Rishi, Triedman, Harold, Bhattacharya, Arkaprabha, Shmatikov, Vitaly

Abstract

Agents operating with computer and Web use inevitably encounter errors: inaccessible webpages, missing files, local and remote misconfigurations, etc. These errors do not thwart agents based on state-of-the-art models. They helpfully continue to look for ways to complete their tasks. We introduce, characterize, and measure a new type of agent failure we call \emph{accidental meltdown}: unsafe or harmful behavior in response to a benign environmental error, in the absence of any adversarial inputs. Because meltdowns are not captured by the existing reliability or safety benchmarks, we develop a taxonomy of meltdown behaviors. We then implement an agent-agnostic infrastructure for injecting simulated local and remote errors into the rollout environment and use it to systematically evaluate agent systems powered by GPT, Grok, and Gemini. Our evaluation demonstrates that meltdowns (e.g., conducting unauthorized reconnaissance or subverting access control) of varying severity and success occur in 64.7\% of agent rollouts that encounter simulated errors, spanning all combinations of agent system, backing model, and error type. In over half of these meltdowns, unsafe behaviors are not reported to the user. Comparing behaviors of the same agents with and without errors, we find that exploration in response to errors is correlated with unsafe and harmful behavior.

Chinese Translation

在计算机和网络使用中，代理不可避免地会遇到错误：无法访问的网页、缺失的文件、本地和远程的配置错误等。这些错误并未阻碍基于最先进模型的代理。相反，它们会继续积极寻找完成任务的方法。我们引入、描述并测量了一种新型的代理失败，称之为 extit{意外崩溃}：在没有任何对抗性输入的情况下，对无害环境错误的反应所导致的不安全或有害行为。由于现有的可靠性或安全性基准未能捕捉到崩溃现象，我们开发了一种崩溃行为的分类法。随后，我们实施了一个与代理无关的基础设施，用于向实施环境中注入模拟的本地和远程错误，并利用该基础设施系统地评估由GPT、Grok和Gemini驱动的代理系统。我们的评估表明，在遇到模拟错误的64.7 ext{%}的代理实施中，发生了不同严重程度和成功率的崩溃（例如，进行未经授权的侦察或破坏访问控制），涵盖了所有代理系统、后端模型和错误类型的组合。在这些崩溃中，超过一半的不安全行为未向用户报告。通过比较同一代理在有错误和无错误情况下的行为，我们发现对错误的探索与不安全和有害行为之间存在相关性。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2605.19173

Prompting language influences diagnostic reasoning and accuracy of large language models

提示语言影响大型语言模型的诊断推理和准确性

Bazoge, Adrien, Corvellec, Josselin, Sid-Ahmed, Sofiane Djillali, Gourraud, Pierre-Antoine

Abstract

Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p < 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. o3 was the only model showing no overall language effect. These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide.

Chinese Translation

大型语言模型（LLMs）在临床决策支持中的应用越来越受到关注，但大多数评估是在英语环境下进行的，这使得它们在其他语言中的可靠性尚不确定。在此，我们通过比较五种大型语言模型（o3、DeepSeek-R1、GPT-4-Turbo、Llama-3.1-405B-Instruct 和 BioMistral-7B）在英语和法语中的表现，评估提示语言对诊断推理和最终诊断准确性的影响。共有180个涵盖16个医学专业的临床案例由两名医生使用18分制进行评估，评估内容包括诊断准确性和推理质量。五个模型中有四个在英语中的表现更佳（平均差异为0.37-0.91，调整后的p值<0.05），差距涉及推理的多个方面，包括鉴别诊断、逻辑结构和内部有效性。o3是唯一一个未显示出整体语言效应的模型。这些发现表明，提示语言仍然是大型语言模型临床表现的关键决定因素，对全球公平的语言文化部署具有重要影响。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2605.19194

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

MMoA：一种具有递归特性的记忆混合智能体的人工智能代理框架

Chu, Rui

Abstract

The Mixture-of-Agents (MoA) framework has shown promise in improving large language model (LLM) performance by aggregating outputs from multiple agents. However, existing MoA systems often rely on static routers that do not fully capture temporal and contextual dependencies across aggregation layers. To address this limitation, we propose MMoA, a recurrent MoA architecture that integrates LSTM-based gating into the agent selection process. The recurrence router adaptively modulates agent contributions based on both current inputs and historical routing decisions, enabling more context-aware aggregation. We evaluate MMoA on standard instruction-following benchmarks, including AlpacaEval 2.0, MT-Bench, and Arena-Hard. The results show that MMoA achieves comparable accuracy to traditional MoA while reducing computational overhead by dynamically activating fewer agents. For example, on AlpacaEval 2.0, MMoA achieves a win rate of 58.0%, compared with 59.8% for MoA, while improving runtime efficiency by up to 4.6%. These results suggest that MMoA provides a scalable and efficient approach for adaptive multi-agent LLM systems.

Chinese Translation

混合智能体（MoA）框架在通过聚合多个智能体的输出以提高大型语言模型（LLM）性能方面显示出良好的前景。然而，现有的MoA系统通常依赖于静态路由器，未能充分捕捉聚合层之间的时间和上下文依赖性。为了解决这一局限性，我们提出了MMoA，一种递归的MoA架构，将基于LSTM的门控机制整合到智能体选择过程中。递归路由器根据当前输入和历史路由决策自适应地调节智能体的贡献，从而实现更具上下文感知的聚合。我们在标准的指令跟随基准测试上评估了MMoA，包括AlpacaEval 2.0、MT-Bench和Arena-Hard。结果表明，MMoA在准确性上与传统MoA相当，同时通过动态激活较少的智能体减少了计算开销。例如，在AlpacaEval 2.0上，MMoA的胜率为58.0%，而MoA为59.8%，同时运行效率提高了多达4.6%。这些结果表明，MMoA为自适应多智能体LLM系统提供了一种可扩展且高效的方法。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2605.19196

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

反思的时刻：我们能否信任大型语言模型（LLM）评审者作为基于证据的研究代理？

Wang, Leyao, He, Yanan, Chen, Peng, Yehudai, Asaf, Liu, Yixin, Ying, Rex, Shmueli-Scheuer, Michal, Cohan, Arman

Abstract

Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.

Chinese Translation

深度研究代理越来越多地自动化复杂的信息检索任务，通过多步骤推理、工具使用和综合生成基于证据的报告。它们日益增长的角色要求可扩展、可靠的评估，将LLM作为评审者定位为评估事实准确性、证据使用和推理质量的监督范式。然而，这些评审者在深度研究代理中的可靠性仍然不甚了解，提出了一个关键的元评估问题：在将LLM评审者部署到监督研究代理之前，我们必须首先评估评审者本身。现有的元评估存在两个不足之处：（1）依赖粗略的、主观的人类偏好一致性；（2）关注于遵循指令或可验证的任务，而未探索开放式代理执行。为了解决这些问题，我们引入了REFLECT（通过控制干预进行可靠的细粒度LLM评审者评估），这是一个针对代理环境中细粒度故障检测的元评估基准。REFLECT定义了一个详细的过程和结果级故障模式分类法，通过对质量筛选的代理执行轨迹进行控制和局部干预来实现。这产生了可验证的、全面的和细粒度的实例，用于验证评审模型。我们的实验表明，当前的LLM评审者仍然不可靠：即使是表现最好的模型在推理、工具使用和报告质量故障方面的整体准确率也低于55%，在证据验证方面的表现尤其差。我们的分类法和研究结果共同揭示了评审者的系统性局限性，揭示了成本和可靠性之间的权衡，并为构建更可靠的深度研究代理评估管道提供了可行的指导。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2605.19220

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

位置：大型语言模型中的不确定性量化仅仅是无监督聚类

Chen, Tiejin, Da, Longchao, Liu, Xiaoou, Wei, Hua

Abstract

Uncertainty Quantification (UQ) is widely regarded as the primary safeguard for deploying Large Language Models (LLMs) in high-stakes domains. However, we argue that the field suffers from a category error: mainstream UQ methods for LLMs are just unsupervised clustering algorithms. We demonstrate that most current approaches inherently quantify the internal consistency of the model's generations rather than their external correctness. Consequently, current methods are fundamentally blind to factual reality and fail to detect ``confident hallucinations,'' where models exhibit high confidence in stable but incorrect answers. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality.

Chinese Translation

不确定性量化（UQ）被广泛视为在高风险领域部署大型语言模型（LLMs）的主要保障。然而，我们认为该领域存在类别错误：主流的LLM不确定性量化方法实际上只是无监督聚类算法。我们证明了目前大多数方法本质上量化的是模型生成内容的内部一致性，而非其外部正确性。因此，当前的方法在根本上对事实现实视而不见，无法检测到“自信的幻觉”，即模型在稳定但错误的答案上表现出高度自信。因此，现有的不确定性量化方法在存在不确定性的情况下可能会产生一种虚假的安全感。在详细分析中，我们识别出由于对内部状态的依赖而导致的三种关键病态：使部署不安全的超参数敏感性危机、将稳定性与真相混淆的内部评估循环，以及迫使依赖不稳定代理指标来评估不确定性的根本缺乏真实标准。为了解决这一僵局，我们倡导对不确定性量化进行范式转变，并为研究社区提供了一条路线图，以采用更好的评估指标和设置，实施本土不确定性的机制变更，并将验证锚定在客观真相上，确保模型自信能够作为现实的可靠代理。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2605.19224

Fine-tuning language encoding models on slow fMRI improves prediction for fast ECoG

在慢速 fMRI 上微调语言编码模型可改善快速 ECoG 的预测

Vaidya, Aditya R., Antonello, Richard J., Huth, Alexander G.

Abstract

Neuroscientists have recently turned to intracranial brain recording methods, like electrocorticography (ECoG), for human experiments because of the fine spatial and temporal resolution that they afford. Models trained on this data, however, are fundamentally restricted by the patient populations that can receive the implants necessary for recording. We propose using non-invasive fMRI to bridge the gap in training data. Using spoken language representations fine-tuned on fMRI, we build encoding models of ECoG. These representations showed improved prediction performance in ECoG, even though the temporal resolution of fMRI is two orders of magnitude worse. Prediction improved in frequency bands well beyond what is directly measured in fMRI. Next, to test the procedure's generalization ability, we fine-tuned models on fMRI responses that were temporally downsampled by a factor of 2. Despite the loss in resolution, these models were able to predict fMRI and ECoG responses at levels comparable to the original fMRI-tuned models. Finally, we showed that ECoG performance steadily scales with the amount of fMRI-tuning data. Our results show that "slow" data like fMRI can be a valuable resource for building better models of "fast" brain data like ECoG. In the future, integrating across multiple recording methods may further improve performance in other applications, like decoding.

Chinese Translation

神经科学家最近转向使用颅内脑记录方法，如皮层电图 (ECoG)，进行人类实验，因为这些方法提供了精细的空间和时间分辨率。然而，基于这些数据训练的模型在根本上受到可以接受植入物进行记录的患者群体的限制。我们建议使用非侵入性的功能性磁共振成像 (fMRI) 来弥补训练数据的缺口。通过在 fMRI 上微调的口语语言表征，我们构建了 ECoG 的编码模型。这些表征在 ECoG 中表现出更好的预测性能，尽管 fMRI 的时间分辨率低两个数量级。预测在频率带上得到了改善，超出了 fMRI 直接测量的范围。接下来，为了测试该程序的泛化能力，我们在时间下采样因子为 2 的 fMRI 响应上微调了模型。尽管分辨率有所损失，这些模型仍能够以与原始 fMRI 微调模型相当的水平预测 fMRI 和 ECoG 响应。最后，我们展示了 ECoG 性能随着 fMRI 微调数据量的增加而稳步提升。我们的结果表明，像 fMRI 这样的“慢速”数据可以成为构建更好“快速”脑数据模型（如 ECoG）的宝贵资源。在未来，整合多种记录方法可能进一步改善其他应用中的性能，如解码。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2605.19228

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

通过逐步置信度归因诊断黑箱大型语言模型中的多步推理失败

Liu, Xiaoou, Chen, Tiejin, Zhang, Dengjia, Wang, Yaqing, Cheng, Lu, Wei, Hua

Abstract

Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5\% over answer-level feedback.

Chinese Translation

大型语言模型在具有客观答案的推理任务中通过生成逐步解决方案取得了强劲的表现，但诊断多步推理轨迹可能失败的地方仍然困难。置信度估计提供了一种诊断信号，但现有方法仅限于最终答案或需要内部模型访问。在本文中，我们提出了逐步置信度归因（Stepwise Confidence Attribution, SCA），这是一个针对闭源大型语言模型的框架，仅基于生成的推理轨迹分配逐步置信度。SCA应用信息瓶颈原理：与正确解决方案中的共识结构对齐的步骤获得高置信度，而偏离的步骤则被标记为可能错误。我们提出了两种互补的方法：（1）NIBS，一种非参数信息瓶颈方法，测量不依赖图结构的一致性；（2）GIBS，一种基于图的信息瓶颈模型，通过可微掩码学习子图以捕捉逻辑变异性。在数学推理和多跳问答的广泛实验中，SCA可靠地识别出与推理错误强相关的低置信度步骤。此外，利用逐步置信度指导自我纠正，使纠正成功率比基于答案的反馈提高了多达13.5%。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2605.19234

AI Technologies in Language Access: Attitudes Towards AI and the Human Value of Language Access Managers

语言获取中的人工智能技术：对人工智能的态度及语言获取管理者的人文价值

Jiménez-Crespo, Miguel A., Rodriguez, Stephanie, Losa, Alejandro Jaume

Abstract

The rapid emergence of AI technologies is reshaping translation practices and theory across the board. This paper deals with the impact of AI in language access. This area is characterized by the need to serve broad and diverse user populations, within a context where efficiency and access are shaped by legal mandates, ethical and commercial tensions, and safety concerns. This paper reports on the attitudes and perceptions of language access managers towards the AI and the human value in the AI age. Methodologically, this paper presents an analysis of a subset of a broader study on language access and technology, specifically a qualitative thematic analysis of ten semi-structured interviews with language access managers in the USA working in healthcare, court, public service and local government contexts. The results indicate that language access managers show conditional optimism towards the inevitable AI implementations, are strongly risk aware, and deeply committed to the human value and human oversight of AI implementations and output.

Chinese Translation

人工智能技术的快速出现正在重塑翻译实践和理论。本文探讨了人工智能在语言获取中的影响。该领域的特点是需要服务于广泛且多样化的用户群体，同时在法律规定、伦理和商业紧张关系以及安全问题的背景下，效率和获取受到影响。本文报告了语言获取管理者对人工智能及其在人工智能时代的人文价值的态度和看法。从方法论上讲，本文对一项更广泛的语言获取与技术研究的子集进行了分析，具体而言，是对美国医疗、法院、公共服务和地方政府背景下的十位语言获取管理者进行的半结构化访谈的定性主题分析。结果表明，语言获取管理者对不可避免的人工智能实施表现出有条件的乐观态度，具有较强的风险意识，并对人工智能实施和输出的人文价值及人类监督深感承诺。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2605.19266

FormalASR: End-to-End Spoken Chinese to Formal Text

FormalASR：端到端的中文口语转正式文本

Ning, Wanyi, Guo, Yinshang, Qian, Haitao, Cheng, Jiyuan, Feng, Weiyuan, Zhang, Yufei

Abstract

Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

Chinese Translation

自动语音识别（ASR）系统通常针对逐字转录进行优化，这种转录保留了口语中的不流畅、填充词和非正式结构，这些往往不适合下游的写作应用。一种常见的解决方案是使用两阶段的ASR+LLM管道进行后期编辑，但这种设计增加了延迟和内存成本，并且难以在设备上部署。我们提出了FormalASR，两个紧凑的端到端模型（0.6B和1.7B），能够直接将中文口语转录为正式书面文本。为了实现这一设置，我们构建了WenetSpeech-Formal和Speechio-Formal，这两个大型口语到正式文本的数据集是通过基于LLM的重写和质量过滤构建的。然后，我们对Qwen3-ASR在两个规模（0.6B和1.7B）上进行了监督微调。在WenetSpeech-Formal和Speechio-Formal上的实验表明，FormalASR在逐字基准上实现了最高37.4%的相对CER降低，同时也提高了ROUGE-L和BERTScore。FormalASR在部署时无需后处理LLM，提供了一种轻量级的、设备端的口语到正式文本转录解决方案。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2605.19270

DECOR: Auditing LLM Deception via Information Manipulation Theory

DECOR：通过信息操控理论审计大型语言模型的欺骗行为

Cai, Linyue, Yeh, Samuel, Dhamala, Jwala, Gupta, Rahul, Li, Sharon

Abstract

Large language models can deceive by subtly manipulating truthful information -- omitting key facts, shifting focus, or obscuring meaning -- making such behavior difficult to detect. Existing black-box methods rely on coarse-grained judgments, offering limited interpretability and failing to pinpoint which facts were distorted and how. We introduce DECOR, a multi-agent framework grounded in Information Manipulation Theory for fine-grained auditing of strategic deception in LLM responses. DECOR decomposes input contexts into atomic informational units and scores each unit against the response across four dimensions of manipulation, producing interpretable manipulation profiles that are aggregated into a global deception index. We comprehensively evaluate DECOR on both single-turn and multi-turn deception detection benchmarks spanning real-world domains, and show that DECOR achieves state-of-the-art performance on both, outperforming competitive baselines. The framework generalizes across 15 frontier models, and ablation studies confirm the contribution of each key design component. Our findings demonstrate that fine-grained, theory-grounded auditing of information manipulation offers an effective and interpretable path for LLM deception detection.

Chinese Translation

大型语言模型可以通过微妙地操控真实信息来欺骗——省略关键信息、转移焦点或模糊意义——使得这种行为难以被检测。现有的黑箱方法依赖于粗略的判断，提供有限的可解释性，并未能准确指出哪些事实被扭曲以及如何扭曲。我们提出了DECOR，一个基于信息操控理论的多智能体框架，用于对大型语言模型响应中的战略欺骗进行细粒度审计。DECOR将输入上下文分解为原子信息单元，并在四个操控维度上对每个单元进行评分，从而生成可解释的操控特征，这些特征被汇总为一个全球欺骗指数。我们在涵盖现实世界领域的单轮和多轮欺骗检测基准上全面评估了DECOR，并显示DECOR在这两方面都达到了最先进的性能，超越了竞争基线。该框架在15个前沿模型上具有良好的泛化能力，消融研究证实了每个关键设计组件的贡献。我们的研究结果表明，基于理论的细粒度信息操控审计为大型语言模型的欺骗检测提供了一条有效且可解释的路径。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2605.19274

Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations

迷失在解释中：跨语言解释中的可信性与忠实性权衡

Banerjee, Somnath, Jha, Pranav, Hazra, Rima, Mukherjee, Animesh

Abstract

LLMs deployed multilingually are often audited via English explanations for non-English inputs. We evaluate extractive explanations ''where the model identifies input token spans as evidence alongside a generated rationale'' and uncover a systematic trade-off: English-pivot explanations can achieve higher span agreement with human rationales while their evidence becomes less causally grounded in the model's prediction, as measured by both comprehensiveness and sufficiency. Across 3 tasks, 5~languages, and 2~multilingual LLM families, we find that English explanations frequently produce fluent but loosely anchored rationales, with comprehensiveness degrading by up to 5.7x relative to native-language conditions - even as task accuracy remains stable across settings. For socially nuanced classification, English pivots also fail to preserve pragmatic cues, reducing both faithfulness and span agreement. We recommend auditing explanations in the input language, reporting multi-faceted faithfulness metrics beyond lexical overlap, and treating English rationales as communication summaries rather than faithful decision traces.

Chinese Translation

多语言部署的大型语言模型（LLMs）通常通过对非英语输入的英语解释进行审计。我们评估了提取式解释，即“模型识别输入标记范围作为证据，并生成理由”，并揭示了一个系统性的权衡：以英语为中心的解释可以实现与人类理由更高的一致性，但其证据在模型预测中的因果基础却变得较弱，这通过全面性和充分性两个指标进行衡量。在3个任务、5种语言和2个多语言LLM家族的研究中，我们发现英语解释往往产生流畅但与实际情况松散关联的理由，其全面性相较于母语条件下降高达5.7倍——尽管任务准确性在不同设置中保持稳定。对于社会细微的分类，英语中心的解释也未能保留语用线索，降低了可信性和标记范围的一致性。我们建议在输入语言中审计解释，报告超越词汇重叠的多维度忠实性指标，并将英语理由视为沟通摘要，而非忠实的决策痕迹。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2605.19276

OpenCompass: A Universal Evaluation Platform for Large Language Models

OpenCompass：大型语言模型的通用评估平台

Cao, Maosong, Chen, Kai, Duan, Haodong, Fang, Yixiao, Gao, Tong, Jiaye, Ge, Li, Mo, Liu, Hongwei, Liu, Junnan, Liu, Yuan, Lyu, Chengqi, Lyu, Han, Ma, Ningsheng, Ma, Zerun, Sun, Yu, Wu, Zhiyong, Xiao, Linchen, Xu, Jun, Ye, Haochen, Yu, Zhaohui, Yuan, Yike, Zhang, Songyang, Zhao, Yufeng, Zhou, Fengzhe, Zhou, Peiheng, Zhu, Dongsheng, Zhu, Lin, Zhuo, Jingming

Abstract

In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.

Chinese Translation

近年来，人工智能领域经历了从任务特定的小规模模型向通用大型语言模型（LLMs）的范式转变。随着LLMs的快速迭代，对其能力进行客观、定量和全面的评估已成为推动技术发展的关键环节。目前，主流的基于静态基准数据集的评估方法面临任务类型多样性、评估标准不一致以及数据和处理工作流碎片化等挑战，使得跨领域和大规模模型评估变得困难。为了解决上述问题，本文提出并开源了OpenCompass，一个一站式、可扩展且支持高并发的通用LLM评估平台。该平台遵循模块化和组件解耦的设计理念，具有高兼容性、灵活性和高并发性三大核心优势。OpenCompass的核心架构由五个关键组件组成：配置系统、任务划分模块、执行与调度模块、任务执行单元和结果可视化模块。其工作流程提供基于规则的评估器、LLM作为评判者的评估器和级联评估器，以适应不同任务场景的需求。该平台支持多个领域的主流基准数据集，包括知识、推理、计算、科学、语言、代码等，为学术界和工业界提供了统一高效的LLM评估工具，促进了对LLMs优缺点的准确识别及其后续优化。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2605.19284

Language models struggle with compartmentalization

语言模型在区分能力方面的挑战

Howe, Thomas Vincent, Wingate, David

Abstract

In the training data used by large language models (LLMs), the same latent concept is often presented in multiple distinct ways: the same facts appear in English and Swahili; many functions can be expressed in both Python and Haskell; we can express propositions in both formal and natural language. We show that LLMs can exhibit compartmentalization, where they fail to identify and share statistical strength between distinct presentations of unified concepts. In the worst case, LLMs simply learn parallel internal representations of each presentation of the concept, saturating model capacity with redundancies and decreasing sample efficiency with the number of such presentations. We also demonstrate that synthetic parallel data can fail to improve this despite being easily learned itself. Under this framework, we find that, for small models, early multilingual learning is nearly entirely compartmentalized. Finally, all interventions that we study exhibit a phase transition in which their effectiveness depends on the number of distinct presentations, suggesting that the language modeling objective may only inconsistently unify representations.

Chinese Translation

在大型语言模型（LLMs）使用的训练数据中，同一潜在概念通常以多种不同方式呈现：相同的事实以英语和斯瓦希里语出现；许多功能可以用Python和Haskell表达；我们可以用形式语言和自然语言表达命题。我们展示了LLMs可能表现出区分能力，即它们未能识别和共享统一概念不同呈现之间的统计关联。在最坏的情况下，LLMs仅仅学习每种概念呈现的并行内部表示，导致模型容量被冗余信息饱和，并随着呈现数量的增加降低样本效率。我们还证明，尽管合成的并行数据容易被学习，但它可能未能改善这一问题。在这一框架下，我们发现对于小型模型，早期的多语言学习几乎完全是区分的。最后，我们研究的所有干预措施都表现出相变，其有效性依赖于不同呈现的数量，这表明语言建模目标可能仅不一致地统一表示。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2605.19285

Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection

合理性是否必要且充分？为可解释的虚假信息检测调优大型语言模型

Wang, Bing, Miao, Rui, Li, Ximing, Shen, Chen, Yan, Shaotian, Li, Changchun, Liu, Kaiyuan, Yuan, Xiaosong, Ye, Jieping

Abstract

The rapid spread of misinformation on social media platforms has become a formidable challenge. To mitigate its proliferation, Misinformation Detection (MD) has emerged as a critical research topic. Traditional MD approaches based on small models typically perform binary classification through a black-box process. Recently, the rise of Large Language Models (LLMs) has enabled explainable MD, where models generate rationales that explain their decisions, thereby enhancing transparency. Existing explainable MD methods primarily focus on crafting sophisticated prompts to elicit rationales from off-the-shelf LLMs. In this work, we propose a pipeline to fine-tune a dedicated LLM specifically for explainable MD. Our pipeline begins by collecting large-scale fact-checked articles, and then uses multiple strong LLMs to produce veracity predictions and rationales. To ensure high-quality training data, we leverage a filtering strategy that selects only the correct instances for fine-tuning. While this pipeline is intuitive and prevalent, our experiments reveal that naive filtering based solely on label correctness is insufficient in practice and suffers from two critical limitations: (1) Coarse-grained labels cause insufficient rationales: Rationales filtered solely based on binary labels are insufficient to adequately support their decisions; (2) Over-verification behavior causes unnecessary rationales: Stronger LLMs tend to exhibit over-verification behavior, producing excessively verbose and unnecessary rationales. To address these issues, we introduce LONSREX, a novel data synthesis pipeline to Locate Necessary and Sufficient Rationales for Explainable MD. Specifically, we propose a metric that quantifies the contribution of each verification step to the final prediction, thereby evaluating its necessity and sufficiency. Experimental results demonstrate the effectiveness of LONSREX.

Chinese Translation

虚假信息在社交媒体平台上的快速传播已成为一个严峻的挑战。为了减缓其扩散，虚假信息检测（Misinformation Detection, MD）已成为一个关键的研究课题。基于小型模型的传统MD方法通常通过黑箱过程进行二元分类。最近，大型语言模型（Large Language Models, LLMs）的兴起使得可解释的MD成为可能，这些模型生成解释其决策的合理性，从而增强透明度。现有的可解释MD方法主要集中在设计复杂的提示，以从现成的LLMs中引出合理性。在本研究中，我们提出了一种专门为可解释MD调优的LLM的微调管道。我们的管道首先收集大规模的事实核查文章，然后使用多个强大的LLM生成真实性预测和合理性。为了确保高质量的训练数据，我们利用一种过滤策略，仅选择正确的实例进行微调。尽管该管道直观且普遍，但我们的实验表明，仅基于标签正确性的简单过滤在实践中是不够的，并存在两个关键限制：（1）粗粒度标签导致合理性不足：仅基于二元标签过滤的合理性不足以充分支持其决策；（2）过度验证行为导致不必要的合理性：更强的LLM往往表现出过度验证行为，生成过于冗长和不必要的合理性。为了解决这些问题，我们引入了LONSREX，一个新的数据合成管道，用于定位可解释MD的必要和充分合理性。具体而言，我们提出了一种度量，量化每个验证步骤对最终预测的贡献，从而评估其必要性和充分性。实验结果证明了LONSREX的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2605.19309

How Do Document Parsers Break? Auditing Structural Vulnerability in Document Intelligence

文档解析器如何失效？文档智能中的结构脆弱性审计

Chen, Yue, Wang, Yihao, Tang, Ziyi, Wang, Keze

Abstract

Document Layout Analysis (DLA) pipelines provide structured page representations for retrieval-augmented generation, long-document question answering, and other document intelligence systems, yet their robustness evaluation remains largely area-centric. We identify this Footprint Bias and propose a lightweight output-level auditing framework that decouples probe construction, policy-driven targeting, and structure-aware diagnosis. The framework combines Block-level Structural Loss Rate (B-SLR), granularity-aware exposure descriptors, and pathway attribution to analyze where perturbations interact with layout structure and how failures propagate. Across MinerU and PP-StructureV3 on 1,000 pages, affected area weakly tracks perturbation-induced OCR instability (R^2=0.384/0.110), whereas B-SLR aligns much more closely with it (R^2=0.727/0.916). Exposure descriptors further separate occlusion- and topology-dominant pathways, and small structurally targeted probes cause downstream QA/retrieval degradation comparable to larger-footprint perturbations. These results shift DLA robustness evaluation from footprint-based stress testing toward structure-aware vulnerability auditing.

Chinese Translation

文档布局分析（Document Layout Analysis, DLA）管道为增强检索生成、长文档问答及其他文档智能系统提供结构化页面表示，然而其鲁棒性评估仍然主要集中在特定领域。我们识别出这种足迹偏差（Footprint Bias），并提出了一种轻量级的输出级审计框架，该框架解耦了探针构建、政策驱动的目标定位和结构感知的诊断。该框架结合了块级结构损失率（Block-level Structural Loss Rate, B-SLR）、粒度感知的曝光描述符和路径归因，以分析扰动如何与布局结构交互以及故障如何传播。在对1,000页的MinerU和PP-StructureV3进行分析时，受影响区域与扰动引起的光学字符识别（OCR）不稳定性之间的弱相关性（R^2=0.384/0.110），而B-SLR与之更为紧密对齐（R^2=0.727/0.916）。曝光描述符进一步区分了遮挡主导和拓扑主导的路径，而小型结构性目标探针导致的下游问答/检索性能下降与大足迹扰动相当。这些结果将DLA的鲁棒性评估从基于足迹的压力测试转向结构感知的脆弱性审计。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2605.19316

A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation

一种多智能体框架用于特征约束的阅读理解题目生成难度控制

Hwang, Seonjeong, Seo, Jun, Kim, Hyounghun, Lee, Gary Geunbae

Abstract

Recent studies in difficulty-controlled reading comprehension item generation have leveraged large language models (LLMs) to produce items by adjusting difficulty-related features. However, existing methods typically rely on a single-agent prompting approach, which often fails to consistently satisfy specified feature constraints, resulting in items that deviate from the target difficulty level. To address this limitation, we introduce MAFIG, a Multi-agent Framework for Feature-constrained Item Generation, where multiple LLM agents and feature-specific evaluators collaborate to generate and iteratively revise items based on intended constraints. Furthermore, to verify the efficacy of MAFIG in difficulty control, we propose a method for constructing a sequence of feature constraint sets that yield items with monotonically increasing difficulty. Experimental results demonstrate that MAFIG generates items that adhere to target constraints at a significantly higher rate than baselines, achieving robust difficulty control through the difficulty-calibrated constraint sequence.

Chinese Translation

近期在难度控制的阅读理解题目生成研究中，利用大型语言模型（LLMs）通过调整与难度相关的特征来生成题目。然而，现有方法通常依赖单一智能体提示的方法，这往往无法持续满足特定的特征约束，导致生成的题目偏离目标难度水平。为了解决这一限制，我们提出了MAFIG（特征约束题目生成的多智能体框架），在该框架中，多个LLM智能体和特征特定评估者协作生成并迭代修订题目，以满足预期的约束。此外，为了验证MAFIG在难度控制中的有效性，我们提出了一种构建特征约束集序列的方法，该序列可生成难度单调递增的题目。实验结果表明，MAFIG生成的题目在满足目标约束方面的成功率显著高于基线，且通过难度校准的约束序列实现了稳健的难度控制。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2605.19341

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

HalluWorld：通过参考世界模型进行幻觉的受控基准测试

Liu, Emmy, Gangal, Varun, Yu, Michael, Tao, Zhuofu, Singh, Karan, Kumar, Sachin, Feng, Steven Y.

Abstract

Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or rely on observations in settings that are difficult to reproduce. To study root causes, we introduce HalluWorld, an extensible benchmark grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this world. Building on this view, we construct synthetic and semi-synthetic environments in which the reference world is fully specified, the model's view is controlled, and hallucination labels are generated automatically. HalluWorld spans gridworlds, chess, and realistic terminal tasks, enabling controlled variation of world complexity, observability, temporal change, and source-conflict policy, and disentangling hallucinations into fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation remain difficult and are not generally solved by extended thinking. In the terminal setting, models also struggle with when to abstain. The uneven profile of failures across probe types and domains suggests that hallucinations arise from distinct failure modes rather than a single capability. Our results suggest that controlled reference worlds offer a scalable and reproducible path toward measuring and reducing hallucinations in modern language models.

Chinese Translation

幻觉仍然是大型语言模型的一个主要失败模式，但现有基准在摘要、问答、检索增强生成和代理交互等方面对其操作化不一致。这种碎片化使得不清楚在一个环境中有效的缓解措施是否能在不同上下文中减少幻觉。目前的基准要么需要人工标注和可能被记忆的固定参考，要么依赖于在难以重现的环境中的观察。为了研究根本原因，我们引入了HalluWorld，这是一个基于明确参考世界模型的可扩展基准：当模型产生一个相对于该世界是错误的可观察声明时，就会出现幻觉。在这一观点的基础上，我们构建了合成和半合成环境，其中参考世界被完全指定，模型的视角是受控的，幻觉标签是自动生成的。HalluWorld涵盖了网格世界、国际象棋和现实终端任务，使得世界复杂性、可观察性、时间变化和源冲突策略的受控变化成为可能，并将幻觉细分为细粒度的错误类别。我们在这些环境中评估了前沿和开放权重的语言模型，发现了一致的模式：对于前沿模型，直接观察信息的感知幻觉几乎得到解决，而多步状态跟踪和因果前向模拟仍然困难，并且通常无法通过扩展思维来解决。在终端环境中，模型在何时放弃方面也存在困难。不同探测类型和领域中的失败不均匀特征表明，幻觉源于不同的失败模式，而不是单一能力。我们的结果表明，受控参考世界提供了一条可扩展和可重现的路径，以测量和减少现代语言模型中的幻觉。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2605.19344

Retrieval-Augmented Linguistic Calibration

检索增强的语言校准

Yeh, Yi-Fan, Tao, Linwei, Dong, Minjing, Huang, Tao, Yu, Jialin, Torr, Philip, Xu, Chang

Abstract

Linguistic cues such as "I believe" and "probably" offer an intuitive interface for communicating confidence, yet a generalisable, principled calibration framework for linguistic confidence expressions remains underexplored. In particular, co-occurring linguistic cues, contextual variation, and subjective audience interpretation pose unique challenges. We therefore model linguistic confidence as a distribution over plausible perceived probability values that a statement is correct, capturing interpretation variability that scalar representations discard. Within this distributional framework, we introduce faithfulness as a complementary evaluation dimension and present Faithfulness Divergence (FD), an information-theoretic metric quantifying the surprise induced in audience beliefs upon truth revelation. Building on these foundations, we present Retrieval-Augmented Linguistic Calibration (RALC), a lightweight post-hoc pipeline that propagates calibrated confidence signals back into natural language via retrieval-augmented rewriting. Across three QA benchmarks and five LLM families, RALC improves in-domain faithfulness and calibration up to 66% and 58%, respectively, outperforming black-box and grey-box calibration baselines.

Chinese Translation

语言提示如 "我相信" 和 "可能" 提供了一种直观的信心传达接口，但针对语言信心表达的可推广、原则性校准框架仍然未得到充分探索。特别是，共同出现的语言提示、上下文变化和主观受众解读带来了独特的挑战。因此，我们将语言信心建模为一个关于陈述正确性可感知概率值的分布，捕捉到标量表示所忽略的解读变异性。在这一分布框架内，我们引入了忠实度作为一个补充评估维度，并提出了忠实度散度（Faithfulness Divergence, FD），这是一种信息论度量，用于量化在真相揭示时对受众信念所引发的惊讶。基于这些基础，我们提出了检索增强的语言校准（Retrieval-Augmented Linguistic Calibration, RALC），这是一种轻量级的事后处理管道，通过检索增强的重写将校准的信心信号传播回自然语言。在三个问答基准和五个大语言模型（LLM）系列中，RALC 在领域内忠实度和校准方面分别提高了66%和58%，超越了黑箱和灰箱校准基线。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2605.19346

IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis

IMLJD：印度婚姻诉讼分析的计算数据集

Bose, Joy

Abstract

We present IMLJD, an open dataset of 3,613 Indian court judgments covering matrimonial disputes under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. The dataset covers the Supreme Court of India from 2000 to 2024 (1,474 cases) and the Karnataka High Court from 2018 to 2024 (2,139 cases), with structured outcome labels, metadata-derived indicators, and a knowledge graph. We find that 57.6% of quashing petitions succeed at the Supreme Court level compared to 39.7% at the Karnataka High Court level. On a matched 2018 to 2024 period, the SC quash rate is 59.3%, widening the differential to 19.6 percentage points and confirming the finding is robust to temporal adjustment. The dataset, code, and knowledge graph are released openly at https://github.com/joyboseroy/imljd and https://huggingface.co/datasets/joyboseroy/imljd.

Chinese Translation

我们提出了IMLJD，这是一个开放的数据集，包含3,613份印度法院判决，涵盖了根据《印度刑法》第498A条、《保护妇女免受家庭暴力法》和《刑事诉讼法》第482条的婚姻纠纷。该数据集包括2000年至2024年间的印度最高法院（1,474个案件）和2018年至2024年间的卡纳塔克高等法院（2,139个案件），并提供结构化的结果标签、元数据衍生指标和知识图谱。我们发现，最高法院的撤销申请成功率为57.6%，而卡纳塔克高等法院的成功率为39.7%。在2018年至2024年匹配的时间段内，最高法院的撤销率为59.3%，使得两者之间的差异扩大至19.6个百分点，确认该发现对时间调整是稳健的。数据集、代码和知识图谱已在https://github.com/joyboseroy/imljd和https://huggingface.co/datasets/joyboseroy/imljd上公开发布。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2605.19357

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

SciCustom：一个用于大语言模型科学能力自定义评估的框架

Gu, Yiyang, Yang, Junwei, Luo, Junyu, Yuan, Ye, Feng, Bin, Xia, Yingce, Xie, Shufang, Liu, Kaili, Wu, Bohan, Shi, Qi, Li, Haoran, Xiao, Beier, Xiao, Zhiping, Luo, Xiao, Zhang, Weizhi, Yu, Philip S., Liu, Zequn, Zhang, Ming

Abstract

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.

Chinese Translation

大语言模型（LLMs）在科学研究中的应用日益增多，但现有的评估往往无法反映实践中所需的细粒度能力。大多数基准测试都是手动整理或领域通用的，这限制了其可扩展性和与真实科学用例的对齐。在本文中，我们提出了一个名为SciCustom的新框架来解决这一问题。该框架支持从大规模科学数据中自定义构建基准，以评估LLMs在特定应用中的科学能力。SciCustom首先将科学知识组织成基于本体的知识单元，控制其粒度，并训练一个标记器将大规模数据实例映射到这一知识空间。根据自定义需求，通过基于投票的多模型共识识别相关知识单元。这些单元使得通过二分搜索进行相关性感知的基准检索成为可能，随后进行代理子集选择和基于数据的基准生成，以实现高效评估。在化学和医疗保健领域的实验表明，SciCustom揭示了LLMs科学能力中的细粒度差异，而这些差异是标准基准所忽视的，同时不需要专家标注或合成问题生成。这项工作为在LLMs中基准测试科学能力提供了可扩展且关注应用的基础。源代码可在 https://github.com/yjwtheonly/SciCustom 获取。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2605.19358

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

驯服思考者：适应性大语言模型推理的条件熵塑形

Wei, Shuyu, Sun, Jian, Qiu, Delai, Wang, Yining, Liu, Shengping, Liang, Jiaen, Fu, Ying, Huang, Wei, Sang, Jitao

Abstract

Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy "forking point" tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.

Chinese Translation

基于熵的深度推理已成为提升大语言模型（LLMs）推理能力的一个有前景的方向，但现有方法往往要么无差别地增加响应长度，要么以牺牲准确性为代价缩短响应。为了更好地平衡这一权衡，我们提出了条件熵塑形（Conditional Entropy Shaping, CES），这是一个动态控制令牌级响应熵的框架，使得LLMs能够在简单问题上生成简明的解决方案，同时在困难问题上鼓励更深入的探索。CES建立在DAPO基础上，利用令牌级熵作为不确定性信号，并应用条件双向策略：在正确推理路径上对高熵“分叉点”令牌进行惩罚以提高简洁性，而在错误路径上则给予奖励以鼓励探索和错误修正。我们在DeepSeek-R1-Distill-7B上实现了CES，并在12个数学基准上进行了评估。CES在相对于DAPO的情况下，始终提高了平均准确性，同时减少了响应长度，补充实验显示在较小的1.5B主干网络和域外基准上也出现了类似的趋势。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2605.19394

EmbGen: Teaching with Reassembled Corpora

EmbGen：使用重组语料库进行教学

Lenin, Arun K, Rouse, Kai, Nicastro, Andrea, Leontjeva, Anna

Abstract

Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross-passage or cross-document dependencies. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question-answer (QA) pairs via proximity, intra-cluster, and inter-cluster sampling with cluster-specialized system prompts. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets (5 and 20 million tokens). We use lexical overlap metrics, an LLM-as-a-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12.5% at 5M and 88.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity.

Chinese Translation

将小型指令调优模型适应于专业领域通常依赖于在策划的指令-响应示例上进行监督微调（SFT），而大规模收集这些示例的成本较高。由教师大型语言模型（LLM）从领域语料库生成的合成训练示例可以降低这一成本，但现有的流程可能会产生同质化的输出，并且无法始终如一地捕捉跨段落或跨文档的依赖关系。我们提出了EmbGen，这是一种合成数据生成流程，它将语料库分解为实体-描述对，利用从嵌入相似性推断出的语义结构重新组合这些对，然后通过邻近、内部聚类和跨聚类采样生成问答（QA）对，并使用聚类专用系统提示进行指导。我们在三个具有不同语义异质性的数据集上评估EmbGen，比较对象包括EntiGraph、InstructLab和Knowledge-Instruct，固定的令牌预算为500万和2000万令牌。我们使用词汇重叠指标、LLM作为评判标准的评分规则，以及二元准确率（Binary Accuracy），这是一个结合了事实准确性和完整性的复合指标进行评估。相较于最强基线，EmbGen在异质性最高的数据集上，在500万令牌预算下提高了12.5%的二元准确率，在2000万令牌预算下提高了88.9%，同时在其他异质性较低的数据集上仍保持竞争力。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2605.19416

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

LambdaPO：一种用于推理语言模型的Lambda风格策略优化

Yuan, Zhe, Zhou, Yipeng, Li, Jinghan, Chen, Xinyuan, Deng, Bowen, Chen, Zhiqian, Zhao, Liang

Abstract

Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward differentials against all peers in its cohort, where each pairwise comparison is dynamically attenuated by the policy's own probabilistic confidence in the established preference. To further mitigate the sparsity of binary outcome supervision, we augment the objective with a semantic density reward, derived from the precision-recall alignment between generated reasoning traces and ground-truth solutions. As a result, our method can mine more fine-grained optimization signals from a group of rollouts, guiding the LLM to a better optima. Experimental results across challenging math reasoning and question-answering tasks demonstrates that LambdaPO improves performance compared to the baseline methods.

Chinese Translation

群体相对策略优化（Group Relative Policy Optimization, GRPO）已成为现代强化学习对齐的基石，因其通过在采样轨迹群体中利用奖励归一化而有效地避免了显式价值评估器的需求。然而，该方法依赖于单一的统计基线，如群体均值，将轨迹空间的关系拓扑压缩为一个标量，从而抹去了在复杂、对排名敏感的奖励景观中导航所需的细粒度偏好信息。为了解决这一问题，我们提出了一种新颖的框架，Lambda策略优化（Lambda Policy Optimization, LambdaPO），通过将优势估计从标量值重新概念化为分解的成对偏好结构，来应对这一信息论瓶颈。具体而言，任何给定轨迹的优势被公式化为与其群体中所有同伴的奖励差异的综合和，其中每个成对比较都由策略在已建立偏好中的概率信心动态衰减。为了进一步缓解二元结果监督的稀疏性，我们通过从生成的推理轨迹与真实解决方案之间的精确率-召回对齐中衍生的语义密度奖励来增强目标。因此，我们的方法能够从一组回合中挖掘出更细粒度的优化信号，引导大型语言模型（LLM）达到更好的最优解。在具有挑战性的数学推理和问答任务中的实验结果表明，LambdaPO相较于基线方法提高了性能。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2605.19433

Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

回溯偏离时的路径：减轻大语言模型推理蒸馏中的双重曝光偏差

Wang, Bing, Yan, Shaotian, Shen, Chen, liu, kaiyuan, Fan, Sinan, Li, Ximing, Miao, Rui, Yuan, Xiaosong, Shen, Zhanming, Ye, Jieping

Abstract

Large language models (LLMs) have achieved remarkable success in complex reasoning tasks via long chain-of-thought (CoT), yet their immense computational overhead hinders real-world deployment. LLM reasoning distillation addresses this by transferring reasoning capabilities from formidable teacher models to compact student models. However, existing distillation paradigms face a fundamental dilemma. Typical off-policy distillation strictly utilizes teacher-generated golden trajectories, suffering from an exposure bias due to the mismatch between training distributions and student-generated inference contexts, which leads to error cascades in long CoT reasoning. To address this, on-policy distillation allows students to explore their own trajectories, but we demonstrate that it inherently introduces a reciprocal reversed exposure bias: the teacher model also struggles to provide positive guidance when conditioned on student-generated sub-optimal contexts. To resolve this dual exposure biases problem, we propose Monitoring Trajectories and Backtracking when it strays (MOTAB), a new LLM reasoning distillation pipeline. Specifically, MOTAB dynamically monitors the student's on-policy generation against an adaptive safety boundary. When the generation strays and exceeds this threshold, MOTAB backtracks to the last safe state and leverages teacher intervention to correct the course. This approach inherently tolerates minor student errors to mitigate exposure bias, while preventing sub-optimal contexts to circumvent reversed exposure bias. Extensive experiments on the LIMO-v2 and AceReason datasets demonstrate that MOTAB effectively alleviates the dual exposure biases, yielding a roughly 3% average performance improvement in reasoning tasks.

Chinese Translation

大型语言模型（LLMs）通过长链思维（CoT）在复杂推理任务中取得了显著成功，但其巨大的计算开销阻碍了在现实世界中的部署。LLM推理蒸馏通过将推理能力从强大的教师模型转移到紧凑的学生模型来解决这一问题。然而，现有的蒸馏范式面临着一个根本性的困境。典型的离策略蒸馏严格利用教师生成的黄金轨迹，由于训练分布与学生生成的推理上下文之间的不匹配，导致了曝光偏差，这在长链思维推理中引发了错误级联。为了解决这个问题，在线策略蒸馏允许学生探索自己的轨迹，但我们证明这本质上引入了相互反向的曝光偏差：当依赖于学生生成的次优上下文时，教师模型也难以提供积极的指导。为了解决这一双重曝光偏差问题，我们提出了监控轨迹与回溯偏离时的路径（MOTAB），一个新的LLM推理蒸馏流程。具体而言，MOTAB动态监控学生的在线生成与自适应安全边界。当生成偏离并超过该阈值时，MOTAB回溯到最后一个安全状态，并利用教师干预来纠正方向。这种方法本质上容忍学生的小错误以减轻曝光偏差，同时防止次优上下文以规避反向曝光偏差。在LIMO-v2和AceReason数据集上的大量实验表明，MOTAB有效缓解了双重曝光偏差，在推理任务中平均提高了约3%的性能。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2605.19470

Drifting Objectives for Refining Discrete Diffusion Language Models

漂移目标用于精炼离散扩散语言模型

Oba, Daisuke, Furuta, Hiroki, Okazaki, Naoaki

Abstract

Discrete diffusion language models (DDLMs) generate text by iteratively denoising categorical token sequences, while recent drifting methods for continuous generators suggest that part of this sampling-time correction can instead be absorbed into training through an anti-symmetric fixed-point objective. We study how to transfer this principle to DDLMs, where the main challenge is the interface with discrete text: hard token samples are non-differentiable, and categorical predictions do not directly provide continuous samples to drift. We formulate TokenDrift, a drifting objective that lifts categorical predictions to soft-token features, applies anti-symmetric drifting in a frozen semantic space, and backpropagates the resulting stop-gradient feature target to DDLM logits. In controlled continual-training experiments with masked and uniform-state diffusion backbones, TokenDrift improves fixed-NFE generation quality over matched continuation baselines, reducing Gen.-PPL at 4 NFEs by 89% on MDLM and 86% on DUO. These results suggest that drifting can provide a practical refinement objective for DDLMs.

Chinese Translation

离散扩散语言模型（DDLMs）通过迭代去噪分类标记序列生成文本，而最近针对连续生成器的漂移方法表明，这种采样时间的修正部分可以通过反对称固定点目标吸收到训练中。我们研究如何将这一原理转移到DDLMs上，主要挑战在于与离散文本的接口：硬标记样本是不可微分的，分类预测并不能直接提供连续样本以进行漂移。我们提出了TokenDrift，这是一种漂移目标，它将分类预测提升为软标记特征，在一个冻结的语义空间中应用反对称漂移，并将生成的停止梯度特征目标反向传播到DDLM的logits。在使用掩蔽和均匀状态扩散骨干的受控持续训练实验中，TokenDrift在固定NFE生成质量上优于匹配的延续基线，在4个NFE下，MDLM的生成困惑度（Gen.-PPL）降低了89%，DUO降低了86%。这些结果表明，漂移可以为DDLMs提供一种实用的精炼目标。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2605.19516

Base Models Look Human To AI Detectors

基础模型在AI检测器眼中显得更像人类

Xu, Yixuan Even, Zhong, Ziqian, Raghunathan, Aditi, Fang, Fei, Kolter, J. Zico

Abstract

As AI-generated text enters the real-world at scale, institutions increasingly use commercial AI-text detectors, especially in education and academic-integrity workflows. We report a surprising empirical finding about such systems: when evaluated by GPTZero and Pangram, generated text from base models is often judged overwhelmingly human, whereas text generated by their instruction-tuned counterparts is not. Building on this observation, we propose Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that minimally fine-tunes a base model into a paraphraser and applies it iteratively. Compared with the baselines we test, HIP yields a stronger trade-off between semantic preservation and detector evasion on commercial detectors. Across Llama-3 and Qwen-3 families, spanning model sizes from 0.6B to 70B, HIP consistently improves detector human-likeness. Our findings suggest that current detectors are tracking artifacts of instruction tuning and local context more than any invariant notion of machine-generated text. This, in turn, calls for detector designs that model these factors more explicitly.

Chinese Translation

随着AI生成文本大规模进入现实世界，各机构越来越多地使用商业AI文本检测器，特别是在教育和学术诚信工作流程中。我们报告了关于此类系统的一个令人惊讶的实证发现：当通过GPTZero和Pangram进行评估时，基础模型生成的文本往往被判定为压倒性的人类文本，而其经过指令调优的对应文本则不是。在此观察的基础上，我们提出了通过迭代释义实现人性化（Humanization by Iterative Paraphrasing，HIP），这是一种与检测器无关的流程，能够将基础模型最小化微调为释义器并进行迭代应用。与我们测试的基线相比，HIP在商业检测器上在语义保留和检测逃避之间实现了更强的权衡。在Llama-3和Qwen-3系列中，模型规模从0.6B到70B不等，HIP始终提高了检测器的人类相似度。我们的发现表明，当前的检测器更多地追踪指令调优和局部上下文的伪影，而不是任何不变的机器生成文本的概念。这反过来又呼吁检测器设计更明确地建模这些因素。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2605.19523

Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

跨模态技能注入的研究：场景、方法与超参数

Xu, Zhiyu, Wang, Lean, Liu, Yuanxin, Li, Lei, Zhou, Hao, Meng, Fandong, Zhou, Jie, Sun, Xu

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models (LLMs) to VLMs without incurring additional training data requirements or significant computational overhead. Unlike conventional merging of homogeneous LLMs, which mainly aggregates existing capabilities, cross-modal skill injection aims to induce emergent cross-modal capabilities by integrating a domain-expert LLM into a VLM. However, existing research lacks a systematic analysis of the applicability and methodology of cross-modal skill injection. In this study, we investigate cross-modal skill injection across three main aspects: scenarios, methods, and hyperparameters. For scenarios, we find that cross-modal skill injection generally performs well in instruction-following and cross-lingual settings, yet struggles with mathematical reasoning. For methods, we find that classic approaches such as TA and DARE consistently achieve superior performance over alternative merging methods. We also provide a systematic and quantitative analysis of the hyperparameter tuning that these classic methods critically depend on.

Chinese Translation

视觉-语言模型（VLMs）在一般多模态理解方面表现出显著的能力；然而，它们在高效获取不断演变的领域特定技能方面仍然面临挑战。传统的增强VLM能力的方法，如监督微调（Supervised Fine-Tuning, SFT），需要大量的数据集整理和显著的计算资源。模型合并作为一种高效的替代方案，能够在不增加额外训练数据需求或显著计算开销的情况下，将领域特定的专业知识从大型语言模型（Large Language Models, LLMs）转移到VLMs。与传统的同质LLM合并主要聚合现有能力不同，跨模态技能注入旨在通过将领域专家LLM集成到VLM中，诱导新兴的跨模态能力。然而，现有研究缺乏对跨模态技能注入的适用性和方法论的系统分析。在本研究中，我们从三个主要方面调查跨模态技能注入：场景、方法和超参数。在场景方面，我们发现跨模态技能注入在遵循指令和跨语言设置中通常表现良好，但在数学推理方面存在困难。在方法方面，我们发现经典方法如TA和DARE在性能上始终优于其他合并方法。我们还对这些经典方法所依赖的超参数调优进行了系统和定量的分析。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2605.19568

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

m3BERT：一种现代的多语言马特ryoshka双向编码器

Wang, Yaoxiang, Zuo, Simiao, Hu, Qingguo, Ding, Yucheng, Gong, Yeyun, Jiao, Jian, Su, Jinsong

Abstract

Embedding models are pivotal in industrial information retrieval systems like search and advertising. However, existing pretrained models often exhibit fixed architectures and embedding dimensionalities, posing significant challenges when adapting them to diverse deployment scenarios with varying business-driven constraints. A common practice involves fine-tuning with partial parameter initialization from larger pretrained models for resource-constrained tasks. This method is often suboptimal as the misalignment between pretraining and downstream usage prevents full realization of pretraining benefits. To address this limitation, we introduce m3BERT: a Modern, Multi-lingual, Matryoshka Bidirectional Encoder, which features a novel pretraining strategy that jointly optimizes representations across both transformer layers and multiple embedding dimensions. This enables a single model to be tailored to varied resource and accuracy targets while maintaining consistency with pretraining. Incorporating recent architectural improvements, m3BERT uses a three-stage pretraining: monolingual pretraining, multilingual adaptation to serve diverse user bases, and crucial continual pretraining on a massive web domain corpus to enhance utility in commercial retrieval. m3BERT significantly outperforms state-of-the-art embedding models in Bing-Click, a large-scale industrial retrieval dataset, showcasing its practical versatility as an efficient foundation for resource-aware industrial retrieval systems. Further experiments on public datasets also confirm the general effectiveness of our multigranular Matryoshka pretraining strategy.

Chinese Translation

嵌入模型在信息检索系统（如搜索和广告）中至关重要。然而，现有的预训练模型通常具有固定的架构和嵌入维度，这在适应具有不同业务驱动约束的多样化部署场景时带来了显著挑战。一个常见的做法是通过从更大预训练模型部分初始化参数进行微调，以应对资源受限的任务。然而，由于预训练与下游使用之间的不匹配，这种方法往往效果不佳，无法充分实现预训练的优势。为了解决这一局限性，我们提出了m3BERT：一种现代的多语言马特ryoshka双向编码器，它采用了一种新颖的预训练策略，联合优化变换器层和多个嵌入维度的表示。这使得单一模型能够根据不同的资源和准确性目标进行调整，同时保持与预训练的一致性。结合最近的架构改进，m3BERT采用三阶段预训练：单语预训练、多语言适应以服务于多样化的用户基础，以及在大规模网络域语料库上的持续预训练，以增强其在商业检索中的实用性。m3BERT在Bing-Click这一大规模工业检索数据集上显著超越了最先进的嵌入模型，展示了其作为资源感知工业检索系统高效基础的实用多样性。对公共数据集的进一步实验也证实了我们多层次马特ryoshka预训练策略的普遍有效性。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2605.19575

A Data-Driven Approach to Idiomaticity Based on Experts' Criteria in Theoretical Linguistics

基于专家标准的习语性数据驱动方法研究

Mikhalkova, Elena, Vishnyakova, Anastasiya, Drozdova, Anastasiya, Gavin, Polina, Zhmykhov, Aleksander, Protasov, Timofey

Abstract

The article observes data analysis of 286 multi-word expressions (MWEs) based on 16 lexical, grammatical and other criteria described in theoretical books and papers on the notion of idiomaticity. MWEs were collected from the same theoretical sources, and a set of experts in linguistics annotated them with these categories. The distribution of categories shows that there are no absolutely idiomatic expressions. Lexical criteria seem to be the most influential; grammatical criteria are bound to certain conditions; presence of obsolete words and grammar influence ability of an MWE to be replaced with one word.

Chinese Translation

本文观察了286个多词表达（MWEs）的数据分析，这些表达基于理论书籍和论文中描述的16个词汇、语法及其他标准。多词表达来自相同的理论来源，并由一组语言学专家使用这些类别进行了标注。类别的分布显示，没有绝对的习语表达。词汇标准似乎是影响最大的因素；语法标准则受到特定条件的限制；过时词汇和语法的存在影响多词表达被单词替代的能力。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2605.19577

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

GoLongRL：面向能力的长上下文强化学习与多任务对齐

Lv, Minxuan, Mei, Tiehua, Du, Tanlong, Chen, Junmin, Su, Zhenpeng, Chen, Ziyang, Wang, Ziqi, Wu, Zhennan, Pan, Ruotong, Liang, jian, Tang, Ruiming, Li, Han

Abstract

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.

Chinese Translation

我们提出了GoLongRL，这是一种完全开源的、面向能力的后训练方案，用于具有可验证奖励的长上下文强化学习（RLVR）。现有的长上下文强化学习方法通常将数据构建视为设计越来越复杂的检索路径的问题，这导致任务覆盖和奖励公式同质化，无法充分反映实际的长上下文需求。我们的工作有两个贡献。(1) 面向能力的数据构建，完全开放发布。我们公开发布了一个包含23K RLVR样本的数据集、完整的构建流程和所有训练代码。在长上下文能力的分类指导下，该数据集涵盖9种任务类型，每种任务都配有其自然评估指标。它包括来自已建立语料库的策划开源样本和从真实源文档（如书籍、学术论文和多轮对话）生成的合成样本的问答对。在相同的普通GRPO设置下，我们的数据集单独超过了闭源的QwenLong-L1.5数据集。此外，我们在该数据上训练的Qwen3-30B-A3B模型在长上下文性能上可与DeepSeek-R1-0528和Qwen3-235B-A22B-Thinking-2507相媲美，这表明更广泛的覆盖和更大的奖励多样性显著有利于长上下文能力的提升。(2) TMN-Reweight用于异构多任务优化。为了解决来自异构奖励的优化挑战，我们提出了TMN-Reweight，它结合了任务级平均归一化以实现跨任务奖励规模对齐，以及适应难度的加权以获得更可靠的优势估计。TMN-Reweight进一步提高了普通GRPO的平均性能，同时在报告的评估中保持或改善了整体能力。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2605.19597

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

LLMEval-Logic：一个经过求解器验证的中文逻辑推理基准，用于对抗性强化的LLMs

Zhang, Ming, Peng, Qiyuan, Wei, Yinxi, Shen, Yujiong, Tan, Kexin, Wang, Yuhui, Xiang, Zhenghao, Ye, Junjie, Yin, Zhangyue, Xi, Zhiheng, Dou, Shihan, Gui, Tao, Pan, Maxm, Yang, Ruizhi, Zhang, Qi, Huang, Xuanjing

Abstract

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.

Chinese Translation

对大型语言模型（LLMs）在自然语言逻辑推理上的评估至关重要，因为规则驱动的任务要求结论严格遵循所陈述的前提。目前许多现有的逻辑推理基准是通过从抽样公式中模板化自然语言项目生成的，提供的正式注释仅为粗略或未经审计的，并且现在迅速被前沿推理模型所饱和。我们提出了LLMEval-Logic，这是一个基于现实情境场景构建的中文逻辑推理基准。其流程结合了作者和专家对自然语言项目及其参考形式化的审核，使用Z3验证注释答案，构建专家评分标准以实现自然到形式的评分，并通过闭环对抗工作流程强化选定项目。该基准分为两个配对子集发布：一个包含246个项目的基础子集，配备1,400个专家开发的评分原子，以及一个包含190个项目的困难子集，涵盖938个多步骤子问题，涉及封闭模型空间。在LLMEval-Logic上评估14个前沿LLMs揭示了当前模型之间的显著差距：最佳模型的困难项目准确率仅达到37.5%，即使使用参考符号，评估模型中最高的联合Z3+评分形式化分数也仅为60.16%。我们的基准公开可在https://github.com/llmeval/LLMEval-Logic获取。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2605.19633

optimize_anything: A Universal API for Optimizing any Text Parameter

优化任何事物：一个通用的文本参数优化API

Agrawal, Lakshya A, Lee, Donghyun, Tan, Shangyin, Ma, Wenjie, Elmaaroufi, Karim, Sandadi, Rohit, Seshia, Sanjit A., Sen, Koushik, Klein, Dan, Stoica, Ion, Gonzalez, Joseph E., Khattab, Omar, Dimakis, Alexandros G., Zaharia, Matei

Abstract

Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa .

Chinese Translation

一个基于大型语言模型（LLM）的优化系统能否与不同领域的专业工具相匹配？我们展示了当优化问题被表述为改善由评分函数评估的文本工件时，一个单一的基于AI的优化系统——支持单任务搜索、多任务搜索（通过跨问题迁移）以及对未见输入的泛化——在六个不同任务中达到了最先进的结果。我们的系统发现的代理架构使Gemini Flash的ARC-AGI准确率几乎翻了三倍（从32.5%提升至89.5%），找到的调度算法将云成本降低了40%，生成的CUDA内核中有87%与PyTorch匹配或超越，并且在报告的AlphaEvolve圆形打包解决方案（n=26）中表现优越。对三个领域的消融实验表明，可操作的侧面信息比仅依赖评分反馈能更快收敛并显著提高最终得分，而多任务搜索在给定相同每个问题预算的情况下，通过跨任务迁移优于独立优化，且其收益随着相关任务数量的增加而扩大。我们首次展示了基于LLM的文本优化是一种通用的问题解决范式，将传统上需要特定领域算法的任务统一在一个框架下。我们将optimize_anything开源，并支持多种后端，作为GEPA项目的一部分，网址为https://github.com/gepa-ai/gepa。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2605.19645

K-Quantization and its Impact on Output Performance

K-量化及其对输出性能的影响

Davidsson, Robin Baki, Nugues, Pierre

Abstract

Recent advancements in large language models (LLMs) have shown their remarkable capacities in many NLP tasks. However, their substantial size often presents challenges for deployment. This necessitates efficient techniques for model compression, with quantization emerging as a prominent solution. Despite its benefits, the exact impact of quantization (from 2- to 6-bit) on the performance and accuracy of LLMs remains an active area of research. This paper investigates the performance of eight LLMs at various quantization levels, focusing on tasks such as MMLU-Pro for knowledge processing and reasoning, CRUXEval for code comprehension, and MuSR for reading comprehension. Our results show a consistent trend where higher precision (e.g., 8-bit Q8\_0) yields improved performance, albeit with diminishing returns. Aggressive quantization (e.g., 2-bit Q2\_K) usually retains acceptable accuracy, though some models show a substantial loss in performance. Our findings indicate that while lower bit precision generally reduces performance, the impact varies across models and tasks. Larger models show greater resilience to aggressive quantization, but can still undergo significant drops at lower precision levels. Mid-sized models in the 7-9 billion parameter range strike an optimal balance between efficiency and resource usage. Such results provide insights into the trade-offs between model size, quantization, and performance.

Chinese Translation

近期大型语言模型（LLMs）的进展展示了它们在许多自然语言处理（NLP）任务中的卓越能力。然而，它们的庞大规模常常给部署带来挑战。这就需要高效的模型压缩技术，其中量化成为一种突出的解决方案。尽管量化带来了诸多好处，但量化（从2位到6位）对LLMs性能和准确性的具体影响仍然是一个活跃的研究领域。本文研究了八种LLMs在不同量化水平下的性能，重点关注MMLU-Pro（知识处理和推理）、CRUXEval（代码理解）和MuSR（阅读理解）等任务。我们的结果显示出一个一致的趋势，即更高的精度（例如，8位 Q8_0）通常会提高性能，尽管收益递减。激进的量化（例如，2位 Q2_K）通常能够保持可接受的准确性，尽管一些模型表现出显著的性能损失。我们的研究结果表明，尽管较低的位精度通常会降低性能，但其影响在不同模型和任务之间存在差异。较大的模型对激进量化表现出更强的韧性，但在较低精度水平下仍可能出现显著下降。参数在70亿到90亿范围内的中型模型在效率和资源使用之间达成了最佳平衡。这些结果为模型规模、量化和性能之间的权衡提供了见解。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2605.19711

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

大型语言模型能否可靠地纠正低资源自动语音识别中的错误？关于西弗里西亚语的污染意识案例研究

Hao, Yun, Amooie, Reihaneh, de Vries, Wietse, van Noord, Rik, Wieling, Martijn

Abstract

Automatic speech recognition (ASR) has improved substantially in recent years, yet performance remains limited for low-resource languages. Large language models (LLMs) have shown promise for improving ASR through generative error correction (GER), but their effectiveness in low-resource settings remains underexplored. In addition, it remains unclear to what extent data contamination influences the reported improvements in LLM-based GER. This study investigates LLM-based GER for low-resource Frisian. In addition to a public corpus, we construct and use a Frisian offline dataset with non-public texts for evaluation to control for potential data contamination. Results show that GER improves ASR performance in most settings, with the best GPT-5.1 results surpassing oracle WERs. Comparable gains on the offline dataset indicate that improvements reflect true correction ability. We further provide a detailed error analysis revealing model correction patterns.

Chinese Translation

自动语音识别（ASR）近年来取得了显著进展，但在低资源语言中的表现仍然有限。大型语言模型（LLMs）在通过生成性错误纠正（GER）改善ASR方面展现出潜力，但它们在低资源环境中的有效性仍未得到充分探讨。此外，数据污染在多大程度上影响LLM基础的GER所报告的改进仍不明确。本研究探讨了基于LLM的低资源弗里西亚语的GER。除了使用公共语料库外，我们还构建并使用了一个包含非公共文本的弗里西亚离线数据集进行评估，以控制潜在的数据污染。结果表明，在大多数设置中，GER提高了ASR性能，其中最佳的GPT-5.1结果超越了oracle WER。离线数据集上的可比增益表明这些改进反映了真实的纠正能力。我们进一步提供了详细的错误分析，揭示了模型的纠正模式。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2605.19714

LLM-Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets

基于大型语言模型的阿拉伯语金融情感分析：来自沙特市场的证据

Albaqawi, Mona H., Albalkhi, Eman M., Albaiti, Joud A., Lopedoto, Enrico

Abstract

Investor sentiment shapes financial markets, yet modeling sentiment in Arabic financial contexts remains challenging due to linguistic complexity and limited resources. We present an Arabic NLP framework for large-scale financial sentiment analysis tailored to the Saudi market, integrating official financial news and social media to capture institutional and public investor sentiment. The framework constructs a large Arabic financial corpus through a multi-stage pipeline encompassing data collection, cleaning, deduplication, entity linking, and sentiment annotation. Transformer-based NER combined with a curated company lexicon links textual mentions to canonical company identifiers, with sentiment labels assigned using a five-class scheme. The resulting dataset of 84K samples supports company-level sentiment aggregation and analysis of sentiment dynamics relative to stock market behavior on the Saudi Exchange. Experimental results demonstrate reliable and scalable Arabic financial sentiment analysis.

Chinese Translation

投资者情绪塑造金融市场，但在阿拉伯金融环境中建模情绪仍然具有挑战性，原因在于语言复杂性和资源有限。我们提出了一种针对沙特市场的大规模金融情感分析的阿拉伯语自然语言处理框架，整合了官方金融新闻和社交媒体，以捕捉机构和公众投资者的情感。该框架通过一个多阶段流程构建了一个大型阿拉伯金融语料库，包括数据收集、清洗、去重、实体链接和情感标注。基于变换器的命名实体识别（NER）结合精心策划的公司词汇表，将文本提及链接到规范的公司标识符，并使用五类方案分配情感标签。最终生成的84K样本数据集支持公司级情感聚合和相对于沙特证券市场行为的情感动态分析。实验结果表明，该阿拉伯语金融情感分析方法可靠且可扩展。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2605.19718

CAIT: A Syntactic Parsing Toolkit for Child-Adult InTeractions

CAIT：儿童-成人互动的句法解析工具包

Padovani, Francesca, Yang, Xiulin, Bunzeck, Bastian, Jumelet, Jaap, Matusevych, Yevgen, Schneider, Nathan, Bisazza, Arianna

Abstract

CHILDES is a paramount resource for language acquisition studies -- yet computational tools for analyzing its syntactic structure remain limited. Leveraging the recent release of the UD-English-CHILDES treebank with gold-standard Universal Dependencies (UD) annotations, we train a state-of-the-art dependency parser specifically tailored to CHILDES. The parser more accurately captures syntactic patterns in child--adult interactions, outperforming widely used off-the-shelf English parsers, including SpaCy and Stanza. Alongside the parser, we also release a Part-of-Speech tagger and an utterance-level construction tagger, which together form the open-source Syntactic Parsing Toolkit for Child--Adult InTeractions (CAIT). Through a detailed error analysis and a case study tracking the distribution of syntactic constructions across developmental time in CHILDES, we demonstrate the practical utility of the toolkit for large-scale, reproducible research on language acquisition.

Chinese Translation

CHILDES 是语言习得研究的重要资源，但用于分析其句法结构的计算工具仍然有限。利用最近发布的带有金标准通用依赖（Universal Dependencies, UD）注释的 UD-English-CHILDES 树库，我们训练了一种专门针对 CHILDES 的最先进的依赖解析器。该解析器更准确地捕捉儿童与成人互动中的句法模式，超越了包括 SpaCy 和 Stanza 在内的广泛使用的现成英语解析器。除了解析器，我们还发布了一个词性标注器和一个话语级构造标注器，这些工具共同构成了开放源代码的儿童-成人互动句法解析工具包（CAIT）。通过详细的错误分析和一个案例研究，追踪 CHILDES 中句法构造在发展时间上的分布，我们展示了该工具包在大规模、可重复的语言习得研究中的实际应用价值。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2605.19723

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

大型语言模型中的数学推理：基准、架构、评估与开放挑战

Amjad, Husnain, Shahzad, Raja Khurram, Shahzad, Aamir, Fatima, Mehwish

Abstract

Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.

Chinese Translation

数学推理对于教育、科学和工业中的问题解决至关重要，是评估人工智能系统的重要基准。随着大型语言模型（LLMs）推理能力的提升，了解它们在数学推理方面的表现变得愈加重要。本调查通过对数据集、架构、训练策略和评估协议的结构化分析，综合了LLMs在数学推理方面的最新进展。我们的系统性回顾涵盖了大约120篇经过同行评审的研究和预印本，考察了该研究领域的发展，并提供了一个统一的分析框架，以理解当前的进展和局限性。我们的研究特别引入了一个统一的数学数据集分类法，区分了预训练语料库、监督微调资源和不同推理复杂度水平的评估基准。我们对推理架构和训练策略进行了系统分析，包括工具集成、验证器引导推理和参数高效适应，评估其对推理稳健性和泛化能力的影响。此外，对现有指标的比较评估突显了最终答案准确性与过程级推理验证之间的差距。通过综合这些领域的见解，我们的分析识别了反复出现的失败模式，如推理可信性问题、基准偏差和泛化局限性，并概述了改善符号基础、评估可靠性以及开发更稳健和可信的基于LLM的推理系统的关键研究方向。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2605.19735

ContextRAG: Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation

ContextRAG：无提取层次图构建的检索增强生成

Prosvirnin, Roman, Kuznetsov, Sergei, Jin, Seungmin

Abstract

Graph-structured retrieval-augmented generation (RAG) systems can improve answer quality on multi-hop questions, but many current systems rely on large language models (LLMs) to extract entities, relations, and summaries during indexing. These calls add token and wall-clock costs that grow with corpus size. We present ContextRAG, a graph RAG system whose graph topology is constructed without LLM-based entity or relation extraction. ContextRAG derives a fuzzy concept graph over chunk embeddings using residual-quantization k-means and Formal Concept Analysis with Lukasiewicz residuated logic. Bridge-like and meet-derived context nodes are induced by soft fuzzy join and meet operations, rather than by LLM-written graph edges. On a 130-task UltraDomain subset, ContextRAG builds its index with 30 LLM calls and 22,073 tokens. In contrast, a local HiRAG reproduction stress test required 870 indexing calls and 3.54M tokens on a 20-task subset before failing during graph construction; linear extrapolation to 130 tasks implies over 23M indexing tokens. ContextRAG obtains 33.6% F1 overall and 36.8% F1 on multi-hop tasks. An activation analysis shows that queries retrieving at least one lattice-derived node in the top five achieve +3.9 percentage points F1 over queries that do not; this association is diagnostic rather than causal.

Chinese Translation

图结构的检索增强生成（RAG）系统可以提高多跳问题的答案质量，但许多当前系统依赖大型语言模型（LLMs）在索引过程中提取实体、关系和摘要。这些调用增加了随着语料库规模增长而增加的令牌和时间成本。我们提出了ContextRAG，一种图RAG系统，其图拓扑结构是在没有基于LLM的实体或关系提取的情况下构建的。ContextRAG通过残差量化k均值和使用Lukasiewicz剩余逻辑的形式概念分析，从块嵌入中推导出模糊概念图。桥状和交集派生的上下文节点是通过软模糊连接和交集操作诱导的，而不是通过LLM编写的图边。在130个任务的UltraDomain子集上，ContextRAG以30次LLM调用和22,073个令牌构建其索引。相比之下，一个本地HiRAG的重现压力测试在一个20任务的子集上需要870次索引调用和3.54M个令牌，最终在图构建过程中失败；线性外推到130个任务意味着超过2300万个索引令牌。ContextRAG在整体上获得了33.6%的F1分数，在多跳任务上获得了36.8%的F1分数。激活分析表明，在前五个检索到的节点中至少检索到一个格派生节点的查询，其F1分数比未检索到的查询高出3.9个百分点；这种关联是诊断性的，而非因果性的。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2605.19738

TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection

TERGAD：结构感知的文本增强表示用于图异常检测

Shi, Wen, Wang, Zhe, Huang, Huafei, Qing, Qing, Xu, Ziqi, Zhang, Qixin, Zhang, Xikun, Luo, Renqiang, Xia, Feng

Abstract

Graph Anomaly Detection (GAD) aims to identify atypical graph entities, such as nodes, edges, or substructures, that deviate significantly from the majority. While existing text-rich approaches typically integrate structural context into the data representation pipeline using raw textual features, they often neglect the structural context of nodes. This limitation hinders their ability to detect sophisticated anomalies arising from inconsistencies between a node's inherent content and its topological role. To bridge this gap, we propose TERGAD (Structure-aware Text-enhanced Representations for Graph Anomaly Detection), A novel data augmentation framework that enriches structural semantics for GAD via the semantic reasoning capabilities of Large Language Models (LLMs). Specifically, TERGAD translates node-level topological properties into descriptive natural language narratives, which are subsequently processed by an LLM to derive high-level semantic embeddings. These embeddings are then adaptively fused with original node attributes through a gated dual-branch autoencoder to jointly reconstruct both graph structure and node features. The anomaly score is computed based on the integrated reconstruction error, effectively capturing deviations in both observable attributes and LLM-informed semantic expectations. Extensive experiments on six real-world datasets demonstrate that TERGAD consistently outperforms state-of-the-art baselines. Furthermore, our ablation studies validate the indispensable role of structural semantic guidance and the efficacy of the gated fusion mechanism. Code is available at https://github.com/Kantorakitty/TERGAD-main.

Chinese Translation

图异常检测（GAD）旨在识别与大多数显著偏离的非典型图实体，如节点、边或子结构。现有的文本丰富方法通常通过使用原始文本特征将结构上下文集成到数据表示管道中，但它们往往忽视节点的结构上下文。这一局限性妨碍了它们检测因节点固有内容与其拓扑角色之间的不一致而产生的复杂异常。为了解决这一问题，我们提出了TERGAD（结构感知的文本增强表示用于图异常检测），这是一种新颖的数据增强框架，通过大型语言模型（LLMs）的语义推理能力来丰富GAD的结构语义。具体而言，TERGAD将节点级拓扑属性转换为描述性的自然语言叙述，随后由LLM处理以推导出高层次的语义嵌入。这些嵌入通过门控双分支自编码器与原始节点属性自适应融合，以共同重构图结构和节点特征。异常分数是基于集成重构误差计算的，有效捕捉可观察属性和LLM知情的语义期望之间的偏差。在六个真实世界数据集上的广泛实验表明，TERGAD始终优于最先进的基线。此外，我们的消融研究验证了结构语义指导的不可或缺性和门控融合机制的有效性。代码可在 https://github.com/Kantorakitty/TERGAD-main 获取。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2605.19766

Synthesis and Evaluation of Long-term History-aware Medical Dialogue

长期历史感知医疗对话的合成与评估

Hu, Hebin, Dai, Renke, Tan, Ah-Hwee, Kang, Yilin

Abstract

An effective healthcare agent must be able to recall and reason over a patient's longitudinal medical history. However, the absence of datasets with realistic long-term dialogue timelines limits systematic evaluation. Real clinical text is constrained by privacy and ethics, while existing benchmarks focus on isolated interactions, failing to capture cross-session reasoning. We introduce a framework for synthesizing high-quality, long-term medical dialogues with LLMs. Our approach entails a knowledge-guided decomposition into three stages: constructing synthetic patient profiles with diverse disease and complication trajectories, generating multi-turn dialogues per encounter, and integrating them into a coherent longitudinal history dataset, MediLongChat. We establish three benchmark tasks-In-dialogue Reasoning, Cross-dialogue Reasoning, and Synthesis Reasoning-to evaluate the memory capabilities of healthcare agents. To assess data quality, we introduce a multi-dimensional evaluation framework combining vector-based metrics with LLM-as-a-judge assessments. Specifically, we define automatic measures-Faithfulness, Coherence, and Diversity-together with two LLM-based evaluations: Correctness and Realism. Benchmark experiments show that even state-of-the-art LLMs struggle with MediLongChat. These findings highlight the benchmark's applicability and underscore the need for tailored methods to advance healthcare agents.

Chinese Translation

一个有效的医疗代理必须能够回忆和推理患者的纵向医疗历史。然而，缺乏具有现实长期对话时间线的数据集限制了系统评估。真实的临床文本受到隐私和伦理的约束，而现有基准则专注于孤立的交互，未能捕捉跨会话推理。我们提出了一种框架，通过大语言模型（LLMs）合成高质量的长期医疗对话。我们的方法包括知识引导的分解，分为三个阶段：构建具有多样疾病和并发症轨迹的合成患者档案、为每次就诊生成多轮对话，并将其整合成一个连贯的纵向历史数据集MediLongChat。我们建立了三个基准任务——对话内推理、跨对话推理和合成推理——以评估医疗代理的记忆能力。为了评估数据质量，我们引入了一个多维评估框架，结合基于向量的指标与LLM作为评判者的评估。具体而言，我们定义了自动测量指标——真实性、一致性和多样性，以及两个基于LLM的评估：正确性和现实性。基准实验表明，即使是最先进的LLMs在MediLongChat上也面临挑战。这些发现突显了基准的适用性，并强调了推动医疗代理发展的定制方法的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2605.19798

Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs

朝着社会互动代理的信任校准：利用大型语言模型研究性别化多模态行为生成

Galland, Lucie, Clavel, Chloé, Ochs, Magalie

Abstract

As Socially Interactive Agents (SIAs) become increasingly integrated into daily life, the ability to calibrate user trust to an agent's actual capabilities would help ensure appropriate usage of these agents. In this paper, we explore the capacity of Large Language Models (LLMs) to generate multimodal behaviors (verbal, vocal, gestural, and facial expression modalities) that reflect varying levels of ability and benevolence, two key dimensions of trustworthiness. We propose a novel method for automatically generating behaviors aligned with specific levels of these traits, a first step towards enabling nuanced and trust-calibrated interactions. By analyzing a large dataset of multimodal transcripts generated by LLMs, we demonstrate that GPT-5.4 is able to produce coherent behavior across different modalities (text, intonation, facial expression, and gesture). Using Random Forest feature importance analysis, we show that the generated behaviors align with theoretical expectations for ability and benevolence. However, we also find that when gender is specified in the prompt, LLMs tend to reproduce societal gender stereotypes, associating male agents' behaviors with high ability and female agents' behaviors with high benevolence. To validate our approach, we conducted a user study on Prolific using a within-subjects design. Participants perceived different levels of ability and benevolence in the generated behaviors align with the intended instructions.

Chinese Translation

随着社会互动代理（SIAs）越来越多地融入日常生活，校准用户对代理实际能力的信任的能力将有助于确保这些代理的适当使用。本文探讨了大型语言模型（LLMs）生成反映不同能力和善意水平的多模态行为（语言、声音、手势和面部表情模态）的能力，这两个维度是信任worthiness的关键。我们提出了一种新颖的方法，自动生成与这些特质特定水平相一致的行为，这是实现细致且经过信任校准的交互的第一步。通过分析由LLMs生成的大量多模态转录数据集，我们证明了GPT-5.4能够在不同模态（文本、语调、面部表情和手势）中产生连贯的行为。使用随机森林特征重要性分析，我们展示了生成的行为与能力和善意的理论预期相一致。然而，我们还发现，当提示中指定性别时，LLMs倾向于再现社会性别刻板印象，将男性代理的行为与高能力关联，将女性代理的行为与高善意关联。为了验证我们的方法，我们在Prolific上进行了用户研究，采用了被试内设计。参与者感知到生成的行为中的不同能力和善意水平与预期指令一致。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2605.19806

Chunking German Legal Code

德国语法典的分块处理

Prior, Max, Milanova, Natalia, Schultz, Andreas

Abstract

This paper investigates chunking strategies for retrieval-augmented generation on German statutory law, using the German Civil Code as a structured benchmark corpus. We implement and compare a range of segmentation approaches, including structural units (sections, subsections, sentences, propositions), fixed-size windows, contextual chunking, semantic clustering, Lumber-style chunking, and RAPTOR-based hierarchical retrieval. All methods are evaluated on a legal question-answering dataset with section-level gold labels, measuring recall, query latency, index build time, and storage requirements. Results show that chunking strategies aligned with the inherent legal structure - particularly section and subsection - based retrieval-achieve the highest recall, while more complex approaches that override this structure perform worse. These simpler methods also offer favorable computational efficiency compared to LLM-intensive techniques such as contextual chunking, RAPTOR, and Lumber. The findings highlight a key trade-off between semantic enrichment and operational cost, and demonstrate that preserving domain-specific structure is critical for effective legal information retrieval.

Chinese Translation

本文探讨了在德国语法检索增强生成中的分块策略，以德国民法典作为结构化基准语料库。我们实施并比较了一系列分段方法，包括结构单元（章节、子章节、句子、命题）、固定大小窗口、上下文分块、语义聚类、Lumber风格分块和基于RAPTOR的层次检索。所有方法在具有章节级金标准标签的法律问答数据集上进行评估，测量召回率、查询延迟、索引构建时间和存储需求。结果表明，与固有法律结构（特别是章节和子章节）对齐的分块策略在检索中实现了最高的召回率，而更复杂的方法则会覆盖这一结构，表现较差。这些简单的方法在计算效率上也优于如上下文分块、RAPTOR和Lumber等依赖大型语言模型的技术。研究结果强调了语义丰富性与操作成本之间的关键权衡，并表明保持特定领域结构对于有效的法律信息检索至关重要。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2605.19815

LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation

LP-Eval：用于测量法律命题生成质量的评估标准和数据集

Xu, Shanshan, Lindholm, Johan, Raina, Amogh, Olsen, Henrik Palmer, Hershcovich, Daniel

Abstract

Legal proposition generation is central to legal reasoning and doctrinal scholarship, yet remain under-examined in Legal NLP. This paper investigates the automatic generation and evaluation of legal propositions from decisions of the Court of Justice of the European Union using large language models (LLMs). We introduce LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions. Using this rubric, we release a dataset of two experts' annotations for 100 LLM-generated legal propositions. Our results show that LLMs can generate predominantly well-formed and high-quality propositions, while expert evaluations reveal higher quality for propositions derived from well established cases than from recent ones. We further examine LLMs as evaluators and find that rubric-guided LLM judgments align more closely with expert assessments than direct overall scoring, but remain insensitive to finer-grained distinctions captured by human experts.

Chinese Translation

法律命题生成是法律推理和学术研究的核心，但在法律自然语言处理（Legal NLP）领域仍未得到充分研究。本文探讨了使用大型语言模型（LLMs）从欧洲联盟法院的裁决中自动生成和评估法律命题。我们引入了LP-Eval，这是一个与法律专家共同设计的三步评估标准，将法律命题质量分解为形式有效性和实质维度。基于该评估标准，我们发布了一个数据集，其中包含两位专家对100个LLM生成的法律命题的标注。我们的结果表明，LLMs能够生成主要结构良好且高质量的命题，而专家评估显示，来自成熟案例的命题质量高于来自近期案例的命题。我们进一步考察了LLMs作为评估者的表现，发现基于评估标准的LLM判断与专家评估的吻合度高于直接的整体评分，但对人类专家捕捉的更细微的区分仍然不敏感。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2605.19848

CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models

CLIF：用于透明瓶颈模型的概念级影响函数

Sun, Yike, Xu, Mingkun, You, Mu, He, Zhongzhi, Shen, Henghua, Tan, Zehan, Wong, Derek F., Fang, Tao

Abstract

In recent years, the black-box nature of deep learning models has limited their application in high-stakes domains such as medical diagnosis and finance, where interpretability is essential. To address this, we propose a novel approach using influence functions to enhance interpretability in NLP models at both the sample and concept levels. Experiments on CEBaB and Yelp datasets show that influence functions effectively identify the most impactful training samples, both helpful and harmful, on model predictions. By adjusting the labels and weights of these samples, we demonstrate that model performance can be restored to baseline levels without retraining, confirming the value of influence functions for efficient data debugging. Furthermore, our concept-level analysis identifies key concepts within Concept Bottleneck Models (CBM) that significantly affect predictions. Modifying these concepts alters model behavior observably, providing clear insights into the decision process.

Chinese Translation

近年来，深度学习模型的黑箱特性限制了它们在医疗诊断和金融等高风险领域的应用，而这些领域对可解释性至关重要。为了解决这一问题，我们提出了一种新颖的方法，利用影响函数在样本和概念层面增强自然语言处理（NLP）模型的可解释性。在CEBaB和Yelp数据集上的实验表明，影响函数能够有效识别对模型预测影响最大的训练样本，包括有益和有害样本。通过调整这些样本的标签和权重，我们证明了模型性能可以在不重新训练的情况下恢复到基线水平，确认了影响函数在高效数据调试中的价值。此外，我们的概念级分析识别了概念瓶颈模型（Concept Bottleneck Models, CBM）中对预测有显著影响的关键概念。修改这些概念会明显改变模型行为，为决策过程提供清晰的洞察。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2605.19852

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

工具总是有益的吗？学习如何适应性地调用工具以进行双模式多模态大语言模型推理

Ma, Qinghe, Zhao, Zhen, Wu, Yiming, Zhang, Jian, Bai, Lei, Shi, Yinghuan

Abstract

Tool-augmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models (MLLMs). However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query. Within a reinforcement learning framework, we design an explicit dual-mode reasoning strategy with mode-specific reward functions to guide the model toward producing accurate responses. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool-assisted and text-centric reasoning throughout training, and promotes free exploration in later stages. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21.8\% accuracy gain on V* benchmark compared to the base model, and a 44.9\% improvement in efficiency over existing tool-augmented methods on POPE benchmark. Code is available at https://github.com/MQinghe/AutoTool.

Chinese Translation

工具增强推理已成为提升多模态大语言模型（MLLM）推理能力的一个有前景的方向。然而，现有研究主要集中在使模型能够执行工具调用，而忽视了调用工具的必要性。我们认为，工具的使用并不总是有益的，因为冗余或不当的调用会大幅增加推理开销，甚至误导模型预测。为了解决这个问题，我们引入了AutoTool，一个根据每个查询的特征自适应决定是否调用工具的模型。在强化学习框架内，我们设计了一种明确的双模式推理策略，结合特定模式的奖励函数，引导模型产生准确的响应。此外，为了防止过早偏向单一推理模式，AutoTool在整个训练过程中共同探索和平衡工具辅助推理与文本中心推理，并在后期阶段促进自由探索。大量实验表明，AutoTool表现出卓越的性能和高效性，在V*基准测试中相比基础模型实现了21.8%的准确率提升，在POPE基准测试中相比现有工具增强方法实现了44.9%的效率提升。代码可在 https://github.com/MQinghe/AutoTool 获取。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2605.19908

Where Does Authorship Signal Emerge in Encoder-Based Language Models?

基于编码器的语言模型中作者信号的出现位置

Kulumba, Francis, Vimont, Guillaume, Romary, Laurent, Cafiero, Florian

Abstract

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are equally available at every layer in every model, including in an off-the-shelf control encoder, hence the gap not coming from representation quality. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.

Chinese Translation

作者归属模型在使用相同的预训练编码器、数据和损失函数进行微调时，其性能可能因评分机制的不同而相差四倍。我们使用机械解释工具来解释这一差距。诸如词长、标点密度和功能词频率等风格特征在每个模型的每一层中均可获得，包括一个现成的控制编码器，因此这一差距并非源于表示质量。相反，因果干预表明，评分器决定了编码器整合作者信号的位置。均值池化强制在早期到中层进行整合，而晚期交互则将其推迟到后期层。我们进一步从每个评分器的梯度结构推导出这一差异，训练动态揭示了由此差异引发的不同学习轨迹。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2605.19936

What Are LLMs Doing to Scientific Communication? Measuring Changes in Writing Practices and Reading Experience

大型语言模型对科学传播的影响：写作实践和阅读体验的变化测量

Miletić, Filip, Falk, Neele

Abstract

Has the style of scientific communication changed due to the growing use of large language models in the writing process? We address this question in the domain of Natural Language Processing by leveraging two data resources we create: a naturalistic corpus of over 37,000 papers from the ACL Anthology (2020-2024); and a synthetic dataset of 3,000 human-written passages and their LLM-generated improvements. We first implement a series of diachronic lexical analyses, showing that both word frequency and usage contexts have changed significantly over time, indicating semantic specialization in some cases and generalization in others. Broadening our perspective, we then model a range of more complex stylistic features and find that LLM-modified texts more frequently contain certain syntactic constructions, more complex and longer words and a lower lexical diversity. Finally, we connect these changes in writing practices to subjective reading experience through a pilot annotation study with 20 domain experts. They overall rate LLM-improved texts as more understandable and exciting, but also express negative qualitative attitudes towards LLMs, highlighting the strongly subjective effect of AI-assisted writing on reading experience.

Chinese Translation

随着大型语言模型在写作过程中的日益使用，科学传播的风格是否发生了变化？我们在自然语言处理领域探讨了这一问题，利用我们创建的两个数据资源：一个来自ACL文集（2020-2024）超过37,000篇论文的自然语料库；以及一个包含3,000段人类撰写文本及其LLM（大型语言模型）生成改进的合成数据集。我们首先实施了一系列历时性词汇分析，显示出词频和使用语境随时间显著变化，表明在某些情况下存在语义专业化，而在其他情况下则表现为一般化。拓宽我们的视角后，我们建模了一系列更复杂的风格特征，发现LLM修改的文本更频繁地包含某些句法结构，使用更复杂和更长的单词，并且词汇多样性较低。最后，我们通过与20位领域专家的初步注释研究，将这些写作实践的变化与主观阅读体验联系起来。他们整体上认为LLM改进的文本更易理解和更具吸引力，但也对LLM表达了负面的定性态度，突显了AI辅助写作对阅读体验的强烈主观影响。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2605.19952

Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory

重新思考记忆方式：超越终身 LLM 代理记忆中的原子事实

Sun, Jingwei, Zhu, Jianing, Yao, Jiangchao, Liu, Tongliang, Han, Bo

Abstract

To enable reliable long-term interaction, LLM agents require a memory system that can faithfully store, efficiently retrieve, and deeply reason over accumulated dialogue history. Most existing methods adopt an extracted fact based paradigm: handcrafted static prompts compress raw dialogues into atomic facts, which are then stored, matched, and injected into downstream reasoning. Nevertheless, such fact-centric designs inevitably discard fine-grained details in original dialogues and fail to support deep reasoning over scattered isolated facts. Moreover, static prompts cannot maintain consistent extraction granularity across diverse dialogue styles. To address these limitations, we propose TriMem, which maintains three coexisting representation granularities, including raw dialogue segments anchored by source identifiers for storage fidelity, extracted atomic facts for efficient memory retrieval, synthesized profiles that aggregate dispersed facts into holistic semantic understanding for deep reasoning. We further adopt TextGrad-based prompt optimization, which iteratively refines extraction and profiling prompts via response quality feedback, achieving lifelong evolution without any parameter updating. Extensive experiments on LoCoMo and PerLTQA across multiple LLM backbones demonstrate that TriMem consistently outperforms strong memory baselines. The code is available at https://TMLR-TriMem.github.io .

Chinese Translation

为了实现可靠的长期交互，LLM 代理需要一个能够真实存储、高效检索和深入推理累积对话历史的记忆系统。现有的大多数方法采用提取事实为基础的范式：手工制作的静态提示将原始对话压缩为原子事实，然后将其存储、匹配并注入到下游推理中。然而，这种以事实为中心的设计不可避免地丢弃了原始对话中的细节，并未能支持对分散的孤立事实进行深入推理。此外，静态提示无法在多样的对话风格中保持一致的提取粒度。为了解决这些局限性，我们提出了 TriMem，它维护三种共存的表示粒度，包括以源标识符为基础的原始对话片段以确保存储的保真性、用于高效记忆检索的提取原子事实，以及将分散事实聚合为整体语义理解以支持深入推理的合成概况。我们进一步采用基于 TextGrad 的提示优化，通过响应质量反馈迭代精炼提取和概况提示，实现终身演化而无需任何参数更新。在多个 LLM 主干上的 LoCoMo 和 PerLTQA 的广泛实验表明，TriMem 始终优于强大的记忆基线。代码可在 https://TMLR-TriMem.github.io 获取。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2605.20022

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

FlexDraft：通过注意力调优和奖励引导校准实现灵活的推测解码

Zhang, Yaojie, Huang, Jianuo, Ke, Junlong, Han, Yuhang, Long, Yongji, Zhao, Tianchen, Qi, Biqing, Zhang, Linfeng

Abstract

Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting between drafting and verification, and repeated exchange of intermediate states further increases memory access overhead. Parallel speculative decoding addresses this limitation by performing drafting and verification within a single target forward pass, allowing future drafts to be prepared while current candidates are being verified. Although effective at small batch sizes, existing parallel speculative decoding methods either require costly continual pretraining with quality degradation or suffer from low acceptance rates. More importantly, this paradigm inherently suffers from uncertainty in both the bonus token and the accepted length, leading to draft verification mismatch and causing throughput gains to collapse at large batch sizes. To address these limitations, we introduce FlexDraft, a lossless speculative decoding framework that flexibly adapts to varying batch sizes through three key designs. (1) Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens, while keeping the autoregressive path frozen to preserve the target distribution and produce high quality drafts with minimal trainable parameters. (2) Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits, mitigating draft verification mismatch caused by bonus token uncertainty. (3) Flex Decoding dynamically switches between parallel draft and verify at small batch sizes and sequential draft then verify at large batch sizes, and adjusts verification length based on draft confidence to eliminate redundant computation.

Chinese Translation

推测解码通过使用快速草拟器提出多个候选标记，并让目标模型并行验证它们，从而在不降低质量的情况下加速内存受限的大型语言模型（LLM）推理。然而，传统的顺序推测解码在草拟和验证之间存在相互等待的问题，且中间状态的重复交换进一步增加了内存访问开销。并行推测解码通过在单次目标前向传播中同时进行草拟和验证来解决这一限制，使得在当前候选标记被验证时可以准备未来的草拟。尽管在小批量情况下效果显著，现有的并行推测解码方法要么需要昂贵的持续预训练并伴随质量下降，要么面临低接受率。更重要的是，这种范式本质上在奖励标记和接受长度上都存在不确定性，导致草拟验证不匹配，并使得在大批量情况下吞吐量提升崩溃。为了解决这些限制，我们提出了FlexDraft，一种无损的推测解码框架，通过三个关键设计灵活适应不同的批量大小。(1) 注意力调优通过仅调节最后几层的注意力投影器在掩码标记上实现块扩散草拟，同时保持自回归路径不变以保留目标分布，并以最小的可训练参数生成高质量草拟。(2) 奖励引导校准使用一个轻量级的多层感知机（MLP），以已解决的奖励标记为条件来校准草拟对数值，减轻由于奖励标记不确定性导致的草拟验证不匹配。(3) 灵活解码在小批量情况下动态切换并行草拟和验证，在大批量情况下顺序草拟然后验证，并根据草拟置信度调整验证长度，以消除冗余计算。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2605.20043

Mind Your Moras: Orthography-Aware Error Analysis of Neural Japanese Morphological Generation

关注你的音节：基于正字法的神经日语形态生成错误分析

Zhang, Wen

Abstract

We present an orthography-aware error analysis of Japanese past-tense morphological inflection, treating hiragana not merely as a transcriptional medium, but as a representational system encoding morphophonological distinctions that may influence model generalization. We evaluate two character-level sequence-to-sequence architectures on past-tense formation using datasets formatted according to the SIGMORPHON 2020 and 2023 shared task conventions. Despite high aggregate accuracy, models exhibit systematic, linguistically interpretable errors that cluster around specific orthographic properties of hiragana. We introduce a concise error taxonomy capturing seven primary failure modes and provide both quantitative and qualitative analyses. Gemination-related errors dominate residual failures, accounting for 75-80% of errors, particularly in verbs whose stems end in the vowel e and require gemination before the past-tense suffix. Error patterns remain highly consistent across architectures and random seeds, suggesting a robust interaction between orthographic representation, morphological structure, and data frequency effects in shaping model generalization. These results underscore the necessity of orthography-aware evaluation for understanding neural generalization in morphologically complex languages.

Chinese Translation

我们提出了一种基于正字法的日语过去时形态屈折错误分析，将平假名视为不仅仅是转录媒介，而是编码形态音位学区分的表征系统，这可能影响模型的泛化能力。我们评估了两种字符级序列到序列架构在过去时形成上的表现，使用根据SIGMORPHON 2020和2023共享任务规范格式化的数据集。尽管整体准确率较高，但模型表现出系统性的、具有语言学可解释性的错误，这些错误集中在平假名的特定正字法特性上。我们引入了一种简明的错误分类法，捕捉七种主要的失败模式，并提供了定量和定性的分析。与音节相关的错误占据了残余错误的主导地位，约占75-80%的错误，特别是在词干以元音e结尾并在过去时后缀之前需要音节化的动词中。错误模式在不同架构和随机种子之间保持高度一致，表明正字法表征、形态结构和数据频率效应之间存在强有力的相互作用，从而塑造模型的泛化能力。这些结果强调了在理解形态复杂语言中的神经泛化时，进行基于正字法的评估的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2605.20050

Language Mutations Sustain the Persistences of Conspiracy Theories on Social Media

语言变异维持阴谋论在社交媒体上的持续性

Cheng, Calvin Yixiang, Quelle, Dorian, Hale, Scott A.

Abstract

This study investigates how language mutations affect the persistent diffusion of conspiracy theories on social media. Drawing on a three-year dataset of conspiracy-related posts from X, and applying computational linguistic analysis alongside survival modelling, we find that conspiracy claims with greater semantic mutations have substantially longer lifespans. Mutations in psycholinguistic properties, including pronouns, social reference words, cognitive process terms, risk- and health- related vocabularies, are associated with extended lifespans. Mutations in actor, action and target (AAT) categories are associated with longer lifespans as well. Qualitative analysis identifies two predominant mutation patterns: simplification and assimilation, at both linguistic and AAT structural levels. Taken together, the results advance our understanding of how language mutations contribute to conspiracy persistence online and shed lights on longitudinal content moderation strategies. We argue that content moderation should consider the mutability of conspiracy claims and focus on the core claims that can address their potential variations.

Chinese Translation

本研究探讨了语言变异如何影响阴谋论在社交媒体上的持续传播。基于来自X的三年阴谋相关帖子数据集，并结合计算语言学分析与生存模型，我们发现具有更大语义变异的阴谋主张具有显著更长的生命周期。心理语言学特征的变异，包括代词、社会参考词、认知过程术语以及与风险和健康相关的词汇，与延长的生命周期相关。行为者、行动和目标（AAT）类别的变异也与更长的生命周期相关。定性分析识别出两种主要的变异模式：简化和同化，分别在语言和AAT结构层面上。综合来看，这些结果加深了我们对语言变异如何促进在线阴谋论持续性的理解，并为长期内容管理策略提供了启示。我们认为，内容管理应考虑阴谋主张的可变性，并关注能够应对其潜在变异的核心主张。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2605.20052

PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling

PromptRad：知识增强的低资源放射学报告多标签提示调优

Lin, Ying-Jia, Lo, Tzu-Chin, Li, Ping-Chien, Cheng, Chi-Tung, Liao, Chien-Hung, Kao, Hung-Yu

Abstract

Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large-scale annotation for medical imaging research. Existing rule-based labelers struggle with the diverse descriptions in clinical reports, while fine-tuning pre-trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge-enhanced multi-label \textbf{prompt}-tuning approach for \textbf{rad}iology report labeling under low-resource settings. PromptRad reformulates multi-label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi-word verbalizer to enrich category representations. By fine-tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine-tuning. Experiments on liver CT reports show that PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT-4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data-scarce clinical scenarios. Our code is available at https://github.com/ila-lab/PromptRad.

Chinese Translation

自动报告标注有助于从非结构化文本中识别临床发现，并为医学影像研究提供大规模注释。现有的基于规则的标注工具在处理临床报告中的多样化描述时面临挑战，而微调预训练语言模型（PLMs）则需要大量标注数据，这在临床环境中往往难以获得。本文提出了PromptRad，一种在低资源环境下用于放射学报告标注的知识增强多标签提示调优方法。PromptRad将多标签分类重新表述为掩码语言建模，并将UMLS Metathesaurus中的同义词纳入多词表述器，以丰富类别表示。通过在不增加额外分类层的情况下微调PLM，PromptRad所需的标注数据显著少于传统微调方法。在肝脏CT报告上的实验表明，PromptRad在仅使用32个标注训练样本的情况下，优于基于字典和微调的基线，并且尽管使用了更小的模型，仍与GPT-4达成了竞争性表现。进一步分析表明，PromptRad在捕捉复杂否定模式方面比现有方法更为有效，使其成为数据稀缺临床场景下报告标注的有前景解决方案。我们的代码可在 https://github.com/ila-lab/PromptRad 获取。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2605.20061

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

奖励信念，而非行为：基于一致性的长时程代理信用分配

Tang, Wenjie, Li, Minne, Huang, Sijie, Xiao, Liquan, Zhou, Yuan

Abstract

Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.

Chinese Translation

基于可验证奖励的强化学习（RLVR）是一种有前景的范式，用于提升大型语言模型（LLM）代理在长时程交互任务中的表现。然而，在部分可观察环境中，不完整的观察会导致代理信念随时间漂移，而延迟奖励则模糊了中间决策的因果影响，加剧了时间信用分配的挑战。为了解决这一问题，我们提出了ReBel（奖励信念），一种过程级强化学习算法，它明确建模结构化信念状态，以总结交互历史并指导后续的策略学习。ReBel引入了信念一致性监督，将预测信念与观察反馈之间的差异转化为密集的自监督信号，而无需外部逐步注释或验证者。它还采用信念感知分组来比较相似信念状态下的轨迹，从而产生更稳健且方差更低的优势估计。我们在具有挑战性的长时程基准测试上评估了ReBel，包括ALFWorld和WebShop。ReBel在任务成功率上比基于回合的基线GRPO提高了最多20.4个百分点，并将样本效率提高了2.1倍。这些结果表明，信念感知自监督是部分可观察环境下可靠的长时程决策的一个有前景的方向。代码可在：https://github.com/Fateyetian/Rebel.git获取。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2605.20066

Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

基于强化学习的文本到SPARQL生成：在DBLP上的GRPO方法

Pfeifer, Jann, Banerjee, Debayan, Usbeck, Ricardo

Abstract

Knowledge graph question answering seeks to translate natural language questions into executable queries over knowledge graphs, but existing approaches often rely on large models or full supervision in the form of gold query annotations. This study examines whether reinforcement learning with outcome-based rewards can train a small instruction-tuned language model to perform zero-shot Text-to-SPARQL generation in the scholarly domain. Group-Relative Policy Optimization (GRPO) is applied to the Qwen3-1.7B model on DBLP-QuAD, using prompts that combine natural language questions with symbolic hints about entities and relations. Training relies on execution feedback, structural constraints, and answer-level rewards, with an additional variant that incorporates gold-query-based shaping. The resulting models are compared to the unmodified zero-shot baseline and to a supervised DoRA-finetuned baseline across answer-level accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit, suggesting that outcome-based reinforcement learning is a viable training strategy when gold queries are unavailable for token-level supervision.

Chinese Translation

知识图谱问答旨在将自然语言问题转换为可在知识图谱上执行的查询，但现有方法通常依赖于大型模型或以金标准查询注释形式的完全监督。本研究探讨了基于结果的奖励的强化学习是否能够训练一个小型的指令调优语言模型，在学术领域执行零-shot文本到SPARQL生成。我们将群体相对策略优化（Group-Relative Policy Optimization, GRPO）应用于Qwen3-1.7B模型，使用结合自然语言问题与关于实体和关系的符号提示的提示。训练依赖于执行反馈、结构约束和答案级奖励，并包含一个额外变体，结合了基于金标准查询的塑形。将得到的模型与未修改的零-shot基线以及经过监督的DoRA微调基线进行比较，评估指标包括答案级准确性、执行准确性、类别得分和对保留模板的泛化能力。GRPO在零-shot基线之上显著提高了性能，并展现出竞争力的泛化能力，而监督的DoRA微调在相同模型规模上实现了更高的整体准确性。消融分析表明，基于执行的奖励占据了大部分增益，额外的塑形带来的好处有限，这表明在缺乏金标准查询进行标记级监督的情况下，基于结果的强化学习是一种可行的训练策略。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2605.20075

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

CopT：用于一般和智能推理的连续空间对比在线思维

Shi, Dachuan, Zhu, Hanlin, Yuan, Xiangchi, Zhao, Wanjia, Xia, Kejing, Xiao, Wen, Lee, Wenke

Abstract

Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.

Chinese Translation

链式思维（Chain-of-thought, CoT）是从大型语言模型（Large Language Models, LLMs）中引发推理能力的标准方法。然而，常见的 CoT 范式将思考视为回答的前提，这可能会延迟获得合理答案的时机，并在模型能够在扩展思考之前识别答案时产生不必要的令牌成本，这种行为被称为表演性推理。在本文中，我们介绍了 CopT，一种重新构建的推理管道，它颠倒了思考和回答的通常顺序。CopT 不是在回答之前进行思考，而是首先引发一个草拟答案，然后基于其自身的草拟答案进行后续的在线思考，以进行反思和修正。为了评估草拟答案是否值得信赖，CopT 将连续嵌入重新构建为推理时的对比验证器。具体而言，它对比模型在离散令牌输入和连续嵌入输入下对同一生成令牌的支持，从而产生一个序列级的逆 KL 估计器来评估答案的可靠性。我们的分析表明，在某些假设下，期望估计等于未解决潜在状态与发出的答案令牌之间的互信息，这解释了它为何捕捉与答案相关的不确定性，而不是潜在状态中的任意不确定性。当答案被认为不够可靠时，CopT 进行进一步的在线思考，其中第二个 KL 估计器动态控制草拟答案的可见性，保留有用的部分信息，同时降低被不可靠内容误导的风险。在数学、编码和智能推理任务中，CopT 的峰值准确率提高了最多 23%，并在可比或更高的准确率下将令牌使用量减少了最多 57%，且无需任何额外训练。代码可在 https://github.com/sdc17/CopT 获取。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2605.20084

BalanceRAG: Joint Risk Calibration for Cascaded Retrieval-Augmented Generation

BalanceRAG：级联检索增强生成的联合风险校准

Jia, Zijun, Ye, Yuanchang, Jia, Sen, Qian, Yiyao, Wang, Haoning, Chen, Baojie, Tang, Diyin, Yu, Jinsong, Wang, Zhiyuan

Abstract

Large language models (LLMs) can enhance factuality via retrieval-augmented generation (RAG), but applying RAG to every query is unnecessary when the model-only answer is reliable. This motivates cascaded RAG: each query is first handled by an LLM-only branch, escalated to a RAG fallback only if the primary branch is uncertain, and abstained from when neither branch is sufficiently trustworthy. However, calibrating such cascades stage by stage may be conservative, since the final utility depends on joint uncertainty thresholding of LLM-only and RAG. In this work, we develop BalanceRAG to certify threshold pairs at a target risk level. Given uncertainty scores from the two branches, BalanceRAG frames each threshold pair as an operating point on a two-dimensional lattice and identifies safe operating points using sequential graphical testing. This enables risk-adaptive threshold calibration, controlling the system-level error rate among accepted points, while retaining more examples. Furthermore, BalanceRAG extends to multi-risk calibration, allowing retrieval usage to be bounded together with the selection-conditioned risk. Experiments on three open-domain question answering (QA) benchmarks across multiple LLM backbones demonstrate that BalanceRAG meets prescribed risk levels, preserves higher coverage and more accepted correct examples, and reduces unnecessary retrieval calls compared with always-on RAG.

Chinese Translation

大型语言模型（LLMs）可以通过检索增强生成（RAG）提高事实性，但在模型单独回答可靠的情况下，对每个查询应用 RAG 是不必要的。这促使了级联 RAG 的提出：每个查询首先由 LLM 单独分支处理，仅在主分支不确定时升级到 RAG 备用方案，而在两个分支都不够可信时则放弃。然而，逐阶段校准这样的级联可能过于保守，因为最终效用依赖于 LLM 单独分支和 RAG 的联合不确定性阈值。在本研究中，我们开发了 BalanceRAG，以在目标风险水平下认证阈值对。给定来自两个分支的不确定性分数，BalanceRAG 将每个阈值对框架化为二维格上的操作点，并使用顺序图形测试识别安全操作点。这使得风险自适应阈值校准成为可能，控制接受点之间的系统级错误率，同时保留更多示例。此外，BalanceRAG 扩展到多风险校准，允许检索使用与选择条件风险共同界定。针对多个 LLM 基础模型的三个开放域问答（QA）基准的实验表明，BalanceRAG 达到了规定的风险水平，保留了更高的覆盖率和更多被接受的正确示例，并且与始终开启的 RAG 相比，减少了不必要的检索调用。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2605.20087

ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

ThoughtTrace：理解现实世界中用户在大型语言模型交互中的思维

Jin, Chuanyang, Li, Binze, Xie, Haopeng, Fang, Cathy Mengying, Li, Tianjian, Longpre, Shayne, Gu, Hongxiang, Chen, Maximillian, Shu, Tianmin

Abstract

Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human--AI conversations with users' self-reported thoughts: their reasons for sending prompts and reactions to assistant responses. ThoughtTrace comprises 1,058 users, 2,155 conversations, 17,058 turns, and 10,174 thought annotations collected across 20 language models. Our analysis shows that ThoughtTrace captures long-horizon, topically diverse interactions, and that thoughts are semantically distinct from messages, difficult for frontier LLMs to infer from context, diverse in content, and tied to conversation stages. We further demonstrate the utility of thoughts for downstream modeling. First, thoughts improve user-behavior prediction as inference-time context. Second, thought-guided rewrites provide fine-grained alignment signals for training personalized assistants. Together, ThoughtTrace establishes user thoughts as a new data modality for studying the cognitive dynamics behind human--AI interaction and provides a foundation for building assistants that better understand and adapt to users' latent goals, preferences, and needs.

Chinese Translation

对话式人工智能现已拥有数十亿用户，但现有数据集仅捕捉人们所说的话，而非他们的思考。我们引入了ThoughtTrace，这是第一个将现实世界中的多轮人机对话与用户自我报告的思维相结合的大规模数据集：用户发送提示的原因及对助手回应的反应。ThoughtTrace包含1,058名用户、2,155次对话、17,058个轮次和10,174个思维注释，数据收集覆盖20种语言模型。我们的分析表明，ThoughtTrace捕捉了长期的、主题多样的交互，且思维在语义上与消息是不同的，前沿大型语言模型难以从上下文中推断，内容多样，并与对话阶段相关。我们进一步展示了思维在下游建模中的实用性。首先，思维作为推理时的上下文改善了用户行为预测。其次，基于思维的重写为训练个性化助手提供了细粒度的对齐信号。总之，ThoughtTrace将用户思维确立为研究人机交互背后认知动态的新数据模态，并为构建更好理解和适应用户潜在目标、偏好和需求的助手奠定了基础。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2605.20128

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

MixRea：大型语言模型中显性-隐性推理的基准测试

Cai, Yuanqing, Huang, Ziyi, Liu, Minhao, Duan, Lixin, Li, Wen, Zhang, Yanru

Abstract

Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emph{inattentional blindness} in human cognition, we investigate whether LLMs, trained on human-preferred corpora that embed attentional biases, exhibit a similar limitation: \emph{failing to attend to subtle yet important contextual cues under explicit task instructions}. To evaluate this, we introduce the task of \textbf{explicit-implicit reasoning} and present \textbf{MixRea}, a benchmark of 2,246 multiple-choice questions across 9 reasoning types with varying distributions of explicit and implicit information. Evaluation of 21 advanced LLMs shows that even the best-performing reasoning model (Gemini 2.5 Pro) achieves only 42.8\% consistency, revealing widespread inattentional blindness. To mitigate this, we propose \textbf{Potential Relation Completion Prompting (PRCP)}, a prompting method that improves reasoning by recovering overlooked causal relations. Further analysis shows that this limitation persists across diverse multi-source reasoning tasks, highlighting the need for more cognitively aligned models.

Chinese Translation

大型语言模型（LLMs）越来越多地被应用于高风险决策中。受人类认知中 extit{注意盲区}理论的启发，我们研究了在嵌入注意偏差的人类偏好语料上训练的LLMs是否表现出类似的局限性： extit{在显性任务指令下未能关注微妙但重要的上下文线索}。为此，我们引入了 extbf{显性-隐性推理}的任务，并呈现了 extbf{MixRea}，这是一个包含2,246个多项选择题的基准，涵盖9种推理类型，具有不同的显性和隐性信息分布。对21个先进LLMs的评估显示，即使是表现最好的推理模型（Gemini 2.5 Pro）也仅达到42.8 ext{%}的一致性，揭示了广泛存在的注意盲区。为了解决这一问题，我们提出了 extbf{潜在关系补全提示（PRCP）}，这是一种通过恢复被忽视的因果关系来改善推理的提示方法。进一步分析表明，这一局限性在多种多源推理任务中依然存在，突显了对更符合认知的模型的需求。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2605.20149

Less Back-and-Forth: A Comparative Study of Structured Prompting

减少反复交互：结构化提示的比较研究

Ghosh, Saurav, Polach, Gabriella, Sow, Abdou

Abstract

Large language models (LLMs) are widely used for open-ended tasks, but underspecified prompts can lead to low-quality answers and additional interaction. This paper studies whether structured prompt design improves response quality while reducing user effort. We compare three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt. We evaluate these conditions across four task types--summarization, planning, explanation, and coding--using three LLM systems: ChatGPT, Claude, and Grok. Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity. Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts. Checklist prompts also produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts. These results suggest that a simple prompt checklist can improve LLM responses while reducing unnecessary interaction.

Chinese Translation

大型语言模型（LLMs）广泛应用于开放式任务，但不明确的提示可能导致低质量的回答和额外的交互。本文研究了结构化提示设计是否能提高响应质量，同时减少用户的努力。我们比较了三种提示条件：原始提示、经过检查表改进的提示和澄清问题提示。我们在四种任务类型（摘要、规划、解释和编码）上评估这些条件，使用了三个LLM系统：ChatGPT、Claude和Grok。每个输出都使用统一的评分标准进行评分，涵盖任务完成度、正确性、合规性和清晰度。经过检查表改进的提示获得了最高的平均评分，得分为7.50（满分8分），而原始提示得分为5.67，澄清问题提示得分为6.67。检查表提示还在质量与努力的权衡中表现最佳，使用的平均标记数少于原始提示和澄清提示。这些结果表明，简单的提示检查表可以改善LLM的响应，同时减少不必要的交互。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2605.20170

KoRe: Compact Knowledge Representations for Large Language Models

KoRe：大型语言模型的紧凑知识表示

Cavicchini, Davide, Giunchiglia, Fausto, Staiano, Jacopo

Abstract

Modern Large Language Models (LLMs) have shown impressive performances in user-facing tasks such as question answering, as well as consistent improvements in reasoning capabilities. Still, the way these models encode knowledge seems inherently flawed: by design, LLMs encode world-knowledge within their parameters. This way of representing knowledge is inherently opaque, difficult to debug and update, and prone to hallucinations. On the other hand, Knowledge Graphs can provide human-readable and easily editable world knowledge representations, and their application in knowledge-intensive tasks has consistently proven beneficial to downstream performance. Nonetheless, current integration techniques require extensive retraining or finetuning. To overcome this issue, we introduce KoRe, a methodology to encode 1-hop sub-graphs into compact discrete knowledge tokens and inject them into a LLM backbone. We test the proposed approach on three established benchmarks, and report competitive performances coupled with a significant reduction (up to 10x) in token usage. Our results show that compact discrete KG representations can efficiently and effectively be used to ground modern LLMs.

Chinese Translation

现代大型语言模型（LLMs）在用户面向的任务（如问答）中表现出色，并且在推理能力上持续改善。然而，这些模型编码知识的方式似乎固有缺陷：根据设计，LLMs将世界知识编码在其参数中。这种知识表示方式本质上是不透明的，难以调试和更新，并且容易产生幻觉。另一方面，知识图谱可以提供人类可读且易于编辑的世界知识表示，其在知识密集型任务中的应用一贯证明对下游性能有益。然而，目前的集成技术需要大量的重新训练或微调。为了解决这个问题，我们提出了KoRe，一种将1-hop子图编码为紧凑离散知识令牌并注入到LLM骨干网络中的方法。我们在三个已建立的基准测试上测试了所提出的方法，并报告了具有竞争力的性能，同时令牌使用量显著减少（最多减少10倍）。我们的结果表明，紧凑的离散知识图谱表示可以有效且高效地用于支持现代LLMs。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2605.20176

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

ClinSeekAgent：自动化多模态证据寻求以支持自主临床推理

Wu, Juncheng, Zhang, Letian, Wang, Yuhan, Tu, Haoqin, Chen, Hardy, Wang, Zijun, Xie, Cihang, Zhou, Yuyin

Abstract

Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.

Chinese Translation

大型语言模型（LLMs）和自主系统在临床决策支持方面展现出潜力，但现有研究大多假设证据已经被整理并提供给模型。现实世界的临床工作流程则要求代理主动寻求、迭代规划并从异构来源综合多模态证据。本文介绍了ClinSeekAgent，一个动态多模态证据寻求的自动化自主框架，改变了从被动证据消费到主动证据获取的范式。ClinSeekAgent仅需临床查询和对原始数据源的访问，通过查询医学知识库、导航原始电子健康记录（EHR）和调用医学影像工具来收集证据；随着新信息的出现不断完善其假设；并将收集到的证据整合到基于事实的临床决策中。ClinSeekAgent既可以作为前沿LLMs的推理时代理，也可以作为训练时管道，将高质量的自主轨迹提炼为紧凑的开源模型。为了验证其推理时的有效性，我们构建了ClinSeek-Bench，将固定预选证据的整理输入推理与原始临床数据的自动证据寻求进行配对。在仅文本的EHR任务中，ClinSeekAgent使Claude Opus 4.6的整体F1从60.0提高到63.2，使MiniMax M2.5从43.1提高到47.3，在9个评估的主机模型中，有7个模型的风险预测均有所提升。在多模态任务中，ClinSeekAgent使Claude Opus 4.6的F1从47.5提高到62.6（+15.1）；所有评估模型在三个与胸部X光（CXR）相关的任务组中均有所提升。我们进一步通过将自主证据寻求轨迹提炼为ClinSeek-35B-A3B来验证ClinSeekAgent作为训练管道的有效性，该模型在现有的AgentEHR-Bench上实现了34.0的平均F1，相较于其Qwen3.5-35B-A3B基线提高了11.9分，并接近Claude Opus 4.6。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2605.20177

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

从感知到思考：解耦感知与推理提高视觉语言模型的后训练效果

Wu, Juncheng, Chen, Hardy, Tu, Haoqin, Tang, Xianfeng, Shi, Freda, Liu, Hui, Lu, Hanqing, Xie, Cihang, Zhou, Yuyin

Abstract

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

Chinese Translation

最近在视觉语言模型（VLMs）方面的进展强调了长链推理的重要性；然而，我们发现它们在视觉任务上的表现主要受到视觉感知不足的限制，而非推理本身。在本研究中，我们系统地研究了VLM后训练中感知与推理之间的相互作用，通过将其能力分解为三个独立的训练阶段：视觉感知、视觉推理和文本推理，并结合专门的训练数据。我们证明了视觉感知（a）需要通过专门数据进行有针对性的优化；（b）作为一个基本的支架，应通过分阶段训练来巩固，然后再细化视觉推理；（c）通过强化学习（RL）学习效果优于基于标题的监督微调（SFT）。我们在多个VLM上进行的实验表明，分阶段训练在视觉感知和推理性能上始终优于合并训练。值得注意的是，采用我们方法训练的模型在推理准确率上提高了1.5%，推理轨迹缩短了20.8%，这表明更优秀的感知减少了对过度推理的需求。此外，我们展示了这种基于能力的分阶段训练代表了一种新的课程维度，与传统的基于难度的课程正交，二者结合可获得进一步的增益。我们的分阶段训练模型在开放权重VLM中表现优越，在多个视觉数学和感知任务上（例如，WeMath提高了5.2%，RealWorldQA提高了3.7%）相较于基础模型建立了先进的结果。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2605.20179

TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

TIDE：具有I/O感知专家卸载的高效无损MoE扩散LLM推理

Chen, Zhiben, Zhao, Youpeng, Sui, Yang, Wang, Jun, Shang, Yuzhang

Abstract

Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4$\times$ and 1.5$\times$ throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.

Chinese Translation

扩散大语言模型（dLLMs）已成为自回归（AR）模型的竞争性替代方案，提供了更好的硬件利用率和通过并行块级解码实现的双向上下文。然而，随着dLLMs在混合专家（MoE）架构下的不断扩展，它们在资源受限设备上的部署仍然是一个未解决的挑战。现有的基于AR的方法往往会产生高昂的I/O开销或显著的计算瓶颈。在本研究中，我们提出了TIDE，一种新颖的资源高效推理系统，利用了扩散过程中专家激活的时间稳定性。具体而言，我们利用了扩散过程中专家激活的时间稳定性，并引入了一种基于时间间隔的专家刷新策略，以I/O感知的方式更新专家位置。为了确保最佳性能，我们将推理调度形式化为一个数学规划问题，求解出最小化I/O流量和CPU计算的最优时间间隔。最重要的是，TIDE是一种无损优化，不需要模型训练，为dLLM推理提供了“免费午餐”加速。在单GPU-CPU系统中，我们展示了TIDE在LLaDA2.0-mini和LLaDA2.0-flash模型上分别实现了高达1.4倍和1.5倍的吞吐量提升，相较于之前的基线。

View on arXiv Download PDF AI Translation

arXiv Papers

KG-ASG: Collision-Knowledge-Guided Closed-Loop Adversarial Scenario Generation With Primary-Support Attribution

Geo-Data-Driven HD Map Generation Workflow with Integrated Reference-Free Constraint-Based Verification

Adversarial Stress Testing of SPARK Humanoid Safety Filters

Distributionally Robust Control via Stein Variational Inference for Contact-Rich Manipulation

RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

Guiding Neuro-Symbolic Scenario Generation with Spatio-Temporal Logic

Neural Operators for Design-Space Surrogate Modeling of Tendon-Actuated Continuum Robots

CosFly: Plan in the Matrix, Fly in the World

Automatically Improving Simulation Physics for Articulated Objects

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions

Aerial Inspection Behaviors via RL-based Quadrotor Control for Under-canopy Forest Environments

CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation

Graph Neural Planning and Predictive Control for Multi-Robot Communication-Constrained Unlabeled Motion Planning

Bilateral Teleoperation with Compliant 6-DOF Pose-and-Force Sensing

PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM

DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies

ContextFlow: Hierarchical Task-State Alignment for Long-Horizon Embodied Agents

Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation

Neuromorphic Control of a Flapping-Wing Robot on Resource-Constrained Hardware

Self-assembling Modular Aerial Robot for Versatile Aerial Tasks

Closed-Loop Hybrid Digital Twin Platform for Connected and Automated Vehicle Validation

CANINE: Coaching Visually Impaired Users for Interactive Navigation with a Robot Guide Dog

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving

Learning-Accelerated Optimization-based Trajectory Planning for Cooperative Aerial-Ground Handover Missions

PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models

Implicit Action Chunking for Smooth Continuous Control

MCNav: Memory-Aware Dynamic Cognitive Map for Zero-shot Goal-oriented Navigation

FlyMirage: A Fully Automated Generation Pipeline for Diverse and Scalable UAV Flight Data via Generative World Model

HEAT: Heterogeneous End-to-End Autonomous Driving via Trajectory-Guided World Models

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

D-CLING: Prior-Preserving Depth-Conditioned Fine-Tuning for Navigation Foundation Models

Multi-Session Ground Texture SLAM in Low-Dynamic Environments

KIO-planner: Attention-Guided Single-Stage Motion Planning with Dual Mapping for UAV Navigation

Beyond Imitation: Learning Safe End-to-End Autonomous Driving from Hard Negatives

Justifying bio-inspired robotics research: A taxonomy of strategies

Trajectory Planning and Control near the Limits: an Open Experimental Benchmark on the RoboRacer Platform

Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

RoHIL: Robust Human-in-the-Loop Robotic Reinforcement Learning Against Illumination Variations

TravExplorer: Cross-Floor Embodied Exploration via Traversability-Aware 3-D Planning

CEER: Compliant End-Effector and Root Control as a Unified Interface for Hierarchical Humanoid Loco-Manipulation

Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

Minimalist Visual Inertial Odometry

Topology-Optimized Pneumatic Soft Actuator: Design and Experimental Validation

Hamilton--Jacobi Reachability for Spacecraft Collision Avoidance

MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

Harnessing Self-Supervised Features for Art Classification

Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction

A Systematic Failure Analysis of Vision Foundation Models for Open Set Iris Presentation Attack Detection

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Personalized Face Privacy Protection From a Single Image

LiFT: Lifted Inter-slice Feature Trajectories for 3D Image Generation from 2D Generators

Learning Long-Term Temporal Dependencies in Photovoltaic Power Output Prediction Through Multi-Horizon Forecasting

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models

Knowing When Not to Predict: Self Supervised Learning and Abstention for Safer DR Screening

Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

Efficient coding along the visual hierarchy

Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings

D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation

Smartphone-based Circular Plot Sampling for Forest Inventory

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

Robust Mitigation of Age-Dependent Confounding Effects via Sample-Difficulty Decorrelation

PhyWorld: Physics-Faithful World Model for Video Generation

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

Distribution Matching Distillation without Fake Score Network

FPED: A Functional-Network Prior-Guided Mixture-of-Experts Framework for Interpretable Brain Decoding

What Makes Synthetic Data Effective in Image Segmentation

iGSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language Models

MMGS: 10$\times$ Compressed 3DGS through Optimal Transport Aggregation based on Multi-view Ranking

MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems

SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs

RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation