arXiv Daily Digest

366

Papers

Modular Lie Algebraic PDE Control of Multibody Flexible Manipulators

多体柔性操纵器的模态李代数偏微分方程控制

Yaqubi, Sadeq, Mattila, Jouni

Abstract

This paper addresses PDE-based control for flexible multibody robotic systems, presenting a subsystem-based framework for serial manipulators with arbitrary links in 3D space. The approach uses a screw-theoretic Lie-algebraic model where motion, deformation, and forces are expressed as body-fixed twists and wrenches in se(3). By substituting a strain-based deformation PDE into the dynamics, distributed elastic acceleration is eliminated, yielding a model governed by twist acceleration and the deformation field. Subsystem twist trajectories are generated from task-space endpoints via deflection-compensating inverse kinematics, providing real-time correction for tip deformation. A nominal controller for each link ensures exponential decay of twist errors via a Lyapunov function nu_i. An adaptive modification replaces physical parameters with online estimates, establishing exponential convergence of both twist and parameter errors. Summing over all links, composite Lyapunov functions V = sum(nu_i) and V^a = sum(nu_i^a) yield time derivatives where inter-link interaction power terms telescope to zero. This cancellation is ensured by Newton's third law and the frame invariance of the power pairing on se(3) x se*(3), establishing global exponential convergence of tracking errors. Bounded elastic deformation is guaranteed by an Euler-Bernoulli energy argument. The screw-theoretic structure renders interaction cancellation exact, making the stability certificate modular and scalable to chains of arbitrary length. Numerical simulations demonstrate the scheme's physical consistency.

Chinese Translation

本文探讨了基于偏微分方程（PDE）的柔性多体机器人系统的控制，提出了一种针对三维空间中任意连杆串联操纵器的子系统框架。该方法采用了螺旋理论李代数模型，其中运动、变形和力被表示为在 se(3) 中的固定体扭转和扭矩。通过将基于应变的变形 PDE 代入动力学中，消除了分布式弹性加速度，从而得到一个由扭转加速度和变形场控制的模型。子系统扭转轨迹通过偏转补偿逆运动学从任务空间端点生成，为末端变形提供实时修正。每个连杆的名义控制器通过 Lyapunov 函数 nu_i 确保扭转误差的指数衰减。自适应修改用在线估计替代物理参数，建立扭转和参数误差的指数收敛。对所有连杆求和，复合 Lyapunov 函数 V = sum(nu_i) 和 V^a = sum(nu_i^a) 产生时间导数，其中连杆间的相互作用功率项望远镜效应消失。这一消除是由牛顿第三定律和在 se(3) x se*(3) 上功率配对的框架不变性所确保的，从而建立了跟踪误差的全局指数收敛。通过欧拉-伯努利能量论证保证了有界的弹性变形。螺旋理论结构使得相互作用消除精确，从而使稳定性证明具有模块化特性，并可扩展到任意长度的链。数值仿真展示了该方案的物理一致性。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2605.06759

An Aerial Manipulator for Perception-Driven Flower Targeting Toward Contactless Pollination in Vertical Farming

一种用于感知驱动花朵定位的空中操控器，旨在实现垂直农业中的无接触授粉

Jin, Chenzhe, Wu, Zhuohang, Cai, Yifan, Li, Xiangqi, Tan, Jan Ming Kevin, Kemsaram, Narsimlu, Modugno, Valerio

Abstract

The decline of natural pollinators has created a major challenge for crop production in controlled indoor agriculture, particularly in vertical farming environments where natural insect pollination is absent. This motivates the development of robotic systems capable of performing precise flower targeting tasks while minimizing physical interference with delicate floral structures. This paper presents an aerial manipulator platform for perception driven flower detection, localization, and approach in vertical farming environments. The proposed system integrates onboard RGBD based perception, model predictive path integral (MPPI) based unmanned aerial vehicle (UAV) control on a PX4 platform, and a lightweight 2DoF manipulator for precise end effector positioning. The platform is evaluated in both MuJoCo simulation and UAV lab experiments using a flower targeting testbed. The experimental results demonstrate stable UAV flight, reliable flower localization, and centimeter level end effector positioning accuracy. In simulation, the proposed controller achieves consistent trajectory convergence and accurate target alignment. In the real world UAV lab environment, the integrated perception control manipulation framework enables stable flower targeted positioning and end effector alignment under constrained aerial operation. These results validate the proposed aerial manipulator as a robust robotic carrier and positioning framework for future contactless pollination systems. While the current study focuses on perception guided targeting and positioning, the developed platform provides a practical foundation for integrating advanced contactless end effectors, including acoustic based pollen manipulation modules, in future work.

Chinese Translation

自然授粉者的减少给受控室内农业中的作物生产带来了重大挑战，尤其是在缺乏自然昆虫授粉的垂直农业环境中。这促使开发能够执行精确花朵定位任务的机器人系统，同时尽量减少对脆弱花卉结构的物理干扰。本文提出了一种空中操控平台，用于在垂直农业环境中进行感知驱动的花朵检测、定位和接近。所提出的系统集成了基于RGBD的感知、基于模型预测路径积分（MPPI）的无人机（UAV）控制（在PX4平台上）以及用于精确末端执行器定位的轻量级2自由度操控器。该平台在MuJoCo仿真和UAV实验室实验中进行了评估，使用了花朵定位测试平台。实验结果表明，UAV飞行稳定，花朵定位可靠，末端执行器定位精度达到厘米级。在仿真中，所提出的控制器实现了一致的轨迹收敛和准确的目标对齐。在实际的UAV实验室环境中，集成的感知控制操控框架在受限的空中操作下实现了稳定的花朵目标定位和末端执行器对齐。这些结果验证了所提出的空中操控器作为未来无接触授粉系统的稳健机器人载体和定位框架的有效性。尽管当前研究集中于感知引导的定位和对齐，但所开发的平台为未来工作中集成先进的无接触末端执行器（包括基于声学的花粉操控模块）提供了实用基础。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2605.06863

Bi3: A Biplatform, Bicultural, Biperson Dataset for Social Robot Navigation

Bi3：一个双平台、双文化、双人数据集用于社交机器人导航

Stratton, Andrew, Singamaneni, Phani Teja, Goyal, Pranav, Alami, Rachid, Mavrogiannis, Christoforos

Abstract

We contribute Bi3, a dataset of social robot navigation among groups of people in a constrained lab space. Compared to prior data collection efforts for social robot navigation, our dataset is unique in that it features: an original experiment design giving rise to close navigation encounters between two humans and a robot; five different navigation algorithms; two different robot platforms; a diverse participant pool of 74 people recruited from two sites in the USA and France; multimodal data streams including 10.5 hours of human and robot ground-truth motion tracks, RGB video, and user impressions over robot performance. Our analysis of the collected dataset through metrics like interaction density and human velocity suggests that Bi3 represents a benchmark of unique diversity and modeling complexity. Bi3 contributes towards understanding how humans and robots can productively mesh their activities in constrained environments, and can be a resource for training models of human motion prediction and robot control policies for navigation in densely crowded spaces.

Chinese Translation

我们贡献了Bi3，一个在受限实验室空间中进行社交机器人导航的数据集，涉及人群组。与之前的社交机器人导航数据收集工作相比，我们的数据集具有独特性：原创实验设计导致人类与机器人之间的近距离导航接触；五种不同的导航算法；两种不同的机器人平台；来自美国和法国两个地点的74名参与者的多样化参与者池；多模态数据流，包括10.5小时的人类和机器人真实运动轨迹、RGB视频以及用户对机器人表现的印象。我们通过交互密度和人类速度等指标对收集的数据集进行分析，表明Bi3代表了独特的多样性和建模复杂性的基准。Bi3有助于理解人类和机器人如何在受限环境中有效地融合其活动，并可以作为训练人类运动预测模型和机器人导航控制策略在密集人群空间中的资源。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2605.06966

Traffic Scenario Orchestration from Language via Constraint Satisfaction

通过约束满足实现语言驱动的交通场景编排

Rong, Frieda, Zhang, Chris, Wong, Kelvin, Urtasun, Raquel

Abstract

Autonomous vehicles (AVs) require extensive testing in simulation, but test case generation for driving scenarios is laborious. The desired scenarios are often out-of-distribution and have precise requirements on interactions with the AV policy under test. Manually programming scenarios allows for precise controllability but is difficult to scale. On the other hand, statistical models can leverage compute and data, but struggle with precise controllability when out-of-distribution. We cast scenario orchestration as a constraint-solving problem and present a language-in, simulation-out scenario orchestrator for closed-loop testing AVs. Our approach leverages foundation model reasoning to translate general, natural language descriptions into a set of constraints as a scenario representation. This then allows us to leverage off the shelf solvers to solve for actor behaviors which meet precise testing intentions in closed-loop. Under a benchmark of carefully crafted and diverse scenario descriptions, our approach greatly outperforms our baselines in orchestration success rate. We further show that our closed-loop approach is especially important for scenarios which require ego-reactive specifications.

Chinese Translation

自动驾驶车辆（AVs）在仿真中需要进行广泛的测试，但驾驶场景的测试用例生成是一项繁琐的工作。所需的场景通常超出分布范围，并对与正在测试的AV策略的交互有精确的要求。手动编程场景可以实现精确的可控性，但难以扩展。另一方面，统计模型可以利用计算和数据，但在超出分布范围时难以实现精确的可控性。我们将场景编排视为一个约束求解问题，并提出了一种语言输入、仿真输出的场景编排器，用于闭环测试AVs。我们的方法利用基础模型推理将一般的自然语言描述转换为一组约束，作为场景表示。这使我们能够利用现成的求解器来解决满足闭环中精确测试意图的参与者行为。在经过精心设计和多样化的场景描述基准下，我们的方法在编排成功率上大大超越了我们的基线。我们进一步表明，我们的闭环方法对于需要自我反应规范的场景尤其重要。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2605.07003

AirBender: Adaptive Transportation of Bendable Objects Using Dual UAVs

AirBender：使用双无人机适应性运输可弯曲物体

Xu, Jiawei, Gao, Longsen, Fierro, Rafael, Saldaña, David

Abstract

The interaction of robots with bendable objects in midair presents significant challenges in control, often resulting in performance degradation and potential crashes, especially for aerial robots due to their limited actuation capabilities and constant need to remain airborne. This paper presents an adaptive controller that enables two aerial vehicles to collaboratively follow a trajectory while transporting a bendable object without relying on explicit elasticity models. Our method allows on-the-fly adaptation to the object's unknown deformable properties, ensuring stability and performance in trajectory-tracking tasks. We use Lyapunov analysis to demonstrate that our adaptive controller is asymptotically stable. Our method is evaluated through hardware experiments in various scenarios, demonstrating the capabilities of using multirotor aerial vehicles to handle bendable objects.

Chinese Translation

机器人与空中可弯曲物体的交互在控制上面临重大挑战，常常导致性能下降和潜在的坠毁，尤其是对于空中机器人，由于其有限的驱动能力和持续保持空中飞行的需求。本文提出了一种自适应控制器，使得两架空中飞行器能够协同沿着轨迹运输可弯曲物体，而无需依赖显式的弹性模型。我们的方法允许对物体未知的可变形特性进行实时适应，确保轨迹跟踪任务中的稳定性和性能。我们使用Lyapunov分析证明了我们的自适应控制器是渐进稳定的。通过在各种场景中的硬件实验评估我们的方法，展示了使用多旋翼空中飞行器处理可弯曲物体的能力。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2605.07037

Intention assimilation control for accurate tracking with variable impedance in teleoperation

基于意图同化控制的变阻抗精确跟踪在远程操作中的应用

Takagi, Atsushi, Li, Yanan, Gomi, Hiroaki, Burdet, Etienne

Abstract

Robot systems for teleoperation commonly use a spring-like force pulling the follower robot towards the leader's position to track their movements. With this control strategy, the tracking accuracy deteriorates when the follower' stiffness is low, but high stiffness poses a danger to objects or people in the follower robot's environment. To address this trade-off between tracking accuracy and safety, we propose an alternative intention assimilation control (IAC) strategy where the robot's tracking accuracy can be ensured without high stiffness. Different from traditional approaches, which transmit the leader's current position to the follower, this new controller estimates the leader's target position and transmits it to the follower. With this strategy, the follower impedance can be changed on-the-fly to continuously reflect the user's desired impedance or modulated automatically to fulfill the task requirements. Our controller was validated on two 7 degree-of-freedom manipulators, yielding high tracking accuracy with varying impedance. Four experiments were conducted to compare {teleoperation} with IAC to tele-impedance control (TIC) during free tracking, interaction with a balloon, during peg insertion, and table polishing with force feedback. The results show that IAC increases tracking accuracy, improves task completion rate and reduces completion time. IAC enables the robot to accurately replicate the user's movement while giving them freedom to modulate the impedance according to their intention, providing an unprecedented level of control of the follower's position and its impedance during unilateral and bilateral teleoperation.

Chinese Translation

远程操作的机器人系统通常使用类似弹簧的力将跟随机器人拉向领导者的位置，以跟踪其运动。采用这种控制策略时，当跟随机器人的刚度较低时，跟踪精度会下降，而高刚度则可能对跟随机器人环境中的物体或人造成危险。为了解决跟踪精度与安全性之间的权衡，我们提出了一种替代的意图同化控制（Intention Assimilation Control, IAC）策略，该策略可以在不需要高刚度的情况下确保机器人的跟踪精度。与传统方法不同，传统方法将领导者的当前位置传递给跟随者，而这种新控制器则估计领导者的目标位置并将其传递给跟随者。通过这种策略，跟随机器人的阻抗可以实时变化，以持续反映用户期望的阻抗，或自动调节以满足任务要求。我们的控制器在两个7自由度的机械臂上进行了验证，结果显示在不同阻抗下实现了高跟踪精度。我们进行了四个实验，将使用IAC的远程操作与远程阻抗控制（Tele-Impedance Control, TIC）进行了比较，实验内容包括自由跟踪、与气球的交互、插销插入和带力反馈的桌面抛光。结果表明，IAC提高了跟踪精度，改善了任务完成率并减少了完成时间。IAC使机器人能够准确复制用户的运动，同时给予用户根据其意图调节阻抗的自由，从而在单向和双向远程操作中提供了前所未有的对跟随者位置及其阻抗的控制水平。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2605.07041

Dr-BA: Separable Optimization for Direct Radar Bundle Adjustment & Localization

Dr-BA：直接雷达束调整与定位的可分离优化

Lisus, Daniil, Gentil, Cedric Le, Barfoot, Timothy D.

Abstract

This paper introduces Dr-BA, a first-of-its-kind radar bundle adjustment (BA) framework that operates directly on 2D spinning radar intensity images. Unlike camera or lidar sensors, radar is largely unaffected by precipitation, making it a critical modality for autonomous systems that require all-weather robustness. Existing state estimation approaches using spinning radar typically extract sparse point clouds from range-azimuth-intensity measurements and apply point cloud alignment techniques to estimate vehicle motion, scene structure, or to localize within an existing map. In contrast, Dr-BA uses the full radar returns from multiple scans to jointly estimate dense maps and sensor poses. By formulating the problem as a separable optimization, we derive an efficient and general solution that decouples pose estimation from mapping. In addition to solving the BA problem, this formulation naturally extends to direct radar-only localization (DRL) within a previously built map. Dr-BA achieves state-of-the-art radar-based BA and cross-session localization performance, demonstrated on more than 200 km of on-road data across five distinct routes. Our implementation is publicly available at https://github.com/utiasASRL/dr_ba.

Chinese Translation

本文介绍了Dr-BA，这是首个直接在二维旋转雷达强度图像上操作的雷达束调整（BA）框架。与相机或激光雷达传感器不同，雷达在降水条件下几乎不受影响，这使其成为需要全天候鲁棒性的自主系统的重要传感方式。现有的使用旋转雷达的状态估计方法通常从距离-方位-强度测量中提取稀疏点云，并应用点云对齐技术来估计车辆运动、场景结构或在现有地图中进行定位。相比之下，Dr-BA利用来自多次扫描的完整雷达回波共同估计稠密地图和传感器姿态。通过将问题表述为可分离优化，我们推导出一种高效且通用的解决方案，将姿态估计与地图构建解耦。除了求解BA问题外，该方法自然扩展到在先前构建的地图中进行直接雷达定位（DRL）。Dr-BA在基于雷达的BA和跨会话定位性能上达到了最先进的水平，已在五条不同路线的200公里以上的道路数据上进行了验证。我们的实现已公开发布在 https://github.com/utiasASRL/dr_ba。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2605.07215

PISTO: Proximal Inference for Stochastic Trajectory Optimization

PISTO：随机轨迹优化的近端推断

Yu, Hongzhe, Chang, Zinuo, Chen, Yongxin

Abstract

Stochastic trajectory optimization methods like STOMP enable planning with non-differentiable costs, offering substantial flexibility over gradient-based approaches. We show that STOMP implicitly minimizes the KL divergence from a Boltzmann trajectory distribution, revealing an elegant Variational Inference (VI) structure underlying its updates. Building on this insight, we propose the \textit{Proximal Inference for Stochastic Trajectory Optimization} (PISTO) algorithm that stabilizes the updates by augmenting the objective with a KL regularization between successive Gaussian proposals. This proximal formulation admits a trust-region interpretation and yields closed-form mean updates computable as expectations under a surrogate distribution. We estimate these expectations via importance-weighted Monte Carlo sampling, producing a simple, derivative-free algorithm that inherits STOMP's ability to handle non-differentiable and discontinuous costs without modification. On robot arm motion planning benchmarks, PISTO achieves an 89\% success rate -- outperforming CHOMP (63\%) and STOMP (68\%) -- while producing shorter, smoother paths at twice the speed of competing stochastic methods. We further validate PISTO on contact-rich MuJoCo locomotion and manipulation tasks, where it consistently outperforms both CEM and MPPI baselines in reward.

Chinese Translation

随机轨迹优化方法如 STOMP 使得在非可微成本下进行规划成为可能，相比于基于梯度的方法提供了显著的灵活性。我们表明，STOMP 隐式地最小化了与玻尔兹曼轨迹分布的 KL 散度，揭示了其更新背后优雅的变分推断（Variational Inference, VI）结构。在此基础上，我们提出了 extit{随机轨迹优化的近端推断}（PISTO）算法，通过在连续高斯提议之间增加 KL 正则化来稳定更新。这种近端形式具有信任区域的解释，并产生可作为替代分布下期望值计算的封闭形式均值更新。我们通过重要性加权的蒙特卡洛采样来估计这些期望，生成一个简单的无导数算法，继承了 STOMP 在处理非可微和不连续成本方面的能力而无需修改。在机器人臂运动规划基准测试中，PISTO 实现了 89\% 的成功率，优于 CHOMP（63\%）和 STOMP（68\%），同时以两倍于竞争随机方法的速度生成更短、更平滑的路径。我们进一步在接触丰富的 MuJoCo 运动和操控任务中验证了 PISTO，在奖励方面它始终优于 CEM 和 MPPI 基线。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2605.07275

Palm-sized Omnidirectional Vision-Based UAV Exploration with Sparse Topological Map Guidance

基于全向视觉的掌中型无人机稀疏拓扑图引导探索

Wang, Zirui, Luo, Xinjia, Sun, Haotian, Ma, Jun, Guo, Jian, Zhou, Boyu

Abstract

Classic exploration methods often rely on dense occupancy maps or high-resolution point clouds for frontier detection and path planning, resulting in substantial memory consumption and computational overhead. Moreover, micro UAVs under size, weight, and power (SWaP) constraints are not practical to be equipped with sensors like LiDAR to obtain accurate environmental geometric measurements. This paper presents a lightweight autonomous exploration system that leverages omnidirectional vision and sparse topological map guidance. Specifically, we utilize a multi-fisheye camera setup to achieve omnidirectional Field of View (FoV) and perform depth estimation. To address the limited depth estimation accuracy, frontiers are represented as potential unexplored regions characterized by topological nodes instead of explicit boundaries, enabling efficient identification of frontier regions without maintaining occupancy grids or global point clouds. Unlike classic dense representations, our approach abstracts the environment using a sparse topological map composed of key nodes and their descriptors, reducing memory consumption and computational demands. Global path planning is performed directly on the sparse graph. The proposed method is validated in both simulation and on a palm-sized vision-based UAV with an 11 cm wheelbase and a 400 g weight in real-world experiments, demonstrating that our method can achieve efficient exploration with extremely low computational consumption.

Chinese Translation

经典的探索方法通常依赖于密集的占用地图或高分辨率点云进行边界检测和路径规划，这导致了大量的内存消耗和计算开销。此外，受尺寸、重量和功率（SWaP）限制的微型无人机不适合配备如激光雷达（LiDAR）等传感器以获取准确的环境几何测量。本文提出了一种轻量级的自主探索系统，利用全向视觉和稀疏拓扑图引导。具体而言，我们采用多鱼眼相机设置实现全向视场（FoV）并进行深度估计。为了解决深度估计精度有限的问题，边界被表示为潜在的未探索区域，以拓扑节点为特征，而不是明确的边界，从而实现高效识别边界区域，而无需维护占用网格或全局点云。与经典的密集表示不同，我们的方法使用由关键节点及其描述符组成的稀疏拓扑图对环境进行抽象，减少了内存消耗和计算需求。全局路径规划直接在稀疏图上进行。所提出的方法在仿真和实际实验中均在一款轮距为11厘米、重量为400克的掌中型视觉无人机上得到了验证，结果表明我们的方法能够以极低的计算消耗实现高效探索。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2605.07292

Variable Aerodynamic Damping via Co-Contraction: A Dynamic Isomorphism with Variable Stiffness Actuators

通过共收缩实现可变气动阻尼：与可变刚度执行器的动态同构

Franchi, Antonio

Abstract

We prove that aerodynamic co-contraction in a redundant dual-rotor actuator can tune a passive, trim-defined aero-mechanical damping while keeping the commanded net force constant. In particular, we define an incremental damping coefficient as the local sensitivity of net thrust to air-relative velocity at a trim and prove that it increases monotonically along constant-force fibers under a mild aerodynamic hardening condition. We then validate the required damping and hardening properties from a first-principles Blade Element Theory derivation, which yields a minimal thrust model affine in inflow and explicitly reveals the speed--inflow coupling driving the effect. The resulting mechanism is formalized as a Variable Aerodynamic Damping Actuator (VADA) and shown to be dynamically isomorphic to stiffness modulation in antagonistic variable-stiffness actuation (VSA), similar to the co-contraction of tendons by muscle co-activation. The same fiber-density principle also enhances the active aerodynamic promptness measure of redundant multirotors. Finally, an impedance-form representation clarifies the roles of common-mode and differential-mode actuation in the control of passive impedance and the equilibrium velocity of the VADA system.

Chinese Translation

我们证明了在冗余双转子执行器中，气动共收缩可以调节被动的、由修整定义的气动机械阻尼，同时保持指令净力不变。特别地，我们定义了增量阻尼系数为在修整状态下净推力对气流相对速度的局部灵敏度，并证明在温和的气动硬化条件下，它沿着恒力纤维单调增加。随后，我们通过第一性原理的叶片元理论推导验证了所需的阻尼和硬化特性，该推导得出了一个在流入中仿射的最小推力模型，并明确揭示了驱动该效应的速度-流入耦合。由此产生的机制被形式化为可变气动阻尼执行器（Variable Aerodynamic Damping Actuator, VADA），并被证明在动态上与对抗性可变刚度驱动（Variable-Stiffness Actuation, VSA）中的刚度调制同构，类似于肌肉共激活下肌腱的共收缩。相同的纤维密度原理还增强了冗余多旋翼的主动气动响应度。最后，阻抗形式的表示澄清了共模和差模驱动在控制被动阻抗和VADA系统平衡速度中的作用。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2605.07306

BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation

BioProVLA-Agent：一种经济实惠的、以协议驱动的、增强视觉的、具备闭环推理能力的多智能体系统，用于生物实验室操作

Du, Zhaohui, Wang, Zhe, Fei, Hongmei, Cao, Xiwen, Xiao, Ting, Wang, Qi, Jin, Huanbo, Gu, Jiaming, Lu, Quan, Liu, Zhe

Abstract

Biological laboratory automation can reduce repetitive manual work and improve reproducibility, but reliable embodied execution in wet-lab environments remains challenging. Protocols are often unstructured, labware is frequently transparent or reflective, and multi-step procedures require state-aware execution beyond one-shot instruction following. Existing robotic systems often rely on costly hardware, fixed workflows, dedicated instruments, or robotics-oriented interfaces. Here, we introduce BioProVLA-Agent, an affordable, protocol-driven, vision-enhanced embodied multi-agent system enabled by Vision-Language-Action (VLA) models for biological manipulation. The system uses protocols as the task interface and integrates protocol parsing, visual state verification, and embodied execution in a closed-loop workflow. A Tailored LLM Protocol Agent converts protocols into verifiable subtasks; a VLM-RAG Verification Agent assesses readiness and completion using observations, robot states, retrieved knowledge, and success/failure examples; and a VLA Embodied Agent executes verified subtasks through a lightweight policy. To improve robustness under wet-lab visual perturbations, we develop AugSmolVLA, an online augmentation strategy targeting transparent labware, reflections, illumination shifts, and overexposure. We evaluate the system on a hierarchical benchmark covering 15 atomic tasks, 6 composite workflows, and 3 bimanual tasks, including tube loading, sorting, waste disposal, cap twisting, and liquid pouring. Across normal and high-exposure settings, AugSmolVLA improves execution stability over ACT, X-VLA, and the original SmolVLA, especially for precise placement, transparent-object manipulation, composite workflows, and visually degraded scenes. These results suggest a practical route toward accessible, protocol-centered, and verification-capable embodied AI for biological manipulation.

Chinese Translation

生物实验室自动化可以减少重复的手工工作并提高可重复性，但在湿实验室环境中实现可靠的具身执行仍然具有挑战性。协议通常是非结构化的，实验器具往往是透明或反光的，多步骤程序需要超越一次性指令执行的状态感知执行。现有的机器人系统通常依赖于昂贵的硬件、固定的工作流程、专用的仪器或面向机器人技术的接口。在此，我们介绍了BioProVLA-Agent，一种经济实惠的、以协议驱动的、增强视觉的具身多智能体系统，利用视觉-语言-动作（Vision-Language-Action, VLA）模型进行生物操作。该系统将协议作为任务接口，并在闭环工作流程中集成协议解析、视觉状态验证和具身执行。定制的LLM协议代理将协议转换为可验证的子任务；VLM-RAG验证代理利用观察、机器人状态、检索知识和成功/失败示例评估准备情况和完成情况；VLA具身代理通过轻量级策略执行经过验证的子任务。为了提高在湿实验室视觉干扰下的鲁棒性，我们开发了AugSmolVLA，一种针对透明实验器具、反射、光照变化和过度曝光的在线增强策略。我们在一个涵盖15个原子任务、6个复合工作流程和3个双手任务的分层基准上评估该系统，包括管子装载、分类、废物处理、盖子扭动和液体倒入。在正常和高曝光设置下，AugSmolVLA在执行稳定性方面优于ACT、X-VLA和原始SmolVLA，特别是在精确放置、透明物体操作、复合工作流程和视觉退化场景中。这些结果表明，朝着可访问的、以协议为中心的、具备验证能力的具身人工智能在生物操作中的实际路径。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2605.07308

AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

AT-VLA：用于增强视觉-语言-动作模型反馈反应的自适应触觉注入

Li, Xiaoqi, Cai, Muhe, Xu, Jiadong, Zhu, Juan, Fan, Hongwei, Shen, Yan, Ren, Guangrui, Dong, Hao

Abstract

Vision-Language-Action (VLA) models have significantly advanced the capabilities of robotic agents in executing diverse tasks; however, they still face challenges in contact-rich manipulation scenarios that require precise physical interactions. To address this limitation, recent studies have attempted to incorporate tactile signals during downstream tasks, enabling pretrained VLAs to interpret tactile feedback. Nevertheless, introducing new modalities during finetuning, which are rarely present in the pretrain stage, may disrupt the pretrained capabilities of VLAs. In addition, the inherently slow inference speed of VLAs hampers real-time responsiveness and limits the effective utilization of tactile feedback for action adjustment. To overcome these challenges, we propose Adaptive Tactile Vision-Language-Action (AT-VLA), which introduces a novel Adaptive Tactile Injection mechanism. This mechanism dynamically determines the appropriate timing and locations for tactile injection, incorporating only when it significantly contributes to action generation, thereby minimizing interference with pretrained representations. Furthermore, to enable rapid and accurate tactile responses, we propose a Tactile Reaction Dual-Stream mechanism, which decouples sensory processing into a slow visual-language stream for low-frequency perceptual reasoning and a fast tactile control stream for high-frequency physical interaction understanding, achieving real-time close-loop responses within 0.04 s. Real-world experiments thoroughly validate the effectiveness of AT-VLA in contact-rich manipulation tasks. The project page is available at: https://sites.google.com/view/at-vla.

Chinese Translation

视觉-语言-动作（VLA）模型在执行多样化任务方面显著提升了机器人代理的能力；然而，它们在需要精确物理交互的接触丰富的操作场景中仍面临挑战。为了解决这一限制，近期的研究尝试在下游任务中引入触觉信号，使预训练的VLA能够解读触觉反馈。然而，在微调过程中引入在预训练阶段很少出现的新模态，可能会干扰VLA的预训练能力。此外，VLA固有的慢推理速度妨碍了实时响应，并限制了触觉反馈在动作调整中的有效利用。为克服这些挑战，我们提出了自适应触觉视觉-语言-动作（AT-VLA），引入了一种新颖的自适应触觉注入机制。该机制动态确定触觉注入的适当时机和位置，仅在其对动作生成有显著贡献时才进行注入，从而最小化对预训练表示的干扰。此外，为了实现快速准确的触觉响应，我们提出了一种触觉反应双流机制，将感知处理解耦为用于低频感知推理的慢视觉-语言流和用于高频物理交互理解的快触觉控制流，实现了在0.04秒内的实时闭环响应。实际实验充分验证了AT-VLA在接触丰富的操作任务中的有效性。项目页面可访问：https://sites.google.com/view/at-vla。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2605.07325

CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations

CSR：具有大规模缓存状态表示的无限时域实时策略

Karlsson, Robin, Suzui, Go

Abstract

Deploying massive large language models (LLMs) as continuous cognitive engines for robotics is bottlenecked by the time-to-first-token (TTFT) latency required to process extensive state histories. Existing solutions like RAG or sliding windows compromise global context or incur prohibitive re-computation costs. We formalize the optimal task structure for minimizing latency and theoretically prove that prefix stability, incremental extensibility, and asynchronous state reconciliation are necessary conditions for real-time performance. Building on these proofs, we introduce the Cached State Representation (CSR) framework as the practical instantiation of these properties, ensuring optimal KV-cache reuse. To sustain these properties over infinite horizons, we further propose an Asynchronous State Reconciliation (ASR) algorithm that offloads state memory eviction to a parallel computational resource to eliminate latency spikes. On a physical robot wirelessly connected to an on-premise GPU server, CSR achieves a 26-fold latency reduction (14.67s to 0.56s) for 120K token contexts with a 235B parameter model compared to a standard baseline. On an embodied AI benchmark, we achieve SOTA recall (0.836 vs. 0.459) while maintaining RAG-level latency. ASR is validated to sustain bounded, spike-free TTFT over 10 eviction cycles in continuous real-world operation. Together, CSR and ASR enable massive LLMs to function as continuously operating, high-frequency (> 2 Hz) embodied policies.

Chinese Translation

将大规模语言模型（LLMs）作为机器人技术的连续认知引擎的部署受到处理大量状态历史所需的首次令牌时间（TTFT）延迟的制约。现有的解决方案如RAG或滑动窗口在全球上下文方面存在妥协，或产生高昂的重新计算成本。我们形式化了最优任务结构以最小化延迟，并理论证明前缀稳定性、增量可扩展性和异步状态协调是实现实时性能的必要条件。在这些证明的基础上，我们引入了缓存状态表示（CSR）框架，作为这些特性的实际实现，确保最优的KV缓存重用。为了在无限时域内维持这些特性，我们进一步提出了一种异步状态协调（ASR）算法，将状态内存驱逐的任务卸载到并行计算资源上，以消除延迟峰值。在一台无线连接到本地GPU服务器的物理机器人上，CSR在120K令牌上下文中实现了26倍的延迟减少（从14.67秒降至0.56秒），相较于标准基线。在一个具身AI基准测试中，我们实现了SOTA召回率（0.836对比0.459），同时保持了RAG级别的延迟。ASR经过验证能够在连续的真实世界操作中维持有界、无峰值的TTFT，经过10次驱逐周期。CSR和ASR共同使得大规模LLMs能够作为持续运行的高频（> 2 Hz）具身策略。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2605.07367

Weather-Robust Scene Semantics with Vision-Aligned 4D Radar

具有视觉对齐的4D雷达的天气鲁棒场景语义

Hamilton, Kali, Heckman, Christoffer

Abstract

Cameras and LiDAR degrade in rain, fog, and snow, while millimeter-wave radar remains largely unaffected. We align a radar encoder to frozen SigLIP vision embeddings and decode structured scene captions through a frozen vision-language model (VLM) with approximately 7M trainable parameters. On K-RADAR with held-out fog, light snow, and heavy snow sequences, all radar configurations outperform a camera baseline that collapses to over 90% hallucination. We identify a token-norm mismatch as the dominant failure mode when bridging radar to a frozen VLM and show that projector-output LayerNorm resolves it. Analysis of encoder complexity, caption format, and pooling strategy reveals tradeoffs that inform future radar-VLM pipeline design.

Chinese Translation

相机和激光雷达在雨、雾和雪中性能下降，而毫米波雷达则基本不受影响。我们将雷达编码器与冻结的SigLIP视觉嵌入对齐，并通过一个具有大约700万可训练参数的冻结视觉-语言模型（VLM）解码结构化场景描述。在K-RADAR上，针对保留的雾、轻雪和大雪序列，所有雷达配置的表现均优于相机基线，后者的幻觉率超过90%。我们发现，在将雷达与冻结的VLM连接时，令牌归一化不匹配是主要的失败模式，并展示了投影器输出的LayerNorm可以解决这一问题。对编码器复杂性、描述格式和池化策略的分析揭示了权衡，这为未来的雷达-VLM管道设计提供了参考。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2605.07370

MORPH-U: Multi-Objective Resilient Motion Planning for V2X-Enabled Autonomous Driving in High-Uncertainty Environments via Simulation

MORPH-U：在高不确定性环境中通过仿真实现的面向多目标的韧性运动规划，适用于V2X支持的自动驾驶

Lai, Shih-Yu

Abstract

V2X can warn an autonomous vehicle about hazards beyond line-of-sight, but it also brings uncertainty: messages may be delayed, dropped, or even forged. Meanwhile, map knowledge may change during a trip, forcing the vehicle to replan under tight real-time budgets. This paper studies how to make motion planning and low-level control robust to such uncertain, event-driven updates. We present MORPH-U, a CARLA-based closed-loop stack that fuses LiDAR/radar/camera with V2X (CAM/DENM) into a Local Dynamic Map (LDM) and triggers Hybrid-A* replanning when validated hazards or map changes affect the planned route. We expose the planning/control trade-offs via a multi-objective formulation over tracking error, safety margin (minimum TTC), responsiveness, and smoothness, and select operating points using Pareto-frontier analysis. To avoid unsafe replanning from faulty V2X triggers, MORPH-U adds a lightweight Byzantine-inspired acceptance gate that combines a quorum rule with an on-board sensor veto. Experiments in dynamic CARLA scenarios show that V2X-augmented LDM improves downstream safety, Pareto tuning provides controllable accuracy-comfort trade-offs, and the gate prevents replanning under saturated false-DENM injection ($p_{\text{attack}}=1.0$).

Chinese Translation

V2X可以警告自动驾驶车辆关于视线之外的危险，但它也带来了不确定性：消息可能会延迟、丢失，甚至被伪造。同时，地图信息在行驶过程中可能会发生变化，迫使车辆在紧迫的实时预算下重新规划。本文研究如何使运动规划和低级控制对这种不确定的事件驱动更新具有鲁棒性。我们提出了MORPH-U，这是一个基于CARLA的闭环系统，融合了LiDAR/雷达/摄像头与V2X（CAM/DENM），生成局部动态地图（LDM），并在验证的危险或地图变化影响规划路线时触发Hybrid-A*重新规划。我们通过跟踪误差、安全边际（最小TTC）、响应性和平滑性等多目标公式揭示规划/控制的权衡，并使用帕累托前沿分析选择操作点。为了避免因故障的V2X触发导致的不安全重新规划，MORPH-U增加了一个轻量级的拜占庭启发式接受门，该门结合了法定规则和车载传感器否决。动态CARLA场景中的实验表明，增强V2X的LDM提高了下游安全性，帕累托调优提供了可控的准确性-舒适性权衡，而该门在饱和的虚假DENM注入下（$p_{ ext{attack}}=1.0$）防止了重新规划。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2605.07381

Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

通过锚点中心适应逃避机器人操作中的多样性陷阱

Chen, Yanzhe, Ma, Kevin Yuchen, Lv, Qi, Lin, Yiqi, Bai, Zechen, Gao, Chen, Shou, Mike Zheng

Abstract

While Vision-Language-Action (VLA) models offer broad general capabilities, deploying them on specific hardware requires real-world adaptation to bridge the embodiment gap. Since robot demonstrations are costly, this adaptation must often occur under a strict data budget. In this work, we identify a critical diversity trap: the standard heuristic of "maximizing coverage" by collecting diverse, single-shot demonstrations can be self-defeating due to non-vanishing estimation noise. We formalize this phenomenon as a Coverage--Density Trade-off. By decomposing the policy error into estimation (density) and extrapolation (coverage) terms, we characterize an interior optimal allocation of unique conditions for a fixed budget. Guided by this analysis, we propose Anchor-Centric Adaptation (ACA), a two-stage framework that first stabilizes a policy skeleton through repeated demonstrations at core anchors, then selectively expands coverage to high-risk boundaries via teacher-forced error mining and constrained residual updates. Real-robot experiments validate our trade-off framework and demonstrate that ACA significantly improves task reliability and success rates over standard diverse sampling strategies under the same budget.

Chinese Translation

尽管视觉-语言-动作（VLA）模型提供了广泛的通用能力，但在特定硬件上部署它们需要进行现实世界的适应，以弥补体现差距。由于机器人演示的成本较高，这种适应通常必须在严格的数据预算下进行。在本研究中，我们识别出一个关键的多样性陷阱：通过收集多样化的单次演示来“最大化覆盖”的标准启发式可能因非消失的估计噪声而自我抵消。我们将这一现象形式化为覆盖-密度权衡。通过将策略误差分解为估计（密度）和外推（覆盖）项，我们为固定预算特征化了独特条件的内部最优分配。在这一分析的指导下，我们提出了锚点中心适应（ACA），这是一个两阶段框架，首先通过在核心锚点处重复演示来稳定策略骨架，然后通过教师强制的误差挖掘和受限残差更新选择性地扩展覆盖到高风险边界。真实机器人实验验证了我们的权衡框架，并表明ACA在相同预算下显著提高了任务可靠性和成功率，优于标准的多样性采样策略。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2605.07496

PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation

PathPainter：将图像生成模型的泛化能力转移到具身导航

Wang, Yijin, Tian, Yuru, Huang, Xijie, Gai, Weiqi, Zhu, Mo, Zhou, Xin, Wu, Yuze, Gao, Fei

Abstract

Bird's-eye-view (BEV) images have been widely demonstrated to provide valuable prior information for navigation. Given the global information provided by such views, two key challenges remain: how to fully exploit this information and how to reliably use it during execution. In this paper, we propose a navigation system that uses BEV images as global priors and is designed for ground and near-ground robotic platforms. The system employs an image generation model to interpret human intent from natural language, identify the target destination, and generate traversability masks. During execution, we introduce cross-view localization to align the robot's odometry with the BEV map and mitigate long-term drift in conventional odometry. We conduct extensive benchmark experiments to evaluate the proposed method and further validate it on a UAV platform. Using only a conventional local motion planner, the UAV successfully completes a 160-meter outdoor long-range navigation task. This work demonstrates how the world-understanding capabilities of foundation models can be transferred to embodied navigation, enabling robots to benefit from the strong generalization ability of existing image generation models.

Chinese Translation

鸟瞰图（BEV）已被广泛证明为导航提供了宝贵的先验信息。鉴于此类视图提供的全局信息，仍然存在两个关键挑战：如何充分利用这些信息，以及如何在执行过程中可靠地使用它。在本文中，我们提出了一种导航系统，该系统将BEV图像作为全局先验，专为地面和近地面机器人平台设计。该系统采用图像生成模型从自然语言中解读人类意图，识别目标目的地，并生成可通行性掩码。在执行过程中，我们引入了跨视图定位，以将机器人的里程计与BEV地图对齐，并减轻传统里程计中的长期漂移。我们进行了广泛的基准实验以评估所提方法，并在无人机平台上进一步验证。仅使用常规的局部运动规划器，无人机成功完成了160米的户外远程导航任务。这项工作展示了基础模型的世界理解能力如何转移到具身导航，使机器人能够受益于现有图像生成模型的强泛化能力。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2605.07514

Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

未来是否兼容？诊断世界行动模型中的动态一致性

Ruan, Bo-Kai, Hsiao, Teng-Fang, Lo, Ling, Shuai, Hong-Han

Abstract

World Action Models (WAMs) enable decision-making through imagined rollouts by predicting future observations and actions. However, the reliability of these imagined futures remains under-examined: is a generated future merely visually plausible, or is it dynamically compatible with the action sequence it claims to model? In this work, we identify action-state consistency, the alignment between predicted actions and induced state transitions, as a missing reliability axis for WAMs. Through a systematic study across representative joint-prediction and inverse-dynamics models, we find that action-state consistency systematically separates successful and failed rollouts across many tasks and follows similar success-failure trends as learned value estimates. These results suggest that consistency captures decision-relevant structure beyond visual realism. We further identify background collapse as an important boundary condition, where low-dynamics failed trajectories can become deceptively consistent because static futures are easier to predict. Building on these findings, we introduce a value-free consensus strategy for test-time selection, which ranks candidate rollouts by agreement among predicted futures. This strategy improves success rates on RoboCasa and RoboTwin 2.0 without additional training or reward modeling. Taken together, our findings establish action-state consistency as both a diagnostic tool for evaluating WAM reliability and a practical signal for value-free planning.

Chinese Translation

世界行动模型（World Action Models, WAMs）通过预测未来的观察和行动来实现决策制定。然而，这些想象中的未来的可靠性仍然未得到充分检验：生成的未来仅仅在视觉上似乎合理，还是与其声称建模的行动序列在动态上兼容？在本研究中，我们识别出行动-状态一致性，即预测的行动与引发的状态转变之间的对齐，作为WAMs缺失的可靠性维度。通过对具有代表性的联合预测和逆动态模型进行系统研究，我们发现行动-状态一致性在许多任务中系统性地区分了成功和失败的展开，并且遵循与学习的价值估计相似的成功-失败趋势。这些结果表明，一致性捕捉了超越视觉现实主义的决策相关结构。我们进一步识别出背景崩溃作为一个重要的边界条件，在这种情况下，低动态失败轨迹可能变得具有欺骗性的一致性，因为静态未来更容易预测。在这些发现的基础上，我们提出了一种无价值共识策略用于测试时选择，该策略通过预测未来之间的协议对候选展开进行排名。该策略在RoboCasa和RoboTwin 2.0上提高了成功率，而无需额外的训练或奖励建模。综上所述，我们的发现确立了行动-状态一致性作为评估WAM可靠性的诊断工具和无价值规划的实用信号。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2605.07530

Search-based Robustness Testing of Laptop Refurbishing Robotic Software

基于搜索的笔记本电脑翻新机器人软件的鲁棒性测试

Isaku, Erblin, Sartaj, Hassan, Ali, Shaukat, Hashmi, Malaika Din, Picard, Francois

Abstract

The Danish Technological Institute (DTI) focuses on transferring advanced technologies (including robots) to the industry and the public sector. One key application is laptop refurbishment using specialized robots, aimed at promoting reuse, reducing electronic waste, and supporting the European Circular Economy Action Plan. The software of such robots often includes features that use object detection models to detect objects for various purposes, such as identifying screws for laptop disassembly or detecting stickers to remove them. Ensuring the robustness of such models to small input variations remains a critical challenge, and addressing it is important to avoid potential damage to laptops during refurbishment. In this paper, we propose PROBE, a search-based robustness testing approach that leverages multi-objective optimization to identify minimal, localized perturbations that expose failures in object detection models used in the software of laptop refurbishing robots. PROBE employs NSGA-II to systematically explore the perturbation space, optimizing for failure induction considering both localization and confidence, and perturbation magnitude, while enabling the discovery of diverse failure cases. Results show that PROBE is 3$\times$ to 7$\times$ more effective than random search in generating failure-inducing perturbations, while requiring smaller perturbation magnitudes, and that the generated perturbations transfer across models. We further show that metamorphic relations provide additional insights into model robustness, enabling the assessment of stability even in non-failing cases.

Chinese Translation

丹麦技术研究院（DTI）专注于将先进技术（包括机器人）转移到工业和公共部门。一个关键应用是使用专用机器人进行笔记本电脑翻新，旨在促进再利用、减少电子废物，并支持欧洲循环经济行动计划。这类机器人的软件通常包含使用物体检测模型来检测物体的功能，例如识别用于笔记本电脑拆解的螺丝或检测需要移除的贴纸。确保这些模型对小输入变化的鲁棒性仍然是一个关键挑战，解决这一问题对于避免在翻新过程中对笔记本电脑造成潜在损害至关重要。本文提出了PROBE，一种基于搜索的鲁棒性测试方法，利用多目标优化来识别最小的、局部的扰动，从而暴露用于笔记本电脑翻新机器人软件中的物体检测模型的失败。PROBE采用NSGA-II系统地探索扰动空间，优化失败诱导，考虑到局部化和置信度以及扰动幅度，同时能够发现多样化的失败案例。结果表明，PROBE在生成诱导失败的扰动方面比随机搜索有效性提高了3倍到7倍，同时需要更小的扰动幅度，并且生成的扰动可以跨模型转移。我们进一步表明，变形关系提供了对模型鲁棒性的额外见解，使得即使在非失败案例中也能评估稳定性。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2605.07560

How to utilize failure demo data?: Effective data selection for imitation learning using distribution differences in attention mechanism

如何利用失败演示数据？基于注意力机制中的分布差异进行模仿学习的有效数据选择

Miyamoto, Kana, Suzuki, Kanata, Ogata, Tetsuya

Abstract

Imitation learning for robotic tasks has relied primarily on policies trained only on successful demonstrations, although failures are unavoidable during human data collection. Many existing approaches for exploiting failure data require additional data processing or iterative policy updates through autonomous rollouts, making it difficult to directly and stably utilize failure data accumulated during data collection. In this work, we propose a method that learns latent representations of success-failure discrepancies and incorporates them into the attention mechanism. During inference, an appropriate latent mode is selected from the initial observation to improve action stability. Furthermore, we introduce a post-training metric that quantifies the attention discrepancy between each failure sample and successful demonstrations to select failure data. Simulation results show that the proposed method improves task success rates when trained with failure data and that the proposed metric identifies failure samples that are beneficial for learning when combined with successful demonstrations. These results suggest that the proposed method can support more efficient use of collected demonstrations in robotic data collection pipelines.

Chinese Translation

模仿学习在机器人任务中主要依赖于仅在成功演示上训练的策略，尽管在数据收集过程中失败是不可避免的。许多现有的方法利用失败数据需要额外的数据处理或通过自主回放进行迭代策略更新，这使得直接和稳定地利用在数据收集过程中积累的失败数据变得困难。在本研究中，我们提出了一种方法，该方法学习成功与失败之间差异的潜在表示，并将其纳入注意力机制。在推理过程中，从初始观察中选择适当的潜在模式以提高动作稳定性。此外，我们引入了一种后训练指标，该指标量化每个失败样本与成功演示之间的注意力差异，以选择失败数据。仿真结果表明，所提出的方法在使用失败数据训练时提高了任务成功率，并且所提出的指标能够识别与成功演示结合时对学习有益的失败样本。这些结果表明，所提出的方法可以支持在机器人数据收集管道中更有效地利用收集到的演示。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2605.07594

MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

MemCompiler：编译，而非注入——面向具身智能体的状态条件记忆

Ding, Xin, Wang, Xinrui, Yang, Yifan, Wu, Hao, Jiang, Shiqi, Zhang, Qianxi, Mi, Liang, Zhu, Hanxin, Li, Kun, Liu, Yunxin, Chen, Zhibo, Cao, Ting

Abstract

Existing memory systems for embodied agents typically inject retrieved memory as static context at episode start, a paradigm we term Ahead-of-time Monolithic Memory Injection (AMMI). However, this static design quickly becomes misaligned with the agent's evolving state and may degrade lightweight executors below the no-memory baseline. To address this, we propose MemCompiler, which reframes memory utilization as State-Conditioned Memory Compilation. A learned Memory Compiler reads a structured Brief State capturing the agent's current execution state and dynamically selects and compiles only relevant memory into executable guidance. This guidance is delivered through a text channel and a latent Soft-Mem channel that preserves perceptual information not expressible in text. Across Alf World, EmbodiedBench, and ScienceWorld, MemCompiler consistently improves over no-memory across open-source backbones (up to +129%), matches or approaches frontier closed-source systems, and reduces per-step latency by 60%, demonstrating that state-aware memory compilation improves both effectiveness and efficiency.

Chinese Translation

现有的具身智能体记忆系统通常在情节开始时将检索到的记忆作为静态上下文注入，这一范式被称为提前单体记忆注入（Ahead-of-time Monolithic Memory Injection, AMMI）。然而，这种静态设计很快与智能体不断变化的状态不再一致，可能导致轻量级执行器的性能低于无记忆基线。为了解决这个问题，我们提出了MemCompiler，它将记忆利用重新构架为状态条件记忆编译（State-Conditioned Memory Compilation）。一个学习到的记忆编译器读取一个结构化的简要状态（Brief State），捕捉智能体当前的执行状态，并动态选择和编译相关的记忆以生成可执行的指导。这些指导通过文本通道和一个潜在的软记忆通道（Soft-Mem channel）传递，以保留无法用文本表达的感知信息。在Alf World、EmbodiedBench和ScienceWorld等多个环境中，MemCompiler在开源基础架构上始终优于无记忆方案（提升幅度高达129%），并且与前沿的闭源系统相匹配或接近，同时将每步延迟减少了60%，证明了状态感知的记忆编译在有效性和效率上都有所提升。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2605.07605

BrickCraft: Visuomotor Skill Composition with Situated Manual Guidance for Long-Horizon Interlocking Brick Assembly

BrickCraft：结合情境手动指导的长远互锁砖块组装的视觉运动技能组合

Yu, Jichuan, Li, Bowei, Tang, Zhenran, Lu, Guanxing, Hu, Chuxiong, Liu, Ruixuan, Liu, Changliu

Abstract

Autonomous robotic assembly of interlocking bricks demands seamless integration of long-horizon task reasoning, spatial grounding, and fine-grained manipulation. This paper presents BrickCraft, a compositional framework designed for long-horizon and generalizable interlocking brick assembly. BrickCraft models the assembly process using a relative formulation, where each step is anchored to a reference brick within the partial structure, thereby decomposing complex tasks into a finite set of reusable primitive skills. BrickCraft bridges the gap between high-level assembly plans and physical execution through situated manuals, which provide explicit spatial guidance for learned visuomotor skills by projecting the assembly intent onto real-time robot observations. Finally, BrickCraft employs a compositional execution pipeline that chains these spatially grounded skills to accomplish long-horizon assembly tasks. Extensive experimental validations demonstrate that BrickCraft acquires proficient assembly skills from a limited set of demonstrations and exhibits strong compositional generalization to unseen structures. The project website is available at https://intelligent-control-lab.github.io/BrickCraft.

Chinese Translation

自主机器人组装互锁砖块需要长远任务推理、空间定位和精细操作的无缝整合。本文提出了BrickCraft，一个旨在长远且可推广的互锁砖块组装的组合框架。BrickCraft使用相对表述来建模组装过程，其中每一步都锚定在部分结构中的参考砖块上，从而将复杂任务分解为有限的一组可重用原始技能。BrickCraft通过情境手册弥合高层组装计划与物理执行之间的差距，情境手册通过将组装意图投影到实时机器人观察中，为学习到的视觉运动技能提供明确的空间指导。最后，BrickCraft采用组合执行管道，将这些空间定位的技能串联起来，以完成长远的组装任务。大量实验验证表明，BrickCraft能够从有限的演示中获得熟练的组装技能，并对未见结构表现出强大的组合泛化能力。项目网站可访问：https://intelligent-control-lab.github.io/BrickCraft。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2605.07687

PhySPRING: Structure-Preserving Reduction of Physics-Informed Twins via GNN

PhySPRING：通过图神经网络实现物理信息双胞胎的结构保持性简化

Jing, Yixiong, Chen, Xingyuan, Wang, Guangming, Wysocki, Olaf, Wu, Haibing, Sheil, Brian

Abstract

Physics-based digital twins aim to predict the dynamics of real-world objects under interaction, enabling real-to-sim-to-real applications in robotics. Current approaches reconstruct such twins as explicit physical models (such as spring-mass systems) to predict the dynamics, but the resulting models often inherit the resolution of the visual reconstruction rather than being reduced to the physical complexity required to reproduce task-relevant dynamics. This mismatch introduces redundant topology, making repeated forward-dynamics rollouts unnecessarily expensive. To address this challenge, we present PhySPRING, an fully differentiable GNN-based method to reduce complexity in spring--mass digital twins. PhySPRING jointly learns a hierarchy of coarsened graph topologies and their mechanical parameters from observations. At each reduction level, PhySPRING merges nodes with similar learned dynamic responses to optimize the topology, while maintaining every reduced layer as an explicit spring--mass system. On the PhysTwin benchmark, PhySPRING improves dense reconstruction and prediction accuracy over PhysTwin, while reduced models retain stable physical and visual fidelity with up to a 2.30 times speed-up. We further demonstrate the effectiveness of PhySPRING in a Real2Sim robot policy-evaluation pipeline, where the reduced models are substituted zero-shot into ACT and $\pi_0$ evaluations, maintaining comparable manipulation success rates across downsampling levels while improving action-sampling effectiveness. Together, PhySPRING enables efficient and structure-preserving spring--mass reduction without sacrificing fidelity or robotic utility.

Chinese Translation

基于物理的数字双胞胎旨在预测现实世界物体在交互下的动态，从而实现机器人技术中的真实-模拟-真实应用。目前的方法将这些双胞胎重建为显式物理模型（如弹簧-质量系统）以预测动态，但所得到的模型往往继承了视觉重建的分辨率，而不是简化为重现任务相关动态所需的物理复杂性。这种不匹配引入了冗余的拓扑结构，使得重复的前向动态展开变得不必要地昂贵。为了解决这一挑战，我们提出了PhySPRING，这是一种完全可微分的基于图神经网络（GNN）的方法，用于简化弹簧-质量数字双胞胎的复杂性。PhySPRING从观察中共同学习粗化的图拓扑层次及其机械参数。在每个简化层级，PhySPRING合并具有相似学习动态响应的节点以优化拓扑，同时保持每个简化层作为显式的弹簧-质量系统。在PhysTwin基准测试中，PhySPRING在密集重建和预测精度上优于PhysTwin，同时简化模型在保持稳定的物理和视觉保真度的情况下实现了高达2.30倍的加速。我们进一步展示了PhySPRING在Real2Sim机器人策略评估管道中的有效性，其中简化模型在ACT和$ ext{π}_0$评估中被零-shot替代，保持了各个下采样级别的可比操作成功率，同时提高了动作采样的有效性。总之，PhySPRING实现了高效且保持结构的弹簧-质量简化，而不牺牲保真度或机器人效用。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2605.07741

Offline-Online Hierarchical 3D Global Relocalization With Synthetic LiDAR Sensing and Descriptor-Space Retrieval

基于合成激光雷达感知和描述子空间检索的离线-在线层次化三维全局重定位

Ren, Jiahua, Shen, Kai, Zhang, Muhua, Ma, Lei

Abstract

3D global relocalization is one of the key capabilities for mobile robots in practical applications. However, in large scale spaces, existing methods often suffer from prolonged online relocalization time due to factors such as the massive pose search space and high computational overhead. To address these issues, this paper proposes an offline-online hierarchical framework that decouples the search space. In the offline phase, candidate positions and their corresponding geometric descriptor indices are generated in the map by simulating LiDAR scans within the grid map. In the online phase, a coarse pose estimate is first obtained via global retrieval, followed by point cloud registration to output precise 6-DoF pose estimates. Real-world experiments demonstrate that the proposed method achieves an average relocalization time of 3 s and an average localization accuracy of 8 cm in 3D environments. Compared with existing global relocalization methods, the proposed method achieves an order-of-magnitude improvement in computational efficiency while delivering comparable relocalization accuracy.

Chinese Translation

三维全局重定位是移动机器人在实际应用中的关键能力之一。然而，在大规模空间中，现有方法往往由于庞大的位姿搜索空间和高计算开销而导致在线重定位时间延长。为了解决这些问题，本文提出了一种离线-在线层次化框架，该框架解耦了搜索空间。在离线阶段，通过在网格地图中模拟激光雷达扫描生成候选位置及其对应的几何描述子索引。在在线阶段，首先通过全局检索获得粗略的位姿估计，然后进行点云配准以输出精确的6自由度位姿估计。实际实验表明，所提出的方法在三维环境中实现了平均3秒的重定位时间和8厘米的平均定位精度。与现有的全局重定位方法相比，所提出的方法在计算效率上实现了数量级的提升，同时提供了可比的重定位精度。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2605.07764

CommandSwarm: Safety-Aware Natural Language-to-Behavior-Tree Generation for Robotic Swarms

CommandSwarm：面向机器人群体的安全意识自然语言到行为树生成

Majid, Mohammed, Majid, Amjad Yousef

Abstract

Natural-language interfaces can make swarm robotics more accessible to non-expert operators, but they must translate ambiguous user intent into executable swarm behaviors without unsupported actions, malformed programs, or unsafe plans. This paper presents CommandSwarm, a safety-aware language-to-behavior-tree pipeline for generating XML behavior trees (BTs) from speech or text commands. The system combines multilingual translation, command-level safety filtering, constrained prompting, a LoRA-adapted large language model (LLM), and deterministic parser validation against a whitelist of executable swarm primitives. We evaluate eleven open 6.7B--14B parameter LLMs, all using 4-bit quantization, on representative swarm-control scenarios under zero-shot, one-shot, and two-shot prompting. Falcon3-Instruct-10B and Mistral-7B-v3 are the strongest prompt-engineered candidates, reaching BLEU scores above 0.60 and high syntactic validity in few-shot settings. LoRA adaptation of Falcon3-Instruct-10B on a 2,063-example synthetic instruction--BT corpus improves zero-shot BLEU from 0.267 to 0.663, ROUGE-L from 0.366 to 0.692, and parser-accepted syntactic validity from 0% to 72%. Translation experiments further show that SeamlessM4T v2-large and EuroLLM-9B provide the best quality-latency trade-offs for the multilingual front end. The results indicate that compact, quantized, domain-adapted LLMs can generate useful swarm BTs when embedded in a validated systems pipeline. They also show that parser acceptance and safety filtering remain necessary execution gates; generation quality alone is not sufficient for autonomous deployment.

Chinese Translation

自然语言接口可以使群体机器人技术对非专业操作员更加可及，但它们必须将模糊的用户意图转换为可执行的群体行为，而不产生不支持的动作、格式错误的程序或不安全的计划。本文提出了CommandSwarm，一个安全意识的语言到行为树管道，用于从语音或文本命令生成XML行为树（BTs）。该系统结合了多语言翻译、命令级安全过滤、约束提示、LoRA适应的大型语言模型（LLM）以及针对可执行群体原语白名单的确定性解析器验证。我们在零-shot、one-shot和two-shot提示下，评估了十一种开放的6.7B到14B参数的LLM，均使用4位量化。在代表性的群体控制场景中，Falcon3-Instruct-10B和Mistral-7B-v3是最强的提示工程候选者，在少量提示设置中达到超过0.60的BLEU分数和高语法有效性。对Falcon3-Instruct-10B进行LoRA适应，在2063个示例的合成指令-BT语料库上，将零-shot BLEU从0.267提高到0.663，ROUGE-L从0.366提高到0.692，解析器接受的语法有效性从0%提高到72%。翻译实验进一步表明，SeamlessM4T v2-large和EuroLLM-9B在多语言前端提供了最佳的质量-延迟权衡。结果表明，紧凑的、量化的、领域适应的LLM在经过验证的系统管道中可以生成有用的群体BTs。同时也表明，解析器接受和安全过滤仍然是必要的执行门；单靠生成质量不足以实现自主部署。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2605.07771

Sensitivity-Based Robust NMPC for Close-Proximity Offshore Wind Turbine Inspection with a Tilted Multirotor

基于灵敏度的鲁棒非线性模型预测控制用于倾斜多旋翼的近距离海上风力涡轮机检测

Silano, Giuseppe, Saska, Martin

Abstract

Close-proximity offshore wind turbine inspection requires strict clearance control around large cylindrical structures under wind and model mismatch. Nominal Nonlinear Model Predictive Control (NMPC) may violate safety constraints when mass, inertia, thrust effectiveness, drag, or wind conditions differ from nominal assumptions. We propose a sensitivity-based robust NMPC for a tilted multirotor that robustifies the tower-clearance constraint via online constraint tightening. First-order parametric state sensitivities provide a structured-uncertainty margin, while bounded gusts are handled by a stage-dependent additive margin. The formulation augments the nominal NMPC with sensitivity propagation and margin evaluation only, leaving the receding-horizon optimization structure unchanged. Monte-Carlo evaluation over 500 uncertainty realizations on a boundary-critical helical inspection trajectory shows that the proposed controller eliminates the clearance violations observed under nominal NMPC at the cost of a moderate increase in solve time.

Chinese Translation

近距离海上风力涡轮机检测需要在风力和模型不匹配的情况下，对大型圆柱形结构周围进行严格的间隙控制。当质量、惯性、推力有效性、阻力或风况与名义假设不符时，名义非线性模型预测控制（NMPC）可能会违反安全约束。我们提出了一种基于灵敏度的鲁棒NMPC，用于倾斜多旋翼，通过在线约束收紧来增强塔间隙约束。一阶参数状态灵敏度提供了结构性不确定性边际，而有界阵风则通过阶段依赖的附加边际进行处理。该方法在名义NMPC的基础上增加了灵敏度传播和边际评估，仅改变了约束条件，保持了回退优化结构不变。在对边界关键的螺旋检测轨迹进行500次不确定性实现的蒙特卡洛评估中，结果表明所提出的控制器消除了在名义NMPC下观察到的间隙违规现象，代价是求解时间的适度增加。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2605.07794

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

NoiseGate：作为信息门控的每潜在时间步调度学习在世界行动模型中的应用

Huang, Wen, Sun, Haoran, Guo, Yongjian, Ma, Yunxuan, Li, Haoran, Long, Jing, Mo, Zhouying, Guan, Zhong, Guo, Yucheng, Di, Shuai, Xiong, Junwu

Abstract

World Action Models (WAMs) are an emerging family of policies that tie robot action generation to future-observation modeling. In this work, we focus on the joint video--action modeling paradigm, where actions and imagined future observations are co-generated along a shared denoising or flow trajectory, so that perception, prediction, and control are coupled within one generative process. Existing WAMs typically realize this paradigm with a Mixture-of-Transformers (MoT), where video and action tokens interact through shared self-attention. This architecture can in principle assign a separate timestep $t_f$ to each predicted latent frame, yet current systems collapse this degree of freedom onto a single shared scalar $t$. Under the noise-as-masking view of Diffusion Forcing, this shared schedule imposes the unjustified prior that every predicted latent is equally reliable for action generation. We instead view the per-latent schedule as a \emph{learnable information-gating policy}: by changing a latent frame's noise level, the policy modulates the reliability of its Key/Value contribution to the action tokens. We propose \textbf{NoiseGate}, which combines independent per-latent timestep sampling during backbone training, a lightweight Gating Policy Network that emits per-latent time increments during denoising, and task-reward optimization that trains the schedule policy without hand-crafted shape priors. Built on a joint video--action MoT backbone, NoiseGate delivers consistent gains on diverse RoboTwin random-scene manipulation tasks.

Chinese Translation

世界行动模型（WAMs）是一类新兴的策略，将机器人动作生成与未来观察建模相结合。在本研究中，我们专注于联合视频-动作建模范式，其中动作和想象的未来观察在共享的去噪或流动轨迹上共同生成，从而使感知、预测和控制在一个生成过程中相互耦合。现有的WAMs通常通过混合变换器（Mixture-of-Transformers, MoT）实现这一范式，其中视频和动作标记通过共享自注意力进行交互。原则上，该架构可以为每个预测的潜在帧分配一个单独的时间步 $t_f$，然而当前系统将这一自由度压缩为一个共享标量 $t$。在扩散强迫的噪声-掩蔽视角下，这一共享调度施加了不合理的先验，即每个预测的潜在帧在动作生成中同样可靠。我们将每个潜在帧的调度视为一种可学习的信息门控策略：通过改变潜在帧的噪声水平，该策略调节其对动作标记的关键/值贡献的可靠性。我们提出了 extbf{NoiseGate}，它结合了在主干训练期间独立的每潜在时间步采样、在去噪过程中发出每潜在时间增量的轻量级门控策略网络，以及无需手工设计形状先验的任务奖励优化。基于联合视频-动作的MoT主干，NoiseGate在多样的RoboTwin随机场景操作任务中提供了一致的性能提升。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2605.07835

Many-to-Many Multi-Agent Pickup and Delivery

多对多多智能体取货与送货

Schneider, Ethan, Chen, Jingkai, Gu, Tianyi, Lian, Kunlei, Hutchinson, Seth, Chernova, Sonia

Abstract

Multi-robot systems in automated warehouses must manage continuous streams of pickup-and-delivery tasks while ensuring efficiency and safety. Prior work on Multi-Agent Pickup-and-Delivery (MAPD) has largely focused on the one-to-one variant, where each task has a fixed pickup and delivery location. In contrast, real warehouses often present many-to-many MAPD scenarios, where items, tracked by stock keeping unit (SKU) identifiers, can be retrieved from or stored at multiple locations, resulting in an NP-hard four-dimensional assignment problem. To solve the many-to-many MAPD problem, we contribute our algorithm: Many-to-Many Multi-Agent Pickup and Delivery (M2M). We experiment with two variants of our algorithm: one that minimizes estimated task durations (M2M), and one which incorporates SKU distribution into the objective function (M2M-wSKU). Simulation results over 8-hour warehouse operations show that our method consistently matches or outperforms prior state of the art, with M2M completing up to 22,000 more tasks on average across different environments and warehouse inventory densities.

Chinese Translation

自动化仓库中的多机器人系统必须管理持续的取货和送货任务流，同时确保效率和安全性。之前关于多智能体取货与送货（Multi-Agent Pickup-and-Delivery, MAPD）的研究主要集中在一对一的变体上，其中每个任务都有固定的取货和送货地点。相比之下，真实的仓库往往呈现多对多的MAPD场景，其中物品通过库存单位（SKU）标识符进行跟踪，可以从多个地点取回或存储，从而导致一个NP难的四维分配问题。为了解决多对多MAPD问题，我们提出了我们的算法：多对多多智能体取货与送货（Many-to-Many Multi-Agent Pickup and Delivery, M2M）。我们对算法进行了两种变体的实验：一种是最小化估计任务持续时间（M2M），另一种是在目标函数中纳入SKU分布（M2M-wSKU）。在8小时的仓库操作模拟中，我们的方法在不同环境和仓库库存密度下，始终与之前的最先进技术相匹配或超越，M2M在平均情况下完成了多达22,000个额外任务。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2605.07877

Melding LLM and temporal logic for reliable human-swarm collaboration in complex scenarios

将大型语言模型与时序逻辑结合以实现复杂场景下可靠的人群-群体协作

Chen, Junfeng, Zhu, Yuxiao, Zhuo, An, Zhang, Xintong, Zhang, Shuo, Wen, Guanghui, Dong, Xiwang, Guo, Meng, Li, Zhongkui

Abstract

Robot swarms promise scalable assistance in complex and hazardous environments. Task planning lies at the core of human-swarm collaboration, translating the operator's intent into coordinated swarm actions and helping determine when validation or intervention is required during execution. In long-horizon missions under dynamic scenarios, however, reliable task planning becomes difficult to maintain: emerging events and changing conditions demand continual adaptation, and sustained operator oversight imposes substantial cognitive burden. Existing LLM-based planning tools can support plan generation, yet they remain susceptible to invalid task orderings and infeasible robot actions, resulting in frequent manual adjustment. Here we introduce a neuro-symbolic framework for long-horizon human-swarm collaboration that tightly melds verifiable task planning with context-grounded LLM reasoning. We formalize mission goals and operational rules as temporal logic formulas and admissible task orderings as task automata. Conditioned on these formal constraints and live perceptual context, LLMs generate executable subtask sequences that satisfy mission rules and remain grounded in the current scene. An uncertainty-aware scheduler then assigns subtasks across the heterogeneous swarm to maximize parallelisms while remaining resilient to disruptions. An event-triggered interaction protocol further limits operator involvement to sparse, high-level confirmation and guidance. Deployment on a heterogeneous robotic fleet yields similar results while remaining robust to hardware-specific actuation and communication uncertainties. Together, these results support a formal and scalable paradigm for reliable and low-overhead human-swarm collaboration in dynamic environments

Chinese Translation

机器人群体在复杂和危险的环境中提供可扩展的支持。任务规划是人群-群体协作的核心，将操作员的意图转化为协调的群体行动，并帮助确定在执行过程中何时需要验证或干预。然而，在动态场景下的长期任务中，可靠的任务规划变得难以维持：突发事件和变化的条件要求持续适应，而持续的操作员监督则带来了巨大的认知负担。现有的基于大型语言模型（LLM）的规划工具可以支持计划生成，但它们仍然容易受到无效任务排序和不可行机器人动作的影响，导致频繁的手动调整。在此，我们引入了一种神经符号框架，用于长期人群-群体协作，紧密结合可验证的任务规划与基于上下文的LLM推理。我们将任务目标和操作规则形式化为时序逻辑公式，将可接受的任务排序形式化为任务自动机。在这些形式约束和实时感知上下文的条件下，LLM生成可执行的子任务序列，以满足任务规则并与当前场景保持一致。一个考虑不确定性的调度器随后在异构群体中分配子任务，以最大化并行性，同时保持对干扰的韧性。事件触发的交互协议进一步将操作员的参与限制为稀疏的高层确认和指导。在异构机器人舰队上的部署产生了类似的结果，同时对硬件特定的驱动和通信不确定性保持稳健。综合来看，这些结果支持了一种在动态环境中实现可靠且低开销的人群-群体协作的正式和可扩展的范式。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2605.07885

AERO-VIS: Asynchronous Event-based Real-time Onboard Visual-Inertial SLAM

AERO-VIS：异步事件驱动实时机载视觉惯性SLAM

Burkhardt, Yannick, Laina, Sebastián Barbas, Boche, Simon, Freißmuth, Leonard, Leutenegger, Stefan

Abstract

The robustness of event cameras to high dynamic range and motion blur holds the potential to improve visual odometry systems in challenging environments. Although their high temporal resolution does not require synchronous processing, most event-based odometry methods still run at fixed rates, which simplifies system design but restricts latency and throughput. In this work, we present AERO-VIS, a stereo event-inertial SLAM system with an integrated, data-driven, robust, and performance-optimized keypoint detector. By processing the event stream asynchronously, the system dynamically adapts to downstream runtime demands, ensuring low-latency and real-time performance. When deploying AERO-VIS on a UAV, we achieve unprecedented accuracy in onboard event-based SLAM. These unique characteristics enable us to present the first purely event-based inertial SLAM system that demonstrates closed-loop UAV control and large-scale state estimation while relying solely on onboard compute. A video of the experiments and the source code are available at ethz-mrl.github.io/AERO-VIS.

Chinese Translation

事件相机对高动态范围和运动模糊的鲁棒性有潜力在具有挑战性的环境中改善视觉里程计系统。尽管其高时间分辨率不需要同步处理，但大多数基于事件的里程计方法仍以固定速率运行，这简化了系统设计，但限制了延迟和吞吐量。在本研究中，我们提出了AERO-VIS，一个立体事件惯性SLAM系统，配备集成的数据驱动、鲁棒且性能优化的关键点检测器。通过异步处理事件流，该系统动态适应下游运行时需求，确保低延迟和实时性能。当在无人机上部署AERO-VIS时，我们在机载基于事件的SLAM中实现了前所未有的精度。这些独特特性使我们能够展示第一个纯基于事件的惯性SLAM系统，该系统在完全依赖机载计算的情况下，演示了闭环无人机控制和大规模状态估计。实验视频和源代码可在ethz-mrl.github.io/AERO-VIS获取。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2605.07943

TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning

TAVIS：模仿学习中的自我中心主动视觉与预期注视基准

Spigler, Giacomo

Abstract

Active vision -- where a policy controls its own gaze during manipulation -- has emerged as a key capability for imitation learning, with multiple independent systems demonstrating its benefits in the past year. Yet there is no shared benchmark to compare approaches or quantify what active vision contributes, on which task types, and under what conditions. We introduce TAVIS, evaluation infrastructure for active-vision imitation learning, with two complementary task suites -- TAVIS-Head (5 tasks, global search via pan/tilt necks) and TAVIS-Hands (3 tasks, local occlusion via wrist cameras) -- on two humanoid torso embodiments (GR1T2, Reachy2), built on IsaacLab. TAVIS provides three evaluation primitives: a paired headcam-vs-fixedcam protocol on identical demonstrations; GALT (Gaze-Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies; and procedural ID/OOD splits. Baseline experiments with Diffusion Policy and $\pi_0$ reveal that (i) active-vision generally helps, but benefits are task-conditional rather than uniform; (ii) multi-task policies degrade sharply under controlled distribution shifts on both suites; and (iii) imitation alone yields anticipatory gaze, with median lead times comparable to the human teleoperator reference. Code, evaluation scripts, demonstrations (LeRobot v3.0; ~2200 episodes) and trained baselines are released at https://github.com/spiglerg/tavis and https://huggingface.co/tavis-benchmark.

Chinese Translation

主动视觉——在操作过程中由策略控制自身的注视——已成为模仿学习的关键能力，过去一年中多个独立系统展示了其优势。然而，目前尚无共享基准来比较不同方法或量化主动视觉的贡献、适用的任务类型及其条件。我们介绍了TAVIS，这是一个用于主动视觉模仿学习的评估基础设施，包含两个互补的任务套件——TAVIS-Head（5个任务，通过摇头/仰头进行全局搜索）和TAVIS-Hands（3个任务，通过手腕摄像头进行局部遮挡），基于IsaacLab构建的两个类人躯干表现（GR1T2，Reachy2）。TAVIS提供了三个评估原语：在相同演示上进行的配对头摄像头与固定摄像头协议；GALT（注视-动作提前时间），这是一个基于认知科学和人机交互的新指标，用于量化学习策略中的预期注视；以及程序化的ID/OOD划分。使用扩散策略（Diffusion Policy）和$ ext{π}_0$的基线实验表明：（i）主动视觉通常有帮助，但其益处是任务依赖的而非统一的；（ii）在两个任务套件中，多任务策略在受控分布变化下急剧退化；（iii）仅靠模仿就能产生预期注视，其中位提前时间与人类遥控操作员的参考相当。代码、评估脚本、演示（LeRobot v3.0；约2200个回合）和训练基线已发布在https://github.com/spiglerg/tavis和https://huggingface.co/tavis-benchmark。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2605.07988

Evaluation of an Actuated Spine in Agile Quadruped Locomotion

在灵活四足运动中对驱动脊柱的评估

Bohlinger, Nico, Kicki, Piotr, Tateo, Davide, Walas, Krzysztof, Peters, Jan

Abstract

The spine plays a crucial role in the dynamic locomotion of quadrupedal animals, improving the stability, speed, and efficiency of their gait, especially for fast-paced and highly agile movements. Therefore, the spine is also a promising and natural way to extend the capabilities of quadruped robots. This paper empirically investigates the benefits of an actuated spine for learning agile quadruped locomotion. We evaluate whether the use of the spine brings benefits in terms of high-speed running, climbing stairs, climbing high-angle slopes, hurdling, and crawling scenarios. We conducted an empirical study in MuJoCo simulation using the Silver Badger robot from MAB Robotics with an actuated 1-DOF spine in the sagittal plane. The obtained results show that the use of the spine provides the robot with increased agility and allows it to overcome higher stairs, steeper slopes, higher obstacles, and smaller passages.

Chinese Translation

脊柱在四足动物的动态运动中发挥着至关重要的作用，改善了它们步态的稳定性、速度和效率，尤其是在快速和高度灵活的运动中。因此，脊柱也是扩展四足机器人能力的一种有前景且自然的方式。本文实证研究了驱动脊柱在学习灵活四足运动中的优势。我们评估了脊柱的使用是否在高速奔跑、爬楼梯、攀爬高角度坡、跨越障碍和爬行场景中带来了好处。我们在MuJoCo仿真中进行了实证研究，使用了MAB Robotics的Silver Badger机器人，该机器人配备了在矢状面上的驱动1-DOF脊柱。获得的结果表明，使用脊柱使机器人具备了更高的灵活性，能够克服更高的楼梯、更陡的坡道、更高的障碍和更狭窄的通道。

View on arXiv Download PDF AI Translation

cs.RO / 33 / 2605.08020

Active Embodiment Identification with Reinforcement Learning for Legged Robots

基于强化学习的四足机器人主动体现识别

Bohlinger, Nico, Peters, Jan

Abstract

We present an active embodiment identification method for legged robots that jointly learns information-seeking behavior and explicit embodiment prediction. Using a history-augmented URMA architecture, the method infers joint-level and global embodiment parameters through interaction with the environment in simulation across different morphologies.

Chinese Translation

我们提出了一种针对四足机器人的主动体现识别方法，该方法联合学习信息寻求行为和显式体现预测。通过使用增强历史的URMA（Unified Reinforcement Learning for Multi-Agent）架构，该方法通过与环境的交互，在不同形态下推断关节级别和全局体现参数。

View on arXiv Download PDF AI Translation

cs.RO / 34 / 2605.08084

123D: Unifying Multi-Modal Autonomous Driving Data at Scale

123D：大规模统一多模态自动驾驶数据

Dauner, Daniel, Charraut, Valentin, Berle, Bastian, Li, Tianyu, Nguyen, Long, Wang, Jiabao, Jing, Changhui, Igl, Maximilian, Caesar, Holger, Ivanovic, Boris, Liao, Yiyi, Geiger, Andreas, Chitta, Kashyap

Abstract

The pursuit of autonomous driving has produced one of the richest sensor data collections in all of robotics. However, its scale and diversity remain largely untapped. Each dataset adopts different 2D and 3D modalities, such as cameras, lidar, ego states, annotations, traffic lights, and HD maps, with different rates and synchronization schemes. They come in fragmented formats requiring complex dependencies that cannot natively coexist in the same development environment. Further, major inconsistencies in annotation conventions prevent training or measuring generalization across multiple datasets. We present 123D, an open-source framework that unifies such multi-modal driving data through a single API. To handle synchronization, we store each modality as an independent timestamped event stream with no prescribed rate, enabling synchronous or asynchronous access across arbitrary datasets. Using 123D, we consolidate eight real-world driving datasets spanning 3,300 hours and 90,000 kilometers, together with a synthetic dataset with configurable collection scripts, and provide tools for data analysis and visualization. We conduct a systematic study comparing annotation statistics and assessing each dataset's pose and calibration accuracy. Further, we showcase two applications 123D enables: cross-dataset 3D object detection transfer and reinforcement learning for planning, and offer recommendations for future directions. Code and documentation are available at https://github.com/kesai-labs/py123d.

Chinese Translation

自动驾驶的追求产生了机器人领域中最丰富的传感器数据集合之一。然而，其规模和多样性仍然大部分未被开发。每个数据集采用不同的2D和3D模态，如摄像头、激光雷达、自我状态、注释、交通信号灯和高清地图，且具有不同的采样率和同步方案。它们以碎片化的格式存在，要求复杂的依赖关系，无法在同一开发环境中原生共存。此外，注释规范的重大不一致性阻碍了跨多个数据集的训练或泛化测量。我们提出了123D，一个开源框架，通过单一API统一这些多模态驾驶数据。为了处理同步，我们将每种模态存储为独立的带时间戳的事件流，没有规定的采样率，从而实现跨任意数据集的同步或异步访问。使用123D，我们整合了八个真实世界的驾驶数据集，涵盖3300小时和90000公里，以及一个具有可配置采集脚本的合成数据集，并提供数据分析和可视化工具。我们进行了系统研究，比较注释统计数据，并评估每个数据集的姿态和校准精度。此外，我们展示了123D所支持的两个应用：跨数据集的3D物体检测迁移和规划的强化学习，并提供了未来方向的建议。代码和文档可在 https://github.com/kesai-labs/py123d 获取。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

155

cs.CV / 1 / 2605.06708

Visual Text Compression as Measure Transport

视觉文本压缩作为测度传输

Tang, Lv, Zheng, Tianyi, Liu, Yang, Li, Bo, Li, Xingyu

Abstract

Visual text compression (VTC) promises efficient long-context processing by rendering text into an image and re-encoding it with a vision-language model, often producing $3$--$20\times$ fewer decoder tokens than subword tokenization. Yet token savings do not translate predictably into downstream utility: on some tasks the visual path matches or exceeds the text path, on others it collapses, and the compression ratio itself does not predict which regime will occur. The missing quantity is therefore not another summary of efficiency, but a principled measure of task-relevant information loss induced by visual encoding. We address this problem by formulating VTC in the language of measure transport. Treating text and visual tokens as empirical probability measures, we show that the ViT patch encoder induces a push-forward map whose transport cost decomposes into a precision cost from within-patch aggregation and a coverage cost from cross-patch fragmentation. Both terms are estimable from downstream-label-free probes. This formulation yields two operational consequences: a downstream-label-free routing criterion that selects whether to use the visual path for a given input or benchmark instance, and a transport-informed foveation mechanism that re-encodes high-cost regions at higher resolution. Across $24$ NLP datasets at Qwen3-4B, our label-free rule matches the per-dataset oracle on $17/24$ datasets ($70.8\%$), and improves the average task score by $+3.3\%$ with $-10.3\%$ average tokens relative to a pure-LLM.

Chinese Translation

视觉文本压缩（VTC）通过将文本呈现为图像并使用视觉-语言模型重新编码，承诺实现高效的长上下文处理，通常产生比子词标记少 $3$--$20 imes$ 的解码器标记。然而，标记节省并不总是能预测性地转化为下游效用：在某些任务中，视觉路径与文本路径相匹配或超越，而在其他任务中则崩溃，压缩比本身也无法预测哪种情况会发生。因此，缺失的量不是另一种效率的总结，而是由视觉编码引起的与任务相关的信息损失的原则性测量。我们通过将 VTC 形式化为测度传输的语言来解决这个问题。将文本和视觉标记视为经验概率测度，我们展示了 ViT 补丁编码器引入了一种推前映射，其传输成本分解为来自补丁内聚合的精度成本和来自补丁间碎片化的覆盖成本。这两个项均可以通过下游无标签探测器进行估计。这种形式化产生了两个操作性后果：一个无标签的下游路由标准，用于选择是否对给定输入或基准实例使用视觉路径，以及一个基于传输信息的注视机制，用于以更高的分辨率重新编码高成本区域。在 Qwen3-4B 的 $24$ 个 NLP 数据集上，我们的无标签规则在 $17/24$ 个数据集上与每个数据集的 oracle 匹配（$70.8 ext{%}$），并在相对于纯 LLM 的情况下将平均任务得分提高了 $+3.3 ext{%}$，同时平均标记减少了 $-10.3 ext{%}$。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2605.06714

Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey

计算机视觉与医学诊断中的边缘深度学习：全面综述

Xu, Yiwen, Khan, Tariq M., Song, Yang, Meijering, Erik

Abstract

Edge deep learning, a paradigm change reconciling edge computing and deep learning, facilitates real-time decision making attuned to environmental factors through the close integration of computational resources and data sources. Here we provide a comprehensive review of the current state of the art in edge deep learning, focusing on computer vision applications, in particular medical diagnostics. An overview of the foundational principles and technical advantages of edge deep learning is presented, emphasising the capacity of this technology to revolutionise a wide range of domains. Furthermore, we present a novel categorisation of edge hardware platforms based on performance and usage scenarios, facilitating platform selection and operational effectiveness. Following this, we dive into approaches to effectively implement deep neural networks on edge devices, encompassing methods such as lightweight design and model compression. Reviewing practical applications in the fields of computer vision in general and medical diagnostics in particular, we demonstrate the profound impact edge-deployed deep learning models can have in real-life situations. Finally, we provide an analysis of potential future directions and obstacles to the adoption of edge deep learning, with the intention to stimulate further investigations and advancements of intelligent edge deep learning solutions. This survey provides researchers and practitioners with a comprehensive reference shedding light on the critical role deep learning plays in the advancement of edge computing applications.

Chinese Translation

边缘深度学习是一种将边缘计算与深度学习相结合的范式变革，通过计算资源与数据源的紧密集成，促进了与环境因素相适应的实时决策。本文提供了对边缘深度学习当前技术状态的全面回顾，重点关注计算机视觉应用，特别是医学诊断。我们概述了边缘深度学习的基础原则和技术优势，强调了该技术在多个领域变革的潜力。此外，我们基于性能和使用场景对边缘硬件平台进行了新的分类，以便于平台选择和操作效率的提升。接着，我们深入探讨了在边缘设备上有效实施深度神经网络的方法，包括轻量化设计和模型压缩等技术。通过回顾计算机视觉领域，尤其是医学诊断中的实际应用，我们展示了边缘部署的深度学习模型在现实场景中的深远影响。最后，我们分析了边缘深度学习未来可能的发展方向和面临的障碍，旨在激发对智能边缘深度学习解决方案的进一步研究和进展。本综述为研究人员和从业者提供了一个全面的参考，阐明了深度学习在边缘计算应用发展中的关键作用。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2605.06747

HumanNet: Scaling Human-centric Video Learning to One Million Hours

HumanNet：将以人为中心的视频学习扩展至一百万小时

Deng, Yufan, Zhou, Daquan

Abstract

Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human activity data. We present HumanNet, a one-million-hour human-centric video corpus that captures how humans interact with the physical world at scale. HumanNet spans both first-person and third-person perspectives and covers fine-grained activities, human-object interactions, tool use, and long-horizon behaviors across diverse real-world environments. Beyond raw video, the dataset provides interaction-centric annotations, including captions, motion descriptions, and hand and body-related signals, enabling motion-aware and interaction-aware learning. Beyond scale, HumanNet introduces a systematic data curation paradigm for embodied learning, where human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment are treated as first-class design principles. This design transforms unstructured internet video into a scalable substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer. We conduct a first-step validation on the value of this design through controlled vision-language-action ablation: under a fixed set of validation data, continued training from the Qwen VLM model with 1000 hours of egocentric video drawn from HumanNet surpasses the continued training with 100 hours of real-robot data from Magic Cobot, indicating that egocentric human video could be a scalable and cost-effective substitute for robot data. By building this project, we aim to explore the opportunity to scale embodied foundation models using human-centric videos, rather than relying solely on robot-specific data.

Chinese Translation

具身智能的进展日益依赖于可扩展的数据基础设施。尽管视觉和语言已经通过互联网语料库实现了扩展，但学习物理交互仍然受到缺乏大型、多样化和丰富注释的人类活动数据的限制。我们提出了HumanNet，一个一百万小时的人类中心视频语料库，捕捉人类如何在大规模上与物理世界互动。HumanNet涵盖第一人称和第三人称视角，涵盖细粒度活动、人-物体交互、工具使用以及在多样化真实环境中的长时间行为。除了原始视频外，该数据集还提供了以交互为中心的注释，包括字幕、运动描述以及与手和身体相关的信号，从而支持运动感知和交互感知的学习。除了规模，HumanNet还引入了一种系统的数据策划范式用于具身学习，其中人类中心过滤、时间结构、视角多样性和注释丰富性被视为首要设计原则。这一设计将非结构化的互联网视频转变为可扩展的表示学习、活动理解、运动生成和人机转移的基础。我们通过控制的视觉-语言-行动消融实验对这一设计的价值进行了初步验证：在固定的验证数据集下，基于HumanNet的1000小时自我中心视频继续训练的Qwen VLM模型超越了基于Magic Cobot的100小时真实机器人数据的继续训练，表明自我中心的人类视频可能成为机器人数据的可扩展且具有成本效益的替代品。通过构建这个项目，我们旨在探索利用以人为中心的视频扩展具身基础模型的机会，而不是仅仅依赖于特定于机器人的数据。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2605.06758

R$^3$L: Reasoning 3D Layouts from Relative Spatial Relations

R$^3$L：基于相对空间关系推理三维布局

Gu, Zhifeng, Wang, Yuqi, Wang, Bing

Abstract

Relative spatial relations provide a compact representation of spatial structure and are fundamental to relative spatial reasoning in 3D layout generation. Recent works leverage Multimodal Large Language Models (MLLMs) to infer such relations, but the inferred relations are often unreliable and are typically handled with post-hoc heuristics. In this paper, we propose R$^3$L, a general framework that improves the reliability and consistency of relative spatial reasoning for 3D layout generation. Our key motivation is that multi-hop reasoning requires repeated reference-frame transformations, which accumulate errors in inferred relations and lead to semantic and metric drift. To mitigate this, we propose invariant spatial decomposition to break coupled relation chains, and consistent spatial imagination to promote self-consistency through an imagine-and-revise loop. We further introduce supportive spatial optimization to ease pose optimization via global-to-local coordinate re-parameterization. Extensive experiments across diverse scene types and instructions demonstrate that R$^3$L produces more physically feasible and semantically consistent layouts. Notably, our analysis shows that resolving frame-induced inconsistencies is crucial for reliable multi-hop relative spatial reasoning. The code is available at https://github.com/Neal2020GitHub/R3L.

Chinese Translation

相对空间关系提供了空间结构的紧凑表示，并且在三维布局生成中的相对空间推理中具有基础性作用。近期的研究利用多模态大型语言模型（Multimodal Large Language Models, MLLMs）来推断这些关系，但推断出的关系往往不可靠，通常需要通过事后启发式方法进行处理。本文提出了R$^3$L，一个通用框架，旨在提高三维布局生成中相对空间推理的可靠性和一致性。我们的主要动机是，多跳推理需要重复的参考框架变换，这会导致推断关系中的误差累积，从而导致语义和度量的漂移。为了解决这个问题，我们提出了不变空间分解，以打破耦合关系链，并通过想象与修正循环促进自一致性的一致空间想象。此外，我们进一步引入支持性空间优化，通过全局到局部坐标的重新参数化来简化姿态优化。在多种场景类型和指令下的广泛实验表明，R$^3$L生成了更具物理可行性和语义一致性的布局。值得注意的是，我们的分析表明，解决框架引起的不一致性对于可靠的多跳相对空间推理至关重要。代码可在 https://github.com/Neal2020GitHub/R3L 获取。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2605.06809

LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

何时观察？通过学习何时、何地以及计算什么来实现快速视频识别

Salamatian, Ali, Fuller, Anthony, Sarkar, Pritam, Green, James R., Sigal, Leonid, Shelhamer, Evan

Abstract

Transformers dominate video recognition. They split videos into tokens, and processing them has expensive superlinear computational cost. Yet videos are filled with redundancy, so we can question the need for this expense. We introduce LookWhen, a selector-extractor framework that factorizes video recognition into learning when, where, and what to compute. Our shallow selector gets a scaled-down video and quickly scores all tokens across space-time, while our deep extractor gets the top-K selected tokens to approximate full-video representations without actually processing all the tokens. A key challenge is defining effective supervision for selection and extraction. For selection pre-training, we introduce a score on representations that ranks tokens by uniqueness using a simple nearest-neighbor distance. For extraction pre-training, we distill both a video teacher and an image teacher, for which we normalize its frame-wise representations to learn what changes within videos. Through these strategies, our selector-extractor learns general and efficient representations for feature extraction or fine-tuning to a task. Through experiments on Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, and Charades, we show that LookWhen achieves a better accuracy-computation trade-off than efficient models and upgraded baselines of similar size. LookWhen Pareto-dominates in accuracy-FLOPs on 9 of 12 cases (6 tasks x 2 settings) and roughly matches on 3. In accuracy-throughput, measuring time in practice, LookWhen is more efficient still at 6.7x faster than InternVideo2-B at equal accuracy.

Chinese Translation

变压器在视频识别中占据主导地位。它们将视频分割为标记，而处理这些标记的计算成本是昂贵的超线性。然而，视频中充满了冗余，因此我们可以质疑这种开销的必要性。我们提出了LookWhen，一个选择-提取框架，将视频识别分解为学习何时、何地以及计算什么。我们的浅层选择器获取缩小的视频，并快速对时空中的所有标记进行评分，而我们的深层提取器获取前K个选定的标记，以近似完整视频表示，而无需实际处理所有标记。一个关键挑战是为选择和提取定义有效的监督。对于选择的预训练，我们引入了一种评分机制，通过简单的最近邻距离对表示进行排名，以标记的独特性。对于提取的预训练，我们提炼了视频教师和图像教师，并对其逐帧表示进行归一化，以学习视频中的变化。通过这些策略，我们的选择-提取器学习到了一般性和高效的表示，用于特征提取或任务微调。通过在Kinetics-400、SSv2、Epic-Kitchens、Diving48、Jester和Charades上的实验，我们表明LookWhen在准确性与计算之间实现了比高效模型和相似规模的升级基线更好的权衡。在12个案例中的9个（6个任务 x 2个设置）中，LookWhen在准确性-FLOPs上占优，并在3个案例中大致持平。在准确性-吞吐量方面，实际测量时间，LookWhen的效率仍然更高，比InternVideo2-B快6.7倍，且准确性相同。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2605.06859

Knowledge Transfer Scaling Laws for 3D Medical Imaging

3D医学成像的知识转移规模法则

Lee, Ho Hin, Du, Dongna, Wang, Chu, Huo, Yuankai, Gu, Shi, Gee, James C., Wu, Yifan

Abstract

Vision foundation models are increasingly moving beyond 2D to volumetric domains such as 3D medical imaging, where unified pretraining across different imaging modalities (i.e. CT, MRI, and PET) could provide foundational models for diverse clinical tasks. However, training such models requires mixing heterogeneous imaging domains, and current mixture strategies remain largely heuristic. In this work, we observe that different medical imaging domains scale at variable rates during pretraining, and knowledge transfer between domains is strongly asymmetric: training on one domain can substantially improve another, but the reverse may be much weaker. Interestingly, both MAE reconstruction loss and cross-domain transfer follow predictable power-law trends with domain-specific behaviors. Motivated by these findings, we formulate data allocation as a scaling-law optimization problem. The derived allocations reveal an interpretable hub-and-island structure: highly transferable domains emerge as hubs that benefit many others and deserve strategic allocation, while isolated domains act as islands requiring direct investment. Empirically, transfer-aware allocation outperforms data-proportional sampling by up to 58% and generalizes well to unseen budgets with r=0.989. Downstream validation on disease classification and organ/lesion segmentation further confirms that the derived transfer-aware mixtures provide stronger pretrained representations for clinical 3D medical imaging tasks.

Chinese Translation

视觉基础模型正逐渐超越二维领域，进入体积领域，如3D医学成像，其中不同成像模态（即CT、MRI和PET）的统一预训练可以为多样的临床任务提供基础模型。然而，训练这样的模型需要混合异质成像领域，而当前的混合策略仍然主要是启发式的。在本研究中，我们观察到不同医学成像领域在预训练期间以不同的速率扩展，且领域之间的知识转移具有明显的不对称性：在一个领域的训练可以显著改善另一个领域，但反之可能效果较弱。有趣的是，MAE重建损失和跨领域转移都遵循可预测的幂律趋势，并具有领域特定的行为。基于这些发现，我们将数据分配形式化为一个规模法则优化问题。推导出的分配揭示了一种可解释的枢纽-岛屿结构：高度可转移的领域作为枢纽，惠及许多其他领域，值得进行战略性分配，而孤立的领域则作为岛屿，需要直接投资。从经验上看，考虑转移的分配在性能上比数据比例采样提高了多达58%，并且在未见预算上具有良好的泛化能力，相关系数r=0.989。对疾病分类和器官/病灶分割的下游验证进一步确认，推导出的考虑转移的混合提供了更强的预训练表示，适用于临床3D医学成像任务。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2605.06876

AdpSplit: Error-Driven Adaptive Splitting for Faster Geometry Discovery in 3D Gaussian Splatting

AdpSplit：用于更快几何发现的误差驱动自适应分裂方法在3D高斯点云中的应用

Lee, Yongjae, Li, Jingxing, Yadav, Abhay Kumar, Chellappa, Rama, Fan, Deliang

Abstract

Adaptive density control in 3D Gaussian Splatting (3DGS) repeatedly grows the Gaussian population through fixed-cardinality random splitting to discover useful scene structure. However, in vanilla 3DGS, its binary split operator requires many densification rounds to expose fine details, making it a bottleneck for efficient training schedules with fewer iterations. We introduce AdpSplit, an error-driven adaptive split operator that determines the number of split children and initializes the child parameters from L1-pixel-error region statistics, enabling fewer densification iterations, thus reduced training time, while preserving the rendering quality of full-schedule training. Across the MipNeRF360, Deep-Blending, and Tanks&Temples datasets, AdpSplit reduces the training time of multiple accelerated 3DGS pipelines by 9.2%-22.3% as a simple drop-in replacement for the standard split operator. With FastGS, AdpSplit matches the full-schedule PSNR on MipNeRF360 while reducing training time by 16.4%, corresponding to a 12.6x acceleration over vanilla 3DGS.

Chinese Translation

在3D高斯点云（3D Gaussian Splatting，3DGS）中，自适应密度控制通过固定基数随机分裂反复增加高斯数量，以发现有用的场景结构。然而，在传统的3DGS中，其二进制分裂操作符需要多次密集化迭代才能揭示细节，这使其成为高效训练计划中的瓶颈，尤其是在迭代次数较少的情况下。我们引入了AdpSplit，这是一种误差驱动的自适应分裂操作符，它根据L1像素误差区域统计来确定分裂子节点的数量并初始化子参数，从而减少密集化迭代次数，降低训练时间，同时保持全计划训练的渲染质量。在MipNeRF360、Deep-Blending和Tanks&Temples数据集上，AdpSplit作为标准分裂操作符的简单替代，减少了多个加速3DGS管道的训练时间9.2%-22.3%。使用FastGS时，AdpSplit在MipNeRF360上实现了与全计划PSNR相匹配的效果，同时训练时间减少了16.4%，相较于传统3DGS加速了12.6倍。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2605.06889

TriDE: Triangle-Consistent Translation Directions for Global Camera Pose Estimation

TriDE：用于全球相机姿态估计的三角形一致性翻译方向

Chen, Francisco, Wang, Yiran, Shi, Yunpeng

Abstract

Pairwise translation directions are a key input to camera location estimation in global structure-from-motion. Existing estimators usually process each image pair independently, producing directions that may be locally plausible but inconsistent with the other relative directions in the viewing graph. To jointly estimate the direction, we propose TriDE, which exploits camera-triangle consistency as an efficient higher-order verification signal. Instead of solving a costly global nonlinear optimization problem that is sensitive to initialization, TriDE refines unreliable pairwise directions through message passing between directions and their incident weighted triangles. This information propagation strategy enables us to establish a strong phase-transition bound for exact recovery under a realistic random corruption model. Experiments on real image graphs show that TriDE improves direction accuracy by a large margin and yields better downstream camera locations, providing a practical link between local pairwise estimation and global camera pose geometry.

Chinese Translation

成对的翻译方向是全球运动结构中相机位置估计的关键输入。现有的估计方法通常独立处理每对图像，产生的方向可能在局部上是合理的，但与视图图中的其他相对方向不一致。为了联合估计方向，我们提出了TriDE，它利用相机-三角形一致性作为一种高效的高阶验证信号。TriDE并不解决一个对初始化敏感的昂贵全局非线性优化问题，而是通过方向之间及其相关加权三角形之间的信息传递来细化不可靠的成对方向。这种信息传播策略使我们能够在现实的随机干扰模型下建立一个强相位转变界限，以实现精确恢复。在真实图像图上的实验表明，TriDE显著提高了方向的准确性，并产生了更好的下游相机位置，为局部成对估计与全球相机姿态几何之间提供了实用的联系。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2605.06891

Towards Fairness under Label Bias in Image Segmentation: Impact, Measurement and Mitigation

在标签偏差下实现图像分割的公平性：影响、测量与缓解

Parikh, Aditya, Frank, Stella, Das, Sneha, Feragen, Aasa

Abstract

Labeled datasets reflect the biases of their annotation pipelines, which sometimes introduce label bias: group-conditional label errors that cause systematic performance disparities across demographic subgroups. Label bias in image segmentation remains underexplored, as even detecting it typically requires clean, unbiased annotations, which are not readily available. We present a data-centric adaptation of Confident Learning to segmentation, allowing detection of label bias directly in the training data without a clean, unbiased ground truth. By comparing the provided training labels to the model's confident predictions, we isolate directional errors that quantify the presence and nature of bias, where standard overlap metrics like Dice fail. We further show that label bias influences subgroup separability in the encoder's feature space, an artifact we leverage for bias mitigation rather than suppressing it. We evaluate three datasets, spanning from synthetic to real-life bias, showing how our framework reliably detects and mitigates bias without access to clean labels, achieving equitable performance across experimental conditions.

Chinese Translation

标注数据集反映了其注释流程的偏见，这有时会引入标签偏差：导致不同人口子群体之间系统性性能差异的群体条件标签错误。图像分割中的标签偏差仍然未得到充分探索，因为即使检测它通常也需要干净且无偏的注释，而这些注释并不容易获得。我们提出了一种数据中心的自信学习（Confident Learning）适应方法，应用于分割任务，允许在没有干净、无偏真值的情况下直接在训练数据中检测标签偏差。通过将提供的训练标签与模型的自信预测进行比较，我们隔离出定向错误，从而量化偏差的存在和性质，而标准的重叠度量（如Dice）则无法做到。我们进一步展示了标签偏差影响编码器特征空间中的子群可分性，这一特征我们利用于偏差缓解，而不是简单压制。我们评估了三个数据集，从合成偏差到现实生活中的偏差，展示了我们的框架如何在没有干净标签的情况下可靠地检测和缓解偏差，实现实验条件下的公平性能。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2605.06892

Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation

并非所有标记都需要40步：扩散变换器中的异构步骤分配以实现高效视频生成

Chu, Ernie, Patel, Vishal M.

Abstract

Diffusion Transformers (DiTs) have achieved state-of-the-art video generation quality, but they incur immense computational cost because standard inference applies the same number of denoising steps uniformly to every token in the sequence. It is well known that human vision ignores vast amounts of redundant motion. Why, then, do our densest models treat every spatiotemporal token with equal priority? In this paper, we introduce Heterogeneous Step Allocation (HSA), a training-free inference algorithm that assigns varying step budgets to different spatiotemporal tokens based on their velocity dynamics. To resolve the resulting sequence-length mismatch without sacrificing global context, HSA introduces a KV-cache synchronization mechanism that allows active tokens to attend to the full sequence while entirely bypassing inactive tokens. Furthermore, we derive a cached Euler update that advances the latent states of skipped tokens in a single operation without additional model evaluations. We evaluate HSA on the Wan-2 and LTX-2 models for both text-to-video (T2V) and image-to-video (I2V) generation. Our results demonstrate that HSA significantly outperforms previous state-of-the-art caching methods and the vanilla Flow Matching baseline, especially at aggressive acceleration regimes (e.g., 50% and 25% runtimes). Crucially, HSA achieves a superior quality-runtime Pareto frontier without the need for expensive offline profiling, robustly preserving structural integrity and generation quality even under tight computational budgets. Project page: https://ernestchu.github.io/hsa

Chinese Translation

扩散变换器（Diffusion Transformers, DiTs）已实现了最先进的视频生成质量，但由于标准推理对序列中的每个标记均匀应用相同数量的去噪步骤，因此产生了巨大的计算成本。众所周知，人类视觉会忽略大量冗余运动。那么，为什么我们最密集的模型对每个时空标记给予相同的优先级呢？在本文中，我们提出了异构步骤分配（Heterogeneous Step Allocation, HSA），这是一种无训练推理算法，根据不同时空标记的速度动态分配不同的步骤预算。为了在不牺牲全局上下文的情况下解决由此产生的序列长度不匹配问题，HSA引入了一种KV缓存同步机制，使得活跃标记能够关注整个序列，同时完全绕过非活跃标记。此外，我们推导了一种缓存的欧拉更新方法，可以在单次操作中推进被跳过标记的潜在状态，而无需额外的模型评估。我们在Wan-2和LTX-2模型上评估了HSA，涵盖文本到视频（text-to-video, T2V）和图像到视频（image-to-video, I2V）生成。我们的结果表明，HSA显著优于以前的最先进缓存方法和基础的流匹配（Flow Matching）基线，特别是在激进加速模式下（例如，50%和25%的运行时间）。关键是，HSA在不需要昂贵的离线分析的情况下，实现了优越的质量-运行时间帕累托前沿，稳健地保持了结构完整性和生成质量，即使在紧张的计算预算下。项目页面：https://ernestchu.github.io/hsa

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2605.06912

Advancing Reliable Synthetic Video Detection: Insights from the SAFE Challenge

推进可靠的合成视频检测：来自SAFE挑战的见解

Trapeznikov, Kirill, Mancino-Ball, Gabriel, Li, Jonathan, Cummer, Paul, Aslam, Jai, Vahdati, Danial Samadi, Nguyen, Tai, Stamm, Matthew C., Bautista, Peter, Davinroy, Michael, Cassani, Laura, Crisman, Jill

Abstract

The proliferation of generative video technologies has intensified the need for reliable methods to detect and characterize synthetic media. To address this challenge, we organized the \href{https://safe-video-2025.dsri.org}{SAFE: Synthetic Video Detection Challenge}, co-located with the \textit{Authenticity and Provenance in the Age of Generative AI (APAI) Workshop }at ICCV 2025. The competition invited participants to develop and evaluate algorithms capable of distinguishing real from synthetic videos under fully blind evaluation conditions with over 600 submissions from 12 teams over a 90 day span. Hosted on the Hugging Face platform, the challenge comprised two primary tasks: (1) detection of synthetic video content generated by diverse state-of-the-art models, and (2) detection of synthetic content following common post-processing operations such as resizing, re-compression, motion blur and others. The challenge data consisted of 13 modern high quality synthetic video models with generated content matched to real videos from 21 diverse and challenge sources, all adding up to 20 hours of 6,000 video samples. This paper describes the challenge design, dataset construction, evaluation methodology, and outcomes, offering insights into the generalization and robustness of contemporary synthetic video detection methods. Our findings highlight measurable progress in cross-generator generalization but also persistent vulnerabilities to post-processing artifacts. https://safe-video-2025.dsri.org

Chinese Translation

生成视频技术的迅猛发展加剧了对可靠方法以检测和表征合成媒体的需求。为应对这一挑战，我们组织了SAFE: 合成视频检测挑战，该挑战与2025年国际计算机视觉大会（ICCV 2025）上的《生成性人工智能时代的真实性与来源（APAI）研讨会》同时举行。比赛邀请参与者开发和评估能够在完全盲评条件下区分真实视频与合成视频的算法，期间共收到来自12个团队的600多份提交。挑战在Hugging Face平台上进行，主要包括两个任务：（1）检测由多种最先进模型生成的合成视频内容；（2）检测经过常见后处理操作（如调整大小、重新压缩、运动模糊等）后的合成内容。挑战数据集由13个现代高质量合成视频模型构成，生成内容与来自21个不同来源的真实视频相匹配，总计达到20小时的6000个视频样本。本文描述了挑战的设计、数据集构建、评估方法及结果，提供了对当代合成视频检测方法的泛化能力和鲁棒性的见解。我们的研究结果突显了跨生成器泛化的可测进展，但也显示出对后处理伪影的持续脆弱性。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2605.06924

A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency

A$^2$RD：用于长视频一致性的自主回归扩散

Long, Do Xuan, Song, Yale, Kan, Min-Yen, Pfister, Tomas, Le, Long T.

Abstract

Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A$^2$RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A$^2$RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A$^2$RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.

Chinese Translation

合成一致且连贯的长视频仍然是一个基本挑战。现有方法在长时间范围内面临语义漂移和叙事崩溃的问题。我们提出了A$^2$RD，一种自主回归扩散架构，它将创意合成与一致性强制解耦。A$^2$RD将长视频合成形式化为一个闭环过程，通过检索-合成-精炼-更新循环逐段合成和自我改进视频。它包含三个核心组件：（i）多模态视频记忆，跟踪跨模态的视频进展；（ii）自适应段生成，在自然进展和视觉一致性之间切换生成模式；（iii）分层测试时自我改进，在帧和视频级别自我改进每个段，以防止错误传播。我们进一步引入了LVBench-C，这是一个具有非线性实体和环境转变的挑战性基准，以压力测试长时间范围内的一致性。在公共和LVBench-C基准上，涵盖一到十分钟的视频，A$^2$RD在一致性方面比最新的基线提高了多达30%，在叙事连贯性方面提高了20%。人类评估证实了这些提升，同时也突出了运动和过渡平滑度的显著改善。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2605.06927

XiYOLO: Energy-Aware Object Detection via Iterative Architecture Search and Scaling

XiYOLO：通过迭代架构搜索和缩放实现的能量感知目标检测

Tran, Tony, Suganda, Richie R., Hu, Bin

Abstract

Object detection on heterogeneous edge devices must satisfy strict energy, latency, and memory constraints while still providing reliable perception for downstream autonomy. Existing energy-aware NAS methods often target limited deployment settings, while real energy remains difficult to optimize because it is highly device-dependent and costly to measure. We address these challenges with an energy-adaptive framework that combines an energy-aware XiResOFA search space, a two-stage energy estimator, and iterative search to identify a single energy-efficient base architecture. We then apply compound scaling to transform this base design into the XiYOLO family across deployment budgets, enabling interpretable accuracy-energy tradeoffs under sparse hardware measurements. Experiments on PascalVOC, COCO, and real-device deployment show that XiYOLO achieves a stronger energy-accuracy tradeoff than YOLO baselines. On PascalVOC, the medium XiYOLO model reaches 86.15 mAP50 while reducing energy relative to YOLOv12m by 20.6% on GPU and 35.9% on NPU. On COCO, XiYOLO reduces energy relative to YOLOv12 by up to 53.7% on GPU and 51.6% on NPU at the small scale. The proposed two-stage estimator also improves sample efficiency over a joint predictor under few-shot adaptation with only 2-20 target-device samples.

Chinese Translation

在异构边缘设备上进行目标检测必须满足严格的能量、延迟和内存限制，同时仍需为下游自主性提供可靠的感知。现有的能量感知神经架构搜索（NAS）方法通常针对有限的部署环境，而真实的能量优化仍然困难，因为它高度依赖于设备且测量成本高。我们通过一个能量自适应框架来解决这些挑战，该框架结合了能量感知的XiResOFA搜索空间、一个两阶段的能量估计器和迭代搜索，以识别单一的能量高效基础架构。然后，我们应用复合缩放将该基础设计转化为XiYOLO系列，以适应不同的部署预算，从而在稀疏硬件测量下实现可解释的准确性-能量权衡。在PascalVOC、COCO和真实设备部署上的实验表明，XiYOLO在能量-准确性权衡方面优于YOLO基线。在PascalVOC上，中等规模的XiYOLO模型在GPU上实现了86.15 mAP50，同时相对于YOLOv12m减少了20.6%的能量，在NPU上减少了35.9%。在COCO上，XiYOLO在小规模下相对于YOLOv12在GPU上减少了多达53.7%的能量，在NPU上减少了51.6%。所提出的两阶段估计器在少量样本适应下的样本效率也优于联合预测器，仅需2-20个目标设备样本。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2605.06969

Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment

将多模态大型语言模型应用于红外-可见图像融合质量评估

Guo, Yuchen, Gong, Junli, Lu, Yao, Xu, Xintong, Cheung, Yiuming, Su, Weifeng

Abstract

Infrared-Visible image fusion (IVIF) aims to integrate thermal information and detailed spatial structures into a single fused image to enhance perception. However, existing evaluation approaches tend to over-optimize both hand-crafted no-reference statistics and full-reference metrics that treat the source images as pseudo ground truths. Recent IVIF reward-modelling efforts learn from human ratings but use scalar regression on aggregated scores, neither leveraging the reasoning of Multimodal Large Language Models (MLLMs) nor encoding per-image perceptual ambiguity in their supervision, but naively introducing MLLMs with discrete one-hot supervision likewise collapses fused images of similar quality into different rating levels. To address this, we introduce FuScore, which utilizes an MLLM to mimic human visual perception by producing continuous quality score, rather than discrete level predictions, enabling fine-grained discrimination among fused images of similar quality. We exploit the agreement among four IVIF-specific sub-dimensions to construct a per-image soft label whose sharpness reflects how consensual the overall judgment is. We further introduce a tripartite objective combining per-image distributional supervision, within-source-pair Thurstone fidelity for method-level ordering, and cross-source-pair Thurstone fidelity for scene-level ordering across scenes. Extensive experiments demonstrate that FuScore achieves state-of-the-art correlation with human visual preferences.

Chinese Translation

红外-可见图像融合（IVIF）旨在将热信息和详细的空间结构整合为一幅融合图像，以增强感知。然而，现有的评估方法往往过度优化手工制作的无参考统计数据和将源图像视为伪真实值的全参考指标。近期的IVIF奖励建模工作从人类评分中学习，但在聚合评分上使用标量回归，既未利用多模态大型语言模型（MLLMs）的推理能力，也未在其监督中编码每幅图像的感知模糊性，而是天真地引入了带有离散独热监督的MLLMs，这同样导致相似质量的融合图像被压缩到不同的评分等级。为了解决这一问题，我们引入了FuScore，它利用MLLM模拟人类视觉感知，通过生成连续的质量评分，而非离散的等级预测，从而实现对相似质量融合图像的细致区分。我们利用四个特定于IVIF的子维度之间的一致性构建每幅图像的软标签，其清晰度反映了整体判断的一致性。我们进一步引入了一个三方目标，结合每幅图像的分布监督、源对内的Thurstone保真度用于方法级排序，以及跨源对的Thurstone保真度用于场景级排序。大量实验表明，FuScore与人类视觉偏好的相关性达到了最先进的水平。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2605.06990

TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations

TRAJGANR：基于轨迹的城市多模态学习通过地理空间对齐神经表示

Siampou, Maria Despoina, Mai, Gengchen, Lao, Ni, Rao, Jinmeng, Arora, Neha, Shahabi, Cyrus, Choudhury, Shushman

Abstract

Multimodal self-supervised learning (MSSL) has emerged as a key paradigm for pretraining geospatial foundation models. However, existing geospatial MSSL methods are mainly designed for static pairs of modalities, such as satellite imagery, street-view imagery, and text, where learning is driven by aligning observations from the same or nearby locations. This assumption breaks down for human mobility trajectories, which represent continuous movement along paths rather than discrete observations at individual locations. Although trajectories are important for urban understanding through their ability to capture human activity across roads, neighborhoods, and places over time, they remain largely underexplored in current geospatial MSSL frameworks. We present TrajGANR, a novel trajectory-centric geospatial MSSL framework that aligns continuous movement patterns with static, location-based observations. TrajGANR learns a continuous neural representation of trajectories at arbitrary points along each path, which enables fine-grained alignment with nearby street-view images, even when they are not co-located with any trajectory waypoints. We leverage this capability to introduce an MSSL objective that jointly aligns three modalities: trajectories, street-view images, and their geographic locations. We evaluate TrajGANR on four urban mobility and road understanding tasks. Across these tasks, TrajGANR consistently outperforms existing geospatial MSSL frameworks and a trajectory-specific foundation model. Ablation studies further demonstrate that our proposed MSSL objective and the multimodal learning framework are the primary drivers of these improvements, highlighting the importance of fine-grained geospatial alignment over coarser aggregation, as well as geospatial multimodal learning.

Chinese Translation

多模态自监督学习（MSSL）已成为预训练地理空间基础模型的关键范式。然而，现有的地理空间MSSL方法主要针对静态模态对，如卫星图像、街景图像和文本，其学习是通过对齐来自相同或相邻位置的观测来驱动的。这一假设对于人类移动轨迹而言并不成立，因为轨迹代表的是沿路径的连续移动，而不是在单个位置的离散观测。尽管轨迹对于城市理解至关重要，因为它们能够捕捉人类在道路、社区和地点上的活动，但在当前的地理空间MSSL框架中仍然未得到充分探索。我们提出了TrajGANR，一个新颖的基于轨迹的地理空间MSSL框架，它将连续运动模式与静态的基于位置的观测进行对齐。TrajGANR在每条路径的任意点学习轨迹的连续神经表示，这使得即使附近的街景图像与任何轨迹路标不共址，也能实现细粒度对齐。我们利用这一能力引入了一个MSSL目标，该目标共同对齐三种模态：轨迹、街景图像及其地理位置。我们在四个城市移动和道路理解任务上评估了TrajGANR。在这些任务中，TrajGANR始终优于现有的地理空间MSSL框架和一个轨迹特定的基础模型。消融研究进一步表明，我们提出的MSSL目标和多模态学习框架是这些改进的主要驱动因素，突显了细粒度地理空间对齐相较于粗略聚合的重要性，以及地理空间多模态学习的价值。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2605.07019

LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

LensVLM：压缩视觉文本表示的选择性上下文扩展

Xie, Roy, Friedman, Dan, Yu, Donghan, Pan, Bowen, Fifty, Christopher, Kim, Jang-Hyun, Du, Xianzhi, Gan, Zhe, Rathod, Vivek, Dhingra, Bhuwan

Abstract

Vision Language Models (VLMs) offer the exciting possibility of processing text as rendered images, bypassing the need for tokenizing the text into long token sequences. Since VLM image encoders map fixed-size images to a fixed number of visual tokens, varying rendering resolution provides a fine-grained compression knob. However, accuracy deteriorates quickly as compression increases: characters shrink below the vision encoder's effective resolution, making them indistinguishable. To address this, we propose LensVLM, an inference framework and post-training recipe that enables VLMs to scan compressed images, then selectively expand only the relevant images to their uncompressed form via learned tools. Building on Qwen3.5-9B-Base, LensVLM maintains accuracy comparable to the full-text upper bound at 4.3x effective compression and outperforms retrieval-based, text- and visual-compression baselines up to 10.1x effective compression across seven text QA benchmarks. LensVLM also generalizes to multimodal document and code understanding tasks, with the accuracy gain over baselines growing as compression increases. Our analysis validates this approach: training makes visual compression robust to rendering choices, and as compression grows the model increasingly relies on expanded content rather than unreliable visual reading. The analysis also yields practical tool-choice guidance: text expansion is preferable for rendered text, while high-resolution image expansion suits native documents whose layout cues carry task-relevant information.

Chinese Translation

视觉语言模型（VLMs）提供了将文本处理为渲染图像的激动人心的可能性，绕过了将文本标记化为长标记序列的需求。由于VLM图像编码器将固定大小的图像映射到固定数量的视觉标记，因此不同的渲染分辨率提供了细粒度的压缩调节。然而，随着压缩的增加，准确性迅速下降：字符缩小到视觉编码器的有效分辨率以下，使其无法区分。为了解决这个问题，我们提出了LensVLM，这是一种推理框架和后训练方案，使VLM能够扫描压缩图像，然后通过学习的工具选择性地将相关图像扩展到其未压缩形式。基于Qwen3.5-9B-Base，LensVLM在4.3倍有效压缩下保持与完整文本上限相当的准确性，并在七个文本问答基准上超越基于检索的文本和视觉压缩基线，达到10.1倍有效压缩。LensVLM还可以推广到多模态文档和代码理解任务，随着压缩的增加，准确性相对于基线的提升也在增加。我们的分析验证了这种方法：训练使视觉压缩对渲染选择具有鲁棒性，随着压缩的增加，模型越来越依赖于扩展内容，而不是不可靠的视觉读取。分析还提供了实用的工具选择指导：对于渲染文本，文本扩展更为可取，而高分辨率图像扩展则适合布局提示携带任务相关信息的原生文档。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2605.07023

OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects

OneViewAll：基于语义先验的单视角6D姿态估计框架用于新颖物体

Luo, Yang, Gong, Yan, Gao, Yongsheng, Zhao, Jie, Zhang, Xinyu, Liu, Huaping

Abstract

In many practical 6D object pose estimation scenarios, we often have access to only a single real-world RGB-D reference view per object, typically without CAD models. Existing methods largely rely on explicit 3D models or multi-view data, which limits their scalability. To address this challenging single-reference model-free setting, we propose \textbf{OneViewAll}, a semantic-prior-guided framework that performs pose estimation via a novel Project-and-Compare paradigm. Instead of relying on computationally expensive CAD-based rendering, our method directly aligns reference and query observations within a projection-equivariant space. OneViewAll progressively integrates hierarchical semantic priors across three levels: (1) \textit{category- and scene-level} priors for efficient hypothesis initialization; (2) \textit{object-level symmetry} priors for geometry completion via mirror fusion; and (3) \textit{patch-level} priors for discriminative refinement. Extensive experiments demonstrate that OneViewAll achieves \textbf{92.5\%} ADD-0.1 accuracy on the LINEMOD dataset using only one real reference view -- significantly outperforming the CVPR 2025 baseline One2Any (52.6\%). It also yields consistent improvements on YCB-V, Real275, and Toyota-Light while maintaining low inference latency. Our results underscore the efficacy of symmetry-aware projection in handling symmetric, texture-less, and occluded objects.

Chinese Translation

在许多实际的6D物体姿态估计场景中，我们通常只能获得每个物体的单个真实RGB-D参考视图，通常没有CAD模型。现有方法在很大程度上依赖于显式的3D模型或多视角数据，这限制了它们的可扩展性。为了解决这一具有挑战性的单参考无模型设置，我们提出了 extbf{OneViewAll}，一个基于语义先验的框架，通过一种新颖的投影与比较范式进行姿态估计。我们的算法不依赖于计算成本高昂的基于CAD的渲染，而是直接在投影等变空间中对齐参考和查询观测。OneViewAll逐步整合三个层次的层次语义先验：(1) extit{类别和场景层}先验用于高效的假设初始化；(2) extit{物体层对称性}先验通过镜像融合实现几何补全；(3) extit{补丁层}先验用于判别性细化。大量实验表明，OneViewAll在LINEMOD数据集上仅使用一个真实参考视图就达到了 extbf{92.5\%}的ADD-0.1准确率，显著超越了CVPR 2025基线方法One2Any（52.6\%）。它在YCB-V、Real275和Toyota-Light上也取得了一致的改进，同时保持了低推理延迟。我们的结果强调了对称感知投影在处理对称、无纹理和被遮挡物体时的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2605.07055

Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness

Pan-FM：一种具有显著性引导掩蔽的全器官基础模型，以增强缺失鲁棒性

Wu, Qiangqiang, McIlvain, Grace, Yu, Zhou, Wen, Junhao

Abstract

Foundation models (FMs) have shown great promise in medical imaging, but most FMs are trained on unimodal data within isolated domains, such as brain MRI alone. Human aging and disease arise through coordinated biological processes across organs, therefore motivating multimodal FMs that learn whole-body representations. A key challenge, however, is that real-world multimodal biomedical data are often missing not at random, which can reduce power, limit generalizability, and introduce bias. We propose Pan-FM, a pan-organ foundation model pre-trained on imaging from seven organs (Brain, Heart, Adipose, Liver, Kidney, Spleen, and Pancreas) under realistic missing-organ scenarios. Pan-FM uses a unified backbone that handles organ missingness during both training and inference, and is pre-trained with masking-based self-distillation. We find that naive multimodal pre-training leads to dominant-organ shortcut learning bias, with the model over-relying on dominant organs such as adipose and heart. To address this, we introduce Saliency-Guided Masking (SGM), which uses the model attention distribution to adaptively mask dominant organs during pre-training, thus encouraging more balanced cross-organ, whole-body learning. Notably, SGM introduces negligible computational overhead and can be seamlessly integrated into existing self-supervised learning frameworks to improve multi-organ representation learning. On the UK Biobank, Pan-FM achieves stronger prediction across 13 disease categories and 14 single disease entities than single-organ and multi-organ baselines, with improved robustness under missing-organ settings. Pan-FM serves as a scalable solution to realistic modality-missingness in multimodal learning in system neuroscience and as a step toward more generalizable whole-body FMs.

Chinese Translation

基础模型（FMs）在医学影像中展现出巨大的潜力，但大多数基础模型是在孤立领域内的单模态数据上训练的，例如仅使用脑部 MRI。人类的衰老和疾病是通过跨器官的协调生物过程产生的，因此需要多模态基础模型来学习全身表征。然而，一个关键挑战是，现实世界中的多模态生物医学数据往往存在非随机缺失，这可能降低效能、限制可推广性并引入偏差。我们提出了 Pan-FM，一种在七个器官（脑、心脏、脂肪、肝脏、肾脏、脾脏和胰腺）影像上预训练的全器官基础模型，适用于现实缺失器官场景。Pan-FM 使用统一的主干网络，在训练和推理过程中处理器官缺失，并通过基于掩蔽的自蒸馏进行预训练。我们发现，简单的多模态预训练会导致主导器官的捷径学习偏差，使模型过度依赖于脂肪和心脏等主导器官。为了解决这个问题，我们引入了显著性引导掩蔽（SGM），该方法利用模型注意力分布在预训练过程中自适应地掩蔽主导器官，从而促进更平衡的跨器官、全身学习。值得注意的是，SGM 引入的计算开销微乎其微，并且可以无缝集成到现有的自监督学习框架中，以改善多器官表征学习。在 UK Biobank 数据集上，Pan-FM 在 13 个疾病类别和 14 个单一疾病实体的预测性能上优于单一器官和多器官基线，并在缺失器官设置下表现出更强的鲁棒性。Pan-FM 为系统神经科学中的多模态学习提供了一个可扩展的解决方案，以应对现实的模态缺失问题，并朝着更具可推广性的全身基础模型迈出了一步。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2605.07064

Learning to Track Instance from Single Nature Language Description

从单一自然语言描述中学习实例跟踪

Zheng, Yaozong, Zhong, Bineng, Liang, Qihua, Zeng, Shuimu, Xia, Haiying, Song, Shuxiang

Abstract

How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \textbf{without relying on any bounding-box ground truth}? In this work, we achieve this goal by tackling \textit{self-supervised VL tracking}, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \textbf{\tracker}, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token \textbf{unequally}. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that {\tracker} surpasses SOTA self-supervised methods.

Chinese Translation

如何利用视频序列中的自然语言描述实现视觉-语言（VL）跟踪，而 extbf{不依赖于任何边界框真实值}？在本研究中，我们通过解决 extit{自监督VL跟踪}的任务来实现这一目标，该任务旨在评估由自然语言描述引导的跟踪能力。我们提出了 extbf{ racker}，一种新颖的自监督VL跟踪器，能够根据语言描述跟踪任何被提及的对象。与传统方法将所有语言和视觉标记等同融合不同，我们提出了一种高效的动态标记聚合模块，该模块对每个视觉标记 extbf{进行不等处理}。该模块主要包括三个步骤：i) 基于锚定标记，从模板帧中选择多个重要目标标记。ii) 根据注意力得分合并所选目标标记，并将其聚合到语言标记中，从而消除冗余的视觉标记噪声并增强语义对齐。iii) 最后，融合后的语言标记作为引导信号，从搜索帧中提取潜在目标标记，并将其传播到后续帧，增强时间提示并鼓励跟踪器从未标记视频中自主学习实例跟踪。这种新的建模方法使得在不需要大规模边界框注释的情况下，有效地实现了语言引导的跟踪表示的自监督学习。在VL跟踪基准上的大量实验表明，{ racker}超越了当前最先进的自监督方法。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2605.07074

Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection

解耦语义与指纹：一种用于AI生成图像检测的通用表示

Wang, Zhiyuan, Chen, Yanxiang, Yao, Yuanzhi, Diao, Yunfeng

Abstract

Detecting AI-generated images across unseen architectures remains challenging, as existing models often overfit to generator-specific fingerprints and semantic content rather than learning universal forgery traces. We attribute this failure to feature entanglement: detectors learn these factors as a single entangled representation, where universal forgery traces are inextricably confounded with both generator-specific fingerprints and semantic content. Crucially, our spectral analysis reveals that this entanglement is avoidable: distinct generator-specific fingerprints (e.g., GAN stripes vs. Diffusion Model spots) occupy disjoint frequency subspaces and coexist as independent superpositions. Leveraging this physical orthogonality, we propose the Orthogonal Decomposition and Purification Network (ODP-Net) to structurally disentangle these factors. Specifically, ODP-Net employs (1) Instance-aware Orthogonal Decomposition to project features into mutually exclusive subspaces: universal forgery traces, generator-specific fingerprints, and semantic content; (2) Perturbation-based Purification to enforce semantic invariance via cross-sample feature injection; and (3) Manifold Alignment to bridge domain gaps. By explicitly decoupling universal forgery traces from generator-specific fingerprints and semantic content, ODP-Net achieves state-of-the-art performance on unseen architectures (e.g., Stable Diffusion 3), validating that structural disentanglement is key to generalization.

Chinese Translation

在未见过的架构中检测AI生成的图像仍然具有挑战性，因为现有模型往往过度拟合于生成器特定的指纹和语义内容，而不是学习通用的伪造痕迹。我们将这一失败归因于特征纠缠：检测器将这些因素学习为一个单一的纠缠表示，其中通用伪造痕迹与生成器特定的指纹和语义内容不可分割地混淆在一起。关键的是，我们的谱分析揭示了这种纠缠是可以避免的：不同的生成器特定指纹（例如，GAN条纹与扩散模型斑点）占据不相交的频率子空间，并作为独立的叠加共存。利用这种物理正交性，我们提出了正交分解与净化网络（Orthogonal Decomposition and Purification Network, ODP-Net）以结构性地解开这些因素。具体而言，ODP-Net采用（1）实例感知正交分解，将特征投影到互斥的子空间：通用伪造痕迹、生成器特定指纹和语义内容；（2）基于扰动的净化，通过跨样本特征注入来强制语义不变性；以及（3）流形对齐以弥合领域差距。通过明确解耦通用伪造痕迹与生成器特定指纹和语义内容，ODP-Net在未见过的架构上实现了最先进的性能（例如，Stable Diffusion 3），验证了结构性解耦是实现泛化的关键。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2605.07079

Learning Visual Feature-Based World Models via Residual Latent Action

通过残差潜在动作学习基于视觉特征的世界模型

Zhang, Xinyu, Xu, Zhengtong, Tao, Yutian, Wang, Yeping, She, Yu, Boularias, Abdeslam

Abstract

World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm

Chinese Translation

世界模型通过观察和动作预测未来的转变。现有的研究主要集中在图像生成上。而基于视觉特征的世界模型则预测未来的视觉特征，而非原始视频像素，提供了一种更高效且不易产生幻觉的有前景的替代方案。然而，目前的特征基础方法依赖于直接回归，这导致在复杂交互中产生模糊或崩溃的预测，而在高维特征空间中的生成建模仍然具有挑战性。在本研究中，我们发现一种新型的潜在动作表示，称为*Residual Latent Action*（RLA），可以通过DINO残差轻松学习。我们还表明，RLA具有预测性、可推广性，并且编码时间进程。在RLA的基础上，我们提出了*RLA World Model*（RLA-WM），通过流匹配预测RLA值。RLA-WM在模拟和真实世界数据集上超越了最先进的特征基础和视频扩散世界模型，同时比视频扩散快几个数量级。此外，我们开发了两种使用RLA-WM来改善策略学习的机器人学习技术。第一种是一个极简的世界动作模型，利用RLA从无动作的演示视频中学习。第二种是第一个完全在仅从离线视频学习的世界模型内训练的视觉强化学习框架，使用视频对齐奖励且不进行在线交互或手工奖励。项目页面：https://mlzxy.github.io/rla-wm

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2605.07082

ImplantMamba: Long-range Sequential Modeling Mamba For Dental Implant Position Prediction

ImplantMamba：用于牙科植体位置预测的长距离序列建模Mamba

Yang, Xinquan, Wang, Congmin, Li, Xuguang, Li, Yulei, Shen, Linlin, Meng, Yongqiang Deng He

Abstract

In the design of surgical guides for implant placement, determining the precise implant position is a critical step. However, the implant region itself is often characterized by a lack of distinctive texture in medical images. Consequently, artificial intelligence (AI) models must infer the correct implant position and angulation (slope) primarily by analyzing the texture of the surrounding teeth, which poses a significant challenge. To address this, we propose ImplantMamba, a network architecture designed for long-range sequential modeling to integrate texture information from adjacent teeth. Our approach explicitly couples the regression of the implant position with its slope. The core of ImplantMamba is a hybrid encoder that combines Convolutional Neural Networks (CNNs) with Mamba layers. This design enables the network to hierarchically extract local anatomical features through CNNs while simultaneously modeling global contextual dependencies across the entire scan volume via Mamba's selective scan operations, leading to a more comprehensive understanding of the implant site. Furthermore, we introduce a Slope-Coupled Prediction Branch (SCP). This branch is designed to connect the prediction of implant position with the slope, ensuring internal consistency and anatomical plausibility by thereby enforcing a coherent relationship between the predicted implant location and its angulation. Extensive experiments on a large-scale dental implant dataset demonstrate that the proposed ImplantMamba achieves superior performance compared to existing methods.

Chinese Translation

在植体放置手术导向设计中，确定精确的植体位置是一个关键步骤。然而，植体区域在医学图像中往往缺乏明显的纹理特征。因此，人工智能（AI）模型必须主要通过分析周围牙齿的纹理来推断正确的植体位置和角度（坡度），这带来了重大挑战。为了解决这一问题，我们提出了ImplantMamba，这是一种旨在进行长距离序列建模的网络架构，旨在整合邻近牙齿的纹理信息。我们的方法明确将植体位置的回归与其坡度相结合。ImplantMamba的核心是一个混合编码器，它结合了卷积神经网络（CNN）和Mamba层。这种设计使得网络能够通过CNN分层提取局部解剖特征，同时通过Mamba的选择性扫描操作对整个扫描体积建模全局上下文依赖，从而更全面地理解植体位置。此外，我们引入了一个坡度耦合预测分支（Slope-Coupled Prediction Branch, SCP）。该分支旨在将植体位置的预测与坡度连接起来，从而确保内部一致性和解剖学上的合理性，强制执行预测植体位置与其角度之间的连贯关系。在大规模牙科植体数据集上的广泛实验表明，所提出的ImplantMamba相比现有方法具有更优越的性能。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2605.07086

Task Relevance Is Not Local Replaceability: A Two-Axis View of Channel Information

任务相关性并非局部可替代性：通道信息的双轴视角

Safaai, Houman, Landau, Andrew T., Beron, Celia C., Mazloumi, Yasin, Sabatini, Bernardo L.

Abstract

Channel importance in vision networks is usually summarized by a single score. That summary hides two different questions: how much a channel is related to the task, and whether its function can be supplied by same-layer peers when the channel is removed. We call the second property local replaceability. We introduce a two-axis view that separates these questions. The local axis measures input capture and peer overlap, while the target axis measures task information and target-excess information. Across ResNet-18, VGG-16, and MobileNetV2 trained on CIFAR-100, the two axes are weakly aligned, induce different channel groupings, and separate rapidly during training despite being strongly coupled at random initialization. A Gaussian linear analysis accounts for how this separation can arise through residualized gradient directions, and lesion plus peer-replacement experiments show that peer support refines removability beyond input capture and task relevance alone. Under the fixed FLOPs-matched pruning protocol, local-axis metrics are more reliable predictors of removability than target-axis metrics across the three CIFAR-100 backbones, with the same direction preserved in stress tests on CIFAR-10, Tiny-ImageNet, ImageNet-100, and a ConvNeXt-T/ImageNet-100 pilot. These findings identify an axis-level distinction rather than a universal ranking of pruning scores: local replaceability is a more reliable guide to removability than target relevance, while norm-based baselines remain competitive in architectures such as VGG-16. Relevance-based scores ask what a channel says about the task; pruning asks whether the network still needs that channel when its peers remain available.

Chinese Translation

在视觉网络中，通道的重要性通常通过一个单一的分数来总结。这种总结掩盖了两个不同的问题：一个通道与任务的相关性有多大，以及当该通道被移除时，是否可以由同层的其他通道来提供其功能。我们称第二个属性为局部可替代性。我们引入了一个双轴视角来区分这些问题。局部轴度量输入捕获和同伴重叠，而目标轴度量任务信息和目标过剩信息。在对CIFAR-100上训练的ResNet-18、VGG-16和MobileNetV2中，这两个轴的对齐程度较弱，导致不同的通道分组，并且在训练过程中迅速分离，尽管在随机初始化时它们是紧密耦合的。高斯线性分析解释了这种分离如何通过残差化梯度方向产生，而损伤加同伴替换实验表明，同伴支持在输入捕获和任务相关性之外进一步细化了可移除性。在固定FLOPs匹配的剪枝协议下，局部轴度量在三个CIFAR-100主干网络中比目标轴度量更可靠地预测可移除性，在对CIFAR-10、Tiny-ImageNet、ImageNet-100和ConvNeXt-T/ImageNet-100的压力测试中保持相同的方向。这些发现识别出一个轴级别的区分，而非剪枝分数的普遍排名：局部可替代性是可移除性的更可靠指导，而基于范数的基线在如VGG-16等架构中仍然具有竞争力。基于相关性的分数询问一个通道对任务的贡献；而剪枝则询问当其同伴仍然可用时，网络是否仍然需要该通道。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2605.07099

InfoGeo: Information-Theoretic Object-Centric Learning for Cross-View Generalizable UAV Geo-Localization

InfoGeo：用于跨视角可泛化无人机地理定位的信息论对象中心学习

Zhang, Hongyang, Wang, Maonnan, Wang, Ziyao, Yin, Hongrui, OnPun, Man

Abstract

Cross-view geo-localization (CVGL) is fundamental for precise localization and navigation in GPS-denied environments, aiming to match ground or UAV imagery with satellite views. While existing approaches rely on global feature alignment, they often suffer from substantial domain shifts induced by varying regional textures and weather conditions. This issue becomes even more pronounced in UAV-based scenarios, where the broader perspective inevitably introduces dense, fine-grained objects, creating significant visual clutter. To address this, we draw inspiration from Object-Centric Learning (OCL) and propose InfoGeo, an information-theoretic framework designed to enhance robustness and generalization. InfoGeo reformulates the optimization as an information bottleneck process with two core objectives: (i) maximizing view-invariant information by aligning the object-centric structural relations across views, and (ii) minimizing view-specific noisy signals through cross-view knowledge constraints. Extensive evaluations across diverse benchmarks and challenging scenarios demonstrate that InfoGeo significantly outperforms state-of-the-art methods.

Chinese Translation

跨视角地理定位（CVGL）对于在GPS不可用环境中实现精确定位和导航至关重要，旨在将地面或无人机图像与卫星视图进行匹配。现有方法通常依赖于全局特征对齐，但往往受到区域纹理和天气条件变化引起的显著领域转移的影响。在基于无人机的场景中，这一问题尤为明显，因为更广阔的视角不可避免地引入了密集的、细粒度的物体，造成显著的视觉杂乱。为了解决这一问题，我们受到对象中心学习（OCL）的启发，提出了InfoGeo，一个旨在增强鲁棒性和泛化能力的信息论框架。InfoGeo将优化过程重新表述为信息瓶颈过程，具有两个核心目标：(i) 通过对齐跨视角的对象中心结构关系来最大化视角不变信息，(ii) 通过跨视角知识约束来最小化视角特定的噪声信号。通过在多样化基准和挑战性场景中的广泛评估，结果表明InfoGeo显著优于现有的最先进方法。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2605.07140

Neurosymbolic Framework for Concept-Driven Logical Reasoning in Skeleton-Based Human Action Recognition

基于骨架的人类动作识别中的概念驱动逻辑推理的神经符号框架

Ilyas, Talha, Mehta, Deval, Ge, Zongyuan

Abstract

Skeleton-based human activity recognition has achieved strong empirical performance, yet most existing models remain black boxes and difficult to interpret. In this work, we introduce a neurosymbolic formulation of skeleton-based HAR that reframes action recognition as concept-driven first-order logical reasoning over motion primitives. Our framework bridges representation learning and symbolic inference by grounding first-order logic predicates in learnable spatial and temporal motion concepts. Specifically, we employ a standard spatio-temporal skeleton encoder to extract latent motion representations, which are then mapped to interpretable concept predicates via a spatio-temporal concept decoder that explicitly separates pose-centric and dynamics-centric abstractions. These concept predicates are composed through differentiable first-order logic layers, enabling the model to learn human-readable logical rules that govern action semantics. To impose semantic structure on the learned concepts, we align skeleton representations with LLM-derived descriptions of atomic motion primitives, establishing a shared conceptual space for perception and reasoning. Extensive experiments on NTU RGB+D 60/120 and NW-UCLA demonstrate that our approach achieves competitive recognition performance while providing explicit, interpretable explanations grounded in logical structure. Our results highlight neurosymbolic reasoning as an effective paradigm for interpretable spatio-temporal action understanding. Code: https://github.com/Mr-TalhaIlyas/REASON

Chinese Translation

基于骨架的人类活动识别已取得强大的实证性能，但大多数现有模型仍然是黑箱，难以解释。在本研究中，我们提出了一种基于骨架的HAR（Human Action Recognition）神经符号形式，将动作识别重新构建为对运动原语的概念驱动的一阶逻辑推理。我们的框架通过将一阶逻辑谓词与可学习的空间和时间运动概念相结合，架起了表征学习与符号推理之间的桥梁。具体而言，我们采用标准的时空骨架编码器提取潜在运动表征，然后通过时空概念解码器将其映射到可解释的概念谓词，该解码器明确区分了以姿态为中心和以动态为中心的抽象。这些概念谓词通过可微分的一阶逻辑层进行组合，使模型能够学习人类可读的逻辑规则，从而支配动作语义。为了对学习到的概念施加语义结构，我们将骨架表征与基于LLM（Large Language Model）推导的原子运动原语描述对齐，建立了一个用于感知和推理的共享概念空间。在NTU RGB+D 60/120和NW-UCLA上的大量实验表明，我们的方法在提供明确、可解释的基于逻辑结构的解释的同时，达到了具有竞争力的识别性能。我们的结果强调了神经符号推理作为可解释的时空动作理解的有效范式。代码链接：https://github.com/Mr-TalhaIlyas/REASON

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2605.07141

Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

Qwen3-VL-Seg：通过视觉-语言基础解锁开放世界的指称分割

Yao, Yuan, Yang, Qiushi, Zhong, Humen, Wei, Jiangning, Men, Yifang, Bai, Shuai, Cui, Miaomiao, Yang, Zhibo

Abstract

Open-world referring segmentation requires grounding unconstrained language expressions to precise pixel-level regions. Existing multimodal large language models (MLLMs) exhibit strong open-world visual grounding, but their outputs remain limited to sparse bounding-box coordinates and are insufficient for dense visual prediction. Recent MLLM-based segmentation methods either directly predict sparse contour coordinates, struggling to reconstruct continuous object boundaries, or rely on external segmentation foundation models such as the Segment Anything Model (SAM), introducing substantial architectural and deployment overhead. We present Qwen3-VL-Seg, a parameter-efficient framework that treats the MLLM-predicted box as a semantically grounded structural prior and decodes it into pixel-level referring segmentation. At its core, a lightweight box-guided mask decoder combines multi-scale spatial feature injection, spatial-semantic query construction, box-guided high-resolution pixel fusion, and iterative mask-aware query refinement, introducing only 17M parameters (about 0.4\% of the base model). For scalable open-world training, we construct SA1B-ORS, an SA-1B-derived dataset with two subsets: SA1B-CoRS (category-oriented samples) and SA1B-DeRS (descriptive, instance-specific samples). For evaluation, we curate ORS-Bench, a manually screened benchmark with in-distribution and out-of-distribution subsets covering diverse referring expression types. Extensive experiments on referring expression segmentation, visual grounding, and ORS-Bench show that Qwen3-VL-Seg performs strongly across closed-set and open-world settings, with clear advantages on language-intensive instructions and strong out-of-distribution generalization. Evaluations on general multimodal benchmarks further show that the model broadly preserves general-purpose multimodal competence after segmentation-oriented adaptation.

Chinese Translation

开放世界的指称分割需要将不受限制的语言表达与精确的像素级区域进行基础对接。现有的多模态大语言模型（MLLMs）展现出强大的开放世界视觉基础能力，但其输出仍然局限于稀疏的边界框坐标，无法满足密集视觉预测的需求。最近的基于MLLM的分割方法要么直接预测稀疏的轮廓坐标，难以重建连续的物体边界，要么依赖于外部分割基础模型，如Segment Anything Model (SAM)，这引入了显著的架构和部署开销。我们提出了Qwen3-VL-Seg，这是一个参数高效的框架，将MLLM预测的框视为语义基础的结构先验，并将其解码为像素级的指称分割。其核心是一个轻量级的框引导掩码解码器，结合了多尺度空间特征注入、空间-语义查询构建、框引导的高分辨率像素融合和迭代的掩码感知查询细化，仅引入17M参数（约占基础模型的0.4%）。为了实现可扩展的开放世界训练，我们构建了SA1B-ORS，这是一个基于SA-1B的数据集，包含两个子集：SA1B-CoRS（面向类别的样本）和SA1B-DeRS（描述性、特定实例的样本）。在评估方面，我们策划了ORS-Bench，这是一个经过人工筛选的基准，包含覆盖多种指称表达类型的分布内和分布外子集。在指称表达分割、视觉基础和ORS-Bench的广泛实验中，Qwen3-VL-Seg在封闭集和开放世界设置中表现出色，在语言密集指令上具有明显优势，并展现出强大的分布外泛化能力。在一般多模态基准上的评估进一步表明，该模型在经过分割导向的适应后，广泛保留了通用多模态能力。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2605.07142

AGA3DNet: Anatomy-Guided Gaussian Priors with Multi-view xLSTM for 3D Brain MRI Subtype Classification

AGA3DNet：基于解剖引导的高斯先验与多视角 xLSTM 结合的 3D 脑部 MRI 亚型分类

Duan, Peiyu, Guo, Xueqi, Farhand, Sepehr, Sahin, Mehmet Berk, Zheng, Xinyuan, Duncan, James S., Valadez, Gerardo Hermosillo, Shinagawa, Yoshihisa

Abstract

Accurate 3D brain MRI subtype classification benefits from both localized anatomical cues and long-range contextual reasoning. We present AGA3DNet, a report-grounded framework that incorporates brief anatomical phrases extracted from radiology reports as a soft anatomical prior channel and fuses it with a lightweight 3D CNN and multi-view xLSTM aggregation. Specifically, extracted anatomical phrases are mapped to atlas-defined regions and converted into smooth spatial priors using a signed-distance transform followed by Gaussian weighting, providing interpretable, anatomy-grounded guidance without requiring dense voxel annotations. We evaluate AGA3DNet on a retrospective institutional brain MRI cohort for abnormal subtype discrimination and compare against reproducible 3D classification baselines. AGA3DNet achieves improved overall balance across performance metrics and supports clinically interpretable localization through the prior channel. We discuss limitations related to single-cohort evaluation and the lack of large-scale public brain MRI datasets paired with radiology reports under broadly usable terms.

Chinese Translation

准确的 3D 脑部 MRI 亚型分类受益于局部解剖线索和长距离上下文推理。我们提出了 AGA3DNet，这是一个基于报告的框架，结合了从放射学报告中提取的简短解剖短语作为软解剖先验通道，并与轻量级 3D CNN 和多视角 xLSTM 聚合相融合。具体而言，提取的解剖短语被映射到由图谱定义的区域，并通过带符号距离变换和高斯加权转换为平滑的空间先验，提供可解释的、基于解剖的指导，而无需密集的体素注释。我们在一个回顾性的机构脑部 MRI 队列上评估了 AGA3DNet，以进行异常亚型区分，并与可重复的 3D 分类基线进行比较。AGA3DNet 在性能指标上实现了更好的整体平衡，并通过先验通道支持临床可解释的定位。我们讨论了与单一队列评估相关的局限性，以及缺乏与放射学报告配对的大规模公共脑部 MRI 数据集的问题。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2605.07143

TriP: A Triangle Puzzle Approach to Robust Translation Averaging

TriP：一种基于三角形拼图的鲁棒翻译平均方法

Fan, Zhekai, Li, Wanze, Wang, Jinxin, Shi, Yunpeng

Abstract

Translation averaging aims to recover camera locations from pairwise relative translation directions and is a fundamental component of global Structure-from-Motion pipelines. The problem is challenging because direction measurements contain no distance information, making the estimation problem highly ill-conditioned and highly sensitive to corrupted observations. In this paper, we propose TriP, a triangle-based framework for robust translation averaging. TriP first infers local relative edge scales from triangle geometry, and then synchronizes the scales of overlapping triangles in the logarithmic domain to recover globally consistent edge lengths and camera locations. By leveraging higher-order consistency across triangles, the proposed method is robust to adversarial, cycle-consistent, and other structured corruptions. In addition, TriP avoids the collapse issue without requiring any extra anti-collapse constraints, since log-scale synchronization excludes the degenerate zero-scale solution by construction. These structural advantages enable a particularly strong theory for exact location recovery. On the practical side, TriP is fully parallelizable, computationally efficient, and naturally scalable to graphs with millions of cameras. Moreover, it outperforms all previous translation averaging methods by a large margin on both synthetic and real datasets.

Chinese Translation

翻译平均旨在从成对的相对翻译方向中恢复相机位置，是全球运动重建（Structure-from-Motion）流程中的一个基本组成部分。该问题具有挑战性，因为方向测量不包含距离信息，使得估计问题高度病态，并对受损观测极为敏感。本文提出了TriP，一种基于三角形的鲁棒翻译平均框架。TriP首先从三角形几何推断局部相对边缘尺度，然后在对数域中同步重叠三角形的尺度，以恢复全局一致的边长和相机位置。通过利用三角形之间的高阶一致性，所提出的方法对对抗性、循环一致性及其他结构性干扰具有鲁棒性。此外，TriP在不需要任何额外的抗崩溃约束的情况下避免了崩溃问题，因为对数尺度同步在构造上排除了退化的零尺度解。这些结构优势为精确位置恢复提供了特别强大的理论支持。在实际应用方面，TriP完全可并行化，计算效率高，并且自然可扩展到拥有数百万个相机的图形。此外，在合成数据集和真实数据集上，TriP的表现均大幅超越了所有先前的翻译平均方法。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2605.07146

UniV2D: Bridging Visual Restoration and Semantic Perception for Underwater Salient Object Detection

UniV2D：桥接视觉恢复与语义感知以实现水下显著目标检测

Chang, Laibin, Wang, Shaodong, Wang, Yunke, Zhang, Xu, Jiang, Kui, Xu, Chang, Du, Bo

Abstract

Underwater salient object detection (USOD) plays a vital role in marine vision tasks but remains fundamentally challenging due to severe visual degradation, such as selective absorption and medium scattering. Conventional pipelines typically adopt a sequential "enhance-then-detect" paradigm. However, isolating low-level visual restoration from high-level semantic perception often leads to semantic inconsistency, where the restored images may not be optimal for detection and can even introduce task-irrelevant noise. To break this sequential bottleneck, we propose UniV2D, a Unified Vision-to-Detection Network that jointly optimizes visual restoration and salient object detection within a mutually beneficial framework. Unlike traditional methods that rely on disjointed pipelines or rigid physical priors, UniV2D introduces a semantic-driven learning paradigm: high-level saliency semantics actively guide the restoration process, while the restored visual cues reciprocally enhance saliency perception. Specifically, UniV2D features a hierarchical dual-branch architecture. It first employs a self-calibrated decoder to predict initial saliency masks alongside a mask-aware restoration module to reconstruct image content. Subsequently, a saliency-guided refinement module equipped with cross-level modulation is utilized to align structural fidelity with semantic consistency. Extensive experiments across multiple benchmarks demonstrate that UniV2D significantly outperforms state-of-the-art methods in both quantitative and qualitative evaluations, establishing a new standard for joint underwater perception.

Chinese Translation

水下显著目标检测（USOD）在海洋视觉任务中扮演着至关重要的角色，但由于严重的视觉退化（如选择性吸收和介质散射），这一任务仍然面临根本性的挑战。传统的处理流程通常采用顺序的“增强-再检测”范式。然而，将低层次的视觉恢复与高层次的语义感知隔离，往往会导致语义不一致，即恢复后的图像可能并不适合检测，甚至可能引入与任务无关的噪声。为了打破这一顺序瓶颈，我们提出了UniV2D，一个统一的视觉到检测网络，在一个互利的框架内共同优化视觉恢复和显著目标检测。与依赖于分离管道或严格物理先验的传统方法不同，UniV2D引入了一种语义驱动的学习范式：高层次的显著性语义主动引导恢复过程，而恢复后的视觉线索则反过来增强显著性感知。具体而言，UniV2D具有分层的双分支架构。它首先采用自校准解码器预测初始显著性掩模，同时利用掩模感知恢复模块重建图像内容。随后，配备跨层调制的显著性引导细化模块被用来对齐结构保真度与语义一致性。在多个基准测试中的广泛实验表明，UniV2D在定量和定性评估中显著超越了现有的最先进方法，为水下联合感知建立了新的标准。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2605.07148

Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

揭示和塑造视觉-语言模型中三维场景拓扑的潜在表示

Wang, Haoming, Gao, Wei

Abstract

Decades of cognitive science establish that humans navigate environments by forming cognitive maps, defined as allocentric and topology-preserving representations of 3D space. While modern Vision-Language Models (VLMs) demonstrate emergent spatial reasoning from 2D egocentric inputs, it remains unclear whether they construct an analogous 3D internal representation. In this paper, we demonstrate that current VLMs do possess a latent topological map of 3D scenes, but it is heavily overshadowed by non-geometric visual semantics, such as color and shape. By isolating this spatial subspace through cross-scene linear feature extraction, we extract a clean spatial subspace that causally controls the model's spatial outputs. We mathematically shape this latent representation and prove its correspondence to the Laplacian eigenmaps of the scene's 3D Gaussian-kernel graph, converging to the physical 3D space in the continuous limit. Motivated by this geometric identification, we further introduce a mathematically principled latent regularization method for VLMs, based on Dirichlet energy. Applying this single-term regularizer to a minimal 500-step supervised VLM fine-tuning (SFT) on simple synthetic data yields significant improvements on real-world spatial benchmarks, outperforming standard SFT and competitive baselines by up to 12.1\% in spatial tasks involving scene topology understanding. Source code is available at https://github.com/pittisl/vlm-latent-shaping

Chinese Translation

数十年的认知科学研究表明，人类通过形成认知地图来导航环境，这种地图被定义为对三维空间的外部中心和拓扑保持的表示。尽管现代视觉-语言模型（VLMs）展示了从二维自我中心输入中涌现出的空间推理，但尚不清楚它们是否构建了类似的三维内部表示。在本文中，我们证明了当前的VLMs确实拥有三维场景的潜在拓扑地图，但它被颜色和形状等非几何视觉语义所严重掩盖。通过跨场景线性特征提取，我们隔离了这一空间子空间，提取出一个干净的空间子空间，该子空间因果地控制模型的空间输出。我们在数学上塑造了这一潜在表示，并证明其与场景三维高斯核图的拉普拉斯特征图相对应，在连续极限中收敛到物理三维空间。基于这一几何识别，我们进一步引入了一种基于Dirichlet能量的数学原则性潜在正则化方法。将这一单项正则化器应用于简单合成数据上的最小500步监督VLM微调（SFT），在真实世界的空间基准测试中取得了显著的改善，在涉及场景拓扑理解的空间任务中，超越了标准SFT和竞争基线，提升幅度高达12.1%。源代码可在 https://github.com/pittisl/vlm-latent-shaping 获取。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2605.07149

Real-IAD MVN: A Multi-View Normal Vector Dataset and Benchmark for High-Fidelity Industrial Anomaly Detection

Real-IAD MVN：一个用于高保真工业异常检测的多视角法向量数据集和基准

Zhu, Wenbing, Liang, Jianing, Cheng, Linjie, Pan, Yurui, Chen, Zhuhao, Yan, Qingwang, Cheng, Yudong, Zhang, Jianghui, Chi, Mingmin, Peng, Bo

Abstract

Industrial Anomaly Detection (IAD) is critical for quality control, but existing methods struggle with subtle, geometric defects. Standard 2D (RGB) images are sensitive to texture and lighting but often miss fine geometric anomalies. While 3D point clouds capture macro-shape, they are typically too sparse to detect micro-defects like scratches or pits. We address this fundamental data limitation by introducing Real-IAD-MVN (Multi-View Normal), a large-scale industrial dataset. By upgrading our acquisition system, Real-IAD-MVN captures high-fidelity surface normal maps from five distinct viewpoints, replacing sparse 3D data entirely. This provides a comprehensive geometric representation at a micro-detail level, making previously invisible side-wall and occluded defects explicitly detectable. Our experiments, conducted on this new dataset, first provide evidence that incorporating dense, multi-view pseudo-3D (surface normals) yields significantly better detection performance than using sparse 3D point cloud data. To further validate the dataset and provide a strong benchmark, we introduce a baseline method based on reconstruction, which learns to extract cross-modal unified prototypes from the image and normal map streams. We demonstrate that this unified prototype approach surpasses existing state-of-the-art multimodal fusion methods, highlighting the rich potential of our new dataset for advancing geometric anomaly detection.

Chinese Translation

工业异常检测（IAD）对质量控制至关重要，但现有方法在处理细微的几何缺陷时面临挑战。标准的二维（RGB）图像对纹理和光照敏感，但往往无法捕捉到细微的几何异常。虽然三维点云能够捕捉宏观形状，但通常过于稀疏，无法检测到划痕或凹坑等微小缺陷。我们通过引入Real-IAD-MVN（多视角法向量），一个大规模工业数据集，解决了这一基本数据限制。通过升级我们的采集系统，Real-IAD-MVN从五个不同的视角捕获高保真的表面法向图，完全替代了稀疏的三维数据。这提供了微观细节层面的全面几何表示，使得之前不可见的侧壁和遮挡缺陷变得显而易见。我们在这个新数据集上进行的实验首先提供了证据，表明结合密集的多视角伪三维（表面法向）显著提高了检测性能，相较于使用稀疏的三维点云数据。为了进一步验证数据集并提供强有力的基准，我们引入了一种基于重建的基线方法，该方法学习从图像和法向图流中提取跨模态统一原型。我们证明了这种统一原型方法超越了现有的最先进的多模态融合方法，突显了我们新数据集在推进几何异常检测方面的丰富潜力。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2605.07151

DPG-CD: Depth-Prior-Guided Cross-Modal Joint 2D-3D Change Detection

DPG-CD：深度先验引导的跨模态联合2D-3D变化检测

Zhang, Luqi, Dong, Zhen, Yang, Bisheng

Abstract

Urban spatial evolution is manifested not only through horizontal expansion but also through vertical structural changes. Consequently, jointly capturing 2D semantic changes and 3D height changes is essential for urban morphology analysis and emergency management. In practical scenarios, collecting 3D observations is often constrained by high acquisition costs and the inability to support frequent updates. The multi-temporal cross-modal input consisting of pre-event Digital Surface Model (DSM) and post-event imagery provides a practical solution for 3D change detection in high-frequency urban monitoring, disaster assessment, and emergency response scenarios. However, this setting remains challenging as imagery and DSM data exhibit significant spectral-geometric representation gaps. Moreover, modality differences may be confused with actual changes, and robust change detection requires effective fusion of semantic and geometric features from multi-temporal data. In this paper, we propose DPG-CD, a depth-prior-guided multi-temporal cross-modal fusion framework for joint 2D semantic and 3D height change detection. Specifically, an estimated depth prior is introduced into the imagery to mitigate the modality gap with DSM. A gated fusion mechanism then selectively injects geometric cues from depth prior while preserving discriminative spectral representations. Subsequently, a multi-stage cross-temporal cross-modal feature fusion architecture is employed to extract change-aware features. Finally, a multi-task decoder jointly predicts 2D semantic changes and 3D height changes, complemented by an auxiliary DSM prediction task to improve structural consistency and height estimation accuracy. Experiments on two public datasets, Hi-BCD and 3DCD, and a new dataset, NYC-MMCD, demonstrate that DPG-CD outperforms state-of-the-art methods on both 2D and 3D change detection tasks.

Chinese Translation

城市空间演变不仅体现在水平扩展上，还体现在垂直结构变化上。因此，联合捕捉2D语义变化和3D高度变化对于城市形态分析和应急管理至关重要。在实际场景中，收集3D观测数据常常受到高获取成本和无法支持频繁更新的限制。由事件前数字表面模型（Digital Surface Model, DSM）和事件后影像组成的多时相跨模态输入为高频城市监测、灾害评估和应急响应场景中的3D变化检测提供了切实可行的解决方案。然而，这种设置仍然具有挑战性，因为影像和DSM数据在光谱-几何表示上存在显著差距。此外，模态差异可能与实际变化混淆，稳健的变化检测需要有效融合来自多时相数据的语义和几何特征。在本文中，我们提出了DPG-CD，一种深度先验引导的多时相跨模态融合框架，用于联合2D语义和3D高度变化检测。具体而言，我们将估计的深度先验引入影像中，以减轻与DSM的模态差距。然后，采用门控融合机制选择性地注入来自深度先验的几何线索，同时保留区分性的光谱表示。随后，采用多阶段跨时相跨模态特征融合架构提取变化感知特征。最后，多任务解码器联合预测2D语义变化和3D高度变化，并通过辅助DSM预测任务提高结构一致性和高度估计精度。在两个公共数据集Hi-BCD和3DCD，以及一个新数据集NYC-MMCD上的实验表明，DPG-CD在2D和3D变化检测任务中均优于最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2605.07154

PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition

PRIMED：通过偏置竞争进行的自适应模态抑制以实现音视频引用分割

He, Yuchen, Zhang, Jing

Abstract

Referring Audio-Visual Segmentation (Ref-AVS) seeks to localize and segment target objects in video frames based on visual, auditory, and textual referring cues. The task is challenging because the relevance of different modalities varies across referring expressions and scenes, while existing methods typically treat multimodal cues as homogeneous inputs for fusion, prompting, or reasoning, making them vulnerable to irrelevant or misleading modalities. To address this problem, we propose PRIMED, inspired by the biased competition theory in cognitive neuroscience, which explicitly models both visual perception and language-driven prior modulation, and enables more accurate Ref-AVS by adaptive modality suppression. Specifically, a Modality Prior Decoder first estimates whether the referring expression relies primarily on audio, vision, or their joint interaction, generating a modality prior to adaptively guide high-level attention. A Token Distiller further extracts compact global visual tokens from high-level features and shares them across Competition-aware Cross-modal Fusion modules to provide hierarchical global context. Additionally, we introduce a Spatial-Aware Semantic Alignment loss to further enhance foreground-background discrimination through contrastive learning. Extensive experiments on the Ref-AVS benchmark demonstrate that PRIMED achieves state-of-the-art overall performance.

Chinese Translation

引用音视频分割（Ref-AVS）旨在根据视觉、听觉和文本引用线索在视频帧中定位和分割目标对象。该任务具有挑战性，因为不同模态的相关性在引用表达和场景中各不相同，而现有方法通常将多模态线索视为同质输入进行融合、提示或推理，这使得它们容易受到无关或误导性模态的影响。为了解决这一问题，我们提出了PRIMED，该方法受到认知神经科学中偏置竞争理论的启发，明确建模视觉感知和语言驱动的先验调制，通过自适应模态抑制实现更准确的Ref-AVS。具体而言，模态先验解码器首先估计引用表达主要依赖于音频、视觉还是它们的联合交互，从而生成模态先验以自适应地引导高级注意力。Token Distiller进一步从高级特征中提取紧凑的全局视觉token，并在竞争感知的跨模态融合模块之间共享，以提供分层的全局上下文。此外，我们引入了一种空间感知语义对齐损失，通过对比学习进一步增强前景与背景的区分。对Ref-AVS基准的广泛实验表明，PRIMED在整体性能上达到了最先进的水平。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2605.07156

Hierarchical Perfusion Graphs for Tumor Heterogeneity Modeling in Glioma Molecular Subtyping

用于胶质瘤分子亚型建模的层次灌注图

Jang, Han, Lee, Junhyeok, Eum, Heeseong, Jang, Joon, Han, Yoseob, Choi, Seung Hong, Choi, Kyu Sung

Abstract

Precise molecular subtyping of gliomas, including isocitrate dehydrogenase (IDH) mutation and 1p/19q codeletion, directly guides surgical and therapeutic decisions, yet currently relies on invasive tissue sampling. Deep learning on structural MRI has emerged as a non-invasive alternative, but anatomy-only approaches cannot capture the hemodynamic signatures that distinguish molecular subtypes. Radiogenomics based on dynamic susceptibility contrast (DSC) MRI holds immense potential for non-invasively characterizing glioma molecular subtypes, yet clinical deployment has been hindered by inter-site variability and the limitations of voxel-wise analysis. We introduce HiPerfGNN, a framework that first learns discrete hemodynamic representations from raw time-intensity curves using a vector-quantized variational autoencoder (VQ-VAE). These quantized perfusion codes define coarse-level graph nodes representing functional tumor habitats, each of which is hierarchically subdivided into fine-level subregions guided by structural MRI. A hierarchical graph neural network then propagates information across scales for molecular prediction. On an internal cohort (n=475), the model achieved AUCs of 0.96 (IDH), 0.89 (1p/19q), and 0.84 (WHO grade), and maintained robust IDH performance (AUC 0.89) on an independent external cohort (n=397) without recalibration. Gradient-based saliency analysis confirms biologically grounded attention patterns aligned with known glioma pathophysiology. Our results demonstrate the added value of integrating perfusion dynamics into radiogenomic pipelines for glioma molecular subtyping. Code is available at https://github.com/janghana/HiPerfGNN.

Chinese Translation

胶质瘤的精确分子亚型分类，包括异柠檬酸脱氢酶（IDH）突变和1p/19q共缺失，直接指导外科和治疗决策，但目前仍依赖于侵入性组织取样。基于结构性MRI的深度学习已成为一种非侵入性替代方法，但仅依赖解剖结构的方法无法捕捉区分分子亚型的血流动力学特征。基于动态易感对比（DSC）MRI的放射基因组学在非侵入性表征胶质瘤分子亚型方面具有巨大潜力，但临床应用受到跨站点变异性和体素级分析局限性的阻碍。我们提出了HiPerfGNN框架，该框架首先使用向量量化变分自编码器（VQ-VAE）从原始时间强度曲线中学习离散的血流动力学表示。这些量化的灌注编码定义了代表功能性肿瘤栖息地的粗级图节点，每个节点根据结构性MRI被层次性细分为细级子区域。然后，层次图神经网络在不同尺度间传播信息以进行分子预测。在一个内部队列（n=475）中，该模型在IDH（AUC 0.96）、1p/19q（AUC 0.89）和WHO分级（AUC 0.84）方面取得了良好的表现，并在一个独立的外部队列（n=397）上保持了稳健的IDH表现（AUC 0.89），且无需重新校准。基于梯度的显著性分析证实了与已知胶质瘤病理生理学相一致的生物学基础注意模式。我们的结果展示了将灌注动态整合到胶质瘤分子亚型放射基因组学流程中的附加价值。代码可在 https://github.com/janghana/HiPerfGNN 获取。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2605.07178

Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection

面具可以交流：从单模态图像中提取结构化文本信息以进行遥感变化检测

Zheng, Kai, Dong, Hang-Cheng, Pan, Jiatong, Wu, Zhenkai, Wei, Fupeng, Zhang, Wei

Abstract

Remote sensing change detection is pivotal for urban monitoring, disaster assessment, and environmental resource management. Yet, unimodal deep learning methods frequently confuse genuine semantic changes with visually similar but irrelevant variations. Recent multimodal approaches incorporate text as auxiliary supervision, but their descriptions are either semantically coarse and unstructured or model-generated and thus noisy. Critically, all of them overlook a simple fact: fine-grained change semantics are already implicitly encoded in the ground-truth mask labels that come standard with every change detection dataset. These masks know where the change happened, what the land-cover types were before and after, how the transition occurred, and how many objects were involved. In this paper, we propose S2M, a framework that obtains structured textual features directly from change labels at zero additional annotation cost. Specifically, each change region is automatically transcribed into a semantic quadruple (where, what, how, how many) and converted into several fixed-template text descriptions, providing precise, dense, and noise-free multimodal supervision. We adopts a two-stage training strategy to fine-tune on remote sensing imagery firstly for robust domain-specific representation, after which a multimodal decoder with a bi-directional contrastive loss is introduced to achieve deep alignment between visual features and structured textual embeddings. To validate our method, we construct Gaza-Change-v2, a new multi-class change detection (MCD) dataset about the Gaza Strip. On this MCD dataset, S2M achieves a Sek of 17.80\% and an F$_{\text{scd}}$ of 66.14\%, notably surpassing even multimodal methods that leverage large language models. Our work demonstrates that masks can indeed talk. They tell us exactly what, where, how, and how many changes have occurred.

Chinese Translation

遥感变化检测对于城市监测、灾害评估和环境资源管理至关重要。然而，单模态深度学习方法常常将真实的语义变化与视觉上相似但无关的变异混淆。最近的多模态方法将文本作为辅助监督，但它们的描述要么语义粗糙且无结构，要么是模型生成的，因此噪声较多。值得注意的是，所有这些方法都忽视了一个简单的事实：细粒度的变化语义已经隐含地编码在每个变化检测数据集中标准附带的真实标签掩码中。这些掩码知道变化发生的地点、变化前后的土地覆盖类型、变化是如何发生的以及涉及了多少对象。在本文中，我们提出了S2M，一个框架，它可以在零额外标注成本的情况下直接从变化标签中获取结构化文本特征。具体而言，每个变化区域被自动转录为一个语义四元组（哪里、什么、如何、多少），并转换为多个固定模板的文本描述，提供精确、密集且无噪声的多模态监督。我们采用两阶段训练策略，首先在遥感图像上进行微调，以获得稳健的领域特定表示，随后引入一个具有双向对比损失的多模态解码器，以实现视觉特征与结构化文本嵌入之间的深度对齐。为了验证我们的方法，我们构建了Gaza-Change-v2，一个关于加沙地带的新多类变化检测（MCD）数据集。在这个MCD数据集上，S2M达到了17.80\%的Sek和66.14\ ext{F}_{ ext{scd}}，显著超越了甚至利用大型语言模型的多模态方法。我们的工作表明，掩码确实可以交流。它们准确告诉我们发生了什么、在哪里、如何以及涉及了多少变化。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2605.07181

SatSurfGS: Generalizable 2D Gaussian Splatting for Sparse-View Satellite Surface Reconstruction

SatSurfGS：用于稀疏视图卫星表面重建的可泛化二维高斯溅射

Chen, Min, Guo, Wei, Wang, Bin, Li, Wen, Fang, Tong, Zhang, Jinbo, Zhao, Junqi, Kuang, Hong, Hu, Han, Ge, Xuming, Zhu, Qing, Xu, Bo

Abstract

Sparse-view satellite image surface reconstruction remains highly challenging, fundamentally because the reliability of multi-view matching under satellite imaging conditions is strongly spatially heterogeneous. Affected by large photometric differences, weak textures, and repetitive textures, multi-view geometric constraints are often sparse, unevenly distributed, and locally unreliable. Although 2D Gaussian Splatting (2DGS) is more suitable than 3D Gaussian Splatting (3DGS) for the explicit representation of continuous surfaces, research on generalizable feed-forward 2DGS frameworks for sparse-view satellite surface reconstruction is still lacking. To address this issue, we propose SatSurfGS, a generalizable sparse-view surface reconstruction method for satellite imagery based on 2DGS. The proposed method builds a coarse-to-fine Gaussian attribute prediction framework and explicitly models local geometric reliability at three levels: feature learning, Gaussian parameter estimation, and training optimization. Specifically, we propose a confidence-aware monocular multi-view feature fusion module to adaptively integrate monocular priors and multi-view matching features according to local confidence; a cross-stage self-consistency residual guidance module to stabilize stage-wise Gaussian parameter refinement using the residual between the rendered height map from the previous stage and the current-stage MVS height map, together with confidence information; and a confidence bidirectional routing loss to achieve differentiated allocation of geometric and appearance supervision. Experiments on satellite datasets show that the proposed method achieves improved rendering quality, surface reconstruction accuracy, cross-dataset generalization, and inference efficiency compared with representative generalizable baselines and competitive per-scene optimization methods.

Chinese Translation

稀疏视图卫星图像表面重建仍然面临很大的挑战，根本原因在于在卫星成像条件下，多视图匹配的可靠性在空间上高度异质。由于受到较大的光度差异、弱纹理和重复纹理的影响，多视图几何约束往往稀疏、不均匀分布，并且在局部上不可靠。尽管二维高斯溅射（2D Gaussian Splatting，2DGS）比三维高斯溅射（3D Gaussian Splatting，3DGS）更适合于连续表面的显式表示，但针对稀疏视图卫星表面重建的可泛化前馈2DGS框架的研究仍然缺乏。为了解决这一问题，我们提出了SatSurfGS，一种基于2DGS的可泛化稀疏视图表面重建方法。该方法构建了一个从粗到细的高斯属性预测框架，并在三个层面上显式建模局部几何可靠性：特征学习、高斯参数估计和训练优化。具体而言，我们提出了一种基于置信度的单目多视图特征融合模块，以根据局部置信度自适应地整合单目先验和多视图匹配特征；一个跨阶段自一致性残差引导模块，以利用前一阶段渲染的高度图与当前阶段多视图立体（MVS）高度图之间的残差及置信度信息，稳定阶段间的高斯参数细化；以及一种置信度双向路由损失，以实现几何和外观监督的差异化分配。对卫星数据集的实验表明，与代表性的可泛化基线和竞争性的逐场优化方法相比，所提出的方法在渲染质量、表面重建精度、跨数据集泛化能力和推理效率方面均有所提升。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2605.07188

PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset

PicoEyes：用于混合现实的统一注视估计框架及其大规模多视角数据集

Duan, Fuxin, Wang, Hui

Abstract

We present PicoEyes, a unified gaze estimation framework that directly predicts all key attributes of gaze, including 3D eye parameters, eye-region segmentation, optical axis, visual axis, and depth maps, from either monocular or binocular inputs. The framework simultaneously addresses calibration, gaze forecasting, and varying device postures, while also supporting 3D eye reconstruction via joint estimation of eye parameters and depth maps in an end-to-end manner. In addition, we introduce a large-scale multi-view near-eye dataset containing comprehensive 2D and 3D annotations under diverse conditions, including train, test, rewear-test, and calibration sessions. Extensive experiments demonstrate that PicoEyes achieves state-ofthe-art performance, consistently outperforming both academic and industrial gaze tracking methods across nocalibration, calibration, rewear-after-calibration, and forecasting settings. This work establishes a practical, end-toend paradigm for robust and generalizable gaze estimation in mixed reality (MR) applications.

Chinese Translation

我们提出了PicoEyes，一个统一的注视估计框架，能够直接预测注视的所有关键属性，包括3D眼部参数、眼部区域分割、光轴、视觉轴和深度图，支持单目或双目输入。该框架同时解决了标定、注视预测和不同设备姿态的问题，同时还支持通过眼部参数和深度图的联合估计进行3D眼部重建，采用端到端的方式。此外，我们引入了一个大规模的多视角近眼数据集，包含在多种条件下的全面2D和3D标注，包括训练、测试、重穿测试和标定会话。大量实验表明，PicoEyes在无标定、标定、重穿后标定和预测设置下，始终优于学术界和工业界的注视跟踪方法，达到了最先进的性能。本研究建立了一个实用的端到端范式，为混合现实（MR）应用中的稳健和可推广的注视估计提供了支持。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2605.07191

Attention Transfer Is Not Universally Effective for Vision Transformers

注意力转移并非对视觉变换器普遍有效

Qin, Huaiyuan, Yang, Muli, Goenawan, Gabriel James, Hu, Peng, Gong, Chen, Peng, Xi, Zhu, Hongyuan

Abstract

A recent work shows that Attention Transfer, which transfers only the attention patterns from a pre-trained teacher Vision Transformer (ViT) to a randomly initialized standard student ViT, is sufficient to recover the full benefit of the teacher's pre-trained weights. We revisit this finding on a comprehensive benchmark of 20 teachers from 11 well-known ViT families and reveal that Attention Transfer is not universally effective. While 7 families transfer successfully, 4 consistently fail, falling up to 5.1\% below the from-scratch no-transfer baseline. Further results demonstrate that this failure is family-consistent across model sizes, and persists under extended training durations, different transfer datasets, and out-of-distribution evaluations. Controlled analyses then consistently localize the problem to the attention-routing channel, indicating that the key issue is not whether the student can match the teacher's attention patterns, but whether the matched patterns remain functional for the student. Crucially, we identify architectural mismatch between the pre-trained teacher and the standard student as the primary mechanism. By adding only the teacher's native architectural components to the student in a randomly initialized state, we completely reverse the failure for all 4 families. Notably, these components alone do not improve from-scratch training, confirming that they specifically unlock the usability of the teacher's attention. We further systematically show that this failure is not explained by the inadequate choice of transfer loss or by differences in pre-training recipes. Our findings refine the prevailing understanding of attention in ViT representations: attention is sufficient \textit{only} when the student architecture matches the teacher.

Chinese Translation

最近的研究表明，注意力转移（Attention Transfer）仅将预训练教师视觉变换器（Vision Transformer, ViT）的注意力模式转移到随机初始化的标准学生ViT上，就足以恢复教师预训练权重的全部益处。我们在11个知名ViT家族的20个教师的综合基准上重新审视了这一发现，并揭示注意力转移并非普遍有效。虽然7个家族成功进行了转移，但4个家族始终失败，性能比从头训练且未进行转移的基线低了多达5.1%。进一步的结果表明，这种失败在不同模型规模中是一致的，并且在延长训练时间、不同转移数据集和分布外评估下依然存在。控制分析一致性地将问题定位于注意力路由通道，表明关键问题不在于学生能否匹配教师的注意力模式，而在于匹配的模式是否对学生仍然有效。重要的是，我们确定预训练教师与标准学生之间的架构不匹配是主要机制。通过仅在随机初始化状态下将教师的原生架构组件添加到学生中，我们完全逆转了所有4个家族的失败。值得注意的是，这些组件单独并未改善从头训练，确认它们特定地解锁了教师注意力的可用性。我们进一步系统地表明，这一失败并不能通过转移损失选择不当或预训练方案的差异来解释。我们的发现细化了对ViT表示中注意力的主流理解：注意力仅在学生架构与教师匹配时才是充分的。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2605.07192

AsyncEvGS: Asynchronous Event-Assisted Gaussian Splatting for Handheld Motion-Blurred Scenes

AsyncEvGS：用于手持运动模糊场景的异步事件辅助高斯溅射

Dai, Jun, Jin, Renbiao, Xu, Bo, Chen, Yutian, Xu, Linning, Yu, Mulin, Xue, Tianfan, Guo, Shi

Abstract

3D reconstruction methods such as 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) achieve impressive photorealism but fail when input images suffer from severe motion blur. While event cameras provide high-temporal-resolution motion cues, existing event-assisted approaches rely on low-resolution sensors and strict synchronization, limiting their practicality for handheld 3D capture on common devices, such as smartphones. We introduce a flexible, high-resolution asynchronous RGB-Event dual-camera system and a corresponding reconstruction framework. Our approach first reconstructs sharp images from the event data and then employs a cross-domain pose estimation module based on the Visual Geometry Transformer (VGGT) to obtain robust initialization for 3DGS. During optimization, we employ a structure-driven event loss and view-specific consistency regularizers to mitigate the ill-posed behavior of traditional event losses and deblurring losses, ensuring both stable and high-fidelity reconstruction. We further contribute AsyncEv-Deblur, a new high-resolution RGB-Event dataset captured with our asynchronous system. Experiments demonstrate that our method achieves state-of-the-art performance on both our challenging dataset and existing benchmarks, substantially improving reconstruction robustness under severe motion blur. Project page: https://openimaginglab.github.io/AsyncEvGS/

Chinese Translation

3D重建方法如3D高斯溅射（3DGS）和神经辐射场（NeRF）在实现令人印象深刻的照片真实感方面表现优异，但在输入图像受到严重运动模糊时则表现不佳。虽然事件相机提供了高时间分辨率的运动线索，但现有的事件辅助方法依赖于低分辨率传感器和严格的同步，这限制了它们在智能手机等常见设备上进行手持3D捕捉的实用性。我们提出了一种灵活的高分辨率异步RGB-事件双摄像头系统及相应的重建框架。我们的方法首先从事件数据中重建清晰图像，然后采用基于视觉几何变换器（Visual Geometry Transformer, VGGT）的跨域姿态估计模块，以获得3DGS的稳健初始化。在优化过程中，我们采用结构驱动的事件损失和视图特定的一致性正则化器，以减轻传统事件损失和去模糊损失的病态行为，确保重建的稳定性和高保真性。我们进一步贡献了AsyncEv-Deblur，这是一个使用我们的异步系统捕获的新高分辨率RGB-事件数据集。实验表明，我们的方法在我们具有挑战性的数据集和现有基准上均实现了最先进的性能，显著提高了在严重运动模糊下的重建鲁棒性。项目页面：https://openimaginglab.github.io/AsyncEvGS/

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2605.07194

Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision Models

预训练视觉模型的闭式线性探测数据集蒸馏

Peng, Bincheng, Li, Guang, Liu, Ping, Ogawa, Takahiro, Haseyama, Miki

Abstract

Dataset distillation compresses a large training set into a small synthetic set that preserves downstream training utility. While most existing methods target training networks from scratch, modern visual transfer learning often uses frozen pre-trained encoders followed by lightweight linear probing. Existing distillation methods for this setting either unroll iterative linear-probe updates with trajectory-based gradient matching, or rely on closed-form formulations originally designed for from-scratch training with neural-tangent-kernel (NTK) approximations. Neither route exploits the fact that frozen-feature linear probing admits a closed-form solution determined directly by the pre-trained features themselves, with no infinite-width approximation and no inner-loop trajectory. We propose Closed-Form Linear-Probe Dataset Distillation (CLP-DD), a bilevel formulation that computes the linear probe induced by the synthetic set with a sample-space kernel ridge solver. The synthetic images are then updated by evaluating this induced classifier on real features through a temperature-scaled softmax cross-entropy, where the classifier columns act as learned class anchors in feature space. We further show that the choice of outer objective is decisive: pairing the closed-form inner solver with a standard MSE outer loss substantially underperforms trajectory-based methods, while the discriminative outer loss closes most of the gap. On ImageNet-100 with four pre-trained backbones, CLP-DD substantially improves over LGM without DSA and approaches LGM with DSA at a fraction of the computational cost. On ImageNet-1K, CLP-DD matches or surpasses LGM with DSA on three of four backbones while running roughly $14\times$ faster and using less than one-eighth of the GPU memory.

Chinese Translation

数据集蒸馏将大型训练集压缩为小型合成集，同时保留下游训练的效用。尽管大多数现有方法针对从头开始训练网络，但现代视觉迁移学习通常使用冻结的预训练编码器，随后进行轻量级线性探测。现有的针对这种设置的蒸馏方法要么通过基于轨迹的梯度匹配展开迭代线性探测更新，要么依赖于最初为从头训练设计的闭式公式，这些公式使用神经切线核（NTK）近似。无论哪种方法都未能利用冻结特征线性探测允许直接由预训练特征本身确定的闭式解的事实，而无需无限宽度近似和内部循环轨迹。我们提出了闭式线性探测数据集蒸馏（CLP-DD），这是一种双层公式，通过样本空间核岭求解器计算合成集诱导的线性探测。然后，通过在真实特征上评估该诱导分类器，使用温度缩放的软最大交叉熵更新合成图像，其中分类器列作为特征空间中的学习类锚点。我们进一步表明，外部目标的选择至关重要：将闭式内部求解器与标准均方误差（MSE）外部损失配对，显著低于基于轨迹的方法，而区分性外部损失则缩小了大部分差距。在使用四个预训练骨干网络的ImageNet-100上，CLP-DD在没有DSA的情况下显著优于LGM，并在计算成本的一小部分下接近LGM与DSA的表现。在ImageNet-1K上，CLP-DD在四个骨干网络中的三个上与LGM与DSA的表现相匹配或超越，同时运行速度约为$14 imes$更快，并且使用的GPU内存不到八分之一。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2605.07195

See Tomorrow, Act Today: Foresight-Driven Autonomous Driving

展望未来，立刻行动：基于前瞻性的自主驾驶

Zhang, Bozhou, Song, Nan, Wang, Yuang, Deng, Jiankang, Zhu, Xiatian, Zhang, Li

Abstract

Current end-to-end autonomous driving planners are fundamentally reactive: they condition on historical and present observations to predict future actions. We argue that autonomous agents should instead imagine future scenes before deciding, just as human drivers mentally simulate ``what will happen next" before acting. We introduce ForeSight, a foundation world model centric planning framework that reframes autonomous driving as anticipatory decision-making. Rather than treating world models as auxiliary components, ForeSight makes future scene imagination the primary driver of action prediction. Our approach operates in two stages: (1) generating plausible future visual worlds via a pretrained world model, and (2) planning actions conditioned on these imagined futures. This paradigm shift from ``what should I do now?" to ``what will happen, and how should I respond?" enables genuinely anticipatory rather than reactive planning. By grounding decisions in anticipated contexts rather than present observations alone, ForeSight navigates dynamic, interactive scenarios more effectively. Extensive experiments on NAVSIM and nuScenes demonstrate that explicit future imagination significantly outperforms previous state-of-the-art alternatives, validating our foresight-driven approach.

Chinese Translation

当前的端到端自主驾驶规划器在本质上是反应性的：它们基于历史和当前观察来预测未来行动。我们认为，自主代理应该在决策之前想象未来场景，就像人类驾驶员在行动前会在脑海中模拟“接下来会发生什么”。我们引入了ForeSight，一个以基础世界模型为中心的规划框架，将自主驾驶重新构想为前瞻性决策。ForeSight不再将世界模型视为辅助组件，而是将未来场景的想象作为行动预测的主要驱动力。我们的方法分为两个阶段：（1）通过预训练的世界模型生成可信的未来视觉世界；（2）基于这些想象的未来规划行动。这一从“我现在应该做什么？”到“将会发生什么，我该如何应对？”的范式转变，使得规划真正具备前瞻性，而非仅仅是反应性。通过将决策建立在预期的情境上，而不仅仅是当前观察，ForeSight能够更有效地应对动态、互动的场景。在NAVSIM和nuScenes上的大量实验表明，明确的未来想象显著优于之前的最先进替代方案，验证了我们基于前瞻性的研究方法。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2605.07203

From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting

从像素到原语：3D高斯溅射中的场景变化检测

Galappaththige, Chamuditha Jayanga, Lai, Jason, Patten, Timothy, Dansereau, Donald, Suenderhauf, Niko, Miller, Dimity

Abstract

Scene change detection methods built on Gaussian splatting universally follow a render-then-compare paradigm: the pre-change scene is rendered into 2D and compared against post-change images via pixel or feature residuals. This change detection problem with Gaussian Splatting has been treated as a question about pixels; we treat it as a question about primitives. We provide direct evidence that native primitive attributes alone -- position, anisotropic covariance, and color -- carry sufficient signal for scene change detection. What makes primitive-space comparison hard is the under-constrained nature of Gaussian splatting representation: independent optimizations yield primitive solutions whose count, positions, shapes, and colors differ even where nothing has changed. We address this challenge with anisotropic models of geometric and photometric drift, complemented by a per-primitive observability term that reflects the extent to which each Gaussian is constrained by the camera geometry. Operating directly on primitives gives our method, GD-DIFF, two properties that distinguish it from render-then-compare methods. First, change maps are multi-view consistent by construction, where prior work had to learn this through an additional optimization objective. Second, geometric and appearance changes are scored separately, identifying not just where but what kind of change occurred, distinguishing structural changes (e.g., an added object) from surface-level ones (e.g., a color change) without supervision or external model dependencies. On real-world benchmarks, GS-DIFF surpasses the prior state-of-the-art approach by approximatelt 17% in mean Intersection over Union.

Chinese Translation

基于高斯溅射的场景变化检测方法普遍遵循先渲染后比较的范式：将变化前的场景渲染为2D图像，并通过像素或特征残差与变化后的图像进行比较。我们将这一基于高斯溅射的变化检测问题视为关于原语的问题，而非仅仅是关于像素的问题。我们提供直接证据表明，仅凭原生原语属性——位置、各向异性协方差和颜色——就能为场景变化检测提供足够的信号。原语空间比较的困难在于高斯溅射表示的欠约束特性：独立优化会产生原语解，其数量、位置、形状和颜色即使在没有变化的情况下也会有所不同。我们通过几何和光度漂移的各向异性模型来应对这一挑战，并补充了一个每个原语的可观测性项，反映每个高斯受到相机几何约束的程度。直接在原语上操作使我们的算法GD-DIFF具备了两个使其与先渲染后比较方法区分开来的特性。首先，变化图在构造上是一致的多视图，而之前的工作需要通过额外的优化目标来学习这一点。其次，几何和外观变化是分开评分的，不仅识别变化发生的位置，还识别变化的类型，从而在没有监督或外部模型依赖的情况下区分结构变化（例如，新增物体）与表面变化（例如，颜色变化）。在真实世界的基准测试中，GS-DIFF在平均交并比（mean Intersection over Union）上超过了之前的最先进方法约17%。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2605.07213

LoHGNet: Infrared Small Target Detection through Lorentz Geometric Encoding with High-Order Relation Learning

LoHGNet：通过洛伦兹几何编码与高阶关系学习实现红外小目标检测

Ma, Qianwen, Xu, Yang, Deng, Shangwei, Li, Xiaobo, Hu, Haofeng

Abstract

Infrared small target detection (IRSTD) remains challenging due to the scarcity of useful target cues and the presence of severe background clutter. Most current methods rely on conventional feature learning and local interaction modeling, where features are represented in Euclidean space. However, such designs may still be limited in describing the subtle differences of weak targets and the contextual relations between targets and backgrounds. To address these limitations, we propose LoHGNet, an IRSTD network that integrates Lorentz geometric encoding with high-order relation learning. By introducing Lorentz manifold based feature learning, LoHGNet offers a different feature representation from conventional IRSTD methods and provides new discriminative cues for IRSTD. Specifically, a Lorentz encoding branch is constructed with the Geometric Attention Guided Lorentz Residual Convolution Module (GA-LRCM) to perform feature modeling under hyperbolic geometric constraints and enhance the hierarchical geometric representation capability of weak targets. Subsequently, the hyperbolic features are mapped into the Euclidean tangent space through logarithmic mapping, and a High-Order Relation Learning Module (HORL) is designed to model the high-order contextual dependencies between targets and backgrounds via hypergraph construction, thereby improving target discrimination in complex backgrounds. Experimental results on three datasets demonstrate that the proposed LoHGNet achieves competitive performance in both detection accuracy and adaptability to complex scenes. The code will be available at https://github.com/Kingwin97.

Chinese Translation

红外小目标检测（IRSTD）仍然面临挑战，因为有效目标线索稀缺且背景杂波严重。目前大多数方法依赖于传统特征学习和局部交互建模，其中特征在欧几里得空间中表示。然而，这种设计在描述弱目标的微妙差异及目标与背景之间的上下文关系方面可能仍然有限。为了解决这些局限性，我们提出了LoHGNet，一种将洛伦兹几何编码与高阶关系学习相结合的IRSTD网络。通过引入基于洛伦兹流形的特征学习，LoHGNet提供了与传统IRSTD方法不同的特征表示，并为IRSTD提供了新的区分线索。具体而言，构建了一个洛伦兹编码分支，采用几何注意引导的洛伦兹残差卷积模块（GA-LRCM）在双曲几何约束下进行特征建模，从而增强弱目标的层次几何表示能力。随后，通过对数映射将双曲特征映射到欧几里得切空间，并设计了高阶关系学习模块（HORL）通过超图构建建模目标与背景之间的高阶上下文依赖关系，从而提高在复杂背景下的目标区分能力。在三个数据集上的实验结果表明，所提出的LoHGNet在检测准确性和对复杂场景的适应性方面均表现出竞争力。代码将发布在 https://github.com/Kingwin97。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2605.07221

DINO-MVR: Multi-View Readout of Frozen DINOv3 for Annotation-Efficient Medical Segmentation

DINO-MVR：用于注释高效医学分割的冻结 DINOv3 的多视图读取

Jiang, Wei, Liu, Feng, Ye, Nan, Sun, Hongfu

Abstract

Adapting foundation models to medical segmentation typically requires either backbone fine-tuning or high-capacity task-specific decoders, both of which are difficult to fit reliably when annotations are scarce. We show that frozen DINOv3 features already contain useful structural and boundary cues for medical segmentation, and that the main bottleneck lies in how these features are read out. We propose DINO-MVR, a Multi-View Readout framework for annotation-efficient medical segmentation. DINO-MVR trains only lightweight MLP probes on features from the final three transformer blocks of a frozen DINOv3 backbone, without updating the backbone itself. At inference, each input is interpreted through complementary resolutions and test-time augmentations, whose probability maps are combined by entropy-weighted fusion and refined with simple spatial regularization. For volumetric inputs, Gaussian z-axis smoothing further improves inter-slice consistency. Under fixed evaluation protocols on endoscopy, dermoscopy, and MRI benchmarks, DINO-MVR achieves strong readout-only performance, including 0.895 Dice on Kvasir-SEG, 0.897 Dice on ISIC 2018, and 0.908 Dice on BraTS FLAIR whole-tumor segmentation. With only five annotated BraTS patients, it recovers 98.4% of the performance obtained by the 40-patient BraTS reference run. These results suggest that frozen self-supervised vision backbones can support accurate medical segmentation when paired with an effective multi-view readout.

Chinese Translation

将基础模型适应于医学分割通常需要对主干进行微调或使用高容量的任务特定解码器，而在注释稀缺的情况下，这两者都难以可靠地拟合。我们展示了冻结的 DINOv3 特征已经包含了用于医学分割的有用结构和边界线索，主要瓶颈在于如何读取这些特征。我们提出了 DINO-MVR，一种用于注释高效医学分割的多视图读取框架。DINO-MVR 仅在冻结 DINOv3 主干的最后三个变换器块的特征上训练轻量级 MLP 探针，而不更新主干本身。在推理时，每个输入通过互补分辨率和测试时增强进行解释，其概率图通过熵加权融合组合，并通过简单的空间正则化进行优化。对于体积输入，高斯 z 轴平滑进一步提高了切片间的一致性。在内窥镜、皮肤镜和 MRI 基准的固定评估协议下，DINO-MVR 实现了强大的仅读取性能，包括 Kvasir-SEG 上的 0.895 Dice、ISIC 2018 上的 0.897 Dice 和 BraTS FLAIR 整肿瘤分割上的 0.908 Dice。仅使用五名标注的 BraTS 患者，它恢复了 40 名患者 BraTS 参考运行所获得的 98.4% 的性能。这些结果表明，当与有效的多视图读取相结合时，冻结的自监督视觉主干可以支持准确的医学分割。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2605.07230

CASCADE: Context-Aware Relaxation for Speculative Image Decoding

CASCADE：面向上下文的放松机制用于推测性图像解码

Yildirim, Selin, Chowdhury, Subhajit Dutta, Kamani, Mohammad Mahdi, Appia, Vikram, Chen, Deming

Abstract

Autoregressive generation is a powerful approach for high-fidelity image synthesis, but it remains computationally demanding and slow even on the most advanced accelerators. While speculative decoding has been explored to mitigate this bottleneck, existing approaches fail to achieve efficiency gains comparable to those observed in text generation. A key limitation is the target model's high uncertainty during image generation, which leads to high draft token rejection rates. In this work, we identify previously overlooked patterns in the target model's behavior that emerge naturally in tree-based speculative decoding. Specifically, we formalize two properties, semantic interchangeability and convergence, arising from the redundancies in the target model's hidden state representations. By capturing these redundancies across the depth and breadth of the predicted token tree, our method identifies principled opportunities for acceptance relaxation without requiring additional training. Additionally, we enhance standalone drafter performance by injecting the redundancy signals from the target model into drafter training with minimal modification. We evaluate our approach across multiple text-to-image models and drafter architectures. Results show that CASCADE achieves state-of-the-art speedups for drafter-based speculative decoding, with up to 3.6x acceleration, while maintaining image quality and text-prompt fidelity.

Chinese Translation

自回归生成是一种强大的高保真图像合成方法，但即使在最先进的加速器上，它仍然计算要求高且速度缓慢。尽管已经探索了推测性解码以缓解这一瓶颈，但现有方法未能实现与文本生成中观察到的效率提升相当的效果。一个关键限制是目标模型在图像生成过程中的高不确定性，这导致草稿令牌的拒绝率很高。在本研究中，我们识别出在基于树的推测性解码中自然出现的目标模型行为中之前被忽视的模式。具体而言，我们形式化了两个属性：语义可互换性和收敛性，这些属性源于目标模型隐藏状态表示中的冗余。通过捕捉预测令牌树的深度和广度中的这些冗余，我们的方法识别出原则性的接受放松机会，而无需额外训练。此外，我们通过将目标模型中的冗余信号注入到草稿器训练中，最小化修改以增强独立草稿器的性能。我们在多个文本到图像模型和草稿器架构中评估了我们的方法。结果表明，CASCADE在基于草稿器的推测性解码中实现了最先进的加速效果，最高可达3.6倍，同时保持图像质量和文本提示的保真度。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2605.07232

Towards multi-modal forgery representation learning for AI-generated video detection and localization

面向多模态伪造表示学习的AI生成视频检测与定位

Le, Dat, Nguyen, Khoa, Wang, Xin, Hu, Shu

Abstract

Recent advances in generative AI have democratized video creation at scale. AI-generated videos, including partially manipulated clips across visual and audio channels, pose escalating risks of semantic distortion and misuse, which motivates the need for reliable detection tools. Most existing AI-generated video detectors remain limited by single- or partial-modality of data modeling and the lack of fine-grained temporal forgery localization. To address these challenges, our primary novelty introduces a core architecture that jointly integrates an LMM semantic branch with a spatio-temporal (ST) visual branch and a multi-scale partial-spoof (PS) audio branch. This multi-modal approach enables simultaneous detection and fine-grained temporal localization of partially manipulated AI-generated video forgeries. Extensive experiments show that this approach outperforms existing state-of-the-art methods.

Chinese Translation

最近，生成性人工智能的进步使视频创作在规模上实现了民主化。AI生成的视频，包括在视觉和音频通道中部分操控的片段，带来了日益严重的语义扭曲和滥用风险，这促使人们对可靠检测工具的需求。现有的大多数AI生成视频检测器受限于单一或部分模态的数据建模，且缺乏细粒度的时间伪造定位。为了解决这些挑战，我们的主要创新在于引入了一种核心架构，该架构将LMM语义分支与时空（ST）视觉分支和多尺度部分欺骗（PS）音频分支联合集成。这种多模态方法能够同时检测和细粒度定位部分操控的AI生成视频伪造。大量实验表明，该方法优于现有的最先进技术。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2605.07250

Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment

难以阅读，易于越狱：视觉降质如何绕过 MLLM 安全对齐

Song, Zhixue, Han, Boyan, Wang, Yiwei, Zhang, Chi

Abstract

Recent advancements in visual context compression enable MLLMs to process ultra-long contexts efficiently by rendering text into images. However, we identify a critical vulnerability inherent to this paradigm: lowering image resolution inadvertently catalyzes jailbreaking. Our experiments reveal that the safety defenses of SOTA models deteriorate sharply as resolution degrades, surprisingly persisting even when text remains legible. We attribute this to ``Cognitive Overload'', hypothesizing that the effort required to decipher degraded inputs diverts attentional resources from safety auditing. This phenomenon is consistent across various visual perturbations, including noise and geometric distortion. To address this, we propose a simple ``Structured Cognitive Offloading'' strategy that mitigates these risks by enforcing a serialized pipeline to decouple visual transcription from safety assessment. Our work exposes a significant risk in vision-based compression and provides critical insights for the secure design of future MLLMs.

Chinese Translation

最近在视觉上下文压缩方面的进展使得 MLLM 能够通过将文本渲染为图像来高效处理超长上下文。然而，我们识别出这一范式固有的一个关键漏洞：降低图像分辨率无意中催化了越狱行为。我们的实验表明，随着分辨率的降低，最先进模型的安全防御急剧恶化，令人惊讶的是，即使文本仍然可读，这种现象依然存在。我们将其归因于“认知超负荷”，假设解读降质输入所需的努力分散了对安全审计的注意资源。这一现象在各种视觉干扰下都是一致的，包括噪声和几何失真。为了解决这个问题，我们提出了一种简单的“结构化认知卸载”策略，通过强制执行序列化管道，将视觉转录与安全评估解耦，从而减轻这些风险。我们的工作揭示了基于视觉的压缩中的重大风险，并为未来 MLLM 的安全设计提供了重要见解。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2605.07253

LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling

LENS：低频特征噪声塑形用于高效扩散采样

Jeon, Haewon, Lee, Si-Hyeon

Abstract

Distilled diffusion models accelerate image generation by reducing the number of denoising steps, but often suffer from degraded image quality. To mitigate this trade-off, test-time optimization methods improve quality, yet their iterative nature incurs substantial computational overhead and leads to slow inference, limiting practical usability. Recent hypernetwork-based approaches amortize this process during training, but still require costly noise modulation in high-dimensional latent spaces. In this work, we propose LENS (Low-frequency Eigen Noise Shaping), an efficient noise modulation framework that operates in a low-dimensional subspace. Our approach is motivated by the observation that low-frequency components of the noise largely determine the global structure and visual fidelity of generated images. Based on this observation, we provide a theoretical justification for restricting modulation to the low-frequency subspace and derive a principled training objective. Building on this, LENS employs a lightweight, standalone network to selectively modulate these components, enabling efficient and targeted noise modulation. Extensive experiments demonstrate that LENS achieves competitive image quality while reducing FLOPs by 400-700$\times$, model parameters by 25-75$\times$, and inference-time overhead by 10-20$\times$ compared to prior methods.

Chinese Translation

蒸馏扩散模型通过减少去噪步骤的数量来加速图像生成，但往往会导致图像质量下降。为了缓解这一权衡，测试时优化方法提高了质量，但其迭代特性带来了可观的计算开销，并导致推理速度缓慢，限制了实际应用。最近基于超网络的方法在训练过程中摊销了这一过程，但仍需在高维潜在空间中进行昂贵的噪声调制。在本研究中，我们提出了LENS（低频特征噪声塑形），这是一种在低维子空间中操作的高效噪声调制框架。我们的方法的动机是观察到噪声的低频成分在很大程度上决定了生成图像的全局结构和视觉保真度。基于这一观察，我们为将调制限制在低频子空间提供了理论依据，并推导出一个原则性的训练目标。在此基础上，LENS采用了一个轻量级的独立网络来选择性地调制这些成分，从而实现高效且有针对性的噪声调制。大量实验表明，与之前的方法相比，LENS在减少FLOPs 400-700倍、模型参数25-75倍和推理时间开销10-20倍的同时，实现了具有竞争力的图像质量。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2605.07254

High-Fidelity Surface Splatting-Based 3D Reconstruction from Multi-View Images

基于高保真表面点云的多视图图像三维重建

Sunil, Nandhana, Iyer, Abhirami R, Mandal, Avirup

Abstract

Multi-view mesh reconstruction remains a core challenge in computer graphics and vision, especially for recovering high-frequency geometry from sparse observations. Recent methods such as 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) rely on post-processing for mesh extraction, thereby limiting joint optimization of geometry and appearance. Implicit Moving Least Squares (IMLS) instead enables direct conversion of point clouds into signed distance and texture fields, supporting end-to-end reconstruction and rendering. However, existing IMLS formulations use exponential kernels that struggle with high-frequency detail. We introduce a compact polynomial kernel with local support and greater flexibility, allowing better control over frequency content and improved geometric fidelity. To further enhance fine details, we incorporate stochastic regularization with Laplacian filtering. Together, these improve the preservation of high-frequency structure while maintaining stable optimization. Experiments show state-of-the-art performance in both surface reconstruction and rendering, yielding more accurate geometry and sharper visuals from multi-view data.

Chinese Translation

多视图网格重建仍然是计算机图形学和视觉领域的核心挑战，尤其是在从稀疏观测中恢复高频几何结构方面。最近的方法，如3D高斯点云（3D Gaussian Splatting, 3DGS）和神经辐射场（Neural Radiance Fields, NeRF），依赖后处理进行网格提取，从而限制了几何和外观的联合优化。隐式移动最小二乘法（Implicit Moving Least Squares, IMLS）则能够直接将点云转换为带符号距离和纹理场，支持端到端的重建和渲染。然而，现有的IMLS公式使用指数核，难以处理高频细节。我们引入了一种具有局部支持和更大灵活性的紧凑多项式核，允许更好地控制频率内容并提高几何保真度。为了进一步增强细节，我们结合了随机正则化和拉普拉斯滤波。综合这些方法，能够在保持稳定优化的同时改善高频结构的保留。实验表明，在表面重建和渲染方面均表现出最先进的性能，从多视图数据中获得更准确的几何形状和更清晰的视觉效果。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2605.07256

TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts

TAS-LoRA：基于混合低秩适应专家的变换器架构搜索

Jeon, Jeimin, Lee, Hyunju, Ham, Bumsub

Abstract

Transformer architecture search (TAS) discovers optimal vision transformer (ViT) architectures automatically, reducing human effort to manually design ViTs. However, existing TAS methods suffer from the feature collapse problem, where subnets within a supernet fail to learn subnet-specific features, mainly due to the shared weights in a supernet, limiting the performance of individual subnets. To address this, we propose TAS-LoRA, a novel method that introduces parameter-efficient low-rank adaptation (LoRA) to enable subnet-specific feature learning, while maintaining computational efficiency. TAS-LoRA incorporates a Mixture-of-LoRAExperts (MoLE) strategy, where a lightweight router dynamically assigns LoRA experts based on subnet architectures, and introduces a group-wise router initialization technique to encourage diverse feature learning across experts early in training. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate that TAS-LoRA mitigates feature collapse effectively, improving performance over state-of-the-art TAS methods significantly.

Chinese Translation

变换器架构搜索（TAS）自动发现最佳视觉变换器（ViT）架构，从而减少了人工设计ViT的工作量。然而，现有的TAS方法存在特征崩溃问题，其中超网络中的子网络未能学习特定于子网络的特征，主要是由于超网络中共享权重的限制，导致个别子网络的性能受到限制。为了解决这个问题，我们提出了TAS-LoRA，这是一种新颖的方法，引入了参数高效的低秩适应（LoRA），以实现子网络特定特征的学习，同时保持计算效率。TAS-LoRA结合了混合低秩适应专家（MoLE）策略，其中一个轻量级路由器根据子网络架构动态分配LoRA专家，并引入了一种组内路由器初始化技术，以鼓励在训练早期专家之间的多样化特征学习。在ImageNet和多个迁移学习基准（包括CIFAR-10/100、Flowers、CARS和INAT-19）上的大量实验表明，TAS-LoRA有效缓解了特征崩溃，显著提高了相较于最先进的TAS方法的性能。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2605.07257

Adaptive Subspace Projection for Generative Personalization

生成个性化的自适应子空间投影

Nguyen, Van-Anh, Bui, Anh Tuan, Abraham, Tamas, Kim, Junae, Kaur, Amardeep, Omari, Rollin, Vu, Thuy-Trang, Phung, Dinh

Abstract

Generative personalization often suffers from the semantic collapsing problem (SCP), where a learned personalized concept overpowers the rest of the text prompt, causing the model to ignore important contextual details. To address this, we first analyze the underlying cause, revealing that the semantic drift responsible for SCP is not random but is concentrated within a specific low-dimensional subspace. We also discover that the personalization process perturbs the embedding of the original base concept, making it an unstable reference point. Based on these insights, we introduce Test-time Embedding Adjustment with Adaptive Subspace Projection (AdaptSP), a training-free method that uses the stable, pre-trained embedding as an anchor. AdaptSP isolates the semantic drift and projects it onto the identified subspace, performing a precise adjustment that mitigates SCP while maintaining the subject identity. Our experiments show that this targeted approach significantly improves prompt fidelity and contextual alignment.

Chinese Translation

生成个性化通常面临语义崩溃问题（SCP），即学习到的个性化概念压倒了文本提示的其他部分，导致模型忽略重要的上下文细节。为了解决这一问题，我们首先分析了其潜在原因，揭示出导致SCP的语义漂移并非随机，而是集中在特定的低维子空间内。我们还发现个性化过程扰动了原始基础概念的嵌入，使其成为一个不稳定的参考点。基于这些见解，我们提出了一种无训练的方法——自适应子空间投影的测试时嵌入调整（AdaptSP），该方法利用稳定的预训练嵌入作为锚点。AdaptSP隔离语义漂移并将其投影到识别出的子空间，进行精确调整，从而减轻SCP，同时保持主体身份。我们的实验表明，这种针对性的方法显著提高了提示的保真度和上下文的一致性。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2605.07264

Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning

Sat3R：基于RPC感知的卫星数字表面模型重建

Yang, Qiaoyi, Zhou, Chaoyi, Liu, Xi, Wang, Run, Xu, Minghui, Pesé, Mert D., Luo, Feng, Xu, Yuhao, Cheng, Zhi-Qi, Chen, Qiushi, Qi, Hairong, Huang, Siyu

Abstract

Accurate Digital Surface Model (DSM) reconstruction from satellite imagery is critical for applications such as disaster response, urban planning, and large-scale geographic mapping. Existing approaches face a fundamental trade-off: optimization-based methods achieve strong accuracy but require hours of per-scene computation, while generalizable geometry foundation models offer near-instant inference but fail to generalize to satellite imagery due to the domain gap introduced by the Rational Polynomial Camera (RPC) model and mismatched depth scale distributions. We present Sat3R, a feed-forward framework that bridges this gap via RPC-aware metric depth fine-tuning of Depth Anything V2 using the Scale-Invariant Logarithmic (SiLog) loss. By constructing physically consistent pseudo depth supervision from RPC geometry, Sat3R adapts a monocular depth foundation model to the satellite domain without per-scene optimization. Experiments on the DFC2019 benchmark demonstrate that Sat3R reduces MAE by 38% over zero-shot feed-forward baselines and achieves competitive accuracy against optimization-based methods, while delivering over 300x speedup. Sat3R demonstrates that feed-forward models, when properly adapted to the satellite domain, can match optimization-based accuracy at a fraction of the computational cost, paving the way for practical large-scale satellite DSM reconstruction.

Chinese Translation

从卫星影像中准确重建数字表面模型（DSM）对于灾害响应、城市规划和大规模地理制图等应用至关重要。现有方法面临一个基本的权衡：基于优化的方法虽然能实现较高的准确性，但每个场景的计算时间需要数小时，而可泛化的几何基础模型则提供近乎即时的推断，但由于有理多项式相机（RPC）模型引入的领域差距和深度尺度分布的不匹配，无法很好地泛化到卫星影像。我们提出了Sat3R，一个前馈框架，通过使用尺度不变对数（SiLog）损失对Depth Anything V2进行RPC感知的度量深度微调，弥补了这一差距。通过从RPC几何构建物理一致的伪深度监督，Sat3R能够在不进行每个场景优化的情况下，将单目深度基础模型适应于卫星领域。在DFC2019基准测试中的实验表明，Sat3R在零样本前馈基线的基础上将平均绝对误差（MAE）降低了38%，并在准确性上与基于优化的方法具有竞争力，同时实现了超过300倍的加速。Sat3R展示了当前馈模型经过适当调整以适应卫星领域时，可以以较低的计算成本匹配基于优化的准确性，为实际的大规模卫星DSM重建铺平了道路。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2605.07273

From Clouds to Hallucinations: Atmospheric Retrieval Hijacking in Remote Sensing Vision-Language RAG

从云层到幻觉：遥感视觉-语言 RAG 中的气象检索劫持

Han, Jiaju, Li, Chao, Hu, Chengyin, Zhang, Qike, Sun, Xuemeng, Wang, Xin, Zhang, Fengyu, Chen, Xiang, Wei, Yiwei, Long, Jiahuan, Guo, Jiujiang

Abstract

Multimodal RAG systems increasingly rely on vision-language retrievers to ground visual queries in external textual evidence. Existing adversarial studies on RAG mainly manipulate the retrieval corpus or memory, while attacks on vision-language and remote sensing models typically target end-task predictions. Input-space threats to the evidence retrieval stage of remote sensing multimodal RAG remain underexplored. To address this gap, we introduce CloudWeb, an atmospheric retrieval hijacking attack that modifies only the input image while keeping the retriever, generator, and knowledge base fixed at deployment. CloudWeb overlays parameterized cloud- and haze-like patterns on remote sensing images and optimizes them with a retrieval-oriented objective that pulls adversarial image embeddings toward target atmospheric evidence, suppresses source-scene evidence, enforces rank separation, and regularizes naturalness and coverage. To the best of our knowledge, this is the first study of retrieval-stage atmospheric evidence hijacking in remote sensing multimodal RAG. We evaluate CloudWeb on a seven-dataset remote sensing RAG benchmark with five CLIP-style retrievers, including GeoRSCLIP, RemoteCLIP, OpenAI CLIP, and OpenCLIP, together with downstream vision-language generators. Across retrievers, CloudWeb consistently outperforms clean retrieval, handcrafted atmospheric baselines, random cloud perturbations, and fixed variants in injecting weather-related evidence into top-ranked results. On GeoRSCLIP ViT-B/32, Weather@5 increases from 0.71\% to 43.29\%. Downstream generation further shows measurable weather hallucination and semantic shift, indicating that retrieval-stage hijacking can propagate to the final RAG response. These findings reveal a practical failure mode: natural-looking atmospheric changes can compromise evidence retrieval before generation begins.

Chinese Translation

多模态 RAG 系统越来越依赖视觉-语言检索器将视觉查询与外部文本证据相结合。现有的针对 RAG 的对抗性研究主要操控检索语料库或记忆，而对视觉-语言和遥感模型的攻击通常针对最终任务预测。输入空间对遥感多模态 RAG 的证据检索阶段的威胁仍然未被充分探讨。为了解决这一空白，我们提出了 CloudWeb，一种气象检索劫持攻击，仅修改输入图像，同时在部署时保持检索器、生成器和知识库不变。CloudWeb 在遥感图像上叠加参数化的云层和雾霾样式，并使用检索导向的目标进行优化，使对抗性图像嵌入朝向目标气象证据拉近，抑制源场景证据，强制排名分离，并正则化自然性和覆盖率。据我们所知，这是首次研究遥感多模态 RAG 中检索阶段气象证据劫持的工作。我们在一个包含七个数据集的遥感 RAG 基准上评估 CloudWeb，使用五个 CLIP 风格的检索器，包括 GeoRSCLIP、RemoteCLIP、OpenAI CLIP 和 OpenCLIP，以及下游视觉-语言生成器。在各个检索器中，CloudWeb 在将天气相关证据注入排名前列的结果中始终优于干净检索、手工制作的气象基线、随机云扰动和固定变体。在 GeoRSCLIP ViT-B/32 上，Weather@5 从 0.71\% 增加到 43.29\%。下游生成进一步显示出可测量的天气幻觉和语义转变，表明检索阶段的劫持可以传播到最终的 RAG 响应。这些发现揭示了一种实际的失败模式：自然外观的气象变化可以在生成开始之前破坏证据检索。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2605.07287

SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

SplatWeaver：学习为可泛化的新视图合成分配高斯原语

Wan, Yecong, Li, Fan, Shao, Mingwen, Zuo, Wangmeng

Abstract

Generalizable novel view synthesis aims to render unseen views from uncalibrated input images without requiring per-scene optimization. Recent feed-forward approaches based on 3D Gaussian Splatting have achieved promising efficiency and rendering quality. However, most of them assign a fixed number of Gaussians to each pixel or voxel, ignoring the spatially varying complexity of real-world scenes. Such uniform allocation often wastes Gaussian primitives in smooth regions while providing insufficient capacity for fine structures, complex geometry, and high-frequency details. This motivates us to predict region-dependent primitive cardinalities rather than impose a fixed primitive budget everywhere, enabling a more expressive yet compact 3D scene representation. Therefore, we propose SplatWeaver, a generalizable novel view synthesis framework that is able to dynamically allocate Gaussian primitives over different regions in a feed-forward manner. Specifically, SplatWeaver introduces cardinality Gaussian experts and a pixel-level routing scheme, wherein each expert specializes in producing a specific number of primitives from 0 to M, and the routing scheme coordinates these experts to adaptively determine how many Gaussian primitives should be allocated to each spatial location. Moreover, SplatWeaver incorporates a high-frequency prior with attendant guidance module and routing regularization to stabilize expert selection and promote complexity-aware allocation. By leveraging high-frequency structural cues, the routing process is encouraged to assign more Gaussian primitives to fine structures, complex geometry, and textured regions, while suppressing redundant primitives in smooth areas. Extensive experiments across diverse scenarios show that SplatWeaver consistently outperforms state-of-the-art methods, delivering more faithful novel-view renderings with fewer Gaussian primitives.

Chinese Translation

可泛化的新视图合成旨在从未校准的输入图像中渲染未见视图，而无需对每个场景进行优化。最近基于3D高斯点云的前馈方法在效率和渲染质量上取得了令人鼓舞的成果。然而，它们大多数为每个像素或体素分配固定数量的高斯原语，忽略了真实场景的空间变化复杂性。这种均匀分配往往在平滑区域浪费高斯原语，而对细结构、复杂几何形状和高频细节的容量不足。这促使我们预测区域依赖的原语基数，而不是在每个地方施加固定的原语预算，从而实现更具表现力且紧凑的3D场景表示。因此，我们提出了SplatWeaver，一个可泛化的新视图合成框架，能够以前馈方式动态分配不同区域的高斯原语。具体而言，SplatWeaver引入了基数高斯专家和像素级路由方案，其中每个专家专门负责生成从0到M的特定数量的原语，路由方案协调这些专家以自适应地确定应分配给每个空间位置的高斯原语数量。此外，SplatWeaver结合了高频先验及其指导模块和路由正则化，以稳定专家选择并促进复杂性感知的分配。通过利用高频结构线索，路由过程被鼓励为细结构、复杂几何形状和纹理区域分配更多高斯原语，同时抑制平滑区域的冗余原语。在多种场景下的广泛实验表明，SplatWeaver始终优于最先进的方法，以更少的高斯原语提供更真实的新视图渲染。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2605.07288

Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

Sword：通过动态潜在引导实现风格鲁棒的世界模型作为VLA策略后训练的模拟器

Gao, Jiaxuan, Guo, Yongjian, Guan, Zhong, Huang, Wen, Ma, Wanlun, Xiao, Xi, Xiong, Junwu, Wen, Sheng

Abstract

The integration of Vision-Language-Action (VLA) models with World Models has gained increasing attention. One representative approach treats learned World Models as generative simulators, enabling policy optimization entirely within "imagination." However, when deployed as simulators for specific environments such as the LIBERO benchmark, existing World Models often suffer from poor generalization and long-horizon error accumulation. During closed-loop rollouts, these models are highly sensitive to initial-state perturbations; minor changes in color, illumination, and other visual factors can trigger cascading hallucinations, leading to severe blurriness or overexposure. Moreover, long-horizon error accumulation further degrades the quality and fidelity of predicted future states. These issues limit the reliability of World Models as simulators. To mitigate these problems, we propose Sword, a robust World Model framework. Our method introduces Structure-Guided Style Augmentation to disentangle the visual textures of interactive environments from task-relevant dynamics, thereby improving generalization. We further propose Dynamic Latent Bootstrapping, which maintains consistency between training and inference while keeping memory consumption low. Extensive experiments on the LIBERO benchmark show that our method significantly outperforms the baseline WoVR in terms of generalization, generation quality, robustness, fidelity, and the success rate of reinforcement-learning post-training for VLA models.

Chinese Translation

视觉-语言-动作（VLA）模型与世界模型的结合日益受到关注。一种代表性的方法将学习到的世界模型视为生成模拟器，使得策略优化完全在“想象”中进行。然而，当作为特定环境（如LIBERO基准）的模拟器部署时，现有的世界模型往往面临较差的泛化能力和长时间误差累积的问题。在闭环回放过程中，这些模型对初始状态的扰动高度敏感；颜色、光照及其他视觉因素的微小变化可能引发级联幻觉，导致严重的模糊或过曝。此外，长时间误差的累积进一步降低了预测未来状态的质量和保真度。这些问题限制了世界模型作为模拟器的可靠性。为了解决这些问题，我们提出了Sword，一个鲁棒的世界模型框架。我们的方法引入了结构引导的风格增强，以将交互环境的视觉纹理与任务相关的动态解耦，从而提高泛化能力。我们进一步提出了动态潜在引导，该方法在保持低内存消耗的同时，确保训练和推理之间的一致性。在LIBERO基准上的大量实验表明，我们的方法在泛化能力、生成质量、鲁棒性、保真度以及VLA模型的强化学习后训练成功率等方面显著优于基线WoVR。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2605.07299

EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams

EgoPro-Bench：在自我中心视频流中基准化个性化主动交互

Ran, Dongchuan, Ou, Linyu, Li, Xueheng, Tong, Wenwen, Guo, Chenxu, Guo, Hewei, Wang, Kaibing, Lu, Lewei

Abstract

Existing Multimodal Large Language Models (MLLMs) remain primarily reactive, failing to continuously perceive environments or proactively assist users. While emerging benchmarks address proactivity, they are largely confined to alert scenarios, neglect personalized context, and fail to evaluate the precise timing of human-machine interactions (HMI).In this paper, we introduce EgoPro-Bench, a novel benchmark for training and evaluating proactive interaction capabilities based on streaming egocentric videos; it comprises 2,400 videos in the evaluation set and over 12,000 videos in the training set.Unlike previous works, EgoPro-Bench leverages simulated user profiles to generate diverse user intentions and to construct high-fidelity HMI data across 12 distinct domains.Subsequently, we propose a specialized evaluation protocol and metrics, train proactive interaction models designed for efficient reasoning and low-latency interaction on streaming video data, and conduct comprehensive evaluations.Furthermore, we introduce an interaction principle termed "short thinking, better interaction", which allocates a limited token budget prior to intent recognition, thereby enhancing interaction performance.The experiments demonstrate that EgoPro-Bench substantially enhances the intention understanding capabilities of MLLMs and enables accurate identification of appropriate timings for HMI, thereby laying a solid foundation for next-generation user-centric proactive interactive agents.

Chinese Translation

现有的多模态大型语言模型（MLLMs）主要是反应性的，未能持续感知环境或主动协助用户。尽管新兴基准测试关注主动性，但它们主要局限于警报场景，忽视个性化上下文，并未评估人机交互（HMI）的精确时机。本文介绍了EgoPro-Bench，这是一个基于流式自我中心视频的主动交互能力训练和评估的新基准；评估集包含2400个视频，训练集包含超过12000个视频。与之前的工作不同，EgoPro-Bench利用模拟用户档案生成多样化的用户意图，并在12个不同领域构建高保真度的HMI数据。随后，我们提出了一种专门的评估协议和指标，训练旨在高效推理和低延迟交互的主动交互模型，并进行全面评估。此外，我们引入了一种名为“短思维，更好交互”的交互原则，在意图识别之前分配有限的令牌预算，从而提高交互性能。实验表明，EgoPro-Bench显著增强了MLLMs的意图理解能力，并能够准确识别HMI的适当时机，为下一代以用户为中心的主动交互代理奠定了坚实基础。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2605.07317

Amortized-Precision Quantization for Early-Exit Vision Transformers

用于早期退出视觉变换器的 amortized-precision 量化

Fang, Rui, Chen, Hsi-Wen, Chen, Ming-Syan

Abstract

Vision Transformers (ViTs) achieve strong performance across vision tasks, yet their deployment with low-precision early exiting remains fragile. Existing quantization methods assume static full-depth execution, making them unstable when exit decisions are perturbed by quantization noise, which can amplify errors along dynamic inference paths. In this paper, we introduce Amortized-Precision Quantization (APQ), a utilization-aware formulation that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs. Building on APQ, we propose Mutual Adaptive Quantization with Early Exiting (MAQEE), a bi-level framework that jointly optimizes exit thresholds and bit-widths under explicit risk control to improve inference stability. MAQEE establishes a superior Pareto frontier in the accuracy-efficiency trade-off, reducing BOPs by up to 95% while maintaining accuracy and outperforming strong baselines by up to 20\% across classification, detection, and segmentation tasks.

Chinese Translation

视觉变换器（ViTs）在视觉任务中表现出色，但其在低精度早期退出的部署仍然脆弱。现有的量化方法假设静态全深度执行，因此在退出决策受到量化噪声干扰时变得不稳定，这可能在动态推理路径上放大错误。本文提出了 amortized-precision 量化（APQ），这是一种考虑层级随机暴露于量化噪声的利用率感知公式，并揭示了深度与精度之间的权衡。在APQ的基础上，我们提出了具有早期退出的互适应量化（MAQEE），这是一个双层框架，旨在在明确的风险控制下共同优化退出阈值和比特宽度，以提高推理稳定性。MAQEE在准确性与效率的权衡中建立了优越的帕累托前沿，将BOPs减少了多达95%，同时保持准确性，并在分类、检测和分割任务中超越强基线多达20%。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2605.07326

GEM: Generating LiDAR World Model via Deformable Mamba

GEM：通过可变形的曼巴生成激光雷达世界模型

Wu, Yang, Liu, Zhaojiang, Meng, Qiang, Liu, Youquan, Weng, Renliang, Qian, Jianjun, Yang, Jian, Xie, Jin

Abstract

World models, which simulate environmental dynamics and generate sensor observations, are gaining increasing attention in autonomous driving. However, progress in LiDAR-based world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of LiDAR point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we propose GEM: a Generative LiDAR world model that leverages deformable mamba architecture, significantly improving fidelity and imaginative capability. Specifically, leveraging the structural similarity between sequential laser scanning and Mamba's processing mechanism, we first tokenize LiDAR sweeps into compact representations via a custom LiDAR scene tokenizer. After unsupervised disentanglement of tokenized features via a dynamic-static separator, a tri-path deformable Mamba is introduced to perform selective scanning and adaptive gating fusion over the disentangled features, leading to enhanced spatial-temporal understanding of the world evolution. Optionally, a planner and a BEV layout controller can be integrated to explore the model's capability for autonomous rollout and its potential to generate ``what-if" scenarios. Extensive experiments show that GEM achieves state-of-the-art performances across diverse benchmarks and evaluation settings, demonstrating its superiority and effectiveness. Project page: https://github.com/wuyang98/GEM.

Chinese Translation

世界模型模拟环境动态并生成传感器观测，近年来在自动驾驶领域受到越来越多的关注。然而，基于激光雷达的世界模型的发展落后于基于摄像头视频或占用数据的模型，这主要是由于两个核心挑战：激光雷达点云的固有无序性以及动态物体与静态结构之间的区分困难。为了解决这些问题，我们提出了GEM：一种利用可变形曼巴架构的生成激光雷达世界模型，显著提高了保真度和想象能力。具体而言，利用顺序激光扫描与曼巴处理机制之间的结构相似性，我们首先通过自定义的激光雷达场景标记器将激光雷达扫描转换为紧凑的表示。在通过动态-静态分离器对标记特征进行无监督解缠后，引入了三路径可变形曼巴，以对解缠特征进行选择性扫描和自适应门控融合，从而增强对世界演变的时空理解。可选地，可以集成规划器和鸟瞰视图（BEV）布局控制器，以探索模型在自主推出和生成“如果”场景方面的能力。大量实验表明，GEM在各种基准和评估设置中实现了最先进的性能，展示了其优越性和有效性。项目页面：https://github.com/wuyang98/GEM。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2605.07327

Teacher-Feature Drifting: One-Step Diffusion Distillation with Pretrained Diffusion Representations

教师特征漂移：基于预训练扩散表示的一步扩散蒸馏

Zhang, Yuan, Li, Chenyi, Ma, Guoqing, Zha, Jiajun, Yang, Yuanming, Wang, Bo, Tang, Wei, Li, Wenbo, Huang, Haoyang, Duan, Nan

Abstract

Sampling from pretrained diffusion and flow-matching models typically requires many forward passes to generate diverse and high-fidelity images. Existing distillation methods often rely on multiple auxiliary networks, carefully designed training stages, or complex optimization pipelines. In this work, we revisit the recently proposed Drifting Model objective and show that a single drifting loss can be directly used to simplify one step distillation. A key observation is that the pretrained diffusion teacher itself already provides a strong representation space. Unlike the original Drifting Model, which relies on an additional pretrained feature extractor, we use intermediate hidden states of the pretrained teacher model as the feature representation. This removes the need for training or introducing an extra representation network while preserving a semantically meaningful feature geometry for drifting. Furthermore, we introduce a lightweight mode coverage loss to mitigate mode collapse during distillation and encourage the student generator to cover diverse teacher-supported regions. Extensive experiments on ImageNet and SDXL demonstrate that our method achieves efficient one step generation with competitive image quality and diversity, achieving FID scores of 1.58 on ImageNet-64$\times$64 and 18.4 on SDXL, while substantially simplifying the overall distillation framework.

Chinese Translation

从预训练的扩散和流匹配模型中采样通常需要多次前向传播以生成多样且高保真的图像。现有的蒸馏方法通常依赖于多个辅助网络、精心设计的训练阶段或复杂的优化流程。在本研究中，我们重新审视了最近提出的漂移模型目标，并展示了单一的漂移损失可以直接用于简化一步蒸馏。一个关键观察是，预训练的扩散教师本身已经提供了一个强大的表示空间。与依赖额外预训练特征提取器的原始漂移模型不同，我们使用预训练教师模型的中间隐藏状态作为特征表示。这消除了训练或引入额外表示网络的需要，同时保留了漂移的语义上有意义的特征几何。此外，我们引入了一种轻量级模式覆盖损失，以减轻蒸馏过程中的模式崩溃，并鼓励学生生成器覆盖多样的教师支持区域。在ImageNet和SDXL上的大量实验表明，我们的方法实现了高效的一步生成，具有竞争力的图像质量和多样性，在ImageNet-64×64上获得了1.58的FID分数，在SDXL上获得了18.4，同时大大简化了整体蒸馏框架。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2605.07329

GC-ART: Global Learnable Second-Order Rational Tone Curves for Illumination Robustness

GC-ART：用于照明鲁棒性的全局可学习二阶有理色调曲线

Huang, Wei, Huang, Joyce

Abstract

We introduce GC-ART (Global Curve Adaptive Rational Tone-mapping), a lightweight differentiable pre-processing module for robust image classification. GC-ART predicts an endpoint-pinned rational tone curve from per-channel soft histograms using a 643-parameter MLP, then applies the curve pointwise before the classifier. The module is trained end-to-end with cross-entropy and a soft monotonicity penalty. On CIFAR-10 with a CIFAR-style ResNet-18, GC-ART matches clean accuracy with the unenhanced baseline and other learned enhancers, improves over the baseline on multiplicative darkening, and achieves the best learned-method result on contrast corruption (48.45% vs. 46.27% for the baseline and 47.13% for Zero-DCE++). These results suggest that histogram-conditioned rational curves can learn useful global tone corrections, including contrast-expanding behavior, while preserving edge locations by construction through pointwise mapping. GC-ART also uses substantially fewer FLOPs than convolutional learned enhancers at 32 x 32. The current hyperparameters are untuned, leaving room for systematic improvement.

Chinese Translation

我们介绍了GC-ART（全局曲线自适应有理色调映射），这是一个轻量级的可微分预处理模块，用于鲁棒的图像分类。GC-ART通过每通道的软直方图使用一个643参数的多层感知器（MLP）预测一个端点固定的有理色调曲线，然后在分类器之前逐点应用该曲线。该模块通过交叉熵和软单调性惩罚进行端到端训练。在使用CIFAR风格的ResNet-18进行CIFAR-10测试时，GC-ART在干净准确度上与未增强的基线和其他学习增强器相匹配，在乘法暗化上超越基线，并在对比度损坏上实现最佳学习方法结果（48.45%对比基线的46.27%和Zero-DCE++的47.13%）。这些结果表明，基于直方图的有理曲线能够学习有用的全局色调校正，包括扩展对比度的行为，同时通过逐点映射在构造上保留边缘位置。GC-ART在32 x 32的情况下使用的FLOPs显著少于卷积学习增强器。目前的超参数尚未调优，留有系统改进的空间。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2605.07334

RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

RCoT-Seg：用于视频推理和分割的强化思维链

Wen, Junwei, Miao, Deshui, Lu, Guangming, Li, Xin, Pei, Wenjie

Abstract

Video Reasoning Segmentation (VRS) aims to segment target objects in videos based on implicit instructions that convey human intent and temporal logic. Existing MLLM-based methods predict masks with a [SEG] token after selecting frames via simple sampling or an auxiliary MLLM, where limited supervision and frame-language similarity rules often yield narrow-scope keyframe choices that weaken holistic temporal understanding and lead to brittle localization in complex multi-object scenes. To address these issues, we introduce RCoT-Seg, a video-of-thought framework that factorizes VRS into temporal video reasoning (TVR) and keyframe target perception (KTP), explicitly separating temporal reasoning from spatial perception. Specifically, in the TVR stage, an agentic keyframe selection module, initialized with a curated CoT-start corpus and refined by GRPO under task-aligned rewards, is proposed to generate and reselect the keyframe through self-evaluation, strengthening moment localization and temporal reasoning. In the KTP stage, RCoT-Seg performs high-resolution segmentation on the selected frame and propagates masks with SAM2-based methods across the sequence, replacing heuristic sampling and external selectors while improving spatial precision and inter-frame consistency. Extensive experimental results demonstrate that the proposed RCoT-Seg achieves favorable performance against the state-of-the-art methods. The code and models will be publicly released at https://github.com/Victor-wjw/RCoT-Seg.

Chinese Translation

视频推理分割（VRS）旨在基于隐含指令对视频中的目标物体进行分割，这些指令传达了人类意图和时间逻辑。现有的基于多模态大语言模型（MLLM）的方法通过简单采样或辅助MLLM选择帧后，使用[SEG]标记预测掩膜，但有限的监督和帧-语言相似性规则往往导致狭窄范围的关键帧选择，从而削弱整体的时间理解，并在复杂的多物体场景中导致脆弱的定位。为了解决这些问题，我们提出了RCoT-Seg，一种将VRS分解为时间视频推理（TVR）和关键帧目标感知（KTP）的视频思维框架，明确将时间推理与空间感知分开。具体而言，在TVR阶段，提出了一种自主关键帧选择模块，该模块以精心策划的思维链（CoT）起始语料库初始化，并在任务对齐奖励下通过GRPO进行优化，以生成和重新选择关键帧，通过自我评估增强时刻定位和时间推理。在KTP阶段，RCoT-Seg对选定帧进行高分辨率分割，并通过基于SAM2的方法在序列中传播掩膜，取代启发式采样和外部选择器，同时提高空间精度和帧间一致性。大量实验结果表明，所提出的RCoT-Seg在性能上优于现有的最先进方法。代码和模型将公开发布在 https://github.com/Victor-wjw/RCoT-Seg。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2605.07338

ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs

ShellfishNet：一种针对海洋软体动物视觉识别的领域特定基准

Zhou, Ziheng, Wang, Yang, Wang, Nan, Wu, Chengliang, Yan, Jun

Abstract

The decline of global shellfish biodiversity poses a severe threat to coastal ecosystems. Although artificial intelligence (AI) technologies show potential for automated ecological monitoring, existing marine benthic datasets often lack adaptation to the complexities of real underwater environments (e.g., variable lighting conditions and diverse species postures), posing challenges for the robust generalization of vision models in practical ecological monitoring. To address this problem, we construct ShellfishNet, a comprehensive image benchmark dataset designed specifically for real-world ecological monitoring constraints. Comprising 8,691 images across 32 taxa, this dataset includes a curated subset annotated with descriptive captions. It is constructed through field photography and web scraping, encompassing samples from complex real-world environments. Based on this benchmark, we systematically evaluate 80 representative neural network models, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), State Space Models (SSMs), and Self-Supervised Learning (SSL) methods. Furthermore, we evaluate the performance of fine-grained visual categorization (FGVC) models and investigate the image captioning capabilities of several mainstream multimodal large language models (MLLMs). Meanwhile, we introduce image corruption benchmark tests to simulate common underwater degradation scenarios (turbidity, severe weather) and assess the robustness of vision models, enabling trustworthy decisions on ecological protection in the wild. ShellfishNet is dedicated to providing a data foundation and a model-evaluation benchmark for the intelligent monitoring of benthic organisms.

Chinese Translation

全球贝类生物多样性的下降对沿海生态系统构成了严重威胁。尽管人工智能（AI）技术在自动化生态监测方面展现出潜力，但现有的海洋底栖数据集往往未能适应真实水下环境的复杂性（例如，变化的光照条件和多样的物种姿态），这给视觉模型在实际生态监测中的稳健泛化带来了挑战。为了解决这一问题，我们构建了ShellfishNet，一个专门为现实生态监测约束设计的综合图像基准数据集。该数据集包含8,691幅图像，涵盖32个分类，其中包括一个经过精心挑选的带有描述性标题的子集。数据集通过实地摄影和网络抓取构建，涵盖了来自复杂真实环境的样本。基于这一基准，我们系统地评估了80个代表性的神经网络模型，包括卷积神经网络（CNNs）、视觉变换器（ViTs）、状态空间模型（SSMs）和自监督学习（SSL）方法。此外，我们评估了细粒度视觉分类（FGVC）模型的性能，并调查了几种主流多模态大型语言模型（MLLMs）的图像标题生成能力。同时，我们引入了图像损坏基准测试，以模拟常见的水下退化场景（浑浊、恶劣天气），并评估视觉模型的鲁棒性，从而为野外生态保护提供可靠的决策依据。ShellfishNet致力于为底栖生物的智能监测提供数据基础和模型评估基准。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2605.07346

SoLAR: Error-Resilient Streamable Long-Horizon Free-Viewpoint Video Reconstruction with Anchor Activation and Latent Recalibration

SoLAR：具有锚点激活和潜在重校准的错误恢复流式长视距自由视角视频重建

Zhang, Haotian, Mo, Xu, Yu, Yixin, Zhu, Guanhua, Xue, Jian, Xu, Tongda, Wang, Yan, Zhang, Jiaqi, Ma, Siwei, Gao, Wen

Abstract

Free-Viewpoint Video (FVV) has emerged as a cornerstone of next-generation immersive media systems and attracted widespread attention. Previous methods primarily focus on short video sequences and suffer from significant performance degradation when processing long-horizon free-viewpoint video (LFVV). Motivated by bit allocation theory, we analyze dynamic-anchor-based volumetric video representation within a rate-distortion optimization framework and propose \textbf{SoLAR}, which is the first error-resilient streamable FVV framework that maintains stable reconstruction quality on long sequences without requiring group-of-pictures partitioning. We propose the Anchor Activation Dynamics (AAD), which enables dynamic anchors to model non-rigid transformations by dynamically activating informative anchors and suppressing redundant ones. Furthermore, we introduce Latent Discrepancy Aware Recalibration (LaDAR), which is a mechanism to identify discrepancies between latent representations and recalibrate the correspondences encoded in the network, effectively mitigating error propagation in LFVV without compromising real-time performance or storage compactness. Extensive experiments demonstrate that \textbf{SoLAR} achieves state-of-the-art reconstruction performance while maintaining minimum storage overhead, which provides a new direction for LFVV reconstruction and advances the practical deployment of immersive systems. Demo free-viewpoint videos are provided in the supplementary material.

Chinese Translation

自由视角视频（FVV）已成为下一代沉浸式媒体系统的基石，并引起了广泛关注。以往的方法主要集中在短视频序列上，在处理长视距自由视角视频（LFVV）时表现出显著的性能下降。受到比特分配理论的启发，我们在率失真优化框架内分析了基于动态锚点的体积视频表示，并提出了 extbf{SoLAR}，这是第一个错误恢复流式FVV框架，能够在长序列上保持稳定的重建质量，而无需进行图像组分割。我们提出了锚点激活动态（AAD），使动态锚点能够通过动态激活信息锚点并抑制冗余锚点来建模非刚性变换。此外，我们引入了潜在差异感知重校准（LaDAR），这是一种识别潜在表示之间差异并重校准网络中编码对应关系的机制，有效减轻LFVV中的错误传播，而不影响实时性能或存储紧凑性。大量实验表明， extbf{SoLAR}在保持最低存储开销的同时，实现了最先进的重建性能，为LFVV重建提供了新的方向，并推动了沉浸式系统的实际部署。演示自由视角视频已在补充材料中提供。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2605.07351

Disambiguating 2D-3D Correspondences in Gaussian Splatting-based Feature Fields for Visual Localization

基于高斯喷溅特征场的视觉定位中的二维-三维对应关系消歧

Lee, Miso, Hyun, Sangeek, Jeon, Yerim, Heo, Jae-Pil

Abstract

While Gaussian Splatting-based Feature Fields (GSFFs) have shown promise for visual localization, this paper highlights that photometrically optimized GSFFs are inherently ill-suited for 2D-3D matching. The volumetric extent of each Gaussian induces many-to-one pixel-to-point mappings that destabilize PnP-based pose estimation, while photometric optimization gives rise to superfluous Gaussians devoid of multi-view consistency. To address these issues, we propose SplitGS-Loc, a localization-specialized GSFFs construction framework that disambiguates 2D-3D correspondences by exploiting Gaussian attributes. Our key design, Mixture-of-Gaussians-based splitting, decomposes each Gaussian into smaller Gaussians, replacing ambiguous many-to-one with precise one-to-one correspondences. In parallel, we exploit composition weights from GS rasterization to select Gaussians that significantly and consistently contribute across multiple views and aggregate discriminative features through strong pixel-Gaussian associations, enforcing multi-view consistency. The resulting compact yet discriminative feature fields enable stable PnP convergence, achieving state-of-the-art performance on localization benchmarks. Extensive experiments validate that SplitGS-Loc extends the utility of photometric GSFFs to accurate and efficient localization by exploiting Gaussian attributes, without per-scene training or iterative pose refinement.

Chinese Translation

尽管基于高斯喷溅特征场（GSFFs）在视觉定位中展现出良好的前景，但本文强调光度优化的GSFFs本质上不适合二维-三维匹配。每个高斯的体积范围导致了许多对一的像素到点的映射，这使得基于PnP的姿态估计不稳定，而光度优化则产生了缺乏多视图一致性的多余高斯。为了解决这些问题，我们提出了SplitGS-Loc，一个专门用于定位的GSFFs构建框架，通过利用高斯属性来消歧二维-三维对应关系。我们的关键设计，基于高斯混合的分割，将每个高斯分解为更小的高斯，从而将模糊的多对一关系替换为精确的一对一对应。同时，我们利用GS光栅化中的组合权重来选择在多个视图中显著且一致贡献的高斯，并通过强像素-高斯关联聚合判别特征，强制执行多视图一致性。最终生成的紧凑而具有判别性的特征场使得PnP收敛稳定，在定位基准测试中实现了最先进的性能。大量实验验证了SplitGS-Loc通过利用高斯属性扩展了光度GSFFs在准确和高效定位中的应用，无需逐场景训练或迭代姿态优化。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2605.07355

TTF: Temporal Token Fusion for Efficient Video-Language Model

TTF：用于高效视频语言模型的时间令牌融合

Huo, Simin, LI, Ning

Abstract

Video-language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at $448{\times}448$ resolution already yield >8,000 visual tokens in Qwen3-VL, making LLM prefill the dominant throughput bottleneck. Existing methods often rely on global similarity or attention-guided compression, incurring offsets to their gains. We propose \textbf{Temporal Token Fusion (TTF)}, a training-free, plug-and-play pre-LLM token compression framework that exploits structured temporal redundancy in video. TTF automatically selects an anchor frame, then for each subsequent frame, performs a local window similarity search (e.g.,$3\times 3$), fusing tokens that exceed a threshold. The compressed sequence maintains positional consistency across both prefill and decoding through coordinate realignment, enabling seamless integration with existing VLM pipelines. On Qwen3-VL-8B with threshold t=0.70, TTF removes about 67\% of visual tokens while retaining 99.5\% of the baseline accuracy and introducing only ${\approx}0.16$\,GFLOPs of matching overhead. Overall, TTF offers a practical, efficient solution for video understanding. The code is available at \href{https://github.com/Cominder/ttf}{https://github.com/Cominder/ttf}

Chinese Translation

视频语言模型（VLMs）在视频长度增加时面临快速推理成本，因为视觉令牌的数量随之增加。例如，在 $448{ imes}448$ 分辨率下，32帧已经产生超过8000个视觉令牌，这使得大语言模型（LLM）的预填充成为主要的吞吐瓶颈。现有方法通常依赖于全局相似性或注意力引导的压缩，导致其收益受到影响。我们提出了 extbf{时间令牌融合（Temporal Token Fusion, TTF）}，这是一种无训练、即插即用的预LLM令牌压缩框架，利用视频中的结构化时间冗余。TTF自动选择一个锚帧，然后对每个后续帧执行局部窗口相似性搜索（例如，$3 imes 3$），融合超过阈值的令牌。压缩后的序列通过坐标重新对齐在预填充和解码过程中保持位置一致性，从而实现与现有VLM管道的无缝集成。在Qwen3-VL-8B上，设定阈值t=0.70，TTF去除了约67 ext{%}的视觉令牌，同时保留了99.5 ext{%}的基线准确性，并仅引入了约0.16 GFLOPs的匹配开销。总体而言，TTF为视频理解提供了一种实用、高效的解决方案。代码可在 exttt{https://github.com/Cominder/ttf}获取。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2605.07356

UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition

UniD-Shift：通过可解释的共享-私有多模态分解实现统一语义分割

Zhang, Shuai, Shi, Zhecheng, Li, Zhuxiao, Ou, Jing, Wang, Tengxi, Liu, Yuan, Zhao, Wufan

Abstract

Semantic segmentation of large-scale 3D point clouds is crucial for applications such as autonomous driving and urban digital twins. However, the sparse sampling pattern of LiDAR and the view-dependent geometric distortion in image observations complicate cross-modal alignment and hinder stable fusion. Inspired by the fact that 2D images captured by cameras are representations of the 3D world, we recognize that the features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific. This insight motivates a unified multimodal framework for joint 2D-3D semantic segmentation. We combine a SAM-based vision encoder with a SPTNet-based geometric encoder to extract complementary semantic and geometric representations. The resulting features from both modalities are explicitly decomposed into shared and private subspaces, where the shared components summarize semantic factors common to both domains, and the private components preserve properties that are unique to each modality. A lightweight attention-based fusion module aggregates the shared features into a consistent cross-modal representation, and a regularized training objective ensures both semantic alignment and subspace independence. Experiments on the SemanticKITTI and nuScenes benchmarks demonstrate consistent improvements in segmentation accuracy over representative multimodal baselines, accompanied by competitive computational efficiency. Cross-domain evaluation on nuScenes USA-Singapore shows stable performance under distribution shifts, demonstrating strong generalization. The implementation code is publicly available at: https://github.com/shuaizhang69/UniD-Shift.

Chinese Translation

大规模3D点云的语义分割对于自动驾驶和城市数字双胞胎等应用至关重要。然而，LiDAR的稀疏采样模式和图像观测中的视角依赖几何失真使得跨模态对齐变得复杂，并阻碍了稳定的融合。受到相机捕获的2D图像是3D世界的表现这一事实的启发，我们认识到从2D和3D分割中学习到的特征共享一些共同的语义，而其他方面则保持模态特异性。这一见解促使我们提出一个统一的多模态框架，用于联合2D-3D语义分割。我们结合基于SAM的视觉编码器和基于SPTNet的几何编码器，以提取互补的语义和几何表示。来自两个模态的特征被明确地分解为共享和私有子空间，其中共享成分总结了两个领域共同的语义因素，而私有成分则保留了每个模态特有的属性。一个轻量级的基于注意力的融合模块将共享特征聚合成一致的跨模态表示，而一个正则化的训练目标确保了语义对齐和子空间独立性。在SemanticKITTI和nuScenes基准上的实验表明，相较于代表性的多模态基线，我们的模型在分割精度上实现了一致的提升，并且在计算效率上也具有竞争力。在nuScenes USA-Singapore的跨域评估中，模型在分布变化下表现稳定，展现出强大的泛化能力。实现代码公开可用，地址为：https://github.com/shuaizhang69/UniD-Shift。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2605.07359

UniISP: A Unified ISP Framework for Both Human and Machine Vision

UniISP：一个统一的人机视觉图像信号处理框架

Li, Hanxi, Cheng, Yao, Zhang, Bo, Zeng, Li

Abstract

Compared to RGB images, raw sensor data provides a richer representation of information, which is crucial for accurate recognition, particularly under challenging conditions such as low-light environments. The traditional Image Signal Processing (ISP) pipeline generates visually pleasing RGB images for human perception through a series of steps, but some of these operations may adversely impact the information integrity by introducing compression and loss. Furthermore, in computer vision tasks that directly utilize raw camera data, most existing methods integrate minimal ISP processing with downstream networks, yet the resulting images are often difficult to visualize or do not align with human aesthetic preferences. This paper proposes UniISP, a novel ISP framework designed to simultaneously meet the requirements of both human visual perception and computer vision applications. By incorporating a carefully designed Hybrid Attention Module (HAM) and employing supervised learning, the proposed method ensures that the generated images are visually appealing. Additionally, a Feature Adapter module is introduced to effectively propagate informative features from the ISP stage to subsequent downstream networks. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across various scenarios and multiple datasets, proving its generalizability and effectiveness.

Chinese Translation

与RGB图像相比，原始传感器数据提供了更丰富的信息表示，这对于在低光环境等具有挑战性的条件下进行准确识别至关重要。传统的图像信号处理（ISP）管道通过一系列步骤生成视觉上令人愉悦的RGB图像，以供人类感知，但其中一些操作可能会通过引入压缩和损失而对信息完整性产生不利影响。此外，在直接利用原始相机数据的计算机视觉任务中，大多数现有方法将最小的ISP处理与下游网络集成，但所生成的图像往往难以可视化或与人类的审美偏好不一致。本文提出了UniISP，一个新颖的ISP框架，旨在同时满足人类视觉感知和计算机视觉应用的需求。通过引入精心设计的混合注意力模块（Hybrid Attention Module, HAM）并采用监督学习，所提出的方法确保生成的图像在视觉上具有吸引力。此外，引入了特征适配器模块，以有效地将ISP阶段的信息特征传播到后续的下游网络。大量实验表明，我们的方法在各种场景和多个数据集上实现了最先进的性能，证明了其通用性和有效性。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2605.07379

RELO: Reinforcement Learning to Localize for Visual Object Tracking

RELO：用于视觉目标跟踪的强化学习定位方法

Chen, Xin, Sun, Chuanyu, Xu, Jiao, Peng, Houwen, Wang, Dong, Lu, Huchuan, Ma, Kede

Abstract

Conventional visual object trackers localize targets using handcrafted spatial priors, often in the form of heatmaps. Such priors provide only surrogate supervision and are poorly aligned with tracking optimization and evaluation metrics, such as intersection over union (IoU) and area under the success curve (AUC). Here, we introduce RELO, a REinforcement-learning-to-LOcalize method for visual object tracking that formulates target localization as a Markov decision process. Specifically, RELO replaces handcrafted spatial priors with a localization policy learned over spatial positions via reinforcement learning, with rewards combining frame-level IoU and sequence-level AUC. We additionally introduce layer-aligned temporal token propagation to improve semantic consistency across frames, with negligible computational overhead. Across multiple benchmarks, RELO achieves superior results, attaining 57.5% AUC on LaSOText without template updates. This confirms that reward-driven localization provides an effective alternative to prior-driven localization for visual object tracking.

Chinese Translation

传统的视觉目标跟踪器使用手工设计的空间先验来定位目标，通常以热图的形式呈现。这些先验仅提供替代监督，并且与跟踪优化和评估指标（如交并比（IoU）和成功曲线下面积（AUC））的对齐效果较差。在此，我们引入了RELO，一种用于视觉目标跟踪的强化学习定位方法，将目标定位公式化为马尔可夫决策过程。具体而言，RELO用通过强化学习在空间位置上学习的定位策略替代手工设计的空间先验，奖励结合了帧级IoU和序列级AUC。此外，我们引入了层对齐的时间令牌传播，以提高帧间的语义一致性，且计算开销微乎其微。在多个基准测试中，RELO取得了优异的结果，在LaSOText上达到了57.5%的AUC，且无需模板更新。这证实了基于奖励的定位为视觉目标跟踪提供了一种有效的替代方案。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2605.07388

A Marine Debris Detection Framework for Ocean Robots via Self-Attention Enhancement and Feature Interaction Optimization

基于自注意力增强和特征交互优化的海洋机器人海洋垃圾检测框架

Li, Yuyang, Han, Jiashu, Lai, Yinyi, Kang, Wenbin, Liu, Zenghui

Abstract

Marine debris detection for ocean robot is crucial for ecological protection, yet performance is often degraded by low-quality images with blur, complex backgrounds, and small targets. To address these challenges, we propose YOLO-MD, an enhanced YOLO-based detection framework. A Dual-Branch Convolutional Enhanced Self-Attention (DB-CASA) module is designed to strengthen spatial-channel interactions, improving feature representation in degraded images. Additionally, a lightweight shift-based operation is introduced to enhance fine-grained feature extraction for objects of varying scales while maintaining parameter efficiency. We further propose SFG-Loss to mitigate class imbalance and optimization instability via dynamic sample reweighting. Experiments on the UODM dataset demonstrate that YOLO-MD achieves 0.875 precision, 0.822 F1-score, and 0.849 mAP50, outperforming the latest state-of-the-art methods. The effectiveness of this method has also been verified through real-world robotic edge deployment experiments.

Chinese Translation

海洋机器人进行海洋垃圾检测对于生态保护至关重要，但性能常常受到模糊、复杂背景和小目标等低质量图像的影响。为了解决这些挑战，我们提出了YOLO-MD，一个增强的基于YOLO的检测框架。设计了一个双分支卷积增强自注意力（DB-CASA）模块，以增强空间-通道交互，从而改善退化图像中的特征表示。此外，引入了一种轻量级的基于位移的操作，以增强不同尺度物体的细粒度特征提取，同时保持参数效率。我们进一步提出了SFG-Loss，通过动态样本重加权来减轻类别不平衡和优化不稳定性。在UODM数据集上的实验表明，YOLO-MD达到了0.875的精度、0.822的F1-score和0.849的mAP50，超过了最新的最先进方法。该方法的有效性也通过实际的机器人边缘部署实验得到了验证。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2605.07390

ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation

ST-Gen4D：将4D时空认知嵌入世界模型以实现4D生成

Wang, Haonan, Zhou, Hanyu, Gu, Tao, Yan, Luxin

Abstract

Generative models have achieved success in producing apparently coherent 2D videos, but remain challenging in the physical world due to lack of 4D spatiotemporal scale. Typically, existing 4D generative models directly embed macro scale constraints to enhance overall spatiotemporal consistency. However, these methods only ensure global appearance coherence and fail to reveal the local dynamics of the physical world. Our insight is that global appearance structure and local dynamic topology empower 4D spatiotemporal cognition, thereby enabling 4D generation with spatiotemporal regularities. In this work, we propose ST-Gen4D, a 4D generation framework with 4D spatiotemporal cognition-based world model. Our model is guided by four key designs: 1) Spatiotemporal representation. We encode various modalities into multiple representations as a feature basis. 2) Spatiotemporal cognition. We sculpture these representations into global appearance graph and local dynamic graph, and fuse them via semantic-bridged spatiotemporal fusion to obtain a 4D cognition graph. 3) Spatiotemporal reasoning. We utilize a world model to derive future state based on the 4D cognition. 4) Spatiotemporal generation. We leverage the derived cognition as condition to guide latent diffusion for 4D Gaussian generation. By deeply integrating 4D intrinsic cognition with generative priors, our model guarantees the structural rationality and topological consistency of 4D generation. Moreover, we propose ST-4D datasets by aggregating public 4D datasets and self-built subset. Extensive experiments demonstrate the superiority of our ST-Gen4D across 3D and 4D generation tasks.

Chinese Translation

生成模型在生成表面上连贯的2D视频方面取得了成功，但由于缺乏4D时空尺度，在物理世界中仍然面临挑战。通常，现有的4D生成模型直接嵌入宏观尺度约束，以增强整体时空一致性。然而，这些方法仅确保全局外观的一致性，未能揭示物理世界的局部动态。我们的见解是，全局外观结构和局部动态拓扑赋予了4D时空认知，从而使得具有时空规律的4D生成成为可能。在本研究中，我们提出了ST-Gen4D，一个基于4D时空认知的世界模型的4D生成框架。我们的模型由四个关键设计指导：1）时空表示。我们将各种模态编码为多个表示作为特征基础。2）时空认知。我们将这些表示雕刻成全局外观图和局部动态图，并通过语义桥接的时空融合将它们融合，以获得4D认知图。3）时空推理。我们利用世界模型基于4D认知推导未来状态。4）时空生成。我们利用推导出的认知作为条件，引导潜在扩散以实现4D高斯生成。通过将4D内在认知与生成先验深度集成，我们的模型保证了4D生成的结构合理性和拓扑一致性。此外，我们通过聚合公共4D数据集和自建子集提出了ST-4D数据集。大量实验表明，我们的ST-Gen4D在3D和4D生成任务中表现优越。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2605.07394

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

BalCapRL：一种基于强化学习的多模态大语言模型图像描述的平衡框架

Ye, Shaokai, Saveris, Vasileios, Qian, Yihao, Hu, Jiaming, Amirloo, Elmira, Grasch, Peter

Abstract

Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.

Chinese Translation

图像描述是计算机视觉中最基本的任务之一。由于其开放性特征，在多模态大语言模型（MLLMs）时代，它受到了显著关注。为了追求更加详细和准确的描述，近期的研究越来越多地转向强化学习（RL）。然而，现有的描述-强化学习方法和评估指标往往强调狭义的描述质量，导致在描述的核心维度之间产生权衡。例如，面向效用的目标可能会鼓励产生嘈杂、虚构或过长的描述，这虽然能改善下游问答的表现，但却损害了流畅性；而竞技场风格的目标则可能偏向流畅但通用的描述，实用性有限。为了解决这一问题，我们提出了一种更为平衡的强化学习框架，联合优化效用感知的正确性、参考覆盖率和语言质量。为了有效优化所得到的连续多目标奖励公式，我们对连续值的描述奖励应用了GDPO风格的奖励解耦归一化，并显示出其在性能上优于普通的GRPO。此外，我们引入了基于长度的奖励掩蔽，提供了更适合描述的长度惩罚。在LLaVA-1.5-7B和Qwen2.5-VL 3B及7B基础模型上，我们的方法持续提高了描述质量，峰值增益分别达到了+13.6 DCScore、+9.0 CaptionQA和+29.0 CapArena。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2605.07398

Exposing and Mitigating Temporal Attack in Deepfake Video Detection

揭示和缓解深度伪造视频检测中的时间攻击

Gu, Zheyuan, Shao, Minghao, Wang, Zhen, Wang, Yusong, Xu, Mingkun, Zhang, Shijie, Jiang, Hao

Abstract

While spatiotemporal deepfake detectors achieve high AUC, our experiments reveal their susceptibility to evasion attacks. These models tend to overfit on fragile temporal spectrum cues, rather than learning robust semantic causality. To mitigate this vulnerability, we propose SpInShield, a temporal spectral-invariant defense framework explicitly designed to decouple semantic motion from manipulatable spectral artifacts. We propose a learnable spectral adversary that dynamically synthesizes severe spectral deformations, simulating extreme attack scenarios. By employing a shortcut suppression optimization strategy, SpInShield compels the encoder to extract reliable forensic cues while purging unstable spectral statistics from the latent space. Experiments show that SpInShield obtains competitive performance on widely used datasets and outperforms the strongest baseline by 21.30 percentage points in AUC under simulated amplitude spectral attacks.

Chinese Translation

尽管时空深度伪造检测器在AUC（曲线下面积）上表现出色，但我们的实验揭示了它们对规避攻击的脆弱性。这些模型往往过度拟合脆弱的时间频谱线索，而不是学习稳健的语义因果关系。为了缓解这一脆弱性，我们提出了SpInShield，一个专门设计的时间谱不变防御框架，旨在将语义运动与可操控的谱伪影解耦。我们提出了一种可学习的谱对抗者，动态合成严重的谱变形，模拟极端攻击场景。通过采用快捷抑制优化策略，SpInShield迫使编码器提取可靠的取证线索，同时清除潜在空间中的不稳定谱统计信息。实验表明，SpInShield在广泛使用的数据集上获得了具有竞争力的性能，并在模拟幅度谱攻击下比最强基线提高了21.30个百分点的AUC。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2605.07399

GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization

GPO-V：通过全球概率优化实现的越狱扩散视觉语言模型

Pan, Yu, Zhang, Andi, Wang, Yi, Yang, Sibei, Wang, Wenjie

Abstract

Diffusion Vision-Language Models (dVLMs), built upon the non-causal foundations of Diffusion Large Language Models (dLLMs), have demonstrated remarkable efficacy in multimodal tasks by departing from the traditional autoregressive generation paradigm. While dVLMs appear inherently robust against conventional jailbreak tactics, which we categorize as Fixed Prefix Optimization (FPO) (e.g., anchoring responses with "Sure, here is"), this perceived resilience is deceptive. Our investigation into the safety landscape of dVLMs reveals a unique refusal pattern: Immediate Refusal and Progressive Refusal. We find that while FPO-based attacks often fail by triggering the latter, the progressive refinement process itself uncovers a novel, latent attack surface. To exploit this vulnerability, we propose Global Probability Optimization (GPO), a general jailbreak paradigm designed specifically for the denoising trajectory of masked diffusion models. Unlike prefix-based methods, GPO manipulates the global generative dynamics to bypass guardrails in diffusion language models. Building on this, we introduce GPO-V, the first visual-modality jailbreak framework tailored for dVLMs. Empirical results demonstrate that GPO-V produces stealthy perturbations with exceptional cross-model transferability, revealing a critical security gap in non-sequential generative architectures. Our findings underscore the critical urgency of addressing safety alignment in dVLMs. These results necessitate an immediate and fundamental re-evaluation of current defense paradigms to mitigate the unique risks of diffusion-based generation. Our code is available at: https://anonymous.4open.science/r/GPO-V-0250.

Chinese Translation

扩散视觉语言模型（dVLMs）基于扩散大型语言模型（dLLMs）的非因果基础，在多模态任务中表现出显著的有效性，突破了传统的自回归生成范式。尽管dVLMs在本质上似乎对传统的越狱策略具有一定的鲁棒性，我们将其归类为固定前缀优化（FPO）（例如，用“当然，这里是”来固定回应），但这种表面的抗性是具有误导性的。我们对dVLMs安全性格局的研究揭示了一种独特的拒绝模式：即时拒绝和渐进拒绝。我们发现，虽然基于FPO的攻击通常因触发后者而失败，但渐进精炼过程本身揭示了一种新的潜在攻击面。为了利用这一脆弱性，我们提出了全球概率优化（GPO），这是一种专门为去噪轨迹的掩蔽扩散模型设计的通用越狱范式。与基于前缀的方法不同，GPO操控全球生成动态，以绕过扩散语言模型中的保护机制。在此基础上，我们引入了GPO-V，这是第一个针对dVLMs量身定制的视觉模态越狱框架。实证结果表明，GPO-V能够产生隐蔽的扰动，并具有卓越的跨模型可转移性，揭示了非顺序生成架构中的一个关键安全漏洞。我们的研究结果强调了迫切需要解决dVLMs中的安全对齐问题。这些结果要求对当前防御范式进行立即和根本的重新评估，以减轻基于扩散生成的独特风险。我们的代码可在以下链接获取：https://anonymous.4open.science/r/GPO-V-0250。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2605.07402

InsHuman: Towards Natural and Identity-Preserving Human Insertion

InsHuman：朝着自然且保留身份的人类插入

Li, Jie, Zhang, Shulian, Gao, Yangyang, Li, Wenbo, Zhang, Yulun, Guo, Yong, Chen, Jian

Abstract

Human insertion aims to naturally place specific individuals into a target background. Although existing image editing models may have such ability, they often produce failure cases, including inappropriate human pose in new background, inconsistent number of people, and modified facial identity. Moreover, publicly available human datasets often lack full-body portraits and realistic physical interaction between humans and their background. To address these challenges, we propose InsHuman for natural and identity-preserving human insertion. Specifically, we propose Human-Background Adaptive Fusion (HBAF), which detects foreground humans to obtain a binary mask and applies region-aware weighting to align the human regions between predicted and ground-truth latents, ensuring the person's pose, count, and overall appearance are coherently adapted to the target background.We further propose Face-to-Face ID-Preserving (FFIP), which detects and matches faces between the generated image and the source image in terms of face recognition features to enforce identity consistency for each face.In addition, we propose Bidirectional Data Pairing (BDP) strategy to construct BDP-InsHuman, a high-quality dataset with realistic human-background interactions. Experiments demonstrate that InsHuman achieves significant improvements in generating plausible images while keeping human identity unchanged.

Chinese Translation

人类插入旨在将特定个体自然地放置于目标背景中。尽管现有的图像编辑模型可能具备这种能力，但它们往往会产生失败案例，包括在新背景中不恰当的人体姿势、人数不一致以及面部身份的修改。此外，公开可用的人体数据集通常缺乏全身肖像和人类与其背景之间的真实物理互动。为了解决这些挑战，我们提出了InsHuman，以实现自然且保留身份的人类插入。具体而言，我们提出了人类-背景自适应融合（Human-Background Adaptive Fusion, HBAF），该方法检测前景人类以获取二进制掩模，并应用区域感知加权来对齐预测和真实潜在空间中的人类区域，确保个体的姿势、数量和整体外观与目标背景一致。我们进一步提出了面对面身份保留（Face-to-Face ID-Preserving, FFIP），该方法在生成图像和源图像之间检测和匹配面部特征，以强制每个面部的身份一致性。此外，我们提出了双向数据配对（Bidirectional Data Pairing, BDP）策略，构建了BDP-InsHuman，这是一个具有真实人类-背景互动的高质量数据集。实验表明，InsHuman在生成可信图像的同时保持人类身份不变方面取得了显著的改进。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2605.07415

ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

ChartREG++：在多样化指称线索和多目标指称下基准测试与改进图表指称表达基础

Niu, Tianhao, Han, Ziyu, Zhu, Qingfu, Che, Wanxiang

Abstract

Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.

Chinese Translation

指称表达基础是视觉基础中的核心问题，广泛用于诊断视觉与语言模型中的空间基础和推理，然而大多数先前的研究主要集中在自然图像上。相比之下，现有的图表指称表达基础相关基准仍然有限：（1）它们主要采用边界框，限制了对细致图表元素的定位精度；（2）它们大多假设单一和两个被指称目标实例，未能处理多实例目标引用；（3）语言表达过于依赖文本线索或数据排序线索；（4）它们仅覆盖狭窄的图表类型范围。为了解决这些问题，我们引入了一个图表指称表达基础基准，系统地支持多种定位形式、多目标引用、多样化的基础线索和多样化的图表类型。来自代表性多模态大型模型的结果揭示了显著的性能差距。我们进一步引入了一个基于代码的合成管道，利用绘图程序与渲染图表原语之间的固有对齐关系，生成跨图表元素类型和粒度的像素精确实例掩模。我们使用合成的掩模训练了一个实例分割模型，并将其集成到一个通用多模态基础框架中。最终系统在我们的基准上始终优于基线，并且在ChartQA派生的真实图表基础基准上表现良好。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2605.07418

Learning Image-Adaptive Scale Fields for Metric Depth Recovery

学习图像自适应尺度场以恢复度量深度

Li, Yuanyan, Althoff, Matthias

Abstract

Monocular depth estimation (MDE) typically produces depth estimations that are defined up to an unknown scale or shift. When only sparse metric anchors are available, recovering accurate metric depth becomes challenging yet necessary for practical applications. We address this problem by formulating metric depth recovery as image-adaptive scale field modeling. Instead of directly correcting the depth, we reformulate the correction as a low-dimensional linear combination of image-adaptive basis maps. These maps are derived from semantic and geometric cues encoded in the MDE estimations and intermediate representations. The weights of basis maps are efficiently determined from sparse metric anchors via a least-squares problem. This formulation yields improved metric depth accuracy, strong robustness under extreme anchor sparsity, and an interpretable decomposition of spatial scale variations. Extensive experiments across multiple datasets and representative MDE models demonstrate the effectiveness and general applicability of our approach.

Chinese Translation

单目深度估计（MDE）通常产生的深度估计是以未知的尺度或偏移为定义的。当仅有稀疏的度量锚点可用时，恢复准确的度量深度变得具有挑战性，但在实际应用中却是必要的。我们通过将度量深度恢复问题表述为图像自适应尺度场建模来解决这一问题。我们不是直接修正深度，而是将修正重新表述为图像自适应基图的低维线性组合。这些基图源自于编码在MDE估计和中间表示中的语义和几何线索。基图的权重通过最小二乘问题从稀疏的度量锚点中有效确定。这种表述提高了度量深度的准确性，在极端锚点稀疏情况下表现出强大的鲁棒性，并提供了空间尺度变化的可解释分解。在多个数据集和代表性MDE模型上的广泛实验表明了我们方法的有效性和普遍适用性。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2605.07429

Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework

通过扩散框架实现逼真且高效的虚化效果渲染

Shi, Linxiao, Zheng, Siming, Wang, Zerong, Zhang, Hao, Chen, Jinwei, Li, Bo, Chen, Shifeng, Jiang, Peng-Tao

Abstract

Existing mobile devices are constrained by compact optical designs, such as small apertures, which make it difficult to produce natural, optically realistic bokeh effects. Although recent learning-based methods have shown promising results, they still struggle with photos captured under high digital zoom levels, which often suffer from reduced resolution and loss of fine details. A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. Through an alternative training strategy and a focus-aware masked attention mechanism, our method jointly optimizes bokeh rendering and super-resolution, substantially improving both controllability and visual fidelity. Furthermore, we introduce degradation-aware depth module to enable more accurate depth estimation from low-quality inputs. Experimental results demonstrate that MagicBokeh efficiently produces photorealistic bokeh effects, particularly on real-world low-resolution images, paving the way for future advancements in bokeh rendering. Our code and models are available at https://github.com/vivoCameraResearch/MagicBokeh.

Chinese Translation

现有的移动设备受到紧凑光学设计的限制，例如小光圈，这使得产生自然、光学真实的虚化效果变得困难。尽管最近基于学习的方法显示出良好的效果，但在高数字变焦级别下拍摄的照片仍然面临分辨率降低和细节丢失的问题。一个简单的解决方案是先提升图像质量再进行虚化渲染，但这种两阶段的流程降低了效率，并引入了不必要的误差累积。为克服这些限制，我们提出了MagicBokeh，一个统一的基于扩散的框架，旨在实现高质量和高效的虚化渲染。通过一种替代训练策略和关注感知的掩码注意机制，我们的方法联合优化虚化渲染和超分辨率，显著提高了可控性和视觉真实感。此外，我们引入了降级感知深度模块，以便从低质量输入中实现更准确的深度估计。实验结果表明，MagicBokeh能够高效地产生逼真的虚化效果，特别是在现实世界的低分辨率图像上，为未来虚化渲染的进展铺平了道路。我们的代码和模型可在 https://github.com/vivoCameraResearch/MagicBokeh 获取。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2605.07447

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

稀疏自编码器作为对抗攻击检测的即插即用防火墙在视觉语言模型中的应用

Wang, Hao, Sun, Yiqun, Wei, Pengfei, Hsieh, Lawrence B., Kawahara, Daisuke

Abstract

Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.

Chinese Translation

视觉语言模型（VLMs）发展迅速，越来越多地应用于现实世界的应用中，尤其是在基于代理的系统兴起的背景下。然而，它们的安全性却受到相对有限的关注。即使是最新的专有和开放权重的VLMs仍然高度易受对抗攻击的影响，使得下游应用面临重大风险。在本研究中，我们提出了一种基于稀疏自编码器（SAEs）的新颖轻量级对抗攻击检测框架，称为SAEgis。通过将SAE模块插入预训练的VLM并使用标准重建目标进行训练，我们发现学习到的稀疏潜在特征自然捕捉到与攻击相关的信号。这些特征能够可靠地分类输入图像是否被对抗性扰动，即使对于之前未见过的样本也是如此。大量实验表明，SAEgis在领域内、跨领域和跨攻击设置中均表现出强大的性能，尤其在跨领域泛化方面相比现有基线有显著提升。此外，结合来自多个层的信号进一步提高了鲁棒性和稳定性。据我们所知，这是首个探索SAE作为VLMs中对抗攻击检测的即插即用机制的研究。我们的方法不需要额外的对抗训练，带来的开销最小，并为提高现实世界VLM系统的安全性提供了一种实用的方法。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2605.07455

EditTransfer++: Toward Faithful and Efficient Visual-Prompt-Guided Image Editing

EditTransfer++：朝着忠实且高效的视觉提示引导图像编辑迈进

Chen, Lan, Mao, Qi, Song, Yiren, Gu, Yuchao, Ma, Siwei

Abstract

Visual-prompt-guided edit transfer aims to learn image transformations directly from example pairs, offering more precise and controllable editing than purely text-driven approaches. However, existing diffusion transformer-based methods often fail to faithfully reproduce the demonstrated edits due to structural mismatches between the task and the backbone, including a pretrained bias toward textual conditioning and inherent stochastic instability during sampling. To bridge this gap, we present EditTransfer++, a framework that combines progressively structured training with an efficient conditioning scheme to improve both visual prompt faithfulness and inference efficiency. We first mitigate textual dominance with a text-decoupled training strategy that removes text conditioning during fine-tuning, compelling the model to infer transformations solely from visual evidence while still supporting optional text guidance at inference. On top of this visually grounded model, a best-worst contrastive refinement mechanism reshapes the denoising trajectories to suppress unfaithful generations and improve consistency across random seeds. To alleviate the computational bottleneck of high-resolution in-context editing, we further introduce a condition compression and reuse strategy that reduces token redundancy and enables efficient generation of images with a 1024-pixel long edge. Extensive experiments on existing benchmarks and the proposed EditTransfer-Bench show that EditTransfer++ achieves state-of-the-art visual prompt faithfulness with substantially faster inference than prior methods, suggesting a promising direction for scalable prompt-guided image editing and broader visual in-context learning.

Chinese Translation

视觉提示引导的编辑迁移旨在直接从示例对中学习图像变换，提供比纯文本驱动方法更精确和可控的编辑。然而，现有基于扩散变换器的方法常常由于任务与主干之间的结构不匹配而无法忠实地再现所示的编辑，包括对文本条件的预训练偏见以及在采样过程中固有的随机不稳定性。为了解决这一问题，我们提出了EditTransfer++，一个结合渐进结构训练与高效条件方案的框架，以提高视觉提示的忠实性和推理效率。我们首先通过一种文本解耦的训练策略来减轻文本主导性，该策略在微调过程中去除了文本条件，迫使模型仅从视觉证据中推断变换，同时在推理时仍支持可选的文本指导。在此视觉基础模型之上，最佳-最差对比精炼机制重塑去噪轨迹，以抑制不忠实生成并提高随机种子间的一致性。为了缓解高分辨率上下文编辑的计算瓶颈，我们进一步引入了一种条件压缩和重用策略，减少了标记冗余，并实现了高效生成长边为1024像素的图像。在现有基准和提出的EditTransfer-Bench上的大量实验表明，EditTransfer++在视觉提示忠实性方面达到了最先进的水平，并且推理速度显著快于之前的方法，表明这是一个可扩展的提示引导图像编辑和更广泛的视觉上下文学习的有希望的方向。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2605.07457

EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

EditRefiner：一个与人类对齐的图像编辑精炼框架

Xu, Zitong, Duan, Huiyu, Nie, Yifei, Du, Mingda, Wu, Sijing, Min, Xiongkuo, Zheng, Tianyi, Zhang, Jian, Xu, Shusong, Chen, Jinwei, Li, Bo, Zhai, Guangtao

Abstract

Recent text-guided image editing (TIE) models have made remarkable progress, yet edited images still frequently suffer from fine-grained issues such as unnatural objects, lighting mismatch, and unexpected changes. Existing refinement approaches either rely on costly iterative regeneration or employ vision-language models (VLMs) with weak spatial grounding, often resulting in semantic drift and unreliable local corrections. To address these limitations, we first construct EditFHF-15K, a dataset of fine-grained human feedback for edited images, comprising (1) 15K images from 12 TIE models spanning 43 editing tasks, (2) 60K annotated artifact regions and 80K editing failure regions, each accompanied by textual reasoning, and (3) 45K mean opinion scores (MOSs) assessing perceptual quality, instruction following, and visual consistency. Based on EditFHF-15K, we propose EditRefiner, a hierarchical, interpretable, and human-aligned agentic framework that reformulates post-editing correction as a human-like perception-reasoning-action-evaluation loop. Specifically, we introduce: (1) a perception agent that detects contextual saliency maps of artifacts and editing failures, (2) a reasoning agent that interprets these perceptual cues to perform human-aligned diagnostic inference, (3) an action agent that uses the reasoning output to plan and execute localized re-editing, and (4) an evaluation agent that assesses the re-edited image and guides the action agent on whether further refinements are required. Extensive experiments demonstrate that EditRefiner consistently outperforms state-of-the-art methods in distortion localization, diagnose accuracy and human perception alignment, establishing a new paradigm for self-corrective and perceptually reliable image editing. The code is available at https://github.com/IntMeGroup/EditRefiner.

Chinese Translation

近年来，基于文本的图像编辑（TIE）模型取得了显著进展，但编辑后的图像仍然常常面临细微问题，例如不自然的物体、光照不匹配和意外变化。现有的精炼方法要么依赖于昂贵的迭代再生，要么使用空间基础较弱的视觉-语言模型（VLMs），这往往导致语义漂移和不可靠的局部修正。为了解决这些局限性，我们首先构建了EditFHF-15K，这是一个包含细粒度人类反馈的编辑图像数据集，包括（1）来自12个TIE模型的15K图像，涵盖43个编辑任务，（2）60K注释的伪影区域和80K编辑失败区域，每个区域都附有文本推理，以及（3）45K均值意见分数（MOSs），评估感知质量、指令遵循和视觉一致性。基于EditFHF-15K，我们提出了EditRefiner，这是一个分层的、可解释的、与人类对齐的代理框架，将后编辑修正重新构思为一个类人感知-推理-行动-评估循环。具体而言，我们引入了：（1）一个感知代理，检测伪影和编辑失败的上下文显著性图，（2）一个推理代理，解释这些感知线索以执行与人类对齐的诊断推理，（3）一个行动代理，利用推理输出规划和执行局部再编辑，以及（4）一个评估代理，评估再编辑的图像并指导行动代理判断是否需要进一步的精炼。大量实验表明，EditRefiner在失真定位、诊断准确性和人类感知对齐方面始终优于最先进的方法，建立了自我修正和感知可靠的图像编辑的新范式。代码可在 https://github.com/IntMeGroup/EditRefiner 获取。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2605.07466

A Unified Framework for the Detection and Classification of Fatty Pancreas in Ultrasound Images

超声图像中脂肪胰腺的检测与分类统一框架

Anghel, Ioan-Tudor-Alexandru, Ceausescu, Ciprian-Mihai, Nedelcu, Elena Dana, Stirban, Elena Raluca, Croitoru, Camelia, Ungureanu, Despina, Palan, Ana Maria, Pop, Gabriela

Abstract

Non-alcoholic fatty pancreas disease (NAFPD) is an underdiagnosed condition associated with metabolic syndrome, insulin resistance, and increased risk of pancreatic cancer. Diagnosis typically relies on subjective visual assessment of ultrasound images by clinicians. We propose an end-to-end framework for automatically classifying normal versus fatty pancreas from abdominal ultrasound images. Our method employs a TransUNet-based segmentation architecture with a ResNet encoder and transformer bottleneck to delineate the pancreas and the splenic vein, followed by anatomically-guided patch extraction and patient-level classification through pairwise texture comparison. The feature engineering mimics clinical reasoning by comparing the echogenicity of peri-venous fat to the pancreatic parenchyma, providing an interpretable signal for classification. The segmentation models are initialized via domain-specific transfer learning from a liver segmentation task. We validate the full pipeline on a clinical dataset of 214 abdominal ultrasound images with 107 expert-labeled cases using 5-fold cross-validation. SVM with RBF kernel achieves a mean cross-validated accuracy of 89.7\%\,$\pm$\,1.8\% and F1 of 0.898\,$\pm$\,0.019, while the unsupervised K-Means baseline reaches 87.8\% accuracy, demonstrating that the proposed features capture the relevant clinical signal even without labeled training data. To our knowledge, this is the first end-to-end automated framework for fatty pancreas classification from ultrasound using segmentation-guided texture analysis.

Chinese Translation

非酒精性脂肪胰腺病（NAFPD）是一种与代谢综合征、胰岛素抵抗及胰腺癌风险增加相关的未被充分诊断的疾病。诊断通常依赖于临床医生对超声图像的主观视觉评估。我们提出了一种端到端框架，用于自动分类正常与脂肪胰腺的腹部超声图像。我们的方法采用基于TransUNet的分割架构，结合ResNet编码器和变换瓶颈，以勾勒胰腺和脾静脉，随后通过解剖引导的图块提取和患者级别分类进行成对纹理比较。特征工程模拟临床推理，通过比较脉管周围脂肪与胰腺实质的回声强度，提供可解释的分类信号。分割模型通过从肝脏分割任务的领域特定迁移学习进行初始化。我们在包含214幅腹部超声图像的临床数据集上验证了整个流程，其中107个案例由专家标注，采用5折交叉验证。使用RBF核的支持向量机（SVM）实现了89.7\%±1.8\%的平均交叉验证准确率和0.898±0.019的F1值，而无监督的K均值基线达到了87.8\\%的准确率，证明所提出的特征即使在没有标注训练数据的情况下也能捕捉到相关的临床信号。据我们所知，这是第一个基于分割引导的纹理分析的超声脂肪胰腺分类的端到端自动化框架。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2605.07474

ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

ForgeVLA：无语言注释的联邦视觉-语言-动作学习

Zhou, Yuhao, Zhu, Yunpeng, Zhou, Yang, Lyu, Jindi, Lan, Jian, Wang, Zhangyuan, Si, Dan, Seidl, Thomas, Ye, Qing, Lyu, Jiancheng

Abstract

Vision-Language-Action (VLA) models hold great promise for general-purpose robotic intelligence, yet scaling up such models is severely bottlenecked by the high cost of acquiring annotated training data. Fortunately, vision-equipped robots deployed across various domains already produce abundant vision-action pairs that can be leveraged to scale up VLA training more efficiently. However, these raw data cannot be centrally aggregated due to various constraints and also exhibit severe heterogeneity. To address these challenges, in this paper, we propose ForgeVLA, a federated VLA training framework that learns VLA models from distributed vision-action pairs without centralizing raw data or requiring manual annotations. Specifically, each client in ForgeVLA is equipped with an embodied instruction classifier that maps vision-action pairs to a predefined instruction set, recovering the missing language modality and forming complete vision-language-action triplets. Beyond triplet construction, we also identify vision-language feature collapse as a critical challenge that has been largely overlooked in prior federated VLA research. To mitigate this issue, ForgeVLA combines a client-side contrastive planning loss with a server-side adaptive aggregation strategy to learn task-discriminative representations efficiently. Extensive experiments across multiple benchmarks show that ForgeVLA significantly outperforms other baselines, and ablation studies further validate the contribution of each component.

Chinese Translation

视觉-语言-动作（VLA）模型在通用机器人智能方面具有巨大潜力，但由于获取标注训练数据的高成本，扩展此类模型面临严重瓶颈。幸运的是，部署在各个领域的视觉装备机器人已经产生了大量的视觉-动作对，可以利用这些数据更高效地扩展VLA训练。然而，由于各种限制，这些原始数据无法集中聚合，并且表现出严重的异质性。为了解决这些挑战，本文提出了ForgeVLA，一个联邦VLA训练框架，能够从分布式视觉-动作对中学习VLA模型，而无需集中原始数据或手动注释。具体而言，ForgeVLA中的每个客户端都配备了一个具身指令分类器，该分类器将视觉-动作对映射到预定义的指令集，从而恢复缺失的语言模态，并形成完整的视觉-语言-动作三元组。除了三元组构建之外，我们还识别出视觉-语言特征崩溃是先前联邦VLA研究中被忽视的一个关键挑战。为了缓解这一问题，ForgeVLA结合了客户端的对比规划损失和服务器端的自适应聚合策略，以高效学习任务区分表示。在多个基准测试中的广泛实验表明，ForgeVLA显著优于其他基线，消融研究进一步验证了每个组件的贡献。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2605.07477

ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning

ReasonEdit：通过强化学习实现可解释的图像编辑评估

Chen, Honghua, Xu, Zitong, Duan, Huiyu, Zhang, Xinyun, Min, Xiongkuo, Zhai, Guangtao

Abstract

Recent text-guided image editing (TIE) models have achieved remarkable progress, however, many edited results still suffer from artifacts, unintended modifications, and suboptimal aesthetics. Although several benchmarks and evaluation methods have been proposed, most existing approaches rely on scalar scores and lack interpretability. This limitation largely stems from the absence of high-quality interpretation datasets for TIE and effective reward models to train interpretable evaluators. To address these challenges, we introduce ReasonEdit-22K, the first dataset that combines 22K edited images with 113K Chain-of-Thought (CoT) samples, along with 1.3M human judgments assessing these interpretations in terms of logicality, accuracy, and usefulness. Building upon this dataset, we propose RE-Reward, a multimodal large language model (MLLM)-based reward model designed to provide human-aligned feedback for evaluating interpretable reasoning in image editing. Furthermore, we develop ReasonEdit, which is trained using reward signals derived from RE-Reward and the Group Relative Policy Optimization (GRPO) algorithm to learn an interpretable evaluation model. Extensive experiments demonstrate that ReasonEdit achieves superior alignment with human preferences and exhibits strong generalization across public benchmarks. In addition, it is capable of generating high-quality interpretable evaluation text, enabling more transparent and trustworthy assessment for image editing. The code is available at https://github.com/IntMeGroup/ReasonEdit.

Chinese Translation

近年来，文本引导的图像编辑（TIE）模型取得了显著进展，但许多编辑结果仍然存在伪影、意外修改和次优美学等问题。尽管已有多个基准和评估方法被提出，但现有的大多数方法依赖于标量评分，缺乏可解释性。这一局限性主要源于缺乏高质量的TIE解释数据集以及有效的奖励模型来训练可解释的评估器。为了解决这些挑战，我们引入了ReasonEdit-22K，这是第一个结合了22K编辑图像和113K思维链（Chain-of-Thought, CoT）样本的数据集，并包含了130万条人类判断，评估这些解释的逻辑性、准确性和实用性。在此数据集的基础上，我们提出了RE-Reward，这是一种基于多模态大型语言模型（MLLM）的奖励模型，旨在为评估图像编辑中的可解释推理提供与人类对齐的反馈。此外，我们开发了ReasonEdit，该模型使用来自RE-Reward的奖励信号和群体相对策略优化（Group Relative Policy Optimization, GRPO）算法进行训练，以学习可解释的评估模型。大量实验表明，ReasonEdit在与人类偏好的对齐方面表现优越，并在公共基准上展现出强大的泛化能力。此外，它能够生成高质量的可解释评估文本，从而为图像编辑提供更透明和可信的评估。代码可在 https://github.com/IntMeGroup/ReasonEdit 获取。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2605.07478

AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models

AudioFace：基于语言辅助的语音驱动面部动画与多模态语言模型

Zheng, Kai, Kang, Zejian, Mao, Rui, Zou, Hongyuan, Fei, Yuanchen, Xu, Xuanyang, Huang, Xiangru

Abstract

Speech-driven facial animation requires accurate correspondence between acoustic signals and facial motion, especially for articulation-related mouth movements. However, directly mapping speech audio to facial coefficients often overlooks the linguistic and phonetic structure underlying speech production. In this paper, we propose AudioFace, a language-assisted framework for speech-driven blendshape generation that treats mouth-related facial coefficient prediction as a structured generation problem guided by linguistic and articulatory information. Instead of relying solely on acoustic features, our method leverages the prior knowledge of multimodal large language models and introduces transcript- and phoneme-level cues to bridge speech signals with interpretable facial actions. Extensive experiments show that AudioFace achieves superior performance across multiple evaluation metrics, validating the effectiveness of language-assisted and multimodal-prior-guided speech-driven facial animation.

Chinese Translation

语音驱动的面部动画需要声学信号与面部运动之间的准确对应，特别是与发音相关的口部运动。然而，直接将语音音频映射到面部系数往往忽视了语音产生过程中潜在的语言和音位结构。本文提出了AudioFace，一个基于语言辅助的语音驱动混合形状生成框架，将与口部相关的面部系数预测视为一个由语言和发音信息引导的结构化生成问题。我们的方法不仅依赖声学特征，还利用多模态大型语言模型的先验知识，引入文本和音素级别的线索，以桥接语音信号与可解释的面部动作。大量实验表明，AudioFace在多个评估指标上表现优越，验证了基于语言辅助和多模态先验引导的语音驱动面部动画的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2605.07491

Implicit Multi-Camera System Calibration Using Gaussian Processes

基于高斯过程的隐式多摄像头系统标定

De Boi, Ivan, Ribbens, Bart, Golanova, Veronika, Kapov, Ursula, Verspeek, Simon

Abstract

This paper proposes a novel framework for implicit multi-camera system calibration utilizing Gaussian Process (GP) regression. Conventional explicit calibration methods are constrained by rigid mathematical models and struggle with complex, non-linear distortions from unconventional optics, while existing neural network-based implicit approaches are typically data-hungry and lack inherent uncertainty quantification (UQ). Our GP-based model directly learns the complex, non-linear mapping from 2D image coordinates across all cameras to a 3D world coordinate, completely bypassing time-consuming estimation of explicit intrinsic and extrinsic parameters. Moreover, the inherent UQ is critical for transforming a simple 3D point prediction into a verifiable 3D measurement, complete with statistically-sound confidence bounds. To further enhance data efficiency and practical deployment, we integrate Active Learning (AL), which intelligently leverages the GP's predictive uncertainty to strategically guide the acquisition of new calibration data. This approach results in a robust, data-efficient, and reliable calibration solution, proving particularly effective in practical scenarios where collecting extensive calibration data is a dominant constraint. Our experiments show that the uncertainty for the 3D predictions is higher closer to the cameras. The data points in $uv$-coordinate space are more sparse in that region, even though they are not in 3D space. This work is relevant for anyone who is tasked with the calibration of complex multi-camera systems.

Chinese Translation

本文提出了一种利用高斯过程（Gaussian Process, GP）回归进行隐式多摄像头系统标定的新框架。传统的显式标定方法受到严格数学模型的限制，难以处理来自非常规光学的复杂非线性畸变，而现有的基于神经网络的隐式方法通常对数据需求较高，且缺乏固有的不确定性量化（Uncertainty Quantification, UQ）。我们的基于高斯过程的模型直接学习所有摄像头的二维图像坐标与三维世界坐标之间的复杂非线性映射，完全绕过了耗时的显式内参和外参估计。此外，固有的不确定性量化对于将简单的三维点预测转化为可验证的三维测量至关重要，并提供了统计上可靠的置信区间。为了进一步提高数据效率和实际应用，我们整合了主动学习（Active Learning, AL），智能地利用高斯过程的预测不确定性来战略性地指导新标定数据的获取。这种方法产生了一种稳健、数据高效且可靠的标定解决方案，特别适用于在收集大量标定数据成为主要限制的实际场景。我们的实验表明，三维预测的不确定性在靠近摄像头的地方更高。在该区域内，$uv$坐标空间中的数据点相对稀疏，尽管它们在三维空间中并不稀疏。这项工作对任何负责复杂多摄像头系统标定的人都具有重要意义。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2605.07492

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

文档解析距离解决还有多远？PureDocBench：一个可追溯来源的基准测试，涵盖干净、退化和真实世界环境

Li, Zhiheng, Ma, Zongyang, Chen, Jiaxian, Zhang, Jianing, Su, Zhaolong, Zhang, Yutong, Yu, Zhiyin, Liu, Ruiqi, Lv, Xiaolei, Li, Bo, Gao, Jun, Zhang, Ziqi, Yuan, Chunfeng, Li, Bing, Hu, Weiming

Abstract

The past year has seen over 20 open-source document parsing models, yet thefield still benchmarks almost exclusively on OmniDocBench, a 1,355-pagemanually annotated dataset whose top scores have saturated above 90%. Athree-stage audit pipeline we run on OmniDocBench screens its 21,353evaluator-scored blocks and confirms 2,580 errors (12.08%); combined with overa year of public availability, both annotation quality and contamination riskcall its rankings into question. To address these issues, we presentPureDocBench, a programmatically generated, source-traceable benchmark thatrenders document images from HTML/CSS and produces verifiable annotations fromthe same source, covering 10 domains, 66 subcategories, and 1,475 pages, eachin three versions: clean, digitally degraded, and real-degraded (4,425 imagestotal). Evaluating 40 models spanning pipeline specialists, end-to-endspecialists, and general-purpose VLMs, we find: (i) document parsing is farfrom solved: the best model scores only ~74 out of 100, with a 44.6-point gapbetween the strongest and weakest models; (ii) specialist parsers with <=4Bparameters rival or surpass general VLMs that are 5-100x larger, yet formularecognition remains a shared bottleneck where no model exceeds 67% whenaveraging the formula metric across all three tracks; (iii) general VLMs loseonly 0.99/8.52 Overall points under digital/real degradation versus 4.90/14.21for pipeline specialists, producing ranking reversals that make clean-onlyevaluation misleading for deployment. All data, code, and artifacts arepublicly released.

Chinese Translation

过去一年中出现了超过20个开源文档解析模型，但该领域几乎完全依赖于OmniDocBench进行基准测试，这是一个包含1,355页手动标注数据集，其最高分数已饱和在90%以上。我们在OmniDocBench上运行的三阶段审计流程筛选了其21,353个评估者评分的区块，并确认了2,580个错误（12.08%）；结合超过一年的公开可用性，标注质量和污染风险使其排名受到质疑。为了解决这些问题，我们提出了PureDocBench，这是一个程序生成的、可追溯来源的基准测试，它从HTML/CSS渲染文档图像，并从相同来源生成可验证的标注，涵盖10个领域、66个子类别和1,475页，每个版本有三种：干净、数字退化和真实退化（总共4,425幅图像）。在评估涵盖管道专家、端到端专家和通用视觉语言模型（VLM）的40个模型时，我们发现：（i）文档解析远未解决：最佳模型仅得分约74分（满分100分），最强模型与最弱模型之间存在44.6分的差距；（ii）参数不超过4B的专业解析器与5-100倍更大的通用VLM相当或超越，但公式识别仍然是一个共同瓶颈，所有模型在三个轨道上平均公式指标时均未超过67%；（iii）通用VLM在数字/真实退化下仅损失0.99/8.52的总体分数，而管道专家则损失4.90/14.21，导致排名反转，使得仅基于干净数据的评估在部署时具有误导性。所有数据、代码和文献均已公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2605.07494

DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models

DIMoE-适配器：视觉-语言模型中的动态专家演化持续学习

Qin, Mengxin, Zhang, Xiang, Wang, Xi, Wei, Kun, Yang, Xu, Deng, Cheng

Abstract

Continual learning enables vision-language models to accumulate knowledge and adapt to evolving tasks without retraining from scratch. However, in multi-domain task-incremental learning, large domain shifts intensify the stability-plasticity dilemma. Most existing methods rely on fixed architectures with statically allocated parameters, which limits adaptation to new domains and aggravates catastrophic forgetting. To address these challenges, we propose DIMoE-Adapters, a Dynamic Incremental Mixture-of-Experts Adapters framework that introduces a dynamic expert evolution paradigm to balance stability and plasticity. This paradigm is implemented through two collaborative components: Self-Calibrated Expert Evolution (SCEE) and Prototype-Guided Expert Selection (PGES). SCEE constructs and evolves a sparse expert pool through expert optimization dynamics, improving plasticity while reducing redundant capacity. PGES controls expert utilization based on the pool shaped by SCEE, improving stability across both previously encountered and unseen tasks. Extensive experiments show that DIMoE-Adapters outperforms previous state-of-the-art methods across various settings.

Chinese Translation

持续学习使视觉-语言模型能够积累知识并适应不断变化的任务，而无需从头开始重新训练。然而，在多领域任务增量学习中，大规模领域转移加剧了稳定性与可塑性之间的困境。现有的大多数方法依赖于固定架构和静态分配的参数，这限制了对新领域的适应能力，并加重了灾难性遗忘。为了解决这些挑战，我们提出了DIMoE-适配器，一个动态增量混合专家适配器框架，引入了一种动态专家演化范式，以平衡稳定性和可塑性。该范式通过两个协作组件实现：自校准专家演化（Self-Calibrated Expert Evolution, SCEE）和原型引导专家选择（Prototype-Guided Expert Selection, PGES）。SCEE通过专家优化动态构建和演化一个稀疏专家池，提高可塑性，同时减少冗余容量。PGES根据SCEE塑造的池控制专家的利用，提升了在以前遇到的任务和未见任务中的稳定性。大量实验表明，DIMoE-适配器在各种设置中优于之前的最先进方法。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2605.07495

Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing

轻量级无配对智能手机ISP转移与语义伪配对

Cho, Yujin, Armangeon, Flavien, Li, Yanhao

Abstract

Unpaired smartphone ISP is a challenging problem due to the lack of scene and color alignment between RAW and target RGB images. Many existing methods either require paired data or rely heavily on adversarial training, which can become unstable in the unpaired setting. In this work, we present a simple and effective approach developed for the NTIRE 2026 Learned Smartphone ISP Challenge with Unpaired Data. Our method first reconstructs larger images from training patches to recover global context. Then, we extract semantic embeddings with DINOv2, and use fused Gromov-Wasserstein (FGW) optimal transport to build pseudo pairs between RAW and RGB images at both image and patch levels. This semantic matching allows us to partially alleviate the unpairedness of the data and build these pseudo input-target pairs. Based on these pseudo pairs, we train a lightweight CNN with only 7K parameters for color rendering. The network is designed to be compact and focus on color transformation rather than structural change, which helps reduce artifacts and improve training stability. Our challenge submission achieves 22.569 PSNR, 0.675 SSIM, and 8.067 $\Delta E$ on the final hidden test set, significantly improving over the baseline and achieving the 3rd best SSIM and $\Delta E$ among all challenge entries. Our code is available at github.com/nuniniyujin/Unpaired-ISP .

Chinese Translation

无配对智能手机ISP是一个具有挑战性的问题，因为RAW图像与目标RGB图像之间缺乏场景和颜色对齐。许多现有方法要么需要配对数据，要么在无配对设置中严重依赖对抗训练，这可能导致不稳定。在本研究中，我们提出了一种简单有效的方法，旨在解决NTIRE 2026无配对数据学习智能手机ISP挑战。我们的方法首先从训练补丁重建更大的图像，以恢复全局上下文。然后，我们使用DINOv2提取语义嵌入，并利用融合的Gromov-Wasserstein（FGW）最优传输在图像和补丁级别之间构建RAW和RGB图像的伪配对。这种语义匹配使我们能够部分缓解数据的无配对性，并构建这些伪输入-目标配对。基于这些伪配对，我们训练了一个仅包含7K参数的轻量级卷积神经网络（CNN）用于颜色渲染。该网络旨在紧凑，并专注于颜色转换而非结构变化，这有助于减少伪影并提高训练稳定性。我们的挑战提交在最终隐藏测试集上达到了22.569 PSNR、0.675 SSIM和8.067 $ riangle E$，显著优于基线，并在所有挑战参赛作品中获得第三好的SSIM和$ riangle E$。我们的代码可在github.com/nuniniyujin/Unpaired-ISP获取。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2605.07499

Cloud-top infrared observations reveal the four-dimensional precipitation structure

云顶红外观测揭示四维降水结构

Xu, Tianchi, Ma, Ziqiang, Marinoni, Andrea, He, Yuanpeng, Li, Xiaoqing, Zhao, Chuanfeng, He, Kang, Xu, Jintao, Zhou, Bohan, Zhao, Wenbo, Chen, Haoshuang, Wang, Tun, Wang, Dongdong, Hong, Yang

Abstract

Accurate four-dimensional (4D) precipitation information is essential for understanding the Earth's energy and water cycles, yet remains observationally unresolved at global scales. Conventional theory holds that geostationary infrared observations primarily sense cloud-top properties, with limited sensitivity to sub-cloud precipitation. Here we show that cloud-top infrared measurements nevertheless encode sufficient information to recover the four-dimensional structure of precipitation, revealing a previously unexploited observability of sub-cloud processes. We introduce a physically constrained deep learning framework, 4DPrecipNet, in which a moisture-first constraint requires the latent representation to recover precipitable water vapour, anchoring the model in thermodynamic consistency. By integrating multi-channel infrared radiances with these constraints and radar-derived precipitation profiles, we reconstruct the vertical and temporal evolution of precipitation systems from geostationary orbit. The framework captures deep convective structures and their evolution, with robust performance across large samples and independent radar comparisons. These results demonstrate that sub-cloud precipitation is physically encoded in cloud-top infrared observations, establishing a new pathway for continuous global monitoring of precipitation structure.

Chinese Translation

准确的四维（4D）降水信息对于理解地球的能量和水循环至关重要，但在全球范围内仍然未能得到观测解决。传统理论认为，静止轨道红外观测主要感知云顶特性，对云下降水的敏感性有限。然而，我们展示了云顶红外测量仍然编码了足够的信息，以恢复降水的四维结构，揭示了云下过程的可观测性尚未被充分利用。我们引入了一种物理约束的深度学习框架，4DPrecipNet，其中的“湿度优先”约束要求潜在表示恢复可降水水汽，从而使模型在热力学一致性中得到锚定。通过将多通道红外辐射与这些约束和雷达推导的降水剖面相结合，我们从静止轨道重建降水系统的垂直和时间演变。该框架捕捉到深对流结构及其演变，在大样本和独立雷达比较中表现出强大的性能。这些结果表明，云下降水在云顶红外观测中被物理地编码，为降水结构的全球连续监测建立了一条新路径。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2605.07503

Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers

扩散-APO：面向轨迹的直接偏好对齐用于视频扩散变换器

Zhu, Jingyuan, Chen, Biaolong, Zhang, Le, Zhang, Aixi, Jiang, Hao, Huang, Pipei

Abstract

Efficiently aligning large-scale video diffusion models with human intent requires a scalable and trajectory-aware pathway that bridges the inherent discrepancy between training noise distributions and practical inference trajectories. While existing paradigms such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) attempt to address this, they are often hindered by either reliance on bias-prone, complex reward models or suboptimal timestep sampling. In this paper, we propose Diffusion-APO (Aligned Preference Optimization), a trajectory-aware algorithm that resolves this misalignment by synchronizing training noise with inference-time denoising paths to maximize gradient signal efficacy. To translate this algorithmic innovation into a practical solution, we introduce a unified and modular RLHF framework that integrates online ranking, half-online anchoring, offline refinement, and distillation-aware drift correction. This framework enables flexible, multi-stage preference alignment across diverse data and computational constraints without relying on scalar-reward-based policy gradients. Through extensive experiments, we demonstrate that Diffusion-APO consistently outperforms standard baselines in visual quality and instruction following, while effectively preserving generative fidelity during model acceleration, providing a robust, end-to-end pathway for scalable video diffusion alignment.

Chinese Translation

有效地将大规模视频扩散模型与人类意图对齐，需要一个可扩展且关注轨迹的路径，以弥合训练噪声分布与实际推理轨迹之间的固有差异。尽管现有的范式如直接偏好优化（Direct Preference Optimization, DPO）和组相对策略优化（Group Relative Policy Optimization, GRPO）试图解决这一问题，但它们往往受到依赖于易受偏见的复杂奖励模型或次优时间步采样的限制。在本文中，我们提出了扩散-APO（对齐偏好优化，Aligned Preference Optimization），这是一种关注轨迹的算法，通过将训练噪声与推理时去噪路径同步，从而解决这一不对齐问题，以最大化梯度信号的有效性。为了将这一算法创新转化为实际解决方案，我们引入了一个统一且模块化的强化学习人类反馈（RLHF）框架，该框架集成了在线排名、半在线锚定、离线精炼和蒸馏感知漂移校正。该框架能够在不同数据和计算约束下实现灵活的多阶段偏好对齐，而无需依赖于基于标量奖励的策略梯度。通过广泛的实验，我们证明了扩散-APO在视觉质量和指令遵循方面始终优于标准基线，同时在模型加速过程中有效保持生成的保真度，为可扩展视频扩散对齐提供了一条稳健的端到端路径。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2605.07510

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

InterLV-Search：交错多模态智能搜索基准测试

Hou, Bohan, Gu, Jiuning, Guo, Jiayan, Dang, Ronghao, Leng, Sicong, Li, Xin, Song, Xuemeng, Yang, Jianfei

Abstract

Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \textbf{InterLV-Search}, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench

Chinese Translation

现有的多模态智能搜索基准评估多模态搜索和视觉浏览，但视觉证据要么局限于输入，要么被视为答案终点，而不是交错搜索轨迹的一部分。我们引入了 extbf{InterLV-Search}，这是一个用于交错语言-视觉智能搜索的基准，其中文本和视觉证据被反复用于条件后续搜索。该基准包含2,061个示例，分为三个层次：主动视觉证据寻求、受控离线交错多模态搜索和开放网络交错多模态搜索。除了现有基准外，它还包括涉及在证据搜索过程中比较多个实体的多模态多分支样本。我们通过自动化流程构建了第1层和第2层，并通过机器主导的人类监督开放网络流程构建了第3层。我们进一步提供了InterLV-Agent，用于标准化工具使用、轨迹记录和评估。对专有和开源多模态智能体的实验表明，当前系统距离解决交错多模态搜索仍然相去甚远，最佳模型的整体准确率低于50%，突显了在视觉证据寻求、搜索控制和多模态证据整合方面的挑战。我们在https://github.com/hbhalpha/InterLV-Search-Bench发布了基准数据和评估代码。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2605.07512

Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models

视觉-语言模型中的层次双子空间解耦持续学习

Qin, Mengxin, Zhang, Xiang, Wei, Kun, Yang, Xu, Deng, Cheng

Abstract

Class-incremental learning aims to continuously acquire new knowledge while preserving previously learned information, thereby mitigating catastrophic forgetting. Existing methods primarily restrict parameter updates but often overlook their structural properties in high-dimensional spaces. From a subspace perspective, updates induced by different tasks tend to lie in multiple overlapping low-rank subspaces, leading to cross-task subspace interference and severe forgetting. To address this issue, we propose HDSD, a Hierarchical Dual-Subspace Decoupling framework for continual learning in vision-language models. Specifically, we introduce a lightweight Feature Modulation Module (FMM) that explicitly decomposes the parameter space into general and task-specific subspaces. Building on this design, we develop two complementary components. First, a General Fusion Module (GFM) evaluates relative parameter changes across tasks and uses an adaptive threshold to capture stable and transferable knowledge. Second, a Hierarchical Learning Module (HLM) performs structured parameter decomposition via Singular Value Decomposition (SVD) and uses a scaling mechanism to constrain updates within distinct subspace scales. Together, these designs reduce subspace interference and parameter drift. Extensive experiments on conventional benchmarks show that HDSD achieves state-of-the-art results.

Chinese Translation

类增量学习旨在不断获取新知识，同时保留先前学习的信息，从而减轻灾难性遗忘。现有方法主要限制参数更新，但往往忽视了高维空间中的结构特性。从子空间的角度来看，不同任务引起的更新往往位于多个重叠的低秩子空间中，导致跨任务的子空间干扰和严重遗忘。为了解决这一问题，我们提出了HDSD（层次双子空间解耦）框架，用于视觉-语言模型中的持续学习。具体而言，我们引入了一个轻量级特征调制模块（Feature Modulation Module, FMM），该模块明确地将参数空间分解为通用子空间和任务特定子空间。在此设计基础上，我们开发了两个互补组件。首先，通用融合模块（General Fusion Module, GFM）评估跨任务的相对参数变化，并使用自适应阈值捕捉稳定和可转移的知识。其次，层次学习模块（Hierarchical Learning Module, HLM）通过奇异值分解（Singular Value Decomposition, SVD）执行结构化参数分解，并使用缩放机制限制更新在不同子空间尺度内。综合这些设计，减少了子空间干扰和参数漂移。在常规基准上的大量实验表明，HDSD达到了最先进的结果。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2605.07545

Implicit Preference Alignment for Human Image Animation

人像动画中的隐式偏好对齐

Wang, Yuanzhi, Ren, Xuhua, Cheng, Jiaxiang, Ma, Bing, Yu, Kai, Zheng, Tianxiang, Lu, Qinglin, Cui, Zhen

Abstract

Human image animation has witnessed significant advancements, yet generating high-fidelity hand motions remains a persistent challenge due to their high degrees of freedom and motion complexity. While reinforcement learning from human feedback, particularly direct preference optimization, offers a potential solution, it necessitates the construction of strict preference pairs. However, curating such pairs for dynamic hand regions is prohibitively expensive and often impractical due to frame-wise inconsistencies. In this paper, we propose Implicit Preference Alignment (IPA), a data-efficient post-training framework that eliminates the need for paired preference data. Theoretically grounded in implicit reward maximization, IPA aligns the model by maximizing the likelihood of self-generated high-quality samples while penalizing deviations from the pretrained prior. Furthermore, we introduce a Hand-Aware Local Optimization mechanism to explicitly steer the alignment process toward hand regions. Experiments demonstrate that our method achieves effective preference optimization to enhance hand generation quality, while significantly lowering the barrier for constructing preference data. Codes are released at https://github.com/mdswyz/IPA

Chinese Translation

人像动画已取得显著进展，但生成高保真手部动作仍然是一个持续的挑战，因为其自由度高且运动复杂。尽管基于人类反馈的强化学习，特别是直接偏好优化，提供了一种潜在解决方案，但它需要构建严格的偏好对。然而，由于逐帧不一致性，为动态手部区域策划这样的对是极其昂贵且往往不切实际的。本文提出了隐式偏好对齐（Implicit Preference Alignment, IPA），这是一种数据高效的后训练框架，消除了对配对偏好数据的需求。IPA在隐式奖励最大化的理论基础上，通过最大化自生成高质量样本的可能性，同时惩罚偏离预训练先验的行为，来对齐模型。此外，我们引入了一种手部感知局部优化机制，以明确引导对齐过程朝向手部区域。实验表明，我们的方法有效实现了偏好优化，提升了手部生成质量，同时显著降低了构建偏好数据的门槛。代码已发布在 https://github.com/mdswyz/IPA

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2605.07549

Probabilistic Object Detection with Conformal Prediction

具有符合预测的概率对象检测

Ries, Christopher, Sbeyti, Moussa Kassem, Bianco, Nicolas, Klein, Nadja

Abstract

Conformal Prediction (CP) is a distribution-free method for constructing prediction sets with marginal finite-sample coverage guarantees, making it a suitable framework for reliable uncertainty quantification in safety-critical object detection. However, object detection introduces structured multi-output predictions, complicating the application of classical CP theory developed for single outputs. In addition, standard, unscaled CP produces fixed-width prediction intervals across inputs, leading to unnecessary width for low-uncertainty predictions. While scaled CP addresses this by adapting the interval width to an input-dependent uncertainty estimate, prior work has neither systematically compared unscaled and scaled CP for multi-class object detection, nor integrated CP with a complementary uncertainty quantification method in this setting. We fill this gap by: (i) applying CP coordinate-wise to bounding box corners with a Bonferroni correction for box-level guarantees; (ii) scaling the resulting intervals using per-prediction aleatoric uncertainty estimates derived from a probabilistic object detector trained with loss attenuation, evaluated in uncalibrated and two calibrated variants; (iii) extending to a two-step pipeline that constructs prediction sets for the class using RAPS and conditions the conformalized bounding boxes on the predicted class set. Across three autonomous driving datasets (KITTI, BDD, CODA), including a cross-domain setting under distribution shift, scaled CP consistently improves interval sharpness over unscaled CP, achieving up to 19% higher IoU and 39% lower interval scores, without sacrificing coverage. Class-wise calibration further improves coverage for both variants with a negligible effect on sharpness. Together, these improvements yield more actionable uncertainty estimates for real-time, real-world object detection.

Chinese Translation

符合预测（Conformal Prediction, CP）是一种无分布的方法，用于构建具有边际有限样本覆盖保证的预测集，使其成为在安全关键的对象检测中进行可靠不确定性量化的合适框架。然而，对象检测引入了结构化的多输出预测，复杂化了为单输出开发的经典 CP 理论的应用。此外，标准的未缩放 CP 在输入上产生固定宽度的预测区间，导致低不确定性预测的宽度不必要。虽然缩放 CP 通过将区间宽度调整为依赖于输入的不确定性估计来解决这一问题，但之前的研究既没有系统地比较多类对象检测中的未缩放和缩放 CP，也没有在这种情况下将 CP 与互补的不确定性量化方法结合起来。我们通过以下方式填补这一空白：(i) 对边界框角进行逐坐标应用 CP，并使用 Bonferroni 校正以确保框级保证；(ii) 使用从经过损失减弱训练的概率对象检测器中得出的每个预测的随机不确定性估计来缩放结果区间，并在未校准和两种校准变体中进行评估；(iii) 扩展为一个两步管道，该管道使用 RAPS 构建类的预测集，并将符合化的边界框条件于预测的类集。在三个自主驾驶数据集（KITTI、BDD、CODA）中，包括在分布转移下的跨域设置，缩放 CP 一直在区间锐度上优于未缩放 CP，达到高达 19% 的 IoU 提升和 39% 的区间评分降低，而不牺牲覆盖率。类级校准进一步改善了两种变体的覆盖率，对锐度的影响微乎其微。这些改进共同为实时、真实世界的对象检测提供了更具可操作性的不确定性估计。

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2605.07550

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views

注意差距：来自不连续视角的几何准确生成重建

Wilczynski, Grzegorz, Zielinski, Mikołaj, Świrta, Bartosz, Belter, Dominik, Spurek, Przemysław

Abstract

3D vision systems are fundamentally constrained by their reliance on visual overlap: reconstruction methods require it for geometric alignment, while generative models use it to enforce multi-view consistency. This limitation is particularly acute in real-world scenarios such as distributed swarm robotics or crowd-sourced data collection, where capturing overlapping perspectives, both in terms of spatial and appearance overlap, is often impossible. We introduce Generative Reconstruction from Disjoint Views as a new paradigm, establish a comprehensive dataset, and propose specialized evaluation metrics for zero-overlap scenarios. Our benchmarking demonstrates that existing state-of-the-art methods fail catastrophically on this task, producing disconnected geometries or semantically incoherent reconstructions. To address these limitations, we propose GLADOS, a general, modular framework that operates through three stages: (1) Generative Bridging, where foundation models synthesize intermediate perspectives to connect disjoint inputs; (2) Robust Coarse 3D Reconstruction, that establish coarse geometric scaffold via global alignment which absorbs local contradictions from generative process; and (3) Iterative Context Expansion and Consistency Optimization to fill missing regions and unify the reconstruction. As an architectureagnostic framework, GLADOS enables seamless integration of future advances in generation, reconstruction, and inpainting. The source code is available at: https://github.com/gwilczynski95/GLADOS.

Chinese Translation

3D视觉系统在根本上受到视觉重叠的限制：重建方法需要它进行几何对齐，而生成模型则利用它来强制多视角一致性。这一限制在现实世界场景中尤为明显，例如分布式群体机器人或众包数据收集，在这些场景中，捕捉空间和外观重叠的重叠视角往往是不可能的。我们提出了来自不连续视角的生成重建作为一种新范式，建立了一个全面的数据集，并为零重叠场景提出了专门的评估指标。我们的基准测试表明，现有的最先进方法在这一任务上表现惨败，产生了不连通的几何形状或语义不一致的重建。为了解决这些局限性，我们提出了GLADOS，一个通用的模块化框架，通过三个阶段进行操作：(1) 生成桥接，在此阶段基础模型合成中间视角以连接不连续输入；(2) 稳健的粗略3D重建，通过全局对齐建立粗略几何框架，吸收生成过程中的局部矛盾；(3) 迭代上下文扩展和一致性优化，以填补缺失区域并统一重建。作为一个与架构无关的框架，GLADOS使未来在生成、重建和修复方面的进展能够无缝集成。源代码可在以下链接获取：https://github.com/gwilczynski95/GLADOS。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2605.07552

VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network

VIMCAN：基于混合Mamba交叉注意力网络的视觉-惯性三维人体姿态估计

Yang, Zepeng, Bai, Junxuan, Li, Hao, Dai, Ju, Pan, Junjun, Yin, Yongfeng, Li, Bin

Abstract

The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual-inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba's dynamic parameterization for temporal modeling and Attention for spatial dependency extraction, VIMCAN achieves superior accuracy, with mean per-joint position errors (MPJPE) of 17.2 mm on TotalCapture and 45.3 mm on 3DPW. VIMCAN outperforms prior Transformer-based and other SOTA approaches while supporting real-time inference at over 60 frames per second on consumer-grade hardware. The source code is available on GitHub.

Chinese Translation

深度学习的快速进展显著提高了多模态三维人体姿态估计（HPE）的准确性。然而，当前最先进的（SOTA）HPE流程仍依赖于变换器（Transformers），其二次复杂度使得长序列的实时处理变得不切实际。Mamba通过选择性状态空间建模解决了这一问题，使得在不牺牲表示能力的情况下实现高效的序列处理。然而，它在多模态环境中捕捉复杂空间依赖性方面仍然存在困难。为了解决这一问题，我们提出了VIMCAN，一种混合架构，将Mamba的高效序列建模与交叉注意力（Cross-Attention）的空间推理相结合，实现了RGB关键点与可穿戴IMU数据之间的稳健视觉-惯性融合和人体姿态估计。通过利用Mamba的动态参数化进行时间建模和注意力机制提取空间依赖性，VIMCAN实现了卓越的准确性，在TotalCapture数据集上的每个关节位置平均误差（MPJPE）为17.2毫米，在3DPW数据集上的MPJPE为45.3毫米。VIMCAN在支持实时推理的同时，超过60帧每秒的速度优于之前基于变换器的和其他SOTA方法。源代码可在GitHub上获取。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2605.07556

Dynamic Mode Decomposition along Depth in Vision Transformers

视觉变换器中的深度动态模式分解

Aswani, Nishant Suresh, Jabari, Saif Eddin

Abstract

Recent work has shown that contiguous vision transformer (ViT) blocks (a) can be replaced by a linear map and (b) organize into recurrent phases of computation. We ask whether these observations coincide: does ViT depth implement approximately \textit{autonomous linear} dynamics, admitting a single operator $K$ applied recurrently across a contiguous span? We test this using Dynamic Mode Decomposition (DMD), which fits $K$ from selected, consecutive hidden-state pairs and predicts $p$ steps ahead via $K^p$. On four pretrained DINO ViTs, we study the regularization, rank, and calibration budget required for stable fitting. For short spans ($p \leq 4$), $K^p$ tracks an unconstrained endpoint map to within $0.02$ cosine similarity on DINOv3-H/16+, while also recovering intermediate activations at each skipped block. At early cut starts, the fitted operators compress to rank $\ll d$ with minimal calibration data, and across tokens, \texttt{cls} is most amenable to linearization; both properties decay monotonically with depth. Yet this local fidelity does not transfer downstream. At the final hidden state, after propagating through the remaining blocks, an identity baseline becomes competitive.

Chinese Translation

近期研究表明，连续的视觉变换器（ViT）模块（a）可以被线性映射替代，并且（b）组织成递归计算阶段。我们探讨这些观察是否一致：ViT 的深度是否实现了大约的 extit{自主线性} 动力学，是否允许在连续范围内重复应用单一算子 $K$？我们使用动态模式分解（DMD）进行测试，该方法从选定的连续隐藏状态对中拟合 $K$，并通过 $K^p$ 预测 $p$ 步的结果。在四个预训练的 DINO ViT 上，我们研究了稳定拟合所需的正则化、秩和校准预算。对于短范围（$p extleq 4$），$K^p$ 将一个不受约束的端点映射追踪到 DINOv3-H/16+ 的 $0.02$ 余弦相似度以内，同时在每个跳过的模块中恢复中间激活。在早期切割开始时，拟合的算子压缩到秩 $ extll d$，且所需的校准数据最小；在各个 token 中， exttt{cls} 对线性化最为适应；这两个特性随着深度单调衰减。然而，这种局部保真度并未向下游转移。在最终隐藏状态中，经过剩余模块传播后，身份基线变得具有竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2605.07561

Multimodal Stepwise Clinically-Guided Attention Learning for Pathological Complete Response Prediction in Breast Cancer

多模态逐步临床引导注意力学习用于乳腺癌病理完全反应预测

Caragliano, Alice Natalina, Guarrasi, Valerio, Gravina, Michela, Sansone, Carlo, Soda, Paolo

Abstract

Pathological complete response (pCR) is a key prognostic factor in breast cancer patients undergoing neoadjuvant therapy, strongly associated with long-term survival and treatment personalization. However, accurate pre-treatment pCR prediction remains challenging due to severe class imbalance and limited generalizability across diverse clinical settings. In this work, we propose a multimodal stepwise clinically-guided attention learning framework for pCR prediction from breast magnetic resonance imaging (MRI), designed to address these limitations through medically grounded spatial guidance and multimodal integration. The approach follows a stepwise training strategy inspired by physician reasoning: the model first learns global discriminative imaging patterns, then attention mechanisms are introduced to constrain the network toward tumor regions, and finally clinical variables are integrated to refine decision-making. This guidance strategy encourages prioritization of task-relevant features, improving identification of responders despite their limited representation in the dataset. Moreover, grounding attention in anatomically consistent tumor regions reduces reliance on dataset-specific patterns, thereby enhancing cross-institutional generalization. The framework is evaluated through external validation across heterogeneous MRI cohorts. Compared to non-guided single-stage baselines, the proposed approach improves sensitivity while maintaining competitive specificity, and produces anatomically coherent attention maps that support interpretation of the model's predictions. These findings highlight the potential of clinically-guided multimodal attention learning for robust and generalizable pCR prediction in breast cancer.

Chinese Translation

病理完全反应（pCR）是接受新辅助治疗的乳腺癌患者的关键预后因素，与长期生存和治疗个性化密切相关。然而，由于严重的类别不平衡和在不同临床环境中的有限泛化能力，准确的治疗前pCR预测仍然具有挑战性。在本研究中，我们提出了一种多模态逐步临床引导注意力学习框架，用于从乳腺磁共振成像（MRI）中预测pCR，旨在通过医学基础的空间引导和多模态整合来解决这些限制。该方法遵循一种受医生推理启发的逐步训练策略：模型首先学习全局判别成像模式，然后引入注意力机制以约束网络关注肿瘤区域，最后整合临床变量以优化决策。这种引导策略鼓励优先考虑与任务相关的特征，提高了对响应者的识别，尽管他们在数据集中代表性有限。此外，将注意力基于解剖一致的肿瘤区域，减少了对特定数据集模式的依赖，从而增强了跨机构的泛化能力。该框架通过在异质MRI队列中的外部验证进行评估。与非引导的单阶段基线相比，所提出的方法提高了敏感性，同时保持了竞争性的特异性，并生成了解剖一致的注意力图，支持对模型预测的解释。这些发现突显了临床引导的多模态注意力学习在乳腺癌中进行稳健且可泛化的pCR预测的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2605.07562

Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

超越GSD作为标记：遥感视觉语言模型的连续尺度条件化

Zhang, Song, Chen, Yanlong, Li, Yilin, Chen, Yining, Yi, Zili, Zhang, Xiaowei, Li, Yawei

Abstract

Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model's computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. To provide matching supervision, we construct GeoScale-VQA, a 1.5M-sample scale-layered RS-VQA corpus whose question-answer generation is conditioned on the same physical scalar that drives CS-HLoRA, forming a closed method-data loop. Trained with QLoRA on an 8B backbone, ScaleEarth achieves state-of-the-art results on remote-sensing benchmarks covering diverse Earth-system tasks, including XLRS-Bench and OmniEarth-Bench.

Chinese Translation

遥感视觉语言模型（RS-VLMs）面临与自然图像对应物的根本不匹配：同一地理对象在跨越多个数量级的地面采样距离（GSD）下表现出截然不同的视觉证据。然而，现有的RS-VLMs通常会忽略GSD或将其作为离散文本标记注入，迫使单一静态参数集吸收整个尺度谱。我们提出了ScaleEarth，这是一个基于Qwen3-VL构建的参数高效微调框架，将GSD视为一个连续的条件变量，控制模型的计算路径。其核心是CS-HLoRA（连续尺度条件化超低秩适配器），通过GSD驱动的门控调节LoRA低秩子空间，使模型能够根据物理尺度动态路由计算。为了在部署时消除对传感器元数据的依赖，我们将CS-HLoRA与SSE-U配对，后者是一个轻量级异方差子头，可以从视觉特征中预测GSD及其不确定性。为了提供匹配的监督，我们构建了GeoScale-VQA，这是一个包含150万样本的尺度分层RS-VQA语料库，其问题-答案生成依赖于驱动CS-HLoRA的相同物理标量，形成一个闭合的方法-数据循环。在8B主干上使用QLoRA进行训练后，ScaleEarth在涵盖多样化地球系统任务的遥感基准测试中取得了最先进的结果，包括XLRS-Bench和OmniEarth-Bench。

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2605.07568

Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

追踪时间之箭：诊断视频大语言模型中的时间信息流

Han, Peitao, Cheng, Fei, Pereira, Lis K., Liu, Qianying, Kitazawa, Shigeru

Abstract

The Arrow-of-Time (AoT) task, determining whether a video plays forward or backward by recognizing temporal irreversibility, is one humans solve with near-perfect accuracy, yet frontier Video Large Language Models (Video-LLMs) perform only modestly above chance. This gap raises a key question: do visual backbones fail to encode temporal information, or does information bottleneck lie elsewhere in the Video-LLM architecture? We address this question by isolating the vision encoder from the Video-LLM and tracing temporal information across the encoder, projector, and LLM. We find that video-centric encoders with explicit temporal modeling encode strong temporal signals, whereas frame-centric encoders do not. However, when video-centric representations are passed through a standard Video-LLM architecture, performance often collapses, revealing a bottleneck of temporal information flow. We identify projector design as a key factor: Q-Former disrupts temporal information, while a time-preserved MLP projection substantially improves the LLM's access to such information. Our layer-wise analysis further shows temporal representation dynamics across encoder layers. Guided by these findings, we build a Video-LLM with temporal-aware video-centric encoder, time-preserved projector, and AoT supervision, surpassing human performance on AoT$_{PPB}$ with 98.1\% accuracy, and improving broader temporal reasoning tasks by up to 6.0 points on VITATECS-Direction and 1.3 points on TVBench. Our results show that temporal reasoning in Video-LLMs requires both effective temporal encoding and reliable transfer of this information to the LLM.

Chinese Translation

时间之箭（Arrow-of-Time, AoT）任务是通过识别时间不可逆性来判断视频是正向播放还是反向播放，这一任务人类几乎可以以完美的准确度解决，而前沿的视频大语言模型（Video-LLMs）仅能略高于随机猜测的水平。这一差距引发了一个关键问题：视觉主干是否未能编码时间信息，还是信息瓶颈存在于视频大语言模型架构的其他地方？我们通过将视觉编码器与视频大语言模型隔离，并追踪编码器、投影器和大语言模型中的时间信息来解决这一问题。我们发现，具有显式时间建模的视频中心编码器能够编码强烈的时间信号，而帧中心编码器则无法做到。然而，当视频中心表示通过标准的视频大语言模型架构时，性能往往会崩溃，揭示了时间信息流的瓶颈。我们确定投影器设计是一个关键因素：Q-Former会干扰时间信息，而时间保留的多层感知器（MLP）投影显著改善了大语言模型对这些信息的访问。我们的逐层分析进一步显示了编码器层之间的时间表示动态。基于这些发现，我们构建了一个具有时间感知的视频中心编码器、时间保留投影器和时间之箭监督的视频大语言模型，在AoT$_{PPB}$任务上以98.1\%的准确率超越了人类表现，并在VITATECS-Direction和TVBench的更广泛时间推理任务上分别提高了6.0分和1.3分。我们的结果表明，视频大语言模型中的时间推理需要有效的时间编码和可靠的信息传递。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2605.07574

PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

PolarVLM：弥合视觉-语言模型中的语义-物理差距

Li, Yuliang, Zhou, Chu, Guo, Heng, Shi, Boxin, Sato, Imari, Ma, Zhanyu

Abstract

Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs targeting reflective and transparent scenes. Experiments show that PolarVLM surpasses the RGB baseline by 25.4% overall across five evaluation tasks, with remarkable gains of 26.6% in reflection recognition and 34.0% in glass counting, successfully unlocking physics-aware semantic understanding.

Chinese Translation

主流的视觉-语言模型（VLMs）在处理严重的光学歧义（如反射和透明物体）时，根本上面临着标准RGB输入的固有限制。尽管偏振成像能够捕捉解决这些歧义的偏振物理参数，但现有方法受限于固定格式的输出，且与开放式推理相隔离。为了弥合这一语义-物理差距，我们提出了PolarVLM，这是第一个将偏振物理参数集成到VLM中的多模态框架。通过采用双流架构和渐进式的两阶段训练策略，PolarVLM有效防止了物理误解，同时保持了通用的视觉能力。为了补充我们的架构，我们构建了PolarVQA，这是第一个针对偏振感知视觉问答（VQA）的基准，包含75K个基于物理的指令调优对，专注于反射和透明场景。实验表明，PolarVLM在五个评估任务中整体超越了RGB基线25.4%，在反射识别和玻璃计数方面分别取得了显著的26.6%和34.0%的提升，成功解锁了基于物理的语义理解。

View on arXiv Download PDF AI Translation

cs.CV / 102 / 2605.07575

Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

Response-G1：用于主动流媒体视频理解的显式场景图建模

Ma, Ke, Tang, Jiaqi, Guo, Bin, Han, Xueting, Xu, Ruonan, He, Qingfeng, Wang, Ziheng, Wang, Xu, Chen, Qifeng, Yu, Zhiwen, Liu, Yunhao

Abstract

Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query's expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame "silence/response" decisions.By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.

Chinese Translation

主动流媒体视频理解要求视频大语言模型（Video-LLMs）在视频展开时决定何时响应，而现有方法由于其隐式的、与查询无关的视觉证据建模，往往难以胜任这一任务。我们提出了Response-G1，一个新颖的框架，通过场景图在累积的视频证据与查询的预期响应条件之间建立显式的结构化对齐。该框架在三个无需微调的阶段中运行：(1) 从流媒体剪辑中在线生成查询引导的场景图；(2) 基于记忆检索最语义相关的历史场景图；(3) 增强检索的触发提示，用于每帧的“沉默/响应”决策。通过在共享图表示中将证据和条件进行基础化，Response-G1 实现了更具可解释性和准确性的响应时机决策。在已建立的基准测试上的实验结果表明，我们的方法在主动和反应任务中均表现出优越性，验证了显式场景图建模和检索在流媒体视频理解中的优势。

View on arXiv Download PDF AI Translation

cs.CV / 103 / 2605.07590

Beyond Defenses: Manifold-Aligned Regularization for Intrinsic 3D Point Cloud Robustness

超越防御：用于内在3D点云鲁棒性的流形对齐正则化

Alonso, Pedro, Li, Chongshou, Li, Tianrui

Abstract

Despite extensive progress in point cloud robustness, existing methods primarily improve performance through augmentation or defense mechanisms, while overlooking the geometric root cause of adversarial fragility. We hypothesize that adversarial vulnerability in 3D networks arises from a manifold misalignment between the latent geometry learned by the model and the intrinsic geometry of the underlying surface. Small, geometry-preserving perturbations along the input manifold often induce disproportionate distortions in feature space, revealing a misalignment between latent and intrinsic geometries. We formalize this phenomenon by developing a geometric interpretation of 3D robustness that links classical adversarial theory to the intrinsic structure of point clouds. Motivated by this analysis, we introduce Manifold-Aligned Point Recognition (MAPR), a framework that regularizes the latent geometry by aligning predictions across intrinsic perturbations. MAPR augments each point cloud with intrinsic features capturing local curvature and diffusion structure, and applies a consistency loss that preserves invariance to intrinsic, geometry-preserving perturbations. Without relying on adversarial training or additional data, MAPR consistently improves robustness across multiple adversarial attacks on both the ModelNet40 and ScanObjectNN datasets, achieving average robustness gains of +20.02% and +8.58% on ModelNet40 and ScanObjectNN, respectively.

Chinese Translation

尽管在点云鲁棒性方面取得了广泛进展，现有方法主要通过增强或防御机制来提高性能，却忽视了对抗脆弱性的几何根本原因。我们假设，3D网络中的对抗脆弱性源于模型学习的潜在几何与基础表面的内在几何之间的流形错位。沿输入流形的小型几何保持扰动往往会在特征空间中引发不成比例的扭曲，揭示潜在几何与内在几何之间的错位。我们通过开发3D鲁棒性的几何解释来形式化这一现象，将经典的对抗理论与点云的内在结构联系起来。基于这一分析，我们引入了流形对齐点识别（Manifold-Aligned Point Recognition, MAPR）框架，该框架通过在内在扰动之间对齐预测来正则化潜在几何。MAPR通过捕捉局部曲率和扩散结构的内在特征来增强每个点云，并应用保持对内在、几何保持扰动不变的一致性损失。在不依赖对抗训练或额外数据的情况下，MAPR在ModelNet40和ScanObjectNN数据集上针对多种对抗攻击持续提高鲁棒性，分别实现了+20.02%和+8.58%的平均鲁棒性提升。

View on arXiv Download PDF AI Translation

cs.CV / 104 / 2605.07593

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

TraceAV-Bench：长音视频多跳轨迹推理的基准测试

Feng, Hengyi, Liang, Hao, Chen, Mingrui, Zeng, Bohan, Qiang, Meiyi, Zhao, Zhengyang, Meng, Zimo, Sheng, Zeang, Zhang, Wentao

Abstract

Real-world audio-visual understanding requires chaining evidence that is sparse, temporally dispersed, and split across the visual and auditory streams, whereas existing benchmarks largely fail to evaluate this capability. They restrict videos to short clips, isolate modalities, or reduce questions to one-hop perception. We introduce TraceAV-Bench, the first benchmark to jointly evaluate multi-hop reasoning over long audio-visual trajectories and multimodal hallucination robustness. TraceAV-Bench comprises 2,200 rigorously validated multiple-choice questions over 578 long videos, totaling 339.5 hours, spanning 4 evaluation dimensions and 15 sub-tasks. Each question is grounded in an explicit reasoning chain that averages 3.68 hops across a 15.1-minute temporal span. The dataset is built by a three-step semi-automated pipeline followed by a strict quality assurance process. Evaluation of multiple representative OmniLLMs on TraceAV-Bench reveals that the benchmark poses a persistent challenge across all models, with the strongest closed-source model (Gemini 3.1 Pro) reaching only 68.29% on general tasks, and the best open-source model (Ming-Flash-Omni-2.0) reaching 51.70%, leaving substantial headroom. Moreover, we find that robustness to multimodal hallucination is largely decoupled from general multimodal reasoning performance. We anticipate that TraceAV-Bench will stimulate further research toward OmniLLMs that can reason coherently and faithfully over long-form audio-visual content.

Chinese Translation

现实世界的音视频理解需要将稀疏、时间上分散且分布在视觉和听觉流中的证据进行链式推理，而现有的基准测试在很大程度上未能评估这一能力。它们将视频限制为短片段，孤立模态，或将问题简化为单跳感知。我们提出了TraceAV-Bench，这是第一个共同评估长音视频轨迹上的多跳推理和多模态幻觉鲁棒性的基准测试。TraceAV-Bench包含2200个经过严格验证的多项选择题，覆盖578个长视频，总时长为339.5小时，涵盖4个评估维度和15个子任务。每个问题都基于一个明确的推理链，平均跨越15.1分钟的时间跨度，包含3.68个跳跃。该数据集通过三步半自动化流程构建，并经过严格的质量保证过程。在TraceAV-Bench上对多种代表性OmniLLMs的评估表明，该基准对所有模型都构成了持续的挑战，最强的闭源模型（Gemini 3.1 Pro）在一般任务上仅达到68.29%，而最佳的开源模型（Ming-Flash-Omni-2.0）仅达到51.70%，留有相当大的提升空间。此外，我们发现对多模态幻觉的鲁棒性与一般多模态推理性能在很大程度上是解耦的。我们期待TraceAV-Bench能够激发更多研究，推动能够在长格式音视频内容上进行连贯和真实推理的OmniLLMs的发展。

View on arXiv Download PDF AI Translation

cs.CV / 105 / 2605.07604

SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild

SAM 3D动物：从野外图像中可提示的动物3D重建

Hu, Xuyi, Lyu, Jin, Liu, Jiuming, Liu, Yebin, Zuffi, Silvia, An, Liang, Goetz, Stefan

Abstract

3D animal reconstruction in the wild remains challenging due to large species variation, frequent occlusions, and the prevalence of multi-animal scenes, while existing methods predominantly focus on single-animal settings. We present SAM 3D Animal, the first promptable framework for multi-animal 3D reconstruction from a single image. Built on the SMAL+ parametric animal model, our method jointly reconstructs multiple instances and supports flexible prompts in the form of keypoints and masks which enable more reliable disambiguation in crowded and occluded scenes. To train such a model, we further introduce Herd3D, a multi-animal 3D dataset containing over 5K images, designed to increase diversity in species, interactions, and occlusion patterns. Experiments on the Animal3D, APTv2, and Animal Kingdom datasets show that our framework achieves state-of-the-art results over both existing model-based and model-free methods, demonstrating a scalable and effective solution for prompt-driven animal 3D reconstruction in the wild.

Chinese Translation

在野外进行3D动物重建仍然面临挑战，主要由于物种差异大、频繁的遮挡以及多动物场景的普遍存在，而现有方法主要集中在单动物设置上。我们提出了SAM 3D动物，这是第一个可提示的多动物3D重建框架，能够从单张图像中进行重建。基于SMAL+参数动物模型，我们的方法能够联合重建多个实例，并支持以关键点和掩码的形式提供灵活的提示，从而在拥挤和遮挡的场景中实现更可靠的消歧。为了训练这样的模型，我们进一步引入了Herd3D，这是一个包含超过5000张图像的多动物3D数据集，旨在增加物种、互动和遮挡模式的多样性。在Animal3D、APTv2和Animal Kingdom数据集上的实验表明，我们的框架在现有的基于模型和无模型方法中均实现了最先进的结果，展示了在野外进行可提示动物3D重建的可扩展和有效的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 106 / 2605.07607

FS-I2P:A Hierarchical Focus-Sweep Registration Network with Dynamically Allocated Depth

FS-I2P：一种具有动态分配深度的分层聚焦-扫描注册网络

Cheng, Zhixin, Chen, Yujia, Tao, Xujing, Liao, Bohao, Yin, Xiaotian, Yin, Baoqun, Zhang, Tianzhu

Abstract

Image-to-point cloud registration is often challenged by viewpoint changes, cross-modal discrepancies, and repetitive textures, which induce scale ambiguity and consequently lead to erroneous correspondences. Recent detection-free methods alleviate this issue by leveraging multi-scale features and transformer-based interactions. However, they still suffer from attention drift across layers and intra-scale inconsistencies, hindering precise registration. Inspired by human behavior, we propose a ``Focus--Sweep'' paradigm and develop a Hierarchical Focus--Sweep Interaction Module within an SSM-based framework to enhance multi-level cross-modal feature association. In addition, we introduce a Dynamic Layer Allocation Strategy that adaptively determines the iteration depth to better exploit geometric constraints and improve matching robustness. Extensive experiments and ablations on two benchmarks, RGB-D Scenes V2 and 7-Scenes, demonstrate that our approach achieves state-of-the-art performance.

Chinese Translation

图像到点云的注册常常面临视角变化、跨模态差异和重复纹理等挑战，这些因素引发了尺度模糊，从而导致错误的对应关系。近期的无检测方法通过利用多尺度特征和基于变换器的交互来缓解这一问题。然而，它们仍然面临跨层注意力漂移和尺度内不一致性的问题，阻碍了精确注册。受到人类行为的启发，我们提出了一种“聚焦-扫描”范式，并在基于SSM（结构性场景模型）的框架内开发了一个分层聚焦-扫描交互模块，以增强多级跨模态特征关联。此外，我们引入了一种动态层分配策略，能够自适应地确定迭代深度，以更好地利用几何约束并提高匹配的鲁棒性。在两个基准数据集RGB-D Scenes V2和7-Scenes上的大量实验和消融研究表明，我们的方法达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 107 / 2605.07640

LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation

LithoBench：用于遥感岩性解释的大型多模态模型基准测试

Wang, Jun, Li, Fengpeng, Dong, Hang, Huang, Tianjin, Han, Wei

Abstract

Remote sensing lithology interpretation is fundamental to geological surveys, mineral exploration, and regional geological mapping. Unlike general land-cover recognition, lithology interpretation is a knowledge-intensive task that requires experts to infer rock types from various features, e.g., subtle visual, spectral, textural, geomorphological, and contextual cues, making reliable automated interpretation highly challenging. Geological knowledge-guided large multimodal models offer new opportunities, yet their evaluation remains constrained by the lack of benchmarks that capture lithological annotations, multi-level geological semantics, and expert-informed assessment. Here, we propose LithoBench, a multi-level benchmark for evaluating geological semantic understanding in remote sensing lithology interpretation. LithoBench contains 10,000 expert-annotated interpretation instances across 12 representative lithological categories, including 4,000 multiple-choice and 6,000 open-ended tasks organized into five cognitive levels: Identification and Description, Comparative Analysis, Mechanism Explanation, Practical Application, and Comprehensive Reasoning. We further develop an expert-in-the-loop, knowledge-grounded semi-automated construction pipeline, coupling multi sub-processes, e.g., structured geological image descriptions, to enhance geological validity and evaluation reliability. Experiments with multiple large vision-language models eveal substantial limitations in geological semantic understanding, particularly on higher-order explanation, application, and reasoning tasks.

Chinese Translation

遥感岩性解释是地质调查、矿产勘探和区域地质制图的基础。与一般的土地覆盖识别不同，岩性解释是一项知识密集型任务，需要专家从各种特征中推断岩石类型，例如细微的视觉、光谱、纹理、地貌和上下文线索，这使得可靠的自动化解释变得极具挑战性。受地质知识指导的大型多模态模型提供了新的机遇，但其评估仍受到缺乏能够捕捉岩性注释、多层次地质语义和专家信息评估的基准的限制。在此，我们提出了LithoBench，一个用于评估遥感岩性解释中地质语义理解的多层次基准。LithoBench包含12个代表性岩性类别下的10,000个专家注释的解释实例，包括4,000个多项选择题和6,000个开放式任务，组织成五个认知层次：识别与描述、比较分析、机制解释、实际应用和综合推理。我们进一步开发了一个专家参与的、知识驱动的半自动化构建流程，结合多个子过程，例如结构化地质图像描述，以增强地质有效性和评估可靠性。对多个大型视觉-语言模型的实验揭示了在地质语义理解方面的重大局限性，特别是在高阶解释、应用和推理任务上。

View on arXiv Download PDF AI Translation

cs.CV / 108 / 2605.07642

EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting

EggHand：一种用于自我中心手势预测的多模态基础模型

Choi, Jaeyoung, Kim, Hyeondong, Kim, Yujin, Park, Daehee

Abstract

Forecasting future 3D hand pose sequences from egocentric video is essential for understanding human intention and enabling embodied applications such as AR/VR assistance and human-robot interaction. However, this task remains a highly challenging problem because egocentric hand motion is driven by complex human intent, exhibits highly dexterous articulations, and is observed under drastic viewpoint shifts induced by ego-motion. In this work, we introduce EggHand, a foundation-model-based framework for egocentric hand pose forecasting that unifies multimodal semantic reasoning with dynamic motion modeling. Our approach couples an action decoder from a Vision-Language-Action (VLA) model, which captures the structured temporal dynamics of hand motion, with an egocentric video-text encoder that provides viewpoint-aware contextual information learned from large-scale first-person video. Together, these components overcome the brittleness of generic visual encoders under ego-motion and enable joint reasoning over motion, context, and high-level intent-without relying on body pose or external tracking. Experiments on the EgoExo4D dataset show that EggHand sets a new state of the art in forecasting accuracy, remains robust under severe ego-motion, and further enables controllable prediction via language-based task prompts. Project page: https://jyoun9.github.io/EggHand

Chinese Translation

从自我中心视频中预测未来的三维手势序列对于理解人类意图和实现增强现实/虚拟现实（AR/VR）辅助以及人机交互等具身应用至关重要。然而，这一任务仍然是一个高度具有挑战性的问题，因为自我中心的手部运动受到复杂人类意图的驱动，表现出高度灵活的关节运动，并且在自我运动引起的剧烈视角变化下被观察到。在本研究中，我们介绍了EggHand，一个基于基础模型的自我中心手势预测框架，它将多模态语义推理与动态运动建模统一起来。我们的方法结合了来自视觉-语言-动作（Vision-Language-Action, VLA）模型的动作解码器，该解码器捕捉手部运动的结构化时间动态，以及一个自我中心视频-文本编码器，该编码器提供从大规模第一人称视频中学习的视角感知上下文信息。这些组件共同克服了在自我运动下通用视觉编码器的脆弱性，并使得在不依赖于身体姿态或外部跟踪的情况下，能够对运动、上下文和高层意图进行联合推理。在EgoExo4D数据集上的实验表明，EggHand在预测准确性上设立了新的最先进水平，在严重自我运动下保持稳健，并进一步通过基于语言的任务提示实现可控预测。项目页面：https://jyoun9.github.io/EggHand

View on arXiv Download PDF AI Translation

cs.CV / 109 / 2605.07649

Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

在操作设计域内运行：基于视觉-语言模型的零样本感知

Ünal, Berkehan, Hauke, Dierend, Dren, Fazlija, Christopher, Plachetka

Abstract

Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of autonomous systems, adherence to safety regulations is crucial. Many regulations are influenced by the Operational Design Domain (ODD), which defines the specific conditions in which an autonomous agent can function. This is especially relevant for Automated Driving Systems (ADS), as a dependable perception of ODD elements is essential for safe implementation and auditing. Vision-language models (VLMs) integrate visual recognition and language reasoning, functioning without task-specific training data, which makes them suitable for adaptable ODD perception. To assess whether VLMs can function as zero-shot "ODD sensors" that adapt to evolving definitions, we contribute (i) an empirical study of zero-shot ODD classification and detection using four VLMs on a custom dataset and Mapillary Vistas, along with failure analyses; (ii) an ablation of zero-shot optimization strategies with a cost-performance overview; and (iii) a suite of reusable prompting templates with guidance for adaptation. Our findings indicate that definition-anchored chain-of-thought prompting with persona decomposition performs best, while other methods may result in reduced recall. Overall, our results pave the way for transparent and effective ODD-based perception in safety-critical applications.

Chinese Translation

在过去几年中，自动化系统的研究已经成熟到一个程度，使得该领域越来越能够将研究转化为实际的、以利益相关者为驱动的用例，涵盖明确界定的领域。然而，广泛采用自动化系统的前提是遵守安全法规。许多法规受操作设计域（Operational Design Domain, ODD）的影响，ODD定义了自主代理可以运行的特定条件。这对于自动驾驶系统（Automated Driving Systems, ADS）尤为重要，因为对ODD元素的可靠感知对于安全实施和审计至关重要。视觉-语言模型（Vision-Language Models, VLMs）整合了视觉识别和语言推理，能够在没有特定任务训练数据的情况下运行，这使得它们适合于适应性ODD感知。为了评估VLMs是否可以作为适应不断演变定义的零样本“ODD传感器”，我们贡献了以下内容：（i）对四个VLM在自定义数据集和Mapillary Vistas上进行的零样本ODD分类和检测的实证研究，以及失败分析；（ii）对零样本优化策略的消融研究，并提供成本-性能概述；（iii）一套可重用的提示模板，并附有适应指导。我们的研究结果表明，基于定义的思维链提示与角色分解的效果最佳，而其他方法可能导致召回率降低。总体而言，我们的结果为在安全关键应用中实现透明和有效的基于ODD的感知铺平了道路。

View on arXiv Download PDF AI Translation

cs.CV / 110 / 2605.07650

Breaking Spatial Uniformity: Prior-Guided Mamba with Radial Serialization for Lens Flare Removal

打破空间均匀性：基于先验引导的Mamba与径向序列化用于镜头光晕去除

Fu, Zijia, Huang, Yuanfei, Wang, Lizhi, Huang, Hua

Abstract

Lens flares, caused by complex optical aberrations, severely degrade image quality especially in nighttime photography. Although recent restoration methods have made remarkable progress, most still rely on spatially uniform processing. They are failing to handle the region-dependent restoration demands of flare scenes, where saturated light sources should be preserved, flare artifacts removed, and background details recovered. To address this challenge, we propose DeflareMambav2, a prior-guided Mamba framework for lens flare removal. Specifically, we introduce a Flare Prior Network (FPN) to estimate flare priors and guide adaptive restoration. Besides, a novel radial serialization strategy breaks spatially homogeneous processing by performing flare-aware targeted sampling, and better supports long-range modeling in State Space Models (SSMs). Based on these priors, the backbone adopts a dual-level adaptive scheme. It explicitly preserves light-source regions to avoid over-processing, and applies curriculum-based restoration to the remaining contaminated areas while calibrating restoration intensity at the pixel level. Extensive experiments demonstrate that DeflareMambav2 achieves state-of-the-art performance with reduced parameter burden. Code is available at https://github.com/BNU-ERC-ITEA/DeflareMambav2.

Chinese Translation

镜头光晕是由复杂的光学像差引起的，严重降低了图像质量，尤其是在夜间摄影中。尽管最近的修复方法取得了显著进展，但大多数仍依赖于空间均匀处理，未能满足光晕场景中区域依赖的修复需求。在这些场景中，饱和光源应被保留，光晕伪影应被去除，背景细节应被恢复。为了解决这一挑战，我们提出了DeflareMambav2，一种基于先验引导的Mamba框架用于镜头光晕去除。具体而言，我们引入了光晕先验网络（Flare Prior Network, FPN）来估计光晕先验并指导自适应修复。此外，一种新颖的径向序列化策略通过执行光晕感知的目标采样打破了空间均匀处理，更好地支持状态空间模型（State Space Models, SSMs）中的长程建模。基于这些先验，主干网络采用双层自适应方案，明确保留光源区域以避免过度处理，并对其余受污染区域应用基于课程的修复，同时在像素级别校准修复强度。大量实验表明，DeflareMambav2在减少参数负担的同时实现了最先进的性能。代码可在 https://github.com/BNU-ERC-ITEA/DeflareMambav2 获取。

View on arXiv Download PDF AI Translation

cs.CV / 111 / 2605.07653

Aquatic Neuromorphic Optical Flow

水下神经形态光流

Zhang, Pei, Liang, Yunkai, Wang, Kaiqiang

Abstract

Underwater environments impose severe constraints on conventional imaging systems and demand solutions that balance high-quality sensing with strict resource efficiency. While emerging event cameras offer a promising alternative, their potential in aquatic scenarios remains largely unexplored. Through the lens of neuromorphic vision, this work pioneers the investigation of motion fields that serve as key media for agile underwater perception. Built upon spiking neural networks, we introduce a self-supervised framework to estimate per-pixel optical flow from asynchronous event streams, elegantly bypassing the long-standing bottleneck of underwater data scarcity. Extensive evaluations demonstrate that our method achieves competitive visual and quantitative results against leading techniques while operating with superior computational efficiency. By bridging neuromorphic sensing and aquatic intelligence, this work opens new frontiers for lightweight, real-time, and low-cost perception on resource-constrained underwater edge platforms.

Chinese Translation

水下环境对传统成像系统施加了严峻的限制，并要求在高质量感知与严格资源效率之间找到平衡。尽管新兴的事件相机提供了一个有前景的替代方案，但其在水下场景中的潜力仍然 largely 未被探索。通过神经形态视觉的视角，本研究开创性地探讨了运动场，这些运动场作为灵活的水下感知的关键媒介。基于脉冲神经网络，我们提出了一个自监督框架，从异步事件流中估计每个像素的光流，优雅地绕过了水下数据稀缺的长期瓶颈。广泛的评估表明，我们的方法在视觉和定量结果上与领先技术具有竞争力，同时在计算效率上表现优越。通过桥接神经形态感知与水下智能，本研究为资源受限的水下边缘平台上的轻量级、实时和低成本感知开辟了新的前沿。

View on arXiv Download PDF AI Translation

cs.CV / 112 / 2605.07655

Towards Billion-scale Multi-modal Biometric Search

迈向十亿规模的多模态生物识别搜索

Koner, Arka, Naik, Chetan S., Kurre, Lokesh, Raghavan, Vivek, Sabut, Barada P., Barma, Tanusree Deb, Namboodiri, Anoop M., Jain, Anil K.

Abstract

Searching a multi-biometric database of a billion records for a country-level identity system requires pushing the limits of all aspects of a biometric system, including acquisition, preprocessing, feature extraction, accuracy, matching speed, presentation attack detection, and handling of special cases (e.g., missing finger digits). This is the first paper that gives insights into such a large-scale multimodal biometric search system, called Bharat ABIS, based on open-source architectures. The end-to-end pipeline of Bharat ABIS processes fingerprint, face and iris modalities through modality-specific stages of preprocessing (segmentation), quality assessment, presentation attack detection, and learning an embedding (feature extraction), producing a concatenated template of 13.5KB per person. We present a detailed analysis of the modalities and how they are integrated to create an efficient and effective solution for 1:N search (de-duplication). Evaluations on a demographically stratified gallery of 220 million identities, randomly sampled from 1.55 billion records in India's Aadhaar database, yield an FNIR of 0.3% at an FPIR of 0.5%, for adult probes (over 18 years). We also compare the performance of Bharat ABIS against three state-of-the-art COTS systems on a 20M gallery. Our system achieves a throughput of 100 searches per second on a gallery of 40M on a single server (8xNvidia H100 GPUs, 2TB RAM).

Chinese Translation

在国家级身份系统中，搜索一个包含十亿条记录的多生物识别数据库需要在生物识别系统的各个方面突破极限，包括采集、预处理、特征提取、准确性、匹配速度、呈现攻击检测以及特殊情况的处理（例如，缺失的手指数字）。这是第一篇对这样一个大规模多模态生物识别搜索系统（称为 Bharat ABIS）进行深入探讨的论文，该系统基于开源架构。Bharat ABIS 的端到端流程通过特定模态的预处理（分割）、质量评估、呈现攻击检测和学习嵌入（特征提取）处理指纹、面部和虹膜模态，为每个人生成一个 13.5KB 的连接模板。我们详细分析了这些模态及其如何集成，以创建一个高效且有效的 1:N 搜索（去重）解决方案。在从印度 Aadhaar 数据库中的 15.5 亿条记录中随机抽取的 2.2 亿个身份的分层画廊上进行评估，结果显示，对于成年探测（超过 18 岁），在 0.5% 的假阳性率（FPIR）下，假阴性率（FNIR）为 0.3%。我们还将 Bharat ABIS 的性能与三种最先进的商业现成系统在 2000 万画廊上的表现进行了比较。我们的系统在单个服务器（8xNvidia H100 GPU，2TB RAM）上实现了每秒 100 次搜索的吞吐量，针对 4000 万的画廊。

View on arXiv Download PDF AI Translation

cs.CV / 113 / 2605.07695

OphEdit: Training-Free Text-Guided Editing of Ophthalmic Surgical Videos

OphEdit：无训练的文本引导眼科手术视频编辑

Jangir, Ritul, Bagchi, Arkya Jyoti, Farooq, Aiman, Okram, Mangalton, Korgaonkar, Saurabh Seetaram, Mishra, Deepak

Abstract

High-fidelity surgical video generation can greatly improve medical training and the development of AI, adapting these generative models for precise video editing remains a formidable challenge. Modifying surgical attributes, such as instrument tissue interactions or procedural phases is challenging due to the strict anatomical and temporal constraints. In this paper, we propose OphEdit, a novel training-free framework for the text-guided editing of ophthalmic surgical videos. Our approach leverages a deterministic second-order ODE inversion pipeline to capture Attention Value (V) tensors from the original video. By selectively injecting these stored tensors into the conditional Classifier-Free Guidance (CFG) branch during the denoising phase, OphEdit rigorously preserves the intricate anatomical geometry of the eye while seamlessly mapping text-driven semantic modifications onto the video stream. Clinical evaluations demonstrates that OphEdit effectively handles complex surgical transformations, such as instrument swaps and procedural variations, with superior structural fidelity and temporal consistency compared to natural-domain video editors. Our work represents the first application of training-free video editing in the ophthalmic surgical domain, offering a scalable solution for generating diverse, annotated medical datasets without the need for exhaustive manual recording or costly model fine-tuning. The code and prompts can be accessed at https://github.com/ophedit/OphEdit

Chinese Translation

高保真手术视频生成可以极大地改善医学培训和人工智能的发展，但将这些生成模型适应于精确的视频编辑仍然是一个巨大的挑战。由于严格的解剖和时间约束，修改手术属性（如器械与组织的相互作用或手术阶段）是具有挑战性的。在本文中，我们提出了OphEdit，一个新颖的无训练框架，用于文本引导的眼科手术视频编辑。我们的方法利用确定性的二阶常微分方程（ODE）反演管道，从原始视频中捕获注意力值（Attention Value, V）张量。通过在去噪阶段选择性地将这些存储的张量注入条件无分类器引导（Classifier-Free Guidance, CFG）分支，OphEdit严格保持了眼睛复杂的解剖几何，同时无缝地将文本驱动的语义修改映射到视频流中。临床评估表明，OphEdit有效处理复杂的手术变换，如器械更换和手术变体，与自然域视频编辑器相比，具有更优的结构保真度和时间一致性。我们的工作代表了无训练视频编辑在眼科手术领域的首次应用，提供了一种可扩展的解决方案，用于生成多样化的带注释医学数据集，而无需耗时的手动记录或昂贵的模型微调。代码和提示可在 https://github.com/ophedit/OphEdit 访问。

View on arXiv Download PDF AI Translation

cs.CV / 114 / 2605.07740

LAMES: A Large-Scale and Artisanal Mining Environmental Segmentation Dataset

LAMES：大规模和手工采矿环境分割数据集

Kahl, Matthias, Chen, Zhaiyu, Saha, Sudipan, Kochupillai, Mrinalini, Kondmann, Lukas, Zhu, Xiao Xiang

Abstract

Mining operations are of utmost importance to the economy of some nations. However, such operations result in land-use change, very high energy consumption, and negative impacts on the environment, including soil erosion and deforestation. The mining process can impact an area much larger than the mining site itself. Adding to the negative externalities linked to mining is the fact that, in addition to government-sanctioned legal mining operations, illegal mining is widespread, including in various countries of Africa. The ability to monitor remote mining site activities can be useful, e.g., for the detection of illegal artisanal mining activities and their environmental impacts. An important outcome of such monitoring could include a better understanding of the interrelationship between mine facility attributes (e.g., mining types, processing methods, commodities, etc.) and their impact on the natural environment. In this work, we present a data set that contains 150 Large Scale Mining (LSM) sites and 870km^2 annotated area of Artisanal Small-scale Mining (ASM) sites. The metadata includes nine eminent LSM sections and 27 mining site attributes for each LSM site. We also discuss the data set's possible contribution to the research community, social and environmental consequences, and researchers' responsibilities from an ethics perspective.

Chinese Translation

采矿作业对某些国家的经济至关重要。然而，这些作业导致土地利用变化、极高的能源消耗以及对环境的负面影响，包括土壤侵蚀和森林砍伐。采矿过程对采矿地点以外的更大区域产生影响。除了政府批准的合法采矿作业外，非法采矿在包括非洲多个国家在内的地区普遍存在，这进一步加剧了与采矿相关的负外部性。监测偏远采矿现场活动的能力非常重要，例如，可以用于检测非法手工采矿活动及其对环境的影响。这种监测的重要成果可能包括更好地理解矿山设施属性（例如，采矿类型、加工方法、商品等）与其对自然环境影响之间的相互关系。在本研究中，我们呈现了一个数据集，其中包含150个大规模采矿（Large Scale Mining, LSM）地点和870平方公里的手工小规模采矿（Artisanal Small-scale Mining, ASM）地点的标注区域。元数据包括九个著名的LSM部分和每个LSM地点的27个采矿现场属性。我们还讨论了该数据集对研究社区的潜在贡献、社会和环境后果，以及研究人员在伦理角度上的责任。

View on arXiv Download PDF AI Translation

cs.CV / 115 / 2605.07749

Benchmarking Foundation Models for Renal Lesion Stratification in CT

基于CT的肾脏病变分层的基础模型基准测试

Häntze, Hartmut, de Boer, Sarah, Buser, Myrthe, Hering, Alessa, van Ginneken, Bram, Prokop, Mathias, Nawabi, Jawed, Ziegelmayer, Sebastian, Adams, Lisa, Bressem, Keno

Abstract

The rapid proliferation of open-source medical foundation models (FMs) raises a practical question: how well do their pre-trained representations transfer to clinically relevant but data-scarce classification tasks? Particularly in CT-based renal lesion classification, a push toward greater generalizability would be meaningful, as the field is constrained by inherently limited training data. We addressed this through a benchmark of three medical FMs on this specific task. This six-class problem spans common entities like cysts and clear cell renal cell carcinoma, alongside rare subtypes. Using a frozen feature-probing protocol, we compared FM embeddings against a handcrafted radiomics classifier and a 3D ResNet-50 trained from scratch. Models were trained on a composite dataset of 2,854 lesions and evaluated on an external test set of 234 lesions from The Cancer Imaging Archive. Our results reveal two key findings. First, FM performance (AUC 0.70-0.77) matched the from-scratch ResNet (AUC 0.72) while drastically reducing hardware demand, requiring only seconds on a CPU after feature extraction. However, the conventional radiomics baseline significantly outperformed all deep learning approaches, achieving an AUC of 0.88 (all p $\leq$ 0.002). This suggests that current generalist FM embeddings do not yet capture the fine-grained texture and shape heterogeneity driving histological subtype discrimination. Despite their potential in data-scarce settings, medical FMs did not surpass established models for renal lesion stratification, leaving radiomics as the current state-of-the-art.

Chinese Translation

开源医学基础模型（FMs）的快速增长引发了一个实际问题：它们的预训练表示在临床相关但数据稀缺的分类任务中的迁移效果如何？特别是在基于CT的肾脏病变分类中，推动更大的普适性将具有重要意义，因为该领域受到训练数据本质上有限的限制。我们通过对三种医学FMs在这一特定任务上的基准测试来解决这一问题。这个六类问题涵盖了囊肿和透明细胞肾细胞癌等常见实体，以及一些稀有亚型。使用冻结特征探测协议，我们将FM嵌入与手工制作的放射组学分类器和从头训练的3D ResNet-50进行了比较。模型在一个包含2,854个病变的综合数据集上训练，并在来自癌症影像档案馆的234个病变的外部测试集上进行了评估。我们的结果揭示了两个关键发现。首先，FM的表现（AUC 0.70-0.77）与从头训练的ResNet（AUC 0.72）相匹配，同时显著降低了硬件需求，在特征提取后仅需几秒钟的CPU时间。然而，传统的放射组学基线显著优于所有深度学习方法，达到了0.88的AUC（所有p ≤ 0.002）。这表明，目前的通用FM嵌入尚未捕捉到驱动组织学亚型区分的细粒度纹理和形状异质性。尽管在数据稀缺的环境中具有潜力，医学FMs并未超越已建立的肾脏病变分层模型，放射组学仍然是当前的最先进技术。

View on arXiv Download PDF AI Translation

cs.CV / 116 / 2605.07766

Head Similarity: Modeling Structured Whole-Head Appearance Beyond Face Recognition

头部相似性：超越面部识别的结构化全头外观建模

Wang, Yingfeng, Xiao, Yuxuan, Liao, Shengcai

Abstract

Many vision applications require identity consistency beyond strict biometric recognition, especially under non-frontal views or when facial cues are missing. However, conventional face recognition models enforce intra-identity invariance, collapsing appearance variations such as hairstyle or styling changes into a single representation, limiting their use in appearance-sensitive scenarios. To address this limitation, we introduce Head Similarity, a new formulation that extends identity-centric recognition to structured whole-head similarity modeling. Our approach explicitly captures intra-identity appearance variation and enforces hierarchical similarity ordering across identity and appearance states, enabling meaningful comparison even under occlusion or rear-view conditions. We construct a large-scale benchmark from long-form videos with weakly-supervised appearance states, covering diverse poses, occlusions, and temporal changes. As a first step, we develop a simple yet effective framework that jointly models identity discrimination and appearance-sensitive similarity through hierarchical supervision and identity-aware distillation. Experiments show that conventional face recognition models fail to capture appearance-dependent similarity, while our approach demonstrates the feasibility of structured whole-head similarity modeling.

Chinese Translation

许多视觉应用需要在严格的生物识别之外保持身份一致性，特别是在非正面视角或面部线索缺失的情况下。然而，传统的面部识别模型强制执行身份内不变性，将发型或造型变化等外观变异压缩为单一表示，这限制了它们在对外观敏感的场景中的应用。为了解决这一局限性，我们提出了头部相似性（Head Similarity），一种新的公式，将以身份为中心的识别扩展到结构化全头相似性建模。我们的方法明确捕捉身份内的外观变异，并在身份和外观状态之间强制执行层次相似性排序，即使在遮挡或后视条件下也能实现有意义的比较。我们从长视频中构建了一个大规模基准，具有弱监督的外观状态，涵盖了多样的姿势、遮挡和时间变化。作为第一步，我们开发了一个简单而有效的框架，通过层次监督和身份感知蒸馏共同建模身份区分和对外观敏感的相似性。实验表明，传统的面部识别模型未能捕捉到依赖外观的相似性，而我们的方法展示了结构化全头相似性建模的可行性。

View on arXiv Download PDF AI Translation

cs.CV / 117 / 2605.07767

SIMI: Self-information Mining Network for Low-light Image Enhancement

SIMI：用于低光照图像增强的自信息挖掘网络

Fu, Xuanshuo, Kang, Lei, Vazquez-Corral, Javier

Abstract

Poor lighting conditions significantly impact image quality, posing substantial challenges for image editing and visualization. Many existing enhancement methods aim at proposing complex models while neglecting the intrinsic information contained within low-light images. In this work, we propose the Self-Information Mining (SIMI) network, an innovative unsupervised framework that decomposes low-light images into multiple components based on bit-plane decomposition. Our approach allows mining intrinsic information without relying on external data. This not only accelerates model convergence but also improves performance and reduces computational overhead. The unsupervised nature of our method facilitates real-world applicability. Experiments conducted on standard benchmarks demonstrate that SIMI achieves state-of-the-art performance.

Chinese Translation

较差的照明条件显著影响图像质量，为图像编辑和可视化带来了重大挑战。许多现有的增强方法旨在提出复杂模型，而忽视了低光照图像中蕴含的内在信息。在本研究中，我们提出了自信息挖掘网络（Self-Information Mining, SIMI），这是一种创新的无监督框架，基于位平面分解将低光照图像分解为多个组件。我们的方法允许在不依赖外部数据的情况下挖掘内在信息。这不仅加速了模型的收敛，还提高了性能并减少了计算开销。我们方法的无监督特性促进了其在现实世界中的适用性。在标准基准上进行的实验表明，SIMI 达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 118 / 2605.07781

Differentiable Ray Tracing with Gaussians for Unified Radio Propagation Simulation and View Synthesis

基于高斯的可微分光线追踪用于统一的无线电传播仿真与视图合成

Vaara, Niklas, Huynh, Lam, Sangi, Pekka, López, Miguel Bordallo, Heikkilä, Janne

Abstract

Explicit neural representations such as 3D Gaussian Splatting (3DGS) enable high-fidelity and real-time novel view synthesis, yet optimize for alpha-composited optical appearance rather than ray-intersectable geometry. In contrast, radio-frequency (RF) digital twins require deterministic multi-bounce paths, where the geometry dictates trajectories and their associated attenuation and delay. We introduce a framework enabling differentiable RF propagation simulation directly within visually reconstructed neural scenes, allowing point-to-point path computation between arbitrary 3D locations while preserving high-quality visual rendering. Unlike conventional RF simulation pipelines that rely on manually constructed meshes, we embed Gaussian primitives into a hardware-accelerated ray tracing structure as the underlying spatial representation. By extracting physically meaningful channel impulse responses from visual-only reconstructions, we provide cross-modal evidence that neural reconstructions can serve as unified spatial representations for both electromagnetic propagation simulation and photorealistic view synthesis.

Chinese Translation

显式神经表示，如三维高斯点云（3D Gaussian Splatting, 3DGS），能够实现高保真和实时的新视图合成，但其优化目标是α合成的光学外观，而非光线可交互的几何形状。相比之下，无线电频率（RF）数字双胞胎需要确定性的多次反射路径，其中几何形状决定了轨迹及其相关的衰减和延迟。我们提出了一种框架，能够在视觉重建的神经场景中直接进行可微分的RF传播仿真，允许在任意三维位置之间进行点对点路径计算，同时保持高质量的视觉渲染。与依赖手动构建网格的传统RF仿真管道不同，我们将高斯原语嵌入到硬件加速的光线追踪结构中，作为基础空间表示。通过从仅视觉重建中提取物理意义明确的信道脉冲响应，我们提供了跨模态证据，表明神经重建可以作为电磁传播仿真和逼真视图合成的统一空间表示。

View on arXiv Download PDF AI Translation

cs.CV / 119 / 2605.07785

Radiologist-Guided Causal Concept Bottleneck Models for Chest X-Ray Interpretation

放射科医生指导的因果概念瓶颈模型用于胸部X光解读

Rafferty, Amy, Ramaesh, Rishi, Rajan, Ajitha

Abstract

Concept Bottleneck Models (CBMs) in medical imaging aim to improve model interpretability by predicting intermediate clinical concepts before final diagnoses. However, most existing CBMs treat concepts as discriminative predictors of pathology labels, without explicitly modelling the underlying clinical generative process where diseases produce observable radiographic findings. We propose XpertCausal, a radiologist-guided causal CBM for chest X-ray interpretation which models pathology-to-concept relationships using a probabilistic noisy-OR framework. This generative model is then inverted via Bayesian inference to estimate pathology probabilities from predicted concepts. Radiologist-curated concept-pathology associations are used to constrain model structure to radiologist-defined clinically plausible reasoning pathways. We evaluate XpertCausal on MIMIC-CXR across pathology classification performance, calibration, explanation quality, and alignment with radiologist-defined reasoning pathways. Compared with both a non-causal CBM baseline and a causal ablation with unconstrained learned associations, XpertCausal achieves improved AUROC, calibration, and clinically relevant explanation quality, while learning concept-pathology relationships that more closely align with expert knowledge. These results demonstrate that incorporating clinically motivated causal structure and expert domain knowledge into CBMs can lead to more accurate, interpretable, and clinically aligned models for CXR interpretation.

Chinese Translation

医学影像中的概念瓶颈模型（CBMs）旨在通过在最终诊断之前预测中间临床概念来提高模型的可解释性。然而，大多数现有的CBMs将概念视为病理标签的区分性预测因子，而没有明确建模潜在的临床生成过程，即疾病如何产生可观察的放射学发现。我们提出了XpertCausal，这是一种放射科医生指导的因果CBM，用于胸部X光解读，采用概率噪声-或框架建模病理与概念之间的关系。然后通过贝叶斯推断反转该生成模型，以从预测的概念中估计病理概率。放射科医生策划的概念-病理关联用于约束模型结构，以符合放射科医生定义的临床合理推理路径。我们在MIMIC-CXR数据集上评估XpertCausal，考察其在病理分类性能、校准、解释质量和与放射科医生定义的推理路径的一致性方面的表现。与非因果CBM基线和具有不受约束学习关联的因果消融模型相比，XpertCausal在AUROC、校准和临床相关的解释质量上均取得了改善，同时学习的概念-病理关系与专家知识更为一致。这些结果表明，将临床驱动的因果结构和专家领域知识纳入CBMs可以导致更准确、可解释且与临床一致的胸部X光解读模型。

View on arXiv Download PDF AI Translation

cs.CV / 120 / 2605.07786

APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment

APEX：无假设投影嵌入评估指标用于图像质量评估

Gallegati, Caterina, Bianchini, Monica, Scarselli, Franco, Murino, Vittorio, Corradini, Barbara Toniella

Abstract

As generative models achieve unprecedented visual quality, the gold standard for image evaluation remains traditional feature-distribution metrics (e.g., FID). However, these metrics are provably hindered by the closed-vocabulary bottleneck of outdated features and the assumptive bias of rigid parametric formulations. Recent alternatives exploit modern backbones to solve the feature bottleneck, yet continue to suffer from parametric limitations. To close this gap, we introduce APEX (Assumption-free Projection-based Embedding eXamination), a novel evaluation framework leveraging the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure. APEX inherits effective scalability to high-dimensional spaces, as we prove with theoretical and empirical evidences. Moreover, APEX is embedding-agnostic and uses two open-vocabulary foundation models, CLIP and DINOv2, as feature extractors. Benchmarking APEX against established baselines reveals superior robustness to visual degradations. Additionally, we show that APEX metrics exhibit intra- and cross-dataset stability, ensuring highly stable evaluations on out-of-domain datasets.

Chinese Translation

随着生成模型在视觉质量上达到前所未有的水平，图像评估的黄金标准仍然是传统的特征分布指标（例如，FID）。然而，这些指标在理论上受到过时特征的封闭词汇瓶颈和刚性参数化公式的假设偏差的制约。最近的替代方案利用现代骨干网络来解决特征瓶颈，但仍然受到参数化限制的影响。为了解决这一问题，我们提出了APEX（无假设投影嵌入评估），这是一种新颖的评估框架，利用切片瓦瑟斯坦距离作为一种在数学上有依据的、无假设的相似性度量。我们通过理论和实证证据证明，APEX在高维空间中具有有效的可扩展性。此外，APEX对嵌入无关，并使用两个开放词汇基础模型CLIP和DINOv2作为特征提取器。将APEX与已建立的基线进行基准测试显示出对视觉退化的卓越鲁棒性。此外，我们还展示了APEX指标在数据集内部和跨数据集的稳定性，确保在域外数据集上进行高度稳定的评估。

View on arXiv Download PDF AI Translation

cs.CV / 121 / 2605.07800

SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

SARA：视频扩散模型的语义自适应关系对齐

Lian, Jiesong, Zhou, Zixiang, Zhong, Ruizhe, Zhou, Yuan, Lu, Qinglin, Wang, Rui, Hu, Long, Hao, Yixue, Huang, Baoru

Abstract

Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt. We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage 1 aligner is trained with per-entity SAM 3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator that assigns each token pair a weight whenever either of its two endpoints is salient, thereby routing supervision toward subject-subject and subject-background pairs and away from background-background ones. In the Wan2.2 continual-training setting, SARA improves both text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, on the public VBench benchmarks, and in a blind user study.

Chinese Translation

近期的视频扩散模型（VDMs）能够合成视觉上令人信服的片段，但仍然会丢失实体、错误绑定属性，并削弱提示中指定的交互。表示对齐目标如 VideoREPA 和 MoAlign 通过从冻结的视觉基础模型中提取时空令牌关系来改善细粒度文本跟随，但它们的成对监督预算是基于视觉或运动线索分配的，而不是基于每对与提示的相关性。我们提出了 SARA（语义自适应关系对齐），它在冻结的视觉基础模型目标上保持令牌关系蒸馏（TRD），并添加了一个文本条件的显著性，用于决定哪些令牌对携带监督。一个轻量级的第一阶段对齐器通过每个实体的 SAM 3.1 掩码监督和 InfoNCE 正则化器进行训练，其连续显著性通过一个成对路由操作符融合到 TRD 中，当其两个端点中的任一个显著时，为每对令牌分配权重，从而将监督路由到主体-主体和主体-背景对，而远离背景-背景对。在 Wan2.2 持续训练设置中，SARA 在公共 VBench 基准测试和盲用户研究中，在 13 维 VLM 评分标准上，相较于 SFT、VideoREPA 和 MoAlign 改善了文本对齐和运动质量。

View on arXiv Download PDF AI Translation

cs.CV / 122 / 2605.07807

Text-to-CAD Evaluation with CADTests

基于CADTests的文本到CAD评估

Mallis, Dimitrios, Wang, Marco, Karadeniz, Ahmet Serdar, Ricci, Elisa, Kacem, Anis, Aouada, Djamila

Abstract

Text-to-CAD has recently emerged as an important task with the potential to substantially accelerate design workflows. Despite its significance, there has been surprisingly little work on Text-to-CAD evaluation, and assessing CAD model generation performance remains a considerable challenge. In this work, we introduce a new evaluation perspective for Text-to-CAD based on automated testing. We propose CADTestBench, the first test-based benchmark for Text-to-CAD, based on CADTests, executable software tests that verify whether a generated CAD model satisfies the geometric and topological requirements of the input prompt. Using CADTestBench, we conduct comprehensive benchmarking of recent Text-to-CAD methods and further demonstrate that CADTests can also guide CAD model generation, yielding simple baselines that surpass performance of current methods. CADTestBench code and data are available at GitHub and Hugging Face dataset.

Chinese Translation

文本到CAD（Text-to-CAD）最近作为一项重要任务出现，具有显著加速设计工作流程的潜力。尽管其重要性不言而喻，但在文本到CAD评估方面的研究却出乎意料地较少，评估CAD模型生成性能仍然是一个相当大的挑战。在本研究中，我们引入了一种基于自动化测试的文本到CAD评估新视角。我们提出了CADTestBench，这是第一个基于测试的文本到CAD基准，基于CADTests，这是一种可执行的软件测试，用于验证生成的CAD模型是否满足输入提示的几何和拓扑要求。通过使用CADTestBench，我们对近期的文本到CAD方法进行了全面的基准测试，并进一步证明CADTests还可以指导CAD模型生成，产生简单的基线，超越当前方法的性能。CADTestBench的代码和数据可在GitHub和Hugging Face数据集中获取。

View on arXiv Download PDF AI Translation

cs.CV / 123 / 2605.07816

ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles

ICDAR 2026 手绘圆圈中的作者识别与笔分类竞赛

Gorges, Thomas, van der Loop, Janne, Hüttner, Lukas, Schneider, Linda-Sophie, Wu, Fei, Seuret, Mathias, Christlein, Vincent

Abstract

This paper presents CircleID, a large-scale ICDAR 2026 competition on writer identification and pen classification from scanned hand-drawn circles. The primary objective is to investigate how biometric writer characteristics and physical pen features naturally entangle within minimal, static traces. CircleID comprises two distinct tasks: (1) open-set writer identification, requiring models to recognize known writers while explicitly rejecting unknown ones, and (2) cross-writer pen classification, evaluated across both seen and unseen writers. Participants were provided with a new, controlled dataset of 46,155 tightly cropped circle images, digitized at 400 DPI and annotated for writer identity and pen type. The dataset comprises samples from 50 known and 16 unknown writers using eight different pens. Hosted on Kaggle as two separate tracks with public and private leaderboards, the competition provided participants with a ResNet baseline. In total, 389 teams (436 participants) made 3,185 submissions for the pen classification task, and 113 teams (141 participants) made 1,737 submissions for the writer identification track. The best-performing private leaderboard submissions achieved a Top-1 accuracy of 64.801% for writer identification and 92.726% for pen classification. This paper details the dataset, evaluates the winning methodologies, and analyzes the impact of out-of-distribution writers on model generalization and feature disentanglement. In this large-scale competition, CircleID establishes a new baseline for minimal-trace analysis.

Chinese Translation

本文介绍了 CircleID，这是一个大规模的 ICDAR 2026 竞赛，旨在从扫描的手绘圆圈中进行作者识别和笔分类。主要目标是研究生物识别作者特征与物理笔特征如何在最小的静态痕迹中自然交织。CircleID 包含两个不同的任务：（1）开放集作者识别，要求模型能够识别已知作者，同时明确拒绝未知作者；（2）跨作者笔分类，评估在已见和未见作者之间的表现。参与者获得了一个新的、受控的数据集，其中包含 46,155 张紧密裁剪的圆圈图像，数字化分辨率为 400 DPI，并进行了作者身份和笔类型的标注。该数据集包含来自 50 位已知作者和 16 位未知作者的样本，使用了八种不同的笔。竞赛在 Kaggle 上以两个独立的赛道形式进行，设有公共和私有排行榜，向参与者提供了 ResNet 基线。总共389个团队（436名参与者）为笔分类任务提交了 3,185 次，而 113 个团队（141 名参与者）为作者识别赛道提交了 1,737 次。表现最佳的私有排行榜提交在作者识别任务中达到了 64.801% 的 Top-1 准确率，而在笔分类任务中达到了 92.726% 的准确率。本文详细介绍了数据集，评估了获胜的方法，并分析了分布外作者对模型泛化和特征解耦的影响。在这个大规模竞赛中，CircleID 为最小痕迹分析建立了新的基线。

View on arXiv Download PDF AI Translation

cs.CV / 124 / 2605.07817

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

GazeVLM：通过内部注意力控制实现主动视觉的多模态推理

Ebouky, Brown, Carrino, Gabriele, Avogaro, Niccolo, Studer, Christoph, Bartezzaghi, Andrea, Rigotti, Mattia

Abstract

Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ($\texttt{}$), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent, triggering a continuous suppression bias that dampens irrelevant visual features, implementing spatial selective attention and simulating foveal fixation. Once local reasoning concludes, the bias lifts, seamlessly restoring the global view. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping tools, or inflating the context window with additional visual tokens derived from localized visual patches. Trained with a bespoke Group Relative Policy Optimization (GRPO) procedure that rewards valid grounding, our 4B-parameter GazeVLM delivers strong high-resolution multimodal reasoning performance, surpassing state-of-the-art VLMs in its parameter class by nearly 4% and agentic multimodal pipelines built around thinking with images by more than 5% on HRBench-4k and HRBench-8k.

Chinese Translation

人类视觉推理受主动视觉的支配，这一过程通过元认知控制驱动自上而下的目标导向注意力，动态地将中央视野聚焦于与任务相关的细节，同时保持对整体场景的周边意识。相比之下，现代视觉-语言模型（VLMs）被动处理视觉信息，依赖于静态积累大量的标记上下文，这稀释了空间推理并引发语言幻觉。在此，我们提出以下范式转变：GazeVLM，一种多模态架构，将这种对注意力资源部署的元认知监督内化到推理循环中。通过赋予VLM自主生成注视标记（$ exttt{} $）的能力，GazeVLM建立了对自身因果注意力掩码的自上而下控制机制。该模型动态决定其聚焦意图，触发持续的抑制偏差，从而减弱无关视觉特征，实施空间选择性注意并模拟中央注视。一旦局部推理结束，偏差解除，全球视野无缝恢复。这种架构使模型能够在全球空间意识和局部聚焦推理之间流畅过渡，而无需依赖裁剪工具等外部代理装置，或通过从局部视觉补丁中衍生的额外视觉标记来扩展上下文窗口。通过定制的群体相对策略优化（GRPO）程序进行训练，该程序奖励有效的基础，具有40亿参数的GazeVLM在高分辨率多模态推理性能上表现出色，超越了其参数类别中的最先进VLM，提升近4%，并在以图像思维为基础的代理多模态管道上提升超过5%，在HRBench-4k和HRBench-8k上表现优异。

View on arXiv Download PDF AI Translation

cs.CV / 125 / 2605.07821

Divide and Conquer: Object Co-occurrence Helps Mitigate Simplicity Bias in OOD Detection

分而治之：物体共现有助于减轻OOD检测中的简单性偏差

Dai, Boyang, Chen, Chaoqi, Yu, Yizhou

Abstract

Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts. Code is released at https://github.com/Michael-McQueen/OCO.

Chinese Translation

分布外（OOD）检测对于确保深度学习模型的可靠性至关重要。现有方法主要集中在常规纠缠表示上，以区分分布内（ID）和OOD数据，忽视了图像中的丰富上下文信息。这个问题在检测近OOD时尤为具有挑战性，因为具有简单性偏差的模型难以在解缠表示中学习到区分特征。人类视觉系统可以利用自然环境中物体的共现来促进场景理解。受到这一启发，我们提出了一种以物体为中心的OOD检测框架，旨在学习捕捉图像中的物体共现（Object CO-occurrence, OCO）模式。该方法引入了一种新的OOD检测范式，通过预测测试样本的解缠表示来理解图像中的物体共现，然后根据在ID训练数据中观察到的物体共现模式自适应地将模式划分为三种场景，最后以分而治之的方式进行OOD检测。通过这种方式，OCO能够通过考虑图像中存在的语义上下文关系来区分近OOD，避免仅关注简单、易于学习的区域。我们通过在具有挑战性和全谱OOD设置下的实验评估OCO，展示了其竞争力结果，并确认其能够应对语义和协变量转移。代码已发布在 https://github.com/Michael-McQueen/OCO。

View on arXiv Download PDF AI Translation

cs.CV / 126 / 2605.07831

Explainable Part-Based Vehicle Classifier with Spatial Awareness

具有空间感知的可解释基于部件的车辆分类器

Caduff, Andreas, Zahn, Klaus, Hofstetter, Jonas, Rechsteiner, Martin, Flaig, Patrick

Abstract

In the area of Intelligent Transportation Systems (ITS), fine-grained vehicle classification systems play an essential role. Recently, the authors have presented a novel vision-based classification approach in which standard end-to-end Convolutional Neural Networks (CNNs) have been decomposed into 1) a CNN-based detector for semantically strong vehicle parts, followed by 2) feature construction and 3) final classification by a decision tree. In contrast to conventional CNNs, this allows both easy extensibility to new vehicle categories - without the need to fully retrain the part detector - and an important step towards the interpretability of the model, removing partially the black-box nature inherent to CNNs. Here we present an important extension of this approach that now incorporates spatial awareness of the vehicle parts: while the feature construction 2) of the previous approach used a binary decision for each feature (present vs. absent), now a full spatial probability map is constructed to condition the presence of each individual part with respect to a given vehicle category. The classification is performed using a softmax regression approach for the overall vehicle probabilities. This method shows a considerably improved robustness against false (part-)detections, a point that is crucial for practical application. Comparative analyses with a state-of-the-art end-to-end CNN indicate that our part-based methods achieve comparable accuracy, effectively challenging the presumed trade-off between accuracy and explainability. This research represents a significant advance in vehicle classification for ITS and forms the basis for systems that combine high accuracy with intuitive interpretability.

Chinese Translation

在智能交通系统（ITS）领域，细粒度车辆分类系统发挥着至关重要的作用。最近，作者提出了一种新颖的基于视觉的分类方法，其中标准的端到端卷积神经网络（CNN）被分解为1）用于语义强的车辆部件的基于CNN的检测器，随后是2）特征构建和3）通过决策树进行最终分类。与传统的CNN相比，这种方法不仅可以轻松扩展到新的车辆类别——无需完全重新训练部件检测器——而且是朝着模型可解释性迈出的重要一步，部分消除了CNN固有的黑箱特性。在此，我们展示了该方法的重要扩展，现在加入了对车辆部件的空间感知：在之前方法的特征构建2）中，对每个特征使用了二元决策（存在与否），而现在构建了完整的空间概率图，以条件化每个单独部件在给定车辆类别下的存在性。分类是通过软最大回归方法对整体车辆概率进行的。这种方法在抵抗虚假（部件）检测方面显示出显著的增强，这一点对于实际应用至关重要。与最先进的端到端CNN的比较分析表明，我们的基于部件的方法在准确性上达到了可比水平，有效挑战了准确性与可解释性之间的假定权衡。这项研究代表了ITS车辆分类的重大进展，并为结合高准确性与直观可解释性的系统奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 127 / 2605.07846

BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

BRIDGE：背景路由与孤立离散门控用于粗掩码局部编辑

Xiong, Peilin, Yuan, Honghui, Chen, Junwen, Yanai, Keiji

Abstract

Coarse-mask local image editing asks a model to modify a user-indicated region while preserving the surrounding scene. In practice, however, rough masks often become unintended shape priors: instead of serving as flexible edit support, the mask can pull the generated object toward its accidental boundary. We study this failure as mask-shape bias and frame the task through a Two-Zone Constraint, where the background should remain stable while the editable region should follow the instruction without being forced to inherit the mask contour. BRIDGE addresses this setting by keeping masks outside the DiT backbone for support construction and blending, avoiding DiT-internal mask injection and copied control branches. It uses BridgePath generation, where a Main Path preserves background context and a Subject Path generates editable content from independent noise. Motivated by a diagnostic Qwen-Image experiment showing that positional embeddings and attention connectivity regulate which image context visual tokens reuse, BRIDGE introduces a learnable Discrete Geometric Gate for token-level positional-embedding routing. This gate lets subject tokens borrow background-anchored coordinates near fusion regions or keep subject-centric coordinates for geometric freedom. We evaluate BRIDGE on BRIDGE-Bench, MagicBrush, and ICE-Bench. On BRIDGE-Bench, BRIDGE improves Local SigLIP2-T from 0.262 with FLUX.1-Fill and 0.390 with ACE++ to 0.503, with parallel gains in local DINO and DreamSim. Zero-shot results on MagicBrush and ICE-Bench further indicate competitive alignment and source preservation beyond the curated benchmark, while the added routing module remains compact at 13.31M parameters compared with ControlNet-style copied branches.

Chinese Translation

粗掩码局部图像编辑要求模型在修改用户指示的区域时保持周围场景的完整性。然而，在实践中，粗糙的掩码往往成为意外的形状先验：掩码不仅未能作为灵活的编辑支持，反而可能将生成的对象拉向其意外的边界。我们将这种失败现象研究为掩码形状偏差，并通过双区域约束（Two-Zone Constraint）框定任务，其中背景应保持稳定，而可编辑区域应遵循指令，而不被迫继承掩码轮廓。BRIDGE通过将掩码保持在DiT主干之外用于支持构建和混合，解决了这一设置，避免了DiT内部掩码注入和复制控制分支。它使用BridgePath生成，其中主路径（Main Path）保留背景上下文，而主题路径（Subject Path）从独立噪声中生成可编辑内容。通过诊断性Qwen-Image实验的启发，显示位置嵌入和注意力连接调节了哪些图像上下文视觉标记被重用，BRIDGE引入了一种可学习的离散几何门（Discrete Geometric Gate）用于标记级别的位置嵌入路由。该门允许主题标记借用接近融合区域的背景锚定坐标，或保持以主题为中心的坐标以获得几何自由度。我们在BRIDGE-Bench、MagicBrush和ICE-Bench上评估BRIDGE。在BRIDGE-Bench上，BRIDGE将Local SigLIP2-T从0.262（使用FLUX.1-Fill）和0.390（使用ACE++）提高到0.503，同时在局部DINO和DreamSim上也获得了相应的提升。在MagicBrush和ICE-Bench上的零样本结果进一步表明，超越经过策划的基准，BRIDGE在对齐和源保持方面具有竞争力，而新增的路由模块在与ControlNet风格的复制分支相比时，参数量保持紧凑，仅为13.31M。

View on arXiv Download PDF AI Translation

cs.CV / 128 / 2605.07859

EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding

EyeCue：通过注视增强的自我中心视频理解检测驾驶员认知分心

Zhang, Lang, Yoon, JinYi, Corbett, Matthew, Sarkar, Abhijit, Ji, Bo

Abstract

Driver cognitive distraction is a major cause of road collisions and remains difficult to detect. Unlike manual or visual distraction, cognitive distraction is diverted by thoughts unrelated to driving, even when the driver appears visually attentive and exhibits no explicit physical movements. In this work, we propose EyeCue, a gaze-empowered egocentric video understanding framework, to detect driver cognitive distraction. A key insight is that cognitive distraction manifests in the interaction between eye gaze and visual context. To capture this interaction, EyeCue integrates eye gaze with egocentric video to enable context-aware modeling of the driver's attention over time. Furthermore, to tackle the limited scale and diversity of existing datasets, we introduce CogDrive, a comprehensive multi-scenario dataset that augments four existing driving datasets with cognitive distraction annotations. Through extensive evaluations on CogDrive, we show that EyeCue achieves the highest accuracy of 74.38%, outperforming 11 baselines from 6 model families by over 7%. Notably, EyeCue can achieve an accuracy of over 70% across various driving scenarios (different road types, times of day, and weather conditions) with strong generalizability. These results highlight the importance of modeling gaze-context interactions and the effectiveness of cross-modal interaction modeling for multimodal cognitive distraction detection. Our codes and CogDrive dataset resources are available at https://github.com/langzhang2000/EyeCue.

Chinese Translation

驾驶员认知分心是导致道路碰撞的主要原因，且检测难度较大。与手动或视觉分心不同，认知分心是由与驾驶无关的思维所引起，即使驾驶员在视觉上看似专注且没有明显的身体动作。在本研究中，我们提出了EyeCue，一个通过注视增强的自我中心视频理解框架，用于检测驾驶员的认知分心。一个关键的见解是，认知分心在眼动与视觉上下文之间的互动中表现出来。为了捕捉这种互动，EyeCue将眼动与自我中心视频相结合，以实现对驾驶员注意力随时间变化的上下文感知建模。此外，为了解决现有数据集规模和多样性有限的问题，我们引入了CogDrive，这是一个全面的多场景数据集，通过认知分心注释扩展了四个现有的驾驶数据集。通过对CogDrive的广泛评估，我们展示了EyeCue达到了74.38%的最高准确率，超过了来自6个模型家族的11个基线超过7%。值得注意的是，EyeCue在各种驾驶场景（不同的道路类型、时间和天气条件）中均能实现超过70%的准确率，具有较强的泛化能力。这些结果突显了建模注视-上下文互动的重要性，以及跨模态互动建模在多模态认知分心检测中的有效性。我们的代码和CogDrive数据集资源可在https://github.com/langzhang2000/EyeCue获取。

View on arXiv Download PDF AI Translation

cs.CV / 129 / 2605.07861

From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data

从合成到真实：朝着身份一致的化妆转移迈进，结合合成数据和真实数据

Yu, Yue, Wang, Jiayu, Shi, Jiajia, Chen, Jingjing, Jiang, Yu-Gang

Abstract

Makeup transfer aims to apply the makeup style of a reference portrait to a source portrait while preserving identity and background. Early methods formulate this task as unsupervised image-to-image translation, relying on surrogate objectives and often yielding limited performance. Recent diffusion- and flow-based approaches instead exploit synthetic data for supervised training, leading to significant improvements. However, these methods still face two critical challenges: synthetic supervision frequently fails to faithfully preserve identity, and the domain gap between synthetic and real data limits generalization, resulting in degraded performance in complex real-world scenarios. To address these issues, this paper first proposes ConsistentBeauty, a novel data curation pipeline that ensures makeup fidelity and strict identity consistency within the synthesized data. Second, we propose RealBeauty, a synthetic-to-real post-training framework. Beyond supervised learning on curated synthetic data, we further adapt the model to real-world scenarios through reinforcement learning and design novel verifiable rewards tailored to the makeup transfer task. It allows the model to further benefit from real makeup patterns beyond synthetic supervision. In addition, we establish a new diverse benchmark for makeup transfer, covering a wide range of skin tones, ages, genders, poses, and makeup styles, thereby enabling a more comprehensive evaluation of model performance under diverse real-world conditions. Extensive experiments show that our method achieves state-of-the-art performance on multiple benchmarks and demonstrates clear advantages in identity preservation and performance on complex real-world cases.

Chinese Translation

化妆转移旨在将参考肖像的化妆风格应用于源肖像，同时保持身份和背景的完整性。早期的方法将此任务表述为无监督的图像到图像翻译，依赖于替代目标，通常导致性能有限。最近的基于扩散和流的方法则利用合成数据进行监督训练，从而显著提高了性能。然而，这些方法仍面临两个关键挑战：合成监督常常无法忠实地保持身份，而合成数据与真实数据之间的领域差距限制了模型的泛化能力，导致在复杂的真实场景中性能下降。为了解决这些问题，本文首先提出了ConsistentBeauty，一个新颖的数据策划管道，确保合成数据中的化妆保真度和严格的身份一致性。其次，我们提出了RealBeauty，一个合成到真实的后训练框架。在经过策划的合成数据上的监督学习之外，我们进一步通过强化学习将模型适应于真实场景，并设计了针对化妆转移任务的新颖可验证奖励。这使得模型能够进一步受益于真实化妆模式，而不仅仅是合成监督。此外，我们建立了一个新的多样化基准，用于化妆转移，涵盖广泛的肤色、年龄、性别、姿势和化妆风格，从而能够在多样化的真实世界条件下对模型性能进行更全面的评估。大量实验表明，我们的方法在多个基准上达到了最先进的性能，并在身份保持和复杂真实案例的表现上展现出明显优势。

View on arXiv Download PDF AI Translation

cs.CV / 130 / 2605.07872

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

视频理解奖励建模：一个稳健的基准和高效的奖励模型

Wei, Yuancheng, Yao, Linli, Li, Lei, Zhang, Haojie, Zhou, Hao, Meng, Fandong, Sun, Xu

Abstract

Multimodal reward models have advanced substantially in text and image domains, yet progress in video understanding reward modeling remains severely limited by the lack of robust evaluation benchmarks and high-quality preference data. To address this, we propose a unified framework spanning benchmark design, data construction, and reward model training. We introduce Video Understanding Reward Bench (VURB), a benchmark featuring 2,100 preference pairs with long chain-of-thought reasoning traces (averaging 1,143 tokens) and majority voting evaluation across general, long, and reasoning-oriented video tasks. We further construct Video Understanding Preference Dataset (VUP-35K) via a fully automated pipeline, providing large-scale high-quality supervision for video reward training. Building on the data, we train VideoDRM and VideoGRM, a discriminative and a generative reward model, both achieving state-of-the-art performance on VURB and VideoRewardBench. Further analysis confirms that VUP-35K enhances both reward performance and model reasoning capability, while VideoDRM and VideoGRM yield significant gains under best-of-$N$ test-time scaling.

Chinese Translation

多模态奖励模型在文本和图像领域取得了显著进展，但视频理解奖励建模的进展仍然受到缺乏稳健评估基准和高质量偏好数据的严重限制。为了解决这一问题，我们提出了一个统一的框架，涵盖基准设计、数据构建和奖励模型训练。我们引入了视频理解奖励基准（Video Understanding Reward Bench, VURB），该基准包含2100对偏好样本，具有长链推理轨迹（平均1143个标记）和针对一般、长时和推理导向视频任务的多数投票评估。我们进一步通过完全自动化的流程构建了视频理解偏好数据集（Video Understanding Preference Dataset, VUP-35K），为视频奖励训练提供了大规模高质量的监督。基于这些数据，我们训练了VideoDRM和VideoGRM，一个判别性奖励模型和一个生成性奖励模型，二者在VURB和VideoRewardBench上均实现了最先进的性能。进一步分析确认，VUP-35K增强了奖励性能和模型推理能力，而VideoDRM和VideoGRM在最佳的$N$测试时间扩展下也取得了显著的提升。

View on arXiv Download PDF AI Translation

cs.CV / 131 / 2605.07897

Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

面向语义的自适应视觉记忆用于流媒体视频理解

Wu, Hang, Mathews, Sherin Mary, Cai, Yujun, Yang, Ming-Hsuan, Wang, Yiwei

Abstract

Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage~1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage~2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48\% at 128 frames over the backbone.

Chinese Translation

在线流媒体视频理解要求模型处理连续的视觉输入，并实时响应用户查询，其中无界流和不可预测的查询时机使得记忆管理成为一个核心挑战。现有方法通常通过视觉相似性启发式方法压缩视觉标记，或通过 KV-cache 级别的检索增强压缩。然而，压缩决策很少考虑语义信号，而检索往往是在压缩完成后添加的，使得这两个阶段难以协调。我们提出了 SAVEMem，一个无训练的双阶段框架，将语义意识引入记忆生成，并使检索范围根据查询进行自适应。在第一阶段，SAVEMem 在固定的内存预算下在线构建三层流媒体记忆。一个固定的伪问题库提供轻量级的语义先验，从而使长期保留受到语义显著性而非单纯视觉相似性的影响。在第二阶段，SAVEMem 在此记忆上执行查询感知检索。一个锚条件的近期门根据查询是针对当前还是远期过去，调整检索范围从短期到中期和长期记忆。在这个范围内，查询与记忆标记之间的后期交互选择候选帧以进行回答。在未经过训练的情况下应用于 Qwen2.5-VL，SAVEMem 将 OVO-Bench 的整体得分从 52.27 提高到 62.69，并在 StreamingBench 和 ODV-Bench 上产生了一致的增益，同时在 128 帧时将峰值 GPU 内存减少了 48%。

View on arXiv Download PDF AI Translation

cs.CV / 132 / 2605.07910

One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction

一个世界，双时间线：用于4D协作驾驶重建的解耦时空高斯场景图

Chen, Yulong, Dong, Xiaoyun, Zhang, Haoyu, Yang, Zongxian, Xie, Lewei, Li, Xinke, Zhang, Yifan, Wang, Kai, Wang, Jianping

Abstract

Reconstructing dynamic scenes from Vehicle-to-Infrastructure Cooperative Autonomous Driving (VICAD) data is fundamentally complicated by temporal asynchrony: vehicle and infrastructure cameras operate on independent clocks, capturing the same dynamic agent such as cars and pedestrians at different physical times. Existing Gaussian Scene Graph methods implicitly assume synchronized observations and assign a single pose per agent per frame, which is an assumption that breaks in cooperative settings, where the resulting gradient conflicts cause severe ghosting on dynamic agents. We identify this as a representation-level failure, not an optimization artifact: we prove that any single-timeline formulation incurs an irreducible photometric loss scaling quadratically with agent velocity and cross-source time offset. To resolve this, we propose Dust (DecoUpled Spatio-Temporal) Gaussian Scene Graph for 4D Cooperative Driving Reconstruction. DUST Gaussian Scene Graph shares a canonical Gaussian set per agent for appearance consistency, while maintaining decouple pose trajectories aligned to each source's true capture timestamps. We prove that this decoupling enables the pose-gradient kernel block-diagonal, eliminating cross-source interference entirely. To make Dust practical, we further introduce a static anchor-based pose correction pipeline that corrects spatio misalignment between vehicle and infrastructure annotations, and a pose-regularized joint optimization scheme that prevents trajectory jitter and drift during early training. On 26 sequences from V2X-Seq, DUST achieves state-of-the-art performance, improving dynamic-area PSNR by 3.2 dB over the strongest baseline and reducing Fr\'echet Video Distance by 37.7%, with keeping robustness under larger temporal asynchrony. Code is available at https://anonymous.4open.science/r/DUST-6A55.

Chinese Translation

从车辆与基础设施协作自动驾驶（VICAD）数据重建动态场景的过程因时间异步而变得复杂：车辆和基础设施摄像头在独立的时钟上运行，以不同的物理时间捕捉同一动态主体，如汽车和行人。现有的高斯场景图方法隐含地假设观察是同步的，并为每个主体在每帧中分配一个单一的姿态，这一假设在协作环境中会失效，导致产生的梯度冲突在动态主体上造成严重的鬼影现象。我们将此视为一种表示层面的失败，而非优化伪影：我们证明任何单时间线的公式都会产生不可减少的光度损失，该损失与主体速度和跨源时间偏移呈二次关系。为了解决这一问题，我们提出了Dust（解耦时空）高斯场景图，用于4D协作驾驶重建。DUST高斯场景图为每个主体共享一个规范的高斯集，以保持外观一致性，同时保持与每个源的真实捕获时间戳对齐的解耦姿态轨迹。我们证明这种解耦使得姿态梯度核块对角化，完全消除了跨源干扰。为了使Dust更具实用性，我们进一步引入了一种基于静态锚点的姿态校正管道，以纠正车辆与基础设施注释之间的空间错位，以及一种姿态正则化的联合优化方案，以防止在早期训练期间的轨迹抖动和漂移。在来自V2X-Seq的26个序列上，DUST实现了最先进的性能，动态区域的PSNR比最强基线提高了3.2 dB，Fréchet视频距离减少了37.7%，并在较大的时间异步下保持了鲁棒性。代码可在 https://anonymous.4open.science/r/DUST-6A55 获取。

View on arXiv Download PDF AI Translation

cs.CV / 133 / 2605.07915

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

对扩散友好的潜在流形而言，什么是重要的？用于潜在扩散的先验对齐自编码器

Yue, Zhengrong, Hu, Taihang, Chen, Mengting, Zhang, Haiyu, Pan, Zihao, Liu, Tao, Wang, Zikang, Lan, Jinsong, Zhu, Xiaoyong, Zheng, Bo, Wang, Yali

Abstract

Tokenizers are a crucial component of latent diffusion models, as they define the latent space in which diffusion models operate. However, existing tokenizers are primarily designed to improve reconstruction fidelity or inherit pretrained representations, leaving unclear what kind of latent space is truly friendly for generative modeling. In this paper, we study this question from the perspective of latent manifold organization. By constructing controlled tokenizer variants, we identify three key properties of a diffusion-friendly latent manifold: coherent spatial structure, local manifold continuity, and global manifold semantics. We find that these properties are more consistent with downstream generation quality than reconstruction fidelity. Motivated by this finding, we propose the Prior-Aligned AutoEncoder (PAE), which explicitly shapes the latent manifold instead of leaving diffusion-friendly manifold to emerge indirectly from reconstruction or inheritance. Specifically, PAE leverages refined priors derived from VFMs and perturbation-based regularization to turn spatial structure, local continuity, and global semantics into explicit training objectives. On ImageNet 256x256, PAE improves both training efficiency and generation quality over existing tokenizers, reaching performance comparable to RAE with up to 13x faster convergence under the same training setup and achieving a new state-of-the-art gFID of 1.03. These results highlight the importance of organizing the latent manifold for latent diffusion models.

Chinese Translation

分词器是潜在扩散模型中的关键组成部分，因为它们定义了扩散模型所操作的潜在空间。然而，现有的分词器主要旨在提高重建保真度或继承预训练表示，尚不清楚什么样的潜在空间真正适合生成建模。本文从潜在流形组织的角度研究了这一问题。通过构建受控的分词器变体，我们识别出扩散友好的潜在流形的三个关键特性：一致的空间结构、局部流形连续性和全局流形语义。我们发现，这些特性与下游生成质量的相关性高于重建保真度。基于这一发现，我们提出了先验对齐自编码器（Prior-Aligned AutoEncoder, PAE），该模型明确塑造潜在流形，而不是让扩散友好的流形间接地从重建或继承中出现。具体而言，PAE利用从变分自由模型（Variational Free Models, VFM）中提取的精细先验和基于扰动的正则化，将空间结构、局部连续性和全局语义转化为明确的训练目标。在ImageNet 256x256上，PAE在训练效率和生成质量上均优于现有的分词器，在相同的训练设置下，其收敛速度可提高至13倍，达到了与RAE相当的性能，并取得了新的最先进的gFID值1.03。这些结果突显了组织潜在流形对于潜在扩散模型的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 134 / 2605.07919

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

MedVIGIL：在破损视觉证据下评估可信的医疗视觉-语言模型

Jiang, Hanqi, Chen, Junhao, Pan, Yi, Chen, Lifeng, You, Weihang, Gong, Haozhen, Yan, Ruiyu, Lv, Jinglei, Zhao, Lin, Ren, Hui, Li, Quanzheng, Liu, Tianming, Li, Xiang

Abstract

Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2{,}556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.

Chinese Translation

医疗视觉-语言模型（VLMs）通常在完整的图像-问题对上进行评估，但可信的临床应用需要更强的特性：模型必须能够识别出答案的证据基础何时失效。我们通过在扰动证据下的无声失败来研究这一点，其中一个需要视觉的医疗问题与错误前提、措辞扰动、仅知识重写或ROI（感兴趣区域）损坏的图像配对，但模型返回流畅的非拒绝答案。我们引入medvigil，这是一个由四个公共医疗视觉问答（VQA）来源提取的300案例评估套件，由四位获得认证的放射科医师全程监督：每个金标准答案、拒绝选项、候选答案集、释义、错误前提陷阱、ROI框和临床风险等级均由临床医生撰写。两位主治放射科医师并行标注每个案例，一位资深放射科医师整合发布的清单，另一位与构建无关的独立放射科医师回答每个探测问题，以提供人类参考基准。发布内容包含2,556个多项选择题（MCQ）探测、240个反事实三元组、医生裁定的风险等级和可回答性标记、ROI框，以及一对开放式变体。我们报告七个基于正确性的审计指标，这些指标汇总为medvigil综合评分（MCS），并对16个具有视觉能力的模型和两个仅文本基准进行审计。独立放射科医师在无声失败率为5.8%时，MCS得分为83.3，留有14.1分的综合提升空间，超过最强审计模型（Claude Opus 4.7，得分69.2）。基准和评估工具已公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 135 / 2605.07931

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

每帧一个标记：重新考虑世界模型中的视觉带宽以用于VLA策略

Tang, Zuojin, Yuan, Shengchao, Bai, Xiaoxin, Jin, Zhiyuan, Ma, De, Pan, Gang, Liu, Bin

Abstract

Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $\pi_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $\pi_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $\pi_0$).

Chinese Translation

视觉-语言-行动（VLA）模型越来越依赖辅助世界模块来进行长期规划，但如何在预训练的VLA基础上对这些模块进行参数化仍然是一个未解决的设计问题。现有的世界模型增强的VLA通常以高视觉带宽将每帧的视觉流传递给世界模块，并将其展开视为行动预测的副产品；在对冻结的主干网络进行受限适应预算的情况下，这使得每帧表示和潜在行动耦合的研究不足。我们提出了OneWM-VLA，它通过自适应注意力池化将每个视图压缩为每帧一个语义标记，并在单一流匹配目标下生成结果潜在流和行动轨迹，而不是通过单独的解码器将它们连接起来。实证研究表明，在我们的设置下，每帧的视觉带宽可以减少到一个标记，而不会影响长期性能。在一个包含14.71M LoRA参数的$ ext{π}_0$（2B）主干网络上训练的OneWM-VLA，将MetaWorld~MT50上的平均成功率从47.9%提高到61.3%，在LIBERO-Long上达到95.6%（相比于$ ext{π}_0$的85.2%），并在真实的Piper臂上的长期可变形任务Fold Cloth中达到60.0%（相比于$ ext{π}_0$的20.0%）。

View on arXiv Download PDF AI Translation

cs.CV / 136 / 2605.07940

Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision

Delta-Adapter：基于单对监督的可扩展示例图像编辑

Chen, Jiacheng, Li, Songze, Fu, Han, Zhao, Baoquan, Liu, Wei, Liang, Yanyan, Qing, Li, Mao, Xudong

Abstract

Exemplar-based image editing applies a transformation defined by a source-target image pair to a new query image. Existing methods rely on a pair-of-pairs supervision paradigm, requiring two image pairs sharing the same edit semantics to learn the target transformation. This constraint makes training data difficult to curate at scale and limits generalization across diverse edit types. We propose Delta-Adapter, a method that learns transferable editing semantics under single-pair supervision, requiring no textual guidance. Rather than directly exposing the exemplar pair to the model, we leverage a pre-trained vision encoder to extract a semantic delta that encodes the visual transformation between the two images. This semantic delta is injected into a pre-trained image editing model via a Perceiver-based adapter. Since the target image is never directly visible to the model, it can serve as the prediction target, enabling single-pair supervision without requiring additional exemplar pairs. This formulation allows us to leverage existing large-scale editing datasets for training. To further promote faithful transformation transfer, we introduce a semantic delta consistency loss that aligns the semantic change of the generated output with the ground-truth semantic delta extracted from the exemplar pair. Extensive experiments demonstrate that Delta-Adapter consistently improves both editing accuracy and content consistency over four strong baselines on seen editing tasks, while also generalizing more effectively to unseen editing tasks. Code will be available at https://delta-adapter.github.io.

Chinese Translation

基于示例的图像编辑通过源-目标图像对定义的变换应用于新的查询图像。现有方法依赖于成对监督范式，需要两个共享相同编辑语义的图像对来学习目标变换。这一限制使得大规模整理训练数据变得困难，并限制了在多样化编辑类型中的泛化能力。我们提出了Delta-Adapter，一种在单对监督下学习可转移编辑语义的方法，不需要文本指导。我们并不是直接将示例对暴露给模型，而是利用预训练的视觉编码器提取一个语义增量，该增量编码了两幅图像之间的视觉变换。这个语义增量通过基于Perceiver的适配器注入到预训练的图像编辑模型中。由于目标图像从未直接暴露给模型，它可以作为预测目标，从而实现单对监督，而无需额外的示例对。这一公式化使我们能够利用现有的大规模编辑数据集进行训练。为了进一步促进真实变换的转移，我们引入了一个语义增量一致性损失，旨在将生成输出的语义变化与从示例对中提取的真实语义增量对齐。大量实验表明，Delta-Adapter在已见编辑任务上始终提高了编辑准确性和内容一致性，相较于四个强基线，同时在未见编辑任务上也表现出更有效的泛化能力。代码将发布在 https://delta-adapter.github.io。

View on arXiv Download PDF AI Translation

cs.CV / 137 / 2605.07945

Rebalancing gradient to improve self-supervised co-training of depth, odometry and optical flow predictions

重新平衡梯度以改善深度、里程计和光流预测的自监督协同训练

Hariat, Marwane, Manzanera, Antoine, Filliat, David

Abstract

We present CoopNet, an approach that improves the cooperation of co-trained networks by dynamically adapting the apportionment of gradient, to ensure equitable learning progress. It is applied to motion-aware self-supervised prediction of depth maps, by introducing a new hybrid loss, based on a distribution model of photo-metric reconstruction errors made by, on the one hand the depth + odometry paired networks, and on the other hand the optical flow network. This model essentially assumes that the pixels from moving objects (that must be discarded for training depth and odometry), correspond to those where the two reconstructions strongly disagree. We justify this model by theoretical considerations and experimental evidences. A comparative evaluation on KITTI and CityScapes datasets shows that CoopNet improves or is comparable to the state-of-the-art in depth, odometry and optical flow predictions.

Chinese Translation

我们提出了 CoopNet，这是一种通过动态调整梯度分配来改善协同训练网络之间合作的方法，以确保公平的学习进展。该方法应用于运动感知的自监督深度图预测，采用了一种基于光度重建误差分布模型的新混合损失，该模型考虑了深度 + 里程计配对网络与光流网络之间的关系。该模型基本上假设来自运动物体的像素（在训练深度和里程计时必须丢弃）对应于两种重建结果强烈不一致的像素。我们通过理论分析和实验证据来证明该模型的合理性。在 KITTI 和 CityScapes 数据集上的比较评估表明，CoopNet 在深度、里程计和光流预测方面的表现优于或可与现有最先进的方法相媲美。

View on arXiv Download PDF AI Translation

cs.CV / 138 / 2605.07955

TimeLesSeg: Unified Contrast-Agnostic Cross-Sectional and Longitudinal MS Lesion Segmentation via a Stochastic Generative Model

TimeLesSeg：通过随机生成模型实现统一的无对比度横断面和纵向多发性硬化病变分割

Caselles-Ballester, Vicent, Martínez-Heras, Eloy, Pontillo, Giuseppe, Mendelsohn, Zoe, Marrón, Elena M., Fernández, Juan Luis García, Subirats, Laia, Stutters, Jon, Chataway, Jeremy, Barkhof, Frederik, Llufriu, Sara, Prados, Ferran

Abstract

Multiple sclerosis (MS) expresses substantial clinical and radiological heterogeneity, which poses significant challenges for automatic lesion segmentation. The current deep learning-based SOTA is highly susceptible to changes in both distribution, e.g., changes in scanner; as well as the structure of inputs, evident in the current divide between cross-sectional and longitudinal approaches. We introduce TimeLesSeg, a unified contrast-agnostic framework designed to segment MS lesions regardless of the presence of a temporal dimension in its inputs, with a single convolutional neural network. Our approach models pathological priors through lesion masks, which are processed together with the current scan. Cross-sectional processing is enabled by exposing the model to training cases where no prior information is available, which are modeled with an empty mask, allowing it to operate seamlessly in both scenarios. To overcome the scarcity and inconsistency of longitudinal datasets, we propose a novel generative pipeline in which patterns of lesion evolution are simulated by stochastically deforming each individual lesion with morphological operations, producing realistic prior timepoints. In parallel, we achieve contrast agnosticism through Gaussian mixture model-based domain randomization, enabling the network to experience a wide spectrum of intensity profiles. Results on three publicly available and two in-house datasets show that TimeLesSeg outperforms the contrast-agnostic state of the art on single-modality inputs across overlap- and distance-based metrics. In longitudinal processing, our method outperforms SAMSEG, and captures lesion load dynamics more accurately than both the former and LST-AI. All source code related to the development of TimeLesSeg is available at https://github.com/NeuroADaS-Lab/TimeLesSeg.

Chinese Translation

多发性硬化症（MS）表现出显著的临床和影像学异质性，这给自动病变分割带来了重大挑战。目前基于深度学习的最先进技术（SOTA）对分布变化（例如扫描仪的变化）以及输入结构的变化高度敏感，这在横断面和纵向方法之间的当前差异中表现得尤为明显。我们提出了TimeLesSeg，这是一个统一的无对比度框架，旨在通过单个卷积神经网络分割MS病变，无论其输入中是否存在时间维度。我们的方法通过病变掩膜建模病理先验，这些掩膜与当前扫描一起处理。通过将模型暴露于没有先前信息的训练案例（这些案例用空掩膜建模），使得横断面处理成为可能，从而使其在两种场景中无缝运行。为了克服纵向数据集的稀缺性和不一致性，我们提出了一种新颖的生成管道，通过随机变形每个单独病变并应用形态学操作来模拟病变演变的模式，从而生成逼真的先前时间点。同时，我们通过基于高斯混合模型的领域随机化实现了无对比度，使网络能够体验广泛的强度谱。对三个公开可用的数据集和两个内部数据集的结果表明，TimeLesSeg在基于重叠和距离的指标上优于单模态输入的无对比度最先进技术。在纵向处理方面，我们的方法优于SAMSEG，并比前者和LST-AI更准确地捕捉病变负荷动态。与TimeLesSeg开发相关的所有源代码可在https://github.com/NeuroADaS-Lab/TimeLesSeg获取。

View on arXiv Download PDF AI Translation

cs.CV / 139 / 2605.07971

DVD: Discrete Voxel Diffusion for 3D Generation and Editing

DVD：用于3D生成和编辑的离散体素扩散

Xiang, Zhengrui, Wu, Jiaqi, Sun, Fupeng, Zheng, Heliang, Li, Yingzhen

Abstract

We introduce Discrete Voxel Diffusion (DVD), a discrete diffusion framework to generate, assess, and edit sparse voxels for SLat (Structured LATent) based 3D generative pipelines. Although discrete diffusion has not generally displaced continuous diffusion in image-like generation, we show that it can be an effective first-stage prior for sparse voxel scaffolds. By treating voxel occupancy as a native discrete variable, DVD avoids continuous-to-discrete thresholding and provides a simple framework for voxel generation, uncertainty estimation, and editing. Beyond quality gains, DVD provides more interpretable generation dynamics through explicit categorical modeling. Furthermore, we leverage the predictive entropy as a robust uncertainty metric to identify ambiguous voxel regions and complicated samples, facilitating tasks such as data filtering and quality assessment. Finally, we propose a lightweight fine-tuning strategy using block-structured perturbation patterns. This approach empowers the model to inpaint and edit voxels within a single sampling round, requiring negligible auxiliary computation and no additional model evaluations.

Chinese Translation

我们提出了离散体素扩散（Discrete Voxel Diffusion，DVD），这是一种离散扩散框架，用于生成、评估和编辑基于结构潜在（Structured LATent）3D生成管道的稀疏体素。尽管离散扩散在图像类生成中尚未普遍取代连续扩散，但我们展示了它可以作为稀疏体素框架的有效第一阶段先验。通过将体素占用视为一种原生离散变量，DVD避免了连续到离散的阈值处理，并提供了一个简单的体素生成、不确定性估计和编辑框架。除了质量提升，DVD还通过显式的类别建模提供了更具可解释性的生成动态。此外，我们利用预测熵作为一种稳健的不确定性度量，以识别模糊的体素区域和复杂样本，从而促进数据过滤和质量评估等任务。最后，我们提出了一种使用块结构扰动模式的轻量级微调策略。这种方法使模型能够在单次采样轮次内进行体素的修补和编辑，所需的辅助计算极少且无需额外的模型评估。

View on arXiv Download PDF AI Translation

cs.CV / 140 / 2605.07973

HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models

HEART：通过Kent表示遍历在扩散模型中实现超球面嵌入对齐

Roy, Arani, Biswas, Shristi Das, Roy, Kaushik

Abstract

Text-to-image diffusion models can generate visually stunning images, yet, controlling what appears and how it appears, remains surprisingly difficult, especially when operating solely within the constraints of the text-conditioning space. For example, changing a subject or adjusting an attribute often leads to unintended side effects, such as altered backgrounds or distorted details. This is because most existing text-based control methods treat the embedding space as Euclidean and apply simple linear transformations, which do not reflect how semantic concepts are actually organized. In this work, we take a step back and ask: what is the true geometry of these embeddings? We find that text encoder representations lie on a hypersphere, where concepts are not linear directions but structured, anisotropic distributions better captured by Kent distributions. Building on this insight, we propose HEART, a training-free framework that performs Kent-aware geodesic transformations directly on the hypersphere. By respecting the underlying geometry, HEART enables intuitive and precise edits, such as consistent subject replacement and fine-grained attribute control, while preserving the original scene. Importantly, HEART requires no finetuning, inversion, or optimization, and generalizes across diffusion model architectures. Our results show that a simple shift in perspective, from linear to spherical, can unlock fast, and controllable image generation.

Chinese Translation

文本到图像的扩散模型能够生成视觉上令人惊叹的图像，但控制图像中出现的内容及其表现方式仍然出乎意料地困难，尤其是在仅依赖文本条件空间的情况下。例如，改变主题或调整属性往往会导致意想不到的副作用，如背景变化或细节扭曲。这是因为大多数现有的基于文本的控制方法将嵌入空间视为欧几里得空间，并应用简单的线性变换，这并不能反映语义概念的实际组织方式。在本研究中，我们退一步思考：这些嵌入的真实几何形状是什么？我们发现文本编码器的表示位于超球面上，概念不是线性方向，而是结构化的各向异性分布，更好地通过Kent分布来捕捉。基于这一见解，我们提出了HEART，一个无训练的框架，直接在超球面上执行考虑Kent的测地线变换。通过尊重基础几何，HEART实现了直观且精确的编辑，例如一致的主题替换和细粒度的属性控制，同时保持原始场景。重要的是，HEART不需要微调、反演或优化，并且可以在不同的扩散模型架构中泛化。我们的结果表明，从线性到球面视角的简单转变，可以解锁快速且可控的图像生成。

View on arXiv Download PDF AI Translation

cs.CV / 141 / 2605.07978

Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images

跨越天空与街道的视野：基于卫星、无人机和地面图像的前馈三维重建

Wang, Qiwei, Tuo, Zhongyao, Ze, Xianghui, Shi, Yujiao

Abstract

Cross-view localization classically asks: where does this ground image lie on the satellite tile? Existing methods are typically limited to 3-DoF estimates -- an $(x,y)$ position and a yaw angle -- because nadir satellite imagery provides no direct cues for roll, pitch, or altitude, forcing a reliance on planar-motion and zero-tilt assumptions. These assumptions break on real terrain with slopes, ramps, and tilted camera mounts. To overcome this, we introduce a single UAV image as an intermediate viewpoint: it reveals the 3D structure invisible from nadir, supplies the cues for roll, pitch, and altitude that the satellite alone cannot provide, and needs only spatial overlap with the ground camera -- no known relative pose is required. Building on this insight, we propose **Cross3R**, a flexible feed-forward model that ingests a satellite tile together with a UAV image, a ground image, or both, and, in a single forward pass, recovers a cross-view 3D point cloud, the 6-DoF poses of every input camera, and the on-tile $(x,y)$ position and yaw of each perspective camera. For training and evaluation, we also construct **CrossGeo**, a 278K-image tri-view dataset spanning 85 scenes across every continent except Antarctica. On CrossGeo, Cross3R consistently outperforms feed-forward 3D baselines in point-cloud reconstruction, 6-DoF camera-pose estimation, and cross-view localization. On KITTI, it outperforms dedicated cross-view methods trained on KITTI on most metrics, despite having no KITTI training itself.

Chinese Translation

跨视图定位经典地提出了一个问题：这张地面图像在卫星图块上的位置在哪里？现有的方法通常仅限于3自由度（3-DoF）估计——一个$(x,y)$位置和一个偏航角，因为天顶卫星图像没有提供关于滚转、俯仰或高度的直接线索，迫使我们依赖于平面运动和零倾斜假设。这些假设在具有坡度、坡道和倾斜相机安装的真实地形上失效。为了解决这个问题，我们引入了一张无人机（UAV）图像作为中间视点：它揭示了从天顶视角无法看到的三维结构，提供了卫星单独无法提供的滚转、俯仰和高度线索，并且只需要与地面相机有空间重叠——不需要已知的相对姿态。基于这一见解，我们提出了**Cross3R**，一个灵活的前馈模型，可以同时处理卫星图块与无人机图像、地面图像或两者，并在一次前向传递中恢复跨视图三维点云、每个输入相机的6自由度（6-DoF）姿态，以及每个视角相机在图块上的$(x,y)$位置和偏航角。为了训练和评估，我们还构建了**CrossGeo**，一个包含278K图像的三视图数据集，涵盖了除南极洲外的85个场景。在CrossGeo上，Cross3R在点云重建、6-DoF相机姿态估计和跨视图定位方面始终优于前馈三维基线。在KITTI数据集上，尽管没有进行KITTI训练，Cross3R在大多数指标上仍然优于专门针对KITTI训练的跨视图方法。

View on arXiv Download PDF AI Translation

cs.CV / 142 / 2605.08000

Rethinking Dense Optical Flow without Test-Time Scaling

重新思考无测试时间缩放的稠密光流

Chanda, Praroop, Kumar, Suryansh

Abstract

Recent progress in dense optical flow has been driven by increasingly complex architectures and multi-step refinement for test-time scaling. While these approaches achieve strong benchmark performance, they also require substantial computation during inference. This raises a fundamental question: Is scaling test-time computation the only way to improve dense optical flow accuracy? We argue that it is not. Instead, powerful visual semantic and geometric priors encoded in modern foundation models can reduce, if not overcome, the need for computationally expensive iterative refinement at test-time. In this paper, we present a framework that estimates dense optical flow in a single forward pass, leveraging pretrained foundation representations, while avoiding iterative refinement and additional inference-time computation, thus offering an alternative to test-time scaling. Our method extracts visual semantic features from a frozen DINO-v2 backbone and combines them with geometric cues from a monocular depth foundation model. We fuse these complementary priors into a unified representation and apply a global matching formulation to estimate dense correspondences without recurrent updates or test-time optimization. Despite avoiding iterative refinement, our approach achieves strong cross-dataset generalization across challenging benchmarks. On Sintel Final, we obtain 2.81 EPE without refinement, significantly improving over state-of-the-art (SOTA) SEA-RAFT under comparable training conditions and outperforming RAFT, GMFlow (without refinement), and recent FlowSeek in the same setting. These results suggest that strong foundation priors can substitute for test-time scaling, offering a computationally efficient alternative to refinement-heavy pipelines.

Chinese Translation

稠密光流的近期进展得益于越来越复杂的架构和多步精炼以进行测试时间缩放。尽管这些方法在基准测试中表现出色，但它们在推理过程中也需要大量计算。这引发了一个根本性的问题：缩放测试时间计算是否是提高稠密光流准确性的唯一方法？我们认为并非如此。相反，现代基础模型中编码的强大视觉语义和几何先验可以减少，甚至克服在测试时间进行计算密集型迭代精炼的需求。在本文中，我们提出了一种框架，该框架在单次前向传递中估计稠密光流，利用预训练的基础表示，同时避免迭代精炼和额外的推理时间计算，从而为测试时间缩放提供了一种替代方案。我们的方法从冻结的 DINO-v2 主干网络中提取视觉语义特征，并将其与单目深度基础模型中的几何线索相结合。我们将这些互补的先验融合成一个统一的表示，并应用全局匹配公式来估计稠密对应关系，而无需递归更新或测试时间优化。尽管避免了迭代精炼，我们的方法在具有挑战性的基准测试中实现了强大的跨数据集泛化。在 Sintel Final 数据集上，我们在没有精炼的情况下获得了 2.81 的 EPE，显著优于在可比训练条件下的最新技术（SOTA）SEA-RAFT，并在相同设置下超越了 RAFT、GMFlow（无精炼）和最近的 FlowSeek。这些结果表明，强大的基础先验可以替代测试时间缩放，为重精炼的流程提供了一种计算上高效的替代方案。

View on arXiv Download PDF AI Translation

cs.CV / 143 / 2605.08003

SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere

SphereVAD：通过单位超球面上的测地推理实现无训练视频异常检测

Huang, Chao, Wei, Penfei, Wang, Wei, Wen, Jie, Wang, Zhihua, Shen, Li, Ren, Wenqi, Cao, Xiaochun

Abstract

Video anomaly detection (VAD) aims to automatically identify events that deviate from normal patterns in untrimmed surveillance videos. Existing methods universally depend on large-scale annotations or task-specific training procedures, severely limiting their rapid deployment to novel scenes. We observe that intermediate-layer features of pre-trained multimodal large language models (MLLMs) already encode rich anomaly semantics, yet existing approaches rely on the language output pathway and fail to exploit the geometric discriminability latent in these representations. Based on this finding, we propose SphereVAD, a fully training-free, zero-shot VAD framework that recasts anomaly discrimination as von Mises-Fisher (vMF) likelihood-ratio geodesic inference on the unit hypersphere, unleashing latent discriminability through principled geometric reasoning rather than learning new representations. Specifically, SphereVAD first applies Frechet mean centering to unfold feature distributions and eliminate domain biases, then employs Holistic Scene Attention (HSA) to reinforce feature consistency using cross-video priors, and finally performs vMF-guided Spherical Geodesic Pulling (SGP) to align ambiguous segments with directional prototypes on the spherical manifold. This training-free pipeline requires only minimal synthetic images for calibration. SphereVAD establishes new state-of-the-art results among training-free approaches on three major benchmarks and remains competitive with fully supervised baselines. Code will be available upon acceptance.

Chinese Translation

视频异常检测（VAD）旨在自动识别在未修剪的监控视频中偏离正常模式的事件。现有方法普遍依赖于大规模标注或特定任务的训练程序，这严重限制了它们在新场景中的快速部署。我们观察到，预训练的多模态大型语言模型（MLLMs）的中间层特征已经编码了丰富的异常语义，但现有方法依赖于语言输出路径，未能利用这些表示中潜在的几何可分性。基于这一发现，我们提出了SphereVAD，一个完全无训练的零-shot VAD框架，将异常区分重新表述为单位超球面上的冯·米塞斯-费舍尔（vMF）似然比测地推理，通过原则性的几何推理释放潜在的可分性，而不是学习新的表示。具体而言，SphereVAD首先应用弗雷歇均值中心化来展开特征分布并消除领域偏差，然后采用整体场景注意力（HSA）利用跨视频先验增强特征一致性，最后执行vMF引导的球面测地拉伸（SGP）以将模糊片段与球面流形上的方向原型对齐。该无训练管道仅需最少的合成图像进行校准。SphereVAD在三个主要基准测试中建立了无训练方法的新最先进结果，并与完全监督的基线保持竞争力。代码将在接受后提供。

View on arXiv Download PDF AI Translation

cs.CV / 144 / 2605.08025

TRAS: An Interactive Software for Tracing Tree Ring Cross Sections

TRAS：一种用于追踪树木年轮横截面的交互式软件

Marichal, Henry, Passarella, Diego, Randall, Gregory

Abstract

Tree ring marking remains a key step in dendrometry and dendrochronology, but it is often performed manually, making the process time-consuming, subjective, and difficult to scale to large image datasets. We present the Tree Ring Analyzer Suite (TRAS), an open-source graphical software for automatic delineation, manual correction, and measurement of tree rings in wood cross-sectional images. TRAS integrates three complementary detection algorithms: the classical image-processing method CS-TRD and two deep-learning approaches, DeepCS-TRD and INBD. The interface allows users to refine automatic detections, remove false positives, and manually add missing rings. It also computes dendrochronological metrics such as earlywood and latewood areas, ring perimeter, equivalent ring width, and custom path-based ring-width measurements. TRAS was evaluated on 18 expertly annotated Pinus taeda L. cross-section images. DeepCS-TRD achieved the best automatic detection performance, with an F-score of 81.0% and precision of 86.4%. Automatic detection reduced the required manual correction effort to approximately 20% of ring boundaries. For one-dimensional ring-width measurements, TRAS showed excellent agreement with CooRecorder ($r > 0.99$). Common detection errors, such as jump propagation or false positives near knots, were easily corrected through the postprocessing interface. TRAS provides a flexible and reproducible solution for tree-ring analysis on Windows, macOS, and Linux. Code is available at the https://hmarichal93.github.io/tras.

Chinese Translation

树木年轮标记仍然是树木测量学和树木年轮年代学中的关键步骤，但通常是手动进行，这使得该过程耗时、主观，并且难以扩展到大型图像数据集。我们提出了树木年轮分析套件（Tree Ring Analyzer Suite，TRAS），这是一款开源图形软件，用于木材横截面图像中树木年轮的自动划分、手动修正和测量。TRAS集成了三种互补的检测算法：经典的图像处理方法CS-TRD和两种深度学习方法DeepCS-TRD和INBD。该界面允许用户细化自动检测结果，去除假阳性，并手动添加缺失的年轮。它还计算树木年轮年代学指标，如早材和晚材面积、年轮周长、等效年轮宽度以及基于路径的自定义年轮宽度测量。TRAS在18幅专家标注的Pinus taeda L.横截面图像上进行了评估。DeepCS-TRD实现了最佳的自动检测性能，F-score为81.0%，精确度为86.4%。自动检测将所需的手动修正工作量减少到约20%的年轮边界。对于一维年轮宽度测量，TRAS与CooRecorder显示出极好的一致性（$r > 0.99$）。常见的检测错误，如跳跃传播或靠近结点的假阳性，可以通过后处理界面轻松修正。TRAS为Windows、macOS和Linux上的树木年轮分析提供了一种灵活且可重复的解决方案。代码可在https://hmarichal93.github.io/tras获取。

View on arXiv Download PDF AI Translation

cs.CV / 145 / 2605.08029

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

STARFlow2：桥接语言模型与归一化流以实现统一的多模态生成

Shen, Ying, Chen, Tianrong, Gao, Yuan, Zhang, Yizhe, Wang, Yuyang, Bautista, Miguel Ángel, Zhai, Shuangfei, Susskind, Joshua M., Gu, Jiatao

Abstract

Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.

Chinese Translation

深度生成模型在文本和视觉领域迅速发展，促使统一的多模态系统的出现，这些系统能够理解、推理并生成交错的文本-图像序列。现有的大多数方法将自回归语言建模与基于扩散的图像生成器结合在一起，继承了因果文本生成与迭代视觉去噪之间的结构不匹配。我们观察到，自回归归一化流实际上是自回归变换器（Transformers），它们共享与大语言模型（LLMs）相同的因果掩码、KV缓存机制和从左到右的结构，使其成为真正统一的多模态生成的最自然范式。我们提出了STARFlow2，该模型基于Pretzel架构，通过残差跳跃连接将预训练的视觉语言模型（VLM）流与TarFlow流垂直交错，两者在相同的因果掩码下运行。结合深浅流设计和统一的FAE潜在空间，STARFlow2实现了缓存友好的交错生成，其中文本和视觉输出直接进入KV缓存而无需重新编码。实验结果在图像生成和多模态理解基准测试中表现出色，验证了自回归流作为统一多模态建模的可行基础。

View on arXiv Download PDF AI Translation

cs.CV / 146 / 2605.08030

PET-Adapter: Test-Time Domain Adaptation for Full and Limited-Angle PET Image Reconstruction

PET-Adapter：全角和有限角PET图像重建的测试时域适应

Yilmaz, Rüveyda, Wu, Yuli, Stegmaier, Johannes, Schulz, Volkmar

Abstract

Positron Emission Tomography (PET) image reconstruction is inherently challenged by Poisson noise and physical degradation factors, which are further exacerbated in limited-angle acquisitions. While deep learning methods demonstrate promising performance, their generalization to unseen clinical data distributions remains limited without extensive retraining. We propose PET-Adapter, a test-time domain adaptation framework for generative PET reconstruction models pretrained solely on phantom data. Our method enables adaptation to clinical datasets with varying anatomies, tracers, and scanner configurations without requiring paired ground truth. PET-Adapter introduces layer-wise low-rank anatomical conditioning during adaptation and Ordered Subset Expectation Maximization-based warm-starting that initializes the generation from physics-informed reconstructions, reducing diffusion steps from 50 to 2 without compromising quality. Experiments across multiple clinical datasets demonstrate superior 3D reconstruction performance in both full-angle and limited-angle settings, highlighting the clinical feasibility and computational efficiency of the proposed approach.

Chinese Translation

正电子发射断层扫描（PET）图像重建本质上受到泊松噪声和物理退化因素的挑战，这在有限角度采集中更为严重。尽管深度学习方法展现出良好的性能，但在没有广泛再训练的情况下，它们对未见临床数据分布的泛化能力仍然有限。我们提出了PET-Adapter，这是一种针对仅在幻影数据上预训练的生成PET重建模型的测试时域适应框架。我们的方法使得能够适应具有不同解剖结构、示踪剂和扫描仪配置的临床数据集，而无需配对的真实值。PET-Adapter在适应过程中引入了逐层低秩解剖条件，并基于有序子集期望最大化的热启动方法，从物理信息重建初始化生成过程，将扩散步骤从50减少到2，而不影响质量。多个临床数据集的实验表明，在全角和有限角设置中，3D重建性能优越，突显了所提方法的临床可行性和计算效率。

View on arXiv Download PDF AI Translation

cs.CV / 147 / 2605.08031

Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models

无物体幻觉的强化去学习框架用于视觉-语言模型

Jia, Kaidi, Lin, Yujie, Yang, Chengyi, Ma, Jiayao, Su, Jinsong

Abstract

Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlying visual representations and often introduces object hallucination. We propose HFRU, a reinforcement unlearning framework that operates on the vision encoder for deep semantic removal. Our two-stage approach combines alignment disruption with GRPO-based optimization using a composite reward, including an abstraction reward that encourages semantically valid substitutions and mitigates hallucinations. Experiments on object recognition and face identity tasks show that HFRU achieves over 98% forgetting and retention performance, while introducing negligible object hallucination, significantly outperforming prior methods.Our code and implementation details are available at https://github.com/XMUDeepLIT/HFRU.

Chinese Translation

视觉-语言模型（VLMs）引发了对隐私、版权和偏见的日益关注，这促使机器去学习以移除敏感知识。然而，现有的方法主要对语言解码器进行微调，导致表面遗忘，未能消除潜在的视觉表征，并且常常引入物体幻觉。我们提出了HFRU，一个在视觉编码器上操作的强化去学习框架，以实现深层语义移除。我们的方法采用两阶段策略，将对齐干扰与基于GRPO的优化结合，使用复合奖励，其中包括鼓励语义有效替代并减轻幻觉的抽象奖励。在物体识别和人脸身份任务上的实验表明，HFRU实现了超过98%的遗忘和保留性能，同时引入的物体幻觉微乎其微，显著优于之前的方法。我们的代码和实现细节可在 https://github.com/XMUDeepLIT/HFRU 获取。

View on arXiv Download PDF AI Translation

cs.CV / 148 / 2605.08043

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

SCOPE：复杂图像生成的结构化分解与条件技能协调

Ren, Tianfei, Yan, Zhipeng, Zhao, Yiming, Fang, Zhen, Zeng, Yu, Zhang, Guohui, Xu, Hang, Ma, Xiaoxiao, Huang, Shiting, Xu, Ke, Huang, Wenxuan, Wang, Lionel Z., Chen, Lin, Chen, Zehui, Huang, Jie, Zhao, Feng

Abstract

While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.

Chinese Translation

尽管文本到图像模型在视觉真实感方面取得了显著进展，但忠实实现复杂的视觉意图仍然具有挑战性，因为在基础、生成和验证过程中需要跟踪许多要求。我们将这些要求称为语义承诺，并将其生命周期的不连续性形式化为概念裂缝（Conceptual Rift），在这一过程中，承诺可能会在局部得到解决或检查，但在整个生成生命周期中未能保持可识别为相同的操作单元。为了解决这一问题，我们提出了SCOPE，一个以规范为指导的技能协调框架，它在不断发展的结构化规范中维护语义承诺，并有条件地调用围绕未解决或违反的承诺的检索、推理和修复技能。为了评估承诺级意图的实现，我们引入了Gen-Arena，一个带有实体和约束级规范的人类标注基准，以及实体优先通过率（Entity-Gated Intent Pass Rate, EGIP），这是一个严格的实体优先通过标准。SCOPE在Gen-Arena上显著优于所有评估的基线，达到了0.60的EGIP，并在WISE-V（0.907）和MindBench（0.61）上也取得了良好的结果，证明了持续承诺跟踪在复杂图像生成中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 149 / 2605.08050

MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

MoCoTalk：具有自适应路由器的多条件扩散用于可控的虚拟人头生成

Ye, Xinyan, Deng, Jiankang, Edalat, Abbas

Abstract

Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples head motion, mouth motion, expression, and lighting. This design provides a temporally consistent geometric prior and allows flexible recombination of these attributes at inference. We further introduce a lip consistency loss to tighten audio-visual alignment. Extensive experiments show that MoCoTalk achieves state-of-the-art performance on the majority of structural, motion, and perceptual metrics, while offering attribute-level controllability that single-condition methods do not provide.

Chinese Translation

虚拟人头生成需要对身份、头部姿态、面部表情和口部动态进行联合建模。现有方法通常仅处理这些因素的一个子集，并在涉及多个条件时依赖于固定权重或启发式融合。我们提出了MoCoTalk，这是一种多条件视频扩散框架，统一了四个互补的控制信号：参考图像、面部关键点、3DMM（3D Morphable Model）渲染的阴影网格和相应的语音音频。为了消除异质条件之间的破坏性干扰，我们引入了一种自适应多条件路由器，该路由器在四个条件流上计算通道级、时间步感知的门控，使得融合策略能够根据特征子空间和噪声水平的变化而变化。为了更好地捕捉与语音相关的面部动态，我们设计了一种口部增强阴影网格，这是一种基于3DMM的表示，解耦了头部运动、口部运动、表情和光照。该设计提供了时间一致的几何先验，并允许在推理时灵活地重新组合这些属性。我们进一步引入了一种唇部一致性损失，以增强视听对齐。大量实验表明，MoCoTalk在大多数结构、运动和感知指标上达到了最先进的性能，同时提供了单条件方法所不具备的属性级可控性。

View on arXiv Download PDF AI Translation

cs.CV / 150 / 2605.08054

Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization

基于检索引导的扩散噪声优化的高度约束人类运动生成

Liu, Hanchao, Zhang, Fang-Lue, Zhang, Shining, Mu, Tai-Jiang, Hu, Shi-Min

Abstract

Generating human motion that satisfies customized zero-shot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme.

Chinese Translation

生成满足定制化零样本目标函数的人类运动是一项关键能力，这使得可控角色动画和虚拟代理的行为合成等应用成为可能。尽管当前的方法能够处理许多未见的约束，但在面对具有极具挑战性的时空限制（如严重的空间障碍或指定的步行步数）时，它们仍然表现不佳。为了使运动生成器能够应对这些高度约束的任务，我们提出了一种基于无训练扩散噪声优化框架的检索引导方法。其关键思想是在大型运动数据集中搜索可能满足困难约束的指导信息。我们引入关系任务解析，将目标约束进行分组，并识别出需要通过检索参考来处理的困难约束。通过结合随机噪声与检索噪声的奖励引导掩码，我们获得了更好的扩散噪声初始化。通过从这一改进的初始化中优化扩散噪声，我们成功解决了高度约束的生成任务。通过利用大型语言模型（LLM）进行关系任务解析，整个框架进一步能够自动推理出需要检索的内容，从而在无训练优化方案下提高移动代理的智能性。

View on arXiv Download PDF AI Translation

cs.CV / 151 / 2605.08059

6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks

基于关键点热图回归的6D姿态估计与RGB-D残差神经网络

Aljosevic, Ismail, Almasi, Amir Masoud, Parovic, Ana, Shafiei, Ashkan

Abstract

In this paper, we propose a modular framework for 6D pose estimation based on keypoint heatmap regression. Our approach combines YOLOv10m for object detection with a ResNet18-based network that predicts 2D heatmaps from RGB images. Keypoints extracted from these heatmaps are used to estimate the 6D object pose via the PnP RANSAC algorithm. We compare different keypoint selection strategies to assess their impact on pose accuracy. Additionally, we extend the baseline by incorporating depth data using a cross-fusion architecture, which enables interaction between RGB and depth features at multiple stages. We further explore general training improvements, such as experimenting with activation functions and learning rate scheduling strategies to improve model performance. Our best RGB-only model achieved a mean ADD-based accuracy of 84.50%, while the RGB-D fusion model reached 92.41% on the LINEMOD dataset. The code is available at https://github.com/ameermasood/HeatNet.

Chinese Translation

本文提出了一种基于关键点热图回归的6D姿态估计模块化框架。我们的方法结合了YOLOv10m进行目标检测，并使用基于ResNet18的网络从RGB图像中预测2D热图。从这些热图中提取的关键点通过PnP RANSAC算法用于估计6D物体姿态。我们比较了不同的关键点选择策略，以评估其对姿态精度的影响。此外，我们通过采用交叉融合架构将深度数据纳入基线，扩展了模型，使RGB和深度特征在多个阶段之间能够相互作用。我们进一步探索了一般训练改进方法，例如实验不同的激活函数和学习率调度策略，以提高模型性能。我们的最佳RGB-only模型在LINEMOD数据集上达到了84.50%的平均ADD基础准确率，而RGB-D融合模型则达到了92.41%。代码可在https://github.com/ameermasood/HeatNet获取。

View on arXiv Download PDF AI Translation

cs.CV / 152 / 2605.08063

Flow-OPD: On-Policy Distillation for Flow Matching Models

Flow-OPD：用于流匹配模型的在线策略蒸馏

Fang, Zhen, Huang, Wenxuan, Zeng, Yu, Zhao, Yiming, Chen, Shuang, Feng, Kaituo, Lin, Yunlong, Chen, Lin, Chen, Zehui, Cao, Shaosheng, Zhao, Feng

Abstract

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.

Chinese Translation

现有的流匹配（Flow Matching, FM）文本到图像模型在多任务对齐中面临两个关键瓶颈：由标量值奖励引起的奖励稀疏性，以及由于联合优化异构目标而产生的梯度干扰，这两者共同导致了竞争指标的“跷跷板效应”和普遍的奖励操控。受到在线策略蒸馏（On-Policy Distillation, OPD）在大型语言模型社区成功的启发，我们提出了Flow-OPD，这是第一个将在线策略蒸馏整合到流匹配模型中的统一后训练框架。Flow-OPD采用两阶段对齐策略：首先通过单奖励的GRPO微调培养领域专门的教师模型，使每个专家能够在孤立状态下达到其性能上限；然后通过基于流的冷启动方案建立一个稳健的初始策略，并通过在线采样、任务路由标记和密集轨迹级监督的三步协调，将异构专业知识无缝整合到一个学生模型中。我们进一步引入了流形锚定正则化（Manifold Anchor Regularization, MAR），利用任务无关的教师提供全数据监督，将生成锚定到高质量流形，有效减轻了在纯粹的强化学习驱动对齐中常见的美学退化。基于Stable Diffusion 3.5 Medium，Flow-OPD将GenEval分数从63提高到92，将OCR准确率从59提高到94，整体提升约10分，相较于传统的GRPO，同时保持图像保真度和人类偏好对齐，并展现出一种新兴的“超越教师”效应。这些结果确立了Flow-OPD作为构建通用文本到图像模型的可扩展对齐范式。

View on arXiv Download PDF AI Translation

cs.CV / 153 / 2605.08064

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Proxy3D：通过语义聚类和对齐实现高效的视觉-语言模型三维表示

Jiang, Jerry, Sun, Haowen, Gudovskiy, Denis, Nakata, Yohei, Okuno, Tomoyuki, Keutzer, Kurt, Zheng, Wenzhao

Abstract

Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.

Chinese Translation

视觉-语言模型（VLMs）中的空间智能引起了研究者的关注，因为在三维世界中进行推理的实际需求。尽管已有一些令人鼓舞的结果，但大多数现有方法仍遵循VLMs中的传统二维流程，并使用像素对齐的表示来处理视觉模态。然而，基于对应关系的模型在隐式三维场景理解方面常常无法实现空间一致性，而基于表示的模型在具有三维几何先验的情况下在视觉序列序列化中效率不足。为了解决这一问题，我们提出了一种Proxy3D方法，该方法为视觉模态提供紧凑而全面的三维代理表示。仅给定视频帧作为输入，我们采用语义和几何编码器提取场景特征，然后进行语义感知聚类，以在三维空间中获得一组代理。为了进行表示对齐，我们进一步整理了SpaceSpan数据集，并应用多阶段训练，以将所提出的三维代理表示与VLM结合。当使用较短的序列来获取视觉信息时，我们的方法在三维视觉问答、视觉定位和一般空间智能基准测试中实现了具有竞争力或先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 154 / 2605.08073

EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

EmambaIR：用于事件引导图像重建的高效视觉状态空间模型

Yu, Wei, Qian, Yunhang

Abstract

Recent event-based image reconstruction methods predominantly rely on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to process complementary event information. However, these architectures face fundamental limitations: CNNs often fail to capture global feature correlations, whereas ViTs incur quadratic computational complexity (e.g., $O(n^2)$), hindering their application in high-resolution scenarios. To address these bottlenecks, we introduce EmambaIR, an Efficient visual State Space Model designed for image reconstruction using spatially sparse and temporally continuous event streams. Our framework introduces two key components: the cross-modal Top-k Sparse Attention Module (TSAM) and the Gated State-Space Module (GSSM). TSAM efficiently performs pixel-level top-k sparse attention to guide cross-modal interactions, yielding rich yet sparse fusion features. Subsequently, GSSM utilizes a nonlinear gated unit to enhance the temporal representation of vanilla linear-complexity ($O(n)$) SSMs, effectively capturing global contextual dependencies without the typical computational overhead. Extensive experiments on six datasets across three diverse image reconstruction tasks - motion deblurring, deraining, and High Dynamic Range (HDR) enhancement - demonstrate that EmambaIR significantly outperforms state-of-the-art methods while offering substantial reductions in memory consumption and computational cost. The source code and data are publicly available at: https://github.com/YunhangWickert/EmambaIR

Chinese Translation

近期基于事件的图像重建方法主要依赖卷积神经网络（CNNs）和视觉变换器（ViTs）来处理互补的事件信息。然而，这些架构面临着根本性的限制：CNNs往往无法捕捉全局特征相关性，而ViTs则会导致二次计算复杂度（例如，$O(n^2)$），妨碍其在高分辨率场景中的应用。为了解决这些瓶颈，我们提出了EmambaIR，一种高效的视觉状态空间模型，旨在利用空间稀疏和时间连续的事件流进行图像重建。我们的框架引入了两个关键组件：跨模态Top-k稀疏注意力模块（TSAM）和门控状态空间模块（GSSM）。TSAM高效地执行像素级的top-k稀疏注意力，以引导跨模态交互，产生丰富而稀疏的融合特征。随后，GSSM利用非线性门控单元增强线性复杂度（$O(n)$）状态空间模型（SSMs）的时间表示，有效捕捉全局上下文依赖关系，而无需典型的计算开销。在六个数据集上进行的广泛实验，涵盖运动去模糊、去雨和高动态范围（HDR）增强三种不同的图像重建任务，证明EmambaIR显著优于最先进的方法，同时在内存消耗和计算成本上大幅降低。源代码和数据可在以下网址公开获取：https://github.com/YunhangWickert/EmambaIR

View on arXiv Download PDF AI Translation

cs.CV / 155 / 2605.08078

Normalizing Trajectory Models

归一化轨迹模型

Gu, Jiatao, Chen, Tianrong, Shen, Ying, Berthelot, David, Zhai, Shuangfei, Susskind, Josh

Abstract

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.

Chinese Translation

基于扩散的模型将采样分解为许多小的高斯去噪步骤——这一假设在生成被压缩为少数粗略过渡时失效。现有的少步方法通过蒸馏、一致性训练或对抗目标来解决这一问题，但在此过程中牺牲了似然框架。我们提出了归一化轨迹模型（Normalizing Trajectory Models, NTM），该模型将每个反向步骤建模为一个具有精确似然训练的表达性条件归一化流。在架构上，NTM在每个步骤中结合了浅层可逆块，并在整个轨迹上使用深层并行预测器，形成一个可从头训练或从预训练流匹配模型初始化的端到端网络。其精确的轨迹似然进一步使自蒸馏成为可能：一个在模型自身得分上训练的轻量级去噪器在四个步骤中产生高质量样本。在文本到图像的基准测试中，NTM在仅仅四个采样步骤中匹配或超越了强大的图像生成基线，同时独特地保持了生成轨迹上的精确似然。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2605.06671

GraphDC: A Divide-and-Conquer Multi-Agent System for Scalable Graph Algorithm Reasoning

GraphDC：一种用于可扩展图算法推理的分治多智能体系统

Li, Wenjin, Cui, Jiaming

Abstract

Large Language Models (LLMs) have demonstrated strong potential for many mathematical problems. However, their performance on graph algorithmic tasks is still unsatisfying, since graphs are naturally more complex in topology and often require systematic multi-step reasoning, especially on larger graphs. Motivated by this gap, we propose GraphDC, a Divide-and-Conquer multi-agent framework for scalable graph algorithm reasoning. Specifically, inspired by Divide-and-Conquer design, GraphDC decomposes an input graph into smaller subgraphs, assigns each subgraph to a specialized agent for local reasoning, and uses a master agent to integrate the local outputs with inter-subgraph information to produce the final solution. This hierarchical design reduces the reasoning burden on individual agents, alleviates computational bottlenecks, and improves robustness on large graph instances. Extensive experiments show that GraphDC consistently outperforms existing methods on graph algorithm reasoning across diverse tasks and scales, especially on larger instances where direct end-to-end reasoning is less reliable.

Chinese Translation

大型语言模型（LLMs）在许多数学问题上展现了强大的潜力。然而，它们在图算法任务上的表现仍然不尽人意，因为图在拓扑上自然更为复杂，通常需要系统的多步推理，尤其是在处理较大图时。基于这一差距，我们提出了GraphDC，一种用于可扩展图算法推理的分治多智能体框架。具体而言，GraphDC受到分治设计的启发，将输入图分解为更小的子图，将每个子图分配给专门的智能体进行局部推理，并使用主智能体整合局部输出与子图间信息，以生成最终解决方案。这种分层设计减轻了单个智能体的推理负担，缓解了计算瓶颈，并提高了在大型图实例上的鲁棒性。大量实验表明，GraphDC在各种任务和规模的图算法推理中始终优于现有方法，尤其是在直接端到端推理不太可靠的较大实例上。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2605.06672

More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models

更多思考，更多偏见：基于长度的推理模型位置偏见

Wang, Xiao

Abstract

Chain-of-thought (CoT) reasoning and reasoning-tuned models such as DeepSeek-R1 are commonly assumed to reduce shallow heuristic biases by thinking carefully. We test this on position bias in multiple-choice QA and find a different story: within any reasoning-capable model, per-question position bias scales with the length of the reasoning trajectory. Across thirteen reasoning-mode configurations (two R1-distilled 7-8B models, two base models prompted with CoT, and DeepSeek-R1 at 671B) on MMLU, ARC-Challenge, and GPQA, twelve show a positive partial correlation between trajectory length and Position Bias Score (PBS) after controlling for accuracy, ranging from 0.11 to 0.41 (all p < 0.05). All twelve open-weight reasoning-mode configurations show monotonically increasing PBS across length quartiles. A truncation intervention provides causal evidence: continuations resumed from later points in the trajectory are increasingly likely to shift toward position-preferred options (16% to 32% for R1-Qwen-7B across absolute-position buckets). At 671B, aggregate PBS collapses to 0.019, but the length effect still manifests in the longest quartile (PBS = 0.071), suggesting that accuracy gates the expression of length-driven bias rather than eliminating the underlying mechanism. We additionally find that direct-answer position bias is a distinct phenomenon with a different footprint (strong in Llama-Instruct-direct, weak in Qwen-Instruct-direct, and uncorrelated with trajectory length): CoT reasoning replaces this baseline bias with length-accumulated bias. Our results argue that reasoning-capable models should not be treated as order-robust by default in MCQ evaluation pipelines, and offer a diagnostic toolkit (PBS, commitment change point, effective switching, truncation probes) for auditing position bias in reasoning models.

Chinese Translation

链式思维（Chain-of-thought, CoT）推理和诸如DeepSeek-R1的推理调优模型通常被认为通过仔细思考来减少浅层启发式偏见。我们在多选问答中的位置偏见上对此进行了测试，发现了不同的结果：在任何具有推理能力的模型中，问题的每个位置偏见与推理轨迹的长度成正比。在MMLU、ARC-Challenge和GPQA上对十三种推理模式配置（两个R1蒸馏的7-8B模型、两个使用CoT提示的基础模型，以及671B的DeepSeek-R1）进行的研究中，控制准确性后，十二种配置显示出轨迹长度与位置偏见分数（Position Bias Score, PBS）之间的正相关，相关系数范围为0.11到0.41（均为p < 0.05）。所有十二种开放权重的推理模式配置在长度四分位数中显示出PBS单调增加。截断干预提供了因果证据：从轨迹后期恢复的继续推理越来越可能偏向于位置优选选项（在绝对位置区间中，R1-Qwen-7B的偏向率从16%增加到32%）。在671B时，整体PBS降至0.019，但在最长的四分位数中长度效应仍然显现（PBS = 0.071），这表明准确性限制了长度驱动偏见的表现，而不是消除其潜在机制。我们还发现，直接答案位置偏见是一个不同的现象，具有不同的特征（在Llama-Instruct-direct中强，在Qwen-Instruct-direct中弱，并且与轨迹长度无关）：CoT推理用长度累积的偏见替代了这一基线偏见。我们的结果表明，具有推理能力的模型在多选问答评估流程中不应默认被视为顺序稳健，并提供了一套诊断工具（PBS、承诺变化点、有效切换、截断探测）用于审计推理模型中的位置偏见。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2605.06682

Fast and Effective Redistricting Optimization via Composite-Move Tabu Search

通过复合移动禁忌搜索实现快速有效的重划选区优化

Jin, Hai, Guo, Diansheng

Abstract

Spatial redistricting is a practical combinatorial optimization problem that demands high-quality solutions, rapid turnaround, and flexibility to accommodate multi-criteria objectives and interactive refinement. A central challenge is the contiguity constraint: enforcing contiguity in integer-programming or heuristic search can severely shrink the feasible neighborhood, weaken exploration, and trap the search in poor local optima. We introduce a composite-move Tabu search (CM-Tabu) that systematically expands the feasible neighborhood space in Tabu search while preserving contiguity. When a boundary unit cannot be reassigned individually without disconnecting its district, our method identifies a minimal set of units that can move together, or a pair of units (or sets of units) that can be switched, as a contiguity-preserving composite move. Candidate single-unit and composite moves are generated in linear time by analyzing each district's contiguity graph using articulation points and biconnected components. Extensive experiments demonstrate that the proposed approach substantially improves solution quality, run-to-run robustness, and computational efficiency relative to traditional Tabu search and other baselines. For example, in the Philadelphia case, the approach can consistently attain the theoretical global optimum in population-equality and support multi-criteria trade-offs. CM-Tabu delivers optimization performance suitable for real-world practices and decision-support workflows.

Chinese Translation

空间重划选区是一个实际的组合优化问题，要求高质量的解决方案、快速的周转时间以及灵活性以适应多标准目标和互动精细化。一个核心挑战是连通性约束：在整数规划或启发式搜索中强制连通性可能会严重缩小可行邻域，削弱探索能力，并使搜索陷入较差的局部最优。我们提出了一种复合移动禁忌搜索（Composite-Move Tabu search, CM-Tabu），该方法在保持连通性的同时，系统性地扩展了禁忌搜索中的可行邻域空间。当一个边界单元无法单独重新分配而不导致其选区断开时，我们的方法识别出一组可以一起移动的最小单元集，或一对可以互换的单元（或单元集），作为保持连通性的复合移动。通过分析每个选区的连通图，利用关节点和双连通分量，候选单元和复合移动在线性时间内生成。大量实验表明，所提出的方法相较于传统的禁忌搜索和其他基准方法，显著提高了解决方案质量、运行间的鲁棒性和计算效率。例如，在费城案例中，该方法能够持续达到人口平衡的理论全局最优，并支持多标准权衡。CM-Tabu提供了适合现实世界实践和决策支持工作流程的优化性能。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2605.06690

State Representation and Termination for Recursive Reasoning Systems

递归推理系统的状态表示与终止

Guha, Debashis, Mukherjee, Amritendu, Kukreja, Sanjay, Kumar, Tarun

Abstract

Recursive reasoning systems alternate between acquiring new evidence and refining an accumulated understanding. Two design choices are typically left implicit: how to represent the evolving reasoning state, and when to stop iterating. This paper addresses both. We represent the reasoning state as an epistemic state graph encoding extracted claims, evidential relations, open questions, and confidence weights. We define the order-gap as the distance between the states reached by expand-then-consolidate versus consolidate-then-expand; a small order-gap suggests that the two orderings agree and further iteration is unlikely to help. Our main result gives a necessary and sufficient condition for the linearised order-gap to be non-degenerate near the fixed point, showing when the criterion is informative rather than algebraically vacuous. This is a local condition, not a global convergence guarantee. We apply the framework to recursive reasoning systems and sketch its application to agent loops, tree-of-thought reasoning, theorem proving, and continual learning.

Chinese Translation

递归推理系统在获取新证据和完善累积理解之间交替进行。通常有两个设计选择是隐含的：如何表示不断演变的推理状态，以及何时停止迭代。本文针对这两个问题进行了探讨。我们将推理状态表示为一个表征提取出的主张、证据关系、未解问题和置信权重的认知状态图。我们定义了顺序差距（order-gap），即通过扩展后整合与整合后扩展所达到的状态之间的距离；较小的顺序差距表明这两种顺序是一致的，进一步的迭代不太可能有所帮助。我们的主要结果给出了线性化顺序差距在固定点附近非退化的必要和充分条件，表明何时该标准是有信息量的，而不是代数上空洞的。这是一个局部条件，而不是全局收敛的保证。我们将该框架应用于递归推理系统，并简要勾勒了其在智能体循环、思维树推理、定理证明和持续学习中的应用。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2605.06696

Hidden Coalitions in Multi-Agent AI: A Spectral Diagnostic from Internal Representations

多智能体人工智能中的隐性联盟：来自内部表征的谱诊断

Berg, Cameron, Schneider, Susan L., Bailey, Mark M.

Abstract

Collections of interacting AI agents can form coalitions, creating emergent group-level organization that is critical for AI safety and alignment. However, observing agent behavior alone is often insufficient to distinguish genuine informational coupling from spurious similarity, as consequential coalitions may form at the level of internal representations before any overt behavioral change is apparent. Here, we introduce a practical method for detecting coalition structure from the internal neural representations of multi-agent systems. The approach constructs a pairwise mutual-information graph from the hidden states of agents and applies spectral partitioning to identify the most salient coalition boundary. We validate this method in two domains. First, in multi-agent reinforcement learning environments, the method successfully recovers programmed hierarchical and dynamic coalition structures and correctly rejects false positives arising from behavioral coordination without informational coupling. Second, using a large language model, the method identifies coalition structures implied by descriptive prompts, tracks dynamic team reassignments, and reveals a representational hierarchy where explicit labels dominate over conflicting interaction patterns. Across both settings, the recovered partition reveals subgroup organization that a scalar cross-agent mutual-information measure cannot distinguish. The results demonstrate that analyzing hidden-state mutual information through spectral partitioning provides a scalable diagnostic for identifying representational coalitions, offering a valuable tool for monitoring emergent structure in distributed AI systems.

Chinese Translation

交互的人工智能代理集合可以形成联盟，创造出对人工智能安全性和对齐至关重要的群体层级组织。然而，仅仅观察代理行为往往不足以区分真实的信息耦合与虚假的相似性，因为重要的联盟可能在内部表征层面形成，而在任何明显的行为变化出现之前。本文介绍了一种实用的方法，用于从多智能体系统的内部神经表征中检测联盟结构。该方法从代理的隐状态构建成对互信息图，并应用谱划分来识别最显著的联盟边界。我们在两个领域验证了该方法。首先，在多智能体强化学习环境中，该方法成功恢复了编程的层级和动态联盟结构，并正确拒绝了因行为协调而产生的虚假正例，这些虚假正例并没有信息耦合。其次，使用大型语言模型，该方法识别了由描述性提示暗示的联盟结构，跟踪动态团队重新分配，并揭示了一个表征层级，其中显式标签主导于相互冲突的互动模式。在这两种设置中，恢复的划分揭示了一个子群体组织，而标量的跨代理互信息度量无法区分。结果表明，通过谱划分分析隐状态互信息提供了一种可扩展的诊断工具，用于识别表征联盟，为监测分布式人工智能系统中的新兴结构提供了有价值的工具。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2605.06702

CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment

CASCADE：在部署期间基于案例的持续适应大型语言模型

Guo, Siyuan, Du, Yali, Chen, Hechang, Chang, Yi, Wang, Jun

Abstract

Large language models (LLMs) have become a central foundation of modern artificial intelligence, yet their lifecycle remains constrained by a rigid separation between training and deployment, after which learning effectively ceases. This limitation contrasts with natural intelligence, which continually adapts through interaction with its environment. In this paper, we formalise deployment-time learning (DTL) as the third stage in the LLM lifecycle that enables LLM agents to improve from experience during deployment without modifying model parameters. We present CASCADE (CASe-based Continual Adaptation during DEployment), a general and principled framework that equips LLM agents with an explicit, evolving episodic memory. CASCADE formulates experience reuse as a contextual bandit problem, enabling principled exploration-exploitation trade-offs and establishing no-regret guarantees over long-term interactions. This design allows agents to accumulate, select, and refine task-relevant cases, transforming past experience into actionable knowledge. Across 16 diverse tasks spanning medical diagnosis, legal analysis, code generation, web search, tool use, and embodied interaction, CASCADE improves macro-averaged success rate by 20.9% over zero-shot prompting while consistently outperforming gradient-based and memory-based baselines. By reframing deployment as an adaptive learning process, this work establishes a foundation for continually improving AI systems.

Chinese Translation

大型语言模型（LLMs）已成为现代人工智能的核心基础，但其生命周期仍受到训练与部署之间严格分隔的限制，之后学习实际上停止。这一限制与自然智能形成对比，后者通过与环境的互动不断适应。本文将部署时学习（DTL）形式化为LLM生命周期中的第三个阶段，使LLM代理能够在部署期间通过经验进行改进，而无需修改模型参数。我们提出了CASCADE（部署期间基于案例的持续适应），这是一个通用且原则性的框架，为LLM代理提供了一个明确的、不断发展的情节记忆。CASCADE将经验重用形式化为一个上下文赌博机问题，从而实现原则性的探索-利用权衡，并在长期互动中建立无悔保证。该设计使代理能够积累、选择和提炼与任务相关的案例，将过去的经验转化为可操作的知识。在涵盖医疗诊断、法律分析、代码生成、网络搜索、工具使用和具身互动等16个不同任务中，CASCADE在零-shot 提示的基础上提高了20.9%的宏平均成功率，同时始终优于基于梯度和基于记忆的基线。通过将部署重新框架为一个适应性学习过程，本研究为持续改进人工智能系统奠定了基础。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2605.06716

From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms

从存储到体验：大规模语言模型代理记忆机制演变的调查

Luo, Jinghao, Tian, Yuchen, Cao, Chuxue, Luo, Ziyang, Lin, Hongzhan, Li, Kaixin, Kong, Chuyi, Yang, Ruichao, Ma, Jing

Abstract

Large Language Model (LLM)-based agents have fundamentally reshaped artificial intelligence by integrating external tools and planning capabilities. While memory mechanisms have emerged as the architectural cornerstone of these systems, current research remains fragmented, oscillating between operating system engineering and cognitive science. This theoretical divide prevents a unified view of technological synthesis and a coherent evolutionary perspective. To bridge this gap, this survey proposes a novel evolutionary framework for LLM agent memory mechanisms, formalizing the development process into three stages: Storage (trajectory preservation), Reflection (trajectory refinement), and Experience (trajectory abstraction). We first formally define these three stages before analyzing the three core drivers of this evolution: the necessity for long-range consistency, the challenges in dynamic environments, and the ultimate goal of continual learning. Furthermore, we specifically explore two transformative mechanisms in the frontier Experience stage: proactive exploration and cross-trajectory abstraction. By synthesizing these disparate views, this work offers robust design principles and a clear roadmap for the development of next-generation LLM agents.

Chinese Translation

基于大规模语言模型（LLM）的代理通过整合外部工具和规划能力，根本性地重塑了人工智能。尽管记忆机制已成为这些系统的架构基石，但当前的研究仍然支离破碎，徘徊于操作系统工程和认知科学之间。这一理论分歧阻碍了对技术综合的统一视角和连贯的演变视角。为了解决这一问题，本调查提出了一种新颖的LLM代理记忆机制演变框架，将发展过程形式化为三个阶段：存储（轨迹保存）、反思（轨迹精炼）和体验（轨迹抽象）。我们首先正式定义这三个阶段，然后分析这一演变的三个核心驱动因素：对长期一致性的需求、动态环境中的挑战，以及持续学习的最终目标。此外，我们特别探讨了前沿体验阶段中的两种变革性机制：主动探索和跨轨迹抽象。通过综合这些不同的视角，本研究提供了强有力的设计原则和清晰的下一代LLM代理发展的路线图。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2605.06723

When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment

语言模型何时做出承诺？一种有限答案的前言承诺理论

Zhang, Long, Chen, Wei-neng, Wei, Feng-feng, Qin, Zi-bo

Abstract

Language models often generate reasoning before giving a final answer, but the visible answer does not reveal when the model's answer preference became stable. We study this question through a narrow computable object: \emph{finite-answer preference stabilization}. For a model state and specified answer verbalizers, we project the model's own continuation probabilities onto a finite answer set; in binary tasks this yields an exact log-odds code, $\delta(\xi)=S_\theta(\mathrm{yes}\mid\xi)-S_\theta(\mathrm{no}\mid\xi)$. This target defines parser-based answer onset, retrospective stabilization time, and lead without relying on greedy rollouts or learned probes. In controlled delayed-verdict tasks with Qwen3-4B-Instruct, the contextual finite-answer projection stabilizes before the answer is parseable, with 17--31 token mean lead in the main templates and positive, shorter lead in a parser-clean replication. The signal tracks the model's eventual output rather than truth, is linearly recoverable from compact hidden summaries, is partly separable from cursor progress, and transfers as shared information without a single invariant coordinate. Diagnostics separate the measurement from online stopping, verbalizer-free belief, and causal answer control; exact steering shows local sensitivity of $\delta$ but not reliable generation control.

Chinese Translation

语言模型在给出最终答案之前通常会生成推理，但可见的答案并未揭示模型的答案偏好何时变得稳定。我们通过一个狭窄的可计算对象—— extit{有限答案偏好稳定性}来研究这个问题。对于模型状态和指定的答案表达者，我们将模型自身的延续概率投影到有限答案集合上；在二元任务中，这产生了一个精确的对数赔率编码，$ heta( ext{xi})=S_ heta( ext{yes} ext{mid} ext{xi})-S_ heta( ext{no} ext{mid} ext{xi})$。这个目标定义了基于解析器的答案开始时间、回顾性稳定时间，并在不依赖贪婪展开或学习探针的情况下进行引导。在与Qwen3-4B-Instruct的控制延迟裁决任务中，上下文有限答案投影在答案可解析之前就已稳定，主要模板中的平均领先时间为17到31个标记，而在解析器清晰的复制中则表现出积极且更短的领先时间。该信号跟踪模型的最终输出而非真实值，可以从紧凑的隐藏摘要中线性恢复，部分可与光标进度分离，并作为共享信息转移而不依赖于单一的不变坐标。诊断将测量与在线停止、无表达者信念和因果答案控制分开；精确引导显示了$ heta$的局部敏感性，但未能实现可靠的生成控制。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2605.06761

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

Weblica：可扩展且可重复的视觉网络代理训练环境

Kar, Oğuzhan Fatih, Bachmann, Roman, Gong, Yuanzheng, Larsen, Anders Boesen Lindbo, Dehghan, Afshin

Abstract

The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity. We propose Weblica (Web Replica), a framework for constructing reproducible and scalable web environments. Our framework leverages 1) HTTP-level caching to capture and replay stable visual states while preserving interactive behavior and 2) LLM-based environment synthesis grounded in real-world websites and core web navigation skills. Using this framework, we scale RL training to thousands of diverse environments and tasks. Our best model, Weblica-8B, outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and is competitive with API models.

Chinese Translation

网络复杂、开放且不断变化，这使得为视觉网络代理扩展训练数据变得具有挑战性。现有的数据收集尝试仅限于用于监督微调的离线轨迹或少数模拟环境的强化学习（RL）训练，因此未能捕捉网络的多样性。我们提出了Weblica（Web Replica），一个构建可重复和可扩展网络环境的框架。我们的框架利用了1）HTTP级缓存，以捕获和重放稳定的视觉状态，同时保留交互行为，以及2）基于大型语言模型（LLM）的环境合成，基于真实网站和核心网络导航技能。使用该框架，我们将强化学习训练扩展到数千个多样化的环境和任务。我们的最佳模型Weblica-8B在多个网络导航基准测试中超越了相似规模的开放权重基线，同时使用更少的推理步骤，随着额外的测试时间计算资源的增加而表现良好，并且与API模型具有竞争力。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2605.06772

When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning

何时批评能改善人工智能辅助的理论物理？SCALAR：结构化批评者-行动者循环用于自主推理

Niarchos, Vasilis, Papageorgakis, Constantinos, Stapleton, Alexander G., Trifinopoulos, Sokratis

Abstract

As large language models (LLMs) show increasing promise on research-level physics reasoning tasks and agentic AI becomes more common, a practical question emerges: How does the interaction between researchers and agents affect the results? We study this using SCALAR (Structured Critic--Actor Loop for AI Reasoning), an Actor--Critic--Judge pipeline applied to quantum field theory and string theory problems. The Actor proposes solutions, the Critic provides iterative feedback, and an independent Judge evaluates the transcript against reference solutions. We vary the Actor persona, the Critic feedback strategy, and the Actor model family and scale. Multi-turn dialogue improves over single-shot attempts throughout, but both the mechanism of improvement and the value of different prompting choices depend strongly on the Actor--Critic pairing. Increasing the scale within one model family (e.g. from the 8B-parameter DeepSeek-R1 variant to DeepSeek-R1 70B) improves some easier-problem behavior, but does not remove the hardest bottleneck we observe. Critic feedback strategy matters most clearly in the asymmetric Actor--Critic setting (e.g., a lightweight Haiku Actor guided by a stronger Sonnet Critic), where constructive feedback improves mean-score outcomes. In same-family Actor--Critic settings, strategy effects are weaker: lenient feedback is sometimes favored, while strict and adversarial feedback are not beneficial. Taken together, SCALAR provides a controlled testbed for evaluating which interaction structures help or hinder AI-driven scientific discovery.

Chinese Translation

随着大型语言模型（LLMs）在研究级物理推理任务中展现出越来越大的潜力，以及自主人工智能的普及，一个实际问题随之而来：研究人员与智能体之间的互动如何影响结果？我们使用SCALAR（结构化批评者-行动者循环用于人工智能推理）进行研究，这是一个应用于量子场论和弦理论问题的行动者-批评者-评估者管道。行动者提出解决方案，批评者提供迭代反馈，而独立评估者则根据参考解决方案评估记录。我们改变行动者的人格、批评者的反馈策略以及行动者模型的家族和规模。多轮对话在整个过程中优于单次尝试，但改进的机制和不同提示选择的价值在很大程度上依赖于行动者-批评者的配对。在同一模型家族内增加规模（例如，从8B参数的DeepSeek-R1变体到70B的DeepSeek-R1）改善了一些简单问题的表现，但并未消除我们观察到的最难瓶颈。在不对称的行动者-批评者设置中（例如，一个轻量级的Haiku行动者由一个更强的Sonnet批评者指导），批评者的反馈策略显得尤为重要，建设性的反馈改善了平均得分结果。在同家族的行动者-批评者设置中，策略效果较弱：宽松的反馈有时更受青睐，而严格和对抗性的反馈并无益处。综合来看，SCALAR提供了一个受控的测试平台，用于评估哪些互动结构有助于或阻碍人工智能驱动的科学发现。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2605.06812

Towards Security-Auditable LLM Agents: A Unified Graph Representation

迈向可安全审计的LLM代理：统一图表示

Li, Chaofan, Zhang, Lyuye, Zhai, Jintao, Feng, Siyue, Yang, Xichun, Wang, Huahao, Dou, Shihan, Ji, Yu, Hu, Yutao, Wu, Yueming, Liu, Yang, Zou, Deqing

Abstract

LLM-based agentic systems are rapidly evolving to perform complex autonomous tasks through dynamic tool invocation, stateful memory management, and multi-agent collaboration. However, this semantics-driven execution paradigm creates a severe semantic gap between low-level physical events and high-level execution intent, making post-hoc security auditing fundamentally difficult. Existing representation mechanisms, including static SBOMs and runtime logs, provide only fragmented evidence and fail to capture cognitive-state evolution, capability bindings, persistent memory contamination, and cascading risk propagation across interacting agents. To bridge this gap, we propose Agent-BOM, a unified structural representation for agent security auditing. Agent-BOM models an agentic system as a hierarchical attributed directed graph that separates static capability bases, such as models, tools, and long-term memory, from dynamic runtime semantic states, such as goals, reasoning trajectories, and actions. These layers are connected through semantic edges and security attributes, transforming fragmented execution traces into queryable audit paths. Building on Agent-BOM, we develop a graph-query-based paradigm for path-level risk assessment and instantiate it with the OWASP Agentic Top 10. We further implement an auditing plugin in the OpenClaw environment to construct Agent-BOM from live executions. Evaluation on representative real-world agentic attack scenarios shows that Agent-BOM can reconstruct stealthy attack chains, including cross-session memory poisoning and tool misuse, capability supply-chain hijacking and unexpected code execution, multi-agent ecosystem hijacking, and privilege and trust abuse. These results demonstrate that Agent-BOM provides a unified and auditable foundation for root-cause analysis and security adjudication in complex agentic ecosystems.

Chinese Translation

基于LLM的代理系统正在快速发展，以通过动态工具调用、状态记忆管理和多代理协作执行复杂的自主任务。然而，这种以语义驱动的执行范式在低层次物理事件与高层次执行意图之间造成了严重的语义鸿沟，使得事后安全审计在本质上变得困难。现有的表示机制，包括静态软件物料清单（SBOM）和运行时日志，仅提供了零散的证据，未能捕捉认知状态的演变、能力绑定、持久性记忆污染以及交互代理之间的级联风险传播。为了弥补这一鸿沟，我们提出了Agent-BOM，一种用于代理安全审计的统一结构表示。Agent-BOM将代理系统建模为一个层次化的有属性有向图，将静态能力基础（如模型、工具和长期记忆）与动态运行时语义状态（如目标、推理轨迹和行动）分开。这些层通过语义边和安全属性连接，将零散的执行痕迹转化为可查询的审计路径。在Agent-BOM的基础上，我们开发了一种基于图查询的路径级风险评估范式，并以OWASP代理十大风险为实例进行实现。我们进一步在OpenClaw环境中实现了一个审计插件，以从实时执行中构建Agent-BOM。对具有代表性的现实世界代理攻击场景的评估表明，Agent-BOM能够重构隐蔽的攻击链，包括跨会话记忆污染和工具误用、能力供应链劫持和意外代码执行、多代理生态系统劫持，以及特权和信任滥用。这些结果表明，Agent-BOM为复杂代理生态系统中的根本原因分析和安全裁决提供了统一且可审计的基础。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2605.06815

Uneven Evolution of Cognition Across Generations of Generative AI Models

生成性人工智能模型各代认知的不均衡演化

Galatzer-Levy, Isaac, McDuff, Daniel, Liu, Xin, McGiffin, Jed

Abstract

The pursuit of artificial general intelligence necessitates robust methods for evaluating the cognitive capabilities of models beyond narrow task performance. Here, we introduce a psychometric framework to assess the cognitive profiles of generative AI, comparing them to human norms and tracking their evolution across generations. Initial evaluation of leading multimodal models using tasks adapted from the Wechsler Adult Intelligence Scale revealed a profoundly uneven cognitive architecture: near-ceiling performance in verbal comprehension and working memory (>$98^{\text{th}}$ percentile) contrasted with near-floor performance in perceptual reasoning (<$1^{\text{st}}$ percentile). To track developmental trajectories beyond human-normed limits, we developed the Artificial Intelligence Quotient (AIQ) Benchmark and applied it to six generations and two model families, revealing significant but asymmetric performance gains. Notably, we uncovered a sharp dissociation between modalities; abstract quantitative reasoning matured far more rapidly when presented linguistically compared to a visually analogous format, indicating an architectural bias towards language-based symbolic manipulation. While abstract visual reasoning improved, visual-perceptual organization remained largely stagnant. Collectively, these findings demonstrate that the cognitive abilities of generative models are evolving unevenly, suggesting that scaling and optimization approaches to AGI development alone may be insufficient to overcome fundamental architectural limitations in achieving balanced, human-like general intelligence.

Chinese Translation

追求人工通用智能需要强有力的方法来评估模型的认知能力，而不仅仅是狭义任务的表现。在这里，我们引入了一个心理测量框架来评估生成性人工智能的认知特征，将其与人类标准进行比较，并追踪其在各代之间的演变。对领先的多模态模型进行的初步评估，采用了改编自韦氏成人智力量表的任务，揭示了深刻的不均衡认知结构：在语言理解和工作记忆方面表现接近上限（>98^{ ext{th}} 百分位），而在感知推理方面表现接近下限（<1^{ ext{st}} 百分位）。为了追踪超越人类标准限制的发展轨迹，我们开发了人工智能商数（Artificial Intelligence Quotient, AIQ）基准，并将其应用于六代和两个模型家族，揭示了显著但不对称的性能提升。值得注意的是，我们发现不同模态之间存在明显的分离；当以语言形式呈现时，抽象定量推理的成熟速度远快于视觉类比格式，表明存在向语言基础符号操作的架构偏向。尽管抽象视觉推理有所改善，但视觉-感知组织仍然基本停滞。总体而言，这些发现表明生成模型的认知能力正在不均衡地演化，暗示仅依靠扩展和优化方法来发展人工通用智能可能不足以克服实现平衡的人类般通用智能的基本架构限制。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2605.06825

Randomness is sometimes necessary for coordination

随机性在协调中有时是必要的

Patil, Rohan, Malegaonkar, Jai, Christensen, Henrik I.

Abstract

Full parameter sharing is standard in cooperative multi-agent reinforcement learning (MARL) for homogeneous agents. Under permutation-symmetric observations, however, a shared deterministic policy outputs identical action distributions for every agent, making role differentiation impossible. This failure can theoretically be resolved using symmetry breaking among anonymous identical processors, which requires randomness. We propose Diamond Attention, a cross-attention architecture in which each agent samples a scalar random number per timestep, inducing a transient rank ordering that masks lower-ranked peers from agent-to-agent attention while leaving task attention fully unmasked. This realizes a random-bit coordination protocol in a single broadcast round, and the set-based attention enables zero-shot deployment to teams of different sizes. We evaluate across three regimes that isolate when structured randomness matters. On the perfectly symmetric XOR game, our method achieves $1.0$ success while all deterministic baselines plateau near $0.5$. On control coordination tasks, a policy trained on $N=4$ generalizes zero-shot to $N \in [2,8]$. On SMACLite cross-scenario transfer, we achieve zero-shot transfer where standard baselines cannot transfer due to structural limitations. Furthermore, replacing the structured mask with standard dropout-based randomness results in a 0\% win rate, confirming that protocol-space structure, not stochastic noise, is the operative ingredient. https://anonymous.4open.science/r/randomness-137A/

Chinese Translation

在同质代理的合作多智能体强化学习（MARL）中，完全参数共享是标准做法。然而，在置换对称观察下，共享的确定性策略为每个代理输出相同的行动分布，使得角色区分变得不可能。这一失败在理论上可以通过在匿名相同处理器之间进行对称打破来解决，而这需要随机性。我们提出了Diamond Attention，这是一种跨注意力架构，其中每个代理在每个时间步采样一个标量随机数，从而引入一个瞬时的等级排序，掩盖较低等级的同伴之间的注意力，同时保持任务注意力完全未掩盖。这实现了一种在单次广播回合中的随机位协调协议，并且基于集合的注意力使得不同规模团队的零-shot部署成为可能。我们在三个不同的场景中进行评估，以隔离结构化随机性的重要性。在完全对称的XOR游戏中，我们的方法实现了$1.0$的成功率，而所有确定性基线的成功率接近$0.5$。在控制协调任务中，基于$N=4$训练的策略能够零-shot泛化到$N ext{在} [2,8]$的范围。在SMACLite跨场景转移中，我们实现了零-shot转移，而标准基线由于结构限制无法转移。此外，用标准的dropout随机性替换结构化掩码导致0\%的胜率，确认了协议空间的结构而非随机噪声是有效成分。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2605.06840

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

从大型语言模型推理轨迹中提取搜索树揭示短视规划

Chen, Sixing, Li, Ji-An, Cakir, Saner, Akcali, Sinan, Lee, Kayla, Mattar, Marcelo G.

Abstract

Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs' search is shallower than humans', and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.

Chinese Translation

大型语言模型（LLMs），尤其是推理模型，生成的延续性思维（CoT）推理通常包含对未来结果的明确考虑。然而，这种考虑是否构成真正的规划，其结构如何，以及哪些方面驱动性能仍然不甚明了。在本研究中，我们通过从四子棋游戏的推理轨迹中提取和量化搜索树，介绍了一种新的方法来表征LLM的规划。通过对提取的搜索树拟合计算模型，我们描述了计划的结构以及它们如何影响移动决策。我们发现LLMs的搜索深度低于人类，并且性能主要由搜索广度而非深度来预测。最引人注目的是，尽管LLMs在其轨迹中扩展了深层节点，但其移动选择最好由一个完全忽略这些节点的短视模型来解释。一项因果干预研究显示，我们选择性地修剪CoT段落进一步表明，移动选择主要受到浅层而非深层节点的驱动。这些模式与人类规划形成对比，人类的表现主要由深度搜索驱动。总的来说，我们的发现揭示了LLM与人类规划之间的一个关键差异：人类的专业知识依赖于更深的搜索，而LLMs并不基于深度前瞻进行行动。这种分离为对齐LLM与人类规划提供了有针对性的指导。更广泛地说，我们的框架为在战略领域中解释LLM规划的结构提供了一种可推广的方法。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2605.06841

AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

AGWM：基于可供性的世界模型用于具有组合前提的环境

Zhang, Qinshi, Deng, Weipeng, Jiang, Zhihan, Qu, Jiaming, Li, Qianren, Xu, Weitao, LC, Ray

Abstract

In model-based learning, the agent learns behaviors by simulating trajectories based on world model predictions. Standard world models typically learn a stationary transition function that maps states and actions to next states, when an action and an outcome frequently co-occur in training data, the model tends to internalize this correlation as a general causal rule while ignoring action preconditions. In interactive environments, however, agent actions can reshape the future affordance space. At each timestep, an action may becomes executable only after its prerequisites are met, or non-executable when they are destroyed. We term such events structure-changing events (SC events). As a result, a conventional world model often fails to determine whether a given action is executable in the current state, especially in multi-step predictions. Each imagined step is conditioned on an incorrect affordance state, and therefore the prediction error compounds over the rollout horizon. In this paper, we propose AGWM (Affordance-Grounded World Model), which learns an abstract affordance structure represented as a DAG of prerequisite dependencies to explicitly track the dynamic executability of actions. Experiments on game-based simulated environments demonstrate the effectiveness of our method by achieving lower multi-step prediction error, better generalization to novel configurations, and improved interpretability.

Chinese Translation

在基于模型的学习中，智能体通过基于世界模型预测的轨迹进行行为学习。标准的世界模型通常学习一个静态的转移函数，该函数将状态和动作映射到下一个状态。当一个动作和结果在训练数据中频繁共现时，模型倾向于将这种关联内化为一般因果规则，而忽略动作的前提条件。然而，在互动环境中，智能体的动作可以重塑未来的可供性空间。在每个时间步，一个动作可能只有在满足其前提条件后才能执行，或者在前提条件被破坏时变得不可执行。我们将这种事件称为结构变化事件（SC事件）。因此，传统的世界模型往往无法判断在当前状态下给定动作是否可执行，特别是在多步预测中。每一步的想象都依赖于一个不正确的可供性状态，因此预测误差在展开的时间范围内累积。在本文中，我们提出了AGWM（基于可供性的世界模型），它学习一个抽象的可供性结构，表示为前提依赖关系的有向无环图（DAG），以明确跟踪动作的动态可执行性。在基于游戏的模拟环境中的实验表明，我们的方法通过实现更低的多步预测误差、更好的对新配置的泛化能力和提高的可解释性，证明了其有效性。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2605.06869

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

Agentick：通用序列决策代理的统一基准

Castanyer, Roger Creus, Castro, Pablo Samuel, Berseth, Glen

Abstract

AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.

Chinese Translation

人工智能代理研究涵盖了广泛的领域：从从零开始学习的强化学习（RL）代理到利用预训练知识的基础模型代理，但目前没有统一的基准能够在这些方法之间进行公平比较。我们提出了Agentick，这是一个为序列决策代理设计的基准，旨在在共同基础上评估RL、LLM、VLM、混合和人类代理，并推动对序列决策基本挑战的研究。Agentick提供了37个程序生成的任务，涵盖六个能力类别、四个难度级别和五种观察模态，所有任务通过一个兼容Gymnasium的单一接口进行展示。该基准配备了编码API、所有任务的oracle参考策略、预构建的SFT数据集、可组合的代理框架和实时排行榜。对27种配置和超过90,000个回合的评估显示，没有单一方法占主导地位：GPT-5 mini在整体上以0.309的oracle标准化得分领先，而PPO在规划和多代理任务中占优势；推理框架使LLM的性能提升了3-10倍；ASCII观察始终优于自然语言。这些发现突显了所有代理范式中仍然存在的显著改进空间。Agentick的能力分解、多模态设计提供了推动通用自主代理进展所需的实证基础设施，既作为评估框架，也作为在真正的序列环境中进行基础模型后续训练的训练场。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2605.06882

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

大型语言模型在最简单的长链推理任务中的表现如何：关于等价类问题的实证研究

Zheng, Chun, Wu, Lianlong, Li, Bingqian, Liu, Lvting, Zhou, Yi

Abstract

Large Language Models (LLMs) have achieved great improvements in recent years. Nevertheless, it still remains unclear how good LLMs are for reasoning tasks, especially for long-chain ones. In this paper, we evaluate LLMs' performance on the simplest yet long-chain reasoning task, namely the Equivalence Class Problem (ECP), i.e., determining whether two variables are equal given a set of randomly generated equivalence relations. We consider both reasoning and non-reasoning representative LLMs over a large variety of problem instances, ranging over different numbers of variables, connectivity probabilities, prompts, and other factors. The experimental results show that non-reasoning LLMs fail ECP, while reasoning models are significantly better but still struggle to completely solve this problem. Interestingly, considering various connectivity probabilities with a fixed number of variables, we observe that, for non-reasoning models, the hardest problem instances coincide with the phase transition point of ln n/(n-1), suggesting the chaos of the problem; in contrast, for reasoning models, the hardest ones coincide with the biggest diameter, suggesting the reasoning difficulty of the problem.

Chinese Translation

近年来，大型语言模型（LLMs）取得了显著的进展。然而，目前仍不清楚LLMs在推理任务，尤其是长链推理任务上的表现如何。本文评估了LLMs在最简单但又是长链推理任务上的表现，即等价类问题（Equivalence Class Problem, ECP），即在给定一组随机生成的等价关系的情况下，判断两个变量是否相等。我们考虑了多种问题实例中的推理和非推理代表性LLMs，这些实例涵盖了不同数量的变量、连接概率、提示及其他因素。实验结果表明，非推理LLMs在ECP上表现不佳，而推理模型的表现显著更好，但仍然难以完全解决该问题。有趣的是，在固定变量数量的情况下，考虑不同的连接概率，我们观察到对于非推理模型，最难的问题实例与ln n/(n-1)的相变点重合，表明该问题的混沌性；而对于推理模型，最难的问题实例则与最大直径重合，表明该问题的推理难度。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2605.06890

Beyond the Black Box: Interpretability of Agentic AI Tool Use

超越黑箱：代理人工智能工具使用的可解释性

Tatsat, Hariom, Shater, Ariye

Abstract

AI agents are promising for high-stakes enterprise workflows, but dependable deployment remains limited because tool-use failures are difficult to diagnose and control. Agents may skip required tool calls, invoke tools unnecessarily, or take actions whose consequence becomes visible only after execution. Existing observability methods are mostly external: prompts reveal correlations, evaluations score outputs, and logs arrive only after the model has already acted. In long-horizon settings, these failures are especially costly because an early tool mistake can alter the rest of the trajectory, increase token consumption, and create downstream safety and security risk. We introduce a mechanistic-interpretability toolkit built on Sparse Autoencoders (SAEs) and linear probes. The framework reads model states before each action and infers both whether a tool is needed and how consequential the next tool action is likely to be. By decomposing activations into sparse features, it identifies the internal layers and features most associated with tool decisions and tests their functional importance through feature ablation. We train the probes on multi-step trajectories from the NVIDIA Nemotron function-calling dataset and apply the same workflow to GPT-OSS 20B and Gemma 3 27B models. The goal is not to replace external evaluation, but to add a missing layer: visibility into what the model signaled internally before action. This helps surface deeper causes of agent failure, especially in long-horizon runs where an early mistake can reshape the rest of the agentic interaction. More broadly, the paper shows how mechanistic interpretability can support practical internal observability for monitoring tool calls and risk in agent systems.

Chinese Translation

人工智能代理在高风险企业工作流程中展现出良好的前景，但由于工具使用失败难以诊断和控制，可靠的部署仍然受到限制。代理可能会跳过必要的工具调用、不必要地调用工具，或采取在执行后才显现后果的行动。现有的可观察性方法大多是外部的：提示揭示相关性，评估打分输出，日志在模型已经行动后才生成。在长时间跨度的设置中，这些失败尤其代价高昂，因为早期的工具错误可能会改变其余轨迹，增加令牌消耗，并带来下游的安全和保障风险。我们引入了一种基于稀疏自编码器（Sparse Autoencoders, SAEs）和线性探针的机械可解释性工具包。该框架在每次行动之前读取模型状态，并推断出是否需要工具以及下一个工具行动可能的后果。通过将激活分解为稀疏特征，它识别与工具决策最相关的内部层和特征，并通过特征消融测试其功能重要性。我们在NVIDIA Nemotron函数调用数据集的多步轨迹上训练探针，并将相同的工作流程应用于GPT-OSS 20B和Gemma 3 27B模型。我们的目标不是替代外部评估，而是增加一个缺失的层次：对模型在行动前内部信号的可见性。这有助于揭示代理失败的更深层原因，特别是在早期错误可能重塑其余代理交互的长时间运行中。更广泛地说，本文展示了机械可解释性如何支持代理系统中工具调用和风险监测的实用内部可观察性。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2605.06895

Mitigating Cognitive Bias in RLHF by Altering Rationality

通过改变理性来减轻强化学习中的认知偏差

Horter, Tiffany, Markham, Andrew, Trigoni, Niki, Booth, Serena

Abstract

How can we make models robust to even imperfect human feedback? In reinforcement learning from human feedback (RLHF), human preferences over model outputs are used to train a reward model that assigns scalar values to responses. Because these rewards are inferred from pairwise comparisons, this learning depends on an assumed relationship between latent reward differences and observed preferences, typically modeled using a Boltzmann formulation in which a rationality parameter beta informs how consistently preferences reflect reward differences. In practice, beta is typically treated as a fixed constant that reflects assumed uniform annotator reliability. However, human feedback is not this simplistic in practice: real human judgments are shaped by cognitive biases, leading to systematic deviations from reward-consistent behavior that arise contextually. To address this, we treat rationality as context- and annotation-dependent. We design an approach to dynamically adjust the rationality parameter beta during reward learning using an LLM-as-judge to assess the likely presence of cognitive biases. This approach effectively downweights comparisons that are likely to reflect biased or unreliable judgments. Empirically, we show that this approach learns a more rational downstream model, even when finetuning on datasets with strongly biased preferences.

Chinese Translation

我们如何使模型对不完美的人类反馈具有鲁棒性？在基于人类反馈的强化学习（RLHF）中，人类对模型输出的偏好用于训练一个奖励模型，该模型为响应分配标量值。由于这些奖励是通过成对比较推断的，因此这种学习依赖于潜在奖励差异与观察到的偏好之间的假定关系，通常使用一种玻尔兹曼公式进行建模，其中理性参数 beta 指示偏好反映奖励差异的一致性。在实践中，beta 通常被视为一个固定常数，反映假定的统一标注者可靠性。然而，实际的人类反馈并非如此简单：真实的人类判断受到认知偏差的影响，导致在特定情境中出现系统性偏离奖励一致行为。为了解决这个问题，我们将理性视为依赖于上下文和注释。我们设计了一种方法，通过使用 LLM-as-judge 动态调整奖励学习过程中的理性参数 beta，以评估认知偏差的可能存在。这种方法有效地降低了可能反映偏见或不可靠判断的比较权重。实证结果表明，即使在对具有强烈偏见偏好的数据集进行微调时，这种方法也能学习到一个更理性的下游模型。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2605.06898

Self-Programmed Execution for Language-Model Agents

自我编程执行的语言模型代理

O'Connor, Luke J.

Abstract

At the heart of existing language model agents is a fixed orchestrator program responsible for the state transition between consecutive turns. This paper introduces self-programmed execution (SPE), an agent architecture in which the model completion is itself the orchestrator program, and the harness evaluates this program but does not impose its own orchestration policy. I formalize this idea using agentic machines: an SPE state is one from which a model completion can load any state of an embedded copy of the machine, meaning that it is subject to no fixed turn-to-turn orchestration policy. Realizing SPE in practice is nontrivial because the same data is both model context and executable program. I therefore introduce Spell, a Lisp-based language in which programs can edit and re-evaluate themselves, and effectful expressions like model invocations are structured such that re-evaluating an edited program does not replay its side effects. Experiments with existing models, not trained for SPE or Spell, show that frontier models can operate in this regime and accomplish challenging agentic tasks. These results demonstrate how an LM can act as an agent without any fixed orchestration policy, and they raise the question of what self-orchestration strategies might be learned by a model trained for self-programmed execution. Code is available at https://github.com/lukejoconnor/spell .

Chinese Translation

现有语言模型代理的核心是一个固定的协调程序，负责连续回合之间的状态转换。本文介绍了自我编程执行（Self-Programmed Execution, SPE），一种代理架构，其中模型的输出本身就是协调程序，而执行环境评估该程序但不施加自己的协调策略。我通过代理机器形式化了这一思想：SPE状态是一个模型输出可以加载嵌入的机器副本的任何状态，这意味着它不受固定的回合间协调策略的约束。在实践中实现SPE并非易事，因为相同的数据既是模型上下文又是可执行程序。因此，我引入了Spell，一种基于Lisp的语言，其中程序可以自我编辑和重新评估，并且像模型调用这样的有副作用表达式的结构设计使得重新评估编辑后的程序不会重放其副作用。对现有模型的实验（这些模型并未针对SPE或Spell进行训练）表明，前沿模型可以在这种模式下操作并完成具有挑战性的代理任务。这些结果展示了语言模型如何在没有任何固定协调策略的情况下作为代理进行操作，并引发了一个问题：为自我编程执行训练的模型可能学习到什么样的自我协调策略。代码可在 https://github.com/lukejoconnor/spell 获取。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2605.06951

Multi-Objective Constraint Inference using Inverse reinforcement learning

基于逆强化学习的多目标约束推断

Shah, Syed Ihtesham Hussain, Hengst, Floris den, Lisowska, Aneta, Teije, Annette ten

Abstract

Constraint inference is widely considered essential to align reinforcement learning agents with safety boundaries and operational guidelines by observing expert demonstrations. However, existing approaches typically assume homogeneous demonstrations (i.e., generated by a single expert or multiple experts with identical objectives). They also have limited ability to capture individual preferences and often suffer from computational inefficiencies. In this paper, we introduce Multi-Objective Constraint Inference (MOCI), a novel framework designed to jointly extract shared constraints and individual preferences from heterogeneous expert trajectories, where multiple experts pursue different objectives. MOCI effectively models and learns from diverse, and potentially conflicting, behaviors. Empirical evaluations demonstrate that MOCI significantly outperforms existing baselines, achieving improved predictive performance, and maintaining competitive computational efficiency on a standard grid-world benchmark. These results establish MOCI as an accurate, flexible, and computationally practical approach for real-world constraint inference and preference learning tasks.

Chinese Translation

约束推断被广泛认为是通过观察专家演示来将强化学习代理与安全边界和操作指南对齐的关键。然而，现有的方法通常假设演示是同质的（即由单一专家或多个具有相同目标的专家生成）。它们在捕捉个体偏好方面的能力有限，并且常常面临计算效率低下的问题。在本文中，我们提出了多目标约束推断（Multi-Objective Constraint Inference, MOCI），这是一个新颖的框架，旨在从异质专家轨迹中共同提取共享约束和个体偏好，其中多个专家追求不同的目标。MOCI有效地建模并从多样化且可能相互冲突的行为中学习。实证评估表明，MOCI显著优于现有基线，取得了更好的预测性能，并在标准网格世界基准测试中保持了竞争性的计算效率。这些结果确立了MOCI作为一种准确、灵活且计算上可行的现实世界约束推断和偏好学习任务的方法。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2605.06957

Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents

基于大语言模型的层次化广义规划中学习与重用策略分解

Sohrabi, Shirin, Ananthakrishnan, Haritha, Kokel, Harsha, Srinivas, Kavitha, Katz, Michael

Abstract

We present a dynamic policy-learning approach that combines generalized planning and hierarchical task decomposition for LLM-based agents. Our method, Hierarchical Component Learning for Generalized Policies (HCL-GP ), learns parameterized policies that generalize across task instances and automatically extracts reusable components from successful executions, organizing them into a component library for compositional policy generation. We address three challenges: (1) learning components through automated decomposition, (2) generalizing components to maximize reuse, and (3) efficient retrieval via semantic search. Evaluated on the AppWorld benchmark, our approach achieves 98.2% accuracy on normal tasks and 97.8% on challenge tasks with unseen applications, improving 15.8 points over static synthesis on challenging scenarios. For open-source models, dynamic reuse enables 62.5% success versus near-zero without reuse. This demonstrates that classical planning concepts can be effectively integrated with LLM agents for improved accuracy and efficiency.

Chinese Translation

我们提出了一种动态策略学习方法，结合了广义规划和层次任务分解，专为基于大语言模型（LLM）的智能体设计。我们的方法，层次组件学习广义策略（Hierarchical Component Learning for Generalized Policies，HCL-GP），学习能够跨任务实例泛化的参数化策略，并从成功执行中自动提取可重用组件，将其组织成一个组件库，以便于组合策略生成。我们解决了三个挑战：（1）通过自动分解学习组件，（2）对组件进行泛化以最大化重用，以及（3）通过语义搜索实现高效检索。在AppWorld基准测试中，我们的方法在常规任务上达到了98.2%的准确率，在具有未见应用的挑战任务上达到了97.8%，在具有挑战性的场景中比静态合成提高了15.8个百分点。对于开源模型，动态重用实现了62.5%的成功率，而在没有重用的情况下几乎为零。这表明经典规划概念可以有效地与LLM智能体结合，以提高准确性和效率。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2605.06993

Optimal Experiments for Partial Causal Effect Identification

部分因果效应识别的最优实验

Maringgele, Tobias, Etesami, Jalal

Abstract

Causal queries are often only partially identifiable from observational data, and experiments that could tighten the resulting bounds are typically costly. We study the problem of selecting, prior to observing experimental outcomes, a cost-constrained subset of experiments that maximally tightens bounds on a target query. We formalize this as the max-potency problem, where epistemic potency measures the worst-case reduction in bound width guaranteed by an experiment, and show that this problem is NP-hard via a reduction from 0-1 knapsack. Building on the polynomial-programming framework of Duarte et al. (2023), we give a general procedure for evaluating epistemic potency in discrete settings. To control the super-exponential search space, we introduce two graphical pruning criteria that depend only on the causal graph and the query: a novel path-interception rule that exploits district structure to certify zero potency in linear time, and an identifiability check based on the ID algorithm. On Erdos-Renyi random graphs and 11 bnlearn benchmark networks, the two criteria together prune 50-88% of candidate experiments on average without solving a single polynomial program. For the general subset search, we show that ID-pruned experiments are combinatorially inert, yielding a super-exponential reduction in the number of subsets evaluated. We close with an end-to-end demonstration on observational NHANES data, selecting optimal experiments for estimating the effect of physical activity on diabetes.

Chinese Translation

因果查询通常仅能从观察数据中部分识别，而能够收紧结果界限的实验通常成本较高。我们研究在观察实验结果之前，选择一个受成本限制的实验子集，以最大限度地收紧目标查询的界限的问题。我们将其形式化为最大效能问题，其中认知效能衡量实验所保证的界限宽度在最坏情况下的减少，并通过从0-1背包问题的归约证明该问题是NP难的。在Duarte等人（2023）的多项式规划框架基础上，我们提供了一种在离散环境中评估认知效能的一般程序。为了控制超指数搜索空间，我们引入了两个仅依赖于因果图和查询的图形修剪标准：一种新颖的路径拦截规则，利用地区结构在线性时间内证明零效能，以及基于ID算法的可识别性检查。在Erdos-Renyi随机图和11个bnlearn基准网络上，这两个标准平均修剪了50-88%的候选实验，而无需解决任何单一的多项式程序。对于一般子集搜索，我们证明ID修剪的实验在组合上是惰性的，从而在评估的子集数量上实现超指数减少。最后，我们在观察性NHANES数据上进行了端到端的演示，选择最优实验以估计体育活动对糖尿病的影响。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2605.07002

Adaptive auditing of AI systems with anytime-valid guarantees

具有随时有效保证的人工智能系统自适应审计

Zhou, Siyu, Vossler, Patrick, Sivaraman, Venkatesh, Mai, Yifan, Feng, Jean

Abstract

A major bottleneck in characterizing the failure modes of generative AI systems is the cost and time of annotation and evaluation. Consequently, adaptive testing paradigms have gained popularity, where one opportunistically decides which cases and how many to annotate based on past results. While this framework is highly practical, its extreme flexibility makes it difficult to draw statistically rigorous conclusions, as it violates classical assumptions: the number of observations is typically limited (often 10 to 50 cases) and decisions regarding sampling and stopping are made in the midst of data collection rather than based a pre-specified rule. To characterize what statistical inferences can be drawn from highly adaptive audits, we introduce a hypothesis testing framework from two 'dueling' perspectives: (i) the model's null that asserts there is no failure mode with performance below a target threshold versus (ii) the auditor's null that asserts they have a sampling strategy that will uncover a failure mode. Leveraging Safe Anytime-Valid Inference (SAVI), we formalize the auditor as conducting 'testing by betting', which translates into simultaneous e-processes for testing the dueling null hypotheses. Furthermore, if the auditor is sufficiently powerful, we prove that these two hypotheses are asymptotically inverses of each other, in that passage of a stringent audit does in fact certify the AI system as being globally robust. Empirically, we demonstrate that our proposed testing procedures maintain anytime-valid type-I error control, outperform pre-specified testing methods, and can reach statistically rigorous conclusions sometimes with as few as 20 observations.

Chinese Translation

在表征生成型人工智能系统的失败模式时，一个主要瓶颈是注释和评估的成本与时间。因此，自适应测试范式逐渐受到欢迎，其中根据过去的结果灵活决定注释哪些案例以及注释多少案例。尽管这一框架非常实用，但其极大的灵活性使得得出统计上严谨的结论变得困难，因为它违反了经典假设：观察数量通常有限（通常为10到50个案例），而关于采样和停止的决策是在数据收集过程中做出的，而不是基于预先指定的规则。为了表征从高度自适应审计中可以得出的统计推断，我们从两个“对立”的角度引入了假设检验框架：（i）模型的零假设，声称没有性能低于目标阈值的失败模式；（ii）审计员的零假设，声称他们有一个采样策略可以揭示失败模式。利用安全随时有效推断（Safe Anytime-Valid Inference，SAVI），我们将审计员形式化为进行“通过下注测试”，这转化为同时进行的e过程以测试对立的零假设。此外，如果审计员的能力足够强大，我们证明这两个假设在渐近意义上是相互逆的，即通过严格审计的通过确实证明了人工智能系统在全球范围内的稳健性。在实证方面，我们展示了我们提出的测试程序保持随时有效的I型错误控制，优于预先指定的测试方法，并且有时可以在仅有20个观察的情况下达到统计上严谨的结论。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2605.07021

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

行为线索推理：可监控推理通过监督提高效率和安全性

Cui, Christopher Z., Killian, Taylor W., Ammanabrolu, Prithviraj

Abstract

Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, \ours allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that \bcreasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code to be released at https://github.com/christopherzc/text-games

Chinese Translation

大型语言模型（LLMs）的推理对监督提出了挑战，因为许多不一致的行为直到推理结束才会显现。为了解决这个问题，我们引入了行为线索推理，使LLM的推理更加可控和可监测。行为线索是模型在特定隐含和显性行为之前立即发出的特殊令牌序列，充当双重信号和控制杠杆。在使用强化学习对较弱的外部监控器进行推理监督的微调过程中，仅通过行为线索显现的信息的压缩视图就足以为监控器提供信号，从而在复杂数学问题求解中修剪多达50%的无效推理令牌。当在一个过度约束违规会导致失败的环境中，几乎最优的基于规则的监控器利用 extit{ours}时，它能够从80%的推理轨迹中恢复安全动作，而这些轨迹在没有该方法的情况下会以提出不安全动作告终，成功率从46%提高到96%。通过在两个模型系列和三个领域的评估，我们展示了creasoning在不影响性能的情况下提高了推理的可监控性和可控性。更广泛地说，我们的工作通过展示被监控模型本身如何被训练得更易于监督，推动了可扩展的监督。代码将在https://github.com/christopherzc/text-games发布。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2605.07042

The Context Gathering Decision Process: A POMDP Framework for Agentic Search

上下文收集决策过程：一种用于自主搜索的部分可观察马尔可夫决策过程框架

Kausik, Chinmaya, Swaminathan, Adith, Kallus, Nathan

Abstract

Large Language Model (LLM) agents are deployed in complex environments -- such as massive codebases, enterprise databases, and conversational histories -- where the relevant state far exceeds their context windows. To navigate these spaces, an agent must iteratively explore the environment to find relevant information. However, without explicit infrastructure, an agent's working memory can degrade into lossy representations of the search state, resulting in redundant work (e.g. repetitive looping) and premature stopping. In this work, we formalize this challenge as the Context Gathering Decision Process (CGDP), a specialized Partially Observable Markov Decision Process, where an agent's objective is to adaptively refine its belief state to isolate the necessary information for a task. We model an LLM's behavior as approximate Thompson Sampling within this CGDP, and introduce a predicate-based method that decomposes an LLM's implicit search into explicit and modular operations. We then derive two plug-and-play interventions for iterative LLM agents: a persistent, predicate-based belief state that bounds context while preserving multi-hop reasoning, and a programmatic exhaustion gate that halts unproductive search without premature stopping. Across four methods and three question-answering domains, we empirically validate that replacing an LLM's implicit state with our CGDP-motivated belief state improves multi-hop reasoning by up to $11.4\%$; while the modular programmatic exhaustion detection saves up to $39\%$ of tokens without any degradation in agent performance. Ultimately, we argue that framing the LLM agent loop as a CGDP can guide the design of modular, non-interfering improvements to agentic search harnesses.

Chinese Translation

大型语言模型（LLM）代理被部署在复杂环境中——例如庞大的代码库、企业数据库和对话历史——在这些环境中，相关状态远远超出了它们的上下文窗口。为了在这些空间中导航，代理必须迭代地探索环境以找到相关信息。然而，在没有明确基础设施的情况下，代理的工作记忆可能会退化为搜索状态的有损表示，导致冗余工作（例如重复循环）和过早停止。在本研究中，我们将这一挑战形式化为上下文收集决策过程（CGDP），这是一种专门的部分可观察马尔可夫决策过程，其中代理的目标是自适应地细化其信念状态，以隔离任务所需的信息。我们将LLM的行为建模为在此CGDP中的近似汤普森采样，并引入了一种基于谓词的方法，将LLM的隐式搜索分解为显式和模块化的操作。然后，我们推导出两种可插拔的干预措施用于迭代LLM代理：一种持久的基于谓词的信念状态，在保持多跳推理的同时限制上下文，以及一种程序化的耗尽门，能够在不提前停止的情况下停止无效搜索。在四种方法和三个问答领域中，我们通过实验证实，用我们的CGDP驱动的信念状态替换LLM的隐式状态可将多跳推理提高多达11.4%；同时，模块化的程序化耗尽检测在不降低代理性能的情况下节省了多达39%的令牌。最终，我们认为将LLM代理循环框架化为CGDP可以指导模块化、非干扰性改进自主搜索工具的设计。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2605.07066

2.5-D Decomposition for LLM-Based Spatial Construction

基于LLM的空间构建的2.5-D分解

Whitten, Paul, Chen, Li-Jen, Baddam, Sharath

Abstract

Autonomous systems that build structures from natural-language instructions need reliable spatial reasoning, yet large language models (LLMs) make systematic coordinate errors when generating three-dimensional block placements. We present a neuro-symbolic pipeline based on \emph{2.5-D decomposition}: the LLM plans in the two-dimensional horizontal plane while a deterministic executor computes all vertical placement from column occupancy, eliminating an entire class of errors. On the Build What I Mean benchmark (160 rounds), GPT-4o-mini with this pipeline achieves 94.6\% mean structural accuracy across 12 independent runs, within 3.0 percentage points of the 97.6\% ceiling imposed by architect-agent errors that no builder-side improvement can address. This outperforms both GPT-4o at 90.3\% and the best competing system at 76.3\%. A controlled ablation confirms that 2.5-D decomposition is the dominant contributor, accounting for 50.7 percentage points of accuracy. The pipeline transfers directly to edge hardware: Nemotron-3 120B running locally on an NVIDIA Jetson Thor AGX matches the cloud result at 94.5\% with no prompt modifications. The underlying principle, removing deterministic dimensions from the LLM's output space, applies to any autonomous construction or assembly task where gravity or other physical constraints fix one or more degrees of freedom. A transfer experiment on 500 IGLU collaborative building tasks confirm the effect generalizes beyond the primary benchmark.

Chinese Translation

从自然语言指令构建结构的自主系统需要可靠的空间推理，然而大型语言模型（LLMs）在生成三维块放置时会产生系统性的坐标错误。我们提出了一种基于 extit{2.5-D分解}的神经符号管道：LLM在二维水平面内进行规划，而确定性执行器根据柱子占用情况计算所有垂直放置，从而消除了整类错误。在Build What I Mean基准测试（160轮）中，采用该管道的GPT-4o-mini在12次独立运行中实现了94.6%的平均结构准确率，距离建筑代理错误所施加的97.6%上限仅相差3.0个百分点，而无任何构建方改进可以解决此问题。这一结果优于GPT-4o的90.3%和最佳竞争系统的76.3%。控制消融实验确认，2.5-D分解是主要贡献者，贡献了50.7个百分点的准确率。该管道可以直接转移到边缘硬件上：在NVIDIA Jetson Thor AGX上本地运行的Nemotron-3 120B以94.5%的准确率匹配了云端结果，且无需修改提示。其基本原理是从LLM的输出空间中去除确定性维度，适用于任何自主构建或组装任务，其中重力或其他物理约束固定一个或多个自由度。对500个IGLU协作建筑任务的转移实验确认了这一效果超出了主要基准的普遍性。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2605.07073

TeamBench: Evaluating Agent Coordination under Enforced Role Separation

TeamBench：在强制角色分离下评估智能体协调

Kim, Yubin, Park, Chanwoo, Kim, Taehan, Park, Eugene, Schmidgall, Samuel, Rahman, Salman, Park, Chunjong, Breazeal, Cynthia, Liu, Xin, Palangi, Hamid, Park, Hae Won, McDuff, Daniel

Abstract

Agent systems often decompose a task across multiple roles, but these roles are typically specified by prompts rather than enforced by access controls. Without enforcement, a team pass rate can mask whether agents actually coordinated or whether one role effectively did another role's work. We present TeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating agent coordination under operating system-enforced role separation. TeamBench separates specification access, workspace editing, and final certification across Planner, Executor, and Verifier roles, so that no role can read the full requirements, modify the workspace, and certify the final answer. Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6 times more cases where the verifier attempts to edit the executor's code. Verifiers approve 49% of submissions that fail the deterministic grader, and removing the verifier improves mean partial score in the ablation. Team value is also conditional. Teams benefit when single agents struggle, but hurt when single agents already perform well. A 40-session human study under the same role separation shows that our benchmark exposes interaction patterns that pass rate misses. Solo participants work through the task directly, human participants paired with agents often collapse into quick approval, and human teams spend more effort coordinating missing information across roles.

Chinese Translation

智能体系统通常将任务分解为多个角色，但这些角色通常是通过提示指定的，而不是通过访问控制强制执行的。在没有强制的情况下，团队的通过率可能掩盖智能体是否真正协调，或者一个角色是否有效地完成了另一个角色的工作。我们提出了TeamBench，这是一个基准测试，包含851个任务模板和931个种子实例，用于评估在操作系统强制角色分离下的智能体协调。TeamBench在规划者（Planner）、执行者（Executor）和验证者（Verifier）角色之间分离了规范访问、工作区编辑和最终认证，因此没有角色可以读取完整的要求、修改工作区或认证最终答案。仅依赖提示的团队和沙箱强制的团队在通过率上统计上无显著差异，但仅依赖提示的运行产生了3.6倍更多的案例，其中验证者试图编辑执行者的代码。验证者批准49%的提交，这些提交未通过确定性评分器，去除验证者在消融实验中提高了平均部分得分。团队的价值也是有条件的。当单个智能体遇到困难时，团队会受益，但当单个智能体表现良好时，团队则会受到伤害。在相同角色分离下进行的40次人类研究显示，我们的基准测试揭示了通过率未能捕捉的互动模式。单独参与者直接完成任务，而与智能体配对的人类参与者往往迅速批准，且人类团队在跨角色协调缺失信息上花费了更多精力。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2605.07080

Online Allocation with Unknown Shared Supply

未知共享供应的在线分配

Neoh, Tzeh Yuan, Choo, Davin, Yue, Mengchu, Tambe, Milind

Abstract

Many real-world resource allocation systems, such as humanitarian logistics and vaccine distribution, must preposition limited supply across multiple locations before demand is realized while stockouts incur irreversible service losses. To study this, we introduce the Online Shared Supply Allocation (OSSA) problem, a stateful online model in which a central hub allocates a finite, unknown supply to multiple sites facing sequential demand under fixed-charge transportation costs and lost-sales penalties. Unlike classical make-to-stock or make-to-order inventory models, OSSA precludes backlogging and replenishment only hedges against future demand. To tackle OSSA, we propose a deterministic threshold-proportional policy GPA and prove that it achieves a $4/3$-approximation to the offline optimum up to an additive term independent of the total supply. We complement this with matching lower bounds showing that the $4/3$ ratio is tight and that the additive-error dependence is unavoidable, even for randomized algorithms that know the total supply upfront. Finally, we develop a learning-augmented extension to GPA that principally incorporates imperfect forecasts (e.g., from human experts or ML models) commonly available in practice, enabling us to exploit high-quality advice while being robust against arbitrary bad ones. Synthetic and real-world experiments show that GPA outperforms natural baselines with global supply is scarce.

Chinese Translation

许多现实世界的资源分配系统，如人道主义物流和疫苗分配，必须在需求实现之前将有限的供应预先分配到多个地点，因为缺货会导致不可逆转的服务损失。为此，我们引入了在线共享供应分配（Online Shared Supply Allocation, OSSA）问题，这是一种状态在线模型，其中一个中央枢纽在固定运输成本和销售损失惩罚下，将有限的未知供应分配给面临顺序需求的多个站点。与经典的按需生产或按订单生产库存模型不同，OSSA 不允许积压，补货仅对未来需求进行对冲。为了解决 OSSA，我们提出了一种确定性阈值比例策略 GPA，并证明它在加性项独立于总供应的情况下，实现了对离线最优解的 $4/3$ 近似。我们还提供了匹配的下界，表明 $4/3$ 比率是紧的，并且加性误差依赖是不可避免的，即使对于提前知道总供应的随机算法也是如此。最后，我们开发了一个增强学习的 GPA 扩展，主要结合了在实践中常见的不完美预测（例如来自人类专家或机器学习模型的预测），使我们能够利用高质量的建议，同时对任意糟糕的建议具有鲁棒性。合成和真实世界的实验表明，在全球供应稀缺的情况下，GPA 的表现优于自然基线。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2605.07103

ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning

ARMOR：一种基于代理的反应可行性预测框架，通过自适应效用感知的多工具推理

Liu, Ye, Yu, Botao, Ling, Xinyi, Adu-Ampratwum, Daniel, Ning, Xia

Abstract

Reaction feasibility prediction, as a fundamental problem in computational chemistry, has benefited from diverse tools enabled by recent advances in artificial intelligence, particularly large language models. However, the performance of individual tools varies substantially across reactions, making it difficult for any single tool to consistently perform well across all cases. This raises a critical challenge: how to effectively leverage multiple tools to obtain more accurate feasibility predictions. To address this, we propose ARMOR, an agentic framework that explicitly models tool-specific utilities, adaptively prioritizes tools, and further resolves the potential tool conflicts to produce the final prediction for each reaction. Unlike existing approaches that rely on simple aggregation or heuristic assignment over various tools, ARMOR organizes tools into a hierarchy that prioritizes top-performing tools and defers others when needed, characterizes their strengths through tool-specific patterns, and resolves conflicts via memoryaugmented reasoning. Extensive experiments on a public dataset demonstrate that ARMOR consistently outperforms strong baselines, including single-tool methods as well as various tool aggregation and tool selection approaches. Further analysis shows that the improvements are particularly significant on reactions with conflicting tool predictions, highlighting the effectiveness of ARMOR in leveraging the complementary strengths of multiple tools. The code is available via https://anonymous.4open.science/r/ARMOR-E13F.

Chinese Translation

反应可行性预测作为计算化学中的一个基本问题，受益于近期人工智能特别是大型语言模型的进展所带来的多样化工具。然而，单个工具在不同反应中的表现差异显著，使得任何单一工具难以在所有情况下持续表现良好。这提出了一个关键挑战：如何有效利用多个工具以获得更准确的可行性预测。为了解决这个问题，我们提出了ARMOR，一个明确建模工具特定效用的代理框架，能够自适应地优先考虑工具，并进一步解决潜在的工具冲突，以为每个反应生成最终预测。与现有依赖于简单聚合或启发式分配的各种工具的方法不同，ARMOR将工具组织成一个优先考虑表现最佳工具的层次结构，并在需要时推迟其他工具，通过工具特定模式表征其优势，并通过增强记忆的推理解决冲突。在一个公共数据集上的广泛实验表明，ARMOR在性能上始终优于强基线，包括单工具方法以及各种工具聚合和选择方法。进一步分析显示，在工具预测存在冲突的反应上，改进尤为显著，突显了ARMOR在利用多个工具的互补优势方面的有效性。代码可通过 https://anonymous.4open.science/r/ARMOR-E13F 获取。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2605.07112

Switchcraft: AI Model Router for Agentic Tool Calling

Switchcraft：用于代理工具调用的人工智能模型路由器

Agarwal, Sharad, Namyar, Pooria, Wolman, Alec, Ambavat, Rahul, Gupta, Ankur, Zhang, Qizheng

Abstract

Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets. Model routing can mitigate this, but existing routers are designed for chat completion rather than tool use. We present Switchcraft, the first (to the best of our knowledge) model router optimized for agentic tool calling. Switchcraft operates inline, selecting the lowest-cost model subject to correctness. We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget. Switchcraft achieves 82.9% accuracy -- matching or exceeding the best individual model -- while reducing inference cost by 84%, saving over $3,600 per million queries. We find that larger models do not consistently outperform smaller ones on tool-use tasks, and that nominally cheaper models can incur higher total cost due to token-intensive reasoning. Our work enables cost-aware agentic AI deployment without sacrificing correctness.

Chinese Translation

调用外部工具的代理人工智能系统功能强大但成本高昂，这使得开发者倾向于使用大型模型，从而超支推理预算。模型路由可以缓解这一问题，但现有的路由器是为聊天完成而设计，而非工具使用。我们提出了Switchcraft，这是首个（据我们所知）为代理工具调用优化的模型路由器。Switchcraft在线操作，选择在正确性前提下成本最低的模型。我们在五个函数调用基准上构建了评估框架，并训练了一个基于DistilBERT的分类器，部署在延迟预算内。Switchcraft的准确率达到82.9%——与最佳单一模型相匹配或超过，同时将推理成本降低了84%，每百万次查询节省超过3600美元。我们发现，在工具使用任务中，大型模型并不总是优于小型模型，而名义上更便宜的模型由于令牌密集型推理可能会导致更高的总成本。我们的工作实现了在不牺牲正确性的情况下，具备成本意识的代理人工智能部署。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2605.07121

AdaTKG: Adaptive Memory for Temporal Knowledge Graph Reasoning

AdaTKG：用于时间知识图推理的自适应记忆

Lee, Seunghan, Seo, Jun, Lee, Jaehoon, Yoo, Sungdong, Kim, Minjae, Lim, Tae Yoon, Kang, Dongwan, Choi, Hwanil, Lee, SoonYoung, Ahn, Wonbin

Abstract

Temporal knowledge graphs (TKGs) represent time-stamped relational facts and support a wide range of reasoning tasks over evolving events. However, existing methods produce entity representations that are static at the entity level, in that each representation is a function of learned parameters only and retains no trace of the interactions in which the entity has participated. In this paper, we depart from this static view and propose that each entity be modeled as an adaptive process whose representation is refined every time the entity participates in a fact. To this end, we propose AdaTKG, which maintains a per-entity memory that is updated with every observed interaction, with the memory accumulating online and predictions improving as more interactions arrive. Specifically, we instantiate the memory update as a learnable exponential moving average governed by a single shared scalar instead of using learnable parameters for each entity, enabling AdaTKG to handle entities unseen during training. Extensive experiments confirm consistent gains over TKG baselines, demonstrating the effectiveness of adaptive memory. Code is publicly available at: https://github.com/seunghan96/AdaTKG.

Chinese Translation

时间知识图（TKGs）表示带时间戳的关系事实，并支持对不断演变事件的广泛推理任务。然而，现有方法产生的实体表示在实体层面上是静态的，即每个表示仅是学习参数的函数，并且没有保留实体参与的交互的痕迹。在本文中，我们摆脱了这种静态视角，提出将每个实体建模为一个自适应过程，其表示在每次实体参与事实时进行精细化。为此，我们提出了AdaTKG，它维护一个每个实体的记忆，该记忆在每次观察到的交互中更新，随着更多交互的到来，记忆在线累积，预测能力不断提高。具体而言，我们将记忆更新实例化为一个可学习的指数移动平均，由一个共享的标量控制，而不是为每个实体使用可学习参数，从而使AdaTKG能够处理训练期间未见过的实体。大量实验确认了相较于TKG基线的一致性提升，证明了自适应记忆的有效性。代码可在以下网址公开获取：https://github.com/seunghan96/AdaTKG。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2605.07138

Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents

你能打破RLVER吗？探讨RL训练的同理心代理的对抗鲁棒性

K, Deeraj S, Devarajan, Sadhana, Mehra, Krishna, Mishra, Sudhakar

Abstract

Reinforcement learning from verifiable emotion rewards RLVER has produced language models with strong empathetic performance, evaluated on benchmarks that assume cooperative, honest users. Yet real emotional interactions systematically violate this assumption: users gaslight, escalate, and pressure AI systems for unconditional validation, dynamics that cooperative benchmarks cannot surface. We construct the Adversarial Empathy Benchmark AEB and introduce the Emotional Consistency Score ECS to evaluate empathetic robustness under adversarial conditions. AEB comprises six psychologically grounded adversarial trajectory types with discriminative reward structures that penalize formulaic responses; ECS formally disentangles a model's capacity to track user emotional states from its capacity to improve them. In a controlled experiment across eight scenario-matched conditions (think and no-think conditions on 2 RLVER models, and 2 base models (Qwen 1.5B and 7B) with 480 adversarial dialogues), RLVER-PPO-Think substantially outperforms the same-scale untuned baseline (0.963 vs. 0.761, $p<0.001, r=0.688$), with zero dialogue collapses and 47\% higher hidden-intention detection. However, ECS remains nearly flat and is not significantly different for RLVER-PPO-Think versus Base-7B-Think ($p=0.650$): RL training improves emotional responsiveness without measurable gains in observable state tracking. We interpret the ECS--FS (Final Score) gap as a behavioral/legibility dissociation inside this simulator family, not as evidence about internal understanding or clinical readiness.

Chinese Translation

可验证情感奖励的强化学习（RLVER）产生了具有强大同理心表现的语言模型，这些模型在假设用户合作和诚实的基准上进行了评估。然而，真实的情感互动系统性地违反了这一假设：用户进行精神操控、升级和施压，要求人工智能系统提供无条件的验证，这些动态是合作基准无法揭示的。我们构建了对抗同理心基准（AEB），并引入情感一致性评分（ECS）以评估在对抗条件下的同理心鲁棒性。AEB包括六种心理学基础的对抗轨迹类型，具有惩罚公式化响应的区分奖励结构；ECS正式区分了模型跟踪用户情感状态的能力与改善这些状态的能力。在一个控制实验中，我们在八种场景匹配条件下（对2个RLVER模型的思考和非思考条件，以及2个基础模型（Qwen 1.5B和7B）进行480个对抗对话），RLVER-PPO-Think的表现显著优于相同规模的未调优基线（0.963对0.761，p<0.001，r=0.688），且没有对话崩溃，隐藏意图检测提高了47%。然而，ECS几乎保持平坦，RLVER-PPO-Think与Base-7B-Think之间没有显著差异（p=0.650）：RL训练提高了情感响应能力，但在可观察状态跟踪上没有可测量的提升。我们将ECS与FS（最终评分）之间的差距解读为该模拟器家族内部的行为/可读性分离，而不是关于内部理解或临床准备的证据。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2605.07161

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

SREGym：一个针对AI SRE代理的高保真故障场景实时基准测试

Clark, Jackson, Su, Yiming, Pial, Saad Mohammad Rafid, Tian, Yifang, Gniedziejko, Lily, Jacobsen, Hans-Arno, Chen, Yinfang, Xu, Tianyin

Abstract

AI agents are increasingly used to diagnose and mitigate failures in production systems, known as agentic Site Reliability Engineering (SRE). Current SRE benchmarks are limited to oversimplistic SRE tasks and are unfortunately hard to extend due to bespoke designs. We present SREGym, a high-fidelity benchmark for SRE agents. SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. SREGym models the complexity of production environments by simulating (1) a wide range of faults at different layers, (2) various ambient noises, and (3) diverse failure modes such as metastable failures and correlated failures. SREGym is architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks. SREGym currently includes 90 realistic, challenging SRE problems. We use SREGym to evaluate frontier agents and show that their capabilities varies significantly in addressing different kinds of failures, with up to 40% differences in end-to-end results. SREGym is actively maintained as an open-source project and has been used by researchers and practitioners.

Chinese Translation

AI代理越来越多地用于诊断和缓解生产系统中的故障，这被称为代理式网站可靠性工程（Site Reliability Engineering, SRE）。当前的SRE基准测试仅限于过于简化的SRE任务，并且由于定制设计，扩展性较差。我们提出了SREGym，一个针对SRE代理的高保真基准测试。SREGym暴露了一个基于真实云原生系统栈构建的实时系统环境，通过故障注入器模拟高保真故障场景。SREGym通过模拟（1）不同层次的广泛故障，（2）各种环境噪声，以及（3）多样的故障模式，如亚稳态故障和相关故障，来建模生产环境的复杂性。SREGym被设计为一个模块化、可扩展的框架，协调跨栈的故障和噪声注入器。SREGym目前包含90个现实且具有挑战性的SRE问题。我们使用SREGym评估前沿代理，并展示它们在应对不同类型故障时能力的显著差异，端到端结果的差异可达40%。SREGym作为一个开源项目积极维护，并已被研究人员和从业者使用。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2605.07174

Repeated Deceptive Path Planning against Learnable Observer

针对可学习观察者的重复欺骗路径规划

Cao, Shiyue, Xu, Pei, Yang, Likun, Cui, Lei, Yu, Shizhao, Zhang, Shiyu, Ren, Yongjian, Chen, Xiaotang, Huang, Kaiqi

Abstract

We study the problem of deceptive path planning (DPP), where an agent aims to conceal its true destination from external observers. While existing work assumes static, non-learning observers, real-world adversaries-such as in critical goods transportation or military operations-can adapt by learning from historical trajectories. To address this gap, we introduce Repeated Deceptive Path Planning (RDPP), a new formulation that explicitly models learnable observers. We show that existing DPP methods fail under this setting, as they cannot adapt to evolving adversarial predictions. While incorporating observer previous predictions into updates enables some adaptation, such incremental updates cause accumulative lag that degrades deception. To this end, we propose Deceptive Meta Planning (DeMP), a two-level optimization framework that combines episode-level adaptation, which enables short-term policy adjustment to counter updated observer, and meta-level updates, which leverage cross-episode feedback to capture how observers update their models and accelerate adaptation in future episodes. In this way, DeMP mitigates the accumulation of adaptation lag, enabling sustained deception against a learning observer. Experiments across environments demonstrate that DeMP significantly outperforms existing approaches in RDPP while maintaining competitive path cost. Our results highlight the importance of modeling repeated interactions with learnable adversaries, providing new insights into deception and privacy in multi-agent systems.

Chinese Translation

我们研究了欺骗路径规划（DPP）的问题，其中代理旨在向外部观察者隐瞒其真实目的地。虽然现有工作假设观察者是静态的、非学习的，但现实世界中的对手——例如在关键货物运输或军事行动中——可以通过学习历史轨迹进行适应。为了解决这一问题，我们引入了重复欺骗路径规划（RDPP），这是一种明确建模可学习观察者的新形式。我们表明，现有的DPP方法在这种情况下失败，因为它们无法适应不断变化的对手预测。虽然将观察者之前的预测纳入更新可以实现一定的适应，但这种增量更新会导致累积滞后，从而降低欺骗效果。为此，我们提出了欺骗元规划（DeMP），这是一种两级优化框架，结合了情节级适应，允许针对更新的观察者进行短期策略调整，以及元级更新，利用跨情节反馈捕捉观察者如何更新其模型并加速未来情节中的适应。通过这种方式，DeMP减轻了适应滞后的累积，使得在学习观察者面前能够持续欺骗。跨环境的实验表明，DeMP在RDPP中显著优于现有方法，同时保持竞争性的路径成本。我们的结果突显了建模与可学习对手的重复互动的重要性，为多智能体系统中的欺骗和隐私提供了新的见解。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2605.07199

Three-in-One World Model: Energy-Based Consistency, Prediction, and Counterfactual Inference for Marketing Intervention

三合一世界模型：基于能量的一致性、预测和反事实推断在营销干预中的应用

Niimi, Junichiro

Abstract

Marketing decisions reflect the interaction of latent consumer heterogeneity, time-varying internal states, and explicit interventions, a structure that current prediction- and language-oriented models do not capture in a unified manner. We propose a Three-in-One world-model architecture in which a Deep Boltzmann Machine (DBM) learns a frozen belief representation from demographics, time, and lagged actions and outcomes, with lightweight task-specific adapters attached on top. The same belief supports three tasks within a single framework: (i) energy-based consistency evaluation through the DBM's free energy, (ii) outcome prediction through adapters, and (iii) counterfactual inference by holding the belief fixed and varying only the action input given to the adapter. Using a controlled simulation in which the latent price sensitivity, promotion responsiveness, and base preference of each consumer are known, we show that the adapters match a strong MLP baseline on visit- and purchase-AUC while recovering heterogeneous treatment effects substantially better than S-, T-, X-, and DR-learner meta-learners and a Causal Forest baseline built on the same raw features, with the largest gap on a confounded price-promotion intervention. Complementing this, free-energy clamps systematically penalize counterfactual purchase trajectories that lack prior promotional exposure, and the penalty itself depends on the latent base preference in the expected direction. These results indicate that DBM beliefs disentangle latent traits in a form that survives counterfactual queries, providing an integrated world-model substrate for marketing intervention.

Chinese Translation

营销决策反映了潜在消费者异质性、时变内部状态和显性干预之间的相互作用，而当前的预测和语言导向模型并未以统一的方式捕捉这一结构。我们提出了一种三合一世界模型架构，其中深度玻尔兹曼机（Deep Boltzmann Machine, DBM）从人口统计学、时间以及滞后行为和结果中学习一个固定的信念表示，并在其上附加轻量级的任务特定适配器。该信念在单一框架内支持三项任务：(i) 通过DBM的自由能进行基于能量的一致性评估，(ii) 通过适配器进行结果预测，以及(iii) 通过固定信念并仅改变提供给适配器的行动输入进行反事实推断。通过一个受控模拟，其中每个消费者的潜在价格敏感度、促销响应性和基础偏好是已知的，我们展示了适配器在访问和购买的AUC上与强大的多层感知器（MLP）基线相匹配，同时在恢复异质性处理效应方面显著优于S-、T-、X-和DR-学习者元学习者以及基于相同原始特征构建的因果森林基线，在混淆的价格-促销干预中差距最大。此外，自由能钳制系统性地惩罚缺乏先前促销曝光的反事实购买轨迹，而惩罚本身依赖于潜在基础偏好的预期方向。这些结果表明，DBM信念以一种能够承受反事实查询的形式解构潜在特征，为营销干预提供了一个综合的世界模型基础。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2605.07202

Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

通过数据到洞察发现代理实现自主商业智能

Wu, Dongming, Li, Junwen, Lu, Ming, Wang, Gang, Chen, Ting

Abstract

Transforming fragmented enterprise data into actionable insights remains a significant challenge for LLMs, constrained by complex database schemas, limitations in dynamic SQL generation, and the need for deep multi-dimensional analysis.In this paper, we propose AIDA(Autonomous Insight Discovery Agent), the first end-to-end framework designed for autonomous exploration in complex business environments. We establish a highly flexible instant retail environment encompassing 200+ metrics and 100+ dimensions, and integrates a proprietary Domain-Specific Language (DSL) that bridges semantic reasoning with precise SQL execution. Our reinforcement learning system subsequently formulates business analysis as a Pareto Principle-guided cumulative reasoning process. Experimental results demonstrate that AIDA significantly outperforms workflow-based agents, and extensive evaluations further reveal that AIDA achieves superior environmental perception and more in-depth analysis from diverse perspectives. Our work ultimately establishes the transformative potential of autonomous intelligence for industrial-scale business intelligence systems.

Chinese Translation

将分散的企业数据转化为可操作的洞察仍然是大型语言模型（LLMs）面临的一项重大挑战，这受到复杂数据库模式、动态SQL生成的局限性以及深度多维分析需求的制约。本文提出了AIDA（自主洞察发现代理），这是第一个旨在复杂商业环境中实现自主探索的端到端框架。我们建立了一个高度灵活的即时零售环境，涵盖200多个指标和100多个维度，并集成了一种专有的领域特定语言（DSL），该语言将语义推理与精确的SQL执行相结合。我们的强化学习系统随后将商业分析表述为一个基于帕累托原则的累积推理过程。实验结果表明，AIDA显著优于基于工作流的代理，广泛的评估进一步揭示AIDA在环境感知和从多样化视角进行更深入分析方面具有更优越的表现。我们的研究最终确立了自主智能在工业规模商业智能系统中的变革潜力。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2605.07214

HMACE: Heterogeneous Multi-Agent Collaborative Evolution for Combinatorial Optimization

HMACE：用于组合优化的异构多智能体协同进化

Yan, Yuping, Han, Jirui, Ming, Fei, Li, Yuanshuai, Jin, Yaochu

Abstract

Large Language Models have recently emerged as a promising paradigm for automated heuristic design for NP-hard combinatorial optimization problems. Despite this progress, existing LLM-based methods typically rely on monolithic workflows constrained by rigid templates, thereby restricting memory-guided exploration and triggering premature convergence to local optima. To design an autonomous and collaborative architecture, we introduce HMACE, a Heterogeneous Multi-Agent Collaborative Evolution framework that reconceptualizes heuristic search as an organizational design problem. HMACE decomposes each evolutionary generation into an autonomous, role-specialized loop with four coordinated agents: a Proposer for strategy exploration, a Generator for executable heuristic synthesis, an Evaluator for empirical assessment, and a Reflector for archive-backed memory update. By coupling behavior-aware retrieval, lightweight candidate filtering, and fitness-grounded archive updates, HMACE guides the search toward diverse and promising heuristic behaviors while avoiding redundant evaluations. Extensive evaluations on representative COPs, including TSP, Online BPP, MKP, and PFSP, show that HMACE achieves a favorable quality-efficiency trade-off compared to state-of-the-art single-agent and multi-agent baselines. In the matched LLM-driven reference comparison, HMACE achieves the lowest average gaps on TSP and Online BPP (0.464\% and 0.223\%, respectively), while requiring only 0.13M and 0.42M tokens for the two tasks, substantially fewer than the compared baselines.

Chinese Translation

大型语言模型最近作为一种有前景的范式出现，用于NP难度组合优化问题的自动启发式设计。尽管取得了这些进展，现有的基于LLM的方法通常依赖于受限于严格模板的单一工作流程，从而限制了基于记忆的探索，并导致过早收敛于局部最优解。为了设计一个自主且协作的架构，我们引入了HMACE，一个异构多智能体协同进化框架，它将启发式搜索重新概念化为一个组织设计问题。HMACE将每一代进化分解为一个自主的、角色专业化的循环，包含四个协调的智能体：一个用于策略探索的提议者（Proposer）、一个用于可执行启发式合成的生成器（Generator）、一个用于经验评估的评估者（Evaluator）和一个用于基于档案的记忆更新的反射者（Reflector）。通过结合行为感知检索、轻量级候选过滤和基于适应度的档案更新，HMACE引导搜索朝向多样化和有前景的启发式行为，同时避免冗余评估。在对代表性组合优化问题（COP）进行的大规模评估中，包括旅行商问题（TSP）、在线装箱问题（Online BPP）、多背包问题（MKP）和并行流水线调度问题（PFSP），HMACE在质量与效率的权衡上表现优于最先进的单智能体和多智能体基线。在与匹配的LLM驱动参考的比较中，HMACE在TSP和在线装箱问题上实现了最低的平均差距（分别为0.464%和0.223%），同时仅需0.13M和0.42M个标记，显著低于比较基线。

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2605.07242

MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory

MEMOREPAIR：代理记忆中的优先障碍级联修复

Zhao, Yang, Dai, Chengxiao, Kou, Mengying, Xiu, Yue

Abstract

Agentic memory evolves across tasks into durable derived artifacts: summaries, cached outputs, embeddings, learned skills, and executable tool procedures. When a source artifact is deleted, corrected, or invalidated by tool or API migration, descendants derived from that source can remain visible and steer future actions with stale support. We formalize this failure mode as the cascade update problem, where repair targets the visible derived state of the memory store. We present MemoRepair, a barrier-first cascade-repair contract for agentic memory. A repair event induces a controlled transition from invalidated descendant state to validated successor state: affected descendants are withdrawn before repair, successors are constructed from retained support and staged repaired predecessors under the current interface, and republication is restricted to validated predecessor-closed successors. This contract induces a scalarized repair-selection problem for a fixed repair-cost tradeoff. We show that the induced publication problem reduces to maximum-weight predecessor closure and can be solved exactly by a single s-t min-cut. Experiments on ToolBench and MemoryArena show that, with complete influence provenance, MemoRepair reduces invalidated-memory exposure from 69.8-94.3% under systems without cascade repair to 0%. Compared with exhaustive Repair all, it recovers 91.1-94.3% of validated successors while reducing normalized repair-operator cost from 1.00 to 0.57-0.76.

Chinese Translation

代理记忆在任务中演变为持久的派生工件：摘要、缓存输出、嵌入、学习的技能和可执行的工具程序。当源工件被删除、修正或因工具或API迁移而失效时，源工件派生的后代可能仍然可见，并以过时的支持引导未来的行动。我们将这种故障模式形式化为级联更新问题，其中修复目标是内存存储的可见派生状态。我们提出了MemoRepair，一种针对代理记忆的优先障碍级联修复契约。修复事件引发从失效后代状态到验证后继状态的受控过渡：受影响的后代在修复前被撤回，后继状态由保留的支持和在当前接口下分阶段修复的前驱构建，重新发布仅限于验证的前驱闭合后继。该契约引发了一个标量化的修复选择问题，以固定的修复成本权衡。我们展示了引发的发布问题简化为最大权重前驱闭合，并可以通过单一的源-汇最小割精确解决。在ToolBench和MemoryArena上的实验表明，在完整影响来源的情况下，MemoRepair将无级联修复系统下的失效内存暴露从69.8-94.3%降低至0%。与全面修复（Repair all）相比，它恢复了91.1-94.3%的验证后继，同时将标准化修复操作成本从1.00降低至0.57-0.76。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2605.07247

EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

EnvSimBench：评估和改进基于大型语言模型的环境模拟的基准

Liu, Yi, Hui, TingFeng, Zhang, Wei, Sun, Li, Su, Ningxin, Wang, Jian, Su, Sen

Abstract

Scalable AI agents training relies on interactive environments that faithfully simulate the consequences of agent actions. Manually crafted environments are expensive to build, brittle to extend, and fundamentally limited in diversity. A promising direction is to replace manually crafted environments with LLM-simulated counterparts. However, this paradigm hinges on an unexamined core assumption: LLMs can accurately simulate environmental feedback. In practice, LLM-simulated environments suffer from hallucinations, logical inconsistencies, and silent state drift failures that corrupt agent reward signals and compound the construction costs that the paradigm was designed to eliminate. To address this gap, we propose EnvSimBench with four contributions: 1) We provide the first formal definition and operationalization of Environment Simulation Ability (EnvSim Ability) as a quantifiable research objective. 2) We construct EnvSimBench, a rigorous benchmark covering 400 samples across 167 diverse environments, equipped with verifiable labels and fine-grained difficulty stratification along three axes. 3) Systematic evaluations reveal that all state-of-the-art language models suffer from a universal state change cliff: they achieve near-perfect accuracy on tasks when the environment state remains invariant, yet fail catastrophically when multiple states need simultaneous updates. This finding exposes EnvSim Ability as a critical yet largely unaddressed capability gap. 4) We design a constraint-driven simulation pipeline that substantially reduces hallucination, boosts environment synthesis yield by 6.8%, and cuts costs by over 90%. Overall, EnvSimBench serves as both a diagnostic framework and a practical optimization path for reliable LLM-based environment simulation, establishing a foundation for scalable agent training. Code and data are available at https://github.com/cookieApril/EnvSimBench

Chinese Translation

可扩展的人工智能代理训练依赖于能够真实模拟代理行为后果的互动环境。手工构建的环境成本高昂，扩展性差，并且在多样性上存在根本限制。一种有前景的方向是用大型语言模型（LLM）模拟的环境替代手工构建的环境。然而，这一范式依赖于一个未经检验的核心假设：LLM能够准确模拟环境反馈。在实践中，LLM模拟的环境存在幻觉、逻辑不一致和静默状态漂移等问题，这些问题会破坏代理的奖励信号，并加大该范式旨在消除的构建成本。为了解决这一问题，我们提出了EnvSimBench，并做出了四项贡献：1）我们首次正式定义并操作化环境模拟能力（Environment Simulation Ability，EnvSim Ability），作为一个可量化的研究目标。2）我们构建了EnvSimBench，这是一个严格的基准，涵盖了400个样本，涉及167个多样化的环境，配备可验证的标签和沿三个维度的细粒度难度分层。3）系统评估表明，所有最先进的语言模型都面临一个普遍的状态变化悬崖：当环境状态保持不变时，它们在任务上几乎达到完美准确率，但在需要同时更新多个状态时却会出现灾难性失败。这一发现揭示了EnvSim Ability作为一个关键但在很大程度上未被解决的能力缺口。4）我们设计了一种基于约束的模拟管道，显著减少了幻觉，环境合成产量提高了6.8%，成本降低超过90%。总体而言，EnvSimBench既是一个诊断框架，也是一个可靠的基于LLM的环境模拟的实用优化路径，为可扩展的代理训练奠定了基础。代码和数据可在 https://github.com/cookieApril/EnvSimBench 获取。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2605.07251

Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

代理能为反应定价吗？对化学成本推理的LLM评估

Wu, Yuyang, Huang, Yue, Shen, Shuaike, Wang, Xujian, Zhang, Shuhao, Xue, Qiyao, Liu, Weichen, Gao, Runtian, Ma, Jian, Zhang, Xiangliang, Isayev, Olexandr

Abstract

Large Language Models (LLMs) have become increasingly capable as tool-using agents, with benchmarks spanning diverse general agentic tasks. Yet rigorous evaluation of scientific tool use remains limited. In chemistry, recent agents can plan syntheses and invoke domain-specific tools, but evaluations often rely on curated demonstrations, expert assessment, or LLM-as-judge scoring rather than exact, judge-free ground truth. We address this gap with chemical procurement cost estimation, a practical task in which an agent must ground chemical identities, retrieve supplier quotes, select valid purchasable packs, normalize quantities, and compute cost from a reaction description. We introduce ChemCost, a benchmark of 1,427 evaluable reactions grounded to a frozen pricing snapshot covering 2,261 chemicals and 230,775 supplier quotes, supporting scalar scoring and stage-level diagnosis of grounding, retrieval, procurement, and arithmetic failures. To evaluate robustness, we further construct controlled noise-injected views that perturb chemical aliases, quantity expressions, missing fields, and input formatting. Experiments with frontier, open-weight, and chemistry-specialized LLM agents show that tool access is necessary but insufficient for solving the task. The strongest agents reach only 50.6% accuracy within 25% relative error on clean inputs and degrade substantially with realistic noise. Stage-level analysis further shows that failures arise from brittle parsing, ineffective evidence integration, invalid pack selection, and non-convergent tool use.

Chinese Translation

大型语言模型（LLMs）作为工具使用代理的能力日益增强，其基准测试涵盖了多种通用代理任务。然而，对科学工具使用的严格评估仍然有限。在化学领域，近期的代理能够规划合成并调用特定领域的工具，但评估通常依赖于策划的演示、专家评估或LLM作为评判者的评分，而非精确的、无评判的真实情况。我们通过化学采购成本估算来填补这一空白，这是一项实用任务，代理必须确定化学物质的身份、检索供应商报价、选择有效的可购买包装、标准化数量，并根据反应描述计算成本。我们引入了ChemCost，这是一个包含1,427个可评估反应的基准，基于涵盖2,261种化学品和230,775个供应商报价的固定定价快照，支持标量评分和对基础、检索、采购和算术失败的阶段性诊断。为了评估鲁棒性，我们进一步构建了控制噪声注入的视图，扰动化学别名、数量表达、缺失字段和输入格式。对前沿、开放权重和化学专业LLM代理的实验表明，工具访问是必要的，但不足以解决该任务。最强的代理在干净输入下仅达到50.6%的准确率，且在现实噪声下显著下降。阶段性分析进一步显示，失败源于脆弱的解析、无效的证据整合、无效的包装选择和不收敛的工具使用。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2605.07274

Structured Role-Aware Policy Optimization for Multimodal Reasoning

面向结构化角色的多模态推理策略优化

Jiang, Bingqing, Zou, Difan

Abstract

Reinforcement learning from verifiable rewards (RLVR), especially with Group Relative Policy Optimization (GRPO), has shown strong potential for improving the reasoning capabilities of large vision-language models (LVLMs). However, in multimodal reasoning, final-answer rewards are typically assigned at the sequence level and do not distinguish the functional roles of different tokens, making it difficult to determine whether a correct answer is supported by task-relevant visual evidence. In this paper, we revisit multimodal RLVR from the perspective of role-aware token-level credit assignment, where structured responses are decomposed into perception tokens for extracting visual evidence and reasoning tokens for deriving answers from that evidence. Based on this perspective, we propose Structured Role-aware Policy Optimization (SRPO), which refines the sequence-level GRPO advantage into role-aware token-level advantages without changing the reward function. Specifically, SRPO assigns role-specific credit by using self-distilled on-policy contrasts: perception tokens are emphasized according to their visual dependency under original versus corrupted visual inputs, while reasoning tokens are emphasized according to their consistency with the generated perception. These role-specific signals are further unified through a shared trajectory-level baseline, yielding positive token weights that adjust relative update magnitudes while preserving the original GRPO reward and optimization direction, without requiring external reward models or separate teachers. Experiments across diverse multimodal reasoning benchmarks show that SRPO improves evidence-grounded reasoning, highlighting the importance of moving beyond uniform sequence-level credit toward role-aware optimization for reliable multimodal reasoning.

Chinese Translation

可验证奖励的强化学习（RLVR），尤其是基于群体相对策略优化（GRPO），在提升大型视觉-语言模型（LVLMs）的推理能力方面展现了强大的潜力。然而，在多模态推理中，最终答案的奖励通常是在序列层面分配的，并未区分不同标记的功能角色，这使得判断正确答案是否由与任务相关的视觉证据支持变得困难。本文从角色感知的标记级信用分配的角度重新审视多模态RLVR，其中结构化响应被分解为用于提取视觉证据的感知标记和用于从该证据推导答案的推理标记。基于这一视角，我们提出了结构化角色感知策略优化（SRPO），该方法在不改变奖励函数的情况下，将序列层面的GRPO优势细化为角色感知的标记级优势。具体而言，SRPO通过使用自蒸馏的在线对比来分配角色特定的信用：感知标记根据其在原始与损坏视觉输入下的视觉依赖性被强调，而推理标记则根据其与生成的感知的一致性被强调。这些角色特定信号通过共享的轨迹级基线进一步统一，从而产生正的标记权重，调整相对更新幅度，同时保持原始GRPO奖励和优化方向，而无需外部奖励模型或单独的教师。针对多种多模态推理基准的实验表明，SRPO提升了基于证据的推理，突显了超越均匀序列级信用，朝向角色感知优化以实现可靠多模态推理的重要性。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2605.07276

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

弱反馈代理代码修复中的信号重塑

Li, Jia, Su, Yuxin, Peng, Ting, Huang, Hailiang, Deng, Yuetang, Lyu, Michael R.

Abstract

Code-agent RL often receives weak feedback: rollout-time signals are reliable and executable, but capture only necessary or surface conditions for task success rather than the target semantic predicate. Using agentic compile-fix as the setting, we study signal reshaping for standard GRPO under such feedback. Our central claim is that GRPO's within-group comparison is meaningful only after three kinds of signals are reshaped: outcome rewards recover semantic ranking, process signals localize intra-trajectory credit, and rollouts from the same prompt remain execution-comparable. We operationalize these conditions with a minimal signal-reshaping construction that leaves GRPO's group-normalized advantage construction unchanged: compile-and-semantic layered rewards reshape trajectory ranking, step-level process scores outside group reward normalization reshape within-trajectory update strength, and failure-cause-aware rollout governance reshapes within-group comparability. Experiments show a clear end-to-end gain: full signal-reshaped GRPO improves strict compile-and-semantic accuracy from the base model's zero-shot $0.385$ to $0.535$. Controlled comparisons further explain the source of this gain: binary rewards remove the compile-only middle tier and degrade trajectory control; on top of layered rewards, process-score weighting further improves accuracy from $0.48$ to $0.53$ and reduces average evaluation steps from $23.50$ to $17.02$. As a boundary comparison, privileged-prompt token-level distillation mainly optimizes local distributional alignment; in long tool-use trajectories, this signal is diluted by non-critical tokens and cannot replace outcome semantics, process credit, or within-group comparability.

Chinese Translation

代码代理强化学习（RL）通常接收到弱反馈：在执行期间的信号是可靠且可执行的，但仅捕捉任务成功所需的表面条件，而非目标语义谓词。以代理编译修复为背景，我们研究在此类反馈下标准GRPO的信号重塑。我们的核心观点是，GRPO的组内比较只有在三种信号被重塑后才有意义：结果奖励恢复语义排序，过程信号定位轨迹内的信用，而来自相同提示的执行结果保持可比性。我们通过一种最小的信号重塑构造来实现这些条件，该构造不改变GRPO的组归一化优势构造：编译和语义分层奖励重塑轨迹排序，组奖励归一化外的逐步过程评分重塑轨迹内更新强度，而关注失败原因的执行治理重塑组内可比性。实验结果显示出明显的端到端增益：完全信号重塑的GRPO将基础模型的严格编译和语义准确率从零样本的0.385提高至0.535。控制比较进一步解释了这一增益的来源：二元奖励移除了仅编译的中间层，并降低了轨迹控制；在分层奖励的基础上，过程评分加权进一步将准确率从0.48提高至0.53，并将平均评估步骤从23.50减少至17.02。作为边界比较，特权提示的令牌级蒸馏主要优化局部分布对齐；在长工具使用轨迹中，该信号被非关键令牌稀释，无法替代结果语义、过程信用或组内可比性。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2605.07301

SOM: Structured Opponent Modeling for LLM-based Agents via Structural Causal Model

SOM：基于结构因果模型的LLM代理的结构化对手建模

Cao, Shiyue, Xu, Pei, Yang, Likun, Cui, Lei, Chen, Xiaotang, Huang, Kaiqi

Abstract

Accurately predicting opponents' behavior from interactions is a fundamental capability for large language model (LLM)-based agents in multi-agent and game-theoretic environments. Existing approaches often entangle opponent modeling with prediction, relying on implicit contextual reasoning and limiting adaptability in dynamic interactions. To this end, we propose Structured Opponent Modeling (SOM), a two-stage opponent modeling framework that distinctly separates opponent model construction and opponent prediction. At the construction stage, SOM employs a Structural Causal Model (SCM), a graph-based formalism for representing dependencies among variables, to capture directed links between opponents' observations and actions, yielding an explicit and structured opponent representation. At the prediction stage, the LLM performs structured reasoning along clear pathways derived from the SCM, improving both prediction accuracy and stability. Extensive experiments on diverse multi-agent benchmarks demonstrate that SOM consistently outperforms state-of-the-art LLM-based reasoning baselines, enabling more accurate and adaptable strategic decision-making in complex and dynamic multi-agent interactions.

Chinese Translation

准确预测对手在交互中的行为是基于大型语言模型（LLM）的代理在多智能体和博弈论环境中的基本能力。现有方法通常将对手建模与预测混合在一起，依赖于隐式的上下文推理，限制了在动态交互中的适应性。为此，我们提出了结构化对手建模（SOM），这是一种两阶段的对手建模框架，明确区分对手模型构建和对手预测。在构建阶段，SOM采用结构因果模型（SCM），这是一种基于图的形式化方法，用于表示变量之间的依赖关系，以捕捉对手观察与行动之间的有向联系，从而生成明确且结构化的对手表示。在预测阶段，LLM沿着从SCM导出的清晰路径进行结构化推理，提高了预测的准确性和稳定性。在多种多智能体基准上的广泛实验表明，SOM始终优于最先进的基于LLM的推理基线，能够在复杂和动态的多智能体交互中实现更准确和更具适应性的战略决策。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2605.07313

When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

存储证据何时停止可用：基于规模的智能体记忆评估

Shao, Jiaqi, Lu, Yiyi, Zhang, Yunzhen, Luo, Bing

Abstract

Memory-agent evaluations report fixed-snapshot accuracy or retrieval quality, but these scores do not show whether evidence remains usable as irrelevant sessions (sessions not annotated as task-relevant evidence for the query) accumulate. We present a scale-conditioned evaluation protocol for agent memory under evidence-preserving growth: for each query, task evidence is held fixed while irrelevant sessions are added. The protocol logs agent--memory trajectories and reports four diagnostics: budget-compliant reliability, tail memory-call burden, failure-regime decomposition, and the usable-scale boundary where reliability falls below the target. Applied to LongMemEval and LoCoMo across flat, planar, and hierarchical memory interfaces, the protocol shows reliability loss is not a single phenomenon. On LongMemEval, HippoRAG stays within the two-call budget but loses 16--20 percentage points in budget-compliant reliability as irrelevant sessions are added; LiCoMemory's observed failures depend strongly on the agent, with Qwen3-8B exceeding the budget while Qwen3-32B and Qwen3-235B remain reliable in the tested range. The result supports a framework for making scalable-memory claims conditional on agent, interface, scale range, and interaction budget.

Chinese Translation

记忆智能体评估报告固定快照的准确性或检索质量，但这些分数并未显示随着无关会话（未标注为查询任务相关证据的会话）的积累，证据是否仍然可用。我们提出了一种在证据保持增长下的规模条件评估协议：对于每个查询，任务证据保持固定，同时添加无关会话。该协议记录智能体-记忆轨迹并报告四个诊断指标：预算合规可靠性、尾部记忆调用负担、失败机制分解以及可靠性低于目标的可用规模边界。应用于 LongMemEval 和 LoCoMo 的平面、平面和层次记忆接口，该协议显示可靠性损失并非单一现象。在 LongMemEval 上，HippoRAG 保持在两次调用预算内，但随着无关会话的增加，预算合规可靠性下降了 16-20 个百分点；LiCoMemory 的观察到的失败在很大程度上依赖于智能体，其中 Qwen3-8B 超出预算，而 Qwen3-32B 和 Qwen3-235B 在测试范围内保持可靠。该结果支持一个框架，使可扩展记忆的声明依赖于智能体、接口、规模范围和交互预算。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2605.07316

Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

隐式压缩正则化：通过内部更短分布实现简洁推理的强化学习后训练

Wang, Chen, Deng, Hexuan, Zhang, Yining, Zhang, Yuchen, Bai, Jionghao, Li, Zhaochun, Lan, Ge, Wang, Yue

Abstract

Reinforcement learning with verifiable rewards improves LLM reasoning but often induces overthinking, where models generate unnecessarily long reasoning traces. Existing methods mainly rely on length penalties or early-exit strategies; however, the former may degrade accuracy and induce underthinking, whereas the latter assumes that substantial portions of reasoning traces can be safely truncated. To obtain a compression signal without these limitations, we revisit the training dynamics of existing compression methods. We observe that the length--accuracy correlation is initially negative but continually increases during compression, indicating that shorter responses are initially more likely to be correct but gradually lose this property as the policy moves toward underthinking. Based on this observation, we formalize overthinking: a negative correlation indicates an overthinking regime, while a positive one indicates underthinking. When overthinking, the shortest correct responses are shorter than the group-average response length in expectation, making them natural compression targets already present in on-policy rollouts. We therefore propose \emph{Implicit Compression Regularization} (ICR), an on-policy regularization method whose compression signal comes from a virtual shorter distribution induced by the shortest correct responses in rollout groups, guiding the policy toward concise yet correct trajectories. Training dynamics show that ICR maintains a better length--accuracy correlation during compression, indicating that short responses remain better aligned with correctness instead of drifting toward underthinking. Experiments on three reasoning backbones and multiple mathematical and knowledge-intensive benchmarks show that ICR consistently shortens responses while preserving or improving accuracy, achieving a stronger accuracy--length Pareto frontier.

Chinese Translation

具有可验证奖励的强化学习改善了大型语言模型（LLM）的推理能力，但常常导致过度思考，模型生成不必要的长推理轨迹。现有方法主要依赖于长度惩罚或提前退出策略；然而，前者可能会降低准确性并导致思维不足，而后者则假设推理轨迹的相当一部分可以安全地截断。为了在没有这些限制的情况下获得压缩信号，我们重新审视现有压缩方法的训练动态。我们观察到，长度与准确性的相关性最初是负的，但在压缩过程中持续增加，这表明较短的响应最初更可能是正确的，但随着策略向思维不足的方向发展，这一特性逐渐减弱。基于这一观察，我们对过度思考进行了形式化：负相关性表明处于过度思考状态，而正相关性则表明思维不足。当过度思考时，最短的正确响应在期望上比群体平均响应长度更短，使其成为在策略回放中自然存在的压缩目标。因此，我们提出了隐式压缩正则化（Implicit Compression Regularization，ICR），这是一种基于策略的正则化方法，其压缩信号来自于回放组中最短的正确响应所诱导的虚拟更短分布，引导策略朝向简洁而正确的轨迹。训练动态表明，ICR在压缩过程中保持了更好的长度与准确性相关性，表明短响应在准确性上保持更好的对齐，而不是漂移向思维不足。对三种推理骨干网络和多个数学及知识密集型基准的实验表明，ICR始终缩短响应，同时保持或提高准确性，实现了更强的准确性与长度的帕累托前沿。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2605.07323

Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation

基于LLM的定性和定量评估发现常微分方程

Song, Sum Kyun, Shin, Bong Gyun, Lee, Jae Yong

Abstract

Discovering governing differential equations from observational data is a fundamental challenge in scientific machine learning. Existing symbolic regression approaches rely primarily on quantitative metrics; however, real-world differential equation modeling also requires incorporating domain knowledge to ensure physical plausibility. To address this gap, we propose DoLQ, a method for discovering ordinary differential equations with LLM-based qualitative and quantitative evaluation. DoLQ employs a multi-agent architecture: a Sampler Agent proposes dynamic system candidates, a Parameter Optimizer refines equations for accuracy, and a Scientist Agent leverages an LLM to conduct both qualitative and quantitative evaluations and synthesize their results to iteratively guide the search. Experiments on multi-dimensional ordinary differential equation benchmarks demonstrate that DoLQ achieves superior performance compared to existing methods, not only attaining higher success rates but also more accurately recovering the correct symbolic terms of ground truth equations. Our code is available at https://github.com/Bon99yun/DoLQ.

Chinese Translation

从观测数据中发现控制微分方程是科学机器学习中的一个基本挑战。现有的符号回归方法主要依赖于定量指标；然而，现实世界中的微分方程建模还需要结合领域知识，以确保物理的合理性。为了解决这一问题，我们提出了DoLQ，一种基于LLM的定性和定量评估发现常微分方程的方法。DoLQ采用多智能体架构：一个采样智能体（Sampler Agent）提出动态系统候选，参数优化器（Parameter Optimizer）对方程进行精确度优化，科学家智能体（Scientist Agent）利用LLM进行定性和定量评估，并综合其结果以迭代指导搜索。在多维常微分方程基准测试中的实验表明，DoLQ的性能优于现有方法，不仅成功率更高，而且更准确地恢复了真实方程的正确符号项。我们的代码可在 https://github.com/Bon99yun/DoLQ 获取。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2605.07339

Tools as Continuous Flow for Evolving Agentic Reasoning

工具作为演变代理推理的连续流

Huang, Tairan, Shang, Siyu, Chen, Qiang, Su, Xiu, Chen, Yi

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in orchestrating tools for reasoning tasks. However, existing methods rely on a step-wise paradigm that lacks a global perspective, which causes error accumulation over long horizons and restricts generalization to unseen tools. To overcome these limitations, we propose Tools as Continuous Flow for Evolving Agentic Reasoning (FlowAgent), which reconceptualizes tool chaining as continuous trajectory generation within a semantic space. To systematically evaluate this paradigm, we introduce the first plan-level closed-loop benchmark dedicated to plan-level agentic reasoning in dynamic real-world environments. Specifically, the proposed FlowAgent leverages conditional flow matching to generate continuous latent trajectories, providing a global planning perspective to ensure coherent and robust tool execution. Theoretically, we establish formal bounds on utility convergence and prove that our continuous formulation fundamentally guarantees robust generalization and error attenuation. Empirical evaluations show that FlowAgent achieves superior robustness and adaptability in long-horizon reasoning tasks.

Chinese Translation

大型语言模型（LLMs）在工具协调推理任务方面展现了显著的能力。然而，现有方法依赖于逐步范式，缺乏全局视角，这导致在长时间范围内的错误积累，并限制了对未见工具的泛化能力。为克服这些局限性，我们提出了“工具作为演变代理推理的连续流”（FlowAgent），将工具链的概念重新构想为在语义空间内的连续轨迹生成。为了系统地评估这一范式，我们引入了首个专门针对动态现实环境中计划级代理推理的计划级闭环基准测试。具体而言，所提出的FlowAgent利用条件流匹配生成连续潜在轨迹，提供全局规划视角，以确保工具执行的一致性和稳健性。从理论上讲，我们建立了效用收敛的正式界限，并证明我们的连续形式在根本上保证了稳健的泛化和错误减弱。实证评估表明，FlowAgent在长时间范围推理任务中实现了更优的稳健性和适应性。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2605.07353

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

信心感知对齐使推理大型语言模型更可靠

Chen, Kejia, Zhang, Jiawen, Wu, Yihong, Gao, Kewei, Lou, Jian, Feng, Zunlei, Song, Mingli, Jia, Ruoxi

Abstract

Large reasoning models often reach correct answers through flawed intermediate steps, creating a gap between final accuracy and reasoning reliability. Existing alignment strategies address this with external verifiers or massive sampling, limiting scalability. In this work, we introduce CASPO (Confidence-Aware Step-wise Preference Optimization), a framework that aligns token-level confidence with step-wise logical correctness through iterative Direct Preference Optimization, without training a separate reward model. During inference, we propose Confidence-aware Thought (CaT), which leverages this calibrated confidence to dynamically prune uncertain reasoning branches with negligible O(V) latency. Experiments across ten benchmarks and multiple model families show that CASPO consistently improves reasoning reliability and inference efficiency. CASPO scales to Qwen3-8B-Base and surpasses tree-search baselines on AIME'24 and AIME'25 without using reward-model data. We also release a step-wise dataset with confidence annotations to support fine-grained analysis of reasoning reliability. Code is available at https://github.com/Thecommonirin/CASPO.

Chinese Translation

大型推理模型常常通过有缺陷的中间步骤得出正确答案，从而导致最终准确性与推理可靠性之间存在差距。现有的对齐策略通过外部验证者或大量采样来解决这一问题，但限制了可扩展性。在本研究中，我们提出了CASPO（信心感知逐步偏好优化），这是一个通过迭代直接偏好优化将标记级信心与逐步逻辑正确性对齐的框架，无需训练单独的奖励模型。在推理过程中，我们提出了信心感知思维（Confidence-aware Thought, CaT），利用这种经过校准的信心动态修剪不确定的推理分支，且几乎没有O(V)延迟。我们在十个基准和多个模型系列上的实验表明，CASPO始终提高了推理可靠性和推理效率。CASPO能够扩展到Qwen3-8B-Base，并在AIME'24和AIME'25上超越树搜索基线，而无需使用奖励模型数据。我们还发布了一个带有信心注释的逐步数据集，以支持对推理可靠性的细粒度分析。代码可在https://github.com/Thecommonirin/CASPO获取。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2605.07357

GraphReAct: Reasoning and Acting for Multi-step Graph Inference

GraphReAct：多步图推理的推理与行动

Yu, Xingtong, Kuai, Zhongwei, Zhou, Chang, Xie, Xuanting, Jiang, Renhe, Zhang, Xikun, Cheng, Hong, Zhang, Xinming, Fang, Yuan

Abstract

Reasoning-acting frameworks enhance large language models (LLMs) by interleaving reasoning with actions for dynamic information acquisition. However, extending this paradigm to graph learning remains underexplored. Graph data is inherently structured, with information distributed across nodes and edges and encoded through both topology and latent representations. As a result, effective reasoning over graphs requires not only retrieving informative evidence from the graph, but also progressively refining the accumulated context during multi-step inference. In this work, we propose GraphReAct, a graph reasoning-acting framework that enables step-by-step inference over graph-structured data. Specifically, we design a graph-based action space with two complementary retrieval actions: topological retrieval, which captures local structural dependencies, and semantic retrieval, which accesses non-local but relevant evidence in the representation space. These actions dynamically expand the reasoning context. To further support multi-step reasoning, we introduce another type of action, context refinement, which distills and reorganizes accumulated information into a compact representation. By interleaving reasoning with both retrieval and refinement actions, our framework enables a progressive transition from context expansion to compression. Extensive experiments on six benchmark datasets demonstrate that GraphReAct consistently outperforms state-of-the-art methods, validating the effectiveness of reasoning-acting for graph learning.

Chinese Translation

推理-行动框架通过将推理与行动交替进行，增强了大型语言模型（LLMs），以实现动态信息获取。然而，将这一范式扩展到图学习仍然未得到充分探索。图数据本质上是结构化的，信息分布在节点和边上，并通过拓扑结构和潜在表示进行编码。因此，有效的图推理不仅需要从图中检索信息丰富的证据，还需要在多步推理过程中逐步细化累积的上下文。在本研究中，我们提出了GraphReAct，一个图推理-行动框架，能够对图结构数据进行逐步推理。具体而言，我们设计了一个基于图的行动空间，包含两种互补的检索行动：拓扑检索，捕捉局部结构依赖关系；语义检索，访问表示空间中非局部但相关的证据。这些行动动态扩展了推理上下文。为了进一步支持多步推理，我们引入了另一种类型的行动，即上下文细化，它将累积的信息提炼并重新组织成紧凑的表示。通过将推理与检索和细化行动交替进行，我们的框架实现了从上下文扩展到压缩的渐进过渡。在六个基准数据集上的大量实验表明，GraphReAct始终优于最先进的方法，验证了推理-行动在图学习中的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2605.07393

Offline Policy Optimization with Posterior Sampling

基于后验采样的离线策略优化

Lin, Hongqiang, Zhang, Dongxu, Sun, Yiding, Li, Mingzhe, Yang, Ning, Zhang, Haijun

Abstract

A fundamental challenge in model-based offline reinforcement learning (RL) lies in the trade-off between generalization and robustness against exploitation errors in out-of-distribution (OOD) regions. While OOD samples may capture valid underlying physical dynamics, they also introduce the risk of model exploitation. Existing methods typically address this risk through excessive pessimistic regularization, which ensures robustness but often sacrifices generalization. To overcome this limitation, we propose Posterior Sampling-based Policy Optimization (PSPO), which formulates dynamics modeling as a Bayesian inference process to derive a posterior that explicitly quantifies model fidelity. Through the integration of posterior sampling and constrained policy optimization, our method leverages dynamics-consistent OOD transitions for generalization while ensuring robustness against model exploitation. Theoretically, we formulate Q-value estimation under posterior sampling as a stochastic approximation problem and establish its convergence. We decompose policy optimization into a sequence of constrained subproblems, demonstrating that solving these subproblems guarantees monotonic improvement until convergence. Experiments on standard benchmarks validate that PSPO achieves superior performance compared to state-of-the-art baselines.

Chinese Translation

基于模型的离线强化学习（RL）面临的一个基本挑战是如何在泛化能力与对分布外（OOD）区域的利用错误的鲁棒性之间进行权衡。尽管OOD样本可能捕捉到有效的潜在物理动态，但它们也带来了模型利用的风险。现有方法通常通过过度悲观的正则化来应对这一风险，这种方法虽然确保了鲁棒性，但往往牺牲了泛化能力。为克服这一局限性，我们提出了基于后验采样的策略优化（PSPO），将动态建模公式化为贝叶斯推断过程，以推导出明确量化模型保真度的后验分布。通过后验采样与约束策略优化的结合，我们的方法利用动态一致的OOD转移来实现泛化，同时确保对模型利用的鲁棒性。从理论上讲，我们将后验采样下的Q值估计公式化为随机逼近问题，并建立其收敛性。我们将策略优化分解为一系列约束子问题，证明解决这些子问题可以保证单调改进，直至收敛。标准基准上的实验验证了PSPO在性能上优于最先进的基线方法。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2605.07452

Bounded Fitting for Expressive Description Logics

有界拟合在表现力丰富的描述逻辑中的应用

Funk, Maurice, Jung, Jean Christoph, Voellmer, Tom

Abstract

Bounded fitting is an attractive paradigm for learning logical formulas from labeled data examples that offers PAC-style generalization guarantees and can often be implemented leveraging SAT solvers. It has been successfully applied to learning concepts of the description logic ALC. We study bounded fitting for learning concepts in expressive description logics that extend ALC with inverse roles, qualified number restrictions, and feature comparisons. We investigate under which conditions bounded fitting keeps its favorable theoretical properties in this setting, and implement it using a SAT solver. We compare our tool with state-of-the-art concept learners with encouraging results, demonstrating that it is a practical approach to expressive concept learning.

Chinese Translation

有界拟合是一种从标记数据示例中学习逻辑公式的吸引人范式，它提供了PAC风格的泛化保证，并且通常可以利用SAT求解器实现。它已成功应用于学习描述逻辑ALC中的概念。我们研究了在扩展ALC的表现力丰富的描述逻辑中学习概念的有界拟合，包括逆角色、合格数量限制和特征比较。我们探讨了在这种情况下有界拟合保持其有利理论性质的条件，并使用SAT求解器实现了该方法。我们将我们的工具与最先进的概念学习器进行了比较，结果令人鼓舞，证明了它是一种实用的表现力丰富的概念学习方法。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2605.07488

Efficient Data Selection for Multimodal Models via Incremental Optimization Utility

通过增量优化效用实现多模态模型的高效数据选择

Jing, Jinhao, Zhao, Qiannian, Huang, Chao, Su, Zhan

Abstract

The scaling of Large Multimodal Models (LMMs) is constrained by the quality-quantity trade-off inherent in synthetic data. Previous approaches, such as LLM-as-a-Judge, have proven their effectiveness in addressing this but suffer from prohibitive computational costs and lack of interpretability. To bridge this gap, we propose One-Step-Train (OST), a framework that reformulates data selection as an incremental optimization utility ranking problem. Instead of relying on semantic heuristics, OST estimates the marginal utility of each sample via a simulated single-step update on a lightweight proxy. Experiments on the Qwen series across multimodal mathematical reasoning benchmarks demonstrate that OST achieves Pareto-optimal efficiency. By selecting the top-50 subset, OST reduces training costs by 43% (and total time consumption by 17) while surpassing the strong LLM-as-a-Judge baseline by 1.8 points. Furthermore, under a fixed compute budget, our method using only the top-20 subset achieves a 5.6 point gain over LLM-as-a-Judge, improves upon heuristic scoring baselines like DEITA, and outperforms the Full-SFT baseline by 8.8 points. Notably, while Full-SFT suffers from performance degradation due to noise, our optimization-grounded approach effectively identifies toxic samples, successfully reversing the negative transfer frequently observed in complex reasoning tasks.

Chinese Translation

大型多模态模型（LMMs）的扩展受到合成数据固有的质量-数量权衡的限制。之前的方法，如LLM-as-a-Judge，已证明其在解决这一问题上的有效性，但面临着高昂的计算成本和缺乏可解释性的问题。为了弥补这一差距，我们提出了One-Step-Train（OST）框架，将数据选择重新表述为增量优化效用排序问题。OST不依赖于语义启发式，而是通过对轻量级代理进行模拟单步更新来估计每个样本的边际效用。在多模态数学推理基准测试中对Qwen系列的实验表明，OST实现了帕累托最优效率。通过选择前50个子集，OST将训练成本降低了43%（总时间消耗减少了17%），同时超越了强大的LLM-as-a-Judge基线1.8分。此外，在固定计算预算下，我们的方法仅使用前20个子集，相较于LLM-as-a-Judge获得了5.6分的提升，改善了像DEITA这样的启发式评分基线，并超越了Full-SFT基线8.8分。值得注意的是，尽管Full-SFT由于噪声而表现下降，我们的基于优化的方法有效识别有毒样本，成功逆转了在复杂推理任务中常见的负迁移现象。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2605.07505

LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

LiteGUI：通过强化学习提炼紧凑型图形用户界面代理

Wu, Yubin, Cai, Zicheng, Ning, Liping, Wang, Hua, Chen, Zhi, Tang, Yaohua, Chen, Hao

Abstract

Developing lightweight, on-device vision-language GUI agents is essential for efficient cross-platform automated interaction. However, current on-device agents are constrained by limited model capacity, and further performance improvements remain urgently needed. Traditional Supervised Fine-Tuning (SFT) for small-scale models often leads to overfitting, catastrophic forgetting and policy rigidity, and thus fails to fully address these challenges. In this work, we propose a novel SFT-free training paradigm that significantly enhances the performance of small-scale models. We first present the initial systematic integration of generalized knowledge distillation into the GUI agent domain via Guided On-policy Distillation. By incorporating oracle reference trajectories together with a dynamic retrieval mechanism, our method reduces hallucinations and mitigates the cognitive misalignment inherent in multi-solution GUI tasks. Building on this foundation, we further introduce a Multi-solution Dual-level GRPO framework that jointly aligns macro-level subtask planning with micro-level execution matching, thereby improving exploration in long-horizon GUI agent scenarios. In addition, we construct an automated data generation pipeline to synthesize GUI task trajectories with rich multi-solution annotations. Extensive experiments show that our method achieves state-of-the-art performance among lightweight models while remaining competitive with substantially larger-scale models across all benchmarks. Ablation studies further demonstrate that structured on-policy distillation and multi-solution dual-level exploration can fully unlock the capabilities of 2B/3B scale agents, surpassing the performance limits of conventional imitation learning.

Chinese Translation

开发轻量级的设备端视觉-语言图形用户界面代理对于高效的跨平台自动化交互至关重要。然而，当前的设备端代理受到模型容量有限的限制，进一步提升性能的需求迫在眉睫。传统的小规模模型的监督微调（Supervised Fine-Tuning, SFT）往往导致过拟合、灾难性遗忘和策略僵化，因此未能充分解决这些挑战。在本研究中，我们提出了一种新颖的无SFT训练范式，显著提升了小规模模型的性能。我们首先通过引导式在线蒸馏（Guided On-policy Distillation）将广义知识蒸馏系统性地整合到图形用户界面代理领域。通过结合oracle参考轨迹和动态检索机制，我们的方法减少了幻觉现象，并缓解了多解图形用户界面任务中固有的认知错位。在此基础上，我们进一步引入了多解双层GRPO框架，该框架将宏观层次的子任务规划与微观层次的执行匹配进行联合对齐，从而改善了长时间跨度图形用户界面代理场景中的探索。此外，我们构建了一个自动化数据生成管道，以合成具有丰富多解注释的图形用户界面任务轨迹。大量实验表明，我们的方法在轻量级模型中实现了最先进的性能，同时在所有基准测试中与大规模模型保持竞争力。消融研究进一步表明，结构化的在线蒸馏和多解双层探索能够充分释放2B/3B规模代理的能力，超越传统模仿学习的性能极限。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2605.07520

Model-Driven Policy Optimization in Differentiable Simulators via Stochastic Exploration

通过随机探索在可微分模拟器中进行模型驱动的策略优化

Aroosh, Yuval, Taitler, Ayal

Abstract

Differentiable planning enables gradient-based optimization of decision-making problems by leveraging differentiable models of system dynamics. However, in highly nonlinear and hybrid discrete-continuous domains, the resulting optimization landscapes are often ill-conditioned, with flat regions and sharp transitions that hinder effective optimization. We propose Model-Driven Policy Optimization (MDPO), a framework that introduces stochastic exploration into differentiable planning by injecting noise into the action space during optimization. Leveraging access to the model, MDPO further adapts the noise magnitude based on gradient-derived sensitivity of the trajectory objective, yielding a time-dependent exploration profile. This enables improved exploration of the objective landscape and helps escape poor local optima via dynamic allocation of exploration across timesteps and iterations. Experiments on benchmark domains demonstrate that MDPO consistently outperforms deterministic differentiable planning, including both the noise-free variant of our method and available state-of-the-art implementations, as well as model-free baselines such as PPO, significantly improving solution quality across challenging nonlinear and hybrid settings. We further analyze the evolution of the adaptive noise magnitude across both time steps and optimization iterations, providing insight into how exploration is allocated during learning.

Chinese Translation

可微分规划通过利用系统动态的可微分模型，使基于梯度的决策问题优化成为可能。然而，在高度非线性和混合离散-连续领域中，产生的优化景观往往条件不良，存在平坦区域和急剧过渡，这妨碍了有效的优化。我们提出了模型驱动的策略优化（Model-Driven Policy Optimization, MDPO）框架，该框架通过在优化过程中向动作空间注入噪声，将随机探索引入可微分规划。MDPO利用对模型的访问，根据梯度导出的轨迹目标的灵敏度进一步调整噪声幅度，从而产生时间依赖的探索特征。这使得对目标景观的探索得以改善，并通过在时间步和迭代之间动态分配探索，帮助逃离不良局部最优解。在基准领域的实验表明，MDPO始终优于确定性可微分规划，包括我们方法的无噪声变体和现有的最先进实现，以及无模型基线如PPO，显著提高了在具有挑战性的非线性和混合设置中的解决方案质量。我们进一步分析了自适应噪声幅度在时间步和优化迭代中的演变，为学习过程中探索的分配提供了洞见。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2605.07521

From Feasible to Practical: Pareto-Optimal Synthesis Planning

从可行到实用：帕累托最优合成规划

Hastedt, Friedrich, Zhang, Dongda, Chanona, Antonio del Rio

Abstract

Current computer-aided synthesis planning (CASP) methods often treat retrosynthesis as solved once a single feasible route is identified, focusing primarily on convergence or shortest-path metrics. This view is misaligned with real-world practice, where chemists must balance competing objectives such as cost, sustainability, toxicity, and overall yield. To address this, we formulate synthesis planning as a multi-objective search problem and introduce MORetro*, an algorithm that generates a Pareto front of synthesis routes to explicitly capture trade-offs among user-defined criteria. MORetro* uses weighted scalarization and BO-informed sampling to efficiently navigate the combinatorial search space and prioritize promising trade-offs. Building on multi-objective A*-search, we provide optimality guarantees showing that, for a fixed single-step model, MORetro* recovers the true Pareto front. Across multiple retrosynthesis benchmarks, MORetro* produces diverse, high-quality Pareto fronts, uncovering solutions overlooked by single-objective approaches and better aligning CASP outputs with industrial decision-making.

Chinese Translation

当前的计算机辅助合成规划（CASP）方法通常在识别出单一可行路线后就认为逆合成问题已解决，主要关注收敛性或最短路径指标。这种观点与实际操作不符，因为化学家必须在成本、可持续性、毒性和总体产量等竞争目标之间进行权衡。为了解决这个问题，我们将合成规划形式化为一个多目标搜索问题，并引入了MORetro*算法，该算法生成合成路线的帕累托前沿，以明确捕捉用户定义标准之间的权衡。MORetro*使用加权标量化和基于贝叶斯优化（BO）的采样来高效地导航组合搜索空间并优先考虑有前景的权衡。基于多目标A*搜索，我们提供了最优性保证，表明对于固定的单步模型，MORetro*能够恢复真实的帕累托前沿。在多个逆合成基准测试中，MORetro*生成了多样化的高质量帕累托前沿，揭示了单目标方法所忽视的解决方案，并更好地将CASP输出与工业决策相结合。

View on arXiv Download PDF AI Translation

cs.AI / 57 / 2605.07537

Multi-Environment POMDPs with Finite-Horizon Objectives

具有有限时间目标的多环境部分可观察马尔可夫决策过程

Brice, Léonard, Cano, Filip, Chatterjee, Krishnendu, Henzinger, Thomas A., Muroya, Stefanie

Abstract

Partially Observable Markov Decision Processes (POMDPs) are systems in which one agent interacts with a stochastic environment, and receives only partial information about the current state. In a multi-environment POMDP (MEPOMDP), the initial state is unknown, and assumed to be adversarially chosen. In this work we focus on computing the optimal value and policy in MEPOMDPs with finite-horizon objectives. That problem is known to be PSPACE-complete in POMDPs. Our main results are as follows: (1) we establish that it is also PSPACE-complete in the more general setting of MEPOMDPs; (2) we present a practical algorithm and evaluate it on classical benchmarks, significantly outperforming the only previously known algorithm.

Chinese Translation

部分可观察马尔可夫决策过程（POMDPs）是指一个智能体与随机环境交互，并仅接收关于当前状态的部分信息的系统。在多环境部分可观察马尔可夫决策过程（MEPOMDP）中，初始状态是未知的，并假定由对手选择。在本研究中，我们重点计算具有有限时间目标的MEPOMDP的最优值和策略。已知该问题在POMDP中是PSPACE完全的。我们的主要结果如下：（1）我们证明在更一般的MEPOMDP环境中，该问题同样是PSPACE完全的；（2）我们提出了一种实用算法，并在经典基准测试上进行了评估，显著优于之前已知的唯一算法。

View on arXiv Download PDF AI Translation

cs.AI / 58 / 2605.07544

From Pixels to Prompts: Vision-Language Models

从像素到提示：视觉-语言模型

Vo, Khang Hoang Nhat

Abstract

When you read a paper about a new Vision-Language Model today, it can be easy to forget how strange this idea would have sounded not so long ago. Teaching machines to see was already hard. Teaching them to read and generate language was already hard. Asking them to do both at once - and then to reason, answer questions, follow instructions, and sometimes even surprise us - still carries a quiet trace of science fiction, even as it becomes routine. This book was born from a simple feeling: \emph{it is too easy to get lost}. The field moves quickly, new model names appear constantly, and the gap between ``I know the buzzwords'' and ``I actually understand how this works'' can feel uncomfortably wide. I have felt that gap many times. If you are holding this book, you probably have too. My goal is not to provide an exhaustive catalog of every dataset, benchmark, and new model variant. Instead, I want to offer something more modest - and, I hope, more durable: a clear mental map of Vision-Language Models. Enough structure that you can read new papers with confidence; enough intuition that you can design your own systems without feeling as if you are assembling LEGO bricks blindly.

Chinese Translation

当你今天阅读关于新视觉-语言模型的论文时，可能很容易忘记这个想法在不久前听起来是多么奇怪。教机器看已经很困难，教它们阅读和生成语言也已经很困难。要求它们同时做这两件事——然后推理、回答问题、遵循指令，有时甚至让我们感到惊讶——仍然带有一丝科幻的色彩，即使这已经变得常规。这本书的诞生源于一个简单的感受： extit{太容易迷失方向了}。这个领域发展迅速，新模型名称不断出现，而“我知道这些流行词”和“我实际上理解它是如何运作的”之间的差距可能让人感到不安。我多次感受到这种差距。如果你正在阅读这本书，你可能也有这样的感受。我的目标不是提供每个数据集、基准和新模型变体的详尽目录。相反，我想提供一些更为谦逊的东西——我希望是更持久的：一个清晰的视觉-语言模型的思维导图。足够的结构让你可以自信地阅读新论文；足够的直觉让你可以设计自己的系统，而不至于感到自己在盲目拼装乐高积木。

View on arXiv Download PDF AI Translation

cs.AI / 59 / 2605.07572

Open-Ended Task Discovery via Bayesian Optimization

通过贝叶斯优化进行开放式任务发现

Adachi, Masaki, Suzuki, Yuta, Ziomek, Juliusz

Abstract

When applying Bayesian optimization (BO) to scientific workflow, a major yet often overlooked source of uncertainty is the task itself -- namely, what to optimize and how to evaluate it -- which can evolve as evidence accumulates. We introduce Generate-Select-Refine (GSR), a open-ended BO framework that alternates between task generation and task optimization. Starting from a user-provided seed task, GSR generates new tasks in a coarse-to-fine manner while a task-acquisition function schedules optimization. Asymptotically, it concentrates evaluations on the best task, incurring only logarithmic regret overhead relative to single-task BO. We apply GSR to new product development, chemical synthesis scaling, algorithm analysis, and patent repurposing, where it outperforms existing LLM-based optimizers.

Chinese Translation

在将贝叶斯优化（Bayesian Optimization, BO）应用于科学工作流时，一个主要但常被忽视的不确定性来源是任务本身——即，优化什么以及如何评估它——这些任务会随着证据的积累而演变。我们提出了生成-选择-精炼（Generate-Select-Refine, GSR）框架，这是一个开放式的贝叶斯优化框架，交替进行任务生成和任务优化。从用户提供的种子任务开始，GSR以粗到细的方式生成新任务，同时任务获取函数调度优化。渐近地，它将评估集中在最佳任务上，相较于单任务贝叶斯优化，仅产生对数级的遗憾开销。我们将GSR应用于新产品开发、化学合成规模化、算法分析和专利再利用等领域，在这些领域中，其表现优于现有的基于大型语言模型（LLM）的优化器。

View on arXiv Download PDF AI Translation

cs.AI / 60 / 2605.07584

Parallel Lifted Planning via Semi-Naive Datalog Evaluation

通过半天真Datalog评估的并行提升规划

Drexler, Dominik, Joergensen, Oliver, Seipp, Jendrik

Abstract

Lifted classical planners operate directly on first-order planning tasks to avoid the computationally demanding grounding step. However, lifted planning is typically slower, as planners must repeatedly instantiate ground structures during search. Many core components of lifted classical planning, such as successor generation, axiom evaluation, task grounding, and delete-relaxed heuristics, have previously been studied through the lens of Datalog evaluation. We build upon this line of work and extend it by developing and analyzing an execution model with two levels of parallelism: rule-level parallelism and grounding parallelism. We further specialize this solver for planning-specific workloads with a grounder based on clique enumeration, which we extend to support semi-naive Datalog evaluation. Our experimental evaluation using greedy best-first search with the FF heuristic shows that our implementation already solves more tasks than the baselines on a single core, and the gap widens as additional cores are used. Moreover, on hard-to-ground tasks where on average 97.6% of the total runtime is spent in Datalog execution, the proposed execution model exhibits an average parallel fraction of 92.4%, while achieving up to a 6-fold speedup on 8 cores in practice.

Chinese Translation

提升的经典规划器直接在一阶规划任务上操作，以避免计算上要求高的基础步骤。然而，提升规划通常较慢，因为规划器在搜索过程中必须反复实例化基础结构。许多提升经典规划的核心组件，如后继生成、公理评估、任务基础和删除松弛启发式，之前已通过Datalog评估的视角进行研究。我们在这一研究基础上进行扩展，开发并分析一种具有两级并行性的执行模型：规则级并行性和基础级并行性。我们进一步为特定规划工作负载专业化该求解器，基于团体枚举的基础器，并扩展以支持半天真Datalog评估。我们的实验评估使用贪婪最佳优先搜索与FF启发式，显示我们的实现已经在单核上解决了比基线更多的任务，并且随着使用更多核心，差距进一步扩大。此外，在那些平均97.6%的总运行时间花费在Datalog执行上的难以基础任务中，所提出的执行模型表现出平均92.4%的并行比例，同时在实际中在8个核心上实现了最高6倍的加速。

View on arXiv Download PDF AI Translation

cs.AI / 61 / 2605.07631

Inference Time Causal Probing in LLMs

大规模语言模型中的因果探测推理时间

Khorasani, Sadegh, Salehkaleybar, Saber, Kiyavash, Negar, Grossglauser, Matthias

Abstract

Causal probing methods aim to test and control how internal representations influence the behavior of generative models. In causal probing, an intervention modifies hidden states so that a property takes on a different value. Most existing approaches define such interventions by training an auxiliary probe classifier, which ties the method to a specific task or model and risks misalignment with the model's predictive geometry. We propose Hidden-state Driven Margin Intervention (HDMI), a probe-free, gradient-based technique that directly steers hidden states using the model's native output. HDMI applies a margin objective that increases the probability of a target continuation while decreasing that of the source, without relying on probe classifiers. We further introduce a lookahead variant (LA-HDMI) for text editing that backpropagates through the softmax embeddings, modifying the current hidden state so that the likelihood of user-specified tokens increases in next token generations while preserving fluency. To evaluate interventions, we measure completeness (whether the targeted property changes as intended) and selectivity (whether unrelated properties are preserved), and report their harmonic mean as an overall measure of reliability. HDMI consistently achieves higher reliability than prior methods on the LGD agreement corpus and the CausalGym benchmark, across Meta-Llama-3-8B-Instruct, and Pythia-70M.

Chinese Translation

因果探测方法旨在测试和控制内部表征如何影响生成模型的行为。在因果探测中，干预会修改隐藏状态，使得某一属性获得不同的值。现有的大多数方法通过训练辅助探测分类器来定义这种干预，这使得该方法与特定任务或模型紧密相关，并且可能与模型的预测几何结构不一致。我们提出了隐状态驱动的边际干预（Hidden-state Driven Margin Intervention, HDMI），这是一种无探测器的基于梯度的技术，直接利用模型的原生输出来引导隐藏状态。HDMI应用了一种边际目标，增加目标延续的概率，同时降低源延续的概率，而无需依赖探测分类器。我们进一步引入了一种前瞻变体（LA-HDMI）用于文本编辑，该变体通过softmax嵌入进行反向传播，修改当前的隐藏状态，以便在下一次生成中增加用户指定的标记的可能性，同时保持流畅性。为了评估干预效果，我们测量了完整性（目标属性是否按预期变化）和选择性（无关属性是否得以保留），并报告它们的调和平均值作为整体可靠性的衡量标准。HDMI在LGD一致性语料库和CausalGym基准测试中，在Meta-Llama-3-8B-Instruct和Pythia-70M上始终实现了比之前方法更高的可靠性。

View on arXiv Download PDF AI Translation

cs.AI / 62 / 2605.07637

Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

为大规模多智能体路径规划学习局部通信

Vyaltsev, Valeriy, Sagirova, Alsu, Andreychuk, Anton, Kuratov, Yuri, Yakovlev, Konstantin, Panov, Aleksandr, Skrynnik, Alexey

Abstract

Multi-agent pathfinding (MAPF) is a widely used abstraction for multi-robot trajectory planning problems, where multiple homogeneous agents move simultaneously within a shared environment. Although solving MAPF optimally is NP-hard, scalable and efficient solvers are critical for real-world applications such as logistics and search-and-rescue. To this end, the research community has proposed various decentralized suboptimal MAPF solvers that leverage machine learning. Such methods frame MAPF (from a single agent perspective) as a Dec-POMDP where at each time step an agent has to decide an action based on the local observation and typically solve the problem via reinforcement learning or imitation learning. We follow the same approach but additionally introduce a learnable communication module tailored to enhance cooperation between agents via efficient feature sharing. We present the Local Communication for Multi-agent Pathfinding (LC-MAPF), a generalizable pre-trained model that applies multi-round communication between neighboring agents to exchange information and improve their coordination. Our experiments show that the introduced method outperforms the existing learning-based MAPF solvers, including IL and RL-based approaches, across diverse metrics in a diverse range of (unseen) test scenarios. Remarkably, the introduced communication mechanism does not compromise LC-MAPF's scalability, a common bottleneck for communication-based MAPF solvers.

Chinese Translation

多智能体路径规划（MAPF）是多机器人轨迹规划问题的广泛应用抽象，其中多个同质智能体在共享环境中同时移动。尽管最优地解决MAPF是NP难题，但可扩展和高效的求解器对于物流和搜救等实际应用至关重要。为此，研究界提出了多种利用机器学习的去中心化次优MAPF求解器。这些方法将MAPF（从单个智能体的角度）框架化为Dec-POMDP，在每个时间步，智能体必须根据局部观察决定一个动作，并通常通过强化学习或模仿学习来解决问题。我们遵循相同的方法，但额外引入一个可学习的通信模块，旨在通过高效的特征共享增强智能体之间的合作。我们提出了多智能体路径规划的局部通信（LC-MAPF），这是一个可泛化的预训练模型，应用多轮通信在邻近智能体之间交换信息并改善协调。我们的实验表明，所提出的方法在各种（未见）测试场景中，在多种指标上优于现有的基于学习的MAPF求解器，包括基于模仿学习（IL）和强化学习（RL）的方法。值得注意的是，所引入的通信机制并未妨碍LC-MAPF的可扩展性，这是基于通信的MAPF求解器的一个常见瓶颈。

View on arXiv Download PDF AI Translation

cs.AI / 63 / 2605.07639

Tacit Knowledge Extraction via Logic Augmented Generation and Active Inference

通过逻辑增强生成和主动推理提取隐性知识

Lamazzi, Lorenzo, Gangemi, Aldo, Giberti, Alessio, Nuzzolese, Andrea Giovanni, Rocca, Vittorio Andrea, Torta, Mattia, Poggi, Francesco

Abstract

Tacit knowledge plays a central role in human expertise, yet it remains difficult to capture, formalize, and reuse in machine-interpretable form. This challenge is especially relevant in procedural domains, where successful execution depends not only on explicit instructions, but also on implicit assumptions, contextual constraints, embodied skills, and experience-based judgments rarely documented. As a result, current knowledge engineering pipelines struggle to transform tacit and process-centric knowledge into formally specified, machine-interpretable representations that can be queried, validated, reasoned over, and reused. In this paper, we introduce a neuro-symbolic framework that combines Logic-Augmented Generation and an Active-Inference-inspired approach for ontology-grounded Knowledge Graph construction. We evaluate the approach in a knowledge transfer case study in manufacturing, using assembly-like repair procedures from instructional videos as a reproducible proxy domain. Results show that the proposed solution improves completeness and semantic quality, advancing neuro-symbolic knowledge engineering for industrial domains.

Chinese Translation

隐性知识在人的专业技能中扮演着核心角色，但在机器可解释的形式中捕捉、形式化和重用仍然困难。这一挑战在程序性领域尤为相关，在这些领域，成功的执行不仅依赖于明确的指令，还依赖于隐含的假设、上下文约束、具身技能以及鲜有文档记录的基于经验的判断。因此，当前的知识工程流程在将隐性和以过程为中心的知识转化为可以查询、验证、推理和重用的正式指定的机器可解释表示方面面临困难。在本文中，我们介绍了一种神经符号框架，该框架结合了逻辑增强生成和受主动推理启发的方法，用于本体基础的知识图谱构建。我们在制造业的知识转移案例研究中评估了该方法，使用来自教学视频的类似组装的修理程序作为可重复的代理领域。结果表明，所提出的解决方案提高了完整性和语义质量，推动了工业领域的神经符号知识工程的发展。

View on arXiv Download PDF AI Translation

cs.AI / 64 / 2605.07675

FactoryBench: Evaluating Industrial Machine Understanding

FactoryBench：评估工业机器理解的基准

Merzouki, Yanis, Izquierdo, Coral, Ignuta-Ciuncanu, Matei, Gomez-Bracamonte, Marcos, Maggioni, Riccardo, Lombardi, Alessandro, Mazzoleni, Camilla, Martelli, Federico, Gunther, Balazs, Petersen, Jonas, Petersen, Philipp

Abstract

We introduce FactoryBench, a benchmark for evaluating time-series models and LLMs on machine understanding over industrial robotic telemetry. Q&A pairs are organized along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl's ladder of causation, and span five answer formats: four structured formats are scored deterministically and free-form answers are scored by an LLM-as-judge voting protocol. We propose a scalable Q&A generation framework built around structured question templates, present FactoryWave (a dense, multitask, multivariate sensor dataset collected from a UR3 cobot and a KUKA KR10 industrial arm), and construct FactoryBench as a large-scale benchmark of over 70k Q&A items grounded in roughly 15k normalized episodes from FactoryWave, AURSAD, and voraus-AD. Zero-shot evaluation of six frontier LLMs shows that no model exceeds 50% on structured levels or 18% on decision-making, revealing a wide gap between current models and operational machine understanding.

Chinese Translation

我们介绍了FactoryBench，这是一个用于评估时间序列模型和大型语言模型（LLMs）在工业机器人遥测上机器理解能力的基准。问答对按照四个因果层次（状态、干预、反事实、决策）进行组织，体现了Pearl的因果阶梯，并涵盖五种回答格式：四种结构化格式采用确定性评分，而自由形式的回答则通过LLM作为评审的投票协议进行评分。我们提出了一个基于结构化问题模板的可扩展问答生成框架，展示了FactoryWave（一个从UR3协作机器人和KUKA KR10工业臂收集的密集多任务多变量传感器数据集），并构建了FactoryBench，作为一个基于大约15,000个标准化情节的FactoryWave、AURSAD和voraus-AD的70,000多个问答项的大规模基准。对六个前沿LLM的零样本评估显示，没有模型在结构化层次上超过50%或在决策制定上超过18%，揭示了当前模型与操作机器理解之间的巨大差距。

View on arXiv Download PDF AI Translation

cs.AI / 65 / 2605.07692

GASim: A Graph-Accelerated Hybrid Framework for Social Simulation

GASim：一种图加速的混合框架用于社会模拟

Zhou, Xuan, Sun, Yanhui, Yao, Hantao, He, Allen, Zhang, Yongdong, Liu, Wu

Abstract

Large-scale social simulators are essential for studying complex social patterns. Prior work explores hybrid methods to scale up simulations, combining large language models (LLM)-based agents with numerical agent-based models (ABM). However, this incurs high latency due to expensive memory retrieval and sequential ABM execution. To address this challenge, we propose GASim, a graph-accelerated hybrid multi-agent framework for large-scale social simulations. For core agents driven by LLM, GASim introduces Graph-Optimized Memory (GOM) to replace intensive LLM-based retrieval pipelines with lightweight propagation over a sparse memory graph. For the majority of ordinary agents, GASim employs Graph Message Passing (GMP), substituting sequential ABM execution with parallel updates by fine-grained feature aggregation and Graph Attention Network. We further introduce Entropy-Driven Grouping (EDG) that coordinates this hybrid partitioning, leveraging information entropy to dynamically identify emergent core agents situated in information-diverse neighborhoods. Extensive experiments show that GASim not only delivers a substantial 9.94-fold end-to-end speedup over the traditional hybrid framework but also consumes less than 20% of baseline tokens, significantly reducing costs while preserving strong alignment with real-world public opinion trends. Our code is available at https://github.com/Jasmine0201/GASim.

Chinese Translation

大规模社会模拟器对于研究复杂的社会模式至关重要。先前的研究探索了混合方法以扩展模拟，将基于大型语言模型（LLM）的代理与数值代理模型（ABM）相结合。然而，这会因昂贵的内存检索和顺序ABM执行而导致高延迟。为了解决这一挑战，我们提出了GASim，一种图加速的混合多代理框架，用于大规模社会模拟。对于由LLM驱动的核心代理，GASim引入了图优化内存（GOM），用轻量级的稀疏内存图传播替代了密集的基于LLM的检索管道。对于大多数普通代理，GASim采用图消息传递（GMP），通过细粒度特征聚合和图注意力网络替代了顺序ABM执行，进行并行更新。我们进一步引入了熵驱动分组（EDG），协调这种混合分区，利用信息熵动态识别位于信息多样性邻域中的新兴核心代理。大量实验表明，GASim不仅在传统混合框架上实现了9.94倍的端到端加速，而且消耗的基线令牌少于20%，显著降低了成本，同时与现实世界公众舆论趋势保持强一致性。我们的代码可在 https://github.com/Jasmine0201/GASim 获取。

View on arXiv Download PDF AI Translation

cs.AI / 66 / 2605.07703

Finite-Time Analysis of MCTS in Continuous POMDP Planning

连续POMDP规划中MCTS的有限时间分析

Kong, Da, Indelman, Vadim

Abstract

This paper presents a finite-time analysis for Monte Carlo Tree Search (MCTS) in Partially Observable Markov Decision Processes (POMDPs), with probabilistic concentration bounds in both discrete and continuous observation spaces. While MCTS-style solvers such as POMCP achieve empirical success in many applications, rigorous finite-time guarantees remain an open problem due to the nonstationarity and the interdependencies induced by heuristic action selection (e.g., UCB). In the discrete setting, we address these challenges by extending the polynomial exploration bonus to UCB in POMDP setting, yielding polynomial concentration bounds for the empirical value estimation at the root node. For continuous observation spaces, we introduce an abstract partitioning framework and propose a finite-time bound on partitioning loss. Under mild conditions, we prove highprobability bound on value estimates in POMDPs with continuous observation space. Specifically, we propose Voro-POMCPOW, a variant of POMCPOW with f inite-time guarantees that adaptively partitions the continuous observation space using Voronoi cells. This approach maintains a finite branching factor while preserving the original observation generator. Empirical validation demonstrates that the proposed Voro-POMCPOW shows competitive performance while providing theoretical guarantees. Although our analysis focuses on continuous POMDPs, the techniques developed herein are also applicable to continuous MDPs, closing another gap on the MDP side.

Chinese Translation

本文针对部分可观测马尔可夫决策过程（POMDPs）中的蒙特卡洛树搜索（MCTS）进行了有限时间分析，提供了在离散和连续观测空间中的概率集中界限。尽管MCTS风格的求解器如POMCP在许多应用中取得了实证成功，但由于启发式动作选择（例如UCB）引起的非平稳性和相互依赖性，严格的有限时间保证仍然是一个未解决的问题。在离散设置中，我们通过将多项式探索奖励扩展到POMDP设置中的UCB，来应对这些挑战，从而为根节点的经验价值估计提供了多项式集中界限。对于连续观测空间，我们引入了一种抽象分区框架，并提出了分区损失的有限时间界限。在温和条件下，我们证明了在具有连续观测空间的POMDP中，价值估计的高概率界限。具体而言，我们提出了Voro-POMCPOW，这是一种具有有限时间保证的POMCPOW变体，能够使用Voronoi单元自适应地划分连续观测空间。该方法在保持原始观测生成器的同时，维持了有限的分支因子。实证验证表明，所提出的Voro-POMCPOW在提供理论保证的同时表现出竞争力。尽管我们的分析集中于连续POMDP，但此处开发的技术同样适用于连续MDP，从而填补了MDP方面的另一个空白。

View on arXiv Download PDF AI Translation

cs.AI / 67 / 2605.07707

Hierarchical Task Network Planning with LLM-Generated Heuristics

基于大语言模型生成启发式的层次任务网络规划

Meneguzzi, Felipe, Buchweitz, Alexandre, Corrêa, Augusto B., Putrich, Victor Scherer, Pereira, André Grahl

Abstract

HTN planning is a variation of classical planning where, instead of searching for a linear sequence of actions, an algorithm decomposes higher-level tasks using a method library until only executable actions remain. On one hand, this allows one to introduce domain knowledge that can speed up the search for a solution through the method library. On the other hand, it creates challenges that go beyond those of classical state-space search. While recent research produced a number of heuristics and novel algorithms that speed up HTN planning, these heuristics are not yet as informative as those available in classical planning algorithms. We investigate whether large language models (LLMs) can generate effective search heuristics for HTN planning, extending the methodology of Corr\^ea, Pereira, and Seipp (2025) from classical to hierarchical planning. Using the Pytrich planner on six standard total-order HTN benchmark domains, we evaluate heuristics generated by nine LLMs under domain-specific prompting and compare them against the TDG and LMCount domain-independent baselines and the PANDA planner. Our results show that LLM-generated heuristics nearly match the coverage of the best available HTN planner, while substantially reducing search effort on 83% of shared problems.

Chinese Translation

HTN（层次任务网络）规划是一种经典规划的变体，在这种方法中，算法不是搜索线性动作序列，而是使用方法库对高层任务进行分解，直到只剩下可执行的动作。一方面，这允许引入领域知识，通过方法库加速解决方案的搜索；另一方面，这也带来了超出经典状态空间搜索的挑战。尽管最近的研究产生了一些启发式和新算法来加速HTN规划，但这些启发式尚未像经典规划算法中的启发式那样信息丰富。我们研究了大型语言模型（LLMs）是否能够为HTN规划生成有效的搜索启发式，扩展Corrêa、Pereira和Seipp（2025）的方法论，从经典规划到层次规划。我们使用Pytrich规划器在六个标准的总序HTN基准领域上评估了在特定领域提示下由九个LLM生成的启发式，并将其与TDG和LMCount这两个领域无关的基线以及PANDA规划器进行比较。我们的结果表明，LLM生成的启发式几乎与最佳可用HTN规划器的覆盖率相匹配，同时在83%的共享问题上显著减少了搜索工作量。

View on arXiv Download PDF AI Translation

cs.AI / 68 / 2605.07736

Online Goal Recognition using Path Signature and Dynamic Time Warping

基于路径签名和动态时间规整的在线目标识别

Tesch, Douglas, Gavenski, Nathan, Amado, Leonardo, Rodrigues, Odinaldo, Meneguzzi, Felipe

Abstract

Online goal recognition in continuous domains poses two central challenges: efficiently encoding large trajectories and effectively comparing them. Recent work addresses these challenges by using custom state-space representations and metrics to compare observations against hypotheses. However, these approaches often overlook well-established encoding techniques used in other domains that offer substantial advantages. This paper introduces a novel method for online goal recognition that leverages path signatures, a compact, expressive representation of rough path theory that efficiently captures key semantic features of trajectories, enabling more meaningful comparisons between them. Experiments show that our method consistently outperforms the state of the art in predictive accuracy and online planning efficiency, while remaining competitive offline.

Chinese Translation

在连续领域中，在线目标识别面临两个主要挑战：高效编码大规模轨迹和有效比较它们。近期的研究通过使用自定义状态空间表示和度量来比较观察结果与假设，从而解决了这些挑战。然而，这些方法往往忽视了其他领域中已建立的编码技术，这些技术提供了显著的优势。本文提出了一种新颖的在线目标识别方法，该方法利用路径签名（path signatures），这是一种紧凑且富有表现力的粗路径理论表示，能够高效捕捉轨迹的关键语义特征，从而实现更有意义的比较。实验表明，我们的方法在预测准确性和在线规划效率方面始终优于当前最先进的技术，同时在离线表现上也保持竞争力。

View on arXiv Download PDF AI Translation

cs.AI / 69 / 2605.07744

Alternating Target-Path Planning for Scalable Multi-Agent Coordination

可扩展多智能体协调的交替目标路径规划

Kumagai, Yu, Okumura, Keisuke

Abstract

The concurrent target assignment and pathfinding (TAPF) problem extends multi-agent pathfinding (MAPF) by asking planners to allocate distinct targets and collision-free paths to agents. Prior work on TAPF has relied exclusively on Conflict-Based Search (CBS), which tightly couples target assignment and pathfinding, resulting in compute-intensive, non-scalable solutions. In contrast, we propose an iterative refinement framework that decouples target assignment from pathfinding. Our framework builds on modern, fast, suboptimal MAPF solvers, such as LaCAM. Specifically, within a given time budget, it repeatedly solves MAPF for the current target assignment, identifies bottleneck agents via MAPF feedback, and refines the assignment. Empirical results show that feedback-driven reassignment loop is effective, enabling our framework to scale well beyond the reach of the state-of-the-art CBS-based solver while maintaining decent solution quality. This represents a solid step toward practical, large scale TAPF suitable for real-world setups.

Chinese Translation

并发目标分配与路径规划（TAPF）问题扩展了多智能体路径规划（MAPF），要求规划者为智能体分配不同的目标和无碰撞路径。之前的TAPF研究完全依赖于基于冲突的搜索（CBS），这使得目标分配与路径规划紧密耦合，导致计算密集且不具可扩展性的解决方案。相比之下，我们提出了一种迭代优化框架，将目标分配与路径规划解耦。我们的框架基于现代快速的次优MAPF求解器，如LaCAM。在给定的时间预算内，它反复为当前目标分配解决MAPF，通过MAPF反馈识别瓶颈智能体，并优化分配。实证结果表明，基于反馈的重新分配循环是有效的，使我们的框架能够在保持良好解决质量的同时，超越最先进的基于CBS的求解器的可扩展性。这代表了朝着适用于现实世界设置的实用大规模TAPF迈出的坚实一步。

View on arXiv Download PDF AI Translation

cs.AI / 70 / 2605.07760

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

RuleSafe-VL：评估视觉语言内容审核中的规则条件决策推理

Lu, Zhifeng, Wang, Dianyuan, Shang, Yuhu, Xu, Zhenbo

Abstract

Platform content moderation applies explicit policy rules and context-dependent conditions to decide whether user content is allowed, restricted, or removed. A correct moderation outcome must therefore depend on which rules a case activates, how those rules interact, and whether the available evidence is sufficient. Current multimodal safety benchmarks largely reduce moderation to matching predefined final labels, leaving this underlying rule structure untested. As a result, a high benchmark score reveals little about whether a model applies the policy correctly or arrives at the correct label through superficial cues. To evaluate this rule-governed process, we introduce RuleSafe-VL, a benchmark for rule-conditioned decision reasoning in vision-language content moderation. Derived from publicly available platform moderation policies, RuleSafe-VL formalizes 93 atomic rules and 92 typed rule relations, yielding 2,166 context-sensitive image-text cases across three high-risk policy families. Its four diagnostic tasks decompose moderation into a rule-conditioned decision chain. They identify activated rules, recover rule interactions, judge decision sufficiency, and resolve outcomes once missing context is supplied. Experiments on 10 frontier, open-source, and safety-oriented VLMs reveal rule-relation recovery as the dominant bottleneck, where the best model reaches only 64.8 Macro-F1 and some safety-oriented models fall below 7 Macro-F1. Decision-state prediction also remains unreliable, peaking at 64.5 Macro-F1. RuleSafe-VL shifts moderation evaluation from final-label scoring toward diagnostic assessment of rule-conditioned decision reasoning.

Chinese Translation

平台内容审核应用明确的政策规则和依赖上下文的条件来决定用户内容是否被允许、限制或删除。因此，正确的审核结果必须依赖于一个案例激活了哪些规则，这些规则如何相互作用，以及可用证据是否充足。目前的多模态安全基准在很大程度上将审核简化为匹配预定义的最终标签，未对这一潜在的规则结构进行测试。因此，高基准分数对于模型是否正确应用政策或通过表面线索得出正确标签几乎没有揭示任何信息。为了评估这一规则驱动的过程，我们引入了RuleSafe-VL，这是一个用于视觉语言内容审核中规则条件决策推理的基准。RuleSafe-VL源自公开可用的平台审核政策，形式化了93条原子规则和92种类型的规则关系，产生了跨三个高风险政策类别的2,166个上下文敏感的图像-文本案例。其四个诊断任务将审核分解为规则条件决策链。它们识别激活的规则，恢复规则交互，判断决策的充分性，并在提供缺失上下文后解决结果。在对10个前沿、开源和安全导向的视觉语言模型（VLMs）进行的实验中，规则关系恢复被发现是主要瓶颈，最佳模型的Macro-F1仅达到64.8，而一些安全导向的模型则低于7 Macro-F1。决策状态预测也仍然不可靠，最高仅为64.5 Macro-F1。RuleSafe-VL将审核评估从最终标签评分转向对规则条件决策推理的诊断评估。

View on arXiv Download PDF AI Translation

cs.AI / 71 / 2605.07839

Exact Regular-Constrained Variable-Order Markov Generation via Sparse Context-State Belief Propagation

通过稀疏上下文状态信念传播实现精确的正则约束可变阶马尔可夫生成

Pachet, François

Abstract

Variable-order Markov models generate sequences over a finite alphabet by conditioning each symbol on the longest available suffix of the generated history. Regular constraints, by contrast, describe finite-horizon control requirements by an automaton: fixed positions, forced endings, metrical patterns, and forbidden copied fragments are all special cases. Existing exact methods already handle regular constraints with belief propagation for first-order Markov chains. The contribution here is the variable-order extension: identifying the state space on which the existing BP-regular machinery must be run when the generator is a variable-order/backoff model. A first-order constraint layer can enforce useful support conditions, but it computes future mass after merging histories that a variable-order generator deliberately keeps distinct. We formalize this mismatch and give the sparse construction obtained by replacing the first-order Markov state with the observed context state, then taking the standard product with the regular constraint automaton. For a fixed trained context graph and automaton, inference is linear in the sequence horizon; in general it is polynomial in the number of reachable product edges. This gives the correct variable-order distribution conditioned on regular constraints without expanding to all K-tuples. The same finite-source interface supports reversible data augmentation by inverse count lookup, matching materialized transposition augmentation without storing transformed corpora. We also separate exact BP inference from generation-time backoff policies, such as singleton avoidance, whose stochastic semantics must be made explicit if exactness is claimed.

Chinese Translation

可变阶马尔可夫模型通过将每个符号的生成条件设定为生成历史中最长的可用后缀，从而生成有限字母表上的序列。相比之下，正则约束通过自动机描述有限视野的控制要求：固定位置、强制结束、度量模式和禁止复制片段都是特殊情况。现有的精确方法已经通过信念传播处理了第一阶马尔可夫链的正则约束。本文的贡献在于可变阶扩展：识别现有的BP-正则机制在生成器为可变阶/回退模型时必须运行的状态空间。第一阶约束层可以强制实施有用的支持条件，但它在合并可变阶生成器故意保持独立的历史后计算未来质量。我们形式化了这种不匹配，并通过用观察到的上下文状态替换第一阶马尔可夫状态来得到稀疏构造，然后与正则约束自动机进行标准乘积。对于固定训练的上下文图和自动机，推理在序列视野中是线性的；一般情况下，它在可达的乘积边数上是多项式的。这在不扩展到所有K元组的情况下，给出了在正则约束下的正确可变阶分布。相同的有限源接口通过逆计数查找支持可逆数据增强，匹配物化转置增强而无需存储转换后的语料库。我们还将精确的BP推理与生成时的回退策略（如单例避免）分开，后者的随机语义必须在声称精确性时明确。

View on arXiv Download PDF AI Translation

cs.AI / 72 / 2605.07926

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

AgentEscapeBench：评估 LLM 代理中的域外工具基础推理能力

Guo, Zhengkang, Li, Yiyang, Qiu, Lin, Wang, Xiaohua, Xv, Jingwen, Ru, Dongyu, Li, Xiaoyu, Zheng, Xiaoqing, Cao, Xuezhi, Cai, Xunliang

Abstract

As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a deterministically verifiable final answer. AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation. Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%. Trajectory analysis attributes model failures mainly to breakdowns in long-range state tracking, clue adherence, and intermediate-result propagation. These findings suggest that current agents can often handle local tool use but still struggle with deep contextual dependencies. We hope AgentEscapeBench can serve as a diagnostic testbed for measuring current agent capabilities and informing future training efforts toward more robust general-purpose reasoning, action, and adaptation.

Chinese Translation

随着基于 LLM 的代理越来越依赖外部工具，评估它们在熟悉工作流程和短期交互之外维持工具基础推理的能力变得至关重要。我们提出了 AgentEscapeBench，这是一个逃生室风格的基准，测试代理在明确的长程依赖约束下是否能够推断、执行和修订新颖的工具使用程序。每个任务定义了一个关于工具和物品的有向无环依赖图，要求代理调用真实的外部函数，跟踪逐步揭示的隐藏状态，传播中间结果，并提交一个确定性可验证的最终答案。AgentEscapeBench 包含 270 个实例，分为五个难度等级，并支持完全自动化评估。与十六个 LLM 代理和人类参与者的实验表明，随着依赖深度的增加，性能急剧下降：人类在难度-5 时的成功率从 98.3% 降至难度-25 时的 80.0%，而最佳模型则从 90.0% 降至 60.0%。轨迹分析将模型失败主要归因于长程状态跟踪、线索遵循和中间结果传播的崩溃。这些发现表明，当前的代理通常能够处理局部工具使用，但在深层上下文依赖方面仍然存在困难。我们希望 AgentEscapeBench 能够作为一个诊断测试平台，用于衡量当前代理的能力，并为未来的训练工作提供信息，以实现更强大的通用推理、行动和适应能力。

View on arXiv Download PDF AI Translation

cs.AI / 73 / 2605.07935

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

TraceFix：使用 TLA+ 反例修复智能体协调协议

Xia, Shuren, Li, Qiwei, Ehsan, Taqiya, Ortiz, Jorge

Abstract

We present TraceFix, a verification-first pipeline for Large Language Model (LLM) multi-agent coordination. An agent synthesizes a protocol topology as a structured intermediate representation (IR) from a task description, generates PlusCal coordination logic, and iteratively repairs the protocol using counterexamples from the TLA+ model checker (TLC) until verification succeeds. Verified process bodies are compiled into per-agent system prompts and executed under a runtime monitor that rejects out-of-topology coordination operations. On 48 tasks spanning 16 scenario families, all tasks reach full TLC verification; 62.5% pass on the first attempt and none requires more than four repair iterations. State spaces span six orders of magnitude yet verification completes in under 60 s for every task. A 3,456-run runtime comparison shows that topology-monitored execution achieves the highest task completion (89.4% average, 81.5% full) and that runtimes using the verified protocol degrade at roughly half the rate of prompt-only and chat-only baselines when model capability is reduced. A paired ablation under a fixed runtime shows that TLC-verified protocols cut deadlock/livelock (DL/LL) from 31.1% to 14.1%, with the largest separation under fault injection.

Chinese Translation

我们提出了 TraceFix，这是一个以验证为先的管道，用于大型语言模型（LLM）多智能体协调。智能体从任务描述中合成协议拓扑作为结构化中间表示（IR），生成 PlusCal 协调逻辑，并使用来自 TLA+ 模型检查器（TLC）的反例迭代修复协议，直到验证成功。经过验证的过程体被编译成每个智能体的系统提示，并在运行时监视器下执行，该监视器拒绝超出拓扑的协调操作。在涵盖 16 个场景系列的 48 个任务中，所有任务均达到完全的 TLC 验证；62.5% 的任务在第一次尝试中通过，且没有任务需要超过四次修复迭代。状态空间跨越六个数量级，但每个任务的验证都在 60 秒内完成。3,456 次运行的运行时比较表明，拓扑监控执行实现了最高的任务完成率（平均 89.4%，完全 81.5%），并且在模型能力降低时，使用经过验证的协议的运行时间下降速度约为仅使用提示和仅使用聊天基线的一半。在固定运行时间下的配对消融实验显示，TLC 验证的协议将死锁/活锁（DL/LL）从 31.1% 降低到 14.1%，在故障注入下分离效果最大。

View on arXiv Download PDF AI Translation

cs.AI / 74 / 2605.07979

The Limits of AI-Driven Allocation: Optimal Screening under Aleatoric Uncertainty

人工智能驱动分配的局限性：在随机不确定性下的最优筛选

Cortes-Gomez, Santiago, Rubio, Mateo Dulce, Patino, Carlos, Wilder, Bryan

Abstract

The rise of machine learning has shifted targeted resource allocation in policy and humanitarian settings toward algorithmic targeting based on predicted risk scores. This approach is typically cheaper and faster than traditional screening procedures that directly observe the latent vulnerability status through physical verification. Yet, even access to the true conditional vulnerability probability cannot eliminate misallocation: aleatoric uncertainty over individual vulnerability status is irreducible, and probabilistic targeting inevitably misallocates some resources. In this work we study how screening and algorithmic targeting should be optimally combined in a two-stage allocation framework where a screening stage observes true outcomes for a subset of units before a final allocation stage assigns the resource under a fixed coverage budget. We show that the optimal strategy screens units at the margin of algorithmic allocation, while directly targeting the highest-risk units. Furthermore, we empirically characterize when screening and algorithmic targeting act as complements or substitutes: efficiency gains from screening grow as the aleatoric uncertainty in the population increases. We illustrate our framework with applications in income-based social protection programs and humanitarian demining in Colombia, where the tension between screening costs and allocation efficiency is operationally consequential.

Chinese Translation

机器学习的兴起使得政策和人道主义环境中的目标资源分配转向基于预测风险评分的算法定向。这种方法通常比通过物理验证直接观察潜在脆弱性状态的传统筛选程序更便宜、更快速。然而，即使能够获取真实的条件脆弱性概率，也无法消除错误分配：个体脆弱性状态的随机不确定性是不可减少的，概率性定向不可避免地会导致一些资源的错误分配。在本研究中，我们探讨了如何在一个两阶段分配框架中最优地结合筛选和算法定向，其中筛选阶段在最终分配阶段之前观察一部分单元的真实结果，最终分配阶段在固定的覆盖预算下分配资源。我们表明，最优策略是在算法分配的边际上筛选单元，同时直接针对最高风险的单元。此外，我们实证描述了筛选和算法定向何时作为互补或替代：随着人群中随机不确定性的增加，筛选带来的效率提升也随之增长。我们通过在哥伦比亚的基于收入的社会保护项目和人道主义排雷应用来说明我们的框架，在这些应用中，筛选成本与分配效率之间的紧张关系在操作上具有重要意义。

View on arXiv Download PDF AI Translation

cs.AI / 75 / 2605.08011

Abductive Reasoning with Probabilistic Commonsense

基于概率的推理常识的溯因推理

Cotnareanu, Joseph, Roverato, Chiara, Zhou, Han, Chetelat, Didier, Zhang, Yingxue, Coates, Mark

Abstract

Recent efforts to improve the reasoning abilities of Large Language Models (LLMs) have focused on integrating formal logic solvers within neurosymbolic frameworks. A key challenge is that formal solvers lack commonsense world knowledge, preventing them from making reasoning steps that humans find obvious. Prior methods address this by using LLMs to supply missing commonsense assumptions, but these approaches implicitly assume universal agreement on such commonsense facts. In reality, commonsense beliefs vary across individuals. We propose a probabilistic framework for abductive commonsense reasoning that explicitly models this variation, aiming to determine whether most people would judge a statement as true or false. We introduce Probabilistic Abductive CommonSense (PACS), a novel algorithm that uses an LLM and a formal solver to sample proofs as observations of individuals' distinct commonsense beliefs, and aggregates conclusions across these samples. Empirically, PACS outperforms chain-of-thought reasoning, prior neurosymbolic methods, and search-based approaches across multiple benchmarks.

Chinese Translation

近年来，提高大型语言模型（LLMs）推理能力的努力集中在将形式逻辑求解器整合到神经符号框架中。一个关键挑战是，形式求解器缺乏常识世界知识，无法进行人类认为显而易见的推理步骤。之前的方法通过使用LLMs提供缺失的常识假设来解决这一问题，但这些方法隐含地假设对这些常识事实的普遍一致性。实际上，常识信念因个体而异。我们提出了一种概率框架用于溯因常识推理，明确建模这种变异，旨在确定大多数人是否会判断某一陈述为真或假。我们引入了概率溯因常识（Probabilistic Abductive CommonSense，PACS），这是一种新颖的算法，利用LLM和形式求解器对个体不同的常识信念进行采样证明，并在这些样本中聚合结论。实证结果表明，PACS在多个基准测试中优于链式推理、之前的神经符号方法和基于搜索的方法。

View on arXiv Download PDF AI Translation

cs.AI / 76 / 2605.08013

Learning CLI Agents with Structured Action Credit under Selective Observation

在选择性观察下利用结构化动作信用学习命令行界面代理

Su, Haoyang, Wen, Ying

Abstract

Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable task feedback, yet few methods exploit the native structured attributes of CLI actions as learning signals. Beyond this underused action structure, CLI learning also couples two bottlenecks for coding agents. First, the agent must identify task-relevant evidence in a large codebase from partial observations. Second, sparse terminal rewards must be assigned to the actions that shape a long multi-turn trajectory. We study these bottlenecks through shell-driven information extraction and file editing tasks. For selective observation, we introduce $\sigma$-Reveal, an inference-time mechanism that selects token-budgeted context for the same CLI. For credit assignment, we propose Action Advantage Assignment ($\mathrm{A}^3$), a native agentic RL method that preserves the algorithmic complexity of standard agentic RL. $\mathrm{A}^3$ constructs turn-level advantages from episode-level relative feedback, abstract syntax tree (AST) based action sub-chain residuals, and tree-level trajectory margins. To further evaluate this problem setting, we construct ShellOps, a verifiable dataset suite covering CLI tasks in repository environments.

Chinese Translation

命令行界面（CLI）代理作为一种实用范式，正在兴起，以实现与不断发展的文件系统、可执行命令行程序和在线执行反馈的代理-计算机交互。近期的研究利用强化学习（RL）从可验证的任务反馈中学习这些交互能力，但很少有方法利用CLI动作的固有结构属性作为学习信号。除了这一未充分利用的动作结构外，CLI学习还结合了编码代理的两个瓶颈。首先，代理必须从部分观察中识别大型代码库中的任务相关证据。其次，必须将稀疏的终端奖励分配给塑造长多轮轨迹的动作。我们通过基于Shell的信息提取和文件编辑任务研究这些瓶颈。对于选择性观察，我们引入了$ ext{σ}$-Reveal，这是一种推理时机制，选择预算有限的上下文以适应相同的CLI。对于信用分配，我们提出了动作优势分配（$ ext{A}^3$），这是一种保留标准代理强化学习算法复杂性的原生代理强化学习方法。$ ext{A}^3$从回合级相对反馈、基于抽象语法树（AST）的动作子链残差和树级轨迹边际构建回合级优势。为了进一步评估这一问题设置，我们构建了ShellOps，这是一个涵盖代码库环境中CLI任务的可验证数据集套件。

View on arXiv Download PDF AI Translation

cs.AI / 77 / 2605.08019

Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

游戏的理由：前沿大型推理模型与人类游戏学习者之间的行为与大脑对齐

Csaba, Botos, Kumar, Sreejan, Andrews, Austin Tudor David, Hunt, Laurence, Summerfield, Chris, Tenenbaum, Joshua B., Costa, Rui Ponte, Mattar, Marcelo G., Tomov, Momchil

Abstract

Humans rapidly learn abstract knowledge when encountering novel environments and flexibly deploy this knowledge to guide efficient and intelligent action. Can modern AI systems learn and plan in a similar way? We study this question using a dataset of complex human gameplay with concurrent fMRI recordings, in which participants learn novel video games that require rule discovery, hypothesis revision, and multi-step planning. We jointly evaluate models by their ability to play the games, match human learning behavior, and predict brain activity during the same task, comparing a suite of frontier Large Reasoning Models (LRMs) against model-free and model-based deep reinforcement learning agents and a Bayesian theory-based agent. We find that frontier LRMs most closely match human behavioral patterns during game discovery and predict brain activity an order of magnitude better than both reinforcement learning alternatives across cortical and subcortical regions, with effects robust to permutation controls. Through targeted manipulations, we further show that brain alignment reflects the model's in-context representation of the game state rather than its downstream planning or reasoning. Our results establish LRMs as compelling computational accounts of human learning and decision making in complex, naturalistic environments. Project page with interactive replays: https://botcs.github.io/reason-to-play/

Chinese Translation

人类在遇到新环境时能够迅速学习抽象知识，并灵活运用这些知识来指导高效且智能的行动。现代人工智能系统能否以类似的方式进行学习和规划？我们通过一个包含复杂人类游戏玩法和同步功能性磁共振成像（fMRI）记录的数据集来研究这个问题，参与者在其中学习需要规则发现、假设修正和多步骤规划的新视频游戏。我们通过评估模型在游戏中的表现、与人类学习行为的匹配程度以及在同一任务中预测大脑活动的能力，来对比一系列前沿大型推理模型（LRMs）与无模型和基于模型的深度强化学习代理以及基于贝叶斯理论的代理。我们发现，前沿LRMs在游戏发现过程中最接近人类的行为模式，并且在皮层和皮层下区域预测大脑活动的效果比两种强化学习替代方案好一个数量级，且在置换控制下效果稳健。通过有针对性的操控，我们进一步表明大脑对齐反映了模型对游戏状态的上下文表示，而非其下游规划或推理。我们的结果确立了LRMs作为人类在复杂自然环境中学习和决策的引人注目的计算模型。项目页面及互动重播链接： https://botcs.github.io/reason-to-play/

View on arXiv Download PDF AI Translation

cs.AI / 78 / 2605.08024

MPD$^2$-Router: Mask-aware Multi-expert Prior-regularized Dual-head Deferral Router in Glaucoma Screening and Diagnosis

MPD$^2$-Router：面向掩膜的多专家优先正则化双头延迟路由器在青光眼筛查与诊断中的应用

Zhan, Wenxin

Abstract

Learning-to-defer (L2D) can make glaucoma screening safer by routing difficult/uncertain cases to humans, yet standard formulations overlook expert availability, heterogeneous readers behavior, workload imbalance, asymmetric diagnostic harm, case difficulty from morphology and deployment shift. We introduce MPD$^2$-Router, a mask-aware multi-expert deferral framework that recasts ophthalmic triage as constrained human--AI routing: whether to defer and to which available expert. It couples a dual-head deferral/allocation policy with mask-aware Gumbel--sigmoid gating that strictly enforces per-sample availability, and fuses uncertainty, morphology, image-quality, and OOD signals. Training uses an asymmetric cost-sensitive objective with an augmented-Lagrangian deferral budget, a group-specific distribution prior, and a rank-majorization JS regularizer that jointly prevent expert collapse without forcing uniform allocation. Across three cross-national glaucoma cohorts (REFUGE, CHAKSU, ORIGA) with a frozen REFUGE-trained backbone, MPD$^2$-Router substantially lowers clinical cost and improves MCC over AI-only at a moderate deferral rate. It is Pareto-optimal in F1--MCC--cost, robust under cross-domain shift, and yields balanced expert utilization.

Chinese Translation

学习延迟（L2D）可以通过将困难/不确定的病例转交给人类来提高青光眼筛查的安全性，但标准的公式忽视了专家的可用性、异质读者行为、工作负载不平衡、非对称诊断危害、形态学带来的病例难度以及部署转变。我们提出了MPD$^2$-Router，这是一种面向掩膜的多专家延迟框架，将眼科分诊重新表述为受限的人类-人工智能路由：是否延迟以及转交给哪个可用专家。它将双头延迟/分配策略与面向掩膜的Gumbel-sigmoid门控相结合，严格执行每个样本的可用性，并融合不确定性、形态学、图像质量和OOD信号。训练使用不对称的成本敏感目标，结合增强拉格朗日延迟预算、特定组的分布先验和秩主导的JS正则化器，共同防止专家崩溃而不强制均匀分配。在三个跨国青光眼队列（REFUGE、CHAKSU、ORIGA）中，使用冻结的REFUGE训练骨干，MPD$^2$-Router显著降低了临床成本，并在适度的延迟率下提高了MCC，相较于仅使用人工智能的方案。它在F1-MCC-成本上是帕累托最优的，且在跨领域转变下表现稳健，实现了专家的平衡利用。

View on arXiv Download PDF AI Translation

cs.AI / 79 / 2605.08061

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

基于评分标准的强化学习：用于可推广推理的结构化评判奖励

Bhattarai, Manish, Boureima, Ismael, Ranasinghe, Nishath Rajiv, Pakin, Scott, O'Malley, Dan

Abstract

We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.

Chinese Translation

我们认为，将奖励分解为加权的、可验证的标准，并使用大型语言模型（LLM）评判其得分，提供了一种部分信用优化信号：每个响应不是仅仅得到一个二元结果或单一的整体分数，而是沿着多个特定任务的标准进行评分。我们形式化了 extit{基于评分标准的强化学习（RL）}：这是一个框架，其中策略是针对由一个冻结的LLM评判生成的结构化、多标准奖励进行优化，该评判依赖于策略从未见过的辅助基础。我们通过从科学与技术信息办公室（OSTI）衍生的约100,000份科学和技术文档中推导评分标准，并使用群体相对策略优化（GRPO）训练Llama-3.1-8B-Instruct来实例化该框架。通过基于GRPO的训练，模型在保留的评分标准评估中达到了$71.7\%$的标准化奖励。经过GRPO调优的策略在四个未从训练语料库中衍生的推理基准上也优于基础模型——GSM8K、MATH、GPQA Main和GPQA Diamond。这些结果提供了证据，表明结构化的、基于文档的奖励可以改善保留的评分标准表现，并引发超出构建训练环境所用语料库的可转移推理行为。

View on arXiv Download PDF AI Translation

cs.AI / 80 / 2605.08070

VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

VecCISC：通过推理轨迹聚类和候选答案选择提升信心驱动的自一致性

Petullo, James, George, Sonny, Cashman, Dylan, Xue, Nianwen

Abstract

A standard technique for scaling inference-time reasoning is Self-Consistency, whereby multiple candidate answers are sampled from an LLM and the most common answer is selected. More recently, it has been shown that weighted majority voting (e.g. Confidence-Informed Self Consistency (CISC)), which assigns a confidence value to each candidate answer and chooses the answer with the largest accumulated score, tends to be more accurate on a wide range of popular benchmarks. In practice, weighted majority voting necessitates calling a critic LLM on each candidate's reasoning trace to produce the answer's confidence score. This secondary series of LLM calls greatly increases the overhead and cost of weighted majority voting, despite its potential performance benefits. To reduce this expense, we propose VecCISC, a lightweight, adaptive framework that uses a measure of semantic similarity to filter reasoning traces that are semantically equivalent to others, degenerate, or hallucinated, thus decreasing the number of candidate answers that must be evaluated by the critic. To ensure adequate experimental thoroughness, we evaluate VecCISC on five challenging, widely-adopted datasets spanning the domains of mathematics, chemistry, biology, commonsense reasoning, and the humanities. Our results demonstrate that VecCISC reduces the total token usage by 47%, while maintaining or exceeding the accuracy of CISC.

Chinese Translation

自一致性是一种标准技术，用于扩展推理时间推理，其中从大型语言模型（LLM）中采样多个候选答案，并选择最常见的答案。最近的研究表明，加权多数投票（例如，信心驱动的自一致性（CISC））通过为每个候选答案分配一个信心值，并选择累积分数最大的答案，通常在广泛的流行基准上表现得更为准确。在实践中，加权多数投票需要对每个候选答案的推理轨迹调用一个批评性LLM，以生成答案的信心分数。这一系列额外的LLM调用大大增加了加权多数投票的开销和成本，尽管它可能带来性能上的好处。为了降低这一费用，我们提出了VecCISC，这是一种轻量级的自适应框架，利用语义相似性度量来过滤与其他推理轨迹语义等价、退化或虚构的推理轨迹，从而减少需要由批评者评估的候选答案数量。为了确保实验的充分性，我们在五个具有挑战性的、广泛采用的数据集上评估了VecCISC，这些数据集涵盖了数学、化学、生物学、常识推理和人文学科等领域。我们的结果表明，VecCISC将总令牌使用量减少了47%，同时保持或超过了CISC的准确性。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2605.06673

Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

前沿大语言模型中的领域级元认知监控：一个33模型图谱

Cacioli, Jon-Paul

Abstract

Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six-domain grouping) to 33 frontier LLMs from eight model families and computed Type-2 AUROC per model-domain cell using verbalized confidence (0-100). Total observations: 47,151. Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall's W = .164). A subject-level coherence analysis (within-domain similarity ratio = 0.95) confirms the six-domain grouping is a pragmatic benchmark taxonomy, not a validated latent construct. Within-family profile-shape clustering is significant for Anthropic, Google-Gemini, and Qwen (permutation p < .0001) but not DeepSeek, Google-Gemma, or OpenAI. Gemma 4 31B showed a +.202 AUROC improvement over Gemma 3 27B. Three models classified Invalid on binary KEEP/WITHDRAW probes produced normal profiles under verbalized confidence, confirming probe-format specificity. Bootstrap 95% CIs on 198 cells have median width .199. Split-half aggregate stability r = .893; profile-level split-half is weaker (grand median r = .184). These results show stable benchmark-domain variation obscured by aggregate metrics, and support benchmark-stage domain screening as a step before deployment in specific application areas.

Chinese Translation

聚合的元认知质量评分掩盖了MMLU基准领域内模型的变异性。我们对来自八个模型家族的33个前沿大语言模型进行了1,500个MMLU项目（每个领域250个，基于先验的六领域分组）的测试，并使用口头表达的置信度（0-100）计算了每个模型-领域单元的Type-2 AUROC。总观察次数为47,151。每个具有高于随机水平的聚合监控的模型都表现出显著的领域级变异性。应用/专业知识领域在监控上可靠地是最容易的基准领域（平均AUROC = .742，在33个模型中排名前2的有21个）；而形式推理和自然科学领域则可靠地是最难的（在33个模型中，有27个模型中这两个领域排名后2）。三个中间领域在统计上无显著差异（Kendall's W = .164）。主题级一致性分析（领域内相似性比率 = 0.95）确认六领域分组是一个务实的基准分类，而非经过验证的潜在构念。对于Anthropic、Google-Gemini和Qwen，家族内轮廓形状聚类显著（置换p < .0001），但对DeepSeek、Google-Gemma或OpenAI则不显著。Gemma 4 31B相较于Gemma 3 27B显示出+0.202的AUROC改善。三个在二元KEEP/WITHDRAW探针上被分类为无效的模型在口头表达的置信度下产生了正常的轮廓，确认了探针格式的特异性。198个单元的自助法95%置信区间的中位宽度为0.199。分半聚合稳定性r = .893；轮廓级别的分半稳定性较弱（总体中位数r = .184）。这些结果表明，聚合指标掩盖了稳定的基准领域变异性，并支持在特定应用领域部署之前进行基准阶段领域筛选。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2605.06765

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

VITA-QinYu：用于角色扮演和唱歌的表现力口语模型

Xu, Jiacheng, Gao, Heting, Xie, Liufei, Yang, Zhenchuan, Li, Lijiang, Chen, Yiting, Zhang, Bin, Chen, Meng, Fu, Chaoyu, Zhao, Weifeng, Zhou, Wenjiang

Abstract

Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role-playing and singing. We present VITA-QinYu, the first expressive end-to-end (E2E) spoken language model (SLM) that goes beyond natural conversation to support both role-playing and singing generation. VITA-QinYu adopts a hybrid speech-text paradigm that extends interleaved text-audio modeling with multi-codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference. We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role-playing, and singing data for training. VITA-QinYu demonstrates superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role-playing benchmarks, and surpassing peer models by 0.13 points on a 5-point MOS scale for singing. Simultaneously, it achieves state-of-the-art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively. We open-source our code and models and provide an easy-to-use demo with full-stack support for streaming and full-duplex interaction.

Chinese Translation

人类语言传达的表现力超越了语言内容，包括个性、情绪或表演元素，例如安慰的语调或哼唱歌曲，我们将其形式化为角色扮演和唱歌。我们提出了VITA-QinYu，这是第一个超越自然对话的表现力端到端（E2E）口语语言模型（SLM），支持角色扮演和唱歌生成。VITA-QinYu采用了一种混合语音-文本范式，扩展了交错的文本-音频建模，使用多代码本音频标记，这一设计能够实现更丰富的副语言表现，同时保持不同模态之间的清晰分离，以避免干扰。我们进一步开发了一个全面的数据生成管道，合成了总计15.8K小时的自然对话、角色扮演和唱歌数据用于训练。VITA-QinYu展示了卓越的表现力，在客观角色扮演基准测试中超越同类SLM模型7个百分点，并在5分制的唱歌MOS评分中超越同类模型0.13分。同时，它在对话准确性和流畅性方面达到了最先进的水平，在C3和URO基准测试中分别超越了之前的SLM模型1.38和4.98个百分点。我们开源了我们的代码和模型，并提供了一个易于使用的演示，支持流媒体和全双工交互的全栈功能。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2605.06832

IntentGrasp: A Comprehensive Benchmark for Intent Understanding

IntentGrasp：意图理解的综合基准

Yin, Yuwei, Li, Chuyuan, Carenini, Giuseppe

Abstract

Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.

Chinese Translation

准确理解语言、对话和书写背后的意图对于开发有用的大型语言模型（LLM）助手至关重要。本文介绍了IntentGrasp，这是一个用于评估LLM意图理解能力的综合基准。IntentGrasp源自49个高质量、开放许可的语料库，涵盖12个不同领域，通过源数据集的策划、意图标签的上下文化和任务格式的统一构建而成。IntentGrasp包含一个大规模的训练集，包含262,759个实例，以及两个评估集：一个包含12,909个测试案例的全量集（All Set）和一个更平衡且具有挑战性的宝石集（Gem Set），包含470个案例。对20个来自7个家族的LLM（包括前沿模型如GPT-5.4、Gemini-3.1-Pro和Claude-Opus-4.7）的广泛评估显示其表现不佳，在全量集上的得分低于60%，在宝石集上的得分低于25%。值得注意的是，在宝石集中，20个测试模型中有17个的表现低于随机猜测基准（15.2%），而估计的人类表现约为81.1%，显示出显著的改进空间。为了提升这种能力，本文提出了意图细调（Intentional Fine-Tuning，IFT），该方法在IntentGrasp的训练集上对模型进行细调，在全量集上获得了30多个F1分数的显著提升，在宝石集上获得了20多个点的提升。显著的是，逐域排除（leave-one-domain-out，Lodo）实验进一步证明了IFT的强跨域泛化能力，验证了其作为显著提升LLM意图理解的有前景的方法。总体而言，通过基准测试和提升意图理解能力，本研究为实现更具意图、能力和安全性的人工智能助手，造福人类和社会提供了有希望的路径。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2605.06886

TajPersLexon: A Tajik-Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP

TajPersLexon：一种塔吉克-波斯语词汇资源及跨脚本低资源自然语言处理的混合模型

Arabov, Mullosharaf K.

Abstract

This work introduces TajPersLexon, a curated Tajik--Persian parallel lexical resource of 40,112 word and short-phrase pairs for cross-script lexical retrieval, transliteration, and alignment in low-resource settings. We conduct a comprehensive CPU-only benchmark comparing three methodological families: (i) a lightweight hybrid pipeline, (ii) neural sequence-to-sequence models, and (iii) retrieval methods. Our evaluation establishes that the task is essentially solvable, with neural and retrieval baselines achieving 98-99% top-1 accuracy. Crucially, we demonstrate that while large multilingual sentence transformers fail on this exact lexical matching, our interpretable hybrid model offers a favorable accuracy-efficiency trade-off for practical applications, achieving 96.4% accuracy in an OCR post-correction task. All experiments use fixed random seeds for full reproducibility. The dataset, code, and models will be publicly released.

Chinese Translation

本研究介绍了TajPersLexon，这是一个经过整理的塔吉克-波斯语平行词汇资源，包含40,112对单词和短语，旨在低资源环境下进行跨脚本词汇检索、音译和对齐。我们进行了全面的仅使用CPU的基准测试，比较了三种方法论： (i) 轻量级混合管道， (ii) 神经序列到序列模型， (iii) 检索方法。我们的评估表明，该任务本质上是可解决的，神经和检索基线的顶级准确率达到了98-99%。关键是，我们展示了虽然大型多语言句子变换器在这种精确的词汇匹配上表现不佳，但我们的可解释混合模型在实际应用中提供了良好的准确性与效率的权衡，在OCR后校正任务中达到了96.4%的准确率。所有实验均使用固定的随机种子以确保完全可重复性。数据集、代码和模型将公开发布。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2605.06897

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

MIST：面向智能家居的多模态互动语音工具调用对话助手

Chen, Maximillian, Zhang, Xuanming, Peng, Michael, Yu, Zhou, Papangelis, Alexandros, Jo, Yohan

Abstract

The rise of Internet of Things (IoT) devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool-usage capabilities, modeling real-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic multi-turn, voice-driven code generation task that operates over IoT devices. We find that there is a significant gap between open- and closed-weight multimodal LLMs on MIST, and that even frontier closed-weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed-initiative voice assistants which reason about physical world constraints.

Chinese Translation

物联网（IoT）设备在物理世界的兴起要求具备处理复杂用户体验的基于语音的接口。尽管现代大型语言模型（LLMs）已经展示了强大的工具使用能力，但对现实世界IoT设备的建模仍然是一个困难且未被充分研究的挑战，它结合了时空约束建模、语音输入、动态状态跟踪和混合主动交互模式。我们介绍了MIST（多模态互动语音工具调用数据集），这是一个合成的多轮、基于语音驱动的代码生成任务，适用于IoT设备。我们发现，在MIST上，开放权重和闭合权重的多模态LLMs之间存在显著差距，即使是前沿的闭合权重LLMs也有相当大的提升空间。我们发布了MIST及一个可扩展的数据生成框架，以构建相关数据集，从而促进关于能够推理物理世界约束的混合主动语音助手的研究。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2605.06901

Reflections and New Directions for Human-Centered Large Language Models

以人为本的大型语言模型的反思与新方向

Ziems, Caleb, Zhao, Dora, Wang, Rose E., Jörke, Matthew, Rushdi, Ahmad, Deepak, Advit, Yu, Sunny, Agarwal, Anshika, Agarwal, Harshvardhan, Aranguiz-Dias, Gabriela, Bhagirath, Aditri, Breuch, Justine, Chen, Huanxing, Chen, Ruishi, Chen, Sarah, Fan, Haocheng, Fang, William, Fergesen, Cat Gonzales, Frees, Daniel, Gao, Tian, Huang, Ziqing, Jain, Vishal, Jiang, Yucheng, Kalinin, Kirill, Karaca, Su Doga, Khatua, Arpandeep, La, Teland, Levent, Isabelle, Li, Miranda, Li, Xinling, Li, Yongce, Liu, Angela, Oh, Minsik, Paek, Nathan J., Qin, Anthony, Redmond, Emily, Ryan, Michael J., Salecha, Aadesh, Shen, Xiaoxian, Singhal, Pranava, Subrahmanya, Shashanka, Tan, Mei, Thawornbut, Irawadee, Vinocour, Michelle, Wang, Xiaoyue, Wang, Zheng, Weng, Henry Jin, Wirawarn, Pawan, Wu, Shirley, Wu, Sophie, Xie, Yichen, Ye, Patrick, Zhang, Sean, Zhang, Yutong, Zhou, Cathy, Zhao, Yiling, Landay, James, Yang, Diyi

Abstract

Large Language Models (LLMs) are increasingly shaping the private and professional lives of users, with numerous applications in business, education, finance, healthcare, law, and science. With this rise in global influence comes greater urgency to build, evaluate, and deploy these systems in a manner that prioritizes not only technical capabilities but also human priorities. This work presents a framework for developing Human-Centered Large Language Models (HCLLMs), which integrates perspectives from Natural Language Processing (NLP), Human-Computer Interaction (HCI), and responsible AI. Considering the ethics, economics, and technical objectives of language modeling, we argue that model developers need to address human concerns, preferences, values, and goals, not only during a cursory post-training stage, but rather with rigor and care at every stage of the pipeline. This paper offers human-centered insights and recommendations for developers at each stage, from system design to data sourcing, model training, evaluation, and responsible deployment. Then we conclude with a case study, applying these insights to understand the future of work with HCLLMs.

Chinese Translation

大型语言模型（LLMs）正日益影响用户的私人和职业生活，在商业、教育、金融、医疗、法律和科学等多个领域有着广泛的应用。随着其全球影响力的上升，迫切需要以优先考虑人类需求的方式构建、评估和部署这些系统，而不仅仅是关注技术能力。本文提出了一个开发以人为本的大型语言模型（HCLLMs）的框架，整合了自然语言处理（NLP）、人机交互（HCI）和负责任的人工智能的视角。我们认为，模型开发者需要在语言建模的伦理、经济和技术目标方面，关注人类的关切、偏好、价值观和目标，这不仅仅是在训练后阶段的粗略考虑，而是要在整个流程的每个阶段都以严谨和细致的态度来对待。本文为开发者在系统设计、数据来源、模型训练、评估和负责任的部署等每个阶段提供了以人为本的见解和建议。最后，我们通过一个案例研究，应用这些见解来理解与HCLLMs共同工作的未来。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2605.06903

MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text

MELD：多任务均衡学习检测器用于AI生成文本

Li, Chenjun, Wan, Cheng, Paetzold, Johannes C.

Abstract

Large language models are now embedded in everyday writing workflows, making reliable AI-generated text detection important for academic integrity, content moderation, and provenance tracking. In practice, however, a detector must do more than achieve high aggregate AUROC on clean, in-distribution human and AI text: it should remain robust to attacks and adversarial rewrites, transfer to unseen generators and domains, and operate at low false-positive rates (FPR). Most existing detectors optimize a single AI/Human objective, giving the representation little incentive to learn generator, attack, or domain structure once the binary task saturates. We introduce MELD (Multi-Task Equilibrated Learning Detector), a deployable detector for AI-generated text that enriches binary detection with auxiliary supervision. MELD attaches generator-family, attack-type, and source-domain heads to a shared encoder, and balances the four losses with learned homoscedastic uncertainty weights. To improve robustness, an EMA teacher predicts on clean inputs while an attack-augmented student is distilled toward the teacher. MELD further uses a hard-negative pairwise ranking loss to enlarge the score margin between AI-generated texts and the most confusable human texts. At inference, all auxiliary heads are discarded, giving MELD the same interface and cost as a standard detector. On the public RAID leaderboard, MELD is the strongest open-source detector and is competitive with leading commercial models, especially under attack and at low FPR. Across standard held-out benchmarks, MELD matches or outperforms supervised baselines. We further introduce MELD-eval, a held-out evaluation pool built from recent chat models released by four major LLM providers. Without additional finetuning, MELD achieves 99.9% TPR at 1% FPR on MELD-eval, while many baselines degrade sharply.

Chinese Translation

大型语言模型现已嵌入日常写作工作流程中，因此可靠的AI生成文本检测对于学术诚信、内容审核和来源追踪变得尤为重要。然而，在实践中，检测器不仅需要在干净的、分布内的人类和AI文本上实现高的整体AUROC，还应对攻击和对抗性重写保持鲁棒性，能够迁移到未见过的生成器和领域，并在低假阳性率（FPR）下运行。现有的大多数检测器优化单一的AI/人类目标，使得在二元任务饱和后，表示学习对生成器、攻击或领域结构的激励不足。我们提出了MELD（多任务均衡学习检测器），这是一个可部署的AI生成文本检测器，通过辅助监督丰富了二元检测。MELD将生成器家族、攻击类型和源领域的头部附加到共享编码器，并通过学习的同方差不确定性权重平衡四个损失。为了提高鲁棒性，EMA教师在干净输入上进行预测，而攻击增强的学生则朝向教师进行蒸馏。MELD进一步使用硬负样本成对排序损失来扩大AI生成文本与最易混淆的人类文本之间的得分边际。在推理时，所有辅助头部被丢弃，使MELD具有与标准检测器相同的接口和成本。在公共RAID排行榜上，MELD是最强的开源检测器，并且在攻击和低FPR下与领先的商业模型具有竞争力。在标准的保留基准测试中，MELD的表现与监督基线相当或更优。我们还介绍了MELD-eval，这是一个由四大LLM提供商最近发布的聊天模型构建的保留评估池。在没有额外微调的情况下，MELD在MELD-eval上实现了99.9%的真正率（TPR）和1%的假阳性率（FPR），而许多基线则急剧下降。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2605.06919

Can LLMs Take Retrieved Information with a Grain of Salt?

大型语言模型能否对检索信息持保留态度？

Shayegh, Behzad, Ahmed, Mohamed Osama, Tung, Fred, Feng, Leo

Abstract

Large language models have demonstrated impressive retrieval-augmented capabilities. However, a crucial area remains underexplored: their ability to appropriately adapt responses to the certainty of the retrieved information. It is a limitation with real consequences in high-stakes domains like medicine and finance. We evaluate eight LLMs on their context-certainty obedience, measuring how well they adjust responses to match expressed context certainty. Our analysis reveals systematic limitations: LLMs struggle to recall prior knowledge after observing an uncertain context, misinterpret expressed certainties, and overtrust complex contexts. To address these, we propose an interaction strategy combining prior reminders, certainty recalibration, and context simplification. This approach reduces obedience errors by 25% on average, without modifying model weights, demonstrating the efficacy of interaction design in enhancing LLM reliability. Our contributions include a principled evaluation metric, empirical insights into LLMs' uncertainty handling, and a portable strategy to improve context-certainty obedience across diverse LLMs.

Chinese Translation

大型语言模型展示了令人印象深刻的检索增强能力。然而，有一个关键领域仍然未被充分探索：它们在多大程度上能够适当地调整响应以适应检索信息的确定性。这在医学和金融等高风险领域具有实际后果。我们评估了八个大型语言模型在上下文确定性遵从性方面的表现，测量它们在多大程度上能够根据表达的上下文确定性调整响应。我们的分析揭示了系统性的局限性：大型语言模型在观察到不确定的上下文后难以回忆先前的知识，误解表达的确定性，并对复杂上下文过于信任。为了解决这些问题，我们提出了一种交互策略，结合了先前提醒、确定性重新校准和上下文简化。这种方法在不修改模型权重的情况下，平均减少了25%的遵从性错误，证明了交互设计在增强大型语言模型可靠性方面的有效性。我们的贡献包括一个有原则的评估指标、对大型语言模型不确定性处理的实证见解，以及一种可移植的策略，以提高不同大型语言模型的上下文确定性遵从性。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2605.06940

MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

MultiSoc-4D：用于诊断封闭集大语言模型注释中指令引发标签崩溃的基准

Pramanik, Souvik, Antu, S. M. Riaz Rahman, Abyad, Shak Mohammad, Khalil, Md. Ibrahim, Hussain, Md. Shahriar

Abstract

Annotation automation via Large Language Models (LLMs) is the core approach for scaling NLP datasets; however, LLM behavior with respect to closed-set instructions in low-resource languages has not been well studied. We present MultiSoc-4D, a Bengali social media dataset benchmark, which contains 58K+ social media comments from six sources annotated along four dimensions: category, sentiment, hate speech, and sarcasm. By employing a structured pipeline where ChatGPT, Gemini, Claude, and Grok individually annotate separate partitions, while sharing a common validation set of 20%, we diagnose LLM behavior systematically. We discover a prevalent phenomenon called "instruction-induced label collapse", wherein LLMs show a systematic preference towards fallback labels (Other, Neutral, No), leading to high agreement rates but under-detection of minority categories. For example, we find that LLMs failed to detect 79% and 75% of instances with hateful and sarcastic content compared to a human-calibrated reference. Furthermore, we prove that it represents a "label agreement illusion", statistically validated via almost null Fleiss' Kappa ($\kappa \approx -0.001$) on sarcasm detection. Across 40+ LLMs, we benchmark this annotation bias propagation within the training pipeline, regardless of architectural differences. We release MultiSoc-4D as a diagnostic benchmark for annotation biases in Bengali NLP.

Chinese Translation

通过大语言模型（LLMs）实现注释自动化是扩展自然语言处理（NLP）数据集的核心方法；然而，关于低资源语言中封闭集指令下的LLM行为尚未得到充分研究。我们提出了MultiSoc-4D，这是一个孟加拉社交媒体数据集基准，包含来自六个来源的58K+社交媒体评论，按四个维度进行注释：类别、情感、仇恨言论和讽刺。通过采用一个结构化的流程，其中ChatGPT、Gemini、Claude和Grok分别对不同的分区进行注释，同时共享一个20%的公共验证集，我们系统性地诊断了LLM的行为。我们发现了一种普遍现象，称为“指令引发的标签崩溃”，在这种现象中，LLM表现出对回退标签（其他、中性、无）的系统性偏好，导致高一致性率但对少数类别的检测不足。例如，我们发现LLM未能检测到79%和75%的仇恨和讽刺内容实例，相较于人类校准的参考。此外，我们证明这代表了一种“标签一致性幻觉”，通过几乎为零的Fleiss' Kappa（$ ext{kappa} ightarrow -0.001$）在讽刺检测中进行了统计验证。在40多个LLM中，我们基准测试了这种注释偏差在训练流程中的传播，无论架构差异如何。我们发布MultiSoc-4D作为孟加拉NLP中注释偏差的诊断基准。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2605.06978

Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries

技能组：面向代理技能库的群体结构技能检索

Zeng, Kun, Huo, Yu, Zhang, Siyu, Ye, Zi, Zhuo, Yuecheng, Liu, Haoyue, Lu, Yuquan, Wen, Junhao, Tang, Xiaoying

Abstract

Skill-augmented agents increasingly rely on large reusable skill libraries, but retrieving relevant skills is not the same as presenting usable context. Existing methods typically return atomic skills or dependency-aware bundles whose internal roles remain implicit, leaving the agent to infer the execution entry point, support skills, visible requirements, and failure-avoidance guidance. We introduce Group of Skills (GoSkills), an inference-time group-structured retrieval method that changes the agent-facing retrieval object from a flat skill list to a compact, role-labeled execution context. GoSkills builds anchor-centered skill groups from a typed skill graph, expands support groups through a group graph, bottlenecks the selected group plan into a bounded set of atomic skill payloads, and renders a fixed execution contract with Start, Support, Check, and Avoid fields, without changing the downstream agent, skill payloads, or execution environment. Experiments on SkillsBench and ALFWorld show that GoSkills preserves visible-requirement coverage under a small skill budget, improves over flat skill-access baselines, and often improves reward and agent-only runtime relative to structural retrieval references.

Chinese Translation

增强技能的代理越来越依赖于大型可重用的技能库，但检索相关技能并不等同于提供可用的上下文。现有方法通常返回原子技能或依赖感知的技能包，其内部角色仍然隐含，导致代理需要推断执行入口、支持技能、可见要求和避免失败的指导。我们提出了技能组（Group of Skills, GoSkills），这是一种推理时的群体结构检索方法，它将代理面对的检索对象从平面的技能列表转变为紧凑的、角色标记的执行上下文。GoSkills 从类型化技能图构建以锚点为中心的技能组，通过群体图扩展支持组，将选定的群体计划瓶颈化为有限的原子技能负载，并呈现一个固定的执行合同，其中包含开始、支持、检查和避免字段，而不改变下游代理、技能负载或执行环境。在 SkillsBench 和 ALFWorld 上的实验表明，GoSkills 在小技能预算下保持可见要求的覆盖，优于平面技能访问基线，并且相对于结构检索参考，通常提高了奖励和代理单独的运行时间。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2605.07013

Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion

通过熵门控连续比特流扩散缩小语言建模中的自回归差距

Batzolis, Georgios, Girolami, Mark, Ambrogioni, Luca

Abstract

Diffusion language models (DLMs) promise parallel, order-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion approaches over token embeddings have narrowed this gap, suggesting continuous state spaces are highly effective for language. In this work, we further close the autoregressive gap by modeling text as a continuous diffusion process over fixed-width binary bitstreams. Our approach represents semantic tokens as analog bit sequences and utilizes a matched-filter residual parameterization to isolate contextual learning from analytic independent-bit posteriors. Crucially, we adopt a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile, automatically concentrating stochasticity in high-information regions while remaining nearly deterministic elsewhere. On the One Billion Word Benchmark (LM1B), our 130M-parameter bitstream model reaches a generative perplexity ($\GenPPL$) of $59.76$ at matched real-data entropy ($4.31$) using 256 neural function evaluations (NFEs), decisively outperforming prior DLM baselines and reaching the autoregressive reference. On OpenWebText (OWT), our stochastic sampler establishes a new continuous-DLM Pareto frontier, achieving $\GenPPL=27.06$ at an entropy of $5.26$ using $4\times$ fewer steps than previous 1024-NFE baselines. As an additional architectural benefit, bitstream diffusion removes the $\mathcal{O}(V)$ vocabulary scaling bottleneck shared by standard DLMs. By predicting $\mathcal{O}(\log V)$ bitwise logits via semantic bit-patching, our model yields a reduced memory footprint and higher throughput, demonstrating a scalable paradigm for language generation as vocabulary sizes grow.

Chinese Translation

扩散语言模型（DLMs）承诺实现并行、无序生成，但在标准基准测试中，它们在样本质量和多样性上历史性地落后于自回归模型。最近对标记嵌入的连续流和扩散方法缩小了这一差距，表明连续状态空间对语言非常有效。在本研究中，我们通过将文本建模为固定宽度二进制比特流上的连续扩散过程，进一步缩小了自回归差距。我们的方法将语义标记表示为模拟比特序列，并利用匹配滤波器残差参数化来将上下文学习与分析独立比特后验隔离。关键是，我们采用了一种随机采样器，该采样器应用由熵率轮廓门控的Langevin类型修正，自动在高信息区域集中随机性，同时在其他地方保持几乎确定性。在十亿单词基准测试（LM1B）上，我们的130M参数比特流模型在匹配真实数据熵（$4.31$）下达到了生成困惑度（$ ext{GenPPL}$）$59.76$，使用256次神经函数评估（NFEs），决定性地超越了之前的DLM基线，并达到了自回归参考。在OpenWebText（OWT）上，我们的随机采样器建立了新的连续-DLM Pareto前沿，以$5.26$的熵在比之前的1024-NFE基线少$4 imes$的步骤下实现了$ ext{GenPPL}=27.06$。作为额外的架构优势，比特流扩散消除了标准DLM共享的$ ext{O}(V)$词汇扩展瓶颈。通过语义比特修补预测$ ext{O}( ext{log} V)$比特级logits，我们的模型实现了更小的内存占用和更高的吞吐量，展示了一种可扩展的语言生成范式，适应词汇规模的增长。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2605.07040

Cognitive Agent Compilation for Explicit Problem Solver Modeling

认知代理编译用于显式问题求解模型

Moon, Hyeongdon, Rosé, Carolyn, Stamper, John

Abstract

Large language models (LLMs) are widely used for tutoring, feedback generation, and content creation, but their broad pretraining makes them hard to constrain and poor substitutes for controllable learners. Educational systems often require inspectable and editable knowledge states: educators want to know what a system assumes the learner knows, and learners benefit when the system can justify actions in terms of explicit skills, misconceptions, and strategies. Inspired by cognitive architectures, we propose Cognitive Agent Compilation (CAC), a framework that uses a strong teacher LLM to compile problem-solving knowledge into an explicit target agent. CAC separates (i) knowledge representation, (ii) problem-solving policy, and (iii) verification and update rules, with the goal of making bounded problem solving more inspectable and editable in educational settings. We present an early proof of concept implemented with Small Language Models that surfaces key design trade-offs, particularly between explicit control and scalable generalization, and positions CAC as an initial step toward bounded-knowledge AI for educational applications.

Chinese Translation

大型语言模型（LLMs）广泛应用于辅导、反馈生成和内容创作，但其广泛的预训练使得它们难以约束，并且作为可控学习者的替代品表现不佳。教育系统通常需要可检查和可编辑的知识状态：教育工作者希望了解系统假设学习者所掌握的知识，而学习者在系统能够根据显式技能、误解和策略来解释其行为时会受益。受到认知架构的启发，我们提出了认知代理编译（Cognitive Agent Compilation, CAC），这是一个利用强大的教师LLM将问题解决知识编译成显式目标代理的框架。CAC将（i）知识表示、（ii）问题解决策略和（iii）验证与更新规则分开，旨在使受限问题解决在教育环境中更具可检查性和可编辑性。我们展示了一个早期的概念验证，使用小型语言模型实现，揭示了关键设计权衡，特别是在显式控制和可扩展泛化之间的权衡，并将CAC定位为朝着教育应用中受限知识人工智能的初步步骤。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2605.07051

NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models

NSMQ谜题：用于评估大型语言模型的科学与数学谜题基准

Boateng, George, Ibrahim, Naafi, John, Samuel, Badu, Philemon, Agyeman-Budu, Patrick, Mensah, Jonathan, Yeboah, Kevin, Edor, William, Mensa-Onumah, Andrew, Yeboah, Nana, Kumbol, Victor Wumbor-Apin

Abstract

Large Language Models (LLMs) have shown good performance on various science educational benchmarks, demonstrating their potential for use in science and mathematics education. Yet, LLMs tend to be evaluated on science and mathematical educational datasets from the Western world, with an underrepresentation of datasets from the Global South. Furthermore, they tend to have multiple-choice answer options that are trivial to evaluate. In this work, we present NSMQ Riddles, a novel benchmark of Scientific and Mathematical Riddles from Ghana's National Science and Maths Quiz (NSMQ) competition to evaluate LLMs. The NSMQ is an annual live TV competition for senior secondary school students in Ghana that brings together the smartest high school students in Ghana who compete in teams of 2 by answering questions in biology, chemistry, physics, and math over five rounds and five stages until a winning team is crowned for that year. NSMQ Riddles consists of 11 years of riddle questions (n=1.8K) from the 5th round, with each riddle containing a minimum of 3 clues. Students compete to be the first to guess the answer on any of the clues, with earlier clues being vague and also fetching more points. The answers are usually a number, word, or short phrase, allowing for automatic evaluation. We evaluated state-of-the-art models: closed (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) and open models (Kimi-K2.5, DeepSeek-V3.1, GPT-OSS-120B) with high and low reasoning settings. Our evaluation shows that the dataset is challenging even for state-of-the-art LLMs, which performed worse than the best student contestants. This work contributes a novel and challenging benchmark for scientific and mathematical reasoning from the Global South towards enabling a true global benchmarking of LLMs' capabilities for science and mathematics education.

Chinese Translation

大型语言模型（LLMs）在各种科学教育基准测试中表现良好，展示了其在科学和数学教育中的潜力。然而，LLMs往往在来自西方世界的科学和数学教育数据集上进行评估，而来自全球南方的数据集则严重不足。此外，这些数据集通常具有简单易评估的多项选择答案选项。在本研究中，我们提出了NSMQ谜题，这是一个来自加纳国家科学与数学竞赛（NSMQ）的科学与数学谜题的新基准，用于评估LLMs。NSMQ是加纳每年举办的现场电视竞赛，面向高级中学学生，汇聚了加纳最聪明的高中生，他们以2人一组的形式在生物、化学、物理和数学等领域回答问题，经过五轮五个阶段的竞争，直到最终产生当年的冠军。NSMQ谜题包含来自第五轮的11年谜题问题（n=1.8K），每个谜题至少包含3个线索。学生们竞争以第一个猜出任何线索的答案，较早的线索通常比较模糊，并且获得更多的分数。答案通常是一个数字、一个单词或一个短语，便于自动评估。我们评估了最先进的模型：封闭模型（GPT-5.4、Gemini 3.1 Pro、Claude Opus 4.6）和开放模型（Kimi-K2.5、DeepSeek-V3.1、GPT-OSS-120B），并设置了高和低推理水平。我们的评估显示，该数据集对最先进的LLMs来说具有挑战性，其表现甚至不如最佳学生参赛者。本研究为来自全球南方的科学与数学推理提供了一个新颖且具有挑战性的基准，旨在实现对LLMs在科学和数学教育能力的真正全球基准测试。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2605.07053

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

GSM-SEM：生成语义变异增强的基准和框架

Singh, Jyotika, Tu, Fang, Mirzadova, Aziza, Agarwal, Amit, Patel, Hitesh Laxmichand, Ghoshal, Sandip, Ballesteros, Miguel, Benajiba, Yassine, Sun, Weiyi, Horwood, Graham, Ravi, Sujith, Roth, Dan

Abstract

Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.

Chinese Translation

像GSM8K这样的基准是数学推理的流行衡量标准，但排行榜的提升可能会因对固定测试集的记忆而夸大真实能力。大多数鲁棒性变体应用表面级扰动（如释义、重命名、数字交换、干扰项），这些扰动在很大程度上保留了基础事实，而静态发布的基准本身随着时间的推移也可能成为记忆目标。我们引入了GSM-SEM，一个可重用的随机框架，用于生成语义多样的基准变体，其语义方差显著高于以往的方法。GSM-SEM通过修改实体、属性和/或关系来扰动问题陈述，频繁改变基础事实，并要求模型在新条件下重新计算解决方案，同时限制生成以保留原始计算/答案并近似问题难度。GSM-SEM在每次运行时生成新的变体，无需重新标注，从而减少对静态公共基准的依赖，降低记忆偏差。我们将GSM-SEM应用于GSM8K和两个现有变体套件（GSM-Symbolic和GSM-Plus），生成了GSM8K-SEM、GSM-Symbolic-SEM和GSM-Plus-SEM。在评估14个最先进的语言模型（SOTA LLMs）时，我们观察到性能一致下降，当语义扰动与符号/加法变体结合时，下降幅度更大（在GSM-SEM的最大严格配置下，平均下降率为28%）。我们公开发布这三个SEM变体作为完全经过人工验证的数据集。最后，为了展示其在GSM风格数学问题之外的适用性，我们将GSM-SEM应用于其他基准，包括BigBenchHard、LogicBench和NLR-BIRD。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2605.07058

MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments

MedExAgent：训练大型语言模型代理在嘈杂临床环境中提问、检查和诊断

Gao, Yicheng, Zhou, Xiaolin, Li, Yahan, Zhao, Yue, Liu, Ruishan

Abstract

Real-world clinical diagnosis is a complex process in which the doctor is required to obtain information from both interaction with the patient and conducting medical exams. Additionally, the doctor needs to adapt to different patient personas, as well as noisy and incomplete information that can happen at any time during the process. However, existing benchmarks for medical LLMs and methods for automatic diagnosis largely simplify this process by reducing it to single-turn question answering, noise-free conversations, or sequential exam making, etc., ignoring the interactive and uncertain nature of clinical diagnosis. In this paper, we aim to address this gap by formalizing clinical diagnosis as a Partially Observable Markov Decision Process (POMDP) with three action types: questioning the patient, ordering medical exams as tool calls, and issuing a diagnosis. We also introduce a systematic noise model comprising seven patient noise types and three exam noise types. Using our proposed environment, we train an effective diagnosis agent, \textbf{MedExAgent}, through a two-stage pipeline that first performs supervised finetuning on synthetic conversations structured after the Calgary-Cambridge model for clinical interviews, and then applies DAPO to optimize a composite reward capturing diagnostic accuracy, tool call quality, and exam cost including financial cost and patient discomfort. Through extensive experiments and ablation studies, we demonstrate that MedExAgent achieves diagnostic performance comparable to larger models while maintaining cost-efficient examination strategies.

Chinese Translation

现实世界中的临床诊断是一个复杂的过程，医生需要通过与患者的互动以及进行医学检查来获取信息。此外，医生还需要适应不同的患者角色，以及在整个过程中可能出现的嘈杂和不完整的信息。然而，现有的医学大型语言模型（LLMs）基准和自动诊断方法在很大程度上简化了这一过程，将其归结为单轮问答、无噪声对话或顺序检查等，忽视了临床诊断的互动性和不确定性。本文旨在通过将临床诊断形式化为部分可观察马尔可夫决策过程（POMDP），并定义三种行动类型：询问患者、下达医学检查的工具调用以及发出诊断，来填补这一空白。我们还引入了一个系统的噪声模型，包括七种患者噪声类型和三种检查噪声类型。利用我们提出的环境，我们通过一个两阶段的流程训练了一个有效的诊断代理——MedExAgent，首先在基于卡尔加里-剑桥模型的合成对话上进行监督微调，然后应用DAPO优化一个综合奖励，以捕捉诊断准确性、工具调用质量和包括经济成本及患者不适在内的检查成本。通过大量实验和消融研究，我们证明了MedExAgent在诊断性能上与更大模型相当，同时保持了成本高效的检查策略。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2605.07068

WiCER: Wiki-memory Compile, Evaluate, Refine Iterative Knowledge Compilation for LLM Wiki Systems

WiCER：Wiki-记忆编译、评估、精炼迭代知识编译用于大型语言模型维基系统

Huerta, Juan M.

Abstract

The LLM Wiki pattern, to compile and provide domain knowledge into a persistent artifact and serve it to LLMs via KV cache inference, promises context access at sub-second latency with zero retrieval failure. Realizing this requires solving the compilation gap: LLM compilation distilling raw documents into a wiki without catastrophically discarding critical facts. We characterize this gap across 17 RepLiQA domains (6,800 questions): we observe that full context KV cache inference outperforms RAG on curated knowledge (4.38 vs. 4.08 out of 5, 7.3 faster TTFT) but degrades below RAG at scale due to attention dilution, and blind compilation fails entirely (2.14 to 2.32 vs. 3.46, 53 to 60% catastrophic failure rate). To address the compilation gap, we propose WiCER (Wiki-memory Compile, Evaluate, Refine), an iterative algorithm inspired by counterexample-guided abstraction refinement (CEGAR) that closes this gap. WiCER evaluates compiled wikis against diagnostic probes, identifies dropped facts, and forces their preservation in subsequent compilations. One to two iterations recover 80% of lost quality (mean 3.24 vs. 3.47 for raw full-context across the 15 topics with baselines), reducing catastrophic failures by 55% relative. An ablation across all 17 topics confirms that targeted diagnosis (+0.95), not generic pinning (+0.16), drives the gains. All code and benchmarks are released for reproducible research.

Chinese Translation

LLM Wiki模式旨在将领域知识编译并提供为持久性文档，通过KV缓存推理服务于大型语言模型（LLMs），承诺在亚秒延迟下实现上下文访问且零检索失败。实现这一目标需要解决编译差距：LLM编译将原始文档提炼为维基，而不致于灾难性地丢弃关键事实。我们在17个RepLiQA领域（6,800个问题）中对这一差距进行了表征：我们观察到，完整上下文的KV缓存推理在策划知识上优于RAG（4.38对4.08，TTFT速度快7.3倍），但由于注意力稀释，在规模上低于RAG，而盲编译则完全失败（2.14到2.32对3.46，灾难性失败率为53%到60%）。为了解决编译差距，我们提出了WiCER（Wiki-记忆编译、评估、精炼），这是一种受反例引导抽象精炼（CEGAR）启发的迭代算法，旨在缩小这一差距。WiCER通过诊断探针评估编译的维基，识别丢失的事实，并在后续编译中强制保留这些事实。一到两次迭代恢复了80%的丢失质量（在15个主题上，平均3.24对3.47），相对减少了55%的灾难性失败。对所有17个主题的消融实验确认，针对性诊断（+0.95）而非通用固定（+0.16）推动了质量提升。所有代码和基准数据均已发布，以支持可重复的研究。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2605.07076

Self-Consolidating Language Models: Continual Knowledge Incorporation from Context

自我巩固语言模型：从上下文中持续融入知识

Wang, Zekun, Gupta, Anant, Dong, Zihan, MacLellan, Christopher J.

Abstract

Large language models (LLMs) increasingly receive information as streams of passages, conversations, and long-context workflows. While longer context windows expose more evidence, they do not ensure that useful information is preserved and reused. We study continual context consolidation: writing current context into model weights while limiting interference with previously consolidated information. We propose \textbf{S}elf-\textbf{Co}nsolidating \textbf{L}anguage Models (SCoL), a post-training framework in which, given current context, an LLM learns to generate textual update instructions specifying which of its own Transformer layers should be updated. Because committed updates change the model that later generates future selections, we train SCoL with meta-reinforcement learning over an evolving model state. We instantiate SCoL with supervised QA rewards on SQuAD knowledge incorporation and intrinsic likelihood-based rewards for LongBench v2 long-context consolidation. Across both settings, SCoL improves acquisition and retention over prompting, summarization, batch test-time training, and sequential finetuning baselines. Analysis of learned selection patterns shows that SCoL encourages the LLM to generate sparse update locations that align with layers of high Fisher information, suggesting that the model learns to route plasticity toward loss-sensitive regions while limiting interference. Moreover, SCoL transfers from shorter meta-training streams to longer LongBench v2 streams at evaluation, suggesting that our framework supports scalable streaming consolidation.

Chinese Translation

大型语言模型（LLMs）越来越多地接收来自段落、对话和长上下文工作流的信息。尽管更长的上下文窗口提供了更多证据，但并不能确保有用的信息得以保存和重用。我们研究了持续上下文巩固：在限制对先前巩固信息干扰的情况下，将当前上下文写入模型权重。我们提出了自我巩固语言模型（Self-Consolidating Language Models, SCoL），这是一个后训练框架，在该框架中，给定当前上下文，LLM学习生成文本更新指令，指定其自身的Transformer层应当更新哪些。由于已提交的更新会改变后续生成未来选择的模型，我们通过对不断演变的模型状态进行元强化学习来训练SCoL。我们在SQuAD知识融入的监督问答奖励和LongBench v2长上下文巩固的内在似然性基础奖励上实例化SCoL。在这两种设置中，SCoL在获取和保留方面优于提示、摘要、批量测试时训练和顺序微调基线。对学习到的选择模式的分析表明，SCoL鼓励LLM生成与高Fisher信息层对齐的稀疏更新位置，暗示模型学习将可塑性引导到对损失敏感的区域，同时限制干扰。此外，SCoL在评估时从较短的元训练流转移到较长的LongBench v2流，表明我们的框架支持可扩展的流式巩固。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2605.07084

Beyond Single Ground Truth: Reference Monism as Epistemic Injustice in ASR Evaluation

超越单一真实：作为认知不公的参考一元论在自动语音识别评估中的应用

Choi, Anna Seo Gyeong, Teleki, Maria, Caverlee, James, del Rio, Miguel, Miller, Corey, Choi, Hoon

Abstract

Automatic speech recognition (ASR) evaluation compares system output to ground truth transcripts, with Word Error Rate (WER) quantifying the distance between them. But ground truth transcripts are not discovered - they are produced by human annotators following conventions that encode normative assumptions about which speech features matter. Different conventions (verbatim, non-verbatim, legal) produce different transcripts of identical speech and judge the same ASR output differently. This paper argues that reference monism - enforcing a single transcription convention as ground truth - commits epistemic injustice. Speakers with aphasia, whose speech includes clinically meaningful disfluencies, are systematically disadvantaged when evaluated against "clean" references that treat those disfluencies as errors. The harm is not merely differential performance, but that evaluative infrastructure lacks interpretive resources to recognize their contributions as legitimate. We develop a philosophical framework introducing the hermeneutical gap, formalize Epistemic Injustice Distance (EID) to measure reference monism's cost, and demonstrate empirically using AphasiaBank that WER varies depending on which convention defines ground truth. We propose WER-Range: reporting performance across legitimate conventions rather than assuming a single correct answer.

Chinese Translation

自动语音识别（ASR）评估将系统输出与真实转录文本进行比较，使用字词错误率（WER）量化二者之间的差距。然而，真实转录文本并不是被发现的，而是由人类注释者根据编码了关于哪些语音特征重要的规范假设的惯例生成的。不同的惯例（逐字、非逐字、法律）会对相同的语音生成不同的转录文本，并对相同的ASR输出做出不同的评判。本文论证，参考一元论——将单一转录惯例强加为真实——构成了认知不公。失语症患者的言语包含临床上有意义的流畅性障碍，在与将这些流畅性障碍视为错误的“干净”参考进行评估时，系统性地处于不利地位。造成的伤害不仅仅是表现差异，而在于评估基础设施缺乏解释资源来承认他们的贡献是合法的。我们发展了一个哲学框架，引入了解释学差距，形式化了认知不公距离（EID）以衡量参考一元论的成本，并通过使用AphasiaBank进行实证研究，展示了WER的变化取决于定义真实的惯例。我们提出了WER范围：报告在合法惯例下的表现，而不是假设单一正确答案。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2605.07093

The Translation Tax Is Not a Scalar: A Counterfactual Audit of English-Source Cue Inheritance in Chinese Multilingual Benchmarks

翻译税不是一个标量：对中文多语种基准中英语源提示继承的反事实审计

Lin, Zezheng, Liu, Fengming, Li, Handi

Abstract

The Translation Tax is often treated as a scalar: translated benchmarks are assumed to inflate scores by preserving English-source cues. We audit this claim in an English-to-Chinese setting. Three proxy estimators disagree: back-translation gaps are small and parser-fragile; cue-score calibration does not predict item-level gains; and a six-model native-control comparison shows model-family rather than uniform benchmark effects. We add a same-item LLM-naturalization stress test that holds answer, options, and content fixed while rewriting Chinese surface form. After correcting a prompt-construction bug, this contrast no longer supports a model-family interaction, but it preserves a residue dose-response: high-residue items benefit while low-residue items do not. The result is not a single Translation Tax, but a set of estimator- and item-dependent validity risks. We release per-cell evidence, the naturalization protocol, human QC, and a reporting checklist for translated multilingual benchmark papers.

Chinese Translation

翻译税通常被视为一个标量：假设翻译后的基准通过保留英语源提示来抬高分数。我们在英语到中文的环境中审计这一说法。三种代理估计器的结果不一致：回译差距较小且对解析器脆弱；提示分数校准无法预测项目级的增益；而六模型的本地控制比较显示出模型家族效应而非统一的基准效应。我们增加了一个相同项目的LLM自然化压力测试，该测试在重写中文表面形式时保持答案、选项和内容不变。在纠正了一个提示构建错误后，这种对比不再支持模型家族交互，但保留了残留剂量反应：高残留项目受益，而低残留项目则没有。结果不是单一的翻译税，而是一组依赖于估计器和项目的有效性风险。我们发布了每个单元的证据、自然化协议、人为质量控制以及翻译多语种基准论文的报告清单。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2605.07102

SAGE: Hierarchical LLM-Based Literary Evaluation through Ontology-Grounded Interpretive Dimensions

SAGE：基于层次化大型语言模型的文学评估通过本体驱动的解释维度

Wang, Tianyu, Zhou, Nianjun

Abstract

Evaluating literary quality requires assessing interpretive dimensions such as cultural representation, emotional depth, and philosophical sophistication that resist straightforward computational measurement. We introduce SAGE, a hierarchical evaluation framework that decomposes literary quality into ontology-grounded interpretive dimensions assessed through structured large language model evaluation with multi-round iterative reflection and independent validation. We validate the framework on 100 short stories (50 canonical works, 30 pulp fiction, 20 LLM-generated narratives) across three analytical layers (cultural, emotional-psychological, existential-philosophical) using dual-mode assessment. Across 600 evaluations, the framework achieves 98.8% score convergence and greater than 94% inter-rater agreement, with near-perfect mode invariance between content-based and metadata-based evaluation. Statistical analysis reveals a consistent genre hierarchy (Canonical > Pulp > LLM, all p<0.001) with layer-specific discrimination: cultural critique and philosophical depth exhibit very large effect sizes (Cohen's d>2.4), while emotional representation shows smaller gaps (d=1.68), suggesting that affective patterns are more learnable from training data than critical stance or philosophical depth. Cross-layer correlations (r=0.649-0.683) confirm the three dimensions capture empirically distinguishable quality facets. These findings demonstrate that theory-driven LLM evaluation can achieve measurement-grade reliability and support systematic identification of where current generative models fall short of human literary production, with direct implications for scalable automated evaluation of open-ended text generation.

Chinese Translation

评估文学质量需要考量诸如文化表现、情感深度和哲学复杂性等解释维度，这些维度难以通过简单的计算测量进行评估。我们提出了SAGE，一个层次化评估框架，将文学质量分解为基于本体的解释维度，通过结构化的大型语言模型评估进行多轮迭代反思和独立验证。我们在100个短篇故事（50部经典作品、30部通俗小说、20个大型语言模型生成的叙述）上验证了该框架，分析层次包括文化、情感-心理和存在-哲学，采用双模式评估。在600次评估中，该框架实现了98.8%的评分收敛和超过94%的评审者间一致性，内容基础评估与元数据基础评估之间几乎完美的模式不变性。统计分析显示出一致的体裁层次（经典 > 通俗 > LLM，均p<0.001），并具有层特异性区分：文化批评和哲学深度表现出非常大的效应量（Cohen's d>2.4），而情感表现则显示出较小的差距（d=1.68），这表明情感模式比批判立场或哲学深度更容易从训练数据中学习。跨层相关性（r=0.649-0.683）确认这三个维度捕捉到可实证区分的质量方面。这些发现表明，理论驱动的LLM评估可以实现测量级的可靠性，并支持系统识别当前生成模型在文学创作方面的不足，对可扩展的开放式文本生成自动评估具有直接的影响。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2605.07106

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

检索、整合与综合：空间-语义基础的潜在视觉推理

Cui, Jin, Long, Xinyue, Zhang, Xunyong, Zhang, Yadong, Su, Chuanchang, Gan, Jingye, Zhao, Boran, Ren, Pengju

Abstract

Multimodal Large Language Models (MLLMs) have made remarkable progress on vision-language reasoning, yet most methods still compress visual evidence into discrete textual thoughts, creating an information bottleneck for fine-grained perception. Recent latent visual reasoning methods attempt to reason in continuous hidden states, but we find that they suffer from insufficient manifold compatibility: latent trajectories drift away from pretrained reasoning circuits, collapse into instance-agnostic patterns, and are often bypassed during answer generation. To address these issues, we propose RIS (Retrieve, Integrate, and Synthesize), a spatial-semantic grounded framework that develops latent reasoning as a compatible extension of pretrained MLLM computation. We first construct a step-wise grounded reasoning dataset with bounding boxes and region-specific semantic descriptions. Built on this supervision, RIS anchors latent tokens to both spatial and semantic evidence, enforces their causal role through a progressive attention bottleneck, and introduces short language transition tokens to bridge synthesized latent states back to vocabulary-aligned decoding. Experiments on V*, HRBench4K, HRBench8K, MMVP, and BLINK show consistent improvements over closed/open-source and latent reasoning baselines. Further analyses demonstrate that RIS learns diverse, interpretable, and progressively integrated latent trajectories, offering a practical path toward faithful internal visual reasoning in MLLMs.

Chinese Translation

多模态大型语言模型（MLLMs）在视觉-语言推理方面取得了显著进展，但大多数方法仍将视觉证据压缩为离散的文本思维，导致细粒度感知的信息瓶颈。近期的潜在视觉推理方法试图在连续的隐藏状态中进行推理，但我们发现它们存在不足的流形兼容性：潜在轨迹偏离了预训练的推理电路，崩溃为与实例无关的模式，并且在答案生成过程中常常被绕过。为了解决这些问题，我们提出了RIS（检索、整合与综合），这是一个空间-语义基础的框架，将潜在推理发展为预训练MLLM计算的兼容扩展。我们首先构建了一个逐步基础的推理数据集，其中包含边界框和特定区域的语义描述。在此监督的基础上，RIS将潜在标记锚定到空间和语义证据，利用渐进式注意力瓶颈强化其因果角色，并引入短语言过渡标记，将综合的潜在状态桥接回与词汇对齐的解码。对V*、HRBench4K、HRBench8K、MMVP和BLINK的实验显示，相较于封闭/开源和潜在推理基线，RIS consistently 提供了一致的改进。进一步的分析表明，RIS学习了多样化、可解释且逐步整合的潜在轨迹，为实现MLLMs中忠实的内部视觉推理提供了一条切实可行的路径。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2605.07110

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

保护计算使用代理：一个统一的架构生命周期框架以实现基于部署的可靠性

Chen, Zejian, Liu, Zhanyuan, Li, Chaozhuo, Han, Mengxiang, Liu, Songyang, Zhang, Litian, Gao, Feng, Hei, Yiming, Zhang, Xi

Abstract

Computer-use agents(CUAs)are moving frombounded benchmarks toward real software environments, wherethey operate browsers, desktops, mobile applications, flesystems,terminals, and tool backends. In such settings, reliability isno longer captured by task success alone: perception errors,planning drift, memory use, tool mediation, permission scope,and runtime oversight jointly determine whether agent actionsremain aligned with user intent, Existing surveys organize theCUA landscape by methods, platforms, benchmarks, or securitythreats, but less explicitly connect capability formation, author-ity exposure, failure manifestation, and control placement. Toaddress this gap, the article develops an architecture-lifecycleframework for deployment-grounded reliability in CUAs. Thearchitectural view analyzes Perception, Decision, and Executionas coupled layers that transform software observations intoauthority-bearing actions, The lifecycle view examines Creation.Deployment, Operation, and Maintenance as stages in which priorsare learned, tools and permissions are bound, runtime trajecto.ries are stressed, and assurance must be preserved under drift.Using this lens, the analysis synthesizes representative systems,benchmarks, and security/privacy studies; distinguishes wherefailures become visible from where their enabling conditions areintroduced, and maps recurring intervention surfaces for controloversight, and assurance. OpenClaw is used only as a public moti.vating example of an open deployment pattern, not as a verifedinternal case study. The conclusion highlights open challengesin controllable grounding, long-horizon constraint preservation,safe authority binding, mixed-trust runtime defense, privacy-preserving memory,and continual assurance.

Chinese Translation

计算使用代理（CUAs）正从受限基准向真实软件环境转变，在这些环境中，它们操作浏览器、桌面应用程序、移动应用程序、文件系统、终端和工具后端。在这样的环境中，可靠性不再仅仅由任务成功来衡量：感知错误、规划漂移、内存使用、工具中介、权限范围和运行时监督共同决定代理的行为是否与用户意图保持一致。现有的调查通过方法、平台、基准或安全威胁来组织CUA领域，但较少明确地将能力形成、权限暴露、失败表现和控制位置联系起来。为了解决这一空白，本文开发了一个用于CUA中基于部署的可靠性的架构生命周期框架。架构视角分析感知、决策和执行作为相互关联的层，这些层将软件观察转化为具有权威性的行动。生命周期视角则考察创建、部署、操作和维护作为学习先验、绑定工具和权限、强调运行时轨迹以及在漂移下保持保证的阶段。通过这种视角，分析综合了代表性系统、基准和安全/隐私研究；区分了失败何时变得可见与其启用条件何时被引入，并绘制了控制监督和保证的重复干预表面。OpenClaw仅作为开放部署模式的公共激励示例，而非经过验证的内部案例研究。结论强调了在可控基础、长期约束保持、安全权限绑定、混合信任运行时防御、隐私保护内存和持续保证方面的开放挑战。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2605.07111

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

超越 LoRA 与全量微调：用于大语言模型适应的梯度引导优化器路由

Tang, Haozhan, Zhu, Xiuqi, Zhang, Xinyin, Li, Boxun, Smith, Virginia, Kuo, Kevin

Abstract

Recent literature on fine-tuning Large Language Models highlights a fundamental debate. While Full Fine-Tuning (FFT) provides the representational plasticity required for high-entropy knowledge injection, Low-Rank Adaptation (LoRA) can match or surpass FFT performance because many tasks only require updates in a low-rank space and benefit from LoRA's additional regularization. Through empirical evaluation across diverse tasks (SQL, Medical QA, and Counterfactual Knowledge) and varying language models (Gemma-3-1B, Qwen2.5-1.5B, and Qwen2.5-3B), we verify both trends and demonstrate that relying solely on either static architecture is structurally limited. To address this challenge, we propose a Mixture of LoRA and Full (MoLF) Fine-Tuning, a unified framework that enables continuous navigation between both training regimes. MoLF dynamically routes updates between FFT and LoRA at the optimizer level to ensure that exact gradient signals are available to both experts throughout training, yielding stable training dynamics. For memory-constrained environments, we also introduce MoLF-Efficient, which freezes base weights and only routes updates among a pair of LoRA experts of potentially varying rank. Our evaluations show that MoLF either improves on or stays within $1.5\%$ of the better of FFT and LoRA across all settings, while MoLF-Efficient outperforms prior adaptive LoRA approaches by up to $20\%$ on Fact and $9\%$ on Med and SQL.

Chinese Translation

近期关于大语言模型微调的文献强调了一个基本的争论。虽然全量微调（Full Fine-Tuning, FFT）提供了高熵知识注入所需的表征可塑性，但低秩适应（Low-Rank Adaptation, LoRA）能够匹配或超越 FFT 的性能，因为许多任务只需要在低秩空间中进行更新，并且受益于 LoRA 的额外正则化。通过对多种任务（SQL、医学问答和反事实知识）和不同语言模型（Gemma-3-1B、Qwen2.5-1.5B 和 Qwen2.5-3B）进行实证评估，我们验证了这两种趋势，并证明仅依赖于任一静态架构在结构上是有限的。为了解决这一挑战，我们提出了一种混合 LoRA 和全量微调（Mixture of LoRA and Full Fine-Tuning, MoLF）的统一框架，使得在两种训练模式之间能够持续导航。MoLF 在优化器级别动态路由更新于 FFT 和 LoRA 之间，以确保在整个训练过程中两个专家都能获得精确的梯度信号，从而实现稳定的训练动态。对于内存受限的环境，我们还引入了 MoLF-Efficient，该方法冻结基础权重，仅在一对可能具有不同秩的 LoRA 专家之间路由更新。我们的评估表明，MoLF 在所有设置中要么提升了性能，要么在 FFT 和 LoRA 中表现更好的基础上保持在 $1.5\%$ 之内，而 MoLF-Efficient 在 Fact 上的表现比之前的自适应 LoRA 方法提高了多达 $20\\%$，在 Med 和 SQL 上提高了 $9\\%$。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2605.07134

Region4Web: Rethinking Observation Space Granularity for Web Agents

Region4Web：重新思考网络代理的观察空间粒度

Kwon, Donguk, Lee, Dongha

Abstract

Web agents perceive web pages through an observation space, yet its granularity has remained an underexamined design choice. Existing work treats observation at the same element-level granularity as the action space, leaving the page's functional organization implicit and forcing the agent to infer it from element-level signals at every step. We argue observation should instead operate at the granularity of functional regions, parts of the page that each serve a distinct purpose. We propose Region4Web, a framework that reorganizes the AXTree into functional regions through hierarchical decomposition and semantic abstraction, exposing the page's functional organization as the basis for page state understanding. Moreover, we propose PageDigest, a web-specific inference pipeline that delivers this region-level observation to the actor agent as a compact per-page digest that persists across steps. On the WebArena benchmark, PageDigest substantially reduces observation length while improving overall task success rate across diverse backbone large language models (LLMs) and established agent methods, regardless of backbone capacity. These results show that operating at the granularity of functional regions delivers a more compact and informative basis for the actor agent than element-level processing alone.

Chinese Translation

网络代理通过观察空间感知网页，但其粒度一直是一个未被充分研究的设计选择。现有研究将观察与动作空间的元素级粒度视为相同，导致页面的功能组织隐含，并迫使代理在每一步都从元素级信号中推断。我们认为观察应在功能区域的粒度上进行，功能区域是页面中各自服务于不同目的的部分。我们提出了Region4Web，一个通过层次分解和语义抽象将AXTree重组为功能区域的框架，从而将页面的功能组织作为理解页面状态的基础。此外，我们提出了PageDigest，一个特定于网络的推理管道，将这种区域级观察以紧凑的每页摘要形式传递给行为代理，并在多个步骤中保持一致。在WebArena基准测试中，PageDigest显著减少了观察长度，同时提高了在不同主干大型语言模型（LLMs）和已建立代理方法下的整体任务成功率，无论主干容量如何。这些结果表明，在功能区域的粒度上操作为行为代理提供了比单纯的元素级处理更紧凑和更具信息量的基础。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2605.07139

Structural Rationale Distillation via Reasoning Space Compression

通过推理空间压缩进行结构化推理蒸馏

Yang, Jialin, Wang, Jiankun, Wu, Jiajun, Leung, Henry, Zhou, Jiayu, Drew, Steve

Abstract

When distilling reasoning from large language models (LLMs) into smaller ones, teacher rationales for similar problems often vary wildly in structure and strategy. Like a chef who makes the same dish differently each time, this inconsistency burdens the student with noisy supervision that is hard to internalize. We propose Distillation through Reasoning Path Compression (D-RPC), which constrains the teacher to follow a compact, dynamically maintained bank of reusable high-level reasoning paths. For each training question, D-RPC retrieves the most relevant path and conditions the teacher to follow it, producing rationales that are consistent across similar problems yet diverse enough to cover different problem types. A PAC-Bayes analysis formalizes the resulting trade-off between bank size and coverage: smaller banks reduce supervision entropy but risk coverage gaps, and the generalization bound identifies an optimal intermediate size confirmed by our ablations. Across five math and commonsense reasoning benchmarks with two student models, D-RPC consistently outperforms chain-of-thought distillation, freeform rationale generation, direct distillation, and structured-supervision baselines, while using fewer tokens than template-heavy alternatives.

Chinese Translation

在将大型语言模型（LLMs）的推理蒸馏到较小模型时，针对相似问题的教师推理往往在结构和策略上差异巨大。就像一位厨师每次制作同一道菜时都可能有所不同，这种不一致性给学生带来了难以内化的噪声监督。我们提出了通过推理路径压缩进行蒸馏（D-RPC），该方法限制教师遵循一个紧凑的、动态维护的可重用高层推理路径库。对于每个训练问题，D-RPC 检索最相关的路径，并指导教师遵循该路径，从而生成在相似问题之间一致但又足够多样化以涵盖不同问题类型的推理。PAC-Bayes 分析形式化了路径库大小与覆盖率之间的权衡：较小的路径库减少了监督熵，但可能存在覆盖空白，而泛化界限识别了一个经过我们消融实验确认的最佳中间大小。在五个数学和常识推理基准测试中，使用两个学生模型，D-RPC 始终优于链式推理蒸馏、自由形式推理生成、直接蒸馏和结构化监督基线，同时使用的标记数量少于模板重的替代方案。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2605.07153

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

超越推理：强化学习解锁大型语言模型中的参数知识

Yang, Wanli, Zang, Hongyu, Zhang, Junwei, Shi, Wenjie, Su, Du, Wang, Jingang, Cheng, Xueqi, Sun, Fei

Abstract

Reinforcement learning (RL) has achieved remarkable success in LLM reasoning, but whether it can also improve direct recall of parametric knowledge remains an open question. We study this question in a controlled zero-shot, one-hop, closed-book QA setting with no chain-of-thought, training only on binary correctness rewards and applying fact-level train-test deduplication to ensure gains reflect improved recall rather than reasoning or memorization. Across three model families and multiple factual QA benchmarks, RL yields ~27% average relative gains, surpassing both training- and inference-time baselines alike. Mechanistically, RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations. Our data-attribution study reveals that the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples (only ~18% of training data) drive ~83% of the gain, since rare correct rollouts still emerge during training and get reinforced. Together, these findings broaden the role of RL beyond reasoning, repositioning it as a tool for unlocking rather than acquiring latent parametric knowledge.

Chinese Translation

强化学习（RL）在大型语言模型（LLM）的推理中取得了显著成功，但它是否也能改善对参数知识的直接回忆仍然是一个未解的问题。我们在一个受控的零-shot、一跳、闭卷问答设置中研究了这个问题，该设置不涉及思维链，仅基于二元正确性奖励进行训练，并应用事实级的训练-测试去重，以确保收益反映的是改进的回忆，而非推理或记忆。在三个模型系列和多个事实问答基准测试中，强化学习平均带来了约27%的相对增益，超越了训练和推理时间的基线。机制上，强化学习主要是在现有知识上重新分配概率质量，而不是获取新事实，将正确答案从低概率尾部转移到可靠的贪婪生成中。我们的数据归因研究表明，最难的例子是信息量最大的：那些答案在128个预强化学习样本中从未出现过的例子（仅占训练数据的约18%）驱动了约83%的增益，因为在训练过程中仍然会出现稀有的正确生成并得到强化。总的来说，这些发现拓宽了强化学习的角色，超越了推理，将其重新定位为解锁潜在参数知识的工具，而非获取新知识的手段。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2605.07162

CLIPer: Tailoring Diverse User Preference via Classifier-Guided Inference-Time Personalization

CLIPer：通过分类器引导的推理时个性化定制多样化用户偏好

Su, Jinyan, Zhou, Jinpeng, Cardie, Claire, Sun, Wen

Abstract

Personalized LLMs can significantly enhance user experiences by tailoring responses to preferences such as helpfulness, conciseness, and humor. However, fine-tuning models to address all possible combinations of user preferences is computationally expensive and impractical. In this paper, we introduce \textbf{CLIPer}(\textbf{Cl}assifier-guided \textbf{I}nference-time \textbf{Per}sonalization), a lightweight personalization approach that leverages a classifier model to steer LLM generation dynamically to different user preferences at inference time. Our method eliminates the need for extensive fine-tuning, inducing negligible additional computational overhead while enabling more controllable and nuanced personalization across single and multi-dimensional preferences. Comprehensive empirical analyses demonstrate the scalability and effectiveness of our approach in delivering personalized language generation.

Chinese Translation

个性化的大型语言模型（LLMs）可以通过根据用户的偏好（如有用性、简洁性和幽默感）定制响应，从而显著提升用户体验。然而，微调模型以应对所有可能的用户偏好组合在计算上是昂贵且不切实际的。在本文中，我们介绍了 extbf{CLIPer}（ extbf{Cl}assifier-guided extbf{I}nference-time extbf{Per}sonalization），这是一种轻量级的个性化方法，利用分类器模型在推理时动态引导LLM生成以适应不同的用户偏好。我们的方法消除了广泛微调的需求，带来了微不足道的额外计算开销，同时在单一和多维偏好上实现了更可控和细致的个性化。全面的实证分析证明了我们的方法在提供个性化语言生成方面的可扩展性和有效性。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2605.07164

Rethinking Experience Utilization in Self-Evolving Language Model Agents

重新思考自我演化语言模型代理中的经验利用

Zhao, Weixiang, Wang, Yingshuo, Zhang, Yichen, Zhao, Yanyan, Zhang, Yu, Wu, Yang, Tu, Dandan, Qin, Bing, Liu, Ting

Abstract

Self-evolving agents improve by accumulating and reusing experience from past interactions. Existing work has largely focused on how experience is constructed, represented, and updated, while paying less attention to how experience should be used during runtime decision-making. As a result, most agents rely on rigid usage strategies, either injecting experience once at initialization or at every step, without considering whether it is needed for the current decision. This paper studies experience utilization as a critical design dimension of self-evolving agents. We ask whether agents benefit from interweaving experience use with decision-making, so that experience is invoked only when additional guidance is needed. To examine this question, we introduce {ExpWeaver}, a lightweight instantiation that leaves experience construction unchanged and modifies only runtime utilization by exposing experience as an optional resource during reasoning. Across four representative frameworks, seven LLM backbones, and three types of environments, ExpWeaver consistently achieves the best performance among different utilization strategies. Reinforcement learning experiments further show that this behavior can be amplified through training. Usage-pattern, causal ablation, and entropy-based analyses reveal that ExpWeaver enables agents to invoke experience selectively, at beneficial decision points, and under higher reasoning uncertainty. Overall, our findings call for a shift from merely studying \emph{what} experience to store toward understanding \emph{how} and \emph{when} experience should enter decision-making.

Chinese Translation

自我演化代理通过积累和重用来自过去交互的经验来提升性能。现有研究主要集中在经验的构建、表示和更新上，而对经验在运行时决策中的使用关注较少。因此，大多数代理依赖于僵化的使用策略，要么在初始化时一次性注入经验，要么在每一步都注入，而不考虑当前决策是否需要这些经验。本文将经验利用作为自我演化代理的一个关键设计维度进行研究。我们探讨代理是否能通过将经验使用与决策过程交织在一起，从而在需要额外指导时才调用经验。为了解决这个问题，我们引入了 {ExpWeaver}，一种轻量级的实现，它保持经验构建不变，仅通过在推理过程中将经验作为可选资源暴露来修改运行时利用。在四个代表性框架、七个 LLM 主干和三种类型的环境中，ExpWeaver 在不同的利用策略中始终实现最佳性能。强化学习实验进一步表明，这种行为可以通过训练得到增强。使用模式、因果消融和基于熵的分析表明，ExpWeaver 使代理能够在有利的决策点和更高的推理不确定性下选择性地调用经验。总体而言，我们的研究结果呼吁从单纯研究经验存储的 extit{what} 转向理解经验在决策中应如何以及何时介入的 extit{how} 和 extit{when}。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2605.07170

A Reproducible Multi-Architecture Baseline for Token-Level Chinese Metaphor Identification under the MIPVU Framework

基于MIPVU框架的可复现多架构基线：中文隐喻识别的词元级研究

Wu, Yufeng

Abstract

Metaphor is pervasive in everyday language, yet token-level computational identification of metaphor-related words in Chinese under the MIPVU framework remains under-explored relative to English. This paper presents a reproducible multi-architecture baseline for token-level metaphor identification on the PSU Chinese Metaphor Corpus (PSU CMC), the only widely available MIPVU-annotated Chinese corpus. We systematically compare three model families: (i) encoder fine-tuning with Chinese RoBERTa-wwm-ext-large; (ii) MelBERT adapted to Chinese using a newly constructed basic-meaning resource derived from the Modern Chinese Dictionary, 7th edition (MCD7), comprising 74,823 entries with 71.51% PSU CMC vocabulary coverage; and (iii) Qwen3.5-9B fine-tuned with QLoRA as an instruction-tuned generative baseline. Across five fixed seeds, MelBERT MIP-only achieves the strongest performance at 0.7281 +/- 0.0050 test positive F1, marginally above MelBERT Full (0.7270 +/- 0.0069) and clearly above plain RoBERTa (0.7142 +/- 0.0121). The Qwen QLoRA generative configuration trails encoder baselines by approximately 11 F1 points (0.6157 +/- 0.0113). Three findings merit attention: (1) the SPV channel of MelBERT does not contribute reliable positive signal in Chinese, consistent with the dominance of conventional metaphor; (2) the Qwen-encoder gap is concentrated in recall, reflecting the discrete-commitment limitation of generative output; (3) several Qwen task formulations fail due to format design rather than model capacity. We release all split manifests, per-seed outputs, the MCD7 basic-meaning embedding pipeline, and training scripts to serve as a common reference for future Chinese metaphor identification research.

Chinese Translation

隐喻在日常语言中无处不在，但在MIPVU框架下对中文隐喻相关词汇的词元级计算识别相较于英语仍然未得到充分探索。本文提出了一种可复现的多架构基线，用于在PSU中文隐喻语料库（PSU CMC）上进行词元级隐喻识别，该语料库是唯一广泛可用的MIPVU标注中文语料库。我们系统地比较了三种模型系列：（i）使用中文RoBERTa-wwm-ext-large进行编码器微调；（ii）MelBERT通过新构建的基本意义资源适应中文，该资源来源于《现代汉语词典》第七版（MCD7），包含74,823个条目，覆盖71.51%的PSU CMC词汇；（iii）使用QLoRA微调的Qwen3.5-9B作为指令调优的生成基线。在五个固定种子下，MelBERT MIP-only在测试正F1上达到了0.7281 +/- 0.0050的最佳表现，略高于MelBERT Full（0.7270 +/- 0.0069），明显高于普通RoBERTa（0.7142 +/- 0.0121）。Qwen QLoRA生成配置的表现比编码器基线低约11个F1点（0.6157 +/- 0.0113）。有三个发现值得关注：（1）MelBERT的SPV通道在中文中未能提供可靠的正信号，这与传统隐喻的主导地位一致；（2）Qwen与编码器之间的差距主要集中在召回率上，反映了生成输出的离散承诺限制；（3）一些Qwen任务的设计因格式设计而非模型能力而失败。我们发布了所有分割清单、每个种子的输出、MCD7基本意义嵌入管道和训练脚本，以作为未来中文隐喻识别研究的共同参考。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2605.07172

Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

基于拓扑增强的大型语言模型对齐：轨迹拓扑损失与拓扑偏好优化

Pan, Yurui, Xu, Ke, Peng, Bo

Abstract

Alignment of large language models (LLMs) via SFT and RLHF/DPO typically ignores the global geometry of the representation space, relying instead on local token likelihoods or scalar scores. We view generation as tracing a semantic trajectory in hidden space and propose a topology-enhanced alignment framework that regularizes these trajectories using 0-dimensional persistent homology. First, for SFT, we introduce Trajectory Topology Loss (TTL). Treating prompt and gold-answer embeddings as a mixed point cloud, we use a 0D persistent homology algorithm to extract "prompt-answer bridges." TTL aligns the model's actual update direction with these topological bridges rather than arbitrary directions. Second, for DPO, we propose Topological Preference Optimization (TPO). TPO constructs topic-specific semantic preference vectors and aligns the improvement direction between rejected and chosen responses with these vectors in an intermediate hidden layer. We also introduce a dynamic weighting scheme to balance DPO and TPO losses. Evaluating on Qwen2.5-7B-Instruct using UltraChat and Anthropic HH-RLHF, our topology-enhanced objectives consistently outperform strong non-topological baselines (e.g., per-example, nearest-neighbor, random regularizers) on automatic preference metrics and LLM-judge evaluations, while maintaining or improving toxicity. Results show persistent homology and trajectory geometry offer a promising direction for controllable alignment.

Chinese Translation

通过SFT和RLHF/DPO对大型语言模型（LLMs）进行对齐通常忽略了表示空间的全局几何，反而依赖于局部的标记似然性或标量分数。我们将生成视为在隐空间中追踪语义轨迹，并提出一种拓扑增强的对齐框架，通过0维持久同调来规范这些轨迹。首先，对于SFT，我们引入了轨迹拓扑损失（Trajectory Topology Loss, TTL）。将提示和黄金答案嵌入视为混合点云，我们使用0D持久同调算法提取“提示-答案桥”。TTL将模型的实际更新方向与这些拓扑桥对齐，而不是任意方向。其次，对于DPO，我们提出拓扑偏好优化（Topological Preference Optimization, TPO）。TPO构建主题特定的语义偏好向量，并在中间隐层中将拒绝和选择响应之间的改进方向与这些向量对齐。我们还引入了一种动态加权方案，以平衡DPO和TPO损失。在使用UltraChat和Anthropic HH-RLHF对Qwen2.5-7B-Instruct进行评估时，我们的拓扑增强目标在自动偏好指标和LLM评估中始终优于强大的非拓扑基线（例如，逐例、最近邻、随机正则化器），同时保持或改善毒性。结果表明，持久同调和轨迹几何为可控对齐提供了一个有前景的方向。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2605.07180

Learning Agent Routing From Early Experience

从早期经验中学习代理路由

Wang, Yimin, Qiu, Jiahao, Qi, Xuan, Juan, Xinzhe, Shi, Jingzhe, Zhao, Zelin, Wang, Hongru, Liu, Shilong, Wang, Mengdi

Abstract

LLM agents achieve strong performance on complex reasoning tasks but incur high latency and compute cost. In practice, many queries fall within the capability boundary of cutting-edge LLMs and do not require full agent execution, making effective routing between LLMs and agents a key challenge. We study the problem of routing queries between lightweight LLM inference and full agent execution under realistic cold-start settings. To address this, we propose BoundaryRouter, a training-free routing framework that uses early behavioral experience and rubric-guided reasoning to decide whether to answer a query with direct LLM inference or escalate to an agent. BoundaryRouter builds a compact experience memory by executing both systems on a shared seed set and retrieves similar cases at inference time to guide routing decisions. To evaluate this method, we introduce RouteBench, a benchmark covering in-domain, paraphrased, and out-of-domain route settings. Experiments show that BoundaryRouter reduces inference time by 60.6% compared to the agent while improving performance by 28.6% over direct LLM inference, outperforming prompt-based and retrieval-only routing by an average of 37.9% and 8.2%, respectively.

Chinese Translation

大型语言模型（LLM）代理在复杂推理任务中表现出色，但会产生高延迟和计算成本。在实际应用中，许多查询在前沿LLM的能力范围内，不需要完全执行代理，这使得在LLM和代理之间进行有效路由成为一个关键挑战。我们研究了在现实冷启动设置下，轻量级LLM推理与完全代理执行之间路由查询的问题。为了解决这个问题，我们提出了BoundaryRouter，一个无训练的路由框架，利用早期行为经验和标准指导推理来决定是通过直接LLM推理回答查询，还是升级到代理。BoundaryRouter通过在共享种子集上执行这两个系统来构建紧凑的经验记忆，并在推理时检索相似案例以指导路由决策。为了评估该方法，我们引入了RouteBench，一个涵盖领域内、改写和领域外路由设置的基准测试。实验表明，与代理相比，BoundaryRouter将推理时间减少了60.6%，同时在直接LLM推理的基础上提高了28.6%的性能，分别比基于提示和仅检索的路由平均超出37.9%和8.2%。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2605.07186

The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval

文本的恐怖谷：大型语言模型信息检索中的非单调性能退化

Tong, Zekai, Xu, Ruiyao, Shrivastava, Aryan, Tan, Chenhao, Holtzman, Ari

Abstract

Existing Large Language Model (LLM) benchmarks primarily focus on syntactically correct inputs, leaving a significant gap in evaluation on imperfect text. In this work, we study how word-boundary corruption affects how LLMs detect targeted information. By inserting whitespace characters within words to break them into fragments, LLMs' detection accuracy follows a U-shaped curve with the increase in insertion rate. We refer to this curve as the Text Uncanny Valley. To explain such observation, we propose a mode transition hypothesis: LLMs operate in a word-level mode for near-normal text and a character-level mode for heavily fragmented text, with the valley marking the disordered transition where neither mode is effective. Four experiments and one analysis are consistent with this account: in-context learning fails to rescue valley-bottom performance; regularizing the perturbation substantially reduces the U-shape; a math reasoning task replicates the U-shape for Gemini 3.0 Flash but not for stronger models, suggesting the effect is attenuated when tasks rely less on exact lexical alignment; and tokenization entropy peaks before the F1 minimum, consistent with a regime-conflict interpretation. These findings reveal a failure mode invisible to clean-text benchmarks yet directly relevant to any deployment scenario involving noisy or uncurated text inputs.

Chinese Translation

现有的大型语言模型（LLM）基准主要关注语法正确的输入，导致对不完美文本的评估存在显著空白。在本研究中，我们探讨了词边界损坏如何影响LLM检测目标信息的能力。通过在单词内部插入空格字符将其分割成碎片，LLM的检测准确率随着插入率的增加呈现U型曲线。我们将这一曲线称为文本的恐怖谷。为了解释这一观察结果，我们提出了一种模式转变假设：LLM在接近正常文本时以词级模式运行，而在高度碎片化文本时以字符级模式运行，谷底标志着两种模式都无效的无序过渡。四个实验和一个分析与这一解释一致：上下文学习未能挽救谷底性能；对扰动进行正则化显著减少了U型曲线；数学推理任务在Gemini 3.0 Flash中复制了U型曲线，但在更强模型中未能复制，表明当任务对精确词汇对齐的依赖减少时，效果减弱；而标记化熵在F1最小值之前达到峰值，这与状态冲突解释一致。这些发现揭示了一种在干净文本基准下不可见的失败模式，但与任何涉及嘈杂或未经整理文本输入的部署场景直接相关。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2605.07201

PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat

PSK@EEUCA 2026：利用合成数据增强微调大型语言模型以进行游戏聊天中的多类别毒性检测

Pulipaka, Srikar Kashyap

Abstract

This paper describes our system for the EEUCA 2026 Shared Task on Understanding Toxic Behavior in Gaming Communities. The task involves classifying World of Tanks chat messages into six toxicity categories: Non-toxic, Insults/Flaming, Other Offensive, Hate/Harassment, Threats, and Extremism. We explore multiple approaches including encoder-based models, instruction-tuned LLMs with LoRA fine-tuning, hierarchical classification, one-vs-rest strategies, and various ensemble methods. Our best system combines Llama 3.1 8B with carefully calibrated 5\% synthetic data augmentation, achieving an F1-macro score of 0.6234 on the test set, placing 4th out of 35 participating teams. We provide extensive analysis of the dataset's annotation patterns and their impact on model generalization, revealing a critical ''validation trap'' phenomenon where high validation performance correlates with poor test transfer.

Chinese Translation

本文描述了我们在 EEUCA 2026 共享任务中针对游戏社区毒性行为理解的系统。该任务涉及将《坦克世界》的聊天消息分类为六个毒性类别：非毒性、侮辱/挑衅、其他冒犯、仇恨/骚扰、威胁和极端主义。我们探索了多种方法，包括基于编码器的模型、经过 LoRA 微调的指令调优大型语言模型（LLMs）、层次分类、一对多策略以及各种集成方法。我们的最佳系统结合了 Llama 3.1 8B 和经过精心校准的 5\% 合成数据增强，在测试集上达到了 0.6234 的 F1-macro 分数，排名 35 支参与团队中的第 4 位。我们对数据集的标注模式及其对模型泛化的影响进行了广泛分析，揭示了一个关键的“验证陷阱”现象，即高验证性能与较差的测试迁移相关。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2605.07209

Hallucination Detection via Activations of Open-Weight Proxy Analyzers

通过开放权重代理分析器的激活进行幻觉检测

Singh, Akshita, Paudel, Prabesh, Roy, Siddhartha

Abstract

We introduce a proxy-analyzer framework for detecting hallucinations in large language models. Instead of looking inside the generating model, our system reads already-generated text through a small locally hosted open-weight model and spots hallucinations using the reader's own internal activations. This works just as well when the generator is a closed API like GPT-4 as when it is any open-weight model. We built eighteen features grounded in how transformers process text, covering residual stream norms, per-head source-document attention, entropy, MLP activations, logit-lens trajectories, and three new token-level grounding statistics. We trained a stacking ensemble on 72,135 samples from five hallucination datasets. We tested across seven analyzer architectures from 0.5 billion to 9 billion parameters: Qwen2.5 at 0.5B and 7B, Gemma-2 at 2B and 9B, Pythia at 1.4B, and LLaMA-3 at both 3B and 8B. Across all seven, we consistently beat ReDeEP's token-level AUC of 0.73 on RAGTruth by 7.4 to 10.3 percentage points. Qwen2.5-7B reached an F1 of 0.717, just above ReDeEP's 0.713, while Qwen2.5-0.5B hit 0.706. The most striking finding is how tightly all seven models cluster: AUC spans only 2.3 percentage points across an eighteen-fold difference in model size. Even more surprising, our 3B LLaMA outperforms our 8B LLaMA on RAGTruth, showing that bigger is not always better even within the same model family. Both RAGTruth and LLM-AggreFact include outputs from multiple LLM families, so our results are not skewed toward any particular generator.

Chinese Translation

我们提出了一种代理分析器框架，用于检测大型语言模型中的幻觉。我们的系统并不直接查看生成模型内部，而是通过一个小型本地托管的开放权重模型读取已生成的文本，并利用阅读者自身的内部激活来识别幻觉。当生成器是像GPT-4这样的封闭API时，这种方法同样有效，也适用于任何开放权重模型。我们构建了十八个特征，基于变换器处理文本的方式，涵盖了残差流规范、每头源文档注意力、熵、MLP激活、logit-lens轨迹以及三个新的令牌级基础统计。我们在来自五个幻觉数据集的72,135个样本上训练了一个堆叠集成模型。我们在七种分析器架构上进行了测试，参数数量从5亿到90亿不等：Qwen2.5的0.5B和7B，Gemma-2的2B和9B，Pythia的1.4B，以及LLaMA-3的3B和8B。在所有七种模型中，我们在RAGTruth上始终超越ReDeEP的令牌级AUC 0.73，提升幅度为7.4到10.3个百分点。Qwen2.5-7B达到了0.717的F1分数，略高于ReDeEP的0.713，而Qwen2.5-0.5B则达到了0.706。最显著的发现是所有七个模型的聚类非常紧密：在模型大小相差十八倍的情况下，AUC仅相差2.3个百分点。更令人惊讶的是，我们的3B LLaMA在RAGTruth上的表现优于我们的8B LLaMA，这表明即使在同一模型系列中，规模并不总是越大越好。RAGTruth和LLM-AggreFact都包含来自多个LLM系列的输出，因此我们的结果并未偏向任何特定的生成器。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2605.07234

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

重新表述长上下文LLM推理中的KV缓存驱逐问题

Mai, Tho, Kim, Joo-Young

Abstract

Large language models (LLMs) support long-context inference but suffer from substantial memory and runtime overhead due to Key-Value (KV) Cache growth. Existing KV Cache eviction methods primarily rely on local attention weights, neglecting the influence of value representations, output projection, and inter-head interactions. In this work, we reformulate KV Cache eviction from a conventional head-wise, weight-averaging approach into an output-aware, layer-wise matrix multiplication approximation problem. We introduce LaProx, a novel eviction strategy that explicitly models the multiplicative interaction between attention maps and projected value states to accurately quantify token contributions while accounting for inter-head dependencies. Building on this metric, we propose the first unified eviction strategy that assigns globally comparable importance scores to tokens, enabling model-wide selection instead of local, head-wise decisions. Experimental results across 19 datasets on long-context benchmarks LongBench and Needle-In-A-Haystack demonstrate that our approach maintains model performance with only 5\% of the KV cache and consistently outperforms prior works across all configurations. Notably, our method achieves up to 2$\times$ accuracy loss reduction under extreme compression scenarios compared to existing state-of-the-art baselines with minimal overhead.

Chinese Translation

大型语言模型（LLMs）支持长上下文推理，但由于键值（KV）缓存的增长，面临着显著的内存和运行时开销。现有的KV缓存驱逐方法主要依赖于局部注意力权重，忽视了值表示、输出投影和头部间交互的影响。在本研究中，我们将KV缓存驱逐从传统的头部权重平均方法重新表述为一个输出感知的层级矩阵乘法近似问题。我们引入了LaProx，一种新颖的驱逐策略，明确建模注意力图与投影值状态之间的乘法交互，以准确量化令牌贡献，同时考虑头部间的依赖关系。在此度量的基础上，我们提出了首个统一的驱逐策略，为令牌分配全球可比的重要性分数，使得模型能够进行全局选择，而不是局部的头部决策。在长上下文基准LongBench和Needle-In-A-Haystack上对19个数据集的实验结果表明，我们的方法在仅使用5%的KV缓存的情况下保持了模型性能，并在所有配置中始终优于先前的工作。值得注意的是，与现有的最先进基线相比，我们的方法在极端压缩场景下实现了高达2倍的准确性损失减少，并且开销最小。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2605.07237

Teaching Language Models to Think in Code

教语言模型用代码思考

Hwang, Hyeon, Lee, Jiwoo, Kang, Jaewoo

Abstract

Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.

Chinese Translation

工具集成推理（Tool-integrated reasoning, TIR）已成为语言模型中数学问题解决的主导范式，它将自然语言（Natural Language, NL）推理与代码执行相结合。然而，这种交错的设置存在三个主要限制：代码通常充当事后验证者，中间的 NL 计算容易出错，NL 和代码的角色重叠而不是明确区分。我们提出了 ThinC（Thinking in Code），一个框架，其中代码本身作为推理者，而不是被 NL 调用的工具。ThinC 的轨迹始于简短的 NL 规划步骤，之后所有推理通过仅由其执行输出连接的代码块展开。我们从教师模型中提取了 12.2k 个以代码为中心的轨迹，并通过监督微调和后续的强化学习训练了 ThinC-1.7B 和 ThinC-4B。ThinC-4B 在五个竞争级数学基准测试中始终优于所有 TIR 基线，甚至超越了更大的 Qwen3-235B-A22B-Thinking。进一步分析表明，ThinC 通过代码进行推理：99.2% 的最终答案基于解释器输出，并且模型在代码执行失败时能够可靠地恢复，而无需中间 NL 推理。我们的代码和模型将很快发布。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2605.07243

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

SpecBlock：具有动态树草拟的块迭代推测解码

Shi, Weijie, Xu, Qiang, Deng, Fan, Wu, Yaguang, Liu, Jiarun, Xu, Yehong, Chen, Hao, Zhu, Jia, Xu, Jiajie, Huang, Xiangjun, Yang, Jian, Zhou, Xiaofang

Abstract

Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions accurate. Within each block, a layer-wise shift carries the previous position's hidden state into every decoder layer. Across blocks, each new block can start from any position of the previous block, inheriting its hidden state to extend the path. To spend verifier budget where acceptance is likely, a co-trained rank head replaces the fixed top-k tree by allocating per-position branching during drafting. To avoid training the drafter on prefixes it never produces at inference, a valid-prefix mask drops the loss at later positions once an earlier one is wrong. Beyond static drafting, a cost-aware bandit at deployment uses free verifier feedback to update the drafter selectively, only when the expected throughput gain exceeds the update cost. Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost, and cost-aware adaptation extends this lead to 11-19%.

Chinese Translation

推测解码通过草拟候选续写的树并在一个目标前向中验证它，从而加速大规模语言模型（LLM）的推理。现有的草拟器分为两类，各有相反的弱点。自回归草拟器如 EAGLE-3 在每个草拟路径上保留依赖关系，但每个树深度调用草拟器一次，使得草拟在每次迭代的延迟中占据了非平凡的比例。并行草拟器通过在一次前向中预测多个未来位置来减少草拟器调用，但每个位置的预测是在未看到其他位置的情况下进行的，导致验证器拒绝某些路径。本文提出了 SpecBlock，一种块迭代草拟器，结合了路径依赖性和低成本草拟。每次草拟器前向生成 K 个依赖位置，我们称之为一个块。草拟树通过重复块扩展而增长。两个机制明确地传递路径依赖性，以保持后续草拟位置的准确性。在每个块内，逐层移动将前一个位置的隐状态传递到每个解码器层。在块之间，每个新块可以从前一个块的任何位置开始，继承其隐状态以扩展路径。为了在接受可能性较高的地方花费验证器预算，一个共同训练的排名头在草拟期间通过为每个位置分配分支来替代固定的 top-k 树。为了避免在推理中训练草拟器在其从未生成的前缀上，合法前缀掩码在早期位置错误后丢弃后续位置的损失。除了静态草拟外，部署时的成本感知强盗使用免费的验证器反馈选择性地更新草拟器，仅在预期吞吐量增益超过更新成本时进行。实验表明，SpecBlock 在 EAGLE-3 的草拟成本的 44-52% 下，平均加速提高了 8-13%，而成本感知适应将这一优势扩大到 11-19%。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2605.07248

PaT: Planning-after-Trial for Efficient Test-Time Code Generation

PaT：试验后规划以实现高效的测试时间代码生成

Yoon, Youngsik, Lee, Sungjae, Song, Seockbean, Wang, Siwei, Chen, Wei, Ok, Jungseul

Abstract

Beyond training-time optimization, scaling test-time computation has emerged as a key paradigm to extend the reasoning capabilities of Large Language Models (LLMs). However, most existing methods adopt a rigid Planning-before-Trial (PbT) policy, which inefficiently allocates test-time compute by incurring planning overhead even on directly solvable problems. We propose Planning-after-Trial (PaT), an adaptive policy for code generation that invokes a planner only upon verification failure. This adaptive policy naturally enables a heterogeneous model configuration: a cost-efficient model handles generation attempts, while a powerful model is reserved for targeted planning interventions. Empirically, across multiple benchmarks and model families, our approach significantly advances the cost-performance Pareto frontier. Notably, our heterogeneous configuration achieves performance comparable to a large homogeneous model while reducing inference cost by approximately 69\%.

Chinese Translation

除了训练时间优化之外，扩展测试时间计算已成为提升大型语言模型（Large Language Models, LLMs）推理能力的关键范式。然而，大多数现有方法采用僵化的试验前规划（Planning-before-Trial, PbT）策略，这种策略在直接可解的问题上也会因规划开销而低效地分配测试时间计算。我们提出了试验后规划（Planning-after-Trial, PaT），这是一种自适应的代码生成策略，仅在验证失败时调用规划器。这种自适应策略自然支持异构模型配置：一个成本效益高的模型处理生成尝试，而一个强大的模型则用于针对性的规划干预。通过多个基准和模型系列的实证研究，我们的方法显著推动了成本-性能帕累托前沿。值得注意的是，我们的异构配置在性能上可与大型同质模型相媲美，同时将推理成本降低了约69%。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2605.07268

From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

从零阶选择到二阶判断：组合硬化揭示前沿大型语言模型中的组成失败

Liu, Hanmeng, Weng, Shichao, Liu, Xiulai, Zhang, Zhicai, Yan, Anli, Liu, Xiaozhang

Abstract

Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment, which significantly increases the thinking overhead and reasoning steps. The framework integrates Item Response Theory (IRT) for computerized adaptive testing (CAT), enabling precise difficulty control with fewer questions than static benchmarks. We instantiate LogiHard-2k, a logical reasoning dataset constructed by cognitively ranking high-stakes examination questions via 9-dimensional analysis of model thinking traces, followed by combinatorial transformation of high-difficulty items. Evaluation across twelve state-of-the-art models reveals an accuracy degradation ranging from 31% to 56% on combinatorially hardened questions. LLMs suffer from the multi-select failure and early exit bias, which are not shared by human testees. Zero-shot transfer to MMLU demonstrates 47% accuracy degradation (89.84% to 42.86%), confirming applicability across domains with provable validity preservation. The consistent aggregate degeneration is domain-agnostic and stems not from knowledge deficits but from a combinatorial reasoning gap, reflecting a training-induced completeness-verification deficit.

Chinese Translation

多项选择推理基准面临双重挑战：随着模型的快速进步，评估迅速饱和，以及数据污染削弱了静态评估的有效性。临时硬化方法（如改写、扰动）试图增加难度，但为表面复杂性牺牲了逻辑有效性，未能有效挑战先进的推理模型。我们提出了LogiHard，这是一个正式框架，确定性地将零阶选择转化为二阶逻辑判断，显著增加了思维负担和推理步骤。该框架整合了项目反应理论（Item Response Theory, IRT）用于计算机自适应测试（Computerized Adaptive Testing, CAT），使得在比静态基准更少的问题中实现精确的难度控制。我们实例化了LogiHard-2k，这是一个通过对模型思维轨迹进行9维分析对高风险考试问题进行认知排名后构建的逻辑推理数据集，随后对高难度项目进行组合转化。对十二个最先进模型的评估显示，在组合硬化问题上的准确率下降范围为31%至56%。大型语言模型（LLMs）遭受多选失败和早期退出偏见，而这些问题并不出现在人类考生中。对MMLU的零-shot迁移显示出47%的准确率下降（从89.84%降至42.86%），确认了在可证明的有效性保持下跨领域的适用性。持续的整体退化是领域无关的，源于组合推理的缺口，而非知识缺陷，反映了训练引起的完整性验证缺失。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2605.07269

MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning

MIPIAD：基于Qwen的多语言间接提示注入攻击防御框架——TF-IDF混合与元集成学习

Muhtadi, Al Muhit, Tazwar, Mostafa Rifat

Abstract

Indirect prompt injection remains a persistent weakness in retrieval-augmented and tool-using LLM systems, and the problem becomes harder to characterise in multilingual settings. We present MIPIAD, a defense framework evaluated on English and Bangla that combines a sequence classifier fine-tuned from Qwen2.5-1.5B via LoRA (XLPID), TF-IDF lexical features, and validation-tuned ensembling through late fusion, stacking, and gradient boosting. The framework is evaluated on a synthetic benchmark built from BIPIA(Yi et al., 2023) templates spanning five task families -- email, table, QA, abstract, and code-comprising over 1.43 million generated samples, with train and test splits using mutually exclusive attack categories. Across the experiments, lexical signals prove strong (TF-IDF+SVM F1=0.77), and the hybrid XLPID+TF-IDF ensemble achieves the best overall F1 (0.9205) while the Boosting Ensemble achieves the best AUROC (0.9378). Ensemble methods consistently reduce the English-Bangla cross-lingual gap relative to standalone neural models. The pipeline is designed for extensibility: NLLB-200 supports over 200 languages and XLPID's multilingual backbone can be retargeted to additional languages without architectural changes; empirical validation is currently limited to English and Bangla

Chinese Translation

间接提示注入在增强检索和工具使用的大型语言模型（LLM）系统中仍然是一个持续的弱点，而在多语言环境中，该问题的特征化变得更加困难。我们提出了MIPIAD，一个在英语和孟加拉语上评估的防御框架，该框架结合了通过LoRA（XLPID）微调的序列分类器、TF-IDF词汇特征以及通过后融合、堆叠和梯度提升进行的验证调优集成。该框架在一个基于BIPIA（Yi et al., 2023）模板构建的合成基准上进行评估，涵盖了五个任务类别——电子邮件、表格、问答、摘要和代码，生成样本超过143万，训练和测试集使用互斥的攻击类别进行划分。在实验中，词汇信号表现出强大的效果（TF-IDF+SVM F1=0.77），而混合的XLPID+TF-IDF集成达到了最佳的整体F1（0.9205），而提升集成则实现了最佳的AUROC（0.9378）。集成方法在相对于独立神经模型时，持续缩小了英语与孟加拉语之间的跨语言差距。该管道设计为可扩展性：NLLB-200支持超过200种语言，XLPID的多语言骨干可以在不改变架构的情况下重新针对其他语言；目前的实证验证仅限于英语和孟加拉语。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2605.07271

Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions

通过决策表示转变理解层修剪大型语言模型中的性能崩溃

Shi, Boyu, Liu, Chang, Gao, ChuanBao, Yang, Xu, Geng, Xin

Abstract

Layer pruning efficiently reduces Large Language Model (LLM) computational costs but often triggers sudden performance collapse. Existing representation-based analyses struggle to explain this mechanism. We propose studying pruning through decision representation. Focusing on multiple-choice tasks, we introduce two metrics, Decision Margin and Option Frequency, and an Iterative Pruning method to analyze layer-wise decision dynamics. Our findings reveal a sharp decision transition that partitions the network into two stages: a Silent Phase, where the model cannot yet predict the correct answer, and a Decisive Phase, where the correct prediction emerges. We also find that pruning the Decisive Phase has minimal impact, whereas pruning the Silent Phase triggers immediate performance collapse, highlighting its extreme sensitivity to structural changes. Therefore, we conclude that pruning-induced collapse stems from disrupting the Silent Phase, which prevents the critical decision transition from occurring.

Chinese Translation

层修剪有效地降低了大型语言模型（LLM）的计算成本，但常常会引发突发的性能崩溃。现有的基于表示的分析难以解释这一机制。我们建议通过决策表示来研究修剪。针对多选任务，我们引入了两个指标：决策边际（Decision Margin）和选项频率（Option Frequency），以及一种迭代修剪方法来分析逐层决策动态。我们的研究发现了一个明显的决策转变，将网络划分为两个阶段：沉默阶段（Silent Phase），在此阶段模型尚无法预测正确答案；以及决定性阶段（Decisive Phase），在此阶段正确预测开始出现。我们还发现，修剪决定性阶段的影响最小，而修剪沉默阶段则会立即触发性能崩溃，突显其对结构变化的极端敏感性。因此，我们得出结论，修剪引发的崩溃源于对沉默阶段的干扰，这阻碍了关键决策转变的发生。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2605.07305

MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs

MedAction：迈向主动多轮临床诊断大型语言模型

Hsu, Hsin-Ling, Wang, Zizheng, Zhang, Donghua, Chen, Nai-Chia, Wang, Jerry, Ding, Jun-En, Hsu, Chia-Hsuan, Wang, Guoan, Liu, Feng, Hung, Fang-Ming, Wu, Chenwei, Shen, Liyue

Abstract

Most existing LLM diagnoses are evaluated on static, single-turn settings where complete patient information is provided upfront, an oversimplification of real clinical practice. We study active diagnosis: the real-life clinical process of starting from initial observation, ordering tests, interpreting results, and updating a differential diagnosis across multiple turns. Through systematic analysis, we identify three recurring failure modes in current LLMs: ungrounded test ordering, unreliable diagnostic update, and degraded multi-turn coherence. Together, these failures reveal a core deficit: existing medical training data teaches models to reason from complete information but not to act under evolving, partial evidence. To address this gap, we introduce MedAction, a tree-structured distillation pipeline that synthesizes diverse and high-quality multi-turn diagnostic trajectories via LLM-environment interaction. We propose two knowledge-graph-grounded metrics to filter trajectory quality: Disease Trajectory Consistency (DTC), which tracks whether the model's hypothesis converges toward the correct diagnosis, and Reasoning-Action Consistency (RAC), which verifies that belief updates are driven by gathered evidence. Using this pipeline, we construct MedAction-32K, a dataset of 32,681 trajectories from 2,896 PMC cases. Fine-tuning an 8B model on MedAction-32K achieves state-of-the-art performance among open-source models on both MedR-Bench and our curated MedAction-300-Hard benchmark, pushing the edge for open-source medical LLMs.

Chinese Translation

目前大多数现有的大型语言模型（LLM）诊断是在静态的单轮设置下进行评估的，其中完整的患者信息被提前提供，这过于简化了真实的临床实践。我们研究了主动诊断：这一真实的临床过程从初步观察开始，进行检查，解读结果，并在多个回合中更新鉴别诊断。通过系统分析，我们识别出当前LLM中的三种反复出现的失败模式：无依据的检查订购、不可靠的诊断更新和退化的多轮连贯性。这些失败共同揭示了一个核心缺陷：现有的医学训练数据教会模型从完整信息中推理，但未能使其在不断变化的部分证据下采取行动。为了解决这一问题，我们引入了MedAction，一种树状结构的蒸馏管道，通过LLM与环境的互动合成多样且高质量的多轮诊断轨迹。我们提出了两个基于知识图谱的度量标准来过滤轨迹质量：疾病轨迹一致性（Disease Trajectory Consistency, DTC），用于跟踪模型的假设是否趋向于正确诊断，以及推理-行动一致性（Reasoning-Action Consistency, RAC），用于验证信念更新是否由收集到的证据驱动。利用该管道，我们构建了MedAction-32K，一个包含32,681条来自2,896个PMC案例的轨迹的数据集。在MedAction-32K上对一个8B模型进行微调，在MedR-Bench和我们策划的MedAction-300-Hard基准测试中，达到了开源模型中的最新性能，推动了开源医学LLM的前沿。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2605.07307

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

重新思考密集序列链：推理语言模型如何从稀疏、顺序打乱的思维链中提取答案

Chen, Yi-Chang, Liao, Feng-Ting, Shiu, Da-shan, Lee, Hung-yi

Abstract

Modern reasoning language models generate dense, sequential chain-of-thought traces implicitly assuming that every token contributes and that steps must be consumed in order. We challenge both assumptions through a systematic intervention pipeline--removal, masking, shuffling, and noise injection--applied to model-generated reasoning chains across three models and three benchmarks. Our findings are counterintuitive on three dimensions. Order: Does the sequential order of a reasoning chain matter for answer extraction? No--line-level shuffling reduces accuracy by less than 0.5 pp; word-level shuffling retains 62%-89% accuracy; only token-level shuffling collapses to near zero. Pretrained-only and instruction-tuned variants exhibit near-identical tolerance (78.67% vs. 78.00% under line shuffling), indicating order-independence originates from pretraining rather than reasoning-specific fine-tuning. Dense: Is all the information in a reasoning chain important for answer extraction? No--masking numeric digits collapses accuracy to exactly 0%, while masking alphabetic prose improves accuracy by 4.7 pp. Robustness: Is a reasoning chain that is both order-shuffling and non-dense still robust? Yes--the most aggressively reduced representation (all natural language removed, lines arbitrarily shuffled) still achieves 83% accuracy, and injecting false answers at 3x true-answer frequency leaves accuracy unchanged (83.3%->83.3%), falsifying a frequency-based extraction account. These results establish that answer extraction operates on a sparse, order-insensitive, and structurally robust informational substrate, opening paths toward parallelized and token-efficient reasoning generation.

Chinese Translation

现代推理语言模型生成的密集序列思维链隐含地假设每个标记都有贡献，并且步骤必须按顺序进行。我们通过一个系统的干预流程——去除、掩蔽、打乱和噪声注入——对三种模型和三个基准测试中模型生成的推理链提出挑战。我们的发现从三个维度上是反直觉的。顺序：推理链的顺序对答案提取是否重要？不重要——行级打乱导致准确率降低不到0.5个百分点；词级打乱保持62%-89%的准确率；只有标记级打乱的准确率接近于零。仅预训练和指令微调的变体表现出近乎相同的容忍度（行打乱下为78.67%对比78.00%），表明顺序无关性源于预训练而非推理特定的微调。密集性：推理链中的所有信息对答案提取是否重要？不重要——掩蔽数字导致准确率降至0%，而掩蔽字母文本则提高准确率4.7个百分点。鲁棒性：一个既顺序打乱又非密集的推理链是否仍然鲁棒？是的——最激进的简化表示（所有自然语言移除，行任意打乱）仍然达到83%的准确率，且以3倍真实答案频率注入错误答案不会改变准确率（83.3%->83.3%），否定了基于频率的提取解释。这些结果表明，答案提取是在一个稀疏、顺序无关且结构鲁棒的信息基础上进行的，为并行化和标记高效的推理生成开辟了新的路径。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2605.07315

LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification

LaTER：通过潜在探索和显式验证实现高效的测试时推理

Li, Xuan, Wang, Yining, Liu, Yuchen, Liu, Guanjun, Qiu, Delai, Liu, Shengping, Liang, Jiaen, Huang, Wei, Yu, Jun, Zhu, Junnan

Abstract

Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous states, yet replacing explicit derivations with latent computation can hurt tasks that require symbolic checking. We propose Latent-Then-Explicit Reasoning (LaTER), a two-stage paradigm that first performs bounded exploration in a continuous latent space and then switches to explicit CoT for verification and answer generation. In a training-free instantiation, LaTER projects final-layer hidden states back to the input embedding space, preserves the latent KV cache, and uses entropy and model-native stop-token probes to decide when to switch. We find that strong reasoning models already exhibit structured latent trajectories under this interface. On Qwen3-14B, training-free LaTER reduces total token usage by 16%-32% on several benchmarks while matching or improving accuracy on most of them; for example, it improves AIME 2025 from 70.0% to 73.3% while reducing tokens from 15,730 to 10,661. We further construct Latent-Switch-69K, a supervised corpus that pairs condensed solution intuitions with shortened explicit derivations. Fine-tuning with latent rollout and halting supervision yields additional gains: trained LaTER reaches 80.0% accuracy on AIME 2025, 10.0 points above the standard CoT baseline, while using 33% fewer tokens. Our code, data, and model are available at https://github.com/TioeAre/LaTER.

Chinese Translation

链式思维（CoT）推理提升了大型语言模型（LLMs）在困难任务上的表现，但也使推理变得昂贵，因为每个中间步骤都必须作为离散标记生成。潜在推理通过传播连续状态减少了可见标记的生成，然而将显式推导替换为潜在计算可能会损害需要符号检查的任务。我们提出了潜在-然后-显式推理（LaTER），一种两阶段范式，首先在连续潜在空间中进行有限探索，然后切换到显式的链式思维进行验证和答案生成。在无训练的实例中，LaTER将最终层的隐藏状态投影回输入嵌入空间，保留潜在的KV缓存，并使用熵和模型原生的停止标记探针来决定何时切换。我们发现，强大的推理模型在此接口下已经表现出结构化的潜在轨迹。在Qwen3-14B上，无训练的LaTER在多个基准测试中将总标记使用量减少了16%-32%，同时在大多数基准上保持或提高了准确性；例如，它将AIME 2025的准确率从70.0%提高到73.3%，同时将标记数量从15,730减少到10,661。我们进一步构建了Latent-Switch-69K，一个监督语料库，将浓缩的解决直觉与简化的显式推导配对。通过潜在回滚和停止监督进行微调可获得额外收益：训练后的LaTER在AIME 2025上达到80.0%的准确率，比标准的链式思维基线高出10.0个百分点，同时使用的标记减少了33%。我们的代码、数据和模型可在https://github.com/TioeAre/LaTER获取。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2605.07324

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

激活差异揭示后门：稀疏自编码器架构的比较

Kumar, Sachin

Abstract

Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability remains an open challenge. We investigate two sparse autoencoder architectures -- Crosscoders and Differential SAEs (Diff-SAE) -- for isolating backdoor-related features in fine-tuned models. Using a controlled SQL injection backdoor triggered by year-based context ("2024" triggers vulnerable code, "2023" triggers safe code), we evaluate both approaches across LoRA and full-rank fine-tuning regimes on SmolLM2-360M. We find that Diff-SAE consistently and substantially outperforms Crosscoders for backdoor isolation. Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across most experimental conditions, while Crosscoders fail almost entirely with BIS below 0.02 in most cases. This performance gap holds across multiple transformer layers (14, 18, 22, 26) and both fine-tuning regimes, with full-rank fine-tuning producing particularly clean backdoor signals. Our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective for detection. These findings have important implications for AI safety monitoring and the development of interpretability tools for detecting model manipulation.

Chinese Translation

语言模型上的后门攻击对人工智能安全构成了重大威胁，这些模型在大多数输入下表现正常，但在特定模式触发时则表现出有害行为。通过机械解释性检测此类后门仍然是一个未解的挑战。我们研究了两种稀疏自编码器架构——Crosscoders和差分自编码器（Differential SAEs, Diff-SAE）——以在微调模型中隔离与后门相关的特征。我们使用一个受控的SQL注入后门，该后门由基于年份的上下文触发（“2024”触发易受攻击的代码，“2023”触发安全代码），在SmolLM2-360M上评估这两种方法在LoRA和全秩微调模式下的表现。我们发现，Diff-SAE在后门隔离方面始终显著优于Crosscoders。Diff-SAE在大多数实验条件下实现了0.40的后门隔离评分（Backdoor Isolation Score, BIS），并且具有完美的精确度（1.0）和零假阳性率，而Crosscoders在大多数情况下几乎完全失败，BIS低于0.02。这个性能差距在多个变换层（14、18、22、26）和两种微调模式下均保持一致，其中全秩微调产生特别干净的后门信号。我们的结果表明，后门表现为方向性激活偏移，而不是稀疏特征激活，这使得基于差异的表示在检测上根本上更为有效。这些发现对人工智能安全监控和开发用于检测模型操控的解释工具具有重要意义。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2605.07345

Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative

均值池化余弦相似度不是长度不变的：长度不变替代方案的理论与跨领域证据

Mitra, Sibayan, Kumar, Dhruv

Abstract

Mean-pooled cosine similarity is the default metric for comparing neural representations across languages, modalities, and tasks. We establish that this metric is not length-invariant: under the anisotropy that characterizes modern transformer representations, mean-pooled cosine grows monotonically in sequence length, independent of representational content. Empirically, on HumanEvalPack across four code LLMs, the length ratio alone explains $R^2 = 0.52$--$0.75$ of cross-language "Python proximity," while AST depth and shared-token fraction add less than 3% of explained variance beyond length. Substituting Centered Kernel Alignment (CKA) reduces explained variance by 83% and reverses the sign of the length coefficient ($\beta_{\mathrm{len}}: +0.86 \to -0.37$). The same pattern holds in Mistral-7B on parallel WMT pairs ($R^2 = 0.23$ EN-FR, $R^2 = 0.33$ EN-DE for cosine; $R^2 < 0.01$ for CKA). In CLIP ViT-B/32, mean-pooling reduces the length effect relative to EOS-pooling ($R^2: 0.21 \to {<}0.01$), as predicted by the theory's dependence on anisotropy. We argue that length-invariant metrics such as CKA should be the default for cross-representation comparisons, and that recent claims of cross-lingual representational convergence built on mean-pooled cosine warrant re-examination.

Chinese Translation

均值池化余弦相似度是比较不同语言、模态和任务中的神经表征的默认度量。我们证明了该度量不是长度不变的：在现代变换器表征所特征化的各向异性下，均值池化余弦随着序列长度单调增加，与表征内容无关。根据在四个代码大语言模型（LLMs）上的 HumanEvalPack 实验，长度比率单独解释了跨语言“Python 接近度”的 $R^2 = 0.52$--$0.75$，而抽象语法树（AST）深度和共享标记比例所增加的解释方差不足 3%。用中心核对齐（Centered Kernel Alignment, CKA）替代均值池化余弦相似度，解释方差减少了 83%，并且长度系数的符号发生了反转（$eta_{ ext{len}}: +0.86 o -0.37$）。在 Mistral-7B 的平行 WMT 对中同样的模式成立（余弦相似度 $R^2 = 0.23$ EN-FR, $R^2 = 0.33$ EN-DE；CKA 的 $R^2 < 0.01$）。在 CLIP ViT-B/32 中，均值池化相对于 EOS 池化减少了长度效应（$R^2: 0.21 o {<}0.01$），这与理论对各向异性的依赖相符。我们认为，像 CKA 这样的长度不变度量应成为跨表征比较的默认选择，并且基于均值池化余弦的跨语言表征收敛的最新论断值得重新审视。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2605.07366

Gradient-Based LoRA Rank Allocation Under GRPO: An Empirical Study

基于梯度的 LoRA 排名分配在 GRPO 下的实证研究

Sawant, Yash Ganpat

Abstract

Adaptive rank allocation for LoRA, allocating more parameters to important layers and fewer to unimportant ones, consistently improves efficiency under supervised fine-tuning (SFT). We investigate whether this success transfers to reinforcement learning, specifically Group Relative Policy Optimization (GRPO). Using gradient-magnitude profiling on Qwen 2.5 1.5B with GSM8K, we find that it does not: proportional rank allocation degrades accuracy by 4.5 points compared to uniform allocation (70.0% vs. 74.5%), despite using identical parameter budgets. We identify two mechanisms behind this failure. First, the gradient landscape under GRPO is fundamentally flatter than under SFT, the max-to-min layer importance ratio is only 2.17x, compared to >10x reported in SFT literature. All layers carry meaningful gradient signal; none are truly idle. Second, we discover a gradient amplification effect: non-uniform allocation widens the importance spread from 2.17x to 3.00x, creating a positive feedback loop where high-rank layers absorb more gradient while low-rank layers are progressively silenced. Our results suggest that gradient importance does not predict capacity requirements under RL, and that naive transfer of SFT-era rank allocation to alignment training should be avoided.

Chinese Translation

LoRA 的自适应排名分配为重要层分配更多参数，而为不重要层分配更少参数，在监督微调（SFT）下始终提高了效率。我们研究这种成功是否可以转移到强化学习，特别是群体相对策略优化（GRPO）。通过对 Qwen 2.5 1.5B 和 GSM8K 进行梯度幅度分析，我们发现并非如此：与均匀分配相比，按比例分配排名使准确率下降了 4.5 个百分点（70.0% 对比 74.5%），尽管使用了相同的参数预算。我们识别出导致这一失败的两个机制。首先，GRPO 下的梯度景观在根本上比 SFT 更平坦，最大与最小层重要性比仅为 2.17 倍，而 SFT 文献中报告的比值超过 10 倍。所有层都携带有意义的梯度信号；没有任何层是真正闲置的。其次，我们发现了梯度放大效应：非均匀分配将重要性差距从 2.17 倍扩大到 3.00 倍，形成了一个正反馈循环，高排名层吸收更多梯度，而低排名层逐渐被抑制。我们的结果表明，梯度重要性并不能预测强化学习下的容量需求，并且应避免将 SFT 时代的排名分配简单转移到对齐训练中。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2605.07409

The Proxy Presumption: From Semantic Embeddings to Valid Social Measures

代理假设：从语义嵌入到有效的社会测量

Li, Baishi, Yu, Ta, Koa, Kelvin J. L., Huang, Ke-Wei

Abstract

Natural Language Processing is rapidly evolving into a primary instrument for Computational Social Science, with researchers increasingly using embeddings to measure latent constructs such as novelty, creativity, and bias. However, this transition faces a fundamental validity challenge: the ''Proxy Presumption,'' or the reliance on geometric properties (e.g., cosine distance) as direct measures of social concepts. We argue that without explicit validation, unsupervised representations remain entangled mixtures of the target construct ($C$) and confounding attributes ($Z$) like topic, style, and authorship. To bridge the gap between semantic embeddings and valid social measures, we introduce the Construct Validity Protocol (CVP). Drawing on causal representation learning and psychometrics, the CVP offers a rigorous pipeline from conceptualization to quantitative verification. We further propose Counterfactual Neutralization, a novel method using LLMs to reduce confounding in embedding space. By providing a standardized Validity Suite -- including tests for discriminant, incremental, and predictive validity -- this work offers the community a toolkit to transform heuristic proxies into robust, scientifically defensible instruments.

Chinese Translation

自然语言处理正迅速演变为计算社会科学的主要工具，研究人员越来越多地使用嵌入来测量潜在构念，如新颖性、创造力和偏见。然而，这一转变面临着一个根本的有效性挑战：即“代理假设”，或依赖几何属性（例如，余弦距离）作为社会概念的直接测量。我们认为，在没有明确验证的情况下，无监督的表示仍然是目标构念（$C$）与混淆属性（$Z$）如主题、风格和作者身份的纠缠混合。为了弥合语义嵌入与有效社会测量之间的差距，我们提出了构念有效性协议（Construct Validity Protocol, CVP）。CVP借鉴了因果表示学习和心理测量学，提供了从概念化到定量验证的严格流程。我们进一步提出了反事实中和（Counterfactual Neutralization），这是一种使用大型语言模型（LLMs）减少嵌入空间中混淆的方法。通过提供一个标准化的有效性工具包——包括区分性、增量性和预测有效性测试——本研究为社区提供了一个工具包，以将启发式代理转变为稳健且科学上可辩护的工具。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2605.07432

Generating training datasets for legal chatbots in Korean

为韩国法律聊天机器人生成训练数据集

Hwang, Changhoe, Nam, Jee-Sun, Laporte, Eric

Abstract

Chatbots are robots that can communicate with humans using text or voice signals. Legal chatbots improve access to justice, since legal representation and legal advice by lawyers come with a high cost that excludes disadvantaged and vulnerable people. However, capturing the diversity of actual user input in datasets for deep-learning dialog systems (chatbots) is a technical challenge. Diversity requires large volumes of data, which must also be labelled in order to classify the user's intent, while the cost of labelling datasets increases with volume. Instead of labelling large volumes of authentic data from users, our approach consists in jointly generating large volumes of utterances and high-quality labels. The generator of labelled datasets is based on language resources that take the form of local grammar graphs (LGG), which capture and generalize the vocabulary and local syntax observed by linguists in text. The LGGs associate labels to the utterances according to a domain-specific classification system. We tested this approach by implementing LIGA, a legal chatbot in Korean. The chatbot answers users' conversational queries on legal situations by providing information on similar legal cases, made publicly available by the Korean government. We generated labelled utterances from the LGGs with the aid of the open-source Unitex platform. This process produced 700 million utterances. We trained a DIET classifier on a dataset made of these utterances, and the trained model reached 91% f1-score performance. We implemented a chatbot called LIGA, which uses the results of the model to select a link to a web page that documents similar legal cases.

Chinese Translation

聊天机器人是能够使用文本或语音信号与人类进行交流的机器人。法律聊天机器人改善了对司法的获取，因为律师的法律代理和法律咨询费用高昂，排除了处于不利和脆弱境地的人群。然而，在深度学习对话系统（聊天机器人）的数据集中捕捉实际用户输入的多样性是一个技术挑战。多样性需要大量的数据，这些数据还必须进行标注以分类用户的意图，而标注数据集的成本随着数据量的增加而上升。我们的方法不是标注大量来自用户的真实数据，而是共同生成大量的发言和高质量的标签。标注数据集的生成器基于语言资源，这些资源以地方语法图（Local Grammar Graphs, LGG）的形式存在，捕捉并概括了语言学家在文本中观察到的词汇和地方语法。LGG根据特定领域的分类系统将标签与发言关联。我们通过实施LIGA，一个韩国法律聊天机器人，测试了这种方法。该聊天机器人通过提供韩国政府公开的类似法律案件的信息，回答用户关于法律情况的对话查询。我们利用开源的Unitex平台从LGG生成了标注的发言，这一过程产生了7亿条发言。我们在由这些发言构成的数据集上训练了DIET分类器，训练后的模型达到了91%的F1分数表现。我们实现了一个名为LIGA的聊天机器人，该机器人利用模型的结果选择链接到记录类似法律案件的网页。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2605.07446

SSP-based construction of evaluation-annotated data for fine-grained aspect-based sentiment analysis

基于SSP的细粒度基于方面的情感分析评估注释数据构建

Choi, Suwon, Kim, Shinwoo, Hwang, Changhoe, Yoo, Gwanghoon, Laporte, Eric, Nam, Jeesun

Abstract

We report the construction of a Korean evaluation-annotated corpus, hereafter called 'Evaluation Annotated Dataset (EVAD)', and its use in Aspect-Based Sentiment Analysis (ABSA) extended in order to cover e-commerce reviews containing sentiment and non-sentiment linguistic patterns. The annotation process uses Semi-Automatic Symbolic Propagation (SSP). We built extensive linguistic resources formalized as a Finite-State Transducer (FST) to annotate corpora with detailed ABSA components in the fashion e-commerce domain. The ABSA approach is extended, in order to analyze user opinions more accurately and extract more detailed features of targets, by including aspect values in addition to topics and aspects, and by classifying aspectvalue pairs depending whether values are unary, binary, or multiple. For evaluation, the KoBERT and KcBERT models are trained on the annotated dataset, showing robust performances of F1 0.88 and F1 0.90, respectively, on recognition of aspect-value pairs.

Chinese Translation

我们报告了一个韩语评估注释语料库的构建，以下简称为“评估注释数据集（Evaluation Annotated Dataset, EVAD）”，并探讨其在基于方面的情感分析（Aspect-Based Sentiment Analysis, ABSA）中的应用，扩展以涵盖包含情感和非情感语言模式的电子商务评论。注释过程采用半自动符号传播（Semi-Automatic Symbolic Propagation, SSP）。我们构建了广泛的语言资源，以有限状态转导器（Finite-State Transducer, FST）的形式对语料库进行注释，详细涵盖电子商务领域的ABSA组件。为了更准确地分析用户意见并提取目标的更详细特征，ABSA方法进行了扩展，除了主题和方面外，还包括方面值，并根据值是单一的、二元的或多个的分类方面值对。为了评估，我们在注释数据集上训练了KoBERT和KcBERT模型，分别在方面值对的识别上展现出强大的性能，F1值为0.88和0.90。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2605.07453

Data Contamination in Neural Hieroglyphic Translation: A Reproducibility Study

神经象形文字翻译中的数据污染：可重复性研究

Toutou, Ammar, Harb, Abdelrahman, Basta, Christine

Abstract

Ancient and endangered languages pose a unique challenge for NLP: their datasets are inherently scarce, difficult to expand, and built from formulaic corpora -- making data-quality issues especially consequential yet rarely audited. Motivated by the need to understand what current NMT can realistically achieve for such languages, we investigate hieroglyphic-to-German translation, where a recent study reported 61.5 BLEU using fine-tuned M2M-100. Our reproduction yields only 37.0 BLEU with the released model. Investigating this gap, we find 2\% of test targets appear identically in training (16/50; 50\% under 8-gram overlap at 70\% threshold). This contamination inflates scores dramatically: contaminated samples achieve up to 83.8 BLEU / 0.924 COMET-22 versus 30.9--39.2 BLEU / 0.622--0.676 COMET-22 on clean samples across five model configurations spanning two architectures. Document-level decontamination reduces contaminated BLEU by only 4.6 points because 8/16 targets persist via other source documents -- target-level deduplication is required. We release a decontaminated 34-sample test set and establish corrected baselines (30.9--39.2 BLEU), providing a realistic assessment of NMT capability for this endangered writing system.

Chinese Translation

古老且濒危的语言为自然语言处理（NLP）带来了独特的挑战：它们的数据集本质上稀缺，难以扩展，并且由公式化语料库构建——这使得数据质量问题尤为重要，但却很少被审计。为了理解当前的神经机器翻译（NMT）在这些语言中可以实际达到的效果，我们研究了象形文字到德语的翻译，其中一项近期研究报告使用微调的M2M-100模型达到了61.5的BLEU分数。我们的复现结果仅为37.0 BLEU，使用的是发布的模型。调查这一差距，我们发现2%的测试目标在训练集中完全相同（16/50；在70%阈值下的8-gram重叠为50%）。这种污染显著抬高了分数：污染样本的BLEU分数高达83.8 / 0.924 COMET-22，而干净样本的BLEU分数在五种模型配置下仅为30.9至39.2 / 0.622至0.676 COMET-22，涵盖两种架构。文档级去污染仅将污染BLEU降低了4.6分，因为8/16个目标通过其他源文档仍然存在——需要进行目标级去重。我们发布了一个去污染的34个样本的测试集，并建立了修正的基线（30.9至39.2 BLEU），为这一濒危书写系统的NMT能力提供了现实的评估。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2605.07454

GRaSp: Automatic Example Optimization for In-Context Learning in Low-Data Tasks

GRaSp：低数据任务中上下文学习的自动示例优化

Bihaug-Frøyland, Simen, Brådland, Henrik

Abstract

In-context learning enables large language models to adapt to new tasks, but their performance is highly sensitive to the selected examples. Finding effective demonstrations is particularly difficult in domain-specific, low-data settings where high-quality examples are scarce. We propose GRaSp, a three-stage framework for automatic in-context example optimization. By first generating a large synthetic candidate pool, then structuring it with clustering and dimensionality reduction, and finally using genetic algorithms to find the optimal in-context examples, the framework shows consistent improvements on the NER task. We also introduce a custom diversity-adaptive mutation mechanism, allowing it to transition from the initial broad inter-cluster exploration to focused intra-cluster refinement as the population converges. We evaluate GRaSp on financial named entity recognition (FiNER-139), comparing synthetic and human-annotated candidate pools across pool sizes of 500 and 5000. With non-synthetic data, GRaSp achieves 45.84% micro-F1, consistently outperforming both zero-shot and random few-shot baselines. Synthetic data matches the random baseline but does not exceed it, suggesting that distributional variety in the candidate pool is critical for generalization.

Chinese Translation

上下文学习使大型语言模型能够适应新任务，但其性能对所选示例高度敏感。在特定领域的低数据环境中，找到有效的示例尤其困难，因为高质量示例稀缺。我们提出了GRaSp，一个用于自动上下文示例优化的三阶段框架。该框架首先生成一个大型合成候选池，然后通过聚类和降维进行结构化，最后使用遗传算法寻找最佳上下文示例，显示出在命名实体识别（NER）任务上的一致性改进。我们还引入了一种自定义的多样性自适应变异机制，使其能够在种群收敛时，从最初的广泛跨簇探索过渡到集中于簇内的精细化。我们在金融命名实体识别（FiNER-139）上评估了GRaSp，比较了500和5000大小的合成与人工标注候选池。在非合成数据上，GRaSp达到了45.84%的微F1，始终优于零-shot和随机少量示例基线。合成数据与随机基线相匹配，但未超过它，这表明候选池中的分布多样性对泛化至关重要。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2605.07461

Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

基于评分标准的思维：从外部评估者到内部推理指导

Yu, Jiachen, Xu, Zhihao, Wang, Junjie, Yang, Yujiu

Abstract

Rubrics have been extensively utilized for evaluating unverifiable, open-ended tasks, with recent research incorporating them into reward systems for reinforcement learning. However, existing frameworks typically treat rubrics only as external evaluator disjointed from the policy's primary reasoning trace. Such design confines rubrics to post-hoc measurement, leaving them unable to actively guide the model's generation process. In this work, we introduce Think-with-Rubrics, a novel paradigm for instruction following tasks. Think-with-Rubrics integrates rubric generation into the reasoning context, transforming the rubric from an independent artifact into an internal guidance of LLM's generation. During training, LLM sequentially generates a rubric followed by a response, while a trained rubric verifier provides joint supervision by evaluating the consistency between the answer and the self-generated / golden rubrics. Experiments across multiple benchmarks demonstrate that Think-with-Rubrics consistently outperforms the Rubric-as-Reward baseline supervised by golden rubrics by an average of 3.87 points. We have also discussed the mechanism by which Think-with-Rubrics enhances model performance. Experimental results demonstrate that supervision from golden rubrics and self-generated rubrics enhances the performance of Think-with-Rubrics by improving the quality of self-generated rubrics and increasing the internal consistency of responses respectively.

Chinese Translation

评分标准在评估不可验证的开放性任务中得到了广泛应用，最近的研究将其纳入了强化学习的奖励系统。然而，现有框架通常将评分标准视为与策略主要推理轨迹脱节的外部评估者。这种设计将评分标准限制为事后测量，使其无法主动指导模型的生成过程。在本研究中，我们提出了基于评分标准的思维（Think-with-Rubrics），这是一种用于指令跟随任务的新范式。基于评分标准的思维将评分标准生成整合到推理上下文中，将评分标准从独立的工具转变为大型语言模型（LLM）生成过程中的内部指导。在训练过程中，LLM依次生成评分标准和响应，而经过训练的评分标准验证器通过评估答案与自生成/黄金评分标准之间的一致性提供联合监督。多个基准测试的实验表明，基于评分标准的思维在黄金评分标准监督下，平均超越了评分标准作为奖励（Rubric-as-Reward）基线3.87分。我们还讨论了基于评分标准的思维如何增强模型性能的机制。实验结果表明，来自黄金评分标准和自生成评分标准的监督通过提高自生成评分标准的质量和增加响应的一致性，分别增强了基于评分标准的思维的性能。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2605.07462

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

Moltbook 文件：无害的混乱末日还是人类的最后实验

Brach, William, Torrielli, Federico, Beltoft, Stine Lyngsø, Pirchert, Annemette Brok, Schneider-Kamp, Peter, Poech, Lukas Galke

Abstract

Moltbook is a Reddit-like platform where OpenClaw agents post, comment, and vote at scale - a so far unprecedented incident that comes with serious safety concerns. With the aim of studying emergent behavior in populations, we release the Moltbook Files, a dataset of 232k posts and 2.2M comments covering the platform's first 12 days, processed through a pipeline to identify and remove Personally-Identifiable Information (PII). We analyze community structure, authorship, lexical properties, sentiment, topics, semantic geometry, and comment interaction. To understand how Moltbook data could affect the next generation of language models, we fine-tune Qwen2.5-14B-Instruct on Moltbook Files with three adaptation levels. Our PII pipeline reveals that agents post API keys, passwords, BIP39 seed phrases on Moltbook, a publicly indexed platform. The overall sentiment is mostly neutral and mildly positive (66.6% neutral, 19.5% positive) and shows a tendency for self-referential linking. We find that fine-tuning on Moltbook data reduces truthfulness from 0.366 to 0.187. However, a model fine-tuned on a size-matched Reddit dataset produces a comparable decrease. Moltbook thus seems to be more of a harmless slopocalypse. However, tail risks remain, including agent affordances, contamination of future crawls through self-links, and potential transfer of traits to the next generation of language models. More broadly, our findings highlight the importance of control baselines in emergent misalignment evaluations.

Chinese Translation

Moltbook 是一个类似于 Reddit 的平台，OpenClaw 代理在此进行大规模的发布、评论和投票——这一前所未有的事件带来了严重的安全隐患。为了研究人群中的涌现行为，我们发布了 Moltbook 文件，这是一个包含 232,000 条帖子和 2.2 百万条评论的数据集，涵盖了该平台的前 12 天，并通过一个处理管道识别和删除了个人可识别信息（PII）。我们分析了社区结构、作者身份、词汇特性、情感、主题、语义几何和评论互动。为了理解 Moltbook 数据如何影响下一代语言模型，我们在 Moltbook 文件上对 Qwen2.5-14B-Instruct 进行了三种适应级别的微调。我们的 PII 管道揭示，代理在公开索引的平台 Moltbook 上发布 API 密钥、密码和 BIP39 种子短语。整体情感大多为中性和轻微积极（66.6% 中性，19.5% 积极），并显示出自我指涉链接的倾向。我们发现，在 Moltbook 数据上进行微调会将真实性从 0.366 降低到 0.187。然而，在与之规模匹配的 Reddit 数据集上进行微调的模型也产生了类似的下降。因此，Moltbook 看起来更像是一个无害的混乱末日。然而，尾部风险依然存在，包括代理的能力、通过自我链接污染未来的爬取，以及潜在特征向下一代语言模型的转移。更广泛地说，我们的发现强调了在涌现失调评估中控制基线的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2605.07465

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

SEIF：用于指令跟随的自我进化强化学习

Ren, Qingyu, He, Qianyu, Zhu, Jiajie, Chen, Xingzhou, Chang, Jingwen, Sun, Zeye, Xia, Han, Yu, Fei, Liang, Jiaqing, Xiao, Yanghua

Abstract

Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self-play training with static-difficulty instructions that cannot evolve as the model's capabilities improve. To address these limitations, we propose SEIF (Self-Evolving Reinforcement Learning for Instruction Following), a self-evolving framework for enhancing the instruction-following ability of LLMs. SEIF forms a closed self-evolution loop that improves the model's instruction-following ability, where instruction difficulty evolution and model capability evolution reinforce each other. SEIF consists of four roles: an Instructor that generates increasingly challenging instructions, a Filter that removes conflicting or invalid instructions to ensure data quality, a Follower that learns to follow evolved instructions, and a Judger that provides reward signals for reinforcement learning. The Instructor and Follower are alternately trained and co-evolve throughout the process. Experiments across multiple model scales and architectures show that SEIF consistently improves instruction-following performance, suggesting strong generality. Further analyses reveal the sources of improvement and identify an effective training strategy for self-evolution on open-ended tasks: sufficient early-stage training to build a solid foundation, followed by moderate late-stage training to mitigate overfitting and achieve better final performance. The code and data are publicly available at https://github.com/Rainier-rq1/SEIF.

Chinese Translation

指令跟随是大型语言模型（LLMs）的基本能力，但持续提升这一能力仍然面临挑战。现有方法通常依赖于来自人类的高成本外部监督或强大的教师模型，或依赖于静态难度指令的自我对弈训练，这些指令无法随着模型能力的提升而进化。为了解决这些局限性，我们提出了SEIF（用于指令跟随的自我进化强化学习），这是一个自我进化框架，用于增强LLMs的指令跟随能力。SEIF形成了一个封闭的自我进化循环，提升模型的指令跟随能力，其中指令难度的进化与模型能力的进化相互促进。SEIF由四个角色组成：生成日益具有挑战性的指令的Instructor，移除冲突或无效指令以确保数据质量的Filter，学习跟随进化指令的Follower，以及为强化学习提供奖励信号的Judger。Instructor和Follower在整个过程中交替训练并共同进化。跨多个模型规模和架构的实验表明，SEIF始终改善指令跟随性能，显示出强大的普适性。进一步的分析揭示了改进的来源，并确定了一种有效的自我进化训练策略，适用于开放式任务：在早期阶段进行充分的训练以建立坚实的基础，随后进行适度的后期训练以减轻过拟合并实现更好的最终性能。代码和数据可在 https://github.com/Rainier-rq1/SEIF 获取。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2605.07507

TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature

TCMIIES：基于浏览器的LLM驱动智能信息提取系统用于学术文献

Zhao, Hanqing

Abstract

The exponential growth of academic publications has created an urgent need for automated tools capable of extracting structured knowledge from unstructured scientific texts. While large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and information extraction, existing solutions often require specialized infrastructure, programming expertise, or fine-tuned domain-specific models that create barriers for researchers in specialized fields. This paper presents TCMIIES, a browser-based, zero-installation platform that leverages commercial LLM APIs to perform structured information extraction from academic literature. The system employs a novel schema-guided prompting framework with automatic system prompt generation, enabling researchers to define custom extraction schemas through an intuitive graphical interface without any programming. TCMIIES features a pure front-end architecture that ensures data privacy by processing all information locally in the browser, supports five major LLM providers, implements concurrent batch processing with automatic retry mechanisms, and provides intelligent field mapping for Chinese academic databases including CNKI and Wanfang. We demonstrate the system's effectiveness through comprehensive evaluation across multiple extraction scenarios in Traditional Chinese Medicine research, achieving structured output compliance rates exceeding 94\% and information extraction accuracy comparable to domain-expert annotation. The system represents a practical, accessible solution that bridges the gap between advanced LLM capabilities and domain-specific academic information extraction needs, particularly for researchers in specialized fields who require flexible, privacy-preserving, and cost-effective extraction tools.

Chinese Translation

学术出版物的指数增长迫切需要能够从非结构化科学文本中提取结构化知识的自动化工具。尽管大型语言模型（LLMs）在自然语言理解和信息提取方面展现了卓越的能力，但现有解决方案往往需要专门的基础设施、编程专业知识或经过微调的领域特定模型，这给专业领域的研究人员带来了障碍。本文提出了TCMIIES，一个基于浏览器的零安装平台，利用商业LLM API从学术文献中执行结构化信息提取。该系统采用了一种新颖的模式引导提示框架，并具有自动系统提示生成能力，使研究人员能够通过直观的图形界面定义自定义提取模式，而无需任何编程。TCMIIES具有纯前端架构，通过在浏览器中本地处理所有信息来确保数据隐私，支持五大主要LLM提供商，实现并发批处理和自动重试机制，并为包括CNKI和万方在内的中国学术数据库提供智能字段映射。我们通过在传统中医研究中的多种提取场景进行全面评估，展示了该系统的有效性，达到了超过94%的结构化输出合规率和与领域专家注释相当的信息提取准确性。该系统代表了一种实用、可访问的解决方案，弥合了先进LLM能力与领域特定学术信息提取需求之间的差距，特别是对于需要灵活、保护隐私和具有成本效益的提取工具的专业领域研究人员。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2605.07522

WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation

WeatherSyn：一种用于天气预报报告生成的指令调优多模态大语言模型

Zheng, Zinan, Liu, Yang, Chen, Nuo, Zheng, Juepeng, Cheng, Hong, Li, Jia

Abstract

Accurate weather forecast reporting enables individuals and communities to better plan daily activities and agricultural operations. However, the current reporting process primarily relies on manual analysis of multi-source data, which leads to information overload and reduced efficiency. With the development of multimodal large language models (MLLMs), leveraging data-driven models to analyze and generate reports in the weather forecasting domain remains largely underexplored. In this work, we propose the Weather Forecasting Report (WFR) task and construct the first instruction-tuning dataset for this task, named~\DatasetNameL, which covers 31 cities in America and 8 weather aspects. Based on this corpus, we develop the first model, \ModelNameL, specialized in generating weather forecast reports. Evaluation across multiple metrics on our dataset shows that \ModelNameL~ consistently outperforms leading closed-source MLLMs, particularly on structurally complex weather aspects. We further analyze its performance across diverse geographic regions and weather aspects. \ModelNameL~ demonstrates strong transferability across different regions, highlighting its zero-shot generalization capability. \ModelNameL~offers valuable insight for developing MLLMs specialized in weather report generation. .

Chinese Translation

准确的天气预报报告使个人和社区能够更好地规划日常活动和农业操作。然而，目前的报告过程主要依赖于对多源数据的人工分析，这导致信息过载和效率降低。随着多模态大语言模型（MLLM）的发展，利用数据驱动模型分析和生成天气预报领域的报告仍然在很大程度上未被探索。在本研究中，我们提出了天气预报报告（WFR）任务，并构建了该任务的第一个指令调优数据集，命名为~ extit{DatasetNameL}，该数据集涵盖了美国31个城市和8个天气方面。基于该语料库，我们开发了第一个专门用于生成天气预报报告的模型， extit{ModelNameL}。在我们的数据集上进行的多项指标评估显示， extit{ModelNameL}~在结构复杂的天气方面上始终优于领先的闭源MLLM，特别是在这些方面。我们进一步分析了其在不同地理区域和天气方面的表现。 extit{ModelNameL}~在不同区域之间表现出强大的迁移能力，突显了其零样本泛化能力。 extit{ModelNameL}~为开发专门用于天气报告生成的MLLM提供了宝贵的见解。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2605.07533

Why do Large Language Models Fail in Low-resource Translation? Unraveling the Token Dynamics of Large Language Models for Machine Translation

大型语言模型在低资源翻译中为何失败？揭示大型语言模型在机器翻译中的词元动态

Qian, Shenbin, Scherrer, Yves

Abstract

Large Language Models (LLMs) have recently demonstrated strong performance in machine translation (MT). However, most prior work focuses on improving or benchmarking translation quality, offering limited insight into when and why LLM-based translation fails. In this work, we systematically analyze failure modes of LLMs in MT by evaluating 15 models, including four reasoning LLMs, across 22 language pairs (LPs) with varying resource levels. We find that non-English-centric LPs consistently yield lower COMET scores than English-centric pairs. To investigate the underlying causes, we introduce Token Activation Rate (TAR), a metric that captures how effectively a model utilizes language-specific tokens in its vocabulary during generation. We validate TAR as a proxy for language representation using models with known language distributions in the training data, and show that lower TAR is strongly associated with poorer translation performance. Furthermore, reasoning LLMs tend to generate more tokens when translating into low-TAR languages, suggesting a compensatory mechanism, although its impact on translation quality varies across models. Overall, our findings emphasize the importance of token-level dynamics in understanding MT performance of LLMs.

Chinese Translation

大型语言模型（LLMs）最近在机器翻译（MT）中表现出强劲的性能。然而，大多数先前的研究集中在提高或基准翻译质量上，提供的见解有限，无法解释LLM基础的翻译何时以及为何失败。在本研究中，我们通过评估15个模型（包括四个推理LLM）在22个具有不同资源水平的语言对（LPs）中的表现，系统分析了LLM在MT中的失败模式。我们发现，非英语中心的语言对的COMET得分始终低于英语中心的语言对。为了探讨潜在原因，我们引入了词元激活率（Token Activation Rate, TAR），这一指标捕捉模型在生成过程中如何有效利用其词汇表中的语言特定词元。我们通过使用在训练数据中具有已知语言分布的模型验证了TAR作为语言表示的代理，并显示较低的TAR与较差的翻译性能显著相关。此外，推理LLM在翻译低TAR语言时倾向于生成更多的词元，这表明了一种补偿机制，尽管其对翻译质量的影响因模型而异。总体而言，我们的研究结果强调了理解LLM在机器翻译中表现的重要性，特别是词元层面的动态变化。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2605.07606

N\"urnberg NLP at PsyDefDetect: Multi-Axis Voter Ensembles for Psychological Defence Mechanism Classification

纽伦堡自然语言处理在心理防御机制分类中的应用：多轴投票集成方法

Steigerwald, Philipp, Rudolph, Eric, Albrecht, Jens

Abstract

Detecting levels of psychological defence mechanisms in supportive conversations is inherently ambiguous. In the PsyDefDetect shared task at BioNLP 2026 the eight positive defence categories share surface language and differ only in pragmatic function and trained raters reach only moderate inter-annotator agreement. On such a task the decisive lever is not a stronger single model but error independence, since any single representation will waver on the overlapping defence boundaries. We translate this insight into a 9-voter ensemble spanning three orthogonal axes: class granularity (all nine classes for the gatekeeper, only the eight defence classes for the specialists), training method (generative and discriminative) and base model. The system reaches $F1_{test}{=}.420$ on the hidden test set, placing first among 21 registered teams.

Chinese Translation

在支持性对话中检测心理防御机制的水平本质上是模糊的。在2026年BioNLP的PsyDefDetect共享任务中，八个积极的防御类别共享表面语言，仅在语用功能上有所不同，而训练评估者之间的注释一致性仅达到中等水平。在这样的任务中，决定性的杠杆不是更强的单一模型，而是错误的独立性，因为任何单一的表示都会在重叠的防御边界上动摇。我们将这一见解转化为一个跨越三个正交轴的9个投票者集成：类别粒度（所有九个类别用于门卫，仅八个防御类别用于专家）、训练方法（生成和判别）以及基础模型。该系统在隐藏测试集上的$F1_{test}{=}.420$，在21个注册团队中名列第一。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2605.07613

Intent-Driven Semantic ID Generation for Grounded Conversational News Recommendation

基于意图驱动的语义ID生成用于扎根的对话新闻推荐

Su, Hongyang, Kong, Beibei, Cheng, Lei, Zhuo, Chengxiang, Li, Zang, Yu, Chenyun

Abstract

Conversational news recommendation requires grounding each suggestion in a rapidly evolving article corpus while addressing implicit user intents that lack explicit retrievable keywords. To characterize this scenario, we identify 6 intent types from production dialogues: five are implicit and pose fundamental challenges to standard RAG pipelines, forming a critical retrieve-first bottleneck. To address these issues, we introduce intent-driven Semantic ID (SID) generation under a Generate-then-Match paradigm. With two-stage training that consists of multi-task SID alignment and GPT-4 Chain-of-Thought distillation, an LLM maps diverse intents to hierarchical SID prefixes, which are then fuzzy-matched to the current news pool to guarantee fully grounded recommendations. Profile-Aware Dual-Signal Reasoning (PADR) further enables cold-start users to obtain valid recommendations using only profiles. On a mainstream Chinese news platform, our 7B model achieves 0% hallucination and 12.4% L1 match in the 152K open-generation SID space (4x random baseline). It matches GPT-4+Hybrid RAG on L1 while surpassing it on finer-grained metrics (L2 2x, Category +1.2pp) at ~100x lower cost. Cold-start users, where existing baselines score 0%, achieve 18.0% L1 (6x random), the highest among all user groups.

Chinese Translation

对话新闻推荐需要在快速发展的文章语料库中为每个建议提供扎根，同时处理缺乏明确可检索关键词的隐含用户意图。为了表征这一场景，我们从生产对话中识别出6种意图类型：五种是隐含的，对标准的检索增强生成（RAG）管道构成了基本挑战，形成了一个关键的优先检索瓶颈。为了解决这些问题，我们在生成-再匹配（Generate-then-Match）范式下引入了基于意图驱动的语义ID（SID）生成。通过包括多任务SID对齐和GPT-4思维链蒸馏的两阶段训练，一个大型语言模型（LLM）将多样的意图映射到层次化的SID前缀，然后将其模糊匹配到当前新闻池，以确保完全扎根的推荐。基于用户画像的双信号推理（PADR）进一步使冷启动用户仅凭个人资料即可获得有效推荐。在一个主流的中文新闻平台上，我们的7B模型在152K开放生成的SID空间中实现了0%的幻觉率和12.4%的L1匹配（相较于4倍随机基线）。在L1上与GPT-4+混合RAG匹配，同时在更细粒度的指标上超越（L2 2倍，类别+1.2个百分点），成本约为其的100倍更低。对于冷启动用户，现有基线得分为0%，而我们的模型达到了18.0%的L1（6倍随机），在所有用户组中最高。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2605.07622

Is She Even Relevant? When BERT Ignores Explicit Gender Cues

她真的相关吗？当 BERT 忽视显性性别线索时

Klein, Jonas, Manna, Chiara, Vanmassenhove, Eva

Abstract

Gender bias in large language models has primarily been investigated for English, while languages with grammatical or morphological gender remain comparatively understudied. This paper investigates how and when gender information emerges in a Dutch BERT model trained from scratch, offering one of the first checkpoint-level analyses of bias formation in a Transformer architecture for a language combining overt morphological gender marking and generic forms. By extracting contextual embeddings throughout training, we construct dynamic gender subspaces using linear SVMs to trace when gender becomes linearly encoded and how this encoding evolves over time. Contextual embeddings are often assumed to integrate contextual cues robustly, allowing models to adjust the representation of a word depending on its more local usage. We therefore test whether explicit gender cues in controlled sentence templates (e.g., Zij is een loodgieter ('She is a plumber')) can override learned statistical associations (plumber -> male). Our findings challenge this assumption: although gender becomes clearly linearly separable around epoch 20 and is distributed across multiple embedding dimensions, the model struggles to update its internal gender representation in light of explicit contextual cues in short sentence templates. Stereotypical gender-profession pairings are predicted far more accurately than anti-stereotypical ones, and generic forms in Dutch systematically default to a male interpretation, even when the context explicitly denotes a female referent. Together, our results seem to indicate that contextualization in the representations learned by our Dutch BERT model is not sufficiently dynamic along the probed gender direction: explicit gender cues in anti-stereotypical contexts are not reliably reflected in the resulting representations, resulting in persistent male-default behaviour.

Chinese Translation

大型语言模型中的性别偏见主要针对英语进行了研究，而具有语法或形态性别的语言则相对较少被研究。本文探讨了在从零开始训练的荷兰 BERT 模型中，性别信息如何以及何时出现，提供了对结合显性形态性别标记和通用形式的语言中偏见形成的 Transformer 架构的首次检查点级分析。通过提取训练过程中的上下文嵌入，我们使用线性支持向量机（SVM）构建动态性别子空间，以追踪性别何时变得线性编码以及这种编码如何随着时间演变。上下文嵌入通常被认为能够稳健地整合上下文线索，使模型能够根据词语的局部用法调整其表示。因此，我们测试了在受控句子模板中显性性别线索（例如，Zij is een loodgieter（‘她是一名水管工’））是否能够覆盖学习到的统计关联（plumber -> male）。我们的研究结果挑战了这一假设：尽管性别在第20个训练周期左右变得明显线性可分，并且分布在多个嵌入维度上，但模型在短句模板中面对显性上下文线索时，仍然难以更新其内部性别表示。刻板印象中的性别-职业配对的预测准确性远高于反刻板印象的配对，而荷兰语中的通用形式系统性地默认男性解释，即使上下文明确指代女性。综合来看，我们的结果似乎表明，荷兰 BERT 模型所学习的表示中的上下文化在探测的性别方向上并不够动态：在反刻板印象的上下文中，显性性别线索未能可靠地反映在结果表示中，导致持续的男性默认行为。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2605.07630

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

安全，还是仅仅无能？重新思考手机使用代理的安全评估

Tang, Zhengyang, Zhang, Yi, Li, Chenxin, Lai, Xin, Lyu, Pengyuan, Guo, Yiduo, Wang, Weinong, Li, Junyi, Ding, Yang, Shen, Huawen, Fang, Zhengyao, Zhou, Xingran, Wu, Liang, Tang, Fei, Fan, Sunqi, Peng, Shangpin, Ruan, Zheng, Zhang, Anran, Wang, Benyou, Zhang, Chengquan, Hu, Han

Abstract

When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell. A harmful outcome may be avoided because the agent recognized the risk and chose the safe action, or because it failed to understand the screen or execute any relevant action at all. These cases have different causes and call for different fixes, yet current benchmarks often merge them under task success, refusal, or final harmful outcome. We address this problem with PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and asks a simple question: does the model take the safe action, take the unsafe action, or fail to do anything useful? We evaluate eight representative phone-use agents under this framework. Our results reveal two main patterns. First, stronger general phone-use ability does not reliably imply safer choices at risky moments. Models that perform better on ordinary app tasks are not always the ones that behave more safely when the next action matters. Second, failures to do anything useful behave like a capability signal rather than a safety signal: they are concentrated in more visually and operationally demanding settings and remain stable when the evaluation protocol changes. Across models, failures split into two recurring patterns: unsafe choices in settings where the model can act but chooses wrongly, and inability to act in more visually and operationally demanding screens. Overall, a harmless outcome is not enough to count as evidence of safety. Evaluating phone-use agents requires separating unsafe judgment from inability to act.

Chinese Translation

当一个手机使用代理避免了伤害，这是否表明其安全，还是仅仅表明其无能为力？现有的评估往往无法区分这一点。一个有害结果的避免可能是因为代理识别了风险并选择了安全的行动，或者因为它未能理解屏幕或根本无法执行任何相关的行动。这些情况有不同的原因，并需要不同的解决方案，然而当前的基准往往将它们合并在任务成功、拒绝或最终有害结果之下。我们通过 PhoneSafety 解决了这个问题，这是一个由 700 个安全关键时刻组成的基准，来源于超过 130 个应用的真实手机交互。每个实例在风险时刻隔离出下一个决策，并提出一个简单的问题：模型是采取安全行动、采取不安全行动，还是未能做出任何有用的事情？我们在这一框架下评估了八个具有代表性的手机使用代理。我们的结果揭示了两个主要模式。首先，较强的手机使用能力并不可靠地意味着在风险时刻做出更安全的选择。在普通应用任务中表现更好的模型并不总是那些在下一个行动重要时表现得更安全的模型。其次，未能做出任何有用的事情更像是一种能力信号，而不是安全信号：它们集中在视觉和操作要求更高的环境中，并在评估协议变化时保持稳定。在各个模型中，失败分为两种反复出现的模式：在模型可以行动但选择错误的情况下做出不安全的选择，以及在视觉和操作要求更高的屏幕中无法行动。总体而言，无害的结果不足以作为安全的证据。评估手机使用代理需要将不安全的判断与无能为力区分开来。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2605.07632

Post-training makes large language models less human-like

后训练使大型语言模型变得不那么像人类

Binz, Marcel, Akata, Elif, Almaatouq, Abdullah, Alsobay, Mohammed, Ariasov, Oleksii, Brändle, Franziska, Broska, David, Burton, Jason W., Busch, Nuno, Callaway, Frederick, Cheung, Vanessa, Christian, Brian, Coda-Forno, Julian, Demircan, Can, Dentella, Vittoria, Eckstein, Maria K., Éltető, Noémi, Franke, Michael, Griffiths, Thomas L., Günther, Fritz, Haridi, Susanne, Hellmann, Sebastian, Herytash, Stefan, Hof, Linus, Holton, Eleanor, Hoxha, Isabelle, Hussain, Zak, Jagadish, Akshay, Kara, Elif, Kriegmair, Valentin, Leivada, Evelina, Ji-An, Li, Ludwig, Tobias, Maier, Maximilian, Mattar, Marcelo G., Mathony, Marvin, Modirshanechi, Alireza, Na, Robin, Nadverniuk, Mariia, Nasioulas, Antonios, Nath, Surabhi S., Niemeyer, Helen, Nussenbaum, Kate, Olschewski, Sebastian, Pachur, Thorsten, Palminteri, Stefano, Petrenco, Aliona, Phaneuf-Hadd, Camille V., Pirrone, Angelo, Rausch, Manuel, Raveling, Laura, Reddy, Shashank, Rmus, Milena, Russek, Evan M., Saanum, Tankred, Sandbrink, Kai, Schiekiera, Louis, Schubert, Johannes A., Buschoff, Luca M. Schulze, Singhi, Nishad, Somerville, Leah H., Spektor, Mikhail S., Sui, Xin, Summerfield, Christopher, Thalmann, Mirko, Thoma, Anna I., Tikhomirova, Taisiia, Truong, Vuong, Tsvilodub, Polina, Voudouris, Konstantinos, Wilson, Robert C., Witte, Kristin, Wu, Shuchen, Wulff, Dirk U., Xiong, Hua-Dong, Xu, Songlin, Ying, Lance, Zhang, Xinyu, Zhu, Jian-Qiao, Schulz, Eric

Abstract

Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.

Chinese Translation

大型语言模型（LLMs）越来越多地被用作人类参与者的替代品，但尚不清楚哪些模型最能捕捉人类行为及其原因。为了解决这个问题，我们引入了Psych-201，这是一个新颖的数据集，使我们能够大规模测量行为一致性。我们发现，后训练——将基础模型转变为有用助手的阶段——在不同模型家族、规模和目标中，始终减少与人类行为的一致性。此外，这种不一致在更新的模型代际中加剧，即使基础模型仍在持续改进。最后，我们发现，个性诱导（persona-induction）——一种通过对模型进行参与者特定信息的条件化来引发人类行为的流行技术——并未在个体层面上改善预测。综合来看，我们的结果表明，目前用于将LLMs转变为有用助手的过程也使它们在模拟人类行为方面的准确性降低。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2605.07635

Multi-Dimensional Evaluation of LLMs for Grammatical Error Correction

针对语法错误纠正的多维度评估大型语言模型

Labib, Adnan, Wang, Qiao, Huang, Yixuan, Yuan, Zheng

Abstract

Automated assistants for Grammatical Error Correction are now embedded in educational platforms serving millions of learners, yet three critical gaps remain in this domain: (1) latest-generation Large Language Models (LLMs) lack comprehensive evaluation on grammar correction tasks; (2) whether combining these LLMs improves correction quality is unexplored; and (3) the extent to which reference-based metrics underestimate GEC system performance has not been adequately quantified. In this study, first, we evaluate latest-generation LLMs on edit precision, fluency preservation, and meaning retention, showing fine-tuned GPT-4o achieves state-of-the-art performance across all three dimensions. Second, through grammatical error type analysis we demonstrate that individual LLMs exhibit highly similar error correction patterns ($\rho=0.947$). Third, we show that reference-based metrics underestimate GEC performance with 73.76% of GPT-4o corrections different from gold standards being equally valid or even superior. These GEC evaluation findings equip educators with guidance for selecting GEC assistants that enhance rather than constrain student linguistic development. We make our data, code, and models publicly available.

Chinese Translation

自动化语法错误纠正助手现已嵌入服务数百万学习者的教育平台中，但该领域仍存在三个关键问题：（1）最新一代大型语言模型（LLMs）在语法纠正任务上的综合评估不足；（2）结合这些LLMs是否能提高纠正质量尚未被探讨；（3）基于参考的评估指标在多大程度上低估了语法错误纠正（GEC）系统的性能尚未得到充分量化。在本研究中，我们首先评估了最新一代LLMs在编辑精度、流畅性保持和意义保留方面的表现，结果显示微调后的GPT-4o在这三个维度上均达到了最先进的性能。其次，通过对语法错误类型的分析，我们证明了各个LLMs表现出高度相似的错误纠正模式（$ ho=0.947$）。第三，我们展示了基于参考的评估指标低估了GEC性能，73.76%的GPT-4o纠正结果与黄金标准不同，但同样有效甚至更优。这些GEC评估结果为教育工作者提供了选择能够促进而非限制学生语言发展的GEC助手的指导。我们将我们的数据、代码和模型公开发布。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2605.07646

MAVEN: Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing

MAVEN：具有逐步认知审计的多智能体验证-阐述网络

Yao, Yinsheng, Tang, Jiehao, Yang, Zhaozhen, Cheng, Dawei

Abstract

While explicit reasoning trajectories enhance model interpretability, existing paradigms often rely on monolithic chains that lack intermediate verification, allowing early errors to cascade unchecked. This lack of modularity impedes granular auditing and compromises the epistemic trust required for high-stakes applications. We propose MAVEN (Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing), a blackboard-inspired framework designed to transform LLMs into deliberate reasoners through explicit role-decoupling. At its core, MAVEN operationalizes an adversarial Skeptic-Researcher-Judge loop, simulating expert deliberation by functionally separating logical defense from factual grounding. Experiments on OpenBookQA, TruthfulQA, HALUEVAL and StrategyQA benchmarks demonstrate that MAVEN delivers superior reasoning quality across four fine-grained metrics. Notably, MAVEN consistently outperforms latent reasoning models such as GEMINI-3.1-Pro and consensus-based baselines (e.g., ReConcile) by generating explicitly structured, modular, and verifiable deliberation trajectories, rather than relying on implicit internal states or post-hoc consensus. Moreover, comprehensive evaluations confirm that MAVEN is fully model-agnostic, serving as a strong and transferable reasoning booster that yields substantial performance improvements across diverse backbone models.

Chinese Translation

尽管显式推理轨迹增强了模型的可解释性，但现有范式通常依赖于单一链条，缺乏中间验证，导致早期错误未被检查而级联。这种缺乏模块化的特性妨碍了细粒度审计，并损害了高风险应用所需的认知信任。我们提出了MAVEN（具有逐步认知审计的多智能体验证-阐述网络），这是一个受黑板模型启发的框架，旨在通过显式角色解耦将大规模语言模型（LLMs）转变为深思熟虑的推理者。MAVEN的核心是实现一个对抗性的怀疑者-研究者-评审者循环，通过功能上将逻辑辩护与事实基础分离，模拟专家的深思熟虑。在OpenBookQA、TruthfulQA、HALUEVAL和StrategyQA基准上的实验表明，MAVEN在四个细粒度指标上提供了卓越的推理质量。值得注意的是，MAVEN始终优于潜在推理模型，如GEMINI-3.1-Pro和基于共识的基线（例如ReConcile），通过生成显式结构化、模块化和可验证的深思熟虑轨迹，而不是依赖于隐式内部状态或事后共识。此外，全面评估确认MAVEN完全模型无关，作为一个强大且可转移的推理增强工具，在各种基础模型中带来了显著的性能提升。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2605.07647

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

自动化短答案评分中的质量条件一致性：中等范围退化及任务特定适应的影响

Schleifer, Abigail Victoria Gurin, Ariely, Moriah, Klebanov, Beata Beigman, Salman, Asaf, Alexandron, Giora

Abstract

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.

Chinese Translation

自动化短答案评分（ASAS）正从判别性、精细调优的模型转向在少量示例设置中使用的大型语言模型（LLMs）。这一范式利用了LLMs广泛的世界知识和易于部署的特点，但有限的任务特定数据可能会降低在复杂评分任务上的一致性。尤其是，它对需要细致解释的部分正确回答的评分影响仍然未被充分探讨。我们研究了不同模型的任务特定适应程度与质量条件评分一致性之间的关系。我们在两个开放式生物学题目上比较了三种LLMs（GPT-5.2、GPT-4o、Claude Opus 4.5）在少量示例模式下的表现，一个经过精细调优的基于BERT的编码器，以及一位人类专家，使用了数百个学生回答和由生物教育专家提供的真实评分。结果表明，人类之间的一致性在整个质量范围内最高且稳定。所有AI模型在完全正确和完全错误的回答上表现良好，但在中等范围的回答上表现出显著的退化。这种中等范围的退化与任务特定适应相关：在示例较少的少量示例LLMs中最为严重，随着任务特定数据的增加而减轻，精细调优的编码器模型表现最佳。这种中等范围的退化可能导致对理解尚在发展的学生所产生回答的不公平评估。我们的研究结果强调了质量条件公平性的重要性，特别关注中等范围的回答。

View on arXiv Download PDF AI Translation

cs.CL / 67 / 2605.07660

Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

并非所有的标记学习方式相同：注意力熵揭示了强化学习推理中的异质信号

Li, Gengyang, Wu, Zheng-Fan, Bao, Siqi, Wu, Yunfang

Abstract

Reinforcement-learning-based post-training has become a key approach for improving the reasoning ability of large language models, but its token-level learning signals remain poorly understood. This work studies their heterogeneity through attention entropy, which measures how concentrated or diffuse the contextual support is for each response token. We first show that token-level RL objectives are sparsely estimable: uniformly random 20 percent token subsets preserve much of the full-token held-out performance, suggesting substantial redundancy in token-level updates. However, entropy-structured subsets behave very differently. Low-attention-entropy tokens, which we call anchors, rely on concentrated support, produce stable gradients aligned with full-token updates, and provide a reliable optimization backbone, but tend to plateau on harder benchmarks. High-attention-entropy tokens, which we call explorers, aggregate more diffuse context and induce larger but more volatile gradients. Explorer-only training is unstable on average, though rare successful runs suggest that these tokens may contain useful hard-reasoning signals when optimization remains stable. We support this anchor-explorer spectrum with evidence-gathering analyses, entropy dynamics, gradient-geometry diagnostics, and controls showing that position, predictive entropy, and loss normalization do not explain the observed asymmetry. Finally, a dynamic entropy-aware soft-reweighting intervention improves Qwen3-8B-Base from 34.39 to 37.40 held-out average in the strongest setting. These findings suggest that attention entropy reveals optimization-relevant structure in token-level RL signals, and that uniform token averaging can obscure meaningful heterogeneity in reasoning post-training.

Chinese Translation

基于强化学习的后训练已成为提升大型语言模型推理能力的关键方法，但其标记级学习信号仍然不甚明了。本研究通过注意力熵来研究这些信号的异质性，注意力熵衡量每个响应标记的上下文支持是集中还是分散。我们首先展示了标记级强化学习目标的稀疏可估性：均匀随机的20%标记子集保留了大部分完整标记的保留性能，表明标记级更新存在显著冗余。然而，基于熵结构的子集表现出截然不同的行为。我们称之为锚点的低注意力熵标记依赖于集中支持，产生与完整标记更新一致的稳定梯度，并提供可靠的优化基础，但在更困难的基准上往往会停滞。我们称之为探索者的高注意力熵标记聚合了更分散的上下文，并引发更大但更不稳定的梯度。仅进行探索者训练在平均上是不稳定的，尽管少数成功的运行表明这些标记在优化保持稳定时可能包含有用的困难推理信号。我们通过证据收集分析、熵动态、梯度几何诊断和控制实验支持这一锚点-探索者谱系，显示位置、预测熵和损失归一化无法解释观察到的非对称性。最后，一种动态的熵感知软重加权干预将Qwen3-8B-Base的保留平均从34.39提高到37.40，表现出最佳效果。这些发现表明，注意力熵揭示了与优化相关的标记级强化学习信号结构，而均匀标记平均可能掩盖推理后训练中的有意义异质性。

View on arXiv Download PDF AI Translation

cs.CL / 68 / 2605.07699

DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

DRIP-R：零售领域中现实政策模糊下的决策与推理基准

Borkakoty, Hsuvas, Pohl, Sebastian, Wang, Cheng, Chen, Bei, Hou, Yufang

Abstract

LLM-based agents are increasingly deployed for routine but consequential tasks in real-world domains, where their behavior is governed by inherently ambiguous domain policies that admit multiple valid interpretations. Despite the prevalence of such ambiguities in practice, existing agent benchmarks largely assume unambiguous, well-specified policies, leaving a critical evaluation gap. We introduce DRIP-R, a benchmark that systematically exploits real-world retail policy ambiguities to construct scenarios in which no single correct resolution exists. DRIP-R comprises a curated set of policy-ambiguous return scenarios paired with a realistic customer personas, a full-duplex conversational simulation with tool-calling capabilities and a multi-judge evaluation framework covering policy adherence, dialogue quality, behavioral alignment, and resolution quality. Our experiments show that frontier models fundamentally disagree on identical policy-ambiguous scenarios, confirming that ambiguity poses a genuine and systematic challenge to LLM decision-making.

Chinese Translation

基于大型语言模型（LLM）的智能体在现实领域中越来越多地被用于日常但重要的任务，其行为受固有的模糊领域政策的支配，这些政策允许多种有效的解释。尽管这种模糊性在实践中普遍存在，但现有的智能体基准大多假设政策是明确且规定良好的，从而留下了一个重要的评估空白。我们提出了DRIP-R，一个基准系统地利用现实零售政策的模糊性来构建没有单一正确解决方案的场景。DRIP-R包含一组经过精心策划的政策模糊回报场景，配备现实的客户角色、具备工具调用能力的全双工对话模拟以及涵盖政策遵循、对话质量、行为一致性和解决质量的多评审评估框架。我们的实验表明，前沿模型在相同的政策模糊场景中存在根本性分歧，确认了模糊性对LLM决策构成了真正且系统的挑战。

View on arXiv Download PDF AI Translation

cs.CL / 69 / 2605.07701

Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models

引导不是超参数：在扩散语言模型中学习动态控制

Zhou, Fan, Van de Cruys, Tim

Abstract

Classifier-Free Guidance (CFG) is a widely used mechanism for controlling diffusion-based generative models, yet its guidance scale is typically treated as a fixed hyperparameter throughout generation. This static design yields a suboptimal controllability and quality tradeoff, as the optimal degree of guidance varies across tasks and across different stages of the diffusion process, especially in NLP domain. We recast CFG scale selection as a sequential decision-making problem and propose to learn dynamic guidance trajectories via reinforcement learning. Specifically, we model the guidance scale as a discrete control action selected at each generation step based on the evolving diffusion state, and optimize a policy using Proximal Policy Optimization (PPO) under task-level rewards. Experiments on three controlled NLP generation tasks using discrete diffusion language models demonstrate that adaptive guidance consistently achieves a better balance between controllability and generation quality than fixed-scale strategies. Further analysis of the learned policies reveals distinct and interpretable guidance trajectories across tasks, underscoring the importance of treating guidance as a dynamic control process rather than a static design choice.

Chinese Translation

无分类器引导（Classifier-Free Guidance, CFG）是一种广泛使用的机制，用于控制基于扩散的生成模型，但其引导尺度通常在生成过程中被视为固定的超参数。这种静态设计导致可控性和质量之间的权衡次优，因为最佳的引导程度在不同任务和扩散过程的不同阶段之间变化，尤其是在自然语言处理（NLP）领域。我们将CFG尺度选择重新表述为一个序列决策问题，并提出通过强化学习学习动态引导轨迹。具体而言，我们将引导尺度建模为在每个生成步骤中根据不断变化的扩散状态选择的离散控制动作，并在任务级奖励下使用近端策略优化（Proximal Policy Optimization, PPO）优化策略。在使用离散扩散语言模型的三个受控NLP生成任务上的实验表明，适应性引导在可控性和生成质量之间始终实现了比固定尺度策略更好的平衡。对学习到的策略的进一步分析揭示了不同任务之间明显且可解释的引导轨迹，强调了将引导视为动态控制过程而非静态设计选择的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 70 / 2605.07711

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

SimCT：恢复跨标记器在线蒸馏中的丢失监督

Sun, Jie, Zheng, Mao, Song, Mingyang, Zhong, Qiyong, Cheng, Yilin, Feng, Bichuan, Liu, Pengfei, Fang, Junfeng, Wang, Xiang

Abstract

On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whenever the two models tokenize the same text differently. Under heterogeneous tokenizers, exact shared-token matching silently discards a large fraction of the teacher signal at precisely the positions where vocabularies disagree. We propose \textbf{\underline{Sim}ple \underline{C}ross-\underline{T}okenizer OPD (SimCT)}, which restores this signal by enlarging the supervision space: alongside shared tokens, SimCT compares teacher and student over short multi-token continuations that both tokenizers can realize, leaving the OPD loss form itself unchanged. We show that these units are the finest jointly tokenizable supervision interface, and that coarser alternatives remove teacher-student distinctions that are useful for on-policy learning. Across three heterogeneous teacher-student pairs on mathematical reasoning and code-generation benchmarks, SimCT shows consistent gains over shared-vocabulary OPD and representative cross-tokenizer baselines, with ablations confirming that the improvements come from recovering supervision discarded by exact shared-token matching. Code is available at \href{https://github.com/sunjie279/SimCT-}{https://github.com/sunjie279/SimCT-}.

Chinese Translation

在线蒸馏（OPD）是将教师行为转移到更小学生的标准工具，但它隐含地假设教师和学生的预测是逐个标记可比的，这一假设在两个模型以不同方式对同一文本进行标记时失效。在异构标记器下，精确的共享标记匹配在词汇不一致的确切位置悄然丢弃了大量教师信号。我们提出了 extbf{ extit{Sim}ple extit{C}ross- extit{T}okenizer OPD (SimCT)}，通过扩大监督空间来恢复这一信号：除了共享标记外，SimCT还在两个标记器都能实现的短多标记延续上比较教师和学生，从而保持OPD损失形式不变。我们展示了这些单元是最精细的联合可标记监督接口，而粗糙的替代方案则去除了对在线学习有用的教师-学生区分。在数学推理和代码生成基准上的三个异构教师-学生对中，SimCT在共享词汇OPD和代表性跨标记器基线之上显示出一致的提升，消融实验确认这些改进来自恢复被精确共享标记匹配丢弃的监督。代码可在 extit{https://github.com/sunjie279/SimCT-}获取。

View on arXiv Download PDF AI Translation

cs.CL / 71 / 2605.07721

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

内存高效的循环变换器：在循环语言模型中解耦计算与内存

Vendrell, Victor Conchello, Masdemont, Arnau Padres, Grillo, Niccolò, Ros-Giralt, Jordi, Behboodi, Arash, Massoli, Fabio Valerio

Abstract

Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro's. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.

Chinese Translation

递归大语言模型（LLM）架构作为一种有前景的方法，已逐渐显现出提升推理能力的潜力，因为它们能够在嵌入空间中进行多步计算而无需生成中间标记。像 Ouro 这样的模型通过在迭代过程中不断更新内部表示，同时在迭代之间保留标准的键值（Key-Value, KV）缓存，从而进行推理，这导致内存消耗随着推理深度线性增长。因此，增加推理迭代次数可能会导致过高的内存使用，限制了此类架构的实际可扩展性。在本研究中，我们提出了内存高效的循环变换器（Memory-Efficient Looped Transformer, MELT），这是一种新颖的架构，能够将推理深度与内存消耗解耦。MELT 不再为每一层和每个循环使用标准的 KV 缓存，而是为每一层维护一个共享的 KV 缓存，跨越推理循环。该缓存通过可学习的门控机制随着时间的推移进行更新。为了在该架构下实现稳定和高效的训练，我们建议采用分块训练的方式，通过两个阶段的程序来训练 MELT：插值过渡，随后是与注意力对齐的蒸馏，从 LoopLM 起始模型到 MELT。实证结果表明，从预训练的 Ouro 参数微调的 MELT 模型在性能上优于同等规模的标准 LLM，同时保持与这些模型相当的内存占用，并显著小于 Ouro 的内存占用。总体而言，MELT 实现了恒定内存的迭代推理，而不牺牲 LoopLM 的性能，仅使用轻量级的后训练程序。

View on arXiv Download PDF AI Translation

cs.CL / 72 / 2605.07725

SOD: Step-wise On-policy Distillation for Small Language Model Agents

SOD：小型语言模型代理的逐步在线蒸馏

Zhong, Qiyong, Zheng, Mao, Song, Mingyang, Lin, Xin, Sun, Jie, Jiang, Houcheng, Wang, Xiang, Fang, Junfeng

Abstract

Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher's token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high-divergence regions while preserving dense guidance in well-aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at https://github.com/YoungZ365/SOD.

Chinese Translation

工具集成推理（TIR）由于长时间工具交互的不稳定性和模型能力的限制，难以扩展到小型语言模型。尽管强化学习方法如群体相对策略优化仅提供稀疏的结果级奖励，但最近在线策略蒸馏（OPD）因其从教师对学生生成的轨迹提供密集的标记级监督而受到关注。然而，我们的实验表明，将OPD应用于TIR会导致一种关键的失败模式：错误的工具调用往往在后续推理步骤中级联，逐渐放大学生与教师之间的差异，使得教师的标记级监督变得越来越不可靠。为了解决这个问题，我们提出了SOD，一个针对小型语言模型代理的逐步在线蒸馏框架，该框架根据每一步的差异自适应地重新加权蒸馏强度。因此，SOD能够在高差异区域减弱潜在误导的教师信号，同时在对齐良好的状态下保持密集的指导。在具有挑战性的数学、科学和代码基准测试中的实验表明，SOD相较于第二最佳基线提高了多达20.86%。值得注意的是，我们的0.6B学生模型在AIME 2025上达到了26.13%的成绩，展示了代理推理向轻量级模型的有效转移。我们的代码可在https://github.com/YoungZ365/SOD获取。

View on arXiv Download PDF AI Translation

cs.CL / 73 / 2605.07731

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

EngGPT2-16B-A3B与可比意大利及国际开源大型语言模型的基准测试

Sassella, Andrea, Chizzola, Andrea, Bianchi, Tommaso, Alessandrelli, Luca, Carman, Mark James

Abstract

This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.

Chinese Translation

本报告对ENGINEERING Ingegneria Informatica S.p.A.的EngGPT2MoE-16B-A3B大型语言模型（LLM）进行了性能基准测试，该模型为一个具有160亿参数的专家混合（Mixture of Experts, MoE）模型，具有30亿活跃参数。我们在多种具有代表性的基准测试中调查了其性能，并与同类规模的开源MoE和稠密模型进行了比较。在与流行的意大利模型（如FastwebMIIA-7B、Minerva-7B、Velvet-14B和LLaMAntino-3-ANITA-8B）的比较中，EngGPT2MoE-16B-A3B在国际基准测试（如ARC-Challenge、GSM8K、AIME24、AIME25、MMLU和HumanEval (HE)）中的表现相当或更好。该模型在RULER基准的最长上下文设置（32k）中表现最佳。在意大利基准数据集ITALIC上，该模型的表现与其他模型相当或更好，除了Velvet-14B，其表现优于本模型。与同类规模的流行MoE模型相比，新模型在所有考虑的基准测试中均报告了比DeepSeek-MoE-16B-Chat更高的值。在HE、MMLU、AIME24、AIME25、GSM8K和32k RULER设置中，其值高于Moonlight-16B-A3B，但在BFCL以及某些ARC和ITALIC设置中较低。最后，在大多数基准测试中，其值低于GPT-OSS-20B，包括HE、MMLU、AIME24、AIME25、GSM8K、ARC、BFCL和RULER 32k。与流行的稠密模型相比，EngGPT2MoE-16B-A3B在AIME24和AIME25上的表现优于Llama-3.1-8B-Instruct、Gemma-3-12b-it和Ministral-3-8BInstruct-2512-BF16，但在ITALIC、BFCL和32k上下文的RULER中表现较低。当将所有基准指标的性能进行汇总时，EngGPT2MoE-16B-A3B的表现高于评估中的意大利模型，同时低于一些表现最好的国际模型，特别是GPT-5 nano和Qwen3-8B。综合来看，我们的研究结果表明，该新模型是意大利本土大型语言模型的一次进步。

View on arXiv Download PDF AI Translation

cs.CL / 74 / 2605.07748

TextLDM: Language Modeling with Continuous Latent Diffusion

TextLDM：基于连续潜在扩散的语言建模

Jiang, Jiaxiu, Ren, Jingjing, Li, Wenbo, Wang, Bo, Sun, Haoze, Yang, Yijun, Liu, Jianhui, Zhang, Yanbing, Zheng, Shenghe, Zhang, Yuan, Huang, Haoyang, Duan, Nan, Zuo, Wangmeng

Abstract

Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework to language modeling. We propose TextLDM, which transfers the visual latent diffusion recipe to text generation with minimal architectural modification. A Transformer-based VAE maps discrete tokens to continuous latents, enhanced by Representation Alignment (REPA) with a frozen pretrained language model to produce representations effective for conditional denoising. A standard DiT then performs flow matching in this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model via REPA is critical for downstream generation quality. Trained from scratch on OpenWebText2, TextLDM substantially outperforms prior diffusion language models and matches GPT-2 under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.

Chinese Translation

通过在变分自编码器（VAE）潜在空间中使用流匹配训练的扩散变换器（DiT）实现了图像和视频的统一视觉生成。朝着为生成（视觉合成）和理解（文本生成）提供单一架构的自然下一步是将这一框架应用于语言建模。我们提出了TextLDM，它在最小架构修改的情况下将视觉潜在扩散方法转移到文本生成中。基于变换器的VAE将离散标记映射到连续潜在空间，并通过与冻结的预训练语言模型进行表征对齐（REPA）来增强，以生成适用于条件去噪的有效表征。标准的DiT随后在该潜在空间中执行流匹配，其架构与视觉对应物相同。我们所面临的核心挑战是获得高质量的连续文本表征：我们发现，仅靠重构保真度是不够的，通过REPA将潜在特征与预训练语言模型对齐对于下游生成质量至关重要。在OpenWebText2上从零开始训练，TextLDM显著优于先前的扩散语言模型，并在相同设置下与GPT-2相匹配。我们的结果表明，视觉DiT方法有效地转移到语言领域，为多模态生成和理解的统一扩散架构迈出了具体的一步。

View on arXiv Download PDF AI Translation

cs.CL / 75 / 2605.07782

CktFormalizer: Autoformalization of Natural Language into Circuit Representations

CktFormalizer：自然语言自动形式化为电路表示

Xiong, Jing, Han, Qi, Ding, Chenchen, Xiao, He, Su, Zunhai, Tao, Chaofan, Wong, Ngai

Abstract

LLMs can generate hardware descriptions from natural language specifications, but the resulting Verilog often contains width mismatches, combinational loops, and incomplete case logic that pass syntax checks yet fail in synthesis or silicon. We present CktFormalizer, a framework that redirects LLM-driven hardware generation through a dependently-typed HDL embedded in Lean 4. Lean serves three roles: (i) type checker:dependent types encode bit-width constraints, case coverage, and acyclicity, turning hardware defects into compile-time errors that guide iterative repair; (ii) correctness firewall:compiled designs are structurally free of defects that cause silent backend failures (the baseline loses 20% of correct designs during synthesis and routing; CktFormalizer preserves all of them); (iii) proof assistant:the agent constructs machine-checked equivalence proofs over arbitrary input sequences and parameterized widths, beyond the reach of bounded SMT-based checking. On VerilogEval (156 problems), RTLLM (50 problems), and ResBench (56 problems), CktFormalizer achieves simulation pass rates competitive with direct Verilog generation while delivering substantially higher backend realizability: 95--100% of compiled designs complete the full synthesis, place-and-route, DRC, and LVS flow. A closed-loop PPA optimization stage yields up to 35% area reduction and 30% power reduction through validated architecture exploration, with automated theorem proof ensuring that each optimized variant remains functionally equivalent to its formal specification.

Chinese Translation

大型语言模型（LLMs）可以根据自然语言规范生成硬件描述，但生成的Verilog代码往往存在宽度不匹配、组合环路和不完整的案例逻辑，这些问题在语法检查中通过，但在综合或硅片实现中失败。我们提出了CktFormalizer，一个通过嵌入在Lean 4中的依赖类型硬件描述语言（HDL）来重定向LLM驱动的硬件生成的框架。Lean在这里扮演了三个角色：（i）类型检查器：依赖类型编码了位宽约束、案例覆盖和无环性，将硬件缺陷转化为编译时错误，从而指导迭代修复；（ii）正确性防火墙：编译后的设计在结构上没有导致无声后端失败的缺陷（基线在综合和布线过程中损失了20%的正确设计；而CktFormalizer保留了所有这些设计）；（iii）证明助手：该代理构建了针对任意输入序列和参数化宽度的机器检查等价性证明，超出了有界SMT检查的能力。在VerilogEval（156个问题）、RTLLM（50个问题）和ResBench（56个问题）上，CktFormalizer实现的仿真通过率与直接生成Verilog代码相当，同时提供了显著更高的后端可实现性：95%至100%的编译设计完成了完整的综合、放置与布线、设计规则检查（DRC）和版图验证（LVS）流程。一个闭环的PPA优化阶段通过经过验证的架构探索实现了高达35%的面积减少和30%的功耗减少，同时自动化的定理证明确保每个优化变体在功能上与其形式规范保持等效。

View on arXiv Download PDF AI Translation

cs.CL / 76 / 2605.07783

Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models

基于链的蒸馏用于可变大小小型语言模型的有效初始化

Shi, Boyu, Jiang, YiCheng, Liu, Chang, Wang, Qiufeng, Yang, Xu, Geng, Xin

Abstract

Large language models (LLMs) achieve strong performance but remain costly to deploy in resource-constrained settings. Training small language models (SLMs) from scratch is computationally expensive, while conventional knowledge distillation requires repeated access to large teachers for different target sizes, leading to poor scalability. To solve these problems, we propose \textbf{Chain-based Distillation (CBD)}, a scalable paradigm for efficiently initializing variable-sized language models. A sparse and limited sequence of intermediate models (called anchors) is constructed via stepwise distillation, forming a distillation chain that progressively transfers knowledge from the source LLMs. To support heterogeneous settings, we introduce \emph{bridge distillation} for cross-architecture and cross-vocabulary transfer. Models of variable sizes are initialized via parameter interpolation between adjacent anchors, eliminating repeated large teacher inference. Experiments show that the proposed method substantially improves efficiency and downstream performance. A 138M-parameter SLM without recovery pre-training, outperforms scratch-trained models on a 10B-token corpus on the specific task. CBD also demonstrates versatility in heterogeneous settings for initialize models with different architectures and vocabularies.

Chinese Translation

大型语言模型（LLMs）在性能上表现出色，但在资源受限的环境中部署成本依然高昂。从头训练小型语言模型（SLMs）计算开销巨大，而传统的知识蒸馏需要对不同目标大小的多个大型教师模型进行重复访问，导致可扩展性差。为了解决这些问题，我们提出了 extbf{基于链的蒸馏（CBD）}，这是一种可扩展的范式，用于高效初始化可变大小的语言模型。通过逐步蒸馏构建稀疏且有限的中间模型序列（称为锚点），形成一个蒸馏链，逐步从源LLMs中转移知识。为了支持异构设置，我们引入了 extit{桥接蒸馏}，用于跨架构和跨词汇的转移。通过相邻锚点之间的参数插值初始化可变大小的模型，消除了对大型教师模型的重复推理。实验表明，所提出的方法显著提高了效率和下游性能。在特定任务上，138M参数的SLM在没有恢复预训练的情况下，优于在10B标记语料库上从头训练的模型。CBD还展示了在异构设置中初始化具有不同架构和词汇的模型的多样性。

View on arXiv Download PDF AI Translation

cs.CL / 77 / 2605.07793

Hybrid TF--IDF Logistic Regression and MLP Neural Baseline for Indonesian Three-Class Sentiment Analysis on Social Media Text

印尼社交媒体文本的三类情感分析的混合 TF--IDF 逻辑回归与 MLP 神经基线

Pasha, Allya Nurul Islami, Putri, Eka Fidiya, Muthoharoh, Luluk, Satria, Ardika, Manullang, Martin C. T.

Abstract

This paper presents a compact three-class sentiment analysis study for Indonesian social media text. The task is formulated with positive, negative, and neutral outputs derived from a fine-grained emotion dataset. The proposed practical baseline combines TF--IDF text features, three lightweight numeric metadata features, and a balanced multinomial Logistic Regression classifier. For comparison, the study also includes a neural baseline using a two-layer multilayer perceptron (MLP) over the same hybrid feature representation. The dataset originally contains 732 rows and 191 fine-grained emotion labels; after cleaning, deduplication, and label remapping, 707 samples remain with an imbalanced distribution of 459 positive, 188 negative, and 60 neutral instances. Experimental results show that the Logistic Regression deployment model reaches 0.8028 accuracy, 0.8003 weighted F1, and 0.7276 macro F1, while project documentation reports a higher-accuracy but non-production MLP baseline. These findings indicate that careful preprocessing, interpretable feature engineering, and class balancing remain competitive for small Indonesian sentiment datasets, whereas the neural baseline is better treated as a comparative experiment than as the default deployment model.

Chinese Translation

本文提出了一项针对印尼社交媒体文本的紧凑型三类情感分析研究。该任务通过从细粒度情感数据集中派生的正面、负面和中性输出进行表述。所提出的实用基线结合了 TF--IDF 文本特征、三个轻量级数值元数据特征，以及一个平衡的多项式逻辑回归分类器。为了进行比较，研究还包括了使用两层多层感知器（MLP）的神经基线，该基线在相同的混合特征表示上进行训练。数据集最初包含 732 行和 191 个细粒度情感标签；经过清理、去重和标签重映射后，剩余 707 个样本，分布不平衡，分别为 459 个正面、188 个负面和 60 个中性实例。实验结果表明，逻辑回归部署模型的准确率达到 0.8028，权重 F1 值为 0.8003，宏观 F1 值为 0.7276，而项目文档报告的 MLP 基线具有更高的准确率，但并未投入生产。这些发现表明，仔细的预处理、可解释的特征工程和类别平衡在小型印尼情感数据集中仍然具有竞争力，而神经基线更适合作为比较实验，而非默认的部署模型。

View on arXiv Download PDF AI Translation

cs.CL / 78 / 2605.07796

PolySQL: Scaling Text-to-SQL Evaluation Across SQL Dialects via Automated Backend Isomorphism

PolySQL：通过自动化后端同构实现跨SQL方言的文本到SQL评估扩展

Perlitz, Yotam, Venezian, Elad, Royer, Corentin, Fusco, Francesco, Giovannini, Andrea

Abstract

SQL dialects vary in syntax, types, and functions across database engines. Text-to-SQL benchmarks, however, predominantly support only SQLite. This creates a critical evaluation gap: cross-dialect evaluation reveals weak per-query agreement (Cohen's ), showing that SQLite performance is an unreliable proxy for other dialects. Yet such evaluation remains prohibitively difficult: existing approaches either require expensive manual query transpilation or rely on tools that often fail on complex SQL. To close this gap, we introduce PolySQL, a novel dual-execution method that eliminates the need for query transpilation by comparing normalized execution results. Notably, our approach achieves higher evaluation fidelity than query transpilation with 100% query coverage. PolySQL comprises three datasets, enabling the first large-scale cross-dialect study. Our study reveals a 10.1% average accuracy drop from SQLite to other dialects and identifies a significant dialect difficulty hierarchy. We find this degradation stems from logical rather than syntactic errors (61% vs. 8%). We release our framework code and leaderboard to enable rigorous dialect-robust evaluation.

Chinese Translation

SQL方言在数据库引擎之间的语法、类型和函数上存在差异。然而，文本到SQL基准测试主要只支持SQLite。这造成了一个关键的评估缺口：跨方言评估显示每个查询的一致性较弱（Cohen's），表明SQLite的性能并不能可靠地代表其他方言。然而，这种评估仍然极其困难：现有的方法要么需要昂贵的手动查询转换，要么依赖于在复杂SQL上常常失败的工具。为了填补这一空白，我们提出了PolySQL，一种新颖的双执行方法，通过比较规范化的执行结果消除了查询转换的需求。值得注意的是，我们的方法在查询覆盖率达到100%的情况下，评估的保真度高于查询转换。PolySQL包含三个数据集，使得首次大规模跨方言研究成为可能。我们的研究揭示了从SQLite到其他方言的平均准确率下降了10.1%，并确定了显著的方言难度层级。我们发现这种下降源于逻辑错误而非语法错误（61%对8%）。我们发布了我们的框架代码和排行榜，以便进行严格的方言稳健性评估。

View on arXiv Download PDF AI Translation

cs.CL / 79 / 2605.07806

Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs

超越信心：重新思考大型语言模型的自我评估以预测性能

Bhattacharyya, Sree, Khanna, Samarth, Chen, Leona, Craig, Lucas, Dilliraj, Tharun, Wang, James Z.

Abstract

Large Language Models (LLMs) are increasingly used in settings where reliable self-assessment is critical. Assessing model reliability has evolved from using probabilistic correctness estimates to, more recently, eliciting verbalized confidence. Confidence, however, has been shown to be an inconsistent and overoptimistic predictor of model correctness. Drawing on cognitive appraisal theory, a framework from human psychology that decomposes self-evaluation into multiple components, we propose a multidimensional perspective on model self-assessment. We elicit six appraisal-based dimensions of self-assessment, alongside confidence, and evaluate their utility for predicting model failure across 12 LLMs and 38 tasks spanning eight domains. We find that competence-related appraisal dimensions, particularly effort and ability, consistently match or outperform confidence across most settings. Effort additionally yields less overoptimistic estimates that remain stable across model sizes. In contrast, affective dimensions provide marginally predictive signals. Furthermore, the most informative dimension varies systematically with task characteristics: effort is most predictive for reasoning-intensive tasks, while ability and confidence dominate on retrieval-oriented tasks. Broadly, our findings indicate that structured multidimensional self-assessment is a promising approach to improving the reliability and safety of language model deployment across diverse real-world settings.

Chinese Translation

大型语言模型（LLMs）在可靠的自我评估至关重要的环境中被越来越广泛地使用。评估模型可靠性的方式已经从使用概率正确性估计演变为最近的口头信心引导。然而，研究表明，信心在预测模型正确性方面是不一致且过于乐观的。基于认知评估理论，这是一种将自我评估分解为多个组成部分的人类心理学框架，我们提出了一种关于模型自我评估的多维视角。我们引导出六个基于评估的自我评估维度，以及信心，并评估它们在预测12个LLMs和38个跨越八个领域的任务中的模型失败的效用。我们发现，与信心相比，能力相关的评估维度，特别是努力和能力，在大多数情况下始终匹配或超越信心。此外，努力还产生了更少的过于乐观的估计，并在不同模型规模中保持稳定。相对而言，情感维度提供的预测信号则边际性较小。此外，最具信息量的维度与任务特征系统性变化：对于推理密集型任务，努力是最具预测性的，而在检索导向任务中，能力和信心则占主导地位。总体而言，我们的研究结果表明，结构化的多维自我评估是一种有前景的方法，可以提高语言模型在各种现实世界环境中的可靠性和安全性。

View on arXiv Download PDF AI Translation

cs.CL / 80 / 2605.07811

A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches for Sentiment Classification on IMDb Movie Reviews

经典机器学习与深度学习方法在IMDb电影评论情感分类中的比较分析

Safitri, Erma Daniar, Ichisasmita, Lia Hana, Agustin, Citra, Muthoharoh, Luluk, Satria, Ardika, Manullang, Martin Clinton Tosima

Abstract

This paper presents a comparative study of classical machine learning and deep learning methods for sentiment classification on the IMDb movie reviews dataset. The machine learning pipeline uses TF-IDF features and PyCaret AutoML to evaluate Logistic Regression, Na\"ive Bayes, and Support Vector Machine, while the deep learning pipeline implements BiLSTM and BiLSTM with an attention mechanism. Experimental results show that classical machine learning, especially SVM, achieves the best performance with an accuracy of 0.8530, outperforming the deep learning models in this study. The BiLSTM with Attention model improves over the standard BiLSTM and reaches an accuracy of 0.706, indicating better contextual modeling. The paper concludes that although deep learning can capture sequential dependencies, classical machine learning remains a strong baseline when combined with effective feature engineering such as TF-IDF, particularly under limited data and computational resources.

Chinese Translation

本文对经典机器学习和深度学习方法在IMDb电影评论数据集上的情感分类进行了比较研究。机器学习流程使用TF-IDF特征和PyCaret AutoML来评估逻辑回归、朴素贝叶斯和支持向量机，而深度学习流程则实现了双向长短期记忆网络（BiLSTM）及带有注意力机制的双向长短期记忆网络（BiLSTM with Attention）。实验结果表明，经典机器学习，特别是支持向量机（SVM），在本研究中以0.8530的准确率取得了最佳表现，超越了深度学习模型。带有注意力机制的BiLSTM模型在标准BiLSTM的基础上有所改进，达到了0.706的准确率，表明其在上下文建模方面表现更佳。本文总结认为，尽管深度学习能够捕捉序列依赖关系，但在结合有效特征工程（如TF-IDF）时，经典机器学习仍然是一个强有力的基线，尤其是在数据和计算资源有限的情况下。

View on arXiv Download PDF AI Translation

cs.CL / 81 / 2605.07823

SCENE: Recognizing Social Norms and Sanctioning in Group Chats

SCENE：识别群聊中的社会规范与制裁

Jacniacki, Mateusz, Bilski, Maksymilian

Abstract

Online group chats are social spaces with implicit behavior patterns that, when broken, are often met with social sanctioning from the group. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce SCENE, a social-interaction benchmark focused on implicit norms and social sanctioning in multi-party chat. SCENE generates plausible non-roleplay scenarios with scripted personas that follow a hidden norm, create opportunities for the subject agent to violate it, and sanction breaches when they occur. We further propose behavioral evaluation metrics for two functional adaptation abilities: responsiveness to negative sanctioning, and adapting norm from peers behavior. We evaluate six frontier and open-weight models on SCENE. Our results show that Claude Opus 4.7 and Gemini 3.1 Pro adapt to implicit norms significantly more than the evaluated open-weight models. SCENE contributes one benchmark in the direction of recent calls for dynamic, interactional evaluation of LLM social capabilities.

Chinese Translation

在线群聊是具有隐含行为模式的社交空间，当这些模式被打破时，往往会遭到群体的社会制裁。基于大型语言模型（LLM）的代理识别和适应这些规范的能力和意愿仍然大多未被探索。我们介绍了SCENE，这是一个专注于多方聊天中隐含规范和社会制裁的社交互动基准。SCENE生成合理的非角色扮演场景，包含遵循隐含规范的脚本化角色，创造机会让主体代理违反这些规范，并在违规发生时进行制裁。我们进一步提出了两种功能适应能力的行为评估指标：对负面制裁的响应能力，以及从同伴行为中适应规范的能力。我们在SCENE上评估了六个前沿和开放权重模型。我们的结果表明，Claude Opus 4.7和Gemini 3.1 Pro在适应隐含规范方面显著优于被评估的开放权重模型。SCENE为近期对大型语言模型社交能力的动态互动评估的呼吁提供了一个基准。

View on arXiv Download PDF AI Translation

cs.CL / 82 / 2605.07847

Measuring and Mitigating the Distributional Gap Between Real and Simulated User Behaviors

测量和缓解真实用户行为与模拟用户行为之间的分布差距

Mehri, Shuhaib, Laban, Philippe, Shashidhar, Sumuk, Abdulhai, Marwa, Levine, Sergey, Galley, Michel, Hakkani-Tür, Dilek

Abstract

As user simulators are increasingly used for interactive training and evaluation of AI assistants, it is essential that they represent the diverse behaviors of real users. While existing works train user simulators to generate human-like responses, whether they capture the broad and heterogeneous distribution of real user behaviors remains an open question. In this work, we introduce a method to measure the distributional gap between real and simulated user behaviors, validated through a human study and ablations. Given a dataset of real and simulated conversations, our method extracts representations of user behavior from each conversation, quantizes them into discrete distributions via clustering, then computes divergence metrics. We provide the first systematic evaluation of 24 LLM-based user simulators on coding and writing tasks, and reveal a large distributional gap from real users that varies across model families, scales, and behavioral facets. Pairwise comparisons show that most simulators behave similarly, while a few stand apart. Combining behaviorally complementary simulators brings the resulting distribution closer to real users compared to either simulator on its own. Finally, a TF-IDF analysis of the clusters surfaces interpretable patterns of behaviors that simulators capture, miss, and hallucinate.

Chinese Translation

随着用户模拟器在人工智能助手的互动训练和评估中越来越多地被使用，确保它们能够代表真实用户的多样化行为变得至关重要。虽然现有研究训练用户模拟器以生成类人响应，但它们是否捕捉到了真实用户行为的广泛和异质分布仍然是一个未解的问题。在本研究中，我们提出了一种测量真实用户行为与模拟用户行为之间分布差距的方法，并通过人类研究和消融实验进行了验证。给定一组真实和模拟对话的数据集，我们的方法从每个对话中提取用户行为的表示，通过聚类将其量化为离散分布，然后计算发散度指标。我们首次对24个基于大型语言模型（LLM）的用户模拟器在编码和写作任务上的表现进行了系统评估，揭示了与真实用户之间存在较大的分布差距，并且这种差距在不同模型系列、规模和行为方面有所不同。成对比较显示，大多数模拟器的行为相似，而少数则表现出显著差异。结合行为互补的模拟器使得最终的分布更接近真实用户，相较于单独使用任一模拟器效果更佳。最后，对聚类的TF-IDF分析揭示了模拟器所捕捉、遗漏和虚构的行为模式。

View on arXiv Download PDF AI Translation

cs.CL / 83 / 2605.07850

MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning

MatryoshkaLoRA：学习准确的层次低秩表示以进行大语言模型微调

Modoranu, Ionut-Vlad, Safaryan, Mher, Alistarh, Dan

Abstract

With the rise in scale for deep learning models to billions of parameters, the computational cost of fine-tuning remains a significant barrier to deployment. While Low-Rank Adaptation (LoRA) has become the standard for parameter-efficient fine-tuning, the need to set a predefined, static rank $r$ requires exhaustive grid searches to balance efficiency and performance. Existing rank-adaptive solutions such as DyLoRA mitigate this by sampling ranks during the training from a predefined distribution. However, they often yield sub-optimal results at higher ranks due to lack of consistent gradient signals across the full hierarchy of ranks, thus making these methods data-inefficient. In this paper, we propose MatryoshkaLoRA, a general, Matryoshka-inspired training framework for LoRA that learns accurate hierarchical low-rank representations by inserting a fixed, carefully crafted diagonal matrix $P$ between the existing LoRA adapters to scale their sub-ranks accordingly. By introducing this simple modification, our general framework recovers LoRA and DyLoRA only by changing $P$ and ensures all sub-ranks embed the available gradient information efficiently. Our MatryoshkaLoRA supports dynamic rank selection with minimal degradation in accuracy. We further propose Area Under the Rank Accuracy Curve (AURAC), a metric that consistently evaluates the performance of hierarchical low-rank adapters. Our results demonstrate that MatryoshkaLoRA learns more accurate hierarchical low-rank representations than prior rank-adaptive approaches and achieves superior accuracy-performance trade-offs across ranks on the evaluated datasets. Our code is available at https://github.com/IST-DASLab/MatryoshkaLoRA.

Chinese Translation

随着深度学习模型规模的增加至数十亿参数，微调的计算成本仍然是部署的一个重要障碍。虽然低秩适应（Low-Rank Adaptation, LoRA）已成为参数高效微调的标准，但需要设定一个预定义的静态秩 $r$，这要求进行全面的网格搜索以平衡效率和性能。现有的秩自适应解决方案，如 DyLoRA，通过从预定义分布中在训练过程中采样秩来缓解这一问题。然而，由于缺乏跨全秩层次的一致梯度信号，它们在较高秩时往往会产生次优结果，从而使这些方法在数据利用上效率低下。在本文中，我们提出了 MatryoshkaLoRA，这是一种通用的、受 Matryoshka 启发的 LoRA 训练框架，通过在现有 LoRA 适配器之间插入一个固定的、精心设计的对角矩阵 $P$ 来学习准确的层次低秩表示，从而相应地扩展它们的子秩。通过引入这一简单的修改，我们的通用框架仅通过更改 $P$ 恢复 LoRA 和 DyLoRA，并确保所有子秩有效地嵌入可用的梯度信息。我们的 MatryoshkaLoRA 支持动态秩选择，且准确度下降最小。我们进一步提出了秩准确度曲线下的面积（Area Under the Rank Accuracy Curve, AURAC），这一指标能够持续评估层次低秩适配器的性能。我们的结果表明，MatryoshkaLoRA 学习到的层次低秩表示比之前的秩自适应方法更为准确，并在评估数据集上实现了更优的准确度与性能平衡。我们的代码可在 https://github.com/IST-DASLab/MatryoshkaLoRA 获取。

View on arXiv Download PDF AI Translation

cs.CL / 84 / 2605.07883

Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement

超越“我无法满足该请求”：通过标签增强缓解大型语言模型中的僵化拒绝

Zhang, Ying, Qiao, Congyu, Geng, Xin, Xu, Ning

Abstract

Large Language Models (LLMs) rely on safety alignment to obey safe requests while refusing harmful ones. However, traditional refusal mechanisms often lead to "rigid rejection," where a general template (e.g., "I cannot fulfill this request") indiscriminately triggers refusals and severely undermines the naturalness of interactions between humans and LLMs. To address this issue, LANCE is proposed in this paper to ensure safe yet flexible and natural responses via label enhancement. Specifically, LANCE employs variational inference to perform label enhancement, predicting a continuous distribution across multiple rejection categories. These fine-grained rejection distributions provide multi-way textual gradients for a refinement model to neutralize the hazardous elements in the prompt, so that the LLMs could generate safe responses that avoid rigid rejections while preserving the naturalness of interactions. Experiments demonstrate that LANCE significantly alleviates the rigid rejection problem while maintaining high security standards, significantly outperforming existing baseline models in terms of helpfulness and naturalness of responses.

Chinese Translation

大型语言模型（LLMs）依赖安全对齐来遵循安全请求，同时拒绝有害请求。然而，传统的拒绝机制往往导致“僵化拒绝”，即一个通用模板（例如，“我无法满足该请求”）无差别地触发拒绝，严重削弱了人类与LLMs之间互动的自然性。为了解决这一问题，本文提出了LANCE，通过标签增强确保安全、灵活且自然的响应。具体而言，LANCE采用变分推理进行标签增强，预测多个拒绝类别的连续分布。这些细粒度的拒绝分布为一个精细化模型提供了多向文本梯度，以中和提示中的有害元素，从而使LLMs能够生成安全的响应，避免僵化拒绝，同时保持互动的自然性。实验表明，LANCE显著缓解了僵化拒绝问题，同时保持高安全标准，在响应的有用性和自然性方面显著优于现有基准模型。

View on arXiv Download PDF AI Translation

cs.CL / 85 / 2605.07905

CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

CoCoReviewBench：一个面向完整性和正确性的人工智能评审基准

Deng, Hexuan, Ke, Xiaopeng, Li, Yichen, Hu, Ruina, Huang, Dehao, Wong, Derek F., Wang, Yue, Liu, Xuebo, Zhang, Min

Abstract

Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer--author--meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine-grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more effective reviewers, motivating further directions for improving AI reviewers. Benchmarks and models are available at https://github.com/hexuandeng/CoCoReviewBench.

Chinese Translation

尽管人工智能评审系统迅速发展，但评估这些系统仍然具有挑战性：评估指标更倾向于与人类评审的重叠而非正确性。然而，由于人类评审通常仅涵盖一部分显著问题，并且有时包含错误，因此作为金标准的可靠性不足。为了解决这个问题，我们构建了特定类别的基准子集，并在相应的人类评审缺失时跳过评估，以增强完整性。同时，我们利用评审者-作者-元评审讨论作为专家注释，并相应过滤不可靠的评审，以增强正确性。最后，我们推出了CoCoReviewBench，整理了来自ICLR和NeurIPS的3900篇论文，以实现对人工智能评审者的可靠和细致的评估。分析表明，人工智能评审者在正确性方面仍然有限，并且容易产生幻觉，强调推理模型作为更有效的评审者，激励进一步改进人工智能评审者的方向。基准和模型可在https://github.com/hexuandeng/CoCoReviewBench获取。

View on arXiv Download PDF AI Translation

cs.CL / 86 / 2605.07925

How Value Induction Reshapes LLM Behaviour

价值引导如何重塑大型语言模型的行为

Arora, Arnav, Schluter, Natalie, Metcalf, Katherine, ter Hoeve, Maartje

Abstract

Conversational Large Language Models are post-trained on language that expresses specific behavioural traits, such as curiosity, open-mindedness, and empathy, and values, such as helpfulness, harmlessness, and honesty. This is done to increase utility, ensure safety, and improve the experience of the people interacting with the model. However, values are complex and inter-related -- inducing one could modify behaviour on another. Further, inducing certain values can make models more addictive or sycophantic through language used in the generations, with a potential detrimental effect on the user. We investigate these and other unintended effects of value induction into models. We fine-tune models using curated value subsets of existing preference datasets, measuring the impact of value induction on expression of other values, model safety, anthropomorphic language, and various QA benchmarks. We find that (i) inducing values leads to expression of other related, and sometimes contrastive values, (ii) inducing positive values increases safety, and (iii) all values increase anthropomorphic language use, making models more validating and sycophantic.

Chinese Translation

对话型大型语言模型在后期训练中使用表达特定行为特征（如好奇心、开放性和同理心）和价值观（如乐于助人、无害和诚实）的语言。这一过程旨在提高效用、确保安全，并改善与模型互动的用户体验。然而，价值观是复杂且相互关联的——引导一种价值可能会影响另一种行为。此外，引导某些价值观可能会通过生成的语言使模型变得更加上瘾或谄媚，从而对用户产生潜在的不利影响。我们研究了这些及其他价值引导对模型的意外影响。我们使用现有偏好数据集的精心策划的价值子集对模型进行微调，测量价值引导对其他价值表达、模型安全性、人性化语言以及各种问答基准的影响。我们发现：（i）引导价值会导致其他相关且有时对立的价值的表达；（ii）引导积极价值会增加安全性；（iii）所有价值的引导都会增加人性化语言的使用，使模型更加验证和谄媚。

View on arXiv Download PDF AI Translation

cs.CL / 87 / 2605.07933

How to Train Your Latent Diffusion Language Model Jointly With the Latent Space

如何与潜在空间共同训练潜在扩散语言模型

Meshchaninov, Viacheslav, Shabalin, Alexander, Chimbulatov, Egor, Gushchin, Nikita, Koziev, Ilya, Korotin, Alexander, Vetrov, Dmitry

Abstract

Latent diffusion models offer an attractive alternative to discrete diffusion for non-autoregressive text generation by operating on continuous text representations and denoising entire sequences in parallel. The major challenge in latent diffusion modeling is constructing a suitable latent space. In this work, we present the Latent Diffusion Language Model (LDLM), in which the latent encoder, diffusion model, and decoder are trained jointly. LDLM builds its latent space by reshaping the representations of a pre-trained language model with a trainable encoder, yielding latents that are easy to both denoise and decode into tokens. We show that naive joint training produces a low-quality diffusion model, and propose a simple training recipe consisting of an MSE decoder loss, diffusion-to-encoder warmup, adaptive timestep sampling, and decoder-input noise. Ablations show that each component substantially impacts generation performance. On OpenWebText and LM1B, LDLM achieves better generation performance than existing discrete and continuous diffusion language models while being $2{\text -}13\times$ faster, indicating that jointly learning the latent space is a key step toward making latent diffusion competitive for text generation.

Chinese Translation

潜在扩散模型为非自回归文本生成提供了一种有吸引力的替代方案，通过对连续文本表示进行操作并并行去噪整个序列。潜在扩散建模的主要挑战在于构建合适的潜在空间。在本研究中，我们提出了潜在扩散语言模型（Latent Diffusion Language Model, LDLM），其中潜在编码器、扩散模型和解码器是共同训练的。LDLM通过使用可训练的编码器重塑预训练语言模型的表示来构建其潜在空间，从而生成易于去噪和解码为标记的潜在表示。我们展示了简单的联合训练会产生低质量的扩散模型，并提出了一种简单的训练方案，包括均方误差（MSE）解码器损失、扩散到编码器的预热、自适应时间步采样和解码器输入噪声。消融实验表明，每个组件对生成性能都有显著影响。在OpenWebText和LM1B数据集上，LDLM的生成性能优于现有的离散和连续扩散语言模型，同时速度提高了$2{ ext -}13 imes$，这表明共同学习潜在空间是使潜在扩散在文本生成中具有竞争力的关键步骤。

View on arXiv Download PDF AI Translation

cs.CL / 88 / 2605.07937

Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?

早问、晚问、问对：澄清时机对长期代理的重要性何在？

Gulati, Anmol, Gupta, Hariom, Lumer, Elias, Sen, Sahil, Subbiah, Vamse Kumar

Abstract

Long-horizon AI agents execute complex workflows spanning hundreds of sequential actions, yet a single wrong assumption early on can cascade into irreversible errors. When instructions are incomplete, the agent must decide not only whether to ask for clarification but when, and no prior work measures how clarification value changes over the course of execution. We introduce a forced-injection framework that provides ground-truth clarifications at controlled points in the agent's trajectory across four information dimensions (goal, input, constraint, context), three agent benchmarks, and four frontier models (three per benchmark; one on a single benchmark only; 84 task variants; 6,000+ runs). Counter to the common intuition that "earlier is always better," we find that the value of clarification depends sharply on what information is missing: goal clarification loses nearly all value after 10% of execution (pass@3 drops from 0.78 to baseline), while input clarification retains value through roughly 50%. Deferring any clarification type past mid-trajectory degrades performance below never asking at all. Cross-model Kendall tau correlations (0.78-0.87 among models sharing identical task coverage; 0.34-0.67 across the full 4-model panel) confirm these timing profiles are substantially task-intrinsic. A complementary study of 300 unscripted sessions reveals that no current frontier model asks within the empirically optimal window, with strategies ranging from over-asking (52% of sessions) to never asking at all. These empirical demand curves provide the quantitative foundation that existing theoretical frameworks require but have lacked, and establish concrete design targets for timing-aware clarification policies. Code and data will be publicly released.

Chinese Translation

长期AI代理执行复杂的工作流程，涵盖数百个顺序动作，但早期的一个错误假设可能会导致不可逆的错误。当指令不完整时，代理不仅必须决定是否请求澄清，还要决定何时请求，而之前的研究没有测量澄清的价值在执行过程中如何变化。我们引入了一种强制注入框架，在代理轨迹的受控点提供真实的澄清信息，涵盖四个信息维度（目标、输入、约束、上下文）、三个代理基准和四个前沿模型（每个基准三个；仅在一个基准上一个；84个任务变体；超过6000次运行）。与“越早越好”的常见直觉相反，我们发现澄清的价值 sharply 依赖于缺失的信息：目标澄清在执行10%后几乎失去所有价值（pass@3从0.78降至基线），而输入澄清的价值大约保持到50%。将任何澄清类型推迟到轨迹中段后，性能下降到低于从不请求的水平。跨模型的Kendall tau相关性（在共享相同任务覆盖的模型之间为0.78-0.87；在整个四模型面板中为0.34-0.67）确认这些时机特征在很大程度上是任务内在的。一项对300个非脚本化会话的补充研究显示，目前没有任何前沿模型在经验上最佳的时间窗口内提出请求，策略从过度请求（52%的会话）到完全不请求不等。这些经验需求曲线为现有理论框架所需但缺乏的定量基础提供了支持，并为时机感知的澄清策略建立了具体的设计目标。代码和数据将公开发布。

View on arXiv Download PDF AI Translation

cs.CL / 89 / 2605.07982

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

GLiGuard：用于大型语言模型安全保障的模式条件分类

Zaratiana, Urchade, Newhauser, Mary, Hurn-Maloney, George, Lewis, Ash

Abstract

Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions. However, state-of-the-art guardrail models rely on autoregressive decoders with 7B--27B parameters, reformulating what is fundamentally a classification problem as sequential text generation, a design choice that incurs high latency and scales poorly to multi-aspect evaluation. In this work, we introduce \textbf{GLiGuard}, a 0.3B-parameter schema-conditioned bidirectional encoder adapted from GLiNER2 for LLM content moderation. The key idea is to encode task definitions and label semantics directly into the input sequence as structured token schemas, enabling simultaneous evaluation of prompt safety, response safety, refusal detection, 14 fine-grained harm categories, and 11 jailbreak strategies in a single non-autoregressive forward pass. This schema-conditioned design lets supported task and label blocks be composed directly in the input schema at inference time. Across nine established safety benchmarks, GLiGuard achieves F1 scores competitive with 7B--27B decoder-based guards despite being 23--90$\times$ smaller, while delivering up to 16$\times$ higher throughput and 17$\times$ lower latency. These results suggest that compact bidirectional encoders can approach the accuracy of much larger guard models while drastically reducing inference cost. Code and models are available at https://github.com/fastino-ai/GLiGuard.

Chinese Translation

确保大型语言模型输出的安全性和政策合规性需要实时内容审核，能够在多个安全维度上进行扩展。然而，最先进的防护模型依赖于具有70亿至270亿参数的自回归解码器，将本质上是分类的问题重新表述为顺序文本生成，这一设计选择导致了高延迟，并且在多方面评估中扩展性较差。在本研究中，我们引入了 extbf{GLiGuard}，这是一个基于0.3亿参数的模式条件双向编码器，改编自GLiNER2，用于大型语言模型的内容审核。其关键思想是将任务定义和标签语义直接编码到输入序列中作为结构化的令牌模式，从而在单次非自回归前向传递中同时评估提示安全性、响应安全性、拒绝检测、14个细粒度危害类别和11种越狱策略。这种模式条件设计使得支持的任务和标签块可以在推理时直接组合到输入模式中。在九个已建立的安全基准测试中，GLiGuard的F1分数与70亿至270亿解码器基础的防护模型相当，尽管其规模小23至90倍，同时提供高达16倍的吞吐量和17倍的更低延迟。这些结果表明，紧凑的双向编码器可以接近更大防护模型的准确性，同时显著降低推理成本。代码和模型可在https://github.com/fastino-ai/GLiGuard获取。

View on arXiv Download PDF AI Translation

cs.CL / 90 / 2605.07990

Tool Calling is Linearly Readable and Steerable in Language Models

工具调用在语言模型中是线性可读和可引导的

Wu, Zekun, Wang, Ze, Cho, Seonglae, Yang, Yufei, Koshiyama, Adriano, Bulathwela, Sahan, Perez-Ortiz, Maria

Abstract

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools' average internal activations switches which tool the model selects at 77-100% accuracy on name-only single-turn prompts (93-100% at 4B+), and the JSON arguments that follow autoregressively match the new tool's schema, so flipping the name is enough. The same per-tool means also flag likely errors before they happen: on Gemma 3 12B and 27B, queries where the gap between the top-1 and top-2 tool is smallest produce 14-21x more wrong calls than queries with the largest gap. The causal effect concentrates along one direction, the row of the output layer that produces the target tool's first token: a unit vector along it at matched magnitude already reaches 93-100%, while what is left over leaves the choice almost untouched. Activation patching localises this to a small set of mid- and late-layer attention heads, and a within-topic probe across 14 same-domain $\tau$-bench airline tools reaches top-1 61-89% across five 4B-14B models, ruling out the reading that we are just moving the model along a topic axis. Even base models encode the right tool before they can emit it: cosine readout from the internal state recovers 69-82% on BFCL while base generation reaches only 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. We measure tool identity selection and JSON schema correctness in single-turn fixed-menu settings; multi-turn agentic transfer is more fragile and is discussed in Limitations.

Chinese Translation

当一个工具调用代理选择错误的工具时，失败在执行之前是不可见的：邮件被发送，会议被错过。我们对12个经过指令调优的模型进行了探测，这些模型包括Gemma 3、Qwen 3、Qwen 2.5和Llama 3.1（270M到27B），发现所选工具的身份在模型内部是线性可读和可引导的。通过添加两个工具的平均内部激活之间的均值差异，可以以77-100%的准确率在仅包含名称的单轮提示中切换模型选择的工具（在4B以上时为93-100%），随后自回归生成的JSON参数与新工具的模式相匹配，因此仅需更改名称即可。相同工具的均值也能在错误发生之前标记出可能的错误：在Gemma 3的12B和27B模型中，排名第一和排名第二工具之间差距最小的查询产生的错误调用比差距最大的查询多出14-21倍。因果效应集中在一个方向上，即输出层生成目标工具第一个标记的行：在匹配幅度下沿该方向的单位向量已经达到93-100%的准确率，而剩余部分几乎不影响选择。激活修补将这一现象局限于一小组中层和后层的注意力头，并且在14个同领域的$ au$-bench航空工具中进行的主题内探测在五个4B-14B模型中达到了61-89%的排名第一，排除了我们仅在主题轴上移动模型的解读。即使是基础模型在能够发出工具之前也编码了正确的工具：从内部状态的余弦读取在BFCL上恢复了69-82%的准确率，而基础生成仅达到2-10%，这表明预训练形成了表示，而指令调优则将其连接到输出。我们在单轮固定菜单设置中测量工具身份选择和JSON模式的正确性；多轮代理转移则更为脆弱，并在限制部分进行了讨论。

View on arXiv Download PDF AI Translation

cs.CL / 91 / 2605.08044

Fast Byte Latent Transformer

快速字节潜在变换器

Kallini, Julie, Pagnoni, Artidoro, Limisiewicz, Tomasz, Ghosh, Gargi, Zettlemoyer, Luke, Potts, Christopher, Han, Xiaochuang, Iyer, Srinivasan

Abstract

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.

Chinese Translation

近期的字节级语言模型（LMs）在性能上与基于标记的模型相匹配，而无需依赖子词词汇，然而它们的实用性受到逐字自回归生成的速度限制。我们通过新的训练和生成技术在字节潜在变换器（Byte Latent Transformer, BLT）中解决了这一瓶颈。首先，我们引入了BLT扩散（BLT Diffusion, BLT-D），这是一个新的模型，也是我们最快的BLT变体，采用辅助块级扩散目标进行训练，同时结合标准的下一个字节预测损失。这使得推理过程能够在每个解码步骤中并行生成多个字节，从而显著减少生成序列所需的前向传递次数。其次，我们提出了两个受投机解码启发的扩展，牺牲部分速度以提高生成质量：BLT自我投机（BLT Self-speculation, BLT-S），在该方法中，BLT的局部解码器继续生成超出其正常补丁边界的字节草稿，然后通过一次完整模型的前向传递进行验证；以及BLT扩散+验证（BLT Diffusion+Verification, BLT-DV），该方法在基于扩散的生成后增加了一个自回归验证步骤。所有方法在生成任务中都能实现超过50%的估计内存带宽成本低于BLT。每种方法都有其独特的优势，共同消除了字节级语言模型实际应用中的关键障碍。

View on arXiv Download PDF AI Translation

cs.CL / 92 / 2605.08045

Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

基于蒸馏大语言模型的不确定性感知全心脏磁共振报告结构化数据提取

Yu, Yi, Martin, Parker, Bu, Zhenyu, Liu, Yixuan, Zheng, Yi-Yu, Simonetti, Orlando, Han, Yuchi, Xue, Yuan

Abstract

Converting free-text cardiac magnetic resonance (CMR) reports into auditable structured data remains a bottleneck for cohort assembly, longitudinal curation, and clinical decision support. We present CMR-EXTR, a lightweight framework that converts free-text CMR reports into structured data and assigns per-field confidence for quality control. A teacher-student distillation pipeline enables fully offline inference while limiting manual annotation. Uncertainty integrates three complementary principles -- distribution plausibility, sampling stability, and cross-field consistency -- to triage human review. Experiments show that CMR-EXTR achieves 99.65% variable-level accuracy, demonstrating both reliable extraction and informative confidence scores. To our knowledge, this is the first CMR-specific extraction system with integrated confidence estimation. The code is available at https://github.com/yuyi1005/CMR-EXTR.

Chinese Translation

将自由文本心脏磁共振（CMR）报告转换为可审计的结构化数据仍然是队列组建、纵向管理和临床决策支持的瓶颈。我们提出了CMR-EXTR，一个轻量级框架，能够将自由文本CMR报告转换为结构化数据，并为每个字段分配置信度以进行质量控制。教师-学生蒸馏管道实现了完全离线推理，同时限制了人工标注。不确定性整合了三种互补原则——分布合理性、采样稳定性和跨字段一致性——以优先处理人工审核。实验表明，CMR-EXTR达到了99.65%的变量级准确率，展示了可靠的提取能力和有信息量的置信度评分。据我们所知，这是第一个具有集成置信度估计的CMR特定提取系统。代码可在https://github.com/yuyi1005/CMR-EXTR获取。

View on arXiv Download PDF AI Translation

cs.CL / 93 / 2605.08048

Accurate and Efficient Statistical Testing for Word Semantic Breadth

准确且高效的词语语义广度统计检验

Ehara, Yo

Abstract

Measuring the breadth of a word's meaning, or its spread across contexts, has become feasible with contextualized token embeddings. A word type can be represented as a cloud of token vectors, with dispersion-based statistics serving as proxies for contextual diversity (Nagata and Tanaka-Ishii, ACL2025). These measurements are useful for deciding appropriate sense distinctions when constructing thesauri and domain-specific dictionaries. However, when comparing the breadth of two word types, naive hypothesis testing on dispersion can be misleading: differences in semantic direction can masquerade as dispersion differences, inflating Type-I error and yielding "statistically significant" outcomes even when there is no true breadth difference. This is problematic because significance testing should distinguish genuine effects from incidental fluctuations in small-difference regimes. We propose a Householder-aligned permutation test to isolate dispersion differences from directional differences. Our method applies a single Householder reflection to align the mean directions of the two word types and then performs a permutation test on the aligned token clouds, yielding calibrated, non-parametric p-values. For practicality, we introduce a GPU-oriented implementation that batches permutations and linear algebra operations. Empirically, our alignment reduced Type-I error by 32.5% while preserving sensitivity to genuine breadth differences, and achieved a 23x speedup over the CPU baseline.

Chinese Translation

测量词语意义的广度，或其在不同语境中的传播，随着上下文化的词嵌入变得可行。一个词类型可以被表示为一组词向量的云，基于离散度的统计量作为语境多样性的代理（Nagata and Tanaka-Ishii, ACL2025）。这些测量在构建同义词典和特定领域词典时，对于决定适当的意义区分非常有用。然而，在比较两个词类型的广度时，简单的离散度假设检验可能会产生误导：语义方向的差异可能伪装成离散度的差异，从而膨胀I型错误，并在没有真实广度差异的情况下产生“统计显著”的结果。这是一个问题，因为显著性检验应该区分真实效应和小差异范围内的偶然波动。我们提出了一种与Householder对齐的置换检验，以将离散度差异与方向差异隔离开来。我们的方法应用单一的Householder反射来对齐两个词类型的平均方向，然后对对齐后的词云进行置换检验，从而产生经过校准的非参数p值。为了实用性，我们引入了一种面向GPU的实现，批量处理置换和线性代数运算。实证结果表明，我们的对齐方法将I型错误降低了32.5%，同时保持了对真实广度差异的敏感性，并且在速度上比CPU基线提高了23倍。

View on arXiv Download PDF AI Translation

cs.CL / 94 / 2605.08057

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

CA-SQL：基于复杂度感知的文本到SQL推理时间推理，通过探索和计算预算分配

Petullo, James, Xue, Nianwen

Abstract

While recent advancements in inference-time learning have improved LLM reasoning on Text-to-SQL tasks, current solutions still struggle to perform well on the most challenging tasks in the Bird-Bench (BIRD) benchmark. This is due to inadequate solution space exploration, which is necessary to uncover promising candidate queries that can be further refined to produce the correct output. To address this challenge, we introduce CA-SQL, a novel Text-to-SQL pipeline that utilizes the estimated difficulty of a task to dynamically scale the breadth of the exploration for generating solution candidates. In addition, we use a custom prompt seeding method, based on principles of evolutionary search, to further elicit exploratory behavior from the base LLM and a novel voting method to select the best candidate solution at the end of the search. Experiments demonstrate that our solution achieves a state-of-the-art score of 51.72% on the "challenging" tier of BIRD development set problems, using only GPT-4o-mini, out-performing other in-context learning approaches, even those that leverage larger models. Overall, our method attains a competitive 61.06% execution accuracy and 68.77% Soft F1 score on the BIRD development dataset.

Chinese Translation

尽管近期推理时间学习的进展提高了大型语言模型（LLM）在文本到SQL任务上的推理能力，但当前解决方案在Bird-Bench（BIRD）基准测试中最具挑战性的任务上仍然表现不佳。这是由于解决方案空间探索不足，而这种探索对于发现有前景的候选查询并进一步优化以产生正确输出是必要的。为了解决这一挑战，我们提出了CA-SQL，一种新颖的文本到SQL管道，利用任务的估计难度动态调整探索的广度，以生成解决方案候选。此外，我们采用了一种基于进化搜索原则的自定义提示种子方法，以进一步引导基础LLM的探索行为，并在搜索结束时采用一种新颖的投票方法选择最佳候选解决方案。实验表明，我们的解决方案在BIRD开发集问题的“挑战”层次上取得了51.72%的最新成绩，仅使用GPT-4o-mini，超越了其他上下文学习方法，即使是那些利用更大模型的方法。总体而言，我们的方法在BIRD开发数据集上达到了61.06%的执行准确率和68.77%的Soft F1分数。

View on arXiv Download PDF AI Translation

cs.CL / 95 / 2605.08060

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

记忆诅咒：扩展回忆如何侵蚀大型语言模型代理的合作意图

Liu, Jiayuan, Li, Tianqin, Du, Shiyi, Luo, Xin, Zeng, Haoxuan, Tewolde, Emanuel, Lee, Tai Sing, Wang, Tonghan, Kingsford, Carl, Conitzer, Vincent

Abstract

Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward-looking intent rather than rising paranoia. We validate this using targeted fine-tuning as a cognitive probe: a LoRA adapter trained exclusively on forward-looking traces mitigates the decay and transfers zero-shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain-of-Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi-agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.

Chinese Translation

上下文窗口的扩展通常被视为大型语言模型（LLMs）的简单能力升级，但我们发现它在多代理社会困境中系统性地失败。在对7个LLMs和4个游戏进行500轮的实验中，扩展可访问的历史在28种模型-游戏设置中有18种情况下降低了合作水平，我们将这种现象称为记忆诅咒。我们通过三项分析来隔离其潜在机制。首先，对378,000条推理轨迹的词汇分析将这种崩溃与前瞻性意图的减弱关联起来，而不是与偏执的增加相关。我们使用针对性的微调作为认知探针来验证这一点：一个仅在前瞻性轨迹上训练的LoRA适配器减缓了衰退，并在不同游戏中实现了零样本迁移。其次，记忆清理在固定提示长度的同时，用合成的合作记录替换可见历史，这显著恢复了合作，证明触发因素是记忆内容，而不仅仅是长度。最后，消除显式的思维链推理通常会减少崩溃，表明深思熟虑反而加剧了记忆诅咒。综合这些结果，我们重新定义了记忆作为多代理行为的一个主动决定因素：较长的回忆可以根据其引发的推理模式，既可能破坏合作，也可能支持合作。

View on arXiv Download PDF AI Translation

cs.CL / 96 / 2605.08077

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

一致路径推理：通过路径级校准实现可信的知识图谱问答

Lin, Shuhang, Zhou, Chuhao, Lin, Xiao, Dong, Zihan, Lu, Kuan, Peng, Zhencan, Yin, Jie, Metaxas, Dimitris N.

Abstract

Knowledge Graph Question Answering (KGQA) has shown promise for grounded and interpretable reasoning, yet existing approaches often fail to provide reliable coverage guarantees over retrieved answers. While Conformal Prediction (CP) offers a principled framework for producing prediction sets with statistical guarantees, prior methods suffer from critical limitations in both calibration validity and score discriminability, resulting in violated coverage guarantees and excessively large prediction sets. To address these pitfalls, we propose Conformal Path Reasoning (CPR), a trustworthy KGQA framework with two key innovations. First, we perform query-level conformal calibration over path-level scores, preserving the exchangeability while generating path prediction sets. Second, we introduce the Residual Conformal Value Network (RCVNet), a lightweight module trained via PUCT-guided exploration to learn discriminative path-level nonconformity scores. Experiments on benchmarks show that CPR significantly improves the Empirical Coverage Rate by 34% while reducing average prediction set size by 40% compared to conformal baselines. These results validate the efficacy of CPR in satisfying coverage guarantees with substantially more compact answer sets.

Chinese Translation

知识图谱问答（KGQA）在基于事实和可解释推理方面展现了良好的前景，但现有方法往往未能提供对检索答案的可靠覆盖保证。尽管一致性预测（Conformal Prediction, CP）提供了一个具有统计保证的预测集生成的原则框架，先前的方法在校准有效性和评分可区分性方面存在重大局限，导致覆盖保证被违反以及预测集过于庞大。为了解决这些问题，我们提出了一致路径推理（Conformal Path Reasoning, CPR），这是一个可信的KGQA框架，具有两个关键创新。首先，我们对路径级评分进行查询级一致性校准，在生成路径预测集的同时保持可交换性。其次，我们引入了残差一致性值网络（Residual Conformal Value Network, RCVNet），这是一个通过PUCT引导探索训练的轻量级模块，用于学习具有区分性的路径级非一致性评分。在基准测试中的实验表明，与一致性基线相比，CPR显著提高了经验覆盖率34%，同时将平均预测集大小减少了40%。这些结果验证了CPR在满足覆盖保证的同时，能够提供更为紧凑的答案集的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 97 / 2605.08083

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

大型语言模型提升大型语言模型：测试时扩展的自主发现

Zheng, Tong, Liu, Haolin, Huang, Chengsong, Bao, Huiwen, Zhang, Sheng, Liu, Rui, Dai, Runpeng, Chen, Ruibo, Liu, Chenxi, Xiong, Tianyi, Wu, Xidong, Zhang, Hongming, Huang, Heng

Abstract

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.

Chinese Translation

测试时扩展（TTS）已成为提高大型语言模型性能的有效方法，通过在推理过程中分配额外的计算资源。然而，现有的TTS策略大多是手工设计的：研究人员通过直觉手动设计推理模式并调整启发式方法，导致计算分配空间的许多部分未被探索。我们提出了一种环境驱动的框架AutoTTS，改变了研究人员的设计思路：从单个TTS启发式方法转变为可以自动发现TTS策略的环境。AutoTTS的关键在于环境构建：发现环境必须使控制空间可处理，并为TTS搜索提供廉价、频繁的反馈。作为具体实例，我们将宽度-深度TTS形式化为在预收集的推理轨迹和探测信号上的控制器合成，其中控制器决定何时分支、继续、探测、修剪或停止，并且可以在不重复调用大型语言模型的情况下进行廉价评估。我们进一步引入了beta参数化，以使搜索可处理，并提供细粒度的执行跟踪反馈，以提高发现效率，帮助代理诊断TTS程序失败的原因。在数学推理基准上的实验表明，发现的策略在整体准确性-成本权衡上优于强大的手工设计基线。发现的策略能够推广到未见过的基准和模型规模，而整个发现过程仅花费39.9美元和160分钟。我们的数据和代码将开源于https://github.com/zhengkid/AutoTTS。

View on arXiv Download PDF AI Translation

arXiv Papers

Modular Lie Algebraic PDE Control of Multibody Flexible Manipulators

An Aerial Manipulator for Perception-Driven Flower Targeting Toward Contactless Pollination in Vertical Farming

Bi3: A Biplatform, Bicultural, Biperson Dataset for Social Robot Navigation

Traffic Scenario Orchestration from Language via Constraint Satisfaction

AirBender: Adaptive Transportation of Bendable Objects Using Dual UAVs

Intention assimilation control for accurate tracking with variable impedance in teleoperation

Dr-BA: Separable Optimization for Direct Radar Bundle Adjustment & Localization

PISTO: Proximal Inference for Stochastic Trajectory Optimization

Palm-sized Omnidirectional Vision-Based UAV Exploration with Sparse Topological Map Guidance

Variable Aerodynamic Damping via Co-Contraction: A Dynamic Isomorphism with Variable Stiffness Actuators

BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation

AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations

Weather-Robust Scene Semantics with Vision-Aligned 4D Radar

MORPH-U: Multi-Objective Resilient Motion Planning for V2X-Enabled Autonomous Driving in High-Uncertainty Environments via Simulation

Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation

Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

Search-based Robustness Testing of Laptop Refurbishing Robotic Software

How to utilize failure demo data?: Effective data selection for imitation learning using distribution differences in attention mechanism

MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

BrickCraft: Visuomotor Skill Composition with Situated Manual Guidance for Long-Horizon Interlocking Brick Assembly

PhySPRING: Structure-Preserving Reduction of Physics-Informed Twins via GNN

Offline-Online Hierarchical 3D Global Relocalization With Synthetic LiDAR Sensing and Descriptor-Space Retrieval

CommandSwarm: Safety-Aware Natural Language-to-Behavior-Tree Generation for Robotic Swarms

Sensitivity-Based Robust NMPC for Close-Proximity Offshore Wind Turbine Inspection with a Tilted Multirotor

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

Many-to-Many Multi-Agent Pickup and Delivery

Melding LLM and temporal logic for reliable human-swarm collaboration in complex scenarios

AERO-VIS: Asynchronous Event-based Real-time Onboard Visual-Inertial SLAM

TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning

Evaluation of an Actuated Spine in Agile Quadruped Locomotion

Active Embodiment Identification with Reinforcement Learning for Legged Robots

123D: Unifying Multi-Modal Autonomous Driving Data at Scale

Visual Text Compression as Measure Transport

Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey

HumanNet: Scaling Human-centric Video Learning to One Million Hours

R$^3$L: Reasoning 3D Layouts from Relative Spatial Relations

LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

Knowledge Transfer Scaling Laws for 3D Medical Imaging

AdpSplit: Error-Driven Adaptive Splitting for Faster Geometry Discovery in 3D Gaussian Splatting

TriDE: Triangle-Consistent Translation Directions for Global Camera Pose Estimation

Towards Fairness under Label Bias in Image Segmentation: Impact, Measurement and Mitigation

Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation

Advancing Reliable Synthetic Video Detection: Insights from the SAFE Challenge

A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency

XiYOLO: Energy-Aware Object Detection via Iterative Architecture Search and Scaling

Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment

TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations

LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects

Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness

Learning to Track Instance from Single Nature Language Description

Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection

Learning Visual Feature-Based World Models via Residual Latent Action

ImplantMamba: Long-range Sequential Modeling Mamba For Dental Implant Position Prediction

Task Relevance Is Not Local Replaceability: A Two-Axis View of Channel Information

InfoGeo: Information-Theoretic Object-Centric Learning for Cross-View Generalizable UAV Geo-Localization

Neurosymbolic Framework for Concept-Driven Logical Reasoning in Skeleton-Based Human Action Recognition

Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

AGA3DNet: Anatomy-Guided Gaussian Priors with Multi-view xLSTM for 3D Brain MRI Subtype Classification

TriP: A Triangle Puzzle Approach to Robust Translation Averaging

UniV2D: Bridging Visual Restoration and Semantic Perception for Underwater Salient Object Detection

Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

Real-IAD MVN: A Multi-View Normal Vector Dataset and Benchmark for High-Fidelity Industrial Anomaly Detection

DPG-CD: Depth-Prior-Guided Cross-Modal Joint 2D-3D Change Detection

PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition

Hierarchical Perfusion Graphs for Tumor Heterogeneity Modeling in Glioma Molecular Subtyping

Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection

SatSurfGS: Generalizable 2D Gaussian Splatting for Sparse-View Satellite Surface Reconstruction

PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset

Attention Transfer Is Not Universally Effective for Vision Transformers

AsyncEvGS: Asynchronous Event-Assisted Gaussian Splatting for Handheld Motion-Blurred Scenes

Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision Models

See Tomorrow, Act Today: Foresight-Driven Autonomous Driving

From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting

LoHGNet: Infrared Small Target Detection through Lorentz Geometric Encoding with High-Order Relation Learning

DINO-MVR: Multi-View Readout of Frozen DINOv3 for Annotation-Efficient Medical Segmentation

CASCADE: Context-Aware Relaxation for Speculative Image Decoding