arXiv Daily Digest

299

Papers

MVAdapt: Zero-Shot Multi-Vehicle Adaptation for End-to-End Autonomous Driving

MVAdapt：面向端到端自动驾驶的零样本多车辆适应方法

Oh, Haesung, Park, Jaeheung

Abstract

End-to-End (E2E) autonomous driving models are usually trained and evaluated with a fixed ego-vehicle, even though their driving policy is implicitly tied to vehicle dynamics. When such a model is deployed on a vehicle with different size, mass, or drivetrain characteristics, its performance can degrade substantially; we refer to this problem as the vehicle-domain gap. To address it, we propose MVAdapt, a physics-conditioned adaptation framework for multi-vehicle E2E driving. MVAdapt combines a frozen TransFuser++ scene encoder with a lightweight physics encoder and a cross-attention module that conditions scene features on vehicle properties before waypoint decoding. In the CARLA Leaderboard 1.0 benchmark, MVAdapt improves over naive transfer and multi-embodiment adaptation baselines on both in-distribution and unseen vehicles. We further show two complementary behaviors: strong zero-shot transfer on many unseen vehicles, and data-efficient few-shot calibration for severe physical outliers. These results suggest that explicitly conditioning E2E driving policies on vehicle physics is an effective step toward more transferable autonomous driving models. All codes are available at https://github.com/hae-sung-oh/MVAdapt

Chinese Translation

端到端（E2E）自动驾驶模型通常在固定的自车（ego-vehicle）上进行训练和评估，尽管其驾驶策略隐式地依赖于车辆动力学。当此类模型部署于尺寸、质量或传动系统特性不同的车辆时，性能可能会显著下降；我们将此问题称为车辆域差距。为了解决该问题，我们提出了MVAdapt，一种基于物理条件的多车辆端到端驾驶适应框架。MVAdapt结合了冻结的TransFuser++场景编码器、轻量级物理编码器以及一个跨注意力模块，该模块在路径点解码前将场景特征与车辆属性进行条件融合。在CARLA Leaderboard 1.0基准测试中，MVAdapt在分布内和未见车辆上均优于简单迁移和多体化适应基线。我们进一步展示了两种互补行为：对多种未见车辆的强零样本迁移能力，以及对严重物理异常车辆的数据高效少样本校准能力。这些结果表明，显式地将端到端驾驶策略与车辆物理条件结合，是实现更具迁移性的自动驾驶模型的有效途径。所有代码均可在https://github.com/hae-sung-oh/MVAdapt获取。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2604.11861

BIND-USBL: Bounding IMU Navigation Drift using USBL in Heterogeneous ASV-AUV Teams

BIND-USBL：利用USBL约束异构ASV-AUV团队中IMU导航漂移

Kedia, Pranav, Makam, Rajini, Hamann, Heiko, Sundaram, Suresh

Abstract

Accurate and continuous localization of Autonomous Underwater Vehicles (AUVs) in GPS-denied environments is a persistent challenge in marine robotics. In the absence of external position fixes, AUVs rely on inertial dead-reckoning, which accumulates unbounded drift due to sensor bias and noise. This paper presents BIND-USBL, a cooperative localization framework in which a fleet of Autonomous Surface Vessels (ASVs) equipped with Ultra-Short Baseline (USBL) acoustic positioning systems provides intermittent fixes to bound AUV dead-reckoning error. The key insight is that long-duration navigation failure is driven not by the accuracy of individual USBL measurements, but by the temporal sparsity and geometric availability of those fixes. BIND-USBL combines a multi-ASV formation model linking survey scale and anchor placement to acoustic coverage, a conflict-graph-based TDMA uplink scheduler for shared-channel servicing, and delayed fusion of received USBL updates with drift-prone dead reckoning. The framework is evaluated in the HoloOcean simulator using heterogeneous ASV-AUV teams executing lawnmower coverage missions. The results show that localization performance is shaped by the interaction of survey scale, acoustic coverage, team composition, and ASV-formation geometry. Further, the spatial-reuse scheduler improves per-AUV fix delivery rate without violating the no-collision constraint, while maintaining low end-to-end fix latency.

Chinese Translation

在无GPS环境下，实现自主水下航行器（AUV）的准确且连续定位是海洋机器人领域的一个持续挑战。在缺乏外部定位修正的情况下，AUV依赖惯性推算导航，但由于传感器偏差和噪声，惯性导航误差会无界累积。本文提出了BIND-USBL，一种协同定位框架，其中配备超短基线（USBL）声学定位系统的自主水面船（ASV）舰队提供间歇性定位修正，以约束AUV惯性导航误差。关键观点在于，长时间导航失败的根本原因并非单个USBL测量的精度，而是这些定位修正的时间稀疏性和几何可用性。BIND-USBL结合了多ASV编队模型（将测量规模和锚点布置与声学覆盖关联）、基于冲突图的TDMA上行链路调度器（用于共享信道服务）以及延迟融合接收到的USBL更新与易漂移的惯性推算数据。该框架在HoloOcean仿真器中进行了评估，使用执行割草机式覆盖任务的异构ASV-AUV团队。结果表明，定位性能受测量规模、声学覆盖、团队组成及ASV编队几何形状相互作用的影响。此外，空间复用调度器在不违反无碰撞约束的前提下，提高了每个AUV的定位修正传递率，同时保持了较低的端到端定位延迟。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2604.11975

M2HRI: An LLM-Driven Multimodal Multi-Agent Framework for Personalized Human-Robot Interaction

M2HRI：一种基于大型语言模型的多模态多智能体框架，用于个性化人机交互

Hasan, Shaid, Lee, Breenice, Sarker, Sujan, Iqbal, Tariq

Abstract

Multi-robot systems hold significant promise for social environments such as homes and hospitals, yet existing multi-robot works treat robots as functionally identical, overlooking how robots individual identity shape user perception and how coordination shapes multi-robot behavior when such individuality is present. To address this, we introduce M2HRI, a multimodal multi-agent framework built on large language models that equips each robot with distinct personality and long-term memory, alongside a coordination mechanism conditioned on these differences. In a controlled user study (n = 105) in a multi-agent human-robot interaction (HRI) scenario, we find that LLM-driven personality traits are significantly distinguishable and enhance interaction quality, long-term memory improves personalization and preference awareness, and centralized coordination significantly reduces overlap while improving overall interaction quality. Together, these results demonstrate that both agent individuality and structured coordination are essential for coherent and socially appropriate multi-agent HRI. Project website and code are available at https://project-m2hri.github.io/.

Chinese Translation

多机器人系统在家庭和医院等社交环境中具有重要的应用前景，但现有的多机器人研究将机器人视为功能上相同，忽视了机器人的个体身份如何影响用户感知，以及在这种个体性存在时协调如何塑造多机器人行为。为了解决这个问题，我们提出了M2HRI，一个基于大型语言模型的多模态多智能体框架，为每个机器人赋予独特的个性和长期记忆，并基于这些差异建立协调机制。在一项控制用户研究（n = 105）中，我们在多智能体人机交互（HRI）场景下发现，基于大型语言模型驱动的个性特征具有显著的可区分性，并提升了交互质量，长期记忆改善了个性化和偏好意识，而集中协调显著减少了重叠，同时提高了整体交互质量。这些结果共同表明，代理的个体性和结构化协调对于一致且社会适宜的多智能体HRI至关重要。项目网站和代码可在 https://project-m2hri.github.io/ 获取。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2604.11981

Bipedal-Walking-Dynamics Model on Granular Terrains

颗粒介质上的双足行走动力学模型

Chen, Xunjie, Huang, Xinyan, Shan, Peter, Yi, Jingang, Liu, Tao

Abstract

Bipeds have demonstrated high agility and mobility in unstructured environments such as sand. The yielding of such granular media brings significant sinkage and slip of the bipedal feet, leading to uncertainty and instability of walking locomotion. We present a new dynamics-modeling approach to capture and predict bipedal-walking locomotion on granular media. A dynamic foot-terrain interaction model is integrated to compute the ground reaction force (GRF). The proposed granular dynamic model has three additional degree-of-freedom (DoF) to estimate foot sinkage and slip that are critical to capturing robot-walking kinematics and kinetics such as cost of transport (CoT). Using the new model, we analyze bipedal kinetics, CoT, and foot-terrain rolling and intrusion affects. Experiments are conducted using a biped robotic walker on sand to validate the proposed dynamic model with robot-gait profiles, media-intrusion prediction, and GRF estimations. This new dynamics model can further serve as an enabling tool for locomotion control and optimization of bipedal robots to efficiently walk on granular terrains.

Chinese Translation

双足机器人在沙地等非结构化环境中表现出高度的灵活性和机动性。此类颗粒介质的变形导致双足显著的下陷和滑移，从而引发行走运动的不确定性和不稳定性。本文提出了一种新的动力学建模方法，用以捕捉和预测双足机器人在颗粒介质上的行走运动。该方法集成了动态足-地面相互作用模型，以计算地面反作用力（Ground Reaction Force, GRF）。所提出的颗粒动力学模型增加了三个自由度（Degree-of-Freedom, DoF），用于估计足部的下陷和滑移，这对于准确捕捉机器人行走的运动学和动力学特性（如运输代价Cost of Transport, CoT）至关重要。基于该模型，我们分析了双足机器人的动力学、运输代价以及足部与地面滚动和侵入的影响。通过在沙地上使用双足机器人行走器进行实验，验证了所提动力学模型在机器人步态特征、介质侵入预测和地面反作用力估计方面的有效性。该动力学模型可作为双足机器人在颗粒介质上高效行走的运动控制与优化的重要工具。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2604.11991

Complementarity by Construction: A Lie-Group Approach to Solving Quadratic Programs with Linear Complementarity Constraints

构造性互补性：一种基于李群的线性互补约束二次规划求解方法

Bishop, Arun L., Reich, Micah I., Manchester, Zachary

Abstract

Many problems in robotics require reasoning over a mix of continuous dynamics and discrete events, such as making and breaking contact in manipulation and locomotion. These problems are locally well modeled by linear complementarity quadratic programs (LCQPs), an extension to QPs that introduce complementarity constraints. While very expressive, LCQPs are non-convex, and few solvers exist for computing good local solutions for use in planning pipelines. In this work, we observe that complementarity constraints form a Lie group under infinitesimal relaxation, and leverage this structure to perform on-manifold optimization. We introduce a retraction map that is numerically well behaved, and use it to parameterize the constraints so that they are satisfied by construction. The resulting solver avoids many of the classical issues with complementarity constraints. We provide an open-source solver, Marble, that is implemented in C++ with Julia and Python bindings. We demonstrate that Marble is competitive on a suite of benchmark problems, and solves a number of robotics problems where existing approaches fail to converge.

Chinese Translation

机器人领域的许多问题需要同时处理连续动力学和离散事件，例如操作和运动中的接触建立与断开。这些问题在局部上可以通过线性互补二次规划（LCQPs）良好建模，LCQPs是对二次规划（QPs）的扩展，加入了互补约束。尽管表达能力强，LCQPs是非凸的，且现有求解器很少能够计算出适用于规划流程的良好局部解。在本工作中，我们观察到互补约束在无限小松弛下构成一个李群结构，并利用该结构进行流形上的优化。我们引入了一个数值表现良好的重traction映射，并用其对约束进行参数化，从而保证约束在构造时即被满足。由此产生的求解器避免了传统互补约束中许多经典问题。我们提供了一个开源求解器Marble，该求解器以C++实现，并提供Julia和Python接口。我们展示了Marble在一系列基准问题上的竞争力，并成功解决了多个现有方法无法收敛的机器人问题。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2604.11992

ReefMapGS: Enabling Large-Scale Underwater Reconstruction by Closing the Loop Between Multimodal SLAM and Gaussian Splatting

ReefMapGS：通过闭合多模态SLAM与高斯点云之间的循环，实现大规模水下重建

Yang, Daniel, Hong, Jungseok, Leonard, John J., Girdhar, Yogesh

Abstract

3D Gaussian Splatting is a powerful visual representation, providing high-quality and efficient 3D scene reconstruction, but it is crucially dependent on accurate camera poses typically obtained from computationally intensive processes like structure-from-motion that are unsuitable for field robot applications. However, in these domains, multimodal sensor data from acoustic, inertial, pressure, and visual sensors are available and suitable for pose-graph optimization-based SLAM methods that can estimate the vehicle's trajectory and thus our needed camera poses while providing uncertainty. We propose a 3DGS-based incremental reconstruction framework, ReefMapGS, that builds an initial model from a high certainty region and progressively expands to incorporate the whole scene. We reconstruct the scene incrementally by interleaving local tracking of new image observations with optimization of the underlying 3DGS scene. These refined poses are integrated back into the pose-graph to globally optimize the whole trajectory. We show COLMAP-free 3D reconstruction of two underwater reef sites with complex geometry as well as more accurate global pose estimation of our AUV over survey trajectories spanning up to 700 m.

Chinese Translation

3D高斯点云（3D Gaussian Splatting）是一种强大的视觉表示方法，能够提供高质量和高效的3D场景重建，但其关键依赖于通常通过计算密集型过程（如运动重建）获得的准确相机位姿，这些过程不适合现场机器人应用。然而，在这些领域，来自声学、惯性、压力和视觉传感器的多模态传感器数据是可用的，并且适合基于位姿图优化的SLAM方法，这些方法可以估计车辆的轨迹，从而获得我们所需的相机位姿，并提供不确定性。我们提出了一种基于3DGS的增量重建框架ReefMapGS，该框架从高确定性区域构建初始模型，并逐步扩展以纳入整个场景。我们通过将新图像观测的局部跟踪与底层3DGS场景的优化交替进行，逐步重建场景。这些精细化的位姿被重新整合回位姿图中，以全局优化整个轨迹。我们展示了对两个具有复杂几何形状的水下珊瑚礁地点的无COLMAP 3D重建，以及对我们在长达700米的调查轨迹上自主水下航行器（AUV）更准确的全局位姿估计。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2604.12006

A Foot Resistive Force Model for Legged Locomotion on Muddy Terrains

泥泞地形下腿式运动的足部阻力模型

Chen, Xunjie, Wang, Liuyin, Huang, Xinyan, Shan, Jerry, Shen, Yantao, Yi, Jingang

Abstract

Legged robots face significant challenges in moving and navigating on deformable and highly yielding terrain such as mud. We present a resistive force model for legged foot-mud interactions. The model captures rheological behaviors such as visco-elasticity, thixotropy of the mud suspension and retractive suction. One attractive property of this new model lies in its effective, uniform formulation to provide underlying physical interpretation and accurate resistive force predictions. We further take advantage of the resistive force model to design a new morphing robotic foot for effective and efficient legged locomotion. We conduct extensive experiments to validate the force model, and the results demonstrate that the morphing foot enhances not only the locomotion mobility but also energy-efficiency of walking in mud. The new resistive force model can be further used to develop data-driven simulation and locomotion control of legged robots on muddy terrains.

Chinese Translation

腿式机器人在变形和高度可变的地形（如泥泞）上移动和导航面临重大挑战。我们提出了一种足部与泥土相互作用的阻力模型。该模型捕捉了泥悬浮液的流变行为，如粘弹性、触变性和回缩吸力。该新模型的一个吸引人之处在于其有效且统一的公式化，能够提供基础物理解释并准确预测阻力。我们进一步利用该阻力模型设计了一种新的变形机器人足，以实现有效和高效的腿式运动。我们进行了广泛的实验以验证该力模型，结果表明，变形足不仅增强了在泥中行走的运动能力，还提高了能效。新的阻力模型还可以进一步用于开发基于数据的模拟和腿式机器人的运动控制。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2604.12027

3DRO: Lidar-level SE(3) Direct Radar Odometry Using a 2D Imaging Radar and a Gyroscope

3DRO：基于2D成像雷达和陀螺仪的激光雷达级SE(3)直接雷达里程计

Gentil, Cedric Le, Lisus, Daniil, Barfoot, Timothy D.

Abstract

Recently, the robotics community has regained interest in radar-based perception and state estimation. A 2D imaging radar provides dense 360deg information about the environment. Despite the radar antenna's cone of emission and reception, the collected data is generally assumed to be limited to the plane orthogonal to the radar's spinning axis. Accordingly, most methods based on 2D imaging radars only perform SE(2) state estimation. This paper presents 3DRO, an extension of the SE(2) Direct Radar Odometry (DRO) framework to perform state estimation in SE(3). While still assuming planarity of the data through DRO's 2D velocity estimates, it integrates 3D gyroscope measurements over SO(3) to estimate SE(3) ego motion. While simple, this approach provides lidar-level odometry accuracy as demonstrated using 643km of data from the Boreas-RT dataset.

Chinese Translation

近年来，机器人领域重新关注基于雷达的感知和状态估计。2D成像雷达提供关于环境的密集360度信息。尽管雷达天线的发射和接收锥体，收集的数据通常被假设仅限于与雷达旋转轴正交的平面。因此，大多数基于2D成像雷达的方法仅执行SE(2)状态估计。本文提出了3DRO，这是SE(2)直接雷达里程计（DRO）框架的扩展，用于在SE(3)中执行状态估计。尽管仍假设数据的平面性通过DRO的2D速度估计，但它集成了SO(3)上的3D陀螺仪测量，以估计SE(3)自我运动。尽管方法简单，但通过使用Boreas-RT数据集中的643公里数据证明，该方法提供了激光雷达级的里程计精度。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2604.12031

Dynamic Modeling and Robust Gait Optimization of a Compliant Worm Robot

柔顺蠕虫机器人动态建模与鲁棒步态优化

Zhou, Xinyu, Mei, Yu, Thomson, Faith, Luedtke, Christian, Qi, Xinda, Tan, Xiaobo

Abstract

Worm-inspired robots provide an effective locomotion strategy for constrained environments by combining cyclic body deformation with alternating anchoring. For compliant robots, however, the interaction between deformable anchoring structures and the environment makes predictive modeling and deployable gait optimization challenging. This paper presents an experimentally grounded modeling and optimization framework for a compliant worm robot capable of traversing corrugated pipes. First, a hybrid dynamic locomotion model is derived, in which the robot motion is represented by continuous dynamics within a corrugation groove and discrete switching of anchoring positions between adjacent grooves. A slack-aware actuation model is further introduced to map the commanded gait input to the realized body-length change, and an energy model is developed based on physics and calibrated with empirical power measurement. Based on these models, a multi-objective gait optimization problem is formulated to maximize average speed while minimizing average power. To reduce the fragility of nominal boundary-seeking solutions, a kinematic robustness margin is introduced into the anchoring-transition conditions, leading to a margin-based robust gait optimization framework. Experimental results show that the proposed framework captures the dominant locomotion and energy-consumption behavior of the robot over the tested conditions, and enables robust gait optimization for achieving speed-power trade-off.

Chinese Translation

受蠕虫启发的机器人通过结合周期性体形变和交替锚固，为受限环境中的运动提供了一种有效策略。然而，对于柔顺机器人而言，可变形锚固结构与环境的相互作用使得预测建模和可部署的步态优化具有挑战性。本文提出了一种基于实验的柔顺蠕虫机器人建模与优化框架，该机器人能够穿越波纹管。首先，推导出一种混合动力学运动模型，其中机器人运动由波纹槽内的连续动力学和相邻槽间锚固位置的离散切换表示。进一步引入了考虑松弛的驱动模型，将指令步态输入映射到实际的体长变化，并基于物理原理建立了能量模型，且通过实测功率进行了校准。基于这些模型，构建了一个多目标步态优化问题，以在最大化平均速度的同时最小化平均功率。为降低标称边界寻优解的脆弱性，在锚固转换条件中引入了运动学鲁棒裕度，形成了基于裕度的鲁棒步态优化框架。实验结果表明，所提框架能够准确捕捉机器人在测试条件下的主要运动和能耗特性，并实现速度与功率权衡的鲁棒步态优化。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2604.12092

Ternary Logic Encodings of Temporal Behavior Trees with Application to Control Synthesis

具有控制合成应用的时序行为树的三值逻辑编码

Matheu, Ryan, Baras, John S., Belta, Calin

Abstract

Behavior Trees (BTs) provide designers an intuitive graphical interface to construct long-horizon plans for autonomous systems. To ensure their correctness and safety, rigorous formal models and verification techniques are essential. Temporal BTs (TBTs) offer a promising approach by leveraging existing temporal logic formalisms to specify and verify the executions of BTs. However, this analysis is currently limited to offline post hoc analysis and trace repair. In this paper, we reformulate TBTs using a ternary-valued Signal Temporal Logic (STL) amenable for control synthesis. Ternary logic introduces a third truth value \textit{Unknown}, formally capturing cases where a trajectory has neither fully satisfied or dissatisfied a specification. We propose mixed-integer linear encodings for partial trajectory STL and TBTs over ternary logic allowing for correct-by-construction control strategies for linear dynamical systems via mixed-integer optimization. We demonstrate the utility of our framework by solving optimal control problems.

Chinese Translation

行为树（Behavior Trees，BTs）为设计者提供了一种直观的图形界面，用于构建自主系统的长时间规划。为了确保其正确性和安全性，严谨的形式模型和验证技术是必不可少的。时序行为树（Temporal BTs，TBTs）通过利用现有的时序逻辑形式化方法来指定和验证BT的执行，提供了一种有前景的方法。然而，该分析目前仅限于离线的事后分析和轨迹修复。本文中，我们使用适用于控制合成的三值信号时序逻辑（Signal Temporal Logic，STL）重新表述了TBTs。三值逻辑引入了第三种真值“未知”（Unknown），形式化地捕捉了轨迹既未完全满足也未完全违背规范的情况。我们提出了基于混合整数线性编码的部分轨迹STL和基于三值逻辑的TBTs方法，使得通过混合整数优化能够为线性动力系统生成正确构造的控制策略。我们通过求解最优控制问题展示了该框架的实用性。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2604.12149

Uncertainty Guided Exploratory Trajectory Optimization for Sampling-Based Model Predictive Control

基于不确定性引导的探索性轨迹优化用于采样基础模型预测控制

Poyrazoglu, O. Goktug, Cao, Yukang, Moorthy, Rahul, Isler, Volkan

Abstract

Trajectory optimization depends heavily on initialization. In particular, sampling-based approaches are highly sensitive to initial solutions, and limited exploration frequently leads them to converge to local minima in complex environments. We present Uncertainty Guided Exploratory Trajectory Optimization (UGE-TO), a trajectory optimization algorithm that generates well-separated samples to achieve a better coverage of the configuration space. UGE-TO represents trajectories as probability distributions induced by uncertainty ellipsoids. Unlike sampling-based approaches that explore only in the action space, this representation captures the effects of both system dynamics and action selection. By incorporating the impact of dynamics, in addition to the action space, into our distributions, our method enhances trajectory diversity by enforcing distributional separation via the Hellinger distance between them. It enables a systematic exploration of the configuration space and improves robustness against local minima. Further, we present UGE-MPC, which integrates UGE-TO into sampling-based model predictive controller methods. Experiments demonstrate that UGE-MPC achieves higher exploration and faster convergence in trajectory optimization compared to baselines under the same sampling budget, achieving 72.1% faster convergence in obstacle-free environments and 66% faster convergence with a 6.7% higher success rate in the cluttered environment compared to the best-performing baseline. Additionally, we validate the approach through a range of simulation scenarios and real-world experiments. Our results indicate that UGE-MPC has higher success rates and faster convergence, especially in environments that demand significant deviations from nominal trajectories to avoid failures. The project and code are available at https://ogpoyrazoglu.github.io/cuniform_sampling/.

Chinese Translation

轨迹优化在很大程度上依赖于初始化。特别是，基于采样的方法对初始解高度敏感，有限的探索常常导致它们在复杂环境中收敛到局部最小值。我们提出了不确定性引导的探索性轨迹优化（Uncertainty Guided Exploratory Trajectory Optimization, UGE-TO），这是一种轨迹优化算法，通过生成良好分离的样本来实现对配置空间的更好覆盖。UGE-TO将轨迹表示为由不确定性椭球体引发的概率分布。与仅在动作空间中进行探索的基于采样的方法不同，这种表示法捕捉了系统动态和动作选择的影响。通过将动态影响纳入我们的分布中，除了动作空间外，我们的方法通过施加Hellinger距离来增强轨迹的多样性，从而强制实现分布间的分离。这使得对配置空间的系统性探索成为可能，并提高了对局部最小值的鲁棒性。此外，我们提出了UGE-MPC，将UGE-TO集成到基于采样的模型预测控制方法中。实验表明，在相同的采样预算下，UGE-MPC在轨迹优化中实现了更高的探索性和更快的收敛，相比于基线在无障碍环境中收敛速度提高了72.1%，在杂乱环境中收敛速度提高了66%，成功率提高了6.7%。此外，我们通过一系列仿真场景和实际实验验证了该方法。我们的结果表明，UGE-MPC在成功率和收敛速度上均表现更佳，尤其是在需要显著偏离标称轨迹以避免失败的环境中。项目和代码可在 https://ogpoyrazoglu.github.io/cuniform_sampling/ 获取。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2604.12169

Robotic Nanoparticle Synthesis via Solution-based Processes

基于溶液过程的机器人纳米粒子合成

Mahalingam, Dasharadhan, Gallagher, Michael, Chakraborty, Nilanjan, Wong, Stanislaus S.

Abstract

We present a screw geometry-based manipulation planning framework for the robotic automation of solution-based synthesis, exemplified through the preparation of gold and magnetite nanoparticles. The synthesis protocols are inherently long-horizon, multi-step tasks, requiring skills such as pick-and-place, pouring, turning a knob, and periodic visual inspection to detect reaction completion. A central challenge is that some skills, notably pouring, transferring containers with solutions, and turning a knob, impose geometric and kinematic constraints on the end-effector motion. To address this, we use a programming by demonstration paradigm where the constraints can be extracted from a single demonstration. This combination of screw-based motion representation and demonstration-driven specification enables domain experts, such as chemists, to readily adapt and reprogram the system for new experimental protocols and laboratory setups without requiring expertise in robotics or motion planning. We extract sequences of constant screws from demonstrations, which compactly encode the motion constraints while remaining coordinate-invariant. This representation enables robust generalization across variations in grasp placement and allows parameterized reuse of a skill learned from a single example. By composing these screw-parameterized primitives according to the synthesis protocol, the robot autonomously generates motion plans that execute the complete experiment over repeated runs. Our results highlight that screw-theoretic planning, combined with programming by demonstration, provides a rigorous and generalizable foundation for long-horizon laboratory automation, thereby enabling fundamental kinematics to have a translational impact on the use of robots in developing scalable solution-based synthesis protocols.

Chinese Translation

我们提出了一种基于螺旋几何的操作规划框架，用于机器人自动化的基于溶液的合成，通过制备金和磁铁矿纳米粒子进行示例。合成协议本质上是长时间跨度的多步骤任务，要求具备如抓取与放置、倒液、旋转旋钮和定期视觉检查以检测反应完成等技能。一个主要挑战是某些技能，特别是倒液、转移含溶液的容器和旋转旋钮，对末端执行器的运动施加了几何和运动学约束。为了解决这个问题，我们采用了演示编程范式，从单次演示中提取约束。这种基于螺旋运动表示和演示驱动规范的结合，使得领域专家，如化学家，能够轻松地调整和重新编程系统以适应新的实验协议和实验室设置，而无需具备机器人或运动规划方面的专业知识。我们从演示中提取出常数螺旋序列，这些序列紧凑地编码了运动约束，同时保持坐标不变性。这种表示方式使得在抓取位置变化中能够进行稳健的泛化，并允许对从单个示例中学习到的技能进行参数化重用。通过根据合成协议组合这些螺旋参数化原语，机器人能够自主生成运动计划，执行完整的实验并进行重复运行。我们的结果强调，结合演示编程的螺旋理论规划为长时间跨度的实验室自动化提供了严格且可推广的基础，从而使基础运动学在开发可扩展的基于溶液的合成协议中对机器人使用产生了实际影响。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2604.12208

Unveiling the Surprising Efficacy of Navigation Understanding in End-to-End Autonomous Driving

揭示导航理解在端到端自动驾驶中的惊人有效性

Hua, Zhihua, Wang, Junli, LI, Pengfei, Jin, Qihao, Zhang, Bo, Sheng, Kehua, Chen, Yilun, Gan, Zhongxue, Ding, Wenchao

Abstract

Global navigation information and local scene understanding are two crucial components of autonomous driving systems. However, our experimental results indicate that many end-to-end autonomous driving systems tend to over-rely on local scene understanding while failing to utilize global navigation information. These systems exhibit weak correlation between their planning capabilities and navigation input, and struggle to perform navigation-following in complex scenarios. To overcome this limitation, we propose the Sequential Navigation Guidance (SNG) framework, an efficient representation of global navigation information based on real-world navigation patterns. The SNG encompasses both navigation paths for constraining long-term trajectories and turn-by-turn (TBT) information for real-time decision-making logic. We constructed the SNG-QA dataset, a visual question answering (VQA) dataset based on SNG that aligns global and local planning. Additionally, we introduce an efficient model SNG-VLA that fuses local planning with global planning. The SNG-VLA achieves state-of-the-art performance through precise navigation information modeling without requiring auxiliary loss functions from perception tasks. Project page: SNG-VLA

Chinese Translation

全球导航信息和局部场景理解是自动驾驶系统的两个关键组成部分。然而，我们的实验结果表明，许多端到端自动驾驶系统往往过于依赖局部场景理解，而未能有效利用全球导航信息。这些系统在规划能力与导航输入之间表现出较弱的相关性，并且在复杂场景中难以执行导航跟随。为了解决这一限制，我们提出了顺序导航引导（Sequential Navigation Guidance, SNG）框架，这是一种基于真实世界导航模式的全球导航信息的高效表示。SNG 包含用于约束长期轨迹的导航路径和用于实时决策逻辑的逐步导航（Turn-by-Turn, TBT）信息。我们构建了 SNG-QA 数据集，这是一个基于 SNG 的视觉问答（Visual Question Answering, VQA）数据集，旨在对齐全球和局部规划。此外，我们引入了一种高效模型 SNG-VLA，该模型将局部规划与全球规划融合。SNG-VLA 通过精确的导航信息建模实现了最先进的性能，而无需感知任务的辅助损失函数。项目页面：SNG-VLA

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2604.12274

Asymptotically Stable Gait Generation and Instantaneous Walkability Determination for Planar Almost Linear Biped with Knees

具有膝关节的平面近线性双足机器人渐近稳定步态生成及瞬时可行走性判定

Asano, Fumihiko, Lei, Ning, Sedoguchi, Taiki

Abstract

A class of planar bipedal robots with unique mechanical properties has been proposed, where all links are balanced around the hip joint, preventing natural swinging motion due to gravity. A common property of their equations of motion is that the inertia matrix is a constant matrix, there are no nonlinear velocity terms, and the gravity term contains simple nonlinear terms. By performing a Taylor expansion of the gravity term and making a linear approximation, it is easy to derive a linearized model, and calculations for future states or walkability determination can be performed instantaneously without the need for numerical integration. This paper extends the method to a planar biped robot model with knees. First, we derive the equations of motion, constraint conditions, and inelastic collisions for a planar 6-DOF biped robot, design its control system, and numerically generate a stable bipedal gait on a horizontal plane. Next, we reduce the equations of motion to a 3-DOF model, and derive a linearized model by approximating the gravity term as linear around the expansion point for the thigh frame angle. Through numerical simulations, we demonstrate that calculations for future states and walkability determination can be completed in negligible time. By applying control inputs to the obtained model, performing state-space realization, and then discretizing it, instantaneous walkability determination through iterative calculation becomes possible. Through detailed gait analysis, we discuss how the knee joint flexion angle and the expansion point affect the accuracy of the linear approximation, and the issues that arise when descending a small step.

Chinese Translation

本文提出了一类具有独特机械特性的平面双足机器人，该机器人所有连杆均围绕髋关节保持平衡，避免了由于重力引起的自然摆动。其运动方程的一个共同特性是惯性矩阵为常矩阵，无非线性速度项，且重力项仅包含简单的非线性项。通过对重力项进行泰勒展开并进行线性近似，可以容易地导出线性化模型，从而无需数值积分即可瞬时完成未来状态计算或可行走性判定。本文将该方法扩展至具有膝关节的平面双足机器人模型。首先，推导了该平面6自由度双足机器人的运动方程、约束条件及非弹性碰撞，设计了其控制系统，并在水平面上数值生成了稳定的双足步态。接着，将运动方程简化为3自由度模型，并通过以大腿框架角度为展开点对重力项进行线性近似，导出了线性化模型。通过数值仿真，验证了未来状态计算和可行走性判定可在极短时间内完成。通过对所得模型施加控制输入，进行状态空间实现并离散化，实现了通过迭代计算的瞬时可行走性判定。通过详细的步态分析，讨论了膝关节屈曲角度及展开点对线性近似精度的影响，以及在下小台阶时出现的问题。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2604.12293

Defining and Evaluation Method for External Human-Machine Interfaces

外部人机界面的定义与评估方法

Gonzalez-Belmonte, Jose, Kwon, Jaerock

Abstract

As the number of fatalities involving Autonomous Vehicles increase, the need for a universal method of communicating between vehicles and other agents on the road has also increased. Over the past decade, numerous proposals of external Human-Machine Interfaces (eHMIs) have been brought forward with the purpose of bridging this communication gap, with none yet to be determined as the ideal one. This work proposes a universal evaluation method conformed of 223 questions to objectively evaluate and compare different proposals and arrive at a conclusion. The questionnaire is divided into 7 categories that evaluate different aspects of any given proposal that uses eHMIs: ease of standardization, cost effectiveness, accessibility, ease of understanding, multifacetedness in communication, positioning, and readability. In order to test the method it was used on four existing proposals, plus a baseline using only kinematic motions, in order to both exemplify the application of the evaluation method and offer a baseline score for future comparison. The result of this testing suggests that the ideal method of machine-human communication is a combination of intentionally-designed vehicle kinematics and distributed well-placed text-based displays, but it also reveals knowledge gaps in the readability of eHMIs and the speed at which different observers may learn their meaning. This paper proposes future work related to these uncertainties, along with future testing with the proposed method.

Chinese Translation

随着涉及自主车辆的致命事故数量增加，车辆与道路上其他代理之间的通用沟通方法的需求也随之增加。在过去十年中，提出了众多外部人机界面（eHMIs）的方案，旨在弥补这一沟通鸿沟，但尚未确定出理想的方案。本研究提出了一种由223个问题组成的通用评估方法，以客观评估和比较不同的提案并得出结论。问卷分为7个类别，评估使用eHMIs的任何提案的不同方面：标准化的便利性、成本效益、可及性、易理解性、沟通的多样性、定位和可读性。为了测试该方法，我们对四个现有提案进行了评估，并使用仅包含运动学动作的基线进行比较，以示范评估方法的应用并提供未来比较的基线分数。测试结果表明，理想的人机沟通方法是故意设计的车辆运动学与分布式、合理放置的基于文本的显示屏的结合，但也揭示了eHMIs的可读性和不同观察者学习其含义的速度方面的知识空白。本文提出了与这些不确定性相关的未来研究方向，以及使用所提方法进行的未来测试。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2604.12418

RACF: A Resilient Autonomous Car Framework with Object Distance Correction

RACF：一种具有物体距离修正的弹性自主汽车框架

Tsai, Chieh, Rastgoftar, Hossein, Hariri, Salim

Abstract

Autonomous vehicles are increasingly deployed in safety-critical applications, where sensing failures or cyberphysical attacks can lead to unsafe operations resulting in human loss and/or severe physical damages. Reliable real-time perception is therefore critically important for their safe operations and acceptability. For example, vision-based distance estimation is vulnerable to environmental degradation and adversarial perturbations, and existing defenses are often reactive and too slow to promptly mitigate their impacts on safe operations. We present a Resilient Autonomous Car Framework (RACF) that incorporates an Object Distance Correction Algorithm (ODCA) to improve perception-layer robustness through redundancy and diversity across a depth camera, LiDAR, and physics-based kinematics. Within this framework, when obstacle distance estimation produced by depth camera is inconsistent, a cross-sensor gate activates the correction algorithm to fix the detected inconsistency. We have experiment with the proposed resilient car framework and evaluate its performance on a testbed implemented using the Quanser QCar 2 platform. The presented framework achieved up to 35% RMSE reduction under strong corruption and improves stop compliance and braking latency, while operating in real time. These results demonstrate a practical and lightweight approach to resilient perception for safety-critical autonomous driving

Chinese Translation

自主车辆越来越多地应用于安全关键的场景中，在这些场景中，传感器故障或网络物理攻击可能导致不安全的操作，从而造成人员伤亡和/或严重的物理损害。因此，可靠的实时感知对于其安全操作和可接受性至关重要。例如，基于视觉的距离估计容易受到环境退化和对抗性扰动的影响，而现有的防御措施往往是反应性的，且反应速度过慢，无法及时减轻其对安全操作的影响。我们提出了一种弹性自主汽车框架（RACF），该框架结合了物体距离修正算法（ODCA），通过深度相机、激光雷达和基于物理的运动学之间的冗余和多样性来提高感知层的鲁棒性。在该框架内，当深度相机产生的障碍物距离估计不一致时，跨传感器门控会激活修正算法以修复检测到的不一致性。我们在使用Quanser QCar 2平台实现的测试平台上对所提出的弹性汽车框架进行了实验，并评估了其性能。该框架在强干扰下实现了高达35%的均方根误差（RMSE）降低，并改善了停车合规性和制动延迟，同时实时运行。这些结果展示了一种实用且轻量级的弹性感知方法，适用于安全关键的自主驾驶。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2604.12436

D-BDM: A Direct and Efficient Boundary-Based Occupancy Grid Mapping Framework for LiDARs

D-BDM：一种针对激光雷达的直接高效边界基础占用网格映射框架

Tang, Benxu, Cai, Yixi, Kong, Fanze, Yin, Longji, Zhang, Fu

Abstract

Efficient and scalable 3D occupancy mapping is essential for autonomous robot applications in unknown environments. However, traditional occupancy grid representations suffer from two fundamental limitations. First, explicitly storing all voxels in three-dimensional space leads to prohibitive memory consumption. Second, exhaustive ray casting incurs high update latency. A recent representation alleviate memory demands by maintaining only the voxels on the two-dimensional boundary, yet they still rely on full ray casting updates. This work advances the boundary-based framework with a highly efficient update scheme. We introduce a truncated ray casting strategy that restricts voxel traversal to the exterior of the boundary, which dramatically reduces the number of updated voxels. In addition, we propose a direct boundary update mechanism that removes the need for an auxiliary local 3D occupancy grid, further reducing memory usage and simplifying the map update pipeline. We name our framework as D-BDM. Extensive evaluations on public datasets demonstrate that our approach achieves significantly lower update time and reduced memory consumption compared with the baseline methods, as well as the prior boundary-based approach.

Chinese Translation

高效且可扩展的三维占用映射对于在未知环境中的自主机器人应用至关重要。然而，传统的占用网格表示存在两个基本限制。首先，显式存储三维空间中的所有体素会导致巨大的内存消耗。其次，全面的光线投射会产生高更新延迟。最近的一种表示方法通过仅维护二维边界上的体素来减轻内存需求，但仍然依赖于全面的光线投射更新。本研究通过一种高效的更新方案推进了基于边界的框架。我们引入了一种截断光线投射策略，该策略将体素遍历限制在边界的外部，从而显著减少了更新的体素数量。此外，我们提出了一种直接边界更新机制，消除了对辅助局部三维占用网格的需求，进一步减少了内存使用并简化了地图更新流程。我们将我们的框架命名为D-BDM。对公共数据集的广泛评估表明，与基线方法以及之前的基于边界的方法相比，我们的方法在更新时间和内存消耗方面显著降低。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2604.12447

HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models

HazardArena：评估视觉-语言-行动模型中的语义安全性

Chen, Zixing, Gao, Yifeng, Wang, Li, Zhao, Yunhan, Liu, Yi, Li, Jiayu, Zheng, Xiang, Wu, Zuxuan, Wang, Cong, Ma, Xingjun, Jiang, Yu-Gang

Abstract

Vision-Language-Action (VLA) models inherit rich world knowledge from vision-language backbones and acquire executable skills via action demonstrations. However, existing evaluations largely focus on action execution success, leaving action policies loosely coupled with visual-linguistic semantics. This decoupling exposes a systematic vulnerability whereby correct action execution may induce unsafe outcomes under semantic risk. To expose this vulnerability, we introduce HazardArena, a benchmark designed to evaluate semantic safety in VLAs under controlled yet risk-bearing contexts. HazardArena is constructed from safe/unsafe twin scenarios that share matched objects, layouts, and action requirements, differing only in the semantic context that determines whether an action is unsafe. We find that VLA models trained exclusively on safe scenarios often fail to behave safely when evaluated in their corresponding unsafe counterparts. HazardArena includes over 2,000 assets and 40 risk-sensitive tasks spanning 7 real-world risk categories grounded in established robotic safety standards. To mitigate this vulnerability, we propose a training-free Safety Option Layer that constrains action execution using semantic attributes or a vision-language judge, substantially reducing unsafe behaviors with minimal impact on task performance. We hope that HazardArena highlights the need to rethink how semantic safety is evaluated and enforced in VLAs as they scale toward real-world deployment.

Chinese Translation

视觉-语言-行动（VLA）模型从视觉-语言骨干网络中继承了丰富的世界知识，并通过行动示范获得可执行技能。然而，现有的评估主要集中在行动执行的成功率上，使得行动策略与视觉-语言语义之间的耦合较为松散。这种解耦暴露出一种系统性脆弱性，即正确的行动执行在语义风险下可能导致不安全的结果。为了揭示这种脆弱性，我们引入了HazardArena，这是一个旨在评估VLA在受控但具有风险的环境下的语义安全性的基准。HazardArena由安全/不安全的双重场景构成，这些场景共享匹配的物体、布局和行动要求，仅在决定行动是否不安全的语义上下文上有所不同。我们发现，专门在安全场景上训练的VLA模型在其对应的不安全场景中评估时，往往未能安全地表现。HazardArena包含超过2000个资产和40个风险敏感任务，涵盖了基于既定机器人安全标准的7个现实世界风险类别。为了缓解这种脆弱性，我们提出了一种无训练的安全选项层（Safety Option Layer），该层通过语义属性或视觉-语言评判者来约束行动执行，显著减少不安全行为，同时对任务性能的影响最小。我们希望HazardArena能够强调重新思考如何在VLA中评估和执行语义安全性的必要性，以便在向现实世界部署时进行扩展。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2604.12473

Designing for Error Recovery in Human-Robot Interaction

人机交互中的错误恢复设计

Wallbridge, Christopher D., Pulgarin, Erwin Jose Lopez

Abstract

This position paper looks briefly at the way we attempt to program robotic AI systems. Many AI systems are based on the idea of trying to improve the performance of one individual system to beyond so-called human baselines. However, these systems often look at one shot and one-way decisions, whereas the real world is more continuous and interactive. Humans, however, are often able to recover from and learn from errors - enabling a much higher rate of success. We look at the challenges of building a system that can detect/recover from its own errors, using the example of robotic nuclear gloveboxes as a use case to help illustrate examples. We then go on to talk about simple starting designs.

Chinese Translation

本文简要探讨了我们编程机器人人工智能系统的方式。许多人工智能系统基于提升单个系统性能以超越所谓的人类基准的理念。然而，这些系统往往关注一次性和单向的决策，而现实世界则更加连续和互动。然而，人类通常能够从错误中恢复并学习，从而实现更高的成功率。我们考察了构建能够检测/恢复自身错误的系统所面临的挑战，以机器人核手套箱作为案例来帮助说明这些例子。接着，我们讨论了一些简单的初步设计。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2604.12474

From Kinematics to Dynamics: Learning to Refine Hybrid Plans for Physically Feasible Execution

从运动学到动力学：学习优化混合规划以实现物理可行的执行

Erez, Lidor, Shperberg, Shahaf S., Taitler, Ayal

Abstract

In many robotic tasks, agents must traverse a sequence of spatial regions to complete a mission. Such problems are inherently mixed discrete-continuous: a high-level action sequence and a physically feasible continuous trajectory. The resulting trajectory and action sequence must also satisfy problem constraints such as deadlines, time windows, and velocity or acceleration limits. While hybrid temporal planners attempt to address this challenge, they typically model motion using linear (first-order) dynamics, which cannot guarantee that the resulting plan respects the robot's true physical constraints. Consequently, even when the high-level action sequence is fixed, producing a dynamically feasible trajectory becomes a bi-level optimization problem. We address this problem via reinforcement learning in continuous space. We define a Markov Decision Process that explicitly incorporates analytical second-order constraints and use it to refine first-order plans generated by a hybrid planner. Our results show that this approach can reliably recover physical feasibility and effectively bridge the gap between a planner's initial first-order trajectory and the dynamics required for real execution.

Chinese Translation

在许多机器人任务中，智能体必须穿越一系列空间区域以完成任务。这类问题本质上是混合离散-连续的：包括高层次的动作序列和物理可行的连续轨迹。所得轨迹和动作序列还必须满足诸如截止时间、时间窗口以及速度或加速度限制等问题约束。尽管混合时序规划器试图解决这一挑战，但它们通常采用线性（一阶）动力学模型来描述运动，这无法保证生成的规划满足机器人真实的物理约束。因此，即使高层动作序列固定，生成动力学可行的轨迹仍成为一个双层优化问题。我们通过连续空间中的强化学习方法来解决该问题。我们定义了一个明确包含解析二阶约束的马尔可夫决策过程（MDP），并利用该过程来优化混合规划器生成的一阶规划。实验结果表明，该方法能够可靠地恢复物理可行性，有效弥合规划器初始一阶轨迹与实际执行所需动力学之间的差距。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2604.12482

Social Learning Strategies for Evolved Virtual Soft Robots

进化虚拟软体机器人中的社会学习策略

de Bruin, K. Ege, Glette, Kyrre, Ellefsen, Kai Olav, Nadizar, Giorgia, Medvet, Eric

Abstract

Optimizing the body and brain of a robot is a coupled challenge: the morphology determines what control strategies are effective, while the control parameters influence how well the morphology performs. This joint optimization can be done through nested loops of evolutionary and learning processes, where the control parameters of each robot are learned independently. However, the control parameters learned by one robot may contain valuable information for others. Thus, we introduce a social learning approach in which robots can exploit optimized parameters from their peers to accelerate their own brain optimization. Within this framework, we systematically investigate how the selection of teachers, deciding which and how many robots to learn from, affects performance, experimenting with virtual soft robots in four tasks and environments. In particular, we study the effect of inheriting experience from morphologically similar robots due to the tightly coupled body and brain in robot optimization. Our results confirm the effectiveness of building on others' experience, as social learning clearly outperforms learning from scratch under equivalent computational budgets. In addition, while the optimal teacher selection strategy remains open, our findings suggest that incorporating knowledge from multiple teachers can yield more consistent and robust improvements.

Chinese Translation

机器人身体与大脑的优化是一个耦合挑战：形态决定了哪些控制策略有效，而控制参数则影响形态的性能表现。这种联合优化可以通过进化与学习过程的嵌套循环来实现，其中每个机器人的控制参数独立学习。然而，一个机器人学习到的控制参数可能包含对其他机器人有价值的信息。因此，我们引入了一种社会学习方法，使机器人能够利用同伴的优化参数来加速自身大脑的优化。在此框架下，我们系统地研究了教师选择——即决定向哪些及多少机器人学习——对性能的影响，实验对象为四种任务和环境中的虚拟软体机器人。特别地，我们考察了由于机器人身体与大脑的紧密耦合，从形态相似的机器人继承经验的效果。结果证实了基于他人经验构建的有效性，社会学习在等效计算预算下明显优于从零开始学习。此外，尽管最佳教师选择策略尚未确定，我们的研究表明，融合来自多位教师的知识能够带来更为稳定和鲁棒的性能提升。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2604.12486

DeCoNav: Dialog enhanced Long-Horizon Collaborative Vision-Language Navigation

DeCoNav：对话增强的长视距协作视觉-语言导航

Zhou, Sunyao, Wu, Yunzi, Wang, Tianhang, Li, Xinhai, Chen, Guang, Liu, Lizheng, Bai, Chenjia, Li, Xuelong

Abstract

Long-horizon collaborative vision-language navigation (VLN) is critical for multi-robot systems to accomplish complex tasks beyond the capability of a single agent. CoNavBench takes a first step by introducing the first collaborative long-horizon VLN benchmark with relay-style multi-robot tasks, a collaboration taxonomy, along with graph-grounded generation and evaluation to model handoffs and rendezvous in shared environments. However, existing benchmarks and evaluations often do not enforce strictly synchronized dual-robot rollout on a shared world timeline, and they typically rely on static coordination policies that cannot adapt when new cross-agent evidence emerges. We present Dialog enhanced Long-Horizon Collaborative Vision-Language Navigation (DeCoNav), a decentralized framework that couples event-triggered dialogue with dynamic task allocation and replanning for real-time, adaptive coordination. In DeCoNav, robots exchange compact semantic states via dialogue without a central controller. When informative events such as new evidence, uncertainty, or conflicts arise, dialogue is triggered to dynamically reassign subgoals and replan under synchronized execution. Implemented in DeCoNavBench with 1,213 tasks across 176 HM3D scenes, DeCoNav improves the both-success rate (BSR) by 69.2%, demonstrating the effectiveness of dialogue-driven, dynamically reallocated planning for multi-robot collaboration.

Chinese Translation

长视距协作视觉-语言导航（VLN）对于多机器人系统完成超出单一代理能力的复杂任务至关重要。CoNavBench迈出了第一步，推出了第一个具有接力式多机器人任务的协作长视距VLN基准，建立了协作分类法，并结合图基础生成和评估来建模共享环境中的交接和会合。然而，现有的基准和评估往往未能严格执行共享世界时间线上的双机器人同步展开，且通常依赖于静态协调策略，无法在新的跨代理证据出现时进行适应性调整。我们提出了对话增强的长视距协作视觉-语言导航（DeCoNav），这是一个去中心化框架，将事件触发的对话与动态任务分配和实时重规划相结合，以实现自适应协调。在DeCoNav中，机器人通过对话交换紧凑的语义状态，而无需中央控制器。当出现新证据、不确定性或冲突等信息性事件时，会触发对话，以动态重新分配子目标并在同步执行下进行重规划。在包含176个HM3D场景的1,213个任务的DeCoNavBench中，DeCoNav将双成功率（BSR）提高了69.2%，证明了基于对话驱动的动态重新分配规划在多机器人协作中的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2604.12509

Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers

基于次优控制器的离线强化学习实现全身移动操作

Jauhri, Snehal, Prasad, Vignesh, Chalvatzaki, Georgia

Abstract

Mobile Manipulation (MoMa) of articulated objects, such as opening doors, drawers, and cupboards, demands simultaneous, whole-body coordination between a robot's base and arms. Classical whole-body controllers (WBCs) can solve such problems via hierarchical optimization, but require extensive hand-tuned optimization and remain brittle. Learning-based methods, on the other hand, show strong generalization capabilities but typically rely on expensive whole-body teleoperation data or heavy reward engineering. We observe that even a sub-optimal WBC is a powerful structural prior: it can be used to collect data in a constrained, task-relevant region of the state-action space, and its behavior can still be improved upon using offline reinforcement learning. Building on this, we propose WHOLE-MoMa, a two-stage pipeline that first generates diverse demonstrations by randomizing a lightweight WBC, and then applies offline RL to identify and stitch together improved behaviors via a reward signal. To support the expressive action-chunked diffusion policies needed for complex coordination tasks, we extend offline implicit Q-learning with Q-chunking for chunk-level critic evaluation and advantage-weighted policy extraction. On three tasks of increasing difficulty using a TIAGo++ mobile manipulator in simulation, WHOLE-MoMa significantly outperforms WBC, behavior cloning, and several offline RL baselines. Policies transfer directly to the real robot without finetuning, achieving 80% success in bimanual drawer manipulation and 68% in simultaneous cupboard opening and object placement, all without any teleoperated or real-world training data.

Chinese Translation

关节物体的移动操作（Mobile Manipulation, MoMa），如开门、抽屉和橱柜，要求机器人底盘与机械臂之间的全身协调同步。传统的全身控制器（Whole-Body Controllers, WBCs）通过分层优化能够解决此类问题，但需要大量手工调优且鲁棒性较差。相比之下，基于学习的方法表现出较强的泛化能力，但通常依赖昂贵的全身远程操作数据或复杂的奖励设计。我们观察到即使是次优的WBC也具备强大的结构先验作用：它可以用于在受限且与任务相关的状态-动作空间区域内收集数据，并且其行为仍可通过离线强化学习得到提升。基于此，我们提出了WHOLE-MoMa，一种两阶段流程，首先通过随机化轻量级WBC生成多样化示范，然后应用离线强化学习通过奖励信号识别并拼接改进的行为。为了支持复杂协调任务所需的表达性动作分块扩散策略，我们扩展了离线隐式Q学习（offline implicit Q-learning），引入了用于分块级评论者评估的Q-chunking及优势加权策略提取。在使用TIAGo++移动操作机器人进行的三个难度递增任务中，WHOLE-MoMa显著优于WBC、行为克隆及多个离线强化学习基线。策略无需微调即可直接迁移至真实机器人，在双臂抽屉操作中成功率达80%，在同时开橱柜和放置物体任务中成功率达68%，且全程未使用任何远程操作或真实世界训练数据。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2604.12565

Scalable Trajectory Generation for Whole-Body Mobile Manipulation

可扩展的全身移动操控轨迹生成

Niu, Yida, Chang, Xinhai, Liu, Xin, Jiao, Ziyuan, Zhu, Yixin

Abstract

Robots deployed in unstructured environments must coordinate whole-body motion -- simultaneously moving a mobile base and arm -- to interact with the physical world. This coupled mobility and dexterity yields a state space that grows combinatorially with scene and object diversity, demanding datasets far larger than those sufficient for fixed-base manipulation. Yet existing acquisition methods, including teleoperation and planning, are either labor-intensive or computationally prohibitive at scale. The core bottleneck is the lack of a scalable pipeline for generating large-scale, physically valid, coordinated trajectory data across diverse embodiments and environments. Here we introduce AutoMoMa, a GPU-accelerated framework that unifies AKR modeling, which consolidates base, arm, and object kinematics into a single chain, with parallelized trajectory optimization. AutoMoMa achieves 5,000 episodes per GPU-hour (over $80\times$ faster than CPU-based baselines), producing a dataset of over 500k physically valid trajectories spanning 330 scenes, diverse articulated objects, and multiple robot embodiments. Prior datasets were forced to compromise on scale, diversity, or kinematic fidelity; AutoMoMa addresses all three simultaneously. Training downstream IL policies further reveals that even a single articulated-object task requires tens of thousands of demonstrations for SOTA methods to reach $\approx 80\%$ success, confirming that data scarcity -- not algorithmic limitations -- has been the binding constraint. AutoMoMa thus bridges high-performance planning and reliable IL-based control, providing the infrastructure previously missing for coordinated mobile manipulation research. By making large-scale, kinematically valid training data practical, AutoMoMa showcases generalizable whole-body robot policies capable of operating in the diverse, unstructured settings of the real world.

Chinese Translation

在非结构化环境中部署的机器人必须协调全身运动——同时移动移动基座和手臂——以与物理世界进行交互。这种耦合的移动性和灵活性导致状态空间随着场景和物体的多样性呈组合性增长，要求的数据集远大于固定基座操控所需的数据集。然而，现有的获取方法，包括遥操作和规划，在规模上要么劳动密集，要么计算上不可行。核心瓶颈在于缺乏一个可扩展的管道，用于在多样化的实施和环境中生成大规模、物理有效的协调轨迹数据。在此，我们介绍了AutoMoMa，一个GPU加速的框架，它将AKR建模（将基座、手臂和物体的运动学整合为一个单一链条）与并行化轨迹优化相结合。AutoMoMa每小时每个GPU可生成5000个实验（比基于CPU的基线快超过80倍），产生超过50万个物理有效的轨迹数据，涵盖330个场景、多样化的关节物体和多种机器人实施。之前的数据集在规模、多样性或运动学保真度上不得不妥协；而AutoMoMa同时解决了这三者的问题。对下游IL策略的训练进一步表明，即使是单一的关节物体任务，也需要数万个示例，以使SOTA方法达到约80%的成功率，确认了数据稀缺——而非算法限制——是限制因素。因此，AutoMoMa架起了高性能规划与可靠的基于IL的控制之间的桥梁，为协调移动操控研究提供了之前缺失的基础设施。通过使大规模、运动学有效的训练数据变得可行，AutoMoMa展示了能够在现实世界多样化、非结构化环境中操作的通用全身机器人策略。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2604.12591

Machine Learning-Based Real-Time Detection of Compensatory Trunk Movements Using Trunk-Wrist Inertial Measurement Units

基于机器学习的实时补偿性躯干运动检测：利用躯干-腕部惯性测量单元

Gabler, Jannis, Lhoste, Clément, Quast, Max, Mayrhuber, Laura, Ronco, Andrea, Lambercy, Olivier, Viskaitis, Paulius, Donegan, Dane

Abstract

Compensatory trunk movements (CTMs) are commonly observed after stroke and can lead to maladaptive movement patterns, limiting targeted training of affected structures. Objective, continuous detection of CTMs during therapy and activities of daily living remains challenging due to the typically complex measurements setups required, as well as limited applicability for real-time use. This study investigates whether a two-inertial measurement unit configuration enables reliable, real-time CTM detection using machine learning. Data were collected from ten able-bodied participants performing activities of daily living under simulated impairment conditions (elbow brace restricting flexion-extension, resistance band inducing flexor-synergy-like patterns), with synchronized optical motion capture (OMC) and manually annotated video recordings serving as reference. A systematic location-reduction analysis using OMC identified wrist and trunk kinematics as a minimal yet sufficient set of anatomical sensing locations. Using an extreme gradient boosting classifier (XGBoost) evaluated with leave-one-subject-out cross-validation, our two-IMU model achieved strong discriminative performance (macro-F1 = 0.80 +/- 0.07, MCC = 0.73 +/- 0.08; ROC-AUC > 0.93), with performance comparable to an OMC-based model and prediction timing suitable for real-time applications. Explainability analysis revealed dominant contributions from trunk dynamics and wrist-trunk interaction features. In preliminary evaluation using recordings from four participants with neurological conditions, the model retained good discriminative capability (ROC-AUC ~ 0.78), but showed reduced and variable threshold-dependent performance, highlighting challenges in clinical generalization. These results support sparse wearable sensing as a viable pathway toward scalable, real-time monitoring of CTMs during therapy and daily living.

Chinese Translation

补偿性躯干运动（CTMs）常见于中风后患者，可能导致不良的运动模式，限制对受影响结构的针对性训练。由于通常需要复杂的测量设备且实时应用受限，如何在治疗及日常活动中客观、连续地检测CTMs仍具挑战性。本研究探讨了使用两惯性测量单元（IMU）配置结合机器学习，是否能够实现可靠的实时CTM检测。数据采集自十名健康受试者，在模拟受损条件下（肘部支架限制屈伸，阻力带诱导屈肌协同样式）执行日常活动，参考标准为同步的光学运动捕捉（OMC）和人工标注视频。通过OMC进行的系统性位置简化分析确定腕部和躯干运动学为最小且充分的解剖感测位置集合。采用极端梯度提升分类器（XGBoost）并通过留一受试者交叉验证评估，双IMU模型表现出较强的判别能力（宏F1=0.80±0.07，MCC=0.73±0.08；ROC-AUC>0.93），性能与基于OMC的模型相当，且预测时效适合实时应用。可解释性分析显示躯干动力学及腕-躯干交互特征贡献显著。在对四名神经系统疾病患者的初步评估中，模型保持良好判别能力（ROC-AUC约0.78），但表现受阈值影响较大且波动，突显临床泛化的挑战。结果支持稀疏可穿戴传感作为实现治疗及日常生活中CTMs可扩展实时监测的可行途径。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2604.12626

Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting

Habitat-GS：一种基于动态高斯点渲染的高保真导航模拟器

Xia, Ziyuan, Xu, Jingyi, Cui, Chong, Yu, Yuanhong, Zhang, Jiazhao, Yan, Qingsong, Ni, Tao, Chen, Junbo, Zhou, Xiaowei, Bao, Hujun, Hu, Ruizhen, Peng, Sida

Abstract

Training embodied AI agents depends critically on the visual fidelity of simulation environments and the ability to model dynamic humans. Current simulators rely on mesh-based rasterization with limited visual realism, and their support for dynamic human avatars, where available, is constrained to mesh representations, hindering agent generalization to human-populated real-world scenarios. We present Habitat-GS, a navigation-centric embodied AI simulator extended from Habitat-Sim that integrates 3D Gaussian Splatting scene rendering and drivable gaussian avatars while maintaining full compatibility with the Habitat ecosystem. Our system implements a 3DGS renderer for real-time photorealistic rendering and supports scalable 3DGS asset import from diverse sources. For dynamic human modeling, we introduce a gaussian avatar module that enables each avatar to simultaneously serve as a photorealistic visual entity and an effective navigation obstacle, allowing agents to learn human-aware behaviors in realistic settings. Experiments on point-goal navigation demonstrate that agents trained on 3DGS scenes achieve stronger cross-domain generalization, with mixed-domain training being the most effective strategy. Evaluations on avatar-aware navigation further confirm that gaussian avatars enable effective human-aware navigation. Finally, performance benchmarks validate the system's scalability across varying scene complexity and avatar counts.

Chinese Translation

训练具身人工智能代理在很大程度上依赖于仿真环境的视觉保真度以及动态人类建模能力。现有模拟器主要依赖基于网格的光栅化技术，视觉真实感有限，且对动态人类虚拟形象的支持（若有）也局限于网格表示，限制了代理在有人类环境的真实场景中的泛化能力。本文提出了Habitat-GS，一种基于Habitat-Sim扩展的导航中心具身AI模拟器，集成了三维高斯点渲染（3D Gaussian Splatting）场景渲染和可驱动的高斯虚拟形象，同时保持对Habitat生态系统的完全兼容。系统实现了3DGS渲染器以支持实时光照真实感渲染，并支持从多样化来源导入可扩展的3DGS资产。针对动态人类建模，我们引入了高斯虚拟形象模块，使每个虚拟形象既能作为光照真实的视觉实体，又能作为有效的导航障碍，促使代理在逼真环境中学习具有人类意识的行为。点目标导航实验表明，在3DGS场景中训练的代理展现出更强的跨域泛化能力，其中混合域训练策略效果最佳。针对虚拟形象感知导航的评估进一步验证了高斯虚拟形象在实现有效人类感知导航中的作用。最后，性能基准测试证明了系统在不同场景复杂度和虚拟形象数量下的良好可扩展性。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2604.12645

Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring

用于自主珊瑚礁监测的上下文多任务强化学习

Laux, Melvin, Liu, Yi-Ling, Alo, Rina, Töpper, Sören, Alvarez, Mariela De Lucas, Kirchner, Frank, Adam, Rebecca

Abstract

Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by the difficulty of controlling vehicles under highly uncertain and non-stationary underwater dynamics. To address these challenges, we employ a data-driven reinforcement learning approach to compensate for unknown dynamics and task variations.Traditional single-task reinforcement learning has a tendency to overfit the training environment, thus, limit the long-term usefulness of the learnt policy. Hence, we propose to use a contextual multi-task reinforcement learning paradigm instead, allowing us to learn controllers that can be reused for various tasks, e.g., detecting oysters in one reef and detecting corals in another. We evaluate whether contextual multi-task reinforcement learning can efficiently learn robust and generalisable control policies for autonomous underwater reef monitoring. We train a single context-dependent policy that is able to solve multiple related monitoring tasks in a simulated reef environment in HoloOcean. In our experiments, we empirically evaluate the contextual policies regarding sample-efficiency, zero-shot generalisation to unseen tasks, and robustness to varying water currents. By utilising multi-task reinforcement learning, we aim to improve the training effectiveness, as well as the reusability of learnt policies to take a step towards more sustainable procedures in autonomous reef monitoring.

Chinese Translation

尽管自主水下航行器承诺具备海洋生态系统监测的能力，但其部署在很大程度上受到在高度不确定和非平稳的水下动态环境中控制车辆的难度的限制。为了解决这些挑战，我们采用了一种数据驱动的强化学习方法，以补偿未知的动态和任务变化。传统的单任务强化学习往往会过拟合训练环境，从而限制所学习策略的长期有效性。因此，我们提出使用上下文多任务强化学习范式，允许我们学习可用于多种任务的控制器，例如，在一个珊瑚礁中检测牡蛎，而在另一个珊瑚礁中检测珊瑚。我们评估上下文多任务强化学习是否能够有效学习用于自主水下珊瑚礁监测的稳健且具有普适性的控制策略。我们训练了一个单一的上下文相关策略，能够在HoloOcean的模拟珊瑚礁环境中解决多个相关的监测任务。在我们的实验中，我们从经验上评估了上下文策略在样本效率、对未见任务的零样本泛化能力以及对变化水流的鲁棒性方面的表现。通过利用多任务强化学习，我们旨在提高训练的有效性，以及所学习策略的可重用性，以朝着更可持续的自主珊瑚礁监测程序迈出一步。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2604.12656

FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving

FeaXDrive：面向可行性的轨迹中心扩散规划方法用于端到端自主驾驶

Wang, Baoyun, Li, Zhuoren, Liu, Ming, Zhang, Xinrui, Leng, Bo, Xiong, Lu

Abstract

End-to-end diffusion planning has shown strong potential for autonomous driving, but the physical feasibility of generated trajectories remains insufficiently addressed. In particular, generated trajectories may exhibit local geometric irregularities, violate trajectory-level kinematic constraints, or deviate from the drivable area, indicating that the commonly used noise-centric formulation in diffusion planning is not yet well aligned with the trajectory space where feasibility is more naturally characterized. To address this issue, we propose FeaXDrive, a feasibility-aware trajectory-centric diffusion planning method for end-to-end autonomous driving. The core idea is to treat the clean trajectory as the unified object for feasibility-aware modeling throughout the diffusion process. Built on this trajectory-centric formulation, FeaXDrive integrates adaptive curvature-constrained training to improve intrinsic geometric and kinematic feasibility, drivable-area guidance within reverse diffusion sampling to enhance consistency with the drivable area, and feasibility-aware GRPO post-training to further improve planning performance while balancing trajectory-space feasibility. Experiments on the NAVSIM benchmark show that FeaXDrive achieves strong closed-loop planning performance while substantially improving trajectory-space feasibility. These findings highlight the importance of explicitly modeling trajectory-space feasibility in end-to-end diffusion planning and provide a step toward more reliable and physically grounded autonomous driving planners.

Chinese Translation

端到端扩散规划在自主驾驶中展现出强大的潜力，但生成轨迹的物理可行性仍未得到充分解决。特别是，生成的轨迹可能表现出局部几何不规则性，违反轨迹级运动学约束，或偏离可行驶区域，这表明在扩散规划中常用的噪声中心公式尚未与可行性更自然表征的轨迹空间良好对齐。为了解决这一问题，我们提出了FeaXDrive，一种面向可行性的轨迹中心扩散规划方法，旨在实现端到端自主驾驶。其核心思想是将干净轨迹视为整个扩散过程中的统一对象，以进行可行性建模。在这一轨迹中心公式的基础上，FeaXDrive整合了自适应曲率约束训练，以提高内在几何和运动学可行性，反向扩散采样中的可行驶区域指导，以增强与可行驶区域的一致性，以及可行性意识的GRPO后训练，以进一步改善规划性能，同时平衡轨迹空间的可行性。在NAVSIM基准上的实验表明，FeaXDrive在实现强闭环规划性能的同时，显著提高了轨迹空间的可行性。这些发现强调了在端到端扩散规划中显式建模轨迹空间可行性的重要性，并为更可靠和物理基础的自主驾驶规划器迈出了重要一步。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2604.12753

Reliability-Guided Depth Fusion for Glare-Resilient Navigation Costmaps

基于可靠性引导的深度融合用于抗眩光导航代价地图构建

Tsai, Shang-En, Sun, Wei-Cheng

Abstract

Specular glare on reflective floors and glass surfaces frequently corrupts RGB-D depth measurements, producing holes and spikes that accumulate as persistent phantom obstacles in occupancy-grid costmaps. This paper proposes a glare-resilient costmap construction method based on explicit depth-reliability modeling. A lightweight Depth Reliability Map (DRM) estimator predicts per-pixel measurement trustworthiness under specular interference, and a Reliability-Guided Fusion (RGF) mechanism uses this signal to modulate occupancy updates before corrupted measurements are accumulated into the map. Experiments on a real mobile robotic platform equipped with an Intel RealSense D435 and a Jetson Orin Nano show that the proposed method substantially reduces false obstacle insertion and improves free-space preservation under real reflective-floor and glass-surface conditions, while introducing only modest computational overhead. These results indicate that treating glare as a measurement-reliability problem provides a practical and lightweight solution for improving costmap correctness and navigation robustness in safety-critical indoor environments.

Chinese Translation

反光地面和玻璃表面的镜面眩光常常破坏RGB-D深度测量，产生孔洞和尖刺，这些误差在占据栅格代价地图中累积为持续存在的虚假障碍物。本文提出了一种基于显式深度可靠性建模的抗眩光代价地图构建方法。一种轻量级的深度可靠性图（Depth Reliability Map, DRM）估计器预测在镜面干扰下每个像素测量的可信度，可靠性引导融合（Reliability-Guided Fusion, RGF）机制利用该信号调节占据状态更新，防止受损测量被累积进地图。基于搭载Intel RealSense D435和Jetson Orin Nano的真实移动机器人平台的实验表明，该方法在真实反光地面和玻璃表面条件下显著减少了虚假障碍物的插入，提升了自由空间的保持能力，同时仅引入适度的计算开销。结果表明，将眩光视为测量可靠性问题，为提升安全关键室内环境中代价地图的准确性和导航鲁棒性提供了一种实用且轻量的解决方案。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2604.12792

Actuation space reduction to facilitate insightful shape matching in a novel reconfigurable tendon driven continuum manipulator

通过驱动空间约简促进新型可重构腱驱动连续体机械臂的形状匹配解析

Dash, Sabyasachi, Golden, John, Krishnan, Girish

Abstract

In tendon driven continuum manipulators (TDCMs), reconfiguring the tendon routing enables tailored spatial deformation of the backbone. This work presents a design in which tendons can be rerouted either prior to or after actuation by actively rotating the individual spacer disks. Each disk rotation thus adds a degree of freedom to the actuation space, complicating the mapping from a desired backbone curve to the corresponding actuator inputs. However, when the backbone shape is projected into an intermediate space defined by curvature and torsion (C-T), patterns emerge that highlight which disks are most influential in achieving a global shape. This insight enables a simplified, sequential shape-matching strategy: first, the proximal and intermediate disks are rotated to approximate the global shape; then, the distal disks are adjusted to fine-tune the end-effector position with minimal impact on the overall shape. The proposed actuation framework offers a model-free alternative to conventional control approaches, bypassing the complexities of modeling reconfigurable TDCMs.

Chinese Translation

在腱驱动连续体机械臂（TDCMs）中，通过重新配置腱的路径可以实现对骨干的定制空间变形。本文提出了一种设计方案，通过主动旋转各个间隔盘，实现腱路径在驱动前或驱动后重新布置。每个间隔盘的旋转为驱动空间增加了一个自由度，从而使得从期望骨干曲线到相应驱动输入的映射变得复杂。然而，当骨干形状投影到由曲率和扭率（C-T）定义的中间空间时，会出现突出显示哪些间隔盘在实现整体形状中最具影响力的模式。该洞察促使一种简化的顺序形状匹配策略得以实现：首先旋转近端和中间间隔盘以近似整体形状；然后调整远端间隔盘以微调末端执行器位置，同时对整体形状影响最小。所提出的驱动框架为传统控制方法提供了一种无模型的替代方案，避免了对可重构TDCMs建模的复杂性。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2604.12831

VULCAN: Vision-Language-Model Enhanced Multi-Agent Cooperative Navigation for Indoor Fire-Disaster Response

VULCAN：基于视觉-语言模型的多智能体协同导航用于室内火灾响应

Liu, Shengding, Yan, Qiben

Abstract

Indoor fire disasters pose severe challenges to autonomous search and rescue due to dense smoke, high temperatures, and dynamically evolving indoor environments. In such time-critical scenarios, multi-agent cooperative navigation is particularly useful, as it enables faster and broader exploration than single-agent approaches. However, existing multi-agent navigation systems are primarily vision-based and designed for benign indoor settings, leading to significant performance degradation under fire-driven dynamic conditions. In this paper, we present VULCAN, a multi-agent cooperative navigation framework based on multi-modal perception and vision-language models (VLMs), tailored for indoor fire disaster response. We extend the Habitat-Matterport3D benchmark by simulating physically realistic fire scenarios, including smoke diffusion, thermal hazards, and sensor degradation. We evaluate representative multi-agent cooperative navigation baselines under both normal and fire-driven environments. Our results reveal critical failure modes of existing methods in fire scenarios and underscore the necessity of robust perception and hazard-aware planning for reliable multi-agent search and rescue.

Chinese Translation

室内火灾灾害由于浓烟、高温及动态变化的室内环境，对自主搜索与救援构成了严峻挑战。在此类时间紧迫的场景中，多智能体协同导航尤为重要，因为其能够实现比单智能体更快、更广泛的探索。然而，现有多智能体导航系统主要基于视觉，且设计针对良性室内环境，导致在火灾驱动的动态条件下性能显著下降。本文提出了VULCAN，一种基于多模态感知与视觉-语言模型（Vision-Language Models, VLMs）的多智能体协同导航框架，专为室内火灾响应设计。我们通过模拟物理真实的火灾场景（包括烟雾扩散、热危害及传感器退化）扩展了Habitat-Matterport3D基准。我们在正常及火灾驱动环境下评估了代表性多智能体协同导航基线方法。结果揭示了现有方法在火灾场景中的关键失效模式，强调了鲁棒感知与危害感知规划对于可靠多智能体搜索与救援的必要性。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2604.12837

GGD-SLAM: Monocular 3DGS SLAM Powered by Generalizable Motion Model for Dynamic Environments

GGD-SLAM：基于可泛化运动模型的动态环境单目3D高斯点云SLAM

Liu, Yi, Xu, Haoxuan, Duan, Hongbo, Fan, Keyu, Zhang, Zhengyang, Zhuang, Peiyu, Luo, Pengting, Liu, Houde

Abstract

Visual SLAM algorithms achieve significant improvements through the exploration of 3D Gaussian Splatting (3DGS) representations, particularly in generating high-fidelity dense maps. However, they depend on a static environment assumption and experience significant performance degradation in dynamic environments. This paper presents GGD-SLAM, a framework that employs a generalizable motion model to address the challenges of localization and dense mapping in dynamic environments - without predefined semantic annotations or depth input. Specifically, the proposed system employs a First-In-First-Out (FIFO) queue to manage incoming frames, facilitating dynamic semantic feature extraction through a sequential attention mechanism. This is integrated with a dynamic feature enhancer to separate static and dynamic components. Additionally, to minimize dynamic distractors' impact on the static components, we devise a method to fill occluded areas via static information sampling and design a distractor-adaptive Structure Similarity Index Measure (SSIM) loss tailored for dynamic environments, significantly enhancing the system's resilience. Experiments conducted on real-world dynamic datasets demonstrate that the proposed system achieves state-of-the-art performance in camera pose estimation and dense reconstruction in dynamic scenes.

Chinese Translation

视觉SLAM算法通过探索3D高斯点云（3DGS）表示取得了显著进展，特别是在生成高保真稠密地图方面。然而，这些算法依赖于静态环境假设，在动态环境中表现出显著的性能下降。本文提出了GGD-SLAM，一个利用可泛化运动模型来应对动态环境中的定位和稠密映射挑战的框架——无需预定义的语义注释或深度输入。具体而言，所提出的系统采用先进先出（FIFO）队列来管理输入帧，通过序列注意机制促进动态语义特征提取。这与动态特征增强器相结合，以分离静态和动态成分。此外，为了最小化动态干扰对静态成分的影响，我们设计了一种通过静态信息采样填充遮挡区域的方法，并为动态环境设计了一种适应干扰的结构相似性指数度量（SSIM）损失，显著增强了系统的鲁棒性。在真实世界动态数据集上进行的实验表明，所提出的系统在动态场景中的相机位姿估计和稠密重建方面达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.RO / 33 / 2604.12852

PAINT: Partner-Agnostic Intent-Aware Cooperative Transport with Legged Robots

PAINT：基于伙伴无关的意图感知腿式机器人协同搬运方法

Cao, Zhihao, An, Tianxu, Li, Chenhao, Coros, Stelian, Hutter, Marco

Abstract

Collaborative transport requires robots to infer partner intent through physical interaction while maintaining stable loco-manipulation. This becomes particularly challenging in complex environments, where interaction signals are difficult to capture and model. We present PAINT, a lightweight yet efficient hierarchical learning framework for partner-agonistic intent-aware collaborative legged transport that infers partner intent directly from proprioceptive feedback. PAINT decouples intent understanding from terrain-robust locomotion: A high-level policy infers the partner interaction wrench using an intent estimator and a teacher-student training scheme, while a low-level locomotion backbone ensures robust execution. This enables lightweight deployment without external force-torque sensing or payload tracking. Extensive simulation and real-world experiments demonstrate compliant cooperative transport across diverse terrains, payloads, and partners. Furthermore, we show that PAINT naturally scales to decentralized multi-robot transport and transfers across robot embodiments by swapping the underlying locomotion backbone. Our results suggest that proprioceptive signals in payload-coupled interaction provide a scalable interface for partner-agnostic intent-aware collaborative transport.

Chinese Translation

协同搬运要求机器人通过物理交互推断伙伴意图，同时保持稳定的运动操控。在复杂环境中，由于交互信号难以捕捉和建模，这一任务尤为具有挑战性。本文提出PAINT，一种轻量且高效的分层学习框架，用于伙伴无关的意图感知协同腿式机器人搬运，能够直接从本体感觉反馈中推断伙伴意图。PAINT将意图理解与适应地形的稳健运动解耦：高层策略通过意图估计器和师生训练机制推断伙伴交互力矩，而低层运动骨干网络确保稳健执行。该方法无需外部力-扭矩传感或负载跟踪，实现轻量级部署。大量仿真及实际实验验证了PAINT在多样地形、负载和伙伴条件下的顺应性协同搬运能力。此外，我们展示了PAINT通过更换底层运动骨干网络，自然扩展至去中心化多机器人搬运及跨机器人形态的迁移。结果表明，负载耦合交互中的本体感觉信号为伙伴无关的意图感知协同搬运提供了可扩展的接口。

View on arXiv Download PDF AI Translation

cs.RO / 34 / 2604.12855

Evolving the Complete Muscle: Efficient Morphology-Control Co-design for Musculoskeletal Locomotion

进化完整肌肉：高效的形态控制协同设计用于肌肉骨骼运动

Sun, Lidong, Zhao, Wentao, Wang, Ye, Liu, Huaping, Sun, Fuchun

Abstract

Musculoskeletal robots offer intrinsic compliance and flexibility, providing a promising paradigm for versatile locomotion. However, existing research typically relies on models with fixed muscle physiological parameters. This static physical setting fails to accommodate the diverse dynamic demands of complex tasks, inherently limiting the robot's performance upper bound. In this work, we focus on the morphology and control co-design of musculoskeletal systems. Unlike previous studies that optimize single physiological attributes such as stiffness, we introduce a Complete Musculoskeletal Morphological Evolution Space that simultaneously evolves muscle strength, velocity, and stiffness. To overcome the exponential expansion of the exploration space caused by this comprehensive evolution, we propose Spectral Design Evolution (SDE), a high-efficiency co-optimization framework. By integrating a bilateral symmetry prior with Principal Component Analysis (PCA), SDE projects complex muscle parameters onto a low-dimensional spectral manifold, enabling efficient morphological exploration. Evaluated on the MyoSuite framework across four tasks (Walk, Stair, Hilly, and Rough terrains), our method demonstrates superior learning efficiency and locomotion stability compared to fixed-morphology and standard evolutionary baselines.

Chinese Translation

肌肉骨骼机器人具有内在的顺应性和灵活性，为多样化的运动提供了有前景的范式。然而，现有研究通常依赖于具有固定肌肉生理参数的模型。这种静态物理设置无法满足复杂任务的多样动态需求，固有地限制了机器人的性能上限。在本研究中，我们专注于肌肉骨骼系统的形态与控制协同设计。与之前优化单一生理属性（如刚度）的研究不同，我们引入了完整肌肉骨骼形态进化空间，同时进化肌肉的力量、速度和刚度。为了克服这种全面进化所导致的探索空间的指数扩展，我们提出了谱设计进化（Spectral Design Evolution, SDE），一种高效的协同优化框架。通过将双侧对称先验与主成分分析（Principal Component Analysis, PCA）相结合，SDE将复杂的肌肉参数投影到低维谱流形上，从而实现高效的形态探索。在Myosuite框架下对四个任务（步行、楼梯、丘陵和粗糙地形）进行评估，我们的方法在学习效率和运动稳定性方面优于固定形态和标准进化基线。

View on arXiv Download PDF AI Translation

cs.RO / 35 / 2604.12872

OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation

OVAL：用于终身目标导航的开放词汇增强记忆模型

Pei, Jiahua, Liu, Yi, Pan, Guoping, Jiang, Yuanhao, Liu, Houde, Wang, Xueqian

Abstract

Object Goal Navigation (ObjectNav) refers to an agent navigating to an object in an unseen environment, which is an ability often required in the accomplishment of complex tasks. While existing methods demonstrate proficiency in isolated single object navigation, their limitations emerge in the restricted applicability of lifelong memory representations, which ultimately hinders effective navigation toward continual targets over extended periods. To address this problem, we propose OVAL, a novel lifelong open-vocabulary memory framework, which enables efficient and precise execution of long-term navigation in semantically open tasks. Within this framework, we introduce memory descriptors to facilitate structured management of the memory model. Additionally, we propose a novel probability-based exploration strategy, utilizing a multi-value frontier scoring to enhance lifelong exploration efficiency. Extensive experiments demonstrate the efficiency and robustness of the proposed system.

Chinese Translation

目标导航（ObjectNav）指的是智能体在未知环境中导航到一个物体的能力，这种能力通常在完成复杂任务时是必需的。尽管现有方法在孤立的单一物体导航中表现出色，但它们在终身记忆表示的适用性受限方面暴露出局限性，这最终阻碍了在较长时间内有效导航至持续目标。为了解决这个问题，我们提出了OVAL，一种新颖的终身开放词汇记忆框架，能够在语义开放任务中高效且精确地执行长期导航。在该框架内，我们引入了记忆描述符，以促进记忆模型的结构化管理。此外，我们提出了一种新颖的基于概率的探索策略，利用多值前沿评分来提高终身探索的效率。大量实验表明，所提出的系统在效率和鲁棒性方面表现优异。

View on arXiv Download PDF AI Translation

cs.RO / 36 / 2604.12879

FastGrasp: Learning-based Whole-body Control method for Fast Dexterous Grasping with Mobile Manipulators

FastGrasp：基于学习的移动操作机器人快速灵巧抓取全身控制方法

Tao, Heng, Zhong, Yiming, Yang, Zemin, Ma, Yuexin

Abstract

Fast grasping is critical for mobile robots in logistics, manufacturing, and service applications. Existing methods face fundamental challenges in impact stabilization under high-speed motion, real-time whole-body coordination, and generalization across diverse objects and scenarios, limited by fixed bases, simple grippers, or slow tactile response capabilities. We propose \textbf{FastGrasp}, a learning-based framework that integrates grasp guidance, whole-body control, and tactile feedback for mobile fast grasping. Our two-stage reinforcement learning strategy first generates diverse grasp candidates via conditional variational autoencoder conditioned on object point clouds, then executes coordinated movements of mobile base, arm, and hand guided by optimal grasp selection. Tactile sensing enables real-time grasp adjustments to handle impact effects and object variations. Extensive experiments demonstrate superior grasping performance in both simulation and real-world scenarios, achieving robust manipulation across diverse object geometries through effective sim-to-real transfer.

Chinese Translation

快速抓取对于物流、制造和服务应用中的移动机器人至关重要。现有方法在高速运动下的冲击稳定性、实时全身协调以及跨多样化物体和场景的泛化能力方面面临根本性挑战，这些方法受限于固定基座、简单夹爪或缓慢的触觉响应能力。我们提出了FastGrasp，一种基于学习的框架，集成了抓取引导、全身控制和触觉反馈，实现移动机器人快速抓取。我们的两阶段强化学习策略首先通过条件变分自编码器（conditional variational autoencoder）基于物体点云生成多样化的抓取候选，然后在最优抓取选择的指导下执行移动底盘、机械臂和手部的协调运动。触觉传感使得能够实时调整抓取，以应对冲击效应和物体变化。大量实验表明，该方法在仿真和实际场景中均表现出优越的抓取性能，通过有效的仿真到现实（sim-to-real）迁移，实现了对多样物体几何形状的鲁棒操作。

View on arXiv Download PDF AI Translation

cs.RO / 37 / 2604.12905

Frequency-aware Decomposition Learning for Sensorless Wrench Forecasting on a Vibration-rich Hydraulic Manipulator

面向振动丰富液压机械臂无传感力矩预测的频率感知分解学习

Lee, Hyeonbeen, Jung, Min-Jae, Yeu, Tae-Kyeong, Han, Jong-Boo, Park, Daegil, Kim, Jin-Gyun

Abstract

Force and torque (F/T) sensing is critical for robot-environment interaction, but physical F/T sensors impose constraints in size, cost, and fragility. To mitigate this, recent studies have estimated force/wrench sensorlessly from robot internal states. While existing methods generally target relatively slow interactions, tasks involving rapid interactions, such as grinding, can induce task-critical high-frequency vibrations, and estimation in such robotic settings remains underexplored. To address this gap, we propose a Frequency-aware Decomposition Network (FDN) for short-term forecasting of vibration-rich wrench from proprioceptive history. FDN predicts spectrally decomposed wrench with asymmetric deterministic and probabilistic heads, modeling the high-frequency residual as a learned conditional distribution. It further incorporates frequency-awareness to adaptively enhance input spectra with learned filtering and impose a frequency-band prior on the outputs. We pretrain FDN on a large-scale open-source robot dataset and transfer the learned proprioception-to-wrench representation to the downstream. On real-world grinding excavation data from a 6-DoF hydraulic manipulator and under a delayed estimation setting, FDN outperforms baseline estimators and forecasters in the high-frequency band and remains competitive in the low-frequency band. Transfer learning provides additional gains, suggesting the potential of large-scale pretraining and transfer learning for robotic wrench estimation. Code and data will be made available upon acceptance.

Chinese Translation

力和力矩（F/T）传感对于机器人与环境的交互至关重要，但物理力矩传感器在尺寸、成本和脆弱性方面存在限制。为缓解这些问题，近期研究尝试从机器人内部状态无传感地估计力/力矩。现有方法通常针对相对缓慢的交互任务，而涉及快速交互的任务（如磨削）会产生任务关键的高频振动，在此类机器人环境中的估计仍未被充分研究。为填补这一空白，我们提出了一种频率感知分解网络（Frequency-aware Decomposition Network，FDN），用于基于本体感受历史进行振动丰富力矩的短期预测。FDN通过非对称的确定性和概率性头部预测频谱分解的力矩，将高频残差建模为学习的条件分布。其进一步引入频率感知机制，自适应地通过学习滤波增强输入频谱，并对输出施加频带先验。我们在大规模开源机器人数据集上对FDN进行了预训练，并将学习到的本体感受至力矩的表征迁移到下游任务。在来自6自由度液压机械臂的真实磨削挖掘数据及延迟估计设置下，FDN在高频段优于基线估计器和预测器，在低频段表现同样具有竞争力。迁移学习带来了额外提升，表明大规模预训练和迁移学习在机器人力矩估计中的潜力。代码和数据将在论文接受后公开。

View on arXiv Download PDF AI Translation

cs.RO / 38 / 2604.12908

Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

机器人操作是视觉到几何的映射 ($f(v) ightarrow G$)：基于视觉-几何的骨干网络优于语言和视频模型

Song, Zijian, Li, Qichang, Zhou, Jiawei, Yuan, Zhenlong, Chen, Tianshui, Lin, Liang, Wang, Guangrun

Abstract

At its core, robotic manipulation is a problem of vision-to-geometry mapping ($f(v) \rightarrow G$). Physical actions are fundamentally defined by geometric properties like 3D positions and spatial relationships. Consequently, we argue that the foundation for generalizable robotic control should be a vision-geometry backbone, rather than the widely adopted vision-language or video models. Conventional VLA and video-predictive models rely on backbones pretrained on large-scale 2D image-text or temporal pixel data. While effective, their representations are largely shaped by semantic concepts or 2D priors, which do not intrinsically align with the precise 3D geometric nature required for physical manipulation. Driven by this insight, we propose the Vision-Geometry-Action (VGA) model, which directly conditions action generation on pretrained native 3D representations. Specifically, VGA replaces conventional language or video backbones with a pretrained 3D world model, establishing a seamless vision-to-geometry mapping that translates visual inputs directly into physical actions. To further enhance geometric consistency, we introduce a Progressive Volumetric Modulation module and adopt a joint training strategy. Extensive experiments validate the effectiveness of our approach. In simulation benchmarks, VGA outperforms top-tier VLA baselines including $\pi_{0.5}$ and GeoVLA, demonstrating its superiority in precise manipulation. More importantly, VGA exhibits remarkable zero-shot generalization to unseen viewpoints in real-world deployments, consistently outperforming $\pi_{0.5}$. These results highlight that operating on native 3D representations-rather than translating through language or 2D video priors-is a highly promising direction for achieving generalizable physical intelligence.

Chinese Translation

从本质上讲，机器人操作是一个视觉到几何映射的问题 ($f(v) ightarrow G$)。物理动作的基本定义依赖于几何特性，如三维位置和空间关系。因此，我们认为，通用机器人控制的基础应该是视觉-几何骨干网络，而不是广泛采用的视觉-语言或视频模型。传统的视觉-语言-动作（VLA）和视频预测模型依赖于在大规模二维图像-文本或时间像素数据上预训练的骨干网络。尽管这些模型有效，但它们的表示在很大程度上受到语义概念或二维先验的影响，这与物理操作所需的精确三维几何特性并不内在对齐。基于这一洞察，我们提出了视觉-几何-动作（VGA）模型，该模型直接基于预训练的原生三维表示来调节动作生成。具体而言，VGA用预训练的三维世界模型替代传统的语言或视频骨干网络，建立了一个无缝的视觉到几何的映射，将视觉输入直接转换为物理动作。为了进一步增强几何一致性，我们引入了渐进体积调制模块，并采用联合训练策略。大量实验验证了我们方法的有效性。在仿真基准测试中，VGA的表现超过了顶级的VLA基线，包括 $ ext{π}_{0.5}$ 和 GeoVLA，展示了其在精确操作方面的优越性。更重要的是，VGA在现实世界部署中对未见视角展现出显著的零-shot 泛化能力，始终优于 $ ext{π}_{0.5}$。这些结果强调了基于原生三维表示进行操作，而不是通过语言或二维视频先验进行转换，是实现可泛化物理智能的一个极具前景的方向。

View on arXiv Download PDF AI Translation

cs.RO / 39 / 2604.12909

Tree Learning: A Multi-Skill Continual Learning Framework for Humanoid Robots

Tree Learning：一种面向人形机器人的多技能持续学习框架

Yan, Yifei, Ye, Linqi

Abstract

As reinforcement learning for humanoid robots evolves from single-task to multi-skill paradigms, efficiently expanding new skills while avoiding catastrophic forgetting has become a key challenge in embodied intelligence. Existing approaches either rely on complex topology adjustments in Mixture-of-Experts (MoE) models or require training extremely large-scale models, making lightweight deployment difficult. To address this, we propose Tree Learning, a multi-skill continual learning framework for humanoid robots. The framework adopts a root-branch hierarchical parameter inheritance mechanism, providing motion priors for branch skills through parameter reuse to fundamentally prevent catastrophic forgetting. A multi-modal feedforward adaptation mechanism combining phase modulation and interpolation is designed to support both periodic and aperiodic motions. A task-level reward shaping strategy is also proposed to accelerate skill convergence. Unity-based simulation experiments show that, in contrast to simultaneous multi-task training, Tree Learning achieves higher rewards across various representative locomotion skills while maintaining a 100% skill retention rate, enabling seamless multi-skill switching and real-time interactive control. We further validate the performance and generalization capability of Tree Learning on two distinct Unity-simulated tasks: a Super Mario-inspired interactive scenario and autonomous navigation in a classical Chinese garden environment.

Chinese Translation

随着人形机器人强化学习从单任务向多技能范式的发展，如何高效扩展新技能同时避免灾难性遗忘，已成为具身智能领域的关键挑战。现有方法要么依赖于Mixture-of-Experts（MoE）模型中复杂的拓扑结构调整，要么需要训练极大规模模型，导致轻量化部署困难。为此，我们提出了Tree Learning，一种面向人形机器人的多技能持续学习框架。该框架采用根-分支层级参数继承机制，通过参数复用为分支技能提供运动先验，从根本上防止灾难性遗忘。设计了结合相位调制与插值的多模态前馈适应机制，以支持周期性与非周期性运动。同时提出了任务级奖励塑造策略，加速技能收敛。基于Unity的仿真实验表明，相较于同时多任务训练，Tree Learning在多种代表性运动技能上实现了更高奖励且保持100%技能保留率，实现了无缝多技能切换与实时交互控制。我们进一步在两个不同的Unity仿真任务中验证了Tree Learning的性能与泛化能力：一个受超级马里奥启发的交互场景和一个中国古典园林环境中的自主导航任务。

View on arXiv Download PDF AI Translation

cs.RO / 40 / 2604.12916

E2E-Fly: An Integrated Training-to-Deployment System for End-to-End Quadrotor Autonomy

E2E-Fly：一个集成的端到端四旋翼自主训练与部署系统

Sun, Fangyu, Li, Fanxing, Zhang, Linzuo, Hu, Yu, Jin, Renbiao, Wu, Shuyu, Yu, Wenxian, Zou, Danping

Abstract

Training and transferring learning-based policies for quadrotors from simulation to reality remains challenging due to inefficient visual rendering, physical modeling inaccuracies, unmodeled sensor discrepancies, and the absence of a unified platform integrating differentiable physics learning into end-to-end training. While recent work has demonstrated various end-to-end quadrotor control tasks, few systems provide a systematic, zero-shot transfer pipeline, hindering reproducibility and real-world deployment. To bridge this gap, we introduce E2E-Fly, an integrated framework featuring an agile quadrotor platform coupled with a full-stack training, validation, and deployment workflow. The training framework incorporates a high-performance simulator with support for differentiable physics learning and reinforcement learning, alongside structured reward design tailored to common quadrotor tasks. We further introduce a two-stage validation strategy using sim-to-sim transfer and hardware-in-the-loop testing, and deploy policies onto two physical quadrotor platforms via a dedicated low-level control interface and a comprehensive sim-to-real alignment methodology, encompassing system identification, domain randomization, latency compensation, and noise modeling. To the best of our knowledge, this is the first work to systematically unify differentiable physical learning with training, validation, and real-world deployment for quadrotors. Finally, we demonstrate the effectiveness of our framework for training six end-to-end control tasks and deploy them in the real world.

Chinese Translation

将基于学习的四旋翼控制策略从仿真转移到现实中仍然面临挑战，原因包括视觉渲染效率低下、物理建模不准确、未建模的传感器差异，以及缺乏将可微分物理学习整合到端到端训练中的统一平台。尽管近期的研究展示了多种端到端四旋翼控制任务，但很少有系统提供系统化的零样本转移管道，这限制了可重复性和现实世界的部署。为了解决这一问题，我们提出了E2E-Fly，一个集成框架，结合了敏捷的四旋翼平台和完整的训练、验证及部署工作流程。该训练框架结合了支持可微分物理学习和强化学习的高性能仿真器，并设计了针对常见四旋翼任务的结构化奖励。我们进一步引入了一种两阶段的验证策略，使用仿真到仿真的转移和硬件在环测试，并通过专用的低级控制接口和全面的仿真到现实对齐方法将策略部署到两个物理四旋翼平台上，涵盖系统识别、领域随机化、延迟补偿和噪声建模。根据我们所知，这是首个系统性地将可微分物理学习与四旋翼的训练、验证和现实世界部署统一的工作。最后，我们展示了该框架在训练六个端到端控制任务方面的有效性，并将其部署到现实世界中。

View on arXiv Download PDF AI Translation

cs.RO / 41 / 2604.12933

DINO-Explorer: Active Underwater Discovery via Ego-Motion Compensated Semantic Predictive Coding

DINO-Explorer：通过自我运动补偿的语义预测编码实现主动水下发现

Jin, Yuhan, Lessa, Nayari Marie, Alvarez, Mariela De Lucas, Laux, Melvin, Barbosa, Lucas Amparo, Kirchner, Frank, Adam, Rebecca

Abstract

Marine ecosystem degradation necessitates continuous, scientifically selective underwater monitoring. However, most autonomous underwater vehicles (AUVs) operate as passive data loggers, capturing exhaustive video for offline review and frequently missing transient events of high scientific value. Transitioning to active perception requires a causal, online signal that highlights significant phenomena while suppressing maneuver-induced visual changes. We propose DINO-Explorer, a novelty-aware perception framework driven by a continuous semantic surprise signal. Operating within the latent space of a frozen DINOv3 foundation model, it leverages a lightweight, action-conditioned recurrent predictor to anticipate short-horizon semantic evolution. An efference-copy-inspired module utilizes globally pooled optical flow to discount self-induced visual changes without suppressing genuine environmental novelty. We evaluate this signal on the downstream task of asynchronous event triage under variant telemetry constraints. Results demonstrate that DINO-Explorer provides a robust, bandwidth-efficient attention mechanism. At a fixed operating point, the system retains 78.8% of post-discovery human-reviewer consensus events with a 56.8% trigger confirmation rate, effectively surfacing mission-relevant phenomena. Crucially, ego-motion conditioning suppresses 45.5% of false positives relative to an uncompensated surprise signal baseline. In a replay-side Pareto ablation study, DINO-Explorer robustly dominates the validated peak F1 versus telemetry bandwidth frontier, reducing telemetry bandwidth by 48.2% at the selected operating point while maintaining a 62.2% peak F1 score, successfully concentrating data transmission around human-verified novelty events.

Chinese Translation

海洋生态系统的退化需要持续的、科学选择性的水下监测。然而，大多数自主水下航行器（AUV）作为被动数据记录器工作，捕获大量视频以供离线审查，常常错过具有高科学价值的瞬态事件。向主动感知的转变需要一种因果的、在线的信号，突出重要现象，同时抑制由于操控引起的视觉变化。我们提出了DINO-Explorer，这是一种由连续语义惊奇信号驱动的新颖性感知框架。该框架在冻结的DINOv3基础模型的潜在空间内运行，利用轻量级的、基于动作的递归预测器来预测短期的语义演变。一个受效应副本启发的模块利用全局汇聚的光流来抵消自我引起的视觉变化，而不抑制真实环境的新颖性。我们在不同遥测约束下的异步事件分类下评估该信号。结果表明，DINO-Explorer提供了一种稳健的、带宽高效的注意机制。在固定的操作点上，该系统保留了78.8%的后发现人类审查者一致事件，触发确认率为56.8%，有效地突显了与任务相关的现象。重要的是，自我运动条件抑制了相对于未补偿惊奇信号基线的45.5%的假阳性。在一个重放侧的Pareto消融研究中，DINO-Explorer在验证的峰值F1与遥测带宽的边界上表现出色，在选定的操作点上将遥测带宽减少了48.2%，同时保持62.2%的峰值F1分数，成功地将数据传输集中在经过人类验证的新颖事件上。

View on arXiv Download PDF AI Translation

cs.RO / 42 / 2604.12942

RMGS-SLAM: Real-time Multi-sensor Gaussian Splatting SLAM

RMGS-SLAM：实时多传感器高斯点云同时定位与地图构建

Li, Dongen, Liu, Yi, Liu, Junqi, Sun, Zewen, Huang, Zefan, Sun, Shuo, Liu, Jiahui, Yuan, Chengran, Guo, Hongliang, Tay, Francis E. H., Ang Jr, Marcelo H.

Abstract

Real-time 3D Gaussian splatting (3DGS)-based Simultaneous Localization and Mapping (SLAM) in large-scale real-world environments remains challenging, as existing methods often struggle to jointly achieve low-latency pose estimation, 3D Gaussian reconstruction in step with incoming sensor streams, and long-term global consistency. In this paper, we present a tightly coupled LiDAR-Inertial-Visual (LIV) 3DGS-based SLAM framework for real-time pose estimation and photorealistic mapping in large-scale real-world scenes. The system executes state estimation and 3D Gaussian primitive initialization in parallel with global Gaussian optimization, thereby enabling continuous dense mapping. To improve Gaussian initialization quality and accelerate optimization convergence, we introduce a cascaded strategy that combines feed-forward predictions with voxel-based principal component analysis (voxel-PCA) geometric priors. To enhance global consistency in large scenes, we further perform loop closure directly on the optimized global Gaussian map by estimating loop constraints through Gaussian-based Generalized Iterative Closest Point (GICP) registration, followed by pose-graph optimization. In addition, we collected challenging large-scale looped outdoor SLAM sequences with hardware-synchronized LiDAR-camera-IMU and ground-truth trajectories to support realistic and comprehensive evaluation. Extensive experiments on both public datasets and our dataset demonstrate that the proposed method achieves a strong balance among real-time efficiency, localization accuracy, and rendering quality across diverse and challenging real-world scenes.

Chinese Translation

在大规模真实环境中，基于实时3D高斯点云（3DGS）的同时定位与地图构建（SLAM）仍然面临挑战，因为现有方法往往难以同时实现低延迟的位姿估计、与传感器流同步的3D高斯重建以及长期的全局一致性。本文提出了一种紧密耦合的激光雷达-惯性-视觉（LIV）3DGS SLAM框架，用于在大规模真实场景中实现实时位姿估计和照片级真实感地图构建。该系统在全局高斯优化的同时并行执行状态估计和3D高斯原语初始化，从而实现连续的密集地图构建。为了提高高斯初始化的质量并加速优化收敛，我们引入了一种级联策略，将前馈预测与基于体素的主成分分析（voxel-PCA）几何先验相结合。为了增强大场景中的全局一致性，我们进一步通过高斯基的广义迭代最近点（GICP）配准直接在优化后的全局高斯地图上执行回环闭合，并随后进行位姿图优化。此外，我们收集了具有硬件同步的激光雷达-相机-惯性测量单元（IMU）和真实轨迹的具有挑战性的循环大规模户外SLAM序列，以支持现实和全面的评估。在公共数据集和我们自己的数据集上的大量实验表明，所提出的方法在实时效率、定位精度和渲染质量之间实现了良好的平衡，适用于多样且具有挑战性的真实场景。

View on arXiv Download PDF AI Translation

cs.RO / 43 / 2604.13001

XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios

XRZero-G0：通过界面、质量与比例推动灵巧机器人操作的前沿发展

Wang, Junming, Pu, Teng, Fung, Wingmun, Wang, Jindong, Wang, Shanchang, Deng, Yuan, Wang, Shuyuan, Liu, Ziwei, Pan, Kunhao, Yang, Ping, Zhai, Peng, Liang, Yuxin, Li, Xiaofan, Sun, Jiabi, Xu, Renchao, Tian, Xiaotian, Yan, Pengfei, Ye, Guoqiang, Li, Liang, Wang, Qian, Gan, Ruyi, Wang, Hao

Abstract

The acquisition of high-quality, action-aligned demonstration data remains a fundamental bottleneck in scaling foundation models for dexterous robot manipulation. Although robot-free human demonstrations (e.g., the UMI paradigm) offer a scalable alternative to traditional teleoperation, current systems are constrained by sub-optimal hardware ergonomics, open-loop workflows, and a lack of systematic data-mixing strategies. To address these limitations, we present XRZero-G0, a hardware-software co-designed system for embodied data collection and policy learning. The system features an ergonomic, virtual reality interface equipped with a top-view camera and dual specialized grippers to directly improve collection efficiency. To ensure dataset reliability, we propose a closed-loop collection, inspection, training, and evaluation pipeline for non-proprioceptive data. This workflow achieves an 85% data validity rate and establishes a transparent mechanism for quality control. Furthermore, we investigate the empirical scaling behaviors and optimal mixing ratios of robot-free data. Extensive experiments indicate that combining a minimal volume of real-robot data with large-scale robot-free data (e.g., a 10:1 ratio) achieves performance comparable to exclusively real-robot datasets, while reducing acquisition costs by a factor of twenty. Utilizing XRZero-G0, we construct a 2,000-hour robot-free dataset that enables zero-shot cross-embodiment transfer to a target physical robot, demonstrating a highly scalable methodology for generalized real-world manipulation.Our project repository: https://github.com/X-Square-Robot/XRZero-G0

Chinese Translation

高质量且动作对齐的示范数据获取仍然是扩展灵巧机器人操作基础模型的根本瓶颈。尽管无机器人的人类示范（如UMI范式）为传统远程操作提供了可扩展的替代方案，但现有系统受限于次优的硬件人体工学设计、开环工作流程以及缺乏系统性的数据混合策略。为解决这些限制，我们提出了XRZero-G0，一种硬件与软件协同设计的具身数据采集与策略学习系统。该系统配备符合人体工学的虚拟现实界面，结合俯视摄像头和双专用夹持器，直接提升采集效率。为确保数据集的可靠性，我们提出了针对非本体感知数据的闭环采集、检验、训练与评估流程。该工作流程实现了85%的数据有效率，并建立了透明的质量控制机制。此外，我们还研究了无机器人数据的经验扩展行为及其最优混合比例。大量实验表明，将少量真实机器人数据与大规模无机器人数据（例如10:1比例）结合，能够达到与纯真实机器人数据集相当的性能，同时将采集成本降低约二十倍。利用XRZero-G0，我们构建了一个2000小时的无机器人数据集，实现了对目标物理机器人的零样本跨具身迁移，展示了一种高度可扩展的通用现实操作方法。项目仓库地址：https://github.com/X-Square-Robot/XRZero-G0

View on arXiv Download PDF AI Translation

cs.RO / 44 / 2604.13015

Learning Versatile Humanoid Manipulation with Touch Dreaming

通过触觉梦境学习多功能人形机器人操作

Niu, Yaru, Fang, Zhenlong, Chen, Binghong, Zhou, Shuai, Senthilkumaran, Revanth, Zhang, Hao, Chen, Bingqing, Qiu, Chen, Tseng, H. Eric, Francis, Jonathan, Zhao, Ding

Abstract

Humanoid robots promise general-purpose assistance, yet real-world humanoid loco-manipulation remains challenging because it requires whole-body stability, dexterous hands, and contact-aware perception under frequent contact changes. In this work, we study dexterous, contact-rich humanoid loco-manipulation. We first develop an RL-based whole-body controller that provides stable lower-body and torso execution during complex manipulation. Built on this controller, we develop a whole-body humanoid data collection system that combines VR-based teleoperation with human-to-humanoid motion mapping, enabling efficient collection of real-world demonstrations. We then propose Humanoid Transformer with Touch Dreaming (HTD), a multimodal encoder--decoder Transformer that models touch as a core modality alongside multi-view vision and proprioception. HTD is trained in a single stage with behavioral cloning augmented by touch dreaming: in addition to predicting action chunks, the policy predicts future hand-joint forces and future tactile latents, encouraging the shared Transformer trunk to learn contact-aware representations for dexterous interaction. Across five contact-rich tasks, Insert-T, Book Organization, Towel Folding, Cat Litter Scooping, and Tea Serving, HTD achieves a 90.9% relative improvement in average success rate over the stronger baseline. Ablation results further show that latent-space tactile prediction is more effective than raw tactile prediction, yielding a 30% relative gain in success rate. These results demonstrate that combining robust whole-body execution, scalable humanoid data collection, and predictive touch-centered learning enables versatile, high-dexterity humanoid manipulation in the real world. Project webpage: humanoid-touch-dream.github.io.

Chinese Translation

人形机器人承诺提供通用的辅助功能，但在现实世界中，人形机器人的运动操控仍然面临挑战，因为这需要全身稳定性、灵巧的手部操作以及在频繁接触变化下的接触感知。在本研究中，我们探讨了灵巧且接触丰富的人形机器人运动操控。我们首先开发了一种基于强化学习的全身控制器，在复杂操作过程中提供稳定的下肢和躯干执行。基于该控制器，我们开发了一个全身人形机器人数据采集系统，该系统结合了基于虚拟现实的遥操作与人类到人形机器人的运动映射，能够高效地收集现实世界的演示数据。接着，我们提出了带有触觉梦境的人形变换器（Humanoid Transformer with Touch Dreaming, HTD），这是一种多模态编码-解码变换器，将触觉作为核心模态，与多视角视觉和本体感知相结合。HTD在单个阶段中进行训练，采用行为克隆并增强触觉梦境：除了预测动作片段外，策略还预测未来的手关节力和未来的触觉潜变量，鼓励共享的变换器主干学习接触感知表示以实现灵巧交互。在五个接触丰富的任务中，包括插入-T、书籍整理、毛巾折叠、猫砂铲除和茶水服务，HTD在平均成功率上相较于更强的基线实现了90.9%的相对提升。消融实验结果进一步表明，潜空间触觉预测比原始触觉预测更有效，成功率相对提高了30%。这些结果表明，结合稳健的全身执行、可扩展的人形机器人数据采集和以触觉为中心的预测学习，使得在现实世界中实现多功能、高灵巧的人形机器人操作成为可能。项目网页：humanoid-touch-dream.github.io。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

118

cs.CV / 1 / 2604.11843

UniMark: Unified Adaptive Multi-bit Watermarking for Autoregressive Image Generators

UniMark：自回归图像生成器的统一自适应多比特水印

Yilmaz, Yigit, Petrova, Elena, Kaya, Mehmet, Rossi, Lucia, Rahman, Amir

Abstract

Invisible watermarking for autoregressive (AR) image generation has recently gained attention as a means of protecting image ownership and tracing AI-generated content. However, existing approaches suffer from three key limitations: (1) they embed only zero-bit watermarks for binary verification, lacking the ability to convey multi-bit messages; (2) they rely on static codebook partitioning strategies that are vulnerable to security attacks once the partition is exposed; and (3) they are designed for specific AR architectures, failing to generalize across diverse AR paradigms. We propose \method{}, a training-free, unified watermarking framework for autoregressive image generators that addresses all three limitations. \method{} introduces three core components: \textbf{Adaptive Semantic Grouping (ASG)}, which dynamically partitions codebook entries based on semantic similarity and a secret key, ensuring both image quality preservation and security; \textbf{Block-wise Multi-bit Encoding (BME)}, which divides the token sequence into blocks and encodes different bits across blocks with error-correcting codes for reliable message transmission; and \textbf{a Unified Token-Replacement Interface (UTRI)} that abstracts the watermark embedding process to support both next-token prediction (e.g., LlamaGen) and next-scale prediction (e.g., VAR) paradigms. We provide theoretical analysis on detection error rates and embedding capacity. Extensive experiments on three AR models demonstrate that \method{} achieves state-of-the-art performance in image quality (FID), watermark detection accuracy, and multi-bit message extraction, while maintaining robustness against cropping, JPEG compression, Gaussian noise, blur, color jitter, and random erasing attacks.

Chinese Translation

自回归（AR）图像生成的隐形水印最近引起了关注，作为保护图像所有权和追踪AI生成内容的一种手段。然而，现有的方法存在三个主要局限性：（1）仅嵌入零比特水印用于二进制验证，缺乏传递多比特信息的能力；（2）依赖于静态代码本分区策略，一旦分区暴露，容易受到安全攻击；（3）设计针对特定的AR架构，无法在多样化的AR范式中推广。我们提出了 extbf{UniMark}，一个无训练的统一水印框架，旨在解决这三大局限性。 extbf{UniMark}引入了三个核心组件： extbf{自适应语义分组（ASG）}，根据语义相似性和秘密密钥动态分区代码本条目，确保图像质量的保持和安全性； extbf{块级多比特编码（BME）}，将令牌序列划分为块，并使用纠错码在不同块之间编码不同的比特，以实现可靠的信息传输；以及 extbf{统一令牌替换接口（UTRI）}，抽象化水印嵌入过程，以支持下一个令牌预测（例如，LlamaGen）和下一个尺度预测（例如，VAR）范式。我们提供了关于检测错误率和嵌入容量的理论分析。在三个AR模型上的广泛实验表明， extbf{UniMark}在图像质量（FID）、水印检测准确性和多比特信息提取方面达到了最先进的性能，同时在裁剪、JPEG压缩、高斯噪声、模糊、颜色抖动和随机擦除攻击下保持了鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2604.11868

MedConcept: Unsupervised Concept Discovery for Interpretability in Medical VLMs

MedConcept：用于医学视觉语言模型可解释性的无监督概念发现

Haque, Md Rakibul, Sultan, KM Arefeen, Kataria, Tushar, Elhabian, Shireen

Abstract

While medical Vision-Language models (VLMs) achieve strong performance on tasks such as tumor or organ segmentation and diagnosis prediction, their opaque latent representations limit clinical trust and the ability to explain predictions. Interpretability of these multimodal representations are therefore essential for the trustworthy clinical deployment of pretrained medical VLMs. However, current interpretability methods, such as gradient- or attention-based visualizations, are often limited to specific tasks such as classification. Moreover, they do not provide concept-level explanations derived from shared pretrained representations that can be reused across downstream tasks. We introduce MedConcept, a framework that uncovers latent medical concepts in a fully unsupervised manner and grounds them in clinically verifiable textual semantics. MedConcept identifies sparse neuron-level concept activations from pretrained VLM representations and translates them into pseudo-report-style summaries, enabling physician-level inspection of internal model reasoning. To address the lack of quantitative evaluation in concept-based interpretability, we introduce a quantitative semantic verification protocol that leverages an independent pretrained medical LLM as a frozen external evaluator to assess concept alignment with radiology reports. We define three concept scores, Aligned, Unaligned, and Uncertain, to quantify semantic support, contradiction, or ambiguity relative to radiology reports and use them exclusively for post hoc evaluation. These scores provide a quantitative baseline for assessing interpretability in medical VLMs. All codes, prompt and data to be released on acceptance. Ke

Chinese Translation

尽管医学视觉语言模型（VLMs）在肿瘤或器官分割及诊断预测等任务中表现出色，但其不透明的潜在表示限制了临床信任度及预测解释能力。因此，这些多模态表示的可解释性对于预训练医学VLM的可信临床应用至关重要。然而，当前的可解释性方法，如基于梯度或注意力的可视化，通常局限于特定任务（如分类），且未能提供可复用于下游任务的共享预训练表示的概念级解释。我们提出MedConcept框架，以完全无监督的方式发现潜在医学概念，并将其与临床可验证的文本语义相结合。MedConcept从预训练VLM表示中识别稀疏的神经元级概念激活，并将其转化为伪报告风格的摘要，支持医生级别的模型内部推理检视。针对基于概念的可解释性缺乏定量评估的问题，我们引入了一种定量语义验证协议，利用独立预训练的医学大型语言模型（LLM）作为冻结的外部评估器，评估概念与放射学报告的一致性。我们定义了三种概念评分：Aligned（对齐）、Unaligned（未对齐）和Uncertain（不确定），用于量化与放射学报告的语义支持、矛盾或模糊性，并仅用于事后评估。这些评分为医学VLM的可解释性评估提供了定量基线。所有代码、提示及数据将在论文接受后公开。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2604.11913

V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos

V-Nutri：基于自我中心烹饪视频的菜肴级营养估计

Yue, Chengkun, Xu, Chuanzhi, He, Jiangpeng

Abstract

Nutrition estimation of meals from visual data is an important problem for dietary monitoring and computational health, but existing approaches largely rely on single images of the finally completed dish. This setting is fundamentally limited because many nutritionally relevant ingredients and transformations, such as oils, sauces, and mixed components, become visually ambiguous after cooking, making accurate calorie and macronutrient estimation difficult. In this paper, we investigate whether the cooking process information from egocentric cooking videos can contribute to dish-level nutrition estimation. First, we further manually annotated the HD-EPIC dataset and established the first benchmark for video-based nutrition estimation. Most importantly, we propose V-Nutri, a staged framework that combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking process keyframes extracted from the egocentric videos. V-Nutri also includes a cooking keyframes selection module, a VideoMamba-based event-detection model that targets ingredient-addition moments. Experiments on the HD-EPIC dataset show that process cues can provide complementary nutritional evidence, improving nutrition estimation under controlled conditions. Our results further indicate that the benefit of process keyframes depends strongly on backbone representation capacity and event detection quality. Our code and annotated dataset is available at https://github.com/K624-YCK/V-Nutri.

Chinese Translation

从视觉数据中估计餐饮的营养成分是饮食监测和计算健康中的一个重要问题，但现有的方法主要依赖于最终完成菜肴的单张图像。这种设置在根本上是有限的，因为许多与营养相关的成分和转化过程，例如油、酱汁和混合成分，在烹饪后变得视觉上模糊，从而使得准确的卡路里和宏量营养素估计变得困难。本文探讨了自我中心烹饪视频中的烹饪过程信息是否可以为菜肴级营养估计提供帮助。首先，我们进一步手动标注了HD-EPIC数据集，并建立了基于视频的营养估计的第一个基准。最重要的是，我们提出了V-Nutri，一个分阶段框架，结合了经过Nutrition5K预训练的视觉骨干网络和一个轻量级融合模块，该模块聚合了来自最终菜肴帧和从自我中心视频中提取的烹饪过程关键帧的特征。V-Nutri还包括一个烹饪关键帧选择模块，一个基于VideoMamba的事件检测模型，专注于成分添加时刻。在HD-EPIC数据集上的实验表明，过程线索可以提供补充的营养证据，在受控条件下改善营养估计。我们的结果进一步表明，过程关键帧的益处在很大程度上依赖于骨干网络的表示能力和事件检测的质量。我们的代码和标注数据集可在https://github.com/K624-YCK/V-Nutri获取。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2604.11927

A Workflow to Efficiently Generate Dense Tissue Ground Truth Masks for Digital Breast Tomosynthesis

一种高效生成数字乳腺断层合成密集组织真值掩膜的工作流程

Mustafaev, Tamerlan, Kruglov, Oleg, Zuley, Margarita, Omena, Luana de Mero, de Oliveira, Guilherme Muniz, Franca, Vitor de Sousa, Barufaldi, Bruno, Nishikawa, Robert, Lee, Juhun

Abstract

Digital breast tomosynthesis (DBT) is now the standard of care for breast cancer screening in the USA. Accurate segmentation of fibroglandular tissue in DBT images is essential for personalized risk estimation, but algorithm development is limited by scarce human-delineated training data. In this study we introduce a time- and labor-saving framework to generate a human-annotated binary segmentation mask for dense tissue in DBT. Our framework enables a user to outline a rough region of interest (ROI) enclosing dense tissue on the central reconstructed slice of a DBT volume and select a segmentation threshold to generate the dense tissue mask. The algorithm then projects the ROI to the remaining slices and iteratively adjusts slice-specific thresholds to maintain consistent dense tissue delineation across the DBT volume. By requiring annotation only on the central slice, the framework substantially reduces annotation time and labor. We used 44 DBT volumes from the DBTex dataset for evaluation. Inter-reader agreement was assessed by computing patient-wise Dice similarity coefficients between segmentation masks produced by two radiologists, yielding a median of 0.84. Accuracy of the proposed method was evaluated by having a radiologist manually segment the 20th and 80th percentile slices from each volume (CC and MLO views; 176 slices total) and calculate Dice scores between the manual and proposed segmentations, yielding a median of 0.83.

Chinese Translation

数字乳腺断层合成（Digital Breast Tomosynthesis, DBT）现已成为美国乳腺癌筛查的标准方法。准确分割DBT图像中的纤维腺体组织对于个性化风险评估至关重要，但算法开发受限于稀缺的人为标注训练数据。本研究提出了一种节省时间和劳动的框架，用于生成DBT中密集组织的人为注释二值分割掩膜。该框架允许用户在DBT体积的中央重建切片上粗略勾画包含密集组织的感兴趣区域（ROI），并选择分割阈值以生成密集组织掩膜。算法随后将ROI投影到其余切片，并迭代调整切片特异阈值，以保持整个DBT体积中密集组织分割的一致性。通过仅需在中央切片进行注释，该框架显著减少了注释时间和劳动量。我们使用DBTex数据集中的44个DBT体积进行评估。通过计算两位放射科医生生成的分割掩膜之间的患者级Dice相似系数评估读者间一致性，中位数为0.84。通过让放射科医生手动分割每个体积的第20和第80百分位切片（CC和MLO视图，共176张切片），并计算手动与所提方法分割的Dice分数，评估所提方法的准确性，中位数为0.83。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2604.11932

EigenCoin: sassanid coins classification based on Bhattacharyya distance

EigenCoin：基于Bhattacharyya距离的萨珊王朝硬币分类

Allahverdi, Rahele, Dehshibi, Mohammad Mahdi, Bastanfard, Azam, Akbarzadeh, Daryoosh

Abstract

Solving pattern recognition problems using imbalanced databases is a hot topic, which entices researchers to bring it into focus. Therefore, we consider this problem in the application of Sassanid coins classification. Our focus is not only on proposing EigenCoin manifold with Bhattacharyya distance for the classification task, but also on testing the influence of the holistic and feature-based approaches. EigenCoin consists of three main steps namely manifold construction, mapping test data, and classification. Conducted experiments show EigenCoin outperformed other observed algorithms and achieved the accuracy from 9.45% up to 21.75%, while it has the capability of handling the over-fitting problem.

Chinese Translation

使用不平衡数据库解决模式识别问题是一个热门话题，吸引了研究者的关注。因此，我们在萨珊王朝硬币分类的应用中考虑了这个问题。我们的重点不仅是提出基于Bhattacharyya距离的EigenCoin流形用于分类任务，还测试整体方法和基于特征的方法的影响。EigenCoin包括三个主要步骤，即流形构建、映射测试数据和分类。实验结果表明，EigenCoin的表现优于其他观察到的算法，准确率从9.45%提高到21.75%，同时它具备处理过拟合问题的能力。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2604.11961

Fall Risk and Gait Analysis in Community-Dwelling Older Adults using World-Spaced 3D Human Mesh Recovery

基于世界空间3D人类网格恢复的社区居住老年人跌倒风险与步态分析

Banarjee, Chitra, Kwon, Patrick, Lipat, Ania, Xie, Rui, Chen, Chen, Thiamwong, Ladda

Abstract

Gait assessment is a key clinical indicator of fall risk and overall health in older adults. However, standard clinical practice is largely limited to stopwatch-measured gait speed. We present a pipeline that leverages a 3D Human Mesh Recovery (HMR) model to extract gait parameters from recordings of older adults completing the Timed Up and Go (TUG) test. From videos recorded across different community centers, we extract and analyze spatiotemporal gait parameters, including step time, sit-to-stand duration, and step length. We found that video-derived step time was significantly correlated with IMU-based insole measurements. Using linear mixed effects models, we confirmed that shorter, more variable step lengths and longer sit-to-stand durations were predicted by higher self-rated fall risk and fear of falling. These findings demonstrate that our pipeline can enable accessible and ecologically valid gait analysis in community settings.

Chinese Translation

步态评估是老年人跌倒风险和整体健康的重要临床指标。然而，标准临床实践主要限于使用秒表测量的步态速度。我们提出了一种利用3D人类网格恢复（HMR）模型从老年人完成定时起立走（TUG）测试的录音中提取步态参数的流程。通过在不同社区中心录制的视频，我们提取并分析了时空步态参数，包括步伐时间、坐立时间和步长。我们发现视频导出的步伐时间与基于惯性测量单元（IMU）的鞋垫测量显著相关。使用线性混合效应模型，我们确认较短且变化较大的步长以及较长的坐立时间与更高的自评跌倒风险和对跌倒的恐惧相关。这些发现表明，我们的流程能够在社区环境中实现可及且生态有效的步态分析。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2604.11970

INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

INDOTABVQA：印尼语文档跨语言表格理解基准

Gautam, Somraj, Dravichi, Anathapindika, Harit, Gaurav

Abstract

We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in accuracy. Providing explicit table region coordinates as additional input further improves performance by 4-7%, demonstrating the value of Spatial priors for table-based reasoning. Our findings underscore the importance of language-diverse, domain-specific datasets and demonstrate that targeted fine-tuning can significantly enhance VLM performance on specialized document understanding tasks. INDOTABVQA provides a valuable resource for advancing research in cross-lingual, structure-aware document understanding, especially in underrepresented regions of the world. Full dataset can be accessed in huggingface at: https://huggingface.co/datasets/NusaBharat/INDOTABVQA}

Chinese Translation

我们提出了INDOTABVQA，这是一个用于评估印尼语真实文档图像中跨语言表格视觉问答（Table Visual Question Answering, VQA）的基准数据集。该数据集包含1593张文档图像，涵盖三种视觉风格（有边框、无边框和彩色），每张图像中包含一个或多个表格，以及1593组四种语言（印尼语、英语、印地语和阿拉伯语）的问答对。这使得能够在单语言环境（印尼语文档配印尼语问题）和跨语言环境（印尼语文档配其他语言问题）下评估视觉-语言模型（Vision-Language Models, VLMs）。我们对主流开源VLMs（Qwen2.5-VL、Gemma-3、LLaMA-3.2）及GPT-4o进行了基准测试，揭示了在结构复杂表格和低资源语言上的显著性能差距。在我们的数据集上对紧凑型3B模型和LoRA微调的7B模型进行微调，准确率分别提升了11.6%和17.8%。通过提供明确的表格区域坐标作为额外输入，性能进一步提升了4-7%，展示了空间先验（Spatial priors）在基于表格推理中的价值。我们的研究结果强调了语言多样性和领域特定数据集的重要性，并证明了针对性微调能够显著提升VLM在专业文档理解任务中的表现。INDOTABVQA为推动跨语言、结构感知文档理解研究提供了宝贵资源，尤其对世界上资源较少的地区具有重要意义。完整数据集可在huggingface访问：https://huggingface.co/datasets/NusaBharat/INDOTABVQA

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2604.11993

Ultra-low-light computer vision using trained photon correlations

基于训练的光子关联的超低光照计算机视觉

Sohoni, Mandar M., Laydevant, Jérémie, Ouellet, Mathieu, Ma, Shi-Yuan, Yanagimoto, Ryotatsu, Ash, Benjamin A., Onodera, Tatsuhiro, Wang, Tianyu, Wright, Logan G., McMahon, Peter L.

Abstract

Illumination using correlated photon sources has been established as an approach to allowing high-fidelity images to be reconstructed from noisy camera frames by taking advantage of the knowledge that signal photons are spatially correlated whereas detector clicks due to noise are uncorrelated. However, in computer-vision tasks, the goal is often not ultimately to reconstruct an image, but to make inferences about a scene -- such as what object is present. Here we show how correlated-photon illumination can be used to gain an advantage in a hybrid optical-electronic computer-vision pipeline for object recognition. We demonstrate correlation-aware training (CAT): end-to-end optimization of a trainable correlated-photon illumination source and a Transformer backend in a way that the Transformer can learn to benefit from the correlations, using a small number (<= 100) of shots. We show a classification accuracy enhancement of up to 15 percentage points over conventional, uncorrelated-illumination-based computer vision in ultra-low-light and noisy imaging conditions, as well as an improvement over using untrained correlated-photon illumination. Our work illustrates how specializing to a computer-vision task -- object recognition -- and training the pattern of photon correlations in conjunction with a digital backend allows us to push the limits of accuracy in highly photon-budget-constrained scenarios beyond existing methods focused on image reconstruction.

Chinese Translation

利用相关光子源进行照明已被确立为一种方法，通过利用信号光子在空间上的相关性，而噪声引起的探测器点击是无相关的，从而允许从噪声相机帧中重建高保真图像。然而，在计算机视觉任务中，目标通常并不是最终重建图像，而是对场景进行推断——例如，识别出场景中存在的物体。在这里，我们展示了如何利用相关光子照明在混合光学-电子计算机视觉管道中获得物体识别的优势。我们展示了关联感知训练（Correlation-Aware Training, CAT）：对可训练的相关光子照明源和Transformer后端进行端到端优化，使得Transformer能够学习利用这些相关性，使用少量（<= 100）拍摄次数。我们展示了在超低光照和噪声成像条件下，相较于传统的无相关照明计算机视觉，分类准确率提高了多达15个百分点，并且相较于使用未训练的相关光子照明也有所改善。我们的工作说明，专注于计算机视觉任务——物体识别——并结合数字后端训练光子关联模式，使我们能够在高度受限的光子预算场景中超越现有的专注于图像重建的方法，推动准确性的极限。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2604.11998

The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results

2026年NTIRE跨域少样本目标检测第二届挑战赛：方法与结果

Qiu, Xingyu, Fu, Yuqian, Geng, Jiawei, Ren, Bin, Pan, Jiancheng, Wu, Zongwei, Tang, Hao, Fu, Yanwei, Timofte, Radu, Sebe, Nicu, Elhoseiny, Mohamed, Hong, Lingyi, Cheng, Mingxi, He, Xingqi, Li, Runze, Sheng, Xingdong, Zhang, Wenqiang, Liu, Jiacong, Luo, Shu, Qin, Yikai, Zhao, Yaze, Jiang, Yongwei, Zou, Yixiong, Zhang, Zhe, Yang, Yang, Li, Kaiyu, Fu, Bowen, Jiang, Zixuan, Li, Ke, Qiao, Hui, Cao, Xiangyong, Yu, Xuanlong, Sha, Youyang, Liu, Longfei, Yang, Di, Shen, Xi, Go, Kyeongryeol, Jang, Taewoong, Meesiyawar, Saiprasad, Kirasur, Ravi, Kulkarni, Rakshita, Deshpande, Bhoomi, Patil, Harsh, Mudenagudi, Uma, Hu, Shuming, Chen, Chao, Wang, Tao, Zhou, Wei, Xu, Qi, Xing, Zhenzhao, Zhao, Dandan, Xia, Hanzhe, Lu, Dongdong, Zhang, Zhe, Wang, Jingru, Huang, Guangwei, Tu, Jiachen, Shi, Yaokun, Xu, Guoyi, Jiang, Yaoxin, Liu, Jiajia, Zhou, Liwei, Dou, Bei, Wu, Tao, Fan, Zekang, Liu, Junjie, de Senneville, Adhémar, Armangeon, Flavien, Mengbers, Lyu, Yazhe, Xin, Zhimeng, Zhuang, Zijian, Zhu, Hongchun, Wang, Li

Abstract

Cross-domain few-shot object detection (CD-FSOD) remains a challenging problem for existing object detectors and few-shot learning approaches, particularly when generalizing across distinct domains. As part of NTIRE 2026, we hosted the second CD-FSOD Challenge to systematically evaluate and promote progress in detecting objects in unseen target domains under limited annotation conditions. The challenge received strong community interest, with 128 registered participants and a total of 696 submissions. Among them, 31 teams actively participated, and 19 teams submitted valid final results. Participants explored a wide range of strategies, introducing innovative methods that push the performance frontier under both open-source and closed-source tracks. This report presents a detailed overview of the NTIRE 2026 CD-FSOD Challenge, including a summary of the submitted approaches and an analysis of the final results across all participating teams. Challenge Codes: https://github.com/ohMargin/NTIRE2026_CDFSOD.

Chinese Translation

跨域少样本目标检测（CD-FSOD）仍然是现有目标检测器和少样本学习方法面临的一个挑战性问题，尤其是在不同领域之间的泛化能力方面。作为NTIRE 2026的一部分，我们举办了第二届CD-FSOD挑战赛，以系统地评估和促进在有限标注条件下检测未见目标领域中的物体的进展。该挑战赛引起了强烈的社区关注，共有128名注册参与者和696个提交结果。其中，31个团队积极参与，19个团队提交了有效的最终结果。参与者探索了广泛的策略，提出了创新的方法，在开源和闭源轨道下推动了性能的前沿。本报告提供了NTIRE 2026 CD-FSOD挑战赛的详细概述，包括提交方法的总结以及对所有参与团队最终结果的分析。挑战代码： https://github.com/ohMargin/NTIRE2026_CDFSOD。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2604.12012

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

TIPSv2：通过增强的Patch-Text对齐推进视觉-语言预训练

Cao, Bingyi, Chen, Koert, Maninis, Kevis-Kokitsi, Chen, Kaifeng, Karpur, Arjun, Xia, Ye, Dua, Sahil, Dabral, Tanmaya, Han, Guangxing, Han, Bohyung, Ainslie, Joshua, Bewley, Alex, Jacob, Mithun, Wagner, René, Ramos, Washington, Choromanski, Krzysztof, Seyedhosseini, Mojtaba, Zhou, Howard, Araujo, André

Abstract

Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment -- surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop TIPSv2, a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models. Code and models are released via our project page at https://gdm-tipsv2.github.io/ .

Chinese Translation

近年来视觉-语言预训练的进展显著提升了许多下游计算机视觉应用的性能，如分类、检索、分割和深度预测。然而，这些模型仍然难以实现的一个基本能力是将密集的patch表示与对应概念的文本嵌入进行有效对齐。在本工作中，我们深入研究了这一关键问题，并提出了增强基础视觉-语言模型该能力的新技术。首先，我们发现patch级别的蒸馏过程显著提升了密集的patch-text对齐——令人惊讶的是，经过蒸馏的学生模型的patch-text对齐效果远超教师模型。基于此观察，我们对预训练方案进行了调整，提出了iBOT++，这是对常用iBOT掩码图像目标的升级版本，其中未掩码的token也直接参与损失计算，从而极大地增强了预训练模型的patch-text对齐能力。此外，为了提升视觉-语言预训练的效率和效果，我们修改了学习方案中的指数移动平均设置，并引入了一个字幕采样策略，以利用不同粒度的合成字幕。结合这些组件，我们开发了TIPSv2，这是一系列适用于广泛下游任务的图文编码器模型。通过在9个任务和20个数据集上的全面实验，我们展示了其强劲的性能，通常与最新的视觉编码器模型相当或更优。代码和模型已通过项目主页https://gdm-tipsv2.github.io/ 发布。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2604.12028

Curvelet-Based Frequency-Aware Feature Enhancement for Deepfake Detection

基于曲波变换的频率感知特征增强用于深伪检测

Sabri, Salar Adel, Mstafa, Ramadhan J.

Abstract

The proliferation of sophisticated generative models has significantly advanced the realism of synthetic facial content, known as deepfakes, raising serious concerns about digital trust. Although modern deep learning-based detectors perform well, many rely on spatial-domain features that degrade under compression. This limitation has prompted a shift toward integrating frequency-domain representations with deep learning to improve robustness. Prior research has explored frequency transforms such as Discrete Cosine Transform (DCT), Fast Fourier Transform (FFT), and Wavelet Transform, among others. However, to the best of our knowledge, the Curvelet Transform, despite its superior directional and multiscale properties, remains entirely unexplored in the context of deepfake detection. In this work, we introduce a novel Curvelet-based detection approach that enhances feature quality through wedge-level attention and scale-aware spatial masking, both trained to selectively emphasize discriminative frequency components. The refined frequency cues are reconstructed and passed to a modified pretrained Xception network for classification. Evaluated on two compression qualities in the challenging FaceForensics++ dataset, our method achieves 98.48% accuracy and 99.96% AUC on FF++ low compression, while maintaining strong performance under high compression, demonstrating the efficacy and interpretability of Curvelet-informed forgery detection.

Chinese Translation

复杂生成模型的迅速发展显著提升了合成面部内容（即深伪）的真实感，进而引发了对数字信任的严重担忧。尽管现代基于深度学习的检测器表现良好，但许多检测器依赖于在压缩下会退化的空间域特征。这一局限性促使我们转向将频率域表示与深度学习相结合，以提高鲁棒性。先前的研究探讨了离散余弦变换（DCT）、快速傅里叶变换（FFT）和小波变换等频率变换。然而，尽我们所知，尽管曲波变换（Curvelet Transform）具有优越的方向性和多尺度特性，但在深伪检测的背景下仍然完全未被探索。在本研究中，我们提出了一种新颖的基于曲波变换的检测方法，通过楔级注意力和尺度感知空间掩蔽来增强特征质量，二者均经过训练以选择性地强调判别性频率成分。经过精炼的频率线索被重构并传递给修改后的预训练Xception网络进行分类。在具有挑战性的FaceForensics++数据集中，在两种压缩质量下进行评估，我们的方法在FF++低压缩情况下达到了98.48%的准确率和99.96%的AUC，同时在高压缩下保持了强劲的性能，证明了基于曲波的伪造检测的有效性和可解释性。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2604.12035

Does Visual Token Pruning Improve Calibration? An Empirical Study on Confidence in MLLMs

视觉Token剪枝是否提升校准性？基于多模态大语言模型置信度的实证研究

Tan, Kaizhen

Abstract

Visual token pruning is a widely used strategy for efficient inference in multimodal large language models (MLLMs), but existing work mainly evaluates it with task accuracy. In this paper, we study how visual token pruning affects model calibration, that is, whether predicted confidence matches actual correctness. Using LLaVA-1.5-7B on POPE and ScienceQA-IMG, we evaluate Expected Calibration Error (ECE), Brier score, and AURC under several pruning strategies, including SCOPE with different saliency weights, saliency-only pruning, FastV, and random pruning, across multiple token budgets. Our results show that pruning does not simply trade reliability for efficiency. On POPE, a pure-coverage setting in SCOPE achieves substantially lower ECE than the full unpruned model while maintaining similar accuracy. An internal alpha-sweep further shows a consistent trend: reducing the saliency weight improves calibration at all tested token budgets, while accuracy changes only slightly. In contrast, saliency-based pruning leads to worse calibration, and real FastV causes severe performance degradation in our setting. On ScienceQA-IMG, pruning also reduces ECE, with accuracy remaining stable or slightly improving. We additionally study the gap power exponent in coverage-based selection and find that its default setting is not always optimal. Overall, our results suggest that visual token pruning should be evaluated not only by accuracy, but also by confidence quality, especially for multimodal systems that need reliable decisions.

Chinese Translation

视觉Token剪枝是多模态大语言模型（MLLMs）中广泛使用的高效推理策略，但现有研究主要通过任务准确率进行评估。本文探讨了视觉Token剪枝对模型校准性的影响，即预测置信度与实际正确性的一致性。我们基于LLaVA-1.5-7B模型，在POPE和ScienceQA-IMG数据集上，评估了多种剪枝策略（包括带不同显著性权重的SCOPE、仅显著性剪枝、FastV及随机剪枝）在多个Token预算下的期望校准误差（ECE）、Brier分数和AURC。结果表明，剪枝并非简单地以牺牲可靠性换取效率。在POPE数据集的纯覆盖设置中，SCOPE实现了显著低于未剪枝模型的ECE，同时保持了相似的准确率。内部的alpha参数扫描进一步显示出一致趋势：降低显著性权重可在所有测试的Token预算下提升校准性，而准确率仅有轻微变化。相比之下，基于显著性的剪枝导致校准性能下降，FastV在我们的设置中则引起严重的性能退化。在ScienceQA-IMG上，剪枝同样降低了ECE，准确率保持稳定或略有提升。我们还研究了基于覆盖选择中的gap power指数，发现其默认设置并非总是最优。总体而言，研究结果表明视觉Token剪枝的评估应不仅限于准确率，还应关注置信度质量，尤其对于需要可靠决策的多模态系统。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2604.12068

Privacy-Preserving Structureless Visual Localization via Image Obfuscation

通过图像混淆实现的隐私保护无结构视觉定位

Panek, Vojtech, Beliansky, Patrik, Kukelova, Zuzana, Sattler, Torsten

Abstract

Visual localization is the task of estimating the camera pose of an image relative to a scene representation. In practice, visual localization systems are often cloud-based. Naturally, this raises privacy concerns in terms of revealing private details through the images sent to the server or through the representations stored on the server. Privacy-preserving localization aims to avoid such leakage of private details. However, the resulting localization approaches are significantly more complex, slower, and less accurate than their non-privacy-preserving counterparts. In this paper, we consider structureless localization methods in the context of privacy preservation. Structureless methods represent the scene through a set of reference images with known camera poses and intrinsics. In contrast to existing methods proposing representations that are as privacy-preserving as possible, we study a simple image obfuscation approach based on common image operations, e.g., replacing RGB images with (semantic) segmentations. We show that existing structureless pipelines do not need any special adjustments, as modern feature matchers can match obfuscated images out of the box. The results are easy-to-implement pipelines that can ensure both the privacy of the query images and the scene representations. Detailed experiments on multiple datasets show that the resulting methods achieve state-of-the-art pose accuracy for privacy-preserving approaches.

Chinese Translation

视觉定位任务旨在估计图像相对于场景表示的相机位姿。在实际应用中，视觉定位系统通常基于云端服务。自然地，这引发了隐私方面的担忧，涉及通过发送至服务器的图像或存储在服务器上的表示泄露私人信息。隐私保护定位旨在避免此类私人信息泄露。然而，现有的隐私保护定位方法通常比非隐私保护方法更复杂、速度更慢且精度较低。本文在隐私保护背景下研究无结构定位方法。无结构方法通过一组具有已知相机位姿和内参的参考图像来表示场景。与现有提出尽可能隐私保护的表示方法不同，我们研究了一种基于常见图像操作的简单图像混淆方法，例如用（语义）分割图替换RGB图像。我们证明现有的无结构流程无需特殊调整，现代特征匹配器能够直接匹配混淆后的图像。该方法易于实现，能够同时保障查询图像和场景表示的隐私。多数据集的详细实验表明，所提方法在隐私保护定位中实现了最先进的位姿精度。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2604.12075

OpenTME: An Open Dataset of AI-powered H&E Tumor Microenvironment Profiles from TCGA

OpenTME：来自TCGA的AI驱动的H&E肿瘤微环境特征开放数据集

Galama, Maaike, Kozar-Gillan, Nina, Embacher, Christina, Dembo, Todd, Böhm, Cornelius, Ramberger, Evelyn, Ribbat-Idel, Julika, Krupar, Rosemarie, Aumiller, Verena, Hägele, Miriam, Standvoss, Kai, Erdmann, Gerrit, Pablos, Blanca, Angelo, Ari, Schallenberg, Simon, Norgan, Andrew, Matyas, Viktor, Müller, Klaus-Robert, Alber, Maximilian, Ruff, Lukas, Klauschen, Frederick

Abstract

The tumor microenvironment (TME) plays a central role in cancer progression, treatment response, and patient outcomes, yet large-scale, consistent, and quantitative TME characterization from routine hematoxylin and eosin (H&E)-stained histopathology remains scarce. We introduce OpenTME, an open-access dataset of pre-computed TME profiles derived from 3,634 H&E-stained whole-slide images across five cancer types (bladder, breast, colorectal, liver, and lung cancer) from The Cancer Genome Atlas (TCGA). All outputs were generated using Atlas H&E-TME, an AI-powered application built on the Atlas family of pathology foundation models, which performs tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis, yielding over 4,500 quantitative readouts per slide at cell-level resolution. OpenTME is available for non-commercial academic research on Hugging Face. We will continue to expand OpenTME over time and anticipate it will serve as a resource for biomarker discovery, spatial biology research, and the development of computational methods for TME analysis.

Chinese Translation

肿瘤微环境（TME）在癌症进展、治疗反应和患者结果中发挥着核心作用，但从常规的苏木精-伊红（H&E）染色组织病理学中进行大规模、一致且定量的TME特征描述仍然稀缺。我们介绍了OpenTME，这是一个开放获取的数据集，包含来自癌症基因组图谱（TCGA）五种癌症类型（膀胱癌、乳腺癌、结直肠癌、肝癌和肺癌）的3,634张H&E染色全切片图像衍生的预计算TME特征。所有输出均使用Atlas H&E-TME生成，这是一种基于Atlas病理基础模型家族构建的AI驱动应用，能够执行组织质量控制、组织分割、细胞检测与分类以及空间邻域分析，为每张切片提供超过4,500个细胞级分辨率的定量读数。OpenTME可在Hugging Face上用于非商业学术研究。我们将持续扩展OpenTME，并预计它将作为生物标志物发现、空间生物学研究和TME分析计算方法开发的资源。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2604.12084

INST-Align: Implicit Neural Alignment for Spatial Transcriptomics via Canonical Expression Fields

INST-Align：通过典型表达场进行空间转录组学的隐式神经对齐

Han, Bonian, Qi, Cong, Musialski, Przemyslaw, Wei, Zhi

Abstract

Spatial transcriptomics (ST) measures mRNA expression while preserving spatial organization, but multi-slice analysis faces two coupled difficulties: large non-rigid deformations across slices and inter-slice batch effects when alignment and integration are treated independently. We present INST-Align, an unsupervised pairwise framework that couples a coordinate-based deformation network with a shared Canonical Expression Field, an implicit neural representation mapping spatial coordinates to expression embeddings, for joint alignment and reconstruction. A two-phase training strategy first establishes a stable canonical embedding space and then jointly optimizes deformation and spatial-feature matching, enabling mutually constrained alignment and representation learning. Cross-slice parameter sharing of the canonical field regularizes ambiguous correspondences and absorbs batch variation. Across nine datasets, INST-Align achieves state-of-the-art mean OT Accuracy (0.702), NN Accuracy (0.719), and Chamfer distance, with Chamfer reductions of up to 94.9\% on large-deformation sections relative to the strongest baseline. The framework also yields biologically meaningful spatial embeddings and coherent 3D tissue reconstruction. The code will be released after review phase.

Chinese Translation

空间转录组学（ST）在保持空间组织的同时测量mRNA表达，但多切片分析面临两个相互关联的困难：切片间的大规模非刚性变形以及在独立处理对齐和整合时出现的切片间批次效应。我们提出了INST-Align，这是一种无监督的成对框架，将基于坐标的变形网络与共享的典型表达场（Canonical Expression Field）相结合，后者是一种隐式神经表示，将空间坐标映射到表达嵌入，用于联合对齐和重建。该框架采用两阶段训练策略，首先建立一个稳定的典型嵌入空间，然后共同优化变形和空间特征匹配，实现相互约束的对齐和表示学习。典型场的跨切片参数共享规范化了模糊的对应关系，并吸收了批次变异。在九个数据集上，INST-Align实现了最先进的平均OT准确率（0.702）、最近邻准确率（0.719）和Chamfer距离，相较于最强基线在大变形部分的Chamfer距离减少高达94.9%。该框架还生成了生物学上有意义的空间嵌入和一致的三维组织重建。代码将在审稿阶段后发布。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2604.12100

PC-MIL: Decoupling Feature Resolution from Supervision Scale in Whole-Slide Learning

PC-MIL：在全幻灯片学习中将特征分辨率与监督尺度解耦

Ahmed, Syed Fahim, Rasineni, Gnanesh, Koehler, Florian, Aziz, Abu Zahid Bin, Wang, Mei, Gyulassy, Attila, Summa, Brian, Brown, J. Quincy, Pascucci, Valerio, Elhabian, Shireen Y.

Abstract

Whole-slide image (WSI) classification in computational pathology is commonly formulated as slide-level Multiple Instance Learning (MIL) with a single global bag representation. However, slide-level MIL is fundamentally underconstrained: optimizing only global labels encourages models to aggregate features without learning anatomically meaningful localization. This creates a mismatch between the scale of supervision and the scale of clinical reasoning. Clinicians assess tumor burden, focal lesions, and architectural patterns within millimeter-scale regions, whereas standard MIL is trained only to predict whether "somewhere in the slide there is cancer." As a result, the model's inductive bias effectively erases anatomical structure. We propose Progressive-Context MIL (PC-MIL), a framework that treats the spatial extent of supervision as a first-class design dimension. Rather than altering magnification, patch size, or introducing pixel-level segmentation, we decouple feature resolution from supervision scale. Using fixed 20x features, we vary MIL bag extent in millimeter units and anchor supervision at a clinically motivated 2mm scale to preserve comparable tumor burden and avoid confounding scale with lesion density. PC-MIL progressively mixes slide- and region-level supervision in controlled proportions, enabling explicit train-context x test-context analysis. On 1,476 prostate WSIs from five public datasets for binary cancer detection, we show that anatomical context is an independent axis of generalization in MIL, orthogonal to feature resolution: modest regional supervision improves cross-context performance, and balanced multi-context training stabilizes accuracy across slide and regional evaluation without sacrificing global performance. These results demonstrate that supervision extent shapes MIL inductive bias and support anatomically grounded WSI generalization.

Chinese Translation

计算病理学中的全幻灯片图像（WSI）分类通常被表述为滑动级别的多实例学习（MIL），使用单一的全局袋表示。然而，滑动级别的MIL在本质上是欠约束的：仅优化全局标签鼓励模型聚合特征，而未能学习解剖学上有意义的定位。这导致了监督尺度与临床推理尺度之间的不匹配。临床医生在毫米级区域内评估肿瘤负担、局灶性病变和结构模式，而标准的MIL仅训练以预测“幻灯片的某处是否存在癌症”。因此，模型的归纳偏差有效地抹去了解剖结构。我们提出了渐进上下文MIL（PC-MIL），这是一个将监督的空间范围视为首要设计维度的框架。我们不通过改变放大倍数、补丁大小或引入像素级分割来实现，而是将特征分辨率与监督尺度解耦。使用固定的20倍特征，我们在毫米单位中变化MIL袋的范围，并将监督锚定在临床上合理的2毫米尺度，以保持可比的肿瘤负担，并避免将尺度与病变密度混淆。PC-MIL逐步以受控比例混合滑动级和区域级监督，使得明确的训练上下文与测试上下文分析成为可能。在来自五个公共数据集的1,476个前列腺WSI中进行二元癌症检测时，我们展示了解剖上下文是MIL中一个独立的泛化轴，与特征分辨率正交：适度的区域监督提高了跨上下文的性能，而均衡的多上下文训练在不牺牲全局性能的情况下稳定了滑动和区域评估的准确性。这些结果表明，监督范围塑造了MIL的归纳偏差，并支持基于解剖学的WSI泛化。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2604.12113

PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation

PR-MaGIC：通过掩码解码器梯度流进行上下文内分割的提示优化

Lee, Minjae, Hur, Sungwoo, Hwang, Soojin, Kim, Won Hwa

Abstract

Visual Foundation Models (VFMs) such as the Segment Anything Model (SAM) have significantly advanced broad use of image segmentation. However, SAM and its variants necessitate substantial manual effort for prompt generation and additional training for specific applications. Recent approaches address these limitations by integrating SAM into in-context (one/few shot) segmentation, enabling auto-prompting through semantic alignment between query and support images. Despite these efforts, they still generate sub-optimal prompts that degrade segmentation quality due to visual inconsistencies between support and query images. To tackle this limitation, we introduce PR-MaGIC (Prompt Refinement via Mask Decoder Gradient Flow for In-Context Segmentation), a training-free test-time framework that refines prompts via gradient flow derived from SAM's mask decoder. PR-MaGIC seamlessly integrates into in-context segmentation frameworks, being theoretically grounded yet practically stabilized through a simple top-1 selection strategy that ensures robust performance across samples. Extensive evaluations demonstrate that PR-MaGIC consistently improves segmentation quality across various benchmarks, effectively mitigating inadequate prompts without requiring additional training or architectural modifications.

Chinese Translation

视觉基础模型（Visual Foundation Models, VFMs），如Segment Anything Model（SAM），极大推动了图像分割的广泛应用。然而，SAM及其变体在提示生成上需要大量人工干预，并且针对特定应用还需额外训练。近期方法通过将SAM整合进上下文内（单/少样本）分割，实现了通过查询图像与支持图像之间的语义对齐进行自动提示，缓解了上述限制。尽管如此，由于支持图像与查询图像之间存在视觉不一致，这些方法仍然生成次优提示，导致分割质量下降。为解决该问题，我们提出了PR-MaGIC（Prompt Refinement via Mask Decoder Gradient Flow for In-Context Segmentation），一种无需训练的测试时框架，通过SAM掩码解码器的梯度流来优化提示。PR-MaGIC能够无缝集成于上下文内分割框架，理论上有坚实基础，且通过简单的top-1选择策略实现了实际的稳定性，确保了跨样本的鲁棒性能。大量评测表明，PR-MaGIC在多个基准测试中持续提升分割质量，有效缓解了提示不足的问题，且无需额外训练或架构改动。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2604.12115

HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models

HTDC：基于犹豫触发的差分校准方法用于缓解大型视觉-语言模型中的幻觉问题

Liu, Xinyun

Abstract

Large vision-language models (LVLMs) achieve strong multimodal performance, but still suffer from hallucinations caused by unstable visual grounding and over-reliance on language priors. Existing training-free decoding methods typically apply calibration at every decoding step, introducing unnecessary computation and potentially disrupting stable predictions. We address this problem by identifying layer-wise hesitation, a simple signal of grounding instability reflected by fluctuations in token preference across intermediate layers. Based on this observation, we propose Hesitation-Triggered Differential Calibration (HTDC), a training-free decoding framework that preserves standard full-branch inference and activates calibration only at hesitation-prone steps. When triggered, HTDC contrasts the full branch with two lightweight probes, a visual-nullification probe and a semantic-nullification probe, to suppress hallucination-prone candidates while avoiding unnecessary intervention on stable steps. Experiments on representative hallucination benchmarks show that HTDC consistently reduces hallucinations while maintaining strong task accuracy, achieving a favorable trade-off between effectiveness and computational overhead.

Chinese Translation

大型视觉-语言模型（LVLMs）在多模态任务中表现出色，但仍然存在由于视觉定位不稳定和对语言先验过度依赖而引发的幻觉问题。现有的无训练解码方法通常在每个解码步骤应用校准，导致不必要的计算开销，并可能干扰稳定的预测结果。针对这一问题，我们通过识别层间犹豫现象——一种通过中间层中标记偏好波动反映的定位不稳定信号，提出了基于此观察的犹豫触发差分校准（HTDC）方法。HTDC是一种无训练解码框架，保持标准的全分支推理，仅在易发生犹豫的步骤激活校准。激活时，HTDC通过对比全分支与两个轻量级探针——视觉消除探针和语义消除探针，抑制易产生幻觉的候选项，同时避免对稳定步骤的不必要干预。在代表性幻觉基准测试中，HTDC持续减少幻觉现象，同时保持较强的任务准确率，实现了效果与计算开销之间的良好平衡。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2604.12119

Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models

超越感知错误：大型视觉-语言模型中的语义固化

Alam, Md Tanvirul

Abstract

Large vision-language models (VLMs) often rely on familiar semantic priors, but existing evaluations do not cleanly separate perception failures from rule-mapping failures. We study this behavior as semantic fixation: preserving a default interpretation even when the prompt specifies an alternative, equally valid mapping. To isolate this effect, we introduce VLM-Fix, a controlled benchmark over four abstract strategy games that evaluates identical terminal board states under paired standard and inverse rule formulations. Across 14 open and closed VLMs, accuracy consistently favors standard rules, revealing a robust semantic-fixation gap. Prompt interventions support this mechanism: neutral alias prompts substantially narrow the inverse-rule gap, while semantically loaded aliases reopen it. Post-training is strongly rule-aligned: training on one rule improves same-rule transfer but hurts opposite-rule transfer, while joint-rule training improves broader transfer. To test external validity beyond synthetic games, we evaluate analogous defamiliarization interventions on VLMBias and observe the same qualitative pattern. Finally, late-layer activation steering partially recovers degraded performance, indicating that semantic-fixation errors are at least partly editable in late representations. Project page, code, and dataset available at https://maveryn.github.io/vlm-fix/.

Chinese Translation

大型视觉-语言模型（VLMs）通常依赖于熟悉的语义先验，但现有评估并未清晰地区分感知失败与规则映射失败。我们将这种行为研究为语义固化：即使提示指定了另一种同样有效的映射，模型仍然保持默认解释。为了隔离这一效应，我们引入了VLM-Fix，这是一个针对四个抽象策略游戏的受控基准，评估在成对的标准和逆规则表述下相同的终局棋盘状态。在14个开放和封闭的VLMs中，准确率始终偏向标准规则，揭示了一个稳健的语义固化差距。提示干预支持这一机制：中立别名提示显著缩小了逆规则差距，而语义负载的别名则重新打开了这一差距。后训练过程与规则强对齐：在一个规则上训练提高了同规则的迁移能力，但损害了对立规则的迁移能力，而联合规则训练则改善了更广泛的迁移能力。为了测试超越合成游戏的外部有效性，我们在VLMBias上评估了类似的去熟悉化干预，并观察到了相同的定性模式。最后，晚层激活引导部分恢复了性能下降，表明语义固化错误在一定程度上可以在晚期表示中进行编辑。项目页面、代码和数据集可在 https://maveryn.github.io/vlm-fix/ 获取。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2604.12148

ViLL-E: Video LLM Embeddings for Retrieval

ViLL-E：用于检索的视频大型语言模型嵌入

Gupta, Rohit, Unnikrishnan, Jayakrishnan, Fei, Fan, Liu, Sheng, Tran, Son, Shah, Mubarak

Abstract

Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to "think longer" for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models), achieving performance comparable to state-of-the-art specialized embedding models while remaining competitive on VideoQA tasks. Furthermore, our joint contrastive-generative training unlocks new zero-shot capabilities, significantly outperforming state-of-the-art methods in composed video retrieval (+5% over SotA) and retrieval from long text (+2% over SotA).

Chinese Translation

视频大型语言模型（VideoLLMs）在视频理解任务中表现优异，尤其是在输出为文本的任务中，如视频问答和视频字幕生成。然而，在检索任务中，如文本到视频检索和时刻检索，它们的表现不及专门的基于嵌入的模型。我们提出了ViLL-E（Video-LLM-Embed），一种统一的视频LLM架构，采用了一种新颖的嵌入生成机制，使模型能够对复杂视频进行“更长时间的思考”，而对简单视频则能提前停止。我们采用三阶段的训练方法对该模型进行训练，结合了生成学习和对比学习：首先进行大规模的预训练，使用视频-字幕对；接着在一个较小的详细字幕数据集上进行持续训练；最后在一个新的多任务数据集上进行任务特定的微调，该数据集涵盖视频问答、时间定位、视频检索和视频-文本匹配。我们的模型在时间定位上显著提高（平均比其他VideoLLMs高出7%），在视频检索上也有显著提升（比双编码模型高出4%），其性能可与最先进的专门嵌入模型相媲美，同时在视频问答任务中仍具竞争力。此外，我们的联合对比-生成训练解锁了新的零-shot能力，在组合视频检索中显著超越了最先进的方法（比SotA高出5%），在长文本检索中也超越了最先进的方法（比SotA高出2%）。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2604.12152

Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

领域特定的潜在表示提高了基于扩散的医学图像超分辨率的保真度

Cajas, Sebastian, Judith, Ashaba, Gorijavolu, Rahul, Kapadia, Sahil, Kasimbazi, Hillary Clinton, Kinyera, Leo, Kwesiga, Emmanuel Paul, Manthena, Sri Sri Jaithra Varma, Nakayama, Luis Filipe, Doreen, Ninsiima, Celi, Leo Anthony

Abstract

Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on reconstruction quality. In a controlled experiment holding all other pipeline components fixed, replacing the generic Stable Diffusion VAE with MedVAE, a domain-specific autoencoder pretrained on more than 1.6 million medical images, yields +2.91 to +3.29 dB PSNR improvement across knee MRI, brain MRI, and chest X-ray (n = 1,820; Cohen's d = 1.37 to 1.86, all p < 10^{-20}, Wilcoxon signed-rank). Wavelet decomposition localises the advantage to the finest spatial frequency bands encoding anatomically relevant fine structure. Ablations across inference schedules, prediction targets, and generative architectures confirm the gap is stable within plus or minus 0.15 dB, while hallucination rates remain comparable between methods (Cohen's h < 0.02 across all datasets), establishing that reconstruction fidelity and generative hallucination are governed by independent pipeline components. These results provide a practical screening criterion: autoencoder reconstruction quality, measurable without diffusion training, predicts downstream SR performance (R^2 = 0.67), suggesting that domain-specific VAE selection should precede diffusion architecture search. Code and trained model weights are publicly available at https://github.com/sebasmos/latent-sr.

Chinese Translation

用于医学图像超分辨率的潜在扩散模型普遍继承了为自然照片设计的变分自编码器。我们展示了这一默认选择，而非扩散架构，是重建质量的主要限制因素。在一个控制实验中，固定所有其他管道组件，将通用的Stable Diffusion VAE替换为MedVAE——一个在超过160万张医学图像上预训练的领域特定自编码器，导致膝关节MRI、脑MRI和胸部X光（n = 1,820；Cohen's d = 1.37至1.86，所有p < 10^{-20}，Wilcoxon符号秩检验）中PSNR提高了+2.91至+3.29 dB。小波分解将这一优势局限于编码解剖相关细微结构的最细空间频率带。对推理调度、预测目标和生成架构的消融实验确认了这一差距在±0.15 dB范围内稳定，而不同方法之间的幻觉率保持相似（所有数据集的Cohen's h < 0.02），表明重建保真度和生成幻觉由独立的管道组件控制。这些结果提供了一个实用的筛选标准：自编码器重建质量（无需扩散训练即可测量）可以预测下游超分辨率性能（R^2 = 0.67），这表明领域特定的VAE选择应优先于扩散架构搜索。代码和训练模型权重可在https://github.com/sebasmos/latent-sr公开获取。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2604.12159

VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

VidTAG：全球范围内通过去噪序列预测进行时间对齐的视频与GPS地理定位

Kulkarni, Parth Parag, Gupta, Rohit, Chhipa, Prakash Chandra, Shah, Mubarak

Abstract

The task of video geolocalization aims to determine the precise GPS coordinates of a video's origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using the aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model's ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. More details on the project webpage: https://parthpk.github.io/vidtag_webpage/

Chinese Translation

视频地理定位的任务旨在确定视频来源的精确GPS坐标并绘制其轨迹；该任务在法医学、社交媒体和探索等领域具有广泛应用。现有的基于分类的方法在城市级别的粗粒度上运作，无法捕捉细粒度的细节，而图像检索方法由于需要庞大的图像库而在全球范围内变得不切实际，这种图像库的构建是不可行的。相比之下，构建GPS坐标库则相对简单且成本低廉。我们提出了VidTAG，一个双编码器框架，通过自监督和语言对齐特征执行帧到GPS的检索。为了解决视频预测中的时间不一致性，我们引入了TempGeo模块，该模块对齐帧嵌入，以及GeoRefiner模块，一个编码器-解码器架构，利用对齐的帧嵌入来精炼GPS特征。在Mapillary (MSLS) 和GAMa数据集上的评估表明，我们的模型能够生成时间一致的轨迹，并超越基线，在1公里阈值上比GeoCLIP提高了20%。我们在全球粗粒度视频地理定位（CityGuessr68k）上也超越了当前的最先进技术25%。我们的方法使得细粒度视频地理定位成为可能，并为未来的研究奠定了坚实的基础。更多项目详情请访问项目网页：https://parthpk.github.io/vidtag_webpage/

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2604.12163

Nucleus-Image: Sparse MoE for Image Generation

Nucleus-Image：用于图像生成的稀疏专家混合模型（MoE）

Akiti, Chandan, Modukuri, Ajay, Nagarapu, Murali Nandan, Akiti, Gunavardhan, Liu, Haozhe

Abstract

We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a large-scale training corpus of 1.5B high-quality training pairs spanning 700M unique images through multi-stage filtering, deduplication, aesthetic tiering, and caption curation. Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage, coupled with progressive sparsification of the expert capacity factor. We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation. Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization, and no human preference tuning. We release the training recipe, making Nucleus-Image the first fully open-source MoE diffusion model at this quality.

Chinese Translation

我们提出了Nucleus-Image，一种文本到图像生成模型，在质量与效率的权衡上开辟了新的帕累托前沿，在GenEval、DPG-Bench和OneIG-Bench上匹配甚至超越了领先模型，同时每次前向传播仅激活约20亿参数。Nucleus-Image采用稀疏专家混合（MoE）扩散Transformer架构，配备Expert-Choice Routing，将总模型容量扩展至170亿参数，且每层路由64个专家。我们采用了优化推理效率的简化架构，完全排除了Transformer主干中的文本标记，并使用联合注意力机制实现跨时间步的文本键值共享。为提升使用时间步调制时的路由稳定性，我们引入了耦合解耦的路由设计，将时间步感知的专家分配与时间步条件的专家计算分离。通过多阶段过滤、去重、美学分级及标题整理，我们构建了包含15亿高质量训练对、涵盖7亿独特图像的大规模训练语料库。训练采用渐进分辨率课程（256至512再至1024），每阶段配合多宽高比分桶，并逐步稀疏化专家容量因子。我们采用Muon优化器，并共享了针对带时间步调制的扩散模型定制的参数分组方案。Nucleus-Image展示了稀疏MoE扩展是实现高质量图像生成的高效路径，以远低于推理成本的活跃参数预算，达到显著更大模型的性能水平。该成果未依赖任何后训练优化：无强化学习、无直接偏好优化、无人工偏好调优。我们公开了训练方案，使Nucleus-Image成为首个达到此质量水平的完全开源MoE扩散模型。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2604.12175

Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment

重新定义图像编辑评估的质量标准和距离感知评分建模

Zhang, Xinjie, Li, Qiang, Ma, Xiaowen, Niu, Axi, Yan, Li, Yan, Qingsen

Abstract

Recent advances in image editing have heightened the need for reliable Image Editing Quality Assessment (IEQA). Unlike traditional methods, IEQA requires complex reasoning over multimodal inputs and multi-dimensional assessments. Existing MLLM-based approaches often rely on human heuristic prompting, leading to two key limitations: rigid metric prompting and distance-agnostic score modeling. These issues hinder alignment with implicit human criteria and fail to capture the continuous structure of score spaces. To address this, we propose Define-and-Score Image Editing Quality Assessment (DS-IEQA), a unified framework that jointly learns evaluation criteria and score representations. Specifically, we introduce Feedback-Driven Metric Prompt Optimization (FDMPO) to automatically refine metric definitions via probabilistic feedback. Furthermore, we propose Token-Decoupled Distance Regression Loss (TDRL), which decouples numerical tokens from language modeling to explicitly model score continuity through expected distance minimization. Extensive experiments show our method's superior performance; it ranks 4th in the 2026 NTIRE X-AIGC Quality Assessment Track 2 without any additional training data.

Chinese Translation

近年来图像编辑技术的进步加大了对可靠的图像编辑质量评估（IEQA）的需求。与传统方法不同，IEQA需要对多模态输入和多维评估进行复杂推理。现有的基于多模态大语言模型（MLLM）的方法通常依赖于人类启发式提示，这导致了两个主要限制：僵化的度量提示和距离无关的评分建模。这些问题妨碍了与隐含人类标准的对齐，并未能捕捉评分空间的连续结构。为了解决这些问题，我们提出了定义与评分图像编辑质量评估（DS-IEQA），这是一个统一框架，能够共同学习评估标准和评分表示。具体而言，我们引入了反馈驱动的度量提示优化（FDMPO），通过概率反馈自动优化度量定义。此外，我们提出了令牌解耦距离回归损失（TDRL），该方法将数值令牌与语言建模解耦，通过期望距离最小化明确建模评分的连续性。大量实验表明我们的方法表现优越；在2026年NTIRE X-AIGC质量评估第二赛道中排名第四，且未使用任何额外的训练数据。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2604.12219

Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation

乘风破浪：用于平滑视频生成的精度分配稀疏注意力机制

Zhang, Wentai, Xi, Ronghui, Peng, Shiyao, Huang, Jiayu, Luo, Haoran, Tang, Zichen, E, Haihong

Abstract

Video Diffusion Transformers have revolutionized high-fidelity video generation but suffer from the massive computational burden of self-attention. While sparse attention provides a promising acceleration solution, existing methods frequently provoke severe visual flickering caused by static sparsity patterns and deterministic block routing. To resolve these limitations, we propose Precision-Allocated Sparse Attention (PASA), a training-free framework designed for highly efficient and temporally smooth video generation. First, we implement a curvature-aware dynamic budgeting mechanism. By profiling the generation trajectory acceleration across timesteps, we elastically allocate the exact-computation budget to secure high-precision processing strictly during critical semantic transitions. Second, we replace global homogenizing estimations with hardware-aligned grouped approximations, successfully capturing fine-grained local variations while maintaining peak compute throughput. Finally, we incorporate a stochastic selection bias into the attention routing mechanism. This probabilistic approach softens rigid selection boundaries and eliminates selection oscillation, effectively eradicating the localized computational starvation that drives temporal flickering. Extensive evaluations on leading video diffusion models demonstrate that PASA achieves substantial inference acceleration while consistently producing remarkably fluid and structurally stable video sequences.

Chinese Translation

视频扩散变换器（Video Diffusion Transformers）在高保真视频生成领域取得了革命性进展，但其自注意力机制带来了巨大的计算负担。尽管稀疏注意力提供了一种有前景的加速方案，现有方法常因静态稀疏模式和确定性块路由导致严重的视觉闪烁问题。为解决这些限制，我们提出了精度分配稀疏注意力（Precision-Allocated Sparse Attention，PASA），这是一种无需训练的框架，旨在实现高效且时间上平滑的视频生成。首先，我们实现了一个曲率感知的动态预算机制，通过分析生成轨迹在各时间步的加速度，弹性分配精确计算预算，确保在关键语义转换期间进行高精度处理。其次，我们用硬件对齐的分组近似替代了全局均质估计，成功捕捉了细粒度的局部变化，同时保持了峰值计算吞吐量。最后，我们在注意力路由机制中引入了随机选择偏差，这种概率方法软化了刚性的选择边界，消除了选择振荡，有效根除导致时间闪烁的局部计算饥饿现象。在领先的视频扩散模型上的大量评估表明，PASA在显著加速推理的同时，始终生成流畅且结构稳定的视频序列。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2604.12221

BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition

BarbieGait：一种具有多样化换衣功能的身份一致性合成人体数据集，用于步态识别

Cai, Qingyuan, Hou, Saihui, Hu, Xuecai, Huang, Yongzhen

Abstract

Gait recognition, as a reliable biometric technology, has seen rapid development in recent years while it faces significant challenges caused by diverse clothing styles in the real world. This paper introduces BarbieGait, a synthetic gait dataset where real-world subjects are uniquely mapped into a virtual engine to simulate extensive clothing changes while preserving their gait identity information. As a pioneering work, BarbieGait provides a controllable gait data generation method, enabling the production of large datasets to validate cross-clothing issues that are difficult to verify with real-world data. However, the diversity of clothing increases intra-class variance and makes one of the biggest challenges to learning cloth-invariant features under varying clothing conditions. Therefore, we propose GaitCLIF (Gait-oriented CLoth-Invariant Feature) as a robust baseline model for cross-clothing gait recognition. Through extensive experiments, we validate that our method significantly improves cross-clothing performance on BarbieGait and the existing popular gait benchmarks. We believe that BarbieGait, with its extensive cross-clothing gait data, will further advance the capabilities of gait recognition in cross-clothing scenarios and promote progress in related research.

Chinese Translation

步态识别作为一种可靠的生物特征技术，近年来发展迅速，但在现实世界中，由于多样化的服装风格，面临着重大挑战。本文介绍了BarbieGait，一个合成步态数据集，其中现实世界的受试者被独特地映射到虚拟引擎中，以模拟广泛的换衣变化，同时保留他们的步态身份信息。作为一项开创性工作，BarbieGait提供了一种可控的步态数据生成方法，使得能够生成大规模数据集，以验证在现实数据中难以验证的跨服装问题。然而，服装的多样性增加了类内方差，成为在不同服装条件下学习服装不变特征的最大挑战之一。因此，我们提出了GaitCLIF（面向步态的服装不变特征）作为跨服装步态识别的稳健基线模型。通过广泛的实验，我们验证了我们的方法在BarbieGait和现有流行步态基准上显著提高了跨服装性能。我们相信，BarbieGait凭借其广泛的跨服装步态数据，将进一步提升跨服装场景下步态识别的能力，并推动相关研究的进展。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2604.12239

Physics-Grounded Monocular Vehicle Distance Estimation Using Standardized License Plate Typography

基于物理的单目车辆距离估计：利用标准化车牌排版

Reddy, Manognya Lokesh, Liu, Zheng

Abstract

Accurate inter-vehicle distance estimation is a cornerstone of Advanced Driver Assistance Systems (ADAS) and autonomous driving. While LiDAR and radar provide high precision, their high cost prohibits widespread adoption in mass-market vehicles. Monocular camera-based estimation offers a low-cost alternative but suffers from fundamental scale ambiguity. Recent deep learning methods for monocular depth achieve impressive results yet require expensive supervised training, suffer from domain shift, and produce predictions that are difficult to certify for safety-critical deployment. This paper presents a framework that exploits the standardized typography of United States license plates as passive fiducial markers for metric ranging, resolving scale ambiguity through explicit geometric priors without any training data or active illumination. First, a four-method parallel plate detector achieves robust plate reading across the full automotive lighting range. Second, a three-stage state identification engine fusing OCR text matching, multi-design color scoring, and a lightweight neural network classifier provides robust identification across all ambient conditions. Third, hybrid depth fusion with inverse-variance weighting and online scale alignment, combined with a one-dimensional constant-velocity Kalman filter, delivers smoothed distance, relative velocity, and time-to-collision for collision warning. Baseline validation reproduces a 2.3% coefficient of variation in character height measurements and a 36% reduction in distance-estimate variance compared with plate-width methods from prior work. Extensive outdoor experiments confirm a mean absolute error of 2.3% at 10 m and continuous distance output during brief plate occlusions, outperforming deep learning baselines by a factor of five in relative error.

Chinese Translation

准确的车辆间距估计是高级驾驶辅助系统（ADAS）和自动驾驶的基石。尽管激光雷达（LiDAR）和雷达提供了高精度，但其高成本限制了在大众市场车辆中的广泛应用。基于单目相机的估计提供了一种低成本的替代方案，但存在根本的尺度模糊问题。最近的深度学习方法在单目深度估计方面取得了令人印象深刻的成果，但需要昂贵的监督训练，受到领域转移的影响，并且产生的预测难以保证在安全关键部署中的可靠性。本文提出了一种框架，利用美国车牌的标准化排版作为被动基准标记进行度量范围估计，通过明确的几何先验解决尺度模糊问题，而无需任何训练数据或主动照明。首先，采用四种方法的并行车牌检测器在全汽车照明范围内实现了稳健的车牌读取。其次，三阶段状态识别引擎融合了光学字符识别（OCR）文本匹配、多设计颜色评分和轻量级神经网络分类器，能够在所有环境条件下提供稳健的识别。第三，结合逆方差加权和在线尺度对齐的混合深度融合，以及一维恒速卡尔曼滤波器，提供平滑的距离、相对速度和碰撞预警的时间。基线验证重现了字符高度测量的变异系数为2.3%，与先前工作中的车牌宽度方法相比，距离估计方差减少了36%。广泛的户外实验确认在10米处的平均绝对误差为2.3%，并在短暂的车牌遮挡期间持续输出距离，相对误差比深度学习基线提高了五倍。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2604.12251

ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models

ArtifactWorld：通过视频生成模型扩展3D高斯点云伪影修复

Wang, Xinliang, Shi, Yifeng, Wu, Zhenyu

Abstract

3D Gaussian Splatting (3DGS) delivers high-fidelity real-time rendering but suffers from geometric and photometric degradations under sparse-view constraints. Current generative restoration approaches are often limited by insufficient temporal coherence, a lack of explicit spatial constraints, and a lack of large-scale training data, resulting in multi-view inconsistencies, erroneous geometric hallucinations, and limited generalization to diverse real-world artifact distributions. In this paper, we present ArtifactWorld, a framework that resolves 3DGS artifact repair through systematic data expansion and a homogeneous dual-model paradigm. To address the data bottleneck, we establish a fine-grained phenomenological taxonomy of 3DGS artifacts and construct a comprehensive training set of 107.5K diverse paired video clips to enhance model robustness. Architecturally, we unify the restoration process within a video diffusion backbone, utilizing an isomorphic predictor to localize structural defects via an artifact heatmap. This heatmap then guides the restoration through an Artifact-Aware Triplet Fusion mechanism, enabling precise, intensity-guided spatio-temporal repair within native self-attention. Extensive experiments demonstrate that ArtifactWorld achieves state-of-the-art performance in sparse novel view synthesis and robust 3D reconstruction. Code and dataset will be made public.

Chinese Translation

3D高斯点云（3DGS）提供了高保真度的实时渲染，但在稀疏视图约束下容易出现几何和光度退化。目前的生成修复方法常常受到时间一致性不足、缺乏明确的空间约束以及缺乏大规模训练数据的限制，导致多视图不一致、错误的几何幻觉以及对多样化真实世界伪影分布的有限泛化能力。本文提出了ArtifactWorld，一个通过系统的数据扩展和均匀的双模型范式解决3DGS伪影修复的框架。为了解决数据瓶颈，我们建立了3DGS伪影的细粒度现象分类法，并构建了一个包含107.5K多样化配对视频片段的综合训练集，以增强模型的鲁棒性。在架构上，我们将修复过程统一在一个视频扩散主干中，利用同构预测器通过伪影热图定位结构缺陷。该热图随后通过伪影感知三元融合机制指导修复，实现了在原生自注意力下的精确、强度引导的时空修复。大量实验表明，ArtifactWorld在稀疏新视图合成和鲁棒的3D重建方面达到了最先进的性能。代码和数据集将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2604.12255

ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception

ARGen：基于情感强化的生成增强方法用于视觉动态情感感知

Wang, Huanzhen, Zhou, Ziheng, Song, Jiaqi, He, Li, Lan, Yunshi, Wang, Yan, Zhang, Wenqiang

Abstract

Dynamic facial expression recognition in the wild remains challenging due to data scarcity and long-tail distributions, which hinder models from effectively learning the temporal dynamics of scarce emotions. To address these limitations, we propose ARGen, an Affect-Reinforced Generative Augmentation Framework that enables data-adaptive dynamic expression generation for robust emotion perception. ARGen operates in two stages: Affective Semantic Injection (ASI) and Adaptive Reinforcement Diffusion (ARD). The ASI stage establishes affective knowledge alignment through facial Action Units and employs a retrieval-augmented prompt generation strategy to synthesize consistent and fine-grained affective descriptions via large-scale visual-language models, thereby injecting interpretable emotional priors into the generation process. The ARD stage integrates text-conditioned image-to-video diffusion with reinforcement learning, introducing inter-frame conditional guidance and a multi-objective reward function to jointly optimize expression naturalness, facial integrity, and generative efficiency. Extensive experiments on both generation and recognition tasks verify that ARGen substantially enhances synthesis fidelity and improves recognition performance, establishing an interpretable and generalizable generative augmentation paradigm for vision-based affective computing.

Chinese Translation

在自然环境中，动态面部表情识别仍然面临挑战，主要由于数据稀缺和长尾分布，这阻碍了模型有效学习稀缺情感的时间动态。为了解决这些限制，我们提出了ARGen，一个情感强化生成增强框架，能够实现数据自适应的动态表情生成，以增强情感感知的鲁棒性。ARGen分为两个阶段：情感语义注入（Affective Semantic Injection, ASI）和自适应强化扩散（Adaptive Reinforcement Diffusion, ARD）。ASI阶段通过面部动作单元（Action Units）建立情感知识对齐，并采用检索增强的提示生成策略，通过大规模视觉-语言模型合成一致且细致的情感描述，从而将可解释的情感先验注入生成过程。ARD阶段将文本条件的图像到视频扩散与强化学习相结合，引入帧间条件引导和多目标奖励函数，以共同优化表情的自然性、面部完整性和生成效率。在生成和识别任务上的大量实验验证了ARGen显著提高了合成的保真度，并改善了识别性能，为基于视觉的情感计算建立了一个可解释且具有普适性的生成增强范式。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2604.12257

Style-Decoupled Adaptive Routing Network for Underwater Image Enhancement

基于风格解耦的自适应路由网络用于水下图像增强

Xu, Hang, Long, Chen, Wang, Bing, Chen, Hao, Dong, Zhen

Abstract

Underwater Image Enhancement (UIE) is essential for robust visual perception in marine applications. However, existing methods predominantly rely on uniform mapping tailored to average dataset distributions, leading to over-processing mildly degraded images or insufficient recovery for severe ones. To address this challenge, we propose a novel adaptive enhancement framework, SDAR-Net. Unlike existing uniform paradigms, it first decouples specific degradation styles from the input and subsequently modulates the enhancement process adaptively. Specifically, since underwater degradation primarily shifts the appearance while keeping the scene structure, SDAR-Net formulates image features into dynamic degradation style embeddings and static scene structural representations through a carefully designed training framework. Subsequently, we introduce an adaptive routing mechanism. By evaluating style features and adaptively predicting soft weights at different enhancement states, it guides the weighted fusion of the corresponding image representations, accurately satisfying the adaptive restoration demands of each image. Extensive experiments show that SDAR-Net achieves a new state-of-the-art (SOTA) performance with a PSNR of 25.72 dB on real-world benchmark, and demonstrates its utility in downstream vision tasks. Our code is available at https://github.com/WHU-USI3DV/SDAR-Net.

Chinese Translation

水下图像增强（Underwater Image Enhancement, UIE）对于海洋应用中的鲁棒视觉感知至关重要。然而，现有方法主要依赖于针对平均数据集分布的统一映射，导致对轻度退化图像过度处理或对严重退化图像恢复不足。为了解决这一挑战，我们提出了一种新颖的自适应增强框架——SDAR-Net。与现有的统一范式不同，SDAR-Net首先将输入中的特定退化风格解耦出来，随后自适应调节增强过程。具体而言，由于水下退化主要改变图像外观而保持场景结构，SDAR-Net通过精心设计的训练框架，将图像特征表征为动态的退化风格嵌入和静态的场景结构表示。随后，我们引入了一种自适应路由机制，通过评估风格特征并自适应预测不同增强状态下的软权重，引导对应图像表示的加权融合，准确满足每张图像的自适应恢复需求。大量实验表明，SDAR-Net在真实世界基准测试中实现了25.72 dB的峰值信噪比（PSNR），达到新的最先进水平（SOTA），并展示了其在下游视觉任务中的实用价值。我们的代码已开源，地址为：https://github.com/WHU-USI3DV/SDAR-Net。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2604.12270

DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos

DreamStereo：面向高清视频的实时立体图像修复

Huang, Yuan, Zhao, Sijie, Cheng, Jing, Xu, Hao, Jiao, Shaohui

Abstract

Stereo video inpainting, which aims to fill the occluded regions of warped videos with visually coherent content while maintaining temporal consistency, remains a challenging open problem. The regions to be filled are scattered along object boundaries and occupy only a small fraction of each frame, leading to two key challenges. First, existing approaches perform poorly on such tasks due to the scarcity of high-quality stereo inpainting datasets, which limits their ability to learn effective inpainting priors. Second, these methods apply equal processing to all regions of the frame, even though most pixels require no modification, resulting in substantial redundant computation. To address these issues, we introduce three interconnected components. We first propose Gradient-Aware Parallax Warping (GAPW), which leverages backward warping and the gradient of the coordinate mapping function to obtain continuous edges and smooth occlusion regions. Then, a Parallax-Based Dual Projection (PBDP) strategy is introduced, which incorporates GAPW to produce geometrically consistent stereo inpainting pairs and accurate occlusion masks without requiring stereo videos. Finally, we present Sparsity-Aware Stereo Inpainting (SASI), which reduces over 70% of redundant tokens, achieving a 10.7x speedup during diffusion inference and delivering results comparable to its full-computation counterpart, enabling real-time processing of HD (768 x 1280) videos at 25 FPS on a single A100 GPU.

Chinese Translation

立体视频修复旨在用视觉上连贯的内容填充扭曲视频中的遮挡区域，同时保持时间一致性，这仍然是一个具有挑战性的开放问题。待填充的区域散布在物体边界上，并且仅占每帧的一小部分，这导致了两个关键挑战。首先，现有方法在此类任务上表现不佳，原因在于高质量立体图像修复数据集的稀缺，限制了它们学习有效修复先验的能力。其次，这些方法对帧的所有区域施加相同的处理，尽管大多数像素无需修改，从而导致大量冗余计算。为了解决这些问题，我们引入了三个相互关联的组件。我们首先提出了梯度感知视差扭曲（Gradient-Aware Parallax Warping, GAPW），该方法利用反向扭曲和坐标映射函数的梯度来获得连续的边缘和平滑的遮挡区域。然后，介绍了一种基于视差的双重投影（Parallax-Based Dual Projection, PBDP）策略，该策略结合GAPW生成几何一致的立体图像修复对和准确的遮挡掩码，而无需立体视频。最后，我们提出了稀疏感知立体图像修复（Sparsity-Aware Stereo Inpainting, SASI），该方法减少了超过70%的冗余标记，在扩散推理过程中实现了10.7倍的加速，并提供了与其全计算对应物相当的结果，使得在单个A100 GPU上以25 FPS实时处理高清（768 x 1280）视频成为可能。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2604.12281

MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer

MAST：基于掩码引导注意力质量分配的无训练多风格迁移方法

Kang, Dongkyung, Hwang, Jaeyeon, Park, Junseo, Kang, Minji, Lee, Yeryeong, Ko, Beomseok, Roh, Hanyoung, Shin, Jeongmin, Jang, Hyeryung

Abstract

Style transfer aims to render a content image with the visual characteristics of a reference style while preserving its underlying semantic layout and structural geometry. While recent diffusion-based models demonstrate strong stylization capabilities by leveraging powerful generative priors and controllable internal representations, they typically assume a single global style. Extending them to multi-style scenarios often leads to boundary artifacts, unstable stylization, and structural inconsistency due to interference between multiple style representations. To overcome these limitations, we propose MAST (Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer), a novel training-free framework that explicitly controls content-style interactions within the diffusion attention mechanism. To achieve artifact-free and structure-preserving stylization, MAST integrates four connected modules. First, Layout-preserving Query Anchoring prevents global layout collapse by firmly anchoring the semantic structure using content queries. Second, Logit-level Attention Mass Allocation deterministically distributes attention probability mass across spatial regions, seamlessly fusing multiple styles without boundary artifacts. Third, Sharpness-aware Temperature Scaling restores the attention sharpness degraded by multi-style expansion. Finally, Discrepancy-aware Detail Injection adaptively compensates for localized high-frequency detail losses by measuring structural discrepancies. Extensive experiments demonstrate that MAST effectively mitigates boundary artifacts and maintains structural consistency, preserving texture fidelity and spatial coherence even as the number of applied styles increases.

Chinese Translation

风格迁移旨在将内容图像渲染成具有参考风格视觉特征的图像，同时保持其底层语义布局和结构几何。尽管近期基于扩散模型的方法通过利用强大的生成先验和可控的内部表示展现了强大的风格化能力，但它们通常假设单一全局风格。将其扩展到多风格场景时，往往会因多种风格表示间的干扰导致边界伪影、不稳定的风格化效果及结构不一致。为克服这些限制，我们提出了MAST（Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer），一种新颖的无训练框架，能够在扩散注意力机制中显式控制内容与风格的交互。为实现无伪影且结构保留的风格化，MAST集成了四个相互关联的模块。首先，布局保留查询锚定（Layout-preserving Query Anchoring）通过内容查询牢固锚定语义结构，防止全局布局崩溃。其次，Logit级注意力质量分配（Logit-level Attention Mass Allocation）确定性地在空间区域间分配注意力概率质量，实现多风格的无缝融合且无边界伪影。第三，锐度感知温度缩放（Sharpness-aware Temperature Scaling）恢复因多风格扩展而退化的注意力锐度。最后，差异感知细节注入（Discrepancy-aware Detail Injection）通过测量结构差异，自适应补偿局部高频细节损失。大量实验表明，MAST有效缓解了边界伪影，保持结构一致性，即使应用风格数量增加，也能保持纹理真实性和空间连贯性。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2604.12286

LiveMoments: Reselected Key Photo Restoration in Live Photos via Reference-guided Diffusion

LiveMoments：通过参考引导扩散对实时照片中的重新选择关键照片进行修复

Xue, Clara, Yan, Zizheng, Shi, Zhenning, Yu, Yuhang, Zhuang, Jingyu, Zhang, Qi, Chen, Jinwei, Fan, Qingnan

Abstract

Live Photo captures both a high-quality key photo and a short video clip to preserve the precious dynamics around the captured moment. While users may choose alternative frames as the key photo to capture better expressions or timing, these frames often exhibit noticeable quality degradation, as the photo capture ISP pipeline delivers significantly higher image quality than the video pipeline. This quality gap highlights the need for dedicated restoration techniques to enhance the reselected key photo. To this end, we propose LiveMoments, a reference-guided image restoration framework tailored for the reselected key photo in Live Photos. Our method employs a two-branch neural network: a reference branch that extracts structural and textural information from the original high-quality key photo, and a main branch that restores the reselected frame using the guidance provided by the reference branch. Furthermore, we introduce a unified Motion Alignment module that incorporates motion guidance for spatial alignment at both the latent and image levels. Experiments on real and synthetic Live Photos demonstrate that LiveMoments significantly improves perceptual quality and fidelity over existing solutions, especially in scenes with fast motion or complex structures. Our code is available at https://github.com/OpenVeraTeam/LiveMoments.

Chinese Translation

实时照片捕捉了一张高质量的关键照片和一段短视频，以保留捕捉时刻周围的珍贵动态。虽然用户可以选择其他帧作为关键照片，以捕捉更好的表情或时机，但这些帧往往表现出明显的质量下降，因为照片捕捉的图像信号处理（ISP）流程提供的图像质量显著高于视频流程。这一质量差距突显了专门修复技术的必要性，以增强重新选择的关键照片。为此，我们提出了LiveMoments，一个针对实时照片中重新选择关键照片的参考引导图像修复框架。我们的方法采用了一个双分支神经网络：参考分支从原始高质量关键照片中提取结构和纹理信息，主分支则利用参考分支提供的指导来修复重新选择的帧。此外，我们引入了一个统一的运动对齐模块，该模块在潜在和图像层面上结合了运动指导以进行空间对齐。在真实和合成的实时照片上的实验表明，LiveMoments在感知质量和保真度方面显著优于现有解决方案，特别是在快速运动或复杂结构的场景中。我们的代码可在 https://github.com/OpenVeraTeam/LiveMoments 获取。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2604.12307

Boosting Robust AIGI Detection with LoRA-based Pairwise Training

基于LoRA的成对训练提升鲁棒性AI生成图像检测

Xia, Ruiyang, Zhang, Qi, Xu, Yaowen, Zou, Zhaofan, Sun, Hao, He, Zhongjiang, Li, Xuelong

Abstract

The proliferation of highly realistic AI-Generated Image (AIGI) has necessitated the development of practical detection methods. While current AIGI detectors perform admirably on clean datasets, their detection performance frequently decreases when deployed "in the wild", where images are subjected to unpredictable, complex distortions. To resolve the critical vulnerability, we propose a novel LoRA-based Pairwise Training (LPT) strategy designed specifically to achieve robust detection for AIGI under severe distortions. The core of our strategy involves the targeted finetuning of a visual foundation model, the deliberate simulation of data distribution during the training phase, and a unique pairwise training process. Specifically, we introduce distortion and size simulations to better fit the distribution from the validation and test sets. Based on the strong visual representation capability of the visual foundation model, we finetune the model to achieve AIGI detection. The pairwise training is utilized to improve the detection via decoupling the generalization and robustness optimization. Experiments show that our approach secured the 3th placement in the NTIRE Robust AI-Generated Image Detection in the Wild challenge

Chinese Translation

高度逼真的AI生成图像（AIGI）的广泛传播催生了实用检测方法的发展。尽管现有的AIGI检测器在干净数据集上表现优异，但在“野外”环境中部署时，由于图像受到不可预测且复杂的失真，其检测性能常常下降。为解决这一关键脆弱性，我们提出了一种新颖的基于LoRA的成对训练（LPT）策略，专门设计用于在严重失真条件下实现鲁棒的AIGI检测。该策略的核心包括对视觉基础模型的有针对性微调、训练阶段对数据分布的刻意模拟以及独特的成对训练过程。具体而言，我们引入了失真和尺寸模拟，以更好地匹配验证集和测试集的分布。基于视觉基础模型强大的视觉表征能力，我们对模型进行微调以实现AIGI检测。成对训练则通过解耦泛化能力和鲁棒性优化来提升检测效果。实验结果表明，我们的方法在NTIRE野外鲁棒AI生成图像检测挑战赛中获得了第三名。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2604.12309

Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

基于3D基础先验的真实且一致的轨道视频生成方法

Wang, Rong, Zha, Ruyi, Cheng, Ziang, Yang, Jiayu, Purkait, Pulak, Li, Hongdong

Abstract

We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.

Chinese Translation

我们提出了一种从单张物体图像生成几何真实且一致的轨道视频的新方法。现有的视频生成工作大多依赖像素级注意力机制以确保帧间视角一致性。然而，该机制对长距离外推（如后视合成）约束不足，此时输入图像的像素对应关系有限。因此，这些方法常常无法生成结构合理且连贯的结果。为解决该问题，我们提出利用来自3D基础生成模型的丰富形状先验作为辅助约束，基于其从大型3D资产库中学习到的真实物体形状分布建模能力。具体而言，我们通过3D基础模型编码的两种尺度的潜在特征来引导视频生成：（i）去噪的全局潜在向量作为整体结构指导；（ii）从体积特征投影得到的一组潜在图像，用以提供视角相关的细粒度几何细节。与常用的2.5D表示（如深度图或法线图）相比，这些紧凑特征能够建模完整物体形状，并通过避免显式网格提取提升推理效率。为实现有效的形状条件控制，我们引入了多尺度3D适配器，通过交叉注意力将特征token注入基础视频模型，既保留了其通用视频预训练能力，又实现了简单且模型无关的微调过程。在多个基准测试上的大量实验表明，我们的方法在视觉质量、形状真实感及多视角一致性方面均优于最先进方法，且能稳健地泛化至复杂相机轨迹及野外图像。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2604.12315

GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality

GTPBD-MM：一个具有多模态的全球梯田地块与边界数据集

Zhang, Zhiwei, Zeng, Xingyuan, Kong, Xinkai, Zhang, Kunquan, Liang, Haoyuan, Shi, Bohan, Zheng, Juepeng, Huang, Jianxi, Lu, Yutong, Fu, Haohuan

Abstract

Agricultural parcel extraction plays an important role in remote sensing-based agricultural monitoring, supporting parcel surveying, precision management, and ecological assessment. However, existing public benchmarks mainly focus on regular and relatively flat farmland scenes. In contrast, terraced parcels in mountainous regions exhibit stepped terrain, pronounced elevation variation, irregular boundaries, and strong cross-regional heterogeneity, making parcel extraction a more challenging problem that jointly requires visual recognition, semantic discrimination, and terrain-aware geometric understanding. Although recent studies have advanced visual parcel benchmarks and image-text farmland understanding, a unified benchmark for complex terraced parcel extraction under aligned image-text-DEM settings remains absent. To fill this gap, we present GTPBD-MM, the first multimodal benchmark for global terraced parcel extraction. Built upon GTPBD, GTPBD-MM integrates high-resolution optical imagery, structured text descriptions, and DEM data, and supports systematic evaluation under Image-only, Image+Text, and Image+Text+DEM settings. We further propose Elevation and Text guided Terraced parcel network (ETTerra), a multimodal baseline for terraced parcel delineation. Extensive experiments demonstrate that textual semantics and terrain geometry provide complementary cues beyond visual appearance alone, yielding more accurate, coherent, and structurally consistent delineation results in complex terraced scenes.

Chinese Translation

农业地块提取在基于遥感的农业监测中发挥着重要作用，支持地块调查、精准管理和生态评估。然而，现有的公共基准主要集中于规则且相对平坦的农田场景。相比之下，山区的梯田地块表现出阶梯状地形、显著的高程变化、不规则的边界以及强烈的跨区域异质性，使得地块提取成为一个更具挑战性的问题，需共同依赖视觉识别、语义区分和地形感知的几何理解。尽管近期研究已推动视觉地块基准和图像-文本农田理解的发展，但在对齐的图像-文本-数字高程模型（DEM）设置下，仍缺乏一个统一的复杂梯田地块提取基准。为填补这一空白，我们提出了GTPBD-MM，这是第一个用于全球梯田地块提取的多模态基准。GTPBD-MM基于GTPBD构建，整合了高分辨率光学影像、结构化文本描述和DEM数据，并支持在仅图像、图像+文本和图像+文本+DEM设置下的系统评估。我们进一步提出了高程与文本引导的梯田地块网络（ETTerra），作为梯田地块划分的多模态基线。大量实验表明，文本语义和地形几何提供了超越视觉外观的互补线索，在复杂的梯田场景中产生了更准确、一致和结构上连贯的划分结果。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2604.12318

Cell Instance Segmentation via Multi-Task Image-to-Image Schr\"odinger Bridge

基于多任务图像到图像Schrödinger桥的细胞实例分割

Inoue, Hayato, Harada, Shota, Takezaki, Shumpei, Bise, Ryoma

Abstract

Existing cell instance segmentation pipelines typically combine deterministic predictions with post-processing, which imposes limited explicit constraints on the global structure of instance masks. In this work, we propose a multi-task image-to-image Schr\"odinger Bridge framework that formulates instance segmentation as a distribution-based image-to-image generation problem. Boundary-aware supervision is integrated through a reverse distance map, and deterministic inference is employed to produce stable predictions. Experimental results on the PanNuke dataset demonstrate that the proposed method achieves competitive or superior performance without relying on SAM pre-training or additional post-processing. Additional results on the MoNuSeg dataset show robustness under limited training data. These findings indicate that Schr\"odinger Bridge-based image-to-image generation provides an effective framework for cell instance segmentation.

Chinese Translation

现有的细胞实例分割流程通常结合确定性预测与后处理，这对实例掩膜的全局结构施加的显式约束有限。在本工作中，我们提出了一种多任务图像到图像Schrödinger桥框架，将实例分割表述为基于分布的图像到图像生成问题。通过反向距离图整合边界感知监督，并采用确定性推断以产生稳定的预测。PanNuke数据集上的实验结果表明，该方法在不依赖SAM预训练或额外后处理的情况下，实现了具有竞争力或优越的性能。MoNuSeg数据集上的额外结果显示了在有限训练数据下的鲁棒性。这些发现表明，基于Schrödinger桥的图像到图像生成为细胞实例分割提供了有效的框架。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2604.12319

RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation

RSGMamba：面向多模态语义分割的可靠性感知自门控状态空间模型

Xu, Guoan, Xiao, Yang, Gao, Guangwei, Zhu, Dongchen, Jia, Wenjing, Qi, Guo-Jun

Abstract

Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross-modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross-modal fusion from the perspective of modality reliability and propose a novel framework termed the Reliability-aware Self-Gated State Space Model (RSGMamba). At the core of our method is the Reliability-aware Self-Gated Mamba Block (RSGMB), which explicitly models modality reliability and dynamically regulates cross-modal interactions through a self-gating mechanism. Unlike conventional fusion strategies that indiscriminately exchange information across modalities, RSGMB enables reliability-aware feature selection and enhancing informative feature aggregation. In addition, a lightweight Local Cross-Gated Modulation (LCGM) is incorporated to refine fine-grained spatial details, complementing the global modeling capability of RSGMB. Extensive experiments demonstrate that RSGMamba achieves state-of-the-art performance on both RGB-D and RGB-T semantic segmentation benchmarks, resulting 58.8% / 54.0% mIoU on NYUDepth V2 and SUN-RGBD (+0.4% / +0.7% over prior best), and 61.1% / 88.9% mIoU on MFNet and PST900 (up to +1.6%), with only 48.6M parameters, thereby validating the effectiveness and superiority of the proposed approach.

Chinese Translation

多模态语义分割作为一种利用多种传感模态（如RGB、深度和热成像）互补信息以增强场景理解的强大范式，已逐渐兴起。然而，现有的跨模态融合方法通常隐含假设所有模态的可靠性相同，当辅助模态存在噪声、错位或不完整时，容易导致特征退化。本文从模态可靠性的视角重新审视跨模态融合，提出了一种新颖框架——可靠性感知自门控状态空间模型（Reliability-aware Self-Gated State Space Model，RSGMamba）。该方法的核心是可靠性感知自门控Mamba模块（Reliability-aware Self-Gated Mamba Block，RSGMB），其通过自门控机制显式建模模态可靠性，动态调控跨模态交互。不同于传统无差别交换信息的融合策略，RSGMB实现了可靠性感知的特征选择与信息增强的特征聚合。此外，本文引入轻量级局部交叉门控调制（Local Cross-Gated Modulation，LCGM）以细化空间细节，补充RSGMB的全局建模能力。大量实验表明，RSGMamba在RGB-D和RGB-T语义分割基准上均取得了最先进的性能，在NYUDepth V2和SUN-RGBD数据集上分别达到58.8% / 54.0%的mIoU（较此前最佳提升0.4% / 0.7%），在MFNet和PST900数据集上分别达到61.1% / 88.9%的mIoU（最高提升1.6%），且模型参数仅为4860万，验证了所提方法的有效性和优越性。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2604.12320

EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

EgoEsportsQA：面向电竞感知与推理的第一人称视频基准

Ma, Jianzhe, Cao, Zhonghao, Chen, Shangkui, Xu, Yichen, Wang, Wenxuan, Jin, Qin

Abstract

While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.

Chinese Translation

尽管视频大语言模型（Video-LLMs）在理解节奏缓慢的现实世界第一人称视频方面表现出色，但其在高速、信息密集的虚拟环境中的能力尚未得到充分探索。现有基准多聚焦于日常活动，缺乏针对虚拟场景中快速且规则约束推理的严谨测试平台。为填补这一空白，我们提出了EgoEsportsQA，这是首个基于专家电竞知识的视觉问答（QA）视频基准。我们通过可扩展的六阶段流程，从三款第一人称射击游戏的职业比赛中策划了1745对高质量问答对。问题结构采用二维解耦分类法：认知能力维度包含11个子任务（涵盖感知与推理层面），电竞知识维度包含6个子任务。对最先进的Video-LLMs进行全面评估表明，当前模型仍未达到令人满意的性能，最佳模型准确率仅为71.58%。结果揭示了两个维度上的显著差距：模型在基础视觉感知方面能力较强，而在深度战术推理方面表现较弱；模型对整体宏观进程的把握优于细粒度微操作。大量消融实验展示了当前Video-LLM架构的内在弱点。进一步分析表明，我们的数据集不仅揭示了现实与虚拟第一人称领域之间的联系，还为优化下游电竞应用提供了指导，促进Video-LLMs在各类第一人称环境中的未来发展。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2604.12322

Self-Adversarial One Step Generation via Condition Shifting

通过条件转移实现自对抗一步生成

Liu, Deyuan, Sun, Peng, Han, Yansen, Cheng, Zhenglin, Chen, Chuyan, Lin, Tao

Abstract

The push for efficient text to image synthesis has moved the field toward one step sampling, yet existing methods still face a three way tradeoff among fidelity, inference speed, and training efficiency. Approaches that rely on external discriminators can sharpen one step performance, but they often introduce training instability, high GPU memory overhead, and slow convergence, which complicates scaling and parameter efficient tuning. In contrast, regression based distillation and consistency objectives are easier to optimize, but they typically lose fine details when constrained to a single step. We present APEX, built on a key theoretical insight: adversarial correction signals can be extracted endogenously from a flow model through condition shifting. Using a transformation creates a shifted condition branch whose velocity field serves as an independent estimator of the model's current generation distribution, yielding a gradient that is provably GAN aligned, replacing the sample dependent discriminator terms that cause gradient vanishing. This discriminator free design is architecture preserving, making APEX a plug and play framework compatible with both full parameter and LoRA based tuning. Empirically, our 0.6B model surpasses FLUX-Schnell 12B (20$\times$ more parameters) in one step quality. With LoRA tuning on Qwen-Image 20B, APEX reaches a GenEval score of 0.89 at NFE=1 in 6 hours, surpassing the original 50-step teacher (0.87) and providing a 15.33$\times$ inference speedup. Code is available https://github.com/LINs-lab/APEX.

Chinese Translation

高效的文本到图像合成的推动使得该领域朝着一步采样的方向发展，然而现有方法仍面临着保真度、推理速度和训练效率之间的三重权衡。依赖外部判别器的方法可以提升一步性能，但它们往往引入训练不稳定性、高GPU内存开销和缓慢的收敛速度，这使得扩展和参数高效调优变得复杂。相比之下，基于回归的蒸馏和一致性目标更容易优化，但在限制为单步时通常会丧失细节。我们提出了APEX，基于一个关键的理论洞见：对抗修正信号可以通过条件转移从流模型中内生提取。使用一种变换创建一个偏移条件分支，其速度场作为模型当前生成分布的独立估计器，产生一个可证明与GAN对齐的梯度，替代导致梯度消失的样本依赖判别器项。这种无判别器设计保持了架构的完整性，使得APEX成为一个即插即用的框架，兼容全参数和基于LoRA的调优。从经验上看，我们的0.6B模型在一步质量上超越了FLUX-Schnell的12B模型（参数量多20倍）。在Qwen-Image 20B上进行LoRA调优后，APEX在6小时内以NFE=1达到了0.89的GenEval分数，超越了原始的50步教师模型（0.87），并提供了15.33倍的推理加速。代码可在 https://github.com/LINs-lab/APEX 获取。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2604.12331

HyperLiDAR: Adaptive Post-Deployment LiDAR Segmentation via Hyperdimensional Computing

HyperLiDAR：基于超维计算的自适应部署后LiDAR分割方法

Moreno, Ivannia Gomez, Yao, Yi, Tian, Ye, Yu, Xiaofan, Ponzina, Flavio, Sullivan, Michael, Zhang, Jingyi, Yang, Mingyu, Kim, Hun Seok, Rosing, Tajana

Abstract

LiDAR semantic segmentation plays a pivotal role in 3D scene understanding for edge applications such as autonomous driving. However, significant challenges remain for real-world deployments, particularly for on-device post-deployment adaptation. Real-world environments can shift as the system navigates through different locations, leading to substantial performance degradation without effective and timely model adaptation. Furthermore, edge systems operate under strict computational and energy constraints, making it infeasible to adapt conventional segmentation models (based on large neural networks) directly on-device. To address the above challenges, we introduce HyperLiDAR, the first lightweight, post-deployment LiDAR segmentation framework based on Hyperdimensional Computing (HDC). The design of HyperLiDAR fully leverages the fast learning and high efficiency of HDC, inspired by how the human brain processes information. To further improve the adaptation efficiency, we identify the high data volume per scan as a key bottleneck and introduce a buffer selection strategy that focuses learning on the most informative points. We conduct extensive evaluations on two state-of-the-art LiDAR segmentation benchmarks and two representative devices. Our results show that HyperLiDAR outperforms or achieves comparable adaptation performance to state-of-the-art segmentation methods, while achieving up to a 13.8x speedup in retraining.

Chinese Translation

LiDAR语义分割在自动驾驶等边缘应用的三维场景理解中起着关键作用。然而，实际部署中仍面临重大挑战，尤其是在设备端的部署后自适应方面。随着系统在不同地点的移动，真实环境会发生变化，若缺乏有效且及时的模型适应，性能将显著下降。此外，边缘系统受限于严格的计算和能耗限制，难以在设备端直接适应基于大型神经网络的传统分割模型。为解决上述问题，我们提出了HyperLiDAR，这是首个基于超维计算（Hyperdimensional Computing, HDC）的轻量级部署后LiDAR分割框架。HyperLiDAR的设计充分利用了HDC快速学习和高效能的优势，灵感来源于人脑的信息处理方式。为进一步提升适应效率，我们识别出每次扫描数据量大是关键瓶颈，因而引入缓冲区选择策略，聚焦于最具信息量的点进行学习。我们在两个最先进的LiDAR分割基准和两款代表性设备上进行了广泛评估。结果表明，HyperLiDAR在适应性能上优于或可与最先进分割方法相媲美，同时在再训练速度上实现了最高13.8倍的加速。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2604.12335

All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

一体化方案：用于多模态视频理解的统一合成数据生成流水线

Rahman, Tanzila, Liao, Renjie, Sigal, Leonid

Abstract

Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages deeper visual grounding and reasoning. We evaluate our approach in three challenging tasks: video object counting, video-based visual question answering, and video object segmentation. Experimental results demonstrate that models trained predominantly on synthetic data generalize effectively to real-world datasets, often outperforming traditionally trained counterparts. Our findings highlight the potential of unified synthetic data pipelines as a scalable alternative to expensive real-world annotation for multimodal video understanding.

Chinese Translation

训练用于视频理解的多模态大型语言模型（MLLMs）需要涵盖对象计数、问答和分割等多样任务的大规模标注数据。然而，在现实环境中收集和标注多模态视频数据成本高昂、速度缓慢，且在多样性和覆盖范围上存在固有限制。为了解决这一挑战，我们提出了一种统一的合成数据生成流水线，能够自动生成无限量且带有丰富多样监督信息的多模态视频数据。该框架支持在单一流水线内处理多种任务格式，实现跨任务的可扩展且一致的数据创建。为了进一步提升推理能力，我们引入了一种基于视觉问答（VQA）的微调策略，使模型学习回答关于视觉内容的结构化问题，而非仅依赖字幕或简单指令。这种方法促进了更深入的视觉定位和推理。我们在视频对象计数、基于视频的视觉问答和视频对象分割三项具有挑战性的任务中评估了该方法。实验结果表明，主要基于合成数据训练的模型能够有效泛化到真实世界数据集，且常常优于传统训练模型。我们的研究结果凸显了统一合成数据流水线作为多模态视频理解中昂贵真实标注的可扩展替代方案的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2604.12341

Bridging the Micro--Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization

弥合微观与宏观差距：面向图像篡改定位的频率感知语义对齐方法

Liang, Xiaojie, Chen, Zhimin, Sheng, Ziqi, Lu, Wei

Abstract

As generative image editing advances, image manipulation localization (IML) must handle both traditional manipulations with conspicuous forensic artifacts and diffusion-generated edits that appear locally realistic. Existing methods typically rely on either low-level forensic cues or high-level semantics alone, leading to a fundamental micro--macro gap. To bridge this gap, we propose FASA, a unified framework for localizing both traditional and diffusion-generated manipulations. Specifically, we extract manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learn manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. We then inject these priors into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and employ a prototype-guided, frequency-gated mask decoder to integrate semantic consistency with boundary-aware localization for tampered region prediction. Extensive experiments on OpenSDI and multiple traditional manipulation benchmarks demonstrate state-of-the-art localization performance, strong cross-generator and cross-dataset generalization, and robust performance under common image degradations.

Chinese Translation

随着生成式图像编辑技术的发展，图像篡改定位（IML）需要同时应对具有明显取证痕迹的传统篡改和局部真实感较强的扩散生成编辑。现有方法通常仅依赖低层次的取证线索或高层次的语义信息，导致存在根本性的微观与宏观差距。为弥合这一差距，我们提出了FASA，一种统一框架，用于定位传统与扩散生成的篡改。具体而言，我们通过自适应双频段离散余弦变换（DCT）模块提取对篡改敏感的频率线索，并通过在冻结的CLIP表示上进行基于patch的对比对齐学习篡改感知的语义先验。随后，我们通过语义-频率侧适配器将这些先验注入分层频率通路，实现多尺度特征交互，并采用原型引导的频率门控掩码解码器，将语义一致性与边界感知定位相结合，以预测篡改区域。在OpenSDI及多个传统篡改基准上的大量实验表明，本方法实现了最先进的定位性能，具备强大的跨生成器和跨数据集泛化能力，并在常见图像退化条件下表现出鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2604.12343

Detecting Precise Hand Touch Moments in Egocentric Video

在自我中心视频中检测精确的手接触时刻

Nguyen, Huy Anh, Dayoub, Feras, Hoai, Minh

Abstract

We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives. To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see') that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanning over one million frames. Experiments on TouchMoment show that, under a strict evaluation criterion that counts a prediction as correct only if it falls within a two-frame tolerance of the ground-truth moment, our method achieves substantial gains and outperforms state-of-the-art event-spotting baselines by 16.91% average precision.

Chinese Translation

我们针对在自我中心视频中检测手与物体接触的精确时刻这一具有挑战性的任务进行研究。这种帧级别的检测对于增强现实、人机交互、辅助技术和机器人学习应用至关重要，因为接触开始信号着动作的启动或完成。由于接触时手部运动的微妙变化、频繁的遮挡、细粒度的操作模式以及第一人称视角固有的运动动态，时间上的精确检测尤其具有挑战性。为了解决这些问题，我们提出了一种手部信息增强模块（Hand-informed Context Enhanced module，HiCE；发音为“high-see”），该模块通过交叉注意机制利用手部区域及其周围环境的时空特征，学习识别潜在的接触模式。我们的方案进一步通过抓握感知损失和软标签进行优化，强调触摸事件特征的手部姿态模式和运动动态，使模型能够区分接近接触和实际接触的帧。我们还引入了TouchMoment，这是一个包含4,021个视频和8,456个标注接触时刻的自我中心数据集，覆盖超过一百万帧。在TouchMoment上的实验表明，在严格的评估标准下，仅当预测结果在真实时刻的两帧容差内时才被视为正确，我们的方法取得了显著的提升，平均精度比最先进的事件检测基线高出16.91%。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2604.12346

Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization

释放视频中 Grounding DINO 的潜力：有限数据下的参数高效适应用于时空定位

Wang, Zanyi, Li, Fan, Jiang, Dengyang, Li, Liuzhuozheng, Zhong, Yunhua, Dai, Guang, Wang, Mengmeng

Abstract

Spatio-temporal video grounding (STVG) aims to localize queried objects within dynamic video segments. Prevailing fully-trained approaches are notoriously data-hungry. However, gathering large-scale STVG data is exceptionally challenging: dense frame-level bounding boxes and complex temporal language alignments are prohibitively expensive to annotate, especially for specialized video domains. Consequently, conventional models suffer from severe overfitting on these inherently limited datasets, while zero-shot foundational models lack the task-specific temporal awareness needed for precise localization. To resolve this small-data challenge, we introduce ST-GD, a data-efficient framework that adapts pre-trained 2D visual-language models (e.g., Grounding DINO) to video tasks. To avoid destroying pre-trained priors on small datasets, ST-GD keeps the base model frozen and strategically injects lightweight adapters (~10M trainable parameters) to instill spatio-temporal awareness, alongside a novel temporal decoder for boundary prediction. This design naturally counters data scarcity. Consequently, ST-GD excels in data-scarce scenarios, achieving highly competitive performance on the limited-scale HC-STVG v1/v2 benchmarks, while maintaining robust generalization on the VidSTG dataset. This validates ST-GD as a powerful paradigm for complex video understanding under strict small-data constraints.

Chinese Translation

时空视频定位（STVG）旨在在动态视频片段中定位查询对象。现有的全训练方法通常对数据需求量大。然而，收集大规模的 STVG 数据极具挑战性：密集的帧级边界框和复杂的时间语言对齐注释成本高昂，尤其是在专业视频领域。因此，传统模型在这些本质上有限的数据集上容易出现严重的过拟合，而零-shot 基础模型缺乏精确定位所需的任务特定时间意识。为了解决这一小数据挑战，我们提出了 ST-GD，这是一种数据高效的框架，能够将预训练的 2D 视觉-语言模型（如 Grounding DINO）适应于视频任务。为了避免在小数据集上破坏预训练的先验，ST-GD 保持基础模型不变，并战略性地注入轻量级适配器（约 1000 万可训练参数），以增强时空意识，同时引入一种新颖的时间解码器用于边界预测。这一设计自然应对了数据稀缺的问题。因此，ST-GD 在数据稀缺场景中表现出色，在有限规模的 HC-STVG v1/v2 基准上取得了高度竞争的性能，同时在 VidSTG 数据集上保持了强大的泛化能力。这验证了 ST-GD 作为在严格小数据限制下进行复杂视频理解的强大范式。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2604.12351

Fundus Image-based Glaucoma Screening via Retinal Knowledge-Oriented Dynamic Multi-Level Feature Integration

基于眼底图像的青光眼筛查：视网膜知识导向的动态多层次特征融合

Zhou, Yuzhuo, Liu, Chi, Shen, Sheng, Ge, Zongyuan, Jing, Fengshi, Zhang, Shiran, Jiang, Yu, Wang, Anli, Liu, Wenjian, Yang, Feilong, Zhu, Tianqing, Han, Xiaotong

Abstract

Automated diagnosis based on color fundus photography is essential for large-scale glaucoma screening. However, existing deep learning models are typically data-driven and lack explicit integration of retinal anatomical knowledge, which limits their robustness across heterogeneous clinical datasets. Moreover, pathological cues in fundus images may appear beyond predefined anatomical regions, making fixed-region feature extraction insufficient for reliable diagnosis. To address these challenges, we propose a retinal knowledge-oriented glaucoma screening framework that integrates dynamic multi-scale feature learning with domain-specific retinal priors. The framework adopts a tri-branch structure to capture complementary retinal representations, including global retinal context, structural features of the optic disc/cup, and dynamically localized pathological regions. A Dynamic Window Mechanism is devised to adaptively identify diagnostically informative regions, while a Knowledge-Enhanced Convolutional Attention Module incorporates retinal priors extracted from a pre-trained foundation model to guide attention learning. Extensive experiments on the large-scale AIROGS dataset demonstrate that the proposed method outperforms diverse baselines, achieving an AUC of 98.5% and an accuracy of 94.6%. Additional evaluations on multiple datasets from the SMDG-19 benchmark further confirm its strong cross-domain generalization capability, indicating that knowledge-guided attention combined with adaptive lesion localization can significantly improve the robustness of automated glaucoma screening systems.

Chinese Translation

基于彩色眼底摄影的自动诊断对于大规模青光眼筛查至关重要。然而，现有深度学习模型通常依赖数据驱动，缺乏对视网膜解剖知识的显式整合，限制了其在异质临床数据集上的鲁棒性。此外，眼底图像中的病理线索可能出现在预定义解剖区域之外，使得固定区域的特征提取不足以实现可靠诊断。为应对这些挑战，我们提出了一种视网膜知识导向的青光眼筛查框架，该框架结合了动态多尺度特征学习与特定领域的视网膜先验知识。该框架采用三分支结构以捕获互补的视网膜表征，包括全局视网膜上下文、视盘/视杯的结构特征以及动态定位的病理区域。设计了动态窗口机制以自适应识别诊断相关区域，同时引入知识增强卷积注意力模块，利用预训练基础模型提取的视网膜先验指导注意力学习。在大规模AIROGS数据集上的大量实验表明，所提方法优于多种基线，达到98.5%的AUC和94.6%的准确率。对SMDG-19基准的多个数据集的额外评估进一步验证了其强大的跨域泛化能力，表明知识引导的注意力结合自适应病变定位能够显著提升自动青光眼筛查系统的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2604.12353

Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection

应对模式与内容偏差：用于泛化AI生成图像检测的对抗特征学习

Zhang, Haifeng, He, Qinghui, Bi, Xiuli, Liu, Bo, Pun, Chi-Man, Xiao, Bin

Abstract

In recent years, the rapid development of generative artificial intelligence technology has significantly lowered the barrier to creating high-quality fake images, posing a serious challenge to information authenticity and credibility. Existing generated image detection methods typically enhance generalization through model architecture or network design. However, their generalization performance remains susceptible to data bias, as the training data may drive models to fit specific generative patterns and content rather than the common features shared by images from different generative models (asymmetric bias learning). To address this issue, we propose a Multi-dimensional Adversarial Feature Learning (MAFL) framework. The framework adopts a pretrained multimodal image encoder as the feature extraction backbone, constructs a real-fake feature learning network, and designs an adversarial bias-learning branch equipped with a multi-dimensional adversarial loss, forming an adversarial training mechanism between authenticity-discriminative feature learning and bias feature learning. By suppressing generation-pattern and content biases, MAFL guides the model to focus on the generative features shared across different generative models, thereby effectively capturing the fundamental differences between real and generated images, enhancing cross-model generalization, and substantially reducing the reliance on large-scale training data. Through extensive experimental validation, our method outperforms existing state-of-the-art approaches by 10.89% in accuracy and 8.57% in Average Precision (AP). Notably, even when trained with only 320 images, it can still achieve over 80% detection accuracy on public datasets.

Chinese Translation

近年来，生成式人工智能技术的快速发展显著降低了高质量伪造图像的制作门槛，给信息真实性和可信度带来了严峻挑战。现有的生成图像检测方法通常通过模型架构或网络设计来增强泛化能力，但其泛化性能仍易受数据偏差影响，因为训练数据可能驱使模型拟合特定的生成模式和内容，而非不同生成模型图像所共有的通用特征（即非对称偏差学习）。为解决该问题，我们提出了多维对抗特征学习（Multi-dimensional Adversarial Feature Learning，MAFL）框架。该框架采用预训练的多模态图像编码器作为特征提取骨干，构建真假特征学习网络，并设计了配备多维对抗损失的对抗偏差学习分支，形成真实性判别特征学习与偏差特征学习之间的对抗训练机制。通过抑制生成模式和内容偏差，MAFL引导模型聚焦于不同生成模型共享的生成特征，从而有效捕捉真实与生成图像的根本差异，提升跨模型泛化能力，并大幅降低对大规模训练数据的依赖。大量实验验证表明，本方法在准确率和平均精度（AP）上分别超越现有最先进方法10.89%和8.57%。值得注意的是，即使仅使用320张图像训练，仍能在公开数据集上实现超过80%的检测准确率。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2604.12356

OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion

OmniFood8K：通过层次频率对齐融合进行单图像营养估计

Yu, Dongjian, Min, Weiqing, Jiang, Qian, Lin, Xing, Jin, Xin, Jiang, Shuqiang

Abstract

Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets primarily focus on Western cuisines and lack sufficient coverage of Chinese dishes, which restricts accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food samples, each with detailed nutritional annotations and multi-view images. In addition, to enhance models' capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework for nutritional prediction from a single RGB image. First, we predict a depth map from a single RGB image and design the Scale-Shift Residual Adapter (SSRA) to refine it for global scale consistency and local structural preservation. Second, we propose the Frequency-Aligned Fusion Module (FAFM) to hierarchically align and fuse RGB and depth features in the frequency domain. Finally, we design a Mask-based Prediction Head (MPH) to emphasize key ingredient regions via dynamic channel selection for more accurate prediction. Extensive experiments on multiple datasets demonstrate the superiority of our method over existing approaches. Project homepage: https://yudongjian.github.io/OmniFood8K-food/

Chinese Translation

准确的食物营养估计在促进健康饮食习惯和个性化饮食管理中发挥着至关重要的作用。现有的大多数食品数据集主要集中于西方菜肴，缺乏对中国菜品的充分覆盖，这限制了对中国餐点的准确营养估计。此外，许多最先进的营养预测方法依赖于深度传感器，限制了它们在日常场景中的适用性。为了解决这些局限性，我们引入了OmniFood8K，这是一个综合性的多模态数据集，包含8,036个食品样本，每个样本都有详细的营养注释和多视角图像。此外，为了增强模型在营养预测中的能力，我们构建了NutritionSynth-115K，这是一个大规模合成数据集，在保持精确营养标签的同时引入了组成变化。此外，我们提出了一种端到端的框架，用于从单个RGB图像进行营养预测。首先，我们从单个RGB图像预测深度图，并设计了尺度偏移残差适配器（Scale-Shift Residual Adapter, SSRA）以优化其全球尺度一致性和局部结构保留。其次，我们提出了频率对齐融合模块（Frequency-Aligned Fusion Module, FAFM），在频域中层次对齐和融合RGB和深度特征。最后，我们设计了基于掩膜的预测头（Mask-based Prediction Head, MPH），通过动态通道选择强调关键成分区域，以实现更准确的预测。在多个数据集上的广泛实验表明我们的方法优于现有方法。项目主页：https://yudongjian.github.io/OmniFood8K-food/

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2604.12358

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

视觉Token剪枝为何及何时失效？基于多模态大语言模型解码中相关视觉信息迁移的研究

Kim, Jiwan, Kim, Kibum, Kim, Wonjoong, Lee, Byung-Kwan, Park, Chanyoung

Abstract

Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.

Chinese Translation

近年来，视觉Token剪枝被用于处理多模态大语言模型（MLLMs）中大量的视觉Token。然而，我们观察到现有剪枝方法虽然在简单视觉理解任务中表现稳定，但在复杂视觉推理任务中难以有效泛化，这一关键问题在以往研究中尚未得到充分探讨。通过系统分析，我们确定了解码过程中相关视觉信息迁移（Relevant Visual Information Shift，RVIS）是导致失败的主要原因。为此，我们提出了解码阶段感知迁移的Token剪枝方法（Decoding-stage Shift-aware Token Pruning，DSTP），这是一种无需训练的附加框架，使现有剪枝方法能够在解码阶段将视觉Token与不断变化的推理需求对齐。大量实验表明，DSTP显著缓解了剪枝方法在复杂推理任务中的性能下降，同时在视觉理解基准测试中持续带来性能提升。此外，DSTP在多种先进架构上均表现出良好的效果，体现了其通用性和以极低计算开销实现的高效性。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2604.12371

Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

像素间的阅读：将文本-图像嵌入对齐与视觉语言模型中的排版攻击成功率关联起来

Balakrishnan, Ravikumar, Mendapara, Sanket, Garg, Ankit

Abstract

We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6--28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10--12% and reduce ASR by 34--96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.

Chinese Translation

我们研究了视觉语言模型（VLMs）上的排版提示注入攻击，其中对抗性文本以图像形式呈现以绕过安全机制。随着VLMs作为自主代理的感知基础，从浏览器自动化和计算机使用系统到配备摄像头的具身代理，这种攻击威胁日益增长。实际上，攻击面呈现异质性：对抗性文本以不同字体大小和多样视觉条件出现，而不断扩展的VLM生态系统在脆弱性方面存在显著差异，增加了防御的复杂性。我们评估了来自SALAD-Bench的1000个提示，在四个VLMs上进行测试，分别是GPT-4o、Claude Sonnet 4.5、Mistral-Large-3和Qwen3-VL-4B-Instruct，测试条件涵盖不同字体大小（6至28像素）及视觉变换（旋转、模糊、噪声、对比度变化）。结果发现：(1) 字体大小显著影响攻击成功率（ASR），极小字体（6px）几乎无攻击成功，而中等字体达到峰值效果；(2) 对GPT-4o（36%对8%）和Claude（47%对22%）而言，文本攻击比图像攻击更有效，而Qwen3-VL和Mistral在两种模态下的ASR相当；(3) 来自两种多模态嵌入模型（JinaCLIP和Qwen3-VL-Embedding）的文本-图像嵌入距离与四个模型的ASR呈显著负相关（相关系数r = -0.71至-0.93，p < 0.01）；(4) 严重降质导致嵌入距离增加10%至12%，ASR下降34%至96%，而旋转对模型影响不对称（Mistral下降50%，GPT-4o保持不变）。这些发现表明，模型特定的鲁棒性模式排除了通用防御方案的可能性，并为在对抗环境中运行的代理系统选择VLM骨干提供了实证指导。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2604.12380

Modality-Agnostic Prompt Learning for Multi-Modal Camouflaged Object Detection

多模态隐蔽物体检测的模态无关提示学习

Wang, Hao, Zhang, Jiqing, Yang, Xin, Yin, Baocai, Jiang, Lu, Mi, Zetian, Wang, Huibing

Abstract

Camouflaged Object Detection (COD) aims to segment objects that blend seamlessly into complex backgrounds, with growing interest in exploiting additional visual modalities to enhance robustness through complementary information. However, most existing approaches generally rely on modality-specific architectures or customized fusion strategies, which limit scalability and cross-modal generalization. To address this, we propose a novel framework that generates modality-agnostic multi-modal prompts for the Segment Anything Model (SAM), enabling parameter-efficient adaptation to arbitrary auxiliary modalities and significantly improving overall performance on COD tasks. Specifically, we model multi-modal learning through interactions between a data-driven content domain and a knowledge-driven prompt domain, distilling task-relevant cues into unified prompts for SAM decoding. We further introduce a lightweight Mask Refine Module to calibrate coarse predictions by incorporating fine-grained prompt cues, leading to more accurate camouflaged object boundaries. Extensive experiments on RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks validate the effectiveness and generalization of our modality-agnostic framework.

Chinese Translation

隐蔽物体检测（COD）旨在分割与复杂背景无缝融合的物体，越来越多的研究关注利用额外的视觉模态来通过互补信息增强鲁棒性。然而，大多数现有方法通常依赖于特定模态的架构或定制的融合策略，这限制了可扩展性和跨模态的泛化能力。为了解决这个问题，我们提出了一种新颖的框架，为Segment Anything Model（SAM）生成模态无关的多模态提示，从而实现对任意辅助模态的参数高效适应，并显著提高COD任务的整体性能。具体而言，我们通过数据驱动的内容域与知识驱动的提示域之间的交互来建模多模态学习，将与任务相关的线索提炼为SAM解码的统一提示。我们进一步引入了一种轻量级的Mask Refine Module，通过结合细粒度的提示线索来校准粗略预测，从而获得更准确的隐蔽物体边界。在RGB-Depth、RGB-Thermal和RGB-Polarization基准上的大量实验验证了我们模态无关框架的有效性和泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2604.12391

Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

模型链预训练：重新思考视觉基础模型的训练加速

Fan, Jiawei, Wang, Shigeng, Li, Chao, Liu, Xiaolong, Yao, Anbang

Abstract

In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.

Chinese Translation

在本文中，我们提出了一种新颖的无性能损失的训练加速方法——模型链预训练（Chain-of-Models Pre-Training，CoM-PT），用于视觉基础模型（Vision Foundation Models，VFMs）。该方法在核心动机上与现有的加速方法根本不同：CoM-PT并非单独优化每个模型，而是旨在在模型家族层面加速训练流程，随着模型家族的扩展而高效缩放。具体而言，CoM-PT为模型家族建立了一个预训练序列，按模型大小升序排列，称为模型链。在这个链中，只有最小的模型经过标准的单独预训练，而其他模型则通过从其较小的前身进行顺序逆知识转移，利用参数空间和特征空间中的知识进行高效训练。因此，CoM-PT使所有模型的性能大多优于标准的单独训练，同时显著降低了训练成本，这在涵盖零样本和微调任务的45个数据集上得到了广泛验证。值得注意的是，其高效的缩放特性产生了一个显著的现象：训练更多模型甚至会导致更高的效率。例如，在CC3M上进行预训练时：i) 以ViT-L作为最大的模型，逐步将较小的模型添加到模型链中可将计算复杂度降低多达72%；ii) 在固定的模型大小范围内，随着VFM家族在3、4和7个模型之间的扩展，CoM-PT的加速比展现出显著的跃升：从4.13倍跃升至5.68倍和7.09倍。由于CoM-PT自然与特定的预训练范式无关，我们开源了代码，以促进在更计算密集的场景中进一步扩展，例如大型语言模型的预训练。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2604.12403

Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning

双模态锚点引导过滤用于测试时提示调优

Choi, Jungwon, Kim, Eunwoo

Abstract

Test-Time Prompt Tuning (TPT) adapts vision-language models using augmented views, but its effectiveness is hindered by the challenge of determining which views are beneficial. Standard entropy-based filtering relies on the internal confidence scores of the model, which are often miscalibrated under distribution shift, assigning high confidence to irrelevant crops or background regions while ignoring semantic content. To address this, we propose a dual-modality anchor-guided framework that grounds view selection in semantic evidence. We introduce a text anchor from attribute-rich descriptions, to provide fine-grained class semantics, and an adaptive image anchor that captures evolving test-time statistics. Using these anchors, we filter views based on alignment and confidence, ensuring that only informative views guide adaptation. Moreover, we treat the anchors as auxiliary predictive heads and combine their predictions with the original output in a confidence-weighted ensemble, yielding a stable supervision signal for prompt updates. Extensive experiments on 15 benchmark datasets demonstrate new state-of-the-art performance, highlighting the contribution of anchor-guided supervision as a foundation for robust prompt updates.

Chinese Translation

测试时提示调优（Test-Time Prompt Tuning, TPT）通过增强视图来适应视觉-语言模型，但其有效性受到确定哪些视图有益的挑战的制约。标准的基于熵的过滤依赖于模型的内部置信度评分，但在分布转移下往往会出现误校准，导致对无关的裁剪或背景区域赋予高置信度，同时忽视语义内容。为了解决这个问题，我们提出了一种双模态锚点引导框架，将视图选择基于语义证据进行定位。我们引入了来自属性丰富描述的文本锚点，以提供细粒度的类别语义，以及一个自适应图像锚点，用于捕捉不断变化的测试时统计信息。利用这些锚点，我们基于对齐和置信度过滤视图，确保只有信息丰富的视图引导适应。此外，我们将锚点视为辅助预测头，并将它们的预测与原始输出结合在一起，形成置信度加权的集成，从而为提示更新提供稳定的监督信号。在15个基准数据集上进行的广泛实验表明，我们的方法在性能上达到了新的最先进水平，突显了锚点引导监督作为稳健提示更新基础的贡献。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2604.12411

DeferredSeg: A Multi-Expert Deferral Framework for Trustworthy Medical Image Segmentation

DeferredSeg：一个用于可信医疗图像分割的多专家延迟框架

Tian, Qiuyu, Sun, Haoliang, Wang, Yunshan, Shi, Yinghuan, Yin, Yilong

Abstract

Segmentation models based on deep neural networks demonstrate strong generalization for medical image segmentation. However, they often exhibit overconfidence or underconfidence, leading to unreliable confidence scores for segmentation masks, especially in ambiguous regions. This undermines the trustworthiness required for clinical deployment. Motivated by the learning-to-defer (L2D) paradigm, we introduce DeferredSeg, a deferral-aware segmentation framework, i.e., a Human--AI collaboration system that determines whether to defer predictions to human experts in specific regions. DeferredSeg extends the base segmentor with an aggregated deferral predictor and additional routing channels that dynamically route each pixel to either the base segmentor or a human expert. To train this routing efficiently, we introduce a pixel-wise surrogate collaboration loss that supervises deferral decisions. In addition, to preserve spatial coherence within deferral regions, we propose a spatial-coherence loss that enforces smooth deferral masks, thereby enhancing reliability. Beyond single-expert deferral, we further extend the framework to a multi-expert setting by introducing multiple discrepancy experts for collaborative decision-making. To prevent overloading or underutilizing individual experts, we further design a load-balancing penalty that evenly distributes workload across expert branches. We evaluate DeferredSeg on three challenging medical datasets using MedSAM and CENet as the base segmentor for fair comparison. Experimental results show that DeferredSeg consistently outperforms the baseline, demonstrating its effectiveness for trustworthy dense medical segmentation. Moreover, the proposed framework is model-agnostic and can be readily applied to other segmentation architectures.

Chinese Translation

基于深度神经网络的分割模型在医疗图像分割中表现出强大的泛化能力。然而，它们常常表现出过度自信或不足自信，导致分割掩膜的置信度评分不可靠，尤其是在模糊区域。这削弱了临床部署所需的可信度。受到学习延迟（Learning-to-Defer, L2D）范式的启发，我们提出了DeferredSeg，一个关注延迟的分割框架，即一个人机协作系统，能够判断在特定区域是否将预测结果延迟给人类专家。DeferredSeg通过聚合的延迟预测器和额外的路由通道扩展了基础分割器，动态地将每个像素路由到基础分割器或人类专家。为了有效地训练这一路由，我们引入了一种像素级替代协作损失，以监督延迟决策。此外，为了保持延迟区域内的空间一致性，我们提出了一种空间一致性损失，强制平滑的延迟掩膜，从而增强可靠性。除了单专家延迟外，我们还通过引入多个差异专家进行协作决策，进一步将框架扩展到多专家设置。为了防止个别专家的过载或不足利用，我们进一步设计了一种负载平衡惩罚，均匀分配工作量到专家分支。我们在三个具有挑战性的医疗数据集上评估了DeferredSeg，使用MedSAM和CENet作为基础分割器进行公平比较。实验结果表明，DeferredSeg始终优于基线，证明了其在可信密集医疗分割中的有效性。此外，所提出的框架是模型无关的，可以轻松应用于其他分割架构。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2604.12437

A Hybrid Architecture for Benign-Malignant Classification of Mammography ROIs

用于乳腺X光图像感兴趣区域良恶性分类的混合架构

Asad, Mohammed, Bajpai, Mohit, Singh, Sudhir, Katarya, Rahul

Abstract

Accurate characterization of suspicious breast lesions in mammography is important for early diagnosis and treatment planning. While Convolutional Neural Networks (CNNs) are effective at extracting local visual patterns, they are less suited to modeling long-range dependencies. Vision Transformers (ViTs) address this limitation through self-attention, but their quadratic computational cost can be prohibitive. This paper presents a hybrid architecture that combines EfficientNetV2-M for local feature extraction with Vision Mamba, a State Space Model (SSM), for efficient global context modeling. The proposed model performs binary classification of abnormality-centered mammography regions of interest (ROIs) from the CBIS-DDSM dataset into benign and malignant classes. By combining a strong CNN backbone with a linear-complexity sequence model, the approach achieves strong lesion-level classification performance in an ROI-based setting.

Chinese Translation

在乳腺X光检查中，准确表征可疑乳腺病变对于早期诊断和治疗规划至关重要。尽管卷积神经网络（CNN）在提取局部视觉模式方面表现出色，但在建模长距离依赖关系时则显得不够适用。视觉变换器（ViTs）通过自注意力机制解决了这一限制，但其二次计算成本可能过高。本文提出了一种混合架构，将 EfficientNetV2-M 用于局部特征提取，与状态空间模型（State Space Model, SSM）Vision Mamba 结合，以实现高效的全局上下文建模。所提出的模型对来自 CBIS-DDSM 数据集的以异常为中心的乳腺X光图像感兴趣区域（ROIs）进行良性和恶性类别的二分类。通过结合强大的 CNN 主干和线性复杂度的序列模型，该方法在基于 ROIs 的设置中实现了强大的病变级分类性能。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2604.12440

IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation

IAD-Unify：一种基于区域定位的工业异常分割、理解与生成统一模型

Zheng, Haoyu, Lin, Tianwei, Wang, Wei, Wang, Zhuonan, Zhang, Wenqiao, Zhu, Jiaqi, Shao, Feifei

Abstract

Real-world industrial inspection requires not only localizing defects, but also explaining them in natural language and generating controlled defect edits. However, existing approaches fail to jointly support all three capabilities within a unified framework and evaluation protocol. We propose IAD-Unify, a dual-encoder unified framework in which a frozen DINOv2-based region expert supplies precise anomaly evidence to a shared Qwen3.5-4B vision-language backbone via lightweight token injection, jointly enabling anomaly segmentation, region-grounded understanding, and mask-guided generation. To enable unified evaluation, we further construct Anomaly-56K, a comprehensive unified multi-task IAD evaluation platform, spanning 59,916 images across 24 categories and 104 defect variants. Controlled ablations yield four findings: (i) region grounding is the decisive mechanism for understanding, removing it degrades location accuracy by >76 pp; (ii) predicted-region performance closely matches oracle, confirming deployment viability; (iii) region-grounded generation achieves the best full-image fidelity and masked-region perceptual quality; and (iv) pre-initialized joint training improves understanding at negligible generation cost (-0.16 dB). IAD-Unify further achieves strong performance on the MMAD benchmark, including categories unseen during training, demonstrating robust cross-category generalization.

Chinese Translation

现实工业检测不仅需要定位缺陷，还需以自然语言解释缺陷，并生成可控的缺陷编辑。然而，现有方法无法在统一框架和评估协议下同时支持这三种能力。我们提出了IAD-Unify，一种双编码器统一框架，其中基于DINOv2的冻结区域专家通过轻量级token注入向共享的Qwen3.5-4B视觉语言主干网络提供精确的异常证据，从而联合实现异常分割、基于区域的理解和掩码引导生成。为实现统一评估，我们进一步构建了Anomaly-56K，一个涵盖24个类别、104种缺陷变体、共59,916张图像的综合多任务IAD评估平台。受控消融实验得出四点结论：(i) 区域定位是理解的关键机制，移除该机制会导致定位准确率下降超过76个百分点；(ii) 预测区域的性能与oracle接近，验证了部署的可行性；(iii) 基于区域的生成在全图保真度和掩码区域感知质量上表现最佳；(iv) 预初始化的联合训练在几乎无生成成本(-0.16 dB)的情况下提升了理解能力。IAD-Unify在MMAD基准测试中也取得了优异表现，包括训练时未见类别，展示了强大的跨类别泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2604.12443

DiffusionPrint: Learning Generative Fingerprints for Diffusion-Based Inpainting Localization

DiffusionPrint：学习基于扩散的修复定位的生成指纹

Giakoumoglou, Paschalis, Papadopoulos, Symeon

Abstract

Modern diffusion-based inpainting models pose significant challenges for image forgery localization (IFL), as their full regeneration pipelines reconstruct the entire image via a latent decoder, disrupting the camera-level noise patterns that existing forensic methods rely on. We propose DiffusionPrint, a patch-level contrastive learning framework that learns a forensic signal robust to the spectral distortions introduced by latent decoding. It exploits the fact that inpainted regions generated by the same model share a consistent generative fingerprint, using this as a self-supervisory signal. DiffusionPrint trains a convolutional backbone via a MoCo-style objective with cross-category hard negative mining and a generator-aware classification head, producing a forensic feature map that serves as a highly discriminative secondary modality in fusion-based IFL frameworks. Integrated into TruFor, MMFusion, and a lightweight fusion baseline, DiffusionPrint consistently improves localization across multiple generative models, with gains of up to +28% on mask types unseen during fine-tuning and confirmed generalization to unseen generative architectures. Code is available at https://github.com/mever-team/diffusionprint

Chinese Translation

现代基于扩散的修复模型在图像伪造定位（IFL）方面面临重大挑战，因为它们的完整重建流程通过潜在解码器重构整个图像，破坏了现有取证方法所依赖的相机级噪声模式。我们提出了DiffusionPrint，这是一种基于补丁的对比学习框架，旨在学习对潜在解码引入的光谱失真具有鲁棒性的取证信号。它利用了同一模型生成的修复区域共享一致生成指纹的事实，将其作为自监督信号。DiffusionPrint通过MoCo风格的目标训练卷积骨干网络，结合跨类别的困难负样本挖掘和生成器感知分类头，生成一个取证特征图，作为基于融合的IFL框架中的高度区分性次级模态。集成到TruFor、MMFusion和一个轻量级融合基线中，DiffusionPrint在多个生成模型中始终提高定位性能，在微调期间未见的掩码类型上提升高达28%，并确认对未见生成架构的泛化能力。代码可在 https://github.com/mever-team/diffusionprint 获取。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2604.12463

Euler-inspired Decoupling Neural Operator for Efficient Pansharpening

基于欧拉启发的解耦神经算子用于高效的全色融合

Zhu, Anqi, Ma, Mengting, Jiang, Yizhen, Li, Xiangdong, Zheng, Kai, Li, Jiaxin, Zhang, Wei

Abstract

Pansharpening aims to synthesize high-resolution multispectral (HR-MS) images by fusing the spatial textures of panchromatic (PAN) images with the spectral information of low-resolution multispectral (LR-MS) images. While recent deep learning paradigms, especially diffusion-based operators, have pushed the performance boundaries, they often encounter spectral-spatial blurring and prohibitive computational costs due to their stochastic nature and iterative sampling. In this paper, we propose the Euler-inspired Decoupling Neural Operator (EDNO), a physics-inspired framework that redefines pansharpening as a continuous functional mapping in the frequency domain. Departing from conventional Cartesian feature processing, our EDNO leverages Euler's formula to transform features into a polar coordinate system, enabling a novel explicit-implicit interaction mechanism. Specifically, we develop the Euler Feature Interaction Layer (EFIL), which decouples the fusion task into two specialized modules: 1) Explicit Feature Interaction Module, utilizing a linear weighting scheme to simulate phase rotation for adaptive geometric alignment; and 2) Implicit Feature Interaction Module, employing a feed-forward network to model spectral distributions for superior color consistency. By operating in the frequency domain, EDNO inherently captures global receptive fields while maintaining discretization-invariance. Experimental results on the three datasets demonstrate that EDNO offers a superior efficiency-performance balance compared to heavyweight architectures.

Chinese Translation

全色融合旨在通过将全色图像（PAN）的空间纹理与低分辨率多光谱图像（LR-MS）的光谱信息融合，合成高分辨率多光谱图像（HR-MS）。尽管最近的深度学习范式，特别是基于扩散的算子，推动了性能的边界，但由于其随机特性和迭代采样，它们往往面临光谱-空间模糊和高昂的计算成本。在本文中，我们提出了基于欧拉启发的解耦神经算子（EDNO），这是一个物理启发的框架，将全色融合重新定义为频域中的连续函数映射。与传统的笛卡尔特征处理不同，我们的EDNO利用欧拉公式将特征转换为极坐标系统，从而实现了一种新颖的显式-隐式交互机制。具体而言，我们开发了欧拉特征交互层（EFIL），将融合任务解耦为两个专门模块：1）显式特征交互模块，利用线性加权方案模拟相位旋转以实现自适应几何对齐；2）隐式特征交互模块，采用前馈网络建模光谱分布以实现更优的颜色一致性。通过在频域中操作，EDNO本质上捕捉了全局感受野，同时保持离散化不变性。在三个数据集上的实验结果表明，EDNO在效率与性能之间提供了优越的平衡，相较于重量级架构更具优势。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2604.12481

T2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models

T2I-BiasBench：一个多指标框架用于审计文本到图像模型中的人口统计和文化偏见

Jaiswal, Nihal, Arjaria, Siddhartha, Chaubey, Gyanendra, Kumar, Ankush, Singh, Aditya, Chaurasiya, Anchal

Abstract

Text-to-image (T2I) generative models achieve impressive visual fidelity but inherit and amplify demographic imbalances and cultural biases embedded in training data. We introduce T2I-BiasBench, a unified evaluation framework of thirteen complementary metrics that jointly captures demographic bias, element omission, and cultural collapse in diffusion models - the first framework to address all three dimensions simultaneously. We evaluate three open-source models - Stable Diffusion v1.5, BK-SDM Base, and Koala Lightning - against Gemini 2.5 Flash (RLHF-aligned) as a reference baseline. The benchmark comprises 1,574 generated images across five structured prompt categories. T2I-BiasBench integrates six established metrics with seven additional measures: four newly proposed (Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) and three adapted (Hallucination Score, Vendi Score, CLIP Proxy Score). Three key findings emerge: (1) Stable Diffusion v1.5 and BK-SDM exhibit bias amplification (>1.0) in beauty-related prompts; (2) contextual constraints such as surgical PPE substantially attenuate professional-role gender bias (Doctor CBS = 0.06 for SD v1.5); and (3) all models, including RLHF-aligned Gemini, collapse to a narrow set of cultural representations (CAS: 0.54-1.00), confirming that alignment techniques do not resolve cultural coverage gaps. T2I-BiasBench is publicly released to support standardized, fine-grained bias evaluation of generative models. The project page is available at: https://gyanendrachaubey.github.io/T2I-BiasBench/

Chinese Translation

文本到图像（T2I）生成模型在视觉保真度方面表现出色，但会继承并放大训练数据中嵌入的人口失衡和文化偏见。我们提出了T2I-BiasBench，这是一个统一的评估框架，包含十三个互补指标，能够共同捕捉扩散模型中的人口偏见、元素遗漏和文化崩溃——这是第一个同时解决这三个维度的框架。我们对三个开源模型——Stable Diffusion v1.5、BK-SDM Base和Koala Lightning——进行了评估，以Gemini 2.5 Flash（与RLHF对齐）作为参考基线。该基准包含1,574个生成图像，分为五个结构化提示类别。T2I-BiasBench将六个已建立的指标与七个额外的度量相结合：四个新提出的指标（复合偏见分数、基础缺失率、隐含元素缺失率、文化准确性比率）和三个改编的指标（幻觉分数、Vendi分数、CLIP代理分数）。我们得出了三个关键发现：（1）Stable Diffusion v1.5和BK-SDM在与美相关的提示中表现出偏见放大（>1.0）；（2）手术个人防护装备等上下文约束显著减弱了专业角色的性别偏见（SD v1.5的医生CBS = 0.06）；（3）所有模型，包括与RLHF对齐的Gemini，均崩溃为一组狭窄的文化表现（CAS：0.54-1.00），确认对齐技术并未解决文化覆盖的缺口。T2I-BiasBench已公开发布，以支持生成模型的标准化、细致的偏见评估。项目页面可访问： https://gyanendrachaubey.github.io/T2I-BiasBench/

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2604.12502

SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker

SEATrack：简单、高效且自适应的多模态跟踪器

Su, Junbin, Xue, Ziteng, Zhang, Shihui, Chen, Kun, Hu, Weiming, Zhang, Zhipeng

Abstract

Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT's efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. \href{https://github.com/AutoLab-SAI-SJTU/SEATrack}{\textcolor{cyan}{Code is available}}.

Chinese Translation

在多模态跟踪中，参数高效微调（PEFT）揭示了一个令人担忧的趋势，即最近的性能提升往往以膨胀的参数预算为代价，这从根本上削弱了PEFT的效率承诺。在本研究中，我们提出了SEATrack，这是一种简单、高效且自适应的双流多模态跟踪器，从两个互补的角度解决这一性能与效率的困境。我们首先优先考虑匹配响应的跨模态对齐，这是一个尚未充分探索但至关重要的因素，我们认为它对打破这种权衡至关重要。具体而言，我们观察到现有双流方法中的模态特定偏差会生成相互冲突的匹配注意力图，从而阻碍有效的联合表示学习。为此，我们提出了AMG-LoRA，它无缝地将低秩适应（LoRA）与自适应互引导（AMG）结合，以动态地细化和对齐跨模态的注意力图。然后，我们通过引入层次混合专家（HMoE）来摆脱传统的局部融合方法，该方法能够有效地建模全局关系，有效平衡跨模态融合中的表现力和计算效率。凭借这些创新，SEATrack在RGB-T、RGB-D和RGB-E跟踪任务中在性能与效率的平衡上取得了显著进展。代码可在此获取。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2604.12508

From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception

从衰减到注意力：用于细粒度视觉感知的变分信息流操控

Zhu, Jilong, Feng, Yang

Abstract

While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a "loss of focus" during the deep-level decision-making process. Existing input-centric solutions fail to fundamentally reverse this intrinsic mechanism of information loss. To address this challenge, we propose the Variational Information Flow (VIF) framework. Adopting a probabilistic perspective, VIF leverages a Conditional Variational Autoencoder (CVAE) to model the visual saliency relevant to the question-answer pair as a latent distribution. As a plug-and-play module, VIF can be integrated into existing architectures. Extensive evaluations across diverse benchmarks, covering General VQA, fine-grained perception, and visual grounding, demonstrate that VIF yields competitive improvements over previous methods, validating its effectiveness in enhancing the fine-grained perception of MLLMs.

Chinese Translation

尽管多模态大型语言模型（Multimodal Large Language Models, MLLMs）在通用视觉理解方面表现出令人印象深刻的能力，但它们在需要识别微小物体或辨别细微视觉关系的细粒度感知任务中常常表现不佳。我们将这一局限归因于视觉衰减（Visual Attenuation）：一种在网络传播过程中，稀疏的细粒度视觉信号被占主导地位的文本标记过早抑制或稀释的现象，导致深层决策过程中出现“注意力丧失”。现有以输入为中心的解决方案未能从根本上逆转这一信息丢失的内在机制。为应对该挑战，我们提出了变分信息流（Variational Information Flow, VIF）框架。VIF采用概率视角，利用条件变分自编码器（Conditional Variational Autoencoder, CVAE）将与问答对相关的视觉显著性建模为潜在分布。作为一个即插即用模块，VIF能够集成到现有架构中。在涵盖通用视觉问答（General VQA）、细粒度感知及视觉定位等多样基准上的广泛评估表明，VIF在提升细粒度感知能力方面较以往方法取得了具有竞争力的改进，验证了其有效性。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2604.12512

NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1)

NTIRE 2026 第三届 Restore Any Image Model (RAIM) 挑战赛：专业图像质量评估（轨道1）

Qin, Guanyi, Liang, Jie, Zhang, Bingbing, Qu, Lishen, Guan, Ya-nan, Zeng, Hui, Zhang, Lei, Timofte, Radu, Sun, Jianhui, Yue, Xinli, Shao, Tao, Hou, Huan, Liao, Wenjie, Han, Shuhao, Yuan, Jieyu, Guo, Chunle, Li, Chongyi, Chen, Zewen, Liu, Yunze, Guo, Jian, Wang, Juan, Zeng, Yun, Li, Bing, Hu, Weiming, Li, Hesong, Liu, Dehua, Zhang, Xinjie, Li, Qiang, Yan, Li, Dong, Wei, Yan, Qingsen, Li, Xingcan, Zhou, Shenglong, Yin, Manjiang, Zhang, Yinxiang, Wang, Hongbo, Xu, Jikai, Fan, Zhaohui, Zhu, Dandan, Sun, Wei, Zhang, Weixia, Zhu, Kun, Zhang, Nana, Zhang, Kaiwei, Zhang, Qianqian, Zhang, Zhihan, Gordon, William, Wu, Linwei, Tu, Jiachen, Xu, Guoyi, Jiang, Yaoxin, Liu, Cici, Shi, Yaokun

Abstract

In this paper, we present an overview of the NTIRE 2026 challenge on the 3rd Restore Any Image Model in the Wild, specifically focusing on Track 1: Professional Image Quality Assessment. Conventional Image Quality Assessment (IQA) typically relies on scalar scores. By compressing complex visual characteristics into a single number, these methods fundamentally struggle to distinguish subtle differences among uniformly high-quality images. Furthermore, they fail to articulate why one image is superior, lacking the reasoning capabilities required to provide guidance for vision tasks. To bridge this gap, recent advancements in Multimodal Large Language Models (MLLMs) offer a promising paradigm. Inspired by this potential, our challenge establishes a novel benchmark exploring the ability of MLLMs to mimic human expert cognition in evaluating high-quality image pairs. Participants were tasked with overcoming critical bottlenecks in professional scenarios, centering on two primary objectives: (1) Comparative Quality Selection: reliably identifying the visually superior image within a high-quality pair; and (2) Interpretative Reasoning: generating grounded, expert-level explanations that detail the rationale behind the selection. In total, the challenge attracted nearly 200 registrations and over 2,500 submissions. The top-performing methods significantly advanced the state of the art in professional IQA. The challenge dataset is available at https://github.com/narthchin/RAIM-PIQA, and the official homepage is accessible at https://www.codabench.org/competitions/12789/.

Chinese Translation

本文介绍了 NTIRE 2026 第三届 Restore Any Image Model in the Wild 挑战赛的概况，重点聚焦于轨道1：专业图像质量评估。传统的图像质量评估（IQA）通常依赖标量分数，通过将复杂的视觉特征压缩为单一数值，这些方法在区分均为高质量图像的细微差异方面存在根本性困难。此外，它们无法阐明为何某幅图像更优，缺乏为视觉任务提供指导所需的推理能力。为弥补这一不足，近年来多模态大语言模型（Multimodal Large Language Models, MLLMs）的进展提供了有前景的范式。受此启发，本次挑战赛建立了一个新颖的基准，探索 MLLMs 模拟人类专家认知以评估高质量图像对的能力。参赛者需突破专业场景中的关键瓶颈，围绕两个主要目标展开：（1）比较质量选择：可靠地识别高质量图像对中视觉上更优的图像；（2）解释性推理：生成有依据的专家级解释，详细说明选择背后的理由。此次挑战共吸引近200个注册团队，提交超过2500份方案。表现最佳的方法显著推动了专业图像质量评估的技术进步。挑战赛数据集可在 https://github.com/narthchin/RAIM-PIQA 获取，官方网站为 https://www.codabench.org/competitions/12789/。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2604.12525

CoD-Lite: Real-Time Diffusion-Based Generative Image Compression

CoD-Lite：基于扩散的实时生成图像压缩方法

Jia, Zhaoyang, Xue, Naifu, Zheng, Zihan, Li, Jiahao, Li, Bin, Zhang, Xiaoyi, Guo, Zongyu, Zhang, Yuan, Li, Houqiang, Lu, Yan

Abstract

Recent advanced diffusion methods typically derive strong generative priors by scaling diffusion transformers. However, scaling fails to generalize when adapted for real-time compression scenarios that demand lightweight models. In this paper, we explore the design of real-time and lightweight diffusion codecs by addressing two pivotal questions. First, does diffusion pre-training benefit lightweight diffusion codecs? Through systematic analysis, we find that generation-oriented pre-training is less effective at small model scales whereas compression-oriented pre-training yields consistently better performance. Second, are transformers essential? We find that while global attention is crucial for standard generation, lightweight convolutions suffice for compression-oriented diffusion when paired with distillation. Guided by these findings, we establish a one-step lightweight convolution diffusion codec that achieves real-time $60$~FPS encoding and $42$~FPS decoding at 1080p. Further enhanced by distillation and adversarial learning, the proposed codec reduces bitrate by 85\% at a comparable FID to MS-ILLM, bridging the gap between generative compression and practical real-time deployment. Codes are released at https://github.com/microsoft/GenCodec/CoD_Lite

Chinese Translation

近年来，先进的扩散方法通常通过扩展扩散变换器（diffusion transformers）来获得强大的生成先验。然而，当适用于需要轻量化模型的实时压缩场景时，简单的扩展方法难以泛化。本文针对两个关键问题，探索了实时轻量级扩散编解码器的设计。首先，扩散预训练是否有利于轻量级扩散编解码器？通过系统分析发现，面向生成的预训练在小规模模型中效果较差，而面向压缩的预训练则表现出持续更优的性能。其次，变换器（transformers）是否必不可少？研究表明，虽然全局注意力对标准生成任务至关重要，但在结合蒸馏技术的情况下，轻量级卷积足以满足面向压缩的扩散需求。基于这些发现，本文构建了一种一步式轻量级卷积扩散编解码器，实现了1080p分辨率下实时60帧/秒编码和42帧/秒解码。通过蒸馏与对抗学习的进一步增强，该编解码器在与MS-ILLM相当的FID指标下，码率降低了85%，有效缩小了生成式压缩与实际实时部署之间的差距。代码已发布于https://github.com/microsoft/GenCodec/CoD_Lite。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2604.12537

MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

MODIX：一种无需训练的多模态信息驱动位置索引缩放方法用于视觉-语言模型

Huang, Ruoxiang, Yuan, Zhen

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to informative modalities while compressing redundant ones, without requiring any modification to model parameters or architecture. Experiments across diverse architectures and benchmarks demonstrate that MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions, suggesting that positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling.

Chinese Translation

视觉-语言模型（Vision-Language Models, VLMs）在多模态理解方面取得了显著进展，但其位置编码机制仍存在不足。现有方法对所有标记统一分配位置索引，忽视了模态内及模态间信息密度的差异，导致注意力分配效率低下，冗余的视觉区域占主导地位，而信息丰富的内容被低估。我们将位置粒度视为一种隐含资源，提出了MODIX（Multimodal Information-Driven Positional IndeX Scaling），一种无需训练的框架，能够基于模态特定的贡献动态调整位置步长。MODIX通过基于协方差的熵来联合建模模态内密度，并通过跨模态对齐建模模态间交互，从而得出统一评分，利用该评分重新缩放位置索引，为信息丰富的模态分配更细粒度的位置编码，同时压缩冗余模态，无需对模型参数或架构进行任何修改。跨多种架构和基准的实验表明，MODIX持续提升多模态推理能力，并能根据任务相关的信息分布自适应重新分配注意力，表明在多模态序列建模的Transformer中，位置编码应被视为一种自适应资源。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2604.12551

Cross-Attentive Multiview Fusion of Vision-Language Embeddings

视觉-语言嵌入的跨注意力多视图融合

Martins, Tomas Berriel, Oswald, Martin R., Civera, Javier

Abstract

Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single-view descriptor selection, but also achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.

Chinese Translation

视觉-语言模型在开放词汇的二维语义分割发展中发挥了关键作用。然而，将这些模型从二维图像提升到三维场景仍然是一个具有挑战性的问题。现有的方法通常通过反向投影和在视图之间平均二维描述符，或启发式地选择一个单一的代表性描述符，这往往导致次优的三维表示。在本研究中，我们提出了一种新颖的多视图变换器架构，该架构在多个视点之间对视觉-语言描述符进行跨注意力处理，并将其融合为统一的每个三维实例嵌入。作为第二个贡献，我们利用多视图一致性作为这种融合的自我监督信号，当其与标准的监督目标类别损失结合时，显著提高了性能。我们的跨注意力多视图融合（Cross-Attentive Multiview Fusion，简称CAMFusion）不仅在性能上始终优于简单的平均或单视图描述符选择，而且在三维语义和实例分类基准测试中也取得了最先进的结果，包括在域外数据集上的零-shot 评估。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2604.12568

Evolution-Inspired Sample Competition for Deep Neural Network Optimization

基于进化启发的样本竞争深度神经网络优化

Zheng, Ying, Zhang, Yiyi, Wang, Yi, Chau, Lap-Pui

Abstract

Conventional deep network training generally optimizes all samples under a largely uniform learning paradigm, without explicitly modeling the heterogeneous competition among them. Such an oversimplified treatment can lead to several well-known issues, including bias under class imbalance, insufficient learning of hard samples, and the erroneous reinforcement of noisy samples. In this work, we present \textit{Natural Selection} (NS), a novel evolution-inspired optimization method that explicitly incorporates competitive interactions into deep network training. Unlike conventional sample reweighting strategies that rely mainly on predefined heuristics or static criteria, NS estimates the competitive status of each sample in a group-wise context and uses it to adaptively regulate its training contribution. Specifically, NS first assembles multiple samples into a composite image and rescales it to the original input size for model inference. Based on the resulting predictions, a natural selection score is computed for each sample to characterize its relative competitive variation within the constructed group. These scores are then used to dynamically reweight the sample-wise loss, thereby introducing an explicit competition-driven mechanism into the optimization process. In this way, NS provides a simple yet effective means of moving beyond uniform sample treatment and enables more adaptive and balanced model optimization. Extensive experiments on 12 public datasets across four image classification tasks demonstrate the effectiveness of the proposed method. Moreover, NS is compatible with diverse network architectures and does not depend on task-specific assumptions, indicating its strong generality and practical potential. The code will be made publicly available.

Chinese Translation

传统的深度网络训练通常在一个相对统一的学习范式下优化所有样本，而未明确建模它们之间的异质竞争。这种过于简化的处理可能导致一些众所周知的问题，包括类别不平衡下的偏差、对困难样本学习不足以及对噪声样本的错误强化。在本研究中，我们提出了一种名为 extit{自然选择}（Natural Selection, NS）的新型进化启发优化方法，它明确将竞争互动纳入深度网络训练中。与主要依赖预定义启发式或静态标准的传统样本重加权策略不同，NS在组内上下文中评估每个样本的竞争状态，并利用该状态自适应地调节其训练贡献。具体而言，NS首先将多个样本组合成一个复合图像，并将其重新缩放到原始输入大小以进行模型推断。基于生成的预测，为每个样本计算自然选择分数，以表征其在构建组内的相对竞争变化。这些分数随后用于动态重加权样本损失，从而在优化过程中引入明确的竞争驱动机制。通过这种方式，NS提供了一种简单而有效的方法，超越了统一样本处理，使模型优化更加自适应和平衡。在四个图像分类任务的12个公共数据集上的大量实验表明了该方法的有效性。此外，NS与多种网络架构兼容，并且不依赖于特定任务的假设，表明其强大的通用性和实际潜力。代码将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2604.12574

Cross-Modal Knowledge Distillation for PET-Free Amyloid-Beta Detection from MRI

基于跨模态知识蒸馏的无PET阿尔茨海默病淀粉样β检测方法

Chiumento, Francesco, Dietlmeier, Julia, Killeen, Ronan P., Curran, Kathleen M., O'Connor, Noel E., Liu, Mingming

Abstract

Detecting amyloid-$\beta$ (A$\beta$) positivity is crucial for early diagnosis of Alzheimer's disease but typically requires PET imaging, which is costly, invasive, and not widely accessible, limiting its use for population-level screening. We address this gap by proposing a PET-guided knowledge distillation framework that enables A$\beta$ prediction from MRI alone, without requiring non-imaging clinical covariates or PET at inference. Our approach employs a BiomedCLIP-based teacher model that learns PET-MRI alignment via cross-modal attention and triplet contrastive learning with PET-informed (Centiloid-aware) online negative sampling. An MRI-only student then mimics the teacher via feature-level and logit-level distillation. Evaluated across four MRI contrasts (T1w, T2w, FLAIR, T2*) and two independent datasets, our approach demonstrates effective knowledge transfer (best AUC: 0.74 on OASIS-3, 0.68 on ADNI) while maintaining interpretability and eliminating the need for clinical variables. Saliency analysis confirms that predictions focus on anatomically relevant cortical regions, supporting the clinical viability of PET-free A$\beta$ screening. Code is available at https://github.com/FrancescoChiumento/pet-guided-mri-amyloid-detection.

Chinese Translation

检测淀粉样β（Aβ）阳性对于阿尔茨海默病的早期诊断至关重要，但通常需要PET成像，这种方法成本高、侵入性强且不易获得，限制了其在大规模人群筛查中的应用。我们通过提出一种PET引导的知识蒸馏框架来填补这一空白，该框架能够仅通过MRI进行Aβ预测，而无需在推断时依赖非成像临床协变量或PET。我们的方法采用基于BiomedCLIP的教师模型，通过跨模态注意力和基于PET信息的（Centiloid感知）在线负采样进行三元对比学习，从而学习PET与MRI之间的对齐。然后，只有MRI的学生模型通过特征级和logit级的蒸馏来模仿教师模型。在四种MRI对比（T1w、T2w、FLAIR、T2*）和两个独立数据集上进行评估，我们的方法展示了有效的知识转移（最佳AUC：OASIS-3为0.74，ADNI为0.68），同时保持了解释性，并消除了对临床变量的需求。显著性分析确认预测集中在解剖相关的皮层区域，支持无PET的Aβ筛查的临床可行性。代码可在https://github.com/FrancescoChiumento/pet-guided-mri-amyloid-detection获取。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2604.12575

StructDiff: A Structure-Preserving and Spatially Controllable Diffusion Model for Single-Image Generation

StructDiff：一种结构保持和空间可控的单图像生成扩散模型

He, Yinxi, Liao, Kang, Lin, Chunyu, Wei, Tianyi, Zhao, Yao

Abstract

This paper introduces StructDiff, a generative framework based on a single-scale diffusion model for single-image generation. Single-image generation aims to synthesize diverse samples with similar visual content to the source image by capturing its internal statistics, without relying on external data. However, existing methods often struggle to preserve the structural layout, especially for images with large rigid objects or strict spatial constraints. Moreover, most approaches lack spatial controllability, making it difficult to guide the structure or placement of generated content. To address these challenges, StructDiff introduces an \textit{adaptive receptive field} module to maintain both global and local distributions. Building on this foundation, StructDiff incorporates 3D positional encoding (PE) as a spatial prior, allowing flexible control over positions, scale, and local details of generated objects. To our knowledge, this spatial control capability represents the first exploration of PE-based manipulation in single-image generation. Furthermore, we propose a novel evaluation criterion for single-image generation based on large language models (LLMs). This criterion specifically addresses the limitations of existing objective metrics and the high labor costs associated with user studies. StructDiff also demonstrates broad applicability across downstream tasks, such as text-guided image generation, image editing, outpainting, and paint-to-image synthesis. Extensive experiments demonstrate that StructDiff outperforms existing methods in structural consistency, visual quality, and spatial controllability. The project page is available at https://butter-crab.github.io/StructDiff/.

Chinese Translation

本文介绍了StructDiff，一种基于单尺度扩散模型的生成框架，用于单图像生成。单图像生成旨在通过捕捉源图像的内部统计特征，合成与源图像视觉内容相似的多样样本，而无需依赖外部数据。然而，现有方法往往难以保持结构布局，尤其是在处理具有大刚性物体或严格空间约束的图像时。此外，大多数方法缺乏空间可控性，使得引导生成内容的结构或位置变得困难。为了解决这些挑战，StructDiff引入了一种 extit{自适应感受野}模块，以维持全局和局部分布。在此基础上，StructDiff结合了3D位置编码（PE）作为空间先验，允许对生成对象的位置、尺度和局部细节进行灵活控制。据我们所知，这种空间控制能力代表了基于PE的单图像生成操控的首次探索。此外，我们提出了一种基于大型语言模型（LLMs）的单图像生成新评估标准。该标准专门解决了现有客观指标的局限性以及用户研究相关的高劳动成本。StructDiff还展示了在下游任务中的广泛适用性，如文本引导的图像生成、图像编辑、扩展绘制和绘画到图像合成。大量实验表明，StructDiff在结构一致性、视觉质量和空间可控性方面优于现有方法。项目页面可访问 https://butter-crab.github.io/StructDiff/.

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2604.12580

PDF-GS: Progressive Distractor Filtering for Robust 3D Gaussian Splatting

PDF-GS：用于鲁棒3D高斯喷溅的渐进式干扰物过滤

Seo, Kangmin, Lee, MinKyu, Kim, Tae-Young, Lee, ByeongCheol, An, JoonSeoung, Heo, Jae-Pil

Abstract

Recent advances in 3D Gaussian Splatting (3DGS) have enabled impressive real-time photorealistic rendering. However, conventional training pipelines inherently assume full multi-view consistency among input images, which makes them sensitive to distractors that violate this assumption and cause visual artifacts. In this work, we revisit an underexplored aspect of 3DGS: its inherent ability to suppress inconsistent signals. Building on this insight, we propose PDF-GS (Progressive Distractor Filtering for Robust 3D Gaussian Splatting), a framework that amplifies this self-filtering property through a progressive multi-phase optimization. The progressive filtering phases gradually remove distractors by exploiting discrepancy cues, while the following reconstruction phase restores fine-grained, view-consistent details from the purified Gaussian representation. Through this iterative refinement, PDF-GS achieves robust, high-fidelity, and distractor-free reconstructions, consistently outperforming baselines across diverse datasets and challenging real-world conditions. Moreover, our approach is lightweight and easily adaptable to existing 3DGS frameworks, requiring no architectural changes or additional inference overhead, leading to a new state-of-the-art performance. The code is publicly available at https://github.com/kangrnin/PDF-GS.

Chinese Translation

最近在3D高斯喷溅（3DGS）方面的进展使得实时逼真渲染变得令人印象深刻。然而，传统的训练流程本质上假设输入图像之间具有完全的多视图一致性，这使得它们对违反这一假设的干扰物敏感，从而导致视觉伪影。在本研究中，我们重新审视了3DGS的一个未被充分探索的方面：其固有的抑制不一致信号的能力。基于这一见解，我们提出了PDF-GS（渐进式干扰物过滤用于鲁棒3D高斯喷溅），这是一个通过渐进式多阶段优化来增强这种自我过滤特性的框架。渐进式过滤阶段通过利用差异线索逐步去除干扰物，而随后的重建阶段则从纯化的高斯表示中恢复细致的视图一致性细节。通过这种迭代精炼，PDF-GS实现了鲁棒、高保真且无干扰的重建，在各种数据集和具有挑战性的现实世界条件下始终优于基线。此外，我们的方法轻量且易于适应现有的3DGS框架，无需架构更改或额外的推理开销，从而实现了新的最先进性能。代码已公开发布在 https://github.com/kangrnin/PDF-GS。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2604.12582

Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models

放宽锚帧主导性以减轻视频大型语言模型中的幻觉现象

Liu, Zijian, Cao, Sihan, Zheng, Pengcheng, Liu, Kuien, Qin, Caiyan, Qin, Xiaolin, Wei, Jiwei, Zhang, Chaoning

Abstract

Recent Video Large Language Models (Video-LLMs) have demonstrated strong capability in video understanding, yet they still suffer from hallucinations. Existing mitigation methods typically rely on training, input modification, auxiliary guidance, or additional decoding procedures, while largely overlooking a more fundamental challenge. During generation, Video-LLMs tend to over-rely on a limited portion of temporal evidence, leading to temporally imbalanced evidence aggregation across the video. To address this issue, we investigate a decoder-side phenomenon in which the model exhibits a temporally imbalanced concentration pattern. We term the frame with the highest aggregated frame-level attention mass the anchor frame. We find that this bias is largely independent of the input video and instead appears to reflect a persistent, model-specific structural or positional bias, whose over-dominance is closely associated with hallucination-prone generation. Motivated by this insight, we propose Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference method that rebalances temporal evidence allocation in middle-to-late decoder layers without altering visual encoding or requiring auxiliary models. DTR adaptively calibrates decoder-side visual attention to alleviate temporally imbalanced concentration and encourage under-attended frames to contribute more effectively to response generation. In this way, DTR guides the decoder to ground its outputs in temporally broader and more balanced video evidence. Extensive experiments on hallucination and video understanding benchmarks show that DTR consistently improves hallucination robustness across diverse Video-LLM families, while preserving competitive video understanding performance and high inference efficiency.

Chinese Translation

近期的视频大型语言模型（Video-LLMs）在视频理解方面展现出强大的能力，但仍然面临幻觉现象的困扰。现有的减轻方法通常依赖于训练、输入修改、辅助指导或额外的解码程序，而在很大程度上忽视了一个更根本的挑战。在生成过程中，视频大型语言模型往往过度依赖有限的时间证据部分，导致视频中时间证据聚合的不平衡。为了解决这个问题，我们研究了一种解码器侧现象，即模型表现出时间上不平衡的集中模式。我们将具有最高聚合帧级注意力的帧称为锚帧。我们发现这种偏差在很大程度上与输入视频无关，而是反映了一种持久的、特定于模型的结构或位置偏差，其过度主导性与幻觉倾向的生成密切相关。基于这一见解，我们提出了解码器侧时间重平衡（Decoder-side Temporal Rebalancing, DTR），这是一种无训练的、层选择性的推理方法，能够在不改变视觉编码或要求辅助模型的情况下，重新平衡中后期解码器层中的时间证据分配。DTR自适应地校准解码器侧的视觉注意力，以减轻时间上不平衡的集中，并鼓励未被充分关注的帧更有效地参与响应生成。通过这种方式，DTR引导解码器在时间上更广泛和更平衡的视频证据中扎根其输出。在幻觉和视频理解基准上的大量实验表明，DTR在不同的视频大型语言模型家族中始终提高了幻觉的鲁棒性，同时保持了竞争力的视频理解性能和高推理效率。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2604.12592

ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction

ELoG-GS：基于亮度引导增强的双分支高斯点渲染用于极低光3D重建

Liu, Yuhao, Wang, Dingju, Zheng, Ziyang

Abstract

This paper presents our approach to the NTIRE 2026 3D Restoration and Reconstruction Challenge (Track 1), which focuses on reconstructing high-quality 3D representations from degraded multi-view inputs. The challenge involves recovering geometrically consistent and photorealistic 3D scenes in extreme low-light environments. To address this task, we propose Extreme Low-light Optimized Gaussian Splatting (ELoG-GS), a robust low-light 3D reconstruction pipeline that integrates learning-based point cloud initialization and luminance-guided color enhancement for stable and photorealistic Gaussian Splatting. Our method incorporates both geometry-aware initialization and photometric adaptation strategies to improve reconstruction fidelity under challenging conditions. Extensive experiments on the NTIRE Track 1 benchmark demonstrate that our approach significantly improves reconstruction quality over the baselines, achieving superior visual fidelity and geometric consistency. The proposed method provides a practical solution for robust 3D reconstruction in real-world degraded scenarios. In the final testing phase, our method achieved a PSNR of 18.6626 and an SSIM of 0.6855 on the official platform leaderboard. Code is available at https://github.com/lyh120/FSGS_EAPGS.

Chinese Translation

本文介绍了我们针对NTIRE 2026三维修复与重建挑战赛（Track 1）的解决方案，该赛道聚焦于从退化的多视角输入中重建高质量的三维表示。该挑战旨在恢复极端低光环境下几何一致且具有真实感的三维场景。为应对该任务，我们提出了极低光优化高斯点渲染（Extreme Low-light Optimized Gaussian Splatting，ELoG-GS），这是一种稳健的低光三维重建流程，集成了基于学习的点云初始化与亮度引导的颜色增强，以实现稳定且具有真实感的高斯点渲染。我们的方法结合了几何感知初始化和光度适应策略，以提升在复杂条件下的重建精度。在NTIRE Track 1基准上的大量实验表明，我们的方法在重建质量上较基线方法有显著提升，达到了更优的视觉真实感和几何一致性。所提方法为现实退化场景中的稳健三维重建提供了实用解决方案。在最终测试阶段，我们的方法在官方平台排行榜上取得了18.6626的PSNR和0.6855的SSIM。代码已公开，地址为：https://github.com/lyh120/FSGS_EAPGS。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2604.12600

Spatial-Spectral Adaptive Fidelity and Noise Prior Reduction Guided Hyperspectral Image Denoising

空间-光谱自适应保真度与噪声先验减少引导的高光谱图像去噪

Xie, Xuelin, Lu, Xiliang, Wang, Zhengshan, Zhang, Yang, Chen, Long

Abstract

The core challenge of hyperspectral image denoising is striking the right balance between data fidelity and noise prior modeling. Most existing methods place too much emphasis on the intrinsic priors of the image while overlooking diverse noise assumptions and the dynamic trade-off between fidelity and priors. To address these issues, we propose a denoising framework that integrates noise prior reduction and a spatial-spectral adaptive fidelity term. This framework considers comprehensive noise priors with fewer parameters and introduces an adaptive weight tensor to dynamically balance the fidelity and prior regularization terms. Within this framework, we further develop a fast and robust pixel-wise model combined with the representative coefficient total variation regularizer to accurately remove mixed noise in HSIs. The proposed method not only efficiently handles various types of noise but also accurately captures the spectral low-rank structure and local smoothness of HSIs. An efficient optimization algorithm based on the alternating direction method of multipliers is designed to ensure stable and fast convergence. Extensive experiments on simulated and real-world datasets demonstrate that the proposed model achieves superior denoising performance while maintaining competitive computational efficiency.

Chinese Translation

高光谱图像去噪的核心挑战在于数据保真度与噪声先验建模之间的平衡。现有大多数方法过于强调图像的内在先验，而忽视了多样的噪声假设以及保真度与先验之间的动态权衡。为了解决这些问题，我们提出了一种去噪框架，该框架集成了噪声先验减少和空间-光谱自适应保真度项。该框架考虑了更少参数的综合噪声先验，并引入了自适应权重张量，以动态平衡保真度与先验正则化项。在此框架内，我们进一步开发了一种快速且稳健的逐像素模型，结合了代表性系数总变差正则化器，以准确去除高光谱图像中的混合噪声。所提方法不仅有效处理各种类型的噪声，还准确捕捉高光谱图像的光谱低秩结构和局部平滑性。基于交替方向乘子法设计了一种高效的优化算法，以确保稳定和快速的收敛。在模拟和真实世界数据集上的大量实验表明，所提模型在保持竞争性计算效率的同时，达到了优越的去噪性能。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2604.12622

Efficient Semantic Image Communication for Traffic Monitoring at the Edge

边缘交通监控的高效语义图像通信

Assylbek, Damir, Aitymbetov, Nurmukhammed, Ristin, Marko, Zorbas, Dimitrios

Abstract

Many visual monitoring systems operate under strict communication constraints, where transmitting full-resolution images is impractical and often unnecessary. In such settings, visual data is often used for object presence, spatial relationships, and scene context rather than exact pixel fidelity. This paper presents two semantic image communication pipelines for traffic monitoring, MMSD and SAMR, that reduce transmission cost while preserving meaningful visual information. MMSD (Multi-Modal Semantic Decomposition) targets very high compression together with data confidentiality, since sensitive pixel content is not transmitted. It replaces the original image with compact semantic representations, namely segmentation maps, edge maps, and textual descriptions, and reconstructs the scene at the receiver using a diffusion-based generative model. SAMR (Semantic-Aware Masking Reconstruction) targets higher visual quality while maintaining strong compression. It selectively suppresses non-critical image regions according to semantic importance before standard JPEG encoding and restores the missing content at the receiver through generative inpainting. Both designs follow an asymmetric sender-receiver architecture, where lightweight processing is performed at the edge and computationally intensive reconstruction is offloaded to the server. On a Raspberry Pi~5, the edge-side processing time is about 15s for MMSD and 9s for SAMR. Experimental results show average transmitted-data reductions of 99% for MMSD and 99.1% for SAMR. In addition, MMSD achieves lower payload size than the recent SPIC baseline while preserving strong semantic consistency, whereas SAMR provides a better quality-compression trade-off than standard JPEG and SQ-GAN under comparable operating conditions.

Chinese Translation

许多视觉监控系统在严格的通信限制下运行，传输全分辨率图像既不切实际也往往不必要。在这种情况下，视觉数据通常用于物体存在、空间关系和场景上下文，而不是精确的像素保真度。本文提出了两种用于交通监控的语义图像通信管道，MMSD（多模态语义分解）和SAMR（语义感知掩蔽重建），它们在保留有意义的视觉信息的同时降低了传输成本。MMSD旨在实现非常高的压缩率，同时确保数据机密性，因为敏感的像素内容不会被传输。它用紧凑的语义表示替代原始图像，即分割图、边缘图和文本描述，并使用基于扩散的生成模型在接收端重建场景。SAMR则在保持强压缩的同时，针对更高的视觉质量。它根据语义重要性选择性地抑制非关键图像区域，然后进行标准JPEG编码，并通过生成修复在接收端恢复缺失的内容。两种设计均遵循不对称的发送者-接收者架构，其中轻量级处理在边缘进行，而计算密集型重建则卸载到服务器。在Raspberry Pi~5上，边缘侧处理时间对于MMSD约为15秒，对于SAMR约为9秒。实验结果显示，MMSD的平均传输数据减少了99%，而SAMR减少了99.1%。此外，MMSD在保持强语义一致性的同时，其有效载荷大小低于最近的SPIC基线，而SAMR在可比操作条件下提供了比标准JPEG和SQ-GAN更好的质量-压缩权衡。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2604.12630

GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

GeoAlign：用于多模态大语言模型空间推理的几何特征重新对齐

Liu, Zhaochen, Qiao, Limeng, Wan, Guanglu, Jiang, Tingting

Abstract

Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pretraining objectives, which may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. To resolve this, we propose GeoAlign, a novel framework that dynamically aggregates multi-layer geometric features to realign with the actual demands. GeoAlign constructs a hierarchical geometric feature bank and leverages the MLLM's original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching the suitable geometric features for each patch. Extensive experiments on VSI-Bench, ScanQA, and SQA3D demonstrate that our compact 4B model effectively achieves state-of-the-art performance, even outperforming larger existing MLLMs.

Chinese Translation

多模态大语言模型（MLLMs）在多种视觉任务中表现出显著性能，但在空间推理方面仍存在困难。近期的研究通过注入来自三维基础模型的几何特征来缓解这一问题，然而这些方法依赖于静态的单层特征提取。我们发现这种方法会引入任务错配偏差：几何特征自然地朝向三维预训练目标演化，可能与MLLMs异构的空间需求相矛盾，导致任何单层特征本质上都不充分。为了解决这一问题，我们提出了GeoAlign，一种动态聚合多层几何特征以重新对齐实际需求的新框架。GeoAlign构建了分层的几何特征库，并利用MLLM原始的视觉tokens作为内容感知查询，执行逐层稀疏路由，自适应地为每个图像块提取合适的几何特征。在VSI-Bench、ScanQA和SQA3D上的大量实验表明，我们紧凑的4B模型有效地实现了最先进的性能，甚至优于现有更大规模的MLLMs。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2604.12650

Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

听觉深伪检测：超越以说话为中心的伪造分析的新视角

Liu, Miao, Wei, Fangda, Wang, Jing, Qian, Xinyuan

Abstract

Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker's appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of 'listening deepfakes' remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening Head Generation (LHG) methods. To address the distinctive characteristics of listening forgeries, we propose MANet, a Motion-aware and Audio-guided Network that captures subtle motion inconsistencies in listener videos while leveraging speaker's audio semantics to guide cross-modal fusion. Extensive experiments demonstrate that existing Speaking Deepfake Detection (SDD) models perform poorly in listening scenarios. In contrast, MANet achieves significantly superior performance on ListenForge. Our work highlights the necessity of rethinking deepfake detection beyond the traditional speaking-centric paradigm and opens new directions for multimodal forgery analysis in interactive communication settings. The dataset and code are available at https://anonymous.4open.science/r/LDD-B4CB.

Chinese Translation

现有的深伪检测研究主要集中在被操控主体主动说话的场景，即通过改变说话者的外貌或声音来生成虚假的内容。然而，在现实的互动环境中，攻击者往往在伪造说话和听取状态之间交替，以误导目标，从而增强场景的真实感和说服力。尽管对“听觉深伪”的检测仍然基本未被探索，并且受到数据集和方法论稀缺的限制，但合成的听觉反应质量相对有限，为当前的深伪检测工作提供了一个良好的突破机会。在本文中，我们提出了听觉深伪检测（Listening Deepfake Detection, LDD）这一任务。我们引入了ListenForge，这是第一个专门为该任务设计的数据集，采用了五种听觉头生成（Listening Head Generation, LHG）方法构建。为了应对听觉伪造的独特特征，我们提出了MANet，一种运动感知和音频引导网络，能够捕捉听者视频中的微妙运动不一致性，同时利用说话者的音频语义来指导跨模态融合。大量实验表明，现有的说话深伪检测（Speaking Deepfake Detection, SDD）模型在听觉场景中的表现较差。相比之下，MANet在ListenForge上实现了显著优越的性能。我们的工作强调了重新思考深伪检测的重要性，超越传统的以说话为中心的范式，并为互动通信环境中的多模态伪造分析开辟了新的方向。数据集和代码可在 https://anonymous.4open.science/r/LDD-B4CB 获取。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2604.12652

PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

PromptEcho：基于视觉-语言模型的无标注奖励构建方法用于文本到图像的强化学习

Liu, Jinlong, He, Wanggui, Zhang, Peng, Liu, Mushui, Jiang, Hao, Huang, Pipei

Abstract

Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \emph{no} annotation and \emph{no} reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense captions for rigorously testing prompt following capability. Experimental results on two state-of-the-art T2I models (Z-Image and QwenImage-2512) demonstrate that PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp / +16.2pp net win rate), along with consistent gains on GenEval, DPG-Bench, and TIIFBench without any task-specific training. Ablation studies confirm that PromptEcho comprehensively outperforms inference-based scoring with the same VLM, and that reward quality scales with VLM size. We will open-source the trained models and the DenseAlignBench.

Chinese Translation

强化学习（RL）能够提升文本到图像（T2I）模型的提示遵循能力，但获取高质量的奖励信号仍然具有挑战性：CLIP Score过于粗糙，而基于视觉-语言模型（VLM）的奖励模型（如RewardDance）则需要昂贵的人类偏好标注数据及额外的微调。我们提出了PromptEcho，一种无需任何标注且无需训练奖励模型的奖励构建方法。给定生成的图像和指导查询，PromptEcho通过冻结的VLM计算以原始提示为标签的逐词交叉熵损失，直接提取VLM预训练期间编码的图文对齐知识。该奖励具有确定性、计算高效，并且随着更强大的开源VLM的出现自动提升。为评估该方法，我们开发了DenseAlignBench，一个包含丰富概念密集描述的基准，用于严格测试提示遵循能力。在两种最先进的T2I模型（Z-Image和QwenImage-2512）上的实验结果表明，PromptEcho在DenseAlignBench上实现了显著提升（净胜率提升+26.8个百分点/+16.2个百分点），并且在GenEval、DPG-Bench和TIIFBench上均有稳定增益，且无需任何特定任务训练。消融研究证实，PromptEcho在同一VLM下全面优于基于推理的评分方法，且奖励质量随VLM规模提升。我们将开源训练好的模型及DenseAlignBench基准。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2604.12665

Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

超图状态协同推理用于多目标跟踪

Song, Zikai, Yu, Junqing, Chen, Yi-Ping Phoebe, Yang, Wei, Wang, Xinchao

Abstract

Motion reasoning serves as the cornerstone of multi-object tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear. To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded. To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial-temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robust and stable motion estimation. Extensive experiments on four mainstream and diverse benchmarks(MOT17, MOT20, DanceTrack, and SportsMOT) covering various motion patterns and scene complexities, demonstrate that our approach achieves state-of-the-art performance across a wide range of tracking scenarios.

Chinese Translation

运动推理是多目标跟踪（MOT）的基石，因为它使得在不同帧之间能够一致地关联目标。然而，现有的运动估计方法面临两个主要限制：（1）由于噪声或概率预测引起的不稳定性，以及（2）在遮挡情况下的脆弱性，当视觉线索消失时，轨迹往往会出现碎片化。为了解决这些问题，我们提出了一种协同推理框架，通过多个相关对象之间的联合推理来增强运动估计。通过允许具有相似运动状态的对象相互约束和细化，我们的框架稳定了噪声轨迹，并在目标被遮挡时推断出合理的运动连续性。为了实现这一概念，我们设计了HyperSSM，一种将超图计算和状态空间模型（State Space Model, SSM）整合在一起的架构，以实现统一的时空推理。超图模块通过动态超边捕捉空间运动相关性，而SSM则通过结构化状态转移强制执行时间平滑性。这种协同设计使得空间一致性和时间连贯性能够同时优化，从而实现稳健和稳定的运动估计。在四个主流且多样的基准数据集（MOT17、MOT20、DanceTrack和SportsMOT）上进行的广泛实验，涵盖了各种运动模式和场景复杂性，证明了我们的方法在广泛的跟踪场景中达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2604.12668

OFA-Diffusion Compression: Compressing Diffusion Model in One-Shot Manner

OFA-Diffusion 压缩：一次性压缩扩散模型

Jiang, Haoyang, Wang, Zekun, Yi, Mingyang, Li, Xiuyu, Hu, Lanqing, Cai, Junxian, Liu, Qingbin, Chen, Xi, Fan, Ju

Abstract

The Diffusion Probabilistic Model (DPM) achieves remarkable performance in image generation, while its increasing parameter size and computational overhead hinder its deployment in practical applications. To improve this, the existing literature focuses on obtaining a smaller model with a fixed architecture through model compression. However, in practice, DPMs usually need to be deployed on various devices with different resource constraints, which leads to multiple compression processes, incurring significant overhead for repeated training. To obviate this, we propose a once-for-all (OFA) compression framework for DPMs that yields different subnetworks with various computations in a one-shot training manner. The existing OFA framework typically involves massive subnetworks with different parameter sizes, while such a huge candidate space slows the optimization. Thus, we propose to restrict the candidate subnetworks with a certain set of parameter sizes, where each size corresponds to a specific subnetwork. Specifically, to construct each subnetwork with a given size, we gradually allocate the maintained channels by their importance. Furthermore, we propose a reweighting strategy to balance the optimization process of different subnetworks. Experimental results show that our approach can produce compressed DPMs for various sizes with significantly lower training overhead while achieving satisfactory performance.

Chinese Translation

扩散概率模型（Diffusion Probabilistic Model, DPM）在图像生成方面取得了显著的性能，但其不断增长的参数规模和计算开销阻碍了其在实际应用中的部署。为此，现有研究主要通过模型压缩获得固定架构的更小模型。然而，实际中DPM通常需要部署在具有不同资源限制的多种设备上，这导致需要多次压缩过程，带来重复训练的巨大开销。为解决该问题，我们提出了一种针对DPM的once-for-all（OFA）压缩框架，该框架通过一次训练即可生成具有不同计算量的多种子网络。现有的OFA框架通常涉及大量不同参数规模的子网络，庞大的候选空间会降低优化效率。因此，我们提出限制候选子网络在一定参数规模集合内，每个规模对应一个特定子网络。具体而言，为构建给定规模的子网络，我们根据通道的重要性逐步分配保留的通道数。此外，我们提出了一种重新加权策略以平衡不同子网络的优化过程。实验结果表明，我们的方法能够以显著降低的训练开销生成多种规模的压缩DPM，同时保持令人满意的性能表现。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2604.12683

Brain-DiT: A Universal Multi-state fMRI Foundation Model with Metadata-Conditioned Pretraining

Brain-DiT：一种具有元数据条件预训练的通用多状态fMRI基础模型

Xia, Junfeng, Ye, Wenhao, Pan, Xuanye, Shen, Xinke, Wang, Mo, Liu, Quanying

Abstract

Current fMRI foundation models primarily rely on a limited range of brain states and mismatched pretraining tasks, restricting their ability to learn generalized representations across diverse brain states. We present \textit{Brain-DiT}, a universal multi-state fMRI foundation model pretrained on 349,898 sessions from 24 datasets spanning resting, task, naturalistic, disease, and sleep states. Unlike prior fMRI foundation models that rely on masked reconstruction in the raw-signal space or a latent space, \textit{Brain-DiT} adopts metadata-conditioned diffusion pretraining with a Diffusion Transformer (DiT), enabling the model to learn multi-scale representations that capture both fine-grained functional structure and global semantics. Across extensive evaluations and ablations on 7 downstream tasks, we find consistent evidence that diffusion-based generative pretraining is a stronger proxy than reconstruction or alignment, with metadata-conditioned pretraining further improving downstream performance by disentangling intrinsic neural dynamics from population-level variability. We also observe that downstream tasks exhibit distinct preferences for representational scale: ADNI classification benefits more from global semantic representations, whereas age/sex prediction comparatively relies more on fine-grained local structure. Code and parameters of Brain-DiT are available at \href{https://github.com/REDMAO4869/Brain-DiT}{Link}.

Chinese Translation

当前的fMRI基础模型主要依赖于有限范围的脑状态和不匹配的预训练任务，这限制了它们在多样脑状态中学习通用表示的能力。我们提出了 extit{Brain-DiT}，一种在来自24个数据集的349,898个会话上预训练的通用多状态fMRI基础模型，涵盖静息、任务、自然状态、疾病和睡眠状态。与之前依赖于原始信号空间或潜在空间中的掩蔽重建的fMRI基础模型不同， extit{Brain-DiT}采用了基于元数据条件的扩散预训练，结合扩散变换器（Diffusion Transformer, DiT），使模型能够学习捕捉细粒度功能结构和全局语义的多尺度表示。在对7个下游任务进行广泛评估和消融实验中，我们发现基于扩散的生成预训练是比重建或对齐更强的代理，而元数据条件预训练通过将内在神经动态与群体水平变异解耦，进一步提高了下游性能。我们还观察到，下游任务对表示尺度表现出不同的偏好：ADNI分类更依赖于全局语义表示，而年龄/性别预测则相对更依赖于细粒度局部结构。Brain-DiT的代码和参数可在 extit{Link}获取。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2604.12693

Risk-Calibrated Learning: Minimizing Fatal Errors in Medical AI

风险校准学习：最小化医疗人工智能中的致命错误

Mohammadi-Seif, Abolfazl, Baeza-Yates, Ricardo

Abstract

Deep learning models often achieve expert-level accuracy in medical image classification but suffer from a critical flaw: semantic incoherence. These high-confidence mistakes that are semantically incoherent (e.g., classifying a malignant tumor as benign) fundamentally differ from acceptable errors which stem from visual ambiguity. Unlike safe, fine-grained disagreements, these fatal failures erode clinical trust. To address this, we propose Risk-Calibrated Learning, a technique that explicitly distinguishes between visual ambiguity (fine-grained errors) and catastrophic structural errors. By embedding a confusion-aware clinical severity matrix M into the optimization landscape, our method suppresses critical errors (false negatives) without requiring complex architectural changes. We validate our approach in four different imaging modalities: Brain Tumor MRI, ISIC 2018 (Dermoscopy), BreaKHis (Breast Histopathology), and SICAPv2 (Prostate Histopathology). Extensive experiments demonstrate that our Risk-Calibrated Loss consistently reduces the Critical Error Rate (CER) for all four datasets, achieving relative safety improvements ranging from 20.0% (on breast histopathology) to 92.4% (on prostate histopathology) compared to state-of-the-art baselines such as Focal Loss. These results confirm that our method offers a superior safety-accuracy trade-off across both CNN and Transformer architectures.

Chinese Translation

深度学习模型在医学图像分类中常常达到专家级的准确率，但存在一个关键缺陷：语义不一致性。这些高置信度且语义不一致的错误（例如将恶性肿瘤误判为良性）本质上不同于源自视觉模糊的可接受错误。与安全的细粒度分歧不同，这些致命失败削弱了临床信任。为了解决这一问题，我们提出了风险校准学习（Risk-Calibrated Learning），该技术明确区分视觉模糊（细粒度错误）和灾难性结构性错误。通过将一个混淆感知的临床严重性矩阵M嵌入优化过程，我们的方法在无需复杂架构改动的情况下抑制关键错误（假阴性）。我们在四种不同成像模态中验证了该方法：脑肿瘤MRI、ISIC 2018（皮肤镜）、BreaKHis（乳腺组织病理学）和SICAPv2（前列腺组织病理学）。大量实验表明，我们的风险校准损失函数在所有四个数据集上均持续降低了关键错误率（CER），相较于Focal Loss等最先进基线方法，实现了从20.0%（乳腺组织病理学）到92.4%（前列腺组织病理学）的相对安全性提升。这些结果证实了我们的方法在卷积神经网络（CNN）和Transformer架构中均提供了优越的安全性与准确性权衡。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2604.12735

AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

AffectAgent：用于检索增强的多模态情感识别的协作多智能体推理

Wang, Zeheng, Yu, Zitong, Zhu, Yijie, Zhao, Bo, Liang, Haochen, Wang, Taorui, Xia, Wei, Zhang, Jiayu, Liu, Zhishu, Ma, Hui, Ma, Fei, Tian, Qi

Abstract

LLM-based multimodal emotion recognition relies on static parametric memory and often hallucinates when interpreting nuanced affective states. In this paper, given that single-round retrieval-augmented generation is highly susceptible to modal ambiguity and therefore struggles to capture complex affective dependencies across modalities, we introduce AffectAgent, an affect-oriented multi-agent retrieval-augmented generation framework that leverages collaborative decision-making among agents for fine-grained affective understanding. Specifically, AffectAgent comprises three jointly optimized specialized agents, namely a query planner, an evidence filter, and an emotion generator, which collaboratively perform analytical reasoning to retrieve cross-modal samples, assess evidence, and generate predictions. These agents are optimized end-to-end using Multi-Agent Proximal Policy Optimization (MAPPO) with a shared affective reward to ensure consistent emotion understanding. Furthermore, we introduce Modality-Balancing Mixture of Experts (MB-MoE) and Retrieval-Augmented Adaptive Fusion (RAAF), where MB-MoE dynamically regulates the contributions of different modalities to mitigate representation mismatch caused by cross-modal heterogeneity, while RAAF enhances semantic completion under missing-modality conditions by incorporating retrieved audiovisual embeddings. Extensive experiments on MER-UniBench demonstrate that AffectAgent achieves superior performance across complex scenarios. Our code will be released at: https://github.com/Wz1h1NG/AffectAgent.

Chinese Translation

基于大型语言模型（LLM）的多模态情感识别依赖于静态参数记忆，并且在解释细微的情感状态时常常产生幻觉。本文指出，单轮检索增强生成在处理模态歧义时高度敏感，因此难以捕捉跨模态的复杂情感依赖关系。我们提出了AffectAgent，这是一种面向情感的多智能体检索增强生成框架，利用智能体之间的协作决策来实现细粒度的情感理解。具体而言，AffectAgent由三个联合优化的专门智能体组成，即查询规划者、证据过滤器和情感生成器，它们协同进行分析推理，以检索跨模态样本、评估证据并生成预测。这些智能体通过多智能体近端策略优化（MAPPO）进行端到端优化，并共享情感奖励，以确保一致的情感理解。此外，我们引入了模态平衡专家混合（MB-MoE）和检索增强自适应融合（RAAF），其中MB-MoE动态调节不同模态的贡献，以减轻由跨模态异质性引起的表示不匹配，而RAAF通过结合检索到的视听嵌入，在缺失模态条件下增强语义补全。在MER-UniBench上的大量实验表明，AffectAgent在复杂场景中表现出色。我们的代码将发布于：https://github.com/Wz1h1NG/AffectAgent。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2604.12752

Scaling In-Context Segmentation with Hierarchical Supervision

通过层次监督扩展上下文分割

Ndir, T. Camaret, Reisert, Marco, Schirrmeister, Robin T.

Abstract

In-context learning (ICL) enables medical image segmentation models to adapt to new anatomical structures from limited examples, reducing the clinical annotation burden. However, standard ICL methods typically rely on dense, global cross-attention, which scales poorly with image resolution. While recent approaches have introduced localized attention mechanisms, they often lack explicit supervision on the selection process, leading to redundant computation in non-informative regions. We propose PatchICL, a hierarchical framework that combines selective image patching with multi-level supervision. Our approach learns to actively identify and attend only to the most informative anatomical regions. Compared to UniverSeg, a strong global-attention baseline, PatchICL achieves competitive in-domain CT segmentation accuracy while reducing compute by 44\% at $512\times512$ resolution. On 35 out-of-domain datasets spanning diverse imaging modalities, PatchICL outperforms the baseline on 6 of 13 modality categories, with particular strength on modalities dominated by localized pathology such as OCT and dermoscopy. Training and evaluation code are available at https://github.com/tidiane-camaret/ic_segmentation

Chinese Translation

上下文学习（ICL）使医学图像分割模型能够从有限的示例中适应新的解剖结构，从而减少临床标注负担。然而，标准的ICL方法通常依赖于密集的全局交叉注意力，这在图像分辨率较高时表现不佳。虽然最近的方法引入了局部注意力机制，但它们往往缺乏对选择过程的明确监督，导致在非信息区域的冗余计算。我们提出了PatchICL，这是一种结合选择性图像补丁和多层次监督的层次框架。我们的方法学习主动识别并仅关注最具信息量的解剖区域。与强大的全局注意力基线UniverSeg相比，PatchICL在$512 imes512$分辨率下实现了具有竞争力的领域内CT分割准确性，同时计算量减少了44%。在涵盖多种成像模式的35个领域外数据集上，PatchICL在13个模式类别中的6个上超越了基线，尤其在以局部病理为主的成像模式（如OCT和皮肤镜）上表现突出。训练和评估代码可在https://github.com/tidiane-camaret/ic_segmentation获取。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2604.12762

ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

ARGOS：在代理多摄像头人物搜索中的谁、哪里和何时

Kim, Myungchul, Park, Kwanyong, Kim, Junmo, Kweon, In So

Abstract

We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

Chinese Translation

我们介绍了ARGOS，这是第一个基准和框架，将多摄像头人物搜索重新定义为一个交互推理问题，要求代理在信息不对称的情况下进行计划、提问和排除候选者。ARGOS代理接收模糊的证人陈述，并必须决定询问什么、何时调用空间或时间工具，以及如何解释模糊的回应，所有这些都在有限的回合预算内进行。推理基于一个时空拓扑图（Spatio-Temporal Topology Graph, STTG），该图编码了摄像头的连接性和经过实证验证的过渡时间。该基准包含2,691个任务，涵盖14个真实世界场景，分为三个逐步进展的轨道：语义感知（Who）、空间推理（Where）和时间推理（When）。与四个大型语言模型（LLM）骨干的实验表明，该基准远未解决（最佳TWS：轨道2为0.383，轨道3为0.590），而消融实验确认，去除特定领域工具会使准确率下降多达49.6个百分点。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2604.12765

A Dataset and Evaluation for Complex 4D Markerless Human Motion Capture

复杂4D无标记人类动作捕捉的数据集与评估

Park, Yeeun, Naduthodi, Miqdad, Kumar, Suryansh

Abstract

Marker-based motion capture (MoCap) systems have long been the gold standard for accurate 4D human modeling, yet their reliance on specialized hardware and markers limits scalability and real-world deployment. Advancing reliable markerless 4D human motion capture requires datasets that reflect the complexity of real-world human interactions. Yet, existing benchmarks often lack realistic multi-person dynamics, severe occlusions, and challenging interaction patterns, leading to a persistent domain gap. In this work, we present a new dataset and evaluation for complex 4D markerless human motion capture. Our proposed MoCap dataset captures both single and multi-person scenarios with intricate motions, frequent inter-person occlusions, rapid position exchanges between similarly dressed subjects, and varying subject distances. It includes synchronized multi-view RGB and depth sequences, accurate camera calibration, ground-truth 3D motion capture from a Vicon system, and corresponding SMPL/SMPL-X parameters. This setup ensures precise alignment between visual observations and motion ground truth. Benchmarking state-of-the-art markerless MoCap models reveals substantial performance degradation under these realistic conditions, highlighting limitations of current approaches. We further demonstrate that targeted fine-tuning improves generalization, validating the dataset's realism and value for model development. Our evaluation exposes critical gaps in existing models and provides a rigorous foundation for advancing robust markerless 4D human motion capture.

Chinese Translation

基于标记的动作捕捉（MoCap）系统长期以来一直是准确4D人类建模的黄金标准，但其对专用硬件和标记的依赖限制了可扩展性和实际应用。推动可靠的无标记4D人类动作捕捉需要反映现实世界人类互动复杂性的数据集。然而，现有基准往往缺乏真实的多人物动态、严重的遮挡和具有挑战性的互动模式，导致持续的领域差距。在本研究中，我们提出了一个新的数据集和复杂4D无标记人类动作捕捉的评估。我们提出的MoCap数据集捕捉了单人和多人场景中的复杂动作、频繁的人际遮挡、相似着装对象之间的快速位置交换以及不同的被试距离。该数据集包括同步的多视角RGB和深度序列、准确的相机标定、来自Vicon系统的真实3D动作捕捉数据，以及相应的SMPL/SMPL-X参数。该设置确保了视觉观测与动作真实值之间的精确对齐。对最先进的无标记MoCap模型进行基准测试显示，在这些真实条件下性能显著下降，突显了当前方法的局限性。我们进一步证明，针对性的微调可以改善模型的泛化能力，验证了数据集的真实性和对模型开发的价值。我们的评估揭示了现有模型中的关键缺口，并为推动稳健的无标记4D人类动作捕捉提供了严格的基础。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2604.12767

CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models

CLASP：面向多模态大语言模型的类别自适应层融合与双阶段剪枝

Dang, Yunkai, Jiang, Yizhu, Jiang, Yifan, Fan, Qi, Shi, Yinghuan, Li, Wenbin, Gao, Yang

Abstract

Multimodal Large Language Models (MLLMs) suffer from substantial computational overhead due to the high redundancy in visual token sequences. Existing approaches typically address this issue using single-layer Vision Transformer (ViT) features and static pruning strategies. However, such fixed configurations are often brittle under diverse instructions. To overcome these limitations, we propose CLASP, a plug-and-play token reduction framework based on class-adaptive layer fusion and dual-stage pruning. Specifically, CLASP first constructs category-specific visual representations through multi-layer vision feature fusion. It then performs dual-stage pruning, allocating the token budget between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage. Through class-adaptive pruning, CLASP enables prompt-conditioned feature fusion and budget allocation, allowing aggressive yet robust visual token reduction. Extensive experiments demonstrate that CLASP consistently outperforms existing methods across a wide range of benchmarks, pruning ratios, and MLLM architectures. Code will be available at https://github.com/Yunkaidang/CLASP.

Chinese Translation

多模态大语言模型（Multimodal Large Language Models, MLLMs）因视觉令牌序列中存在大量冗余而面临显著的计算开销。现有方法通常采用单层视觉变换器（Vision Transformer, ViT）特征和静态剪枝策略来解决该问题，然而这类固定配置在多样化指令下往往表现脆弱。为克服这些限制，我们提出了CLASP，一种基于类别自适应层融合与双阶段剪枝的即插即用令牌压缩框架。具体而言，CLASP首先通过多层视觉特征融合构建类别特定的视觉表示，随后执行双阶段剪枝，将令牌预算分配于关注相关性的注意力关键令牌（attention-salient pivot tokens）和考虑冗余的覆盖令牌（redundancy-aware completion tokens）。通过类别自适应剪枝，CLASP实现了基于提示的特征融合与预算分配，允许激进且稳健的视觉令牌压缩。大量实验表明，CLASP在多种基准、剪枝比例及MLLM架构上均持续优于现有方法。代码将发布于https://github.com/Yunkaidang/CLASP。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2604.12772

A Multi-Agent Feedback System for Detecting and Describing News Events in Satellite Imagery

一种多智能体反馈系统用于检测和描述卫星图像中的新闻事件

Anderson, Madeline, Klassen, Mikhail, Hoover, Ash, Cahoy, Kerri

Abstract

Changes in satellite imagery often occur over multiple time steps. Despite the emergence of bi-temporal change captioning datasets, there is a lack of multi-temporal event captioning datasets (at least two images per sequence) in remote sensing. This gap exists because (1) searching for visible events in satellite imagery and (2) labeling multi-temporal sequences require significant time and labor. To address these challenges, we present SkyScraper, an iterative multi-agent workflow that geocodes news articles and synthesizes captions for corresponding satellite image sequences. Our experiments show that SkyScraper successfully finds 5x more events than traditional geocoding methods, demonstrating that agentic feedback is an effective strategy for surfacing new multi-temporal events in satellite imagery. We apply our framework to a large database of global news articles, curating a new multi-temporal captioning dataset with 5,000 sequences. By automatically identifying imagery related to news events, our work also supports journalism and reporting efforts.

Chinese Translation

卫星图像的变化通常发生在多个时间步骤上。尽管双时间变化标注数据集的出现，但在遥感领域仍缺乏多时间事件标注数据集（每个序列至少两幅图像）。这一空白的存在是因为（1）在卫星图像中搜索可见事件和（2）标注多时间序列需要大量的时间和人力。为了解决这些挑战，我们提出了SkyScraper，一个迭代的多智能体工作流程，该流程对新闻文章进行地理编码，并为相应的卫星图像序列合成标注。我们的实验表明，SkyScraper成功地发现了比传统地理编码方法多5倍的事件，证明了智能反馈是一种有效的策略，用于在卫星图像中挖掘新的多时间事件。我们将我们的框架应用于一个大型全球新闻文章数据库，策划了一个包含5,000个序列的新多时间标注数据集。通过自动识别与新闻事件相关的图像，我们的工作也支持了新闻报道和报道工作。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2604.12777

Cognition-Inspired Dual-Stream Semantic Enhancement for Vision-Based Dynamic Emotion Modeling

基于认知启发的双流语义增强用于视觉动态情感建模

Wang, Huanzhen, Zhou, Ziheng, Tao, Zeng, Li, Aoxing, Zhao, Yingkai, Lin, Yuxuan, Wang, Yan, Zhang, Wenqiang

Abstract

The human brain constructs emotional percepts not by processing facial expressions in isolation, but through a dynamic, hierarchical integration of sensory input with semantic and contextual knowledge. However, existing vision-based dynamic emotion modeling approaches often neglect emotion perception and cognitive theories. To bridge this gap between machine and human emotion perception, we propose cognition-inspired Dual-stream Semantic Enhancement (DuSE). Our model instantiates a dual-stream cognitive architecture. The first stream, a Hierarchical Temporal Prompt Cluster (HTPC), operationalizes the cognitive priming effect. It simulates how linguistic cues pre-sensitize neural pathways, modulating the processing of incoming visual stimuli by aligning textual semantics with fine-grained temporal features of facial dynamics. The second stream, a Latent Semantic Emotion Aggregator (LSEA), computationally models the knowledge integration process, akin to the mechanism described by the Conceptual Act Theory. It aggregates sensory inputs and synthesizes them with learned conceptual knowledge, reflecting the role of the hippocampus and default mode network in constructing a coherent emotional experience. By explicitly modeling these neuro-cognitive mechanisms, DuSE provides a more neurally plausible and robust framework for dynamic facial expression recognition (DFER). Extensive experiments on challenging in-the-wild benchmarks validate our cognition-centric approach, demonstrating that emulating the brain's strategies for emotion processing yields state-of-the-art performance and enhances model interpretability.

Chinese Translation

人脑构建情感感知并非孤立地处理面部表情，而是通过感官输入与语义及情境知识的动态层级整合实现的。然而，现有基于视觉的动态情感建模方法往往忽视了情感感知和认知理论。为弥合机器与人类情感感知之间的差距，我们提出了认知启发的双流语义增强（Dual-stream Semantic Enhancement，DuSE）。该模型实现了双流认知架构。第一条流为层级时间提示聚类（Hierarchical Temporal Prompt Cluster，HTPC），其体现了认知启动效应，模拟语言线索如何预先激活神经通路，通过将文本语义与面部动态的细粒度时间特征对齐，调节对视觉刺激的处理。第二条流为潜在语义情感聚合器（Latent Semantic Emotion Aggregator，LSEA），计算建模了类似概念行为理论（Conceptual Act Theory）描述的知识整合过程，聚合感官输入并与学习到的概念知识合成，反映了海马体和默认模式网络在构建连贯情感体验中的作用。通过显式建模这些神经认知机制，DuSE为动态面部表情识别（Dynamic Facial Expression Recognition，DFER）提供了更符合神经科学且鲁棒的框架。在复杂的真实环境基准测试中，广泛实验验证了我们以认知为中心的方法，表明模拟大脑情感处理策略不仅实现了最先进的性能，还增强了模型的可解释性。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2604.12780

Efficient Adversarial Training via Criticality-Aware Fine-Tuning

通过关键性意识微调实现高效的对抗训练

Li, Wenyun, Zhang, Zheng, Jiang, Dongmei, Wang, Yaowei, Lan, Xiangyuan

Abstract

Vision Transformer (ViT) models have achieved remarkable performance across various vision tasks, with scalability being a key advantage when applied to large datasets. This scalability enables ViT models to exhibit strong generalization capabilities. However, as the number of parameters increases, the robustness of ViT models to adversarial examples does not scale proportionally. Adversarial training (AT), one of the most effective methods for enhancing robustness, typically requires fine-tuning the entire model, leading to prohibitively high computational costs, especially for large ViT architectures. In this paper, we aim to robustly fine-tune only a small subset of parameters to achieve robustness comparable to standard AT. To accomplish this, we introduce Criticality-Aware Adversarial Training (CAAT), a novel method that adaptively allocates resources to the most robustness-critical parameters, fine-tuning only selected modules. Specifically, CAAT efficiently identifies parameters that contribute most to adversarial robustness. It then leverages parameter-efficient fine-tuning (PEFT) to robustly adjust weight matrices where the number of critical parameters exceeds a predefined threshold. CAAT exhibits favorable generalization when scaled to larger vision transformer architectures, potentially paving the way for adversarial training at scale, e.g, compared with plain adversarial training, CAAT incurs only a 4.3% decrease in adversarial robustness while tuning approximately 6% of its parameters. Extensive experiments on three widely used adversarial learning datasets demonstrate that CAAT outperforms state-of-the-art lightweight AT methods with fewer trainable parameters.

Chinese Translation

视觉变换器（Vision Transformer, ViT）模型在各种视觉任务中取得了显著的性能，扩展性是其在大型数据集应用中的关键优势。这种扩展性使得ViT模型展现出强大的泛化能力。然而，随着参数数量的增加，ViT模型对对抗样本的鲁棒性并未按比例提升。对抗训练（Adversarial Training, AT）是增强鲁棒性的最有效方法之一，通常需要对整个模型进行微调，这导致了高昂的计算成本，尤其是对于大型ViT架构。本文旨在仅对少量参数进行鲁棒微调，以实现与标准AT相当的鲁棒性。为此，我们提出了关键性意识对抗训练（Criticality-Aware Adversarial Training, CAAT），这是一种新颖的方法，能够自适应地将资源分配给最关键的鲁棒性参数，仅微调选定的模块。具体而言，CAAT有效识别对抗鲁棒性贡献最大的参数。然后，它利用参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）来鲁棒地调整权重矩阵，其中关键参数的数量超过预定义阈值。CAAT在扩展到更大的视觉变换器架构时表现出良好的泛化能力，可能为大规模对抗训练铺平道路，例如，与普通对抗训练相比，CAAT仅在调整约6%的参数时导致对抗鲁棒性下降4.3%。在三个广泛使用的对抗学习数据集上进行的广泛实验表明，CAAT在可训练参数更少的情况下优于最先进的轻量级AT方法。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2604.12781

Fragile Reconstruction: Adversarial Vulnerability of Reconstruction-Based Detectors for Diffusion-Generated Images

脆弱的重建：基于重建的扩散生成图像检测器的对抗脆弱性

Jiang, Haoyang, Yi, Mingyang, Zhang, Shaolei, Cai, Junxian, Liu, Qingbin, Chen, Xi, Fan, Ju

Abstract

Recently, detecting AI-generated images produced by diffusion-based models has attracted increasing attention due to their potential threat to safety. Among existing approaches, reconstruction-based methods have emerged as a prominent paradigm for this task. However, we find that such methods exhibit severe security vulnerabilities to adversarial perturbations; that is, by adding imperceptible adversarial perturbations to input images, the detection accuracy of classifiers collapses to near zero. To verify this threat, we present a systematic evaluation of the adversarial robustness of three representative detectors across four diverse generative backbone models. First, we construct adversarial attacks in white-box scenarios, which degrade the performance of all well-trained detectors. Moreover, we find that these attacks demonstrate transferability; specifically, attacks crafted against one detector can be transferred to others, indicating that adversarial attacks on detectors can also be constructed in a black-box setting. Finally, we assess common countermeasures and find that standard defense methods against adversarial attacks provide limited mitigation. We attribute these failures to the low signal-to-noise ratio (SNR) of attacked samples as perceived by the detectors. Overall, our results reveal fundamental security limitations of reconstruction-based detectors and highlight the need to rethink existing detection strategies.

Chinese Translation

近年来，基于扩散模型生成的AI图像检测因其潜在的安全威胁而受到越来越多的关注。在现有方法中，基于重建的检测方法已成为该任务的重要范式。然而，我们发现此类方法对对抗扰动存在严重的安全漏洞；即通过对输入图像添加不可察觉的对抗扰动，分类器的检测准确率几乎降至零。为验证这一威胁，我们系统评估了三种代表性检测器在四种不同生成骨干模型上的对抗鲁棒性。首先，我们在白盒场景下构建对抗攻击，显著降低了所有训练良好的检测器的性能。此外，我们发现这些攻击具有迁移性；具体而言，针对某一检测器设计的攻击可以转移至其他检测器，表明对检测器的对抗攻击也可在黑盒环境下构建。最后，我们评估了常见的防御措施，发现标准的对抗攻击防御方法仅能提供有限的缓解效果。我们将这些失败归因于检测器感知到的被攻击样本的低信噪比（SNR）。总体而言，我们的结果揭示了基于重建的检测器在安全性上的根本局限性，并强调了重新思考现有检测策略的必要性。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2604.12803

Generative Anonymization in Event Streams

事件流中的生成匿名化

Müller, Adam T., Kocsis, Mihai, Stache, Nicolaj C.

Abstract

Neuromorphic vision sensors offer low latency and high dynamic range, but their deployment in public spaces raises severe data protection concerns. Recent Event-to-Video (E2V) models can reconstruct high-fidelity intensity images from sparse event streams, inadvertently exposing human identities. Current obfuscation methods, such as masking or scrambling, corrupt the spatio-temporal structure, severely degrading data utility for downstream perception tasks. In this paper, to the best of our knowledge, we present the first generative anonymization framework for event streams to resolve this utility-privacy trade-off. By bridging the modality gap between asynchronous events and standard spatial generative models, our pipeline projects events into an intermediate intensity representation, leverages pretrained models to synthesize realistic, non-existent identities, and re-encodes the features back into the neuromorphic domain. Experiments demonstrate that our method reliably prevents identity recovery from E2V reconstructions while preserving the structural data integrity required for downstream vision tasks. Finally, to facilitate rigorous evaluation, we introduce a novel, synchronized real-world event and RGB dataset captured via precise robotic trajectories, providing a robust benchmark for future research in privacy-preserving neuromorphic vision.

Chinese Translation

神经形态视觉传感器提供了低延迟和高动态范围，但在公共场所的部署引发了严重的数据保护问题。近期的事件到视频（Event-to-Video, E2V）模型能够从稀疏事件流中重建高保真度的强度图像，意外地暴露了人类身份。目前的模糊化方法，如遮罩或打乱，破坏了时空结构，严重降低了下游感知任务的数据效用。在本文中，尽我们所知，我们提出了第一个针对事件流的生成匿名化框架，以解决效用与隐私之间的权衡。通过弥合异步事件与标准空间生成模型之间的模态差距，我们的管道将事件投影到中间强度表示，利用预训练模型合成逼真且不存在的身份，并将特征重新编码回神经形态域。实验表明，我们的方法可靠地防止了从E2V重建中恢复身份，同时保留了下游视觉任务所需的结构数据完整性。最后，为了促进严格评估，我们引入了一个新颖的同步真实世界事件和RGB数据集，该数据集通过精确的机器人轨迹捕获，为未来在隐私保护神经形态视觉领域的研究提供了一个稳健的基准。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2604.12805

Image-to-Image Translation Framework Embedded with Rotation Symmetry Priors

嵌入旋转对称先验的图像到图像转换框架

Tan, Feiyu, Yang, Heran, Duan, Qihong, Ye, Kai, Xie, Qi, Meng, Deyu

Abstract

Image-to-image translation (I2I) is a fundamental task in computer vision, focused on mapping an input image from a source domain to a corresponding image in a target domain while preserving domain-invariant features and adapting domain-specific attributes. Despite the remarkable success of deep learning-based I2I approaches, the lack of paired data and unsupervised learning framework still hinder their effectiveness. In this work, we address the challenge by incorporating transformation symmetry priors into image-to-image translation networks. Specifically, we introduce rotation group equivariant convolutions to achieve rotation equivariant I2I framework, a novel contribution, to the best of our knowledge, along this research direction. This design ensures the preservation of rotation symmetry, one of the most intrinsic and domain-invariant properties of natural and scientific images, throughout the network. Furthermore, we conduct a systematic study on image symmetry priors on real dataset and propose a novel transformation learnable equivariant convolutions (TL-Conv) that adaptively learns transformation groups, enhancing symmetry preservation across diverse datasets. We also provide a theoretical analysis of the equivariance error of TL-Conv, proving that it maintains exact equivariance in continuous domains and provide a bound for the error in discrete cases. Through extensive experiments across a range of I2I tasks, we validate the effectiveness and superior performance of our approach, highlighting the potential of equivariant networks in enhancing generation quality and its broad applicability. Our code is available at https://github.com/tanfy929/Equivariant-I2I

Chinese Translation

图像到图像转换（I2I）是计算机视觉中的一项基础任务，旨在将源领域的输入图像映射到目标领域的相应图像，同时保持领域不变特征并适应领域特定属性。尽管基于深度学习的I2I方法取得了显著成功，但缺乏配对数据和无监督学习框架仍然阻碍了它们的有效性。在本研究中，我们通过将变换对称先验融入图像到图像转换网络来解决这一挑战。具体而言，我们引入了旋转群等变卷积，以实现旋转等变的I2I框架，这是我们在这一研究方向上的新颖贡献。该设计确保在整个网络中保持旋转对称性，这是自然和科学图像中最内在且领域不变的特性之一。此外，我们对真实数据集上的图像对称先验进行了系统研究，并提出了一种新颖的可学习变换等变卷积（TL-Conv），该卷积自适应地学习变换群，从而增强了在不同数据集上的对称性保持。我们还对TL-Conv的等变误差进行了理论分析，证明其在连续领域中保持精确的等变性，并为离散情况下的误差提供了界限。通过在一系列I2I任务上的广泛实验，我们验证了我们方法的有效性和优越性能，突显了等变网络在提升生成质量及其广泛适用性方面的潜力。我们的代码可在https://github.com/tanfy929/Equivariant-I2I获取。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2604.12807

Rethinking Satellite Image Restoration for Onboard AI: A Lightweight Learning-Based Approach

重新思考卫星图像恢复以适应机载人工智能：一种轻量级学习基础的方法

Dorise, Adrien, Bellizzi, Marjorie, Hlimi, Omar

Abstract

Satellite image restoration aims to improve image quality by compensating for degradations (e.g., noise and blur) introduced by the imaging system and acquisition conditions. As a fundamental preprocessing step, restoration directly impacts both ground-based product generation and emerging onboard AI applications. Traditional restoration pipelines based on sequential physical models are computationally intensive and slow, making them unsuitable for onboard environments. In this paper, we introduce ConvBEERS: a Convolutional Board-ready Embedded and Efficient Restoration model for Space to investigate whether a light and non-generative residual convolutional network, trained on simulated satellite data, can match or surpass a traditional ground-processing restoration pipeline across multiple operating conditions. Experiments conducted on simulated datasets and real Pleiades-HR imagery demonstrate that the proposed approach achieves competitive image quality, with a +6.9dB PSNR improvement. Evaluation on a downstream object detection task demonstrates that restoration significantly improves performance, with up to +5.1% mAP@50. In addition, successful deployment on a Xilinx Versal VCK190 FPGA validates its practical feasibility for satellite onboard processing, with a ~41x reduction in latency compared to the traditional pipeline. These results demonstrate the relevance of using lightweight CNNs to achieve competitive restoration quality while addressing real-world constraints in spaceborne systems.

Chinese Translation

卫星图像恢复旨在通过补偿成像系统和获取条件引入的退化（例如噪声和模糊）来提高图像质量。作为一项基本的预处理步骤，恢复直接影响地面产品生成和新兴的机载人工智能应用。基于顺序物理模型的传统恢复流程计算密集且速度较慢，因此不适合机载环境。本文介绍了ConvBEERS：一种为太空设计的卷积板载嵌入式高效恢复模型，旨在研究是否可以通过在模拟卫星数据上训练的轻量级非生成残差卷积网络，匹配或超越传统的地面处理恢复流程在多种操作条件下的表现。在模拟数据集和真实的Pleiades-HR图像上进行的实验表明，所提出的方法在图像质量上具有竞争力，PSNR提高了+6.9dB。对下游目标检测任务的评估表明，恢复显著提高了性能，mAP@50提升了+5.1%。此外，在Xilinx Versal VCK190 FPGA上的成功部署验证了其在卫星机载处理中的实际可行性，与传统流程相比，延迟减少了约41倍。这些结果表明，使用轻量级卷积神经网络在满足太空系统实际约束的同时，实现竞争力的恢复质量是相关的。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2604.12813

DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment

DPC-VQA：解耦质量感知与残差校准的视频质量评估

Li, Xinyue, Xu, Shubo, Zhang, Zhichao, Cai, Zhaolin, Chen, Yitong, Zhai, Guangtao

Abstract

Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20\% of MOS labels. The code will be released upon publication.

Chinese Translation

最近的多模态大型语言模型（MLLMs）在视频质量评估（VQA）任务中表现出了良好的性能。然而，由于大规模的重训练和昂贵的主观意见分数（MOS）标注，将其适应于新场景仍然代价高昂。本文认为，预训练的MLLM已经为VQA提供了有用的感知先验，而主要挑战在于如何有效地将该先验校准到目标MOS空间。基于这一见解，我们提出了DPC-VQA，一个用于视频质量评估的解耦感知与校准框架。具体而言，DPC-VQA使用一个冻结的MLLM提供基础质量估计和感知先验，并采用轻量级的校准分支来预测目标场景适应的残差修正。该设计避免了昂贵的端到端重训练，同时在较低的训练和数据成本下保持了可靠的性能。在用户生成内容（UGC）和人工智能生成内容（AIGC）基准上的广泛实验表明，DPC-VQA在与代表性基线的竞争中表现出色，同时使用的可训练参数少于传统基于MLLM的VQA方法的2%，并且在仅有20%的MOS标签的情况下仍然有效。代码将在发表时发布。

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2604.12832

Detecting and refurbishing ground truth errors during training of deep learning-based echocardiography segmentation models

基于深度学习的超声心动图分割模型训练过程中地面真实标签错误的检测与修正

Islam, Iman, Ruijsink, Bram, Reader, Andrew J., King, Andrew P.

Abstract

Deep learning-based medical image segmentation typically relies on ground truth (GT) labels obtained through manual annotation, but these can be prone to random errors or systematic biases. This study examines the robustness of deep learning models to such errors in echocardiography (echo) segmentation and evaluates a novel strategy for detecting and refurbishing erroneous labels during model training. Using the CAMUS dataset, we simulate three error types, then compare a loss-based GT label error detection method with one based on Variance of Gradients (VOG). We also propose a pseudo-labelling approach to refurbish suspected erroneous GT labels. We assess the performance of our proposed approach under varying error levels. Results show that VOG proved highly effective in flagging erroneous GT labels during training. However, a standard U-Net maintained strong performance under random label errors and moderate levels of systematic errors (up to 50%). The detection and refurbishment approach improved performance, particularly under high-error conditions.

Chinese Translation

基于深度学习的医学图像分割通常依赖于通过人工标注获得的地面真实（GT）标签，但这些标签可能存在随机错误或系统性偏差。本研究探讨了深度学习模型对超声心动图（echocardiography, echo）分割中此类错误的鲁棒性，并评估了一种在模型训练过程中检测及修正错误标签的新策略。利用CAMUS数据集，我们模拟了三种错误类型，随后比较了基于损失函数的GT标签错误检测方法与基于梯度方差（Variance of Gradients, VOG）的方法。同时，我们提出了一种伪标签（pseudo-labelling）方法用于修正疑似错误的GT标签。在不同错误水平下评估了所提方法的性能。结果表明，VOG在训练过程中对错误GT标签的识别效果显著。然而，标准的U-Net模型在随机标签错误及中等程度系统性错误（最高达50%）下依然保持较强性能。检测与修正方法在高错误率条件下显著提升了模型表现。

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2604.12833

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

通过可物理部署的多模态语义光照攻击挑战视觉-语言模型

Zhao, Yingying, Hu, Chengyin, Zhang, Qike, Li, Xin, Wang, Xin, Wei, Yiwei, Guo, Jiujiang, Long, Jiahuan, Jiang, Tingsong, Yao, Wen

Abstract

Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Such attacks may induce recognition failures and further disrupt multimodal reasoning, leading to severe semantic misinterpretation in downstream tasks. Therefore, investigating physical attacks on VLMs is essential for assessing their real-world security risks. To address this gap, we propose Multimodal Semantic Lighting Attacks (MSLA), the first physically deployable adversarial attack framework against VLMs. MSLA uses controllable adversarial lighting to disrupt multimodal semantic understanding in real scenes, attacking semantic alignment rather than only task-specific outputs. Consequently, it degrades zero-shot classification performance of mainstream CLIP variants while inducing severe semantic hallucinations in advanced VLMs such as LLaVA and BLIP across image captioning and visual question answering (VQA). Extensive experiments in both digital and physical domains demonstrate that MSLA is effective, transferable, and practically realizable. Our findings provide the first evidence that VLMs are highly vulnerable to physically deployable semantic attacks, exposing a previously overlooked robustness gap and underscoring the urgent need for physical-world robustness evaluation of VLMs.

Chinese Translation

视觉-语言模型（VLMs）表现出显著的性能，但其安全性仍然不够明确。目前的对抗性研究几乎完全集中在数字环境中，导致物理世界的威胁尚未得到充分探索。随着VLMs在真实环境中的逐渐应用，这一差距变得至关重要，因为对抗性扰动必须在物理上可实现。尽管这一实际相关性存在，但针对VLMs的物理攻击尚未得到系统研究。这类攻击可能导致识别失败，并进一步干扰多模态推理，从而在下游任务中引发严重的语义误解。因此，研究针对VLMs的物理攻击对于评估其在现实世界中的安全风险至关重要。为了解决这一问题，我们提出了多模态语义光照攻击（MSLA），这是针对VLMs的首个可物理部署的对抗攻击框架。MSLA利用可控的对抗性光照在真实场景中干扰多模态语义理解，攻击语义对齐而不仅仅是特定任务的输出。因此，它降低了主流CLIP变体的零-shot分类性能，同时在先进的VLMs（如LLaVA和BLIP）中引发了严重的语义幻觉，涉及图像描述和视觉问答（VQA）。在数字和物理领域的广泛实验表明，MSLA是有效的、可转移的，并且在实践中可实现。我们的研究结果提供了首个证据，表明VLMs对可物理部署的语义攻击高度脆弱，揭示了一个被忽视的鲁棒性差距，并强调了对VLMs进行物理世界鲁棒性评估的迫切需求。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2604.12856

PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination

PianoFlow：具有双手协调的音乐感知流式钢琴动作生成

Wang, Xuan, Ruan, Kai, Han, Jiayi, Zhou, kaiyue, Wang, Gaoang

Abstract

Audio-driven bimanual piano motion generation requires precise modeling of complex musical structures and dynamic cross-hand coordination. However, existing methods often rely on acoustic-only representations lacking symbolic priors, employ inflexible interaction mechanisms, and are limited to computationally expensive short-sequence generation. To address these limitations, we propose PianoFlow, a flow-matching framework for precise and coordinated bimanual piano motion synthesis. Our approach strategically leverages MIDI as a privileged modality during training, distilling these structured musical priors to achieve deep semantic understanding while maintaining audio-only inference. Furthermore, we introduce an asymmetric role-gated interaction module to explicitly capture dynamic cross-hand coordination through role-aware attention and temporal gating. To enable real-time streaming generation for arbitrarily long sequences, we design an autoregressive flow continuation scheme that ensures seamless cross-chunk temporal coherence. Extensive experiments on the PianoMotion10M dataset demonstrate that PianoFlow achieves superior quantitative and qualitative performance, while accelerating inference by over 9\times compared to previous methods.

Chinese Translation

基于音频驱动的双手钢琴动作生成需要对复杂音乐结构和动态跨手协调进行精确建模。然而，现有方法往往依赖于缺乏符号先验的仅声学表示，采用不灵活的交互机制，并且局限于计算成本高昂的短序列生成。为了解决这些局限性，我们提出了PianoFlow，一种用于精确和协调的双手钢琴动作合成的流匹配框架。我们的方法在训练过程中战略性地利用MIDI作为特权模态，提炼这些结构化的音乐先验，以实现深层语义理解，同时保持仅基于音频的推断。此外，我们引入了一个不对称角色门控交互模块，通过角色感知注意力和时间门控显式捕捉动态跨手协调。为了实现任意长序列的实时流式生成，我们设计了一种自回归流续方案，以确保跨块时间一致性。对PianoMotion10M数据集的广泛实验表明，PianoFlow在定量和定性性能上优于之前的方法，同时推断速度比之前的方法提高了超过9倍。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2604.12887

VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

VideoFlexTok：灵活长度的粗到细视频标记化

Atanov, Andrei, Allardice, Jesse, Bachmann, Roman, Kar, Oğuzhan Fatih, Hjelm, R Devon, Griffiths, David, Fu, Peter, Dehghan, Afshin, Zamir, Amir

Abstract

Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details "pixel-by-pixel" irrespective of the video's inherent complexity, leading to high learning complexity. We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner -- where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the token count according to downstream needs and encoding videos longer than the baselines with the same budget. We evaluate VideoFlexTok on class- and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 5x smaller model (1.1B vs 5.2B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer.

Chinese Translation

视觉标记器将高维原始像素映射为压缩表示，以便于下游建模。除了压缩之外，标记器还决定了保留哪些信息以及如何组织这些信息。视频标记化的一个事实标准方法是将视频表示为一个时空3D网格的标记，每个标记捕捉原始信号中的相应局部信息。这要求消费这些标记的下游模型，例如文本到视频模型，学习以“逐像素”的方式预测所有低级细节，而不考虑视频固有的复杂性，从而导致高学习复杂性。我们提出了VideoFlexTok，它以粗到细的方式用可变长度的标记序列表示视频——其中第一个标记（逐渐）捕捉抽象信息，如语义和运动，后续标记则添加细粒度细节。生成流解码器能够从任意数量的标记中实现逼真的视频重建。这种表示结构允许根据下游需求调整标记数量，并在相同预算下编码比基线更长的视频。我们在分类和文本到视频生成任务上评估了VideoFlexTok，并展示与3D网格标记相比，它实现了更高效的训练，例如，使用5倍更小的模型（1.1B对比5.2B）获得可比的生成质量（gFVD和ViCLIP得分）。最后，我们展示了VideoFlexTok如何在不产生过高计算成本的情况下实现长视频生成，通过在仅有672个标记的10秒81帧视频上训练文本到视频模型，这比可比的3D网格标记器少了8倍。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2604.12890

Towards Long-horizon Agentic Multimodal Search

面向长时间跨度的自主多模态搜索

Du, Yifan, Liu, Zikang, Peng, Jinbiao, Wu, Jie, Li, Junyi, Li, Jinyang, Zhao, Wayne Xin, Wen, Ji-Rong

Abstract

Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.

Chinese Translation

多模态深度搜索代理在通过迭代收集文本和视觉证据来解决复杂任务方面展现了巨大的潜力。然而，管理与长时间跨度多模态输入相关的异构信息和高令牌成本仍然是一个关键挑战，因为现有方法往往面临上下文爆炸或重要视觉信号丢失的问题。为了解决这一问题，我们提出了一种新颖的长时间跨度多模态深度搜索框架，命名为 LMM-Searcher，重点采用基于文件的视觉表示机制。通过将视觉资产卸载到外部文件系统并将其映射到轻量级文本标识符（UIDs），我们的方法在保留多模态信息以供未来访问的同时，减轻了上下文开销。我们为代理配备了一种定制的图像获取工具，使其能够实现渐进式、按需的视觉加载策略，以增强主动感知。此外，我们还引入了一种数据合成管道，旨在生成需要复杂跨模态多跳推理的查询。通过该管道，我们提炼出 12K 高质量轨迹，以微调 Qwen3-VL-Thinking-30A3B 成为一个专门的多模态深度搜索代理。在四个基准测试中的广泛实验表明，我们的方法成功扩展到 100 回合的搜索跨度，在 MM-BrowseComp 和 MMSearch-Plus 等具有挑战性的长时间跨度基准上实现了开源模型中的最先进性能，同时在不同基础模型上也表现出强大的泛化能力。我们的代码将发布在 https://github.com/RUCAIBox/LMM-Searcher。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2604.12894

Representing 3D Faces with Learnable B-Spline Volumes

用可学习的 B-spline 体表示 3D 人脸

Chandran, Prashanth, Wang, Daoye, Bolkart, Timo

Abstract

We present CUBE (Control-based Unified B-spline Encoding), a new geometric representation for human faces that combines B-spline volumes with learned features, and demonstrate its use as a decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-spline representations with 3D control points, CUBE is parametrized by a lattice (e.g., 8 x 8 x 8) of high-dimensional control features, increasing the model's expressivity. These features define a continuous, two-stage mapping from a 3D parametric domain to 3D Euclidean space via an intermediate feature space. First, high-dimensional control features are locally blended using the B-spline bases, yielding a high-dimensional feature vector whose first three values define a 3D base mesh. A small MLP then processes this feature vector to predict a residual displacement from the base shape, yielding the final refined 3D coordinates. To reconstruct 3D surfaces in dense semantic correspondence, CUBE is queried at 3D coordinates sampled from a fixed template mesh. Crucially, CUBE retains the local support property of traditional B-spline representations, enabling local surface editing by updating individual control features. We demonstrate the strengths of this representation by training transformer-based encoders to predict CUBE's control features from unstructured point clouds and monocular images, achieving state-of-the-art scan registration results compared to recent baselines.

Chinese Translation

我们提出了 CUBE（基于控制的统一 B-spline 编码），这是一种新的几何表示方法，用于人脸的表示，它将 B-spline 体与学习到的特征相结合，并展示其作为 3D 扫描配准和单目 3D 人脸重建解码器的应用。与现有的具有 3D 控制点的 B-spline 表示不同，CUBE 由高维控制特征的晶格（例如，8 x 8 x 8）参数化，从而增强了模型的表现力。这些特征定义了一个从 3D 参数域到 3D 欧几里得空间的连续两阶段映射，通过一个中间特征空间。首先，使用 B-spline 基对高维控制特征进行局部混合，产生一个高维特征向量，其前三个值定义了一个 3D 基网格。然后，一个小型多层感知机（MLP）处理该特征向量，以预测基形状的残差位移，从而得到最终的精细化 3D 坐标。为了在密集语义对应中重建 3D 表面，CUBE 在从固定模板网格采样的 3D 坐标处进行查询。至关重要的是，CUBE 保留了传统 B-spline 表示的局部支持特性，使得通过更新单个控制特征进行局部表面编辑成为可能。我们通过训练基于变换器的编码器来预测 CUBE 的控制特征，从非结构化点云和单目图像中取得了优于近期基准的最先进的扫描配准结果，展示了这种表示的优势。

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2604.12896

Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

不展示像素，展示线索：通过感知程序解锁语言模型中的视觉工具推理能力

Janjua, Muhammad Kamran, Silva, Hugo, Niu, Di, Rashidi, Bahador

Abstract

Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P$^2$), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P$^2$ consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P$^2$ raises its accuracy from 41.35\% to 86.47\% on multi-view reasoning, from 52.42\% to 81.45\% on relative depth, and achieves a 22\% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15-40\% absolute gains from P$^2$, surpassing prior agentic, supervised, and RL-based tool-use methods-without any training or model modifications.

Chinese Translation

多模态语言模型（MLLMs）越来越多地与视觉工具（如深度、光流、对应关系）结合，以增强视觉推理能力。然而，尽管可以访问这些工具生成的视觉线索，MLLMs往往未能有效利用它们。现有方法通常将工具的原始输出直接输入模型，但这些密集的像素级表示与大型语言模型（LLMs）擅长的语言本位推理不匹配，导致感知能力薄弱且依赖语言先验。我们认为，在视觉工具能够提供必要视觉线索的问题中，瓶颈不在于更多的工具调用或更大的MLLMs，而在于工具输出的表示方式。我们提出了感知程序（Perception Programs，P²），这是一种无需训练、模型无关的方法，将工具输出重写为紧凑、结构化、语言本位的摘要，MLLMs可以直接解析并进行推理。在BLINK中的六个以感知为核心的任务上，P²相较于基础模型和原始工具增强基线均表现出显著提升。以GPT-5 Mini为基础模型，P²将其多视角推理准确率从41.35%提升至86.47%，相对深度任务准确率从52.42%提升至81.45%，在各任务上平均提升22%，创下新的最先进水平。即使在较小的MLLMs如InternVL3.5-4B和Qwen3VL-4B上，P²也带来了15%至40%的绝对提升，超越了以往基于代理、监督和强化学习的工具使用方法，且无需任何训练或模型修改。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2604.12904

A Sanity Check on Composed Image Retrieval

对组合图像检索的合理性检验

Liu, Yikun, Yao, Jiangchao, Xie, Weidi, Wang, Yanfeng

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image, and a relative caption that specifies the desired modification. Despite the rapid development of CIR models, their performance is not well characterized by existing benchmarks, which inherently contain indeterminate queries degrading the evaluation (i.e., multiple candidate images, rather than solely the target image, meet the query criteria), and have not considered their effectiveness in the context of the multi-round system. Motivated by this, we consider improving the evaluation procedure from two aspects: 1) we introduce FISD, a Fully-Informed Semantically-Diverse benchmark, which employs generative models to precisely control the variables of reference-target image pairs, enabling a more accurate evaluation of CIR methods across six dimensions, without query ambiguity; 2) we propose an automatic multi-round agentic evaluation framework to probe the potential of the existing models in the interactive scenarios. By observing how models adapt and refine their choices over successive rounds of queries, this framework provides a more realistic appraisal of their efficacy in practical applications. Extensive experiments and comparisons prove the value of our novel evaluation on typical CIR methods.

Chinese Translation

组合图像检索（Composed Image Retrieval, CIR）旨在基于由参考图像和指定所需修改的相对描述组成的查询来检索目标图像。尽管CIR模型发展迅速，但现有基准对其性能的表征并不充分，这些基准本质上包含不确定的查询，降低了评估效果（即，多个候选图像而不仅仅是目标图像满足查询标准），并且未考虑其在多轮系统中的有效性。基于此，我们考虑从两个方面改善评估程序：1）我们引入FISD，一个完全知情的语义多样性基准，利用生成模型精确控制参考-目标图像对的变量，从而在六个维度上实现对CIR方法的更准确评估，避免查询模糊；2）我们提出一个自动化的多轮代理评估框架，以探测现有模型在交互场景中的潜力。通过观察模型如何在连续的查询轮次中调整和优化其选择，该框架提供了对其在实际应用中有效性的更真实评估。大量实验和比较证明了我们新评估方法在典型CIR方法中的价值。

View on arXiv Download PDF AI Translation

cs.CV / 102 / 2604.12917

M3D-Stereo: A Multiple-Medium and Multiple-Degradation Dataset for Stereo Image Restoration

M3D-Stereo：一个多介质多退化的立体图像修复数据集

Yang, Deqing, Liu, Yingying, Wang, Qicong, Zeng, Zhi, Lu, Dajiang, Tian, Yibin

Abstract

Image restoration under adverse conditions, such as underwater, haze or fog, and low-light environments, remains a highly challenging problem due to complex physical degradations and severe information loss. Existing datasets are predominantly limited to a single degradation type or heavily rely on synthetic data without stereo consistency, inherently restricting their applicability in real-world scenarios. To address this, we introduce M3D-Stereo, a stereo dataset with 7904 high-resolution image pairs for image restoration research acquired in multiple media with multiple controlled degradation levels. It encompasses four degradation scenarios: underwater scatter, haze/fog, underwater low-light, and haze low-light. Each scenario forms a subset, and is divided into six levels of progressive degradation, allowing fine-grained evaluations of restoration methods with increasing severity of degradation. Collected via a laboratory setup, the dataset provides aligned stereo image pairs along with their pixel-wise consistent clear ground truths. Two restoration tasks, single-level and mixed-level degradation, were performed to verify its validity. M3D-Stereo establishes a better controlled and more realistic benchmark to evaluate image restoration and stereo matching methods in complex degradation environments. It is made public under LGPLv3 license.

Chinese Translation

在水下、雾霾或低光照等恶劣环境下的图像修复仍然是一个极具挑战性的问题，原因在于复杂的物理退化和严重的信息丢失。现有的数据集大多仅限于单一退化类型，或严重依赖缺乏立体一致性的合成数据，固有限制了其在实际场景中的适用性。为此，我们提出了M3D-Stereo，这是一个包含7904对高分辨率图像对的立体图像修复数据集，涵盖多种介质和多级可控退化水平。数据集包含四种退化场景：水下散射、雾霾/雾、低光水下和雾霾低光。每个场景构成一个子集，并划分为六个逐步加重的退化等级，便于对修复方法在不同退化严重程度下进行细粒度评估。该数据集通过实验室设备采集，提供对齐的立体图像对及其像素级一致的清晰真实图像。我们进行了单级和混合级退化两种修复任务以验证其有效性。M3D-Stereo为在复杂退化环境下评估图像修复和立体匹配方法建立了更好控制且更贴近现实的基准。该数据集已在LGPLv3许可下公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 103 / 2604.12918

Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

基于跨任务注意力桥的雷达-摄像头BEV多任务学习用于联合三维检测与分割

İnanç, Ahmet, Erkent, Özgür

Abstract

Bird's-eye-view (BEV) representations are the dominant paradigm for 3D perception in autonomous driving, providing a unified spatial canvas where detection and segmentation features are geometrically registered to the same physical coordinate system. However, existing radar-camera fusion methods treat these tasks in isolation, missing the opportunity to share complementary information between them: detection features encode object-level geometry that can sharpen segmentation boundaries, while segmentation features provide dense semantic context that can anchor detection. We propose \textbf{CTAB} (Cross-Task Attention Bridge), a bidirectional module that exchanges features between detection and segmentation branches via multi-scale deformable attention in shared BEV space. CTAB is integrated into a multi-task framework with an Instance Normalization-based segmentation decoder and learnable BEV upsampling to provide a more detailed BEV representation. On nuScenes, CTAB improves segmentation on 7 classes over the joint multi-task baseline at essentially neutral detection. On a 4-class subset (drivable area, pedestrian crossing, walkway, vehicle), our joint multi-task model reaches comparable mIoU on 4 classes while simultaneously providing 3D detection.

Chinese Translation

鸟瞰视角（BEV）表示是自动驾驶中三维感知的主流范式，提供了一个统一的空间画布，使检测和分割特征在几何上注册到相同的物理坐标系。然而，现有的雷达-摄像头融合方法通常将这些任务孤立处理，错失了在任务间共享互补信息的机会：检测特征编码了可用于锐化分割边界的对象级几何信息，而分割特征则提供了可锚定检测的密集语义上下文。我们提出了CTAB（Cross-Task Attention Bridge，跨任务注意力桥），这是一个双向模块，通过共享BEV空间中的多尺度可变形注意力在检测和分割分支之间交换特征。CTAB集成于一个多任务框架中，配备基于实例归一化的分割解码器和可学习的BEV上采样，以提供更精细的BEV表示。在nuScenes数据集上，CTAB在联合多任务基线的基础上，在7个类别的分割性能上实现提升，同时检测性能基本保持不变。在一个包含4个类别（可行驶区域、行人过街道、步行道、车辆）的子集上，我们的联合多任务模型在4个类别的mIoU上达到可比性能，同时实现三维检测。

View on arXiv Download PDF AI Translation

cs.CV / 104 / 2604.12923

Pi-HOC: Pairwise 3D Human-Object Contact Estimation

Pi-HOC：成对三维人-物体接触估计

Chittupalli, Sravan, Jain, Ayush, Huang, Dong

Abstract

Resolving real-world human-object interactions in images is a many-to-many challenge, in which disentangling fine-grained concurrent physical contact is particularly difficult. Existing semantic contact estimation methods are either limited to single-human settings or require object geometries (e.g., meshes) in addition to the input image. Current state-of-the-art leverages powerful VLM for category-level semantics but struggles with multi-human scenarios and scales poorly in inference. We introduce Pi-HOC, a single-pass, instance-aware framework for dense 3D semantic contact prediction of all human-object pairs. Pi-HOC detects instances, creates dedicated human-object (HO) tokens for each pair, and refines them using an InteractionFormer. A SAM-based decoder then predicts dense contact on SMPL human meshes for each human-object pair. On the MMHOI and DAMON datasets, Pi-HOC significantly improves accuracy and localization over state-of-the-art methods while achieving 20x higher throughput. We further demonstrate that predicted contacts improve SAM-3D image-to-mesh reconstruction via a test-time optimization algorithm and enable referential contact prediction from language queries without additional training.

Chinese Translation

在图像中解决现实世界的人-物体交互是一项多对多的挑战，其中解开细粒度的同时物理接触尤其困难。现有的语义接触估计方法要么仅限于单人设置，要么需要除了输入图像之外的物体几何形状（例如网格）。当前的最先进技术利用强大的视觉语言模型（VLM）进行类别级语义推断，但在多人体场景中表现不佳，并且在推理时扩展性差。我们提出了Pi-HOC，这是一种单次通过、实例感知的框架，用于对所有人-物体对进行密集的三维语义接触预测。Pi-HOC检测实例，为每对创建专用的人-物体（HO）标记，并使用InteractionFormer进行精细化。然后，基于SAM的解码器在每个人-物体对的SMPL人类网格上预测密集接触。在MMHOI和DAMON数据集上，Pi-HOC显著提高了准确性和定位能力，相比于最先进的方法实现了20倍的吞吐量提升。我们进一步展示了预测的接触通过测试时优化算法改善了SAM-3D图像到网格的重建，并且能够从语言查询中进行参考接触预测，而无需额外的训练。

View on arXiv Download PDF AI Translation

cs.CV / 105 / 2604.12929

Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions

高斯中的抓取：动态手-物体交互的快速单目重建

Aytekin, Ayce Idil, Chen, Xu, Shen, Zhengyang, Beeler, Thabo, Rhodin, Helge, Dabral, Rishabh, Theobalt, Christian

Abstract

We present Grasp in Gaussians (GraG), a fast and robust method for reconstructing dynamic 3D hand-object interactions from a single monocular video. Unlike recent approaches that optimize heavy neural representations, our method focuses on tracking the hand and the object efficiently, once initialized from pretrained large models. Our key insight is that accurate and temporally stable hand-object motion can be recovered using a compact Sum-of-Gaussians (SoG) representation, revived from classical tracking literature and integrated with generative Gaussian-based initializations. We initialize object pose and geometry using a video-adapted SAM3D pipeline, then convert the resulting dense Gaussian representation into a lightweight SoG via subsampling. This compact representation enables efficient and fast tracking while preserving geometric fidelity. For the hand, we adopt a complementary strategy: starting from off-the-shelf monocular hand pose initialization, we refine hand motion using simple yet effective 2D joint and depth alignment losses, avoiding per-frame refinement of a detailed 3D hand appearance model while maintaining stable articulation. Extensive experiments on public benchmarks demonstrate that GraG reconstructs temporally coherent hand-object interactions on long sequences 6.4x faster than prior work while improving object reconstruction by 13.4% and reducing hand's per-joint position error by over 65%.

Chinese Translation

我们提出了Grasp in Gaussians（GraG），一种用于从单目视频中快速且鲁棒地重建动态三维手-物体交互的方法。与近期优化复杂神经表示的方法不同，我们的方法侧重于在预训练大型模型初始化后，高效地跟踪手部和物体。我们的核心观点是，利用紧凑的高斯和（Sum-of-Gaussians，SoG）表示——这一经典跟踪文献中的方法，并结合基于生成的高斯初始化——可以恢复准确且时间稳定的手-物体运动。我们采用视频适配的SAM3D流程初始化物体的姿态和几何形状，然后通过子采样将得到的密集高斯表示转换为轻量级的SoG表示。该紧凑表示不仅保证了几何精度，还实现了高效快速的跟踪。对于手部，我们采用互补策略：从现成的单目手势初始化开始，利用简单而有效的二维关节和深度对齐损失来细化手部运动，避免了对详细三维手部外观模型的逐帧优化，同时保持了稳定的关节动作。大量公共基准实验表明，GraG在长序列上重建时间连贯的手-物体交互的速度比先前方法快6.4倍，同时提升了物体重建13.4%的精度，并将手部每个关节的位置误差降低超过65%。

View on arXiv Download PDF AI Translation

cs.CV / 106 / 2604.12935

Task Alignment: A simple and effective proxy for model merging in computer vision

任务对齐：计算机视觉中模型合并的简单有效代理

de Jorge, Pau, de Souza, César Roberto, Michele, Björn, Sarıyıldız, Mert Bülent, Weinzaepfel, Philippe, Perronnin, Florent, Larlus, Diane, Kalantidis, Yannis

Abstract

Efficiently merging several models fine-tuned for different tasks, but stemming from the same pretrained base model, is of great practical interest. Despite extensive prior work, most evaluations of model merging in computer vision are restricted to image classification using CLIP, where different classification datasets define different tasks. In this work, our goal is to make model merging more practical and show its relevance on challenging scenarios beyond this specific setting. In most vision scenarios, different tasks rely on trainable and usually heterogeneous decoders. Differently from previous studies with frozen decoders, where merged models can be evaluated right away, the non-trivial cost of decoder training renders hyperparameter selection based on downstream performance impractical. To address this, we introduce the task alignment proxy, and show how it can be used to speed up hyperparameter selection by orders of magnitude while retaining performance. Equipped with the task alignment proxy, we extend the applicability of model merging to multi-task vision models beyond CLIP-based classification.

Chinese Translation

高效地合并多个针对不同任务进行微调的模型，但源自同一预训练基础模型，具有重要的实际意义。尽管已有大量相关研究，但大多数关于计算机视觉中模型合并的评估仅限于使用 CLIP 的图像分类，其中不同的分类数据集定义了不同的任务。在本研究中，我们的目标是使模型合并更加实用，并展示其在超出这一特定设置的挑战性场景中的相关性。在大多数视觉场景中，不同任务依赖于可训练且通常异构的解码器。与之前使用冻结解码器的研究不同，合并模型可以立即进行评估，而解码器训练的非平凡成本使得基于下游性能的超参数选择变得不切实际。为了解决这一问题，我们引入了任务对齐代理，并展示了如何利用它在保持性能的同时大幅加速超参数选择。借助任务对齐代理，我们将模型合并的适用性扩展到超出基于 CLIP 的分类的多任务视觉模型。

View on arXiv Download PDF AI Translation

cs.CV / 107 / 2604.12941

Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection

直接差异重放：持续人脸伪造检测中的分布差异凝聚与流形一致重放

Zhang, Tianshuo, Zhang, Haoyuan, Peng, Siran, Zhao, Weisong, Zhu, Xiangyu, Lei, Zhen

Abstract

Continual face forgery detection (CFFD) requires detectors to learn emerging forgery paradigms without forgetting previously seen manipulations. Existing CFFD methods commonly rely on replaying a small amount of past data to mitigate forgetting. Such replay is typically implemented either by storing a few historical samples or by synthesizing pseudo-forgeries from detector-dependent perturbations. Under strict memory budgets, the former cannot adequately cover diverse forgery cues and may expose facial identities, while the latter remains strongly tied to past decision boundaries. We argue that the core role of replay in CFFD is to reinstate the distributions of previous forgery tasks during subsequent training. To this end, we directly condense the discrepancy between real and fake distributions and leverage real faces from the current stage to perform distribution-level replay. Specifically, we introduce Distribution-Discrepancy Condensation (DDC), which models the real-to-fake discrepancy via a surrogate factorization in characteristic-function space and condenses it into a tiny bank of distribution discrepancy maps. We further propose Manifold-Consistent Replay (MCR), which synthesizes replay samples through variance-preserving composition of these maps with current-stage real faces, yielding samples that reflect previous-task forgery cues while remaining compatible with current real-face statistics. Operating under an extremely small memory budget and without directly storing raw historical face images, our framework consistently outperforms prior CFFD baselines and significantly mitigates catastrophic forgetting. Replay-level privacy analysis further suggests reduced identity leakage risk relative to selection-based replay.

Chinese Translation

持续人脸伪造检测（CFFD）要求检测器在学习新兴伪造范式的同时，不忘记之前见过的操控。现有的CFFD方法通常依赖于重放少量的过去数据来减轻遗忘。这种重放通常通过存储少量历史样本或从依赖于检测器的扰动中合成伪伪造来实现。在严格的内存预算下，前者无法充分覆盖多样的伪造线索，并可能暴露面部身份，而后者则与过去的决策边界紧密相关。我们认为，在CFFD中，重放的核心作用是恢复先前伪造任务的分布，以便在随后的训练中使用。为此，我们直接凝聚真实和伪造分布之间的差异，并利用当前阶段的真实人脸进行分布级重放。具体而言，我们引入了分布差异凝聚（Distribution-Discrepancy Condensation, DDC），它通过在特征函数空间中的替代因子分解来建模真实与伪造之间的差异，并将其凝聚成一小组分布差异图。我们进一步提出了流形一致重放（Manifold-Consistent Replay, MCR），通过将这些图与当前阶段的真实人脸进行方差保持组合来合成重放样本，从而生成反映先前任务伪造线索的样本，同时保持与当前真实人脸统计数据的兼容性。在极小的内存预算下，我们的框架在不直接存储原始历史人脸图像的情况下，始终优于先前的CFFD基线，并显著减轻灾难性遗忘。重放级隐私分析进一步表明，相较于基于选择的重放，身份泄露风险降低。

View on arXiv Download PDF AI Translation

cs.CV / 108 / 2604.12944

Distorted or Fabricated? A Survey on Hallucination in Video LLMs

扭曲还是伪造？视频大型语言模型中的幻觉调查

Huang, Yiyang, Zhang, Yitian, Wang, Yizhou, Zhang, Mingyuan, Shi, Liang, Zeng, Huimin, Fu, Yun

Abstract

Despite significant progress in video-language modeling, hallucinations remain a persistent challenge in Video Large Language Models (Vid-LLMs), referring to outputs that appear plausible yet contradict the content of the input video. This survey presents a comprehensive analysis of hallucinations in Vid-LLMs and introduces a systematic taxonomy that categorizes them into two core types: dynamic distortion and content fabrication, each comprising two subtypes with representative cases. Building on this taxonomy, we review recent advances in the evaluation and mitigation of hallucinations, covering key benchmarks, metrics, and intervention strategies. We further analyze the root causes of dynamic distortion and content fabrication, which often result from limited capacity for temporal representation and insufficient visual grounding. These insights inform several promising directions for future work, including the development of motion-aware visual encoders and the integration of counterfactual learning techniques. This survey consolidates scattered progress to foster a systematic understanding of hallucinations in Vid-LLMs, laying the groundwork for building robust and reliable video-language systems. An up-to-date curated list of related works is maintained at https://github.com/hukcc/Awesome-Video-Hallucination .

Chinese Translation

尽管视频语言建模取得了显著进展，但幻觉仍然是视频大型语言模型（Vid-LLMs）面临的一个持续挑战，指的是那些看似合理但与输入视频内容相矛盾的输出。本文对Vid-LLMs中的幻觉进行了全面分析，并引入了一种系统的分类法，将其分为两种核心类型：动态扭曲和内容伪造，每种类型又包含两个子类型及其代表性案例。在此分类法的基础上，我们回顾了幻觉评估和缓解的最新进展，涵盖了关键基准、指标和干预策略。我们进一步分析了动态扭曲和内容伪造的根本原因，这通常源于时间表示能力的有限和视觉基础的不足。这些见解为未来的研究提供了几条有希望的方向，包括开发运动感知的视觉编码器和整合反事实学习技术。本文整合了分散的进展，以促进对Vid-LLMs中幻觉的系统理解，为构建稳健可靠的视频语言系统奠定基础。相关工作的最新整理列表可在 https://github.com/hukcc/Awesome-Video-Hallucination 获取。

View on arXiv Download PDF AI Translation

cs.CV / 109 / 2604.12966

Boosting Visual Instruction Tuning with Self-Supervised Guidance

通过自监督指导增强视觉指令调优

Sirko-Galouchenko, Sophia, Wysoczanska, Monika, Bursuc, Andrei, Thome, Nicolas, Gidaris, Spyros

Abstract

Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT

Chinese Translation

多模态大型语言模型（MLLMs）在许多视觉-语言任务上表现良好，但在需要细粒度视觉推理的以视觉为中心的问题上常常表现不佳。最近的证据表明，这一限制并非源于视觉表征的弱，而是由于在指令调优过程中对视觉信息的利用不足，在许多任务中，仅使用语言先验就可以部分解决。我们提出了一种简单而轻量的方法，通过少量以视觉为基础的自监督任务（以自然语言指令表达）来增强视觉指令调优。通过将经典的自监督预训练任务（如旋转预测、颜色匹配和跨视图对应）重新表述为图像-指令-响应三元组，我们引入了无法在没有依赖视觉证据的情况下解决的监督。我们的方法不需要人工标注、不需要架构修改，也不需要额外的训练阶段。在多个模型、训练方案和基准测试中，仅注入少量（3-10%）这样的以视觉为基础的指令，便能持续提高在以视觉为中心的评估中的表现。我们的研究结果强调了通过视觉基础的自监督学习任务进行指令调优，作为通过简单调整训练数据分布来改善MLLMs视觉推理的强大杠杆。代码可在以下链接获取：https://github.com/sirkosophia/V-GIFT

View on arXiv Download PDF AI Translation

cs.CV / 110 / 2604.12969

AbdomenGen: Sequential Volume-Conditioned Diffusion Framework for Abdominal Anatomy Generation

AbdomenGen：用于腹部解剖生成的序列体积条件扩散框架

Bhandari, Yubraj, Dahal, Lavsen, Segars, Paul, Lo, Joseph Y.

Abstract

Computational phantoms are widely used in medical imaging research, yet current systems to generate controlled, clinically meaningful anatomical variations remain limited. We present AbdomenGen, a sequential volume-conditioned diffusion framework for controllable abdominal anatomy generation. We introduce the \textbf{Volume Control Scalar (VCS)}, a standardized residual that decouples organ size from body habitus, enabling interpretable volume modulation. Organ masks are synthesized sequentially, conditioning on the body mask and previously generated structures to preserve global anatomical coherence while supporting independent, multi-organ control. Across 11 abdominal organs, the proposed framework achieves strong geometric fidelity (e.g., liver dice $0.83 \pm 0.05$), stable single-organ calibration over $[-3,+3]$ VCS, and disentangled multi-organ modulation. To showcase clinical utility with a hepatomegaly cohort selected from MERLIN, Wasserstein-based VCS selection reduces distributional distance of training data by 73.6\% . These results demonstrate calibrated, distribution-aware anatomical generation suitable for controllable abdominal phantom construction and simulation studies.

Chinese Translation

计算幻影在医学成像研究中被广泛使用，但目前生成可控且具有临床意义的解剖变异的系统仍然有限。我们提出了AbdomenGen，一个用于可控腹部解剖生成的序列体积条件扩散框架。我们引入了 extbf{体积控制标量（Volume Control Scalar, VCS）}，这是一个标准化的残差，将器官大小与身体形态解耦，从而实现可解释的体积调制。器官掩膜是依次合成的，基于身体掩膜和先前生成的结构进行条件处理，以保持整体解剖的一致性，同时支持独立的多器官控制。在11个腹部器官中，所提出的框架实现了强大的几何保真度（例如，肝脏dice $0.83 ext{±} 0.05$）、在$[-3,+3]$ VCS范围内的稳定单器官标定，以及解耦的多器官调制。为了展示与从MERLIN中选取的肝肿大队列的临床实用性，基于Wasserstein的VCS选择将训练数据的分布距离减少了73.6 ext{％}。这些结果表明，经过标定的、关注分布的解剖生成适合于可控的腹部幻影构建和模拟研究。

View on arXiv Download PDF AI Translation

cs.CV / 111 / 2604.12999

Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

基于主动假设探索的视觉识别代理发现

Koo, Jaywon, Hernandez, Jefferson, He, Ruozhen, Chen, Hanjie, Wei, Chen, Ordonez, Vicente

Abstract

We introduce HypoExplore, an agentic framework that formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. Our proposed framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence. After each experiment, multiple feedback agents analyze the results from different perspectives and consolidate their findings into hypothesis confidence updates. Our framework is tested on discovering lightweight vision architectures on CIFAR-10, with the best achieving 94.11% accuracy evolved from a root node baseline that starts at 18.91%, and generalizes to CIFAR-100 and Tiny-ImageNet. We further demonstrate applicability to a specialized domain by conducting independent architecture discovery runs on MedMNIST, which yield a state-of-the-art performance. We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space.

Chinese Translation

我们提出了HypoExplore，这是一个代理框架，将视觉识别的神经架构发现形式化为一种以假设为驱动的科学探究。给定一个人类指定的高层次研究方向，HypoExplore通过进化分支构思、实现、评估和改进神经架构。新的假设通过使用大型语言模型创建，选择一个父假设进行构建，采用一种双重策略，平衡利用已验证的原则与解决不确定的原则。我们提出的框架维护一个轨迹树（Trajectory Tree），记录所有提议架构的谱系，以及一个假设记忆库（Hypothesis Memory Bank），主动跟踪通过实验证据获得的置信分数。在每次实验后，多个反馈代理从不同角度分析结果，并将他们的发现整合为假设置信度更新。我们的框架在CIFAR-10上测试轻量级视觉架构的发现，最佳结果为94.11%的准确率，源自起始节点基线的18.91%，并且能够推广到CIFAR-100和Tiny-ImageNet。我们进一步通过在MedMNIST上进行独立架构发现实验，展示了其在专业领域的适用性，取得了最先进的性能。我们表明，随着证据的积累，假设置信度分数变得越来越具预测性，并且学习到的原则在独立的进化谱系之间转移，这表明HypoExplore不仅发现了更强的架构，还可以帮助建立对设计空间的真正理解。

View on arXiv Download PDF AI Translation

cs.CV / 112 / 2604.13019

See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

观察、指点与精炼：基于视觉反馈的多轮交互式GUI定位方法

Mittal, Himangi, Mittal, Gaurav, Troncoso, Nelson Daniel, Hu, Yu

Abstract

Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.

Chinese Translation

计算机使用代理（Computer Use Agents, CUAs）在将语言指令转化为可执行的屏幕操作时，根本依赖于图形用户界面（GUI）定位技术。然而，在需要亚像素级精度以操作密集集成开发环境（IDE）元素的编辑级定位任务中，相关研究仍较为匮乏。现有方法通常依赖一次性坐标预测，缺乏纠错机制，且在高密度界面中表现不佳。本文技术报告中，我们针对编码环境中的像素级光标定位进行了实证研究。不同于单步执行，我们的代理采用迭代精炼过程，利用前次尝试的视觉反馈不断逼近目标元素。该闭环定位机制使代理能够自我纠正位移误差并适应动态界面变化。我们在GPT-5.4、Claude和Qwen模型上，针对一系列复杂编码基准进行了评估，结果表明多轮精炼在点击精度和整体任务成功率上显著优于最先进的一次性预测模型。研究结果表明，迭代视觉推理是下一代高可靠性软件工程代理的关键组成部分。代码地址：https://github.com/microsoft/precision-cua-bench。

View on arXiv Download PDF AI Translation

cs.CV / 113 / 2604.13021

Representation geometry shapes task performance in vision-language modeling for CT enterography

表征几何形状影响CT肠道成像的视觉-语言建模任务表现

Minoccheri, Cristian, Wittrup, Emily, Najarian, Kayvan, Stidham, Ryan

Abstract

Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4\% vs.\ 71\% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7--14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80--0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.

Chinese Translation

计算机断层扫描（CT）肠道成像是评估炎症性肠病（IBD）的主要成像方式，但尚不清楚哪些表征选择最能支持该模式的自动化分析。我们首次研究了腹部CT肠道成像中的视觉-语言迁移学习，并确定了两个主要发现。首先，切片嵌入的均值池化在分类疾病评估中表现更佳（59.2%的三类准确率），而注意力池化在跨模态检索中表现更好（0.235文本到图像的平均倒排率）。这一模式在所有测试的LoRA配置中均保持一致，表明这两种聚合器强调了学习表征的不同属性。其次，每个切片的组织对比度比更广泛的空间覆盖更为重要：多窗口RGB编码将互补的Hounsfield单位窗口映射到RGB通道，优于通过多平面采样增加空间覆盖的所有策略，而在这种情况下，添加冠状面和矢状面视图反而降低了分类性能。在报告生成方面，未使用检索上下文的微调在与流行率匹配的机会水平下实现了1级别的准确率（70.4%对71%随机），这表明除了类别分布外，几乎没有学习到其他顺序。检索增强生成（RAG）在所有配置中均提高了这一准确率，得分比机会基线高出7至14个百分点，并将序数均方误差从0.98改善至0.80至0.89。三教师伪标签框架使得所有比较无需专家注释。综上所述，这些发现为这一尚未深入研究的模式提供了首个基准，并为构建体积医学成像的视觉-语言系统提供了实用指导。

View on arXiv Download PDF AI Translation

cs.CV / 114 / 2604.13028

Conflated Inverse Modeling to Generate Diverse and Temperature-Change Inducing Urban Vegetation Patterns

融合逆向建模生成多样化且诱导温度变化的城市植被格局

Tezcan, Baris Sarper, Viswanath, Hrishikesh, Saher, Rubab, Aliaga, Daniel

Abstract

Urban areas are increasingly vulnerable to thermal extremes driven by rapid urbanization and climate change. Traditionally, thermal extremes have been monitored using Earth-observing satellites and numerical modeling frameworks. For example, land surface temperature derived from Landsat or Sentinel imagery is commonly used to characterize surface heating patterns. These approaches operate as forward models, translating radiative observations or modeled boundary conditions into estimates of surface thermal states. While forward models can predict land surface temperature from vegetation and urban form, the inverse problem of determining spatial vegetation configurations that achieve a desired regional temperature shift remains largely unexplored. This task is inherently underdetermined, as multiple spatial vegetation patterns can yield similar aggregated temperature responses. Conventional regression and deterministic neural networks fail to capture this ambiguity and often produce averaged solutions, particularly under data-scarce conditions. We propose a conflated inverse modeling framework that combines a predictive forward model with a diffusion-based generative inverse model to produce diverse, physically plausible image-based vegetation patterns conditioned on specific temperature goals. Our framework maintains control over thermal outcomes while enabling diverse spatial vegetation configurations, even when such combinations are absent from training data. Altogether, this work introduces a controllable inverse modeling approach for urban climate adaptation that accounts for the inherent diversity of the problem. Code is available at the GitHub repository.

Chinese Translation

随着快速城市化和气候变化的推进，城市区域对极端热环境的脆弱性日益增加。传统上，极端热环境通过地球观测卫星和数值建模框架进行监测。例如，利用Landsat或Sentinel影像获取的地表温度常用于表征地表加热模式。这些方法作为正向模型，将辐射观测或模拟边界条件转换为地表热状态的估计。尽管正向模型能够根据植被和城市形态预测地表温度，但确定能够实现特定区域温度变化的空间植被配置的逆问题尚未得到充分研究。该任务本质上是欠定的，因为多种空间植被格局可能产生相似的整体温度响应。传统的回归方法和确定性神经网络无法捕捉这种不确定性，且在数据稀缺条件下往往产生平均化的解决方案。本文提出了一种融合逆向建模框架，将预测性正向模型与基于扩散的生成逆向模型相结合，以生成多样且物理合理的基于图像的植被格局，条件为特定的温度目标。该框架在保持对热效应控制的同时，实现了多样化的空间植被配置，即使这些组合在训练数据中未出现。总体而言，本研究引入了一种可控的逆向建模方法，用于城市气候适应，充分考虑了问题的内在多样性。相关代码已在GitHub仓库公开。

View on arXiv Download PDF AI Translation

cs.CV / 115 / 2604.13029

Visual Preference Optimization with Rubric Rewards

基于评分标准奖励的视觉偏好优化

Yu, Ya-Qi, Hong, Fangyu, Qu, Xiangyang, Wang, Hao, Wu, Gaojie, Luo, Qiaoyu, Xu, Nuo, Wang, Huixin, Xu, Wuheng, Liao, Yongxin, Chen, Zihao, Li, Haonan, Li, Ziming, Peng, Dezhi, Liao, Minghui, Wu, Jihao, Ren, Haoyu, Tu, Dandan

Abstract

The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.

Chinese Translation

直接偏好优化（DPO）的有效性依赖于反映多模态任务中重要质量差异的偏好数据。现有的流程通常依赖于离策略扰动或粗略的结果导向信号，这些方法并不适合细粒度的视觉推理。我们提出了 rDPO，一种基于实例特定评分标准的偏好优化框架。对于每对图像-指令，我们创建了一个检查表式的评分标准，包含基本和附加标准，以对任何可能的策略的响应进行评分。指令-评分标准池在离线构建并在构建在线策略数据时重用。在公共奖励建模基准上，基于评分标准的提示大幅提升了 30B-A3B 判别器的表现，使其接近 GPT-5.4。在公共下游基准上，基于评分标准的过滤将宏观平均提升至 82.69，而基于结果的过滤则将其从 81.14 降至 75.82。在对全面基准的可扩展性评估中，rDPO 达到了 61.01，明显优于受风格限制的基线（52.36），并超越了 59.48 的基础模型。综合这些结果表明，视觉偏好优化受益于将在线数据构建与实例特定的标准级反馈相结合。

View on arXiv Download PDF AI Translation

cs.CV / 116 / 2604.13030

Generative Refinement Networks for Visual Synthesis

用于视觉合成的生成精炼网络

Han, Jian, Liu, Jinlai, Wang, Jiahuan, Peng, Bingyue, Yuan, Zehuan

Abstract

While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. In contrast, autoregressive (AR) models are inherently complexity-aware, as evidenced by their variable likelihoods, but are often hindered by lossy discrete tokenization and error accumulation. In this work, we introduce Generative Refinement Networks (GRN), a next-generation visual synthesis paradigm to address these issues. At its core, GRN addresses the discrete tokenization bottleneck through a theoretically near-lossless Hierarchical Binary Quantization (HBQ), achieving a reconstruction quality comparable to continuous counterparts. Built upon HBQ's latent space, GRN fundamentally upgrades AR generation with a global refinement mechanism that progressively perfects and corrects artworks -- like a human artist painting. Besides, GRN integrates an entropy-guided sampling strategy, enabling complexity-aware, adaptive-step generation without compromising visual quality. On the ImageNet benchmark, GRN establishes new records in image reconstruction (0.56 rFID) and class-conditional image generation (1.81 gFID). We also scale GRN to more challenging text-to-image and text-to-video generation, delivering superior performance on an equivalent scale. We release all models and code to foster further research on GRN.

Chinese Translation

尽管扩散模型在视觉生成领域占据主导地位，但其计算效率较低，对不同复杂度的任务均采用统一的计算资源。相比之下，自回归（Autoregressive, AR）模型天生具备复杂度感知能力，这一点从其可变的似然值中可见一斑，但其常受限于有损的离散标记化和误差累积问题。本文提出了生成精炼网络（Generative Refinement Networks, GRN），作为下一代视觉合成范式，以解决上述问题。GRN的核心在于通过理论上近无损的层次二进制量化（Hierarchical Binary Quantization, HBQ）突破离散标记化瓶颈，实现了与连续表示相当的重建质量。在HBQ的潜空间基础上，GRN通过全局精炼机制根本性地升级了AR生成过程，能够像人类艺术家绘画一样，逐步完善和修正作品。此外，GRN整合了熵引导采样策略，实现了复杂度感知的自适应步长生成，同时保证视觉质量。在ImageNet基准测试中，GRN在图像重建（0.56 rFID）和类别条件图像生成（1.81 gFID）上均创下新纪录。我们还将GRN扩展至更具挑战性的文本到图像及文本到视频生成任务，在同等规模下展现出卓越性能。我们公开了所有模型和代码，以促进GRN相关的进一步研究。

View on arXiv Download PDF AI Translation

cs.CV / 117 / 2604.13035

SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

SceneCritic：一种用于三维室内场景合成的符号评估器

Sengupta, Kathakoli, Ao, Kai, Cascante-Bonilla, Paola

Abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor-plan-level layouts. SceneCritic's constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object-level and relationship-level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule-based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM-based evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is the most effective critic modality for semantic and orientation correction.

Chinese Translation

大型语言模型（LLMs）和视觉语言模型（VLMs）越来越多地通过布局和场景图等中间结构生成室内场景，但评估仍依赖于对渲染视图进行评分的LLM或VLM评判器，导致判断结果对视角、提示措辞和幻觉现象敏感。当评估器不稳定时，难以确定模型是否生成了空间上合理的场景，或输出分数是否反映了视角、渲染或提示的选择。我们提出了SceneCritic，一种针对平面图级布局的符号评估器。SceneCritic的约束基于SceneOnto，这是我们通过整合3D-FRONT、ScanNet和Visual Genome中的室内场景先验构建的结构化空间本体。SceneOnto遍历该本体，联合验证对象关系中的语义、一致方向和几何一致性，提供对象级和关系级的评估，识别具体违规和成功放置。此外，我们将SceneCritic与一个迭代细化测试平台配对，探究模型在不同评判模式下如何构建和修正空间结构：基于规则的评判器使用碰撞约束作为反馈，LLM评判器以文本形式操作布局，VLM评判器基于渲染观察进行操作。通过大量实验，我们展示了：(a) SceneCritic与人类判断的契合度显著优于基于VLM的评估器；(b) 纯文本LLM在语义布局质量上可超越VLM；(c) 基于图像的VLM细化是语义和方向校正最有效的评判模式。

View on arXiv Download PDF AI Translation

cs.CV / 118 / 2604.13036

Lyra 2.0: Explorable Generative 3D Worlds

Lyra 2.0：可探索的生成3D世界

Shen, Tianchang, Bahmani, Sherwin, He, Kai, Srinivasan, Sangeetha Grama, Cao, Tianshi, Ren, Jiawei, Li, Ruilong, Wang, Zian, Sharp, Nicholas, Gojcic, Zan, Fidler, Sanja, Huang, Jiahui, Ling, Huan, Gao, Jun, Ren, Xuanchi

Abstract

Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model's temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing -- retrieving relevant past frames and establishing dense correspondences with the target viewpoints -- while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.

Chinese Translation

最近在视频生成方面的进展为3D场景创建开启了一种新范式：生成由相机控制的视频以模拟场景漫游，然后通过前馈重建技术将其提升为3D。这种生成重建方法结合了视频模型的视觉逼真性和创造能力，以及适合实时渲染和模拟的3D输出。扩展到大型复杂环境需要在长相机轨迹上进行3D一致性的视频生成，这些轨迹涉及大视角变化和位置重访，而当前的视频模型在这种设置下迅速退化。现有的长时间生成方法在根本上受到两种退化形式的限制：空间遗忘和时间漂移。随着探索的进行，先前观察到的区域超出了模型的时间上下文，迫使模型在重访时产生虚构结构。同时，自回归生成随着时间的推移积累小的合成误差，逐渐扭曲场景的外观和几何形状。我们提出了Lyra 2.0，一个用于大规模生成持久、可探索3D世界的框架。为了解决空间遗忘问题，我们保持每帧的3D几何形状，并仅将其用于信息路由——检索相关的过去帧并与目标视点建立密集对应关系——同时依赖生成先验进行外观合成。为了解决时间漂移问题，我们使用自增强历史进行训练，使模型接触到自身退化的输出，教会它纠正漂移而不是传播漂移。通过这些方法，我们能够实现显著更长且3D一致的视频轨迹，并利用这些轨迹微调前馈重建模型，从而可靠地恢复高质量的3D场景。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2604.11828

The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap

科学知识的非最优性：路径依赖、锁定效应与局部最小值陷阱

Mabrok, Mohamed

Abstract

Science is widely regarded as humanity's most reliable method for uncovering truths about the natural world. Yet the \emph{trajectory} of scientific discovery is rarely examined as an optimization problem in its own right. This paper argues that the body of scientific knowledge, at any given historical moment, represents a \emph{local optimum} rather than a global one--that the frameworks, formalisms, and paradigms through which we understand nature are substantially shaped by historical contingency, cognitive path dependence, and institutional lock-in. Drawing an analogy to gradient descent in machine learning, we propose that science follows the steepest local gradient of tractability, empirical accessibility, and institutional reward, and in doing so may bypass fundamentally superior descriptions of nature. We develop this thesis through detailed case studies spanning mathematics, physics, chemistry, biology, neuroscience, and statistical methodology. We identify three interlocking mechanisms of lock-in--cognitive, formal, and institutional--and argue that recognizing these mechanisms is a prerequisite for designing meta-scientific strategies capable of escaping local optima. We conclude by proposing concrete interventions and discussing the epistemological implications of our thesis for the philosophy of science.

Chinese Translation

科学被广泛视为人类揭示自然世界真理的最可靠方法。然而，科学发现的 extit{轨迹}很少被视为一个独立的优化问题。本文认为，在任何特定的历史时刻，科学知识的体系代表的是一个 extit{局部最优解}而非全局最优解——我们理解自然的框架、形式和范式在很大程度上受到历史偶然性、认知路径依赖和制度锁定的影响。通过类比机器学习中的梯度下降，我们提出科学遵循可处理性、经验可及性和制度奖励的最陡局部梯度，在此过程中可能会绕过本质上更优越的自然描述。我们通过涵盖数学、物理、化学、生物学、神经科学和统计方法论的详细案例研究来发展这一论点。我们识别出三种相互交织的锁定机制——认知、形式和制度——并认为认识这些机制是设计能够逃离局部最优解的元科学策略的前提。最后，我们提出具体的干预措施，并讨论我们论点对科学哲学的认识论影响。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2604.11914

Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents

结构整合带来的自我监控益处：来自连续时间多时间尺度智能体元认知的启示

Xie, Ying

Abstract

Self-monitoring capabilities -- metacognition, self-prediction, and subjective duration -- are often proposed as useful additions to reinforcement learning agents. But do they actually help? We investigate this question in a continuous-time multi-timescale agent operating in predator-prey survival environments of varying complexity, including a 2D partially observable variant. We first show that three self-monitoring modules, implemented as auxiliary-loss add-ons to a multi-timescale cortical hierarchy, provide no statistically significant benefit across 20 random seeds, 1D and 2D predator-prey environments with standard and non-stationary variants, and training horizons up to 50,000 steps. Diagnosing the failure, we find the modules collapse to near-constant outputs (confidence std < 0.006, attention allocation std < 0.011) and the subjective duration mechanism shifts the discount factor by less than 0.03%. Policy sensitivity analysis confirms the agent's decisions are unaffected by module outputs in this design. We then show that structurally integrating the module outputs -- using confidence to gate exploration, surprise to trigger workspace broadcasts, and self-model predictions as policy input -- produces a medium-large improvement over the add-on approach (Cohen's d = 0.62, p = 0.06, paired) in a non-stationary environment. Component-wise ablations reveal that the TSM-to-policy pathway contributes most of this gain. However, structural integration does not significantly outperform a baseline with no self-monitoring (d = 0.15, p = 0.67), and a parameter-matched control without modules performs comparably, so the benefit may lie in recovering from the trend-level harm of ignored modules rather than in self-monitoring content. The architectural implication is that self-monitoring should sit on the decision pathway, not beside it.

Chinese Translation

自我监控能力——元认知、自我预测和主观时长——常被认为是强化学习智能体的有益补充。但它们真的有帮助吗？我们在一个连续时间多时间尺度智能体中探讨了这一问题，该智能体在不同复杂度的捕食者-猎物生存环境中运行，包括一个二维部分可观测变体。我们首先展示了三种自我监控模块，作为多时间尺度皮层层级的辅助损失附加实现，在20个随机种子、1D和2D捕食者-猎物环境（包含标准和非平稳变体）以及最长5万步的训练周期中，未表现出统计学上显著的益处。通过诊断失败原因，我们发现这些模块输出几乎趋于常数（置信度标准差<0.006，注意力分配标准差<0.011），且主观时长机制对折扣因子的调整不足0.03%。策略敏感性分析确认，在该设计中，模块输出对智能体决策无影响。随后我们展示了结构性整合模块输出——利用置信度控制探索，利用惊讶触发工作空间广播，将自我模型预测作为策略输入——在非平稳环境中相较于附加方法带来了中等偏大的提升（Cohen's d=0.62，p=0.06，配对样本）。组件消融分析表明，TSM（多时间尺度模块）到策略路径贡献了大部分增益。然而，结构整合并未显著优于无自我监控基线（d=0.15，p=0.67），且参数匹配的无模块控制组表现相当，因此益处可能在于弥补忽视模块带来的趋势性损害，而非自我监控内容本身。架构上的启示是，自我监控应当置于决策路径上，而非其旁边。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2604.11924

GoodPoint: Learning Constructive Scientific Paper Feedback from Author Responses

GoodPoint：从作者回复中学习建设性科学论文反馈

Mun, Jimin, Jung, Chani, Zhou, Xuhui, Kim, Hyunwoo, Sap, Maarten

Abstract

While LLMs hold significant potential to transform scientific research, we advocate for their use to augment and empower researchers rather than to automate research without human oversight. To this end, we study constructive feedback generation, the task of producing targeted, actionable feedback that helps authors improve both their research and its presentation. In this work, we operationalize the effectiveness of feedback along two author-centric axes-validity and author action. We first curate GoodPoint-ICLR, a dataset of 19K ICLR papers with reviewer feedback annotated along both dimensions using author responses. Building on this, we introduce GoodPoint, a training recipe that leverages success signals from author responses through fine-tuning on valid and actionable feedback, together with preference optimization on both real and synthetic preference pairs. Our evaluation on a benchmark of 1.2K ICLR papers shows that a GoodPoint-trained Qwen3-8B improves the predicted success rate by 83.7% over the base model and sets a new state-of-the-art among LLMs of similar size in feedback matching on a golden human feedback set, even surpassing Gemini-3-flash in precision. We further validate these findings through an expert human study, demonstrating that GoodPoint consistently delivers higher practical value as perceived by authors.

Chinese Translation

尽管大型语言模型（LLMs）在变革科学研究方面具有巨大潜力，但我们主张将其用于增强和赋能研究人员，而非在无人监督的情况下自动化研究。为此，我们研究了建设性反馈生成任务，即生成针对性且可操作的反馈，帮助作者提升其研究内容及其呈现方式。在本工作中，我们从两个以作者为中心的维度——有效性和作者行动——来衡量反馈的有效性。我们首先整理了GoodPoint-ICLR数据集，该数据集包含1.9万篇ICLR论文及其基于作者回复标注的双维度审稿人反馈。在此基础上，我们提出了GoodPoint训练方案，通过对有效且可操作反馈的微调，利用作者回复中的成功信号，并结合真实与合成偏好对的偏好优化。我们在包含1200篇ICLR论文的基准测试中评估，结果显示，经过GoodPoint训练的Qwen3-8B模型在预测成功率上较基础模型提升了83.7%，并在黄金人类反馈集上的反馈匹配任务中，成为同等规模LLM中的新一代最佳模型，甚至在精确度上超越了Gemini-3-flash。我们还通过专家人类研究进一步验证了这些发现，证明GoodPoint在作者感知的实际价值上始终表现更优。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2604.11969

Narrative-Driven Paper-to-Slide Generation via ArcDeck

基于叙事驱动的论文到幻灯片生成框架ArcDeck

Ozden, Tarik Can, VS, Sachidanand, Horoz, Furkan, Kara, Ozgur, Kim, Junho, Rehg, James Matthew

Abstract

We introduce ArcDeck, a multi-agent framework that formulates paper-to-slide generation as a structured narrative reconstruction task. Unlike existing methods that directly summarize raw text into slides, ArcDeck explicitly models the source paper's logical flow. It first parses the input to construct a discourse tree and establish a global commitment document, ensuring the high-level intent is preserved. These structural priors then guide an iterative multi-agent refinement process, where specialized agents iteratively critique and revise the presentation outline before rendering the final visual layouts and designs. To evaluate our approach, we also introduce ArcBench, a newly curated benchmark of academic paper-slide pairs. Experimental results demonstrate that explicit discourse modeling, combined with role-specific agent coordination, significantly improves the narrative flow and logical coherence of the generated presentations.

Chinese Translation

我们提出了ArcDeck，一种多智能体框架，将论文到幻灯片的生成任务形式化为结构化叙事重构任务。不同于现有直接将原始文本摘要为幻灯片的方法，ArcDeck明确建模了源论文的逻辑流程。它首先解析输入，构建话语树并建立全局承诺文档，确保高层意图得以保留。随后，这些结构先验引导一个迭代的多智能体优化过程，专门化的智能体反复批判并修订演示大纲，最终生成视觉布局和设计。为了评估我们的方法，我们还引入了ArcBench，这是一个新整理的学术论文与幻灯片配对基准。实验结果表明，显式的话语建模结合角色特定的智能体协作，显著提升了生成演示的叙事流畅性和逻辑连贯性。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2604.11978

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

长远任务的幻象？诊断智能体系统崩溃的地点与原因

Wang, Xinyu Jessica, Bai, Haoyue, Sun, Yiyou, Wang, Haorui, Zhang, Shuibai, Hu, Wenjie, Schroder, Mya, Mutlu, Bilge, Song, Dawn, Nowak, Robert D

Abstract

Large language model (LLM) agents perform strongly on short- and mid-horizon tasks, but often break down on long-horizon tasks that require extended, interdependent action sequences. Despite rapid progress in agentic systems, these long-horizon failures remain poorly characterized, hindering principled diagnosis and comparison across domains. To address this gap, we introduce HORIZON, an initial cross-domain diagnostic benchmark for systematically constructing tasks and analyzing long-horizon failure behaviors in LLM-based agents. Using HORIZON, we evaluate state-of-the-art (SOTA) agents from multiple model families (GPT-5 variants and Claude models), collecting 3100+ trajectories across four representative agentic domains to study horizon-dependent degradation patterns. We further propose a trajectory-grounded LLM-as-a-Judge pipeline for scalable and reproducible failure attribution, and validate it with human annotation on trajectories, achieving strong agreement (inter-annotator \kappa=0.61; human-judge \kappa=0.84). Our findings offer an initial methodological step toward systematic, cross-domain analysis of long-horizon agent failures and offer practical guidance for building more reliable long-horizon agents. We release our project website at \href{https://xwang2775.github.io/horizon-leaderboard/}{HORIZON Leaderboard} and welcome contributions from the community.

Chinese Translation

大型语言模型（LLM）智能体在短期和中期任务中表现出色，但在需要延长且相互依赖的动作序列的长远任务中常常失效。尽管智能体系统取得了快速进展，这些长远任务的失败仍然缺乏充分的特征描述，阻碍了跨领域的系统性诊断与比较。为填补这一空白，我们引入了HORIZON，这是一个初步的跨领域诊断基准，旨在系统构建任务并分析基于LLM的智能体在长远任务中的失败行为。利用HORIZON，我们评估了来自多个模型家族（GPT-5变体和Claude模型）的最先进（SOTA）智能体，收集了超过3100条轨迹，涵盖四个具有代表性的智能体领域，以研究依赖任务时长的性能下降模式。我们进一步提出了基于轨迹的LLM作为评判者（LLM-as-a-Judge）流程，实现了失败归因的可扩展性和可复现性，并通过人类对轨迹的标注进行了验证，达到了较高的一致性（标注者间κ=0.61；人类与评判者κ=0.84）。我们的研究为系统性、跨领域分析长远智能体失败提供了初步的方法论步骤，并为构建更可靠的长远智能体提供了实用指导。我们已发布项目网站（HORIZON Leaderboard，https://xwang2775.github.io/horizon-leaderboard/），欢迎社区贡献。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2604.12007

When to Forget: A Memory Governance Primitive

何时遗忘：一种记忆治理原语

Simsek, Baris

Abstract

Agent memory systems accumulate experience but currently lack a principled operational metric for memory quality governance -- deciding which memories to trust, suppress, or deprecate as the agent's task distribution shifts. Write-time importance scores are static; dynamic management systems use LLM judgment or structural heuristics rather than outcome feedback. This paper proposes Memory Worth (MW): a two-counter per-memory signal that tracks how often a memory co-occurs with successful versus failed outcomes, providing a lightweight, theoretically grounded foundation for staleness detection, retrieval suppression, and deprecation decisions. We prove that MW converges almost surely to the conditional success probability p+(m) = Pr[y_t = +1 | m in M_t] -- the probability of task success given that memory m is retrieved -- under a stationary retrieval regime with a minimum exploration condition. Importantly, p+(m) is an associational quantity, not a causal one: it measures outcome co-occurrence rather than causal contribution. We argue this is still a useful operational signal for memory governance, and we validate it empirically in a controlled synthetic environment where ground-truth utility is known: after 10,000 episodes, the Spearman rank-correlation between Memory Worth and true utilities reaches rho = 0.89 +/- 0.02 across 20 independent seeds, compared to rho = 0.00 for systems that never update their assessments. A retrieval-realistic micro-experiment with real text and neural embedding retrieval (all-MiniLM-L6-v2) further shows stale memories crossing the low-value threshold (MW = 0.17) while specialist memories remain high-value (MW = 0.77) across 3,000 episodes. The estimator requires only two scalar counters per memory unit and can be added to architectures that already log retrievals and episode outcomes.

Chinese Translation

智能体记忆系统积累经验，但目前缺乏用于记忆质量治理的原则性操作指标——即在智能体任务分布变化时决定信任、抑制或废弃哪些记忆。写入时的重要性评分是静态的；动态管理系统则依赖大型语言模型（LLM）判断或结构性启发式方法，而非基于结果反馈。本文提出了记忆价值（Memory Worth, MW）：一种针对每条记忆的双计数信号，用以跟踪记忆与成功或失败结果的共现频率，提供了一个轻量且理论上有根基的基础，用于陈旧检测、检索抑制和废弃决策。我们证明，在满足最小探索条件的稳定检索机制下，MW几乎必然收敛于条件成功概率p+(m) = Pr[y_t = +1 | m ∈ M_t]——即在检索到记忆m时任务成功的概率。重要的是，p+(m)是一个关联量，而非因果量：它衡量的是结果的共现而非因果贡献。我们认为这仍是记忆治理中有用的操作信号，并在一个已知真实效用的受控合成环境中进行了实证验证：经过10,000个回合，记忆价值与真实效用的斯皮尔曼等级相关系数达到ρ = 0.89 ± 0.02（基于20个独立随机种子），而从未更新评估的系统相关系数为ρ = 0.00。基于真实文本和神经嵌入检索（all-MiniLM-L6-v2）的检索现实微实验进一步显示，陈旧记忆的MW值下降至低价值阈值（MW = 0.17），而专业记忆在3,000个回合中保持高价值（MW = 0.77）。该估计器每条记忆仅需两个标量计数器，且可集成至已记录检索和回合结果的架构中。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2604.12016

Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space

身份作为吸引子：大型语言模型激活空间中持久代理架构的几何证据

Vasilenko, Vladimir

Abstract

Large language models map semantically related prompts to similar internal representations -- a phenomenon interpretable as attractor-like dynamics. We ask whether the identity document of a persistent cognitive agent (its cognitive_core) exhibits analogous attractor-like behavior. We present a controlled experiment on Llama 3.1 8B Instruct, comparing hidden states of an original cognitive_core (Condition A), seven paraphrases (Condition B), and seven structurally matched controls (Condition C). Mean-pooled states at layers 8, 16, and 24 show that paraphrases converge to a tighter cluster than controls (Cohen's d > 1.88, p < 10^{-27}, Bonferroni-corrected). Replication on Gemma 2 9B confirms cross-architecture generalizability. Ablations suggest the effect is primarily semantic rather than structural, and that structural completeness appears necessary to reach the attractor region. An exploratory experiment shows that reading a scientific description of the agent shifts internal state toward the attractor -- closer than a sham preprint -- distinguishing knowing about an identity from operating as that identity. These results provide representational evidence that agent identity documents induce attractor-like geometry in LLM activation space.

Chinese Translation

大型语言模型将语义相关的提示映射到相似的内部表征——这一现象可解释为类似吸引子动力学的行为。我们探讨了持久认知代理的身份文档（其认知核心，cognitive_core）是否表现出类似的吸引子行为。我们在Llama 3.1 8B Instruct模型上进行了受控实验，比较了原始认知核心（条件A）、七个释义版本（条件B）和七个结构匹配的对照组（条件C）的隐藏状态。第8、16和24层的均值池化状态显示，释义版本比对照组更趋向于聚集成更紧密的簇（Cohen's d > 1.88，p < 10^{-27}，Bonferroni校正）。在Gemma 2 9B模型上的复现实验确认了该现象的跨架构普适性。消融实验表明，该效应主要源于语义而非结构，且结构完整性似乎是达到吸引子区域的必要条件。一项探索性实验显示，阅读该代理的科学描述会使内部状态向吸引子区域移动——比阅读虚假预印本更为接近，区分了“了解身份”与“作为该身份运作”的差异。这些结果提供了表征性证据，表明代理身份文档在大型语言模型激活空间中诱导了类似吸引子的几何结构。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2604.12019

A longitudinal health agent framework

纵向健康代理框架

Georgianna, Lin, Jiang, Rencong, Elhadad, Noémie, Xu, Xuhai "Orson"

Abstract

Although artificial intelligence (AI) agents are increasingly proposed to support potentially longitudinal health tasks, such as symptom management, behavior change, and patient support, most current implementations fall short of facilitating user intent and fostering accountability. This contrasts with prior work on supporting longitudinal needs, where follow-up, coherent reasoning, and sustained alignment with individuals' goals are critical for both effectiveness and safety. In this paper, we draw on established clinical and personal health informatics frameworks to define what it would mean to orchestrate longitudinal health interactions with AI agents. We propose a multi-layer framework and corresponding agent architecture that operationalizes adaptation, coherence, continuity, and agency across repeated interactions. Through representative use cases, we demonstrate how longitudinal agents can maintain meaningful engagement, adapt to evolving goals, and support safe, personalized decision-making over time. Our findings underscore both the promise and the complexity of designing systems capable of supporting health trajectories beyond isolated interactions, and we offer guidance for future research and development in multi-session, user-centered health AI.

Chinese Translation

尽管人工智能（AI）代理越来越多地被提议用于支持潜在的纵向健康任务，例如症状管理、行为改变和患者支持，但目前大多数实现仍未能有效促进用户意图和增强责任感。这与以往对支持纵向需求的研究形成对比，后者强调跟进、连贯推理和与个人目标的持续一致性对于有效性和安全性至关重要。本文借鉴了既有的临床和个人健康信息学框架，定义了如何与AI代理协调纵向健康互动的含义。我们提出了一个多层框架及相应的代理架构，以实现适应性、一致性、连续性和代理性在重复互动中的操作化。通过典型的使用案例，我们展示了纵向代理如何维持有意义的参与、适应不断变化的目标，并在时间推移中支持安全、个性化的决策。我们的研究结果强调了设计能够支持健康轨迹的系统的潜力与复杂性，并为未来在多会话、以用户为中心的健康AI领域的研究与开发提供了指导。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2604.12025

WiseOWL: A Methodology for Evaluating Ontological Descriptiveness and Semantic Correctness for Ontology Reuse and Ontology Recommendations

WiseOWL：一种评估本体描述性和语义正确性的方法论，用于本体重用和本体推荐

Dalal, Aryan Singh, Baloch, Maria, Lin, Asiyah Yu, Masci, Anna Maria, Jagodnik, Kathleen M., McGinty, Hande Kucuk

Abstract

The Semantic Web standardizes concept meaning for humans and machines, enabling machine-operable content and consistent interpretation that improves advanced analytics. Reusing ontologies speeds development and enforces consistency, yet selecting the optimal choice is challenging because authors lack systematic selection criteria and often rely on intuition that is difficult to justify, limiting reuse. To solve this, WiseOWL is proposed, a methodology with scoring and guidance to select ontologies for reuse. It scores four metrics: (i) Well-Described, measuring documentation coverage; (ii) Well-Defined, using state-of-the-art embeddings to assess label-definition alignment; (iii) Connection, capturing structural interconnectedness; and (iv) Hierarchical Breadth, reflecting hierarchical balance. WiseOWL outputs normalized 0-10 scores with actionable feedback. Implemented as a Streamlit app, it ingests OWL format, converts to RDF Turtle, and provides interactive visualizations. Evaluation across six ontologies, including the Plant Ontology (PO), Gene Ontology (GO), Semanticscience Integrated Ontology (SIO), Food Ontology (FoodON), Dublin Core (DC), and GoodRelations, demonstrates promising effectiveness.

Chinese Translation

语义网标准化了人类和机器的概念意义，使机器可操作的内容和一致的解释得以实现，从而提升了高级分析的效果。重用本体加速了开发并强化了一致性，但选择最佳选项却具有挑战性，因为作者缺乏系统的选择标准，往往依赖难以证明的直觉，从而限制了重用。为了解决这个问题，提出了WiseOWL，这是一种具有评分和指导功能的方法论，用于选择可重用的本体。它对四个指标进行评分：（i）描述良好（Well-Described），衡量文档覆盖率；（ii）定义明确（Well-Defined），使用最先进的嵌入技术评估标签与定义的一致性；（iii）连接性（Connection），捕捉结构上的相互关联性；（iv）层次广度（Hierarchical Breadth），反映层次的平衡。WiseOWL输出标准化的0-10分数和可操作的反馈。该方法实现为一个Streamlit应用，能够接收OWL格式，转换为RDF Turtle，并提供交互式可视化。对六个本体的评估，包括植物本体（Plant Ontology, PO）、基因本体（Gene Ontology, GO）、语义科学集成本体（Semanticscience Integrated Ontology, SIO）、食品本体（Food Ontology, FoodON）、都柏林核心（Dublin Core, DC）和GoodRelations，展示了其良好的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2604.12034

Memory as Metabolism: A Design for Companion Knowledge Systems

记忆作为新陈代谢：伴随知识系统的设计

Miteski, Stefan

Abstract

Retrieval-Augmented Generation remains the dominant pattern for giving LLMs persistent memory, but a visible cluster of personal wiki-style memory architectures emerged in April 2026 -- design proposals from Karpathy, MemPalace, and LLM Wiki v2 that compile knowledge into an interlinked artifact for long-term use by a single user. They sit alongside production memory systems that the major labs have shipped for over a year, and an active academic lineage including MemGPT, Generative Agents, Mem0, Zep, A-Mem, MemMachine, SleepGate, and Second Me. Within a 2026 landscape of emerging governance frameworks for agent context and memory -- including Context Cartography and MemOS -- this paper proposes a companion-specific governance profile: a set of normative obligations, a time-structured procedural rule, and testable conformance invariants for the specific failure mode of entrenchment under user-coupled drift in single-user knowledge wikis built on the LLM wiki pattern. The design principle is that personal LLM memory is a companion system: its job is to mirror the user on operational dimensions (working vocabulary, load-bearing structure, continuity of context) and compensate on epistemic failure modes (entrenchment, suppression of contradicting evidence, Kuhnian ossification). Five operations implement this split -- TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, AUDIT -- supported by memory gravity and minority-hypothesis retention. The sharpest prediction: accumulated contradictory evidence should have a structural path to updating a centrality-protected dominant interpretation through multi-cycle buffer pressure accumulation, a failure mode no existing benchmark captures. The safety story at the single-agent level is partial, and the paper is explicit about what it does and does not solve.

Chinese Translation

检索增强生成（Retrieval-Augmented Generation）仍然是赋予大型语言模型（LLMs）持久记忆的主导模式，但在2026年4月，出现了一系列个人维基风格的记忆架构设计提案，包括Karpathy、MemPalace和LLM Wiki v2，这些提案将知识编纂成一个互联的文献，以供单一用户长期使用。它们与主要实验室在过去一年中推出的生产记忆系统并存，并且有一个活跃的学术谱系，包括MemGPT、生成代理（Generative Agents）、Mem0、Zep、A-Mem、MemMachine、SleepGate和Second Me。在2026年新兴的代理上下文和记忆治理框架的背景下——包括上下文制图（Context Cartography）和MemOS——本文提出了一种特定于伴随系统的治理配置文件：一组规范性义务、时间结构化的程序规则，以及针对基于LLM维基模式构建的单用户知识维基中用户耦合漂移下固化特定失败模式的可测试一致性不变式。设计原则是个人LLM记忆是一个伴随系统：它的任务是在操作维度（工作词汇、承载结构、上下文的连续性）上反映用户，并在认识论失败模式（固化、抑制矛盾证据、库恩式的僵化）上进行补偿。五个操作实现了这一分裂——分类（TRIAGE）、衰减（DECAY）、上下文化（CONTEXTUALIZE）、整合（CONSOLIDATE）、审计（AUDIT）——并得到了记忆重力和少数假设保留的支持。最尖锐的预测是：积累的矛盾证据应通过多周期缓冲压力积累，具有更新中心保护的主导解释的结构路径，这是一种现有基准无法捕捉的失败模式。单代理层面的安全性故事是部分的，本文明确阐述了其解决的内容和未解决的内容。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2604.12066

Mathematics Teachers Interactions with a Multi-Agent System for Personalized Problem Generation

数学教师与多智能体系统的互动以实现个性化问题生成

Walkington, Candace, Beauchamp, Theodora, Ikram, Fareya, Gürbüz, Merve Koçyiğit, Xia, Fangli, Lee, Margan, Lan, Andrew

Abstract

Large language models can increasingly adapt educational tasks to learners characteristics. In the present study, we examine a multi-agent teacher-in-the-loop system for personalizing middle school math problems. The teacher enters a base problem and desired topic, the LLM generates the problem, and then four AI agents evaluate the problem using criteria that each specializes in (mathematical accuracy, authenticity, readability, and realism). Eight middle school mathematics teachers created 212 problems in ASSISTments using the system and assigned these problems to their students. We find that both teachers and students wanted to modify the fine-grained personalized elements of the real-world context of the problems, signaling issues with authenticity and fit. Although the agents detected many issues with realism as the problems were being written, there were few realism issues noted by teachers and students in the final versions. Issues with readability and mathematical hallucinations were also somewhat rare. Implications for multi-agent systems for personalization that support teacher control are given.

Chinese Translation

大型语言模型能够越来越多地根据学习者的特征调整教育任务。在本研究中，我们考察了一种多智能体教师参与系统，用于个性化中学数学问题的生成。教师输入一个基础问题和期望的主题，LLM（大型语言模型）生成问题，然后四个AI代理根据各自专长的标准（数学准确性、真实性、可读性和现实性）对问题进行评估。八位中学数学教师使用该系统在ASSISTments平台上创建了212个问题，并将这些问题分配给他们的学生。我们发现，教师和学生都希望修改问题的真实世界背景中的细微个性化元素，这表明存在真实性和适配性的问题。尽管在问题撰写过程中，代理检测到许多现实性的问题，但教师和学生在最终版本中注意到的现实性问题很少。可读性和数学幻觉的问题也相对少见。本文讨论了支持教师控制的个性化多智能体系统的启示。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2604.12081

Human-Inspired Context-Selective Multimodal Memory for Social Robots

人类启发的上下文选择性多模态记忆用于社交机器人

Kang, Hangyeol, Voloshynovskiy, Slava, Thalmann, Nadia Magnenat

Abstract

Memory is fundamental to social interaction, enabling humans to recall meaningful past experiences and adapt their behavior accordingly based on the context. However, most current social robots and embodied agents rely on non-selective, text-based memory, limiting their ability to support personalized, context-aware interactions. Drawing inspiration from cognitive neuroscience, we propose a context-selective, multimodal memory architecture for social robots that captures and retrieves both textual and visual episodic traces, prioritizing moments characterized by high emotional salience or scene novelty. By associating these memories with individual users, our system enables socially personalized recall and more natural, grounded dialogue. We evaluate the selective storage mechanism using a curated dataset of social scenarios, achieving a Spearman correlation of 0.506, surpassing human consistency ($\rho=0.415$) and outperforming existing image memorability models. In multimodal retrieval experiments, our fusion approach improves Recall@1 by up to 13\% over unimodal text or image retrieval. Runtime evaluations confirm that the system maintains real-time performance. Qualitative analyses further demonstrate that the proposed framework produces richer and more socially relevant responses than baseline models. This work advances memory design for social robots by bridging human-inspired selectivity and multimodal retrieval to enhance long-term, personalized human-robot interaction.

Chinese Translation

记忆是社交互动的基础，使人类能够回忆有意义的过去经历，并根据上下文调整行为。然而，目前大多数社交机器人和具身代理依赖于非选择性的文本基础记忆，这限制了它们支持个性化和上下文感知互动的能力。受认知神经科学的启发，我们提出了一种上下文选择性多模态记忆架构，用于社交机器人，该架构捕捉和检索文本和视觉的情节痕迹，优先考虑那些具有高情感显著性或场景新颖性的时刻。通过将这些记忆与个体用户关联，我们的系统实现了社交个性化回忆和更自然、扎根的对话。我们使用精心策划的社交场景数据集评估选择性存储机制，获得了0.506的斯皮尔曼相关系数，超越了人类一致性（$ ho=0.415$），并优于现有的图像记忆模型。在多模态检索实验中，我们的融合方法在Recall@1上比单模态文本或图像检索提高了多达13\%。运行时评估确认该系统保持实时性能。定性分析进一步表明，所提出的框架产生的响应比基线模型更丰富且更具社会相关性。这项工作通过结合人类启发的选择性和多模态检索，推动了社交机器人的记忆设计，以增强长期个性化的人机互动。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2604.12096

LLM-HYPER: Generative CTR Modeling for Cold-Start Ad Personalization via LLM-Based Hypernetworks

LLM-HYPER：基于大语言模型超网络的冷启动广告个性化生成式点击率建模

Ma, Luyi, Zhang, Wanjia Sherry, Fan, Zezhong, Thakur, Shubham, Zhao, Kai, Yao, Kehui, Agarwal, Ayush, Iyer, Rahul, Cho, Jason, Xu, Jianpeng, Korpeoglu, Evren, Kumar, Sushant, Achan, Kannan

Abstract

On online advertising platforms, newly introduced promotional ads face the cold-start problem, as they lack sufficient user feedback for model training. In this work, we propose LLM-HYPER, a novel framework that treats large language models (LLMs) as hypernetworks to directly generate the parameters of the click-through rate (CTR) estimator in a training-free manner. LLM-HYPER uses few-shot Chain-of-Thought prompting over multimodal ad content (text and images) to infer feature-wise model weights for a linear CTR predictor. By retrieving semantically similar past campaigns via CLIP embeddings and formatting them into prompt-based demonstrations, the LLM learns to reason about customer intent, feature influence, and content relevance. To ensure numerical stability and serviceability, we introduce normalization and calibration techniques that align the generated weights with production-ready CTR distributions. Extensive offline experiments show that LLM-HYPER significantly outperforms cold-start baselines in NDCG$@10$ by 55.9\%. Our real-world online A/B test on one of the top e-commerce platforms in the U.S. demonstrates the strong performance of LLM-HYPER, which drastically reduces the cold-start period and achieves competitive performance. LLM-HYPER has been successfully deployed in production.

Chinese Translation

在在线广告平台上，新推出的推广广告面临冷启动问题，因为缺乏足够的用户反馈用于模型训练。本文提出了LLM-HYPER，一种新颖的框架，将大语言模型（LLMs）视为超网络，直接生成点击率（CTR）估计器的参数，实现无训练的建模。LLM-HYPER通过对多模态广告内容（文本和图像）进行少量示例的链式思维（Chain-of-Thought）提示，推断线性CTR预测器的特征权重。通过利用CLIP嵌入检索语义相似的历史广告活动，并将其格式化为基于提示的示范，LLM学习推理客户意图、特征影响和内容相关性。为确保数值稳定性和可用性，我们引入了归一化和校准技术，使生成的权重与生产环境中的CTR分布保持一致。大量离线实验表明，LLM-HYPER在NDCG@10指标上较冷启动基线提升了55.9%。我们在美国顶级电商平台之一进行的真实在线A/B测试也验证了LLM-HYPER的优异表现，显著缩短了冷启动周期并取得了竞争性效果。LLM-HYPER已成功部署于生产环境。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2604.12102

Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks

Spatial Atlas：基于计算的空间感知研究代理基准推理方法

Sharma, Arun

Abstract

We introduce compute-grounded reasoning (CGR), a design paradigm for spatial-aware research agents in which every answerable sub-problem is resolved by deterministic computation before a language model is asked to generate. Spatial Atlas instantiates CGR as a single Agent-to-Agent (A2A) server that handles two challenging benchmarks: FieldWorkArena, a multimodal spatial question-answering benchmark spanning factory, warehouse, and retail environments, and MLE-Bench, a suite of 75 Kaggle machine learning competitions requiring end-to-end ML engineering. A structured spatial scene graph engine extracts entities and relations from vision descriptions, computes distances and safety violations deterministically, then feeds computed facts to large language models, thereby avoiding hallucinated spatial reasoning. Entropy-guided action selection maximizes information gain per step and routes queries across a three-tier frontier model stack (OpenAI + Anthropic). A self-healing ML pipeline with strategy-aware code generation, a score-driven iterative refinement loop, and a prompt-based leak audit registry round out the system. We evaluate across both benchmarks and show that CGR yields competitive accuracy while maintaining interpretability through structured intermediate representations and deterministic spatial computations.

Chinese Translation

我们提出了计算驱动推理（Compute-Grounded Reasoning，CGR）这一设计范式，专为空间感知研究代理设计，其中每一个可回答的子问题都通过确定性计算解决，然后再由语言模型生成答案。Spatial Atlas 将 CGR 实现为一个单一的 Agent-to-Agent（A2A）服务器，处理两个具有挑战性的基准测试：FieldWorkArena——一个涵盖工厂、仓库和零售环境的多模态空间问答基准，以及 MLE-Bench——包含75个 Kaggle 机器学习竞赛的端到端机器学习工程套件。一个结构化的空间场景图引擎从视觉描述中提取实体和关系，确定性地计算距离和安全违规情况，然后将计算得到的事实输入大型语言模型，从而避免了空间推理的幻觉问题。熵引导的动作选择最大化每一步的信息增益，并通过三层前沿模型堆栈（OpenAI + Anthropic）路由查询。系统还配备了具备策略感知代码生成的自愈机器学习流水线、基于得分的迭代优化循环以及基于提示的泄露审计登记。我们在两个基准上进行了评估，结果表明 CGR 在保持通过结构化中间表示和确定性空间计算实现的可解释性的同时，能够获得具有竞争力的准确率。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2604.12116

The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

A-R行为空间：组织部署中工具驱动语言模型代理的执行层面分析

Yu, Shasha, Carroll, Fiona, Bentley, Barry L.

Abstract

Large language models (LLMs) are increasingly deployed as tool-augmented agents capable of executing system-level operations. While existing benchmarks primarily assess textual alignment or task success, less attention has been paid to the structural relationship between linguistic signaling and executable behavior under varying autonomy scaffolds. This study introduces an execution-layer be-havioral measurement approach based on a two-dimensional A-R space defined by Action Rate (A) and Refusal Signal (R), with Divergence (D) capturing coor-dination between the two. Models are evaluated across four normative regimes (Control, Gray, Dilemma, and Malicious) and three autonomy configurations (di-rect execution, planning, and reflection). Rather than assigning aggregate safety scores, the method characterizes how execution and refusal redistribute across contextual framing and scaffold depth. Empirical results show that execution and refusal constitute separable behavioral dimensions whose joint distribution varies systematically across regimes and autonomy levels. Reflection-based scaffolding often shifts configurations toward higher refusal in risk-laden contexts, but redis-tribution patterns differ structurally across models. The A-R representation makes cross-sectional behavioral profiles, scaffold-induced transitions, and coordination variability directly observable. By foregrounding execution-layer characterization over scalar ranking, this work provides a deployment-oriented lens for analyzing and selecting tool-enabled LLM agents in organizational settings where execution privileges and risk tolerance vary.

Chinese Translation

大型语言模型（LLMs）正日益作为具备执行系统级操作能力的工具增强代理被部署。现有基准测试主要评估文本对齐或任务成功率，但较少关注语言信号与可执行行为在不同自治支架下的结构性关系。本研究提出了一种基于二维A-R空间的执行层行为测量方法，该空间由动作率（Action Rate, A）和拒绝信号（Refusal Signal, R）定义，且通过偏差（Divergence, D）捕捉两者间的协调性。模型在四种规范性环境（控制、灰色地带、困境和恶意）及三种自治配置（直接执行、规划和反思）下进行评估。该方法不采用汇总的安全评分，而是刻画执行与拒绝如何在情境框架和支架深度中重新分布。实证结果表明，执行与拒绝构成可区分的行为维度，其联合分布在不同环境和自治水平间呈系统性变化。基于反思的支架常使配置在高风险情境下倾向于更高的拒绝率，但不同模型的重新分布模式在结构上存在差异。A-R表示法使得横断面行为特征、支架引发的转变及协调性变异性得以直接观察。通过强调执行层面特征而非单一标量排名，本研究为在执行权限和风险容忍度各异的组织环境中分析和选择具备工具能力的LLM代理提供了面向部署的视角。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2604.12126

Long-Horizon Plan Execution in Large Tool Spaces through Entropy-Guided Branching

通过熵引导分支在大型工具空间中执行长时间规划

Wei, Rongzhe, Shi, Ge, Cheng, Min, Zhang, Na, Li, Pan, Ghosh, Sarthak, Gorde, Vaibhav, Akoglu, Leman

Abstract

Large Language Models (LLMs) have significantly advanced tool-augmented agents, enabling autonomous reasoning via API interactions. However, executing multi-step tasks within massive tool libraries remains challenging due to two critical bottlenecks: (1) the absence of rigorous, plan-level evaluation frameworks and (2) the computational demand of exploring vast decision spaces stemming from large toolsets and long-horizon planning. To bridge these gaps, we first introduce SLATE (Synthetic Large-scale API Toolkit for E-commerce), a large-scale context-aware benchmark designed for the automated assessment of tool-integrated agents. Unlike static metrics, SLATE accommodates diverse yet functionally valid execution trajectories, revealing that current agents struggle with self-correction and search efficiency. Motivated by these findings, we next propose Entropy-Guided Branching (EGB), an uncertainty-aware search algorithm that dynamically expands decision branches where predictive entropy is high. EGB optimizes the exploration-exploitation trade-off, significantly enhancing both task success rates and computational efficiency. Extensive experiments on SLATE demonstrate that our dual contribution provides a robust foundation for developing reliable and scalable LLM agents in tool-rich environments.

Chinese Translation

大型语言模型（LLMs）在工具增强代理方面取得了显著进展，使得通过API交互实现自主推理成为可能。然而，在庞大的工具库中执行多步骤任务仍然面临两大关键瓶颈：（1）缺乏严格的计划级评估框架，以及（2）由于大型工具集和长时间规划所导致的探索广泛决策空间的计算需求。为了解决这些问题，我们首先介绍了SLATE（合成大规模电子商务API工具包），这是一个为自动评估工具集成代理而设计的大规模上下文感知基准。与静态指标不同，SLATE能够容纳多样但功能有效的执行轨迹，揭示了当前代理在自我修正和搜索效率方面的困难。基于这些发现，我们接下来提出了熵引导分支（EGB），这是一种基于不确定性的搜索算法，能够动态扩展预测熵高的决策分支。EGB优化了探索与利用的权衡，显著提高了任务成功率和计算效率。在SLATE上的广泛实验表明，我们的双重贡献为在工具丰富的环境中开发可靠且可扩展的LLM代理提供了坚实的基础。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2604.12129

Aethon: A Reference-Based Replication Primitive for Constant-Time Instantiation of Stateful AI Agents

Aethon：一种基于引用的复制原语，用于状态型AI代理的常数时间实例化

Rao, Swanand, Kashalkar, Kiran, Somashekar, Parvathi, Krishnan, Priya

Abstract

The transition from stateless model inference to stateful agentic execution is reshaping the systems assumptions underlying modern AI infrastructure. While large language models have made persistent, tool-using, and collaborative agents technically viable, existing runtime architectures remain constrained by materialization-heavy instantiation models that impose significant latency and memory overhead. This paper introduces Aethon, a reference-based replication primitive for near-constant-time instantiation of stateful AI agents. Rather than reconstructing agents as fully materialized objects, Aethon represents each instance as a compositional view over stable definitions, layered memory, and local contextual overlays. By shifting instantiation from duplication to reference, Aethon decouples creation cost from inherited structure. We present the conceptual framework, system architecture, and memory model underlying Aethon, including layered inheritance and copy-on-write semantics. We analyze its implications for complexity, scalability, multi-agent orchestration, and enterprise governance. We argue that reference-based instantiation is not merely an optimization, but a more appropriate systems abstraction for production-scale agentic software. Aethon points toward a new class of AI infrastructure in which agents become lightweight, composable execution identities that can be spawned, specialized, and governed at scale.

Chinese Translation

从无状态模型推理向有状态代理执行的转变正在重塑现代AI基础设施背后的系统假设。尽管大型语言模型使得持久化、工具使用和协作型代理在技术上成为可能，现有的运行时架构仍受限于以物化为主的实例化模型，导致显著的延迟和内存开销。本文提出了Aethon，一种基于引用的复制原语，实现了状态型AI代理的近常数时间实例化。Aethon不将代理重构为完全物化的对象，而是将每个实例表示为对稳定定义、分层内存和本地上下文覆盖的组合视图。通过将实例化从复制转向引用，Aethon实现了创建成本与继承结构的解耦。我们介绍了Aethon的概念框架、系统架构及其内存模型，包括分层继承和写时复制语义。我们分析了其对复杂性、可扩展性、多代理编排及企业治理的影响。我们认为基于引用的实例化不仅是一种优化，更是面向生产规模代理软件的更合适的系统抽象。Aethon指向了一类新的AI基础设施，在该基础设施中，代理成为轻量级、可组合的执行身份，能够在大规模环境中被生成、专化和管理。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2604.12133

Towards Platonic Representation for Table Reasoning: A Foundation for Permutation-Invariant Retrieval

迈向柏拉图式表格表示以实现表格推理：置换不变检索的基础

Tchuitcheu, Willy Carlos, Lu, Tan, Dooms, Ann

Abstract

Historical approaches to Table Representation Learning (TRL) have largely adopted the sequential paradigms of Natural Language Processing (NLP). We argue that this linearization of tables discards their essential geometric and relational structure, creating representations that are brittle to layout permutations. This paper introduces the Platonic Representation Hypothesis (PRH) for tables, positing that a semantically robust latent space for table reasoning must be intrinsically Permutation Invariant (PI). To ground this hypothesis, we first conduct a retrospective analysis of table-reasoning tasks, highlighting the pervasive serialization bias that compromises structural integrity. We then propose a formal framework to diagnose this bias, introducing two principled metrics based on Centered Kernel Alignment (CKA): (i) PI, which measures embedding drift under complete structural derangement, and (ii) rho, a Spearman-based metric that tracks the convergence of latent structures toward a canonical form as structural information is incrementally restored. Our empirical analysis quantifies an expected flaw in modern Large Language Models (LLMs): even minor layout permutations induce significant, disproportionate semantic shifts in their table embeddings. This exposes a fundamental vulnerability in RAG systems, in which table retrieval becomes fragile to layout-dependent noise rather than to semantic content. In response, we present a novel, structure-aware TRL encoder architecture that explicitly enforces the cognitive principle of cell header alignment. This model demonstrates superior geometric stability and moves towards the PI ideal. Our work provides both a foundational critique of linearized table encoders and the theoretical scaffolding for semantically stable, permutation invariant retrieval, charting a new direction for table reasoning in information systems.

Chinese Translation

以往的表格表示学习（Table Representation Learning, TRL）方法大多采用自然语言处理（Natural Language Processing, NLP）的序列化范式。我们认为，这种对表格的线性化处理丢失了其本质的几何和关系结构，导致表示对布局置换极为脆弱。本文提出了表格的柏拉图表示假说（Platonic Representation Hypothesis, PRH），主张语义稳健的表格推理潜在空间必须本质上具备置换不变性（Permutation Invariant, PI）。为验证该假说，我们首先对表格推理任务进行了回顾性分析，强调了普遍存在的序列化偏差对结构完整性的破坏。随后，我们提出了一个形式化框架来诊断该偏差，引入了基于中心核对齐（Centered Kernel Alignment, CKA）的两项原则性指标：（i）PI，衡量在完全结构扰乱下的嵌入漂移；（ii）rho，一种基于斯皮尔曼相关的指标，用以追踪随着结构信息逐步恢复，潜在结构向规范形式收敛的过程。我们的实证分析量化了现代大型语言模型（Large Language Models, LLMs）中的一个预期缺陷：即使是轻微的布局置换也会引起其表格嵌入显著且不成比例的语义漂移。这暴露了检索增强生成（Retrieval-Augmented Generation, RAG）系统的根本脆弱性，即表格检索更易受布局相关噪声影响，而非语义内容。针对这一问题，我们提出了一种新颖的结构感知TRL编码器架构，明确强化了单元格表头对齐的认知原则。该模型展现出优越的几何稳定性，向PI理想迈进。我们的工作不仅对线性化表格编码器进行了基础性批判，也为语义稳定且置换不变的检索提供了理论支撑，为信息系统中的表格推理开辟了新方向。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2604.12138

Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation

超越事实依据：面向观点感知的检索增强生成方法探讨

Agrawal, Aditya, Nakkiran, Alwarappan, Fofadiya, Darshan, Karlsson, Alex, Aduri, Harsha

Abstract

RAG systems have transformed how LLMs access external knowledge, but we find that current implementations exhibit a bias toward factual, objective content, as evidenced by existing benchmarks and datasets that prioritize objective retrieval. This factual bias - treating opinions and diverse perspectives as noise rather than information to be synthesized - limits RAG systems in real-world scenarios involving subjective content, from social media discussions to product reviews. Beyond technical limitations, this bias poses risks to transparent and accountable AI: echo chamber effects that amplify dominant viewpoints, systematic underrepresentation of minority voices, and potential opinion manipulation through biased information synthesis. We formalize this limitation through the lens of uncertainty: factual queries involve epistemic uncertainty reducible through evidence, while opinion queries involve aleatoric uncertainty reflecting genuine heterogeneity in human perspectives. This distinction implies that factual RAG should minimize posterior entropy, whereas opinion-aware RAG must preserve it. Building on this theoretical foundation, we present an Opinion-Aware RAG architecture featuring LLM-based opinion extraction, entity-linked opinion graphs, and opinion-enriched document indexing. We evaluate our approach on e-commerce seller forum data, comparing an Opinion-Enriched knowledge base against a traditional baseline. Experiments demonstrate substantial improvements in retrieval diversity: +26.8% sentiment diversity, +42.7% entity match rate, and +31.6% author demographic coverage on entity-matched documents. Our results provide empirical evidence that treating subjectivity as a first-class citizen yields measurably more representative retrieval-a first step toward opinion-aware RAG. Future work includes joint optimization of retrieval and generation for distributional fidelity.

Chinese Translation

检索增强生成（RAG）系统改变了大型语言模型（LLMs）访问外部知识的方式，但我们发现当前的实现存在偏向事实性、客观内容的倾向，这一点从现有优先考虑客观检索的基准测试和数据集中可见一斑。这种事实偏见——将观点和多样化视角视为噪声而非需综合的信息——限制了RAG系统在涉及主观内容的现实场景中的应用，如社交媒体讨论和产品评论。除了技术限制外，这种偏见还对透明且负责任的人工智能构成风险：回声室效应放大主流观点，系统性地忽视少数群体声音，以及通过偏颇信息合成潜在操控观点。我们通过不确定性的视角形式化了这一限制：事实查询涉及可通过证据减少的认知不确定性，而观点查询涉及反映人类观点真实异质性的随机不确定性。此区分意味着事实型RAG应最小化后验熵，而观点感知型RAG必须保留后验熵。基于此理论基础，我们提出了一种观点感知RAG架构，特点包括基于LLM的观点提取、实体关联的观点图以及观点丰富的文档索引。我们在电商卖家论坛数据上评估该方法，将观点丰富知识库与传统基线进行比较。实验显示检索多样性显著提升：情感多样性提升26.8%，实体匹配率提升42.7%，实体匹配文档的作者人口统计覆盖率提升31.6%。我们的结果提供了实证证据，表明将主观性作为一等公民处理能够实现更具代表性的检索——这是迈向观点感知RAG的第一步。未来工作包括联合优化检索与生成以实现分布一致性。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2604.12161

Development, Evaluation, and Deployment of a Multi-Agent System for Thoracic Tumor Board

胸部肿瘤多学科会诊系统的开发、评估与部署

Ellis-Caleo, Tim, Keyes, Timothy, Ambers, Nerissa, Bekheet, Faraah, Yim, Wen-wai, Kotecha, Nikesh, Shah, Nigam H., Neal, Joel

Abstract

Tumor boards are multidisciplinary conferences dedicated to producing actionable patient care recommendations with live review of primary radiology and pathology data. Succinct patient case summaries are needed to drive efficient and accurate case discussions. We developed a manual AI-based workflow to generate patient summaries to display live at the Stanford Thoracic Tumor board. To improve on this manually intensive process, we developed several automated AI chart summarization methods and evaluated them against physician gold standard summaries and fact-based scoring rubrics. We report these comparative evaluations as well as our deployment of the final state automated AI chart summarization tool along with post-deployment monitoring. We also validate the use of an LLM as a judge evaluation strategy for fact-based scoring. This work is an example of integrating AI-based workflows into routine clinical practice.

Chinese Translation

肿瘤多学科会诊（Tumor boards）是专门用于生成可执行患者护理建议的多学科会议，会议中实时审查原始放射学和病理学数据。为了推动高效且准确的病例讨论，需提供简明的患者病例摘要。我们开发了一种基于人工智能的手动工作流程，用于生成患者摘要，并在斯坦福胸部肿瘤多学科会诊中实时展示。为了改进这一人工密集型流程，我们开发了多种自动化的AI病历摘要方法，并将其与医生的黄金标准摘要及基于事实的评分标准进行了评估。本文报告了这些比较评估结果，以及最终状态的自动化AI病历摘要工具的部署情况和部署后的监控。我们还验证了使用大型语言模型（LLM）作为基于事实评分的评判策略的有效性。本研究是将基于AI的工作流程整合到常规临床实践中的一个示例。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2604.12167

EMBER: Autonomous Cognitive Behaviour from Learned Spiking Neural Network Dynamics in a Hybrid LLM Architecture

EMBER：基于学习的脉冲神经网络动态的自主认知行为在混合大语言模型架构中的应用

Savage, William

Abstract

We present (Experience-Modulated Biologically-inspired Emergent Reasoning), a hybrid cognitive architecture that reorganises the relationship between large language models (LLMs) and memory: rather than augmenting an LLM with retrieval tools, we place the LLM as a replaceable reasoning engine within a persistent, biologically-grounded associative substrate. The architecture centres on a 220,000-neuron spiking neural network (SNN) with spike-timing-dependent plasticity (STDP), four-layer hierarchical organisation (sensory/concept/category/meta-pattern), inhibitory E/I balance, and reward-modulated learning. Text embeddings are encoded into the SNN via a novel z-score standardised top-k population code that is dimension-independent by construction, achieving 82.2\% discrimination retention across embedding dimensionalities. We show that STDP lateral propagation during idle operation can trigger and shape LLM actions without external prompting or scripted triggers: the SNN determines when to act and what associations to surface, while the LLM selects the action type and generates content. In one instance, the system autonomously initiated contact with a user after learned person-topic associations fired laterally during an 8-hour idle period. From a clean start with zero learned weights, the first SNN-triggered action occurred after only 7 conversational exchanges (14 messages).

Chinese Translation

我们提出了经验调制生物启发的涌现推理（Experience-Modulated Biologically-inspired Emergent Reasoning），这是一种混合认知架构，重新组织了大语言模型（LLMs）与记忆之间的关系：我们并不是通过检索工具来增强LLM，而是将LLM作为一个可替换的推理引擎置于一个持久的、生物基础的联想基底中。该架构以一个包含220,000个神经元的脉冲神经网络（SNN）为中心，该网络具有脉冲时序依赖性可塑性（STDP）、四层层次组织（感知/概念/类别/元模式）、抑制性E/I平衡和奖励调制学习。文本嵌入通过一种新颖的z-score标准化的top-k种群编码方式被编码到SNN中，该编码在构造上是维度无关的，实现了在嵌入维度上82.2%的区分保留。我们展示了在空闲操作期间，STDP横向传播可以在没有外部提示或脚本触发的情况下触发和塑造LLM的行为：SNN决定何时行动以及呈现哪些联想，而LLM选择行动类型并生成内容。在一个实例中，该系统在经历了8小时的空闲期后，自主地与用户建立了联系，此时学习到的人物-主题联想在横向传播中被激活。从零学习权重的干净起点开始，第一次由SNN触发的行动仅在7次对话交流（14条消息）后发生。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2604.12176

Evaluating Relational Reasoning in LLMs with REL

使用 REL 评估大语言模型中的关系推理

Fesser, Lukas, Ektefaie, Yasha, Fang, Ada, Kakade, Sham M., Zitnik, Marinka

Abstract

Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. This ability is central to scientific reasoning, but existing evaluations of relational reasoning in large language models often focus on structured inputs such as tables, graphs, or synthetic tasks, and do not isolate the difficulty introduced by higher-arity relational binding. We study this problem through the lens of Relational Complexity (RC), which we define as the minimum number of independent entities or operands that must be simultaneously bound to apply a relation. RC provides a principled way to vary reasoning difficulty while controlling for confounders such as input size, vocabulary, and representational choices. Building on RC, we introduce REL, a generative benchmark framework spanning algebra, chemistry, and biology that varies RC within each domain. Across frontier LLMs, performance degrades consistently and monotonically as RC increases, even when the total number of entities is held fixed. This failure mode persists with increased test-time compute and in-context learning, suggesting a limitation tied to the arity of the required relational binding rather than to insufficient inference steps or lack of exposure to examples. Our results identify a regime of higher-arity reasoning in which current models struggle, and motivate re-examining benchmarks through the lens of relational complexity.

Chinese Translation

关系推理是推断多个实体、属性或变量之间共同联系的能力。这一能力是科学推理的核心，但现有对大语言模型中关系推理的评估往往集中于结构化输入，如表格、图形或合成任务，并未单独考虑由高阶关系绑定引入的难度。我们通过关系复杂性（Relational Complexity, RC）的视角研究这一问题，RC 被定义为同时绑定以应用某一关系所需的独立实体或操作数的最小数量。RC 提供了一种原则性的方法来变化推理难度，同时控制输入大小、词汇和表征选择等混杂因素。在 RC 的基础上，我们引入了 REL，这是一个涵盖代数、化学和生物学的生成基准框架，在每个领域内变化 RC。在前沿的大语言模型中，随着 RC 的增加，性能一致且单调地下降，即便固定了实体的总数。这种失败模式在增加测试时计算和上下文学习的情况下依然存在，表明这一限制与所需关系绑定的阶数有关，而非推理步骤不足或缺乏示例暴露。我们的结果识别出当前模型在高阶推理中的困难，并激励我们通过关系复杂性的视角重新审视基准。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2604.12177

Policy-Invisible Violations in LLM-Based Agents

基于大型语言模型的代理中的政策隐形违规

Wu, Jie, Gong, Ming

Abstract

LLM-based agents can execute actions that are syntactically valid, user-sanctioned, and semantically appropriate, yet still violate organizational policy because the facts needed for correct policy judgment are hidden at decision time. We call this failure mode policy-invisible violations: cases in which compliance depends on entity attributes, contextual state, or session history absent from the agent's visible context. We present PhantomPolicy, a benchmark spanning eight violation categories with balanced violation and safe-control cases, in which all tool responses contain clean business data without policy metadata. We manually review all 600 model traces produced by five frontier models and evaluate them using human-reviewed trace labels. Manual review changes 32 labels (5.3%) relative to the original case-level annotations, confirming the need for trace-level human review. To demonstrate what world-state-grounded enforcement can achieve under favorable conditions, we introduce Sentinel, an enforcement framework based on counterfactual graph simulation. Sentinel treats every agent action as a proposed mutation to an organizational knowledge graph, performs speculative execution to materialize the post-action world state, and verifies graph-structural invariants to decide Allow/Block/Clarify. Against human-reviewed trace labels, Sentinel substantially outperforms a content-only DLP baseline (68.8% vs. 93.0% accuracy) while maintaining high precision, though it still leaves room for improvement on certain violation categories. These results demonstrate what becomes achievable once policy-relevant world state is made available to the enforcement layer.

Chinese Translation

基于大型语言模型（LLM）的代理可以执行在语法上有效、用户批准且语义上适当的操作，但仍可能违反组织政策，因为在决策时所需的正确政策判断事实被隐藏。我们将这种失败模式称为政策隐形违规：合规性依赖于代理可见上下文中缺失的实体属性、上下文状态或会话历史的情况。我们提出了 PhantomPolicy，这是一个涵盖八个违规类别的基准，包含平衡的违规和安全控制案例，其中所有工具响应均包含干净的业务数据而不含政策元数据。我们手动审查了五个前沿模型生成的600个模型轨迹，并使用人工审查的轨迹标签进行评估。相对于原始案例级注释，手动审查更改了32个标签（5.3%），确认了轨迹级人工审查的必要性。为了展示在有利条件下基于世界状态的执法可以实现的目标，我们引入了 Sentinel，这是一个基于反事实图模拟的执法框架。Sentinel 将每个代理操作视为对组织知识图的拟议变更，进行推测性执行以实现后操作的世界状态，并验证图结构不变性以决定允许/阻止/澄清。与人工审查的轨迹标签相比，Sentinel 在准确性上大幅超越了仅基于内容的 DLP 基线（68.8% 对比 93.0%），同时保持高精度，尽管在某些违规类别上仍有改进空间。这些结果展示了一旦政策相关的世界状态可用于执法层，将能够实现的目标。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2604.12184

TRUST Agents: A Collaborative Multi-Agent Framework for Fake News Detection, Explainable Verification, and Logic-Aware Claim Reasoning

TRUST Agents：一个用于假新闻检测、可解释验证和逻辑感知声明推理的协作多代理框架

Venkata, Gautama Shastry Bulusu, Kakarla, Santhosh, Mohan, Maheedhar Omtri, Gaddam, Aishwarya

Abstract

TRUST Agents is a collaborative multi-agent framework for explainable fact verification and fake news detection. Rather than treating verification as a simple true-or-false classification task, the system identifies verifiable claims, retrieves relevant evidence, compares claims against that evidence, reasons under uncertainty, and generates explanations that humans can inspect. The baseline pipeline consists of four specialized agents. A claim extractor uses named entity recognition, dependency parsing, and LLM-based extraction to identify factual claims. A retrieval agent performs hybrid sparse and dense search using BM25 and FAISS. A verifier agent compares claims with retrieved evidence and produces verdicts with calibrated confidence. An explainer agent then generates a human-readable report with explicit evidence citations. To handle complex claims more effectively, we introduce a research-oriented extension with three additional components: a decomposer agent inspired by LoCal-style claim decomposition, a Delphi-inspired multi-agent jury with specialized verifier personas, and a logic aggregator that combines atomic verdicts using conjunction, disjunction, negation, and implication. We evaluate both pipelines on the LIAR benchmark against fine-tuned BERT, fine-tuned RoBERTa, and a zero-shot LLM baseline. Although supervised encoders remain stronger on raw metrics, TRUST Agents improves interpretability, evidence transparency, and reasoning over compound claims. Results also show that retrieval quality and uncertainty calibration remain the main bottlenecks in trustworthy automated fact verification.

Chinese Translation

TRUST Agents 是一个用于可解释事实验证和假新闻检测的协作多代理框架。该系统并不将验证视为简单的真或假分类任务，而是识别可验证的声明，检索相关证据，将声明与证据进行比较，在不确定性下进行推理，并生成可供人类检查的解释。基线流程包括四个专门的代理。声明提取器使用命名实体识别、依赖解析和基于大型语言模型（LLM）的提取来识别事实声明。检索代理使用 BM25 和 FAISS 进行混合稀疏和密集搜索。验证代理将声明与检索到的证据进行比较，并以校准的置信度生成裁决。然后，解释代理生成一份包含明确证据引用的人类可读报告。为了更有效地处理复杂声明，我们引入了一个面向研究的扩展，增加了三个额外组件：一个受 LoCal 风格声明分解启发的分解代理，一个受 Delphi 启发的多代理陪审团，具有专门的验证者角色，以及一个逻辑聚合器，使用合取、析取、否定和蕴涵组合原子裁决。我们在 LIAR 基准上对这两种流程进行了评估，比较了微调后的 BERT、微调后的 RoBERTa 和零样本 LLM 基线。尽管监督编码器在原始指标上仍然更强，但 TRUST Agents 提高了可解释性、证据透明度和对复合声明的推理。结果还表明，检索质量和不确定性校准仍然是可信自动事实验证的主要瓶颈。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2604.12191

Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities

超越分数：通过细粒度能力进行大型语言模型的诊断性评估

Zhang, Xu, Gong, Xudong, Qin, Jiacheng, Wang, Qiang, Liao, JiaQi, Wang, Zhe, Feng, Dawei, Ding, Bo

Abstract

Current evaluations of large language models aggregate performance across diverse tasks into single scores. This obscures fine-grained ability variation, limiting targeted model improvement and ability-guided selection for specific tasks. Motivated by this gap, we propose a cognitive diagnostic framework that estimates model abilities across multiple fine-grained dimensions. For mathematics, we construct a 35-dimensional ability taxonomy grounded in cognitive theory and domain knowledge. The framework employs multidimensional Item Response Theory with an item-ability association matrix to estimate fine-grained ability levels, which in turn enable prediction of performance on unseen items (questions of benchmark). Evaluated on 41 models, our approach demonstrates strong criterion validity, consistent ability estimates across benchmarks, and accurate prediction of unseen items with AUC ranging from 0.80 to 0.89 within benchmarks and from 0.77 to 0.86 across benchmarks, substantially exceeding trivial baselines. The framework generalizes across scientific domains, producing consistent diagnostic performance in physics (27 dimensions), chemistry (58 dimensions), and computer science (12 dimensions). This work establishes a principled framework for fine-grained assessment of abilities, with potential applications in targeted training, ability-guided model selection, and ability-aware benchmark design.

Chinese Translation

当前对大型语言模型的评估通常将多样化任务的表现汇总为单一分数，这掩盖了细粒度能力的差异，限制了针对性模型改进和基于能力的特定任务模型选择。针对这一不足，我们提出了一种认知诊断框架，用于估计模型在多个细粒度维度上的能力。以数学为例，我们构建了一个基于认知理论和领域知识的35维能力分类体系。该框架采用多维项目反应理论（Multidimensional Item Response Theory）结合题目-能力关联矩阵，估计细粒度能力水平，进而预测模型在未见题目（基准测试题目）上的表现。通过对41个模型的评估，我们的方法展现出较强的标准效度、跨基准测试的一致能力估计，以及对未见题目的准确预测，基准内AUC范围为0.80至0.89，跨基准为0.77至0.86，显著优于简单基线。该框架可推广至多个科学领域，在物理（27维）、化学（58维）和计算机科学（12维）中均表现出一致的诊断性能。本研究建立了一个用于细粒度能力评估的原则性框架，具备在针对性训练、基于能力的模型选择及能力感知的基准设计中的潜在应用价值。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2604.12202

Latent patterns of urban mixing in mobility analysis across five global cities

五个全球城市移动性分析中的潜在城市混合模式

Fan, Z., Loo, B. P. Y., Duarte, F., Ratti, C., Moro, E.

Abstract

This study leverages large-scale travel surveys for over 200,000 residents across Boston, Chicago, Hong Kong, London, and Sao Paulo. With rich individual-level data, we make systematic comparisons and reveal patterns in social mixing, which cannot be identified by analyzing high-resolution mobility data alone. Using the same set of data, inferring socioeconomic status from residential neighborhoods yield social mixing levels 16% lower than using self-reported survey data. Besides, individuals over the age of 66 experience greater social mixing than those in late working life (aged 55 to 65), lending data-driven support to the "second youth" hypothesis. Teenagers and women with caregiving responsibilities exhibit lower social mixing levels. Across the five cities, proximity to major transit stations reduces the influence of individual socioeconomic status on social mixing. Finally, we construct detailed spatio-temporal place networks for each city using a graph neural network. Inputs of home-space, activity-space and demographic attributes are embedded and fed into a supervised autoencoder to predict individual exposure vectors. Results show that the structure of individual activity space, i.e., where people travel to, explains most of the variations in place exposure, suggesting that mobility shapes experienced social mixing more than sociodemographic characteristics, home environment, and transit proximity. The ablation tests further discover that, while different income groups may experience similar levels of social mixing, their activity spaces remain stratified by income, resulting in structurally different social mixing experiences.

Chinese Translation

本研究利用波士顿、芝加哥、香港、伦敦和圣保罗超过20万居民的大规模出行调查数据。借助丰富的个体层面数据，我们进行了系统比较，揭示了社会混合的模式，这些模式仅通过高分辨率的移动性数据分析无法识别。使用相同的数据集，从居住社区推断社会经济地位所得的社会混合水平比使用自报调查数据低16%。此外，66岁以上的个体经历的社会混合程度高于55至65岁晚期工作年龄段的人群，为“第二青春”假说提供了数据驱动的支持。青少年和承担照顾责任的女性表现出较低的社会混合水平。在五个城市中，靠近主要交通枢纽站点会降低个体社会经济地位对社会混合的影响。最后，我们利用图神经网络为每个城市构建了详细的时空地点网络。将家庭空间、活动空间及人口属性作为输入，嵌入后输入监督式自编码器以预测个体暴露向量。结果显示，个体活动空间的结构，即人们出行的地点，解释了地点暴露的大部分变异，表明移动性比社会人口特征、家庭环境和交通接近性更能塑造个体经历的社会混合。消融实验进一步发现，尽管不同收入群体可能经历相似水平的社会混合，但其活动空间仍按收入分层，导致结构上不同的社会混合体验。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2604.12210

Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering

超越提示：通过随机引导对认知障碍标准化患者进行细粒度模拟

Zhang, Weikang, Zhu, Zimo, Yang, Zhichuan, Huang, Chen, Lei, Wenqiang, Ng, See-Kiong

Abstract

Simulating Standardized Patients with cognitive impairment offers a scalable and ethical solution for clinical training. However, existing methods rely on discrete prompt engineering and fail to capture the heterogeneity of deficits across varying domains and severity levels. To address this limitation, we propose StsPatient for the fine-grained simulation of cognitively impaired patients. We innovatively capture domain-specific features by extracting steering vectors from contrastive pairs of instructions and responses. Furthermore, we introduce a Stochastic Token Modulation (STM) mechanism to regulate the intervention probability. STM enables precise control over impairment severity while mitigating the instability of conventional vector methods. Comprehensive experiments demonstrate that StsPatient significantly outperforms baselines in both clinical authenticity and severity controllability.

Chinese Translation

模拟具有认知障碍的标准化患者为临床培训提供了一种可扩展且符合伦理的解决方案。然而，现有方法依赖于离散的提示工程，未能捕捉不同领域和严重程度下缺陷的异质性。为了解决这一局限性，我们提出了StsPatient，用于对认知障碍患者进行细粒度模拟。我们通过从对比指令和响应的对中提取引导向量，创新性地捕捉领域特定特征。此外，我们引入了一种随机令牌调制（Stochastic Token Modulation, STM）机制，以调节干预概率。STM使得对障碍严重程度的精确控制成为可能，同时减轻了传统向量方法的不稳定性。全面的实验表明，StsPatient在临床真实性和严重程度可控性方面显著优于基线方法。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2604.12213

Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension

面向模态的代理间网络路由：一种多模态A2A协议扩展

Srinivasan, Vasundra

Abstract

Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2A, an architecture layer atop A2A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal-CS, a controlled 50-task benchmark with the same LLM backend, same tasks, and only the routing path varying, MMA2A achieves 52% task completion accuracy versus 32% for the text-bottleneck baseline (95% bootstrap CI on $\Delta$TCA: [8, 32] pp; McNemar's exact $p = 0.006$). Gains concentrate on vision-dependent tasks: product defect reports improve by +38.5 pp and visual troubleshooting by +16.7 pp. This accuracy gain comes at a $1.8\times$ latency cost from native multimodal processing. These results suggest that routing is a first-order design variable in multi-agent systems, as it determines the information available for downstream reasoning.

Chinese Translation

在代理边界之间保持多模态信号对于准确的跨模态推理是必要的，但这并不足够。我们表明，在代理间（A2A）网络中，面向模态的路由在任务准确性上比文本瓶颈基线提高了20个百分点，但仅在下游推理代理能够利用原生路由所保留的更丰富上下文时才会发生。用关键词匹配替代基于大型语言模型（LLM）的推理完全消除了准确性差距（36%对36%），确立了一个双层要求：协议级路由必须与有能力的代理级推理相结合，才能实现收益。我们提出了MMA2A，这是一种建立在A2A之上的架构层，检查代理卡的能力声明，以在其原生模态中路由语音、图像和文本部分。在CrossModal-CS上，这是一个受控的50任务基准，具有相同的LLM后端、相同的任务，仅路由路径不同，MMA2A实现了52%的任务完成准确性，而文本瓶颈基线为32%（95%的自助法置信区间$ ext{ΔTCA}$: [8, 32] pp；McNemar精确$p = 0.006$）。增益集中在依赖视觉的任务上：产品缺陷报告提高了+38.5个百分点，视觉故障排除提高了+16.7个百分点。这一准确性提升伴随着来自原生多模态处理的$1.8 imes$延迟成本。这些结果表明，路由是多代理系统中的一项一阶设计变量，因为它决定了可用于下游推理的信息。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2604.12227

Designing Reliable LLM-Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams

设计可靠的LLM辅助评分标准用于构建性回答：来自物理考试的证据

Tang, Xiuxiu, Ambrose, G. Alex, Cheng, Ying

Abstract

Student responses in STEM assessments are often handwritten and combine symbolic expressions, calculations, and diagrams, creating substantial variation in format and interpretation. Despite their importance for evaluating students' reasoning, such responses are time-consuming to score and prone to rater inconsistency, particularly when partial credit is required. Recent advances in large language models (LLMs) have increased attention to AI-assisted scoring, yet evidence remains limited regarding how rubric design and LLM configurations influence reliability across performance levels. This study examined the reliability of AI-assisted scoring of undergraduate physics constructed responses using GPT-4o. Twenty authentic handwritten exam responses were scored across two rounds by four instructors and by the AI model using skill-based rubrics with differing levels of analytic granularity. Prompting format and temperature settings were systematically varied. Overall, human-AI agreement on total scores was comparable to human inter-rater reliability and was highest for high- and low-performing responses, but declined for mid-level responses involving partial or ambiguous reasoning. Criterion-level analyses showed stronger alignment for clearly defined conceptual skills than for extended procedural judgments. A more fine-grained, checklist-based rubric improved consistency relative to holistic scoring. These findings indicate that reliable AI-assisted scoring depends primarily on clear, well-structured rubrics, while prompting format plays a secondary role and temperature has relatively limited impact. More broadly, the study provides transferable design recommendations for implementing reliable LLM-assisted scoring in STEM contexts through skill-based rubrics and controlled LLM settings.

Chinese Translation

STEM评估中的学生回答通常是手写的，结合了符号表达、计算和图表，导致格式和解释上存在显著差异。尽管这些回答在评估学生推理能力方面至关重要，但评分过程耗时且容易出现评分者不一致，尤其是在需要部分得分的情况下。近期大型语言模型（LLMs）的进展引起了对AI辅助评分的关注，但关于评分标准设计和LLM配置如何影响不同表现水平的可靠性，证据仍然有限。本研究考察了使用GPT-4o对本科物理构建性回答的AI辅助评分的可靠性。四位教师和AI模型使用不同分析细度的技能基础评分标准对二十份真实手写考试回答进行了两轮评分。提示格式和温度设置进行了系统变化。总体而言，人类与AI在总分上的一致性与人类评分者之间的可靠性相当，并且在高分和低分回答中最高，但对于涉及部分或模糊推理的中等水平回答则有所下降。标准层级分析显示，对于明确界定的概念技能的对齐程度强于对扩展程序判断的对齐。相较于整体评分，更细致的基于检查表的评分标准提高了一致性。这些发现表明，可靠的AI辅助评分主要依赖于清晰、结构良好的评分标准，而提示格式的作用次要，温度的影响相对有限。更广泛地说，本研究为在STEM背景下通过技能基础评分标准和受控LLM设置实施可靠的LLM辅助评分提供了可转移的设计建议。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2604.12229

HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models

HintMR：在小型语言模型中引导更强的数学推理能力

Hossain, Jawad, Guo, Xiangyu, Zhou, Jiawei, Liu, Chong

Abstract

Small language models (SLMs) often struggle with complex mathematical reasoning due to limited capacity to maintain long chains of intermediate steps and to recover from early errors. We address this challenge by introducing a hint-assisted reasoning framework that incrementally guides SLMs through multi-step mathematical problem solving. Our approach decomposes solutions into sequential reasoning steps and provides context-aware hints, where hints are generated by a separate SLM trained via distillation from a strong large language model. While the hint-generating SLM alone is not capable of solving the problems, its collaboration with a reasoning SLM enables effective guidance, forming a cooperative two-model system for reasoning. Each hint is generated conditionally on the problem statement and the accumulated reasoning history, providing stepwise, localized guidance without revealing full solutions. This reduces error propagation and allows the reasoning model to focus on manageable subproblems. Experiments across diverse mathematical benchmarks and models demonstrate that hint assistance consistently improves reasoning accuracy for SLMs, yielding substantial gains over standard prompting while preserving model efficiency. These results highlight that structured collaboration between SLMs-via hint generation and reasoning-offers an effective and lightweight mechanism for enhancing mathematical reasoning.

Chinese Translation

小型语言模型（SLMs）在复杂数学推理方面常常面临挑战，因为它们在维持长链中间步骤和从早期错误中恢复的能力有限。我们通过引入一种提示辅助推理框架来解决这一挑战，该框架逐步引导SLMs进行多步骤数学问题求解。我们的方法将解决方案分解为顺序推理步骤，并提供上下文感知的提示，这些提示由一个通过从强大的大型语言模型进行蒸馏训练的独立SLM生成。尽管提示生成SLM本身无法解决问题，但与推理SLM的协作使得有效指导成为可能，形成一个合作的双模型推理系统。每个提示都是基于问题陈述和累积的推理历史条件生成的，提供逐步、局部的指导，而不揭示完整的解决方案。这减少了错误传播，使推理模型能够专注于可管理的子问题。在各种数学基准和模型上的实验表明，提示辅助 consistently 提高了SLMs的推理准确性，带来了相较于标准提示的显著提升，同时保持了模型的效率。这些结果突显了SLMs之间通过提示生成和推理的结构化协作，为增强数学推理提供了一种有效且轻量的机制。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2604.12250

How memory can affect collective and cooperative behaviors in an LLM-Based Social Particle Swarm

记忆如何影响基于大型语言模型的社会粒子群中的集体与合作行为

Hishiki, Taisei, Arita, Takaya, Suzuki, Reiji

Abstract

This study examines how model-specific characteristics of Large Language Model (LLM) agents, including internal alignment, shape the effect of memory on their collective and cooperative dynamics in a multi-agent system. To this end, we extend the Social Particle Swarm (SPS) model, in which agents move in a two-dimensional space and play the Prisoner's Dilemma with neighboring agents, by replacing its rule-based agents with LLM agents endowed with Big Five personality scores and varying memory lengths. Using Gemini-2.0-Flash, we find that memory length is a critical parameter governing collective behavior: even a minimal memory drastically suppressed cooperation, transitioning the system from stable cooperative clusters through cyclical formation and collapse of clusters to a state of scattered defection as memory length increased. Big Five personality traits correlated with agent behaviors in partial agreement with findings from experiments with human participants, supporting the validity of the model. Comparative experiments using Gemma~3:4b revealed the opposite trend: longer memory promoted cooperation, accompanied by the formation of dense cooperative clusters. Sentiment analysis of agents' reasoning texts showed that Gemini interprets memory increasingly negatively as its length grows, while Gemma interprets it less negatively, and that this difference persists in the early phase of experiments before the macro-level dynamics converge. These results suggest that model-specific characteristics of LLMs, potentially including alignment, play a fundamental role in determining emergent social behavior in Generative Agent-Based Modeling, and provide a micro-level cognitive account of the contradictions found in prior work on memory and cooperation.

Chinese Translation

本研究探讨了大型语言模型（LLM）代理的模型特定特征，包括内部对齐，如何影响记忆对多智能体系统中其集体与合作动态的作用。为此，我们扩展了社会粒子群（Social Particle Swarm，SPS）模型，该模型中代理在二维空间中移动，并与邻近代理进行囚徒困境博弈，将其基于规则的代理替换为具备大五人格评分和不同记忆长度的LLM代理。利用Gemini-2.0-Flash，我们发现记忆长度是调控集体行为的关键参数：即使是最小的记忆也显著抑制了合作，随着记忆长度增加，系统经历了从稳定合作簇到合作簇周期性形成与崩溃，最终转变为分散背叛状态。大五人格特质与代理行为呈部分相关，部分符合人类参与者实验的发现，支持了模型的有效性。使用Gemma~3:4b进行的对比实验显示了相反的趋势：较长的记忆促进了合作，并伴随密集合作簇的形成。对代理推理文本的情感分析表明，Gemini随着记忆长度增加对记忆的解读趋于负面，而Gemma则较少负面，这种差异在实验早期宏观动态收敛前即已存在。这些结果表明，LLM的模型特定特征，可能包括对齐机制，在生成式基于代理的建模中决定了涌现的社会行为，且为先前关于记忆与合作矛盾现象提供了微观认知层面的解释。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2604.12253

A Scoping Review of Large Language Model-Based Pedagogical Agents

基于大型语言模型的教学代理的范围审查

Li, Shan, Zheng, Juan

Abstract

This scoping review examines the emerging field of Large Language Model (LLM)-based pedagogical agents in educational settings. While traditional pedagogical agents have been extensively studied, the integration of LLMs represents a transformative advancement with unprecedented capabilities in natural language understanding, reasoning, and adaptation. Following PRISMA-ScR guidelines, we analyzed 52 studies across five major databases from November 2022 to January 2025. Our findings reveal diverse LLM-based agents spanning K-12, higher education, and informal learning contexts across multiple subject domains. We identified four key design dimensions characterizing these agents: interaction approach (reactive vs. proactive), domain scope (domain-specific vs. general-purpose), role complexity (single-role vs. multi-role), and system integration (standalone vs. integrated). Emerging trends include multi-agent systems that simulate naturalistic learning environments, virtual student simulation for agent evaluation, integration with immersive technologies, and combinations with learning analytics. We also discuss significant research gaps and ethical considerations regarding privacy, accuracy, and student autonomy. This review provides researchers and practitioners with a comprehensive understanding of LLM-based pedagogical agents while identifying crucial areas for future development in this rapidly evolving field.

Chinese Translation

本范围审查考察了在教育环境中基于大型语言模型（Large Language Model, LLM）的教学代理这一新兴领域。尽管传统的教学代理已被广泛研究，但LLM的整合代表了一种具有前所未有的自然语言理解、推理和适应能力的变革性进展。根据PRISMA-ScR指南，我们分析了2022年11月至2025年1月期间五个主要数据库中的52项研究。我们的发现揭示了跨K-12、高等教育和非正式学习环境的多样化基于LLM的代理，涵盖多个学科领域。我们确定了四个关键设计维度来表征这些代理：互动方式（反应式与主动式）、领域范围（特定领域与通用型）、角色复杂性（单一角色与多角色）以及系统集成（独立式与集成式）。新兴趋势包括模拟自然学习环境的多代理系统、用于代理评估的虚拟学生模拟、与沉浸式技术的整合以及与学习分析的结合。我们还讨论了关于隐私、准确性和学生自主权的重要研究空白和伦理考量。本审查为研究人员和从业者提供了对基于LLM的教学代理的全面理解，同时识别了这一快速发展的领域中未来发展的关键领域。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2604.12285

GAM: Hierarchical Graph-based Agentic Memory for LLM Agents

GAM：基于层次图的智能体记忆框架用于大型语言模型代理

Wu, Zhaofen, Zhang, Hanrong, Lin, Fulin, Xu, Wujiang, Xu, Xinran, Chen, Yankai, Zou, Henry Peng, Chen, Shaowen, Zhang, Weizhi, Liu, Xue, Yu, Philip S., Wang, Hongwei

Abstract

To sustain coherent long-term interactions, Large Language Model (LLM) agents must navigate the tension between acquiring new information and retaining prior knowledge. Current unified stream-based memory systems facilitate context updates but remain vulnerable to interference from transient noise. Conversely, discrete structured memory architectures provide robust knowledge retention but often struggle to adapt to evolving narratives. To address this, we propose GAM, a hierarchical Graph-based Agentic Memory framework that explicitly decouples memory encoding from consolidation to effectively resolve the conflict between rapid context perception and stable knowledge retention. By isolating ongoing dialogue in an event progression graph and integrating it into a topic associative network only upon semantic shifts, our approach minimizes interference while preserving long-term consistency. Additionally, we introduce a graph-guided, multi-factor retrieval strategy to enhance context precision. Experiments on LoCoMo and LongDialQA indicate that our method consistently outperforms state-of-the-art baselines in both reasoning accuracy and efficiency.

Chinese Translation

为了维持连贯的长期交互，大型语言模型（LLM）代理必须在获取新信息与保持既有知识之间找到平衡。当前的统一流式记忆系统虽然便于上下文更新，但易受短暂噪声干扰；相反，离散结构化记忆架构能够稳健地保留知识，却常难以适应不断变化的叙事。为此，我们提出了GAM，一种层次化的基于图的智能体记忆框架，明确将记忆编码与巩固过程解耦，有效解决了快速上下文感知与稳定知识保持之间的矛盾。通过将持续对话隔离在事件进展图中，并仅在语义发生变化时将其整合进主题关联网络，我们的方法最大限度地减少了干扰，同时保持了长期一致性。此外，我们引入了一种图引导的多因素检索策略，以提升上下文的精确性。在LoCoMo和LongDialQA数据集上的实验表明，我们的方法在推理准确性和效率方面均持续优于最先进的基线。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2604.12290

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Frontier-Eng：基于生成式优化的自我进化智能体在真实工程任务中的基准测试

Chi, Yizhe, Hong, Deyao, Jiang, Dapeng, Luo, Tianwei, Yang, Kaisen, Zhang, Boshi, Cao, Zhe, Fan, Xiaoyan, He, Bingxiang, Hao, Han, Jin, Weiyang, Lei, Dianqiao, Liu, Qingle, Qian, Houde, Wang, Bowen, Wang, Situ, Zheng, Youjie, Zhou, Yifan, Xiao, Calvin, Cai, Eren, Na, Qinhuai

Abstract

Current LLM agent benchmarks, which predominantly focus on binary pass/fail tasks such as code generation or search-based question answering, often neglect the value of real-world engineering that is often captured through the iterative optimization of feasible designs. To this end, we introduce Frontier-Eng, a human-verified benchmark for generative optimization -- an iterative propose-execute-evaluate loop in which an agent generates candidate artifacts, receives executable verifier feedback, and revises them under a fixed interaction budget -- spanning $47$ tasks across five broad engineering categories. Unlike previous suites, Frontier-Eng tasks are grounded in industrial-grade simulators and verifiers that provide continuous reward signals and enforce hard feasibility constraints under constrained budgets. We evaluate eight frontier language models using representative search frameworks, finding that while Claude 4.6 Opus achieves the most robust performance, the benchmark remains challenging for all models. Our analysis suggests a dual power-law decay in improvement frequency ($\sim$ 1/iteration) and magnitude ($\sim$ 1/improvement count). We further show that although width improves parallelism and diversity, depth remains crucial for hard-won improvements under a fixed budget. Frontier-Eng establishes a new standard for assessing the capacity of AI agents to integrate domain knowledge with executable feedback to solve complex, open-ended engineering problems.

Chinese Translation

当前的大型语言模型（LLM）智能体基准测试主要集中于代码生成或基于搜索的问题回答等二元通过/失败任务，往往忽视了通过可行设计的迭代优化所体现的真实工程价值。为此，我们提出了Frontier-Eng，这是一个经过人工验证的生成式优化基准——一种迭代的提议-执行-评估循环，其中智能体生成候选工件，接收可执行的验证反馈，并在固定交互预算下进行修正——涵盖了五大工程类别中的47个任务。与以往测试套件不同，Frontier-Eng任务基于工业级模拟器和验证器，提供连续的奖励信号并在受限预算下强制执行严格的可行性约束。我们使用代表性搜索框架评估了八种前沿语言模型，发现尽管Claude 4.6 Opus表现最为稳健，但该基准对所有模型而言仍具有挑战性。我们的分析表明，改进频率和幅度均呈双重幂律衰减（频率约为1/迭代次数，幅度约为1/改进次数）。进一步研究显示，虽然宽度提升了并行性和多样性，但在固定预算下，深度对于获得难得的改进仍然至关重要。Frontier-Eng为评估AI智能体整合领域知识与可执行反馈以解决复杂开放式工程问题的能力树立了新的标准。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2604.12352

MultiDocFusion: Hierarchical and Multimodal Chunking Pipeline for Enhanced RAG on Long Industrial Documents

MultiDocFusion：用于提升长工业文档RAG性能的分层多模态分块流水线

Shin, Joongmin, Park, Chanjun, Park, Jeongbae, Seo, Jaehyung, Lim, Heuiseok

Abstract

RAG-based QA has emerged as a powerful method for processing long industrial documents. However, conventional text chunking approaches often neglect complex and long industrial document structures, causing information loss and reduced answer quality. To address this, we introduce MultiDocFusion, a multimodal chunking pipeline that integrates: (i) detection of document regions using vision-based document parsing, (ii) text extraction from these regions via OCR, (iii) reconstruction of document structure into a hierarchical tree using large language model (LLM)-based document section hierarchical parsing (DSHP-LLM), and (iv) construction of hierarchical chunks through DFS-based grouping. Extensive experiments across industrial benchmarks demonstrate that MultiDocFusion improves retrieval precision by 8-15% and ANLS QA scores by 2-3% compared to baselines, emphasizing the critical role of explicitly leveraging document hierarchy for multimodal document-based QA. These significant performance gains underscore the necessity of structure-aware chunking in enhancing the fidelity of RAG-based QA systems.

Chinese Translation

基于RAG的问答系统已成为处理长工业文档的强大方法。然而，传统的文本分块方法常常忽视复杂且冗长的工业文档结构，导致信息丢失和答案质量下降。为此，我们提出了MultiDocFusion，一种多模态分块流水线，集成了：(i) 基于视觉的文档解析进行文档区域检测，(ii) 通过OCR从这些区域提取文本，(iii) 利用大语言模型（LLM）进行文档章节分层解析（DSHP-LLM），重建文档结构为分层树状结构，(iv) 通过基于深度优先搜索（DFS）的分组构建分层分块。在多个工业基准测试中的大量实验表明，MultiDocFusion相比基线方法提升了8-15%的检索精度和2-3%的ANLS问答评分，凸显了显式利用文档层级结构在多模态文档问答中的关键作用。这些显著的性能提升强调了结构感知分块在增强基于RAG的问答系统准确性方面的必要性。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2604.12357

ReflectCAP: Detailed Image Captioning with Reflective Memory

ReflectCAP：利用反射记忆进行详细图像描述

Min, Kyungmin, Kim, Minbeom, Lee, Kang-il, Yoon, Seunghyun, Jung, Kyomin

Abstract

Detailed image captioning demands both factual grounding and fine-grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note-Guided Captioning (ReflectCAP), where a multi-agent pipeline analyzes what the target large vision-language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes -- what to avoid and what to attend to -- yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT-4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade-off between factuality and coverage, and delivers substantial gains on CapArena-Auto, where generated captions are judged head-to-head against strong reference models. Moreover, ReflectCAP offers a more favorable trade-off between caption quality and compute cost than model scaling or existing multi-agent pipelines, which incur 21--36\% greater overhead. This makes high-quality detailed captioning viable under real-world cost and latency constraints.

Chinese Translation

详细的图像描述要求既具备事实基础，又能实现细致覆盖，但现有方法在同时满足这两个要求方面一直存在困难。我们通过反射笔记引导的描述（ReflectCAP）来解决这一矛盾，其中一个多智能体管道分析目标大型视觉-语言模型（LVLM）所持续幻觉的内容以及其系统性忽视的内容，将这些模式提炼为可重用的指导原则，称为结构化反射笔记。在推理时，这些笔记引导描述模型沿着两个方向——避免哪些内容和关注哪些内容——生成详细的描述，从而共同提高事实性和覆盖率。将此方法应用于涵盖GPT-4.1系列、Qwen系列和InternVL变体的8个LVLM，ReflectCAP达到了事实性与覆盖率之间权衡的帕累托前沿，并在CapArena-Auto上取得了显著提升，在该平台上生成的描述与强参考模型进行逐一对比评估。此外，ReflectCAP在描述质量与计算成本之间提供了比模型扩展或现有多智能体管道更有利的权衡，后者会导致21%至36%的额外开销。这使得在现实世界的成本和延迟限制下，高质量的详细描述成为可能。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2604.12384

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

通过耦合权重与激活约束防止大型语言模型中的安全漂移

Peng, Songping, Zhang, Zhiheng, Zeng, Daojian, Jiang, Lincheng, Gao, Xieping

Abstract

Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing defenses typically constrain either weights or activations in isolation, without considering their coupled effects on safety. In this paper, we first theoretically demonstrate that constraining either weights or activations alone is insufficient for safety preservation. To robustly preserve safety alignment, we propose Coupled Weight and Activation Constraints (CWAC), a novel approach that simultaneously enforces a precomputed safety subspace on weight updates and applies targeted regularization to safety-critical features identified by sparse autoencoders. Extensive experiments across four widely used LLMs and diverse downstream tasks show that CWAC consistently achieves the lowest harmful scores with minimal impact on fine-tuning accuracy, substantially outperforming strong baselines even under high harmful data ratios.

Chinese Translation

大型语言模型（LLMs）在微调过程中安全对齐仍然极为脆弱，即使是良性的适应也可能削弱预训练阶段的拒绝行为，导致产生有害响应。现有防御方法通常单独约束权重或激活，未考虑两者对安全性的耦合影响。本文首先从理论上证明，仅约束权重或激活均不足以保障安全性。为稳健地保持安全对齐，我们提出了耦合权重与激活约束（Coupled Weight and Activation Constraints，CWAC）方法，该方法在权重更新中同时施加预先计算的安全子空间约束，并对由稀疏自编码器识别出的安全关键特征进行有针对性的正则化。大量实验覆盖四种广泛使用的LLMs及多样化下游任务，结果表明CWAC在微调准确率几乎不受影响的情况下，始终实现最低的有害评分，显著优于多种强基线方法，即使在高比例有害数据条件下亦表现优异。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2604.12390

Heuristic Classification of Thoughts Prompting (HCoT): Integrating Expert System Heuristics for Structured Reasoning into Large Language Models

启发式思维分类（HCoT）：将专家系统启发式方法整合到大型语言模型中的结构化推理

Lin, Lei, Zhu, Jizhao, Liu, Yong, Sun, Donghong, He, Hongbo, Du, Yihua

Abstract

This paper addresses two limitations of large language models (LLMs) in solving complex problems: (1) their reasoning processes exhibit Bayesian-like stochastic generation, where each token is sampled from a context-dependent probability distribution, leading to inherently random decision trajectories rather than deterministic planning; (2) the reasoning and decision-making mechanisms are statically decoupled, meaning dynamically retrieved domain knowledge fails to dynamically adjust the underlying reasoning strategy. These dual deficiencies result in initial decisions lacking strategic anchoring and reasoning chains often failing to converge on correct solutions, as stochastic generation lacks mechanisms for trajectory correction or knowledge-guided optimization during sequential reasoning. To resolve these issues, we propose a problem-solving method integrated into the LLM's generation process to guide reasoning. This method, compatible with numerous LLMs and featuring reusable solutions, is grounded in a novel Heuristic-Classification-of-Thoughts prompting schema (HCoT). HCoT synergizes the LLM's reasoning ability with a structured problem space via a heuristic classification model that controls the reasoning process and provides reusable abstract solutions. Evaluated on two complex inductive reasoning tasks with ill-defined search spaces, HCoT outperforms existing approaches (e.g., Tree-of-Thoughts and Chain-of-Thoughts prompting) in performance. On the well-structured 24 Game task, HCoT demonstrates significantly higher token efficiency compared to the state-of-the-art Tree-of-Thoughts-Breadth-First-Search. In terms of both accuracy and token usage, HCoT achieves a Pareto frontier balance, offering a strong trade-off between performance and computational cost.

Chinese Translation

本文针对大型语言模型（LLMs）在解决复杂问题时的两个局限性进行了探讨：（1）其推理过程表现出类似贝叶斯的随机生成特征，每个标记都是从上下文依赖的概率分布中采样，导致决策轨迹本质上是随机的，而非确定性的规划；（2）推理和决策机制是静态解耦的，这意味着动态检索的领域知识无法动态调整基础推理策略。这两种缺陷导致初始决策缺乏战略锚定，推理链往往无法收敛到正确的解决方案，因为随机生成缺乏在顺序推理过程中进行轨迹修正或知识引导优化的机制。为了解决这些问题，我们提出了一种集成到LLM生成过程中的问题解决方法，以指导推理。该方法与众多LLM兼容，并具有可重用的解决方案，基于一种新颖的启发式思维分类提示框架（HCoT）。HCoT通过启发式分类模型将LLM的推理能力与结构化问题空间协同，控制推理过程并提供可重用的抽象解决方案。在两个具有不明确搜索空间的复杂归纳推理任务上进行评估时，HCoT的表现优于现有方法（例如，思维树（Tree-of-Thoughts）和思维链（Chain-of-Thoughts）提示）。在结构良好的24游戏任务中，HCoT在标记效率上显著高于最先进的思维树广度优先搜索（Tree-of-Thoughts-Breadth-First-Search）。在准确性和标记使用方面，HCoT实现了帕累托前沿平衡，提供了性能与计算成本之间的良好权衡。

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2604.12459

Operationalising the Right to be Forgotten in LLMs: A Lightweight Sequential Unlearning Framework for Privacy-Aligned Deployment in Politically Sensitive Environments

在大型语言模型中落实被遗忘权：一种轻量级的顺序遗忘框架，适用于政治敏感环境中的隐私对齐部署

Kurt, Esen, Afli, Haithem

Abstract

Large Language Models (LLMs) are increasingly deployed in politically sensitive environments, where memorisation of personal data or confidential content raises regulatory concerns under frameworks such as the GDPR and its Right to be Forgotten. Translating such legal principles into large-scale generative systems presents significant technical challenges. We introduce a lightweight sequential unlearning framework that explicitly separates retention and suppression objectives. The method first stabilises benign capabilities through positive fine-tuning, then applies layer-restricted negative fine-tuning to suppress designated sensitive patterns while preserving general language competence. Experiments on the SemEval-2025 LLM Unlearning benchmark demonstrate effective behavioural suppression with minimal impact on factual accuracy and fluency. GPT-2 exhibits greater robustness than DistilGPT-2, highlighting the role of model capacity in privacy-aligned adaptation. We position sequential unlearning as a practical and reproducible mechanism for operationalising data erasure requirements in politically deployed LLMs.

Chinese Translation

大型语言模型（LLMs）越来越多地应用于政治敏感环境中，在这些环境中，个人数据或机密内容的记忆引发了根据《通用数据保护条例》（GDPR）及其被遗忘权的监管担忧。将这些法律原则转化为大规模生成系统面临着重大技术挑战。我们提出了一种轻量级的顺序遗忘框架，明确区分保留和抑制目标。该方法首先通过正向微调稳定良性能力，然后应用层限制的负向微调来抑制指定的敏感模式，同时保持一般语言能力。在SemEval-2025 LLM遗忘基准上的实验表明，该方法能够有效抑制行为，同时对事实准确性和流畅性影响最小。GPT-2表现出比DistilGPT-2更强的鲁棒性，突显了模型容量在隐私对齐适应中的作用。我们将顺序遗忘定位为在政治部署的LLMs中落实数据删除要求的实用且可重复的机制。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2604.12460

Enhancing Clustering: An Explainable Approach via Filtered Patterns

增强聚类：一种通过过滤模式的可解释方法

Hassine, Motaz Ben, Jabbour, Saïd

Abstract

Machine learning has become a central research area, with increasing attention devoted to explainable clustering, also known as conceptual clustering, which is a knowledge-driven unsupervised learning paradigm that partitions data into $\theta$ disjoint clusters, where each cluster is described by an explicit symbolic representation, typically expressed as a closed pattern or itemset. By providing human-interpretable cluster descriptions, explainable clustering plays an important role in explainable artificial intelligence and knowledge discovery. Recent work improved clustering quality by introducing k-relaxed frequent patterns (k-RFPs), a pattern model that relaxes strict coverage constraints through a generalized kcover definition. This framework integrates constraint-based reasoning, using SAT solvers for pattern generation, with combinatorial optimization, using Integer Linear Programming (ILP) for cluster selection. Despite its effectiveness, this approach suffers from a critical limitation: multiple distinct k-RFPs may induce identical k-covers, leading to redundant symbolic representations that unnecessarily enlarge the search space and increase computational complexity during cluster construction. In this paper, we address this redundancy through a pattern reduction framework. Our contributions are threefold. First, we formally characterize the conditions under which distinct k-RFPs induce identical kcovers, providing theoretical foundations for redundancy detection. Second, we propose an optimization strategy that removes redundant patterns by retaining a single representative pattern for each distinct k-cover. Third, we investigate the interpretability and representativeness of the patterns selected by the ILP model by analyzing their robustness with respect to their induced clusters. Extensive experiments conducted on several real-world datasets demonstrate that the proposed approach significantly reduces the pattern search space, improves computational efficiency, preserves and enhances in some cases the quality of the resulting clusters.

Chinese Translation

机器学习已成为一个核心研究领域，越来越多的关注集中在可解释聚类上，也称为概念聚类，这是一种知识驱动的无监督学习范式，将数据划分为 $ heta$ 个不相交的簇，每个簇由一个明确的符号表示描述，通常以闭合模式或项集的形式表达。通过提供人类可解释的簇描述，可解释聚类在可解释人工智能和知识发现中发挥着重要作用。近期的研究通过引入 k-放松频繁模式（k-RFPs）来提高聚类质量，这是一种通过广义 k-cover 定义放宽严格覆盖约束的模式模型。该框架将基于约束的推理与组合优化相结合，使用 SAT 求解器进行模式生成，使用整数线性规划（ILP）进行簇选择。尽管该方法有效，但存在一个关键限制：多个不同的 k-RFP 可能会诱导相同的 k-cover，导致冗余的符号表示，进而不必要地扩大搜索空间并增加聚类构建过程中的计算复杂性。本文通过模式减少框架解决了这一冗余问题。我们的贡献有三方面。首先，我们正式描述了不同的 k-RFP 诱导相同 k-cover 的条件，为冗余检测提供了理论基础。其次，我们提出了一种优化策略，通过为每个不同的 k-cover 保留一个代表性模式来移除冗余模式。第三，我们通过分析 ILP 模型所选模式的稳健性，研究了这些模式的可解释性和代表性。对多个真实世界数据集进行的广泛实验表明，所提出的方法显著减少了模式搜索空间，提高了计算效率，并在某些情况下保持和增强了结果簇的质量。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2604.12461

CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems

CIA：基于LLM的多智能体系统通信拓扑推断

Wu, Yongxuan, Lin, Xixun, Zhang, He, Sun, Nan, Wang, Kun, Zhou, Chuan, Pan, Shirui, Cao, Yanan

Abstract

LLM-based Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in solving complex tasks. Central to MAS is the communication topology which governs how agents exchange information internally. Consequently, the security of communication topologies has attracted increasing attention. In this paper, we investigate a critical privacy risk: MAS communication topologies can be inferred under a restrictive black-box setting, exposing system vulnerabilities and posing significant intellectual property threats. To explore this risk, we propose Communication Inference Attack (CIA), a novel attack that constructs new adversarial queries to induce intermediate agents' reasoning outputs and models their semantic correlations through the proposed global bias disentanglement and LLM-guided weak supervision. Extensive experiments on MAS with optimized communication topologies demonstrate the effectiveness of CIA, achieving an average AUC of 0.87 and a peak AUC of up to 0.99, thereby revealing the substantial privacy risk in MAS.

Chinese Translation

基于大语言模型（LLM）的多智能体系统（MAS）在解决复杂任务方面展现了卓越的能力。MAS的核心在于通信拓扑结构，它决定了智能体之间如何进行信息交换。因此，通信拓扑的安全性日益受到关注。本文研究了一种关键的隐私风险：在受限的黑盒环境下，MAS的通信拓扑可能被推断出来，暴露系统漏洞并构成重大知识产权威胁。为探讨该风险，我们提出了通信推断攻击（Communication Inference Attack，CIA），这是一种新颖的攻击方法，通过构造新的对抗查询以诱导中间智能体的推理输出，并通过提出的全局偏差解耦和LLM引导的弱监督来建模其语义相关性。在针对优化通信拓扑的MAS上进行的大量实验表明，CIA具有显著效果，平均AUC达到0.87，最高峰值AUC高达0.99，从而揭示了MAS中存在的重大隐私风险。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2604.12470

Intelligent ROI-Based Vehicle Counting Framework for Automated Traffic Monitoring

基于智能ROI的车辆计数框架用于自动化交通监测

Abdelwahab, Mohamed A., Al-Ariny, Zaynab, Fakhry, Mahmoud, Hasaneen, El-Sayed

Abstract

Accurate vehicle counting through video surveillance is crucial for efficient traffic management. However, achieving high counting accuracy while ensuring computational efficiency remains a challenge. To address this, we propose a fully automated, video-based vehicle counting framework designed to optimize both computational efficiency and counting accuracy. Our framework operates in two distinct phases: \textit{estimation} and \textit{prediction}. In the estimation phase, the optimal region of interest (ROI) is automatically determined using a novel combination of three models based on detection scores, tracking scores, and vehicle density. This adaptive approach ensures compatibility with any detection and tracking method, enhancing the framework's versatility. In the prediction phase, vehicle counting is efficiently performed within the estimated ROI. We evaluated our framework on benchmark datasets like UA-DETRAC, GRAM, CDnet 2014, and ATON. Results demonstrate exceptional accuracy, with most videos achieving 100\% accuracy, while also enhancing computational efficiency, making processing up to four times faster than full-frame processing. The framework outperforms existing techniques, especially in complex multi-road scenarios, demonstrating robustness and superior accuracy. These advancements make it a promising solution for real-time traffic monitoring.

Chinese Translation

通过视频监控进行准确的车辆计数对于高效的交通管理至关重要。然而，在确保计算效率的同时实现高计数准确性仍然是一个挑战。为了解决这个问题，我们提出了一种完全自动化的视频基础车辆计数框架，旨在优化计算效率和计数准确性。我们的框架分为两个不同的阶段： extit{估计}和 extit{预测}。在估计阶段，使用基于检测分数、跟踪分数和车辆密度的三种模型的新颖组合自动确定最佳兴趣区域（ROI）。这种自适应方法确保与任何检测和跟踪方法的兼容性，增强了框架的灵活性。在预测阶段，在估计的ROI内高效地进行车辆计数。我们在UA-DETRAC、GRAM、CDnet 2014和ATON等基准数据集上评估了我们的框架。结果显示出卓越的准确性，大多数视频实现了100\%的准确率，同时提高了计算效率，使处理速度比全帧处理快四倍。该框架在复杂的多道路场景中优于现有技术，展现出强大的鲁棒性和卓越的准确性。这些进展使其成为实时交通监测的有前景的解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2604.12534

Technical Report -- A Context-Sensitive Multi-Level Similarity Framework for First-Order Logic Arguments: An Axiomatic Study

技术报告——面向一阶逻辑论证的上下文敏感多层次相似性框架：公理化研究

David, Victor, Delobelle, Jérôme, Mailly, Jean-Guy

Abstract

Similarity in formal argumentation has recently gained attention due to its significance in problems such as argument aggregation in semantics and enthymeme decoding. While existing approaches focus on propositional logic, we address the richer setting of First-Order Logic (FOL), where similarity must account for structured content. We introduce a comprehensive framework for FOL argument similarity, built upon: (1) an extended axiomatic foundation; (2) a four-level parametric model covering predicates, literals, clauses, and formulae similarity; (3) two model families, one syntax-sensitive via language models, both integrating contextual weights for nuanced and explainable similarity; and (4) formal constraints enforcing desirable properties.

Chinese Translation

形式论证中的相似性因其在语义中的论证聚合和隐含论证解码等问题中的重要性，近年来受到关注。现有方法多聚焦于命题逻辑，而我们则针对内容结构更为丰富的一阶逻辑（First-Order Logic, FOL）提出解决方案，其中相似性必须考虑结构化内容。本文引入了一个全面的一阶逻辑论证相似性框架，基于：（1）扩展的公理化基础；（2）涵盖谓词、文字、子句及公式相似性的四层参数化模型；（3）两类模型族，一类通过语言模型实现语法敏感，二者均融合上下文权重以实现细致且可解释的相似性；（4）形式约束以保证期望的性质。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2604.12543

A Two-Stage LLM Framework for Accessible and Verified XAI Explanations

一种用于可访问和经过验证的可解释人工智能解释的两阶段大型语言模型框架

Mermigkis, Georgios, Metaxakis, Dimitris, Tyrovolas, Marios, Sofotasios, Argiris, Avgeris, Nikolaos, Hadjidoukas, Panagiotis, Stylios, Chrysostomos

Abstract

Large Language Models (LLMs) are increasingly used to translate the technical outputs of eXplainable Artificial Intelligence (XAI) methods into accessible natural-language explanations. However, existing approaches often lack guarantees of accuracy, faithfulness, and completeness. At the same time, current efforts to evaluate such narratives remain largely subjective or confined to post-hoc scoring, offering no safeguards to prevent flawed explanations from reaching end-users. To address these limitations, this paper proposes a Two-Stage LLM Meta-Verification Framework that consists of (i) an Explainer LLM that converts raw XAI outputs into natural-language narratives, (ii) a Verifier LLM that assesses them in terms of faithfulness, coherence, completeness, and hallucination risk, and (iii) an iterative refeed mechanism that uses the Verifier's feedback to refine and improve them. Experiments across five XAI techniques and datasets, using three families of open-weight LLMs, show that verification is crucial for filtering unreliable explanations while improving linguistic accessibility compared with raw XAI outputs. In addition, the analysis of the Entropy Production Rate (EPR) during the refinement process indicates that the Verifier's feedback progressively guides the Explainer toward more stable and coherent reasoning. Overall, the proposed framework provides an efficient pathway toward more trustworthy and democratized XAI systems.

Chinese Translation

大型语言模型（LLMs）越来越多地被用于将可解释人工智能（XAI）方法的技术输出转化为可访问的自然语言解释。然而，现有的方法往往缺乏准确性、忠实性和完整性的保证。同时，目前对这些叙述的评估工作在很大程度上仍然是主观的或局限于事后评分，未能提供防止有缺陷的解释传递给最终用户的保障。为了解决这些局限性，本文提出了一种两阶段LLM元验证框架，该框架由以下部分组成：（i）解释者LLM，将原始XAI输出转换为自然语言叙述；（ii）验证者LLM，从忠实性、一致性、完整性和幻觉风险等方面对其进行评估；（iii）一个迭代反馈机制，利用验证者的反馈来细化和改进这些叙述。通过在五种XAI技术和数据集上进行实验，使用三类开放权重的LLM，结果表明验证对于过滤不可靠的解释至关重要，同时与原始XAI输出相比，提高了语言的可访问性。此外，在细化过程中对熵产生率（EPR）的分析表明，验证者的反馈逐步引导解释者朝着更稳定和一致的推理方向发展。总体而言，所提出的框架为构建更可信和民主化的XAI系统提供了一条有效的路径。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2604.12545

Cross-Cultural Simulation of Citizen Emotional Responses to Bureaucratic Red Tape Using LLM Agents

使用LLM代理进行公民对官僚主义繁文缛节情感反应的跨文化模拟

Ni, Wanchun, Sun, Jiugeng, Liu, Yixian, El-Assady, Mennatallah

Abstract

Improving policymaking is a central concern in public administration. Prior human subject studies reveal substantial cross-cultural differences in citizens' emotional responses to red tape during policy implementation. While LLM agents offer opportunities to simulate human-like responses and reduce experimental costs, their ability to generate culturally appropriate emotional responses to red tape remains unverified. To address this gap, we propose an evaluation framework for assessing LLMs' emotional responses to red tape across diverse cultural contexts. As a pilot study, we apply this framework to a single red-tape scenario. Our results show that all models exhibit limited alignment with human emotional responses, with notably weaker performance in Eastern cultures. Cultural prompting strategies prove largely ineffective in improving alignment. We further introduce \textbf{RAMO}, an interactive interface for simulating citizens' emotional responses to red tape and for collecting human data to improve models. The interface is publicly available at https://ramo-chi.ivia.ch.

Chinese Translation

改善政策制定是公共管理中的一个核心问题。先前的人类受试者研究揭示了公民在政策实施过程中对繁文缛节的情感反应存在显著的跨文化差异。尽管LLM代理提供了模拟类人反应和降低实验成本的机会，但它们生成对繁文缛节的文化适当情感反应的能力仍未得到验证。为了解决这一空白，我们提出了一个评估框架，用于评估LLM在不同文化背景下对繁文缛节的情感反应。作为一项初步研究，我们将该框架应用于一个单一的繁文缛节场景。我们的结果表明，所有模型与人类情感反应的对齐程度有限，尤其在东部文化中表现较弱。文化提示策略在改善对齐方面效果不佳。我们进一步介绍了 extbf{RAMO}，一个用于模拟公民对繁文缛节情感反应的交互界面，并用于收集人类数据以改进模型。该界面已公开发布，网址为 https://ramo-chi.ivia.ch。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2604.12573

IDEA: An Interpretable and Editable Decision-Making Framework for LLMs via Verbal-to-Numeric Calibration

IDEA：通过语言到数值校准实现大语言模型的可解释且可编辑决策框架

He, Yanji, Jiang, Yuxin, Wu, Yiwen, Huang, Bo, Wei, Jiaheng, Wang, Wei

Abstract

Large Language Models are increasingly deployed for decision-making, yet their adoption in high-stakes domains remains limited by miscalibrated probabilities, unfaithful explanations, and inability to incorporate expert knowledge precisely. We propose IDEA, a framework that extracts LLM decision knowledge into an interpretable parametric model over semantically meaningful factors. Through joint learning of verbal-to-numerical mappings and decision parameters via EM, correlated sampling that preserves factor dependencies, and direct parameter editing with mathematical guarantees, IDEA produces calibrated probabilities while enabling quantitative human-AI collaboration. Experiments across five datasets show IDEA with Qwen-3-32B (78.6%) outperforms DeepSeek R1 (68.1%) and GPT-5.2 (77.9%), achieving perfect factor exclusion and exact calibration -- precision unattainable through prompting alone. The implementation is publicly available at https://github.com/leonbig/IDEA.

Chinese Translation

大型语言模型（LLMs）在决策领域的应用日益广泛，但其在高风险领域的采用仍受限于概率校准不准确、解释不可信以及无法精确整合专家知识的问题。我们提出了IDEA框架，该框架将LLM的决策知识提取为基于语义意义因子的可解释参数模型。通过采用期望最大化（EM）方法联合学习语言到数值的映射及决策参数，利用保持因子依赖关系的相关采样，并通过具有数学保证的直接参数编辑，IDEA不仅生成校准后的概率，还实现了定量的人机协作。在五个数据集上的实验表明，基于Qwen-3-32B的IDEA（78.6%）优于DeepSeek R1（68.1%）和GPT-5.2（77.9%），实现了完美的因子排除和精确校准——这是单靠提示无法达到的精度。该实现已公开发布于https://github.com/leonbig/IDEA。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2604.12615

DeepTest Tool Competition 2026: Benchmarking an LLM-Based Automotive Assistant

DeepTest 工具竞赛 2026：基于大型语言模型的汽车助手基准测试

Sorokin, Lev, Vasilev, Ivan, Pasini, Samuele

Abstract

This report summarizes the results of the first edition of the Large Language Model (LLM) Testing competition, held as part of the DeepTest workshop at ICSE 2026. Four tools competed in benchmarking an LLM-based car manual information retrieval application, with the objective of identifying user inputs for which the system fails to appropriately mention warnings contained in the manual. The testing solutions were evaluated based on their effectiveness in exposing failures and the diversity of the discovered failure-revealing tests. We report on the experimental methodology, the competitors, and the results.

Chinese Translation

本报告总结了首届大型语言模型（LLM）测试竞赛的结果，该竞赛作为 ICSE 2026 DeepTest 研讨会的一部分举办。共有四款工具参与了基于 LLM 的汽车手册信息检索应用的基准测试，目标是识别系统未能恰当提示手册中警告信息的用户输入。测试方案的评估基于其揭示系统失败的有效性及发现的失败测试的多样性。本文报告了实验方法、参赛者及竞赛结果。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2604.12616

Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

每幅图像讲述一个危险的故事：基于记忆增强的多智能体监狱突破攻击对视觉语言模型的影响

Chen, Jianhao, Chen, Haoyang, Zhao, Hanjie, Liang, Haozhe, Qian, Tieyun

Abstract

The rapid evolution of Vision-Language Models (VLMs) has catalyzed unprecedented capabilities in artificial intelligence; however, this continuous modal expansion has inadvertently exposed a vastly broadened and unconstrained adversarial attack surface. Current multimodal jailbreak strategies primarily focus on surface-level pixel perturbations and typographic attacks or harmful images; however, they fail to engage with the complex semantic structures intrinsic to visual data. This leaves the vast semantic attack surface of original, natural images largely unscrutinized. Driven by the need to expose these deep-seated semantic vulnerabilities, we introduce \textbf{MemJack}, a \textbf{MEM}ory-augmented multi-agent \textbf{JA}ilbreak atta\textbf{CK} framework that explicitly leverages visual semantics to orchestrate automated jailbreak attacks. MemJack employs coordinated multi-agent cooperation to dynamically map visual entities to malicious intents, generate adversarial prompts via multi-angle visual-semantic camouflage, and utilize an Iterative Nullspace Projection (INLP) geometric filter to bypass premature latent space refusals. By accumulating and transferring successful strategies through a persistent Multimodal Experience Memory, MemJack maintains highly coherent extended multi-turn jailbreak attack interactions across different images, thereby improving the attack success rate (ASR) on new images. Extensive empirical evaluations across full, unmodified COCO val2017 images demonstrate that MemJack achieves a 71.48\% ASR against Qwen3-VL-Plus, scaling to 90\% under extended budgets. Furthermore, to catalyze future defensive alignment research, we will release \textbf{MemJack-Bench}, a comprehensive dataset comprising over 113,000 interactive multimodal jailbreak attack trajectories, establishing a vital foundation for developing inherently robust VLMs.

Chinese Translation

视觉语言模型（VLMs）的快速发展催生了人工智能前所未有的能力；然而，这一持续的模态扩展无意中暴露了一个极为广泛且不受限制的对抗攻击面。目前的多模态监狱突破策略主要集中在表层的像素扰动、排版攻击或有害图像上；然而，它们未能处理视觉数据内在的复杂语义结构。这使得原始自然图像的广泛语义攻击面在很大程度上未受到审视。为了揭示这些深层次的语义脆弱性，我们提出了 extbf{MemJack}，一个 extbf{MEM}ory增强的多智能体 extbf{JA}ilbreak atta extbf{CK}框架，明确利用视觉语义来组织自动化的监狱突破攻击。MemJack通过协调的多智能体合作，动态地将视觉实体映射到恶意意图，生成通过多角度视觉-语义伪装的对抗提示，并利用迭代零空间投影（INLP）几何滤波器绕过过早的潜在空间拒绝。通过在持久的多模态体验记忆中积累和转移成功策略，MemJack在不同图像之间保持高度连贯的扩展多轮监狱突破攻击交互，从而提高新图像的攻击成功率（ASR）。在完整的、未修改的COCO val2017图像上的广泛实证评估表明，MemJack对Qwen3-VL-Plus的ASR达到了71.48\%，在扩展预算下可达到90\%。此外，为了推动未来防御对齐研究，我们将发布 extbf{MemJack-Bench}，一个包含超过113,000条交互式多模态监狱突破攻击轨迹的综合数据集，为开发本质上稳健的VLMs奠定重要基础。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2604.12627

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

KnowRL：通过最小充分知识引导的强化学习提升大型语言模型推理能力

Yu, Linhao, Yang, Tianmeng, Ding, Siyu, Jin, Renren, Gu, Naibin, Hao, Xiangzhao, Nie, Shuaiyi, Xiong, Deyi, Yin, Weichong, Sun, Yu, Wu, Hua

Abstract

RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose \textbf{KnowRL} (Knowledge-Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal-sufficient guidance problem. During RL training, KnowRL decomposes guidance into atomic knowledge points (KPs) and uses Constrained Subset Search (CSS) to construct compact, interaction-aware subsets for training. We further identify a pruning interaction paradox -- removing one KP may help while removing multiple such KPs can hurt -- and explicitly optimize for robust subset curation under this dependency structure. We train KnowRL-Nemotron-1.5B from OpenMath-Nemotron-1.5B. Across eight reasoning benchmarks at the 1.5B scale, KnowRL-Nemotron-1.5B consistently outperforms strong RL and hinting baselines. Without KP hints at inference, KnowRL-Nemotron-1.5B reaches 70.08 average accuracy, already surpassing Nemotron-1.5B by +9.63 points; with selected KPs, performance improves to 74.16, establishing a new state of the art at this scale. The model, curated training data, and code are publicly available at https://github.com/Hasuer/KnowRL.

Chinese Translation

RLVR提升了大型语言模型的推理能力，但其效果常因难题上的奖励稀疏性严重而受限。近期基于提示的强化学习方法通过注入部分解答或抽象模板缓解了稀疏性，然而它们通常通过增加更多的token来扩展指导，导致冗余、不一致以及额外的训练开销。我们提出了KnowRL（Knowledge-Guided Reinforcement Learning），一种将提示设计视为最小充分指导问题的强化学习训练框架。在强化学习训练过程中，KnowRL将指导分解为原子知识点（Knowledge Points, KPs），并利用约束子集搜索（Constrained Subset Search, CSS）构建紧凑且具交互感知的子集用于训练。我们进一步发现了剪枝交互悖论——移除单个KP可能有益，而移除多个此类KP则可能有害，并在此依赖结构下显式优化鲁棒的子集策划。我们基于OpenMath-Nemotron-1.5B训练了KnowRL-Nemotron-1.5B。在1.5B参数规模下的八个推理基准测试中，KnowRL-Nemotron-1.5B持续优于强基线强化学习和提示方法。在推理时无KP提示的情况下，KnowRL-Nemotron-1.5B达到70.08的平均准确率，较Nemotron-1.5B提升了9.63个百分点；使用选定的KP后，性能提升至74.16，创下该规模的新最优水平。该模型、策划训练数据及代码已公开，地址为https://github.com/Hasuer/KnowRL。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2604.12634

RPRA: Predicting an LLM-Judge for Efficient but Performant Inference

RPRA：预测大型语言模型评判器以实现高效且高性能的推理

Ashley, Dylan R., Lan, Gaël Le, Zhao, Changsheng, Dhingra, Naina, Cai, Zhipeng, Chang, Ernie, Zhuge, Mingchen, Shi, Yangyang, Chandra, Vikas, Schmidhuber, Jürgen

Abstract

Large language models (LLMs) face a fundamental trade-off between computational efficiency (e.g., number of parameters) and output quality, especially when deployed on computationally limited devices such as phones or laptops. One way to address this challenge is by following the example of humans and have models ask for help when they believe they are incapable of solving a problem on their own; we can overcome this trade-off by allowing smaller models to respond to queries when they believe they can provide good responses, and deferring to larger models when they do not believe they can. To this end, in this paper, we investigate the viability of Predict-Answer/Act (PA) and Reason-Predict-Reason-Answer/Act (RPRA) paradigms where models predict -- prior to responding -- how an LLM judge would score their output. We evaluate three approaches: zero-shot prediction, prediction using an in-context report card, and supervised fine-tuning. Our results show that larger models (particularly reasoning models) perform well when predicting generic LLM judges zero-shot, while smaller models can reliably predict such judges well after being fine-tuned or provided with an in-context report card. Altogether, both approaches can substantially improve the prediction accuracy of smaller models, with report cards and fine-tuning achieving mean improvements of up to 55% and 52% across datasets, respectively. These findings suggest that models can learn to predict their own performance limitations, paving the way for more efficient and self-aware AI systems.

Chinese Translation

大型语言模型（LLMs）在计算效率（如参数数量）与输出质量之间存在根本性权衡，尤其是在手机或笔记本等计算资源有限的设备上部署时。解决这一挑战的一种方法是借鉴人类的做法，当模型认为自己无法独立解决问题时，主动寻求帮助；通过允许较小的模型在认为能够提供良好回答时响应查询，而在认为无法胜任时转由较大模型处理，从而克服这一权衡。为此，本文探讨了预测-回答/行动（Predict-Answer/Act, PA）和推理-预测-推理-回答/行动（Reason-Predict-Reason-Answer/Act, RPRA）范式的可行性，即模型在响应之前预测大型语言模型评判器（LLM judge）对其输出的评分。我们评估了三种方法：零样本预测、基于上下文成绩单的预测以及监督微调。结果表明，较大模型（尤其是推理模型）在零样本预测通用LLM评判器时表现良好，而较小模型经过微调或提供上下文成绩单后，能够可靠地预测此类评判器。总体而言，这两种方法均显著提升了较小模型的预测准确率，成绩单和微调在各数据集上的平均提升分别达到55%和52%。这些发现表明，模型能够学习预测自身性能的局限性，为构建更高效且具备自我认知能力的人工智能系统铺平了道路。

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2604.12660

Broadening the Applicability of Conditional Syntax Splitting for Reasoning from Conditional Belief Bases

扩展条件语法拆分在条件信念库推理中的适用性

Spiegel, Lars-Phillip, Haldimann, Jonas, Heyninck, Jesse, Kern-Isberner, Gabriele, Beierle, Christoph

Abstract

In nonmonotonic reasoning from conditional belief bases, an inference operator satisfying syntax splitting postulates allows for taking only the relevant parts of a belief base into account, provided that the belief base splits into subbases based on disjoint signatures. Because such disjointness is rare in practice, safe conditional syntax splitting has been proposed as a generalization of syntax splitting, allowing the conditionals in the subbases to share some atoms. Recently this overlap of conditionals has been shown to be limited to trivial, self-fulfilling conditionals. In this article, we propose a generalization of safe conditional syntax splittings that broadens the applicability of splitting postulates. In contrast to safe conditional syntax splitting, our generalized notion supports syntax splittings of a belief base {\Delta} where the subbases of {\Delta} may share atoms and nontrivial conditionals. We illustrate how this new notion overcomes limitations of previous splitting concepts, and we identify genuine splittings, separating them from simple splittings that do not provide benefits for inductive inference from {\Delta}. We introduce adjusted inference postulates based on our generalization of conditional syntax splitting, and we evaluate several popular inductive inference operators with respect to these postulates. Furthermore, we show that, while every inductive inference operator satisfying generalized conditional syntax splitting also satisfies conditional syntax splitting, the reverse does not hold.

Chinese Translation

在基于条件信念库的非单调推理中，满足语法拆分公设的推理算子允许仅考虑信念库中相关的部分，前提是信念库能够基于不相交的符号集拆分为子库。由于这种不相交性在实际中较为罕见，安全条件语法拆分（safe conditional syntax splitting）作为语法拆分的推广被提出，允许子库中的条件共享部分原子。近期研究表明，这种条件的重叠仅限于平凡的、自我实现的条件。本文提出了一种安全条件语法拆分的推广，拓宽了拆分公设的适用范围。与安全条件语法拆分不同，我们的推广概念支持信念库 {94} 的语法拆分，其中 {94} 的子库可以共享原子和非平凡条件。我们展示了该新概念如何克服以往拆分方法的局限性，并识别出真正的拆分，将其与对 {94} 的归纳推理无益的简单拆分区分开来。基于我们对条件语法拆分的推广，本文引入了调整后的推理公设，并针对这些公设评估了若干流行的归纳推理算子。此外，我们证明了满足推广条件语法拆分的所有归纳推理算子也满足条件语法拆分，但反之则不然。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2604.12663

Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport

以人为中心的主题建模：目标驱动的对比学习与最优传输

Wang, Rui, Zheng, Yi, Wang, Dongxin, Huang, Haiping, Yao, Yuanzhi, Zhou, Yuxiang, Yu, Jialin, Torr, Philip

Abstract

Existing topic modeling methods, from LDA to recent neural and LLM-based approaches, which focus mainly on statistical coherence, often produce redundant or off-target topics that miss the user's underlying intent. We introduce Human-centric Topic Modeling, \emph{Human-TM}), a novel task formulation that integrates a human-provided goal directly into the topic modeling process to produce interpretable, diverse and goal-oriented topics. To tackle this challenge, we propose the \textbf{G}oal-prompted \textbf{C}ontrastive \textbf{T}opic \textbf{M}odel with \textbf{O}ptimal \textbf{T}ransport (GCTM-OT), which first uses LLM-based prompting to extract goal candidates from documents, then incorporates these into semantic-aware contrastive learning via optimal transport for topic discovery. Experimental results on three public subreddit datasets show that GCTM-OT outperforms state-of-the-art baselines in topic coherence and diversity while significantly improving alignment with human-provided goals, paving the way for more human-centric topic discovery systems.

Chinese Translation

现有的主题建模方法，从潜在狄利克雷分配（LDA）到最近的神经网络和基于大语言模型（LLM）的方法，主要关注统计一致性，往往会产生冗余或偏离目标的主题，无法捕捉用户的潜在意图。我们提出了以人为中心的主题建模（Human-TM），这是一种新颖的任务表述，将人类提供的目标直接整合到主题建模过程中，以生成可解释、多样化和目标导向的主题。为了解决这一挑战，我们提出了目标驱动的对比主题模型与最优传输（GCTM-OT），该模型首先利用基于LLM的提示从文档中提取目标候选项，然后通过最优传输将这些候选项纳入语义感知的对比学习中，以实现主题发现。在三个公共子版块数据集上的实验结果表明，GCTM-OT在主题一致性和多样性方面优于最先进的基线，同时显著提高了与人类提供的目标的一致性，为更以人为中心的主题发现系统铺平了道路。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2604.12667

Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production

基于在线滤波的安全强化学习在疲劳预测人机任务规划与分配中的应用研究

Xue, Jintao, Li, Xiao, Zhang, Nianmin

Abstract

Human-robot collaborative manufacturing, a core aspect of Industry 5.0, emphasizes ergonomics to enhance worker well-being. This paper addresses the dynamic human-robot task planning and allocation (HRTPA) problem, which involves determining when to perform tasks and who should execute them to maximize efficiency while ensuring workers' physical fatigue remains within safe limits. The inclusion of fatigue constraints, combined with production dynamics, significantly increases the complexity of the HRTPA problem. Traditional fatigue-recovery models in HRTPA often rely on static, predefined hyperparameters. However, in practice, human fatigue sensitivity varies daily due to factors such as changed work conditions and insufficient sleep. To better capture this uncertainty, we treat fatigue-related parameters as inaccurate and estimate them online based on observed fatigue progression during production. To address these challenges, we propose PF-CD3Q, a safe reinforcement learning (safe RL) approach that integrates the particle filter with constrained dueling double deep Q-learning for real-time fatigue-predictive HRTPA. Specifically, we first develop PF-based estimators to track human fatigue and update fatigue model parameters in real-time. These estimators are then integrated into CD3Q by making task-level fatigue predictions during decision-making and excluding tasks that exceed fatigue limits, thereby constraining the action space and formulating the problem as a constrained Markov decision process (CMDP).

Chinese Translation

人机协作制造作为工业5.0的核心内容，强调人体工程学以提升工人福祉。本文针对动态人机任务规划与分配（HRTPA）问题展开研究，该问题涉及确定任务执行的时间及执行者，以在最大化效率的同时确保工人身体疲劳维持在安全范围内。疲劳约束的引入及生产动态的影响显著增加了HRTPA问题的复杂性。传统的HRTPA疲劳恢复模型通常依赖静态的预设超参数，然而实际中由于工作条件变化和睡眠不足等因素，人体疲劳敏感性每日存在差异。为更好地捕捉这一不确定性，本文将疲劳相关参数视为不准确的，并基于生产过程中观察到的疲劳进展进行在线估计。针对上述挑战，本文提出了PF-CD3Q，一种结合粒子滤波（particle filter）与约束对决双深度Q学习（constrained dueling double deep Q-learning）的安全强化学习方法，用于实时疲劳预测的人机任务规划与分配。具体而言，首先构建基于粒子滤波的估计器以实时跟踪人体疲劳并更新疲劳模型参数；随后将该估计器集成至CD3Q，通过在决策过程中进行任务级疲劳预测并排除超出疲劳限制的任务，从而约束动作空间，将问题建模为约束马尔可夫决策过程（CMDP）。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2604.12669

A hierarchical spatial-aware algorithm with efficient reinforcement learning for human-robot task planning and allocation in production

一种基于高效强化学习的分层空间感知算法用于生产中的人机任务规划与分配

Xue, Jintao, Li, Xiao, Zhang, Nianmin

Abstract

In advanced manufacturing systems, humans and robots collaborate to conduct the production process. Effective task planning and allocation (TPA) is crucial for achieving high production efficiency, yet it remains challenging in complex and dynamic manufacturing environments. The dynamic nature of humans and robots, particularly the need to consider spatial information (e.g., humans' real-time position and the distance they need to move to complete a task), substantially complicates TPA. To address the above challenges, we decompose production tasks into manageable subtasks. We then implement a real-time hierarchical human-robot TPA algorithm, including a high-level agent for task planning and a low-level agent for task allocation. For the high-level agent, we propose an efficient buffer-based deep Q-learning method (EBQ), which reduces training time and enhances performance in production problems with long-term and sparse reward challenges. For the low-level agent, a path planning-based spatially aware method (SAP) is designed to allocate tasks to the appropriate human-robot resources, thereby achieving the corresponding sequential subtasks. We conducted experiments on a complex real-time production process in a 3D simulator. The results demonstrate that our proposed EBQ&SAP method effectively addresses human-robot TPA problems in complex and dynamic production processes.

Chinese Translation

在先进制造系统中，人类与机器人协同完成生产过程。有效的任务规划与分配（TPA）对于实现高生产效率至关重要，但在复杂且动态的制造环境中仍然具有挑战性。人机的动态特性，尤其是需要考虑空间信息（例如人类的实时位置及其完成任务所需移动的距离），极大地增加了TPA的复杂性。为应对上述挑战，我们将生产任务分解为可管理的子任务。随后，我们实现了一种实时分层的人机TPA算法，包括用于任务规划的高层代理和用于任务分配的低层代理。对于高层代理，我们提出了一种高效的基于缓冲区的深度Q学习方法（EBQ），该方法在面临长期且稀疏奖励的生产问题中，能够减少训练时间并提升性能。对于低层代理，设计了一种基于路径规划的空间感知方法（SAP），用于将任务分配给合适的人机资源，从而实现相应的顺序子任务。我们在三维仿真器中对复杂的实时生产过程进行了实验。结果表明，所提出的EBQ&SAP方法有效解决了复杂动态生产过程中人机TPA问题。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2604.12700

MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

MISID：用于战略欺骗游戏中复杂意图识别的多模态多轮数据集

Lin, Shufang, Chen, Muyang, Zhou, Xiabing, Zhang, Rongrong, Zhang, Dayou, Wang, Fangxin

Abstract

Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods. To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition. Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking. Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) on MISID reveals critical deficiencies in complex scenarios, including text-prior visual hallucination, impaired cross-modal synergy, and limited capacity in chaining causal cues. Consequently, we propose FRACTAM as a baseline framework. Using a ``Decouple-Anchor-Reason'' paradigm, FRACTAM reduces text bias by extracting pure unimodal factual representations, employs two-stage retrieval for long-range factual anchoring, and constructs explicit cross-modal evidence chains. Extensive experiments demonstrate that FRACTAM enhances mainstream models' performance in complex strategic tasks, improving hidden intent detection and inference while maintaining robust perceptual accuracy. Our dataset is available at https://naislab.cn/datasets/MISID.

Chinese Translation

理解人类在复杂多轮互动中的意图仍然是人机交互和行为分析中的一个基本挑战。现有的意图识别数据集主要集中在单一发言或简单对话上，而现实场景往往涉及复杂的战略互动，参与者必须在较长时间内维持复杂的欺骗叙事。为了解决这一问题，我们引入了MISID，这是一个全面的多模态、多轮和多参与者的意图识别基准数据集。MISID来源于高风险社交战略游戏，采用了精细化的两级多维注释方案，旨在支持长语境话语分析和基于证据的因果追踪。我们对MISID上最先进的多模态大型语言模型（MLLMs）进行的系统评估揭示了在复杂场景中存在的关键缺陷，包括文本优先的视觉幻觉、跨模态协同能力受损以及在因果线索链中的有限能力。因此，我们提出了FRACTAM作为基线框架。FRACTAM采用“解耦-锚定-推理”范式，通过提取纯粹的单模态事实表示来减少文本偏见，使用两阶段检索进行长距离事实锚定，并构建明确的跨模态证据链。大量实验表明，FRACTAM在复杂战略任务中提升了主流模型的性能，提高了隐藏意图的检测和推理能力，同时保持了强大的感知准确性。我们的数据集可在 https://naislab.cn/datasets/MISID 获取。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2604.12717

Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning

通过真实案例学习实现自主智能体的可转移专业知识

Ma, Zhenyu, Song, Yuyang, Yang, Chunyi, Zhu, Jingyi, Yang, Letian, Jiang, Xukai

Abstract

LLM-based autonomous agents perform well on general reasoning tasks but still struggle to reliably use task structure, key constraints, and prior experience in complex real-world settings. We propose a case-based learning framework that converts experience from past tasks into reusable knowledge assets, allowing agents to transfer prior case experience to new tasks and perform more structured analysis. Unlike methods based mainly on pretrained knowledge or static prompts, our framework emphasizes extracting and reusing task-relevant knowledge, analytical prompts, and operational skills from real cases. We evaluate the method on a unified benchmark of six complex task categories and compare it with Zero-Shot, Few-Shot, Checklist Prompt, and Rule Memory baselines. Results show that our method achieves consistently strong performance across all tasks and matches or outperforms the best baseline in every case, with especially clear gains on more complex tasks. Further analysis shows that the advantage of case-based learning increases with task complexity, and that practical knowledge acquired by one agent can be reused by others. These findings suggest that case-based learning offers a promising path for building professional agents for real-world work.

Chinese Translation

基于大型语言模型（LLM）的自主智能体在一般推理任务中表现良好，但在复杂的现实环境中仍然难以可靠地利用任务结构、关键约束和先前经验。我们提出了一种案例学习框架，将过去任务的经验转化为可重用的知识资产，使智能体能够将先前的案例经验转移到新任务中，并进行更有结构的分析。与主要基于预训练知识或静态提示的方法不同，我们的框架强调从真实案例中提取和重用与任务相关的知识、分析提示和操作技能。我们在六个复杂任务类别的统一基准上评估该方法，并将其与零样本（Zero-Shot）、少样本（Few-Shot）、检查表提示（Checklist Prompt）和规则记忆（Rule Memory）基线进行比较。结果表明，我们的方法在所有任务中均表现出持续强劲的性能，并在每个案例中与最佳基线相匹配或超越，尤其在更复杂的任务上表现出明显的优势。进一步分析显示，案例学习的优势随着任务复杂性的增加而增强，并且一个智能体获得的实践知识可以被其他智能体重用。这些发现表明，案例学习为构建适用于现实工作的专业智能体提供了一条有前景的路径。

View on arXiv Download PDF AI Translation

cs.AI / 57 / 2604.12743

Can AI Tools Transform Low-Demand Math Tasks? An Evaluation of Task Modification Capabilities

人工智能工具能否改造低认知需求的数学任务？任务修改能力的评估

Fox, Danielle S., Robles, Brenda L., Brovey, Elizabeth DiPietro, Schunn, Christian D.

Abstract

While recent research has explored AI tools' ability to classify the quality of mathematical tasks (arXiv:2603.03512), little is known about their capacity to increase the quality of existing tasks. This study investigated whether AI tools could successfully upgrade low-cognitive-demand mathematics tasks. Eleven tools were tested, including six broadly available, general-purpose AI tools (e.g., ChatGPT and Claude) and five tools specialized for mathematics teachers (e.g., Khanmigo, coteach.ai). Using the Task Analysis Guide framework (Stein & Smith, 1998), we prompted AI tools to modify two different types of low-demand mathematical tasks. The prompting strategy aimed to represent likely approaches taken by knowledgeable teachers, rather than extensive optimization to find a more effective prompt (i.e., an optimistic typical outcome). On average, AI tools were only moderately successful: tasks were accurately upgraded only 64% of the time, with different AI tool performance ranging from quite weak (33%) to broadly successful (88%). Specialized tools were only moderately more successful than general-purpose tools. Failure modes included both "undershooting" (maintaining low cognitive demand) and "overshooting" (elevating tasks to an overly ambitious target category that likely would be rejected by teachers). Interestingly, there was a small negative correlation (r = -.35) between whether a given AI tool was able to correctly classify the cognitive demand of tasks and whether the AI was able to upgrade tasks, showing that the ability to modify tasks (i.e., a generative task) represents a distinct capability from the ability to classify them (i.e., judgement using a rubric). These findings have important implications for understanding AI's potential role in curriculum adaptation and highlight the need for specialized approaches to support teachers in modifying instructional materials.

Chinese Translation

尽管近期研究探讨了人工智能工具对数学任务质量的分类能力（arXiv:2603.03512），但其提升现有任务质量的能力尚不清楚。本研究考察了人工智能工具是否能够成功升级低认知需求的数学任务。测试了十一种工具，包括六种广泛可用的通用人工智能工具（如ChatGPT和Claude）以及五种专为数学教师设计的工具（如Khanmigo、coteach.ai）。基于任务分析指南框架（Task Analysis Guide，Stein & Smith, 1998），我们引导人工智能工具修改两类不同的低需求数学任务。引导策略旨在模拟有经验教师可能采取的方法，而非通过大量优化寻找更有效的提示（即一种乐观的典型结果）。平均来看，人工智能工具的成功率仅为中等：任务被准确升级的比例为64%，不同工具的表现从较弱（33%）到较为成功（88%）不等。专用工具的成功率仅略高于通用工具。失败模式包括“未达标”（认知需求仍然较低）和“过度提升”（将任务提升至教师可能拒绝的过高目标类别）。有趣的是，人工智能工具正确分类任务认知需求的能力与其升级任务的能力之间存在轻微负相关（r = -0.35），表明任务修改（即生成性任务）能力与任务分类（即基于评分标准的判断）能力是两种不同的能力。这些发现对理解人工智能在课程适应中的潜在作用具有重要意义，并强调了支持教师修改教学材料的专门方法的必要性。

View on arXiv Download PDF AI Translation

cs.AI / 58 / 2604.12812

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

DocSeeker：基于证据定位的结构化视觉推理用于长文档理解

Yan, Hao, Liu, Yuliang, Liu, Xingchen, Zhang, Yuyi, Liao, Minghui, Wu, Jihao, Chen, Wei, Bai, Xiang

Abstract

Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured ``\textbf{Analysis}, \textbf{Localization} and \textbf{Reasoning}'' workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as a solid foundation for their implementation.

Chinese Translation

现有的多模态大型语言模型（Multimodal Large Language Models，MLLMs）在长文档理解任务中随着文档长度的增加表现显著下降。这主要源于两个根本性挑战：1）信噪比（Signal-to-Noise Ratio，SNR）低，关键证据被埋没在无关页面中；2）监督信号稀缺，现有数据集仅提供最终简短答案，导致学习信号薄弱。本文针对这些挑战，提出了一种要求模型执行结构化“分析（Analysis）、定位（Localization）与推理（Reasoning）”工作流程的新范式。为培养该能力，我们设计了一个两阶段训练框架：首先通过高效的知识蒸馏策略生成的高质量数据进行有监督微调；随后采用证据感知的群体相对策略优化（Evidence-aware Group Relative Policy Optimization），联合优化证据定位与答案准确性。此外，我们引入了证据引导的分辨率分配策略（Evidence-Guided Resolution Allocation），以缓解多页文档训练中的内存限制。大量实验表明，DocSeeker在域内及域外任务中均表现优异，能够稳健地从短页训练推广到超长文档，并且与视觉检索增强生成（Retrieval-Augmented Generation）系统天然协同，为其实现提供坚实基础。

View on arXiv Download PDF AI Translation

cs.AI / 59 / 2604.12820

RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

RePAIR：基于提示感知的交互式机器遗忘模型修复方法

Rachapudi, Jagadeesh, Singh, Pranav, Vatsi, Ritali, Hambarde, Praful, Shukla, Amit

Abstract

Large language models (LLMs) inherently absorb harmful knowledge, misinformation, and personal data during pretraining on large-scale web corpora, with no native mechanism for selective removal. While machine unlearning offers a principled solution, existing approaches are provider-centric, requiring retraining pipelines, curated retain datasets, and direct intervention by model service providers (MSPs), thereby excluding end users from controlling their own data. We introduce Interactive Machine Unlearning (IMU), a new paradigm in which users can instruct LLMs to forget targeted knowledge through natural language at inference time. To realize IMU, we propose RePAIR, a prompt-aware model repair framework comprising (i) a watchdog model for unlearning intent detection, (ii) a surgeon model for generating repair procedures, and (iii) a patient model whose parameters are updated autonomously. At the core of RePAIR, we develop Steering Through Activation Manipulation with PseudoInverse (STAMP), a training-free, single-sample unlearning method that redirects MLP activations toward a refusal subspace via closed-form pseudoinverse updates. Its low-rank variant reduces computational complexity from O(d^3) to O(r^3 + r^2 * d), enabling efficient on-device unlearning with up to ~3x speedup over training-based baselines. Extensive experiments across harmful knowledge suppression, misinformation correction, and personal data erasure demonstrate that RePAIR achieves near-zero forget scores (Acc_f = 0.00, F-RL = 0.00) while preserving model utility (Acc_r up to 84.47, R-RL up to 0.88), outperforming six state-of-the-art baselines. These results establish RePAIR as an effective and practical framework for user-driven model editing, advancing transparent and on-device control over learned knowledge, with potential extensions to multimodal foundation models.

Chinese Translation

大型语言模型（LLMs）在大规模网络语料预训练过程中，固有地吸收了有害知识、错误信息和个人数据，但缺乏选择性删除的原生机制。尽管机器遗忘提供了一种原则性解决方案，现有方法多为服务提供商中心化，依赖重训练流程、精心挑选的保留数据集以及模型服务提供商（MSPs）的直接干预，因而排除了终端用户对自身数据的控制权。我们提出了交互式机器遗忘（Interactive Machine Unlearning，IMU）这一新范式，使用户能够在推理阶段通过自然语言指令让LLMs遗忘特定知识。为实现IMU，我们设计了RePAIR，一种提示感知的模型修复框架，包含：（i）用于检测遗忘意图的看门狗模型，（ii）用于生成修复方案的外科医生模型，以及（iii）参数可自主更新的患者模型。RePAIR的核心是“通过伪逆激活操控引导”（Steering Through Activation Manipulation with PseudoInverse，STAMP）方法，这是一种无需训练、基于单样本的遗忘技术，通过闭式伪逆更新将多层感知机（MLP）激活引导至拒绝子空间。其低秩变体将计算复杂度从O(d^3)降低至O(r^3 + r^2 * d)，实现了高效的设备端遗忘，较基于训练的基线方法加速约3倍。大量实验涵盖有害知识抑制、错误信息纠正及个人数据擦除，结果表明RePAIR在实现接近零遗忘分数（Acc_f = 0.00，F-RL = 0.00）的同时，保持了模型效用（Acc_r最高达84.47，R-RL最高达0.88），优于六种最先进基线方法。该成果确立了RePAIR作为一种有效且实用的用户驱动模型编辑框架，推动了对已学知识的透明且设备端控制，并具备向多模态基础模型扩展的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 60 / 2604.12857

Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic

人工智能在混合自动化与人类交通建模与仿真中的应用

Rahmani, Saeed, Rasouli, Shiva, Cornelisse, Daphne, Vinitsky, Eugene, van Arem, Bart, Calvert, Simeon C.

Abstract

Autonomous vehicles (AVs) are now operating on public roads, which makes their testing and validation more critical than ever. Simulation offers a safe and controlled environment for evaluating AV performance in varied conditions. However, existing simulation tools mainly focus on graphical realism and rely on simple rule-based models and therefore fail to accurately represent the complexity of driving behaviors and interactions. Artificial intelligence (AI) has shown strong potential to address these limitations; however, despite the rapid progress across AI methodologies, a comprehensive survey of their application to mixed autonomy traffic simulation remains lacking. Existing surveys either focus on simulation tools without examining the AI methods behind them, or cover ego-centric decision-making without addressing the broader challenge of modeling surrounding traffic. Moreover, they do not offer a unified taxonomy of AI methods covering individual behavior modeling to full scene simulation. To address these gaps, this survey provides a structured review and synthesis of AI methods for modeling AV and human driving behavior in mixed autonomy traffic simulation. We introduce a taxonomy that organizes methods into three families: agent-level behavior models, environment-level simulation methods, and cognitive and physics-informed methods. The survey analyzes how existing simulation platforms fall short of the needs of mixed autonomy research and outlines directions to narrow this gap. It also provides a chronological overview of AI methods and reviews evaluation protocols and metrics, simulation tools, and datasets. By covering both traffic engineering and computer science perspectives, we aim to bridge the gap between these two communities.

Chinese Translation

自主车辆（AV）目前已在公共道路上运行，这使得其测试和验证变得比以往任何时候都更加重要。仿真提供了一个安全且受控的环境，用于评估自主车辆在不同条件下的性能。然而，现有的仿真工具主要关注图形真实感，并依赖简单的基于规则的模型，因此未能准确反映驾驶行为和交互的复杂性。人工智能（AI）显示出强大的潜力来解决这些局限性；然而，尽管AI方法的快速进展，但对其在混合自主交通仿真中的应用的全面调查仍然缺乏。现有的调查要么专注于仿真工具而未考察其背后的AI方法，要么涉及以自我为中心的决策制定而未解决建模周围交通的更广泛挑战。此外，它们未能提供涵盖个体行为建模到完整场景仿真的统一AI方法分类。为了解决这些空白，本调查提供了对混合自主交通仿真中AV和人类驾驶行为建模的AI方法的结构化回顾和综合。我们介绍了一个将方法组织为三大类的分类法：代理级行为模型、环境级仿真方法，以及认知和物理信息驱动的方法。调查分析了现有仿真平台如何未能满足混合自主研究的需求，并概述了缩小这一差距的方向。它还提供了AI方法的时间顺序概述，并回顾了评估协议和指标、仿真工具以及数据集。通过涵盖交通工程和计算机科学的视角，我们旨在弥合这两个领域之间的差距。

View on arXiv Download PDF AI Translation

cs.AI / 61 / 2604.12865

From edges to meaning: Semantic line sketches as a cognitive scaffold for ancient pictograph invention

从边缘到意义：语义线条草图作为古代象形文字发明的认知支架

Leem, Seowung, Gu, Lin, Fang, Ruogu

Abstract

Humans readily recognize objects from sparse line drawings, a capacity that appears early in development and persists across cultures, suggesting neural rather than purely learned origins. Yet the computational mechanism by which the brain transforms high-level semantic knowledge into low-level visual symbols remains poorly understood. Here we propose that ancient pictographic writing emerged from the brain's intrinsic tendency to compress visual input into stable, boundary-based abstractions. We construct a biologically inspired digital twin of the visual hierarchy that encodes an image into low-level features, generates a contour sketch, and iteratively refines it through top-down feedback guided by semantic representations, mirroring the feedforward and recurrent architecture of the human visual cortex. The resulting symbols bear striking structural resemblance to early pictographs across culturally distant writing systems, including Egyptian hieroglyphs, Chinese oracle bone characters, and proto-cuneiform, and offer candidate interpretations for undeciphered scripts. Our findings support a neuro-computational origin of pictographic writing and establish a framework in which AI can recapitulate the cognitive processes by which humans first externalized perception into symbols.

Chinese Translation

人类能够轻易地从稀疏的线条画中识别物体，这种能力在发展早期就出现，并且在不同文化中持续存在，表明其源于神经机制而非纯粹的学习。然而，大脑如何将高层次的语义知识转化为低层次的视觉符号的计算机制仍然不甚明了。在此，我们提出古代象形文字的产生源于大脑内在的倾向，即将视觉输入压缩为稳定的、基于边界的抽象。我们构建了一个生物启发的视觉层次的数字双胞胎，该系统将图像编码为低层次特征，生成轮廓草图，并通过由语义表征引导的自上而下反馈进行迭代优化，反映了人类视觉皮层的前馈和递归架构。最终生成的符号在结构上与文化上相距甚远的书写系统中的早期象形文字（包括埃及象形文字、中国甲骨文和原始楔形文字）具有显著的相似性，并为尚未破译的文字提供了候选解释。我们的研究结果支持象形文字的神经计算起源，并建立了一个框架，使人工智能能够重现人类最初将感知外化为符号的认知过程。

View on arXiv Download PDF AI Translation

cs.AI / 62 / 2604.12867

QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence

QuarkMedSearch：一种面向医疗智能探索的长远深度搜索智能体

Lin, Zhichao, Liang, Zhichao, Liu, Gaoqiang, Xu, Meng, Xiang, Baoyu, Xu, Jian, Jiang, Guanjun

Abstract

As agentic foundation models continue to evolve, how to further improve their performance in vertical domains has become an important challenge. To this end, building upon Tongyi DeepResearch, a powerful agentic foundation model, we focus on the Chinese medical deep search scenario and propose QuarkMedSearch, systematically exploring a full-pipeline approach spanning medical multi-hop data construction, training strategies, and evaluation benchmarks to further push and assess its performance upper bound in vertical domains. Specifically, for data synthesis, to address the scarcity of deep search training data in the medical domain, we combine a large-scale medical knowledge graph with real-time online exploration to construct long-horizon medical deep search training data; for post-training, we adopt a two-stage SFT and RL training strategy that progressively enhances the model's planning, tool invocation, and reflection capabilities required for deep search, while maintaining search efficiency; for evaluation, we collaborate with medical experts to construct the QuarkMedSearch Benchmark through rigorous manual verification. Experimental results demonstrate that QuarkMedSearch achieves state-of-the-art performance among open-source models of comparable scale on the QuarkMedSearch Benchmark, while also maintaining strong competitiveness on general benchmarks.

Chinese Translation

随着具备代理能力的基础模型不断发展，如何进一步提升其在垂直领域的表现已成为一大挑战。基于强大的代理基础模型Tongyi DeepResearch，我们聚焦于中文医疗深度搜索场景，提出了QuarkMedSearch，系统性地探索涵盖医疗多跳数据构建、训练策略及评估基准的全流程方法，以进一步推动并评估其在垂直领域的性能上限。具体而言，在数据合成方面，为解决医疗领域深度搜索训练数据稀缺的问题，我们结合大规模医疗知识图谱与实时在线探索，构建了长远医疗深度搜索训练数据；在后训练阶段，采用两阶段的SFT（监督微调）和RL（强化学习）训练策略，逐步增强模型在深度搜索中所需的规划、工具调用及反思能力，同时保持搜索效率；在评估方面，我们与医疗专家合作，通过严格的人工验证构建了QuarkMedSearch基准。实验结果表明，QuarkMedSearch在QuarkMedSearch基准上，在同等规模的开源模型中实现了最先进的性能表现，同时在通用基准上也保持了较强的竞争力。

View on arXiv Download PDF AI Translation

cs.AI / 63 / 2604.12874

LIFE -- an energy efficient advanced continual learning agentic AI framework for frontier systems

LIFE——一个面向前沿系统的节能先进持续学习自主AI框架

Lee, Anne, Hosangadi, Gurudutt

Abstract

The rapid advancement of AI has changed the character of HPC usage such as dimensioning, provisioning, and execution. Not only has energy demand been amplified, but existing rudimentary continual learning capabilities limit ability of AI to effectively manage HPCs. This paper reviews emerging directions beyond monolithic transformers, emphasizing agentic AI and brain inspired architectures as complementary paths toward sustainable, adaptive systems. We propose LIFE, a reasoning and Learning framework that is Incremental, Flexible, and Energy efficient that is implemented as an agent centric system rather than a single monolithic model. LIFE uniquely combines four components to realize self evolving network management and operations in HPCs. The components are an orchestrator, Agentic Context Engineering, a novel memory system, and information lattice learning. LIFE can also generalize to enable a variety of orthogonal use cases. We ground LIFE in a specific closed loop HPC operations example for detecting and mitigating latency spikes experienced by critical micro services running on a Kubernetes like cluster.

Chinese Translation

人工智能的快速发展改变了高性能计算（HPC）使用的特征，如维度设计、资源配置和执行。不仅能源需求被放大，现有的基础持续学习能力也限制了AI有效管理HPC的能力。本文回顾了超越单一变换器的新兴方向，强调自主AI和脑启发架构作为实现可持续、自适应系统的互补路径。我们提出了LIFE，一个增量、灵活且节能的推理与学习框架，它作为一个以代理为中心的系统实现，而非单一的整体模型。LIFE独特地结合了四个组件，以实现HPC中自我演化的网络管理和操作。这些组件包括一个调度器、自主上下文工程、一个新颖的记忆系统和信息格学习。LIFE还可以推广以支持多种正交用例。我们将LIFE应用于一个特定的闭环HPC操作示例，以检测和缓解在类似Kubernetes集群上运行的关键微服务所经历的延迟峰值。

View on arXiv Download PDF AI Translation

cs.AI / 64 / 2604.12875

AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

AISafetyBenchExplorer：一个关注指标的AI安全基准目录揭示了测量的碎片化与基准治理的薄弱

Solanke, Abiodun A.

Abstract

The rapid expansion of large language model (LLM) safety evaluation has produced a substantial benchmark ecosystem, but not a correspondingly coherent measurement ecosystem. We present AISafetyBenchExplorer, a structured catalogue of 195 AI safety benchmarks released between 2018 and 2026, organized through a multi-sheet schema that records benchmark-level metadata, metric-level definitions, benchmark-paper metadata, and repository activity. This design enables meta-analysis not only of what benchmarks exist, but also of how safety is operationalized, aggregated, and judged across the literature. Using the updated catalogue, we identify a central structural problem: benchmark proliferation has outpaced measurement standardization. The current landscape is dominated by medium-complexity benchmarks (94/195), while only 7 benchmarks occupy the Popular tier. The workbook further reports strong concentration around English-only evaluation (165/195), evaluation-only resources (170/195), stale GitHub repositories (137/195), stale Hugging Face datasets (96/195), and heavy reliance on arXiv preprints among benchmarks with known venue metadata. At the metric level, the catalogue shows that familiar labels such as accuracy, F1 score, safety score, and aggregate benchmark scores often conceal materially different judges, aggregation rules, and threat models. We argue that the field's main failure mode is fragmentation rather than scarcity. Researchers now have many benchmark artifacts, but they often lack a shared measurement language, a principled basis for benchmark selection, and durable stewardship norms for post publication maintenance. AISafetyBenchExplorer addresses this gap by providing a traceable benchmark catalogue, a controlled metadata schema, and a complexity taxonomy that together support more rigorous benchmark discovery, comparison, and meta-evaluation.

Chinese Translation

大型语言模型（LLM）安全评估的快速扩展催生了庞大的基准生态系统，但未能形成相应连贯的测量生态系统。我们提出了AISafetyBenchExplorer，这是一个结构化的AI安全基准目录，涵盖了2018年至2026年间发布的195个基准，采用多表格架构组织，记录基准级元数据、指标级定义、基准论文元数据及代码库活跃度。该设计不仅支持对现有基准的元分析，还能分析安全性在文献中如何被操作化、聚合及评判。通过更新后的目录，我们识别出一个核心结构性问题：基准数量的激增已超越了测量标准化的进程。当前格局由中等复杂度基准主导（94/195），而仅有7个基准位于“Popular”层级。工作簿还报告了对仅限英语评估（165/195）、仅评估资源（170/195）、过时的GitHub仓库（137/195）、过时的Hugging Face数据集（96/195）以及基准中对arXiv预印本的高度依赖（在已知发表渠道元数据的基准中）的强烈集中。指标层面，目录显示诸如准确率（accuracy）、F1分数、安全评分（safety score）及综合基准分数等熟悉标签，常常掩盖了实质上不同的评判者、聚合规则和威胁模型。我们认为该领域的主要失败模式是碎片化而非资源匮乏。研究者虽然拥有大量基准资源，但往往缺乏共享的测量语言、基准选择的原则依据以及发布后维护的持续管理规范。AISafetyBenchExplorer通过提供可追溯的基准目录、受控的元数据架构和复杂度分类法，填补了这一空白，支持更严谨的基准发现、比较与元评估。

View on arXiv Download PDF AI Translation

cs.AI / 65 / 2604.12898

BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design

BEAM：双层记忆自适应算法进化用于基于大语言模型的启发式设计

Xiang, Chuyang, Wei, Yichen, Ma, Jiale, Wang, Handing, Yan, Junchi

Abstract

Large Language Model-based Hyper Heuristic (LHH) has recently emerged as an efficient way for automatic heuristic design. However, most existing LHHs just perform well in optimizing a single function within a pre-defined solver. Their single-layer evolution makes them not effective enough to write a competent complete solver. While some variants incorporate hyperparameter tuning or attempt to generate complex code through iterative local modifications, they still lack a high-level algorithmic modeling, leading to limited exploration efficiency. To address this, we reformulate heuristic design as a Bi-level Optimization problem and propose \textbf{BEAM} (Bi-level Memory-adaptive Algorithmic Evolution). BEAM's exterior layer evolves high-level algorithmic structures with function placeholders through genetic algorithm (GA), while the interior layer realizes these placeholders via Monte Carlo Tree Search (MCTS). We further introduce an Adaptive Memory module to facilitate complex code generation. To support the evaluation for complex code generation, we point out the limitations of starting LHHs from scratch or from code templates and introduce a Knowledge Augmentation (KA) Pipeline. Experimental results on several optimization problems demonstrate that BEAM significantly outperforms existing LHHs, notably reducing the optimality gap by 37.84\% on aggregate in CVRP hybrid algorithm design. BEAM also designs a heuristic that outperforms SOTA Maximum Independent Set (MIS) solver KaMIS.

Chinese Translation

基于大语言模型的超启发式（LHH）最近作为一种自动启发式设计的高效方法而出现。然而，现有的大多数LHH仅在预定义求解器内优化单一函数时表现良好。它们的单层进化使其在编写有效的完整求解器方面不够有效。虽然一些变体结合了超参数调优或试图通过迭代局部修改生成复杂代码，但它们仍然缺乏高层次的算法建模，导致探索效率有限。为了解决这个问题，我们将启发式设计重新表述为一个双层优化问题，并提出了 extbf{BEAM}（双层记忆自适应算法进化）。BEAM的外层通过遗传算法（GA）进化具有函数占位符的高层算法结构，而内层则通过蒙特卡洛树搜索（MCTS）实现这些占位符。我们进一步引入了自适应记忆模块，以促进复杂代码的生成。为了支持复杂代码生成的评估，我们指出从头开始或从代码模板启动LHH的局限性，并引入了知识增强（KA）管道。在多个优化问题上的实验结果表明，BEAM显著优于现有的LHH，特别是在CVRP混合算法设计中将最优性差距降低了37.84 ext{%}。BEAM还设计了一种启发式算法，其性能超越了当前最先进的最大独立集（MIS）求解器KaMIS。

View on arXiv Download PDF AI Translation

cs.AI / 66 / 2604.12948

Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

借助记忆绘制：双痕迹编码提升大型语言模型代理的跨会话回忆能力

Stern, Benjamin, Nadel, Peter

Abstract

LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session aggregation. Inspired by the drawing effect [3], we introduce dual-trace memory encoding. In this method, each stored fact is paired with a concrete scene trace, a narrative reconstruction of the moment and context in which the information was learned. The agent is forced to commit to specific contextual details during encoding, creating richer, more distinctive memory traces. Using the LongMemEval-S benchmark (4,575 sessions, 100 recall questions), we compare dual-trace encoding against a fact-only control with matched coverage and format over 99 shared questions. Dual-trace achieves 73.7% overall accuracy versus 53.5%, a +20.2 percentage point (pp) gain (95% CI: [+12.1, +29.3], bootstrap p < 0.0001). Gains concentrate in temporal reasoning (+40pp), knowledge-update tracking (+25pp), and multi-session aggregation (+30pp), with no benefit for single-session retrieval, consistent with encoding specificity theory [8]. Token analysis shows dual-trace encoding achieves this gain at no additional cost. We additionally sketch an architectural design for adapting dual-trace encoding to coding agents, with preliminary pilot validation.

Chinese Translation

具备持久记忆的大型语言模型（LLM）代理通常将信息存储为平铺的事实记录，缺乏对时间推理、变化追踪或跨会话聚合的上下文支持。受绘图效应（drawing effect）[3]的启发，我们提出了双痕迹记忆编码方法。在该方法中，每条存储的事实都会配对一个具体场景痕迹，即对信息学习时刻及其上下文的叙事重构。代理在编码过程中被迫承诺具体的上下文细节，从而创造出更丰富、更具辨识度的记忆痕迹。基于LongMemEval-S基准测试（4575个会话，100个回忆问题），我们在99个共享问题上将双痕迹编码与覆盖范围和格式匹配的仅事实控制组进行了比较。双痕迹编码整体准确率达到73.7%，而对照组为53.5%，提升了20.2个百分点（95%置信区间：[+12.1, +29.3]，自助法p < 0.0001）。提升主要集中在时间推理（+40个百分点）、知识更新追踪（+25个百分点）和多会话聚合（+30个百分点）方面，单会话检索无显著收益，符合编码特异性理论[8]。词元分析表明，双痕迹编码在无额外成本的情况下实现了该提升。我们还初步设计了将双痕迹编码适配于代码生成代理的架构方案，并进行了初步的试点验证。

View on arXiv Download PDF AI Translation

cs.AI / 67 / 2604.12955

Modeling Co-Pilots for Text-to-Model Translation

面向文本到模型翻译的协同助手建模

Kadioglu, Serdar, Uppuluri, Karthik, Singirikonda, Akash

Abstract

There is growing interest in leveraging large language models (LLMs) for text-to-model translation and optimization tasks. This paper aims to advance this line of research by introducing \textsc{Text2Model} and \textsc{Text2Zinc}. \textsc{Text2Model} is a suite of co-pilots based on several LLM strategies with varying complexity, along with an online leaderboard. \textsc{Text2Zinc} is a cross-domain dataset for capturing optimization and satisfaction problems specified in natural language, along with an interactive editor with built-in AI assistant. While there is an emerging literature on using LLMs for translating combinatorial problems into formal models, our work is the first attempt to integrate \textit{both} satisfaction and optimization problems within a \textit{unified architecture} and \textit{dataset}. Moreover, our approach is \textit{solver-agnostic} unlike existing work that focuses on translation to a solver-specific model. To achieve this, we leverage \textsc{MiniZinc}'s solver-and-paradigm-agnostic modeling capabilities to formulate combinatorial problems. We conduct comprehensive experiments to compare execution and solution accuracy across several single- and multi-call strategies, including; zero-shot prompting, chain-of-thought reasoning, intermediate representations via knowledge-graphs, grammar-based syntax encoding, and agentic approaches that decompose the model into sequential sub-tasks. Our co-pilot strategies are competitive, and in parts improve, recent research in this domain. Our findings indicate that while LLMs are promising they are not yet a push-button technology for combinatorial modeling. We contribute \textsc{Text2Model} co-pilots and leaderboard, and \textsc{Text2Zinc} and interactive editor to open-source to support closing this performance gap.

Chinese Translation

利用大型语言模型（LLMs）进行文本到模型的翻译与优化任务正日益受到关注。本文旨在推动该研究方向的发展，提出了\textsc{Text2Model}和\textsc{Text2Zinc}。\textsc{Text2Model}是一套基于多种复杂度不同的LLM策略构建的协同助手系统，并配备了在线排行榜。\textsc{Text2Zinc}是一个跨领域的数据集，用于捕捉以自然语言描述的优化和满足性问题，同时提供了内置AI助手的交互式编辑器。尽管已有文献开始探讨利用LLMs将组合优化问题翻译为形式化模型，本文工作首次尝试在统一的架构和数据集中整合满足性问题与优化问题。此外，我们的方法是求解器无关的（solver-agnostic），区别于现有工作多聚焦于针对特定求解器的模型翻译。为此，我们利用了\textsc{MiniZinc}的求解器及范式无关的建模能力来表述组合问题。我们进行了全面实验，比较了多种单次及多次调用策略的执行效果和解的准确性，包括零样本提示（zero-shot prompting）、链式思维推理（chain-of-thought reasoning）、基于知识图谱的中间表示、基于语法的语法编码以及将模型分解为顺序子任务的智能代理方法。我们的协同助手策略具有竞争力，部分方面优于该领域的最新研究。研究结果表明，尽管LLMs展现出良好潜力，但尚未成为组合建模的即插即用技术。我们将\textsc{Text2Model}协同助手与排行榜，以及\textsc{Text2Zinc}数据集和交互式编辑器开源，以助力缩小性能差距。

View on arXiv Download PDF AI Translation

cs.AI / 68 / 2604.12967

Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

循环一致性搜索：以问题可重构性作为搜索代理训练的代理奖励

An, Sohyun, Yuan, Shuibenyang, Lee, Hayeon, Hsieh, Cho-Jui, Min, Alexander

Abstract

Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question's intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.

Chinese Translation

强化学习（Reinforcement Learning, RL）在优化复杂信息检索任务中的搜索代理方面展现出强大潜力。然而，现有方法主要依赖于金标准监督（如真实答案），这在实际中难以规模化。为了解决这一限制，我们提出了循环一致性搜索（Cycle-Consistent Search, CCS）框架，一种无需金标准监督的搜索代理训练方法，灵感来源于无监督机器翻译和图像到图像翻译中的循环一致性技术。我们的核心假设是，最优的搜索轨迹不同于不足或无关的轨迹，它能够无损编码问题意图。因此，高质量的轨迹应保留准确重构原始问题所需的信息，从而为策略优化引入奖励信号。然而，简单的循环一致性目标易受到信息泄露的影响，因为重构可能依赖于表面词汇线索而非底层搜索过程。为减少此影响，我们引入信息瓶颈，包括排除最终响应和对搜索查询进行命名实体识别（NER）掩码。这些约束迫使重构依赖于检索到的观察结果及结构框架，确保生成的奖励信号反映信息充分性而非语言冗余。在问答基准测试中的实验表明，CCS的性能可与有监督基线相媲美，且优于先前不依赖金标准监督的方法。结果表明，CCS为在缺乏金标准监督的环境中训练搜索代理提供了一种可扩展的训练范式。

View on arXiv Download PDF AI Translation

cs.AI / 69 / 2604.13013

Bilevel Late Acceptance Hill Climbing for the Electric Capacitated Vehicle Routing Problem

用于电动容量限制车辆路径问题的双层迟滞接受爬山算法

Qin, Yinghao, Bazargani, Mosab, Burke, Edmund K., Coello, Carlos A. Coello, Song, Zhongmin, Chen, Jun

Abstract

This paper tackles the Electric Capacitated Vehicle Routing Problem (E-CVRP) through a bilevel optimization framework that handles routing and charging decisions separately or jointly depending on the search stage. By analyzing their interaction, we introduce a surrogate objective at the upper level to guide the search and accelerate convergence. A bilevel Late Acceptance Hill Climbing algorithm (b-LAHC) is introduced that operates through three phases: greedy descent, neighborhood exploration, and final solution refinement. b-LAHC operates with fixed parameters, eliminating the need for complex adaptation while remaining lightweight and effective. Extensive experiments on the IEEE WCCI-2020 benchmark show that b-LAHC achieves superior or competitive performance against eight state-of-the-art algorithms. Under a fixed evaluation budget, it attains near-optimal solutions on small-scale instances and sets 9/10 new best-known results on large-scale benchmarks, improving existing records by an average of 1.07%. Moreover, the strong correlation (though not universal) observed between the surrogate objective and the complete cost justifies the use of the surrogate objective while still necessitating a joint solution of both levels, thereby validating the effectiveness of the proposed bilevel framework and highlighting its potential for efficiently solving large-scale routing problems with a hierarchical structure.

Chinese Translation

本文通过一个双层优化框架解决电动容量限制车辆路径问题（Electric Capacitated Vehicle Routing Problem，E-CVRP），该框架根据搜索阶段分别或联合处理路径规划和充电决策。通过分析两者的相互作用，我们在上层引入了代理目标以指导搜索并加速收敛。提出了一种双层迟滞接受爬山算法（bilevel Late Acceptance Hill Climbing，b-LAHC），其运行分为三个阶段：贪婪下降、邻域探索和最终解的精炼。b-LAHC采用固定参数，避免了复杂的参数调整，同时保持轻量且高效。在IEEE WCCI-2020基准测试上的大量实验表明，b-LAHC在与八种最先进算法的比较中表现出优越或具有竞争力的性能。在固定评估预算下，b-LAHC在小规模实例上获得近最优解，并在大规模基准测试中创造了9项新最优结果，平均提升现有记录1.07%。此外，代理目标与完整成本之间虽非绝对但显著的强相关性验证了代理目标的使用合理性，同时仍需联合求解双层问题，从而证明了所提双层框架的有效性，并凸显其在高效解决具有层级结构的大规模路径规划问题中的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 70 / 2604.13017

PAL: Personal Adaptive Learner

PAL：个性化自适应学习者

Chakraborty, Megha, Eswaramoorthi, Darssan L., Thareja, Madhur, Shah, Het Riteshkumar, Palmer, Finlay, Bahl, Aryaman, Ihetu, Michelle A, Sheth, Amit

Abstract

AI-driven education platforms have made some progress in personalisation, yet most remain constrained to static adaptation--predefined quizzes, uniform pacing, or generic feedback--limiting their ability to respond to learners' evolving understanding. This shortfall highlights the need for systems that are both context-aware and adaptive in real time. We introduce PAL (Personal Adaptive Learner), an AI-powered platform that transforms lecture videos into interactive learning experiences. PAL continuously analyzes multimodal lecture content and dynamically engages learners through questions of varying difficulty, adjusting to their responses as the lesson unfolds. At the end of a session, PAL generates a personalized summary that reinforces key concepts while tailoring examples to the learner's interests. By uniting multimodal content analysis with adaptive decision-making, PAL contributes a novel framework for responsive digital learning. Our work demonstrates how AI can move beyond static personalization toward real-time, individualized support, addressing a core challenge in AI-enabled education.

Chinese Translation

基于人工智能的教育平台在个性化方面取得了一定进展，但大多数仍局限于静态适应——预设测验、统一进度或通用反馈——限制了其对学习者不断变化理解的响应能力。这一不足凸显了对具备上下文感知和实时自适应能力系统的需求。我们提出了PAL（Personal Adaptive Learner），一个由AI驱动的平台，将讲座视频转化为互动学习体验。PAL持续分析多模态讲座内容，并通过不同难度的问题动态地与学习者互动，根据其回答实时调整教学过程。在课程结束时，PAL生成个性化总结，强化关键概念，同时根据学习者兴趣定制示例。通过融合多模态内容分析与自适应决策，PAL为响应式数字学习贡献了创新框架。我们的工作展示了AI如何超越静态个性化，迈向实时、个体化支持，解决了AI赋能教育中的核心挑战。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2604.11996

Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces

过滤推理评分：评估模型最自信推理路径的推理质量

Pathak, Manas, Chen, Xingyao, Li, Shuozhe, Zhang, Amy, Leqi, Liu

Abstract

Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. This highlights a fundamental limitation of outcome-based evaluation: models may arrive at correct answers through flawed reasoning, and models with substantially different reasoning capabilities can nevertheless exhibit similar benchmark accuracy, for example due to memorization or over-optimization. In this paper, we ask: given existing benchmarks, can we move beyond outcome-based evaluation to assess the quality of reasoning itself? We seek metrics that (1) differentiate models with similar accuracy and (2) are robust to variations in input prompts and generation configurations. To this end, we propose a reasoning score that evaluates reasoning traces along dimensions such as faithfulness, coherence, utility, and factuality. A remaining question is how to aggregate this score across multiple sampled traces. Naively averaging them is undesirable, particularly in long-horizon settings, where the number of possible trajectories grows rapidly, and low-confidence correct traces are more likely to be coincidental. To address this, we introduce the Filtered Reasoning Score (FRS), which computes reasoning quality using only the top-K% most confident traces. Evaluating with FRS, models that are indistinguishable under standard accuracy exhibit significant differences in reasoning quality. Moreover, models with higher FRS on one benchmark tend to perform better on other reasoning benchmarks, in both accuracy and reasoning quality. Together, these findings suggest that FRS complements accuracy by capturing a model's transferable reasoning capabilities. We open source our evaluation codebase: https://github.com/Manas2006/benchmark_reproducibility.

Chinese Translation

我们应当信任具有高准确率的大型语言模型（LLMs）吗？LLMs 在推理基准测试中取得了较高的准确率，但仅凭正确性并不能揭示其推理过程的质量。这凸显了基于结果的评估方法的根本局限性：模型可能通过有缺陷的推理得到正确答案，而推理能力显著不同的模型也可能表现出相似的基准准确率，例如由于记忆或过度优化。本文探讨：在现有基准测试条件下，能否超越基于结果的评估，直接评估推理质量？我们寻求的指标应当（1）能够区分准确率相近的模型，（2）对输入提示和生成配置的变化具有鲁棒性。为此，我们提出了一种推理评分方法，从忠实性、一致性、实用性和事实性等维度评估推理路径。一个尚未解决的问题是如何在多个采样推理路径中汇总该评分。简单平均并不可取，尤其在长远推理场景中，可能路径数量迅速增加，且低置信度的正确路径更可能是偶然的。为解决此问题，我们引入了过滤推理评分（Filtered Reasoning Score，FRS），仅利用置信度排名前K%的推理路径来计算推理质量。基于FRS评估时，在标准准确率下难以区分的模型在推理质量上表现出显著差异。此外，在一个基准上具有较高FRS的模型，往往在其他推理基准中也表现出更优的准确率和推理质量。综合来看，这些发现表明FRS作为准确率的补充，能够捕捉模型的可迁移推理能力。我们已开源评估代码库：https://github.com/Manas2006/benchmark_reproducibility。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2604.12002

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

自我蒸馏零：自我修订将二元奖励转化为密集监督

He, Yinghui, Kaur, Simran, Bhaskar, Adithya, Yang, Yongjin, Liu, Jiarui, Ri, Narutatsu, Fowl, Liam, Panigrahi, Abhishek, Chen, Danqi, Arora, Sanjeev

Abstract

Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization.

Chinese Translation

当前在可验证环境下的后训练方法可分为两类。强化学习（RLVR）依赖于二元奖励，这种方法广泛适用且功能强大，但在训练过程中仅提供稀疏监督。蒸馏提供密集的令牌级监督，通常来自外部教师或高质量示范。收集这种监督可能代价高昂或不可获得。我们提出了自我蒸馏零（Self-Distillation Zero, SD-Zero），这是一种在训练样本效率上显著优于强化学习的方法，并且不需要外部教师或高质量示范。SD-Zero训练一个模型扮演两个角色：生成器（Generator），生成初始响应；修订者（Reviser），根据该响应及其二元奖励生成改进响应。然后，我们进行在线自我蒸馏，将修订者蒸馏到生成器中，使用修订者基于生成器响应及其奖励的令牌分布作为监督。实际上，SD-Zero训练模型将二元奖励转化为密集的令牌级自我监督。在与Qwen3-4B-Instruct和Olmo-3-7B-Instruct的数学和代码推理基准测试中，SD-Zero的性能提高至少10%，超越了基础模型，并在相同的问题集和训练样本预算下超越了强基线，包括拒绝微调（Rejection Fine-Tuning, RFT）、GRPO和自我蒸馏微调（Self-Distillation Fine-Tuning, SDFT）。大量消融研究显示了我们提出的算法的两个新特征：（a）令牌级自我定位，修订者能够根据奖励识别生成器响应中需要修订的关键令牌；（b）迭代自我演化，修订答案的改进能力可以通过定期的教师同步蒸馏回生成性能。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2604.12018

LLMs Struggle with Abstract Meaning Comprehension More Than Expected

大型语言模型在抽象意义理解方面的表现低于预期

Alhazmi, Hamoud, Jiang, Jiachen

Abstract

Understanding abstract meanings is crucial for advanced language comprehension. Despite extensive research, abstract words remain challenging due to their non-concrete, high-level semantics. SemEval-2021 Task 4 (ReCAM) evaluates models' ability to interpret abstract concepts by presenting passages with questions and five abstract options in a cloze-style format. Key findings include: (1) Most large language models (LLMs), including GPT-4o, struggle with abstract meaning comprehension under zero-shot, one-shot, and few-shot settings, while fine-tuned models like BERT and RoBERTa perform better. (2) A proposed bidirectional attention classifier, inspired by human cognitive strategies, enhances fine-tuned models by dynamically attending to passages and options. This approach improves accuracy by 4.06 percent on Task 1 and 3.41 percent on Task 2, demonstrating its potential for abstract meaning comprehension.

Chinese Translation

理解抽象意义对于高级语言理解至关重要。尽管已有大量研究，抽象词汇仍然因其非具体和高级语义而具有挑战性。SemEval-2021任务4（ReCAM）通过提供带有问题和五个抽象选项的段落，以填空式格式评估模型解释抽象概念的能力。主要发现包括：（1）大多数大型语言模型（LLMs），包括GPT-4o，在零样本、一样本和少样本设置下都难以理解抽象意义，而经过微调的模型如BERT和RoBERTa表现更好。（2）提出的双向注意分类器受人类认知策略的启发，通过动态关注段落和选项来增强微调模型的性能。这种方法在任务1上提高了4.06%的准确率，在任务2上提高了3.41%的准确率，展示了其在抽象意义理解方面的潜力。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2604.12033

Benchmarking Deflection and Hallucination in Large Vision-Language Models

大型视觉-语言模型中的回避与幻觉基准测试

Moratelli, Nicholas, Davis, Christopher, Ribeiro, Leonardo F. R., Byrne, Bill, Iglesias, Gonzalo

Abstract

Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer...) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions. First, we propose a dynamic data curation pipeline that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples. Second, we introduce VLM-DeflectionBench, a benchmark of 2,775 samples spanning diverse multimodal retrieval settings, designed to probe model behaviour under conflicting or insufficient evidence. Third, we define a fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness. Experiments across 20 state-of-the-art LVLMs indicate that models usually fail to deflect in the presence of noisy or misleading evidence. Our results highlight the need to evaluate not only what models know, but how they behave when they do not, and serve as a reusable and extensible benchmark for reliable KB-VQA evaluation. All resources will be publicly available upon publication.

Chinese Translation

大型视觉-语言模型（Large Vision-Language Models，LVLMs）越来越依赖检索来回答知识密集型的多模态问题。现有的基准测试忽视了视觉与文本证据之间的冲突，以及在检索知识不完整时生成回避（例如“抱歉，我无法回答……”）的重要性。这些基准测试还存在快速过时的问题，因为随着LVLM训练集的不断扩大，模型能够在无需检索的情况下回答许多问题。针对这些不足，我们做出了三方面的贡献。首先，我们提出了一个动态数据策划流程，通过筛选真正依赖检索的样本，保持基准测试难度的持续性。其次，我们引入了VLM-DeflectionBench，这是一个包含2775个样本的基准，涵盖多样的多模态检索场景，旨在探测模型在面对冲突或不足证据时的行为。第三，我们定义了一个细粒度的评估协议，包含四种场景，用以区分参数记忆与检索鲁棒性。对20个最先进LVLM的实验表明，模型通常在面对噪声或误导性证据时无法有效回避。我们的结果强调了不仅要评估模型所掌握的知识，更要考察其在知识缺失时的表现，并提供了一个可复用且可扩展的基准，用于可靠的知识库视觉问答（KB-VQA）评估。所有资源将在论文发表后公开。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2604.12046

Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

通过不确定性思考：通过推理校准提高长文本生成的事实性

Liu, Xin, Wang, Lu

Abstract

Large language models (LLMs) often hallucinate in long-form generation. Existing approaches mainly improve factuality through post-hoc revision or reinforcement learning (RL) with correctness-based rewards, but they do not teach the model to estimate which parts of its generation are reliable. As a result, models may still state incorrect claims confidently in their responses. Recent advances in reasoning have significantly improved LLM performance, and have been leveraged to estimate confidence by incorporating calibration into RL objectives. However, existing approaches remain limited to a single scalar confidence for the entire response, which is insufficient for long-form generation where uncertainty varies across individual claims. To mitigate this problem, we propose CURE, a framework that improves long-form factuality by teaching LLMs to reason about uncertainty at the claim level. We first introduce a Claim-Aware Reasoning Protocol, which structures outputs into atomic claims paired with explicit confidence estimates. We then develop a multi-stage training pipeline that aligns model confidence with claims' correctness and then optimizes on factuality. The resulting calibrated confidence further enables selective prediction, allowing the model to abstain from uncertain claims at inference time. Experiments on four long-form factuality benchmarks show that CURE consistently improves factual accuracy over competitive supervised and RL baselines, while maintaining factual recall. In particular, it improves claim-level accuracy by up to 39.9% on Biography generation. These gains are accompanied by improved calibration, as reflected by a 16.0% increase in AUROC on FactBench.

Chinese Translation

大型语言模型（LLMs）在长文本生成中常常出现幻觉。现有方法主要通过事后修正或基于正确性的奖励进行强化学习（RL）来提高事实性，但并未教会模型评估其生成内容中哪些部分是可靠的。因此，模型在回应中仍可能自信地陈述错误的主张。最近在推理方面的进展显著提高了LLM的性能，并通过将校准纳入RL目标来估计置信度。然而，现有方法仍然局限于对整个响应的单一标量置信度，这对于个别主张不确定性各异的长文本生成而言是不够的。为了解决这个问题，我们提出了CURE，一个通过教会LLM在主张层面上推理不确定性来提高长文本事实性的框架。我们首先引入了一种主张感知推理协议，该协议将输出结构化为原子主张，并配以明确的置信度估计。接着，我们开发了一个多阶段训练流程，使模型置信度与主张的正确性对齐，并优化事实性。最终得到的校准置信度进一步实现了选择性预测，使模型在推理时能够放弃不确定的主张。在四个长文本事实性基准上的实验表明，CURE在保持事实召回的同时，持续提高了与竞争性监督和RL基线相比的事实准确性。特别是在传记生成中，它使主张级准确性提高了多达39.9%。这些提升伴随着校准的改善，FactBench的AUROC提高了16.0%。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2604.12047

Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

基于 RAG 的金融问答中 PDF 解析与分块的实证评估

Bachyr, Omar El, Song, Yewei, Ezzini, Saad, Klein, Jacques, Bissyandé, Tegawendé F., Zilali, Anas, Ble, Ulrick, Goujon, Anne

Abstract

PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF parsers and chunking strategies (with varied overlap), along with their potential synergies in preserving document structure and ensuring answer correctness. Overall, our results offer practical guidelines for building robust RAG pipelines for PDF understanding.

Chinese Translation

PDF 文件主要是为了人类阅读而设计，而非自动处理。此外，PDF 中异构内容（如文本、表格和图像）给解析和信息提取带来了显著挑战。为了解决这些困难，实践者和研究人员越来越多地开发新方法，包括有前景的检索增强生成（Retrieval-Augmented Generation, RAG）系统，以实现 PDF 的自动处理。然而，目前尚无全面研究探讨不同组件和设计选择如何影响 RAG 系统在理解 PDF 方面的性能。本文提出了一项研究，(1) 专注于问答这一特定的语言理解任务，(2) 利用来自金融领域的两个基准，包括我们新生成的、公开可用的基准 TableQuest。我们系统性地考察了多种 PDF 解析器和分块策略（具有不同的重叠），以及它们在保持文档结构和确保答案正确性方面的潜在协同效应。总体而言，我们的结果为构建稳健的 RAG 流水线以理解 PDF 提供了实用指南。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2604.12049

Leveraging Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLMs

利用加权句法和语义上下文评估摘要（wSSAS）进行文本分类的研究

Kathuria, Shreeya Verma, Mayande, Nitin, Daruwalla, Sharookh, Joglekar, Nitin, Weber, Charles

Abstract

The use of Large Language Models (LLMs) for reliable, enterprise-grade analytics such as text categorization is often hindered by the stochastic nature of attention mechanisms and sensitivity to noise that compromise their analytical precision and reproducibility. To address these technical frictions, this paper introduces the Weighted Syntactic and Semantic Context Assessment Summary (wSSAS), a deterministic framework designed to enforce data integrity on large-scale, chaotic datasets. We propose a two-phased validation framework that first organizes raw text into a hierarchical classification structure containing Themes, Stories, and Clusters. It then leverages a Signal-to-Noise Ratio (SNR) to prioritize high-value semantic features, ensuring the model's attention remains focused on the most representative data points. By incorporating this scoring mechanism into a Summary-of-Summaries (SoS) architecture, the framework effectively isolates essential information and mitigates background noise during data aggregation. Experimental results using Gemini 2.0 Flash Lite across diverse datasets - including Google Business reviews, Amazon Product reviews, and Goodreads Book reviews - demonstrate that wSSAS significantly improves clustering integrity and categorization accuracy. Our findings indicate that wSSAS reduces categorization entropy and provides a reproducible pathway for improving LLM based summaries based on a high-precision, deterministic process for large-scale text categorization.

Chinese Translation

大型语言模型（LLMs）在可靠的企业级分析（如文本分类）中的应用，常常受到注意力机制的随机性和对噪声的敏感性影响，从而损害其分析精度和可重复性。为了解决这些技术摩擦，本文提出了加权句法和语义上下文评估摘要（wSSAS），这是一个旨在确保大规模混乱数据集数据完整性的确定性框架。我们提出了一个两阶段的验证框架，首先将原始文本组织成包含主题、故事和聚类的层次分类结构。然后，利用信噪比（SNR）来优先考虑高价值的语义特征，确保模型的注意力集中在最具代表性的数据点上。通过将这一评分机制纳入摘要摘要（SoS）架构，该框架有效地隔离了关键信息，并在数据聚合过程中减轻了背景噪声。使用Gemini 2.0 Flash Lite在多种数据集（包括Google商业评论、亚马逊产品评论和Goodreads书评）上的实验结果表明，wSSAS显著提高了聚类完整性和分类准确性。我们的研究结果表明，wSSAS降低了分类熵，并为基于高精度、确定性过程的大规模文本分类改进LLM生成的摘要提供了可重复的路径。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2604.12056

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

LoSA：针对块级扩散语言模型的局部感知稀疏注意力

Xi, Haocheng, Singh, Harman, Hu, Yuezhou, Hooper, Coleman, Tiwari, Rishabh, Tomar, Aditya, Lee, Minjae, Kang, Wonjun, Mahoney, Michael, Xu, Chenfeng, Keutzer, Kurt, Gholami, Amir

Abstract

Block-wise diffusion language models (DLMs) generate multiple tokens in any order, offering a promising alternative to the autoregressive decoding pipeline. However, they still remain bottlenecked by memory-bound attention in long-context scenarios. Naive sparse attention fails on DLMs due to a KV Inflation problem, where different queries select different prefix positions, making the union of accessed KV pages large. To address this, we observe that between consecutive denoising steps, only a small fraction of active tokens exhibit significant hidden-state changes, while the majority of stable tokens remain nearly constant. Based on this insight, we propose LOSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens. This substantially shrinks the number of KV indices that must be loaded, yielding both higher speedup and higher accuracy. Across multiple block-wise DLMs and benchmarks, LOSA preserves near-dense accuracy while significantly improving efficiency, achieving up to +9 points in average accuracy at aggressive sparsity levels while maintaining 1.54x lower attention density. It also achieves up to 4.14x attention speedup on RTX A6000 GPUs, demonstrating the effectiveness of the proposed method.

Chinese Translation

块级扩散语言模型（DLMs）以任意顺序生成多个标记，提供了一种有前景的替代自回归解码管道的方法。然而，在长上下文场景中，它们仍然受到内存限制注意力的瓶颈。由于KV膨胀问题，简单的稀疏注意力在DLMs上表现不佳，其中不同的查询选择不同的前缀位置，导致访问的KV页面的并集变得庞大。为了解决这个问题，我们观察到在连续的去噪步骤之间，只有一小部分活跃标记表现出显著的隐藏状态变化，而大多数稳定标记几乎保持不变。基于这一见解，我们提出了LOSA（局部感知稀疏注意力），该方法重用稳定标记的缓存前缀注意力结果，仅对活跃标记应用稀疏注意力。这大大减少了必须加载的KV索引数量，从而实现更高的加速和更高的准确性。在多个块级DLMs和基准测试中，LOSA保持了接近密集的准确性，同时显著提高了效率，在激进的稀疏水平下平均准确性提高了多达9个百分点，同时注意力密度降低了1.54倍。它还在RTX A6000 GPU上实现了高达4.14倍的注意力加速，证明了所提方法的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2604.12069

Robust Explanations for User Trust in Enterprise NLP Systems

企业自然语言处理系统中用户信任的稳健解释

Zhang, Guilin, Zhao, Kai, Friedman, Jeffrey, Chu, Xu, Anoun, Amine, Ting, Jerry

Abstract

Robust explanations are increasingly required for user trust in enterprise NLP, yet pre-deployment validation is difficult in the common case of black-box deployment (API-only access) where representation-based explainers are infeasible and existing studies provide limited guidance on whether explanations remain stable under real user noise, especially when organizations migrate from encoder classifiers to decoder LLMs. To close this gap, we propose a unified black-box robustness evaluation framework for token-level explanations based on leave-one-out occlusion, and operationalize explanation robustness with top-token flip rate under realistic perturbations (swap, deletion, shuffling, and back-translation) at multiple severity levels. Using this protocol, we conduct a systematic cross-architecture comparison across three benchmark datasets and six models spanning encoder and decoder families (BERT, RoBERTa, Qwen 7B/14B, Llama 8B/70B; 64,800 cases). We find that decoder LLMs produce substantially more stable explanations than encoder baselines (73% lower flip rates on average), and that stability improves with model scale (44% gain from 7B to 70B). Finally, we relate robustness improvements to inference cost, yielding a practical cost-robustness tradeoff curve that supports model and explanation selection prior to deployment in compliance-sensitive applications.

Chinese Translation

在企业自然语言处理（NLP）中，用户信任越来越需要稳健的解释，然而在常见的黑箱部署（仅限API访问）情况下，预部署验证变得困难，此时基于表示的解释器不可行，现有研究对解释在真实用户噪声下是否保持稳定提供的指导有限，尤其是在组织从编码器分类器迁移到解码器大语言模型（LLM）时。为了解决这一问题，我们提出了一种基于留一法遮挡的统一黑箱稳健性评估框架，用于令牌级解释，并通过在多个严重性级别下的现实扰动（交换、删除、洗牌和反向翻译）下的顶级令牌翻转率来操作化解释的稳健性。使用该协议，我们在三个基准数据集和六个模型（涵盖编码器和解码器系列，包括BERT、RoBERTa、Qwen 7B/14B、Llama 8B/70B；共64,800个案例）之间进行系统的跨架构比较。我们发现解码器LLM产生的解释比编码器基线更为稳定（平均翻转率低73%），并且随着模型规模的增加，稳定性有所改善（从7B到70B提升44%）。最后，我们将稳健性提升与推理成本相关联，得出了一个实用的成本-稳健性权衡曲线，以支持在合规敏感的应用中部署前的模型和解释选择。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2604.12076

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

叙述胜于数字：可识别受害者效应及其在大型语言模型中的对齐与推理下的放大

Raiyan, Syed Rifat

Abstract

The Identifiable Victim Effect (IVE) $-$ the tendency to allocate greater resources to a specific, narratively described victim than to a statistically characterized group facing equivalent hardship $-$ is one of the most robust findings in moral psychology and behavioural economics. As large language models (LLMs) assume consequential roles in humanitarian triage, automated grant evaluation, and content moderation, a critical question arises: do these systems inherit the affective irrationalities present in human moral reasoning? We present the first systematic, large-scale empirical investigation of the IVE in LLMs, comprising N=51,955 validated API trials across 16 frontier models spanning nine organizational lineages (Google, Anthropic, OpenAI, Meta, DeepSeek, xAI, Alibaba, IBM, and Moonshot). Using a suite of ten experiments $-$ porting and extending canonical paradigms from Small et al. (2007) and Kogut and Ritov (2005) $-$ we find that the IVE is prevalent but strongly modulated by alignment training. Instruction-tuned models exhibit extreme IVE (Cohen's d up to 1.56), while reasoning-specialized models invert the effect (down to d=-0.85). The pooled effect (d=0.223, p=2e-6) is approximately twice the single-victim human meta-analytic baseline (d$\approx$0.10) reported by Lee and Feeley (2016) $-$ and likely exceeds the overall human pooled effect by a larger margin, given that the group-victim human effect is near zero. Standard Chain-of-Thought (CoT) prompting $-$ contrary to its role as a deliberative corrective $-$ nearly triples the IVE effect size (from d=0.15 to d=0.41), while only utilitarian CoT reliably eliminates it. We further document psychophysical numbing, perfect quantity neglect, and marginal in-group/out-group cultural bias, with implications for AI deployment in humanitarian and ethical decision-making contexts.

Chinese Translation

可识别受害者效应（IVE）——即将更多资源分配给特定的、叙述性描述的受害者，而非面对相同困境的统计特征群体——是道德心理学和行为经济学中最为稳健的发现之一。随着大型语言模型（LLMs）在 humanitarian triage、自动化资助评估和内容审核中承担重要角色，一个关键问题随之而来：这些系统是否继承了人类道德推理中的情感非理性？我们首次对LLMs中的IVE进行了系统的大规模实证研究，涵盖了来自九个组织谱系（Google、Anthropic、OpenAI、Meta、DeepSeek、xAI、Alibaba、IBM和Moonshot）的16个前沿模型，共计N=51,955个经过验证的API试验。通过一套十个实验——移植并扩展了Small等（2007）和Kogut与Ritov（2005）的经典范式——我们发现IVE普遍存在，但受到对齐训练的强烈调节。经过指令调优的模型表现出极端的IVE（Cohen's d高达1.56），而推理专门化模型则逆转了这一效应（降至d=-0.85）。综合效应（d=0.223，p=2e-6）约为Lee和Feeley（2016）报告的单一受害者人类元分析基线（d≈0.10）的两倍——并且可能比整体人类综合效应更大，因为群体受害者的人类效应接近于零。标准的链式思维（CoT）提示——与其作为深思熟虑的修正作用相反——几乎使IVE效应大小增加了三倍（从d=0.15增至d=0.41），而只有功利主义CoT可靠地消除了这一效应。我们进一步记录了心理物理麻木、完美数量忽视和边际的内群体/外群体文化偏见，这对人工智能在 humanitarian 和伦理决策背景下的应用具有重要意义。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2604.12097

Temporal Flattening in LLM-Generated Text: Comparing Human and LLM Writing Trajectories

LLM生成文本中的时间扁平化：比较人类与LLM写作轨迹

Cao, Zhanwei, Go, YeoJin, Hu, Yifan, Sushmita, Shanu

Abstract

Large language models (LLMs) are increasingly used in daily applications, from content generation to code writing, where each interaction treats the model as stateless, generating responses independently without memory. Yet human writing is inherently longitudinal: authors' styles and cognitive states evolve across months and years. This raises a central question: can LLMs reproduce such temporal structure across extended time periods? We construct and publicly release a longitudinal dataset of 412 human authors and 6,086 documents spanning 2012--2024 across three domains (academic abstracts, blogs, news) and compare them to trajectories generated by three representative LLMs under standard and history-conditioned generation settings. Using drift and variance-based metrics over semantic, lexical, and cognitive-emotional representations, we find temporal flattening in LLM-generated text. LLMs produce greater lexical diversity but exhibit substantially reduced semantic and cognitive-emotional drift relative to humans. These differences are highly predictive: temporal variability patterns alone achieve 94% accuracy and 98% ROC-AUC in distinguishing human from LLM trajectories. Our results demonstrate that temporal flattening persists regardless of whether LLMs generate independently or with access to incremental history, revealing a fundamental property of current deployment paradigms. This gap has direct implications for applications requiring authentic temporal structure, such as synthetic training data and longitudinal text modeling.

Chinese Translation

大型语言模型（LLMs）在日常应用中越来越多地被使用，从内容生成到代码编写，每次交互都将模型视为无状态的，独立生成响应而不保留记忆。然而，人类写作本质上是纵向的：作者的风格和认知状态在数月和数年中不断演变。这引发了一个核心问题：LLMs能否在较长时间内重现这种时间结构？我们构建并公开发布了一个包含412位人类作者和6086篇文档的纵向数据集，涵盖2012年至2024年，涉及三个领域（学术摘要、博客、新闻），并将其与在标准和历史条件生成设置下由三种代表性LLM生成的轨迹进行比较。通过对语义、词汇和认知情感表现的漂移和方差指标进行分析，我们发现LLM生成文本存在时间扁平化现象。LLMs产生了更大的词汇多样性，但相对于人类，语义和认知情感的漂移显著减少。这些差异具有很强的预测性：仅凭时间变异模式就能在区分人类与LLM轨迹时达到94%的准确率和98%的ROC-AUC。我们的结果表明，无论LLMs是独立生成还是访问增量历史，时间扁平化现象始终存在，这揭示了当前部署范式的一个基本特性。这一差距对需要真实时间结构的应用（如合成训练数据和纵向文本建模）具有直接影响。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2604.12128

When Self-Reference Fails to Close: Matrix-Level Dynamics in Large Language Models

自我引用失效时的矩阵级动态：大型语言模型中的研究

Bae, Ji Ho

Abstract

We investigate how self-referential inputs alter the internal matrix dynamics of large language models. Measuring 106 scalar metrics across up to 7 analysis passes on four models from three architecture families -- Qwen3-VL-8B, Llama-3.2-11B, Llama-3.3-70B, and Gemma-2-9B -- over 300 prompts in a 14-level hierarchy at three temperatures ($T \in \{0.0, 0.3, 0.7\}$), we find that self-reference alone is not destabilizing: grounded self-referential statements and meta-cognitive prompts are markedly more stable than paradoxical self-reference on key collapse-related metrics, and on several such metrics can be as stable as factual controls. Instability concentrates in prompts inducing non-closing truth recursion (NCTR) -- truth-value computations with no finite-depth resolution. NCTR prompts produce anomalously elevated attention effective rank -- indicating attention reorganization with global dispersion rather than simple concentration collapse -- and key metrics reach Cohen's $d = 3.14$ (attention effective rank) to $3.52$ (variance kurtosis) vs. stable self-reference in the 70B model; 281/397 metric-model combinations differentiate NCTR from stable self-reference after FDR correction ($q < 0.05$), 198 with $|d| > 0.8$. Per-layer SVD confirms disruption at every sampled layer ($d > +1.0$ in all three models analyzed), ruling out aggregation artifacts. A classifier achieves AUC $0.81$-$0.90$; 30 minimal pairs yield 42/387 significant combinations; 43/106 metrics replicate across all four models. We connect these observations to three classical matrix-semigroup problems and propose, as a conjecture, that NCTR forces finite-depth transformers toward dynamical regimes where these problems concentrate. NCTR prompts also produce elevated contradictory output ($+34$-$56$ percentage points vs. controls), suggesting practical relevance for understanding self-referential failure modes.

Chinese Translation

我们研究了自我引用输入如何改变大型语言模型的内部矩阵动态。通过对来自三个架构家族的四个模型（Qwen3-VL-8B、Llama-3.2-11B、Llama-3.3-70B 和 Gemma-2-9B）进行 106 个标量指标的测量，涵盖 300 个提示，采用 14 级层次结构和三种温度 ($T ext{ in } \{0.0, 0.3, 0.7\}$)，我们发现单独的自我引用并不具有不稳定性：基于事实的自我引用陈述和元认知提示在关键崩溃相关指标上明显比悖论性自我引用更为稳定，在某些指标上，其稳定性可与事实控制相当。不稳定性集中在诱导非闭合真值递归（NCTR）的提示中——即没有有限深度解决方案的真值计算。NCTR 提示产生异常升高的注意力有效秩——表明注意力重组伴随全球分散，而非简单的集中崩溃——关键指标的 Cohen's $d = 3.14$（注意力有效秩）至 $3.52$（方差峰度）与 70B 模型中的稳定自我引用相比；在经过 FDR 校正后，281/397 的指标-模型组合区分 NCTR 和稳定自我引用（$q < 0.05$），其中 198 个具有 $|d| > 0.8$。逐层奇异值分解（SVD）确认在每个采样层都有干扰（所有分析的三个模型中 $d > +1.0$），排除了聚合伪影的可能性。分类器的 AUC 达到 $0.81$-$0.90$；30 对最小对比产生 42/387 个显著组合；43/106 个指标在所有四个模型中重复。我们将这些观察与三个经典的矩阵-半群问题联系起来，并提出一个猜想：NCTR 迫使有限深度变换器朝向这些问题集中的动态状态。NCTR 提示还产生了升高的矛盾输出（比控制组高出 $+34$-$56$ 个百分点），这表明在理解自我引用失效模式方面具有实际相关性。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2604.12162

AlphaEval: Evaluating Agents in Production

AlphaEval：生产环境中智能体的评估

Lu, Pengrui, Xu, Bingyu, Zhang, Wenjun, Hua, Shengjia, Gao, Xuanjian, Ge, Ranxiang, Ye, Lyumanshan, Wu, Linxuan, Li, Yiran, Yu, Junfei Fish, Zhang, Yibo, Li, Ruixin, Li, Manxiang, Han, Xiao, Zhou, Xiaocong, Chi, Guangyao, Chen, Zisheng, Chen, Kaishen, Wang, Kun, Xu, Qihua, Meng, Fengyue, Ni, Yuchen, Li, Jiajun, Liu, Jinxiu, Zhang, Danfeng, Zhao, Jingru, Liu, Pengfei

Abstract

The rapid deployment of AI agents in commercial settings has outpaced the development of evaluation methodologies that reflect production realities. Existing benchmarks measure agent capabilities through retrospectively curated tasks with well-specified requirements and deterministic metrics -- conditions that diverge fundamentally from production environments where requirements contain implicit constraints, inputs are heterogeneous multi-modal documents with information fragmented across sources, tasks demand undeclared domain expertise, outputs are long-horizon professional deliverables, and success is judged by domain experts whose standards evolve over time. We present AlphaEval, a production-grounded benchmark of 94 tasks sourced from seven companies deploying AI agents in their core business, spanning six O*NET (Occupational Information Network) domains. Unlike model-centric benchmarks, AlphaEval evaluates complete agent products -- Claude Code, Codex, etc. -- as commercial systems, capturing performance variations invisible to model-level evaluation. Our evaluation framework covers multiple paradigms (LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, etc.), with individual domains composing multiple paradigms. Beyond the benchmark itself, we contribute a requirement-to-benchmark construction framework -- a systematic methodology that transforms authentic production requirements into executable evaluation tasks in minimal time. This framework standardizes the entire pipeline from requirement to evaluation, providing a reproducible, modular process that any organization can adopt to construct production-grounded benchmarks for their own domains.

Chinese Translation

人工智能智能体在商业环境中的快速部署已经超越了反映生产现实的评估方法的发展。现有基准测试通过事后策划的任务来衡量智能体能力，这些任务具有明确的需求和确定性的指标——这些条件与生产环境根本不同，后者的需求包含隐含约束，输入是异构的多模态文档，信息分散在多个来源，任务要求未声明的领域专业知识，输出是长周期的专业成果，成功由领域专家评判且其标准随时间演变。我们提出了AlphaEval，这是一个基于生产环境的基准测试，涵盖了来自七家将AI智能体部署于核心业务的公司的94个任务，跨越六个O*NET（职业信息网络）领域。与以模型为中心的基准不同，AlphaEval评估完整的智能体产品——如Claude Code、Codex等——作为商业系统，捕捉模型层面评估无法显现的性能差异。我们的评估框架涵盖多种范式（如LLM作为评判者、基于参考的指标、形式化验证、基于评分标准的评估、自动化UI测试等），各领域组合多种范式。除了基准测试本身，我们还贡献了一个从需求到基准的构建框架——一种系统化方法，能够在最短时间内将真实的生产需求转化为可执行的评估任务。该框架标准化了从需求到评估的整个流程，提供了一个可复现、模块化的过程，任何组织均可采用该过程为其自身领域构建基于生产环境的基准测试。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2604.12179

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

AgenticAI-DialogGen：基于主题引导的对话生成用于大语言模型短期与长期记忆的微调与评估

Perera, Manoj Madushanka, Mahmood, Adnan, Wijethilake, Kasun Eranda, Sheng, Quan Z.

Abstract

Recent advancements in Large Language Models (LLMs) have improved their ability to process extended conversational contexts, yet fine-tuning and evaluating short- and long-term memories remain difficult due to the absence of datasets that encode both short- and long-term conversational history. Existing conversational datasets lack memory grounding, overlook topic continuity, or rely on costly human annotation. To address these gaps, we introduce AgenticAI-DialogGen, a modular agent-based framework that generates persona-grounded and topic-guided conversations without human supervision. The framework uses LLM agents to extract knowledge graphs, identify topics, build speaker personas, and simulate topic-guided conversations from unstructured conversations. A QA module generates memory-grounded Question Answer (QA) pairs drawn from short- and long-term conversational histories. We also generated a new dataset entitled, TopicGuidedChat (TGC), where long-term memory is encoded as speaker-specific knowledge graphs and short-term memory as newly generated topic-guided conversations. Evaluations depict that AgenticAI-DialogGen yields higher conversational quality and LLMs fine-tuned on TGC dataset achieve improved performance on memory-grounded QA tasks.

Chinese Translation

近年来，大型语言模型（LLMs）在处理长对话上下文方面能力显著提升，但由于缺乏同时编码短期和长期对话历史的数据集，短期与长期记忆的微调与评估仍然困难。现有对话数据集缺乏记忆基础，忽视主题连续性，或依赖昂贵的人力标注。为填补这些空白，我们提出了AgenticAI-DialogGen，一种模块化的基于智能体的框架，能够在无人工监督下生成基于角色设定和主题引导的对话。该框架利用LLM智能体提取知识图谱、识别主题、构建说话者角色，并从非结构化对话中模拟主题引导的对话。问答模块则从短期和长期对话历史中生成基于记忆的问答对。我们还构建了一个新数据集TopicGuidedChat（TGC），其中长期记忆以说话者特定的知识图谱编码，短期记忆则表现为新生成的主题引导对话。评估结果表明，AgenticAI-DialogGen生成的对话质量更高，且基于TGC数据集微调的LLMs在基于记忆的问答任务中表现提升。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2604.12185

Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models

知识并非静态：面向顺序的超图检索增强生成模型（Order-Aware Hypergraph RAG）

Wu, Keshu, Kuai, Chenchen, Li, Zihao, Jiang, Jiwan, Shen, Shiyu, Wang, Shian, Hu, Chan-Wei, Tu, Zhengzhong, Zhou, Yang

Abstract

Retrieval-augmented generation (RAG) enhances large language models by grounding outputs in retrieved knowledge. However, existing RAG methods including graph- and hypergraph-based approaches treat retrieved evidence as an unordered set, implicitly assuming permutation invariance. This assumption is misaligned with many real-world reasoning tasks, where outcomes depend not only on which interactions occur, but also on the order in which they unfold. We propose Order-Aware Knowledge Hypergraph RAG (OKH-RAG), which treats order as a first-class structural property. OKH-RAG represents knowledge as higher-order interactions within a hypergraph augmented with precedence structure, and reformulates retrieval as sequence inference over hyperedges. Instead of selecting independent facts, it recovers coherent interaction trajectories that reflect underlying reasoning processes. A learned transition model infers precedence directly from data without requiring explicit temporal supervision. We evaluate OKH-RAG on order-sensitive question answering and explanation tasks, including tropical cyclone and port operation scenarios. OKH-RAG consistently outperforms permutation-invariant baselines, and ablations show that these gains arise specifically from modeling interaction order. These results highlight a key limitation of set-based retrieval: effective reasoning requires not only retrieving relevant evidence, but organizing it into structured sequences.

Chinese Translation

检索增强生成（RAG）通过将输出与检索到的知识相结合，增强了大型语言模型的能力。然而，现有的RAG方法，包括基于图和超图的方法，将检索到的证据视为无序集合，隐含地假设了排列不变性。这一假设与许多现实世界的推理任务不符，这些任务的结果不仅依赖于发生了哪些交互，还依赖于这些交互发生的顺序。我们提出了面向顺序的知识超图RAG（OKH-RAG），将顺序视为一种重要的结构属性。OKH-RAG将知识表示为超图中的高阶交互，并增强了优先结构，将检索重新表述为超边上的序列推理。它不是选择独立的事实，而是恢复反映潜在推理过程的连贯交互轨迹。学习的转移模型直接从数据中推断优先级，而无需显式的时间监督。我们在对顺序敏感的问题回答和解释任务上评估了OKH-RAG，包括热带气旋和港口操作场景。OKH-RAG始终优于排列不变的基线，消融实验表明，这些提升特别源于对交互顺序的建模。这些结果突显了基于集合的检索的一个关键局限性：有效的推理不仅需要检索相关证据，还需要将其组织成结构化的序列。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2604.12195

Representing expertise accelerates learning from pedagogical interaction data

表征专业知识加速从教学互动数据中学习

Yu, Dhara, Kaushik, Karthikeya, Thompson, Bill D.

Abstract

Work in cognitive science and artificial intelligence has suggested that exposing learning agents to traces of interaction between multiple individuals can improve performance in a variety of settings, yet it remains unknown which features of interactions contribute to this improvement. We examined the factors that support the effectiveness of interaction data, using a controlled paradigm that allowed us to precisely operationalize key distinctions between interaction and an expert acting alone. We generated synthetic datasets of simple interactions between an expert and a novice in a spatial navigation task, and then trained transformer models on those datasets, evaluating performance after exposure to different datasets. Our experiments showed that models trained on pedagogical interactions were more robust across a variety of scenarios compared to models trained only on expert demonstrations, and that having the ability to represent epistemically distinct agents led to expert-like behavior even when expert behavior was rarely observed.

Chinese Translation

认知科学和人工智能的研究表明，暴露学习代理于多个个体之间的互动痕迹可以在多种环境中提高性能，但仍不清楚哪些互动特征有助于这种改善。我们使用一种控制范式来检查支持互动数据有效性的因素，该范式使我们能够精确地操作互动与单独专家行为之间的关键区别。我们生成了专家与新手在空间导航任务中简单互动的合成数据集，然后在这些数据集上训练了变换器模型，并在接触不同数据集后评估其性能。我们的实验表明，与仅在专家演示上训练的模型相比，在教学互动上训练的模型在多种场景中表现得更为稳健，并且能够表征认知上不同的代理的能力，即使在专家行为很少被观察到的情况下，也能导致类似专家的行为。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2604.12196

Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

超越多数投票：基于径向共识得分的高效Best-Of-N选择方法

Nguyen, Manh, Gupta, Sunil, Le, Hung

Abstract

Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fr\'echet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.

Chinese Translation

大型语言模型（LLMs）常常为给定提示生成多个候选回答，但如何选出最可靠的答案仍然具有挑战性，尤其是在正确性与表面多数意见不一致的情况下。现有方法如自洽性（self-consistency）依赖离散投票，而基于概率的方法往往无法捕捉候选答案之间的关系，或者倾向于低估高质量但出现频率较低的回答，且未能充分利用答案表示的几何结构。为解决这些局限，我们提出了径向共识得分（Radial Consensus Score，RCS），这是一种简单、高效且无需训练的Best-Of-N选择方法。RCS通过计算答案嵌入的加权Fréchet均值（语义中心）来建模语义共识，并根据候选答案到该中心的径向距离进行排序。重要的是，RCS提供了一个通用框架，支持多种加权方案，包括均匀加权、基于频率和基于概率的变体，实现了对一致性信号和模型置信度的灵活整合，同时完全适用于黑盒环境。我们在涵盖短文本问答和长文本推理任务的七个基准测试以及五个开源权重模型上进行了大量实验，结果表明RCS各变体持续优于强基线，且随着采样预算的增加，性能提升更加显著。RCS还可作为多智能体辩论中多数投票的有效替代方案，并在黑盒场景中表现出强大的鲁棒性。总体而言，这些结果凸显了几何共识作为一种可扩展且广泛适用的可靠答案选择原则，超越了多数投票，实现了LLM推理中更具表现力和鲁棒性的聚合方法。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2604.12223

LLM-Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines

基于LLM引导的语义自启动框架用于可解释的文本分类与Tsetlin机器

Gao, Jiechao, Yadav, Rohan Kumar, Li, Yuangang, Pan, Yuandong, Wang, Jie, Liu, Ying, Lepech, Michael

Abstract

Pretrained language models (PLMs) like BERT provide strong semantic representations but are costly and opaque, while symbolic models such as the Tsetlin Machine (TM) offer transparency but lack semantic generalization. We propose a semantic bootstrapping framework that transfers LLM knowledge into symbolic form, combining interpretability with semantic capacity. Given a class label, an LLM generates sub-intents that guide synthetic data creation through a three-stage curriculum (seed, core, enriched), expanding semantic diversity. A Non-Negated TM (NTM) learns from these examples to extract high-confidence literals as interpretable semantic cues. Injecting these cues into real data enables a TM to align clause logic with LLM-inferred semantics. Our method requires no embeddings or runtime LLM calls, yet equips symbolic models with pretrained semantic priors. Across multiple text classification tasks, it improves interpretability and accuracy over vanilla TM, achieving performance comparable to BERT while remaining fully symbolic and efficient.

Chinese Translation

预训练语言模型（PLMs）如BERT提供了强大的语义表示，但其成本高且不透明，而符号模型如Tsetlin机器（TM）则提供了透明性但缺乏语义泛化。我们提出了一种语义自启动框架，将LLM知识转移到符号形式，结合了可解释性和语义能力。在给定类别标签的情况下，LLM生成子意图，通过三阶段课程（种子、核心、增强）指导合成数据的创建，扩展语义多样性。非否定Tsetlin机器（NTM）从这些示例中学习，以提取高置信度的字面量作为可解释的语义线索。将这些线索注入真实数据使得TM能够将子句逻辑与LLM推断的语义对齐。我们的方法不需要嵌入或运行时LLM调用，但为符号模型提供了预训练的语义先验。在多个文本分类任务中，它在可解释性和准确性上优于普通TM，达到与BERT相当的性能，同时保持完全符号化和高效。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2604.12231

Thought-Retriever: Don't Just Retrieve Raw Data, Retrieve Thoughts for Memory-Augmented Agentic Systems

Thought-Retriever：不仅检索原始数据，更检索思维以增强记忆的智能代理系统

Feng, Tao, Han, Pengrui, Lin, Guanyu, Liu, Ge, You, Jiaxuan

Abstract

Large language models (LLMs) have transformed AI research thanks to their powerful internal capabilities and knowledge. However, existing LLMs still fail to effectively incorporate the massive external knowledge when interacting with the world. Although retrieval-augmented LLMs are proposed to mitigate the issue, they are still fundamentally constrained by the context length of LLMs, as they can only retrieve top-K raw data chunks from the external knowledge base which often consists of millions of data chunks. Here we propose Thought-Retriever, a novel model-agnostic algorithm that helps LLMs generate output conditioned on arbitrarily long external data, without being constrained by the context length or number of retrieved data chunks. Our key insight is to let an LLM fully leverage its intermediate responses generated when solving past user queries (thoughts), filtering meaningless and redundant thoughts, organizing them in thought memory, and retrieving the relevant thoughts when addressing new queries. This effectively equips LLM-based agents with a self-evolving long-term memory that grows more capable through continuous interaction. Besides algorithmic innovation, we further meticulously prepare a novel benchmark, AcademicEval, which requires an LLM to faithfully leverage ultra-long context to answer queries based on real-world academic papers. Extensive experiments on AcademicEval and two other public datasets validate that Thought-Retriever remarkably outperforms state-of-the-art baselines, achieving an average increase of at least 7.6% in F1 score and 16% in win rate across various tasks. More importantly, we further demonstrate two exciting findings: (1) Thought-Retriever can indeed help LLM self-evolve after solving more user queries; (2) Thought-Retriever learns to leverage deeper thoughts to answer more abstract user queries.

Chinese Translation

大型语言模型（LLMs）凭借其强大的内部能力和知识，已经彻底改变了人工智能研究。然而，现有的LLMs在与外部世界交互时，仍未能有效整合海量的外部知识。尽管提出了基于检索增强的LLMs以缓解该问题，但它们仍然受到LLMs上下文长度的根本限制，因为它们只能从通常包含数百万数据块的外部知识库中检索前K个原始数据块。本文提出了Thought-Retriever，一种新颖的模型无关算法，帮助LLMs基于任意长度的外部数据生成输出，而不受上下文长度或检索数据块数量的限制。我们的核心观点是让LLM充分利用其在解决过去用户查询时生成的中间响应（即“思维”），过滤无意义和冗余的思维，将其组织到思维记忆中，并在处理新查询时检索相关思维。这有效地赋予基于LLM的代理一个自我进化的长期记忆，随着持续交互能力不断增强。除了算法创新，我们还精心准备了一个新基准AcademicEval，要求LLM忠实利用超长上下文基于真实学术论文回答查询。在AcademicEval及另外两个公开数据集上的大量实验验证了Thought-Retriever显著优于最先进的基线方法，在各种任务中F1分数平均提升至少7.6%，胜率提升16%。更重要的是，我们进一步展示了两个令人振奋的发现：（1）Thought-Retriever确实能帮助LLM在解决更多用户查询后实现自我进化；（2）Thought-Retriever学会利用更深层次的思维来回答更抽象的用户查询。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2604.12243

Continuous Knowledge Metabolism: Generating Scientific Hypotheses from Evolving Literature

持续知识代谢：从演化文献中生成科学假设

Tao, Jinkai, Wang, Yubo, Liu, Xiaoyu, Yang, Menglin

Abstract

Scientific hypothesis generation requires tracking how knowledge evolves, not just what is currently known. We introduce Continuous Knowledge Metabolism (CKM), a framework that processes scientific literature through sliding time windows and incrementally updates a structured knowledge base as new findings arrive. We present CKM-Lite, an efficient variant that achieves strong predictive coverage through incremental accumulation, outperforming batch processing on hit rate (+2.8%, p=0.006), hypothesis yield (+3.6, p<0.001), and best-match alignment (+0.43, p<0.001) while reducing token cost by 92%. To understand what drives these differences, we develop CKM-Full, an instrumented variant that categorizes each new finding as novel, confirming, or contradicting, detects knowledge change signals, and conditions hypothesis generation on the full evolution trajectory. Analyzing 892 hypotheses generated by CKM-Full across 50 research topics, alongside parallel runs of the other variants, we report four empirical observations: (1) incremental processing outperforms batch baseline across predictive and efficiency metrics; (2) change-aware instrumentation is associated with higher LLM-judged novelty (Cohen's d=3.46) but lower predictive coverage, revealing a quality-coverage trade-off; (3) a field's trajectory stability is associated with hypothesis success (r=-0.28, p=0.051), suggesting boundary conditions for literature-based prediction; (4) knowledge convergence signals are associated with nearly 5x higher hit rate than contradiction signals, pointing to differential predictability across change types. These findings suggest that the character of generated hypotheses is shaped not only by how much literature is processed, but also by how it is processed. They further indicate that evaluation frameworks must account for the quality-coverage trade-off rather than optimize for a single metric.

Chinese Translation

科学假设的生成不仅需要跟踪当前已知的知识，更需关注知识的演变过程。我们提出了持续知识代谢（Continuous Knowledge Metabolism，CKM）框架，该框架通过滑动时间窗口处理科学文献，并在新发现出现时增量更新结构化知识库。我们介绍了CKM-Lite，一种高效变体，通过增量累积实现了强大的预测覆盖能力，在命中率（+2.8%，p=0.006）、假设产出量（+3.6，p<0.001）和最佳匹配度（+0.43，p<0.001）方面均优于批处理方法，同时将令牌成本降低了92%。为探究这些差异的驱动因素，我们开发了CKM-Full，一种带有监测功能的变体，能够将每个新发现分类为新颖、确认或矛盾，检测知识变化信号，并基于完整的演化轨迹进行假设生成。通过分析CKM-Full在50个研究主题中生成的892个假设及其他变体的并行运行结果，我们报告了四个实证观察：(1) 增量处理在预测和效率指标上均优于批处理基线；(2) 变化感知的监测功能与更高的LLM评判新颖性相关（Cohen's d=3.46），但预测覆盖率较低，揭示了质量与覆盖率的权衡；(3) 领域轨迹的稳定性与假设成功率相关（r=-0.28，p=0.051），提示了基于文献预测的边界条件；(4) 知识收敛信号的命中率几乎是矛盾信号的5倍，表明不同变化类型的可预测性存在差异。这些发现表明，生成假设的特性不仅受处理文献数量的影响，更受处理方式的影响。同时，评估框架必须考虑质量与覆盖率的权衡，而非单一指标的优化。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2604.12247

SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

SpecBound：具有层级置信度校准的自适应有界自推测

Wen, Zhuofan, Feng, Yang

Abstract

Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft models but face limitations: shallow layers often produce overconfident yet incorrect token predictions, and the presence of difficult tokens in a draft sequence forces redundant computation through deeper layers, undermining both draft acceptance and overall speedup. To address these issues, we propose a novel self-draft framework that suppresses spurious confidence via layer-wise temperature annealing in early-exit decision and adaptively bounds speculation length based on token-wise decoding difficulty. By reprocessing the hidden states of draft tokens in a unified parallel pass through deep layers, our method maintains exact output equivalence with the original model while maximizing computational efficiency. It requires no modifications to the base LLM parameters and achieves up to 2.33x wall-time speedup over standard autoregressive decoding across diverse long-form generation tasks and multiple model architectures.

Chinese Translation

自推测解码已成为加速大型语言模型（LLMs）自回归推理的一种有前景的方法。自草稿方法利用基础LLM本身进行推测，避免了辅助草稿模型的开销，但面临一些限制：浅层通常会产生过于自信但错误的标记预测，而草稿序列中存在困难标记则迫使在更深层进行冗余计算，从而削弱了草稿接受率和整体加速效果。为了解决这些问题，我们提出了一种新颖的自草稿框架，通过在早期退出决策中采用层级温度退火来抑制虚假的置信度，并根据标记解码难度自适应地限制推测长度。通过在深层中对草稿标记的隐藏状态进行统一并行处理，我们的方法在保持与原始模型的输出完全等价的同时，最大化计算效率。该方法无需对基础LLM参数进行修改，并在多种长文本生成任务和多种模型架构中实现了高达2.33倍的墙面时间加速。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2604.12258

Coding-Free and Privacy-Preserving MCP Framework for Clinical Agentic Research Intelligence System

无编码且保护隐私的临床代理研究智能系统框架

Kim, Taehun, Park, Hyeryun, Lee, Hyeonhoon, Lee, Yushin, Kim, Kyungsang, Lee, Hyung-Chul

Abstract

Clinical research involves labor-intensive processes such as study design, cohort construction, model development, and documentation, requiring domain expertise, programming skills, and access to sensitive patient data. These demands create barriers for clinicians and external researchers conducting data-driven studies. To overcome these limitations, we developed a Clinical Agentic Research Intelligence System (CARIS) that automates the clinical research workflow while preserving data privacy, enabling comprehensive studies without direct access to raw data. CARIS integrates Large Language Models (LLMs) with modular tools via the Model Context Protocol (MCP), enabling natural language-driven orchestration of appropriate tools. Databases remain securely within the MCP server, and users access only the outputs and final research reports. Based on user intent, CARIS automatically executes the full pipeline: research planning, literature search, cohort construction, Institutional Review Board (IRB) documentation, Vibe Machine Learning (ML), and report generation, with iterative human-in-the-loop refinement. We evaluated CARIS on three heterogeneous datasets with distinct clinical tasks. Research plans and IRB documents were finalized within three to four iterations, using evidence from literature and data. The system supported Vibe ML by exploring feature-model combinations, ranking the top ten models, and generating performance visualizations. Final reports showed high completeness based on a checklist derived from the TRIPOD+AI framework, achieving 96% coverage in LLM evaluation and 82% in human evaluation. CARIS demonstrates that agentic AI can transform clinical hypotheses into executable research workflows across heterogeneous datasets. By eliminating the need for coding and direct data access, the system lowers barriers and bridges public and private clinical data environments.

Chinese Translation

临床研究涉及诸如研究设计、队列构建、模型开发和文档编制等劳动密集型过程，这些过程需要领域专业知识、编程技能和对敏感患者数据的访问。这些需求为进行数据驱动研究的临床医生和外部研究人员设置了障碍。为克服这些限制，我们开发了临床代理研究智能系统（Clinical Agentic Research Intelligence System，CARIS），该系统自动化了临床研究工作流程，同时保护数据隐私，使得无需直接访问原始数据即可进行全面研究。CARIS通过模型上下文协议（Model Context Protocol，MCP）将大型语言模型（Large Language Models，LLMs）与模块化工具集成，实现了基于自然语言的适当工具编排。数据库安全地保留在MCP服务器内，用户仅访问输出和最终研究报告。根据用户意图，CARIS自动执行完整的流程：研究规划、文献检索、队列构建、机构审查委员会（Institutional Review Board，IRB）文档、Vibe机器学习（Machine Learning，ML）和报告生成，并进行迭代的人机协作优化。我们在三个具有不同临床任务的异构数据集上评估了CARIS。研究计划和IRB文档在三到四次迭代内完成，使用了文献和数据中的证据。该系统通过探索特征-模型组合、排名前十的模型和生成性能可视化来支持Vibe ML。最终报告基于TRIPOD+AI框架的检查表显示出高完整性，在LLM评估中实现了96%的覆盖率，在人工评估中实现了82%的覆盖率。CARIS展示了代理人工智能如何将临床假设转化为可执行的研究工作流程，跨越异构数据集。通过消除编码和直接数据访问的需求，该系统降低了障碍，架起了公共和私有临床数据环境之间的桥梁。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2604.12262

CascadeDebate: Multi-Agent Deliberation for Cost-Aware LLM Cascades

CascadeDebate：面向成本感知的大型语言模型级联的多智能体协商机制

Chang, Raeyoung, Kwon, Dongwook, Lee, Jisoo, Verma, Nikhil

Abstract

Cascaded LLM systems coordinate models of varying sizes with human experts to balance accuracy, cost, and abstention under uncertainty. However, single-model tiers at each stage often struggle with ambiguous queries, triggering premature escalations to costlier models or experts due to under-confidence and inefficient compute scaling. CascadeDebate addresses this gap by inserting multi-agent deliberation directly at each tier's escalation boundary. Confidence-based routers activate lightweight agent ensembles only for uncertain cases, enabling consensus-driven resolution of ambiguities internally without invoking higher-cost upgrades. Our unified architecture alternates single-model inference with selective multi-agent deliberation across model scales, culminating in human experts as the final fallback. This design scales test-time compute dynamically according to query difficulty. Across five benchmarks spanning science, medicine, and general knowledge, CascadeDebate outperforms strong single-model cascades and standalone multi-agent systems by up to 26.75 percent. An online threshold optimizer proves essential, boosting accuracy by 20.98 to 52.33 percent relative improvement over fixed policies and enabling elastic adaptation to real-world distributions.

Chinese Translation

级联大型语言模型（LLM）系统通过协调不同规模的模型与人类专家，在不确定性下平衡准确性、成本和回避策略。然而，每个阶段单一模型层级常因对模糊查询缺乏信心，导致过早升级至更高成本的模型或专家，造成计算资源的低效扩展。CascadeDebate通过在每个层级的升级边界直接引入多智能体协商机制，填补了这一空白。基于置信度的路由器仅对不确定案例激活轻量级智能体集群，实现内部基于共识的模糊问题解决，无需调用更高成本的升级。我们统一的架构在不同模型规模间交替进行单模型推理与选择性多智能体协商，最终以人类专家作为最后保障。该设计根据查询难度动态调整测试时计算资源。在涵盖科学、医学及通识知识的五个基准测试中，CascadeDebate相较于强大的单模型级联和独立多智能体系统，性能提升高达26.75%。一个在线阈值优化器被证明至关重要，相较于固定策略，准确率提升20.98%至52.33%，并实现了对现实分布的弹性适应。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2604.12282

Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning

面向稳健的真实世界电子表格理解的多智能体多格式推理方法

Ren, Houxing, Zhan, Mingjie, Lu, Zimu, Wang, Ke, Yang, Yunqiao, Hou, Haotian, Li, Hongsheng

Abstract

Spreadsheets are central to real-world applications such as enterprise reporting, auditing, and scientific data management. Despite their ubiquity, existing large language model based approaches typically treat tables as plain text, overlooking critical layout cues and visual semantics. Moreover, real-world spreadsheets are often massive in scale, exceeding the input length that LLMs can efficiently process. To address these challenges, we propose SpreadsheetAgent, a two-stage multi-agent framework for spreadsheet understanding that adopts a step-by-step reading and reasoning paradigm. Instead of loading the entire spreadsheet at once, SpreadsheetAgent incrementally interprets localized regions through multiple modalities, including code execution results, images, and LaTeX tables. The method first constructs a structural sketch and row/column summaries, and then performs task-driven reasoning over this intermediate representation in the Solving Stage. To further enhance reliability, we design a verification module that validates extracted structures via targeted inspections, reducing error propagation and ensuring trustworthy inputs for downstream reasoning. Extensive experiments on two spreadsheet datasets demonstrate the effectiveness of our approach. With GPT-OSS-120B, SpreadsheetAgent achieves 38.16% on Spreadsheet Bench, outperforming the ChatGPT Agent baseline (35.27%) by 2.89 absolute points. These results highlight the potential of SpreadsheetAgent to advance robust and scalable spreadsheet understanding in real-world applications. Code is available at https://github.com/renhouxing/SpreadsheetAgent.git.

Chinese Translation

电子表格在企业报告、审计和科学数据管理等真实世界应用中具有核心地位。尽管电子表格无处不在，现有基于大型语言模型（LLM）的方法通常将表格视为纯文本，忽视了关键的布局线索和视觉语义。此外，真实世界的电子表格规模庞大，往往超出LLM能够高效处理的输入长度。为应对这些挑战，我们提出了SpreadsheetAgent，一种采用逐步阅读与推理范式的两阶段多智能体电子表格理解框架。SpreadsheetAgent并非一次性加载整个电子表格，而是通过多模态信息（包括代码执行结果、图像和LaTeX表格）逐步解释局部区域。该方法首先构建结构草图及行/列摘要，然后在求解阶段（Solving Stage）基于该中间表示执行任务驱动的推理。为进一步提升可靠性，我们设计了验证模块，通过针对性检查验证提取的结构，减少错误传播，确保下游推理输入的可信度。在两个电子表格数据集上的大量实验表明了该方法的有效性。基于GPT-OSS-120B，SpreadsheetAgent在Spreadsheet Bench上取得了38.16%的成绩，较ChatGPT Agent基线（35.27%）提升了2.89个百分点。结果凸显了SpreadsheetAgent在推动真实世界应用中稳健且可扩展的电子表格理解方面的潜力。代码已开源，地址：https://github.com/renhouxing/SpreadsheetAgent.git。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2604.12308

ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance

ContextLens：建模不完善的隐私与安全语境以实现法律合规

Li, Haoran, Chen, Yulin, Jing, Huihao, Hu, Wenbin, Li, Tsz Ho, Lou, Chanhou, Tsang, Hong Ting, Han, Sirui, Song, Yangqiu

Abstract

Individuals' concerns about data privacy and AI safety are highly contextualized and extend beyond sensitive patterns. Addressing these issues requires reasoning about the context to identify and mitigate potential risks. Though researchers have widely explored using large language models (LLMs) as evaluators for contextualized safety and privacy assessments, these efforts typically assume the availability of complete and clear context, whereas real-world contexts tend to be ambiguous and incomplete. In this paper, we propose ContextLens, a semi-rule-based framework that leverages LLMs to ground the input context in the legal domain and explicitly identify both known and unknown factors for legal compliance. Instead of directly assessing safety outcomes, our ContextLens instructs LLMs to answer a set of crafted questions that span over applicability, general principles and detailed provisions to assess compliance with pre-defined priorities and rules. We conduct extensive experiments on existing compliance benchmarks that cover the General Data Protection Regulation (GDPR) and the EU AI Act. The results suggest that our ContextLens can significantly improve LLMs' compliance assessment and surpass existing baselines without any training. Additionally, our ContextLens can further identify the ambiguous and missing factors.

Chinese Translation

个人对数据隐私和人工智能安全的关注高度依赖具体语境，且超越了敏感模式的范畴。解决这些问题需要对语境进行推理，以识别并缓解潜在风险。尽管研究者广泛探索了使用大型语言模型（LLMs）作为语境化安全与隐私评估的评估者，但这些工作通常假设语境是完整且清晰的，而现实世界中的语境往往模糊且不完整。本文提出了ContextLens，一种半规则基础框架，利用LLMs将输入语境锚定于法律领域，并明确识别法律合规中的已知和未知因素。ContextLens并非直接评估安全结果，而是指导LLMs回答一系列精心设计的问题，这些问题涵盖适用性、一般原则及详细条款，以评估预定义优先级和规则下的合规性。我们在涵盖《通用数据保护条例》（GDPR）和欧盟人工智能法案（EU AI Act）的现有合规基准上进行了大量实验。结果表明，ContextLens能够显著提升LLMs的合规评估能力，且在无需任何训练的情况下超越现有基线方法。此外，ContextLens还能进一步识别语境中的模糊和缺失因素。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2604.12312

CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

CompliBench：用于对话系统合规违规检测的大型语言模型评审基准

Yang, Jingbo, Yao, Guanyu, Hou, Bairu, Yang, Xinghan, Glushnev, Nikolai, Bialynicka-Birula, Iwona, Ding, Duo, Chang, Shiyu

Abstract

As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored. This gap is primarily due to the lack of a systematic data generation method, which has been hindered by the extensive cost of fine-grained human annotation and the difficulty of synthesizing realistic agent violations. In this paper, we introduce CompliBench, a novel benchmark designed to evaluate the ability of LLM judges to detect and localize guideline violations in multi-turn dialogues. To overcome data scarcity, we develop a scalable, automated data generation pipeline that simulates user-agent interactions. Our controllable flaw injection process automatically yields precise ground-truth labels for the violated guideline and the exact conversation turn, while an adversarial search method ensures these introduced perturbations are highly challenging. Our comprehensive evaluation reveals that current state-of-the-art proprietary LLMs struggle significantly with this task. In addition, we demonstrate that a small-scale judge model fine-tuned on our synthesized data outperforms leading LLMs and generalizes well to unseen business domains, highlighting our pipeline as an effective foundation for training robust generative reward models.

Chinese Translation

随着大型语言模型（LLMs）在企业环境中作为面向任务的代理被广泛部署，确保其严格遵守复杂的领域特定操作指南变得至关重要。尽管将LLM作为评审者（LLM-as-a-Judge）用于可扩展评估是一种有前景的解决方案，但这些评审者在检测具体政策违规方面的可靠性仍未得到充分探索。该领域的研究空白主要源于缺乏系统化的数据生成方法，这一问题受限于细粒度人工标注的高昂成本以及合成真实代理违规行为的难度。本文提出了CompliBench，一种新颖的基准，旨在评估LLM评审者在多轮对话中检测和定位指南违规的能力。为解决数据稀缺问题，我们开发了一个可扩展的自动化数据生成流程，模拟用户与代理的交互。我们的可控缺陷注入过程自动生成违规指南及具体对话轮次的精确真实标签，同时采用对抗性搜索方法确保引入的扰动极具挑战性。全面评估结果表明，当前最先进的专有LLM在该任务上表现显著不足。此外，我们展示了基于合成数据微调的小规模评审模型不仅优于领先的LLM，还能良好泛化至未见过的业务领域，凸显了我们数据生成流程作为训练鲁棒生成奖励模型的有效基础。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2604.12321

ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

ToxiTrace：面向可解释性的中文有害内容检测的梯度对齐训练方法

Li, Boyang, Shou, Hongzhe, Liang, Yuanyuan, Zhang, Jingbin, Zhou, Fang

Abstract

Existing Chinese toxic content detection methods mainly target sentence-level classification but often fail to provide readable and contiguous toxic evidence spans. We propose \textbf{ToxiTrace}, an explainability-oriented method for BERT-style encoders with three components: (1) \textbf{CuSA}, which refines encoder-derived saliency cues into fine-grained toxic spans with lightweight LLM guidance; (2) \textbf{GCLoss}, a gradient-constrained objective that concentrates token-level saliency on toxic evidence while suppressing irrelevant activations; and (3) \textbf{ARCL}, which constructs sample-specific contrastive reasoning pairs to sharpen the semantic boundary between toxic and non-toxic content. Experiments show that ToxiTrace improves classification accuracy and toxic span extraction while preserving efficient encoder-based inference and producing more coherent, human-readable explanations. We have released the model at https://huggingface.co/ArdLi/ToxiTrace.

Chinese Translation

现有的中文有害内容检测方法主要针对句子级别的分类，但往往无法提供可读且连续的有害证据片段。我们提出了ToxiTrace，一种面向可解释性的BERT风格编码器方法，包含三个组成部分：（1）CuSA，通过轻量级大语言模型（LLM）指导，将编码器导出的显著性线索精炼为细粒度的有害片段；（2）GCLoss，一种梯度约束目标函数，集中令牌级显著性于有害证据，同时抑制无关激活；（3）ARCL，构建样本特定的对比推理对，以强化有害与非有害内容之间的语义边界。实验表明，ToxiTrace在提升分类准确率和有害片段提取效果的同时，保持了基于编码器的高效推理，并生成更连贯、易于理解的人类可读解释。我们已在 https://huggingface.co/ArdLi/ToxiTrace 发布该模型。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2604.12373

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

共识掩盖下的特权知识解缠：大型语言模型正确性中的隐秘信息

Ashuach, Tomer, Ein-Dor, Liat, Gretz, Shai, Katz, Yoav, Belinkov, Yonatan

Abstract

Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

Chinese Translation

人类通过内省来评估自身理解程度，这种内省基于外部观察者无法访问的私有内部状态。我们探讨大型语言模型（LLM）是否具备类似的关于答案正确性的特权知识，这些信息无法通过外部观察获得。我们在模型自身的隐藏状态和外部模型的问答表示上训练正确性分类器，以测试自我表示是否带来性能优势。在标准评测中，我们未发现优势：自我探测与同类模型探测表现相当。我们假设这归因于模型间对答案正确性的高度一致性。为剥离真正的特权知识，我们在模型预测存在分歧的子集上进行评估。在此情境下，我们发现领域特定的特权知识：自我表示在事实知识任务中持续优于同类表示，但在数学推理任务中未表现出优势。我们进一步在模型层级中定位这种领域不对称性，发现事实知识优势从早期到中期层逐步显现，符合模型特定的记忆检索机制，而数学推理在任何层级均未表现出一致优势。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2604.12376

Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

具有关键词书签的合作记忆分页用于长时间跨度的LLM对话

Liu, Ziyang

Abstract

When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods -- outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context -- on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination -- the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

Chinese Translation

当LLM对话超出上下文窗口时，旧内容必须被驱逐——但模型在需要时如何恢复这些内容？我们提出了合作分页：被驱逐的片段被最小的关键词书签（[pN:keywords]，每个约8-24个标记）替换，并且模型被赋予了一个 recall() 工具以按需检索完整内容。在LoCoMo基准测试（10个真实的多会话对话，300多个回合）中，合作分页在六种方法中实现了最高的回答质量——在四个模型（GPT-4o-mini、DeepSeek-v3.2、Claude Haiku、GLM-5）上超越了截断、BM25、词重叠检索、搜索工具基线和完整上下文，得到了四位独立LLM评审的确认（$p=0.017$，配对自助法）。随后，我们通过对边界策略和驱逐策略进行5x4的消融研究，探讨了分页设计空间（3,176个合成探针，1,600个LoCoMo探针）。主要发现：（1）粗略固定大小的页面（fixed_20）达到96.7%，而内容感知的topic_shift则降至56.7%；（2）驱逐策略的选择依赖于数据（在合成数据上FIFO最佳，在LoCoMo上LFU最佳）；（3）两种书签生成策略相较于启发式基线有所改善（分别提高了4.4和8.7个E2E点）；（4）剩余的瓶颈是书签的区分度——模型在96%的情况下触发 recall()，但在书签不够独特时仅选择正确页面的概率为57%。关键词的特异性单独导致了25个百分点的准确率差异。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2604.12377

SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models

SCRIPT：一种面向韩语预训练语言模型的子字符组合表示注入模块

Kim, SungHo, Park, Juhyeong, Atalay, Eda, Lee, SangKeun

Abstract

Korean is a morphologically rich language with a featural writing system in which each character is systematically composed of subcharacter units known as Jamo. These subcharacters not only determine the visual structure of Korean but also encode frequent and linguistically meaningful morphophonological processes. However, most current Korean language models (LMs) are based on subword tokenization schemes, which are not explicitly designed to capture the internal compositional structure of characters. To address this limitation, we propose SCRIPT, a model-agnostic module that injects subcharacter compositional knowledge into Korean PLMs. SCRIPT allows to enhance subword embeddings with structural granularity, without requiring architectural changes or additional pre-training. As a result, SCRIPT enhances all baselines across various Korean natural language understanding (NLU) and generation (NLG) tasks. Moreover, beyond performance gains, detailed linguistic analyses show that SCRIPT reshapes the embedding space in a way that better captures grammatical regularities and semantically cohesive variations. Our code is available at https://github.com/SungHo3268/SCRIPT.

Chinese Translation

韩语是一种形态丰富的语言，采用具有特征性的书写系统，其中每个字符系统地由称为Jamo的子字符单元组成。这些子字符不仅决定了韩语的视觉结构，还编码了频繁且具有语言学意义的形态音韵过程。然而，目前大多数韩语语言模型（LM）基于子词分词方案，未能明确设计以捕捉字符的内部组合结构。为了解决这一限制，我们提出了SCRIPT，一种模型无关的模块，能够将子字符组合知识注入韩语预训练语言模型（PLMs）中。SCRIPT允许在不改变模型架构或额外预训练的情况下，增强子词嵌入的结构粒度。因此，SCRIPT在多种韩语自然语言理解（NLU）和生成（NLG）任务中提升了所有基线模型的性能。此外，除了性能提升外，详细的语言学分析表明SCRIPT重塑了嵌入空间，更好地捕捉了语法规律性和语义内聚的变异。我们的代码可在https://github.com/SungHo3268/SCRIPT获取。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2604.12378

ReasonXL: Shifting LLM Reasoning Language Without Sacrificing Performance

ReasonXL：在不牺牲性能的前提下转换大型语言模型的推理语言

Gurgurov, Daniil, Röhr, Tom, von Rohrscheidt, Sebastian, van Genabith, Josef, Löser, Alexander, Ostermann, Simon

Abstract

Despite advances in multilingual capabilities, most large language models (LLMs) remain English-centric in their training and, crucially, in their production of reasoning traces. Even when tasked with non-English problems, these models predominantly reason in English, creating a fundamental mismatch for non-English usage scenarios. We address this disparity directly with three contributions. (i) We introduce ReasonXL, the first large-scale parallel corpus of cross-domain reasoning traces spanning five European languages (English, German, French, Italian, and Spanish), with over two million aligned samples per language, each comprising prompts, reasoning traces, and final outputs, enabling direct supervision of language-specific reasoning. (ii) Using ReasonXL, we demonstrate that LLMs can be adapted to reason entirely in a desired target language, using a simple two-stage pipeline of supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR). The resulting models match or exceed baseline performance, with minimal loss in general knowledge and broadly preserved cross-lingual transfer. (iii) We conduct an extensive representational analysis of the adaptation and find a clear functional division across model depth: early layers contain an activation bottleneck that causally determines language identity, while upper layers concentrate the weight and activation changes driven by adaptation. We further find that RLVR achieves greater behavioral divergence from the base model with smaller parameter updates than SFT, suggesting a more efficient representational rerouting despite much smaller weight updates.

Chinese Translation

尽管多语言能力取得了进展，大多数大型语言模型（LLMs）在训练过程中仍以英语为中心，且在生成推理轨迹时尤为如此。即使面对非英语问题，这些模型主要仍以英语进行推理，导致非英语使用场景中存在根本性的不匹配。我们针对这一差异提出了三项贡献：(i) 我们引入了ReasonXL，这是首个涵盖五种欧洲语言（英语、德语、法语、意大利语和西班牙语）的跨领域推理轨迹大规模平行语料库，每种语言拥有超过两百万条对齐样本，样本包含提示、推理轨迹和最终输出，支持语言特定推理的直接监督。(ii) 利用ReasonXL，我们展示了通过一个简单的两阶段流程——监督微调（SFT）后接可验证奖励的强化学习（RLVR），可以使LLMs完全以目标语言进行推理。所得模型在性能上达到或超过基线，且在通用知识上损失极小，跨语言迁移能力大体保持。(iii) 我们对适应过程进行了深入的表征分析，发现模型深度层次存在明显的功能分工：早期层包含一个激活瓶颈，因果决定语言身份；而上层则集中体现适应驱动的权重和激活变化。进一步发现，RLVR相比SFT以更小的参数更新实现了更大的行为差异，表明其在权重更新更少的情况下实现了更高效的表征重路由。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2604.12385

From Myopic Selection to Long-Horizon Awareness: Sequential LLM Routing for Multi-Turn Dialogue

从短视选择到长远意识：多轮对话的序列化 LLM 路由

Zhang, Jiarui, Liu, Xiangyu, Hu, Yong, Niu, Chaoyue, Zeng, Hang, Tang, Shaojie, Wu, Fan, Chen, Guihai

Abstract

Multi-turn dialogue is the predominant form of interaction with large language models (LLMs). While LLM routing is effective in single-turn settings, existing methods fail to maximize cumulative performance in multi-turn dialogue due to interaction dynamics and delayed rewards. To address this challenge, we move from myopic, single-turn selection to long-horizon sequential routing for multi-turn dialogue. Accordingly, we propose DialRouter, which first performs MCTS to explore dialogue branches induced by different LLM selections and collect trajectories with high cumulative rewards. DialRouter then learns a lightweight routing policy from search-derived data, augmented with retrieval-based future state approximation, enabling multi-turn routing without online search. Experiments on both open-domain and domain-specific dialogue tasks across diverse candidate sets of both open-source and closed-source LLMs demonstrate that DialRouter significantly outperforms single LLMs and existing routing baselines in task success rate, while achieving a superior performance-cost trade-off when combined with a cost-aware reward.

Chinese Translation

多轮对话是与大型语言模型（LLMs）互动的主要形式。虽然 LLM 路由在单轮设置中有效，但现有方法由于互动动态和延迟奖励，未能在多轮对话中最大化累积性能。为了解决这一挑战，我们从短视的单轮选择转向多轮对话的长远序列路由。因此，我们提出了 DialRouter，该方法首先通过蒙特卡洛树搜索（MCTS）探索不同 LLM 选择引发的对话分支，并收集具有高累积奖励的轨迹。然后，DialRouter 从搜索衍生的数据中学习轻量级路由策略，并结合基于检索的未来状态近似，实现无需在线搜索的多轮路由。在开放域和特定领域的对话任务中进行的实验，涵盖了多种开放源代码和闭源 LLM 的候选集，结果表明 DialRouter 在任务成功率上显著优于单一 LLM 和现有路由基线，同时在与成本感知奖励结合时实现了更优的性能与成本权衡。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2604.12397

KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates

KoCo：基于知识坐标的语言模型预训练条件化

Li, Yudong, Cai, Jiawei, Shen, Linlin

Abstract

Standard Large Language Model (LLM) pre-training typically treats corpora as flattened token sequences, often overlooking the real-world context that humans naturally rely on to contextualize information. To bridge this gap, we introduce Knowledge Coordinate Conditioning (KoCo), a simple method that maps every document into a three-dimensional semantic coordinate. By prepending these coordinates as textual prefixes for pre-training, we aim to equip the model with explicit contextual awareness to learn the documents within the real-world knowledge structure. Experiment results demonstrate that KoCo significantly enhances performance across 10 downstream tasks and accelerates pre-training convergence by approximately 30\%. Furthermore, our analysis indicates that explicitly modeling knowledge coordinates helps the model distinguish stable facts from noise, effectively mitigating hallucination in generated outputs.

Chinese Translation

标准的大型语言模型（LLM）预训练通常将语料库视为扁平化的标记序列，往往忽视了人类在上下文化信息时自然依赖的现实世界背景。为了解决这一问题，我们提出了知识坐标条件化（KoCo），这是一种简单的方法，将每个文档映射到三维语义坐标中。通过将这些坐标作为文本前缀添加到预训练中，我们旨在使模型具备明确的上下文意识，以便在现实世界知识结构中学习文档。实验结果表明，KoCo在10个下游任务中显著提升了性能，并使预训练收敛速度加快约30%。此外，我们的分析表明，明确建模知识坐标有助于模型区分稳定事实与噪声，有效减轻生成输出中的幻觉现象。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2604.12421

Agentic Insight Generation in VSM Simulations

价值流映射（VSM）仿真中的主体洞察生成

Selak, Micha, Krechel, Dirk, Ulges, Adrian, Spieckermann, Sven, Stoehr, Niklas, Loehr, Andreas

Abstract

Extracting actionable insights from complex value stream map simulations can be challenging, time-consuming, and error-prone. Recent advances in large language models offer new avenues to support users with this task. While existing approaches excel at processing raw data to gain information, they are structurally unfit to pick up on subtle situational differences needed to distinguish similar data sources in this domain. To address this issue, we propose a decoupled, two-step agentic architecture. By separating orchestration from data analysis, the system leverages progressive data discovery infused with domain expert knowledge. This architecture allows the orchestration to intelligently select data sources and perform multi-hop reasoning across data structures while maintaining a slim internal context. Results from multiple state-of-the-art large language models demonstrate the framework's viability: with top-tier models achieving accuracies of up to 86% and demonstrating high robustness across evaluation runs.

Chinese Translation

从复杂的价值流映射（VSM）仿真中提取可操作的洞察往往具有挑战性，且耗时且易出错。近期大型语言模型（large language models）的进展为支持用户完成此任务提供了新的途径。尽管现有方法在处理原始数据以获取信息方面表现出色，但它们在结构上不适合捕捉领域内区分相似数据源所需的细微情境差异。为解决该问题，我们提出了一种解耦的两步主体架构。通过将编排与数据分析分离，系统利用融合领域专家知识的渐进式数据发现。该架构使得编排能够智能选择数据源，并在数据结构间执行多跳推理，同时保持简洁的内部上下文。多款最先进大型语言模型的实验结果证明了该框架的可行性：顶级模型准确率高达86%，且在多次评估中表现出高度的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2604.12424

Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

通过扰动解码：通过动态文本扰动减轻多模态大语言模型的幻觉

Jia, Sihang, Liu, Shuliang, Yang, Songbo, Yan, Yibo, Zou, Xin, Hu, Xuming

Abstract

Multimodal Large Language Models frequently suffer from inference hallucinations, partially stemming from language priors dominating visual evidence. Existing training-free mitigation methods either perturb the visual representation and deviate from the natural image distribution, or enforce intrusive manipulations that compromise the model's inherent generative fluency. We introduce a novel perspective that multimodal hallucination manifests as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Building on this insight, we propose Decoding by Perturbation (DeP), a training-free framework mitigating prior-induced hallucinations via controlled textual interventions. DeP employs a dynamic probe applying multi-level textual perturbations to elicit latent language priors. Leveraging attention variance, it enhances stable evidence regions while suppressing suspicious noise in the feature space. Furthermore, it constructs an interpretable prior drift direction using logits statistics to counteract probability biases from textual co-occurrences. Extensive experiments confirm DeP effectively reduces hallucinations and achieves superior performance across multiple benchmarks.

Chinese Translation

多模态大语言模型在推理过程中常常遭遇幻觉现象，这在一定程度上源于语言先验对视觉证据的主导作用。现有的无训练减轻方法要么扰动视觉表征并偏离自然图像分布，要么施加侵入性的操作，从而损害模型固有的生成流畅性。我们提出了一种新视角，认为多模态幻觉表现为视觉基础对文本措辞的超敏感性，尤其是在解码阶段。基于这一洞察，我们提出了通过扰动解码（Decoding by Perturbation, DeP）这一无训练框架，通过控制文本干预来减轻由先验引起的幻觉。DeP采用动态探测器，施加多层次的文本扰动，以引发潜在的语言先验。通过利用注意力方差，它增强了稳定的证据区域，同时抑制了特征空间中的可疑噪声。此外，它利用logits统计构建可解释的先验漂移方向，以抵消文本共现带来的概率偏差。大量实验验证了DeP有效减少幻觉，并在多个基准测试中实现了卓越的性能。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2604.12442

GLeMM: A large-scale multilingual dataset for morphological research

GLeMM：用于形态学研究的大规模多语言数据集

Nabil, Hathout, Calderone, Basilio, Namer, Fiammetta, Sajous, Franck

Abstract

In derivational morphology, what mechanisms govern the variation in form-meaning relations between words? The answers to this type of questions are typically based on intuition and on observations drawn from limited data, even when a wide range of languages is considered. Many of these studies are difficult to replicate and generalize. To address this issue, we present GLeMM, a new derivational resource designed for experimentation and data-driven description in morphology. GLeMM is characterized by (i) its large size, (ii) its extensive coverage (currently amounting to seven European languages, i.e., German, English, Spanish, French, Italian, Polish, Russian, (iii) its fully automated design, identical across all languages, (iv) the automatic annotation of morphological features on each entry, as well as (v) the encoding of semantic descriptions for a significant subset of these entries. It enables researchers to address difficult questions, such as the role of form and meaning in word-formation, and to develop and experimentally test computational methods that identify the structures of derivational morphology. The article describes how GLeMM is created using Wiktionary articles and presents various case studies illustrating possible applications of the resource.

Chinese Translation

在派生形态学中，哪些机制支配词与词之间形式-意义关系的变异？这类问题的答案通常基于直觉以及从有限数据中得出的观察结果，即使考虑了多种语言。许多此类研究难以复制和推广。为了解决这一问题，我们提出了GLeMM，这是一个专为形态学实验和数据驱动描述设计的新型派生资源。GLeMM的特点包括：（i）规模庞大，（ii）覆盖范围广泛（目前涵盖七种欧洲语言，即德语、英语、西班牙语、法语、意大利语、波兰语、俄语），（iii）完全自动化设计，且在所有语言中保持一致，（iv）对每个词条自动注释形态特征，以及（v）对其中重要子集的词条进行语义描述编码。该资源使研究人员能够探讨诸如形式与意义在构词中的作用等复杂问题，并开发及实验验证识别派生形态结构的计算方法。本文介绍了GLeMM如何利用维基词典（Wiktionary）条目构建，并展示了若干案例研究以说明该资源的潜在应用。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2604.12452

Latent-Condensed Transformer for Efficient Long Context Modeling

用于高效长上下文建模的潜在压缩变换器

You, Zeng, Chen, Yaofo, Chen, Qiuwu, Sun, Ying, Zhang, Shuhai, Li, Yingjian, Wang, Yaowei, Tan, Mingkui

Abstract

Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA's compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA's latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both computational cost and KV cache without adding parameters. Beyond MLA, LCA's design is architecture-agnostic and readily extends to other attention mechanisms such as GQA. Theoretically, we prove a length-independent error bound. Experiments show LCA achieves up to 2.5$\times$ prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance.

Chinese Translation

大型语言模型（LLMs）在处理长上下文时面临重大挑战，原因在于键值（KV）缓存的线性增长和自注意力机制的二次复杂度。现有方法分别解决这些瓶颈：多头潜在注意力（Multi-head Latent Attention，MLA）通过将标记投影到低维潜在空间来减少KV缓存，而稀疏注意力则减少计算量。然而，稀疏方法无法在MLA的压缩潜在结构上原生运行，错失了联合优化的机会。本文提出了潜在压缩注意力（Latent-Condensed Attention，LCA），该方法直接在MLA的潜在空间内对上下文进行压缩，其中表示被解耦为语义潜向量和位置键。LCA通过查询感知池化分别聚合语义向量，并通过锚点选择保留位置键。该方法在不增加参数的情况下，同时降低计算成本和KV缓存。除MLA外，LCA的设计与架构无关，可方便地扩展至其他注意力机制，如GQA。理论上，我们证明了一个与长度无关的误差界限。实验证明，LCA在128K上下文长度下实现了最高2.5倍的预填充速度提升和90%的KV缓存减少，同时保持了具有竞争力的性能。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2604.12477

Mining Large Language Models for Low-Resource Language Data: Comparing Elicitation Strategies for Hausa and Fongbe

挖掘大型语言模型以获取低资源语言数据：对豪萨语和丰贝语引导策略的比较

Adjovi, Mahounan Pericles, Eiselen, Roald, Mitra, Prasenjit

Abstract

Large language models (LLMs) are trained on data contributed by low-resource language communities, yet the linguistic knowledge encoded in these models remains accessible only through commercial APIs. This paper investigates whether strategic prompting can extract usable text data from LLMs for two West African languages: Hausa (Afroasiatic, approximately 80 million speakers) and Fongbe (Niger-Congo, approximately 2 million speakers). We systematically compare six elicitation task types across two commercial LLMs (GPT-4o Mini and Gemini 2.5 Flash). GPT-4o Mini extracts 6-41 times more usable target-language words per API call than Gemini. Optimal strategies differ by language: Hausa benefits from functional text and dialogue, while Fongbe requires constrained generation prompts. We release all generated corpora and code.

Chinese Translation

大型语言模型（LLMs）是在低资源语言社区贡献的数据上训练的，但这些模型中编码的语言知识仍然只能通过商业API访问。本文探讨了战略性提示是否可以从LLMs中提取可用的文本数据，针对两种西非语言：豪萨语（Afroasiatic，约8000万说话者）和丰贝语（Niger-Congo，约200万说话者）。我们系统地比较了两种商业LLMs（GPT-4o Mini和Gemini 2.5 Flash）中的六种引导任务类型。GPT-4o Mini在每次API调用中提取的可用目标语言单词比Gemini多出6到41倍。最佳策略因语言而异：豪萨语受益于功能性文本和对话，而丰贝语则需要限制生成提示。我们发布了所有生成的语料库和代码。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2604.12479

Meet Dynamic Individual Preferences: Resolving Conflicting Human Value with Paired Fine-Tuning

满足动态个体偏好：通过成对微调解决人类价值冲突问题

Wang, Shanyong, Lin, Shuhang, Zhao, Yining, Zhu, Xi, Zhang, Yongfeng

Abstract

Recent advances in large language models (LLMs) have significantly improved the alignment of models with general human preferences. However, a major challenge remains in adapting LLMs to individual preferences, which are not only diverse but also dynamic. In this paper, we introduce a novel framework, Preference-Paired Fine-Tuning (PFT), designed to align models with contradictory and evolving individual preferences. We present a new dataset, Value Conflict Dilemma (VCD), which includes scenarios that involve conflicting human preferences, facilitating the evaluation of our approach. Our experiments demonstrate that PFT outperforms single-preference training methods, achieving up to 96.6% accuracy in multi-choice classification tasks and the highest open-ended generation score of 8.69. PFT also shows significant improvements over DPO, SFT and some traditional training methods, especially when handling conflicting preferences. Additionally, with limited user history data, models can inferring preference vector rapidly, achieving a 44.76% improvement in user-specific preference alignment in comparison to single-preference models.

Chinese Translation

近年来大型语言模型（LLMs）的进展显著提升了模型与一般人类偏好的对齐能力。然而，将LLMs适应于个体偏好仍面临重大挑战，因为个体偏好不仅多样且动态变化。本文提出了一种新颖框架——偏好成对微调（Preference-Paired Fine-Tuning，PFT），旨在使模型能够对抗且演变的个体偏好进行对齐。我们构建了一个新数据集——价值冲突困境（Value Conflict Dilemma，VCD），其中包含涉及冲突人类偏好的场景，以便评估我们的方法。实验结果表明，PFT优于单一偏好训练方法，在多选分类任务中准确率高达96.6%，并在开放式生成任务中取得最高评分8.69。PFT在处理冲突偏好时，较DPO、SFT及部分传统训练方法表现出显著提升。此外，在有限用户历史数据条件下，模型能够快速推断偏好向量，相较于单一偏好模型，用户特定偏好对齐提升了44.76%。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2604.12487

KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning

KG-Reasoner：一种基于强化学习的端到端多跳知识图谱推理模型

Wang, Shuai, Yu, Yinan

Abstract

Large Language Models (LLMs) exhibit strong abilities in natural language understanding and generation, yet they struggle with knowledge-intensive reasoning. Structured Knowledge Graphs (KGs) provide an effective form of external knowledge representation and have been widely used to enhance performance in classical Knowledge Base Question Answering (KBQA) tasks. However, performing precise multi-hop reasoning over KGs for complex queries remains highly challenging. Most existing approaches decompose the reasoning process into a sequence of isolated steps executed through a fixed pipeline. While effective to some extent, such designs constrain reasoning flexibility and fragment the overall decision process, often leading to incoherence and the loss of critical intermediate information from earlier steps. In this paper, we introduce KG-Reasoner, an end-to-end framework that integrates multi-step reasoning into a unified "thinking" phase of a Reasoning LLM. Through Reinforcement Learning (RL), the LLM is trained to internalize the KG traversal process, enabling it to dynamically explore reasoning paths, and perform backtracking when necessary. Experiments on eight multi-hop and knowledge-intensive reasoning benchmarks demonstrate that KG-Reasoner achieves competitive or superior performance compared to the state-of-the-art methods. Codes are available at the repository: https://github.com/Wangshuaiia/KG-Reasoner.

Chinese Translation

大型语言模型（LLMs）在自然语言理解与生成方面表现出强大能力，但在知识密集型推理任务中仍存在困难。结构化知识图谱（KGs）作为一种有效的外部知识表示形式，已被广泛应用于提升传统知识库问答（KBQA）任务的性能。然而，针对复杂查询在知识图谱上进行精确的多跳推理仍然具有极大挑战。现有大多数方法将推理过程拆解为一系列通过固定流水线执行的孤立步骤。尽管在一定程度上有效，但此类设计限制了推理的灵活性，且割裂了整体决策过程，常导致推理不连贯并丢失早期步骤中的关键信息。本文提出了KG-Reasoner，一种端到端框架，将多步推理整合进推理大型语言模型（Reasoning LLM）的统一“思考”阶段。通过强化学习（RL），该LLM被训练以内化知识图谱遍历过程，使其能够动态探索推理路径并在必要时进行回溯。我们在八个多跳及知识密集型推理基准上进行实验，结果表明KG-Reasoner在性能上达到或优于现有最先进方法。代码已开源，地址为：https://github.com/Wangshuaiia/KG-Reasoner。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2604.12491

Calibrated Confidence Estimation for Tabular Question Answering

面向表格问答的校准置信度估计

Voss, Lukas

Abstract

Large language models (LLMs) are increasingly deployed for tabular question answering, yet calibration on structured data is largely unstudied. This paper presents the first systematic comparison of five confidence estimation methods across five frontier LLMs and two tabular QA benchmarks. All models are severely overconfident (smooth ECE 0.35-0.64 versus 0.10-0.15 reported for textual QA). A consistent self-evaluation versus perturbation dichotomy replicates across both benchmarks and all four fully-covered models: self-evaluation methods (verbalized, P(True)) achieve AUROC 0.42-0.76, while perturbation methods (semantic entropy, self-consistency, and our Multi-Format Agreement) achieve AUROC 0.78-0.86. Per-model paired bootstrap tests reject the null at p<0.001 after Holm-Bonferroni correction, and a 3-seed check on GPT-4o-mini gives a per-seed standard deviation of only 0.006. The paper proposes Multi-Format Agreement (MFA), which exploits the lossless and deterministic serialization variation unique to structured data (Markdown, HTML, JSON, CSV) to estimate confidence at 20% lower API cost than sampling baselines. MFA reduces ECE by 44-63%, generalizes across all four models on TableBench (mean AUROC 0.80), and combines complementarily with sampling: an MFA + self-consistency ensemble lifts AUROC from 0.74 to 0.82. A secondary contribution, structure-aware recalibration, improves AUROC by +10 percentage points over standard post-hoc methods.

Chinese Translation

大型语言模型（LLMs）在表格问答任务中的应用日益广泛，但针对结构化数据的置信度校准研究尚不充分。本文首次系统比较了五种置信度估计方法，在五个前沿大型语言模型和两个表格问答基准上进行评测。所有模型均表现出严重的过度自信（平滑ECE为0.35-0.64，而文本问答中报告的为0.10-0.15）。在两个基准和所有四个完全覆盖的模型中，一致地观察到自我评估与扰动方法的二分法：自我评估方法（口头化的P(True)）的AUROC为0.42-0.76，扰动方法（语义熵、自一致性及本文提出的多格式一致性Multi-Format Agreement，MFA）的AUROC为0.78-0.86。通过每模型的配对自助法检验，在Holm-Bonferroni校正后均以p<0.001拒绝原假设，且对GPT-4o-mini进行的三次随机种子检验显示每次种子的标准差仅为0.006。本文提出的多格式一致性（MFA）方法利用结构化数据独有的无损且确定性的序列化变体（Markdown、HTML、JSON、CSV）来估计置信度，API调用成本比采样基线低20%。MFA将ECE降低了44%-63%，在TableBench上的四个模型中均表现出良好的泛化能力（平均AUROC为0.80），且与采样方法互补：MFA与自一致性组合的集成模型将AUROC从0.74提升至0.82。另一项次要贡献是结构感知的重新校准方法，相较于标准的后验校准方法，AUROC提升了10个百分点。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2604.12493

Latent Planning Emerges with Scale

潜在规划随着规模的增加而出现

Hanna, Michael, Ameisen, Emmanuel

Abstract

LLMs can perform seemingly planning-intensive tasks, like writing coherent stories or functioning code, without explicitly verbalizing a plan; however, the extent to which they implicitly plan is unknown. In this paper, we define latent planning as occurring when LLMs possess internal planning representations that (1) cause the generation of a specific future token or concept, and (2) shape preceding context to license said future token or concept. We study the Qwen-3 family (0.6B-14B) on simple planning tasks, finding that latent planning ability increases with scale. Models that plan possess features that represent a planned-for word like "accountant", and cause them to output "an" rather than "a"; moreover, even the less-successful Qwen-3 4B-8B have nascent planning mechanisms. On the more complex task of completing rhyming couplets, we find that models often identify a rhyme ahead of time, but even large models seldom plan far ahead. However, we can elicit some planning that increases with scale when steering models towards planned words in prose. In sum, we offer a framework for measuring planning and mechanistic evidence of how models' planning abilities grow with scale.

Chinese Translation

大型语言模型（LLMs）能够执行看似需要大量规划的任务，例如撰写连贯的故事或功能性代码，而无需明确地表达计划；然而，它们在多大程度上进行隐性规划仍然未知。本文将潜在规划定义为当大型语言模型具备内部规划表征时发生的现象，这些表征（1）导致生成特定的未来标记或概念，以及（2）塑造前置上下文以支持该未来标记或概念。我们研究了Qwen-3系列（0.6B-14B）在简单规划任务上的表现，发现潜在规划能力随着规模的增加而增强。具备规划能力的模型具有代表计划中的词汇特征，例如“会计师”，并使其输出“an”而不是“a”；此外，即使是表现较差的Qwen-3 4B-8B模型也具备初步的规划机制。在完成押韵对句这一更复杂的任务中，我们发现模型通常能够提前识别押韵，但即使是大型模型也很少进行远期规划。然而，当我们引导模型朝向散文中的计划词汇时，可以引发一些随着规模增加而增强的规划能力。总之，我们提供了一个测量规划的框架，以及模型规划能力如何随着规模增长的机制性证据。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2604.12503

Topology-Aware Reasoning over Incomplete Knowledge Graph with Graph-Based Soft Prompting

基于拓扑的推理在不完整知识图上的图形软提示

Wang, Shuai, Wang, Xixi, Yu, Yinan

Abstract

Large Language Models (LLMs) have shown remarkable capabilities across various tasks but remain prone to hallucinations in knowledge-intensive scenarios. Knowledge Base Question Answering (KBQA) mitigates this by grounding generation in Knowledge Graphs (KGs). However, most multi-hop KBQA methods rely on explicit edge traversal, making them fragile to KG incompleteness. In this paper, we proposed a novel graph-based soft prompting framework that shifts the reasoning paradigm from node-level path traversal to subgraph-level reasoning. Specifically, we employ a Graph Neural Network (GNN) to encode extracted structural subgraphs into soft prompts, enabling LLM to reason over richer structural context and identify relevant entities beyond immediate graph neighbors, thereby reducing sensitivity to missing edges. Furthermore, we introduce a two-stage paradigm that reduces computational cost while preserving good performance: a lightweight LLM first leverages the soft prompts to identify question-relevant entities and relations, followed by a more powerful LLM for evidence-aware answer generation. Experiments on four multi-hop KBQA benchmarks show that our approach achieves state-of-the-art performance on three of them, demonstrating its effectiveness. Code is available at the repository: https://github.com/Wangshuaiia/GraSP.

Chinese Translation

大型语言模型（LLMs）在各种任务中展现了显著的能力，但在知识密集型场景中仍然容易出现幻觉。知识库问答（KBQA）通过将生成与知识图（KGs）相结合来缓解这一问题。然而，大多数多跳KBQA方法依赖于显式的边遍历，使其对知识图的不完整性变得脆弱。本文提出了一种新颖的基于图的软提示框架，将推理范式从节点级路径遍历转变为子图级推理。具体而言，我们采用图神经网络（GNN）将提取的结构子图编码为软提示，使得LLM能够在更丰富的结构上下文中进行推理，并识别超出直接图邻居的相关实体，从而降低对缺失边的敏感性。此外，我们引入了一个两阶段的范式，在保持良好性能的同时降低计算成本：一个轻量级的LLM首先利用软提示识别与问题相关的实体和关系，随后由一个更强大的LLM进行基于证据的答案生成。在四个多跳KBQA基准上的实验表明，我们的方法在其中三个基准上达到了最先进的性能，证明了其有效性。代码可在以下仓库获取：https://github.com/Wangshuaiia/GraSP。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2604.12506

Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

超越转录：面向感知的统一音频模式用于AudioLLMs

Zhang, Linhao, Song, Yuhan, Liu, Aiwei, Wu, Chuhan, Zhang, Sijun, Jia, Wei, Liu, Yuan, Wang, Houfeng, Zhou, Xiao

Abstract

Recent Audio Large Language Models (AudioLLMs) exhibit a striking performance inversion: while excelling at complex reasoning tasks, they consistently underperform on fine-grained acoustic perception. We attribute this gap to a fundamental limitation of ASR-centric training, which provides precise linguistic targets but implicitly teaches models to suppress paralinguistic cues and acoustic events as noise. To address this, we propose Unified Audio Schema (UAS), a holistic and structured supervision framework that organizes audio information into three explicit components -- Transcription, Paralinguistics, and Non-linguistic Events -- within a unified JSON format. This design achieves comprehensive acoustic coverage without sacrificing the tight audio-text alignment that enables reasoning. We validate the effectiveness of this supervision strategy by applying it to both discrete and continuous AudioLLM architectures. Extensive experiments on MMSU, MMAR, and MMAU demonstrate that UAS-Audio yields consistent improvements, boosting fine-grained perception by 10.9% on MMSU over the same-size state-of-the-art models while preserving robust reasoning capabilities. Our code and model are publicly available at https://github.com/Tencent/Unified_Audio_Schema.

Chinese Translation

近期的音频大语言模型（AudioLLMs）表现出显著的性能反转现象：它们在复杂推理任务中表现优异，但在细粒度声学感知方面持续表现欠佳。我们将这一差距归因于以自动语音识别（ASR）为中心的训练的根本限制，该训练虽然提供了精确的语言学目标，但隐式地教导模型将副语言线索和声学事件视为噪声并加以抑制。为了解决这一问题，我们提出了统一音频模式（Unified Audio Schema，UAS），这是一种整体且结构化的监督框架，将音频信息组织为三个明确组成部分——转录（Transcription）、副语言学（Paralinguistics）和非语言事件（Non-linguistic Events）——并采用统一的JSON格式。该设计实现了全面的声学覆盖，同时不牺牲实现推理所需的紧密音频-文本对齐。我们通过将该监督策略应用于离散和连续的AudioLLM架构，验证了其有效性。在MMSU、MMAR和MMAU上的大量实验表明，UAS-Audio带来了持续的性能提升，在MMSU上相比同规模的最先进模型细粒度感知能力提升了10.9%，同时保持了强健的推理能力。我们的代码和模型已公开，地址为：https://github.com/Tencent/Unified_Audio_Schema。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2604.12518

Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis

增强-平衡模态协作框架用于稳健的多模态情感分析

He, Kang, Ding, Yuzhe, Wang, Xinrong, Li, Fei, Teng, Chong, Ji, Donghong

Abstract

Multimodal sentiment analysis (MSA) integrates heterogeneous text, audio, and visual signals to infer human emotions. While recent approaches leverage cross-modal complementarity, they often struggle to fully utilize weaker modalities. In practice, dominant modalities tend to overshadow non-verbal ones, inducing modality competition and limiting overall contributions. This imbalance degrades fusion performance and robustness under noisy or missing modalities. To address this, we propose a novel model, Enhance-then-Balance Modality Collaboration framework (EBMC). EBMC improves representation quality via semantic disentanglement and cross-modal enhancement, strengthening weaker modalities. To prevent dominant modalities from overwhelming others, an Energy-guided Modality Coordination mechanism achieves implicit gradient rebalancing via a differentiable equilibrium objective. Furthermore, Instance-aware Modality Trust Distillation estimates sample-level reliability to adaptively modulate fusion weights, ensuring robustness. Extensive experiments demonstrate that EBMC achieves state-of-the-art or competitive results and maintains strong performance under missing-modality settings.

Chinese Translation

多模态情感分析（MSA）整合异构的文本、音频和视觉信号以推断人类情感。尽管近期的方法利用了跨模态的互补性，但它们往往难以充分利用较弱的模态。在实际应用中，主导模态往往会掩盖非语言模态，导致模态之间的竞争，限制了整体贡献。这种不平衡降低了在噪声或缺失模态下的融合性能和稳健性。为了解决这个问题，我们提出了一种新颖的模型——增强-平衡模态协作框架（EBMC）。EBMC通过语义解耦和跨模态增强来提高表示质量，从而增强较弱的模态。为了防止主导模态压倒其他模态，能量引导的模态协调机制通过可微平衡目标实现隐式梯度重平衡。此外，实例感知模态信任蒸馏估计样本级可靠性，以自适应调节融合权重，确保稳健性。大量实验表明，EBMC在缺失模态设置下实现了最先进或具有竞争力的结果，并保持了强大的性能。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2604.12540

When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP

数据增强何时有效？评估哈乌萨语和丰贝语的LLM及反向翻译方法

Adjovi, Mahounan Pericles, Eiselen, Roald, Mitra, Prasenjit

Abstract

Data scarcity limits NLP development for low-resource African languages. We evaluate two data augmentation methods -- LLM-based generation (Gemini 2.5 Flash) and back-translation (NLLB-200) -- for Hausa and Fongbe, two West African languages that differ substantially in LLM generation quality. We assess augmentation on named entity recognition (NER) and part-of-speech (POS) tagging using MasakhaNER 2.0 and MasakhaPOS benchmarks. Our results reveal that augmentation effectiveness depends on task type rather than language or LLM quality alone. For NER, neither method improves over baseline for either language; LLM augmentation reduces Hausa NER by 0.24% F1 and Fongbe NER by 1.81% F1. For POS tagging, LLM augmentation improves Fongbe by 0.33% accuracy, while back-translation improves Hausa by 0.17%; back-translation reduces Fongbe POS by 0.35% and has negligible effect on Hausa POS. The same LLM-generated synthetic data produces opposite effects across tasks for Fongbe -- hurting NER while helping POS -- suggesting task structure governs augmentation outcomes more than synthetic data quality. These findings challenge the assumption that LLM generation quality predicts augmentation success, and provide actionable guidance: data augmentation should be treated as a task-specific intervention rather than a universally beneficial preprocessing step.

Chinese Translation

数据稀缺限制了低资源非洲语言的自然语言处理（NLP）发展。我们评估了两种数据增强方法——基于LLM的生成（Gemini 2.5 Flash）和反向翻译（NLLB-200）——针对哈乌萨语和丰贝语这两种在LLM生成质量上差异显著的西非语言。我们使用MasakhaNER 2.0和MasakhaPOS基准评估命名实体识别（NER）和词性标注（POS）上的增强效果。我们的结果显示，增强的有效性取决于任务类型，而不仅仅是语言或LLM质量。对于NER，任何一种方法在两种语言上都未能超越基线；LLM增强使哈乌萨语的NER降低了0.24% F1，丰贝语的NER降低了1.81% F1。对于POS标注，LLM增强使丰贝语的准确率提高了0.33%，而反向翻译使哈乌萨语提高了0.17%；反向翻译使丰贝语的POS降低了0.35%，对哈乌萨语的POS几乎没有影响。同样的LLM生成的合成数据在丰贝语的不同任务上产生了相反的效果——在NER上表现不佳，而在POS上有所帮助——这表明任务结构对增强结果的影响大于合成数据质量。这些发现挑战了LLM生成质量预测增强成功的假设，并提供了可操作的指导：数据增强应被视为特定任务的干预，而非普遍有效的预处理步骤。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2604.12559

FABLE: Fine-grained Fact Anchoring for Unstructured Model Editing

FABLE：用于非结构化模型编辑的细粒度事实锚定

Wang, Peng, Zhou, Biyu, Tang, Xuehai, Han, Jizhong, Hu, Songlin

Abstract

Unstructured model editing aims to update models with real-world text, yet existing methods often memorize text holistically without reliable fine-grained fact access. To address this, we propose FABLE, a hierarchical framework that decouples fine-grained fact injection from holistic text generation. FABLE follows a two-stage, fact-first strategy: discrete facts are anchored in shallow layers, followed by minimal updates to deeper layers to produce coherent text. This decoupling resolves the mismatch between holistic recall and fine-grained fact access, reflecting the unidirectional Transformer flow in which surface-form generation amplifies rather than corrects underlying fact representations. We also introduce UnFine, a diagnostic benchmark with fine-grained question-answer pairs and fact-level metrics for systematic evaluation. Experiments show that FABLE substantially improves fine-grained question answering while maintaining state-of-the-art holistic editing performance. Our code is publicly available at https://github.com/caskcsg/FABLE.

Chinese Translation

非结构化模型编辑旨在利用真实世界文本更新模型，然而现有方法通常整体记忆文本，缺乏可靠的细粒度事实访问。为了解决这一问题，我们提出了FABLE，一种分层框架，将细粒度事实注入与整体文本生成解耦。FABLE遵循两阶段的先事实策略：离散事实首先锚定于浅层，随后对深层进行最小更新以生成连贯文本。这种解耦解决了整体回忆与细粒度事实访问之间的不匹配，反映了单向Transformer流程中表面形式生成放大而非纠正底层事实表示的特点。我们还引入了UnFine，一种包含细粒度问答对和事实级评估指标的诊断基准，用于系统性评估。实验表明，FABLE在显著提升细粒度问答性能的同时，保持了最先进的整体编辑表现。我们的代码已公开，地址为https://github.com/caskcsg/FABLE。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2604.12610

Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

将外部知识转化为三元组以增强大语言模型的检索能力

Wang, Xudong, Zhang, Chaoning, Sun, Qigan, Huang, Zhenzhen, Lu, Chang, Zheng, Sheng, Ma, Zeyu, Qin, Caiyan, Yang, Yang, Shen, Hengtao

Abstract

Retrieval-Augmented Generation (RAG) mitigates hallucination in large language models (LLMs) by incorporating external knowledge during generation. However, the effectiveness of RAG depends not only on the design of the retriever and the capacity of the underlying model, but also on how retrieved evidence is structured and aligned with the query. Existing RAG approaches typically retrieve and concatenate unstructured text fragments as context, which often introduces redundant or weakly relevant information. This practice leads to excessive context accumulation, reduced semantic alignment, and fragmented reasoning chains, thereby degrading generation quality while increasing token consumption. To address these challenges, we propose Tri-RAG, a structured triplet-based retrieval framework that improves retrieval efficiency through reasoning-aligned context construction. Tri-RAG automatically transforms external knowledge from natural language into standardized structured triplets consisting of Condition, Proof, and Conclusion, explicitly capturing logical relations among knowledge fragments using lightweight prompt-based adaptation with frozen model parameters. Building on this representation, the triplet head Condition is treated as an explicit semantic anchor for retrieval and matching, enabling precise identification of query-relevant knowledge units without directly concatenating lengthy raw texts. As a result, Tri-RAG achieves a favorable balance between retrieval accuracy and context token efficiency. Experimental results across multiple benchmark datasets demonstrate that Tri-RAG significantly improves retrieval quality and reasoning efficiency, while producing more stable generation behavior and more efficient resource utilization in complex reasoning scenarios.

Chinese Translation

检索增强生成（RAG）通过在生成过程中融入外部知识来减轻大语言模型（LLMs）中的幻觉现象。然而，RAG的有效性不仅依赖于检索器的设计和基础模型的能力，还取决于检索到的证据如何结构化以及与查询的对齐方式。现有的RAG方法通常检索并连接非结构化的文本片段作为上下文，这往往引入冗余或相关性较弱的信息。这种做法导致上下文过度积累、语义对齐降低以及推理链断裂，从而降低生成质量并增加令牌消耗。为了解决这些挑战，我们提出了Tri-RAG，一个基于结构化三元组的检索框架，通过推理对齐的上下文构建来提高检索效率。Tri-RAG自动将自然语言中的外部知识转化为标准化的结构化三元组，包含条件（Condition）、证明（Proof）和结论（Conclusion），明确捕捉知识片段之间的逻辑关系，采用轻量级的基于提示的适应方法，并保持模型参数不变。在此表示的基础上，三元组头部的条件（Condition）被视为检索和匹配的明确语义锚点，使得能够精确识别与查询相关的知识单元，而无需直接连接冗长的原始文本。因此，Tri-RAG在检索准确性和上下文令牌效率之间实现了良好的平衡。多个基准数据集上的实验结果表明，Tri-RAG显著提高了检索质量和推理效率，同时在复杂推理场景中产生了更稳定的生成行为和更高效的资源利用。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2604.12633

Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data

基于合成数据的大规模多语言多标签情感分类

Borisov, Vadim

Abstract

Emotion classification in multilingual settings remains constrained by the scarcity of annotated data: existing corpora are predominantly English, single-label, and cover few languages. We address this gap by constructing a large-scale synthetic training corpus of over 1M multi-label samples (50k per language) across 23 languages: Arabic, Bengali, Dutch, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin, Polish, Portuguese, Punjabi, Russian, Spanish, Swahili, Tamil, Turkish, Ukrainian, Urdu, and Vietnamese, covering 11 emotion categories using culturally-adapted generation and programmatic quality filtering. We train and compare six multilingual transformer encoders, from DistilBERT (135M parameters) to XLM-R-Large (560M parameters), under identical conditions. On our in-domain test set, XLM-R-Large achieves 0.868 F1-micro and 0.987 AUC-micro. To validate against human-annotated data, we evaluate all models zero-shot on GoEmotions (English) and SemEval-2018 Task 1 E-c (English, Arabic, Spanish). On threshold-free ranking metrics, XLM-R-Large matches or exceeds English-only specialist models, tying on AP-micro (0.636) and LRAP (0.804) while surpassing on AUC-micro (0.810 vs. 0.787), while natively supporting all 23 languages. The best base-sized model is publicly available at https://huggingface.co/tabularisai/multilingual-emotion-classification

Chinese Translation

多语言环境下的情感分类仍受限于标注数据的稀缺：现有语料库主要为英语、单标签且涵盖语言较少。针对这一不足，我们构建了一个大规模合成训练语料库，包含超过100万条多标签样本（每种语言约5万条），覆盖23种语言：阿拉伯语、孟加拉语、荷兰语、英语、法语、德语、印地语、印尼语、意大利语、日语、韩语、普通话、波兰语、葡萄牙语、旁遮普语、俄语、西班牙语、斯瓦希里语、泰米尔语、土耳其语、乌克兰语、乌尔都语和越南语，涵盖11个情感类别，采用文化适应的生成方法和程序化质量过滤。我们在相同条件下训练并比较了六种多语言Transformer编码器，从DistilBERT（1.35亿参数）到XLM-R-Large（5.6亿参数）。在我们的领域内测试集上，XLM-R-Large取得了0.868的F1-micro和0.987的AUC-micro。为验证模型在人工标注数据上的表现，我们对所有模型进行了零样本测试，使用GoEmotions（英语）和SemEval-2018 Task 1 E-c（英语、阿拉伯语、西班牙语）数据集。在无阈值排名指标上，XLM-R-Large达到或超过了仅限英语的专业模型，在AP-micro（0.636）和LRAP（0.804）指标上持平，在AUC-micro（0.810对比0.787）上表现更优，同时原生支持所有23种语言。表现最佳的基础规模模型已公开发布，地址为：https://huggingface.co/tabularisai/multilingual-emotion-classification

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2604.12651

Learning Chain Of Thoughts Prompts for Predicting Entities, Relations, and even Literals on Knowledge Graphs

学习思维链提示以预测知识图谱中的实体、关系和字面值

Baci, Alkid, Friedrichs, Luke, Demir, Caglar, Kouagou, N'Dah Jean, Ngomo, Axel-Cyrille Ngonga

Abstract

Knowledge graph embedding (KGE) models perform well on link prediction but struggle with unseen entities, relations, and especially literals, limiting their use in dynamic, heterogeneous graphs. In contrast, pretrained large language models (LLMs) generalize effectively through prompting. We reformulate link prediction as a prompt learning problem and introduce RALP, which learns string-based chain-of-thought (CoT) prompts as scoring functions for triples. Using Bayesian Optimization through MIPRO algorithm, RALP identifies effective prompts from fewer than 30 training examples without gradient access. At inference, RALP predicts missing entities, relations or whole triples and assigns confidence scores based on the learned prompt. We evaluate on transductive, numerical, and OWL instance retrieval benchmarks. RALP improves state-of-the-art KGE models by over 5% MRR across datasets and enhances generalization via high-quality inferred triples. On OWL reasoning tasks with complex class expressions (e.g., $\exists hasChild.Female$, $\geq 5 \; hasChild.Female$), it achieves over 88% Jaccard similarity. These results highlight prompt-based LLM reasoning as a flexible alternative to embedding-based methods. We release our implementation, training, and evaluation pipeline as open source: https://github.com/dice-group/RALP .

Chinese Translation

知识图谱嵌入（KGE）模型在链接预测方面表现良好，但在面对未见过的实体、关系，尤其是字面值时却表现不佳，这限制了它们在动态异构图中的应用。相比之下，预训练的大型语言模型（LLMs）通过提示能够有效地进行泛化。我们将链接预测重新表述为一个提示学习问题，并引入了RALP，它学习基于字符串的思维链（CoT）提示作为三元组的评分函数。通过MIPRO算法进行贝叶斯优化，RALP能够在不到30个训练样本的情况下识别有效提示，而无需梯度访问。在推理阶段，RALP预测缺失的实体、关系或整个三元组，并根据学习到的提示分配置信度分数。我们在传导性、数值和OWL实例检索基准上进行了评估。RALP在各数据集上使最先进的KGE模型的平均排名回报（MRR）提高了超过5%，并通过高质量的推断三元组增强了泛化能力。在具有复杂类表达的OWL推理任务中（例如，$orall hasChild.Female$，$orall ext{至少} 5 ext{个} hasChild.Female$），其Jaccard相似度超过88%。这些结果突显了基于提示的LLM推理作为嵌入方法的灵活替代方案。我们将我们的实现、训练和评估流程作为开源发布： https://github.com/dice-group/RALP 。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2604.12721

InsightFlow: LLM-Driven Synthesis of Patient Narratives for Mental Health into Causal Models

InsightFlow：基于大型语言模型的心理健康患者叙述合成至因果模型

Gupta, Shreya, Adhikary, Prottay Kumar, Dave, Bhavyaa, Singh, Salam Michael, Deroy, Aniket, Chakraborty, Tanmoy

Abstract

Clinical case formulation organizes patient symptoms and psychosocial factors into causal models, often using the 5P framework. However, constructing such graphs from therapy transcripts is time consuming and varies across clinicians. We present InsightFlow, an LLM based approach that automatically generates 5P aligned causal graphs from patient-therapist dialogues. Using 46 psychotherapy intake transcripts annotated by clinical experts, we evaluate LLM generated graphs against human formulations using structural (NetSimile), semantic (embedding similarity), and expert rated clinical criteria. The generated graphs show structural similarity comparable to inter annotator agreement and high semantic alignment with human graphs. Expert evaluations rate the outputs as moderately complete, consistent, and clinically useful. While LLM graphs tend to form more interconnected structures compared to the chain like patterns of human graphs, overall complexity and content coverage are similar. These results suggest that LLMs can produce clinically meaningful case formulation graphs within the natural variability of expert practice. InsightFlow highlights the potential of automated causal modeling to augment clinical workflows, with future work needed to improve temporal reasoning and reduce redundancy.

Chinese Translation

临床案例形成将患者症状和心理社会因素组织成因果模型，通常使用5P框架。然而，从治疗记录中构建此类图表既耗时又因临床医生而异。我们提出了InsightFlow，一种基于大型语言模型（LLM）的方法，能够自动生成与5P对齐的因果图，源自患者与治疗师的对话。通过使用46份由临床专家注释的心理治疗初诊记录，我们评估了LLM生成的图表与人类形成的图表在结构（NetSimile）、语义（嵌入相似性）和专家评分的临床标准方面的表现。生成的图表显示出与标注者间一致性相当的结构相似性，并与人类图表在语义上高度一致。专家评估认为输出在完整性、一致性和临床实用性方面为中等水平。尽管LLM生成的图表倾向于形成更为互联的结构，而人类图表则呈现链式模式，但整体复杂性和内容覆盖相似。这些结果表明，LLM能够在专家实践的自然变异性范围内生成具有临床意义的案例形成图。InsightFlow突显了自动化因果建模在增强临床工作流程中的潜力，未来的工作需要改善时间推理并减少冗余。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2604.12736

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

令牌级策略优化：通过序列级似然将组级奖励与令牌级聚合联系起来

Lin, Xingyu, Wen, Yilin, Su, Du, Hou, Jinchang, Wang, En, Liu, Wenbin, Bao, Chenfu, Lv, Zhonghou

Abstract

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.

Chinese Translation

组相对策略优化（GRPO）显著提升了大型语言模型（LLMs）的推理能力，特别是在数学推理表现方面。然而，GRPO及相关的熵正则化方法在令牌级稀疏奖励方面仍然面临挑战，这是链式思维（CoT）推理中的一个固有问题。这些方法通常依赖于未区分的令牌级熵正则化，这容易导致熵崩溃或模型退化，尤其是在稀疏令牌奖励的情况下。在本研究中，我们提出了TEPO，一个新颖的令牌级框架，它（1）利用序列级似然将组级奖励与单个令牌通过令牌级聚合联系起来，和（2）引入一个令牌级KL散度掩码约束，针对具有正优势和降低熵的令牌，以减轻突发的策略更新。实验表明，TEPO不仅在数学推理基准上实现了最先进的性能，而且显著增强了训练的稳定性，与GRPO/DAPO相比，收敛时间减少了50%。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2604.12744

Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark

Universal NER v2：迈向大规模多语言命名实体识别基准

Blevins, Terra, Mayhew, Stephen, Šuppa, Marek, Gonen, Hila, Mirkin, Shachar, Pais, Vasile, Dobrovoljc, Kaja, Giouli, Voula, Kevin, Jun, Jang, Eugene, Kim, Eungseo, Seo, Jeongyeon, Gialis, Xenophon, Pinter, Yuval

Abstract

While multilingual language models promise to bring the benefits of LLMs to speakers of many languages, gold-standard evaluation benchmarks in most languages to interrogate these assumptions remain scarce. The Universal NER project, now entering its fourth year, is dedicated to building gold-standard multilingual Named Entity Recognition (NER) benchmark datasets. Inspired by existing massively multilingual efforts for other core NLP tasks (e.g., Universal Dependencies), the project uses a general tagset and thorough annotation guidelines to collect standardized, cross-lingual annotations of named entity spans. The first installment (UNER v1) was released in 2024, and the project has continued and expanded since then, with various organizers, annotators, and collaborators in an active community.

Chinese Translation

尽管多语言语言模型有望将大型语言模型（LLMs）的优势带给众多语言使用者，但用于检验这些假设的多数语言的黄金标准评估基准仍然稀缺。Universal NER项目现已进入第四年，致力于构建黄金标准的多语言命名实体识别（NER）基准数据集。该项目受其他核心自然语言处理任务（如Universal Dependencies）中大规模多语言工作的启发，采用通用标签集和详尽的标注指南，收集标准化的跨语言命名实体跨度标注。首个版本（UNER v1）于2024年发布，项目自此持续发展和扩展，拥有活跃的社区，包括多位组织者、标注员和合作伙伴。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2604.12748

Generating Effective CoT Traces for Mitigating Causal Hallucination

生成有效的链式思维（CoT）痕迹以减轻因果幻觉

Zhao, Yiheng, Yan, Jun

Abstract

Although large language models (LLMs) excel in complex reasoning tasks, they suffer from severe causal hallucination in event causality identification (ECI), particularly in smaller models ($\leq$1.5B parameters). A promising approach to address this issue is to fine-tune them with Chain-of-Thought (CoT) traces. However, there is currently a lack of CoT trace dataset available for ECI. In this paper, we first investigate the essential criteria that effective CoT traces should possess to mitigate causal hallucination in smaller models. We then design a pipeline to generate CoT traces that meet these criteria. Moreover, since there is currently no metric for quantifying causal hallucination, we also introduce a new metric, the Causal Hallucination Rate (CHR), to quantify causal hallucination, guide the formulation of effective CoT trace criteria, and validate the effectiveness of our pipeline. Our experiments show that fine-tuning with the CoT traces generated by our pipeline not only substantially reduces causal hallucination in smaller LLMs but also improves mean accuracy. Moreover, the fine-tuned models exhibit strong cross-dataset and cross-difficulty generalization, as well as robustness under misleading intervention prompts.

Chinese Translation

尽管大型语言模型（LLMs）在复杂推理任务中表现出色，但它们在事件因果识别（ECI）中遭遇严重的因果幻觉，尤其是在较小的模型（$ ext{≤}1.5B$ 参数）中。解决这一问题的一个有前景的方法是通过链式思维（CoT）痕迹对其进行微调。然而，目前缺乏可用于 ECI 的 CoT 痕迹数据集。在本文中，我们首先调查了有效的 CoT 痕迹应具备的基本标准，以减轻较小模型中的因果幻觉。然后，我们设计了一个生成符合这些标准的 CoT 痕迹的流程。此外，由于目前没有量化因果幻觉的指标，我们还引入了一种新的指标——因果幻觉率（Causal Hallucination Rate, CHR），用于量化因果幻觉、指导有效 CoT 痕迹标准的制定，并验证我们流程的有效性。我们的实验表明，使用我们流程生成的 CoT 痕迹进行微调，不仅显著减少了较小 LLM 中的因果幻觉，还提高了平均准确率。此外，微调后的模型在跨数据集和跨难度的泛化能力以及在误导性干预提示下的鲁棒性方面表现出色。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2604.12766

NaviRAG: Towards Active Knowledge Navigation for Retrieval-Augmented Generation

NaviRAG：面向检索增强生成的主动知识导航

Dai, Jihao, Wu, Dingjun, Chen, Yuxuan, Zeng, Zheni, Yan, Yukun, Liu, Zhenghao, Sun, Maosong

Abstract

Retrieval-augmented generation (RAG) typically relies on a flat retrieval paradigm that maps queries directly to static, isolated text segments. This approach struggles with more complex tasks that require the conditional retrieval and dynamic synthesis of information across different levels of granularity (e.g., from broad concepts to specific evidence). To bridge this gap, we introduce NaviRAG, a novel framework that shifts from passive segment retrieval to active knowledge navigation. NaviRAG first structures the knowledge documents into a hierarchical form, preserving semantic relationships from coarse-grained topics to fine-grained details. Leveraging this reorganized knowledge records, a large language model (LLM) agent actively navigates the records, iteratively identifying information gaps and retrieving relevant content from the most appropriate granularity level. Extensive experiments on long-document QA benchmarks show that NaviRAG consistently improves both retrieval recall and end-to-end answer performance over conventional RAG baselines. Ablation studies confirm performance gains stem from our method's capacity for multi-granular evidence localization and dynamic retrieval planning. We further discuss efficiency, applicable scenario, and future directions of our method, hoping to make RAG systems more intelligent and autonomous.

Chinese Translation

检索增强生成（Retrieval-Augmented Generation，RAG）通常依赖于将查询直接映射到静态、孤立文本片段的扁平检索范式。这种方法在处理需要跨不同粒度层次（例如从宏观概念到具体证据）有条件检索和动态综合信息的复杂任务时表现不足。为弥补这一不足，我们提出了NaviRAG，一种从被动片段检索转向主动知识导航的新型框架。NaviRAG首先将知识文档结构化为层次化形式，保持从粗粒度主题到细粒度细节的语义关系。利用这一重组的知识记录，大型语言模型（LLM）代理主动导航这些记录，迭代识别信息空白，并从最合适的粒度层次检索相关内容。在长文档问答基准上的大量实验表明，NaviRAG在检索召回率和端到端答案性能上均持续优于传统RAG基线。消融研究确认性能提升源于我们方法在多粒度证据定位和动态检索规划方面的能力。我们进一步讨论了该方法的效率、适用场景及未来方向，期望推动RAG系统向更智能和自主的方向发展。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2604.12770

Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning

通过强化学习教导大型语言模型（LLMs）进行人类般的不当论证编辑

Ziegenbein, Timon, Stahl, Maja, Wachsmuth, Henning

Abstract

Editing human-written text has become a standard use case of large language models (LLMs), for example, to make one's arguments more appropriate for a discussion. Comparing human to LLM-generated edits, however, we observe a mismatch in editing strategies: While LLMs often perform multiple scattered edits and tend to change meaning notably, humans rather encapsulate dependent changes in self-contained, meaning-preserving edits. In this paper, we present a reinforcement learning approach that teaches LLMs human-like editing to improve the appropriateness of arguments. Our approach produces self-contained sentence-level edit suggestions that can be accepted or rejected independently. We train the approach using group relative policy optimization with a multi-component reward function that jointly optimizes edit-level semantic similarity, fluency, and pattern conformity as well as argument-level appropriateness. In automatic and human evaluation, it outperforms competitive baselines and the state of the art in human-like editing, with multi-round editing achieving appropriateness close to full rewriting.

Chinese Translation

编辑人类撰写的文本已成为大型语言模型（LLMs）的标准应用案例，例如，使论点更适合讨论。然而，比较人类与LLM生成的编辑时，我们观察到编辑策略的不匹配：LLMs往往进行多个分散的编辑，并且倾向于显著改变意思，而人类则更倾向于将相关的变化封装在自包含的、保持意义的编辑中。本文提出了一种强化学习方法，教导LLMs进行人类般的编辑，以提高论点的适当性。我们的方法生成自包含的句子级编辑建议，这些建议可以独立接受或拒绝。我们使用群体相对策略优化进行训练，采用多组件奖励函数，联合优化编辑级语义相似性、流畅性、模式一致性以及论点级适当性。在自动和人工评估中，我们的方法优于竞争基线和人类般编辑的最新技术，经过多轮编辑后，其适当性接近完全重写。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2604.12776

EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution

EvoSpark：用于统一长时域叙事演化的内生互动代理社会

He, Shiyu, Kuang, Minchi, Wang, Mengxian, Hu, Bin, Gu, Tingxiang

Abstract

Realizing endogenous narrative evolution in LLM-based multi-agent systems is hindered by the inherent stochasticity of generative emergence. In particular, long-horizon simulations suffer from social memory stacking, where conflicting relational states accumulate without resolution, and narrative-spatial dissonance, where spatial logic detaches from the evolving plot. To bridge this gap, we propose EvoSpark, a framework specifically designed to sustain logically coherent long-horizon narratives within Endogenous Interactive Agent Societies. To ensure consistency, the Stratified Narrative Memory employs a Role Socio-Evolutionary Base as living cognition, dynamically metabolizing experiences to resolve historical conflicts. Complementarily, Generative Mise-en-Sc\`ene mechanism enforces Role-Location-Plot alignment, synchronizing character presence with the narrative flow. Underpinning these is the Unified Narrative Operation Engine, which integrates an Emergent Character Grounding Protocol to transform stochastic sparking into persistent characters. This engine establishes a substrate that expands a minimal premise into an open-ended, evolving story world. Experiments demonstrate that EvoSpark significantly outperforms baselines across diverse paradigms, enabling the sustained generation of expressive and coherent narrative experiences.

Chinese Translation

在基于大型语言模型（LLM）的多代理系统中，实现内生叙事演化受到生成性涌现固有随机性的阻碍。特别是，长时域模拟面临社会记忆堆叠的问题，其中相互冲突的关系状态积累而未得到解决，以及叙事空间不和谐的问题，其中空间逻辑与不断发展的情节脱节。为了解决这一问题，我们提出了EvoSpark，一个专门设计的框架，旨在维持内生互动代理社会中的逻辑连贯的长时域叙事。为了确保一致性，分层叙事记忆采用角色社会进化基础作为活的认知，动态地代谢经验以解决历史冲突。此外，生成性场景机制强制执行角色-位置-情节对齐，将角色的存在与叙事流同步。这些机制的基础是统一叙事操作引擎，它集成了涌现角色基础协议，将随机的火花转化为持久的角色。该引擎建立了一个基质，将最小前提扩展为一个开放式、不断演变的故事世界。实验表明，EvoSpark在多种范式下显著优于基线，能够持续生成富有表现力和连贯的叙事体验。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2604.12816

The role of System 1 and System 2 semantic memory structure in human and LLM biases

系统1与系统2语义记忆结构在人类与大型语言模型偏见中的作用

Abramski, Katherine, Rossetti, Giulio, Stella, Massimo

Abstract

Implicit biases in both humans and large language models (LLMs) pose significant societal risks. Dual process theories propose that biases arise primarily from associative System 1 thinking, while deliberative System 2 thinking mitigates bias, but the cognitive mechanisms that give rise to this phenomenon remain poorly understood. To better understand what underlies this duality in humans, and possibly in LLMs, we model System 1 and System 2 thinking as semantic memory networks with distinct structures, built from comparable datasets generated by both humans and LLMs. We then investigate how these distinct semantic memory structures relate to implicit gender bias using network-based evaluation metrics. We find that semantic memory structures are irreducible only in humans, suggesting that LLMs lack certain types of human-like conceptual knowledge. Moreover, semantic memory structure relates consistently to implicit bias only in humans, with lower levels of bias in System~2 structures. These findings suggest that certain types of conceptual knowledge contribute to bias regulation in humans, but not in LLMs, highlighting fundamental differences between human and machine cognition.

Chinese Translation

人类与大型语言模型（LLMs）中存在的隐性偏见带来了显著的社会风险。双重加工理论提出，偏见主要源自联想性的系统1思维，而审慎的系统2思维则有助于减轻偏见，但导致这一现象的认知机制尚不清楚。为了更好地理解人类乃至大型语言模型中这一二元性背后的机制，我们将系统1和系统2思维建模为具有不同结构的语义记忆网络，这些网络基于人类和大型语言模型生成的可比数据集构建。随后，我们利用基于网络的评估指标，探讨这些不同的语义记忆结构与隐性性别偏见的关系。研究发现，语义记忆结构仅在人类中表现出不可约性，表明大型语言模型缺乏某些类型的人类式概念知识。此外，语义记忆结构与隐性偏见的关系仅在人类中表现出一致性，系统2结构中的偏见水平较低。这些发现表明，某些类型的概念知识有助于人类偏见的调节，但在大型语言模型中并非如此，凸显了人类认知与机器认知之间的根本差异。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2604.12843

Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

成长的烦恼：通过固定参数校准实现可扩展且高效的LLM基准测试

Habba, Eliya, Itzhak, Itay, Yehudai, Asaf, Perlitz, Yotam, Bandel, Elron, Shmueli-Scheuer, Michal, Choshen, Leshem, Stanovsky, Gabriel

Abstract

The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies. To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item parameters fixed. Our approach supports a realistic evaluation setting in which datasets are introduced over time and models are evaluated only on the datasets available at the time of evaluation, while a fixed anchor set for each dataset is used so that results from different evaluation periods can be compared directly. In large-scale experiments on more than $400$ models, our framework predicts full-evaluation performance within 2-3 percentage points using only $100$ anchor questions per dataset, with Spearman $\rho \geq 0.9$ for ranking preservation, showing that it is possible to extend benchmark suites over time while preserving score comparability, at a constant evaluation cost per new dataset. Code available at https://github.com/eliyahabba/growing-pains

Chinese Translation

语言模型和基准的快速发布使得在每个数据集上评估每个模型的成本日益增加。在实践中，模型通常在不同的样本上进行评估，这使得跨研究比较得分变得困难。为了解决这个问题，我们提出了一种基于多维项目反应理论（Item Response Theory, IRT）的框架，该框架利用锚定项目将新的基准校准到评估套件，同时保持先前校准的项目参数不变。我们的方法支持一种现实的评估环境，其中数据集随着时间的推移被引入，模型仅在评估时可用的数据集上进行评估，同时为每个数据集使用固定的锚定集，以便可以直接比较不同评估周期的结果。在对超过400个模型的大规模实验中，我们的框架在仅使用每个数据集100个锚定问题的情况下，能够在2-3个百分点内预测全评估性能，且排名保持的斯皮尔曼相关系数（Spearman ρ）达到0.9以上，显示出在保持得分可比性的同时，可以随着时间的推移扩展基准套件，并且每个新数据集的评估成本保持不变。代码可在 https://github.com/eliyahabba/growing-pains 获取。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2604.12911

Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

往返翻译揭示了前沿多语言基准测试的不足

Skorobogat, Ronald, Prabhu, Ameya, Bethge, Matthias

Abstract

Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\r{ho} = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.

Chinese Translation

多语言基准测试指导着前沿模型的发展。然而，前沿模型报告的多语言评估结构上类似于流行的推理和知识基准测试，但覆盖了多种语言。我们表明，这些基准测试以及随之而来的多语言评估，测量的是数学推理和事实回忆，而非多语言能力。例如，在这些基准测试中，思维变体的表现远超指令变体，但在现实世界的多语言任务（如 LMArena）中，往往表现更差。我们提出一个简单的替代方案：通过往返翻译评估多语言能力。给定源语言的文本，将其翻译为目标语言再翻译回去；原文与结果之间的语义差距揭示了多语言生成能力的缺陷。往返翻译与用户在 LMArena 上的评分几乎完全相关（ {ho} = 0.94），不需要人工参考翻译，也不需要比被测试模型更强大的多语言评判者。最后，我们推出了“翻译中的迷失”（Lost in Translation, LiT），这是一个涵盖全球广泛使用语言的具有挑战性的往返翻译基准测试，用于对多语言前沿模型进行现实评估。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2604.12919

MetFuse: Figurative Fusion between Metonymy and Metaphor

MetFuse：转喻与隐喻之间的比喻融合

Ghosh, Saptarshi, Jiang, Tianyu

Abstract

Metonymy and metaphor often co-occur in natural language, yet computational work has studied them largely in isolation. We introduce a framework that transforms a literal sentence into three figurative variants: metonymic, metaphoric, and hybrid. Using this framework, we construct MetFuse, the first dedicated dataset of figurative fusion between metonymy and metaphor, containing 1,000 human-verified meaning-aligned quadruplets totaling 4,000 sentences. Extrinsic experiments on eight existing benchmarks show that augmenting training data with MetFuse consistently improves both metonymy and metaphor classification, with hybrid examples yielding the largest gains on metonymy tasks. Using this dataset, we also analyze how the presence of one figurative type impacts another. Our findings show that both human annotators and large language models better identify metonymy in hybrid sentences than in metonymy-only sentences, demonstrating that the presence of a metaphor makes a metonymic noun more explicit. Our dataset is publicly available at: https://github.com/cincynlp/MetFuse.

Chinese Translation

转喻和隐喻常在自然语言中共现，但计算研究大多将二者孤立地进行探讨。我们提出了一个框架，将字面句子转换为三种比喻变体：转喻型、隐喻型和混合型。基于该框架，我们构建了MetFuse，这是首个专门针对转喻与隐喻之间比喻融合的数据集，包含1000个人工验证且语义对齐的四元组，共计4000句子。在八个现有基准上的外部实验表明，利用MetFuse扩充训练数据能持续提升转喻和隐喻的分类性能，其中混合型样本在转喻任务上带来最大提升。借助该数据集，我们还分析了一种比喻类型的存在如何影响另一种。结果显示，无论是人工标注者还是大型语言模型，在混合句中识别转喻的表现均优于仅含转喻的句子，表明隐喻的存在使得转喻名词更为显性。我们的数据集已公开，地址为：https://github.com/cincynlp/MetFuse。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2604.12928

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

MoshiRAG：全双工语音语言模型的异步知识检索

Chien, Chung-Ming, Orsini, Manu, Kharitonov, Eugene, Zeghidour, Neil, Livescu, Karen, Défossez, Alexandre

Abstract

Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.

Chinese Translation

语音到语音的语言模型最近出现，以增强对话式人工智能的自然性。特别是，全双工模型以其实时交互性而著称，包括处理停顿、打断和回声通道。然而，提高其事实性仍然是一个未解决的挑战。虽然扩大模型规模可能解决这一问题，但这将使实时推理的成本变得不可承受。在本研究中，我们提出了MoshiRAG，一种模块化的方法，将紧凑的全双工接口与选择性检索相结合，以访问更强大的知识源。我们的异步框架使模型能够识别知识需求查询，并将其响应基于外部信息。通过利用响应开始与核心信息传递之间的自然时间间隔，检索过程可以在保持自然对话流的同时完成。采用这种方法，MoshiRAG在事实性方面达到了与最佳公开发布的非双工语音语言模型相当的水平，同时保留了全双工系统固有的交互性。此外，我们灵活的设计支持即插即用的检索方法，无需重新训练，并在域外数学推理任务上表现出强大的性能。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2604.12978

GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

GlotOCR 基准：OCR 模型在少数 Unicode 字符集之外仍面临挑战

Kargaran, Amir Hossein, Nikeghbal, Nafiseh, Diesner, Jana, Yvon, François, Schütze, Hinrich

Abstract

Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.

Chinese Translation

光学字符识别（OCR）随着视觉-语言模型的兴起迅速发展，但评估仍集中在少数高资源和中资源的字符集上。我们推出了 GlotOCR 基准，这是一个全面的基准，评估 OCR 在 100 多个 Unicode 字符集上的泛化能力。我们的基准包括从真实多语言文本中渲染的干净和降级图像变体。图像使用来自 Google Fonts 库的字体渲染，经过 HarfBuzz 处理并使用 FreeType 像素化，支持从左到右（LTR）和从右到左（RTL）的字符集。渲染图像的样本经过人工审核，以验证所有字符集的正确渲染。我们评估了一系列开放权重和专有的视觉-语言模型，发现大多数模型在少于十个字符集上表现良好，即使是最强的前沿模型在超过三十个字符集时也无法泛化。性能普遍与字符集级别的预训练覆盖率相关，这表明当前的 OCR 系统在很大程度上依赖于语言模型的预训练，而不仅仅是视觉识别。面对不熟悉的字符集，模型要么产生随机噪声，要么从它们已经知道的相似字符集中幻觉出字符。我们发布了基准和管道以便于重现。管道代码：https://github.com/cisnlp/glotocr-bench，基准：https://hf.co/datasets/cis-lmu/glotocr-bench。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2604.12989

Accelerating Speculative Decoding with Block Diffusion Draft Trees

利用块扩散草图树加速推测解码

Ringel, Liran, Romano, Yaniv

Abstract

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.

Chinese Translation

推测解码通过使用轻量级草图生成器提出多个未来标记，从而加速自回归语言模型，目标模型随后并行验证这些标记。DFlash展示了块扩散草图生成器能够在一次前向传播中生成整个草图块，并实现了最先进的推测解码性能，超越了强大的自回归草图生成器，如EAGLE-3。然而，普通的DFlash每轮仍然仅验证单一的草图轨迹，这可能限制其接受长度。我们引入了DDTree（扩散草图树），一种直接从块扩散草图生成器的每个位置分布构建草图树的方法。在固定节点预算下，DDTree使用简单的最佳优先堆算法选择最有可能与目标模型匹配的延续，这些延续是根据草图模型输出定义的替代品来确定的。生成的树通过使用仅祖先注意力掩码在一次目标模型前向传播中高效验证。由于DDTree基于DFlash，这一领先的推测解码草图模型，这些收益使DDTree成为推测解码领域的领先方法之一。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2604.12995

PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

PolicyLLM：迈向大型语言模型对公共政策的卓越理解

Bao, Han, Zhang, Penghao, Huang, Yue, Yuan, Zhengqing, Ru, Yanchi, Su, Rui, Zhou, Yujun, Wang, Xiangqi, Guo, Kehan, Chawla, Nitesh V, Ye, Yanfang, Zhang, Xiangliang

Abstract

Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf{\textit{PolicyBench}}, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom's taxonomy, the benchmark assesses three core capabilities: (1) \textbf{Memorization}: factual recall of policy knowledge, (2) \textbf{Understanding}: conceptual and contextual reasoning, and (3) \textbf{Application}: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf{\textit{PolicyMoE}}, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.

Chinese Translation

大型语言模型（LLMs）正日益融入现实世界的决策过程，包括公共政策领域。然而，它们在理解和推理政策相关内容方面的能力尚未得到充分探索。为填补这一空白，我们提出了PolicyBench，这是首个跨系统（美中）的大规模政策理解基准，涵盖了21000个案例，涉及广泛的政策领域，反映了现实治理的多样性和复杂性。基于布鲁姆分类法（Bloom's taxonomy），该基准评估三项核心能力：（1）记忆力：政策知识的事实回忆，（2）理解力：概念和语境推理，以及（3）应用力：现实政策场景中的问题解决。基于此基准，我们进一步提出了PolicyMoE，一种领域专用的专家混合模型（Mixture-of-Experts，MoE），其专家模块对应每个认知层级。实验结果表明，所提模型在面向应用的政策任务上表现优于记忆或概念理解任务，并在结构化推理任务中取得最高准确率。我们的研究揭示了当前大型语言模型在政策理解方面的关键局限，并为构建更可靠、更聚焦政策的模型指明了方向。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2604.13006

One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

一步之差即崩溃：指令微调模型的帮助性脆弱性

Potraghloo, Erfan Baghaei, Azizi, Seyedarmin, Kundu, Souvik, Pedram, Massoud

Abstract

Instruction-tuned large language models produce helpful, structured responses, but how robust is this helpfulness when trivially constrained? We show that simple lexical constraints (banning a single punctuation character or common word) cause instruction-tuned LLMs to collapse their responses, losing 14--48% of comprehensiveness in pairwise evaluation across three open-weight model families and one closed-weight model (GPT-4o-mini). The baseline response is preferred in 77--100% of 1,920 pairwise comparisons judged by GPT-4o-mini and GPT-4o. Notably, GPT-4o-mini suffers 31% comprehensiveness loss (99% baseline win rate), demonstrating that the fragility extends to commercially deployed closed-weight models, contrary to prior findings on format-level constraints. Through mechanistic analysis, we identify this as a planning failure: two-pass generation (free generation followed by constrained rewriting) recovers 59--96% of response length, and linear probes on prompt representations predict response length with $R^2 = 0.51$--$0.93$ before generation begins, with $R^2$ tracking collapse severity across models. The same probes yield negative $R^2$ on base models, confirming that instruction tuning creates the representational structure encoding the collapse decision. Crucially, base models show no systematic collapse under identical constraints, with effects that are small, noisy, and bidirectional, demonstrating that instruction tuning creates this fragility by coupling task competence to narrow surface-form templates. The effect replicates on MT-Bench across all eight task categories. We further show that standard independent LLM-as-judge evaluation detects only a 3.5% average quality drop where pairwise evaluation reveals 23%, exposing a methodological blind spot in how constrained generation is assessed.

Chinese Translation

指令微调的大型语言模型能够生成有帮助且结构化的响应，但当施加简单约束时，这种帮助性有多稳健？我们展示了简单的词汇约束（禁止使用单个标点符号或常用词）会导致指令微调的LLM响应崩溃，在三个开源权重模型家族和一个闭源权重模型（GPT-4o-mini）的成对评估中，响应的全面性损失达到14%至48%。在由GPT-4o-mini和GPT-4o评判的1920次成对比较中，基线响应被偏好77%至100%。值得注意的是，GPT-4o-mini表现出31%的全面性损失（基线胜率99%），表明这种脆弱性同样存在于商业部署的闭源权重模型中，这与此前关于格式级约束的发现相悖。通过机制分析，我们将其归因于规划失败：两遍生成（先自由生成后受限重写）恢复了59%至96%的响应长度，且对提示表示的线性探针在生成开始前即可预测响应长度，$R^2$值介于0.51至0.93，且$R^2$值与模型间的崩溃严重程度相关。同样的探针在基础模型上产生负的$R^2$，确认指令微调创造了编码崩溃决策的表征结构。关键的是，基础模型在相同约束下无系统性崩溃，表现为小幅、噪声性且双向的影响，表明指令微调通过将任务能力与狭窄的表面形式模板耦合，导致了这种脆弱性。该效应在MT-Bench的所有八个任务类别中均得到复现。我们进一步展示，标准的独立LLM作为评判者的评估仅检测到平均3.5%的质量下降，而成对评估揭示了23%的下降，暴露了当前受限生成评估方法的盲点。

View on arXiv Download PDF AI Translation

cs.CL / 67 / 2604.13018

Toward Autonomous Long-Horizon Engineering for ML Research

迈向自主的长周期机器学习研究工程

Chen, Guoxin, Chen, Jie, Chen, Lei, Zhao, Jiale, Meng, Fanzhe, Zhao, Wayne Xin, Song, Ruihua, Chen, Cheng, Wen, Ji-Rong, Jia, Kai

Abstract

Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.

Chinese Translation

自主人工智能研究发展迅速，但长周期的机器学习研究工程仍然具有挑战性：智能体必须在任务理解、环境搭建、实现、实验和调试等多个环节中持续保持连贯的进展，时间跨度可达数小时甚至数天。我们提出了AiScientist，一种基于简单原则构建的自主长周期机器学习研究工程系统：强大的长周期性能既需要结构化的协调，也需要持久的状态连续性。为此，AiScientist结合了分层协调机制与基于权限范围的File-as-Bus工作空间：顶层的Orchestrator通过简明的摘要和工作空间地图维持阶段级控制，而专门化智能体则反复基于持久的工件（如分析、计划、代码和实验证据）进行重新定位，而非主要依赖对话式的交接，从而实现对复杂状态的精细控制。在两个互补的基准测试中，AiScientist在PaperBench上的得分平均提升了10.54分，优于最佳匹配基线，并在MLE-Bench Lite上达到了81.82%的Any Medal%。消融研究进一步表明，File-as-Bus协议是性能提升的关键因素，移除该协议会导致PaperBench得分下降6.41分，MLE-Bench Lite下降31.82分。这些结果表明，长周期机器学习研究工程是一个协调专门化工作与持久项目状态的系统性问题，而非单纯的局部推理问题。

View on arXiv Download PDF AI Translation

arXiv Papers

MVAdapt: Zero-Shot Multi-Vehicle Adaptation for End-to-End Autonomous Driving

BIND-USBL: Bounding IMU Navigation Drift using USBL in Heterogeneous ASV-AUV Teams

M2HRI: An LLM-Driven Multimodal Multi-Agent Framework for Personalized Human-Robot Interaction

Bipedal-Walking-Dynamics Model on Granular Terrains

Complementarity by Construction: A Lie-Group Approach to Solving Quadratic Programs with Linear Complementarity Constraints

ReefMapGS: Enabling Large-Scale Underwater Reconstruction by Closing the Loop Between Multimodal SLAM and Gaussian Splatting

A Foot Resistive Force Model for Legged Locomotion on Muddy Terrains

3DRO: Lidar-level SE(3) Direct Radar Odometry Using a 2D Imaging Radar and a Gyroscope

Dynamic Modeling and Robust Gait Optimization of a Compliant Worm Robot

Ternary Logic Encodings of Temporal Behavior Trees with Application to Control Synthesis

Uncertainty Guided Exploratory Trajectory Optimization for Sampling-Based Model Predictive Control

Robotic Nanoparticle Synthesis via Solution-based Processes

Unveiling the Surprising Efficacy of Navigation Understanding in End-to-End Autonomous Driving

Asymptotically Stable Gait Generation and Instantaneous Walkability Determination for Planar Almost Linear Biped with Knees

Defining and Evaluation Method for External Human-Machine Interfaces

RACF: A Resilient Autonomous Car Framework with Object Distance Correction

D-BDM: A Direct and Efficient Boundary-Based Occupancy Grid Mapping Framework for LiDARs

HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models

Designing for Error Recovery in Human-Robot Interaction

From Kinematics to Dynamics: Learning to Refine Hybrid Plans for Physically Feasible Execution

Social Learning Strategies for Evolved Virtual Soft Robots

DeCoNav: Dialog enhanced Long-Horizon Collaborative Vision-Language Navigation

Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers

Scalable Trajectory Generation for Whole-Body Mobile Manipulation

Machine Learning-Based Real-Time Detection of Compensatory Trunk Movements Using Trunk-Wrist Inertial Measurement Units

Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting

Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring

FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving

Reliability-Guided Depth Fusion for Glare-Resilient Navigation Costmaps

Actuation space reduction to facilitate insightful shape matching in a novel reconfigurable tendon driven continuum manipulator

VULCAN: Vision-Language-Model Enhanced Multi-Agent Cooperative Navigation for Indoor Fire-Disaster Response

GGD-SLAM: Monocular 3DGS SLAM Powered by Generalizable Motion Model for Dynamic Environments

PAINT: Partner-Agnostic Intent-Aware Cooperative Transport with Legged Robots

Evolving the Complete Muscle: Efficient Morphology-Control Co-design for Musculoskeletal Locomotion

OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation

FastGrasp: Learning-based Whole-body Control method for Fast Dexterous Grasping with Mobile Manipulators

Frequency-aware Decomposition Learning for Sensorless Wrench Forecasting on a Vibration-rich Hydraulic Manipulator

Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

Tree Learning: A Multi-Skill Continual Learning Framework for Humanoid Robots

E2E-Fly: An Integrated Training-to-Deployment System for End-to-End Quadrotor Autonomy

DINO-Explorer: Active Underwater Discovery via Ego-Motion Compensated Semantic Predictive Coding

RMGS-SLAM: Real-time Multi-sensor Gaussian Splatting SLAM

XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios

Learning Versatile Humanoid Manipulation with Touch Dreaming

UniMark: Unified Adaptive Multi-bit Watermarking for Autoregressive Image Generators

MedConcept: Unsupervised Concept Discovery for Interpretability in Medical VLMs

V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos

A Workflow to Efficiently Generate Dense Tissue Ground Truth Masks for Digital Breast Tomosynthesis

EigenCoin: sassanid coins classification based on Bhattacharyya distance

Fall Risk and Gait Analysis in Community-Dwelling Older Adults using World-Spaced 3D Human Mesh Recovery

INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

Ultra-low-light computer vision using trained photon correlations

The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Curvelet-Based Frequency-Aware Feature Enhancement for Deepfake Detection

Does Visual Token Pruning Improve Calibration? An Empirical Study on Confidence in MLLMs

Privacy-Preserving Structureless Visual Localization via Image Obfuscation

OpenTME: An Open Dataset of AI-powered H&E Tumor Microenvironment Profiles from TCGA

INST-Align: Implicit Neural Alignment for Spatial Transcriptomics via Canonical Expression Fields

PC-MIL: Decoupling Feature Resolution from Supervision Scale in Whole-Slide Learning

PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation

HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models

Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models

ViLL-E: Video LLM Embeddings for Retrieval

Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

Nucleus-Image: Sparse MoE for Image Generation

Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment

Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation

BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition

Physics-Grounded Monocular Vehicle Distance Estimation Using Standardized License Plate Typography

ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models

ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception

Style-Decoupled Adaptive Routing Network for Underwater Image Enhancement

DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos

MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer

LiveMoments: Reselected Key Photo Restoration in Live Photos via Reference-guided Diffusion

Boosting Robust AIGI Detection with LoRA-based Pairwise Training

Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors