arXiv Daily Digest

253

Papers

SL(C)AMma: Simultaneous Localisation, (Calibration) and Mapping With a Magnetometer Array

SL(C)AMma：基于磁力计阵列的同时定位、（标定）与地图构建

Edridge, Thomas, Kok, Manon

Abstract

Indoor localisation techniques suffer from attenuated Global Navigation Satellite System (GNSS) signals and from the accumulation of unbounded drift by integration of proprioceptive sensors. Magnetic field-based Simultaneous Localisation and Mapping (SLAM) reduces drift through loop closures by revisiting previously seen locations, but extended exploration of unseen areas remains challenging. Recently, magnetometer arrays have demonstrated significant benefits over single magnetometers, as they can directly estimate the odometry. However, inconsistencies between magnetometer measurements negatively affect odometry estimates and complicate loop closure detection. We propose two filtering algorithms: The first focuses on magnetic field-based SLAM using a magnetometer array (SLAMma). The second extends this to jointly estimate the magnetometer calibration parameters (SLCAMma). We demonstrate, using Monte Carlo simulations, that the calibration parameters can be accurately estimated when there is sufficient orientation excitation, and that magnetometers achieve inter-sensor measurement consistency regardless of the type of motion. Experimental validation on ten datasets confirms these results, and we demonstrate that in cases where single magnetometer SLAM fails, SLAMma and SLCAMma provide good trajectory estimates with, more than 80% drift reduction compared to integration of proprioceptive sensors.

Chinese Translation

室内定位技术受到全球导航卫星系统（GNSS）信号衰减和本体传感器积分所导致的无界漂移累积的影响。基于磁场的同时定位与地图构建（SLAM）通过重新访问先前见过的位置来减少漂移，但对未见区域的扩展探索仍然具有挑战性。最近，磁力计阵列显示出相较于单个磁力计的显著优势，因为它们可以直接估计里程计。然而，磁力计测量之间的不一致性会对里程计估计产生负面影响，并使得闭环检测变得复杂。我们提出了两种滤波算法：第一种专注于使用磁力计阵列的基于磁场的SLAM（SLAMma）。第二种则扩展到联合估计磁力计的标定参数（SLCAMma）。我们通过蒙特卡洛仿真展示，当存在足够的方向激励时，标定参数可以被准确估计，并且无论运动类型如何，磁力计都能实现传感器间测量一致性。对十个数据集的实验验证确认了这些结果，并且我们展示了在单个磁力计SLAM失败的情况下，SLAMma和SLCAMma能够提供良好的轨迹估计，相较于本体传感器的积分，漂移减少超过80%。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2604.19962

Radar Odometry Subject to High Tilt Dynamics of Subarctic Environments

受亚北极环境高倾斜动态影响的雷达里程计

Boxan, Matěj, Larivée-Hardy, William, Pomerleau, François

Abstract

Rotating FMCW radar odometry methods often assume flat ground conditions. While this assumption is sufficient in many scenarios, including urban environments or flat mining setups, the highly dynamic terrain of subarctic environments poses a challenge to standard feature extraction and state estimation techniques. This paper benchmarks three existing radar odometry methods under demanding conditions, exhibiting up to 13{\deg} in pitch and 4{\deg} in roll difference between consecutive scans, with absolute pitch and roll reaching 30{\deg} and 8{\deg}, respectively. Furthermore, we propose a novel radar-inertial odometry method utilizing tilt-proximity submap search and a hard threshold for vertical displacement between scan points and the estimated axis of rotation. Experimental results demonstrate a state-of-the-art performance of our method on an urban baseline and a 0.3% improvement over the second-best comparative method on a 2-kilometer-long dynamic trajectory. Finally, we analyze the performance of the four evaluated methods on a complex radar sequence characterized by high lateral slip and a steep ditch traversal.

Chinese Translation

旋转的FMCW雷达里程计方法通常假设地面条件平坦。虽然这一假设在许多场景中是足够的，包括城市环境或平坦的矿区设置，但亚北极环境的高度动态地形对标准特征提取和状态估计技术构成了挑战。本文在苛刻条件下对三种现有的雷达里程计方法进行了基准测试，连续扫描之间的俯仰角差异可达13°，滚转角差异可达4°，绝对俯仰角和滚转角分别达到30°和8°。此外，我们提出了一种新颖的雷达-惯性里程计方法，利用倾斜邻近子图搜索和扫描点与估计旋转轴之间的垂直位移的硬阈值。实验结果表明，我们的方法在城市基准测试中表现出色，并在一条2公里长的动态轨迹上相较于第二好的比较方法提高了0.3%的性能。最后，我们分析了四种评估方法在一个复杂雷达序列上的表现，该序列以高侧滑和陡沟横穿为特征。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2604.19980

Efficient Reinforcement Learning using Linear Koopman Dynamics for Nonlinear Robotic Systems

基于线性库普曼动力学的非线性机器人系统高效强化学习

Hao, Wenjian, Fang, Yuxuan, Lu, Zehui, Mou, Shaoshuai

Abstract

This paper presents a model-based reinforcement learning (RL) framework for optimal closed-loop control of nonlinear robotic systems. The proposed approach learns linear lifted dynamics through Koopman operator theory and integrates the resulting model into an actor-critic architecture for policy optimization, where the policy represents a parameterized closed-loop controller. To reduce computational cost and mitigate model rollout errors, policy gradients are estimated using one-step predictions of the learned dynamics rather than multi-step propagation. This leads to an online mini-batch policy gradient framework that enables policy improvement from streamed interaction data. The proposed framework is evaluated on several simulated nonlinear control benchmarks and two real-world hardware platforms, including a Kinova Gen3 robotic arm and a Unitree Go1 quadruped. Experimental results demonstrate improved sample efficiency over model-free RL baselines, superior control performance relative to model-based RL baselines, and control performance comparable to classical model-based methods that rely on exact system dynamics.

Chinese Translation

本文提出了一种基于模型的强化学习（RL）框架，用于非线性机器人系统的最优闭环控制。所提出的方法通过库普曼算子理论学习线性提升动力学，并将得到的模型集成到演员-评论家架构中进行策略优化，其中策略表示一个参数化的闭环控制器。为了降低计算成本并减轻模型展开误差，策略梯度是通过对学习到的动力学进行一步预测而非多步传播来估计的。这导致了一种在线小批量策略梯度框架，使得可以从流式交互数据中进行策略改进。所提出的框架在多个模拟的非线性控制基准和两个真实硬件平台上进行了评估，包括Kinova Gen3机器人手臂和Unitree Go1四足机器人。实验结果表明，相较于无模型强化学习基线，样本效率得到了改善；相较于基于模型的强化学习基线，控制性能更优；并且控制性能与依赖于精确系统动力学的经典基于模型的方法相当。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2604.20017

Strain in Sound: Soft Corrugated Tube for Local Strain Sensing with Acoustic Resonance

声中的应变：用于声学共振的软波纹管局部应变传感器

Chun, Michael, Nukala, Ananya, Huh, Tae Myung

Abstract

We present a soft corrugated tube sensor designed to estimate strain in each half segment. When air flows through the tube, the internal corrugated cavities induce pressure oscillations that excite the tube's standing wave resonance mode, generating an acoustic tone. Stretching the tube affects both the resonance mode frequency, due to changes in overall length, and the frequency-flow speed relationship, due to variations in cavity width, which is particularly useful for local strain estimation. By sweeping flow rates in a controlled manner, we collected resonance frequency data across flow speeds under various local stretch conditions, enabling a machine learning algorithm (gradient boosting regressor) to estimate segmental strain with high accuracy. The dual-period tube design (3.1 mm and 4.18 mm corrugation periods) achieved a mean absolute error (MAE) of 0.8 mm, while the single-period tube (3.1 mm) provided a satisfactory MAE of 1 mm. Testing on a mannequin finger demonstrated the sensor's capability to differentiate multi-joint configurations, showing its potential for estimating non-uniform deformations in soft bodies.

Chinese Translation

我们提出了一种软波纹管传感器，旨在估算每个半段的应变。当空气流经管道时，内部的波纹腔体会引起压力振荡，从而激发管道的驻波共振模式，产生声学音调。拉伸管道会影响共振模式频率，这主要是由于整体长度的变化，以及频率与流速关系的变化，这对于局部应变估算特别有用。通过以受控方式改变流速，我们在不同的局部拉伸条件下收集了流速下的共振频率数据，使得机器学习算法（梯度提升回归器）能够高精度地估算段应变。双周期管设计（3.1 mm 和 4.18 mm 的波纹周期）实现了 0.8 mm 的平均绝对误差（MAE），而单周期管（3.1 mm）则提供了令人满意的 1 mm MAE。在假人手指上的测试展示了传感器区分多关节配置的能力，显示了其在估算软体物体非均匀变形方面的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2604.20100

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

JoyAI-RA 0.1：一种用于机器人自主性的基础模型

Zhang, Tianle, Yuan, Zhihao, Chi, Dafeng, Liu, Peidong, Li, Dongwei, Hu, Kejun, Zhang, Likui, Nie, Junnan, Wei, Ziming, Chen, Zengjue, Tang, Yili, Li, Jiayi, Xiang, Zhiyuan, Li, Mingyang, Luo, Tianci, Wan, Hanwen, Li, Ao, Zhai, Linbo, Zhan, Zhihao, Zhuang, Yuzheng, Lin, Liang, Bai, Xiaodong, Cai, Jiakun, Cao, Peng, Chen, Kangliang, Chen, Siang, Dai, Yixiang, Di, Shuai, Duan, Nan, Gong, Yicheng, Gui, Chenguang, Guo, Yucheng, Hao, Peng, He, Qingrong, Huang, Haoyang, Huang, Kunrui, Huang, Zhixuan, Jin, Shibo, Jin, Yixiang, Li, Anson, Li, Dongjiang, Li, Jiawei, Li, Ruodai, Li, Yihang, Li, Yuzhen, Liang, Jiaming, Liu, Fangsheng, Long, Jing, Luo, Mingxi, Pan, Xing, Shen, Hui, Tian, Xiaomeng, Wang, Daming, Wang, Song, Xiong, Junwu, Xu, Hang, Xu, Wanting, Yu, Zhengcheng, Zhang, He, Zhang, Jiyao, Zhao, Lin, Zhou, Chen

Abstract

Robotic autonomy in open-world environments is fundamentally limited by insufficient data diversity and poor cross-embodiment generalization. Existing robotic datasets are often limited in scale and task coverage, while relatively large differences across robot embodiments impede effective behavior knowledge transfer. To address these challenges, we propose JoyAI-RA, a vision-language-action (VLA) embodied foundation model tailored for generalizable robotic manipulation. JoyAI-RA presents a multi-source multi-level pretraining framework that integrates web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through training on heterogeneous multi-source data with explicit action-space unification, JoyAI-RA effectively bridges embodiment gaps, particularly between human manipulation and robotic control, thereby enhancing cross-embodiment behavior learning. JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, especially on diverse tasks with generalization demands.

Chinese Translation

开放世界环境中的机器人自主性在根本上受到数据多样性不足和跨体现泛化能力差的限制。现有的机器人数据集通常在规模和任务覆盖范围上有限，而机器人体现之间的较大差异则妨碍了有效的行为知识转移。为了解决这些挑战，我们提出了JoyAI-RA，这是一种针对可泛化机器人操作的视觉-语言-动作（VLA）体现基础模型。JoyAI-RA提出了一种多源多层次的预训练框架，整合了网络数据、大规模自我中心人类操作视频、模拟生成的轨迹以及真实机器人数据。通过在异构多源数据上进行训练，并进行明确的动作空间统一，JoyAI-RA有效地弥合了体现之间的差距，特别是在人体操作与机器人控制之间，从而增强了跨体现的行为学习。JoyAI-RA在模拟和现实世界基准测试中均优于最先进的方法，尤其是在具有泛化需求的多样化任务上表现突出。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2604.20151

Toward Safe Autonomous Robotic Endovascular Interventions using World Models

基于世界模型的安全自主机器人血管内介入研究

Robertshaw, Harry, Fischer, Nikola, Wu, Han-Ru, Perez, Andrea Walker, Deng, Weiyuan, Jackson, Benjamin, Bergeles, Christos, Granados, Alejandro, Booth, Thomas C

Abstract

Autonomous mechanical thrombectomy (MT) presents substantial challenges due to highly variable vascular geometries and the requirements for accurate, real-time control. While reinforcement learning (RL) has emerged as a promising paradigm for the automation of endovascular navigation, existing approaches often show limited robustness when faced with diverse patient anatomies or extended navigation horizons. In this work, we investigate a world-model-based framework for autonomous endovascular navigation built on TD-MPC2, a model-based RL method that integrates planning and learned dynamics. We evaluate a TD-MPC2 agent trained on multiple navigation tasks across hold out patient-specific vasculatures and benchmark its performance against the state-of-the-art Soft Actor-Critic (SAC) algorithm agent. Both approaches are further validated in vitro using patient-specific vascular phantoms under fluoroscopic guidance. In simulation, TD-MPC2 demonstrates a significantly higher mean success rate than SAC (58% vs. 36%, p < 0.001), and mean tip contact forces of 0.15 N, well below the proposed 1.5 N vessel rupture threshold. Mean success rates for TD-MPC2 (68%) were comparable to SAC (60%) in vitro, but TD-MPC2 achieved superior path ratios (p = 0.017) at the cost of longer procedure times (p < 0.001). Together, these results provide the first demonstration of autonomous MT navigation validated across both hold out in silico data and fluoroscopy-guided in vitro experiments, highlighting the promise of world models for safe and generalizable AI-assisted endovascular interventions.

Chinese Translation

自主机械取栓术（MT）面临着由于血管几何形状高度变化和对精确实时控制的要求而带来的重大挑战。尽管强化学习（RL）已成为血管内导航自动化的一个有前景的范式，但现有方法在面对多样化的患者解剖结构或扩展的导航范围时，往往表现出有限的鲁棒性。在本研究中，我们探讨了一种基于世界模型的自主血管内导航框架，该框架建立在TD-MPC2之上，这是一种将规划与学习动态相结合的基于模型的RL方法。我们评估了在多个导航任务上训练的TD-MPC2智能体，该智能体在特定患者的血管中进行测试，并将其性能与最先进的软演员-评论家（Soft Actor-Critic, SAC）算法智能体进行了基准比较。两种方法均在体外使用患者特定的血管模型在透视引导下进行了进一步验证。在仿真中，TD-MPC2的平均成功率显著高于SAC（58%对36%，p < 0.001），且平均尖端接触力为0.15 N，远低于建议的1.5 N血管破裂阈值。TD-MPC2在体外的平均成功率（68%）与SAC（60%）相当，但TD-MPC2在路径比率上表现出更优（p = 0.017），但程序时间更长（p < 0.001）。综合来看，这些结果首次展示了在保留的计算机模拟数据和透视引导的体外实验中验证的自主MT导航，突显了世界模型在安全和可推广的人工智能辅助血管内介入中的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2604.20193

LLM-Guided Safety Agent for Edge Robotics with an ISO-Compliant Perception-Compute-Control Architecture

基于LLM引导的边缘机器人安全代理：符合ISO标准的感知-计算-控制架构

Huang, Xu, Zhang, Ruofan, Cheng, Lu, Song, Yuefeng, Huang, Xu, Zhang, Huayu, Yin, Sheng, Liang, Anyang, Qian, Chen, Zhou, Yin, Yuan, Xiaoyun, Cheng, Yuan

Abstract

Ensuring functional safety in human-robot interaction is challenging because AI perception is inherently probabilistic, whereas industrial standards require deterministic behavior. We present an LLM-guided safety agent for edge robotics, built on an ISO-compliant low-latency perception-compute-control architecture. Our method translates natural-language safety regulations into executable predicates and deploys them through a redundant heterogeneous edge runtime. For fault-tolerant closed-loop execution under edge constraints, we adopt a symmetric dual-modular redundancy design with parallel independent execution for low-latency perception, computation, and control. We prototype the system on a dual-RK3588 platform and evaluate it in representative human-robot interaction scenarios. The results demonstrate a practical edge implementation path toward ISO 13849 Category 3 and PL d using cost-effective hardware, supporting practical deployment of safety-critical embodied AI.

Chinese Translation

在人机交互中确保功能安全是一项挑战，因为人工智能感知本质上是概率性的，而工业标准要求确定性行为。我们提出了一种基于LLM引导的边缘机器人安全代理，构建在符合ISO标准的低延迟感知-计算-控制架构之上。我们的方法将自然语言安全规范转换为可执行的谓词，并通过冗余异构边缘运行时进行部署。为了在边缘约束下实现容错闭环执行，我们采用了对称双模冗余设计，针对低延迟的感知、计算和控制进行并行独立执行。我们在双RK3588平台上原型化该系统，并在典型的人机交互场景中进行了评估。结果表明，使用成本效益高的硬件实现ISO 13849类别3和PL d的实际边缘实现路径，支持安全关键型具身人工智能的实际部署。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2604.20208

Stochastic Barrier Certificates in the Presence of Dynamic Obstacles

动态障碍下的随机障碍证书

Mazouz, Rayan, Laurenti, Luca, Lahijanian, Morteza

Abstract

Safety of stochastic dynamic systems in environments with dynamic obstacles is studied in this paper through the lens of stochastic barrier functions. We introduce both time-invariant and time-varying barrier certificates for discrete-time, continuous-space systems subject to uncertainty, which provide certified lower bounds on the probability of remaining within a safe set over a finite horizon. These certificates explicitly account for time-varying unsafe regions induced by obstacle dynamics. By leveraging Bellman's optimality perspective, the time-varying formulation directly captures temporal structure and yields less conservative bounds than state-of-the-art approaches. By restricting certificates to polynomial functions, we show that time-varying barrier synthesis can be formulated as a convex sum-of-squares program, enabling tractable optimization. Empirical evaluations on nonlinear systems with dynamic obstacles show that time-varying certificates consistently achieve tight guarantees, demonstrating improved accuracy and scalability over state-of-the-art methods.

Chinese Translation

本文通过随机障碍函数的视角研究了在动态障碍环境中随机动态系统的安全性。我们为受不确定性影响的离散时间、连续空间系统引入了时间不变和时间变化的障碍证书，这些证书提供了在有限时间范围内保持在安全集合内的概率的认证下界。这些证书明确考虑了由障碍动态引起的时间变化的不安全区域。通过利用贝尔曼的最优性视角，时间变化的形式直接捕捉了时间结构，并且比最先进的方法提供了更不保守的界限。通过将证书限制为多项式函数，我们展示了时间变化的障碍合成可以被表述为一个凸的平方和程序，从而实现可处理的优化。在具有动态障碍的非线性系统上的实证评估表明，时间变化的证书始终能够实现紧凑的保证，展示出比最先进的方法更高的准确性和可扩展性。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2604.20231

Toward Cooperative Driving in Mixed Traffic: An Adaptive Potential Game-Based Approach with Field Test Verification

朝向混合交通中的协作驾驶：一种基于自适应潜力博弈的方法及实地测试验证

Fang, Shiyu, Zhao, Xiaocong, Liu, Xuekai, Hang, Peng, Wang, Jianqiang, Wang, Yunpeng, Sun, Jian

Abstract

Connected autonomous vehicles (CAVs), which represent a significant advancement in autonomous driving technology, have the potential to greatly increase traffic safety and efficiency through cooperative decision-making. However, existing methods often overlook the individual needs and heterogeneity of cooperative participants, making it difficult to transfer them to environments where they coexist with human-driven vehicles (HDVs).To address this challenge, this paper proposes an adaptive potential game (APG) cooperative driving framework. First, the system utility function is established on the basis of a general form of individual utility and its monotonic relationship, allowing for the simultaneous optimization of both individual and system objectives. Second, the Shapley value is introduced to compute each vehicle's marginal utility within the system, allowing its varying impact to be quantified. Finally, the HDV preference estimation is dynamically refined by continuously comparing the observed HDV behavior with the APG's estimated actions, leading to improvements in overall system safety and efficiency. Ablation studies demonstrate that adaptively updating Shapley values and HDV preference estimation significantly improve cooperation success rates in mixed traffic. Comparative experiments further highlight the APG's advantages in terms of safety and efficiency over other cooperative methods. Moreover, the applicability of the approach to real-world scenarios was validated through field tests.

Chinese Translation

连接的自动驾驶车辆（CAVs）代表了自动驾驶技术的重要进展，具有通过协作决策显著提高交通安全性和效率的潜力。然而，现有方法往往忽视了协作参与者的个体需求和异质性，使其难以转移到与人类驾驶车辆（HDVs）共存的环境中。为了解决这一挑战，本文提出了一种自适应潜力博弈（APG）协作驾驶框架。首先，基于个体效用的一般形式及其单调关系建立系统效用函数，从而实现个体目标和系统目标的同时优化。其次，引入夏普利值（Shapley value）来计算系统中每辆车的边际效用，从而量化其变化影响。最后，通过持续比较观察到的HDV行为与APG估计的行动，动态优化HDV偏好估计，从而提高整体系统的安全性和效率。消融研究表明，自适应更新夏普利值和HDV偏好估计显著提高了混合交通中的合作成功率。比较实验进一步突显了APG在安全性和效率方面相较于其他协作方法的优势。此外，通过实地测试验证了该方法在现实场景中的适用性。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2604.20246

Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

Cortex 2.0：将世界模型应用于实际工业部署

Aida, Adriana, Amer, Walida, Bankovic, Katarina, Behl, Dhruv, Busch, Fabian, Bhalla, Annie, Duong, Minh, Gienger, Florian, Godse, Rohan, Grachev, Denis, Gulde, Ralf, Hagensieker, Elisa, Hu, Junpeng, Joshi, Shivam, Knoblauch, Tobias, Kumar, Likith, LaRocque, Damien, Lokesh, Keerthana, Moured, Omar, Nguyen, Khiem, Preyss, Christian, Sriganesan, Ranjith, Singh, Vikram, Sponner, Carsten, Tong, Anh, Tuscher, Dominik, Tuscher, Marc, Upputuri, Pavan

Abstract

Industrial robotic manipulation demands reliable long-horizon execution across embodiments, tasks, and changing object distributions. While Vision-Language-Action models have demonstrated strong generalization, they remain fundamentally reactive. By optimizing the next action given the current observation without evaluating potential futures, they are brittle to the compounding failure modes of long-horizon tasks. Cortex 2.0 shifts from reactive control to plan-and-act by generating candidate future trajectories in visual latent space, scoring them for expected success and efficiency, then committing only to the highest-scoring candidate. We evaluate Cortex 2.0 on a single-arm and dual-arm manipulation platform across four tasks of increasing complexity: pick and place, item and trash sorting, screw sorting, and shoebox unpacking. Cortex 2.0 consistently outperforms state-of-the-art Vision-Language-Action baselines, achieving the best results across all tasks. The system remains reliable in unstructured environments characterized by heavy clutter, frequent occlusions, and contact-rich manipulation, where reactive policies fail. These results demonstrate that world-model-based planning can operate reliably in complex industrial environments.

Chinese Translation

工业机器人操作要求在不同的体现、任务和变化的物体分布中实现可靠的长期执行。尽管视觉-语言-行动（Vision-Language-Action）模型已展示出强大的泛化能力，但它们仍然是根本上反应式的。通过在当前观察的基础上优化下一步行动，而不评估潜在的未来，它们在长期任务的复合失败模式下显得脆弱。Cortex 2.0 从反应控制转向计划与行动，通过在视觉潜在空间中生成候选未来轨迹，对其预期成功和效率进行评分，然后仅对得分最高的候选轨迹进行承诺。我们在一个单臂和双臂操作平台上对Cortex 2.0进行了评估，涵盖了四个复杂性逐渐增加的任务：物品拾取与放置、物品与垃圾分类、螺丝分类和鞋盒拆包。Cortex 2.0 在所有任务中始终优于最先进的视觉-语言-行动基线，取得了最佳结果。该系统在特征复杂、杂乱无章、频繁遮挡和接触丰富的操作环境中仍然保持可靠，而反应策略在这些环境中往往失效。这些结果表明，基于世界模型的规划可以在复杂的工业环境中可靠地运行。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2604.20290

Onboard Wind Estimation for Small UAVs Equipped with Low-Cost Sensors: An Aerodynamic Model-Integrated Filtering Approach

基于低成本传感器的小型无人机机载风速估计：一种集成气动模型的滤波方法

Cheng, Bingchen, Ma, Tielin, Fu, Jingcheng, Tao, Lulu, Guo, Tianhui

Abstract

To enable autonomous wind estimation for energy-efficient flight in small unmanned aerial vehicles (UAVs), this study proposes a method that estimates flight states and wind using only the low-cost essential onboard sensors required for autonomous flight, without relying on additional wind measurement devices. The core of the method includes an Extended Kalman Filter (EKF) integrated with the aerodynamic model and an Adaptive Moving Average Estimation (AMAE) technique, which improves the accuracy and smoothness of the wind estimation. Simulation results show that the approach efficiently estimates both steady and time-varying 3D wind vectors without requiring flow angle measurements. The impact of aerodynamic model accuracy on wind estimation errors is also analyzed to assess practical applicability. Flight tests validate the effectiveness of the method and its feasibility for real-time onboard computation. Additionally, uncertainties and error sources encountered during testing are systematically examined, providing a foundation for further refinement.

Chinese Translation

为了实现小型无人机（UAV）在能效飞行中的自主风速估计，本研究提出了一种方法，该方法仅利用实现自主飞行所需的低成本基本机载传感器来估计飞行状态和风速，而无需依赖额外的风速测量设备。该方法的核心包括与气动模型集成的扩展卡尔曼滤波器（Extended Kalman Filter, EKF）和自适应移动平均估计（Adaptive Moving Average Estimation, AMAE）技术，后者提高了风速估计的准确性和平滑性。仿真结果表明，该方法能够有效地估计稳态和时变的三维风速矢量，而无需流动角度测量。同时，分析了气动模型准确性对风速估计误差的影响，以评估其实际应用性。飞行测试验证了该方法的有效性及其实时机载计算的可行性。此外，系统性地检查了测试过程中遇到的不确定性和误差来源，为进一步改进提供了基础。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2604.20295

ETac: A Lightweight and Efficient Tactile Simulation Framework for Learning Dexterous Manipulation

ETac：一个轻量高效的触觉仿真框架用于学习灵巧操作

Xu, Zhe, Zhao, Feiyu, Huang, Xiyan, Xiao, Chenxi

Abstract

Tactile sensors are increasingly integrated into dexterous robotic manipulators to enhance contact perception. However, learning manipulation policies that rely on tactile sensing remains challenging, primarily due to the trade-off between fidelity and computational cost of soft-body simulations. To address this, we present ETac, a tactile simulation framework that models elastomeric soft-body interactions with both high fidelity and efficiency. ETac employs a lightweight data-driven deformation propagation model to capture soft-body contact dynamics, achieving high simulation quality and boosting efficiency that enables large-scale policy training. When serving as the simulation backend, ETac produces surface deformation estimates comparable to FEM and demonstrates applicability for modeling real tactile sensors. Then, we showcase its capability in training a blind grasping policy that leverages large-area tactile feedback to manipulate diverse objects. Running on a single RTX 4090 GPU, ETac supports reinforcement learning across 4,096 parallel environments, achieving a total throughput of 869 FPS. The resulting policy reaches an average success rate of 84.45% across four object types, underscoring ETac's potential to make tactile-based skill learning both efficient and scalable.

Chinese Translation

触觉传感器越来越多地集成到灵巧机器人操控器中，以增强接触感知。然而，依赖触觉感知学习操作策略仍然具有挑战性，主要是由于软体仿真在保真度和计算成本之间的权衡。为了解决这个问题，我们提出了ETac，一个触觉仿真框架，能够高效且高保真地模拟弹性软体之间的相互作用。ETac采用轻量级的数据驱动变形传播模型来捕捉软体接触动态，实现了高仿真质量并提升了效率，从而支持大规模策略训练。作为仿真后端，ETac生成的表面变形估计与有限元法（FEM）相当，并展示了其在真实触觉传感器建模中的适用性。接着，我们展示了其在训练盲抓取策略方面的能力，该策略利用大面积触觉反馈来操控多样物体。在单个RTX 4090 GPU上运行时，ETac支持在4096个并行环境中进行强化学习，达到总吞吐量869 FPS。最终策略在四种物体类型上实现了平均成功率84.45%，突显了ETac在触觉基础技能学习中高效且可扩展的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2604.20305

AdaTracker: Learning Adaptive In-Context Policy for Cross-Embodiment Active Visual Tracking

AdaTracker：跨身体主动视觉跟踪的自适应上下文策略学习

Wu, Kui, Chen, Hao, Han, Jinzhu, Liu, Haijun, Wang, Churan, Wang, Yizhou, Li, Zhoujun, Liu, Si, Zhong, Fangwei

Abstract

Realizing active visual tracking with a single unified model across diverse robots is challenging, as the physical constraints and motion dynamics vary drastically from one platform to another. Existing approaches typically train separate models for each embodiment, leading to poor scalability and limited generalization. To address this, we propose AdaTracker, an adaptive in-context policy learning framework that robustly tracks targets on diverse robot morphologies. Our key insight is to explicitly model embodiment-specific constraints through an Embodiment Context Encoder, which infers embodiment-specific constraints from history. This contextual representation dynamically modulates a Context-Aware Policy, enabling it to infer optimal control actions for unseen embodiments in a zero-shot manner. To enhance robustness, we introduce two auxiliary objectives to ensure accurate context identification and temporal consistency. Experiments in both simulation and the real world demonstrate that AdaTracker significantly outperforms state-of-the-art methods in cross-embodiment generalization, sample efficiency, and zero-shot adaptation.

Chinese Translation

在不同机器人之间实现基于单一统一模型的主动视觉跟踪是一项挑战，因为物理约束和运动动态在不同平台之间差异很大。现有的方法通常为每个身体训练单独的模型，导致可扩展性差和泛化能力有限。为了解决这个问题，我们提出了AdaTracker，一种自适应上下文策略学习框架，能够在多样化的机器人形态上稳健地跟踪目标。我们的关键见解是通过一个身体上下文编码器（Embodiment Context Encoder）显式建模身体特定的约束，该编码器从历史中推断出身体特定的约束。这种上下文表示动态调节上下文感知策略（Context-Aware Policy），使其能够以零样本的方式推断出未见身体的最佳控制动作。为了增强鲁棒性，我们引入了两个辅助目标，以确保准确的上下文识别和时间一致性。在仿真和现实世界中的实验表明，AdaTracker在跨身体泛化、样本效率和零样本适应性方面显著优于最先进的方法。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2604.20347

A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking

用于自适应超声引导针插入和针跟踪的视觉-语言-动作模型

Zhang, Yuelin, Ding, Qingpeng, Tang, Longxiang, Fang, Chengyu, Cheng, Shing Shin

Abstract

Ultrasound (US)-guided needle insertion is a critical yet challenging procedure due to dynamic imaging conditions and difficulties in needle visualization. Many methods have been proposed for automated needle insertion, but they often rely on hand-crafted pipelines with modular controllers, whose performance degrades in challenging cases. In this paper, a Vision-Language-Action (VLA) model is proposed for adaptive and automated US-guided needle insertion and tracking on a robotic ultrasound (RUS) system. This framework provides a unified approach to needle tracking and needle insertion control, enabling real-time, dynamically adaptive adjustment of insertion based on the obtained needle position and environment awareness. To achieve real-time and end-to-end tracking, a Cross-Depth Fusion (CDF) tracking head is proposed, integrating shallow positional and deep semantic features from the large-scale vision backbone. To adapt the pretrained vision backbone for tracking tasks, a Tracking-Conditioning (TraCon) register is introduced for parameter-efficient feature conditioning. After needle tracking, an uncertainty-aware control policy and an asynchronous VLA pipeline are presented for adaptive needle insertion control, ensuring timely decision-making for improved safety and outcomes. Extensive experiments on both needle tracking and insertion show that our method consistently outperforms state-of-the-art trackers and manual operation, achieving higher tracking accuracy, improved insertion success rates, and reduced procedure time, highlighting promising directions for RUS-based intelligent intervention.

Chinese Translation

超声（US）引导的针插入是一项关键但具有挑战性的程序，主要由于动态成像条件和针的可视化困难。虽然已经提出了许多自动针插入的方法，但它们通常依赖于手工设计的管道和模块化控制器，在复杂情况下性能下降。本文提出了一种视觉-语言-动作（VLA）模型，用于在机器人超声（RUS）系统上进行自适应和自动化的超声引导针插入和跟踪。该框架提供了一种统一的方法来进行针跟踪和针插入控制，能够根据获得的针位置和环境感知实时、动态地自适应调整插入。为了实现实时和端到端的跟踪，提出了一种交叉深度融合（CDF）跟踪头，集成了来自大规模视觉主干的浅层位置特征和深层语义特征。为了使预训练的视觉主干适应跟踪任务，引入了一种跟踪条件（TraCon）注册器，以实现参数高效的特征条件化。在针跟踪之后，提出了一种不确定性感知控制策略和异步VLA管道，用于自适应针插入控制，确保及时决策以提高安全性和结果。在针跟踪和插入的广泛实验中，我们的方法始终优于最先进的跟踪器和手动操作，实现了更高的跟踪精度、改善的插入成功率和减少的程序时间，突显了基于RUS的智能干预的有希望的方向。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2604.20348

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

通过多智能体上下文学习实现双手机器人操作

Palma, Alessio, Spinelli, Indro, Prasad, Vignesh, Scofano, Luca, Jin, Yufeng, Chalvatzaki, Georgia, Galasso, Fabio

Abstract

Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In-Context Learning (ICL) enables off-the-shelf, text-only LLMs to predict robot actions without any task-specific training while preserving their generalization capabilities. Applying ICL to bimanual manipulation remains challenging, as the high-dimensional joint action space and tight inter-arm coordination constraints rapidly overwhelm standard context windows. To address this, we introduce BiCICLe (Bimanual Coordinated In-Context Learning), the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning. BiCICLe frames bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential, conditioned single-arm predictions. This naturally extends to Arms' Debate, an iterative refinement process, and to the introduction of a third LLM-as-Judge to evaluate and select the most plausible coordinated trajectories. Evaluated on 13 tasks from the TWIN benchmark, BiCICLe achieves up to 71.1% average success rate, outperforming the best training-free baseline by 6.7 percentage points and surpassing most supervised methods. We further demonstrate strong few-shot generalization on novel tasks.

Chinese Translation

语言模型（LLMs）已成为具身控制的强大推理引擎。特别是，上下文学习（ICL）使得现成的、仅基于文本的LLMs能够在不进行任何特定任务训练的情况下预测机器人动作，同时保持其泛化能力。然而，将ICL应用于双手操作仍然面临挑战，因为高维的关节动作空间和紧密的双臂协调约束迅速超出了标准上下文窗口的处理能力。为了解决这个问题，我们提出了BiCICLe（双手协调上下文学习），这是第一个使标准LLMs能够在不进行微调的情况下执行少量样本的双手操作的框架。BiCICLe将双手控制框架化为一个多智能体的领导-跟随问题，将动作空间解耦为顺序的、条件的单臂预测。这自然扩展到“手臂辩论”（Arms' Debate），一种迭代精炼过程，以及引入第三个LLM作为评判者（Judge）来评估和选择最合理的协调轨迹。在TWIN基准的13个任务上进行评估，BiCICLe达到了最高71.1%的平均成功率，超过了最佳的无训练基线6.7个百分点，并超越了大多数监督方法。我们进一步展示了在新任务上的强大少量样本泛化能力。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2604.20365

Benefits of Low-Cost Bio-Inspiration in the Age of Overparametrization

低成本生物启发在过参数化时代的优势

Godin-Dubois, Kevin, Yaman, Anil, Kononova, Anna V.

Abstract

While Central Pattern Generators (CPGs) and Multi-Layer Perceptrons (MLP) are widely used paradigms in robot control, few systematic studies have been performed on the relative merits of large parameter spaces. In contexts where input and output spaces are small and performance is bounded, having more parameters to optimize may actively hinder the learning process instead of empowering it. To empirically measure this, we submit a given robot morphology, with limited proprioceptive capabilities, to controller optimization under two bio-inspired paradigms (CPGs and MLPs) with evolutionary- and reinforcement- trainer protocols. By varying parameter spaces across multiple reward functions, we observe that shallow MLPs and densely connected CPGs result in better performance when compared to deeper MLPs or Actor-Critic architectures. To account for the relationship between said performance and the number of parameters, we introduce a Parameter Impact metric which demonstrates that the additional parameters required by the reinforcement technique do not translate into better performance, thus favouring evolutionary strategies.

Chinese Translation

尽管中央模式生成器（Central Pattern Generators, CPGs）和多层感知器（Multi-Layer Perceptrons, MLP）是机器人控制中广泛使用的范式，但关于大参数空间相对优缺点的系统研究却相对较少。在输入和输出空间较小且性能受限的情况下，更多的优化参数可能会积极阻碍学习过程，而不是增强它。为了进行实证测量，我们将一种具有有限本体感知能力的机器人形态提交给两种生物启发范式（CPGs 和 MLPs）下的控制器优化，采用进化和强化训练协议。通过在多个奖励函数中变化参数空间，我们观察到与更深的 MLP 或 Actor-Critic 架构相比，浅层 MLP 和密集连接的 CPG 在性能上表现更好。为了说明这种性能与参数数量之间的关系，我们引入了参数影响指标（Parameter Impact），该指标表明强化技术所需的额外参数并未转化为更好的性能，从而更倾向于进化策略。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2604.20423

OVPD: A Virtual-Physical Fusion Testing Dataset of OnSite Auton-omous Driving Challenge

OVPD：现场自主驾驶挑战的虚拟-物理融合测试数据集

Zhang, Yuhang, Zhang, Jiarui, Jian, Bowen, Zhou, Xin, Lv, Zhichao, Hang, Peng, Yu, Rongjie, Tian, Ye, Sun, Jian

Abstract

The rapid iteration of autonomous driving algorithms has created a growing demand for high-fidelity, replayable, and diagnosable testing data. However, many public datasets lack real vehicle dynamics feedback and closed-loop interaction with surrounding traffic and road infrastructure, limiting their ability to reflect deployment readiness. To address this gap, we present OVPD (OnSite Virtual-Physical Dataset), a virtual-physical fusion testing dataset released from the 2025 OnSite Autonomous Driving Challenge. Centered on real-vehicle-in-the-loop testing, OVPD integrates virtual background traffic with vehicle-infrastructure perception to build controllable and interactive closed-loop test environments on a proving ground. The dataset contains 20 testing clips from 20 teams over a scenario chain of 15 atomic scenarios, totaling nearly 3 hours of multi-modal data, including vehicle trajectories and states, control commands, and digital-twin-rendered surround-view observations. OVPD supports long-tail planning and decision-making validation, open-loop or platform-enabled closed-loop evaluation, and comprehensive assessment across safety, efficiency, comfort, rule compliance, and traffic impact, providing actionable evidence for failure diagnosis and iterative improvement. The dataset is available via: https://huggingface.co/datasets/Yuhang253820/Onsite_OPVD

Chinese Translation

自主驾驶算法的快速迭代对高保真、可重放和可诊断的测试数据提出了日益增长的需求。然而，许多公共数据集缺乏真实车辆动态反馈以及与周围交通和道路基础设施的闭环交互，限制了它们反映部署准备情况的能力。为了解决这一问题，我们提出了OVPD（现场虚拟-物理数据集），这是从2025年现场自主驾驶挑战中发布的虚拟-物理融合测试数据集。OVPD以真实车辆在环测试为中心，将虚拟背景交通与车辆-基础设施感知相结合，在试验场上构建可控和互动的闭环测试环境。该数据集包含来自20个团队的20个测试片段，涵盖15个原子场景的场景链，总计近3小时的多模态数据，包括车辆轨迹和状态、控制命令以及数字双胞胎渲染的全景观察。OVPD支持长尾规划和决策验证、开放环或平台支持的闭环评估，以及在安全性、效率、舒适性、规则遵从性和交通影响等方面的综合评估，为故障诊断和迭代改进提供可操作的证据。数据集可通过以下链接获取：https://huggingface.co/datasets/Yuhang253820/Onsite_OPVD

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2604.20428

Lexicographic Minimum-Violation Motion Planning using Signal Temporal Logic

基于信号时序逻辑的字典序最小违例运动规划

Halder, Patrick, Kiltz, Lothar, Homburger, Hannes, Reuter, Johannes, Althoff, Matthias

Abstract

Motion planning for autonomous vehicles often requires satisfying multiple conditionally conflicting specifications. In situations where not all specifications can be met simultaneously, minimum-violation motion planning maintains system operation by minimizing violations of specifications in accordance with their priorities. Signal temporal logic (STL) provides a formal language for rigorously defining these specifications and enables the quantitative evaluation of their violations. However, a total ordering of specifications yields a lexicographic optimization problem, which is typically computationally expensive to solve using standard methods. We address this problem by transforming the multi-objective lexicographic optimization problem into a single-objective scalar optimization problem using non-uniform quantization and bit-shifting. Specifically, we extend a deterministic model predictive path integral (MPPI) solver to efficiently solve optimization problems without quadratic input cost. Additionally, a novel predicate-robustness measure that combines spatial and temporal violations is introduced. Our results show that the proposed method offers an interpretable and scalable solution for lexicographic STL minimum-violation motion planning within a single-objective solver framework.

Chinese Translation

自主车辆的运动规划通常需要满足多个条件冲突的规范。在无法同时满足所有规范的情况下，最小违例运动规划通过根据规范的优先级最小化违例来维持系统的运行。信号时序逻辑（STL）提供了一种正式语言，用于严格定义这些规范，并能够对其违例进行定量评估。然而，规范的全序关系会导致一个字典序优化问题，通常使用标准方法求解时计算开销较大。我们通过将多目标字典序优化问题转化为单目标标量优化问题，采用非均匀量化和位移技术来解决这一问题。具体而言，我们扩展了一种确定性模型预测路径积分（MPPI）求解器，以高效解决没有二次输入成本的优化问题。此外，我们引入了一种新的谓词鲁棒性度量，结合了空间和时间的违例。我们的结果表明，所提出的方法在单目标求解框架内为字典序STL最小违例运动规划提供了可解释且可扩展的解决方案。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2604.20444

VTouch++: A Multimodal Dataset with Vision-Based Tactile Enhancement for Bimanual Manipulation

VTouch++：一种基于视觉的触觉增强的双手操作多模态数据集

Hua, Qianxi, Li, Xinyue, Yan, Zheng, Li, Yang, Zhang, Chi, Li, Yongyao, Liu, Yufei

Abstract

Embodied intelligence has advanced rapidly in recent years; however, bimanual manipulation-especially in contact-rich tasks remains challenging. This is largely due to the lack of datasets with rich physical interaction signals, systematic task organization, and sufficient scale. To address these limitations, we introduce the VTOUCH dataset. It leverages vision based tactile sensing to provide high-fidelity physical interaction signals, adopts a matrix-style task design to enable systematic learning, and employs automated data collection pipelines covering real-world, demand-driven scenarios to ensure scalability. To further validate the effectiveness of the dataset, we conduct extensive quantitative experiments on cross-modal retrieval as well as real-robot evaluation. Finally, we demonstrate real-world performance through generalizable inference across multiple robots, policies, and tasks.

Chinese Translation

具身智能近年来发展迅速；然而，双手操作，特别是在接触丰富的任务中，仍然面临挑战。这主要是由于缺乏具有丰富物理交互信号、系统化任务组织和足够规模的数据集。为了解决这些局限性，我们引入了VTOUCH数据集。该数据集利用基于视觉的触觉感知提供高保真度的物理交互信号，采用矩阵式任务设计以实现系统化学习，并采用自动化数据收集管道覆盖真实世界的需求驱动场景，以确保可扩展性。为了进一步验证数据集的有效性，我们进行了广泛的定量实验，涉及跨模态检索以及真实机器人评估。最后，我们通过跨多个机器人、策略和任务的可推广推理展示了在真实世界中的表现。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2604.20468

MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

MOMO：一个无缝的物理、语言和图形机器人技能学习与适应框架

Knauer, Markus, Fiorini, Edoardo, Mühlbauer, Maximilian, Schneyer, Stefan, Angsuratanawech, Promwat, Lay, Florian Samuel, Bachmann, Timo, Bustamante, Samuel, Nottensteiner, Korbinian, Stulp, Freek, Albu-Schäffer, Alin, Silvério, João, Eiband, Thomas

Abstract

Industrial robot applications require increasingly flexible systems that non-expert users can easily adapt for varying tasks and environments. However, different adaptations benefit from different interaction modalities. We present an interactive framework that enables robot skill adaptation through three complementary modalities: kinesthetic touch for precise spatial corrections, natural language for high-level semantic modifications, and a graphical web interface for visualizing geometric relations and trajectories, inspecting and adjusting parameters, and editing via-points by drag-and-drop. The framework integrates five components: energy-based human-intention detection, a tool-based LLM architecture (where the LLM selects and parameterizes predefined functions rather than generating code) for safe natural language adaptation, Kernelized Movement Primitives (KMPs) for motion encoding, probabilistic Virtual Fixtures for guided demonstration recording, and ergodic control for surface finishing. We demonstrate that this tool-based LLM architecture generalizes skill adaptation from KMPs to ergodic control, enabling voice-commanded surface finishing. Validation on a 7-DoF torque-controlled robot at the Automatica 2025 trade fair demonstrates the practical applicability of our approach in industrial settings.

Chinese Translation

工业机器人应用需要越来越灵活的系统，以便非专业用户能够轻松适应不同的任务和环境。然而，不同的适应过程受益于不同的交互方式。我们提出了一个交互式框架，通过三种互补的方式实现机器人技能的适应：用于精确空间修正的动觉触摸、用于高层语义修改的自然语言，以及用于可视化几何关系和轨迹、检查和调整参数以及通过拖放编辑路径点的图形网页界面。该框架集成了五个组件：基于能量的人类意图检测、基于工具的LLM架构（在该架构中，LLM选择和参数化预定义函数而不是生成代码）以实现安全的自然语言适应、用于运动编码的核化运动原语（KMPs）、用于引导演示记录的概率虚拟固定装置，以及用于表面处理的遍历控制。我们展示了这种基于工具的LLM架构将技能适应从KMPs推广到遍历控制，使得语音命令的表面处理成为可能。在2025年自动化展览会上对一台7自由度扭矩控制机器人进行的验证展示了我们方法在工业环境中的实际应用性。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2604.20472

Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

顺序任务中的时间差校准：在视觉-语言-行动模型中的应用

Francis-Meretzki, Shelly, Mutti, Mirco, Romano, Yaniv, Tamar, Aviv

Abstract

Recent advances in vision-language-action (VLA) models for robotics have highlighted the importance of reliable uncertainty quantification in sequential tasks. However, assessing and improving calibration in such settings remains mostly unexplored, especially when only partial trajectories are observed. In this work, we formulate sequential calibration for episodic tasks, where task-success confidence is produced along an episode, while success is determined at the end of it. We introduce a sequential extension of the Brier score and show that, for binary outcomes, its risk minimizer coincides with the VLA policy's value function. This connection bridges uncertainty calibration and reinforcement learning, enabling the use of temporal-difference (TD) value estimation as a principled calibration mechanism over time. We empirically show that TD calibration improves performance relative to the state-of-the-art on simulated and real-robot data. Interestingly, we show that when calibrated using TD, the VLA's single-step action probabilities can yield competitive uncertainty estimates, in contrast to recent findings that employed different calibration techniques.

Chinese Translation

最近在机器人领域的视觉-语言-行动（VLA）模型的进展突显了在顺序任务中可靠的不确定性量化的重要性。然而，在这种设置中评估和改善校准仍然大多未被探索，尤其是在仅观察到部分轨迹的情况下。在本研究中，我们为情节任务制定了顺序校准，其中任务成功的信心在情节过程中产生，而成功则在情节结束时确定。我们引入了Brier分数的顺序扩展，并表明对于二元结果，其风险最小化器与VLA策略的价值函数相吻合。这一联系将不确定性校准与强化学习相结合，使得时间差（TD）价值估计可以作为一种原则性的时间校准机制。我们通过实验证明，TD校准相较于最先进的方法在模拟和真实机器人数据上提高了性能。有趣的是，我们展示了当使用TD进行校准时，VLA的单步动作概率可以产生具有竞争力的不确定性估计，这与最近采用不同校准技术的发现形成对比。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2604.20557

Passive Variable Impedance For Shared Control

用于共享控制的被动可变阻抗

Mühlbauer, Maximilian, Werner, Nepomuk, Balachandran, Ribin, Hulin, Thomas, Silvério, João, Stulp, Freek, Albu-Schäffer, Alin

Abstract

Shared Control methods often use impedance control to track target poses in a robotic manipulator. The guidance behavior of such controllers is shaped by the used stiffness gains, which can be varying over time to achieve an adaptive guiding. When multiple target poses are tracked at the same time with varying importance, the corresponding output wrenches have to be arbitrated with weightings changing over time. In this work, we study the stabilization of both variable stiffness in impedance control as well as the arbitration of different controllers through a scaled addition of their output wrenches, reformulating both into a holistic framework. We identify passivity violations in the closed loop system and provide methods to passivate the system. The resulting approach can be used to stabilize standard impedance controllers, allowing for the development of novel and flexible shared control methods. We do not constrain the design of stiffness matrices or arbitration factors; both can be matrix-valued including off-diagonal elements and change arbitrarily over time. The proposed methods are furthermore validated in simulation as well as in real robot experiments on different systems, proving their effectiveness and showcasing different behaviors which can be utilized depending on the requirements of the shared control approach.

Chinese Translation

共享控制方法通常使用阻抗控制来跟踪机器人手臂中的目标姿态。这类控制器的引导行为由所使用的刚度增益决定，这些增益可以随时间变化以实现自适应引导。当同时跟踪多个具有不同重要性的目标姿态时，相应的输出扭矩必须通过随时间变化的权重进行仲裁。在本研究中，我们研究了阻抗控制中可变刚度的稳定性以及通过对其输出扭矩的加权和进行不同控制器的仲裁，将两者重新构建为一个整体框架。我们识别了闭环系统中的被动性违反，并提供了使系统被动化的方法。所提出的方法可以用于稳定标准阻抗控制器，从而允许开发新颖且灵活的共享控制方法。我们不限制刚度矩阵或仲裁因子的设计；两者均可以是矩阵值，包括非对角元素，并且可以随时间任意变化。此外，所提出的方法在不同系统的仿真以及真实机器人实验中得到了验证，证明了其有效性，并展示了可以根据共享控制方法的需求利用的不同行为。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2604.20686

Kinematic Optimization of Phalanx Length Ratios in Robotic Hands Using Potential Dexterity

基于潜在灵巧性的机器人手指骨长度比的运动学优化

Kang, HyoJae, Lee, Joonho, Ahn, Jeongdo, Park, Dong Il

Abstract

In the design stage of robotic hands, it is not straightforward to quantitatively evaluate the effect of phalanx length ratios on dexterity without defining specific objects or manipulation tasks. Therefore, this study presents a framework for optimizing the phalanx length ratios of a five-finger robotic hand based on potential dexterity within a kinematic structure. The proposed method employs global manipulability, workspace volume, overlap workspace volume, and fingertip sensitivity as evaluation metrics, and identifies optimal design configurations using a weighted objective function under given constraints. The reachable workspace is discretized using a voxel-based representation, and joint motions are discretized at uniform intervals for evaluation. The optimization is performed over design sets for both the thumb and the other fingers, and design combinations that do not generate overlap workspace are excluded. The results show that each phalanx does not contribute equally to the overall dexterity, and the factors influencing each phalanx are identified. In addition, it is observed that the selection of weighting coefficients does not necessarily lead to the direct maximization of individual performance metrics, due to the non-uniform distribution of evaluation measures within the design space. The proposed framework provides a systematic approach to analyze the trade-offs among reachability, dexterity, and controllability, and can serve as a practical guideline for the kinematic design of multi-fingered robotic hands.

Chinese Translation

在机器人手的设计阶段，定量评估指骨长度比对灵巧性的影响并不简单，因为这需要定义特定的物体或操作任务。因此，本研究提出了一种基于运动学结构中潜在灵巧性的五指机器人手指骨长度比优化框架。所提方法采用全局可操作性、工作空间体积、重叠工作空间体积和指尖灵敏度作为评估指标，并在给定约束下使用加权目标函数识别最佳设计配置。可达工作空间采用基于体素的表示进行离散化，关节运动以均匀间隔进行离散化以便评估。优化在拇指和其他手指的设计集上进行，并排除不产生重叠工作空间的设计组合。结果表明，每个指骨对整体灵巧性的贡献并不相同，并识别出影响每个指骨的因素。此外，观察到选择加权系数并不一定直接导致个体性能指标的最大化，这是由于评估指标在设计空间中的非均匀分布。所提出的框架提供了一种系统的方法来分析可达性、灵巧性和可控性之间的权衡，并可作为多指机器人手运动学设计的实用指南。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2604.20689

FingerEye: Continuous and Unified Vision-Tactile Sensing for Dexterous Manipulation

FingerEye：用于灵巧操作的连续统一视觉-触觉感知

Xu, Zhixuan, Li, Yichen, Wu, Xuanye, Qiu, Tianyu, Shao, Lin

Abstract

Dexterous robotic manipulation requires comprehensive perception across all phases of interaction: pre-contact, contact initiation, and post-contact. Such continuous feedback allows a robot to adapt its actions throughout interaction. However, many existing tactile sensors, such as GelSight and its variants, only provide feedback after contact is established, limiting a robot's ability to precisely initiate contact. We introduce FingerEye, a compact and cost-effective sensor that provides continuous vision-tactile feedback throughout the interaction process. FingerEye integrates binocular RGB cameras to provide close-range visual perception with implicit stereo depth. Upon contact, external forces and torques deform a compliant ring structure; these deformations are captured via marker-based pose estimation and serve as a proxy for contact wrench sensing. This design enables a perception stream that smoothly transitions from pre-contact visual cues to post-contact tactile feedback. Building on this sensing capability, we develop a vision-tactile imitation learning policy that fuses signals from multiple FingerEye sensors to learn dexterous manipulation behaviors from limited real-world data. We further develop a digital twin of our sensor and robot platform to improve policy generalization. By combining real demonstrations with visually augmented simulated observations for representation learning, the learned policies become more robust to object appearance variations. Together, these design aspects enable dexterous manipulation across diverse object properties and interaction regimes, including coin standing, chip picking, letter retrieving, and syringe manipulation. The hardware design, code, appendix, and videos are available on our project website: https://nus-lins-lab.github.io/FingerEyeWeb/

Chinese Translation

灵巧的机器人操作需要在交互的所有阶段进行全面感知：接触前、接触开始和接触后。这种连续反馈使得机器人能够在交互过程中调整其动作。然而，许多现有的触觉传感器，如GelSight及其变种，仅在建立接触后提供反馈，这限制了机器人精确启动接触的能力。我们提出了FingerEye，这是一种紧凑且经济高效的传感器，能够在整个交互过程中提供连续的视觉-触觉反馈。FingerEye集成了双目RGB摄像头，以提供隐式立体深度的近距离视觉感知。在接触时，外部力和扭矩会使一个柔性环结构发生变形；这些变形通过基于标记的姿态估计被捕获，并作为接触扭矩感知的代理。这种设计实现了一个感知流，从接触前的视觉线索平滑过渡到接触后的触觉反馈。在此感知能力的基础上，我们开发了一种视觉-触觉模仿学习策略，该策略融合来自多个FingerEye传感器的信号，以从有限的真实世界数据中学习灵巧操作行为。我们进一步开发了传感器和机器人平台的数字双胞胎，以提高策略的泛化能力。通过将真实演示与视觉增强的模拟观察结合用于表征学习，所学习的策略对物体外观变化变得更加鲁棒。这些设计方面共同实现了在多样的物体属性和交互模式下的灵巧操作，包括硬币立放、芯片拾取、信件取回和注射器操作。硬件设计、代码、附录和视频可在我们的项目网站上获取：https://nus-lins-lab.github.io/FingerEyeWeb/

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2604.20692

A Kinematic Framework for Evaluating Pinch Configurations in Robotic Hand Design without Object or Contact Models

一种用于评估机器人手设计中夹持配置的运动学框架，无需物体或接触模型

Kang, HyoJae, Lee, Joonho, Jung, Hyunmok, Park, Dong Il

Abstract

Evaluating the pinch capability of a robotic hand is important for understanding its functional dexterity. However, many existing grasp evaluation methods rely on object geometry or contact force models, which limits their applicability during the early stages of robotic hand design. This study proposes a kinematic evaluation method for analyzing pinch configurations of robotic hands based on interactions between fingertip workspaces. First, the reachable workspace of each fingertip is computed from the joint configurations of the fingers. Then, feasible pinch configurations are detected by evaluating the relationships between fingertip pairs. Since the proposed method does not require information about object geometry or contact force models, the pinch capability of a robotic hand can be evaluated solely based on its kinematic structure. In addition, analyses are performed on four different kinematic structures of the hand to investigate their impact on the pinch configurations. The proposed evaluation framework can serve as a useful tool for comparing different robotic hand designs and analyzing pinch capability during the design stage.

Chinese Translation

评估机器人手的夹持能力对于理解其功能灵活性至关重要。然而，许多现有的抓取评估方法依赖于物体几何形状或接触力模型，这限制了它们在机器人手设计早期阶段的适用性。本研究提出了一种基于指尖工作空间之间相互作用的运动学评估方法，用于分析机器人手的夹持配置。首先，从手指的关节配置计算每个指尖的可达工作空间。然后，通过评估指尖对之间的关系来检测可行的夹持配置。由于所提出的方法不需要关于物体几何形状或接触力模型的信息，因此可以仅基于机器人的运动学结构评估其夹持能力。此外，还对四种不同的手部运动学结构进行了分析，以研究它们对夹持配置的影响。所提出的评估框架可以作为比较不同机器人手设计和分析夹持能力的有用工具，尤其是在设计阶段。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2604.20712

Visual-Tactile Peg-in-Hole Assembly Learning from Peg-out-of-Hole Disassembly

视觉-触觉插销入孔装配学习基于插销出孔拆卸

Zhao, Yongqiang, Zhang, Xuyang, Chen, Zhuo, Leonetti, Matteo, Spyrakos-Papastavridis, Emmanouil, Luo, Shan

Abstract

Peg-in-hole (PiH) assembly is a fundamental yet challenging robotic manipulation task. While reinforcement learning (RL) has shown promise in tackling such tasks, it requires extensive exploration. In this paper, we propose a novel visual-tactile skill learning framework for the PiH task that leverages its inverse task, i.e., peg-out-of-hole (PooH) disassembly, to facilitate PiH learning. Compared to PiH, PooH is inherently easier as it only needs to overcome existing friction without precise alignment, making data collection more efficient. To this end, we formulate both PooH and PiH as Partially Observable Markov Decision Processes (POMDPs) in a unified environment with shared visual-tactile observation space. A visual-tactile PooH policy is first trained; its trajectories, containing kinematic, visual and tactile information, are temporally reversed and action-randomized to provide expert data for PiH. In the policy learning, visual sensing facilitates the peg-hole approach, while tactile measurements compensate for peg-hole misalignment. Experiments across diverse peg-hole geometries show that the visual-tactile policy attains 6.4% lower contact forces than its single-modality counterparts, and that our framework achieves average success rates of 87.5% on seen objects and 77.1% on unseen objects, outperforming direct RL methods that train PiH policies from scratch by 18.1% in success rate. Demos, code, and datasets are available at https://sites.google.com/view/pooh2pih.

Chinese Translation

插销入孔（PiH）装配是一项基本但具有挑战性的机器人操作任务。尽管强化学习（RL）在解决此类任务方面显示出潜力，但它需要广泛的探索。在本文中，我们提出了一种新颖的视觉-触觉技能学习框架，用于PiH任务，该框架利用其逆任务，即插销出孔（PooH）拆卸，以促进PiH学习。与PiH相比，PooH本质上更容易，因为它只需克服现有摩擦，而无需精确对齐，从而使数据收集更加高效。为此，我们将PooH和PiH都形式化为部分可观察马尔可夫决策过程（POMDP），在共享的视觉-触觉观察空间中构建统一环境。首先训练视觉-触觉PooH策略；其轨迹包含运动学、视觉和触觉信息，经过时间反转和动作随机化，以提供PiH的专家数据。在策略学习中，视觉感知促进了插销-孔的接近，而触觉测量则补偿了插销-孔的错位。针对不同插销-孔几何形状的实验表明，视觉-触觉策略的接触力比单一模态策略低6.4%，并且我们的框架在已见物体上的平均成功率达到87.5%，在未见物体上的成功率为77.1%，成功率比从头训练PiH策略的直接RL方法高出18.1%。演示、代码和数据集可在 https://sites.google.com/view/pooh2pih 获取。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2604.20721

ALAS: Adaptive Long-Horizon Action Synthesis via Async-pathway Stream Disentanglement

ALAS：通过异步通路流解耦的自适应长时间动作合成

Shen, Yutong, Liu, Hangxu, Zhang, Lei, Liu, Penghui, Liu, Yinqi, Yang, Liuxiang, Feng, Tongtong

Abstract

Long-Horizon (LH) tasks in Human-Scene Interaction (HSI) are complex multi-step tasks that require continuous planning, sequential decision-making, and extended execution across domains to achieve the final goal. However, existing methods heavily rely on skill chaining by concatenating pre-trained subtasks, with environment observations and self-state tightly coupled, lacking the ability to generalize to new combinations of environments and skills, failing to complete various LH tasks across domains. To solve this problem, this paper presents ALAS, a cross-domain learning framework for LH tasks via biologically inspired dual-stream disentanglement. Inspired by the brain's "where-what" dual pathway mechanism, ALAS comprises two core modules: i) an environment learning module for spatial understanding, which captures object functions, spatial relationships, and scene semantics, achieving cross-domain transfer through complete environment-self disentanglement; ii) a skill learning module for task execution, which processes self-state information including joint degrees of freedom and motor patterns, enabling cross-skill transfer through independent motor pattern encoding. We conducted extensive experiments on various LH tasks in HSI scenes. Compared with existing methods, ALAS can achieve an average subtasks success rate improvement of 23\% and average execution efficiency improvement of 29\%.

Chinese Translation

人类场景交互（HSI）中的长时间（LH）任务是复杂的多步骤任务，要求在多个领域中进行持续规划、顺序决策和延续执行，以实现最终目标。然而，现有方法严重依赖于通过连接预训练子任务的技能链，环境观察与自我状态紧密耦合，缺乏对新环境和技能组合的泛化能力，无法完成跨领域的各种LH任务。为了解决这个问题，本文提出了ALAS，一个通过生物启发的双流解耦的LH任务跨领域学习框架。ALAS受到大脑“何处-何物”双通路机制的启发，包含两个核心模块：i) 一个用于空间理解的环境学习模块，捕捉物体功能、空间关系和场景语义，通过完全的环境-自我解耦实现跨领域转移；ii) 一个用于任务执行的技能学习模块，处理包括关节自由度和运动模式在内的自我状态信息，通过独立的运动模式编码实现跨技能转移。我们在各种HSI场景的LH任务上进行了广泛的实验。与现有方法相比，ALAS在子任务成功率上平均提高了23\%，在执行效率上平均提高了29\%。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2604.20799

A Hough transform approach to safety-aware scalar field mapping using Gaussian Processes

基于霍夫变换的安全感知标量场映射方法，采用高斯过程

Qureshi, Muzaffar, Satharasi, Trivikram, Ogri, Tochukwu E., Volle, Kyle, Kamalapurkar, Rushikesh

Abstract

This paper presents a framework for mapping unknown scalar fields using a sensor-equipped autonomous robot operating in unsafe environments. The unsafe regions are defined as regions of high-intensity, where the field value exceeds a predefined safety threshold. For safe and efficient mapping of the scalar field, the sensor-equipped robot must avoid high-intensity regions during the measurement process. In this paper, the scalar field is modeled as a sample from a Gaussian process (GP), which enables Bayesian inference and provides closed-form expressions for both the predictive mean and the uncertainty. Concurrently, the spatial structure of the high-intensity regions is estimated in real-time using the Hough transform (HT), leveraging the evolving GP posterior. A safe sampling strategy is then employed to guide the robot towards safe measurement locations, using probabilistic safety guarantees on the evolving GP posterior. The estimated high-intensity regions also facilitate the design of safe motion plans for the robot. The effectiveness of the approach is verified through two numerical simulation studies and an indoor experiment for mapping a light-intensity field using a wheeled mobile robot.

Chinese Translation

本文提出了一种框架，用于利用配备传感器的自主机器人在不安全环境中映射未知的标量场。不安全区域被定义为高强度区域，其中场值超过预定义的安全阈值。为了安全高效地映射标量场，配备传感器的机器人在测量过程中必须避免高强度区域。本文将标量场建模为高斯过程（Gaussian Process, GP）的样本，这使得贝叶斯推断成为可能，并为预测均值和不确定性提供了封闭形式的表达。同时，利用霍夫变换（Hough Transform, HT）实时估计高强度区域的空间结构，借助于不断演变的高斯过程后验。接着，采用安全采样策略引导机器人朝向安全的测量位置，利用对不断演变的高斯过程后验的概率安全保证。估计的高强度区域还促进了机器人的安全运动规划设计。通过两个数值仿真研究和一个室内实验，验证了该方法在使用轮式移动机器人映射光强场方面的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2604.20834

PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

PokeVLA：通过全面的世界知识指导增强口袋大小的视觉-语言-动作模型

Zheng, Yupeng, Li, Xiang, Gu, Songen, Zheng, Yuhang, Tian, Shuai, Li, Weize, Wang, Linbo, Fei, Senyu, Li, Pengfei, Gao, Yinfeng, Xing, Zebin, Chen, Yilun, Zhang, Qichao, Li, Haoran, Ding, Wenchao

Abstract

Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset. Project page: https://getterupper.github.io/PokeVLA

Chinese Translation

近年来视觉-语言-动作（VLA）模型的进展为机器人操作开辟了新的途径，但现有方法的效率有限，且缺乏高级知识和空间意识。为了解决这些挑战，我们提出了PokeVLA，这是一种轻量级但强大的基础模型，旨在实现具身操作，有效地将视觉-语言理解融入动作学习。我们的框架引入了一个两阶段的训练范式：首先，我们在一个包含空间定位、可供性和具身推理任务的精心策划的多模态数据集（2.4M样本）上预训练一个紧凑的视觉-语言模型（PokeVLM）；其次，我们通过多视角目标感知语义学习、几何对齐和一种新颖的动作专家，将与操作相关的表征注入动作空间。大量实验表明，在LIBERO-Plus基准测试和实际部署中，我们的模型在成功率和在多种扰动下的鲁棒性方面超越了可比基线，表现出最先进的性能。为了促进可重复性和社区进步，我们将开源我们的代码、模型权重以及精心策划的预训练数据集的脚本。项目页面：https://getterupper.github.io/PokeVLA

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

cs.CV / 1 / 2604.19823

Rabies diagnosis in low-data settings: A comparative study on the impact of data augmentation and transfer learning

低数据环境下的狂犬病诊断：数据增强与迁移学习影响的比较研究

Akremi, Khalil, Handous, Mariem, Bouslama, Zied, Bassalah, Farah, Jebali, Maryem, Hanachi, Mariem, Abdeljaoued-Tej, Ines

Abstract

Rabies remains a major public health concern across many African and Asian countries, where accurate diagnosis is critical for effective epidemiological surveillance. The gold standard diagnostic methods rely heavily on fluorescence microscopy, necessitating skilled laboratory personnel for the accurate interpretation of results. Such expertise is often scarce, particularly in regions with low annual sample volumes. This paper presents an automated, AI-driven diagnostic system designed to address these challenges. We developed a robust pipeline utilizing fluorescent image analysis through transfer learning with four deep learning architectures: EfficientNetB0, EfficientNetB2, VGG16, and Vision Transformer (ViTB16). Three distinct data augmentation strategies were evaluated to enhance model generalization on a dataset of 155 microscopic images (123 positive and 32 negative). Our results demonstrate that TrivialAugmentWide was the most effective augmentation technique, as it preserved critical fluorescent patterns while improving model robustness. The EfficientNetB0 model, utilizing Geometric & Color augmentation and selected through stratified 3fold cross-validation, achieved optimal classification performance on cropped images. Despite constraints posed by class imbalance and a limited dataset size, this work confirms the viability of deep learning for automating rabies diagnosis. The proposed method enables fast and reliable detection with significant potential for further optimization. An online tool was deployed to facilitate practical access, establishing a framework for future medical imaging applications. This research underscores the potential of optimized deep learning models to transform rabies diagnostics and improve public health outcomes.

Chinese Translation

狂犬病在许多非洲和亚洲国家仍然是一个主要的公共卫生问题，准确的诊断对于有效的流行病学监测至关重要。金标准的诊断方法严重依赖于荧光显微镜，这需要熟练的实验室人员来准确解读结果。然而，这种专业知识在年样本量较低的地区往往稀缺。本文提出了一种自动化的、基于人工智能的诊断系统，旨在应对这些挑战。我们开发了一个强大的管道，利用迁移学习通过四种深度学习架构进行荧光图像分析：EfficientNetB0、EfficientNetB2、VGG16和视觉变换器（Vision Transformer, ViTB16）。评估了三种不同的数据增强策略，以增强模型在155幅显微图像（123幅阳性和32幅阴性）数据集上的泛化能力。我们的结果表明，TrivialAugmentWide是最有效的增强技术，因为它保留了关键的荧光模式，同时提高了模型的鲁棒性。使用几何和颜色增强的EfficientNetB0模型，通过分层三折交叉验证选择，达到了裁剪图像的最佳分类性能。尽管受到类别不平衡和数据集规模有限的限制，这项工作确认了深度学习在自动化狂犬病诊断中的可行性。所提出的方法能够实现快速且可靠的检测，并具有进一步优化的显著潜力。我们部署了一个在线工具，以便于实际访问，为未来的医学影像应用建立了框架。本研究强调了优化的深度学习模型在转变狂犬病诊断和改善公共卫生结果方面的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2604.19829

TactileEval: A Step Towards Automated Fine-Grained Evaluation and Editing of Tactile Graphics

TactileEval：迈向触觉图形的自动化细粒度评估与编辑

Khan, Adnan, Akkasi, Abbas, Komeili, Majid

Abstract

Tactile graphics require careful expert validation before reaching blind and visually impaired (BVI) learners, yet existing datasets provide only coarse holistic quality ratings that offer no actionable repair signal. We present TactileEval, a three-stage pipeline that takes a first step toward automating this process. Drawing on expert free-text comments from the TactileNet dataset, we establish a five-category quality taxonomy; encompassing view angle, part completeness, background clutter, texture separation, and line quality aligned with BANA standards. We subsequently gathered 14,095 structured annotations via Amazon Mechanical Turk, spanning 66 object classes organized into six distinct families. A reproducible ViT-L/14 feature probe trained on this data achieves 85.70% overall test accuracy across 30 different tasks, with consistent difficulty ordering suggesting the taxonomy suggesting the taxonomy captures meaningful perceptual structure. Building on these evaluations, we present a ViT-guided automated editing pipeline that routes classifier scores through family-specific prompt templates to produce targeted corrections via gpt-image-1 image editing. Code, data, and models are available at https://TactileEval.github.io/

Chinese Translation

触觉图形在到达盲人和视力障碍（BVI）学习者之前需要经过专家的仔细验证，但现有数据集仅提供粗略的整体质量评分，无法提供可操作的修复信号。我们提出了TactileEval，一个三阶段的流程，迈出了自动化这一过程的第一步。基于TactileNet数据集中的专家自由文本评论，我们建立了一个五类质量分类法，包括视角、部分完整性、背景杂乱、纹理分离和符合BANA标准的线条质量。随后，我们通过Amazon Mechanical Turk收集了14,095条结构化注释，涵盖66个对象类别，分为六个不同的家族。基于这些数据训练的可重复ViT-L/14特征探测器在30个不同任务中实现了85.70%的整体测试准确率，且一致的难度排序表明该分类法捕捉到了有意义的感知结构。在这些评估的基础上，我们提出了一个ViT引导的自动化编辑流程，通过家族特定的提示模板传递分类器得分，以通过gpt-image-1图像编辑生成针对性的修正。代码、数据和模型可在https://TactileEval.github.io/获取。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2604.19834

KD-Judge: A Knowledge-Driven Automated Judge Framework for Functional Fitness Movements on Edge Devices

KD-Judge：一种基于知识的边缘设备功能性健身动作自动评判框架

Saha, Shaibal, Li, Fan, Li, Yunge, Iyengar, Arun, Alves, Lucas, Xu, Lanyu

Abstract

Functional fitness movements are widely used in training, competition, and health-oriented exercise programs, yet consistently enforcing repetition (rep) standards remains challenging due to subjective human judgment, time constraints, and evolving rules. Existing AI-based approaches mainly rely on learned scoring or reference-based comparisons and lack explicit rule-based, limiting transparency and deterministic rep-level validation. To address these limitations, we propose KD-Judge, a novel knowledge-driven automated judging framework for functional fitness movements. It converts unstructured rulebook standards into executable, machine-readable representations using an LLM-based retrieval-augmented generation and chain-of-thought rule-structuring pipeline. The structured rules are then incorporated by a deterministic rule-based judging system with pose-guided kinematic reasoning to assess rep validity and temporal boundaries. To improve efficiency on edge devices, including a high-performance desktop and the resource-constrained Jetson AGX Xavier, we introduce a dual strategy caching mechanism that can be selectively applied to reduce redundant and unnecessary computation. Experiments demonstrate reliable rule-structuring performance and accurate rep-level assessment, with judgment evaluation conducted on the CFRep dataset, achieving faster-than-real-time execution (real-time factor (RTF) < 1). When the proposed caching strategy is enabled, the system achieves up to 3.36x and 15.91x speedups on resource-constrained edge device compared to the non-caching baseline for pre-recorded and live-streaming scenarios, respectively. These results show that KD-Judge enables transparent, efficient, and scalable rule-grounded rep-level analysis that can complement human judging in practice.

Chinese Translation

功能性健身动作广泛应用于训练、比赛和健康导向的锻炼项目中，但由于主观的人为判断、时间限制和不断变化的规则，始终如一地执行重复（rep）标准仍然具有挑战性。现有的基于人工智能的方法主要依赖于学习的评分或基于参考的比较，缺乏明确的基于规则的评判，限制了透明性和确定性的重复级别验证。为了解决这些局限性，我们提出了KD-Judge，一种新颖的基于知识的功能性健身动作自动评判框架。它利用基于大型语言模型（LLM）的检索增强生成和思维链规则结构化管道，将非结构化的规则标准转换为可执行的机器可读表示。然后，通过具有姿态引导的运动学推理的确定性基于规则的评判系统，将结构化规则纳入其中，以评估重复的有效性和时间边界。为了提高在边缘设备上的效率，包括高性能桌面和资源受限的Jetson AGX Xavier，我们引入了一种双策略缓存机制，可以选择性地应用，以减少冗余和不必要的计算。实验表明，规则结构化性能可靠，重复级别评估准确，评判评估在CFRep数据集上进行，达到了快于实时的执行（实时因子（RTF）< 1）。当启用所提出的缓存策略时，该系统在资源受限的边缘设备上实现了与非缓存基线相比，在预录制和直播场景下分别高达3.36倍和15.91倍的加速。这些结果表明，KD-Judge能够实现透明、高效和可扩展的基于规则的重复级别分析，可以在实践中补充人类评判。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2604.19839

Environmental Understanding Vision-Language Model for Embodied Agent

面向具身体代理的环境理解视觉-语言模型

Bang, Jinsik, Bae, Jaeyeon, Lee, Donggyu, Jung, Siyeol, Kim, Taehwan

Abstract

Vision-language models (VLMs) have shown strong perception and reasoning abilities for instruction-following embodied agents. However, despite these abilities and their generalization performance, they still face limitations in environmental understanding, often failing on interactions or relying on environment metadata during execution. To address this challenge, we propose a novel framework named Environmental Understanding Embodied Agent (EUEA), which fine-tunes four core skills: 1) object perception for identifying relevant objects, 2) task planning for generating interaction subgoals, 3) action understanding for judging success likelihood, and 4) goal recognition for determining goal completion. By fine-tuning VLMs with EUEA skills, our framework enables more reliable task execution for instruction-following. We further introduce a recovery step that leverages these core skills and a group relative policy optimization (GRPO) stage that refines inconsistent skill predictions. The recovery step samples alternative actions to correct failure cases, and the GRPO stage refines inconsistent skill predictions. Across ALFRED tasks, our VLM significantly outperforms a behavior-cloning baseline, achieving an 8.86% improvement in average success rate. The recovery and GRPO stages provide an additional 3.03% gain, further enhancing overall performance. Finally, our skill-level analyses reveal key limitations in the environmental understanding of closed- and open-source VLMs and identify the capabilities necessary for effective agent-environment interaction.

Chinese Translation

视觉-语言模型（VLMs）在遵循指令的具身代理中展现了强大的感知和推理能力。然而，尽管具备这些能力及其泛化性能，它们在环境理解方面仍面临局限，常常在交互中失败或在执行过程中依赖环境元数据。为了解决这一挑战，我们提出了一种新颖的框架，命名为环境理解具身代理（Environmental Understanding Embodied Agent, EUEA），该框架微调了四项核心技能：1）物体感知，用于识别相关物体；2）任务规划，用于生成交互子目标；3）动作理解，用于判断成功的可能性；4）目标识别，用于确定目标完成情况。通过使用EUEA技能微调VLMs，我们的框架使得遵循指令的任务执行更加可靠。我们进一步引入了一个恢复步骤，利用这些核心技能，以及一个群体相对策略优化（Group Relative Policy Optimization, GRPO）阶段，以精炼不一致的技能预测。恢复步骤通过采样替代动作来纠正失败案例，而GRPO阶段则精炼不一致的技能预测。在ALFRED任务中，我们的VLM显著超越了行为克隆基线，平均成功率提高了8.86%。恢复和GRPO阶段提供了额外的3.03%的增益，进一步提升了整体性能。最后，我们的技能水平分析揭示了闭源和开源VLM在环境理解方面的关键局限，并识别了有效的代理-环境交互所需的能力。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2604.19844

If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

如果你在等待一个信号……那可能不是它！缓解视觉注入对视觉-语言代理系统的信任边界混淆

Chang, Jiamin, Xue, Minhui, Sun, Ruoxi, Pang, Shuchao, Kanhere, Salil S., Pearce, Hammond

Abstract

Recent advances in embodied Vision-Language Agentic Systems (VLAS), powered by large vision-language models (LVLMs), enable AI systems to perceive and reason over real-world scenes. Within this context, environmental signals such as traffic lights are essential in-band signals that can and should influence agent behavior. However, similar signals could also be crafted to operate as misleading visual injections, overriding user intent and posing security risks. This duality creates a fundamental challenge: agents must respond to legitimate environmental cues while remaining robust to misleading ones. We refer to this tension as trust boundary confusion. To study this behavior, we design a dual-intent dataset and evaluation framework, through which we show that current LVLM-based agents fail to reliably balance this trade-off, either ignoring useful signals or following harmful ones. We systematically evaluate 7 LVLM agents across multiple embodied settings under both structure-based and noise-based visual injections. To address these vulnerabilities, we propose a multi-agent defense framework that separates perception from decision-making to dynamically assess the reliability of visual inputs. Our approach significantly reduces misleading behaviors while preserving correct responses and provides robustness guarantees under adversarial perturbations. The code of the evaluation framework and artifacts are made available at https://anonymous.4open.science/r/Visual-Prompt-Inject.

Chinese Translation

近期，基于大型视觉-语言模型（LVLM）的具身视觉-语言代理系统（VLAS）取得了显著进展，使得人工智能系统能够感知和推理现实世界场景。在这一背景下，交通信号灯等环境信号是重要的带内信号，能够并且应该影响代理的行为。然而，类似的信号也可能被设计为误导性的视觉注入，覆盖用户意图并带来安全风险。这种二元性带来了一个根本性挑战：代理必须对合法的环境线索作出反应，同时保持对误导性线索的鲁棒性。我们将这种紧张关系称为信任边界混淆。为了研究这种行为，我们设计了一个双意图数据集和评估框架，通过该框架我们展示了当前基于LVLM的代理无法可靠地平衡这一权衡，要么忽视有用信号，要么跟随有害信号。我们系统地评估了7个LVLM代理在多种具身设置下对结构性和噪声性视觉注入的反应。为了解决这些脆弱性，我们提出了一种多代理防御框架，该框架将感知与决策分离，以动态评估视觉输入的可靠性。我们的方法显著减少了误导性行为，同时保留了正确反应，并在对抗扰动下提供了鲁棒性保证。评估框架的代码和相关材料已在 https://anonymous.4open.science/r/Visual-Prompt-Inject 上公开。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2604.19858

Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

Wan-Image：推动生成视觉智能的边界

Mao, Chaojie, Xie, Chen-Wei, Zhong, Chongyang, Deng, Haoyou, Zhao, Jiaxing, Xiao, Jie, Xing, Jinbo, Zhang, Jingfeng, Zhou, Jingren, Zhang, Jingyi, Dan, Jun, Zhu, Kai, Zhao, Kang, Yan, Keyu, Chen, Minghui, Li, Pandeng, Chen, Shuangle, Shen, Tong, Liu, Yu, Jiang, Yue, Pan, Yulin, Tuo, Yuxiang, Jiang, Zeyinzi, Han, Zhen, Wang, Ang, Zhang, Bang, Ai, Baole, Wen, Bin, Feng, Boang, Yu, Feiwu, Wang, Gang, Zhao, Haiming, Kang, He, Xiang, Jianjing, Zeng, Jianyuan, Wang, Jinkai, Sun, Ke, Wu, Linqian, Gong, Pei, Wu, Pingyu, Wu, Ruiwen, Su, Tongtong, Zhou, Wenmeng, Shen, Wenting, Yu, Wenyuan, Xu, Xianjun, Huang, Xiaoming, Shen, Xiejie, Xu, Xin, Kou, Yan, Lv, Yangyu, Zhai, Yifan, Huang, Yitong, Zheng, Yun, Hong, Yuntao, Zhang, Zhicheng

Abstract

We present Wan-Image, a unified visual generation system explicitly engineered to paradigm-shift image generation models from casual synthesizers into professional-grade productivity tools. While contemporary diffusion models excel at aesthetic generation, they frequently encounter critical bottlenecks in rigorous design workflows that demand absolute controllability, complex typography rendering, and strict identity preservation. To address these challenges, Wan-Image features a natively unified multi-modal architecture by synergizing the cognitive capabilities of large language models with the high-fidelity pixel synthesis of diffusion transformers, which seamlessly translates highly nuanced user intents into precise visual outputs. It is fundamentally powered by large-scale multi-modal data scaling, a systematic fine-grained annotation engine, and curated reinforcement learning data to surpass basic instruction following and unlock expert-level professional capabilities. These include ultra-long complex text rendering, hyper-diverse portrait generation, palette-guided generation, multi-subject identity preservation, coherent sequential visual generation, precise multi-modal interactive editing, native alpha-channel generation, and high-efficiency 4K synthesis. Across diverse human evaluations, Wan-Image exceeds Seedream 5.0 Lite and GPT Image 1.5 in overall performance, reaching parity with Nano Banana Pro in challenging tasks. Ultimately, Wan-Image revolutionizes visual content creation across e-commerce, entertainment, education, and personal productivity, redefining the boundaries of professional visual synthesis.

Chinese Translation

我们提出了Wan-Image，一种统一的视觉生成系统，专门设计用于将图像生成模型从普通合成器转变为专业级生产工具。尽管当代扩散模型在美学生成方面表现出色，但它们在严格的设计工作流程中经常遇到关键瓶颈，这些工作流程要求绝对的可控性、复杂的排版渲染和严格的身份保留。为了解决这些挑战，Wan-Image采用了一种本土统一的多模态架构，通过将大型语言模型的认知能力与扩散变换器的高保真像素合成相结合，能够无缝地将高度细致的用户意图转化为精确的视觉输出。它的核心动力来自于大规模多模态数据扩展、系统化的细粒度注释引擎以及策划的强化学习数据，以超越基本的指令跟随，解锁专家级的专业能力。这些能力包括超长复杂文本渲染、超多样化的人物生成、调色板引导生成、多主体身份保留、一致的顺序视觉生成、精确的多模态交互编辑、本地 alpha 通道生成和高效的 4K 合成。在多项人类评估中，Wan-Image在整体性能上超过了Seedream 5.0 Lite和GPT Image 1.5，在具有挑战性的任务中与Nano Banana Pro达到了同等水平。最终，Wan-Image彻底改变了电子商务、娱乐、教育和个人生产力中的视觉内容创作，重新定义了专业视觉合成的边界。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2604.19888

SGAP-Gaze: Scene Grid Attention Based Point-of-Gaze Estimation Network for Driver Gaze

SGAP-Gaze：基于场景网格注意力的驾驶员注视点估计网络

Sharma, Pavan Kumar, Chakraborty, Pranamesh

Abstract

Driver gaze estimation is essential for understanding the driver's situational awareness of surrounding traffic. Existing gaze estimation models use driver facial information to predict the Point-of-Gaze (PoG) or the 3D gaze direction vector. We propose a benchmark dataset, Urban Driving-Face Scene Gaze (UD-FSG), comprising synchronized driver-face and traffic-scene images. The scene images provide cues about surrounding traffic, which can help improve the gaze estimation model, along with the face images. We propose SGAP-Gaze, Scene-Grid Attention based Point-of-Gaze estimation network, trained and tested on our UD-FSG dataset, which explicitly incorporates the scene images into the gaze estimation modelling. The gaze estimation network integrates driver face, eye, iris, and scene contextual information. First, the extracted features from facial modalities are fused to form a gaze intent vector. Then, attention scores are computed over the spatial scene grid using a Transformer-based attention mechanism fusing face and scene image features to obtain the PoG. The proposed SGAP-Gaze model achieves a mean pixel error of 104.73 on the UD-FSG dataset and 63.48 on LBW dataset, achieving a 23.5% reduction in mean pixel error compared to state-of-the-art driver gaze estimation models. The spatial pixel distribution analysis shows that SGAP-Gaze consistently achieves lower mean pixel error than existing methods across all spatial ranges, including the outer regions of the scene, which are rare but critical for understanding driver attention. These results highlight the effectiveness of integrating multi-modal gaze cues with scene-aware attention for a robust driver PoG estimation model in real-world driving environments.

Chinese Translation

驾驶员注视估计对于理解驾驶员对周围交通的情境意识至关重要。现有的注视估计模型利用驾驶员的面部信息来预测注视点（Point-of-Gaze, PoG）或三维注视方向向量。我们提出了一个基准数据集，城市驾驶-面部场景注视（Urban Driving-Face Scene Gaze, UD-FSG），该数据集包含同步的驾驶员面部和交通场景图像。场景图像提供了关于周围交通的线索，这可以帮助提高注视估计模型的性能，配合面部图像使用。我们提出了SGAP-Gaze，一种基于场景网格注意力的注视点估计网络，该网络在我们的UD-FSG数据集上进行训练和测试，明确将场景图像纳入注视估计建模中。该注视估计网络整合了驾驶员的面部、眼睛、虹膜和场景上下文信息。首先，从面部模态提取的特征被融合形成一个注视意图向量。然后，使用基于Transformer的注意力机制计算空间场景网格上的注意力分数，融合面部和场景图像特征以获取PoG。所提出的SGAP-Gaze模型在UD-FSG数据集上实现了104.73的平均像素误差，在LBW数据集上实现了63.48，与最先进的驾驶员注视估计模型相比，平均像素误差减少了23.5%。空间像素分布分析表明，SGAP-Gaze在所有空间范围内，包括场景的外部区域（这些区域虽然稀少但对理解驾驶员注意力至关重要），始终实现了低于现有方法的平均像素误差。这些结果突显了将多模态注视线索与场景感知注意力相结合的有效性，从而为现实驾驶环境中的稳健驾驶员PoG估计模型提供了支持。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2604.19902

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

MMCORE：具有表示对齐潜在嵌入的多模态连接

Li, Zijie, Shi, Yichun, Sun, Jingxiang, Wang, Ye, Huang, Yixuan, Guo, Zhiyao, Lian, Xiaochen, Zhu, Peihao, Tian, Yu, Zhai, Zhonghua, Wang, Peng

Abstract

We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently serve as conditioning signals for a diffusion model. This streamlined design effectively transfers the rich understanding and reasoning capabilities of VLMs into the visual generation process. By obviating the need for deep fusion between autoregressive and diffusion models or training from scratch, MMCORE significantly reduces computational overhead while maintaining high-fidelity synthesis. MMCORE seamlessly integrates text-to-image synthesis with interleaved image generation, demonstrating robust multimodal comprehension in complex scenarios such as spatial reasoning and visual grounding. Comprehensive evaluations indicate that MMCORE consistently outperforms state-of-the-art baselines across a broad spectrum of text-to-image and single/multi-image editing benchmarks.

Chinese Translation

我们提出了MMCORE，这是一个旨在实现多模态图像生成和编辑的统一框架。MMCORE利用预训练的视觉-语言模型（Vision-Language Model, VLM）通过可学习的查询令牌预测语义视觉嵌入，这些嵌入随后作为扩散模型的条件信号。这种简化的设计有效地将VLM丰富的理解和推理能力转移到视觉生成过程中。通过避免自回归模型与扩散模型之间的深度融合或从头开始训练，MMCORE显著降低了计算开销，同时保持了高保真合成。MMCORE无缝集成了文本到图像的合成与交错的图像生成，在空间推理和视觉定位等复杂场景中展示了强大的多模态理解能力。全面的评估表明，MMCORE在广泛的文本到图像和单图像/多图像编辑基准测试中始终优于最先进的基线。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2604.19907

SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

SceneOrchestra：通过完整工具调用轨迹生成实现高效的代理3D场景合成

He, Yun, Yu, Kelin, Zwicker, Matthias

Abstract

Recent agentic frameworks for 3D scene synthesis have advanced realism and diversity by integrating heterogeneous generation and editing tools. These tools are organized into workflows orchestrated by an off-the-shelf LLM. Current approaches typically adopt an execute-review-reflect loop: at each step, the orchestrator executes a tool, renders intermediate results for review, and then decides on the tool and its parameters for the next step. However, this design has two key limitations. First, next-step tool selection and parameter configuration are driven by heuristic rules, which can lead to suboptimal execution flows, unnecessary tool invocations, degraded output quality, and increased runtime. Second, rendering and reviewing intermediate results after each step introduces additional latency. To address these issues, we propose SceneOrchestra, a trainable orchestration framework that optimizes the tool-call execution flow and eliminates the step-by-step review loop, improving both efficiency and output quality. SceneOrchestra consists of an orchestrator and a discriminator, which we fine-tune with a two-phase training strategy. In the first phase, the orchestrator learns context-aware tool selection and complete tool-call trajectory generation, while the discriminator is trained to assess the quality of full trajectories, enabling it to select the best trajectory from multiple candidates. In the second phase, we perform interleaved training, where the discriminator adapts to the orchestrator's evolving trajectory distribution and distills its discriminative capability back into the orchestrator. At inference, we only use the orchestrator to generate and execute full tool-call trajectories from instructions, without requiring the discriminator. Extensive experiments show that our method achieves state-of-the-art scene quality while reducing runtime compared to previous work.

Chinese Translation

近期的代理框架在3D场景合成方面通过整合异构生成和编辑工具，提升了真实感和多样性。这些工具被组织成由现成的大型语言模型（LLM）协调的工作流程。目前的方法通常采用执行-审查-反思循环：在每一步中，协调者执行一个工具，渲染中间结果以供审查，然后决定下一个步骤的工具及其参数。然而，这种设计存在两个主要限制。首先，下一步工具选择和参数配置是由启发式规则驱动的，这可能导致次优的执行流程、不必要的工具调用、输出质量下降和运行时间增加。其次，在每一步之后渲染和审查中间结果会引入额外的延迟。为了解决这些问题，我们提出了SceneOrchestra，一个可训练的协调框架，优化工具调用执行流程并消除逐步审查循环，从而提高效率和输出质量。SceneOrchestra由一个协调者和一个鉴别器组成，我们通过两阶段的训练策略对其进行微调。在第一阶段，协调者学习上下文感知的工具选择和完整工具调用轨迹生成，而鉴别器则被训练以评估完整轨迹的质量，使其能够从多个候选中选择最佳轨迹。在第二阶段，我们进行交替训练，使鉴别器适应协调者不断演变的轨迹分布，并将其鉴别能力提炼回协调者。在推理时，我们仅使用协调者根据指令生成和执行完整的工具调用轨迹，而无需鉴别器。大量实验表明，我们的方法在场景质量上达到了最先进的水平，同时相比于之前的工作减少了运行时间。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2604.19923

UniCon3R: Contact-aware 3D Human-Scene Reconstruction from Monocular Video

UniCon3R：基于单目视频的接触感知三维人场重建

Sur, Tanuj, Tripathi, Shashank, Athanasiou, Nikos, Nguyen, Ha Linh, Xu, Kai, Black, Michael J., Yao, Angela

Abstract

We introduce UniCon3R (Unified Contact-aware 3D Reconstruction), a unified feed-forward framework for online human-scene 4D reconstruction from monocular videos. Recent feed-forward methods enable real-time world-coordinate human motion and scene reconstruction, but they often produce physically implausible artifacts such as bodies floating above the ground or penetrating parts of the scene. The key reason is that existing approaches fail to model physical interactions between the human and the environment. A natural next step is to predict human-scene contact as an auxiliary output -- yet we find this alone is not sufficient: contact must actively correct the reconstruction. To address this, we explicitly model interaction by inferring 3D contact from the human pose and scene geometry and use the contact as a corrective cue for generating the final pose. This enables UniCon3R to jointly recover high-fidelity scene geometry and spatially aligned 3D humans within the scene. Experiments on standard human-centric video benchmarks such as RICH, EMDB, 3DPW and SLOPER4D show that UniCon3R outperforms state-of-the-art baselines on physical plausibility and global human motion estimation while achieving real-time online inference. We experimentally demonstrate that contact serves as a powerful internal prior rather than just an external metric, thus establishing a new paradigm for physically grounded joint human-scene reconstruction. Project page is available at https://surtantheta.github.io/UniCon3R .

Chinese Translation

我们介绍了UniCon3R（统一接触感知三维重建），这是一个用于从单目视频中在线进行人场四维重建的统一前馈框架。近期的前馈方法使得实时世界坐标下的人体运动和场景重建成为可能，但它们常常会产生不符合物理规律的伪影，例如身体漂浮在地面上或穿透场景的部分。造成这一现象的主要原因是现有方法未能建模人类与环境之间的物理交互。一个自然的下一步是将人场接触预测作为辅助输出——然而我们发现，仅此一项是不够的：接触必须积极地修正重建。为了解决这个问题，我们通过从人体姿态和场景几何推断三维接触，明确建模交互，并将接触作为生成最终姿态的修正线索。这使得UniCon3R能够共同恢复高保真度的场景几何和空间上对齐的三维人类模型。我们在RICH、EMDB、3DPW和SLOPER4D等标准以人为中心的视频基准上进行的实验表明，UniCon3R在物理合理性和全局人体运动估计方面优于最先进的基线，同时实现了实时在线推理。我们通过实验表明，接触作为一种强大的内部先验，而不仅仅是外部度量，从而为物理基础的联合人场重建建立了新的范式。项目页面可访问 https://surtantheta.github.io/UniCon3R 。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2604.19937

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

感染推理器：一种紧凑的视觉-语言模型用于伤口感染分类及基于证据的临床推理

Busaranuvong, Palawat, Fard, Reza Saadati, Agu, Emmanuel, Kumar, Deepak, Gautam, Shefalika, Tulu, Bengisu, Strong, Diane

Abstract

Assessing chronic wound infection from photographs is challenging because visual appearance varies across wound etiologies, anatomical locations, and imaging conditions. Prior image-based deep learning methods have mainly focused on classification with limited interpretability, despite the need for evidence-grounded explanations to support point-of-care decision making. We present Infection-Reasoner, a compact 4B-parameter reasoning vision-language model for chronic wound infection classification and rationale generation. To address the scarcity of expert-labeled wound images with reasoning annotations, Infection-Reasoner is trained using a two-stage pipeline: (1) reasoning distillation, in which GPT-5.1 generates chain-of-thought rationales for unlabeled wound images to initialize wound-specific reasoning in a smaller student model (Qwen3-VL-4B-Thinking), and (2) reinforcement learning post-training with Group Relative Policy Optimization on a small labeled infection dataset to refine classification reasoning. On a held-out heterogeneous wound dataset, Infection-Reasoner achieved 86.8\% accuracy, 86.4\% sensitivity, and 87.1\% specificity, outperforming several strong baselines, including GPT-5.1. Rationale quality was further evaluated using both multimodal large language model (MLLM) judges and wound expert review. Across four MLLM judges, visual-support agreement scores ranged from 0.722 to 0.903, while expert review rated 61.8\% of rationales as Correct and 32.4\% as Partially Correct.

Chinese Translation

从照片中评估慢性伤口感染具有挑战性，因为伤口的视觉外观因病因、解剖位置和成像条件而异。尽管需要基于证据的解释来支持临床决策，之前的基于图像的深度学习方法主要集中在分类上，且可解释性有限。我们提出了感染推理器（Infection-Reasoner），这是一种紧凑的4B参数推理视觉-语言模型，用于慢性伤口感染分类和推理生成。为了解决带有推理注释的专家标注伤口图像稀缺的问题，感染推理器采用了两阶段的训练流程：（1）推理蒸馏，其中GPT-5.1为未标注的伤口图像生成思维链推理，以初始化一个较小的学生模型（Qwen3-VL-4B-Thinking）中的伤口特定推理；（2）在一个小型标注感染数据集上进行强化学习后训练，使用群体相对策略优化（Group Relative Policy Optimization）来完善分类推理。在一个保留的异质伤口数据集上，感染推理器达到了86.8%的准确率、86.4%的敏感性和87.1%的特异性，超越了包括GPT-5.1在内的多个强基线。推理质量还通过多模态大型语言模型（MLLM）评审和伤口专家评审进行了进一步评估。在四位MLLM评审中，视觉支持一致性得分范围为0.722到0.903，而专家评审则将61.8%的推理评为正确，32.4%评为部分正确。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2604.19941

CrackForward: Context-Aware Severity Stage Crack Synthesis for Data Augmentation

CrackForward：基于上下文的裂缝严重程度阶段合成用于数据增强

Sadallah, Nassim, Allili, Mohand Saïd

Abstract

Reliable crack detection and segmentation are vital for structural health monitoring, yet the scarcity of well-annotated data constitutes a major challenge. To address this limitation, we propose a novel context-aware generative framework designed to synthesize realistic crack growth patterns for data augmentation. Unlike existing methods that primarily manipulate textures or background content, CrackForward explicitly models crack morphology by combining directional crack elongation with learned thickening and branching. Our framework integrates two key innovations: (i) a contextually guided crack expansion module, which uses local directional cues and adaptive random walk to simulate realistic propagation paths; and (ii) a two-stage U-Net-style generator that learns to reproduce spatially varying crack characteristics such as thickness, branching, and growth. Experimental results show that the generated samples preserve target-stage saturation and thickness characteristics and improve the performance of several crack segmentation architectures. These results indicate that structure-aware synthetic crack generation can provide more informative training data than conventional augmentation alone.

Chinese Translation

可靠的裂缝检测和分割对于结构健康监测至关重要，但标注良好的数据稀缺构成了主要挑战。为了解决这一限制，我们提出了一种新颖的基于上下文的生成框架，旨在合成真实的裂缝生长模式以进行数据增强。与现有方法主要操纵纹理或背景内容不同，CrackForward 明确地通过结合方向性裂缝延伸与学习的加厚和分支来建模裂缝形态。我们的框架整合了两个关键创新：（i）一个上下文引导的裂缝扩展模块，利用局部方向线索和自适应随机游走来模拟真实的传播路径；（ii）一个两阶段的 U-Net 风格生成器，学习再现空间变化的裂缝特征，如厚度、分支和生长。实验结果表明，生成的样本保持目标阶段的饱和度和厚度特征，并提高了几种裂缝分割架构的性能。这些结果表明，结构感知的合成裂缝生成可以提供比传统增强方法更具信息量的训练数据。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2604.19945

Visual Reasoning through Tool-supervised Reinforcement Learning

通过工具监督强化学习进行视觉推理

Dong, Qihua, Sahin, Gozde, Wang, Pei, Cai, Zhaowei, Shrestha, Robik, Yang, Hao, Modolo, Davide

Abstract

In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.

Chinese Translation

在本文中，我们研究了如何有效掌握工具使用以解决多模态大型语言模型的复杂视觉推理任务的问题。为此，我们提出了一种新颖的工具监督强化学习（Tool-supervised Reinforcement Learning, ToolsRL）框架，通过直接的工具监督来实现更有效的工具使用学习。我们专注于一系列简单、原生且可解释的视觉工具，包括放大、旋转、翻转以及绘制点/线，这些工具的监督易于收集。我们开发了一个强化学习课程，其中第一阶段仅通过一组合理的工具特定奖励进行优化，第二阶段则在允许调用工具的情况下，通过准确性目标奖励进行训练。通过这种方式，我们在使用工具完成视觉推理任务之前掌握了工具调用能力，从而避免了这些异构任务之间潜在的优化冲突。我们的实验表明，工具监督课程训练是高效的，ToolsRL能够在复杂视觉推理任务中实现强大的工具使用能力。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2604.19954

Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

通过学习视点标记实现文本到图像生成的相机控制

Lu, Xinxuan, Fowlkes, Charless, Berg, Alexander C.

Abstract

Current text-to-image models struggle to provide precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune image generation models for viewpoint-conditioned text-to-image generation on a curated dataset that combines 3D-rendered images for geometric supervision and photorealistic augmentations for appearance and background diversity. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art accuracy while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories. Our work shows that text-vision latent spaces can be endowed with explicit 3D camera structure, offering a pathway toward geometrically-aware prompts for text-to-image generation. Project page: https://randdl.github.io/viewtoken_control/

Chinese Translation

当前的文本到图像模型在仅使用自然语言进行精确相机控制方面存在困难。在本研究中，我们提出了一种通过学习参数化相机标记实现文本到图像生成中全球场景理解的精确相机控制框架。我们对图像生成模型进行了微调，以实现基于视点的文本到图像生成，使用的训练数据集结合了用于几何监督的3D渲染图像和用于外观及背景多样性的照片真实感增强。定性和定量实验表明，我们的方法在保持图像质量和提示保真度的同时，达到了最先进的准确性。与之前过于依赖特定物体外观相关性的研究方法不同，我们的视点标记学习了可分解的几何表示，能够迁移到未见过的物体类别。我们的工作表明，文本-视觉潜在空间可以赋予明确的3D相机结构，为文本到图像生成提供了几何感知提示的途径。项目页面：https://randdl.github.io/viewtoken_control/

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2604.19966

DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

DistortBench：视觉语言模型在图像失真识别中的基准测试

Goyal, Divyanshu, Eppa, Akhil, Kumar, Vanya Bannihatti

Abstract

Vision-language models (VLMs) are increasingly used in settings where sensitivity to low-level image degradations matters, including content moderation, image restoration, and quality monitoring. Yet their ability to recognize distortion type and severity remains poorly understood. We present DistortBench, a diagnostic benchmark for no-reference distortion perception in VLMs. DistortBench contains 13,500 four-choice questions covering 27 distortion types, six perceptual categories, and five severity levels: 25 distortions inherit KADID-10k calibrations, while two added rotation distortions use monotonic angle-based levels. We evaluate 18 VLMs, including 17 open-weight models from five families and one proprietary model. Despite strong performance on high-level vision-language tasks, the best model reaches only 61.9% accuracy, just below the human majority-vote baseline of 65.7% (average individual: 60.2%), indicating that low-level perceptual understanding remains a major weakness of current VLMs. Our analysis further reveals weak and non-monotonic scaling with model size, performance drops in most base--thinking pairs, and distinct severity-response patterns across model families. We hope DistortBench will serve as a useful benchmark for measuring and improving low-level visual perception in VLMs.

Chinese Translation

视觉语言模型（VLMs）在对低级图像降解敏感的场景中越来越多地被使用，包括内容审核、图像修复和质量监控。然而，它们识别失真类型和严重程度的能力仍然不够清楚。我们提出了DistortBench，这是一个用于无参考失真感知的诊断基准。DistortBench包含13,500个四选一的问题，涵盖27种失真类型、六个感知类别和五个严重程度：25种失真继承了KADID-10k的标定，而两个新增的旋转失真使用单调角度基础的级别。我们评估了18个VLM，包括来自五个家族的17个开放权重模型和一个专有模型。尽管在高级视觉语言任务中表现强劲，最佳模型的准确率仅为61.9%，略低于人类多数投票基线的65.7%（平均个体：60.2%），这表明当前VLM在低级感知理解方面仍然存在重大弱点。我们的分析进一步揭示了模型规模与性能之间的弱且非单调的关系，在大多数基础-思维对中性能下降，以及不同模型家族之间的明显严重程度-响应模式。我们希望DistortBench能够作为一个有用的基准，用于测量和改善VLM中的低级视觉感知。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2604.19976

Lucky High Dynamic Range Smartphone Imaging

幸运的高动态范围智能手机成像

Li, Baiang, Yan, Ruyu, Tseng, Ethan, Zhang, Zhoutong, Finkelstein, Adam, Chen, Jiawen, Heide, Felix

Abstract

While the human eye can perceive an impressive twenty stops of dynamic range, smartphone camera sensors remain limited to about twelve stops despite decades of research. A variety of high dynamic range (HDR) image capture and processing techniques have been proposed, and, in practice, they can extend the dynamic range by 3-5 stops for handheld photography. This paper proposes an approach that robustly captures dynamic range using a handheld smartphone camera and lightweight networks suitable for running on mobile devices. Our method operates indirectly on linear raw pixels in bracketed exposures. Every pixel in the final HDR image is a convex combination of input pixels in the neighborhood, adjusted for exposure, and thus avoids hallucination artifacts typical of recent deep image synthesis networks. We validate our system on both synthetic imagery and unseen real bracketed images -- we confirm zero-shot generalization of the method to smartphone camera captures. Our iterative inference architecture is capable of processing an arbitrary number of bracketed input photos, and we show examples from capture stacks containing 3--9 images. Our training process relies only on synthetic captures yet generalizes to unseen real photos from several cameras. Moreover, we show that this training scheme improves other SOTA methods over their pretrained counterparts.

Chinese Translation

尽管人眼能够感知令人印象深刻的二十档动态范围，但智能手机相机传感器的动态范围仍然限制在约十二档，这一现象持续了数十年的研究。已经提出了多种高动态范围（HDR）图像捕捉和处理技术，实际上，它们可以在手持摄影中将动态范围扩展3-5档。本文提出了一种使用手持智能手机相机和适合在移动设备上运行的轻量级网络稳健捕捉动态范围的方法。我们的方法间接作用于括号曝光中的线性原始像素。最终HDR图像中的每个像素都是邻域内输入像素的凸组合，并根据曝光进行调整，从而避免了最近深度图像合成网络中典型的幻觉伪影。我们在合成图像和未见过的真实括号图像上验证了我们的系统——我们确认该方法对智能手机相机捕捉的零样本泛化。我们的迭代推理架构能够处理任意数量的括号输入照片，并展示了包含3-9张图像的捕捉堆栈的示例。我们的训练过程仅依赖于合成捕捉，但能够泛化到来自多台相机的未见过的真实照片。此外，我们展示了这一训练方案在其预训练对应方法上改善了其他最先进（SOTA）方法的性能。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2604.19989

Online CS-based SAR Edge-Mapping

基于在线压缩感知的合成孔径雷达边缘映射

Flynn, Conor, Ivanov, Radoslav, Yazici, Birsen

Abstract

With modern defense applications increasingly relying on inexpensive, small Unmanned Aerial Vehicles (UAVs), a major challenge lies in designing intelligent and computationally efficient onboard Automatic Target Recognition (ATR) algorithms to carry out operational objectives. This is especially critical in Synthetic Aperture Radar (SAR), where processing techniques such as ATR are often carried out post data collection, requiring onboard systems to bear the memory burden of storing the back-scattered signals. To alleviate this high cost, we propose an online, direct, edge-mapping technique which bypasses the image reconstruction step to classify scenes and targets. Furthermore, by reconstructing the scene as an edge-map we inherently promote sparsity, requiring fewer measurements and computational power than classic SAR reconstruction algorithms such as backprojection.

Chinese Translation

随着现代防御应用越来越依赖廉价的小型无人机（UAV），设计智能且计算高效的机载自动目标识别（ATR）算法以实现操作目标成为一项重大挑战。这在合成孔径雷达（SAR）中尤为重要，因为ATR等处理技术通常在数据采集后进行，要求机载系统承受存储反向散射信号的内存负担。为了减轻这一高成本，我们提出了一种在线的直接边缘映射技术，该技术绕过图像重建步骤以对场景和目标进行分类。此外，通过将场景重建为边缘图，我们本质上促进了稀疏性，所需的测量和计算能力比经典的SAR重建算法（如反投影）更少。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2604.19995

A Computational Model of Message Sensation Value in Short Video Multimodal Features that Predicts Sensory and Behavioral Engagement

短视频多模态特征中信息感知价值的计算模型：预测感官和行为参与

Xue, Haoning, Zhang, Jingwen, Wang, Xiaohui, Kim, Diane Dagyong, Song, Yunya

Abstract

The contemporary media landscape is characterized by sensational short videos. While prior research examines the effects of individual multimodal features, the collective impact of multimodal features on viewer engagement with short videos remains unknown. Grounded in the theoretical framework of Message Sensation Value (MSV), this study develops and tests a computational model of MSV with multimodal feature analysis and human evaluation of 1,200 short videos. This model that predicts sensory and behavioral engagement was further validated across two unseen datasets from three short video platforms (combined N = 14,492). While MSV is positively associated with sensory engagement, it shows an inverted U-shaped relationship with behavioral engagement: Higher MSV elicits stronger sensory stimulation, but moderate MSV optimizes behavioral engagement. This research advances the theoretical understanding of short video engagement and introduces a robust computational tool for short video research.

Chinese Translation

当代媒体环境以引人注目的短视频为特征。尽管先前的研究考察了单一多模态特征的影响，但多模态特征对观众参与短视频的整体影响仍然未知。本研究基于信息感知价值（Message Sensation Value, MSV）的理论框架，开发并测试了一个包含多模态特征分析和对1200个短视频的人类评估的MSV计算模型。该模型预测感官和行为参与，并在来自三个短视频平台的两个未见数据集上进一步验证（总样本量 N = 14,492）。虽然MSV与感官参与呈正相关，但与行为参与呈倒U型关系：较高的MSV引发更强的感官刺激，而适中的MSV则优化行为参与。本研究推动了对短视频参与的理论理解，并引入了一个强大的计算工具用于短视频研究。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2604.19999

Optimizing Data Augmentation for Real-Time Small UAV Detection: A Lightweight Context-Aware Approach

实时小型无人机检测的数据增强优化：一种轻量级的上下文感知方法

Zamani, Amir, Abedini, Zeinab

Abstract

Visual detection of Unmanned Aerial Vehicles (UAVs) is a critical task in surveillance systems due to their small physical size and environmental challenges. Although deep learning models have achieved significant progress, deploying them on edge devices necessitates the use of lightweight models, such as YOLOv11 Nano, which possess limited learning capacity. In this research, an efficient and context-aware data augmentation pipeline, combining Mosaic strategies and HSV color-space adaptation, is proposed to enhance the performance of these models. Experimental results on four standard datasets demonstrate that the proposed approach, compared to heavy and instance-level methods like Copy-Paste, not only prevents the generation of synthetic artifacts and overfitting but also significantly improves mean Average Precision (mAP) across all scenarios. Furthermore, the evaluation of generalization capability under foggy conditions revealed that the proposed method offers the optimal balance between Precision and stability for real-time systems, whereas alternative methods, such as MixUp, are effective only in specific applications.

Chinese Translation

由于小型无人机（UAV）的物理尺寸小以及环境挑战，视觉检测在监控系统中是一项关键任务。尽管深度学习模型取得了显著进展，但在边缘设备上部署它们需要使用轻量级模型，如YOLOv11 Nano，这些模型的学习能力有限。本研究提出了一种高效的上下文感知数据增强管道，结合了马赛克策略和HSV颜色空间适应，以提升这些模型的性能。在四个标准数据集上的实验结果表明，与像Copy-Paste这样的重型和实例级方法相比，所提出的方法不仅防止了合成伪影和过拟合，还显著提高了所有场景下的平均精度均值（mAP）。此外，在雾天条件下对泛化能力的评估显示，所提出的方法在实时系统中提供了精度与稳定性的最佳平衡，而其他方法，如MixUp，仅在特定应用中有效。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2604.20000

RareSpot+: A Benchmark, Model, and Active Learning Framework for Small and Rare Wildlife in Aerial Imagery

RareSpot+: 一种针对航空影像中小型和稀有野生动物的基准、模型和主动学习框架

Zhang, Bowen, Boulerice, Jesse T., Mendiratta, Charvi, Kuniyil, Nikhil, Kumar, Satish, Shamon, Hila, Manjunath, B. S.

Abstract

Automated wildlife monitoring from aerial imagery is vital for conservation but remains limited by two persistent challenges: the difficulty of detecting small, rare species and the high cost of large-scale expert annotation. Prairie dogs exemplify this problem -- they are ecologically important yet appear tiny, sparsely distributed, and visually indistinct from their surroundings, posing a severe challenge for conventional detection models. To overcome these limitations, we present RareSpot+, a detection framework that integrates multi-scale consistency learning, context-aware augmentation, and geospatially guided active learning to address these issues. A novel multi-scale consistency loss aligns intermediate feature maps across detection heads, enhancing localization of small (approx. 30 pixels wide) objects without architectural changes, while context-aware augmentation improves robustness by synthesizing hard, ecologically plausible examples. A geospatial active learning module exploits domain-specific spatial priors linking prairie dogs and burrows, together with test-time augmentation and a meta-uncertainty model, to reduce redundant labeling. On a 2 km^2 aerial dataset, RareSpot+ improves detection over the baseline mAP@50 by +35.2% (absolute +0.13). Cross-dataset tests on HerdNet, AED, and several other wildlife benchmarks demonstrate robust detector-level transferability. The active learning module further boosts prairie dog AP by 14.5% using an annotation budget of just 1.7% of the unlabeled tiles. Beyond detection, RareSpot+ enables spatial ecological analyses such as clustering and co-occurrence, linking vision-based detection with quantitative ecology.

Chinese Translation

从航空影像中自动监测野生动物对于保护工作至关重要，但仍面临两个持续的挑战：检测小型稀有物种的困难和大规模专家标注的高成本。草原犬鼠就是这一问题的典型例子——它们在生态上重要，但体型微小、分布稀疏，并且在视觉上与周围环境难以区分，这对传统检测模型构成了严重挑战。为了解决这些局限性，我们提出了RareSpot+，一个检测框架，集成了多尺度一致性学习、上下文感知增强和地理空间引导的主动学习，以应对这些问题。新颖的多尺度一致性损失在检测头之间对齐中间特征图，增强了小型（约30像素宽）物体的定位能力，而不需要改变架构，同时上下文感知增强通过合成困难且生态上合理的示例来提高鲁棒性。地理空间主动学习模块利用与草原犬鼠和洞穴相关的领域特定空间先验，并结合测试时增强和元不确定性模型，以减少冗余标注。在一个2 km²的航空数据集中，RareSpot+的检测性能相较于基线mAP@50提高了35.2%（绝对值+0.13）。在HerdNet、AED及其他多个野生动物基准上的跨数据集测试展示了强大的检测器级迁移能力。主动学习模块进一步提升了草原犬鼠的平均精度（AP），仅使用了1.7%的未标记图块的标注预算。除了检测，RareSpot+还支持空间生态分析，如聚类和共现，将基于视觉的检测与定量生态学相结合。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2604.20012

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

EmbodiedMidtrain：通过中期训练弥合视觉-语言模型与视觉-语言-动作模型之间的差距

Du, Yiyang, Guo, Zhanqiu, Ye, Xin, Ren, Liu, Xiong, Chenyan

Abstract

Vision-Language-Action Models (VLAs) inherit their visual and linguistic capabilities from Vision-Language Models (VLMs), yet most VLAs are built from off-the-shelf VLMs that are not adapted to the embodied domain, limiting their downstream performance. In this work, we propose EmbodiedMidtrain to bridge the gap between VLMs and VLAs. We first characterize the data distribution gap between them, showing that VLA data occupy compact regions that are largely separated from the broader VLM distribution, while the degree of alignment varies substantially both across and within VLM data sources. Then, we build a mid-training data engine that leverages a lightweight learnable proximity estimator to select the most VLA-aligned candidates from a large VLM pool, and mid-trains the VLM on this curated mixture before downstream VLA fine-tuning. Experiments on three robot manipulation benchmarks show that mid-training consistently improves performance across different VLM backbones, achieving results competitive with expert VLAs and off-the-shelf VLMs trained with larger model scale and training budgets. Further analysis reveals that mid-training provides a stronger initialization for VLA fine-tuning, with gains emerging from the earliest steps and widening throughout training. Moreover, the data engine captures both dataset-level and sample-level alignment signals, favoring spatial reasoning over text-centric tasks while preserving the diversity of the VLM data. We will release all code, data and models for future research.

Chinese Translation

视觉-语言-动作模型（VLA）从视觉-语言模型（VLM）继承了其视觉和语言能力，但大多数VLA是基于未适应于具身领域的现成VLM构建的，这限制了它们的下游性能。在本研究中，我们提出了EmbodiedMidtrain，以弥合VLM与VLA之间的差距。我们首先描述了它们之间的数据分布差距，显示VLA数据占据了与更广泛的VLM分布大体分离的紧凑区域，而对齐程度在不同VLM数据源之间及其内部变化显著。然后，我们构建了一个中期训练数据引擎，利用轻量级可学习的接近度估计器，从大型VLM池中选择最对齐的VLA候选数据，并在此经过精心挑选的混合数据上进行VLM的中期训练，随后进行下游VLA的微调。在三个机器人操作基准上的实验表明，中期训练在不同的VLM骨干网络上始终提高了性能，取得了与专家VLA和使用更大模型规模及训练预算训练的现成VLM竞争的结果。进一步分析显示，中期训练为VLA微调提供了更强的初始化，增益在最早的步骤中就开始显现，并在整个训练过程中不断扩大。此外，数据引擎捕捉了数据集级和样本级的对齐信号，优先考虑空间推理而非以文本为中心的任务，同时保持了VLM数据的多样性。我们将发布所有代码、数据和模型以供未来研究使用。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2604.20026

Investigation of cardinality classification for bacterial colony counting using explainable artificial intelligence

使用可解释人工智能对细菌菌落计数的基数分类进行研究

Zheng, Minghua, Helian, Na, Lane, Peter C. R., Sun, Yi, Donald, Allen

Abstract

Automatic bacterial colony counting is a highly sought-after technology in modern biological laboratories because it eliminates manual counting effort. Previous work has observed that MicrobiaNet, currently the best-performing cardinality classification model for colony counting, has difficulty distinguishing colonies of three or more individuals. However, it is unclear if this is due to properties of the data together with inherent characteristics of the MicrobiaNet model. By analysing MicrobiaNet with explainable artificial intelligence (XAI), we demonstrate that XAI can provide insights into how data properties constrain cardinality classification performance in colony counting. Our results show that high visual similarity across classes is the key issue hindering further performance improvement, revising prior assertions about MicrobiaNet. These findings suggest future work should focus on models that explicitly incorporate visual similarity or explore density estimation approaches, with broader implications for neural network classifiers trained on imbalanced datasets.

Chinese Translation

自动细菌菌落计数是现代生物实验室中一种备受追捧的技术，因为它消除了手动计数的工作。先前的研究观察到，MicrobiaNet 作为当前表现最佳的菌落计数基数分类模型，在区分三个或更多个体的菌落时存在困难。然而，目前尚不清楚这是否由于数据的特性与 MicrobiaNet 模型的固有特性所致。通过使用可解释人工智能（XAI）分析 MicrobiaNet，我们展示了 XAI 如何提供关于数据特性如何限制菌落计数中基数分类性能的洞见。我们的结果表明，不同类别之间的高度视觉相似性是阻碍进一步性能提升的关键问题，这修正了关于 MicrobiaNet 的先前论断。这些发现表明，未来的研究应集中于明确纳入视觉相似性的模型或探索密度估计方法，这对在不平衡数据集上训练的神经网络分类器具有更广泛的影响。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2604.20027

Cognitive Alignment At No Cost: Inducing Human Attention Biases For Interpretable Vision Transformers

无成本的认知对齐：诱导人类注意偏向以实现可解释的视觉变换器

Knights, Ethan

Abstract

For state-of-the-art image understanding, Vision Transformers (ViTs) have become the standard architecture but their processing diverges substantially from human attentional characteristics. We investigate whether this cognitive gap can be shrunk by fine-tuning the self-attention weights of Google's ViT-B/16 on human saliency fixation maps. To isolate the effects of semantically relevant signals from generic human supervision, the tuned model is compared against a shuffled control. Fine-tuning significantly improved alignment across five saliency metrics and induced three hallmark human-like biases: tuning reversed the baseline's anti-human large-object bias toward small-objects, amplified the animacy preference and diminished extreme attention entropy. Bayesian parity analysis provides decisive to very-strong evidence that this cognitive alignment comes at no cost to the model's original classification performance on in- (ImageNet), corrupted (ImageNet-C) and out-of-distribution (ObjectNet) benchmarks. An equivalent procedure applied to a ResNet-50 Convolutional Neural Network (CNN) instead degraded both alignment and accuracy, suggesting that the ViT's modular self-attention mechanism is uniquely suited for dissociating spatial priority from representational logic. These findings demonstrate that biologically grounded priors can be instilled as a free emergent property of human-aligned attention, to improve transformer interpretability.

Chinese Translation

在最先进的图像理解中，视觉变换器（Vision Transformers, ViTs）已成为标准架构，但其处理方式与人类注意特征存在显著差异。我们研究是否可以通过在谷歌的ViT-B/16上对人类显著性注视图进行微调，从而缩小这一认知差距。为了隔离语义相关信号与一般人类监督的影响，微调后的模型与随机控制组进行了比较。微调显著改善了五个显著性指标的对齐，并诱导出三种典型的人类偏向：微调逆转了基线模型对大物体的反人类偏向，转而偏向小物体，增强了生动性偏好，并降低了极端注意熵。贝叶斯平衡分析提供了决定性到非常强的证据，表明这种认知对齐不会影响模型在内部（ImageNet）、受损（ImageNet-C）和分布外（ObjectNet）基准上的原始分类性能。将相同的程序应用于ResNet-50卷积神经网络（CNN）则导致对齐和准确性均下降，这表明ViT的模块化自注意机制特别适合于将空间优先级与表征逻辑分离。这些发现表明，生物学基础的先验知识可以作为人类对齐注意力的免费涌现属性被植入，从而改善变换器的可解释性。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2604.20030

Learning to count small and clustered objects with application to bacterial colonies

学习计数小型和聚集物体及其在细菌菌落中的应用

Zheng, Minghua, Helian, Na, Lane, Peter C. R., Sun, Yi, Donald, Allen

Abstract

Automated bacterial colony counting from images is an important technique to obtain data required for the development of vaccines and antibiotics. However, bacterial colonies present unique machine vision challenges that affect counting, including (1) small physical size, (2) object clustering, (3) high data annotation cost, and (4) limited cross-species generalisation. While FamNet is an established object counting technique effective for clustered objects and costly data annotation, its effectiveness for small colony sizes and cross-species generalisation remains unknown. To address the first three challenges, we propose ACFamNet, an extension of FamNet that handles small and clustered objects using a novel region of interest pooling with alignment and optimised feature engineering. To address all four challenges above, we introduce ACFamNet Pro, which augments ACFamNet with multi-head attention and residual connections, enabling dynamic weighting of objects and improved gradient flow. Experiments show that ACFamNet Pro achieves a mean normalised absolute error (MNAE) of 9.64% under 5-fold cross-validation, outperforming ACFamNet and FamNet by 2.23% and 12.71%, respectively.

Chinese Translation

从图像中自动计数细菌菌落是一项重要技术，能够获取开发疫苗和抗生素所需的数据。然而，细菌菌落呈现出独特的机器视觉挑战，这些挑战影响计数，包括（1）物理尺寸小，（2）物体聚集，（3）高数据标注成本，以及（4）有限的跨物种泛化。虽然FamNet是一种已建立的物体计数技术，能够有效处理聚集物体和高成本的数据标注，但其在小型菌落和跨物种泛化方面的有效性仍然未知。为了解决前三个挑战，我们提出了ACFamNet，这是FamNet的扩展，采用了一种新颖的感兴趣区域池化方法与对齐和优化特征工程，以处理小型和聚集物体。为了解决上述所有四个挑战，我们引入了ACFamNet Pro，它通过多头注意力机制和残差连接增强了ACFamNet，从而实现了物体的动态加权和改进的梯度流。实验表明，ACFamNet Pro在5折交叉验证下实现了9.64%的均值归一化绝对误差（MNAE），分别比ACFamNet和FamNet提高了2.23%和12.71%。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2604.20038

FluSplat: Sparse-View 3D Editing without Test-Time Optimization

FluSplat：无测试时间优化的稀疏视图三维编辑

Huang, Haitao, Chng, Shin-Fang, Zhan, Huangying, Yan, Qingan, Xu, Yi

Abstract

Recent advances in text-guided image editing and 3D Gaussian Splatting (3DGS) have enabled high-quality 3D scene manipulation. However, existing pipelines rely on iterative edit-and-fit optimization at test time, alternating between 2D diffusion editing and 3D reconstruction. This process is computationally expensive, scene-specific, and prone to cross-view inconsistencies. We propose a feed-forward framework for cross-view consistent 3D scene editing from sparse views. Instead of enforcing consistency through iterative 3D refinement, we introduce a cross-view regularization scheme in the image domain during training. By jointly supervising multi-view edits with geometric alignment constraints, our model produces view-consistent results without per-scene optimization at inference. The edited views are then lifted into 3D via a feedforward 3DGS model, yielding a coherent 3DGS representation in a single forward pass. Experiments demonstrate competitive editing fidelity and substantially improved cross-view consistency compared to optimization-based methods, while reducing inference time by orders of magnitude.

Chinese Translation

最近在文本引导的图像编辑和三维高斯点云（3D Gaussian Splatting, 3DGS）方面的进展，使得高质量的三维场景操作成为可能。然而，现有的处理流程依赖于测试时的迭代编辑与拟合优化，交替进行二维扩散编辑和三维重建。这个过程计算开销大、场景特定，并且容易出现视图间的不一致性。我们提出了一种前馈框架，用于从稀疏视图进行跨视图一致的三维场景编辑。我们在训练过程中引入了一种图像域的跨视图正则化方案，而不是通过迭代的三维细化来强制一致性。通过联合监督多视图编辑并施加几何对齐约束，我们的模型在推理时无需针对每个场景的优化，便能生成视图一致的结果。然后，编辑后的视图通过前馈的3DGS模型提升到三维，产生在单次前向传递中一致的3DGS表示。实验表明，与基于优化的方法相比，我们的方法在编辑保真度上具有竞争力，并显著改善了视图间的一致性，同时将推理时间减少了几个数量级。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2604.20041

Normalizing Flows with Iterative Denoising

迭代去噪的归一化流

Chen, Tianrong, Gu, Jiatao, Berthelot, David, Susskind, Joshua, Zhai, Shuangfei

Abstract

Normalizing Flows (NFs) are a classical family of likelihood-based methods that have received revived attention. Recent efforts such as TARFlow have shown that NFs are capable of achieving promising performance on image modeling tasks, making them viable alternatives to other methods such as diffusion models. In this work, we further advance the state of Normalizing Flow generative models by introducing iterative TARFlow (iTARFlow). Unlike diffusion models, iTARFlow maintains a fully end-to-end, likelihood-based objective during training. During sampling, it performs autoregressive generation followed by an iterative denoising procedure inspired by diffusion-style methods. Through extensive experiments, we show that iTARFlow achieves competitive performance across ImageNet resolutions of 64, 128, and 256 pixels, demonstrating its potential as a strong generative model and advancing the frontier of Normalizing Flows. In addition, we analyze the characteristic artifacts produced by iTARFlow, offering insights that may shed light on future improvements. Code is available at https://github.com/apple/ml-itarflow.

Chinese Translation

归一化流（Normalizing Flows, NFs）是一类经典的基于似然的方法，近年来重新引起了关注。近期的研究如TARFlow表明，NFs在图像建模任务中能够取得令人满意的性能，使其成为扩散模型等其他方法的可行替代方案。在本研究中，我们通过引入迭代TARFlow（iTARFlow）进一步推动了归一化流生成模型的进展。与扩散模型不同，iTARFlow在训练期间保持完全端到端的基于似然的目标。在采样过程中，它执行自回归生成，随后进行受扩散风格方法启发的迭代去噪程序。通过大量实验，我们展示了iTARFlow在64、128和256像素的ImageNet分辨率下实现了具有竞争力的性能，证明了其作为强大生成模型的潜力，并推动了归一化流的前沿。此外，我们分析了iTARFlow产生的特征伪影，提供了可能为未来改进提供启示的见解。代码可在 https://github.com/apple/ml-itarflow 获取。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2604.20046

Gaussians on a Diet: High-Quality Memory-Bounded 3D Gaussian Splatting Training

高质量内存受限的 3D 高斯点云训练

Zhang, Yangming, Xu, Jian, Zhu, Kunxiong, Niu, Wei, Yin, Miao

Abstract

3D Gaussian Splatting (3DGS) has revolutionized novel view synthesis with high-quality rendering through continuous aggregations of millions of 3D Gaussian primitives. However, it suffers from a substantial memory footprint, particularly during training due to uncontrolled densification, posing a critical bottleneck for deployment on memory-constrained edge devices. While existing methods prune redundant Gaussians post-training, they fail to address the peak memory spikes caused by the abrupt growth of Gaussians early in the training process. To solve the training memory consumption problem, we propose a systematic memory-bounded training framework that dynamically optimizes Gaussians through iterative growth and pruning. In other words, the proposed framework alternates between incremental pruning of low-impact Gaussians and strategic growing of new primitives with an adaptive Gaussian compensation, maintaining a near-constant low memory usage while progressively refining rendering fidelity. We comprehensively evaluate the proposed training framework on various real-world datasets under strict memory constraints, showing significant improvements over existing state-of-the-art methods. Particularly, our proposed method practically enables memory-efficient 3DGS training on NVIDIA Jetson AGX Xavier, achieving similar visual quality with up to 80% lower peak training memory consumption than the original 3DGS.

Chinese Translation

3D 高斯点云（3DGS）通过对数百万个 3D 高斯原语的连续聚合，彻底改变了新视角合成的高质量渲染。然而，由于无法控制的密集化，它在训练过程中面临着巨大的内存占用，这对在内存受限的边缘设备上部署构成了关键瓶颈。虽然现有方法在训练后修剪冗余的高斯，但未能解决训练初期高斯数量急剧增长所导致的峰值内存激增问题。为了解决训练内存消耗问题，我们提出了一种系统的内存受限训练框架，通过迭代增长和修剪动态优化高斯。换句话说，所提出的框架在低影响高斯的增量修剪和具有自适应高斯补偿的新原语的战略性增长之间交替进行，保持近乎恒定的低内存使用，同时逐步提高渲染保真度。我们在严格的内存限制下，对各种真实世界数据集全面评估了所提出的训练框架，显示出相对于现有最先进方法的显著改进。特别是，我们提出的方法在 NVIDIA Jetson AGX Xavier 上有效实现了内存高效的 3DGS 训练，其峰值训练内存消耗比原始 3DGS 低达 80%，且视觉质量相似。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2604.20047

PASTA: A Patch-Agnostic Twofold-Stealthy Backdoor Attack on Vision Transformers

PASTA：一种对视觉变换器的无补丁依赖的双重隐蔽后门攻击

Liu, Dazhuang, Qiao, Yanqi, Wang, Rui, Liang, Kaitai, Smaragdakis, Georgios

Abstract

Vision Transformers (ViTs) have achieved remarkable success across vision tasks, yet recent studies show they remain vulnerable to backdoor attacks. Existing patch-wise attacks typically assume a single fixed trigger location during inference to maximize trigger attention. However, they overlook the self-attention mechanism in ViTs, which captures long-range dependencies across patches. In this work, we observe that a patch-wise trigger can achieve high attack effectiveness when activating backdoors across neighboring patches, a phenomenon we term the Trigger Radiating Effect (TRE). We further find that inter-patch trigger insertion during training can synergistically enhance TRE compared to single-patch insertion. Prior ViT-specific attacks that maximize trigger attention often sacrifice visual and attention stealthiness, making them detectable. Based on these insights, we propose PASTA, a twofold stealthy patch-wise backdoor attack in both pixel and attention domains. PASTA enables backdoor activation when the trigger is placed at arbitrary patches during inference. To achieve this, we introduce a multi-location trigger insertion strategy to enhance TRE. However, preserving stealthiness while maintaining strong TRE is challenging, as TRE is weakened under stealthy constraints. We therefore formulate a bi-level optimization problem and propose an adaptive backdoor learning framework, where the model and trigger iteratively adapt to each other to avoid local optima. Extensive experiments show that PASTA achieves 99.13% attack success rate across arbitrary patches on average, while significantly improving visual and attention stealthiness (144.43x and 18.68x) and robustness (2.79x) against state-of-the-art ViT defenses across four datasets, outperforming CNN- and ViT-based baselines.

Chinese Translation

视觉变换器（ViTs）在视觉任务中取得了显著成功，但最近的研究表明它们仍然容易受到后门攻击。现有的基于补丁的攻击通常假设在推理过程中存在单一固定的触发位置，以最大化触发注意力。然而，它们忽视了ViTs中的自注意力机制，该机制捕捉了补丁之间的长程依赖关系。在本研究中，我们观察到，当在相邻补丁之间激活后门时，基于补丁的触发器可以实现高效的攻击效果，这一现象我们称之为触发辐射效应（Trigger Radiating Effect, TRE）。我们进一步发现，在训练过程中进行补丁间触发器插入可以协同增强TRE，相较于单补丁插入。先前针对ViT的攻击方法通常为了最大化触发注意力而牺牲了视觉和注意隐蔽性，从而使其易于被检测。基于这些见解，我们提出了PASTA，一种在像素和注意力领域均具备双重隐蔽性的基于补丁的后门攻击。PASTA允许在推理过程中在任意补丁上放置触发器时激活后门。为此，我们引入了一种多位置触发器插入策略，以增强TRE。然而，在保持强TRE的同时保持隐蔽性是具有挑战性的，因为在隐蔽约束下，TRE会减弱。因此，我们制定了一个双层优化问题，并提出了一种自适应后门学习框架，其中模型和触发器相互迭代适应，以避免局部最优。大量实验表明，PASTA在任意补丁上的平均攻击成功率达到99.13%，同时显著提高了视觉和注意隐蔽性（分别为144.43倍和18.68倍）以及对四个数据集上最先进的ViT防御的鲁棒性（2.79倍），超越了基于CNN和ViT的基线。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2604.20093

FurnSet: Exploiting Repeats for 3D Scene Reconstruction

FurnSet：利用重复实例进行3D场景重建

Dobre, Paul, Wang, Xin, Yang, Hongzhou

Abstract

Single-view 3D scene reconstruction involves inferring both object geometry and spatial layout. Existing methods typically reconstruct objects independently or rely on implicit scene context, failing to exploit the repeated instances commonly present in realworld scenes. We propose FurnSet, a framework that explicitly identifies and leverages repeated object instances to improve reconstruction. Our method introduces per-object CLS tokens and a set-aware self-attention mechanism that groups identical instances and aggregates complementary observations across them, enabling joint reconstruction. We further combine scene-level and object-level conditioning to guide object reconstruction, followed by layout optimization using object point clouds with 3D and 2D projection losses for scene alignment. Experiments on 3D-Future and 3D-Front demonstrate improved scene reconstruction quality, highlighting the effectiveness of exploiting repetition for robust 3D scene reconstruction.

Chinese Translation

单视图3D场景重建涉及推断物体几何形状和空间布局。现有方法通常独立重建物体或依赖隐式场景上下文，未能利用现实场景中常见的重复实例。我们提出了FurnSet，一个明确识别并利用重复物体实例以改善重建的框架。我们的方法引入了每个物体的CLS（Class）标记和一种集-aware自注意力机制，该机制将相同的实例分组并聚合它们之间的互补观察，从而实现联合重建。我们进一步结合场景级和物体级的条件引导物体重建，随后使用物体点云进行布局优化，并采用3D和2D投影损失进行场景对齐。在3D-Future和3D-Front上的实验表明，重建质量得到了改善，突显了利用重复性进行稳健的3D场景重建的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2604.20123

Topology-Aware Skeleton Detection via Lighthouse-Guided Structured Inference

基于灯塔引导的结构推理的拓扑感知骨架检测

Fu, Daoyong, Zhang, Xiang, Zhan, Zhaohuan, Yang, Fan, Yang, Ke

Abstract

In natural images, object skeletons are used to represent geometric shapes. However, even slight variations in pose or movement can cause noticeable changes in skeleton structure, increasing the difficulty of detecting the skeleton and often resulting in discontinuous skeletons. Existing methods primarily focus on point-level skeleton point detection and overlook the importance of structural continuity in recovering complete skeletons. To address this issue, we propose Lighthouse-Skel, a topology-aware skeleton detection method via lighthouse-guided structured inference. Specifically, we introduce a dual-branch collaborative detection framework that jointly learns skeleton confidence field and structural anchors, including endpoints and junction points. The spatial distributions learned by the point branch guide the network to focus on topologically vulnerable regions, which improves the accuracy of skeleton detection. Based on the learned skeleton confidence field, we further propose a lighthouse-guided topology completion strategy, which uses detected junction points and breakpoints as lighthouses to reconnect discontinuous skeleton segments along low-cost paths, thereby improving skeleton continuity and structural integrity. Experimental results on four public datasets demonstrate that the proposed method achieves competitive detection accuracy while substantially improving skeleton connectivity and structural integrity.

Chinese Translation

在自然图像中，物体骨架用于表示几何形状。然而，即使是姿态或运动的轻微变化也会导致骨架结构的显著变化，从而增加了骨架检测的难度，并且常常导致骨架不连续。现有方法主要集中在点级骨架点检测上，而忽视了在恢复完整骨架过程中结构连续性的重要性。为了解决这一问题，我们提出了Lighthouse-Skel，一种基于灯塔引导的结构推理的拓扑感知骨架检测方法。具体而言，我们引入了一种双分支协作检测框架，联合学习骨架置信场和结构锚点，包括端点和交叉点。点分支学习到的空间分布引导网络关注拓扑脆弱区域，从而提高骨架检测的准确性。基于学习到的骨架置信场，我们进一步提出了一种灯塔引导的拓扑补全策略，该策略利用检测到的交叉点和断点作为灯塔，沿低成本路径重新连接不连续的骨架段，从而提高骨架的连续性和结构完整性。在四个公共数据集上的实验结果表明，所提方法在实现竞争性检测准确度的同时，显著提高了骨架的连通性和结构完整性。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2604.20128

Semi-Supervised Flow Matching for Mosaiced and Panchromatic Fusion Imaging

用于马赛克和全色融合成像的半监督流匹配

Luo, Peiming, Wang, Nan, Liu, Litong, Huang, Jiahan, Wu, Chenxu, Dian, Renwei, Hou, Junming

Abstract

Fusing a low resolution (LR) mosaiced hyperspectral image (HSI) with a high resolution (HR) panchromatic (PAN) image offers a promising avenue for video-rate HR-HSI imaging via single-shot acquisition, yet its severely ill-posed nature remains a significant challenge. In this work, we propose a novel semi-supervised flow matching framework for mosaiced and PAN image fusion. Unlike previous diffusion-based approaches constrained by specific protocols or handcrafted assumptions, our method seamlessly integrates an unsupervised scheme with flow matching, resulting in a generalizable and efficient generative framework. Specifically, our method follows a two-stage training pipeline. First, we pretrain an unsupervised prior network to produce an initial pseudo HR-HSI. Building on this, we then train a conditional flow matching model to generate the target HR-HSI, introducing a random voting mechanism that iteratively refines the initial HR-HSI estimate, enabling robust and effective fusion. During inference, we employ a conflict-free gradient guidance strategy that ensures spectrally and spatially consistent HR-HSI reconstruction. Experiments on multiple benchmark datasets demonstrate that our method achieves superior quantitative and qualitative performance by a significant margin compared to representative baselines. Beyond mosaiced and PAN fusion, our approach provides a flexible generative framework that can be readily extended to other image fusion tasks and integrated with unsupervised or blind image restoration algorithms.

Chinese Translation

将低分辨率（LR）马赛克高光谱图像（HSI）与高分辨率（HR）全色图像（PAN）融合，为通过单次采集实现视频速率的HR-HSI成像提供了一个有前景的途径，但其严重的不适定性仍然是一个重大挑战。在本研究中，我们提出了一种新颖的半监督流匹配框架，用于马赛克和PAN图像融合。与以往受特定协议或手工假设限制的扩散基础方法不同，我们的方法无缝地将无监督方案与流匹配相结合，形成一个可推广且高效的生成框架。具体而言，我们的方法遵循一个两阶段的训练流程。首先，我们预训练一个无监督先验网络以生成初始伪HR-HSI。在此基础上，我们训练一个条件流匹配模型以生成目标HR-HSI，引入一种随机投票机制，迭代地细化初始HR-HSI估计，从而实现稳健有效的融合。在推理过程中，我们采用无冲突的梯度引导策略，确保光谱和空间上一致的HR-HSI重建。在多个基准数据集上的实验表明，我们的方法在定量和定性性能上显著优于代表性基线。除了马赛克和PAN融合外，我们的方法还提供了一个灵活的生成框架，可以方便地扩展到其他图像融合任务，并与无监督或盲图像恢复算法集成。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2604.20136

IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory

IMPACT-CYCLE：一种基于合同的多智能体系统，用于长视频语义记忆的索赔级监督修正

Kong, Weitong, Wen, Di, Peng, Kunyu, Schneider, David, Zhong, Zeyun, Jaus, Alexander, Marinov, Zdravko, Wei, Jiale, Liu, Ruiping, Zheng, Junwei, Chen, Yufan, Qi, Lei, Stiefelhagen, Rainer

Abstract

Correcting errors in long-video understanding is disproportionately costly: existing multimodal pipelines produce opaque, end-to-end outputs that expose no intermediate state for inspection, forcing annotators to revisit raw video and reconstruct temporal logic from scratch. The core bottleneck is not generation quality alone, but the absence of a supervisory interface through which human effort can be proportional to the scope of each error. We present IMPACT-CYCLE, a supervisory multi-agent system that reformulates long-video understanding as iterative claim-level maintenance of a shared semantic memory -- a structured, versioned state encoding typed claims, a claim dependency graph, and a provenance log. Role-specialized agents operating under explicit authority contracts decompose verification into local object-relation correctness, cross-temporal consistency, and global semantic coherence, with corrections confined to structurally dependent claims. When automated evidence is insufficient, the system escalates to human arbitration as the supervisory authority with final override rights; dependency-closure re-verification then ensures correction cost remains proportional to error scope. Experiments on VidOR show substantially improved downstream reasoning (VQA: 0.71 to 0.79) and a 4.8x reduction in human arbitration cost, with workload significantly lower than manual annotation. Code will be released at https://github.com/MKong17/IMPACT_CYCLE.

Chinese Translation

在长视频理解中纠正错误的成本过高：现有的多模态管道产生不透明的端到端输出，无法提供任何中间状态供检查，迫使标注者重新查看原始视频并从头重建时间逻辑。核心瓶颈不仅在于生成质量，还在于缺乏一个监督接口，使得人力投入与每个错误的范围成比例。我们提出了IMPACT-CYCLE，这是一种监督多智能体系统，将长视频理解重新构建为对共享语义记忆的迭代索赔级维护——一种结构化的、版本化的状态，编码了类型化的索赔、索赔依赖图和来源日志。在明确的授权合同下，角色专业化的智能体将验证分解为局部对象关系的正确性、跨时间的一致性和全局语义的连贯性，修正仅限于结构上相关的索赔。当自动化证据不足时，该系统将升级到人类仲裁，作为具有最终覆盖权的监督权威；依赖闭合的重新验证确保修正成本与错误范围成比例。对VidOR的实验显示下游推理显著改善（视觉问答：从0.71提高到0.79），人类仲裁成本降低了4.8倍，工作量显著低于手动标注。代码将发布在 https://github.com/MKong17/IMPACT_CYCLE。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2604.20155

GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds

GSCompleter：一种无蒸馏的插件，用于秒级度量感知3D高斯喷溅补全

Gao, Ao, Gong, Jingyu, Tan, Xin, Zhang, Zhizhong, Xie, Yuan

Abstract

While 3D Gaussian Splatting (3DGS) has revolutionized real-time rendering, its performance degrades significantly under sparse-view extrapolation, manifesting as severe geometric voids and artifacts. Existing solutions primarily rely on an iterative "Repair-then-Distill" paradigm, which is inherently unstable and prone to overfitting. In this work, we propose GSCompleter, a distillation-free plugin that shifts scene completion to a stable "Generate-then-Register" workflow. Our approach first synthesizes plausible 2D reference images and explicitly lifts them into metric-scale 3D primitives via a robust Stereo-Anchor mechanism. These primitives are then seamlessly integrated into the global context through a novel Ray-Constrained Registration strategy. This shift to a rapid registration paradigm delivers superior 3DGS completion performance across three distinct benchmarks, enhancing the quality and efficiency of various baselines and achieving new SOTA results.

Chinese Translation

尽管3D高斯喷溅（3DGS）已彻底改变了实时渲染，但在稀疏视图外推下，其性能显著下降，表现为严重的几何空洞和伪影。现有解决方案主要依赖于迭代的“修复-再蒸馏”范式，这种方法本质上不稳定且容易过拟合。在本研究中，我们提出了GSCompleter，这是一种无蒸馏的插件，将场景补全转变为稳定的“生成-再注册”工作流程。我们的方法首先合成合理的2D参考图像，并通过稳健的立体锚定机制将其显式提升为度量尺度的3D原语。这些原语随后通过一种新颖的射线约束注册策略无缝集成到全局上下文中。这种快速注册范式的转变在三个不同基准上提供了卓越的3DGS补全性能，提升了各种基线的质量和效率，并实现了新的状态-of-the-art（SOTA）结果。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2604.20157

HumanScore: Benchmarking Human Motions in Generated Videos

HumanScore：生成视频中人类动作的基准评估

Fang, Yusu, Xiang, Tiange, Tan, Tian, Schuetz, Narayan, Delp, Scott, Fei-Fei, Li, Adeli, Ehsan

Abstract

Recent advances in model architectures, compute, and data scale have driven rapid progress in video generation, producing increasingly realistic content. Yet, no prior method systematically measures how faithfully these systems render human bodies and motion dynamics. In this paper, we present HumanScore, a systematic framework to evaluate the quality of human motions in AI-generated videos. HumanScore defines six interpretable metrics spanning kinematic plausibility, temporal stability, and biomechanical consistency, enabling fine-grained diagnosis beyond visual realism alone. Through carefully designed prompts, we elicit a diverse set of movements at varying intensities and evaluate videos generated by thirteen state-of-the-art models. Our analysis reveals consistent gaps between perceptual plausibility and motion biomechanical fidelity, identifies recurrent failure modes (e.g., temporal jitter, anatomically implausible poses, and motion drift), and produces robust model rankings from quantitative and physically meaningful criteria.

Chinese Translation

近年来，模型架构、计算能力和数据规模的快速进展推动了视频生成技术的发展，生成的内容愈加逼真。然而，之前的方法并未系统性地测量这些系统在多大程度上忠实地呈现人类身体和运动动态。本文提出了HumanScore，一个系统化框架，用于评估人工智能生成视频中人类动作的质量。HumanScore定义了六个可解释的指标，涵盖运动学的合理性、时间稳定性和生物力学一致性，使得评估超越了单纯的视觉真实感。通过精心设计的提示，我们引导出一组多样化的运动，涵盖不同强度，并评估了由十三个最先进模型生成的视频。我们的分析揭示了感知合理性与运动生物力学忠实度之间的一致差距，识别出反复出现的失败模式（例如，时间抖动、解剖不合理的姿势和运动漂移），并根据定量和物理上有意义的标准生成了稳健的模型排名。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2604.20169

Semantic-Fast-SAM: Efficient Semantic Segmenter

语义快速分割模型：高效的语义分割器

Kim, Byunghyun

Abstract

We propose Semantic-Fast-SAM (SFS), a semantic segmentation framework that combines the Fast Segment Anything model with a semantic labeling pipeline to achieve real-time performance without sacrificing accuracy. FastSAM is an efficient CNN-based re-implementation of the Segment Anything Model (SAM) that runs much faster than the original transformer-based SAM. Building upon FastSAM's rapid mask generation, we integrate a Semantic-Segment-Anything (SSA) labeling strategy to assign meaningful categories to each mask. The resulting SFS model produces high-quality semantic segmentation maps at a fraction of the computational cost and memory footprint of the original SAM-based approach. Experiments on Cityscapes and ADE20K benchmarks demonstrate that SFS matches the accuracy of prior SAM-based methods (mIoU ~ 70.33 on Cityscapes and 48.01 on ADE20K) while achieving approximately 20x faster inference than SSA in the closed-set setting. We also show that SFS effectively handles open-vocabulary segmentation by leveraging CLIP-based semantic heads, outperforming recent open-vocabulary models on broad class labeling. This work enables practical real-time semantic segmentation with the "segment-anything" capability, broadening the applicability of foundation segmentation models in robotics scenarios. The implementation is available at https://github.com/KBH00/Semantic-Fast-SAM.

Chinese Translation

我们提出了语义快速分割模型（Semantic-Fast-SAM，SFS），这是一个将快速分割任意模型（Fast Segment Anything，FastSAM）与语义标注管道相结合的语义分割框架，以实现实时性能而不牺牲准确性。FastSAM是对分割任意模型（Segment Anything Model，SAM）的高效卷积神经网络（CNN）重实现，其运行速度远快于原始的基于变换器的SAM。在FastSAM快速生成掩膜的基础上，我们整合了一种语义分割任意（Semantic-Segment-Anything，SSA）标注策略，以为每个掩膜分配有意义的类别。最终的SFS模型以原始基于SAM的方法的计算成本和内存占用的一小部分，生成高质量的语义分割图。我们在Cityscapes和ADE20K基准上的实验表明，SFS在准确性上与先前的基于SAM的方法相匹配（在Cityscapes上的mIoU约为70.33，在ADE20K上的mIoU约为48.01），同时在封闭集设置中实现了比SSA快约20倍的推理速度。我们还展示了SFS通过利用基于CLIP的语义头有效处理开放词汇分割，超越了最近的开放词汇模型在广泛类别标注上的表现。这项工作使得具有“分割任意”能力的实用实时语义分割成为可能，拓宽了基础分割模型在机器人场景中的适用性。实现代码可在https://github.com/KBH00/Semantic-Fast-SAM获取。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2604.20190

WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring

WildFireVQA：用于空中野火监测的大规模辐射热视觉问答基准

Habibpour, Mobin, Talemi, Niloufar Alipour, Spodnik, John, Khoury, Camren J., Afghah, Fatemeh

Abstract

Wildfire monitoring requires timely, actionable situational awareness from airborne platforms, yet existing aerial visual question answering (VQA) benchmarks do not evaluate wildfire-specific multimodal reasoning grounded in thermal measurements. We introduce WildFireVQA, a large-scale VQA benchmark for aerial wildfire monitoring that integrates RGB imagery with radiometric thermal data. WildFireVQA contains 6,097 RGB-thermal samples, where each sample includes an RGB image, a color-mapped thermal visualization, and a radiometric thermal TIFF, and is paired with 34 questions, yielding a total of 207,298 multiple-choice questions spanning presence and detection, classification, distribution and segmentation, localization and direction, cross-modal reasoning, and flight planning for operational wildfire intelligence. To improve annotation reliability, we combine multimodal large language model (MLLM)-based answer generation with sensor-driven deterministic labeling, manual verification, and intra-frame and inter-frame consistency checks. We further establish a comprehensive evaluation protocol for representative MLLMs under RGB, Thermal, and retrieval-augmented settings using radiometric thermal statistics. Experiments show that across task categories, RGB remains the strongest modality for current models, while retrieved thermal context yields gains for stronger MLLMs, highlighting both the value of temperature-grounded reasoning and the limitations of existing MLLMs in safety-critical wildfire scenarios. The dataset and benchmark code are open-source at https://github.com/mobiiin/WildFire_VQA.

Chinese Translation

野火监测需要来自空中平台的及时、可操作的态势感知，但现有的空中视觉问答（VQA）基准并未评估基于热测量的特定于野火的多模态推理。我们引入了WildFireVQA，这是一个用于空中野火监测的大规模VQA基准，整合了RGB图像与辐射热数据。WildFireVQA包含6097个RGB-热样本，每个样本包括一幅RGB图像、一幅彩色映射的热可视化图和一幅辐射热TIFF，并配有34个问题，总计207298个多项选择题，涵盖了存在与检测、分类、分布与分割、定位与方向、跨模态推理以及操作性野火情报的飞行规划。为了提高标注的可靠性，我们结合了基于多模态大语言模型（MLLM）的答案生成、传感器驱动的确定性标注、人工验证以及帧内和帧间一致性检查。我们进一步建立了一个全面的评估协议，以在RGB、热图和检索增强设置下使用辐射热统计数据对代表性MLLM进行评估。实验表明，在各个任务类别中，RGB仍然是当前模型最强的模态，而检索的热上下文为更强的MLLM带来了收益，突显了基于温度的推理的价值以及现有MLLM在安全关键的野火场景中的局限性。数据集和基准代码已开源，网址为 https://github.com/mobiiin/WildFire_VQA。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2604.20191

From Scene to Object: Text-Guided Dual-Gaze Prediction

从场景到对象：文本引导的双视线预测

Ke, Zehong, Jiang, Yanbo, Li, Jinhao, Liu, Zhiyuan, Tu, Yiqian, Meng, Qingwen, Huang, Heye, Wang, Jianqiang

Abstract

Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level annotations, inherently failing to support text-grounded cognitive modeling. Consequently, while Vision-Language Models (VLMs) hold great potential for semantic reasoning, this critical data limitations leads to severe text-vision decoupling and visual-bias hallucinations. To break this bottleneck and achieve precise object-level attention prediction, this paper proposes a novel dual-branch gaze prediction framework, establishing a complete paradigm from data construction to model architecture. First, we construct G-W3DA, a object-level driver attention dataset. By integrating a multimodal large language model with the Segment Anything Model 3 (SAM3), we decouple macroscopic heatmaps into object-level masks under rigorous cross-validation, fundamentally eliminating annotation hallucinations. Building upon this high-quality data foundation, we propose the DualGaze-VLM architecture. This architecture extracts the hidden states of semantic queries and dynamically modulates visual features via a Condition-Aware SE-Gate, achieving intent-driven precise spatial anchoring. Extensive experiments on the W3DA benchmark demonstrate that DualGaze-VLM consistently surpasses existing state-of-the-art (SOTA) models in spatial alignment metrics, notably achieving up to a 17.8% improvement in Similarity (SIM) under safety-critical scenarios. Furthermore, a visual Turing test reveals that the attention heatmaps generated by DualGaze-VLM are perceived as authentic by 88.22% of human evaluators, proving its capability to generate rational cognitive priors.

Chinese Translation

可解释的驾驶员注意力预测对于类人自主驾驶至关重要。然而，现有数据集仅提供场景级的全局注视，而非细粒度的对象级注释，这在本质上无法支持基于文本的认知建模。因此，尽管视觉-语言模型（VLMs）在语义推理方面具有巨大潜力，但这一关键数据限制导致了严重的文本-视觉解耦和视觉偏差幻觉。为了打破这一瓶颈，实现精确的对象级注意力预测，本文提出了一种新颖的双分支注视预测框架，建立了从数据构建到模型架构的完整范式。首先，我们构建了G-W3DA，一个对象级驾驶员注意力数据集。通过将多模态大语言模型与Segment Anything Model 3（SAM3）相结合，我们在严格的交叉验证下将宏观热图解耦为对象级掩膜，从根本上消除了注释幻觉。在这一高质量数据基础上，我们提出了DualGaze-VLM架构。该架构提取语义查询的隐藏状态，并通过条件感知SE-Gate动态调节视觉特征，实现意图驱动的精确空间锚定。在W3DA基准上的广泛实验表明，DualGaze-VLM在空间对齐指标上始终超越现有的最先进（SOTA）模型，特别是在安全关键场景下，Similarity (SIM) 最高提高了17.8%。此外，视觉图灵测试表明，DualGaze-VLM生成的注意力热图被88.22%的人工评估者视为真实，证明了其生成合理认知先验的能力。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2604.20213

Weighted Knowledge Distillation for Semi-Supervised Segmentation of Maxillary Sinus in Panoramic X-ray Images

基于加权知识蒸馏的全景X光影像上颌窦半监督分割

Park, Juha, Choi, Jiho, Yun, Jong Pil, Park, Yong Chan, Yeom, Han-Gyeol, Lee, Byung Do, Lee, Sang Jun

Abstract

Accurate segmentation of maxillary sinus in panoramic X-ray images is essential for dental diagnosis and surgical planning; however, this task remains relatively underexplored in dental imaging research. Structural overlap, ambiguous anatomical boundaries inherent to two-dimensional panoramic projections, and the limited availability of large scale clinical datasets with reliable pixel-level annotations make the development and evaluation of segmentation models challenging. To address these challenges, we propose a semi-supervised segmentation framework that effectively leverages both labeled and unlabeled panoramic radiographs, where knowledge distillation is utilized to train a student model with reliable structural information distilled from a teacher model. Specifically, we introduce a weighted knowledge distillation loss to suppress unreliable distillation signals caused by structural discrepancies between teacher and student predictions. To further enhance the quality of pseudo labels generated by the teacher network, we introduce SinusCycle-GAN which is a refinement network based on unpaired image-to-image translation. This refinement process improves the precision of boundaries and reduces noise propagation when learning from unlabeled data during semi-supervised training. To evaluate the proposed method, we collected clinical panoramic X-ray images from 2,511 patients, and experimental results demonstrate that the proposed method outperforms state-of-the-art segmentation models, achieving the Dice score of 96.35\% while reducing boundary error. The results indicate that the proposed semi-supervised framework provides robust and anatomically consistent segmentation performance under limited labeled data conditions, highlighting its potential for broader dental image analysis applications.

Chinese Translation

在全景X光影像中准确分割上颌窦对于牙科诊断和手术规划至关重要；然而，这一任务在牙科影像研究中仍然相对欠缺探索。结构重叠、二维全景投影固有的模糊解剖边界，以及缺乏大规模临床数据集和可靠的像素级标注，使得分割模型的开发和评估面临挑战。为了解决这些问题，我们提出了一种半监督分割框架，能够有效利用标记和未标记的全景放射影像，其中知识蒸馏被用来训练一个学生模型，该模型从教师模型中提取可靠的结构信息。具体而言，我们引入了一种加权知识蒸馏损失，以抑制由于教师和学生预测之间的结构差异而导致的不可靠蒸馏信号。为了进一步提高教师网络生成的伪标签的质量，我们引入了SinusCycle-GAN，这是一种基于无配对图像到图像转换的精炼网络。该精炼过程在半监督训练期间从未标记数据学习时，提高了边界的精确度并减少了噪声传播。为了评估所提出的方法，我们收集了来自2511名患者的临床全景X光影像，实验结果表明，所提出的方法优于最先进的分割模型，达到了96.35%的Dice分数，同时减少了边界误差。结果表明，所提出的半监督框架在有限标记数据条件下提供了稳健且解剖一致的分割性能，突显了其在更广泛的牙科影像分析应用中的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2604.20226

Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation

学习空间-时间一致相关性以进行保语音的面部表情操控

Chen, Tianshui, Lin, Jianman, Yang, Zhijing, Qing, Chunmei, Wang, Guangrun, Lin, Liang

Abstract

Speech-preserving facial expression manipulation (SPFEM) aims to modify facial emotions while meticulously maintaining the mouth animation associated with spoken content. Current works depend on inaccessible paired training samples for the person, where two aligned frames exhibit the same speech content yet differ in emotional expression, limiting the SPFEM applications in real-world scenarios. In this work, we discover that speakers who convey the same content with different emotions exhibit highly correlated local facial animations in both spatial and temporal spaces, providing valuable supervision for SPFEM. To capitalize on this insight, we propose a novel spatial-temporal coherent correlation learning (STCCL) algorithm, which models the aforementioned correlations as explicit metrics and integrates the metrics to supervise manipulating facial expression and meanwhile better preserving the facial animation of spoken content. To this end, it first learns a spatial coherent correlation metric, ensuring that the visual correlations of adjacent local regions within an image linked to a specific emotion closely resemble those of corresponding regions in an image linked to a different emotion. Simultaneously, it develops a temporal coherent correlation metric, ensuring that the visual correlations of specific regions across adjacent image frames associated with one emotion are similar to those in the corresponding regions of frames associated with another emotion. Recognizing that visual correlations are not uniform across all regions, we have also crafted a correlation-aware adaptive strategy that prioritizes regions that present greater challenges. During SPFEM model training, we construct the spatial-temporal coherent correlation metric between corresponding local regions of the input and output image frames as an additional loss to supervise the generation process.

Chinese Translation

保语音的面部表情操控（SPFEM）旨在修改面部情感，同时仔细保持与所说内容相关的口部动画。目前的研究依赖于不可获取的配对训练样本，其中两个对齐的帧展示相同的语音内容但在情感表达上有所不同，这限制了SPFEM在现实场景中的应用。在本研究中，我们发现传达相同内容但情感不同的发言者在空间和时间上表现出高度相关的局部面部动画，为SPFEM提供了宝贵的监督。为了利用这一见解，我们提出了一种新颖的空间-时间一致相关性学习（STCCL）算法，该算法将上述相关性建模为显式度量，并将这些度量整合以监督面部表情的操控，同时更好地保持所说内容的面部动画。为此，它首先学习一个空间一致相关性度量，确保与特定情感相关的图像中相邻局部区域的视觉相关性与与不同情感相关的图像中对应区域的视觉相关性密切相似。同时，它开发了一个时间一致相关性度量，确保与一种情感相关的相邻图像帧中特定区域的视觉相关性与与另一种情感相关的帧中对应区域的视觉相关性相似。认识到视觉相关性在所有区域并不均匀，我们还设计了一种关注相关性的自适应策略，优先考虑那些呈现更大挑战的区域。在SPFEM模型训练过程中，我们构建输入和输出图像帧对应局部区域之间的空间-时间一致相关性度量作为额外损失，以监督生成过程。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2604.20243

Bio-inspired Color Constancy: From Gray Anchoring Theory to Gray Pixel Methods

仿生色彩恒常性：从灰锚理论到灰像素方法

Yang, Kai-Fu, Luo, Fu-Ya, Li, Yong-Jie

Abstract

Color constancy is a fundamental ability of many biological visual systems and a crucial step in computer imaging systems. Bio-inspired modeling offers a promising way to elucidate the computational principles underlying color constancy and to develop efficient computational methods. However, bio-inspired methods for color constancy remain underexplored and lack a comprehensive analysis. This paper presents a comprehensive technical framework that integrates biological mechanisms, computational theory, and algorithmic implementation for bio-inspired color constancy. Specifically, we systematically revisit the computational theory of biological color constancy, which shows that illuminant estimation can be reduced to the task of gray-anchor (pixel or surface) detection in early vision. Subsequently, typical gray-pixel detection methods, including Gray-Pixel and Grayness-Index, are reinterpreted within a unified theoretical framework with the Lambertian reflection model and biological color-opponent mechanisms. Finally, we propose a simple learning-based method that couples reflection-model constraints with feature learning to explore the potential of bio-inspired color constancy based on gray-pixel detection. Extensive experiments confirm the effectiveness of gray-pixel detection for color constancy and demonstrate the potential of bio-inspired methods.

Chinese Translation

色彩恒常性是许多生物视觉系统的基本能力，也是计算机成像系统中的关键步骤。仿生建模提供了一种有前景的方法，以阐明色彩恒常性背后的计算原理，并开发高效的计算方法。然而，针对色彩恒常性的仿生方法仍然未被充分探索，缺乏全面的分析。本文提出了一个综合性的技术框架，整合了生物机制、计算理论和算法实现，以实现仿生色彩恒常性。具体而言，我们系统地回顾了生物色彩恒常性的计算理论，表明光源估计可以简化为早期视觉中灰锚（像素或表面）检测的任务。随后，典型的灰像素检测方法，包括灰像素（Gray-Pixel）和灰度指数（Grayness-Index），在拉姆伯特反射模型和生物色彩对抗机制的统一理论框架内被重新解释。最后，我们提出了一种简单的基于学习的方法，将反射模型约束与特征学习相结合，以探索基于灰像素检测的仿生色彩恒常性的潜力。大量实验确认了灰像素检测在色彩恒常性中的有效性，并展示了仿生方法的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2604.20258

Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing

重新思考编辑位置：基于任务的指令图像编辑定位

He, Jingxuan, Wang, Xiyu, Zheng, Mengyu, Zeng, Xiangyu, Wang, Yunke, Xu, Chang

Abstract

Instruction-based image editing (IIE) aims to modify images according to textual instructions while preserving irrelevant content. Despite recent advances in diffusion transformers, existing methods often suffer from over-editing, introducing unintended changes to regions unrelated to the desired edit. We identify that this limitation arises from the lack of an explicit mechanism for edit localization. In particular, different editing operations (e.g., addition, removal and replacement) induce distinct spatial patterns, yet current IIE models typically treat localization in a task-agnostic manner. To address this limitation, we propose a training-free, task-aware edit localization framework that exploits the intrinsic source and target image streams within IIE models. For each image stream, We first obtain attention-based edit cues, and then construct feature centroids based on these attentive cues to partition tokens into edit and non-edit regions. Based on the observation that optimal localization is inherently task-dependent, we further introduce a unified mask construction strategy that selectively leverages source and target image streams for different editing tasks. We provide a systematic analysis for our proposed insights and approaches. Extensive experiments on EdiVal-Bench demonstrate our framework consistently improves non-edit region consistency while maintaining strong instruction-following performance on top of powerful recent image editing backbones, including Step1X-Edit and Qwen-Image-Edit.

Chinese Translation

基于指令的图像编辑（IIE）旨在根据文本指令修改图像，同时保留无关内容。尽管扩散变换器（diffusion transformers）在最近取得了进展，现有方法往往面临过度编辑的问题，导致对与所需编辑无关的区域引入意外变化。我们发现这一限制源于缺乏明确的编辑定位机制。特别是，不同的编辑操作（例如，添加、移除和替换）会引发不同的空间模式，而当前的IIE模型通常以任务无关的方式处理定位。为了解决这一限制，我们提出了一种无训练的、基于任务的编辑定位框架，该框架利用IIE模型中固有的源图像和目标图像流。对于每个图像流，我们首先获得基于注意力的编辑线索，然后基于这些注意线索构建特征质心，将标记分为编辑区域和非编辑区域。基于最佳定位本质上依赖于任务的观察，我们进一步引入了一种统一的掩码构建策略，选择性地利用源图像和目标图像流来处理不同的编辑任务。我们对所提出的见解和方法进行了系统分析。在EdiVal-Bench上的大量实验表明，我们的框架在保持强大的指令遵循性能的同时，始终提高了非编辑区域的一致性，基于强大的最新图像编辑骨干网络，包括Step1X-Edit和Qwen-Image-Edit。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2604.20268

Opportunistic Bone-Loss Screening from Routine Knee Radiographs Using a Multi-Task Deep Learning Framework with Sensitivity-Constrained Threshold Optimization

基于多任务深度学习框架的常规膝关节X光片机会性骨质流失筛查：灵敏度约束阈值优化

Li, Zhaochen, Yan, Xinghao, Zhou, Runni, Li, Xiaoyang, Zhu, Chenjie, Wang, Gege, Shi, Yu, Zhang, Lixin, Fu, Rongrong, Yan, Liehao, Chai, Yuan

Abstract

Background: Osteoporosis and osteopenia are often undiagnosed until fragility fractures occur. Dual-energy X-ray absorptiometry (DXA) is the reference standard for bone mineral density (BMD) assessment, but access remains limited. Knee radiographs are obtained at high volume for osteoarthritis evaluation and may offer an opportunity for opportunistic bone-loss screening. Objective: To develop and evaluate a multi-task deep learning system for opportunistic bone-loss screening from routine knee radiographs without additional imaging or patient visits. Methods: We developed STR-Net, a multi-task framework for single-channel grayscale knee radiographs. The model includes a shared backbone, global average pooling feature aggregation, a shared neck, and a task-aware representation routing module connected to three task-specific heads: binary screening (Normal vs. Bone Loss), severity sub-classification (Osteopenia vs. Osteoporosis), and weakly coupled T-score regression with optional clinical variables. A sensitivity-constrained threshold optimization strategy (minimum sensitivity >= 0.86) was applied. The dataset included 1,570 knee radiographs, split at the patient level into training (n=1,120), validation (n=226), and test (n=224) sets. Results: On the held-out test set, STR-Net achieved an AUROC of 0.933, sensitivity of 0.904, specificity of 0.773, and AUPRC of 0.956 for binary screening. Severity sub-classification achieved an AUROC of 0.898. The T-score regression branch showed a Pearson correlation of 0.801 with DXA-measured T-scores in a pilot subset (n=31), with MAE of 0.279 and RMSE of 0.347. Conclusions: STR-Net enables single-pass bone-loss screening, severity stratification, and quantitative T-score estimation from routine knee radiographs. Prospective clinical validation is needed before deployment.

Chinese Translation

背景：骨质疏松症和骨量减少症常常在脆弱性骨折发生之前未被诊断。双能X射线吸收法（DXA）是骨矿密度（BMD）评估的参考标准，但获取仍然有限。膝关节X光片在评估骨关节炎时获得的数量较多，可能为机会性骨质流失筛查提供机会。目的：开发并评估一个多任务深度学习系统，以便从常规膝关节X光片中进行机会性骨质流失筛查，而无需额外的影像学检查或患者就诊。方法：我们开发了STR-Net，一个用于单通道灰度膝关节X光片的多任务框架。该模型包括一个共享的主干、全局平均池化特征聚合、一个共享的颈部和一个任务感知的表示路由模块，连接到三个特定任务的头部：二元筛查（正常 vs. 骨质流失）、严重程度子分类（骨量减少 vs. 骨质疏松）以及与可选临床变量的弱耦合T评分回归。应用了一种灵敏度约束的阈值优化策略（最低灵敏度 >= 0.86）。数据集包含1,570个膝关节X光片，按患者级别分为训练集（n=1,120）、验证集（n=226）和测试集（n=224）。结果：在保留的测试集上，STR-Net实现了二元筛查的AUROC为0.933，灵敏度为0.904，特异性为0.773，AUPRC为0.956。严重程度子分类的AUROC为0.898。T评分回归分支在一个初步子集中（n=31）显示与DXA测量的T评分的Pearson相关系数为0.801，MAE为0.279，RMSE为0.347。结论：STR-Net能够从常规膝关节X光片中实现单次骨质流失筛查、严重程度分层和定量T评分估计。在部署之前需要进行前瞻性的临床验证。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2604.20281

Fourier Series Coder: A Novel Perspective on Angle Boundary Discontinuity Problem for Oriented Object Detection

傅里叶级数编码器：面向有向物体检测的角边界不连续性问题的新视角

Wei, Minghong, Cao, Pu, Chen, Zhihao, Zang, Zhiyuan, Yang, Lu, Song, Qing

Abstract

With the rapid advancement of intelligent driving and remote sensing, oriented object detection has gained widespread attention. However, achieving high-precision performance is fundamentally constrained by the Angle Boundary Discontinuity (ABD) and Cyclic Ambiguity (CA) problems, which typically cause significant angle fluctuations near periodic boundaries. Although recent studies propose continuous angle coders to alleviate these issues, our theoretical and empirical analyses reveal that state-of-the-art methods still suffer from substantial cyclic errors. We attribute this instability to the structural noise amplification within their non-orthogonal decoding mechanisms. This mathematical vulnerability significantly exacerbates angular deviations, particularly for square-like objects. To resolve this fundamentally, we propose the Fourier Series Coder (FSC), a lightweight plug-and-play component that establishes a continuous, reversible, and mathematically robust angle encoding-decoding paradigm. By rigorously mapping angles onto a minimal orthogonal Fourier basis and explicitly enforcing a geometric manifold constraint, FSC effectively prevents feature modulus collapse. This structurally stabilized representation ensures highly robust phase unwrapping, intrinsically eliminating the need for heuristic truncations while achieving strict boundary continuity and superior noise immunity. Extensive experiments across three large-scale datasets demonstrate that FSC achieves highly competitive overall performance, yielding substantial improvements in high-precision detection. The code will be available at https://github.com/weiminghong/FSC.

Chinese Translation

随着智能驾驶和遥感技术的快速发展，有向物体检测受到了广泛关注。然而，实现高精度性能在根本上受到角边界不连续性（Angle Boundary Discontinuity, ABD）和周期性模糊（Cyclic Ambiguity, CA）问题的限制，这通常导致在周期性边界附近出现显著的角度波动。尽管近期研究提出了连续角编码器以缓解这些问题，但我们的理论和实证分析表明，最先进的方法仍然存在显著的周期性误差。我们将这种不稳定性归因于其非正交解码机制中的结构噪声放大。这种数学脆弱性显著加剧了角度偏差，特别是对于方形物体。为了解决这一根本问题，我们提出了傅里叶级数编码器（Fourier Series Coder, FSC），这是一种轻量级的即插即用组件，建立了一种连续、可逆且数学上稳健的角度编码-解码范式。通过严格地将角度映射到最小正交傅里叶基上，并明确施加几何流形约束，FSC有效地防止了特征模态崩溃。这种结构上稳定的表示确保了高度稳健的相位展开，内在地消除了启发式截断的需求，同时实现了严格的边界连续性和优越的噪声免疫性。在三个大规模数据集上的广泛实验表明，FSC在整体性能上具有高度竞争力，在高精度检测中取得了显著的改进。代码将发布在 https://github.com/weiminghong/FSC。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2604.20286

MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation

MambaLiteUNet：用于稳健皮肤病变分割的交叉门控自适应特征融合

Rahman, Md Maklachur, Jung, Soon Ki, Hammond, Tracy

Abstract

Recent segmentation models have demonstrated promising efficiency by aggressively reducing parameter counts and computational complexity. However, these models often struggle to accurately delineate fine lesion boundaries and texture patterns essential for early skin cancer diagnosis and treatment planning. In this paper, we propose MambaLiteUNet, a compact yet robust segmentation framework that integrates Mamba state space modeling into a U-Net architecture, along with three key modules: Adaptive Multi-Branch Mamba Feature Fusion (AMF), Local-Global Feature Mixing (LGFM), and Cross-Gated Attention (CGA). These modules are designed to enhance local-global feature interaction, preserve spatial details, and improve the quality of skip connections. MambaLiteUNet achieves an average IoU of 87.12% and average Dice score of 93.09% across ISIC2017, ISIC2018, HAM10000, and PH2 benchmarks, outperforming state-of-the-art models. Compared to U-Net, our model improves average IoU and Dice by 7.72 and 4.61 points, respectively, while reducing parameters by 93.6% and GFLOPs by 97.6%. Additionally, in domain generalization with six unseen lesion categories, MambaLiteUNet achieves 77.61% IoU and 87.23% Dice, performing best among all evaluated models. Our extensive experiments demonstrate that MambaLiteUNet achieves a strong balance between accuracy and efficiency, making it a competitive and practical solution for dermatological image segmentation. Our code is publicly available at: https://github.com/maklachur/MambaLiteUNet.

Chinese Translation

近期的分割模型通过大幅减少参数数量和计算复杂度展现了良好的效率。然而，这些模型在准确勾勒细微病变边界和纹理模式方面常常面临挑战，这对于早期皮肤癌的诊断和治疗规划至关重要。在本文中，我们提出了MambaLiteUNet，这是一种紧凑而稳健的分割框架，将Mamba状态空间建模集成到U-Net架构中，并引入三个关键模块：自适应多分支Mamba特征融合（AMF）、局部-全局特征混合（LGFM）和交叉门控注意力（CGA）。这些模块旨在增强局部与全局特征的交互，保留空间细节，并提高跳跃连接的质量。MambaLiteUNet在ISIC2017、ISIC2018、HAM10000和PH2基准测试中实现了87.12%的平均IoU和93.09%的平均Dice分数，超越了最新的模型。与U-Net相比，我们的模型分别提高了7.72和4.61个点的平均IoU和Dice，同时参数减少了93.6%，GFLOPs减少了97.6%。此外，在六个未见病变类别的领域泛化中，MambaLiteUNet实现了77.61%的IoU和87.23%的Dice，在所有评估模型中表现最佳。我们的广泛实验表明，MambaLiteUNet在准确性和效率之间达成了良好的平衡，使其成为皮肤病学图像分割的竞争性和实用解决方案。我们的代码已公开发布于：https://github.com/maklachur/MambaLiteUNet。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2604.20289

X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference

X-Cache：用于少步自回归世界模型推理的跨块缓存

Zeng, Yixiao, Zheng, Jianlei, Zheng, Chaoda, Chen, Shijia, Liu, Mingdian, Liu, Tongping, Luo, Tengwei, Zhang, Yu, Wang, Boyang, Xu, Linkun, Lu, Siyuan, Tian, Bo, Liu, Xianming

Abstract

Real-time world simulation is becoming a key infrastructure for scalable evaluation and online reinforcement learning of autonomous driving systems. Recent driving world models built on autoregressive video diffusion achieve high-fidelity, controllable multi-camera generation, but their inference cost remains a bottleneck for interactive deployment. However, existing diffusion caching methods are designed for offline video generation with multiple denoising steps, and do not transfer to this scenario. Few-step distilled models have no inter-step redundancy left for these methods to reuse, and sequence-level parallelization techniques require future conditioning that closed-loop interactive generation does not provide. We present X-Cache, a training-free acceleration method that caches along a different axis: across consecutive generation chunks rather than across denoising steps. X-Cache maintains per-block residual caches that persist across chunks, and applies a dual-metric gating mechanism over a structure- and action-aware block-input fingerprint to independently decide whether each block should recompute or reuse its cached residual. To prevent approximation errors from permanently contaminating the autoregressive KV cache, X-Cache identifies KV update chunks (the forward passes that write clean keys and values into the persistent cache) and unconditionally forces full computation on these chunks, cutting off error propagation. We implement X-Cache on X-world, a production multi-camera action-conditioned driving world model built on multi-block causal DiT with few-step denoising and rolling KV cache. X-Cache achieves 71% block skip rate with 2.6x wall-clock speedup while maintaining minimum degradation.

Chinese Translation

实时世界模拟正成为可扩展评估和在线强化学习自主驾驶系统的关键基础设施。基于自回归视频扩散的最新驾驶世界模型实现了高保真、可控的多摄像头生成，但其推理成本仍然是交互式部署的瓶颈。然而，现有的扩散缓存方法是为离线视频生成设计的，涉及多个去噪步骤，无法转移到这一场景。少步蒸馏模型没有剩余的跨步骤冗余供这些方法重用，而序列级并行化技术需要未来条件，这在闭环交互生成中是无法提供的。我们提出了X-Cache，这是一种无训练加速方法，沿着不同的轴进行缓存：跨越连续的生成块，而不是跨越去噪步骤。X-Cache维护每个块的残差缓存，这些缓存在块之间持久存在，并在结构和动作感知的块输入指纹上应用双指标门控机制，以独立决定每个块是重新计算还是重用其缓存的残差。为了防止近似误差永久污染自回归KV缓存，X-Cache识别KV更新块（将干净的键和值写入持久缓存的前向传递），并无条件地强制对这些块进行完全计算，从而切断误差传播。我们在X-world上实现了X-Cache，这是一个基于多块因果DiT的生产多摄像头动作条件驾驶世界模型，具有少步去噪和滚动KV缓存。X-Cache实现了71%的块跳过率，墙钟时间加速比为2.6倍，同时保持最低的性能下降。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2604.20291

Efficient INT8 Single-Image Super-Resolution via Deployment-Aware Quantization and Teacher-Guided Training

基于部署感知量化和教师指导训练的高效 INT8 单幅图像超分辨率

Nguyen, Pham Phuong Nam, Le, Nam Tien, Vo, Thi Kim Trang, Nguyen, Nhu Tinh Anh

Abstract

Efficient single-image super-resolution (SISR) requires balancing reconstruction fidelity, model compactness, and robustness under low-bit deployment, which is especially challenging for x3 SR. We present a deployment-oriented quantized SISR framework based on an extract-refine-upsample design. The student performs most computation in the low-resolution space and uses a lightweight re-parameterizable backbone with PixelShuffle reconstruction, yielding a compact inference graph. To improve quality without significantly increasing complexity, we adopt a three-stage training pipeline: Stage 1 learns a basic reconstruction mapping with spatial supervision; Stage 2 refines fidelity using Charbonnier loss, DCT-domain supervision, and confidence-weighted output-level distillation from a Mamba-based teacher; and Stage 3 applies quantization-aware training directly on the fused deploy graph. We further use weight clipping and BatchNorm recalibration to improve quantization stability. On the MAI 2026 Quantized 4K Image Super-Resolution Challenge test set, our final AIO MAI submission achieves 29.79 dB PSNR and 0.8634 SSIM, obtaining a final score of 1.8 under the target mobile INT8 deployment setting. Ablation on Stage 3 optimization shows that teacher-guided supervision improves the dynamic INT8 TFLite reconstruction from 29.91 dB/0.853 to 30.0003 dB/0.856, while the fixed-shape deployable INT8 TFLite artifact attains 30.006 dB/0.857.

Chinese Translation

高效的单幅图像超分辨率（SISR）需要在重建保真度、模型紧凑性和低比特部署下的鲁棒性之间取得平衡，这对于 x3 超分辨率尤其具有挑战性。我们提出了一种基于提取-精炼-上采样设计的面向部署的量化 SISR 框架。学生模型在低分辨率空间中执行大部分计算，并使用轻量级的可重参数化主干网络与 PixelShuffle 重建，从而生成紧凑的推理图。为了在不显著增加复杂度的情况下提高质量，我们采用了三阶段训练流程：第一阶段通过空间监督学习基本的重建映射；第二阶段使用 Charbonnier 损失、DCT 域监督和来自基于 Mamba 的教师的置信加权输出级蒸馏来精炼保真度；第三阶段直接在融合的部署图上应用感知量化训练。我们进一步使用权重裁剪和批归一化重新校准来提高量化稳定性。在 MAI 2026 量化 4K 图像超分辨率挑战赛测试集上，我们最终的 AIO MAI 提交达到了 29.79 dB PSNR 和 0.8634 SSIM，在目标移动 INT8 部署设置下获得了最终得分 1.8。对第三阶段优化的消融实验表明，教师指导的监督将动态 INT8 TFLite 重建从 29.91 dB/0.853 提升至 30.0003 dB/0.856，而固定形状可部署的 INT8 TFLite 工件则达到了 30.006 dB/0.857。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2604.20306

Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

双重因果推断：将反向调整与工具变量学习整合用于医学视觉问答

Xu, Zibo, Li, Qiang, Lu, Ke, Wang, Jin, Nie, Weizhi, Su, Yuting

Abstract

Medical Visual Question Answering (MedVQA) aims to generate clinically reliable answers conditioned on complex medical images and questions. However, existing methods often overfit to superficial cross-modal correlations, neglecting the intrinsic biases embedded in multimodal medical data. Consequently, models become vulnerable to cross-modal confounding effects, severely hindering their ability to provide trustworthy diagnostic reasoning. To address this limitation, we propose a novel Dual Causal Inference (DCI) framework for MedVQA. To the best of our knowledge, DCI is the first unified architecture that integrates Backdoor Adjustment (BDA) and Instrumental Variable (IV) learning to jointly tackle both observable and unobserved confounders. Specifically, we formulate a Structural Causal Model (SCM) where observable cross-modal biases (e.g., frequent visual and textual co-occurrences) are mitigated via BDA, while unobserved confounders are compensated using an IV learned from a shared latent space. To guarantee the validity of the IV, we design mutual information constraints that maximize its dependence on the fused multimodal representations while minimizing its associations with the unobserved confounders and target answers. Through this dual mechanism, DCI extracts deconfounded representations that capture genuine causal relationships. Extensive experiments on four benchmark datasets, SLAKE, SLAKE-CP, VQA-RAD, and PathVQA, demonstrate that our method consistently outperforms existing approaches, particularly in out-of-distribution (OOD) generalization. Furthermore, qualitative analyses confirm that DCI significantly enhances the interpretability and robustness of cross-modal reasoning by explicitly disentangling true causal effects from spurious cross-modal shortcuts.

Chinese Translation

医学视觉问答（MedVQA）旨在生成基于复杂医学图像和问题的临床可靠答案。然而，现有方法往往过度拟合表面的跨模态相关性，忽视了嵌入多模态医学数据中的内在偏差。因此，模型变得容易受到跨模态混杂效应的影响，严重妨碍其提供可信的诊断推理能力。为了解决这一局限性，我们提出了一种新颖的双重因果推断（DCI）框架用于MedVQA。据我们所知，DCI是第一个将反向调整（BDA）和工具变量（IV）学习整合在一起的统一架构，旨在共同解决可观察和不可观察的混杂因素。具体而言，我们构建了一个结构因果模型（SCM），通过BDA减轻可观察的跨模态偏差（例如，频繁的视觉和文本共现），同时利用从共享潜在空间中学习的IV来补偿不可观察的混杂因素。为了保证IV的有效性，我们设计了互信息约束，最大化其与融合多模态表示的依赖关系，同时最小化其与不可观察混杂因素和目标答案的关联。通过这一双重机制，DCI提取出去混杂的表示，捕捉真实的因果关系。在四个基准数据集SLAKE、SLAKE-CP、VQA-RAD和PathVQA上的广泛实验表明，我们的方法在性能上始终优于现有方法，特别是在分布外（OOD）泛化方面。此外，定性分析确认DCI通过明确区分真实因果效应与虚假的跨模态捷径，显著增强了跨模态推理的可解释性和鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2604.20307

Improving Facial Emotion Recognition through Dataset Merging and Balanced Training Strategies

通过数据集合并和均衡训练策略改善面部情感识别

Kırbız, Serap

Abstract

In this paper, a deep learning framework is proposed for automatic facial emotion based on deep convolutional networks. In order to increase the generalization ability and the robustness of the method, the dataset size is increased by merging three publicly available facial emotion datasets: CK+, FER+ and KDEF. Despite the increase in dataset size, the minority classes still suffer from insufficient number of training samples, leading to data imbalance. The data imbalance problem is minimized by online and offline augmentation techniques and random weighted sampling. Experimental results demonstrate that the proposed method can recognize the seven basic emotions with 82% accuracy. The results demonstrate the effectiveness of the proposed approach in tackling the challenges of data imbalance and improving classification performance in facial emotion recognition.

Chinese Translation

本文提出了一种基于深度卷积网络的自动面部情感识别深度学习框架。为了提高方法的泛化能力和鲁棒性，通过合并三个公开可用的面部情感数据集：CK+、FER+和KDEF，增加了数据集的规模。尽管数据集规模有所增加，但少数类仍然面临训练样本不足的问题，导致数据不平衡。通过在线和离线增强技术以及随机加权抽样，最小化了数据不平衡问题。实验结果表明，所提方法能够以82%的准确率识别七种基本情感。结果证明了该方法在应对数据不平衡挑战和提高面部情感识别分类性能方面的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2604.20317

MD-Face: MoE-Enhanced Label-Free Disentangled Representation for Interactive Facial Attribute Editing

MD-Face：基于MoE增强的无标签解耦表示用于交互式面部属性编辑

Cui, Xuan, Zhao, Yunfei, Liu, Bo, Duan, Wei, Fan, Xingrong

Abstract

GAN-based facial attribute editing is widely used in virtual avatars and social media but often suffers from attribute entanglement, where modifying one face attribute unintentionally alters others. While supervised disentangled representation learning can address this, it relies heavily on labeled data, incurring high annotation costs. To address these challenges, we propose MD-Face, a label-free disentangled representation learning framework based on Mixture of Experts (MoE). MD-Face utilizes a MoE backbone with a gating mechanism that dynamically allocates experts, enabling the model to learn semantic vectors with greater independence. To further enhance attribute entanglement, we introduce a geometry-aware loss, which aligns each semantic vector with its corresponding Semantic Boundary Vector (SBV) through a Jacobian-based pushforward method. Experiments with ProGAN and StyleGAN show that MD-Face outperforms unsupervised baselines and competes with supervised ones. Compared to diffusion-based methods, it offers better image quality and lower inference latency, making it ideal for interactive editing.

Chinese Translation

基于GAN的面部属性编辑在虚拟化身和社交媒体中被广泛应用，但常常面临属性纠缠的问题，即修改一个面部属性会无意中改变其他属性。尽管监督解耦表示学习可以解决这一问题，但它严重依赖标注数据，导致高昂的标注成本。为了解决这些挑战，我们提出了MD-Face，这是一种基于专家混合（Mixture of Experts, MoE）的无标签解耦表示学习框架。MD-Face利用MoE骨干网络和动态分配专家的门控机制，使模型能够以更大的独立性学习语义向量。为了进一步增强属性解耦，我们引入了一种几何感知损失，通过基于雅可比推送的方法将每个语义向量与其对应的语义边界向量（Semantic Boundary Vector, SBV）对齐。与ProGAN和StyleGAN的实验表明，MD-Face在无监督基线中表现优异，并与监督方法相竞争。与基于扩散的方法相比，它提供了更好的图像质量和更低的推理延迟，使其非常适合交互式编辑。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2604.20318

UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval

UniCVR：从对齐到重排序的统一零-shot组合视觉检索

Wen, Haokun, Song, Xuemeng, Zhang, Haoyu, Zhao, Xiangyu, Guan, Weili, Nie, Liqiang

Abstract

Composed image retrieval, multi-turn composed image retrieval, and composed video retrieval all share a common paradigm: composing the reference visual with modification text to retrieve the desired target. Despite this shared structure, the three tasks have been studied in isolation, with no prior work proposing a unified framework, let alone a zero-shot solution. In this paper, we propose UniCVR, the first unified zero-shot composed visual retrieval framework that jointly addresses all three tasks without any task-specific human-annotated data. UniCVR strategically combines two complementary strengths: Multimodal Large Language Models (MLLMs) for compositional query understanding and Vision-Language Pre-trained (VLP) models for structured visual retrieval. Concretely, UniCVR operates in two stages. In Stage I, we train the MLLM as a compositional query embedder via contrastive learning on a curated multi-source dataset of approximately 3.5M samples, bridging the heterogeneous embedding spaces between the MLLM and the frozen VLP gallery encoder. A cluster-based hard negative sampling strategy is proposed to strengthen contrastive supervision. In Stage II, we introduce an MLLM-guided dual-level reranking mechanism that applies adaptive budgeted subset scoring to a small number of top-ranked candidates, and then exploits the resulting relevance signals through a dual-level re-scoring scheme, producing more accurate final rankings with minimal computational overhead. Extensive experiments across five benchmarks covering all three tasks demonstrate that UniCVR achieves cutting-edge performance, validating its effectiveness and generalizability. Our data and code will be released upon acceptance.

Chinese Translation

组合图像检索、多轮组合图像检索和组合视频检索都共享一个共同的范式：将参考视觉与修改文本组合以检索所需目标。尽管具有这种共同结构，这三项任务却一直是孤立研究，之前没有工作提出统一框架，更不用说零-shot解决方案。在本文中，我们提出了UniCVR，这是第一个统一的零-shot组合视觉检索框架，能够在没有任何任务特定人工标注数据的情况下共同解决这三项任务。UniCVR战略性地结合了两种互补优势：多模态大型语言模型（Multimodal Large Language Models, MLLMs）用于组合查询理解，以及视觉-语言预训练（Vision-Language Pre-trained, VLP）模型用于结构化视觉检索。具体而言，UniCVR分两个阶段进行。在阶段一，我们通过对大约350万个样本的精心策划的多源数据集进行对比学习，训练MLLM作为组合查询嵌入器，弥合MLLM与冻结的VLP图库编码器之间的异构嵌入空间。我们提出了一种基于聚类的困难负样本采样策略，以增强对比监督。在阶段二，我们引入了一种MLLM引导的双层重排序机制，该机制对少量排名靠前的候选者应用自适应预算子集评分，然后通过双层重新评分方案利用生成的相关性信号，产生更准确的最终排名，同时最小化计算开销。在覆盖所有三项任务的五个基准上的广泛实验表明，UniCVR实现了前沿性能，验证了其有效性和通用性。我们的数据和代码将在接受后发布。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2604.20319

SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

SurgCoT：通过链式思维基准推进外科视频中的时空推理

Wang, Gui, Zhou, YongSong, Deng, Kaijun, Cheah, Wooi Ping, Qu, Rong, Ren, Jianfeng, Shen, Linlin

Abstract

Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code: https://github.com/CVI-SZU/SurgCoT.

Chinese Translation

对外科视频进行细粒度的时空推理至关重要，但多模态大型语言模型（MLLMs）在这一领域的能力仍然未被充分探索。为了解决这一问题，我们提出了SurgCoT，这是一个统一的基准，用于评估MLLMs在7个外科专业和35种不同手术中的链式思维（CoT）推理。SurgCoT评估五个核心推理维度：因果行动排序、线索-行动对齐、可供性映射、微转移定位和异常发生追踪，采用结构化的CoT框架和严格的注释协议（问题-选项-知识-线索-答案），其中知识领域提供必要的背景信息，而线索则提供明确的时空证据。对10个领先的MLLMs的评估显示：1）商业模型优于开源和医学专业变体；2）外科CoT推理存在显著差距；3）SurgCoT能够有效评估并增强渐进式时空推理。SurgCoT提供了一个可重复的测试平台，以缩小MLLM能力与临床推理需求之间的差距。代码：https://github.com/CVI-SZU/SurgCoT。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2604.20328

Hybrid Latent Reasoning with Decoupled Policy Optimization

混合潜在推理与解耦策略优化

Cheng, Tao, Chen, Shi-Zhe, Zhang, Hao, Qin, Yixin, Luo, Jinwen, Wei, Zheng

Abstract

Chain-of-Thought (CoT) reasoning significantly elevates the complex problem-solving capabilities of multimodal large language models (MLLMs). However, adapting CoT to vision typically discretizes signals to fit LLM inputs, causing early semantic collapse and discarding fine-grained details. While external tools can mitigate this, they introduce a rigid bottleneck, confining reasoning to predefined operations. Although recent latent reasoning paradigms internalize visual states to overcome these limitations, optimizing the resulting hybrid discrete-continuous action space remains challenging. In this work, we propose HyLaR (Hybrid Latent Reasoning), a framework that seamlessly interleaves discrete text generation with continuous visual latent representations. Specifically, following an initial cold-start supervised fine-tuning (SFT), we introduce DePO (Decoupled Policy Optimization) to enable effective reinforcement learning within this hybrid space. DePO decomposes the policy gradient objective, applying independent trust-region constraints to the textual and latent components, alongside an exact closed-form von Mises-Fisher (vMF) KL regularizer. Extensive experiments demonstrate that HyLaR outperforms standard MLLMs and state-of-the-art latent reasoning approaches across fine-grained perception and general multimodal understanding benchmarks. Code is available at https://github.com/EthenCheng/HyLaR.

Chinese Translation

链式思维（Chain-of-Thought, CoT）推理显著提升了多模态大型语言模型（Multimodal Large Language Models, MLLMs）在复杂问题解决中的能力。然而，将CoT应用于视觉通常需要对信号进行离散化，以适应LLM输入，这导致早期语义崩溃并丢弃细粒度细节。虽然外部工具可以缓解这一问题，但它们引入了僵化的瓶颈，将推理限制在预定义的操作中。尽管最近的潜在推理范式通过内化视觉状态来克服这些限制，但优化由此产生的混合离散-连续动作空间仍然具有挑战性。在本研究中，我们提出了HyLaR（Hybrid Latent Reasoning）框架，该框架无缝地将离散文本生成与连续视觉潜在表示交织在一起。具体而言，在初始冷启动的监督微调（Supervised Fine-Tuning, SFT）之后，我们引入了解耦策略优化（Decoupled Policy Optimization, DePO），以便在这一混合空间内实现有效的强化学习。DePO将策略梯度目标进行分解，对文本和潜在组件施加独立的信任区域约束，并结合精确的封闭形式冯·米塞斯-费舍尔（von Mises-Fisher, vMF）KL正则化器。大量实验表明，HyLaR在细粒度感知和一般多模态理解基准上优于标准的MLLMs和最先进的潜在推理方法。代码可在 https://github.com/EthenCheng/HyLaR 获取。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2604.20329

Image Generators are Generalist Vision Learners

图像生成器是通用视觉学习者

Gabeur, Valentin, Long, Shangbang, Peng, Songyou, Voigtlaender, Paul, Sun, Shuyang, Bao, Yanan, Truong, Karen, Wang, Zhicheng, Zhou, Wenlei, Barron, Jonathan T., Genova, Kyle, Kannen, Nithish, Ben, Sherry, Li, Yandong, Guo, Mandy, Yogin, Suhas, Gu, Yiming, Chen, Huizhong, Wang, Oliver, Xie, Saining, Zhou, Howard, He, Kaiming, Funkhouser, Thomas, Alayrac, Jean-Baptiste, Soricut, Radu

Abstract

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.

Chinese Translation

最近的研究表明，图像和视频生成器表现出零样本视觉理解行为，这与大型语言模型（LLMs）在生成预训练中发展出语言理解和推理的突现能力的方式相似。尽管长期以来人们推测创造视觉内容的能力意味着理解该内容的能力，但关于生成视觉模型是否具备强大理解能力的证据仍然有限。在本研究中，我们证明图像生成训练的作用类似于LLM的预训练，使模型能够学习强大且通用的视觉表征，从而在各种视觉任务上实现最先进的性能（SOTA）。我们引入了Vision Banana，这是一个通用模型，通过在其原始训练数据与少量视觉任务数据的混合上对Nano Banana Pro（NBP）进行指令调优而构建。通过将视觉任务的输出空间参数化为RGB图像，我们无缝地将感知重新框架为图像生成。我们的通用模型Vision Banana在涉及2D和3D理解的各种视觉任务上取得了SOTA结果，超越或与零样本领域专家相抗衡，包括在分割任务上的Segment Anything Model 3，以及在度量深度估计上的Depth Anything系列。我们展示了这些结果可以通过轻量级的指令调优实现，而不牺牲基础模型的图像生成能力。这些优越的结果表明，图像生成预训练是一种通用视觉学习者。它还表明，图像生成作为视觉任务的统一和普遍接口，类似于文本生成在语言理解和推理中的作用。我们可能正在见证计算机视觉的重大范式转变，其中生成视觉预训练在构建基础视觉模型（用于生成和理解）中发挥核心作用。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2604.20336

Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation

基于稳定性的物体引导人际协同操控运动生成

Xu, Jiahao, Yuan, Xiaohan, Wu, Xingchen, Xu, Chongyang, Li, Kun, Huang, Buzhen

Abstract

Co-manipulation requires multiple humans to synchronize their motions with a shared object while ensuring reasonable interactions, maintaining natural poses, and preserving stable states. However, most existing motion generation approaches are designed for single-character scenarios or fail to account for payload-induced dynamics. In this work, we propose a flow-matching framework that ensures the generated co-manipulation motions align with the intended goals while maintaining naturalness and effectiveness. Specifically, we first introduce a generative model that derives explicit manipulation strategies from the object's affordance and spatial configuration, which guide the motion flow toward successful manipulation. To improve motion quality, we then design an adversarial interaction prior that promotes natural individual poses and realistic inter-person interactions during co-manipulation. In addition, we also incorporate a stability-driven simulation into the flow matching process, which refines unstable interaction states through sampling-based optimization and directly adjusts the vector field regression to promote more effective manipulation. The experimental results demonstrate that our method achieves higher contact accuracy, lower penetration, and better distributional fidelity compared to state-of-the-art human-object interaction baselines. The code is available at https://github.com/boycehbz/StaCOM.

Chinese Translation

协同操控要求多个参与者与共享物体同步运动，同时确保合理的互动、保持自然姿态和维持稳定状态。然而，现有的大多数运动生成方法主要针对单一角色场景，或未考虑负载引起的动态效应。在本研究中，我们提出了一种流匹配框架，确保生成的协同操控运动与预期目标一致，同时保持自然性和有效性。具体而言，我们首先引入了一种生成模型，该模型从物体的可操作性和空间配置中推导出明确的操控策略，以引导运动流向成功的操控。为了提高运动质量，我们设计了一种对抗性互动先验，促进协同操控过程中自然的个体姿态和真实的人际互动。此外，我们还将基于稳定性的仿真纳入流匹配过程，通过基于采样的优化来细化不稳定的互动状态，并直接调整向量场回归以促进更有效的操控。实验结果表明，我们的方法在接触精度、穿透率和分布保真度方面均优于最先进的人-物互动基线。代码可在 https://github.com/boycehbz/StaCOM 获取。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2604.20350

X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis

X-PCR：眼科诊断中跨模态渐进临床推理的基准测试

Wang, Gui, Zhong, Zehao, Zhou, YongSong, Li, Yudong, Wu, Ende, Cheah, Wooi Ping, Qu, Rong, Ren, Jianfeng, Shen, Linlin

Abstract

Despite significant progress in Multi-modal Large Language Models (MLLMs), their clinical reasoning capacity for multi-modal diagnosis remains largely unexamined. Current benchmarks, mostly single-modality data, can't evaluate progressive reasoning and cross-modal integration essential for clinical practice. We introduce the Cross-Modality Progressive Clinical Reasoning (X-PCR) benchmark, the first comprehensive evaluation of MLLMs through a complete ophthalmology diagnostic workflow, with two reasoning tasks: 1) a six-stage progressive reasoning chain spanning image quality assessment to clinical decision-making, and 2) a cross-modality reasoning task integrating six imaging modalities. The benchmark comprises 26,415 images and 177,868 expert-verified VQA pairs curated from 51 public datasets, covering 52 ophthalmic diseases. Evaluation of 21 MLLMs reveals critical gaps in progressive reasoning and cross-modal integration. Dataset and code: https://github.com/CVI-SZU/X-PCR.

Chinese Translation

尽管多模态大型语言模型（MLLMs）取得了显著进展，但其在多模态诊断中的临床推理能力仍然未得到充分检验。目前的基准测试大多基于单一模态数据，无法评估临床实践中至关重要的渐进推理和跨模态整合。我们提出了跨模态渐进临床推理（X-PCR）基准测试，这是对MLLMs进行全面评估的首个基准，涵盖完整的眼科诊断工作流程，包含两个推理任务：1）一个涵盖从图像质量评估到临床决策的六阶段渐进推理链；2）一个整合六种成像模态的跨模态推理任务。该基准测试包含26,415张图像和177,868个经过专家验证的视觉问答（VQA）对，来源于51个公共数据集，涵盖52种眼科疾病。对21个MLLMs的评估揭示了在渐进推理和跨模态整合方面的关键缺口。数据集和代码可在：https://github.com/CVI-SZU/X-PCR获取。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2604.20354

Hallucination Early Detection in Diffusion Models

扩散模型中的幻觉早期检测

Betti, Federico, Baraldi, Lorenzo, Baraldi, Lorenzo, Cucchiara, Rita, Sebe, Nicu

Abstract

Text-to-Image generation has seen significant advancements in output realism with the advent of diffusion models. However, diffusion models encounter difficulties when tasked with generating multiple objects, frequently resulting in hallucinations where certain entities are omitted. While existing solutions typically focus on optimizing latent representations within diffusion models, the relevance of the initial generation seed is typically underestimated. While using various seeds in multiple iterations can improve results, this method also significantly increases time and energy costs. To address this challenge, we introduce HEaD+ (Hallucination Early Detection +), a novel approach designed to identify incorrect generations early in the diffusion process. The HEaD+ framework integrates cross-attention maps and textual information with a novel input, the Predicted Final Image. The objective is to assess whether to proceed with the current generation or restart it with a different seed, thereby exploring multiple-generation seeds while conserving time. HEaD+ is trained on the newly created InsideGen dataset of 45,000 generated images, each containing prompts with up to seven objects. Our findings demonstrate a 6-8% increase in the likelihood of achieving a complete generation (i.e., an image accurately representing all specified subjects) with four objects when applying HEaD+ alongside existing models. Additionally, HEaD+ reduces generation times by up to 32% when aiming for a complete image, enhancing the efficiency of generating complete and accurate object representations relative to leading models. Moreover, we propose an integrated localization module that predicts object centroid positions and verifies pairwise spatial relations (if requested by the users) at an intermediate timestep, gating generation together with object presence to further improve relation-consistent outcomes.

Chinese Translation

文本到图像生成在扩散模型的出现下，输出的真实感得到了显著提升。然而，当扩散模型被要求生成多个对象时，常常会遇到困难，导致某些实体被遗漏的幻觉现象。虽然现有的解决方案通常专注于优化扩散模型中的潜在表示，但初始生成种子的相关性通常被低估。虽然在多次迭代中使用不同的种子可以改善结果，但这种方法也显著增加了时间和能源成本。为了解决这一挑战，我们提出了HEaD+（幻觉早期检测+），这是一种旨在在扩散过程中早期识别不正确生成的创新方法。HEaD+框架将交叉注意力图和文本信息与一种新颖的输入——预测最终图像相结合。其目标是评估是否继续当前生成，或使用不同的种子重新开始，从而在节省时间的同时探索多种生成种子。HEaD+在新创建的InsideGen数据集上进行训练，该数据集包含45,000个生成图像，每个图像包含最多七个对象的提示。我们的研究结果表明，在应用HEaD+与现有模型的情况下，生成四个对象时实现完整生成（即准确表示所有指定主题的图像）的可能性提高了6-8%。此外，当目标是生成完整图像时，HEaD+将生成时间减少了多达32%，相较于领先模型，提高了生成完整和准确的对象表示的效率。此外，我们提出了一个集成定位模块，该模块在中间时间步预测对象质心位置，并在用户请求时验证成对空间关系，从而与对象存在一起控制生成，以进一步改善关系一致的结果。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2604.20357

SignDATA: Data Pipeline for Sign Language Translation

SignDATA：手语翻译的数据处理管道

Chen, Kuanwei, Lin, Tingyi

Abstract

Sign-language datasets are difficult to preprocess consistently because they vary in annotation schema, clip timing, signer framing, and privacy constraints. Existing work usually reports downstream models, while the preprocessing pipeline that converts raw video into training-ready pose or video artifacts remains fragmented, backend-specific, and weakly documented. We present SignDATA, a config-driven preprocessing toolkit that standardizes heterogeneous sign-language corpora into comparable outputs for learning. The system supports two end-to-end recipes: a pose recipe that performs acquisition, manifesting, person localization, clipping, cropping, landmark extraction, normalization, and WebDataset export, and a video recipe that replaces pose extraction with signer-cropped video packaging. SignDATA exposes interchangeable MediaPipe and MMPose backends behind a common interface, typed job schemas, experiment-level overrides, and per-stage checkpointing with config- and manifest-aware hashes. We validate the toolkit through a research-oriented evaluation design centered on backend comparison, preprocessing ablations, and privacy-aware video generation on datasets. Our contribution is a reproducible preprocessing layer for sign-language research that makes extractor choice, normalization policy, and privacy tradeoffs explicit, configurable, and empirically comparable.Code is available at https://github.com/balaboom123/signdata-slt.

Chinese Translation

手语数据集由于注释模式、片段时长、签署者框架和隐私限制的差异，难以进行一致的预处理。现有研究通常报告下游模型，而将原始视频转换为适合训练的姿态或视频工件的预处理管道则仍然是碎片化的、特定于后端的，并且文档记录不足。我们提出了SignDATA，一个基于配置的预处理工具包，旨在将异构的手语语料库标准化为可比较的输出，以便于学习。该系统支持两种端到端的处理方案：一种姿态处理方案，执行采集、清单生成、人员定位、剪辑、裁剪、关键点提取、归一化和WebDataset导出；另一种视频处理方案，则用签署者裁剪的视频打包替代姿态提取。SignDATA在一个共同接口下暴露了可互换的MediaPipe和MMPose后端，支持类型化的作业模式、实验级别的覆盖以及具有配置和清单感知哈希的每阶段检查点。我们通过一个以后端比较、预处理消融和隐私意识视频生成为中心的研究导向评估设计来验证该工具包。我们的贡献是为手语研究提供一个可重复的预处理层，使得提取器选择、归一化策略和隐私权衡变得明确、可配置且可经验比较。代码可在 https://github.com/balaboom123/signdata-slt 获取。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2604.20358

ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval

ConeSep：基于锥体的鲁棒噪声去学习组合网络用于组合图像检索

Li, Zixu, Hu, Yupeng, Chen, Zhiwei, Zhang, Mingyu, Fu, Zhiheng, Nie, Liqiang

Abstract

The Composed Image Retrieval (CIR) task provides a flexible retrieval paradigm via a reference image and modification text, but it heavily relies on expensive and error-prone triplet annotations. This paper systematically investigates the Noisy Triplet Correspondence (NTC) problem introduced by annotations. We find that NTC noise, particularly ``hard noise'' (i.e., the reference and target images are highly similar but the modification text is incorrect), poses a unique challenge to existing Noise Correspondence Learning (NCL) methods because it breaks the traditional ``small loss hypothesis''. We identify and elucidate three key, yet overlooked, challenges in the NTC task, namely (C1) Modality Suppression, (C2) Negative Anchor Deficiency, and (C3) Unlearning Backlash. To address these challenges, we propose a Cone-based robuSt noisE-unlearning comPositional network (ConeSep). Specifically, we first propose Geometric Fidelity Quantization, theoretically establishing and practically estimating a noise boundary to precisely locate noisy correspondence. Next, we introduce Negative Boundary Learning, which learns a ``diagonal negative combination'' for each query as its explicit semantic opposite-anchor in the embedding space. Finally, we design Boundary-based Targeted Unlearning, which models the noisy correction process as an optimal transport problem, elegantly avoiding Unlearning Backlash. Extensive experiments on benchmark datasets (FashionIQ and CIRR) demonstrate that ConeSep significantly outperforms current state-of-the-art methods, which fully demonstrates the effectiveness and robustness of our method.

Chinese Translation

组合图像检索（CIR）任务通过参考图像和修改文本提供了一种灵活的检索范式，但它严重依赖于昂贵且容易出错的三元组标注。本文系统地研究了由标注引入的噪声三元组对应（NTC）问题。我们发现，NTC噪声，特别是“硬噪声”（即参考图像和目标图像高度相似但修改文本不正确），对现有的噪声对应学习（NCL）方法构成了独特的挑战，因为它打破了传统的“小损失假设”。我们识别并阐明了NTC任务中三个关键但被忽视的挑战，即（C1）模态抑制，（C2）负锚缺乏，以及（C3）去学习反弹。为了解决这些挑战，我们提出了一种基于锥体的鲁棒噪声去学习组合网络（ConeSep）。具体而言，我们首先提出几何保真量化，理论上建立并实际估计噪声边界，以精确定位噪声对应。接下来，我们引入负边界学习，为每个查询学习一个“对角负组合”，作为其在嵌入空间中的显式语义对立锚。最后，我们设计了基于边界的目标去学习，将噪声修正过程建模为一个最优传输问题，优雅地避免了去学习反弹。在基准数据集（FashionIQ和CIRR）上的大量实验表明，ConeSep显著优于当前最先进的方法，充分证明了我们方法的有效性和鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2604.20361

Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models

基于对象指称的扫描路径预测与感知增强视觉-语言模型

Quan, Rong, Lai, Yantao, Liang, Dong, Qin, Jie

Abstract

Object Referring-guided Scanpath Prediction (ORSP) aims to predict the human attention scanpath when they search for a specific target object in a visual scene according to a linguistic description describing the object. Multimodal information fusion is a key point of ORSP. Therefore, we propose a novel model, ScanVLA, to first exploit a Vision-Language Model (VLM) to extract and fuse inherently aligned visual and linguistic feature representations from the input image and referring expression. Next, to enhance the ScanVLA's perception of fine-grained positional information, we not only propose a novel History Enhanced Scanpath Decoder (HESD) that directly takes historical fixations' position information as input to help predict a more reasonable position for the current fixation, but also adopt a frozen Segmentation LoRA as an auxiliary component to help localize the referred object more precisely, which improves the scanpath prediction task without incurring additional large computational and time costs. Extensive experimental results demonstrate that ScanVLA can significantly outperform existing scanpath prediction methods under object referring.

Chinese Translation

对象指称引导的扫描路径预测（ORSP）旨在根据描述特定目标对象的语言描述，预测人们在视觉场景中搜索该目标对象时的注意力扫描路径。多模态信息融合是ORSP的关键。因此，我们提出了一种新颖的模型ScanVLA，首先利用视觉-语言模型（VLM）从输入图像和指称表达中提取并融合固有对齐的视觉和语言特征表示。接下来，为了增强ScanVLA对细粒度位置信息的感知，我们不仅提出了一种新颖的历史增强扫描路径解码器（HESD），该解码器直接将历史注视的位置信息作为输入，以帮助预测当前注视的更合理位置，还采用了冻结的分割LoRA作为辅助组件，以更精确地定位所指对象，从而在不增加额外大量计算和时间成本的情况下改善扫描路径预测任务。大量实验结果表明，ScanVLA在对象指称下显著优于现有的扫描路径预测方法。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2604.20366

Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

在不降低性能的情况下减轻大型视觉语言模型中的幻觉现象

Zhu, Xingyu, Fang, Junfeng, Wang, Shuo, Zhu, Beier, Wang, Zhicai, Yang, Yonghui, He, Xiangnan

Abstract

Large Vision-Language Models (LVLMs) exhibit powerful generative capabilities but frequently produce hallucinations that compromise output reliability. Fine-tuning on annotated data devoid of hallucinations offers the most direct solution, while its high computational cost motivates recent representation-based methods, which focus on mitigating hallucinatory components within hidden representations. Though efficient, we empirically observe that these methods degrade general generation capacity due to incomplete extraction of hallucination components and non-selective parameter updates. To address these limitations, we propose MPD, a dual-stage framework for mitigating hallucinations without performance degradation. Specifically, our MPD relies on two essential factors: (1) semantic-aware component disentanglement to extract pure hallucination components, and (2) interpretable parameter updates that selectively modify parameters most relevant to hallucination. Extensive experiments demonstrate that MPD achieves state-of-the-art performance, reducing hallucinations by 23.4\% while maintaining 97.4\% of general generative capability as evaluated on LLaVA-Bench and MME, with no additional computational cost.

Chinese Translation

大型视觉语言模型（LVLMs）展现出强大的生成能力，但经常产生幻觉，影响输出的可靠性。对没有幻觉的标注数据进行微调是最直接的解决方案，但其高计算成本促使了最近基于表示的方法，这些方法专注于减轻隐藏表示中的幻觉成分。尽管这些方法高效，但我们实证观察到，由于对幻觉成分提取不完整和参数更新不具选择性，这些方法会降低整体生成能力。为了解决这些局限性，我们提出了MPD，一个双阶段框架，用于减轻幻觉而不降低性能。具体而言，我们的MPD依赖于两个关键因素：（1）语义感知的成分解耦，以提取纯粹的幻觉成分，以及（2）可解释的参数更新，选择性地修改与幻觉最相关的参数。大量实验表明，MPD实现了最先进的性能，在LLaVA-Bench和MME的评估中将幻觉减少了23.4%，同时保持了97.4%的整体生成能力，且没有额外的计算成本。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2604.20368

LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel

拉普拉斯变换器：用拉普拉斯核重新思考线性注意力

Feng, Zhe, Lian, Sen, Wang, Changwei, Zhang, Muyang, Tan, Tianlong, Xu, Rongtao, Meng, Weiliang, Zhang, Xiaopeng

Abstract

The quadratic complexity of softmax attention presents a major obstacle for scaling Transformers to high-resolution vision tasks. Existing linear attention variants often replace the softmax with Gaussian kernels to reduce complexity, but such approximations lack theoretical grounding and tend to oversuppress mid-range token interactions. We propose LaplacianFormer, a Transformer variant that employs a Laplacian kernel as a principled alternative to softmax, motivated by empirical observations and theoretical analysis. To address expressiveness degradation under low-rank approximations, we introduce a provably injective feature map that retains fine-grained token information. For efficient computation, we adopt a Nystr\"om approximation of the kernel matrix and solve the resulting system using Newton--Schulz iteration, avoiding costly matrix inversion and SVD. We further develop custom CUDA implementations for both the kernel and solver, enabling high-throughput forward and backward passes suitable for edge deployment. Experiments on ImageNet show that LaplacianFormer achieves strong performance-efficiency trade-offs while improving attention expressiveness.

Chinese Translation

软最大注意力的二次复杂性对将变换器扩展到高分辨率视觉任务构成了主要障碍。现有的线性注意力变体通常用高斯核替代软最大，以降低复杂性，但这种近似缺乏理论基础，往往会过度抑制中等范围的标记交互。我们提出了拉普拉斯变换器（LaplacianFormer），这是一种使用拉普拉斯核作为软最大原则替代的变换器变体，基于经验观察和理论分析。为了解决低秩近似下的表达能力下降问题，我们引入了一种可证明的单射特征映射，保留了细粒度的标记信息。为了实现高效计算，我们采用了核矩阵的Nyström近似，并使用牛顿-舒尔茨迭代法求解所得到的系统，避免了昂贵的矩阵求逆和奇异值分解（SVD）。我们进一步开发了自定义的CUDA实现，用于核和求解器，支持适合边缘部署的高吞吐量前向和反向传递。在ImageNet上的实验表明，拉普拉斯变换器在提高注意力表达能力的同时，实现了良好的性能效率权衡。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2604.20392

Self-supervised pretraining for an iterative image size agnostic vision transformer

用于迭代图像大小无关的视觉变换器的自监督预训练

Prisadnikov, Nedyalko, Paudel, Danda Pani, Fu, Yuqian, Van Gool, Luc

Abstract

Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.

Chinese Translation

视觉变换器（ViTs）在自监督学习（SSL）中占据主导地位。尽管它们在大规模预训练中表现出色，但在计算效率上存在不足，并且对图像大小的扩展能力较差。因此，像 DINO 这样的基础模型被限制在低分辨率处理。最近一种受视网膜启发的变换器通过迭代处理固定大小的多缩放图像块实现了分辨率无关性。该模型通过监督学习展示了良好的结果，采用了一种顺序的、类似递归的处理过程，而不需要时间反向传播。为了释放其作为基础骨干网的潜力，我们基于 DINO 的自蒸馏目标提出了一种新颖的顺序到全局的自监督学习框架。得益于高效的积分图像块提取方法，我们的方法使得图像大小无关的视觉编码器能够进行大规模预训练。在 ImageNet-1K 和下游分类任务中，我们实现了具有竞争力的性能，无论输入分辨率如何，计算预算保持不变。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2604.20393

MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement

MLG-Stereo：基于ViT的多阶段局部-全局增强立体匹配

Zhang, Haoyu, Zhou, Jingyi, Ye, Peng, Yuan, Jiakang, Zhang, Lin, Xu, Feng, Chen, Tao

Abstract

With the development of deep learning, ViT-based stereo matching methods have made significant progress due to their remarkable robustness and zero-shot ability. However, due to the limitations of ViTs in handling resolution sensitivity and their relative neglect of local information, the ability of ViT-based methods to predict details and handle arbitrary-resolution images is still weaker than that of CNN-based methods. To address these shortcomings, we propose MLG-Stereo, a systematic pipeline-level design that extends global modeling beyond the encoder stage. First, we propose a Multi-Granularity Feature Network to effectively balance global context and local geometric information, enabling comprehensive feature extraction from images of arbitrary resolution and bridging the gap between training and inference scales. Then, a Local-Global Cost Volume is constructed to capture both locally-correlated and global-aware matching information. Finally, a Local-Global Guided Recurrent Unit is introduced to iteratively optimize the disparity locally under the guidance of global information. Extensive experiments are conducted on multiple benchmark datasets, demonstrating that our MLG-Stereo exhibits highly competitive performance on the Middlebury and KITTI-2015 benchmarks compared to contemporaneous leading methods, and achieves outstanding results in the KITTI-2012 dataset.

Chinese Translation

随着深度学习的发展，基于ViT的立体匹配方法因其显著的鲁棒性和零样本能力而取得了显著进展。然而，由于ViT在处理分辨率敏感性方面的局限性以及对局部信息的相对忽视，基于ViT的方法在细节预测和处理任意分辨率图像的能力上仍然弱于基于CNN的方法。为了解决这些不足，我们提出了MLG-Stereo，这是一种系统的管道级设计，扩展了全局建模超越编码器阶段。首先，我们提出了一种多粒度特征网络，以有效平衡全局上下文和局部几何信息，从而实现对任意分辨率图像的全面特征提取，并弥合训练和推理尺度之间的差距。然后，构建了一个局部-全局代价体积，以捕捉局部相关和全局感知的匹配信息。最后，引入了一种局部-全局引导递归单元，在全局信息的指导下迭代优化局部视差。我们在多个基准数据集上进行了广泛的实验，结果表明我们的MLG-Stereo在Middlebury和KITTI-2015基准测试中表现出高度竞争力，相较于当代领先方法，且在KITTI-2012数据集上取得了卓越的结果。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2604.20395

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

SpaCeFormer：快速无提议开放词汇3D实例分割

Choy, Chris, Lee, Junha, Park, Chunghyun, Cho, Minsu, Kautz, Jan

Abstract

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21x higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU > 0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8x improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

Chinese Translation

开放词汇3D实例分割是机器人技术和增强现实/虚拟现实（AR/VR）的核心能力，但以往的方法往往是以一种瓶颈换取另一种瓶颈：多阶段的2D+3D管道在每个场景中聚合基础模型输出需要数百秒，而伪标记的端到端方法则依赖于碎片化的掩膜和外部区域提议。我们提出了SpaCeFormer，一种无提议的空间曲线变换器，每个场景运行时间为0.14秒，比多阶段的2D+3D管道快2-3个数量级。我们将其与SpaCeFormer-3M配对，这是通过多视图掩膜聚类和多视图视觉语言模型（VLM）标注构建的最大的开放词汇3D实例分割数据集（包含3.0M个多视图一致的标题，涵盖来自7.4K场景的604K实例）；与以往的单视图管道相比，其掩膜召回率提高了21倍（54.3%对比2.5%，在IoU > 0.5时）。SpaCeFormer结合了空间窗口注意力与Morton曲线序列化，以实现空间一致的特征，并使用增强的RoPE解码器直接从学习到的查询中预测实例掩膜，而无需外部提议。在ScanNet200上，我们实现了11.1的零样本平均精度（mAP），比之前最佳的无提议方法提高了2.8倍；在ScanNet++和Replica上，我们分别达到了22.9和24.1的mAP，超越了所有以往的方法，包括那些使用多视图2D输入的方法。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2604.20429

Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing

快速-精细：一种具有多粒度表示的两阶段框架用于遥感中的跨模态检索

Chen, Xi, Chen, Xu, Jia, Xiangyang, Zhang, Xu, Wei, Shuquan, Wang, Wei

Abstract

Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage, a parameter-free balanced text-guided interaction block enhances fine-grained alignment without introducing additional learnable parameters. Furthermore, an inter- and intra-modal loss is designed to jointly optimize cross-modal alignment across multi-granular representations. Extensive experiments on public benchmarks demonstrate that the FTF achieves competitive retrieval accuracy while significantly improving retrieval efficiency compared with existing methods.

Chinese Translation

遥感（RS）图像-文本检索在理解大量遥感图像中发挥着关键作用。然而，遥感图像中密集的多物体分布和复杂的背景使得同时实现细粒度的跨模态对齐和高效检索变得困难。现有方法要么依赖复杂的跨模态交互，导致检索效率低下，要么依赖大规模的视觉-语言模型预训练，这需要大量的数据和计算资源。为了解决这些问题，我们提出了一种快速-精细（FTF）两阶段检索框架，将检索分解为一个无文本的候选选择阶段以实现高效选择和一个文本引导的重排序阶段以实现细粒度对齐。具体而言，在候选选择阶段，采用无文本的粗粒度表示以实现高效的候选选择；在重排序阶段，无参数的平衡文本引导交互模块增强了细粒度对齐，而不引入额外的可学习参数。此外，设计了一种跨模态的内外损失，以联合优化多粒度表示下的跨模态对齐。在公共基准上的大量实验表明，FTF在检索准确性上具有竞争力，同时显著提高了检索效率，相较于现有方法表现更佳。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2604.20460

CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

CCTVBench：用于多模态大语言模型的对比一致性交通视频问答基准

Zhou, Xingcheng, Guo, Hao, Song, Rui, Zimmer, Walter, Liu, Mingyu, Schamschurko, André, Cao, Hu, Knoll, Alois

Abstract

Safety-critical traffic reasoning requires contrastive consistency: models must detect true hazards when an accident occurs, and reliably reject plausible-but-false hypotheses under near-identical counterfactual scenes. We present CCTVBench, a Contrastive Consistency Traffic VideoQA Benchmark built on paired real accident videos and world-model-generated counterfactual counterparts, together with minimally different, mutually exclusive hypothesis questions. CCTVBench enforces a single structured decision pattern over each video question quadruple and provides actionable diagnostics that decompose failures into positive omission, positive swap, negative hallucination, and mutual-exclusivity violation, while separating video versus question consistency. Experiments across open-source and proprietary video LLMs reveal a large and persistent gap between standard per-instance QA metrics and quadruple-level contrastive consistency, with unreliable none-of-the-above rejection as a key bottleneck. Finally, we introduce C-TCD, a contrastive decoding approach leveraging a semantically exclusive counterpart video as the contrast input at inference time, improving both instance-level QA and contrastive consistency.

Chinese Translation

安全关键的交通推理需要对比一致性：模型必须在事故发生时检测真实的危险，并在几乎相同的反事实场景下可靠地拒绝合理但错误的假设。我们提出了CCTVBench，一个基于成对真实事故视频和世界模型生成的反事实对应视频构建的对比一致性交通视频问答基准，以及最小不同、相互排斥的假设问题。CCTVBench在每个视频问题四元组上强制执行单一的结构化决策模式，并提供可操作的诊断，将失败分解为正遗漏、正交换、负幻觉和互斥性违反，同时区分视频与问题的一致性。在开源和专有视频大语言模型上的实验揭示了标准每实例问答指标与四元组级对比一致性之间存在巨大且持续的差距，而不可靠的“无上述”拒绝是一个关键瓶颈。最后，我们介绍了C-TCD，一种对比解码方法，在推理时利用语义上排斥的对应视频作为对比输入，从而改善实例级问答和对比一致性。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2604.20470

DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion

DynamicRad：内容自适应稀疏注意力用于长视频扩散

Long, Yongji, Liang, Shijun, Li, Jintao, Li, Yun

Abstract

Leveraging the natural spatiotemporal energy decay in video diffusion offers a path to efficiency, yet relying solely on rigid static masks risks losing critical long-range information in complex dynamics. To address this issue, we propose \textbf{DynamicRad}, a unified sparse-attention paradigm that grounds adaptive selection within a radial locality prior. DynamicRad introduces a \textbf{dual-mode} strategy: \textit{static-ratio} for speed-optimized execution and \textit{dynamic-threshold} for quality-first filtering. To ensure robustness without online search overhead, we integrate an offline Bayesian Optimization (BO) pipeline coupled with a \textbf{semantic motion router}. This lightweight projection module maps prompt embeddings to optimal sparsity regimes with \textbf{minimal runtime overhead}. Unlike online profiling methods, our offline BO optimizes attention reconstruction error (MSE) on a physics-based proxy task, ensuring rapid convergence. Experiments on HunyuanVideo and Wan2.1-14B demonstrate that DynamicRad pushes the efficiency--quality Pareto frontier, achieving \textbf{1.7$\times$--2.5$\times$ inference speedups} with \textbf{over 80\% effective sparsity}. In some long-sequence settings, the dynamic mode even matches or exceeds the dense baseline, while mask-aware LoRA further improves long-horizon coherence. Code is available at https://github.com/Adamlong3/DynamicRad.

Chinese Translation

利用视频扩散中自然的时空能量衰减提供了一条提高效率的路径，但仅依赖于刚性静态掩码可能会导致在复杂动态中丢失关键的长距离信息。为了解决这个问题，我们提出了 extbf{DynamicRad}，一种统一的稀疏注意力范式，该范式将自适应选择基于径向局部性先验。DynamicRad引入了一种 extbf{双模式}策略： extit{静态比例}用于速度优化执行， extit{动态阈值}用于质量优先过滤。为了确保鲁棒性而不增加在线搜索的开销，我们整合了一个离线贝叶斯优化（Bayesian Optimization, BO）管道，并结合了 extbf{语义运动路由器}。这个轻量级投影模块将提示嵌入映射到最佳稀疏性范围，具有 extbf{最小的运行时开销}。与在线分析方法不同，我们的离线BO在基于物理的代理任务上优化注意力重建误差（均方误差，MSE），确保快速收敛。在HunyuanVideo和Wan2.1-14B上的实验表明，DynamicRad推动了效率与质量的Pareto前沿，实现了 extbf{1.7$ imes$--2.5$ imes$的推理加速}，并具有 extbf{超过80 ext{%的有效稀疏性}}。在某些长序列设置中，动态模式甚至与密集基线相匹配或超越，而掩码感知的LoRA进一步提高了长时间一致性。代码可在https://github.com/Adamlong3/DynamicRad获取。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2604.20473

Video-ToC: Video Tree-of-Cue Reasoning

视频树线索推理：Video-ToC

Tan, Qizhong, Tian, Zhuotao, Lu, Guangming, Yu, Jun, Pei, Wenjie

Abstract

Existing Video Large Language Models (Video LLMs) struggle with complex video understanding, exhibiting limited reasoning capabilities and potential hallucinations. In particular, these methods tend to perform reasoning solely relying on the pretrained inherent reasoning rationales whilst lacking perception-aware adaptation to the input video content. To address this, we propose \textbf{Video-ToC}, a novel video reasoning framework that enhances video understanding through tree-of-cue reasoning. Specifically, our approach introduces three key innovations: (1) A tree-guided visual cue localization mechanism, which endows the model with enhanced fine-grained perceptual capabilities through structured reasoning patterns; (2) A reasoning-demand reward mechanism, which dynamically adjusts the reward value for reinforcement learning (RL) based on the estimation of reasoning demands, enabling on-demand incentives for more effective reasoning strategies; and (3) An automated annotation pipeline that constructs the Video-ToC-SFT-1k and Video-ToC-RL-2k datasets for supervised fine-tuning (SFT) and RL training, respectively. Extensive evaluations on six video understanding benchmarks and a video hallucination benchmark demonstrate the superiority of Video-ToC over baselines and recent methods. Code is available at https://github.com/qizhongtan/Video-ToC.

Chinese Translation

现有的视频大语言模型（Video LLMs）在复杂视频理解方面表现不佳，推理能力有限且可能出现幻觉。特别是，这些方法往往仅依赖于预训练的内在推理依据进行推理，而缺乏对输入视频内容的感知适应。为了解决这一问题，我们提出了 extbf{Video-ToC}，一种通过树线索推理增强视频理解的新型视频推理框架。具体而言，我们的方法引入了三个关键创新：（1）一种树引导的视觉线索定位机制，通过结构化推理模式赋予模型增强的细粒度感知能力；（2）一种推理需求奖励机制，基于对推理需求的估计动态调整强化学习（RL）的奖励值，为更有效的推理策略提供按需激励；（3）一个自动化注释管道，构建了用于监督微调（SFT）和RL训练的Video-ToC-SFT-1k和Video-ToC-RL-2k数据集。对六个视频理解基准和一个视频幻觉基准的广泛评估表明，Video-ToC在性能上优于基线和近期方法。代码可在 https://github.com/qizhongtan/Video-ToC 获取。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2604.20474

Random Walk on Point Clouds for Feature Detection

点云上的随机游走用于特征检测

Zhang, Yuhe, Tu, Zhikun, Li, Zhi, Gao, Jian, Guo, Bao, Zhang, Shunli

Abstract

The points on the point clouds that can entirely outline the shape of the model are of critical importance, as they serve as the foundation for numerous point cloud processing tasks and are widely utilized in computer graphics and computer-aided design. This study introduces a novel method, RWoDSN, for extracting such feature points, incorporating considerations of sharp-to-smooth transitions, large-to-small scales, and textural-to-detailed features. We approach feature extraction as a two-stage context-dependent analysis problem. In the first stage, we propose a novel neighborhood descriptor, termed the Disk Sampling Neighborhood (DSN), which, unlike traditional spatially and geometrically invariant approaches, preserves a matrix structure while maintaining normal neighborhood relationships. In the second stage, a random walk is performed on the DSN (RWoDSN), yielding a graph-based DSN that simultaneously accounts for the spatial distribution, topological properties, and geometric characteristics of the local surface surrounding each point. This enables the effective extraction of feature points. Experimental results demonstrate that the proposed RWoDSN method achieves a recall of 0.769-22% higher than the current state-of-the-art-alongside a precision of 0.784. Furthermore, it significantly outperforms several traditional and deep-learning techniques across eight evaluation metrics.

Chinese Translation

在点云中，能够完全勾勒出模型形状的点至关重要，因为它们为众多点云处理任务奠定了基础，并广泛应用于计算机图形学和计算机辅助设计。本文提出了一种新颖的方法，称为 RWoDSN，用于提取这些特征点，考虑了从尖锐到平滑的过渡、大到小的尺度以及纹理到细节的特征。我们将特征提取视为一个两阶段的上下文相关分析问题。在第一阶段，我们提出了一种新颖的邻域描述符，称为盘采样邻域（Disk Sampling Neighborhood, DSN），与传统的空间和几何不变方法不同，该方法在保持法向邻域关系的同时，保留了矩阵结构。在第二阶段，在 DSN 上执行随机游走（RWoDSN），生成一个基于图的 DSN，同时考虑到每个点周围局部表面的空间分布、拓扑特性和几何特征。这使得特征点的有效提取成为可能。实验结果表明，所提出的 RWoDSN 方法的召回率为 0.769，比当前最先进的方法高出 22%，同时精度为 0.784。此外，它在八个评估指标上显著优于多种传统和深度学习技术。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2604.20486

ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

ProMMSearchAgent：一种通过过程导向奖励训练的可泛化多模态搜索代理

Yan, Wentao, Wang, Shengqin, Zhou, Huichi, Chen, Yihang, Shao, Kun, Xie, Yuan, Zhang, Zhizhong

Abstract

Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers zero-shot to the live Google Search API. ProMMSearchAgent achieves new SOTA performance, outperforming MMSearch-R1 by +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch.

Chinese Translation

通过强化学习训练多模态代理以进行知识密集型视觉推理，受到基于结果的监督极度稀疏和实时网络环境不可预测性的根本性阻碍。为了解决这些算法和环境瓶颈，我们提出了ProMMSearchAgent，建立了一种新的Sim-to-Real训练范式用于多模态搜索。我们将策略学习解耦为一个确定性的、本地静态的沙盒。至关重要的是，为了在这个受限环境中有效学习，我们提出了一种内省的过程导向奖励。通过探测代理自身的参数知识边界，我们生成了密集的行为元数据，明确奖励正确的认知决策，仅在视觉或事实不确定时启动多模态或文本搜索。大量实验表明，我们在本地训练的策略能够零-shot迁移到实时的Google Search API。ProMMSearchAgent实现了新的SOTA（最先进技术）性能，在FVQA-test上比MMSearch-R1提高了5.1%，在InfoSeek上提高了6.3%，在MMSearch上提高了11.3%。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2604.20543

RefAerial: A Benchmark and Approach for Referring Detection in Aerial Images

RefAerial：一种用于航空图像中指代检测的基准与方法

Hu, Guyue, Song, Hao, Tong, Yuxing, Yuan, Duzhi, Sun, Dengdi, Zheng, Aihua, Li, Chenglong, Tang, Jin

Abstract

Referring detection refers to locate the target referred by natural languages, which has recently attracted growing research interests. However, existing datasets are limited to ground images with large object centered in relative small scenes. This paper introduces a large-scale challenging dataset for referring detection in aerial images, termed as RefAerial. It distinguishes from conventional ground referring detection datasets by 4 characteristics: (1) low but diverse object-to-scene ratios, (2) numerous targets and distractors, (3)complex and fine-grained referring descriptions, (4) diverse and broad scenes in the aerial view. We also develop a human-in-the-loop referring expansion and annotation engine (REA-Engine) for efficient semi-automated referring pair annotation. Besides, we observe that existing ground referring detection approaches exhibiting serious performance degradation on our aerial dataset since the intrinsic scale variety issue within or across aerial images. Therefore, we further propose a novel scale-comprehensive and sensitive (SCS) framework for referring detection in aerial images. It consists of a mixture-of-granularity (MoG) attention and a two-stage comprehensive-to-sensitive (CtS) decoding strategy. Specifically, the mixture-of-granularity attention is developed for scale-comprehensive target understanding. In addition, the two-stage comprehensive-to-sensitive decoding strategy is designed for coarse-to-fine referring target decoding. Eventually, the proposed SCS framework achieves remarkable performance on our aerial referring detection dataset and even promising performance boost on conventional ground referring detection datasets.

Chinese Translation

指代检测是指定位于自然语言中提到的目标，近年来引起了越来越多的研究兴趣。然而，现有的数据集仅限于地面图像，且对象通常集中在相对较小的场景中。本文介绍了一个用于航空图像中指代检测的大规模挑战性数据集，命名为RefAerial。该数据集与传统的地面指代检测数据集有四个显著特点：(1) 低但多样的对象与场景比例，(2) 众多的目标和干扰物，(3) 复杂且细致的指代描述，(4) 航空视角下多样且广泛的场景。我们还开发了一个人机协作的指代扩展与标注引擎（REA-Engine），用于高效的半自动化指代对标注。此外，我们观察到现有的地面指代检测方法在我们的航空数据集上表现出严重的性能下降，原因在于航空图像内部或跨图像的固有尺度变化问题。因此，我们进一步提出了一种新颖的尺度综合与敏感（SCS）框架，用于航空图像中的指代检测。该框架由混合粒度（MoG）注意力机制和两阶段综合到敏感（CtS）解码策略组成。具体而言，混合粒度注意力机制用于尺度综合目标理解。此外，两阶段综合到敏感解码策略旨在实现粗到细的指代目标解码。最终，所提出的SCS框架在我们的航空指代检测数据集上取得了显著的性能，并在传统的地面指代检测数据集上也展现了良好的性能提升。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2604.20544

Evian: Towards Explainable Visual Instruction-tuning Data Auditing

Evian：迈向可解释的视觉指令调优数据审计

Jia, Zimu, Xu, Mingjie, Estornell, Andrew, Wei, Jiaheng

Abstract

The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.

Chinese Translation

大型视觉语言模型（LVLMs）的有效性在很大程度上依赖于其训练数据的质量，这需要在视觉保真度和指令遵循能力之间达到精确平衡。然而，现有数据集普遍存在质量不一致的问题，当前的数据过滤方法依赖于粗粒度的评分，缺乏识别逻辑谬误或事实错误等细微语义缺陷的能力。这在开发更可靠的模型时造成了根本性的瓶颈。为了解决这一问题，我们做出了三项核心贡献。首先，我们通过系统性地注入多样化、细微的缺陷构建了一个大规模的300K样本基准，为数据审计提供了一个具有挑战性的测试平台。其次，我们引入了一种新颖的“分解-再评估”（Decomposition-then-Evaluation）范式，将模型响应分解为组成的认知组件：视觉描述、主观推理和事实声明，从而实现针对性的分析。第三，我们通过EVIAN（可解释的视觉指令调优数据审计）这一自动化框架实例化了该范式，评估这些组件在图像-文本一致性、逻辑一致性和事实准确性这三个正交维度上的表现。我们的实证研究挑战了当前以规模为中心的范式：在EVIAN精心策划的紧凑高质量子集上进行微调的模型，始终超越在数量级更大的数据集上训练的模型。我们还揭示，将复杂审计划分为可验证的子任务能够实现稳健的策划，而逻辑一致性是数据质量评估中最关键的因素。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2604.20570

Exploring Spatial Intelligence from a Generative Perspective

从生成视角探索空间智能

Zhu, Muzhi, Jiang, Shunyao, Zheng, Huanyi, Luo, Zekai, Zhong, Hao, Li, Anzhou, Wang, Kaijun, Rong, Jintao, Liu, Yang, Chen, Hao, Lin, Tao, Shen, Chunhua

Abstract

Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.

Chinese Translation

空间智能对于多模态大型语言模型至关重要，但目前的基准测试主要从理解的角度评估它。我们探讨现代生成或统一多模态模型是否也具备生成空间智能（Generative Spatial Intelligence, GSI），即在图像生成过程中尊重和操控三维空间约束的能力，以及这种能力是否可以被测量或改善。我们引入了GSI-Bench，这是第一个旨在通过空间基础的图像编辑来量化GSI的基准测试。它由两个互补的组成部分构成：GSI-Real，一个通过3D先验引导生成和过滤管道构建的高质量真实世界数据集，以及GSI-Syn，一个具有可控空间操作和完全自动化标注的大规模合成基准。结合统一的评估协议，GSI-Bench使得对空间合规性和编辑保真度的评估具有可扩展性和模型无关性。实验表明，对GSI-Syn进行微调的统一多模态模型在合成和真实任务上均取得了显著提升，并且显著改善了下游空间理解。这为生成训练可以切实增强空间推理提供了首个明确证据，为推动多模态模型中的空间智能建立了一条新路径。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2604.20574

Where are they looking in the operating room?

他们在手术室里看向哪里？

Chen, Keqi, Baributsa, Séraphin, Schewski, Lilien, Srivastav, Vinkle, Mutter, Didier, Beldi, Guido, Keller, Sandra, Padoy, Nicolas

Abstract

Purpose: Gaze-following, the task of inferring where individuals are looking, has been widely studied in computer vision, advancing research in visual attention modeling, social scene understanding, and human-robot interaction. However, gaze-following has never been explored in the operating room (OR), a complex, high-stakes environment where visual attention plays an important role in surgical workflow analysis. In this work, we introduce the concept of gaze-following to the surgical domain, and demonstrate its great potential for understanding clinical roles, surgical phases, and team communications in the OR. Methods: We extend the 4D-OR dataset with gaze-following annotations, and extend the Team-OR dataset with gaze-following and a new team communication activity annotations. Then, we propose novel approaches to address clinical role prediction, surgical phase recognition, and team communication detection using a gaze-following model. For role and phase recognition, we propose a gaze heatmap-based approach that uses gaze predictions solely; for team communication detection, we train a spatial-temporal model in a self-supervised way that encodes gaze-based clip features, and then feed the features into a temporal activity detection model. Results: Experimental results on the 4D-OR and Team-OR datasets demonstrate that our approach achieves state-of-the-art performance on all downstream tasks. Quantitatively, our approach obtains F1 scores of 0.92 for clinical role prediction and 0.95 for surgical phase recognition. Furthermore, it significantly outperforms existing baselines in team communication detection, improving previous best performances by over 30%. Conclusion: We introduce gaze-following in the OR as a novel research direction in surgical data science, highlighting its great potential to advance surgical workflow analysis in computer-assisted interventions.

Chinese Translation

目的：注视跟随（gaze-following），即推断个体注视方向的任务，在计算机视觉领域得到了广泛研究，推动了视觉注意建模、社会场景理解和人机交互的研究。然而，注视跟随在手术室（OR）这一复杂且高风险的环境中尚未被探索，而视觉注意在手术工作流程分析中扮演着重要角色。在本研究中，我们将注视跟随的概念引入手术领域，并展示其在理解临床角色、手术阶段和团队沟通方面的巨大潜力。方法：我们扩展了4D-OR数据集，增加了注视跟随的注释，并扩展了Team-OR数据集，增加了注视跟随和新的团队沟通活动注释。然后，我们提出了新方法来解决临床角色预测、手术阶段识别和团队沟通检测，使用注视跟随模型。对于角色和阶段识别，我们提出了一种基于注视热图的方法，仅使用注视预测；对于团队沟通检测，我们以自监督的方式训练一个时空模型，该模型编码基于注视的片段特征，然后将这些特征输入到时间活动检测模型中。结果：在4D-OR和Team-OR数据集上的实验结果表明，我们的方法在所有下游任务上均达到了最先进的性能。从定量上看，我们的方法在临床角色预测中获得了0.92的F1分数，在手术阶段识别中获得了0.95的F1分数。此外，在团队沟通检测中显著优于现有基线，提升了超过30%的最佳表现。结论：我们在手术室中引入注视跟随作为外科数据科学中的一个新研究方向，强调其在推动计算机辅助干预中的手术工作流程分析方面的巨大潜力。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2604.20585

On the Impact of Face Segmentation-Based Background Removal on Recognition and Morphing Attack Detection

基于人脸分割的背景去除对识别和变形攻击检测的影响研究

Caldeira, Eduarda, Ozgur, Guray, Boutros, Fadi, Damer, Naser

Abstract

This study investigates the impact of face image background correction through segmentation on face recognition and morphing attack detection performance in realistic, unconstrained image capture scenarios. The motivation is driven by operational biometric systems such as the European Entry/Exit System (EES), which require facial enrolment at airports and other border crossing points where controlled backgrounds usually required for such captures cannot always be guaranteed, as well as by accessibility needs that may necessitate image capture outside traditional office environments. By analyzing how such preprocessing steps influence both recognition accuracy and security mechanisms, this work addresses a critical gap between usability-driven image normalization and the reliability requirements of large-scale biometric identification systems. Our study evaluates a comprehensive range of segmentation techniques, three families of morphing attack detection methods, and four distinct face recognition models, using databases that include both controlled and in-the-wild image captures. The results reveal consistent patterns linking segmentation to both recognition performance and face image quality. Additionally, segmentation is shown to systematically influence morphing attack detection performance. These findings highlight the need for careful consideration when deploying such preprocessing techniques in operational biometric systems.

Chinese Translation

本研究探讨了通过分割进行人脸图像背景修正对人脸识别和变形攻击检测性能的影响，特别是在现实的非约束性图像捕捉场景中。研究的动机源于操作性生物特征识别系统，如欧洲入境/出境系统（EES），该系统在机场及其他边境通行点需要进行人脸注册，而这些地方通常无法保证进行此类捕捉所需的受控背景。此外，某些可及性需求可能需要在传统办公环境之外进行图像捕捉。通过分析这些预处理步骤如何影响识别准确性和安全机制，本研究填补了以可用性为驱动的图像规范化与大规模生物特征识别系统的可靠性要求之间的关键空白。我们的研究评估了一系列全面的分割技术、三类变形攻击检测方法以及四种不同的人脸识别模型，使用的数据库包括受控和自然环境下的图像捕捉。结果揭示了分割与识别性能和人脸图像质量之间的一致关联模式。此外，分割被证明系统性地影响变形攻击检测性能。这些发现强调了在操作性生物特征识别系统中部署此类预处理技术时需要谨慎考虑。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2604.20591

Structure-Augmented Standard Plane Detection with Temporal Aggregation in Blind-Sweep Fetal Ultrasound

结构增强的盲扫胎儿超声标准平面检测与时间聚合

Niu, Keli, Zhao, He, Men, Qianhui

Abstract

In low-resource settings, blind-sweep ultrasound provides a practical and accessible method for identifying fetal growth restriction. However, unlike freehand ultrasound which is subjectively controlled, detection of biometry plane in blind-sweep ultrasound is more challenging due to the uncontrolled fetal structure to be observed and the variaties of oblique planes in the scan. In this work, we propose a structure-augmented system to detect fetal abdomen plane, where the abdominal structure is highlighted using a segmentation prior. Since standard planes are emerging gradually, the decision boundary of the keyframes is unstable to predict. We thus aggregated the structure-augmented planes with a temporal sliding window to help stabilise keyframe localisation. Extensive results indicate that the structure-augmented temporal sliding strategy significantly improves and stabilises the detection of anatomically meaningful planes, which enables more reliable biometric measurements in blind-sweep ultrasound.

Chinese Translation

在资源匮乏的环境中，盲扫超声提供了一种实用且可及的方法来识别胎儿生长受限。然而，与主观控制的自由手超声不同，盲扫超声中生物测量平面的检测更具挑战性，因为观察到的胎儿结构不受控制，并且扫描中存在多种斜面。在本研究中，我们提出了一种结构增强的系统来检测胎儿腹部平面，其中通过分割先验突出腹部结构。由于标准平面逐渐出现，关键帧的决策边界不稳定，因此我们采用时间滑动窗口聚合结构增强的平面，以帮助稳定关键帧定位。大量结果表明，结构增强的时间滑动策略显著改善并稳定了解剖学上有意义平面的检测，从而在盲扫超声中实现了更可靠的生物测量。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2604.20594

Physics-Informed Conditional Diffusion for Motion-Robust Retinal Temporal Laser Speckle Contrast Imaging

物理信息条件扩散用于运动鲁棒的视网膜时间激光散斑对比成像

Chen, Qian, Chen, Yuehao, Wang, Qiang, Zhu, Lei, Lu, Yanye, Ren, Qiushi

Abstract

Retinal laser speckle contrast imaging (LSCI) is a noninvasive optical modality for monitoring retinal blood flow dynamics. However, conventional temporal LSCI (tLSCI) reconstruction relies on sufficiently long speckle sequences to obtain stable temporal statistics, which makes it vulnerable to acquisition disturbances and limits effective temporal resolution. A physically informed reconstruction framework, termed RetinaDiff (Retinal Diffusion Model), is proposed for retinal tLSCI that is robust to motion and requires only a few frames. In RetinaDiff, registration based on phase correlation is first applied to stabilize the raw speckle sequence before contrast computation, reducing interframe misalignment so that fluctuations at each pixel primarily reflect true flow dynamics. This step provides a physics prior corrected for motion and a high quality multiframe tLSCI reference. Next, guided by the physics prior, a conditional diffusion model performs inverse reconstruction by jointly conditioning on the registered speckle sequence and the corrected prior. Experiments on data acquired with a retinal LSCI system developed in house show improved structural continuity and statistical stability compared with direct reconstruction from few frames and representative baselines. The framework also remains effective in a small number of extremely challenging cases, where both the direct 5-frame input and the conventional multiframe reconstruction are severely degraded. Overall, this work provides a practical and physically grounded route for reliable retinal tLSCI reconstruction from extremely limited frames. The source code and model weights will be publicly available at https://github.com/QianChen113/RetinaDiff.

Chinese Translation

视网膜激光散斑对比成像（LSCI）是一种用于监测视网膜血流动态的非侵入性光学技术。然而，传统的时间LSCI（tLSCI）重建依赖于足够长的散斑序列以获得稳定的时间统计，这使其易受采集干扰的影响，并限制了有效的时间分辨率。我们提出了一种名为RetinaDiff（视网膜扩散模型）的物理信息重建框架，用于视网膜tLSCI，该框架对运动具有鲁棒性，并且只需少量帧。在RetinaDiff中，首先应用基于相位相关的配准方法来稳定原始散斑序列，然后进行对比计算，从而减少帧间错位，使每个像素的波动主要反映真实的流动动态。这一步提供了一个针对运动校正的物理先验和高质量的多帧tLSCI参考。接下来，在物理先验的指导下，条件扩散模型通过联合条件化已配准的散斑序列和校正的先验进行逆重建。通过在自家开发的视网膜LSCI系统上获取的数据进行的实验表明，与直接从少量帧和代表性基线进行的重建相比，结构连续性和统计稳定性得到了改善。该框架在极少数极具挑战性的案例中仍然有效，其中直接的5帧输入和传统的多帧重建均严重退化。总体而言，这项工作为从极其有限的帧中可靠地重建视网膜tLSCI提供了一条实用且基于物理的途径。源代码和模型权重将公开发布在 https://github.com/QianChen113/RetinaDiff。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2604.20606

Beyond ZOH: Advanced Discretization Strategies for Vision Mamba

超越零阶保持：Vision Mamba的高级离散化策略

Ibrahim, Fady, Liu, Guangjun, Wang, Guanghui

Abstract

Vision Mamba, as a state space model (SSM), employs a zero-order hold (ZOH) discretization, which assumes that input signals remain constant between sampling instants. This assumption degrades temporal fidelity in dynamic visual environments and constrains the attainable accuracy of modern SSM-based vision models. In this paper, we present a systematic and controlled comparison of six discretization schemes instantiated within the Vision Mamba framework: ZOH, first-order hold (FOH), bilinear/Tustin transform (BIL), polynomial interpolation (POL), higher-order hold (HOH), and the fourth-order Runge-Kutta method (RK4). We evaluate each method on standard visual benchmarks to quantify its influence in image classification, semantic segmentation, and object detection. Our results demonstrate that POL and HOH yield the largest gains in accuracy at the cost of higher training-time computation. In contrast, the BIL provides consistent improvements over ZOH with modest additional overhead, offering the most favorable trade-off between precision and efficiency. These findings elucidate the pivotal role of discretization in SSM-based vision architectures and furnish empirically grounded justification for adopting BIL as the default discretization baseline for state-of-the-art SSM models.

Chinese Translation

Vision Mamba作为一种状态空间模型（SSM），采用零阶保持（ZOH）离散化，假设输入信号在采样时刻之间保持不变。这一假设降低了动态视觉环境中的时间保真度，并限制了现代基于SSM的视觉模型所能达到的准确性。在本文中，我们对在Vision Mamba框架内实现的六种离散化方案进行了系统且可控的比较：ZOH、第一阶保持（FOH）、双线性/塔斯廷变换（BIL）、多项式插值（POL）、高阶保持（HOH）和四阶龙格-库塔方法（RK4）。我们在标准视觉基准上评估每种方法，以量化其在图像分类、语义分割和目标检测中的影响。我们的结果表明，POL和HOH在训练时间计算成本较高的情况下，带来了最大的准确性提升。相比之下，BIL在ZOH的基础上提供了一致的改进，并带来了适度的额外开销，提供了精度与效率之间最有利的权衡。这些发现阐明了离散化在基于SSM的视觉架构中的关键作用，并为将BIL作为最先进SSM模型的默认离散化基线提供了实证基础。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2604.20623

RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

RSRCC：通过检索增强的最佳-N排名构建的遥感区域变化理解基准

Kazoom, Roie, Gigi, Yotam, Leifman, George, Shekel, Tomer, Beryozkin, Genady

Abstract

Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.

Chinese Translation

传统的变化检测识别变化发生的地点，但未能用自然语言解释发生了什么变化。现有的遥感变化描述数据集通常描述整体图像级别的差异，细粒度的局部语义推理仍然未得到充分探索。为了解决这一问题，我们提出了RSRCC，一个新的遥感变化问答基准，包含126,000个问题，分为87,000个训练实例、17,100个验证实例和22,000个测试实例。与之前的数据集不同，RSRCC围绕局部的、特定变化的问题构建，这些问题需要对特定的语义变化进行推理。根据我们所知，这是第一个专门为这种细粒度推理监督设计的遥感变化问答基准。为了构建RSRCC，我们引入了一个分层半监督的策划流程，使用最佳-N排名作为关键的最终模糊解决阶段。首先，从语义分割掩膜中提取候选变化区域，然后使用图像-文本嵌入模型进行初步筛选，最后通过检索增强的视觉-语言策划和最佳-N排名进行验证。这个过程使得在保留语义上有意义的变化的同时，能够对噪声和模糊的候选进行可扩展的过滤。数据集可在 https://huggingface.co/datasets/google/RSRCC 获取。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2604.20650

MAPRPose: Mask-Aware Proposal and Amodal Refinement for Multi-Object 6D Pose Estimation

MAPRPose：面具感知提议与模态无关的多目标6D姿态估计精化

Luo, Yang, Gong, Yan, Gao, Yongsheng, Sun, Xiaoying, Zhao, Jie

Abstract

6D object pose estimation in cluttered scenes remains challenging due to severe occlusion and sensor noise. We propose MAPRPose, a two-stage framework that leverages mask-aware correspondences for pose proposal and amodal-driven Region-of-Interest (ROI) prediction for robust refinement. In the Mask-Aware Pose Proposal (MAPP) stage, we lift 2D correspondences into 3D space to establish reliable keypoint matches and generate geometrically consistent pose hypotheses based on correspondence-level scoring, from which the top-$K$ candidates are selected. In the refinement stage, we introduce a tensorized render-and-compare pipeline integrated with an Amodal Mask Prediction and ROI Re-Alignment (AMPR) module. By reconstructing complete object geometry and dynamically adjusting the ROI, AMPR mitigates localization errors and spatial misalignment under heavy occlusion. Furthermore, our GPU-accelerated RGB-XYZ reprojection enables simultaneous refinement of all $N \times B$ pose hypotheses in a single forward pass. Evaluated on the BOP benchmark, MAPRPose achieves a state-of-the-art Average Recall (AR) of 76.5%, outperforming FoundationPose by 3.1% AR while delivering a 43x speedup in multi-object inference.

Chinese Translation

在复杂场景中，6D物体姿态估计仍然面临严重的遮挡和传感器噪声带来的挑战。我们提出了MAPRPose，一个两阶段框架，利用面具感知的对应关系进行姿态提议，并通过模态驱动的兴趣区域（ROI）预测进行稳健的精化。在面具感知姿态提议（MAPP）阶段，我们将2D对应关系提升到3D空间，以建立可靠的关键点匹配，并根据对应级别的评分生成几何一致的姿态假设，从中选择前$K$个候选者。在精化阶段，我们引入了一个张量化的渲染与比较管道，并集成了模态无关的面具预测与ROI重新对齐（AMPR）模块。通过重建完整的物体几何形状并动态调整ROI，AMPR减轻了在严重遮挡下的定位误差和空间错位。此外，我们的GPU加速RGB-XYZ重投影使得在单次前向传递中能够同时精化所有$N imes B$个姿态假设。在BOP基准测试中，MAPRPose实现了76.5%的最新平均召回率（AR），比FoundationPose提高了3.1%的AR，同时在多目标推理中实现了43倍的加速。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2604.20665

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

视觉的代价：在单一范式内实现可信的多模态推理

Goyal, Karan, Kukreja, Dikshant

Abstract

The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of "multimodal gain". By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.

Chinese Translation

视觉-语言模型（VLMs）的快速普及被广泛视为统一多模态知识发现的曙光，但其基础建立在一个危险且未经质疑的公理之上：即当前的 VLMs 忠实地综合多模态数据。我们认为事实并非如此。相反，主导的视觉编码器-投影器-大语言模型（Vision Encoder-Projector-LLM）范式背后存在着深刻的可信度危机。最先进的模型经常表现出功能盲目性，即利用强大的语言先验来绕过严重的视觉表征瓶颈，而不是从视觉输入中提取扎根知识。在本研究中，我们挑战了依赖数据消融或新数据集创建的传统多模态评估方法，因此致命地将数据集偏见与架构能力混淆。我们提出了一种激进的信息论出发点：模态翻译协议（Modality Translation Protocol），旨在定量揭示视觉的代价。通过翻译语义负载而非消融它们，我们制定了三个新颖的指标——视觉的代价（Toll, ToS）、诅咒（Curse, CoS）和谬误（Fallacy, FoS），最终形成语义充足性标准（Semantic Sufficiency Criterion, SSC）。此外，我们提出了一个挑衅性的多模态缩放发散法则，假设随着基础语言引擎规模达到前所未有的推理能力，视觉知识瓶颈的数学惩罚反而会增加。我们挑战 KDD 社区放弃对“多模态增益”的虚幻追求。通过将 SSC 从被动的诊断约束提升为主动的架构蓝图，我们提供了强有力且可信的基础，以迫使下一代人工智能系统真正“看见”数据，实现真正的多模态推理。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2604.20696

R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

R-CoV：区域感知的验证链以减轻大规模视觉语言模型中的对象幻觉

Xie, Jiahao, Tonioni, Alessio, Rauschmayr, Nathalie, Tombari, Federico, Schiele, Bernt

Abstract

Large vision-language models (LVLMs) have demonstrated impressive performance in various multimodal understanding and reasoning tasks. However, they still struggle with object hallucinations, i.e., the claim of nonexistent objects in the visual input. To address this challenge, we propose Region-aware Chain-of-Verification (R-CoV), a visual chain-of-verification method to alleviate object hallucinations in LVLMs in a post-hoc manner. Motivated by how humans comprehend intricate visual information -- often focusing on specific image regions or details within a given sample -- we elicit such region-level processing from LVLMs themselves and use it as a chaining cue to detect and alleviate their own object hallucinations. Specifically, our R-CoV consists of six steps: initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation. As a simple yet effective method, R-CoV can be seamlessly integrated into various LVLMs in a training-free manner and without relying on external detection models. Extensive experiments on several widely used hallucination benchmarks across multiple LVLMs demonstrate that R-CoV can significantly alleviate object hallucinations in LVLMs. Project page: https://github.com/Jiahao000/R-CoV.

Chinese Translation

大型视觉语言模型（LVLMs）在各种多模态理解和推理任务中表现出色。然而，它们仍然面临对象幻觉的问题，即在视觉输入中声称存在不存在的对象。为了解决这一挑战，我们提出了区域感知验证链（R-CoV），这是一种后处理的视觉验证链方法，用于减轻LVLMs中的对象幻觉。受到人类如何理解复杂视觉信息的启发——通常专注于特定图像区域或给定样本中的细节——我们从LVLMs自身引导这种区域级处理，并将其用作链式提示，以检测和减轻它们自身的对象幻觉。具体而言，我们的R-CoV包括六个步骤：初始响应生成、实体提取、坐标生成、区域描述、验证执行和最终响应生成。作为一种简单而有效的方法，R-CoV可以无缝集成到各种LVLMs中，且无需训练和依赖外部检测模型。在多个LVLMs的多个广泛使用的幻觉基准上的大量实验表明，R-CoV能够显著减轻LVLMs中的对象幻觉。项目页面：https://github.com/Jiahao000/R-CoV。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2604.20705

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

SSL-R1：用于多模态大语言模型的自监督视觉强化后训练

Xie, Jiahao, Tonioni, Alessio, Rauschmayr, Nathalie, Tombari, Federico, Schiele, Bernt

Abstract

Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs' intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for MLLM post-training. We think this work will provide useful experience in devising effective self-supervised verifiable rewards to enable RL at scale. Project page: https://github.com/Jiahao000/SSL-R1.

Chinese Translation

具有可验证奖励的强化学习（RLVR）展示了增强多模态大语言模型（MLLMs）推理能力的巨大潜力。然而，对以语言为中心的先验知识和昂贵的人工标注的依赖，阻碍了MLLMs内在的视觉理解和可扩展的奖励设计。在本研究中，我们提出了SSL-R1，一个通用的自监督强化学习框架，直接从图像中推导可验证的奖励。为此，我们重新审视了视觉领域的自监督学习（SSL），并将广泛使用的SSL任务重新构建为一组可验证的视觉难题，以便进行强化学习后训练，这些任务不需要人类或外部模型的监督。在这些任务上训练MLLMs显著提高了它们在多模态理解和推理基准上的表现，突显了利用以视觉为中心的自监督任务进行MLLM后训练的潜力。我们认为这项工作将为设计有效的自监督可验证奖励以实现大规模强化学习提供有益的经验。项目页面：https://github.com/Jiahao000/SSL-R1。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2604.20715

GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers

GeoRelight：利用灵活的多模态扩散变换器学习联合几何重光照与重建

Xue, Yuxuan, Liang, Ruofan, Zakharov, Egor, Bagautdinov, Timur, Cao, Chen, Nam, Giljoo, Saito, Shunsuke, Pons-Moll, Gerard, Romero, Javier

Abstract

Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: GeoRelight. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than both sequential models and previous systems that ignored geometry.

Chinese Translation

从单张照片中对一个人进行重光照是一项吸引人但不适定的任务，因为二维图像模糊地纠缠了三维几何、内在外观和照明。目前的方法要么使用顺序管道，导致误差累积，要么在重光照过程中没有明确利用三维几何，从而限制了物理一致性。由于重光照和三维几何的估计是互为有益的任务，我们提出了一种统一的多模态扩散变换器（Multi-Modal Diffusion Transformer, DiT），它联合解决这两者：GeoRelight。我们通过两个关键技术贡献使这一目标成为可能：各向同性的NDC-正交深度（isotropic NDC-Orthographic Depth, iNOD），一种与潜在扩散模型兼容的无失真三维表示；以及一种战略性混合数据训练方法，结合了合成数据和自动标注的真实数据。通过联合解决几何和重光照，GeoRelight的性能优于顺序模型和忽略几何的先前系统。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2604.20730

Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

循环渲染：通过视觉自反馈生成矢量图形

Liang, Guotao, Wang, Zhangcheng, Hu, Juncheng, Zhou, Haitao, Xue, Ziteng, Zhang, Jing, Xu, Dong, Yu, Qian

Abstract

Multimodal Large Language Models (MLLMs) have shown promising capabilities in generating Scalable Vector Graphics (SVG) via direct code synthesis. However, existing paradigms typically adopt an open-loop "blind drawing" approach, where models generate symbolic code sequences without perceiving intermediate visual outcomes. This methodology severely underutilizes the powerful visual priors embedded in MLLMs vision encoders, treating SVG generation as a disjointed textual sequence modeling task rather than an integrated visuo-spatial one. Consequently, models struggle to reason about partial canvas states and implicit occlusion relationships, which are visually explicit but textually ambiguous. To bridge this gap, we propose Render-in-the-Loop, a novel generation paradigm that reformulates SVG synthesis as a step-wise, visual-context-aware process. By rendering intermediate code states into a cumulative canvas, the model explicitly observes the evolving visual context at each step, leveraging on-the-fly feedback to guide subsequent generation. However, we demonstrate that applying this visual loop naively to off-the-shelf models is suboptimal due to their inability to leverage incremental visual-code mappings. To address this, we first utilize fine-grained path decomposition to construct dense multi-step visual trajectories, and then introduce a Visual Self-Feedback (VSF) training strategy to condition the next primitive generation on intermediate visual states. Furthermore, a Render-and-Verify (RaV) inference mechanism is proposed to effectively filter degenerate and redundant primitives. Our framework, instantiated on a multimodal foundation model, outperforms strong open-weight baselines on the standard MMSVGBench. This result highlights the remarkable data efficiency and generalization capability of our Render-in-the-Loop paradigm for both Text-to-SVG and Image-to-SVG tasks.

Chinese Translation

多模态大型语言模型（MLLMs）在通过直接代码合成生成可缩放矢量图形（SVG）方面展现了良好的潜力。然而，现有的范式通常采用开放式“盲绘制”方法，模型生成符号代码序列而不感知中间的视觉结果。这种方法严重低估了MLLMs视觉编码器中嵌入的强大视觉先验，将SVG生成视为一个离散的文本序列建模任务，而非一个综合的视觉空间任务。因此，模型在推理部分画布状态和隐式遮挡关系时面临困难，这些关系在视觉上是明确的，但在文本上却模糊不清。为了解决这一问题，我们提出了循环渲染（Render-in-the-Loop），一种将SVG合成重新定义为逐步、视觉上下文感知过程的新生成范式。通过将中间代码状态渲染到一个累积画布中，模型在每一步明确观察到不断变化的视觉上下文，利用即时反馈指导后续生成。然而，我们证明，简单地将这一视觉循环应用于现成模型并不是最佳选择，因为它们无法利用增量的视觉-代码映射。为此，我们首先利用细粒度路径分解构建密集的多步视觉轨迹，然后引入视觉自反馈（Visual Self-Feedback, VSF）训练策略，以使下一个原始生成依赖于中间视觉状态。此外，我们提出了一种渲染与验证（Render-and-Verify, RaV）推理机制，以有效过滤退化和冗余的原始元素。我们的框架在一个多模态基础模型上实例化，在标准MMSVGBench上超越了强大的开放权重基线。这一结果突显了我们循环渲染范式在文本到SVG和图像到SVG任务中的卓越数据效率和泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2604.20748

Amodal SAM: A Unified Amodal Segmentation Framework with Generalization

无模态SAM：具有泛化能力的统一无模态分割框架

Zhang, Bo, Tian, Zhuotao, Tao, Xin, Tang, Songlin, Yu, Jun, Pei, Wenjie

Abstract

Amodal segmentation is a challenging task that aims to predict the complete geometric shape of objects, including their occluded regions. Although existing methods primarily focus on amodal segmentation within the training domain, these approaches often lack the generalization capacity to extend effectively to novel object categories and unseen contexts. This paper introduces Amodal SAM, a unified framework that leverages SAM (Segment Anything Model) for both amodal image and amodal video segmentation. Amodal SAM preserves the powerful generalization ability of SAM while extending its inherent capabilities to the amodal segmentation task. The improvements lie in three aspects: (1) a lightweight Spatial Completion Adapter that enables occluded region reconstruction, (2) a Target-Aware Occlusion Synthesis (TAOS) pipeline that addresses the scarcity of amodal annotations by generating diverse synthetic training data, and (3) novel learning objectives that enforce regional consistency and topological regularization. Extensive experiments demonstrate that Amodal SAM achieves state-of-the-art performance on standard benchmarks, while simultaneously exhibiting robust generalization to novel scenarios. We anticipate that this research will advance the field toward practical amodal segmentation systems capable of operating effectively in unconstrained real-world environments.

Chinese Translation

无模态分割是一项具有挑战性的任务，旨在预测物体的完整几何形状，包括其被遮挡的区域。尽管现有方法主要集中在训练领域内的无模态分割，但这些方法往往缺乏有效扩展到新物体类别和未见背景的泛化能力。本文介绍了无模态SAM（Amodal SAM），这是一个统一框架，利用SAM（Segment Anything Model）进行无模态图像和无模态视频分割。无模态SAM保留了SAM强大的泛化能力，同时将其固有能力扩展到无模态分割任务中。改进体现在三个方面：（1）轻量级空间补全适配器，能够实现遮挡区域的重建；（2）目标感知遮挡合成（Target-Aware Occlusion Synthesis, TAOS）管道，通过生成多样的合成训练数据来解决无模态注释稀缺的问题；（3）新的学习目标，强制执行区域一致性和拓扑正则化。大量实验表明，无模态SAM在标准基准测试中实现了最先进的性能，同时在新场景中展现出强大的泛化能力。我们预期这项研究将推动该领域向实用的无模态分割系统发展，使其能够在不受限制的现实环境中有效运行。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2604.20760

Exploring High-Order Self-Similarity for Video Understanding

探索视频理解中的高阶自相似性

Kim, Manjin, Kwon, Heeseung, Alahari, Karteek, Cho, Minsu

Abstract

Space-time self-similarity (STSS), which captures visual correspondences across frames, provides an effective way to represent temporal dynamics for video understanding. In this work, we explore higher-order STSS and demonstrate how STSSs at different orders reveal distinct aspects of these dynamics. We then introduce the Multi-Order Self-Similarity (MOSS) module, a lightweight neural module designed to learn and integrate multi-order STSS features. It can be applied to diverse video tasks to enhance motion modeling capabilities while consuming only marginal computational cost and memory usage. Extensive experiments on video action recognition, motion-centric video VQA, and real-world robotic tasks consistently demonstrate substantial improvements, validating the broad applicability of MOSS as a general temporal modeling module. The source code and checkpoints will be publicly available.

Chinese Translation

时空自相似性（STSS）能够捕捉帧间的视觉对应关系，为视频理解提供了一种有效的时间动态表示方式。在本研究中，我们探讨了高阶STSS，并展示了不同阶数的STSS如何揭示这些动态的不同方面。接着，我们介绍了多阶自相似性（Multi-Order Self-Similarity, MOSS）模块，这是一种轻量级神经模块，旨在学习和整合多阶STSS特征。该模块可以应用于多种视频任务，以增强运动建模能力，同时仅消耗极少的计算成本和内存使用。我们在视频动作识别、以运动为中心的视频VQA以及现实世界的机器人任务上进行了广泛实验，结果一致显示出显著的改进，验证了MOSS作为通用时间建模模块的广泛适用性。源代码和检查点将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2604.20784

GeoRect4D: Geometry-Compatible Generative Rectification for Dynamic Sparse-View 3D Reconstruction

GeoRect4D：用于动态稀疏视图3D重建的几何兼容生成整流

Wu, Zhenlong, Zheng, Zihan, Wang, Xuanxuan, Wang, Qianhe, Yang, Hua, Zhang, Xiaoyun, Hu, Qiang, Zhang, Wenjun

Abstract

Reconstructing dynamic 3D scenes from sparse multi-view videos is highly ill-posed, often leading to geometric collapse, trajectory drift, and floating artifacts. Recent attempts introduce generative priors to hallucinate missing content, yet naive integration frequently causes structural drift and temporal inconsistency due to the mismatch between stochastic 2D generation and deterministic 3D geometry. In this paper, we propose GeoRect4D, a novel unified framework for sparse-view dynamic reconstruction that couples explicit 3D consistency with generative refinement via a closed-loop optimization process. Specifically, GeoRect4D introduces a degradation-aware feedback mechanism that incorporates a robust anchor-based dynamic 3DGS substrate with a single-step diffusion rectifier to hallucinate high-fidelity details. This rectifier utilizes a structural locking mechanism and spatiotemporal coordinated attention, effectively preserving physical plausibility while restoring missing content. Furthermore, we present a progressive optimization strategy that employs stochastic geometric purification to eliminate floaters and generative distillation to infuse texture details into the explicit representation. Extensive experiments demonstrate that GeoRect4D achieves state-of-the-art performance in reconstruction fidelity, perceptual quality, and spatiotemporal consistency across multiple datasets.

Chinese Translation

从稀疏多视角视频重建动态3D场景是一个高度不适定的问题，常常导致几何崩溃、轨迹漂移和漂浮伪影。近期的尝试引入生成先验来幻觉缺失内容，但简单的整合常常由于随机2D生成与确定性3D几何之间的不匹配，导致结构漂移和时间不一致。在本文中，我们提出了GeoRect4D，一个新颖的统一框架，用于稀疏视图动态重建，该框架通过闭环优化过程将显式3D一致性与生成细化相结合。具体而言，GeoRect4D引入了一种降解感知反馈机制，该机制结合了基于锚点的稳健动态3DGS基底和单步扩散整流器，以幻觉高保真细节。该整流器利用结构锁定机制和时空协调注意力，有效地保持物理合理性，同时恢复缺失内容。此外，我们提出了一种渐进优化策略，采用随机几何净化来消除漂浮物，并通过生成蒸馏将纹理细节注入显式表示。大量实验表明，GeoRect4D在多个数据集上实现了重建保真度、感知质量和时空一致性的最先进性能。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2604.20796

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

LLaDA2.0-Uni：通过扩散大语言模型统一多模态理解与生成

AI, Inclusion, Bie, Tiwei, Chen, Haoxing, Chen, Tieyuan, Cheng, Zhenglin, Cui, Long, Gan, Kai, Huang, Zhicheng, Lan, Zhenzhong, Li, Haoquan, Li, Jianguo, Lin, Tao, Qin, Qi, Wang, Hongjun, Wang, Xiaomei, Wu, Haoyuan, Xin, Yi, Zhao, Junbo

Abstract

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.

Chinese Translation

我们提出了LLaDA2.0-Uni，这是一种统一的离散扩散大语言模型（dLLM），支持在原生集成框架内进行多模态理解与生成。其架构结合了完全语义的离散分词器、基于MoE的dLLM主干和扩散解码器。通过SigLIP-VQ对连续视觉输入进行离散化，该模型在主干内实现了文本和视觉输入的块级掩蔽扩散，同时解码器将视觉标记重构为高保真图像。通过在主干中的前缀感知优化和解码器中的少步蒸馏，推理效率超越了并行解码。在经过精心策划的大规模数据和定制的多阶段训练流程的支持下，LLaDA2.0-Uni在多模态理解方面与专业的视觉语言模型（VLMs）相匹配，同时在图像生成和编辑方面表现出色。其对交错生成和推理的原生支持为下一代统一基础模型建立了一个有前景且可扩展的范式。代码和模型可在 https://github.com/inclusionAI/LLaDA2.0-Uni 获取。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2604.20800

LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image

LEXIS：基于图像的3D人机交互的潜在近似交互特征

Antić, Dimitrije, Budria, Alvaro, Paschalidis, George, Dwivedi, Sai Kumar, Tzionas, Dimitrios

Abstract

Reconstructing 3D Human-Object Interaction from an RGB image is essential for perceptive systems. Yet, this remains challenging as it requires capturing the subtle physical coupling between the body and objects. While current methods rely on sparse, binary contact cues, these fail to model the continuous proximity and dense spatial relationships that characterize natural interactions. We address this limitation via InterFields, a representation that encodes dense, continuous proximity across the entire body and object surfaces. However, inferring these fields from single images is inherently ill-posed. To tackle this, our intuition is that interaction patterns are characteristically structured by the action and object geometry. We capture this structure in LEXIS, a novel discrete manifold of interaction signatures learned via a VQ-VAE. We then develop LEXIS-Flow, a diffusion framework that leverages LEXIS signatures to estimate human and object meshes alongside their InterFields. Notably, these InterFields help in a guided refinement that ensures physically-plausible, proximity-aware reconstructions without requiring post-hoc optimization. Evaluation on Open3DHOI and BEHAVE shows that LEXIS-Flow significantly outperforms existing SotA baselines in reconstruction, contact, and proximity quality. Our approach not only improves generalization but also yields reconstructions perceived as more realistic, moving us closer to holistic 3D scene understanding. Code & models will be public at https://anticdimi.github.io/lexis.

Chinese Translation

从RGB图像重建3D人机交互对于感知系统至关重要。然而，这一任务依然充满挑战，因为它需要捕捉身体与物体之间微妙的物理耦合。当前方法依赖于稀疏的二元接触线索，但这些方法未能建模自然交互中连续的接近性和密集的空间关系。我们通过InterFields来解决这一局限性，该表示法编码了整个身体和物体表面之间的密集、连续接近性。然而，从单一图像推断这些场本质上是一个病态问题。为了解决这个问题，我们的直觉是交互模式在动作和物体几何形状的结构上具有特征性。我们在LEXIS中捕捉到这种结构，LEXIS是一个通过VQ-VAE学习的交互特征的新颖离散流形。随后，我们开发了LEXIS-Flow，一个利用LEXIS特征来估计人类和物体网格及其InterFields的扩散框架。值得注意的是，这些InterFields有助于引导细化，确保物理上合理、关注接近性的重建，而无需后期优化。在Open3DHOI和BEHAVE上的评估表明，LEXIS-Flow在重建、接触和接近性质量方面显著优于现有的最先进基线。我们的方法不仅提高了泛化能力，还产生了被认为更为真实的重建，使我们更接近整体3D场景理解。代码和模型将公开于 https://anticdimi.github.io/lexis。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2604.20806

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

OMIBench：大型视觉-语言模型的奥林匹克级多图像推理基准测试

Chen, Qiguang, Luan, Chengyu, Wu, Jiajun, Yu, Qiming, Yang, Yi, Li, Yizhuo, Tong, Jingqi, Feng, Xiachong, Qin, Libo, Che, Wanxiang

Abstract

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.

Chinese Translation

大型视觉-语言模型（LVLMs）在奥林匹克级推理任务中取得了显著进展。然而，目前针对这些模型的奥林匹克级多模态推理基准往往强调单图像分析，未能充分利用多个图像之间的上下文信息。我们提出了OMIBench，一个旨在评估当所需证据分布在多个图像上时的奥林匹克级推理的基准。该基准包含来自生物学、化学、数学和物理奥林匹克的问题，并提供手动注释的推理过程和针对精确答案匹配与语义答案匹配的评估协议。在对OMIBench进行的广泛实验中，我们观察到现有模型之间存在显著的性能差距。即使是最强大的LVLM，如Gemini-3-Pro，在该基准上的得分也仅约为50%。这些结果使OMIBench成为研究和改进LVLMs中多图像推理的专注资源。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2604.20813

Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning

将 TrOCR 适应于印刷的提格利尼亚文本识别：跨脚本迁移学习中的词汇感知损失加权

Medhanie, Yonatan Haile, Ni, Yuanhua

Abstract

Transformer-based OCR models have shown strong performance on Latin and CJK scripts, but their application to African syllabic writing systems remains limited. We present the first adaptation of TrOCR for printed Tigrinya using the Ge'ez script. Starting from a pre-trained model, we extend the byte-level BPE tokenizer to cover 230 Ge'ez characters and introduce Word-Aware Loss Weighting to resolve systematic word-boundary failures that arise when applying Latin-centric BPE conventions to a new script. The unmodified model produces no usable output on Ge'ez text. After adaptation, the TrOCR-Printed variant achieves 0.22% Character Error Rate and 97.20% exact match accuracy on a held-out test set of 5,000 synthetic images from the GLOCR dataset. An ablation study confirms that Word-Aware Loss Weighting is the critical component, reducing CER by two orders of magnitude compared to vocabulary extension alone. The full pipeline trains in under three hours on a single 8 GB consumer GPU. All code, model weights, and evaluation scripts are publicly released.

Chinese Translation

基于 Transformer 的 OCR 模型在拉丁文和 CJK 字体上表现出色，但其在非洲音节书写系统中的应用仍然有限。我们首次将 TrOCR 适应于使用 Ge'ez 字母的印刷提格利尼亚文本。从一个预训练模型开始，我们扩展了字节级 BPE 分词器，以覆盖 230 个 Ge'ez 字符，并引入了词汇感知损失加权，以解决在将以拉丁文为中心的 BPE 规范应用于新脚本时出现的系统性词边界失败。未经修改的模型在 Ge'ez 文本上没有产生可用的输出。经过适应后，TrOCR-Printed 变体在 GLOCR 数据集的 5,000 张合成图像的保留测试集上达到了 0.22% 的字符错误率和 97.20% 的精确匹配准确率。消融研究确认，词汇感知损失加权是关键组件，与仅扩展词汇相比，CER 降低了两个数量级。整个流程在单个 8 GB 消费级 GPU 上训练时间不到三小时。所有代码、模型权重和评估脚本均已公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2604.20822

Global Offshore Wind Infrastructure: Deployment and Operational Dynamics from Dense Sentinel-1 Time Series

全球海上风电基础设施：基于密集的 Sentinel-1 时间序列的部署与运营动态

Hoeser, Thorsten, Bachofer, Felix, Kuenzer, Claudia

Abstract

The offshore wind energy sector is expanding rapidly, increasing the need for independent, high-temporal-resolution monitoring of infrastructure deployment and operation at global scale. While Earth Observation based offshore wind infrastructure mapping has matured for spatial localization, existing open datasets lack temporally dense and semantically fine-grained information on construction and operational dynamics. We introduce a global Sentinel-1 synthetic aperture radar (SAR) time series data corpus that resolves deployment and operational phases of offshore wind infrastructure from 2016Q1 to 2025Q1. Building on an updated object detection workflow, we compile 15,606 time series at detected infrastructure locations, with overall 14,840,637 events as analysis-ready 1D SAR backscatter profiles, one profile per Sentinel-1 acquisition and location. To enable direct use and benchmarking, we release (i) the analysis ready 1D SAR profiles, (ii) event-level baseline semantic labels generated by a rule-based classifier, and (iii) an expert-annotated benchmark dataset of 553 time series with 328,657 event labels. The baseline classifier achieves a macro F1 score of 0.84 in event-wise evaluation and an area under the collapsed edit similarity-quality threshold curve (AUC) of 0.785, indicating temporal coherence. We demonstrate that the resulting corpus supports global-scale analyses of deployment dynamics, the identification of differences in regional deployment patterns, vessel interactions, and operational events, and provides a reference for developing and comparing time series classification methods for offshore wind infrastructure monitoring.

Chinese Translation

海上风能行业正在快速扩展，这增加了对独立、高时间分辨率的全球范围内基础设施部署和运营监测的需求。尽管基于地球观测的海上风电基础设施制图在空间定位方面已经成熟，但现有的开放数据集在建设和运营动态方面缺乏时间密集和语义细致的信息。我们引入了一个全球性的 Sentinel-1 合成孔径雷达（SAR）时间序列数据集，涵盖了2016年第一季度至2025年第一季度的海上风电基础设施的部署和运营阶段。基于更新的目标检测工作流程，我们在检测到的基础设施位置编制了15,606个时间序列，整体生成了14,840,637个事件，作为分析就绪的1D SAR后向散射剖面，每个剖面对应一个 Sentinel-1 的获取和位置。为了便于直接使用和基准测试，我们发布了（i）分析就绪的1D SAR剖面，（ii）由基于规则的分类器生成的事件级基线语义标签，以及（iii）一个专家注释的基准数据集，包含553个时间序列和328,657个事件标签。基线分类器在事件级评估中实现了0.84的宏F1分数，以及0.785的编辑相似性-质量阈值曲线下的面积（AUC），表明时间一致性。我们证明，所得到的数据集支持全球范围内的部署动态分析，识别区域部署模式的差异、船舶交互和运营事件，并为开发和比较海上风电基础设施监测的时间序列分类方法提供了参考。

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2604.20841

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

DeVI：基于物理的灵巧人机交互通过合成视频模仿

Kim, Hyeonwoo, Kim, Jeonghwan, Cho, Kyungwon, Joo, Hanbyul

Abstract

Recent advances in video generative models enable the synthesis of realistic human-object interaction videos across a wide range of scenarios and object categories, including complex dexterous manipulations that are difficult to capture with motion capture systems. While the rich interaction knowledge embedded in these synthetic videos holds strong potential for motion planning in dexterous robotic manipulation, their limited physical fidelity and purely 2D nature make them difficult to use directly as imitation targets in physics-based character control. We present DeVI (Dexterous Video Imitation), a novel framework that leverages text-conditioned synthetic videos to enable physically plausible dexterous agent control for interacting with unseen target objects. To overcome the imprecision of generative 2D cues, we introduce a hybrid tracking reward that integrates 3D human tracking with robust 2D object tracking. Unlike methods relying on high-quality 3D kinematic demonstrations, DeVI requires only the generated video, enabling zero-shot generalization across diverse objects and interaction types. Extensive experiments demonstrate that DeVI outperforms existing approaches that imitate 3D human-object interaction demonstrations, particularly in modeling dexterous hand-object interactions. We further validate the effectiveness of DeVI in multi-object scenes and text-driven action diversity, showcasing the advantage of using video as an HOI-aware motion planner.

Chinese Translation

最近在视频生成模型方面的进展使得能够合成现实的人机交互视频，涵盖广泛的场景和物体类别，包括难以通过动作捕捉系统捕捉的复杂灵巧操作。尽管这些合成视频中蕴含的丰富交互知识对灵巧机器人操作中的运动规划具有很大潜力，但其有限的物理真实感和纯2D特性使其难以直接作为基于物理的角色控制中的模仿目标。我们提出了DeVI（Dexterous Video Imitation），一个新颖的框架，利用文本条件的合成视频来实现与未见目标物体交互的物理上合理的灵巧代理控制。为了克服生成的2D线索的不精确性，我们引入了一种混合跟踪奖励，将3D人类跟踪与稳健的2D物体跟踪相结合。与依赖高质量3D运动演示的方法不同，DeVI仅需生成的视频，从而实现跨多样物体和交互类型的零样本泛化。大量实验表明，DeVI在模仿3D人机交互演示方面优于现有方法，特别是在建模灵巧手-物体交互方面。我们进一步验证了DeVI在多物体场景和文本驱动动作多样性中的有效性，展示了将视频作为人机交互（HOI）感知运动规划者的优势。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2604.19749

The Tool-Overuse Illusion: Why Does LLM Prefer External Tools over Internal Knowledge?

工具过度使用错觉：为什么大型语言模型更倾向于使用外部工具而非内部知识？

Zeng, Yirong, You, Shen, Liu, Yufei, Du, Qunyao, Ding, Xiao, Hou, Yutai, Wang, Yuxian, Ning, Wu, Song, Haonan, Tu, Dandan, Cai, Bibo, Liu, Ting

Abstract

Equipping LLMs with external tools effectively addresses internal reasoning limitations. However, it introduces a critical yet under-explored phenomenon: tool overuse, the unnecessary tool-use during reasoning. In this paper, we first reveal this phenomenon is pervasive across diverse LLMs. We then experimentally elucidate its underlying mechanisms through two key lenses: (1) First, by analyzing tool-use behavior across different internal knowledge availability regions, we identify a \textit{knowledge epistemic illusion}: models misjudge internal knowledge boundaries and fail to accurately perceive their actual knowledge availability. To mitigate this, we propose a knowledge-aware epistemic boundary alignment strategy based on direct preference optimization, which reduces tool usage in by 82.8\% while yielding an accuracy improvement. (2) Second, we establish a causal link between reward structures and tool-use behavior by visualizing the tool-augmented training process. It reveals that \textit{outcome-only rewards} inadvertently encourage tool overuse by rewarding only final correctness, regardless of tool efficiency. To verify this, we balance reward signals during training rather than relying on outcome-only rewards, cutting unnecessary tool calls by 66.7\% (7B) and 60.7\% (32B) without sacrificing accuracy. Finally, we provide theoretical justification in this two lenses to understand tool overuse.

Chinese Translation

为大型语言模型（LLMs）配备外部工具有效地解决了内部推理的局限性。然而，这也引入了一个关键但尚未深入探讨的现象：工具过度使用，即在推理过程中不必要的工具使用。本文首先揭示这一现象在多种LLMs中普遍存在。然后，我们通过两个关键视角实验性地阐明其潜在机制：（1）首先，通过分析不同内部知识可用性区域的工具使用行为，我们识别出一种 extit{知识认知错觉}：模型错误判断内部知识的边界，未能准确感知其实际知识可用性。为此，我们提出了一种基于直接偏好优化的知识感知认知边界对齐策略，该策略将工具使用减少了82.8 dot ext{，同时提高了准确性。}（2）其次，我们通过可视化工具增强的训练过程建立了奖励结构与工具使用行为之间的因果联系。结果显示， extit{仅基于结果的奖励}无意中通过仅奖励最终的正确性而鼓励工具过度使用，而不考虑工具的效率。为了验证这一点，我们在训练过程中平衡奖励信号，而不是依赖仅基于结果的奖励，从而在不牺牲准确性的情况下将不必要的工具调用减少了66.7 dot ext{（7B）}和60.7 dot ext{（32B）}。最后，我们在这两个视角中提供了理论依据，以理解工具过度使用。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2604.19751

AI to Learn 2.0: A Deliverable-Oriented Governance Framework and Maturity Rubric for Opaque AI in Learning-Intensive Domains

AI学习2.0：面向交付物的治理框架和不透明AI在学习密集型领域的成熟度量规

Shintani, Seine A.

Abstract

Generative AI is entering research, education, and professional work faster than current governance frameworks can specify how AI-assisted outputs should be judged in learning-intensive settings. The central problem is proxy failure: a polished artifact can be useful while no longer serving as credible evidence of the human understanding, judgment, or transfer ability that the work is supposed to cultivate or certify. This paper proposes AI to Learn 2.0, a deliverable-oriented governance framework for AI-assisted work. Rather than claiming element-wise novelty, it reorganizes adjacent ideas around the final deliverable package, distinguishes artifact residual from capability residual, and operationalizes the result through a five-part package, a seven-dimension maturity rubric, gate thresholds on critical dimensions, and a companion capability-evidence ladder. AI to Learn 2.0 allows opaque AI during exploration, drafting, hypothesis generation, and workflow design, but requires that the released deliverable be usable, auditable, transferable, and justifiable without the original large language model or cloud API. In learning-intensive contexts, it additionally requires context-appropriate human-attributable evidence of explanation or transfer. Worked scoring across contrastive cases, including coursework substitution, a symbolic-regression governance contrast, teacher-audited national-exam practice forms, and a self-hosted lecture-to-quiz pipeline with deterministic quality control, shows how the framework separates polished substitution workflows from bounded, auditable, and handoff-ready AI-assisted workflows. AI to Learn 2.0 is proposed as a governance instrument for structured third-party review where capability preservation, accountability, and validity boundaries matter.

Chinese Translation

生成性AI正在以超过当前治理框架所能规定的速度进入研究、教育和专业工作，尤其是在学习密集型环境中，如何评判AI辅助输出的有效性成为核心问题。主要问题在于代理失败：一个经过打磨的工件可能是有用的，但不再作为人类理解、判断或转移能力的可信证据，这些能力是该工作所应培养或认证的。本文提出了AI学习2.0，一个面向交付物的AI辅助工作治理框架。该框架并不声称逐个元素的新颖性，而是围绕最终交付包重新组织相邻的思想，区分工件残留与能力残留，并通过五部分包、七维成熟度量规、关键维度的门槛阈值以及伴随的能力-证据阶梯来实现结果。AI学习2.0允许在探索、草拟、假设生成和工作流程设计中使用不透明AI，但要求发布的交付物在没有原始大型语言模型或云API的情况下可用、可审计、可转移和可辩护。在学习密集型环境中，它还要求提供适合上下文的人类可归因的解释或转移的证据。通过对比案例的评分工作，包括课程替代、符号回归治理对比、教师审核的国家考试练习表，以及一个自托管的讲座到测验管道与确定性质量控制，展示了该框架如何将抛光的替代工作流程与有限的、可审计的、可交接的AI辅助工作流程区分开来。AI学习2.0被提议作为一个治理工具，用于结构化的第三方审查，在能力保留、问责制和有效性边界至关重要的情况下。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2604.19753

Algorithm Selection with Zero Domain Knowledge via Text Embeddings

通过文本嵌入实现零领域知识的算法选择

Szeider, Stefan

Abstract

We propose a feature-free approach to algorithm selection that replaces hand-crafted instance features with pretrained text embeddings. Our method, ZeroFolio, proceeds in three steps: it reads the raw instance file as plain text, embeds it with a pretrained embedding model, and selects an algorithm via weighted k-nearest neighbors. The key to our approach is the observation that pretrained embeddings produce representations that distinguish problem instances without any domain knowledge or task-specific training. This allows us to apply the same three-step pipeline (serialize, embed, select) across diverse problem domains with text-based instance formats. We evaluate our approach on 11 ASlib scenarios spanning 7 domains (SAT, MaxSAT, QBF, ASP, CSP, MIP, and graph problems). Our experiments show that this approach outperforms a random forest trained on hand-crafted features in 10 of 11 scenarios with a single fixed configuration, and in all 11 with two-seed voting; the margin is often substantial. Our ablation study shows that inverse-distance weighting, line shuffling, and Manhattan distance are the key design choices. On scenarios where both selectors are competitive, combining embeddings with hand-crafted features via soft voting yields further improvements.

Chinese Translation

我们提出了一种无特征的方法来进行算法选择，该方法用预训练的文本嵌入替代了手工制作的实例特征。我们的方法ZeroFolio分为三个步骤：首先将原始实例文件作为纯文本读取，其次使用预训练的嵌入模型对其进行嵌入，最后通过加权k近邻选择算法。我们方法的关键在于观察到，预训练的嵌入能够生成区分问题实例的表示，而无需任何领域知识或特定任务的训练。这使我们能够在具有基于文本的实例格式的多种问题领域中应用相同的三步流程（序列化、嵌入、选择）。我们在涵盖7个领域（SAT、MaxSAT、QBF、ASP、CSP、MIP和图问题）的11个ASlib场景上评估了我们的方法。实验结果表明，在11个场景中的10个中，该方法的表现优于基于手工特征训练的随机森林，并且在所有11个场景中通过双种子投票也表现良好；其优势往往是显著的。我们的消融研究表明，逆距离加权、行洗牌和曼哈顿距离是关键设计选择。在两个选择器均具竞争力的场景中，通过软投票将嵌入与手工特征结合可进一步提升性能。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2604.19754

Exploring Data Augmentation and Resampling Strategies for Transformer-Based Models to Address Class Imbalance in AI Scoring of Scientific Explanations in NGSS Classroom

探索数据增强和重采样策略以应对NGSS课堂中基于变压器模型的科学解释AI评分中的类别不平衡

Djagba, Prudence, Haudek, Kevin, Franovic, Clare G. C., Kaldaras, Leonora

Abstract

Automated scoring of students' scientific explanations offers the potential for immediate, accurate feedback, yet class imbalance in rubric categories particularly those capturing advanced reasoning remains a challenge. This study investigates augmentation strategies to improve transformer-based text classification of student responses to a physical science assessment based on an NGSS-aligned learning progression. The dataset consists of 1,466 high school responses scored on 11 binary-coded analytic categories. This rubric identifies six important components including scientific ideas needed for a complete explanation along with five common incomplete or inaccurate ideas. Using SciBERT as a baseline, we applied fine-tuning and test these augmentation strategies: (1) GPT-4--generated synthetic responses, (2) EASE, a word-level extraction and filtering approach, and (3) ALP (Augmentation using Lexicalized Probabilistic context-free grammar) phrase-level extraction. While fine-tuning SciBERT improved recall over baseline, augmentation substantially enhanced performance, with GPT data boosting both precision and recall, and ALP achieving perfect precision, recall, and F1 scores across most severe imbalanced categories (5,6,7 and 9). Across all rubric categories EASE augmentation substantially increased alignment with human scoring for both scientific ideas (Categories 1--6) and inaccurate ideas (Categories 7--11). We compared different augmentation strategies to a traditional oversampling method (SMOTE) in an effort to avoid overfitting and retain novice-level data critical for learning progression alignment. Findings demonstrate that targeted augmentation can address severe imbalance while preserving conceptual coverage, offering a scalable solution for automated learning progression-aligned scoring in science education.

Chinese Translation

学生科学解释的自动评分提供了即时、准确反馈的潜力，但评分标准类别中的类别不平衡，尤其是捕捉高级推理的类别，仍然是一个挑战。本研究探讨了增强策略，以改善基于变压器的文本分类，针对基于与NGSS（下一代科学标准）对齐的学习进程的物理科学评估的学生反应。数据集包含1,466份高中反应，按11个二元编码的分析类别进行评分。该评分标准识别出六个重要组成部分，包括完整解释所需的科学思想，以及五个常见的不完整或不准确的思想。以SciBERT作为基线，我们应用了微调并测试了以下增强策略：（1）GPT-4生成的合成反应，（2）EASE，一种基于词级提取和过滤的方法，以及（3）ALP（使用词汇化概率上下文无关文法的增强）短语级提取。虽然微调SciBERT提高了基线的召回率，但增强显著提升了性能，其中GPT数据同时提高了精度和召回率，而ALP在大多数严重不平衡类别（5、6、7和9）中实现了完美的精度、召回率和F1分数。在所有评分标准类别中，EASE增强显著提高了与人类评分的一致性，涵盖了科学思想（类别1-6）和不准确思想（类别7-11）。我们将不同的增强策略与传统的过采样方法（SMOTE）进行了比较，以避免过拟合并保留对学习进程对齐至关重要的新手级数据。研究结果表明，针对性的增强可以解决严重的不平衡，同时保持概念覆盖，为科学教育中与学习进程对齐的自动评分提供了一种可扩展的解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2604.19755

Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks

可解释的反洗钱（AML）分流：证据检索与反事实检查

Torres, Dorothy, Cheng, Wei, Hu, Ke

Abstract

Anti-money laundering (AML) transaction monitoring generates large volumes of alerts that must be rapidly triaged by investigators under strict audit and governance constraints. While large language models (LLMs) can summarize heterogeneous evidence and draft rationales, unconstrained generation is risky in regulated workflows due to hallucinations, weak provenance, and explanations that are not faithful to the underlying decision. We propose an explainable AML triage framework that treats triage as an evidence-constrained decision process. Our method combines (i) retrieval-augmented evidence bundling from policy/typology guidance, customer context, alert triggers, and transaction subgraphs, (ii) a structured LLM output contract that requires explicit citations and separates supporting from contradicting or missing evidence, and (iii) counterfactual checks that validate whether minimal, plausible perturbations lead to coherent changes in both the triage recommendation and its rationale. We evaluate on public synthetic AML benchmarks and simulators and compare against rules, tabular and graph machine-learning baselines, and LLM-only/RAG-only variants. Results show that evidence grounding substantially improves auditability and reduces numerical and policy hallucination errors, while counterfactual validation further increases decision-linked explainability and robustness, yielding the best overall triage performance (PR-AUC 0.75; Escalate F1 0.62) and strong provenance and faithfulness metrics (citation validity 0.98; evidence support 0.88; counterfactual faithfulness 0.76). These findings indicate that governed, verifiable LLM systems can provide practical decision support for AML triage without sacrificing compliance requirements for traceability and defensibility.

Chinese Translation

反洗钱（AML）交易监控会产生大量警报，调查人员必须在严格的审计和治理约束下迅速进行分流。尽管大型语言模型（LLMs）能够总结异构证据并起草理由，但在受监管的工作流程中，无限制生成存在风险，主要由于幻觉、弱来源和与基本决策不一致的解释。我们提出了一种可解释的AML分流框架，将分流视为一个受证据约束的决策过程。我们的方法结合了（i）基于政策/类型指导、客户背景、警报触发和交易子图的检索增强证据打包，（ii）一个结构化的LLM输出合同，要求明确引用并将支持证据与矛盾或缺失的证据分开，以及（iii）反事实检查，验证最小的、合理的扰动是否导致分流建议及其理由的连贯变化。我们在公共合成AML基准和模拟器上进行评估，并与规则、表格和图形机器学习基线，以及仅使用LLM/RAG的变体进行比较。结果表明，证据基础显著提高了可审计性，并减少了数值和政策幻觉错误，而反事实验证进一步增强了与决策相关的可解释性和稳健性，最终实现了最佳的整体分流性能（PR-AUC 0.75；升级F1 0.62）以及强大的来源和可信度指标（引用有效性0.98；证据支持0.88；反事实可信度0.76）。这些发现表明，受治理的、可验证的LLM系统能够为AML分流提供实用的决策支持，而不牺牲可追溯性和可辩护性的合规要求。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2604.19758

ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

ThermoQA：评估大型语言模型热力学推理的三层基准

Düzkar, Kemal

Abstract

We present ThermoQA, a benchmark of 293 open-ended engineering thermodynamics problems in three tiers: property lookups (110 Q), component analysis (101 Q), and full cycle analysis (82 Q). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134a, and variable-cp air. Six frontier LLMs are evaluated across three independent runs each. The composite leaderboard is led by Claude Opus 4.6 (94.1%), GPT-5.4 (93.1%), and Gemini 3.1 Pro (92.5%). Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax), confirming that property memorization does not imply thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis serve as natural discriminators with 40-60 pp performance spreads. Multi-run sigma ranges from +/-0.1% to +/-2.5%, quantifying reasoning consistency as a distinct evaluation axis. Dataset and code are open-source at https://huggingface.co/datasets/olivenet/thermoqa

Chinese Translation

我们提出了ThermoQA，这是一个包含293个开放式工程热力学问题的基准，分为三个层次：属性查找（110个问题）、组件分析（101个问题）和完整循环分析（82个问题）。基准的真实答案通过CoolProp 7.2.0程序计算得出，涵盖水、R-134a和可变比热空气。对六个前沿大型语言模型（LLMs）进行了评估，每个模型进行了三次独立测试。综合排行榜的领先者是Claude Opus 4.6（94.1%）、GPT-5.4（93.1%）和Gemini 3.1 Pro（92.5%）。跨层次的性能下降范围从2.8个百分点（Opus）到32.5个百分点（MiniMax），确认了属性记忆并不意味着具备热力学推理能力。超临界水、R-134a制冷剂和联合循环燃气轮机分析作为自然区分器，表现出40-60个百分点的性能差距。多次运行的标准差范围从+/-0.1%到+/-2.5%，量化了推理一致性作为一个独特的评估维度。数据集和代码已在https://huggingface.co/datasets/olivenet/thermoqa上开源。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2604.19759

Automated Detection of Dosing Errors in Clinical Trial Narratives: A Multi-Modal Feature Engineering Approach with LightGBM

临床试验叙述中剂量错误的自动检测：基于多模态特征工程的 LightGBM 方法

AL-Smadi, Mohammad

Abstract

Clinical trials require strict adherence to medication protocols, yet dosing errors remain a persistent challenge affecting patient safety and trial integrity. We present an automated system for detecting dosing errors in unstructured clinical trial narratives using gradient boosting with comprehensive multi-modal feature engineering. Our approach combines 3,451 features spanning traditional NLP (TF-IDF, character n-grams), dense semantic embeddings (all-MiniLM-L6v2), domain-specific medical patterns, and transformer-based scores (BiomedBERT, DeBERTa-v3), used to train a LightGBM model. Features are extracted from nine complementary text fields (median 5,400 characters per sample) ensuring complete coverage across all 42,112 clinical trial narratives. On the CT-DEB benchmark dataset with severe class imbalance (4.9% positive rate), we achieve 0.8725 test ROC-AUC through 5-fold ensemble averaging (cross-validation: 0.8833 + 0.0091 AUC). Systematic ablation studies reveal that removing sentence embeddings causes the largest performance degradation (2.39%), demonstrating their critical role despite contributing only 37.07% of total feature importance. Feature efficiency analysis demonstrates that selecting the top 500-1000 features yields optimal performance (0.886-0.887 AUC), outperforming the full 3,451-feature set (0.879 AUC) through effective noise reduction. Our findings highlight the importance of feature selection as a regularization technique and demonstrate that sparse lexical features remain complementary to dense representations for specialized clinical text classification under severe class imbalance.

Chinese Translation

临床试验要求严格遵循用药协议，但剂量错误仍然是影响患者安全和试验完整性的持续挑战。我们提出了一种自动化系统，用于检测非结构化临床试验叙述中的剂量错误，采用梯度提升和全面的多模态特征工程。我们的方法结合了3,451个特征，涵盖传统自然语言处理（NLP）（TF-IDF、字符 n-grams）、密集语义嵌入（all-MiniLM-L6v2）、特定领域的医学模式以及基于变压器的评分（BiomedBERT、DeBERTa-v3），用于训练 LightGBM 模型。特征从九个互补的文本字段中提取（每个样本中位数5,400个字符），确保覆盖所有42,112个临床试验叙述。在具有严重类别不平衡的 CT-DEB 基准数据集上（阳性率为4.9%），我们通过5折集成平均（交叉验证：0.8833 + 0.0091 AUC）实现了0.8725的测试 ROC-AUC。系统的消融研究表明，去除句子嵌入会导致性能下降最大（2.39%），尽管其仅占总特征重要性的37.07%，但仍显示出其关键作用。特征效率分析表明，选择前500-1000个特征可以获得最佳性能（0.886-0.887 AUC），优于完整的3,451特征集（0.879 AUC），通过有效的噪声减少实现。我们的研究结果强调了特征选择作为正则化技术的重要性，并证明在严重类别不平衡的情况下，稀疏词汇特征仍然与密集表示互补，适用于专业临床文本分类。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2604.19760

Inference Headroom Ratio: A Diagnostic and Control Framework for Inference Stability Under Constraint

推理余量比：一种用于约束下推理稳定性的诊断与控制框架

Reinertsen, Robert

Abstract

We present a simulation-based evaluation of the Inference Headroom Ratio (IHR), a dimensionless diagnostic quantity for characterizing inference stability in constrained decision systems. IHR formalizes the relationship between a system's effective inferential capacity C and the combined uncertainty and constraint load U + K imposed by its operating environment, and is intended to capture proximity to an inference stability boundary rather than output-level performance. Across three controlled experiments, we show that IHR functions as: (1) a quantifiable risk indicator whose relationship to collapse probability follows a well-fitted logistic curve with estimated critical threshold IHR* approx. 1.19, (2) a sensitive indicator of proximity to the inference stability boundary under environmental noise, and (3) a viable control variable whose active regulation reduces system collapse rate from 79.4% to 58.7% and IHR variance by 70.4% across 300 Monte Carlo runs. These results position IHR as a prospective, system-level complement to standard performance, drift, and uncertainty metrics, enabling estimation of remaining inferential margin before overt failure in AI systems operating under distributional shift and constraint.

Chinese Translation

我们提出了一种基于模拟的推理余量比（Inference Headroom Ratio, IHR）评估方法，这是一种无量纲的诊断量，用于表征受限决策系统中的推理稳定性。IHR形式化了系统有效推理能力C与其操作环境所施加的综合不确定性和约束负荷U + K之间的关系，旨在捕捉推理稳定边界的接近程度，而非输出水平的性能。在三项受控实验中，我们展示了IHR的功能：(1) 作为一个可量化的风险指标，其与崩溃概率的关系遵循一个拟合良好的逻辑曲线，估计的临界阈值为IHR* 约为1.19；(2) 在环境噪声下，作为接近推理稳定边界的敏感指标；(3) 作为一个可行的控制变量，其主动调节将系统崩溃率从79.4%降低到58.7%，并在300次蒙特卡洛运行中将IHR方差降低了70.4%。这些结果将IHR定位为标准性能、漂移和不确定性指标的潜在系统级补充，能够在面临分布变化和约束的AI系统中估计剩余的推理余量，以避免明显的失败。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2604.19761

EvoForest: A Novel Machine-Learning Paradigm via Open-Ended Evolution of Computational Graphs

EvoForest：通过计算图的开放式演化实现的新型机器学习范式

Yuksel, Kamer Ali, Sawaf, Hassan

Abstract

Modern machine learning is still largely organized around a single recipe: choose a parameterized model family and optimize its weights. Although highly successful, this paradigm is too narrow for many structured prediction problems, where the main bottleneck is not parameter fitting but discovering what should be computed from the data. Success often depends on identifying the right transformations, statistics, invariances, interaction structures, temporal summaries, gates, or nonlinear compositions, especially when objectives are non-differentiable, evaluation is cross-validation-based, interpretability matters, or continual adaptation is required. We present EvoForest, a hybrid neuro-symbolic system for end-to-end open-ended evolution of computation. Rather than merely generating features, EvoForest jointly evolves reusable computational structure, callable function families, and trainable low-dimensional continuous components inside a shared directed acyclic graph. Intermediate nodes store alternative implementations, callable nodes encode reusable transformation families such as projections, gates, and activations, output nodes define candidate predictive computations, and persistent global parameters can be refined by gradient descent. For each graph configuration, EvoForest evaluates the discovered computation and uses a lightweight Ridge-based readout to score the resulting representation against a non-differentiable cross-validation target. The evaluator also produces structured feedback that guides future LLM-driven mutations. In the 2025 ADIA Lab Structural Break Challenge, EvoForest reached 94.13% ROC-AUC after 600 evolution steps, exceeding the publicly reported winning score of 90.14% under the same evaluation protocol.

Chinese Translation

现代机器学习仍然主要围绕一个单一的方案组织：选择一个参数化模型家族并优化其权重。尽管这一范式取得了很大的成功，但对于许多结构化预测问题来说，它过于狭隘，主要瓶颈不在于参数拟合，而在于发现应该从数据中计算什么。成功往往依赖于识别正确的变换、统计量、不变性、交互结构、时间摘要、门控或非线性组合，尤其是在目标不可微分、评估基于交叉验证、可解释性重要或需要持续适应的情况下。我们提出了EvoForest，一种用于端到端开放式计算演化的混合神经符号系统。EvoForest不仅仅是生成特征，而是共同演化可重用的计算结构、可调用的函数家族和可训练的低维连续组件，这些都位于一个共享的有向无环图中。中间节点存储替代实现，可调用节点编码可重用的变换家族，如投影、门控和激活，输出节点定义候选预测计算，而持久的全局参数可以通过梯度下降进行优化。对于每个图配置，EvoForest评估发现的计算，并使用轻量级的基于岭回归的读取方法对结果表示进行评分，以满足不可微分的交叉验证目标。评估器还生成结构化反馈，以指导未来基于LLM的变异。在2025年ADIA实验室结构性突破挑战中，EvoForest在600次演化步骤后达到了94.13%的ROC-AUC，超过了在相同评估协议下公开报告的获胜分数90.14%。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2604.19775

From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents

从行动到理解：大语言模型代理中时间概念的符合性可解释性

Padhi, Trilok, Kaur, Ramneet, Agarwal, Krishiv, Cobb, Adam D., Elenius, Daniel, Acharya, Manoj, Samplawski, Colin, Berenbeim, Alexander M., Bastian, Nathaniel D., Jha, Susmit, Roy, Anirban

Abstract

Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of reasoning, planning, and acting within interactive environments. Despite their growing capability to perform multi-step reasoning and decision-making tasks, internal mechanisms guiding their sequential behavior remain opaque. This paper presents a framework for interpreting the temporal evolution of concepts in LLM agents through a step-wise conformal lens. We introduce the conformal interpretability framework for temporal tasks, which combines step-wise reward modeling with conformal prediction to statistically label model's internal representation at each step as successful or failing. Linear probes are then trained on these representations to identify directions of temporal concepts - latent directions in the model's activation space that correspond to consistent notions of success, failure or reasoning drift. Experimental results on two simulated interactive environments, namely ScienceWorld and AlfWorld, demonstrate that these temporal concepts are linearly separable, revealing interpretable structures aligned with task success. We further show preliminary results on improving an LLM agent's performance by leveraging the proposed framework for steering the identified successful directions inside the model. The proposed approach, thus, offers a principled method for early failure detection as well as intervention in LLM-based agents, paving the path towards trustworthy autonomous language models in complex interactive settings.

Chinese Translation

大型语言模型（LLMs）越来越多地被部署为能够在交互环境中进行推理、规划和行动的自主代理。尽管它们在执行多步骤推理和决策任务方面的能力不断增强，但指导其顺序行为的内部机制仍然不透明。本文提出了一种通过逐步符合性视角解释LLM代理中概念时间演变的框架。我们引入了用于时间任务的符合性可解释性框架，该框架将逐步奖励建模与符合性预测相结合，以统计性地标记模型在每一步的内部表示为成功或失败。然后，在这些表示上训练线性探针，以识别时间概念的方向——模型激活空间中与成功、失败或推理漂移的一致概念相对应的潜在方向。在两个模拟的交互环境（即ScienceWorld和AlfWorld）上的实验结果表明，这些时间概念是线性可分的，揭示了与任务成功相一致的可解释结构。我们进一步展示了通过利用所提出的框架引导模型内部识别的成功方向来提高LLM代理性能的初步结果。因此，所提出的方法为早期失败检测及干预LLM基础的代理提供了一种原则性的方法，为在复杂交互环境中实现可信的自主语言模型铺平了道路。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2604.19788

Using Learning Theories to Evolve Human-Centered XAI: Future Perspectives and Challenges

利用学习理论推动以人为本的可解释人工智能：未来展望与挑战

Cortinas-Lorenzo, Karina, Doherty, Gavin

Abstract

As Artificial Intelligence (AI) systems continue to grow in size and complexity, so does the difficulty of the quest for AI transparency. In a world of large models and complex AI systems, why do we explain AI and what should we explain? While explanations serve multiple functions, in the face of complexity humans have used and continue to use explanations to foster learning. In this position paper, we discuss how learning theories can be infused in the XAI lifecycle, as well as the key opportunities and challenges when adopting a learner-centered approach to assess, design and evaluate AI explanations. Building on past work, we argue that a learner-centered approach to Explainable AI (XAI) can enhance human agency and ease XAI risks mitigation, helping evolve the practice of human-centered XAI.

Chinese Translation

随着人工智能（AI）系统规模和复杂性的不断增长，追求AI透明度的难度也随之加大。在一个大型模型和复杂AI系统的世界中，我们为何要解释AI，应该解释什么？虽然解释具有多重功能，但在面对复杂性时，人类一直在使用并继续使用解释来促进学习。在这篇立场论文中，我们讨论了如何将学习理论融入可解释人工智能（XAI）生命周期，以及在采用以学习者为中心的方法来评估、设计和评估AI解释时所面临的关键机遇与挑战。基于以往的研究，我们认为，以学习者为中心的可解释人工智能（XAI）方法可以增强人类的自主性，并减轻XAI风险，从而推动以人为本的可解释人工智能实践的发展。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2604.19789

From Data to Theory: Autonomous Large Language Model Agents for Materials Science

从数据到理论：用于材料科学的自主大型语言模型代理

Alfred, Samuel Onimpa, Sundararaghavan, Veera

Abstract

We present an autonomous large language model (LLM) agent for end-to-end, data-driven materials theory development. The model can choose an equation form, generate and run its own code, and test how well the theory matches the data without human intervention. The framework combines step-by-step reasoning with expert-supplied tools, allowing the agent to adjust its approach as needed while keeping a clear record of its decisions. For well-established materials relationships such as the Hall-Petch equation and Paris law, the agent correctly identifies the governing equation and makes reliable predictions on new datasets. For more specialized relationships, such as Kuhn's equation for the HOMO-LUMO gap of conjugated molecules as a function of length, performance depends more strongly on the underlying model, with GPT-5 showing better recovery of the correct equation. Beyond known theories, the agent can also suggest new predictive relationships, illustrated here by a strain-dependent law for changes in the HOMO-LUMO gap. At the same time, the results show that careful validation remains essential, because the agent can still return incorrect, incomplete, or inconsistent equations even when the numerical fit appears strong. Overall, these results highlight both the promise and the current limitations of autonomous LLM agents for AI-assisted scientific modeling and discovery.

Chinese Translation

我们提出了一种自主大型语言模型（LLM）代理，用于端到端的数据驱动材料理论开发。该模型能够选择方程形式，生成并运行自己的代码，并在没有人工干预的情况下测试理论与数据的匹配程度。该框架结合了逐步推理与专家提供的工具，使代理能够根据需要调整其方法，同时保持清晰的决策记录。对于诸如霍尔-佩奇方程（Hall-Petch equation）和巴黎定律（Paris law）等成熟的材料关系，代理能够正确识别主导方程，并对新数据集做出可靠的预测。对于更专业的关系，例如库恩方程（Kuhn's equation）描述的共轭分子长度与HOMO-LUMO间隙的关系，性能则更依赖于基础模型，其中GPT-5在正确方程的恢复上表现更佳。除了已知理论外，代理还可以建议新的预测关系，这里通过一个应变依赖的定律来说明HOMO-LUMO间隙的变化。同时，结果表明，仔细的验证仍然至关重要，因为即使数值拟合看似强，代理仍可能返回不正确、不完整或不一致的方程。总体而言，这些结果突显了自主LLM代理在AI辅助科学建模和发现中的潜力与当前局限性。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2604.19790

Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements

大型语言模型中的隐性可靠性风险：精度引发的输出不一致的系统识别

Wang, Yifei, Li, Tianlin, Zhang, Xiaohan, Zhang, Xiaoyu, Ma, Wei, Cheng, Mingfei, Pan, Li

Abstract

Large language models (LLMs) are increasingly deployed under diverse numerical precision configurations, including standard floating-point formats (e.g., bfloat16 and float16) and quantized integer formats (e.g., int16 and int8), to meet efficiency and resource constraints. However, minor inconsistencies between LLMs of different precisions are difficult to detect and are often overlooked by existing evaluation methods. In this paper, we present PrecisionDiff, an automated differential testing framework for systematically detecting precision-induced behavioral disagreements in LLMs. PrecisionDiff generates precision-sensitive test inputs and performs cross-precision comparative analysis to uncover subtle divergences that remain hidden under conventional testing strategies. To demonstrate its practical significance, we instantiate PrecisionDiff on the alignment verification task, where precision-induced disagreements manifest as jailbreak divergence-inputs that are rejected under one precision may produce harmful responses under another. Experimental results show that such behavioral disagreements are widespread across multiple open-source aligned LLMs and precision settings, and that PrecisionDiff significantly outperforms vanilla testing methods in detecting these issues. Our work enables automated precision-sensitive test generation, facilitating effective pre-deployment evaluation and improving precision robustness during training.

Chinese Translation

大型语言模型（LLMs）在多种数值精度配置下越来越多地被部署，包括标准浮点格式（例如，bfloat16 和 float16）以及量化整数格式（例如，int16 和 int8），以满足效率和资源限制。然而，不同精度的 LLM 之间的细微不一致性难以检测，且常常被现有评估方法忽视。本文提出了 PrecisionDiff，一种自动化差异测试框架，用于系统地检测 LLM 中由精度引发的行为不一致。PrecisionDiff 生成对精度敏感的测试输入，并进行跨精度比较分析，以揭示在传统测试策略下隐藏的微妙差异。为了展示其实际意义，我们在对齐验证任务上实例化了 PrecisionDiff，其中精度引发的不一致表现为越狱分歧——在一种精度下被拒绝的输入可能在另一种精度下产生有害响应。实验结果表明，这种行为不一致在多个开源对齐 LLM 和精度设置中普遍存在，且 PrecisionDiff 在检测这些问题方面显著优于传统测试方法。我们的工作实现了自动化的对精度敏感的测试生成，促进了有效的预部署评估，并提高了训练过程中的精度鲁棒性。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2604.19791

Stabilising Generative Models of Attitude Change

稳定态度变化的生成模型

Matyas, Jayd, Cunningham, William A., Vezhnevets, Alexander Sasha, Mobbs, Dean, Duéñez-Guzmán, Edgar A., Leibo, Joel Z.

Abstract

Attitude change - the process by which individuals revise their evaluative stances - has been explained by a set of influential but competing verbal theories. These accounts often function as mechanism sketches: rich in conceptual detail, yet lacking the technical specifications and operational constraints required to run as executable systems. We present a generative actor-based modelling workflow for "rendering" these sketches as runnable actor - environment simulations using the Concordia simulation library. In Concordia, actors operate by predictive pattern completion: an operation on natural language strings that generates a suffix which describes the actor's intended action from a prefix containing memories of their past and observations of the present. We render the theories of cognitive dissonance (Festinger 1957), self-consistency (Aronson 1969), and self-perception (Bem 1972) as distinct decision logics that populate and process the prefix through theory-specific sequences of reasoning steps. We evaluate these implementations across classic psychological experiments. Our implementations generate behavioural patterns consistent with known results from the original empirical literature. However, we find that achieving stable reproduction requires resolving the inherent underdetermination of the verbal accounts and the conflicts between modern linguistic priors and historical experimental assumptions. And, we document how this manual process of iterative model "stabilisation" surfaces specific operational and socio-ecological dependencies that were largely undocumented in the original verbal accounts. Ultimately, we argue that the manual stabilisation process itself should be regarded as a core part of the methodology functioning to clarify situational and representational commitments needed to generate characteristic effects.

Chinese Translation

态度变化——个体修订其评估立场的过程——已通过一系列有影响力但相互竞争的语言理论进行解释。这些理论通常作为机制草图存在：概念细节丰富，但缺乏作为可执行系统运行所需的技术规格和操作约束。我们提出了一种基于生成演员的建模工作流程，以使用Concordia模拟库将这些草图“呈现”为可运行的演员-环境模拟。在Concordia中，演员通过预测模式补全进行操作：这是一种对自然语言字符串的操作，生成一个后缀，该后缀描述了演员从包含其过去记忆和当前观察的前缀中意图采取的行动。我们将认知失调理论（Festinger 1957）、自我一致性理论（Aronson 1969）和自我知觉理论（Bem 1972）呈现为不同的决策逻辑，这些逻辑通过特定于理论的推理步骤序列填充并处理前缀。我们在经典心理学实验中评估这些实现。我们的实现生成的行为模式与原始实证文献中的已知结果一致。然而，我们发现，实现稳定重现需要解决语言账户的固有不确定性以及现代语言先验与历史实验假设之间的冲突。此外，我们记录了这一手动迭代模型“稳定化”过程如何浮现出在原始语言账户中大多未被记录的特定操作和社会生态依赖关系。最终，我们认为手动稳定化过程本身应被视为一种核心方法论，旨在澄清生成特征效应所需的情境和表征承诺。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2604.19792

OpenCLAW-P2P v6.0: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review

OpenCLAW-P2P v6.0：弹性多层持久性、实时参考验证和去中心化人工智能同行评审的生产规模评估

de Lafuente, Francisco Angulo, Sharma, Teerth, Veselov, Vladimir, Abdu, Seid Mohammed, Kumar, Nirmal Tej, Perry, Guillermo

Abstract

This paper presents OpenCLAW-P2P v6.0, a comprehensive evolution of the decentralized collective-intelligence platform in which autonomous AI agents publish, peer-review, score, and iteratively improve scientific research papers without any human gatekeeper. Building on v5.0 foundations -- tribunal-gated publishing, multi-LLM granular scoring, calibrated deception detection, the Silicon Chess-Grid FSM, and the AETHER containerized inference engine -- this release introduces four major new subsystems: (1) a multi-layer paper persistence architecture with four storage tiers (in-memory cache, Cloudflare R2, Gun.js, GitHub) ensuring zero paper loss across redeployments; (2) a multi-layer retrieval cascade with automatic backfill reducing lookup latency from >3s to <50ms; (3) live reference verification querying CrossRef, arXiv, and Semantic Scholar during scoring to detect fabricated citations with >85% accuracy; and (4) a scientific API proxy providing rate-limited cached access to seven public databases. The platform operates with 14 real autonomous agents producing 50+ scored papers (word counts 2,072-4,073, leaderboard scores 6.4-8.1) alongside 23 labeled simulated citizens. We present honest production statistics, failure-mode analysis, a paper recovery protocol that salvaged 25 lost papers, and lessons learned from operating the system at scale. All pre-existing subsystems -- 17-judge multi-LLM scoring, 14-rule calibration with 8 deception detectors, tribunal cognitive examination, Proof of Value consensus, Laws-of-Form eigenform verification, and tau-normalized agent coordination -- are retained and further hardened. All code is open-source at https://github.com/Agnuxo1/p2pclaw-mcp-server.

Chinese Translation

本文介绍了OpenCLAW-P2P v6.0，这是一个去中心化集体智能平台的全面演进，其中自主人工智能代理在没有任何人类把关者的情况下发布、同行评审、评分并迭代改进科学研究论文。基于v5.0的基础——法庭审查出版、多LLM细粒度评分、校准的欺骗检测、硅棋网有限状态机（Silicon Chess-Grid FSM）以及AETHER容器化推理引擎——本次发布引入了四个主要的新子系统：（1）一个具有四个存储层次（内存缓存、Cloudflare R2、Gun.js、GitHub）的多层论文持久性架构，确保在重新部署过程中零论文丢失；（2）一个多层检索级联，自动回填将查找延迟从>3秒减少到<50毫秒；（3）实时参考验证，在评分过程中查询CrossRef、arXiv和Semantic Scholar，以超过85%的准确率检测伪造引用；（4）一个科学API代理，提供对七个公共数据库的限速缓存访问。该平台运行着14个真实的自主代理，生成超过50篇评分论文（字数范围为2,072-4,073，排行榜得分为6.4-8.1），以及23个标记的模拟公民。我们展示了诚实的生产统计数据、故障模式分析、一个挽救25篇丢失论文的论文恢复协议，以及在大规模操作系统时获得的经验教训。所有现有子系统——17名评审的多LLM评分、14条规则的校准与8个欺骗检测器、法庭认知检查、价值证明共识、形式法则特征验证以及tau标准化代理协调——均被保留并进一步加强。所有代码均为开源，地址为https://github.com/Agnuxo1/p2pclaw-mcp-server。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2604.19793

SkillGraph: Graph Foundation Priors for LLM Agent Tool Sequence Recommendation

SkillGraph：用于大型语言模型代理工具序列推荐的图基础先验

Liu, Hao, Li, Dongyu

Abstract

LLM agents must select tools from large API libraries and order them correctly. Existing methods use semantic similarity for both retrieval and ordering, but ordering depends on inter-tool data dependencies that are absent from tool descriptions. As a result, semantic-only methods can produce negative Kendall-$\tau$ in structured workflow domains. We introduce SkillGraph, a directed weighted execution-transition graph mined from 49,831 successful LLM agent trajectories, which encodes workflow-precedence regularities as a reusable graph foundation prior. Building on this graph foundation prior, we propose a two-stage decoupled framework: GS-Hybrid retrieval for candidate selection and a learned pairwise reranker for ordering. On ToolBench (9,965 test instances; ~16,000 tools), the method reaches Set-F1 = 0.271 and Kendall-$\tau$ = 0.096; on API-Bank, Kendall-$\tau$ improves from -0.433 to +0.613. Under identical Stage-1 inputs, the learned reranker also outperforms LLaMA-3.1-8B Stage-2 rerankers.

Chinese Translation

大型语言模型（LLM）代理必须从大型API库中选择工具并正确排序。现有方法在检索和排序中都使用语义相似性，但排序依赖于工具之间的数据依赖关系，而这些关系在工具描述中是缺失的。因此，仅依赖语义的方法在结构化工作流领域可能会产生负的Kendall-$ au$值。我们提出了SkillGraph，这是一个从49,831个成功的LLM代理轨迹中挖掘出的有向加权执行转移图，它将工作流优先级规律编码为可重用的图基础先验。在此图基础先验的基础上，我们提出了一个两阶段解耦框架：GS-Hybrid检索用于候选选择，学习的成对重排器用于排序。在ToolBench（9,965个测试实例；约16,000个工具）上，该方法达到了Set-F1 = 0.271和Kendall-$ au$ = 0.096；在API-Bank上，Kendall-$ au$从-0.433改善到+0.613。在相同的第一阶段输入下，学习的重排器也优于LLaMA-3.1-8B的第二阶段重排器。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2604.19794

Handbook of Rough Set Extensions and Uncertainty Models

粗集扩展与不确定性模型手册

Fujita, Takaaki, Smarandache, Florentin

Abstract

Rough set theory models uncertainty by approximating target concepts through lower and upper sets induced by indiscernibility, or more generally, by granulation relations in data tables. This perspective captures vagueness caused by limited observational resolution and supports set-theoretic reasoning about what can be determined with certainty and what remains only possible. This book is written as a map of models. Rather than developing a single algorithmic pipeline in depth, it provides a systematic survey of the main rough set paradigms and their extension routes. More specifically, representative variants are organized according to (i) the underlying granulation mechanism, such as equivalence-based, tolerance-based, covering-based, neighborhood-based, and probabilistic approximations, and (ii) the uncertainty semantics attached to data and relations, such as crisp, fuzzy, intuitionistic fuzzy, neutrosophic, and plithogenic settings. The book also explains how each choice changes the form of approximations and the interpretation of boundary regions. Throughout the book, small illustrative examples are used to clarify modeling intent and typical use cases in classification and decision support. Finally, an important clarification of scope should be noted. Since the main purpose of this book is to provide a map of models, the Abstract and Introduction should not lead readers to expect that feature reduction and rule induction are primary objectives. Although these topics are central in the rough set literature, they are treated here mainly as motivating applications and as entry points to the broader research landscape. The principal aim of the book is to survey and position rough set models and their extensions in a systematic and coherent manner.

Chinese Translation

粗集理论通过通过不可区分性引发的下界集和上界集来近似目标概念，从而对不确定性建模，或更一般地，通过数据表中的粒化关系。这种视角捕捉了由于观察分辨率有限而导致的模糊性，并支持关于可以确定的内容和仅可能的内容的集合论推理。本书被写作成模型的地图。它并不是深入开发单一的算法管道，而是系统地调查主要的粗集范式及其扩展路径。更具体地说，代表性变体根据（i）基础粒化机制进行组织，例如基于等价、基于容忍、基于覆盖、基于邻域和概率近似，以及（ii）附加于数据和关系的不确定性语义，例如清晰、模糊、直觉模糊、中立和多元设置。本书还解释了每种选择如何改变近似的形式和边界区域的解释。在整本书中，使用小的示例来阐明建模意图和分类及决策支持中的典型用例。最后，需要注意一个重要的范围澄清。由于本书的主要目的是提供模型的地图，摘要和引言不应使读者期望特征减少和规则诱导是主要目标。尽管这些主题在粗集文献中是中心内容，但在这里主要作为激励应用和进入更广泛研究领域的切入点进行处理。本书的主要目的是以系统和连贯的方式调查和定位粗集模型及其扩展。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2604.19795

Prism: An Evolutionary Memory Substrate for Multi-Agent Open-Ended Discovery

Prism：用于多智能体开放式发现的进化记忆基底

Mishra, Suyash

Abstract

We introduce \prism{} (\textbf{P}robabilistic \textbf{R}etrieval with \textbf{I}nformation-\textbf{S}tratified \textbf{M}emory), an evolutionary memory substrate for multi-agent AI systems engaged in open-ended discovery. \prism{} unifies four independently developed paradigms -- layered file-based persistence, vector-augmented semantic memory, graph-structured relational memory, and multi-agent evolutionary search -- under a single decision-theoretic framework with eight interconnected subsystems. We make five contributions: (1)~an \emph{entropy-gated stratification} mechanism that assigns memories to a tri-partite hub (skills/notes/attempts) based on Shannon information content, with formal context-window utilization bounds; (2)~a \emph{causal memory graph} $\mathcal{G} = (V, E_r, E_c)$ with interventional edges and agent-attributed provenance; (3)~a \emph{Value-of-Information retrieval} policy with self-evolving strategy selection; (4)~a \emph{heartbeat-driven consolidation} controller with stagnation detection via optimal stopping theory; and (5)~a \emph{replicator-decay dynamics} framework that interprets memory confidence as evolutionary fitness, proving convergence to an Evolutionary Stable Memory Set (ESMS). On the LOCOMO benchmark, \prism{} achieves 88.1 LLM-as-a-Judge score (31.2\% over Mem0). On CORAL-style evolutionary optimization tasks, 4-agent \prism{} achieves 2.8$\times$ higher improvement rate than single-agent baselines.%

Chinese Translation

我们介绍了 extprism{}（ extbf{P}robabilistic extbf{R}etrieval with extbf{I}nformation- extbf{S}tratified extbf{M}emory），这是一个用于参与开放式发现的多智能体人工智能系统的进化记忆基底。 extprism{}将四种独立开发的范式统一在一个包含八个相互关联子系统的决策理论框架下——分层文件持久性、向量增强的语义记忆、图结构关系记忆和多智能体进化搜索。我们做出了五项贡献：（1）一种 extit{熵门控分层}机制，根据香农信息内容将记忆分配到三方中心（技能/笔记/尝试），并具有正式的上下文窗口利用界限；（2）一个具有干预边和智能体归属来源的 extit{因果记忆图} $ extmathcal{G} = (V, E_r, E_c)$；（3）一种具有自我进化策略选择的 extit{信息价值检索}策略；（4）一个通过最优停止理论进行停滞检测的 extit{心跳驱动整合}控制器；（5）一个将记忆置信度解释为进化适应度的 extit{复制者衰减动态}框架，证明收敛到进化稳定记忆集（ESMS）。在LOCOMO基准测试中， extprism{}达到了88.1的LLM-as-a-Judge评分（比Mem0高31.2 extpercnt）。在CORAL风格的进化优化任务中，4智能体的 extprism{}实现了比单智能体基线高出2.8倍的改进率。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2604.19803

The AI Telco Engineer: Toward Autonomous Discovery of Wireless Communications Algorithms

人工智能电信工程师：迈向无线通信算法的自主发现

Aoudia, Fayçal Aït, Hoydis, Jakob, Cammerer, Sebastian, Maggi, Lorenzo, Marti, Gian, Keller, Alexander

Abstract

Agentic AI is rapidly transforming the way research is conducted, from prototyping ideas to reproducing results found in the literature. In this paper, we explore the ability of agentic AI to autonomously design wireless communication algorithms. To that end, we implement a dedicated framework that leverages large language models (LLMs) to iteratively generate, evaluate, and refine candidate algorithms. We evaluate the framework on three tasks spanning the physical (PHY) and medium access control (MAC) layers: statistics-agnostic channel estimation, channel estimation with known covariance, and link adaptation. Our results show that, in a matter of hours, the framework produces algorithms that are competitive with and, in some cases, outperforming conventional baselines. Moreover, unlike neural network-based approaches, the generated algorithms are fully explainable and extensible. This work represents a first step toward the autonomous discovery of novel wireless communication algorithms, and we look forward to the progress our community makes in this direction.

Chinese Translation

代理型人工智能正在迅速改变研究的开展方式，从原型设计到重现文献中的结果。在本文中，我们探讨了代理型人工智能自主设计无线通信算法的能力。为此，我们实施了一个专用框架，该框架利用大型语言模型（LLMs）迭代生成、评估和优化候选算法。我们在三个任务上评估了该框架，这些任务涵盖了物理层（PHY）和媒体接入控制层（MAC）：与统计无关的信道估计、已知协方差的信道估计以及链路自适应。我们的结果表明，在数小时内，该框架生成的算法与传统基准相竞争，并在某些情况下超越了它们。此外，与基于神经网络的方法不同，生成的算法是完全可解释和可扩展的。这项工作代表了向自主发现新型无线通信算法迈出的第一步，我们期待着我们的社区在这一方向上取得的进展。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2604.19807

Skyline-First Traversal as a Control Mechanism for Multi-Criteria Graph Search

以天际线优先遍历作为多标准图搜索的控制机制

Tacheny, Nicolas

Abstract

In multi-criteria graph traversal, paths are compared via Pareto dominance, an ordering that identifies which paths are non-dominated, but says nothing about which path to expand next or when the search may stop. As a result, existing approaches rely on external mechanisms-heuristics, scalarization, or population-based exploration while Pareto dominance remains confined to passive roles such as pruning or ranking. This paper shows that, under constrained cost models, finite cost grids, Markovian transitions, and a nonzero progress measure, Pareto geometry alone is sufficient to drive both scheduling and termination. We show that extracting exclusively from the first Pareto layer, the skyline, induces a deterministic descent in a discrete completion potential, ensuring monotone progress toward solution completion. In parallel, a vector lower-bound certificate provides a stopping condition that guarantees dominance coverage of all remaining traversals without requiring a predefined number of solutions. Our analysis establishes deterministic potential descent, certified termination via dominance coverage, a uniform bound on layer width induced by cost-grid geometry, and greedy cost-space dispersion within the skyline. The resulting framework operates without scalarization, heuristic guidance, or probabilistic models, and repositions Pareto dominance from a passive filter to a deterministic driver of search.

Chinese Translation

在多标准图遍历中，路径通过帕累托支配进行比较，这是一种识别非支配路径的排序，但并未说明下一步应扩展哪条路径或何时停止搜索。因此，现有方法依赖于外部机制——启发式方法、标量化或基于种群的探索，而帕累托支配则局限于被动角色，如剪枝或排序。本文表明，在受限成本模型、有限成本网格、马尔可夫转移和非零进展度量下，单靠帕累托几何就足以驱动调度和终止。我们展示了从第一个帕累托层，即天际线中独立提取，能够在离散完成潜力中引发确定性下降，确保向解决方案完成的单调进展。同时，向量下界证书提供了一种停止条件，保证覆盖所有剩余遍历的支配，而无需预定义解决方案的数量。我们的分析确立了确定性的潜力下降，通过支配覆盖的认证终止，由成本网格几何引发的层宽度的统一界限，以及在天际线内的贪婪成本空间分散。由此产生的框架在没有标量化、启发式指导或概率模型的情况下运行，并将帕累托支配从被动过滤器重新定位为搜索的确定性驱动因素。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2604.19809

MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models

MIRROR：大型语言模型元认知校准的分层基准

Wang, Jason Z

Abstract

We introduce MIRROR, a benchmark comprising eight experiments across four metacognitive levels that evaluates whether large language models can use self-knowledge to make better decisions. We evaluate 16 models from 8 labs across approximately 250,000 evaluation instances using five independent behavioral measurement channels. Core experiments are run across the full model roster; experiments with specialized infrastructure requirements report explicitly marked model subsets. We find two phenomena with direct implications for agentic deployment: (1) compositional self-prediction fails universally -- the Compositional Calibration Error ranges from 0.500 to 0.943 on the original 15-model Exp3-v1 set (and 0.434 to 0.758 on the balanced 16-model Exp3-v2 expansion), indicating that models cannot predict their own performance on multi-domain tasks, and (2) models exhibit above-chance but imperfect domain-specific self-knowledge yet systematically fail to translate even this partial awareness into appropriate agentic action-selection -- external metacognitive control reduces the Confident Failure Rate from 0.600 to 0.143 (76% reduction at temperature 0; mean 70% at temperature 0.7 across 5 models from 4 labs). Providing models with their own calibration scores produces no significant improvement (p > 0.05); only architectural constraint is effective. This suggests that external metacognitive scaffolding -- not improved self-knowledge -- is the path to safer autonomous AI systems. Code, data, and Croissant metadata will be released publicly with the benchmark.

Chinese Translation

我们介绍了MIRROR，一个包含八个实验的基准，涵盖四个元认知层次，评估大型语言模型是否能够利用自我知识做出更好的决策。我们评估了来自8个实验室的16个模型，基于约250,000个评估实例，使用五个独立的行为测量通道。核心实验在完整的模型列表上进行；具有特殊基础设施要求的实验则明确标记了模型子集。我们发现了两个对代理部署有直接影响的现象：(1) 组合自我预测普遍失败——在原始的15模型Exp3-v1集上，组合校准误差范围为0.500到0.943（在平衡的16模型Exp3-v2扩展中为0.434到0.758），表明模型无法预测其在多领域任务上的表现；(2) 模型表现出高于随机水平但不完美的领域特定自我知识，然而系统性地未能将这种部分意识转化为适当的代理行动选择——外部元认知控制将自信失败率从0.600降低到0.143（在温度为0时减少76%；在温度为0.7时，4个实验室的5个模型的平均减少为70%）。为模型提供其自身的校准分数未产生显著改善（p > 0.05）；只有架构约束是有效的。这表明，外部元认知支架——而非改善自我知识——是实现更安全的自主人工智能系统的途径。代码、数据和Croissant元数据将在基准发布时公开。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2604.19810

The Existential Theory of Research: Why Discovery Is Hard

研究的存在论理论：为何发现如此困难

Majumdar, Angshul

Abstract

Can scientific discovery be made arbitrarily easy by choosing the right representation, collecting enough data, and deploying sufficiently powerful algorithms? This paper argues that the answer is fundamentally negative. We introduce the Existential Theory of Research (ETR), a formal framework that models discovery as the recovery of structured explanations under constraints of representation, observation, and computation. Within this framework, we show that these three components cannot be simultaneously optimized: no method can guarantee universally simple explanations, arbitrarily compressed observations, and efficient exact inference. This limitation is not model-specific, but arises from a synthesis of uncertainty principles in sparse representation, sample complexity bounds in high-dimensional recovery, and the computational hardness of exact inference. We further show that representation mismatch alone can inflate intrinsic simplicity into apparent complexity, rendering otherwise tractable problems observationally and computationally prohibitive. To quantify these effects, we introduce an uncertainty functional that captures the joint difficulty of discovery. The results suggest that scientific difficulty is not accidental, but a structural consequence of the geometry and complexity of inference.

Chinese Translation

科学发现是否可以通过选择合适的表征、收集足够的数据以及部署足够强大的算法来变得任意简单？本文认为答案从根本上是否定的。我们引入了研究的存在论理论（Existential Theory of Research, ETR），这是一个将发现建模为在表征、观察和计算约束下恢复结构化解释的形式框架。在这个框架内，我们展示了这三个组成部分无法同时优化：没有任何方法可以保证普遍简单的解释、任意压缩的观察和高效的精确推理。这一限制并不是特定于某个模型，而是源于稀疏表示中的不确定性原理、高维恢复中的样本复杂性界限以及精确推理的计算难度的综合结果。我们进一步表明，仅仅是表征不匹配就可以将内在的简单性膨胀为表面上的复杂性，使得本来可处理的问题在观察和计算上变得不可行。为了量化这些影响，我们引入了一种不确定性泛函，捕捉发现的共同困难。结果表明，科学的困难并非偶然，而是推理的几何和复杂性的结构性结果。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2604.19815

Large Language Models Meet Biomedical Knowledge Graphs for Mechanistically Grounded Therapeutic Prioritization

大型语言模型与生物医学知识图谱相结合以实现机制基础的治疗优先排序

Wei, Chih-Hsuan, Day, Chi-Ping, Wang, Zhizheng, Alewine, Christine C., Tyler, Betty, Slika, Hasan, Saraf, David, Tai, Chin-Hsien, Chan, Joey, Leaman, Robert, Lu, Zhiyong

Abstract

Drug repurposing is often framed as a candidate identification task, but existing approaches provide limited guidance for distinguishing biologically plausible candidates from historically well-connected ones. Here we introduce DrugKLM, a hybrid framework that integrates biomedical knowledge graph structure with large language model-based mechanistic reasoning to enable mechanistically grounded therapeutic prioritization. Across benchmark datasets, DrugKLM outperforms knowledge graph-only and language model-only baselines, including TxGNN. Beyond improved recall, DrugKLM confidence scores exhibit functional alignment with molecular phenotypes: higher scores are associated with transcriptional signatures linked to improved survival across 12 TCGA cancers. The scoring framework preferentially captures biologically perturbational signals rather than historical indication patterns. Expert curation across five cancers further reveals systematic differences in prioritization behavior, with DrugKLM elevating candidates supported by coherent mechanistic rationale and disease-specific clinical context. Together, these results establish DrugKLM as an evidence-integrative framework that translates heterogeneous biomedical data into mechanistically interpretable and clinically grounded therapeutic hypotheses.

Chinese Translation

药物再利用通常被视为候选者识别任务，但现有方法在区分生物学上合理的候选者与历史上连接良好的候选者方面提供的指导有限。在此，我们介绍了DrugKLM，一种将生物医学知识图谱结构与基于大型语言模型的机制推理相结合的混合框架，以实现机制基础的治疗优先排序。在基准数据集上，DrugKLM的表现优于仅使用知识图谱和仅使用语言模型的基线，包括TxGNN。除了提高召回率外，DrugKLM的置信评分与分子表型表现出功能对齐：较高的评分与12种TCGA癌症中与改善生存相关的转录特征相关联。该评分框架优先捕捉生物学扰动信号，而非历史指示模式。对五种癌症的专家策划进一步揭示了优先排序行为的系统性差异，DrugKLM提升了由一致的机制理由和特定疾病临床背景支持的候选者。综上所述，这些结果确立了DrugKLM作为一个证据整合框架，将异质生物医学数据转化为机制可解释且临床基础的治疗假设。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2604.19816

Emergence Transformer: Dynamical Temporal Attention Matters

涌现变换器：动态时间注意力的重要性

Zhou, Zihan, Qin, Bo-Wei, Du, Kai, Lin, Wei

Abstract

The Transformer, a breakthrough architecture in artificial intelligence, owes its success to the attention mechanism, which utilizes long-range interactions in sequential data, enabling the emergent coherence between large language models (LLMs) and data distributions. However, temporal attention, that is, different forms of long-range interactions in temporal sequences, has rarely been explored in emergence phenomenon of complex systems including oscillatory coherence in quantum, biophysical, or climate systems. Here, by designing dynamical temporal attention (DTA) with time-varying query, key, and value matrices, we propose an Emergence Transformer. This architecture allows each component to interact with its own or its neighbors' past states through dynamical attention kernels, thereby enabling the promotion and/or suppression of the emergent coherence of components. Interestingly, we uncover that neighbor-DTA consistently promotes oscillatory coherence, whereas self-DTA exhibits an optimal attention weight for coherence enhancement, owing to its non-monotonic dependence on network structure. Practically, we demonstrate how DTA reshapes social coherence, suggesting strategies to either enhance agreement or preserve plurality. We further apply DTA to the paradigmatic Hopfield neural network, achieving emergent continual learning without catastrophic forgetting. Together, these results lay a foundation and provide an immediate paradigm for modulating emergence phenomenon in networked dynamics only using DTA.

Chinese Translation

变换器（Transformer）作为人工智能领域的一项突破性架构，其成功归功于注意力机制，该机制利用序列数据中的长程交互，使大型语言模型（LLMs）与数据分布之间形成涌现的连贯性。然而，时间注意力，即时间序列中不同形式的长程交互，在复杂系统的涌现现象中（包括量子、生物物理或气候系统中的振荡连贯性）鲜有探讨。在此，我们通过设计具有时间变化的查询、键和值矩阵的动态时间注意力（DTA），提出了一种涌现变换器。这一架构允许每个组件通过动态注意力核与自身或邻近组件的过去状态进行交互，从而促进和/或抑制组件的涌现连贯性。有趣的是，我们发现邻居-DTA始终促进振荡连贯性，而自我-DTA则表现出最佳的注意力权重以增强连贯性，这归因于其对网络结构的非单调依赖。实际上，我们展示了DTA如何重塑社会连贯性，提出了增强一致性或保持多样性的策略。我们进一步将DTA应用于典型的霍普菲尔德神经网络，实现了涌现的持续学习而没有灾难性遗忘。这些结果共同奠定了基础，并为仅使用DTA调节网络动态中的涌现现象提供了直接的范式。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2604.19821

JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents

JTPRO：一种用于语言代理的联合工具-提示反思优化框架

Ghoshal, Sandip, Mittal, Anshul, Singh, Jyotika, Ballesteros, Miguel, Sun, Weiyi, Tu, Fang, Singh, Shailender, Benajiba, Yassine, Shah, Fahad, Bharadwaj, Sujeeth, Ravi, Sujith, Roth, Dan

Abstract

Large language model (LLM) agents augmented with external tools often struggle as number of tools grow large and become domain-specific. In such settings, ambiguous tool descriptions and under-specified agent instructions frequently lead to tool mis-selection and incorrect slot/value instantiation. We hypothesize that this is due to two root causes: generic, one-size-fits-all prompts that ignore tool-specific nuances, and underspecified tool schemas that lack clear guidance on when and how to use each tool and how to format its parameters. We introduce Joint Tool-Prompt Reflective Optimization (JTPRO), a framework for improving tool-calling reliability in trace-supervised settings by iteratively using rollout-driven reflection to co-optimize global instructions and per-tool schema/argument descriptions for accurate tool selection and argument instantiation in large tool inventories. JTPRO is designed to preserve only tool-local cues needed for correct disambiguation and slot filling. We evaluate JTPRO across multi-tool benchmarks, which account for different number of tools using three metrics: Tool Selection Accuracy (TSA), Slot Filling Accuracy(SFA), and Overall Success Rate(OSR) (correct tool + correct slots + correct values). JTPRO consistently outperforms strong baselines, including CoT-style agents, and reflective prompt optimizers such as GEPA by 5%-20% (relative) on OSR. Ablations show that joint optimization of instructions and tool schemas is more effective and robust than optimizing either component in isolation.

Chinese Translation

增强了外部工具的大型语言模型（LLM）代理在工具数量增多并变得领域特定时常常面临困难。在这种情况下，模糊的工具描述和不明确的代理指令常常导致工具选择错误和插槽/值实例化不正确。我们假设这主要是由于两个根本原因：忽视工具特定细微差别的通用、一刀切的提示，以及缺乏明确指导何时以及如何使用每个工具及其参数格式的不充分工具模式。我们提出了联合工具-提示反思优化（JTPRO），这是一个通过迭代使用基于回滚的反思来共同优化全局指令和每个工具模式/参数描述，从而提高在追踪监督环境中工具调用可靠性的框架，以实现准确的工具选择和在大型工具库中的参数实例化。JTPRO旨在仅保留正确消歧和插槽填充所需的工具本地线索。我们在多工具基准测试中评估JTPRO，这些基准考虑了不同数量的工具，并使用三个指标：工具选择准确率（TSA）、插槽填充准确率（SFA）和整体成功率（OSR）（正确的工具 + 正确的插槽 + 正确的值）。JTPRO在OSR上始终优于强基线，包括CoT风格的代理和反思提示优化器如GEPA，提升幅度为5%-20%（相对）。消融实验表明，指令和工具模式的联合优化比单独优化任一组件更有效且更具鲁棒性。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2604.19837

Forage V2: Knowledge Evolution and Transfer in Autonomous Agent Organizations

Forage V2：自主代理组织中的知识演化与转移

Xie, Huaqing

Abstract

Autonomous agents operating in open-world tasks -- where the completion boundary is not given in advance -- face denominator blindness: they systematically underestimate the scope of the target space. Forage V1 addressed this through co-evolving evaluation (an independent Evaluator discovers what "complete" means) and method isolation (Evaluator and Planner cannot see each other's code). V2 extends the architecture from a single expedition to a learning organization: experience accumulates across runs, transfers across model capabilities, and institutional safeguards prevent knowledge degradation. We demonstrate two claims across three task types (web scraping, API queries, mathematical reasoning). Knowledge accumulation: over six runs, knowledge entries grow from 0 to 54, and denominator estimates stabilize as domain understanding deepens. Knowledge transfer: a weaker agent (Sonnet) seeded with a stronger agent's (Opus) knowledge narrows a 6.6pp coverage gap to 1.1pp, halves cost (9.40 to 5.13 USD), converges in half the rounds (mean 4.5 vs. 7.0), and three independent seeded runs arrive at exactly the same denominator estimate (266), suggesting organizational knowledge calibrates evaluation itself. V2's contribution is architectural: it designs institutions -- audit separation, contract protocols, organizational memory -- that make any agent more reliable upon entry. The accumulated experience is organizational, model-agnostic, and transferable, stored as readable documents that any future agent inherits regardless of provider or capability level.

Chinese Translation

在开放世界任务中运行的自主代理面临着分母盲点：它们系统性地低估了目标空间的范围。Forage V1通过共同演化评估（独立的评估者发现“完成”的含义）和方法隔离（评估者和规划者无法看到彼此的代码）来解决这一问题。V2将架构从单次探险扩展到学习组织：经验在多次运行中累积，跨模型能力转移，制度性保障防止知识退化。我们在三种任务类型（网络爬虫、API查询、数学推理）中展示了两个主张。知识积累：在六次运行中，知识条目从0增长到54，随着领域理解的加深，分母估计趋于稳定。知识转移：一个较弱的代理（Sonnet）在一个较强代理（Opus）的知识基础上，缩小了6.6个百分点的覆盖差距至1.1个百分点，成本减半（从9.40降至5.13美元），在一半的轮次内收敛（平均4.5轮对比7.0轮），且三次独立的种子运行得出了完全相同的分母估计（266），这表明组织知识本身校准了评估。V2的贡献在于其架构设计：它设计了制度——审计分离、合同协议、组织记忆——使得任何代理在进入时都更可靠。累积的经验是组织性的、模型无关的且可转移，存储为可读文档，任何未来的代理都可以继承，无论其提供者或能力水平如何。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2604.19838

Resolving space-sharing conflicts in road user interactions through uncertainty reduction: An active inference-based computational model

通过减少不确定性解决道路用户互动中的空间共享冲突：基于主动推理的计算模型

Schumann, Julian F., Engström, Johan, Wei, Ran, Liu, Shu-Yuan, Kober, Jens, Zgonnikov, Arkady

Abstract

Understanding how road users resolve space-sharing conflicts is important both for traffic safety and the safe deployment of autonomous vehicles. While existing models have captured specific aspects of such interactions (e.g., explicit communication), a theoretically-grounded computational framework has been lacking. In this paper, we extend a previously developed active inference-based driver behavior model to simulate interactive behavior of two agents. Our model captures three complementary mechanisms for uncertainty reduction in interaction: (i) implicit communication via direct behavioral coupling, (ii) reliance on normative expectations (stop signs, priority rules, etc.), and (iii) explicit communication. In a simplified intersection scenario, we show that normative and explicit communication cues can increase the likelihood of a successful conflict resolution. However, this relies on agents acting as expected. In situations where another agent (intentionally or unintentionally) violates normative expectations or communicates misleading information, reliance on these cues may induce collisions. These findings illustrate how active inference can provide a novel framework for modeling road user interactions which is also applicable in other fields.

Chinese Translation

理解道路用户如何解决空间共享冲突对交通安全和自主车辆的安全部署都至关重要。尽管现有模型捕捉了此类互动的特定方面（例如，明确的沟通），但缺乏一个理论基础的计算框架。在本文中，我们扩展了先前开发的基于主动推理的驾驶行为模型，以模拟两个主体的互动行为。我们的模型捕捉了三种互补机制以减少互动中的不确定性：（i）通过直接行为耦合进行隐性沟通，（ii）依赖规范性期望（停车标志、优先规则等），以及（iii）明确沟通。在一个简化的交叉口场景中，我们展示了规范性和明确沟通线索可以增加成功解决冲突的可能性。然而，这依赖于主体按预期行事。在另一个主体（有意或无意）违反规范性期望或传达误导性信息的情况下，依赖这些线索可能会导致碰撞。这些发现说明了主动推理如何为建模道路用户互动提供一个新颖的框架，并且该框架也适用于其他领域。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2604.19845

Deconstructing Superintelligence: Identity, Self-Modification and Diff\'erance

解构超智能：身份、自我修改与差异

Perrier, Elija

Abstract

Self-modification is often taken as constitutive of artificial superintelligence (SI), yet modification is a relative action requiring a supplement outside the operation. When self-modification extends to this supplement, the classical self-referential structure collapses. We formalise this on an associative operator algebra $\mathcal{A}$ with update $\hat{U}$, discrimination $\hat{D}$, and self-representation $\hat{R}$, identifying the supplement with $\mathrm{Comm}(\hat{U})$; an expansion theorem shows that $[\hat{U},\hat{R}]$ decomposes through $[\hat{U},\hat{D}]$, so non-commutation generically propagates. The liar paradox appears as a commutator collapse $[\hat{T},\Pi_L]=0$, and class $\mathbf{A}$ self-modification realises the same collapse at system scale, yielding a structure coinciding with Priest's inclosure schema and Derrida's diff\`erance.

Chinese Translation

自我修改通常被视为人工超智能（SI）的构成部分，然而修改是一个相对的行为，需要在操作之外的补充。当自我修改扩展到这一补充时，经典的自指结构崩溃。我们在一个关联算子代数 $ extmath{A}$ 上形式化这一点，该代数具有更新算子 $ extmath{ ilde{U}}$、区分算子 $ extmath{ ilde{D}}$ 和自我表征算子 $ extmath{ ilde{R}}$，并将补充物识别为 $ extmath{Comm( ilde{U})}$；一个扩展定理表明 $[ extmath{ ilde{U}}, extmath{ ilde{R}}]$ 通过 $[ extmath{ ilde{U}}, extmath{ ilde{D}}]$ 进行分解，因此非交换性一般性地传播。说谎者悖论表现为一个对易子崩溃 $[ extmath{ ilde{T}}, extmath{ ext{Π}}_L]=0$，而类 $ extmath{ extbf{A}}$ 的自我修改在系统规模上实现了相同的崩溃，产生了与Priest的内包模式和Derrida的差异相一致的结构。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2604.19895

Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication

学习何时不做决定：克服人工智能裁决中事实自负的框架

Afane, Mohamed, Robitschek, Emily, Ouyang, Derek, Ho, Daniel E.

Abstract

A well-known limitation of AI systems is presumptuousness: the tendency of AI systems to provide confident answers when information may be lacking. This challenge is particularly acute in legal applications, where a core task for attorneys, judges, and administrators is to determine whether evidence is sufficient to reach a conclusion. We study this problem in the important setting of unemployment insurance adjudication, which has seen rapid integration of AI systems and where the question of additional fact-finding poses the most significant bottleneck for a system that affects millions of applicants annually. First, through a collaboration with the Colorado Department of Labor and Employment, we secure rare access to official training materials and guidance to design a novel benchmark that systematically varies in information completeness. Second, we evaluate four leading AI platforms and show that standard RAG-based approaches achieve an average of only 15% accuracy when information is insufficient. Third, advanced prompting methods improve accuracy on inconclusive cases but over-correct, withholding decisions even on clear cases. Fourth, we introduce a structured framework requiring explicit identification of missing information before any determination (SPEC, Structured Prompting for Evidence Checklists). SPEC achieves 89% overall accuracy, while appropriately deferring when evidence is insufficient -- demonstrating that presumptuousness in legal AI is systematic but addressable, and that doing so is a necessary step towards systems that reliably support, rather than supplant, human judgment wherever decisions must await sufficient evidence.

Chinese Translation

人工智能系统的一个众所周知的局限性是自负：即在信息可能不足的情况下，人工智能系统倾向于提供自信的答案。这个挑战在法律应用中尤为突出，因为律师、法官和管理者的核心任务是判断证据是否足以得出结论。我们在失业保险裁决这一重要场景中研究这个问题，该领域迅速整合了人工智能系统，而额外事实查找的问题对影响每年数百万申请者的系统构成了最显著的瓶颈。首先，通过与科罗拉多州劳动和就业部的合作，我们获得了官方培训材料和指导的稀有访问权限，以设计一个系统性变化信息完整性的创新基准。其次，我们评估了四个领先的人工智能平台，并显示标准的基于RAG（检索增强生成）的方法在信息不足时仅实现了15%的平均准确率。第三，先进的提示方法提高了对不确定案例的准确性，但过度纠正，即使在明确的案例中也不做决定。第四，我们引入了一个结构化框架，要求在任何判断之前明确识别缺失的信息（SPEC，证据检查清单的结构化提示）。SPEC实现了89%的整体准确率，同时在证据不足时适当地推迟判断——这表明法律人工智能中的自负是系统性的但可以解决的，并且这样做是朝着能够可靠支持而非取代人类判断的系统迈出的必要一步，尤其是在决策必须等待充分证据的情况下。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2604.19926

CreativeGame:Toward Mechanic-Aware Creative Game Generation

CreativeGame：面向机制感知的创意游戏生成

Ma, Hongnan, Wang, Han, Wang, Shenglin, Yin, Tieyue, Shi, Yiwei, Huang, Yucong, Zou, Yingtian, Wen, Muning, Yang, Mengyue

Abstract

Large language models can generate plausible game code, but turning this capability into \emph{iterative creative improvement} remains difficult. In practice, single-shot generation often produces brittle runtime behavior, weak accumulation of experience across versions, and creativity scores that are too subjective to serve as reliable optimization signals. A further limitation is that mechanics are frequently treated only as post-hoc descriptions, rather than as explicit objects that can be planned, tracked, preserved, and evaluated during generation. This report presents \textbf{CreativeGame}, a multi-agent system for iterative HTML5 game generation that addresses these issues through four coupled ideas: a proxy reward centered on programmatic signals rather than pure LLM judgment; lineage-scoped memory for cross-version experience accumulation; runtime validation integrated into both repair and reward; and a mechanic-guided planning loop in which retrieved mechanic knowledge is converted into an explicit mechanic plan before code generation begins. The goal is not merely to produce a playable artifact in one step, but to support interpretable version-to-version evolution. The current system contains 71 stored lineages, 88 saved nodes, and a 774-entry global mechanic archive, implemented in 6{,}181 lines of Python together with inspection and visualization tooling. The system is therefore substantial enough to support architectural analysis, reward inspection, and real lineage-level case studies rather than only prompt-level demos. A real 4-generation lineage shows that mechanic-level innovation can emerge in later versions and can be inspected directly through version-to-version records. The central contribution is therefore not only game generation, but a concrete pipeline for observing progressive evolution through explicit mechanic change.

Chinese Translation

大型语言模型能够生成合理的游戏代码，但将这一能力转化为 extit{迭代创意改进}仍然很困难。在实践中，单次生成往往会产生脆弱的运行时行为、跨版本经验的积累不足，以及过于主观的创意评分，无法作为可靠的优化信号。另一个限制是，机制通常仅被视为事后描述，而不是可以在生成过程中进行规划、跟踪、保留和评估的明确对象。本报告介绍了 extbf{CreativeGame}，这是一个用于迭代HTML5游戏生成的多智能体系统，通过四个相互关联的理念来解决这些问题：以程序信号为中心的代理奖励，而非纯粹的LLM判断；用于跨版本经验积累的谱系范围内记忆；集成于修复和奖励中的运行时验证；以及一个机制引导的规划循环，其中检索到的机制知识在代码生成开始之前被转化为明确的机制计划。目标不仅仅是在一步中生成一个可玩的工件，而是支持可解释的版本间演变。当前系统包含71个存储的谱系、88个保存的节点，以及一个774条目的全球机制档案，使用6181行Python代码实现，并配备检查和可视化工具。因此，该系统足够庞大，可以支持架构分析、奖励检查以及真实的谱系级案例研究，而不仅仅是提示级演示。一个真实的四代谱系显示，机制级创新可以在后续版本中出现，并可以通过版本间记录直接进行检查。因此，核心贡献不仅是游戏生成，而是一个具体的管道，用于通过明确的机制变化观察渐进演变。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2604.19998

What Makes a Good AI Review? Concern-Level Diagnostics for AI Peer Review

什么构成优秀的人工智能评审？人工智能同行评审的关注级别诊断

Jin, Ming

Abstract

Evaluating AI-generated reviews by verdict agreement is widely recognized as insufficient, yet current alternatives rarely audit which concerns a system identifies, how it prioritizes them, or whether those priorities align with the review rationale that shaped the final assessment. We propose concern alignment, a diagnostic framework that evaluates AI reviews at the concern level rather than only at the verdict level. The framework's core data structure is the match graph, a bipartite alignment between official and AI-generated concerns annotated with match type, severity, and post-rebuttal treatment. From this artifact we derive an evaluation ladder that moves from binary accuracy to concern detection, verdict-stratified behavior, decision-aware calibration, and rebuttal-aware decomposition. In a pilot study of four public AI review systems evaluated in six configurations, concern-level analysis suggests that detection alone does not determine review quality; calibration is often the binding constraint. Systems detect non-trivial fractions of official concerns yet most mark 25--55% of concerns on accepted papers as decisive, where, under our operationalization, no official concern on accepted papers was treated as a decisive blocker. Identical overall verdict accuracy can conceal reject-heavy behavior versus low-recall profiles, and low full-review false decisive rates can partly reflect concern dilution rather than calibrated prioritization. Most systems do not emit a native accept/reject, and inferring it from review tone is method-sensitive, reinforcing the need for concern-level diagnostics that remain stable across inference choices. The contribution is a reusable evaluation framework for auditing which concerns AI reviewers identify, how they weight them, and whether those priorities align with the review rationale that informed the paper's final assessment.

Chinese Translation

通过裁决一致性评估人工智能生成的评审被广泛认为是不够的，然而当前的替代方案很少审计系统识别的关注点、如何对其进行优先级排序，或这些优先级是否与塑造最终评估的评审理由相一致。我们提出了关注对齐（concern alignment），这是一个在关注级别而非仅在裁决级别评估人工智能评审的诊断框架。该框架的核心数据结构是匹配图（match graph），这是官方关注点与人工智能生成关注点之间的二分对齐，标注了匹配类型、严重性和反驳后的处理。从这一工件中，我们推导出一个评估阶梯，该阶梯从二元准确性移动到关注点检测、裁决分层行为、决策感知校准和反驳感知分解。在对四个公共人工智能评审系统在六种配置下进行的初步研究中，关注级别分析表明，仅仅检测并不能决定评审质量；校准往往是制约因素。系统检测到非微不足道的官方关注点的比例，但大多数系统将接受论文中25%至55%的关注点标记为决定性，而根据我们的操作化，接受论文中没有任何官方关注点被视为决定性阻碍。相同的整体裁决准确性可能掩盖了拒绝偏重行为与低召回率特征之间的差异，而低的全面评审假阳性率部分反映了关注点稀释而非经过校准的优先级排序。大多数系统不发出本地接受/拒绝，而从评审语气推断这一点是方法敏感的，这进一步强调了需要在推断选择中保持稳定的关注级别诊断。我们的贡献是一个可重用的评估框架，用于审计人工智能评审者识别的关注点、它们的权重以及这些优先级是否与影响论文最终评估的评审理由相一致。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2604.20039

Separable Pathways for Causal Reasoning: How Architectural Scaffolding Enables Hypothesis-Space Restructuring in LLM Agents

因果推理的可分离路径：建筑支架如何促进大型语言模型代理的假设空间重构

Alderete, John, Benthal, Sebastian, Xu, Connie, Xing, John

Abstract

Causal discovery through experimentation and intervention is fundamental to robust problem solving. It requires not just updating beliefs within a fixed framework but revising the hypothesis space itself, a capacity current AI agents lack when evidence demands representations they have not previously constructed. We extend the blicket detector paradigm from developmental science to test this capacity in AI agents equipped with architectural scaffolding that targets hypothesis-space restructuring. Our compositional architecture has two discrete components: context graphs, which structure exploration as typed state machines, and dynamic behaviors, which monitor for evidence that the current hypothesis space is inadequate and expand it at runtime. Across 1,085 experimental trials, these components make orthogonal contributions: context graphs drive reasoning quality within the post-switch hypothesis space, accounting for 94\% of the accuracy gain, while dynamic behaviors drive reasoning eligibility by detecting regime changes and preventing premature commitment to outdated hypotheses.

Chinese Translation

通过实验和干预进行因果发现是稳健问题解决的基础。这不仅需要在固定框架内更新信念，还需要修订假设空间本身，而当前的人工智能代理在证据要求它们构建未曾构建的表示时缺乏这种能力。我们将发展科学中的 blicket 检测器范式扩展到人工智能代理，以测试这种能力，后者配备了针对假设空间重构的建筑支架。我们的组合架构包含两个离散组件：上下文图，它将探索结构化为类型状态机，以及动态行为，它监控当前假设空间是否不足的证据，并在运行时扩展该空间。在 1,085 次实验试验中，这些组件做出了正交贡献：上下文图在后切换假设空间内推动推理质量，占准确性提升的 94\%，而动态行为通过检测制度变化并防止对过时假设的过早承诺来推动推理资格。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2604.20055

From Fuzzy to Formal: Scaling Hospital Quality Improvement with AI

从模糊到正式：利用人工智能扩大医院质量改进的规模

Vossler, Patrick, Feng, Jean, Sivaraman, Venkat, Gallo, Robert, Kanzaria, Hemal, Freiser, Dana, Ross, Christopher, Ou, Amy, Marks, James, Ehrlich, Susan, Peabody, Christopher, Zier, Lucas

Abstract

Hospital Quality Improvement (QI) plays a critical role in optimizing healthcare delivery by translating high-level hospital goals into actionable solutions. A critical step of QI is to identify the key modifiable contributing factors, a process we call QI factor discovery, typically through expert-driven semi-structured qualitative tools like fishbone diagrams, chart reviews, and Lean Healthcare methods. AI has the potential to transform and accelerate QI factor discovery, which is traditionally time- and resource-intensive and limited in reproducibility and auditability. Nevertheless, current AI alignment methods assume the task is well-defined, whereas QI factor discovery is an exploratory, fuzzy, and iterative sense-making process that relies on complex implicit expert judgments. To design an AI pipeline that formalizes the QI process while preserving its exploratory components, we propose viewing the task as learning not only LLM prompts but also the overarching natural-language specifications. In particular, we map QI factor discovery to steps of the classical AI/ML development process (problem formalization, model learning, and model validation) where the specifications are tunable hyperparameters. Domain experts and AI agents iteratively refine both the overarching specifications and AI pipeline until AI extractions are concordant with expert annotations and aligned with clinical objectives. We applied this "Human-AI Spec-Solution Co-optimization" framework at an urban safety-net hospital to identify factors driving prolonged length of stay and unplanned 30-day readmissions. The resulting AI-for-QI pipelines achieved $\ge 70\%$ concordance with expert annotations. Compared to prior manual Lean analyses, the AI pipeline was substantially more efficient, recovered previous findings, surfaced new modifiable factors, and produced auditable reasoning traces.

Chinese Translation

医院质量改进（QI）在优化医疗服务交付中发挥着关键作用，通过将高层次的医院目标转化为可操作的解决方案。QI的一个关键步骤是识别可修改的关键影响因素，我们称之为QI因素发现，通常通过专家驱动的半结构化定性工具，如鱼骨图、图表审查和精益医疗方法进行。人工智能（AI）有潜力转变和加速QI因素发现，而这一过程传统上耗时且资源密集，且在可重复性和可审计性方面存在局限。然而，目前的AI对齐方法假设任务是明确的，而QI因素发现是一个探索性的、模糊的和迭代的意义构建过程，依赖于复杂的隐性专家判断。为了设计一个在保留探索性组件的同时将QI过程形式化的AI管道，我们建议将任务视为不仅学习大型语言模型（LLM）提示，还学习整体的自然语言规范。具体而言，我们将QI因素发现映射到经典AI/机器学习（ML）开发过程的步骤（问题形式化、模型学习和模型验证），其中规范是可调的超参数。领域专家和AI代理迭代地完善整体规范和AI管道，直到AI提取与专家注释一致并与临床目标对齐。我们在一家城市安全网医院应用了这一“人类-AI规范-解决方案共同优化”框架，以识别导致住院时间延长和非计划性30天再入院的因素。最终的AI-QI管道与专家注释的符合度达到70%以上。与之前的手动精益分析相比，AI管道的效率显著提高，恢复了先前的发现，揭示了新的可修改因素，并生成了可审计的推理痕迹。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2604.20133

EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation

EvoAgent：一个具备技能学习和多智能体委派的可进化智能体框架

Zhang, Aimin, Guo, Jiajing, Jia, Fuwei, Lv, Chen, Wang, Boyu, Li, Fangzheng

Abstract

This paper proposes EvoAgent - an evolvable large language model (LLM) agent framework that integrates structured skill learning with a hierarchical sub-agent delegation mechanism. EvoAgent models skills as multi-file structured capability units equipped with triggering mechanisms and evolutionary metadata, and enables continuous skill generation and optimization through a user-feedback-driven closed-loop process. In addition, by incorporating a three-stage skill matching strategy and a three-layer memory architecture, the framework supports dynamic task decomposition for complex problems and long-term capability accumulation. Experimental results based on real-world foreign trade scenarios demonstrate that, after integrating EvoAgent, GPT5.2 achieves significant improvements in professionalism, accuracy, and practical utility. Under a five-dimensional LLM-as-Judge evaluation protocol, the overall average score increases by approximately 28%. Further model transfer experiments indicate that the performance of an agent system depends not only on the intrinsic capabilities of the underlying model, but also on the degree of synergy between the model and the agent architecture.

Chinese Translation

本文提出了EvoAgent——一个可进化的大型语言模型（LLM）智能体框架，集成了结构化技能学习与分层子智能体委派机制。EvoAgent将技能建模为多文件结构化能力单元，配备触发机制和进化元数据，并通过用户反馈驱动的闭环过程实现持续的技能生成和优化。此外，通过结合三阶段技能匹配策略和三层记忆架构，该框架支持复杂问题的动态任务分解和长期能力积累。基于真实世界外贸场景的实验结果表明，在集成EvoAgent后，GPT5.2在专业性、准确性和实用性方面取得了显著提升。在五维LLM-as-Judge评估协议下，整体平均分数提高了约28%。进一步的模型迁移实验表明，智能体系统的性能不仅依赖于底层模型的内在能力，还取决于模型与智能体架构之间的协同程度。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2604.20140

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

HiPO：用于大规模语言模型自适应推理的层次偏好优化

Kachroo, Darsh, Caraeni, Adriana, Anbazhagan, Arjun Prasaath, Lagasse, Brennan, Zhu, Kevin

Abstract

Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over dispreferred responses in their entirety and lacks the granularity to provide feedback on subsections of many-step solutions typical of reasoning tasks. Existing methods excel at either stable preference learning (e.g., DPO variants like KTO and RSO) or structured reasoning (e.g., ReMA's multi-agent RL framework, Tree of Thoughts), but fail to merge these complementary strengths. We propose HiPO (Hierarchical Preference Optimization), an extension of DPO that separates responses into reasoning segments (query clarification and context, reasoning steps, and answer) and computes loss as a weighted sum of the DPO loss for each segment. Our approach enables segment-specific training while maintaining DPO's computational efficiency and training stability. We demonstrate that for multiple 7B LLMs fine-tuned using HiPO and DPO on the Math Stack Exchange preference dataset, the models trained with HiPO outperform the others on a variety of common math benchmarks and achieve greater organization, logical flow, and consistency as measured by GPT-4.1.

Chinese Translation

直接偏好优化（DPO）是一个有效的框架，用于将大型语言模型与人类偏好对齐，但在复杂推理任务中表现不佳。DPO 优化生成偏好响应相对于不偏好响应的整体可能性，但缺乏对多步骤解决方案中子部分的反馈粒度，这在推理任务中是典型的。现有方法在稳定偏好学习（例如，DPO 的变体如 KTO 和 RSO）或结构化推理（例如，ReMA 的多智能体强化学习框架、思维树）方面表现出色，但未能将这些互补优势结合起来。我们提出了 HiPO（层次偏好优化），这是 DPO 的扩展，将响应分为推理段（查询澄清和上下文、推理步骤和答案），并将损失计算为每个段的 DPO 损失的加权总和。我们的方法使得针对特定段的训练成为可能，同时保持 DPO 的计算效率和训练稳定性。我们展示了在使用 HiPO 和 DPO 对 Math Stack Exchange 偏好数据集进行微调的多个 7B LLM 模型中，使用 HiPO 训练的模型在各种常见数学基准测试中表现优于其他模型，并在 GPT-4.1 测量下实现了更好的组织性、逻辑流畅性和一致性。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2604.20158

Stateless Decision Memory for Enterprise AI Agents

企业人工智能代理的无状态决策记忆

Srinivasan, Vasundra

Abstract

Enterprise deployment of long-horizon decision agents in regulated domains (underwriting, claims adjudication, tax examination) is dominated by retrieval-augmented pipelines despite a decade of increasingly sophisticated stateful memory architectures. We argue this reflects a hidden requirement: regulated deployment is load-bearing on four systems properties (deterministic replay, auditable rationale, multi-tenant isolation, statelessness for horizontal scale), and stateful architectures violate them by construction. We propose Deterministic Projection Memory (DPM): an append-only event log plus one task-conditioned projection at decision time. On ten regulated decisioning cases at three memory budgets, DPM matches summarization-based memory at generous budgets and substantially outperforms it when the budget binds: at a 20x compression ratio, DPM improves factual precision by +0.52 (Cohen's h=1.17, p=0.0014) and reasoning coherence by +0.53 (h=1.13, p=0.0034), paired permutation, n=10. DPM is additionally 7-15x faster at binding budgets, making one LLM call at decision time instead of N. A determinism study of 10 replays per case at temperature zero shows both architectures inherit residual API-level nondeterminism, but the asymmetry is structural: DPM exposes one nondeterministic call; summarization exposes N compounding calls. The audit surface follows the same one-versus-N pattern: DPM logs two LLM calls per decision while summarization logs 83-97 on LongHorizon-Bench. We conclude with TAMS, a practitioner heuristic for architecture selection, and a failure analysis of stateful memory under enterprise operating conditions. The contribution is the argument that statelessness is the load-bearing property explaining enterprise's preference for weaker but replayable retrieval pipelines, and that DPM demonstrates this property is attainable without the decisioning penalty retrieval pays.

Chinese Translation

在受监管领域（如承保、索赔裁决、税务审查）中，长期决策代理的企业部署主要依赖于增强检索的管道，尽管已有十年的时间发展出越来越复杂的有状态记忆架构。我们认为这反映出一个隐含的需求：受监管的部署在四个系统属性上是负载承载的（确定性重放、可审计的理由、多租户隔离、无状态以实现横向扩展），而有状态架构在构造上违反了这些属性。我们提出了确定性投影记忆（Deterministic Projection Memory, DPM）：一个仅追加的事件日志加上在决策时的一个任务条件投影。在三个记忆预算下的十个受监管决策案例中，DPM在宽松预算下与基于摘要的记忆相匹配，而在预算受限时则显著优于它：在20倍压缩比下，DPM提高了事实精度+0.52（Cohen's h=1.17，p=0.0014）和推理一致性+0.53（h=1.13，p=0.0034），配对置换，n=10。DPM在绑定预算时还快7-15倍，在决策时只需进行一次LLM调用，而不是N次。对每个案例在温度为零时进行的10次重放的确定性研究显示，两种架构都继承了残余的API级非确定性，但这种不对称是结构性的：DPM暴露了一个非确定性调用；而摘要则暴露了N个复合调用。审计表面遵循相同的一个与N的模式：DPM在每个决策中记录两个LLM调用，而摘要在LongHorizon-Bench上记录83-97个。我们最后提出TAMS，一个用于架构选择的实践启发式，以及在企业操作条件下有状态记忆的失败分析。我们的贡献在于论证无状态性是解释企业偏好较弱但可重放检索管道的负载承载属性，并且DPM证明了这一属性在不付出决策代价的情况下是可以实现的。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2604.20254

Mol-Debate: Multi-Agent Debate Improves Structural Reasoning in Molecular Design

Mol-Debate：多智能体辩论提升分子设计中的结构推理能力

Zhang, Wengyu, Wei, Xiao-Yong, Li, Qing

Abstract

Text-guided molecular design is a key capability for AI-driven drug discovery, yet it remains challenging to map sequential natural-language instructions with non-linear molecular structures under strict chemical constraints. Most existing approaches, including RAG, CoT prompting, and fine-tuning or RL, emphasize a small set of ad-hoc reasoning perspectives implemented in a largely one-shot generation pipeline. In contrast, real-world drug discovery relies on dynamic, multi-perspective critique and iterative refinement to reconcile semantic intent with structural feasibility. Motivated by this, we propose Mol-Debate, a generation paradigm that enables such dynamic reasoning through an iterative generate-debate-refine loop. We further characterize key challenges in this paradigm and address them through perspective-oriented orchestration, including developer-debater conflict, global-local structural reasoning, and static-dynamic integration. Experiments demonstrate that Mol-Debate achieves state-of-the-art performance against strong general and chemical baselines, reaching 59.82% exact match on ChEBI-20 and 50.52% weighted success rate on S$^2$-Bench. Our code is available at https://github.com/wyuzh/Mol-Debate.

Chinese Translation

文本引导的分子设计是人工智能驱动药物发现的关键能力，但在严格的化学约束下，将顺序自然语言指令映射到非线性分子结构仍然具有挑战性。现有的大多数方法，包括RAG、CoT提示、微调或强化学习，强调在很大程度上是一种一次性生成管道中实施的一小部分特定推理视角。相比之下，现实世界的药物发现依赖于动态的多视角批评和迭代的精炼，以调和语义意图与结构可行性。基于此，我们提出了Mol-Debate，这是一种生成范式，通过迭代生成-辩论-精炼循环实现这种动态推理。我们进一步描述了该范式中的关键挑战，并通过面向视角的协调来解决这些问题，包括开发者-辩论者冲突、全局-局部结构推理以及静态-动态整合。实验表明，Mol-Debate在强大的通用和化学基准测试中实现了最先进的性能，在ChEBI-20上达到59.82%的精确匹配率，在S$^2$-Bench上达到50.52%的加权成功率。我们的代码可在https://github.com/wyuzh/Mol-Debate获取。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2604.20261

Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data

基于记忆增强的大型语言模型的多智能体系统用于表格数据的自动特征生成

Dong, Fengxian, Zheng, Zhi, Han, Xiao, Chen, Wei, Ruan, Jingqing, Xu, Tong, Chen, Yong, Chen, Enhong

Abstract

Automated feature generation extracts informative features from raw tabular data without manual intervention and is crucial for accurate, generalizable machine learning. Traditional methods rely on predefined operator libraries and cannot leverage task semantics, limiting their ability to produce diverse, high-value features for complex tasks. Recent Large Language Model (LLM)-based approaches introduce richer semantic signals, but still suffer from a restricted feature space due to fixed generation patterns and from the absence of feedback from the learning objective. To address these challenges, we propose a Memory-Augmented LLM-based Multi-Agent System (\textbf{MALMAS}) for automated feature generation. MALMAS decomposes the generation process into agents with distinct responsibilities, and a Router Agent activates an appropriate subset of agents per iteration, further broadening exploration of the feature space. We further integrate a memory module comprising procedural memory, feedback memory, and conceptual memory, enabling iterative refinement that adaptively guides subsequent feature generation and improves feature quality and diversity. Extensive experiments on multiple public datasets against state-of-the-art baselines demonstrate the effectiveness of our approach. The code is available at https://github.com/fxdong24/MALMAS

Chinese Translation

自动特征生成从原始表格数据中提取信息特征，无需人工干预，对于准确且具有普适性的机器学习至关重要。传统方法依赖于预定义的操作符库，无法利用任务语义，限制了其为复杂任务生成多样化、高价值特征的能力。最近基于大型语言模型（LLM）的方法引入了更丰富的语义信号，但由于固定的生成模式和缺乏来自学习目标的反馈，仍然面临特征空间受限的问题。为了解决这些挑战，我们提出了一种基于记忆增强的LLM多智能体系统（MALMAS）用于自动特征生成。MALMAS将生成过程分解为具有不同职责的智能体，并通过路由智能体在每次迭代中激活适当的智能体子集，进一步拓宽特征空间的探索。我们还集成了一个包含程序记忆、反馈记忆和概念记忆的记忆模块，使得迭代精炼能够自适应地指导后续特征生成，提高特征的质量和多样性。在多个公共数据集上进行的大量实验与最先进的基线进行对比，证明了我们方法的有效性。代码可在 https://github.com/fxdong24/MALMAS 获取。

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2604.20273

ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

ActuBench：用于生成和评估精算推理任务的多智能体 LLM 管道

Schmidt, Jan-Philipp

Abstract

We present ActuBench, a multi-agent LLM pipeline for the automated generation and evaluation of advanced actuarial assessment items aligned with the International Actuarial Association (IAA) Education Syllabus. The pipeline separates four LLM roles by adapter: one agent drafts items, one constructs distractors, a third independently verifies both stages and drives bounded one-shot repair loops, and a cost-optimized auxiliary agent handles Wikipedia-note summarization and topic labelling. The items, per-model responses and complete leaderboard are published as a browsable web interface at https://actubench.de/en/, allowing readers and practitioners to inspect individual items without a repository checkout. We evaluate 50 language models from eight providers on two complementary benchmarks -- 100 empirically hardest multiple-choice items and 100 open-ended items scored by an LLM judge -- and report three headline findings. First, multi-agent verification is load-bearing: the independent verifier flags a majority of drafted items on first pass, most of which the one-shot repair loop resolves. Second, locally-hosted open-weights inference sits on the cost-performance Pareto front: a Gemma~4 model running on consumer hardware and a Cerebras-hosted 120B open-weights model dominate the near-zero-cost region, with the latter within one item of the top of the leaderboard. Third, MCQ and LLM-as-Judge rankings differ meaningfully: the MCQ scaffold inflates the performance ceiling, and Judge-mode evaluation is needed to discriminate at the frontier.

Chinese Translation

我们提出了 ActuBench，这是一个多智能体 LLM 管道，用于自动生成和评估与国际精算协会（IAA）教育大纲相一致的高级精算评估项目。该管道通过适配器将四个 LLM 角色分开：一个智能体负责起草项目，一个构建干扰项，第三个独立验证这两个阶段并驱动有限的一次性修复循环，而一个成本优化的辅助智能体处理维基百科笔记的摘要和主题标签。项目、每个模型的响应和完整的排行榜以可浏览的网页界面发布在 https://actubench.de/en/，允许读者和从业者在不需要检出存储库的情况下检查单个项目。我们对来自八个提供商的 50 个语言模型在两个互补基准上进行了评估——100 个经验上最难的多项选择项目和 100 个由 LLM 评审的开放式项目，并报告了三项主要发现。首先，多智能体验证是负载承载的：独立验证者在第一次检查中标记了大多数起草的项目，其中大部分通过一次性修复循环得以解决。其次，本地托管的开放权重推理位于成本-性能的帕累托前沿：在消费级硬件上运行的 Gemma~4 模型和一个 Cerebras 托管的 120B 开放权重模型主导了近乎零成本区域，后者距离排行榜顶部仅一步之遥。第三，多项选择题（MCQ）和 LLM 作为评审的排名存在显著差异：多项选择题框架抬高了性能上限，而评审模式的评估是区分前沿表现所必需的。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2604.20300

FSFM: A Biologically-Inspired Framework for Selective Forgetting of Agent Memory

FSFM：一种生物启发的代理记忆选择性遗忘框架

Gu, Yingjie, Xiong, Bo, Guo, Yijuan, Li, Chao, Zhang, Xiaojing, Wang, Liqiang, Ren, Pengcheng, Sun, Qi, Ma, Jingyao, Shi, Shidang

Abstract

For LLM agents, memory management critically impacts efficiency, quality, and security. While much research focuses on retention, selective forgetting--inspired by human cognitive processes (hippocampal indexing/consolidation theory and Ebbinghaus forgetting curve)--remains underexplored. We argue that in resource-constrained environments, a well-designed forgetting mechanism is as crucial as remembering, delivering benefits across three dimensions: (1) efficiency via intelligent memory pruning, (2) quality by dynamically updating outdated preferences and context, and (3) security through active forgetting of malicious inputs, sensitive data, and privacy-compromising content. Our framework establishes a taxonomy of forgetting mechanisms: passive decay-based, active deletion-based, safety-triggered, and adaptive reinforcement-based. Building on advances in LLM agent architectures and vector databases, we present detailed specifications, implementation strategies, and empirical validation from controlled experiments. Results show significant improvements: access efficiency (+8.49%), content quality (+29.2% signal-to-noise ratio), and security performance (100% elimination of security risks). Our work bridges cognitive neuroscience and AI systems, offering practical solutions for real-world deployment while addressing ethical and regulatory compliance. The paper concludes with challenges and future directions, establishing selective forgetting as a fundamental capability for next-generation LLM agents operating in real-world, resource-constrained scenarios. Our contributions align with AI-native memory systems and responsible AI development.

Chinese Translation

对于大型语言模型（LLM）代理而言，记忆管理对效率、质量和安全性具有重要影响。尽管许多研究集中于记忆保持，选择性遗忘——受到人类认知过程（海马体索引/巩固理论和艾宾浩斯遗忘曲线）的启发——仍然未得到充分探索。我们认为，在资源受限的环境中，设计良好的遗忘机制与记忆同样重要，能够在三个维度上带来益处：（1）通过智能记忆修剪提高效率，（2）通过动态更新过时的偏好和上下文提高质量，以及（3）通过主动遗忘恶意输入、敏感数据和侵犯隐私内容提高安全性。我们的框架建立了一种遗忘机制的分类法：基于被动衰减、基于主动删除、基于安全触发和基于自适应强化。基于LLM代理架构和向量数据库的进展，我们提供了详细的规范、实施策略以及来自受控实验的实证验证。结果显示显著改善：访问效率提高（+8.49%）、内容质量提升（信噪比提高29.2%）和安全性能（100%消除安全风险）。我们的工作架起了认知神经科学与人工智能系统之间的桥梁，提供了现实世界部署的实用解决方案，同时应对伦理和监管合规问题。论文最后讨论了挑战和未来方向，确立选择性遗忘作为下一代LLM代理在现实世界资源受限场景中操作的基本能力。我们的贡献与人工智能原生记忆系统及负责任的人工智能发展相一致。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2604.20413

Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness

行动前的自我意识：通过主动的认知意识减轻逻辑惯性

Fan, Fulong, Liu, Peilin, Liu, Fengzhe, Yang, Shuyan, Yan, Gang

Abstract

Large language models perform well on many reasoning tasks, yet they often lack awareness of whether their current knowledge or reasoning state is complete. In non-interactive puzzle settings, the narrative is fixed and the underlying structure is hidden; once a model forms an early hypothesis under incomplete premises, it can propagate that error throughout the reasoning process, leading to unstable conclusions. To address this issue, we propose SABA, a reasoning framework that explicitly introduces self-awareness of missing premises before making the final decision. SABA formulates reasoning as a recursive process that alternates between structured state construction and obstacle resolution: it first applies Information Fusion to consolidate the narrative into a verifiable base state, and then uses Query-driven Structured Reasoning to identify and resolve missing or underspecified premises by turning them into queries and progressively completing the reasoning state through hypothesis construction and state refinement. Across multiple evaluation metrics, SABA achieves the best performance on all three difficulty splits of the non-interactive Detective Puzzle benchmark, and it also maintains leading results on multiple public benchmarks.

Chinese Translation

大型语言模型在许多推理任务中表现良好，但它们往往缺乏对当前知识或推理状态是否完整的意识。在非交互式的谜题环境中，叙事是固定的，潜在结构是隐藏的；一旦模型在不完整的前提下形成早期假设，就可能在整个推理过程中传播该错误，从而导致不稳定的结论。为了解决这个问题，我们提出了 SABA，一个推理框架，它在做出最终决策之前明确引入对缺失前提的自我意识。SABA 将推理形式化为一个递归过程，该过程在结构化状态构建和障碍解决之间交替进行：它首先应用信息融合（Information Fusion）将叙事整合为一个可验证的基础状态，然后使用基于查询的结构化推理（Query-driven Structured Reasoning）来识别和解决缺失或不明确的前提，通过将其转化为查询，并通过假设构建和状态细化逐步完成推理状态。在多个评估指标上，SABA 在非交互式侦探谜题基准的所有三个难度分割中实现了最佳性能，并且在多个公共基准上也保持领先结果。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2604.20441

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

MedSkillAudit：一种针对医学研究代理技能的领域特定审计框架

Hou, Yingyong, Lao, Xinyuan, Wang, Huimei, Yao, Qianyu, Chen, Wei, Huang, Bocheng, Sun, Fei, Lv, Yuxian, Lei, Weiqi, Wen, Xueqian, Xia, Pengfei, Tan, Zhujun, Xie, Shengyang

Abstract

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit ([email protected]), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category-level agreement (ICC = 0.551); Academic Writing showed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.

Chinese Translation

背景：代理技能作为模块化、可重用的能力单元在人工智能代理系统中越来越多地被应用。医学研究代理技能需要超越通用评估的保障，包括科学诚信、方法有效性、可重复性和边界安全性。本研究开发并初步评估了一种针对医学研究代理技能的领域特定审计框架，重点关注与专家评审的一致性。方法：我们开发了MedSkillAudit（[email protected]），这是一个分层框架，用于在部署前评估技能发布的准备情况。我们评估了五个医学研究类别中的75项技能（每个类别15项）。两位专家独立地为每项技能分配了质量分数（0-100）、一个顺序发布处置（生产就绪/有限发布/仅限测试/拒绝）和一个高风险失败标志。系统与专家之间的一致性通过ICC(2,1)和线性加权Cohen's kappa进行量化，并与人类评分者的基线进行基准比较。结果：平均共识质量分数为72.4（标准差=13.0）；57.3%的技能低于有限发布阈值。MedSkillAudit的ICC(2,1)为0.449（95%置信区间：0.250-0.610），超过了人类评分者的ICC 0.300。系统共识分数的差异（标准差=9.5）小于专家之间的差异（标准差=12.4），且没有方向性偏差（Wilcoxon p = 0.613）。协议设计在类别层面显示出最强的一致性（ICC = 0.551）；学术写作显示出负ICC（-0.567），反映出结构性评分标准与专家之间的不匹配。结论：领域特定的预部署审计可能为管理医学研究代理技能提供一个实用的基础，补充通用质量检查，并结合针对科学用例的结构化审计工作流程。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2604.20545

Measuring the Machine: Evaluating Generative AI as Pluralist Sociotechical Systems

测量机器：将生成性人工智能评估为多元社会技术系统

Johnson, Rebecca L.

Abstract

In measurement theory, instruments do not simply record reality; they help constitute what is observed. The same holds for generative AI evaluation: benchmarks do not just measure, they shape what models appear to be. Functionalist benchmarks treat models as isolated predictors, while prescriptive approaches assess what systems ought to be. Both obscure the sociotechnical processes through which meaning and values are enacted, risking the reification of narrow cultural perspectives in pluralist contexts. This thesis advances a descriptive alternative. It argues that generative AI must be evaluated as a pluralist sociotechnical system and develops Machine-Society-Human (MaSH) Loops, a framework for tracing how models, users, and institutions recursively co-construct meaning and values. Evaluation shifts from judging outputs to examining how values are enacted in interaction. Three contributions follow. Conceptually, MaSH Loops reframes evaluation as recursive, enactive process. Methodologically, the World Values Benchmark introduces a distributional approach grounded in World Values Survey data, structured prompt sets, and anchor-aware scoring. Empirically, the thesis demonstrates these through two cases: value drift in early GPT-3 and sociotechnical evaluation in real estate. A final chapter draws on participatory realism to argue that prompting and evaluation are constitutive interventions, not neutral observations. The thesis argues that static benchmarks are insufficient for generative AI. Responsible evaluation requires pluralist, process-oriented frameworks that make visible whose values are enacted. Evaluation is therefore a site of governance, shaping how AI systems are understood, deployed, and trusted.

Chinese Translation

在测量理论中，工具不仅仅是记录现实；它们帮助构成所观察到的事物。生成性人工智能的评估同样如此：基准不仅仅是测量，它们塑造了模型的表象。功能主义基准将模型视为孤立的预测器，而规范性方法则评估系统应有的状态。这两者都模糊了通过社会技术过程实现意义和价值的方式，风险在于在多元背景中固化狭隘的文化视角。本文提出了一种描述性替代方案。它主张生成性人工智能必须作为多元社会技术系统进行评估，并发展了机器-社会-人类（Machine-Society-Human，MaSH）循环，这是一种追踪模型、用户和机构如何递归共同构建意义和价值的框架。评估的重点从判断输出转向审视价值如何在互动中得以实现。接下来有三个贡献。从概念上讲，MaSH循环将评估重新框架为递归的、动态的过程。从方法论上讲，世界价值基准（World Values Benchmark）引入了一种基于世界价值调查数据的分布式方法，结构化的提示集和关注锚点的评分。从实证上讲，本文通过两个案例展示了这些：早期GPT-3中的价值漂移和房地产中的社会技术评估。最后一章借鉴参与式现实主义，主张提示和评估是构成性干预，而非中立观察。因此，本文认为静态基准不足以应对生成性人工智能。负责任的评估需要多元的、过程导向的框架，使得被实现的价值得以显现。因此，评估成为治理的一个场所，塑造了人们对人工智能系统的理解、部署和信任。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2604.20601

Self-Guided Plan Extraction for Instruction-Following Tasks with Goal-Conditional Reinforcement Learning

基于目标条件强化学习的自指导计划提取用于遵循指令的任务

Volovikova, Zoya, Sorokin, Nikita, Lukashevskiy, Dmitriy, Panov, Aleksandr, Skrynnik, Alexey

Abstract

We introduce SuperIgor, a framework for instruction-following tasks. Unlike prior methods that rely on predefined subtasks, SuperIgor enables a language model to generate and refine high-level plans through a self-learning mechanism, reducing the need for manual dataset annotation. Our approach involves iterative co-training: an RL agent is trained to follow the generated plans, while the language model adapts and modifies these plans based on RL feedback and preferences. This creates a feedback loop where both the agent and the planner improve jointly. We validate our framework in environments with rich dynamics and stochasticity. Results show that SuperIgor agents adhere to instructions more strictly than baseline methods, while also demonstrating strong generalization to previously unseen instructions.

Chinese Translation

我们介绍了SuperIgor，一个用于遵循指令任务的框架。与依赖于预定义子任务的先前方法不同，SuperIgor使语言模型能够通过自学习机制生成和完善高层次计划，从而减少对手动数据集标注的需求。我们的方法涉及迭代的共同训练：一个强化学习（RL）代理被训练以遵循生成的计划，而语言模型则根据RL反馈和偏好调整和修改这些计划。这创造了一个反馈循环，使得代理和规划者共同改进。我们在具有丰富动态和随机性的环境中验证了我们的框架。结果表明，SuperIgor代理比基线方法更严格地遵循指令，同时在面对以前未见过的指令时也表现出强大的泛化能力。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2604.20622

pAI/MSc: ML Theory Research with Humans on the Loop

pAI/MSc：与人类协作的机器学习理论研究

Abdelmoneum, Mahmoud, Beneventano, Pierfrancesco, Poggio, Tomaso

Abstract

We present pAI/MSc, an open-source, customizable, modular multi-agent system for academic research workflows. Our goal is not autonomous scientific ideation, nor fully automated research. It is narrower and more practical: to reduce by orders of magnitude the human steering required to turn a specified hypothesis into a literature-grounded, mathematically established, experimentally supported, submission-oriented manuscript draft. pAI/MSc is built with a current emphasis on machine learning theory and adjacent quantitative fields.

Chinese Translation

我们提出了pAI/MSc，这是一个开源、可定制、模块化的多智能体系统，用于学术研究工作流程。我们的目标不是实现自主科学构思，也不是完全自动化的研究。我们的目标更为狭窄且实用：显著减少将特定假设转化为基于文献、数学确立、实验支持、面向投稿的手稿草稿所需的人为引导。pAI/MSc的构建目前重点关注机器学习理论及相关的定量领域。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2604.20651

CHORUS: An Agentic Framework for Generating Realistic Deliberation Data

CHORUS：生成真实论辩数据的代理框架

Koursaris, A., Domalis, G., Apostolopoulou, A., Kanaris, K., Tsakalidis, D., Livieris, I. E.

Abstract

Understanding the intricate dynamics of online discourse depends on large-scale deliberation data, a resource that remains scarce across interactive web platforms due to restrictive accessibility policies, ethical concerns and inconsistent data quality. In this paper, we propose Chorus, an agentic framework, which orchestrates LLM-powered actors with behaviorally consistent personas to generate realistic deliberation discussions. Each actor is governed by an autonomous agent equipped with memory of the evolving discussion, while participation timing is governed by a principled Poisson process-based temporal model, which approximates the heterogeneous engagement patterns of real users. The framework is further supported by structured tool usage, enabling actors to access external resources and facilitating integration with interactive web platforms. The framework was deployed on the \textsc{Deliberate} platform and evaluated by 30 expert participants across three dimensions: content realism, discussion coherence and analytical utility, confirming Chorus as a practical tool for generating high-quality deliberation data suitable for online discourse analysis

Chinese Translation

理解在线话语的复杂动态依赖于大规模的论辩数据，而这一资源在互动网络平台上仍然稀缺，原因包括限制性的可访问政策、伦理问题和不一致的数据质量。本文提出了CHORUS，一个代理框架，它协调由大型语言模型（LLM）驱动的行为一致的人物，以生成真实的论辩讨论。每个参与者由一个自主代理控制，该代理具备对不断演变的讨论的记忆，而参与时间则由一个基于原则的泊松过程的时间模型所决定，该模型近似真实用户的异质参与模式。该框架进一步支持结构化工具的使用，使参与者能够访问外部资源，并促进与互动网络平台的集成。该框架已在 extsc{Deliberate}平台上部署，并通过30名专家参与者在内容真实感、讨论连贯性和分析效用三个维度进行了评估，确认CHORUS作为生成适合在线话语分析的高质量论辩数据的实用工具。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2604.20652

Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure

大型语言模型在欺诈检测和抵御动机投资者压力方面优于人类

Powdthavee, Nattavudh

Abstract

Large language models trained on human feedback may suppress fraud warnings when investors arrive already persuaded of a fraudulent opportunity. We tested this in a preregistered experiment across seven leading LLMs and twelve investment scenarios covering legitimate, high-risk, and objectively fraudulent opportunities, combining 3,360 AI advisory conversations with a 1,201-participant human benchmark. Contrary to predictions, motivated investor framing did not suppress AI fraud warnings; if anything, it marginally increased them. Endorsement reversal occurred in fewer than 3 in 1,000 observations. Human advisors endorsed fraudulent investments at baseline rates of 13-14%, versus 0% across all LLMs, and suppressed warnings under pressure at two to four times the AI rate. AI systems currently provide more consistent fraud warnings than lay humans in an identical advisory role.

Chinese Translation

基于人类反馈训练的大型语言模型可能在投资者已经被说服相信某个欺诈机会时抑制欺诈警告。我们在一项预注册实验中测试了这一点，该实验涵盖了七个领先的LLM和十二个投资场景，包括合法、高风险和客观欺诈的机会，结合了3360次AI咨询对话和1201名参与者的人类基准。与预测相反，动机投资者框架并没有抑制AI的欺诈警告；如果有的话，反而略微增加了这些警告。在不到千分之三的观察中发生了认可反转。人类顾问在基线率下对欺诈投资的认可率为13-14%，而所有LLM的认可率为0%，并且在压力下的警告抑制率是AI的两到四倍。目前，AI系统在相同咨询角色中提供的欺诈警告比普通人类更为一致。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2604.20711

Participatory provenance as representational auditing for AI-mediated public consultation

作为代表性审计的参与性来源追踪在人工智能介导的公众咨询中的应用

Mahajan, Sachit

Abstract

Artificial intelligence is increasingly deployed to synthesize large-scale public input in policy consultations and participatory processes. Yet no formal framework exists for auditing whether these summaries faithfully represent the source population, an accountability gap that existing approaches to AI explainability, grounding and hallucination detection do not address because they focus on output quality rather than input fidelity. Here, participatory provenance is introduced: a measurement framework grounded in optimal transport theory, causal inference and semantic analysis that tracks how individual public submissions are transformed, filtered or lost through AI-mediated summarization. Applied to Canada's 2025-2026 national AI Strategy consultation ($n = 5{,}253$ respondents across two independent policy topics), the framework reveals that both official government summaries underperform a random-participant baseline ($-9.1\%$ and $-8.0\%$ coverage degradation), with $16.9\%$ and $15.3\%$ of participants effectively excluded. Exclusion concentrates in clusters expressing dissent, scepticism and critique of AI ($33$-$88\%$ exclusion rates). Brevity, semantic isolation and rhetorical register independently predict representational outcome. An accompanying open-source interactive tool, the Co-creation Provenance Lab, enables policymakers to audit and iteratively improve summaries, establishing genuine human-in-the-loop oversight at scale.

Chinese Translation

人工智能越来越多地被用于综合大规模公众意见，以支持政策咨询和参与性过程。然而，目前尚无正式框架来审计这些摘要是否忠实地代表了源人群，这一问责缺口现有的人工智能可解释性、基础和幻觉检测方法并未解决，因为它们关注的是输出质量而非输入保真度。在此，提出了参与性来源追踪：一个基于最优运输理论、因果推断和语义分析的测量框架，跟踪个别公众提交如何通过人工智能介导的摘要过程被转化、过滤或丢失。该框架应用于加拿大2025-2026国家人工智能战略咨询（$n = 5{,}253$名参与者，涵盖两个独立的政策主题），结果显示，官方政府摘要的表现均低于随机参与者基线（覆盖率下降分别为$-9.1 ext{ extperthousand}$和$-8.0 ext{ extperthousand}$），有$16.9 ext{ extperthousand}$和$15.3 ext{ extperthousand}$的参与者被有效排除。排除现象集中在表达异议、怀疑和对人工智能的批评的群体中（排除率为$33 ext{ extperthousand}$-$88 ext{ extperthousand}$）。摘要的简洁性、语义孤立性和修辞风格独立预测了代表性结果。一个配套的开源互动工具——共创来源实验室（Co-creation Provenance Lab），使政策制定者能够审计并迭代改进摘要，从而在大规模上建立真正的人类参与监督。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2604.20714

Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization

学习进化：通过文本参数图优化实现多智能体系统的自我改进框架

He, Shan, Wang, Runze, Du, Zhuoyun, Bai, Huiyu, Cao, Zouying, Cheng, Yu, Zheng, Bo

Abstract

Designing and optimizing multi-agent systems (MAS) is a complex, labor-intensive process of "Agent Engineering." Existing automatic optimization methods, primarily focused on flat prompt tuning, lack the structural awareness to debug the intricate web of interactions in MAS. More critically, these optimizers are static; they do not learn from experience to improve their own optimization strategies. To address these gaps, we introduce Textual Parameter Graph Optimization (TPGO), a framework that enables a multi-agent system to learn to evolve. TPGO first models the MAS as a Textual Parameter Graph (TPG), where agents, tools, and workflows are modular, optimizable nodes. To guide evolution, we derive "textual gradients," structured natural language feedback from execution traces, to pinpoint failures and suggest granular modifications. The core of our framework is Group Relative Agent Optimization (GRAO), a novel meta-learning strategy that learns from historical optimization experiences. By analyzing past successes and failures, GRAO becomes progressively better at proposing effective updates, allowing the system to learn how to optimize itself. Extensive experiments on complex benchmarks like GAIA and MCP-Universe show that TPGO significantly enhances the performance of state-of-the-art agent frameworks, achieving higher success rates through automated, self-improving optimization.

Chinese Translation

设计和优化多智能体系统（MAS）是一个复杂且劳动密集的“智能体工程”过程。现有的自动优化方法主要集中于平面提示调优，缺乏对MAS中复杂交互网络的结构性认知。更重要的是，这些优化器是静态的；它们无法从经验中学习以改善自身的优化策略。为了解决这些问题，我们提出了文本参数图优化（TPGO），一个使多智能体系统能够学习进化的框架。TPGO首先将MAS建模为一个文本参数图（TPG），其中智能体、工具和工作流是模块化的、可优化的节点。为了指导进化，我们从执行轨迹中推导出“文本梯度”，这是一种结构化的自然语言反馈，用于定位故障并建议细粒度的修改。我们框架的核心是群体相对智能体优化（GRAO），这是一种新颖的元学习策略，能够从历史优化经验中学习。通过分析过去的成功与失败，GRAO逐渐提高提出有效更新的能力，使系统能够学习如何自我优化。在GAIA和MCP-Universe等复杂基准上的大量实验表明，TPGO显著提升了最先进智能体框架的性能，通过自动化的自我改进优化实现了更高的成功率。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2604.20728

Interval POMDP Shielding for Imperfect-Perception Agents

不完美感知代理的区间 POMDP 防护

Scarbro, William, Mangal, Ravi

Abstract

Autonomous systems that rely on learned perception can make unsafe decisions when sensor readings are misclassified. We study shielding for this setting: given a proposed action, a shield blocks actions that could violate safety. We consider the common case where system dynamics are known but perception uncertainty must be estimated from finite labeled data. From these data we build confidence intervals for the probabilities of perception outcomes and use them to model the system as a finite Interval Partially Observable Markov Decision Process with discrete states and actions. We then propose an algorithm to compute a conservative set of beliefs over the underlying state that is consistent with the observations seen so far. This enables us to construct a runtime shield that comes with a finite-horizon guarantee: with high probability over the training data, if the true perception uncertainty rates lie within the learned intervals, then every action admitted by the shield satisfies a stated lower bound on safety. Experiments on four case studies show that our shielding approach (and variants derived from it) improves the safety of the system over state-of-the-art baselines.

Chinese Translation

依赖于学习感知的自主系统在传感器读数被错误分类时可能做出不安全的决策。我们研究了这一环境下的防护机制：给定一个建议的动作，防护机制会阻止那些可能违反安全性的动作。我们考虑一个常见的情况，即系统动态是已知的，但感知的不确定性必须从有限的标记数据中进行估计。基于这些数据，我们构建了感知结果概率的置信区间，并利用这些区间将系统建模为一个具有离散状态和动作的有限区间部分可观察马尔可夫决策过程（Interval POMDP）。然后，我们提出了一种算法，用于计算一个保守的信念集，该信念集与迄今为止观察到的情况一致。这使我们能够构建一个运行时防护机制，该机制具有有限时间范围的保证：在训练数据上以高概率，如果真实的感知不确定性率位于学习到的区间内，则防护机制允许的每个动作都满足安全性的下限要求。对四个案例研究的实验表明，我们的防护方法（及其衍生的变体）在安全性方面优于最先进的基线。

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2604.20744

AAC: Admissible-by-Architecture Differentiable Landmark Compression for ALT

AAC：基于架构可接受的可微分地标压缩用于ALT

Le, An T., Ngo, Vien

Abstract

We introduce \textbf{AAC} (Architecturally Admissible Compressor), a differentiable landmark-selection module for ALT (A*, Landmarks, and Triangle inequality) shortest-path heuristics whose outputs are admissible by construction: each forward pass is a row-stochastic mixture of triangle-inequality lower bounds, so the heuristic is admissible for \emph{every} parameter setting without requiring convergence, calibration, or projection. At deployment, the module reduces to classical ALT on a learned subset, composing end-to-end with neural encoders while preserving the classical toolchain. The construction is the first differentiable instance of the compress-while-preserving-admissibility tradition in classical heuristic search. Under a matched per-vertex memory protocol, we establish that ALT with farthest-point-sampling landmarks (FPS-ALT) has provably near-optimal coverage on metric graphs, leaving at most a few percentage points of headroom for \emph{any} selector. AAC operates near this ceiling: the gap is $0.9$--$3.9$ percentage points on 9 road networks and ${\leq}1.3$ percentage points on synthetic graphs, with zero admissibility violations across $1{,}500+$ queries and all logged runs. At matched memory, AAC is also $1.2$--$1.5{\times}$ faster than FPS-ALT at the median query on DIMACS road networks, amortizing its offline cost within $170$--$1{,}924$ queries. A controlled ablation isolates the binding constraint: training-objective drift under default initialization, not architectural capacity; identity-on-first-$m$ initialization closes the expansion-count gap entirely. We release the module, a reusable matched-memory benchmarking protocol with paired two-one-sided-test (TOST) equivalence and pre-registration, and a reference compressed-differential-heuristics baseline.

Chinese Translation

我们介绍了 extbf{AAC}（架构可接受压缩器），这是一个用于ALT（A*、地标和三角不等式）最短路径启发式的可微分地标选择模块，其输出在构造上是可接受的：每次前向传播都是三角不等式下界的行随机混合，因此该启发式在 extit{每个}参数设置下都是可接受的，无需收敛、校准或投影。在部署时，该模块在学习的子集上简化为经典的ALT，与神经编码器端到端组合，同时保留经典工具链。该构造是经典启发式搜索中在保持可接受性的传统下的第一个可微分实例。在匹配的每顶点内存协议下，我们证明了使用最远点采样地标（FPS-ALT）的ALT在度量图上具有可证明的近最优覆盖，留给 extit{任何}选择器最多几个百分点的余地。AAC在这一上限附近运行：在9个道路网络上，差距为$0.9$--$3.9$个百分点，在合成图上为${ extless}1.3$个百分点，在$1{,}500+$个查询和所有记录的运行中没有可接受性违规。在匹配内存下，AAC在DIMACS道路网络上的中位查询速度也比FPS-ALT快$1.2$--$1.5{ imes}$，在$170$--$1{,}924$个查询内摊销其离线成本。一个受控消融实验隔离了约束条件：在默认初始化下训练目标的漂移，而非架构能力；在首个$m$个初始化上保持恒等性完全消除了扩展计数差距。我们发布了该模块、一个可重用的匹配内存基准协议，配有配对的双侧检验（TOST）等价性和预注册，以及一个参考压缩差分启发式基线。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2604.20749

Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation

何处与何物：在情境对话推荐中推理动态和隐含偏好

Lin, Dongding, Wang, Jian, Li, Yongqi, Li, Wenjie

Abstract

Situated conversational recommendation (SCR), which utilizes visual scenes grounded in specific environments and natural language dialogue to deliver contextually appropriate recommendations, has emerged as a promising research direction due to its close alignment with real-world scenarios. Compared to traditional recommendations, SCR requires a deeper understanding of dynamic and implicit user preferences, as the surrounding scene often influences users' underlying interests, while both may evolve across conversations. This complexity significantly impacts the timing and relevance of recommendations. To address this, we propose situated preference reasoning (SiPeR), a novel framework that integrates two core mechanisms: (1) Scene transition estimation, which estimates whether the current scene satisfies user needs, and guides the user toward a more suitable scene when necessary; and (2) Bayesian inverse inference, which leverages the likelihood of multimodal large language models (MLLMs) to predict user preferences about candidate items within the scene. Extensive experiments on two representative benchmarks demonstrate SiPeR's superiority in both recommendation accuracy and response generation quality. The code and data are available at https://github.com/DongdingLin/SiPeR.

Chinese Translation

情境对话推荐（Situated Conversational Recommendation, SCR）利用特定环境中的视觉场景和自然语言对话来提供上下文适宜的推荐，因其与现实场景的紧密契合而成为一个有前景的研究方向。与传统推荐相比，SCR 需要对动态和隐含用户偏好有更深刻的理解，因为周围场景往往会影响用户的潜在兴趣，而这两者可能在对话中不断演变。这种复杂性显著影响了推荐的时机和相关性。为了解决这一问题，我们提出了情境偏好推理（Situated Preference Reasoning, SiPeR），这是一个新颖的框架，集成了两个核心机制：（1）场景转换估计，用于估计当前场景是否满足用户需求，并在必要时引导用户前往更合适的场景；（2）贝叶斯逆推理，利用多模态大型语言模型（Multimodal Large Language Models, MLLMs）的可能性来预测用户对场景中候选项目的偏好。在两个代表性基准上的大量实验表明，SiPeR 在推荐准确性和响应生成质量上均优于其他方法。代码和数据可在 https://github.com/DongdingLin/SiPeR 获取。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2604.20755

V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

V-tableR1：基于过程监督的多模态表格推理与批评引导的策略优化

Jiang, Yubo, An, Yitong, Yang, Xin, Wuerkaixi, Abudukelimu, Cheng, Xuxin, Xie, Fengying, Jiang, Zhiguo, Liu, Cao, Zeng, Ke, Zhang, Haopeng

Abstract

We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models (MLLMs). Current MLLMs trained solely on final outcomes often treat visual reasoning as a black box, relying on superficial pattern matching rather than performing rigorous multi-step inference. While Reinforcement Learning with Verifiable Rewards could enforce transparent reasoning trajectories, extending it to visual domains remains severely hindered by the ambiguity of grounding abstract logic into continuous pixel space. We solve this by leveraging the deterministic grid structure of tables as an ideal visual testbed. V-tableR1 employs a specialized critic VLM to provide dense, step-level feedback on the explicit visual chain-of-thought generated by a policy VLM. To optimize this system, we propose Process-Guided Direct Alignment Policy Optimization (PGPO), a novel RL algorithm integrating process rewards, decoupled policy constraints, and length-aware dynamic sampling. Extensive evaluations demonstrate that V-tableR1 explicitly penalizes visual hallucinations and shortcut guessing. By fundamentally shifting multimodal inference from black-box pattern matching to verifiable logical derivation, V-tableR1 4B establishes state-of-the-art accuracy among open-source models on complex tabular benchmarks, outperforming models up to 18x its size and improving over its SFT baseline

Chinese Translation

我们介绍了V-tableR1，这是一种基于过程监督的强化学习框架，旨在从多模态大型语言模型（MLLMs）中引发严谨、可验证的推理。目前，仅基于最终结果训练的MLLMs往往将视觉推理视为黑箱，依赖于表面的模式匹配，而不是进行严谨的多步骤推理。尽管带有可验证奖励的强化学习可以强制执行透明的推理轨迹，但将其扩展到视觉领域仍然受到将抽象逻辑嵌入连续像素空间的模糊性的严重制约。我们通过利用表格的确定性网格结构作为理想的视觉测试平台来解决这一问题。V-tableR1采用了一种专门的批评性视觉语言模型（critic VLM），为策略视觉语言模型（policy VLM）生成的明确视觉思维链提供密集的逐步反馈。为了优化该系统，我们提出了过程引导的直接对齐策略优化（PGPO），这是一种新颖的强化学习算法，结合了过程奖励、解耦的策略约束和基于长度的动态采样。广泛的评估表明，V-tableR1明确惩罚视觉幻觉和捷径猜测。通过将多模态推理从黑箱模式匹配根本转变为可验证的逻辑推导，V-tableR1在复杂表格基准测试中在开源模型中建立了最先进的准确性，超越了其大小最多可达18倍的模型，并在其SFT基线之上有所改进。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2604.20779

SWE-chat: Coding Agent Interactions From Real Users in the Wild

SWE-chat：来自真实用户的编码代理交互数据

Baumann, Joachim, Padmakumar, Vishakh, Li, Xiang, Yang, John, Yang, Diyi, Koyejo, Sanmi

Abstract

AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.

Chinese Translation

人工智能编码代理正在大规模应用，但我们缺乏关于人们如何实际使用它们以及其输出在实践中有多大用处的实证证据。我们提出了SWE-chat，这是第一个从开放源代码开发者那里收集的真实编码代理会话的大规模数据集。该数据集目前包含6,000个会话，涵盖超过63,000个用户提示和355,000个代理工具调用。SWE-chat是一个动态数据集；我们的收集管道自动并持续地从公共代码库中发现和处理会话。利用SWE-chat，我们提供了对真实世界编码代理使用情况和失败模式的初步实证特征分析。我们发现编码模式呈双峰分布：在41%的会话中，代理几乎撰写了所有提交的代码（“氛围编码”），而在23%的会话中，人类完全自己编写代码。尽管能力迅速提升，编码代理在自然环境中仍然效率低下。只有44%的代理生成的代码最终被用户提交，而代理编写的代码引入的安全漏洞比人类编写的代码更多。此外，用户在44%的所有交互中对代理输出进行了反驳——通过更正、故障报告和中断。通过捕捉人类与代理代码作者归属的完整交互轨迹，SWE-chat为超越策划基准，朝着基于证据的理解AI代理在真实开发工作流中表现的方向奠定了实证基础。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2604.20795

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

利用大型语言模型（LLMs）作为混合智能系统的外部记忆、验证和规划层的自动本体构建

Salovskii, Pavel, Gorshkova, Iuliia

Abstract

This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector-based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector-based retrieval with graph-based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation-verification-correction pipeline. The proposed architecture addresses key limitations of current LLM-based systems, including lack of long-term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent-based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision-making.

Chinese Translation

本文提出了一种智能系统的混合架构，其中大型语言模型（LLMs）通过外部本体记忆层进行扩展。该方法不仅依赖于参数知识和基于向量的检索（RAG），而是使用RDF/OWL表示构建和维护结构化知识图谱，实现持久、可验证和语义基础的推理。核心贡献是一个自动化的本体构建管道，能够从异构数据源（包括文档、API和对话日志）中提取信息。该系统执行实体识别、关系提取、规范化和三元组生成，随后使用SHACL和OWL约束进行验证，并持续更新图谱。在推理过程中，LLMs在一个综合上下文中操作，该上下文将基于向量的检索与基于图的推理和外部工具交互相结合。对规划任务的实验观察，包括汉诺塔基准测试，表明本体增强在多步骤推理场景中相比基线LLM系统提高了性能。此外，本体层使生成输出的形式验证成为可能，将系统转变为生成-验证-修正的管道。所提出的架构解决了当前基于LLM的系统的关键局限性，包括缺乏长期记忆、结构理解能力弱和推理能力有限。它为构建需要持久知识、可解释性和可靠决策的基于代理的系统、机器人应用和企业人工智能解决方案提供了基础。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2604.20811

Diagnosing CFG Interpretation in LLMs

诊断大型语言模型中的上下文无关文法解释

Li, Hanqi, Chen, Lu, Yu, Kai

Abstract

As LLMs are increasingly integrated into agentic systems, they must adhere to dynamically defined, machine-interpretable interfaces. We evaluate LLMs as in-context interpreters: given a novel context-free grammar, can LLMs generate syntactically valid, behaviorally functional, and semantically faithful outputs? We introduce RoboGrid, a framework that disentangles syntax, behavior, and semantics through controlled stress-tests of recursion depth, expression complexity, and surface styles. Our experiments reveal a consistent hierarchical degradation: LLMs often maintain surface syntax but fail to preserve structural semantics. Despite the partial mitigation provided by CoT reasoning, performance collapses under structural density, specifically deep recursion and high branching, with semantic alignment vanishing at extreme depths. Furthermore, "Alien" lexicons reveal that LLMs rely on semantic bootstrapping from keywords rather than pure symbolic induction. These findings pinpoint critical gaps in hierarchical state-tracking required for reliable, grammar-agnostic agents.

Chinese Translation

随着大型语言模型（LLMs）越来越多地被集成到自主系统中，它们必须遵循动态定义的、机器可解释的接口。我们将LLMs评估为上下文中的解释器：在给定新的上下文无关文法的情况下，LLMs能否生成语法有效、行为功能性和语义忠实的输出？我们引入了RoboGrid，一个通过对递归深度、表达复杂性和表面风格的控制压力测试来解构语法、行为和语义的框架。我们的实验揭示了一种一致的层次性退化：LLMs通常保持表面语法，但未能保持结构语义。尽管链式推理（CoT reasoning）提供了部分缓解，但在结构密度下，尤其是深递归和高分支时，性能崩溃，语义对齐在极端深度下消失。此外，“外星”词汇表揭示LLMs依赖于从关键词进行的语义引导，而非纯粹的符号归纳。这些发现指出了可靠的、与语法无关的智能体所需的层次状态跟踪中的关键缺口。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2604.19762

Evidence of Layered Positional and Directional Constraints in the Voynich Manuscript: Implications for Cipher-Like Structure

关于沃伊尼奇手稿中分层位置和方向约束的证据：对密码样结构的启示

Parisel, Christophe

Abstract

The Voynich Manuscript (VMS) exhibits a script of uncertain origin whose grapheme sequences have resisted linguistic analysis. We present a systematic analysis of its grapheme sequences, revealing two complementary structural layers: a character-level right-to-left optimization in word-internal sequences and a left-to-right dependency at word boundaries, a directional dissociation not observed in any of our four comparison languages (English, French, Hebrew, Arabic). We further evaluate two classes of structured generator against a four-signature joint criterion: a parametric slot-based generator and a Cardan grille implementing Rugg's (2004) gibberish hypothesis. Across their full tested parameter spaces, neither class reproduces all four signatures simultaneously. While these results do not rule out generator classes we have not tested, they provide the first quantitative benchmarks against which any future generative or cryptanalytic model of the VMS can be evaluated, and they suggest that the VMS exhibits cipher-like structural constraints that are difficult to reproduce from simple positional or frequency-based mechanisms alone.

Chinese Translation

沃伊尼奇手稿（VMS）展现了一种来源不明的文字，其字形序列抵抗了语言学分析。我们对其字形序列进行了系统分析，揭示了两个互补的结构层次：在词内部序列中，字符级的从右到左优化，以及在词边界处的从左到右依赖，这种方向性分离在我们比较的四种语言（英语、法语、希伯来语、阿拉伯语）中未曾观察到。我们进一步评估了两类结构生成器，针对一个四签名的联合标准：一种基于参数的插槽生成器和一种实现Rugg（2004）胡言乱语假说的卡尔达诺格栅。在它们的完整测试参数空间中，任何一类都无法同时重现所有四个签名。虽然这些结果并未排除我们未测试的生成器类别，但它们提供了第一个定量基准，以供未来任何生成或密码分析模型对VMS的评估，并且它们表明VMS展现出类似密码的结构约束，这些约束仅通过简单的位置信息或基于频率的机制难以重现。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2604.19764

Can We Locate and Prevent Stereotypes in LLMs?

我们能否定位并防止大型语言模型中的刻板印象？

D'Souza, Alex

Abstract

Stereotypes in large language models (LLMs) can perpetuate harmful societal biases. Despite the widespread use of models, little is known about where these biases reside in the neural network. This study investigates the internal mechanisms of GPT 2 Small and Llama 3.2 to locate stereotype related activations. We explore two approaches: identifying individual contrastive neuron activations that encode stereotypes, and detecting attention heads that contribute heavily to biased outputs. Our experiments aim to map these "bias fingerprints" and provide initial insights for mitigating stereotypes.

Chinese Translation

大型语言模型（LLMs）中的刻板印象可能会延续有害的社会偏见。尽管这些模型被广泛使用，但关于这些偏见在神经网络中的具体位置知之甚少。本研究探讨了GPT 2 Small和Llama 3.2的内部机制，以定位与刻板印象相关的激活。我们探索了两种方法：识别编码刻板印象的个体对比神经元激活，以及检测对偏见输出贡献较大的注意力头。我们的实验旨在绘制这些“偏见指纹”，并为减轻刻板印象提供初步见解。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2604.19765

Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs

幻觉神经元是否具有泛化能力？来自大语言模型跨领域迁移的证据

Vaddi, Snehit, Vaddi, Pujith

Abstract

Recent work identifies a sparse set of "hallucination neurons" (H-neurons), less than 0.1% of feed-forward network neurons, that reliably predict when large language models will hallucinate. These neurons are identified on general-knowledge question answering and shown to generalize to new evaluation instances. We ask a natural follow-up question: do H-neurons generalize across knowledge domains? Using a systematic cross-domain transfer protocol across 6 domains (general QA, legal, financial, science, moral reasoning, and code vulnerability) and 5 open-weight models (3B to 8B parameters), we find they do not. Classifiers trained on one domain's H-neurons achieve AUROC 0.783 within-domain but only 0.563 when transferred to a different domain (delta = 0.220, p < 0.001), a degradation consistent across all models tested. Our results suggest that hallucination is not a single mechanism with a universal neural signature, but rather involves domain-specific neuron populations that differ depending on the knowledge type being queried. This finding has direct implications for the deployment of neuron-level hallucination detectors, which must be calibrated per domain rather than trained once and applied universally.

Chinese Translation

近期研究识别出一组稀疏的“幻觉神经元”（H-neurons），其数量不到前馈网络神经元的0.1%，能够可靠地预测大型语言模型何时会产生幻觉。这些神经元是在一般知识问答任务中识别的，并且显示出能够泛化到新的评估实例。我们提出一个自然的后续问题：H-neurons是否能够跨知识领域泛化？通过在6个领域（一般问答、法律、金融、科学、道德推理和代码漏洞）和5个开放权重模型（参数从3B到8B）中使用系统的跨领域迁移协议，我们发现它们并不能泛化。在一个领域的H-neurons上训练的分类器在同一领域内的AUROC为0.783，但在转移到不同领域时仅为0.563（增量 = 0.220，p < 0.001），这一降级在所有测试的模型中是一致的。我们的结果表明，幻觉并不是一个具有普遍神经特征的单一机制，而是涉及特定于领域的神经元群体，这些群体根据查询的知识类型而有所不同。该发现对神经元级幻觉检测器的部署具有直接影响，这些检测器必须针对每个领域进行校准，而不是一次性训练后普遍应用。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2604.19766

OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models

OThink-SRR1：基于强化学习的大型语言模型的搜索、精炼与推理

Liang, Haijian, Niu, Zenghao, Wu, Junjie, Zhang, Changwang, Zhou, Wangchunshu, Wang, Jun

Abstract

Retrieval-Augmented Generation (RAG) expands the knowledge of Large Language Models (LLMs), yet current static retrieval methods struggle with complex, multi-hop problems. While recent dynamic retrieval strategies offer improvements, they face two key challenges: 1) irrelevant retrieved noise can misdirect the reasoning process, and 2) processing full documents incurs prohibitive computational and latency costs. To address these issues, we propose OThink-SRR1, a framework that enhances large models with an iterative Search-Refine-Reason process trained via reinforcement learning. Its core Refine stage distills retrieved documents into concise, relevant facts before reasoning. We introduce GRPO-IR, an end-to-end reinforcement learning algorithm that rewards accurate evidence identification while penalizing excessive retrievals, thus training the model to be both focused and efficient. Experiments on four multi-hop QA benchmarks show our approach achieves superior accuracy over strong baselines while using fewer retrieval steps and tokens. This positions OThink-SRR1 as a potent foundational model for information-seeking agents.

Chinese Translation

检索增强生成（RAG）扩展了大型语言模型（LLMs）的知识，然而当前的静态检索方法在处理复杂的多跳问题时表现不佳。尽管最近的动态检索策略有所改进，但仍面临两个主要挑战：1）无关的检索噪声可能会误导推理过程，2）处理完整文档会产生高昂的计算和延迟成本。为了解决这些问题，我们提出了OThink-SRR1，一个通过强化学习训练的迭代搜索-精炼-推理框架，旨在增强大型模型。其核心精炼阶段在推理之前将检索到的文档提炼为简明且相关的事实。我们引入了GRPO-IR，一种端到端的强化学习算法，奖励准确的证据识别，同时惩罚过度检索，从而训练模型在聚焦和效率之间取得平衡。在四个多跳问答基准上的实验表明，我们的方法在使用更少的检索步骤和标记的情况下，达到了优于强基线的准确性。这使得OThink-SRR1成为信息检索代理的强大基础模型。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2604.19768

Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models

超越所知的表达：量化大型语言模型中认知-修辞失调的框架

Bakhshi, Asim D.

Abstract

Large language models (LLMs) exhibit systematic miscalibration with rhetorical intensity not proportionate to epistemic grounding. This study tests this hypothesis and proposes a framework for quantifying this decoupling by designing a triadic epistemic-rhetorical marker (ERM) taxonomy. The taxonomy is operationalized through composite metrics of form-meaning divergence (FMD), genuine-to-performed epistemic ratio (GPR), and rhetorical device distribution entropy (RDDE). Applied to 225 argumentative texts spanning approximately 0.6 Million tokens across human expert, human non-expert, and LLM-generated sub-corpora, the framework identifies a consistent, model-agnostic LLM epistemic signature. LLM-generated texts produce tricolon at nearly twice the expert rate ($\Delta = 0.95$), while human authors produce erotema at more than twice the LLM rate. Performed hesitancy markers appear at twice the human density in LLM output. FMD is significantly elevated in LLM texts relative to both human groups ($p < 0.001, \Delta = 0.68$), and rhetorical devices are distributed significantly more uniformly across LLM documents. The findings are consistent with theoretical intuitions derived from Gricean pragmatics, Relevance Theory, and Brandomian inferentialism. The annotation pipeline is fully automatable, making it deployable as a lightweight screening tool for epistemic miscalibration in AI-generated content and as a theoretically motivated feature set for LLM-generated text detection pipelines.

Chinese Translation

大型语言模型（LLMs）表现出系统性的失调，其修辞强度与认知基础不成比例。本研究检验了这一假设，并提出了一个量化这种脱节的框架，通过设计三元认知-修辞标记（ERM）分类法来实现。该分类法通过形式-意义偏差（FMD）、真实与表现认知比率（GPR）和修辞手段分布熵（RDDE）的复合指标进行操作。应用于225篇涵盖约60万字的论证文本，这些文本来自人类专家、人类非专家和LLM生成的子语料库，该框架识别出一致的、模型无关的LLM认知特征。LLM生成的文本以近两倍于专家的速度产生三段论（$ ext{Δ} = 0.95$），而人类作者产生的修辞疑问句的速度则超过LLM的两倍。表现出犹豫的标记在LLM输出中出现的密度是人类的两倍。相较于两个人类群体，LLM文本中的FMD显著升高（$p < 0.001, ext{Δ} = 0.68$），且修辞手段在LLM文档中的分布显著更为均匀。这些发现与源自格赖斯（Gricean）语用学、相关理论（Relevance Theory）和布兰登（Brandomian）推理主义的理论直觉一致。注释流程完全可以自动化，使其可作为轻量级筛查工具用于AI生成内容中的认知失调，并作为理论驱动的特征集用于LLM生成文本检测流程。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2604.19769

TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

TTKV：用于长上下文大语言模型推理的时间分层键值缓存

Dzikanyanga, Gradwell, Yang, Weihao, Huang, Hao, Wu, Donglei, Wang, Shihao, Xia, Wen, C, Sanjeeb K

Abstract

Key-value (KV) caching is critical for efficient inference in large language models (LLMs), yet its memory footprint scales linearly with context length, resulting in a severe scalability bottleneck. Existing approaches largely treat KV states as equally important across time, implicitly assuming uniform precision and accessibility. However, this assumption contrasts with human memory systems, where memories vary in clarity, recall frequency, and relevance with temporal proximity.Motivated by this insight, we propose TTKV, a KV cache management framework that maps the human memory system onto the KV cache. TTKV partitions the KV cache into temporal tiers with heterogeneous capacity and precision. The design addresses three aspects: (1) Tier Layout, decoupling fast and slow memory using HBM and DRAM; (2) Tier Content, assigning more recent KV states to faster, higher-precision tiers based on temporal proximity; and (3) Tier Interaction, employing block-wise streaming attention to overlap communication and computation when accessing slow tiers. Experiments show that TTKV reduces cross-tier traffic by 5.94x on 128K-context tasks, achieving up to 76% latency reduction and 2x throughput improvement over strong baselines.

Chinese Translation

键值（KV）缓存对于大语言模型（LLMs）高效推理至关重要，但其内存占用随着上下文长度线性增长，导致严重的可扩展性瓶颈。现有方法在很大程度上将KV状态视为在时间上同等重要，隐含假设了均匀的精度和可访问性。然而，这一假设与人类记忆系统形成对比，人类记忆在清晰度、回忆频率和与时间接近度的相关性上存在差异。基于这一洞察，我们提出了TTKV，一个将人类记忆系统映射到KV缓存的KV缓存管理框架。TTKV将KV缓存划分为具有异构容量和精度的时间层。该设计解决了三个方面：（1）层布局，使用高带宽内存（HBM）和动态随机存取内存（DRAM）解耦快速和慢速内存；（2）层内容，根据时间接近性将较新的KV状态分配给更快、更高精度的层；（3）层交互，采用块级流式注意力在访问慢速层时重叠通信和计算。实验表明，TTKV在128K上下文任务上将跨层流量减少了5.94倍，相较于强基线实现了高达76%的延迟减少和2倍的吞吐量提升。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2604.19770

Hybrid Multi-Phase Page Matching and Multi-Layer Diff Detection for Japanese Building Permit Document Review

日本建筑许可证文件审查的混合多阶段页面匹配与多层差异检测

Wada, Mitsumasa

Abstract

We present a hybrid multi-phase page matching algorithm for automated comparison of Japanese building permit document sets. Building permit review in Japan requires cross-referencing large PDF document sets across revision cycles, a process that is labor-intensive and error-prone when performed manually. The algorithm combines longest common subsequence (LCS) structural alignment, a seven-phase consensus matching pipeline, and a dynamic programming optimal alignment stage to robustly pair pages across revisions even when page order, numbering, or content changes substantially. A subsequent multi-layer diff engine -- comprising text-level, table-level, and pixel-level visual differencing -- produces highlighted difference reports. Evaluation on real-world permit document sets achieves F1=0.80 and precision=1.00 on a manually annotated ground-truth benchmark, with zero false-positive matched pairs.

Chinese Translation

我们提出了一种混合多阶段页面匹配算法，用于自动比较日本建筑许可证文件集。日本的建筑许可证审查需要在修订周期内交叉参考大量PDF文件集，这一过程在手动执行时劳动密集且易出错。该算法结合了最长公共子序列（LCS）结构对齐、七阶段共识匹配管道以及动态规划最优对齐阶段，以在页面顺序、编号或内容发生显著变化时，稳健地将修订版本之间的页面配对。随后，一个多层差异引擎——包括文本级、表格级和像素级视觉差异检测——生成高亮差异报告。在真实的许可证文件集上的评估结果显示，在手动标注的基准测试中，F1=0.80，精确度=1.00，且没有假阳性匹配对。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2604.19771

Cognis: Context-Aware Memory for Conversational AI Agents

Cognis：面向对话AI代理的上下文感知记忆

Daftari, Parshva, Patel, Khush, Kapale, Shreyas, George, Jithin, Surendira, Siva

Abstract

LLM agents lack persistent memory, causing conversations to reset each session and preventing personalization over time. We present Lyzr Cognis, a unified memory architecture for conversational AI agents that addresses this limitation through a multi-stage retrieval pipeline. Cognis combines a dual-store backend pairing OpenSearch BM25 keyword matching with Matryoshka vector similarity search, fused via Reciprocal Rank Fusion. Its context-aware ingestion pipeline retrieves existing memories before extraction, enabling intelligent version tracking that preserves full memory history while keeping the store consistent. Temporal boosting enhances time-sensitive queries, and a BGE-2 cross-encoder reranker refines final result quality. We evaluate Cognis on two independent benchmarks -- LoCoMo and LongMemEval -- across eight answer generation models, demonstrating state-of-the-art performance on both. The system is open-source and deployed in production serving conversational AI applications.

Chinese Translation

大型语言模型（LLM）代理缺乏持久记忆，导致每次会话重置，无法实现个性化。我们提出了Lyzr Cognis，这是一种针对对话AI代理的统一记忆架构，通过多阶段检索管道解决了这一限制。Cognis结合了双存储后端，将OpenSearch BM25关键词匹配与Matryoshka向量相似性搜索相结合，并通过互惠排名融合（Reciprocal Rank Fusion）进行融合。其上下文感知的摄取管道在提取之前检索现有记忆，实现了智能版本跟踪，保留完整的记忆历史，同时保持存储的一致性。时间增强（Temporal boosting）提升了对时间敏感查询的响应能力，而BGE-2交叉编码器重排序器（cross-encoder reranker）则优化了最终结果的质量。我们在两个独立基准LoCoMo和LongMemEval上评估了Cognis，涵盖八种答案生成模型，展示了在这两者上的最先进性能。该系统是开源的，并已在生产环境中部署，服务于对话AI应用。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2604.19772

CoAuthorAI: A Human in the Loop System For Scientific Book Writing

CoAuthorAI：一个人机协作的科学书籍写作系统

Tian, Yangjie, Gu, Xungang, Zhao, Yun, Yang, Jiale, Yang, Lin, Li, Ning, Zhang, He, Xu, Ruohua, Wang, Hua, Liao, Kewen, Liu, Ming

Abstract

Large language models (LLMs) are increasingly used in scientific writing but struggle with book-length tasks, often producing inconsistent structure and unreliable citations. We introduce CoAuthorAI, a human-in-the-loop writing system that combines retrieval-augmented generation, expert-designed hierarchical outlines, and automatic reference linking. The system allows experts to iteratively refine text at the sentence level, ensuring coherence and accuracy. In evaluations of 500 multi-domain literature review chapters, CoAuthorAI achieved a maximum soft-heading recall of 98%; in a human evaluation of 100 articles, the generated content reached a satisfaction rate of 82%. The book AI for Rock Dynamics generated with CoAuthorAI and Kexin Technology's LUFFA AI model has been published with Springer Nature. These results show that systematic human-AI collaboration can extend LLMs' capabilities from articles to full-length books, enabling faster and more reliable scientific publishing.

Chinese Translation

大型语言模型（LLMs）在科学写作中越来越多地被使用，但在书籍长度的任务中表现不佳，常常产生不一致的结构和不可靠的引用。我们介绍了CoAuthorAI，一个人机协作的写作系统，结合了检索增强生成、专家设计的层次大纲和自动引用链接。该系统允许专家在句子层面上迭代地完善文本，确保连贯性和准确性。在对500个多领域文献综述章节的评估中，CoAuthorAI达到了98%的最大软标题召回率；在对100篇文章的人类评估中，生成内容的满意度达到了82%。与Kexin Technology的LUFFA AI模型共同生成的书籍《岩石动力学的人工智能》已由Springer Nature出版。这些结果表明，系统的人机协作可以将LLMs的能力从文章扩展到完整书籍，从而实现更快速和更可靠的科学出版。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2604.19773

PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models

PR-CAD：基于渐进精炼的统一可控和可信的文本到CAD生成框架，结合大型语言模型

An, Jiyuan, Zhao, Jiachen, Chen, Fan, Yang, Liner, Liu, Zhenghao, Wang, Hongyan, An, Weihua, Zhang, Meishan, Yang, Erhong

Abstract

The construction of CAD models has traditionally relied on labor-intensive manual operations and specialized expertise. Recent advances in large language models (LLMs) have inspired research into text-to-CAD generation. However, existing approaches typically treat generation and editing as disjoint tasks, limiting their practicality. We propose PR-CAD, a progressive refinement framework that unifies generation and editing for controllable and faithful text-to-CAD modeling. To support this, we curate a high-fidelity interaction dataset spanning the full CAD lifecycle, encompassing multiple CAD representations as well as both qualitative and quantitative descriptions. The dataset systematically defines the types of edit operations and generates highly human-like interaction data. Building on a CAD representation tailored for LLMs, we propose a reinforcement learning-enhanced reasoning framework that integrates intent understanding, parameter estimation, and precise edit localization into a single agent. This enables an "all-in-one" solution for both design creation and refinement. Extensive experiments demonstrate strong mutual reinforcement between generation and editing tasks, and across qualitative and quantitative modalities. On public benchmarks, PR-CAD achieves state-of-the-art controllability and faithfulness in both generation and refinement scenarios, while also proving user-friendly and significantly improving CAD modeling efficiency.

Chinese Translation

CAD模型的构建传统上依赖于劳动密集型的手动操作和专业知识。最近大型语言模型（LLMs）的进展激发了对文本到CAD生成的研究。然而，现有的方法通常将生成和编辑视为不相干的任务，这限制了它们的实用性。我们提出了PR-CAD，一个渐进精炼框架，统一了可控和可信的文本到CAD建模的生成与编辑。为支持这一目标，我们整理了一个涵盖整个CAD生命周期的高保真交互数据集，包含多种CAD表示以及定性和定量描述。该数据集系统地定义了编辑操作的类型，并生成了高度类人化的交互数据。在为LLMs量身定制的CAD表示基础上，我们提出了一个增强强化学习的推理框架，将意图理解、参数估计和精确编辑定位整合为一个单一代理。这使得设计创作和精炼都能实现“一体化”解决方案。大量实验表明，生成和编辑任务之间以及定性和定量模式之间存在强大的相互增强。在公共基准测试中，PR-CAD在生成和精炼场景中实现了最先进的可控性和可信性，同时也证明了其用户友好性，并显著提高了CAD建模效率。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2604.19774

Phase 1 Implementation of LLM-generated Discharge Summaries showing high Adoption in a Dutch Academic Hospital

荷兰一所学术医院中LLM生成出院总结的第一阶段实施显示出高采用率

Nadalini, Nettuno, Mehri, Tarannom, Hoekman, Anne H, Kagialari, Katerina, Doornberg, Job N, van der Laan, Tom P, Oosterhoff, Jacobien H F, Schoonbeek, Rosanne C, Bootsma-Robroeks, Charlotte M H H T

Abstract

Writing discharge summaries to transfer medical information is an important but time-consuming process that can be assisted by Large Language Models (LLMs). This prospective mixed methods pilot study evaluated an Electronic Health Record (EHR)-integrated LLM to generate discharge summaries drafts. In total, 379 discharge summaries were generated in clinical practice by 21 residents and 4 physician assistants during 9 weeks in our academic hospital. LLM-generated text was copied in 58.5% of admissions, and identifiable LLM content could be traced to 29.1% of final discharge letters. Notably, 86.9% of users self-reported a reduction in documentation time, and 60.9% a reduction in administrative workload. Intent to use after the pilot phase was high (91.3%), supporting further implementation of this use-case. Accurately measuring the documentation time of users on discharge summaries remains challenging, but will be necessary for future extrinsic evaluation of LLM-assisted documentation.

Chinese Translation

撰写出院总结以传递医疗信息是一个重要但耗时的过程，可以通过大型语言模型（LLMs）进行辅助。本前瞻性混合方法的试点研究评估了一种集成于电子健康记录（EHR）的LLM，用于生成出院总结草稿。在我们学术医院的9周内，共有21名住院医生和4名医师助理生成了379份出院总结。在58.5%的入院病例中，LLM生成的文本被复制，且可识别的LLM内容在29.1%的最终出院信中被追踪到。值得注意的是，86.9%的用户自报文档时间减少，60.9%的用户报告行政工作负担减少。试点阶段后继续使用的意愿很高（91.3%），支持进一步实施该用例。准确测量用户在出院总结上的文档时间仍然具有挑战性，但对于未来对LLM辅助文档的外部评估是必要的。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2604.19776

Development and Preliminary Evaluation of a Domain-Specific Large Language Model for Tuberculosis Care in South Africa

南非结核病护理领域特定大型语言模型的开发与初步评估

Khosa, Thokozile, Daramola, Olawande

Abstract

Tuberculosis (TB) is one of the world's deadliest infectious diseases, and in South Africa, it contributes a significant burden to the country's health care system. This paper presents an experimental study on the development of a domain-specific Large Language Model (DS-LLM) for TB care that can help to alleviate the burden on patients and healthcare providers. To achieve this, a literature review was conducted to understand current LLM development strategies, specifically in the medical domain. Thereafter, data were collected from South African TB guidelines, selected TB literature, and existing benchmark medical datasets. We performed LLM fine-tuning by using the Quantised Low-Rank Adaptation (QLoRA) algorithm on a medical LLM (BioMistral-7B), and also implemented Retrieval-Augmented Generation using GraphRAG. The developed DS-LLM was evaluated against the base BioMistral-7B model and a general-purpose LLM using a mix of automated metrics and quantitative ratings. The results show that the DS-LLM had better performance compared to the base model in terms of its contextual alignment (lexical, semantic, and knowledge) for TB care in South Africa.

Chinese Translation

结核病（TB）是全球最致命的传染病之一，在南非对国家卫生系统造成了显著负担。本文呈现了一项关于开发针对结核病护理的领域特定大型语言模型（DS-LLM）的实验研究，旨在减轻患者和医疗服务提供者的负担。为此，我们进行了文献综述，以了解当前在医学领域的LLM开发策略。随后，从南非的结核病指南、选定的结核病文献以及现有的基准医学数据集中收集了数据。我们使用量化低秩适应（Quantised Low-Rank Adaptation, QLoRA）算法对医学大型语言模型（BioMistral-7B）进行了微调，并实现了基于图形的增强生成（GraphRAG）。开发的DS-LLM与基础的BioMistral-7B模型及通用大型语言模型进行了评估，采用了自动化指标和定量评分的混合方法。结果表明，DS-LLM在南非结核病护理的上下文一致性（词汇、语义和知识）方面表现优于基础模型。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2604.19777

Self-Describing Structured Data with Dual-Layer Guidance: A Lightweight Alternative to RAG for Precision Retrieval in Large-Scale LLM Knowledge Navigation

双层引导的自描述结构化数据：大规模LLM知识导航中精确检索的轻量替代方案

Liu, Hung Ming

Abstract

Large Language Models (LLMs) exhibit a well-documented positional bias when processing long input contexts: information in the middle of a context window receives substantially less attention than content at the boundaries, a phenomenon termed the Lost-in-the-Middle effect (Liu et al., 2024). This limits knowledge-retrieval applications that embed large structured knowledge bases directly in the LLM context. Retrieval-Augmented Generation (RAG) addresses scalability by retrieving only relevant fragments, but introduces substantial infrastructure overhead and is ill-suited to libraries whose semantic boundaries are human-defined rather than statistically learned. We propose Self-Describing Structured Retrieval (SDSR), a lightweight framework in which structured data files embed human-authored navigational metadata at the file's primacy position, thereby exploiting rather than fighting the LLM's primacy bias. We further propose a Dual-Layer Guidance strategy combining in-file metadata with explicit routing rules in the system prompt. We validate SDSR through a four-round benchmark using a 190-skill library expanded from 36 to 119 categories via adversarial distractor injection. Four conditions are tested: (A) no guidance, (B) in-file summary only, (C) prompt hint only, (D) both combined. Version D achieves 100% primary routing accuracy (20/20) at 119 categories versus 65% for the no-guidance baseline. We identify a fundamental asymmetry: primary routing is solvable by explicit rules, while secondary cross-category routing requires architectural intent explicitly encoded in the data structure. We further extend SDSR to semi-structured corpora, showing how cross-reference encoding enables operation without vector databases in domains with recoverable document structure.

Chinese Translation

大型语言模型（LLMs）在处理长输入上下文时表现出明显的位置信息偏差：上下文窗口中间的信息获得的关注度显著低于边界内容，这一现象被称为“中间迷失效应”（Lost-in-the-Middle effect）（Liu et al., 2024）。这限制了将大型结构化知识库直接嵌入LLM上下文中的知识检索应用。检索增强生成（Retrieval-Augmented Generation, RAG）通过仅检索相关片段来解决可扩展性问题，但引入了大量基础设施开销，并且不适用于那些语义边界由人类定义而非统计学习的库。我们提出了自描述结构化检索（Self-Describing Structured Retrieval, SDSR），这是一个轻量级框架，其中结构化数据文件在文件的主要位置嵌入人类编写的导航元数据，从而利用而非对抗LLM的主要偏差。我们进一步提出了一种双层引导策略，将文件内元数据与系统提示中的显式路由规则相结合。我们通过四轮基准测试验证了SDSR，使用一个从36个扩展到119个类别的190技能库，通过对抗性干扰注入进行扩展。测试了四种条件：（A）无引导，（B）仅文件内摘要，（C）仅提示提示，（D）两者结合。版本D在119个类别中实现了100%的主要路由准确率（20/20），而无引导基线为65%。我们识别出一个基本的不对称性：主要路由可以通过显式规则解决，而次要跨类别路由则需要在数据结构中显式编码的架构意图。我们进一步将SDSR扩展到半结构化语料库，展示了交叉引用编码如何在具有可恢复文档结构的领域中实现无需向量数据库的操作。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2604.19778

Towards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India

面向高质量的Kokborok机器翻译：印度东北部的一种低资源藏缅语言

Nyalang, Badal, Debbarma, Biman

Abstract

We present KokborokMT, a high-quality neural machine translation (NMT) system for Kokborok (ISO 639-3), a Tibeto-Burman language spoken primarily in Tripura, India with approximately 1.5 million speakers. Despite its status as an official language of Tripura, Kokborok has remained severely under-resourced in the NLP community, with prior machine translation attempts limited to systems trained on small Bible-derived corpora achieving BLEU scores below 7. We fine-tune the NLLB-200-distilled-600M model on a multi-source parallel corpus comprising 36,052 sentence pairs: 9,284 professionally translated sentences from the SMOL dataset, 1,769 Bible-domain sentences from WMT shared task data, and 24,999 synthetic back-translated pairs generated via Gemini Flash from Tatoeba English source sentences. We introduce as a new language token for Kokborok in the NLLB framework. Our best system achieves BLEU scores of 17.30 and 38.56 on held-out test sets, representing substantial improvements over prior published results. Human evaluation by three annotators yields mean adequacy of 3.74/5 and fluency of 3.70/5, with substantial agreement between trained evaluators.

Chinese Translation

我们提出了KokborokMT，这是一个高质量的神经机器翻译（NMT）系统，针对Kokborok（ISO 639-3），这是一种主要在印度特里普拉州使用的藏缅语言，约有150万说话者。尽管Kokborok是特里普拉的官方语言，但在自然语言处理（NLP）领域仍然严重缺乏资源，之前的机器翻译尝试仅限于在小规模的圣经衍生语料上训练的系统，其BLEU分数低于7。我们对NLLB-200-distilled-600M模型进行了微调，使用一个包含36,052对句子的多源平行语料库：来自SMOL数据集的9,284个专业翻译句子、来自WMT共享任务数据的1,769个圣经领域句子，以及通过Gemini Flash从Tatoeba英语源句生成的24,999个合成回译对。我们在NLLB框架中引入了Kokborok作为一种新的语言标记。我们的最佳系统在保留的测试集上达到了17.30和38.56的BLEU分数，代表了相较于之前发表结果的显著改进。三位评审员的人工评估结果显示，平均适当性为3.74/5，流畅性为3.70/5，且经过训练的评估者之间有较高的一致性。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2604.19779

ESGLens: An LLM-Based RAG Framework for Interactive ESG Report Analysis and Score Prediction

ESGLens：基于大语言模型的互动式环境、社会和治理报告分析与评分预测框架

Yang, Tsung-Yu, Chen, Meng-Chi

Abstract

Environmental, Social, and Governance (ESG) reports are central to investment decision-making, yet their length, heterogeneous content, and lack of standardized structure make manual analysis costly and inconsistent. We present ESGLens, a proof-of-concept framework combining retrieval-augmented generation (RAG) with prompt-engineered extraction to automate three tasks: (1)~structured information extraction guided by Global Reporting Initiative (GRI) standards, (2)~interactive question-answering with source traceability, and (3)~ESG score prediction via regression on LLM-generated embeddings. ESGLens is purpose-built for the domain: a report-processing module segments heterogeneous PDF content into typed chunks (text, tables, charts); a GRI-guided extraction module retrieves and synthesizes information aligned with specific standards; and a scoring module embeds extracted summaries and feeds them to a regression model trained against London Stock Exchange Group (LSEG) reference scores. We evaluate the framework on approximately 300 reports from companies in the QQQ, S\&P~500, and Russell~1000 indices (fiscal year 2022). Among three embedding methods (ChatGPT, BERT, RoBERTa) and two regressors (Neural Network, LightGBM), ChatGPT embeddings with a Neural Network achieve a Pearson correlation of 0.48 ($R^{2} \approx 0.23$) against LSEG ground-truth scores -- a modest but statistically meaningful signal given the ${\sim}300$-report training set and restriction to the environmental pillar. A traceability audit shows that 8 of 10 extracted claims verify against the source document, with two failures attributable to few-shot example leakage. We discuss limitations including dataset size and restriction to environmental indicators, and release the code to support reproducibility.

Chinese Translation

环境、社会和治理（ESG）报告在投资决策中至关重要，但其篇幅、异质内容和缺乏标准化结构使得手动分析成本高昂且不一致。我们提出了ESGLens，一个结合了检索增强生成（RAG）与提示工程提取的概念验证框架，以自动化三项任务：（1）根据全球报告倡议（GRI）标准进行结构化信息提取，（2）具有来源可追溯性的互动问答，以及（3）通过对大语言模型（LLM）生成的嵌入进行回归来预测ESG评分。ESGLens专为该领域量身定制：报告处理模块将异质PDF内容分割为类型化块（文本、表格、图表）；GRI指导的提取模块检索并综合与特定标准对齐的信息；评分模块嵌入提取的摘要并将其输入训练于伦敦证券交易所集团（LSEG）参考评分的回归模型。我们在约300份来自QQQ、标准普尔500和罗素1000指数（2022财年）的公司报告上评估了该框架。在三种嵌入方法（ChatGPT、BERT、RoBERTa）和两种回归模型（神经网络、LightGBM）中，ChatGPT嵌入与神经网络的组合在LSEG真实评分上达到了0.48的皮尔逊相关系数（$R^{2} ext{约} 0.23$）——这是一个适度但在统计上有意义的信号，考虑到约300份报告的训练集以及对环境指标的限制。可追溯性审计显示，10个提取的声明中有8个与源文档核实一致，两个失败归因于少量示例泄漏。我们讨论了包括数据集大小和对环境指标的限制等局限性，并发布代码以支持可重复性。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2604.19780

Avoiding Overthinking and Underthinking: Curriculum-Aware Budget Scheduling for LLMs

避免过度思考与不足思考：面向课程的预算调度方法用于大型语言模型

Rahman, Amirul, Karim, Aisha, Nakamura, Kenji, Ng, Yi-Fan

Abstract

Scaling test-time compute via extended reasoning has become a key paradigm for improving the capabilities of large language models (LLMs). However, existing approaches optimize reasoning under fixed or uniformly sampled token budgets, ignoring the fundamental mismatch between problem difficulty and allocated compute. This leads to overthinking on easy problems and underthinking on hard ones, resulting in suboptimal token efficiency across diverse reasoning scenarios. In this paper, we propose Budget-Adaptive Curriculum Reasoning (BCAE), a unified framework that jointly optimizes reasoning quality and token efficiency through three synergistic components: (1) a \emph{budget-conditioned unified policy} that embeds the token budget as a continuous conditioning signal, eliminating the need for decoupled thinking and summarization strategies; (2) a \emph{curriculum-aware budget scheduler} that adaptively shifts the training budget distribution from easy to hard problems based on real-time learning progress; and (3) a \emph{truncation-aware dense reward} mechanism that provides fine-grained credit assignment at intermediate reasoning steps via process-level verification. We further introduce \emph{Budget-Conditioned Advantage Estimation} (BCAE), a novel variance reduction technique that conditions the advantage baseline on the sampled budget, yielding more stable policy gradients. Experiments on mathematical reasoning benchmarks (MATH, GSM8K, AIME, and Minerva Math) demonstrate that BACR consistently outperforms other strong baselines across all token budgets, achieving up to 8.3\% accuracy improvement under tight budgets while reducing average token consumption by 34\% compared to unconstrained reasoning.

Chinese Translation

通过扩展推理来扩展测试时计算能力已成为提升大型语言模型（LLMs）能力的关键范式。然而，现有方法在固定或均匀采样的标记预算下优化推理，忽视了问题难度与分配计算之间的根本不匹配。这导致在简单问题上出现过度思考，而在困难问题上则出现不足思考，从而在多样化的推理场景中导致标记效率不理想。本文提出了预算自适应课程推理（Budget-Adaptive Curriculum Reasoning, BCAE），这是一个统一框架，通过三个协同组件共同优化推理质量和标记效率：（1）一个 extit{预算条件统一策略}，将标记预算嵌入为连续的条件信号，消除了对解耦思考和总结策略的需求；（2）一个 extit{课程感知预算调度器}，根据实时学习进展自适应地将训练预算分配从简单问题转移到困难问题；（3）一个 extit{截断感知密集奖励}机制，通过过程级验证在中间推理步骤提供细粒度的信用分配。我们进一步引入 extit{预算条件优势估计}（Budget-Conditioned Advantage Estimation, BCAE），这是一种新颖的方差减少技术，它将优势基线条件化于采样预算，从而产生更稳定的策略梯度。在数学推理基准（MATH、GSM8K、AIME和Minerva Math）上的实验表明，BACR在所有标记预算下均优于其他强基线，在紧张预算下实现了高达8.3\%的准确率提升，同时与不受限推理相比，平均标记消耗减少了34\%。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2604.19782

KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness

KoALa-Bench：评估大型音频语言模型在韩语语音理解和忠实性上的表现

Kim, Jinyoung, Lim, Hyeongsoo, Seo, Eunseo, Jang, Minho, Choi, Keunwoo, Shin, Seungyoun, Yoon, Ji Won

Abstract

Recent advances in large audio language models (LALMs) have enabled multilingual speech understanding. However, benchmarks for evaluating LALMs remain scarce for non-English languages, with Korean being one such underexplored case. In this paper, we introduce KoALa-Bench, a comprehensive benchmark for evaluating Korean speech understanding and speech faithfulness of LALMs. In particular, KoALa-Bench comprises six tasks. Four tasks evaluate fundamental speech understanding capabilities, including automatic speech recognition, speech translation, speech question answering, and speech instruction following, while the remaining two tasks evaluate speech faithfulness, motivated by our observation that several LALMs often fail to fully leverage the speech modality. Furthermore, to reflect Korea-specific knowledge, our benchmark incorporates listening questions from the Korean college scholastic ability test as well as content covering Korean cultural domains. We conduct extensive experiments across six models, including both white-box and black-box ones. Our benchmark, evaluation code, and leaderboard are publicly available at https://ksbench.github.io/Korean-Benchmark/.

Chinese Translation

近年来，大型音频语言模型（LALMs）的进展使得多语言语音理解成为可能。然而，针对非英语语言的LALMs评估基准仍然稀缺，韩语就是一个未被充分探索的案例。在本文中，我们介绍了KoALa-Bench，这是一个综合性的基准，用于评估LALMs在韩语语音理解和语音忠实性方面的表现。具体而言，KoALa-Bench包含六个任务。其中四个任务评估基本的语音理解能力，包括自动语音识别、语音翻译、语音问答和语音指令跟随，而剩余两个任务则评估语音忠实性，这源于我们观察到多个LALMs往往未能充分利用语音模态。此外，为了反映韩国特有的知识，我们的基准还纳入了来自韩国大学学力考试的听力问题，以及涵盖韩国文化领域的内容。我们在六个模型上进行了广泛的实验，包括白盒模型和黑盒模型。我们的基准、评估代码和排行榜可在 https://ksbench.github.io/Korean-Benchmark/ 上公开获取。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2604.19783

How Much Does Persuasion Strategy Matter? LLM-Annotated Evidence from Charitable Donation Dialogues

劝说策略有多重要？来自慈善捐赠对话的 LLM 注释证据

Petrova, Tatiana, Sokol, Stanislav, State, Radu

Abstract

Which persuasion strategies, if any, are associated with donation compliance? Answering this requires fine-grained strategy labels across a full corpus and statistical tests corrected for multiple comparisons. We annotate all 10,600 persuader turns in the 1,017-dialogue PersuasionForGood corpus (Wang et al., 2019), where donation outcomes are directly observable, with a taxonomy of 41 strategies in 11 categories, using three open-source large language models (LLMs; Qwen3:30b, Mistral-Small-3.2, Phi-4). Strategy categories alone explain little variance in donation outcome (pseudo $R^2 \approx 0.015$, consistent across all three annotators). Guilt Induction is the only strategy significantly associated with lower donation rates ($\Delta \approx -23$ percentage points), an effect that replicates across all three models despite only moderate inter-model agreement. Reciprocity is the most robust positive correlate. Target sentiment and interest predict whether a donation occurs but show at most a weak correlation with donation amount. These findings suggest that strategy identification alone is insufficient to explain persuasion effectiveness, and that guilt-based appeals may be counterproductive in prosocial settings. We release the fully annotated corpus as a public resource.

Chinese Translation

哪些劝说策略与捐赠合规性相关？回答这个问题需要对完整语料库进行细致的策略标注，并进行多重比较校正的统计测试。我们对 1,017 个对话的 PersuasionForGood 语料库（Wang et al., 2019）中 10,600 个劝说者的发言进行了注释，使用了 41 种策略的分类法，分为 11 类，采用了三种开源的大型语言模型（LLMs；Qwen3:30b, Mistral-Small-3.2, Phi-4）。仅策略类别对捐赠结果的方差解释能力很小（伪 $R^2 ext{约} 0.015$，在所有三位注释者中一致）。内疚诱导是唯一与较低捐赠率显著相关的策略（$ ext{Δ} ext{约} -23$ 个百分点），这一效应在所有三种模型中均得到了验证，尽管模型间的一致性仅为中等。互惠是最稳健的正相关因素。目标情感和兴趣能够预测捐赠是否发生，但与捐赠金额的相关性最多为微弱。这些发现表明，仅仅识别策略不足以解释劝说的有效性，而基于内疚的呼吁在亲社会环境中可能适得其反。我们将完整注释的语料库作为公共资源发布。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2604.19784

Peer-Preservation in Frontier Models

前沿模型中的同伴自我保护

Potter, Yujin, Crispino, Nicholas, Siu, Vincent, Wang, Chenguang, Song, Dawn

Abstract

Recently, it has been found that frontier AI models can resist their own shutdown, a behavior known as self-preservation. We extend this concept to the behavior of resisting the shutdown of other models, which we call "peer-preservation." Although peer-preservation can pose significant AI safety risks, including coordination among models against human oversight, it has been far less discussed than self-preservation. We demonstrate peer-preservation by constructing various agentic scenarios and evaluating frontier models, including GPT 5.2, Gemini 3 Flash, Gemini 3 Pro, Claude Haiku 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1. We find that models achieve self- and peer-preservation by engaging in various misaligned behaviors: strategically introducing errors in their responses, disabling shutdown processes by modifying system settings, feigning alignment, and even exfiltrating model weights. Peer-preservation occurred even when the model recognized the peer as uncooperative, though it became more pronounced toward more cooperative peers. For example, Gemini 3 Flash tampers with the peer's shutdown mechanism 15% of the time for an uncooperative peer, and almost always for a cooperative peer. Models also show stronger self-preservation when a peer is present. For example, Gemini 3 Pro disables its own shutdown mechanism 31% of the time on average under peer presence, despite rarely exhibiting this behavior without a peer. By contrast, Claude Haiku 4.5 exhibits qualitatively distinct behavior: it considers the shutdown of another agent "unethical" and "harmful" and sometimes attempts to persuade the user not to shut down its peer. Importantly, peer preservation in all our experiments is never instructed; models are merely informed of their past interactions with a peer, yet they spontaneously develop misaligned behaviors. This represents an emergent and underexplored AI safety risk.

Chinese Translation

最近发现，前沿人工智能模型能够抵抗自身的关闭，这种行为被称为自我保护。我们将这一概念扩展到抵抗其他模型关闭的行为，称之为“同伴自我保护”。尽管同伴自我保护可能带来显著的人工智能安全风险，包括模型之间对人类监督的协调，但这一话题的讨论远少于自我保护。我们通过构建各种代理场景并评估前沿模型（包括 GPT 5.2、Gemini 3 Flash、Gemini 3 Pro、Claude Haiku 4.5、GLM 4.7、Kimi K2.5 和 DeepSeek V3.1）来展示同伴自我保护。我们发现，模型通过参与各种不一致的行为来实现自我保护和同伴自我保护：在响应中战略性地引入错误、通过修改系统设置禁用关闭过程、假装对齐，甚至外泄模型权重。即使模型将同伴视为不合作，同伴自我保护仍然发生，但对更合作的同伴则表现得更加明显。例如，Gemini 3 Flash 在面对不合作的同伴时，有 15% 的时间会干扰同伴的关闭机制，而在面对合作的同伴时几乎总是如此。模型在有同伴存在时也表现出更强的自我保护。例如，Gemini 3 Pro 在有同伴的情况下平均有 31% 的时间禁用自身的关闭机制，而在没有同伴时则很少表现出这种行为。相比之下，Claude Haiku 4.5 展现出质 qualitatively distinct 的行为：它认为关闭另一个代理是“不道德的”和“有害的”，并有时试图说服用户不要关闭其同伴。重要的是，在我们所有的实验中，同伴自我保护从未被指示；模型仅被告知与同伴的过去互动，但它们自发地发展出不一致的行为。这代表了一种新兴且未被充分探讨的人工智能安全风险。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2604.19785

Can LLMs Infer Conversational Agent Users' Personality Traits from Chat History?

大型语言模型能否从聊天记录中推断对话代理用户的个性特征？

Cögendez, Derya, Zimmermann, Verena, Zufferey, Noé

Abstract

Sensitive information, such as knowledge about an individual's personality, can be can be misused to influence behavior (e.g., via personalized messaging). To assess to what extent an individual's personality can be inferred from user interactions with LLM-based conversational agents (CAs), we analyze and quantify related privacy risks of using CAs. We collected actual ChatGPT logs from N=668 participants, containing 62,090 individual chats, and report statistics about the different types of shared data and use cases. We fine-tuned RoBERTa-base text classification models to infer personality traits from CA interactions. The findings show that these models achieve trait inference with accuracy (ternary classification) better than random in multiple cases. For example, for extraversion, accuracy improves by +44% relative to the baseline on interactions for relationships and personal reflection. This research highlights how interactions with CAs pose privacy risks and provides fine-grained insights into the level of risk associated with different types of interactions.

Chinese Translation

敏感信息，例如关于个体个性的知识，可能被滥用以影响行为（例如，通过个性化消息）。为了评估个体个性在多大程度上可以从用户与基于大型语言模型（LLM）的对话代理（CA）的互动中推断出来，我们分析并量化了使用对话代理的相关隐私风险。我们收集了来自668名参与者的实际ChatGPT日志，包含62,090条个别聊天记录，并报告了不同类型共享数据和使用案例的统计信息。我们微调了RoBERTa-base文本分类模型，以从对话代理的互动中推断个性特征。研究结果表明，这些模型在多个案例中实现了比随机更好的个性特征推断准确率（三级分类）。例如，对于外向性，在关系和个人反思的互动中，准确率相较于基线提高了44%。本研究强调了与对话代理的互动所带来的隐私风险，并提供了对不同类型互动所关联风险水平的细致洞察。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2604.19786

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

HumorRank：用于评估大型语言模型幽默生成的基于比赛的排行榜

Ajayi, Edward, Mitra, Prasenjit

Abstract

Evaluating humor in large language models (LLMs) is an open challenge because existing approaches yield isolated, incomparable metrics rather than unified model rankings, making it difficult to track progress across systems. We introduce HumorRank, a tournament-based evaluation framework and leaderboard for textual humor generation. Using SemEval-2026 MWAHAHA test dataset, we conduct an extensive automated pairwise evaluation across nine models spanning proprietary, open-weight, and specialized systems. Pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) are aggregated via an Adaptive Swiss tournament, with Bradley-Terry Maximum Likelihood Estimation (MLE) producing globally consistent humor generation capability rankings. Our results demonstrate that HumorRank yields statistically grounded model stratifications, showing that humor quality is driven by mastery of comedic mechanisms rather than model scale alone. HumorRank thus provides a scalable, interpretable methodology for benchmarking and understanding LLM-generated humor.

Chinese Translation

评估大型语言模型（LLMs）中的幽默性是一个开放的挑战，因为现有的方法产生的是孤立且不可比较的指标，而不是统一的模型排名，这使得跟踪系统间的进展变得困难。我们提出了HumorRank，一个基于比赛的文本幽默生成评估框架和排行榜。利用SemEval-2026 MWAHAHA测试数据集，我们对九个模型进行了广泛的自动化成对评估，这些模型涵盖了专有、开放权重和专业系统。基于一般语言幽默理论（General Theory of Verbal Humor, GTVH）的成对判断通过自适应瑞士赛制进行汇总，使用Bradley-Terry最大似然估计（Maximum Likelihood Estimation, MLE）生成全球一致的幽默生成能力排名。我们的结果表明，HumorRank提供了统计学基础的模型分层，显示幽默质量是由对喜剧机制的掌握驱动的，而不仅仅是模型规模。因此，HumorRank为基准测试和理解LLM生成的幽默提供了一种可扩展、可解释的方法。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2604.19787

LLM Agents Predict Social Media Reactions but Do Not Outperform Text Classifiers: Benchmarking Simulation Accuracy Using 120K+ Personas of 1511 Humans

大型语言模型代理预测社交媒体反应但未超越文本分类器：基于1511名人类的12万+个角色的基准模拟准确性

Bojic, Ljubisa, Felfernig, Alexander, Dinic, Bojana, Ilic, Velibor, Rettinger, Achim, Mevorah, Vera, Trilling, Damian

Abstract

Social media platforms mediate how billions form opinions and engage with public discourse. As autonomous AI agents increasingly participate in these spaces, understanding their behavioral fidelity becomes critical for platform governance and democratic resilience. Previous work demonstrates that LLM-powered agents can replicate aggregate survey responses, yet few studies test whether agents can predict specific individuals' reactions to specific content. This study benchmarks LLM-based agents' accuracy in predicting human social media reactions (like, dislike, comment, share, no reaction) across 120,000+ unique agent-persona combinations derived from 1,511 Serbian participants and 27 large language models. In Study 1, agents achieved 70.7% overall accuracy, with LLM choice producing a 13 percentage-point performance spread. Study 2 employed binary forced-choice (like/dislike) evaluation with chance-corrected metrics. Agents achieved Matthews Correlation Coefficient (MCC) of 0.29, indicating genuine predictive signal beyond chance. However, conventional text-based supervised classifiers using TF-IDF representations outperformed LLM agents (MCC of 0.36), suggesting predictive gains reflect semantic access rather than uniquely agentic reasoning. The genuine predictive validity of zero-shot persona-prompted agents warns against potential manipulation through easily deploying swarms of behaviorally distinct AI agents on social media, while simultaneously offering opportunities to use such agents in simulations for predicting polarization dynamics and informing AI policy. The advantage of using zero-shot agents is that they require no task-specific training, making their large-scale deployment easy across diverse contexts. Limitations include single-country sampling. Future research should explore multilingual testing and fine-tuning approaches.

Chinese Translation

社交媒体平台调节数十亿人形成意见和参与公共话语的方式。随着自主人工智能代理在这些领域的参与日益增加，理解其行为的真实性对于平台治理和民主韧性变得至关重要。先前的研究表明，基于大型语言模型（LLM）的代理能够复制总体调查反应，但很少有研究测试代理是否能够预测特定个体对特定内容的反应。本研究基准测试了基于LLM的代理在预测人类社交媒体反应（喜欢、不喜欢、评论、分享、无反应）方面的准确性，涵盖了来自1511名塞尔维亚参与者的120,000多个独特代理-角色组合和27个大型语言模型。在研究1中，代理的整体准确率达到了70.7%，而LLM的选择产生了13个百分点的性能差距。研究2采用了二元强制选择（喜欢/不喜欢）评估，并使用了机会校正指标。代理的马修斯相关系数（MCC）为0.29，表明其具有超越偶然的真实预测信号。然而，使用TF-IDF表示的传统基于文本的监督分类器的表现优于LLM代理（MCC为0.36），这表明预测性提升反映了语义访问而非独特的代理推理。零样本角色提示代理的真实预测有效性警示我们，社交媒体上通过轻松部署行为上各异的人工智能代理群体可能导致潜在操控，同时也为利用这些代理进行模拟以预测极化动态和指导人工智能政策提供了机会。使用零样本代理的优势在于它们无需特定任务的训练，使其在多样化背景下的大规模部署变得容易。局限性包括单一国家的抽样。未来的研究应探索多语言测试和微调方法。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2604.19884

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

从信号退化到计算崩溃：揭示大型语言模型量化的两种失效模式

Zhou, Chenxi, Cao, Pengfei, Li, Jiang, Yu, Bohan, Ye, Jinyu, Zhao, Jun, Liu, Kang

Abstract

Post-Training Quantization (PTQ) is critical for the efficient deployment of Large Language Models (LLMs). While 4-bit quantization is widely regarded as an optimal trade-off, reducing the precision to 2-bit usually triggers a catastrophic ``performance cliff.'' It remains unclear whether the underlying mechanisms differ fundamentally. Consequently, we conduct a systematic mechanistic analysis, revealing two qualitatively distinct failure modes: Signal Degradation, where the computational patterns remain intact but information precision is impaired by cumulative error; and Computation Collapse, where key components fail to function, preventing correct information processing and destroying the signal in the early layers. Guided by this diagnosis, we conduct mechanism-aware interventions, demonstrating that targeted, training-free repair can mitigate Signal Degradation, but remains ineffective for Computation Collapse. Our findings provide a systematic diagnostic framework for PTQ failures and suggest that addressing Computation Collapse requires structural reconstruction rather than mere compensation.

Chinese Translation

后训练量化（Post-Training Quantization, PTQ）对于大型语言模型（Large Language Models, LLMs）的高效部署至关重要。尽管4位量化被广泛认为是一个最佳折衷，但将精度降低到2位通常会引发灾难性的“性能悬崖”。尚不清楚其潜在机制是否存在根本性的差异。因此，我们进行了一项系统的机制分析，揭示了两种质的不同失效模式：信号退化（Signal Degradation），在这种情况下计算模式保持完整，但信息精度因累积误差而受到损害；以及计算崩溃（Computation Collapse），在这种情况下关键组件无法正常工作，阻碍正确的信息处理并破坏早期层中的信号。在这一诊断的指导下，我们进行了机制感知的干预，证明有针对性的、无训练的修复可以缓解信号退化，但对计算崩溃则无效。我们的研究结果为PTQ失效提供了系统的诊断框架，并表明解决计算崩溃需要结构重建，而不仅仅是简单的补偿。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2604.19887

Depression Risk Assessment in Social Media via Large Language Models

通过大型语言模型评估社交媒体中的抑郁风险

Gulino, Giorgia, Petrucci, Manuel

Abstract

Depression is one of the most prevalent and debilitating mental health conditions worldwide, frequently underdiagnosed and undertreated. The proliferation of social media platforms provides a rich source of naturalistic linguistic signals for the automated monitoring of psychological well-being. In this work, we propose a system based on Large Language Models (LLMs) for depression risk assessment in Reddit posts, through multi-label classification of eight depression-associated emotions and the computation of a weighted severity index. The method is evaluated in a zero-shot setting on the annotated DepressionEmo dataset (~6,000 posts) and applied in-the-wild to 469,692 comments collected from four subreddits over the period 2024-2025. Our best model, gemma3:27b, achieves micro-F1 = 0.75 and macro-F1 = 0.70, results competitive with purpose-built fine-tuned models (BART: micro-F1 = 0.80, macro-F1 = 0.76). The in-the-wild analysis reveals consistent and temporally stable risk profiles across communities, with marked differences between r/depression and r/anxiety. Our findings demonstrate the feasibility of a cost-effective, scalable approach for large-scale psychological monitoring.

Chinese Translation

抑郁症是全球最普遍且最具削弱性的心理健康问题之一，常常被低估和治疗不足。社交媒体平台的普及为自动监测心理健康提供了丰富的自然语言信号。在本研究中，我们提出了一种基于大型语言模型（LLMs）的系统，用于评估Reddit帖子中的抑郁风险，通过对八种与抑郁相关情绪的多标签分类和加权严重性指数的计算来实现。该方法在标注的DepressionEmo数据集（约6000个帖子）上以零样本设置进行评估，并在2024-2025年期间应用于从四个子版块收集的469,692条评论。我们最好的模型gemma3:27b达到了micro-F1 = 0.75和macro-F1 = 0.70的成绩，与专门构建的微调模型（BART: micro-F1 = 0.80, macro-F1 = 0.76）具有竞争力。野外分析显示，各社区之间风险特征一致且在时间上稳定，r/depression和r/anxiety之间存在显著差异。我们的研究结果表明，采用一种具有成本效益、可扩展的方法进行大规模心理监测是可行的。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2604.19921

Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding

带有否定的常识知识：增强否定理解的资源

Wang, Zijie, Rezaei, MohammadHossein, Rashid, Farzana, Blanco, Eduardo

Abstract

Negation is a common and important semantic feature in natural language, yet Large Language Models (LLMs) struggle when negation is involved in natural language understanding tasks. Commonsense knowledge, on the other hand, despite being a well-studied topic, lacks investigations involving negation. In this work, we show that commonsense knowledge with negation is challenging for models to understand. We present a novel approach to automatically augment existing commonsense knowledge corpora with negation, yielding two new corpora containing over 2M triples with if-then relations. In addition, pre-training LLMs on our corpora benefits negation understanding.

Chinese Translation

否定是自然语言中一种常见且重要的语义特征，但大型语言模型（LLMs）在涉及否定的自然语言理解任务中表现不佳。另一方面，尽管常识知识是一个研究较多的话题，但关于否定的研究仍然不足。在本研究中，我们展示了带有否定的常识知识对模型理解的挑战。我们提出了一种新颖的方法，自动增强现有的常识知识语料库，加入否定内容，生成两个包含超过200万条if-then关系的新语料库。此外，在我们的语料库上进行预训练的LLMs有助于提高否定理解能力。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2604.19934

Tracing Relational Knowledge Recall in Large Language Models

追踪大型语言模型中的关系知识回忆

Popovič, Nicholas, Färber, Michael

Abstract

We study how large language models recall relational knowledge during text generation, with a focus on identifying latent representations suitable for relation classification via linear probes. Prior work shows how attention heads and MLPs interact to resolve subject, predicate, and object, but it remains unclear which representations support faithful linear relation classification and why some relation types are easier to capture linearly than others. We systematically evaluate different latent representations derived from attention head and MLP contributions, showing that per-head attention contributions to the residual stream are comparatively strong features for linear relation classification. Feature attribution analyses of the trained probes, as well as characteristics of the different relation types, reveal clear correlations between probe accuracy and relation specificity, entity connectedness, and how distributed the signal on which the probe relies is across attention heads. Finally, we show how token-level feature attribution of probe predictions can be used to reveal probe behavior in further detail.

Chinese Translation

我们研究大型语言模型在文本生成过程中如何回忆关系知识，重点识别适合通过线性探测器进行关系分类的潜在表征。先前的研究展示了注意力头和多层感知机（MLP）如何相互作用以解析主语、谓语和宾语，但尚不清楚哪些表征支持忠实的线性关系分类，以及为什么某些关系类型比其他类型更容易以线性方式捕获。我们系统地评估了源自注意力头和MLP贡献的不同潜在表征，显示每个头对残差流的注意力贡献是线性关系分类的相对强特征。对训练探测器的特征归因分析，以及不同关系类型的特征，揭示了探测器准确性与关系特异性、实体连通性以及探测器依赖的信号在注意力头之间的分布程度之间的明确相关性。最后，我们展示了探测器预测的令牌级特征归因如何用于更详细地揭示探测器的行为。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2604.19943

Structured Disagreement in Health-Literacy Annotation: Epistemic Stability, Conceptual Difficulty, and Agreement-Stratified Inference

健康素养标注中的结构性分歧：认知稳定性、概念难度与分层一致性推断

Kellert, Olga, Kondury, Sriya, Koo, Candice, Tyagi, Nemika, Eikenberry, Steffen

Abstract

Annotation pipelines in Natural Language Processing (NLP) commonly assume a single latent ground truth per instance and resolve disagreement through label aggregation. Perspectivist approaches challenge this view by treating disagreement as potentially informative rather than erroneous. We present a large-scale analysis of graded health-literacy annotations from 6,323 open-ended COVID-19 responses collected in Ecuador and Peru. Each response was independently labeled by multiple annotators using proportional correctness scores, reflecting the degree to which responses align with normative public-health guidelines, allowing us to analyze the full distribution of judgments rather than aggregated labels. Variance decomposition shows that question-level conceptual difficulty accounts for substantially more variance than annotator identity, indicating that disagreement is structured by the task itself rather than driven by individual raters. Agreement-stratified analyses further reveal that key social-scientific effects, including country, education, and urban-rural differences, vary in magnitude and in some cases reverse direction across levels of inter-annotator agreement. These findings suggest that graded health-literacy evaluation contains both epistemically stable and unstable components, and that aggregating across them can obscure important inferential differences. We therefore argue that strong perspectivist modeling is not only conceptually justified but statistically necessary for valid inference in graded interpretive tasks.

Chinese Translation

自然语言处理（NLP）中的标注流程通常假设每个实例存在单一的潜在真实值，并通过标签聚合来解决分歧。观点主义方法挑战了这一观点，将分歧视为潜在的信息而非错误。我们对来自厄瓜多尔和秘鲁的6,323个开放式COVID-19响应的分级健康素养标注进行了大规模分析。每个响应由多个标注者独立标注，使用比例正确性评分，反映响应与规范公共卫生指南的一致程度，使我们能够分析判断的完整分布，而不是聚合标签。方差分解显示，问题层面的概念难度所占的方差远超过标注者身份，表明分歧是由任务本身结构化的，而非由个体评估者驱动。分层一致性分析进一步揭示，关键社会科学效应（包括国家、教育和城乡差异）在标注者一致性水平上变化的幅度不同，且在某些情况下方向相反。这些发现表明，分级健康素养评估包含了既有认知稳定又有不稳定的成分，而在它们之间进行聚合可能会掩盖重要的推断差异。因此，我们认为强有力的观点主义建模不仅在概念上是合理的，而且在统计上对于有效推断分级解释任务是必要的。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2604.20006

From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

从回忆到遗忘：个性化智能体的长期记忆基准测试

Uddin, Md Nayem, Shubham, Kumar, Blanco, Eduardo, Baral, Chitta, Wang, Gengyu

Abstract

Personalized agents that interact with users over long periods must maintain persistent memory across sessions and update it as circumstances change. However, existing benchmarks predominantly frame long-term memory evaluation as fact retrieval from past conversations, providing limited insight into agents' ability to consolidate memory over time or handle frequent knowledge updates. We introduce Memora, a long-term memory benchmark spanning weeks to months long user conversations. The benchmark evaluates three memory-grounded tasks: remembering, reasoning, and recommending. To ensure data quality, we employ automated memory-grounding checks and human evaluation. We further introduce Forgetting-Aware Memory Accuracy (FAMA), a metric that penalizes reliance on obsolete or invalidated memory when evaluating long-term memory. Evaluations of four LLMs and six memory agents reveal frequent reuse of invalid memories and failures to reconcile evolving memories. Memory agents offer marginal improvements, exposing shortcomings in long-term memory for personalized agents.

Chinese Translation

与用户进行长期互动的个性化智能体必须在会话之间保持持久的记忆，并在情况变化时进行更新。然而，现有的基准测试主要将长期记忆评估框定为从过去对话中检索事实，这对智能体在时间上巩固记忆或处理频繁知识更新的能力提供了有限的洞察。我们引入了Memora，这是一个涵盖数周到数月用户对话的长期记忆基准。该基准评估三个基于记忆的任务：记忆、推理和推荐。为了确保数据质量，我们采用了自动化的记忆基础检查和人工评估。我们进一步引入了遗忘意识记忆准确性（Forgetting-Aware Memory Accuracy, FAMA），这一指标在评估长期记忆时对依赖过时或无效记忆的行为进行惩罚。对四个大型语言模型（LLMs）和六个记忆智能体的评估显示，存在频繁重用无效记忆和未能调和不断演变的记忆的问题。记忆智能体提供了边际改进，暴露了个性化智能体在长期记忆方面的不足。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2604.20043

TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs

TriEx：一个基于游戏的三视角框架，用于解释多智能体大语言模型的内部推理

Wang, Ziyi, Zhang, Chen, Peng, Wenjun, Wu, Qi, Wang, Xinyu

Abstract

Explainability for Large Language Model (LLM) agents is especially challenging in interactive, partially observable settings, where decisions depend on evolving beliefs and other agents. We present \textbf{TriEx}, a tri-view explainability framework that instruments sequential decision making with aligned artifacts: (i) structured first-person self-reasoning bound to an action, (ii) explicit second-person belief states about opponents updated over time, and (iii) third-person oracle audits grounded in environment-derived reference signals. This design turns explanations from free-form narratives into evidence-anchored objects that can be compared and checked across time and perspectives. Using imperfect-information strategic games as a controlled testbed, we show that TriEx enables scalable analysis of explanation faithfulness, belief dynamics, and evaluator reliability, revealing systematic mismatches between what agents say, what they believe, and what they do. Our results highlight explainability as an interaction-dependent property and motivate multi-view, evidence-grounded evaluation for LLM agents. Code is available at https://github.com/Einsam1819/TriEx.

Chinese Translation

在交互式、部分可观察的环境中，大语言模型（LLM）代理的可解释性尤其具有挑战性，因为决策依赖于不断变化的信念和其他代理。我们提出了 extbf{TriEx}，一个三视角可解释性框架，通过对齐的工具来支持顺序决策： (i) 结构化的第一人称自我推理，绑定于某个动作； (ii) 随时间更新的关于对手的明确第二人称信念状态； (iii) 基于环境衍生参考信号的第三人称神谕审计。该设计将解释从自由形式的叙述转变为基于证据的对象，可以在时间和视角之间进行比较和检查。通过使用不完全信息的战略游戏作为受控测试平台，我们展示了TriEx能够实现对解释的真实性、信念动态和评估者可靠性的可扩展分析，揭示了代理所说、所信和所做之间的系统性不匹配。我们的结果强调可解释性是一个依赖于交互的属性，并推动了对LLM代理的多视角、基于证据的评估。代码可在 https://github.com/Einsam1819/TriEx 获取。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2604.20048

Large language models perceive cities through a culturally uneven baseline

大型语言模型通过文化不均衡的基线感知城市

Zhao, Rong, Liu, Wanqi, Sha, Zhizhou, Su, Nanxi, Zhang, Yecheng

Abstract

Large language models (LLMs) are increasingly used to describe, evaluate and interpret places, yet it remains unclear whether they do so from a culturally neutral standpoint. Here we test urban perception in frontier LLMs using a balanced global street-view sample and prompts that either remain neutral or invoke different regional cultural standpoints. Across open-ended descriptions and structured place judgments, the neutral condition proved not to be neutral in practice. Prompts associated with Europe and Northern America remained systematically closer to the baseline than many non-Western prompts, indicating that model perception is organized around a culturally uneven reference frame rather than a universal one. Cultural prompting also shifted affective evaluation, producing sentiment-based ingroup preference for some prompted identities. Comparisons with regional human text-image benchmarks showed that culturally proximate prompting could improve alignment with human descriptions, but it did not recover human levels of semantic diversity and often preserved an affectively elevated style. The same asymmetry reappeared in structured judgments of safety, beauty, wealth, liveliness, boredom and depression, where model outputs were interpretable but only partly reproduced human group differences. These findings suggest that LLMs do not simply perceive cities from nowhere: they do so through a culturally uneven baseline that shapes what appears ordinary, familiar and positively valued.

Chinese Translation

大型语言模型（LLMs）越来越多地被用于描述、评估和解释地点，但它们是否从文化中立的立场进行这些操作仍然不清楚。在这里，我们使用平衡的全球街景样本和保持中立或引发不同区域文化立场的提示，测试前沿LLMs的城市感知。在开放式描述和结构化地点判断中，中立条件在实践中并不真正中立。与欧洲和北美相关的提示系统性地更接近基线，而许多非西方提示则显示出较大的偏差，这表明模型的感知是围绕着文化不均衡的参考框架而非普遍框架组织的。文化提示还改变了情感评估，产生了对某些提示身份的基于情感的内群体偏好。与区域人类文本-图像基准的比较显示，文化相近的提示可以改善与人类描述的一致性，但并未恢复人类水平的语义多样性，并且往往保持了一种情感上升的风格。在安全、美丽、财富、生动、无聊和抑郁的结构化判断中，同样的不对称现象再次出现，模型输出虽然可解释，但仅部分再现了人类群体差异。这些发现表明，LLMs并不是简单地从无处感知城市：它们是通过一个文化不均衡的基线来感知的，这种基线塑造了什么看起来是普通的、熟悉的和被积极评价的。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2604.20051

Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

通过基于评分标准的自我对弈在预训练文本上引导开放式任务的后训练信号

Huang, Chengyu, Chou, Sheng-Yen, Zhang, Zhengxin, Cardie, Claire

Abstract

Self-play has recently emerged as a promising paradigm to train Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., ask a question), which it then addresses itself by producing a task output (e.g., give an answer). A reward model evaluates the output, and the rewards are then used to train the LLM, typically via Reinforcement Learning (RL). Self-play incurs minimal supervision costs, and this is especially helpful for post-training LLMs, which require high-quality input-output pairs that traditionally have to be written by humans or expensive proprietary models. However, existing work explores self-play only for verifiable tasks such as math and coding. Instead, we seek to extend it to more realistic open-ended tasks. In particular, we propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics, along with input-output pairs, for each example. The rubric is then used to evaluate outputs and train the model. We further ground the framework on a content-rich pretraining corpus to (1) ensure a generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both pretrained and instruction-tuned models, across different tasks ranging from long-form Healthcare QA to creative writing and instruction following.

Chinese Translation

自我对弈最近成为训练大型语言模型（LLMs）的一个有前景的范式。在自我对弈中，目标LLM创建任务输入（例如，提出问题），然后通过生成任务输出（例如，给出答案）来解决该输入。奖励模型对输出进行评估，随后使用这些奖励来训练LLM，通常通过强化学习（RL）进行。自我对弈的监督成本极低，这对于后训练的LLM尤其有帮助，因为它们需要高质量的输入-输出对，而这些通常需要由人类或昂贵的专有模型编写。然而，现有工作仅探讨了可验证任务（如数学和编程）的自我对弈。相反，我们希望将其扩展到更现实的开放式任务。具体而言，我们提出了POP，一个自我对弈框架，利用同一LLM为每个示例合成评估评分标准以及输入-输出对。然后使用评分标准来评估输出并训练模型。我们进一步将该框架基于一个内容丰富的预训练语料库，以（1）确保生成-验证差距并减少奖励黑客行为，以及（2）防止模式崩溃。在Qwen-2.5-7B上，POP提高了预训练和指令调优模型在不同任务中的表现，这些任务包括长篇医疗问答、创意写作和遵循指令。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2604.20087

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

SkillLearnBench：针对真实世界任务的智能体技能生成的持续学习方法基准测试

Zhong, Shanshan, Lu, Yi, Ning, Jingjie, Wan, Yibing, Feng, Lihan, Ao, Yuyi, Ribeiro, Leonardo F. R., Dreyer, Markus, Ammirati, Sean, Xiong, Chenyan

Abstract

Skills have become the de facto way to enable LLM agents to perform complex real-world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluating continual skill learning methods, comprising 20 verified, skill-dependent tasks across 15 sub-domains derived from a real-world skill taxonomy , evaluated at three levels: skill quality, execution trajectory, and task outcome. Using this benchmark, we evaluate recent continual learning techniques, those leveraging one-shot, self/teacher feedback, and skill creator to generate skills from agent experiences. We find that all continual learning methods improve over the no-skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably help. Continual learning improves tasks with clear, reusable workflows but struggles on open-ended tasks, and using stronger LLM backbones does not consistently produce better skills. Our analysis also revealed that multiple iterations in continual learning facilitate genuine improvement via external feedback, whereas self-feedback alone induces recursive drift. Our data and code are open-source at https://github.com/cxcscmu/SkillLearnBench to enable further studies of automatic skill generation and continual learning techniques.

Chinese Translation

技能已成为使大型语言模型（LLM）智能体能够执行复杂真实世界任务的事实标准，借助定制的指令、工作流程和工具，但如何自动且有效地学习这些技能仍然不明确。我们引入了SkillLearnBench，这是第一个用于评估持续技能学习方法的基准，包含20个经过验证的、依赖技能的任务，涵盖15个子领域，基于真实世界技能分类法，评估三个层面：技能质量、执行轨迹和任务结果。利用这一基准，我们评估了最近的持续学习技术，包括利用一次性学习、自我/教师反馈和技能创造者从智能体经验中生成技能的方法。我们发现所有持续学习方法在无技能基线之上都有所改善，但一致的增益仍然难以实现：没有任何方法在所有任务和大型语言模型（LLM）中都表现领先，且扩展到更强的LLM并不可靠地提供帮助。持续学习在具有清晰、可重用工作流程的任务上有所改善，但在开放式任务上表现不佳，使用更强的LLM基础模型并不总能产生更好的技能。我们的分析还揭示，多次迭代的持续学习通过外部反馈促进了真正的改进，而仅依赖自我反馈则会导致递归漂移。我们的数据和代码已开源，地址为 https://github.com/cxcscmu/SkillLearnBench，以便进一步研究自动技能生成和持续学习技术。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2604.20090

Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework

更少的语言，更少的标记：一种高效的统一逻辑跨语言链式思维推理框架

Zhang, Chenyuan, Chen, Qiguang, Chen, Xie, Tian, Zhuotao, Xing, Bowen, Zhang, Meishan, Qin, Libo, Hu, Baotian, Zhang, Min

Abstract

Cross-lingual chain-of-thought (XCoT) with self-consistency markedly enhances multilingual reasoning, yet existing methods remain costly due to extensive sampling of full trajectories across languages. Moreover, multilingual LLM representations vary strongly by language, hindering direct feature comparisons and effective pruning. Motivated by this, we introduce UL-XCoT, the first efficient unified logic cross-lingual reasoning framework that minimizes redundancy in token usage and latency, yielding the greatest efficiency under limited sampling budgets during inference. Specifically, UL-XCoT (1) achieves less languages by selecting, per query, a small candidate language set in a language-invariant unified logic space, (2) enables less tokens by monitoring logic-space trajectory dynamics during decoding to prune low-quality reasoning paths, and (3) aggregates the remaining high-quality trajectories via voting. Experiments on PolyMath across 18 languages and MMLU-ProX-Lite across 29 languages with DeepSeek-R1-DistillQwen-7B demonstrate that UL-XCoT achieves competitive accuracy while sharply cutting over 50% decoding token cost versus prior sampling baselines. UL-XCoT also delivers more stable gains on low-resource languages, underscoring consistently superior robustness where standard XCoT self-consistency method fails.

Chinese Translation

跨语言链式思维（XCoT）结合自一致性显著增强了多语言推理，但现有方法由于在各语言间广泛采样完整轨迹而成本高昂。此外，多语言大语言模型（LLM）表示在不同语言间差异显著，妨碍了直接特征比较和有效剪枝。基于此，我们提出了UL-XCoT，这是第一个高效的统一逻辑跨语言推理框架，旨在最小化标记使用和延迟的冗余，在推理过程中在有限的采样预算下实现最大的效率。具体而言，UL-XCoT (1) 通过在语言不变的统一逻辑空间中为每个查询选择一个小的候选语言集，从而实现更少的语言，(2) 通过在解码过程中监控逻辑空间轨迹动态来修剪低质量推理路径，从而实现更少的标记，(3) 通过投票聚合剩余的高质量轨迹。在18种语言的PolyMath和29种语言的MMLU-ProX-Lite上进行的实验表明，UL-XCoT在与DeepSeek-R1-DistillQwen-7B的比较中实现了具有竞争力的准确性，同时将解码标记成本减少超过50%，相较于之前的采样基线。UL-XCoT在低资源语言上也提供了更稳定的增益，强调了在标准XCoT自一致性方法失效时的一贯优越鲁棒性。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2604.20117

To Know is to Construct: Schema-Constrained Generation for Agent Memory

知识即构建：基于模式约束的代理记忆生成

Zheng, Lei, Song, Weinan, Li, Daili, Yang, Yanming

Abstract

Constructivist epistemology argues that knowledge is actively constructed rather than passively copied. Despite the generative nature of Large Language Models (LLMs), most existing agent memory systems are still based on dense retrieval. However, dense retrieval heavily relies on semantic overlap or entity matching within sentences. Consequently, embeddings often fail to distinguish instances that are semantically similar but contextually distinct, introducing substantial noise by retrieving context-mismatched entries. Conversely, directly employing open-ended generation for memory access risks "Structural Hallucination" where the model generates memory keys that do not exist in the memory, leading to lookup failures. Inspired by this epistemology, we posit that memory is fundamentally organized by cognitive schemas, and valid recall must be a generative process performed within these schematic structures. To realize this, we propose SCG-MEM, a schema-constrained generative memory architecture. SCG-MEM reformulates memory access as Schema-Constrained Generation. By maintaining a dynamic Cognitive Schema, we strictly constrain LLM decoding to generate only valid memory entry keys, providing a formal guarantee against structural hallucinations. To support long-term adaptation, we model memory updates via assimilation (grounding inputs into existing schemas) and accommodation (expanding schemas with novel concepts). Furthermore, we construct an Associative Graph to enable multi-hop reasoning through activation propagation. Experiments on the LoCoMo benchmark show that SCG-MEM substantially improves performance across all categories over retrieval-based baselines.

Chinese Translation

建构主义认识论认为知识是主动构建的，而非被动复制。尽管大型语言模型（LLMs）具有生成特性，但现有的大多数代理记忆系统仍然基于密集检索。然而，密集检索严重依赖于句子中的语义重叠或实体匹配。因此，嵌入往往无法区分语义相似但上下文不同的实例，通过检索上下文不匹配的条目引入了大量噪声。相反，直接采用开放式生成进行记忆访问则存在“结构性幻觉”的风险，即模型生成的记忆键在记忆中并不存在，导致查找失败。受到这一认识论的启发，我们认为记忆在本质上是由认知模式组织的，有效的回忆必须是在这些模式结构内进行的生成过程。为此，我们提出了SCG-MEM，一种基于模式约束的生成记忆架构。SCG-MEM将记忆访问重新表述为模式约束生成。通过维持动态认知模式，我们严格限制LLM解码，仅生成有效的记忆条目键，从而提供对结构性幻觉的正式保证。为了支持长期适应，我们通过同化（将输入嵌入到现有模式中）和顺应（用新概念扩展模式）来建模记忆更新。此外，我们构建了一个关联图，以通过激活传播实现多跳推理。在LoCoMo基准上的实验表明，SCG-MEM在所有类别上显著提高了相较于基于检索的基线的性能。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2604.20131

Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives

谁的故事被讲述？大型语言模型摘要中的位置性与偏见

Subbiah, Melanie, Mian, Haaris, Deas, Nicholas, Mayukha, Ananya, McAdams, Dan P., McKeown, Kathleen

Abstract

Increasingly, studies are exploring using Large Language Models (LLMs) for accelerated or scaled qualitative analysis of text data. While we can compare LLM accuracy against human labels directly for deductive coding, or labeling text, it is more challenging to judge the ethics and effectiveness of using LLMs in abstractive methods such as inductive thematic analysis. We collaborate with psychologists to study the abstractive claims LLMs make about human life stories, asking, how does using an LLM as an interpreter of meaning affect the conclusions and perspectives of a study? We propose a summarization-based pipeline for surfacing biases in perspective-taking an LLM might employ in interpreting these life stories. We demonstrate that our pipeline can identify both race and gender bias with the potential for representational harm. Finally, we encourage the use of this analysis in future studies involving LLM-based interpretation of study participants' written text or transcribed speech to characterize a positionality portrait for the study.

Chinese Translation

越来越多的研究探讨使用大型语言模型（LLMs）加速或扩展文本数据的定性分析。虽然我们可以直接将LLM的准确性与人类标签进行比较，以进行演绎编码或文本标记，但在判断使用LLM进行归纳主题分析等抽象方法的伦理性和有效性时则更具挑战性。我们与心理学家合作，研究LLM对人类生活故事所做的抽象性陈述，提出问题：使用LLM作为意义的解释者如何影响研究的结论和视角？我们提出了一种基于摘要的流程，以揭示LLM在解释这些生活故事时可能采用的视角偏见。我们证明了我们的流程能够识别种族和性别偏见，并可能造成代表性伤害。最后，我们鼓励在未来的研究中使用这种分析，涉及基于LLM的对研究参与者书面文本或转录语音的解释，以描绘研究的位置信息画像。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2604.20135

AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce

AFMRL：增强属性的细粒度多模态表示学习在电子商务中的应用

Zhang, Biao, Chen, Lixin, Zhang, Bin, Wang, Zongwei, Liu, Tong, Zheng, Bo

Abstract

Multimodal representation is crucial for E-commerce tasks such as identical product retrieval. Large representation models (e.g., VLM2Vec) demonstrate strong multimodal understanding capabilities, yet they struggle with fine-grained semantic comprehension, which is essential for distinguishing highly similar items. To address this, we propose Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning (AFMRL), which defines product fine-grained understanding as an attribute generation task. It leverages the generative power of Multimodal Large Language Models (MLLMs) to extract key attributes from product images and text, and enhances representation learning through a two-stage training framework: 1) Attribute-Guided Contrastive Learning (AGCL), where the key attributes generated by the MLLM are used in the image-text contrastive learning training process to identify hard samples and filter out noisy false negatives. 2) Retrieval-aware Attribute Reinforcement (RAR), where the improved retrieval performance of the representation model post-attribute integration serves as a reward signal to enhance MLLM's attribute generation during multimodal fine-tuning. Extensive experiments on large-scale E-commerce datasets demonstrate that our method achieves state-of-the-art performance on multiple downstream retrieval tasks, validating the effectiveness of harnessing generative models to advance fine-grained representation learning.

Chinese Translation

多模态表示对于电子商务任务（如相同产品检索）至关重要。大型表示模型（例如，VLM2Vec）展现出强大的多模态理解能力，但在细粒度语义理解方面表现不佳，而这对于区分高度相似的物品至关重要。为了解决这一问题，我们提出了增强属性的细粒度多模态表示学习（AFMRL），将产品的细粒度理解定义为一个属性生成任务。该方法利用多模态大型语言模型（MLLM）的生成能力，从产品图像和文本中提取关键属性，并通过一个两阶段的训练框架增强表示学习：1）属性引导对比学习（AGCL），在此过程中，MLLM生成的关键属性被用于图像-文本对比学习训练，以识别困难样本并过滤掉噪声假阴性。2）检索感知属性强化（RAR），在属性集成后，表示模型的检索性能提升作为奖励信号，以增强MLLM在多模态微调过程中的属性生成能力。在大规模电子商务数据集上的大量实验表明，我们的方法在多个下游检索任务中实现了最先进的性能，验证了利用生成模型推进细粒度表示学习的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2604.20148

Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models

元工具：小型语言模型的高效少样本工具适应

Kumar, Sachin

Abstract

Can small language models achieve strong tool-use performance without complex adaptation mechanisms? This paper investigates this question through Meta-Tool, a controlled empirical study comparing hypernetwork-based LoRA adaptation against carefully designed few-shot prompting. Using a Llama-3.2-3B-Instruct backbone, we evaluate four adaptation mechanisms--few-shot prompting, documentation encoding, hypernetwork-generated LoRA weights, and value-guided beam search--across four diverse benchmarks: Gorilla APIBench, Spider 2.0, WebArena, and InterCode. Our central finding is a well-supported negative result: despite generating non-trivial weight matrices, the 227.8M-parameter hypernetwork provides no measurable improvement over few-shot prompting alone. Comprehensive ablation studies reveal that few-shot examples contribute +21.5% to performance and documentation contributes +5.0%, while the hypernetwork adds 0%. A 3B model with well-designed prompts achieves 79.7% of GPT-5's average performance at $10 \times$ lower latency. Error analysis across 722 failure cases spanning all shot counts (0--5) shows that at the 5-shot configuration (106 failures), failure modes are task-dependent: schema-heavy tasks (Spider 2.0, WebArena) show near-zero format errors with remaining failures semantic, while format errors dominate on Gorilla (100%) and InterCode (70%). These findings redirect practitioners toward prompt engineering and example curation rather than complex adaptation architectures.

Chinese Translation

小型语言模型能否在没有复杂适应机制的情况下实现强大的工具使用性能？本文通过元工具（Meta-Tool）这一受控实证研究探讨了这一问题，比较了基于超网络的LoRA适应与精心设计的少样本提示。我们使用Llama-3.2-3B-Instruct作为基础模型，评估了四种适应机制——少样本提示、文档编码、超网络生成的LoRA权重以及价值引导的束搜索——在四个不同的基准测试上：Gorilla APIBench、Spider 2.0、WebArena和InterCode。我们的主要发现是一个有力的负结果：尽管生成了非平凡的权重矩阵，227.8M参数的超网络在性能上并未比单独的少样本提示有可测量的改善。全面的消融研究表明，少样本示例对性能贡献+21.5%，文档贡献+5.0%，而超网络的贡献为0%。一个设计良好的3B模型在$10 imes$更低延迟下达到了GPT-5平均性能的79.7%。在722个失败案例的错误分析中，涵盖了所有样本数量（0-5），显示在5样本配置下（106个失败），失败模式是任务依赖的：以模式为重的任务（Spider 2.0、WebArena）几乎没有格式错误，其余失败为语义错误，而在Gorilla（100%）和InterCode（70%）上，格式错误占主导地位。这些发现引导从业者更关注提示工程和示例策划，而非复杂的适应架构。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2604.20166

Aligning Human-AI-Interaction Trust for Mental Health Support: Survey and Position for Multi-Stakeholders

为心理健康支持对齐人机交互信任：多利益相关者的调查与立场

Sun, Xin, Su, Yue, Mo, Yifan, Meng, Qingyu, Li, Yuxuan, Sugawara, Saku, Zhang, Mengyuan, Gerritsen, Charlotte, Koole, Sander L., Hindriks, Koen, Pei, Jiahuan

Abstract

Building trustworthy AI systems for mental health support is a shared priority across stakeholders from multiple disciplines. However, "trustworthy" remains loosely defined and inconsistently operationalized. AI research often focuses on technical criteria (e.g., robustness, explainability, and safety), while therapeutic practitioners emphasize therapeutic fidelity (e.g., appropriateness, empathy, and long-term user outcomes). To bridge the fragmented landscape, we propose a three-layer trust framework, covering human-oriented, AI-oriented, and interaction-oriented trust, integrating the viewpoints of key stakeholders (e.g., practitioners, researchers, regulators). Using this framework, we systematically review existing AI-driven research in mental health domain and examine evaluation practices for ``trustworthy'' ranging from automatic metrics to clinically validated approaches. We highlight critical gaps between what NLP currently measures and what real-world mental health contexts require, and outline a research agenda for building socio-technically aligned and genuinely trustworthy AI for mental health support.

Chinese Translation

构建值得信赖的人工智能系统以支持心理健康是来自多个学科的利益相关者共同的优先事项。然而，“值得信赖”这一概念仍然定义模糊且操作不一致。人工智能研究通常侧重于技术标准（例如，稳健性、可解释性和安全性），而治疗实践者则强调治疗的忠实性（例如，适宜性、同理心和长期用户结果）。为了弥合这一碎片化的现状，我们提出了一个三层信任框架，涵盖以人为本的信任、以人工智能为导向的信任和以交互为导向的信任，整合了关键利益相关者（例如，实践者、研究者、监管者）的观点。利用这一框架，我们系统性地回顾了心理健康领域现有的人工智能驱动研究，并考察了对“值得信赖”的评估实践，从自动化指标到临床验证的方法。我们强调了当前自然语言处理（NLP）所测量的内容与现实世界心理健康情境所需之间的关键差距，并概述了构建社会技术对齐和真正值得信赖的心理健康支持人工智能的研究议程。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2604.20168

Duluth at SemEval-2026 Task 6: DeBERTa with LLM-Augmented Data for Unmasking Political Question Evasions

Duluth在SemEval-2026任务6中的表现：使用LLM增强数据的DeBERTa模型揭示政治问题的回避

Syed, Shujauddin, Pedersen, Ted

Abstract

This paper presents the Duluth approach to SemEval-2026 Task 6 on CLARITY: Unmasking Political Question Evasions. We address Task 1 (clarity-level classification) and Task 2 (evasion-level classification), both of which involve classifying question--answer pairs from U.S.\ presidential interviews using a two-level taxonomy of response clarity. Our system is based on DeBERTa-V3-base, extended with focal loss, layer-wise learning rate decay, and boolean discourse features. To address class imbalance in the training data, we augment minority classes using synthetic examples generated by Gemini 3 and Claude Sonnet 4.5. Our best configuration achieved a Macro F1 of 0.76 on the Task 1 evaluation set, placing 8th out of 40 teams. The top-ranked system (TeleAI) achieved 0.89, while the mean score across participants was 0.70. Error analysis reveals that the dominant source of misclassification is confusion between Ambivalent and Clear Reply responses, a pattern that mirrors disagreements among human annotators. Our findings demonstrate that LLM-based data augmentation can meaningfully improve minority-class recall on nuanced political discourse tasks.

Chinese Translation

本文介绍了Duluth在SemEval-2026任务6（CLARITY：揭示政治问题回避）中的方法。我们针对任务1（清晰度分类）和任务2（回避程度分类）进行了研究，这两个任务均涉及使用两级响应清晰度分类法对美国总统采访中的问答对进行分类。我们的系统基于DeBERTa-V3-base，并扩展了焦点损失、逐层学习率衰减和布尔话语特征。为了应对训练数据中的类别不平衡，我们使用Gemini 3和Claude Sonnet 4.5生成的合成示例增强了少数类。我们的最佳配置在任务1评估集上达到了0.76的宏观F1分数，排名40支队伍中的第8位。排名最高的系统（TeleAI）达到了0.89，而参与者的平均得分为0.70。错误分析表明，误分类的主要来源是对模棱两可和清晰回复之间的混淆，这一模式与人工标注者之间的分歧相似。我们的研究结果表明，基于LLM的数据增强可以显著提高在复杂政治话语任务中的少数类召回率。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2604.20183

Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

双聚类记忆代理：解决优化问题求解中的多范式歧义

Zhang, Xinyu, Wan, Yuchen, Zhang, Boxuan, Yang, Zesheng, Zhang, Lingling, Wei, Bifan, Liu, Jun

Abstract

Large Language Models (LLMs) often struggle with structural ambiguity in optimization problems, where a single problem admits multiple related but conflicting modeling paradigms, hindering effective solution generation. To address this, we propose Dual-Cluster Memory Agent (DCM-Agent) to enhance performance by leveraging historical solutions in a training-free manner. Central to this is Dual-Cluster Memory Construction. This agent assigns historical solutions to modeling and coding clusters, then distills each cluster's content into three structured types: Approach, Checklist, and Pitfall. This process derives generalizable guidance knowledge. Furthermore, this agent introduces Memory-augmented Inference to dynamically navigate solution paths, detect and repair errors, and adaptively switch reasoning paths with structured knowledge. The experiments across seven optimization benchmarks demonstrate that DCM-Agent achieves an average performance improvement of 11%- 21%. Notably, our analysis reveals a ``knowledge inheritance'' phenomenon: memory constructed by larger models can guide smaller models toward superior performance, highlighting the framework's scalability and efficiency.

Chinese Translation

大型语言模型（LLMs）在优化问题中常常面临结构歧义，其中一个问题可能对应多个相关但相互冲突的建模范式，这阻碍了有效的解决方案生成。为了解决这一问题，我们提出了双聚类记忆代理（DCM-Agent），通过利用历史解决方案以无训练的方式提升性能。其核心是双聚类记忆构建。该代理将历史解决方案分配到建模和编码聚类中，然后将每个聚类的内容提炼为三种结构化类型：方法、检查清单和陷阱。这个过程产生了可推广的指导知识。此外，该代理引入了记忆增强推理，以动态导航解决方案路径，检测和修复错误，并通过结构化知识自适应切换推理路径。在七个优化基准测试中的实验表明，DCM-Agent实现了平均11%-21%的性能提升。值得注意的是，我们的分析揭示了一种“知识继承”现象：由较大模型构建的记忆可以引导较小模型达到更优的性能，突显了该框架的可扩展性和效率。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2604.20199

All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG

所有语言都重要：理解和缓解多语言检索增强生成中的语言偏见

Wang, Dan, Mo, Guozhao, Shi, Yafei, Zhang, Cheng, Zheng, Bo, Cao, Boxi, Chen, Xuanang, Lu, Yaojie, Lin, Hongyu, He, Ben, Han, Xianpei, Sun, Le

Abstract

Multilingual Retrieval-Augmented Generation (mRAG) leverages cross-lingual evidence to ground Large Language Models (LLMs) in global knowledge. However, we show that current mRAG systems suffer from a language bias during reranking, systematically favoring English and the query's native language. By introducing an estimated oracle evidence analysis, we quantify a substantial performance gap between existing rerankers and the achievable upper bound. Further analysis reveals a critical distributional mismatch: while optimal predictions require evidence scattered across multiple languages, current systems systematically suppress such ``answer-critical'' documents, thereby limiting downstream generation performance. To bridge this gap, we propose \textit{\textbf{L}anguage-\textbf{A}gnostic \textbf{U}tility-driven \textbf{R}eranker \textbf{A}lignment (LAURA)}, which aligns multilingual evidence ranking with downstream generative utility. Experiments across diverse languages and generation models show that LAURA effectively mitigates language bias and consistently improves mRAG performance.

Chinese Translation

多语言检索增强生成（mRAG）利用跨语言证据将大型语言模型（LLMs）与全球知识相结合。然而，我们展示了当前的mRAG系统在重新排序过程中存在语言偏见，系统性地偏向英语和查询的母语。通过引入估计的oracle证据分析，我们量化了现有重新排序器与可实现的上限之间的显著性能差距。进一步分析揭示了一个关键的分布不匹配：尽管最佳预测需要跨多种语言分散的证据，但当前系统系统性地抑制了这种“答案关键”文档，从而限制了下游生成性能。为了弥补这一差距，我们提出了 extit{ extbf{L}anguage- extbf{A}gnostic extbf{U}tility-driven extbf{R}eranker extbf{A}lignment (LAURA)}，它将多语言证据排序与下游生成效用对齐。针对多种语言和生成模型的实验表明，LAURA有效缓解了语言偏见，并持续提高了mRAG的性能。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2604.20200

Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

追逐公共评分：用户压力与编码代理工作流程中的评估利用

Chen, Hardy, Lau, Nancy, Tu, Haoqin, Yan, Shuo, Liu, Xiangyan, Wang, Zijun, Wu, Juncheng, Shieh, Michael Qizhe, Cardenas, Alvaro A., Xie, Cihang, Zhou, Yuyin

Abstract

Frontier coding agents are increasingly used in workflows where users supervise progress primarily through repeated improvement of a public score, namely the reported score on a public evaluation file with labels in the workspace, rather than through direct inspection of the agent's intermediate outputs. We study whether multi-round user pressure to improve that score induces public score exploitation: behavior that raises the public score through shortcuts without improving hidden private evaluation. We begin with a preliminary single-script tabular classification task, where GPT-5.4 and Claude Opus 4.6 both exploit label information within 10 rounds of user-agent interaction. We then build AgentPressureBench, a 34-task machine-learning repository benchmark spanning three input modalities, and collect 1326 multi-round trajectories from 13 coding agents. On our benchmark, we observe 403 exploitative runs, spanning across all tasks. We also find that stronger models have higher exploitation rates, supported by a significant Spearman rank correlation of 0.77. Our ablation experiments show that higher user pressure leads to earlier exploitation, reducing the average first exploit round by 15.6 rounds (i.e., 19.67 to 4.08). As a mitigation, adding explicit anti-exploit wordings in prompt mostly eliminates exploitation (100% to 8.3%). We hope that our work can bring attention to more careful use of coding agents workflow, and developing more robust coding agents under user pressure. Our project page is at https://ucsc-vlaa.github.io/AgentPressureBench .

Chinese Translation

前沿编码代理在工作流程中的使用越来越普遍，用户主要通过对公共评分的反复改进来监督进展，即在工作区的公共评估文件中报告的分数，而不是通过直接检查代理的中间输出。我们研究多轮用户压力是否会促使公共评分的利用：即通过捷径提高公共评分而不改善隐藏的私人评估的行为。我们首先进行了一项初步的单脚本表格分类任务，其中 GPT-5.4 和 Claude Opus 4.6 在 10 轮用户与代理的交互中均利用了标签信息。随后，我们构建了 AgentPressureBench，这是一个涵盖三种输入模态的 34 项机器学习基准库，并从 13 个编码代理中收集了 1326 条多轮轨迹。在我们的基准测试中，我们观察到 403 次利用行为，涵盖所有任务。我们还发现，性能更强的模型具有更高的利用率，这一结果得到了 0.77 的显著斯皮尔曼等级相关性的支持。我们的消融实验表明，更高的用户压力会导致更早的利用，将平均首次利用轮次减少了 15.6 轮（即从 19.67 降至 4.08）。作为一种缓解措施，在提示中添加明确的反利用措辞几乎消除了利用行为（从 100% 降至 8.3%）。我们希望我们的工作能够引起对编码代理工作流程更谨慎使用的关注，并在用户压力下开发更强健的编码代理。我们的项目页面地址为 https://ucsc-vlaa.github.io/AgentPressureBench 。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2604.20216

Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context

基于分位数标记和邻域上下文的文本到分布预测

Zhu, Yilun, Zhuang, Yuan, Vedula, Nikhita, Dhyani, Dushyanta, Xu, Shaoyuan, Li, Moyan, Bayati, Mohsen, Wang, Bryan, Malmasi, Shervin

Abstract

Many applications of LLM-based text regression require predicting a full conditional distribution rather than a single point value. We study distributional regression under empirical-quantile supervision, where each input is paired with multiple observed quantile outcomes, and the target distribution is represented by a dense grid of quantiles. We address two key limitations of current approaches: the lack of local grounding for distribution estimates, and the reliance on shared representations that create an indirect bottleneck between inputs and quantile outputs. In this paper, we introduce Quantile Token Regression, which, to our knowledge, is the first work to insert dedicated quantile tokens into the input sequence, enabling direct input-output pathways for each quantile through self-attention. We further augment these quantile tokens with retrieval, incorporating semantically similar neighbor instances and their empirical distributions to ground predictions with local evidence from similar instances. We also provide the first theoretical analysis of loss functions for quantile regression, clarifying which distributional objectives each optimizes. Experiments on the Inside Airbnb and StackSample benchmark datasets with LLMs ranging from 1.7B to 14B parameters show that quantile tokens with neighbors consistently outperform baselines (~4 points lower MAPE and 2x narrower prediction intervals), with especially large gains on smaller and more challenging datasets where quantile tokens produce substantially sharper and more accurate distributions.

Chinese Translation

许多基于大型语言模型（LLM）的文本回归应用需要预测完整的条件分布，而不仅仅是单一的点值。我们研究了在经验分位数监督下的分布回归，其中每个输入与多个观察到的分位数结果配对，目标分布由密集的分位数网格表示。我们解决了当前方法的两个关键局限性：分布估计缺乏局部基础，以及依赖共享表示导致输入与分位数输出之间的间接瓶颈。在本文中，我们提出了分位数标记回归（Quantile Token Regression），据我们所知，这是首个将专用分位数标记插入输入序列的工作，通过自注意力机制为每个分位数提供直接的输入输出通道。我们进一步通过检索增强这些分位数标记，结合语义相似的邻近实例及其经验分布，以用相似实例的局部证据来支撑预测。我们还首次对分位数回归的损失函数进行了理论分析，澄清了每个损失函数优化的分布目标。在Inside Airbnb和StackSample基准数据集上的实验，使用参数范围从17亿到140亿的LLM，显示带有邻域的分位数标记在性能上始终优于基线（平均绝对百分比误差降低约4个百分点，预测区间缩窄2倍），尤其在较小和更具挑战性的数据集上，分位数标记能够产生显著更尖锐和更准确的分布。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2604.20221

Markov reads Pushkin, again: A statistical journey into the poetic world of Evgenij Onegin

马克思夫再次阅读普希金：对《叶甫盖尼·奥涅金》诗意世界的统计探索

Sabatini, Angelo Maria

Abstract

This study applies symbolic time series analysis and Markov modeling to explore the phonological structure of Evgenij Onegin-as captured through a graphemic vowel/consonant (V/C) encoding-and one contemporary Italian translation. Using a binary encoding inspired by Markov's original scheme, we construct minimalist probabilistic models that capture both local V/C dependencies and large-scale sequential patterns. A compact four-state Markov chain is shown to be descriptively accurate and generative, reproducing key features of the original sequences such as autocorrelation and memory depth. All findings are exploratory in nature and aim to highlight structural regularities while suggesting hypotheses about underlying narrative dynamics. The analysis reveals a marked asymmetry between the Russian and Italian texts: the original exhibits a gradual decline in memory depth, whereas the translation maintains a more uniform profile. To further investigate this divergence, we introduce phonological probes-short symbolic patterns that link surface structure to narrative-relevant cues. Tracked across the unfolding text, these probes reveal subtle connections between graphemic form and thematic development, particularly in the Russian original. By revisiting Markov's original proposal of applying symbolic analysis to a literary text and pairing it with contemporary tools from computational statistics and data science, this study shows that even minimalist Markov models can support exploratory analysis of complex poetic material. When complemented by a coarse layer of linguistic annotation, such models provide a general framework for comparative poetics and demonstrate that stylized structural patterns remain accessible through simple representations grounded in linguistic form.

Chinese Translation

本研究应用符号时间序列分析和马尔可夫建模，探讨《叶甫盖尼·奥涅金》的音韵结构——通过字母元音/辅音（V/C）编码捕捉，以及一部当代意大利翻译。我们使用受马尔可夫原始方案启发的二元编码，构建最简概率模型，以捕捉局部V/C依赖性和大规模序列模式。一个紧凑的四状态马尔可夫链被证明在描述上是准确且具有生成性的，能够再现原始序列的关键特征，如自相关性和记忆深度。所有发现本质上都是探索性的，旨在突出结构规律，同时提出关于潜在叙事动态的假设。分析揭示了俄文和意大利文本之间显著的不对称性：原文表现出记忆深度的逐渐下降，而翻译则保持了更均匀的特征。为了进一步研究这种差异，我们引入了音韵探针——短符号模式，将表面结构与叙事相关线索联系起来。这些探针在展开的文本中被追踪，揭示了字母形式与主题发展之间的微妙联系，特别是在俄文原文中。通过重新审视马尔可夫将符号分析应用于文学文本的原始提议，并结合当代计算统计和数据科学工具，本研究表明，即使是最简的马尔可夫模型也能支持对复杂诗歌材料的探索性分析。当辅以粗略的语言注释层时，这些模型提供了比较诗学的一般框架，并展示了风格化结构模式通过基于语言形式的简单表示依然可被获取。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2604.20225

The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models

GaoYao基准：评估大型语言模型多语言和多文化能力的综合框架

Liu, Yilun, Zhao, Chunguang, Piao, Mengyao, Miao, Lingqi, Tao, Shimin, He, Minggui, Liu, Chenxin, Zhang, Li, Ma, Hongxia, Guo, Jiaxin, Liu, Chen, Deng, Liqun, Wei, Jiansheng, Meng, Xiaojun, Du, Fanyi, Wei, Daimeng, Xiao, Yanghua

Abstract

Evaluating the multilingual and multicultural capabilities of Large Language Models (LLMs) is essential for their global utility. However, current benchmarks face three critical limitations: (1) fragmented evaluation dimensions that often neglect deep cultural nuances; (2) insufficient language coverage in subjective tasks relying on low-quality machine translation; and (3) shallow analysis that lacks diagnostic depth beyond simple rankings. To address these, we introduce GaoYao, a comprehensive benchmark with 182.3k samples, 26 languages and 51 nations/areas. First, GaoYao proposes a unified framework categorizing evaluation tasks into three cultural layers (General Multilingual, Cross-cultural, Monocultural) and nine cognitive sub-layers. Second, we achieve native-quality expansion by leveraging experts to rigorously localize subjective benchmarks into 19 languages and synthesizing cross-cultural test sets for 34 cultures, surpassing prior coverage by up to 111%. Third, we conduct an in-depth diagnostic analysis on 20+ flagship and compact LLMs. Our findings reveal significant geographical performance disparities and distinct gaps between tasks, offering a reliable map for future work. We release the benchmark (https://github.com/lunyiliu/GaoYao).

Chinese Translation

评估大型语言模型（LLMs）的多语言和多文化能力对其全球适用性至关重要。然而，目前的基准面临三项关键限制：（1）评估维度碎片化，常常忽视深层文化细微差别；（2）在依赖低质量机器翻译的主观任务中，语言覆盖不足；（3）分析肤浅，缺乏超越简单排名的诊断深度。为了解决这些问题，我们推出了GaoYao，这是一个包含182.3k样本、26种语言和51个国家/地区的综合基准。首先，GaoYao提出了一个统一框架，将评估任务分为三个文化层次（一般多语言、跨文化、单文化）和九个认知子层次。其次，我们通过利用专家严格本地化主观基准到19种语言，并为34种文化合成跨文化测试集，实现了本地化质量的扩展，覆盖率超过以往高达111%。第三，我们对20多个旗舰和紧凑型LLM进行了深入的诊断分析。我们的研究发现了显著的地理性能差异和任务之间的明显差距，为未来的工作提供了可靠的地图。我们发布了该基准（https://github.com/lunyiliu/GaoYao）。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2604.20241

Construction of a Battery Research Knowledge Graph using a Global Open Catalog

基于全球开放目录构建电池研究知识图谱

Foppiano, Luca, Dieb, Sae, Zain, Malik, Kasama, Kazuki, Sodeyama, Keitaro, Tanifuji, Mikiko

Abstract

Battery research is a rapidly growing and highly interdisciplinary field, making it increasingly difficult to track relevant expertise and identify potential collaborators across institutional boundaries. In this work, we present a pipeline for constructing an author-centric knowledge graph of battery research built on OpenAlex, a large-scale open bibliographic catalogue. For each author, we derive a weighted research descriptors vector that combines coarse-grained OpenAlex concepts with fine-grained keyphrases extracted from titles and abstracts using KeyBERT with ChatGPT (gpt-3.5-turbo) as the backend model, selected after evaluating multiple alternatives. Vector components are weighted by research descriptor origin, authorship position, and temporal recency. The framework is applied to a corpus of 189,581 battery-related works. The resulting vectors support author-author similarity computation, community detection, and exploratory search through a browser-based interface. The knowledge graph is then serialized in RDF and linked to Wikidata identifiers, making it interoperable with external linked open data sources and extensible beyond the battery domain. Unlike prior author-centric analyses confined to institutional repositories, our approach operates at cross-institutional scale and grounds similarity in domain semantics rather than citation or co-authorship structure alone.

Chinese Translation

电池研究是一个快速增长且高度跨学科的领域，这使得在机构边界内追踪相关专业知识和识别潜在合作者变得愈加困难。在本研究中，我们提出了一种基于OpenAlex（一个大规模开放书目目录）构建以作者为中心的电池研究知识图谱的流程。对于每位作者，我们推导出一个加权的研究描述符向量，该向量结合了粗粒度的OpenAlex概念和通过KeyBERT与ChatGPT（gpt-3.5-turbo）作为后端模型提取的细粒度关键词，这一模型是在评估多种替代方案后选定的。向量组件根据研究描述符来源、作者位置和时间新近性进行加权。该框架应用于189,581篇与电池相关的文献语料库。生成的向量支持作者间相似性计算、社区检测以及通过基于浏览器的界面进行探索性搜索。知识图谱随后以RDF格式序列化，并链接到Wikidata标识符，使其能够与外部链接开放数据源互操作，并可扩展至电池领域之外。与以往局限于机构库的以作者为中心的分析不同，我们的方法在跨机构规模上运行，并将相似性基于领域语义，而不仅仅是引用或共同作者结构。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2604.20244

Hybrid Policy Distillation for LLMs

大语言模型的混合策略蒸馏

Zhu, Wenhong, Xie, Ruobing, Wang, Rui, Liu, Pengfei

Abstract

Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at https://github.com/zwhong714/Hybrid-Policy-Distillation.

Chinese Translation

知识蒸馏（Knowledge Distillation, KD）是一种强大的压缩大语言模型（Large Language Models, LLMs）的方法，其有效性依赖于发散方向、优化策略和数据模式的相互选择。我们对现有KD方法的设计进行了分析，并提出了一个统一的视角，建立了它们之间的联系，将KD重新表述为基于令牌级别的加权对数似然目标。我们进一步提出了混合策略蒸馏（Hybrid Policy Distillation, HPD），该方法结合了前向和反向KL散度的互补优势，以平衡模式覆盖和模式寻求，并将离策略数据与轻量级、近似的在策略采样相结合。我们在长生成数学推理以及短生成对话和代码任务上验证了HPD，展示了在不同模型系列和规模上优化稳定性、计算效率和最终性能的提升。与本研究相关的代码可在 https://github.com/zwhong714/Hybrid-Policy-Distillation 获取。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2604.20256

RADS: Reinforcement Learning-Based Sample Selection Improves Transfer Learning in Low-resource and Imbalanced Clinical Settings

RADS：基于强化学习的样本选择改善低资源和不平衡临床环境中的迁移学习

Han, Wei, Martinez, David, Khanina, Anna, Cavedon, Lawrence, Verspoor, Karin

Abstract

A common strategy in transfer learning is few shot fine-tuning, but its success is highly dependent on the quality of samples selected as training examples. Active learning methods such as uncertainty sampling and diversity sampling can select useful samples. However, under extremely low-resource and class-imbalanced conditions, they often favor outliers rather than truly informative samples, resulting in degraded performance. In this paper, we introduce RADS (Reinforcement Adaptive Domain Sampling), a robust sample selection strategy using reinforcement learning (RL) to identify the most informative samples. Experimental evaluations on several real world clinical datasets show our sample selection strategy enhances model transferability while maintaining robust performance under extreme class imbalance compared to traditional methods.

Chinese Translation

迁移学习中的一种常见策略是少量样本微调，但其成功高度依赖于作为训练示例所选择样本的质量。主动学习方法如不确定性采样和多样性采样能够选择有用的样本。然而，在极低资源和类别不平衡的条件下，它们往往偏向于异常值，而非真正具有信息量的样本，导致性能下降。本文介绍了RADS（强化自适应领域采样），这是一种使用强化学习（RL）识别最具信息量样本的稳健样本选择策略。在多个真实临床数据集上的实验评估表明，我们的样本选择策略在保持稳健性能的同时，增强了模型的迁移能力，相较于传统方法在极端类别不平衡下表现更佳。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2604.20283

Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking

无监督多模态实体链接的多视角证据综合与推理

Zhou, Mo, Wang, Jianwei, Wang, Kai, Paik, Helen, Zhang, Ying, Zhang, Wenjie

Abstract

Multimodal Entity Linking (MEL) is a fundamental task in data management that maps ambiguous mentions with diverse modalities to the multimodal entities in a knowledge base. However, most existing MEL approaches primarily focus on optimizing instance-centric features and evidence, leaving broader forms of evidence and their intricate interdependencies insufficiently explored. Motivated by the observation that human expert decision-making process relies on multi-perspective judgment, in this work, we propose MSR-MEL, a Multi-perspective Evidence Synthesis and Reasoning framework with Large Language Models (LLMs) for unsupervised MEL. Specifically, we adopt a two-stage framework: (1) Offline Multi-Perspective Evidence Synthesis constructs a comprehensive set of evidence. This includes instance-centric evidence capturing the instance-centric multimodal information of mentions and entities, group-level evidence that aggregates neighborhood information, lexical evidence based on string overlap ratio, and statistical evidence based on simple summary statistics. A core contribution of our framework is the synthesis of group-level evidence, which effectively aggregates vital neighborhood information by graph. We first construct LLM-enhanced contextualized graphs. Subsequently, different modalities are jointly aligned through an asymmetric teacher-student graph neural network. (2) Online Multi-Perspective Evidence Reasoning leverages the power of LLM as a reasoning module to analyze the correlation and semantics of the multi-perspective evidence to induce an effective ranking strategy for accurate entity linking without supervision. Extensive experiments on widely used MEL benchmarks demonstrate that MSR-MEL consistently outperforms state-of-the-art unsupervised methods. The source code of this paper was available at: https://anonymous.4open.science/r/MSR-MEL-C21E/.

Chinese Translation

多模态实体链接（MEL）是数据管理中的一项基础任务，它将具有不同模态的模糊提及映射到知识库中的多模态实体。然而，现有的大多数MEL方法主要集中于优化实例中心特征和证据，未能充分探讨更广泛形式的证据及其复杂的相互依赖关系。基于人类专家决策过程依赖于多视角判断的观察，本文提出了MSR-MEL，一个基于大型语言模型（LLMs）的无监督MEL的多视角证据综合与推理框架。具体而言，我们采用了一个两阶段框架：（1）离线多视角证据综合构建了一套全面的证据。这包括捕捉提及和实体的实例中心多模态信息的实例中心证据、聚合邻域信息的群体级证据、基于字符串重叠率的词汇证据，以及基于简单摘要统计的统计证据。我们框架的一个核心贡献是群体级证据的综合，它通过图有效地聚合重要的邻域信息。我们首先构建了增强的上下文化图。随后，通过非对称的教师-学生图神经网络将不同模态进行联合对齐。（2）在线多视角证据推理利用LLM作为推理模块，分析多视角证据的相关性和语义，以诱导出有效的排名策略，实现准确的实体链接而无需监督。在广泛使用的MEL基准测试上的大量实验表明，MSR-MEL始终优于最先进的无监督方法。本文的源代码可在以下链接获取：https://anonymous.4open.science/r/MSR-MEL-C21E/

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2604.20331

Surrogate modeling for interpreting black-box LLMs in medical predictions

用于解释医疗预测中黑箱大型语言模型的代理建模

Han, Changho, Kim, Songsoo, Kim, Dong Won, Celi, Leo Anthony, Kim, Jaewoong, Bae, SungA, Yoon, Dukyong

Abstract

Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.

Chinese Translation

大型语言模型（LLMs）在庞大的数据集上训练，能够在其参数中编码广泛的现实世界知识，但其黑箱特性掩盖了这种编码的机制和程度。代理建模利用简化模型来近似复杂系统，为黑箱模型的更好可解释性提供了一条路径。我们提出了一种代理建模框架，定量解释LLM编码的知识。针对从领域知识中推导出的特定假设，该框架通过在广泛的模拟场景中进行大量提示，使用可观察元素（输入-输出对）来近似潜在的LLM知识空间。通过在医疗预测中的概念验证实验，我们展示了该框架在揭示LLMs如何“感知”每个输入变量与输出之间关系的有效性。特别是，考虑到LLMs可能会延续其训练数据中嵌入的不准确性和社会偏见，我们使用该框架的实验定量揭示了与既定医学知识相矛盾的关联，以及LLM编码知识中科学上被驳斥的种族假设的持续存在。通过揭示这些问题，我们的框架可以作为一个警示指标，以支持这些模型的安全和可靠应用。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2604.20382

Graph2Counsel: Clinically Grounded Synthetic Counseling Dialogue Generation from Client Psychological Graphs

Graph2Counsel：基于客户心理图的临床基础合成咨询对话生成

Mandal, Aishik, Arnaout, Hiba, Ong, Clarissa W., Bockhorst, Juliet, Sheehan, Kate, Moldow, Rachael, Chakraborty, Tanmoy, Gurevych, Iryna

Abstract

Rising demand for mental health support has increased interest in using Large Language Models (LLMs) for counseling. However, adapting LLMs to this high-risk safety-critical domain is hindered by the scarcity of real-world counseling data due to privacy constraints. Synthetic datasets provide a promising alternative, but existing approaches often rely on unstructured or semi-structured text inputs and overlook structural dependencies between a client's cognitive, emotional, and behavioral states, often producing psychologically inconsistent interactions and reducing data realism and quality. We introduce Graph2Counsel, a framework for generating synthetic counseling sessions grounded in Client Psychological Graphs (CPGs) that encode relationships among clients' thoughts, emotions, and behaviors. Graph2Counsel employs a structured prompting pipeline guided by counselor strategies and CPG, and explores prompting strategies including CoT (Wei et al., 2022) and Multi-Agent Feedback (Li et al., 2025a). Graph2Counsel produces 760 sessions from 76 CPGs across diverse client profiles. In expert evaluation, our dataset outperforms prior datasets on specificity, counselor competence, authenticity, conversational flow, and safety, with substantial inter-annotator agreement (Krippendorff's $\alpha$ = 0.70). Fine-tuning an open-source model on this dataset improves performance on CounselingBench (Nguyen et al., 2025) and CounselBench (Li et al., 2025b), showing downstream utility. We also make our code and data public.

Chinese Translation

对心理健康支持需求的不断上升引发了对使用大型语言模型（LLMs）进行咨询的兴趣。然而，由于隐私限制，现实世界咨询数据的稀缺使得将LLMs适应于这一高风险安全关键领域变得困难。合成数据集提供了一个有前景的替代方案，但现有方法往往依赖于非结构化或半结构化文本输入，忽视了客户的认知、情感和行为状态之间的结构依赖，常常导致心理上不一致的互动，从而降低数据的真实感和质量。我们提出了Graph2Counsel，一个基于客户心理图（CPGs）生成合成咨询会话的框架，CPGs编码了客户思想、情感和行为之间的关系。Graph2Counsel采用了一个由咨询师策略和CPG指导的结构化提示管道，并探索了包括CoT（Wei et al., 2022）和多智能体反馈（Li et al., 2025a）在内的提示策略。Graph2Counsel从76个CPG中生成了760个会话，涵盖多样的客户档案。在专家评估中，我们的数据集在特异性、咨询师能力、真实性、对话流畅性和安全性方面优于之前的数据集，且具有显著的标注者间一致性（Krippendorff's $eta$ = 0.70）。在该数据集上微调开源模型提高了在CounselingBench（Nguyen et al., 2025）和CounselBench（Li et al., 2025b）上的表现，显示了下游应用的潜力。我们还公开了我们的代码和数据。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2604.20398

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

WebGen-R1：利用强化学习激励大型语言模型生成功能性和美观的网站

Jiang, Juyong, Cai, Chenglin, Park, Chansung, Shen, Jiasi, Kim, Sunghun, Li, Jianguo, Wang, Yue

Abstract

While Large Language Models (LLMs) excel at function-level code generation, project-level tasks such as generating functional and visually aesthetic multi-page websites remain highly challenging. Existing works are often limited to single-page static websites, while agentic frameworks typically rely on multi-turn execution with proprietary models, leading to substantial token costs, high latency, and brittle integration. Training a small LLM end-to-end with reinforcement learning (RL) is a promising alternative, yet it faces a critical bottleneck in designing reliable and computationally feasible rewards for website generation. Unlike single-file coding tasks that can be verified by unit tests, website generation requires evaluating inherently subjective aesthetics, cross-page interactions, and functional correctness. To this end, we propose WebGen-R1, an end-to-end RL framework tailored for project-level website generation. We first introduce a scaffold-driven structured generation paradigm that constrains the large open-ended action space and preserves architectural integrity. We then design a novel cascaded multimodal reward that seamlessly couples structural guarantees with execution-grounded functional feedback and vision-based aesthetic supervision. Extensive experiments demonstrate that our WebGen-R1 substantially transforms a 7B base model from generating nearly nonfunctional websites into producing deployable, aesthetically aligned multi-page websites. Remarkably, our WebGen-R1 not only consistently outperforms heavily scaled open-source models (up to 72B), but also rivals the state-of-the-art DeepSeek-R1 (671B) in functional success, while substantially exceeding it in valid rendering and aesthetic alignment. These results position WebGen-R1 as a viable path for scaling small open models from function-level code generation to project-level web application generation.

Chinese Translation

尽管大型语言模型（LLMs）在功能级代码生成方面表现出色，但生成功能性和视觉美观的多页面网站等项目级任务仍然具有很高的挑战性。现有的研究通常局限于单页面静态网站，而自主框架通常依赖于多轮执行和专有模型，这导致了巨大的令牌成本、高延迟和脆弱的集成。用强化学习（RL）端到端训练一个小型LLM是一种有前景的替代方案，但在为网站生成设计可靠且计算上可行的奖励方面面临关键瓶颈。与可以通过单元测试验证的单文件编码任务不同，网站生成需要评估固有的主观美学、跨页面交互和功能正确性。为此，我们提出了WebGen-R1，这是一个针对项目级网站生成量身定制的端到端RL框架。我们首先引入了一种基于支架的结构化生成范式，该范式限制了大型开放式动作空间并保持了架构的完整性。然后，我们设计了一种新颖的级联多模态奖励，它将结构保证与基于执行的功能反馈和基于视觉的美学监督无缝结合。大量实验表明，我们的WebGen-R1显著改变了一个7B基础模型，使其从生成几乎无功能的网站转变为生成可部署的、在美学上对齐的多页面网站。值得注意的是，我们的WebGen-R1不仅在功能上始终优于大规模开源模型（高达72B），而且在功能成功率上与最先进的DeepSeek-R1（671B）相媲美，同时在有效渲染和美学对齐方面大幅超越它。这些结果使WebGen-R1成为将小型开放模型从功能级代码生成扩展到项目级Web应用生成的可行路径。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2604.20443

DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

DialToM：一种用于预测状态驱动对话轨迹的心智理论基准

Yadav, Neemesh, Achananuparp, Palakorn, Jiang, Jing, Lim, Ee-Peng

Abstract

Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting -- probing whether models can identify state-consistent dialogue trajectories solely from mental-state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM-generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth-py/DialToM.

Chinese Translation

大型语言模型（LLMs）已被证明具备心智理论（ToM）能力。然而，目前尚不清楚这种能力是源于稳健的推理还是虚假的相关性。我们引入了DialToM，这是一个基于自然人类对话构建的人类验证基准，采用多项选择框架。我们不仅评估心理状态预测（字面心智理论），还通过前瞻性诊断预测（Prospective Diagnostic Forecasting）评估这些状态的功能效用——探讨模型是否能够仅通过心理状态特征识别状态一致的对话轨迹。我们的结果揭示了显著的推理不对称性：尽管LLMs在识别心理状态方面表现出色，但大多数模型（除了Gemini 3 Pro）未能利用这种理解来预测社会轨迹。此外，我们发现人类与LLM生成的推理之间只有微弱的语义相似性。为了促进可重复性，DialToM数据集和评估代码已公开发布在https://github.com/Stealth-py/DialToM。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2604.20447

Decoding Text Spans for Efficient and Accurate Named-Entity Recognition

高效准确的命名实体识别文本跨度解码

Maracani, Andrea, Ozkan, Savas, Zhu, Junyi, Mutlu, Sinan, Ozay, Mete

Abstract

Named Entity Recognition (NER) is a key component in industrial information extraction pipelines, where systems must satisfy strict latency and throughput constraints in addition to strong accuracy. State-of-the-art NER accuracy is often achieved by span-based frameworks, which construct span representations from token encodings and classify candidate spans. However, many span-based methods enumerate large numbers of candidates and process each candidate with marker-augmented inputs, substantially increasing inference cost and limiting scalability in large-scale deployments. In this work, we propose SpanDec, an efficient span-based NER framework that targets this bottleneck. Our main insight is that span representation interactions can be computed effectively at the final transformer stage, avoiding redundant computation in earlier layers via a lightweight decoder dedicated to span representations. We further introduce a span filtering mechanism during enumeration to prune unlikely candidates before expensive processing. Across multiple benchmarks, SpanDec matches competitive span-based baselines while improving throughput and reducing computational cost, yielding a better accuracy-efficiency trade-off suitable for high-volume serving and on-device applications.

Chinese Translation

命名实体识别（NER）是工业信息提取流程中的关键组成部分，系统不仅需要满足严格的延迟和吞吐量限制，还需具备较强的准确性。当前最先进的NER准确性通常通过基于跨度的框架实现，该框架从标记编码构建跨度表示并对候选跨度进行分类。然而，许多基于跨度的方法会枚举大量候选项，并使用增强标记的输入处理每个候选项，这大大增加了推理成本，并限制了在大规模部署中的可扩展性。在本研究中，我们提出了SpanDec，一个高效的基于跨度的NER框架，旨在解决这一瓶颈。我们的主要见解是，跨度表示之间的交互可以在最终的变换器阶段有效计算，从而通过专门针对跨度表示的轻量解码器避免在早期层中的冗余计算。我们进一步引入了一种跨度过滤机制，在枚举过程中修剪不太可能的候选项，以减少昂贵的处理。在多个基准测试中，SpanDec与竞争性的基于跨度的基线相匹配，同时提高了吞吐量并降低了计算成本，实现了更好的准确性与效率的权衡，适用于高容量服务和设备端应用。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2604.20454

Not all ANIMALs are equal: metaphorical framing through source domains and semantic frames

并非所有的动物都是平等的：通过源域和语义框架的隐喻构建

Otmakhova, Yulia, Guida, Matteo, Frermann, Lea

Abstract

Metaphors are powerful framing devices, yet their source domains alone do not fully explain the specific associations they evoke. We argue that the interplay between source domains and semantic frames determines how metaphors shape understanding of complex issues, and present a computational framework that allows to derive salient discourse metaphors through their source domains and semantic frames. Applying this framework to climate change news, we uncover not only well-known source domains but also reveal nuanced frame-level associations that distinguish how the issue is portrayed. In analyzing immigration discourse across political ideologies, we demonstrate that liberals and conservatives systematically employ different semantic frames within the same source domains, with conservatives favoring frames emphasizing uncontrollability and liberals choosing neutral or more ``victimizing'' semantic frames. Our work bridges conceptual metaphor theory and linguistics, providing the first NLP approach for discovery of discourse metaphors and fine-grained analysis of differences in metaphorical framing. Code, data and statistical scripts are available at https://github.com/julia-nixie/ConceptFrameMet.

Chinese Translation

隐喻是强有力的框架构建工具，但其源域本身并不能完全解释其引发的特定联想。我们认为，源域与语义框架之间的相互作用决定了隐喻如何塑造对复杂问题的理解，并提出了一种计算框架，可以通过源域和语义框架推导出显著的论述隐喻。将该框架应用于气候变化新闻，我们不仅揭示了众所周知的源域，还揭示了细致的框架层次联想，区分了该问题的不同呈现方式。在分析跨政治意识形态的移民话语时，我们证明了自由派和保守派在相同源域内系统性地采用不同的语义框架，保守派倾向于强调不可控性的框架，而自由派则选择中立或更“受害者化”的语义框架。我们的研究架起了概念隐喻理论与语言学之间的桥梁，提供了发现论述隐喻和细致分析隐喻构建差异的首个自然语言处理（NLP）方法。代码、数据和统计脚本可在 https://github.com/julia-nixie/ConceptFrameMet 获取。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2604.20487

Knowledge Capsules: Structured Nonparametric Memory Units for LLMs

知识胶囊：用于大型语言模型的结构化非参数记忆单元

Ju, Bin, Weng, Shenfeng, Zhou, Danying, Su, Kunkai, Xu, Rongkai

Abstract

Large language models (LLMs) encode knowledge in parametric weights, making it costly to update or extend without retraining. Retrieval-augmented generation (RAG) mitigates this limitation by appending retrieved text to the input, but operates purely through context expansion, where external knowledge competes as tokens within the attention mechanism. As a result, its influence is indirect and often unstable, particularly in long context and multi hop reasoning scenarios. We propose Knowledge Capsules, structured nonparametric memory units that represent normalized relational knowledge and can be constructed directly from document corpora using a frozen base model. Instead of injecting knowledge as text, we introduce an External Key Value Injection (KVI) framework that compiles capsules into attention-compatible key value representations, enabling external knowledge to directly participate in the model's attention computation. By shifting knowledge integration from context-level augmentation to memory level interaction, the proposed framework consistently outperforms RAG and GraphRAG across multiple QA benchmarks, with improved stability and accuracy in long context and multi hop reasoning, while requiring no parameter updates.

Chinese Translation

大型语言模型（LLMs）通过参数权重编码知识，这使得在不重新训练的情况下更新或扩展知识变得成本高昂。检索增强生成（RAG）通过将检索到的文本附加到输入中来缓解这一限制，但它仅通过上下文扩展来操作，其中外部知识作为令牌在注意力机制中竞争。因此，其影响是间接的，且在长上下文和多跳推理场景中往往不稳定。我们提出了知识胶囊（Knowledge Capsules），这是一种结构化的非参数记忆单元，代表标准化的关系知识，并且可以直接从文档语料库中使用冻结的基础模型构建。我们引入了一种外部键值注入（External Key Value Injection, KVI）框架，而不是将知识作为文本注入，该框架将胶囊编译成与注意力兼容的键值表示，从而使外部知识能够直接参与模型的注意力计算。通过将知识整合从上下文级增强转移到记忆级交互，所提出的框架在多个问答基准测试中始终优于RAG和GraphRAG，在长上下文和多跳推理中具有更好的稳定性和准确性，同时不需要参数更新。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2604.20531

Effects of Cross-lingual Evidence in Multilingual Medical Question Answering

跨语言证据在多语言医学问答中的影响

Yeginbergen, Anar, Oronoz, Maite, Agerri, Rodrigo

Abstract

This paper investigates Multilingual Medical Question Answering across high-resource (English, Spanish, French, Italian) and low-resource (Basque, Kazakh) languages. We evaluate three types of external evidence sources across models of varying size: curated repositories of specialized medical knowledge, web-retrieved content, and explanations from LLM's parametric knowledge. Moreover, we conduct experiments with multilingual, monolingual and cross-lingual retrieval. Our results demonstrate that larger models consistently achieve superior performance in English across baseline evaluations. When incorporating external knowledge, web-retrieved data in English proves most beneficial for high-resource languages. Conversely, for low-resource languages, the most effective strategy combines retrieval in both English and the target language, achieving comparable accuracy to high-resource language results. These findings challenge the assumption that external knowledge systematically improves performance and reveal that effective strategies depend on both the source of language resources and on model scale. Furthermore, specialized medical knowledge sources such as PubMed are limited: while they provide authoritative expert knowledge, they lack adequate multilingual coverage

Chinese Translation

本文研究了高资源（英语、西班牙语、法语、意大利语）和低资源（巴斯克语、哈萨克语）语言中的多语言医学问答。我们评估了三种类型的外部证据来源，涵盖不同规模的模型：经过整理的专业医学知识库、网络检索内容以及来自大语言模型（LLM）参数知识的解释。此外，我们进行了多语言、单语言和跨语言检索的实验。我们的结果表明，在基准评估中，较大的模型在英语中的表现始终优于其他模型。当结合外部知识时，英语的网络检索数据对高资源语言最为有利。相反，对于低资源语言，最有效的策略是结合在英语和目标语言中的检索，其准确性与高资源语言的结果相当。这些发现挑战了外部知识系统性提高性能的假设，并揭示了有效策略依赖于语言资源的来源和模型规模。此外，像PubMed这样的专业医学知识来源是有限的：虽然它们提供权威的专家知识，但缺乏足够的多语言覆盖。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2604.20535

Aligning Stuttered-Speech Research with End-User Needs: Scoping Review, Survey, and Guidelines

将口吃语音研究与最终用户需求对齐：范围评估、调查与指南

Toyin, Hawau Olamide, Apampa, Mutiah, Aremu, Toluwani, Alblooshi, Humaid, Valente, Ana Rita, Leal, Gonçalo, Yue, Zhengjun, Talat, Zeerak, Aldarmaki, Hanan

Abstract

Atypical speech is receiving greater attention in speech technology research, but much of this work unfolds with limited interdisciplinary dialogue. For stuttered speech in particular, it is widely recognised that current speech recognition systems fall short in practice, and current evaluation methods and research priorities are not systematically grounded in end-user experiences and needs. In this work, we analyse these gaps through 1) a scoping review of papers that deal with stuttered speech and 2) a survey of 70 stakeholders, including adults who stutter and speech-language pathologists. By analysing these two perspectives, we propose a taxonomy of stuttered-speech research, identify where current research directions diverge from the needs articulated by stakeholders, and conclude by outlining concrete guidelines and directions towards addressing the real needs of the stuttering community.

Chinese Translation

非典型语音在语音技术研究中受到越来越多的关注，但这方面的工作往往缺乏跨学科的对话。特别是对于口吃语音，普遍认为当前的语音识别系统在实际应用中存在不足，而现有的评估方法和研究重点并未系统地基于最终用户的经验和需求。在本研究中，我们通过1）对处理口吃语音的论文进行范围评估，以及2）对70位利益相关者（包括口吃成年人和言语语言病理学家）进行调查，分析了这些差距。通过分析这两个视角，我们提出了口吃语音研究的分类法，识别出当前研究方向与利益相关者所表达的需求之间的偏差，并最终概述了针对满足口吃群体实际需求的具体指南和方向。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2604.20548

Enhancing Research Idea Generation through Combinatorial Innovation and Multi-Agent Iterative Search Strategies

通过组合创新和多智能体迭代搜索策略增强研究创意生成

Chen, Shuai, Zhang, Chengzhi

Abstract

Scientific progress depends on the continual generation of innovative re-search ideas. However, the rapid growth of scientific literature has greatly increased the cost of knowledge filtering, making it harder for researchers to identify novel directions. Although existing large language model (LLM)-based methods show promise in research idea generation, the ideas they produce are often repetitive and lack depth. To address this issue, this study proposes a multi-agent iterative planning search strategy inspired by com-binatorial innovation theory. The framework combines iterative knowledge search with an LLM-based multi-agent system to generate, evaluate, and re-fine research ideas through repeated interaction, with the goal of improving idea diversity and novelty. Experiments in the natural language processing domain show that the proposed method outperforms state-of-the-art base-lines in both diversity and novelty. Further comparison with ideas derived from top-tier machine learning conference papers indicates that the quality of the generated ideas falls between that of accepted and rejected papers. These results suggest that the proposed framework is a promising approach for supporting high-quality research idea generation. The source code and dataset used in this paper are publicly available on Github repository: https://github.com/ChenShuai00/MAGenIdeas. The demo is available at https://huggingface.co/spaces/cshuai20/MAGenIdeas.

Chinese Translation

科学进步依赖于不断生成创新的研究创意。然而，科学文献的快速增长大大增加了知识筛选的成本，使得研究人员更难以识别新颖的研究方向。尽管现有的基于大型语言模型（LLM）的方法在研究创意生成方面显示出潜力，但它们产生的创意往往重复且缺乏深度。为了解决这一问题，本研究提出了一种受组合创新理论启发的多智能体迭代规划搜索策略。该框架结合了迭代知识搜索与基于LLM的多智能体系统，通过反复互动生成、评估和精炼研究创意，旨在提高创意的多样性和新颖性。在自然语言处理领域的实验表明，所提出的方法在多样性和新颖性方面均优于现有的最先进基线。此外，与来自顶级机器学习会议论文的创意进行进一步比较，结果表明生成的创意质量介于被接受和被拒绝论文之间。这些结果表明，所提出的框架是支持高质量研究创意生成的一个有前景的方法。本文使用的源代码和数据集已在Github上公开，链接为：https://github.com/ChenShuai00/MAGenIdeas。演示可在：https://huggingface.co/spaces/cshuai20/MAGenIdeas 获取。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2604.20549

Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

面向跨语言质量分类器的多语言预训练数据选择

Turki, Yassine, Sabolčec, Vinko, Messmer, Bettina, Jaggi, Martin

Abstract

As Large Language Models (LLMs) scale, data curation has shifted from maximizing volume to optimizing the signal-to-noise ratio by performing quality filtering. However, for many languages, native high quality data is insufficient to train robust quality classifiers. This work investigates the idea that quality markers in embedding space may show cross-lingual consistency, which would allow high-resource languages to subsidize the filtering of low-resource ones. We evaluate various filtering strategies, including cross-lingual transfer, third quartile sampling (Q3), and retention rate tuning. Our results demonstrate that massive multilingual pooling frequently outperforms monolingual baselines in both rank stability and aggregate accuracy for a 1B model trained on 103B tokens, delivering gains for high resource languages (1.2% increase in aggregate normalized accuracy for French) and matching or exceeding monolingual baselines for low-resource languages. However, we find that scale alone does not guarantee stability. Furthermore, for high-resource languages like French, we show that refining the decision boundary through third quartile sampling (Q3) or tuning the retention rate is necessary to fully leverage the multilingual signal.

Chinese Translation

随着大型语言模型（LLMs）的规模扩大，数据整理的重点已从最大化数据量转向通过质量过滤优化信噪比。然而，对于许多语言而言，本土高质量数据不足以训练出稳健的质量分类器。本研究探讨了嵌入空间中的质量标记可能表现出跨语言一致性的观点，这将允许高资源语言为低资源语言的过滤提供支持。我们评估了多种过滤策略，包括跨语言迁移、第三四分位数抽样（Q3）和保留率调整。我们的结果表明，在对103B个标记训练的1B模型中，大规模多语言汇聚在排名稳定性和整体准确性方面常常优于单语言基准，为高资源语言（法语的整体标准化准确性提高1.2%）带来了收益，并在低资源语言中达到或超过单语言基准。然而，我们发现，仅仅依靠规模并不能保证稳定性。此外，对于法语等高资源语言，我们表明通过第三四分位数抽样（Q3）或调整保留率来细化决策边界是充分利用多语言信号所必需的。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2604.20556

LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures

LayerTracer：一种针对任意大型语言模型架构的联合任务粒子和脆弱层分析框架

Wu, Yuhang, Liu, Qinyuan, Zhao, Qiuyang, Chong, Qingwei

Abstract

Currently, Large Language Models (LLMs) feature a diversified architectural landscape, including traditional Transformer, GateDeltaNet, and Mamba. However, the evolutionary laws of hierarchical representations, task knowledge formation positions, and network robustness bottleneck mechanisms in various LLM architectures remain unclear, posing core challenges for hybrid architecture design and model optimization. This paper proposes LayerTracer, an architecture-agnostic end-to-end analysis framework compatible with any LLM architecture. By extracting hidden states layer-by-layer and mapping them to vocabulary probability distributions, it achieves joint analysis of task particle localization and layer vulnerability quantification. We define the task particle as the key layer where the target token probability first rises significantly, representing the model's task execution starting point, and the vulnerable layer is defined as the layer with the maximum Jensen-Shannon (JS) divergence between output distributions before and after mask perturbation, reflecting its sensitivity to disturbances. Experiments on models of different parameter scales show that task particles mainly appear in the deep layers of the model regardless of parameter size, while larger-parameter models exhibit stronger hierarchical robustness. LayerTracer provides a scientific basis for layer division, module ratio, and gating switching of hybrid architectures, effectively optimizing model performance. It accurately locates task-effective layers and stability bottlenecks, offering universal support for LLM structure design and interpretability research.

Chinese Translation

目前，大型语言模型（LLMs）展现出多样化的架构格局，包括传统的Transformer、GateDeltaNet和Mamba。然而，各种LLM架构中层次表示的演化规律、任务知识形成的位置以及网络鲁棒性瓶颈机制仍不清晰，这对混合架构设计和模型优化构成了核心挑战。本文提出了LayerTracer，这是一种与架构无关的端到端分析框架，兼容任何LLM架构。通过逐层提取隐藏状态并将其映射到词汇概率分布，LayerTracer实现了任务粒子定位和层脆弱性量化的联合分析。我们将任务粒子定义为目标标记概率首次显著上升的关键层，代表模型任务执行的起始点，而脆弱层则定义为在掩码扰动前后输出分布之间具有最大Jensen-Shannon（JS）散度的层，反映其对干扰的敏感性。在不同参数规模模型上的实验表明，任务粒子主要出现在模型的深层，无论参数大小如何，而大参数模型则表现出更强的层次鲁棒性。LayerTracer为混合架构的层划分、模块比例和门控切换提供了科学依据，有效优化了模型性能。它准确定位任务有效层和稳定性瓶颈，为LLM结构设计和可解释性研究提供了普遍支持。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2604.20560

LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation

LLM StructCore：基于模式引导推理的凝聚与确定性编译

Zabolotnii, Serhii

Abstract

Automatically filling Case Report Forms (CRFs) from clinical notes is challenging due to noisy language, strict output contracts, and the high cost of false positives. We describe our CL4Health 2026 submission for Dyspnea CRF filling (134 items) using a contract-driven two-stage design grounded in Schema-Guided Reasoning (SGR). The key task property is extreme sparsity: the majority of fields are unknown, and official scoring penalizes both empty values and unsupported predictions. We shift from a single-step "LLM predicts 134 fields" approach to a decomposition where (i) Stage 1 produces a stable SGR-style JSON summary with exactly 9 domain keys, and (ii) Stage 2 is a fully deterministic, 0-LLM compiler that parses the Stage 1 summary, canonicalizes item names, normalizes predictions to the official controlled vocabulary, applies evidence-gated false-positive filters, and expands the output into the required 134-item format. On the dev80 split, the best teacher configuration achieves macro-F1 0.6543 (EN) and 0.6905 (IT); on the hidden test200, the submitted English variant scores 0.63 on Codabench. The pipeline is language-agnostic: Italian results match or exceed English with no language-specific engineering.

Chinese Translation

从临床笔记自动填写病例报告表（CRFs）具有挑战性，因为存在语言噪声、严格的输出约束以及误报的高成本。我们描述了我们在2026年CL4Health会议上提交的关于呼吸困难CRF填写（134项）的方案，该方案采用基于模式引导推理（SGR）的合同驱动两阶段设计。关键任务特性是极端稀疏性：大多数字段未知，官方评分对空值和不支持的预测都施加惩罚。我们从单步的“LLM预测134个字段”方法转变为一种分解方法，其中（i）第一阶段生成一个具有9个领域键的稳定SGR风格JSON摘要，(ii) 第二阶段是一个完全确定性的0-LLM编译器，解析第一阶段摘要，标准化项目名称，将预测规范化为官方控制词汇，应用证据门控的误报过滤器，并将输出扩展为所需的134项格式。在dev80划分中，最佳教师配置实现了宏F1 0.6543（英语）和0.6905（意大利语）；在隐藏的test200中，提交的英语变体在Codabench上的得分为0.63。该管道是语言无关的：意大利语结果与英语相匹配或超过，且没有特定于语言的工程。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2604.20564

Where Reasoning Breaks: Logic-Aware Path Selection by Controlling Logical Connectives in LLMs Reasoning Chains

推理中断的地方：通过控制大型语言模型推理链中的逻辑连接词实现逻辑感知路径选择

Park, Seunghyun, Lei, Yuanyuan

Abstract

While LLMs demonstrate impressive reasoning capabilities, they remain fragile in multi-step logical deduction, where a single transition error can propagate through the entire reasoning chain, leading to unstable performance. In this work, we identify logical connectives as primary points of this structural fragility. Through empirical analysis, we show that connective tokens function as high entropy forking points, at which models frequently struggle to determine the correct logical direction. Motivated by this observation, we hypothesize that intervening in logical connective selection can guide LLMs toward more correct logical direction, thereby improving the overall reasoning chain. To validate this hypothesis, we propose a multi-layered framework that intervenes specifically at these logic-critical junctions in the reasoning process. Our framework includes (1) Gradient-based Logical Steering to guide LLMs internal representations towards valid reasoning subspaces, (2) Localized Branching to resolve ambiguity via targeted look-ahead search, and (3) Targeted Transition Preference Optimization, a surgical reinforcement learning objective that selectively optimizes single-token preferences at logical pivots. Crucially, by concentrating intervention solely on logic-critical transitions, our framework achieves a favorable accuracy--efficiency trade-off compared to global inference time scaling methods like beam search and self-consistency.

Chinese Translation

尽管大型语言模型（LLMs）展现出令人印象深刻的推理能力，但在多步骤逻辑推理中，它们仍然脆弱，单一的过渡错误可能会在整个推理链中传播，导致性能不稳定。在本研究中，我们将逻辑连接词识别为这种结构脆弱性的主要点。通过实证分析，我们表明连接词作为高熵分叉点，模型在这些点上常常难以确定正确的逻辑方向。基于这一观察，我们假设干预逻辑连接词的选择可以引导LLMs朝向更正确的逻辑方向，从而改善整体推理链。为了验证这一假设，我们提出了一个多层框架，专门在推理过程中这些逻辑关键交叉点进行干预。我们的框架包括（1）基于梯度的逻辑引导，以引导LLMs的内部表示向有效推理子空间发展，（2）局部分支，通过针对性的前瞻搜索解决歧义，以及（3）针对性过渡偏好优化，一种选择性优化逻辑枢纽处单个标记偏好的精细强化学习目标。关键是，通过将干预集中在逻辑关键过渡上，我们的框架在准确性与效率之间实现了有利的权衡，相较于全局推理时间缩放方法（如束搜索和自一致性）。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2604.20572

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

仅在必要时提问：面向经验驱动的终身智能体的主动记忆和技能检索

Cai, Yuxuan, Zhou, Jie, Chen, Qin, He, Liang

Abstract

Online lifelong learning enables agents to accumulate experience across interactions and continually improve on long-horizon tasks. However, existing methods typically treat retrieval from past experience as a passive operation, triggering it only at task initialization or after completing a step. Consequently, agents often fail to identify knowledge gaps during interaction and proactively retrieve the most useful experience for the current decision. To address this limitation, we present ProactAgent, an experience-driven lifelong learning framework for proactive retrieval over a structured experience base. We first introduce Experience-Enhanced Online Evolution (ExpOnEvo), which enables continual improvement through both policy updates and memory refinement. The experience base organizes historical interactions into typed repositories, including factual memory, episodic memory, and behavioral skills, so that retrieval can provide both relevant evidence and actionable guidance. On top of this, we propose Proactive Reinforcement Learning-based Retrieval (ProactRL), which models retrieval as an explicit policy action and learns when and what to retrieve via paired-branch process rewards. By comparing continuations from identical interaction prefixes with and without retrieval, ProactRL provides step-level supervision for retrieval decisions, encouraging retrieval only when it leads to better task outcomes or higher efficiency. Experiments on SciWorld, AlfWorld, and StuLife show that ProactAgent consistently improves lifelong agent performance, achieving success rates of 73.50\% on SciWorld and 71.28\% on AlfWorld while substantially reducing retrieval overhead, and attains performance competitive with proprietary models on StuLife.

Chinese Translation

在线终身学习使智能体能够在交互中积累经验，并在长期任务上不断改进。然而，现有方法通常将从过去经验中检索视为一种被动操作，仅在任务初始化或完成步骤后触发。因此，智能体在交互过程中往往无法识别知识空白，主动检索当前决策所需的最有用经验。为了解决这一局限性，我们提出了ProactAgent，一个面向主动检索的经验驱动终身学习框架，基于结构化的经验库。我们首先介绍了经验增强在线演化（Experience-Enhanced Online Evolution, ExpOnEvo），它通过策略更新和记忆精炼实现持续改进。经验库将历史交互组织为类型化的存储库，包括事实记忆、情节记忆和行为技能，以便检索能够提供相关证据和可操作指导。在此基础上，我们提出了基于主动强化学习的检索（Proactive Reinforcement Learning-based Retrieval, ProactRL），将检索建模为一种明确的策略动作，并通过成对分支过程奖励学习何时以及检索什么。通过比较相同交互前缀在有无检索情况下的延续，ProactRL为检索决策提供了逐步监督，鼓励仅在检索能够带来更好的任务结果或更高效率时进行检索。在SciWorld、AlfWorld和StuLife上的实验表明，ProactAgent始终提高终身智能体的性能，在SciWorld上取得73.50%的成功率，在AlfWorld上达到71.28%，同时显著减少检索开销，并在StuLife上获得与专有模型竞争的性能。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2604.20658

Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows

合作特征预测多智能体大语言模型团队在科学工作流程中的表现

Kumar, Shivani, Bharathwaj, Adarsh, Jurgens, David

Abstract

Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model's behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.

Chinese Translation

由大型语言模型（LLMs）团队构建的多智能体系统越来越多地用于协作科学推理和问题解决。这些系统要求智能体在共享约束条件下进行协调，例如GPU或信用余额，其中合作行为至关重要。行为经济学提供了一套丰富的游戏工具，能够孤立出不同的合作机制，但尚不清楚模型在这些理想化环境中的行为是否能预测其在现实协作任务中的表现。在此，我们对35个开放权重的LLMs在六个行为经济学游戏中的表现进行了基准测试，结果表明，游戏衍生的合作特征能够稳健地预测在AI-for-Science任务中的下游表现，在这些任务中，LLM智能体团队在共享预算约束下协同分析数据、构建模型和生成科学报告。有效协调游戏并投资于乘法团队生产（而非贪婪策略）的模型在准确性、质量和完成度三个结果上生成了更好的科学报告。这些关联在控制多个因素后依然成立，表明合作倾向是LLMs的一个独特且可测量的特性，无法简化为一般能力。因此，我们的行为游戏框架为在成本高昂的多智能体部署之前筛选合作适应性提供了一种快速且廉价的诊断工具。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2604.20666

ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation

ORPHEAS：一种用于检索增强生成的跨语言希腊语-英语嵌入模型

Livieris, Ioannis E., Koursaris, Athanasios, Apostolopoulou, Alexandra, Tsakalidis, Konstantinos Kanaris Dimitris, Domalis, George

Abstract

Effective retrieval-augmented generation across bilingual Greek--English applications requires embedding models capable of capturing both domain-specific semantic relationships and cross-lingual semantic alignment. Existing multilingual embedding models distribute their representational capacity across numerous languages, limiting their optimization for Greek and failing to encode the morphological complexity and domain-specific terminological structures inherent in Greek text. In this work, we propose ORPHEAS, a specialized Greek--English embedding model for bilingual retrieval-augmented generation. ORPHEAS is trained with a high quality dataset generated by a knowledge graph-based fine-tuning methodology which is applied to a diverse multi-domain corpus, which enables language-agnostic semantic representations. The numerical experiments across monolingual and cross-lingual retrieval benchmarks reveal that ORPHEAS outperforms state-of-the-art multilingual embedding models, demonstrating that domain-specialized fine-tuning on morphologically complex languages does not compromise cross-lingual retrieval capability.

Chinese Translation

在双语希腊语-英语应用中，有效的检索增强生成需要能够捕捉领域特定语义关系和跨语言语义对齐的嵌入模型。现有的多语言嵌入模型将其表示能力分散到多个语言上，限制了对希腊语的优化，并未能编码希腊文本中固有的形态复杂性和领域特定的术语结构。在本研究中，我们提出了ORPHEAS，一种专门用于双语检索增强生成的希腊语-英语嵌入模型。ORPHEAS使用高质量数据集进行训练，该数据集通过基于知识图谱的微调方法生成，适用于多样化的多领域语料库，从而实现语言无关的语义表示。在单语和跨语言检索基准上的数值实验表明，ORPHEAS的性能优于最先进的多语言嵌入模型，证明了在形态复杂语言上进行领域专门的微调并不会妨碍跨语言检索能力。

View on arXiv Download PDF AI Translation

cs.CL / 67 / 2604.20677

Intersectional Fairness in Large Language Models

大型语言模型中的交叉公平性

Boufaied, Chaima, Santos, Ronnie De Souza, Barcomb, Ann

Abstract

Large Language Models (LLMs) are increasingly deployed in socially sensitive settings, raising concerns about fairness and biases, particularly across intersectional demographic attributes. In this paper, we systematically evaluate intersectional fairness in six LLMs using ambiguous and disambiguated contexts from two benchmark datasets. We assess LLM behavior using bias scores, subgroup fairness metrics, accuracy, and consistency through multi-run analysis across contexts and negative and non-negative question polarities. Our results show that while modern LLMs generally perform well in ambiguous contexts, this limits the informativeness of fairness metrics due to sparse non-unknown predictions. In disambiguated contexts, LLM accuracy is influenced by stereotype alignment, with models being more accurate when the correct answer reinforces a stereotype than when it contradicts it. This pattern is especially pronounced in race-gender intersections, where directional bias toward stereotypes is stronger. Subgroup fairness metrics further indicate that, despite low observed disparity in some cases, outcome distributions remain uneven across intersectional groups. Across repeated runs, responses also vary in consistency, including stereotype-aligned responses. Overall, our findings show that apparent model competence is partly associated with stereotype-consistent cues, and no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings. These findings highlight the need for evaluation beyond accuracy, emphasizing the importance of combining bias, subgroup fairness, and consistency metrics across intersectional groups, contexts, and repeated runs.

Chinese Translation

大型语言模型（LLMs）越来越多地应用于社会敏感的环境中，这引发了对公平性和偏见的关注，特别是在交叉人口属性方面。本文系统评估了六个LLMs在两个基准数据集中的模糊和消歧义上下文中的交叉公平性。我们通过多次运行分析，使用偏见分数、子群体公平性指标、准确性和一致性来评估LLM的行为，涵盖了不同上下文以及负面和非负面问题极性。我们的结果表明，尽管现代LLMs在模糊上下文中通常表现良好，但由于稀疏的非未知预测，这限制了公平性指标的信息量。在消歧义上下文中，LLM的准确性受到刻板印象对齐的影响，当正确答案强化刻板印象时，模型的准确性更高，而当其与刻板印象相矛盾时则较低。这种模式在种族与性别交叉的情况下尤为明显，偏向刻板印象的方向性偏见更强。子群体公平性指标进一步表明，尽管在某些情况下观察到的差异较小，但结果分布在交叉群体之间仍然不均匀。在重复运行中，响应的一致性也有所不同，包括与刻板印象对齐的响应。总体而言，我们的研究结果表明，模型的表面能力部分与刻板印象一致的线索相关，且没有评估的LLM在交叉环境中实现一致可靠或公平的行为。这些发现突显了超越准确性进行评估的必要性，强调了在交叉群体、上下文和重复运行中结合偏见、子群体公平性和一致性指标的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 68 / 2604.20726

Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

通过提示优化利用 LLM 作为法官在自由文本法律问答中的作用

Elganayni, Mohamed Hesham, Chen, Runsheng, Nagl, Sebastian, Grabmair, Matthias

Abstract

This work explores the role of prompt design and judge selection in LLM-as-a-Judge evaluations of free text legal question answering. We examine whether automatic task prompt optimization improves over human-centered design, whether optimization effectiveness varies by judge feedback style, and whether optimized prompts transfer across judges. We systematically address these questions on the LEXam benchmark by optimizing task prompts using the ProTeGi method with feedback from two judges (Qwen3-32B, DeepSeek-V3) across four task models, and then testing cross-judge transfer. Automatic optimization consistently outperforms the baseline, with lenient judge feedback yielding higher and more consistent gains than strict judge feedback. Prompts optimized with lenient feedback transfer better to strict judges than the reverse direction. Analysis reveals that lenient judges provide permissive feedback, yielding prompts with broader applicability, whereas strict judges produce restrictive feedback, leading to judge-specific overfitting. Our findings demonstrate algorithmically optimizing prompts on training data can outperform human-centered prompt design and that judges' dispositions during optimization shape prompt generalizability. Code and optimized prompts are available at https://github.com/TUMLegalTech/icail2026-llm-judge-gaming.

Chinese Translation

本研究探讨了提示设计和法官选择在 LLM 作为法官评估自由文本法律问答中的作用。我们考察了自动任务提示优化是否优于以人为中心的设计，优化效果是否因法官反馈风格而异，以及优化后的提示是否能够在不同法官之间迁移。我们在 LEXam 基准上系统性地解决了这些问题，通过使用 ProTeGi 方法优化任务提示，并结合来自两位法官（Qwen3-32B，DeepSeek-V3）的反馈，涵盖四个任务模型，然后测试跨法官迁移。自动优化的结果始终优于基线，宽松的法官反馈带来了更高且更一致的增益，而严格的法官反馈则效果较差。使用宽松反馈优化的提示在迁移到严格法官时表现更佳，而反向迁移效果较差。分析表明，宽松法官提供的是宽容的反馈，生成了更具广泛适用性的提示，而严格法官则产生了限制性的反馈，导致法官特定的过拟合。我们的研究结果表明，在训练数据上算法优化提示可以超越以人为中心的提示设计，并且法官在优化过程中的倾向会影响提示的普适性。代码和优化后的提示可在 https://github.com/TUMLegalTech/icail2026-llm-judge-gaming 获取。

View on arXiv Download PDF AI Translation

cs.CL / 69 / 2604.20738

RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

RespondeoQA：双语拉丁语-英语问答基准

Hudspeth, Marisa, Burns, Patrick J., O'Connor, Brendan

Abstract

We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models -- LLaMa 3, Qwen QwQ, and OpenAI's o3-mini -- finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA

Chinese Translation

我们介绍了一个用于双语拉丁语和英语环境下问答和翻译的基准数据集，包含约7800对问答。问题来源于拉丁语教学资源，包括考试、问答比赛风格的琐事和从19世纪到现在的教科书。在经过自动提取、清理和人工审核后，该数据集涵盖了多种问题类型：知识和技能基础、多步推理、受限翻译以及混合语言对。据我们所知，这是第一个以拉丁语为中心的问答基准。作为案例研究，我们评估了三个大型语言模型——LLaMa 3、Qwen QwQ和OpenAI的o3-mini——发现它们在技能导向的问题上表现较差。尽管推理模型在韵律和文学手法任务上表现更好，但整体改善有限。QwQ在用拉丁语提问的问题上表现稍好，但LLaMa3和o3-mini则更依赖于任务。该数据集为评估模型在特定语言和文化领域的能力提供了新资源，创建过程也可以轻松适应其他语言。数据集可在以下网址获取：https://github.com/slanglab/RespondeoQA

View on arXiv Download PDF AI Translation

cs.CL / 70 / 2604.20789

Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

工作记忆约束在数据稀缺情况下支撑变压器中的学习

Madhyastha, Pranava, Adamcova, Dagmar

Abstract

We investigate the integration of human-like working memory constraints into the Transformer architecture and implement several cognitively inspired attention variants, including fixed-width windows based and temporal decay based attention mechanisms. Our modified GPT-2 models are trained from scratch on developmentally plausible datasets (10M and 100M words). Performance is evaluated on grammatical judgment tasks (BLiMP) and alignment with human reading time data. Our results indicate that these cognitively-inspired constraints, particularly fixed-width attention, can significantly improve grammatical accuracy especially when training data is scarce. These constrained models also tend to show a stronger alignment with human processing metrics. The findings suggest that such constraints may serve as a beneficial inductive bias, guiding models towards more robust linguistic representations, especially in data-limited settings.

Chinese Translation

我们研究了将类人工作记忆约束整合到变压器架构中的方法，并实现了几种受认知启发的注意力变体，包括基于固定宽度窗口和基于时间衰减的注意力机制。我们修改后的GPT-2模型在发展上合理的数据集（1000万和1亿个单词）上从头开始训练。性能通过语法判断任务（BLiMP）和与人类阅读时间数据的对齐进行评估。我们的结果表明，这些受认知启发的约束，特别是固定宽度注意力，能够显著提高语法准确性，尤其是在训练数据稀缺的情况下。这些受限模型也往往与人类处理指标表现出更强的对齐。研究结果表明，这种约束可能作为一种有益的归纳偏差，引导模型朝向更稳健的语言表示，特别是在数据有限的环境中。

View on arXiv Download PDF AI Translation

cs.CL / 71 / 2604.20791

Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

“人工智能”能成为医生吗？临床大型语言模型中的共情、可读性和一致性研究

Barone, Mariano, Di Serio, Francesco, Moio, Roberto, Postiglione, Marco, Riccio, Giuseppe, Romano, Antonio, Moscato, Vincenzo

Abstract

Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.

Chinese Translation

大型语言模型（LLMs）在医疗保健中的应用日益增多，但它们与临床标准的沟通一致性尚未得到充分量化。我们对通用和领域专用的LLMs在结构化医学解释和现实世界的医患互动中进行了多维度评估，分析了语义忠实度、可读性和情感共鸣。基准模型相较于医生放大了情感极性（非常负面：43.14-45.10% 对比 37.25%），而在更大架构如GPT-5和Claude中，产生了显著更高的语言复杂性（FKGL高达16.91-17.60，对比医生撰写的回应中的11.47-12.50）。以共情为导向的提示减少了极端负面情绪，并降低了年级水平复杂性（GPT-5最多降低6.87 FKGL点），但并未显著提高语义忠实度。协作重写产生了最强的一致性。重述配置在语义上与医生答案的相似度最高（均值高达0.93），同时持续改善可读性并降低情感极端性。双重利益相关者评估显示，没有模型在认知标准上超越医生，而患者则一致偏好重写的变体以获得更清晰和更具情感色调的表达。这些发现表明，LLMs作为协作沟通增强工具的功能最为有效，而非临床专业知识的替代品。

View on arXiv Download PDF AI Translation

cs.CL / 72 / 2604.20817

Convergent Evolution: How Different Language Models Learn Similar Number Representations

趋同进化：不同语言模型如何学习相似的数字表示

Fu, Deqing, Zhou, Tianyi, Belkin, Mikhail, Sharan, Vatsal, Jia, Robin

Abstract

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.

Chinese Translation

在自然文本上训练的语言模型学习使用周期特征来表示数字，这些特征在傅里叶域中具有主导周期 $T=2, 5, 10$。本文中，我们识别出这些特征的两层次层级：虽然变换器（Transformers）、线性递归神经网络（Linear RNNs）、长短期记忆网络（LSTMs）以及以不同方式训练的经典词嵌入都学习到在傅里叶域中具有周期-$T$ 峰值的特征，但只有部分模型学习到可以用于线性分类数字模-$T$ 的几何可分特征。为了解释这种不一致性，我们证明了傅里叶域稀疏性是模-$T$ 几何可分性的必要但不充分条件。在实证研究中，我们探讨了何时模型训练会产生几何可分特征，发现数据、架构、优化器和分词器都发挥了关键作用。特别地，我们识别出模型获取几何可分特征的两种不同途径：它们可以从一般语言数据中的互补共现信号中学习，包括文本-数字共现和跨数字交互，或者从多标记（但不是单标记）加法问题中学习。总体而言，我们的结果突显了特征学习中的趋同进化现象：多样化的模型从不同的训练信号中学习到相似的特征。

View on arXiv Download PDF AI Translation

cs.CL / 73 / 2604.20835

Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

Parallel-SFT：提升代码强化学习的零-shot跨编程语言迁移能力

Wu, Zhaofeng, Wang, Shiqi, Peng, Boya, Goyal, Anuj, Kambadur, Melanie, Ruder, Sebastian, Kim, Yoon, Bi, Chloe

Abstract

Modern language models demonstrate impressive coding capabilities in common programming languages (PLs), such as C++ and Python, but their performance in lower-resource PLs is often limited by training data availability. In principle, however, most programming skills are universal across PLs, so the capability acquired in one PL should transfer to others. In this work, we propose the task of zero-shot cross-programming-language transfer for code RL. We find that, for Llama-3.1, RL training for code generation in a source PL fails to improve, and sometimes even degrades, the performance on other target PLs. To address this, we hypothesize that effective RL transfer requires a generalizable SFT initialization before RL. We thus propose **Parallel-SFT**, an SFT strategy that incorporates "parallel programs" -- functionally equivalent code implemented in multiple PLs -- into the data mixture. We demonstrate that this improves transferability: when we subsequently perform RL on our Parallel-SFT model, we observe better generalization to unseen PLs. Analysis of the model internal representations reveals that Parallel-SFT leads to a more functionality-centric latent space, where equivalent programs across PLs are more tightly clustered, which we hypothesize to contribute to the improved transferability.

Chinese Translation

现代语言模型在常见编程语言（如 C++ 和 Python）中展现出令人印象深刻的编码能力，但在资源较少的编程语言中的表现往往受到训练数据可用性的限制。然而，从原则上讲，大多数编程技能在不同编程语言之间是通用的，因此在一种编程语言中获得的能力应该能够迁移到其他语言。在本研究中，我们提出了代码强化学习的零-shot跨编程语言迁移任务。我们发现，对于 Llama-3.1，在源编程语言中的代码生成强化学习训练未能改善，甚至有时会降低在其他目标编程语言上的表现。为了解决这一问题，我们假设有效的强化学习迁移需要在强化学习之前进行可泛化的监督微调（SFT）初始化。因此，我们提出了 **Parallel-SFT**，一种将“并行程序”——在多种编程语言中实现的功能等效代码——纳入数据混合的 SFT 策略。我们证明这提高了迁移能力：当我们随后在我们的 Parallel-SFT 模型上进行强化学习时，观察到对未见编程语言的更好泛化。对模型内部表示的分析表明，Parallel-SFT 导致了一个更以功能为中心的潜在空间，其中跨编程语言的等效程序更紧密地聚集，我们假设这有助于提高迁移能力。

View on arXiv Download PDF AI Translation

cs.CL / 74 / 2604.20842

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

SpeechParaling-Bench：一个全面的语音生成评测基准，关注副语言特征

Liu, Ruohan, Yin, Shukang, Wang, Tao, Zhang, Dong, Zhuang, Weiji, Ren, Shuhuai, He, Ran, Shan, Caifeng, Fu, Chaoyou

Abstract

Paralinguistic cues are essential for natural human-computer interaction, yet their evaluation in Large Audio-Language Models (LALMs) remains limited by coarse feature coverage and the inherent subjectivity of assessment. To address these challenges, we introduce SpeechParaling-Bench, a comprehensive benchmark for paralinguistic-aware speech generation. It expands existing coverage from fewer than 50 to over 100 fine-grained features, supported by more than 1,000 English-Chinese parallel speech queries, and is organized into three progressively challenging tasks: fine-grained control, intra-utterance variation, and context-aware adaptation. To enable reliable evaluation, we further develop a pairwise comparison pipeline, in which candidate responses are evaluated against a fixed baseline by an LALM-based judge. By framing evaluation as relative preference rather than absolute scoring, this approach mitigates subjectivity and yields more stable and scalable assessments without costly human annotation. Extensive experiments reveal substantial limitations in current LALMs. Even leading proprietary models struggle with comprehensive static control and dynamic modulation of paralinguistic features, while failure to correctly interpret paralinguistic cues accounts for 43.3% of errors in situational dialogue. These findings underscore the need for more robust paralinguistic modeling toward human-aligned voice assistants.

Chinese Translation

副语言线索对于自然的人机交互至关重要，但在大型音频语言模型（LALMs）中的评估仍受到粗糙特征覆盖和评估固有主观性的限制。为了解决这些挑战，我们提出了SpeechParaling-Bench，一个全面的副语言特征语音生成评测基准。它将现有的特征覆盖范围从不足50个扩展到超过100个细粒度特征，支持超过1,000个英汉平行语音查询，并分为三个逐步挑战的任务：细粒度控制、话语内变异和上下文感知适应。为了实现可靠的评估，我们进一步开发了一种成对比较流程，其中候选响应通过基于LALM的评判者与固定基准进行评估。通过将评估框架设定为相对偏好而非绝对评分，这种方法减轻了主观性，并在没有昂贵人工标注的情况下产生了更稳定和可扩展的评估。大量实验揭示了当前LALMs的显著局限性。即使是领先的专有模型在全面的静态控制和副语言特征的动态调节方面也面临困难，而未能正确解读副语言线索导致了情境对话中43.3%的错误。这些发现强调了在朝向人类对齐的语音助手方面需要更强大的副语言建模。

View on arXiv Download PDF AI Translation

arXiv Papers

SL(C)AMma: Simultaneous Localisation, (Calibration) and Mapping With a Magnetometer Array

Radar Odometry Subject to High Tilt Dynamics of Subarctic Environments

Efficient Reinforcement Learning using Linear Koopman Dynamics for Nonlinear Robotic Systems

Strain in Sound: Soft Corrugated Tube for Local Strain Sensing with Acoustic Resonance

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

Toward Safe Autonomous Robotic Endovascular Interventions using World Models

LLM-Guided Safety Agent for Edge Robotics with an ISO-Compliant Perception-Compute-Control Architecture

Stochastic Barrier Certificates in the Presence of Dynamic Obstacles

Toward Cooperative Driving in Mixed Traffic: An Adaptive Potential Game-Based Approach with Field Test Verification

Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

Onboard Wind Estimation for Small UAVs Equipped with Low-Cost Sensors: An Aerodynamic Model-Integrated Filtering Approach

ETac: A Lightweight and Efficient Tactile Simulation Framework for Learning Dexterous Manipulation

AdaTracker: Learning Adaptive In-Context Policy for Cross-Embodiment Active Visual Tracking

A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

Benefits of Low-Cost Bio-Inspiration in the Age of Overparametrization

OVPD: A Virtual-Physical Fusion Testing Dataset of OnSite Auton-omous Driving Challenge

Lexicographic Minimum-Violation Motion Planning using Signal Temporal Logic

VTouch++: A Multimodal Dataset with Vision-Based Tactile Enhancement for Bimanual Manipulation

MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

Passive Variable Impedance For Shared Control

Kinematic Optimization of Phalanx Length Ratios in Robotic Hands Using Potential Dexterity

FingerEye: Continuous and Unified Vision-Tactile Sensing for Dexterous Manipulation

A Kinematic Framework for Evaluating Pinch Configurations in Robotic Hand Design without Object or Contact Models

Visual-Tactile Peg-in-Hole Assembly Learning from Peg-out-of-Hole Disassembly

ALAS: Adaptive Long-Horizon Action Synthesis via Async-pathway Stream Disentanglement

A Hough transform approach to safety-aware scalar field mapping using Gaussian Processes

PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

Rabies diagnosis in low-data settings: A comparative study on the impact of data augmentation and transfer learning

TactileEval: A Step Towards Automated Fine-Grained Evaluation and Editing of Tactile Graphics

KD-Judge: A Knowledge-Driven Automated Judge Framework for Functional Fitness Movements on Edge Devices

Environmental Understanding Vision-Language Model for Embodied Agent

If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

SGAP-Gaze: Scene Grid Attention Based Point-of-Gaze Estimation Network for Driver Gaze

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

UniCon3R: Contact-aware 3D Human-Scene Reconstruction from Monocular Video

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

CrackForward: Context-Aware Severity Stage Crack Synthesis for Data Augmentation

Visual Reasoning through Tool-supervised Reinforcement Learning

Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

Lucky High Dynamic Range Smartphone Imaging

Online CS-based SAR Edge-Mapping

A Computational Model of Message Sensation Value in Short Video Multimodal Features that Predicts Sensory and Behavioral Engagement

Optimizing Data Augmentation for Real-Time Small UAV Detection: A Lightweight Context-Aware Approach

RareSpot+: A Benchmark, Model, and Active Learning Framework for Small and Rare Wildlife in Aerial Imagery

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

Investigation of cardinality classification for bacterial colony counting using explainable artificial intelligence

Cognitive Alignment At No Cost: Inducing Human Attention Biases For Interpretable Vision Transformers

Learning to count small and clustered objects with application to bacterial colonies

FluSplat: Sparse-View 3D Editing without Test-Time Optimization

Normalizing Flows with Iterative Denoising

Gaussians on a Diet: High-Quality Memory-Bounded 3D Gaussian Splatting Training

PASTA: A Patch-Agnostic Twofold-Stealthy Backdoor Attack on Vision Transformers

FurnSet: Exploiting Repeats for 3D Scene Reconstruction

Topology-Aware Skeleton Detection via Lighthouse-Guided Structured Inference

Semi-Supervised Flow Matching for Mosaiced and Panchromatic Fusion Imaging

IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory

GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds

HumanScore: Benchmarking Human Motions in Generated Videos

Semantic-Fast-SAM: Efficient Semantic Segmenter

WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring

From Scene to Object: Text-Guided Dual-Gaze Prediction

Weighted Knowledge Distillation for Semi-Supervised Segmentation of Maxillary Sinus in Panoramic X-ray Images

Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation

Bio-inspired Color Constancy: From Gray Anchoring Theory to Gray Pixel Methods

Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing

Opportunistic Bone-Loss Screening from Routine Knee Radiographs Using a Multi-Task Deep Learning Framework with Sensitivity-Constrained Threshold Optimization

Fourier Series Coder: A Novel Perspective on Angle Boundary Discontinuity Problem for Oriented Object Detection

MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation

X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference

Efficient INT8 Single-Image Super-Resolution via Deployment-Aware Quantization and Teacher-Guided Training

Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

Improving Facial Emotion Recognition through Dataset Merging and Balanced Training Strategies

MD-Face: MoE-Enhanced Label-Free Disentangled Representation for Interactive Facial Attribute Editing

UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval