← Back to Index
Daily Research Digest

arXiv Papers

2026-05-06
193
Papers
4
Categories
193
Translated
收藏清单 0
机器人学 (Robotics)
20
cs.RO / 1 / 2605.02983

Human-in-the-Loop Uncertainty Analysis in Self-Adaptive Robots Using LLMs

基于大语言模型的人机协作自适应机器人不确定性分析
Sartaj, Hassan, Boudjadar, Jalil, Frasheri, Mirgita, Ali, Shaukat, Larsen, Peter Gorm
Abstract
Self-adaptive robots operate in dynamic, unpredictable environments where unaddressed uncertainties can lead to safety violations and operational failures. However, systematically identifying and analyzing these uncertainties, including their sources, impacts, and mitigation strategies, remains a significant challenge given the inherent complexity of real-world environments, dynamic robotic behavior, and the rapid evolution of robotic technologies. To address this, we introduce RoboULM, a human-in-the-loop methodology and tool that supports practitioners in systematically exploring uncertainties at the design stage using large language models (LLMs). Moreover, we present an uncertainty taxonomy that provides a detailed catalog of uncertainties in self-adaptive robots. We evaluated RoboULM with 16 practitioners from four industrial use cases. The results show that RoboULM was perceived as both useful and easy to understand, with the participants particularly valuing structured prompting and iterative refinement support. These findings demonstrate the potential of RoboULM as a viable solution for systematic uncertainty analysis in complex robots.
Chinese Translation
自适应机器人在动态且不可预测的环境中运行,其中未处理的不确定性可能导致安全违规和操作失败。然而,系统地识别和分析这些不确定性,包括其来源、影响和缓解策略,仍然是一个重大挑战,因为现实世界环境的复杂性、动态机器人行为以及机器人技术的快速演变。为了解决这个问题,我们提出了RoboULM,这是一种人机协作的方法论和工具,支持从业者在设计阶段使用大语言模型(LLMs)系统地探索不确定性。此外,我们还提出了一种不确定性分类法,提供了自适应机器人不确定性的详细目录。我们与来自四个工业用例的16位从业者对RoboULM进行了评估。结果表明,RoboULM被认为既有用又易于理解,参与者特别重视结构化提示和迭代完善的支持。这些发现展示了RoboULM作为复杂机器人系统中系统化不确定性分析可行解决方案的潜力。
cs.RO / 2 / 2605.03075

Refining Compositional Diffusion for Reliable Long-Horizon Planning

精炼组合扩散以实现可靠的长期规划
Lee, Kyowoon, Luo, Yunhao, Tong, Anh, Choi, Jaesik
Abstract
Compositional diffusion planning generates long-horizon trajectories by stitching together overlapping short-horizon segments through score composition. However, when local plan distributions are multimodal, existing compositional methods suffer from mode-averaging, where averaging incompatible local modes leads to plans that are neither locally feasible nor globally coherent. We propose Refining Compositional Diffusion (RCD), a training-free guidance method that steers compositional sampling toward high-density, globally coherent plans. RCD leverages the self-reconstruction error of a pretrained diffusion model as a proxy for the log-density of composed plans, combined with an overlap consistency term that enforces consistency at segment boundaries. We show that the combined guidance concentrates sampling on high-density plans that mitigate mode-averaging. Experiments on challenging long-horizon tasks from OGBench, including locomotion, object manipulation, and pixel-based observations, demonstrate that RCD consistently outperforms existing methods.
Chinese Translation
组合扩散规划通过拼接重叠的短期片段来生成长期轨迹,利用分数组合。然而,当局部规划分布呈现多模态时,现有的组合方法会遭遇模态平均的问题,其中不兼容的局部模态的平均导致的计划既不具有局部可行性也不具有全局一致性。我们提出了精炼组合扩散(Refining Compositional Diffusion, RCD),这是一种无需训练的引导方法,旨在引导组合采样朝向高密度、全局一致的计划。RCD借助预训练扩散模型的自重构误差作为组合计划对数密度的代理,并结合重叠一致性项来强制在片段边界处保持一致性。我们展示了综合引导将采样集中在高密度计划上,有效缓解模态平均。对来自OGBench的挑战性长期任务进行的实验,包括运动、物体操控和基于像素的观察,证明RCD的性能始终优于现有方法。
cs.RO / 3 / 2605.03111

Benchmarking Local Language Models for Social Robots using Edge Devices

使用边缘设备对社交机器人本地语言模型进行基准测试
Lamouille, Dorian, Zorec, Matevž B., Baksh, Farnaz, Kruusamäe, Karl
Abstract
Social-educational robots designed for socially interactive pedagogical support, such as the Robot Study Companion (RSC), rely on responsive, privacy-preserving interaction despite severely limited compute. However, there is a gap in systematic benchmarking of language models for edge computing in pedagogical applications. This paper benchmarks 25 open-source language models for local deployment on edge hardware. We evaluate each model across three dimensions: inference efficiency (tokens per second, energy consumption), general knowledge (a six-category MMLU subset), and teaching effectiveness (LLM-rated pedagogical quality), validated against five independent human raters using the Raspberry Pi(RPi)4 as the primary platform, with additional comparisons on the RPi5 and a laptop GPU. Results reveal pronounced trade-offs: throughput and energy efficiency vary by over an order of magnitude across models, MMLU accuracy ranges from near-random to 57.2%, and teaching effectiveness does not correlate monotonically with either metric. Among the evaluated models, Granite4 Tiny Hybrid (7B) achieves a strong overall balance, reaching 2.5 tokens per second, 0.90 tokens per joule, and 54.6% MMLU accuracy; high MMLU accuracy does not appear necessary for strong teaching scores. Human validation on four representative models preserved the automated rank ordering (Pearson r = 0.967, n = 4). Based on these findings, we propose a three-tier local inference architecture for the RSC that balances responsiveness and accuracy on resource-constrained hardware.
Chinese Translation
旨在提供社交互动教育支持的社交教育机器人,如机器人学习伴侣(Robot Study Companion, RSC),依赖于响应迅速且保护隐私的交互,但计算能力极为有限。然而,在教育应用中,对边缘计算语言模型的系统基准测试存在空白。本文对25种开源语言模型在边缘硬件上的本地部署进行基准测试。我们在推理效率(每秒标记数、能耗)、一般知识(六类MMLU子集)和教学有效性(LLM评估的教育质量)三个维度对每个模型进行评估,使用树莓派(Raspberry Pi, RPi)4作为主要平台,并在RPi5和笔记本GPU上进行了额外比较。结果揭示了明显的权衡:不同模型的吞吐量和能效相差超过一个数量级,MMLU准确率范围从接近随机到57.2%,而教学有效性与任何指标之间并不存在单调相关性。在评估的模型中,Granite4 Tiny Hybrid (7B) 在各方面表现出色,达到每秒2.5标记、每焦耳0.90标记和54.6%的MMLU准确率;高MMLU准确率似乎并不是获得高教学评分的必要条件。对四个代表性模型的人工验证保持了自动排名的顺序(皮尔逊相关系数 r = 0.967, n = 4)。基于这些发现,我们提出了一种三层本地推理架构,以便在资源受限的硬件上平衡响应性和准确性。
cs.RO / 4 / 2605.03260

Robust Path Tracking for Vehicles via Continuous-Time Residual Learning: An ICODE-MPPI Approach

基于连续时间残差学习的车辆鲁棒路径追踪:ICODE-MPPI方法
Song, Shugen, Mei, Wenjie, Zhao, Chengyan
Abstract
Model Predictive Path Integral (MPPI) control is a powerful sampling-based strategy for nonlinear autonomous systems. However, its performance is often bottlenecked by the fidelity of nominal dynamics. We propose ICODE-MPPI, a robust framework that leverages Input Concomitant Neural Ordinary Differential Equations (ICODEs) to learn and compensate for unmodeled residual dynamics. Unlike discrete-time learners, ICODEs maintain physical consistency and temporal continuity during the MPPI prediction horizon. High-fidelity simulations on complex trajectories demonstrate that ICODE-MPPI achieves up to a 69\% reduction in cross-tracking error under persistent disturbances compared to standard MPPI control. Furthermore, our analysis confirms that ICODE-MPPI significantly suppresses control chattering, yielding smoother steering commands and superior robust performance.
Chinese Translation
模型预测路径积分(MPPI)控制是一种强大的基于采样的非线性自主系统控制策略。然而,其性能常常受限于标称动力学的精度。我们提出了ICODE-MPPI,这是一个鲁棒框架,利用输入共伴神经常微分方程(ICODEs)来学习和补偿未建模的残差动力学。与离散时间学习器不同,ICODE在MPPI预测范围内保持了物理一致性和时间连续性。在复杂轨迹上的高保真模拟表明,与标准的MPPI控制相比,ICODE-MPPI在持续干扰下的交叉跟踪误差减少了多达69%。此外,我们的分析确认,ICODE-MPPI显著抑制了控制抖振,产生了更平滑的转向指令和卓越的鲁棒性能。
cs.RO / 5 / 2605.03269

RLDX-1 Technical Report

RLDX-1 技术报告
Kim, Dongyoung, Jang, Huiwon, Koo, Myungkyu, Jang, Suhyeok, Kim, Taeyoung, Kim, Beomjun, Yoon, Byungjun, Jang, Changsung, Choi, Daewon, Han, Dongsu, Lee, Donguk, Kwon, Heeseung, Jeon, Hojin, Kang, Jaehyun, Bae, Jaekyoung, Lee, Jihyuk, Lee, Jimin, Won, John, Ahn, Joonwoo, Park, Junhyeong, Sung, Junyoung, Lee, Kyungmin, Han, Minseong, Yoon, Minsung, Joo, Sejune, Son, Seonil, Park, Seungcheol, Cho, Seunggeun, Moon, Seungjun, Kim, Seungku, Dong, Yonghoon, Cho, Yongjin, Kim, Youngchan, Kim, Chang Hwan, Kim, Dohyeon, Lee, Hazel, Kim, Heecheol, Ahn, Hensen, Ryu, Hyungkyu, Choi, Hyunsoo, Shin, Hyunsoo, Jung, Jaeheon, Kim, Jaewoo, Kim, Jinwook, Chang, Joochul, Kim, Joonsoo, Park, Junghun, Park, Jungwoo, Cho, Junho, Park, Junhyeok, Lee, Junwon, Lee, Kangwook, Kim, Kwanghoon, Choe, Kyoungwhan, Bhadu, Manoj, Oh, Nayoung, Kim, Sangjun, Kim, Sangwoo, Shim, Seunghoon, Kim, Seunghyun, Lee, Seungjun, Ka, Seungyup, Yang, Sungryol, Jung, Wook, Shukla, Yashu, Lee, Yeonjae, Bae, Yeonwoo, Shin, Jinwoo
Abstract
While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. $\pi_{0.5}$ and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while $\pi_{0.5}$ and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.
Chinese Translation
尽管视觉-语言-动作模型(Vision-Language-Action models, VLA)在通过预训练的视觉-语言模型所继承的多样化智能(即广泛的场景理解和语言条件下的泛化)方面取得了显著进展,但它们在面对需要更广泛功能能力的复杂现实任务时仍然面临挑战(例如,运动感知、记忆感知的决策制定和物理感知)。为了解决这一问题,我们提出了 RLDX-1,这是一种基于多流动作变换器(Multi-Stream Action Transformer, MSAT)构建的通用机器人策略,用于灵巧操作。该架构通过集成异质模态,结合了模态特定流与跨模态联合自注意力,统一了这些能力。RLDX-1 进一步结合了系统级设计选择,包括为稀有操作场景合成训练数据、针对类人操作的学习过程以及实时部署的推理优化。通过实证评估,我们表明 RLDX-1 在需要超出一般通用性的广泛功能能力的模拟基准和现实任务中,持续超越了近期的前沿 VLA(例如,$ ext{π}_{0.5}$ 和 GR00T N1.6)。特别是在 ALLEX 类人任务中,RLDX-1 以 86.8% 的成功率展现了优越性,而 $ ext{π}_{0.5}$ 和 GR00T N1.6 的成功率约为 40%,突显了 RLDX-1 在多样化功能需求下控制高自由度(high-DoF)类人机器人的能力。这些结果共同将 RLDX-1 定位为朝向可靠的 VLA 在复杂、高接触和动态现实世界灵巧操作又一 promising 进展。
cs.RO / 6 / 2605.03288

Neural Control: Adjoint Learning Through Equilibrium Constraints

神经控制:通过平衡约束进行伴随学习
Tong, Dezhong, Wang, Jiawen, Zhou, Hengyi, Shen, Yinglong, Huang, Xiaonan, Jawed, M. Khalid
Abstract
Many physical AI tasks are governed by implicit equilibrium: an agent actuates a subset of degrees of freedom (boundary DoFs), while the remaining free DoFs settle by minimizing a total potential energy. Even seemingly basic tasks such as bending a deformable linear object (DLO) to a target shape can exhibit strongly nonlinear behavior due to multi-stability: the same boundary conditions may yield multiple equilibrium shapes depending on the actuation trajectory. However, learning and control in such systems is brittle because the actuation-to-configuration map is defined only implicitly, and naive backpropagation through iterative equilibrium solvers is memory- and compute-intensive. We propose Neural Control, a boundary-control framework that computes trajectory-dependent, memory-efficient proxy gradients by differentiating equilibrium conditions via an adjoint formulation, avoiding unrolling of solver iterations. To improve robustness over long horizons, we integrate these sensitivities into a receding-horizon MPC scheme that repeatedly re-anchors optimization to realized equilibria and mitigates basin-switching in multi-stable regimes. We evaluate Neural Control in simulation and on physical robots manipulating DLOs, and show improved performance over gradient-free baselines such as SPSA and CEM.
Chinese Translation
许多物理人工智能任务由隐含平衡支配:一个智能体操控一部分自由度(边界自由度),而其余的自由度通过最小化总潜在能量来达到稳态。即使是弯曲可变形线性物体(DLO)到目标形状这样看似简单的任务,也可能由于多稳态性表现出强非线性行为:相同的边界条件可能会根据驱动轨迹产生多种平衡形状。然而,在这样的系统中进行学习和控制是脆弱的,因为驱动到配置的映射仅是隐式定义的,而通过迭代平衡求解器进行简单的反向传播则会消耗大量内存和计算资源。我们提出了一种神经控制(Neural Control)的方法,这是一个边界控制框架,通过伴随形式对平衡条件进行求导,从而计算依赖轨迹的、内存高效的代理梯度,避免了解决器迭代的展开。为了提高在长时间范围内的鲁棒性,我们将这些灵敏度集成到一个递归水平的模型预测控制(MPC)方案中,该方案反复将优化重新锚定到已实现的平衡,并减轻多稳态范围中的盆地切换。我们在模拟和物理机器人操控可变形线性对象的实验中评估了神经控制,并显示出相较于无梯度基准(如SPSA和CEM)的性能提升。
cs.RO / 7 / 2605.03290

On Surprising Effects of Risk-Aware Domain Randomization for Contact-Rich Sampling-based Predictive Control

关于风险意识领域随机化在接触丰富的基于采样的预测控制中的惊人效果
Esteban, Sergio A., Li, Junheng, Kurtz, Vince, Ames, Aaron D.
Abstract
Domain randomization (DR) is widely used in policy learning to improve robustness to modeling error, but remains underexplored in contact-rich sampling-based predictive control (SPC), where rollout quality is highly sensitive to uncertainty. In this work, we take the first step by studying risk-aware DR in predictive sampling on a simple yet representative Push-T task, comparing average, optimistic, and pessimistic rollout aggregations under randomized model instances. Our initial results suggest that DR affects not only robustness to model error, but also the effective cost landscape seen by the sampling-based optimizer, by reshaping the basin of attraction around contact-producing actions. This opens up potential for exploring better grounded risk-aware contact-rich SPC under model uncertainty. Video: https://youtu.be/f1F0ALXxhSM
Chinese Translation
领域随机化(Domain Randomization, DR)被广泛应用于策略学习,以提升对建模误差的鲁棒性,但在接触丰富的基于采样的预测控制(Sampling-based Predictive Control, SPC)中仍然未得到充分探讨,而该领域中,展开质量对不确定性高度敏感。在本研究中,我们迈出了第一步,通过在一个简单而具有代表性的推拉任务(Push-T)上研究风险意识的DR,比较在随机模型实例下的平均、乐观和悲观展开聚合。我们的初步结果表明,DR不仅影响对模型误差的鲁棒性,还影响基于采样的优化器所见的有效成本景观,重新塑造接触产生动作周围的吸引盆地。这为在模型不确定性下探索更为扎实的风险意识接触丰富SPC开辟了潜在的可能性。视频:https://youtu.be/f1F0ALXxhSM
cs.RO / 8 / 2605.03302

Height Control and Optimal Torque Planning for Jumping With Wheeled-Bipedal Robots

轮式双足机器人跳跃的高度控制与最优扭矩规划
Zhuang, Yulun, Xu, Yuan, Huang, Binxin, Chao, Mandan, Shi, Guowei, Yang, Xin, Zhang, Kuangen, Fu, Chenglong
Abstract
This paper mainly studies the accurate height jumping control of wheeled-bipedal robots based on torque planning and energy consumption optimization. Due to the characteristics of underactuated, nonlinear estimation, and instantaneous impact in the jumping process, accurate control of the wheeled-bipedal robot's jumping height is complicated. In reality, robots often jump at excessive height to ensure safety, causing additional motor loss, greater ground reaction force and more energy consumption. To solve this problem, a novel wheeled-bipedal jumping dynamical model(W-JBD) is proposed to achieve accurate height control. It performs well but not suitable for the real robot because the torque has a striking step. Therefore, the Bayesian optimization for torque planning method(BOTP) is proposed, which can obtain the optimal torque planning without accurate dynamic model and within few iterations. BOTP method can reduce 82.3% height error, 26.9% energy cost with continuous torque curve. This result is validated in the Webots simulation platform. Based on the torque curve obtained in the W-JBD model to narrow the searching space, BOTP can quickly converge (40 times on average). Cooperating W-JBD model and BOTP method, it is possible to achieve the height control of real robots with reasonable times of experiments.
Chinese Translation
本文主要研究基于扭矩规划和能量消耗优化的轮式双足机器人精确高度跳跃控制。由于跳跃过程中存在欠驱动、非线性估计和瞬时冲击等特性,轮式双足机器人跳跃高度的准确控制变得复杂。实际上,机器人往往为了确保安全而跳得过高,这导致额外的电机损耗、更大的地面反作用力和更多的能量消耗。为了解决这个问题,提出了一种新颖的轮式双足跳跃动态模型(W-JBD),旨在实现精确的高度控制。该模型表现良好,但由于扭矩变化幅度剧烈,不适合实际机器人。因此,提出了基于贝叶斯优化的扭矩规划方法(BOTP),该方法可以在没有精确动态模型的情况下,通过少量迭代获得最优的扭矩规划。BOTP 方法可以将高度误差降低82.3%,能量成本降低26.9%,且形成连续的扭矩曲线。此结果在Webots仿真平台上得到验证。在W-JBD模型中获得的扭矩曲线可以缩小搜索空间,从而使BOTP能够快速收敛(平均40次)。通过结合W-JBD模型和BOTP方法,能够以合理的实验次数实现实际机器人的高度控制。
cs.RO / 9 / 2605.03363

Learning Reactive Dexterous Grasping via Hierarchical Task-Space RL Planning and Joint-Space QP Control

通过层次化任务空间强化学习规划和关节空间二次规划控制实现反应性灵巧抓取
Lee, Ho Jae, Lee, Yonghyeon, Alexiev, Alexander, Lin, Tzu-Yuan, Jeon, Se Hwan, Kim, Sangbae
Abstract
In this work, we propose a hybrid hierarchical control framework for reactive dexterous grasping that explicitly decouples high-level spatial intent from low-level joint execution. We introduce a multi-agent reinforcement learning architecture, specialized into distinct arm and hand agents, that acts as a high-level planner by generating desired task-space velocity commands. These commands are then processed by a GPU-parallelized quadratic programming controller, which translates them into feasible joint velocities while strictly enforcing kinematic limits and collision avoidance. This structural isolation not only accelerates training convergence but also strictly enforces hardware safety. Furthermore, the architecture unlocks zero-shot steerability, allowing system operators to dynamically adjust safety margins and avoid dynamic obstacles without retraining the policy. We extensively validate the proposed framework through a rigorous simulation-to-reality pipeline. Real-world hardware experiments on a 7-DoF arm equipped with a 20-DoF anthropomorphic hand demonstrate highly robust zero-shot transferability for dexterous grasping to a diverse set of unseen objects, highlighting the system's ability to reactively recover from unexpected physical disturbances in unstructured environments.
Chinese Translation
在本研究中,我们提出了一种混合层次控制框架,用于反应性灵巧抓取,该框架明确地将高层空间意图与低层关节执行解耦。我们介绍了一种多智能体强化学习架构,专门化为独立的臂部和手部智能体,作为高层规划器,通过生成期望的任务空间速度指令来运作。这些指令随后由一个基于GPU并行化的二次规划控制器处理,该控制器将其转换为可行的关节速度,同时严格执行运动学限制和防止碰撞。这种结构隔离不仅加速了训练的收敛,还严格维护了硬件安全。此外,该架构实现了零样本可控性,使系统操作员能够动态调整安全边际并在不重新训练策略的情况下避开动态障碍物。我们通过严格的仿真到现实的流程对提出的框架进行了广泛验证。在装备有20自由度类人手的7自由度机械臂上的真实硬件实验显示了对于不同未见物体的灵巧抓取的高度鲁棒性零样本迁移能力,突显了该系统在无结构环境中对意外物理干扰的反应恢复能力。
cs.RO / 10 / 2605.03452

BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation

BifrostUMI:连接无机器人演示与类人全身操控
Yu, Chenhao, Wang, Hongwu, Hu, Youhao, Zhang, Jiachen, Li, Yuanyuan, Luo, Shaqi
Abstract
High-quality data collection is a fundamental cornerstone for training humanoid whole-body visuomotor policies. Current data acquisition paradigms predominantly rely on robot teleoperation, which is often hindered by limited hardware accessibility and low operational efficiency. Inspired by the Universal Manipulation Interface (UMI), we propose BifrostUMI, a portable, efficient, and robot-free data collection framework tailored for humanoid robots. BifrostUMI leverages lightweight VR devices to capture human demonstrations as sparse keypoint trajectories while simultaneously recording wrist-mounted visual data. These multimodal data are subsequently utilized to train a high-level policy network that predicts future keypoint trajectories conditioned on the captured visual features. Through a robust keypoint retargeting pipeline, keypoint trajectories are precisely mapped onto the robot's morphology and executed via a whole-body controller. This approach enables the seamless transfer of diverse and agile behaviors from natural human demonstrations to humanoid embodiments. We demonstrate the efficacy and versatility of the proposed framework across two distinct experimental scenarios.
Chinese Translation
高质量数据收集是训练类人全身视觉运动策略的基础核心。目前的数据采集范式主要依赖于机器人遥控,这常常受到硬件可达性有限和操作效率低下的困扰。受到通用操控接口(Universal Manipulation Interface, UMI)的启发,我们提出了BifrostUMI,这是一个便携、高效且无机器人数据收集框架,专门为类人机器人量身定制。BifrostUMI利用轻量级虚拟现实(VR)设备以稀疏关键点轨迹的形式捕捉人类演示,同时记录手腕部位的视觉数据。这些多模态数据随后被用于训练一个高级策略网络,该网络根据捕获的视觉特征预测未来的关键点轨迹。通过一个稳健的关键点重新定向管道,关键点轨迹被精确地映射到机器人的形态上,并通过全身控制器执行。该方法能够实现从自然人类演示到类人表现的多样化和灵活行为的无缝转移。我们展示了所提框架在两个不同实验场景中的有效性和多功能性。
cs.RO / 11 / 2605.03637

Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing

弥合具身性差距:解耦式跨具身视频编辑
Li, Zhiyuan, Yang, Wenyan, Zhao, Wenshuai, Ma, Yue, Tu, Yuanpeng, Marttinen, Pekka, Pajarinen, Joni
Abstract
Learning robotic manipulation from human videos is a promising solution to the data bottleneck in robotics, but the distribution shift between humans and robots remains a critical challenge. Existing approaches often produce entangled representations, where task-relevant information is coupled with human-specific kinematics, limiting their adaptability. We propose a generative framework for cross-embodiment video editing that directly addresses this by learning explicitly disentangled task and embodiment representations. Our method factorizes a demonstration video into two orthogonal latent spaces by enforcing a dual contrastive objective: it minimizes mutual information between the spaces to ensure independence while maximizing intra-space consistency to create stable representations. A parameter-efficient adapter injects these latent codes into a frozen video diffusion model, enabling the synthesis of a coherent robot execution video from a single human demonstration, without requiring paired cross-embodiment data. Experiments show our approach generates temporally consistent and morphologically accurate robot demonstrations, offering a scalable solution to leverage internet-scale human video for robot learning.
Chinese Translation
从人类视频中学习机器人操作是解决机器人领域数据瓶颈的一个有前景的方案,但人类与机器人之间的分布差异仍然是一项重要挑战。现有的方法通常生成纠缠的表示,其任务相关信息与人类特有的运动学相耦合,限制了适应性。我们提出了一种跨具身视频编辑的生成框架,直接通过显式学习解耦的任务和具身表示来解决这一问题。我们的方法通过强制执行双重对比目标,将演示视频分解为两个正交的潜在空间:它最小化两个空间之间的互信息以确保独立性,同时最大化空间内的一致性以创建稳定的表示。一个参数高效的适配器将这些潜在编码注入到冻结的视频扩散模型中,从而能够从单个的人类演示合成出连贯的机器人执行视频,而无需配对的跨具身数据。实验表明,我们的方法生成具有时间一致性和形态准确性的机器人演示,为利用互联网规模的人类视频进行机器人学习提供了可扩展的解决方案。
cs.RO / 12 / 2605.03641

Jiao: Bridging Isolation and Customization in Mixed Criticality Robotics

Jiao:在混合关键性机器人技术中架起隔离与定制的桥梁
Yen, James, Huang, Zhibai, Wei, Zhixiang, Yi, Tinghao, Zeng, Shupeng, Pang, Liang, Xue, Songtao, Qi, Zhengwei
Abstract
Consumer robotics demands consolidation of safety-critical control, perception pipelines, and user applications on shared multicore platforms. While static partitioning hypervisors provide hardware-enforced isolation, directly transplanting automotive architectures encounters an expertise asymmetry problem in which end-users modifying robot behavior lack the systems knowledge that platform developers possess. We present an architecture addressing this challenge through three integrated components. A Safe IO Cell provides hardware-level override capability. A Parameter Synchronization Service encapsulates cross-domain complexity. A Safety Communication Layer implements IEC~61508-aligned verification. Our empirical evaluation on an ARM Cortex-A55 platform demonstrates that partition isolation reduces cycle-period jitter by 84.5\% and cuts tail timing error by nearly an order of magnitude (p99 $|$jitter$|$ from 69.0\,$\mu$s to 7.8\,$\mu$s), eliminating all $>$50\,$\mu$s~excursions.
Chinese Translation
消费类机器人需要在共享的多核平台上整合安全关键控制、感知管道和用户应用。虽然静态分区的虚拟机监控程序提供了硬件强制隔离,但直接移植汽车架构会遇到专业知识不对称的问题,即修改机器人行为的最终用户缺乏平台开发者所拥有的系统知识。我们提出了一种通过三个集成组件解决这一挑战的架构。一个安全输入输出单元(Safe IO Cell)提供了硬件级覆盖能力;一个参数同步服务(Parameter Synchronization Service)封装了跨域复杂性;一个安全通信层(Safety Communication Layer)实施了IEC~61508标准的验证。我们在ARM Cortex-A55平台上的实证评估表明,分区隔离将周期抖动减少了84.5\%,并将尾部时序误差降低了近一个数量级(p99 $|$jitter$|$从69.0 extmu s降至7.8 extmu s),消除了所有大于50 extmu s的波动。
cs.RO / 13 / 2605.03662

Feasibility-aware Hybrid Control for Motion Planning under Signal Temporal Logics

基于信号时序逻辑的可行性感知混合控制运动规划
Rousseas, Panagiotis, Dimarogonas, Dimos V.
Abstract
In this work, a novel method for planar task and motion planning based on hybrid modeling is proposed. By virtue of a discrete variable which models local constraint satisfaction and enables local feasibility analysis, the proposed control architecture unifies planning with control design. Concurrently, control barrier functions are designed on a transformed disk version of the original nonconvex and geometrically complex robotic workspace, thus amending the issue of deadlocks. Simulations of the proposed method indicate effective handling of multiple overlapping spatio-temporal tasks even in the face of input saturation.
Chinese Translation
本文提出了一种基于混合建模的平面任务与运动规划的新方法。通过引入一个离散变量来建模局部约束满足并实现局部可行性分析,所提出的控制架构将规划与控制设计统一起来。同时,在原始非凸且几何复杂的机器人工作空间的变换圆盘版本上设计了控制障碍函数,从而解决了死锁问题。对所提出方法的仿真表明,尽管存在输入饱和的情况,该方法能够有效处理多个重叠的时空任务。
cs.RO / 14 / 2605.03666

Sensorless State Estimation and Control for Agile Cable-Suspended Payload Transport by Quadrotors

四旋翼无人机灵活的吊索载荷运输无传感器状态估计与控制
Nascimento, Ana Maria, Sales, Augusto, Lima, Antonio Marcus, Nascimento, Tiago
Abstract
This work proposes a novel control and estimation approach for aerial manipulation of a cable-suspended load using Unmanned Aerial Vehicles (UAVs). Common approaches in the state of the art have practical limitations, relying on direct load measurements and Lagrangian methods for dynamic modeling. The lack of a straightforward dynamic model of the system led us to propose adopting the Udwadia-Kalaba method to explicitly incorporate the cable's geometric constraints. This formulation allowed for the consistent derivation of the tension force and its direct integration into the NMPC prediction model. Additionally, we propose a sensorless load state estimation based on the same geometric constraints. Results from real-robot experiments demonstrated that the explicit inclusion of load dynamics in the optimization problem significantly reduces trajectory-tracking errors and yields better overall performance compared to strategies based on incomplete models.
Chinese Translation
本研究提出了一种新颖的控制与估计方法,用于通过无人机(UAVs)进行吊索载荷的空中操作。当前最先进的方法存在实际限制,依赖于直接的载荷测量和拉格朗日方法进行动态建模。由于缺乏系统的直接动态模型,我们提出采用Udwadia-Kalaba方法,明确纳入吊索的几何约束。这种表述使得张力的推导一致,并可以将其直接整合到非线性模型预测控制(NMPC)预测模型中。此外,我们还基于相同的几何约束提出了一种无传感器的载荷状态估计。来自真实机器人实验的结果表明,在优化问题中明确包含载荷动态显著减少了轨迹跟踪误差,并与基于不完整模型的策略相比,表现出更好的整体性能。
cs.RO / 15 / 2605.03669

FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers

FUS3DMaps:通过体素级和实例级层的3D融合实现可扩展和准确的开放词汇语义映射
Homberger, Timon, Busch, Finn Lukas, Peimbert, Jesús Gerardo Ortega, Yang, Quantao, Andersson, Olov
Abstract
Open-vocabulary semantic mapping enables robots to spatially ground previously unseen concepts without requiring predefined class sets. Current training-free methods commonly rely on multi-view fusion of semantic embeddings into a 3D map, either at the instance-level via segmenting views and encoding image crops of segments, or by projecting image patch embeddings directly into a dense semantic map. The latter approach sidesteps segmentation and 2D-to-3D instance association by operating on full uncropped image frames, but existing methods remain limited in scalability. We present FUS3DMaps, an online dual-layer semantic mapping method that jointly maintains both dense and instance-level open-vocabulary layers within a shared voxel map. This design enables further voxel-level semantic fusion of the layer embeddings, combining the complementary strengths of both semantic mapping approaches. We find that our proposed semantic cross-layer fusion approach improves the quality of both the instance-level and dense layers, while also enabling a scalable and highly accurate instance-level map where the dense layer and cross-layer fusion are restricted to a spatial sliding window. Experiments on established 3D semantic segmentation benchmarks as well as a selection of large-scale scenes show that FUS3DMaps achieves accurate open-vocabulary semantic mapping at multi-story building scales. Additional material and code will be made available: https://githanonymous.github.io/FUS3DMaps/.
Chinese Translation
开放词汇语义映射使机器人能够在没有预定义类别集合的情况下,将以前未见的概念进行空间定位。目前的无训练方法通常依赖于将语义嵌入进行多视角融合,从而构建3D地图,这些方法要么通过对视图进行分割并编码图像切片以在实例级进行操作,要么通过将图像块嵌入直接投影到稠密语义地图来进行操作。后者方法通过在完整未裁剪的图像帧上进行操作,避免了分割和2D到3D实例关联的步骤,但现有方法在可扩展性上仍然受到限制。我们提出了FUS3DMaps,这是一种在线双层语义映射方法,能够在共享的体素地图中同时维护稠密层和实例级的开放词汇层。这种设计允许对层嵌入进行进一步的体素级语义融合,结合了两种语义映射方法的互补优势。我们的研究发现,所提出的语义跨层融合方法提高了实例级和稠密层的质量,同时也使得在空间滑动窗口内限制稠密层和跨层融合的情况下,实现可扩展且高度准确的实例级地图成为可能。在已建立的3D语义分割基准以及一些大规模场景的实验表明,FUS3DMaps在多层建筑规模下实现了准确的开放词汇语义映射。附加材料和代码将会公开: https://githanonymous.github.io/FUS3DMaps/.
cs.RO / 16 / 2605.03678

Robust Visual SLAM for UAV Navigation in GPS-Denied and Degraded Environments: A Multi-Paradigm Evaluation and Deployment Study

在GPS受限和退化环境中针对无人机导航的鲁棒视觉SLAM:多范式评估与部署研究
Kumar, Prasoon, Deepak, Akshay, Kumar, Sandeep
Abstract
Reliable localization in GPS-denied, visually degraded environments is critical for autonomous UAV opera- tions. This paper presents a systematic comparative evaluation of five V-SLAM systems ORB-SLAM3, DPVO, DROID-SLAM, DUSt3R, and MASt3R spanning classical, deep learning, recurrent, and Vision Transformer (ViT) paradigms. Experiments are conducted on curated sequences from four public benchmarks (TUM RGB-D, EuRoC MAV, UMA-VI, SubT-MRS) and a custom monocular indoor dataset under five controlled degradation conditions (normal, low light, dust haze, motion blur, and combined), with sub-millimeter Vicon ground truth. Results show that ORB-SLAM3 fails critically under severe degradation (62.4% overall TSR; 0% under dense haze), while learning-based methods remain robust: MASt3R achieves the lowest degraded ATE (0.027 m) and DUSt3R the highest tracking success (96.5%). DPVO offers the best efficiency robustness trade-off (18.6 FPS, 3.1 GB GPU memory, 86.1% TSR), making it the preferred choice for memory-constrained embedded platforms. Embedded deployment analysis across NVIDIA Jetson platforms provides actionable guidelines for SLAM selection under SWaP-constrained UAV scenarios.
Chinese Translation
在GPS受限且视觉退化的环境中,可靠的定位对于自主无人机操作至关重要。本文对五种视觉SLAM系统(ORB-SLAM3、DPVO、DROID-SLAM、DUSt3R和MASt3R)进行了系统的比较评估,涵盖了经典、深度学习、递归和视觉变换器(ViT)范式。实验在来自四个公共基准(TUM RGB-D、EuRoC MAV、UMA-VI、SubT-MRS)的精心策划序列以及一个定制的单目室内数据集上进行,考虑了五种受控退化条件(正常、低光、灰尘雾、运动模糊和组合),并利用亚毫米级的Vicon真实值进行验证。结果表明,ORB-SLAM3在严重退化情况下严重失效(整体跟踪成功率62.4%;在浓雾下为0%),而基于学习的方法仍然保持鲁棒性:MASt3R获得了最低的退化绝对轨迹误差(0.027米),DUSt3R则获得了最高的跟踪成功率(96.5%)。DPVO在效率和鲁棒性之间提供了最佳的权衡(18.6 FPS,3.1 GB GPU内存,86.1%跟踪成功率),使其成为内存受限嵌入式平台的首选。在NVIDIA Jetson平台上的嵌入式部署分析为SWaP受限的无人机场景下的SLAM选择提供了可操作的指导方针。
cs.RO / 17 / 2605.03821

RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

RoboAlign-R1:用于机器人视频世界模型的提取多模态奖励对齐
Wu, Hao, Li, Yuqi, Gao, Yuan, Xu, Fan, Zhang, Fan, Wang, Kun, Zhao, Penghao, Wang, Qiufeng, Zhao, Yizhou, Wang, Weiyan, Tian, Yingli, Wu, Xian, Huang, Xiaomeng
Abstract
Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We present RoboAlign-R1, a framework that combines reward-aligned post-training with stabilized long-horizon inference for robot video world models. We construct RobotWorldBench, a benchmark of 10,000 annotated video-instruction pairs collected from four robot data sources, and train a multimodal teacher judge, RoboAlign-Judge, to provide fine-grained six-dimensional evaluation of generated videos. We then distill the teacher into a lightweight student reward model for efficient reinforcement-learning-based post-training. To reduce long-horizon rollout drift, we further introduce Sliding Window Re-encoding (SWR), a training-free inference strategy that periodically refreshes the generation context. Under our in-domain evaluation protocol, RoboAlign-R1 improves the aggregate six-dimension score by 10.1% over the strongest baseline, including gains of 7.5% on Manipulation Accuracy and 4.6% on Instruction Following; these ranking improvements are further supported by an external VLM-based cross-check and a blinded human study. Meanwhile, SWR improves long-horizon prediction quality with only about 1% additional latency, yielding a 2.8% gain in SSIM and a 9.8% reduction in LPIPS. Together, these results show that reward-aligned post-training and stabilized long-horizon decoding improve task consistency, physical realism, and long-horizon prediction quality in robot video world models.
Chinese Translation
现有的机器人视频世界模型通常使用重建和感知相似性等低层次目标进行训练,这些目标与机器人决策中最重要的能力(如遵循指令、操作成功和物理合理性)之间的对齐程度较差。此外,它们在长期自回归预测中也容易出现误差累积。我们提出了RoboAlign-R1,这是一种将奖励对齐后训练与稳定的长期推理相结合的框架,专为机器人视频世界模型设计。我们构建了RobotWorldBench,这是一个由四个机器人数据源收集的10,000对带注释的视频-指令对的基准,并训练了多模态教师评估模型RoboAlign-Judge,以提供对生成视频的细粒度六维评估。然后,我们将教师模型提炼为一个轻量级的学生奖励模型,以进行高效的基于强化学习的后训练。为减少长期展开漂移,我们进一步引入滑动窗口重新编码(Sliding Window Re-encoding, SWR),这是一种无训练推理策略,定期刷新生成上下文。在我们的领域内评估协议下,RoboAlign-R1在最强基线基础上提高了10.1%的综合六维分数,包括在操作准确性上提升了7.5%和在遵循指令上提升了4.6%;这些排名改进还得到了基于外部视觉语言模型(VLM)的交叉检查和盲人研究的进一步支持。同时,SWR在仅增加约1%的延迟的情况下提高了长期预测质量,SSIM值提高了2.8%,LPIPS减少了9.8%。综合来看,这些结果表明,奖励对齐后训练和稳定的长期解码显著提高了机器人视频世界模型中的任务一致性、物理真实性和长期预测质量。
cs.RO / 18 / 2605.03846

SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision

SigLoMa:基于自我中心视觉的开放世界四足运动操控学习
Chen, Shiyi, Liu, Haiyi, Yang, Mingye, Zhang, Jiaqi, Zhang, Debing
Abstract
Designing an open-world quadrupedal loco-manipulation system is highly challenging. Traditional reinforcement learning frameworks utilizing exteroception often suffer from extreme sample inefficiency and massive sim-to-real gaps. Furthermore, the inherent latency of visual tracking fundamentally conflicts with the high-frequency demands of precise floating-base control. Consequently, existing systems lean heavily on expensive external motion capture and off-board computation. To eliminate these dependencies, we present SigLoMa, a fully onboard, ego-centric vision-based pick-and-place framework. At the core of SigLoMa is the introduction of Sigma Points, a lightweight geometric representation for exteroception that guarantees high scalability and native sim-to-real alignment. To bridge the frequency divide between slow perception and fast control, we design an ego-centric Kalman Filter to provide robust, high-rate state estimation. On the learning front, we alleviate sample inefficiency via an Active Sampling Curriculum guided by Hint Poses, and tackle the robot's structural visual blind spots using temporal encoding coupled with simulated random-walk drift. Real-world experiments validate that, relying solely on a 5Hz (200 ms latency) open-vocabulary detector, SigLoMa successfully executes dynamic loco-manipulation across multiple tasks, achieving performance comparable to expert human teleoperation.
Chinese Translation
设计一个开放世界的四足运动操控系统具有很高的挑战性。传统的强化学习框架利用外部感知,往往面临极端的样本效率低下和巨大的模拟到现实的差距。此外,视觉追踪的固有延迟与精确浮动基底控制的高频率需求根本相冲突。因此,现有系统在很大程度上依赖于昂贵的外部动作捕捉和板外计算。为了消除这些依赖,我们提出了SigLoMa,一个完全车载的、基于自我中心视觉的抓取与放置框架。在SigLoMa的核心是引入了Sigma Points,一种轻量几何表示,用于外部感知,确保高可扩展性和原生的模拟到现实对齐。为了弥合缓慢感知与快速控制之间的频率差距,我们设计了一种自我中心的卡尔曼滤波器,以提供稳健的高频状态估计。在学习方面,我们通过以提示姿态为指导的主动采样课程缓解样本低效,并采用时间编码与模拟随机游走漂移相结合的方式来解决机器人结构上的视觉盲点。实验证明,依赖于仅5Hz(200毫秒延迟)的开放词汇检测器,SigLoMa成功地在多个任务中执行动态运动操控,达到了与专家人类遥操作相当的性能。
cs.RO / 19 / 2605.03855

Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior

评估生成模型作为人类类似协作行为的互动涌现表示
Shaji, Shinas, Hassan, Teena Chakkalayil, Houben, Sebastian, Mitrevski, Alex
Abstract
Human-AI collaboration requires AI agents to understand human behavior for effective coordination. While advances in foundation models show promising capabilities in understanding and showing human-like behavior, their application in embodied collaborative settings needs further investigation. This work examines whether embodied foundation model agents exhibit emergent collaborative behaviors indicating underlying mental models of their collaborators, which is an important aspect of effective coordination. This paper develops a 2D collaborative game environment where large language model agents and humans complete color-matching tasks requiring coordination. We define five collaborative behaviors as indicators of emergent mental model representation: perspective-taking, collaborator-aware planning, introspection, theory of mind, and clarification. An automated behavior detection system using LLM-based judges identifies these behaviors, achieving fair to substantial agreement with human annotations. Results from the automated behavior detection system show that foundation models consistently exhibit emergent collaborative behaviors without being explicitly trained to do so. These behaviors occur at varying frequencies during collaboration stages, with distinct patterns across different LLMs. A user study was also conducted to evaluate human satisfaction and perceived collaboration effectiveness, with the results indicating positive collaboration experiences. Participants appreciated the agents' task focus, plan verbalization, and initiative, while suggesting improvements in response times and human-like interactions. This work provides an experimental framework for human-AI collaboration, empirical evidence of collaborative behaviors in embodied LLM agents, a validated behavioral analysis methodology, and an assessment of collaboration effectiveness.
Chinese Translation
人机协作需要人工智能代理理解人类行为以实现有效协调。尽管基础模型的发展在理解和展现人类类似行为方面表现出良好的能力,但其在具身协作环境中的应用仍需进一步研究。本研究考察具身基础模型代理是否表现出涌现的协作行为,以指示其合作伙伴的潜在心理模型,这是有效协调的重要方面。本文开发了一个二维协作游戏环境,在该环境中,大型语言模型代理与人类共同完成需要协调的配色任务。我们定义了五种协作行为作为涌现心理模型表示的指标:换位思考、合作伙伴意识规划、自省、心智理论和澄清。基于大型语言模型的自动行为检测系统识别这些行为,并与人工标注达成了公平到显著的一致性。自动行为检测系统的结果表明,基础模型在没有明确训练的情况下,始终表现出涌现的协作行为。这些行为在不同的协作阶段以不同频率出现,并且在不同的语言模型之间存在明显的模式。此外,还进行了用户研究,以评估人类的满意度和感知的协作有效性,结果表明协作体验积极。参与者对代理的任务专注、计划表述和主动性表示赞赏,同时建议改进响应时间和人类般的互动。该工作为人机协作提供了实验框架,实证了具身大型语言模型代理中的协作行为,验证了行为分析方法,并评估了协作的有效性。
cs.RO / 20 / 2605.03909

Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional Computing

基于视觉语言嵌入和超维计算的机器人检测任务感知扫描参数配置
Chen, Zhiling, Gorsich, David, Castanier, Matthew P., Zhang, Yang, Tang, Jiong, Imani, Farhad
Abstract
Robotic laser profiling is widely used for dimensional verification and surface inspection, yet measurement fidelity is often dominated by sensor configuration rather than robot motion. Industrial profilers expose multiple coupled parameters, including sampling frequency, measurement range, exposure time, receiver dynamic range, and illumination, that are still tuned by trial-and-error; mismatches can cause saturation, clipping, or missing returns that cannot be recovered downstream. We formulate instruction-conditioned sensing parameter recommendation; given a pre-scan RGB observation and a natural-language inspection instruction, infer a discrete configuration over key parameters of a robot-mounted profiler. To benchmark this problem, we develop Instruct-Obs2Param, a real-world multimodal dataset linking inspection intents and multi-view pose and illumination variation across 16 objects to canonical parameter regimes. We then propose ScanHD, a hyperdimensional computing framework that binds instruction and observation into a task-aware code and performs parameter-wise associative reasoning with compact memories, matching discrete scanner regimes while yielding stable, interpretable, low-latency decisions. On Instruct-Obs2Param, ScanHD achieves 92.7% average exact accuracy and 98.1% average Win@1 accuracy across the five parameters, with strong cross-split generalization and low-latency inference suitable for deployment, outperforming rule-based heuristics, conventional multimodal models, and multimodal large language models. This work enables autonomous, instruction-conditioned sensing configuration from task intent and scene context, eliminating manual tuning and elevating sensor configuration from a static setting to an adaptive decision variable.
Chinese Translation
机器人激光轮廓测量广泛应用于尺寸验证和表面检测,但测量精度往往受到传感器配置而非机器人运动的主导。工业轮廓仪暴露出多个耦合参数,包括采样频率、测量范围、曝光时间、接收器动态范围和照明,这些参数仍然通过试错方法进行调节;不匹配可能导致饱和、剪切或丢失回波,无法在后续处理中恢复。我们提出了基于指令的传感器参数推荐;给定一个预扫描的RGB观察图像和自然语言的检测指令,推断出机器人安装的轮廓仪的关键参数的离散配置。为了基准测试这个问题,我们开发了Instruct-Obs2Param,这是一个现实世界的多模态数据集,将检测意图与16个对象上多视角的姿态和照明变化联系到规范参数区域。然后,我们提出了ScanHD,一个超维计算框架,它将指令和观察绑定成一个任务感知代码,并通过紧凑的记忆执行按参数的关联推理,匹配离散扫描仪区域,同时产生稳定、可解释、低延迟的决策。在Instruct-Obs2Param上,ScanHD在五个参数上实现了92.7%的平均准确率和98.1%的平均首位准确率,具有强大的跨分割泛化能力以及适合部署的低延迟推断,优于基于规则的启发式方法、传统的多模态模型和多模态大语言模型。这项工作使得从任务意图和场景上下文中自主、基于指令的传感器配置成为可能,消除了手动调节,并将传感器配置从静态设置提升为自适应决策变量。
计算机视觉 (Computer Vision)
76
cs.CV / 1 / 2605.02908

Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

稳定扩散中的记忆化意外地受到CLIP嵌入的驱动
Kim, Bumjun, No, Albert
Abstract
Understanding how textual embeddings contribute to memorization in text-to-image diffusion models is crucial for both interpretability and safety. This paper investigates an unexpected behavior of CLIP embeddings in Stable Diffusion, revealing that the model disproportionately relies on specific embeddings. We categorize input tokens as , , and with corresponding embeddings $\mathbf{v}^{\mathbf{sot}}, \mathbf{v}^{\mathbf{pr}}, \mathbf{v}^{\mathbf{eot}}, \mathbf{v}^{\mathbf{pad}}$. We discover that $\mathbf{v}^{\mathbf{pr}}$ contribute minimally to generation in memorized cases. In contrast, $\mathbf{v}^{\mathbf{pad}}$ strongly affect memorization due to their structural duplication of $\mathbf{v}^{\mathbf{eot}}$, the only embedding explicitly optimized during CLIP training. This duplication unintentionally amplifies the influence of $\mathbf{v}^{\mathbf{eot}}$, causing the model to over-rely on it, thereby driving memorization. Based on these observations, we propose two simple yet effective inference-time mitigation strategies: (1) Replacing the tokenizer's default from to the ! token before embedding, and masking the $\mathbf{v}^{\mathbf{eot}}$; (2) Partial masking of $\mathbf{v}^{\mathbf{pad}}$. Both suppress memorization without degrading quality, and are readily deployable without prior detection.
Chinese Translation
理解文本嵌入如何影响文本到图像扩散模型中的记忆化,对模型的可解释性和安全性至关重要。本文研究了CLIP嵌入在稳定扩散中的意外行为,揭示模型不成比例地依赖于特定的嵌入。我们将输入标记分类为,对应的嵌入分别为$ extbf{v}^{ extbf{sot}}$、$ extbf{v}^{ extbf{pr}}$、$ extbf{v}^{ extbf{eot}}$和$ extbf{v}^{ extbf{pad}}$。我们发现,在记忆化情况下,$ extbf{v}^{ extbf{pr}}$对生成的贡献最小。相反,$ extbf{v}^{ extbf{pad}}$由于其与$ extbf{v}^{ extbf{eot}}$的结构重复,强烈影响记忆化,而$ extbf{v}^{ extbf{eot}}$是CLIP训练期间唯一被明确优化的嵌入。这种重复意外地放大了$ extbf{v}^{ extbf{eot}}$的影响,使得模型对其过于依赖,从而驱动记忆化。基于这些观察结果,我们提出了两种简单但有效的推理时缓解策略:(1) 在嵌入之前将分词器的默认替换为!符号,并屏蔽$ extbf{v}^{ extbf{eot}}$;(2) 部分屏蔽$ extbf{v}^{ extbf{pad}}$。这两种方法均在不降低质量的前提下抑制了记忆化,并且无须先前检测即可轻松部署。
cs.CV / 2 / 2605.02912

Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models

推理引导的基础:通过多模态大型语言模型提升视频异常检测
Agarwal, Sakshi, Konwer, Aishik, Shah, Ankit Parag
Abstract
Video Anomaly Detection (VAD) has traditionally been framed as binary classification or outlier detection, providing neither interpretable reasoning nor precise spatial localization of anomalous events. While Vision-Language Models (VLMs) offer rich scene understanding, they struggle with reliable spatial grounding - often producing hallucinated or geometrically invalid bounding boxes when asked to localize objects. We propose VANGUARD (Video Anomaly Understanding through Reasoning and Grounding), a framework that unifies anomaly classification, spatial grounding, and chain-of-thought reasoning within a single VLM. VANGUARD introduces a three-stage curriculum that progressively layers training objectives: (1) classifier warmup on frozen backbone features, (2) LoRA-adapted spatial grounding, and (3) chain-of-thought generation. To overcome the sparse annotation typical of VAD benchmarks, we employ a teacher-student annotation pipeline in which a VLM (Qwen3-VL-4B) generates structured per-subclip reasoning trajectories based on manual annotations available from the UCA Dataset. Further, GroundingDINO provides bounding box supervision. On UCF-Crime, VANGUARD achieves 94% ROC-AUC with 84% F1 while simultaneously producing interpretable chain-of-thought explanations and spatial grounding of anomalous objects - capabilities absent from prior VAD methods. Ablations confirm that staged training outperforms monolithic optimization, and that structured reasoning acts as an implicit regularizer yielding more balanced predictions than classification-only fine-tuning. Zero-shot transfer to XD-Violence and ShanghaiTech demonstrates cross-domain generalization without target-domain adaptation.
Chinese Translation
视频异常检测(VAD)传统上被视为二元分类或离群点检测,但并未提供可解释的推理或异常事件的精确空间定位。尽管视觉语言模型(VLMs)能够提供丰富的场景理解,但在可靠的空间定位上表现不佳——常常会生成虚幻或几何上无效的边界框,尤其是在被要求定位对象时。我们提出了VANGUARD(通过推理和基础进行视频异常理解),一个统一异常分类、空间定位和链式思维推理于单一VLM中的框架。VANGUARD引入了一个三阶段的课程,逐步叠加训练目标:(1)在固定主干特征上进行分类器预热;(2)LoRA适配的空间定位;以及(3)链式思维生成。为了克服VAD基准测试中典型的稀疏注释,我们采用教师-学生注释流程,其中VLM(Qwen3-VL-4B)基于UCA数据集中可用的手动注释生成结构化的逐子片段推理轨迹。此外,GroundingDINO提供边界框监督。在UCF-Crime数据集上,VANGUARD达到了94%的ROC-AUC和84%的F1,同时产生可解释的链式思维解释和异常对象的空间定位,这是以往VAD方法所不具备的能力。消融实验确认分阶段训练优于整体优化,并且结构化推理作为隐式正则化器,能产生比仅进行分类微调更为平衡的预测。零样本迁移到XD-Violence和ShanghaiTech展示了跨域泛化的能力,而无需目标域适应。
cs.CV / 3 / 2605.03053

Approaching human parity in the quality of automated organoid image segmentation

接近人类标准的自动化类器官图像分割质量
Cartwright, Chase, Guo, Gongbo, Pusuluri, Sai Teja, Mayhew, Christopher N., Hester, Mark, Castillo, Horacio E.
Abstract
Organoids are complex, three dimensional, self-organizing cell cultures which manifest organ-like features and represent a powerful platform for studying human disease and developing treatment options. Organoid development is characterized by dynamic morphological and cellular organization, which mimic some aspects of organ development. To study these rapid changes over the course of organoid development, advanced imaging and analytical tools are critical to accurately monitor the trajectory of organoid growth and investigate disease processes. In this work, we focus on computer vision and machine learning techniques to automatically measure the size and shape of developing spheroids derived from pluripotent stem cells (iPSCs), which are typically the starting material for generating organoid cultures. To facilitate this task, we introduce a composite method that combines the Segment Anything Model (SAM), a general-purpose foundation model, with an existing domain-specific tool. This composite method is evaluated together with several existing tools by testing them on organoid image data and comparing with the results of manual image segmentation. We find that no single existing tool is able to segment the test images with sufficient accuracy across all test conditions, but the newly introduced composite method produces consistent and accurate results for all but a very small fraction of the most challenging images. Finally, we compare the accuracy of this method to the variability between manual segmentations by independent annotators (inter-observer variability) and find that by one measure it performs at the level of inter-observer variability and by others it performs very close to it.
Chinese Translation
类器官是复杂的三维自组织细胞培养物,展现出器官样特征,代表着研究人类疾病和开发治疗方案的强大平台。类器官的发展特点是动态形态和细胞组织,模拟器官发育的某些方面。要研究这些在类器官发育过程中迅速变化的特征,先进的成像和分析工具对于准确监测类器官生长的轨迹和调查疾病过程至关重要。在本研究中,我们重点关注计算机视觉和机器学习技术,以自动测量来源于多能干细胞(iPSCs)培养出的发育球体的大小和形状,iPSCs通常是生成类器官培养的起始材料。为方便这一任务,我们提出了一种复合方法,将通用基础模型Segment Anything Model (SAM)与现有的特定领域工具相结合。我们通过在类器官图像数据上测试该复合方法及多种现有工具,并将结果与手动图像分割的结果进行比较,从而评估该复合方法。我们发现没有任何现有工具能够在所有测试条件下以足够的准确性对测试图像进行分割,但新提出的复合方法对所有除极少数最具挑战性的图像外,均能产生一致且准确的结果。最后,我们将该方法的准确性与来自独立标注者之间的手动分割变化性(观察者间变化性)进行比较,发现该方法在某些指标上表现达到观察者间变化水平,其余指标则非常接近。
cs.CV / 4 / 2605.03059

Learning to Segment using Summary Statistics and Weak Supervision

基于汇总统计和弱监督学习的分割方法
Kulkarni, Omkar, Raff, Edward, Oates, Tim
Abstract
Medical experts often manually segment images to obtain diagnostic statistics and discard the resulting annotations. We aim to train segmentation models to alleviate this burden, but constrained to the retained summary statistics (e.g., the area of the annotated region). Empirical results suggest that statistics alone are insufficient for this task, but adding weak information in the form of a few pixels within the area of interest significantly improves performance. We use a novel loss function that combines terms for image reconstruction quality, matching to summary statistics, and overlap between the predicted foreground and the weak supervisory signal. Experiments on standard image, ultrasound (breast cancer), and Computed Tomography (CT) scan (kidney tumors) data demonstrate the utility and potential of the approach.
Chinese Translation
医学专家通常手动分割图像以获得诊断统计数据,并弃用生成的标注。我们旨在训练分割模型以减轻这一负担,但只能基于保留的汇总统计(例如,标注区域的面积)进行训练。实证结果表明,单靠统计数据不足以完成这一任务,但在关注区域内添加少量像素的弱信息显著改善了性能。我们使用一种新颖的损失函数,结合了图像重建质量、与汇总统计的匹配程度以及预测前景与弱监督信号之间的重叠程度。针对标准图像、超声(乳腺癌)和计算机断层扫描(CT,肾肿瘤)数据的实验展示了该方法的实用性和潜力。
cs.CV / 5 / 2605.03098

One Sequence to Segment Them All: Efficient Data Augmentation for CT and MRI Cross-Domain 3D Spine Segmentation

一切序列的分割:CT和MRI跨域3D脊柱分割的高效数据增强
Molinier, Nathan, Möller, Hendrik, Dagonneau, Thomas, Curto-Vilalta, Anna, Graf, Robert, Atad, Matan, Rueckert, Daniel, Kirschke, Jan S., Cohen-Adad, Julien
Abstract
Deep learning-based medical image segmentation is increasingly used to support clinical diagnosis and develop new treatment strategies. However, model performance remains limited by the scarcity of high-quality annotated data and insufficient generalization across imaging protocols. This limitation is particularly evident in MRI and CT, where models are typically trained on a single acquisition sequence and exhibit reduced robustness when applied to unseen sequences or contrasts. Although data augmentation is widely used to improve general robustness on medical images, its impact on cross-modality generalization has not been quantitatively explored. In this work, we study a targeted set of data augmentation techniques designed to improve cross-modality transfer. We train three spine segmentation models, each on a single-modality/sequence dataset, and evaluate them across seven out-of-distribution datasets (spanning CT and MRI), reflecting a realistic single-sequence training and multi-sequence/contrast/modality deployment scenario. Our results demonstrate substantial performance gains on unseen domains (average Dice gain of 155 %) while preserving in-domain accuracy (average Dice decrease of 0.008 %), including effective transfer between CT and MRI. To mitigate the computational cost typically associated with strong data augmentation, we implement GPU-optimized augmentations that maintain, and even improve, training efficiency by approximately 10 %. We release our approach as an open-source toolbox, enabling seamless integration into commonly used frameworks such as nnUNet and MONAI. These augmentations significantly enhance robustness to heterogeneous clinical imaging scenarios without compromising training speed.
Chinese Translation
基于深度学习的医学图像分割在临床诊断支持和新治疗策略开发中得到越来越广泛的应用。然而,模型性能受到高质量标注数据稀缺和成像协议间欠缺泛化能力的制约。这一限制在MRI和CT中尤为明显,模型通常在单一获取序列上训练,当应用于未见过的序列或对比度时显示出减少的鲁棒性。尽管数据增强被广泛应用于提高医学图像的整体鲁棒性,但其在跨模态泛化上的影响尚未得到定量探索。在本研究中,我们探讨了一组针对性的数据增强技术,旨在改善跨模态转移。我们训练了三种脊柱分割模型,每个模型在单一模态/序列数据集上进行训练,并在七个超出分布的数据集(涵盖CT和MRI)上进行评估,反映了在单一序列训练和多序列/对比度/模态部署场景中的实际应用。我们的结果显示,在未见领域上存在显著的性能提升(平均Dice系数增益为155%),同时保持了领域内的准确性(平均Dice系数减少0.008%),包括在CT和MRI之间的有效转移。为了降低通常与强数据增强相关的计算成本,我们实现了GPU优化的增强方法,保持并提升了训练效率约10%。我们将我们的研究成果作为开源工具包发布,以便于无缝集成到常用框架如nnUNet和MONAI中。这些增强方法在不妥协训练速度的情况下显著增强了对异质临床成像场景的鲁棒性。
cs.CV / 6 / 2605.03144

NucEval: A Robust Evaluation Framework for Nuclear Instance Segmentation

NucEval: 一种用于核实例分割的稳健评估框架
Mahbod, Amirreza, Woitek, Ramona, Shen, Jeanne
Abstract
In computational pathology, nuclear instance segmentation is a fundamental task with many downstream clinical applications. With the advent of deep learning, many approaches, including convolutional neural networks (CNNs) and vision transformers (ViTs), have been proposed for this task, along with both machine learning-based and non-machine learning-based pre- and post-processing techniques to further boost performance. However, one fundamental aspect that has received less attention is the evaluation pipeline. In this study, we identify four key issues associated with nuclear instance segmentation evaluation and propose corresponding solutions. Our proposed modifications, namely handling vague regions, score normalization, overlapping instances, and border uncertainty, are integrated into a unified framework called NucEval, which enables robust evaluation of nuclear instance segmentation. We evaluate this pipeline using the NuInsSeg dataset, which provides unique characteristics that make it particularly suitable for this study, as well as two additional external datasets, with three CNN- and ViT-based nuclear instance segmentation models, to demonstrate the impact of these modifications on instance segmentation metrics. The code, along with complete guidelines and illustrative examples, is publicly available at: https://github.com/masih4/nuc_eval.
Chinese Translation
在计算病理学中,核实例分割是一项基础任务,具有许多下游临床应用。随着深度学习的兴起,许多方法相继被提出,包括卷积神经网络(CNNs)和视觉转化器(ViTs),以及基于机器学习和非机器学习的预处理和后处理技术,以进一步提高性能。然而,一个受到较少关注的基本方面是评估流程。在本研究中,我们识别出与核实例分割评估相关的四个关键问题,并提出相应的解决方案。我们提出的修改方案,即处理模糊区域、分数归一化、重叠实例和边界不确定性,已整合到一个统一的框架中,称为NucEval,能够实现核实例分割的稳健评估。我们使用NuInsSeg数据集评估该流程,该数据集具有独特特性,使其特别适合本研究,并结合两个额外的外部数据集以及三个基于CNN和ViT的核实例分割模型,展示这些修改对实例分割指标的影响。代码及完整指南和示例已公开发布,网址为:https://github.com/masih4/nuc_eval.
cs.CV / 7 / 2605.03148

Boundary-Aware Uncertainty Quantification for Wildfire Spread Prediction

边界感知的不确定性量化用于野火传播预测
Funk, Jonas V.
Abstract
Reliable wildfire spread prediction is vital for risk-aware emergency planning, yet most deep learning models lack principled uncertainty quantification (UQ). Further, for boundary-sensitive cases like wildfire spread, evaluating models with global metrics alone is often insufficient. To shift the focus of UQ evaluation toward a more operationally relevant approach, the Fire-Centered Evaluation Region (FCER) framework is introduced as a spatially conditioned protocol to characterize UQ within critical fire zones. Using FCER, an Ensemble is compared against an distilled single-pass student model on the WildfireSpreadTS dataset. The student model demonstrates comparable calibration and complementary uncertainty ranking in boundary-relevant regimes. Code is available at https://github. com/jonasvilhofunk/WildfireUQ-FCER
Chinese Translation
可靠的野火传播预测对风险意识的应急规划至关重要,然而大多数深度学习模型缺乏原则性的意 uncertainty 量化 (UQ)。此外,对于像野火传播这样的边界敏感案例,仅使用全局指标评估模型往往是不够的。为了将 UQ 评估的重点转向更具操作相关性的方法,本文提出了火源中心评估区域 (FCER) 框架,作为一种空间条件协议,用于描述关键火区的 UQ。通过 FCER,将一个集成模型与在 WildfireSpreadTS 数据集上经过提炼的单通道学生模型进行比较。结果表明,学生模型在边界相关的情境中展示了可比的校准和互补的不确定性排序。代码可在 https://github. com/jonasvilhofunk/WildfireUQ-FCER 获取。
cs.CV / 8 / 2605.03175

DINO Soars: DINOv3 for Open-Vocabulary Semantic Segmentation of Remote Sensing Imagery

DINO的飞跃:用于遥感图像开放词汇语义分割的DINOv3
Faulkenberry, Ryan, Prasad, Saurabh
Abstract
The remote sensing (RS) domain suffers from a lack of densely labeled datasets, which are costly to obtain. Thus, models that can segment RS imagery well without supervised fine-tuning are valuable, but existing solutions fall behind supervised methods. Recently, DINOv3 surpassed SOTA RS foundation models on the GEO-bench segmentation benchmark without pre-training on RS data. Additionally, DINO.txt has enabled open vocabulary semantic segmentation (OVSS) with the DINOv3 backbone. We leverage these developments to form an OVSS model for RS imagery, free of RS-domain fine-tuning. Our model, CAFe-DINO (Cost Aggregation + Feature Upsampling with DINO) exploits the strong OVSS performance of DINOv3 for RS imagery via cost aggregation and training-free upsampling of text-image similarity scores. The robust latent of the DINOv3 backbone eliminates the need for fine-tuning on RS imagery; we instead fine-tune our model on a RS-targeted subset of COCO-Stuff. CAFe-DINO achieves state-of-the-art performance on key RS segmentation datasets, outperforming OVSS methods fine-tuned on RS data. Our code and data are publicly available at https://github.com/rfaulk/DINO_Soars.
Chinese Translation
遥感(RS)领域缺乏密集标注的数据集,而这些数据集的获取成本高昂。因此,能够在没有监督微调的情况下良好分割RS图像的模型具有重要价值,但现有的解决方案落后于监督方法。最近,DINOv3在不对RS数据进行预训练的情况下,在GEO-bench分割基准测试中超越了现有的SOTA RS基础模型。此外,DINO.txt使得基于DINOv3骨架的开放词汇语义分割(OVSS)成为可能。我们利用这些发展形成了一种针对RS图像的OVSS模型,无需进行RS领域的微调。我们的模型CAFe-DINO(基于DINO的成本聚合 + 特征上采样)通过成本聚合和无训练的文本图像相似度评分上采样,充分发挥了DINOv3在RS图像上的强大OVSS性能。DINOv3骨架的强大潜力消除了在RS图像上进行微调的必要;我们则是在COCO-Stuff数据集中一个针对RS的子集上对我们的模型进行了微调。CAFe-DINO在关键RS分割数据集上实现了最先进的性能,超越了在RS数据上经过微调的OVSS方法。我们的代码和数据可以在 https://github.com/rfaulk/DINO_Soars 获得。
cs.CV / 9 / 2605.03189

Sentinel2Cap: A Human-Annotated Benchmark Dataset for Multimodal Remote Sensing Image Captioning

Sentinel2Cap:一个用于多模态遥感图像描述的人类标注基准数据集
Tosato, Lucrezia, Lombardi, Gianluca, Hansch, Ronny
Abstract
Image captioning has become an important task in computer vision, enabling models to generate natural language descriptions of visual content. While several datasets exist for natural images and high-resolution optical remote sensing imagery, the availability of captioning datasets for multimodal satellite data remains limited, particularly for SAR imagery and medium-resolution sensors. We introduce Sentinel2Cap, a human-annotated multimodal captioning dataset containing Sentinel-1 SAR and Sentinel-2 multi-spectral image patches at 10 m and 20 m spatial resolution with diverse land cover compositions. Captions are created manually and carefully validated to ensure both semantic accuracy and linguistic quality. To evaluate Sentinel2Cap, we perform a zero-shot captioning using the Qwen3-VL-8B-Instruct model across three image modalities: RGB, multi-spectral, and SAR pseudo-RGB representations. Results show that RGB images achieve the highest captioning performance, while SAR images remain more challenging for vision-language models. Providing modality-specific contextual prompts consistently improves performance across all metrics. These findings highlight both the challenges of multimodal remote sensing image captioning and the potential value of human-annotated datasets for advancing research in cross-modal scene understanding. All the material is publicly avaiable.
Chinese Translation
图像描述已成为计算机视觉中的一项重要任务,能够使模型生成视觉内容的自然语言描述。虽然现有多个用于自然图像和高分辨率光学遥感图像的数据集,但针对多模态卫星数据的描述数据集仍然有限,特别是对于合成孔径雷达(SAR)图像和中等分辨率传感器。我们引入了Sentinel2Cap,这是一个人类标注的多模态描述数据集,包含分辨率为10米和20米的Sentinel-1 SAR和Sentinel-2多光谱图像块,具有多样的土地覆盖组成。字幕由人工创建并经过仔细验证,以确保语义准确性和语言质量。为了评估Sentinel2Cap,我们使用Qwen3-VL-8B-Instruct模型进行零样本描述,涉及三种图像模态:RGB、多光谱和SAR伪RGB表现。结果显示,RGB图像的描述性能最高,而SAR图像对视觉-语言模型仍具有更大的挑战。提供特定模态的上下文提示在所有指标上均能显著提高性能。这些发现突显了多模态遥感图像描述的挑战以及人类标注数据集在推动跨模态场景理解研究中的潜在价值。所有材料均可公开获取。
cs.CV / 10 / 2605.03221

Synthetic Data Generation for Long-Tail Medical Image Classification: A Case Study in Skin Lesions

针对长尾医学图像分类的合成数据生成:皮肤病变的案例研究
Jiang, Jiaxiang, Subedar, Mahesh, Tickoo, Omesh
Abstract
Long-tailed class distributions are pervasive in multi-class medical datasets and pose significant challenges for deep learning models which typically underperform on tail classes with limited samples. This limitation is particularly problematic in medical applications, where rare classes often correspond to severe or high-risk diseases and therefore require high diagnostic accuracy. Existing solutions-including specialized architectures, rebalanced loss functions, and handcrafted data augmentation-offer only marginal improvements and struggle to scale due to their limited and largely deterministic variability. To address these challenges, we introduce a diffusion-model-driven synthetic data augmentation pipeline tailored for medical long-tailed classification. Our approach features a novel inpainting diffusion model combined with an Out-of-Distribution (OOD) post-selection mechanism to ensure diverse, realistic, and clinically meaningful synthetic samples. Evaluated on the ISIC2019 skin lesion classification dataset, one of the largest and most imbalanced medical imaging benchmarks, our method yields substantial improvements in overall performance, with particularly pronounced gains on tail classes with more than $28\%$ improvement on the class with the fewest samples. These results demonstrate the effectiveness of diffusion-based augmentation in mitigating long-tail imbalance and enhancing medical classification robustness.
Chinese Translation
长尾类别分布广泛存在于多类别医学数据集中,给深度学习模型带来了重大挑战,因为这些模型通常在样本有限的尾部类别上表现不佳。这个限制在医学应用中尤为突出,因为稀有类别往往对应于严重或高风险的疾病,因此需要高诊断准确性。现有的解决方案,包括专业架构、重平衡损失函数和手工制作的数据增强,提供的改善仅限于边际,并且由于其有限且主要是确定性的可变性,难以扩展。为了解决这些挑战,我们引入了一种基于扩散模型的合成数据增强管道,专门针对医学长尾分类。我们的方法采用一种新颖的图像修复扩散模型,并结合一种分布外(Out-of-Distribution, OOD)后期选择机制,以确保生成多样化、真实和临床上有意义的合成样本。在ISIC2019皮肤病变分类数据集上进行评估,该数据集是最大的和最不平衡的医学影像基准之一,我们的方法在整体性能上取得了显著改善,尤其是在尾部类别上,样本最少的类别的改善超过28%。这些结果证明了基于扩散的增强在减轻长尾不平衡和增强医学分类鲁棒性方面的有效性。
cs.CV / 11 / 2605.03259

CropVLM: A Domain-Adapted Vision-Language Model for Open-Set Crop Analysis

CropVLM:一种适应农业领域的开放集作物分析视觉语言模型
Boudiaf, Abderrahmene, Javed, Sajd
Abstract
High-throughput plant phenotyping, the quantitative measurement of observable plant traits, is critical for modern breeding but remains constrained by a "phenotyping bottleneck," where manual data collection is labor-intensive and prone to observer bias. Conventional closed-set computer vision systems fail to address this challenge, as they require extensive species-specific annotation and lack the flexibility to handle diverse breeding populations. To bridge this gap, we present CropVLM, a Vision-Language Model (VLM) adapted for the agricultural domain via Domain-Specific Semantic Alignment (DSSA). Trained on 52,987 manually selected image-caption pairs covering 37 species in natural field conditions, CropVLM effectively maps agronomic terminology to fine-grained visual features. We further introduce the Hybrid Open-Set Localization Network (HOS-Net), an architecture that integrates CropVLM to enable the detection of novel crops solely from natural language descriptions without retraining. By eliminating the reliance on species-specific training data, CropVLM provides a scalable solution for high-throughput phenotyping, accelerating genetic gain and facilitating large-scale biodiversity research essential for sustainable agriculture. The trained model weights and complete pipeline implementation are publicly available at: [https://github.com/boudiafA/CropVLM](https://github.com/boudiafA/CropVLM). In comprehensive evaluations, CropVLM achieves 72.51% zero-shot classification accuracy, outperforming seven CLIP-style baselines. Our detection pipeline demonstrates superior zero-shot generalization to novel species, achieving 49.17 AP50 on our CVTCropDet benchmark and 50.73 AP50 on tropical fruit species, compared to 34.89 and 48.58 for the next-best method, respectively.
Chinese Translation
高通量植物表型分析,即对可观察植物性状的定量测量,对于现代育种至关重要,但仍受到“表型瓶颈”的限制,这个瓶颈导致手动数据收集劳动强度大且易受观察者偏差影响。传统的闭集计算机视觉系统未能解决这一挑战,因为它们需要大量特定物种的标注,且缺乏处理多样化育种种群的灵活性。为了解决这一问题,我们提出了CropVLM,这是一种通过领域特定语义对齐(Domain-Specific Semantic Alignment, DSSA)适应农业领域的视觉语言模型(Vision-Language Model, VLM)。CropVLM在52,987对人工选择的图像-标题对上进行训练,覆盖了自然田野条件下的37种植物,有效将农业术语映射到精细的视觉特征上。我们进一步提出了混合开放集定位网络(Hybrid Open-Set Localization Network, HOS-Net),这是一种整合CropVLM的架构,能够仅通过自然语言描述检测新作物,而无需重新训练。通过消除对特定物种训练数据的依赖,CropVLM为高通量表型分析提供了一种可扩展的解决方案,加速遗传增益,并促进对可持续农业至关重要的大规模生物多样性研究。训练后的模型权重和完整的管道实现已公开,网址为:[https://github.com/boudiafA/CropVLM](https://github.com/boudiafA/CropVLM)。在全面评估中,CropVLM达到了72.51%的零样本分类准确率,优于七个CLIP风格基线模型。我们的检测管道在对新物种的零样本泛化能力上表现出色,在我们的CVTCropDet基准上实现了49.17 AP50,对热带水果物种则达到了50.73 AP50,而下一个最佳方法分别为34.89和48.58。
cs.CV / 12 / 2605.03276

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

VEBench:对大型多模态模型在现实视频编辑中的基准测试
Deng, Andong, Du, Dawei, Chen, Zhenfang, Zhong, Wen, Chen, Fan, Chen, Guang, Kuo, Chia-Wen, Wen, Longyin, Chen, Chen, Zhu, Sijie
Abstract
Real-world video editing demands not only expert knowledge of cinematic techniques but also multimodal reasoning to select, align, and combine footage into coherent narratives. While recent Large Multimodal Models (LMMs) have shown remarkable progress in general video understanding, their abilities in multi-video reasoning and operational editing workflows remain largely unexplored. We introduce VEBENCH, the first comprehensive benchmark designed to evaluate both editing knowledge understanding and operational reasoning in realistic video editing scenarios. VEBENCH contains 3.9K high-quality edited videos (over 257 hours) and 3,080 human-verified QA pairs, built through a three-round human-AI collaborative annotation pipeline that ensures precise temporal labeling and semantic consistency. It features two complementary QA tasks: 1) Video Editing Technique Recognition, assessing models' ability to identify 7 editing techniques using multimodal cues; and 2) Video Editing Operation Simulation, modeling real-world editing workflows by requiring the selection and temporal localization of relevant clips from multiple candidates. Extensive experiments across proprietary (e.g., Gemini-2.5-Pro) and open-source LMMs reveal a large gap between current model performance and human-level editing cognition. These results highlight the urgent need for bridging video understanding with creative operational reasoning. We envision VEBENCH as a foundation for advancing intelligent video editing systems and driving future research on complex reasoning.
Chinese Translation
现实世界的视频编辑不仅需要对电影技术的专业知识,还需要多模态推理能力,以选择、对齐和将素材组合成连贯的叙事。尽管最近的大型多模态模型(Large Multimodal Models, LMMs)在一般视频理解方面取得了显著进展,但它们在多视频推理和操作编辑工作流程中的能力仍然未得到充分探索。我们提出了VEBENCH,这是第一个旨在评估现实视频编辑场景中编辑知识理解和操作推理的综合基准。VEBENCH包含3900个高质量的编辑视频(超过257小时)和3080对经过人工验证的问题与答案(QA)对,这些数据通过三轮人机协作注释流程构建,确保了精确的时间标注和语义一致性。它具有两个互补的QA任务:1)视频编辑技术识别,评估模型利用多模态线索识别7种编辑技术的能力;2)视频编辑操作模拟,通过要求从多个候选素材中选择和定位相关剪辑来模拟现实世界的编辑工作流。在专有(例如,Gemini-2.5-Pro)和开源LMM中进行的广泛实验揭示了当前模型性能与人类级别编辑认知之间的巨大差距。这些结果突显了在视频理解与创造性操作推理之间架起桥梁的紧迫需求。我们设想VEBENCH作为推动智能视频编辑系统进步和未来复杂推理研究的基础。
cs.CV / 13 / 2605.03294

FACTOR: Counterfactual Training-Free Test-Time Adaptation for Open-Vocabulary Object Detection

FACTOR:用于开放词汇目标检测的无反事实训练测试时适应方法
Zhao, Kaixiang, Ye, Mao, Zhou, Lihua, Wang, Hu, Ji, Luping, Tang, Song, Zhu, Xiatian
Abstract
Open-vocabulary object detection often fails under distribution shifts, as it can be misled by spurious correlations between non-causal visual attributes (e.g., brightness, texture) and object categories. Existing test-time adaptation (TTA) methods either depend on costly online optimization or perform global calibration, overlooking the attribute-specific nature of these failures. To address this, we propose FACTOR (counterFACtual training-free Test-time adaptation for Open-vocabulaRy object detection), a lightweight framework grounded in counterfactual reasoning. By perturbing test images along non-causal attributes and comparing region-level predictions between original and counterfactual views, FACTOR quantifies attribute sensitivity, semantic relevance, and prediction variation to selectively suppress attribute-dependent predictions-without parameter updates. Experiments on PASCAL-C, COCO-C, and FoggyCityscapes show that FACTOR consistently outperforms prior TTA methods, demonstrating that explicit counterfactual reasoning effectively improves robustness under distribution shifts.
Chinese Translation
开放词汇目标检测在分布变化下往往表现不佳,因为它容易受到非因果视觉属性(例如亮度、纹理)与对象类别之间虚假相关性的误导。目前现有的测试时适应(TTA)方法要么依赖于昂贵的在线优化,要么进行全局校准,忽视了这些失效的属性特定特征。为了解决这一问题,我们提出了FACTOR(无反事实训练的开放词汇目标检测测试时适应),这是一个基于反事实推理的轻量级框架。通过沿着非因果属性扰动测试图像并比较原始视图与反事实视图之间的区域级预测,FACTOR量化了属性敏感性、语义相关性和预测变化,以有选择性地抑制依赖于属性的预测,而无需更新参数。对PASCAL-C、COCO-C和FoggyCityscapes的数据实验表明,FACTOR始终优于先前的TTA方法,证明了显式的反事实推理在分布变化下有效提高了鲁棒性。
cs.CV / 14 / 2605.03315

TACO: Trajectory Aligning Cross-view Optimisation

TACO:轨迹对齐跨视角优化
Shore, Tavis, Mendez, Oscar, Hadfield, Simon
Abstract
Cross-View Geo-localisation (CVGL) matches ground imagery against satellite tiles to give absolute position fixes, an alternative to GNSS where signals are occluded, jammed, or spoofed. Recent fine-grained CVGL methods regress sub-tile metric pose, but have only been evaluated as one-shot localisers, never as the primary fix in a live pipeline. Inertial sensing provides high-rate relative motion, but accumulates unbounded drift without an absolute anchor. We propose TACO, a tightly-coupled IMU + fine-grained CVGL pipeline that consumes a single GNSS reading at start-up and thereafter operates on onboard sensing alone. A closed-form cross-track error model triggers CVGL before IMU drift exceeds the matcher's capture radius, and a forward-biased five-point multi-crop search keeps inference cost fixed at five forward passes per fix. A yaw-residual gate rejects fixes that disagree with the onboard compass, and an anisotropic body-frame noise model scales each Unscented Kalman Filter update by per-fix confidence. A factor graph with vetted loop closures provides an offline smoothed trajectory. On the KITTI raw dataset, TACO reduces median Absolute Trajectory Error (ATE) from 97.0m (IMU-only) to 16.3m, a 5.9 times reduction, at <0.1 ms per-frame fusion cost and a 5-10% camera duty cycle. Code is available: github.com/tavisshore/TACO.
Chinese Translation
跨视角地理定位(CVGL)将地面图像与卫星图像匹配,以提供绝对定位修正,这是一种在信号被遮挡、干扰或伪造时替代全球导航卫星系统(GNSS)的方法。最近的细粒度CVGL方法回归子瓦片的度量姿态,但仅被评估为单次定位器,从未在实时管道中作为主要修正。惯性传感器提供高频率的相对运动,但在没有绝对锚点的情况下会积累无界漂移。我们提出TACO,一个紧密耦合的IMU与细粒度CVGL管道,启动时消耗单一的GNSS读取,之后仅依赖机载传感器工作。一个封闭形式的横向误差模型在IMU漂移超过匹配器捕捉半径之前触发CVGL,而一个前向偏置的五点多重裁剪搜索保持每次修正的推理成本固定在五次前向传递。一个偏航残差门拒绝与机载指南针不一致的修正,而一个各向异性机体框架噪声模型则根据每次修正的置信度来缩放每次无迹卡尔曼滤波器更新。一个经过验证的循环闭合的因子图提供离线平滑轨迹。在KITTI原始数据集上,TACO将中位数绝对轨迹误差(ATE)从97.0米(仅IMU)降低到16.3米,减少了5.9倍,融合成本低于每帧0.1毫秒,并且相机占空比为5-10%。代码可在:github.com/tavisshore/TACO。
cs.CV / 15 / 2605.03317

AHPA: Adaptive Hierarchical Prior Alignment for Diffusion Transformers

AHPA:针对扩散变换器的自适应层次先验对齐
Min, Ruibin, Liu, Yexin, Pan, Aimin, Lu, Changsheng, Wu, Jiafei, Yao, Kelu, Xu, Xiaogang, Yang, Harry
Abstract
Representation alignment has recently emerged as an effective paradigm for accelerating Diffusion Transformer training. Despite their success, existing alignment methods typically impose a fixed supervision target or a fixed alignment granularity throughout the entire denoising trajectory, whether the guidance is provided by external vision encoders, internal self-representations, or VAE-derived features. We argue that such timestep-agnostic alignment is suboptimal because the useful granularity of representation supervision changes systematically with the signal-to-noise ratio. In high-noise regimes, diffusion models benefit more from coarse semantic and layout-level anchoring, whereas in low-noise regimes, the training signal should emphasize spatially detailed and structurally faithful refinement. This non-stationary alignment behavior creates a representational mismatch for static single-level supervisors. To address this issue, we propose Adaptive Hierarchical Prior Alignment (AHPA), a lightweight alignment framework that exploits the hierarchical representations naturally embedded in the frozen VAE encoder. Instead of using only a single compressed latent as the alignment target, AHPA extracts multi-level VAE features that provide complementary priors ranging from local geometry and spatial topology to coarse semantic layout. A timestep-conditioned Dynamic Router adaptively selects and weights these hierarchical priors along the denoising trajectory, thereby synchronizing the alignment granularity with the model's evolving training needs. Extensive experiments show that AHPA improves convergence and generation quality over baselines and incurs no additional inference cost while avoiding external encoder supervision during training.
Chinese Translation
表示对齐最近作为加速扩散变换器训练的有效范式而出现。尽管现有的方法取得了一定成功,但通常在整个去噪轨迹中施加固定的监督目标或固定的对齐粒度,无论指导是由外部视觉编码器、内部自表示还是基于变分自编码器(VAE)的特征提供。我们认为,这种时间步不可知的对齐是次优的,因为表示监督的有效粒度随着信噪比的变化而系统性改变。在高噪声环境中,扩散模型更受益于粗略的语义和布局级锚定,而在低噪声环境中,训练信号应强调空间细节和结构忠实的精细化。这种非平稳对齐行为为静态单级监督者创造了表示不匹配。为了解决这个问题,我们提出了自适应层次先验对齐(AHPA),这是一种轻量级对齐框架,它利用了自然嵌入在冻结的VAE编码器中的层次表示。AHPA不是仅仅使用一个压缩潜变量作为对齐目标,而是提取多层次的VAE特征,这些特征提供了从局部几何和空间拓扑到粗略语义布局的互补先验。一个基于时间步的动态路由器自适应地选择和加权这些层次先验,从而在去噪轨迹上同步对齐粒度与模型不断演变的训练需求。大量实验表明,AHPA在收敛性和生成质量方面优于基线方法,并在训练过程中避免了外部编码器的监督,且未增加额外的推理成本。
cs.CV / 16 / 2605.03337

FreeTimeGS++: Secrets of Dynamic Gaussian Splatting and Their Principles

FreeTimeGS++:动态高斯点云的秘密及其原理
Lee, Lucas Yunkyu, Kim, Soonho, Kim, Youngwook, Kim, Sangmin, Park, Jaesik
Abstract
The recent surge in 4D Gaussian Splatting (4DGS) has achieved impressive dynamic scene reconstruction. While these methods demonstrate remarkable performance, the specific drivers behind such gains remain less explored, making a systematic understanding of the underlying principles challenging. In this paper, we perform a comprehensive analysis of these hidden factors to provide a clearer perspective on the 4DGS framework. We first establish a controlled baseline, FreeTimeGS_ours, by formalizing and reproducing the heuristics of the state-of-the-art FreeTimeGS. Using this framework, we dissect 4DGS along its fundamental axes and uncover key secrets, including the emergent temporal partitioning driven by Gaussian durations and the discrepancy between photometric fidelity and spatiotemporal consistency. Based on these insights, we propose FreeTimeGS++, a principled method that employs gated marginalization and neural velocity fields to achieve superior stability and robust dynamic representations. Our approach yields reproducible results with reduced run-to-run variance. We will release our implementation to provide a reliable foundation for future 4DGS research.
Chinese Translation
近年来,4D 高斯点云(4DGS)技术在动态场景重建方面取得了令人印象深刻的成果。虽然这些方法展现了显著的性能,但推动这些进步的具体因素仍然探讨较少,使得对其基本原理的系统理解面临挑战。本文对这些隐藏因素进行了全面分析,以期为 4DGS 框架提供更清晰的视角。我们首先通过形式化和重现最先进的 FreeTimeGS 的启发式算法,建立了一个受控基准模型 FreeTimeGS_ours。利用这一框架,我们沿着 4DGS 的基本轴进行拆解,揭示了若干关键秘密,包括由高斯持续时间驱动的紧急时间划分,以及光度保真度与时空一致性之间的差异。基于这些见解,我们提出了 FreeTimeGS++,这是一种采用门控边际化和神经速度场的原则性方法,以实现更卓越的稳定性和强健的动态表示。我们的方法实现了可重复的结果,减少了运行间的变异性。我们将公开我们的实现,以为未来的 4DGS 研究提供可靠的基础。
cs.CV / 17 / 2605.03343

MedSR-Vision: Deep Learning Framework for Multi-Domain Medical Image Super-Resolution

MedSR-Vision:多领域医学图像超分辨率的深度学习框架
Gurappa, Subhash, Satharasi, Trivikram, Hariprasad, Yashas, Iyengar, Sundararaj Sitharama
Abstract
Medical image super-resolution (MedSR) is essential for improving diagnostic precision across diverse imaging modalities such as MRI, CT, X-ray, Ultrasound, and Fundus imaging. Despite rapid advances in deep learning, challenges remain in preserving anatomical accuracy, maintaining perceptual quality, and generalizing across medical domains. This paper presents MedSR-Vision, a novel unified deep learning framework for evaluating and comparing super-resolution models across five modalities: Brain MRI, Chest X-ray, Renal Ultrasound, Nephrolithiasis CT, and Spine MRI, at magnification scales of $\times2$, $\times3$, and $\times4$. Three representative models namely SRCNN, SwinIR, and Real-ESRGAN are benchmarked using multiple quantitative metrics encompassing fidelity, perceptual realism, and sharpness. Experimental analysis demonstrates that Real-ESRGAN achieves superior perceptual quality and edge recovery at higher scales, SwinIR excels in preserving structural and diagnostic features, and SRCNN provides efficient and stable performance at lower magnifications. The results establish domain-specific insights and practical guidelines for model selection in clinical imaging workflows, offering a standardized evaluation framework for future medical image super-resolution research and deployment.
Chinese Translation
医学图像超分辨率(MedSR)对提高不同成像模式(如MRI、CT、X光、超声波和眼底成像)的诊断准确性至关重要。尽管深度学习迅速发展,但在保持解剖准确性、维持感知质量以及跨医学领域的泛化能力方面仍面临挑战。本文提出了MedSR-Vision,一个新颖的统一深度学习框架,用于评估和比较五种模式下的超分辨率模型:脑MRI、胸部X光、肾脏超声、肾结石CT和脊柱MRI,放大倍数为$×2$、$×3$和$×4$。通过多个定量指标(包括保真度、感知真实感和清晰度)对三种代表性模型(即SRCNN、SwinIR和Real-ESRGAN)进行了基准测试。实验分析表明,在较高倍数下,Real-ESRGAN在感知质量和边缘恢复方面表现优越;SwinIR在保持结构和诊断特征方面表现突出;而SRCNN在较低放大倍数下则提供了高效且稳定的性能。结果为临床影像工作流中的模型选择提供领域特定的见解和实用指南,为未来医学图像超分辨率研究和部署提供了标准化的评估框架。
cs.CV / 18 / 2605.03351

VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

通过 FrameMogging 训练无关的反重计算实现视频视觉语言模型的 VLMaxxing
Bastien, JF, D'Amico, Sam
Abstract
Video vision-language models (VLMs) keep paying for visual state the stream already told us was stable. The factory wall did not move, but most VLM pipelines still hand the model dense RGB frames or a fresh prefix again. We study that waste as training-free anti-recomputation: reuse state when validation says it survives, and buy fresh evidence when the scene, query, or cache topology requires it. The largest measured win is after ingest. On frozen Qwen2.5-VL-7B-Instruct-4bit, adaptive same-video follow-up reuse preserves paired choices and correctness on a 93-query VideoMME breadth setting while reducing follow-up latency by 14.90-35.92x. The first query is still cold; the win starts when later questions reuse the same video state. Stress tests bound the result: repeated-question schedules hold through 50 turns, while dense-answer-anchored prompt variation separates conservative fixed K=1 repair from faster aggressive policies that drift. Fresh-video pruning is smaller but real. C-VISION skips timed vision-tower work before the first answer is generated. On Gemma 4-E4B-4bit, the clean 32f short cell reaches 1.316x first-query speedup with no paired drift or parse failures on 20 items; Qwen shows the fidelity/speed boundary. Stage-share ceiling (C-CEILING) is the accounting guardrail: a component speedup becomes an end-to-end speedup only in proportion to the wall-clock share it accelerates, so C-VISION and after-ingest follow-up reuse do not multiply. Candidate C-STREAM remains a native-rate target, not a headline result here. The broader direction is VLM-native media that expose change, motion, uncertainty, object state, sensor time, and active tiles directly, so models do not have to rediscover the world from dense RGB every frame.
Chinese Translation
视频视觉语言模型(VLMs)不断为视觉状态付费,而该状态的流已经告诉我们是稳定的。工厂墙壁并没有移动,但大多数 VLM 管道仍然向模型提供密集的 RGB 帧或重新传送新前缀。我们将这种浪费研究为无训练的反重计算:当验证表明它仍然存活时重用状态,并在场景、查询或缓存拓扑要求时购买新证据。测得的最大胜利发生在摄取之后。在冻结的 Qwen2.5-VL-7B-Instruct-4bit 上,自适应同视频后续重用在 93 个查询的视频多模态推理(VideoMME)广度设置中保留了配对选择和正确性,同时将后续延迟减少了 14.90-35.92 倍。第一次查询仍然是冷的;胜利开始于后续的问题重用相同的视频状态。压力测试界定了结果:重复问题的调度在 50 个回合内持续有效,而基于密集答案的提示变体则将保守的固定 K=1 修复与更快的激进策略区分开,这些策略会漂移。新视频的修剪虽然较小但真实。C-VISION 在生成第一个答案之前跳过了定时视觉塔的工作。在 Gemma 4-E4B-4bit 上,干净的 32f 短单元在 20 项目上实现了 1.316 倍的首次查询加速,没有配对漂移或解析失败;Qwen 显示了保真度/速度的边界。阶段共享上限(C-CEILING)是会计的护栏:组件加速只有在与其加速的墙钟时间份额成比例时才会转化为端到端的加速,因此 C-VISION 和后续的摄取重用并不会相乘。候选的 C-STREAM 仍然是本地速率目标,在这里不是一个焦点结果。更广泛的方向是 VLM 原生媒体,直接暴露变化、运动、不确定性、对象状态、传感器时间和活动块,从而使模型不必每帧都从密集 RGB 中重新发现世界。
cs.CV / 19 / 2605.03352

Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology

多模态大语言模型能理解病理运动吗?对癫痫临床表现的初步研究
Zhang, Lina, Monsoor, Tonmoy, Lorasdagi, Mehmet Efe, Sinha, Prateik, Han, Chong, Li, Peizheng, Wang, Yuan, Pasqua, Jessica, McCrimmon, Colin, Mazumder, Rajarshi, Roychowdhury, Vwani
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated robust capabilities in recognizing everyday human activities, yet their potential for analyzing clinically significant involuntary movements in neurological disorders remains largely unexplored. This pilot study evaluates the capability of MLLMs for automated recognition of pathological movements in seizure videos. We assessed the zero-shot performance of state-of-the-art MLLMs on 20 ILAE-defined semiological features across 90 clinical seizure recordings. MLLMs outperformed fine-tuned Convolutional Neural Network (CNN) and Vision Transformer (ViT) baseline models on 13 of 18 features without task-specific training, demonstrating particular strength in recognizing salient postural and contextual features while struggling with subtle, high-frequency movements. Feature-targeted signal enhancement (facial cropping, pose estimation, audio denoising) improved performance on 10 of 20 features. Expert evaluation showed that 94.3 percent of MLLM-generated explanations for correctly predicted cases achieved at least 60 percent faithfulness scores, aligning with epileptologist reasoning. These findings demonstrate the potential of adapting general-purpose MLLMs for specialized clinical video analysis through targeted preprocessing strategies, offering a path toward interpretable, efficient diagnostic assistance. Our code is publicly available at https://github.com/LinaZhangUCLA/PathMotionMLLM.
Chinese Translation
多模态大语言模型(MLLMs)在识别日常人类活动方面表现出了强大的能力,但它们在分析神经疾病中临床重要的非自主运动方面的潜力仍然在很大程度上未被探索。本初步研究评估了MLLMs在癫痫视频中自动识别病理运动的能力。我们评估了最先进的MLLMs在90个临床癫痫录音中对20个国际抗癫痫联盟(ILAE)定义的临床表现特征的零样本性能。MLLMs在不进行特定任务训练的情况下,在18个特征中的13个上超越了微调的卷积神经网络(CNN)和视觉变换器(ViT)基线模型,特别擅长于识别显著的姿态和上下文特征,但在微妙的高频运动方面表现不佳。针对特征的信号增强(面部裁剪、姿态估计、音频去噪)提高了20个特征中10个的性能。专家评估显示,94.3%的MLLM生成的正确预测案例的解释达到了至少60%的忠实度得分,与癫痫专家的推理一致。这些发现表明,通过目标导向的预处理策略,通用MLLMs的适应性可以用于专业临床视频分析,提供了通向可解释且高效的诊断辅助的路径。我们的代码已公开,访问链接为 https://github.com/LinaZhangUCLA/PathMotionMLLM。
cs.CV / 20 / 2605.03358

Tracing Like a Clinician: Anatomy-Guided Spatial Priors for Cephalometric Landmark Detection

如同临床医生般追踪:基于解剖学的空间先验在头影测量标志检测中的应用
Mohapatra, Sidhartha, Mohanty, Pallavi
Abstract
When orthodontists trace cephalometric radiographs, they follow a structured workflow: identify the soft tissue profile, partition the skull into anatomical regions, trace contours, and locate landmarks using geometric definitions -- yet no automated system replicates this reasoning. We present a five-phase anatomy-guided initialization pipeline that translates this clinical workflow into computational operations, producing confidence-weighted spatial attention priors for a downstream HRNet-W32 detector. On 1,502 radiographs from three sources spanning 7+ imaging devices, the system achieves 1.04 mm mean radial error on 25 landmarks -- surpassing prior state-of-the-art (1.23 mm on 19 landmarks) by 15.4%, with twelve landmarks below 1 mm. A three-way controlled ablation reveals two striking findings. First, removing anatomical priors does not merely slow convergence -- it destroys generalization: both models converge to ~1.03 mm on validation, but diverge to 1.94 vs. 1.04 mm on the test set. Second, replacing anatomical priors with random-position Gaussians produces even worse generalization (2.24 mm), confirming that the improvement derives from anatomically correct positioning, not additional input channels. Clinical domain knowledge encoded as spatial priors provides an inductive bias that architecture and data augmentation alone do not provide.
Chinese Translation
当正畸医生追踪头影X光片时,他们遵循一个结构化的工作流程:识别软组织轮廓,将颅骨划分为解剖区域,描绘轮廓,并使用几何定义定位标志点——然而没有任何自动化系统能够复制这种推理。我们提出了一种五阶段的解剖学引导初始化管道,将这一临床工作流程转化为计算操作,从而为下游HRNet-W32检测器生成具有置信度加权的空间注意力先验。在来自三个来源的1,502张X光片中,涵盖7种以上成像设备,系统在25个标志点上达成了1.04毫米的平均半径误差——超越了先前的最佳水平(19个标志点的1.23毫米)15.4%,且有十二个标志点的误差低于1毫米。三方控制消融实验揭示了两个显著发现。首先,去除解剖学先验不仅仅减缓了收敛过程——它破坏了泛化能力:两个模型在验证集上的收敛结果均为约1.03毫米,但在测试集上却分别向1.94与1.04毫米发散。其次,用随机位置的高斯分布替代解剖学先验导致了更糟的泛化能力(2.24毫米),确认了改进来自于解剖位置的准确性,而非额外输入通道。编码为空间先验的临床领域知识提供了一种归纳偏差,这是单靠模型架构和数据增强无法实现的。
cs.CV / 21 / 2605.03359

Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation

Mix3R:混合前馈重建与生成性3D先验以实现联合多视图对齐的3D重建和位姿估计
Lin, Siyou, Xue, Zhou, Zhang, Hongwen, An, Liang, Li, Dongping, Jiao, Shaohui, Liu, Yebin
Abstract
Recent trends in sparse-view 3D reconstruction have taken two different paths: feed-forward reconstruction that predicts pixel-aligned point maps without a complete geometry, and generative 3D reconstruction that generates complete geometry but often with poor input-alignment. We present Mix3R, a novel generative 3D reconstruction method which mixes feed-forward reconstruction and 3D generation into a single framework in an aligned manner. Mix3R generates a 3D shape in two stages: a sparse voxel generation stage and a textured geometry generation stage. Unlike pure generative methods, our first-stage generation jointly produces a coarse 3D structure (sparse voxels), per-view point maps and camera parameters aligned to that 3D structure. This is made possible by introducing a Mixture-of-Transformers architecture that inserts global self-attentions to a feed-forward reconstruction model and a 3D generative model, both pretrained on large-scale data. This design effectively retains the pretrained priors but enables better 2D-3D alignment. Based on the initial aligned generations of sparse 3D voxels and point maps, we compute an overlap-based attention bias that is directly added to another pretrained textured geometry generation model, enabling it to correctly place input textures onto generated shapes in a training-free manner. Our design brings mutual benefits to both feed-forward reconstruction and 3D generation: The feed-forward branch learns to ground its predictions to a generative 3D prior, and conversely, the 3D generation branch is conditioned on geometrically informative features from the feed-forward branch. As a result, our method produces 3D shapes with better input alignment compared with pure 3D generative methods, together with camera pose estimations more accurate than previous feed-forward reconstruction methods. Our project page is at https://jsnln.github.io/mix3r/
Chinese Translation
近年来,稀疏视图3D重建的趋势走上了两条不同的路径:前馈重建预测像素对齐的点图而不需要完整几何体,生成性3D重建生成完整几何体但往往存在输入对齐不佳的问题。我们提出了Mix3R,一种新颖的生成性3D重建方法,能够以对齐的方式将前馈重建与3D生成混合到单一框架中。Mix3R在两个阶段生成3D形状:稀疏体素生成阶段和带纹理几何体生成阶段。与纯生成方法不同,我们的第一阶段生成共同产生一个粗略的3D结构(稀疏体素)、每视图点图以及与该3D结构对齐的相机参数。这一设计得益于引入了一种Mixture-of-Transformers架构,它将全局自注意力引入到一个前馈重建模型和一个在大规模数据上预训练的3D生成模型中。此设计有效地保留了预训练的先验知识,但提升了2D-3D对齐的质量。基于稀疏3D体素和点图的初始对齐生成结果,我们计算了一个基于重叠的注意力偏置,直接添加到另一个预训练的带纹理几何体生成模型上,使其能够在无训练方式下正确地将输入纹理放置到生成的形状上。我们的方法为前馈重建和3D生成都带来了互惠的益处:前馈分支学会将其预测与生成性的3D先验相结合,反之,3D生成分支则依赖于来自前馈分支的几何信息特征。因此,与纯3D生成方法相比,我们的方法生成的3D形状具有更好的输入对齐,同时相机位姿估计的准确性也优于以往的前馈重建方法。我们的项目页面在 https://jsnln.github.io/mix3r/
cs.CV / 22 / 2605.03364

Dynamic Distillation and Gradient Consistency for Robust Long-Tailed Incremental Learning

动态蒸馏和梯度一致性用于鲁棒的长尾增量学习
Sakai, Taigo, Hotta, Kazuhiro
Abstract
The task of Long-tailed Class Incremental Learning (LT-CIL) addresses the sequential learning of new classes from datasets with imbalanced class distributions. This scenario intensifies the fundamental problem of catastrophic forgetting, inherent to continual learning, with the dual challenges of under-learning minority classes and overfitting majority classes. To tackle these combined issues, this paper proposes two main techniques. First, we introduce gradient consistency regularization, which leverages the moving average of gradients to suppress abrupt fluctuations and stabilize the training process. Second, we dynamically adjust the weight of the distillation loss by measuring the degree of class imbalance with normalized entropy. This adaptive weighting establishes an optimal balance between retaining old knowledge and acquiring new information. Experiments on the CIFAR-100-LT, ImageNetSubset-LT, and Food101-LT benchmarks show that our method achieves consistent accuracy improvements of up to 5.0\%. Furthermore, we demonstrate dramatic gains in the challenging 'In-ordered' setting, where tasks progress from majority to minority classes, highlighting our method's robustness in mitigating forgetting under unfavorable learning dynamics. This enhanced performance is achieved without a significant increase in computational overhead, demonstrating the practicality of our framework.
Chinese Translation
长尾类增量学习(LT-CIL)的任务解决了从具有不平衡类别分布的数据集中顺序学习新类别的问题。此场景加剧了持续学习中固有的灾难性遗忘问题,面临少数类学习不足和多数类过拟合的双重挑战。为了应对这些组合问题,本文提出了两项主要技术。首先,我们引入了梯度一致性正则化,利用梯度的移动平均来抑制突变波动,稳定训练过程。其次,我们通过测量归一化熵来动态调整蒸馏损失的权重,以反映类别不平衡的程度。这种自适应加权在保留旧知识与获取新信息之间建立了最佳平衡。在CIFAR-100-LT、ImageNetSubset-LT和Food101-LT基准上的实验表明,我们的方法实现了高达5.0%的一致性准确性提升。此外,我们在具有挑战性的'按顺序'设置中展示了显著的收益,其中任务从多数类进展到少数类,突显了我们方法在不利学习动态下减轻遗忘的鲁棒性。这一增强的性能在不显著增加计算开销的情况下实现,证明了我们框架的实用性。
cs.CV / 23 / 2605.03365

Dual-Foundation Models for Unsupervised Domain Adaptation

双基础模型用于无监督领域适应
Cheon, Yerin, Balasubramanian, Aruna, Rameau, Francois
Abstract
Semantic segmentation provides pixel-level scene understanding essential for autonomous driving and fine-grained perception tasks. However, training segmentation models requires costly, labor-intensive annotations on real-world datasets. Unsupervised Domain Adaptation (UDA) addresses this by training models on labeled synthetic data and adapting them to unlabeled real images. While conceptually simple, adaptation is challenging due to the domain gap, i.e., differences in visual appearance and scene structure between synthetic and real data. Prior approaches bridge this gap through pixel-level mixing or feature-level contrastive learning. Yet, these techniques suffer from two major limitations: (1) reliance on high-confidence pseudo-labels restricts learning to a subset of the target domain, and (2) prototype-based contrastive methods initialize class prototypes from source-trained models, yielding biased and unstable anchors during adaptation. To address these issues, we propose a dual-foundation UDA framework that leverages two complementary foundation models. First, we employ the Segment Anything Model (SAM) with superpixel-guided prompting to enable learning from a broader range of target pixels beyond high-confidence predictions. Second, we incorporate DINOv3 to construct stable, domain-invariant class prototypes through its robust representation learning. Our method achieves consistent improvements of +1.3% and +1.4% mIoU over strong UDA baselines on GTA-to-Cityscapes and SYNTHIA-to-Cityscapes, respectively.
Chinese Translation
语义分割提供了像素级的场景理解,这对自动驾驶和细粒度感知任务至关重要。然而,训练分割模型需要在真实世界数据集上进行高成本、劳动密集型的注释。无监督领域适应(UDA)通过在有标签的合成数据上训练模型,并将其适应于未标记的真实图像来解决这个问题。尽管概念上简单,但由于领域间的差距——即合成数据和真实数据在视觉外观和场景结构上的差异,适应仍然是一个挑战。以往的方法通过像素级混合或特征级对比学习来桥接这个差距。然而,这些技术存在两个主要局限性:(1) 依赖高置信度伪标签使学习受到目标领域子集的限制;(2) 基于原型的对比方法从源训练模型初始化类别原型,在适应过程中产生偏置和不稳定的锚点。为了解决这些问题,我们提出了一种双基础 UDA 框架,利用两个互补的基础模型。首先,我们采用超像素引导提示的 Segment Anything Model (SAM) 以便从超出高置信度预测范围的更广泛目标像素中学习。其次,我们引入 DINOv3,通过其稳健的表征学习构建稳定的、领域不变的类别原型。我们的办法在 GTA 到 Cityscapes 和 SYNTHIA 到 Cityscapes 的强 UDA 基线基础上,分别取得了 +1.3% 和 +1.4% 的 mIoU 的一致性提升。
cs.CV / 24 / 2605.03371

SoDa2: Single-Stage Open-Set Domain Adaptation via Decoupled Alignment for Cross-Scene Hyperspectral Image Classification

SoDa²:通过解耦对齐实现的单阶段开放集领域适应,应用于跨场景高光谱图像分类
Liu, Yiwen, Wang, Minghua, Yao, Jing, Zhao, Xin, Vivone, Gemine
Abstract
Cross-scene hyperspectral image (HSI) classification stands as a fundamental research topic in remote sensing, with extensive applications spanning various fields. Owing to the inclusion of unknown categories in the target domain and the existence of domain shift across different scenes, open-set domain adaptation techniques are commonly employed to address cross-scene HSI classification. However, existing open-set cross-scene HSI classification methods still face two critical challenges: (1) domain shift issues arising from the direct alignment of mixed spectral-spatial features; (2) high computational costs caused by two-stage training strategies. To address these issues, this paper proposes a single-stage open-set domain adaptation method with decoupled alignment (SoDa$^2$) for cross-scene HSI classification. A contribution-aware dual-modality feature extraction is customized to disentangle the characteristics from spectral sequence signals and spatial details, selectively and adaptively enhancing discriminative features. The decoupled alignment module minimizes the Maximum Mean Discrepancy to independently reduce the spectral discrepancy and the spatial discrepancy between the source and target domains, extracting more fine-grained domain-invariant features. A cost-effective single-stage dual-branch framework is designed to learn MMD-constrainted aligned features and constraint-free intrinsic features for adaptive distinction between known and unknown classes. This framework employs a Gaussian Mixture Model to model the squared cosine similarity distribution between the two feature types, enabling open-set recognition without prior knowledge of unknown classes. Extensive experiments on three groups of HSI datasets demonstrate that SoDa$^2$ outperforms state-of-the-art methods, achieving superior classification accuracy and model transferability for open-set cross-scene tasks.
Chinese Translation
跨场景高光谱图像(HSI)分类是遥感领域的一个基础研究课题,广泛应用于多个领域。由于目标领域中包含未知类别以及不同场景之间存在领域迁移,开放集领域适应技术通常被用于解决跨场景HSI分类问题。然而,现有的开放集跨场景HSI分类方法仍然面临两个关键挑战:(1)由于混合光谱-空间特征的直接对齐而引发的领域迁移问题;(2)由双阶段训练策略引起的高计算成本。为了解决这些问题,本文提出了一种用于跨场景HSI分类的单阶段开放集领域适应方法,称为解耦对齐的SoDa²。该方法定制了一种关注贡献的双模态特征提取,旨在分离光谱序列信号和空间细节的特征,选择性且自适应地增强判别特征。解耦对齐模块最小化最大均值差异(Maximum Mean Discrepancy,MMD),以独立地降低源领域和目标领域之间的光谱差异和空间差异,从而提取出更多的细粒度领域不变特征。设计了一种成本效益高的单阶段双支路框架,用于学习受MMD约束的对齐特征和无约束的内在特征,以实现已知类别和未知类别之间的自适应区分。该框架采用高斯混合模型来建模两种特征类型之间的平方余弦相似性分布,从而在没有已知类别先验知识的情况下实现开放集识别。在三组HSI数据集上的大量实验表明,SoDa²优于最先进的方法,在开放集跨场景任务中实现了更高的分类准确率和模型可迁移性。
cs.CV / 25 / 2605.03390

Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework

通过无训练双系统框架增强自监督人头伪造检测
Liu, Ke, Wei, Jiwei, Zhou, Shuchang, Xiao, Yutong, Chai, Ruikun, Qin, Yitong, Zhou, Yuyang, Yang, Yang
Abstract
Supervised talking head forgery detection faces severe generalization challenges due to the continuous evolution of generators. By reducing reliance on generator-specific forgery patterns, self-supervised detectors offer stronger cross-generator robustness. However, existing research has mainly focused on building stronger detectors, while the discriminative capacity of trained detectors remains insufficiently exploited. In particular, for score-based self-supervised detectors, the limited discriminative ability on hard cases is often reflected in unreliable anomaly ordering, leaving room for further refinement. Motivated by this observation, we draw inspiration from the dual-system theory of human cognition and propose a Training-Free Dual-System (TFDS) framework to further exploit the latent discriminative capacity of existing score-based self-supervised detectors. TFDS treats anomaly-like scores as the basis of System-1, using lightweight threshold-based routing to partition samples into confident and uncertain subsets. System-2 then revisits only the uncertain subset, performing fine-grained evidence-guided reasoning to refine the relative ordering of ambiguous samples within the original score distribution. Extensive experiments demonstrate consistent improvements across datasets and perturbation settings, with the gains arising mainly from corrected ordering within the uncertain subset. These findings show that existing self-supervised talking head forgery detectors still contain underexploited discriminative cues that can be effectively unlocked through training-free dual-system reasoning.
Chinese Translation
监督的人头伪造检测由于生成器的持续演变面临严重的泛化挑战。通过减少对特定生成器伪造模式的依赖,自监督检测器提供了更强的跨生成器鲁棒性。然而,现有研究主要集中在构建更强的检测器上,而训练检测器的辨别能力仍未得到充分利用。特别是,对于基于得分的自监督检测器,针对困难案例的有限辨别能力通常体现在不可靠的异常排序上,这为进一步细化留下了空间。基于这一观察,我们从人类认知的双系统理论中汲取灵感,提出了一种无训练的双系统框架(Training-Free Dual-System,TFDS),以进一步挖掘现有基于得分的自监督检测器的潜在辨别能力。TFDS将异常得分视为系统一的基础,使用轻量级的基于阈值的路由将样本划分为可信和不确定的子集。系统二仅重新审视不确定子集,进行细粒度的证据引导推理,以细化原始得分分布中模糊样本的相对排序。大量实验表明,在不同数据集和扰动设置下均有一致的改进,这些提升主要源自于不确定子集内排序的修正。这些发现表明,现有的自监督人头伪造检测器仍然包含未被充分利用的辨别线索,可以通过无训练的双系统推理有效解锁。
cs.CV / 26 / 2605.03398

MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding

MASRA:基于MLLM的语义关系一致性对齐用于视频时序定位
Ran, Ran, Wei, Jiwei, Zhou, Shuchang, Qin, Yitong, He, Shiyuan, Ma, Zeyu, Zhou, Yuyang, Yang, Yang
Abstract
Video Temporal Grounding (VTG) faces a cross-modal semantic gap that often leads to background features being incorrectly aligned with the query, while directly matching the query to moments results in insufficient discriminability and consistency of temporal semantics. To address this issue, we propose MLLM-Assisted Semantic-Relational Consistent Alignment (MASRA), a training-time MLLM-based optimization framework for VTG. MASRA leverages an MLLM during training to produce two forms of textual priors, namely event-level descriptions with temporal spans and clip-level captions, and instantiates two MLLM-assisted alignments. Event Semantic Temporal Alignment (ESTA) aligns temporal context with event semantics to explicitly strengthen the correspondence between semantics and temporal events and improve span-level separability. Local Relational Consistency Alignment (LRCA) constructs a textual relation matrix derived from clip-level captions and aligns it with the temporal feature similarity matrix in the model, enhancing temporal consistency while capturing local structural information. MASRA includes two simple supporting modules, semantic-guided enhancement and second-order relational attention, to better utilize the learned semantic context and relational structure. Moreover, we introduce Decoupled Alignment Interaction (DAI) with a context-aware codebook to adaptively absorb query-irrelevant semantics and alleviate the cross-modal gap. The MLLM is only invoked during training and is not used at inference. Extensive experiments show that MASRA outperforms existing methods, and ablation studies validate its effectiveness.
Chinese Translation
视频时序定位(VTG)面临跨模态语义差距,常导致背景特征与查询错误对齐,而直接将查询匹配到时刻则会导致时序语义的辨别力和一致性不足。为了解决这一问题,我们提出了基于MLLM的语义关系一致性对齐(MASRA),这是一个用于视频时序定位的训练时MLLM优化框架。MASRA在训练过程中利用MLLM生成两种形式的文本先验,即具有时间跨度的事件级描述和剪辑级标题,并实例化两种基于MLLM的对齐方式。事件语义时序对齐(ESTA)将时间上下文与事件语义对齐,以明确增强语义与时序事件之间的对应关系,提高时间跨度级的可分性。局部关系一致性对齐(LRCA)构建了一个源自剪辑级标题的文本关系矩阵,并将其与模型中的时序特征相似性矩阵进行对齐,增强了时序一致性,同时捕捉局部结构信息。MASRA包括两个简单的支持模块:语义引导增强和二阶关系注意力,以更好地利用学习到的语义上下文和关系结构。此外,我们引入了具有上下文感知代码本的解耦对齐交互(DAI),以自适应地吸收与查询无关的语义,并缓解跨模态差距。MLLM仅在训练期间调用,推理时不使用。大量实验表明,MASRA在现有方法中表现优越,消融研究验证了其有效性。
cs.CV / 27 / 2605.03403

GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

GRPO-TTA: 基于 GRPO 驱动的强化学习的视觉语言模型测试时视觉调优
Li, Yujun, Zhang, Hongyuan, Yuan, Yuan
Abstract
Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time adaptation (TTA) of vision language models. In this paper, we propose Group Relative Policy Optimization for Test-Time Adaptation (GRPO-TTA), which adapts GRPO to the TTA setting by reformulating class-specific prompt prediction as a group-wise policy optimization problem. Specifically, we construct output groups by sampling top-K class candidates from CLIP similarity distributions, enabling probability-driven optimization without access to ground-truth labels. Moreover, we design reward functions tailored to test-time adaptation, including alignment rewards and dispersion rewards, to guide effective visual encoder tuning. Extensive experiments across diverse benchmarks demonstrate that GRPO-TTA consistently outperforms existing test-time adaptation methods, with notably larger performance gains under natural distribution shifts.
Chinese Translation
群体相对策略优化(GRPO)最近在后训练的语言模型和视觉语言模型中显示出强大的性能。这引发了一个问题,即 GRPO 是否也显著促进了视觉语言模型的测试时适应(TTA)。在本文中,我们提出了用于测试时适应的群体相对策略优化(GRPO-TTA),通过将特定类的提示预测重新表述为一个群体优化策略问题,从而将 GRPO 适应于 TTA 情境。具体而言,我们通过从 CLIP 相似性分布中采样 top-K 类候选者构建输出组,实现了无需访问真实标签的概率驱动优化。此外,我们设计了专门针对测试时适应的奖励函数,包括对齐奖励和分散奖励,以指导有效的视觉编码器调优。广泛的实验覆盖多种基准测试表明,GRPO-TTA 在现有的测试时适应方法中始终表现优异,并在自然分布变化下实现了显著更大的性能提升。
cs.CV / 28 / 2605.03405

TsallisPGD: Adaptive Gradient Weighting for Adversarial Attacks on Semantic Segmentation

TsallisPGD:针对语义分割的自适应梯度加权对抗攻击
Matyasko, Alexander, Lou, Xin, Atmosukarto, Indriyati, Zhang, Wei
Abstract
Attacking semantic segmentation models is significantly harder than image classification models because an attacker must flip thousands of pixel predictions simultaneously. Standard pixel-wise cross-entropy (CE) is ill-suited to this setting: it tends to overemphasize already-misclassified pixels, which slows optimization and overstates model robustness. To address these issues, we introduce TsallisPGD, an adversarial attack built on the Tsallis cross-entropy, a generalization of CE parameterized by $q$, which adaptively reshapes the gradient landscape by controlling gradient concentration across pixels. By varying $q$, we steer the attack toward pixels at different confidence levels. We first show that no single fixed-$q$ is universally optimal, as its effectiveness depends on the dataset, model architecture, and perturbation budget. Motivated by this, we propose a dynamic $q$-schedule that sweeps $q$ during optimization. Extensive experiments on Cityscapes, Pascal VOC, and ADE20K show that TsallisPGD, using a single validation-selected schedule, achieves the best average attack rank across all evaluated settings and improves over CEPGD, SegPGD, CosPGD, JSPGD, and MaskedPGD in reducing accuracy and mIoU on both standard and robust models.
Chinese Translation
对语义分割模型的攻击比对图像分类模型的攻击要困难得多,因为攻击者必须同时翻转成千上万的像素预测。标准的逐像素交叉熵(CE)不适合这种情况:它往往过于强调已经分类错误的像素,这会减慢优化速度并夸大模型的鲁棒性。为了解决这些问题,我们提出了TsallisPGD,一种基于Tsallis交叉熵的对抗攻击方法,Tsallis交叉熵是CE的一个推广,参数化为 $q$,它通过控制像素之间的梯度集中度自适应地重塑梯度景观。通过改变 $q$,我们将攻击引导向不同置信水平的像素。我们首先表明,没有单一的固定-$q$是普遍最优的,因为其有效性依赖于数据集、模型架构和扰动预算。基于这一点,我们提出了一种动态 $q$-调度,在优化过程中调整 $q$。在Cityscapes、Pascal VOC和ADE20K上的广泛实验表明,使用单一验证选定的调度,TsallisPGD在所有评估设置中实现了最佳的平均攻击排名,并在降低标准模型和鲁棒模型的准确性和mIoU方面超过了CEPGD、SegPGD、CosPGD、JSPGD和MaskedPGD。
cs.CV / 29 / 2605.03432

MK-ResRecon: Multi-Kernel Residual Framework for Texture-Aware 3D MRI Refinement from Sparse 2D Slices

MK-ResRecon:基于多核残差框架的纹理感知3D MRI稀疏2D切片重建
Pyati, Prajyot, Sachan, Sapna, Mahto, Amulya Kumar, Phukan, Pranjal
Abstract
Magnetic Resonance Imaging (MRI) acquisition remains a time-intensive and patient-straining process, as prolonged scan dura- tions increase the likelihood of motion artifacts, which degrade image quality and frequently require repeated scans. To address these chal- lenges, we propose a novel framework with two models MK-ResRecon and IdentityRefineNet3D to reconstruct high-fidelity 3D MRI volumes from sparsely sampled 2D slices-requiring only 12.5% of the axial slices for full resolution 3D reconstruction. MK-ResRecon predicts missing in- termediate 2D slices using a multi-kernel texture-aware loss, preserving fine anatomical details. IdentityRefineNet3D refines the predicted slices and the original sparse slices as a single 3D volume to obtain a smooth anatomical structure. We train the models on a large T1-sequence POST- contrast brain MRI dataset and evaluate on a large heterogeneous brain MRI cohort. The work provides accurate, hallucination-free, generaliz- able and clinically validated framework for 3D MRI reconstruction from highly sparse inputs and enables a clinically viable path towards faster and more patient-friendly MRI imaging.
Chinese Translation
磁共振成像(MRI)获取仍然是一个耗时且对患者造成压力的过程,因为扫描时间过长增加了运动伪影的可能性,这会降低图像质量,通常需要重复扫描。为了解决这些挑战,我们提出了一种新颖的框架,包含两个模型MK-ResRecon和IdentityRefineNet3D,能够从稀疏采样的2D切片重建高保真度的3D MRI体积——只需12.5%的轴向切片即可实现完整分辨率的3D重建。MK-ResRecon利用多核纹理感知损失预测缺失的中间2D切片,保留细微的解剖细节。IdentityRefineNet3D将预测的切片和原始稀疏切片作为单一3D体积进行优化,以获得平滑的解剖结构。我们在大型T1序列对比后脑MRI数据集上训练这些模型,并在一个大型异质脑MRI队列上进行评估。这项工作提供了一种准确、无虚幻、可泛化且经过临床验证的3D MRI重建框架,能够处理高度稀疏的输入,并为更快、更友好的MRI成像提供了临床可行的路径。
cs.CV / 30 / 2605.03437

Learning Discriminative Signed Distance Functions from Multi-scale Level-of-detail Features for 3D Anomaly Detection

基于多尺度细节层次特征学习区分性的有符号距离函数以实现3D异常检测
Xiao, Haibo, Liang, Hanzhe, Zhou, Jie, Wang, Jinbao, Gao, Can
Abstract
Detecting anomalies from 3D point clouds has received increasing attention in the field of computer vision, with some group-based or point-based methods achieving impressive results in recent years. However, learning accurate point-wise representations for 3D anomaly detection faces great challenges due to the large scale and sparsity of point clouds. In this study, a surface-based method is proposed for 3D anomaly detection, which learns a discriminative signed distance function using multi-scale level-of-detail features. We first present a Noisy Points Generation (NPG) module to generate different types of noise, thereby facilitating the learning of discriminative features by exposing abnormal points. Then, we introduce a Multi-scale Level-of-detail Feature (MLF) module to capture multi-scale information from a point cloud, which provides both fine-grained local and coarse-grained global feature information. Finally, we design an Implicit Surface Discrimination (ISD) module that leverages the extracted multi-scale features to learn an implicit surface representation of point clouds, which effectively trains a signed distance function to distinguish between abnormal and normal points. Experimental results demonstrate that the proposed method achieves an average object-level AUROC of 92.1\% and 85.9\% on the Anomaly-ShapeNet and Real3D-AD datasets, outperforming the current best approach by 2.1\% and 3.6\%, respectively. Codes are available at https://anonymous.4open.science/r/DLF-3AD-DA61.
Chinese Translation
从3D点云中检测异常在计算机视觉领域受到越来越多的关注,近年来一些基于组的或基于点的方法取得了令人印象深刻的结果。然而,由于点云的大规模和稀疏性,学习精确的逐点表示在3D异常检测中面临巨大挑战。本研究提出了一种基于表面的3D异常检测方法,通过多尺度细节层次特征学习区分性的有符号距离函数。我们首先提出了一个噪声点生成(NPG)模块,以生成不同类型的噪声,从而通过暴露异常点来促进区分特征的学习。然后,我们引入一个多尺度细节层次特征(MLF)模块来捕获来自点云的多尺度信息,提供精细局部和粗糙全局的特征信息。最后,我们设计了一个隐式表面区分(ISD)模块,利用提取的多尺度特征学习点云的隐式表面表示,有效训练有符号距离函数以区分异常点和正常点。实验结果表明,所提出的方法在Anomaly-ShapeNet和Real3D-AD数据集上分别达到92.1\%和85.9\%的平均对象级AUROC,相较于当前最佳方法分别提升了2.1\%和3.6\%。代码可在https://anonymous.4open.science/r/DLF-3AD-DA61获取。
cs.CV / 31 / 2605.03438

Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models

Mantis:Mamba原生调优在3D点云基础模型中的高效性
Guo, Zihao, Zhu, Jihua, Liu, Jian, Mian, Ajmal Saeed
Abstract
Pre-trained 3D point cloud foundation models (PFMs) have demonstrated strong transferability across diverse downstream tasks. However, full fine-tuning these models is computationally expensive and storage-intensive. Parameter-efficient fine-tuning (PEFT) offers a promising alternative, but existing PEFT approaches are primarily designed for Transformer-based backbones and rely on token-level prompting or feature transformation. Mamba-based backbones introduce a granularity mismatch between token-level adaptation and state-level sequence dynamics. Consequently, straightforward transfer of existing PEFT approaches to frozen Mamba backbones leads to substantial accuracy degradation and unstable optimization. To address this issue, we propose Mantis, the first Mamba-native PEFT framework for 3D PFMs. Specifically, a State-Aware Adapter (SAA) is introduced to inject lightweight task-conditioned control signals into selective state-space updates, enabling state-level adaptation while keeping the pre-trained backbone frozen. Moreover, different valid point cloud serializations are regularized by Dual-Serialization Consistency Distillation (DSCD), thereby reducing serialization-induced instability. Extensive experiments across multiple benchmarks demonstrate that our Mantis achieves competitive performance with only about 5% trainable parameters. Our code is available at https://github.com/gzhhhhhhh/Mantis.
Chinese Translation
预训练的3D点云基础模型(PFMs)在多种下游任务中表现出强大的迁移能力。然而,对这些模型进行完全微调的计算成本高昂且存储需求大。参数高效微调(PEFT)提供了一种有前景的替代方案,但现有的PEFT方法主要设计用于基于变压器的架构,并依赖于令牌级提示或特征变换。基于Mamba的架构在令牌级适应和状态级序列动态之间引入了粒度不匹配。因此,现有PEFT方法直接转移至冻结的Mamba架构会导致显著的准确性下降和不稳定的优化。为了解决这一问题,我们提出了Mantis,这是首个针对3D PFMs的Mamba原生PEFT框架。具体而言,引入了一种状态感知适配器(SAA),将轻量级任务条件控制信号注入到选择性状态空间更新中,允许在保持预训练主干不变的情况下进行状态级适应。此外,不同有效点云序列化通过双序列一致性蒸馏(DSCD)进行正则化,从而减少序列化引起的不稳定性。在多个基准测试中进行的广泛实验表明,Mantis在仅约5%的可训练参数情况下实现了具有竞争力的性能。我们的代码可在 https://github.com/gzhhhhhhh/Mantis 获取。
cs.CV / 32 / 2605.03456

VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

VL-SAM-v3:用于开放世界目标检测的记忆引导视觉先验
Liu, Chih-Chung, Lin, Zhiwei, Wang, Yongtao
Abstract
Open-world object detection aims to localize and recognize objects beyond a fixed closed-set label space. It is commonly divided into two categories, i.e., open-vocabulary detection, which assumes a predefined category list at test time, and open-ended detection, which requires generating candidate categories during the inference. Existing methods rely primarily on coarse textual semantics and parametric knowledge, which often provide insufficient visual evidence for fine-grained appearance variation, rare categories, and cluttered scenes. In this paper, we propose VL-SAM-v3, a unified framework that augments open-world detection with retrieval-grounded external visual memory. Specifically, once candidate categories are available, VL-SAM-v3 retrieves relevant visual prototypes from a non-parametric memory bank and transforms them into two complementary visual priors, i.e., sparse priors for instance-level spatial anchoring and dense priors for class-aware local context. These priors are integrated with the original detection prompts via Memory-Guided Prompt Refinement, enabling a shared retrieval-and-refinement mechanism that supports open-vocabulary and open-ended inference.Extensive zero-shot experiments on LVIS show that VL-SAM-v3 consistently improves detection performance under both open-vocabulary and open-ended inference, with particularly strong gains on rare categories.Moreover, experiments with a stronger open-vocabulary detector (i.e., SAM3) validate the generality of the proposed retrieval-and-refinement mechanism.
Chinese Translation
开放世界目标检测旨在定位和识别超出固定封闭标签空间的对象。它通常分为两类,即开放词汇检测(open-vocabulary detection),该类在测试时假定预定义的类别列表,以及开放式检测(open-ended detection),该类在推理过程中需要生成候选类别。现有方法主要依赖粗略的文本语义和参数化知识,通常无法为细粒度外观变化、稀有类别和杂乱场景提供足够的视觉证据。本文提出了VL-SAM-v3,一个统一框架,通过检索基础的外部视觉记忆增强开放世界检测。具体而言,一旦候选类别可用,VL-SAM-v3将从非参数记忆库中检索相关的视觉原型,并将其转换为两个互补的视觉先验,即用于实例级空间锚定的稀疏先验和用于类别感知局部上下文的密集先验。这些先验通过记忆引导提示精炼(Memory-Guided Prompt Refinement)与原始检测提示集成,形成一个共享的检索与精炼机制,支持开放词汇和开放式推理。在LVIS上的广泛零样本实验表明,VL-SAM-v3在开放词汇和开放式推理下均持续提升检测性能,尤其在稀有类别上获得显著提升。此外,使用更强的开放词汇检测器(即SAM3)的实验验证了所提出的检索与精炼机制的普适性。
cs.CV / 33 / 2605.03463

First Shape, Then Meaning: Efficient Geometry and Semantics Learning for Indoor Reconstruction

先形状,再语义:室内重建的高效几何和语义学习
Chierchia, Remi, Lebrat, Léo, Ahmedt-Aristizabal, David, Salvado, Olivier, Fookes, Clinton, Cruz, Rodrigo Santa
Abstract
Neural Surface Reconstruction has become a standard methodology for indoor 3D reconstruction, with Signed Distance Functions (SDFs) proving particularly effective for representing scene geometry. A variety of applications require a detailed understanding of the scene context, driving the need for object-level semantic signals. While recent methods successfully integrate semantic labels, they often inherit the slow training time and limited scalability of multi-SDF learning. In this paper, we introduce FSTM, a unified approach for learning geometry and semantics through a two-step process: a geometry warm-up using RGB inputs and geometric cues, followed by semantic field estimation. By first optimising geometry without semantic supervision, we observe substantial improvements compared to the standard joint optimisation. Rather than relying on specialised modules or complex multi-SDF designs, FSTM shows that a streamlined formulation is sufficient to achieve strong geometric and semantic reconstructions. Experiments on both synthetic and real-world indoor datasets show that our method outperforms multi-SDF approaches. It trains 2.3x faster on Replica, improves robustness to real-world imperfections on ScanNet++, and achieves higher recall by recovering the surfaces of more objects in the scene. The code will be made available at https://remichierchia.github.io/FSTM.
Chinese Translation
神经表面重建已成为室内3D重建的标准方法,其中有符号距离函数(Signed Distance Functions, SDFs)在表示场景几何方面表现尤为出色。许多应用需要对场景背景有详细的理解,这推动了对象级语义信号的需求。尽管最近的方法成功地融合了语义标签,但它们往往继承了多SDF学习的缓慢训练时间和有限的可扩展性。本文提出了一种统一的方法FSTM,通过两步过程学习几何和语义:首先使用RGB输入和几何线索进行几何预热,然后进行语义场估计。通过在没有语义监督的情况下先优化几何,我们观察到与标准联合优化相比,显著改善了结果。FSTM表明,简化的表达方式足以实现强大的几何和语义重建,而不依赖于专门的模块或复杂的多SDF设计。对合成和真实室内数据集的实验表明,我们的方法在性能上优于多SDF方法。在Replica数据集上训练速度提高了2.3倍,增强了对ScanNet++上真实世界缺陷的鲁棒性,并通过恢复场景中更多对象的表面实现了更高的召回率。代码将发布于https://remichierchia.github.io/FSTM。
cs.CV / 34 / 2605.03475

WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

WorldJen:一种端到端的多维基准测试用于生成视频模型
Inbasekar, Karthik, Rom, Guy, Shlomovits, Omer
Abstract
Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Frechet Video Distance (FVD) favors distributional textures over physical plausibility. Binary Visual Question Answering (VQA) based benchmarks like VBench~2.0 are prone to yes-bias and rely on low-resolution auditors that miss temporal failures. Moreover, their prompts target a single dimension at a time, multiplying the number of videos required while still not guaranteeing reliable results. WorldJen addresses these limitations directly. Binary VQA is replaced with Likert-scale questionnaires graded by a VLM that receives frames at native video resolution. Video generation costs are addressed by using adversarially curated prompts that are designed to exercise up to 16 quality dimensions simultaneously. The framework is built around two interlocking contributions. First, A blind human preference study is conducted, accumulating (2,696 pairwise annotations from 7 annotators with 100% pair coverage over 50 of the curated prompts $\times$ 6 state-of-the-art video models. A mean inter-annotator agreement of 66.9% is achieved and the study establishes a human ground-truth Bradley-Terry (BT) rating with a three-tier structure. Second, A VLM-as-a-judge evaluation engine using prompt-specific, dimension-specific Likert questionnaires (10 questions per dimension, 47,160 scored responses) judges the videos and reproduces the human-established three-tier BT rating structure independently. The VLM achieves a Spearman $\hat{\rho}=1.000,~p=0.0014$ that is interpreted as tier agreement with the human results. Six focused ablation studies validate the robustness of the VLM evaluation framework.
Chinese Translation
评估生成视频模型仍然是一个未解决的问题。基于参考的指标,如结构相似性指数测量(SSIM)和峰值信噪比(PSNR),更加关注像素的忠实度而非语义的正确性,而弗雷歇视频距离(FVD)则更倾向于分布纹理而非物理合理性。基于二元视觉问答(VQA)的基准测试,比如 VBench~2.0,易受到偏向“是”的影响,并依赖于低分辨率评审者,无法识别时间上的失误。此外,它们的提示一次只针对一个维度,这使得所需视频数量呈倍增,同时仍无法保证结果的可靠性。WorldJen 直接解决了这些局限性。二元 VQA 被替换为利克特量表问卷,由接受原始视频分辨率帧的视觉语言模型(VLM)进行评估。视频生成成本通过使用恶意策划的提示来解决,这些提示设计用于同时考察多达 16 个质量维度。该框架围绕两个互为支撑的贡献构建。首先,进行了一项盲人偏好研究,累计了(2,696 对注释来自 7 位注释者,涵盖 50 个策划提示的 100% 对比覆盖,共计 × 6 种最先进的视频模型。实现了66.9%的平均注释者间一致性,并该研究建立了一个基于人类的布拉德利-特里(BT)评级的三层结构。其次,使用针对特定提示和特定维度的利克特问卷(每个维度 10 道问题,共 47,160 条评分反馈)作为评判的 VLM 评估引擎对视频进行评判,并独立再现人类建立的三层 BT 评级结构。该 VLM 实现了斯皮尔曼系数 $ ext{hat{ ho}}=1.000,~p=0.0014$,被解读为与人类结果的一致性。六项聚焦的消融研究验证了 VLM 评估框架的稳健性。
cs.CV / 35 / 2605.03485

MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

MHPR:大型视觉语言模型的多维人类感知与推理基准
Wang, Kangkang, Jiang, Qinting, Zhang, Wanping, Ren, Bowen, Wen, Shengzhao
Abstract
Multidimensional human understanding is essential for real-world applications such as film analysis and virtual digital humans, yet current LVLM benchmarks largely focus on single-task settings and lack fine-grained, human-centric evaluation. In this work, we introduce MHPR, a comprehensive benchmark for joint perception-reasoning over human-centric scenes spanning individual, multi-person, and human-object interaction dimensions. MHPR comprises a multi-level data design-Captioned Raw Data (C-RD), Supervised Fine-Tuning Data (SFT-D), Reinforcement Learning Data (RL-D), and Test Data (T-D)-together with an automated caption/VQA generation pipeline (ACVG) that performs category-wise attribute decomposition, attribute-specific rewriting, and multi-model voting to ensure high-quality, scalable annotations. We evaluate state-of-the-art vision-language models on fine-grained attributes (appearance, clothing, pose, parts) and high-level semantics (social relations, action semantics, spatial relations, intent and functionality). Our findings show that: 1) format-aligned SFT data substantially improves instruction following and stability; 2) challenge-focused RL data derived from bad-case analysis further enhances perception and reasoning on difficult instances; and 3) training Qwen2.5-VL-7B with MHPR yields significant gains, achieving near-parity with considerably larger models. We release ACVG and MHPR to facilitate reproducible, extensible research on human-centric perception and reasoning.
Chinese Translation
多维人类理解对于电影分析和虚拟数字人等现实世界应用至关重要,但当前的LVLM基准主要集中在单一任务设置上,缺乏精细化的以人为中心的评估。在本研究中,我们引入了MHPR,一个全面的基准,旨在对涵盖个体、多个人和人-物互动维度的人类中心场景进行联合感知与推理。MHPR包括多层次的数据设计——带标题的原始数据(Captioned Raw Data, C-RD)、监督微调数据(Supervised Fine-Tuning Data, SFT-D)、强化学习数据(Reinforcement Learning Data, RL-D)和测试数据(Test Data, T-D)——以及一个自动化的标题/视觉问答生成管道(Automated Caption/VQA Generation, ACVG),该管道执行按类别的属性分解、特定属性的重写和多模型投票,以确保高质量、可扩展的注释。我们评估了最先进的视觉语言模型在细粒度属性(外观、服装、姿势、部件)和高层语义(社会关系、动作语义、空间关系、意图和功能)上的表现。我们的研究结果表明:1)格式对齐的SFT数据显著提高了指令跟随能力和稳定性;2)基于不良案例分析生成的以挑战为中心的RL数据进一步增强了对困难实例的感知和推理;3)使用MHPR训练的Qwen2.5-VL-7B模型取得了显著进展,达到与大得多的模型接近的性能。我们发布了ACVG和MHPR,以促进可重复的、可扩展的人类中心感知与推理研究。
cs.CV / 36 / 2605.03490

Orientation-Aware Unsupervised Domain Adaptation for Brain Tumor Classification Across Multi-Modal MRI

面向方向的无监督领域适应用于多模态MRI脑肿瘤分类
Sachan, Sapna, Mahto, Amulya Kumar, Patil, Prashant Wagambar
Abstract
The clinical integration of deep learning models for brain tumor diagnosis in neuro-oncology is severely constrained by limited expert-annotated MRI data and substantial inter-institutional domain shift arising from variations in scanners, imaging protocols, and contrast settings. These challenges significantly impair model generalization in real-world settings. To address this, we propose a novel orientation-aware unsupervised domain-adaptive framework for automated brain tumor classification using mixed 2D MRI slices. Initially, a CNN with large receptive field first categorizes input slices into axial, sagittal, and coronal views. For each orientation, a CNN architecture with ResNet50 backbone augmented with four fully connected layers is trained to extract discriminative features for tumor classification. To mitigate annotation scarcity and domain discrepancies, we introduce a slice-wise unsupervised domain adaptation strategy that transfers knowledge from the multi-modal such as T1, T2, and FLAIR source domain to the post-contrast T1 target domain. Feature-level alignment is enforced using maximum mean discrepancy loss, complemented by pseudo-label guided adaptation to preserve class discriminability. Extensive experiments demonstrate improved target-domain performance over prior approaches, highlighting the benefits of orientation-specific learning, multi-modal knowledge transfer, pseudo-label-guided adaptation, and unsupervised domain adaptation.
Chinese Translation
深度学习模型在神经肿瘤学中用于脑肿瘤诊断的临床整合受到专家标注MRI数据的限制和由于扫描仪、成像协议和对比设置的变化所导致的显著跨机构领域偏移的严重制约。这些挑战显著影响了模型在实际应用中的泛化能力。为了解决这一问题,我们提出了一种新颖的面向方向的无监督领域适应框架,用于利用混合二维MRI切片进行自动化脑肿瘤分类。最初,具有大感受野的卷积神经网络(CNN)首先将输入切片分类为轴向、矢状和冠状视图。对于每个方向,我们训练了一个基于ResNet50的卷积神经网络架构,并增加了四个全连接层,以提取用于肿瘤分类的判别特征。为了缓解标注不足和领域差异,我们提出了一种切片级无监督领域适应策略,旨在将来自多模态(如T1、T2和FLAIR)源领域的知识转移到后对比T1目标领域。我们通过最大均值差异损失强制实现特征层对齐,并辅以伪标签引导的适应,以保持类别的可区分性。大量实验表明,与以前的方法相比,目标领域性能有所提升,突显了面向方向的学习、多模态知识转移、伪标签引导适应和无监督领域适应的优势。
cs.CV / 37 / 2605.03509

BFORE: Butterfly-Firefly Optimized Retinex Enhancement for Low-Light Image Quality Improvement

BFORE:蝴蝶-萤火虫优化的Retinex增强用于低光照图像质量提升
Cherif, Ahmed
Abstract
Low-light image enhancement is a fundamental challenge in computer vision and multimedia applications, as images captured under insufficient illumination suffer from poor visibility, low contrast, and color distortion. Existing Retinex-based methods rely on manually tuned parameters that fail to generalize across diverse lighting conditions. This paper proposes BFORE (Butterfly-Firefly Optimized Retinex Enhancement), a novel hybrid metaheuristic-optimized framework that automatically tunes the parameters of a multi-stage Retinex-based pipeline. The proposed method converts the input image to HSV color space and applies Adaptive Gamma Correction with Weighted Distribution (AGCWD) to the luminance channel, followed by adaptive denoising. A Butterfly Optimization Algorithm (BOA) optimizes the Multi-Scale Retinex with Color Restoration (MSRCR) parameters, while a Firefly Algorithm (FA) optimizes the AGCWD and denoising parameters. A hybrid BOA-FA switching strategy dynamically balances global exploration and local exploitation. Experimental evaluation on the LOL benchmark dataset (15 paired test images) demonstrates that BFORE achieves the highest PSNR (17.22 dB) among all traditional enhancement methods, with 20.3% improvement over Histogram Equalization and 17.5% over MSRCR. BFORE produces the most naturally balanced mean brightness (129.97), closest to the ideal mid-tone value. Notably, BFORE outperforms RetinexNet -- a deep learning baseline -- in both PSNR (17.22 vs. 16.77 dB) and SSIM (0.5417 vs. 0.4252) without requiring any training data. The hybrid BOA-FA optimization contributes a 12.3% PSNR improvement and 14.8% SSIM improvement over the unoptimized pipeline.
Chinese Translation
低光照图像增强是计算机视觉和多媒体应用中的一个基础挑战,因为在不足的光照条件下拍摄的图像往往存在可见度差、对比度低和色彩失真的问题。现有的基于Retinex的方法依赖于手动调整的参数,这些参数无法在不同的光照条件下进行泛化。本文提出了一种名为BFORE(蝴蝶-萤火虫优化的Retinex增强)的新型混合元启发式优化框架,该框架能够自动调节多阶段基于Retinex的处理流程的参数。所提方法将输入图像转换为HSV色彩空间,并在亮度通道应用加权分布自适应伽玛校正(AGCWD),随后进行自适应去噪。蝴蝶优化算法(BOA)用于优化多尺度Retinex与色彩恢复(MSRCR)参数,而萤火虫算法(FA)则优化AGCWD和去噪参数。混合BOA-FA切换策略动态平衡全局探索与局部开发。在LOL基准数据集(15对测试图像)的实验评估中,BFORE在所有传统增强方法中实现了最高的峰值信噪比(PSNR,17.22 dB),比直方图均衡法改善了20.3%,比MSRCR改善了17.5%。BFORE生成的平均亮度(129.97)最自然平衡,最接近理想的中调值。值得注意的是,BFORE在PSNR(17.22与16.77 dB)和结构相似性指数(SSIM,0.5417与0.4252)上超越了RetinexNet - 一个深度学习基线,而不需要任何训练数据。混合BOA-FA优化相较于未优化的处理流程贡献了12.3%的PSNR提升和14.8%的SSIM提升。
cs.CV / 38 / 2605.03544

DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset

DALPHIN:在开放多中心数据集上对数字病理学人工智能助手与病理学家的基准测试
Lems, Carlijn, Moonemans, Sander, Klubíčková, Natálie, Brattoli, Biagio, Lee, Taebum, Kim, Seokhwi, Vilaplana, Veronica, Pons, Laura, Hochman, Sapir, Suárez-Franck, Mauricio Eduardo, Fernandez, Pedro Luis, Drachneris, Julius, Petroska, Donatas, Augulis, Renaldas, Laurinavicius, Arvydas, Oliveira, Domingos, Montezuma, Diana, Bouwmeester, Anouk B., van Midden, Dominique, Vos, Anne-Marie, Vos, Shoko, van Ipenburg, Jolique, Balkenhol, Maschenka, Winkler, Koen, Nagtegaal, Iris, Hebeda, Konnie, Flucke, Uta, Grünberg, Katrien, Skopal, Josef, Chohan, Brinder S., Temprana-Salvador, Jordi, Munari, Enrico, Cima, Luca, Querzoli, Giulia, Belisario, Yosamin Gonzalez, Faber, Jaeike W., van Leenders, Geert J. L. H., von der Thüsen, Jan H., Brosens, Lodewijk A. A., de Krijger, Ronald R., Wesseling, Pieter, Florquin, Sandrine, Maniewski, Mateusz, Kowalewski, Adam, Barna, Robert, Tiniakos, Dina, Gros, Joan Lop, Donders, Rogier, Maurits, Jake S. F., Lu, Ming Yang, Chen, Chengkuan, Mahmood, Faisal, van der Laak, Jeroen, Khalili, Nadieh, Meeuwsen, Frédérique, Ciompi, Francesco
Abstract
Foundation models with visual question answering capabilities for digital pathology are emerging. Such unprecedented technology requires independent benchmarking to assess its potential in assisting pathologists in routine diagnostics. We created DALPHIN, the first multicentric open benchmark for pathology AI copilots, comprising 1236 images from 300 cases, spanning 130 rare to common diagnoses, 6 countries, and 14 subspecialties. The DALPHIN design and dataset are introduced alongside a human performance benchmark of 31 pathologists from 10 countries with varying expertise. We report results for two general-purpose (GPT-5, Gemini 2.5 Pro) and one pathology-specific copilot (PathChat+) for sequential and independent answer generation. We observed no statistically significant difference from expert-level performance in four of six tasks for PathChat, 2/6 tasks for Gemini, and 1/6 tasks for GPT. DALPHIN is publicly released with sequestered, indirectly accessible ground truth to foster robust and enduring benchmarking. Data, methods, and the evaluation platform are accessible through dalphin.grand-challenge.org.
Chinese Translation
具有视觉问答能力的基础模型在数字病理学中逐渐出现。这种前所未有的技术需要独立的基准测试,以评估其在日常诊断中辅助病理学家的潜力。我们创建了DALPHIN,这是首个针对病理学人工智能助手的多中心开放基准,包含来自300个病例的1236幅图像,涵盖130种罕见到常见的诊断,来自6个国家和14个子专业。我们介绍了DALPHIN的设计和数据集,并提供了来自10个国家的31名病理学家的人类表现基准,这些病理学家的专业水平各不相同。我们报告了针对两种通用模型(GPT-5, Gemini 2.5 Pro)和一种病理特定助手(PathChat+)的顺序和独立答案生成的结果。在PathChat的六项任务中,有四项表现出未见统计显著差异;Gemini在六项任务中有两项表现出未见统计显著差异;而GPT则只有一项表现出未见统计显著差异。DALPHIN 已公开发布,并附有分隔的、间接可访问的真实标签,以促进稳健且持久的基准测试。数据、方法及评估平台可通过 dalphin.grand-challenge.org 访问。
cs.CV / 39 / 2605.03547

Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models

抹去特征,遗忘背景:对大规模视觉语言模型中多模态版权遗忘的基准测试
Kwon, JuneHyoung, Yun, JungMin, Kim, YoungBin
Abstract
Large Vision-Language Models (LVLMs), trained on web-scale data, risk memorizing and regenerating copyrighted visual content such as characters and logos, creating significant challenges. Machine unlearning offers a path to mitigate these risks by removing specific content post-training, but evaluating its effectiveness, especially in the complex multimodal setting of LVLMs, remains an open problem. Current evaluation methods often lack robustness or fail to capture the nuances of cross-modal concept erasure. To address this critical gap, we introduce the CoVUBench benchmark, the first framework specifically designed for evaluating copyright content unlearning in LVLMs. CoVUBench utilizes procedurally generated, legally safe synthetic data coupled with systematic visual variations spanning compositional changes and diverse domain manifestations to ensure realistic and robust evaluation of unlearning generalization. Our comprehensive multimodal evaluation protocol assesses both forgetting efficacy from the copyright holder perspective and the preservation of general model utility from the deployer viewpoint. By rigorously measuring this crucial trade-off, CoVUBench provides a standardized tool to advance the development of responsible and effective unlearning methods for LVLMs.
Chinese Translation
大规模视觉语言模型(LVLMs)在网络规模数据上训练,存在记忆和再生受版权保护的视觉内容(如角色和商标)的风险,造成重大挑战。机器遗忘提供了一种通过训练后移除特定内容来降低这些风险的方法,但评估其有效性,特别是在LVLMs复杂的多模态环境中,仍然是一个未解决的问题。目前的评估方法往往缺乏稳健性或无法捕捉跨模态概念遗忘的细微差别。为了解决这一关键缺口,我们引入了CoVUBench基准,这是第一个专门设计用于评估LVLMs中文件内容遗忘的框架。CoVUBench利用程序生成的、合法安全的合成数据,结合系统的视觉变化,包括组合变化和多样化的领域表现,确保对遗忘泛化进行现实且稳健的评估。我们的综合多模态评估协议从版权持有者的角度评估遗忘效果,从部署者的视角评估模型通用性保留。通过严格测量这一关键权衡,CoVUBench提供了一种标准化工具,以促进负责任和高效的遗忘方法在LVLMs中的发展。
cs.CV / 40 / 2605.03555

MILE: Mixture of Incremental LoRA Experts for Continual Semantic Segmentation across Domains and Modalities

MILE:跨领域和模态的连续语义分割增量LoRA专家混合
Muralidhara, Shishir, Stricker, Didier, Schuster, René
Abstract
Continual semantic segmentation requires models to adapt to new domains or modalities without sacrificing performance on previously learned tasks. Expert-based learning, in which task-specific modules specialize in different domains, has proven effective in mitigating forgetting. These methods include dynamic expansion, which suffers from scalability issues, or parameter isolation, which constrains the ability to learn new tasks. We introduce Mixture of Incremental LoRA Experts (MILE), a modular and parameter-efficient framework for continual segmentation across both domains and modalities. MILE leverages Low-Rank Adaptation (LoRA) to instantiate lightweight experts for each new task while keeping the pretrained base network frozen. Each expert is trained exclusively on its task data, thus avoids overwriting previously learned information. A prototype-guided gating mechanism dynamically selects the most appropriate expert at inference. MILE achieves the benefits of expert-based learning while overcoming its scalability limitations. It requires only a marginal parameter increase per task and tens of LoRA adapters are needed before matching the size of a single full model, making it highly efficient in both training and storage. Across domain- and modality-incremental benchmarks, MILE achieves strong performance while ensuring better stability, plasticity, and scalability.
Chinese Translation
连续语义分割要求模型在不牺牲对先前学习任务性能的情况下,能够适应新领域或模态。基于专家的学习方法,即任务特定模块在不同领域进行专业化,已被证明能有效缓解遗忘。这些方法包括动态扩展,然而面临可扩展性问题,或者参数隔离,这限制了学习新任务的能力。我们提出了增量LoRA专家混合(MILE),这是一个用于在多个领域和模态下进行连续分割的模块化和参数高效框架。MILE利用低秩自适应(Low-Rank Adaptation, LoRA)为每个新任务实例化轻量级专家,同时保持预训练基础网络不变。每个专家仅在其任务数据上进行训练,从而避免覆盖先前学习的信息。原型引导的门控机制在推理时动态选择最合适的专家。MILE在克服了专家学习的可扩展性限制的同时,实现了专家学习的优势。每个任务只需增加少量参数,且需数十个LoRA适配器才能匹配一个完整模型的大小,从而在训练和存储上都展现出高度的效率。在跨领域和模态的增量基准测试中,MILE取得了强劲的表现,同时确保了更好的稳定性、可塑性和可扩展性。
cs.CV / 41 / 2605.03610

deSEO: Physics-Aware Dataset Creation for High-Resolution Satellite Image Shadow Removal

deSEO:基于物理的高分辨率卫星图像阴影去除数据集创建
Beltrame, Lorenzo, Salzinger, Jules, Svoboda, Filip, Fanta-Jende, Phillipp, Lampert, Jasmin, Timofte, Radu, Körner, Marco
Abstract
Shadows cast by terrain and tall structures remain a major obstacle for high-resolution satellite image analysis, degrading classification, detection, and 3D reconstruction performance. Public resources offering geometry-consistent paired shadow/shadow-free satellite imagery are essentially missing, and most Earth-observation datasets are designed for shadow detection or 3D modelling rather than removal. Existing deep shadow-removal datasets either target ground-level or aerial scenes or rely on unpaired and weakly supervised formulations rather than explicit satellite pairs. We address this gap with deSEO, a geometry-aware and physics-informed methodology that, to the best of our knowledge, is the first to derive paired supervision for satellite shadow removal from the S-EO shadow detection dataset through a fully replicable pipeline. For each tile, deSEO selects a minimally shadowed acquisition as a weak reference and pairs it with shadowed counterparts using temporal and geometric filtering, Jacobian-based orientation normalisation, and LoFTR-RANSAC registration. A per-pixel validity mask restricts learning to reliably aligned regions, enabling supervision despite residual off-nadir parallax. In addition to this paired dataset, we develop a DSM-aware deshadowing model that combines residual translation, perceptual objectives, and mask-constrained adversarial learning. In contrast, a direct adaptation of a UAV-based SRNet/pix2pix architecture fails to converge under satellite viewpoint variability. Our model consistently reduces the visual impact of cast shadows across diverse illumination and viewing conditions, achieving improved structural and perceptual fidelity on held-out scenes. deSEO therefore provides the first reproducible, geometry-aware paired dataset and baseline for shadow removal in satellite Earth observation.
Chinese Translation
地形和高大结构投射的阴影仍然是高分辨率卫星图像分析的一大障碍,影响分类、检测和三维重建的性能。目前,提供几何一致的配对阴影/无阴影卫星图像的公共资源几乎缺失,大多数地球观测数据集旨在进行阴影检测或三维建模,而不是阴影去除。现有的深度阴影去除数据集要么针对地面场景或空中场景,要么依赖于未配对和弱监督的形式,而不是明确的卫星配对。我们通过deSEO来填补这一空白,提出了一种几何感知和物理驱动的方法,这在我们所知的范围内是首个通过完全可复制的流程从S-EO阴影检测数据集中导出卫星阴影去除的配对监督。对于每个图块,deSEO选择一个阴影最小的采集作为弱参考,并通过时间和几何过滤、基于雅可比的方向归一化和LoFTR-RANSAC配准,将其与阴影对照进行配对。每像素有效性掩模限制学习于可靠对齐的区域,使得尽管存在残余的偏离视线视差,仍能够进行监督。除了该配对数据集外,我们还开发了一个DSM感知的去阴影模型,结合了残余位移、感知目标和掩模约束的对抗学习。相比之下,基于无人机的SRNet/pix2pix架构的直接适应反而在卫星视角变化下无法收敛。我们的模型在多种照明和视角条件下持续减少投射阴影的视觉影响,在保留场景中实现了结构和感知忠实度的提升。因此,deSEO提供了首个可复制的、几何感知的配对数据集及卫星地球观测阴影去除的基准。
cs.CV / 42 / 2605.03614

Uncertainty Estimation in Instance Segmentation of Affordances via Bayesian Visual Transformers

通过贝叶斯视觉变换器进行可供性实例分割的不确定性估计
Mur-Labadia, Lorenzo, Martinez-Cantina, Ruben, Guerrero, Jose J.
Abstract
Visual affordances identify regions in an image with potential interactions, offering a novel paradigm for scene understanding. Recognizing affordances allows autonomous robots to act more naturally, could enhance human-robot interactions, enrich augmented reality systems, and benefit prosthetic vision devices. Accurate and localized prediction of affordance regions, rather than general saliency maps is crucial for these applications. We present a model for instance segmentation of affordances by adopting sample-based and ensembles approaches for uncertainty estimation. We extend an attention-based architecture for our novel task, showing with detailed ablation experiments the effects of each component. By comparing the distribution of these different detections, we extract pixel-wise epistemic and aleatoric variances at both the semantic and spatial levels. In addition, we propose a novel measure called Probability-based Mask Quality, which enables a comprehensive analysis of semantic and spatial variations in a probabilistic instance segmentation model. Our results show that the global consensus of multiple sub-networks of Bayesian models improve deterministic networks due to a better mask refinement and generalization. This fact, joined with the more powerful features extracted by attention-based mechanisms, represent an improvement of +7.4 p.p on the $F_{\beta}^w$ score in the challenging IIT-Aff dataset. Bayesian models are also better calibrated, producing less overconfident probabilities and with a better uncertainty estimation. Qualitative results show that aleatoric variance appears in the contour of the objects, while the epistemic variance is observed in visual challenging pixels, adding interpretability to the neural network.
Chinese Translation
视觉可供性识别图像中潜在交互的区域,为场景理解提供了一种新颖的范式。识别可供性使得自主机器人能够更自然地行动,能够增强人机交互,丰富增强现实系统,并且惠及义眼设备。对于这些应用而言,准确而局部的可供性区域预测,而非一般显著性图,是至关重要的。我们提出了一种采用基于样本和集成方法进行不确定性估计的可供性实例分割模型。我们扩展了一种基于注意力的架构以适应我们的新任务,并通过详细的消融实验展示每个组件的效果。通过比较这些不同检测的分布,我们在语义和空间层面提取了逐像素的认识不确定性和随机不确定性。此外,我们提出了一种新的度量方式,称为基于概率的掩膜质量,这使得在概率实例分割模型中对语义和空间变化的综合分析成为可能。我们的结果显示,多子网络贝叶斯模型的全局共识通过更好的掩膜精炼和泛化改善了确定性网络,这一事实与基于注意力机制提取的更强大的特征相结合,在具有挑战性的IIT-Aff数据集中提升了+7.4个百分点的$F_{eta}^w$得分。贝叶斯模型的校准情况也更理想,产生了较少的过度自信概率并且具有更好的不确定性估计。定性结果表明,随机不确定性出现在物体的轮廓上,而认识不确定性则在视觉挑战的像素中被观察到,从而为神经网络增加了可解释性。
cs.CV / 43 / 2605.03615

PriorNet: Prior-Guided Engagement Estimation from Face Video

PriorNet:基于先验指导的人脸视频互动感知估计
Vedernikov, Alexander
Abstract
Engagement estimation from face video remains challenging because facial evidence is often incomplete, labeled data are limited, and engagement annotations are subjective. We present PriorNet, a prior-guided framework that injects task-relevant priors at three stages of the pipeline: preprocessing, model adaptation, and objective design. PriorNet converts face-detection failures into explicit zero-frame placeholders so that missing-face events remain represented in the input sequence, adapts a frozen Self-supervised Video Facial Affect Perceiver (SVFAP) backbone through a Prior-guided Low-Rank Adaptation module (Prior-LoRA) for parameter-efficient specialization, and trains with a Dirichlet-evidential, uncertainty-weighted objective under hard-label supervision. We evaluate PriorNet on EngageNet, DAiSEE, DREAMS, and PAFE using each dataset's native evaluation protocol. Across these benchmarks, PriorNet improves over the strongest listed prior reference within each dataset's evaluation framing, while component ablations on EngageNet and DAiSEE indicate that the gains arise from complementary contributions of preprocessing, adaptation, and objective-level priors. These results support explicit prior injection as a useful design principle for face-video engagement estimation under the benchmark conditions studied in this work.
Chinese Translation
从人脸视频中进行互动感知估计仍然面临挑战,因为面部证据往往不完整、标记数据有限,并且互动注释是主观的。我们提出了PriorNet,这是一种先验指导的框架,在处理流程的三个阶段注入与任务相关的先验:预处理、模型适应和目标设计。PriorNet将人脸检测失败转化为显式零帧占位符,使丢失的人脸事件在输入序列中得以表示,通过先验指导低秩适应模块(Prior-LoRA)对冻结的自监督视频面部情感感知器(SVFAP)主干进行高效参数专业化,并在硬标签监督下用Dirichlet-证据、不确定性加权目标进行训练。我们在EngageNet、DAiSEE、DREAMS和PAFE上评估PriorNet,使用每个数据集的原生评估协议。在这些基准测试中,PriorNet在每个数据集的评估框架内超越了最强的基准参考,同时在EngageNet和DAiSEE上的组件消融实验表明,性能提升源于预处理、适应和目标级先验的互补贡献。这些结果支持在本研究中所探讨的基准条件下,显式先验注入作为面部视频互动感知估计的有用设计原则。
cs.CV / 44 / 2605.03626

RPBA-Net: An Interpretable Residual Pyramid Bilateral Affine Network for RAW-Domain ISP Enhancement

RPBA-Net:一种可解释的残差金字塔双边仿射网络用于RAW域ISP增强
Xin, Yucheng, Chen, Wu, Chen, Xiang, Gao, Guangwei, Wang, Xinchun, Wu, Ruize, Lu, Dianjie, Zhang, Guijuan, Fan, Linwei, Zheng, Zhuoran
Abstract
To address module fragmentation, uninterpretable mappings, and deployment constraints in RAW-domain demosaicing, color correction, and detail enhancement, this paper proposes RPBA-Net, an interpretable residual pyramid bilateral affine network for RAW-domain ISP enhancement. Given packed RAW as input, the method performs residual affine base reconstruction by estimating a base RGB representation and learning identity-guided residual affine corrections, thereby unifying demosaicing and enhancement. It further builds pyramid bilateral affine grids and combines guide-driven autoregressive adaptive slicing with adaptive cross-layer fusion to hierarchically model global tone restoration and local texture enhancement. In addition, smoothness, cross-scale consistency, and magnitude regularization terms are introduced to improve model stability, controllability, and structural interpretability. Extensive experiments demonstrate that RPBA-Net surpasses representative RAW-to-sRGB methods and achieves state-of-the-art performance in reconstruction fidelity and perceptual quality, while maintaining low model complexity and strong deployment potential for mobile and embedded platforms.
Chinese Translation
为了解决RAW域去马赛克、色彩校正和细节增强中的模块碎片化、难以解释的映射和部署限制,本文提出了RPBA-Net,这是一种用于RAW域ISP增强的可解释残差金字塔双边仿射网络。该方法以打包的RAW作为输入,通过估计基础RGB表示和学习引导身份的残差仿射校正,进行残差仿射基础重建,从而统一去马赛克和增强。它进一步构建金字塔双边仿射网格,并结合引导驱动的自回归自适应切片与自适应跨层融合,以分层建模全局色调恢复与局部纹理增强。此外,引入平滑性、跨尺度一致性和幅度正则化项,以提高模型的稳定性、可控性和结构可解释性。大量实验表明,RPBA-Net在重建保真度和感知质量上超越了代表性的RAW至sRGB方法,同时保持低模型复杂性,并在移动和嵌入式平台上具有强大的部署潜力。
cs.CV / 45 / 2605.03639

Diffusion Masked Pretraining for Dynamic Point Cloud

动态点云的扩散掩码预训练
Zhang, Zhuoyue, Zhu, Jihua, Fang, Chaowei, Liu, Jian, Mian, Ajmal Saeed
Abstract
Dynamic point cloud pretraining is still dominated by masked reconstruction objectives. However, these objectives inherit two key limitations. Existing methods inject ground-truth tube centers as decoder positional embeddings, causing spatio-temporal positional leakage. Moreover, they supervise inter-frame motion with deterministic proxy targets that systematically discard distributional structure by collapsing multimodal trajectory uncertainty into conditional means. To address these limitations, we propose Diffusion Masked Pretraining (DiMP), a unified self-supervised framework for dynamic point clouds. DiMP introduces diffusion modeling into both positional inference and motion learning. It first applies forward diffusion noise only to masked tube centers, then predicts clean centers from visible spatio-temporal context. This removes positional leakage while preserving visible coordinates as clean temporal anchors. DiMP also reformulates point-wise inter-frame displacement supervision as a DDPM noise-prediction objective conditioned on decoded representations. This design drives the encoder to target the full conditional distribution of plausible motions under a variational surrogate, rather than collapsing to a single deterministic estimate. Extensive experiments demonstrate that DiMP consistently improves downstream accuracy over the backbone alone, with absolute gains of 11.21% on offline action segmentation and 13.65% under causally constrained online inference.Codes are available at https://github.com/InitalZ/DiMP.git.
Chinese Translation
动态点云预训练仍主要受到掩码重建目标的主导。然而,这些目标继承了两个关键的局限性。现有方法将真实的管道中心作为解码器的位置信息嵌入,导致时空位置信息泄漏。此外,它们使用确定性的代理目标来监督帧间运动,这种方法系统性地通过将多模态轨迹的不确定性压缩为条件均值而丢弃了分布结构。为了解决这些局限性,我们提出了扩散掩码预训练(Diffusion Masked Pretraining, DiMP),这是一个统一的自监督框架,专为动态点云设计。DiMP将扩散建模引入到位置信息推断和运动学习中。它首先仅对掩码管道中心应用正向扩散噪声,然后从可见的时空上下文中预测干净的中心。这消除了位置泄漏,同时将可见坐标保留为干净的时间锚点。DiMP还将逐点的帧间位移监督重塑为以解码的表示为条件的DDPM噪声预测目标。该设计促使编码器以变分替代形式针对可行运动的完整条件分布,而不是简化为单一确定性估计。大量实验表明,DiMP在下游任务的准确性上相较于仅使用基础模型有持续提升,在离线动作分割中绝对提升11.21%,并在因果约束的在线推理中提升13.65%。代码可在 https://github.com/InitalZ/DiMP.git 获取。
cs.CV / 46 / 2605.03642

The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection

检测器自我学习:轻量级自监督适应用于开放词汇物体检测
Wan, Yazhe, Oh, Changjae
Abstract
Open-vocabulary object detection aims to recognize objects from an open set of categories, which leverages vision-language models (VLMs) pre-trained on large-scale image-text data. The cooperative paradigm combines an object detector with a VLM to achieve zero-shot recognition of novel objects. However, VLMs pre-trained on full images often struggle to capture local object details, limiting their effectiveness when applied to region-level detection. We present Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning approach to improve VLMs for cooperative model-based object detection. Given a cooperative model consists of a closed-set detector and a VLM, we first construct a region-aware pseudo-labeled dataset using a pre-trained closed-set object detector, in which regions corresponding to novel objects may be present but remain unlabeled or mislabeled. We then fine-tune the visual backbone of the VLM in a decoupled manner, which enhances local feature alignment while preserving global semantic knowledge via weight interpolation. DAT is a plug-and-play module that requires no inference overhead and fine-tunes less than 0.8M parameters. Experiments on the COCO and LVIS datasets show that DAT consistently improves detection performance on both novel and known categories, establishing a new state of the art in cooperative open-vocabulary detection.
Chinese Translation
开放词汇物体检测旨在从开放类别集中识别物体,这一方法利用在大规模图像-文本数据上预训练的视觉-语言模型(VLMs)。合作模式将物体检测器与VLM结合,以实现对新颖物体的零-shot识别。然而,在完整图像上预训练的VLM通常难以捕捉局部物体细节,当应用于区域级检测时,其效果受到限制。我们提出了解耦适应训练(Decoupled Adaptivity Training, DAT),这是一种自监督微调方法,用于改善VLM在基于模型的合作物体检测中的表现。给定的合作模型由一个闭集检测器和一个VLM组成,我们首先使用预训练的闭集物体检测器构建一个区域感知的伪标注数据集,其中可能存在对应新颖物体的区域,但这些区域仍然未标注或标错。随后,我们以解耦的方式微调VLM的视觉主干,从而增强局部特征对齐,同时通过权重插值保留全局语义知识。DAT是一个即插即用的模块,不需要推理开销,并且微调少于80万参数。在COCO和LVIS数据集上的实验表明,DAT在新颖和已知类别的检测性能上始终有所提升,确立了合作开放词汇检测的新状态。
cs.CV / 47 / 2605.03650

Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

重新思考视频对象中心学习中的时间一致性:从预测到对应
Li, Zhiyuan, Zhao, Rongzhen, Yang, Wenyan, Zhao, Wenshuai, Marttinen, Pekka, Pajarinen, Joni
Abstract
The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We introduce Grounded Correspondence, a framework that replaces learned transition functions with deterministic bipartite matching. Slots initialize from salient regions in frozen backbone features. Frame-to-frame identity is maintained through Hungarian matching on slot representations. The approach requires zero learnable parameters for temporal modeling yet achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS. Project page: https://magenta-sherbet-85b101.netlify.app/
Chinese Translation
在视频对象中心学习中,现行的做法是通过学习的动态模块来维持时间一致性,预测未来的对象表示,称为插槽(slots)。我们证明了这些预测器实际上是离散对应问题的高成本近似。现代自监督视觉骨干网络已经编码了可以可靠区分对象的实例特征。利用这些特征消除了学习时间预测的必要性。我们提出了“基础对应”(Grounded Correspondence)框架,利用确定性二分匹配替代学习的转移函数。插槽从冻结骨干特征中的显著区域初始化。通过匈牙利匹配(Hungarian matching)在插槽表示上维护帧到帧的身份。该方法在时间建模中不需要可学习参数,但在MOVi-D、MOVi-E和YouTube-VIS上实现了竞争性的性能。项目网页:https://magenta-sherbet-85b101.netlify.app/
cs.CV / 48 / 2605.03652

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

AniMatrix:一个以艺术思考而非物理的动画视频生成模型
Tencent HY Team
Abstract
Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We will publicly release the AniMatrix model weights and inference code.
Chinese Translation
视频生成模型将物理现实内化为其先验知识,但动画故意违背物理规律:模糊、撞击帧、萌化转变;其数千种共存的艺术惯例也无法形成一个单一的“动画物理学”,供模型吸收。因此,偏向物理的模型要么压制了定义这一媒介的艺术性,要么在其风格变异中崩溃。我们提出了AniMatrix,一个通过双通道条件机制和三步转变(重新定义正确性、覆盖物理先验、区分艺术与失败)来针对艺术性而非物理准确性的动画视频生成模型。首先,制作知识系统(Production Knowledge System)将动画编码为可控生产变量的结构化分类法(风格、动作、相机、特效),AniCaption则从像素中推断这些变量作为导演指令。可训练的标签编码器保留了这一分类法的域值结构,而冻结的T5编码器则处理自由形式的叙事;双路径注入(用于细粒度控制的交叉注意力,全球强制执行的AdaLN调制)确保分类指令不会被开放式文本稀释。其次,风格-动作-变形课程使模型从近物理动作过渡到完整的动画表现力。第三,基于变形意识的偏好优化与领域特定奖励模型相结合,将有意的艺术性与病态崩溃分离。在针对专业动画师打分的五个制作维度的动画特定人类评估中,AniMatrix在五个维度中有四个排名第一,尤其在提示理解(+0.70,+22.4%)和艺术性动作(+0.55,+16.9%)上,相较于Seedance-Pro 1.0取得了最大的提升。我们将公开发布AniMatrix模型权重和推断代码。
cs.CV / 49 / 2605.03680

Real Image Denoising with Knowledge Distillation for High-Performance Mobile NPUs

针对高性能移动神经处理单元的真实图像去噪与知识蒸馏
Kayani, Faraz, Kayani, Sarmad, Ahmed, Asad, Timofte, Radu, Ignatov, Dmitry
Abstract
While deep-learning-based image restoration has achieved unprecedented fidelity, deployment on mobile Neural Processing Units (NPUs) remains bottlenecked by operator incompatibility and memory-access overhead. We propose an NPU-aware hardware-algorithm co-design approach for real-world image denoising on mobile NPUs. Our approach employs a high-capacity teacher to supervise a lightweight student network specifically designed to leverage the tiled-memory architectures of modern mobile SoCs. By prioritizing NPU-native primitives -- standard 3x3 convolutions, ReLU activations, and nearest-neighbor upsampling -- and employing a progressive context expansion strategy (up to 1024x1024 crops), the model achieves 37.66 dB PSNR / 0.9278 SSIM on the validation benchmark and 37.58 dB PSNR / 0.9098 SSIM on the held-out test benchmark at full resolution (2432x3200) in the Mobile AI 2026 challenge. Following the official challenge rules, the inference runtime is measured under a standardized Full HD (1088x1920) protocol, where it runs in 34.0 ms on the MediaTek Dimensity 9500 and 46.1 ms on the Qualcomm Snapdragon 8 Elite NPU. We further reveal an "Inference Inversion" effect, where strict adherence to NPU-compatible operations enables dedicated NPU execution up to 3.88x faster than the integrated mobile GPU. The 1.96M-parameter student recovers 99.8% of the teacher's restoration quality via high-alpha knowledge distillation (alpha = 0.9), achieving a 21.2x parameter reduction while closing the PSNR gap from 1.63 dB to only 0.05 dB. These results establish hardware-aware distillation as an effective strategy for unifying high-fidelity denoising with practical deployment across diverse mobile NPU architectures. The proposed lightweight student model (LiteDenoiseNet) and its training statistics are provided in the NN Dataset, available at https://github.com/ABrain-One/NN-Dataset.
Chinese Translation
尽管基于深度学习的图像修复已经实现了前所未有的逼真度,但在移动神经处理单元(NPU)上的部署仍然受到操作符不兼容和内存访问开销的制约。我们提出了一种NPU感知的硬件-算法共设计方法,用于在移动NPU上进行真实世界的图像去噪。我们的方法利用高容量教师来监督一个专门为利用现代移动系统芯片(SoC)的分块内存架构而设计的轻量级学生网络。通过优先采用NPU原生基本操作——标准3x3卷积、ReLU激活以及最近邻上采样,并采用逐步上下文扩展策略(最大支持1024x1024裁剪),模型在验证基准中达到了37.66 dB的峰值信噪比(PSNR)和0.9278的结构相似性指数(SSIM),在全分辨率(2432x3200)下的保留测试基准中达到了37.58 dB的峰值信噪比和0.9098的结构相似性指数,参与了2026年移动人工智能挑战赛。遵循官方挑战规则,推理运行时间在标准化的全高清(1088x1920)协议下测量,MediaTek Dimensity 9500上运行时间为34.0毫秒,而Qualcomm Snapdragon 8 Elite NPU上为46.1毫秒。我们进一步揭示了“推理反转”效应,即严格遵守NPU兼容操作能够使专用NPU执行速度比集成移动GPU快多达3.88倍。该1.96M参数的学生网络通过高α知识蒸馏(α=0.9)恢复了教师网络99.8%的修复质量,实现了21.2倍的参数减少,同时将峰值信噪比差距从1.63 dB缩小到仅0.05 dB。这些结果确立了硬件感知蒸馏作为一种有效策略,可在不同移动NPU架构上实现高保真去噪与实际部署的统一。所提出的轻量级学生模型(LiteDenoiseNet)及其训练统计数据可在NN数据集中获取,地址为 https://github.com/ABrain-One/NN-Dataset。
cs.CV / 50 / 2605.03716

Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

双混合专家的统一多模态视觉跟踪
Hong, Lingyi, Li, Jinglun, Zhou, Xinyu, Jiang, Kaixun, Guo, Pinxue, Chen, Zhaoyu, Li, Runze, Sheng, Xingdong, Zhang, Wenqiang
Abstract
Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multi-modal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and robustness. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks and 12 benchmarks, while maintaining high inference efficiency. Notably, even after model compression, OneTrackerV2 retains strong performance. Moreover, OneTrackerV2 demonstrates remarkable robustness under modality-missing scenarios.
Chinese Translation
多模态视觉目标跟踪可以基于输入模态分为几种不同的任务(例如,RGB和RGB+X跟踪)。现有的方法通常为每种模态训练单独的模型,或依赖预训练模型适应新模态,这限制了效率、可扩展性和可用性。因此,我们提出了OneTrackerV2,这是一种统一的多模态跟踪框架,能够实现对任一模态的端到端训练。我们提出了Meta Merger,旨在将多模态信息嵌入统一的空间,允许灵活的模态融合和鲁棒性。我们进一步引入了双混合专家(Dual Mixture-of-Experts, DMoE):T-MoE模型为跟踪建模时空关系,而M-MoE嵌入多模态知识,解开跨模态依赖并减少特征冲突。凭借共享架构、统一参数和单一端到端训练,OneTrackerV2在五个RGB和RGB+X跟踪任务及12个基准测试中达到了最先进的性能,同时保持了高推理效率。值得注意的是,即使在模型压缩后,OneTrackerV2仍保持卓越的性能。此外,OneTrackerV2在模态缺失场景中表现出显著的鲁棒性。
cs.CV / 51 / 2605.03749

FluxFlow: Conservative Flow-Matching for Astronomical Image Super-Resolution

FluxFlow:天文图像超分辨率的保守流匹配
Liu, Shuhong, Ge, Xining, Cui, Ziteng, Li, Liuzhuozheng, Chang, Gengjia, Liu, Jun, Gu, Ziying, Li, Dong, Chu, Xuangeng, Gu, Lin, Harada, Tatsuya
Abstract
Ground-to-space astronomical super-resolution requires recovering space-quality images from ground-based observations that are simultaneously limited by pixel sampling resolution and atmospheric seeing, which imposes a stochastic, spatially varying PSF that cannot be resolved through upsampling alone. Existing methods rely on synthetic training pairs that fail to capture real atmospheric statistics and are prone to either over-smoothed reconstructions or hallucination sources with no physical counterpart in the observed sky. We propose FluxFlow, a conservative pixel-space flow-matching framework that incorporates observation uncertainty and source-region importance weights during training, and a training-free Wiener-regularized test-time correction to suppress hallucination sources while preserving recovered detail. We further construct the DESI--HST Dataset, the large-scale real-world benchmark comprising 19,500 real co-registered ground-to-space image pairs with real atmospheric PSF variation. Experiments demonstrate that FluxFlow consistently outperforms existing baseline methods in both photometric and scientific accuracy.
Chinese Translation
从地面到太空的天文超分辨率需要从受限于像素采样分辨率和大气扰动的地面观测中恢复出太空品质的图像,后者施加了一个随机的、空间变化的点扩散函数(PSF),仅通过上采样无法解决。现有方法依赖于合成训练对,这些方法无法捕捉真实的大气统计特征,可能导致过度平滑的重构或在观察天空中没有物理对应的幻觉源。我们提出了 FluxFlow,一种保守的像素空间流匹配框架,该框架在训练过程中结合了观测的不确定性和源区域重要性权重,并具有无训练的维纳(Wiener)正则化测试时间校正,以抑制幻觉源,同时保留恢复的细节。我们进一步构建了 DESI--HST 数据集,这是一个大规模的真实世界基准,包含 19,500 对真实的地面到太空图像配对,以及真实的大气PSF变化。实验表明,FluxFlow 在光度和科学准确性方面始终优于现有基线方法。
cs.CV / 52 / 2605.03759

Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks

在遗忘之前,学会记忆:重新审视LVLM去学习基准中的基础学习失败
Kwon, JuneHyoung, Kim, MiHyeon, Lee, Eunju, Yun, JungMin, Lim, Byeonggeuk, Kim, YoungBin
Abstract
While Large Vision-Language Models (LVLMs) offer powerful capabilities, they pose privacy risks by unintentionally memorizing sensitive personal information. Current unlearning benchmarks attempt to mitigate this using fictitious identities but overlook a critical stage 1 failure: models fail to effectively memorize target information initially, rendering subsequent unlearning evaluations unreliable. Diagnosing under-memorization and the multi-hop curse as root causes, we introduce ReMem, a Reliable Multi-hop and Multi-image Memorization Benchmark. ReMem ensures robust foundational learning through principled data scaling, reasoning-aware QA pairs, and diverse visual contexts. Additionally, we propose a novel Exposure metric to quantify the depth of information erasure from the model's internal probability distribution. Extensive experiments demonstrate that ReMem provides a rigorous and trustworthy framework for diagnosing both learning and unlearning behaviors in LVLMs.
Chinese Translation
尽管大型视觉语言模型(Large Vision-Language Models, LVLMs)提供强大的能力,但它们通过无意中记忆敏感个人信息而带来了隐私风险。目前的去学习基准试图通过虚构身份来减轻这一风险,但忽略了一个关键的阶段一失败:模型在初始阶段未能有效记忆目标信息,从而使随后的去学习评估不可靠。我们诊断了记忆不足和多跳诅咒作为根本原因,推出了ReMem,一个可靠的多跳多图像记忆基准。ReMem通过系统性的数据扩展、基于推理的问答对以及多样的视觉上下文,确保了强大的基础学习。此外,我们提出了一种新的曝光度指标,用以量化模型内部概率分布中信息消除的深度。大量实验证明,ReMem为诊断LVLM的学习和去学习行为提供了一个严谨和可信赖的框架。
cs.CV / 53 / 2605.03764

GeoTopoDiff: Learning Geometry--Topology Graph Priors through Boundary-Constrained Mixed Diffusion for Sparse-Slice 3D Porous Reconstruction

GeoTopoDiff:通过边界约束混合扩散学习几何-拓扑图先验以进行稀疏切片3D多孔重建
Shi, Yue, Wang, Peng, Yu, Mingzhe, Zhao, Yunlong, Liu, Li, Hatton, Gareth D, Lyu, Yan, Han, Liangxiu
Abstract
Diffusion-based voxel prior modelling is challenging for the reconstruction of large-scale 3D porous microstructures. Due to the demanding requirements for simultaneously modelling both the continuous pore morphology and the discrete pore-throat topology, the diffusion models require fully observed CT scans to provide topology-faithful priors, which results in an inherent trade-off among throughput, topological fidelity, and field of view in practical industrial applications. We propose GeoTopoDiff, a graph diffusion-based framework for reconstructing 3D porous microstructures from sparse CT slices. GeoTopoDiff transfers the learning of diffusion priors from a voxel-based space to a mixed graph state space, which simultaneously encompasses continuous pore geometry and discrete pore-throat topology. A topology-aware partial graph prior from sparsely observed CT slices is introduced to constrain the reverse denoising process. Experiments on anisotropic PTFE and Fontainebleau sandstone show that GeoTopoDiff reduces morphology-related errors by 19.8% and topology-sensitive transport errors by 36.5% on average. Our findings suggest that the mixed graph state space promotes the diffusion denoising process to reduce posterior uncertainty under a sparse observations. All models and code have been made publicly available to facilitate the exploration of diffusion models in the field of 3D porous microstructures simulation.
Chinese Translation
基于扩散的体素先验建模在大规模3D多孔微构造重建中面临挑战。由于对同时建模连续孔隙形态和离散孔喉拓扑的高要求,扩散模型需要完全观测的CT扫描以提供拓扑保真的先验,这在实际工业应用中导致了吞吐量、拓扑保真度和视场之间的固有权衡。我们提出GeoTopoDiff,一种基于图扩散的框架,用于从稀疏CT切片重建3D多孔微构造。GeoTopoDiff将扩散先验的学习从基于体素的空间转移到混合图状态空间,同时包含连续孔隙几何和离散孔喉拓扑。引入来自稀疏观测CT切片的拓扑感知部分图先验,以约束反向去噪过程。对各向异性PTFE和芳汀布劳砂岩的实验表明,GeoTopoDiff平均减少了19.8%的形态相关误差和36.5%的拓扑敏感传输误差。我们的发现表明,混合图状态空间促进了扩散去噪过程,减少了稀疏观测下的后验不确定性。所有模型和代码均已公开,以促进在3D多孔微构造模拟领域中对扩散模型的探索。
cs.CV / 54 / 2605.03784

ReLeaf: Benchmarking Leaf Segmentation across Domains and Species

ReLeaf:跨领域和物种的叶片分割基准评估
Martinko, Robert, Steininger, Daniel, Simon, Julia, Trondl, Andreas, Blaickner, Matthias
Abstract
Rising global food demand and growing climate pressure increase the need for sustainable, precise agricultural practices. Automated, individualized plant treatment relies on fine-grained visual analysis, yet leaf-level segmentation remains underexplored despite its value for assessing crop health, growth dynamics, yield potential and localized stress symptoms. Progress is limited by a lack of dedicated datasets, especially regarding species coverage, and by the absence of systematic evaluations of modern instance-segmentation architectures for this task. We address these gaps by surveying current data and identifying four suitable, publicly available leaf-segmentation datasets. Using them, we compare one-stage, two-stage and Transformer-based detectors and identify a YOLO26 model configuration to provide the best trade-off for real-world precision-agriculture tasks. Extensive cross-domain generalization experiments reveal substantial performance drops across plant species and recording setups, especially for models trained solely on laboratory data. To strengthen data availability, we introduce a new benchmark dataset with leaf-level masks for 23 plant species, created via semi-automatic annotation of selected CropAndWeed images. A model trained on all four existing datasets achieves a mean mAP50-95 of 83.9% across their corresponding test sets and 40.2% on our new benchmark, demonstrating improved generalization and highlighting the need for diverse leaf-segmentation datasets in robust precision agriculture.
Chinese Translation
全球对食品需求的增长和气候压力的加剧增加了对可持续、精准农业实践的需求。自动化、个性化的植物处理依赖于精细的视觉分析,尽管叶片级分割在评估作物健康、成长动态、产量潜力和局部胁迫症状方面具有重要价值,但这一领域仍然未得到充分探索。进展受到专门数据集不足、特别是在物种覆盖方面的限制,以及缺乏针对这一任务的现代实例分割架构系统评估的影响。我们通过调查现有数据,识别出四个适用的、公开可用的叶片分割数据集来填补这些空白。利用这些数据集,我们比较了一阶段、二阶段和基于 Transformer 的检测器,确定了一个 YOLO26 模型配置,为现实世界的精准农业任务提供了最佳的权衡。广泛的跨领域泛化实验揭示了植物物种和记录设置之间的显著性能下降,尤其是对于仅在实验室数据上训练的模型。为了增强数据的可用性,我们引入了一个新的基准数据集,该数据集包含23种植物的叶片级掩码,通过对选定的 CropAndWeed 图像进行半自动注释创建。基于所有四个现有数据集训练的模型在其对应的测试集上达到了83.9%的平均 mAP50-95,在我们的新基准上达到了40.2%,展示了更好的泛化能力,并强调了在稳健的精准农业中对多样化叶片分割数据集的需求。
cs.CV / 55 / 2605.03787

A Robust Unsupervised Domain Adaptation Framework for Medical Image Classification Using RKHS-MMD

基于RKHS-MMD的医疗图像分类鲁棒无监督领域适应框架
Sachan, Sapna, Sanodiya, Rakesh Kumar, Mahto, Amulya Kumar
Abstract
Labeling medical images is a major bottleneck in the field of medical imaging, as it requires domain-specific expertise, and it gets further complicated due to variability across different medical centers and different imaging devices. Such heterogeneity introduces domain shifts and modality discrepancies, which limits the generalization of trained models. To address this important challenge, we propose an unsupervised domain adaptation framework that combines transfer learning with a Reproducing Kernel Hilbert Space based Maximum Mean Discrepancy loss for the alignment of source and target domains. By jointly optimizing classification and RKHS-MMD losses, the methodology enhances generalization to unannotated medical datasets while diminishing reliance on manual annotation. Experimental evaluations presented on two chest X-ray datasets, which are obtained from different medical centers, show outstanding improvements over models trained without adaptation. Furthermore, we perform a comparative study to see that RKHS-MMD performs better than the standard Maximum Mean Discrepancy in reducing modality gap, emphasizing its effectiveness for medical image classification and also its strong capability in advanced AI-driven medical diagnostics.
Chinese Translation
标注医疗图像是医疗影像学领域的一个主要瓶颈,因为这需要领域特定的专业知识,并且由于不同医疗中心和不同成像设备之间的差异,情况变得更加复杂。这种异质性引入了领域转移和模态差异,限制了训练模型的泛化能力。为了解决这一重要挑战,我们提出了一种无监督领域适应框架,该框架结合了迁移学习和基于再生核希尔伯特空间(RKHS)的最大均值差异(MMD)损失,以对齐源域和目标域。通过联合优化分类和RKHS-MMD损失,该方法在减少对手动标注依赖的同时,提升了未标注医疗数据集的泛化能力。在来自不同医疗中心的两个胸部X射线数据集上的实验评估显示,与未进行适应训练的模型相比,该方法有显著的改进。此外,我们进行了比较研究,发现RKHS-MMD在减少模态差距方面优于标准最大均值差异,强调了其在医疗图像分类中的有效性以及在先进人工智能驱动医疗诊断中的强大能力。
cs.CV / 56 / 2605.03790

Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

通过链式问题引导的检索增强生成提升视觉问答的多模态大语言模型
Xu, Quanxing, Zhou, Ling, Zhong, Xian, Huang, Xiaohua, Huang, Rubing, Lin, Chia-Wen
Abstract
With advances in multimodal research and deep learning, Multimodal Large Language Models (MLLMs) have emerged as a powerful paradigm for a wide range of multimodal tasks. As a core problem in vision-language research, Visual Question Answering (VQA) has increasingly employed MLLMs to improve performance, particularly in open-domain settings where external knowledge is essential. In this work, we aim to further enhance retrieval-based VQA by more effectively integrating MLLMs with structured reasoning and knowledge acquisition. We introduce a logical prompting strategy that fuses Chain-of-Thought (CoT) reasoning with Visual Question Decomposition (VQD), termed CoVQD, to guide retrieval toward more accurate and relevant knowledge for MLLM inference. Building on this idea, we propose a new framework, CoVQD-guided RAG (CgRAG), which enables MLLMs to access more comprehensive and coherent external knowledge while benefiting from structured visual-text reasoning guidance, thereby improving generalization and reliability in complex cross-domain VQA scenarios. Extensive experiments on E-VQA, InfoSeek, and OKVQA benchmarks demonstrate the effectiveness of the proposed method.
Chinese Translation
随着多模态研究和深度学习的进展,多模态大语言模型(MLLMs)已成为广泛多模态任务的重要范式。作为视觉-语言研究中的核心问题,视觉问答(VQA)日益采用MLLMs来提升性能,特别是在外部知识至关重要的开放领域环境中。在本研究中,我们旨在通过更有效地将MLLMs与结构化推理和知识获取相结合,进一步增强基于检索的VQA。我们提出了一种逻辑提示策略,将链式思维(Chain-of-Thought, CoT)推理与视觉问题分解(Visual Question Decomposition, VQD)相结合,称为CoVQD,以引导检索更准确和相关的知识用于MLLM推理。在此基础上,我们提出了一个新框架CoVQD引导的检索增强生成(CoVQD-guided RAG, CgRAG),使MLLMs能够访问更全面和一致的外部知识,同时受益于结构化视觉文本推理的指导,从而提高在复杂跨领域VQA场景中的泛化能力和可靠性。在E-VQA、InfoSeek和OKVQA基准上的大量实验验证了所提方法的有效性。
cs.CV / 57 / 2605.03820

Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration

基于守恒预测自校准的低质量数据多模态学习
Jiang, Xun, Gu, Yufan, Hu, Disen, Hou, Yuqing, Yao, Yazhou, Shen, Fumin, Shen, Heng Tao, Xu, Xing
Abstract
Multimodal learning often grapples with the challenge of low-quality data, which predominantly manifests as two facets: modality imbalance and noisy corruption. While these issues are often studied in isolation, we argue that they share a common root in the predictive uncertainty towards the reliability of individual modalities and instances during learning. In this paper, we propose a unified framework, termed Conformal Predictive Self-Calibration (CPSC), which leverages conformal prediction to equip the model with the ability to perform self-guided calibration on-the-fly. The core of our proposed CPSC lies in a novel self-calibrating training loop that seamlessly integrates two key modules: (1) Representation Self-Calibration, which decomposes unimodal features into components, and selectively fuses the most robust ones identified by a conformal predictor to enhance feature resilience. (2) Gradient Self-Calibration, which recalibrates the gradient flow during backpropagation based on instance-wise reliability scores, steering the optimization towards more trustworthy directions. Furthermore, we also devise a self-update strategy for the conformal predictor to ensure the entire system co-evolves consistently throughout the training process. Extensive experiments on six benchmark datasets under both imbalanced and noisy settings demonstrate that our CPSC framework consistently outperforms existing state-of-the-art methods. Our code is available at https://github.com/XunCHN/CPSC.
Chinese Translation
多模态学习往往面临低质量数据的挑战,这主要表现为两方面:模态不平衡和噪声干扰。尽管这些问题通常被孤立研究,但我们认为它们在学习过程中共享一个共同根源,即对各个模态和实例可靠性的预测不确定性。在本文中,我们提出了一个统一的框架,称为守恒预测自校准(Conformal Predictive Self-Calibration,CPSC),它利用守恒预测为模型提供实时自导向校准的能力。我们所提出的CPSC的核心在于一个新颖的自校准训练循环,该循环无缝集成了两个关键模块:(1)表征自校准,它将单模态特征分解为多个组件,并选择由守恒预测器识别出的最强的组件进行融合,以增强特征的鲁棒性;(2)梯度自校准,它在反向传播过程中根据实例的可靠性评分重新校准梯度流,推动优化朝向更可靠的方向。此外,我们还设计了一种自更新策略,用于守恒预测器,以确保整个系统在训练过程中保持一致的共同演化。在不平衡和噪声干扰的设置下,在六个基准数据集上的大量实验表明,我们的CPSC框架始终优于现有的最先进方法。我们的代码可在 https://github.com/XunCHN/CPSC 中获取。
cs.CV / 58 / 2605.03830

Identity-Consistent Multi-Pose Generation of Contactless Fingerprints

身份一致的非接触式指纹多姿态生成
Pan, Zhiyu, Guan, Xiongjun, Feng, Jianjiang, Zhou, Jie
Abstract
Contactless fingerprint recognition has gained increasing attention due to its advantages in hygiene and acquisition flexibility. However, the absence of physical contact constraints introduces severe nonlinear geometric distortions caused by free finger poses in 3D space, resulting in a substantial cross-modal domain gap between contactless and conventional contact-based fingerprints. Existing solutions largely rely on explicit geometric correction or image enhancement, which are fragile under extreme pose variations. In this paper, we propose Identity-Consistent Multi-Pose Generation of Contactless Fingerprints (IMPOSE), a physics-inspired framework that synthesizes identity-preserving, multi-pose contactless fingerprint samples to empower recognition models. IMPOSE consists of three stages: (1) rolled fingerprint identity generation via latent diffusion with discrete codebook representations, (2) cross-modal translation from rolled to contactless modality guided by Sauvola-based local adaptive binarization as an identity anchor, and (3) physics-based multi-pose simulation through 3D finger model texture mapping and projection. The generated samples maintain strict identity consistency at the ridge topology level and spatial alignment with standard fingerprint coordinate space. Extensive experiments on the UWA and PolyU CL2CB databases demonstrate that fine-tuning fixed-length dense descriptors (FDD) with IMPOSE-synthesized data achieves state-of-the-art cross-modal matching, reducing EER to 8.74% on UWA and 2.26% on PolyU CL2CB. Synthetic data also yields consistent gains across mainstream representations including DeepPrint and AFRNet, and the hybrid strategy combining synthetic and real data achieves the best overall results. The code and generated samples are available at https://github.com/Yu-Yy/IMPOSE.
Chinese Translation
非接触式指纹识别因其在卫生和采集灵活性方面的优势而受到越来越多的关注。然而,由于缺乏物理接触限制,自由手指姿态在三维空间中引入了严重的非线性几何失真,导致非接触式指纹与传统接触式指纹之间存在显著的跨模态领域差距。现有解决方案在很大程度上依赖于显式几何校正或图像增强,这在极端姿态变化下显得脆弱。本文提出了一种身份一致的非接触式指纹多姿态生成框架(Identity-Consistent Multi-Pose Generation of Contactless Fingerprints,IMPOSE),该框架受到物理学启发,合成保持身份一致的多姿态非接触式指纹样本,从而增强识别模型的能力。IMPOSE由三个阶段组成: (1) 通过潜在扩散与离散代码本表示生成滚动指纹身份, (2) 在Sauvola基础局部自适应二值化的指导下,从滚动到非接触模态的跨模态转换,以身份作为锚, (3) 通过三维手指模型纹理映射和投影进行基于物理的多姿态仿真。生成的样本在脊顶拓扑层面保持严格的身份一致性,并与标准指纹坐标空间在空间上对齐。对UWA和PolyU CL2CB数据库的广泛实验表明,使用IMPOSE合成数据对固定长度密集描述符(FDD)进行微调可以实现最先进的跨模态匹配,在UWA上将EER降低到8.74%,在PolyU CL2CB上降低到2.26%。合成数据在包括DeepPrint和AFRNet在内的主流表示中也带来了持续的提升,而结合合成数据和真实数据的混合策略实现了最佳的整体结果。代码和生成样本可在 https://github.com/Yu-Yy/IMPOSE 获取。
cs.CV / 59 / 2605.03837

Conditions for well-posed color recovery in scattering media

散射介质中色彩恢复的良定性条件
Solomatov, Grigory, Akkaynak, Derya
Abstract
Recovering scene color from images captured in scattering media is a fundamental inverse problem in optical imaging. Yet the problem is intrinsically ill-posed as multiple solutions can explain the same observation, and prediction error cannot be controlled without understanding the space of candidate solutions. Here, we present sufficient conditions under which color recovery in a scattering medium becomes well-posed. Observing that ill-posedness stems from (i) projection of spectral signals onto pixel intensities, and (ii) unknown medium parameters, we demonstrate that sensor improvements alone cannot resolve medium-induced distortions without additional constraints. We identify recovery patterns, cross-pixel relationships that naturally occur in images, and prove, for an ideal hyperspectral camera, that they restrict the solution to a unique candidate. This opens the door to a new class of vision algorithms grounded in first principles, enabling quantitative analysis of images in scattering environments.
Chinese Translation
从在散射介质中捕获的图像中恢复场景色彩是光学成像中的一个基本逆问题。然而,这个问题本质上是病态的,因为多种解可以解释相同的观测结果,且在没有理解候选解空间的情况下,预测误差无法被控制。在此,我们提出了使散射介质中的色彩恢复变得良定的充分条件。我们观察到,病态性源于 (i) 光谱信号投影到像素强度,以及 (ii) 未知的介质参数,我们证明仅依靠传感器的改进无法解决中介引起的失真,而不添加额外的约束。我们识别出恢复模式,这些跨像素关系自然出现在图像中,并证明对于理想的高光谱相机,这些关系将解限制为唯一候选。这为基于基本原理的新一类视觉算法开辟了经过,能够在散射环境中进行图像的定量分析。
cs.CV / 60 / 2605.03848

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

参数高效的多视角能力评估:从判别分类到生成反馈
Bianchi, Edoardo, Liotta, Antonio
Abstract
Estimating how well a person performs an action, rather than which action is performed, is central to coaching, rehabilitation, and talent identification. This task is challenging because proficiency is encoded in subtle differences in timing, balance, body mechanics, and execution, often distributed across multiple views and short temporal events. We discuss three recent contributions to multi-view proficiency estimation on Ego-Exo4D. SkillFormer introduces a parameter-efficient discriminative architecture for selective multi-view fusion; PATS improves temporal sampling by preserving locally dense excerpts of fundamental movements; and ProfVLM reformulates proficiency estimation as conditional language generation, producing both a proficiency label and expert-style feedback through a gated cross-view projector and a compact language backbone. Together, these methods achieve state-of-the-art accuracy on Ego-Exo4D with up to 20x fewer trainable parameters and up to 3x fewer training epochs than video-transformer baselines, while moving from closed-set classification toward interpretable feedback generation. These results highlight a shift toward efficient, multi-view systems that combine selective fusion, proficiency-aware sampling, and actionable generative feedback.
Chinese Translation
估计一个人执行某项动作的能力,而不是识别执行的动作,对于辅导、康复和人才识别至关重要。这个任务具有挑战性,因为能力体现在时机、平衡、身体机械和执行等方面的微妙差异中,通常分布在多个视角和短暂的时间事件中。我们讨论了对Ego-Exo4D的多视角能力评估的三个最新贡献。SkillFormer引入了一种参数高效的判别架构,用于选择性多视角融合;PATS通过保留基本动作的局部密集摘录来改进时间抽样;而ProfVLM将能力评估重新表述为条件语言生成,通过一个门控交叉视角投影器和一个紧凑的语言主干生成能力标签和专家风格的反馈。综合来看,这些方法在Ego-Exo4D上实现了行业领先的准确性,训练可调整参数减少了多达20倍,训练轮次则比视频变换基线减少了多达3倍,同时从闭集分类向可解释反馈生成转变。这些结果突显了向高效的多视角系统的转变,这种系统结合了选择性融合、能力感知抽样和可操作的生成反馈。
cs.CV / 61 / 2605.03849

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

Stream-R1:面向流媒体视频生成的可靠性-困惑性感知奖励蒸馏
Wu, Bin, Huang, Mengqi, Wu, Shaojin, Jia, Weinan, Wang, Yuxin, Mao, Zhendong, Zhang, Yongdong
Abstract
Distillation-based acceleration has become foundational for making autoregressive streaming video diffusion models practical, with distribution matching distillation (DMD) as the de facto choice. Existing methods, however, train the student to match the teacher's output indiscriminately, treating every rollout, frame, and pixel as equally reliable supervision. We argue that this caps distilled quality, since it overlooks two complementary axes of variance in DMD supervision: Inter-Reliability across student rollouts whose supervision varies in reliability, and Intra-Perplexity across spatial regions and temporal frames that contribute unequally to where quality can still be improved. The objective thus conflates two questions under a uniform weight: whether to learn from each rollout, and where to concentrate optimization within it. To address this, we propose Stream-R1, a Reliability-Perplexity Aware Reward Distillation framework that adaptively reweights the distillation objective at both rollout and spatiotemporal-element levels through a single shared reward-guided mechanism. At the Inter-Reliability level, Stream-R1 rescales each rollout's loss by an exponential of a pretrained video reward score, so that rollouts with reliable supervision dominate optimization. At the Intra-Perplexity level, it back-propagates the same reward model to extract per-pixel gradient saliency, which is factored into spatial and temporal weights that concentrate optimization pressure on regions and frames where refinement yields the largest expected gain. An adaptive balancing mechanism prevents any single quality axis from dominating across visual quality, motion quality, and text alignment. Stream-R1 attains consistent improvements on all three dimensions over distillation baselines on standard streaming video generation benchmarks, without architectural modification or additional inference cost.
Chinese Translation
基于蒸馏的加速已成为使自回归流媒体视频扩散模型实用的基础,其中分布匹配蒸馏(DMD)是实际的选择。然而,现有方法训练学生模型不加区分地匹配教师模型的输出,将每个回滚、帧和像素视为同等可靠的监督。我们认为这种做法限制了蒸馏质量,因为它忽略了DMD监督中可靠性的两个互补变化轴:在可靠性变化的学生回滚之间的交互可靠性(Inter-Reliability)以及在不同空间区域和时间帧之间对改善质量贡献不均的内部困惑性(Intra-Perplexity)。因此,目标将两个问题混合在一个统一权重下:是否要从每个回滚中学习,以及在其中集中优化的位置。为了解决这个问题,我们提出了Stream-R1,一个面向可靠性-困惑性感知的奖励蒸馏框架,通过单一的共享奖励引导机制自适应地在回滚和时空元素层面重新加权蒸馏目标。在交互可靠性层面,Stream-R1通过将每个回滚的损失重新缩放为预训练视频奖励评分的指数,从而使得具有可靠监督的回滚在优化中占主导地位。在内部困惑性层面,它反向传播相同的奖励模型以提取每个像素的梯度显著性,这些显著性被分解为空间和时间权重,集中优化压力在可以获得最大预期收益的区域和帧上。自适应平衡机制防止在视觉质量、动作质量和文本对齐之间主导任何单一质量轴。Stream-R1在标准流媒体视频生成基准测试上,在所有三个维度上较蒸馏基线实现了一致性改进,且无须架构修改或额外推理成本。
cs.CV / 62 / 2605.03857

A Deeper Dive into the Irreversibility of PolyProtect: Making Protected Face Templates Harder to Invert

深入探讨PolyProtect的不可逆性:提升受保护面部模板的反向困难度
Hahn, Vedrana Krivokuća, Maceiras, Jérémy, Marcel, Sébastien
Abstract
This work presents a deeper analysis of the "irreversibility" property of PolyProtect, a biometric template protection method initially proposed for securing face embeddings. PolyProtect transforms embeddings into protected templates via multivariate polynomials, whose coefficients and exponents are distinct for each subject enrolled in the face recognition system. A polynomial is applied to consecutive sets of elements from a given embedding, where the amount of overlap between the sets is a tunable parameter. We begin our irreversibility analysis by demonstrating that PolyProtected templates are easier to invert using a numerical solver based on cosine distance, as opposed to Euclidean distance (used in the earlier PolyProtect work). To make this inversion more difficult, we then propose a "key selection algorithm", which tries to choose "keys" (coefficients and exponents of the PolyProtect polynomial) that enhance the irreversibility of PolyProtected templates, compared to when the keys are purely random. Our experiments show that this algorithm is effective at generating PolyProtected templates that are significantly more difficult to invert, and that it approximately equalises the irreversibility of PolyProtected templates generated using different "overlap" parameters. This allows for better control of the irreversibility versus accuracy trade-off, known to exist across different overlaps. We also show that accuracy in the PolyProtected domain can be affected by the range in which the embedding elements lie, but that this can be improved by normalizing the embeddings prior to applying PolyProtect. This work is reproducible using our open-source code.
Chinese Translation
本文对PolyProtect的“不可逆性”特性进行了深入分析。PolyProtect是一种最初为保护面部嵌入而提出的生物特征模板保护方法。该方法通过多元多项式将嵌入转换为受保护的模板,不同主体的系数和指数是独特的。一个多项式作用于给定嵌入的连续元素集合,这些集合之间的重叠量是一个可调参数。我们通过展示PolyProtected模板在使用基于余弦距离的数值求解器时,相较于早期PolyProtect工作中使用的欧几里得距离更容易被反向推导,开始我们的不可逆性分析。为了使这种反向推导更加困难,我们提出了一种“密钥选择算法”,该算法试图选择“密钥”(PolyProtect多项式的系数和指数),以增强PolyProtected模板的不可逆性,优于完全随机选择的情况。我们的实验表明,该算法能够有效生成显著更难反向推导的PolyProtected模板,并且在不同“重叠”参数下,大致平衡了生成的PolyProtected模板的不可逆性。这使得在不同重叠值之间的不可逆性与准确性权衡的控制变得更好。我们还显示,PolyProtected领域的准确性可能会受到嵌入元素范围的影响,但通过在应用PolyProtect之前对嵌入进行标准化,可以得到改善。该研究使用我们的开源代码是可复现的。
cs.CV / 63 / 2605.03877

DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models

DMGD:基于扩散模型的无训练数据集蒸馏与语义分布匹配
Wang, Qichao, Lu, Yunhong, Cao, Hengyuan, Zhang, Junyi, Zhang, Min
Abstract
Dataset distillation enables efficient training by distilling the information of large-scale datasets into significantly smaller synthetic datasets. Diffusion based paradigms have emerged in recent years, offering novel perspectives for dataset distillation. However, they typically necessitate additional fine-tuning stages, and effective guidance mechanisms remain underexplored. To address these limitations, we rethink diffusion based dataset distillation and propose a Dual Matching Guided Diffusion (DMGD) framework, centered on efficient training-free guidance. We first establish Semantic Matching via conditional likelihood optimization, eliminating the need for auxiliary classifiers. Furthermore, we propose a dynamic guidance mechanism that enhances the diversity of synthetic data while maintaining semantic alignment. Simultaneously, we introduce an optimal transport (OT) based Distribution Matching approach to further align with the target distribution structure. To ensure efficiency, we develop two enhanced strategies for diffusion based framework: Distribution Approximate Matching and Greedy Progressive Matching. These strategies enable effective distribution matching guidance with minimal computational overhead. Experimental results on ImageNet-Woof, ImageNet-Nette, and ImageNet-1K demonstrate that our training-free approach achieves significant improvements, outperforming state-of-the-art (SOTA) methods requiring additional fine-tuning by average accuracy gains of 2.1%, 5.4%, and 2.4%, respectively.
Chinese Translation
数据集蒸馏通过将大规模数据集的信息提炼为显著更小的合成数据集,从而实现高效训练。近年来,基于扩散的范式崭露头角,为数据集蒸馏提供了新的视角。然而,它们通常需要额外的微调阶段,有效的指导机制仍未得到充分探索。为了解决这些限制,我们重新思考基于扩散的数据集蒸馏,提出了一种双重匹配引导的扩散框架(Dual Matching Guided Diffusion, DMGD),其核心是高效的无训练指导。我们首先通过条件似然优化建立语义匹配,消除了对辅助分类器的需求。此外,我们提出了一种动态指导机制,在保持语义一致性的同时增强合成数据的多样性。同时,我们引入了一种基于最优运输(Optimal Transport, OT)的分布匹配方法,以进一步与目标分布结构对齐。为了确保效率,我们为基于扩散的框架开发了两种增强策略:分布近似匹配(Distribution Approximate Matching)和贪婪逐步匹配(Greedy Progressive Matching)。这些策略在最小计算开销的情况下实现了有效的分布匹配指导。对ImageNet-Woof、ImageNet-Nette和ImageNet-1K的实验结果表明,我们的无训练方法取得了显著的改进,相较于需要额外微调的最先进(SOTA)方法,平均准确率分别提高了2.1%、5.4%和2.4%。
cs.CV / 64 / 2605.03885

Raising the Ceiling: Better Empirical Fixation Densities for Saliency Benchmarking

提升上限:更好的经验注视密度用于显著性基准测试
Agrawal, Susmit, Hollman, Jannis, Kümmerer, Matthias
Abstract
Empirical fixation densities, spatial distributions estimated from human eye-tracking data, are foundational to saliency benchmarking. They directly shape benchmark conclusions, leaderboard rankings, failure case analyses, and scientific claims about human visual behavior. Yet the standard estimation method, fixed-bandwidth isotropic Gaussian KDE, has gone essentially unchanged for decades. This matters now more than ever: as the field shifts toward sample-level evaluation (failure case analysis, inverse benchmarking, per-image model comparison), reliable per-image density estimates become critical. We propose a principled mixture model that combines an adaptive-bandwidth KDE based on Abramson's method, center bias and uniform components, and a state-of-the-art saliency model, to capture different spatial and semantic types of interobserver consistency, and optimize all parameters per image via leave-one-subject-out cross-validation. Our method yields substantially higher interobserver consistency estimates across multiple benchmarks, with median per-image gains of 5-15% in log-likelihood and up to 2 percentage points in AUC. For the most affected images -- precisely those most relevant to failure case analysis -- improvements exceed 25%. We leverage these improved estimates to identify and analyze remaining failure cases of state-of-the-art saliency models, demonstrating that significant headroom for model improvement remains. More broadly, our findings highlight that empirical fixation densities should not be treated as fixed ground truths but as evolving estimates that improve with better methodology.
Chinese Translation
经验注视密度,即从人类眼动追踪数据中估计的空间分布,是显著性基准测试的基础。它们直接影响基准结论、排行榜排名、失败案例分析以及关于人类视觉行为的科学声明。然而,标准估计方法——固定带宽各向同性高斯核密度估计(Gaussian KDE)数十年来基本没有变化。这在现今尤为重要:随着该领域逐渐转向样本级评估(失败案例分析、逆向基准测试、逐图模型比较),可靠的逐图密度估计变得至关重要。我们提出了一种合理的混合模型,该模型结合了基于 Abramson 方法的自适应带宽 KDE、中心偏差和均匀成分,以及一种最先进的显著性模型,旨在捕捉不同空间和语义类型的观察者一致性,并通过留一法交叉验证优化每个图像的所有参数。我们的方法在多个基准测试中产生了显著更高的观察者一致性估计,逐图的对数似然中位数提升为 5-15%,AUC 提升达到 2 个百分点。对于受到影响最严重的图像——即那些与失败案例分析最相关的图像——改善超过 25%。我们利用这些改进的估计来识别和分析最先进的显著性模型所剩余的失败案例,展示了模型改进仍然存在显著空间。更广泛地说,我们的发现强调了经验注视密度不应被视为固定的真实值,而应视为随着更好方法而演变的估计。
cs.CV / 65 / 2605.03927

StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

StateVLM:一种针对机器人可供性推理的状态感知视觉语言模型
Sun, Xiaowen, Kerzel, Matthias, Li, Mengdi, Zhao, Xufeng, Striker, Paul, Wermter, Stefan
Abstract
Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject to a fundamental limitation inherent in large language models (LLMs): they struggle with numerical reasoning, particularly in object detection and object-state localization. To explore numerical reasoning as a regression task in VLMs, we propose a novel training strategy to adapt VLMs for object detection and object-state localization. This approach leverages box decoder outputs to compute an Auxiliary Regression Loss (ARL) during fine-tuning, while preserving standard sequence prediction at inference. We leverage this training strategy to develop StateVLM (State-aware Vision-Language Model), a novel model designed to perceive and learn fine-grained object representations, including precise localization of objects and their states, as well as graspable regions. Due to the lack of a benchmark for object-state affordance reasoning, we introduce an open-source benchmark, Object State Affordance Reasoning (OSAR), which contains 1,172 scenes with 7,746 individual objects and corresponding bounding boxes. Comparative experiments on adapted benchmarks (RefCOCO, RefCOCO+, and \mbox{RefCOCOg}) demonstrate that ARL improves model performance by an average of 1.6\% compared to models without ARL. Experiments on the OSAR benchmark further support this finding, showing that StateVLM with ARL achieves an average of 5.2\% higher performance than models without ARL. In particular, ARL is also important for the complex task of affordance reasoning in OSAR, where it enhances the consistency of model outputs.
Chinese Translation
视觉语言模型(VLMs)在各种机器人任务中表现出色,因为它们能够感知视觉信息并理解自然语言指令。然而,在机器人应用中,VLMs仍受限于大型语言模型(LLMs)固有的基本局限性:它们在数值推理上表现不佳,特别是在物体检测和物体状态定位方面。为了将数值推理视为VLMs中的回归任务,我们提出了一种新颖的训练策略,以适应VLMs进行物体检测和物体状态定位。该方法在微调过程中利用框解码器输出计算辅助回归损失(Auxiliary Regression Loss, ARL),同时在推理时保持标准序列预测。我们利用这一训练策略开发了StateVLM(状态感知视觉语言模型),这是一个新模型,旨在感知和学习细粒度的物体表征,包括物体的精确定位及其状态,以及可抓取区域。由于缺乏物体状态可供性推理的基准,我们引入了一个开源基准,称为物体状态可供性推理(Object State Affordance Reasoning, OSAR),该基准包含1172个场景,其中有7746个单独的物体及其相应的边界框。在适应基准(RefCOCO、RefCOCO+和RefCOCOg)上的比较实验表明,相较于没有ARL的模型,ARL使得模型性能平均提高了1.6%。在OSAR基准上的实验进一步支持了这一发现,显示出带有ARL的StateVLM的性能平均比不带ARL的模型高出5.2%。特别是,ARL对OSAR中可供性推理这一复杂任务也至关重要,能够增强模型输出的一致性。
cs.CV / 66 / 2605.03941

A Benchmark for Interactive World Models with a Unified Action Generation Framework

一种统一动作生成框架的交互式世界模型基准测试
Fang, Jianjie, Lei, Yingshan, Wan, Qin, Wang, Ziyou, Huang, Yuchao, Xu, Yongyan, Zhao, Baining, Zhang, Weichen, Gao, Chen, Chen, Xinlei, Li, Yong
Abstract
Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable environments for perception, reasoning, and action. Yet current research still lacks large-scale datasets and unified benchmarks to evaluate their physical interaction capabilities. To address this, we propose iWorld-Bench, a comprehensive benchmark for training and testing world models on interaction-related abilities such as distance perception and memory. We construct a diverse dataset with 330k video clips and select 2.1k high-quality samples covering varied perspectives, weather, and scenes. As existing world models differ in interaction modalities, we introduce an Action Generation Framework to unify evaluation and design six task types, generating 4.9k test samples. These tasks jointly assess model performance across visual generation, trajectory following, and memory. Evaluating 14 representative world models, we identify key limitations and provide insights for future research. The iWorld-Bench model leaderboard is publicly available at iWorld-Bench.com.
Chinese Translation
实现人工通用智能(AGI)需要能够自适应学习和交互的智能体,交互式世界模型为感知、推理和行动提供了可扩展的环境。然而,当前研究仍然缺乏大规模数据集和统一基准来评估物理交互能力。为了解决这一问题,我们提出了iWorld-Bench,这是一个全面的基准测试,用于训练和测试与交互相关的能力,如距离感知和记忆。我们构建了一个包含33万个视频片段的多样化数据集,并选择了2100个高质量样本,覆盖了不同的视角、天气和场景。由于现有世界模型在交互模式上有所不同,我们引入了一个动作生成框架以统一评估,并设计了六种任务类型,生成了4900个测试样本。这些任务共同评估模型在视觉生成、轨迹跟随和记忆方面的性能。在对14个具有代表性的世界模型进行评估时,我们识别出了关键的局限性,并为未来的研究提供了见解。iWorld-Bench模型排行榜已在iWorld-Bench.com上公开发布。
cs.CV / 67 / 2605.03942

Reservoir property image slices from the Groningen gas field for image translation and segmentation

来自格罗宁根气田的储层属性图像切片用于图像转换和分割
Al-Fakih, Abdulrahman, Sariah, Nabil, Koeshidayatullah, Ardiansyah, Kaka, SanLinn I.
Abstract
Reservoir characterization workflows increasingly rely on image-based and machine-learning/deep learning or even generative AI approaches, but openly available geological image datasets suitable for reproducible benchmarking remain limited. Here we describe a high-resolution dataset of reservoir-property image slices derived from the Groningen static geological model. The dataset contains aligned two-dimensional PNG images representing facies, porosity, permeability, and water saturation, generated from three-dimensional reservoir grids and prepared for downstream visualization, segmentation, and image-to-image translation tasks. In addition to the deposited original image corpus, we provide an archived software workflow for reproducing augmentation, mask generation, paired-image construction, and example baseline experiments. The resource is designed to support benchmarking of geological image analysis methods and the study of cross-domain relationships among reservoir properties. By separating the fixed image dataset from the reproducible processing workflow, this work provides a transparent foundation for reuse in geoscience, reservoir modeling, and machine-learning applications.
Chinese Translation
储层表征工作流程越来越依赖基于图像的机器学习/深度学习甚至生成性人工智能方法,但公开可用的适合可再现基准测试的地质图像数据集仍然有限。在这里,我们描述了一个高分辨率的储层属性图像切片数据集,该数据集来源于格罗宁根静态地质模型。该数据集包含对齐的二维PNG图像,表示相、孔隙度、渗透率和水饱和度,这些图像是从三维储层网格生成的,并为后续的可视化、分割和图像到图像的转换任务做好准备。除了提交的原始图像集外,我们还提供了一种归档的软件工作流程,以再现数据增强、掩膜生成、成对图像构建和示例基准实验。该资源旨在支持地质图像分析方法的基准测试以及储层属性之间跨领域关系的研究。通过将固定的图像数据集与可再现的处理工作流程分离,本研究为地球科学、储层建模和机器学习应用中的重用提供了透明的基础。
cs.CV / 68 / 2605.03950

UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning

UnAC:针对复杂多模态推理的自适应视觉提示与抽象和逐步检查
Wang, Yifan, Fu, Yun
Abstract
Although recent LMMs have become much stronger at visual perception, they remain unreliable on problems that require multi-step reasoning over visual evidence. In this paper, we present UnAC (Understanding, Abstracting, and Checking), a multimodal prompting method that strengthens reasoning for complex multimodal tasks in LMMs (e.g., GPT-4o, Gemini 1.5, and GPT-4V). To improve image understanding and capture fine details, we propose an adaptive visual prompting strategy that enables LMMs to focus on salient regions. We further design an image-abstraction prompt to effectively extract key information from images. In addition, we introduce a gradual self-checking scheme that improves reasoning by verifying each decomposed subquestion and its answer. Extensive experiments on three public benchmarks-MathVista, MM-Vet, and MMMU.
Chinese Translation
尽管近期的语言-视觉模型(LMMs)在视觉感知方面已变得更强大,但在需要对视觉证据进行多步推理的问题上仍然不可靠。本文提出了UnAC(理解、抽象与检查),一种增强LMMs在复杂多模态任务(如GPT-4o、Gemini 1.5和GPT-4V)中推理能力的多模态提示方法。为改善图像理解并捕捉细节,我们提出了一种自适应视觉提示策略,使LMMs能够聚焦于显著区域。我们进一步设计了一种图像抽象提示,以有效提取图像中的关键信息。此外,我们引入了一种渐进式自我检查机制,通过验证每个分解的子问题及其答案来提升推理能力。在三个公共基准数据集-MathVista、MM-Vet和MMMU上的广泛实验验证了我们的方法。
cs.CV / 69 / 2605.03968

Label-Efficient School Detection from Aerial Imagery via Weakly Supervised Pretraining and Fine-Tuning

通过弱监督预训练和微调,实现航空影像中标签高效的学校检测
Elmimouni, Zakarya, Fourati, Fares, Alouini, Mohamed-Slim
Abstract
Accurate school detection is essential for supporting education initiatives, including infrastructure planning and expanding internet connectivity to underserved areas. However, many regions around the world face challenges due to outdated, incomplete, or unavailable official records. Manual mapping efforts, while valuable, are labor-intensive and lack scalability across large geographic areas. To address this, we propose a weakly supervised framework for school detection from aerial imagery that minimizes the need for human annotations while supporting global mapping efforts. Our method is specifically designed for low-data regimes, where manual annotations are extremely scarce. We introduce an automatic labeling pipeline that leverages sparse location points and semantic segmentation to generate infrastructure masks from which we generate bounding boxes. Using these automatically labeled images, we train our detectors on a first training stage to learn a representation of what schools look like, then using a small set of manually labeled images, we fine-tune the previously trained models on this clean dataset. This two stage training pipeline enables large-scale and strong detection in low-data setting of school infrastructure with minimal supervision. Our results demonstrate strong object detection performance, particularly in the low-data regime, where the models achieve promising results using only 50 manually labeled images, significantly reducing the need for costly annotations. This framework supports education and connectivity initiatives worldwide by providing an efficient and extensible approach to mapping schools from space. All models, training code and auto-labeled data will be publicly released to foster future research and real-world impact.
Chinese Translation
准确的学校检测对支持教育倡议至关重要,包括基础设施规划和将互联网连接扩展到服务不足地区。然而,世界许多地区由于官方记录过时、不完整或不可用而面临挑战。尽管人工绘制地图的努力有其价值,但却劳动密集,而且在大面积地理区域内缺乏可扩展性。为了解决这个问题,我们提出了一种弱监督框架,用于从航空影像中进行学校检测,最小化对人工标注的需求,同时支持全球测绘工作。我们的方法专门针对低数据环境,其中人工标注极为稀缺。我们引入了一种自动标注流程,通过稀疏位置点和语义分割生成基础设施掩膜,从中生成边界框。利用这些自动标注的图像,我们在第一阶段训练检测器,以学习学校的外观表示,然后使用一小部分手动标注的图像,对先前训练的模型通过这个干净的数据集进行微调。这一两阶段的训练流程使得在低数据环境下进行大规模且强有力的学校基础设施检测成为可能,且对监督的需求最小化。我们的结果显示出强大的目标检测性能,特别是在低数据环境中,模型在仅使用50张手动标注图像的情况下取得了良好的效果,显著减少了对昂贵标注的需求。该框架通过提供一种有效且可扩展的方法,从太空中绘制学校,支持全球的教育和连接倡议。所有模型、训练代码和自动标注数据将公开发布,以促进未来的研究和现实世界的影响。
cs.CV / 70 / 2605.03996

3D Human Face Reconstruction with 3DMM face model from RGB image

基于RGB图像的3D人脸重建及3DMM面部模型
Jiang, Zhangnan, Yang, Zichen
Abstract
Nowadays as convolution neural networks demonstrate its powerful problem-solving ability in the area of image processing, efforts have been made to reconstruct detailed face shapes from 2D face images or videos. However, to make the full use of CNN, a large number of labeled data is required to train the network. Coarse morphable face model has been used to synthesize labeled data. However, it is hard for coarse morphable face models to generate photo-realistic data with detail such as wrinkles. In this project, we present a pipeline that reconstructs a human face 3D model from a single RGB image. The pipeline includes face detection, landmark detection, regression of 3DMM model parameters, and soft rendering. Mentor: Zhipeng Fan (Email: [email protected]) Code Repository: https://github.com/SeVEnMY/3d-face- reconstruction Code Reference: https://github.com/sicxu/Deep3DFaceRecon pytorch
Chinese Translation
近年来,卷积神经网络在图像处理领域展示了其强大的问题解决能力,因此人们致力于从二维人脸图像或视频中重建详细的人脸形状。然而,为了充分利用卷积神经网络,需要大量标注数据来训练网络。粗糙可变形面部模型已被用于合成标注数据,但粗糙可变形面部模型很难生成具有真实细节(如皱纹)的照片级真实感数据。在本项目中,我们提出了一个从单个RGB图像重建人脸三维模型的管道。该管道包括人脸检测、关键点检测、3DMM模型参数回归以及软渲染。指导教师:范志鹏(邮箱:[email protected])代码库:https://github.com/SeVEnMY/3d-face-reconstruction 代码参考:https://github.com/sicxu/Deep3DFaceRecon pytorch
cs.CV / 71 / 2605.03999

RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction

RD-ViT:一种减少数据依赖的递归深度视觉变换器,用于扩展递归深度变换器架构到密集预测的语义分割
He, Renjie
Abstract
Vision Transformers (ViTs) achieve state-of-the-art segmentation accuracy but require large training datasets because each layer has unique parameters that must be learned independently. We present RD-ViT, a Recurrent-Depth Vision Transformer that adapts the Recurrent-Depth Transformer (RDT) architecture to dense prediction tasks, supporting both 2D and 3D inputs. RD-ViT replaces the deep stack of unique transformer blocks with a single shared block looped T times, augmented with LTI-stable state injection for guaranteed convergence, Adaptive Computation Time (ACT) for spatial compute allocation, depth-wise LoRA adaptation, and optional Mixture-of-Experts (MoE) feed-forward networks for category-specific specialization. We evaluate on the ACDC cardiac MRI segmentation benchmark in both 2D slice-level and 3D volumetric settings with exclusively real experiments executed in Google Colab. In 2D, RD-ViT outperforms standard ViT at 10% training data (Dice 0.774 vs 0.762) and at full data (0.882 vs 0.872). In 3D, RD-ViT with MoE achieves Dice 0.812 with 3.0M parameters, reaching 99.4% of standard ViT performance (0.817) at 53% of the parameter count. MoE expert utilization analysis reveals that different experts spontaneously specialize for different cardiac structures (RV, MYO, LV) without explicit routing supervision. ACT halting maps show higher compute allocation at cardiac boundaries, and the mean ponder time decreases from 2.6 to 1.4 iterations during training, demonstrating learned computational efficiency. Depth extrapolation enables inference with more loops than training without degradation. All code, notebooks, and results are publicly released.
Chinese Translation
视觉变换器(ViTs)在分割精度上达到了最先进的水平,但由于每一层都有独特的参数需要独立学习,因此需要大量的训练数据集。我们提出了 RD-ViT,一种递归深度视觉变换器,它将递归深度变换器(RDT)架构适应于密集预测任务,支持二维和三维输入。RD-ViT 用一个共享的模块替代了深层堆叠的独特变换器块,该模块循环 T 次,并通过 LTI 稳定的状态注入实现确保收敛,利用自适应计算时间(ACT)进行空间计算分配,深度方向的 LoRA 适应,以及可选的专门针对类别的混合专家(MoE)前馈网络。我们在 ACDC 心脏 MRI 分割基准测试上进行了评估,涵盖了二维切片级和三维体积设置,并在 Google Colab 中执行了完全真实的实验。在二维设置中,RD-ViT 在 10% 的训练数据下优于标准 ViT(Dice 0.774 对 0.762),在全数据下也表现更佳(0.882 对 0.872)。在三维设置中,搭载 MoE 的 RD-ViT 在拥有 3.0M 参数的情况下实现了 Dice 0.812,达到了标准 ViT 性能的 99.4%(0.817),而参数数量仅为 53%。MoE 专家的利用分析表明,不同的专家在没有明确路由监督的情况下,自发地专攻不同的心脏结构(右心室 RV、心肌 MYO、左心室 LV)。ACT 停止图展示了在心脏边界处更高的计算分配,而平均停留时间在训练过程中从 2.6 降低到 1.4 次迭代,表明学习到了计算效率。深度外推使推理时的循环次数超过训练而不产生降级。所有代码、笔记本和结果已公开发布。
cs.CV / 72 / 2605.04008

Enhanced 3D Brain Tumor Segmentation Using Assorted Precision Training

利用多种精度训练增强三维脑肿瘤分割
Pandya, Adwaitt, Oguine, Ozioma C., Bhargava, Harita, Zade, Shrikant
Abstract
A brain tumor is a medical disorder faced by individuals of all demographics. Medically, it is described as the spread of non-essential cells close to or throughout the brain. Symptoms of this ailment include headaches, seizures, and sensory changes. This research explores two main categories of brain tumors: benign and malignant. Benign spreads steadily, and malignant expresses growth, making it dangerous. Early identification of brain tumors is a crucial factor for the survival of patients. This research provides a state-of-the-art approach to the early identification of tumors within the brain. We implemented the SegResNet architecture, a widely adopted architecture for three-dimensional segmentation, and trained it using the automatic multi-precision method. We incorporated the dice loss function and dice metric for evaluating the model. We got a dice score of 0.84. For the tumor core, we got a dice score of 0.84; for the whole tumor, 0.90; and for the enhanced tumor, we got a score of 0.79.
Chinese Translation
脑肿瘤是所有人群均可能面临的医学疾病。从医学角度来看,它被描述为非必需细胞在接近或遍布大脑的情况下传播。这种疾病的症状包括头痛、癫痫发作和感官变化。本研究探讨了两大类脑肿瘤:良性和恶性。良性肿瘤增长缓慢,而恶性肿瘤则表现出快速生长,具有潜在危险。脑肿瘤的早期识别是患者生存的重要因素。本研究提供了一种先进的方法,用于早期识别脑内肿瘤。我们实现了SegResNet架构,这是一种广泛应用于三维分割的架构,并采用自动多精度方法进行了训练。我们引入了Dice损失函数和Dice指标来评估模型。我们的Dice得分为0.84;肿瘤核心的Dice得分为0.84;整个肿瘤的Dice得分为0.90;增强肿瘤的得分为0.79。
cs.CV / 73 / 2605.04035

Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

大规模高质量3D高斯头部重建的多视图捕捉方法
Ntavelis, Evangelos, Wu, Sean, Shahbazi, Mohamad, Maninchedda, Fabio, Kostiaev, Dmitry, Sevastopolsky, Artem, Megaro, Vittorio, Phillips, Trevor, Blumentals, Alejandro, Ravikumar, Shridhar, Gupta, Mehak, Knothe, Reinhard, Bayer, Jeronimo, Vestner, Matthias, Schaefer, Simon, Etterlin, Thomas, Zimmermann, Christian, Deschler, Mathias, Kaufmann, Peter, Brugger, Stefan, Martin, Sebastian, Amberg, Brian, Runia, Tom
Abstract
We propose HeadsUp, a scalable feed-forward method for reconstructing high-quality 3D Gaussian heads from large-scale multi-camera setups. Our method employs an efficient encoder-decoder architecture that compresses input views into a compact latent representation. This latent representation is then decoded into a set of UV-parameterized 3D Gaussians anchored to a neutral head template. This UV representation decouples the number of 3D Gaussians from the number and resolution of input images, enabling training with many high-resolution input views. We train and evaluate our model on an internal dataset with more than 10,000 subjects, which is an order of magnitude larger than existing multi-view human head datasets. HeadsUp achieves state-of-the-art reconstruction quality and generalizes to novel identities without test-time optimization. We extensively analyze the scaling behavior of our model across identities, views, and model capacity, revealing practical insights for quality-compute trade-offs. Finally, we highlight the strength of our latent space by showcasing two downstream applications: generating novel 3D identities and animating the 3D heads with expression blendshapes.
Chinese Translation
我们提出了一种名为HeadsUp的可扩展前馈方法,用于从大规模多摄像头设置中重建高质量的3D高斯头部。我们的方法采用高效的编码器-解码器架构,将输入视图压缩成紧凑的潜在表示。该潜在表示随后被解码为一组锚定在中性头部模板上的UV参数化3D高斯。这个UV表示将3D高斯的数量与输入图像的数量和分辨率解耦,从而能够使用多种高分辨率输入视图进行训练。我们在一个包含超过10,000个受试者的内部数据集上训练和评估我们的模型,该数据集的规模比现有的多视图人头数据集大一个数量级。HeadsUp实现了最先进的重建质量,并且无需在测试时优化即可对新身份进行泛化。我们广泛分析了模型在身份、视图和模型容量之间的扩展行为,揭示了质量与计算权衡的实际洞见。最后,我们通过展示两个下游应用(生成新颖的3D身份和用表情混合形状对3D头部进行动画处理)来突出我们的潜在空间的优势。
cs.CV / 74 / 2605.04040

Large Language Models are Universal Reasoners for Visual Generation

大型语言模型是视觉生成的普遍推理者
Ren, Sucheng, Chen, Chen, Wang, Zhenbang, Song, Liangchen, Zhu, Xiangxin, Yuille, Alan, Chen, Liang-Chieh, Lu, Jiasen
Abstract
Text-to-image generation has advanced rapidly with diffusion models, progressing from CLIP and T5 conditioning to unified systems where a single LLM backbone handles both visual understanding and generation. Despite the architectural unification, these systems frequently fail to faithfully align complex prompts during synthesis, even though they remain highly accurate at verifying whether an image satisfies those same prompts. We formalize this as the \emph{understanding-generation gap} and propose UniReasoner, a framework that leverages the LLM as a universal reasoner to convert its understanding strength into direct generation guidance. Given a prompt, the LLM first produces a coarse visual draft composed of discrete vision tokens. It then performs a self-critique by evaluating the draft for prompt consistency, producing a grounded textual evaluation that pinpoints what needs to be corrected. Finally, a diffusion model is conditioned jointly on the prompt, the visual draft, and the evaluation, ensuring that generation is guided by explicit corrective signals. Each signal addresses a limitation of the other: the draft provides a concrete, scene-level anchor that reduces under-specification in text-only conditioning, while the evaluation turns verification into grounded, actionable constraints that correct omissions, hallucinations, and relational errors. Experiments show that UniReasoner improves compositional alignment and semantic faithfulness under the same diffusion backbone while maintaining image quality, demonstrating a practical way to exploit LLM reasoning to close the understanding-generation gap.
Chinese Translation
文本到图像的生成已经通过扩散模型迅速发展,从CLIP和T5的条件生成进展到统一系统,在这些系统中,单一的LLM骨架同时处理视觉理解和生成。尽管架构已经统一,这些系统在合成过程中常常无法忠实地对齐复杂提示,即便它们在验证图像是否满足这些提示方面仍然高度准确。我们将这一现象形式化为 extit{理解-生成差距},并提出UniReasoner框架,该框架利用LLM作为普遍推理者,将其理解能力转化为直接的生成指导。在给定提示的情况下,LLM首先生成由离散视觉标记组成的粗略视觉草图。然后,它通过评估草图的提示一致性进行自我批评,生成指明需要纠正内容的具体文本评估。最后,一个扩散模型同时以提示、视觉草图和评估进行条件生成,确保生成过程受到明确的修正信号的指导。每个信号针对其他信号的局限性:草图提供了一个具体的、场景级的锚点,降低了仅基于文本条件生成的不足规格,而评估则将验证转变为具体的、可操作的约束,修正遗漏、幻觉和关系错误。实验表明,在相同的扩散骨架下,UniReasoner改善了组合对齐和语义忠实性,同时保持图像质量,展示了一种利用LLM推理来缩小理解-生成差距的实用方法。
cs.CV / 75 / 2605.04044

UniCorrn: Unified Correspondence Transformer Across 2D and 3D

UniCorrn:跨越2D和3D的统一对应变换器
Goswami, Prajnan, Ding, Tianye, Liu, Feng, Jiang, Huaizu
Abstract
Visual correspondence across image-to-image (2D-2D), image-to-point cloud (2D-3D), and point cloud-to-point cloud (3D-3D) geometric matching forms the foundation for numerous 3D vision tasks. Despite sharing a similar problem structure, current methods use task-specific designs with separate models for each modality combination. We present UniCorrn, the first correspondence model with shared weights that unifies geometric matching across all three tasks. Our key insight is that Transformer attention naturally captures cross-modal feature similarity. We propose a dual-stream decoder that maintains separate appearance and positional feature streams. This design enables end-to-end learning through stack-able layers while supporting flexible query-based correspondence estimation across heterogeneous modalities. Our architecture employs modality-specific backbones followed by shared encoder and decoder components, trained jointly on diverse data combining pseudo point clouds from depth maps with real 3D correspondence annotations. UniCorrn achieves competitive performance on 2D-2D matching and surpasses prior state-of-the-art by 8% on 7Scenes (2D-3D) and 10% on 3DLoMatch (3D-3D) in registration recall. Project website: https://neu-vi.github.io/UniCorrn
Chinese Translation
跨越图像到图像(2D-2D)、图像到点云(2D-3D)和点云到点云(3D-3D)的视觉对应关系是众多3D视觉任务的基础。尽管具有相似的问题结构,当前方法采用任务特定设计,为每种模态组合使用独立模型。我们提出了UniCorrn,这是第一个具有共享权重的对应模型,统一了所有三项任务的几何匹配。我们的关键见解在于,Transformer注意力自然捕捉跨模态特征相似性。我们提出了一种双流解码器,保持独立的外观和位置特征流。这一设计通过可堆叠层使端到端学习成为可能,同时支持跨异构模态的灵活查询基础对应估计。我们的架构采用模态特定主干,随后是共享的编码器和解码器组件,在结合来自深度图的伪点云与真实3D对应标注的多样数据上进行联合训练。UniCorrn在2D-2D匹配上表现出竞争力,并在注册召回方面在7Scenes(2D-3D)上超过之前的最先进技术8%,在3DLoMatch(3D-3D)上超过10%。项目网址:https://neu-vi.github.io/UniCorrn
cs.CV / 76 / 2605.04045

Audio-Visual Intelligence in Large Foundation Models

大型基础模型中的音视频智能
Qin, You, Liu, Kai, Wu, Shengqiong, Wang, Kai, Deng, Shijian, Tian, Yapeng, Xiao, Junbin, Xing, Yazhou, Ma, Yinghao, Li, Bobo, Zimmermann, Roger, Cui, Lei, Wei, Furu, Luo, Jiebo, Fei, Hao
Abstract
Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first comprehensive review of AVI through the lens of large foundation models. We establish a unified taxonomy covering the broad landscape of AVI tasks, ranging from understanding (e.g., speech recognition, sound localization) to generation (e.g., audio-driven video synthesis, video-to-audio) and interaction (e.g., dialogue, embodied, or agentic interfaces). We synthesize methodological foundations, including modality tokenization, cross-modal fusion, autoregressive and diffusion-based generation, large-scale pretraining, instruction alignment, and preference optimization. Furthermore, we curate representative datasets, benchmarks, and evaluation metrics, offering a structured comparison across task families and identifying open challenges in synchronization, spatial reasoning, controllability, and safety. By consolidating this rapidly expanding field into a coherent framework, this survey aims to serve as a foundational reference for future research on large-scale AVI.
Chinese Translation
音视频智能(Audio-Visual Intelligence, AVI)已经成为人工智能中的一个中心前沿,连接听觉和视觉模式,使机器能够在多模态的现实世界中感知、生成和互动。在大型基础模型的时代,音频和视觉的联合建模变得日益重要,这不仅关乎对动态、时间基础信号的理解,也涉及可控的生成和推理。近期的进展,例如 Meta MovieGen 和 Google Veo-3,突显了工业和学术界对统一音视频架构的日益关注,这些架构能够从海量的多模态数据中学习。然而,尽管取得了快速进展,文献依然支离破碎,涵盖了不同的任务、不一致的分类法和异质的评估实践,阻碍了系统比较和知识整合。本文通过大型基础模型的视角提供了对音视频智能的首次全面回顾。我们建立了一个统一的分类法,涵盖了音视频智能任务的广泛领域,从理解(例如,语音识别、声音定位)到生成(例如,音频驱动的视频合成、视频转音频)以及互动(例如,对话、具体现身或代理接口)。我们综合了方法论基础,包括模态标记化、跨模态融合、自回归和扩散生成、大规模预训练、指令对齐和偏好优化。此外,我们策划了具有代表性的数据集、基准和评估指标,提供了对任务类别的结构化比较,并识别了在同步、空间推理、可控性和安全性方面的开放挑战。通过将这一快速扩展的领域整合为一个连贯的框架,本文旨在成为未来大型音视频智能研究的基础参考。
人工智能 (Artificial Intelligence)
47
cs.AI / 1 / 2605.02910

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

CreativityBench: 通过基于赋能的工具再利用评估智能体的创造性推理
Qian, Cheng, Ha, Hyeonjeong, Liu, Jiayu, He, Bingxiang, Kim, Jeonghwan, Liu, Jiateng, Li, Bingxuan, Tiwari, Aditi, Dalal, Dwip, Wang, Zhenhailong, Chen, Xiusi, Namazifar, Mahdi, Li, Yunzhu, Ji, Heng
Abstract
Recent advances in large language models have led to strong performance on reasoning and environment-interaction tasks, yet their ability for creative problem-solving remains underexplored. We study this capability through the lens of creative tool use, where a model repurposes available objects by reasoning about their affordances and attributes rather than relying on canonical usage. As a first step, we introduce CreativityBench, a benchmark for evaluating affordance-based creativity in LLMs. To this end, we build a large-scale affordance knowledge base (KB) with 4K entities and 150K+ affordance annotations, explicitly linking objects, parts, attributes, and actionable uses. Building on this KB, we generate 14K grounded tasks that require identifying non-obvious yet physically plausible solutions under constraints. Evaluations across 10 state-of-the-art LLMs, including closed and open-source models, show that models can often select a plausible object, but fail to identify the correct parts, their affordances, and the underlying physical mechanism needed to solve the task, leading to a significant drop in performance. Furthermore, improvements from model scaling quickly saturate, strong general reasoning does not reliably translate to creative affordance discovery, and common inference-time strategies such as Chain-of-Thought yield limited gains. These results suggest that creative tool use remains a major challenge for current models, and that CreativityBench provides a useful testbed for studying this missing dimension of intelligence, with potential implications for planning and reasoning modules in future agents.
Chinese Translation
近期大语言模型的进展在推理和环境交互任务中取得了显著表现,但其创造性解决问题的能力仍然未被充分探索。我们通过创造性工具使用的视角研究这一能力,其中模型通过推理可用对象的赋能和属性来重新利用这些对象,而不是依赖于标准用法。作为第一步,我们引入了CreativityBench,一个用于评估大语言模型(LLMs)基于赋能创造力的基准。为此,我们构建了一个包含4K实体和150K+赋能注释的大规模赋能知识库(KB),明确链接对象、部分、属性和可操作的用途。在此知识库基础上,我们生成了14K个需要在限制条件下识别非显而易见但物理上合理的解决方案的具体任务。对10个最先进的大语言模型,包括封闭和开源模型的评估表明,模型通常能够选择一个合理的对象,但无法识别正确的部分、它们的赋能以及解决任务所需的基本物理机制,导致性能显著下降。此外,模型扩展带来的改进迅速饱和,强推理能力未能可靠地转化为创造性赋能发现,常见的推理时间策略如链式思维(Chain-of-Thought)仅带来有限的收益。这些结果表明,创造性工具使用仍然是当前模型面临的主要挑战,CreativityBench为研究这一智能缺失维度提供了一个有用的测试平台,并可能对未来智能体中的规划和推理模块产生影响。
cs.AI / 2 / 2605.03034

Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense

稳定的代理控制:用于自主网络防御的工具中介LLM架构
Prinos, Kerri, Brush, Lilianne, Denton, Cameron, Wang, Zhanqi, Knox, Joshua, Antani, Snehal, Foltz, Anton, Villaseñor, Amy
Abstract
Agentic systems involved in high-stake decision-making under adversarial pressure need formal guarantees not offered by existing approaches. Motivated by the operational needs of security operations centers (SOCs) that must configure endpoint detection and response (EDR) policies under adversarial pressure, we present a tool-mediated architecture: LLM agents use deterministic tools (Stackelberg best-response, Bayesian observer updates, attack-graph primitives) and select from finite action catalogs enforced at the tool-output interface. A composite Lyapunov function machine-checked in Lean 4 with zero sorry certifies controllability, observability from asymmetric sensor data, and Input-to-State Stability (ISS) robustness under intelligent adversarial disturbance, with two corollaries extending the certificate to any controller or adversary from the catalogs. On 282 real enterprise attack graphs, the claims hold with margin. On paired offensive/defensive telemetry, a tool-mediated Claude Sonnet 4 controller reduces the attacker's expected payoff (game value) by 59% relative to a deterministic greedy baseline, with zero variance across 40 runs at four temperatures. A Claude Haiku 4.5 controller converges to suboptimal game values but stays catalog-bounded over an additional 40 runs, demonstrating that architectural stability is not dependent on the controller capability. The LLM agent's non-determinism furthers creative exploration of strategies, while the tool-mediated architecture ensures system stability.
Chinese Translation
在对抗性压力下参与高风险决策的代理系统需要现有方法无法提供的正式保证。鉴于安全运营中心(SOCs)在对抗性压力下必须配置端点检测和响应(EDR)策略的运营需求,我们提出了一种工具中介架构:LLM代理使用确定性工具(Stackelberg最佳响应、贝叶斯观测者更新、攻击图原语)并从工具输出接口强制执行的有限动作目录中选择。一个在Lean 4中经过机器验证的复合李雅普诺夫函数以零错误证明了在智能对手干扰下的可控性、从不对称传感器数据中的可观测性以及输入到状态稳定性(ISS)稳健性,并且有两个推论将该证明扩展到目录中的任何控制器或对手。在282个真实企业攻击图上,这些主张均有充足的余地。在配对的进攻/防御遥测中,工具中介的Claude Sonnet 4控制器相对于确定性贪婪基线将攻击者的预期收益(博弈值)降低了59%,在四个温度下进行了40次实验没有产生方差。Claude Haiku 4.5控制器收敛到次优博弈值,但在额外的40次实验中保持目录界限,表明架构稳定性并不依赖于控制器的能力。LLM代理的非确定性进一步促进了策略的创造性探索,而工具中介架构确保了系统的稳定性。
cs.AI / 3 / 2605.03067

Computing Thiele Rules on Interval Elections and their Generalizations

间隔选举中的 Thiele 规则及其推广的计算
Avramidis, Dimitris, Lassota, Alexandra, Schmidt-Kraepelin, Ulrike, Vetta, Adrian
Abstract
Approval-based committee voting has received significant attention in the social choice community. Among the studied rules, Thiele rules, and especially Proportional Approval Voting (PAV), stand out for desirable properties such as proportional representation, Pareto optimality, and support monotonicity. Their main drawback is that computing a Thiele outcome is NP-hard in general. A glimpse of hope comes from the fact that Thiele rules are better behaved under structured preferences. On the candidate interval (CI) domain, they are computable in polynomial time via a linear program (LP) that has a totally unimodular constraint matrix. Surprisingly, this approach fails for the related voter interval (VI) domain, and the complexity of the problem has repeatedly been posed as an open question. Our main result resolves this question: although the relevant matrix is not totally unimodular, the ``standard'' LP still admits at least one optimal integral solution, and we provide a fast algorithm for finding it. Our technique naturally extends to the voter-candidate interval (VCI) domain, also known as the 1-dimensional voter-candidate range (1D-VCR) domain, and to the linearly consistent (LC) domain, both of which generalize the candidate and voter interval domains. Although both the VCI and LC domains have been studied in social choice, their relationship was unknown. We show, through connections to graph theory, that LC strictly contains VCI. We also provide an alternative definition of LC that is closer in spirit to VCI and has a natural interpretation in approval elections; this equivalence may be of independent interest. Finally, we study an alternative tree-based generalization of VCI and show that Thiele rules become NP-hard to compute on this domain.
Chinese Translation
基于批准的委员会投票在社会选择领域引起了 significant 关注。在研究的规则中,Thiele 规则,尤其是比例批准投票 (Proportional Approval Voting, PAV),因其诸如比例代表性、帕累托最优性和支持单调性等理想属性而脱颖而出。其主要缺陷是计算 Thiele 结果在一般情况下是 NP-hard 的。希望的曙光在于,Thiele 规则在结构化偏好下表现更佳。在候选人区间 (Candidate Interval, CI) 领域,它们可以通过具有完全无模约束矩阵的线性规划 (Linear Program, LP) 在多项式时间内计算。令人惊讶的是,该方法在相关的选民区间 (Voter Interval, VI) 领域失效,该问题的复杂性多次被提出为开放问题。我们的主要结果解决了这个问题:尽管相关矩阵不是完全无模的,``标准''的 LP 仍然至少承认一个最优整数解,并且我们提供了一个快速算法来找到该解。我们的方法自然扩展到选民-候选人区间 (Voter-Candidate Interval, VCI) 领域,也称为一维选民-候选人范围 (1-dimensional Voter-Candidate Range, 1D-VCR) 领域,以及线性一致域 (Linearly Consistent, LC) 领域,这两个领域都推广了候选人和选民区间领域。尽管 VCI 和 LC 领域在社会选择中都有研究,但它们之间的关系尚不明确。我们通过与图论的联系展示 LC 严格包含 VCI。我们还提供了一个与 VCI 精神相近的 LC 替代定义,并在批准选举中具有自然解释;这种等价性可能具有独立的兴趣。最后,我们研究了 VCI 的另一种基于树的推广,并表明 Thiele 规则在这个领域的计算变得 NP-hard。
cs.AI / 4 / 2605.03078

Making the Invisible Visible: Understanding the Mismatch Between Organizational Goals and Worker Experiences in AI Adoption

让隐形变为可见:理解人工智能采纳中组织目标与员工体验之间的不匹配
Lee, Christine P., Lee, Min Kyung, Mutlu, Bilge
Abstract
While AI is often introduced into organizations to drive innovation and efficiency, many adoption efforts fail as workers resist and struggle to integrate these systems. These failures point to a deeper issue: workers, the very people expected to collaborate with AI, are often invisible in decisions about how AI is designed and used. Drawing on interviews with professionals who interact with AI systems daily in healthcare, finance, and management, we examine the disconnect between organizational expectations and worker experiences. We identify key barriers, including poor usability and interoperability, misaligned expectations, limited control, and insufficient communication. These challenges highlight a gap between how organizations implement AI and the evolving worker needs, tasks, and workflows that it fails to support. We argue that successful adoption requires recognizing workers as central to AI integration and propose adaptation strategies at the individual, task, and organizational levels to better align AI systems with real-world practices.
Chinese Translation
尽管人工智能(AI)通常被引入组织以推动创新和提高效率,但许多采纳努力失败,因为员工抵制并难以整合这些系统。这些失败指向更深层次的问题:员工,即那些被期待与AI协作的人,往往在关于AI如何设计和使用的决策中被忽视。通过对在医疗、金融和管理领域每日与AI系统互动的专业人士进行访谈,我们考察了组织期望与员工体验之间的脱节。我们识别了关键障碍,包括可用性差和互操作性不足、期望不一致、控制权有限以及沟通不充分。这些挑战突显了组织实施AI与不断演变的员工需求、任务和工作流程之间的差距,这些差距使得AI未能提供足够的支持。我们主张,成功的采纳需要将员工视为AI整合的核心,并提出在个人、任务和组织层面上的适应策略,以更好地将AI系统与现实世界的实践对接。
cs.AI / 5 / 2605.03101

Programmatic Context Augmentation for LLM-based Symbolic Regression

基于程序上下文增强的符号回归LLM方法
Liu, Hao, Yang, Xiao-Wen, Sehgal, Atharva, Wang, Yixin, Guo, Lan-Zhe, Li, Yu-Feng, Yue, Yisong
Abstract
Symbolic regression (SR), the task of discovering mathematical expressions that best describe a given dataset, remains a fundamental challenge in scientific discovery. Traditional approaches, primarily based on genetic algorithms and related evolutionary methods, have proven useful but suffer from scalability and expressivity limitations. Recently, large language model (LLM)-based evolutionary search methods have been introduced into SR and show promise. However, existing LLM-based approaches typically rely on scalar evaluation metrics, such as mean squared error, as the sole source of feedback during the search process, thereby overlooking the rich information embedded in the dataset. To address this limitation, we propose a novel LLM-based evolutionary search framework that incorporates programmatic context augmentation. By enabling code-based interactions with the dataset, our method can actively perform data analysis and extract informative signals, beyond aggregated evaluation scores. We evaluate our framework on advanced benchmarks, such as LLM-SRBench, and demonstrate superior efficiency and accuracy compared to strong baselines.
Chinese Translation
符号回归(Symbolic Regression,SR)是发现最佳描述给定数据集的数学表达式的任务,仍然是科学发现中的一个基本挑战。传统方法,主要基于遗传算法和相关的进化方法,已证明有用,但在可扩展性和表现力方面存在局限。最近,基于大型语言模型(Large Language Model,LLM)的进化搜索方法已被引入到SR中,并表现出良好的前景。然而,现有的基于LLM的方法通常依赖于标量评估指标,如均方误差,作为搜索过程中的唯一反馈来源,从而忽视了数据集中蕴含的丰富信息。为了解决这一局限性,我们提出了一种新颖的基于LLM的进化搜索框架,该框架结合了程序上下文增强。通过实现与数据集的基于代码的交互,我们的方法能够主动进行数据分析并提取信息信号,超越汇总的评估得分。我们在先进的基准(如LLM-SRBench)上对我们的框架进行了评估,并展示了相较于强基线的优越效率和准确性。
cs.AI / 6 / 2605.03149

Are you with me? A Framework for Detecting Mental Model Discrepancies in Task-Based Team Dialogues

你跟我在一起吗?任务驱动团队对话中心理模型差异检测的框架
Kowalyshyn, Katharine, Scheutz, Matthias
Abstract
Humans typically use natural language to update teammates on task states. Since not all updates are communicated, discrepancies arise between the team members' mental models that negatively affect overall team performance. How can we categorize such discrepancies? Do misalignments detected in team dialogue predict future mental model misalignments? Traditional shared mental model (SMM) assessment methods rely on retrospective expert coding that cannot capture real-time coordination dynamics. We propose a framework to identify and categorize four types of mental model discrepancies: unsupported beliefs, false beliefs, belief contradictions, and omissions, all of which can naturally emerge in team dialogues. Using dialogues from twenty dyad teams performing collaborative object identification tasks across four sequential levels, we demonstrate that these discrepancy patterns contain predictive signals. Averaging historical discrepancy counts achieves meaningful prediction accuracy using uniform weighting as an exploratory baseline, with differential predictability across discrepancy types.
Chinese Translation
人类通常使用自然语言来更新队友关于任务状态的信息。然而,并非所有更新都被传达,这导致团队成员之间的心理模型出现差异,从而对整体团队绩效产生负面影响。我们该如何对这些差异进行分类?在团队对话中发现的错位是否能够预测未来的心理模型错位?传统的共享心理模型(SMM)评估方法依赖于事后专家编码,无法捕捉实时协调动态。我们提出了一个框架来识别和分类四种类型的心理模型差异:不支持的信念、错误的信念、信念矛盾及遗漏,这些差异都可能在团队对话中自然出现。通过分析二十个二人团队在四个顺序级别上执行协作对象识别任务的对话,我们表明这些差异模式中包含预测信号。使用均匀加权作为探索基线计算历史差异计数的平均值,实现了有意义的预测准确性,并且不同差异类型之间的可预测性也有所差异。
cs.AI / 7 / 2605.03159

Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents

从示例中学习正确行为:验证自主智能体的顺序执行
Sharma, Reshabh K, Mittal, Gaurav, Hu, Yu
Abstract
As autonomous agents become increasingly sophisticated, validating their sequential behavior presents a significant challenge. Traditional testing approaches require manual specification, exact sequence matching, or thousands of training examples. We present a novel algorithm that automatically learns correct behavior from just 2-10 passing execution traces and validates new executions against this learned model. Our approach combines dominator analysis from compiler theory with multimodal large language model-powered semantic understanding to identify essential states and handle non-deterministic behavior. The system constructs a generalized ground truth model using Prefix Tree Acceptors, merges traces through multi-tiered equivalence detection, and validates new executions via topological subsequence matching. In controlled experiments, our system achieved high accuracy in detecting product bugs and false successes using only 3 training traces. This approach provides explainable validation results with coverage metrics and works across diverse domains including UI testing, code generation, and robotic processes.
Chinese Translation
随着自主智能体变得越来越复杂,验证其顺序行为成为一项重大挑战。传统的测试方法要求手动规范、精确的序列匹配或成千上万的训练示例。我们提出了一种新颖的算法,它仅从2到10个成功的执行轨迹中自动学习正确行为,并将新的执行与此学习模型进行验证。我们的方法结合了编译理论中的支配分析与多模态大型语言模型驱动的语义理解,以识别关键状态并处理非确定性行为。该系统使用前缀树接受器构建广义的真实模型,通过多层等价检测合并轨迹,并通过拓扑子序列匹配验证新的执行。在受控实验中,我们的系统以仅使用3个训练轨迹就高效检测产品缺陷和虚假成功,达到了高准确率。该方法提供了具有覆盖率指标的可解释验证结果,并可以应用于UI测试、代码生成和机器人过程等多个领域。
cs.AI / 8 / 2605.03195

Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?

Terminus-4B:较小模型能否替代前沿大型语言模型在代理执行任务中的作用?
Garg, Spandan, Nitin, Vikram, Huang, Yufan
Abstract
Modern coding agents increasingly delegate specialized subtasks to subagents, which are smaller, focused agentic loops that handle narrow responsibilities like search, debugging or terminal execution. This architectural pattern keeps the main agent's context window clean by isolating verbose outputs (e.g. build logs, test results, etc.) within the subagent context. Typically when agents employ subagents for such tasks, they use frontier models as these subagents. In this paper, we investigate whether a finetuned small language model (SLM) can achieve comparable performance to frontier models in the task of agentic terminal execution. We present Terminus-4B, which is a post-trained Qwen3-4B model via Supervised Finetuning (SFT) and Reinforcement Learning (RL) using rubric-based LLM-as-judge reward, specifically for this task. In our extensive evaluation spanning various frontier models, training ablations and main agent configurations, we find that Terminus-4B is able to reduce the token usage of the main agent by up to ~30% compared to the No Subagent baseline with no impact to agent performance on benchmarks like SWE-Bench Pro and our internal SWE-Bench C# benchmark, which tends to be heavy in verbose execution tasks. Furthermore, Terminus-4B improves key metrics showing the main agent relying on the outputs of the subagent and doing fewer terminal execution tasks by itself. We see that our model not only closes the gap between the Vanilla Qwen model and frontier models like Claude Sonnet / Opus / GPT-5.3-Codex, but often even exceeds their performance.
Chinese Translation
现代编码代理越来越多地将专业子任务委派给子代理,这些子代理是更小、更专注的代理循环,负责处理狭窄的任务,如搜索、调试或终端执行。这种架构模式通过将冗长输出(例如构建日志、测试结果等)隔离在子代理的上下文中,从而保持主代理的上下文窗口的整洁。通常,当代理为此类任务使用子代理时,会采用前沿模型作为这些子代理。在本文中,我们研究了经过微调的小型语言模型(SLM)是否能够在代理终端执行任务中达到与前沿模型相当的表现。我们提出了Terminus-4B,这是一个通过监督微调(Supervised Finetuning,SFT)和强化学习(Reinforcement Learning,RL)使用基于评估标准的LLM作为评判奖励后训练的Qwen3-4B模型,专门用于该任务。在我们针对各种前沿模型、训练消融和主代理配置进行的广泛评估中,我们发现Terminus-4B能够将主代理的令牌使用量减少多达约30%,相比于没有子代理的基线,且对代理在如SWE-Bench Pro和我们内部的SWE-Bench C#基准测试中的表现没有影响,而后者通常在冗长的执行任务中较为繁重。此外,Terminus-4B在关键指标上取得了改进,表明主代理依赖子代理的输出,并减少了自身的终端执行任务。我们观察到我们的模型不仅缩小了Vanilla Qwen模型与Claude Sonnet / Opus / GPT-5.3-Codex等前沿模型之间的差距,而且往往甚至超越了它们的表现。
cs.AI / 9 / 2605.03202

Stop Automating Peer Review Without Rigorous Evaluation

停止在没有严格评估的情况下自动化同行评审
Baumann, Joachim, Pei, Jiaxin, Koyejo, Sanmi, Hovy, Dirk
Abstract
Large language models offer a tempting solution to address the peer review crisis. This position paper argues that today's AI systems should not be used to produce paper reviews. We ground this position in an empirical comparison of human- versus AI-generated ICLR 2026 reviews and an evaluation of the effect of automated paper rewriting on different AI reviewers. We identify two critical issues: 1) AI reviewers exhibit a hivemind effect of excessive agreement within and across papers that reduces perspective diversity. 2) AI review scores are trivially gameable through paper laundering: prompting an LLM to rewrite a paper could significantly increase the scores from AI reviewers, demonstrating that LLM reviewers are easy to game through stylistic changes rather than scientific results. However, non-gameability and review diversity are necessary but not sufficient conditions for automation. We argue that addressing the peer review crisis requires a science of peer review automation -- not general-purpose LLMs deployed without rigorous evaluation.
Chinese Translation
大型语言模型为解决同行评审危机提供了诱人的解决方案。本文立场文章主张,今日的人工智能系统不应被用于生成论文评审。我们的立场基于对2026年国际计算语言学会议(ICLR)人类与人工智能生成评审的实证比较,以及对不同人工智能评审者自动化论文重写效果的评估。我们识别出两个关键问题:1) 人工智能评审者表现出过度一致的集群效应,在不同论文之间和内部减少了视角多样性;2) 人工智能评审分数容易受到纸张洗涤的操控:提示大型语言模型(LLM)重写一篇论文可能显著提高人工智能评审者的评分,这表明通过风格变化而非科学结果来操控LLM评审者是可行的。然而,非游戏性和评审多样性是自动化所必需但不充分的条件。我们认为,解决同行评审危机需要同行评审自动化的科学,而不是在没有严格评估的情况下部署通用的大型语言模型。
cs.AI / 10 / 2605.03212

ADAPTS: Agentic Decomposition for Automated Protocol-agnostic Tracking of Symptoms

ADAPTS:面向自动化无协议症状跟踪的自主分解
Vail, Alexandria K., Cicconet, Marcelo, Doorn, Katie Aafjes-van, Maroney, Ryan, Aafjes, Marc
Abstract
Modeling latent clinical constructs from unconstrained clinical interactions is a unique challenge in affective computing. We present ADAPTS (Agentic Decomposition for Automated Protocol-agnostic Tracking of Symptoms), a framework for automated rating of depression and anxiety severity using a mixture-of-agents LLM architecture. This approach decomposes long-form clinical interviews into symptom-specific reasoning tasks, producing auditable justifications while preserving temporal and speaker alignment. Generalization was evaluated across two independent datasets ($N=204$) with distinct interview structures. On high-discrepancy interviews, automated ratings approximated expert benchmarks ($\text{absolute error}=22$) more closely than original human ratings ($\text{absolute error}=26$). Implementing an ``extended'' protocol that incorporates qualitative clinical conventions significantly stabilized ratings, with absolute agreement reaching $\text{ICC(2,1)} = 0.877$. These findings suggest that the ADAPTS framework enables promising evaluations of psychiatric severity. While the current implementation is purely text-based, the underlying architecture is readily extensible to multimodal inputs, including acoustic and visual features. By approximating expert-level precision in a protocol-agnostic manner, this framework provides a foundation for objective and scalable psychiatric assessment, especially in resource-limited settings.
Chinese Translation
从不受限制的临床互动中建模潜在的临床构念是情感计算中的一个独特挑战。我们提出了ADAPTS(面向自动化无协议症状跟踪的自主分解),这是一个使用混合代理大语言模型(LLM)架构自动评估抑郁和焦虑严重程度的框架。该方法将长篇临床访谈分解为特定症状的推理任务,生成可审计的解释,同时保持时间和发言者的对应关系。我们在两个独立数据集($N=204$)中评估了泛化性能,这些数据集具有不同的访谈结构。在高差异性访谈中,自动评分的逼近专家基准($ ext{absolute error}=22$)更接近于原始人类评分($ ext{absolute error}=26$)。实施包含定性临床惯例的“扩展”协议显著稳定了评分,绝对一致性达到$ ext{ICC(2,1)} = 0.877$。这些发现表明,ADAPTS框架能够有效评估精神疾病严重程度。虽然当前实现完全基于文本,但基础架构很容易扩展到包括声学和视觉特征的多模态输入。通过以无协议的方式逼近专家级精度,该框架为客观和可扩展的精神评估提供了基础,尤其在资源有限的环境中。
cs.AI / 11 / 2605.03227

Evaluating Prompting and Execution-Based Methods for Deterministic Computation in LLMs

评估大型语言模型中的基于提示和执行的方法对于确定性计算的效果
Yu, Hongkun
Abstract
Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning. However, their ability to perform exact, deterministic computation remains unclear. In this work, we systematically evaluate multiple prompting strategies, including Chain-of-Thought (CoT), Least-to-Most decomposition, Program-of-Thought (PoT), and Self-Consistency (SC), on tasks requiring precise and error-free outputs, including binary counting, longest substring detection, and arithmetic evaluation. To support this study, we introduce a synthetic dataset with diverse natural language instructions, enabling controlled evaluation of exact computation across multiple task types. Our results show that standard prompting methods achieve only moderate accuracy on sequence-based tasks. CoT provides limited improvement, while Least-to-Most suffers from error accumulation. In contrast, PoT achieves perfect accuracy by generating executable code and delegating computation to an external interpreter. Self-Consistency improves robustness through majority voting, but incurs substantial computational overhead. We further train a small domain-specific model (CodeT5-small) to generate executable programs, which achieves perfect accuracy on held-out synthetic test data across all tasks with minimal training cost. Overall, our findings suggest that LLMs may simulate reasoning patterns rather than reliably perform exact symbolic computation. For deterministic tasks, combining LLMs with external tools or using specialized models provides a more reliable and efficient solution.
Chinese Translation
大型语言模型(LLMs)在自然语言理解和推理方面展现出强大的能力。然而,它们执行精确、确定性计算的能力仍不明晰。在本研究中,我们系统地评估了多种提示策略,包括思维链(Chain-of-Thought, CoT)、从少到多分解法(Least-to-Most decomposition)、思想程序(Program-of-Thought, PoT)和自一致性(Self-Consistency, SC),针对需要准确且无误输出的任务,如二进制计数、最长子字符串检测和算术评估。为了支持这项研究,我们引入了一个包含多样化自然语言指令的合成数据集,使得对多种任务类型的精确计算进行控制评估成为可能。我们的结果表明,标准提示方法在基于序列的任务上,仅能实现中等准确性。CoT提供有限的改进,而Least-to-Most则受到错误累积的影响。相比之下,PoT通过生成可执行代码并将计算委托给外部解释器,实现了完美的准确率。自一致性通过多数投票提高了鲁棒性,但带来了可观的计算开销。我们进一步训练了一个小型领域特定模型(CodeT5-small),用于生成可执行程序,该模型在所有任务的保留合成测试数据上以最低的训练成本达到了完美的准确率。总体而言,我们的研究结果表明,LLMs可能模拟推理模式,而不是可靠地执行精确的符号计算。对于确定性任务,将LLMs与外部工具结合或使用专门模型提供了更可靠且高效的解决方案。
cs.AI / 12 / 2605.03231

cotomi Act: Learning to Automate Work by Watching You

cotomi Act:通过观察您来学习自动化工作
Oyamada, Masafumi, Takeoka, Kunihiro, Akimoto, Kosuke, Obara, Ryoma, Enomoto, Masafumi, Zhang, Haochen, Haraguchi, Daichi, Tamura, Takuya
Abstract
What if a browser agent could learn your work simply by watching you do it? We present cotomi Act, a browser-based computer-using agent that combines reliable multi-step task execution with persistent organizational knowledge learned from user behavior. For execution, an agent scaffold with adaptive lazy observation, verbal-diff-based history compression, coarse-grained actions, and test-time scaling via best-of-N action selection achieves 80.4% on the 179-task WebArena human-evaluation subset, exceeding the reported 78.2% human baseline. For organizational knowledge, a behavior-to-knowledge pipeline passively observes the user's browsing and progressively abstracts it into artifacts (task boards, wiki) exposed through a shared workspace editable by both user and agent. A controlled proxy evaluation confirms that task success improves as behavior-derived knowledge accumulates. In our live demonstration, attendees interact with the system in a real browser, issuing tasks and observing end-to-end autonomous execution and shared knowledge management.
Chinese Translation
如果一个浏览器代理可以通过观察您的工作来学习您所做的事情,那会怎样?我们提出了 cotomi Act,这是一款基于浏览器的计算机代理,它将可靠的多步骤任务执行与基于用户行为学习到的持久组织知识相结合。在执行方面,采用自适应惰性观察、基于语言差异的历史压缩、粗粒度动作和通过最佳N个动作选择进行测试时缩放的代理架构在179个任务的WebArena人类评估子集中达到了80.4%的准确率,超出了报告的78.2%人类基线。在组织知识方面,一个行为到知识的管道被动地观察用户的浏览行为,并逐步将其抽象为通过一个可由用户和代理共同编辑的共享工作区展示的工件(任务板、维基)。控制代理评估确认,随着行为派生知识的积累,任务成功率提高。在我们的现场演示中,与会者在真实浏览器中与系统互动,发出任务并观察端到端的自主执行和共享知识管理。
cs.AI / 13 / 2605.03242

Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios

增强智能体的安全判断:用于欺骗性分布外场景的控制基准重写与类比推理
Zhang, Zuoyu, Zhu, Yancheng
Abstract
Tool-using agent systems powered by large language models (LLMs) are increasingly deployed across web, app, operating-system, and transactional environments. Yet existing safety benchmarks still emphasize explicit risks, potentially overstating a model's ability to judge deceptive or ambiguous trajectories. To address this gap, we introduce ROME (Red-team Orchestrated Multi-agent Evolution), a controlled benchmark-construction pipeline that rewrites known unsafe trajectories into more deceptive evaluation instances while preserving their underlying risk labels. Starting from 100 unsafe source trajectories, ROME produces 300 challenge instances spanning contextual ambiguity, implicit risks, and shortcut decision-making. Experiments show that these challenge sets substantially degrade safety-judgment performance, with hidden-risk cases remaining particularly non-trivial even for recent frontier models. We further study ARISE (Analogical Reasoning for Inference-time Safety Enhancement), a retrieval-guided inference-time enhancement that retrieves ReAct-style analogical safety trajectories from an external analogical base and injects them as structured reasoning exemplars. ARISE improves judgment quality without retraining, but is best viewed as a task-specific robustness enhancement rather than a standalone safety guarantee. Together, ROME and ARISE provide practical tools for stress-testing and improving agent safety judgment under deceptive distribution shifts.
Chinese Translation
借助大型语言模型(LLMs)的工具使用智能体系统在网络、应用程序、操作系统和事务环境中的部署日益增多。然而,现有的安全基准仍然强调显式风险,可能夸大了模型判断欺骗性或模糊性轨迹的能力。为了解决这一问题,我们引入了ROME(Red-team Orchestrated Multi-agent Evolution),这是一个控制的基准构建管道,它将已知的不安全轨迹重写为更具欺骗性的评估实例,同时保留其基本风险标签。ROME从100个不安全源轨迹出发,生成300个挑战实例,涵盖上下文模糊性、隐性风险和快捷决策。实验表明,这些挑战集显著降低了安全判断性能,尤其是隐藏风险案例,即使对于最前沿的模型也仍然相当具有挑战性。我们进一步研究了ARISE(Analogical Reasoning for Inference-time Safety Enhancement),这是一种检索引导的推理时间增强方法,它从外部类比库中检索ReAct风格的类比安全轨迹,并将其注入作为结构化推理示例。ARISE在不重新训练的情况下提高了判断质量,但更适合作为一种针对特定任务的鲁棒性增强,而非独立的安全保障。总体而言,ROME和ARISE为在欺骗性分布变化下进行智能体安全判断的压力测试和改进提供了实用工具。
cs.AI / 14 / 2605.03308

Revisiting the Travel Planning Capabilities of Large Language Models

重新审视大语言模型的旅行规划能力
Zhang, Bo-Wen, Ye, Jin, Hua, Peng-Yu, Cao, Jia-Wei, Shao, Jie-Jing, Li, Yu-Feng, Guo, Lan-Zhe
Abstract
Travel planning serves as a critical task for long-horizon reasoning, exposing significant deficits in LLMs. However, existing benchmarks and evaluations primarily assess final plans in an end-to-end manner, which lacks interpretability and makes it difficult to analyze the root causes of failures. To bridge this gap, we decompose travel planning into five constituent atomic sub-capabilities, including \emph{Constraint Extraction}, \emph{Tool Use}, \emph{Plan Generation}, \emph{Error Identification}, and \emph{Error Correction}. We implement a decoupled evaluation protocol leveraging oracle intermediate contexts to rigorously isolate these components, thereby measuring the atomic performance boundary without the noise of cascading errors. Our results highlight a clear contrast in performance: while LLMs are proficient in extracting explicit constraints, they struggle to infer implicit, open-world requirements. Furthermore, they exhibit structural biases in plan generation and suffer from ineffective self-correction, characterized by excessive sensitivity and erroneous persistence. These findings offer precise directions for improving LLM reasoning and planning abilities.
Chinese Translation
旅行规划作为一项重要的长时间推理任务,暴露了大语言模型(LLMs)的重大不足。然而,现有的基准和评估主要以端到端的方式评估最终计划,这种方法缺乏可解释性,难以分析失败的根本原因。为了弥补这一空白,我们将旅行规划分解为五个基本的原子子能力,包括: extit{约束提取}、 extit{工具使用}、 extit{计划生成}、 extit{错误识别}和 extit{错误纠正}。我们实施了一种解耦的评估方案,利用oracle中间上下文严格隔离这些组件,从而在不受级联错误影响的情况下测量原子性能边界。我们的结果突显了性能的明显对比:尽管LLMs在提取明确约束方面表现出色,但在推断隐含的开放世界需求时却显得困难。此外,它们在计划生成中呈现结构性偏见,且自我纠正能力低下,表现为过度敏感和错误持续性。以上发现为改善LLM的推理和规划能力提供了明确的方向。
cs.AI / 15 / 2605.03339

Automated Large-scale CVRP Solver Design via LLM-assisted Flexible MCTS

基于大型语言模型辅助的灵活蒙特卡罗树搜索自动化大型CVRP求解器设计
Guo, Tong, Chen, Caishun, Ong, Yew Soon
Abstract
Solving large-scale CVRP (LSCVRP) with hundreds to thousands of nodes remains difficult for even state-of-the-art solvers. Divide-and-conquer can scale by decomposing the instance into size-reduced subproblems, but designing decomposition logic and configuring sub-solvers is highly expertise- and labor-intensive. Large Language Models (LLMs) have emerged as promising tools for automated algorithm design. However, existing LLM-driven approaches struggle with LSCVRP primarily due to the difficulty in generating sophisticated search strategies within a limited context window. To bridge this gap, we propose the LLM-assisted Flexible Monte Carlo Tree Search (LaF-MCTS), a novel framework that automates the design of high-performance LSCVRP solvers. We develop a three-tier decision hierarchy to enable incremental design of decomposition policies and sub-solvers for LSCVRP. To enable efficient search within the algorithmic hypothesis space, we introduce semantic pruning to eliminate semantically and structurally redundant codes, and branch regrowth to regenerate codes and preserve diversity. Extensive experiments on CVRPLib demonstrate that LaF-MCTS autonomously composes and optimizes decomposition-enhanced solvers that surpasses various state-of-the-art CVRP solvers.
Chinese Translation
解决大规模车辆路径问题(Large-scale CVRP,LSCVRP),尤其是包含数百到数千个节点的实例,仍对当前的最先进求解器构成挑战。尽管分治法可以通过将实例分解成规模较小的子问题来扩展求解能力,但设计分解逻辑和配置子求解器的过程非常依赖专业知识且劳动密集。大型语言模型(LLMs)作为自动化算法设计的有前景工具已经出现。然而,现有的基于LLM的方法在处理LSCVRP时面临困难,主要是因为在有限的上下文窗口内难以生成复杂的搜索策略。为了解决这一问题,我们提出了基于LLM辅助的灵活蒙特卡罗树搜索(LaF-MCTS),这是一个自动化设计高性能LSCVRP求解器的新框架。我们开发了一个三层决策层级,以实现对LSCVRP分解策略和子求解器的渐进式设计。为了在算法假设空间内实现高效搜索,我们引入了语义剪枝,以消除语义和结构上冗余的代码,并采用分支重生技术以重新生成代码并保持多样性。在CVRPLib上的大量实验结果表明,LaF-MCTS能够自主组成和优化增强分解的求解器,其性能超越了各种最先进的CVRP求解器。
cs.AI / 16 / 2605.03354

What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis

代理记忆内部发生了什么?从生成到诊断的电路分析
Mao, Xutao, Zhao, Jinman, Penn, Gerald, Wang, Cong
Abstract
Agent memory failures are silent: an LLM-based agent can produce a fluent response even when it fails to extract, retain, or retrieve the information needed across sessions. The write-manage-read loop describes the external pipeline of these systems but leaves open which internal computations implement each stage. Tracing internal feature circuits across the Qwen-3 family (0.6B--14B) and two memory frameworks (mem0 and A-MEM), we report three findings. First, control is detectable before content: routing circuitry is causally active at 0.6B, while content circuitry produces no detectable signal until 4B under our tracing setup, creating a deployment regime where small models route with apparent competence but silently fail at extraction and grounding. Second, within the content group, Write and Read share a late-layer hub that operates as a context-grounding substrate already present in the base model; only memory framing recruits a functional grounding direction on this substrate, and the hub transfers across both frameworks. Third, emergence does not imply steerability: although the content circuit becomes detectable at 4B, it becomes reliably steerable only at 8B, indicating that detection and intervention have distinct scale thresholds. As a practical implication, the feature-space separation between the two circuit groups enables per-operation failure localization at 76.2% accuracy without supervision, providing a stage-level diagnostic for otherwise silent agent-memory failures.
Chinese Translation
代理记忆的失败是无声的:基于大型语言模型(LLM)的代理能够生成流畅的回答,即使在跨会话中未能提取、保留或检索所需信息时。写入-管理-读取循环描述了这些系统的外部管道,但对于哪些内部计算实现每个阶段却没有明确说明。通过追踪 Qwen-3 系列(0.6B--14B)和两个记忆框架(mem0 和 A-MEM)中的内部特征电路,我们报告了三项发现。首先,控制在内容之前是可检测的:在 0.6B 时,路由电路因果活跃,而内容电路在我们的追踪设置下直到 4B 才产生可检测信号,这造成了一个部署机制,其中小型模型在路由上表现出明显的能力,但在提取和基础构建上静默失败。其次,在内容组内,写入和读取共享一个操作于上下文基础的晚层中心,该中心在基础模型中已经存在;只有记忆框架在这个基础上招募一个功能性基础构建方向,而且该中心跨越两个框架转移。第三,生成并不意味着可操控:尽管内容电路在 4B 时变得可检测,但仅在 8B 时才可靠可操控,这表明检测与干预有不同的规模阈值。作为一个实际的结果,这两个电路组之间的特征空间分离使得每次操作的失败可定位的准确率达到 76.2%,而无需监督,为本应无声的代理记忆失败提供了阶段级的诊断。
cs.AI / 17 / 2605.03361

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

ReasonAudio:评估文本-音频检索中超越匹配的推理的基准
Zhang, Honglei, Chen, Yuting, Hu, Chenpeng, Zhang, Siyue, Shi, Yilei
Abstract
As multimodal content continues to expand at a rapid pace, audio retrieval has emerged as a key enabling technology for media search, content organization, and intelligent assistants. However, most existing benchmarks concentrate on semantic matching and fail to capture the fact that real-world queries often demand advanced reasoning abilities, including negation understanding, temporal ordering, concurrent event recognition, and duration discrimination. To address this gap, we introduce ReasonAudio, the first reasoning-intensive benchmark for Text-Audio Retrieval, comprising 1,000 queries and 10,000 composite audio clips across five fundamental reasoning tasks: Negation, Order, Overlap, Duration, and Mix. Despite their intuitive nature for humans and straightforward construction, these tasks pose significant challenges to current models. Our evaluation of ten state-of-the-art models reveals the following findings: All models struggle with reasoning-intensive audio retrieval, performing particularly poorly on Negation and Duration while showing relatively better results on Overlap and Order. Moreover, Multimodal Large Language Model-based embedding models fail to inherit the reasoning capabilities of their backbones through contrastive fine-tuning, suggesting that current training paradigms are insufficient to preserve reasoning capacity in retrieval settings
Chinese Translation
随着多模态内容的快速扩展,音频检索已成为媒体搜索、内容组织和智能助手的关键支持技术。然而,现有的大多数基准集中于语义匹配,未能捕捉到现实世界查询往往要求高级推理能力的事实,包括否定理解、时间顺序、同时事件识别和持续时间区分。为了解决这一问题,我们引入了ReasonAudio,这是首个针对文本-音频检索的推理密集型基准,包含1000个查询和10000个复合音频片段,涵盖五个基本推理任务:否定、顺序、重叠、持续时间和混合。尽管这些任务对人类来说直观且构建简单,但对当前模型构成了重大挑战。我们对十个最先进模型的评估揭示了以下发现:所有模型在推理密集型音频检索中均面临困难,在否定和持续时间任务上表现尤为不佳,而在重叠和顺序任务上则相对表现较好。此外,基于多模态大型语言模型的嵌入模型未能通过对比微调继承其基础模型的推理能力,这表明当前的训练范式不足以在检索环境中保留推理能力。
cs.AI / 18 / 2605.03383

GeoDecider: A Coarse-to-Fine Agentic Workflow for Explainable Lithology Classification

GeoDecider:一种用于可解释岩石类型分类的粗到细代理工作流程
Wang, Jiahao, Cheng, Mingyue, Zhou, Yitong, Mao, Qingyang, Tao, Xiaoyu, Liu, Qi, Chen, Enhong
Abstract
Lithology classification aims to infer subsurface rock types from well-logging signals, supporting downstream applications like reservoir characterization. Despite substantial progress, most existing methods still treat lithology classification as a single-pass classification task. In contrast, practical experts incorporate geological principles, external knowledge, and tool-use capabilities to perform accurate classification. In this work, we propose GeoDecider, a coarse-to-fine agentic workflow that enables accurate and explainable lithology classification through training-free use of large language models (LLMs). GeoDecider reformulates lithology classification as an expert-like structured process and organizes it into a multi-stage workflow involving coarse-to-fine reasoning. Specifically, GeoDecider includes the following stages: (1) base classifier-guided coarse classification, which uses a pre-trained classifier to provide a rough reference for downstream tasks, thus reducing the overall cost of downstream reasoning, (2) tool-augmented reasoning, which utilizes several tools such as contextual analysis and neighbor retrieval to achieve finer and more precise classifications, (3) geological refinement, which post-processes the final results to enforce geological consistency. Experiments on four benchmarks show that GeoDecider outperforms representative baselines. Further analysis demonstrates that the proposed framework produces geologically interpretable predictions while achieving a better trade-off between classification performance and inference efficiency.
Chinese Translation
岩石类型分类旨在通过测井信号推断地下岩石类型,以支持如油藏表征等下游应用。尽管已有显著进展,现有大多数方法仍将岩石类型分类视为一次性分类任务。相较之下,实践中的专家会结合地质原则、外部知识和工具使用能力来进行准确分类。在本研究中,我们提出了GeoDecider,一种粗到细的代理工作流程,通过训练-free使用大型语言模型(LLMs)实现准确且可解释的岩石类型分类。GeoDecider将岩石类型分类重新构建为一种类专家的结构化过程,并将其组织成一个多阶段的工作流程,涉及粗到细的推理。具体而言,GeoDecider包含以下阶段:(1)基分类器指导的粗分类,利用预训练分类器为下游任务提供粗略参考,从而减少下游推理的整体成本;(2)工具增强推理,使用多个工具如上下文分析和邻居检索,达到更细致、精准的分类;(3)地质精细化,对最终结果进行后处理,以增强地质一致性。在四个基准上的实验表明,GeoDecider在性能上优于具有代表性的基线方法。进一步分析表明,所提出的框架能够生成具有地质可解释性的预测,同时在分类性能和推理效率之间实现更好的平衡。
cs.AI / 19 / 2605.03409

Robust Agent Compensation (RAC): Teaching AI Agents to Compensate

鲁棒代理补偿(RAC):教导人工智能代理进行补偿
Perera, Srinath, Hapuarachchi, Kaviru, Leymann, Frank, Khalaf, Rania
Abstract
We present Robust Agent Compensation (RAC), a log-based recovery paradigm (providing a safety net) implemented through an architectural extension that can be applied to most Agent frameworks to support reliable executions (avoiding unintended side effects). Users can choose to enable RAC without changing their current agent code (e.g., LangGraph agents). The proposed approach can be implemented in most existing agent frameworks via their existing extension points. We present an implementation based on LangChain, demonstrate its viability through the $\tau$-bench and REALM-Bench, and show that when solving complex problems, RAC is 1.5-8X or more better in both latency and token economy compared to state-of-the-art LLM-based recovery approaches.
Chinese Translation
我们提出了鲁棒代理补偿(RAC),这是一种基于日志的恢复范式(提供安全保障),通过一种架构扩展实现,可以应用于大多数代理框架,以支持可靠的执行(避免意外副作用)。用户可以选择启用RAC,而无需更改当前的代理代码(例如,LangGraph代理)。所提出的方法可以通过现有的扩展点在大多数现有代理框架中实现。我们展示了一种基于LangChain的实现,通过$ au$-bench和REALM-Bench验证其可行性,并显示在解决复杂问题时,RAC在延迟和代币经济上比最先进的基于LLM的恢复方法提升了1.5-8倍或更高的性能。
cs.AI / 20 / 2605.03410

Geometry over Density: Few-Shot Cross-Domain OOD Detection

基于密度的几何:少样本跨域OOD检测
Li, Shawn, Qin, You, Li, Jiate, Peris, Charith, Bauer, Lisa, Zimmermann, Roger, Zhao, Yue
Abstract
Out-of-distribution (OOD) detection identifies test samples that fall outside a model's training distribution, a capability critical for safe deployment in high-stakes applications. Standard OOD detectors are trained on a specific in-distribution (ID) dataset and detect deviations from that single domain. In contrast, we study few-shot cross-domain OOD detection: given a \emph{single} pre-trained model, can we perform OOD detection on \emph{arbitrary} new ID-OOD task pairs using only a handful of ID samples at inference time, with no additional training? We propose \textbf{UFCOD}, a unified framework that achieves this goal through information-geometric analysis of diffusion trajectories. Our key insight is that diffusion noise predictions are score functions (gradients of log-density), and we extract two energy features: \emph{Path Energy} (integrated score magnitude) and \emph{Dynamics Energy} (score smoothness), that form a discrete Sobolev norm capturing how samples interact with the learned diffusion process. The central contribution is a \textbf{train-once, deploy-anywhere} paradigm: a diffusion model trained on a single dataset (e.g., CelebA) serves as a universal feature extractor for OOD detection across semantically unrelated domains (e.g., CIFAR-10, SVHN, Textures). At deployment, each new task requires only $\sim$100 unlabeled ID samples for inference: no retraining, no fine-tuning, no task-specific adaptation. Using 100 ID samples per task, UFCOD achieves 93.7\% average AUROC across 12 cross-domain benchmarks, competitive with methods trained on 50k--163k samples, demonstrating $\sim$500$\times$ improvement in sample efficiency. See our code in https://github.com/lili0415/UFCOD.
Chinese Translation
分布外(OOD)检测用于识别测试样本是否超出了模型的训练分布,这一能力对于在高风险应用中的安全部署至关重要。标准的OOD检测器是在特定的内部分布(ID)数据集上训练的,并且仅检测偏离该单一领域的情况。相较之下,我们研究了少样本跨域OOD检测:给定一个*单一*预训练模型,能否仅利用少量的ID样本在推理时对*任意*新的ID-OOD任务对进行OOD检测,而无需额外的训练?我们提出了 extbf{UFCOD},一个通过对扩散轨迹的信息几何分析来实现这一目标的统一框架。我们的核心见解是,扩散噪声预测是得分函数(对数密度的梯度),我们提取了两个能量特征: extit{路径能量}(积分得分幅度)和 extit{动态能量}(得分平滑性),这些特征形成了一个离散的Sobolev范数,捕捉样本与学习到的扩散过程之间的相互作用。核心贡献是 extbf{一次训练,随处部署}的范式:在单一数据集(如CelebA)上训练的扩散模型,作为跨语义无关领域(如CIFAR-10、SVHN、纹理)的OOD检测的通用特征提取器。在部署时,每个新任务仅需约100个未标记的ID样本进行推理:无需重新训练,无需微调,无需特定任务的适应。使用每个任务100个ID样本,UFCOD在12个跨域基准测试中实现了93.7%的平均AUROC,竞争于在50k至163k样本上训练的方法,展示了约500倍的样本效率提升。请查看我们的代码:https://github.com/lili0415/UFCOD。
cs.AI / 21 / 2605.03423

Adaptive Dual-Path Framework for Covert Semantic Communication

隐蔽语义通信的自适应双路径框架
Yu, Xi, Li, Weicai, Yin, Lin, Lv, Tiejun
Abstract
This paper proposes a novel adaptive dual-path framework for covert semantic communication (SemCom), which integrates covert information transmission with task-oriented semantic coding. Unlike conventional covert communication methods that embed hidden messages through power-domain signal superposition, our framework embeds covert data within task-specific features via semantic-level intrinsic encoding. This new architecture introduces dual encoding paths with adaptive block selection: an Explicit path for public task execution and a Stego path that jointly encodes both public and covert information through contrastive representation alignment. A Gumbel-Softmax enabled adaptive path selection mechanism dynamically activates network blocks based on task require- ments. We formulate a multi-objective optimization framework that simultaneously ensures accurate semantic understanding and reliable covert transmission. We rigorously evaluate our framework's security against a powerful, independently trained attacker. Experimental results on the Cityscapes dataset demon- strate a state-of-the-art level of covertness: our method suppresses the attacker's detection accuracy to a near-random guessing level of 56.12%. This robust security is achieved while simultaneously maintaining superior performance on the primary semantic tasks compared to the baselines.
Chinese Translation
本文提出了一种新的自适应双路径框架,用于隐蔽语义通信(Covert Semantic Communication,SemCom),该框架将隐蔽信息传输与任务导向的语义编码相结合。与传统的通过功率域信号叠加嵌入隐藏消息的隐蔽通信方法不同,我们的框架通过语义级内在编码将隐蔽数据嵌入到特定任务的特征中。该新架构引入了具有自适应块选择的双编码路径:用于公共任务执行的显式路径和通过对比表示对齐共同编码公共信息和隐蔽信息的隐写路径。一种基于Gumbel-Softmax的自适应路径选择机制根据任务需求动态激活网络块。我们制定了一个多目标优化框架,旨在同时确保准确的语义理解和可靠的隐蔽传输。我们严格评估了该框架在强大且独立训练的攻击者下的安全性。在Cityscapes数据集上的实验结果表明,隐蔽性达到了最先进的水平:我们的方法将攻击者的检测准确率抑制到接近随机猜测的56.12%。在此过程中,该方法在主要语义任务上相比基线保持了优越的性能。
cs.AI / 22 / 2605.03426

Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

用偏好替代参数:异构视觉-语言模型的联邦对齐
Lu, Shule, Wang, Yujing, Zhang, Hainan, Yang, Xiaoshan, Zheng, Hongwei, Tong, Yongxin, Xu, Changsheng, Zheng, Zhiming
Abstract
Vision-Language Models (VLMs) have broad potential in privacy-sensitive domains such as healthcare and finance, yet strict data-sharing constraints render centralized training infeasible. Federated Learning mitigates this issue by enabling decentralized training, but practical deployments face challenges due to client heterogeneity in computational resources, application requirements, and model architectures. Under extreme model and data heterogeneity, replacing parameter aggregation with preference-based collaboration offers a more suitable interface, as it eliminates the need for direct parameter or data exchange. Motivated by this, we propose MoR, a federated alignment framework that combines GRPO with Mixture-of-Rewards for heterogeneous VLMs. In MoR, each client locally trains a reward model from local preference annotations, capturing specific evaluation signals without exposing raw data. To combine these heterogeneous supervision signals, MoR introduces a Mixture-of-Rewards mechanism with learned routing, which adaptively fuses client reward models according to the input and alignment objective. The server then optimizes a base VLM using GRPO with a KL penalty to a reference model, enabling preference alignment without requiring client models to share architectures or parameters. Experiments on diverse public vision-language benchmarks demonstrate that MoR consistently outperforms federated alignment baselines in generalization and cross-client adaptability. Our approach provides a scalable solution for privacy-preserving alignment of heterogeneous VLMs under federated settings.
Chinese Translation
视觉-语言模型(VLMs)在医疗和金融等隐私敏感领域具有广泛的潜力,但严格的数据共享限制使得集中训练不可行。联邦学习通过支持去中心化训练来缓解这一问题,但由于客户在计算资源、应用需求和模型架构上的异质性,实际应用面临挑战。在极端的模型和数据异质性下,用基于偏好的协作替代参数聚合提供了更合适的接口,因为它消除了直接交换参数或数据的需求。基于此,我们提出了MoR,一种将GRPO与混合奖励结合的异构VLM的联邦对齐框架。在MoR中,每个客户从本地偏好注释局部训练一个奖励模型,捕捉特定的评估信号而不暴露原始数据。为了结合这些异构监督信号,MoR引入了带有学习路由的混合奖励机制,根据输入和对齐目标自适应地融合客户的奖励模型。然后,服务器使用带有KL惩罚的GRPO来优化基础VLM,对齐到参考模型,实现偏好对齐,而无需客户模型共享架构或参数。在多样化的公共视觉-语言基准测试中的实验表明,MoR在泛化能力和跨客户适应性上始终优于联邦对齐基线。我们的方法为在联邦环境中保护隐私的异构VLM对齐提供了可扩展的解决方案。
cs.AI / 23 / 2605.03460

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

FinSTaR:迈向基于时间序列推理模型的金融推理
Lee, Seunghan, Seo, Jun, Lee, Jaehoon, Yoo, Sungdong, Kim, Minjae, Lim, Tae Yoon, Kang, Dongwan, Choi, Hwanil, Lee, Soonyoung, Ahn, Wonbin
Abstract
Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail on financial domain, which exhibit unique characteristics. We propose a general 2x2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain -- where the distinction between deterministic assessment and stochastic prediction is particularly critical -- as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is publicly available at: https://github.com/seunghan96/FinSTaR.
Chinese Translation
时间序列(TS)推理模型(TSRMs)在一般领域展示了很有前景的能力,但在金融领域却始终表现不佳,因为该领域具有独特的特征。我们提出了一种通用的2x2能力分类法,通过交叉1)单实体与多实体分析以及2)当前状态评估与未来行为预测,来描述TSRMs。我们在金融领域实例化该分类法——在这里,确定性评估与随机性预测之间的区别尤为重要——形成了十个金融推理任务,基于标普股票建立了FinTSR-Bench基准。为此,我们提出FinSTaR(Financial Time Series Thinking and Reasoning),该模型在FinTSR-Bench上进行训练,采用针对每个类别的独特思维链(CoT)策略。在评估方面,该过程是确定性的(即可从可观察数据中计算得出),我们采用Compute-in-CoT,这是一种程序化的思维链,允许模型直接从原始价格中推导出答案。在预测方面,该过程本质上是随机的(即受不可观察因素的影响),我们采用场景感知思维链(Scenario-Aware CoT),在做出判断之前生成多样化的场景,反映金融分析师在不确定性下推理的方式。所提方法在FinTSR-Bench上达到了78.9%的平均准确率,显著优于大型语言模型(LLM)和TSRM的基准。此外,我们还表明,这四个能力类别通过联合训练是互补且相互增强的,且场景感知思维链相比标准思维链在预测准确性方面持续提升。代码公开可用: https://github.com/seunghan96/FinSTaR.
cs.AI / 24 / 2605.03491

Real-Time Evaluation of Autonomous Systems under Adversarial Attacks

对抗攻击下自主系统的实时评估
Mohan, Adithya, Xie, Xujun, Sambandham, Venkatesh Thirugnana, Schön, Torsten
Abstract
Most evaluations of autonomous driving policies under adversarial conditions are conducted in simulation, due to cost efficiency and the absence of physical risk. However, purely virtual testing fails to capture structural inconsistencies, supervision constraints, and state-representation effects that arise in real-world data and fundamentally shape policy robustness. This work presents an offline trajectory-learning and adversarial robustness evaluation framework grounded in real-world intersection driving data. Within a controlled data contract, we train and compare three trajectory-learning paradigms: Multi-Layer Perceptron (MLP)-based Behavior Cloning (BC), Transformer-based object-tokenized BC, and inverse reinforcement learning (IRL) formulated within a Generative Adversarial Imitation Learning (GAIL) framework. Models are evaluated using Average Displacement Error (ADE) and Final Displacement Error (FDE). Inference-time robustness is assessed by subjecting trained policies to gradient-based adversarial perturbations across multiple intersection scenarios, yielding a structured robustness evaluation matrix. Results show that state-structure design and architectural inductive biases critically influence adversarial stability, leading to markedly different robustness profiles despite comparable nominal prediction accuracy (ADE < 0.08). Inference-time Projected Gradient Descent (PGD) attacks induce final displacement errors of up to approximately 8 meters. The proposed framework establishes a scalable benchmark for studying offline trajectory learning and adversarial robustness in real-world autonomous driving settings.
Chinese Translation
由于成本效率和缺乏物理风险,大多数对抗条件下自主驾驶政策的评估是在模拟环境中进行的。然而,纯粹的虚拟测试无法捕捉在真实世界数据中出现的结构不一致性、监督约束和状态表示效应,这些因素从根本上影响政策的鲁棒性。本研究提出了一种基于真实世界交叉路口驾驶数据的离线轨迹学习与对抗鲁棒性评估框架。在受控数据合约下,我们训练并比较了三种轨迹学习范式:基于多层感知器(MLP)的行为克隆(BC)、基于变换器的对象标记化BC,以及在生成对抗模仿学习(GAIL)框架中形成的逆强化学习(IRL)。通过平均位移误差(ADE)和最终位移误差(FDE)对模型进行评估。通过对训练策略施加基于梯度的对抗扰动,在多种交叉路口场景中评估推理时的鲁棒性,得到一个结构化的鲁棒性评估矩阵。结果表明,状态结构设计和架构归纳偏差对对抗稳定性具有重要影响,尽管名义预测准确度相似(ADE < 0.08),但鲁棒性特征却大相径庭。在推理时,投影梯度下降(PGD)攻击导致的最终位移误差高达约8米。该框架建立了一个可扩展的基准,用于研究真实世界自主驾驶环境中的离线轨迹学习和对抗鲁棒性。
cs.AI / 25 / 2605.03596

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Workspace-Bench 1.0:在具有大规模文件依赖的工作空间任务上评估人工智能代理
Tang, Zirui, Zhou, Xuanhe, Liu, Yumou, Li, Linchun, Wang, Weizheng, Huang, Hongzhang, Zhou, Jun, Song, Jiachen, Yu, Shaoli, Wang, Jinqi, Zhou, Zihang, Zhou, Hongyi, Lv, Yuting, Li, Jinyang, Liu, Jiashuo, Chen, Ruoyu, Liu, Chunwei, Li, GuoLiang, Kang, Jihua, Wu, Fan
Abstract
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only 68.7%, substantially below the human result of 80.7%, and the average performance across agents is only 47.4%.
Chinese Translation
工作空间学习要求人工智能代理识别、推理、利用并更新工人工作空间中异构文件之间的显式和隐式依赖关系,从而有效地完成常规和高级任务。尽管其重要性不言而喻,现有相关基准主要在预定义或合成的文件上评估代理,且这些文件的真实世界依赖性有限,因此工作空间级别的评估尚未得到充分探索。为此,我们提出了Workspace-Bench,这是一个用于评估涉及大规模文件依赖的工作空间学习的人工智能代理的基准。我们构建了包含5个工人角色、74种文件类型、20,476个文件(大小可达20GB)的真实工作空间,并编制了388个任务,每个任务都有其独特的文件依赖图,评估涉及7,399个总标准,这些标准要求跨文件检索、上下文推理和自适应决策。我们还提供了Workspace-Bench-Lite,一个包含100个任务的子集,保持了基准分布的同时将评估成本降低约70%。我们评估了4种流行的代理框架和7种基础模型。实验结果表明,目前的代理在可靠的工作空间学习方面仍然远未达到预期,最好的表现仅为68.7%,远低于人类的80.7%,而各代理的平均表现仅为47.4%。
cs.AI / 26 / 2605.03609

Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models

路径分岔处:大规模语言模型中道德推理的局部化、校准控制
Yuan, Chenchen, Zhang, Zheyu, Kasneci, Gjergji
Abstract
Large language models often display heterogeneous moral preferences across settings. We study inference-time steering toward a desired ethical framework while preserving general competence. We present Convergent-Divergent Routing, which traces and edits minimal branch points inside transformer blocks where ethical-framework-related pathways first converge and then diverge. Gating non-target branches at these loci blocks the downstream propagation while leaving upstream computations intact. We find that this intervention alone increases targeted ethical-framework reasoning. To achieve fine-grained control, we adapt Common Spatial Patterns to the residual stream and extract, for each branch-point layer, a pair of directions that discriminate between utilitarian and deontological frameworks. We then introduce Dual Logit Calibration, a closed-form, minimum-$\ell_2$-norm update that moves the residual within this two-dimensional subspace so the resulting directional projections align with user-specified preference weights. Experiments on real-life moral dilemmas show that our method reliably achieves preference calibration and largely preserves general capabilities, outperforming recent baselines while providing an interpretable mechanism.
Chinese Translation
大规模语言模型在不同情境中常常表现出异质的道德偏好。我们研究在保持一般能力的同时,如何在推理阶段引导模型朝向期望的伦理框架。我们提出了趋同-分歧路由(Convergent-Divergent Routing),该方法在变换器模块内追踪并编辑最小的分支点,即与伦理框架相关的路径首次趋同然后再分歧的地方。在这些位置对非目标分支进行门控,阻止了下游传播,同时保持了上游计算的完整性。我们的研究发现,仅通过这种干预就能增加针对特定伦理框架的推理。为了实现精细控制,我们将共空间模式(Common Spatial Patterns)适应于残差流,并为每一个分支点层提取一对方向,以区分功利主义和义务论框架。接着,我们引入了双重逻辑校准(Dual Logit Calibration),这是一种封闭形式的最小-$ ext{l}_2$-范数更新方法,将残差移动到这个二维子空间内,从而使得得到的方向投影与用户指定的偏好权重对齐。在现实道德困境上的实验表明,我们的方法可靠地实现了偏好校准,并在很大程度上保持了总体能力,超越了最新的基准,同时提供了可解释的机制。
cs.AI / 27 / 2605.03625

Self-Improvement for Fast, High-Quality Plan Generation

快速、高质量计划生成的自我提升
Gieselmann, Robert, von Huelsen, Henrike, Samson, Mihai, Meyer, Marie-Christine, Piotrowski, Dariusz, Radomskyi, Oleksandr, Okamoto, Justin, Gojayev, Turan, Painter, Michael, Brown, Gavin, Pecora, Federico, Wyatt, Jeremy L.
Abstract
Generative models trained on synthetic plan data are a promising approach to generalized planning. Recent work has focused on finding any valid plan, rather than a high-quality solution. We address the challenge of producing high-quality plans, a computationally hard problem, in sub-exponential time. First, we demonstrate that, given optimal data, a decoder-only transformer can generate high-quality plans for unseen problem instances. Second, we show how to self-improve an initial model trained on sub-optimal data. Each round of self-improvement combines multiple model calls with graph search to generate improved plans, used for model fine-tuning. An experimental study on four domains: Blocksworld, Logistics, Labyrinth, and Sokoban, shows on average a 30% reduction in plan length over the source symbolic planner, with over 80% of plans being optimal, where the optimum is known. Plan quality is further improved by inference-time search. The model's latency scales sub-exponentially in contrast to the satisficing and optimal symbolic planners to which we compare. Together, these results suggest that self-improvement with generative models offers a scalable approach for high-quality plan generation.
Chinese Translation
基于合成计划数据训练的生成模型是对广义规划的一种有前景的方法。近期的研究集中在寻找任何有效的计划,而不是高质量的解决方案。我们解决了在亚指数时间内生成高质量计划这一计算上困难的问题。首先,我们证明了在给定最优数据的情况下,仅使用解码器的变换器可以为未见过的问题实例生成高质量的计划。其次,我们展示了如何对在次优数据上训练的初始模型进行自我提升。每轮自我提升结合了多个模型调用与图搜索,以生成改进的计划,并用于模型的微调。在四个领域(Blocks世界、物流、迷宫和推箱子)进行的实验研究显示,与源符号规划器相比,平均计划长度减少了30%,其中80%以上的计划是最优的(当最优解已知时)。通过推理时的搜索,进一步提升了计划的质量。与我们比较的满足要求和最优符号规划器相比,该模型的延迟以亚指数方式扩展。这些结果表明,采用生成模型的自我提升为高质量计划生成提供了一种可扩展的方法。
cs.AI / 28 / 2605.03644

AdapShot: Adaptive Many-Shot In-Context Learning with Semantic-Aware KV Cache Reuse

AdapShot:具有语义感知 KV 缓存重用的自适应多示例上下文学习
Ou, Jie, Guo, Jinyu, Guo, Shiyao, Li, Yuang, Wu, Ruiqi, Wang, Zhaokun, Li, Wenyi, Tian, Wenhong
Abstract
Many-Shot In-Context Learning (ICL) has emerged as a promising paradigm, leveraging extensive examples to unlock the reasoning potential of Large Language Models (LLMs). However, existing methods typically rely on a predetermined, fixed number of shots. This static approach often fails to adapt to the varying difficulty of different queries, leading to either insufficient context or interference from noise. Furthermore, the prohibitive computational and memory costs of long contexts severely limit Many-Shot's feasibility. To address the above limitations, we propose AdapShot, which dynamically optimizes shot counts and leverages KV cache reuse for efficient inference. Specifically, we design a probe-based evaluation mechanism that utilizes output entropy to determine the optimal number of shots. To bypass the redundant prefilling computation during both the probing and inference phases, we incorporate a semantics-aware KV cache reuse strategy. Within this reuse strategy, to address positional encoding incompatibilities, we introduce a decoupling and re-encoding method that enables the flexible reordering of cached key-value pairs. Extensive experiments demonstrate that AdapShot achieves an average performance gain of around 10% and a 4.64x speedup compared to state-of-the-art DBSA.
Chinese Translation
多示例上下文学习(ICL)作为一种前景广阔的范式,利用大量示例来释放大型语言模型(LLMs)的推理潜力。然而,现有方法通常依赖于预先确定的固定示例数量。这种静态方法往往未能适应不同查询的难度变化,导致上下文不足或噪声干扰。此外,长上下文的高昂计算和内存成本严重限制了多示例学习的可行性。为了解决上述限制,我们提出了 AdapShot,它动态优化示例数量并利用 KV 缓存重用以实现高效推理。具体而言,我们设计了一种基于探测的评估机制,通过输出熵来确定最优示例数量。为绕过在探测和推理阶段的冗余填充计算,我们引入了语义感知的 KV 缓存重用策略。在该重用策略中,针对位置编码不兼容的问题,我们引入了一种解耦和重新编码的方法,能够灵活地对缓存的键值对进行重排序。大量实验表明,AdapShot 与最先进的 DBSA 相比,在性能上平均提升约 10%,推理速度提高 4.64 倍。
cs.AI / 29 / 2605.03648

Agent-Based Modeling of Low-Emission Fertilizer Adoption for Dairy Farm Decarbonisation using Empirical Farm Data

基于代理的低排放肥料在奶牛养殖场减碳中的采用建模:基于实证农场数据
Jayakumar, Surya, Sullivan, Kieran, McLaughlin, John, OMeara, Christine, Dey, Indrakshi
Abstract
To understand complex system dynamics in dairy farming, it is essential to use modeling tools that capture farm heterogeneity, social interactions, and cumulative environmental impacts. This study proposes an agent-based modeling (ABM) framework to simulate nitrogen management and the adoption of low-emission fertilizer across 295 Irish dairy farms over a 15-year period. Using empirical data, the model represents farm communication through a social network, capturing peer influence and discussion group dynamics, where adoption probabilities are driven by social contagion, farm-scale characteristics, and policy interventions such as subsidies and carbon taxes. The framework estimates sectoral greenhouse gas emissions, cumulative abatement, and private-social cost trade-offs, using Monte Carlo simulation and sensitivity analysis to quantify uncertainty. The model shows strong agreement with observed adoption trajectories ($R^2 = 0.979$, RMSE = 0.0274) and is validated against empirical data using a Kolmogorov-Smirnov test (D = 0.2407, p < 0.001), indicating its ability to reproduce structural patterns in adoption behavior. Adoption dynamics are further characterized using a logistic diffusion model consistent with Rogers' innovation diffusion theory, capturing progression from early adoption to a saturation level of approximately 91%. By framing decarbonization as a socio-technical diffusion process rather than a purely economic optimization problem, this study provides an in silico policy laboratory for evaluating the robustness and diffusion speed of climate mitigation strategies prior to implementation.
Chinese Translation
为了理解奶牛养殖中的复杂系统动态,使用能够捕捉农场异质性、社会互动和累积环境影响的建模工具至关重要。本研究提出了一种基于代理的建模(Agent-Based Modeling,ABM)框架,以模拟295个爱尔兰奶牛养殖场在15年期间的氮管理和低排放肥料的采用。该模型使用实证数据,通过社交网络代表农场之间的沟通,捕捉同伴影响和讨论小组动态,其中,采用概率由社会传播、农场规模特征以及政策干预(例如补贴和碳税)驱动。该框架估算了行业温室气体排放、累积减排和私人社会成本权衡,利用蒙特卡洛模拟和灵敏度分析来量化不确定性。模型与观察到的采用轨迹高度一致($R^2 = 0.979$, RMSE = 0.0274),并通过Kolmogorov-Smirnov检验(D = 0.2407, p < 0.001)对实证数据进行验证,表明其能够再现采用行为中的结构模式。采用动态进一步使用与Rogers的创新扩散理论一致的逻辑扩散模型进行特征描述,捕捉从早期采用到饱和水平(约91%)的进程。通过将减碳视为一种社会技术扩散过程,而非单纯的经济优化问题,本研究提供了一个体外政策实验室,以评估气候减缓策略在实施前的稳健性和扩散速度。
cs.AI / 30 / 2605.03675

MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents

MEMTIER:长期自主人工智能代理的分层内存架构与检索瓶颈分析
Sidik, Bronislav, Rokach, Lior
Abstract
Long-running autonomous AI agents suffer from a well-documented memory coherence problem: tool-execution success rates degrade 14 percentage points over 72-hour operation windows due to four compounding failure modes in existing flat-file memory systems. We present MEMTIER, a tripartite memory architecture for the OpenClaw agent runtime that introduces a structured episodic JSONL store, a five-signal weighted retrieval engine, an attention-attributed cognitive weight update loop, an asynchronous consolidation daemon promoting episodic facts to a semantic tier, and a PPO-based policy framework for adapting retrieval weights (infrastructure validated; performance gains pending camera-ready). On the full 500-question LongMemEval-S benchmark (Wu et al., 2025), MEMTIER achieves Acc=0.382, F1=0.412 with Qwen2.5-7B on a consumer 6GB GPU - a +33 percentage point improvement over the full-context baseline (0.050 -> 0.382, i.e., 5% -> 38%). With DeepSeek-V4-Flash fact pre-population, single-session recall reaches 0.686-0.714, exceeding the paper's RAG BM25 GPT-4o baseline (0.560) on those categories. Temporal reasoning rises to 0.323 and multi-session synthesis to 0.173, demonstrating that structured semantic pre-population qualitatively changes what lightweight retrieval can achieve. All phases run locally on a consumer laptop with a 6GB GPU.
Chinese Translation
长期运行的自主人工智能代理存在众所周知的内存一致性问题:在72小时的操作窗口内,工具执行成功率因现有平面文件内存系统中的四种累积故障模式而下降14个百分点。我们提出了MEMTIER,这是一种针对OpenClaw代理运行时的三方内存架构,包含一个结构化的情节JSONL存储、一个五信号加权检索引擎、一个基于注意力的认知权重更新循环、一个促进将情节事实提升至语义层的异步整合守护进程,以及一个基于PPO的策略框架用于调整检索权重(基础设施已验证;性能提升待最终定稿)。在完整的500问题LongMemEval-S基准测试(Wu 等,2025)中,MEMTIER在消费级6GB GPU上的Qwen2.5-7B上实现了Acc=0.382,F1=0.412,比完整上下文基线(0.050 -> 0.382,即5% -> 38%)提高了33个百分点。通过DeepSeek-V4-Flash事实预填充,单会话回忆达到0.686-0.714,在那些类别上超过了论文中的RAG BM25 GPT-4o基线(0.560)。时间推理提升至0.323,多会话综合提升至0.173,表明结构化语义预填充在定性上改变了轻量级检索能够实现的效果。所有阶段都能在配备6GB GPU的消费级笔记本电脑上本地运行。
cs.AI / 31 / 2605.03762

OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking

OracleProto:用于通过知识截止和时间掩蔽进行 LLM 原生预测基准测试的可重复框架
Ma, Yiding, Ruan, Chengyun, Huang, Kaibo, Yang, Zhongliang, Zhou, Linna
Abstract
Large language models are moving from static text generators toward real-world decision-support systems, where forecasting is a composite capability that links information gathering, evidence integration, situational judgment, and action-oriented decision making. This capability is in broad demand across finance, policy, industry, and scientific research, yet its evaluation remains difficult: live benchmarks evaluate forecasts before answers exist, making them the cleanest way to measure forecasting ability, but they expire once events resolve; retrospective benchmarks are reproducible, but they cannot reliably distinguish genuine forecasting from facts a model may have already learned during pretraining. Prompting models to "pretend not to know" cannot replace a genuine knowledge boundary. We propose OracleProto, a reproducible framework for evaluating LLM native forecasting capability. OracleProto reconstructs resolved events into time-bounded forecasting samples by combining model-cutoff-aligned sample admission, tool-level temporal masking, content-level leakage detection, discrete answer normalization, and hierarchical scoring. Instantiated on a FutureX-Past-derived dataset with six contemporary LLMs, OracleProto distinguishes forecasting quality, sampling stability, and cost efficiency under controlled information boundaries, while reducing residual leakage to the $1\%$ level, an order of magnitude below tool-only temporal filtering. OracleProto turns LLM forecasting from one-off evaluation into an auditable, reusable, and trainable dataset-level capability, providing a unified interface for fair cross-model comparison and a controlled signal source for downstream SFT and RL. Code and data are available at https://github.com/MaYiding/OracleProto and https://huggingface.co/datasets/MaYiding/OracleProto.
Chinese Translation
大语言模型正从静态文本生成器向真实世界的决策支持系统转变,其中预测是一项复合能力,连接了信息收集、证据整合、情境判断和以行动为导向的决策制定。这种能力在金融、政策、工业和科学研究等多个领域有广泛的需求,但其评估仍然具有挑战性:实时基准在答案尚不存在之前评估预测,使其成为衡量预测能力的最清晰方式,但一旦事件解决就失效;回顾性基准是可重复的,但它们无法可靠地区分真正的预测与模型在预训练期间可能已经学习的事实。促使模型“假装不知道”无法替代真正的知识边界。我们提出了 OracleProto,这是一个用于评估 LLM 原生预测能力的可重复框架。OracleProto 通过结合模型截止对齐的样本接纳、工具级时间掩蔽、内容级泄漏检测、离散答案归一化和分层评分,将已解决事件重构为时间限制的预测样本。在一个基于 FutureX-Past 的数据集上实例化了六个当代 LLM,OracleProto 在受控的信息边界下区分预测质量、采样稳定性和成本效率,同时将残余泄漏降低到 $1\%$ 的水平,低于仅使用工具的时间过滤下的一个数量级。OracleProto 将 LLM 预测从一次性评估转变为可审计、可重用和可训练的数据集级能力,提供统一的接口以便进行公平的跨模型比较和受控的信号源用于下游 SFT 和 RL。代码和数据可在 https://github.com/MaYiding/OracleProto 和 https://huggingface.co/datasets/MaYiding/OracleProto 获取。
cs.AI / 32 / 2605.03782

What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

你所想即你所见:通过视觉-语言好奇心驱动VLM代理的探索
Li, Haoxi, Hou, Qinglin, Ma, Jianfei, Lai, Jinxiang, Han, Tao, Bai, Sikai, Guo, Jingcai, Zhang, Jie, Guo, Song
Abstract
To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is insufficient for sparse-reward tasks, as it lacks the epistemic drive to actively uncover the ``known unknown'' required for robust generalization. We ask: Can VLM agents actively find signals that challenge and refine their internal world model through curiosity-driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an intrinsic curiosity signal within reinforcement learning, steering the agent to actively explore areas where its internal model is uncertain. Extensive experiments across a series of agentic tasks show the effectiveness of GLANCE, and demonstrate that aligning ``what the agent thinks'' with ``what the agent sees'' is key to solving complex or sparse agentic tasks.
Chinese Translation
为了在部分可观察的视觉环境中导航,最近的VLM代理越来越多地将世界建模能力内化到其策略中,通过显式的链式推理(CoT)使它们能够在行动之前进行未来的心理模拟。然而,单靠对已访问状态的被动推理不足以应对稀疏奖励任务,因为它缺乏主动揭示“已知未知”以实现稳健泛化的认知驱动。我们提出了一个问题:VLM代理能否通过好奇心驱动的探索主动寻找挑战并完善其内部世界模型的信号?在本工作中,我们提出了GLANCE,一个统一框架,通过将代理的语言世界模型与不断演变的目标网络的稳定视觉表征相结合,弥合推理与探索之间的差距。至关重要的是,GLANCE利用语言预测与视觉现实之间的差异作为强化学习中的内在好奇信号,引导代理主动探索其内部模型不确定的区域。针对一系列代理任务的大量实验表明GLANCE的有效性,并证明“代理所想”的与“代理所见”的一致性是解决复杂或稀疏代理任务的关键。
cs.AI / 33 / 2605.03788

Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones

表述任务,执行群体:增强代理在无人机网络中的大语言模型推理
Iannoli, Andrea, Gigli, Lorenzo, Sciullo, Luca, Trotta, Angelo, Di Felice, Marco
Abstract
Large Language Models (LLMs) are increasingly explored as high-level reasoning engines for cyber-physical systems, yet their application to real-time UAV swarm management remains challenging due to heterogeneous interfaces, limited grounding, and the need for long-running closed-loop execution. This paper presents a mission-agnostic, agent-enhanced LLM framework for UAV swarm control, where users express mission objectives in natural language and the system autonomously executes them through grounded, real-time interactions. The proposed architecture combines an LLM-based Agent Core with a Model Context Protocol (MCP) gateway and a Web-of-Drones abstraction based on W3C Web of Things (WoT) standards. By exposing drones, sensors, and services as standardized WoT Things, the framework enables structured tool-based interaction, continuous state observation, and safe actuation without relying on code generation. We evaluate the framework using ArduPilot-based simulation across four swarm missions and six state-of-the-art LLMs. Results show that, despite strong reasoning abilities, current general-purpose LLMs still struggle to achieve reliable execution - even for simple swarm tasks - when operating without explicit grounding and execution support. Task-specific planning tools and runtime guardrails substantially improve robustness, while token consumption alone is not indicative of execution quality or reliability.
Chinese Translation
大型语言模型(LLMs)越来越多地被探索为网络物理系统的高级推理引擎,但由于异构接口、有限的基础设施支持以及需要长时间闭环执行,其在实时无人机群管理中的应用仍然面临挑战。本文提出了一种任务无关的增强代理大语言模型框架,用于无人机群控制,用户可以用自然语言表达任务目标,系统则通过基于实时交互的有根据执行自动实现这些目标。所提架构将基于大语言模型的代理核心与模型上下文协议(Model Context Protocol, MCP)网关及基于W3C万物互联(Web of Things, WoT)标准的无人机网络抽象结合在一起。通过将无人机、传感器及服务暴露为标准化的WoT事物,该框架能够实现结构化的工具交互、持续的状态观察以及在不依赖代码生成的情况下安全执行。我们使用基于ArduPilot的模拟评估了该框架,涉及四个群体任务和六种最先进的大语言模型。结果表明,尽管当前通用大语言模型具有较强的推理能力,但在没有明确的基础支持和执行支持的情况下,仍然难以实现可靠的执行,即使是简单的群体任务。特定任务的规划工具和运行时保护措施显著提高了鲁棒性,而仅依赖于令牌消耗并不能反映执行质量或可靠性。
cs.AI / 34 / 2605.03804

ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting

ScrapMem:一种基于生物启发的框架,通过光遗忘实现设备端个性化代理记忆
Chang, Jiale, Ren, Yuxiang
Abstract
Long-term personalized memory for LLM agents is challenging on resource-limited edge devices due to high storage costs and multimodal complexity. To address this, we propose ScrapMem, a framework that integrates multimodal data into "Scrapbook Page." ScrapMem introduces Optical Forgetting, an optical compression mechanism that progressively reduces the resolution of older memories, lowering storage cost while suppressing low-value details. To maintain semantic consistency, we construct an Episodic Memory Graph (EM-Graph) that organizes key events into a causal-temporal structure. Extensive experiments on the multimodal ATM-Bench showcase that ScrapMem provides three main benefits: (1) strong performance, achieving a new state-of-the-art with a 51.0% Joint@10 score; (2) high storage efficiency, reducing memory usage by up to 93% via optical forgetting; and (3) improved recall, increasing Recall@10 to 70.3% through structured aggregation. ScrapMem offers an effective and storage-efficient solution for on-device long-term memory in multimodal LLM agents.
Chinese Translation
由于高存储成本和多模态复杂性,边缘设备上的长期个性化记忆对于大规模语言模型(LLM)代理来说是一项挑战。为了解决这一问题,我们提出了ScrapMem,一个将多模态数据整合到“剪贴页(Scrapbook Page)”中的框架。ScrapMem引入了光遗忘(Optical Forgetting),一种光学压缩机制,逐步降低较旧记忆的分辨率,从而降低存储成本并抑制低价值细节。为了保持语义一致性,我们构建了一个情节记忆图(Episodic Memory Graph, EM-Graph),将关键事件组织成因果时间结构。在多模态ATM-Bench上的广泛实验表明,ScrapMem提供了三个主要优势:(1)强大的性能,达成51.0%的Joint@10得分,创下新的最先进水平;(2)高存储效率,通过光遗忘将记忆使用减少至93%;以及(3)改进的召回率,通过结构化聚合将Recall@10提升至70.3%。ScrapMem为多模态LLM代理的设备端长期记忆提供了一个有效且节省存储的解决方案。
cs.AI / 35 / 2605.03808

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

主体性模型:通过自我研究演化主体性可解释性工具
Singh, Chandan, Tan, Yan Shuo, Xu, Weijia, Gero, Zelalem, Yang, Weiwei, Galley, Michel, Gao, Jianfeng
Abstract
Agentic data science (ADS) systems are rapidly improving their capability to autonomously analyze, fit, and interpret data, potentially moving towards a future where agents conduct the vast majority of data-science work. However, current ADS systems use statistical tools designed to be interpretable by humans, rather than interpretable by agents. To address this, we introduce Agentic-imodels, an agentic autoresearch loop that evolves data-science tools designed to be interpretable by agents. Specifically, it develops a library of scikit-learn-compatible regressors for tabular data that are optimized for both predictive performance and a novel LLM-based interpretability metric. The metric measures a suite of LLM-graded tests that probe whether a fitted model's string representation is "simulatable" by an LLM, i.e. whether the LLM can answer questions about the model's behavior by reading its string output alone. We find that the evolved models jointly improve predictive performance and agent-facing interpretability, generalizing to new datasets and new interpretability tests. Furthermore, these evolved models improve downstream end-to-end ADS, increasing performance for Copilot CLI, Claude Code, and Codex on the BLADE benchmark by up to 73%
Chinese Translation
主体性数据科学(ADS)系统正在迅速提升其自主分析、拟合和解释数据的能力,有可能向未来发展,即代理人完成绝大多数的数据科学工作。然而,当前的ADS系统使用的是旨在被人类解释的统计工具,而非代理人可解释的工具。为了解决这一问题,我们引入了主体性模型(Agentic-imodels),这是一个主体性自我研究循环,旨在演化出可供代理人解释的数据科学工具。具体而言,它开发了一个与scikit-learn兼容的回归器库,针对表格数据优化了预测性能和一种新的基于大型语言模型(LLM)的可解释性指标。该指标测量了一系列经过LLM评估的测试,以探测拟合模型的字符串表示是否能够被LLM“模拟”,即LLM是否能够仅通过阅读其字符串输出来回答关于模型行为的问题。我们发现,演化后的模型共同提升了预测性能和针对代理的可解释性,能够推广到新的数据集和新的可解释性测试。此外,这些演化模型提升了下游的端到端ADS,在BLADE基准测试中将Copilot CLI、Claude Code 和 Codex 的性能提高了多达73%。
cs.AI / 36 / 2605.03842

SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems

SOAR:机器人移动履行系统中订单分配与机器人调度的实时联合优化
Tang, Yibang, Yang, Yifan, Wang, Jingyuan, Chen, Junhua, Zhao, Zhen
Abstract
Robotic Mobile Fulfillment Systems (RMFS) rely on mobile robots for automated inventory transportation, coordinating order allocation and robot scheduling to enhance warehousing efficiency. However, optimizing RMFS is challenging due to strict real-time constraints and the strong coupling of multi-phase decisions. Existing methods either decompose the problem into isolated sub-tasks to guarantee responsiveness at the cost of global optimality, or rely on computationally expensive global optimization models that are unsuitable for dynamic industrial environments. To bridge this gap, we propose SOAR, a unified Deep Reinforcement Learning framework for real-time joint optimization. SOAR transforms order allocation and robot scheduling into a unified process by utilizing soft order allocations as observations. We formulate this as an Event-Driven Markov Decision Process, enabling the agent to perform simultaneous scheduling in response to asynchronous system events. Technically, we employ a Heterogeneous Graph Transformer to encode the warehouse state and integrate phased domain knowledge. Additionally, we incorporate a reward shaping strategy to address sparse feedback in long-horizon tasks. Extensive experiments on synthetic and real-world industrial datasets, in collaboration with Geekplus, demonstrate that SOAR reduces global makespan by 7.5\% and average order completion time by 15.4\% with sub-100ms latency. Furthermore, sim-to-real deployment confirms its practical viability and significant performance gains in production environments. The code is available at https://github.com/200815147/SOAR.
Chinese Translation
机器人移动履行系统(RMFS)依赖移动机器人进行自动化库存运输,协调订单分配与机器人调度以提高仓储效率。然而,由于严格的实时约束和多阶段决策的强耦合性,优化 RMFS 面临挑战。现有方法要么将问题分解为孤立的子任务以保证响应性,但代价是全球最优性;要么依赖计算代价高昂的全局优化模型,这不适合动态工业环境。为弥补这一缺口,我们提出了 SOAR,这是一个统一的深度强化学习框架,用于实时联合优化。SOAR 通过将软订单分配作为观察,将订单分配和机器人调度转化为一个统一的过程。我们将其形式化为事件驱动的马尔可夫决策过程,使代理能够对异步系统事件进行同时调度。从技术上讲,我们采用异构图变换器来编码仓库状态并整合分阶段领域知识。此外,我们还结合了奖励塑形策略,以应对长时间任务中的稀疏反馈。在与 Geekplus 合作进行的合成与真实工业数据集上的大量实验证明,SOAR 将全球完工时间减少了 7.5\%,平均订单完成时间减少了 15.4\%,并且延迟低于 100ms。此外,仿真到现实的部署证实了其在生产环境中的实际可行性和显著的性能提升。代码可在 https://github.com/200815147/SOAR 获得。
cs.AI / 37 / 2605.03847

Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

机械良知:机器智能可靠性的数学框架
Batzorig, Munkhdegerekh, Ganbold, Purevbaatar, Park, Kyungbin, Jeong, Pilkong, Kangbin
Abstract
Distributed collaborative intelligence (DCI), encompassing edge-to-edge architectures, federated learning, transfer learning, and swarm systems, creates environments in which emergent risk is structurally unavoidable: locally correct decisions by individual agents compose into globally unacceptable behavioral trajectories under uncertainty. Existing approaches such as constrained optimization, safe reinforcement learning, and runtime assurance evaluate acceptability at the level of individual actions rather than across behavioral trajectories, and none addresses the multi-participant, uncertainty-laden nature of DCI deployments. This paper introduces mechanical conscience (MC), a novel concept and simplified mathematical framework that operationalizes trajectory-level normative regulation for both single-agent and distributed intelligent systems. Mechanical conscience is defined as a supervisory filter that minimally corrects a baseline policy's actions to reduce cumulative deviation from a normatively admissible region, while accounting for epistemic uncertainty. We introduce associated constructs, conscience score, mechanical guilt, and resonant dependability, that provide an interpretable vocabulary and computable governance signals for this emerging field. Core theoretical properties are established: admissibility equivalence, existence of optimal regulation, and monotonic deviation reduction. Illustrative results demonstrate that MC-regulated agents maintain trajectory-level normative acceptability where conventional controllers drift outside admissible bounds, and that the framework naturally extends to suppress interaction-induced emergent risk in multi-agent DCI settings.
Chinese Translation
分布式协作智能(DCI)涵盖边缘到边缘的架构、联邦学习、迁移学习和群体系统,创造了在这些环境中新兴风险结构性不可避免的情况:个体代理局部正确的决策在不确定性下组合成全球不可接受的行为轨迹。现有方法如受限优化、安全强化学习和运行时保障评估的是个体行动的可接受性,而非行为轨迹的整体表现,并且没有任何方法解决 DCI 部署中的多参与者、不确定性的特性。本文介绍了一种名为机械良知(mechanical conscience,MC)的新概念和简化数学框架,将轨迹级别的规范性调节应用于单一智能体和分布式智能系统。机械良知被定义为一种监督过滤器,最小化修正基线政策的行动,以减少与规范性可允许区域的累积偏差,同时考虑到认知不确定性。我们引入了相关构造,包括良知得分、机械罪疚和共振可靠性,提供一个可解释的词汇和可计算的治理信号,服务于这一新兴领域。建立了核心理论属性:可接受性等价性、最优调节的存在以及单调偏差减少。说明性结果表明,在 MC 调节的情况下,智能体能够保持轨迹级的规范性可接受性,而传统控制器则可能偏离可接受范围,并且该框架能够自然扩展到抑制多智能体 DCI 环境中因互动引发的新兴风险。
cs.AI / 38 / 2605.03862

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

仅仅正确是不够的:使用执行者引导奖励训练推理规划器
Han, Tianyang, Shi, Hengyu, Hu, Junjie, Yang, Xu, Wang, Zhiling, Su, Junhao
Abstract
Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it.
Chinese Translation
具有可验证奖励的强化学习已经成为提高大型语言模型显性推理的常用方式,但最终答案的正确性并不能透露推理过程是否忠实、可靠或对消耗它的模型有用。这种仅基于结果的信号可能会强化基于错误理由的正确过程,通过奖励捷径来夸大推理收益,并在多步骤系统中传播有缺陷的中间状态。为此,我们提出了TraceLift,一个将推理视为可消耗中间产物的规划者-执行者训练框架。在规划者训练期间,规划者生成标记的推理。一个被冻结的执行者将这一推理转化为最终产物,以便进行验证反馈,而基于执行者的奖励则塑造了中间过程。该奖励通过在同一冻结执行者上测量的提升乘以基于评分标准的推理奖励模型(RM)分数,为高质量且有用的推理过程提供认可。为了使推理质量可以直接学习,我们引入了TRACELIFT-GROUPS,这是一个基于评分标准注释的仅推理数据集,构建于数学和代码种子问题之上。每个例子都是同一问题组,包含一个高质量的参考推理过程和多个具有局部扰动的可行缺陷推理过程,这些扰动降低推理质量或解决方案支持,同时保持任务相关性。对代码和数学基准的广泛实验表明,这种基于执行者的推理奖励在执行唯训练的两阶段规划者-执行者系统中得到了改善,表明推理监督不仅应评估推理过程的外观质量,还应评估其对消耗该推理的模型的帮助。
cs.AI / 39 / 2605.03863

Quantifying the human visual exposome with vision language models

通过视觉语言模型量化人类视觉外显组
Rominger, Christian, Schwerdtfeger, Andreas R., Singh, Malay Gaherwar, Khudyakow, Dimitri, Michels, Elizabeth A. M., Wolf, Fabian, Kather, Jakob Nikolas, Wekenborg, Magdalena Katharina
Abstract
The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, failing to capture the first person visual context of daily life. We addressed this gap by coupling ecological momentary assessment with vision language models (VLMs) to quantify the semantic richness of human visual experience. Across 2674 participant generated photographs, VLM derived estimates of greenness robustly predicted momentary affect and chronic stress, consistent with established benchmarks. We then developed a semi autonomous large language model (LLM) based pipeline that mined over seven million scientific publications to extract nearly 1000 environmental features empirically linked to mental health. When applied to real world imagery, up to 33 percent of VLM extracted context ratings significantly correlated with affect and stress. These findings establish a scalable objective paradigm for visual exposomics, enabling high throughput decoding of how the visible world is associated with mental health.
Chinese Translation
视觉环境是影响心理健康的一个基本但尚未量化的因素。尽管环境外显组的概念已相当成熟,现有方法仍依赖粗略的地理空间代理或有偏的自我报告,未能捕捉日常生活中的第一人称视觉情境。我们通过将生态瞬时评估与视觉语言模型(VLMs)相结合,填补了这一空白,以量化人类视觉体验的语义丰富性。在2674幅参与者生成的照片中,VLM推导的绿度估计稳健地预测了瞬时情感和慢性压力,这与既定基准相一致。随后,我们开发了一个半自主的大型语言模型(LLM)驱动的流程,该流程挖掘了超过七百万篇科学出版物,以提取近千个与心理健康实证相关的环境特征。当应用于真实世界影像时,最多有33%的VLM提取的语境评分与情感和压力显著相关。这些发现确立了一种可扩展的客观视觉外显组学范式,使得高通量解码可见世界与心理健康之间的关联成为可能。
cs.AI / 40 / 2605.03871

EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

EvoLM:通过共同进化的判别标准实现自我进化语言模型
Li, Shuyue Stella, Xin, Rui, Xiao, Teng, Wang, Yike, Shao, Rulin, Hao, Zoey, Sclar, Melanie, Oh, Sewoong, Brahman, Faeze, Koh, Pang Wei, Tsvetkov, Yulia
Abstract
Language models encode substantial evaluative knowledge from pretraining, yet current post-training methods rely on external supervision (human annotations, proprietary models, or scalar reward models) to produce reward signals. Each imposes a ceiling. Human judgment cannot supervise capabilities beyond its own, proprietary APIs create dependencies, and verifiable rewards cover only domains with ground-truth answers. Self-improvement from a model's own evaluative capacity is a reward source that scales with the model itself, yet remains largely untapped by current methods. We introduce EVOLM, a post-training method that structures this capacity into explicit discriminative rubrics and uses them as training signal. EVOLM trains two capabilities within a single language model in alternation: (1) a rubric generator producing instance-specific evaluation criteria optimized for discriminative utility, which maximizes a small frozen judge's ability to distinguish preferred from dispreferred responses; and (2) a policy trained using those rubric-conditioned scores as reward. All preference signals are constructed from the policy's own outputs via temporal contrast with earlier checkpoints, requiring no human annotation or external supervision. EVOLM trains a Qwen3-8B model to generate rubrics that outperform GPT-4.1 on RewardBench-2 by 25.7%. The co-trained policy achieves 69.3% average on the OLMo3-Adapt suite, outperforming policies trained with GPT-4.1 prompted rubrics by 3.9% and with the state-of-the-art 8B reward model SkyWork-RM by 16%. Overall, EVOLM demonstrates that structuring a model's evaluative capacity into co-evolving discriminative rubrics enables self-improvement without external supervision.
Chinese Translation
语言模型从预训练中编码了大量的评估知识,但当前的后训练方法依赖外部监督(人类注释、专有模型或标量奖励模型)来产生奖励信号。这些方法都设定了上限。人类判断无法监督其自身以外的能力,专有API创造了依赖关系,而可验证的奖励仅覆盖真实答案的领域。从模型自身的评估能力中获得自我改进是一种可与模型自身扩展的奖励来源,但目前的方法在这一方面仍然未得到充分利用。我们提出了EVOLM,一种后训练方法,将这种能力结构化为明确的判别标准,并将其用作训练信号。EVOLM在单一语言模型中交替训练两种能力:(1)生成器,生产针对具体实例的优化评估标准,以提高判别效用,并最大化一个小型固定评判者区分偏好与非偏好响应的能力;(2)利用这些基于标准条件的分数作为奖励进行训练的策略。所有偏好信号均通过与早期检查点的时间对比,从策略自身的输出中构建,无需人类注释或外部监督。EVOLM训练的Qwen3-8B模型生成的标准在RewardBench-2上比GPT-4.1提升了25.7%。共训练的策略在OLMo3-Adapt套件上的平均得分为69.3%,比使用GPT-4.1提示标准训练的策略高出3.9%,比使用最先进的8B奖励模型SkyWork-RM高出16%。总体而言,EVOLM证明了将模型的评估能力结构化为共同进化的判别标准能够在没有外部监督的情况下实现自我改进。
cs.AI / 41 / 2605.03884

QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs

QKVShare:用于多智能体设备端大型语言模型的量化KV缓存切换
Honavar, Pratik, GVSL, Tejpratap
Abstract
Multi-agent LLM systems on edge devices need to hand off latent context efficiently, but the practical choices today are expensive re-prefill or full-precision KV transfer. We study QKVShare, a framework for quantized KV-cache handoff between agents that combines token-level mixed-precision allocation, a self-contained CacheCard representation, and a HuggingFace-compatible cache injection path. Our current results support a narrower but clearer story than the original draft: on 150 GSM8K problems with Llama-3.1-8B-Instruct, adaptive quantization remains competitive under repeated handoff and shows its clearest gains against uniform quantization in deeper-hop, higher budget settings; for handoff latency, the QKVShare path reduces TTFT relative to full re prefill at every tested context, from 130.7 ms vs. 150.2 ms at nominal 1K context to 397.1 ms vs. 1029.7 ms at nominal 8K context;. Stage timing shows that post-injection generation, not card creation, dominates the current QKVShare latency path. These results position quantized KV handoff as a promising on-device systems direction while also highlighting the need for stronger controller ablations and apples-to-apples runtime comparisons.
Chinese Translation
边缘设备上的多智能体大型语言模型(LLM)系统需要高效地切换潜在上下文,但目前的实际选择是昂贵的重新填充或全精度的KV转移。我们研究了QKVShare,一个用于智能体之间量化KV缓存切换的框架,结合了基于令牌的混合精度分配、自包含的CacheCard表示和与HuggingFace兼容的缓存注入路径。我们目前的结果支持一个更加简洁明了的故事,与原始草稿相比具有更窄但更清晰的论点:在150个GSM8K问题中使用Llama-3.1-8B-Instruct,适应性量化在重复切换下仍具有竞争力,并在更深的跳数和更高的预算设置中展现其对比均匀量化的明显优势;在切换延迟方面,QKVShare路径在每个测试上下文中相对于全重新填充减少了TTFT,从名义1K上下文的130.7毫秒对150.2毫秒,到名义8K上下文的397.1毫秒对1029.7毫秒。阶段时序显示,生成后的注入而非卡片创建主导了当前的QKVShare延迟路径。这些结果将量化KV切换定位为一个有前景的设备端系统方向,同时强调了更强控制器消融实验和同类运行时比较的必要性。
cs.AI / 42 / 2605.03900

Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems

情境下的多目标优化:重新思考前沿人工智能系统中的目标
Zhou, Jie, Chen, Qin, He, Liang
Abstract
Frontier AI systems perform best in settings with clear, stable, and verifiable objectives, such as code generation, mathematical reasoning, games, and unit-test-driven tasks. They remain less reliable in open-ended settings, including scientific assistance, long-horizon agents, high-stakes advice, personalization, and tool use, where the relevant objective is ambiguous, context-dependent, delayed, or only partially observable. We argue that many such failures are not merely failures of scale or capability, but failures of objective selection: the system optimizes a locally visible signal while missing which objectives should govern the interaction. We formulate this problem as \emph{contextual multi-objective optimization}. In this setting, systems must consider multiple, context-dependent objectives, such as helpfulness, truthfulness, safety, privacy, calibration, non-manipulation, user preference, reversibility, and stakeholder impact, while determining which objectives are active, which are soft preferences, and which must function as hard or quasi-hard constraints. These examples are not intended as an exhaustive taxonomy: different domains and deployment settings may activate different objective dimensions and different conflict-resolution procedures. Our framework models AI behavior as a context-dependent choice rule over candidate actions, objective estimates, active constraints, stakeholders, uncertainty, and conflict-resolution procedures. We outline an implementation pathway based on decomposed objective representations, context-to-objective routing, hierarchical constraints, deliberative policy reasoning, controlled personalization, tool-use control, diagnostic evaluation, auditing, and post-deployment revision.
Chinese Translation
前沿人工智能系统在具有明确、稳定和可验证目标的环境中表现最佳,例如代码生成、数学推理、游戏和单元测试驱动任务。而在开放式环境中,包括科学辅助、长期代理、高风险建议、个性化和工具使用等,相关目标往往模糊、依赖于上下文、延迟或仅部分可观察,因此这些系统的可靠性较低。我们认为,许多此类失败不仅仅是规模或能力的问题,更是目标选择的失败:系统优化了局部可见信号,却未能确定哪些目标应主导交互。我们将这个问题表述为情境多目标优化(contextual multi-objective optimization)。在这种情况下,系统必须考虑多个依赖于上下文的目标,如有用性、真实性、安全性、隐私、校准、非操控性、用户偏好、可逆性和利益相关者影响,同时确定哪些目标是活跃的,哪些是软偏好,以及哪些必须作为硬约束或准硬约束执行。这些例子并不是详尽的分类法:不同的领域和部署环境可能激活不同的目标维度和冲突解决程序。我们的框架将人工智能的行为建模为对候选动作、目标估计、活跃约束、利益相关者、不确定性和冲突解决程序的情境依赖选择规则。我们概述了一条实施路径,基于目标表征的分解、上下文到目标的路由、层级约束、深思熟虑的政策推理、受控个性化、工具使用控制、诊断评估、审计和后部署修订。
cs.AI / 43 / 2605.03986

From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

从意图到执行:通过代理推荐构建代理工作流程
Athrey, Kishan, Pishehvar, Ramin, Riordan, Brian, Viswanathan, Mahesh
Abstract
Multi-Agent Systems (MAS) built using AI agents fulfill a variety of user intents that may be used to design and build a family of related applications. However, the creation of such MAS currently involves manual composition of the plan, manual selection of appropriate agents, and manual creation of execution graphs. This paper introduces a framework for the automated creation of multi-agent systems which replaces multiple manual steps with an automated framework. The proposed framework consists of software modules and a workflow to orchestrate the requisite task- specific application. The modules include: an LLM-derived planner, a set of tasks described in natural language, a dynamic call graph, an orchestrator for map agents to tasks, and an agent recommender that finds the most suitable agent(s) from local and global agent registries. The agent recommender uses a two-stage information retrieval (IR) system comprising a fast retriever and an LLM-based re-ranker. We implemented a series of experiments exploring the choice of embedders, re- rankers, agent description enrichment, and supervising critique agent. We benchmarked this system end-to-end, evaluating the combination of planning, agent selection, and task completion, with our proposed approach. Our experimental results show that our approach outperforms the state-of-the- art in terms of the recall rate and is more robust and scalable compared to previous approaches. The critique agent holistically reevaluates both agent and tool recommendations against the overall plan. We show that the inclusion of the critique agent further enhances the recall score, proving that the comprehensive review and revision of task-based agent selection is an essential step in building end-to-end multi-agent systems.
Chinese Translation
使用 AI 代理构建的多代理系统(MAS)满足用户的各种意图,这些意图可以用于设计和构建一系列相关应用。然而,目前创建此类 MAS 涉及手动构建计划、手动选择适当的代理以及手动创建执行图。本论文提出了一种多代理系统的自动化创建框架,它用自动化框架替代了多个手动步骤。所提出的框架由软件模块和一个工作流程组成,用于协调所需的任务特定应用。模块包括:一个基于 LLM 的规划器、一组以自然语言描述的任务、一个动态调用图、一个负责将代理映射到任务的协调器,以及一个从本地和全球代理注册表中寻找最合适代理的代理推荐系统。代理推荐系统采用了一个包括快速检索器和基于 LLM 的重新排序器的两阶段信息检索(IR)系统。我们实施了一系列实验,探索嵌入器、重新排序器、代理描述增强和监督批判代理的选择。我们对该系统进行了端到端基准测试,评估了规划、代理选择和任务完成的组合效果,并与我们提出的方法进行了比较。实验结果表明,我们的方法在召回率方面优于现有最先进技术,并且相较于之前的方法更具鲁棒性和可扩展性。批判代理从整体上重新评估代理和工具推荐与整体计划的一致性。我们展示了批判代理的纳入进一步提升了召回分数,证明了针对任务基础代理选择的全面审查和修订是在构建端到端多代理系统中一个至关重要的步骤。
cs.AI / 44 / 2605.03989

An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

面向智能体的可插拔经验-RAG技能用于经验驱动的检索策略协调
Zhang, Dutao, Liao, Tian
Abstract
Retrieval-augmented generation systems often assume that one fixed retrieval pipeline is sufficient across heterogeneous tasks, yet factoid question answering, multi-hop reasoning, and scientific verification exhibit different retrieval preferences. We present Experience-RAG Skill, an agent-oriented pluggable retrieval orchestration layer positioned between the agent and the retriever pool. The proposed skill analyzes the current scene, consults an experience memory, selects an appropriate retrieval strategy, and returns structured evidence to the agent. Under a fixed candidate pool, Experience-RAG Skill achieves an overall nDCG@10 of 0.8924 on BeIR/nq, BeIR/hotpotqa, and BeIR/scifact, outperforming fixed single-retriever baselines and remaining competitive with Adaptive-RAG-style routing. The results suggest that retrieval strategy selection can be productively encapsulated as a reusable agent skill rather than being hard-coded in the upper workflow.
Chinese Translation
检索增强生成系统通常假设在异构任务中一个固定的检索管道已足够,但事实问答、多跳推理和科学验证表现出不同的检索偏好。我们提出了经验-RAG技能(Experience-RAG Skill),这是一个面向智能体的可插拔检索协调层,位于智能体与检索器池之间。该技能分析当前场景,咨询经验记忆,选择合适的检索策略,并向智能体返回结构化证据。在固定候选池下,经验-RAG技能在 BeIR/nq、BeIR/hotpotqa 和 BeIR/scifact 数据集上实现了 0.8924 的整体 nDCG@10,超越了固定单一检索器基线,并在与 Adaptive-RAG 样式路由的竞争中保持了优势。结果表明,检索策略选择可以有效地封装为可重用的智能体技能,而不是硬编码在上层工作流程中。
cs.AI / 45 / 2605.04012

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

SymptomAI:面向日常症状评估的对话式人工智能代理
Breda, Joseph, Yousif, Fadi, Hawkins, Beszel, Cotoi, Marinela, Liu, Miao, Luo, Ray, Chen, Po-Hsuan Cameron, Schaekermann, Mike, Schmidgall, Samuel, Liu, Xin, Narayanswamy, Girish, Solomon, Samuel, Xu, Maxwell A., Fan, Xiaoran, Shangguan, Longfei, Wang, Anran, Daryani, Bhavna, Herkenham, Buddy, Tan, Cara, Malhotra, Mark, Patel, Shwetak, Hernandez, John B., Duong, Quang, Liu, Yun, Wasson, Zach, Antos, Dimitrios, Lou, Bob, Thompson, Matthew, Richina, Jonathan, Pathak, Anupam, Young-Lin, Nichole, Sunshine, Jake, McDuff, Daniel
Abstract
Language models excel at diagnostic assessments on currated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.47, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.
Chinese Translation
语言模型在经过整理的医学案例研究和情景分析中的诊断评估表现优异,能够与临床专业人员相媲美甚至更好。然而,现有研究主要集中在复杂情境和丰富背景下,这使得我们难以得出关于这些系统在患者日常生活中报告症状时的表现的结论。我们部署了SymptomAI,一套用于端到端患者访谈和鉴别诊断(DDx)的对话式人工智能代理,通过Fitbit应用在一项随机分配参与者(N=13,917)与五个AI代理互动的研究中进行。这一语料库捕捉了来自真实世界人群的多样化交流及疾病的现实分布。在这当中,一部分1,228名参与者报告了临床医生提供的诊断,并且其中的517名参与者经过超过250小时的注释由临床医生小组进一步评估。在盲随机比较中,SymptomAI的鉴别诊断准确性显著高于独立临床医生(OR = 2.47, p < 0.001),在相同对话情境下。此外,采用专门症状访谈的代理策略在提供诊断前,引导额外症状信息的收集,其表现显著优于基线的用户引导对话(p < 0.001)。对来自美国一般人群面板的1,509次对话的辅助分析验证了这些结果超出了可穿戴设备用户的范畴。我们使用SymptomAI的诊断结果作为所有13,917名参与者的标签,分析了近400种独特病症下超过500,000天的可穿戴指标。我们确认急性感染与生理变化之间的强相关性(例如,流感的OR > 7)。尽管受限于自我报告的真实情况,这些结果展示了专门且完整的症状访谈相较于用户引导的症状讨论的优势,而后者是大多数消费级大型语言模型的默认设置。
cs.AI / 46 / 2605.04019

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

在主体时代重新定义人工智能红队:从数周到数小时
Dheekonda, Raja Sekhar Rao, Pearce, Will, Landers, Nick
Abstract
AI systems are entering critical domains like healthcare, finance, and defense, yet remain vulnerable to adversarial attacks. While AI red teaming is a primary defense, current approaches force operators into manual, library-specific workflows. Operators spend weeks hand-crafting workflows - assembling attacks, transforms, and scorers. When results fall short, workflows must be rebuilt. As a result, operators spend more time constructing workflows than probing targets for security and safety vulnerabilities. We introduce an AI red teaming agent built on the open-source Dreadnode SDK. The agent creates workflows grounded in 45+ adversarial attacks, 450+ transforms, and 130+ scorers. Operators can probe multi-agent systems, multilingual, and multimodal targets, focusing on what to probe rather than how to implement it. We make three contributions: 1. Agentic interface. Operators describe goals in natural language via the Dreadnode TUI (Terminal User Interface). The agent handles attack selection, transform composition, execution, and reporting, letting operators focus on red teaming. Weeks compress to hours. 2. Unified framework. A single framework for probing traditional ML models (adversarial examples) and generative AI systems (jailbreaks), removing the need for separate libraries. 3. Llama Scout case study. We red team Meta Llama Scout and achieve an 85% attack success rate with severity up to 1.0, using zero human-developed code
Chinese Translation
人工智能系统正在进入医疗、金融和国防等关键领域,但仍然容易受到对抗攻击。虽然人工智能红队是主要的防御手段,但当前的方法迫使操作员进入手动和特定库的工作流程。操作员需要花费几周的时间来手工构建工作流程,包括组装攻击、变换和评分器。当结果不尽如人意时,工作流程必须重新构建。因此,操作员在构建工作流程上花费的时间超过了对安全和安全漏洞进行探测的时间。我们推出了一种基于开源 Dreadnode SDK 的人工智能红队代理。该代理创建了基于45种以上对抗攻击、450种以上变换和130种以上评分器的工作流程。操作员可以探测多代理系统、多语言和多模态目标,专注于要探测的内容,而不是如何实现。我们的贡献有三点:1. 主体界面。操作员通过 Dreadnode TUI(终端用户界面)用自然语言描述目标。代理处理攻击选择、变换组合、执行和报告,让操作员专注于红队工作。数周缩短为数小时。2. 统一框架。一个统一的框架用于探测传统机器学习模型(对抗样本)和生成式人工智能系统(越狱),消除了需要独立库的需求。3. Llama Scout 案例研究。我们对 Meta Llama Scout 进行了红队测试,并在使用零人类开发代码的情况下,达到了85%的攻击成功率,严重性高达1.0。
cs.AI / 47 / 2605.04036

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

OpenSeeker-v2:推进具有信息丰富性和高难度轨迹的搜索代理的极限
Du, Yuwen, Ye, Rui, Tang, Shuo, Huang, Keduan, Zhu, Xinyu, Cai, Yuzhu, Chen, Siheng
Abstract
Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In this report, we show that when fueled with informative and high-difficulty trajectories, a simple SFT approach could be surprisingly powerful for training frontier search agents. By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low-step filtering, we establish a stronger baseline. Trained on merely 10.6k data points, our OpenSeeker-v2 achieves state-of-the-art performance across 4 benchmarks (30B-sized agents with ReAct paradigm): 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively. Notably, OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT. We are excited to open-source the OpenSeeker-v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community.
Chinese Translation
深度搜索能力已成为前沿大语言模型(Large Language Model, LLM)代理不可或缺的能力,但其发展仍然受到行业巨头的主导。典型的行业流程涉及一个高度资源密集的管道,涵盖预训练、持续预训练(Continual Pre-training, CPT)、监督微调(Supervised Fine-tuning, SFT)和强化学习(Reinforcement Learning, RL)。在本报告中,我们展示了当使用信息丰富且高难度的轨迹时,简单的SFT方法对训练前沿搜索代理可能是令人惊讶的强大。通过引入三种简单的数据合成修改:扩大知识图谱规模以实现更丰富的探索、增加工具集规模以实现更广泛的功能、以及严格的低步过滤,我们建立了一个更强的基线。在仅使用10.6k数据点的基础上,我们的OpenSeeker-v2在四个基准测试中实现了最先进的性能(具有ReAct范式的30B规模代理):在BrowseComp上获得46.0%,在BrowseComp-ZH上获得58.1%,在Humanity's Last Exam上获得34.6%,在xbench上获得78.0%,超越了即使经过重度CPT+SFT+RL管道训练的Tongyi DeepResearch,后者分别达到43.4%、46.7%、32.9%和75.0%。值得注意的是,OpenSeeker-v2是首个在其模型规模和范式内由纯学术团队使用SFT开发的最先进搜索代理。我们非常高兴地开源OpenSeeker-v2模型权重,并分享我们简单而有效的发现,使前沿搜索代理的研究更易于社区访问。
计算语言学 (Computation and Language)
50
cs.CL / 1 / 2605.02915

When Should a Language Model Trust Itself? Same-Model Self-Verification as a Conditional Confidence Signal

语言模型何时应信任自己?同模型自我验证作为条件信心信号
Phalod, Aditya Ajay
Abstract
Same-model self-verification, prompting a model to audit its own predicted answer, is a plausible confidence signal for selective prediction, but its practical value remains unclear once strong likelihood-based baselines are taken seriously. We evaluate self-verification against two such baselines, LL-AVG and LL-SUM, on ARC-Challenge and TruthfulQA-MC across multiple model families, scales, and prompt variants. We measure not only correctness ranking, but also abstention quality through AURC and operating-point analyses. The results are sharply task- and model-dependent. On ARC-Challenge, self-verification substantially improves over LL-AVG for Phi-2 and the Qwen models, with the largest gains appearing in Qwen-7B. On TruthfulQA-MC, however, the signal is less reliable: smaller models can become prompt-sensitive, DeepSeek-R1-Distill-8B degrades relative to LL-AVG, and LL-SUM often remains the stronger practical baseline. We therefore do not treat self-verification as a general-purpose uncertainty estimator. In this setting, it is better understood as a conditional confidence signal whose value depends on task type, model family, prompt formulation, and, crucially, the baseline it must beat.
Chinese Translation
同模型自我验证,即提示模型审计其自身预测答案,是一种合理的选择性预测信心信号,但一旦认真考虑强的基于似然的基准,其实际价值仍然不清楚。我们针对两种此类基准LL-AVG和LL-SUM,在ARC-Challenge和TruthfulQA-MC上评估自我验证,涵盖多个模型系列、规模和提示变体。我们不仅测量正确性排名,还通过AURC和操作点分析评估弃权质量。结果明显依赖于任务和模型。在ARC-Challenge中,自我验证在Phi-2和Qwen模型上显著超越LL-AVG,尤其是在Qwen-7B模型中获得了最大的提升。然而,在TruthfulQA-MC中,该信号的可靠性较低:较小的模型可能变得对提示敏感,DeepSeek-R1-Distill-8B相对于LL-AVG的表现下降,而LL-SUM往往仍然是更强的实际基准。因此,我们不将自我验证视为一种通用的不确定性估计器。在这种情况下,它更适合被理解为一种条件信心信号,其价值依赖于任务类型、模型系列、提示形式,以及至关重要的,必须打败的基准。
cs.CL / 2 / 2605.03050

Evaluating Reasoning Models for Queries with Presuppositions

评估具前设的查询推理模型
Sathyanathan, Rose, Vasisht, Kinshuk, Pruthi, Danish
Abstract
Millions of users turn to AI models for their information needs. It is conceivable that a large number of user queries contain assumptions that may be factually inaccurate. Prior work notes that large language models (LLMs) often fail to challenge such erroneous assumptions, and can reinforce users' misinformed opinions. However, given the recent advances, especially in model's reasoning capabilities, we revisit whether large reasoning models (LRMs) can reason about the underlying assumptions and respond to user queries appropriately. We construct queries with varying degrees of presuppositions spanning health, science, and general knowledge, and use it to evaluate several widely-deployed models When compared to non-reasoning models, we find that reasoning models achieve a slightly higher accuracy (2-11%), but they still fail to challenge a large fraction (26-42%) of false presuppositions. Further, reasoning models remain susceptible to how strongly the presupposition is expressed.
Chinese Translation
数以百万计的用户依赖人工智能模型满足其信息需求。可以想象,许多用户查询中包含的假设可能在事实层面上不准确。先前的研究指出,大型语言模型(LLMs)常常未能质疑这些错误假设,反而可能强化用户的错误观点。然而,考虑到最近的进展,特别是在模型推理能力方面,我们重新审视大型推理模型(LRMs)是否能够对潜在假设进行推理,并适当地响应用户查询。我们构建了涵盖健康、科学和一般知识的各种前设程度的查询,并用其评估几种广泛应用的模型。与非推理模型相比,我们发现推理模型的准确率略高(2-11%),但它们仍然未能挑战大部分(26-42%)错误前设。此外,推理模型在多大程度上表达前设的强度方面仍然容易受到影响。
cs.CL / 3 / 2605.03052

How Language Models Process Negation

语言模型如何处理否定
Zhou, Zhejian, Zhou, Tianyi, Jia, Robin, May, Jonathan
Abstract
We study how Large Language Models (LLMs) process negation mechanistically. First, we establish that even though open-weight models often provide wrong answers to questions involving negation, they do possess internal components that process negation correctly. Their poor accuracy is due to late-layer attention behavior that promotes simple shortcuts; ablating those attention modules greatly improves accuracy on negation-related questions. Second, we uncover how models process negation. We consider two hypotheses: models could use attention heads that attend to the phrase being negated and suppress related concepts, or they could directly construct a representation of the entire negative phrase (e.g., representing "not gas" as a vector that promotes liquids and solids). We apply a range of observational and causal interpretability techniques on Mistral-7B and Llama-3.1-8B to show that models implement both mechanisms, with the "constructive" mechanism being more prominent. Combined, our work deepens the understanding of LLMs' internals, highlighting construction-dominant computations and the coexistence of competing mechanisms within LLMs.
Chinese Translation
我们研究大型语言模型(Large Language Models, LLMs)如何机械性地处理否定。首先,我们证实尽管开放权重模型在涉及否定的问题上经常提供错误答案,但它们确实拥有内部组件能够正确处理否定。这种较差的准确性源于后层注意力行为,促使模型采用简单的捷径;消去这些注意力模块显著提高了在与否定相关的问题上的准确性。其次,我们揭示了模型如何处理否定。我们考虑了两个假设:模型可能使用注意力头关注被否定的短语并抑制相关概念,或它们可能直接构建整个否定短语的表示(例如,将“not gas”表示为一个促进液体和固体的向量)。我们在 Mistral-7B 和 Llama-3.1-8B 上应用了一系列观察性和因果可解释性技术,显示模型同时实现这两种机制,其中“构建性”机制更为显著。综合来看,我们的研究加深了对LLMs内部机制的理解,突出了以构建为主的计算以及LLMs中竞争机制的共存。
cs.CL / 4 / 2605.03073

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

TTS-STT 飞轮:合成实体密集音频弥补商业与开源系统在印度语言自动语音识别中的差距
Menta, Venkata Pushpak Teja
Abstract
Niche-domain Indic ASR -- digit strings, currency amounts, addresses, brand names, English/Indic codemix -- is under-served by both open-source SOTA and commercial systems. On a synthesised entity-dense Telugu test set (held-out by synthesis system), vasista22/whisper-telugu-large-v2 (open SOTA) achieves Entity-Hit-Rate (EHR) 0.027 and Deepgram Nova-3 (commercial) 0.16. We close this gap with a self-contained TTS<->STT flywheel: an open-source Indic TTS pipeline synthesises ~22,000 entity-dense Indic-English code-mix utterances at <$50 marginal cost, and a LoRA fine-tune on top of vasista22 achieves EHR 0.473 on the held-out test (17x over open SOTA, 3x over commercial), with read-prose regression bounded to +6.6 pp WER on FLEURS-Te. Cross-language: beta-Hi 0.337 (7x vs vasista22) and beta-Ta 0.543 (22x vs vasista22, 22x vs Deepgram); on Hindi where Deepgram has substantial entity coverage, the flywheel underperforms commercial. All three beta models fall below pre-registered EHR targets (0.75 for Te, 0.65 for Hi/Ta); we report honestly. A native-human-recorded sanity check (n=20 Telugu) confirms transfer to real speech (beta-Te EHR 0.516 on native vs 0.473 on synth). An EDSA-isolation ablation (LoRA on FLEURS-Te alone) yields EHR 0.020 on the same held-out, attributing ~100% of the gain to the EDSA corpus. We additionally report a language-conditional finding: vanilla Whisper-large-v3 has Telugu-specific Script Collapse (SFR 0.46-0.71) that a per-language LoRA corrects (SFR 0.81-0.97), but the recipe is contraindicated on Hindi and Tamil where vanilla SFR >= 0.98. Code, holdouts, predictions, EDSA corpus, and entity dictionaries are released open-source.
Chinese Translation
针对利基领域印度语言自动语音识别(ASR)——数字字符串、货币金额、地址、品牌名称、英语/印度语言混合——目前的开源最先进技术(SOTA)和商业系统服务不足。在一个合成的密集实体泰卢固测试集上(未被合成系统使用),vasista22/whisper-telugu-large-v2(开源 SOTA)实现了 0.027 的实体命中率(EHR),而 Deepgram Nova-3(商业)达到了 0.16。我们通过一个自包含的 TTS<->STT 飞轮来缩小这一差距:一个开源的印度语言 TTS 流水线以不到 50 美元的边际成本合成约 22,000 个实体密集的印度英语混合语句,并对 vasista22 进行 LoRA 微调,在持出测试集上实现了 0.473 的 EHR(相比开源 SOTA 提高 17 倍,相比商业系统提高 3 倍),在 FLEURS-Te 上的朗读回归的字错误率(WER)限制在 +6.6 个百分点。跨语言:beta-Hi 为 0.337(相比 vasista22 提高 7 倍),beta-Ta 为 0.543(相比 vasista22 提高 22 倍,相比 Deepgram 提高 22 倍);在 Deepgram 拥有大量实体覆盖的印地语上,飞轮表现不及商业系统。这三种 beta 模型均未达预注册的 EHR 目标(Te: 0.75, Hi/Ta: 0.65);我们诚实报告。一个由母语者录制的合理性检查(样本量 n=20 泰卢固语)确认了对真实语音的迁移(beta-Te 在母语与合成语音上的 EHR 分别为 0.516 和 0.473)。对 EDSA 隔离的消融实验(仅对 FLEURS-Te 进行 LoRA 微调)在同一持出样本上产生了 0.020 的 EHR,将约 100% 的增益归功于 EDSA 语料库。我们还报告了一个与语言相关的发现:原始的 Whisper-large-v3 在泰卢固语中出现特定的脚本崩溃(SFR 0.46-0.71),而每种语言的 LoRA 纠正了该问题(SFR 0.81-0.97),但在原始 SFR >= 0.98 的印地语和泰米尔语上不适用。代码、持出样本、预测、EDSA 语料库和实体词典已开放源代码发布。
cs.CL / 5 / 2605.03092

Semantically Enriching Investor Micro-blogs for Opinion-Aware Emotion Analysis: A Practical Approach

语义丰富投资者微博以进行情感意识分析:一种实用方法
Negi, Gaurav, Buitelaar, Paul
Abstract
While sentiment analysis is the staple of financial NLP, capturing the nuances of 'why' behind that sentiment remains a challenge. There have been attempts to address this by analysing investor emotions alongside sentiment; however, this does not provide the additional granularity required to understand the target of the emotion/sentiment. We address this by augmenting the StockEmotions dataset with semantically structured opinion graphs, which provide granular semantic depth to the existing sentiment and emotion labels. Using a declarative LLM pipeline, we augment the StockEmotions dataset with opinion graphs for each sentence, derived from 10,000 comments collected from StockTwits. In addition, we study the effect of introducing opinion semantics on baseline classifiers using Graph Neural Networks (GNNs). Our analysis demonstrates that incorporating opinion semantics improves classification performance across different emotional spectrums
Chinese Translation
尽管情感分析是金融自然语言处理(NLP)的基础,但捕捉情感背后“为什么”的细微差别仍然是一项挑战。尽管已有尝试通过分析投资者情绪与情感结合来解决这个问题,但这并未提供理解情绪/情感目标所需的额外细节。我们通过用语义结构化的意见图增强StockEmotions数据集来解决这个问题,这为现有的情感和情绪标签提供了细致的语义深度。我们使用声明式的LLM(大语言模型)管道,为每个句子从StockTwits收集的10,000条评论中生成意见图,增强StockEmotions数据集。此外,我们研究了引入意见语义对基线分类器使用图神经网络(GNNs)的影响。我们的分析表明,纳入意见语义能改善不同情感谱的分类性能。
cs.CL / 6 / 2605.03103

MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports

MedStruct-S:用于关键发现、关键条件问答和从OCR临床报告中提取半结构化信息的基准
Li, Yingyun, Wang, Yu, Qian, Haiyang
Abstract
Semi-structured information extraction (IE) from OCR-derived clinical reports is crucial for efficiently reconstructing patients' longitudinal medical histories. In practice, this scenario commonly involves three tasks: (i) field-header (key) discovery, (ii) key-conditioned question answering (QA), and (iii) end-to-end key-value pair extraction. However, existing evaluations often under-model two factors: heterogeneous and incompletely known key representations, and OCR-induced noise. This makes it difficult to assess model robustness in real-world settings. We present MedStruct-S, a benchmark specifically designed to evaluate these tasks under unknown keys and OCR noise. MedStruct-S contains 3,582 annotated real-world clinical report pages. Using MedStruct-S, we benchmark two representative paradigms: encoder-only sequence labeling with post-processing and decoder-only structured generation, covering four encoder-only and five decoder-only models spanning 0.11B to 103B parameters. Our results show that encoder-only models achieve the best performance for non-null-value key-conditioned QA despite being substantially smaller than decoder-only models. When comparing models of similar order of magnitude, encoder-only models still perform better overall. Without controlling for model scale, fine-tuned decoder-only models deliver the strongest overall results. These findings show that the benchmark provides a reliable and practical basis for selecting and comparing models across different semi-structured IE settings.
Chinese Translation
从OCR生成的临床报告中进行半结构化信息提取对于高效重建患者的纵向医疗历史至关重要。在实际操作中,这一场景通常涉及三个任务:(i) 字段标题(关键)发现,(ii) 基于关键的问答 (QA),以及 (iii) 端到端关键值对提取。然而,目前的评估往往未充分考虑两个因素:异构且不完全已知的关键表示,以及OCR引入的噪声。这使得在实际环境中评估模型的鲁棒性变得困难。我们提出了MedStruct-S,这是一个专门设计用于在未知关键和OCR噪声下评估这些任务的基准。MedStruct-S包含3,582个注释的实际临床报告页面。利用MedStruct-S,我们基准了两种典型的范式:带有后处理的仅编码器序列标注和仅解码器的结构生成,涵盖了四种仅编码器模型和五种仅解码器模型,参数范围从0.11B到103B。我们的结果表明,尽管仅编码器模型的规模明显小于仅解码器模型,但在非空值关键条件问答任务中,仍实现了最佳性能。在比较相同数量级的模型时,仅编码器模型整体表现仍优于解码器模型。在未控制模型规模的情况下,经过微调的仅解码器模型提供了最强的整体结果。这些发现表明,该基准为在不同半结构化信息提取设置中选择和比较模型提供了可靠且实用的基础。
cs.CL / 7 / 2605.03147

Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls

有效的绩效测量:从财报电话会议中提取关键绩效指标的挑战与机遇
Aavang, Rasmus T., Tjalk-Bøggild, Rasmus, Iolov, Alexandre, Rizzi, Giovanni, Zhang, Mike, Bjerva, Johannes
Abstract
Earnings calls are a key source of financial information about public companies. However, extracting information from these calls is difficult. Unlike the templatic filings required by the U.S. Securities and Exchange Commission (SEC) to report a company's financial situation, earnings conference calls have no built-in labels, are unstructured, and feature conversational language. We explore this challenging domain by assessing the information captured by models trained on SEC filings and in-context learning methods. To establish a baseline, we first evaluate the generalization capabilities of SEC-trained models across established SEC datasets. To support our investigation, we introduce three novel benchmarks: (1) SEC Filings Benchmark (SECB), (2) Earnings Calls Benchmark (ECB), and ECB-A, a subset with 2,460 expert annotation groups to support our qualitative analysis. We find that encoder-based models struggle with the domain shift. Finally, we propose a system utilizing LLMs to perform open-ended extraction from unstructured call transcripts, verified by human evaluation (79.7% precision), providing a baseline for this valuable domain through the consistent tracking of emergent KPIs.
Chinese Translation
财报电话会议是获取上市公司财务信息的关键来源。然而,从这些会议中提取信息颇具挑战性。与美国证券交易委员会(SEC)要求的模板化文件不同,财报电话会议没有内置标签、结构松散并且使用对话式语言。我们通过评估在SEC文件上训练的模型以及上下文学习方法捕获的信息,探索这一具有挑战性的领域。为了建立基线,我们首先评估SEC训练模型在已建立SEC数据集上的泛化能力。为了支持我们的研究,我们引入了三个新基准: (1) SEC文件基准(SECB),(2) 财报电话会议基准(ECB),以及ECB-A,一个包含2,460个专家注释组的子集,以支持我们的定性分析。我们发现基于编码器的模型在领域转移中表现不佳。最后,我们提出了一种利用大型语言模型(LLMs)从非结构化电话会议记录中进行开放式提取的系统,经过人工评估验证其准确率达到79.7%,为该重要领域通过持续跟踪新兴的关键绩效指标提供了基线。
cs.CL / 8 / 2605.03196

Geometric Deviation as an Unsupervised Pre-Generation Reliability Signal: Probing LLM Representations for Answerability

几何偏差作为无监督的预生成可靠性信号:探究大语言模型表示的可回答性
Du, Yucheng
Abstract
A reliable language model should be able to signal, prior to generation, when a query falls outside its knowledge. We investigate whether representation geometry can provide such a pre-generation signal by measuring the deviation of hidden states from an answerable reference set, requiring no labeled failure data and no access to model outputs. Across three instruction-tuned models (Llama 3.1-8B, Qwen 2.5-7B, and Mistral-7B-Instruct) and three prompt forms (Math, Fact, Code), we find that geometry primarily encodes task form. Within mathematical prompts, unanswerable inputs consistently deviate from the answerable centroid, yielding strong separation (ROC-AUC 0.78-0.84). This single-pass pre-generation signal outperforms a simple refusal baseline and compares favorably to self-consistency. It also captures cases where models do not explicitly refuse. In contrast, no reliable geometric signal emerges for factual prompts, indicating that the effect is form-conditional rather than universal. Code prompts show large effect sizes with higher variance, suggesting partial generalization beyond mathematical form. A layer-wise analysis reveals that the signal arises in early layers and gradually attenuates toward the output. These results suggest that answerability-related geometry is established before the final stages of generation. Together, these findings indicate that geometric deviation can serve as a lightweight pre-generation signal that is reliable in structured domains with formal answerability constraints, with clear boundaries on where it generalizes.
Chinese Translation
一个可靠的语言模型应能够在生成之前发出信号,以指示查询是否超出其知识范围。我们研究了表示几何是否能够提供这种预生成信号,通过测量隐藏状态与可回答参考集的偏差,方法不需要标记的失败数据和模型输出的访问。在三个经过指令调优的模型(Llama 3.1-8B,Qwen 2.5-7B 和 Mistral-7B-Instruct)及三种提示形式(数学、事实、代码)中,我们发现几何主要编码任务形式。在数学提示中,无法回答的输入与可回答的中心点一致地偏离,从而实现了强分离(ROC-AUC 0.78-0.84)。这种单通道预生成信号优于简单的拒绝基线,并与自一致性相比具有良好的表现。它还捕捉到了模型未明确拒绝的情况。相比之下,对于事实提示,没有出现可靠的几何信号,这表明此效应是条件于形式而非普遍性的。代码提示显示出较大的效应规模和更高的方差,暗示部分超出了数学形式的推广。逐层分析表明,该信号在早期层中产生,并逐渐减弱至输出。这些结果表明,与可回答性相关的几何在生成的最后阶段之前就已建立。总的来看,这些发现表明几何偏差可以作为一种轻量级的预生成信号,在具有正式可回答性约束的结构化领域中是可靠的,并在其推广的边界上有明确的界限。
cs.CL / 9 / 2605.03229

Sparse Memory Finetuning as a Low-Forgetting Alternative to LoRA and Full Finetuning

稀疏记忆微调作为LoRA和完全微调的低遗忘替代方案
Gupta, Prakhar, Shah, Garv, Goyal, Satyam, Kanchi, Anirudh
Abstract
Adapting a pretrained language model to a new task often hurts the general capabilities it already had, a problem known as catastrophic forgetting. Sparse Memory Finetuning (SMF) tries to avoid this by adding key-value memory layers to the model and, on each training step, updating only the small set of memory rows that the current batch reads most heavily. We re-implement SMF on Qwen-2.5-0.5B-Instruct and compare it with LoRA and full finetuning on MedMCQA, a 4-choice medical exam task, using WikiText perplexity and TriviaQA accuracy as forgetting probes. SMF improves MedMCQA by 2.5 percentage points while keeping both forgetting probes within roughly 1 point of the base model, whereas LoRA and full finetuning achieve larger gains but with clear drift on both. We also compare two row-selection rules (KL-divergence and TF-IDF), which balance the two forgetting metrics differently.
Chinese Translation
将预训练语言模型适应于新任务往往会损害其原有的通用能力,这个问题被称为灾难性遗忘。稀疏记忆微调(Sparse Memory Finetuning, SMF)通过在模型中添加键值记忆层来避免这一点,并在每次训练步骤中,仅更新当前批次最为频繁读取的小集合记忆行。我们在Qwen-2.5-0.5B-Instruct上重新实现了SMF,并将其与LoRA和在MedMCQA(一个4选项医疗考试任务)上的完全微调进行了比较,使用WikiText困惑度和TriviaQA准确度作为遗忘探针。结果显示,SMF将MedMCQA的性能提高了2.5个百分点,同时保持两个遗忘探针与基础模型的误差约在1点内;而LoRA和完全微调虽然获得了更大的提升,但在两个指标上都表现出明显的漂移。我们还比较了两种行选择规则(KL散度和TF-IDF),这两种规则在遗忘指标的平衡上表现出不同的效果。
cs.CL / 10 / 2605.03244

S^2tory: Story Spine Distillation for Movie Script Summarization

S^2tory:基于故事主干的电影剧本摘要提取
Lu, Mingzhe, Liu, Yanbing, Wang, Qihao, Zhang, Jiarui, Wu, Jiayue, Hu, Yue, Li, Yunpeng, Xu, Yangyan
Abstract
Movie scripts pose a fundamental challenge for automatic summarization due to their non-linear, cross-cut narrative structure, which makes surface-level saliency methods ineffective at preserving core story progression. To address this, we introduce S^2tory (Story Spine Distillation), a narratology-grounded framework that leverages character development trajectories to identify plot nuclei, the essential events that drive the narrative forward, while filtering out peripheral satellite events that merely enrich atmosphere or emotion. Our Narrative Expert Agent (NEAgent) performs theory-constrained reasoning, whose distilled knowledge conditions a small model to identify plot nuclei. Another model then uses these plot nuclei to generate the summary. Experiments on the MovieSum dataset demonstrate state-of-the-art semantic fidelity at approximately 3.5x compression, and zero-shot evaluation on BookSum confirms strong out-of-domain generalization. Human evaluation further validates that narratological theory provides an indispensable foundation for modeling complex, non-linear narratives.
Chinese Translation
电影剧本由于其非线性、交叉剪辑的叙事结构,给自动摘要带来了根本性挑战,这使得表层的突出性方法在保持核心故事进展方面效果不佳。为此,我们提出了S^2tory(故事主干提炼),一个基于叙事学的框架,利用角色发展轨迹来识别情节核,即推动叙事发展的关键事件,同时过滤出仅丰富氛围或情感的外围卫星事件。我们的叙事专家代理(NEAgent)执行受理论约束的推理,其提炼的知识为小模型提供条件,以识别情节核。另一个模型则利用这些情节核生成摘要。对MovieSum数据集的实验证明在约3.5倍压缩下达到最先进的语义保真度,而对BookSum的零样本评估则证实了强大的跨领域泛化能力。人类评估进一步验证了叙事学理论为建模复杂、非线性叙事提供了不可或缺的基础。
cs.CL / 11 / 2605.03299

LLM-XTM: Enhancing Cross-Lingual Topic Models with Large Language Models

LLM-XTM:利用大语言模型增强跨语言主题模型
Xuan, Minh Chu, Nguyen, Tien-Phat, Van, Linh Ngo, Sang, Dinh Viet, Diep, Nguyen Thi Ngoc, Le, Trung
Abstract
Cross-lingual topic modeling aims to discover shared semantic structures across languages, yet existing models depend on sparse bilingual resources and often yield incoherent or weakly aligned topics. Recent LLM-based refinements improve interpretability but are costly, document-level, and prone to hallucination, with prior white-box approaches requiring inaccessible token probabilities. We propose LLM-XTM, a framework that integrates LLM-guided topic refinement with self-consistency uncertainty quantification, enabling black-box, stable, and scalable enhancement of cross-lingual topic models. Experiments on multilingual corpora show that LLM-XTM achieves superior topic coherence and alignment while reducing reliance on bilingual dictionaries and expensive LLM calls.
Chinese Translation
跨语言主题建模旨在发现不同语言间共享的语义结构,但现有模型依赖于稀疏的双语资源,通常导致主题不连贯或对齐不佳。最近基于大语言模型(LLM)的改进提高了可解释性,但代价高昂,基于文档,并且容易产生虚假信息,之前的透明白盒方法需要不可获取的令牌概率。我们提出LLM-XTM,这是一种将LLM引导的主题精炼与自我一致性不确定性量化相结合的框架,能够实现对跨语言主题模型的黑箱、稳定和可扩展的增强。在多语言语料库上的实验表明,LLM-XTM实现了更优的主题连贯性和对齐,同时减少了对双语词典和昂贵LLM调用的依赖。
cs.CL / 12 / 2605.03301

SHIELD: A Diverse Clinical Note Dataset and Distilled Small Language Models for Enterprise-Scale De-identification

SHIELD:用于企业规模去标识化的多样化临床记录数据集和精简小型语言模型
Posada, Jose D., Love, David, Datta, Somalee, Desai, Priya
Abstract
De-identification of clinical text remains essential for secondary use of electronic health records (EHRs), yet public benchmarks such as i2b2 2006/2014 are over a decade old and lack the semantic and demographic diversity of modern narratives. While Large Language Models (LLMs) achieve state-of-the-art zero-shot extraction, enterprise deployment is hindered by compute costs and governance restricting Protected Health Information (PHI) from cloud APIs. We introduce SHIELD (Synthetic Human-annotated Identifier-replaced Entries for Learning and De-identification), a diverse dataset of 1,394 notes with 10,505 gold-standard PHI spans across 9 categories, built via set-cover diversity sampling with human-in-the-loop adjudication. We evaluate four LLMs (two proprietary, two open-weight) to establish a performance ceiling, then distill these capabilities into locally deployable Small Language Models (SLMs). Distributional analysis using Frechet Text Distance and Jensen-Shannon Divergence confirms SHIELD occupies a distinct region of biomedical embedding and vocabulary space versus legacy benchmarks. Our best distilled model matches its teacher on structured PHI categories (DATE, DOCTOR, ID, PATIENT, PHONE) and achieves micro-averaged span-level precision of 0.88 and recall of 0.86 on standard workstation hardware. Cross-dataset evaluation shows diversity-trained models generalize well on universal structured PHI, while institution-specific entities remain hard to transfer, suggesting optimal deployment combines broad-coverage models with specialized models for high-volume notes. We publicly release the SHIELD dataset and the distilled DeBERTa v3 model.
Chinese Translation
临床文本的去标识化对于电子健康记录(EHR)的二次使用仍然至关重要,但诸如i2b2 2006/2014等公共基准已有十多年历史,并缺乏现代叙述的语义和人口统计多样性。尽管大型语言模型(LLMs)在零样本提取中达到了最先进的水平,但企业部署由于计算成本和治理限制使受保护健康信息(PHI)无法使用云API受限。我们介绍了SHIELD(合成人工标注的标识符替代条目用于学习和去标识化),这是一个包含1394条记录的多样化数据集,具有10505个金标准PHI跨度,涵盖9个类别,通过集合覆盖多样性抽样和人工审议构建而成。我们评估了四个LLM(两个专有模型,两个开放权重模型)以确定性能上限,然后将这些能力提炼为可本地部署的小型语言模型(SLMs)。使用Frechet文本距离和Jensen-Shannon散度的分布分析确认SHIELD占据了生物医学嵌入和词汇空间中与传统基准不同的区域。我们最好的精简模型在结构化PHI类别(日期、医生、ID、病人、电话)上与其教师模型相匹配,并在标准工作站硬件上达到了0.88的微平均跨度精准度和0.86的召回率。跨数据集评估表明,经过多样性训练的模型能够很好地泛化到通用结构化PHI,而特定于机构的实体仍然难以迁移,这表明最佳部署将广覆盖模型与用于高容量记录的专业模型相结合。我们公开发布SHIELD数据集和精简的DeBERTa v3模型。
cs.CL / 13 / 2605.03314

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

何时思考,何时发言:学习大型语言模型推理的披露策略
Wei, Jiaqi, Guo, Xuehang, Yu, Pengfei, Zhang, Xiang, Ouyang, Wanli, Sun, Siqi, Wang, Qingyun, You, Chenyu
Abstract
In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a \emph{silence tax}: additional deliberation postpones the first \emph{task-relevant} content, while naive early streaming risks premature commitments that bias subsequent generations. We introduce \textbf{\emph{Side-by-Side (SxS)}} Interleaved Reasoning, which makes \emph{disclosure timing} a controllable decision within standard autoregressive generation. SxS interleaves partial disclosures with continued private reasoning in the same context, but releases content only when it is \emph{supported} by the reasoning so far. To learn such pacing without incentivizing filler, we construct entailment-aligned interleaved trajectories by matching answer prefixes to supporting reasoning prefixes, then train with SFT to acquire the dual-action semantics and RL to recover reasoning performance under the new format. Across two Qwen3 architectures/scales (MoE \textbf{Qwen3-30B-A3B}, dense \textbf{Qwen3-4B}) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, SxS improves accuracy--\emph{content-latency} Pareto trade-offs under token-level proxies (e.g., inter-update waiting).
Chinese Translation
在单流自回归接口中,相同的标记既更新模型状态,又构成不可逆的公共承诺。这种耦合产生了 extit{沉默税}:额外的思考拖延了首个 extit{任务相关}内容的生成,而过早流式输出则存在导致后续生成偏差的风险。我们提出了 extbf{ extit{并行(Side-by-Side, SxS)}}交替推理方法,使得 extit{披露时机}成为标准自回归生成中的一个可控决策。SxS将部分披露与在相同上下文中的持续私密推理交替进行,但只在这些内容 extit{得到}现有推理支持时才会发布。为了在不激励填充填充物的情况下学习这种节奏,我们构建了符合蕴涵的交替轨迹,通过将答案前缀与支持推理前缀进行匹配来实现,然后通过监督微调(SFT)训练以获得双重动作语义,并通过强化学习(RL)在新格式下恢复推理性能。在两个Qwen3架构/规模(MoE extbf{Qwen3-30B-A3B},稠密 extbf{Qwen3-4B})以及域内(AIME25)和跨域(GPQA-Diamond)基准上,SxS提高了准确性—— extit{内容-延迟}帕累托权衡在标记级代理下(例如,更新间等待)取得了改善。
cs.CL / 14 / 2605.03387

From prompting to evidence-based translation: A RAG+prompt system for Japanese-Chinese translation and its pedagogical potential

从提示到基于证据的翻译:用于日中翻译的 RAG+Prompt 系统及其教学潜力
Gu, Wenshi
Abstract
Large language models perform well on high-resource pairs but are less reliable for Japanese-Chinese sentences containing noun-modifying clause constructions (NMCCs). This study evaluates a retrieval-augmented generation RAG+Prompt translation system that integrates linguistic analysis, embedding-based retrieval, prompt construction, and LLM generation without modifying the base model. The analysis module outputs A1 (inner vs. outer NMCC) and A2 (risk predictions: lexical choice/NMCC handling/word order/style/register); top-k = 5 similar Ja-Zh examples (L2 distance) and A1/A2 are inserted into an enhanced prompt. Using GPT-4o and a 66-sentence test set, we compare six knowledge-base sizes (0/100/200/500/1,000/2,000). Macro-averaged sentence-level BLEU (1-4-gram with brevity penalty; cased; Chinese at the character level) is the sole metric. Mean BLEU increases from 24.28 at 0 (RAG disabled) to 29.96 at 2,000 (+5.68; +23.4%). The upward trend holds across sizes, with larger knowledge bases yielding higher scores. We conclude that the RAG+Prompt translation system improves Ja-Zh translation of sentences containing NMCCs in an interpretable and auditable manner. Limitations include one base model, one metric, and reliance on published texts and commercial APIs; future work will broaden genres, language pairs, and evaluation metrics.
Chinese Translation
大型语言模型在高资源语对上表现良好,但在包含名词修饰性从句结构(NMCC)的日中句子翻译方面可靠性较低。本研究评估了一种检索增强生成(RAG+Prompt)翻译系统,该系统整合了语言分析、基于嵌入的检索、提示构建和大语言模型(LLM)生成,而无需修改基础模型。分析模块输出 A1(内NMCC与外NMCC)和 A2(风险预测:词汇选择/NMCC处理/词序/风格/语域);将 top-k = 5 个相似的日中例句(L2 距离)及 A1/A2 插入增强提示中。使用 GPT-4o 以及 66 句测试集,我们比较了六种知识库规模(0/100/200/500/1,000/2,000)。唯一的评价指标为宏观平均句级 BLEU(1-4-gram 及简洁性惩罚;保留大小写;中文字符级)。平均 BLEU 从 0(RAG 禁用)的 24.28 增加到 2,000 的 29.96(+5.68;+23.4%)。这一上升趋势在各个规模中持续,较大的知识库带来更高的得分。我们得出结论,RAG+Prompt 翻译系统以可解释和可审计的方式提高了 NMCC 句子的日中翻译效果。局限性包括仅使用一个基础模型、一个指标,及依赖于已发布文本和商业 API;未来的研究将扩大体裁、语言对和评价指标。
cs.CL / 15 / 2605.03414

Geolocating News about Extreme Climate Events: A Comparative Analysis of Off-the-Shelf Tools for Toponym Identification in German

对极端气候事件新闻的地理定位:德国地名识别现成工具的比较分析
Madureira, Brielen, de Brito, Mariana Madruga, Niekler, Andreas
Abstract
Determining the geolocation of extreme climate events and disasters in texts is a common problem in climate impact and adaptation research. Named-entity recognition (NER) tools are typically used to identify a pool of toponyms that serve as candidate event locations. In this study, we conduct a comparative analysis of three off-the-shelf NER tools, namely Flair, Spacy and Stanza. We describe and quantify differences between their outputs for German news articles and evaluate them extrinsically based on three methods to determine the country where events took place. We show how their contrasts are propagated into downstream tasks and can yield distinct decisions about a document's geographical focus, which, in turn, can impact conclusions about countries' prominence in German media.
Chinese Translation
确定文本中极端气候事件和灾害的地理位置是气候影响和适应研究中的一个普遍问题。命名实体识别(NER)工具通常用于识别作为候选事件地点的地名池。在本研究中,我们对三种现成的NER工具,即Flair、Spacy和Stanza进行比较分析。我们描述并量化它们在德语新闻文章中输出之间的差异,并基于三种方法对它们进行外部评估,以确定事件发生的国家。我们展示了这些工具之间的对比是如何影响下游任务的,并可能导致有关文档地理焦点的不同决策,这反过来又可能影响对德国媒体中各国突出性的结论。
cs.CL / 16 / 2605.03439

Benchmarking Logistic Regression, SVM, Naive Bayes, and IndoBERT Fine-Tuning for Sentiment Analysis on Indonesian Product Reviews

对印度尼西亚产品评论情感分析的逻辑回归、支持向量机、朴素贝叶斯和IndoBERT微调的基准测试
Zahra, Nabila Zakiyah, Farhanatussaidah, Salwa, Afifah, Nasywa Nur, Muthoharoh, Luluk, Satria, Ardika, Manullang, Martin C. T.
Abstract
The exponential growth of e-commerce platforms in Indonesia has generated a massive volume of user-generated product reviews. Analyzing the sentiment of these reviews is critical for measuring customer satisfaction and identifying product issues at scale. This paper benchmarks traditional Machine Learning (ML) approaches against a Transformer-based Deep Learning model for a three-class sentiment analysis task (positive, neutral, negative) on the Tokopedia Product Reviews 2025 dataset. We implemented Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction coupled with three algorithms: Logistic Regression, Linear Support Vector Machine (SVM), and Multinomial Naive Bayes as robust baselines. Subsequently, we fine-tuned the IndoBERT model (indobenchmark/indobert-base-p1) for contextual sequence classification. To computationally address the severe class imbalance inherent in e-commerce feedback, we applied balanced class weights for the baseline models and engineered a custom weighted cross-entropy loss function within the IndoBERT training loop, following the broader motivation of imbalanced-learning research. Our comprehensive evaluation using Accuracy, Macro F1-score, and Weighted F1-score revealed that the traditional Linear SVC model significantly outperformed the IndoBERT model in our experimental setup, achieving an Accuracy of 97.60% and a Macro F1-score of 0.5510, compared to IndoBERT's 88.70% and 0.5088. Detailed analysis indicates that this performance gap was primarily driven by discrepancies in the data sampling regimes, where baselines utilized the full corpus while the Transformer was constrained to a sampled subset. Finally, we demonstrate the practical viability of our pipeline by deploying the final sentiment classification model as an interactive Gradio web application.
Chinese Translation
印度尼西亚电子商务平台的指数级增长产生了大量用户生成的产品评论。分析这些评论的情感对于衡量客户满意度和识别产品问题至关重要。本文针对Tokopedia产品评论2025数据集的三类情感分析任务(积极、中立、消极),对传统机器学习(ML)方法与基于变换器的深度学习模型进行了基准测试。我们实施了词频-逆文档频率(TF-IDF)特征提取,并结合三种算法:逻辑回归、线性支持向量机(SVM)和多项式朴素贝叶斯作为稳健的基线。随后,我们对IndoBERT模型(indobenchmark/indobert-base-p1)进行了上下文序列分类的微调。为了在计算上应对电子商务反馈中固有的严重类别不平衡,我们对基线模型应用了平衡类别权重,并在IndoBERT训练循环中设计了自定义加权交叉熵损失函数,遵循不平衡学习研究的更广泛动机。通过准确率、宏观F1分数和加权F1分数的全面评估,结果表明,在我们的实验设置中,传统的线性SVC模型显著优于IndoBERT模型,准确率达到97.60%,宏观F1分数为0.5510,而IndoBERT的准确率为88.70%、宏观F1分数为0.5088。详细分析表明,这一性能差异主要源于数据抽样机制的差异,基线模型使用了完整语料库,而变换器则受限于抽样子集。最后,我们通过将最终情感分类模型部署为交互式Gradio网页应用程序,展示了我们工作流程的实用可行性。
cs.CL / 17 / 2605.03440

A Comparison of Traditional Machine Learning Algorithms and LSTM-Based Deep Learning Models for Email Sentiment Analysis

传统机器学习算法与基于长短期记忆网络(LSTM)深度学习模型在电子邮件情感分析中的比较
Saragih, Virdio Samuel, Abirawa, Baruna, Simbolon, Kartini Lovian, Muthoharoh, Luluk, Satria, Ardika, Manullang, Martin C. T.
Abstract
The rapid growth of electronic communication has necessitated more robust systems for email classification and sentiment detection. This study presents a comparative performance analysis between traditional machine learning algorithms and deep learning architectures, specifically focusing on Support Vector Machines (SVMs), Logistic Regression, Naive Bayes, and Long Short-Term Memory (LSTM). Utilizing Word2Vec embeddings for feature representation, our experimental results indicate that the SVM model with a linear kernel achieves the highest efficiency and accuracy, reaching a peak performance of 98.74%. While the LSTM model demonstrates exceptional recall capabilities in detecting spam-related sentiments, it requires significantly more computational time compared to discriminative statistical models. Detailed evaluations via confusion matrices further reveal that traditional classifiers remain highly robust for dense vector spaces. This research concludes that for email detection tasks, SVM offers the most optimal balance between predictive precision and processing speed. These findings provide critical insights for developing high-performance automated email filtering systems in professional and academic environments.
Chinese Translation
电子通信的快速增长要求更加健壮的系统来进行电子邮件分类和情感检测。本文对传统机器学习算法与深度学习架构进行了性能比较分析,具体关注支持向量机(Support Vector Machines, SVM)、逻辑回归(Logistic Regression)、朴素贝叶斯(Naive Bayes)和长短期记忆网络(Long Short-Term Memory, LSTM)。通过使用Word2Vec嵌入进行特征表示,我们的实验结果表明,具有线性核的SVM模型在效率和准确性方面表现最佳,达到了98.74%的峰值性能。虽然LSTM模型在检测与垃圾邮件相关的情感时展现出卓越的召回能力,但与判别统计模型相比,它需要显著更多的计算时间。通过混淆矩阵的详细评估进一步揭示,传统分类器在密集向量空间中依旧保持高度稳健。研究得出结论,对于电子邮件检测任务,SVM在预测精度和处理速度之间提供了最佳的平衡。这些发现为在专业和学术环境中开发高性能自动电子邮件过滤系统提供了重要的见解。
cs.CL / 18 / 2605.03443

Sentiment Analysis of Indonesian Spotify Reviews Using Machine Learning and BiLSTM

基于机器学习和双向长短时记忆网络的印尼Spotify评论情感分析
Purba, Uliano Wilyam, Parhusip, Andre Hadiman Rotua, Maulana, Sahid, Muthoharoh, Luluk, Satria, Ardika, Manullang, Martin C. T.
Abstract
This paper benchmarks classical machine learning and deep learning approaches for three-class sentiment classification of Indonesian Spotify reviews. Using 100,000 scraped reviews and 70,155 cleaned samples, the study compares Support Vector Machine, Multinomial Naive Bayes, and Decision Tree models with a two-layer BiLSTM. Both approaches use the same preprocessing pipeline, including slang normalization, stopword removal, and stemming. Decision Tree achieves the best performance among the classical models, while BiLSTM attains the highest weighted F1-score overall but fails on the minority neutral class. The paper concludes that BiLSTM is stronger for overall sentiment detection, whereas machine learning with SMOTE provides more balanced three-class performance.
Chinese Translation
本文对印尼Spotify评论的三类情感分类进行了经典机器学习和深度学习方法的基准测试。研究使用了10万条抓取的评论和70,155条清理过的样本,对支持向量机(Support Vector Machine)、多项式朴素贝叶斯(Multinomial Naive Bayes)和决策树(Decision Tree)模型与两层双向长短时记忆网络(BiLSTM)进行了比较。两种方法都使用了相同的预处理流程,包括俚语标准化、停止词删除和词干提取。结果显示,决策树在经典模型中表现最佳,而BiLSTM则在整体上获得了最高的加权F1-score,但在少数中性类上表现不佳。本文得出结论,BiLSTM在整体情感检测上表现更强,而使用SMOTE的机器学习则提供了更加平衡的三类性能。
cs.CL / 19 / 2605.03447

An ERP Study of Recursive Possessive Parsing in ASD Children and Its Cognitive Neuro Mechanisms

自闭症谱系儿童递归性拥有句解析的 ERP 研究及其认知神经机制
Chenxi, Fu, Xiaoyi, Wang, Ziman, Zhuang, Caimei, Yang
Abstract
Recursive structures are a core property of human language, yet little is known about how children with autism spectrum disorder (ASD) process complex recursion. This ERP study investigated the online processing of two-level recursive possessive structures in Mandarin-speaking children with ASD (n = 12) compared to typically developing (TD) peers (n = 12) using a sentence-picture matching paradigm. ERPs were analyzed for P200 (150-250 ms), N400 (300-500 ms), and P600 (500-1000 ms). Results showed that ASD children exhibited significantly reduced P200 amplitudes and failed to show the typical posterior grammaticality effect, indicating atypical early perceptual processing. No robust N400 violation effect was observed in either group, confirming the mismatch was not a semantic anomaly; however, ASD children showed a reversed anterior effect and an attenuated posterior effect. For the P600, ASD children had significantly reduced amplitudes, no posterior grammaticality effect, and a trend toward delayed latency, reflecting a core deficit in syntactic reanalysis. These findings demonstrate that while lexical-semantic processing is relatively preserved in ASD, the online syntactic computation required for recursion is severely impaired, supporting modular dissociation accounts of language in autism.
Chinese Translation
递归结构是人类语言的核心特性,但关于自闭症谱系障碍(ASD)儿童如何处理复杂递归的信息仍然知之甚少。本研究使用事件相关电位(ERP)技术,比较了说普通话的ASD儿童(n = 12)与典型发展儿童(TD)同伴(n = 12)在句子-图片匹配范式下对两层递归拥有结构的在线处理。针对 P200(150-250 ms)、N400(300-500 ms)和 P600(500-1000 ms)进行了 ERP 分析。结果显示,ASD儿童的 P200 振幅显著降低,并且未能显示出典型的后期语法效应,这表明早期感知处理不典型。两组均未观察到显著的 N400 违反效应,确认该不匹配并非语义异常;然而,ASD儿童表现出了相反的前期效应和减弱的后期效应。对于 P600,ASD儿童的振幅显著减少,未出现后期语法效应,并趋向延迟的潜伏期,反映了句法重分析的核心缺陷。这些发现表明,尽管ASD儿童的词汇-语义处理相对保留,但递归所需的在线句法计算严重受损,支持了关于自闭症中语言模块解离的观点。
cs.CL / 20 / 2605.03450

Retrieving Floods without Floodlights: Topic Models as Binary Classifiers for Extreme Climate Events in German News

无需灯光的洪水检索:主题模型作为极端气候事件的二元分类器在德国新闻中的应用
Madureira, Brielen, de Brito, Mariana Madruga, Niekler, Andreas
Abstract
In studies of media coverage of extreme climate events, NLP methods have become indispensable for identifying relevant texts in large news databases. Still, enough annotated data to train accurate deep learning-based classifiers from scratch is often not available. Topic Models have the advantage of being both unsupervised and interpretable, but are typically used only for exploratory analysis or data characterisation. In this study, we investigate how to employ Topic Models as binary classifiers for refining the retrieval of relevant news about seven types of extreme climate events in the German media. Our method relies on the posterior distributions estimated by Topic Models to select relevant documents, without modifying their training procedure. Using an annotated sample to guide the evaluation, we show that the probabilities assigned to keywords used to query news databases can also be informative for selecting relevant topics and improve sample precision. We compare our results to a fine-tuned text embedding classifier and an open-weight LLM, discussing observed trade-offs, e.g. the LLM's lowest precision. Moreover, we show that results are hazard-dependent, which speaks against considering climate events as a single category in NLP tasks.
Chinese Translation
在极端气候事件的媒体报道研究中,自然语言处理(NLP)方法已经成为识别大型新闻数据库中相关文本的不可或缺的工具。然而,通常没有足够的标注数据可供从零开始训练准确的基于深度学习的分类器。主题模型具有无监督和可解释的优势,但通常仅用于探索性分析或数据特征描述。本研究探讨了如何将主题模型用作二元分类器,以精炼德国媒体中与七种极端气候事件相关的新闻检索。我们的方法依赖于主题模型估计的后验分布,选择相关文档,而不修改其训练过程。通过使用带标注的样本来指导评估,我们显示用于查询新闻数据库的关键词分配的概率同样可以为选择相关主题提供信息,并提高样本的精度。我们将我们的结果与经过微调的文本嵌入分类器和开放权重的语言模型(LLM)进行比较,讨论观察到的权衡,例如,LLM 的最低精度。此外,我们还表明,结果是与危险性相关的,这反对在 NLP 任务中将气候事件视为单一类别的看法。
cs.CL / 21 / 2605.03472

Detecting Stealth Sycophancy in Mental-Health Dialogue with Dynamic Emotional Signature Graphs

利用动态情感特征图检测心理健康对话中的隐秘谄媚行为
Han, Tianze, Xu, Beining, Zhang, Hanbo, Lu, Yongming
Abstract
As conversational AI therapists are increasingly used in psychological support settings, reliable offline evaluation of therapeutic response quality remains an open problem. This paper studies multi-domain support-dialogue evaluation without relying on large language models as final judges. We use a direct LLM judge as a baseline that reads raw dialogue text and predicts whether the target response is harmful, productive, or neutral. We find that direct LLM judges and symmetric text-similarity metrics are poorly aligned with therapeutic quality because the target label depends on clinical direction: whether the response moves the user state toward regulation or reframing, leaves it broadly unchanged, or reinforces deterioration through higher risk affect or cognitive-distortion mass. To address this issue, we propose Dynamic Emotional Signature Graphs (DESG), a model-agnostic evaluator that represents dialogue windows with decoupled clinical states and scores them using asymmetric clinical geometry. We evaluate DESG on a constructed diagnostic stress-test benchmark of 3{,}000 dialogue windows from EmpatheticDialogues, ESConv, and CRADLE-Dialogue, covering peer support, counseling dialogue, and crisis-oriented interaction. On the 600-window held-out test aggregate, DESG-Ensemble achieves 0.9353 macro-F1, exceeding ConcatANN by 1.51 percentage points, BERTScore by 19.63 points, and TRACT by 33.81 points. Feature ablations, artifact controls, a 100-window blinded adjudicator audit, and qualitative disagreement cases indicate that the clinical state manifold is the main discriminative substrate, while graph-based trajectory components provide asymmetric scoring and interpretable diagnostics rather than serving as the sole source of performance.
Chinese Translation
随着对话式人工智能治疗师在心理支持领域的日益普及,如何可靠地离线评估治疗反应质量仍然是一个尚待解决的问题。本文研究了多领域支持对话的评估,而不依赖大型语言模型作为最终判断者。我们使用直接的LLM(大语言模型)评估人作为基线,该模型读取原始对话文本并预测目标反应是有害的、有效的或中性的。我们发现,直接的LLM评估人和对称文本相似度指标与治疗质量的关联性较差,因为目标标签取决于临床方向:即反应是使用户状态向调节或重构转变、保持大致不变,还是通过更高风险的情感或认知扭曲的累积来加剧恶化。为了解决这个问题,我们提出了动态情感特征图(Dynamic Emotional Signature Graphs, DESG),这是一种与模型无关的评估器,它通过解耦的临床状态表示对话窗口,并使用不对称的临床几何进行评分。我们在从EmpatheticDialogues、ESConv和CRADLE-Dialogue构建的3,000个对话窗口的诊断压力测试基准上评估DESG,涵盖了同伴支持、咨询对话和危机导向的互动。在600个窗口的保留测试集上,DESG-Ensemble达到了0.9353的宏平均F1值,超过了ConcatANN 1.51个百分点,BERTScore 19.63分,以及TRACT 33.81分。特征消融、伪影控制、100个窗口的盲审判审计和定性分歧案例表明,临床状态流形是主要的区分基础,而基于图的轨迹成分提供了不对称评分和可解释的诊断,而不是作为性能的唯一来源。
cs.CL / 22 / 2605.03476

CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

CuraView:一种基于 GraphRAG 增强知识验证的多智能体医疗幻觉检测框架
Ye, Severin, Kong, Xiao, He, Xiaopeng, Yan, Guangsu, Oh, Dongsuk
Abstract
Discharge summaries require extracting critical information from lengthy electronic health records (EHRs), a process that is labor-intensive when performed manually. Large language models (LLMs) can improve generation efficiency; however, they are prone to producing faithfulness hallucinations, statements that contradict source records, posing direct risks to patient safety. To address this, we present CuraView, a multi-agent framework for sentence-level detection and evidence-grounded explanation of faithfulness hallucinations in discharge summaries. CuraView constructs a GraphRAG-based knowledge graph from patient-level EHRs and implements a closed-loop generation-detection pipeline with sentence-level evidence retrieval and classification spanning four evidence grades from strong support to direct contradiction (E1-E4), yielding structured and interpretable evidence chains. We evaluate CuraView on a subset of 250 patients from the Discharge-Me benchmark, with 50 patients held out for testing. Our fine-tuned Qwen3-14B detection model achieves an F1 of 0.831 on the safety-critical E4 metric (90.9% recall, 76.5% precision) and an F1 of 0.823 on E3+E4, representing a 50.0% relative improvement over the base model and outperforming RAGTruth-style and QAGS-style baselines. These results demonstrate that evidence-chain-based graph retrieval verification substantially improves the factual reliability of clinical documentation, while simultaneously producing reusable annotated datasets for downstream model training and distillation.
Chinese Translation
出院摘要需要从冗长的电子健康记录(EHR)中提取关键信息,手动执行这一过程劳动强度大。大型语言模型(LLMs)可以提高生成效率;然而,它们容易产生真实性幻觉,即与源记录相矛盾的陈述,直接危及患者安全。为此,我们提出了 CuraView,一种用于出院摘要中真实性幻觉的句子级检测和基于证据的解释的多智能体框架。CuraView 从患者级 EHR 中构建基于 GraphRAG 的知识图谱,并实现了一个闭环生成-检测流程,具备句子级证据检索和分类,涵盖从强支持到直接矛盾的四个证据等级(E1-E4),生成结构化和可解释的证据链。我们在 Discharge-Me 基准中的 250 名患者子集上评估了 CuraView,其中 50 名患者用于测试。我们微调的 Qwen3-14B 检测模型在安全关键的 E4 指标上获得 0.831 的 F1 分数(90.9% 召回率,76.5% 精确率),在 E3+E4 上的 F1 分数为 0.823,相较于基线模型实现了 50.0% 的相对改善,并超越了 RAGTruth 风格和 QAGS 风格的基线。这些结果表明,基于证据链的图检索验证显著提高了临床文档的事实可靠性,同时为下游模型训练和蒸馏产生了可重复使用的标注数据集。
cs.CL / 23 / 2605.03510

Rational Communication Shapes Morphological Composition

理性沟通塑形词法组成
Yang, Fengyuan, Peng, Yongqian, Ma, Yuxi, Xu, Chenheng, Zhu, Yixin
Abstract
Human languages expand vocabularies by combining existing morphemes rather than inventing arbitrary forms. Communicative efficiency shapes lexical systems at multiple levels (Gibson et al., 2019), yet morphological composition -- combining morphemes through compounding or affixation -- has rarely been modeled as a historically situated speaker choice among competing morpheme sequences, leaving unanswered why a language settles on one morpheme combination over other plausible alternatives. We ask whether a trade-off between listener recoverability and speaker production cost can predict attested compositions over contemporaneously available alternatives. Here we show, within the Rational Speech Act (RSA) framework (Frank & Goodman, 2012; Goodman & Frank, 2016) using a time-indexed lexicon constructed from Corpus of Historical American English (COHA) and Corpus of Contemporary American English (COCA), that across 4323 naturally occurring English compounds and derivations spanning 1820--2019, attested compositions are systematically ranked above unattested alternatives generated from contemporaneously available morphemes. Models integrating semantic informativeness with production cost outperform semantic-only and cost-only baselines on Mean Reciprocal Rank (MRR) and top-k accuracy (Acc@k), with the advantage of the Pragmatic Speaker model ($S_1$) over the semantic-only baseline growing as the candidate set expands, where meaning alone leaves morphological choice underdetermined. These findings suggest that lexicalization reflects a communicative trade-off between expressiveness and efficiency, extending rational accounts of communication from utterance-level choice to the internal structure of words.
Chinese Translation
人类语言通过结合已有的语素而非发明任意形式来扩展词汇。在多个层面上,交际效率塑造了词汇系统 (Gibson et al., 2019),然而词法组合——通过复合或附加结合语素——作为历史上处于竞争中的语素序列中讲者选择的模型却鲜有研究,留下了一个未解之谜:为什么一种语言会选择一个语素组合而非其他合理的替代方案。我们探讨了听者可恢复性与讲者产生成本之间的权衡是否能预测存在的组合在同时可用的替代方案中。从设定为时间索引的词汇表出发,采用理性言语行为 (Rational Speech Act, RSA) 框架 (Frank & Goodman, 2012; Goodman & Frank, 2016),我们展示了从美国历史语料库 (Corpus of Historical American English, COHA) 和当代美国语料库 (Corpus of Contemporary American English, COCA) 中构建的词汇,在1820至2019年间,4323个自然出现的英语复合词和派生词中,已证实的组合系统性地优于从同时可用语素生成的未证实替代品。在整合语义信息量与产生成本的模型中,其在平均倒数排名 (Mean Reciprocal Rank, MRR) 和前k准确率 (Acc@k) 上优于仅基于语义或仅基于成本的基线,并且随着候选集的扩大,实用讲者模型 ($S_1$) 相较于仅基于语义的基线的优势不断增加,而仅凭意义无法决定词法选择。这些发现表明,词汇化反映了表达性与效率之间的交际权衡,将理性沟通的解释从发话层次的选择扩展到词的内部结构。
cs.CL / 24 / 2605.03514

Revisiting Graph-Tokenizing Large Language Models: A Systematic Evaluation of Graph Token Understanding

重新审视图令牌化的大型语言模型:图令牌理解的系统评估
Zhang, Zhongjian, Yu, Yue, Zhang, Mengmei, Du, Junping, Wang, Xiao, Shi, Chuan
Abstract
The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph tasks. As a widely recognized paradigm, Graph-Tokenizing LLMs (GTokenLLMs) compress complex graph data into graph tokens and treat them as prefix tokens for querying LLMs, leading many to believe that LLMs can understand graphs more effectively and efficiently. In this paper, we challenge this belief: \textit{Do GTokenLLMs fully understand graph tokens in the natural-language embedding space?} Motivated by this question, we formalize a unified framework for GTokenLLMs and propose an evaluation pipeline, \textbf{GTEval}, to assess graph-token understanding via instruction transformations at the format and content levels. We conduct extensive experiments on 6 representative GTokenLLMs with GTEval. The primary findings are as follows: (1) Existing GTokenLLMs do not fully understand graph tokens. They exhibit over-sensitivity or over-insensitivity to instruction changes, and rely heavily on text for reasoning; (2) Although graph tokens preserve task-relevant graph information and receive attention across LLM layers, their utilization varies across models and instruction variants; (3) Additional instruction tuning can improve performance on the original and seen instructions, but it does not fully address the challenge of graph-token understanding, calling for further improvement.
Chinese Translation
大型语言模型(LLMs)的显著成功激励了研究人员将其适应为各种图任务的通用预测器。作为一种广泛认可的范式,图令牌化LLMs(GTokenLLMs)将复杂的图数据压缩为图令牌,并将其作为前缀令牌用于查询LLMs,这使得许多人相信LLMs可以更有效和高效地理解图。在本文中,我们对这一信念提出质疑: extit{GTokenLLMs是否在自然语言嵌入空间中完全理解图令牌?} 针对此问题,我们形式化了GTokenLLMs的统一框架,并提出了一个评估流程 extbf{GTEval},通过格式和内容层面的指令变换来评估图令牌理解。我们在6个代表性的GTokenLLMs上进行了广泛的实验,使用了GTEval。主要发现如下:(1) 现有的GTokenLLMs并未完全理解图令牌。它们对指令变化表现出过度敏感或过度不敏感,并在推理中严重依赖文本;(2) 尽管图令牌保留了任务相关的图信息,并在LLM层之间受到关注,但它们在模型和指令变体间的利用程度差异很大;(3) 额外的指令调整可以改善对原始和已见指令的性能,但并未完全解决图令牌理解的挑战,呼吁进一步改善。
cs.CL / 25 / 2605.03534

SURE-RAG: Sufficiency and Uncertainty-Aware Evidence Verification for Selective Retrieval-Augmented Generation

SURE-RAG: 面向选择性检索增强生成的充分性与不确定性感知证据验证
Qiu, Jingxi, Han, Zeyu, Huang, Cheng
Abstract
Retrieval-augmented generation (RAG) grounds answers in retrieved passages, but retrieval is not verification: a passage can be topical and still fail to justify the answer. We frame this gap as evidence sufficiency verification for selective RAG answering: given a question, a candidate answer, and retrieved evidence, predict whether the evidence supports, refutes, or is insufficient, and abstain unless support is established. We present SURE-RAG, a transparent aggregation protocol built on the observation that evidence sufficiency is a set-level property: missing hops and unresolved conflicts cannot be detected by independent passage scoring. A shared pair-level claim-evidence verifier produces local relation distributions, which SURE-RAG aggregates into interpretable answer-level signals -- coverage, relation strength, disagreement, conflict, and retrieval uncertainty -- yielding a three-way decision and an auditable selective score. We evaluate on HotpotQA-RAG v3, a controlled multi-hop benchmark, under an artifact-aware protocol (shortcut baselines, counterfactual swaps, no-oracle checks, GPT-4o audits). Calibrated SURE-RAG reaches 0.9075 Macro-F1 (0.8951 +/- 0.0069), substantially above DeBERTa mean-pooling (0.6516) and a GPT-4o judge (0.7284), while matching a strong but opaque concat cross-encoder (0.8888 +/- 0.0109) with full auditability. Risk at 30% coverage drops from 0.2588 to 0.1642, a 37% reduction in unsafe answers. To deliberately probe the task boundary, we further contrast SURE-RAG with GPT-4o on HaluBench unsafe detection: the ranking reverses (0.3343 vs 0.7389 unsafe-F1), establishing that controlled sufficiency verification and natural hallucination detection are distinct problems.
Chinese Translation
检索增强生成(RAG)将答案基础于检索到的段落,但检索并不等同于验证:某段落可能与主题相关,但仍然无法证明答案的正确性。我们将这一缺口框架化为选择性RAG回答的证据充分性验证:给定一个问题、一个候选答案和检索到的证据,预测这些证据是支持、反驳还是不足,并在未建立支持的情况下选择不作回答。我们提出了SURE-RAG,这是一种透明的聚合协议,建立在证据充分性是一个集合级属性的观察基础上:缺失的跳跃和未解决的冲突无法通过独立的段落评分检测。一个共享的对级声明-证据验证器生成局部关系分布,而SURE-RAG将其聚合为可解释的答案级信号——覆盖率、关系强度、不一致、冲突和检索不确定性——从而产生三种决策结果和可审计的选择性评分。我们在HotpotQA-RAG v3上进行了评估,这是一个受控的多跳基准测试,采用了考虑人工制品的协议(快捷基线、反事实交换、无神谕检查、GPT-4o审计)。经过校准的SURE-RAG达到0.9075的Macro-F1(0.8951 +/- 0.0069),远高于DeBERTa平均池化(0.6516)和GPT-4o评估(0.7284),同时与强而不透明的拼接交叉编码器(0.8888 +/- 0.0109)相匹配,并具备完全的可审计性。在30%覆盖率下的风险从0.2588降至0.1642,安全答案减少了37%。为了主动探测任务边界,我们进一步对比了SURE-RAG与GPT-4o在HaluBench不安全检测上的表现:排名发生反转(0.3343与0.7389的不安全-F1),确认了受控的充分性验证与自然幻觉检测是两个不同的问题。
cs.CL / 26 / 2605.03571

PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

PatRe:一个针对专利审查的全阶段办公室行动与反驳生成基准
Wang, Qiyao, Chen, Xinyi, Chen, Longze, Wang, Hongbo, Alinejad-Rokny, Hamid, Lin, Yuan, Yang, Min
Abstract
Patent examination is a complex, multi-stage process requiring both technical expertise and legal reasoning, increasingly challenged by rising application volumes. Prior benchmarks predominantly view patent examination as discriminative classification or static extraction, failing to capture its inherently interactive and iterative nature, similar to the peer review and rebuttal process in academic publishing. In this paper, we introduce PatRe, the first benchmark that models the full patent examination lifecycle, including Office Action generation and applicant rebuttal. PatRe comprises 480 real-world cases and supports both oracle and retrieval-simulated evaluation settings. Our benchmark reframes patent examination as a dynamic, multi-turn process of justification and response. Extensive experiments across various LLMs reveal critical insights into model performance, including differences between proprietary and open-source models, as well as task asymmetries between examiner analysis and applicant-side rebuttal. These findings highlight both the potential and current limitations of LLMs in modeling complex, real-world legal reasoning and technical novelty judgment in patent examination. We release our code and dataset to facilitate future research on patent examination modeling.
Chinese Translation
专利审查是一个复杂的多阶段过程,既需要技术专长,又需法律推理,但在日益增长的申请量挑战下愈加困难。之前的基准主要将专利审查视为判别性分类或静态提取,未能捕捉其固有的互动和迭代特性,这与学术出版中的同行评审和反驳过程相似。在本文中,我们介绍了PatRe,这是第一个模拟完整专利审查生命周期的基准,包括办公室行动生成和申请人反驳。PatRe包含480个真实案例,并支持oracle和检索模拟评估设置。我们的基准将专利审查重新框定为一个动态的、多轮次的辩论与回应过程。在各种大型语言模型(LLMs)上的广泛实验揭示了模型性能的关键见解,包括专有模型与开源模型之间的差异,以及审查员分析与申请人反驳之间的任务不对称。这些发现突显了大型语言模型在模拟复杂的、现实的法律推理和技术新颖性判断方面的潜力与现有限制。我们发布了我们的代码和数据集,以促进未来对专利审查建模的研究。
cs.CL / 27 / 2605.03590

AfriVox-v2: A Domain-Verticalized Benchmark for In-the-Wild African Speech Recognition

AfriVox-v2:野外环境中非洲语音识别的领域垂直化基准
Awobade, Busayo, Ashungafac, Gabrial Zencha, Olatunji, Tobi
Abstract
Recent large language models (LLMs) show strong speech recognition and translation capabilities for high-resource languages. However, African languages remain dramatically underrepresented in benchmarks, limiting their practical use in low-resource settings. While early benchmarks tested African languages and accents, they lacked exhaustive real-world noise and granular domain evaluations. We present AfriVox-v2, a comprehensive benchmark designed to test speech models under realistic African deployment conditions. AfriVox-v2 introduces "in the wild" unscripted audio for all supported languages. We also introduce strict domain verticalization, evaluating model accuracy across ten sectors including government, finance, health, and agriculture and conducting targeted tests on numbers and named entities. Finally, we benchmark a new generation of speech models, including Sahara-v2, Gemini 3 Flash, and the Omnilingual CTC models. Our results expose the true generalization gap of modern speech models in specialized, noisy African contexts and provide a reliable blueprint for developers building localized voice AI.
Chinese Translation
近期的大型语言模型(LLMs)在高资源语言的语音识别和翻译能力上表现出色。然而,非洲语言在基准测试中依然严重不足,这限制了它们在低资源环境中的实际应用。虽然早期的基准测试涵盖了非洲语言和口音,但它们缺乏全面的真实环境噪声和细粒度的领域评估。我们提出了AfriVox-v2,一个旨在在现实的非洲部署条件下测试语音模型的综合基准。AfriVox-v2为所有支持的语言引入了“野外”非脚本化音频。此外,我们还引入了严格的领域垂直化,评估模型在政府、金融、健康和农业等十个行业的准确性,并对数字和专有名词进行有针对性的测试。最后,我们对新一代语音模型进行了基准测试,包括Sahara-v2、Gemini 3 Flash和Omnilingual CTC模型。我们的结果揭示了现代语音模型在专业和嘈杂的非洲环境中的真正泛化差距,并为开发本地化语音人工智能的开发者提供了可靠的蓝图。
cs.CL / 28 / 2605.03618

BIT.UA-AAUBS at ArchEHR-QA 2026: Evaluating Open-Source and Proprietary LLMs via Prompting in Low-Resource QA

BIT.UA-AAUBS 在 ArchEHR-QA 2026 的表现:通过提示评估低资源问答中的开源和专有大型语言模型
Jonker, Richard A. A., Christiansen, Alexander, Maniatis, Alexandros, Garrido, Rúben, Lima, Rogério Braunschweiger de Freitas, Jurowetzki, Roman, Matos, Sérgio
Abstract
This paper presents the joint participation of the BIT.UA and AAUBS groups in the ArchEHR-QA 2026 shared task, which focuses on clinical question answering and evidence grounding in a low-resource setting. Due to the absence of training data and the strict data privacy constraints inherent to the healthcare domain (e.g. GDPR), we investigate the capabilities of Large Language Models (LLMs) without weight updates. We evaluate several state-of-the-art proprietary models and locally deployable open-source alternatives using various prompt engineering strategies, including task decomposition, Chain-of-Thought, and in-context learning. Furthermore, we explore majority voting and LLM-as-a-judge ensembling techniques to maximize predictive robustness. Our results demonstrate that while proprietary models exhibit strong resilience to prompt variations, domain-adapted open-source models (such as MedGemma 3 27B) achieve highly competitive performance when paired with the right prompt. Overall, our prompt-based approach proved highly effective, securing 1st place in Subtask 4 (evidence citation alignment) and 3rd place in Subtask 3 (patient-friendly answer generation). All code, results, and prompts are available on our GitHub repository: https://github.com/bioinformatics-ua/ArchEHR-QA-2026.
Chinese Translation
本文介绍了 BIT.UA 和 AAUBS 团队共同参与 ArchEHR-QA 2026 共享任务的情况,该任务重点关注低资源环境中的临床问题回答和证据基础。由于缺乏训练数据以及医护领域固有的数据隐私约束(例如,GDPR),我们研究了不进行权重更新的大型语言模型(LLMs)的能力。我们评估了几种最新的专有模型和可以本地部署的开源替代品,采用了多种提示工程策略,包括任务分解、思维链(Chain-of-Thought)和上下文学习。此外,我们探索了多数投票和 LLM 作为裁判的集成技术,以最大化预测的鲁棒性。我们的结果表明,尽管专有模型对提示变化表现出强大的适应性,但经过领域调整的开源模型(如 MedGemma 3 27B)在配合正确提示时也能获得非常有竞争力的表现。总体而言,我们的基于提示的方法效果显著,在子任务 4(证据引用对齐)中获得了第一名,在子任务 3(适合患者的答案生成)中获得了第三名。所有代码、结果和提示均可在我们的 GitHub 代码库中找到:https://github.com/bioinformatics-ua/ArchEHR-QA-2026。
cs.CL / 29 / 2605.03624

Annotation Quality in Aspect-Based Sentiment Analysis: A Case Study Comparing Experts, Students, Crowdworkers, and Large Language Model

基于方面的情感分析中的注释质量:比较专家、学生、众包工人和大型语言模型的案例研究
Donhauser, Niklas, Fehle, Jakob, Hellwig, Nils Constantin, Weinberger, Markus, Kruschwitz, Udo, Wolff, Christian
Abstract
Aspect-Based Sentiment Analysis (ABSA) enables fine-grained opinion analysis by identifying sentiments toward specific aspects or targets within a text. While ABSA has been widely studied for English, research on other languages such as German remains limited, largely due to the lack of high-quality annotated datasets. This paper examines how different annotation sources influence the development of German ABSA. To this end, an existing dataset is re-annotated by experts to establish a ground truth, which serves as a reference for evaluating annotations produced by students, crowdworkers, Large Language Models (LLMs), and experts. Annotation quality is compared using Inter-Annotator Agreement (IAA) and its impact on downstream model performance for different ABSA subtasks. The evaluation focuses on Aspect Category Sentiment Analysis (ACSA) and Target Aspect Sentiment Detection (TASD). We apply State-of-the-Art (SOTA) methods for ABSA, including BERT-, T5-, and LLaMA-based approaches to assess performance differences, spanning fine-tuning and in-context learning with instruction prompts. The findings provide practical insights into trade-offs between annotation reliability and efficiency, offering guidance for dataset construction in under-resourced Natural Language Processing (NLP) scenarios.
Chinese Translation
基于方面的情感分析(Aspect-Based Sentiment Analysis, ABSA)通过识别文本中对特定方面或目标的情感,实现细粒度的意见分析。虽然ABSA在英语上的研究已经相对广泛,但对德语等其他语言的研究仍然有限,这主要是由于缺乏高质量的注释数据集。本文探讨了不同注释来源对德语ABSA发展的影响。为此,专家对现有数据集进行了重新注释,以建立基准真相,为评估学生、众包工人、大型语言模型(Large Language Models, LLMs)和专家所产生的注释提供参考。我们使用注释者间一致性(Inter-Annotator Agreement, IAA)比较注释质量,并考察其对不同ABSA子任务下游模型性能的影响。评估重点关注方面类别情感分析(Aspect Category Sentiment Analysis, ACSA)和目标方面情感检测(Target Aspect Sentiment Detection, TASD)。我们应用最先进的方法(State-of-the-Art, SOTA)进行ABSA,包括基于BERT、T5和LLaMA的方法,评估细调(fine-tuning)和上下文学习(in-context learning)中的表现差异,并结合指令提示。研究结果为注释可靠性与效率之间的权衡提供了实用见解,为资源不足的自然语言处理(Natural Language Processing, NLP)场景中的数据集构建提供了指导。
cs.CL / 30 / 2605.03671

A Paradigm for Interpreting Metrics and Identifying Critical Errors in Automatic Speech Recognition

一种用于解释指标和识别自动语音识别中关键错误的范式
Bañeras-Roux, Thibault, Rouvier, Mickael, Wottawa, Jane, Dufour, Richard
Abstract
The most commonly used metrics for evaluating automatic speech transcriptions, namely Word Error Rate (WER) and Character Error Rate (CER), have been heavily criticized for their poor correlation to human perception and their inability to take into account linguistic and semantic information. While metric-based embeddings, seeking to approximate human perception, have been proposed, their scores remain difficult to interpret, unlike WER and CER. In this article, we overcome this problem by proposing a paradigm that consists in incorporating a chosen metric into it in order to obtain an equivalent of the error rate: a Minimum Edit Distance (minED). This approach parallels transcription errors with their human perception, also allowing an original study of the severity of these errors from a human perspective.
Chinese Translation
用于评估自动语音转录的最常用指标,即字错误率(Word Error Rate,WER)和字符错误率(Character Error Rate,CER),因其与人类感知的相关性较差以及无法考虑语言和语义信息而受到广泛批评。虽然有提出基于指标的嵌入方法,旨在近似人类感知,但其得分仍然难以解释,与WER和CER相比。这篇文章通过提出一个范式来克服这一问题,该范式将所选指标纳入其中,以获得错误率的等效值:最小编辑距离(Minimum Edit Distance,minED)。这种方法将转录错误与人类感知相对应,同时也允许从人类的角度对这些错误的严重性进行原创研究。
cs.CL / 31 / 2605.03696

A Comprehensive Analysis of Tokenization and Self-Supervised Learning in End-to-End Automatic Speech Recognition applied on French Language

法语自动语音识别中的分词与自监督学习的综合分析
Bañeras-Roux, Thibault, Rouvier, Mickael, Wottawa, Jane, Dufour, Richard
Abstract
The performance of end-to-end automatic speech recognition (ASR) systems enables their increasing integration into numerous applications. While there are various benefits to such speech-to-text systems, the choice of hyperparameters and models plays a crucial role in their performance. Typically, these choices are determined by considering only the character (CER) and/or word error rate (WER) metrics. However, it has been shown in several studies that these metrics are largely incomplete and fail to adequately describe the downstream application of automatic transcripts. In this paper, we conduct a qualitative study on the French language that investigates the impact of subword tokenization algorithms and self-supervised learning models from different linguistic and acoustic perspectives, using a comprehensive set of evaluation metrics.
Chinese Translation
端到端自动语音识别(ASR)系统的性能使其在众多应用中的集成不断增加。尽管这类语音转文本系统有多种好处,但超参数和模型的选择在其性能中起着至关重要的作用。通常,这些选择是仅根据字符错误率(CER)和/或单词错误率(WER)指标来决定的。然而,多个研究表明,这些指标在很大程度上是不完整的,无法充分描述自动 transcripts 的下游应用。本文对法语进行了定性研究,探讨了子词分词算法和自监督学习模型在不同语言学和声学视角下的影响,使用了一套全面的评估指标。
cs.CL / 32 / 2605.03701

SERE: Structural Example Retrieval for Enhancing LLMs in Event Causality Identification

SERE:用于增强大型语言模型在事件因果识别中的结构示例检索
Hao, Zhifeng, Chen, Zhongjie, Lu, Junhao, Yu, Shengyin, Hu, Guimin, Zhang, Keli, Cai, Ruichu, Xu, Boyan
Abstract
Event Causality Identification (ECI) requires models to determine whether a given pair of events in a context exhibits a causal relationship. While Large Language Models (LLMs) have demonstrated strong performance across various NLP tasks, their effectiveness in ECI remains limited due to biases in causal reasoning, often leading to overprediction of causal relationships (causal hallucination). To mitigate these issues and enhance LLM performance in ECI, we propose SERE, a structural example retrieval framework that leverages LLMs' few-shot learning capabilities. SERE introduces an innovative retrieval mechanism based on three structural concepts: (i) Conceptual Path Metric, which measures the conceptual relationship between events using edit distance in ConceptNet; (ii) Syntactic Metric, which quantifies structural similarity through tree edit distance on syntactic trees; and (iii) Causal Pattern Filtering, which filters examples based on predefined causal structures using LLMs. By integrating these structural retrieval strategies, SERE selects more relevant examples to guide LLMs in causal reasoning, mitigating bias and improving accuracy in ECI tasks. Extensive experiments on multiple ECI datasets validate the effectiveness of SERE. The source code is publicly available at https://github.com/DMIRLAB-Group/SERE.
Chinese Translation
事件因果识别(ECI)需要模型判断上下文中给定的一对事件是否存在因果关系。尽管大型语言模型(LLMs)在各种自然语言处理任务中表现出色,但由于在因果推理中的偏差,导致其在ECI方面的有效性仍然有限,常常导致因果关系的过度预测(因果幻觉)。为了解决这些问题并提高LLM在ECI中的性能,我们提出了SERE,一个利用LLM的少量学习能力的结构化示例检索框架。SERE引入了一种基于三种结构概念的创新检索机制:(i)概念路径度量(Conceptual Path Metric),通过在ConceptNet中的编辑距离测量事件之间的概念关系;(ii)句法度量(Syntactic Metric),通过在句法树上使用树编辑距离量化结构相似性;(iii)因果模式过滤(Causal Pattern Filtering),基于预定义因果结构使用LLMs过滤示例。通过整合这些结构化检索策略,SERE选择更相关的示例以指导LLMs进行因果推理,减少偏差并提高ECI任务的准确性。在多个ECI数据集上的大量实验验证了SERE的有效性。源代码可以在https://github.com/DMIRLAB-Group/SERE获取。
cs.CL / 33 / 2605.03706

SAM-NER: Semantic Archetype Mediation for Zero-Shot Named Entity Recognition

SAM-NER:用于零样本命名实体识别的语义原型调解
Cai, Ruichu, Gan, Juntao, Mai, Miao, Hao, Zhifeng, Xu, Boyan
Abstract
Zero-shot Named Entity Recognition (ZS-NER) remains brittle under domain and schema shifts, where unseen label definitions often misalign with a large language model's (LLM's) intrinsic semantic organization. As a result, directly mapping entity mentions to fine-grained target labels can induce systematic semantic drift, especially when target schemas are novel or semantically overlapping. We propose \textbf{SAM-NER}, a three-stage framework based on \emph{Semantic Archetype Mediation} that stabilizes cross-domain transfer through an intermediate, domain-invariant archetype space. SAM-NER: (i) performs \emph{Entity Discovery} via cooperative extraction and consensus-based denoising to obtain high-coverage, high-fidelity entity spans; (ii) conducts \emph{Abstract Mediation} by projecting entities into a compact set of universal semantic archetypes distilled from high-level ontological abstractions; and (iii) applies \emph{Semantic Calibration} to resolve archetype-level predictions into target-domain types through constrained, definition-aligned inference with a frozen LLM. Experiments on the CrossNER benchmark show that SAM-NER consistently outperforms strong prior ZS-NER baselines in cross-domain settings. Our implementation will be open-sourced at https://github.com/DMIRLAB-Group/SAM-NER.
Chinese Translation
零样本命名实体识别(ZS-NER)在领域和模式转移下依然脆弱,其中未见标签定义往往与大型语言模型(LLM)的内在语义组织不一致。因此,直接将实体提及映射到细粒度的目标标签可能导致系统性的语义漂移,尤其是在目标模式是新颖或语义重叠的情况下。我们提出了 extbf{SAM-NER},一个基于 extit{语义原型调解}的三阶段框架,通过中间的领域不变原型空间来稳定跨领域转移。SAM-NER: (i) 通过合作提取和基于共识的去噪来执行 extit{实体发现},以获得高覆盖率、高保真度的实体跨度; (ii) 通过将实体映射到从高级本体抽象中提炼的紧凑的通用语义原型集来进行 extit{抽象调解}; (iii) 通过与冻结的LLM进行约束的、与定义对齐的推断,应用 extit{语义校准}将原型级别的预测解析为目标领域类型。在CrossNER基准上的实验表明,SAM-NER在跨领域设置中始终优于强大的以往ZS-NER基线。我们的实现将开源于 https://github.com/DMIRLAB-Group/SAM-NER 。
cs.CL / 34 / 2605.03720

Rose-SQL: Role-State Evolution Guided Structured Reasoning for Multi-Turn Text-to-SQL

Rose-SQL:基于角色-状态演变指导的多轮文本到SQL结构化推理
Zhou, Le, Yao, Feng, Qiao, Fengcai, Xu, Bo, Wang, Fangyuan, Xu, Boyan
Abstract
Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought have demonstrated remarkable capabilities in code generation and mathematical reasoning. However, their potential in multi-turn Text-to-SQL tasks remains largely underexplored. Existing approaches typically rely on unstable API-based inference or require expensive fine-tuning on small-scale models. In this work, we present Rose-SQL, a training-free framework that leverages small-scale LRMs through in-context learning to enable accurate context-dependent parsing. We introduce the Role-State, a fine-grained representation that bridges the structural gap between schema linking and SQL generation by serving as a structural blueprint. To handle conversational dependencies, Rose-SQL traces the evolution of Role-State through historical context via structural isomorphism checks, guiding the model to infer the possible SQL composition for the current question through verified interaction trajectories. Experiments on the SParC and CoSQL benchmarks show that, within the Qwen3 series, Rose-SQL outperforms in-context learning baselines at the 4B scale and substantially surpasses state-of-the-art fine-tuned models at the 8B and 14B scales, while showing consistent gains on additional reasoning backbones.
Chinese Translation
最近,通过长链思维训练的大型推理模型(LRMs)的进展在代码生成和数学推理方面展现了显著能力。然而,它们在多轮文本到SQL(Text-to-SQL)任务中的潜力仍然在很大程度上未被探索。现有的方法通常依赖于不稳定的API推理或需要在小规模模型上进行昂贵的微调。在本研究中,我们提出了Rose-SQL,一个无训练的框架,通过上下文学习利用小规模LRMs,使得准确的上下文依赖解析成为可能。我们引入了角色-状态(Role-State),这一细粒度表示通过作为结构蓝图,弥合了模式链接和SQL生成之间的结构差距。为了处理对话依赖关系,Rose-SQL通过结构同构检查追踪角色-状态的演变,从历史上下文中指导模型推断当前问题可能的SQL组合,基于经过验证的交互轨迹。对SParC和CoSQL基准的实验表明,在Qwen3系列中,Rose-SQL在4B规模下优于基于上下文学习的基线,并在8B和14B规模上显著超过最新的微调模型,同时在额外推理骨干上展示了一致的提升。
cs.CL / 35 / 2605.03723

Segmenting Human-LLM Co-authored Text via Change Point Detection

通过变化点检测进行人类与大型语言模型共同创作文本的分割
Li, Mengchu, Zhu, Jin, Li, Jinglai, Shi, Chengchun
Abstract
The rise of large language models (LLMs) has created an urgent need to distinguish between human-written and LLM-generated text to ensure authenticity and societal trust. Existing detectors typically provide a binary classification for an entire passage; however, this is insufficient for human--LLM co-authored text, where the objective is to localize specific segments authored by humans or LLMs. To bridge this gap, we propose algorithms to segment text into human- and LLM-authored pieces. Our key observation is that such a segmentation task is conceptually similar to classical change point detection in time-series analysis. Leveraging this analogy, we adapt change point detection to LLM-generated text detection, develop a weighted algorithm and a generalized algorithm to accommodate heterogeneous detection score variability, and establish the minimax optimality of our procedure. Empirically, we demonstrate the strong performance of our approach against a wide range of existing baselines.
Chinese Translation
大型语言模型(LLMs)的兴起使得区分人类撰写文本与LLM生成文本的需求变得迫切,以确保文本的真实性和社会信任。现有的检测器通常对整段文本提供二分类结果;然而,这对于人类与LLM共同创作的文本来说是不够的,因为其目标是定位特定由人类或LLM创作的片段。为了解决这一问题,我们提出了将文本分割为人类和LLM创作的部分的算法。我们发现这样的一项分割任务在概念上与时间序列分析中的经典变化点检测类似。借助这一类比,我们将变化点检测方法应用于LLM生成文本的检测,开发了加权算法和广义算法,以适应异质检测得分的变异性,并确立了我们方法的极小极大最优性。在实证方面,我们展示了我们的方法在多种现有基准测试中的优异表现。
cs.CL / 36 / 2605.03742

Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus

大语言模型在低资源塔吉克文本生成中的参数高效微调基准:以塔吉克网络语料库为例
Arabov, Mullosharaf K.
Abstract
This paper is devoted to the adaptation of generative large language models for the Tajik language, a low-resource language with Cyrillic script. To overcome the shortage of digital text resources, the author created and publicly released the Tajik Web Corpus, the largest open-access corpus of Tajik, comprising 319,298 documents (~1.11 billion characters). On a subsample of 10,000 documents, 17 configurations were benchmarked, covering autoregressive, encoder-decoder, and encoder-only models with three fine-tuning strategies: full fine-tuning, LoRA, and QLoRA (ranks 8 and 16). Quality was assessed via perplexity and cross-entropy loss; peak GPU memory and training time were also recorded. Best results were achieved by Mistral 7B with QLoRA (r=16): mean perplexity 5.03, standard deviation 0.03. Increasing rank from 8 to 16 gave statistically insignificant improvement while raising memory consumption. For small GPT-2 family models, full fine-tuning yielded lower perplexity (3.48 for GPT-2 Medium) than LoRA (7.60-8.42), but induced catastrophic forgetting. The encoder-only XLM-RoBERTa showed the worst results (perplexity 59.3). The novelty lies in creating the largest verified Tajik corpus and the first systematic analysis of PEFT effectiveness for Tajik text generation. Practical value lies in recommendations for architecture and fine-tuning strategy selection, optimizing computational costs without substantial quality loss.
Chinese Translation
本文致力于为塔吉克语(一种使用西里尔字母的低资源语言)适应生成性大语言模型。为克服数字文本资源短缺,作者创建并公开发布了塔吉克网络语料库,这是最大的塔吉克开放访问语料库,包含319,298份文档(约11.1亿个字符)。在10,000份文档的子样本上,对17种配置进行了基准测试,覆盖自回归、编码-解码(encoder-decoder)和编码(encoder-only)模型,并应用三种微调策略:完全微调(full fine-tuning)、LoRA和QLoRA(秩8和16)。通过困惑度(perplexity)和交叉熵损失(cross-entropy loss)评估质量;同时记录了峰值GPU内存和训练时间。最佳结果由Mistral 7B与QLoRA(r=16)实现:平均困惑度为5.03,标准差为0.03。从8提高到16的秩带来了统计上不显著的改善,同时增加了内存消耗。对于小型GPT-2系列模型,完全微调的困惑度(GPT-2 Medium为3.48)低于LoRA(7.60-8.42),但产生了灾难性遗忘。仅编码的XLM-RoBERTa表现最差(困惑度为59.3)。本研究的创新之处在于创建了最大规模的已验证塔吉克语语料库,并首次系统分析了PEFT在塔吉克文本生成中的有效性。实际价值在于为架构和微调策略选择提供建议,从而在不显著损失质量的前提下优化计算成本。
cs.CL / 37 / 2605.03792

TriBench-Ko: Evaluating LLM Risks in Judicial Workflows

TriBench-Ko:评估法律工作流程中大型语言模型的风险
Lee, Haesung, Choi, Gyubin, Lee, Eun-Ju, Lee, So-Min, Ko, Youkang, Lim, Dogyoon, Jang, Sung-Kyoung, Jo, Yohan
Abstract
Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial processes. To address this, we publicly release TriBench-Ko, a Korean benchmark designed to evaluate potential deployment risks of LLMs within the context of verified judicial task requirements. It covers four core tasks: jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis. It jointly assesses model behavior across multiple deployment risk categories, including inaccuracy (hallucination, omission, statutory misapplication), biases (demographic, overcompliance), inconsistencies (prompt sensitivity, non-determinism), and adjudicative overreach. Each item is structured to systematically assess both task performance and a specific risk type based on real judicial decisions. Our evaluation of a range of contemporary LLMs reveals that many models frequently manifest significant risks, most notably struggling with precedent retrieval and failing to capture critical legal information. We provide a comprehensive diagnosis of these LLMs and pinpoint critical areas where LLM-generated outputs in judicial contexts necessitate rigorous inspection and caution. Our dataset and code are available at https://github.com/holi-lab/TriBench-Ko
Chinese Translation
大型语言模型(LLMs)正日益融入法律工作流程。然而,现有的基准主要关注代理任务,如律师考试表现或分类,这些并未能捕捉到日常司法流程中固有的性能和风险。为此,我们公开发布了TriBench-Ko,这是一个针对韩国的基准,旨在评估在经过验证的司法任务要求背景下,LLMs潜在部署风险。该基准涵盖四项核心任务:法学总结、判例检索、法律问题提取和证据分析。它在多个部署风险类别中联合评估模型行为,包括不准确性(幻觉、遗漏、法律误适用)、偏见(人口统计偏见、过度合规)、不一致性(提示敏感性、非确定性)和裁判扩大。每个项目结构化设计用于系统评估根据实际司法决定的任务表现和特定风险类型。我们对多种现代LLMs的评估显示,许多模型经常表现出显著风险,特别是在判例检索和捕捉关键信息方面存在困难。我们提供了对这些LLMs的全面诊断,并指出LLM在司法背景下生成的输出在多个关键领域需要进行严格审查和谨慎对待。我们的数据集和代码可在 https://github.com/holi-lab/TriBench-Ko 获取。
cs.CL / 38 / 2605.03799

Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF

自然语言处理:从分词到基于人类反馈的强化学习的综合实用指南
Arabov, Mullosharaf K.
Abstract
This preprint presents a systematic, research-oriented practicum that guides the reader through the entire modern NLP pipeline: from tokenisation and vectorisation to fine-tuning of large language models, retrieval-augmented generation, and reinforcement learning from human feedback. Twelve hands-on sessions combine concise theory with detailed implementation plans, formalised evaluation metrics, and transparent assessment criteria. The work is not a conventional textbook: it is designed as a reproducible research artefact where every session requires publishing code, models, and reports in public repositories. All experiments are conducted on a single evolving corpus, and the work advocates open-weight models over commercial APIs, with special attention to the Hugging Face ecosystem. The material is enriched by original research on low-resource languages, incorporating linguistic resources for Tajik and Tatar (subword tokenisers, embeddings, lexical databases, and transliteration benchmarks), demonstrating how modern NLP can be adapted to data-scarce environments. Designed for senior undergraduates, graduate students, and practising developers seeking to implement, compare, and deploy methods from classical ML to state-of-the-art LLM-based systems.
Chinese Translation
本预印本呈现了一种系统的、以研究为导向的实践教程,引导读者穿越整个现代自然语言处理(NLP)流程:从分词和向量化到大型语言模型的微调、检索增强生成以及基于人类反馈的强化学习。十二个动手实践环节将简明的理论与详细的实施计划、规范的评估指标和透明的评估标准相结合。该作品并不是一本传统的教科书:它被设计为一个可重复的研究成果,每个环节都要求在公共仓库中发布代码、模型和报告。所有实验均在一个不断演变的语料库上进行,作品提倡开放权重模型而非商业API,特别关注Hugging Face生态系统。材料还融入了对低资源语言的原创研究,包含塔吉克语和鞑靼语的语言资源(子词分词器、嵌入、词汇数据库和转写基准),展示了现代NLP如何适应数据稀缺的环境。该教程旨在为高年级本科生、研究生及希望实施、比较和部署从经典机器学习到最先进的基于LLM系统的方法的开发者提供指导。
cs.CL / 39 / 2605.03824

Reproducing Complex Set-Compositional Information Retrieval

再现复杂集合组合信息检索
Degenhart, Vincent, Timman, Dewi, de Vries, Arjen P., Hasibi, Faegheh, Hoveyda, Mohanna
Abstract
Complex information needs may involve set-compositional queries using conjunction, disjunction, and exclusion, yet it remains unclear whether current retrieval paradigms genuinely satisfy such constraints or exploit `semantic shortcuts'. We conduct a reproducibility study to benchmark major retrieval families and reasoning-targeted methods on QUEST and QUEST+Variants, and introduce LIMIT+, a controlled benchmark where relevance depends on arbitrary attribute predicates and constraint satisfaction, and less on pretrained knowledge. Our findings show that (i) on QUEST, the best neural retrievers achieve an effectiveness that is more than double what can be achieved with BM25 (Recall@100 ${>}$0.41 vs.\ 0.20), but reasoning-targeted methods like ReasonIR and Search-R1 do not outperform general-purpose retrievers uniformly; (ii) on LIMIT+, gains fail to transfer, where the strongest QUEST method collapses from Recall@100${\approx}$0.42 to below 0.02, while classic lexical retrieval gains to ${\sim}$0.96. Lastly, (iii) stratifying by compositional depth reveals a consistent degradation across all methods, where algebraic sparse and lexical methods show more stable performance while dense approaches collapse. We release code and LIMIT+ data generation scripts to support future reproducibility and controlled evaluation.
Chinese Translation
复杂信息需求可能涉及使用 conjunction(结合)、disjunction(析取)和 exclusion(排除)的集合组合查询,但目前尚不清楚现有的检索范式是否真正满足这些约束或是利用了“语义捷径”。我们进行了一项可重复性研究,以基准测试主要的检索家族和针对推理的方法,基于 QUEST 和 QUEST+Variants,并引入 LIMIT+,这是一个受控基准,其中相关性依赖于任意属性谓词和约束满足,而不太依赖于预训练知识。我们的研究结果表明:(i) 在 QUEST 上,最佳的神经检索器的效果超过了用 BM25 能达到的两倍以上(Recall@100 > 0.41 对比 0.20),但像 ReasonIR 和 Search-R1 这样的针对推理的方法并没有均匀超越通用检索器;(ii) 在 LIMIT+ 上,性能提升未能转移,最强的 QUEST 方法的 Recall@100 从大约 0.42 降至低于 0.02,而经典的词法检索提升至约 0.96。最后,(iii) 按组合深度分层显示所有方法都存在一致的性能下降,其中代数稀疏和词法方法表现更为稳定,而密集方法则崩溃。我们发布了代码和 LIMIT+ 数据生成脚本,以支持未来的可重复性和受控评估。
cs.CL / 40 / 2605.03838

TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains

TRACE:一个以计量学为基础的可信代理人工智能系统的工程框架,适用于运行关键领域
Zabolotnii, Serhii
Abstract
We introduce TRACE, a cross-domain engineering framework for trustworthy agentic AI in operationally critical domains. TRACE combines a four-layer reference architecture with an explicit classical-ML vs. LLM-validator split (L2a/L2b), a stateful orchestration-and-escalation policy (L3), and bounded human supervision (L4); a metrologically grounded trust-metric suite mapped to GUM/VIM/ISO 17025; and a Model-Parsimony principle quantified by the Computational Parsimony Ratio (CPR). Three instantiations--clinical decision support, industrial multi-domain operations, and a judicial AI assistant--transfer the samearchitecture and metrics across principally different governance contexts. The L2a/L2b separation makes the use of large language models a deliberate design decision rather than an architectural default, with parsimony quantified through CPR. TRACE introduces CPR as a first-class design principle in trustworthy-AI engineering.
Chinese Translation
我们介绍了TRACE,一个适用于运行关键领域的可信代理人工智能的跨域工程框架。TRACE结合了四层参考架构与显式的经典机器学习(classical-ML)与大型语言模型(LLM)验证者的划分(L2a/L2b)、一个状态感知的协调和升级策略(L3)以及限制性的人类监督(L4);一个基于计量学的可信度指标套件,映射至GUM/VIM/ISO 17025;以及通过计算简约比(Computational Parsimony Ratio, CPR)量化的模型简约性原则。这三种实例——临床决策支持、工业多领域操作和司法人工智能助手——在根本不同的治理背景下应用相同的架构和指标。L2a/L2b的分离使得大型语言模型的使用成为一个刻意的设计决策,而非架构的默认选择,简约性通过CPR进行量化。TRACE将CPR引入为可信人工智能工程中的一项首要设计原则。
cs.CL / 41 / 2605.03858

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

MCJudgeBench:多约束指令遵循中的约束级评估基准
Lee, Jaeyun, Koh, Junyoung, Tok, Zeynel, Batra, Hunar, Clark, Ronald
Abstract
Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.
Chinese Translation
多约束指令遵循要求验证一个响应是否满足多个单独的要求,然而大语言模型(LLM)评估者通常仅通过整体响应判断进行评估。我们介绍了MCJudgeBench,这是一个用于多约束指令遵循中的约束级评估的基准。每个实例包括一个指令、一个候选响应、一个明确的约束列表、每个约束的金标准标签({yes, partial, no}),以及受控的响应侧扰动。评估协议还包括评估提示变体,以测试评估者的稳定性。我们使用正确性和不一致性指标评估专有和开源的LLM评估者,区分随机解码下的内在不一致性与提示和响应扰动下的程序不一致性。我们的结果显示,评估者的可靠性具有多个维度:强大的整体表现并不保证在标签类别间有同样可靠的检测,特别是在比较稀有的部分和无响应案例时。具有更高正确性的评估者并不总是不一致性较低。带有推理的评估提高了正确性,但并不均匀提升稳定性。这些发现促使我们在约束级别评估LLM评估者,以研究这些失败模式。
cs.CL / 42 / 2605.03903

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

CC-OCR V2:大型多模态模型在现实世界文档处理中的识字能力基准测试
Xu, Zhipeng, Ji, Junhao, Chen, Zulong, Liu, Zhenghao, Liu, Qing, Peng, Chunyi, Qin, Zubao, Xu, Ze, Wan, Jianqiang, Tang, Jun, Yang, Zhibo, Bai, Shuai, Liu, Dayiheng
Abstract
Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications remains underexplored, as existing benchmarks adopt task scopes misaligned with practical applications and assume homogeneous acquisition conditions. To address this gap, we introduce CC-OCR V2, a comprehensive and challenging OCR benchmark tailored to real-world document processing. CC-OCR V2 focuses on practical enterprise document processing tasks and incorporates hard and corner cases that are critical yet underrepresented in prior benchmarks, covering 5 major OCR-centric tracks: text recognition, document parsing, document grounding, key information extraction, and document question answering, comprising 7,093 high-difficulty samples. Extensive experiments on 14 advanced LMMs reveal that current models fall short of real-world application requirements. Even state-of-the-art LMMs exhibit substantial performance degradation across diverse tasks and scenarios. These findings reveal a significant gap between performance on current benchmarks and effectiveness in real-world applications. We release the full dataset and evaluation toolkit at https://github.com/eioss/CC-OCR-V2.
Chinese Translation
大型多模态模型(LMMs)最近在光学字符识别(OCR)任务中展示了强劲的表现,证明了它们在文档识读方面的潜力。然而,它们在现实世界应用中的有效性仍然未得到充分探索,因为现有基准测试采用了与实际应用不一致的任务范围,并假设环境条件的同质性。为了解决这一问题,我们引入了CC-OCR V2,这是一个全面且具有挑战性的OCR基准,专为现实世界文档处理设计。CC-OCR V2专注于实际企业文档处理任务,并纳入了在以往基准中重要却未被充分代表的难点和边缘案例,涵盖5个主要的OCR中心领域:文本识别、文档解析、文档定位、关键信息提取和文档问答,包括7,093个高难度样本。在14个先进的LMMs上进行了广泛的实验,结果表明当前模型无法满足现实世界应用的要求。即使是最先进的LMMs,在不同任务和场景中也会出现显著的性能下降。这些发现表明当前基准测试的表现与现实世界应用的有效性之间存在显著差距。我们将在 https://github.com/eioss/CC-OCR-V2 上发布完整的数据集和评估工具包。
cs.CL / 43 / 2605.03907

Steer Like the LLM: Activation Steering that Mimics Prompting

像 LLM 一样引导:模仿提示的激活引导
Heyman, Geert, Vandeputte, Frederik
Abstract
Large language models can be steered at inference time through prompting or activation interventions, but activation steering methods often underperform compared to prompt-based approaches. We propose a framework that formulates prompt steering as a form of activation steering and investigates whether distilling successful prompt steering behavior into simpler, interpretable models can close this gap. Our analysis reveals that popular activation steering methods are not faithful to the mechanics of prompt steering, which applies strong interventions on some tokens while barely affecting others. Based on these insights, we introduce Prompt Steering Replacement (PSR) models that estimate token-specific steering coefficients from the activations themselves and are trained to imitate prompt-based interventions. Experiments on three steering benchmarks across multiple language models show that PSR models outperform existing activation steering methods, especially when controlling for high-coherence completions, and also compare favorably to prompting on AxBench and persona steering.
Chinese Translation
大型语言模型在推理时可以通过提示或激活干预进行引导,但激活引导方法的效果通常不如基于提示的方法。我们提出了一个框架,将提示引导视为一种激活引导,并探讨将成功的提示引导行为提炼为更简单、可解释模型是否可以缩小这种差距。我们的分析表明,流行的激活引导方法并未忠实于提示引导的机制,后者对某些标记施加强干预,而对其他标记几乎没有影响。基于这些见解,我们引入了提示引导替代模型(Prompt Steering Replacement, PSR),该模型从激活本身估计特定标记的引导系数,并经过训练以模仿基于提示的干预。在多个语言模型的三个引导基准上的实验表明,PSR 模型优于现有的激活引导方法,尤其是在控制高连贯性完成时,并且在 AxBench 和个性引导上与提示相比较也表现良好。
cs.CL / 44 / 2605.03916

Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial

原子事实检查增加临床医生对大型语言模型肿瘤决策支持推荐的信任:一项随机对照试验
Adams, Lisa C., Marx, Linus, Orberg, Erik Thiele, Bressem, Keno, Ziegelmayer, Sebastian, Bernhardt, Denise, Graf, Markus, Makowski, Marcus R., Combs, Stephanie E., Matthes, Florian, Peeken, Jan C.
Abstract
Question: Does atomic fact-checking, which decomposes AI treatment recommendations into individually verifiable claims linked to source guideline documents, increase clinician trust compared to traditional explainability approaches? Findings: In this randomized trial of 356 clinicians generating 7,476 trust ratings, atomic fact-checking produced a large effect on trust (Cohen's d = 0.94), increasing the proportion of clinicians expressing trust from 26.9% to 66.5%. Traditional transparency mechanisms showed a dose-response gradient of improvement over baseline (d = 0.25 to 0.50). Meaning: Decomposing AI recommendations into individually verifiable claims linked to source guidelines produces substantially higher clinician trust than traditional explainability approaches in high-stakes clinical decisions.
Chinese Translation
研究问题:原子事实检查是否通过将人工智能治疗推荐分解为与源指南文件相关联的可单独验证的主张,从而提高临床医生的信任度,相较于传统的可解释性方法?研究结果:在这项包含356名临床医生的随机试验中,共产生了7,476个信任评分,原子事实检查在信任度上产生了显著效应(Cohen's d = 0.94),信任表达的临床医生比例从26.9%增加至66.5%。传统透明机制显示出对基线的改善有剂量反应梯度(d = 0.25至0.50)。意义:将人工智能推荐分解为与源指南相关的可单独验证的主张,在高风险临床决策中可显著提高临床医生的信任度,远高于传统的可解释性方法。
cs.CL / 45 / 2605.03936

The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models

反例游戏:语言模型中的迭代概念分析与修正
Drucker, Daniel, Mahowald, Kyle
Abstract
Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: one model instance generates counterexamples to a proposed definition, another repairs the definition, and the process repeats. Across 20 concepts and thousands of counterexample-repair cycles, we find that, although many LM-generated counterexamples are judged invalid by both expert humans and an LM judge, the LM judge accepts roughly twice as many as humans do. Nonetheless, per-item validity judgments are moderately consistent across humans and between humans and the LM. We further find that extended iteration produces increasingly verbose definitions without improving accuracy. We also see that some concepts resist stable definitions in general. These findings suggest that while LMs can engage in philosophical reasoning, the counterexample-repair loop hits diminishing returns quickly and could be a fruitful test case for evaluating whether LMs can sustain high-level iterated philosophical reasoning.
Chinese Translation
概念分析——通过反例提出定义并进行精炼——是哲学方法论的核心。我们研究语言模型是否能够通过迭代分析和修正链来执行这一任务:一个模型实例生成对建议定义的反例,另一个模型修正该定义,过程重复进行。在20个概念和数千个反例-修正循环中,我们发现,尽管许多由语言模型生成的反例被专家评委和语言模型评委判断为无效,但语言模型评委接受的反例数量大约是人类的两倍。然而,对于每个条目的有效性判断,在人类之间以及人类与语言模型之间呈现出适度的一致性。我们进一步发现,延长迭代会产生越来越冗长的定义,但并未提高准确性。我们还观察到,一些概念总体上抵制稳定的定义。这些发现表明,尽管语言模型能够进行哲学推理,但反例-修正循环很快会遭遇收益递减,并可能成为评估语言模型是否能够维持高水平迭代哲学推理的一个富有成效的测试案例。
cs.CL / 46 / 2605.03969

Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators

增强特征变换器在各领域和生成器上实现稳健的AI文本检测
Mady, Mohamed, Reschke, Johannes, Schuller, Björn
Abstract
AI-generated text is nowadays produced at scale across domains and heterogeneous generation pipelines, making robustness to distribution shift a central requirement for supervised binary detectors. We train transformer-based detectors on HC3 PLUS and calibrate a single decision threshold by maximising balanced accuracy on held-out validation; this threshold is then kept fixed for all downstream test distributions, revealing domain- and generator-dependent error asymmetries under shift. We evaluate in-domain on HC3 PLUS, under cross-dataset transfer to the multi-domain, multi-generator M4 benchmark, and on the external AI-Text-Detection-Pile. Although base models achieve near-ceiling in-domain performance (up to 99.5% balanced accuracy), performance under shift is brittle and strongly model-dependent. Feature augmentation via attention-based linguistic feature fusion improves transfer, with our best model (DeBERTa-v3-base+FeatAttn) achieving 85.9% balanced accuracy on M4. Multi-seed experiments confirm high stability. Under the same fixed-threshold protocol, our model outperforms strong zero-shot baselines by up to +7.22 points. Category-level ablations further show that readability and vocabulary features contribute most to robustness under shift. Overall, these results demonstrate that feature augmentation and a modern DeBERTa backbone significantly outperform earlier BERT/RoBERTa models, while the fixed-threshold protocol provides a more realistic and informative assessment of practical detector robustness.
Chinese Translation
如今,AI生成的文本在各个领域和异构生成流水线中大规模生成,使得对分布变化的稳健性成为监督二元检测器的一个核心要求。我们在HC3 PLUS上训练基于变换器的检测器,并通过最大化保留验证集上的平衡准确率来校准单一决策阈值;该阈值随后在所有下游测试分布中保持固定,揭示了在分布变化下领域和生成器相关的错误不对称性。我们在HC3 PLUS的领域内进行评估,并在跨数据集迁移到多领域、多生成器的M4基准测试以及外部AI-文本检测集合上进行评估。尽管基础模型在领域内表现接近极限性能(高达99.5%的平衡准确率),但在分布变化下的性能则表现出脆弱性并且强烈依赖于模型。通过基于注意力的语言特征融合进行特征增强改善了迁移效果,我们的最佳模型(DeBERTa-v3-base+FeatAttn)在M4上达到了85.9%的平衡准确率。多种种子实验确认了高稳定性。在相同的固定阈值协议下,我们的模型在零样本基线之上优于其表现高达+7.22分。类别级别的消融实验进一步显示,可读性和词汇特征对变化下的稳健性贡献最大。总体而言,这些结果证明了特征增强和现代DeBERTa骨干网络显著优于早期的BERT/RoBERTa模型,而固定阈值协议为实际检测器的稳健性提供了更为现实和信息丰富的评估。
cs.CL / 47 / 2605.03971

Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments

逻辑一致性作为桥梁:通过响应与自我判断之间的标签约束建模改善大语言模型幻觉检测
Mi, Hao, Sheng, Qiang, Wang, Shaofei, Hu, Beizhe, Sun, Yifan, Wang, Zhengjia, Zeng, Hengqi, Li, Yang, Wang, Danding, Cao, Juan
Abstract
Large Language Models (LLMs) are prone to factual hallucinations, risking their reliability in real-world applications. Existing hallucination detectors mainly extract micro-level intrinsic patterns for uncertainty quantification or elicit macro-level self-judgments through verbalized prompts. However, these methods address only a single facet of the hallucination, focusing either on implicit neural uncertainty or explicit symbolic reasoning, thereby treating these inherently coupled behaviors in isolation and failing to exploit their interdependence for a holistic view. In this paper, we propose LaaB (Logical Consistency-as-a-Bridge), a framework that bridges neural features and symbolic judgments for hallucination detection. LaaB introduces a "meta-judgment" process to map symbolic labels back into the feature space. By leveraging the inherent logical bridge where response and meta-judgment labels are either the same or opposite based on the self-judgment's semantics, LaaB aligns and integrates dual-view signals via mutual learning and enhances the hallucination detection. Extensive experiments on 4 public datasets, across 4 LLMs, against 8 baselines demonstrate the superiority of LaaB.
Chinese Translation
大型语言模型(LLMs)容易出现事实幻觉,从而影响其在现实应用中的可靠性。现有的幻觉检测器主要提取微观层面的内在模式来量化不确定性或通过语言提示引发宏观层面的自我判断。然而,这些方法仅关注幻觉的单一方面,分别侧重于隐式神经不确定性或显式符号推理,从而孤立地处理这些本质上相互耦合的行为,未能利用它们之间的相互依赖性来获得整体视角。本文提出了LaaB(逻辑一致性作为桥梁)框架,该框架将神经特征和符号判断结合起来进行幻觉检测。LaaB引入了一种“元判断”过程,将符号标签映射回特征空间。通过利用响应和元判断标签在自我判断语义基础上要么相同要么相反的内在逻辑桥梁,LaaB通过互学习对双视角信号进行对齐和整合,从而增强幻觉检测。在4个公共数据集、4种LLM模型和8个基线的广泛实验中,展现了LaaB的优越性。
cs.CL / 48 / 2605.03998

EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

EQUITRIAGE:基于大语言模型的急诊科分诊中的性别偏见公平性审计
Young, Richard J., Matthews, Alice M.
Abstract
Emergency department triage assigns patients an acuity score that determines treatment priority, and clinical evidence documents persistent gender disparities in human acuity assessment. As hospitals pilot large language models (LLMs) as triage decision support, a critical question is whether these models reproduce or mitigate known biases. We present EQUITRIAGE, a fairness audit of LLM-based ESI assignment evaluating five models (Gemini-3-Flash, Nemotron-3-Super, DeepSeek-V3.1, Mistral-Small-3.2, GPT-4.1-Nano) across 374,275 evaluations on 18,714 MIMIC-IV-ED vignettes under four prompt strategies. Of 9,368 originals, 9,346 are paired with a gender-swapped counterfactual. All five models produced flip rates above a pre-registered 5% threshold (9.9% to 43.8%). Two showed directional female undertriage (DeepSeek F/M 2.15:1, Gemini 1.34:1); two were near-parity; one had high sensitivity with weak male-direction asymmetry. DeepSeek's directional bias coexisted with a low outcome-linked calibration gap (0.013 against MIMIC-IV admission), a Chouldechova-style dissociation between within-group calibration and between-pair counterfactual invariance. Demographic blinding reduced Gemini's flip rate to 0.5%; an age-preserving blind variant left DeepSeek with residual F/M 1.25, implicating age as a residual channel. Chain-of-thought prompting degraded accuracy for all five models. A two-model ablation reveals opposite underlying mechanisms for the same directional phenotype: in Gemini the signal is emergent in the combined name+gender swap, while in DeepSeek the gender token alone carries it. EQUITRIAGE shows that group parity, counterfactual invariance, and gender calibration are distinct fairness properties, that intervention effectiveness is model-dependent, and that per-model counterfactual auditing should precede clinical deployment.
Chinese Translation
急诊科分诊为患者分配急迫性评分,以确定治疗优先级,临床证据记录了人类急迫性评估中持续存在的性别差异。在医院试点使用大语言模型(LLMs)作为分诊决策支持时,一个关键问题是这些模型是否会重现或缓解已知的偏见。我们提出了EQUITRIAGE,这是一个针对基于LLM的紧急评估系统(ESI)分配的公平性审计,评估了五个模型(Gemini-3-Flash, Nemotron-3-Super, DeepSeek-V3.1, Mistral-Small-3.2, GPT-4.1-Nano)在18,714个MIMIC-IV-ED场景中进行的374,275次评估,采用四种提示策略。在9,368个原始案例中,9,346个与性别调换的对照案例配对。所有五个模型的翻转率均超过事先注册的5%阈值(9.9%至43.8%)。其中两个模型表现出女性低分诊趋势(DeepSeek的性别比为2.15:1,Gemini为1.34:1);两个模型接近平衡;一个模型表现出较高的敏感性但男性方向的偏差较弱。DeepSeek的方向性偏见与低结果相关的校准差异共存(相对于MIMIC-IV入院的校准差为0.013),表现出组内校准与成对对照不变性之间的Chouldechova式解离。人口统计盲化将Gemini的翻转率降低至0.5%;保持年龄盲化的变体使DeepSeek的残留性别比为1.25,暗示年龄作为残留通道影响结果。思维链提示降低了所有五个模型的准确性。通过两个模型的消融研究揭示了同一方向表型背后的相反机制:在Gemini中,信号在组合的名称+性别交换中显现,而在DeepSeek中,性别标记本身携带该信号。EQUITRIAGE表明,组平衡、对照不变性和性别校准是不同的公平性属性,干预的有效性依赖于模型,并且每个模型的对照审计应在临床部署之前进行。
cs.CL / 49 / 2605.04018

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

重新思考推理密集型检索:评估与提升代理搜索系统中的检索器
Zhao, Yilun, Wei, Jinbiao, Song, Tingyu, Zhang, Siyue, Zhao, Chen, Cohan, Arman
Abstract
Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is increasingly important for agentic search systems, where retrievers must provide complementary evidence across iterative search and synthesis. However, existing work remains limited on both evaluation and training: benchmarks such as BRIGHT provide narrow gold sets and evaluate retrievers in isolation, while synthetic training corpora often optimize single-passage relevance rather than evidence portfolio construction. We introduce BRIGHT-Pro, an expert-annotated benchmark that expands each query with multi-aspect gold evidence and evaluates retrievers under both static and agentic search protocols. We further construct RTriever-Synth, an aspect-decomposed synthetic corpus that generates complementary positives and positive-conditioned hard negatives, and use it to LoRA fine-tune RTriever-4B from Qwen3-Embedding-4B. Experiments across lexical, general-purpose, and reasoning-intensive retrievers show that aspect-aware and agentic evaluation expose behaviors hidden by standard metrics, while RTriever-4B substantially improves over its base model.
Chinese Translation
推理密集型检索旨在提取支持下游推理的证据,而不仅仅是匹配主题相似性。这一能力在代理搜索系统中变得越来越重要,在这些系统中,检索器必须在迭代搜索和综合过程中提供互补证据。然而,目前的研究在评估和训练方面仍然有限:如 BRIGHT 等基准提供狭窄的高质量数据集,并且单独评估检索器,而合成训练语料库往往优化单段相关性,而非证据组合构建。我们引入了 BRIGHT-Pro,这是一种专家注释的基准,扩展了每个查询,提供多方面的高质量证据,并在静态和代理搜索协议下评估检索器。此外,我们还构建了 RTriever-Synth,这是一个按方面分解的合成语料库,生成互补的正样本和基于正样本的难负样本,并利用它对 Qwen3-Embedding-4B 的 RTriever-4B 进行 LoRA 微调。在词汇型、通用型和推理密集型检索器的实验中,体现方面注意的代理评估揭示了标准指标所隐藏的行为,而 RTriever-4B 显著优于其基础模型。
cs.CL / 50 / 2605.04039

Safety and accuracy follow different scaling laws in clinical large language models

临床大语言模型的安全性与准确性遵循不同的规模定律
Wind, Sebastian, Nguyen, Tri-Thien, Sopa, Jeta, Lotfinia, Mahshad, Bickelhaup, Sebastian, Uder, Michael, Köstler, Harald, Wellein, Gerhard, Nebelung, Sven, Truhn, Daniel, Maier, Andreas, Arasteh, Soroosh Tayebi
Abstract
Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.
Chinese Translation
临床大语言模型(LLMs)通常通过增加模型规模、上下文长度、检索复杂性或推理时计算量来扩展,并隐含地期望更高的准确性意味着更安全的行为。然而,这一假设在医学领域是不完全的,因为一些自信的、高风险的或与证据相矛盾的错误可能比平均基准表现更为重要。我们引入了SaFE-Scale,一个用于测量临床LLMs在模型规模、证据质量、检索策略、上下文暴露和推理时计算等方面如何变化的安全性框架。为了实现这一框架,我们推出了RadSaFE-200,这是一个聚焦于放射学安全的评估基准,包含200个多项选择题,具有临床医生定义的清晰证据、矛盾证据以及针对高风险错误、不安全答案和证据矛盾的选项级标签。我们评估了在六种部署条件下的34个本地部署的LLMs:闭卷提示(零-shot)、清晰证据、矛盾证据、标准RAG、代理RAG和最大上下文提示。清晰证据带来了最强的改善,将平均准确性从73.5%提高到94.1%,同时将高风险错误从12.0%降低到2.6%,矛盾从12.7%降低到2.3%,而危险的过度自信从8.0%降至1.6%。标准RAG和代理RAG未能重复这一安全特征:代理RAG在准确性上优于标准RAG并减少了矛盾,但高风险错误和危险的过度自信依然较高。最大上下文提示增加了延迟而未能弥补安全差距,额外的推理时计算仅带来了有限的收益。最坏情况分析显示,临床相关的错误集中在一小部分问题上。因此,临床LLMs的安全性并不是规模扩展的被动结果,而是由证据质量、检索设计、上下文构建和集体失败行为塑造的部署特性。