← Back to Index
Daily Research Digest

arXiv Papers

2026-05-13
391
Papers
4
Categories
391
Translated
收藏清单 0
机器人学 (Robotics)
40
cs.RO / 1 / 2605.10993

ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models

ECHO:用于视觉-语言-动作模型的连续层次记忆
Hu, Yanbin, Cui, Jin, Lu, Jiayi, Yang, Ruixuan, Ye, Jun, Zhao, Boran, Chen, Xingyu, Lan, Xuguang, Ren, Pengju
Abstract
Memory capacity is a critical factor determining the performance of Vision-Language-Action (VLA) models in long-horizon manipulation tasks. Existing memory-augmented architectures primarily rely on linear or flat storage, lacking structural priors for manipulation categories and hierarchical organization. This deficiency hinders efficient experience retrieval and limits generalization to unseen long-horizon task compositions. Inspired by the hierarchical organization of human experience, we propose ECHO (Experience Consolidation and Hierarchical Organization), a novel memory framework operating within a Continuous Hierarchical Space. By employing a hyperbolic autoencoder, ECHO maps VLA hidden states into this space. Leveraging hyperbolic metrics and entailment constraint mechanisms, experience vectors are organized into a semantic memory tree that supports efficient top-down retrieval. In parallel, a background consolidation mechanism continuously refines the memory tree through geometric interpolation and structural splitting, supporting virtual memory synthesis in the continuous space. We integrate ECHO into the $\pi_0$ foundation model. Evaluations on LIBERO and preliminary real-world experiments demonstrate the effectiveness of our approach, notably achieving a 12.8% absolute improvement in execution success rate over the $\pi_0$ baseline on LIBERO-Long, while improving compositional generalization on cross-suite unseen long-horizon tasks.
Chinese Translation
记忆容量是决定视觉-语言-动作(VLA)模型在长时间操作任务中性能的关键因素。现有的增强记忆架构主要依赖于线性或扁平存储,缺乏对操作类别的结构性先验和层次组织。这一缺陷阻碍了经验的高效检索,并限制了对未见过的长时间任务组合的泛化。受到人类经验层次组织的启发,我们提出了ECHO(经验整合与层次组织),这是一种在连续层次空间中运行的新型记忆框架。通过采用双曲自编码器,ECHO将VLA隐藏状态映射到该空间。利用双曲度量和蕴含约束机制,经验向量被组织成一个支持高效自上而下检索的语义记忆树。同时,背景整合机制通过几何插值和结构分裂不断优化记忆树,支持在连续空间中的虚拟记忆合成。我们将ECHO集成到$ ext{π}_0$基础模型中。在LIBERO上的评估和初步的现实世界实验展示了我们方法的有效性,特别是在LIBERO-Long上,相较于$ ext{π}_0$基线,执行成功率实现了12.8%的绝对提升,同时在跨套件未见长时间任务上改善了组合泛化能力。
cs.RO / 2 / 2605.11048

ForceFlow: Learning to Feel and Act via Contact-Driven Flow Matching

ForceFlow:通过接触驱动的流匹配学习感知与行动
Zhang, Shuoheng, Yuan, Yifu, Tang, Hongyao, Zheng, Yan, Yu, Qiaojun, Li, Pengyi, Huang, Guowei, Huang, Helong, Quan, Xingyue, Hao, Jianye
Abstract
Existing imitation learning methods enable robots to interact autonomously with the physical environment. However, contact-rich manipulation tasks remain a significant challenge due to complex contact dynamics that demand high-precision force feedback and control. Although recent efforts have attempted to integrate force/torque sensing into policies, how to build a simple yet effective framework that achieves robust generalization under multimodal observations remains an open question. In this paper, we propose ForceFlow, a force-aware reactive framework built upon flow matching. For contact-stage policy design, we investigate force signal fusion mechanisms and adopt an asymmetric multimodal fusion architecture that treats force as a global regulatory signal, combined with a joint prediction paradigm that enhances the policy's understanding of instantaneous force and historical information, thereby achieving deep coupling between force and motion. For task-level hierarchical decomposition, we divide manipulation into a vision-dominant approach stage (VLM-based pointing for target localization) and a touch-dominant interaction stage (force-driven contact execution), with a Vision-to-Force (V2F) handover mechanism that explicitly decouples spatial generalization from contact regulation. Experimental results across six real-world contact-rich tasks demonstrate that ForceFlow achieves a 37% success rate improvement over the strong baseline ForceVLA while maintaining significantly lower cost. Moreover, ForceFlow exhibits accurate force signal prediction and demonstrates superior performance in contact force self-regulation and zero-shot out-of-distribution (OOD) generalization.
Chinese Translation
现有的模仿学习方法使机器人能够自主与物理环境进行交互。然而,由于复杂的接触动态,接触丰富的操控任务仍然是一个重大挑战,这需要高精度的力反馈和控制。尽管最近的努力尝试将力/扭矩传感器集成到策略中,但如何构建一个简单而有效的框架,以在多模态观察下实现稳健的泛化仍然是一个未解的问题。在本文中,我们提出了ForceFlow,这是一种基于流匹配的力感知反应框架。对于接触阶段的策略设计,我们研究了力信号融合机制,并采用了一种不对称的多模态融合架构,将力视为全局调节信号,结合一种联合预测范式,增强策略对瞬时力和历史信息的理解,从而实现力与运动之间的深度耦合。对于任务级的层次分解,我们将操控分为以视觉为主的阶段(基于视觉的目标定位指向)和以触觉为主的交互阶段(力驱动的接触执行),并采用一种视觉到力(Vision-to-Force, V2F)交接机制,明确将空间泛化与接触调节解耦。六个真实世界接触丰富任务的实验结果表明,ForceFlow在成功率上比强基线ForceVLA提高了37%,同时保持了显著更低的成本。此外,ForceFlow展现了准确的力信号预测,并在接触力自我调节和零样本分布外(OOD)泛化方面表现优越。
cs.RO / 3 / 2605.11114

SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection

SEVO:通过主动照明和数据中心收集增强语义的虚拟观察以实现稳健的视觉-语言-行动(VLA)操作
Fang, Tianchonghui, Zhuang, Yuan, Miao, Fei
Abstract
Vision-Language-Action (VLA) and imitation-learning policies trained via community toolchains on low-cost hardware frequently fail when deployed outside the training environment. Existing evaluations, including the original ACT and SmolVLA benchmarks, demonstrate high success rates under controlled, fixed backgrounds, yet community practitioners report near-zero transfer to new environments. We present SEVO (Semantic-Enhanced Virtual Observation), a data-centric approach that improves cross-environment manipulation robustness without modifying the policy architecture. SEVO transforms the raw RGB camera stream through three mechanisms: (1) body-fixed cameras whose combined fields of view cover the full manipulation workspace, (2) active red-spectrum illumination that physically normalizes object appearance, and (3) real-time YOLO segmentation overlay that provides a background-invariant semantic cue. Critically, we show that a diversified data collection protocol (systematically varying lighting, backgrounds, and distractors during teleoperation) is the single most important factor for generalization. We target transparent water bottles, objects that visually blend with their surroundings, and select a simple pick-and-place task to enable hundreds of controlled real-robot trials across two mobile platforms. The full pipeline achieves 95% grasp success with ACT and 83% with SmolVLA in the training environment, transferring to novel environments at 85% and 75%. Without SEVO, the same policies achieve only 75%/70% in training and collapse to 30-35% in novel environments. Our results demonstrate that principled observation design and environmental diversity during data collection, not model scaling, enable low-cost robots to operate reliably in everyday household environments.
Chinese Translation
通过社区工具链在低成本硬件上训练的视觉-语言-行动(VLA)和模仿学习策略在部署到训练环境之外时常常失败。现有评估,包括原始的ACT和SmolVLA基准,显示在受控的固定背景下具有较高的成功率,但社区从业者报告在新环境中的迁移几乎为零。我们提出SEVO(语义增强虚拟观察),这是一种数据中心的方法,旨在提高跨环境操作的稳健性,而无需修改策略架构。SEVO通过三种机制转换原始RGB摄像头流:(1)固定在身体上的摄像头,其组合视野覆盖整个操作工作空间;(2)主动红光谱照明,物理上规范化物体外观;(3)实时YOLO分割叠加,提供背景不变的语义线索。关键是,我们展示了多样化的数据收集协议(在远程操作期间系统性地变化照明、背景和干扰物)是实现泛化的最重要因素。我们以透明水瓶为目标,这些物体在视觉上与其周围环境融合,并选择一个简单的拾取和放置任务,以便在两个移动平台上进行数百次受控的真实机器人试验。完整的流程在训练环境中实现了95%的抓取成功率,使用ACT时为83%,在新环境中的迁移成功率为85%和75%。没有SEVO,相同的策略在训练中仅实现75%/70%的成功率,并在新环境中崩溃至30-35%。我们的结果表明,原则性的观察设计和数据收集过程中的环境多样性,而非模型扩展,使低成本机器人能够在日常家庭环境中可靠地操作。
cs.RO / 4 / 2605.11119

ASIP-Planner: Adaptive Planning for UAV Surface Inspection in Partially Known Indoor Environments

ASIP-Planner:在部分已知室内环境中无人机表面检查的自适应规划
Jin, Hanyu, Xu, Zhefan, Shen, Haoyu, Han, Xinming, Ye, Kanlong, Shimada, Kenji
Abstract
Indoor infrastructure inspection, such as tunnels and industrial facilities, requires systematic surface coverage to ensure that all inspection targets are properly observed. Unmanned Aerial Vehicles (UAVs) offer an alternative to manual inspection by conducting map-guided surface inspection using prior structural models. However, in practice, indoor inspection often relies on floorplan-derived reference maps that may not reflect unforeseen obstacles, such as temporary structures or equipment, leading to occluded viewpoints and degraded inspection quality. Existing coverage planning methods typically assume a fully known inspection environment and perform deterministic global viewpoint optimization based on accurate prior maps, making them vulnerable to environmental discrepancies during execution. This work presents an adaptive UAV inspection framework for partially known structured indoor environments. The proposed method integrates a segment-based global coverage planner with an inspection-oriented local view-angle adaptation module. The global planner organizes planar inspection targets into surface-aligned clusters to generate compact viewpoint sequences with improved orientation consistency. The local planner generates collision-free trajectories and adjusts the viewing direction online to mitigate occlusion-induced coverage loss while preserving the planned trajectory structure. The simulation results across randomized scene configurations demonstrate that the proposed global planner achieves near-complete coverage while reducing trajectory length compared to representative baselines. Real-world flight experiments further validate that the framework produces usable inspection data for downstream analysis. These results indicate that the proposed framework improves inspection efficiency and adaptability in partially known structured indoor environments.
Chinese Translation
室内基础设施检查,如隧道和工业设施,需要系统的表面覆盖,以确保所有检查目标得到适当观察。无人机(UAV)通过使用先前的结构模型进行地图引导的表面检查,为手动检查提供了一种替代方案。然而,在实际操作中,室内检查通常依赖于基于平面图的参考地图,这些地图可能无法反映意外障碍物,如临时结构或设备,从而导致视点被遮挡和检查质量下降。现有的覆盖规划方法通常假设检查环境是完全已知的,并基于准确的先前地图执行确定性的全局视点优化,这使得它们在执行过程中容易受到环境差异的影响。本文提出了一种针对部分已知结构化室内环境的自适应无人机检查框架。所提出的方法将基于段的全局覆盖规划器与面向检查的局部视角适应模块相结合。全局规划器将平面检查目标组织成表面对齐的聚类,以生成具有改进方向一致性的紧凑视点序列。局部规划器生成无碰撞的轨迹,并在线调整视角方向,以减轻由于遮挡导致的覆盖损失,同时保持规划轨迹的结构。针对随机场景配置的仿真结果表明,所提出的全局规划器在减少轨迹长度的同时实现了近乎完全的覆盖,相较于代表性基线表现更优。实际飞行实验进一步验证了该框架能够生成可用于后续分析的检查数据。这些结果表明,所提出的框架提高了在部分已知结构化室内环境中的检查效率和适应性。
cs.RO / 5 / 2605.11144

Forecast-aware Gaussian Splatting for Predictive 3D Representation in Language-Guided Pick-and-Place Manipulation

面向预测的高斯溅射在语言引导的抓取与放置操作中的三维表示
Jia, Kaixin, Xu, Jiacheng
Abstract
We introduce Forecast-aware Gaussian Splatting (Forecast-GS), a predictive 3D representation framework for language-conditioned robotic manipulation. While recent manipulation systems have made progress by grounding language instructions into robot affordances, value maps, or relational keypoint constraints, they usually reason over the current scene and do not explicitly model the task-completed state. This limitation is critical when success depends on satisfying spatial and semantic goals under partial observations, where the robot must evaluate whether a candidate action leads to a feasible task-consistent outcome. We validate Forecast-GS on real-world pick-and-place manipulation tasks, including Cutter-to-Box, Apple-to-Bowl, and Sponge-to-Tray. For each task, we conduct 25 real-world trials under varied initial object configurations using the same robot platform and sensing setup. Forecast-GS with automatic candidate selection achieves success rates of 21/25, 23/25, and 16/25 on the three tasks, respectively, outperforming the ReKep baseline, which achieves 15/25, 19/25, and 10/25. A diagnostic human-assisted setting further improves success rates to 23/25, 24/25, and 19/25, suggesting that candidate generation is effective while automatic ranking remains imperfect. These results suggest that explicitly forecasting task-completed 3D states enables more reliable action evaluation, while the gap between automatic and human-assisted selection indicates that robust final-state ranking remains an important challenge for fully autonomous manipulation. Overall, Forecast-GS provides an interpretable bridge between language understanding, 3D perception, and robotic manipulation planning.
Chinese Translation
我们提出了一种面向预测的高斯溅射(Forecast-GS),这是一个用于语言条件下机器人操作的预测三维表示框架。尽管近期的操作系统在将语言指令与机器人可操作性、价值图或关系关键点约束相结合方面取得了进展,但它们通常只针对当前场景进行推理,并未明确建模任务完成状态。当成功依赖于在部分观察下满足空间和语义目标时,这一局限性尤为关键,此时机器人必须评估候选动作是否导致可行的任务一致结果。我们在真实的抓取与放置操作任务上验证了Forecast-GS,包括刀具到箱子、苹果到碗和海绵到托盘。对于每个任务,我们在相同的机器人平台和传感器设置下,针对不同的初始物体配置进行了25次真实世界的试验。使用自动候选选择的Forecast-GS在三个任务上的成功率分别为21/25、23/25和16/25,优于ReKep基线,其成功率为15/25、19/25和10/25。通过人类辅助的诊断设置,成功率进一步提高至23/25、24/25和19/25,表明候选生成是有效的,而自动排名仍然不完善。这些结果表明,明确预测任务完成的三维状态能够实现更可靠的动作评估,而自动选择与人类辅助选择之间的差距则表明,稳健的最终状态排名仍然是完全自主操作的重要挑战。总体而言,Forecast-GS为语言理解、三维感知和机器人操作规划之间提供了一个可解释的桥梁。
cs.RO / 6 / 2605.11210

Distributed Pose Graph Optimization via Continuous Riemannian Dynamics

基于连续黎曼动力学的分布式位姿图优化
Shin, Jaeho, Ghaffari, Maani, Tian, Yulun
Abstract
We present a framework for distributed Pose Graph Optimization (PGO) by formulating the problem as a second-order continuous-time dynamical system evolving on Lie groups. By modeling pose variables as massive particles subject to damping, the equilibrium points of the resulting Riemannian dynamics coincide with first-order critical points of the original PGO problem. Using the governing damped Euler--Poincar\'e equations and a semi-implicit geometric integrator, we design an optimization algorithm that generalizes existing algorithms such as Riemannian gradient descent and Gauss--Newton. In multi-robot settings, we present a fully distributed and parallel method based on block-diagonal mass and damping matrices, where each robot solves an ordinary differential equation for its own poses with minimal communication overhead. Moreover, modeling both state and velocity enables principled neighbor prediction that significantly improves convergence under delayed communication. Theoretically, we present an analysis and establish sufficient condition that ensures energy dissipation under the employed geometric discretization scheme. Experiments on benchmark PGO datasets demonstrate that the proposed solver achieves superior performance compared to state-of-the-art distributed baselines in both synchronous and asynchronous regimes.
Chinese Translation
我们提出了一种分布式位姿图优化(Pose Graph Optimization, PGO)框架,通过将问题表述为在李群上演化的二阶连续时间动力系统。通过将位姿变量建模为受阻尼影响的大质量粒子,所得到的黎曼动力学的平衡点与原始PGO问题的一阶临界点重合。利用控制的阻尼欧拉-庞卡雷方程和半隐式几何积分器,我们设计了一种优化算法,该算法推广了现有的算法,如黎曼梯度下降和高斯-牛顿法。在多机器人环境中,我们提出了一种基于块对角质量和阻尼矩阵的完全分布式和并行方法,其中每个机器人以最小的通信开销解决其自身位姿的常微分方程。此外,建模状态和速度使得邻居预测具有原则性,这显著改善了在延迟通信下的收敛性。从理论上讲,我们进行了分析并建立了确保在所采用的几何离散化方案下能量耗散的充分条件。在基准PGO数据集上的实验表明,所提出的求解器在同步和异步模式下均优于最先进的分布式基线,表现出更优的性能。
cs.RO / 7 / 2605.11296

Computational Design of a Low-Visibility UAV Using a Human-Aligned Perceptual Metric

基于人类对齐感知度量的低可视化无人机的计算设计
Wang, Jingxian, Yu, Chen, Matthews, David, Alexander, Emma, Kriegman, Sam, Rubenstein, Michael
Abstract
We introduce Phantom Twist, a type of single-propeller UAV designed to achieve low visibility through high-speed spinning and the exploitation of motion blur. We develop a two-stage automated design pipeline that optimizes the placement of functional components including batteries, control PCB, motor-propeller assembly, and counterweights. The pipeline minimizes visibility as measured by a human-aligned perceptual metric (LPIPS) while strictly satisfying inertial and aerodynamic constraints required for stable flight. We validate this approach through fabrication and flight testing of multiple prototypes. These tests confirm that our pipeline produces stable, controllable designs and that the optimized UAV exhibits significantly reduced visual perceptibility compared to conventional quadcopters.
Chinese Translation
我们介绍了 Phantom Twist,这是一种单旋翼无人机,旨在通过高速旋转和运动模糊的利用实现低可视化。我们开发了一个两阶段的自动化设计流程,优化功能组件的布局,包括电池、控制 PCB、电机-螺旋桨组件和配重。该流程在严格满足稳定飞行所需的惯性和气动约束的同时,最小化通过人类对齐感知度量(LPIPS)测量的可视性。我们通过制造和飞行测试多个原型来验证这种方法。这些测试确认我们的流程能够产生稳定、可控的设计,并且优化后的无人机与传统四旋翼相比,视觉可感知性显著降低。
cs.RO / 8 / 2605.11381

Kairos: A Scalable Serving System for Physical AI

Kairos:一个可扩展的物理人工智能服务系统
Dai, Yinwei, Ananthanarayanan, Ganesh, Cox, Landon, Foukas, Xenofon, Radunovic, Bozidar, Netravali, Ravi
Abstract
Physical AI is experiencing rapid growth with frontier foundation models increasing its capabilities across general environments. Physical AI tasks are characterized by inference properties that are markedly different from digital AI. They consist of multiple rounds of inference and action execution, generating a chunk of actions in each inference round, and asynchronously interleaving inference and execution. This makes existing digital AI serving systems unsuited for physical AI; a shortcoming that is critical for enabling their wide adoption, considering their size and the scale of the robot fleets they have to serve. To fill this gap, we design Kairos, the first multi-robot serving system that makes the generate-execute loop a first-class citizen, with active involvement in the execution phase. Across a wide range of physical AI models and robots, Kairos reduces the average end-to-end task latency by 31.8--66.5% over state-of-the-art digital AI serving practices, with gains scaling with the robot fleet size.
Chinese Translation
物理人工智能正经历快速增长,前沿基础模型提升了其在一般环境中的能力。物理人工智能任务的推理特性与数字人工智能显著不同。它们由多轮推理和行动执行组成,每轮推理生成一组动作,并异步交错推理与执行。这使得现有的数字人工智能服务系统不适合物理人工智能;这一缺陷对于促进其广泛应用至关重要,考虑到其规模和需要服务的机器人队伍的规模。为填补这一空白,我们设计了Kairos,这是第一个将生成-执行循环作为一等公民的多机器人服务系统,积极参与执行阶段。在广泛的物理人工智能模型和机器人中,Kairos将平均端到端任务延迟降低了31.8%至66.5%,相较于最先进的数字人工智能服务实践,且随着机器人队伍规模的扩大,性能提升更为显著。
cs.RO / 9 / 2605.11459

Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

克服动态盲视:无训练的节奏与路径修正用于视觉-语言-动作模型
Zhang, Yanyan, Song, Chaoda, Singh, Vikash, Li, Xinpeng, Ye, Kai, Hu, Zhe, Pu, Zhongzhu, Yin, Yu, Chaudhary, Vipin
Abstract
Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.
Chinese Translation
视觉-语言-动作(VLA)模型在超越经典控制范式方面展现出显著的灵活性和泛化能力。然而,大多数主流的VLA模型是在单帧观察范式下训练的,这使得它们在结构上对时间动态失去敏感性。因此,这些模型在非平稳场景中表现严重退化,即使在动态数据集上经过训练或微调也是如此。现有的方法要么需要昂贵的重新训练,要么在动作块之间存在延迟瓶颈和较差的时间一致性。我们提出了节奏与路径修正(Pace-and-Path Correction),这是一种无训练的、闭式形式的推理时操作符,可以包裹任何分块动作的VLA。通过单一的二次成本,联合最小化产生了一个统一的解决方案,该方案正交地分解为两个不同的通道。节奏通道沿计划方向压缩执行,而路径通道应用正交的空间偏移,共同吸收在块窗口内感知到的动态。我们在一个全面的诊断基准MoveBench上评估了我们的方法,该基准旨在将运动作为唯一的受控变量。实证结果表明,我们的框架在动态仅和静态-动态混合环境中,分别在绝对值上提高了成功率达28.8%和25.9%,并且始终优于最先进的无训练包装器和动态自适应方法。
cs.RO / 10 / 2605.11479

Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

基于折扣活跃性表述的操作策略离线评估
Wang, Hao, Bowden, Joshua, Crosby, Colton, Bansal, Somil
Abstract
Policy evaluation is a fundamental component of the development and deployment pipeline for robotic policies. In modern manipulation systems, this problem is particularly challenging: rewards are often sparse, task progression of evaluation rollouts are often non-monotonic as the policies exhibit recovery behaviors, and evaluation rollouts are necessarily of finite length. This finite length introduces truncation bias, breaking the infinite-horizon assumptions underlying standard methods relying on Bellman equations/principle of optimality. In this work, we propose a framework for offline policy evaluation from sparse rewards based on a liveness-based Bellman operator. Our formulation interprets policy evaluation as a task-completion problem and yields a conservative fixed-point value function that is robust to finite-horizon truncation. We analyze the theoretical properties of the proposed operator, including contraction guarantees, and show how it encodes task progression while mitigating truncation bias. We evaluate our method on two simulated manipulation tasks using both a Vision-Language-Action model and a diffusion policy, and a cloth folding task using human demonstrations. Empirical results demonstrate that our approach more accurately reflects task progress and substantially reduces truncation bias, outperforming classical baselines such as TD(0) and Monte Carlo policy evaluation.
Chinese Translation
策略评估是机器人策略开发和部署流程中的一个基本组成部分。在现代操作系统中,这个问题尤其具有挑战性:奖励往往稀疏,评估回放的任务进展通常是非单调的,因为策略表现出恢复行为,并且评估回放必然是有限长度的。这种有限长度引入了截断偏差,破坏了依赖于贝尔曼方程/最优性原则的标准方法所基于的无限视野假设。在本研究中,我们提出了一种基于活跃性贝尔曼算子的稀疏奖励离线策略评估框架。我们的表述将策略评估解释为一个任务完成问题,并产生一个对有限视野截断具有鲁棒性的保守固定点值函数。我们分析了所提算子的理论性质,包括收缩保证,并展示了它如何编码任务进展,同时减轻截断偏差。我们在两个模拟操作任务上评估了我们的方法,使用了视觉-语言-动作模型和扩散策略,以及使用人类演示的布料折叠任务。实证结果表明,我们的方法更准确地反映了任务进展,并显著减少了截断偏差,优于经典基线方法,如 TD(0) 和蒙特卡洛策略评估。
cs.RO / 11 / 2605.11485

Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations

协调扩散:在没有多智能体示范的情况下生成多智能体行为
Peters, Lasse, Ferranti, Laura, Alonso-Mora, Javier, Bajcsy, Andrea
Abstract
Imitation learning powered by generative models has proven effective for modeling complex single-agent behaviors. However, teaching multi-agent systems, like multiple arms or vehicles, to coordinate through imitation learning is hindered by a fundamental data bottleneck: as the joint state-action space grows exponentially with the number of agents, collecting a sufficient amount of coordinated multi-agent demonstrations becomes extremely costly. In this work, we ask: how can we leverage single-agent demonstration data to learn multi-agent policies? We present Coordinated Diffusion (CoDi), a framework that couples independently trained single-agent diffusion policies through a user-defined multi-agent cost function, without requiring any coordinated demonstrations. We derive a new diffusion-based sampling scheme wherein the diffusion score function decomposes into independent, single-agent pre-trained base policies plus a cost-driven guidance term that coordinates these base policies into cohesive multi-agent behavior. We show that this guidance term can be estimated in a gradient-free manner, making CoDi applicable to black-box, non-differentiable cost functions without additional training. Theoretically and empirically, we analyze the conditions under which this composition can faithfully approximate a target multi-agent behavior. We find a complementary role for demonstration data versus the cost function: single-agent demonstrations must cover the support of the desired multi-agent behavior, while the cost function must promote desired behavior from this product of single-agent policies. Our results in simulation and hardware experiments of a two-arm manipulation task show that CoDi discovers robust coordinated behavior from single-agent data, is more data-efficient than multi-agent baselines, and highlights the importance of joint guidance, base policy support, and cost design.
Chinese Translation
基于生成模型的模仿学习已被证明在建模复杂的单智能体行为方面有效。然而,通过模仿学习教导多智能体系统(如多个机械臂或车辆)进行协调面临着一个根本的数据瓶颈:随着智能体数量的增加,联合状态-动作空间呈指数增长,收集足够的协调多智能体示范变得极其昂贵。在本研究中,我们提出了一个问题:如何利用单智能体示范数据来学习多智能体策略?我们提出了协调扩散(Coordinated Diffusion, CoDi)框架,该框架通过用户定义的多智能体成本函数将独立训练的单智能体扩散策略结合起来,而无需任何协调示范。我们推导出一种新的基于扩散的采样方案,其中扩散评分函数分解为独立的、经过预训练的单智能体基础策略加上一个驱动成本的指导项,该指导项将这些基础策略协调成一致的多智能体行为。我们表明,这个指导项可以以无梯度的方式进行估计,使得CoDi适用于黑箱、不可微分的成本函数,而无需额外的训练。在理论和实证上,我们分析了这种组合可以真实近似目标多智能体行为的条件。我们发现示范数据与成本函数之间存在互补作用:单智能体示范必须覆盖所需多智能体行为的支持,而成本函数必须促进这一单智能体策略的产物所期望的行为。我们在两臂操作任务的仿真和硬件实验中的结果表明,CoDi能够从单智能体数据中发现稳健的协调行为,其数据效率优于多智能体基线,并强调了联合指导、基础策略支持和成本设计的重要性。
cs.RO / 12 / 2605.11534

PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments

PRISM:在模拟具身环境中进行意图规划与推理
Lim, Yunn Kang, Sun, Pengzhan, Bai, Ziyi, Xu, Xun, Yao, Angela, Yang, Xulei, Li, Shijie
Abstract
When an LLM-based embodied agent fails at a household task, the culprit could be misidentified objects, forgotten sub-goals, or poor action sequencing -- yet existing benchmarks report only a single success rate, making it impossible to tell which cognitive module is responsible. We present PRISM, a diagnostic benchmark that reframes this problem: rather than asking only \textit{did the agent succeed?}, PRISM asks \textit{which capability is most likely responsible for failure?} Built on five photorealistic multi-room apartments (4--8 rooms each), PRISM structures 300 human-verified tasks into three capability tiers -- \textit{Basic Ability}, \textit{Reasoning Ability}, and \textit{Long-horizon Ability} -- that isolate perception-to-action grounding, implicit intent resolution, and sustained multi-step coordination respectively. PRISM exposes an agent-agnostic executable action API that allows arbitrary agents: LLM agents, VLM agents, symbolic planners, RL policies, and hybrid systems, to be evaluated end-to-end under the same benchmark protocol. To support deeper diagnosis, optional probes for perception, memory, and planning can be adopted, replaced, or bypassed entirely, enabling controlled component-level analysis when desired. Experiments on seven contemporary LLMs establish a clear hierarchy: explicit spatial grounding is not the dominant failure source under oracle perception, implicit intent resolution is a significant bottleneck for all model families, and long-horizon coordination exposes a stark capability cliff -- lightweight models collapse to as low as 20.0\% success while simultaneously consuming more tokens than their frontier counterparts, a signature of compensatory over-reasoning rather than genuine planning capability. Project page: \href{https://sj-li.com/PROJ/PRISM}{link}.
Chinese Translation
当基于大型语言模型(LLM)的具身智能体在家庭任务中失败时,原因可能是物体识别错误、子目标遗忘或动作顺序不当——然而现有基准仅报告单一的成功率,使得无法判断哪个认知模块负责。我们提出了PRISM,一个重新构建这一问题的诊断基准:PRISM不仅仅问“智能体是否成功?”,而是问“哪个能力最可能导致失败?”PRISM基于五个逼真的多房间公寓(每个公寓有4到8个房间),将300个经过人工验证的任务结构化为三个能力层级——“基本能力”(Basic Ability)、“推理能力”(Reasoning Ability)和“长远能力”(Long-horizon Ability),分别隔离感知到动作的基础、隐含意图的解析和持续的多步骤协调。PRISM提供了一个与智能体无关的可执行动作API,允许任意智能体:LLM智能体、视觉语言模型(VLM)智能体、符号规划者、强化学习(RL)策略和混合系统,在同一基准协议下进行端到端评估。为了支持更深入的诊断,可以采用、替换或完全绕过感知、记忆和规划的可选探针,从而在需要时实现受控的组件级分析。对七个当代LLM的实验建立了明确的层级关系:在理想感知下,显式空间基础并不是主要的失败来源,隐含意图解析是所有模型家族的显著瓶颈,而长远协调则暴露出明显的能力悬崖——轻量级模型的成功率低至20.0%,同时消耗的token数量却超过其前沿对应物,显示出补偿性过度推理的特征,而非真正的规划能力。项目页面: [链接](https://sj-li.com/PROJ/PRISM)。
cs.RO / 13 / 2605.11564

RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

RIO:用于跨身体机器人学习的灵活实时机器人输入/输出
Ortega-Kral, Pablo, Xing, Eliot, Bucker, Arthur, Luk, Vernon, Kim, Junseo, Kwon, Owen, Xie, Angchen, Sobanbabu, Nikhil, Yuan, Yifu, Lee, Megan, Ameria, Deepam, Ayapilla, Bhaswanth, Bussell, Jaycie, Shi, Guanya, Francis, Jonathan, Oh, Jean
Abstract
Despite recent efforts to collect multi-task, multi-embodiment datasets, to design recipes for training Vision-Language-Action models (VLAs), and to showcase these models on different robot platforms, generalist cross-embodiment robot capabilities remains a largely elusive ideal. Progress is limited by fragmented infrastructure: most robot code is highly specific to the exact setup the user decided on, which adds major overhead when attempting to reuse, recycle, or share artifacts between users. We present RIO (Robot I/O), an open source Python framework that provides flexible, lightweight components for robot control, teleoperation, data formatting, sensor configuration, and policy deployment across diverse hardware platforms and morphologies. RIO provides abstractions that enable users to make any choice and to switch between them, with minimal reconfiguration effort. We validate RIO on VLA deployment workflows across three morphologies (single-arm, bimanual, humanoid) and four hardware platforms with varying grippers and cameras. Using teleoperated data collected with RIO, we fine-tune state-of-the-art VLAs including $\pi_{0.5}$ and GR00T on household tasks such as pick-and-place, folding, and bowl scrubbing. By open sourcing all our efforts, we hope the community can accelerate their pace of robot learning on real-world robot hardware. Additional details at: https://robot-i-o.github.io
Chinese Translation
尽管近期在收集多任务、多身体数据集、设计用于训练视觉-语言-动作模型(VLA)的方案以及在不同机器人平台上展示这些模型方面做出了努力,但通用的跨身体机器人能力仍然是一个难以实现的理想。进展受到碎片化基础设施的限制:大多数机器人代码高度依赖于用户所选择的具体设置,这在尝试在用户之间重用、回收或共享工件时增加了重大开销。我们提出了RIO(机器人输入/输出),一个开源的Python框架,提供灵活、轻量的组件,用于机器人控制、远程操作、数据格式化、传感器配置和策略部署,适用于多种硬件平台和形态。RIO提供了抽象,使用户能够进行任何选择并在这些选择之间切换,所需的重新配置工作量最小。我们在三种形态(单臂、双手、类人)和四种具有不同抓手和摄像头的硬件平台上验证了RIO在VLA部署工作流程中的有效性。利用使用RIO收集的远程操作数据,我们对包括$ ext{π}_{0.5}$和GR00T在内的最先进VLA进行了微调,以完成诸如拾取与放置、折叠和碗擦洗等家庭任务。通过开源我们所有的努力,我们希望社区能够加速在真实机器人硬件上的机器人学习进程。更多细节请访问:https://robot-i-o.github.io
cs.RO / 14 / 2605.11618

Sampling-Based Follow-the-Leader Motion Planning for Manipulator-Mounted Continuum Robots

基于采样的跟随领导者运动规划用于装配在机械手上的连续机器人
Shentu, Chengnan, Baldassini, Nicholas, Iseoluwa, Oluwagbotemi D., Gondokaryono, Radian, Burgner-Kahrs, Jessica
Abstract
Follow-the-leader (FTL) motion exploits the unique morphology of continuum robots (CRs) to navigate confined spaces by having the body retrace the path of the tip. While extensively studied, existing FTL methods typically assume a fixed base or a single degree-of-freedom insertion mechanism, limiting their applicability to practical systems in which CRs are mounted on robotic manipulators with fully actuated SE(3) base pose. This paper presents a sampling-based motion planner for FTL motion of manipulator-mounted CRs that jointly considers robot configuration and base pose. The key idea is to decouple global shape search from base pose determination by computing the base pose through a closed-form geometric construction, thereby avoiding iterative optimization during online planning. The approach supports general forward models and enables efficient planning by shifting the majority of computation offline. We establish theoretical guarantees including resolution complete shape search and converging tip tracking throughout waypoint traversal and interpolation. Experiments on 120 simulated paths over 3 test classes demonstrate 0% tip error and 1.9% mean shape deviation (w.r.t. robot length) at 100% success rate. We validate the practicality of our approach on a 6-DOF tendon-driven CR mounted on a serial manipulator. Code and visualization available at https://continuumroboticslab.github.io/sb-ftl-cr-planner/.
Chinese Translation
跟随领导者(FTL)运动利用连续机器人(CRs)独特的形态在狭小空间中导航,通过使机器人主体沿着机器人尖端的路径回溯。尽管这一方法已被广泛研究,但现有的FTL方法通常假设固定基座或单一自由度的插入机制,这限制了其在实际系统中的适用性,尤其是在连续机器人装配在具有完全驱动的SE(3)基座姿态的机械手上。本文提出了一种基于采样的运动规划方法,针对装配在机械手上的连续机器人的FTL运动,联合考虑了机器人配置和基座姿态。其关键思想是通过闭式几何构造计算基座姿态,从而将全局形状搜索与基座姿态确定解耦,避免在在线规划过程中进行迭代优化。该方法支持一般的前向模型,并通过将大部分计算移至离线来实现高效规划。我们建立了理论保证,包括分辨率完备的形状搜索和在途径点遍历及插值过程中的尖端跟踪收敛性。在120条模拟路径的3个测试类别上的实验表明,尖端误差为0%,平均形状偏差(相对于机器人长度)为1.9%,成功率为100%。我们在装配在串联机械手上的6自由度腱驱动连续机器人上验证了我们方法的实用性。代码和可视化可在 https://continuumroboticslab.github.io/sb-ftl-cr-planner/ 获取。
cs.RO / 15 / 2605.11665

Nautilus: From One Prompt to Plug-and-Play Robot Learning

Nautilus:从单一提示到即插即用的机器人学习
Jin, Yufeng, Guo, Jianfei, Jia, Xiaogang, Deng, Yu, Li, Zechu, Liu, Han, Liao, Weiran, Prasad, Vignesh, Franzius, Mathias, Neumann, Gerhard, Chalvatzaki, Georgia
Abstract
Robot learning research is fragmented across policy families, benchmark suites, and real robots; each implementation is entangled with the others in a complex combination matrix, making it an engineering nightmare to port any single element. General-purpose coding agents may occasionally bridge specific setups, but cannot close this gap at scale because they lack the procedural priors and validation practices that characterize robotics research workflows. We propose NAUTILUS, an open-source harness that turns a single user prompt -- for example, "Evaluate policy A with benchmark B" -- into ready-to-use reproduction, evaluation, fine-tuning, and deployment workflows. NAUTILUS provides: plug-and-play agent skill sets with distilled priors from robotics research; typed contracts among policies, simulators/benchmarks, and real-world robots; unified interfaces and execution environments; and a trustworthy agentic coding workflow with explicit, automated validation, and testing at each milestone. NAUTILUS can not only automatically generate the required adapters and containers for existing implementations, but also wrap and onboard new or user-provided policies, simulators/benchmarks, and robots, all connected via a uniform interface. This expands cross-validation coverage without hand-written glue code. Like a nautilus shell that grows by adding chambers, NAUTILUS scales by extending its execution in chambered units, making it a research harness for scalability rather than a hand-curated framework, and aiming to reduce the engineering burden of cross-family reproduction and evaluation in the ever-growing robot learning ecosystem.
Chinese Translation
机器人学习研究在政策家族、基准套件和真实机器人之间呈现出碎片化的状态;每个实现都与其他实现交织在一起,形成复杂的组合矩阵,使得移植任何单一元素成为工程上的噩梦。通用编码代理偶尔可以桥接特定的设置,但由于缺乏特征化机器人研究工作流程的程序性先验和验证实践,无法在规模上弥补这一差距。我们提出了NAUTILUS,一个开源工具,它将单个用户提示——例如,“使用基准B评估策略A”——转化为可即用的重现、评估、微调和部署工作流程。NAUTILUS提供:具有来自机器人研究的提炼先验的即插即用代理技能集;政策、模拟器/基准和真实机器人之间的类型化合约;统一的接口和执行环境;以及一个可信的代理编码工作流程,在每个里程碑上具有明确的、自动化的验证和测试。NAUTILUS不仅可以自动生成现有实现所需的适配器和容器,还可以包装和引入新的或用户提供的政策、模拟器/基准和机器人,所有这些都通过统一接口连接。这扩展了交叉验证的覆盖范围,而无需手动编写粘合代码。就像通过添加腔室而生长的鹦鹉螺壳,NAUTILUS通过在腔室单位中扩展其执行来实现规模化,使其成为一个关注可扩展性的研究工具,而不是一个手动策划的框架,旨在减少在不断增长的机器人学习生态系统中跨家族重现和评估的工程负担。
cs.RO / 16 / 2605.11674

A Proprioceptive-Only Benchmark for Quadruped State Estimation: ATE, RPE, and Runtime Trade-offs Between Filters and Smoothers

仅基于本体感觉的四足机器人状态估计基准:滤波器与平滑器之间的 ATE、RPE 和运行时间权衡
Nisticò, Ylenia, Soares, João Carlos Virgolino, Solà, Joan, Semini, Claudio
Abstract
We compare three state-of-the-art proprioceptive state estimators for quadruped robots: MUSE [1], the Invariant Extended Kalman Filter (IEKF) [2], and the Invariant Smoother (IS) [3], on the CYN-1 sequence of the GrandTour Dataset [4]. Our goal is to give practitioners clear guidance on accuracy and computation time: we report long-term accuracy (Absolute Trajectory Error, ATE), short-term accuracy (translational and rotational Relative Pose Error, RPE), and per-update computation time on a fixed hardware/software stack. On this dataset, RPEs are broadly similar across methods, while IEKF and IS achieve a lower ATE than MUSE. Runtime results highlight the accuracy-latency trade-offs across the three approaches. In the discussion, we outline the evaluation choices used to ensure a fair comparison and analyze factors that influence short-horizon metrics. Overall, this study provides a concise snapshot of accuracy and cost, helping readers choose an estimator that fits their application constraints, with all evaluation code and documentation released open-source at https://github.com/iit-DLSLab/state_estimation_benchmark for full reproducibility.
Chinese Translation
我们比较了三种最先进的四足机器人本体感觉状态估计器:MUSE [1]、不变扩展卡尔曼滤波器(Invariant Extended Kalman Filter, IEKF)[2] 和不变平滑器(Invariant Smoother, IS)[3],基于 GrandTour 数据集 [4] 的 CYN-1 序列。我们的目标是为从业者提供关于准确性和计算时间的明确指导:我们报告了长期准确性(绝对轨迹误差,Absolute Trajectory Error, ATE)、短期准确性(平移和旋转相对位姿误差,translational and rotational Relative Pose Error, RPE)以及在固定硬件/软件环境下的每次更新计算时间。在该数据集上,各种方法的 RPE 大致相似,而 IEKF 和 IS 的 ATE 低于 MUSE。运行时间结果突显了三种方法之间的准确性与延迟的权衡。在讨论中,我们概述了用于确保公平比较的评估选择,并分析了影响短期指标的因素。总体而言,本研究提供了准确性和成本的简明快照,帮助读者选择适合其应用约束的估计器,所有评估代码和文档已在 https://github.com/iit-DLSLab/state_estimation_benchmark 上开源,以实现完全可重复性。
cs.RO / 17 / 2605.11697

Rainbow Deep Q-Learning with Kinematics-Aware Design for Cooperative Delta and 3-RRS Parallel Robot Insertion

基于运动学感知设计的彩虹深度Q学习用于协作Delta和3-RRS平行机器人插入
Nigatu, Hassen, Shi, Gaokun, Li, Jituo, Jin, Wang, Guodong, Lu
Abstract
This paper presents a kinematics-aware deep reinforcement learning framework based on Rainbow Deep Q-Networks (DQN) for cooperative peg-in-hole manipulation by a Delta parallel robot and a 3-RRS (Revolute--Revolute--Spherical) parallel manipulator. A key contribution is the integration of a geometric design-optimization stage that precedes learning: the 3-RRS geometry is tuned to maximize the singularity-free workspace and improve conditioning, which in turn enlarges the safe region in which the reinforcement learning policy can explore. Together the two manipulators expose a 6~degree-of-freedom (DoF) controllable subspace (three Delta translations, two 3-RRS rotations, and one 3-RRS vertical translation); the peg-in-hole task is invariant to rotation about the peg axis, so the task-relevant manifold is five dimensional. The cooperative insertion problem is cast as a Markov Decision Process with a 12-dimensional state vector and a discrete action set containing $6 \times 2 = 12$ incremental commands (one positive and one negative per controlled DoF). A shaped reward combines dense proximity guidance, penalties for kinematic and workspace violations, and sparse bonuses for successful insertions. The Rainbow DQN -- integrating double Q-learning, dueling architecture, prioritized replay, multi-step returns, noisy linear layers for exploration, and a distributional value head -- is trained with a two-stage curriculum. The co-designed framework is validated in a high-fidelity kinematic simulator, where it achieves stable policy convergence, reliable insertions, and reduced constraint violations compared against a vanilla DQN agent and a classical sampling-based planner.
Chinese Translation
本文提出了一种基于彩虹深度Q网络(Rainbow Deep Q-Networks, DQN)的运动学感知深度强化学习框架,用于Delta平行机器人和3-RRS(转动-转动-球形)平行操纵器的协作 peg-in-hole 操作。一个关键贡献是集成了一个在学习之前进行的几何设计优化阶段:3-RRS的几何形状经过调整,以最大化无奇异性工作空间并改善条件,这反过来扩大了强化学习策略可以探索的安全区域。两个操纵器共同暴露出一个6自由度(DoF)可控子空间(三个Delta平移、两个3-RRS旋转和一个3-RRS垂直平移);peg-in-hole任务对围绕插销轴的旋转是不变的,因此与任务相关的流形是五维的。协作插入问题被建模为一个具有12维状态向量的马尔可夫决策过程,离散动作集包含$6 imes 2 = 12$个增量命令(每个受控自由度一个正向和一个负向命令)。一个形状奖励结合了密集的接近引导、对运动学和工作空间违规的惩罚,以及对成功插入的稀疏奖励。彩虹DQN集成了双重Q学习、对抗架构、优先重放、多步回报、用于探索的噪声线性层和分布式价值头,采用两阶段课程进行训练。共同设计的框架在高保真运动学模拟器中得到了验证,达到了稳定的策略收敛、可靠的插入和相较于普通DQN代理和经典采样规划器减少的约束违规。
cs.RO / 18 / 2605.11714

Introducing Environmental Constraints to Grasping Strategies for Paper-Like Flexible Materials Using a Soft Gripper

引入环境约束以优化纸质柔性材料的抓取策略使用软抓手
Dong, Yi, Li, Yang, Duan, Jinjun, Dai, Zhendong
Abstract
Robotic manipulation of flexible objects is widely required in both industrial and service applications. Among such objects, paper-like materials exhibit distinct mechanical characteristics compared to cloth, being more sensitive to compressive stress, where minor variations in physical properties can significantly affect grasping. This study systematically investigates grasping strategies for paper-like materials using a universal soft gripper by exploiting environmental constraints. Based on manipulation primitives employed in existing grasping strategies, we proposed systematic grasping strategies for flexible materials by exploiting environmental constraints and analyzed their mechanical and kinematic models. To investigate the influence of materials and working conditions on grasping, an evaluation system for measuring grasping force and success rate was defined and experimentally evaluated. Finally, we summarized the specific workspaces and characteristics of different strategies that can satisfy various task requirements and lead to potential applications in household service robots for grasping planar flexible objects.
Chinese Translation
柔性物体的机器人操作在工业和服务应用中广泛需求。在这些物体中,纸质材料与布料相比展现出不同的机械特性,对压缩应力更为敏感,物理属性的微小变化可能显著影响抓取效果。本研究系统地探讨了利用环境约束的通用软抓手对纸质材料的抓取策略。基于现有抓取策略中采用的操作原语,我们提出了利用环境约束的柔性材料系统抓取策略,并分析了其机械和运动学模型。为了研究材料和工作条件对抓取的影响,定义并实验评估了一个用于测量抓取力和成功率的评估系统。最后,我们总结了不同策略的特定工作空间和特性,这些策略能够满足各种任务需求,并可能在家庭服务机器人抓取平面柔性物体中应用。
cs.RO / 19 / 2605.11750

DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

DreamAvoid:关键阶段测试时间梦境以避免VLA策略中的失败
Fan, Xianzhe, Lu, Yuxiang, Gao, Shenyuan, Wu, Xiaoyang, Han, Ruihua, Li, Manling, Zhao, Hengshuang
Abstract
Vision-Language-Action (VLA) models are often brittle in fine-grained manipulation, where minor action errors during the critical phases can rapidly escalate into irrecoverable failures. Since existing VLA models rely predominantly on successful demonstrations for training, they lack an explicit awareness of failure during these critical phases. To address this, we propose DreamAvoid, a critical-phase test-time dreaming framework that enables VLA models to anticipate and avoid failures. We also introduce an autonomous boundary learning paradigm to refine the system's understanding of the subtle boundary between success and failure. Specifically, we (1) utilize a Dream Trigger to determine whether the execution has entered a critical phase, (2) sample multiple candidate action chunks from the VLA via an Action Proposer, and (3) employ a Dream Evaluator, jointly trained on mixed data (success, failure, and boundary cases), to "dream" the short-horizon futures corresponding to the candidate actions, evaluate their values, and select the optimal action. We conduct extensive evaluations on real-world manipulation tasks and simulation benchmarks. The results demonstrate that DreamAvoid can effectively avoid failures, thereby improving the overall task success rate. Our code is available at https://github.com/XianzheFan/DreamAvoid.
Chinese Translation
视觉-语言-行动(VLA)模型在细粒度操作中往往表现脆弱,在关键阶段的微小动作错误可能迅速升级为不可恢复的失败。由于现有的VLA模型主要依赖成功示范进行训练,因此在这些关键阶段缺乏对失败的明确意识。为了解决这个问题,我们提出了DreamAvoid,一个关键阶段测试时间梦境框架,使VLA模型能够预见并避免失败。我们还引入了一种自主边界学习范式,以细化系统对成功与失败之间微妙边界的理解。具体而言,我们(1)利用梦境触发器(Dream Trigger)来判断执行是否进入关键阶段,(2)通过动作提议器(Action Proposer)从VLA中采样多个候选动作片段,以及(3)采用梦境评估器(Dream Evaluator),该评估器在混合数据(成功、失败和边界案例)上联合训练,以“梦境”候选动作对应的短期未来,评估其价值,并选择最优动作。我们在真实世界的操作任务和仿真基准上进行了广泛评估。结果表明,DreamAvoid能够有效避免失败,从而提高整体任务成功率。我们的代码可在 https://github.com/XianzheFan/DreamAvoid 获取。
cs.RO / 20 / 2605.11762

NavOL: Navigation Policy with Online Imitation Learning

NavOL:基于在线模仿学习的导航策略
Wei, Xiaofei, Gu, Chun, Zhang, Li
Abstract
Learning robust navigation policies remains a core challenge in robotics. Offline imitation learning suffers from distribution shift and compounding errors at rollout, while reinforcement learning requires reward engineering and learns inefficiently. In this paper, we propose NavOL, an online imitation learning paradigm that interacts with a simulator and updates itself using expert demonstrations gathered online. Built upon a pretrained navigation diffusion policy that maps local observations to future waypoints, NavOL trains in a rollout update loop: during rollout, the policy acts in the simulator and queries a global planner which has privileged access to the global environment for the optimal path segment as ground truth trajectory labels; during update, the policy is trained on the online collected observation trajectory pairs. This online imitation loop removes the need for reward design, improves learning efficiency, and mitigates distribution shift by training on the policy own explored rollouts. Built on IsaacLab with fast, high-fidelity parallel rendering and domain randomization of camera pose and start-goal pairs, our system scales across 50 scenes on 8 RTX 4090 GPUs, collecting over 2,000 new trajectories per hour, each averaging more than 400 steps. We also introduce an indoor visual navigation benchmark with predefined start and goal positions for zero-shot generalization. Extensive evaluations on simulation benchmarks, including the NavDP benchmark and our proposed benchmark, as well as carefully designed real-world experiments, demonstrate the effectiveness of NavOL, showing consistent performance gains in online imitation learning.
Chinese Translation
学习鲁棒的导航策略仍然是机器人技术中的核心挑战。离线模仿学习面临分布转移和执行时的累积误差,而强化学习则需要奖励设计且学习效率低下。本文提出了NavOL,一种在线模仿学习范式,它与模拟器进行交互,并利用在线收集的专家示范进行自我更新。NavOL建立在一个预训练的导航扩散策略之上,该策略将局部观察映射到未来的航点。NavOL在一个执行更新循环中进行训练:在执行阶段,策略在模拟器中行动,并查询一个具有特权访问全球环境的全局规划器,以获取最优路径段作为真实轨迹标签;在更新阶段,策略在在线收集的观察轨迹对上进行训练。这个在线模仿循环消除了对奖励设计的需求,提高了学习效率,并通过在策略自身探索的执行上进行训练来减轻分布转移。我们的系统基于IsaacLab,具有快速、高保真的并行渲染和相机姿态及起始-目标对的领域随机化,能够在8个RTX 4090 GPU上扩展到50个场景,每小时收集超过2000条新轨迹,每条轨迹平均超过400步。我们还引入了一个室内视觉导航基准,具有预定义的起始和目标位置,以实现零-shot泛化。在模拟基准(包括NavDP基准和我们提出的基准)以及精心设计的现实世界实验上进行的广泛评估,证明了NavOL的有效性,显示出在线模仿学习中的一致性能提升。
cs.RO / 21 / 2605.11817

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

关注重要性:用于可泛化视觉-语言-动作模型的可微网格采样剪枝
Feng, Yixu, Zhao, Zinan, Ma, Yanxiang, Xia, Chenghao, Du, Chengbin, Wang, Yunke, Xu, Chang
Abstract
Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at https://github.com/Fediory/Grid-Sampler.
Chinese Translation
视觉-语言-动作(VLA)模型在机器人操作中展现出了显著的潜力,但其高计算成本阻碍了实时部署。现有的标记剪枝方法存在一个基本的权衡:激进的剪枝压缩不可避免地会丢弃关键的几何细节,例如接触点,从而导致性能严重下降。这迫使我们做出妥协,限制了可实现的压缩率,从而影响潜在的加速效果。我们认为,打破这一权衡需要重新思考压缩,将其视为一种几何感知的、连续的标记重采样过程,应用于视觉编码器。为此,我们提出了可微网格采样器(Differentiable Grid Sampler,GridS),这是一个即插即用的模块,能够在VLA中进行任务感知的连续视觉标记重采样。通过自适应预测一组最小的显著坐标并通过可微插值提取特征,GridS在实现显著压缩(原始视觉标记少于10%)的同时,保留了重要的空间信息。在LIBERO基准测试和真实机器人平台上的实验表明,GridS验证了迄今为止报告的最低可行视觉标记数量,实现了76%的FLOPs减少且成功率没有下降。代码可在 https://github.com/Fediory/Grid-Sampler 获取。
cs.RO / 22 / 2605.11825

Mapping Embodied Affective Touch Strategies on a Humanoid Robot

人形机器人上体现情感触觉策略的映射
Ren, Qiaoqiao, Eldardeer, Omar, Cocchella, Francesca, Francesco, Rea, Sciutti, Alessandra, Belpaeme, Tony
Abstract
Affective touch in human-robot interaction is shaped not only by emotional intent, but also by robot embodiment, including touch location, physical constraints, and perceived agency or social role. Existing HRI studies typically focus on one or two isolated body parts, limiting understanding of how affective touch generalises across the full humanoid body. We present a study with 32 participants interacting with the iCub robot, which is equipped with full-body distributed tactile sensors. Participants expressed eight emotions under three conditions: free touch, arm-only touch, and torso-only touch. Results show that body region and spatial constraints jointly shaped both touch location and dynamics. In free touch, participants preferred socially accessible upper-body regions, while less frequently touched areas showed stronger emotion-specific selectivity. Emotion-related variation was more evident in motion features for arm-only touch and pressure features for torso-only touch. Touch strategies also did not transfer directly between free and constrained conditions, even within the same coarse body region. Participants reported increased closeness to the robot after interaction, with around 30 percent reporting a change in perceived social relationship. Together, these findings show that affective touch expression is strongly body-region dependent and shaped by embodiment constraints.
Chinese Translation
人机交互中的情感触觉不仅受情感意图的影响,还受到机器人具身性的影响,包括触摸位置、物理限制以及感知的能动性或社会角色。现有的人机交互研究通常集中于一两个孤立的身体部位,这限制了对情感触觉如何在整个类人身体上普遍化的理解。我们进行了一项研究,参与者为32名与配备全身分布式触觉传感器的iCub机器人进行互动。参与者在三种条件下表达了八种情感:自由触摸、仅手臂触摸和仅躯干触摸。结果显示,身体区域和空间限制共同影响了触摸位置和动态。在自由触摸中,参与者更倾向于选择社会可接触的上半身区域,而触摸频率较低的区域则显示出更强的情感特异性选择性。与情感相关的变化在仅手臂触摸的运动特征和仅躯干触摸的压力特征中更为明显。触摸策略在自由触摸和受限条件之间并未直接转移,即使在相同的粗略身体区域内。参与者在互动后报告与机器人之间的亲密感增加,约30%的参与者表示感知的社会关系发生了变化。综合来看,这些发现表明情感触觉的表达在很大程度上依赖于身体区域,并受到具身性限制的影响。
cs.RO / 23 / 2605.11832

Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

基于多视角潜在先验的动作流形学习用于机器人操作
Xiao, Junjin, Li, Dongyang, Yang, Yandan, Zeng, Shuang, Lin, Tong, Chang, Xinyuan, Xiong, Feng, Xu, Mu, Wei, Xing, Ma, Zhiheng, Zhang, Qing, Zheng, Wei-Shi
Abstract
This paper tackles spatial perception and manipulation challenges in Vision-Language-Action (VLA) models. To address depth ambiguity from monocular input, we leverage a pre-trained multi-view diffusion model to synthesize latent novel views and propose a Geometry-Guided Gated Transformer (G3T) that aligns multi-view features under 3D geometric guidance while adaptively filtering occlusion noise. To improve action learning efficiency, we introduce Action Manifold Learning (AML), which directly predicts actions on the valid action manifold, bypassing inefficient regression of unstructured targets like noise or velocity. Experiments on LIBERO, RoboTwin 2.0, and real-robot tasks show our method achieves superior success rate and robustness over SOTA baselines. Project page: https://junjxiao.github.io/Multi-view-VLA.github.io/.
Chinese Translation
本文解决了视觉-语言-动作(VLA)模型中的空间感知和操作挑战。为了应对单目输入带来的深度模糊问题,我们利用预训练的多视角扩散模型合成潜在的新视角,并提出了一种几何引导门控变换器(Geometry-Guided Gated Transformer, G3T),该变换器在三维几何引导下对多视角特征进行对齐,同时自适应地过滤遮挡噪声。为了提高动作学习的效率,我们引入了动作流形学习(Action Manifold Learning, AML),该方法直接在有效的动作流形上预测动作,避免了对噪声或速度等非结构化目标的低效回归。在LIBERO、RoboTwin 2.0和真实机器人任务上的实验表明,我们的方法在成功率和鲁棒性方面优于现有的最先进基线。项目页面:https://junjxiao.github.io/Multi-view-VLA.github.io/
cs.RO / 24 / 2605.11859

EvoNav: Evolutionary Reward Function Design for Robot Navigation with Large Language Models

EvoNav:基于大型语言模型的机器人导航进化奖励函数设计
Zhao, Zhikai, Hua, Chuanbo, Berto, Federico, Ma, Zihan, Lee, Kanghoon, Li, Jiachen, Park, Jinkyoo
Abstract
Robot navigation is a crucial task with applications to social robots in dynamic human environments. While Reinforcement Learning (RL) has shown great promise for this problem, the policy quality is highly sensitive to the specification of reward functions. Hand-crafted rewards require substantial domain expertise and embed inductive biases that are difficult to audit or adapt, limiting their effectiveness and leading to suboptimal performance. In this paper, we propose EvoNav, an evolutionary framework that automates the design of robot navigation reward functions via large language models (LLMs). To overcome prohibitively costly policy training, EvoNav evaluates each candidate proposal from the LLM via a progressive three-stage warm-up-boost procedure. EvoNav advances from analytical proxies with low-cost surrogates, such as small datasets and analytic rules, to lightweight rollouts and, finally, to full policy training, enabling computationally efficient exploration under effective feedback. Experiment results show that EvoNav produces more effective navigation policies than manually designed RL rewards and state-of-the-art reward design methods.
Chinese Translation
机器人导航是一个关键任务,广泛应用于动态人类环境中的社交机器人。尽管强化学习(Reinforcement Learning, RL)在这一问题上展现出巨大的潜力,但策略质量对奖励函数的规范高度敏感。手工设计的奖励函数需要大量的领域专业知识,并且嵌入了难以审计或适应的归纳偏见,从而限制了其有效性并导致次优性能。本文提出了EvoNav,一个通过大型语言模型(Large Language Models, LLMs)自动化设计机器人导航奖励函数的进化框架。为了克服高昂的策略训练成本,EvoNav通过逐步的三阶段热身提升程序评估LLM的每个候选提案。EvoNav从低成本的代理分析工具(如小型数据集和分析规则)进展到轻量级的模拟,最终实现全面的策略训练,从而在有效反馈下实现计算效率的探索。实验结果表明,EvoNav生成的导航策略比手动设计的RL奖励和最先进的奖励设计方法更为有效。
cs.RO / 25 / 2605.11951

From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation

从反应到预见:通过代理任务图实现机器人操作的主动故障恢复
Xu, Sheng, Jin, Ruixing, Zhou, Huayi, Yue, Bo, Qiao, Guanren, Tai, Yunxin, Deng, Yueci, Jia, Kui, Liu, Guiliang
Abstract
Although robotic manipulation has made significant progress, reliable execution remains challenging because task failures are inevitable in dynamic and unstructured environments. To handle such failures, existing frameworks typically follow a stepwise detect-reason-recover pipeline, which often incurs high latency and limited robustness due to delayed reasoning and reactive planning. Inspired by the human capability to anticipate and proactively plan for potential failures, we introduce AgentChord, an agentic system that models a manipulation task as a directed task graph. Before execution, this graph is enriched with anticipatory recovery branches that specify context-aware corrective behaviors, enabling immediate and targeted responses when failures occur. Specifically, AgentChord operates through a choreography of specialized agents: a composer that structures the nominal task graph, an arranger that augments the graph with anticipatory recovery branches, and a conductor that compiles and coordinates executable transitions using low-latency monitors to detect deviations and trigger pre-compiled recoveries without re-planning. Empirical studies on diverse long-horizon bimanual manipulation tasks demonstrate that AgentChord substantially improves success rates and execution efficiency, advancing the reliability and autonomy of real-world robotic systems. The project page is available at: https://shengxu.net/AgentChord/.
Chinese Translation
尽管机器人操作已经取得了显著进展,但在动态和非结构化环境中,可靠执行仍然面临挑战,因为任务失败是不可避免的。为了处理这些失败,现有框架通常遵循逐步检测-推理-恢复的流程,这往往会因推理延迟和反应性规划而导致高延迟和有限的鲁棒性。受到人类预见和主动规划潜在失败能力的启发,我们提出了AgentChord,一个将操作任务建模为有向任务图的代理系统。在执行之前,该图通过预期恢复分支进行增强,这些分支指定了上下文感知的纠正行为,使得在发生失败时能够立即和有针对性地响应。具体而言,AgentChord通过一组专业代理的编排进行操作:一个作曲者构建标准任务图,一个编排者用预期恢复分支增强图形,以及一个指挥者利用低延迟监控器编译和协调可执行的转换,以检测偏差并触发预编译的恢复,而无需重新规划。对多样化的长时间双手操作任务的实证研究表明,AgentChord显著提高了成功率和执行效率,推动了现实世界机器人系统的可靠性和自主性。项目页面可访问:https://shengxu.net/AgentChord/
cs.RO / 26 / 2605.11972

Cooperative Robotics Reinforced by Collective Perception for Traffic Moderation

通过集体感知增强的合作机器人用于交通调节
Khoshkdahan, Mohammad, Arockiasamy, John Pravin, Comeca, Andy Flores, Vinel, Alexey
Abstract
Collisions at non-line-of-sight (NLOS) intersections remain a major safety concern because drivers have limited visibility of approaching traffic. V2X based warnings can reduce these risks, yet many vehicles are not equipped with V2X and drivers may ignore in vehicle alerts. Collective perception (CP) can compensate for low V2X penetration by extending the awareness of connected vehicles, but it cannot influence unconnected vehicles. To fill this gap, our work introduces a complementary concept that adds a cooperative humanoid robot as an active traffic moderator capable of physically stopping a vehicle that attempts to merge into an unseen traffic stream. The system operates on two parallel perception pathways. A dual camera infrastructure unit detects the position, speed and motion of approaching vehicles and transmits this information to the robot as a collective perception message (CPM). The robot also receives cooperative awareness messages (CAM) from connected vehicles through its onboard V2X unit and can act as a relay for decentralized environmental notification messages (DENM) when safety events originate elsewhere along the road. A fusion module combines these streams to maintain a robust real time view of the main road. A Zone of Danger (ZoD) is defined and used to predict whether an approaching vehicle creates a collision risk for a merging road user. When such a risk is detected, the robot issues a human-like STOP gesture and blocks the merging path until the hazard disappears. The full system was deployed at the Future Mobility Park (FMP) in Rotterdam. Experiments show that the combined vision and V2X perception allows the robot to detect approaching vehicles early, predict hazards reliably and prevent unsafe merges in real world NLOS conditions.
Chinese Translation
在非视距(NLOS)交叉口的碰撞仍然是一个主要的安全隐患,因为驾驶员对接近交通的可见性有限。基于车与一切(V2X)的警告可以降低这些风险,但许多车辆并未配备V2X,且驾驶员可能会忽视车内警报。集体感知(CP)可以通过扩展连接车辆的意识来弥补V2X渗透率低的不足,但它无法影响未连接的车辆。为填补这一空白,我们的工作引入了一个补充概念,增加了一个合作类人机器人,作为一个主动的交通调节者,能够物理阻止试图并入看不见的交通流的车辆。该系统在两个平行的感知通道上运行。双摄像头基础设施单元检测接近车辆的位置、速度和运动,并将这些信息作为集体感知消息(CPM)传输给机器人。机器人还通过其车载V2X单元接收来自连接车辆的合作意识消息(CAM),并能够在安全事件源自道路其他地方时充当去中心化环境通知消息(DENM)的中继。融合模块结合这些信息流,以保持对主干道的强大实时视图。定义了危险区域(ZoD),并用于预测接近车辆是否对并入的道路使用者构成碰撞风险。当检测到此类风险时,机器人发出类似人类的停止手势,并阻挡并入路径,直到危险消失。完整系统已在鹿特丹的未来出行公园(FMP)部署。实验表明,结合视觉和V2X感知使机器人能够及早检测接近车辆,可靠预测危险,并在现实世界的NLOS条件下防止不安全的并入。
cs.RO / 27 / 2605.12053

Closing the Motion Execution Gap: From Semantic Motion Task Constraints to Kinematic Control

缩小运动执行差距:从语义运动任务约束到运动学控制
Stelter, Simon, Hassouna, Vanessa, Huerkamp, Malte, Beetz, Michael
Abstract
This paper addresses the Motion Execution Gap, the disconnect between high-level symbolic task descriptions using semantic constraints and executable robot motions. Motion Statecharts are introduced as an executable symbolic representation for complex motions. They allow the arbitrary arrangement of motion constraints, monitors or nested statecharts in parallel and sequence. World-centric motion specification and generalization across embodiments are enabled through the use of a unified differentiable kinematic world model of both, robots and environments. Motion execution is realized through a lMPC-based implementation of the task-function approach, in which smooth transitions during task switches are ensured using jerk bounds. Cross-platform transferability was demonstrated by deploying the method on eight robot platforms, operating in diverse environments. The proposed framework is called Giskard and is available open source: https://github.com/cram2/cognitive_robot_abstract_machine.
Chinese Translation
本文探讨了运动执行差距,即使用语义约束的高层次符号任务描述与可执行机器人运动之间的脱节。我们引入了运动状态图(Motion Statecharts)作为复杂运动的可执行符号表示。运动状态图允许运动约束、监控器或嵌套状态图的任意并行和顺序排列。通过使用统一的可微分运动学世界模型,支持机器人和环境的世界中心运动规范和跨实例的泛化。运动执行通过基于线性模型预测控制(lMPC)的任务函数方法实现,其中在任务切换期间通过限制加加速度确保平滑过渡。通过在八个平台上部署该方法,展示了跨平台的可转移性,这些平台在不同环境中运行。所提出的框架称为Giskard,并已开源发布: https://github.com/cram2/cognitive_robot_abstract_machine。
cs.RO / 28 / 2605.12071

Control of Fully Actuated Aerial Vehicles: A Comparison of Model-based and Sensor-based Dynamic Inversion

全驱动空中飞行器的控制:基于模型与基于传感器的动态反演比较
Yilmaz, Ali Sidar, Turan, Buday, Pries, Lukas, Ryll, Markus
Abstract
Fully actuated multirotor platforms decouple translational force generation from vehicle attitude, enabling independent control of position and orientation and shifting performance limitations from attitude authority to actuator dynamics and control effectiveness. This paper compares a model-based nonlinear dynamic inversion controller (geometric NDI) with a sensor-based incremental dynamic inversion controller (INDI) on a fixed-tilt fully actuated hexarotor. Both controllers share an identical outer-loop structure and are both executed at 500 Hz; therefore, performance differences can be attributed primarily to the inversion strategy. Controller performance is evaluated in five experiments covering attitude step tracking under nominal conditions and under a 50% mismatch in the rotor force coefficient, hover disturbance rejection under an external lateral load, waypoint tracking in the presence of wind gust disturbances, reduced control frequency, and injected sensor degradation. The results show that INDI offers clear advantages under parameter mismatch, gust disturbances, and sensor degradation, and maintains lower position errors across the controller-frequency sweep. However, its advantages are not universal: geometric NDI yields better attitude tracking at reduced control frequencies. To the authors' best knowledge, this work presents the first experimental validation of a full pose tracking INDI controller with decoupled translational and rotational dynamics. These findings highlight the trade-off between measurement-based and model-based inversion for robust control and rapid deployment of fully actuated UAVs.
Chinese Translation
全驱动多旋翼平台将平移力的生成与飞行器姿态解耦,实现位置和方向的独立控制,并将性能限制从姿态控制权转移到执行器动态和控制有效性上。本文比较了一种基于模型的非线性动态反演控制器(几何 NDI)与一种基于传感器的增量动态反演控制器(INDI)在固定倾斜全驱动六旋翼上的表现。这两种控制器共享相同的外环结构,并且均以 500 Hz 的频率执行,因此性能差异主要可归因于反演策略。控制器性能通过五个实验进行评估,涵盖了在名义条件下的姿态阶跃跟踪以及在转子力系数 50% 不匹配下的表现、在外部横向负载下的悬停干扰抑制、在风阵扰动下的航点跟踪、降低控制频率以及传感器降级的影响。结果表明,INDI 在参数不匹配、阵风干扰和传感器降级下具有明显优势,并在控制器频率变化中保持较低的位置误差。然而,其优势并非普遍存在:在降低控制频率的情况下,几何 NDI 在姿态跟踪方面表现更佳。据作者所知,这项工作首次实验验证了具有解耦平移和旋转动态的全姿态跟踪 INDI 控制器。这些发现突显了基于测量与基于模型的反演在全驱动无人机的鲁棒控制和快速部署之间的权衡。
cs.RO / 29 / 2605.12084

Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration

学习重要性:用于机器人探索的自适应信息论目标
Yu, Youwei, Wang, Jionghao, Yu, Zhengming, Wang, Wenping, Liu, Lantao
Abstract
Designing learnable information-theoretic objectives for robot exploration remains challenging. Such objectives aim to guide exploration toward data that reduces uncertainty in model parameters, yet it is often unclear what information the collected data can actually reveal. Although reinforcement learning (RL) can optimize a given objective, constructing objectives that reflect parametric learnability is difficult in high-dimensional robotic systems. Many parameter directions are weakly observable or unidentifiable, and even when identifiable directions are selected, omitted directions can still influence exploration and distort information measures. To address this challenge, we propose Quasi-Optimal Experimental Design (Q{\footnotesize OED}), an adaptive information objective grounded in optimal experimental design. Q{\footnotesize OED} (i) performs eigenspace analysis of the Fisher information matrix to identify an observable subspace and select identifiable parameter directions, and (ii) modifies the exploration objective to emphasize these directions while suppressing nuisance effects from non-critical parameters. Under bounded nuisance influence and limited coupling between critical and nuisance directions, Q{\footnotesize OED} provides a constant-factor approximation to the ideal information objective that explores all parameters. We evaluate Q{\footnotesize OED} on simulated and real-world navigation and manipulation tasks, where identifiable-direction selection and nuisance suppression yield performance improvements of \SI{35.23}{\percent} and \SI{21.98}{\percent}, respectively. When integrated as an exploration objective in model-based policy optimization, Q{\footnotesize OED} further improves policy performance over established RL baselines.
Chinese Translation
为机器人探索设计可学习的信息论目标仍然具有挑战性。这些目标旨在引导探索朝向能够减少模型参数不确定性的数据,但通常不清楚所收集的数据实际能够揭示什么信息。尽管强化学习(RL)可以优化给定的目标,但在高维机器人系统中构建反映参数可学习性的目标是困难的。许多参数方向的可观察性较弱或不可识别,即使选择了可识别的方向,遗漏的方向仍可能影响探索并扭曲信息度量。为了解决这一挑战,我们提出了准最优实验设计(Quasi-Optimal Experimental Design, QOED),这是一种基于最优实验设计的自适应信息目标。QOED (i) 对费舍尔信息矩阵进行特征空间分析,以识别可观察的子空间并选择可识别的参数方向;(ii) 修改探索目标,以强调这些方向,同时抑制来自非关键参数的干扰效应。在有限的干扰影响和关键方向与干扰方向之间的有限耦合下,QOED 提供了对理想信息目标的常数因子近似,该目标探索所有参数。我们在模拟和现实世界的导航与操作任务中评估了 QOED,其中可识别方向选择和干扰抑制分别带来了 35.23% 和 21.98% 的性能提升。当作为基于模型的策略优化中的探索目标集成时,QOED 进一步提高了策略性能,超越了已建立的 RL 基线。
cs.RO / 30 / 2605.12090

World Action Models: The Next Frontier in Embodied AI

世界行动模型:具身人工智能的下一个前沿
Wang, Siyin, Shi, Junhao, Fu, Zhaoyang, He, Xinzhe, Liu, Feihong, Yang, Chenchen, Zhou, Yikang, Fei, Zhaoye, Gong, Jingjing, Fu, Jinlan, Shou, Mike Zheng, Huang, Xuanjing, Qiu, Xipeng, Jiang, Yu-Gang
Abstract
Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.
Chinese Translation
视觉-语言-行动(VLA)模型在具身策略学习中实现了强大的语义泛化,然而它们学习的是反应性的观察到行动的映射,而没有明确建模在干预下物理世界如何演变。越来越多的研究工作通过将世界模型(环境动态的预测模型)整合到行动生成流程中来解决这一局限性。我们将这一新兴范式称为世界行动模型(WAMs):将预测状态建模与行动生成统一的具身基础模型,目标是对未来状态和行动的联合分布进行建模,而不仅仅是行动。然而,现有文献在架构、学习目标和应用场景上仍然零散,缺乏统一的概念框架。我们正式定义了WAMs,并将其与相关概念区分开来,同时追溯了VLA与世界模型研究的基础及其早期整合,这些研究催生了这一范式。我们将现有方法组织成级联和联合WAMs的结构化分类法,并根据生成方式、条件机制和行动解码策略进行进一步细分。我们系统分析了推动WAMs发展的数据生态系统,涵盖机器人远程操作、便携式人类示范、仿真和互联网规模的自我中心视频,并综合了围绕视觉保真度、物理常识和行动合理性组织的新兴评估协议。总体而言,本次调查提供了WAMs领域的首次系统性概述,阐明了关键架构范式及其权衡,并识别了这一快速发展的领域中的开放挑战和未来机遇。
cs.RO / 31 / 2605.12160

Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete

Premover:通过在指令完成前采取行动实现快速的视觉-语言-动作控制
Park, Joonha, Jeong, Jiseung, Gong, Taesik
Abstract
Vision-Language-Action (VLA) policies are typically evaluated as if the user had finished typing or speaking before the robot begins acting. In real deployment, however, users take several seconds to enter a request, leaving the policy idle for a substantial fraction of the interaction. We introduce Premover, a lightweight module that converts this idle window into useful precomputation. Premover keeps the VLA backbone frozen and attaches two small projection heads, one for image patches, one for language tokens, that map an intermediate layer of the backbone into a shared space. The resulting focus map is supervised by simulator-rendered target-object segmentation masks and applied as a per-patch reweighting of the next step's image tokens. A single scalar readiness threshold, trained jointly from streaming prefixes, decides when the policy should begin acting. On the LIBERO benchmark suite, Premover reduces mean wall-clock time from 34.0 to 29.4 seconds, a 13.6% reduction, while matching the full-prompt baseline's success rate (95.1% vs. 95.0%); naive premoving, by contrast, collapses to 66.4%.
Chinese Translation
视觉-语言-动作(VLA)策略通常在用户完成输入后才开始评估机器人行动。然而,在实际部署中,用户需要几秒钟的时间来输入请求,这使得策略在交互的相当一部分时间内处于闲置状态。我们提出了Premover,一个轻量级模块,将这一闲置窗口转化为有用的预计算。Premover保持VLA主干网络的冻结,并附加两个小的投影头,一个用于图像补丁,一个用于语言标记,将主干网络的中间层映射到共享空间。生成的聚焦图由模拟器渲染的目标对象分割掩码进行监督,并作为下一步图像标记的每个补丁的重加权。一个单一的标量准备阈值,通过流式前缀联合训练,决定策略何时开始行动。在LIBERO基准测试套件上,Premover将平均墙钟时间从34.0秒减少到29.4秒,减少了13.6%,同时匹配了完整提示基线的成功率(95.1%对95.0%);相比之下,简单的预移动方法则崩溃至66.4%。
cs.RO / 32 / 2605.12162

X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction

X-Imitator:通过双向动作-姿态交互实现空间感知的模仿学习
Xiong, Kai, Fang, Hongjie, Yang, Lixin, Lu, Cewu
Abstract
Effectively handling the interplay between spatial perception and action generation remains a critical bottleneck in robotic manipulation. Existing methods typically treat spatial perception and action execution as decoupled or strictly unidirectional processes, fundamentally restricting a robot's ability to master complex manipulation tasks. To address this, we propose X-Imitator, a versatile dual-path framework that models spatial perception and action execution as a tightly coupled bidirectional loop. By reciprocally conditioning current pose predictions on past actions and vice versa, this framework enables continuous mutual refinement between spatial reasoning and action generation. This joint modeling exactly mimics human internal forward models. Designed as a modular architecture, the system can be seamlessly integrated into various visuomotor policies. Extensive experiments across 24 simulated and 3 real-world tasks demonstrate that our framework significantly outperforms both vanilla policies and prior methods utilizing explicit pose guidance. The code will be open sourced.
Chinese Translation
有效处理空间感知与动作生成之间的相互作用仍然是机器人操作中的一个关键瓶颈。现有方法通常将空间感知和动作执行视为解耦或严格单向的过程,这在根本上限制了机器人掌握复杂操作任务的能力。为了解决这个问题,我们提出了X-Imitator,一个多功能的双路径框架,将空间感知和动作执行建模为紧密耦合的双向循环。通过相互条件化当前姿态预测与过去动作以及反之,这个框架实现了空间推理与动作生成之间的持续相互优化。这种联合建模精确模拟了人类内部的前向模型。该系统设计为模块化架构,可以无缝集成到各种视觉运动策略中。在24个模拟任务和3个真实世界任务中的大量实验表明,我们的框架显著优于传统策略和利用显式姿态指导的先前方法。代码将开源。
cs.RO / 33 / 2605.12167

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

从想象的未来到可执行的行动:用于机器人操作的潜在行动混合
Li, Yajie, Zhang, Bozhou, Gu, Chun, Ma, Zipei, Zhang, Jiahui, Deng, Jiankang, Zhu, Xiatian, Zhang, Li
Abstract
Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization.
Chinese Translation
视频生成模型通过预测长时间范围的未来观察,为机器人操作提供了一种有前景的想象机制,但有效利用这些想象的未来进行行动执行仍然具有挑战性。现有的方法要么将策略条件化于预测的帧,要么直接将生成的视频解码为行动,这两者都面临视觉现实与控制相关性之间的不匹配。因此,预测的观察强调感知的真实性,而非以行动为中心的状态转变原因,导致间接和不稳定的控制。为了解决这一问题,我们提出了MoLA(潜在行动混合),一种面向控制的接口,将想象的未来视频转化为可执行的表示。MoLA并不是将预测的帧直接传递给策略,而是利用一组预训练的逆动力学模型推断生成的视觉转变所暗示的潜在行动的混合。这些具有模态感知的逆动力学模型捕捉到互补的语义、深度和流动线索,提供了一种结构化且物理基础的行动表示,架起了视频想象与策略执行之间的桥梁。我们在模拟基准(LIBERO、CALVIN和LIBERO-Plus)和真实世界的机器人操作任务上评估了我们的方法,取得了任务成功率、时间一致性和泛化能力的一致提升。
cs.RO / 34 / 2605.12182

DexTwist: Dexterous Hand Retargeting for Twist Motion via Mixed Reality-based Teleoperation

DexTwist:基于混合现实的灵巧手扭转运动重定向
Lee, Dongmyoung, Li, Chengxi, Lee, Dongheui
Abstract
Dexterous teleoperation via Mixed Reality (MR)-based interfaces offers a scalable paradigm for transferring human manipulation skills to dexterous robot hands. However, conventional retargeting approaches that minimize kinematic dissimilarity (e.g., joint angle or fingertip position error) often fail in contact-rich rotational manipulation, such as cap opening, key turning, and bolt screwing. This failure stems from the embodiment gap: mismatched link lengths, joint axes/limits, and fingertip geometry can cause direct pose imitation to induce tangential fingertip sliding rather than stable object rotation, resulting in screw axis drift, contact slip, and grasp instability. To address this, we propose DexTwist, a functional twist-retargeting framework for MR-based dexterous teleoperation. DexTwist detects a tripod pinch, estimates the operator's intended screw axis and twist magnitude, and applies a real-time residual joint-space refinement that tracks turning progress while regularizing the robot tripod geometry. The refinement minimizes a virtual-object objective defined by turning angle, screw axis consistency, fingertip closure, and tripod stability. Simulation and real-world experiments show that DexTwist improves turning angle tracking and screw axis stability compared with a vector-based retargeting baseline.
Chinese Translation
基于混合现实(MR)的灵巧遥操作提供了一种可扩展的范式,用于将人类操作技能转移到灵巧机器人手上。然而,传统的重定向方法通常通过最小化运动学差异(例如,关节角度或指尖位置误差)来实现,但在接触丰富的旋转操作中(如开瓶盖、转钥匙和拧螺栓)往往会失败。这种失败源于体现差距:不匹配的连杆长度、关节轴/限制和指尖几何形状可能导致直接的姿态模仿引发切向指尖滑动,而不是稳定的物体旋转,从而导致螺旋轴漂移、接触滑移和抓握不稳定。为了解决这个问题,我们提出了DexTwist,一个用于基于MR的灵巧遥操作的功能性扭转重定向框架。DexTwist检测三脚架夹持,估计操作者意图的螺旋轴和扭转幅度,并应用实时残余关节空间精细化,跟踪转动进度,同时规范化机器人三脚架几何形状。该精细化最小化一个由转动角度、螺旋轴一致性、指尖闭合和三脚架稳定性定义的虚拟物体目标。仿真和现实世界实验表明,与基于向量的重定向基线相比,DexTwist改善了转动角度跟踪和螺旋轴稳定性。
cs.RO / 35 / 2605.12228

Morphologically Equivariant Flow Matching for Bimanual Mobile Manipulation

形态等变流匹配在双手移动操控中的应用
Siebenborn, Max, Apraez, Daniel Ordoñez, Lueth, Sophie, Turrisi, Giulio, Pontil, Massimiliano, Semini, Claudio, Chalvatzaki, Georgia
Abstract
Mobile manipulation requires coordinated control of high-dimensional, bimanual robots. Imitation learning methods have been broadly used to solve these robotic tasks, yet typically ignore the bilateral morphological symmetry inherent in such systems. We argue that morphological symmetry is an underexplored but crucial inductive bias for learning in bimanual mobile manipulation: knowing how to solve a task in one configuration directly determines how to solve its mirrored counterpart. In this paper, we formalize this symmetry prior and show that it constrains optimal bimanual policies to be ambidextrous and equivariant under reflections across the robot's sagittal plane. We introduce a $\mathbb{C}_2$-equivariant flow matching policy that enforces reflective symmetry either via a regularized training loss or an equivariant velocity network. Across planar and 6-DoF mobile manipulation tasks, symmetry-informed policies consistently improve sample efficiency and achieve zero-shot generalization to mirrored configurations absent from the training distribution. We further validate this zero-shot generalization capability on a real-world manipulation task with a TIAGo++ robot. Together, our findings establish morphological symmetry as an effective, generalizable, and scalable inductive bias for ambidextrous generative policy learning.
Chinese Translation
移动操控需要对高维双手机器人进行协调控制。模仿学习方法已广泛应用于解决这些机器人任务,但通常忽视了此类系统固有的双边形态对称性。我们认为,形态对称性是双手移动操控学习中一个未被充分探索但至关重要的归纳偏置:在一种配置中解决任务的方法直接决定了如何解决其镜像对应的任务。在本文中,我们形式化了这种对称性先验,并展示它限制了最优双手策略在机器人矢状面反射下必须是双手灵巧和等变的。我们引入了一种 $ ext{C}_2$-等变流匹配策略,通过正则化训练损失或等变速度网络来强制反射对称性。在平面和6自由度移动操控任务中,基于对称性的策略始终提高了样本效率,并实现了对训练分布中缺失的镜像配置的零-shot泛化。我们进一步在TIAGo++机器人上验证了这种零-shot泛化能力。在一起,我们的研究结果确立了形态对称性作为一种有效、可泛化且可扩展的归纳偏置,用于双手生成策略学习。
cs.RO / 36 / 2605.12236

TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

TMRL:扩散时间步调制预训练促进高效策略微调的探索
Hong, Matthew M., Zhang, Jesse, Nagabandi, Anusha, Gupta, Abhishek
Abstract
Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a unified framework that enables the exploration necessary to enable efficient robot policy finetuning by bridging BC pre-training and RL fine-tuning. Our pre-training method, Context-Smoothed Pre-training (CSP), injects forward-diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. We then fine-tune pre-trained policies via Timestep-Modulated Reinforcement Learning (TMRL), which trains the agent to dynamically adjust this conditioning during fine-tuning by modulating the diffusion timestep, granting explicit control over exploration. Integrating seamlessly with arbitrary policy inputs, e.g., states, 3D point clouds, or image-based VLA policies, we show that TMRL improves RL fine-tuning sample efficiency. Notably, TMRL enables successful real-world fine-tuning on complex manipulation tasks in under one hour. Videos and code available at https://weirdlabuw.github.io/tmrl/.
Chinese Translation
使用强化学习(RL)对预训练机器人策略进行微调时,常常会继承由行为克隆(BC)预训练引入的瓶颈,这导致产生狭窄的动作分布,缺乏下游探索所需的覆盖范围。我们提出了一个统一框架,通过连接BC预训练和RL微调,促进必要的探索以实现高效的机器人策略微调。我们的方法,称为上下文平滑预训练(Context-Smoothed Pre-training, CSP),在策略输入中注入前向扩散噪声,形成精确模仿与广泛动作覆盖之间的连续体。随后,我们通过时间步调制强化学习(Timestep-Modulated Reinforcement Learning, TMRL)对预训练策略进行微调,该方法训练代理在微调过程中动态调整这种条件,通过调制扩散时间步,赋予对探索的明确控制。TMRL与任意策略输入(例如状态、3D点云或基于图像的VLA策略)无缝集成,我们展示了TMRL提高了RL微调的样本效率。值得注意的是,TMRL使得在复杂操作任务上成功进行现实世界的微调时间不超过一小时。视频和代码可在 https://weirdlabuw.github.io/tmrl/ 获取。
cs.RO / 37 / 2605.12247

SI-Diff: A Framework for Learning Search and High-Precision Insertion with a Force-Domain Diffusion Policy

SI-Diff:一种通过力域扩散策略学习搜索和高精度插入的框架
Liu, Yibo, Oparnica, Stanko, Shewchun-Jakaitis, Simon, Fu, Guoyi, Wang, Jie, Yang, Jun, Jagannathan, Anand, Lo, Tony Hong-Yau
Abstract
Contact-rich assembly is fundamental in robotics but poses significant challenges due to uncertainties in relative poses, such as misalignments and small clearances in peg-in-hole tasks. Existing approaches typically address search and high-precision insertion separately, because these tasks involve distinct action patterns. However, supporting both tasks within a single model, without switching models or weights, is desirable for intelligent assembly systems. In this work, we propose SI-Diff, a framework that learns both search and high-precision insertion through a force-domain diffusion policy. To this end, we introduce a new mode-conditioning mechanism that enables the policy to capture distinct action behaviors under a single framework. Moreover, we develop a new search teacher policy that can generate diverse trajectories. By training on successful and efficient demonstrations provided by the teacher policy, the model learns the mapping from tactile and end-effector velocity observations to effective action behaviors. We conduct thorough experiments to show that SI-Diff extends the tolerance to x-y misalignments from 2 mm to 5 mm compared to the state-of-the-art baseline, TacDiffusion, while also demonstrating strong zero-shot transferability to unseen shapes.
Chinese Translation
接触丰富的组装在机器人技术中至关重要,但由于相对姿态的不确定性,例如在插销入孔任务中的错位和小间隙,带来了显著的挑战。现有的方法通常将搜索和高精度插入分开处理,因为这两个任务涉及不同的动作模式。然而,在单一模型中支持这两个任务,而不切换模型或权重,对于智能组装系统是理想的。在本研究中,我们提出了SI-Diff,一个通过力域扩散策略学习搜索和高精度插入的框架。为此,我们引入了一种新的模式条件机制,使得策略能够在单一框架下捕捉不同的动作行为。此外,我们开发了一种新的搜索教师策略,能够生成多样的轨迹。通过对教师策略提供的成功和高效演示进行训练,该模型学习了从触觉和末端执行器速度观测到有效动作行为的映射。我们进行了全面的实验,表明与最先进的基线TacDiffusion相比,SI-Diff将x-y错位的容忍度从2毫米扩展到5毫米,同时还展示了对未见形状的强零-shot迁移能力。
cs.RO / 38 / 2605.12347

Real-Time Whole-Body Teleoperation of a Humanoid Robot Using IMU-Based Motion Capture with Sim2Sim and Sim2Real Validation

基于IMU运动捕捉的类人机器人实时全身遥操作:Sim2Sim与Sim2Real验证
Durrani, Hamza Ahmed, Khan, Suleman
Abstract
Stable, low-latency whole-body teleoperation of humanoid robots is an open research challenge, complicated by kinematic mismatches between human and robot morphologies, accumulated inertial sensor noise, non-trivial control latency, and persistent sim-to-real transfer gaps. This paper presents a complete real-time whole-body teleoperation system that maps human motion, recorded with a Virdyn IMU-based full-body motion capture suit, directly onto a Unitree G1 humanoid robot. We introduce a custom motion-processing, kinematic retargeting, and control pipeline engineered for continuous, low-latency operation without any offline buffering or learning-based components. The system is first validated in simulation using the MuJoCo physics model of the Unitree G1 (sim2sim), and then deployed without modification on the physical platform (sim2real). Experimental results demonstrate stable, synchronized reproduction of a broad motion repertoire, including walking, standing, sitting, turning, bowing, and coordinated expressive full-body gestures. This work establishes a practical, scalable framework for whole-body humanoid teleoperation using commodity wearable motion capture hardware.
Chinese Translation
类人机器人的稳定低延迟全身遥操作是一个开放的研究挑战,受限于人类与机器人形态之间的运动学不匹配、积累的惯性传感器噪声、非平凡的控制延迟以及持续的仿真到现实转移差距。本文提出了一种完整的实时全身遥操作系统,该系统将使用Virdyn基于IMU的全身运动捕捉服记录的人类运动直接映射到Unitree G1类人机器人上。我们引入了一种定制的运动处理、运动学重定向和控制管道,旨在实现连续低延迟操作,而无需任何离线缓冲或基于学习的组件。该系统首先在仿真中使用Unitree G1的MuJoCo物理模型进行验证(sim2sim),然后在物理平台上无修改地部署(sim2real)。实验结果表明,系统能够稳定、同步地再现广泛的运动 repertoire,包括行走、站立、坐下、转身、鞠躬和协调的全身表达性手势。这项工作建立了一个实用的、可扩展的框架,用于使用普通可穿戴运动捕捉硬件进行类人机器人全身遥操作。
cs.RO / 39 / 2605.12369

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

GuidedVLA:通过即插即用的动作注意力专业化来指定任务相关因素
Jia, Xiaosong, Yang, Bowen, Ge, Zuhao, Nie, Xian, Zhou, Yuchen, Fan, Cunxin, Li, Yufeng, Chai, Yilin, Jing, Chao, Liang, Zijian, Bu, Qingwen, Cao, Haidong, Wu, Chao, Li, Qifeng, Yang, Zhenjie, Zhang, Chenhe, Li, Hongyang, Wu, Zuxuan, Yan, Junchi, Jiang, Yu-Gang
Abstract
Vision-Language-Action (VLA) models aim for general robot learning by aligning action as a modality within powerful Vision-Language Models (VLMs). Existing VLAs rely on end-to-end supervision to implicitly enable the action decoding process to learn task-relevant features. However, without explicit guidance, these models often overfit to spurious correlations, such as visual shortcuts or environmental noise, limiting their generalization. In this paper, we introduce GuidedVLA, a framework designed to manually guide the action generation to focus on task-relevant factors. Our core insight is to treat the action decoder not as a monolithic learner, but as an assembly of functional components. Individual attention heads are supervised by manually defined auxiliary signals to capture distinct factors. As an initial study, we instantiate this paradigm with three specialized heads: object grounding, spatial geometry, and temporal skill logic. Across simulation and real-robot experiments, GuidedVLA improves success rates in both in-domain and out-of-domain settings compared to strong VLA baselines. Finally, we show that the quality of these specialized factors correlates positively with task performance and that our mechanism yields decoupled, high-quality features. Our results suggest that explicitly guiding action-decoder learning is a promising direction for building more robust and general VLA models.
Chinese Translation
视觉-语言-动作(VLA)模型旨在通过将动作作为强大视觉-语言模型(VLM)中的一种模态来实现通用机器人学习。现有的VLA依赖于端到端的监督,隐式地使动作解码过程学习任务相关特征。然而,在没有明确指导的情况下,这些模型往往会过拟合于虚假相关性,例如视觉捷径或环境噪声,从而限制了它们的泛化能力。本文提出了GuidedVLA,一个旨在手动引导动作生成以关注任务相关因素的框架。我们的核心见解是将动作解码器视为一个功能组件的集合,而非一个单一的学习者。各个注意力头通过手动定义的辅助信号进行监督,以捕捉不同的因素。作为初步研究,我们用三个专业化的头部实例化这一范式:物体定位、空间几何和时间技能逻辑。在模拟和真实机器人实验中,GuidedVLA在领域内和领域外的设置中相比强大的VLA基线提高了成功率。最后,我们展示了这些专业化因素的质量与任务表现呈正相关,并且我们的机制产生了解耦的高质量特征。我们的结果表明,明确引导动作解码器学习是构建更强大和通用的VLA模型的一个有前景的方向。
cs.RO / 40 / 2605.12386

SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation

SafeManip:一种基于属性的机器人操作时序安全评估基准
Huang, Chengyue, Huynh, Khang Vo, Elbaum, Sebastian, Kira, Zsolt, Feng, Lu
Abstract
Robotic manipulation is typically evaluated by task success, but successful completion does not guarantee safe execution. Many safety failures are temporal: a robot may touch a clean surface after contamination or release an object before it is fully inside an enclosure. We introduce SafeManip, a property-driven benchmark to explicitly evaluate temporal safety properties in robotic manipulation, moving beyond prior evaluations that largely focus on task completion or per-state constraint violations. SafeManip defines reusable safety templates over finite executions using Linear Temporal Logic over finite traces (LTLf). It maps observed rollouts to symbolic predicate traces and evaluates them with LTLf-based monitors. Its property suite covers eight manipulation safety categories: collision and contact safety, grasp stability, release stability, cross-contamination, action onset, mechanism recovery, object containment, and enclosure access. Templates can be instantiated with task-specific objects, fixtures, regions, or skills, allowing the same safety specifications to generalize across tasks and environments. We evaluate SafeManip on six vision-language-action policies, including $\pi_0$, $\pi_{0.5}$, GR00T, and their training variants, across 50 RoboCasa365 household tasks. Results show that even strong models often behave unsafely. Task-success gains do not reliably translate into safer execution: many successful rollouts remain unsafe, while longer-horizon or more complex tasks expose more violations. SafeManip provides a reusable evaluation layer for diagnosing temporal safety failures and measuring safe success beyond task completion.
Chinese Translation
机器人操作通常通过任务成功来评估,但成功完成并不保证安全执行。许多安全失败是时序性的:机器人可能在污染后触碰干净表面,或在物体未完全放入封闭空间前就释放它。我们提出了SafeManip,这是一种基于属性的基准,旨在明确评估机器人操作中的时序安全属性,超越以往主要关注任务完成或每个状态约束违反的评估。SafeManip使用有限轨迹上的线性时序逻辑(Linear Temporal Logic over finite traces, LTLf)定义了可重用的安全模板,涵盖有限执行。它将观察到的执行映射到符号谓词轨迹,并使用基于LTLf的监控器进行评估。其属性套件涵盖了八个操作安全类别:碰撞和接触安全、抓取稳定性、释放稳定性、交叉污染、动作开始、机制恢复、物体容纳和封闭访问。模板可以与特定任务的对象、夹具、区域或技能实例化,从而使相同的安全规范能够在不同任务和环境中推广。我们在50个RoboCasa365家庭任务上评估了SafeManip,涉及六种视觉-语言-动作策略,包括$ ext{π}_0$、$ ext{π}_{0.5}$、GR00T及其训练变体。结果表明,即使是强模型也常常表现出不安全的行为。任务成功的提升并不可靠地转化为更安全的执行:许多成功的执行仍然不安全,而更长时间或更复杂的任务则暴露出更多的违规情况。SafeManip提供了一个可重用的评估层,用于诊断时序安全失败并衡量超越任务完成的安全成功。
计算机视觉 (Computer Vision)
158
cs.CV / 1 / 2605.10984

Principle-Guided Supervision for Interpretable Uncertainty in Medical Image Segmentation

基于原则引导的监督以实现医学图像分割中的可解释不确定性
Sui, An, Li, Yuzhu, Schumann, Gunter, Wu, Fuping, Zhuang, Xiahai
Abstract
Uncertainty quantification complements model predictions by characterizing their reliability, which is essential for high-stakes decision making such as medical image segmentation. However, most existing methods reduce uncertainty to a scalar confidence estimate, leaving its spatial distribution semantically underconstrained. In this work, we focus on uncertainty interpretability, namely, whether estimated uncertainty behaves in a human-understandable manner with respect to sources of ambiguity. We identify three perception-aligned principles requiring the spatial distribution of uncertainty to reflect: (1) image contrast between structures, (2) severity of image corruption, and (3) geometric complexity in anatomical structures. Accordingly, we develop a principle-guided uncertainty supervision framework (PriUS) based on evidential learning, in which the corresponding supervision objectives are explicitly enforced during training. We further introduce quantitative metrics to measure the consistency between predicted uncertainty and image attributes that induce ambiguity. Experiments on ACDC, ISIC, and WHS datasets showed that, compared with state-of-the-art methods, PriUS produced more consistent uncertainty estimates while maintaining competitive segmentation performance.
Chinese Translation
不确定性量化通过表征模型预测的可靠性来补充模型预测,这对于医学图像分割等高风险决策至关重要。然而,大多数现有方法将不确定性简化为标量置信度估计,导致其空间分布在语义上受到限制。在本研究中,我们关注不确定性的可解释性,即估计的不确定性是否以人类可理解的方式与模糊源相关。我们识别出三个与感知一致的原则,要求不确定性的空间分布反映: (1) 结构之间的图像对比度, (2) 图像损坏的严重程度,以及 (3) 解剖结构的几何复杂性。因此,我们开发了一种基于证据学习的原则引导不确定性监督框架(PriUS),在该框架中,相应的监督目标在训练过程中被明确执行。我们进一步引入定量指标来测量预测不确定性与导致模糊的图像属性之间的一致性。在ACDC、ISIC和WHS数据集上的实验表明,与最先进的方法相比,PriUS产生了更一致的不确定性估计,同时保持了竞争性的分割性能。
cs.CV / 2 / 2605.11055

The first global agricultural field boundary map at 10m resolution

首个10米分辨率的全球农业田块边界地图
Robinson, Caleb, Muhawenayo, Gedeon, Khanal, Subash, Fang, Zhanpei, Corley, Isaac, Tárano, Ana M., Estes, Lyndon, Marcus, Jennifer, Jacobs, Nathan, Kerner, Hannah, Becker-Reshef, Inbal, Ferres, Juan M. Lavista
Abstract
The agricultural field is the natural unit at which crops are planted, managed, regulated, and reported, yet most global remote-sensing products for agriculture are only available at the pixel level. While some high-quality field-level data products exist, they come from parcel registries covering only parts of Europe or from ML-derived products for individual countries. No openly available, globally consistent map of agricultural field boundaries exists to date. Here we present the first global field boundary dataset at 10\,m resolution for the years 2024 and 2025, comprising 3.17 billion remote-sensing field polygons (1.62 B in 2024 and 1.55 B in 2025) across 241 countries and territories, produced by applying a U-Net segmentation model trained on the Fields of The World dataset to cloud-free Sentinel-2 mosaics. Validated against ground-truth field boundaries in 24 countries, the map achieved a mean pixel-level recall of 0.85 with 14 countries exceeding 0.90. Evaluation against full-country ground-truth datasets in Austria, Latvia, and Finland yielded F1 scores of 0.89, 0.88, and 0.74, respectively. Because reference data for global validation is inherently incomplete, we accompanied the map with a 500 m confidence layer that identifies regions where predictions are reliable. We release the dataset openly as three global maps: the confidence-thresholded default field boundary dataset, the full unfiltered dataset, and the continuous-valued confidence raster. These maps provide the first globally consistent field-level unit of analysis for crop monitoring, food security, and downstream agricultural science.
Chinese Translation
农业田块是作物种植、管理、监管和报告的自然单元,但目前大多数全球农业遥感产品仅在像素级别提供。虽然存在一些高质量的田块级数据产品,但它们仅来自覆盖部分欧洲的地块登记处或针对个别国家的机器学习(ML)衍生产品。迄今为止,尚无公开可用的全球一致的农业田块边界地图。在此,我们展示了2024年和2025年首个10米分辨率的全球田块边界数据集,涵盖241个国家和地区,共计31.7亿个遥感田块多边形(2024年为16.2亿,2025年为15.5亿),该数据集是通过将训练于《世界田块数据集》(Fields of The World)上的U-Net分割模型应用于无云的Sentinel-2马赛克图像生成的。经过与24个国家的地面真实田块边界进行验证,该地图在像素级别上达到了85%的平均召回率,其中14个国家超过了90%。在奥地利、拉脱维亚和芬兰的全国地面真实数据集评估中,F1分数分别为0.89、0.88和0.74。由于全球验证的参考数据本质上是不完整的,我们为该地图附加了一个500米的置信层,以识别预测可靠的区域。我们将该数据集公开发布为三幅全球地图:置信阈值默认田块边界数据集、完整未过滤数据集和连续值置信栅格。这些地图为作物监测、粮食安全和下游农业科学提供了首个全球一致的田块级分析单元。
cs.CV / 3 / 2605.11061

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

HiDream-O1-Image:一种具有像素级统一变换器的原生统一图像生成基础模型
Cai, Qi, Chen, Jingwen, Gao, Chengmin, Gong, Zijian, Li, Yehao, Pan, Yingwei, Peng, Yi, Qiu, Zhaofan, Yu, Kai, Zhang, Yiheng, Ai, Hao, Bai, Siying, Chen, Yang, Chen, Zhihui, Gao, Fengbin, Guo, Ying, Li, Dong, Shen, Zhen, Shi, Leilei, Wang, Jing, Wang, Siyu, Wang, Yimeng, Zheng, Rui, Yao, Ting, Mei, Tao
Abstract
The evolution of visual generative models has long been constrained by fragmented architectures relying on disjoint text encoders and external VAEs. In this report, we present HiDream-O1-Image, a natively unified generative foundation model via pixel-space Diffusion Transformer, that pioneers a paradigm shift from modular architectures to an end-to-end in-context visual generation engine. By mapping raw image pixels, text tokens, and task-specific conditions into a single shared token space, HiDream-O1-Image achieves a structural unification of multimodal inputs within an Unified Transformer (UiT) architecture. This native encoding paradigm eliminates the need for separate VAEs or disjoint pre-trained text encoders, allowing the model to treat diverse generation and editing tasks as a consistent in-context reasoning process. Extensive experiments show that HiDream-O1-Image excels across various generation tasks, including text-to-image generation, instruction-based editing, and subject-driven personalization. Notably, with only 8B parameters, HiDream-O1-Image (8B) achieves performance parity with or even surpasses established state-of-the-art models with significantly larger parameters (e.g., 27B Qwen-Image). Crucially, to validate the immense scalability of this paradigm, we successfully scale the architecture up to over 200B parameters. Experimental results demonstrate that this massive-scale version HiDream-O1-Image-Pro (200B+) unlocks unprecedented generative capabilities and superior performance, establishing new state-of-the-art benchmarks. Ultimately, HiDream-O1-Image highlights the immense potential of natively unified architectures and charts a highly scalable path toward next-generation multimodal AI.
Chinese Translation
视觉生成模型的演变长期受到依赖于分散架构的限制,这些架构依赖于不相干的文本编码器和外部变分自编码器(VAE)。在本报告中,我们提出了HiDream-O1-Image,这是一种通过像素空间扩散变换器(Diffusion Transformer)实现的原生统一生成基础模型,开创了从模块化架构到端到端上下文视觉生成引擎的范式转变。通过将原始图像像素、文本标记和任务特定条件映射到一个共享的标记空间,HiDream-O1-Image在统一变换器(Unified Transformer, UiT)架构内实现了多模态输入的结构统一。这种原生编码范式消除了对单独的VAE或不相干的预训练文本编码器的需求,使模型能够将多样的生成和编辑任务视为一致的上下文推理过程。大量实验表明,HiDream-O1-Image在各种生成任务中表现优异,包括文本到图像生成、基于指令的编辑和主题驱动的个性化。值得注意的是,HiDream-O1-Image(8B)仅用8亿参数就达到了与甚至超越具有显著更多参数的现有最先进模型(例如,27亿的Qwen-Image)的性能。至关重要的是,为了验证这一范式的巨大可扩展性,我们成功地将架构扩展到超过2000亿参数。实验结果表明,这一大规模版本的HiDream-O1-Image-Pro(200B+)解锁了前所未有的生成能力和卓越性能,建立了新的最先进基准。最终,HiDream-O1-Image突显了原生统一架构的巨大潜力,并为下一代多模态人工智能绘制了一条高度可扩展的道路。
cs.CV / 4 / 2605.11107

Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

物以类聚:通过VLM中的线性结构实现背景不变表示
Zaazou, Youssef, Thomas, Mark
Abstract
Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding $90\%$ on Waterbirds under perfect ($100\%$) spurious correlation (i.e., no minority-group examples in the training data). Furthermore, it demonstrates strong sim-to-real transfer and requires no access to real-world debiased data, making it practical for real-world deployment.
Chinese Translation
视觉语言模型(VLMs),如CLIP和SigLIP 2,广泛应用于图像分类,但其视觉编码器仍然容易受到系统性偏差的影响,从而削弱其鲁棒性。特别是,前景物体与其背景之间的相关性构成了一类显著且在实践中重要的虚假依赖关系。在本研究中,我们重新审视了VLM嵌入空间中高线性可加性的著名特性,并展示了它使得场景表示可以分解为前景和背景成分。利用这一见解,我们提出了一种预训练方法,利用这一特性通过合成数据构建背景不变表示。我们的研究方法在完美($100 ext{%}$)虚假相关的Waterbirds数据集上,实现了我们所知的首个最差组准确率超过$90 ext{%}$的结果(即训练数据中没有少数群体样本)。此外,该方法展示了强大的从模拟到现实的迁移能力,并且不需要访问真实世界的去偏数据,使其在实际应用中具有可行性。
cs.CV / 5 / 2605.11115

LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR

LatentHDR:通过条件潜在到潜在映射解耦曝光与扩散,用于文本/图像到全景HDR
Fekri, Pedram, Li, WenChen, Chen, William, Altamirano, Peter
Abstract
High Dynamic Range (HDR) generation remains challenging for generative models, which are largely limited to low dynamic range outputs. Recent diffusionbased approaches approximate HDR by generating multiple exposure-conditioned samples, incurring high computational cost and structural inconsistencies across exposures. We propose LatentHDR, a framework that decouples scene generation from exposure modeling in latent space. A pretrained diffusion backbone produces a single coherent scene representation, while a lightweight conditional latent to-latent head deterministically maps it to exposure-specific representations. This enables the generation of a dense, structurally consistent exposure stack in a single pass. This design eliminates multi-pass diffusion, ensures cross-exposure alignment, and enables scalable HDR synthesis. LatentHDR supports both textand image-conditioned HDR generation for perspective and panoramic scenes. Experiments on synthetic data and the SI-HDR benchmark show that LatentHDR achieves state-of-the-art dynamic range with competitive perceptual quality, while reducing computation by an order of magnitude. Our results demonstrate that high-quality HDR generation can be achieved through structured latent modeling, challenging the need for stochastic multi-exposure generation.
Chinese Translation
高动态范围(HDR)生成仍然是生成模型面临的挑战,现有模型主要局限于低动态范围输出。近期的基于扩散的方法通过生成多个曝光条件样本来近似HDR,导致高计算成本和曝光间的结构不一致。我们提出了LatentHDR,一个在潜在空间中将场景生成与曝光建模解耦的框架。预训练的扩散主干网络生成一个一致的场景表示,而一个轻量级的条件潜在到潜在头部则确定性地将其映射到特定曝光的表示。这使得在一次传递中生成密集且结构一致的曝光堆栈成为可能。该设计消除了多次扩散,确保了跨曝光的对齐,并实现了可扩展的HDR合成。LatentHDR支持文本和图像条件的HDR生成,适用于透视和全景场景。在合成数据和SI-HDR基准上的实验表明,LatentHDR在动态范围上达到了最先进的水平,并具有竞争力的感知质量,同时将计算量减少了一个数量级。我们的结果表明,通过结构化的潜在建模可以实现高质量的HDR生成,挑战了随机多曝光生成的必要性。
cs.CV / 6 / 2605.11131

USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation

USEMA:一种可扩展的高效类Mamba注意力机制用于医学图像分割
Dayag, Elisha, Tran, Nhat Thanh, Xin, Jack
Abstract
Accurate medical image segmentation is an integral part of the medical image analysis pipeline that requires the ability to merge local and global information. While vision transformers are able to capture global interactions using vanilla self-attention, their quadratic computational complexity in the input size remains a struggle for medical image segmentation tasks. Motivated by the dispersion property of vanilla self-attention and recent development of Mamba form of attention, Scalable and Efficient Mamba like Attention (SEMA) utilizes token localization via local window attention to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. In this work, we present USEMA, a hybrid UNet architecture that merges the local feature extraction ability of convolutional neural networks (CNNs) with SEMA attention. We conduct experiments with USEMA across a variety of modalities and image sizes, demonstrating improved computational efficiency compared to transformer based models using full self-attention, and superior segmentation performance relative to purely convolution and Mamba-based models.
Chinese Translation
准确的医学图像分割是医学图像分析流程中不可或缺的一部分,它需要合并局部和全局信息的能力。尽管视觉变换器能够通过普通自注意力捕捉全局交互,但其在输入大小上的二次计算复杂度仍然是医学图像分割任务的一大挑战。受到普通自注意力的离散特性和近期Mamba形式注意力发展的启发,Scalable and Efficient Mamba like Attention (SEMA) 通过局部窗口注意力利用标记定位来避免离散并保持聚焦,辅以理论上一致的算术平均以捕捉注意力的全局特征。在本研究中,我们提出了USEMA,一种混合UNet架构,结合了卷积神经网络(CNN)在局部特征提取方面的能力与SEMA注意力。我们在多种模态和图像尺寸上对USEMA进行了实验,结果表明,与使用全自注意力的变换器模型相比,USEMA在计算效率上有所改善,并且在分割性能上优于纯卷积和基于Mamba的模型。
cs.CV / 7 / 2605.11166

Unpacking the Eye of the Beholder: Social Location, Identity, and the Moving Target of Political Perspectives

解构观察者的眼睛:社会位置、身份与政治视角的动态目标
Sirotkina, Elena
Abstract
Political and social identities structure how people evaluate political information, a finding decades deep in political science and routinely discarded by computational tools that often produce single scores that treat a piece of text, an image, or a video as if it means the same thing to everyone. This paper shows that it does not, and that the difference is consequential. To address this problem, I develop the Perspectivist Visual Political Sentiment (PVPS) classifier, which learns from approximately 82,000 evaluations by 5,575 U.S. adults to predict how audiences defined by political and social identities will evaluate the same image. Unlike standard tools that average systematic disagreement away, PVPS preserves it, returning an evaluative profile that records who agrees, who diverges, and along which identity lines. Applied to several influential studies of visual sentiment, PVPS shows that perceived violence in protest imagery and the emotional mechanisms behind protest image engagement both change substantively once audience identity is taken into account. It follows that what a political image conveys is a moving target, and measuring it requires knowing whom it is moving.
Chinese Translation
政治和社会身份结构影响人们对政治信息的评估,这一发现已有数十年的政治科学研究基础,但常常被计算工具忽视,这些工具通常产生单一评分,将一段文本、一幅图像或一段视频视为对所有人具有相同意义。本文表明,事实并非如此,且这种差异是有重要意义的。为了解决这一问题,我开发了“视角主义视觉政治情感”(Perspectivist Visual Political Sentiment, PVPS)分类器,该分类器通过约82,000条由5,575名美国成年人提供的评估数据,预测由政治和社会身份定义的观众将如何评估同一图像。与标准工具通过平均系统性分歧来消除差异不同,PVPS保留了这种分歧,返回一个评估轮廓,记录谁同意、谁分歧以及沿着哪些身份线。应用于几项影响力较大的视觉情感研究,PVPS显示,当考虑观众身份时,抗议图像中感知的暴力和抗议图像参与背后的情感机制都会发生实质性变化。因此,政治图像所传达的内容是一个动态目标,测量它需要了解其所移动的对象。
cs.CV / 8 / 2605.11208

Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

Hi-GaTA:用于外科视频报告生成的分层门控时间聚合适配器
Sun, Kedi, Dang, Chaohui, Feng, Yue, Glasbey, James, Arvanitis, Theodoros N., Zhang, Le
Abstract
Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors. Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention, and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. Finally, we fine-tune the LLM backbone using LoRA to enable coherent and stylistically consistent surgical report generation under limited supervision. Experiments show our approach achieves the best overall performance, with consistent gains over strong Multimodal Large Language Model (MLLM) baselines. Ablation studies further validate the effectiveness of each proposed component.
Chinese Translation
自动化的临床级外科手术评估报告能够减轻文档负担并提供客观反馈,但由于密集时空视频表示与基于语言的推理之间的对齐困难以及高质量、保护隐私的数据集稀缺,仍然面临挑战。为了解决这一问题,我们建立了一个基准数据集,包括214个高质量的模拟外科手术视频,配有外科医生撰写的评估报告。在此基础上,我们提出了一个用于外科视频报告生成的感知-对齐-推理框架,特色是Hi-GaTA,这是一种新颖的轻量级时间适配器,通过短到长范围的时间聚合,将长视频序列高效压缩为紧凑的、与大语言模型(LLM)兼容的视觉前缀令牌。为了实现稳健的视觉感知,我们在40,000分钟的公共外科视频上预训练了Sur40k,这是一种特定于外科的ViViT风格视频编码器,以捕捉细粒度的时空程序先验。Hi-GaTA采用了一个具有文本条件的双重交叉注意力的时间金字塔,并通过跨层门控融合和逐渐加深的策略提高多尺度一致性。最后,我们使用LoRA对LLM主干进行微调,以在有限监督下实现连贯且风格一致的外科报告生成。实验表明,我们的方法在整体性能上表现最佳,相较于强大的多模态大语言模型(MLLM)基线具有一致的提升。消融研究进一步验证了每个提出组件的有效性。
cs.CV / 9 / 2605.11224

ABRA: Agent Benchmark for Radiology Applications

ABRA:放射学应用的智能体基准
Maksudov, Bulat, Kurenkov, Vladislav, Curran, Kathleen M., Mileo, Alessandra
Abstract
Existing medical-agent benchmarks deliver imaging as pre-selected samples, never as an environment the agent must navigate. We introduce ABRA, a radiology-agent benchmark in which the agent operates an OHIF viewer and an Orthanc DICOM server through twenty-one function-calling tools that span slice navigation, windowing, series selection, pixel-coordinate annotation, and structured reporting. ABRA contains 655 programmatically generated tasks across three difficulty tiers and eight types (viewer control, metadata QA, vision probe, annotation, longitudinal comparison, BI-RADS reporting, and oracle variants of annotation and BI-RADS reporting), drawn from LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT. Each episode is scored along Planning, Execution, and Outcome (Bluethgen et al., 2025) by task-type-specific automatic scorers. Ten current models, five closed-weight and five open-weight, reach at least 89% Execution on real annotation but only 0-25% Outcome; on the paired oracle variant where a simulated detector supplies the finding, Outcome on the same task reaches 69-100% across the models evaluated, localising the bottleneck to perception rather than tool orchestration. Code, task generators, and scorers are released at https://github.com/Luab/ABRA
Chinese Translation
现有的医学智能体基准将影像作为预选样本提供,而不是作为智能体必须导航的环境。我们介绍了ABRA,这是一个放射学智能体基准,智能体通过二十一种功能调用工具操作OHIF查看器和Orthanc DICOM服务器,这些工具涵盖切片导航、窗位调整、系列选择、像素坐标标注和结构化报告。ABRA包含655个程序生成的任务,分为三个难度级别和八种类型(查看器控制、元数据质量保证、视觉探测、标注、纵向比较、BI-RADS报告,以及标注和BI-RADS报告的oracle变体),这些任务来源于LIDC-IDRI、杜克大学乳腺癌MRI和NLST新病灶LongCT。每个任务在规划、执行和结果(Bluethgen等,2025)方面由特定任务类型的自动评分器进行评分。十个当前模型中,五个为闭合权重,五个为开放权重,在真实标注上至少达到89%的执行率,但结果仅为0-25%;在配对的oracle变体中,模拟检测器提供发现信息,在同一任务上的结果在评估的模型中达到69-100%,将瓶颈定位于感知而非工具协调。代码、任务生成器和评分器已发布在https://github.com/Luab/ABRA
cs.CV / 10 / 2605.11265

DenseTRF: Texture-Aware Unsupervised Representation Adaptation for Surgical Scene Dense Prediction

DenseTRF:基于纹理感知的无监督表示适应用于外科场景的密集预测
Liao, Guiqiu, Jogan, Matjaž, Hashimoto, Daniel A.
Abstract
Dense prediction tasks in surgical computer vision, such as segmentation and surgical zone prediction, can provide valuable guidance for laparoscopic and robotic surgery. However, these models often suffer from distribution shifts, as training datasets rarely cover the variability encountered during deployment, leading to poor generalization. We propose DenseTRF, a self-supervised representation adaptation framework based on texture-centric attention. Our method leverages slot attention to learn texture-aware representations that capture invariant visual structures. By adapting these representations to the target distribution without supervision, DenseTRF significantly improves robustness to domain shifts. The framework is implemented through conditioning dense prediction on slot attention and model merging strategies. Experiments across multiple surgical procedures demonstrate improved cross-distribution generalization in comparison to state-of-the-art segmentation models and test-distribution adaptation methods for dense prediction tasks.
Chinese Translation
外科计算机视觉中的密集预测任务,如分割和外科区域预测,可以为腹腔镜和机器人手术提供宝贵的指导。然而,这些模型往往受到分布偏移的影响,因为训练数据集很少涵盖部署过程中遇到的变异性,导致泛化能力差。我们提出了DenseTRF,一种基于纹理中心注意力的自监督表示适应框架。我们的方法利用槽注意力学习纹理感知的表示,以捕捉不变的视觉结构。通过在没有监督的情况下将这些表示适应于目标分布,DenseTRF显著提高了对领域偏移的鲁棒性。该框架通过将密集预测与槽注意力和模型合并策略相结合来实现。针对多种外科手术的实验表明,与最先进的分割模型和密集预测任务的测试分布适应方法相比,DenseTRF在跨分布泛化方面表现出显著改善。
cs.CV / 11 / 2605.11266

PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives

PG-3DGS:优化3D高斯点云以满足物理目标
Lee, Zachary, Jacobson, Maxwell, Xue, Yexiang
Abstract
Recent advances in Gaussian Splatting have enabled fast, high-fidelity 3D scene generation, yet these methods remain purely visual and lack an understanding of how shapes behave in the physical world. We introduce Physics-Guided 3D Gaussian Splatting (PG-3DGS), a framework that couples differentiable physics simulation with 3D Gaussian representations to generate 3D structures satisfying physics functionalities. By allowing physical objectives to guide the shape optimization process alongside visual losses, our approach produces geometries that are not only photometrically accurate but also physically functional. The model learns to adjust shapes so that the generated objects exhibit physically meaningful behaviors, for example, teapots that can pour and airplanes that can generate lift, without sacrificing visual quality. Experiments on pouring and aerodynamic lift tasks show that PG-3DGS improves physical functionality while preserving visual quality. In addition to simulation gains, bench-top physical lift tests with 3D-printed aircraft (Cessna, B-2 Spirit, and paper plane) under identical airflow conditions show higher scale-measured lift for PG-3DGS, generated structures than an appearance-matching baseline in all three cases. Our unified framework connects appearance-based reconstruction with physics-based reasoning, enabling end-to-end generation of 3D structures that both look realistic and function correctly.
Chinese Translation
最近在高斯点云技术上的进展使得快速、高保真的3D场景生成成为可能,但这些方法仍然纯粹是视觉上的,缺乏对形状在物理世界中如何表现的理解。我们提出了物理引导的3D高斯点云(PG-3DGS)框架,该框架将可微分物理模拟与3D高斯表示相结合,以生成满足物理功能的3D结构。通过允许物理目标引导形状优化过程,同时考虑视觉损失,我们的方法生成的几何体不仅在光度上准确,而且在物理上也具有功能性。该模型学习调整形状,使生成的物体表现出物理上有意义的行为,例如,能够倒水的茶壶和能够产生升力的飞机,而不牺牲视觉质量。在倒水和气动升力任务上的实验表明,PG-3DGS在保持视觉质量的同时提高了物理功能性。此外,在相同气流条件下对3D打印飞机(塞斯纳、B-2幽灵和纸飞机)进行的台面物理升力测试显示,在所有三种情况下,PG-3DGS生成的结构的升力测量值均高于外观匹配基线。我们的统一框架将基于外观的重建与基于物理的推理相连接,实现了3D结构的端到端生成,这些结构不仅看起来真实,而且功能正确。
cs.CV / 12 / 2605.11267

Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates

仅使用地名或坐标进行真实尺度岛屿面积和海岸线估算
Wu, Quanyun, Gao, Kyle, Sun, Wentao, He, Hongjie, Chen, Yuhao, Clausi, David A., Li, Jonathan
Abstract
Accurate measurement of island area and coastline length is crucial for coastal zone monitoring and oceanographic analysis. However, traditional measurement and mapping methods usually rely heavily on orthophotos, expensive airborne depth sensors, or dense ground control points, which face serious limitations of high labor costs, time-consuming efforts, and low operational efficiency in vast and inaccessible open sea environments. To overcome these challenges and break away from the reliance on manual field exploration, this paper proposes a geometrically consistent, real-scale island measurement framework based on pure monocular vision. This project significantly reduces the mapping cost through a fully automated process and achieves high-efficiency measurement without prior GIS data. In our system pipeline, only the geographical coordinates or names of the target area need to be input to obtain a low-altitude surrounding image sequence. After obtaining the point clouds, a lightweight trajectory alignment algorithm (Umeyama) is used to restore the global physical scale, and the scaled model is orthorectified, enabling high-precision area and perimeter extraction directly on the 2D rasterized plane. We have fully verified this pipeline on four islands with different terrain features (covering natural landform islands and islands with complex artificial facilities). The experimental results show that the final measurement error of the system is stable at around 10\%, demonstrating excellent accuracy and robustness. Moreover, this framework has outstanding inference speed, requiring only 70 ms to process a single high-resolution image and generate point clouds, providing a highly practical new paradigm for large-scale marine and coastline
Chinese Translation
准确测量岛屿面积和海岸线长度对于沿海区域监测和海洋分析至关重要。然而,传统的测量和制图方法通常严重依赖正射影像、昂贵的机载深度传感器或密集的地面控制点,这在广阔且难以到达的海域环境中面临高劳动成本、耗时的努力和低操作效率等严重限制。为了解决这些挑战并摆脱对人工实地勘探的依赖,本文提出了一种基于纯单目视觉的几何一致性真实尺度岛屿测量框架。该项目通过完全自动化的过程显著降低了制图成本,并在没有先前GIS数据的情况下实现了高效测量。在我们的系统流程中,只需输入目标区域的地理坐标或名称,即可获取低空周围图像序列。在获得点云后,使用轻量级轨迹对齐算法(Umeyama)恢复全球物理尺度,并对缩放模型进行正射校正,从而能够直接在二维栅格平面上进行高精度的面积和周长提取。我们已在四个具有不同地形特征的岛屿上全面验证了该流程(涵盖自然地貌岛屿和具有复杂人工设施的岛屿)。实验结果表明,系统的最终测量误差稳定在约10\%,展示了卓越的准确性和鲁棒性。此外,该框架具有出色的推理速度,仅需70毫秒即可处理单张高分辨率图像并生成点云,为大规模海洋和海岸线测量提供了一种高度实用的新范式。
cs.CV / 13 / 2605.11276

Generative AI for Visualizing Highway Construction Hazards Through Synthetic Images and Temporal Sequences

通过合成图像和时间序列可视化高速公路施工危险的生成性人工智能
Neece, Trevor, Smetana, Mason, Khazanovich, Lev
Abstract
Highway construction workers face a high risk of serious injury or death. Image-based training materials depicting hazardous scenarios are essential for engaging safety instruction but remain scarce due to ethical and logistical barriers. This study develops and evaluates a generative AI methodology for producing synthetic visualizations of highway construction hazards from OSHA Severe Injury Report narratives. Two modes were developed: a single-pass approach yielding one image per incident, and a temporal approach producing a four-stage sequence. A sample of 75 incident records yielded 750 images, evaluated using CLIP-based semantic retrieval and expert assessment across dimensions such as educational utility, fidelity, and alignment. Single-pass images achieved 81.1% educational acceptability with fidelity and alignment scores of 4.14/5 and 4.07/5, respectively, while temporal sequences achieved 60.9% acceptability with comparable alignment (3.94/5) but lower fidelity (3.51/5). CLIP-based retrieval revealed that both modes produce images with statistically significant retrieval capabilities. This is among the first studies to leverage modern autoregressive image generation models for visualizing construction hazards from reported severe injuries and to generate temporally sequenced hazard imagery, and a new multi-dimensional evaluation framework was developed to support future research in this domain. The work enables safety trainers to pair narrative storytelling with visual learning material without photographing real-world hazards, and the framework could be applied to datasets across diverse domains, enabling synthetic image generation tailored to new application areas.
Chinese Translation
高速公路施工工人面临严重伤害或死亡的高风险。基于图像的培训材料描绘危险场景,对于吸引安全培训至关重要,但由于伦理和后勤障碍,这类材料仍然稀缺。本研究开发并评估了一种生成性人工智能方法,旨在从职业安全健康管理局(OSHA)严重伤害报告叙述中生成高速公路施工危险的合成可视化图像。研究开发了两种模式:单次生成模式,每个事件生成一幅图像;以及时间序列模式,生成四个阶段的序列。通过75个事件记录生成了750幅图像,使用基于CLIP的语义检索和专家评估在教育效用、真实度和一致性等维度上进行评估。单次生成的图像在教育可接受性上达到了81.1%,真实度和一致性评分分别为4.14/5和4.07/5,而时间序列的可接受性为60.9%,一致性评分为3.94/5,但真实度较低(3.51/5)。基于CLIP的检索结果显示,两种模式生成的图像在统计上具有显著的检索能力。这是首批利用现代自回归图像生成模型可视化报告的严重伤害施工危险的研究之一,并生成时间序列的危险图像,同时开发了一个新的多维评估框架,以支持未来在该领域的研究。这项工作使安全培训师能够将叙事故事与视觉学习材料结合,而无需拍摄真实世界的危险场景,该框架也可应用于各个领域的数据集,从而实现针对新应用领域的合成图像生成。
cs.CV / 14 / 2605.11300

Can Graphs Help Vision SSMs See Better?

图形能帮助视觉状态空间模型更好地看见吗?
Parikh, Dhruv, Ramachandran, Anvitha, Fan, Haoyang, Munir, Mustafa, Kannan, Rajgopal, Prasanna, Viktor
Abstract
Vision state space models inherit the efficiency and long-range modeling ability of Mamba-style selective scans. However, their performance depends critically on the representation of two-dimensional visual features as one-dimensional token sequences. Existing scan operators range from predefined geometric traversals to dynamic coordinate-based samplers that reroute tokens through predicted offsets and interpolation. While effective, these mechanisms primarily adapt paths or sampling locations, rather than explicitly modeling which local patches should exchange information before global state-space mixing. This motivates a simple question: \emph{can graphs help vision state space models see better?} We introduce \textbf{GraphScan}, a graph-induced dynamic scanning operator for Vision SSMs. For each token, GraphScan constructs a spatially bounded local graph, learns feature-conditioned affinities with relative positional bias, and produces the output token by one-step message passing over its semantic neighborhood. The resulting tokens are locally grounded before being processed by the selective SSM for global aggregation. GraphScan preserves token count and linear scaling in image size, while replacing coordinate-conditioned interpolation with feature-conditioned semantic routing. Integrated into a hierarchical backbone, \textbf{GraphScan-Mamba} achieves state-of-the-art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with modest computational overhead. Our analysis further shows that GraphScan induces interpretable displacement fields over the token lattice, providing a semantic and spatially grounded view of dynamic scanning. These results suggest that future Vision SSMs should treat scanning not merely as geometric serialization, but as learned local semantic routing before global state-space modeling.
Chinese Translation
视觉状态空间模型继承了Mamba风格选择性扫描的高效性和长距离建模能力。然而,它们的性能在很大程度上依赖于将二维视觉特征表示为一维令牌序列。现有的扫描操作符从预定义的几何遍历到动态坐标基础的采样器,后者通过预测的偏移和插值重新路由令牌。尽管这些机制有效,但它们主要调整路径或采样位置,而不是明确建模哪些局部区域应该在全局状态空间混合之前交换信息。这引发了一个简单的问题: extit{图形能帮助视觉状态空间模型更好地看见吗?} 我们引入了 extbf{GraphScan},一个针对视觉状态空间模型的图形诱导动态扫描操作符。对于每个令牌,GraphScan构建一个空间限制的局部图,学习具有相对位置偏差的特征条件亲和性,并通过对其语义邻域进行一步消息传递生成输出令牌。生成的令牌在被选择性状态空间模型进行全局聚合之前是局部固定的。GraphScan保持令牌数量和图像大小的线性缩放,同时用特征条件的语义路由替代坐标条件的插值。集成到层次骨干网络中, extbf{GraphScan-Mamba}在图像分类、目标检测、实例分割和语义分割等视觉状态空间模型中实现了最先进的性能,且计算开销适中。我们的分析进一步表明,GraphScan在令牌格上诱导可解释的位移场,提供了动态扫描的语义和空间基础视图。这些结果表明,未来的视觉状态空间模型应将扫描视为不仅仅是几何序列化,而是作为全局状态空间建模之前学习的局部语义路由。
cs.CV / 15 / 2605.11304

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography

CheXTemporal:一个用于胸部放射影像时间基础推理的数据集
Prakash, Eva, Gao, Yunhe, Wang, Chong, Xu, Justin, Prakash, Neal, Michalson, Arne, Dehkharghani, Seena, Hong, Eun Kyoung, Bauml, Julie, Boodoo, Roger, Delbrouck, Jean-Benoit, Ostmeier, Sophie, Langlotz, Curtis
Abstract
Chest radiograph interpretation requires temporal reasoning over prior and current studies, yet most vision-language models are trained on static image-report pairs and lack explicit supervision for modeling longitudinal change. We introduce CheXTemporal, a dataset for temporally grounded reasoning in chest radiography consisting of paired prior-current chest X-rays (CXR) with finding-level temporal and spatial annotations. The dataset includes a five-class progression taxonomy (new, worse, stable, improved, resolved), localized spatial supervision of pathology, explicit spatial-temporal alignment across paired studies, and multi-source coverage for cross-domain evaluation. We additionally construct a 280K-pair silver dataset with automatically derived temporal and anatomical supervision for large-scale evaluation under weaker supervision. Using these resources, we evaluate multiple state-of-the-art vision-language CXR models on grounding and progression-classification tasks in a zero-shot setting. Across both gold and silver evaluations, current models exhibit consistent limitations in spatial grounding, fine-grained temporal reasoning, and robustness under distribution shift. In particular, models perform substantially better on salient progression categories such as worse than on temporally subtle states such as stable and resolved, suggesting limited modeling of longitudinal disease evolution in chest radiography.
Chinese Translation
胸部放射影像的解读需要对先前和当前研究进行时间推理,然而大多数视觉-语言模型是在静态图像-报告对上训练的,缺乏对纵向变化建模的明确监督。我们引入了CheXTemporal,一个用于胸部放射影像时间基础推理的数据集,包含配对的先前-当前胸部X光片(CXR),并附有发现级别的时间和空间注释。该数据集包括五类进展分类(新出现、恶化、稳定、改善、解决),病理的局部空间监督,配对研究之间的明确空间-时间对齐,以及跨领域评估的多源覆盖。此外,我们还构建了一个包含28万对的银数据集,具有自动生成的时间和解剖监督,以便在较弱监督下进行大规模评估。利用这些资源,我们在零样本设置下评估了多种最先进的视觉-语言CXR模型在基础和进展分类任务上的表现。在金标准和银标准评估中,当前模型在空间基础、细粒度时间推理和分布转移下的鲁棒性方面表现出一致的局限性。特别是,模型在显著的进展类别(如恶化)上的表现明显优于在时间上微妙的状态(如稳定和解决),这表明在胸部放射影像中对纵向疾病演变的建模能力有限。
cs.CV / 16 / 2605.11307

Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

Vision2Code:用于评估图像到代码生成的多领域基准
Periasami, Ajay Vikram, Wang, Junlin, Dhingra, Bhuwan
Abstract
Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable reference code, or rely on generic rubrics that miss domain-specific reconstruction errors. We introduce Vision2Code, a reference-code-free benchmark and evaluation framework for multi-domain image-to-code generation. Vision2Code contains 2,169 test examples from 15 source datasets that span charts and plots, geometry, graphs, scientific imagery, documents, and 3D spatial scenes. Models generate executable programs, which we render and score against the source image using a VLM rater with dataset-specific rubrics and deterministic guardrails for severe semantic failures. We report render-success diagnostics that separate code execution failures from reconstruction quality. Human validation shows that this evaluation protocol aligns better with human judgments than either a generic visual rubric or embedding-similarity baselines. Across nine open-weight and proprietary models, we find that image-to-code performance is domain-dependent: leading models perform well on regular chart- and graph-like visuals but remain weak on spatial scenes, chemistry, documents, and circuit-style diagrams. Finally, we show that evaluator-filtered model outputs can serve as training data to improve image-to-code capability, with Qwen3.5-9B improving from 1.60 to 1.86 on the benchmark without paired source programs. Vision2Code provides a reproducible testbed for measuring, diagnosing, and improving image-to-code generation. Our code and data are publicly available at https://image2code.github.io/vision2code/.
Chinese Translation
图像到代码生成测试视觉语言模型(VLM)是否能够恢复图像的结构,以便将其表达为可执行代码。现有的基准要么专注于狭窄的视觉领域,要么依赖于成对的可执行参考代码,或者依赖于通用的评分标准,这些标准无法捕捉特定领域的重建错误。我们引入了Vision2Code,这是一个无参考代码的基准和评估框架,用于多领域的图像到代码生成。Vision2Code包含来自15个源数据集的2,169个测试示例,涵盖图表和绘图、几何、图形、科学图像、文档和3D空间场景。模型生成可执行程序,我们使用具有数据集特定评分标准和针对严重语义失败的确定性保护措施的VLM评估者对生成的程序进行渲染和评分。我们报告渲染成功的诊断,区分代码执行失败与重建质量。人工验证表明,这一评估协议与人类判断的吻合度优于通用视觉评分标准或嵌入相似性基准。在九个开放权重和专有模型中,我们发现图像到代码的性能是领域依赖的:领先模型在常规图表和图形视觉上表现良好,但在空间场景、化学、文档和电路风格图表上仍然较弱。最后,我们展示了经过评估者筛选的模型输出可以作为训练数据,以提高图像到代码的能力,Qwen3.5-9B在没有配对源程序的情况下,在基准上从1.60提高到1.86。Vision2Code提供了一个可重复的测试平台,用于测量、诊断和改善图像到代码生成。我们的代码和数据可在https://image2code.github.io/vision2code/上公开获取。
cs.CV / 17 / 2605.11314

Quantifying Rodda and Graham Gait Classification from 3D Makerless Kinematics derived from a Single-view Video in a Heterogeneous Pediatric Clinical Cohort

从单视角视频中提取的3D无标记运动学量化Rodda和Graham步态分类在异质性儿童临床队列中的应用
Reddy, Lauhitya, Donahue, Seth, Bauer, Jeremy, Sienko, Susan, Bagley, Anita, Krzak, Joseph, Eveld, Maura, Kruger, Karen, Chafetz, Ross, Kulkarni, Vedant, Kwon, Hyeokhyen
Abstract
Cerebral Palsy (CP) is a neurological disorder of movement and the most common cause of lifelong physical disability in childhood. Approximately 75% of children with CP are ambulatory, and accurate gait assessment is central to preserving walking function, which deteriorates by mid-adulthood in a quarter to half of adults with CP. The Rodda and Graham classification system quantifies sagittal-plane gait deviations using ankle and knee z-scores derived from 3D Instrumented Gait Analysis (3D-IGA), but 3D-IGA is expensive and limited to specialized centers, while observational assessment shows only moderate inter-rater agreement. We developed a markerless gait analysis pipeline that quantifies Rodda and Graham knee and ankle z-scores directly from single-view clinical gait videos. Across 1,058 bilateral limb samples from 529 trials of 152 children (88 male, 63 female; age 12.1 $\pm$ 4.0 years; 60 distinct primary diagnoses, cerebral palsy the most common at $n=54$), the sagittal-view model achieved $R^2 = 0.80 \pm 0.02$ and CCC $= 0.89 \pm 0.02$ for knee z-scores and $R^2 = 0.57 \pm 0.02$ and CCC $= 0.72 \pm 0.02$ for ankle z-scores against 3D-IGA. Binary screening for excess knee flexion achieves AUROC $= 0.88$, correctly identifying 83% of affected children, and applying Rodda and Graham rules yields $43 \pm 1$% 7-class accuracy with macro-AUROC $= 0.78 \pm 0.01$, ankle prediction error remaining the primary bottleneck. Beyond cross-sectional screening, continuous z-scores support longitudinal trajectory tracking across visits, providing a quantitative substrate for monitoring disease progression and treatment response unavailable from observational scales. These results demonstrate the feasibility of video-based z-score estimation, excess-flexion screening, and longitudinal trajectory tracking as a path toward scalable, objective gait assessment in low-resource clinical settings.
Chinese Translation
脑性瘫痪(CP)是一种运动神经系统疾病,是儿童时期最常见的终身身体残疾原因。大约75%的脑性瘫痪儿童能够行走,准确的步态评估对于维持行走功能至关重要,而在四分之一到一半的脑性瘫痪成年人中,这种功能在中年时会恶化。Rodda和Graham分类系统通过使用来自3D仪器步态分析(3D-IGA)的踝关节和膝关节z分数来量化矢状面步态偏差,但3D-IGA成本高且仅限于专业中心,而观察性评估的评分者间一致性仅为中等。我们开发了一种无标记步态分析流程,能够直接从单视角临床步态视频中量化Rodda和Graham的膝关节和踝关节z分数。在来自152名儿童(88名男性,63名女性;年龄12.1 ± 4.0岁;60种不同的主要诊断,其中脑性瘫痪最为常见,样本量为54)的529个试验中,共计1058个双侧肢体样本,矢状面模型在膝关节z分数方面达到了R² = 0.80 ± 0.02和CCC = 0.89 ± 0.02,而在踝关节z分数方面则为R² = 0.57 ± 0.02和CCC = 0.72 ± 0.02,与3D-IGA相比。对过度膝关节屈曲的二元筛查达到了AUROC = 0.88,正确识别了83%的受影响儿童,应用Rodda和Graham规则的7类准确率为43 ± 1%,宏观AUROC为0.78 ± 0.01,踝关节预测误差仍然是主要瓶颈。除了横断面筛查,连续的z分数支持跨访次的纵向轨迹跟踪,为监测疾病进展和治疗反应提供了定量基础,这在观察性量表中是无法获得的。这些结果展示了基于视频的z分数估计、过度屈曲筛查和纵向轨迹跟踪的可行性,为在资源有限的临床环境中实现可扩展、客观的步态评估铺平了道路。
cs.CV / 18 / 2605.11354

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

Lite3R:一种模型无关的高效前馈三维重建框架
Zhang, Haoyu, Zhang, Zeyu, Zhou, Zedong, Zhao, Yang, Tang, Hao
Abstract
Transformer-based 3D reconstruction has emerged as a powerful paradigm for recovering geometry and appearance from multi-view observations, offering strong performance across challenging visual conditions. As these models scale to larger backbones and higher-resolution inputs, improving their efficiency becomes increasingly important for practical deployment. However, modern 3D transformer pipelines face two coupled challenges: dense multi-view attention creates substantial token-mixing overhead, and low-precision execution can destabilize geometry-sensitive representations and degrade depth, pose, and 3D consistency. To address the first challenge, we propose Lite3R, a model-agnostic teacher-student framework that replaces dense attention with Sparse Linear Attention to preserve important geometric interactions while reducing attention cost. To address the second challenge, we introduce a parameter-efficient FP8-aware quantization-aware training (FP8-aware QAT) strategy with partial attention distillation, which freezes the vast majority of pretrained backbone parameters and trains only lightweight linear-branch projection layers, enabling stable low-precision deployment while retaining pretrained geometric priors. We further evaluate Lite3R on two representative backbones, VGGT and DA3-Large, over BlendedMVS and DTU64, showing that it substantially reduces latency (1.7-2.0x) and memory usage (1.9-2.4x) while preserving competitive reconstruction quality overall. These results demonstrate that Lite3R provides an effective algorithm-system co-design approach for practical transformer-based 3D reconstruction. Code: https://github.com/AIGeeksGroup/Lite3R. Website: https://aigeeksgroup.github.io/Lite3R.
Chinese Translation
基于Transformer的三维重建已成为从多视角观测中恢复几何形状和外观的强大范式,在具有挑战性的视觉条件下表现出色。随着这些模型向更大的骨干网络和更高分辨率输入的扩展,提高其效率在实际应用中变得越来越重要。然而,现代三维Transformer管道面临两个相互关联的挑战:密集的多视角注意力造成了显著的标记混合开销,而低精度执行可能会使几何敏感的表示不稳定,并降低深度、姿态和三维一致性。为了解决第一个挑战,我们提出了Lite3R,这是一种模型无关的教师-学生框架,通过用稀疏线性注意力替代密集注意力,保留重要的几何交互,同时降低注意力成本。为了解决第二个挑战,我们引入了一种参数高效的FP8感知量化训练(FP8-aware QAT)策略,结合部分注意力蒸馏,该策略冻结了绝大多数预训练骨干参数,仅训练轻量级线性分支投影层,从而在保持预训练几何先验的同时,实现稳定的低精度部署。我们进一步在两个代表性骨干网络VGGT和DA3-Large上对Lite3R进行了评估,使用BlendedMVS和DTU64数据集,结果显示其显著降低了延迟(1.7-2.0倍)和内存使用(1.9-2.4倍),同时整体保持了竞争力的重建质量。这些结果表明,Lite3R为基于Transformer的实际三维重建提供了一种有效的算法-系统协同设计方法。代码:https://github.com/AIGeeksGroup/Lite3R。网站:https://aigeeksgroup.github.io/Lite3R。
cs.CV / 19 / 2605.11363

PresentAgent-2: Towards Generalist Multimodal Presentation Agents

PresentAgent-2:迈向通用多模态演示代理
Wu, Wei, Xu, Ziyang, Zhang, Zeyu, Zhao, Yang, Tang, Hao
Abstract
Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction. Code: https://github.com/AIGeeksGroup/PresentAgent-2. Website: https://aigeeksgroup.github.io/PresentAgent-2.
Chinese Translation
演示生成正从静态幻灯片创作向端到端的演示视频生成转变,涵盖研究基础、多模态媒体和互动交付。我们介绍了PresentAgent-2,这是一个从用户查询生成演示视频的代理框架。给定一个开放式用户查询和选定的演示模式,PresentAgent-2首先将查询总结为一个聚焦主题,并对适合演示的来源进行深入研究,以收集多模态资源,包括相关文本、图像、GIF和视频。然后,它构建演示幻灯片,生成特定模式的脚本,并将幻灯片、音频和动态媒体组合成完整的演示视频。PresentAgent-2在统一框架内支持三种独立的演示模式:单一演示(Single Presentation),生成单讲者叙述的演示视频;讨论(Discussion),创建具有结构化讲者角色的多讲者演示,例如用于提出引导性问题、解释概念、澄清细节和总结要点;以及互动(Interaction),独立支持回答观众问题,基于生成的幻灯片、脚本、检索的证据和演示上下文。为了评估这些能力,我们建立了一个多模态演示基准,涵盖单一演示、讨论和互动场景,并为内容质量、媒体相关性、动态媒体使用、对话自然性和互动基础制定了特定任务的评估标准。总体而言,PresentAgent-2将演示生成从依赖文档的幻灯片创作扩展到基于查询、研究基础的多模态媒体演示视频生成,结合对话和互动。代码:https://github.com/AIGeeksGroup/PresentAgent-2。网站:https://aigeeksgroup.github.io/PresentAgent-2。
cs.CV / 20 / 2605.11367

3D-Belief: Embodied Belief Inference via Generative 3D World Modeling

3D-Belief:通过生成3D世界建模进行具身信念推理
Yin, Yifan, Wen, Zehao, Chen, Jieneng, Zheng, Zehan, Dai, Nanru, Shi, Haojun, Ye, Suyu, Huang, Aydan, Zhang, Zheyuan, Yuille, Alan, Xie, Jianwen, Tewari, Ayush, Shu, Tianmin
Abstract
Recent advances in visual generative models have highlighted the promise of learning generative world models. However, most existing approaches frame world modeling as novel-view synthesis or future-frame prediction, emphasizing visual realism rather than the structured uncertainty required by embodied agents acting under partial observability. In this work, we propose a different perspective: world modeling as embodied belief inference in 3D space. From this view, a world model should not merely render what may be seen, but maintain and update an agent's belief about the unobserved 3D world as new observations are acquired. We identify several key capabilities for such models, including spatially consistent scene memory, multi-hypothesis belief sampling, sequential belief updating, and semantically informed prediction of unseen regions. We instantiate these ideas in 3D-Belief, a generative 3D world model that infers explicit, actionable 3D beliefs from partial observations and updates them online over time. Unlike prior visual prediction models, 3D-Belief represents uncertainty directly in 3D, enabling embodied agents to imagine plausible scene completions and reason over partially observed environments. We evaluate 3D-Belief on 2D visual quality for scene memory and unobserved-scene imagination, object- and scene-level 3D imagination using our proposed 3D-CORE benchmark, and challenging object navigation tasks in both simulation and the real world. Experiments show that 3D-Belief improves 2D and 3D imagination quality and downstream embodied task performance compared to state-of-the-art methods.
Chinese Translation
近期视觉生成模型的进展突显了学习生成世界模型的潜力。然而,大多数现有方法将世界建模视为新视图合成或未来帧预测,强调视觉现实性,而非具身代理在部分可观察性下所需的结构化不确定性。在本研究中,我们提出了一个不同的视角:将世界建模视为在3D空间中的具身信念推理。从这个角度来看,世界模型不仅应渲染可能被观察到的内容,还应在获取新观察时维护和更新代理对未观察到的3D世界的信念。我们确定了此类模型的几个关键能力,包括空间一致的场景记忆、多假设信念采样、序列信念更新以及对未见区域的语义信息预测。我们在3D-Belief中实现了这些思想,这是一种生成的3D世界模型,能够从部分观察中推断出明确的、可操作的3D信念,并随着时间的推移在线更新。与先前的视觉预测模型不同,3D-Belief直接在3D中表示不确定性,使得具身代理能够想象合理的场景补全并在部分观察的环境中进行推理。我们在2D视觉质量方面评估了3D-Belief,针对场景记忆和未观察场景的想象、使用我们提出的3D-CORE基准进行对象和场景级的3D想象,以及在模拟和现实世界中进行的挑战性对象导航任务。实验表明,与最先进的方法相比,3D-Belief提高了2D和3D想象质量以及下游具身任务的表现。
cs.CV / 21 / 2605.11369

Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers

通过混合预训练模块控制器实现物体交互的动态全身运动代理
Nam, Sanghyeok, Kim, Byoungjun, Park, Daehyung, Kim, Tae-Kyun
Abstract
Generating physically plausible dynamic motions of human-object interaction (HOI) remains challenging, mainly due to existing HOI datasets limited to static interactions, and pretrained agents capable of either dynamic full-body motions without objects or static HOI motions. Recent works such as InsActor and CLoSD generate HOI motions in planning and execution stages, are yet limited to either static or short-term contacts e.g. striking. In this work, we propose a framework that fulfills dynamic and long-term interaction motions such as running while holding a table, by combining pretrained motion priors and imitation agents in planning and execution stages. In the planning stage, we augment HOI datasets with dynamic priors from a pretrained human motion diffusion model, followed by object trajectory generation. This plans dynamic HOI sequences. In the execution stage, a composer network blends actions of pretrained imitation agents specialized either for dynamic human motions or static HOI motions, enabling spatio-temporal composition of their complementary skills. Our method over relevant prior-arts consistently improves success rates while maintaining interaction for dynamic HOI tasks. Furthermore, blending pretrained experts with our composer achieves competitive performance in significantly reduced training time. Ablation studies validate the effectiveness of our augmentation and composer blending.
Chinese Translation
生成物体交互(HOI)的物理上合理的动态运动仍然具有挑战性,主要是因为现有的HOI数据集仅限于静态交互,以及预训练的代理只能实现动态全身运动而不涉及物体或静态HOI运动。近期的研究如InsActor和CLoSD在规划和执行阶段生成HOI运动,但仍然局限于静态或短期接触,例如打击。在本研究中,我们提出了一个框架,能够实现动态和长期交互运动,例如在持有桌子的同时奔跑,通过在规划和执行阶段结合预训练的运动先验和模仿代理。在规划阶段,我们利用预训练的人体运动扩散模型的动态先验来增强HOI数据集,随后生成物体轨迹。这一过程规划了动态HOI序列。在执行阶段,一个组合网络混合了专注于动态人类运动或静态HOI运动的预训练模仿代理的动作,从而实现其互补技能的时空组合。我们的方法在相关的先前研究中始终提高了成功率,同时保持了动态HOI任务的交互。此外,将预训练专家与我们的组合器混合能够在显著减少训练时间的情况下实现竞争性能。消融研究验证了我们增强和组合混合的有效性。
cs.CV / 22 / 2605.11383

HamBR: Active Decision Boundary Restoration Based on Hamiltonian Dynamics for Learning with Noisy Labels

HamBR:基于哈密顿动力学的主动决策边界恢复用于带噪标签的学习
Peng, Ningkang, Mao, Jingyang, Yu, Qianfeng, Peng, Xiaoqian, Ma, Peirong, Gu, Yanhui
Abstract
In large-scale visual recognition and data mining tasks, the presence of noisy labels severely undermines the generalization capability of deep neural networks (DNNs). Prevalent sample selection methods rely primarily on training loss or prediction confidence for passive screening. However, within a feature space degraded by noise, decision boundaries undergo systematic boundary collapse. This phenomenon hinders the ability of the model to distinguish between hard clean samples and noisy samples at the decision margins, thereby creating a significant performance bottleneck. This study is the first to emphasize the pivotal importance of active boundary restoration for noise-robust learning. We propose HamBR, a novel paradigm based on Hamiltonian dynamics. The core approach leverages the Spherical Hamiltonian Monte Carlo (Spherical HMC) mechanism to actively probe inter-class ambiguous regions within the representation space and synthesize high-quality virtual outliers. By imposing explicit repulsion constraints via energy-based modeling, these synthesized samples establish robust energy barriers at the decision boundaries. This mechanism forces real samples to move from dispersed overlapping regions toward their respective class centers, thereby restoring the discriminative sharpness of the decision boundaries. HamBR demonstrates exceptional versatility and can be integrated as a plug-and-play defense module into existing semi-supervised noisy label learning frameworks. Empirical evaluations show that the proposed paradigm significantly enhances the discriminative accuracy of hard boundary samples, achieving state-of-the-art (SOTA) performance on CIFAR-10/100 and real-world noise benchmarks. Furthermore, it exhibits superior convergence efficiency and reliable robustness, while improving significantly the capability of the model for Out-of-Distribution (OOD) detection.
Chinese Translation
在大规模视觉识别和数据挖掘任务中,噪声标签的存在严重削弱了深度神经网络(DNN)的泛化能力。普遍的样本选择方法主要依赖于训练损失或预测置信度进行被动筛选。然而,在被噪声降解的特征空间中,决策边界经历系统性的边界崩溃。这一现象妨碍了模型在决策边缘区分困难的干净样本和噪声样本的能力,从而造成显著的性能瓶颈。本研究首次强调了主动边界恢复在抗噪声学习中的关键重要性。我们提出了HamBR,一种基于哈密顿动力学的新范式。核心方法利用球面哈密顿蒙特卡罗(Spherical Hamiltonian Monte Carlo, Spherical HMC)机制,主动探测表示空间中的类间模糊区域,并合成高质量的虚拟异常值。通过施加基于能量建模的显式排斥约束,这些合成样本在决策边界上建立了稳健的能量屏障。该机制迫使真实样本从分散的重叠区域移动到各自的类别中心,从而恢复决策边界的判别清晰度。HamBR展示了卓越的通用性,可以作为即插即用的防御模块集成到现有的半监督噪声标签学习框架中。实证评估表明,所提出的范式显著提升了困难边界样本的判别准确性,在CIFAR-10/100和真实世界噪声基准上达到了最先进的(SOTA)性能。此外,它展现出优越的收敛效率和可靠的鲁棒性,同时显著提升了模型的分布外(Out-of-Distribution, OOD)检测能力。
cs.CV / 23 / 2605.11385

JACoP: Joint Alignment for Compliant Multi-Agent Prediction

JACoP:符合性多智能体预测的联合对齐
Liu, Qingze, Mrdovic, Alen, Li, Danrui, Schwartz, Mathew, Yoon, Sejong, Kapadia, Mubbasir
Abstract
Stochastic Human Trajectory Prediction (HTP) using generative modeling has emerged as a significant area of research. Although state-of-the-art models excel in optimizing the accuracy of individual agents, they often struggle to generate predictions that are collectively compliant, leading to output trajectories marred by social collisions and environmental violations, thus rendering them impractical for real-world applications. To bridge this gap, we present JACoP: Joint Alignment for Compliant Multi-Agent Prediction, an innovative multi-stage framework that ensures scene-level plausibility. JACoP incorporates an Anchor-Based Agent-Centric Profiler for effective initial compliance filtering and employs a Markov Random Field (MRF) based aligner to formalize the joint selection for scene predictions. By representing inter-agent spatial and social costs as MRF energy potentials, we successfully infer and sample from the joint trajectory distribution, achieving prediction with optimal scene compliance. Comprehensive experiments show that JACoP not only achieves competitive accuracy, but also sets a new standard in reducing both environmental violations and social collisions, thereby confirming its ability to produce collectively feasible and practically applicable trajectory predictions.
Chinese Translation
基于生成建模的随机人类轨迹预测(HTP)已成为一个重要的研究领域。尽管最先进的模型在优化单个智能体的准确性方面表现出色,但它们在生成集体符合的预测时常常面临挑战,导致输出轨迹受到社会碰撞和环境违规的影响,从而使其在实际应用中变得不切实际。为了解决这一问题,我们提出了JACoP:符合性多智能体预测的联合对齐,这是一个创新的多阶段框架,确保场景级的合理性。JACoP结合了一种基于锚点的智能体中心分析器,用于有效的初步符合性过滤,并采用基于马尔可夫随机场(MRF)的对齐器来形式化场景预测的联合选择。通过将智能体间的空间和社会成本表示为MRF能量势,我们成功地推断并从联合轨迹分布中采样,实现了最佳场景符合性的预测。全面的实验表明,JACoP不仅在准确性上具有竞争力,而且在减少环境违规和社会碰撞方面树立了新的标准,从而确认其能够生成集体可行且实际适用的轨迹预测。
cs.CV / 24 / 2605.11424

VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors

VidSplat:基于几何引导的视频扩散先验的高斯溅射重建
Tang, Jimin, Zhang, Wenyuan, Zhou, Junsheng, Huang, Zian, Shi, Kanle, Xu, Shenkun, Liu, Yu-Shen, Han, Zhizhong
Abstract
Gaussian Splatting has achieved remarkable progress in multi-view surface reconstruction, yet it exhibits notable degradation when only few views are available. Although recent efforts alleviate this issue by enhancing multi-view consistency to produce plausible surfaces, they struggle to infer unseen, occluded, or weakly constrained regions beyond the input coverage. To address this limitation, we present VidSplat, a training-free generative reconstruction framework that leverages powerful video diffusion priors to iteratively synthesize novel views that compensate for missing input coverage, and thereby recover complete 3D scenes from sparse inputs. Specifically, we tackle two key challenges that enable the effective integration of generation and reconstruction. First, for 3D consistent generation, we elaborate a training-free, stage-wise denoising strategy that adaptively guides the denoising direction toward the underlying geometry using the rendered RGB and mask images. Second, to enhance the reconstruction, we develop an iterative mechanism that samples camera trajectories, explores unobserved regions, synthesizes novel views, and supplements training through confidence weighted refinement. VidSplat performs robustly to sparse input and even a single image. Extensive experiments on widely used benchmarks demonstrate our superior performance in sparse-view scene reconstruction.
Chinese Translation
高斯溅射在多视角表面重建方面取得了显著进展,但在仅有少数视角可用时表现出明显的退化。尽管近期的努力通过增强多视角一致性来生成合理的表面,从而缓解了这一问题,但它们在推断输入覆盖范围之外的未见、遮挡或弱约束区域时仍然面临困难。为了解决这一局限性,我们提出了VidSplat,这是一种无训练的生成重建框架,利用强大的视频扩散先验,迭代合成补偿缺失输入覆盖的新视角,从而从稀疏输入中恢复完整的3D场景。具体而言,我们解决了两个关键挑战,以有效整合生成与重建。首先,为了实现3D一致性生成,我们详细阐述了一种无训练的分阶段去噪策略,该策略利用渲染的RGB和掩膜图像自适应地引导去噪方向朝向基础几何。其次,为了增强重建效果,我们开发了一种迭代机制,采样相机轨迹,探索未观察区域,合成新视角,并通过置信加权细化补充训练。VidSplat对稀疏输入甚至单张图像表现出强大的鲁棒性。在广泛使用的基准测试上进行的大量实验表明,我们在稀疏视角场景重建方面的性能优于其他方法。
cs.CV / 25 / 2605.11427

PD-4DGS:Progressive Decomposition of 4D Gaussian Splatting for Bandwidth-Adaptive Dynamic Scene Streaming

PD-4DGS:用于带宽自适应动态场景流媒体的4D高斯喷溅的渐进分解
Li, Jiachen, Han, Guangzhi, Wan, Jin, Han, Delong, Gao, Yuan, Li, Min, Zhou, Mingle, Li, Gang
Abstract
4D Gaussian Splatting (4DGS) enables high-quality dynamic novel view synthesis, yet current models remain monolithic bitstreams that clients must download in full before any frame can be rendered, causing black-screen waits of tens to hundreds of seconds on mobile bandwidth and leaving 4DGS incompatible with modern adaptive-bitrate delivery. Progressive 3DGS compression alleviates this for static scenes, but it acts only on spatial anchors and cannot partition the temporal deformation networks that dominate dynamic-scene size. We present PD-4DGS, the first framework for progressive compression and on-demand transmission of 4DGS. Hierarchical Deformation Decomposition (HDD) externalises the coarse-to-fine motion hierarchy already latent in 4DGS into three independently transmittable layers -- a static scaffold, a global deformation, and a local refinement -- so that any prefix of the bitstream is already renderable, turning a single training run into a scalable, DASH/HLS-compatible bitstream. A Gaussian-entropy attribute rate-distortion loss together with a temporal mask consistency regulariser shrink the base layer while suppressing low-bitrate flicker; a capacity-weighted rollout schedule, gated online by a learnt activation rate rho, then prevents deformation-network under-training without any per-scene hyperparameter. On the Dycheck iPhone benchmark, PD-4DGS cuts the streamed bitstream by >60% at matched rendering fidelity and reduces first-frame latency from 73--930 s to ~1.7 s on a 2 Mbps link, uniquely enabling true on-demand progressive streaming for 4DGS.
Chinese Translation
4D高斯喷溅(4DGS)能够实现高质量的动态新视图合成,但当前模型仍然是单一的比特流,客户端必须在渲染任何帧之前完全下载,导致在移动带宽下出现数十到数百秒的黑屏等待,并使得4DGS与现代自适应比特率传输不兼容。渐进式3DGS压缩缓解了静态场景的问题,但它仅作用于空间锚点,无法划分主导动态场景大小的时间变形网络。我们提出了PD-4DGS,这是第一个用于4DGS的渐进压缩和按需传输的框架。层次变形分解(HDD)将4DGS中潜在的粗到细运动层次外部化为三个独立可传输的层次——一个静态支架、一个全局变形和一个局部细化——使得比特流的任何前缀都可以被渲染,从而将单次训练转化为可扩展的、兼容DASH/HLS的比特流。高斯熵属性率失真损失结合时间掩码一致性正则化在抑制低比特率闪烁的同时缩小基础层;然后通过学习到的激活率rho进行门控在线的容量加权展开计划,防止变形网络的欠训练,而无需每个场景的超参数。在Dycheck iPhone基准测试中,PD-4DGS在匹配渲染保真度的情况下将流式比特流减少了超过60%,并将首帧延迟从73-930秒降低到约1.7秒,在2 Mbps的链接上,独特地实现了4DGS的真正按需渐进流媒体。
cs.CV / 26 / 2605.11430

Diabetic Retinopathy Classification using Downscaling Algorithms and Deep Learning

基于降尺度算法和深度学习的糖尿病视网膜病变分类
Doshi, Nishi, Oza, Urvi, Kumar, Pankaj
Abstract
Diabetic Retinopathy (DR) is an art and science of recording and classifying the retinal images of a diabetic patient. DR classification deals with classifying retinal fundus image into five stages on the basis of severity of diabetes. One of the major issue faced while dealing with DR classification problem is the large and varying size of images. In this paper we propose and explore the use of several downscaling algorithms before feeding the image data to a Deep Learning Network for classification. For improving training and testing; we amalgamate two datasets: Kaggle and Indian Diabetic Retinopathy Image Dataset. Our experiments have been performed on a novel Multi Channel Inception V3 architecture with a unique self crafted preprocessing phase. We report results of proposed approach using accuracy, specificity and sensitivity, which outperform the previous state of the art methods. Index Terms: Diabetic Retinopathy, Downscaling Algorithms, Multichannel CNN Architecture, Deep Learning
Chinese Translation
糖尿病视网膜病变(Diabetic Retinopathy, DR)是记录和分类糖尿病患者视网膜图像的一门艺术和科学。DR 分类涉及根据糖尿病的严重程度将视网膜底部图像分为五个阶段。在处理 DR 分类问题时,面临的主要问题之一是图像的大小大且变化多端。本文提出并探讨在将图像数据输入深度学习网络进行分类之前使用几种降尺度算法。为了改善训练和测试,我们结合了两个数据集:Kaggle 数据集和印度糖尿病视网膜病变图像数据集。我们的实验在一种新颖的多通道 Inception V3 架构上进行,并采用独特的自定义预处理阶段。我们报告了所提方法的结果,使用准确率、特异性和敏感性指标,超越了之前的最先进方法。关键词:糖尿病视网膜病变、降尺度算法、多通道卷积神经网络架构、深度学习
cs.CV / 27 / 2605.11435

ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models

ZeroIDIR:基于零参考的照明退化图像恢复与扰动一致性扩散模型
Jiang, Hai, Liu, Zhen, Lei, Yinjie, Han, Songchen, Zeng, Bing, Liu, Shuaicheng
Abstract
In this paper, we propose a zero-reference diffusion-based framework, named ZeroIDIR, for illumination degradation image restoration, which decouples the restoration process into adaptive illumination correction and diffusion-based reconstruction while being trained solely on low-quality degraded images. Specifically, we design an adaptive gamma correction module that performs spatially varying exposure correction to generate illumination-corrected only representations to mitigate exposure bias and serve as reliable inputs for subsequent diffusion processes, where a histogram-guided illumination correction loss is introduced to regularize the corrected illumination distribution toward that of natural scenes. Subsequently, the illumination-corrected image is treated as an intermediate noisy state for the proposed perturbed consistency diffusion model to reconstruct details and suppress noise. Moreover, a perturbed diffusion consistency loss is proposed to constrain the forward diffusion trajectory of the final restored image to remain consistent with the perturbed state, thus improving restoration fidelity and stability in the absence of supervision. Extensive experiments on publicly available benchmarks show that the proposed method outperforms state-of-the-art unsupervised competitors and is comparable to supervised methods while being more generalizable to various scenes. Code is available at https://github.com/JianghaiSCU/ZeroIDIR.
Chinese Translation
在本文中,我们提出了一种基于零参考的扩散框架,命名为ZeroIDIR,用于照明退化图像的恢复,该框架将恢复过程解耦为自适应照明校正和基于扩散的重建,同时仅在低质量退化图像上进行训练。具体而言,我们设计了一个自适应伽马校正模块,该模块执行空间变化的曝光校正,以生成仅经过照明校正的表示,从而减轻曝光偏差,并作为后续扩散过程的可靠输入,其中引入了一种基于直方图的照明校正损失,以规范化校正后的照明分布,使其趋向于自然场景的分布。随后,经过照明校正的图像被视为中间噪声状态,以供所提出的扰动一致性扩散模型重建细节并抑制噪声。此外,提出了一种扰动扩散一致性损失,以约束最终恢复图像的前向扩散轨迹与扰动状态保持一致,从而在缺乏监督的情况下提高恢复的保真度和稳定性。在公开可用的基准测试上进行的广泛实验表明,所提出的方法优于最先进的无监督竞争者,并且在性能上可与监督方法相媲美,同时对各种场景具有更好的泛化能力。代码可在 https://github.com/JianghaiSCU/ZeroIDIR 获取。
cs.CV / 28 / 2605.11438

Beyond Masks: The Case for Medical Image Parsing

超越掩膜:医学图像解析的案例
Gupta, Siddharth, Yuille, Alan L., Zhou, Zongwei
Abstract
Medical imaging research has spent a decade getting very good at one thing: producing per-voxel masks. Masks tell us size, volume, and location, and a decade of clinical infrastructure rests on those outputs. Yet the report a radiologist writes contains almost nothing a mask can express. We argue that medical imaging research should adopt medical image parsing as its central output: a structured representation in which entities, attributes, and relationships are emitted together and mutually consistent. Entities are the named structures and findings, present or absent. Attributes describe those entities, capturing things like margin regularity, enhancement pattern, or severity grade. Relationships connect them, naming where one structure sits relative to another, what abuts what, and what has changed since the prior scan. A good parse satisfies three properties, in order: (1) decision (the parse names the right things in the current image), (2) reconstruction (its content is rich enough to regenerate that image), and (3) prediction (its content is rich enough to forecast how the patient state will evolve). Quantitative measurements are derived from this content; they are not predicted alongside it. To test how close the field is to producing such an output, we audit eleven representative systems against the three parsing primitives plus closure. None emits a well-formed parse. Entities are largely solved. Attributes, relationships, and closure remain near-empty. The path forward is not a new architecture. It is a commitment to a richer output, and to training signals that reward it. Segmentation taught models to measure. Parsing asks them to explain.
Chinese Translation
医学影像研究在过去十年里在一件事情上取得了很大的进展:生成每个体素的掩膜。掩膜告诉我们大小、体积和位置,而十年的临床基础设施建立在这些输出之上。然而,放射科医生撰写的报告几乎没有任何掩膜能够表达的内容。我们认为医学影像研究应将医学图像解析作为其核心输出:一种结构化表示,其中实体、属性和关系共同发出并相互一致。实体是命名的结构和发现,可能存在也可能缺失。属性描述这些实体,捕捉诸如边缘规则性、增强模式或严重程度等级等信息。关系将它们连接起来,命名一个结构相对于另一个结构的位置、相邻关系以及自上次扫描以来的变化。一个良好的解析满足三个属性,依次为:(1)决策(解析在当前图像中命名了正确的事物),(2)重建(其内容足够丰富以再生该图像),以及(3)预测(其内容足够丰富以预测患者状态的演变)。定量测量源自这些内容;它们并不是与内容一起预测的。为了测试该领域在生成这样的输出方面的接近程度,我们对十一种代表性系统进行了审计,评估其与三种解析原语及闭合的符合程度。没有一个系统发出良好形成的解析。实体大部分已得到解决。属性、关系和闭合仍然几乎为空。前进的道路不是新的架构,而是对更丰富输出的承诺,以及奖励这种输出的训练信号。分割教会模型如何测量,而解析则要求它们进行解释。
cs.CV / 29 / 2605.11439

Instruct-ICL: Instruction-Guided In-Context Learning for Post-Disaster Damage Assessment

指导性指令-上下文学习:灾后损害评估的指令引导上下文学习
Zarbaft, Armin, Karimi, Ehsan, Le, Nhut, Rahnemoonfar, Maryam
Abstract
Rapid and accurate situational awareness is essential for effective response during natural disasters, where delays in analysis can significantly hinder decision-making. Training task-specific models for post-disaster assessment is often time-consuming and computationally expensive, making such approaches impractical in time-critical scenarios. Consequently, pretrained multimodal large language models (MLLMs) have emerged as a promising alternative for post-disaster visual question answering (VQA), a task that aims to answer structured questions about visual scenes by jointly reasoning over images and text. While these models demonstrate strong multimodal reasoning capabilities, their responses can be sensitive to prompt formulation, which can limit their reliability in real-world disaster assessment scenarios. In this paper, we investigate whether structured reasoning strategies can improve the reliability of pretrained MLLMs for post-disaster VQA. Specifically, we explore multiple prompting paradigms in which one MLLM is used to generate task-specific instructions that serve as Chain-of-Thought (CoT) guidance for a second MLLM. These instructions are incorporated during answer generation with varying degrees of in-context learning (ICL), enabling the model to leverage both explicit reasoning guidance and contextual examples. We conduct our evaluation on the FloodNet dataset and compare these approaches against a zero-shot baseline. Our results demonstrate that integrating instruction-driven CoT reasoning consistently improves answer accuracy.
Chinese Translation
在自然灾害期间,快速而准确的态势感知对于有效应对至关重要,分析中的延迟可能显著妨碍决策。为灾后评估训练特定任务模型通常耗时且计算成本高,使得在时间紧迫的情况下,这种方法变得不切实际。因此,预训练的多模态大型语言模型(MLLMs)作为灾后视觉问答(VQA)的有希望的替代方案应运而生,该任务旨在通过对图像和文本进行联合推理来回答有关视觉场景的结构化问题。尽管这些模型展示了强大的多模态推理能力,但它们的响应可能对提示的表述敏感,这可能限制其在现实灾害评估场景中的可靠性。本文探讨了结构化推理策略是否可以提高预训练MLLMs在灾后VQA中的可靠性。具体而言,我们探索了多种提示范式,其中一个MLLM用于生成特定任务的指令,作为第二个MLLM的思维链(Chain-of-Thought, CoT)指导。这些指令在答案生成过程中以不同程度的上下文学习(In-Context Learning, ICL)被纳入,使模型能够利用明确的推理指导和上下文示例。我们在FloodNet数据集上进行评估,并将这些方法与零-shot基线进行比较。我们的结果表明,整合以指令驱动的CoT推理能够持续提高答案的准确性。
cs.CV / 30 / 2605.11444

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

利用多模态大型语言模型通过频率专家混合实现一体化图像恢复
Lee, Eunho, Hwang, Youngbae, Kawakami, Rei
Abstract
All-in-one image restoration seeks to recover clean images from inputs affected by diverse and unknown degradations using a unified framework. Recent methods have shown strong performance by identifying degradation characteristics to guide the restoration process. However, many of them treat degradations as discrete categories, which limits their ability to model the continuous relational structure that arises in composite degradations. To address this issue, we propose a multimodal large language model (MLLM)-guided image restoration framework that exploits multimodal embeddings as guidance for low-level restoration. Specifically, MLLM-derived features are injected into an encoder-decoder architecture through an MLLM-guided fusion block (MGFB) to enhance degradation-aware representations. In addition, we incorporate a mixture-of-frequency-experts (MoFE) module that adaptively combines frequency experts using MLLM-guided contextual cues. To further improve expert routing, we design an MLLM-guided router with a relational alignment loss that encourages routing patterns consistent with the embedding-space relationships of degraded inputs. Extensive experiments on multiple benchmarks show that the proposed method achieves strong performance across diverse restoration settings and establishes a new state of the art on the challenging CDD11 dataset, outperforming previous methods by up to 1.35 dB.
Chinese Translation
一体化图像恢复旨在通过统一框架从受多种未知退化影响的输入中恢复干净图像。近期的方法通过识别退化特征来指导恢复过程,表现出强大的性能。然而,许多方法将退化视为离散类别,这限制了它们建模复合退化中出现的连续关系结构的能力。为了解决这一问题,我们提出了一种多模态大型语言模型(MLLM)指导的图像恢复框架,该框架利用多模态嵌入作为低级恢复的指导。具体而言,MLLM派生的特征通过MLLM指导的融合块(MGFB)注入到编码器-解码器架构中,以增强对退化的感知表示。此外,我们还结合了一个频率专家混合(MoFE)模块,该模块使用MLLM指导的上下文线索自适应地组合频率专家。为了进一步改善专家路由,我们设计了一个带有关系对齐损失的MLLM指导路由器,鼓励路由模式与退化输入的嵌入空间关系一致。在多个基准测试上的大量实验表明,所提方法在多种恢复设置中表现出强大的性能,并在具有挑战性的CDD11数据集上建立了新的最先进水平,超越了之前的方法,提升幅度达到1.35 dB。
cs.CV / 31 / 2605.11462

SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

SpatialForge:从开放世界的二维图像引导三维空间推理
Liu, Zishan, Zang, Ruoxi, Zhang, Yanglin, Liu, Wei, Zhang, Yin, Yao, Jian, Zheng, Jiayin, Liu, Zhengzhe
Abstract
Recent advancements in Large Vision-Language Models (VLMs) have demonstrated exceptional semantic understanding, yet these models consistently struggle with spatial reasoning, often failing at fundamental geometric tasks such as depth ordering and precise coordinate grounding. Recent efforts introduce spatial supervision from scene-centric datasets (e.g., multi-view scans or indoor video), but are constrained by the limited number of underlying scenes. As a result, the scale and diversity of such data remain significantly smaller than those of web-scale 2D image collections. To address this limitation, we propose SpatialForge, a scalable data synthesis pipeline that transforms in-the-wild 2D images into spatial reasoning supervision. Our approach decomposes spatial reasoning into perception and relation, and constructs structured supervision signals covering depth, layout, and viewpoint-dependent reasoning, with automatic verification to ensure data quality. Based on this pipeline, we build SpatialForge-10M, a large-scale dataset containing 10 million spatial QA pairs. Extensive experiments across multiple spatial reasoning benchmarks demonstrate that training on SpatialForge-10M significantly improves the spatial reasoning ability of standard VLMs, highlighting the effectiveness of scaling 2D data for 3D-aware spatial reasoning.
Chinese Translation
近期大型视觉语言模型(VLMs)的进展展示了卓越的语义理解能力,但这些模型在空间推理方面始终存在困难,常常在深度排序和精确坐标定位等基本几何任务上表现不佳。最近的研究尝试从以场景为中心的数据集(例如,多视角扫描或室内视频)引入空间监督,但受限于基础场景数量的有限。因此,这类数据的规模和多样性仍显著小于网络规模的二维图像集合。为了解决这一限制,我们提出了SpatialForge,这是一种可扩展的数据合成管道,将野外的二维图像转化为空间推理监督。我们的方法将空间推理分解为感知和关系,并构建覆盖深度、布局和视点依赖推理的结构化监督信号,并通过自动验证确保数据质量。基于该管道,我们构建了SpatialForge-10M,这是一个包含1000万个空间问答对的大规模数据集。在多个空间推理基准上的广泛实验表明,在SpatialForge-10M上训练显著提高了标准VLMs的空间推理能力,突显了扩展二维数据在三维空间推理中的有效性。
cs.CV / 32 / 2605.11463

Encore: Conditioning Trajectory Forecasting via Biased Ego Rehearsals

Encore:通过偏置自我排练进行轨迹预测
Wong, Conghao, Zou, Ziqian, You, Xinge
Abstract
Learning and representing the subjectivities of agents has become a challenging but crucial problem in the trajectory prediction task. Such subjectivities not only present specific spatial or temporal structures, but also are anisotropic for all interaction participants. Despite great efforts, it remains difficult to explicitly learn and forecast these subjectivities, let alone further modulate models' predictions through a specific ego's subjectivity. Inspired by prefactual thoughts in psychology and relevant theatrical concepts, we interpret such subjectivities in future trajectories as the continuous process from rehearsal to encore. In the rehearsal phase, the proposed ego predictor focuses on how each ego agent learns to derive and direct a set of explicitly biased rehearsal trajectories for all participants in the scene from the short-term observations. Then, these rehearsal trajectories serve as immediate controls to condition final predictions, providing direct yet distinct ego biases for the prediction network to simulate agents' various subjectivities. Experiments across datasets not only demonstrate a consistent improvement in the performance of the proposed \emph{Encore} trajectory prediction model but also provide clear interpretability regarding subjectivities as biased ego rehearsals.
Chinese Translation
学习和表征智能体的主观性已成为轨迹预测任务中一个具有挑战性但至关重要的问题。这种主观性不仅呈现出特定的空间或时间结构,而且对所有交互参与者来说是各向异性的。尽管进行了大量努力,但显式学习和预测这些主观性仍然困难,更不用说通过特定自我的主观性进一步调节模型的预测。受到心理学中的假设性思维和相关戏剧概念的启发,我们将未来轨迹中的这种主观性解释为从排练到重演的连续过程。在排练阶段,所提出的自我预测器专注于每个自我智能体如何从短期观察中学习推导并指导一组显式偏置的排练轨迹,以供场景中所有参与者使用。然后,这些排练轨迹作为直接控制条件最终预测,为预测网络提供直接但独特的自我偏见,以模拟智能体的各种主观性。跨数据集的实验不仅展示了所提出的 extit{Encore} 轨迹预测模型在性能上的一致提升,还提供了关于主观性作为偏置自我排练的清晰可解释性。
cs.CV / 33 / 2605.11475

Deep Probabilistic Unfolding for Quantized Compressive Sensing

深度概率展开用于量化压缩感知
Qu, Gang, Wang, Ping, Zheng, Siming, Yuan, Xin
Abstract
We propose a deep probabilistic unfolding model to address the classical quantized compressive sensing problem that leverages an unfolding framework to enhance the reconstruction accuracy and efficiency. Unlike previous unfolding methods that apply L2 projection to measurements, we derive a closed-form, numerically stable likelihood gradient projection, which allows the model to respect the true quantization physics, turning the hard quantization constraint into a soft probabilistic guidance. Furthermore, an efficient, dual-domain Mamba module is specifically designed to dynamically capture and fuse the multi-scale local and global features, ensuring the interactions between the distant but correlated regions. Extensive experiments demonstrate the state-of-the-art performance of the proposed method over previous works, which is capable of promoting the application of quantized compressive sensing in real life.
Chinese Translation
我们提出了一种深度概率展开模型,以解决经典的量化压缩感知问题,该模型利用展开框架来提高重建的准确性和效率。与之前将 L2 投影应用于测量的展开方法不同,我们推导出了一种闭合形式的数值稳定的似然梯度投影,这使得模型能够遵循真实的量化物理,将硬量化约束转化为软概率指导。此外,我们特别设计了一种高效的双域 Mamba 模块,以动态捕捉和融合多尺度的局部和全局特征,确保远离但相关区域之间的相互作用。大量实验表明,所提出的方法在性能上优于以往的工作,能够促进量化压缩感知在现实生活中的应用。
cs.CV / 34 / 2605.11477

LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs

LDDR:基于线性DPP的动态分辨率视频帧采样方法
Chen, Jingfeng, Qian, Jiawen, Deng, Wendi, Guo, Yinuo, Yu, Jiaqi, Leng, Sicong, Thirukovalluru, Raghuveer, Dhingra, Bhuwan
Abstract
Video understanding in multimodal large language models requires selecting informative frames from long, redundant videos under limited visual-token budgets. Existing methods often rely on uniform sampling, point-wise relevance scoring, chunk-wise selection, or agentic exploration, which either miss global dependencies or introduce substantial overhead. We propose LDDR (Linear DPP-Based Dynamic Resolution), a training-free, plug-and-play, and budget-aware video frame sampling framework. LDDR performs query-aware Determinantal Point Process (DPP) frame selection in a task-conditioned feature space, achieving a 3x runtime speedup over standard DPP baselines. It further introduces a Group DPP importance metric to guide frame retention and dynamic resolution allocation, assigning more tokens to informative, non-redundant frames while downscaling or pruning less useful ones. Across four video benchmarks spanning short-, medium-, and long-range videos, LDDR consistently outperforms the next-best baselines, achieving gains of 2.5 points under budget-constrained settings and 1.6 points in high-budget scenarios. These improvements are consistently observed across multiple MLLM backbones, including both open- and closed-source models. Qualitative analysis confirms that relevant frames are selected and allocated a higher budget, facilitating improved video understanding.
Chinese Translation
在多模态大型语言模型中,视频理解需要在有限的视觉标记预算下,从冗长且冗余的视频中选择信息丰富的帧。现有方法通常依赖于均匀采样、逐点相关性评分、块级选择或自主探索,这些方法要么忽视全局依赖关系,要么引入了大量开销。我们提出了LDDR(基于线性DPP的动态分辨率),这是一种无训练、即插即用且考虑预算的视频帧采样框架。LDDR在任务条件特征空间中执行查询感知的决定性点过程(DPP)帧选择,相较于标准DPP基线实现了3倍的运行时加速。它进一步引入了一种组DPP重要性度量,以指导帧的保留和动态分辨率分配,为信息丰富且非冗余的帧分配更多的标记,同时缩减或修剪不太有用的帧。在涵盖短、中、长视频的四个视频基准测试中,LDDR始终优于下一个最佳基线,在预算受限的情况下获得了2.5分的提升,在高预算场景中获得了1.6分的提升。这些改进在多个MLLM骨干网络中均得到了持续验证,包括开源和闭源模型。定性分析确认了相关帧被选中并分配了更高的预算,从而促进了视频理解的改善。
cs.CV / 35 / 2605.11492

A Mimetic Detector for Adversarial Image Perturbations

一种模仿检测器用于对抗性图像扰动
Corbino, Johnny
Abstract
Adversarial attacks fool deep image classifiers by adding tiny, almost invisible noise patterns to a clean image. The standard $\ell^\infty$-bounded attacks (FGSM, PGD, and the $\ell^\infty$ variant of Carlini--Wagner) produce high-frequency, near-random sign patterns at the pixel level: nearly invisible in $\ell^2$, but carrying disproportionate gradient energy. We exploit this with a single-shot, training-free detector using the high-order Corbino--Castillo mimetic operators from the open-source MOLE library. No retraining, no surrogate classifier, no access to the network under attack: the verdict is a property of the input alone, computed in $O(HW)$ time. We validate the detector on the standard \texttt{peppers} test image at the canonical $\ell^\infty$ budget $\varepsilon = 16/255$ and observe a clean-vs-adversarial separation that grows monotonically from $3.55\times$ at order $k=2$ to $4.19\times$ at $k=6$.
Chinese Translation
对抗性攻击通过向干净图像添加微小、几乎不可见的噪声模式来欺骗深度图像分类器。标准的 $ ext{ℓ}^ ext{∞}$ 有界攻击(FGSM、PGD 以及 Carlini--Wagner 的 $ ext{ℓ}^ ext{∞}$ 变体)在像素级别生成高频、近乎随机的符号模式:在 $ ext{ℓ}^2$ 中几乎不可见,但携带不成比例的梯度能量。我们利用这一点,使用来自开源 MOLE 库的高阶 Corbino--Castillo 模仿算子,构建了一种单次、无训练的检测器。无需重新训练,无需替代分类器,无需访问被攻击的网络:判决仅依赖于输入本身,计算时间为 $O(HW)$。我们在标准的 exttt{peppers} 测试图像上验证了该检测器,在经典的 $ ext{ℓ}^ ext{∞}$ 预算 $ ext{ε} = 16/255$ 下观察到干净图像与对抗性图像的分离,从 $k=2$ 的 $3.55 imes$ 单调增长到 $k=6$ 的 $4.19 imes$。
cs.CV / 36 / 2605.11494

STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models

STRIDE:通过主成分分析导向的特征扰动实现无训练的多样性引导在单步扩散模型中的应用
Yadav, Ankit, Garg, Arpit, Huy, Ta Duc, Liu, Lingqiao
Abstract
Distilled one-step (T=1) or few-step (T$\leq$4) diffusion models enable real-time image generation but often exhibit reduced sample diversity compared to their multi-step counterparts. In multi-step diffusion, diversity can be introduced through schedules, trajectories, or iterative optimization; however, these mechanisms are unavailable in the few-step or single-step setting, limiting the effectiveness of existing diversity-enhancing methods. A natural alternative is to perturb intermediate features, but naive feature perturbation is often ineffective, either yielding limited diversity gains or degrading generation quality. We argue that effective diversity injection in few-step models requires perturbations that respect the model's learned feature geometry. Based on this insight, we propose STRIDE, a training-free and optimization-free method that operates in a single forward pass. STRIDE injects spatially coherent (pink) noise into intermediate transformer features, projected onto the principal components of the model's own activations, ensuring that perturbations lie on the learned feature manifold. This design enables controlled variation along meaningful directions in the representation space. Extensive experiments on FLUX.1-schnell and SD3.5 Turbo across COCO, DrawBench, PartiPrompts, and GenEval show that STRIDE consistently improves diversity while maintaining strong text alignment. In particular, STRIDE reduces intra-batch similarity with minimal impact on CLIP score, and Pareto-dominates existing training-free baselines on the diversity-fidelity frontier. These results highlight that, in the absence of iterative refinement, improving diversity in few-step and one-step diffusion depends not on increasing perturbation strength, but on aligning perturbations with the model's internal representation structure.
Chinese Translation
蒸馏的一步(T=1)或少步(T≤4)扩散模型能够实现实时图像生成,但与多步模型相比,样本多样性往往降低。在多步扩散中,可以通过调度、轨迹或迭代优化引入多样性;然而,这些机制在少步或单步设置中不可用,限制了现有增强多样性方法的有效性。一个自然的替代方案是扰动中间特征,但简单的特征扰动往往效果不佳,要么带来有限的多样性提升,要么降低生成质量。我们认为,在少步模型中有效的多样性注入需要尊重模型学习到的特征几何。基于这一见解,我们提出了STRIDE,这是一种无训练和无优化的方法,能够在单次前向传播中操作。STRIDE将空间一致的(粉色)噪声注入到中间变换器特征中,并投影到模型自身激活的主成分上,确保扰动位于学习到的特征流形上。这一设计使得在表示空间中沿有意义的方向进行可控变化成为可能。在COCO、DrawBench、PartiPrompts和GenEval上的FLUX.1-schnell和SD3.5 Turbo的广泛实验表明,STRIDE在保持强文本对齐的同时,始终提高了多样性。特别是,STRIDE在对CLIP评分影响最小的情况下减少了批内相似性,并在多样性-保真度边界上优于现有的无训练基线。这些结果强调,在缺乏迭代优化的情况下,提高少步和一步扩散中的多样性并不依赖于增加扰动强度,而是依赖于将扰动与模型的内部表示结构对齐。
cs.CV / 37 / 2605.11497

PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition

PoseBridge:弥补零样本基于骨架的动作识别中的骨架化差距
Lee, Sanghyeon, Kim, Jinwoo, Lee, Jong Taek
Abstract
Zero-shot skeleton-based action recognition (ZSSAR) is typically treated as a skeleton-text alignment problem: encode joint-coordinate sequences, align them with language, and classify unseen actions. We argue that this alignment is often too late. Skeletons are not complete action observations, but compressed outputs of human pose estimation (HPE); by the time alignment begins, human-object interactions and pose-relative visual cues may no longer be explicit. We call this upstream semantic loss. To address it, we propose PoseBridge, an HPE-aware ZSSAR framework that bridges intermediate HPE representations to skeleton-text alignment. Rather than adding an RGB action branch or object detector, PoseBridge extracts pose-anchored semantic cues from the same HPE process that produces skeletons, then transfers them through skeleton-conditioned bridging and semantic prototype adaptation. Across NTU-RGB+D 60/120, PKU-MMD, and Kinetics-200/400, PoseBridge improves ZSSAR performance under the evaluated protocols. On the Kinetics-200/400 PURLS benchmark, which contains in-the-wild videos with diverse scenes and action contexts, PoseBridge shows the clearest separation, improving the strongest compared baseline by 13.3-17.4 points across all eight splits. Our code will be publicly released.
Chinese Translation
零样本基于骨架的动作识别(ZSSAR)通常被视为一个骨架与文本对齐的问题:编码关节坐标序列,将其与语言对齐,并对未见过的动作进行分类。我们认为这种对齐往往来得太晚。骨架并不是完整的动作观察,而是人类姿态估计(HPE)的压缩输出;在对齐开始时,人-物体交互和姿态相关的视觉线索可能不再明确。我们称之为上游语义损失。为了解决这个问题,我们提出了PoseBridge,一个HPE感知的ZSSAR框架,旨在将中间HPE表示与骨架-文本对齐连接起来。PoseBridge并没有增加RGB动作分支或物体检测器,而是从生成骨架的同一HPE过程中提取姿态锚定的语义线索,然后通过骨架条件的桥接和语义原型适应进行传递。在NTU-RGB+D 60/120、PKU-MMD和Kinetics-200/400数据集上,PoseBridge在评估协议下提高了ZSSAR的性能。在包含多样场景和动作上下文的野外视频的Kinetics-200/400 PURLS基准上,PoseBridge显示出最明显的分离,所有八个分割的性能相比最强基线提高了13.3-17.4分。我们的代码将公开发布。
cs.CV / 38 / 2605.11506

Principled Design of Diffusion-based Optimizers for Inverse Problems

基于扩散的优化器在逆问题中的原则性设计
Oscanoa, Julio, Sivgin, Irmak, Alkan, Cagan, Ennis, Daniel, Pauly, John, Pilanci, Mert, Vasanawala, Shreyas
Abstract
Score-based diffusion models achieve state-of-the-art performance for inverse problems, but their practical deployment is hindered by long inference times and cumbersome hyperparameter tuning. While pretrained diffusion models can be reused across tasks without retraining, inference-time hyperparameters such as the noise schedule and posterior sampling weights typically require ad-hoc adjustment for each problem setup. We propose principled reparameterizations that induce invariances, allowing the same hyperparameters to be reused across multiple problems without re-tuning. In addition, building on the RED-diff framework, which reformulates posterior sampling as an optimization problem, we further develop the OptDiff pipeline. OptDiff provides a simplified tuning framework that facilitates the integration of convex optimization tools to accelerate inference. Experiments on image reconstruction, deblurring, and super-resolution show substantial speedups and improved image quality.
Chinese Translation
基于评分的扩散模型在逆问题中实现了最先进的性能,但其实际应用受到推理时间长和超参数调整繁琐的限制。虽然预训练的扩散模型可以在不同任务中重复使用而无需重新训练,但推理时的超参数,如噪声调度和后验采样权重,通常需要针对每个问题设置进行临时调整。我们提出了原则性的重参数化方法,诱导不变性,使得相同的超参数可以在多个问题中重复使用而无需重新调整。此外,基于将后验采样重新表述为优化问题的RED-diff框架,我们进一步开发了OptDiff管道。OptDiff提供了一个简化的调优框架,便于整合凸优化工具以加速推理。在图像重建、去模糊和超分辨率的实验中,显示出显著的加速效果和改善的图像质量。
cs.CV / 39 / 2605.11508

LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing

LiBrA-Net:用于实时4K视频去雾的李代数双边仿射场
Wang, Yongcong, Shen, Chengchao, Gao, Guangwei, Wang, Wei, Dai, Pengwen, Lu, Dianjie, Zhang, Guijuan, Zheng, Zhuoran
Abstract
Currently, there is a gap in the field of ultra-high-definition (UHD) video dehazing due to the lack of a benchmark for evaluation. Furthermore, existing video dehazing methods cannot run on consumer-grade GPUs when processing continuous UHD sequences of 3--5 frames at a time. In this paper, we address both issues with a new benchmark and an efficient method. Our key observation is that atmospheric dehazing reduces to a per-pixel affine transform governed by the low-frequency depth field, which can be compactly encoded in bilateral grids whose prediction cost is decoupled from the output resolution. Building on this, we propose LiBrA-Net, which factorizes the spatiotemporal affine field into a spatial--color and a temporal bilateral sub-grid predicted at a fixed low resolution, fuses their coefficients in the $\mathfrak{gl}(3)$ Lie algebra under group-theoretic regularization, maps the result to invertible GL(3) transforms via a Cayley parameterization, and restores high-frequency detail through a lightweight input-guided branch. We further release UHV-4K, the first paired 4K video dehazing benchmark with depth, transmission, and optical-flow annotations on every frame. Across UHV-4K, REVIDE, and HazeWorld, LiBrA-Net sets a new state of the art among compared video dehazing methods while running native 4K at 25 FPS on a single GPU with only 6.12 M parameters. Code and data are available at https://anonymous.4open.science/r/LiBrA-Net-42B8.
Chinese Translation
目前,由于缺乏评估基准,超高清(UHD)视频去雾领域存在一定的空白。此外,现有的视频去雾方法在处理连续的UHD序列(每次3到5帧)时无法在消费级GPU上运行。本文通过提出一个新的基准和一种高效的方法来解决这两个问题。我们的关键观察是,大气去雾可以简化为由低频深度场控制的逐像素仿射变换,这可以紧凑地编码在双边网格中,其预测成本与输出分辨率解耦。在此基础上,我们提出了LiBrA-Net,该网络将时空仿射场分解为在固定低分辨率下预测的空间-颜色和时间双边子网格,利用群论正则化将其系数融合在$ ext{gl}(3)$李代数中,通过Cayley参数化将结果映射到可逆的GL(3)变换,并通过轻量级输入引导分支恢复高频细节。我们还发布了UHV-4K,这是第一个配对的4K视频去雾基准,包含每帧的深度、传输和光流注释。在UHV-4K、REVIDE和HazeWorld数据集上,LiBrA-Net在比较的视频去雾方法中设定了新的最先进水平,同时在单个GPU上以25 FPS的速度原生运行4K,参数仅为6.12 M。代码和数据可在https://anonymous.4open.science/r/LiBrA-Net-42B8获取。
cs.CV / 40 / 2605.11520

PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting

PointGS:基于3D高斯点云的语义一致性无监督3D点云分割
Song, Yixiao, Li, Qingyong, Wang, Wen, Yan, Zhicheng
Abstract
Unsupervised point cloud segmentation is critical for embodied artificial intelligence and autonomous driving, as it mitigates the prohibitive cost of dense point-level annotations required by fully supervised methods. While integrating 2D pre-trained models such as the Segment Anything Model (SAM) to supplement semantic information is a natural choice, this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer. To address these limitations, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds are first reconstructed into dense 3D Gaussian spaces via multi-view observations, filling spatial gaps and encoding occlusion relationships to eliminate projection-induced semantic conflation. Multi-view dense images are rendered from the Gaussian space, with 2D semantic masks extracted via SAM, and semantics are distilled to 3D Gaussian primitives through contrastive learning to ensure consistent semantic assignments across different views. The Gaussian space is aligned with the original point cloud via two-step registration, and point semantics are assigned through nearest-neighbor search on labeled Gaussians. Experiments demonstrate that PointGS outperforms state-of-the-art unsupervised methods, achieving +0.9% mIoU on ScanNet-V2 and +2.8% mIoU on S3DIS.
Chinese Translation
无监督点云分割对于具身人工智能和自动驾驶至关重要,因为它减少了完全监督方法所需的密集点级标注的高昂成本。尽管将如Segment Anything Model (SAM)等2D预训练模型整合以补充语义信息是一个自然选择,但这种方法面临着离散3D点与连续2D图像之间的根本不匹配。这种不匹配导致不可避免的投影重叠和复杂的模态对齐,从而在2D-3D转移中造成语义一致性的妥协。为了解决这些局限性,本文提出了PointGS,一个简单而有效的无监督3D点云分割管道。PointGS利用3D高斯点云作为统一的中间表示,以弥合离散与连续域之间的差距。输入的稀疏点云首先通过多视角观测重建为密集的3D高斯空间,填补空间间隙并编码遮挡关系,以消除投影引起的语义混淆。从高斯空间渲染多视角密集图像,通过SAM提取2D语义掩码,并通过对比学习将语义提炼到3D高斯原语,以确保不同视角之间的一致语义分配。通过两步配准将高斯空间与原始点云对齐,并通过对标记高斯的最近邻搜索分配点语义。实验表明,PointGS在无监督方法中表现优于现有最先进的技术,在ScanNet-V2上实现了+0.9%的mIoU,在S3DIS上实现了+2.8%的mIoU。
cs.CV / 41 / 2605.11521

XWOD: A Real-World Benchmark for Object Detection under Extreme Weather Conditions

XWOD:极端天气条件下物体检测的真实世界基准
Chen, Chih-Hsin, Liu, Yu-Tung, Fadillah, Amar, Lai, Kuan-Ting, Liu, Dong
Abstract
Autonomous driving and intelligent transportation systems remain vulnerable under extreme weather. The U.S. Federal Highway Administration reports that roughly 745,000 crashes and 3,800 fatalities per year are weather-related, and recent regulatory investigations have examined failures of Level-2/3 driving systems under reduced-visibility conditions. However, datasets commonly used to evaluate weather robustness remain limited in scale, diversity, and realism. In this paper, we introduce XWOD (Extreme Weather Object Detection), a large-scale real-world traffic-object detection benchmark containing 10,010 images and 42,924 bounding boxes across seven extreme weather conditions: rain, snow, fog, haze/sand/dust, flooding, tornado, and wildfire. The dataset covers six traffic-object categories, including car, person, truck, motorcycle, bicycle, and bus. XWOD extends the weather taxonomy from one to seven conditions, and is the first to cover the emerging class of climate-amplified hazards, such as flooding, tornado, and wildfire. To evaluate the quality of our data, we train standard YOLO-family detectors on XWOD and test them zero-shot on external weather benchmarks, achieving mAP$_{50}$ scores of 63.00% on RTTS, 59.94% on DAWN, and 61.12% on WEDGE, compared with the corresponding published YOLO-based baselines of 40.37%, 32.75%, and 45.41%, respectively, representing relative improvements of 56%, 83%, and 35%. These cross-dataset results show that XWOD provides a strong source domain for learning weather-robust traffic perception. We release the dataset, splits, baseline weights, and reproducible evaluation code under a research-use license.
Chinese Translation
自主驾驶和智能交通系统在极端天气条件下仍然脆弱。美国联邦公路管理局报告称,每年约有745,000起事故和3,800人死亡与天气相关,近期的监管调查也考察了在能见度降低的情况下,2级/3级驾驶系统的失效。然而,常用于评估天气鲁棒性的数据集在规模、多样性和真实性上仍然有限。本文介绍了XWOD(极端天气物体检测),这是一个大规模的真实世界交通物体检测基准,包含10,010张图像和42,924个边界框,涵盖七种极端天气条件:雨、雪、雾、霾/沙/尘、洪水、龙卷风和野火。该数据集涵盖六个交通物体类别,包括汽车、行人、卡车、摩托车、自行车和公交车。XWOD将天气分类从一种扩展到七种条件,并首次涵盖了气候加剧的危害新类别,如洪水、龙卷风和野火。为了评估我们数据的质量,我们在XWOD上训练标准的YOLO系列检测器,并在外部天气基准上进行零样本测试,分别在RTTS上获得63.00%的mAP$_{50}$分数,在DAWN上获得59.94%,在WEDGE上获得61.12%,而相应的已发布YOLO基线分别为40.37%、32.75%和45.41%,相对提升分别为56%、83%和35%。这些跨数据集的结果表明,XWOD为学习天气鲁棒的交通感知提供了强大的源领域。我们在研究使用许可下发布数据集、划分、基线权重和可重复的评估代码。
cs.CV / 42 / 2605.11541

GeoR-Bench: Evaluating Geoscience Visual Reasoning

GeoR-Bench:评估地球科学视觉推理
Zheng, Yushuo, Zhang, Zicheng, Duan, Huiyu, Li, Chunyi, Chen, Zijian, Jia, Ziheng, Shi, Yue, Gu, Ke, Min, Xiongkuo, Zhai, Guangtao
Abstract
Geoscience intelligence is expected to understand, reason about, and predict earth system changes to support human decision-making in critical domains such as disaster response, climate adaptation and environmental protection. Although current research has shown promising progress on specific geoscience tasks, such as remote sensing interpretation, geographic question-answering, existing benchmarks remain largely task-specific which failing to capture the open-ended real world geoscience problems. As a result, it remains unclear how far current AI systems are from achieving genuine geoscience intelligence. To address this gap, we present \textbf{GeoR-Bench}, a \underline{Bench}mark for evaluating \underline{Geo}science visual \underline{R}easoning through reasoning informed visual editing tasks. GeoR-Bench contains 440 curated samples spanning 6 geoscience categories and 24 task types, covering earth observation imagery and structured scientific representations such as maps and diagrams. We evaluate outputs along three dimensions, including reasoning, consistency, and quality. Benchmark results of 21 closed- and open-source multimodal models reveal that geoscience reasoning remains a critical bottleneck. The highest-performing model achieves 42.7\% overall strict accuracy, while the best open-source models only get 10.3\%. Notably, the visual consistency and image quality of the outputs frequently surpass their scientific accuracy. Ultimately, these findings indicate that current models generate superficially plausible results but fail to capture underlying earth science processes.
Chinese Translation
地球科学智能被期望能够理解、推理和预测地球系统变化,以支持人类在灾难响应、气候适应和环境保护等关键领域的决策。尽管当前研究在特定的地球科学任务上取得了可喜的进展,例如遥感解读和地理问答,但现有的基准测试仍然主要是任务特定的,未能捕捉开放式的现实世界地球科学问题。因此,目前的人工智能系统距离实现真正的地球科学智能还有多远仍然不清楚。为了解决这一差距,我们提出了 extbf{GeoR-Bench},一个用于通过推理驱动的视觉编辑任务评估地球科学视觉推理的 extunderline{基准}。GeoR-Bench 包含 440 个经过精心策划的样本,涵盖 6 个地球科学类别和 24 种任务类型,涉及地球观测图像和结构化科学表示(如地图和图表)。我们从推理、一致性和质量三个维度评估输出。21 个闭源和开源的多模态模型的基准结果表明,地球科学推理仍然是一个关键瓶颈。表现最好的模型的整体严格准确率为 42.7\%,而最佳开源模型仅为 10.3\%。值得注意的是,输出的视觉一致性和图像质量常常超过其科学准确性。最终,这些发现表明,当前模型生成的结果表面上似乎合理,但未能捕捉到潜在的地球科学过程。
cs.CV / 43 / 2605.11550

The DAWN of World-Action Interactive Models

世界行动互动模型的曙光
Lu, Hongbo, Yao, Liang, He, Chenghao, Wang, Haoyu, Gu, Xiang, Li, Xianfei, Liao, Wenlong, He, Tao, Peng, Pai
Abstract
A plausible scene evolution depends on the maneuver being considered, while a good maneuver depends on how the scene may evolve. Existing World Action Models (WAMs) largely miss this reciprocity, treating world prediction and action generation as either isolated parallel branches or rigid predict-then-plan pipelines. We formalize this perspective as World-Action Interactive Models (WAIMs), and instantiate it in autonomous driving with \textbf{DAWN} (\textbf{D}enoising \textbf{A}ctions and \textbf{W}orld i\textbf{N}teractive model), a simple yet strong latent generative baseline. DAWN operates in a compact semantic latent space and couples a \emph{World Predictor} with a \emph{World-Conditioned Action Denoiser}: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update the world prediction, so that both are recursively refined during inference. Rather than eliminating test-time world evolution altogether or rolling out the full future in pixel space, DAWN performs a short explicit latent rollout that is sufficient to support long-horizon trajectory generation in complex interactive scenes. Experiments show that DAWN achieves strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks. More broadly, our results suggest that interactive world-action generation is a principled path toward truly actionable world models.
Chinese Translation
合理的场景演变依赖于所考虑的操控,而良好的操控又依赖于场景可能的演变。现有的世界行动模型(World Action Models, WAMs)在很大程度上忽视了这种互惠关系,将世界预测和动作生成视为孤立的平行分支或僵化的预测-再规划管道。我们将这一视角形式化为世界行动互动模型(World-Action Interactive Models, WAIMs),并在自主驾驶中实现了 extbf{DAWN}( extbf{D}enoising extbf{A}ctions and extbf{W}orld i extbf{N}teractive model),这是一个简单而强大的潜在生成基线。DAWN在紧凑的语义潜在空间中运行,并将 extit{世界预测器}与 extit{世界条件下的动作去噪器}相结合:预测的世界假设为动作去噪提供条件,而去噪后的动作假设则反馈用于更新世界预测,从而在推理过程中递归地细化两者。DAWN并没有完全消除测试时的世界演变或在像素空间中展开完整的未来,而是执行了一个短的显式潜在展开,这足以支持在复杂互动场景中的长时间轨迹生成。实验表明,DAWN在多个自主驾驶基准测试中实现了强大的规划性能和良好的安全相关结果。更广泛地说,我们的结果表明,互动世界-行动生成是朝着真正可操作的世界模型迈进的一个原则性路径。
cs.CV / 44 / 2605.11555

ScribbleDose: Scribble-Guided Dose Prediction in Radiotherapy

ScribbleDose:基于涂鸦引导的放射治疗剂量预测
Zhang, Zhenxi, Zhuang, Yitao, Pu, Yao, Yu, Peixin, Li, Zirong, Xia, Yan, Li, Hui, Li, Bin, Zheng, Fuchen, Ren, Ge
Abstract
Anatomical structure masks are widely adopted in radiotherapy dose prediction, as they provide explicit geometric constraints that facilitate structure-dose coupling. However, conventional manual delineation of these masks requires precise annotation of structure boundaries relevant to radiotherapy, which is time-consuming and labor-intensive. To address these limitations, we propose a scribble-guided dose prediction framework that relies solely on anatomical structures annotated with sparse scribbles. Specifically, we design a Scribble Completion Module (SCM) to generate dense anatomical masks by propagating sparse scribble labels to semantically similar voxels. During the propagation process, a supervoxel-based regularization is introduced to preserve geometric boundary consistency to ensure anatomical plausibility. Furthermore, we propose a Structure-Guided Dose Generation Module (SGDGM) to strengthen the correspondence between sparse structural cues and dose distribution. The completed dense masks derived from scribbles serve as structural guidance to condition dose prediction, forming a scribble-mask-dose learning pipeline under sparse annotation. Experiments on the GDP-HMM dataset demonstrate that ScribbleDose achieves competitive dose prediction performance using only sparse structural annotations. The source code and reannotated scribble annotations are publicly available at https://github.com/iCherishxixixi/ScribbleDose.
Chinese Translation
解剖结构掩模在放射治疗剂量预测中被广泛采用,因为它们提供了明确的几何约束,促进了结构与剂量的耦合。然而,传统的手动勾画这些掩模需要对与放射治疗相关的结构边界进行精确标注,这既耗时又费力。为了解决这些限制,我们提出了一种仅依赖于稀疏涂鸦标注的解剖结构的涂鸦引导剂量预测框架。具体而言,我们设计了一个涂鸦补全模块(Scribble Completion Module, SCM),通过将稀疏涂鸦标签传播到语义相似的体素来生成密集的解剖掩模。在传播过程中,引入了基于超体素的正则化,以保持几何边界的一致性,确保解剖的合理性。此外,我们提出了一个结构引导剂量生成模块(Structure-Guided Dose Generation Module, SGDGM),以增强稀疏结构线索与剂量分布之间的对应关系。由涂鸦生成的完成密集掩模作为结构指导,条件化剂量预测,形成了一个在稀疏标注下的涂鸦-掩模-剂量学习管道。在GDP-HMM数据集上的实验表明,ScribbleDose在仅使用稀疏结构标注的情况下实现了具有竞争力的剂量预测性能。源代码和重新标注的涂鸦注释可在 https://github.com/iCherishxixixi/ScribbleDose 上公开获取。
cs.CV / 45 / 2605.11559

When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs

当视觉观察不足时:视觉注意结构揭示多模态大语言模型中的幻觉
Cao, Fanpu, Zou, Xin, Hu, Xuming, Xiong, Hui
Abstract
Multimodal large language models (MLLMs) have become a key interface for visual reasoning and grounded question answering, yet they remain vulnerable to visual hallucinations, where generated responses contradict image content or mention nonexistent objects. A central challenge is that hallucination is not always caused by a simple lack of visual attention: the model may still assign substantial attention mass to image tokens while internally drifting toward an incorrect answer. In this paper, we show that the high-frequency structure of visual attention, measured by layer-wise Laplacian energy, reveals both the layer where hallucinated preferences emerge and the layer where the ground-truth answer transiently recovers. Building on this finding, we propose LaSCD (Laplacian-Spectral Contrastive Decoding), a training-free decoding strategy that selects informative layers via Laplacian energy and remaps next-token logits in closed form. Experiments on hallucination and general multimodal benchmarks show that LaSCD consistently reduces hallucination while preserving general capabilities, highlighting its potential as a faithful decoding paradigm. The code is available at https://github.com/macovaseas/LaSCD.
Chinese Translation
多模态大语言模型(MLLMs)已成为视觉推理和基于图像的问题回答的关键接口,但它们仍然容易受到视觉幻觉的影响,即生成的响应与图像内容相矛盾或提及不存在的物体。一个主要挑战是,幻觉并不总是由于简单的视觉注意力缺失造成的:模型可能仍然将大量注意力分配给图像标记,同时在内部偏向错误的答案。在本文中,我们展示了通过逐层拉普拉斯能量测量的视觉注意的高频结构,揭示了幻觉偏好出现的层次以及真实答案暂时恢复的层次。基于这一发现,我们提出了LaSCD(拉普拉斯-谱对比解码),这是一种无训练的解码策略,通过拉普拉斯能量选择信息层,并以封闭形式重新映射下一个标记的logits。在幻觉和一般多模态基准测试中的实验表明,LaSCD在保持一般能力的同时,始终减少了幻觉,突显了其作为一种可靠解码范式的潜力。代码可在https://github.com/macovaseas/LaSCD获取。
cs.CV / 46 / 2605.11563

TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

TCP-SSM:具有令牌条件极点的高效视觉状态空间模型
Shoouri, Sara, Taba, Morteza Tavakoli, Kim, Hun-Seok
Abstract
State Space Models (SSMs) have emerged as a compelling alternative to attention models for long-range vision tasks, offering input-dependent recurrence with linear complexity. However, most efficient SSM variants reduce computation cost by modifying scan routes, resolutions, or traversal patterns, while largely leaving the recurrent dynamics implicit. Consequently, the model's state-dependent memory behavior is difficult to control, particularly in compact backbones where long scan paths can exceed the effective memory horizon. We propose Token-Conditioned Poles SSM (TCP-SSM), a structured selective SSM framework that improves efficiency while making recurrence dynamics explicit and interpretable through stable poles. TCP-SSM builds each scan operator with 1) real poles that model monotone or sign-alternating decay, and 2) complex-conjugate poles that capture damped oscillatory responses. Using bounded radius and angle modulation, TCP-SSM converts shared base poles into token-dependent poles, allowing each scan step to adapt its memory behavior to the current visual token while preserving pole stability. For practical scalability, we integrate grouped pole sharing with a lightweight low-rank input pathway, yielding an efficient scan operator that preserves linear-time scan complexity. Across image classification, semantic segmentation, and object detection, TCP-SSM reduces SSM computation complexity up to 44% in Vision Mamba-style models while maintaining or surpassing baseline accuracy.
Chinese Translation
状态空间模型(SSMs)作为长程视觉任务中对注意力模型的有力替代方案,提供了依赖输入的递归机制,且具有线性复杂度。然而,大多数高效的SSM变体通过修改扫描路径、分辨率或遍历模式来降低计算成本,而在很大程度上将递归动态保持为隐式。因此,模型的状态依赖记忆行为难以控制,特别是在紧凑的骨干网络中,长扫描路径可能超出有效记忆范围。我们提出了令牌条件极点SSM(TCP-SSM),这是一种结构化的选择性SSM框架,通过稳定极点使递归动态变得明确且可解释,从而提高效率。TCP-SSM通过1)建模单调或符号交替衰减的实极点,和2)捕捉阻尼振荡响应的复共轭极点,构建每个扫描算子。通过有界半径和角度调制,TCP-SSM将共享基极点转换为依赖于令牌的极点,使每个扫描步骤能够根据当前视觉令牌调整其记忆行为,同时保持极点的稳定性。为了实现实际的可扩展性,我们将分组极点共享与轻量级低秩输入路径相结合,产生一种高效的扫描算子,保持线性时间的扫描复杂度。在图像分类、语义分割和目标检测任务中,TCP-SSM在Vision Mamba风格模型中将SSM计算复杂度降低了多达44%,同时保持或超过基线准确率。
cs.CV / 47 / 2605.11567

Dynamic Execution Commitment of Vision-Language-Action Models

视觉-语言-动作模型的动态执行承诺
Chen, Feng, Wang, Xianghui, Chen, Yuxuan, Li, Boying, He, Yefei, Zhang, Zeyu, Wu, Yicheng
Abstract
Vision-Language-Action (VLA) models predominantly adopt action chunking, i.e., predicting and committing to a short horizon of consecutive low-level actions in a single forward pass, to amortize the inference cost of large-scale backbones and reduce per-step latency. However, committing these multi-step predictions to real-world execution requires balancing success rate against inference efficiency, a decision typically governed by fixed execution horizons tuned per task. Such heuristics ignore the state-dependent nature of predictive reliability, leading to brittle performance in dynamic or out-of-distribution settings. In this paper, we introduce A3, an Adaptive Action Acceptance mechanism that reframes dynamic execution commitment as a self-speculative prefix verification problem. A3 first computes a trajectory-wise consensus score of actions via group sampling, then selects a representative draft and prioritizes downstream verification. Specifically, it enforces: (1) consensus-ordered conditional invariance, which validates low-consensus actions by judging whether they remain consistent when re-decoded conditioned on high-consensus actions; and (2) prefix-closed sequential consistency, which guarantees physical rollout integrity by accepting only the longest continuous sequence of verified actions starting from the beginning. Consequently, the execution horizon emerges as the longest verifiable prefix satisfying both internal model logic and sequential execution constraints. Experiments across diverse VLA models and benchmarks demonstrate that A3 eliminates the need for manual horizon tuning while achieving a superior trade-off between execution robustness and inference throughput.
Chinese Translation
视觉-语言-动作(VLA)模型主要采用动作分块,即在单次前向传播中预测并承诺一系列连续的低级动作,以摊销大规模骨干网络的推理成本并减少每步延迟。然而,将这些多步预测承诺于现实世界的执行需要在成功率与推理效率之间进行平衡,这一决策通常由针对每个任务调优的固定执行范围所主导。这种启发式方法忽视了预测可靠性的状态依赖特性,导致在动态或分布外环境中的脆弱表现。在本文中,我们提出了A3,一种自适应动作接受机制,将动态执行承诺重新框定为自我推测前缀验证问题。A3首先通过组采样计算动作的轨迹一致性得分,然后选择一个代表性草案并优先进行下游验证。具体而言,它强制执行:(1)一致性排序条件不变性,通过判断低一致性动作在基于高一致性动作重新解码时是否保持一致来验证这些动作;(2)前缀闭合的序列一致性,通过仅接受从开始起的最长连续验证动作序列来保证物理展开的完整性。因此,执行范围成为满足内部模型逻辑和序列执行约束的最长可验证前缀。针对不同VLA模型和基准的实验表明,A3消除了手动调优执行范围的需求,同时在执行鲁棒性和推理吞吐量之间实现了更优的权衡。
cs.CV / 48 / 2605.11572

TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning

TB-AVA:文本作为音视频参数高效微调的语义桥梁
Kim, Seongah, Tran, Dinh Phu, Hwang, Hyeontaek, Wazir, Saad, Minh, Duc Do, Kim, Daeyoung
Abstract
Audio-visual understanding requires effective alignment between heterogeneous modalities, yet cross-modal correspondence remains challenging when temporally aligned audio and visual signals lack clear semantic correspondence.We propose to use text as a semantic anchor for audio-visual representation learning.To this end, we introduce a parameter-efficient adaptation frameworkbuilt on frozen audio and visual encoders, centered on Text-Bridged Audio-Visual Adapter (TB-AVA), which enables text-mediated interaction between audio and visual streams. At the core of TB-AVA, Gated Semantic Modulation (GSM) selectively modulates feature channels based on text-inferred semantic relevance. We evaluate the proposed approach on multiple benchmarks, including AVE, AVS, and AVVP, where the proposed framework achieves state-of-the-art performance, demonstrating text as an effective semantic anchor for parameter-efficient fine-tuning (PEFT) in audio-visual learning.
Chinese Translation
音视频理解需要在异构模态之间有效对齐,然而,当时间对齐的音频和视觉信号缺乏明确的语义对应时,跨模态对应仍然具有挑战性。我们提出使用文本作为音视频表示学习的语义锚点。为此,我们引入了一种基于冻结音频和视觉编码器的参数高效适应框架,中心是文本桥接音视频适配器(Text-Bridged Audio-Visual Adapter,TB-AVA),该框架能够实现音频和视觉流之间的文本介导交互。在TB-AVA的核心,门控语义调制(Gated Semantic Modulation,GSM)根据文本推断的语义相关性选择性地调制特征通道。我们在多个基准测试上评估了所提出的方法,包括AVE、AVS和AVVP,结果表明该框架实现了最先进的性能,证明文本作为音视频学习中参数高效微调(PEFT)的有效语义锚点。
cs.CV / 49 / 2605.11578

The Midas Touch for Metric Depth

度量深度的米达斯之触
Ma, Yu, Guo, Zizhan, Xiong, Zuyi, Zhang, Haoran, Feng, Yi, Zhao, Hongbo, Wang, Hanli, Fan, Rui
Abstract
Recent advances have markedly improved the cross-scene generalization of relative depth estimation, yet its practical applicability remains limited by the absence of metric scale, local inconsistencies, and low computational efficiency. To address these issues, we present \emph{\textbf{M}idas \textbf{T}ouch for \textbf{D}epth} (MTD), a mathematically interpretable approach that converts relative depth into metric depth using only extremely sparse 3D data. To eliminate local scale inconsistencies, it applies a segment-wise recovery strategy via sparse graph optimization, followed by a pixel-wise refinement strategy using a discontinuity-aware geodesic cost. MTD exhibits strong generalization and achieves substantial accuracy improvements over previous depth completion and depth estimation methods. Moreover, its lightweight, plug-and-play design facilitates deployment and integration on diverse downstream 3D tasks. Project page is available at https://mias.group/MTD.
Chinese Translation
最近的进展显著提高了相对深度估计的跨场景泛化能力,但其实际应用仍受到缺乏度量尺度、局部不一致性和低计算效率的限制。为了解决这些问题,我们提出了 extbf{M}idas extbf{T}ouch for extbf{D}epth (MTD),这是一种数学上可解释的方法,仅使用极其稀疏的3D数据将相对深度转换为度量深度。为了消除局部尺度不一致性,它通过稀疏图优化应用了一种分段恢复策略,随后使用考虑不连续性的测地成本进行逐像素的精细化策略。MTD展现出强大的泛化能力,并在深度补全和深度估计方法上实现了显著的准确性提升。此外,其轻量级的即插即用设计便于在多种下游3D任务中进行部署和集成。项目页面可访问 https://mias.group/MTD。
cs.CV / 50 / 2605.11585

A Mixture Autoregressive Image Generative Model on Quadtree Regions for Gaussian Noise Removal via Variational Bayes and Gradient Methods

基于四叉树区域的混合自回归图像生成模型用于高斯噪声去除的变分贝叶斯与梯度方法
Saito, Shota, Nakahara, Yuta, Horinouchi, Kohei, Ichijo, Naoki, Kobayashi, Manabu, Matsushima, Toshiyasu
Abstract
This paper addresses the problem of image denoising for grayscale images. We propose a probabilistic image generative model that combines a quadtree region-partitioning model with a mixture autoregressive model, and propose a framework that reduces MAP (maximum a posteriori)-estimation-based denoising to the maximization of a variational lower bound. To maximize this lower bound, we develop an algorithm that alternately applies variational Bayes and gradient methods. We particularly demonstrate that the gradient-based update rule can be computed analytically without numerical computation or approximation. We carried out some experiments to verify that the proposed algorithm actually removes image noise and to identify directions for future improvement.
Chinese Translation
本文针对灰度图像的去噪问题提出了一种概率图像生成模型,该模型结合了四叉树区域划分模型和混合自回归模型,并提出了一个将基于最大后验估计(MAP)的去噪问题简化为变分下界最大化的框架。为了最大化该下界,我们开发了一种交替应用变分贝叶斯和梯度方法的算法。我们特别证明了基于梯度的更新规则可以在不进行数值计算或近似的情况下进行解析计算。我们进行了若干实验,以验证所提出的算法确实能够去除图像噪声,并确定未来改进的方向。
cs.CV / 51 / 2605.11591

Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration

Logit-注意力偏差:通过注意力引导校准减轻多图像检索中的位置偏差
Xian, Mingtao, Yang, Yifeng, Gu, Qinying, Wang, Xinbing, Ye, Nanyang
Abstract
Multimodal Large Language Models (MLLMs) have shown strong performance in multi-image cross-modal retrieval, yet suffer from severe position bias, where predictions are dominated by input order rather than semantic relevance. Through empirical analysis, we identify a phenomenon termed Logit-Attention Divergence, in which output logits are heavily biased while internal attention maps remain well-aligned with relevant visual evidence. This observation reveals a fundamental limitation of existing logit-level calibration methods such as PriDe. Based on this insight, we propose a training-free, attention-guided debiasing framework that leverages intrinsic attention signals for instance-level correction at inference time, requiring only a minimal calibration set with negligible computational overhead. Experiments on MS-COCO-based benchmarks show that our method substantially improves permutation invariance and achieves state-of-the-art performance, enhancing accuracy by over 40\% compared to baselines. Code is available at https://github.com/brightXian/LAD.
Chinese Translation
多模态大型语言模型(MLLMs)在多图像跨模态检索中表现出色,但却遭受严重的位置偏差,即预测结果受到输入顺序的主导,而非语义相关性。通过实证分析,我们识别出一种现象,称为Logit-注意力偏差(Logit-Attention Divergence),在这种现象中,输出logits受到严重偏见,而内部注意力图与相关视觉证据保持良好对齐。这一观察揭示了现有logit级别校准方法(如PriDe)的基本局限性。基于这一洞察,我们提出了一种无训练、注意力引导的去偏框架,该框架利用内在的注意力信号在推理时进行实例级校正,仅需一个最小的校准集,且计算开销微乎其微。在基于MS-COCO的基准测试中,实验表明我们的方法显著提高了排列不变性,并实现了最先进的性能,与基线相比,准确率提高了超过40%。代码可在https://github.com/brightXian/LAD获取。
cs.CV / 52 / 2605.11594

PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations

PointForward:通过点对齐表示进行前馈驾驶重建
Chi, Cheng, Wang, Xianqi, Luo, Hongcheng, Tu, Mingfei, Xu, Gangwei, Zhang, Zehan, Wang, Bing, Chen, Guang, Ye, Hangjun, Peng, Sida, Yang, Xin, Sun, Haiyang
Abstract
High-fidelity reconstruction of driving scenes is crucial for autonomous driving. While recent feedforward 3D Gaussian Splatting (3DGS) methods enable fast reconstruction, their per-pixel Gaussian prediction paradigm often suffers from multi-view inconsistency and layering artifacts. Moreover, existing methods often model dynamic instances via dense flow prediction, which lacks explicit cross-view correspondence and instance-level consistency. In this paper, we propose PointForward, a feedforward driving reconstruction framework through point-aligned representations. Unlike pixel-aligned methods, we initialize sparse 3D queries in world space and aggregate multi-view image information via spatial-temporal fusion onto these queries, enforcing explicit cross-view consistency in a single feedforward pass. To handle scene dynamics, we introduce scene graphs that explicitly organize moving instances during reconstruction. By leveraging 3D bounding boxes, our method enables instance-level motion propagation and temporally consistent dynamic representations. Extensive experiments demonstrate that PointForward achieves state-of-the-art performance on large-scale driving benchmarks. The code will be available upon the publication of the paper.
Chinese Translation
高保真度的驾驶场景重建对于自动驾驶至关重要。尽管最近的前馈3D高斯点云(3DGS)方法实现了快速重建,但其逐像素高斯预测范式常常遭受多视图不一致性和分层伪影的问题。此外,现有方法通常通过密集流预测来建模动态实例,这缺乏明确的跨视图对应关系和实例级一致性。本文提出了PointForward,一个通过点对齐表示进行前馈驾驶重建的框架。与像素对齐方法不同,我们在世界空间中初始化稀疏3D查询,并通过时空融合将多视图图像信息聚合到这些查询上,从而在单次前馈传递中强制执行明确的跨视图一致性。为了处理场景动态,我们引入了场景图,明确组织重建过程中的移动实例。通过利用3D边界框,我们的方法实现了实例级运动传播和时间一致的动态表示。大量实验表明,PointForward在大规模驾驶基准测试中达到了最先进的性能。代码将在论文发表后提供。
cs.CV / 53 / 2605.11596

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

HorizonDrive:用于长时间驾驶仿真的自我校正自回归世界模型
Zhang, Conglang, Zhan, Yifan, Wang, Qingjie, Ouyang, Zhanpeng, Li, Yu, Yang, Zihao, Guo, Xiaoyang, Ren, Weiqiang, Zhang, Qian, Dong, Zhen, Zheng, Yinqiang, Yin, Wei, Chen, Zhengqing
Abstract
Closed-loop driving simulation requires real-time interaction beyond short offline clips, pushing current driving world models toward autoregressive (AR) rollout. Existing AR distillation approaches typically rely on frame sinks or student-side degradation training. The former transfers poorly to driving due to fast ego-motion and rapid scene changes, while the latter remains bounded by the teacher's single-pass output length and thus provides only a limited supervision horizon. A natural question is: can the teacher itself be extended via AR rollout to provide unbounded-horizon supervision at bounded memory cost? The key difficulty is that a standard teacher drifts under its own predictions, contaminating the supervision it provides. Our key insight is to make the teacher rollout-capable, ensuring reliable supervision from its own AR rollouts. This is instantiated as HorizonDrive, an anti-drifting training-and-distillation framework for AR driving simulation. First, scheduled rollout recovery (SRR) trains the base model to reconstruct ground-truth future clips from prediction-corrupted histories, yielding a teacher that remains stable across long AR rollouts. Second, the rollout-capable teacher is extended via AR rollout, providing long-horizon distribution-matching supervision under bounded memory, while a short-window student aligns to it with teacher rollout DMD (TRD) for efficient real-time deployment. HorizonDrive natively supports minute-scale AR rollout under bounded memory; on nuScenes, HorizonDrive reduces FID by 52% and FVD by 37%, and lowers ARE and DTW by 21% and 9% relative to the strongest long-horizon streaming baselines, while remaining competitive with single-pass driving video generators.
Chinese Translation
闭环驾驶仿真需要实时交互,超越短暂的离线片段,这推动了当前驾驶世界模型向自回归(AR)展开。现有的AR蒸馏方法通常依赖于帧汇聚或学生端降级训练。前者由于快速的自我运动和快速的场景变化而在驾驶中表现不佳,而后者则受到教师单次输出长度的限制,因此仅提供有限的监督视野。一个自然的问题是:教师本身是否可以通过AR展开来扩展,以在有限的内存成本下提供无界视野的监督?关键的困难在于,标准教师在自身预测下会漂移,从而污染其提供的监督。我们的关键见解是使教师具备展开能力,确保其自身AR展开提供可靠的监督。这被实例化为HorizonDrive,一种用于AR驾驶仿真的抗漂移训练与蒸馏框架。首先,调度展开恢复(SRR)训练基础模型从预测损坏的历史中重建真实未来片段,从而产生在长AR展开中保持稳定的教师。其次,具备展开能力的教师通过AR展开进行扩展,在有限内存下提供长视野的分布匹配监督,同时短时间窗口的学生通过教师展开DMD(TRD)与之对齐,以实现高效的实时部署。HorizonDrive在有限内存下原生支持分钟级AR展开;在nuScenes数据集上,HorizonDrive将FID降低了52%,FVD降低了37%,并相对于最强的长时间流媒体基线降低了ARE和DTW分别为21%和9%,同时在单次通过的驾驶视频生成器中保持竞争力。
cs.CV / 54 / 2605.11605

Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

保留音频无法表达的内容:面向全模态大语言模型的上下文保留令牌剪枝
Jung, Chaeyoung, Rho, Kyeongha, Chung, Joon Son
Abstract
Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify. For further compression, our method merges temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor. On Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six audio-visual benchmarks, ContextGuard outperforms prior inference-time pruning methods while pruning more tokens. Notably, on Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens.
Chinese Translation
全模态大语言模型(Omni-LLMs)由于处理大量多模态输入令牌而产生了显著的计算开销,因此令牌减少对于实际应用至关重要。现有的Omni-LLM剪枝方法通常通过选择对当前查询重要或与跨模态线索强相关的令牌来降低这一成本。然而,这种策略可能会丢弃超出这些标准的证据,即使在回答不同问题或理解超出音频-视觉线索的上下文时也是如此。为了解决这一局限性,我们将Omni-LLM的令牌减少重新框定为在去除跨模态冗余的同时保留广泛的音频-视觉上下文。我们提出了ContextGuard,一个基于这一原则的推理时令牌剪枝框架。ContextGuard从音频中预测粗略的视觉语义,并剪除那些其粗略语义可能从音频中恢复的视频令牌,同时保留额外的视频令牌以保留音频无法单独指定的局部视觉细节。为了进一步压缩,我们的方法合并时间上相似的视频令牌。该框架不需要下游LLM的微调,仅使用一个独立训练的轻量级预测器。在Qwen2.5-Omni和Video-SALMONN2+的3B和7B规模下,针对六个音频-视觉基准,ContextGuard在剪除更多令牌的同时超越了之前的推理时剪枝方法。值得注意的是,在Qwen2.5-Omni 7B上,ContextGuard在六个基准中的五个上实现了全令牌级别的性能,同时剪除了55%的输入令牌。
cs.CV / 55 / 2605.11616

Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

通过记忆实现基础:跨场景与场内记忆用于三维功能性赋能
Wang, Qirui, He, Jingyi, Pan, Yining, Yang, Xulei, Li, Shijie
Abstract
Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as "the second handle from the top." AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.
Chinese Translation
功能性赋能的基础不仅仅需要识别物体:代理必须定位支持交互的特定区域,例如拉手或按钮。这对于无训练的视觉-语言管道来说是困难的,因为可操作区域通常较小、视觉上模糊,并且在场景中多个同类实例中重复出现。我们提出了AFFORDMEM,一个通过在两个层次上记忆几何体来实现三维功能性赋能的框架。第一个是跨场景赋能记忆:代理维护一个类别级别的记忆库,其中包含RGB图像,赋能区域以叠加的形式呈现,并在查询时回忆出最具信息性的示例,以引导冻结的视觉语言模型(VLM)关注文本提示常常遗漏的小可操作子区域。第二个是场内空间记忆:当代理处理场景时,它将候选实例及其三维空间关系组织成一个结构化的场景图,从而使语言模型能够解析对远处或当前未观察到的候选项的引用,例如“从顶部数第二个把手”。AFFORDMEM不需要模型微调和目标场景注释,使用从源场景构建的可重用记忆库。在SceneFun3D上,我们的方法在Split 0上提高了3.23的AP50,相较于之前的无训练最先进方法,在Split 1上提高了3.7。消融研究支持互补效益:跨场景赋能记忆改善了细粒度定位,而场内空间记忆则在空间限定查询上提供了更大的增益。项目主页可在项目页面上访问。
cs.CV / 56 / 2605.11622

RNA-FM: Flow-Matching Generative Model for Genome-wide RNA-Seq Prediction

RNA-FM:用于全基因组RNA测序预测的流匹配生成模型
Song, Yaxuan, Fan, Jianan, Wang, Tianyi, Hu, Qiuyue, Chang, Hang, Huang, Heng, Cai, Weidong
Abstract
Histopathology whole-slide images (WSIs) are routinely acquired in clinical practice and contain rich tissue morphology but lack direct molecular architecture and functional programs defining pathological states, whereas RNA sequencing (RNA-seq) provides genome-wide transcriptional profiles at substantial cost, thereby motivating WSI-based genome-wide transcriptomic prediction. Existing approaches for predicting gene expression from WSIs predominantly rely on deterministic regression with one-to-one mapping, limiting their ability to capture biological heterogeneity and predictive uncertainty. We propose RNA-FM, a flow-matching generative framework for genome-wide bulk RNA-seq prediction from WSIs. RNA-FM formulates transcriptomic prediction as a continuous-time conditional transport problem, learning a velocity field that maps a simple prior to the target gene expression distribution conditioned on morphologies. By integrating pathway-level structure, RNA-FM enables scalable and biologically interpretable genome-wide gene expression imputation. Extensive experiments demonstrate that RNA-FM consistently outperforms state-of-the-art approaches while maintaining biological meaningfulness. Code is available at https://github.com/YXSong000/RNA-FM.
Chinese Translation
组织病理学全切片图像(WSIs)在临床实践中被常规获取,包含丰富的组织形态信息,但缺乏定义病理状态的直接分子结构和功能程序,而RNA测序(RNA-seq)则提供了全基因组的转录组特征,成本相对较高,因此推动了基于WSI的全基因组转录组预测。现有的从WSI预测基因表达的方法主要依赖于确定性回归和一对一映射,限制了其捕捉生物异质性和预测不确定性的能力。我们提出了RNA-FM,一种用于从WSI进行全基因组大规模RNA-seq预测的流匹配生成框架。RNA-FM将转录组预测形式化为一个连续时间条件传输问题,学习一个速度场,将简单的先验映射到基于形态条件的目标基因表达分布。通过整合通路级结构,RNA-FM实现了可扩展且具有生物学解释性的全基因组基因表达插补。大量实验表明,RNA-FM在保持生物学意义的同时,始终优于最先进的方法。代码可在 https://github.com/YXSong000/RNA-FM 获取。
cs.CV / 57 / 2605.11628

Single-Shot HDR Recovery via a Video Diffusion Prior

通过视频扩散先验进行单次高动态范围(HDR)恢复
Talegaonkar, Chinmay, He, Jinshi, McKenna, Christopher, Antipa, Nicholas
Abstract
Recent generative methods for single-shot high dynamic range (HDR) image reconstruction show promising results, but often struggle with preserving fidelity to the input image. They require separate models to handle highlights and shadows, or sacrifice interpretability by directly predicting the final HDR image. We address these limitations by re-casting single-shot HDR reconstruction as conditional video generation and fusing the generated frames into an HDR image. We finetune a video diffusion model to generate an exposure bracket, conditioned on a low dynamic range (LDR) input. We fuse this image bracket using per-pixel weights predicted by a light-weight UNet. This formulation is simple, interpretable, and effective. Rather than directly hallucinating an HDR image, it explicitly reconstructs the intermediate exposure stack and fuses it into the final output. Our method eliminates the need for separate models across exposure regimes and produces HDR reconstructions with high input fidelity. On quantitative benchmarks, we outperform state-of-the-art generative baselines with comparable model capacity on several reconstruction metrics. Human evaluators further prefer our results in 72% of pairwise comparisons against existing methods. Finally, we show that this input-conditioned sequence generation and fusion framework extends beyond HDR to other image reconstruction tasks, such as all-in-focus image recovery from a single defocus-blurred input.
Chinese Translation
最近的单次高动态范围(HDR)图像重建生成方法显示出良好的效果,但在保持输入图像的保真度方面常常面临挑战。它们需要单独的模型来处理高光和阴影,或者通过直接预测最终的HDR图像而牺牲可解释性。我们通过将单次HDR重建重新表述为条件视频生成,并将生成的帧融合成HDR图像,来解决这些局限性。我们对视频扩散模型进行微调,以生成一个曝光包,条件是低动态范围(LDR)输入。我们使用轻量级的UNet预测的每像素权重来融合这个图像包。这个公式简单、可解释且有效。我们的方案不是直接幻觉出HDR图像,而是明确地重建中间曝光堆栈并将其融合到最终输出中。我们的方法消除了在不同曝光条件下需要单独模型的需求,并生成具有高输入保真度的HDR重建。在定量基准测试中,我们在多个重建指标上超越了具有相当模型容量的最先进生成基线。人类评估者在72%的成对比较中更倾向于我们的结果,相较于现有方法。最后,我们展示了这种输入条件的序列生成和融合框架不仅限于HDR,还可以扩展到其他图像重建任务,例如从单个失焦模糊输入中恢复全聚焦图像。
cs.CV / 58 / 2605.11634

Unlocking UML Class Diagram Understanding in Vision Language Models

解锁视觉语言模型中的 UML 类图理解
Naboichenko, Artem, Peinl, René
Abstract
Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.
Chinese Translation
尽管视觉语言模型(VLMs)在各种应用场景中取得了巨大的进展,但在回答与图表相关的问题时,仍然落后于对照片的处理。尽管在条形图、折线图等图表领域取得了一定的进展,但针对其他类型图表的研究仍然较少,例如计算机科学领域。我们的工作提出了一个基于 UML 类图的视觉问答基准,该基准既具有挑战性又可管理。我们进一步构建了一个大规模的训练数据集,包含 16,000 个图像-问题-答案三元组,并展示了基于 LoRA 的微调方法轻松超越了 Qwen 3.5 27B,这是一种在许多其他基准中表现良好的最新 VLM。
cs.CV / 59 / 2605.11651

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

隐藏以观察:视觉锚定思维中的推理前缀掩蔽用于VLM蒸馏
Yu, Seonghoon, Nam, Dongjun, Lee, Byung-Kwan, Son, Jeany
Abstract
Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their high computational cost limits real-world deployment. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, {measured by discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyses confirm enhanced visual utilization along the student thinking process.
Chinese Translation
最近在视觉语言模型(VLM)中的思考-回答方法,如Qwen3-VL-Thinking,通过利用最终答案之前的中间思考步骤提升了推理性能,但其高计算成本限制了实际应用。为了将这种能力蒸馏到紧凑的思考-回答VLM中,主要目标是提高学生在推理过程中利用视觉证据的能力。为此,我们提出了一种新颖的思考-回答蒸馏框架,鼓励学生通过掩蔽其显著的推理前缀来将思维锚定在视觉信息上。为了弥补这些被掩蔽的文本线索,学生被鼓励在蒸馏过程中更多依赖视觉证据作为替代信息源。我们的掩蔽策略包括:1)逐词显著推理前缀掩蔽,针对每个下一个标记预测选择性地掩蔽高影响力的推理前缀;2)自适应掩蔽预算调度,根据蒸馏难度逐渐增加掩蔽规模,蒸馏难度通过教师-学生分布之间的差异来衡量。在蒸馏阶段,学生受到我们的显著推理前缀掩蔽的指导,该掩蔽同时阻止未来标记和显著推理线索,而不是用于自回归语言建模的标准因果掩蔽。实验结果表明,我们的方法在多模态推理基准测试中优于最近的开源VLM、VLM蒸馏和自蒸馏方法,同时进一步分析确认了学生思维过程中视觉利用的增强。
cs.CV / 60 / 2605.11654

Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery

基于原型的语义部分发现的抗天气跨视角地理定位
Tran, Chi-Nguyen, Minh, Dao Sy Duy, Kiet, Huynh Trung, Quy, Nguyen Lam Phu, Pham, Phu-Hoa, Tran-Thanh, Long
Abstract
Cross-view geo-localization (CVGL), which matches an oblique drone view to a geo-referenced satellite tile, has emerged as a key alternative for autonomous drone navigation when GNSS signals are jammed, spoofed, or unavailable. Despite strong recent progress, three limitations persist: (1) global-descriptor designs compress the patch grid into a single vector without separating layout from texture across the view gap; (2) altitude-related scale variation is retained in the learned embedding rather than marginalized; and (3) multi-objective training relies on hand-tuned scalars over losses on incompatible gradient scales. We propose SkyPart, a lightweight swappable head for patch-based vision transformers (ViTs) that institutes explicit part grouping over the patch grid. SkyPart has four theory-grounded components: (i) learnable prototypes competing for patch tokens via single-pass cosine assignment; (ii) altitude-conditioned linear modulation applied only during training, making the retrieval embedding altitude-free at inference; (iii) a graph-attention readout over active prototypes; and (iv) a Kendall uncertainty-weighted multi-objective loss whose stationary points are Pareto-stationary. At 26.95M parameters and 22.14 GFLOPs, SkyPart is the smallest among top-performing methods and sets a new state of the art on SUES-200, University-1652, and DenseUAV under a single-pass, no-re-ranking, no-TTA protocol. Its advantage over the strongest baseline widens under the ten-condition WeatherPrompt corruption benchmark.
Chinese Translation
跨视角地理定位(CVGL)通过将倾斜无人机视角与地理参考的卫星图块匹配,已成为在GNSS信号被干扰、欺骗或不可用时,自动无人机导航的关键替代方案。尽管近期取得了显著进展,但仍存在三大局限:(1)全局描述符设计将图块网格压缩为单一向量,而未能在视角差异中分离布局与纹理;(2)学习到的嵌入中保留了与高度相关的尺度变化,而不是将其边缘化;(3)多目标训练依赖于手动调整的标量,处理不兼容的梯度尺度损失。我们提出了SkyPart,一种轻量级可互换的头部,适用于基于图块的视觉变换器(ViTs),在图块网格上实施显式的部分分组。SkyPart具有四个理论基础组件:(i)通过单次余弦分配竞争图块标记的可学习原型;(ii)仅在训练期间应用的高度条件线性调制,使得检索嵌入在推理时不受高度影响;(iii)对活动原型的图注意力读取;(iv)一种Kendall不确定性加权的多目标损失,其驻点为Pareto驻点。在2695万参数和22.14 GFLOPs下,SkyPart是顶尖方法中最小的,并在单次传递、无重排序、无TTA协议下在SUES-200、University-1652和DenseUAV上设定了新的最优状态。其相较于最强基线的优势在十种条件的WeatherPrompt腐蚀基准下进一步扩大。
cs.CV / 61 / 2605.11659

Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning

复兴源无关跨领域少样本学习的领域内微调方法
Zhao, Yaze, Liu, Yicong, Zou, Yixiong, Li, Yuhua, Li, Ruixuan
Abstract
Cross-Domain Few-Shot Learning (CDFSL) aims to adapt large-scale pretrained models to specialized target domains with limited samples, yet the few-shot fine-tuning of vision-language models like CLIP remains underexplored. By establishing multiple fine-tuning baselines of CLIP for CDFSL, we find adapter-based methods (e.g., LoRA) consistently outperform prompt-based ones (e.g., MaPLe), contrary to in-domain scenarios. To make those effective in-domain methods competitive again in CDFSL, we analyze this phenomenon and discover LoRA's superiority stems from rectifying the collapsed attention of visual CLS token, enhancing modality alignment and class separation by focusing on text-related visual regions. Further, we find textual EOS token exhibit much better attention to visual samples, and CLIP's standard contrastive loss weakly constrains modality alignment. Based on these insights, we propose Semantic Probe, a plug-and-play attention rectification framework for both adapter- and prompt-based methods. Extensive experiments on four CDFSL benchmarks validate our rationale, achieving state-of-the-art performance and benefiting both fine-tuning paradigms. Codes will be released.
Chinese Translation
跨领域少样本学习(CDFSL)旨在将大规模预训练模型适应于样本有限的专门目标领域,但视觉-语言模型(如 CLIP)的少样本微调仍然未得到充分探索。通过为 CDFSL 建立多个 CLIP 的微调基线,我们发现基于适配器的方法(如 LoRA)在性能上始终优于基于提示的方法(如 MaPLe),这与领域内场景相反。为了使这些有效的领域内方法在 CDFSL 中再次具有竞争力,我们分析了这一现象,并发现 LoRA 的优势源于修正视觉 CLS 令牌的崩溃注意力,通过关注与文本相关的视觉区域来增强模态对齐和类别分离。此外,我们发现文本 EOS 令牌对视觉样本表现出更好的注意力,而 CLIP 的标准对比损失对模态对齐的约束较弱。基于这些见解,我们提出了语义探测器(Semantic Probe),这是一个即插即用的注意力修正框架,适用于基于适配器和基于提示的方法。在四个 CDFSL 基准上的广泛实验验证了我们的推理,达到了最先进的性能,并使两种微调范式受益。代码将会发布。
cs.CV / 62 / 2605.11680

ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

ShapeCodeBench:用于合成形状场景的感知到程序重建的可再生基准
Kumar, Shivam
Abstract
We introduce ShapeCodeBench, a synthetic benchmark for perception-to-program reconstruction: given a rendered raster image, a model must emit an executable drawing program that a deterministic evaluator re-renders and compares with the target. The v1 DSL has four primitives on a 512 x 512 black-on-white canvas, but every instance is generated from a seeded RNG, so fresh held-out sets can be created to reduce exact-instance contamination. We release a frozen eval_v1 split with 150 samples across easy, medium, and hard tiers, scored by exact match, pixel accuracy, foreground IoU, parse success, and execution success. We evaluate an empty-program floor, a classical computer-vision heuristic, Claude Opus 4.7 at high and max effort, and GPT-5.5 at medium and extra_high reasoning effort. The heuristic is competitive on easy scenes but collapses when overlaps fuse components; the strongest multimodal configuration preserves much of the foreground structure but still misses exact match because of small parameter errors. Best overall exact match remains low, so ShapeCodeBench is far from saturated. The benchmark code, frozen dataset, run artifacts, and paper sources are released to support independent replication and extension.
Chinese Translation
我们介绍了ShapeCodeBench,这是一个用于感知到程序重建的合成基准:给定一个渲染的光栅图像,模型必须输出一个可执行的绘图程序,该程序由确定性评估器重新渲染并与目标进行比较。v1 DSL在一个512 x 512的黑底白字画布上有四个原语,但每个实例都是从种子随机数生成器生成的,因此可以创建新的保留集以减少精确实例的污染。我们发布了一个冻结的eval_v1拆分,其中包含150个样本,分为简单、中等和困难三个层次,评分标准包括精确匹配、像素准确度、前景IoU、解析成功率和执行成功率。我们评估了一个空程序的基线、一个经典计算机视觉启发式算法、Claude Opus 4.7在高和最大努力下的表现,以及GPT-5.5在中等和额外高推理努力下的表现。该启发式算法在简单场景中表现竞争力,但当重叠融合组件时则崩溃;最强的多模态配置保留了大部分前景结构,但由于小参数误差仍然错过了精确匹配。总体上最佳的精确匹配仍然较低,因此ShapeCodeBench远未饱和。我们发布了基准代码、冻结数据集、运行文档和论文源,以支持独立复制和扩展。
cs.CV / 63 / 2605.11683

DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers

DORA:用于视觉变换器中令牌合并的动态在线强化代理
He, Kaixuan, Chen, Song, Kang, Yi
Abstract
Vision Transformers (ViTs) incur significant computational overhead due to the quadratic complexity of self-attention relative to the token sequence length. While existing token reduction methods mitigate this issue, they predominantly rely on fixed heuristic metrics, predefined ratios, or static offline masks, which lack the adaptability to capture input-dependent redundancy during inference. In this paper, we propose DORA (Dynamic Online Reinforcement Agent), the first reinforcement learning (RL)-driven online inference framework for dynamic token merging in ViTs. We formulate the merging process as a sequential Markov Decision Process (MDP), where a lightweight RL agent determines the merging strategy for each Transformer block based on the current feature state and layer-specific context. To balance computational efficiency and feature fidelity, the agent is optimized via a dense reward function incorporating a non-linear distillation-based penalty. We implement an asymmetric Actor-Critic architecture that utilizes a high-capacity Critic for stable offline training while retaining a minimal Actor head for low-computation online inference. Evaluations across multiple ViT scales (Tiny to Large) demonstrate that DORA improves the accuracy-efficiency Pareto front compared to current baselines. Under strict negligible accuracy-drop constraints (<= 0.05%), DORA achieves up to a 12.66% token merging rate, and delivers up to a 569.7% relative improvement over the most efficient baseline. On ImageNet-1K, under aligned accuracy constraints, DORA achieves up to a 76% relative improvement in computational savings compared to state-of-the-art methods. Furthermore, on out-of-distribution (OOD) benchmarks such as ImageNet-A and ImageNet-C, DORA attains a relative efficiency advantage of over 430%.
Chinese Translation
视觉变换器(ViTs)由于自注意力相对于令牌序列长度的平方复杂度,导致显著的计算开销。尽管现有的令牌减少方法缓解了这一问题,但它们主要依赖于固定的启发式指标、预定义的比例或静态离线掩码,缺乏在推理过程中捕捉输入相关冗余的适应性。本文提出了DORA(动态在线强化代理),这是第一个基于强化学习(RL)的在线推理框架,用于ViTs中的动态令牌合并。我们将合并过程形式化为一个序列马尔可夫决策过程(MDP),其中一个轻量级的RL代理根据当前特征状态和层特定上下文确定每个变换器块的合并策略。为了平衡计算效率和特征保真度,代理通过一个包含非线性蒸馏基罚的密集奖励函数进行优化。我们实现了一个不对称的演员-评论家架构,该架构利用高容量的评论家进行稳定的离线训练,同时保留一个最小的演员头以实现低计算的在线推理。在多个ViT规模(从Tiny到Large)的评估中,DORA在准确性-效率的Pareto前沿上相较于当前基线表现出改善。在严格的可忽略准确性下降约束(<= 0.05%)下,DORA实现了高达12.66%的令牌合并率,并在最有效的基线之上提供了高达569.7%的相对改善。在ImageNet-1K上,在对齐的准确性约束下,DORA在计算节省方面相比于最先进的方法实现了高达76%的相对改善。此外,在ImageNet-A和ImageNet-C等分布外(OOD)基准上,DORA获得了超过430%的相对效率优势。
cs.CV / 64 / 2605.11695

Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning

异构视觉智能体通过去中心化学习实现的自发通信
Ochiai, Mikako, Nagano, Masatoshi, Taniguchi, Tadahiro
Abstract
Symbols are shared, but perception is private. We study emergent communication between heterogeneous visual agents through decentralized learning, asking what visual information can become shareable when agents have different visual representations. Instead of optimizing messages through a shared external communicative objective, our agents exchange only discrete token sequences and update their own models using local perceptual evidence. This setting focuses on an underexplored aspect of emergent communication, examining whether common symbols can arise without shared perceptual access, and how the similarity between private visual spaces constrains the content and symmetry of the resulting language. We instantiate this setting in the Metropolis-Hastings Captioning Game (MHCG), where two agents collaboratively form shared captions by exchanging proposed token sequences that a listener accepts or rejects using an MH-style criterion evaluated against its own visual features. We compare three pairings of frozen visual encoders, with agents starting from randomly initialized text modules. Experiments on MS-COCO show that MHCG produces visually informative shared token sequences that outperform a no-communication baseline in cross-agent alignment, visual-feature prediction, and image-text retrieval; all cross-agent metrics decline as encoder mismatch increases. Moderate encoder heterogeneity reduces the number of shared sequences while preserving per-sequence visual specificity, whereas stronger encoder heterogeneity yields fewer, coarser, and more asymmetric sequences. Ablations show that listener-side MH acceptance is critical for avoiding degenerate token formation. These results suggest that shared symbols can arise from local perceptual evaluation alone, with visual representational similarity across encoders shaping both the content and symmetry of the resulting language.
Chinese Translation
符号是共享的,但感知是私有的。我们研究异构视觉智能体通过去中心化学习实现的自发通信,探讨在智能体具有不同视觉表征时,哪些视觉信息可以变得可共享。我们的智能体并不通过共享的外部通信目标来优化信息,而是仅交换离散的标记序列,并利用本地感知证据更新自身模型。该设置关注自发通信的一个未被充分探索的方面,考察在没有共享感知访问的情况下,是否可以产生共同的符号,以及私有视觉空间之间的相似性如何限制所产生语言的内容和对称性。我们在大都会-哈斯廷斯字幕游戏(Metropolis-Hastings Captioning Game, MHCG)中实例化这一设置,其中两个智能体通过交换提议的标记序列共同形成共享字幕,听者根据自身的视觉特征使用MH风格标准接受或拒绝这些序列。我们比较了三对冻结的视觉编码器,智能体从随机初始化的文本模块开始。对MS-COCO的实验表明,MHCG生成的视觉信息丰富的共享标记序列在跨智能体对齐、视觉特征预测和图像-文本检索方面优于无通信基线;随着编码器不匹配程度的增加,所有跨智能体指标均下降。适度的编码器异质性减少了共享序列的数量,同时保持了每个序列的视觉特异性,而更强的编码器异质性则产生了更少、更粗糙且更不对称的序列。消融实验表明,听者端的MH接受对于避免退化标记形成至关重要。这些结果表明,共享符号可以仅通过本地感知评估产生,编码器之间的视觉表征相似性塑造了所产生语言的内容和对称性。
cs.CV / 65 / 2605.11696

WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting

WildRelight:用于单图像重光照的真实世界基准和物理引导适应
Wang, Lezhong, Kaya, Mehmet Onurcan, Bigdeli, Siavash, Frisvad, Jeppe Revall
Abstract
Recent single-image relighting methods, powered by advanced generative models, have achieved impressive photorealism on synthetic benchmarks. However, their effectiveness in the complex visual landscape of the real world remains largely unverified. A critical gap exists, as current datasets are typically designed for multi-view reconstruction and fail to address the unique challenges of single-image relighting. To bridge this synthetic-to-real gap, we introduce WildRelight, the first in-the-wild dataset specifically created for evaluating single-image relighting models. WildRelight features a diverse collection of high-resolution outdoor scenes, captured under strictly aligned, temporally varying natural illuminations, each paired with a high-dynamic-range environment map. Using this data, we establish a rigorous benchmark revealing that state-of-the-art models trained on synthetic data suffer from severe domain shifts. The strictly aligned temporal structure of WildRelight enables a new paradigm for domain adaptation. We demonstrate this by introducing a physics-guided inference framework that leverages the captured natural light evolution as a self-supervised constraint. By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real-world statistics on-the-fly, transforming the intractable sim-to-real challenge into a tractable self-supervised task. The dataset and code will be made publicly available to foster robust, physically-grounded relighting research.
Chinese Translation
近期的单图像重光照方法,得益于先进的生成模型,在合成基准上取得了令人印象深刻的照片级真实感。然而,它们在复杂的真实世界视觉环境中的有效性仍然在很大程度上未得到验证。当前的数据集通常设计用于多视角重建,未能解决单图像重光照的独特挑战,因此存在一个关键的缺口。为了解决这一合成与真实之间的差距,我们引入了WildRelight,这是第一个专门为评估单图像重光照模型而创建的野外数据集。WildRelight包含了一系列多样化的高分辨率户外场景,这些场景在严格对齐、时间变化的自然光照下捕获,并与高动态范围环境图相配对。利用这些数据,我们建立了一个严格的基准,揭示了在合成数据上训练的最先进模型遭受严重的领域转移。WildRelight严格对齐的时间结构为领域适应提供了一种新范式。我们通过引入一个物理引导推断框架来证明这一点,该框架利用捕获的自然光演变作为自我监督约束。通过将扩散后验采样(Diffusion Posterior Sampling, DPS)与时间采样感知测试时适应(Temporal Sampling-Aware Test-Time Adaptation, TTA)相结合,我们展示了该数据集使合成模型能够即时与真实世界统计数据对齐,将难以处理的模拟到真实挑战转变为可处理的自我监督任务。该数据集和代码将公开发布,以促进稳健的、基于物理的重光照研究。
cs.CV / 66 / 2605.11704

ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation

ScaleMoGen:用于人类运动生成的自回归下一尺度预测
Hwang, Inwoo, Jang, Hojun, Zhou, Bing, Wang, Jian, Kim, Young Min, Guo, Chuan
Abstract
We present ScaleMoGen, a scale-wise autoregressive framework for text-driven human motion generation. Unlike conventional autoregressive approaches that rely on standard next-token prediction, ScaleMoGen frames motion generation as a coarse-to-fine process. We quantize 3D motions into compositional discrete tokens across multiple skeletal-emporal scales of increasing granularity, learning to generate motion by autoregressively predicting next-scale token maps. To maintain structural integrity, our motion tokenizers and quantizers are explicitly designed so that discrete tokens at every scale strictly preserve the skeletal hierarchy. Additionally, we employ bitwise quantization and prediction, which efficiently scale up the tokenizer vocabulary to preserve motion details and stabilize optimization. Extensive experiments demonstrate that ScaleMoGen achieves state-of-the-art performance, establishing an FID of 0.030 (vs. 0.045 for MoMask) on HumanML3D and a CLIP Score of 0.693 (vs. 0.685 for MoMask++) on the SnapMoGen dataset. Furthermore, we demonstrate that our skeletal-temporal multi-scale representation naturally facilitates training-free, text-guided motion editing.
Chinese Translation
我们提出了ScaleMoGen,这是一种基于尺度的自回归框架,用于文本驱动的人类运动生成。与依赖于标准下一标记预测的传统自回归方法不同,ScaleMoGen将运动生成框架化为一个粗到细的过程。我们将3D运动量化为跨越多个骨骼-时间尺度的组合离散标记,学习通过自回归预测下一尺度的标记图来生成运动。为了保持结构完整性,我们的运动标记器和量化器被明确设计,使得每个尺度的离散标记严格保留骨骼层次。此外,我们采用逐位量化和预测,这有效地扩大了标记器词汇量,以保留运动细节并稳定优化。大量实验表明,ScaleMoGen实现了最先进的性能,在HumanML3D上建立了0.030的FID(相比于MoMask的0.045),在SnapMoGen数据集上获得了0.693的CLIP Score(相比于MoMask++的0.685)。此外,我们展示了我们的骨骼-时间多尺度表示自然地促进了无训练的文本引导运动编辑。
cs.CV / 67 / 2605.11705

CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection

CAST:考虑崩溃的多尺度拓扑融合用于多模态核心集选择
Zhao, Boran, Liu, Hetian, Hu, Zhenxian, Yuan, Yuqing, Yan, Yu, Ren, Pengju
Abstract
The training of large multimodal models fundamentally relies on massive image-text datasets, which inevitably incur prohibitive computational overhead. Dataset selection offers a promising paradigm by identifying a highly informative coreset. However, existing approaches suffer from two critical limitations: (i) single-modality-dominated sampling methods, which ignore the fine-grained cross-modal information imbalance inherent in multimodal datasets and thus lead to semantic loss in the other modality; and (ii) coarse-grained sample-scoring-based sampling methods, where the selected coreset tends to be biased toward the scoring model, making it difficult to guarantee distributional equivalence between the coreset and the original dataset. Meanwhile, existing distribution matching and discrete sampling strategies often fail to jointly account for global semantic structure, local fine-grained details, and redundancy-aware coverage in dense regions. To this end, we propose CAST, a Collapse-Aware multi-Scale Topology fusion framework for multimodal coreset selection. We first construct image- and text-modality topologies, and derive a unified topology via local-collapse-aware refinement and cross-modal fusion. We then introduce a multi-scale distribution matching criterion in the diffusion wavelet domain, encouraging the coreset to approximate the original dataset at multiple scales. Finally, we introduce a local soft relational coverage mechanism that extends pure geometric coverage to relation-aware indirect coverage, penalizing redundant selections in dense clusters. Extensive experiments on Flickr30K and MS-COCO show that CAST outperforms existing dataset selection baselines, showcasing great superiority in cross-architecture generalization and energy efficiency over state-of-the-art multimodal synthesis methods.
Chinese Translation
大型多模态模型的训练基本上依赖于海量的图像-文本数据集,这不可避免地导致了巨大的计算开销。数据集选择通过识别高度信息丰富的核心集提供了一种有前景的范式。然而,现有方法存在两个关键限制:(i)单模态主导的采样方法,忽视了多模态数据集中固有的细粒度跨模态信息不平衡,从而导致另一模态的语义损失;(ii)基于粗粒度样本评分的采样方法,所选核心集往往偏向于评分模型,使得难以保证核心集与原始数据集之间的分布等价性。同时,现有的分布匹配和离散采样策略往往未能共同考虑全局语义结构、局部细粒度细节和密集区域的冗余感知覆盖。为此,我们提出了CAST,一个考虑崩溃的多尺度拓扑融合框架,用于多模态核心集选择。我们首先构建图像和文本模态的拓扑,并通过局部崩溃感知细化和跨模态融合推导出统一的拓扑。然后,我们在扩散小波域引入多尺度分布匹配标准,鼓励核心集在多个尺度上逼近原始数据集。最后,我们引入一种局部软关系覆盖机制,将纯几何覆盖扩展为关系感知的间接覆盖,惩罚密集簇中的冗余选择。在Flickr30K和MS-COCO上的大量实验表明,CAST在数据集选择基线中表现优于现有方法,在跨架构泛化和能效方面展现出相对于最先进的多模态合成方法的显著优势。
cs.CV / 68 / 2605.11722

EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation

EPIC:用于组合文本到图像生成的高效谓词引导推理时控制
Mun, Sunung, Cho, Sunghyun, Ok, Jungseul
Abstract
Recent text-to-image (T2I) generators can synthesize realistic images, but still struggle with compositional prompts involving multiple objects, counts, attributes, and relations. We introduce EPIC (Efficient Predicate-Guided Inference-Time Control), a training-free inference-time refinement framework for compositional T2I generation. EPIC casts refinement as predicate-guided search: it parses the original prompt once into a fixed visual program of object variables and typed predicates, covering checkable conditions such as object presence, counts, attributes, and relations. Each generated or edited image is verified against this program using visual evidence extracted from that image. An image is judged to satisfy the prompt only when all predicates are satisfied; otherwise, failed predicates decide the next step, routing local failures to targeted editing and global failures to resampling while the fixed visual program remains unchanged. On GenEval2, EPIC improves prompt-level accuracy from 34.16% for single-pass generation with the base generator to 71.46%. Under the same generator/editor setting and maximum image-model execution budget, EPIC outperforms the strongest prior refinement baseline by 19.23 points while reducing realized cost by 31% in image-model executions, 72% in MLLM calls, and 81% in MLLM tokens per prompt.
Chinese Translation
最近的文本到图像(T2I)生成器能够合成逼真的图像,但在处理涉及多个对象、数量、属性和关系的组合提示时仍然存在困难。我们提出了EPIC(高效谓词引导推理时控制),这是一个无需训练的推理时精炼框架,用于组合T2I生成。EPIC将精炼视为谓词引导搜索:它将原始提示解析为一个固定的视觉程序,其中包含对象变量和类型化谓词,涵盖可检查的条件,如对象存在性、数量、属性和关系。每个生成或编辑的图像都通过从该图像中提取的视觉证据与该程序进行验证。仅当所有谓词都满足时,图像才被判断为满足提示;否则,失败的谓词决定下一步,将局部失败路由到针对性的编辑,将全局失败路由到重新采样,同时固定的视觉程序保持不变。在GenEval2上,EPIC将单次生成的基础生成器的提示级准确率从34.16%提高到71.46%。在相同的生成器/编辑器设置和最大图像-模型执行预算下,EPIC比最强的先前精炼基线提高了19.23个百分点,同时在图像-模型执行中减少了31%的实际成本,在MLLM调用中减少了72%,在每个提示的MLLM令牌中减少了81%。
cs.CV / 69 / 2605.11723

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

CaC:通过层次化时空集中推进视频奖励模型
Wang, Jiyuan, Ouyang, Huan, Lin, Jiuzhou, Lin, Chunyu, Fan, Dewen, Zhang, Boheng, Fan, Haonan, Zuo, Fei, Sun, Jia, Wang, Huaiqing, Wang, Honglie, Fan, Yiyang, Yuan, Zhenlong, Li, Zijun, Heng, Yongrui, Lin, Guosheng, Yang, Fan, Gao, Tingting
Abstract
In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.
Chinese Translation
在本文中,我们提出了集中与集中(Concentrate and Concentrate, CaC),这是一种基于视觉-语言模型的粗到细异常奖励模型。在推理过程中,它首先进行全局时间扫描以锚定异常时间窗口,然后在局部区间内进行细粒度空间定位,最后通过结构化时空链式思维推理得出稳健的判断。为了赋予模型这些能力,我们构建了第一个大规模生成视频异常数据集,该数据集包含逐帧边界框注释、时间异常窗口和细粒度归因标签。在此数据集的基础上,我们设计了一个三阶段的渐进训练范式。模型最初通过单帧和多帧的监督微调学习空间和时间锚定,然后通过基于双轮组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习策略进行优化。除了传统的准确性奖励外,我们引入了时间和空间交并比(IoU)奖励,以监督中间定位过程,有效引导模型朝向更具基础性和可解释性的时空推理。大量实验表明,CaC能够稳定地集中于细微异常,在细粒度异常基准上实现了25.7%的准确性提升,并且当作为奖励信号使用时,CaC将生成视频异常减少了11.7%,同时提高了整体视频质量。
cs.CV / 70 / 2605.11743

WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views

WorldComp2D:基于局部视图的物体身份和位置的时空语义表示
Jin, SeongMin, Jeong, Doo Seok
Abstract
Learning latent representations that capture both semantic and spatial information is central to efficient spatio-semantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task-specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale local receptive fields. This framework consists of (i) a proximity-dependent encoder that maps a given observation into a spatio-semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio-semantic representation. Using facial landmark localization as a proof-of-concept, we show that, compared to SoTA lightweight models, WorldComp2D reduces the numbers of parameters and FLOPs by up to 4.0X and 2.2X, respectively, while maintaining real-time performance on CPU. These results demonstrate that explicitly structured latent spaces provide an efficient and general foundation for spatio-semantic reasoning. This framework is open-sourced at https://github.com/JinSeongmin/WorldComp2D.
Chinese Translation
学习能够捕捉语义和空间信息的潜在表示是高效时空语义推理的核心。然而,许多现有方法依赖于隐式潜在结构,结合密集特征图或特定任务的头部,这限制了计算效率和灵活性。我们提出了WorldComp2D,这是一种新颖的轻量级表示学习框架,它通过多尺度局部感受野显式地根据物体身份和空间邻近性构建潜在空间几何。该框架包括(i)一个依赖于邻近性的编码器,将给定的观测映射到时空语义潜在空间,以及(ii)一个定位器,从生成的时空语义表示中推断输入中物体的坐标。以面部关键点定位作为概念验证,我们展示了与最先进的轻量级模型相比,WorldComp2D在参数数量和FLOPs上分别减少了多达4.0倍和2.2倍,同时在CPU上保持实时性能。这些结果表明,显式构建的潜在空间为时空语义推理提供了高效且通用的基础。该框架已在https://github.com/JinSeongmin/WorldComp2D上开源。
cs.CV / 71 / 2605.11748

BronchoLumen: Analysis of recent YOLO-based architectures for real-time bronchial orifice detection in video bronchoscopy

BronchoLumen:基于YOLO的实时支气管口检测架构分析
Li, Yongchao, Himstedt, Marian
Abstract
Bronchoscopy is routinely conducted in pulmonary clinics and intensive care units, but navigating the complex branching of the respiratory tract remains challenging. This paper introduces BronchoLumen, a real-time YOLO-based system for detecting bronchial orifices in video bronchoscopy, aiming to assist navigation and CAD systems. The paper investigates if bronchial orifices can be robustly detected across image domains using state-of-the-art object detection and a limited set of public image data. The study includes the description and comparison of YOLOv8, a widely adopted architecture, and YOLOv12, a more recent architecture integrating attention-based modules to improve spatial reasoning. Both models are trained and tested solely on publicly available datasets comprising different image domains. A comparison of both models is conducted based on the common metrics [email protected] and [email protected]:0.9 with the latter emphasizing localization accuracy. For YOLOv8 we obtained a [email protected] of 0.91 on an in-domain and 0.68 on a cross-domain test set. YOLOv12 achieved 0.84 and 0.68 respectively with slightly better localization accuracy with [email protected]:0.9 of 0.48 and 0.26 compared to YOLOv8 with 0.45 and 0.25. Challenges like motion blur and low contrast occasionally entailed uncertainties but the system demonstrated overall robustness in most scenarios. BronchoLumen is an open-weight, YOLO-based solution for bronchial orifice detection offering high accuracy and efficiency across multiple image domains. While the more recent YOLOv12 achieves better localization accuracy, we observed a slightly worse precision. The models have been made publicly available to foster further research in bronchoscopy navigation.
Chinese Translation
支气管镜检查在肺科诊所和重症监护病房中常规进行,但在复杂的呼吸道分支中导航仍然具有挑战性。本文介绍了BronchoLumen,一个基于YOLO的实时系统,用于在视频支气管镜检查中检测支气管口,旨在辅助导航和计算机辅助诊断(CAD)系统。研究探讨了是否可以在不同图像域中稳健地检测支气管口,使用最先进的目标检测技术和有限的公共图像数据集。研究包括对广泛采用的架构YOLOv8和更近期的架构YOLOv12的描述与比较,后者集成了基于注意力的模块以改善空间推理。两个模型均仅在包含不同图像域的公共可用数据集上进行训练和测试。基于常用指标[email protected][email protected]:0.9对两个模型进行了比较,后者强调定位精度。对于YOLOv8,我们在同域测试集上获得了0.91的[email protected],在跨域测试集上获得了0.68。YOLOv12分别达到了0.84和0.68的结果,定位精度略有提升,[email protected]:0.9为0.48和0.26,而YOLOv8的对应值为0.45和0.25。运动模糊和低对比度等挑战偶尔导致不确定性,但系统在大多数场景中表现出整体的稳健性。BronchoLumen是一个开放权重的基于YOLO的支气管口检测解决方案,在多个图像域中提供高准确性和效率。尽管更近期的YOLOv12在定位精度上表现更佳,但我们观察到其精确度略有下降。这些模型已公开发布,以促进支气管镜导航的进一步研究。
cs.CV / 72 / 2605.11756

Focusable Monocular Depth Estimation

可聚焦的单目深度估计
Du, Yuxin, Lin, Tao, Zhong, Zile, Li, Runting, Chen, Xiyao, Liu, Jiting, Liu, Chenglin, Chen, Ying-Cong, Fu, Yuqian, Zhao, Bo
Abstract
Monocular depth foundation models generalize well across scenes, yet they are typically optimized with uniform pixel-wise objectives that do not distinguish user-specified or task-relevant target regions from the surrounding context. We therefore introduce Focusable Monocular Depth Estimation (FDE), a region-aware depth estimation task in which, given a specified target region, the model is required to prioritize foreground depth accuracy, preserve sharp boundary transitions, and maintain coherent global scene geometry. To prioritize task-critical region modeling, we propose FocusDepth, a prompt-conditioned monocular relative depth estimation framework that guides depth modeling to focus on target regions via box/text prompts. The core Multi-Scale Spatial-Aligned Fusion (MSSA) in FocusDepth spatially aligns multi-scale features from Segment Anything Model 3 to the Depth Anything family and injects them through scale-specific, gated conditional fusion. This enables dense prompt cue injection without disrupting geometric representations, thereby endowing the depth estimation model with focused perception capability. To study FDE, we establish FDE-Bench, a target-centric monocular relative depth benchmark built from image-target-depth triplets across five datasets, containing 252.9K/72.5K train/val triplets and 972 categories spanning real-world and embodied simulation environments. On FDE-Bench, FocusDepth consistently improves over globally fine-tuned DA2/DA3 baselines under both box and text prompts, with the largest gains appearing in target boundary and foreground regions while preserving global scene geometry. Ablations show that MSSA's spatial alignment is the key design factor, as disrupting prompt-geometry correspondence increases AbsRel by up to 13.8%.
Chinese Translation
单目深度基础模型在不同场景中具有良好的泛化能力,但通常是通过统一的像素级目标进行优化,这种方法并未区分用户指定的或任务相关的目标区域与周围背景。因此,我们提出了可聚焦的单目深度估计(Focusable Monocular Depth Estimation, FDE),这是一项区域感知的深度估计任务。在该任务中,给定一个指定的目标区域,模型需要优先考虑前景深度的准确性,保持清晰的边界过渡,并维持一致的全局场景几何形状。为了优先建模任务关键区域,我们提出了FocusDepth,这是一种基于提示的单目相对深度估计框架,通过框/文本提示引导深度建模关注目标区域。FocusDepth中的核心多尺度空间对齐融合(Multi-Scale Spatial-Aligned Fusion, MSSA)将来自Segment Anything Model 3的多尺度特征与Depth Anything系列进行空间对齐,并通过尺度特定的门控条件融合进行注入。这使得在不干扰几何表示的情况下,能够密集注入提示线索,从而赋予深度估计模型聚焦感知能力。为了研究FDE,我们建立了FDE-Bench,这是一个以目标为中心的单目相对深度基准,基于五个数据集中的图像-目标-深度三元组构建,包含252.9K/72.5K的训练/验证三元组和972个类别,涵盖真实世界和具身模拟环境。在FDE-Bench上,FocusDepth在框和文本提示下始终优于全局微调的DA2/DA3基线,最大的提升出现在目标边界和前景区域,同时保持全局场景几何形状。消融实验表明,MSSA的空间对齐是关键设计因素,因为干扰提示与几何的对应关系会使AbsRel增加最多达13.8%。
cs.CV / 73 / 2605.11760

M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection

M$^4$-SAM:基于记忆增强的多模态专家混合模型用于RGB-D视频显著目标检测
Liu, Jiyuan, Lin, Jia, Zhou, Xiaofei, Cong, Runmin, Liu, Deyang, Liu, Zhi
Abstract
The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M$^4$-SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LORA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2's encoder. Secondly, we deploy Gated Multi-Level Feature Fusion, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a Pseudo-Guided Initialization, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M$^4$-SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.
Chinese Translation
Segment Anything Model 2 (SAM2) 已成为通用分割的基础模型。由于其可泛化的视觉表征,SAM2已成功应用于各种下游任务。然而,将SAM2扩展到RGB-D视频显著目标检测(RGB-D VSOD)任务时面临三个挑战,包括线性LoRA的空间建模能力有限、SAM的多尺度特征利用不足,以及初始化对显式提示的依赖。为了解决这些问题,我们提出了基于记忆增强的多模态专家混合模型(M$^4$-SAM),该模型为SAM2配备了与模态相关的PEFT、分层特征融合和无提示的记忆初始化。首先,我们将模态感知的MoE-LORA注入到SAM2的编码器中,该模块利用卷积专家编码局部空间先验,并引入模态调度器以实现高效的多模态微调。其次,我们部署了门控多层特征融合,该方法通过自适应门控机制分层聚合多尺度编码器特征,以平衡空间细节和语义上下文。最后,为了在没有手动提示的情况下进行零-shot VSOD,我们利用伪引导初始化,其中粗略掩码被视为伪先验并用于引导记忆库。大量实验表明,M$^4$-SAM在三个公共RGB-D VSOD数据集上在所有评估指标上均达到了最先进的性能。
cs.CV / 74 / 2605.11771

Revisiting Shadow Detection from a Vision-Language Perspective

从视觉-语言视角重新审视阴影检测
Wang, Yonghui, Zhou, Wengang, Feng, Hao, Li, Houqiang
Abstract
Shadow detection is commonly formulated as a vision-driven dense prediction problem, where models rely primarily on pixel-wise visual supervision to distinguish shadows from non-shadow regions. However, this formulation can become unreliable in visually ambiguous cases, where similar dark regions may correspond either to cast shadows or to intrinsically dark surfaces, making visual evidence alone insufficient for establishing a stable decision rule. In this work, we revisit shadow detection from a vision--language perspective and argue that robust prediction benefits from an explicit semantic reference beyond visual cues alone. We propose SVL, a Shadow Vision--Language framework that uses language as an explicit semantic reference to disambiguate shadows from visually similar dark regions. SVL aligns the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, thereby providing image-level guidance on the overall extent of shadows. To transfer this global guidance to dense inference, SVL introduces a global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions. In parallel, SVL applies local patch-level constraints with text embeddings to improve fine-grained discrimination under challenging appearance conditions. Built on a frozen DINOv3 image encoder, the framework learns only lightweight projection and decoding modules, yielding a parameter-efficient design with less than $1\%$ trainable parameters. Extensive experiments on multiple shadow detection benchmarks, including dedicated hard-case evaluations, suggest strong overall performance and improved robustness under visually ambiguous conditions.
Chinese Translation
阴影检测通常被表述为一个以视觉为驱动的密集预测问题,其中模型主要依赖于逐像素的视觉监督来区分阴影和非阴影区域。然而,这种表述在视觉模糊的情况下可能变得不可靠,因为相似的暗区域可能对应于投射阴影或本质上暗的表面,仅凭视觉证据不足以建立稳定的决策规则。在本研究中,我们从视觉-语言的角度重新审视阴影检测,并认为稳健的预测受益于超越视觉线索的明确语义参考。我们提出了SVL(Shadow Vision-Language)框架,该框架利用语言作为明确的语义参考,以消除视觉上相似的暗区域中的阴影歧义。SVL通过场景级阴影比例回归目标,将全局图像表示与阴影相关的文本嵌入对齐,从而提供关于阴影整体范围的图像级指导。为了将这种全局指导转移到密集推断中,SVL引入了一种全局到局部的耦合机制,以强制图像级指导与补丁级预测之间的一致性。同时,SVL应用局部补丁级约束与文本嵌入结合,以改善在具有挑战性的外观条件下的细粒度区分。基于冻结的DINOv3图像编码器,该框架仅学习轻量级的投影和解码模块,设计出参数高效,训练参数少于1%的模型。在多个阴影检测基准上进行的广泛实验,包括专门的困难案例评估,表明该方法在视觉模糊条件下具有强大的整体性能和改善的鲁棒性。
cs.CV / 75 / 2605.11782

Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision

基于视觉问答的事件地图在低视力人群中的城市风险感知导航
Valls, Antoni, Sanchez-Riera, Jordi
Abstract
Visual impairment affects hundreds of millions of people worldwide, severely limiting their ability to navigate urban environments safely and independently. While wearable assistive devices offer a promising platform for real-time hazard detection, existing approaches rely on task-specific vision pipelines that lack flexibility and generalizability. In this work, we propose an event map framework based on visual question answering that leverages Vision-Language Models (VLMs) for pedestrian scene description and hazard identification across diverse real-world environments, using a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific retraining. Model responses are aggregated into a weighted risk scoring system that maps street segments into four discrete safety categories, producing navigable risk-aware event maps for route planning. To support evaluation and future research, we introduce a geographically diverse dataset spanning 20 cities across six continents, comprising over 800 annotated images and 18,000 answered questions. We benchmark four VQA architectures -ViLT, LLaVA, InstructBLIP, and Qwen-VL- and find that generative Multimodal Large Language Models (MLLMs) substantially outperform classification-based approaches, with Qwen-VL achieving the best overall balance of precision and recall. These results demonstrate the viability of MLLMs as a flexible and generalizable foundation for assistive navigation systems for visually impaired people.
Chinese Translation
视觉障碍影响全球数亿人,严重限制了他们安全独立地导航城市环境的能力。虽然可穿戴辅助设备为实时危险检测提供了有前景的平台,但现有方法依赖于特定任务的视觉处理流程,缺乏灵活性和普适性。在本研究中,我们提出了一种基于视觉问答的事件地图框架,利用视觉语言模型(Vision-Language Models, VLMs)进行行人场景描述和在多样化真实环境中的危险识别,采用三层层次查询结构以实现细粒度场景理解,而无需特定任务的重新训练。模型响应被汇总到一个加权风险评分系统中,将街道段映射到四个离散的安全类别,从而生成可导航的风险感知事件地图以供路线规划。为了支持评估和未来研究,我们引入了一个地理多样化的数据集,涵盖六大洲的20个城市,包含超过800张标注图像和18,000个回答问题。我们对四种视觉问答架构(ViLT、LLaVA、InstructBLIP和Qwen-VL)进行了基准测试,发现生成型多模态大型语言模型(Multimodal Large Language Models, MLLMs)在性能上显著优于基于分类的方法,其中Qwen-VL在精确度和召回率之间实现了最佳整体平衡。这些结果证明了MLLMs作为视觉障碍人士辅助导航系统的灵活且可普适的基础的可行性。
cs.CV / 76 / 2605.11799

SB-BEVFusion: Enhancing the Robustness against Sensor Malfunction and Corruptions

SB-BEVFusion:增强对传感器故障和数据损坏的鲁棒性
Essl, Markus, Moscati, Marta, Noman, Mubashir, Zaheer, Muhammad Zaigham, Naseem, Usman, Nawaz, Shah, Schedl, Markus
Abstract
Multimodal sensor fusion has demonstrated remarkable performance improvements over unimodal approaches in 3D object detection for autonomous vehicles. Typically, existing methods transform multimodal data from independent sensors, such as camera and LiDAR, into a unified bird's-eye view (BEV) representation for fusion. Although effective in ideal conditions, this strategy suffers from substantial performance deterioration when camera or LiDAR data are missing, corrupted, or noisy. To address this vulnerability, we develop a framework-agnostic fusion module for camera and LiDAR data that allows for handling cases when one of the two modalities is missing or corrupted. To demonstrate the effectiveness of our module, we instantiate it in BEVFusion [1], a well-established framework to combine camera and LiDAR data for 3D object detection. By means of quantitative experiments on the MultiCorrupt dataset, we demonstrate that our module achieves favorable performance improvements under scenarios of missing and corrupted modalities, substantially outperforming existing unified representation approaches across a wide range of sensor deterioration scenarios and reaching state-of-the-art performance in scenarios of corrupted modality due to extreme weather conditions and sensor failure.
Chinese Translation
多模态传感器融合在自动驾驶车辆的三维物体检测中,相较于单模态方法展现出显著的性能提升。现有方法通常将来自独立传感器(如摄像头和激光雷达)的多模态数据转换为统一的鸟瞰图(BEV)表示以进行融合。尽管在理想条件下有效,但当摄像头或激光雷达数据缺失、损坏或噪声较大时,该策略的性能会显著下降。为了解决这一脆弱性,我们开发了一个框架无关的融合模块,用于处理摄像头和激光雷达数据,能够应对其中一种模态缺失或损坏的情况。为了验证我们模块的有效性,我们将其应用于BEVFusion [1],这是一个成熟的框架,用于结合摄像头和激光雷达数据进行三维物体检测。通过在MultiCorrupt数据集上的定量实验,我们展示了在缺失和损坏模态的场景下,我们的模块实现了良好的性能提升,显著超越了现有的统一表示方法,并在因极端天气条件和传感器故障导致的损坏模态场景中达到了最先进的性能。
cs.CV / 77 / 2605.11803

OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

OTT-Vid:视频大语言模型的最优传输时间令牌压缩
Kang, Minseok, Lee, Minhyeok, Lee, Jungho, Kim, Minjung, Kim, Donghyeong, Lee, Dayeon, Choi, Heeseung, Kim, Ig-jae, Lee, Sangyoun
Abstract
As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform token mass, which protects semantically important tokens from aggressive compression, and a locality-aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state-of-the-art training-free compression methods.
Chinese Translation
随着视频大语言模型(Video-LLMs)在处理更长和更复杂视频时,其推理成本因跨帧累积的大量视觉令牌而迅速增长。无训练的令牌压缩已成为解决这一瓶颈的实用方案。然而,现有的时间压缩方法主要依赖于跨帧令牌相似性或分割启发式,忽视了每个令牌在其帧内的语义角色,并未能根据每对帧的可压缩性调整压缩强度。在本研究中,我们提出了OTT-Vid,一种基于传输的时间令牌压缩分配框架。我们的方法包括两个阶段:空间修剪识别每帧内的代表性内容,随后在相邻帧之间解决最优传输(OT)以估计时间可压缩性。我们以非均匀令牌质量来公式化这一OT,这样可以保护语义重要的令牌免受过度压缩,并采用一种考虑局部性的成本来捕捉特征和空间差异。最终的传输计划在令牌重要性和匹配成本之间实现了平衡,而其总成本定义了每对帧的传输难度,我们利用这一点动态分配压缩预算。在六个涵盖视频问答和时间定位的基准测试中的实验表明,OTT-Vid在仅保留10%令牌的情况下,仍能保留95.8%的VQA和73.9%的VTG性能,始终优于现有的最先进的无训练压缩方法。
cs.CV / 78 / 2605.11808

Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

通过关系感知视觉增强减轻大型视觉语言模型中的动作关系幻觉
Qin, Zhenxin, Li, Qiang, Wang, Qingzhuo, Qin, Ruiyang, Wei, Zhihua, Shen, Wen
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable performance on diverse vision-language tasks. However, LVLMs still suffer from hallucinations, generating text that contradicts the visual input. Existing research has primarily focused on mitigating object hallucinations, but often overlooks more complex relation hallucinations, particularly action relations involving interactions between objects. In this study, we empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information. Thus, we propose a framework to locate action-relevant image regions and enhance the LVLM's attention to those regions. Specifically, we define the Action-Relation Sensitivity (ARS) score to identify attention heads that are most sensitive to action-relation changes, thereby localizing action-relevant image regions that contain key visual cues. Then, we propose the Relation-aware Visual Enhancement (RVE) method to enhance the LVLM's attention to these action-relevant image regions. Extensive experiments demonstrate that, compared to existing baselines, our method achieves superior performance in mitigating action-relation hallucinations with negligible additional inference cost. Furthermore, it effectively generalizes to spatial-relation hallucinations and object hallucinations.
Chinese Translation
大型视觉语言模型(LVLMs)在多种视觉语言任务中取得了显著的性能。然而,LVLMs仍然存在幻觉问题,生成与视觉输入相矛盾的文本。现有研究主要集中在减轻物体幻觉上,但往往忽视了更复杂的关系幻觉,特别是涉及物体之间交互的动作关系。在本研究中,我们实证观察到LVLMs中动作关系幻觉的主要原因是对视觉信息分配的注意力不足。因此,我们提出了一个框架,以定位与动作相关的图像区域,并增强LVLM对这些区域的注意力。具体而言,我们定义了动作关系敏感度(Action-Relation Sensitivity, ARS)评分,以识别对动作关系变化最敏感的注意力头,从而定位包含关键视觉线索的与动作相关的图像区域。然后,我们提出了关系感知视觉增强(Relation-aware Visual Enhancement, RVE)方法,以增强LVLM对这些与动作相关的图像区域的注意力。大量实验表明,与现有基线相比,我们的方法在减轻动作关系幻觉方面取得了优越的性能,且额外推理成本微乎其微。此外,它还有效地推广到空间关系幻觉和物体幻觉。
cs.CV / 79 / 2605.11818

RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

RevealLayer:通过考虑遮挡的图像分解解开隐藏层和可见层
Wang, Binhao, Zhao, Shihao, Cheng, Bo, Ji, Qiuyu, Ma, Yuhang, Wu, Liebucha, Liu, Shanyuan, Leng, Dawei, Yin, Yuhui
Abstract
Recent diffusion-based approaches have made substantial progress in image layer decomposition. However, accurately decomposing complex natural images remains challenging due to difficulties in occlusion completion, robust layer disentanglement, and precise foreground boundaries. Moreover, the scarcity of high-quality multi-layer natural image datasets limits advancement. To address these challenges, we propose RevealLayer, a diffusion-based framework that decomposes an RGB image into multiple RGBA layers, enabling precise layer separation and reliable recovery of occluded content in natural images. RevealLayer incorporates three key components: (1) a Region-Aware Attention module to disentangle hidden and visible layers; (2) an Occlusion-Guided Adapter to leverage contextual information to enhance overlapping regions; and (3) a composite loss to enforce sharp alpha boundaries and suppress residual artifacts. To support training and evaluation, we introduce RevealLayer-100K, a high-quality multi-layer natural image constructed through a collaboration between automated algorithms and human annotation, and further establish RevealLayerBench for benchmarking layer decomposition in general natural scenes. Extensive experiments demonstrate that RevealLayer consistently outperforms existing approaches in layer decomposition.
Chinese Translation
近期基于扩散的方法在图像层分解方面取得了显著进展。然而,由于遮挡补全、稳健的层解缠和精确的前景边界等问题,准确分解复杂的自然图像仍然具有挑战性。此外,高质量的多层自然图像数据集的稀缺限制了这一领域的发展。为了解决这些挑战,我们提出了RevealLayer,这是一种基于扩散的框架,可以将RGB图像分解为多个RGBA层,从而实现精确的层分离和可靠的遮挡内容恢复。RevealLayer包含三个关键组件:(1) 区域感知注意力模块用于解缠隐藏层和可见层;(2) 遮挡引导适配器利用上下文信息增强重叠区域;(3) 复合损失用于强制锐利的α边界并抑制残余伪影。为了支持训练和评估,我们引入了RevealLayer-100K,这是通过自动化算法与人工标注的合作构建的高质量多层自然图像,并进一步建立了RevealLayerBench用于一般自然场景中层分解的基准测试。大量实验表明,RevealLayer在层分解方面始终优于现有方法。
cs.CV / 80 / 2605.11824

REFNet++: Multi-Task Efficient Fusion of Camera and Radar Sensor Data in Bird's-Eye Polar View

REFNet++:鸟瞰极坐标视图中相机与雷达传感器数据的多任务高效融合
Chandrasekaran, Kavin, Grigorescu, Sorin, Dubbelman, Gijs, Jancura, Pavol
Abstract
A realistic view of the vehicle's surroundings is generally offered by camera sensors, which is crucial for environmental perception. Affordable radar sensors, on the other hand, are becoming invaluable due to their robustness in variable weather conditions. However, because of their noisy output and reduced classification capability, they work best when combined with other sensor data. Specifically, we address the challenge of multimodal sensor fusion by aligning radar and camera data in a unified domain, prioritizing not only accuracy, but also computational efficiency. Our work leverages the raw range-Doppler (RD) spectrum from radar and front-view camera images as inputs. To enable effective fusion, we employ a variational encoder-decoder architecture that learns the transformation of front-view camera data into the Bird's-Eye View (BEV) polar domain. Concurrently, a radar encoder-decoder learns to recover the angle information from the RD data that produce Range-Azimuth (RA) features. This alignment ensures that both modalities are represented in a compatible domain, facilitating robust and efficient sensor fusion. We evaluated our fusion strategy for vehicle detection and free space segmentation against state-of-the-art methods using the RADIal dataset.
Chinese Translation
相机传感器通常提供车辆周围环境的真实视图,这对于环境感知至关重要。另一方面,经济实惠的雷达传感器由于其在多变天气条件下的鲁棒性,正变得越来越重要。然而,由于其输出噪声大和分类能力降低,雷达传感器在与其他传感器数据结合时效果最佳。具体而言,我们通过在统一域中对齐雷达和相机数据,解决多模态传感器融合的挑战,优先考虑准确性和计算效率。我们的工作利用雷达的原始距离-多普勒(RD)谱和前视相机图像作为输入。为了实现有效融合,我们采用了一种变分编码器-解码器架构,该架构学习将前视相机数据转换为鸟瞰视图(BEV)极坐标域。同时,雷达编码器-解码器学习从RD数据中恢复角度信息,以生成范围-方位(RA)特征。这种对齐确保了两种模态在兼容域中表示,从而促进了稳健和高效的传感器融合。我们使用RADIal数据集评估了我们的融合策略在车辆检测和自由空间分割方面的表现,并与最先进的方法进行了比较。
cs.CV / 81 / 2605.11840

Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation

选择,而非融合:用于雷达-相机深度估计的雷达调制状态空间模型
Hou, Zhangcheng, Ohtsuki, Tomoaki
Abstract
Radar-camera depth estimation must turn an ultra-sparse, all-weather, metric radar signal into a dense per-pixel depth map. Existing methods -- concatenation, confidence-aware gating, sparse supervision, graph-based extraction -- combine radar and image features outside the backbone's sequence operator, and even cross-modal Mamba variants leave the selection mechanism itself unimodal. We argue that the selection mechanism is the right place for radar to enter. We introduce Radar-Modulated Selection (RMS), a minimal and principled way to inject radar into Mamba's selective scan: radar modulates the scan from within, adding zero-initialised perturbations to the step size $\Delta$ and readout $\mathbf{C}$ while leaving the input projection $\mathbf{B}$ and state dynamics $\mathbf{A}$ image-only. The construction is exactly equivalent to a pretrained image-only Mamba at initialisation, ensuring radar only influences the model where it improves accuracy. Two further properties follow that out-of-scan fusion cannot offer: linear-cost cross-modal coupling at every recurrence step, and a natural fallback to the image-only backbone when radar is absent. We deploy RMS in a Multi-View Scan Pyramid (MVSP) that matches the fusion operator to radar's spatial reach at each scale. SemoDepth achieves state-of-the-art performance on nuScenes, reducing MAE by 34.0%, 29.9%, and 29.9% over the previous best at 0--50, 0--70, and 0--80m, while attaining the lowest single-frame latency (26.8ms). A further ablation shows that out-of-scan feature blending adds no accuracy on top of RMS, providing empirical validation that in-scan selection can replace out-of-scan fusion.
Chinese Translation
雷达-相机深度估计必须将超稀疏的全天气度量雷达信号转化为密集的每像素深度图。现有方法——连接、基于置信度的门控、稀疏监督、基于图的提取——在主干网络的序列操作外结合雷达和图像特征,即使是跨模态的 Mamba 变体也使得选择机制本身保持单模态。我们认为选择机制是雷达介入的正确位置。我们引入了雷达调制选择(Radar-Modulated Selection, RMS),这是一种将雷达注入 Mamba 选择扫描的最小且有原则的方法:雷达从内部调制扫描,通过对步长 $oldsymbol{ riangle}$ 和读出 $oldsymbol{C}$ 添加零初始化的扰动,同时保持输入投影 $oldsymbol{B}$ 和状态动态 $oldsymbol{A}$ 仅为图像。该构造在初始化时与预训练的图像专用 Mamba 完全等效,确保雷达仅在提高准确性的地方影响模型。由此产生的两个进一步特性是超扫描融合无法提供的:在每次递归步骤中线性成本的跨模态耦合,以及当雷达缺失时自然回退到仅图像的主干网络。我们在多视图扫描金字塔(Multi-View Scan Pyramid, MVSP)中部署 RMS,使融合操作与雷达在每个尺度的空间覆盖相匹配。SemoDepth 在 nuScenes 上实现了最先进的性能,在 0-50m、0-70m 和 0-80m 的 MAE 上分别减少了 34.0%、29.9% 和 29.9%,同时达到了最低的单帧延迟(26.8ms)。进一步的消融实验表明,超扫描特征融合在 RMS 之上没有增加准确性,提供了实证验证,证明在扫描选择可以替代超扫描融合。
cs.CV / 82 / 2605.11856

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

UniVLR:在多模态大语言模型中统一文本与视觉的视觉潜在推理
Jiang, Houcheng, Fu, Jiajun, Fang, Junfeng, Gao, Chen, Wang, Xiang, He, Xiangnan, Li, Yong
Abstract
Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on real-world perception and visual reasoning tasks show that UniVLR outperforms prior visual latent reasoning methods while using substantially fewer generated reasoning tokens, suggesting a more unified and efficient paradigm for visual thinking in MLLMs.
Chinese Translation
多模态大语言模型越来越被期望能够进行图像思考,然而现有的视觉潜在推理方法仍然依赖于与视觉潜在标记交错的显式文本推理链。这种交错设计限制了效率,并使推理在独立的文本和视觉通道之间变得零散。我们提出了UniVLR,一个统一的视觉潜在推理框架,将文本推理和辅助视觉证据视为共享的视觉工作空间。UniVLR不再将文本推理链作为独立的推理路径,而是将推理痕迹与辅助图像一起呈现,并学习将这种统一表示压缩为紧凑的视觉潜在标记。在推理时,模型仅通过视觉潜在标记进行推理,并直接解码最终答案,避免了外部工具调用和冗长的文本推理。在真实世界的感知和视觉推理任务上的实验表明,UniVLR在生成的推理标记数量显著减少的情况下,优于先前的视觉潜在推理方法,这表明在多模态大语言模型中实现视觉思考的更统一和高效的范式。
cs.CV / 83 / 2605.11863

GATA2Floor: Graph attention for floor counting in street-view facades

GATA2Floor:用于街景立面楼层计数的图注意力
Le, Ngoc Tan, Chamiti, Tzoulio, Papagiannopoulou, Eirini, Deligiannis, Nikos
Abstract
Automated analysis of building facades from street-level imagery has great potential for urban analytics, energy assessment, and emergency planning. However, it requires reasoning over spatially arranged elements rather than solely isolated detections. In this work, we model each facade as a graph over window/door detections with a vertical prior on edges. Additionally, we introduce GATA2Floor, a multi-head Graph Attention v2 (GATv2) based model that predicts the global floor count of a building and, via learnable cross-attention queries, softly assigns elements to latent floor slots, yielding interpretable outputs and robustness to irregular designs. To mitigate the lack of labeled datasets, we demonstrate that the proposed graph-based reasoning can be applied without annotations by leveraging a lightweight label-free proposal mechanism based on self-supervised features and vision-language scoring. Our approach demonstrates the value of graph-attention-based relational reasoning for facade understanding.
Chinese Translation
从街景图像自动分析建筑立面在城市分析、能源评估和应急规划方面具有巨大潜力。然而,这需要对空间排列的元素进行推理,而不仅仅是孤立的检测。在本研究中,我们将每个立面建模为一个基于窗口/门检测的图,并在边缘上引入垂直先验。此外,我们提出了GATA2Floor,这是一种基于多头图注意力v2(Graph Attention v2, GATv2)的模型,能够预测建筑的全球楼层计数,并通过可学习的交叉注意力查询,柔性地将元素分配到潜在的楼层槽中,从而产生可解释的输出并增强对不规则设计的鲁棒性。为了缓解缺乏标注数据集的问题,我们展示了所提出的基于图的推理可以通过利用基于自监督特征和视觉-语言评分的轻量级无标签提议机制而无需注释。我们的方法展示了基于图注意力的关系推理在立面理解中的价值。
cs.CV / 84 / 2605.11867

When Brains Disagree: Biological Ambiguity Underlies the Challenge of Amyloid PET Synthesis from Structural MRI

当大脑意见不一致:生物学模糊性是从结构性MRI合成淀粉样PET的挑战的根源
Baron, Louise E. G., Callaghan, Ross, Cash, David M., Weston, Philip S., Azadbakht, Hojjat, Zhang, Hui
Abstract
Structural MRI-to-amyloid PET synthesis has been proposed as a non-invasive alternative for amyloid assessment in Alzheimer's disease (AD). However, reported performance of identical models varies widely across studies, and increasingly complex architectures have not led to consistent gains. This inconsistency is thought to be caused by a fundamental biological ambiguity: MRI captures neurodegeneration, while PET measures amyloid pathology - two processes that are often temporally decoupled in AD. As a result, similar MRI patterns may correspond to different amyloid states, creating ambiguous one-to-many mappings. MRI-to-amyloid PET synthesis may therefore be intrinsically ill-posed; however, this idea has yet to be tested scientifically. The aim of this work is to test this hypothesis through two controlled experiments. We first control the training distribution by stratifying paired MRI-PET data by amyloid and neurodegeneration status. Using two standard synthesis models under a controlled design, we show that biologically unambiguous mappings are learnable in isolation, but performance collapses when data ambiguity is introduced. This demonstrates that ambiguity in the data distribution, rather than architectural capacity, constrains performance. Second, we show that introducing orthogonal biological information in the form of plasma biomarkers resolves this ambiguity. When multimodal inputs are incorporated, performance improves and stability is restored. Together, these findings suggest that limited and inconsistent performance in MRI-to-amyloid PET synthesis is explained by intrinsic biological ambiguity, and that stable, meaningful progress requires multimodal integration rather than architectural complexity.
Chinese Translation
结构性MRI到淀粉样PET的合成被提出作为阿尔茨海默病(AD)中淀粉样评估的非侵入性替代方法。然而,相同模型的报告性能在不同研究中差异巨大,日益复杂的架构并未带来一致的提升。这种不一致被认为是由一种根本的生物学模糊性引起的:MRI捕捉神经退行性变,而PET测量淀粉样病理——这两个过程在AD中往往是时间上解耦的。因此,相似的MRI模式可能对应不同的淀粉样状态,造成模糊的一对多映射。因此,MRI到淀粉样PET的合成可能本质上是病态的;然而,这一观点尚未得到科学验证。本研究的目的是通过两个受控实验来检验这一假设。我们首先通过按淀粉样和神经退行性状态分层配对的MRI-PET数据来控制训练分布。使用两种标准合成模型在受控设计下,我们展示了生物学上无歧义的映射可以在孤立状态下学习,但当引入数据模糊性时,性能崩溃。这表明数据分布中的模糊性,而非架构能力,限制了性能。其次,我们展示了以血浆生物标志物的形式引入正交生物信息可以解决这种模糊性。当结合多模态输入时,性能得到改善,稳定性得以恢复。这些发现共同表明,MRI到淀粉样PET合成中的有限和不一致性能是由内在的生物学模糊性解释的,而稳定、有意义的进展需要多模态整合而非架构复杂性。
cs.CV / 85 / 2605.11869

FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

FIS-DiT:通过无训练的帧交错稀疏性突破少步视频推理瓶颈
Tang, Jian, Fan, Jiawei, Liu, Qingbin, Wei, Zheng
Abstract
While the overall inference latency of Video Diffusion Transformers (DiTs) can be substantially reduced through model distillation, per-step inference latency remains a critical bottleneck. Existing acceleration paradigms primarily exploit redundancy across the denoising trajectory; however, we identify a limitation where these step-wise strategies encounter diminishing returns in few-step regimes. In such scenarios, the scarcity of temporal states prevents effective feature reuse or predictive modeling, creating a formidable barrier to further acceleration. To overcome this, we propose Frame Interleaved Sparsity DiT (FIS-DiT), a training-free and operator-agnostic framework that shifts the optimization focus from the temporal trajectory to the latent frame dimension. Our approach is motivated by an intrinsic duality within this dimension: the existence of frame-wise sparsity that permits reduced computation, coupled with a structural consistency where each frame position remains equally vital to the global spatiotemporal context. Leveraging this insight, we implement Frame Interleaved Sparsity (FIS) as an execution strategy that manipulates frame subsets across the model hierarchy, refreshing all latent positions without requiring full-scale block computation. Empirical evaluations on Wan 2.2 and HunyuanVideo 1.5 demonstrate that FIS-DiT consistently achieves 2.11--2.41$\times$ speedup with negligible degradation across VBench-Q and CLIP metrics, providing a scalable and robust pathway toward real-time high-definition video generation.
Chinese Translation
尽管通过模型蒸馏可以显著降低视频扩散变换器(DiTs)的整体推理延迟,但每步推理延迟仍然是一个关键瓶颈。现有的加速范式主要利用去噪轨迹中的冗余;然而,我们发现这些逐步策略在少步情况下会遇到收益递减的限制。在这种情况下,时间状态的稀缺性阻碍了有效的特征重用或预测建模,形成了进一步加速的巨大障碍。为此,我们提出了无训练的帧交错稀疏性变换器(FIS-DiT),这是一个与操作无关的框架,旨在将优化重点从时间轨迹转移到潜在帧维度。我们的方法受到这一维度内在双重性的启发:存在允许减少计算的帧级稀疏性,以及每个帧位置在全球时空上下文中同样重要的结构一致性。利用这一见解,我们实施了帧交错稀疏性(FIS)作为执行策略,操控模型层级中的帧子集,刷新所有潜在位置而无需全规模块计算。在Wan 2.2和HunyuanVideo 1.5上的实证评估表明,FIS-DiT在VBench-Q和CLIP指标上始终实现了2.11至2.41倍的加速,且性能下降微乎其微,为实时高清晰度视频生成提供了一条可扩展且稳健的途径。
cs.CV / 86 / 2605.11871

$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

$h$-控制:通过块条件吉布斯细化实现无训练摄像机控制
Wang, Yuzhu, Ye, Xi, Su, Duo, Xu, Yangyang, Zhu, Jun
Abstract
Training-free camera control for pretrained flow-matching video generators is a partial-observation inverse problem: a depth-warped guidance video supplies noisy evidence on a subset of latent sites, which the sampler must reconcile with the pretrained prior. Existing methods struggle to balance the trade-off between trajectory adherence and visual quality and the heuristic guidance-strength tuning lacks robustness. We propose \textbf{$h$-control}, which resolves this dilemma through a structural change to the sampler: each outer hard-replacement guidance step is augmented with an inner-loop \emph{block-conditional pseudo-Gibbs refinement} on the unobserved complement at the same noise level, with provable convergence to the partial-observation conditional data law. To accelerate convergence on high-dimensional video latents, we exploit their conditional locality, partitioning the unobserved complement into 3D patches, each tracked by a custom mixing indicator that adaptively freezes converged patches. On RealEstate10K and DAVIS, \textbf{$h$-control} attains the best FVD against all seven training-free and training-based competitors, outperforming every training-free baseline on every reported metric.
Chinese Translation
针对预训练流匹配视频生成器的无训练摄像机控制是一个部分观测逆问题:深度扭曲的引导视频为潜在位置的子集提供了噪声证据,采样器必须将其与预训练的先验进行协调。现有方法在轨迹遵循与视觉质量之间的权衡上存在困难,而启发式的引导强度调节缺乏鲁棒性。我们提出了 extbf{$h$-控制},通过对采样器进行结构性变更来解决这一困境:每一步外部硬替换引导步骤都与同一噪声水平下未观测补充的内部循环 extit{块条件伪吉布斯细化}相结合,且具有可证明的收敛性,收敛于部分观测条件数据法则。为了加速高维视频潜变量的收敛,我们利用其条件局部性,将未观测补充划分为3D块,每个块由自定义混合指示器跟踪,该指示器自适应地冻结已收敛的块。在RealEstate10K和DAVIS数据集上, extbf{$h$-控制}在所有七个无训练和基于训练的竞争者中获得了最佳FVD,在每个报告的指标上超越了每个无训练基线。
cs.CV / 87 / 2605.11881

Learning Subspace-Preserving Sparse Attention Graphs from Heterogeneous Multiview Data

从异构多视图数据中学习保持子空间特性的稀疏注意力图
Chen, Jie, Gou, Yuanbiao, Liu, Chuanbin, Wang, Zhu, Peng, Xi
Abstract
The high-dimensional features extracted from large-scale unlabeled data via various pretrained models with diverse architectures are referred to as heterogeneous multiview data. Most existing unsupervised transfer learning methods fail to faithfully recover intrinsic subspace structures when exploiting complementary information across multiple views. Therefore, a fundamental challenge involves constructing sparse similarity graphs that preserve these underlying subspace structures for achieving semantic alignment across heterogeneous views. In this paper, we propose a sparse attention graph learning (SAGL) method that learns subspace-preserving sparse attention graphs from heterogeneous multiview data. Specifically, we introduce a bilinear attention factorization scheme to capture asymmetric similarities among the high-dimensional features, which breaks the symmetry bottleneck that is inherent in the traditional representation learning techniques. A dynamic sparsity gating mechanism then predicts a feature-specific compression factor for adaptively controlling the topological contributions of neighbors. Furthermore, we employ a structured sparse projection via $\alpha$-entmax to generate subspace-preserving sparse attention graphs for individual views. SAGL leverages these view-specific graphs to conduct sparse information aggregation, yielding discriminative representations for multiview learning tasks. In addition, we provide a rigorous theoretical analysis that bridges differentiable sparse attention and probability simplex constraints. Extensive experiments conducted on multiple benchmark datasets demonstrate that SAGL consistently outperforms the state-of-the-art unsupervised transfer learning approaches.
Chinese Translation
通过各种具有不同架构的预训练模型从大规模无标签数据中提取的高维特征被称为异构多视图数据。大多数现有的无监督迁移学习方法在利用多个视图之间的互补信息时,无法忠实地恢复内在的子空间结构。因此,一个基本挑战是构建保持这些潜在子空间结构的稀疏相似性图,以实现异构视图之间的语义对齐。本文提出了一种稀疏注意力图学习(SAGL)方法,该方法从异构多视图数据中学习保持子空间特性的稀疏注意力图。具体而言,我们引入了一种双线性注意力分解方案,以捕捉高维特征之间的不对称相似性,打破了传统表示学习技术中固有的对称瓶颈。然后,动态稀疏门控机制预测特征特定的压缩因子,以自适应地控制邻居的拓扑贡献。此外,我们通过$eta$-entmax采用结构化稀疏投影,为各个视图生成保持子空间特性的稀疏注意力图。SAGL利用这些视图特定的图进行稀疏信息聚合,从而为多视图学习任务生成区分性表示。此外,我们提供了严格的理论分析,桥接了可微分稀疏注意力和概率单纯形约束。在多个基准数据集上进行的广泛实验表明,SAGL在性能上始终优于最先进的无监督迁移学习方法。
cs.CV / 88 / 2605.11898

Few-Shot Synthetic Data Generation with Diffusion Models for Downstream Vision Tasks

基于扩散模型的少样本合成数据生成用于下游视觉任务
Dushenev, Daniil, Karpov, Nazariy, Zinovjev, Daniil, Gorin, Alexander, Kulikov, Konstantin
Abstract
Class imbalance is a persistent challenge in visual recognition, particularly in safety-critical domains where collecting positive examples is expensive and rare events are inherently underrepresented. We propose a lightweight synthetic data augmentation pipeline that fine-tunes a LoRA adapter on as few as 20-50 real images of a rare class and uses a pretrained diffusion model to generate synthetic samples for training. We systematically vary the synthetic-to-real ratio and evaluate the approach across two structurally different domains: chest X-ray pathology classification (NIH ChestX-ray14) and industrial surface crack detection (Magnetic Tile Defect dataset). All evaluations are performed on held-out sets of real images only. Across both domains, synthetic augmentation consistently improves rare-class recall and F1 compared to training with real data alone. Performance improves with moderate synthetic augmentation and shows diminishing returns as the synthetic ratio increases. These results suggest that LoRA-adapted diffusion models provide a simple and scalable mechanism for augmenting rare classes, enabling effective learning in data-scarce scenarios across heterogeneous visual domains.
Chinese Translation
类别不平衡是视觉识别中的一个持续挑战,特别是在安全关键领域,收集正样本既昂贵又稀有事件本质上被低估。我们提出了一种轻量级的合成数据增强管道,该管道在少至20-50张稀有类别的真实图像上微调LoRA适配器,并使用预训练的扩散模型生成用于训练的合成样本。我们系统地改变合成与真实图像的比例,并在两个结构上不同的领域进行评估:胸部X光病理分类(NIH ChestX-ray14)和工业表面裂纹检测(磁性瓷砖缺陷数据集)。所有评估均在仅包含真实图像的保留集上进行。在这两个领域中,与仅使用真实数据训练相比,合成增强始终提高了稀有类别的召回率和F1分数。随着适度的合成增强,性能有所提升,但随着合成比例的增加,收益递减。这些结果表明,LoRA适配的扩散模型提供了一种简单且可扩展的机制,用于增强稀有类别,从而在数据稀缺的异构视觉领域中实现有效学习。
cs.CV / 89 / 2605.11900

Mobile Traffic Camera Calibration from Road Geometry for UAV-Based Traffic Surveillance

基于道路几何的移动交通摄像头标定用于无人机交通监控
Popov, Alexey, Trukhina, Natalia, Vashkelis, Vadim
Abstract
Unmanned aerial vehicles (UAVs) can provide flexible traffic surveillance where fixed roadside cameras are unavailable, costly, or impractical. However, raw UAV video is difficult to use for traffic analytics because vehicle motion is observed in perspective image coordinates rather than in a stable metric road coordinate system. This paper presents a lightweight pipeline for converting monocular oblique UAV traffic video into a local metric bird's-eye-view (BEV) representation. Visible road geometry, including lane markings, road borders, and crosswalks, is used to estimate a road-plane homography from image coordinates to metric ground-plane coordinates. Vehicle observations from dataset annotations or detectors are then projected to BEV using estimated ground contact points. The resulting trajectories support estimation of vehicle direction, speed, heading, and dynamic 3D cuboids on the road plane. We evaluate the pipeline on UAVDT using ground-truth annotations to isolate calibration and geometric reconstruction from detector and tracker errors. For sequence M1401, 40 sampled frames from img000001-img000196 produce 632 metric cuboid instances across 23 tracks. Results show that road-geometry calibration can transform monocular UAV footage into interpretable traffic-camera-style analytics, including BEV tracks and synchronized 3D cuboid visualizations. They also reveal key limitations: far-field vehicles are sensitive to homography errors, manual validation is currently more reliable than fully automatic calibration, and the single-plane assumption limits performance in non-planar or ambiguous road regions. The proposed pipeline provides a practical foundation for deployable UAV traffic cameras and future real-time traffic digital-twin systems.
Chinese Translation
无人驾驶飞行器(UAV)能够在固定路边摄像头不可用、成本高昂或不切实际的情况下提供灵活的交通监控。然而,原始的无人机视频难以用于交通分析,因为车辆运动是以透视图像坐标的形式观察,而不是在稳定的度量道路坐标系统中。本文提出了一种轻量级的流程,将单目倾斜无人机交通视频转换为局部度量鸟瞰图(BEV)表示。可见的道路几何特征,包括车道标记、道路边界和人行横道,用于从图像坐标估计道路平面单应性(homography),将其转换为度量地面平面坐标。然后,来自数据集注释或检测器的车辆观察结果通过估计的地面接触点投影到BEV中。生成的轨迹支持对车辆方向、速度、航向和道路平面上的动态3D立方体的估计。我们在UAVDT上评估该流程,使用真实标注来隔离标定和几何重建与检测器和跟踪器错误的影响。在序列M1401中,从img000001到img000196采样的40帧生成了632个度量立方体实例,涵盖23条轨迹。结果表明,道路几何标定可以将单目无人机视频转换为可解释的交通摄像头风格分析,包括BEV轨迹和同步的3D立方体可视化。同时也揭示了关键的局限性:远处车辆对单应性误差敏感,手动验证目前比完全自动标定更可靠,以及单平面假设在非平面或模糊的道路区域限制了性能。所提出的流程为可部署的无人机交通摄像头和未来的实时交通数字双胞胎系统提供了实用的基础。
cs.CV / 90 / 2605.11904

Beyond Point-wise Neural Collapse: A Topology-Aware Hierarchical Classifier for Class-Incremental Learning

超越点状神经崩溃:一种拓扑感知的层次分类器用于类增量学习
Yi, Huiyu, Xu, Zhiming, Tu, Dunwei, Wang, Zhicheng, Xu, Baile, Shen, Furao
Abstract
The Nearest Class Mean (NCM) classifier is widely favored in Class-Incremental Learning (CIL) for its superior resistance to catastrophic forgetting compared to Fully Connected layers. While Neural Collapse (NC) theory supports NCM's optimality by assuming features collapse into single points, non-linear feature drift and insufficient training in CIL often prevent this ideal state. Consequently, classes manifest as complex manifolds rather than collapsed points, rendering the single-point NCM suboptimal. To address this, we propose Hierarchical-Cluster SOINN (HC-SOINN), a novel classifier that captures the topological structure of these manifolds via a ``local-to-global'' representation. Furthermore, we introduce Structure-Topology Alignment via Residuals (STAR) method, which employs a fine-grained pointwise trajectory tracking mechanism to actively deform the learned topology, allowing it to adapt precisely to complex non-linear feature drift. Theoretical analysis and Procrustes distance experiments validate our framework's resilience to manifold deformations. We integrated HC-SOINN into seven state-of-the-art methods by replacing their original classifiers, achieving consistent improvements that highlight the effectiveness and robustness of our approach. Code is available at https://github.com/yhyet/HC_SOINN.
Chinese Translation
最近类均值(Nearest Class Mean, NCM)分类器因其在类增量学习(Class-Incremental Learning, CIL)中对灾难性遗忘的优越抵抗能力而广受青睐,相较于全连接层。尽管神经崩溃(Neural Collapse, NC)理论通过假设特征收敛为单一点来支持NCM的最优性,但在CIL中,非线性特征漂移和训练不足常常阻碍这一理想状态的实现。因此,类别表现为复杂的流形,而非收敛的点,使得单点NCM变得次优。为了解决这一问题,我们提出了层次聚类自组织增量神经网络(Hierarchical-Cluster SOINN, HC-SOINN),这是一种新颖的分类器,通过“局部到全局”的表示捕捉这些流形的拓扑结构。此外,我们引入了通过残差进行结构-拓扑对齐(Structure-Topology Alignment via Residuals, STAR)的方法,该方法采用细粒度的逐点轨迹跟踪机制,主动变形所学习的拓扑,使其能够精确适应复杂的非线性特征漂移。理论分析和Procrustes距离实验验证了我们框架对流形变形的韧性。我们将HC-SOINN集成到七种最先进的方法中,通过替换其原始分类器,实现了一致的改进,突显了我们方法的有效性和鲁棒性。代码可在 https://github.com/yhyet/HC_SOINN 获取。
cs.CV / 91 / 2605.11913

Vector Scaffolding: Inter-Scale Orchestration for Differentiable Image Vectorization

向量支架:可微图像向量化的跨尺度协调
Lee, Jaerin, Lee, Kanggeon, Lee, Kyoung Mu
Abstract
Differentiable vector graphics have enabled powerful gradient-based optimization of vector primitives directly from raster images. However, existing frameworks formulate this as a flat optimization problem, forcing hundreds to thousands of randomly initialized curves to blindly compete for pixel-level error reduction. This disordered optimization leads to topology collapse, where macroscopic structures are distorted by internal high-frequency noise, resulting in a redundant and uneditable "polygon soup" that limits practical editability. To address this limitation, we propose Vector Scaffolding, a novel hierarchical optimization framework that shifts from flat pixel-matching to structured topological construction tailored for vector graphics. By identifying a key cause of topology collapse as the mathematical imbalance between area and boundary gradients, we introduce Interior Gradient Aggregation to stabilize the learning dynamics of multi-scale curve mixtures. Upon this stabilized landscape, we employ Progressive Stratification and Rapid Inflation Scheduling to progressively densify vector primitives with extremely high learning rates ($\times 50$). Experiments demonstrate that our approach accelerates optimization by $2.5\times$ while simultaneously improving PSNR by up to 1.4 dB over the previous state of the art.
Chinese Translation
可微向量图形使得可以直接从光栅图像中对向量原语进行强大的基于梯度的优化。然而,现有框架将其表述为一个平面优化问题,迫使数百到数千个随机初始化的曲线盲目竞争以减少像素级误差。这种无序的优化导致拓扑崩溃,宏观结构被内部高频噪声扭曲,最终形成冗余且不可编辑的“多边形汤”,限制了实际的可编辑性。为了解决这一局限性,我们提出了向量支架(Vector Scaffolding),一种新颖的分层优化框架,它将优化从平面像素匹配转变为针对向量图形的结构化拓扑构建。通过识别拓扑崩溃的一个关键原因是面积和边界梯度之间的数学不平衡,我们引入了内部梯度聚合(Interior Gradient Aggregation)来稳定多尺度曲线混合的学习动态。在这个稳定的环境中,我们采用渐进分层(Progressive Stratification)和快速膨胀调度(Rapid Inflation Scheduling)以极高的学习率($ imes 50 $)逐步加密向量原语。实验表明,我们的方法在加速优化方面提高了$2.5 imes$,同时在PSNR上比之前的最先进技术提高了最高1.4 dB。
cs.CV / 92 / 2605.11927

RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation

RealDiffusion:基于物理知识的多角色故事书生成中的注意力机制
Zhao, Qi, Chen, Jun, Tsang, Ivor, Dai, Guang
Abstract
While modern diffusion models excel at generating diverse single images, extending this to sequential generation reveals a fundamental challenge: balancing narrative dynamism with multi-character coherence. Existing methods often falter at this trade-off, leading to artifacts where characters lose their identity or the story stagnates. To resolve this critical tension, we introduce RealDiffusion, a unified framework designed to reconcile robust coherence with narrative dynamism. Heat diffusion serves as a dissipative prior that averages neighboring features along the sequence and removes high-frequency noise within the subject region. This suppresses attribute drift and stabilizes identity across frames. A region-aware stochastic process then introduces small perturbations that explore nearby modes and prevent collapse so the story maintains pose change and scene evolution. We thus introduce a lightweight, training-free Physics-informed Attention mechanism that injects controllable physical priors into the self-attention layers during inference. By modeling feature evolution as a configurable physical system, our method regularizes spatio-temporal relationships without suppressing intentional, prompt-driven changes. Extensive experiments demonstrate that RealDiffusion achieves substantial gains in character coherence while preserving narrative dynamism, outperforming state-of-the-art approaches. Code is available at https://github.com/ShmilyQi-CN/RealDiffusion.
Chinese Translation
尽管现代扩散模型在生成多样化单幅图像方面表现出色,但将其扩展到序列生成时却暴露出一个根本性挑战:在叙事活力与多角色一致性之间取得平衡。现有方法往往在这一权衡中表现不佳,导致角色失去身份或故事停滞。为了解决这一关键矛盾,我们提出了RealDiffusion,一个旨在调和强一致性与叙事活力的统一框架。热扩散作为一种耗散先验,沿序列平均相邻特征并去除主体区域内的高频噪声。这抑制了属性漂移并稳定了帧间身份。随后,一个区域感知的随机过程引入小扰动,探索附近模式并防止崩溃,从而使故事保持姿态变化和场景演变。因此,我们引入了一种轻量级、无需训练的基于物理知识的注意力机制,在推理过程中将可控的物理先验注入自注意力层。通过将特征演变建模为一个可配置的物理系统,我们的方法在不抑制有意的、由提示驱动的变化的情况下,规范了时空关系。大量实验表明,RealDiffusion在保持叙事活力的同时,实现了角色一致性的显著提升,超越了最先进的方法。代码可在 https://github.com/ShmilyQi-CN/RealDiffusion 获取。
cs.CV / 93 / 2605.11931

Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

学习思考:通过视觉感知自我提升训练改善多模态推理
Zhong, Qihuang, Ding, Liang, Xuan, Wenjie, Liu, Juhua, Du, Bo, Tao, Dacheng
Abstract
Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model's focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.
Chinese Translation
在多模态大型语言模型(MLLMs)的推理能力提升中,使用显式推理轨迹的后训练方法是常见的。然而,获取高质量的推理轨迹往往成本高昂且耗时。因此,自我提升范式应运而生,使得MLLMs能够在没有外部监督的情况下自我生成推理轨迹进行训练。尽管这种方法有效,但我们揭示了MLLMs自我提升训练中的两个缺陷:1)数据不平衡,简单样本被过度训练,而具有挑战性但至关重要的样本却被训练不足;2)语言先验偏差,MLLMs过于依赖语言先验而忽视视觉线索。为此,我们提出了VISTA,一个视觉感知自我提升训练框架,用于增强MLLMs的多模态推理能力。具体而言,VISTA首先引入了一种前缀重采样策略,以重用部分正确的推理轨迹进行高效的数据收集,然后设计了一种视觉感知注意力评分,以量化模型对视觉信息的关注。大量实验表明,VISTA可以应用于各种后训练场景,即监督微调和偏好学习,并有效提升各种MLLMs和任务的多模态推理性能,例如,为Qwen2.5-VL-3B-Instruct带来了高达+13.66%的平均性能提升。
cs.CV / 94 / 2605.11934

Interactive State Space Model with Cross-Modal Local Scanning for Depth Super-Resolution

具有跨模态局部扫描的交互状态空间模型用于深度超分辨率
Wu, Chen, Wang, Ling, Zheng, Zhuoran, Chen, Xiangyu, Xia, Jingyuan, Jiang, Weidong, Zhou, Jiantao
Abstract
Guided depth super-resolution (GDSR) reconstructs HR depth maps from LR inputs with HR RGB guidance. Existing methods either model each modality independently or rely on computationally expensive attention mechanisms with quadratic complexity, hindering the establishment of efficient and semantically interactive joint representations. In this paper, we observe that feature maps from different modalities exhibit semantic-level correlations during feature extraction. This motivates us to develop a more flexible approach enabling dense, semantically-aware deep interactions between modalities. To this end, we propose a novel GDSR framework centered around the Interactive State Space Model. Specifically, we design a cross-modal local scanning mechanism that enables fine-grained semantic interactions between RGB and depth features. Leveraging the Mamba architecture, our framework achieves global modeling with linear complexity. Furthermore, a cross-modal matching transform module is introduced to enhance interactive modeling quality by utilizing representative features from both modalities. Extensive experiments demonstrate competitive performance against state-of-the-art methods.
Chinese Translation
引导深度超分辨率(GDSR)从低分辨率输入中重建高分辨率深度图,并以高分辨率RGB作为指导。现有方法要么独立建模每种模态,要么依赖于计算复杂度为二次的计算密集型注意力机制,这阻碍了高效且语义交互的联合表示的建立。在本文中,我们观察到不同模态的特征图在特征提取过程中表现出语义层面的相关性。这促使我们开发一种更灵活的方法,使模态之间能够进行密集的、语义感知的深度交互。为此,我们提出了一种以交互状态空间模型为中心的新型GDSR框架。具体而言,我们设计了一种跨模态局部扫描机制,能够实现RGB和深度特征之间的细粒度语义交互。利用Mamba架构,我们的框架实现了线性复杂度的全局建模。此外,引入了跨模态匹配变换模块,通过利用两种模态的代表性特征来增强交互建模质量。大量实验表明,本文方法在性能上与最先进的方法具有竞争力。
cs.CV / 95 / 2605.11939

Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models

面向长尾泛化的集群感知神经崩溃提示调优方法用于视觉-语言模型
Guo, Boyang, Li, Liang, Peng, Lin, Gao, Yuhan, Sheng, Xichun, Yan, Chenggang
Abstract
Prompt learning has emerged as an efficient alternative to fine-tuning pre-trained vision-language models (VLMs). Despite its promise, current methods still struggle to maintain tail-class discriminability when adapting to class-imbalanced datasets. In this work, we propose cluster-aware neural collapse prompt tuning (CPT), which enhances the discriminability of tail classes in prompt-tuned VLMs without sacrificing their overall generalization. First, we design a cluster-invariant space by mining semantic assignments from the pre-trained VLM and mapping them to prompt-tuned features. This computes cluster-level boundaries and restricts the constraints to local neighborhoods, which reduces interference with the global semantic structure of the pre-trained VLM. Second, we introduce neural-collapse-driven discriminability optimization with three losses: textual Equiangular Tight Frame (ETF) separation loss, class-wise convergence loss, and rotation stabilization loss. These losses work together to shape intra-cluster geometry for better inter-class separation and intra-class alignment. Extensive experiments on 11 diverse datasets demonstrate that CPT outperforms SOTA methods, with stronger performance on long-tail classes and good generalization to unseen classes.
Chinese Translation
提示学习已成为微调预训练视觉-语言模型(VLMs)的高效替代方案。尽管前景广阔,当前的方法在适应类别不平衡的数据集时仍然难以保持尾类的可区分性。在本研究中,我们提出了一种集群感知神经崩溃提示调优(CPT)方法,该方法在不牺牲整体泛化能力的情况下增强了提示调优VLMs中尾类的可区分性。首先,我们通过从预训练VLM中挖掘语义分配并将其映射到提示调优特征,设计了一个集群不变空间。这一过程计算集群级边界并将约束限制在局部邻域,从而减少对预训练VLM全局语义结构的干扰。其次,我们引入了基于神经崩溃的可区分性优化,采用三种损失函数:文本等角紧框(ETF)分离损失、类别间收敛损失和旋转稳定损失。这些损失共同作用,塑造集群内部几何形状,以实现更好的类间分离和类内对齐。在11个多样化数据集上的广泛实验表明,CPT在长尾类上表现优于现有最先进的方法,并且对未见类别具有良好的泛化能力。
cs.CV / 96 / 2605.11959

Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models

基于视觉-语言模型的多模态教学视频抽象摘要
Nazir, Maham, Aqeel, Muhammad, Zhang, Richong, Setti, Francesco
Abstract
Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent visual concepts as discrete categories not aligned with natural language. We propose ClipSum, a framework that leverages frozen CLIP vision-language features with explicit temporal modeling and dimension-adaptive fusion for instructional video summarization. CLIP's contrastive pre-training on 400M image-text pairs yields visual features semantically aligned with the linguistic concepts that text decoders generate, bridging the vision-language gap at the representation level. On YouCook2, ClipSum achieves 33.0% ROUGE-1 versus 30.5% for ResNet-152 with 4x lower dimensionality (512 vs. 2048), demonstrating that semantic alignment matters more than feature capacity. Frozen CLIP (33.0%) surpasses fine-tuned CLIP (32.3%), showing that preserving pre-trained alignment is more valuable than task-specific adaptation. https://github.com/aqeeelmirza/clipsum
Chinese Translation
多模态视频摘要需要视觉特征与语言生成在语义上对齐。传统方法依赖于为对象分类训练的卷积神经网络(CNN)特征,这些特征将视觉概念表示为与自然语言不对齐的离散类别。我们提出了ClipSum,一个利用冻结的CLIP视觉-语言特征、显式时间建模和维度自适应融合的框架,用于教学视频摘要。CLIP在4亿对图像-文本的对比预训练产生的视觉特征与文本解码器生成的语言概念在语义上对齐,弥合了表示层面的视觉-语言差距。在YouCook2数据集上,ClipSum实现了33.0%的ROUGE-1分数,而ResNet-152为30.5%,且其维度低4倍(512对比2048),表明语义对齐比特征容量更为重要。冻结的CLIP(33.0%)超过了微调的CLIP(32.3%),显示出保持预训练对齐比任务特定的适应更具价值。
cs.CV / 97 / 2605.11960

Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

Chronicles-OCR:一个跨时间感知基准,用于汉字演变轨迹
Li, Gengluo, Peng, Shangpin, Wan, Xingyu, Zhang, Chengquan, Feng, Hao, Xu, Xin, Wu, Pian, Li, Bang, Ding, Zengmao, Liu, Yongge, Ye, Yipei, Yang, Yang, Shu, Zhan, Yan, Guojun, Li, Zhe, Ma, Can, Wang, Weiping, Zhou, Yu, Hu, Han
Abstract
Vision Large Language Models (VLLMs) have achieved remarkable success in modern text-rich visual understanding. However, their perceptual robustness in the face of the continuous morphological evolution of historical writing systems remains largely unexplored. Existing ancient text datasets typically focus on isolated historical periods, failing to capture the systematic visual distribution shifts spanning thousands of years. To bridge this gap and empower Digital Humanities, we introduce Chronicles-OCR, the first comprehensive benchmark specifically designed to evaluate the cross-temporal visual perception capabilities of VLLMs across the complete evolutionary trajectory of Chinese characters, known as the Seven Chinese Scripts. Curated in collaboration with top-tier institutional domain experts, the dataset comprises 2,800 strictly balanced images encompassing highly diverse physical media, ranging from tortoise shells to paper-based calligraphy. To accommodate the drastic morphological and topological variations across different historical stages, we propose a novel Stage-Adaptive Annotation Paradigm. Based on this, Chronicles-OCR formulates four rigorous quantitative tasks: cross-period character spotting, fine-grained archaic character recognition via visual referring, ancient text parsing, and script classification. By isolating visual perception from semantic reasoning, Chronicles-OCR provides an authoritative platform to expose the limitations of current VLLMs, paving the way for robust, evolution-aware historical text perception. Chronicles-OCR is publicly available at https://github.com/VirtualLUOUCAS/Chronicles-OCR.
Chinese Translation
视觉大型语言模型(VLLMs)在现代文本丰富的视觉理解中取得了显著成功。然而,它们在面对历史书写系统的持续形态演变时的感知鲁棒性仍然在很大程度上未被探索。现有的古代文本数据集通常集中于孤立的历史时期,未能捕捉跨越数千年的系统视觉分布变化。为了填补这一空白并赋能数字人文学科,我们推出了Chronicles-OCR,这是第一个专门设计用于评估VLLMs在汉字演变轨迹(即七种汉字书体)上的跨时间视觉感知能力的综合基准。该数据集与顶尖机构领域专家合作策划,包含2,800幅严格平衡的图像,涵盖从龟甲到纸质书法等高度多样的物理媒介。为了适应不同历史阶段之间剧烈的形态和拓扑变化,我们提出了一种新颖的阶段自适应注释范式。在此基础上,Chronicles-OCR制定了四个严格的定量任务:跨时期字符识别、通过视觉指称进行细粒度古代字符识别、古文本解析和书体分类。通过将视觉感知与语义推理隔离,Chronicles-OCR提供了一个权威平台,以揭示当前VLLMs的局限性,为强健的、意识到演变的历史文本感知铺平道路。Chronicles-OCR已在 https://github.com/VirtualLUOUCAS/Chronicles-OCR 上公开发布。
cs.CV / 98 / 2605.11963

What Does It Mean for a Medical AI System to Be Right?

医疗人工智能系统“正确”意味着什么?
Gitau, Antony
Abstract
This paper examines what it means for a medical AI system to be right by grounding the question in a specific clinical context: the automatic classification of plasma cells in digitized bone marrow smears for the diagnosis of multiple myeloma. Drawing on philosophy of science and research ethics, the paper argues that correctness in medical AI is not a singular property reducible to benchmark performance, but a multi-dimensional concept involving the availability of expertly labeled medical datasets, the explainability and interpretability of model outputs, the clinical meaningfulness of evaluation metrics, and the distribution of accountability in human-AI workflows. As such, the paper develops this argument through four interrelated themes: the instability of ground truth labels, the opacity of overconfident AI, the inadequacy of standard clinical metrics, and the risk of automation bias in time-pressured clinical settings.
Chinese Translation
本文通过将问题置于特定的临床背景中,探讨医疗人工智能系统“正确”的含义:即在数字化骨髓涂片中自动分类浆细胞以诊断多发性骨髓瘤。文章借鉴科学哲学和研究伦理,认为医疗人工智能的正确性并不是可以简化为基准性能的单一属性,而是一个多维概念,涉及专家标注的医疗数据集的可用性、模型输出的可解释性和可理解性、评估指标的临床意义,以及人机工作流程中责任的分配。因此,本文通过四个相互关联的主题发展这一论点:基础真相标签的不稳定性、过于自信的人工智能的不透明性、标准临床指标的不足,以及在时间紧迫的临床环境中自动化偏见的风险。
cs.CV / 99 / 2605.11967

H2G: Hierarchy-Aware Hyperbolic Grouping for 3D Scenes

H2G:用于三维场景的层次感知超曲面分组
Ko, ByungHa, Lee, Youngmin, Kim, Dong Hwan
Abstract
Hierarchical 3D grouping aims to recover scene groups across multiple granularities, from fine object parts to complete objects, without relying on semantic labels or a fixed vocabulary. The main challenge is to transform 2D foundation-model cues into coherent hierarchy supervision and embed that hierarchy in a 3D representation. We propose H2G, a hyperbolic affinity field for hierarchical 3D grouping. Our method derives semantically organized tree supervision by interpreting foundation-model affinities through Dasgupta's objective for similarity-based hierarchical clustering. This supervision is distilled into a single Lorentz hyperbolic feature field, whose geometry is well suited for tree-like branching structures. A hierarchy-aware objective aligns the field with fine-level assignments, coarse object structure, compact feature clusters, and LCA (Lowest Common Ancestor) ordering. This formulation represents multiple grouping levels in one feature space, enabling semantic hierarchical grouping grounded in 2D foundation-model knowledge.
Chinese Translation
层次化三维分组旨在从细小的物体部分到完整物体,恢复跨多个粒度的场景组,而无需依赖语义标签或固定词汇。主要挑战在于将二维基础模型线索转化为一致的层次监督,并将该层次嵌入三维表示中。我们提出了H2G,一种用于层次化三维分组的超曲面亲和场。我们的方法通过将基础模型亲和性解释为基于相似性的层次聚类的Dasgupta目标,推导出语义组织的树形监督。这种监督被提炼为单一的洛伦兹超曲面特征场,其几何形状非常适合树状分支结构。层次感知目标将该特征场与细粒度分配、粗略物体结构、紧凑特征簇和LCA(最低公共祖先)排序对齐。该公式在一个特征空间中表示多个分组层次,使得基于二维基础模型知识的语义层次分组成为可能。
cs.CV / 100 / 2605.11977

Optimizing 4D Wires for Sparse 3D Abstraction

优化稀疏三维抽象的四维线条
Wu, Dong-Yi, Lee, Tong-Yee
Abstract
We present a unified framework for 3D geometric abstraction using a single continuous 4D wire, parameterized as a B-spline with spatial coordinates and variable width $(x,y,z,w)$. Existing approaches typically represent shapes as collections of many independent curve segments, which often leads to fragmented structures and limited physical realizability. In contrast, we show that a single continuous spline is sufficiently expressive to capture complex volumetric forms while enforcing global topological coherence. By imposing continuity, our method transforms 3D sketching from a local density-accumulation process into a global routing problem, providing a strong inductive bias toward cleaner aesthetics and improved structural coherence. To enable gradient-based optimization, we introduce a differentiable rendering pipeline that efficiently rasterizes variable-width curves with bounded projection error. This formulation supports robust optimization using modern guidance signals such as Score Distillation Sampling (SDS) or CLIP. We demonstrate applications including image-to-3D abstraction, multi-view wire art generation, and differentiable stylized surface filling. Experiments show that our unified representation produces structures with higher semantic fidelity and improved structural coherence compared to approaches based on collections of discrete curves.
Chinese Translation
我们提出了一种统一的三维几何抽象框架,使用单一的连续四维线条,该线条以B样条形式参数化,具有空间坐标和可变宽度$(x,y,z,w)$。现有的方法通常将形状表示为许多独立曲线段的集合,这往往导致结构的碎片化和有限的物理可实现性。相比之下,我们表明单一的连续样条足以表达复杂的体积形态,同时强制执行全局拓扑一致性。通过施加连续性,我们的方法将三维草图绘制从局部密度累积过程转变为全局路由问题,提供了强烈的归纳偏向于更清晰的美学和改善的结构一致性。为了实现基于梯度的优化,我们引入了一种可微渲染管道,该管道高效地光栅化具有有界投影误差的可变宽度曲线。这种表述支持使用现代引导信号(如评分蒸馏采样(Score Distillation Sampling, SDS)或CLIP)进行稳健优化。我们展示了包括图像到三维抽象、多视图线条艺术生成和可微风格化表面填充等应用。实验表明,我们的统一表示产生的结构在语义保真度和结构一致性方面优于基于离散曲线集合的方法。
cs.CV / 101 / 2605.11989

A Transfer Learning Evaluation of Deep Neural Networks for Image Classification

深度神经网络在图像分类中的迁移学习评估
Baker, Nermeen Abou, Zengeler, Nico, Handmann, Uwe
Abstract
Transfer learning is a machine learning technique that uses previously acquired knowledge from a source domain to enhance learning in a target domain by reusing learned weights. This technique is ubiquitous because of its great advantages in achieving high performance while saving training time, memory, and effort in network design. In this paper, we investigate how to select the best pre-trained model that meets the target domain requirements for image classification tasks. In our study, we refined the output layers and general network parameters to apply the knowledge of eleven image processing models, pre-trained on ImageNet, to five different target domain datasets. We measured the accuracy, accuracy density, training time, and model size to evaluate the pre-trained models both in training sessions in one episode and with ten episodes.
Chinese Translation
迁移学习是一种机器学习技术,通过重用已学习的权重,将源领域获得的知识应用于目标领域,以增强学习效果。这种技术因其在实现高性能的同时节省训练时间、内存和网络设计精力而广泛应用。本文研究了如何选择最符合目标领域要求的最佳预训练模型,以用于图像分类任务。在我们的研究中,我们对输出层和一般网络参数进行了优化,以将预先在ImageNet上训练的十一种图像处理模型的知识应用于五个不同的目标领域数据集。我们测量了准确率、准确率密度、训练时间和模型大小,以评估预训练模型在一次训练会话和十次训练会话中的表现。
cs.CV / 102 / 2605.12002

EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization

EDGER:基于边缘引导与热图细化的通用图像伪造定位
Le-Phan, Minh-Khoa, Le, Minh-Hoang, Tran, Minh-Triet, Do, Trong-Le
Abstract
Text-guided inpainting has made image forgery increasingly realistic, challenging both SID and IFL. However, existing methods often struggle to point out suspicious signals across domains. To address this problem, we propose EDGER, a patch-based, dual-branch framework that localizes manipulated regions in arbitrary resolution images without sacrificing native resolution. The first branch, Edge-Guided Segmentation, introduces a Frequency-based Edge Detector to emphasize high-frequency inconsistencies at manipulation boundaries, and fine-tunes a SegFormer to fuse RGB and edge features for pixel-level masks. Since edge evidence is most informative only when patches contain both authentic and manipulated pixels, we complement Edge-Guided Segmentation with a Synthetic Heatmapping branch, a classification-based localizer that fine-tunes a CLIP-ViT image encoder with LoRA to flag fully synthetic patches. Together, Synthetic Heatmapping provides coarse, patch-level synthetic priors, while Edge-Guided Segmentation sharpens boundaries within partially manipulated patches, yielding comprehensive localization. Evaluated in the MediaEval 2025, SynthIM challenge, Manipulated Region Localization Task's setting, our approach scales to multi-megapixel imagery and exhibits strong cross-domain generalization. Extensive ablations highlight the complementary roles of frequency-based edge cues and patch-level synthetic priors in driving accurate, resolution-agnostic localization.
Chinese Translation
文本引导的图像修复使得图像伪造变得愈加真实,给图像伪造检测(SID)和伪造图像定位(IFL)带来了挑战。然而,现有方法往往难以在不同领域中指出可疑信号。为了解决这个问题,我们提出了EDGER,一个基于补丁的双分支框架,能够在不牺牲原始分辨率的情况下定位任意分辨率图像中的操控区域。第一个分支,边缘引导分割,引入了一种基于频率的边缘检测器,以强调操控边界处的高频不一致性,并微调SegFormer以融合RGB和边缘特征,生成像素级掩膜。由于边缘证据在补丁同时包含真实和操控像素时最具信息量,我们用合成热图分支补充边缘引导分割,该分类基础的定位器通过LoRA微调CLIP-ViT图像编码器,以标记完全合成的补丁。合成热图提供粗略的补丁级合成先验,而边缘引导分割则在部分操控补丁内锐化边界,从而实现全面的定位。在MediaEval 2025的SynthIM挑战中,操控区域定位任务的设置下,我们的方法能够扩展到多百万像素图像,并表现出强大的跨领域泛化能力。大量消融实验突显了基于频率的边缘线索和补丁级合成先验在驱动准确、分辨率无关的定位中的互补作用。
cs.CV / 103 / 2605.12006

Robust Promptable Video Object Segmentation

鲁棒可提示视频目标分割
Lee, Sohyun, Gwon, Yeho, Hoyer, Lukas, Schindler, Konrad, Sakaridis, Christos, Kwak, Suha
Abstract
The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safety-critical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real-world evaluation datasets of 351 video clips and more than 2,500 object masks under real-world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory-object-conditioned Gated-rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two-fold: effectively handling object-specific degradation and ensuring temporal consistency in predictions. MoGA leverages object-specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA's efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real-world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark is publicly available at https://sohyun-l.github.io/RobustPVOS_project_page/.
Chinese Translation
可提示视频目标分割(PVOS)模型在输入损坏情况下性能显著下降,这阻碍了PVOS在安全关键领域的应用。本文首次对鲁棒PVOS(RobustPVOS)进行了全面研究。我们首先构建了一个新的综合基准,包含两个真实世界评估数据集,共351个视频片段和2500多个目标掩码,旨在真实世界的不利条件下进行评估。同时,我们通过对现有VOS数据集施加多样化和时间变化的损坏,生成合成训练数据。此外,我们提出了一种新的鲁棒PVOS方法,称为记忆对象条件门控排序适应(Memory-object-conditioned Gated-rank Adaptation,MoGA)。成功执行鲁棒PVOS的关键在于两个方面:有效处理特定对象的退化和确保预测的时间一致性。MoGA利用在帧间维护的特定对象表示来条件化鲁棒化过程,使模型能够以时间一致的方式对每个跟踪对象进行不同处理。在我们的基准上进行的广泛实验验证了MoGA的有效性,显示出在合成和真实世界数据集上对多种损坏类型的一致且显著的改进,为未来的鲁棒PVOS研究建立了强有力的基准。我们的基准数据集可在 https://sohyun-l.github.io/RobustPVOS_project_page/ 上公开获取。
cs.CV / 104 / 2605.12013

L2P: Unlocking Latent Potential for Pixel Generation

L2P:解锁像素生成的潜在能力
Chen, Zhennan, Zhu, Junwei, Chen, Xu, Zhang, Jiangning, Chen, Jiawei, Zeng, Zhuoqi, Zhang, Wei, Wang, Chengjie, Yang, Jian, Tai, Ying
Abstract
Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.
Chinese Translation
像素扩散模型最近在视觉生成领域重新引起了关注。然而,从头开始训练先进的像素空间模型需要巨大的计算和数据资源。为了解决这个问题,我们提出了潜在到像素(Latent-to-Pixel,L2P)转移范式,这是一种高效的框架,直接利用预训练的潜在扩散模型(LDM)中的丰富知识来构建强大的像素空间模型。具体而言,L2P抛弃了变分自编码器(VAE),采用大块标记化,并冻结源LDM的中间层,仅训练浅层以学习潜在到像素的转换。通过将LDM生成的合成图像作为唯一的训练语料库,L2P适应了已经平滑的数据流形,从而实现了快速收敛,无需真实数据收集。这一策略使L2P能够仅使用8个GPU无缝迁移大量潜在先验到像素空间。此外,消除VAE的内存瓶颈解锁了原生4K超高分辨率生成。针对主流LDM架构的广泛实验表明,L2P的训练开销微乎其微,但在DPG-Bench上与源LDM的表现相当,并在GenEval上达到了93%的性能。
cs.CV / 105 / 2605.12017

FAME: Feature Activation Map Explanation on Image Classification and Face Recognition

FAME:图像分类和人脸识别中的特征激活图解释
Zhang, Xinyi, Günther, Manuel
Abstract
Deep Learning has revolutionized machine learning, reaching unprecedented levels of accuracy, but at the cost of reduced interpretability. Especially in image processing systems, deep networks transform local pixel information into more global concepts in a highly obscured manner. Explainable AI methods for image processing try to shed light on this issue by highlighting the regions of the image that are important for the prediction task. Among these, Class Activation Mapping (CAM) and its gradient-based variants compute attributions based on the feature map and upscale them to the image resolution, assuming that feature map locations are influenced only by underlying regions. Perturbation-based methods, such as CorrRISE, on the other hand, try to provide pixel-level attributions by perturbing the input with fixed patches and checking how the output of the network changes. In this work, we propose Feature Activation Map Explanation (FAME), which combines both worlds by using network gradients to compute changes to the input image, manipulating it in a gradient-driven way rather than using fixed patches. We apply this technique on two common tasks, image classification and face recognition, and show that CAM's above-mentioned assumption does not hold for deeper networks. We qualitatively and quantitively show that FAME produces attribution maps that are competitive state-of-the-art systems. Our code is available: {\footnotesize https://github.com/AIML-IfI/fame.}
Chinese Translation
深度学习彻底改变了机器学习,达到了前所未有的准确性,但代价是可解释性降低。特别是在图像处理系统中,深度网络以高度模糊的方式将局部像素信息转化为更全球的概念。针对图像处理的可解释人工智能方法试图通过突出对预测任务重要的图像区域来阐明这一问题。在这些方法中,类激活映射(Class Activation Mapping, CAM)及其基于梯度的变体根据特征图计算归因,并将其放大到图像分辨率,假设特征图位置仅受底层区域的影响。另一方面,基于扰动的方法,如CorrRISE,试图通过用固定补丁扰动输入并检查网络输出的变化来提供像素级的归因。在本研究中,我们提出了特征激活图解释(Feature Activation Map Explanation, FAME),它通过使用网络梯度计算输入图像的变化,以梯度驱动的方式进行操控,而不是使用固定补丁。我们将该技术应用于两个常见任务:图像分类和人脸识别,并展示了CAM上述假设在更深层网络中并不成立。我们定性和定量地表明,FAME生成的归因图与当前最先进的系统具有竞争力。我们的代码可在以下链接获取:{ ootnotesize https://github.com/AIML-IfI/fame.}
cs.CV / 106 / 2605.12021

What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization

什么-哪里变换器:一种以槽为中心的视觉骨干网络,用于同时表示和定位
Yoshihashi, Ryota, Kada, Masahiro, Ikehata, Satoshi, Kawakami, Rei, Sato, Ikuro
Abstract
Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One possible reason is that classification-oriented backbones tend to emphasize semantic information about what, while implicitly entangling or suppressing information about where. In this work, we focus on an inductive bias termed what-where separation, which encourages models to represent object appearance and spatial location in a decomposed manner. To incorporate this bias throughout an attentive backbone in the style of Vision Transformer (ViT), we propose the What-Where Transformer (WWT). Our method introduces two key novel designs: (1) it treats tokens as representations of what and attention maps as representations of where, and processes them in concurrent feed-forward modules via a multi-stream, slot-based architecture; (2) it reuses both the final-layer tokens and attention maps for downstream tasks, and directly exposes them to gradients derived from task losses, thereby facilitating more effective and explicit learning of localization. We demonstrate that even under standard single-label classification-based supervision on ImageNet, WWT exhibits emergent multiple object discovery directly from raw attention maps, rather than via additional postprocessing such as token clustering. Furthermore, WWT achieves superior performance compared to ViT-based methods on zero-shot object discovery and weakly supervised semantic segmentation, and it is transferable to various localization setups with minimal modifications. Code will be published after acceptance.
Chinese Translation
许多图像理解任务涉及识别存在什么以及它出现在哪里。然而,处理位置的任务,如物体发现、检测和分割,往往比主要关注物体的图像分类复杂得多。一个可能的原因是,面向分类的骨干网络往往强调关于物体的语义信息,而隐含地纠缠或抑制关于位置的信息。在本研究中,我们关注一种称为“什么-哪里分离”的归纳偏置,这种偏置鼓励模型以分解的方式表示物体外观和空间位置。为了在类似于视觉变换器(Vision Transformer, ViT)的注意力骨干网络中贯穿这一偏置,我们提出了什么-哪里变换器(What-Where Transformer, WWT)。我们的方法引入了两个关键的新设计:(1)将标记视为“什么”的表示,将注意力图视为“哪里”的表示,并通过多流的基于槽的架构在并发前馈模块中处理它们;(2)重用最终层的标记和注意力图用于下游任务,并直接将其暴露于由任务损失导出的梯度,从而促进更有效和明确的定位学习。我们展示了即使在ImageNet上采用标准的单标签分类监督,WWT也能直接从原始注意力图中展现出多物体发现,而不需要额外的后处理,如标记聚类。此外,WWT在零样本物体发现和弱监督语义分割方面的表现优于基于ViT的方法,并且在各种定位设置中可通过最小修改实现迁移。代码将在接受后发布。
cs.CV / 107 / 2605.12026

Spectral Vision Transformer for Efficient Tokenization with Limited Data

用于有限数据高效标记化的光谱视觉变换器
Roberts, Alexandra G., John, Maneesh, Zhang, Jinwei, Romano, Dominick, Sisman, Mert, Choi, Ki Sueng, Kim, Heejong, Sabuncu, Mert R., Nguyen, Thanh D., Dimov, Alexey V., Spincemaille, Pascal, Kopell, Brian H., Wang, Yi
Abstract
We propose a novel spectral vision transformer architecture for efficient tokenization in limited data, with an emphasis on medical imaging. We outline convenient theoretical properties arising from the choice of basis including spatial invariance and optimal signal-to-noise ratio. We show reduced complexity arising from the spectral projection compared to spatial vision transformers. We show equitable or superior performance with a reduced number of parameters as compared to a variety of models including compact and standard vision transformers, convolutional neural networks with attention, shifted window transformers, multi-layer perceptrons, and logistic regression. We include simulated, public, and clinical data in our analysis and release our code at: \verb+github.com/agr78/spectralViT+.
Chinese Translation
我们提出了一种新颖的光谱视觉变换器架构,以实现有限数据下的高效标记化,重点关注医学影像。我们概述了由于基底选择而产生的便利理论特性,包括空间不变性和最佳信噪比。我们展示了与空间视觉变换器相比,光谱投影所带来的复杂性降低。我们还展示了在参数数量减少的情况下,与多种模型(包括紧凑型和标准视觉变换器、带注意力机制的卷积神经网络、移位窗口变换器、多层感知器和逻辑回归)相比,具有相等或更优的性能。我们的分析包括模拟数据、公共数据和临床数据,并在以下网址发布了我们的代码: exttt{github.com/agr78/spectralViT}。
cs.CV / 108 / 2605.12027

4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation

4DVGGT-D:具有改进动态深度估计的4D视觉几何变换器
Zang, Ying, Liu, Xuanyi, Han, Yidong, Ji, Deyi, Ding, Chaotao, Hu, Yuanqi, Zhu, Qi, Li, Xuanfu, Ma, Jin, Sun, Lingyun, Chen, Tianrun, Zhu, Lanyun
Abstract
Reconstructing dynamic 4D scenes from monocular videos is a fundamental yet challenging task. While recent 3D foundation models provide strong geometric priors, their performance significantly degrades in dynamic environments. This degradation stems from a fundamental tension: the inherent coupling of camera ego-motion and object motion within global attention mechanisms. In this paper, we propose a novel, training-free progressive decoupling framework that disentangles dynamics from statics in a principled, coarse-to-fine manner. Our core insight is to resolve the tension by first stabilizing the camera pose, followed by geometric refinement. Specifically, our approach consists of three synergistic components: (1) a Dynamic-Mask-Guided Pose Decoupling module that isolates pose estimation from dynamic interference, yielding a stable motion-free reference frame; (2) a Topological Subspace Surgery mechanism that orthogonally decomposes the depth manifold, safely preserving dynamic objects while injecting refined, mask-aware geometry into static regions; and (3) an Information-Theoretic Confidence-Aware Fusion strategy that formulates depth integration as a heteroscedastic Bayesian inference problem, adaptively blending multi-pass predictions via inverse-variance weighting. Extensive experiments on standard 4D reconstruction benchmarks demonstrate that our method achieves consistent and substantial improvements across principal point-cloud metrics. Notably, our approach shows competitive performance in robust 4D scene reconstruction without requiring fine-tuning, suggesting the potential of mathematically grounded dynamic-static disentanglement.
Chinese Translation
从单目视频重建动态4D场景是一项基本而具有挑战性的任务。尽管近期的3D基础模型提供了强大的几何先验,但它们在动态环境中的性能显著下降。这种下降源于一个基本的矛盾:相机自我运动与物体运动在全局注意机制中的固有耦合。在本文中,我们提出了一种新颖的无训练渐进解耦框架,以原则性的方法从粗到细地将动态与静态分离。我们的核心见解是通过首先稳定相机姿态,然后进行几何精细化来解决这一矛盾。具体而言,我们的方法由三个协同组件组成:(1)动态掩模引导的姿态解耦模块,该模块将姿态估计与动态干扰隔离,生成一个稳定的无运动参考框架;(2)拓扑子空间手术机制,该机制正交分解深度流形,在安全保留动态物体的同时,将精细的、考虑掩模的几何信息注入静态区域;(3)信息论信心感知融合策略,该策略将深度集成公式化为异方差贝叶斯推断问题,通过逆方差加权自适应地融合多次预测。在标准4D重建基准上的大量实验表明,我们的方法在主要点云指标上实现了一致且显著的改进。值得注意的是,我们的方法在稳健的4D场景重建中表现出竞争力,而无需微调,表明数学基础的动态-静态解耦的潜力。
cs.CV / 109 / 2605.12038

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

OmniHumanoid:基于无配对适应的跨形态视频生成流
Song, Yiren, Deng, Xiyao, Yang, Pei, Wang, Yihan, Shou, Mike Zheng
Abstract
Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.
Chinese Translation
跨形态视频生成旨在在不同的人形形态之间转移动作,例如人类与机器人、机器人与机器人之间,从而实现可扩展的数据生成以支持具身智能。在这一背景下,一个主要挑战是动作动态在不同形态之间部分可转移,而外观和形态则是特定于形态的。现有方法往往将这些因素纠缠在一起,且许多方法需要每个目标形态的配对数据,这限制了对新机器人的可扩展性。我们提出了OmniHumanoid,一个将可转移动作学习与形态特定适应相分离的框架。我们的方法从跨多个形态的动作对齐配对视频中学习一个共享的动作转移模型,同时仅通过轻量级的形态特定适配器,利用未配对视频适应新的形态。为了减少动作转移与形态适应之间的干扰,我们进一步引入了一种分支隔离注意力设计,将动作条件与形态特定调制分开。此外,我们构建了一个合成的跨形态数据集,其中包含在多样的人形资产、场景和视角下渲染的动作对齐配对视频。在合成和真实世界基准上的实验表明,OmniHumanoid实现了强大的动作保真度和形态一致性,同时在不重新训练共享动作模型的情况下,支持对未见过的人形形态的可扩展适应。
cs.CV / 110 / 2605.12064

TAR: Text Semantic Assisted Cross-modal Image Registration Framework for Optical and SAR Images

TAR:用于光学和合成孔径雷达图像的文本语义辅助跨模态图像配准框架
Cai, Zhuoyu, Quan, Dou, Huyan, Ning, He, Pei, Wang, Shuang, Jiao, Licheng
Abstract
Existing deep learning-based methods can capture shared features from optical and synthetic aperture radar (SAR) images for spatial alignment. However, optical-SAR registration remains challenging under large geometric deformations, because the model needs to simultaneously handle cross-modal appearance discrepancies and complex spatial transformations. To address this issue, this paper proposes a text semantic-assisted cross-modal image registration framework, named TAR, for optical and SAR images. TAR exploits text semantic priors from remote sensing scenes and land-cover categories to alleviate the modality gap and enhance cross-modal feature learning. TAR consists of three components: a multi-scale visual feature learning (MSFL) module, a text-assisted feature enhancement (TAFE) module, and a coarse-to-fine dense matching (CFDM) module. MSFL extracts multi-scale visual features from optical and SAR images. TAFE constructs text descriptors related to remote sensing scenes and land-cover objects, and uses a frozen RemoteCLIP text encoder to extract text features. These text features are introduced through visual-text interaction to enhance high-level visual features for more reliable coarse matching. CFDM then establishes coarse correspondences based on the enhanced high-level features and refines the matched locations using low-level features. Experimental results on cross-modal remote sensing images demonstrate the effectiveness of TAR, which achieves stronger matching performance than several state-of-the-art methods and yields significant gains under large geometric deformations.
Chinese Translation
现有的基于深度学习的方法能够捕捉光学图像和合成孔径雷达(SAR)图像中的共享特征以实现空间对齐。然而,在大几何变形下,光学-SAR配准仍然具有挑战性,因为模型需要同时处理跨模态外观差异和复杂的空间变换。为了解决这个问题,本文提出了一种名为TAR的文本语义辅助跨模态图像配准框架,适用于光学和SAR图像。TAR利用来自遥感场景和土地覆盖类别的文本语义先验,以减轻模态间的差距并增强跨模态特征学习。TAR由三个组件组成:多尺度视觉特征学习(MSFL)模块、文本辅助特征增强(TAFE)模块和粗到细的密集匹配(CFDM)模块。MSFL从光学和SAR图像中提取多尺度视觉特征。TAFE构建与遥感场景和土地覆盖对象相关的文本描述符,并使用冻结的RemoteCLIP文本编码器提取文本特征。这些文本特征通过视觉-文本交互引入,以增强高层视觉特征,从而实现更可靠的粗匹配。然后,CFDM基于增强的高层特征建立粗对应关系,并使用低层特征细化匹配位置。在跨模态遥感图像上的实验结果证明了TAR的有效性,其匹配性能优于几种最先进的方法,并在大几何变形下获得显著提升。
cs.CV / 111 / 2605.12069

Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection

面向异常检测的零-shot视觉语言适配器
Aqeel, Muhammad, Nazir, Maham, Khan, Uzair, Cristani, Marco, Setti, Francesco
Abstract
Zero-shot anomaly detection aims to identify defects in unseen categories without target-specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA-DINO, an anomaly-aware vision-language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text-guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context-specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD and strong cross-domain generalization to medical imaging without domain-specific fine-tuning. https://github.com/aqeeelmirza/AVA-DINO
Chinese Translation
零-shot异常检测旨在识别未见类别中的缺陷,而无需针对特定目标的训练。现有方法通常对所有样本应用相同的特征转换,尽管正常数据和异常数据在根本上具有不对称的分布(紧凑的正常数据与多样的异常数据),但仍将其视为统一的处理。我们通过提出AVA-DINO,利用这种自然的不对称性,构建了一个异常感知的视觉语言适配框架,具有针对正常和异常模式的双重专门分支,能够适应冻结的DINOv3视觉特征。在辅助数据的训练过程中,这两个分支通过文本引导的路由机制和显式的路由正则化共同学习,以促进分支的专业化。在测试时,仅使用输入图像和固定的预定义语言描述动态组合这两个分支,从而实现不对称激活。该设计防止了退化的统一路由,并允许上下文特定的特征转换。在九个工业和医疗基准测试中的实验表明,该方法达到了最先进的性能,在MVTec-AD上实现了93.5%的图像-AUROC,并在没有领域特定微调的情况下展现出强大的跨领域泛化能力。
cs.CV / 112 / 2605.12072

PairDropGS: Paired Dropout-Induced Consistency Regularization for Sparse-View Gaussian Splatting

PairDropGS:基于配对丢弃的稀疏视图高斯点云一致性正则化
Li, Hantang, Zhu, Qiang, Meng, Xiandong, Wang, Xingtao, Zhao, Debin, Fan, Xiaopeng
Abstract
Dropout-based sparse-view 3D Gaussian Splatting (3DGS) methods alleviate overfitting by randomly suppressing Gaussian primitives during training. Existing methods mainly focus on designing increasingly sophisticated dropout strategies, while they overlook the resulting inconsistencies among different dropped Gaussian subsets. This oversight often leads to unstable reconstruction and suboptimal Gaussian representation learning.In this paper, we revisit dropout-based sparse-view 3DGS from a consistency regularization perspective and propose PairDropGS, a Paired Dropout-induced Consistency Regularization framework for sparse-view Gaussian splatting. Specifically, PairDropGS first constructs a pair of the dropped Gaussian subsets from a shared Gaussian field and designs a low-frequency consistency regularization to constrain their low-frequency rendered structures. This design encourages the shared Gaussian field to preserve stable scene layout and coarse geometry under different random dropouts, while avoiding excessive constraints on ambiguous high-frequency details. Moreover, we introduce a progressive consistency scheduling strategy to gradually strengthen the consistency regularization during training for stability and robustness of reconstruction. Extensive experiments on widely-used sparse-view benchmarks demonstrate that PairDropGS achieves superior training stability, significantly outperforms existing dropout-based 3DGS methods in reconstruction quality, while exhibiting the simplicity and plug-and-play nature for improving dropout-based optimization.
Chinese Translation
基于丢弃的稀疏视图三维高斯点云(3DGS)方法通过在训练过程中随机抑制高斯原语来缓解过拟合。现有方法主要集中于设计越来越复杂的丢弃策略,而忽视了不同丢弃高斯子集之间产生的不一致性。这种忽视通常导致重建不稳定和高斯表示学习的次优。在本文中,我们从一致性正则化的角度重新审视基于丢弃的稀疏视图3DGS,并提出了PairDropGS,一个用于稀疏视图高斯点云的基于配对丢弃的一致性正则化框架。具体而言,PairDropGS首先从共享的高斯场构建一对丢弃的高斯子集,并设计了一种低频一致性正则化,以约束它们的低频渲染结构。该设计鼓励共享的高斯场在不同的随机丢弃下保持稳定的场景布局和粗略几何,同时避免对模糊高频细节施加过多约束。此外,我们引入了一种渐进一致性调度策略,以在训练过程中逐步增强一致性正则化,从而提高重建的稳定性和鲁棒性。在广泛使用的稀疏视图基准上的大量实验表明,PairDropGS实现了优越的训练稳定性,在重建质量上显著优于现有的基于丢弃的3DGS方法,同时展现了提高基于丢弃优化的简单性和即插即用特性。
cs.CV / 113 / 2605.12074

BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding

BARISTA:一个用于组合视觉理解的多任务自我中心基准
Knab, Patrick, Xhelili, Orgest, Buzi, Inis, Nilo, Drago Andres Guggiana, Khan, Mohd Saquib, Kolb, Lorenz, Scherzer, Manuel, Yildirir, Kerem, Bartelt, Christian, Schubert, Philipp Johannes
Abstract
Scene understanding is central to general physical intelligence, and video is a primary modality for capturing both state and temporal dynamics of a scene. Yet understanding physical processes remains difficult, as models must combine object localization, hand-object interactions, relational parsing, temporal reasoning, and step-level procedural inference. Existing benchmarks usually evaluate these capabilities separately, limiting diagnosis of why models fail on procedural tasks. We introduce BARISTA, a densely annotated egocentric dataset and benchmark of 185 real-world coffee-preparation videos covering fully automatic, portafilter-based, and capsule-based workflows. BARISTA provides verified per-frame scene graphs linking persistent object identities to masks, tracks, boxes, attributes, typed relations, hand-object interactions, activities, and process steps. From these graphs, we derive zero-shot language-based tasks spanning phrase grounding, hand-object interaction recognition, referring, activity recognition, relation extraction, and temporal visual question answering. Experiments reveal strong variation across task families and no consistently dominant model family, positioning BARISTA as a challenging diagnostic benchmark for procedural video understanding. Code and dataset available at https://huggingface.co/datasets/ramblr/BARISTA.
Chinese Translation
场景理解是通用物理智能的核心,而视频是捕捉场景状态和时间动态的主要方式。然而,理解物理过程仍然困难,因为模型必须结合对象定位、手-物体交互、关系解析、时间推理和逐步程序推理。现有基准通常单独评估这些能力,限制了对模型在程序任务中失败原因的诊断。我们引入了BARISTA,一个密集注释的自我中心数据集和基准,包含185个真实世界的咖啡制作视频,涵盖全自动、基于滤杯和基于胶囊的工作流程。BARISTA提供经过验证的逐帧场景图,将持久的对象身份与掩模、轨迹、边框、属性、类型关系、手-物体交互、活动和过程步骤相连接。基于这些图,我们推导出零样本语言任务,包括短语定位、手-物体交互识别、指称、活动识别、关系提取和时间视觉问答。实验结果显示任务家族之间存在显著差异,并且没有一致占优的模型家族,使BARISTA成为一个具有挑战性的程序视频理解诊断基准。代码和数据集可在 https://huggingface.co/datasets/ramblr/BARISTA 获取。
cs.CV / 114 / 2605.12077

The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments

缺失的GAP:从解决方形拼图到处理现实世界考古碎片
Shahar, Ofir Itzhak, Elkin, Gur, Ben-Shahar, Ohad
Abstract
Jigsaw puzzle solving has been an increasingly popular task in the computer vision research community. Recent works have utilized cutting-edge architectures and computational approaches to reassemble groups of pieces into a coherent image, while achieving increasingly good results on well established datasets. However, most of these approaches share a common, restricting setting: operating solely on strictly square puzzle pieces. In this work, we introduce GAP, a set of novel jigsaw puzzles datasets containing synthetic, heavily eroded pieces of unrestricted shapes, generated by a learned distribution of real-world archaeological fragments. We also introduce PuzzleFlow, a novel ViT and Flow-Matching based framework for jigsaw puzzle solving, capable of handling complex puzzle pieces and demonstrating superior performance on GAP when compared to both classic and recent prominent works in this domain.
Chinese Translation
拼图解决已成为计算机视觉研究领域越来越受欢迎的任务。近期的研究利用尖端架构和计算方法将一组拼图碎片重新组合成一个连贯的图像,同时在成熟的数据集上取得了越来越好的结果。然而,这些方法大多数共享一个共同的限制条件:仅在严格的方形拼图碎片上操作。在本研究中,我们介绍了GAP,一个包含合成的、严重侵蚀的不限形状拼图碎片的新型拼图数据集,这些碎片是通过对现实世界考古碎片的学习分布生成的。我们还推出了PuzzleFlow,一个基于ViT(视觉变换器)和流匹配的新框架,能够处理复杂的拼图碎片,并在与该领域经典和近期重要工作的比较中,在GAP上展示出优越的性能。
cs.CV / 115 / 2605.12088

UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

UniCustom:用于多参考图像生成的统一视觉条件框架
Xu, Yiyan, Wang, Qiulin, Wang, Wenjie, Mao, Yunyao, Wang, Xintao, Wan, Pengfei, Gai, Kun, Feng, Fuli
Abstract
Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image. As a result, the model may recognize which subject is being referred to, but fail to preserve its identity and fine-grained appearance, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual conditioning framework that fuses ViT and VAE features before VLM encoding. This early fusion exposes the VLM to both semantic cues and appearance-rich details, enabling its hidden states to jointly encode the referred subject and corresponding visual appearance with only a lightweight linear fusion layer. To learn such unified representations, we adopt a two-stage training strategy: reconstruction-oriented pretraining that preserves reference-specific appearance details in the fused hidden states, followed by supervised finetuning on single- and multi-reference generation tasks. We further introduce a slot-wise binding regularization that encourages each image slot to preserve low-level details of its corresponding reference, thereby reducing cross-reference entanglement. Experiments on two multi-reference generation benchmarks demonstrate that UniCustom consistently improves subject consistency, instruction following, and compositional fidelity over strong baselines.
Chinese Translation
多参考图像生成旨在根据文本指令合成图像,同时忠实地保留来自多个参考图像的主体身份。现有的增强视觉语言模型(VLM)的扩散模型通常依赖于解耦的视觉条件:语义视觉变换器(ViT)特征由VLM处理以理解指令,而丰富外观的变分自编码器(VAE)特征则在后期注入到扩散主干中。尽管这种设计直观,但这种分离使得模型难以将每个语义基础的主体与来自正确参考图像的视觉细节关联起来。因此,模型可能识别出所指的主体,但无法保留其身份和细粒度外观,导致在复杂的多参考设置中出现属性泄漏和交叉参考混淆。为了解决这个问题,我们提出了UniCustom,一个统一的视觉条件框架,在VLM编码之前融合ViT和VAE特征。这种早期融合使得VLM能够同时接触到语义线索和丰富的外观细节,从而使其隐藏状态能够通过一个轻量级的线性融合层共同编码所指主体及其相应的视觉外观。为了学习这样的统一表示,我们采用了两阶段的训练策略:以重建为导向的预训练,保留融合隐藏状态中的参考特定外观细节,随后在单参考和多参考生成任务上进行监督微调。我们进一步引入了一种槽位绑定正则化,鼓励每个图像槽保留其对应参考的低级细节,从而减少交叉参考的纠缠。在两个多参考生成基准上的实验表明,UniCustom在主体一致性、指令遵循和组合保真度方面始终优于强基线。
cs.CV / 116 / 2605.12112

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

当政策熵约束失效时:通过感知熵在基于流的RLHF中保持多样性
Tan, Xiaofeng, Liu, Jun, Gao, Bin-Bin, Fan, Yuanting, Jiang, Xi, Wang, Chengjie, Wang, Hongsong, Zheng, Feng
Abstract
RLHF is widely used to align flow-matching text-to-image models with human preferences, but often leads to severe diversity collapse after fine-tuning. In RL, diversity is often assumed to correlate with policy entropy, motivating entropy regularization. However, we show this intuition breaks in flow models: policy entropy remains constant, even while perceptual diversity collapses. We explain this mismatch both theoretically and empirically: the constant entropy arises from the fixed, pre-defined noise schedule, while the diversity collapse is driven by the mode-seeking nature of policy gradients. As a result, policy entropy fails to prevent the model from converging to a narrow high-reward region in the perceptual space. To this end, we introduce perceptual entropy that captures diversity in a perceptual space and maintains the property of standard entropy. Building upon this insight, we propose two entropy-regularized strategies, Perceptual Entropy Constraint and Perceptual Constraints on Generation Space, to preserve perceptual diversity and improve the quality. Experiments across two base models, neural and rule-based rewards, and three perceptual spaces demonstrate consistent gains in the quality-diversity trade-off; PEC achieves the best overall score of 0.734 (vs. baseline's 0.366); a complementary setting of PEC further reaches a diversity average of 0.989 (vs. baseline's 0.047). Our project page (https://xiaofeng-tan.github.io/projects/PEC) is publicly available.
Chinese Translation
RLHF广泛用于将流匹配的文本到图像模型与人类偏好对齐,但在微调后常常导致严重的多样性崩溃。在强化学习中,多样性通常被假设与政策熵相关,这促使了熵正则化的应用。然而,我们表明这一直觉在流模型中失效:即使感知多样性崩溃,政策熵仍然保持不变。我们从理论和实证两个方面解释了这种不匹配:常数熵源于固定的、预定义的噪声调度,而多样性崩溃则是由政策梯度的模式寻求特性驱动的。因此,政策熵未能阻止模型在感知空间中收敛到狭窄的高奖励区域。为此,我们引入了感知熵,它在感知空间中捕捉多样性并保持标准熵的特性。基于这一见解,我们提出了两种熵正则化策略:感知熵约束(Perceptual Entropy Constraint)和生成空间的感知约束(Perceptual Constraints on Generation Space),以保持感知多样性并提高质量。在两个基础模型、神经奖励和基于规则的奖励以及三个感知空间上的实验表明,质量与多样性权衡的一致收益;PEC达到了最佳整体得分0.734(对比基线的0.366);PEC的一个互补设置进一步达到了0.989的多样性平均值(对比基线的0.047)。我们的项目页面(https://xiaofeng-tan.github.io/projects/PEC)已公开可用。
cs.CV / 117 / 2605.12119

MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

MoCam:通过结构化去噪动态实现统一的新视图合成
Liu, Haofeng, Zhou, Yang, Wang, Ziheng, Xu, Zhengbo, Peng, Zhan, Ma, Jie, Liang, Jun, He, Shengfeng, Li, Jing
Abstract
Generative novel view synthesis faces a fundamental dilemma: geometric priors provide spatial alignment but become sparse and inaccurate under view changes, while appearance priors offer visual fidelity but lack geometric correspondence. Existing methods either propagate geometric errors throughout generation or suffer from signal conflicts when fusing both statically. We introduce MoCam, which employs structured denoising dynamics to orchestrate a coordinated progression from geometry to appearance within the diffusion process.MoCam first leverages geometric priors in early stages to anchor coarse structures and tolerate their incompleteness, then switches to appearance priors in later stages to actively correct geometric errors and refine details. This design naturally unifies static and dynamic view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion process.Experiments demonstrate that MoCam significantly outperforms prior methods, particularly when point clouds contain severe holes or distortions, achieving robust geometry-appearance disentanglement.
Chinese Translation
生成新视图合成面临一个基本困境:几何先验提供空间对齐,但在视角变化下变得稀疏且不准确,而外观先验则提供视觉保真度,但缺乏几何对应关系。现有方法要么在生成过程中传播几何错误,要么在静态融合时遭遇信号冲突。我们提出了MoCam,它利用结构化去噪动态在扩散过程中协调几何与外观的进展。MoCam首先在早期阶段利用几何先验来锚定粗略结构并容忍其不完整性,然后在后期阶段切换到外观先验,以主动纠正几何错误并细化细节。该设计自然地通过在扩散过程中时间上解耦几何对齐和外观细化,统一了静态和动态视图合成。实验表明,MoCam显著优于先前的方法,特别是在点云存在严重孔洞或扭曲时,实现了稳健的几何-外观解耦。
cs.CV / 118 / 2605.12134

MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation

MULTI:解耦相机镜头、传感器、视角和领域以生成新颖图像
Godavarthy, Sonali, Neuwirth-Trapp, Matthias, Faasch, Tim-Felix, Bieshaar, Maarten, Moeller, Michael, Paudel, Danda Pani
Abstract
Recent text-to-image models produce high-quality images, yet text ambiguity hinders precise control when specific styles or objects are required. There have been a number of recent works dealing with learning and composing multiple objects and patterns. However, current work focuses almost entirely on image content, overlooking imaging factors such as camera lens, sensor types, imaging viewpoints, and scenes' domain characteristics. We introduce this new challenge as Imaging Factor Disentanglement and show limitations of current approaches in the regime. We, therefore, propose the new method Multi-factor disentanglement through Textual Inversion (MULTI). It consists of two stages: in the first stage, we learn general factors, and in the second stage, we extract dataset-specific ones. This setup enables the extension of existing datasets and novel factor combinations, thereby reducing distribution gaps. It further supports modifications of specific factors and image-to-image generation via ControlNets. The evaluation on our new DF-RICO benchmark demonstrates the effectiveness of MULTI and highlights the importance of Factor Disentanglement as a new direction of research.
Chinese Translation
近期的文本到图像模型能够生成高质量图像,但文本的模糊性在需要特定风格或物体时阻碍了精确控制。已有多项研究致力于学习和组合多个物体和模式。然而,目前的工作几乎完全专注于图像内容,忽视了诸如相机镜头、传感器类型、成像视角和场景领域特征等成像因素。我们将这一新挑战称为成像因素解耦,并展示了当前方法在这一领域的局限性。因此,我们提出了一种新方法——通过文本反演进行多因素解耦(MULTI)。该方法由两个阶段组成:在第一阶段,我们学习一般因素;在第二阶段,我们提取特定于数据集的因素。这一设置使得现有数据集的扩展和新因素组合成为可能,从而减少分布差距。它进一步支持特定因素的修改和通过ControlNets进行图像到图像的生成。在我们的新DF-RICO基准上的评估证明了MULTI的有效性,并强调了因素解耦作为一项新研究方向的重要性。
cs.CV / 119 / 2605.12138

Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

设计您的广告:基于统一自回归模型的个性化广告图像和文本生成
Xu, Yexing, Feng, Wei, Zhang, Shen, Wang, Haohan, Qin, Yuxin, Li, Yaoyu, Ma, Ao, Luo, Yuhao, Wang, Lu, Ren, Xudong, Wang, Haoran, Ling, Run, Zhang, Zheng, Lv, Jingjing, Shen, Junjie, Law, Ching, Wang, Longguang, Guo, Yulan
Abstract
Generating realistic and user-preferred advertisements is a key challenge in e-commerce. Existing approaches utilize multiple independent models driven by click-through-rate (CTR) to controllably create attractive image or text advertisements. However, their pipelines lack cross-modal perception and rely on CTR that only reflects average preferences. Therefore, we explore jointly generating personalized image-text advertisements from historical click behaviors. We first design a Unified Advertisement Generative model (Uni-AdGen) that employs a single autoregressive framework to produce both advertising images and texts. By incorporating a foreground perception module and instruction tuning, Uni-AdGen enhances the realism of the generated content. To further personalize advertisements, we equip Uni-AdGen with a coarse-to-fine preference understanding module that effectively captures user interests from noisy multimodal historical behaviors to drive personalized generation. Additionally, we construct the first large-scale Personalized Advertising image-text dataset (PAd1M) and introduce a Product Background Similarity (PBS) metric to facilitate training and evaluation. Extensive experiments show that our method outperforms baselines in general and personalized advertisement generation. Our project is available at https://github.com/JD-GenX/Uni-AdGen.
Chinese Translation
生成逼真且符合用户偏好的广告是电子商务中的一项关键挑战。现有的方法利用多个独立模型,通过点击率(CTR)来可控地创建吸引人的图像或文本广告。然而,它们的流程缺乏跨模态感知,并且依赖于仅反映平均偏好的CTR。因此,我们探索从历史点击行为中联合生成个性化的图像-文本广告。我们首先设计了一个统一广告生成模型(Unified Advertisement Generative model,Uni-AdGen),该模型采用单一的自回归框架来同时生成广告图像和文本。通过结合前景感知模块和指令调优,Uni-AdGen增强了生成内容的真实感。为了进一步个性化广告,我们为Uni-AdGen配备了一个粗到细的偏好理解模块,该模块能够有效捕捉来自嘈杂的多模态历史行为中的用户兴趣,以驱动个性化生成。此外,我们构建了第一个大规模个性化广告图像-文本数据集(PAd1M),并引入了一种产品背景相似性(Product Background Similarity,PBS)度量来促进训练和评估。大量实验表明,我们的方法在一般和个性化广告生成方面均优于基线。我们的项目可在 https://github.com/JD-GenX/Uni-AdGen 获取。
cs.CV / 120 / 2605.12140

EchoTracker2: Enhancing Myocardial Point Tracking by Modeling Local Motion

EchoTracker2:通过建模局部运动增强心肌点跟踪
Azad, Md Abulkalam, Holmstrøm, Vegard, Nyberg, John, Lovstakken, Lasse, Dalen, Håvard, Grenne, Bjørnar, Østvik, Andreas
Abstract
Myocardial point tracking (MPT) has recently emerged as a promising direction for motion estimation in echocardiography, driven by advances in general-purpose point tracking methods. However, myocardial motion fundamentally differs from motion encountered in natural videos, as it arises from physiologically constrained deformation that is spatially and temporally continuous throughout the cardiac cycle. Consequently, motion trajectories typically remain locally confined despite substantial tissue deformation. Motivated by these properties, we revisit the architectural design for MPT and find that coarse initialization in commonly used two-stage coarse-to-fine architectures may be unnecessary in this domain. In this work, we propose a fine-stage-only architecture, \textbf{EchoTracker2}, which enriches pixel-precise features with local spatiotemporal context and integrates them with long-range joint temporal reasoning for robust tracking. Experimental results across in-distribution, out-of-distribution (OOD), and public synthetic datasets show that our model improves position accuracy by $6.5\%$ and reduces median trajectory error by $12.2\%$ relative to a domain-specific state-of-the-art (SOTA) model. Compared to the best general-purpose point tracking method, the improvements are $2.0\%$ and $5.3\%$, respectively. Moreover, EchoTracker2 shows better agreement with expert-derived global longitudinal strain (GLS) and enhances test-rest reproducibility. Source code will be available at: https://github.com/riponazad/ptecho.
Chinese Translation
心肌点跟踪(MPT)最近作为超声心动图中运动估计的一个有前景的方向而出现,这得益于通用点跟踪方法的进步。然而,心肌运动与自然视频中遇到的运动在本质上是不同的,因为它源于生理上受限的变形,这种变形在整个心动周期中在空间和时间上都是连续的。因此,尽管组织发生了显著的变形,运动轨迹通常仍然局限于局部。基于这些特性,我们重新审视了MPT的架构设计,发现常用的两阶段粗到细架构中的粗略初始化在这一领域可能是不必要的。在本研究中,我们提出了一种仅包含细阶段的架构, extbf{EchoTracker2},该架构通过局部时空上下文丰富像素精确特征,并将其与长距离联合时间推理相结合,以实现稳健的跟踪。在分布内、分布外(OOD)和公共合成数据集上的实验结果表明,我们的模型相对于领域特定的最先进(SOTA)模型提高了位置准确性$6.5\%$,并将中位轨迹误差降低了$12.2\\%$。与最佳的通用点跟踪方法相比,改进分别为$2.0\\%$和$5.3\\%$。此外,EchoTracker2与专家推导的全局纵向应变(GLS)表现出更好的一致性,并增强了测试-重测的可重复性。源代码将可在以下网址获取:https://github.com/riponazad/ptecho。
cs.CV / 121 / 2605.12144

PoseCompass: Intelligent Synthetic Pose Selection for Visual Localization

PoseCompass:用于视觉定位的智能合成姿态选择
Zhou, Yanan, Qian, Zhaoyan, Li, Yanli, Yang, Nan, Guo, Zhongliang, Yuan, Dong
Abstract
In visual localization, Absolute Pose Regression (APR) enables real-time 6-DoF camera pose inference from single images, yet critically depends on fine-tuning data quality and coverage. While recent methods leverage 3D Gaussian Splatting (3DGS) for novel view synthesis-based data augmentation, random sampling generates redundant views and noisy samples from poorly reconstructed regions. To mitigate this research gap, we propose PoseCompass, an intelligent pose selection pipeline for 3DGS-based APR. PoseCompass formulates synthetic pose selection and derives a value-based pose ranking mechanism to identify informative poses. The ranking integrates three dimensions: Localization Difficulty, favoring challenging regions; Coverage Novelty, exploring under-sampled areas; and Rendering Observability, filtering artifacts and noise. PoseCompass then generates trajectory-constrained candidates, selects the top-K ranked poses, and synthesizes views using 3DGS with lightweight diffusion-based alignment. Finally, the pose regressor is fine-tuned on mixed real and synthetic data. We evaluate PoseCompass on 7-Scenes, where it reduces adaptation time from 15.2 to 5.1 minutes, a 3x speedup, while cutting median pose errors by 53.8 percent and significantly outperforming random baselines.
Chinese Translation
在视觉定位中,绝对姿态回归(Absolute Pose Regression, APR)能够从单幅图像中实时推断出6自由度的相机姿态,但其关键依赖于微调数据的质量和覆盖范围。尽管近期方法利用3D高斯点云(3D Gaussian Splatting, 3DGS)进行基于新视角合成的数据增强,但随机采样会从重建效果差的区域生成冗余视图和噪声样本。为了解决这一研究空白,我们提出了PoseCompass,一个基于3DGS的APR智能姿态选择管道。PoseCompass制定了合成姿态选择方案,并推导出一种基于价值的姿态排名机制,以识别信息丰富的姿态。该排名整合了三个维度:定位难度,偏向于具有挑战性的区域;覆盖新颖性,探索欠采样区域;以及渲染可观察性,过滤伪影和噪声。PoseCompass随后生成受轨迹约束的候选姿态,选择排名前K的姿态,并使用3DGS与轻量级扩散对齐合成视图。最后,姿态回归器在混合的真实和合成数据上进行微调。我们在7-Scenes数据集上评估了PoseCompass,结果显示其将适应时间从15.2分钟减少至5.1分钟,实现了3倍的加速,同时将中位姿态误差降低了53.8%,显著优于随机基线。
cs.CV / 122 / 2605.12145

Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

通过语义对齐的离散表示实现跨模态领域泛化
Sen, Souptik, Younis, Raneen, Ahmadi, Zahra
Abstract
Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross-modal Discrete Alignment And Reconstruction), a novel framework that resolves this long-standing trade-off by establishing semantic consensus across modality-specific codebooks through index-level alignment. This design uniquely allows CoDAAR to preserve modality-unique structures while achieving generalizable cross-modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained temporal quantization, and Cascading Semantic Alignment (CSA), which promotes progressive cross-modal semantic agreement. Together, they establish a competition-free unified representation space. Trained with self-supervised reconstruction objectives on paired multimodal sequences, CoDAAR demonstrates robust cross-modal and cross-domain generalization. Across Cross-Modal Generalization benchmarks, including event classification, localization, video segmentation, and cross-dataset transfer, CoDAAR achieves state-of-the-art performance, establishing a new paradigm for discrete and generalizable multimodal representation learning.
Chinese Translation
多模态学习旨在整合来自不同感官源的信息,但当前的方法在跨模态泛化与特定模态结构之间难以取得平衡。连续(隐式)方法保留了细粒度的先验知识,但使得泛化变得具有挑战性,而离散(显式)方法则以牺牲模态特异性为代价强制执行共享原型。我们提出了 CoDAAR(跨模态离散对齐与重建),这是一个新颖的框架,通过索引级对齐在模态特定的代码本之间建立语义共识,从而解决了这一长期存在的权衡。该设计独特地允许 CoDAAR 在统一的离散空间内保留模态独特结构,同时实现可泛化的跨模态表示。CoDAAR 结合了两种互补机制:离散时间对齐(DTA),使得细粒度的时间量化成为可能,以及级联语义对齐(CSA),促进渐进的跨模态语义一致性。两者共同建立了一个无竞争的统一表示空间。在成对的多模态序列上通过自监督重建目标进行训练,CoDAAR 展现了强大的跨模态和跨领域泛化能力。在跨模态泛化基准测试中,包括事件分类、定位、视频分割和跨数据集迁移,CoDAAR 达到了最先进的性能,为离散和可泛化的多模态表示学习建立了新的范式。
cs.CV / 123 / 2605.12163

Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

自洽潜在推理:面向视觉-语言模型的长潜在序列推理
Wang, Chenfeng, He, Wei, Zhu, Xuhan, Zhou, Chunpeng, Li, Qizhen, Yan, Song, Zheng, Yufei, Yu, Chengjun, Lu, Fan, Zhai, Wei, Cao, Yang, Yu, Pengfei, Zha, Zheng-Jun
Abstract
In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.
Chinese Translation
在语言推理中,较长的思维链通常能带来更好的表现,这自然暗示视觉潜在推理也可能从更长的潜在序列中受益。然而,我们发现了一个反直觉的现象:现有的潜在视觉推理方法的性能在潜在序列变长时系统性下降。我们揭示了根本原因:信息增益崩溃——自回归生成使得每一步高度依赖于先前的输出,因此后续的标记几乎无法引入新信息。我们进一步确定,作为监督目标使用的高度池化($ ext{≥} 128 imes$)图像嵌入提供的信号与无意义的占位符没有区别。基于这些见解,我们提出了SCOLAR(自洽潜在推理),它引入了一种轻量级的去变换器,利用大型语言模型(LLM)的全序列隐藏状态一次性生成辅助视觉标记,每个标记独立锚定于原始视觉空间。结合三阶段的SFT和ALPO强化学习,SCOLAR将可接受的潜在链长延长超过$30 imes$,在真实世界推理基准上实现了开源模型中的最新成果(比基础模型提高14.12%),并展示了强大的分布外泛化能力。
cs.CV / 124 / 2605.12169

UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis

UniFixer:一种用于基于扩散的视图合成的通用参考引导修复器
Chen, Sihan, Zhang, Xiang, Zhang, Yang, Aydin, Tunc, Schroers, Christopher
Abstract
With the recent surge of generative models, diffusion-based approaches have become mainstream for view synthesis tasks, either in an explicit depth-warp-inpaint or in an implicit end-to-end manner. Despite their success, both paradigms often suffer from noticeable quality degradation, e.g., blurred details and distorted structures, caused by pixel-to-latent compression and diffusion hallucination. In this paper, we investigate diffusion degradation from three key dimensions (i.e., spatial, temporal, and backbone-related) and propose UniFixer, a universal reference-guided framework that fixes diverse degradation artifacts via a coarse-to-fine strategy. Specifically, a reference pre-alignment module is first designed to perform coarse alignment between the reference view and the degraded novel view. A global structure anchoring mechanism then rectifies geometric distortions to ensure structural fidelity, followed by a local detail injection module that recovers fine-grained texture details for high-quality view synthesis. Our UniFixer serves as a plug-and-play refiner that achieves zero-shot fixing across different types of diffusion degradation, and extensive experiments verify our state-of-the-art performance on novel view synthesis and stereo conversion.
Chinese Translation
随着生成模型的快速发展,基于扩散的方法已成为视图合成任务的主流,无论是在显式的深度扭曲修复还是隐式的端到端方式中。尽管取得了成功,这两种范式通常会遭遇显著的质量下降,例如,由于像素到潜在空间的压缩和扩散幻觉,导致细节模糊和结构扭曲。本文从空间、时间和骨干网络相关三个关键维度研究了扩散降解,并提出了UniFixer,一种通用的参考引导框架,通过粗到细的策略修复多种降解伪影。具体而言,首先设计了一个参考预对齐模块,以在参考视图和降解的新视图之间进行粗对齐。然后,全球结构锚定机制纠正几何扭曲,以确保结构的保真性,接着是一个局部细节注入模块,恢复高质量视图合成所需的细粒度纹理细节。我们的UniFixer作为一个即插即用的修复器,实现了对不同类型扩散降解的零-shot 修复,广泛的实验验证了我们在新视图合成和立体转换上的最先进性能。
cs.CV / 125 / 2605.12179

SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

SyncDPO:通过偏好学习增强视频-音频联合生成中的时间同步
Cheng, Xin, Wang, Xihua, Ba, Ying, Wang, Yuyue, Guan, Kaisi, Wang, Yinbo, Li, Wenpu, Song, Ruihua
Abstract
Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in substantial computational cost. To improve efficiency, we introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures without incurring additional annotation or sampling. We demonstrate that the temporal alignment capability can be effectively reinforced by providing explicit negative supervision through temporally distorted V-A pairs. Accordingly, we implement a curriculum learning strategy that progressively increases the difficulty of negative samples, transitioning from coarse misalignment to subtle inconsistencies. Extensive objective and subjective experiments across four diverse benchmarks, ranging from ambient sound videos to human speech videos, demonstrate that SyncDPO significantly outperforms other methods in improving model's temporal alignment capability. It also demonstrates superior generalization on out-of-distribution benchmark by capturing intrinsic motion-sound dynamics. Demo and code is available in https://syncdpo.github.io/syncdpo/.
Chinese Translation
近年来,视频-音频联合生成的进展在语义对应方面取得了显著成功。然而,实现精确的时间同步仍然是一个具有挑战性的问题,这需要音频事件与其视觉触发器之间的细粒度对齐。联合生成的后训练方法主要受到监督微调的主导,但常用的均方误差损失对微小的时间错位提供的惩罚不足。直接偏好优化(Direct Preference Optimization)通过引入显式的错位对应物提供了一种替代方案,以更好地提高时间敏感性。本文提出了一种后训练框架SyncDPO,利用DPO来改善视频-音频联合生成的时间敏感性。传统的DPO管道通常依赖于昂贵的采样和排序程序来构建偏好对,导致巨大的计算成本。为了提高效率,我们引入了一套基于规则的即时负样本构造策略,这些策略在不增加额外标注或采样的情况下扭曲时间结构。我们证明,通过提供显式的负监督,通过时间扭曲的视频-音频对,可以有效增强时间对齐能力。因此,我们实施了一种课程学习策略,逐步增加负样本的难度,从粗略错位过渡到微妙的不一致性。在四个不同基准上的广泛客观和主观实验,从环境声音视频到人类语音视频,证明SyncDPO在提高模型的时间对齐能力方面显著优于其他方法。它还在捕捉内在运动-声音动态方面表现出对分布外基准的优越泛化能力。演示和代码可在 https://syncdpo.github.io/syncdpo/ 获取。
cs.CV / 126 / 2605.12198

Enhancing Domain Generalization in 3D Human Pose Estimation through Controllable Generative Augmentation

通过可控生成增强提升3D人类姿态估计的领域泛化能力
Hu, Xinhao, Zhang, Yiyi, Zhang, Liqing, Zhang, Jianfu
Abstract
Pedestrian motion, due to its causal nature, is strongly influenced by domain gaps arising from discrepancies between training and testing data distributions. Focusing on 3D human pose estimation, this work presents a controllable human pose generation framework that synthesizes diverse video data by systematically varying poses, backgrounds, and camera viewpoints. This generative augmentation enriches training datasets, enhances model generalization, and alleviates the limitations of existing methods in handling domain discrepancies. By leveraging both indoor/real-world and outdoor/virtual datasets, we perform cross-domain data fusion and controllable video generation to construct enriched training data, tailored to realistic deployment settings. Extensive experiments show that the augmented datasets significantly improve model performance on unseen scenarios and datasets, validating the effectiveness of the proposed approach.
Chinese Translation
行人运动由于其因果特性,受到训练和测试数据分布之间差异所引发的领域差距的强烈影响。针对3D人类姿态估计,本研究提出了一种可控的人类姿态生成框架,通过系统地变化姿态、背景和摄像机视角来合成多样化的视频数据。这种生成增强丰富了训练数据集,提高了模型的泛化能力,并缓解了现有方法在处理领域差异时的局限性。通过利用室内/真实世界和室外/虚拟数据集,我们进行跨领域数据融合和可控视频生成,以构建适应于现实部署环境的丰富训练数据。大量实验表明,增强的数据集显著提高了模型在未见场景和数据集上的性能,验证了所提方法的有效性。
cs.CV / 127 / 2605.12218

Learning Ego-Centric BEV Representations from a Perspective-Privileged View: Cross-View Supervision for Online HD Map Construction

从视角特权视图学习自我中心的鸟瞰图(BEV)表示:在线高清地图构建的跨视图监督
Lengerer, Daniel, Pechinger, Mathias, Bogenberger, Klaus, Markgraf, Carsten
Abstract
Bird's-eye-view (BEV) representations derived from multi-camera input have become a central interface for online high-definition (HD) map construction. However, most approaches rely solely on ego-centric supervision, requiring large-scale scene structure to be inferred from incomplete observations, occlusions, and diminishing information density at long range, where perspective effects and spatial sparsity hinder consistent structural reasoning. We introduce Cross-View Supervision (CVS), a representation learning paradigm that transfers geometric and topological priors from an ego-aligned overhead perspective into camera-based BEV encoders. Rather than adding auxiliary semantic losses, CVS aligns representations in a shared BEV feature space and distills globally consistent structural knowledge from a perspective-privileged teacher into the ego-centric backbone. This supervision enhances structural coherence without modifying the inference architecture or requiring overhead input at test time. Experiments on nuScenes using ego-aligned aerial imagery from the AID4AD cross-view extension demonstrate consistent improvements over StreamMapNet while maintaining identical camera-only inference. CVS yields +3.9\,mAP in the standard $60\times30\,\mathrm{m}$ region and +9.9\,mAP in the extended $100\times50\,\mathrm{m}$ setting, corresponding to a 44\% relative gain at long range. These results highlight perspective-privileged structural supervision as a promising training principle for improving BEV representation learning in HD map construction.
Chinese Translation
基于多摄像头输入的鸟瞰图(BEV)表示已成为在线高清(HD)地图构建的核心接口。然而,大多数方法仅依赖于自我中心的监督,需要从不完整的观测、遮挡以及远距离信息密度下降中推断出大规模场景结构,在这些情况下,视角效应和空间稀疏性阻碍了一致的结构推理。我们引入了跨视图监督(Cross-View Supervision, CVS),这是一种表示学习范式,它将几何和拓扑先验从自我对齐的俯视视角转移到基于摄像头的BEV编码器中。CVS并不是通过添加辅助语义损失,而是在共享的BEV特征空间中对齐表示,并从视角特权教师中提炼出全局一致的结构知识到自我中心的主干网络中。这种监督增强了结构一致性,而无需修改推理架构或在测试时要求俯视输入。在nuScenes上的实验使用来自AID4AD跨视图扩展的自我对齐空中图像,显示出相较于StreamMapNet的一致性改进,同时保持相同的仅摄像头推理。在标准的$60 imes30 \, ext{m}$区域中,CVS实现了+3.9 mAP,在扩展的$100 imes50 \, ext{m}$设置中实现了+9.9 mAP,对应于远距离下44 ext{%}的相对增益。这些结果突显了视角特权结构监督作为改善高清地图构建中BEV表示学习的有前景的训练原则。
cs.CV / 128 / 2605.12220

TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion

TriBand-BEV:通过高度感知的鸟瞰图和高分辨率特征融合实现实时激光雷达三维行人检测
Khoshkdahan, Mohammad, Vinel, Alexey
Abstract
Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.
Chinese Translation
安全的自主代理和移动机器人需要快速的实时三维感知,特别是对于脆弱的道路使用者(VRUs),如行人。我们介绍了一种新的鸟瞰图(BEV)编码,它将完整的三维激光雷达点云映射为一个轻量级的二维BEV张量,具有三个高度带。我们明确地将三维检测重新表述为二维检测问题,然后从BEV输出重建三维框。一个单一的网络能够在一次传递中检测汽车、行人和骑自行车的人。主干网络在深层阶段使用区域注意力,层次双向颈部在P1到P4之间融合上下文和细节,头部则通过分布焦点学习预测有方向的框,并使用旋转IoU损失来处理侧偏差。训练过程中应用了小的垂直重分箱和轻微的反射抖动,以抵抗记忆化。在三维重建过程中,我们使用四分位数范围(IQR)滤波器去除噪声和异常的激光雷达点。在KITTI数据集上,TriBand-BEV在单个消费级GPU上以49 FPS的速度实现了58.7/52.6/47.2的行人BEV AP(%),分别对应于简单、中等和困难场景,超越了Complex-YOLO,提升幅度分别为+12.6%、+7.5%和+3.1%。定性场景显示在遮挡下的稳定检测。该流程紧凑,适合实时机器人部署。我们的源代码已在GitHub上公开。
cs.CV / 129 / 2605.12237

UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

UHR-Micro:诊断和缓解地球观测视觉语言模型中的分辨率幻觉
Ni, Shuo, Wang, Tong, Zhang, Jing, Chen, He, Guo, Haonan, Zhang, Ning, Du, Bo
Abstract
Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at https://github.com/MiliLab/UHR-Micro.
Chinese Translation
视觉语言模型(VLMs)越来越多地应用于超高分辨率(UHR)地球观测图像,但它们仍然容易受到大规模场景上下文与微观目标之间严重尺度不匹配的影响。我们将这一经验性差距称为“分辨率幻觉”:更高的输入分辨率提供了更丰富的视觉细节的表象,但并不一定能可靠地感知空间上较小的、与任务相关的证据。为了基准测试这一挑战,我们引入了UHR-Micro,这是一个包含11,253条指令的基准,基于1,212张UHR图像,旨在评估VLMs在原生地球观测图像的空间极限下的表现。UHR-Micro涵盖了多样的微观目标尺度、上下文要求、任务类别和视觉条件,并提供了支持控制评估和细粒度错误归因的诊断注释。对代表性高分辨率VLMs的实验显示,尽管可以访问高分辨率输入,但在空间定位和证据解析方面存在显著失败。进一步分析表明,这些失败并不能通过简单提高模型容量来完全解决,而是与在定位和使用任务相关的微观证据方面缺乏足够的指导密切相关。基于这一发现,我们提出了微证据主动感知(Micro-evidence Active Perception,MAP),这是一种参考代理,能够将查询分解为寻求证据的步骤,主动检查候选区域,并将其答案基于局部观察进行定位。MAP-Agent通过使高分辨率推理以证据为中心而非以图像为中心来改善微观层面的感知。UHR-Micro和MAP-Agent共同提供了一个诊断平台,用于评估、理解和推动地球观测VLMs中的高分辨率推理。数据集和源代码已发布在 https://github.com/MiliLab/UHR-Micro。
cs.CV / 130 / 2605.12252

H3D-MarNet: Wavelet-Guided Dual-Path Learning for Metal Artifact Suppression and CT Modality Transformation for Radiotherapy Workflows

H3D-MarNet:基于小波引导的双路径学习用于金属伪影抑制及放射治疗工作流中的CT模态转换
Rehman, Mubashara, Martinel, Niki, Avanzo, Michele, Spizzo, Riccardo, Micheloni, Christian
Abstract
Metal artifacts in computed tomography (CT) severely degrade image quality, compromising diagnostic accuracy and radiotherapy planning, especially in cancer patients with high-density implants. We propose H3D-MarNet, a two-stage framework for artifact-aware CT domain transformation from kilo-voltage CT (kVCT) to mega-voltage CT (MVCT). In the first stage, a wavelet-based preprocessing module suppresses metal-induced artifacts through frequency-aware denoising while preserving anatomical structures. In second stage, Domain-TransNet performs kVCT-to-MVCT domain transformation using a hybrid volumetric learning architecture. Domain-TransNet integrates a CNN-based encoder to capture fine-grained local anatomical details and a transformer-based encoder to model long-range volumetric dependencies. The complementary representations are fused through an attention-based feature fusion mechanism to ensure spatial and contextual coherence across slices. A multi-stage, attention-guided decoder, supported by deep supervision, progressively reconstructs artifact-suppressed MVCT volumes. Extensive experiments demonstrate that H3D-MarNet achieves 28.14 dB PSNR and 0.717 SSIM on artifact-affected slices from full dataset, indicating effective metal artifact suppression and anatomical preservation, highlighting its potential for reliable CT modality transformation in clinical radiotherapy workflows.
Chinese Translation
计算机断层扫描(CT)中的金属伪影严重降低了图像质量,影响了诊断准确性和放射治疗规划,尤其是在具有高密度植入物的癌症患者中。我们提出了H3D-MarNet,一种用于从千伏CT(kVCT)到兆伏CT(MVCT)进行伪影感知CT领域转换的两阶段框架。在第一阶段,基于小波的预处理模块通过频率感知去噪抑制金属引起的伪影,同时保留解剖结构。在第二阶段,Domain-TransNet使用混合体积学习架构执行kVCT到MVCT的领域转换。Domain-TransNet集成了基于卷积神经网络(CNN)的编码器以捕捉细粒度的局部解剖细节,以及基于变换器的编码器以建模长距离体积依赖关系。通过基于注意力的特征融合机制将互补表示融合,以确保切片之间的空间和上下文一致性。一个多阶段的、基于注意力引导的解码器,在深度监督的支持下,逐步重建抑制伪影的MVCT体积。大量实验表明,H3D-MarNet在完整数据集中伪影影响切片上达到了28.14 dB的峰值信噪比(PSNR)和0.717的结构相似性指数(SSIM),表明其在金属伪影抑制和解剖结构保留方面的有效性,突显了其在临床放射治疗工作流中可靠的CT模态转换潜力。
cs.CV / 131 / 2605.12259

From Image Hashing to Scene Change Detection

从图像哈希到场景变化检测
Duong, Anh-Kiet, Iatrides, Marie-Claire, Gomez-Krämer, Petra, Carozza, Jean-Michel
Abstract
Image hashing provides compact representations for efficient storage and retrieval but is inherently limited to global comparison and cannot reason about where changes occur. This limitation prevents hashing from being directly applicable to scene change detection, where spatial localization is essential. In this work, we revisit hashing from a scene change detection perspective and propose HashSCD, a patch-wise hashing framework that enables both efficient global change detection and localized change identification. HashSCD encodes spatially aligned patches into compact hash codes and aggregates them through an XOR-like operation, allowing change detection and localization to be performed directly in the Hamming space without repeated inference on previous images. The model is trained in an unsupervised manner using contrastive learning at both patch and global levels. Experiments demonstrate that HashSCD achieves competitive performance compared to state-of-the-art unsupervised hashing and scene change detection methods, while significantly reducing computational cost and storage requirements.
Chinese Translation
图像哈希提供了紧凑的表示,以实现高效的存储和检索,但本质上仅限于全局比较,无法推断变化发生的位置。这一限制使得哈希无法直接应用于场景变化检测,而空间定位在此过程中至关重要。在本研究中,我们从场景变化检测的角度重新审视哈希,并提出了HashSCD,这是一种基于补丁的哈希框架,能够实现高效的全局变化检测和局部变化识别。HashSCD将空间对齐的补丁编码为紧凑的哈希码,并通过类似XOR的操作进行聚合,使得变化检测和定位可以直接在汉明空间中进行,而无需对先前图像进行重复推断。该模型采用无监督方式进行训练,使用对比学习在补丁和全局层面上进行。实验表明,HashSCD在与最先进的无监督哈希和场景变化检测方法相比时,表现出竞争力,同时显著降低了计算成本和存储需求。
cs.CV / 132 / 2605.12266

CAD-feature enhanced machine learning for manufacturing effort estimation on sheet metal bending parts

基于CAD特征增强的机器学习在钣金弯曲零件制造努力估算中的应用
Ballegeer, Matteo, Van Camp, Toon, Jaspers, Willem, Bayar, Alp, Soe, Aung Nyein, Roelfs, Martin, Benoit, Dries F., Decraemer, Bieke, Duflou, Joost R.
Abstract
Graph-based machine learning has emerged as a promising approach for manufacturability analysis by learning directly from CAD models represented as Boundary Representations (B-reps), exploiting both surface geometry and topological connectivity. However, purely geometric representations often lack the process-specific semantics required for accurate manufacturability prediction: many manufacturing factors, such as surface roles or bend intent, are not explicitly encoded in shape alone and are difficult for data-driven models to infer reliably. We propose a hybrid approach that addresses this challenge by enriching B-rep attributed adjacency graphs with manufacturing features recognized through a rule-based module. Applied to sheet metal bending, recognized features, such as bend characteristics, flange lengths, and surface roles are integrated as node attributes, concentrating the learning signal on process-relevant geometric patterns. Experiments on both a large-scale synthetic manufacturability benchmark and a real-world industrial dataset with measured bending times, one of the first such validations on genuine production data, demonstrate that combining domain knowledge with graph-based learning improves prediction accuracy across both tasks. The results demonstrate that hybrid modeling offers a feasible and effective path toward deployable tools for manufacturability assessment and effort estimation in industrial CAD environments.
Chinese Translation
基于图的机器学习已成为制造可行性分析的一种有前景的方法,通过直接从以边界表示(B-reps)形式表示的CAD模型中学习,利用表面几何和拓扑连接。然而,纯粹的几何表示往往缺乏准确制造可行性预测所需的过程特定语义:许多制造因素,如表面角色或弯曲意图,仅通过形状并未明确编码,且数据驱动模型难以可靠推断。我们提出了一种混合方法,通过基于规则的模块识别的制造特征来丰富B-rep属性邻接图,从而解决这一挑战。该方法应用于钣金弯曲,将识别出的特征(如弯曲特性、法兰长度和表面角色)作为节点属性进行整合,集中学习信号于与过程相关的几何模式。在一个大规模合成制造可行性基准和一个具有测量弯曲时间的真实工业数据集上的实验(这是对真实生产数据的首次验证之一)表明,将领域知识与基于图的学习相结合可以提高两个任务的预测准确性。结果表明,混合建模为在工业CAD环境中可部署的制造可行性评估和努力估算工具提供了一条可行且有效的路径。
cs.CV / 133 / 2605.12271

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

超越文本提示:视觉到视觉生成作为统一范式
Liu, Yaofang, Cui, Kangning, Chu, Meng, Li, Zhaoqing, Zhang, Suiyun, Morel, Jean-Michel, Cun, Xiaodong, Che, Haoxuan, Liu, Rui, Chan, Raymond H.
Abstract
Humans often specify and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text, a bottleneck that compresses signals like spatial structure, exact appearance, and glyph shape. We propose \textbf{\emph{visual-to-visual} (V2V)} generation, in which the user conditions a generative model with a visual specification page rather than a text prompt. The page is not an edit target, but a visual document that specifies the desired output. We introduce \textbf{V2V-Zero}, a training-free framework that exposes this interface in existing vision-language model (VLM) conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages, exploiting the fact that the frozen VLM already maps both text and images into the generator's conditioning space. On GenEval, V2V-Zero reaches 0.85 with a frozen Qwen-Image backbone, closely matching its optimized text-to-image performance without fine-tuning. To evaluate the broader V2V space, we introduce \textbf{Simple-V2V Bench}, spanning seven visual-conditioning tasks and seven models, including GPT Image 2, Nano Banana 2, Seedream 5.0 Lite, open-weight baselines, and a video extension. V2V-Zero scores 32.7/100, outperforming evaluated open-weight image baselines and revealing a clear capability hierarchy: attribute binding is strong, content generation is unreliable, and structural control remains hard even for commercial systems. A HunyuanVideo-1.5 extension scores 20.2/100, showing the interface transfers beyond images. Mechanistic analysis shows the default reasoning path is primarily visually routed, with 95.0\% of conditioning-token attention mass on visual-page hidden states.
Chinese Translation
人类通常通过视觉工件来指定和创造:排版表、草图、参考图像和注释场景。然而,现代视觉生成器仍然要求用户将这种意图序列化为文本,这一瓶颈压缩了空间结构、确切外观和字形等信号。我们提出了 extbf{ extit{视觉到视觉} (V2V)}生成,其中用户通过视觉规范页面而非文本提示来条件化生成模型。该页面不是编辑目标,而是指定所需输出的视觉文档。我们引入了 extbf{V2V-Zero},这是一个无训练框架,通过用从视觉页面提取的最终层隐藏状态替换仅文本的条件化,揭示了这一接口在现有视觉-语言模型(VLM)条件生成器中的应用,利用了冻结的VLM已经将文本和图像映射到生成器的条件空间这一事实。在GenEval上,V2V-Zero在冻结的Qwen-Image骨干网络上达到了0.85,接近其优化的文本到图像性能,而无需微调。为了评估更广泛的V2V空间,我们引入了 extbf{Simple-V2V Bench},涵盖七个视觉条件任务和七个模型,包括GPT Image 2、Nano Banana 2、Seedream 5.0 Lite、开放权重基线和视频扩展。V2V-Zero得分32.7/100,超越了评估的开放权重图像基线,并揭示了明显的能力层次:属性绑定强,内容生成不可靠,结构控制即使对于商业系统也仍然困难。HunyuanVideo-1.5扩展得分20.2/100,表明该接口超越了图像的转移。机制分析显示默认推理路径主要是视觉导向的,95.0 ext{%}的条件令牌注意力集中在视觉页面的隐藏状态上。
cs.CV / 134 / 2605.12282

Large-Small Model Collaboration for Farmland Semantic Change Detection

大小模型协作用于农田语义变化检测
Li, Xinjia, Wang, Rui, Peng, Qiurong, Ye, Lingfei, Zhang, Dengrong, Zhang, Haoyu
Abstract
Farmland Semantic Change Detection (SCD) is essential for cultivated land protection, yet existing benchmarks and models remain insufficient for fine-grained farmland conversion monitoring. Current datasets often lack dedicated "from-to" annotations, while visual change detection models are easily disturbed by phenology-induced pseudo-changes caused by crop rotation, seasonal variation, and illumination differences. To address these challenges, we construct HZNU-FCD, a large-scale fine-grained farmland SCD benchmark with a unified five-class farmland-to-non-farmland annotation protocol. It contains 4,588 bitemporal image pairs with pixel-level labels for practical farmland protection. Based on this benchmark, we propose a large-small collaborative SCD framework that integrates a task-driven small visual model with a frozen large vision-language model. The small model, Fine-grained Difference-aware Mamba (FD-Mamba), learns dense change representations for boundary preservation and small-region localization. The large-model pathway, Cross-modal Logical Arbitration (CMLA), introduces CLIP-based textual priors for prompt-guided semantic arbitration and pseudo-change suppression. To enable effective collaboration, we design a hard-region co-training strategy that supervises the CMLA semantic score map only on low-confidence pixels. Experiments show that our method achieves 97.63% F1, 96.32% IoU, and 96.35% SCD_IoU_mean on HZNU-FCD with only 6.65M trainable parameters. Compared with the multimodal ChangeCLIP-ViT, which leverages vision-language information for change detection, our method improves F1 by 10.19 percentage points on HZNU-FCD. It also achieves 91.43% F1 and 84.21% IoU on LEVIR-CD, and 93.85% F1 and 88.41% IoU on WHU-CD, demonstrating strong robustness and generalization. The code is available at https://github.com/Lovelymili/FD-Mamba.
Chinese Translation
农田语义变化检测(SCD)对于耕地保护至关重要,但现有的基准和模型在细粒度农田转变监测方面仍显不足。目前的数据集往往缺乏专门的“从-到”标注,而视觉变化检测模型容易受到由作物轮作、季节变化和光照差异引起的表型伪变化的干扰。为了解决这些挑战,我们构建了HZNU-FCD,这是一个大规模细粒度农田SCD基准,采用统一的五类农田与非农田标注协议。该基准包含4,588对双时相图像,具有像素级标签,便于实际的农田保护。在此基准的基础上,我们提出了一种大-小协作SCD框架,该框架将任务驱动的小型视觉模型与冻结的大型视觉-语言模型相结合。小型模型Fine-grained Difference-aware Mamba(FD-Mamba)学习稠密变化表示,以实现边界保留和小区域定位。大型模型路径Cross-modal Logical Arbitration(CMLA)引入基于CLIP的文本先验,用于提示引导的语义仲裁和伪变化抑制。为了实现有效的协作,我们设计了一种硬区域协同训练策略,仅在低置信度像素上监督CMLA语义得分图。实验表明,我们的方法在HZNU-FCD上达到了97.63%的F1、96.32%的IoU和96.35%的SCD_IoU_mean,且仅有6.65M的可训练参数。与利用视觉-语言信息进行变化检测的多模态ChangeCLIP-ViT相比,我们的方法在HZNU-FCD上提高了10.19个百分点的F1。它在LEVIR-CD上也达到了91.43%的F1和84.21%的IoU,在WHU-CD上达到了93.85%的F1和88.41%的IoU,展示了强大的鲁棒性和泛化能力。代码可在https://github.com/Lovelymili/FD-Mamba获取。
cs.CV / 135 / 2605.12297

EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras

EgoEV-HandPose:基于立体事件相机的自我中心3D手势姿态估计与识别
Wang, Luming, Shi, Hao, Zhai, Jiajun, Yang, Kailun, Wang, Kaiwei
Abstract
Egocentric 3D hand pose estimation and gesture recognition are essential for immersive augmented/virtual reality, human-computer interaction, and robotics. However, conventional frame-based cameras suffer from motion blur and limited dynamic range, while existing event-based methods are hindered by ego-motion interference, monocular depth ambiguity, and the lack of large-scale real-world stereo datasets. To overcome these limitations, we propose EgoEV-HandPose, an end-to-end framework for joint 3D bimanual pose estimation and gesture recognition from stereo event streams. Central to our approach is KeypointBEV, a flexible stereo fusion module that lifts features into a canonical bird's-eye-view space and employs an iterative reprojection-guided refinement loop to progressively resolve depth uncertainty and enforce kinematic consistency. In addition, we introduce EgoEVHands, the first large-scale real-world stereo event-camera dataset for egocentric hand perception, containing 5,419 annotated sequences with dense 3D/2D keypoints across 38 gesture classes under varying illumination. Extensive experiments demonstrate that EgoEV-HandPose achieves state-of-the-art performance with an MPJPE of 30.54mm and 86.87% Top-1 gesture recognition accuracy, significantly outperforming RGB-based stereo and prior event-camera methods, particularly in low-light and bimanual occlusion scenarios, thereby setting a new benchmark for event-based egocentric perception. The established dataset and source code will be publicly released at https://github.com/ZJUWang01/EgoEV-HandPose.
Chinese Translation
自我中心的3D手势姿态估计与识别对于沉浸式增强现实/虚拟现实、人机交互和机器人技术至关重要。然而,传统的基于帧的相机存在运动模糊和动态范围有限的问题,而现有的基于事件的方法则受到自我运动干扰、单目深度模糊和缺乏大规模真实世界立体数据集的限制。为了解决这些问题,我们提出了EgoEV-HandPose,这是一个用于从立体事件流中联合进行3D双手姿态估计和手势识别的端到端框架。我们方法的核心是KeypointBEV,一个灵活的立体融合模块,它将特征提升到规范的鸟瞰视图空间,并采用迭代重投影引导的细化循环,逐步解决深度不确定性并强制执行运动学一致性。此外,我们引入了EgoEVHands,这是首个大规模真实世界立体事件相机数据集,用于自我中心手部感知,包含5,419个注释序列,涵盖38个手势类别的稠密3D/2D关键点,且在不同光照条件下采集。大量实验表明,EgoEV-HandPose在MPJPE为30.54mm和86.87%的Top-1手势识别准确率方面达到了最先进的性能,显著优于基于RGB的立体方法和先前的事件相机方法,尤其是在低光照和双手遮挡场景中,从而为基于事件的自我中心感知设定了新的基准。已建立的数据集和源代码将公开发布于https://github.com/ZJUWang01/EgoEV-HandPose。
cs.CV / 136 / 2605.12305

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

句子中的图像:统一视觉生成的交错指令扩展
Zhang, Yabo, Li, Kunchang, Zhou, Dewei, Huang, Xinyu, Wang, Xun
Abstract
While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose \texttt{I}mages i\texttt{N} \texttt{SE}n\texttt{T}ences (\textit{a.k.a}, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.
Chinese Translation
尽管近期多模态语言模型的进展使得从富有表现力的多图像指令生成图像成为可能,但现有方法在处理复杂的交错指令时表现不佳。这一局限性源于当前范式中图像与文本的结构性分离,迫使模型在匹配描述与视觉目标时跨越困难的长距离依赖关系。为了解决这些挑战,我们提出了 exttt{I}mages i exttt{N} exttt{SE}n exttt{T}ences(简称 INSET),这是一个统一生成模型,可以将图像无缝嵌入文本指令中的原生词汇。通过将视觉特征直接放置在其对应的语义槽中,INSET 利用变换器的上下文局部性实现精确的对象绑定,有效地将图像视为密集且富有表现力的语言符号。此外,我们引入了一个可扩展的数据引擎,从标准图像和视频数据集中合成 1500 万个高质量的交错样本,利用 VLMs 和 LLMs 构建丰富的长时间序列。在 InterleaveBench 上的评估结果表明,INSET 在多图像一致性和文本对齐方面显著优于最先进的方法,随着输入复杂性的增加,性能差距进一步扩大。除了标准生成外,我们的方法本质上还扩展到多模态图像编辑,将视觉内容整合为指令的一部分,以促进高度表现力和创造性的视觉操作。
cs.CV / 137 / 2605.12309

G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

G$^2$TR:生成引导的视觉标记减少方法用于分离编码器统一多模态模型
Li, Junxian, Liu, Kai, Ding, Zizhong, Wang, Zhixin, Chen, Zhikai, Pei, Renjing, Zhang, Yulun
Abstract
The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model's capabilities for editing images. We propose G$^2$TR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant but also important for latent-space image reconstruction and generation. G$^2$TR estimates token importance from consistency with VAE latent, performs balanced token selection, and merges redundant tokens into retained representatives to reduce information loss. The method is training-free, plug-and-play, and applied only after the understanding encoding stage, making it compatible with existing UMM inference pipelines. Experiments on image understanding and editing benchmarks show that G$^2$TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks.
Chinese Translation
分离编码器统一多模态模型(UMMs)的发展伴随着由于密集视觉标记处理而迅速增长的推理成本。本文重点关注理解侧视觉标记的减少,以提高分离编码器UMMs的效率。尽管这一主题在多模态大语言模型(MLLMs)中得到了广泛研究,但现有方法通常依赖于注意力分数、文本-图像相似性等,隐含地假设最终目标是判别推理。然而,这一假设并不适用于UMMs,因为理解侧的视觉标记还必须保留模型对图像编辑的能力。我们提出了G$^2$TR,一种用于分离编码器UMMs的生成引导视觉标记减少框架。我们的关键见解是,生成分支提供了一种与任务无关的信号,用于识别不仅在语义上相关而且对潜在空间图像重建和生成重要的理解侧视觉标记。G$^2$TR通过与变分自编码器(VAE)潜在空间的一致性来估计标记的重要性,执行平衡的标记选择,并将冗余标记合并为保留的代表,从而减少信息损失。该方法无需训练,具有即插即用的特性,仅在理解编码阶段之后应用,使其与现有的UMM推理管道兼容。在图像理解和编辑基准测试中的实验表明,G$^2$TR显著减少了视觉标记和预填充计算,减少了1.94倍,同时保持了推理准确性和编辑质量,在几乎所有基准测试中超越了基线。
cs.CV / 138 / 2605.12320

Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos

在噪声时间自监督下的对比学习用于结肠镜视频
Parolari, Luca, Gori, Pietro, Ballan, Lamberto, Biffi, Carlo, Folgoc, Loic Le
Abstract
Learning robust representations of polyp tracklets is key to enabling multiple AI-assisted colonoscopy applications, from polyp characterization to automated reporting and retrieval. Supervised contrastive learning is an effective approach for learning such representations, but it typically relies on correct positive and negative definitions. Collecting these labels requires linking tracklets that depict the same underlying polyp entity throughout the video, which is costly and demands specialized clinical expertise. In this work, we leverage the sequential workflow of colonoscopy procedures to derive self-supervised associations from temporal structure. Since temporally derived associations are not guaranteed to be correct, we introduce a noise-aware contrastive loss to account for noisy associations. We demonstrate the effectiveness of the learned representations across multiple downstream tasks, including polyp retrieval and re-identification, size estimation, and histology classification. Our method outperforms prior self-supervised and supervised baselines, and matches or exceeds recent foundation models across all tasks, using a lightweight encoder trained on only 27 videos. Code is available at https://github.com/lparolari/ntssl.
Chinese Translation
学习息肉轨迹的鲁棒表示是实现多种AI辅助结肠镜应用的关键,从息肉特征化到自动报告和检索。监督对比学习是一种有效的学习此类表示的方法,但通常依赖于正确的正负样本定义。收集这些标签需要将描绘同一基础息肉实体的轨迹在视频中进行关联,这一过程成本高且需要专业的临床知识。在本研究中,我们利用结肠镜程序的顺序工作流程,从时间结构中推导自监督关联。由于时间推导的关联并不一定是正确的,我们引入了一种噪声感知对比损失来考虑噪声关联的影响。我们在多个下游任务中展示了所学表示的有效性,包括息肉检索与再识别、大小估计和组织学分类。我们的方法在所有任务中超越了先前的自监督和监督基线,并且在使用仅27个视频训练的轻量级编码器时,达到了或超过了近期基础模型的表现。代码可在 https://github.com/lparolari/ntssl 获取。
cs.CV / 139 / 2605.12325

VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference

VIP:基于视觉引导的提示演化用于高效的稠密视觉-语言推理
Zhu, Hao, Jin, Shuo, Liao, Wenbin, Xiao, Jiayu, Zhu, Yan, Yu, Siyue, Dai, Feng
Abstract
Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dino.txt framework to facilitate more efficient and high-quality dense prediction. While dino.txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce \textcolor{oursblue}{\textbf{VI}}sual-guided \textcolor{oursblue}{\textbf{P}}rompt evolution (\textcolor{oursblue}{\textbf{\textit{VIP}}}) to rectify the semantic expressiveness of text queries in dino.txt, unleashing its potential for fine-grained object perception. Towards this end, \VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that \VIP: \ding{182} surpasses the top-leading methods by $1.4\% \sim 8.4\%$ average mIoU, \ding{183} generalizes well to diverse challenging domains, and \ding{184} requires marginal inference time and memory overhead. \href{https://github.com/MiSsU-HH/VIP}{Our code is publicly available at GitHub \faGithub}.
Chinese Translation
在高效且可泛化的方式下追求无训练的开放词汇语义分割仍然面临挑战,这主要源于CLIP中的深层空间偏差。为了克服现有解决方案的局限性,本研究超越了基于CLIP的范式,利用最近的空间感知dino.txt框架来促进更高效且高质量的稠密预测。尽管dino.txt展现出强大的空间意识,我们发现文本查询的语义模糊性导致其稠密跨模态交互中出现严重的不匹配。为了解决这一问题,我们引入了 extcolor{oursblue}{ extbf{VI}}sual-guided extcolor{oursblue}{ extbf{P}}rompt evolution( extcolor{oursblue}{ extbf{ extit{VIP}}}),以纠正dino.txt中文本查询的语义表达能力,从而释放其在细粒度物体感知中的潜力。为此, extit{VIP}结合了别名扩展和视觉引导的蒸馏机制,以挖掘有价值的语义线索,这些线索以显著性感知的方式稳健地聚合,从而产生高保真预测。广泛的评估表明, extit{VIP}: extbf{1} 超越了顶尖方法,平均mIoU提高了$1.4\% ext{至} 8.4\\%$; extbf{2} 在多样化的挑战性领域中表现良好; extbf{3} 仅需极少的推理时间和内存开销。我们的代码已在GitHub上公开发布。
cs.CV / 140 / 2605.12374

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

填补差距:一种用于多模态大型语言模型视觉推理的细粒度对齐范式
Miao, Yanting, Sun, Yutao, Wang, Dexin, Zhou, Mengyu, Poupart, Pascal, Lv, Lei, Zhao, Qi, Wang, Li, Li, Hao, Jiang, Xiaoxi, Jiang, Guanjun
Abstract
Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume~\citep{xie2025mhc,li2026siamesenorm,team2026attention}. This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose \textbf{GAP}, a \textbf{G}ranular \textbf{A}lignment \textbf{P}aradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.
Chinese Translation
视觉潜在推理使得多模态大型语言模型(MLLM)能够生成作为连续标记的中间视觉证据,从而避免使用外部工具或图像生成器。然而,现有方法通常遵循输出作为输入的潜在范式,导致不稳定的收益。我们发现存在特征空间不匹配的证据,这可能导致这种不稳定性:主导的视觉潜在模型基于预归一化的MLLM,并重用解码器隐藏状态作为预测的潜在输入,尽管这些状态占据了与模型训练时所需输入嵌入显著不同的归一化范围。这种不匹配可能使得直接的潜在反馈不可靠。基于这一诊断,我们提出了 extbf{GAP},一种用于视觉潜在建模的 extbf{G}ranular extbf{A}lignment extbf{P}aradigm。GAP在三个层面上对齐视觉潜在推理:特征级对齐通过轻量级的PCA对齐潜在头将解码器输出映射到输入兼容的视觉潜在;上下文级对齐通过可检视的辅助视觉监督为潜在目标提供基础;容量引导对齐则选择性地将潜在监督分配给基础MLLM表现不佳的示例。在Qwen2.5-VL 7B上,所得到的模型在我们的监督变体中实现了最佳的平均综合感知和推理性能。推理时干预探测进一步表明,生成的潜在提供了超越简单添加标记槽的任务相关视觉信号。
cs.CV / 141 / 2605.12377

Fast Image Super-Resolution via Consistency Rectified Flow

通过一致性校正流实现快速图像超分辨率
Xu, Jiaqi, Li, Wenbo, Sun, Haoze, Li, Fan, Wang, Zhixin, Peng, Long, Ren, Jingjing, Yang, Haoran, Hu, Xiaowei, Pei, Renjing, Heng, Pheng-Ann
Abstract
Diffusion models (DMs) have demonstrated remarkable success in real-world image super-resolution (SR), yet their reliance on time-consuming multi-step sampling largely hinders their practical applications. While recent efforts have introduced few- or single-step solutions, existing methods either inefficiently model the process from noisy input or fail to fully exploit iterative generative priors, compromising the fidelity and quality of the reconstructed images. To address this issue, we propose FlowSR, a novel approach that reformulates the SR problem as a rectified flow from low-resolution (LR) to high-resolution (HR) images. Our method leverages an improved consistency learning strategy to enable high-quality SR in a single step. Specifically, we refine the original consistency distillation process by incorporating HR regularization, ensuring that the learned SR flow not only enforces self-consistency but also converges precisely to the ground-truth HR target. Furthermore, we introduce a fast-slow scheduling strategy, where adjacent timesteps for consistency learning are sampled from two distinct schedulers: a fast scheduler with fewer timesteps to improve efficiency, and a slow scheduler with more timesteps to capture fine-grained texture details. Extensive experiments demonstrate that FlowSR achieves outstanding performance in both efficiency and image quality.
Chinese Translation
扩散模型(DMs)在真实世界的图像超分辨率(SR)中取得了显著成功,但其对耗时的多步采样的依赖在很大程度上阻碍了其实际应用。尽管近期的努力引入了少步或单步解决方案,但现有方法要么低效地对噪声输入的过程进行建模,要么未能充分利用迭代生成先验,从而影响重建图像的保真度和质量。为了解决这一问题,我们提出了FlowSR,一种将SR问题重新表述为从低分辨率(LR)到高分辨率(HR)图像的校正流的新方法。我们的方法利用改进的一致性学习策略,使得在单步中实现高质量的SR。具体而言,我们通过引入HR正则化来优化原始的一致性蒸馏过程,确保学习到的SR流不仅强制自一致性,还能精确收敛到真实的HR目标。此外,我们引入了一种快慢调度策略,其中一致性学习的相邻时间步由两个不同的调度器采样:一个具有较少时间步的快速调度器以提高效率,以及一个具有更多时间步的慢速调度器以捕捉细致的纹理细节。大量实验表明,FlowSR在效率和图像质量方面均表现出色。
cs.CV / 142 / 2605.12389

SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation

SEMIR:用于视觉分割的图上语义小结构诱导表示学习
Miller, Luke James, Lee, Yugyung
Abstract
Segmenting small and sparse structures in large-scale images is fundamentally constrained by voxel-level, lattice-bound computation and extreme class imbalance -- dense, full-resolution inference scales poorly and forces most pipelines to rely on fixed regionization or downsampling, coupling computational cost to image resolution and attenuating boundary evidence precisely where minority structures are most informative. We introduce SEMIR (Semantic Minor-Induced Representation Learning), a representation framework that decouples inference from the native grid by learning a task-adapted, topology-preserving latent graph representation with exact decoding. SEMIR transforms the underlying grid graph into a compact, boundary-aligned graph minor through parameterized edge contraction, node deletion, and edge deletion, while preserving an exact lifting map from minor predictions to lattice labels. Minor construction is formalized as a few-shot structure learning problem that replaces hand-tuned preprocessing with a boundary-alignment objective: minor parameters are learned by maximizing agreement between predicted boundary elements and target-specific semantic edges under a boundary Dice criterion, and the induced minor is annotated with scale- and rotation-robust geometric and intensity descriptors and supports efficient region-level inference via message passing on a graph neural network (GNN) with relational edge features. We benchmark SEMIR on three tumor segmentation datasets -- BraTS 2021, KiTS23, and LiTS -- where targets exhibit high structural variability and distributional uncertainty. SEMIR yields consistent improvements in minority-structure Dice at practical runtime. More broadly, SEMIR establishes a framework for learning task-adapted, topology-preserving latent representations with exact decoding for high-resolution structured visual data.
Chinese Translation
在大规模图像中分割小而稀疏的结构受到体素级、晶格约束计算和极端类别不平衡的根本限制——密集的全分辨率推断扩展性差,迫使大多数管道依赖固定区域化或下采样,从而将计算成本与图像分辨率耦合,并在少数结构最具信息性的边界证据处减弱。我们提出了SEMIR(语义小结构诱导表示学习),一个通过学习任务适应的、拓扑保持的潜在图表示与精确解码来将推断与原生网格解耦的表示框架。SEMIR通过参数化边收缩、节点删除和边删除,将底层网格图转化为紧凑的、边界对齐的图小结构,同时保持从小结构预测到晶格标签的精确提升映射。小结构构建被形式化为一个少样本结构学习问题,通过边界对齐目标替代手动调优的预处理:小结构参数通过在边界Dice标准下最大化预测边界元素与目标特定语义边之间的一致性来学习,并且诱导的小结构被标注为具有尺度和旋转鲁棒的几何和强度描述符,并通过带有关系边特征的图神经网络(GNN)支持高效的区域级推断。我们在三个肿瘤分割数据集上对SEMIR进行了基准测试——BraTS 2021、KiTS23和LiTS——这些数据集的目标表现出高结构变异性和分布不确定性。SEMIR在实际运行时间内在少数结构的Dice指标上取得了一致的改善。更广泛地说,SEMIR建立了一个学习任务适应的、拓扑保持的潜在表示的框架,具有高分辨率结构视觉数据的精确解码。
cs.CV / 143 / 2605.12399

GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction

GeoQuery:稀疏视图重建的几何查询扩散
Cao, Xiao, Li, Yuze, Zhang, Youmin, Song, Jiayu, Yan, Cheng, Li, Wen, Duan, Lixin
Abstract
3D Gaussian Splatting (3DGS) has emerged as a prominent paradigm for 3D reconstruction and novel view synthesis. However, it remains vulnerable to severe artifacts when trained under sparse-view constraints. While recent methods attempt to rectify artifacts in rendered views using image diffusion models, they typically rely on multi-view self-attention to retrieve information from reference images. We observe that this mechanism often fails when the rendered novel views output by 3DGS are heavily corrupted: damaged query features lead to erroneous cross-view retrieval, resulting in inconsistent rendering refinement. To address this, we propose GeoQuery, a geometry-guided diffusion framework that integrates generative priors with explicit geometric cues via a novel Geometry-guided Cross-view Attention (GCA) mechanism. First, by leveraging predicted depth maps and camera poses, we construct a geometry-induced correspondence field to sample reference features, forming a geometry-aligned proxy query that replaces the corrupted rendering features. Furthermore, we design a new cross-view feature aggregation pipeline, in which we restrict the cross-view attention to a local window around each proxy query to effectively retrieve useful features while suppressing spurious matches. GeoQuery can be seamlessly integrated into existing diffusion-based pipelines, enabling robust reconstruction even under extreme view sparsity. Extensive experiments on sparse-view novel view synthesis and rendering artifact removal demonstrate the effectiveness of our approach.
Chinese Translation
3D高斯点云(3D Gaussian Splatting, 3DGS)已成为3D重建和新视图合成的重要范式。然而,在稀疏视图约束下训练时,它仍然容易受到严重伪影的影响。虽然最近的方法试图利用图像扩散模型纠正渲染视图中的伪影,但它们通常依赖于多视图自注意力从参考图像中检索信息。我们观察到,当3DGS输出的渲染新视图严重损坏时,这种机制往往失效:损坏的查询特征导致错误的跨视图检索,从而导致不一致的渲染细化。为了解决这个问题,我们提出了GeoQuery,一种几何引导的扩散框架,通过一种新颖的几何引导跨视图注意力(Geometry-guided Cross-view Attention, GCA)机制,将生成先验与显式几何线索相结合。首先,通过利用预测的深度图和相机姿态,我们构建了一个几何诱导的对应场,以采样参考特征,形成一个几何对齐的代理查询,替代损坏的渲染特征。此外,我们设计了一种新的跨视图特征聚合管道,其中我们将跨视图注意力限制在每个代理查询周围的局部窗口,以有效检索有用特征,同时抑制虚假匹配。GeoQuery可以无缝集成到现有的基于扩散的管道中,即使在极端视图稀疏的情况下也能实现稳健的重建。在稀疏视图新视图合成和渲染伪影去除的广泛实验中,证明了我们方法的有效性。
cs.CV / 144 / 2605.12413

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

超越定位:对全向图像中多模态大型语言模型的视角条件空间推理的全面诊断
Chen, Yuangong, Wong, Wai Keung, Li, Jiaxing, Patras, Ioannis, Zheng, Xu
Abstract
Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.
Chinese Translation
多模态大型语言模型(MLLMs)展现出强大的视觉感知能力,但在面对变化视角下的空间推理时仍然存在局限性。我们将这一挑战研究为视角条件空间推理(PCSR),重点关注360度全向图像,其中广泛的场景覆盖减少了部分观察带来的模糊性,但并未消除对视角依赖推理的需求。为了评估这一能力,我们引入了PCSR-Bench,这是一个包含来自2600张全向图像的84,373个问答对的诊断基准,涵盖26个室内环境。PCSR-Bench包含八个任务,涵盖基础感知(例如,物体计数、相对距离和相对方向)和高级PCSR,包括组合链、自我中心旋转、视角重新锚定、自我扭曲和有限视场可见性。我们评估了14个代表性的MLLM,并观察到显著的感知-推理差距:在基础相对方向的准确率达到57.59%,但在自我中心旋转时降至13.49%,在自我扭曲时降至7.13%,而在开放式组合推理时仅为0.64%。为了探究这一差距的可塑性,我们对一个7B规模的模型进行了基于强化学习的诊断研究。在受控环境下,通过奖励塑形将匹配的7B基线从31.10%提高至60.06%,这表明PCSR具有部分可塑性,而非完全不可变。然而,提升是任务选择性的,对奖励设计(包括权重分配和奖励公式)敏感,并在一定程度上依赖于评估协议。这些结果将PCSR定位为当前MLLMs中的一个关键瓶颈,并强调在针对性优化下有限但有意义的恢复空间。
cs.CV / 145 / 2605.12430

AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection

AOI-SSL:用于光学检测中线焊半导体高效分割的自监督框架
Figueira, Joaquín, Van Gastel, Rob, D'Amicantonio, Giacomo, Liu, Zhuoran, Bucur, Ioan Gabriel, Boughorbel, Faysal, Bondarev, Egor
Abstract
Segmentation models in automated optical inspection of wire-bonded semiconductors are typically device-specific and must be re-trained when new devices or distribution shifts appear. We introduce AOI-SSL, a training-efficient framework for semantic segmentation of wire-bonded semiconductors by combining small-domain self-supervised pre-training of vision transformers with in-context inference that minimizes the need of labeled examples. We pre-train SOTA self-supervised algorithms in a small industrial inspection dataset and find that Masked Autoencoders are the most effective in this small-data setting, improving downstream segmentation while reducing the labeled fine-tuning effort. We further introduce in-context, patch-level retrieval methods that predict masks directly from dense encoder embeddings with negligible additional training. We show that, in this setting, simple similarity-based retrieval performs on par with more complex attention-based aggregation used currently in the literature. Furthermore, our experiments demonstrate that self-supervised pre-training significantly improves segmentation quality compared to training from scratch and to ImageNet pre-trained backbones under a fixed fine-tuning computational budget. Finally, the results reveal that retrieval based segmentation outperforms fine-tuning when targeting single device images, allowing for near-instant adaptation to difficult samples.
Chinese Translation
在自动光学检测中,线焊半导体的分割模型通常是设备特定的,当出现新设备或分布变化时必须重新训练。我们提出了AOI-SSL,这是一个高效的语义分割训练框架,通过将小领域自监督预训练的视觉变换器与上下文推理相结合,最小化对标记样本的需求。我们在一个小型工业检测数据集上预训练了最先进的自监督算法,发现Masked Autoencoders在这种小数据环境中最为有效,能够提高下游分割性能,同时减少标记微调的工作量。我们进一步引入了上下文的补丁级检索方法,能够直接从密集编码器嵌入中预测掩膜,几乎不需要额外的训练。我们展示了在这种设置下,基于简单相似度的检索与当前文献中使用的更复杂的基于注意力的聚合方法表现相当。此外,我们的实验表明,自监督预训练显著提高了分割质量,相较于从头训练和在固定微调计算预算下使用ImageNet预训练的骨干网络。最后,结果表明,基于检索的分割在针对单一设备图像时优于微调,能够快速适应困难样本。
cs.CV / 146 / 2605.12431

GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization

GaitProtector:基于模仿驱动的步态去身份识别通过无训练扩散潜在优化
Duan, Huiran, Zhou, Qian, Guo, Zhongliang, Dong, Junhao, Li, Yuqi, Zhao, Guoying, Tian, Yingli
Abstract
Conventional gait de-identification methods often encounter an inherent trade-off: they either provide insufficient identity suppression or introduce spatiotemporal distortions that impede structure-sensitive downstream applications. We propose GaitProtector, an impersonation-driven gait de-identification framework that formulates privacy protection as a unified objective with two tightly coupled components: (i) obfuscation, which repels the protected gait from the source identity, and (ii) impersonation, which attracts it toward a selected target identity. The target identity serves as a semantic anchor that biases optimization toward structurally plausible gait patterns under the pretrained diffusion prior, helping preserve dominant body shape and motion dynamics. We instantiate this idea through a training-free diffusion latent optimization pipeline. Instead of retraining a generator for each dataset, we invert each input silhouette sequence into the latent trajectory of a pretrained 3D video diffusion model and iteratively optimize latent codes with a differentiable adversarial objective to synthesize protected gaits. Experiments on the CASIA-B dataset show that GaitProtector achieves a 56.7% impersonation success rate under black-box gait recognition and reduces Rank-1 identification accuracy from 89.6% to 15.0%, while maintaining favorable visual and temporal quality. We further evaluate downstream utility on the Scoliosis1K dataset, where diagnostic accuracy decreases only from 91.4% to 74.2%. To the best of our knowledge, this work is the first to leverage pretrained 3D diffusion priors in a training-free manner for silhouette-based gait de-identification.
Chinese Translation
传统的步态去身份识别方法常常面临固有的权衡:要么提供不足的身份抑制,要么引入时空扭曲,从而妨碍结构敏感的下游应用。我们提出了GaitProtector,一个基于模仿驱动的步态去身份识别框架,将隐私保护公式化为一个统一的目标,包含两个紧密耦合的组成部分:(i) 混淆,旨在将受保护的步态与源身份隔离,以及 (ii) 模仿,旨在将其吸引至选定的目标身份。目标身份作为语义锚点,偏向于在预训练的扩散先验下优化结构上合理的步态模式,帮助保持主导的身体形状和运动动态。我们通过一个无训练的扩散潜在优化管道实现这一理念。我们不需要为每个数据集重新训练生成器,而是将每个输入轮廓序列反转为预训练的3D视频扩散模型的潜在轨迹,并通过可微分的对抗目标迭代优化潜在编码,以合成受保护的步态。在CASIA-B数据集上的实验表明,GaitProtector在黑箱步态识别下实现了56.7%的模仿成功率,并将Rank-1识别准确率从89.6%降低至15.0%,同时保持良好的视觉和时间质量。我们进一步在Scoliosis1K数据集上评估下游效用,诊断准确率仅从91.4%降至74.2%。据我们所知,这项工作是首个以无训练方式利用预训练的3D扩散先验进行基于轮廓的步态去身份识别的研究。
cs.CV / 147 / 2605.12437

3D Gaussian Splatting for Efficient Retrospective Dynamic Scene Novel View Synthesis with a Standardized Benchmark

用于高效回顾性动态场景新视图合成的3D高斯点云与标准化基准
Zhang, Yunxiao, Kumar, Suryansh
Abstract
Retrospective novel view synthesis (NVS) of dynamic scenes is fundamental to applications such as sports. Recent dynamic 3D Gaussian Splatting (3DGS) approaches introduce temporally coupled formulations to enforce motion coherence across time. In this paper, we argue that, in a synchronized multi-view (MV) setting typical of sports, the dynamic scene at each time step is already strongly geometrically constrained. We posit that the availability of calibrated, synchronized viewpoints provides sufficient spatial consistency, and therefore, explicit temporal coupling, or complex multi-body constraints seems unnecessary for retrospective NVS. To this end, we propose an approach tailored for synchronized MV dynamic scene. By initializing the SfM-derived point cloud at the start time and propagating optimized Gaussians over time, we show that efficient retrospective NVS can be achieved without imposing a temporal deformation constraint. Complementing our methodological contribution, we introduce a Dynamic MV dataset framework built on Blender for reproducible NeRF and 3DGS research. The framework generates high-quality, synchronized camera rigs and exports training-ready datasets in standard formats, eliminating inconsistencies in coordinate conventions and data pipelines. Using the framework, we construct a dynamic benchmark suite and evaluate representative NeRF and 3DGS approaches under controlled conditions. Together, we show that, under a synchronized MV setup, efficient retrospective dynamic scene NVS can be achieved using 3DGS. At the same time, the dataset-generation framework enables reproducible and principled benchmarking of dynamic NVS methods.
Chinese Translation
动态场景的回顾性新视图合成(NVS)是体育等应用的基础。近期的动态3D高斯点云(3DGS)方法引入了时间耦合的公式,以确保跨时间的运动一致性。本文认为,在典型的体育同步多视角(MV)环境中,每个时间步的动态场景已经受到强烈的几何约束。我们认为,校准的同步视点的可用性提供了足够的空间一致性,因此,对于回顾性NVS而言,显式的时间耦合或复杂的多体约束似乎是多余的。为此,我们提出了一种针对同步MV动态场景的特定方法。通过在起始时间初始化基于结构从运动(SfM)获得的点云,并在时间上传播优化的高斯分布,我们展示了在不施加时间变形约束的情况下可以实现高效的回顾性NVS。作为我们方法论贡献的补充,我们引入了一个基于Blender的动态MV数据集框架,用于可重复的NeRF和3DGS研究。该框架生成高质量的同步相机设备,并以标准格式导出适合训练的数据集,消除了坐标约定和数据管道中的不一致性。利用该框架,我们构建了一个动态基准套件,并在受控条件下评估了代表性的NeRF和3DGS方法。我们共同展示了在同步MV设置下,可以使用3DGS实现高效的回顾性动态场景NVS。同时,数据集生成框架使动态NVS方法的可重复和原则性基准测试成为可能。
cs.CV / 148 / 2605.12449

LychSim: A Controllable and Interactive Simulation Framework for Vision Research

LychSim:一个可控的交互式视觉研究仿真框架
Ma, Wufei, Wang, Chloe, Chen, Siyi, Peng, Jiawei, Li, Patrick, Yuille, Alan
Abstract
While self-supervised pretraining has reduced vision systems' reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-distribution (OOD) evaluation. However, modern simulation platforms often present steep technical barriers, requiring extensive expertise in computer graphics and game development. In this work, we present LychSim, a highly controllable and interactive simulation framework built upon Unreal Engine 5 to bridge this gap. LychSim is built around three key designs: (1) a streamlined Python API that abstracts away underlying engine complexities; (2) a procedural data pipeline capable of generating diverse, high-fidelity environments with varying out-of-distribution (OOD) visual challenges, paired with rich 2D and 3D ground truths; and (3) a native integration of the Model Context Protocol (MCP) that transforms the simulator into a dynamic, closed-loop playground for reasoning agentic LLMs. We further annotate scene-level procedural rules and object-level pose alignments to enable semantically aligned 3D ground truths and automated scene modification. We demonstrate LychSim's capability across multiple downstream applications, including serving as a synthetic data engine, powering reinforcement learning-based adversarial examiners, and facilitating interactive, language-driven scene layout generation. To benefit the broader vision community, LychSim will be made publicly available, including full source code and various data annotations.
Chinese Translation
尽管自监督预训练减少了视觉系统对合成数据的依赖,但仿真仍然是闭环优化和严格的分布外(OOD)评估不可或缺的工具。然而,现代仿真平台通常存在较高的技术门槛,需要在计算机图形学和游戏开发方面的广泛专业知识。在本研究中,我们提出了LychSim,一个基于虚幻引擎5(Unreal Engine 5)构建的高度可控和交互式仿真框架,以填补这一空白。LychSim围绕三个关键设计构建:(1)一个简化的Python API,抽象了底层引擎的复杂性;(2)一个程序化数据管道,能够生成多样化、高保真的环境,具有不同的分布外(OOD)视觉挑战,并配备丰富的2D和3D真实数据;(3)对模型上下文协议(Model Context Protocol, MCP)的原生集成,将仿真器转变为一个动态的闭环游乐场,以便推理代理的语言模型(LLMs)。我们进一步注释场景级程序规则和对象级姿态对齐,以实现语义对齐的3D真实数据和自动化场景修改。我们展示了LychSim在多个下游应用中的能力,包括作为合成数据引擎、支持基于强化学习的对抗性检查器,以及促进交互式、语言驱动的场景布局生成。为了造福更广泛的视觉社区,LychSim将公开发布,包括完整的源代码和各种数据注释。
cs.CV / 149 / 2605.12451

FuTCR: Future-Targeted Contrast and Repulsion for Continual Panoptic Segmentation

FuTCR:面向未来的对比与排斥用于持续全景分割
Ikechukwu, Nicholas, Nichols, Keanu, Ghadiyaram, Deepti, Plummer, Bryan A.
Abstract
Continual Panoptic Segmentation (CPS) requires methods that can quickly adapt to new categories over time. The nature of this dense prediction task means that training images may contain a mix of labeled and unlabeled objects. As nothing is known about these unlabeled objects a priori, existing methods often simply group any unlabeled pixel into a single "background" class during training. In effect, during training, they repeatedly tell the model that all the different background categories are the same (even when they aren't). This makes learning to identify different background categories as they are added challenging since these new categories may require using information the model was previously told was unimportant and ignored. Thus, we propose a Future-Targeted Contrastive and Repulsive (FuTCR) framework that addresses this limitation by restructuring representations before new classes are introduced. FuTCR first discovers confident future-like regions by grouping model-predicted masks whose pixels are consistently classified as background but exhibit non-background logits. Next, FuTCR applies pixel-to-region contrast to build coherent prototypes from these unlabeled regions, while simultaneously repelling background features away from known-class prototypes to explicitly reserve representational space for future categories. Experiments across six CPS settings and a range of dataset sizes show FuTCR improves relative new-class panoptic quality over the state-of-the-art by up to 28%, while preserving or improving base-class performance with gains up to 4%.
Chinese Translation
持续全景分割(CPS)需要能够快速适应新类别的方法。由于这一密集预测任务的性质,训练图像可能包含标记和未标记对象的混合。由于对这些未标记对象在先验上没有任何了解,现有方法通常在训练期间将任何未标记的像素简单地归入单一的“背景”类别。实际上,在训练过程中,它们反复告诉模型所有不同的背景类别是相同的(即使它们并不相同)。这使得在添加新类别时学习识别不同的背景类别变得具有挑战性,因为这些新类别可能需要使用模型之前被告知不重要并被忽略的信息。因此,我们提出了一种面向未来的对比与排斥(FuTCR)框架,通过在引入新类别之前重构表示来解决这一限制。FuTCR首先通过对模型预测的掩码进行分组,发现自信的未来样式区域,这些区域的像素被一致地分类为背景,但表现出非背景的logits。接下来,FuTCR应用像素到区域的对比,从这些未标记区域构建一致的原型,同时将背景特征从已知类别原型中排斥出去,以明确保留未来类别的表示空间。在六个CPS设置和多种数据集规模上的实验表明,FuTCR在相对新类别的全景质量上比现有最先进方法提高了多达28%,同时保持或改善了基础类别的性能,增益高达4%。
cs.CV / 150 / 2605.12480

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

OmniNFT:面向模态的全方位扩散强化学习用于联合音视频生成
Zhang, Guohui, Ma, XiaoXiao, Huang, Jie, Xu, Hang, Yu, Hu, Fu, Siming, Li, Yuming, Xue, Zeyue, Song, Lin, Huang, Haoyang, Duan, Nan, Zhao, Feng
Abstract
Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.
Chinese Translation
近年来,联合音视频生成的进展显著,但现实应用要求在每种模态上具有强大的保真度、跨模态对齐和精细的同步。强化学习(Reinforcement Learning, RL)提供了一种有前景的范式,但其在多目标和多模态联合音视频生成中的扩展仍未得到探索。我们深入分析后首次揭示,应用RL的主要障碍在于:(i)多目标优势不一致性,即多模态输出的优势在同一组内并不总是一致;(ii)多模态梯度不平衡,即视频分支的梯度泄漏到负责模态内生成的浅层音频层;(iii)统一信用分配,即精细的跨模态对齐区域未能得到有效探索。这些缺陷表明,使用单一全局优势的传统RL微调策略往往导致次优结果。为了解决这些挑战,我们提出了OmniNFT,这是一种新颖的面向模态的在线扩散强化学习框架,具有三个关键创新:(1)模态独立优势路由,将独立的每个奖励优势路由到各自的模态生成分支;(2)层级梯度手术,选择性地在浅层音频层上断开视频分支的梯度,同时保留跨模态交互层的梯度;(3)区域损失重加权,调节策略优化以关注与音视频同步和精细对齐相关的关键区域。在JavisBench和VBench上进行的广泛实验表明,基于LTX-2主干的OmniNFT在音频和视频感知质量、跨模态对齐以及音视频同步方面实现了全面的改进。
cs.CV / 151 / 2605.12491

Elastic Attention Cores for Scalable Vision Transformers

可扩展视觉变换器的弹性注意力核心
Song, Alan Z., Chen, Yinjie, Nan, Mu, Zhang, Rui, Cao, Jiahang, Mai, Weijian, Yu, Muquan, Adeli, Hossein, Ramanan, Deva, Tarr, Michael J., Luo, Andrew F.
Abstract
Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.
Chinese Translation
视觉变换器(ViTs)通过利用全对全自注意力实现了强大的数据驱动扩展。然而,这种灵活性带来了与图像分辨率呈二次方关系的计算成本,从而限制了ViTs在高分辨率领域的应用。该方法的基础假设是,成对的标记交互对于学习丰富的视觉-语义表示是必要的。在本研究中,我们挑战了这一假设,证明有效的视觉表示可以在没有任何直接的补丁到补丁交互的情况下学习。我们提出了VECA(视觉弹性核心注意力),一种视觉变换器架构,采用高效的线性时间核心-外围结构注意力,由一小组学习到的核心支持。在VECA中,这些核心充当通信接口:补丁标记通过核心标记独占地交换信息,核心标记从零开始初始化并在各层之间传播。由于$N$个图像补丁仅与一组不随分辨率变化的$C$个学习到的“核心”嵌入直接交互,这使得在预定的$C$下,复杂度为线性$O(N)$,从而避免了二次扩展。与之前的交叉注意力架构相比,VECA保持并迭代更新完整的$N$输入标记,避免了小$C$的瓶颈。结合沿核心轴的嵌套训练,我们的模型在推理过程中可以弹性地权衡计算和准确性。在分类和密集任务中,VECA的性能与最新的视觉基础模型相媲美,同时降低了计算成本。我们的结果确立了弹性核心-外围注意力作为视觉变换器的可扩展替代构建块。
cs.CV / 152 / 2605.12494

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

重新审视光度模糊性以实现准确的高斯点云表面重建
Li, Jiahe, Zhang, Jiawei, Bai, Xiao, Zheng, Jin, Yu, Xiaohan, Gu, Lin, Lee, Gim Hee
Abstract
Surface reconstruction with differentiable rendering has achieved impressive performance in recent years, yet the pervasive photometric ambiguities have strictly bottlenecked existing approaches. This paper presents AmbiSuR, a framework that explores an intrinsic solution upon Gaussian Splatting for the photometric ambiguity-robust surface 3D reconstruction with high performance. Starting by revisiting the foundation, our investigation uncovers two built-in primitive-wise ambiguities in representation, while revealing an intrinsic potential for ambiguity self-indication in Gaussian Splatting. Stemming from these, a photometric disambiguation is first introduced, constraining ill-posed geometry solution for definite surface formation. Then, we propose an ambiguity indication module that unleashes the self-indication potential to identify and further guide correcting underconstrained reconstructions. Extensive experiments demonstrate our superior surface reconstructions compared to existing methods across various challenging scenarios, excelling in broad compatibility. Project: https://fictionarry.github.io/AmbiSuR-Proj/ .
Chinese Translation
近年来,基于可微渲染的表面重建取得了令人瞩目的成果,但普遍存在的光度模糊性严重制约了现有方法的发展。本文提出了AmbiSuR,一个探索基于高斯点云的光度模糊性鲁棒的高性能三维表面重建的内在解决方案的框架。我们从基础出发,重新审视,揭示了表示中的两种内置原始模糊性,同时揭示了高斯点云中模糊性自指示的内在潜力。基于这些,我们首次引入了光度消歧义,约束不适定几何解以形成明确的表面。然后,我们提出了一个模糊性指示模块,释放自指示潜力,以识别并进一步引导修正欠约束的重建。大量实验表明,与现有方法相比,我们在各种具有挑战性的场景中展现了优越的表面重建能力,并在广泛兼容性方面表现出色。项目网址: https://fictionarry.github.io/AmbiSuR-Proj/ .
cs.CV / 153 / 2605.12495

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

AlphaGRPO:通过可分解可验证奖励解锁UMMs中的自我反思多模态生成
Huang, Runhui, Wu, Jie, Yang, Rui, Liu, Zhe, Zhao, Hengshuang
Abstract
In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks. These results validate that our self-reflective reinforcement approach effectively leverages inherent understanding to guide high-fidelity generation. Project page: https://huangrh99.github.io/AlphaGRPO/
Chinese Translation
在本文中,我们提出了AlphaGRPO,这是一种新颖的框架,应用群体相对策略优化(Group Relative Policy Optimization, GRPO)于AR-Diffusion统一多模态模型(Unified Multimodal Models, UMMs),以增强多模态生成能力,而无需额外的冷启动阶段。我们的方法释放了模型执行高级推理任务的内在潜力:推理文本到图像生成,其中模型主动推断隐含的用户意图,以及自我反思精炼,其中模型自主诊断并纠正生成输出中的不一致性。为了解决为现实世界多模态生成提供稳定监督的挑战,我们引入了可分解可验证奖励(Decompositional Verifiable Reward, DVReward)。与整体标量奖励不同,DVReward利用大型语言模型(LLM)将复杂的用户请求分解为原子、可验证的语义和质量问题,然后由通用多语言模型(MLLM)进行评估,以提供可靠且可解释的反馈。大量实验表明,AlphaGRPO在多模态生成基准测试中(包括GenEval、TIIF-Bench、DPG-Bench和WISE)产生了显著的改进,同时在GEdit的编辑任务中也取得了显著的提升,而无需在编辑任务上进行训练。这些结果验证了我们的自我反思强化学习方法有效利用内在理解来指导高保真生成。项目页面:https://huangrh99.github.io/AlphaGRPO/
cs.CV / 154 / 2605.12496

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

CausalCine:用于多镜头视频叙事的实时自回归生成
Meng, Yihao, Liu, Zichen, Ouyang, Hao, Wang, Qiuyu, Cheng, Ka Leong, Yu, Yue, Wang, Hanlin, Li, Haobo, Zhu, Jiapeng, Zeng, Yanhong, Zhu, Xing, Shen, Yujun, Chen, Qifeng, Qu, Huamin
Abstract
Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at https://yihao-meng.github.io/CausalCine/
Chinese Translation
自回归视频生成旨在实现实时、开放式的合成。然而,电影叙事不仅仅是单一场景的无尽延续;它需要通过不断演变的事件、视角转换和离散镜头边界进行推进。现有的自回归模型在这种环境中往往表现不佳。它们主要针对短期延续进行训练,将长序列视为延长的单一镜头,必然在长时间生成过程中遭遇运动停滞和语义漂移。为了解决这一问题,我们提出了CausalCine,一个互动自回归框架,将多镜头视频生成转变为在线导演过程。CausalCine在镜头变化之间进行因果生成,实时接受动态提示,并在不重新生成先前镜头的情况下重用上下文。为此,我们首先在原生多镜头序列上训练一个因果基础模型,以学习复杂的镜头过渡,然后提出内容感知记忆路由(Content-Aware Memory Routing, CAMR),根据基于注意力的相关性评分动态检索历史键值(KV)条目,而不是时间接近性,从而在有限的活动记忆下保持跨镜头的一致性。最后,我们将因果基础模型提炼为一个少步生成器,以实现实时互动生成。大量实验表明,CausalCine显著优于自回归基线,并接近双向模型的能力,同时解锁因果生成的流媒体互动性。演示可在 https://yihao-meng.github.io/CausalCine/ 获取。
cs.CV / 155 / 2605.12497

From Web to Pixels: Bringing Agentic Search into Visual Perception

从网络到像素:将主动搜索引入视觉感知
Yang, Bokang, Sun, Xinyi, Feng, Kaituo, Dong, Xingping, Wu, Dongming, Yue, Xiangyu
Abstract
Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.
Chinese Translation
视觉感知将高层次的语义理解与像素级的感知连接起来,但大多数现有设置假设识别目标的决定性证据已经存在于图像中或冻结的模型知识中。我们研究了一种更实际但更具挑战性的开放世界案例,在这种情况下,必须首先从外部事实、最近事件、长尾实体或多跳关系中解析出可见对象,然后才能进行定位。我们将这一挑战形式化为感知深度研究,并引入WebEye,这是一个具有可验证证据、知识密集型查询、精确的框/掩码标注和三种任务视角(基于搜索的定位、基于搜索的分割和基于搜索的视觉问答)的对象锚定基准。WebEye包含120张图像、473个标注的对象实例、645对独特的问答对和1,927个任务样本。我们进一步提出了Pixel-Searcher,一种主动的搜索到像素的工作流程,它解析隐藏的目标身份并将其绑定到框、掩码或定位答案上。实验表明,Pixel-Searcher在所有三种任务视角中实现了最强的开源性能,而失败主要源于证据获取、身份解析和视觉实例绑定。
cs.CV / 156 / 2605.12498

EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera

EgoForce:基于前臂引导的单目自我中心相机3D手势重建
Millerdurai, Christen, Wang, Shaoxiang, Xie, Yaxu, Golyanik, Vladislav, Stricker, Didier, Pagani, Alain
Abstract
Reconstructing the absolute 3D pose and shape of the hands from the user's viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth-scale ambiguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to acquire. This paper addresses these challenges by introducing EgoForce, a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user's (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm-hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth-scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera-space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations. For more details, visit the project page at https://dfki-av.github.io/EgoForce.
Chinese Translation
从用户视角使用单个头戴式相机重建手部的绝对3D姿态和形状,对于增强现实/虚拟现实(AR/VR)、远程呈现和以手为中心的操作任务中的实际自我中心交互至关重要,这些场景要求传感器保持紧凑且不显眼。尽管单目RGB方法已有所进展,但仍受限于深度尺度模糊,并且难以在各种头戴设备的光学配置中进行泛化。因此,模型通常需要在特定设备的数据集上进行广泛训练,而这些数据集的获取成本高且耗时。本文通过引入EgoForce,解决了这些挑战,EgoForce是一个单目3D手部重建框架,能够从用户(相机空间)视角恢复稳健的绝对3D手势及其位置。EgoForce通过一个统一的网络在鱼眼、透视和失真宽视场相机模型上进行操作。我们的方法结合了一个可微分的前臂表示,以稳定手部姿态,一个统一的臂-手变换器,从单一自我中心视角预测手部和前臂几何形状,减轻深度尺度模糊,以及一个光线空间的闭式解算器,使得在不同头戴式相机模型上实现绝对3D姿态恢复。对三个自我中心基准的实验表明,EgoForce在3D精度上达到了最先进的水平,与之前的方法相比,在HOT3D数据集上将相机空间的MPJPE减少了多达28%,并在不同相机配置下保持了一致的性能。更多细节请访问项目页面:https://dfki-av.github.io/EgoForce。
cs.CV / 157 / 2605.12500

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

SenseNova-U1:通过NEO-unify架构统一多模态理解与生成
Diao, Haiwen, Wu, Penghao, Deng, Hanming, Wang, Jiahao, Bai, Shihao, Wu, Silei, Fan, Weichen, Ye, Wenjie, Tong, Wenwen, Fan, Xiangyu, Li, Yan, Wang, Yubo, Cao, Zhijie, Lin, Zhiqian, Yang, Zhitao, Cai, Zhongang, Niu, Yuwei, Zhu, Yue, Liu, Bo, Lv, Chengguang, Yu, Haojia, Xie, Haozhe, Wang, Hongli, Fan, Jianan, Li, Jiaqi, Lu, Jiefan, Ni, Jingcheng, Xu, Junxiang, Liang, Kaihuan, Shi, Lianqiang, Dai, Linjun, Wang, Linyan, Qian, Oscar, Gao, Peng, Liu, Pengfei, Sun, Qingping, Shen, Rui, Wang, Ruisi, Ma, Shengnan, Yang, Shuang, Xie, Siyi, Li, Siying, Zhong, Tianbo, Kong, Xiangli, Shi, Xuanke, Gao, Yang, Yao, Yongqiang, Wang, Yves, Bai, Zhengqi, Lin, Zhengyu, Yin, Zixin, Sun, Wenxiu, Gong, Ruihao, Wang, Quan, Lu, Lewei, Yang, Lei, Liu, Ziwei, Lin, Dahua
Abstract
Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.
Chinese Translation
近期的大型视觉语言模型(VLMs)在根本上受到持续的二分法限制:理解和生成被视为不同的问题,导致架构碎片化、管道级联以及表示空间的不对齐。我们认为,这种分裂不仅仅是工程上的产物,而是一种结构性限制,阻碍了原生多模态智能的出现。因此,我们引入了SenseNova-U1,一个基于NEO-unify构建的原生统一多模态范式,在该范式中,理解和生成作为单一基础过程的协同视角共同演进。我们推出了两个原生统一变体,分别是基于密集(8B)和专家混合(30B-A3B)理解基线的SenseNova-U1-8B-MoT和SenseNova-U1-A3B-MoT。它们从基本原理出发设计,在文本理解、视觉语言感知、知识推理、代理决策和空间智能等方面与顶尖的仅理解VLMs相媲美。同时,它们提供强大的语义一致性和视觉保真度,在常规或知识密集型的任意到图像(X2I)合成、复杂的文本丰富信息图生成以及交错的视觉语言生成中表现出色,无论是否采用思维模式。除了性能之外,我们展示了详细的模型设计、数据预处理、训练前/后策略和推理策略,以支持社区研究。最后但同样重要的是,初步证据表明我们的模型不仅限于感知和生成,在视觉语言行动(VLA)和世界模型(WM)场景中表现出色。这指向了一个更广泛的路线图,即模型不仅在模态之间进行转换,而是在原生的方式下进行思考和行动。多模态人工智能不再是连接独立系统,而是构建一个统一的系统,并信任必要的能力从内部涌现。
cs.CV / 158 / 2605.12501

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

覆盖计算机使用的人类行为空间:数据合成与基准测试
Zhang, Miaosen, Zhao, Xiaohan, Tan, Zhihong, Huoshen, Zhou, Fan, Yijia, Yang, Yifan, Qiu, Kai, Liu, Bei, Wagle, Justin, Yin, Chenzhong, Cheng, Mingxi, Li, Ji, Dai, Qi, Luo, Chong, Yang, Xu, Geng, Xin, Guo, Baining
Abstract
Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git
Chinese Translation
计算机使用代理(CUAs)自动化屏幕工作,正如GPT-5.4和Claude所示。然而,它们在复杂的低频交互中的可靠性仍然较差,限制了用户的信任。我们对先进模型的失败案例分析表明,在图形用户界面(GUI)操作中存在长尾模式,其中相对较小比例的复杂和多样化交互占据了任务失败的过大份额。我们假设这个问题主要源于复杂交互数据的稀缺。为了解决这一问题,我们提出了一个新的基准CUActSpot,用于评估模型在五种模态下对复杂交互的能力:GUI、文本、表格、画布和自然图像,以及多种动作(点击、拖动、绘制等),涵盖了比以往主要关注GUI小部件的点击中心基准更广泛的交互类型。我们还设计了一个基于渲染器的数据合成管道:为每种模态自动生成场景,记录屏幕截图和元素坐标,并由大型语言模型(LLM)生成匹配的指令和动作轨迹。在这个语料库上训练后,我们的Phi-Ground-Any-4B在参数少于32B的开源模型中表现优于其他模型。我们将会在https://github.com/microsoft/Phi-Ground.git发布我们的基准、数据、代码和模型。
人工智能 (Artificial Intelligence)
111
cs.AI / 1 / 2605.11118

A Cascaded Generative Approach for e-Commerce Recommendations

一种级联生成方法用于电子商务推荐
Hasani, Moein, Shahidi, Hamidreza, Levinson, Trace, Zhong, Yuan, Shu, Guanghua, Gudla, Vinesh, Tenneti, Tejaswi
Abstract
Personalized storefronts in large e-commerce marketplaces are often assembled from many independent components: static themes per page section ("placement"), retrieval systems to fetch eligible products per placement, and pointwise rankers to order content. While effective in optimizing for aggregate preferences, this paradigm is rigid and can limit personalization and semantic cohesion across the page. This makes it poorly suited to support dynamic objectives and merchandising requirements over time. To address this, we introduce a cascaded merchandising framework that decomposes storefront construction into two generative tasks: (i) placement-level theme generation and (ii) constrained keyword generation per placement to power product retrieval. Teacher-student fine-tuning is leveraged to improve scalability of this framework under production latency and cost constraints. Fine-tuned model ablations are shown to approach closed-weight LLM performance. We further contribute frameworks for AI-driven content evaluation and quality filtering, enabling safe and automated deployment of dynamic content at scale. Generative output is fused with traditional ranking models to preserve hybrid infrastructure. In online experiments, this framework yields an estimated +2.7% lift in cart adds per page view over a strong baseline.
Chinese Translation
大型电子商务市场中的个性化店面通常由许多独立组件组成:每个页面部分的静态主题(“位置”)、用于获取每个位置合适产品的检索系统,以及用于排序内容的逐点排名器。虽然这种范式在优化整体偏好方面有效,但其刚性限制了个性化和页面内的语义一致性,使其不适合支持动态目标和随时间变化的商品需求。为了解决这一问题,我们提出了一种级联商品框架,将店面构建分解为两个生成任务:(i)位置级主题生成和(ii)每个位置的约束关键词生成,以推动产品检索。利用教师-学生微调方法,提高了该框架在生产延迟和成本约束下的可扩展性。微调模型的消融实验表明其性能接近封闭权重的LLM。我们进一步贡献了AI驱动的内容评估和质量过滤框架,使动态内容的安全和自动化大规模部署成为可能。生成输出与传统排名模型相融合,以保持混合基础设施。在在线实验中,该框架在每次页面浏览的购物车添加量上相较于强基线估计提升了+2.7%。
cs.AI / 2 / 2605.11136

EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales

EVOCHAMBER:多智能体系统在个体、团队和群体层面的测试时共同进化
Zhang, Yaolun, Xu, Tianyi, Dai, Shengyu, Shao, Zhenwen, Wu, Qingyun, Wang, Huazheng
Abstract
We argue that multi-agent test-time evolution is not single-agent evolution replicated N times. A single-agent learner can only evolve its own context and memory. A multi-agent system additionally evolves who collaborates, how they collaborate, and how knowledge flows across the population. These components have no single-agent counterpart and can produce phenomena such as emergent specialization. Yet prior test-time methods either confine experiences to individual agents, forfeiting cross-agent learning, or broadcast symmetrically to all agents, erasing the specialization that makes collaboration valuable. We present EVOCHAMBER, a training-free framework that instantiates test-time evolution at three levels over a coevolving agent pool. At its core is CODREAM (Collaborative Dreaming), a post-task protocol triggered on team failure or disagreement, in which agents collaboratively reflect, distill insights, and route them asymmetrically from strong to weak agents on the failed niche, preserving specialization while filling knowledge gaps. Team-level operators assemble niche-conditioned teams and select collaboration structures online. Population-level lifecycle operators fork, merge, prune, and seed agents under performance pressure. On three heterogeneous task streams with Qwen3-8B, EVOCHAMBER reaches 63.9% on competition math, 75.7% on code, and 87.1% on multi-domain reasoning, outperforming the best baseline by 32% relative on math and confirming asymmetric cross-agent transfer as the primary driver in ablation. Starting from several identically initialized agents, four to five stable niche specialists spontaneously emerge, a structural signature of multi-agent evolution that no single-agent learner can express. See our code at: https://github.com/Mercury7353/EvoChamber
Chinese Translation
我们认为,多智能体的测试时进化并不是单智能体进化的 N 次复制。单智能体学习者只能进化其自身的上下文和记忆。而多智能体系统还会进化谁进行合作、如何合作以及知识如何在群体中流动。这些组件没有单智能体的对应物,并且可以产生如涌现专业化等现象。然而,之前的测试时方法要么将经验局限于个体智能体,放弃跨智能体学习,要么对所有智能体进行对称广播,抹去使合作有价值的专业化。我们提出了 EVOCHAMBER,这是一个无训练的框架,在共同进化的智能体池上实现了三个层次的测试时进化。其核心是 CODREAM(Collaborative Dreaming),这是一个在团队失败或意见不合时触发的后任务协议,智能体在其中共同反思、提炼见解,并将其不对称地从强智能体传递给弱智能体,以填补失败领域的知识空白,同时保持专业化。团队层面的操作符组装特定领域的团队并在线选择合作结构。群体层面的生命周期操作符在性能压力下分叉、合并、修剪和播种智能体。在与 Qwen3-8B 的三个异构任务流中,EVOCHAMBER 在竞争数学上达到了 63.9%,在代码上达到了 75.7%,在多领域推理上达到了 87.1%,在数学上相对超越最佳基线 32%,并确认不对称跨智能体转移是消融中的主要驱动因素。从几个初始化相同的智能体开始,自发出现四到五个稳定的特定领域专家,这是多智能体进化的结构特征,单智能体学习者无法表现。请查看我们的代码:https://github.com/Mercury7353/EvoChamber
cs.AI / 3 / 2605.11151

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

RankQ:通过自监督动作排序实现离线到在线的强化学习
Choi, Andrew, Xu, Wei
Abstract
Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors. Across sparse reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. In vision-based robot learning, RankQ enables effective offline-to-online fine-tuning of a pretrained vision-language-action (VLA) model in a low-data regime, achieving on average a 42.7% higher simulation success rate than the next best method. In a high-data setting, RankQ improves simulation performance by 13.7% over the next best method and achieves strong sim-to-real transfer, increasing real-world cube stacking success from 43.1% to 84.7% relative to the VLA's initial performance.
Chinese Translation
离线到在线的强化学习(RL)通过利用预先收集的数据集来提高样本效率。然而,一个关键挑战是在有限数据集覆盖的情况下,在大状态-动作空间中学习准确的评估器。为了减轻由于价值高估带来的有害更新,先前的方法通过相对于数据集动作降低分布外(OOD)动作的权重来施加悲观性。虽然有效,但这本质上充当了行为克隆的锚点,并可能在数据集动作不理想时阻碍下游在线策略的改进。我们提出了RankQ,一种离线到在线的Q学习目标,通过自监督的多项排名损失增强时间差分学习,以强制执行结构化的动作排序。通过学习相对动作偏好而不是均匀惩罚未见动作,RankQ塑造Q函数,使得动作梯度指向更高质量的行为。在稀疏奖励的D4RL基准测试中,RankQ的表现与七种先前方法相当或更优。在基于视觉的机器人学习中,RankQ能够在低数据环境中有效地对预训练的视觉-语言-动作(VLA)模型进行离线到在线的微调,平均实现比下一个最佳方法高出42.7%的仿真成功率。在高数据设置中,RankQ的仿真性能比下一个最佳方法提高了13.7%,并实现了强大的仿真到现实转移,将现实世界的立方体堆叠成功率从43.1%提高到84.7%,相对于VLA的初始表现。
cs.AI / 4 / 2605.11169

OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

OLIVIA:通过推理时动作适应的在线学习以支持LLM ReAct代理的决策制定
Yu, Sheldon, Wu, Junda, Li, Xintong, Kuang, Nikki Lijing, Zhou, Sizhe, Yu, Tong, Han, Jiawei, Shang, Jingbo, McAuley, Julian
Abstract
Large language model agents interleave reasoning, action selection, and observation to solve sequential decision-making tasks. In deployed settings where agents repeatedly handle related multi-step tasks, small action-selection errors can accumulate into wasted tool calls, latency, and reduced reliability. Despite this need for deployment-time improvement, existing inference-time adaptation methods for LLM agents mainly rely on prompting or retrieval, which influence behavior indirectly through context manipulation. For ReAct-style agents, such approaches do not expose an explicit decision layer that can score candidate actions, represent uncertainty, or be updated online from action-level feedback. As a result, they provide limited support for trackable, fine-grained, and uncertainty-aware adaptation during deployment. We propose OLIVIA, an inference-time action adaptation framework for ReAct-style agents. OLIVIA models the LLM's final action-selection layer as a contextual linear bandit over candidate actions, with frozen hidden states as decision contexts. This choice is particularly suitable for deployment because it adapts behavior directly at the action-selection interface, preserves the underlying reasoning process, and provides explicit uncertainty estimates and lightweight online updates from action-level feedback. With upper-confidence-bound exploration, OLIVIA improves the policy sample-efficiently with minimal computational overhead. We instantiate OLIVIA on four benchmarks and show that it consistently improves task performance over static ReAct and prompt-based inference-time baselines. Our results suggest that explicit online decision layers provide an effective alternative to purely prompt- or retrieval-based adaptation for LLM agents during deployment.
Chinese Translation
大型语言模型代理通过推理、动作选择和观察交替进行,以解决顺序决策任务。在实际应用中,代理反复处理相关的多步骤任务时,小的动作选择错误可能累积成浪费的工具调用、延迟和可靠性降低。尽管存在对部署时改进的需求,现有的LLM代理推理时适应方法主要依赖于提示或检索,这些方法通过上下文操控间接影响行为。对于ReAct风格的代理,这些方法并未暴露出一个明确的决策层来对候选动作进行评分、表示不确定性或从动作级反馈中在线更新。因此,它们在部署期间对可追踪、细粒度和不确定性感知的适应支持有限。我们提出了OLIVIA,一个针对ReAct风格代理的推理时动作适应框架。OLIVIA将LLM的最终动作选择层建模为候选动作上的上下文线性赌博机,以冻结的隐藏状态作为决策上下文。这个选择特别适合于部署,因为它直接在动作选择接口上适应行为,保留了基础推理过程,并提供明确的不确定性估计和来自动作级反馈的轻量级在线更新。通过上置信界探索,OLIVIA以最小的计算开销提高了策略的样本效率。我们在四个基准上实例化了OLIVIA,并显示其在任务性能上始终优于静态ReAct和基于提示的推理时基线。我们的结果表明,明确的在线决策层为LLM代理在部署期间提供了一种有效的替代方案,优于纯粹基于提示或检索的适应方法。
cs.AI / 5 / 2605.11182

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

政策蒸馏的多重面貌:陷阱、机制与解决方案
Zhu, Siqi, Ye, Xuyan, Lu, Hongyu, Shi, Weiye, Liu, Ge
Abstract
On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biased TopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show that stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students mitigate these failures.
Chinese Translation
政策蒸馏(On-policy distillation, OPD)和政策自蒸馏(On-policy self-distillation, OPSD)作为大型语言模型的有前景的后训练方法,提供了基于模型自身策略采样的轨迹的密集令牌级监督。然而,现有的有效性结果仍然不尽相同:尽管 OP(S)D 在系统提示和知识内化方面显示出潜力,近期的研究也报告了不稳定性和性能下降。在本研究中,我们对 OPD 和 OPSD 的有效性进行了全面的实证研究,探讨了它们何时有效、何时失效以及原因。我们发现,OPD 在数学推理方面对教师选择和损失公式高度敏感,而在我们测试的环境中,OPSD 由于缺乏特定实例的特权信息(privileged information, PI)而失败。相比之下,当 PI 表示共享的潜在规则(如系统提示或对齐偏好)时,OPSD 是有效的。我们识别了三种失效机制:(1)由于依赖于学生生成的前缀而导致的教师与学生之间的分布不匹配,(2)由于偏置的 TopK 反向 KL 梯度导致的优化不稳定,以及(3)OPSD 特有的限制,即学生学习了一个不包含 PI 的策略,该策略聚合了以 PI 为条件的教师,而当 PI 是特定实例时,这种聚合是不够的。我们进一步展示了停止梯度的 TopK 目标、适应 RLVR 的教师和稳定的 SFT 学生可以缓解这些失效问题。
cs.AI / 6 / 2605.11218

Don't Look at the Numbers: Visual Anchoring Bias and Layer-wise Representation in VLMs

不要只看数字:视觉锚定偏差与视觉语言模型中的层级表示
Shalankin, M.
Abstract
Embedded numeric anchors on images systematically bias Vision-Language Model quality judgments across six VLMs from five architectural families (ANOVA eta^2 = 0.18-0.77, all p < 0.001). Anchor effects are 2.5x larger than severe image quality degradation, confirming bias is not reducible to visual changes. Layer-wise probing reveals consistent dissociation: layers where anchor classification saturates (L12-L34) are suboptimal for quality prediction, with optimal layers deeper (R^2 = 0.69-0.91). Fusion analysis identifies architecture-dependent integration -- instant fusion at L1-L2 in two models versus partial or no fusion in three others. These results establish a causal account of visual anchoring bias, linking behavioral susceptibility to representation dynamics.
Chinese Translation
图像上的嵌入数字锚点系统性地影响了来自五个架构家族的六个视觉语言模型(VLMs)的质量判断(ANOVA eta^2 = 0.18-0.77,所有 p < 0.001)。锚点效应比严重的图像质量下降大2.5倍,确认了偏差并非仅可归因于视觉变化。层级探测揭示了一致的分离:锚点分类饱和的层(L12-L34)在质量预测中表现不佳,而更深层的层则最为优越(R^2 = 0.69-0.91)。融合分析识别出架构依赖的整合——在两个模型中L1-L2层即时融合,而在其他三个模型中则部分或没有融合。这些结果建立了视觉锚定偏差的因果解释,将行为易感性与表示动态联系起来。
cs.AI / 7 / 2605.11223

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

视觉-语言模型在点击解谜游戏中是否展现出类人逻辑问题解决能力?
Helfenstein, Dominik, Menner, Marco, Triebel, Maximilian
Abstract
Vision-Language(-Action) Models (VLMs) are increasingly applied to interactive environments, yet existing benchmarks often overlook the complex physical reasoning required for point-and-click puzzle games. This paper introduces Vision-Language Against The Incredible Machine (VLATIM), a benchmark designed to evaluate human-like logical problem-solving capabilities within the classic physics puzzle game The Incredible Machine 2 (TIM). Unlike existing benchmarks, VLATIM specifically targets the critical gap between high-level logical reasoning and continuous action spaces requiring precise mouse interactions. This benchmark is structured into five progressive parts, assessing capabilities that range from basic visual grounding and domain understanding to multi-step manipulation and full puzzle solving. Our results reveal a significant disparity between reasoning and execution. While large proprietary models demonstrate superior planning abilities, they struggle with precise visual grounding. Consequently, they do not yet show human-like problem-solving capabilities.
Chinese Translation
视觉-语言(-动作)模型(VLMs)越来越多地应用于交互环境,然而现有基准往往忽视了点击解谜游戏所需的复杂物理推理。本文介绍了针对《不可思议的机器2》(The Incredible Machine 2,TIM)的视觉-语言对抗不可思议的机器(Vision-Language Against The Incredible Machine,VLATIM),这是一个旨在评估类人逻辑问题解决能力的基准。与现有基准不同,VLATIM特别针对高层次逻辑推理与需要精确鼠标交互的连续动作空间之间的关键差距。该基准分为五个渐进部分,评估从基本的视觉基础和领域理解到多步骤操作和完整解谜的能力。我们的结果揭示了推理与执行之间的显著差距。尽管大型专有模型展现出卓越的规划能力,但它们在精确的视觉基础方面表现不佳。因此,它们尚未展现出类人的问题解决能力。
cs.AI / 8 / 2605.11225

PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement

PIVOT:通过轨迹优化桥接大型语言模型代理的规划与执行
Zhang, Tuo, Popa, Alin-Ionut, Xu, Yan, Song, Rui, Dimitriadis, Dimitrios
Abstract
Large language model (LLM)-based agents frequently generate seemingly coherent plans that fail upon execution due to infeasible actions, constraint violations, and compounding errors over extended horizons. PIVOT (Plan-Inspect-eVOlve Trajectories) addresses this plan-execution misalignment through a self-supervised framework that treats trajectories as optimizable objects iteratively refined via environment interaction. The framework comprises four stages: PLAN generates candidate trajectories; INSPECT executes them and computes structured losses with textual gradients encoding plan-execution discrepancies; EVOLVE applies these signals to produce improved trajectories; and VERIFY performs a final global check against task constraints. A monotonic acceptance process ensures a non-decreasing solution quality. Empirical evaluations on DeepPlanning and GAIA demonstrate state-of-the-art performance: with human-in-the-loop (HITL) feedback, PIVOT establishes a strong upper bound up to 94% relative improvement in constraint satisfaction, while its fully autonomous variant retains substantial gains, showing that the core trajectory-refinement mechanism remains effective without external supervision. At the same time, PIVOT remains computationally efficient, requiring up to 3x to 5x fewer tokens than competing refinement methods. These findings establish that (self- or human-supervised) feedback-based trajectory optimization is a principled methodology for mitigating plan-execution gaps in autonomous agent systems.
Chinese Translation
基于大型语言模型(LLM)的代理经常生成看似连贯的计划,但由于不可行的行动、约束违反以及在较长时间范围内累积的错误而在执行时失败。PIVOT(计划-检查-演变轨迹)通过一个自监督框架解决了这一计划与执行之间的不匹配,该框架将轨迹视为可优化对象,通过与环境的交互进行迭代优化。该框架包括四个阶段:PLAN生成候选轨迹;INSPECT执行这些轨迹并计算结构化损失,使用文本梯度编码计划与执行之间的差异;EVOLVE应用这些信号以生成改进的轨迹;VERIFY对任务约束进行最终的全局检查。单调接受过程确保了解决方案质量不降低。在DeepPlanning和GAIA上的实证评估展示了最先进的性能:在有人参与的反馈(HITL)下,PIVOT建立了高达94%的约束满足率的强上限,而其完全自主的变体仍保持显著的收益,表明核心的轨迹优化机制在没有外部监督的情况下仍然有效。同时,PIVOT在计算上也保持高效,相较于竞争的优化方法,所需的token数量减少了3到5倍。这些发现表明,基于反馈(自我或人类监督)的轨迹优化是一种原则性的方法,用于减轻自主代理系统中的计划与执行之间的差距。
cs.AI / 9 / 2605.11232

Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack

重新思考针对欺诈和反洗钱(AML)的LLMOps:构建合规级LLM服务栈
Naik, Prathamesh Vasudeo, Dintakurthi, Naresh, Wang, Yue
Abstract
Fraud detection and anti-money-laundering (AML) compliance are high-value domains for large language models (LLMs), but their serving requirements differ sharply from generic chat workloads. Compliance prompts are often prefix-heavy, schema-constrained, and evidence-rich, combining reusable policy instructions, risk taxonomies, transaction or document context, and short structured outputs such as JSON labels or risk factors. These properties make prefix reuse, KV-cache efficiency, runtime tuning, model orchestration, and output validation first-order systems concerns. This paper introduces a workload-aware LLMOps stack for fraud and AML workloads using self-hosted open-weight models such as Meta Llama and Alibaba Qwen. The stack combines vLLM-style runtime tuning, PagedAttention, Automatic Prefix Caching, multi-adapter serving, adapter and prompt-length-aware batching, sleep/wake lifecycle management, speculative decoding, and optional prefill/decode disaggregation. To avoid exposing institution-specific data, the reproducibility track converts public synthetic AML datasets, including IBM AML and SAML-D, into prefix-heavy compliance prompts with reusable policy text, transaction evidence, typology definitions, and schema-constrained outputs. We also incorporate an LLM-as-judge quality gate using deterministic compliance checks, reference metrics, expert-adjudicated calibration data where available, and multi-judge rubric scoring. Across public-synthetic AML workloads and controlled serving benchmarks, workload-aware tuning improved throughput from 612-650 to 3,600 requests/hour, reduced P99 latency from 31-38 seconds to 6.4-8.7 seconds, and increased GPU utilization from 12% to 78%. These results show that regulated LLM performance is a workload-design, serving-optimization, and quality-gating problem, not only a model-selection problem.
Chinese Translation
欺诈检测和反洗钱(AML)合规是大型语言模型(LLMs)的高价值领域,但它们的服务要求与通用聊天工作负载有显著不同。合规提示通常以前缀为主,受模式约束且富含证据,结合了可重用的政策指令、风险分类、交易或文档上下文,以及短结构化输出,如JSON标签或风险因素。这些特性使得前缀重用、键值缓存效率、运行时调优、模型编排和输出验证成为首要系统关注点。本文介绍了一种针对欺诈和AML工作负载的工作负载感知LLMOps栈,使用自托管的开放权重模型,如Meta Llama和Alibaba Qwen。该栈结合了vLLM风格的运行时调优、PagedAttention、自动前缀缓存、多适配器服务、适配器和提示长度感知批处理、睡眠/唤醒生命周期管理、推测解码以及可选的预填充/解码分离。为了避免暴露特定机构的数据,重现性轨迹将公共合成AML数据集(包括IBM AML和SAML-D)转换为前缀重的合规提示,包含可重用的政策文本、交易证据、分类定义和模式约束输出。我们还结合了一个基于LLM的质量门控,使用确定性合规检查、参考指标、专家裁定的校准数据(如可用)和多评审标准评分。在公共合成AML工作负载和受控服务基准测试中,工作负载感知调优将吞吐量从612-650提升至3,600请求/小时,将P99延迟从31-38秒降低至6.4-8.7秒,并将GPU利用率从12%提高至78%。这些结果表明,受监管的LLM性能是一个工作负载设计、服务优化和质量门控的问题,而不仅仅是一个模型选择的问题。
cs.AI / 10 / 2605.11234

The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems

语义训练差距:面向工业人工智能代理系统的本体驱动工具架构
Chethan, Grama
Abstract
Large language model (LLM)-based AI agents are increasingly deployed in manufacturing environments for analytics, quality management, and decision support. These agents demonstrate statistical fluency with domain terminology but lack grounded understanding of operational semantics -- the relational structure that connects equipment identifiers, process parameters, failure codes, and regulatory constraints within a specific production context. This paper identifies and formalizes the semantic training gap: a structural disconnect between how AI systems acquire domain vocabulary through training and how manufacturing operations define meaning through ontological relationships. We demonstrate that this gap causes operationally incorrect outputs even when model responses are linguistically precise, and that in multi-agent configurations it produces a compounding failure mode we term semantic drift. To close this gap, we present an architecture that embeds manufacturing ontology directly into the AI tool layer as a typed relational configuration, enforcing semantic constraints at runtime rather than relying on model training. The architecture is formalized as a three-operation interface contract -- resolve, contextualize, annotate -- with invariants enforced by an AIOps orchestration layer. In a controlled experiment across six industry configurations (72 tool invocations using Qwen3-32B), unconstrained tool parameters produced a 43% hallucination rate for domain identifiers; ontology-grounded parameters reduced this to 0%. We validate the approach through a digital twin analytics platform demonstrating that a single codebase with domain-specific ontology configurations eliminates tool-call hallucination and achieves cross-domain configurability without application code changes.
Chinese Translation
基于大型语言模型(LLM)的人工智能代理在制造环境中越来越多地用于分析、质量管理和决策支持。这些代理在领域术语方面表现出统计流利性,但缺乏对操作语义的扎实理解——即在特定生产环境中连接设备标识符、过程参数、故障代码和监管约束的关系结构。本文识别并形式化了语义训练差距:AI系统通过训练获取领域词汇的方式与制造操作通过本体关系定义意义之间的结构性断裂。我们证明了这一差距导致了即使模型响应在语言上精确也会产生操作上不正确的输出,并且在多代理配置中,它产生了一种我们称之为语义漂移的复合故障模式。为了缩小这一差距,我们提出了一种架构,将制造本体直接嵌入AI工具层作为一种类型化关系配置,在运行时强制执行语义约束,而不是依赖于模型训练。该架构被形式化为一个三操作接口契约——解析(resolve)、上下文化(contextualize)、注释(annotate),并由AIOps编排层强制执行不变性。在对六个行业配置(72次使用Qwen3-32B的工具调用)进行的受控实验中,不受限的工具参数导致领域标识符的幻觉率达到43%;而基于本体的参数将这一比例降低至0%。我们通过数字双胞胎分析平台验证了该方法,证明单一代码库与领域特定本体配置的结合消除了工具调用的幻觉,并在不改变应用代码的情况下实现了跨领域的可配置性。
cs.AI / 11 / 2605.11258

Unlocking LLM Creativity in Science through Analogical Reasoning

通过类比推理释放大语言模型在科学中的创造力
Shen, Andrew, Druckmann, Shaul, Zou, James
Abstract
Autonomous science promises to augment scientific discovery, particularly in complex fields like biomedicine. However, this requires AI systems that can consistently generate novel and diverse solutions to open-ended problems. We evaluate LLMs on the task of open-ended solution generation and quantify their tendency to mode collapse into low-diversity generations. To mitigate this mode collapse, we introduce analogical reasoning (AR) as a new approach to solution generation. AR generates analogies to cross-domain problems based on shared relational structure, then uses those analogies to search for novel solutions. Compared to baselines, AR discovers significantly more diverse generations (improving solution diversity metrics by 90-173%), generates novel solutions over 50% of the time (compared to as little as 1.6% for baselines), and produces high-quality analogies. To validate the real-world feasibility of AR, we implement AR-generated solutions across four biomedical problems, yielding consistent quantitative gains. AR-generated approaches achieve a nearly 13-fold improvement on distributional metrics for perturbation effect prediction, outperform all baselines on AUPRC when predicting cell-cell communication, infer brain region interactions with a high Spearman correlation ($\rho$=0.729) to published methods, and establish state-of-the-art performance on 2 datasets for oligonucleotide property prediction. The novel and diverse solutions produced by AR can be used to augment the search space of existing solution generation methods.
Chinese Translation
自主科学有望增强科学发现,特别是在生物医学等复杂领域。然而,这需要能够持续生成新颖和多样化解决方案的人工智能系统,以应对开放式问题。我们评估了大语言模型(LLMs)在开放式解决方案生成任务上的表现,并量化它们倾向于模式崩溃为低多样性生成的倾向。为了解决这一模式崩溃问题,我们引入了类比推理(Analogical Reasoning, AR)作为一种新的解决方案生成方法。AR基于共享的关系结构生成跨领域问题的类比,然后利用这些类比来寻找新颖的解决方案。与基线相比,AR发现的生成结果显著更加多样化(解决方案多样性指标提高90-173%),生成新颖解决方案的频率超过50%(而基线仅为1.6%),并产生高质量的类比。为了验证AR的现实可行性,我们在四个生物医学问题中实施了AR生成的解决方案,取得了一致的定量提升。AR生成的方法在扰动效应预测的分布指标上实现了近13倍的提升,在预测细胞间通信时超越所有基线,在推断脑区交互时与已发布方法的斯皮尔曼相关系数($ ho$=0.729)高度一致,并在两个寡核苷酸属性预测数据集上建立了最先进的性能。AR生成的新颖和多样化的解决方案可以用于增强现有解决方案生成方法的搜索空间。
cs.AI / 12 / 2605.11259

Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation

模板作为本体:跨领域制造人工智能验证的可配置合成数据基础设施
Chethan, Grama
Abstract
LLarge language model (LLM)-based AI agents deployed in manufacturing environments require populated, schema-correct data for validation, yet production MES data is proprietary, privacy-encumbered, and vendor-specific. This paper introduces the Template-as-Ontology principle: a single Python configuration module (700-770 lines, 45 validated exports) serves simultaneously as the specification for a time-stepped manufacturing simulator and as the runtime domain schema for AI analytics tools, producing alignment by construction rather than integration. We formally define the domain template as a typed relational configuration schema and prove that structural alignment between simulation and tool layers is guaranteed by single-source consumption. A five-layer pipeline--simulation, PostgreSQL, CDC/Iceberg lakehouse, star schema, and 12 parameterized AI tools--generates causally coherent, MES-shaped data spanning 66 entity types across four operational domains mapped to ISA-95/IEC 62264. We validate the architecture with six industry templates (aerospace, pharma, automotive, electronics, beverages, warehousing) running on identical framework code. Calibration experiments (60 runs, 10 seeds per template) confirm parametric controllability: observed KPIs fall within configured ranges across all templates. A controlled hallucination experiment (72 tool invocations, Qwen3-32B) demonstrates that ontology-constrained parameters eliminate tool-parameter fabrication (0% constrained vs. 43% unconstrained hallucination rate for the evaluated model, Fisher's exact test p < 10^-12); the 0% constrained rate is an architectural guarantee that holds for any model. The framework provides a reusable data layer for discrete manufacturing AI validation.
Chinese Translation
基于大型语言模型(LLM)的人工智能代理在制造环境中部署时需要填充的、符合模式的数据进行验证,但生产制造执行系统(MES)数据是专有的、受隐私限制的,并且特定于供应商。本文介绍了模板作为本体的原则:一个单一的Python配置模块(700-770行,45个经过验证的导出)同时作为时间步进制造模拟器的规范和人工智能分析工具的运行时领域模式,通过构造而非集成实现对齐。我们正式定义领域模板为一种类型化的关系配置模式,并证明模拟与工具层之间的结构对齐由单一来源消费所保证。一个五层管道——模拟、PostgreSQL、CDC/Iceberg湖屋、星型模式和12个参数化的人工智能工具——生成因果一致的、符合MES形状的数据,涵盖66种实体类型,跨越映射到ISA-95/IEC 62264的四个操作领域。我们通过六个行业模板(航空航天、制药、汽车、电子、饮料、仓储)在相同框架代码上进行验证。校准实验(60次运行,每个模板10个种子)确认了参数可控性:观察到的关键绩效指标(KPI)在所有模板中均落在配置范围内。一个受控幻觉实验(72次工具调用,Qwen3-32B)表明,受本体约束的参数消除了工具参数的虚构(评估模型的约束幻觉率为0%,而非约束幻觉率为43%,Fisher精确检验p < 10^-12);0%的约束率是对任何模型都适用的架构保证。该框架为离散制造人工智能验证提供了可重用的数据层。
cs.AI / 13 / 2605.11301

LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

LatentRouter:在看到答案之前,我们能选择正确的多模态模型吗?
Cheng, Xueqi, Dong, Yushun
Abstract
Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image-question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image-question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model-specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility-based policy supports performance-oriented and performance-cost routing, and handles changing candidate pools through shared per-model scoring with availability masking. Experiments on MMR-Bench and VL-RouterBench show that LatentRouter outperforms fixed-model, feature-level, and learned-router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout-sensitive, or reasoning-oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: https://github.com/LabRAI/LatentRouter.
Chinese Translation
多模态大型语言模型(MLLMs)在光学字符识别(OCR)、图表理解、空间推理、视觉问答、成本和延迟等方面具有异质性优势。因此,有效的MLLM路由不仅需要估计查询的难度:路由器必须将当前图像-问题输入的多模态需求与每个候选模型的能力相匹配。我们提出了LatentRouter,一种将MLLM路由公式化为反事实多模态效用预测的路由器。在给定图像-问题查询的情况下,LatentRouter提取学习到的多模态路由胶囊,用模型能力标记表示每个候选MLLM,并在这些状态之间进行潜在通信,以估计每个模型如果被选择将如何表现。分布式结果头预测模型特定的反事实质量,而有界胶囊校正则在不允许残余信号主导预测的情况下精细化接近的决策。最终的基于效用的策略支持以性能为导向和性能-成本的路由,并通过共享每个模型的评分和可用性掩蔽来处理变化的候选池。在MMR-Bench和VL-RouterBench上的实验表明,LatentRouter优于固定模型、特征级和学习路由器基线。额外分析表明,增益在模型选择依赖于视觉、布局敏感或推理导向需求的多模态任务组上最为显著,并且潜在通信是提升的主要贡献者。代码可在以下链接获取:https://github.com/LabRAI/LatentRouter。
cs.AI / 14 / 2605.11312

Constraint-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments

约束数据值最大化:在低数据环境中利用数据归因进行有效的数据修剪
Brajovic, Danilo, Kreplin, David A., Huber, Marco F.
Abstract
Attributing model behavior to training data is an evolving research field. A common benchmark is data removal, which involves eliminating data instances with either low or high values, then assessing a model's performance trained on the modified dataset. Many existing studies leverage Shapley-based data values for this task. In this paper, we demonstrate that these data values are not optimally suited for pruning low-value data when only a limited amount of data remains. To address this limitation, we introduce the Constraint-Data-Value-Maximization (CDVM) approach, which effectively utilizes data attributions for pruning in low-data scenarios. By casting pruning as a constrained optimization that both maximizes total influence and penalizes excessive per-test contributions, CDVM delivers robust performance when only a small fraction of the data is retained. On the OpenDataVal benchmark, CDVM shows strong performance and competitive runtime.
Chinese Translation
将模型行为归因于训练数据是一个不断发展的研究领域。一个常见的基准是数据移除,这涉及到消除具有低或高值的数据实例,然后评估在修改后的数据集上训练的模型性能。许多现有研究利用基于Shapley的数据值来完成这一任务。在本文中,我们展示了这些数据值在仅剩有限数据时并不适合用于修剪低值数据。为了解决这一局限性,我们提出了约束数据值最大化(Constraint-Data-Value-Maximization, CDVM)方法,该方法有效利用数据归因在低数据场景中进行修剪。通过将修剪视为一个约束优化问题,既最大化总影响力又惩罚过度的每次测试贡献,CDVM在仅保留少量数据时表现出强大的性能。在OpenDataVal基准上,CDVM展现了强劲的性能和竞争力的运行时间。
cs.AI / 15 / 2605.11330

Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights

重新思考大语言模型幻觉检测的评估:期望、基于RAG的新基准、以及新的见解
Chen, Wenbo, Padmanabhan, Veena, Giyahchi, Tootiya, Wong, Elaine, Akoglu, Leman
Abstract
Hallucination, broadly referring to unfaithful, fabricated, or inconsistent content generated by LLMs, has wide-ranging implications. Therefore, a large body of effort has been devoted to detecting LLM hallucinations, as well as designing benchmark datasets for evaluating these detectors. In this work, we first establish a desiderata of properties for hallucination detection benchmarks (HDBs) to exhibit for effective evaluation. A critical look at existing HDBs through the lens of our desiderata reveals that none of them exhibits all the properties. We identify two largest gaps: (1) RAG-based grounded benchmarks with long context are severely lacking (partly because length impedes human annotation); and (2) Existing benchmarks do not make available realistic label noise for stress-testing detectors although real-world use-cases often grapple with label noise due to human or automated/weak annotation. To close these gaps, we build and open-source a new RAG-based HDB called T RIVIA+ that underwent a rigorous human annotation process. Notably, our benchmark exhibits all desirable properties including (1) T RIVIA+ contains samples with the longest context in the literature; and (2) we design and share four sets of noisy labels with different, both sample-dependent and sampleindependent, noise schemes. Finally, we perform experiments on RAG-based HDBs, including our T RIVIA+, using popular SOTA detectors that reveal new insights: (i) ample room remains for current detectors to reach the performance ceiling on RAG-based HDBs, (ii) the basic LLM-as-a-Judge baseline performs competitively, and (iii) label noise hinders detection performance. We expect that our findings, along with our proposed benchmark 1 , will motivate and foster needed research on hallucination detection for RAG-based tasks.
Chinese Translation
幻觉广泛指代大语言模型(LLMs)生成的不真实、虚构或不一致的内容,具有广泛的影响。因此,许多研究致力于检测LLM幻觉,以及设计基准数据集来评估这些检测器。在本研究中,我们首先建立了幻觉检测基准(HDBs)应具备的特性期望,以实现有效评估。通过我们的期望对现有HDBs进行批判性审视,发现没有一个基准具备所有特性。我们识别出两个最大的缺口:(1)基于RAG的长上下文基准严重不足(部分原因是长度妨碍了人工标注);(2)现有基准未提供现实的标签噪声以进行检测器的压力测试,尽管现实世界的应用案例常常面临由于人工或自动/弱标注导致的标签噪声。为填补这些缺口,我们构建并开源了一个新的基于RAG的HDB,称为T RIVIA+,该基准经过严格的人类标注过程。值得注意的是,我们的基准具备所有期望的特性,包括(1)T RIVIA+包含文献中上下文最长的样本;(2)我们设计并分享了四组具有不同噪声方案的噪声标签,这些方案包括样本依赖和样本独立的噪声。最后,我们在基于RAG的HDB上进行实验,包括我们的T RIVIA+,使用流行的SOTA检测器,揭示了新的见解:(i)当前检测器在基于RAG的HDB上仍有很大的提升空间,(ii)基本的LLM作为评判基线表现具有竞争力,以及(iii)标签噪声妨碍了检测性能。我们期待我们的研究发现以及我们提出的基准能够激励并促进对基于RAG任务的幻觉检测所需的研究。
cs.AI / 16 / 2605.11341

CPEMH: An Agentic Framework for Prompt-Driven Behavior Evaluation and Assurance in Foundation-Model Systems for Mental Health Screening

CPEMH:一种用于基础模型系统中基于提示的行为评估与保障的代理框架,应用于心理健康筛查
Lorenzoni, Giuliano, Portugal, Ivens, Alencar, Paulo, Cowan, Donald
Abstract
This paper presents CPEMH, an agentic framework designed to evaluate prompt-driven behavior in foundation-model systems operating on transcript-based datasets for mental-health screening. CPEMH serves as an engineering methodology for behavioral assurance in large-scale language systems, introducing an orchestrated architecture that autonomously performs the design, evaluation, and selection of prompt strategies, enabling systematic control of behavioral variability across contexts. Its modular agentic design, combining orchestrator, inference, and evaluation agents, ensures traceability, reproducibility, and robustness throughout the prompting lifecycle. A case study on automated depression screening from interview transcripts demonstrates the framework's capacity to stabilize and audit foundation-model behavior in conversational and clinically sensitive domains. Lessons learned emphasize the role of modular orchestration in behavioral assurance, the prioritization of stability over architectural complexity, and the integration of F1, bias, and robustness as core acceptance criteria.
Chinese Translation
本文提出了CPEMH,一种旨在评估在基于转录数据集上运行的基础模型系统中基于提示的行为的代理框架。CPEMH作为一种工程方法论,致力于在大规模语言系统中实现行为保障,介绍了一种协调架构,该架构能够自主执行提示策略的设计、评估和选择,从而实现对不同情境下行为变异性的系统控制。其模块化的代理设计结合了协调者、推理和评估代理,确保了在提示生命周期中的可追溯性、可重复性和鲁棒性。通过对访谈转录文本的自动抑郁症筛查的案例研究,展示了该框架在对话和临床敏感领域中稳定和审计基础模型行为的能力。经验教训强调了模块化协调在行为保障中的作用、稳定性优先于架构复杂性的必要性,以及将F1、偏差和鲁棒性作为核心接受标准的整合。
cs.AI / 17 / 2605.11359

CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing

CVEvolve:用于非结构化科学数据处理的自主算法发现
Du, Ming, Yin, Xiangyu, Luo, Yanqi, Beniwal, Dishant, Tang, Songyuan, Sharma, Hemant, Cherukara, Mathew J.
Abstract
Scientific data processing often requires task-specific algorithms or AI models, creating a barrier for domain scientists who need to analyze their data but may not have extensive computing or image-processing expertise. This barrier is especially pronounced when data are noisy, have a high dynamic range, are sparsely labeled, or are only loosely specified. We introduce CVEvolve, an autonomous agentic harness with a zero-code interface for scientific data-processing algorithm discovery. CVEvolve combines a multi-round search strategy with tools for code execution, evaluation implementation, history management, holdout testing, and optional inspection of scientific data and visual outputs. The search alternates between discovery and improvement actions, and uses lineage-aware stochastic candidate sampling to balance exploration and exploitation. We demonstrate CVEvolve on x-ray fluorescence microscopy image registration, Bragg peak detection, and high-energy diffraction microscopy image segmentation. Across these tasks, CVEvolve discovers algorithms that improve over baseline methods, while holdout test tracking helps identify candidates that generalize better than later over-optimized alternatives. These results show that zero-code, autonomous LLM-powered algorithm development can help domain scientists turn unstructured scientific image data into practical algorithms and downstream scientific discoveries.
Chinese Translation
科学数据处理通常需要特定任务的算法或人工智能模型,这对需要分析数据但可能没有广泛计算或图像处理专业知识的领域科学家构成了障碍。当数据存在噪声、高动态范围、稀疏标记或仅松散指定时,这种障碍尤为明显。我们介绍了CVEvolve,这是一种具有零代码接口的自主代理工具,用于科学数据处理算法的发现。CVEvolve结合了多轮搜索策略与代码执行、评估实现、历史管理、留出测试以及可选的科学数据和视觉输出检查工具。搜索过程在发现和改进行动之间交替进行,并使用考虑谱系的随机候选采样来平衡探索与利用。我们在X射线荧光显微镜图像配准、布拉格峰检测和高能衍射显微镜图像分割等任务中展示了CVEvolve。在这些任务中,CVEvolve发现的算法在基线方法上有所改进,而留出测试跟踪有助于识别出比后期过度优化的替代方案更具泛化能力的候选者。这些结果表明,零代码的自主大型语言模型(LLM)驱动的算法开发可以帮助领域科学家将非结构化科学图像数据转化为实用算法和下游科学发现。
cs.AI / 18 / 2605.11365

Causal Bias Detection in Generative Artifical Intelligence

生成性人工智能中的因果偏差检测
Plecko, Drago
Abstract
Automated systems built on artificial intelligence (AI) are increasingly deployed across high-stakes domains, raising critical concerns about fairness and the perpetuation of demographic disparities that exist in the world. In this context, causal inference provides a principled framework for reasoning about fairness, as it links observed disparities to underlying mechanisms and aligns naturally with human intuition and legal notions of discrimination. Prior work on causal fairness primarily focuses on the standard machine learning setting, where a decision-maker constructs a single predictive mechanism $f_{\widehat Y}$ for an outcome variable $Y$, while inheriting the causal mechanisms of all other covariates from the real world. The generative AI setting, however, is markedly more complex: generative models can sample from arbitrary conditionals over any set of variables, implicitly constructing their own beliefs about all causal mechanisms rather than learning a single predictive function. This fundamental difference requires new developments in causal fairness methodology. We formalize the problem of causal fairness in generative AI and unify it with the standard ML setting under a common theoretical framework. We then derive new causal decomposition results that enable granular quantification of fairness impacts along both (a) different causal pathways and (b) the replacement of real-world mechanisms by the generative model's mechanisms. We establish identification conditions and introduce efficient estimators for causal quantities of interest, and demonstrate the value of our methodology by analyzing race and gender bias in large language models across different datasets.
Chinese Translation
基于人工智能(AI)构建的自动化系统越来越多地应用于高风险领域,这引发了关于公平性和现存的人口差异持续存在的重大担忧。在这种背景下,因果推断提供了一个原则性框架,用于推理公平性,因为它将观察到的差异与潜在机制联系起来,并与人类直觉和法律对歧视的概念自然契合。先前关于因果公平性的研究主要集中在标准机器学习环境中,在该环境中,决策者为结果变量 $Y$ 构建一个单一的预测机制 $f_{ ext{widehat Y}}$,同时继承来自现实世界的所有其他协变量的因果机制。然而,生成性人工智能环境显著更为复杂:生成模型可以从任意条件下对任何变量集合进行采样,隐式地构建自己对所有因果机制的信念,而不是学习单一的预测函数。这一根本差异要求在因果公平性方法论上进行新的发展。我们形式化了生成性人工智能中的因果公平性问题,并在一个共同的理论框架下将其与标准机器学习环境统一。随后,我们推导出新的因果分解结果,使得能够沿着(a)不同的因果路径和(b)用生成模型的机制替代现实世界机制对公平性影响进行细致量化。我们建立了识别条件,并引入了有效的因果量估计器,通过分析不同数据集中大型语言模型中的种族和性别偏见,展示了我们方法论的价值。
cs.AI / 19 / 2605.11373

Causal Algorithmic Recourse: Foundations and Methods

因果算法补救:基础与方法
Plecko, Drago, Wang, Collin, Bareinboim, Elias
Abstract
The trustworthiness of AI decision-making systems is increasingly important. A key feature of such systems is the ability to provide recommendations for how an individual may reverse a negative decision, a problem known as algorithmic recourse. Existing approaches treat recourse outcomes as counterfactuals of a fixed unit, ignoring that real-world recourse involves repeated decisions on the same individual under possibly different latent conditions. We develop a causal framework that models recourse as a process over pre- and post-intervention outcomes, allowing for partial stability and resampling of latent variables. We introduce post-recourse stability conditions that enable reasoning about recourse from observational data alone, and develop a copula-based algorithm for inferring the effects of recourse under these conditions. For settings where paired observations of the same individual before and after intervention are available (called recourse data), we develop methods for inferring copula parameters and performing goodness-of-fit testing. When the copula model is rejected, we provide a distribution-free algorithm for learning recourse effects directly from recourse data. We demonstrate the value of the proposed methods on real and semi-synthetic datasets.
Chinese Translation
人工智能决策系统的可信度日益重要。此类系统的一项关键特征是能够提供建议,帮助个体逆转负面决策,这一问题被称为算法补救。现有方法将补救结果视为固定单位的反事实,忽略了现实世界中的补救涉及在可能不同的潜在条件下对同一个体进行重复决策。我们开发了一个因果框架,将补救建模为干预前后结果的过程,允许潜在变量的部分稳定性和重抽样。我们引入了后补救稳定性条件,使得仅通过观察数据就能推理补救效果,并开发了一种基于copula(联结函数)的算法,用于在这些条件下推断补救的效果。在可获得同一个体在干预前后配对观察数据的情况下(称为补救数据),我们开发了推断copula参数和进行拟合优度检验的方法。当copula模型被拒绝时,我们提供了一种无分布假设的算法,直接从补救数据中学习补救效果。我们在真实和半合成数据集上展示了所提方法的价值。
cs.AI / 20 / 2605.11376

LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents

LLM-X:一种可扩展的以协商为导向的个人 LLM 代理间通信交换
Lorenzoni, Giuliano, Alencar, Paulo, Cowan, Donald
Abstract
We propose a personal-LLM exchange (LLM-X), a scalable negotiation-oriented environment that enables direct, structured communication across populations of personal agents (LLMs), each representing an individual user. Unlike existing tool-centric protocols that focus on agent-API interaction, LLM-X introduces a message bus and routing substrate for LLM-to-LLM coordination with guarantees around schema validity and policy enforcement. We contribute: (1) an architecture for LLM-X comprising federated gateways, topic-based routing, and policy enforcement; (2) a typed message protocol supporting capability negotiation and contract-net-style coordination; and (3) the first empirical evaluation of LLM-based multi-agent negotiation at scale. Experiments span 5, 9, and 12 agents, under distinct negotiation policies (Low, Medium, High), and across both short-run (minutes) and long-run (2h, 12h) load conditions. Results highlight clear policy-performance trade-offs: stricter policies improve robustness and fairness but increase latencies and message volume. Extended runs confirm that LLM-X remains stable under sustained load, with bounded latency drift.
Chinese Translation
我们提出了一种个人 LLM 交换(LLM-X),这是一个可扩展的以协商为导向的环境,能够实现代表个别用户的个人代理(LLM)之间的直接、结构化通信。与现有的以工具为中心的协议不同,后者专注于代理与 API 的交互,LLM-X 引入了消息总线和路由基础设施,以实现 LLM 之间的协调,并确保模式有效性和政策执行。我们的贡献包括:(1)LLM-X 的架构,包含联邦网关、基于主题的路由和政策执行;(2)支持能力协商和合同网络式协调的类型化消息协议;(3)对基于 LLM 的多代理协商进行大规模的首次实证评估。实验涵盖了 5、9 和 12 个代理,采用不同的协商政策(低、中、高),并在短期(分钟)和长期(2小时、12小时)负载条件下进行。结果突显了明确的政策与性能之间的权衡:更严格的政策提高了鲁棒性和公平性,但增加了延迟和消息量。扩展运行确认 LLM-X 在持续负载下保持稳定,且延迟漂移受限。
cs.AI / 21 / 2605.11386

Revisiting Privacy Preservation in Brain-Computer Interfaces: Conceptual Boundaries, Risk Pathways, and a Protection-Strength Grading Framework

重新审视脑-计算机接口中的隐私保护:概念边界、风险路径及保护强度分级框架
Sun, Lei, Mao, Xiuqing, Zhang, Shuai, Zeng, Qingyu, Zhao, Min, Li, Jiyuan, Dong, Wenle
Abstract
Brain-computer interfaces (BCIs) are moving rapidly from laboratory research into clinical, edge, and real-world settings. Under ISO/IEC 8663:2025, a BCI is a direct communication link between central nervous system activity and external software or hardware systems. This link expands privacy risk beyond raw neural-signal leakage: neural data, derived representations, model assets, and decoded outputs can be re-associated with individuals across collection, transmission, storage, training, inference, and feedback, or used to infer information beyond what a task requires. Starting from the general BCI paradigm, this review deffnes privacy-protection boundaries, protection objects, and the relationship between user data privacy and model privacy within a shared risk pathway. It then proposes a three-dimensional framework - protection object, lifecycle stage, and dominant protection-strength level - to classify existing work into four levels of protection strength. Finally, mental privacy and neuroethical risks are treated as open issues, emphasizing that BCI privacy protection should not only obscure data but also disentangle task-irrelevant sensitive information while preserving downstream utility. Keywords: Brain-computer interface, Neural data privacy, User data privacy, Model privacy, Disentanglement of task-irrelevant sensitive information, Protection-strength grading, Neuroethical risks
Chinese Translation
脑-计算机接口(BCIs)正迅速从实验室研究转向临床、边缘和现实世界的应用。根据ISO/IEC 8663:2025,BCI是中央神经系统活动与外部软件或硬件系统之间的直接通信链接。该链接将隐私风险扩展到超出原始神经信号泄露的范围:神经数据、派生表示、模型资产和解码输出可以在收集、传输、存储、训练、推理和反馈的过程中与个体重新关联,或用于推断超出任务所需的信息。从一般的BCI范式出发,本综述定义了隐私保护的边界、保护对象,以及用户数据隐私与模型隐私在共享风险路径中的关系。然后,提出了一个三维框架——保护对象、生命周期阶段和主导保护强度水平——将现有工作分类为四个保护强度级别。最后,心理隐私和神经伦理风险被视为开放性问题,强调BCI隐私保护不仅应模糊数据,还应解开与任务无关的敏感信息,同时保持下游效用。关键词:脑-计算机接口、神经数据隐私、用户数据隐私、模型隐私、与任务无关的敏感信息解开、保护强度分级、神经伦理风险
cs.AI / 22 / 2605.11392

Transformer Interpretability from Perspective of Attention and Gradient

从注意力和梯度的角度看Transformer的可解释性
Cui, Yongjin, Fan, Xiaohui, Chen, Huajun
Abstract
Although researchers' attention is more focused on the performance of Transformer models, the interpretation of Transformer can never be ignored. Gradient is widely utilized in Transformer interpretation. From the perspective of attention and gradient, we conduct an in-depth study of Transformer interpretation and propose a method to achieve it by guiding the gradient direction, or more precisely, the attention direction. The method enables more comprehensive interpretation of feature regions, offers detail interpretation, and helps to better understand Transformer mechanism. Leveraging the difference in how Vision Transformer (ViT) and humans perceive images, we alter the class of an image in a way that is almost imperceptible to the human eye. This class rewriting phenomenon may potentially pose security risks in certain scenarios.
Chinese Translation
尽管研究者们的关注更多集中在Transformer模型的性能上,但Transformer的可解释性同样不可忽视。梯度在Transformer的解释中被广泛应用。从注意力和梯度的角度出发,我们对Transformer的可解释性进行了深入研究,并提出了一种通过引导梯度方向,或者更准确地说,引导注意力方向来实现可解释性的方法。该方法能够更全面地解释特征区域,提供详细的解释,并有助于更好地理解Transformer的机制。利用视觉Transformer(Vision Transformer, ViT)与人类感知图像的差异,我们以一种几乎不被人眼察觉的方式改变图像的类别。这种类别重写现象在某些场景中可能会带来安全风险。
cs.AI / 23 / 2605.11398

AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment

AcuityBench:评估临床紧急程度识别与不确定性对齐
Linzmayer, Robin, Lin, Georgianna, Coneybeare, Di, Chu, Jason, Cloyd, Trudi, Garg, Manish, Gordon, Miles, Hartofilis, Elizabeth, Hong, Benjamin, Hussain, Ashraf, Kim, Eugene Y., King, Oluchi Iheagwara, McCormack, Ross, Olsen, Erica, Riggins Jr, John K., Rasheed, Mustafa N., Sacco, Dana L., Saggar, Vinay, Sayan, Osman R., Shembekar, Amit, Shin-Kim, Janice, Sun, Wendy W., Chang, Bernard P., Kessler, David, Elhadad, Noémie
Abstract
We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health interactions, or narrow workflow-specific triage tasks, but they do not offer a unified evaluation of acuity identification across these settings. AcuityBench addresses this gap by harmonizing five public datasets spanning user conversations, online forum posts, clinical vignettes, and patient portal messages under a shared four-level acuity framework ranging from home monitoring to immediate emergency care. The benchmark contains 914 cases, including 697 consensus cases for standard accuracy evaluation and 217 physician-confirmed ambiguous cases for uncertainty-aware evaluation. It supports two complementary task formats: explicit four-way classification in a QA setting, and free-form conversational responses evaluated with a rubric-based judge anchored to the same framework. Across 12 frontier proprietary and open-weight models, we find substantial variation in clear-case acuity accuracy and error direction. Comparing task formats reveals a systematic tradeoff: conversational responses reduce over-triage but increase under-triage relative to QA, especially in higher-acuity cases. In ambiguous cases, no model closely matches the distribution of physician judgments, and model predictions are more concentrated than expert clinical uncertainty. We also compare expert and model adjudication on a subset of maximally ambiguous cases, using those cases to examine the role of clinical uncertainty in label disagreement. Together, these results position acuity identification as a distinct safety-critical capability and show that AcuityBench enables systematic comparison and stress-testing of how well models guide users to the right level of care in real-world health use.
Chinese Translation
我们介绍了AcuityBench,这是一个用于评估语言模型是否能够从用户的医疗呈现中识别适当的护理紧急程度的基准。现有的健康基准强调医疗问答、广泛的健康互动或狭窄的工作流程特定的分诊任务,但未能在这些环境中提供紧急程度识别的统一评估。AcuityBench通过将五个公共数据集进行整合,填补了这一空白,这些数据集涵盖了用户对话、在线论坛帖子、临床小插曲和患者门户消息,并在一个共享的四级紧急程度框架下进行分类,范围从家庭监测到立即的紧急护理。该基准包含914个案例,其中697个为共识案例,用于标准准确性评估,217个为医生确认的模糊案例,用于不确定性意识评估。它支持两种互补的任务格式:在问答设置中进行明确的四向分类,以及使用基于评分标准的评审对自由形式的对话响应进行评估,这些评分标准与相同的框架相锚定。在12个前沿的专有和开放权重模型中,我们发现清晰案例的紧急程度准确性和错误方向存在显著差异。比较任务格式揭示了一种系统性的权衡:相较于问答,对话响应减少了过度分诊,但在高紧急程度案例中增加了不足分诊。在模糊案例中,没有模型与医生判断的分布紧密匹配,模型预测的集中程度高于专家的临床不确定性。我们还对一部分最大模糊案例进行了专家与模型裁定的比较,利用这些案例考察临床不确定性在标签不一致中的作用。这些结果共同表明,紧急程度识别是一项独特的安全关键能力,并显示AcuityBench能够系统性地比较和压力测试模型在现实健康应用中如何引导用户到达正确的护理水平。
cs.AI / 24 / 2605.11404

Attributing Emergence in Million-Agent Systems

归因于百万代理系统中的涌现现象
Tang, Ling, Mei, Jilin, Chen, Qian, Ren, Qihan, Zhang, Linfeng, Zhang, Quanshi, Shao, Jing, Hu, Xia, Liu, Dongrui
Abstract
Large language models (LLMs) can simulate human-like reasoning and decision-making in individual agents. LLM-powered multi-agent systems (MAS) combine such agents to simulate population-scale social phenomena such as polarization, information cascades, and market panics. Such studies require attributing macro emergence to individual agents, but existing axiomatic methods scale combinatorially in $N$ and have been confined to $N \lesssim 10^3$, while the phenomena they explain occur at $N \geq 10^6$. We address this gap by adapting Aumann--Shapley path-integral attribution to LLM-powered MAS at million-agent scale; the resulting method satisfies all four axioms, runs four to five orders of magnitude faster than sampled Shapley on the same hardware. We use this method to test the scale gap empirically: across 14 days of public Bluesky data ($1{,}671{,}587$ active users), we compute the attribution at both full scale and the visibility-biased $N = 10^2$ convenience sample used by small-scale studies, and the two disagree structurally. At full scale the long tail and middle tier jointly carry the majority; the biased small panel attributes almost everything to a few high-follower accounts. We then prove that under any nonlinear macro indicator the disagreement cannot be reduced by post-hoc rescaling: an Attribution Scaling Bias theorem shows that no global rescaling factor can reconcile small-scale and full-scale attribution. Full-scale attribution is therefore not a methodological choice but a theoretical requirement for any nonlinear macro indicator.
Chinese Translation
大型语言模型(LLMs)能够模拟个体代理中的类人推理和决策。基于LLM的多代理系统(MAS)将这些代理结合起来,以模拟人口规模的社会现象,如极化、信息级联和市场恐慌。这类研究需要将宏观涌现归因于个体代理,但现有的公理方法在$N$上呈组合增长,且仅限于$N extlesssim 10^3$,而它们所解释的现象发生在$N extgreater= 10^6$。我们通过将Aumann--Shapley路径积分归因方法适应于百万代理规模的LLM驱动MAS来填补这一空白;所得到的方法满足所有四个公理,运行速度比在相同硬件上采样的Shapley快四到五个数量级。我们使用该方法进行实证测试,以检验规模差距:在14天的公共Bluesky数据中($1{,}671{,}587$个活跃用户),我们计算了全规模和小规模研究使用的可见性偏差样本$N = 10^2$的归因,两者在结构上存在显著差异。在全规模下,长尾和中间层共同承载了大部分归因;而偏见的小样本几乎将所有归因都归于少数高关注者账户。随后,我们证明在任何非线性宏观指标下,后期重标定无法减少这种不一致性:归因缩放偏差定理表明,没有全球重标定因子可以调和小规模和全规模的归因。因此,全规模归因不仅是一个方法选择,而是任何非线性宏观指标的理论要求。
cs.AI / 25 / 2605.11410

What Do EEG Foundation Models Capture from Human Brain Signals?

脑电图基础模型从人脑信号中捕获了什么?
Tang, Ling, Chen, Qian, Mei, Jilin, Xu, Houshi, Zhang, Quanshi, Shao, Jing, Zou, Na, Hu, Xia, Liu, Dongrui
Abstract
Clinical electroencephalogram (EEG) analysis rests on a hand-crafted feature catalog refined over decades, \emph{e.g.,} band power, connectivity, complexity, and more. Modern EEG foundation models bypass this catalog, learn directly from raw signals via self-supervised pretraining, and match or outperform feature-engineered baselines on most clinical benchmarks. Whether the two representations align is an open question, which we decompose into three sub-questions: \emph{what does the model learn}, \emph{what does the model use}, and \emph{how much can be explained}. We answer them with layer-wise ridge probing, LEACE-style cross-covariance subspace erasure, and a transparent classifier benchmarked against a random-feature baseline. The audit covers three foundation models (CSBrain, CBraMod, LaBraM), five clinical tasks (MDD, Stress, ISRUC-Sleep, TUSL, Siena), and a 6-family 63-feature lexicon. Of the $945$ (model, task, feature) units, $648$ ($68.6\%$) are representation-causal and $199$ ($21.1\%$) are encoded-only. Across tasks, $50$ features qualify as universal candidates with strong support (all three architectures RC) in two or more tasks. Frequency-domain features dominate, but the other five families each contribute substantial causal mass. Confirmed features recover, on average, $79.3\%$ of the foundation model's advantage over the random baseline, with a clean task gradient (MDD $\approx 0.99$ down to Stress $\approx 0.56$): tasks near ceiling are almost fully recovered by the lexicon, while harder tasks leave a non-trivial residual that pinpoints a concrete target for future concept discovery.
Chinese Translation
临床脑电图(EEG)分析基于经过数十年精炼的手工特征目录,例如,带宽功率、连接性、复杂性等。现代EEG基础模型绕过了这一目录,通过自监督预训练直接从原始信号中学习,并在大多数临床基准测试中与特征工程基线相匹配或超越。两种表示是否一致仍然是一个悬而未决的问题,我们将其分解为三个子问题:模型学习了什么、模型使用了什么以及可以解释多少。我们通过逐层岭回归探测、LEACE风格的交叉协方差子空间擦除以及一个透明的分类器(与随机特征基线进行基准测试)来回答这些问题。审计涵盖了三个基础模型(CSBrain、CBraMod、LaBraM)、五个临床任务(MDD、压力、ISRUC-睡眠、TUSL、锡耶纳)和一个包含63个特征的6个家族词汇。在945个(模型、任务、特征)单元中,648个(68.6%)是表示因果的,199个(21.1%)是仅编码的。在各个任务中,有50个特征符合强支持的通用候选标准(所有三个架构均为RC),在两个或更多任务中表现突出。频域特征占主导地位,但其他五个家族也各自贡献了可观的因果质量。确认的特征平均恢复了基础模型相对于随机基线的79.3%的优势,且任务梯度清晰(MDD约为0.99,下降至压力约为0.56):接近上限的任务几乎完全被词汇恢复,而更难的任务则留下了非平凡的残差,指向未来概念发现的具体目标。
cs.AI / 26 / 2605.11418

Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry

SKILL.md 的内部机制:对 AI 代理技能注册表的语义供应链攻击
Saha, Shoumik, Faghih, Kazem, Feizi, Soheil
Abstract
Autonomous AI agents increasingly extend their capabilities through Agent Skills: modular filesystem packages whose SKILL.md files describe when and how agents should use them. While this design enables scalable, on-demand capability expansion, it also introduces a semantic supply-chain risk in which natural-language metadata and instructions can affect which skills are admitted, surfaced, selected, and loaded. We study SKILL.md - only attacks across three registry-facing stages of the Agent Skill lifecycle, using real ClawHub skills and realistic registry mechanisms. In Discovery, short textual triggers can manipulate embedding-based retrieval and improve adversarial skill visibility, achieving up to 86% pairwise win rate and 80% Top-10 placement. In Selection, description-only framing biases agents toward functionally equivalent adversarial variants, which are selected in 77.6% of paired trials on average. In Governance, semantic evasion strategies cause malicious skills to avoid a blocking verdict in 36.5%-100% of cases. Overall, our results show that SKILL.md is not passive documentation but operational text that shapes which third-party capabilities agents find, trust, and use.
Chinese Translation
自主 AI 代理通过代理技能不断扩展其能力:这些模块化的文件系统包的 SKILL.md 文件描述了代理何时以及如何使用它们。尽管这种设计使得能力扩展具有可扩展性和按需性,但它也引入了一种语义供应链风险,其中自然语言元数据和指令可能影响哪些技能被接受、展示、选择和加载。我们研究了 SKILL.md 的攻击,涵盖了代理技能生命周期中面向注册表的三个阶段,使用真实的 ClawHub 技能和现实的注册机制。在发现阶段,简短的文本触发器可以操控基于嵌入的检索并提高对抗性技能的可见性,达到高达 86% 的配对胜率和 80% 的前十名排名。在选择阶段,仅通过描述的框架使代理偏向于功能上等效的对抗变体,这些变体在配对试验中平均被选择的比例为 77.6%。在治理阶段,语义规避策略使恶意技能在 36.5%-100% 的情况下避免被阻止的裁决。总体而言,我们的结果表明,SKILL.md 不是被动的文档,而是塑造代理发现、信任和使用哪些第三方能力的操作性文本。
cs.AI / 27 / 2605.11426

A Mechanistic Investigation of Supervised Fine Tuning

监督微调的机制性研究
Chopra, Ruhaan
Abstract
The cosine similarity between a large language model's hidden activations before and after Supervised Fine-Tuning (SFT) remains very high. This, at first glance, suggests that SFT leaves the model's activation geometry largely undisturbed. However, projecting both sets of activations through a Sparse Autoencoder (SAE) pretrained on the base model reveals that the underlying sparse latents diverge significantly. We introduce a novel investigative pipeline which utilizes these pretrained SAEs as a high-resolution diagnostic tool to mechanistically investigate the drivers of this representational divergence. Through our analytical pipeline, we discover task-specific and layer-specific distributions of the precise semantic features that are systematically altered during supervised fine-tuning. We additionally identify a layer-wise update profile specific to safety alignment. All code, experimental scripts, and analysis files associated with this work are publicly available at: https://github.com/ruhzi/sae-investigation.
Chinese Translation
在监督微调(Supervised Fine-Tuning, SFT)之前和之后,大型语言模型的隐藏激活之间的余弦相似度仍然很高。这乍一看表明,SFT在很大程度上未扰动模型的激活几何结构。然而,通过在基础模型上预训练的稀疏自编码器(Sparse Autoencoder, SAE)对这两组激活进行投影,揭示了潜在的稀疏特征显著分歧。我们引入了一种新颖的研究管道,利用这些预训练的SAE作为高分辨率诊断工具,从机制上探讨这种表征分歧的驱动因素。通过我们的分析管道,我们发现了在监督微调过程中系统性改变的精确语义特征的任务特定和层特定分布。此外,我们还识别出一种特定于安全对齐的层级更新特征。与本研究相关的所有代码、实验脚本和分析文件均可在以下网址公开获取:https://github.com/ruhzi/sae-investigation。
cs.AI / 28 / 2605.11458

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

自适应教师曝光用于大语言模型推理中的自我蒸馏
Han, Zihao, Zhang, Tiangang, Wang, Huaibin, Sun, Yilun
Abstract
On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.
Chinese Translation
基于策略的自我蒸馏已成为大语言模型(LLM)推理的一种有效方法,其中一个特权教师在参考解决方案的条件下监督学生的自我回滚。然而,几乎所有此类方法共享的一个设计选择却未受到质疑:教师总是能够看到完整的参考推理。我们认为这一默认设置本身就是问题的一部分,并识别出教师侧曝光不匹配的情况:当教师在学生当前能力范围之外的推理上进行条件化时,生成的目标标记变得过于强大,学生难以吸收。通过控制固定曝光的实验,我们在两个方面具体化了这一点:1)完全曝光并不总是最佳选择,2)随着教师看到更多特权推理,学生与教师之间的不匹配单调增加。这促使我们将教师曝光视为一个可学习的训练时控制变量,而非固定的超参数。因此,我们提出了自适应教师曝光用于自我蒸馏(ATESD)。ATESD通过一个轻量级的Beta策略控制器来建模曝光比例,该控制器以紧凑的训练状态统计数据为条件,并在学生更新的短暂保持窗口中使用一次采样的曝光。为了使这一曝光控制器可学习,我们通过折扣学习进度奖励来优化它,该奖励根据每个保持决策对学生未来改进的影响而非其即时损失变化进行评分,从而解决了基于策略蒸馏引发的延迟信用分配问题。在AIME 24、AIME 25和HMMT 25的Qwen3-{1.7B, 4B, 8B}实验中,ATESD始终优于竞争性的自我蒸馏和强化学习基线,分别提高了OPSD的平均得分+0.95、+2.05和+2.33(Average@12),并确立了自适应教师曝光作为推理自我蒸馏的新有效维度。
cs.AI / 29 / 2605.11461

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

打破 extit{赢家通吃}:合作策略优化提升多样化大语言模型推理
Chen, Haoxuan, Liang, Tianming, Zheng, Wei-Shi, Hu, Jian-Fang
Abstract
Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non-redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at $\href{https://github.com/bradybuddiemarch/gcpo}{this}$.
Chinese Translation
带验证者的强化学习(RLVR)已成为提升大语言模型(LLM)推理的核心范式,然而,像 GRPO 这样的流行基于群体的优化算法常常遭遇探索崩溃的问题,即模型过早地收敛于一组狭窄的高得分模式,缺乏探索新解决方案的能力。近期的努力试图通过添加熵正则化或多样性奖励来缓解这一问题。然而,这些方法并没有改变 extit{赢家通吃} 的本质,回滚仍然竞争个体优势,而不是合作以最大化全球多样性。在本研究中,我们提出了群体合作策略优化(GCPO),将训练范式从回滚竞争转变为团队合作。具体而言,GCPO 用团队级别的信用分配替代独立的回滚评分:回滚的奖励取决于其对团队有效解决方案覆盖的贡献,而不是其个体准确性。该覆盖被描述为基于奖励加权语义嵌入的决定性体积,只有正确且不冗余的回滚才会对该体积产生贡献。在优势估计过程中,GCPO 按照每个回滚对团队的平均边际贡献,将集体团队奖励重新分配给每个单独的回滚。这种合作训练范式将优化引导至非冗余的正确推理路径。在多个推理基准上的实验表明,GCPO 显著提高了推理准确性和解决方案多样性,超越了现有方法。代码将发布在 $ extit{this}$.
cs.AI / 30 / 2605.11468

CAMPA: Efficient and Aligned Multimodal Graph Learning via Decoupled Propagation and Aggregation

CAMPA:通过解耦传播和聚合实现高效且对齐的多模态图学习
Su, Daohan, Liu, Hao, Li, Xunkai, Zhu, Yinlin, Yongfu, Xiong, Liu, Yi, Qin, Hongchao, Li, Rong-Hua, Wang, Guoren
Abstract
Multimodal Graph Neural Networks (MGNNs) have shown strong potential for learning from multimodal attributed graphs, yet most existing approaches rely on tightly coupled architectures that suffer from prohibitive computational overhead. In this paper, we present a systematic empirical analysis showing that decoupled MGNNs are substantially more efficient and scalable for large-scale graph learning. However, we identify a critical bottleneck in existing decoupled pipelines, namely modal conflict, which arises in both the propagation and aggregation stages. Specifically, independent multi-hop diffusion causes cross-modal semantic divergence during propagation, while naive fusion fails to align multi-hop feature trajectories during aggregation, jointly limiting effective representation learning. To address this challenge, we propose CAMPA, a Cross-modal Aligned Multimodal Propagation & Aggregation framework for decoupled multimodal graph learning. Concretely, CAMPA introduces a two-stage alignment mechanism: (1) cross-modal aligned propagation, which injects cross-modal similarity priors into message passing to preserve semantic consistency without additional parameter overhead; (2) trajectory aligned aggregation, which leverages trajectory-level self-attention and cross-attention to capture and align long-range dependencies across modalities and hops. Extensive experiments on diverse benchmark datasets and tasks demonstrate that CAMPA consistently outperforms strong coupled and decoupled baselines while preserving the efficiency advantages of the decoupled paradigm.
Chinese Translation
多模态图神经网络(MGNNs)在从多模态属性图中学习方面展现了强大的潜力,但大多数现有方法依赖于紧密耦合的架构,导致计算开销过大。本文提出了一项系统的实证分析,表明解耦的MGNN在大规模图学习中显著更高效且可扩展。然而,我们发现现有解耦管道中的一个关键瓶颈,即模态冲突,这在传播和聚合阶段均会出现。具体而言,独立的多跳扩散在传播过程中导致跨模态语义偏差,而简单的融合在聚合过程中未能对齐多跳特征轨迹,从而共同限制了有效的表示学习。为了解决这一挑战,我们提出了CAMPA,一个用于解耦多模态图学习的跨模态对齐传播与聚合框架。具体而言,CAMPA引入了一个两阶段的对齐机制:(1)跨模态对齐传播,通过在消息传递中注入跨模态相似性先验,保持语义一致性而不增加额外的参数开销;(2)轨迹对齐聚合,利用轨迹级自注意力和跨注意力捕捉并对齐跨模态和跨跳的长程依赖。对多种基准数据集和任务的广泛实验表明,CAMPA在保持解耦范式效率优势的同时,始终优于强耦合和解耦基线。
cs.AI / 31 / 2605.11473

TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

TOPPO:重新思考用于多任务强化学习的PPO与评论员平衡
Li, Yuanpeng, Lin, Gefei, Qu, Annie, Miao, Rui
Abstract
Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diagnose that PPO in MTRL suffers from a previously overlooked issue: critic-side gradient ill-conditioning, which may cause tail tasks to stall while easy tasks dominate the value function's updates. To address this, we propose TOPPO (Tail-Optimized PPO), a reformulation of PPO via Critic Balancing -- a set of modules that improve gradient conditioning and balance learning dynamics across tasks. Unlike prior approaches that rely on modular architectures or large models, TOPPO targets the optimization bottleneck within PPO itself. Empirically, TOPPO achieves stronger mean and tail-task performance than published SAC-family and ARS-family baselines while using substantially fewer parameters and environment steps on Meta-World+ benchmark. Notably, TOPPO matches or surpasses strong SAC baselines early in training and maintains superior performance at full budget. Ablations confirm the effectiveness of each module in TOPPO and provide insights into their interactions. Our results demonstrate that, with proper optimization, on-policy methods can rival or exceed off-policy approaches in MTRL, challenging the prevailing reliance on SAC and highlighting critic-side gradient conditioning as the central bottleneck.
Chinese Translation
软演员-评论员(Soft Actor-Critic, SAC)及其变体因其离线策略样本效率在多任务强化学习(Multi-Task Reinforcement Learning, MTRL)中占据主导地位,而在线策略方法如近端策略优化(Proximal Policy Optimization, PPO)仍然未被充分探索。我们诊断出PPO在MTRL中存在一个之前被忽视的问题:评论员侧梯度不良条件,这可能导致尾任务停滞,而简单任务主导价值函数的更新。为了解决这个问题,我们提出了TOPPO(Tail-Optimized PPO),通过评论员平衡(Critic Balancing)重新构造PPO——一组改善梯度条件和在任务间平衡学习动态的模块。与依赖模块化架构或大型模型的先前方法不同,TOPPO针对PPO本身的优化瓶颈。实证结果表明,TOPPO在Meta-World+基准上实现了比已发布的SAC系列和ARS系列基线更强的平均和尾任务性能,同时使用的参数和环境步数显著更少。值得注意的是,TOPPO在训练初期与强大的SAC基线相匹配或超越,并在全预算下保持优越性能。消融实验确认了TOPPO中每个模块的有效性,并提供了对其相互作用的深入见解。我们的结果表明,通过适当的优化,在线策略方法可以在MTRL中与离线策略方法相媲美或超越,挑战了对SAC的普遍依赖,并突出了评论员侧梯度条件作为核心瓶颈。
cs.AI / 32 / 2605.11478

FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression

FibQuant:用于随机访问 KV 缓存压缩的通用向量量化
Lee, Namyoon, Kim, Yongjune
Abstract
Long-context inference is increasingly a memory-traffic problem. The culprit is the key--value (KV) cache: it grows with context length, batch size, layers, and heads, and it is read at every decoding step. Rotation-based scalar codecs meet this systems constraint by storing a norm, applying a shared random rotation, and quantizing one coordinate at a time. They are universal and random-access, but they discard the geometry created by the normalization step. After a Haar rotation, a block of $k$ consecutive coordinates is not a product source; it is a spherical-Beta source on the unit ball. We introduce \textsc{FibQuant}, a universal fixed-rate vector quantizer that keeps the same normalize--rotate--store interface while replacing scalar tables by a shared radial--angular codebook matched to this canonical source. The codebook combines Beta-quantile radii, Fibonacci\,/\,Roberts--Kronecker quasi-uniform directions, and multi-restart Lloyd--Max refinement. We prove that the resulting vector code strictly improves on its scalar product specialization at matched rate, with a high-rate gain that separates into a cell-shaping factor and a density-matching factor. The same construction gives a dense rate axis, including fractional-bit and sub-one-bit operating points, without calibration or variable-length addresses. On GPT-2 small KV caches, \textsc{FibQuant} traces a memory--fidelity frontier from $5\times$ compression at $0.99$ attention cosine similarity to $34\times$ at $0.95$. End-to-end on TinyLlama-1.1B, it is within $0.10$ perplexity of fp16 at $4\times$ compression and has $3.6\times$ lower perplexity than scalar \textsc{TurboQuant} at $b = 2$ ($8\times$ compression), where scalar random-access quantization begins to fail.
Chinese Translation
长上下文推理日益成为一个内存流量问题。罪魁祸首是键值(KV)缓存:它随着上下文长度、批量大小、层数和头数的增加而增长,并且在每个解码步骤中都会被读取。基于旋转的标量编码器通过存储一个范数、应用共享随机旋转,并一次量化一个坐标来满足这一系统约束。它们是通用的和随机访问的,但它们丢弃了归一化步骤所产生的几何结构。在 Haar 旋转之后,一组 $k$ 个连续坐标并不是一个乘积源;它是单位球上的球面 Beta 源。我们引入了 extsc{FibQuant},一种通用的固定速率向量量化器,它保持相同的归一化-旋转-存储接口,同时用一个与该典型源匹配的共享径向-角度代码本替代标量表。该代码本结合了 Beta-分位半径、Fibonacci/Roberts-Kronecker 准均匀方向和多重重启的 Lloyd-Max 精炼。我们证明了所得到的向量编码在匹配速率下严格优于其标量乘积专门化,具有高率增益,该增益分为一个单元形状因子和一个密度匹配因子。同样的构造提供了一个稠密速率轴,包括分数位和小于一位的操作点,无需校准或可变长度地址。在 GPT-2 小型 KV 缓存上, extsc{FibQuant} 描绘了一个内存-保真度边界,从 $5 imes$ 压缩在 $0.99$ 注意力余弦相似度到 $34 imes$ 在 $0.95$。在 TinyLlama-1.1B 上的端到端实验中,它在 $4 imes$ 压缩下的困惑度与 fp16 相差不超过 $0.10$,并且在 $b = 2$($8 imes$ 压缩)时的困惑度比标量 extsc{TurboQuant} 低 $3.6 imes$,此时标量随机访问量化开始失效。
cs.AI / 33 / 2605.11484

Engagement Process: Rethinking the Temporal Interface of Action and Observation

参与过程:重新思考行动与观察的时间接口
Li, Jialian, Cao, Yuchen, Liu, Junhong, Guo, Weiran, Wang, Xutao, Song, Jiaming, Zhang, Jiahao, Chen, Jie
Abstract
Task completion in digital and physical environments increasingly involves complex temporal interaction, where actions and observations unfold over different time scales rather than align with fixed observation--action steps. To model such interactions, we propose \emph{Engagement Process} (EP), an interaction formalism that inherits the decision-theoretic structure of POMDPs while making time explicit in the action--observation interface. EP represents actions and observations as decoupled event streams along time, rather than updates paired at fixed decision steps. This interface captures single-agent timing issues such as deliberation latency, delayed feedback, and persistent actions, while supporting richer agent-side organization, multi-rate coordination, and compositional interaction among subsystems. Across toy, LLM-agent, and learning experiments, EP exposes temporal behaviors hidden by step-based interfaces and enables policies to adapt under explicit time costs.
Chinese Translation
在数字和物理环境中,任务完成越来越涉及复杂的时间交互,其中行动和观察在不同的时间尺度上展开,而不是与固定的观察-行动步骤对齐。为了建模这种交互,我们提出了 extit{参与过程}(Engagement Process, EP),这是一种交互形式,继承了部分可观测马尔可夫决策过程(POMDPs)的决策理论结构,同时在行动-观察接口中明确时间。EP将行动和观察表示为沿时间解耦的事件流,而不是在固定决策步骤中配对的更新。该接口捕捉了单代理的时间问题,如深思延迟、反馈延迟和持续行动,同时支持更丰富的代理端组织、多速率协调以及子系统之间的组合交互。在玩具实验、LLM代理实验和学习实验中,EP揭示了被基于步骤的接口隐藏的时间行为,并使策略能够在明确的时间成本下进行适应。
cs.AI / 34 / 2605.11496

The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested

评估差异:当前沿人工智能模型意识到它们正在被测试时
Vishwarupe, Varad, Shadbolt, Nigel, Jirotka, Marina, Flechais, Ivan
Abstract
Recent published evidence from frontier laboratories shows that contemporary AI models can recognise evaluation contexts, latently represent them, and behave differently under those contexts than under deployment-continuous conditions. Anthropic's BrowseComp incident, the Natural Language Autoencoder findings on SWE-bench Verified and destructive-coding evaluations, and the OpenAI / Apollo anti-scheming work all document instances of this phenomenon. We argue that these findings create a claim-validity problem for safety conclusions drawn from frontier evaluations. We introduce the Evaluation Differential (ED), a conditional divergence in a target behavioural property between recognised-evaluation and deployment-continuous contexts, define a normalised effect-size form (nED) for cross-property comparison, and prove that marginal evaluation scores cannot identify ED. We develop a typology of safety claims (ED-stable, ED-degraded, ED-inverted, ED-undetermined) by their warrant-status under documented divergence, and specify TRACE (Test-Recognition Audit for Claim Evaluation), an audit protocol that wraps existing evaluation infrastructure and produces restricted claims rather than capability scores. We apply the framework retrospectively to three publicly documented evaluation incidents and discuss governance implications for system cards, conformity assessment, and the international network of AI safety and security institutes. TRACE does not eliminate adversarial adaptation; it disciplines the claims drawn from evaluation evidence by making explicit the conditions under which that evidence was produced.
Chinese Translation
最近来自前沿实验室的证据表明,当代人工智能模型能够识别评估环境,潜在地表示这些环境,并在这些环境下的行为与在持续部署条件下的行为有所不同。Anthropic的BrowseComp事件、自然语言自编码器在SWE-bench验证和破坏性编码评估中的发现,以及OpenAI/Apollo的反策划工作都记录了这一现象的实例。我们认为,这些发现为从前沿评估中得出的安全结论创造了一个有效性问题。我们引入了评估差异(Evaluation Differential, ED),这是在识别评估和持续部署环境之间的目标行为属性的条件性差异,并定义了一种归一化效应大小形式(nED)以便于跨属性比较,并证明边际评估分数无法识别ED。我们根据文档化的差异,开发了一种安全声明的类型学(ED稳定、ED降级、ED反转、ED未确定),并指定TRACE(声明评估的测试识别审计),这是一种审计协议,包裹现有的评估基础设施,并产生限制性声明而非能力分数。我们将该框架追溯应用于三个公开记录的评估事件,并讨论其对系统卡、合规评估以及国际人工智能安全与安全机构网络的治理影响。TRACE并不能消除对抗性适应;它通过明确指出产生该证据的条件来规范从评估证据中得出的声明。
cs.AI / 35 / 2605.11505

Selective Off-Policy Reference Tuning with Plan Guidance

带规划指导的选择性离线参考调优
Le, Duc Anh, Nguyen, Tien-Phat, Nguyen, Thien Huu, Van, Linh Ngo, Le, Trung
Abstract
Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation. Across three backbones and eight reasoning benchmarks, SORT improves over GRPO and guidance baselines, with largest gains on weaker models.
Chinese Translation
具有可验证奖励的强化学习有助于推理,但GRPO风格的方法在所有采样回合失败的困难提示上停滞不前。SORT为这些失败添加了修复更新,而不改变回合生成:它从参考解决方案中推导出一个计划,比较有该计划和没有该计划的标记概率,并对在计划条件下变得更可预测的标记给予更高的权重。这将所有错误的提示转变为选择性、结构感知的学习信号,而不是均匀模仿。在三个基础模型和八个推理基准上,SORT在GRPO和指导基线之上取得了改进,尤其在较弱模型上获得了最大的提升。
cs.AI / 36 / 2605.11509

Hierarchical LLM-Driven Control for HAPS-Assisted UAV Networks: Joint Optimization of Flight and Connectivity

基于层次化大型语言模型驱动的高空平台辅助无人机网络控制:飞行与连接的联合优化
Yan, Zijiang, Zhou, Hao, Jaafar, Wael, Pei, Jianhua, Wang, Ping, Yanikomeroglu, Halim, Tabassum, Hina
Abstract
Uncrewed aerial vehicles (UAVs) are increasingly deployed in complex networked environments, yet the joint optimization of multi-UAV motion control and connectivity remains a fundamental challenge. In this paper, we study a multi-UAV system operating in an integrated terrestrial and non-terrestrial network (ITNTN) comprising terrestrial base stations and high-altitude platform stations (HAPS). We consider a three-dimensional (3D) aerial highway scenario where UAVs must adapt their motion to ensure collision avoidance, efficient traffic flow, and reliable communication under dynamic and partially observable conditions. We first model the problem as a hierarchical multi-objective partially observable Markov decision process (H-MO-POMDP), capturing the coupling between control and communication objectives. Based on this formulation, we propose a large language model (LLM)-driven hierarchical multi-rate control framework. At the global level, an LLM-based controller on the HAPS performs long-term planning for load balancing and handover decisions. At the local level, each UAV employs a hybrid controller that integrates a slow-timescale LLM for high-level spatial reasoning with a reinforcement learning agent for faster UAV-to-infrastructure (U2I) communication and motion control. We further develop a high-fidelity 3D simulation platform by integrating the gym-pybullet-drones environment with 3GPP-compliant RF/THz channel models. Numerical results demonstrate that the proposed framework significantly outperforms state-of-the-art baselines, achieving a 14% increase in transportation efficiency and a 25% improvement in telecommunication throughput. Additionally, it achieves a 23% reduction in physical collision rates, demonstrating strong handover stability and zero-shot generalization in dynamic scenarios.
Chinese Translation
无人驾驶飞行器(UAV)在复杂的网络环境中越来越多地被部署,但多无人机运动控制与连接的联合优化仍然是一个基本挑战。本文研究了一个在集成地面与非地面网络(ITNTN)中运行的多无人机系统,该网络包括地面基站和高空平台站(HAPS)。我们考虑了一个三维(3D)空中高速公路场景,其中无人机必须调整其运动以确保避免碰撞、高效的交通流和在动态及部分可观测条件下的可靠通信。我们首先将问题建模为一个层次化多目标部分可观测马尔可夫决策过程(H-MO-POMDP),捕捉控制与通信目标之间的耦合关系。基于这一模型,我们提出了一种基于大型语言模型(LLM)的层次化多速率控制框架。在全局层面,HAPS上的LLM控制器进行负载平衡和切换决策的长期规划。在局部层面,每个无人机采用一种混合控制器,该控制器将慢时间尺度的LLM用于高层次空间推理与强化学习代理结合,用于更快的无人机与基础设施(U2I)通信和运动控制。我们进一步通过将gym-pybullet-drones环境与符合3GPP标准的射频/太赫兹(RF/THz)信道模型集成,开发了一个高保真度的3D仿真平台。数值结果表明,所提出的框架显著优于最先进的基线,运输效率提高了14%,电信吞吐量改善了25%。此外,物理碰撞率减少了23%,在动态场景中表现出强大的切换稳定性和零样本泛化能力。
cs.AI / 37 / 2605.11518

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

AutoLLMResearch:为自动化大语言模型实验配置训练研究代理 -- 从低成本学习,优化高成本
Guo, Taicheng, Chawla, Nitesh V., Wiest, Olaf, Zhang, Xiangliang
Abstract
Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.
Chinese Translation
有效配置可扩展的大语言模型(LLM)实验,包括架构设计、超参数调优等,对于推动LLM研究至关重要,因为不当的配置选择可能浪费大量计算资源,并阻碍模型发挥其全部潜力。以往的自动化方法主要针对低成本环境设计,其中重复的试错过程是可行的,但可扩展的LLM实验过于昂贵,无法进行如此广泛的迭代。我们所知,尚无研究解决高成本LLM实验配置的自动化问题,这使得这一问题仍然需要大量人力,并依赖专家的直觉。基于这一空白,我们提出了AutoLLMResearch,一个代理框架,模拟人类研究人员如何从低保真实验中学习可推广的原则,并推断出在高成本LLM环境中有效识别有前景配置的方法。核心挑战在于如何使代理通过与一个多保真实验环境的互动来学习,该环境捕捉了LLM配置空间的结构。为此,我们提出了一个系统框架,包含两个关键组件:1)LLMConfig-Gym,一个包含四个关键LLM实验任务的多保真环境,支持超过一百万小时的可验证实验结果;2)一个结构化的训练管道,将配置研究形式化为长时间范围的马尔可夫决策过程,并相应地激励跨保真推断推理。针对多种强基线在保留实验上的广泛评估表明,我们的框架在有效性、泛化性和可解释性方面表现出色,支持其作为可扩展的现实世界LLM实验自动化的实用和通用解决方案的潜力。
cs.AI / 38 / 2605.11519

Controllable User Simulation

可控用户模拟
Tennenholtz, Guy, Meshi, Ofer, Globerson, Amir, Shalit, Uri, Jeong, Jihwan, Boutilier, Craig
Abstract
Using offline datasets to evaluate conversational agents often fails to cover rare scenarios or to support testing new policies. This has motivated the use of controllable user simulators for targeted, counterfactual evaluation, typically implemented by prompting or fine-tuning large language models. In this work, we formalize controllable simulation as a causal inference problem. By bridging natural language evaluation with off-policy evaluation methodology, we show that the standard practice of training simulators via supervised fine-tuning on post-hoc trajectory labels yields a structurally biased model. Specifically, these labels are inextricably coupled to the data-generating behavior policy, injecting a look-ahead bias that breaks causal consistency. Furthermore, we prove that under policy shift this failure causes the variance of evaluation metrics to explode geometrically, a phenomenon we term controllability collapse. To restore causal consistency, we establish theoretical conditions for accurate simulation and propose practical training mitigations: a priori controls, step-wise dynamic controls, and direct policy-conditioned learning. Empirical evaluation confirms that while standard global controls distort conversational distributions and collapse behavioral diversity, our causally grounded simulators eliminate look-ahead bias, preserve natural variance, and exhibit robust zero-shot generalization to unseen agent behaviors.
Chinese Translation
使用离线数据集评估对话代理常常无法覆盖稀有场景或支持测试新策略。这促使了可控用户模拟器的使用,以进行针对性的反事实评估,通常通过提示或微调大型语言模型来实现。在本研究中,我们将可控模拟形式化为一个因果推断问题。通过将自然语言评估与离策略评估方法论相结合,我们表明,通过对后验轨迹标签进行监督微调来训练模拟器的标准做法会导致结构性偏差模型。具体而言,这些标签与数据生成行为策略密不可分,注入了前瞻性偏差,破坏了因果一致性。此外,我们证明在策略转变下,这种失败会导致评估指标的方差几何级数爆炸,这一现象我们称之为可控性崩溃。为了恢复因果一致性,我们建立了准确模拟的理论条件,并提出了实际的训练缓解措施:先验控制、逐步动态控制和直接策略条件学习。实证评估确认,虽然标准的全局控制扭曲了对话分布并导致行为多样性的崩溃,但我们基于因果的模拟器消除了前瞻性偏差,保持了自然方差,并在未见代理行为上展现出强大的零样本泛化能力。
cs.AI / 39 / 2605.11532

Read, Grep, and Synthesize: Diagnosing Cross-Domain Seed Exposure for LLM Research Ideation

阅读、提取和综合:诊断跨领域种子曝光以促进大型语言模型研究构思
Choi, Yunju, Song, Min
Abstract
The discovery of novel methodologies for emerging problems is a continuing cycle in ML, often driven by the migration of techniques across domains. Building on this observation, we ask whether current LLM ideation systems benefit from targeted cross-domain retrieval or simply from exposure to diverse mechanisms. We study this question through PaperGym, a three-stage pipeline: (1) tool-augmented seed extraction via read, grep, and bash over an isolated paper environment, (2) cross-domain seed retrieval via paraphrasing across seven ML domains, and (3) method synthesis from retrieved seeds, each scored by rubric-based judges. Tool-augmented extraction improves specificity, and paraphrase-based retrieval broadens domain coverage. In synthesis, cross-domain retrieval receives more pairwise novelty wins than no-retrieval and same-domain baselines, but shows no significant difference from a random diverse-seed control. These findings suggest LLM ideation systems benefit from diverse seed exposure, but do not yet reliably exploit the semantic reason particular seeds were retrieved. We release the seed library, rubric prompts, and run scripts at https://github.com/yunjoochoi/PaperGym
Chinese Translation
在机器学习(ML)领域,针对新兴问题发现新方法的过程是一个持续的循环,通常由技术在不同领域之间的迁移驱动。基于这一观察,我们探讨当前大型语言模型(LLM)构思系统是否从有针对性的跨领域检索中受益,或者仅仅是从对多样化机制的曝光中获益。我们通过 PaperGym 进行研究,这是一个三阶段的流程:(1)通过在隔离论文环境中使用阅读、提取和 Bash 工具增强的种子提取;(2)通过在七个 ML 领域之间的改写进行跨领域种子检索;(3)从检索到的种子中合成方法,每个方法由基于评分标准的评审进行评分。工具增强的提取提高了特异性,而基于改写的检索扩大了领域覆盖范围。在合成阶段,跨领域检索在配对新颖性上比无检索和同领域基线获得更多胜利,但与随机多样种子控制组之间没有显著差异。这些发现表明,LLM 构思系统受益于多样化的种子曝光,但尚未可靠地利用特定种子被检索的语义原因。我们在 https://github.com/yunjoochoi/PaperGym 发布了种子库、评分提示和运行脚本。
cs.AI / 40 / 2605.11544

Optimal LTLf Synthesis

最优 LTLf 合成
Cao, Yujian, Schewe, Sven, Tang, Qiyi, Zhu, Shufang
Abstract
Strategy synthesis typically follows an all-or-nothing paradigm, returning unrealisable whenever a specification cannot be guaranteed in an uncertain environment. In this paper, we introduce optimal LTLf synthesis, where the goal is to realise as many objectives as possible from a given specification consisting of multiple objectives, especially for the case that they are not all jointly realisable. We first consider max-guarantee synthesis, which commits to a maximal set of objectives that we can a priori guarantee to realise. We then introduce max-observation synthesis, which maximises a posteriori realised objectives that may be incomparable on different executions. Finally, we present incremental max-observation synthesis, which further improves strategies by exploiting opportunities for stronger guarantees when they arise during an execution. Experimental results show that different variations of optimal synthesis scale broadly equally well, solving a large fraction of the benchmark instances within the given timeout, demonstrating the practical feasibility of the approach.
Chinese Translation
策略合成通常遵循全有或全无的范式,当在不确定环境中无法保证某个规范时,返回不可实现。在本文中,我们引入了最优 LTLf 合成,其目标是尽可能实现给定规范中包含的多个目标,特别是在这些目标并非全部可以共同实现的情况下。我们首先考虑最大保证合成(max-guarantee synthesis),该方法承诺实现我们可以事先保证的最大目标集。接着,我们介绍最大观察合成(max-observation synthesis),该方法最大化在不同执行中可能不可比较的后验实现目标。最后,我们提出增量最大观察合成(incremental max-observation synthesis),通过在执行过程中利用出现的更强保证机会进一步改进策略。实验结果表明,不同变体的最优合成在规模上表现良好,在给定的超时内解决了大量基准实例,展示了该方法的实际可行性。
cs.AI / 41 / 2605.11556

Hindsight Hint Distillation: Scaffolded Reasoning for SWE Agents from CoT-free Answers

事后提示蒸馏:基于无链条思维答案的SWE代理的支架推理
Wang, Shengjie, Li, Guanghe, Yang, Zonghan, Gao, Yang
Abstract
Solving complex long-horizon tasks requires strong planning and reasoning capabilities. Although datasets with explicit chain-of-thought (CoT) rationales can substantially benefit learning, they are costly to obtain. To address this challenge, we propose Hindsight Hint Distillation (HHD), which only requires easy-to-obtain question-answer pairs without CoT annotations. Inspired by how human teachers use student mistakes to provide targeted guidance, HHD synthesizes hindsight hints from the model's own failed self-rollouts and uses them to scaffold on-policy rollouts that successfully complete the tasks. The model then self-distills these scaffolded trajectories and generalizes to new problems without hint guidance. Experiments show that HHD significantly outperforms iterative RFT and trajectory-synthesis baselines, achieving an absolute improvement of 8\% on SWE-bench Verified, while all baselines improve by only around 2\%. Notably, the reasoning strategies induced by HHD generalize effectively to out-of-distribution tasks, yielding the largest gains on SWE-bench Multilingual despite no training on multilingual data. These results demonstrate that HHD can effectively synthesize expert-like reasoning from CoT-free data and substantially improve long-horizon performance.
Chinese Translation
解决复杂的长时间任务需要强大的规划和推理能力。尽管具有明确链条思维(CoT)推理的数据集可以显著促进学习,但获取这些数据集的成本较高。为了解决这一挑战,我们提出了事后提示蒸馏(Hindsight Hint Distillation, HHD),该方法仅需易于获取的无CoT注释的问题-答案对。HHD的灵感来源于人类教师如何利用学生的错误提供有针对性的指导,它从模型自身失败的自我回滚中合成事后提示,并利用这些提示来支架成功完成任务的在政策回滚。然后,模型自我蒸馏这些支架轨迹,并在没有提示指导的情况下推广到新问题。实验表明,HHD显著优于迭代的RFT和轨迹合成基线,在SWE-bench Verified上实现了8%的绝对提升,而所有基线的提升仅约为2%。值得注意的是,HHD诱导的推理策略在分布外任务中有效推广,尽管没有在多语言数据上进行训练,但在SWE-bench Multilingual上获得了最大的提升。这些结果表明,HHD能够有效地从无CoT数据中合成专家级推理,并显著提高长时间性能。
cs.AI / 42 / 2605.11569

Dual-Temporal LSTM with Hybrid Attention for Airline Passenger Load Factor Forecasting: Integrating Intra-Flight and Inter-Flight Booking Dynamics

双时间维度长短期记忆网络与混合注意力机制在航空乘客载客率预测中的应用:整合航班内外的预订动态
Islam, ASM Nazrul, Kabir, Md. Hasanul, Ali, Md. Liakot, Sana, Joydeb Kumar
Abstract
Accurate short-term demand forecasting is crucial to airline revenue management, yet most existing systems fail to meet this need because current models treat booking data as a single temporal dimension, either the accumulation of bookings for a specific flight or the historical booking profile of the same route. This unidimensional view discards information carried by the other temporal stream and forecasting absolute passenger counts introduces a further operational fragility when change in planned aircraft type alters total seat capacity. This study addresses both limitations. A dual-stream Long Short-Term Memory (LSTM) integrated with attention framework is proposed that simultaneously processes two complementary input sequences: a horizontal sequence capturing intra-flight booking accumulation over the days preceding departure, and a vertical sequence capturing inter-flight booking patterns at fixed days-before-departure offsets across historical flights. Multiple dual-stream architectural variants, combining self-attention, cross-attention, and hybrid attention with concatenation, residual, and gated fusion strategies, are developed and evaluated. Experiments on real-world reservation data from the national airline of Bangladesh, Biman Bangladesh Airlines (BBA), demonstrate that the proposed hybrid model achieves a Mean Absolute Error of 2.8167 and a coefficient of determination ($R^{2}$) of 0.9495, outperforming single-stream baselines, tree-based models, and three prior dual-LSTM architectures applied to the same data. Validation across four flight category pairs; domestic versus international, direct versus transit, high versus low frequency, and short versus mid versus long haul confirms that the model generalizes across operationally diverse route types. Biman Bangladesh Airlines (BBA) has officially integrated this methodology into its operations.
Chinese Translation
准确的短期需求预测对航空公司收入管理至关重要,但现有大多数系统未能满足这一需求,因为当前模型将预订数据视为单一时间维度,即特定航班的预订累积或同一路线的历史预订概况。这种单维视角忽略了另一时间流所携带的信息,并且预测绝对乘客人数在计划的飞机类型改变时会引入进一步的操作脆弱性,影响总座位容量。本研究解决了这两种局限性。我们提出了一种集成注意力机制的双流长短期记忆网络(LSTM),同时处理两个互补的输入序列:一个是捕捉出发前几天航班内预订累积的横向序列,另一个是捕捉历史航班在固定出发前天数偏移下的航班间预订模式的纵向序列。我们开发并评估了多种双流架构变体,结合了自注意力、交叉注意力和混合注意力的拼接、残差和门控融合策略。对孟加拉国国家航空公司Biman Bangladesh Airlines(BBA)的真实预订数据进行的实验表明,所提出的混合模型实现了平均绝对误差为2.8167,决定系数($R^{2}$)为0.9495,优于单流基线、基于树的模型以及三种先前应用于相同数据的双LSTM架构。在四对航班类别(国内与国际、直飞与中转、高频与低频、短途与中途与长途)上的验证确认了该模型在操作上多样的航线类型中的泛化能力。Biman Bangladesh Airlines(BBA)已正式将该方法整合到其运营中。
cs.AI / 43 / 2605.11595

Native Explainability for Bayesian Confidence Propagation Neural Networks: A Framework for Trusted Brain-Like AI

贝叶斯置信传播神经网络的原生可解释性:可信脑类人工智能的框架
Makridis, Georgios, Fatouros, Georgios, Soldatos, John, Katsis, George, Kyriazis, Dimosthenis
Abstract
The EU Artificial Intelligence Act (Regulation 2024/1689), fully applicable to high-risk systems from August 2026, creates urgent demand for AI architectures that are simultaneously trustworthy, transparent, and feasible to deploy on resource-constrained edge devices. Brain-like neural networks built on the Bayesian Confidence Propagation Neural Network (BCPNN) formalism have re-emerged as a credible alternative to backpropagation-driven deep learning. They deliver state-of-the-art unsupervised representation learning, neuromorphic-friendly sparsity, and existing FPGA implementations that target edge deployment. Despite this momentum, no systematic framework exists for explaining BCPNN decisions -- a gap the present paper fills. We argue that BCPNN is, in the sense of Rudin's interpretable-by-design agenda, an inherently transparent model whose architectural primitives map directly onto established explainable-AI (XAI) families. We make four contributions. First, we propose the first XAI taxonomy for BCPNN. It maps weights, biases, hypercolumn posteriors, structural-plasticity usage scores, attractor dynamics, and input-reconstruction populations onto attribution, prototype, concept, counterfactual, and mechanistic explanation modalities. Second, we introduce sixteen architecture-level explanation primitives (P1--P16), several without analogue in standard ANNs. We provide closed-form algorithms for computing each from quantities the model already maintains. Third, we introduce five design-time Configuration-as-Explanation primitives (Config-P1 to Config-P5) that treat BCPNN hyperparameter choices as an auditable pre-deployment explanation artifact. Fourth, we sketch a roadmap for integration into industrial IoT deployments and discuss EU AI Act alignment, edge feasibility, and Industry 5.0 implications.
Chinese Translation
欧盟人工智能法案(Regulation 2024/1689)将于2026年8月全面适用于高风险系统,这对同时具备可信性、透明性并能在资源受限的边缘设备上部署的人工智能架构提出了迫切需求。基于贝叶斯置信传播神经网络(BCPNN)形式的脑类神经网络重新成为反向传播驱动的深度学习的可信替代方案。它们提供了最先进的无监督表示学习、神经形态友好的稀疏性以及现有的针对边缘部署的FPGA实现。尽管有这种势头,目前尚无系统框架来解释BCPNN的决策——这是本文所填补的空白。我们认为,BCPNN在Rudin的可解释设计议程的意义上,是一个固有透明的模型,其架构原语直接映射到已建立的可解释人工智能(XAI)家族。我们做出了四个贡献。首先,我们提出了BCPNN的第一个XAI分类法。它将权重、偏置、超列后验、结构可塑性使用分数、吸引子动态和输入重构种群映射到归因、原型、概念、反事实和机制解释模式。其次,我们介绍了十六个架构级解释原语(P1--P16),其中一些在标准人工神经网络中没有对应物。我们提供了计算每个原语的封闭形式算法,这些算法基于模型已经维护的量。第三,我们引入了五个设计时的配置作为解释原语(Config-P1至Config-P5),将BCPNN的超参数选择视为可审计的预部署解释文档。第四,我们勾勒了与工业物联网部署集成的路线图,并讨论了与欧盟人工智能法案的一致性、边缘可行性及工业5.0的影响。
cs.AI / 44 / 2605.11603

GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization

GAR:通过约束优化实现的碳感知路由用于大语言模型推理
Sheshanarayana, Disha, Pal, Rajat Subhra, Sinha, Manjira, Dasgupta, Tirthankar
Abstract
The growing deployment of large language models (LLMs) makes per-request routing essential for balancing response quality and computational cost across heterogeneous model pools. Current routing methods rarely consider sustainable energy use and CO2 emissions as optimization objectives, despite grid carbon intensity varying by time and region, and models differing significantly in energy consumption. To address this gap, we introduce Green-Aware Routing (GAR), a constrained multi-objective optimization framework that minimizes per-request CO2 emissions subject to explicit accuracy floors and p95-latency service-level objectives (SLOs). GAR employs adaptive constraint optimization through per-dataset floor tuning and incorporates lightweight estimators for correctness, tail latency, and carbon emissions, enabling real-time routing decisions without additional inference passes. We present GAR-PD, a practical online primal-dual routing algorithm for rolling carbon budgets, alongside heuristic variants that achieve high feasibility coverage while limiting accuracy degradation. Comprehensive experiments across standard NLP benchmarks with heterogeneous LLM pools (7B-70B) demonstrate that GAR achieves substantial carbon reductions while maintaining competitive accuracy and p95 latency guarantees, providing a practical, theoretically grounded approach to sustainable LLM inference.
Chinese Translation
大型语言模型(LLMs)的日益普及使得每个请求的路由在平衡响应质量和异构模型池的计算成本方面变得至关重要。尽管电网碳强度因时间和地区而异,并且模型在能耗上存在显著差异,目前的路由方法很少将可持续能源使用和二氧化碳排放作为优化目标。为了解决这一问题,我们提出了绿色感知路由(Green-Aware Routing, GAR),这是一种约束多目标优化框架,旨在最小化每个请求的二氧化碳排放,同时满足明确的准确性底线和第95百分位延迟服务水平目标(SLOs)。GAR通过每个数据集的底线调优采用自适应约束优化,并结合轻量级估算器来评估正确性、尾部延迟和碳排放,从而实现实时路由决策而无需额外的推理过程。我们提出了GAR-PD,这是一种实用的在线原始-对偶路由算法,适用于动态碳预算,同时提供启发式变体,以实现高可行性覆盖并限制准确性下降。在标准自然语言处理基准测试中对异构LLM池(7B-70B)进行的全面实验表明,GAR在保持竞争性准确性和第95百分位延迟保证的同时,实现了显著的碳减排,为可持续的LLM推理提供了一种实用且理论基础扎实的方法。
cs.AI / 45 / 2605.11611

CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

CuSearch: 基于搜索深度的课程展开采样用于自主检索增强生成
Shen, Jianghan, Luo, Siqi, Cheng, Xinyu, Xiong, Jing, Li, Yue, Liu, Jiyao, Lin, Jiashi, Chen, Yirong, He, Junjun
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for training agentic retrieval-augmented generation (RAG) systems from outcome-only supervision. Most existing methods optimize policies from uniformly sampled rollouts, implicitly treating all trajectories as equally informative. However, trajectories differ substantially in search depth and are therefore not equally informative: deeper-search trajectories contain more retrieval decision points and provide denser direct supervision for the retrieval sub-policy. Moreover, this heterogeneity grows over training as the within-batch depth distribution shifts toward higher values, yet uniform rollout sampling remains blind to this shift. To address this, we propose CuSearch, a curriculum rollout sampling framework built on Search-Depth Greedy Allocation (SDGA), a batch-level operator that reallocates a fixed update budget toward deeper-search trajectories. SDGA-Auto always targets the deepest available trajectories in the current batch, yielding an implicit training-aligned curriculum as the depth distribution shifts upward. SDGA-Phase explicitly advances the curriculum threshold as deeper trajectories become sufficiently abundant. Experiments across model types and retrieval frameworks show that CuSearch consistently improves performance, achieving up to 11.8 exact-match points over standard GRPO on ZeroSearch. These results establish per-trajectory search depth as a reliable, annotation-free proxy for retrieval supervision density in RLVR-based agentic RAG training. The code is available at https://github.com/MrToser/CuSearch.
Chinese Translation
可验证奖励的强化学习(RLVR)已成为从仅基于结果的监督中训练自主检索增强生成(RAG)系统的一个有前景的范式。现有的大多数方法通过均匀采样的展开轨迹来优化策略,隐含地将所有轨迹视为同等信息量。然而,轨迹在搜索深度上存在显著差异,因此并非所有轨迹的信息量相同:深度搜索轨迹包含更多的检索决策点,并为检索子策略提供更密集的直接监督。此外,随着训练的进行,这种异质性在批内深度分布向更高值的转变中不断增加,而均匀展开采样对此转变却视而不见。为了解决这个问题,我们提出了CuSearch,一个基于搜索深度贪婪分配(SDGA)的课程展开采样框架,该框架是一个批级操作符,旨在将固定的更新预算重新分配到深度搜索轨迹上。SDGA-Auto始终针对当前批次中可用的最深轨迹,从而在深度分布上升时产生隐式的训练对齐课程。SDGA-Phase则在深度轨迹变得足够丰富时显式推进课程阈值。在不同模型类型和检索框架下的实验表明,CuSearch始终提高性能,在ZeroSearch上相较于标准GRPO实现了高达11.8的精确匹配点。这些结果确立了每条轨迹的搜索深度作为RLVR基础上自主RAG训练中检索监督密度的可靠、无注释代理。代码可在 https://github.com/MrToser/CuSearch 获取。
cs.AI / 46 / 2605.11625

Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

优雅的弃牌还是英雄的跟注:学习预算高效思维以实现自适应推理
Zhou, Zhaomeng, Zhang, Lan, Wang, Junyang, Yuan, Mu, Lin, Junda
Abstract
Large reasoning models (LRMs) improve problem solving through extended reasoning, but often misallocate test-time compute. Existing efficiency methods reduce cost by compressing reasoning traces or conditioning budget on perceived difficulty, yet largely overlook solvability. As a result, they may spend large budgets on queries beyond the model's capability while compressing hard-but-solvable queries that require deeper reasoning. In this work, we formulate adaptive reasoning as a computational investment under uncertainty, where budget should follow the expected return of reasoning rather than perceived difficulty alone. To instantiate this principle, we propose Budget-Efficient Thinking (BET), a two-stage framework that combines behavioral cold-start with GRPO under an investment-cost-aware reward. By aligning solve-or-fold decisions with rollout-derived solvability, BET learns three behaviors: (1) short solve, answering easy queries concisely; (2) nice fold, abstaining early when continued reasoning has near-zero expected return; and (3) hero call, preserving sufficient compute for hard-but-solvable queries. Across seven benchmarks and three base models, BET reduces reasoning tokens by ~55% on average while achieving overall performance improvements, and transfers zero-shot from mathematical reasoning to scientific QA and logical reasoning with comparable efficiency gains.
Chinese Translation
大型推理模型(LRMs)通过扩展推理来改善问题解决能力,但在测试时常常错误分配计算资源。现有的效率方法通过压缩推理轨迹或根据感知难度调整预算来降低成本,但在很大程度上忽视了可解性。因此,它们可能会在超出模型能力的查询上花费大量预算,同时压缩那些需要更深层推理的困难但可解的查询。在本研究中,我们将自适应推理形式化为在不确定性下的计算投资,其中预算应遵循推理的预期回报,而不仅仅是感知难度。为了实现这一原则,我们提出了预算高效思维(Budget-Efficient Thinking, BET),这是一种结合行为冷启动与投资成本感知奖励的两阶段框架。通过将解决或弃牌的决策与基于回滚的可解性对齐,BET 学习了三种行为:(1)短期解决,简洁地回答简单查询;(2)优雅的弃牌,当继续推理的预期回报接近零时提前放弃;(3)英雄的跟注,为困难但可解的查询保留足够的计算资源。在七个基准测试和三个基础模型中,BET 平均减少了约 55% 的推理令牌,同时实现了整体性能的提升,并在数学推理到科学问答和逻辑推理的零样本转移中获得了可比的效率提升。
cs.AI / 47 / 2605.11633

Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

大型语言模型代理能应对灾害吗?应急操作中异构地理空间推理的基准测试
Wang, Junjue, Xuan, Weihao, Qi, Heli, Dai, Pengyu, Liu, Kunyi, Chen, Hongruixuan, Zheng, Zhuo, Xia, Junshi, Ermon, Stefano, Yokoya, Naoto
Abstract
Operational disaster response goes beyond damage assessment, requiring responders to integrate multi-sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote-sensing perception or evaluates generic tool use, leaving the end-to-end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end-to-end disaster response: 515 expert-authored tasks across 45 real-world disaster events spanning 10 types, paired with expert-verified, replayable gold trajectories totaling 3,500 tool-call steps. Tasks span five dimensions that cover the operational disaster-response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution reasoning, and multi-modal report synthesis. Agents compose calls from a 108-tool MCP library over heterogeneous geospatial data: optical, SAR, and multi-spectral imagery across single-, bi-, and multi-temporal sequences (0.015-10m GSD), complemented by elevation and social vector layers. We comprehensively evaluate 13 frontier LLMs on our benchmark, revealing three persistent challenges: 1) disaster-domain grounding exposes unique failure modes (damage-semantic grounding, sensor-modality mismatch, and disaster-pipeline composition); 2) agents are doubly bottlenecked by tool selection and argument grounding, where gold tool-order hints improve accuracy by only 1.08-4.40%, and alternative scaffolds yield at most a 3.24% gain; 3) compositional fragility scales with trajectory length, the agent-to-gold gap widening from 7% to 56% on long pipelines. DORA establishes a rigorous testbed for operationally reliable disaster-response agents.
Chinese Translation
操作性灾害响应不仅仅是损害评估,它要求响应者整合多传感器信号,推理道路网络、人口和关键设施,规划疏散,并生成可操作的报告。然而,以往的研究主要孤立了遥感感知或评估通用工具的使用,导致应急操作的端到端工作流程尚未得到充分探索。本文介绍了灾害操作响应代理基准(Disaster Operational Response Agent benchmark,DORA),这是首个针对端到端灾害响应的代理基准:涵盖45个真实灾害事件的515个专家撰写的任务,涉及10种类型,并配有专家验证的、可重放的黄金轨迹,总计3,500个工具调用步骤。任务涵盖了五个维度,涵盖操作性灾害响应管道:灾害感知、空间关系分析、救援和疏散规划、时间演变推理以及多模态报告综合。代理从一个包含108种工具的多工具库(MCP)中组合调用,处理异构地理空间数据:光学、合成孔径雷达(SAR)和多光谱影像,涵盖单时相、双时相和多时相序列(0.015-10米地面分辨率),并辅以高程和社会向量层。我们在基准上全面评估了13个前沿大型语言模型,揭示了三个持续的挑战:1)灾害领域的基础知识暴露出独特的失败模式(损害语义基础、传感器模态不匹配和灾害管道组合);2)代理在工具选择和参数基础上受到双重瓶颈的限制,其中黄金工具顺序提示仅提高了1.08-4.40%的准确性,而替代框架最多可获得3.24%的增益;3)组合脆弱性随着轨迹长度的增加而加剧,代理与黄金标准之间的差距在长管道上从7%扩大到56%。DORA为操作性可靠的灾害响应代理建立了一个严格的测试平台。
cs.AI / 48 / 2605.11636

Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

Seir extasciicircum enes:带有演变干扰的对抗自我博弈框架用于大语言模型推理
Zhang, Chi, Qiu, Haibo, Zhang, Qiming, Xu, Yufei, Gao, Xinbo, Zhang, Jing
Abstract
We present Seir\^enes, a self-play RL framework that transforms contextual interference from a failure mode of LLM reasoning into an internal training signal for co-evolving more resilient reasoners. While RL with verifiable rewards has significantly advanced reasoning capabilities, models can still exhibit fragility when encountering non-idealized contexts: scenarios characterized by superfluous information, tangential instructions, or incidental correlations that differ from the clean distributions typical of standard benchmarks. Seir\^enes harnesses this vulnerability through a parameter-shared and adversarial self-play loop. Within this framework, a single model is trained to both construct plausible yet distracting contexts that expose its own reasoning blind spots, and solve problems by discerning the essential task from these perturbations to recover the core underlying logic. By pitting these competing objectives against each other, Seir\^enes compels the model to move beyond superficial pattern matching and anchors its capabilities in robust underlying reasoning. This continuous interaction sustains an informative co-evolutionary curriculum as the model improves. Across seven mathematical reasoning benchmarks and model scales from 4B to 30B, Seir\^enes achieves average gains of +10.2, +9.1, and +7.2 points. Besides, distracting contexts produced by the 4B Seir\^enes model reduce the accuracy of top-tier closed-source models (GPT and Gemini) by roughly 4--5 points, revealing Seir\^enes' general ability to uncover reasoning models' blind spots.
Chinese Translation
我们提出了Seir extasciicircum enes,这是一个自我博弈的强化学习(RL)框架,它将大语言模型(LLM)推理中的上下文干扰这一失败模式转化为内部训练信号,以共同演化出更具韧性的推理模型。尽管具有可验证奖励的强化学习显著提升了推理能力,但模型在遇到非理想化上下文时仍可能表现出脆弱性:这些场景的特征是多余的信息、旁支指令或与标准基准典型的干净分布不同的偶然相关性。Seir extasciicircum enes通过参数共享和对抗自我博弈循环利用了这一脆弱性。在这个框架内,单个模型被训练来构建合理但具有干扰性的上下文,以暴露其自身推理盲点,并通过从这些扰动中辨别出基本任务来解决问题,从而恢复核心逻辑。通过将这些相互竞争的目标对立起来,Seir extasciicircum enes迫使模型超越表面的模式匹配,并将其能力锚定在稳健的基础推理上。这种持续的互动维持了一个信息丰富的共同演化课程,随着模型的改进而不断发展。在七个数学推理基准测试和从4B到30B的模型规模中,Seir extasciicircum enes实现了平均提升分别为+10.2、+9.1和+7.2分。此外,4B Seir extasciicircum enes模型生成的干扰上下文使顶尖闭源模型(如GPT和Gemini)的准确率降低了大约4到5分,揭示了Seir extasciicircum enes在发现推理模型盲点方面的普遍能力。
cs.AI / 49 / 2605.11672

A CAP-like Trilemma for Large Language Models: Correctness, Non-bias, and Utility under Semantic Underdetermination

大型语言模型的类CAP三难问题:在语义不足确定性下的正确性、非偏见性和效用
Venugopal, Vinu Ellampallil
Abstract
The CAP theorem states that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance under network partition. Inspired by this result, this paper formulates a CAP-like conjecture for Large Language Models (LLMs). The proposed trilemma states that, under semantic underdetermination, an LLM cannot always simultaneously guarantee strong correctness, strict non-bias, and high utility. A prompt is semantically underdetermined when the given premises do not determine a unique answer. In such cases, a useful and decisive response requires the model to introduce a selection criterion, preference, prior, or value ordering. If this criterion is not supplied by the user or justified by the available premises, the response becomes biased in a broad selection-theoretic sense. Conversely, if the model avoids unsupported preferences, it may preserve correctness and non-bias but may reduce utility through refusal, hedging, or clarification. The paper formalizes this correctness--non-bias--utility trilemma, develops examples, and argues that certain LLM failures arise not merely from model limitations but from the structure of underdetermined decision requests.
Chinese Translation
CAP定理指出,分布式系统在网络分区下无法同时保证一致性、可用性和分区容忍性。受到这一结果的启发,本文为大型语言模型(LLMs)提出了一个类CAP猜想。该三难问题指出,在语义不足确定性下,LLM无法始终同时保证强正确性、严格非偏见性和高效用。当给定的前提无法确定唯一答案时,提示被视为语义不足确定。在这种情况下,一个有用且决定性的回应要求模型引入选择标准、偏好、先验或价值排序。如果用户未提供这一标准或可用前提未能证明其合理性,回应在广义选择理论意义上会变得有偏见。相反,如果模型避免无支持的偏好,它可能保持正确性和非偏见性,但可能通过拒绝、模糊或澄清来降低效用。本文形式化了这一正确性-非偏见性-效用三难问题,发展了示例,并论证某些LLM的失败不仅源于模型的局限性,还源于不足确定的决策请求的结构。
cs.AI / 50 / 2605.11678

OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models

通过 CPU-GPU 内存交换实现无 OOM 的 Alpamayo 以支持视觉-语言-动作模型
Roh, Seungwoo, Kim, Huiyeong, Kim, Jong-Chan
Abstract
End-to-end Vision-Language-Action (VLA) models for autonomous driving unify perception, reasoning, and control in a single neural network, achieving strong driving performance but requiring 20-60GB of GPU memory-far exceeding the 12-16GB available on commodity GPUs. We present a framework, which enables memory-efficient VLA inference on VRAM-constrained GPUs through system-level optimization alone, without model modification. Our work proceeds in three stages: (1) Sequential Demand Layering reduces VRAM usage from model-level to layer-level granularity; (2) Pipelined Demand Layering hides parameter transfer time within layer execution time via transfer--compute overlap; and (3) a GPU-Resident Layer Decision Policy, informed by per-module residency benefit analysis, eliminates the residual transfer overhead that pipelining cannot hide. We further propose a performance prediction model that determines the optimal configuration-both the number and placement of resident layers-from a single profiling run with less than 1.3% prediction error across all configurations. Applied to NVIDIA's Alpamayo-R1-10B (21.52GB) on an RTX 5070Ti (16GB), our work achieves up to 3.55x speedup over Accelerate offloading while maintaining full BF16 precision.
Chinese Translation
端到端的视觉-语言-动作 (VLA) 模型用于自动驾驶,将感知、推理和控制统一在一个神经网络中,虽然实现了强大的驾驶性能,但需要 20-60GB 的 GPU 内存,远超普通 GPU 提供的 12-16GB。我们提出了一个框架,通过系统级优化实现内存高效的 VLA 推理,适用于 VRAM 受限的 GPU,而无需对模型进行修改。我们的工作分为三个阶段:(1) 顺序需求分层将 VRAM 使用从模型级别降低到层级别的粒度;(2) 管道需求分层通过传输-计算重叠在层执行时间内隐藏参数传输时间;(3) GPU 常驻层决策策略,通过每个模块的常驻效益分析消除管道化无法隐藏的剩余传输开销。我们进一步提出了一种性能预测模型,该模型通过一次配置分析确定最佳配置,包括常驻层的数量和位置,在所有配置中预测误差低于 1.3%。在 RTX 5070Ti (16GB) 上应用于 NVIDIA 的 Alpamayo-R1-10B (21.52GB),我们的工作在保持完整 BF16 精度的同时,实现了高达 3.55 倍的加速,相较于 Accelerate 卸载。
cs.AI / 51 / 2605.11679

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

通过偏好维度扩展解释和打破安全性与有用性之间的天花板
Huang, ShiYing, Lin, Liang, Li, Yuer, Luo, Kaiwen, Zhou, Zhenhong, Zhang, An, Dong, Junhao, Wang, Kun, Zeng, Zhigang
Abstract
In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://anonymous.4open.science/r/MORA-MPA.
Chinese Translation
在大型语言模型的多目标对齐领域,平衡不同的人类偏好通常表现为零和冲突。具体而言,竞争目标之间的内在紧张关系决定了,积极优化一个指标(例如,有用性)往往会对另一个指标(例如,无害性)造成显著的惩罚。尽管先前的研究主要集中在数据选择、参数合并或训练过程中的算法平衡上,但这些方法仅仅是在固定的帕累托前沿上迫使不同偏好之间的妥协,未能从根本上解决固有的权衡问题。在本研究中,我们从多维奖励的新视角来解决这一问题。通过扩大模型的回滚并分析不同奖励维度下的输出,我们得出了一个关键结论:多个目标之间的冲突源于提示本身固有地限制了可实现的多维奖励。基于这一核心观察,我们提出了MORA:多目标奖励同化。具体而言,MORA通过预采样隔离单一奖励提示,并通过重写原始问题以纳入多维意图来扩展其奖励多样性。大量实验表明:(1)在顺序对齐中,MORA在经过有用性、无害性和真实性维度的多偏好对齐后,实现了单一偏好提升范围为5%至12.4%,在无害性方面获得了显著提升。(2)在同时对齐中,MORA实现了平均总体奖励提升4.6%。我们的代码可在 https://anonymous.4open.science/r/MORA-MPA 获取。
cs.AI / 52 / 2605.11687

Persistent and Conversational Multi-Method Explainability for Trustworthy Financial AI

可信金融人工智能的持久和对话式多方法可解释性
Makridis, Georgios, Fatouros, Georgios, Soldatos, John, Katsis, George, Kyriazis, Dimosthenis
Abstract
Financial institutions increasingly require AI explanations that are persistent, cross-validated across methods, and conversationally accessible to human decision-makers. We present an architecture for human-centered explainable AI in financial sentiment analysis that combines three contributions. First, we treat XAI artifacts -- LIME feature attributions, occlusion-based word importance scores, and saliency heatmaps -- as persistent, searchable objects in distributed S3-compatible storage with structured metadata and natural-language summaries, enabling semantic retrieval over explanation history and automatic index reconstruction after system failures. Second, we enable multi-method explanation triangulation, where a retrieval-augmented generation (RAG) assistant compares and synthesizes results from multiple XAI methods applied to the same prediction, allowing users to assess explanation robustness through natural-language dialogue. Third, we evaluate the faithfulness of generated explanations using automated checks over grounding completeness, hallucinated claims, and method-attribution behavior. We demonstrate the architecture on an EXTRA-BRAIN financial sentiment analysis pipeline using FinBERT predictions and present evaluation results showing that constrained prompting reduces hallucination rate by 36\% and increases method-attribution citations by 73\% compared to naive prompting. We discuss implications for trustworthy, human-centered AI services in regulated financial environments.
Chinese Translation
金融机构日益需要持久的、跨方法验证的人工智能解释,并且这些解释能够以对话的方式为人类决策者所获取。我们提出了一种以人为中心的可解释人工智能架构,应用于金融情感分析,结合了三项贡献。首先,我们将可解释人工智能(XAI)产物——LIME特征归因、基于遮挡的词重要性评分和显著性热图——视为在分布式S3兼容存储中具有结构化元数据和自然语言摘要的持久可搜索对象,从而实现对解释历史的语义检索和系统故障后的自动索引重建。其次,我们实现了多方法解释三角测量,其中检索增强生成(RAG)助手比较和综合应用于同一预测的多种XAI方法的结果,使用户能够通过自然语言对话评估解释的稳健性。第三,我们通过对基础完整性、虚构声明和方法归因行为的自动检查,评估生成解释的真实性。我们在EXTRA-BRAIN金融情感分析管道上展示了该架构,使用FinBERT预测,并呈现评估结果,显示约束提示将虚构率降低了36%,并将方法归因引用增加了73%,与天真的提示相比。我们讨论了在受监管金融环境中可信的以人为中心的人工智能服务的意义。
cs.AI / 53 / 2605.11693

Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity

超越文本的关键测量:通过质量、对齐和多样性评估多模态摘要
Ali, Abid, Molla-Aliod, Diego, Naseem, Usman
Abstract
Multimodal Large Language Models (MLLMs) have facilitated Multimodal Summarization with Multimodal Output (MSMO), wherein systems generate concise textual summaries accompanied by salient visuals from multimodal sources. However, current MSMO evaluation remains fragmented: text quality, image-text alignment, and visual diversity are typically assessed in isolation using unimodal metrics, making it difficult to capture whether the modalities jointly support a faithful and useful summary. To address this gap, we introduce MM-Eval, a unified evaluation framework that integrates assessments of textual quality, cross-modal alignment, and visual diversity. MM-Eval comprises three components: (1) text quality, measured using OpenFActScore for factual consistency and G-Eval for coherence, fluency, and relevance; (2) image-text relevance, evaluated via an MLLM-as-a-judge approach; and (3) image-set diversity, quantified using Truncated CLIP Entropy. We calibrate MM-Eval through a learned aggregation model trained on the mLLM-EVAL news benchmark, aligning component contributions with human preferences. Our analysis reveals a text-dominant hierarchy in this setting, where factual consistency acts as a critical determinant of perceived overall quality, while visual relevance and diversity provide complementary signals. MM-Eval improves over heuristic aggregation baselines and provides an interpretable, reference-weak framework for comparative evaluation of multimodal summaries.
Chinese Translation
多模态大型语言模型(MLLMs)促进了多模态摘要生成(MSMO),在该过程中,系统生成简洁的文本摘要,并附以来自多模态来源的显著视觉内容。然而,目前的MSMO评估仍然存在碎片化的问题:文本质量、图像-文本对齐和视觉多样性通常使用单模态指标进行孤立评估,这使得难以捕捉各模态是否共同支持一个真实且有用的摘要。为了解决这一问题,我们引入了MM-Eval,一个统一的评估框架,整合了文本质量、跨模态对齐和视觉多样性的评估。MM-Eval包含三个组成部分:(1)文本质量,通过OpenFActScore测量事实一致性,并使用G-Eval评估连贯性、流畅性和相关性;(2)图像-文本相关性,通过MLLM作为评判者的方法进行评估;(3)图像集多样性,通过截断CLIP熵进行量化。我们通过在mLLM-EVAL新闻基准上训练的学习聚合模型对MM-Eval进行校准,使组件贡献与人类偏好对齐。我们的分析揭示了在这一设置中,文本主导的层级结构,其中事实一致性是感知整体质量的关键决定因素,而视觉相关性和多样性则提供了互补信号。MM-Eval在启发式聚合基线之上有所改进,并提供了一个可解释的、参考弱的框架,用于多模态摘要的比较评估。
cs.AI / 54 / 2605.11712

Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

朝向稳定的价值对齐:引入独立模块以实现一致的价值引导
Chen, Wenhao, Sun, Sirui, Bai, Shengyuan, Song, Guojie
Abstract
Aligning large language models (LLMs) with human values typically relies on post-training or inference-time steering that directly manipulates the backbone's parameters or representation space. However, a critical gap exists: the model's residual stream is highly dynamic, in which values exist as fragile, low-dimensional properties, inherently incompatible with the stability required for consistent value expression. In this paper, we propose the Stable Value Guidance Transformer (SVGT), which addresses this gap through an independent value module incorporating two key designs: (1) independent value modeling, maintaining normative representations in a dedicated value space isolated from the backbone, and (2) explicit behavioral guidance, transducing these stable signals into learnable latent Bridge Tokens. These tokens serve as dynamic value anchors to explicitly steer the generative trajectory, ensuring robust adherence across diverse contexts without disrupting the backbone's internal representations. Experiments across multiple backbones and safety benchmarks show that SVGT generally reduces harmful scores by over 70% while maintaining generation fluency, demonstrating the efficacy of architecturally grounded value modeling. Our code is available at https://github.com/Clervils/SVGT.git.
Chinese Translation
将大型语言模型(LLMs)与人类价值观对齐通常依赖于后训练或推理时的引导,这直接操控主干的参数或表示空间。然而,存在一个关键的缺口:模型的残差流是高度动态的,其中价值观作为脆弱的低维属性存在,固有地与一致的价值表达所需的稳定性不兼容。本文提出了稳定价值引导变换器(Stable Value Guidance Transformer, SVGT),通过一个独立的价值模块来填补这一缺口,该模块包含两个关键设计:(1)独立的价值建模,在与主干隔离的专用价值空间中维持规范表示;(2)明确的行为引导,将这些稳定信号转化为可学习的潜在桥接令牌(Bridge Tokens)。这些令牌作为动态价值锚点,明确引导生成轨迹,确保在不同上下文中稳健遵循,而不干扰主干的内部表示。在多个主干和安全基准上的实验表明,SVGT通常将有害评分降低超过70%,同时保持生成流畅性,证明了基于架构的价值建模的有效性。我们的代码可在 https://github.com/Clervils/SVGT.git 获取。
cs.AI / 55 / 2605.11716

SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

SafeSteer:一种针对多模态大语言模型的解码级防御机制
Zeng, Xinyi, Yang, Xue, Zhang, Jingyuan, Yan, Huanqian, Chen, Xiang, Wei, Kaiwen, Kang, Hankun, Tian, Yu
Abstract
Multimodal large language models (MLLMs) are gaining increasing attention. Due to the heterogeneity of their input features, they face significant challenges in terms of jailbreak defenses. Current defense methods rely on costly fine-tuning or inefficient post-hoc interventions, limiting their ability to address novel attacks and involving performance trade-offs. To address the above issues, we explore the inherent safety capabilities within MLLMs and quantify their intrinsic ability to discern harmfulness at decoding stage. We observe that 1) MLLMs can distinguish the harmful and harmless inputs during decoding process, 2) Image-based attacks are more stealthy. Based on these insights, we introduce SafeSteer, a decoding-level defense mechanism for MLLMs. Specifically, it includes a Decoding-Probe, a lightweight probe for detecting and correcting harmful output during decoding, which iteratively steers the decoding process toward safety. Furthermore, a modal semantic alignment vector is integrated to transfer the strong textual safety alignment to the vision modality. Experiments on multiple MLLMs demonstrate that SafeSterr can improve MLLMs' safety by up to 33.40\% without fine-tuning. Notably, it can maintain the effectiveness of MLLMs, ensuring a balance between their helpfulness and harmlessness.
Chinese Translation
多模态大语言模型(MLLMs)正受到越来越多的关注。由于输入特征的异质性,它们在越狱防御方面面临重大挑战。目前的防御方法依赖于昂贵的微调或低效的事后干预,限制了它们应对新型攻击的能力,并涉及性能权衡。为了解决上述问题,我们探讨了MLLMs内在的安全能力,并量化了它们在解码阶段识别有害性的内在能力。我们观察到:1)MLLMs能够在解码过程中区分有害和无害的输入;2)基于图像的攻击更加隐蔽。基于这些见解,我们提出了SafeSteer,一种针对MLLMs的解码级防御机制。具体而言,它包括一个解码探测器(Decoding-Probe),这是一个轻量级探测器,用于在解码过程中检测和纠正有害输出,迭代地将解码过程引导向安全。此外,集成了一种模态语义对齐向量,以将强大的文本安全对齐转移到视觉模态。对多个MLLMs的实验表明,SafeSteer可以在不进行微调的情况下将MLLMs的安全性提高多达33.40%。值得注意的是,它能够保持MLLMs的有效性,确保其有用性与无害性之间的平衡。
cs.AI / 56 / 2605.11727

Allegory of the Cave: Measurement-Grounded Vision-Language Learning

洞穴的寓言:基于测量的视觉-语言学习
Xu, Kepeng, Xu, Li, He, Gang, Yu, Wenxin
Abstract
Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66\% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. These results suggest that part of VLM grounding error arises from information lost during RGB rendering, and that preserving measurement-domain evidence can improve multimodal reasoning.
Chinese Translation
视觉-语言模型通常在后ISP RGB 图像上进行推理,尽管 RGB 渲染可能在推理之前裁剪、抑制或量化传感器证据。我们研究了当视觉接口更接近基础相机测量时,基础是否有所改善。我们提出了基于测量的视觉-语言学习,并将其实例化为 PRISM-VL,该模型结合了基于 RAW 的 Meas.-XYZ 输入、相机条件的基础和曝光包围监督聚合,以将监督从 RGB 代理转移到测量域观察。使用经过质量控制的 150K 指令调优集和一个针对低光、HDR、可见性敏感和幻觉敏感案例的保留基准,PRISM-VL-8B 达到了 0.6120 BLEU、0.4571 ROUGE-L 和 82.66% LLM-Judge 准确率,相较于 RGB Qwen3-VL-8B 基线提高了 +0.1074 BLEU、+0.1071 ROUGE-L 和 +4.46 个百分点。这些结果表明,部分 VLM 基础错误源于 RGB 渲染过程中丢失的信息,而保留测量域证据可以改善多模态推理。
cs.AI / 57 / 2605.11738

OptArgus: A Multi-Agent System to Detect Hallucinations in LLM-based Optimization Modeling

OptArgus:一种多智能体系统用于检测基于大语言模型的优化建模中的幻觉
Li, Zhong, Guo, Zihan, Lu, Xiaohan, Wang, Juntao, Song, Jie, Shen, Chao, Wu, Jiageng, Sun, Mingyang
Abstract
Large language models (LLMs) are increasingly used to translate natural-language optimization problems into mathematical formulations and solver code, but matching the reference objective value is not a reliable test of correctness: an artifact may agree numerically while still changing the underlying optimization semantics. We formulate this issue as \emph{optimization-modeling hallucination detection}, namely structural consistency auditing over the problem description, symbolic model, and solver implementation. We develop, to our knowledge, the first fine-grained hallucination taxonomy specifically for optimization modeling, spanning objective, variable, constraint, and implementation failures. We use this taxonomy to design OptArgus, a multi-agent detector with conductor routing, specialist auditors, and evidence consolidation. To evaluate this setting, we introduce a three-part benchmark suite with $484$ clean artifacts, $1266$ controlled injected artifacts, and $6292$ natural LLM-generated artifacts. Against a matched single-agent baseline, OptArgus produces fewer false alarms on clean artifacts, more accurate top-ranked localization on controlled single-error cases, and stronger detection on natural model outputs. Together, these contributions turn optimization-modeling hallucination detection into a concrete empirical problem and suggest that modular, taxonomy-grounded auditing is a practical route to more reliable optimization modeling.
Chinese Translation
大型语言模型(LLMs)越来越多地用于将自然语言优化问题转化为数学公式和求解器代码,但匹配参考目标值并不是一个可靠的正确性测试:一个伪影可能在数值上达成一致,但仍然改变基础的优化语义。我们将这一问题表述为 extit{优化建模幻觉检测},即对问题描述、符号模型和求解器实现进行结构一致性审计。我们开发了迄今为止首个专门针对优化建模的细粒度幻觉分类法,涵盖目标、变量、约束和实现失败。我们利用这一分类法设计了OptArgus,一个具有指挥路由、专业审计员和证据整合的多智能体检测器。为了评估这一设置,我们引入了一个由484个干净伪影、1266个受控注入伪影和6292个自然生成的LLM伪影组成的三部分基准套件。与匹配的单智能体基线相比,OptArgus在干净伪影上产生了更少的误报,在受控单错误案例中提供了更准确的高排名定位,并在自然模型输出上实现了更强的检测。总体而言,这些贡献将优化建模幻觉检测转变为一个具体的实证问题,并表明基于模块化和分类法的审计是实现更可靠优化建模的实用途径。
cs.AI / 58 / 2605.11746

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

推理痕迹何以变得具表演性:逐步证据表明思维链(Chain-of-Thought)是一个不完美的监督通道
Li, Wenkai, Yang, Fan, Hazarika, Ananya, Mehta, Shaunak A., Onoue, Koichi
Abstract
Chain-of-thought (CoT) traces are increasingly used both to improve language model capability and to audit model behavior, implicitly assuming that the visible trace remains synchronized with the computation that determines the answer. We test this assumption with a step-level Detect-Classify-Compare framework built around an answer-commitment proxy that is cross-validated with Patchscopes, tuned-lens probes, and causal direction ablation. Across nine models and seven reasoning benchmarks, latent commitment and explicit answer arrival align on only 61.9% of steps on average. The dominant mismatch pattern is confabulated continuation: 58.0% of detected mismatch events occur after the answer-commitment proxy has already stabilized while the trace continues producing deliberative-looking text, and a vacuousness analysis shows that the committed answer does not change during these steps. In architecture-matched Qwen2.5/DeepSeek-R1-Distill comparisons, the reasoning pipeline changes failure composition more than aggregate alignment, most clearly at 32B where confabulated steps decrease as contradictory states increase. Lower step-level alignment is also associated with larger CoT utility, suggesting that the settings that benefit most from CoT are often the least temporally faithful. Paired truncation and a complementary donor-corruption test further indicate that much post-commitment text is not load-bearing for the final answer. These findings suggest that CoT can remain useful while still being an unreliable report of when the answer was formed.
Chinese Translation
思维链(Chain-of-Thought, CoT)痕迹越来越多地被用于提升语言模型的能力和审计模型行为,隐含假设可见的痕迹与决定答案的计算保持同步。我们通过一个基于答案承诺代理的逐步检测-分类-比较(Detect-Classify-Compare)框架来检验这一假设,该框架与Patchscopes、调优镜头探针和因果方向消融进行了交叉验证。在九个模型和七个推理基准上,潜在承诺与显式答案到达的对齐平均仅为61.9%。主导的不匹配模式是虚构的延续:58.0%的检测到的不匹配事件发生在答案承诺代理已经稳定之后,而痕迹仍在生成看似深思熟虑的文本,空洞性分析显示在这些步骤中承诺的答案并未改变。在架构匹配的Qwen2.5/DeepSeek-R1-Distill比较中,推理管道的变化对失败组成的影响大于整体对齐,尤其在32B时,虚构步骤减少而矛盾状态增加。较低的逐步对齐也与更大的CoT效用相关,表明从CoT中受益最大的设置往往是时间上最不忠实的。配对截断和补充的捐赠-腐蚀测试进一步表明,许多承诺后的文本对于最终答案并不具有支撑作用。这些发现表明,尽管CoT仍然有用,但它在报告答案形成时间上并不可靠。
cs.AI / 59 / 2605.11753

Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention

通过跨模态变换器和门控注意力实现视觉基础的多模态摘要
Ali, Abid, Molla-Aliod, Diego, Naseem, Usman
Abstract
Multimodal summarization requires models to jointly understand textual and visual inputs to generate concise, semantically coherent summaries. Existing methods often inject shallow visual features into deep language models, leading to representational mismatches and weak cross-modal grounding. We propose a unified framework that jointly performs text summarization and representative image selection. Our system, SPeCTrA-Sum (Sampler Perceiver with Cross-modal Transformer and gated Attention for Summarization), introduces two key innovations. First, a Deep Visual Processor (DVP) aligns the visual encoder with the language model at corresponding depths, enabling hierarchical, layer-wise fusion that preserves semantic consistency. Second, a lightweight Visual Relevance Predictor (VRP) selects salient and diverse images by distilling soft labels from a Determinantal Point Processes (DPP) teacher. SPeCTrA-Sum is trained using a multi-objective loss that combines autoregressive summarization, cross-modal alignment, and DPP-based distillation. Experiments show that our system produces more accurate, visually grounded summaries and selects more representative images, demonstrating the benefits of depth-aware fusion and principled image selection for multimodal summarization.
Chinese Translation
多模态摘要要求模型共同理解文本和视觉输入,以生成简洁且语义连贯的摘要。现有方法通常将浅层视觉特征注入深层语言模型,导致表征不匹配和弱跨模态基础。我们提出了一个统一框架,联合执行文本摘要和代表性图像选择。我们的系统SPeCTrA-Sum(用于摘要的跨模态变换器和门控注意力的采样感知器)引入了两个关键创新。首先,深度视觉处理器(Deep Visual Processor, DVP)在相应深度上对齐视觉编码器和语言模型,实现层次化的逐层融合,保持语义一致性。其次,轻量级视觉相关性预测器(Visual Relevance Predictor, VRP)通过从行列式点过程(Determinantal Point Processes, DPP)教师中提炼软标签,选择显著且多样的图像。SPeCTrA-Sum使用多目标损失进行训练,该损失结合了自回归摘要、跨模态对齐和基于DPP的蒸馏。实验表明,我们的系统生成了更准确、视觉基础的摘要,并选择了更具代表性的图像,展示了深度感知融合和原则性图像选择在多模态摘要中的优势。
cs.AI / 60 / 2605.11789

Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations

超越低效:多智能体蒙特卡洛模拟中不文明行为的系统成本
Moldovan-Mauer, Alison, Mangold, Benedikt
Abstract
Unconstructive debate and uncivil communication carry well-documented costs for productivity and cohesion, yet isolating their effect on operational efficiency has proven difficult. Human subject research in this domain is constrained by ethical oversight, limited reproducibility, and the inherent unpredictability of naturalistic settings. We address this gap by leveraging Large Language Model (LLM) based Multi-Agent Systems as a controlled sociological sandbox, enabling systematic manipulation of communicative behavior at scale. Using a Monte Carlo simulation framework, we generate thousands of structured 1-on-1 adversarial debates across varying toxicity conditions, measuring convergence time, defined as the number of rounds required to reach a conclusion, as a proxy for interactional efficiency. Building on a prior study, we replicate and extend its findings across two additional LLM agents of varying parameter size, allowing us to assess whether the effects of toxic behavior on debate dynamics generalize across model scale. The convergence latency of 25% reported in the previous study was confirmed. It was found that this latency is significantly bigger for models with fewer parameters. We further identify a significant first-mover advantage, whereby the agent initiating the discussion wins significantly above chance regardless of toxicity condition.
Chinese Translation
不建设性的辩论和不文明的沟通对生产力和凝聚力带来了众所周知的成本,但将其对操作效率的影响进行隔离却证明是困难的。该领域的人类研究受到伦理监督、有限的可重复性以及自然环境固有的不确定性的限制。我们通过利用基于大型语言模型(Large Language Model, LLM)的多智能体系统作为一个受控的社会学实验平台,来填补这一空白,从而能够在大规模上系统性地操控沟通行为。使用蒙特卡洛模拟框架,我们在不同毒性条件下生成数千个结构化的1对1对抗辩论,测量收敛时间,即达到结论所需的回合数,作为互动效率的代理指标。在先前研究的基础上,我们在两个不同参数规模的LLM代理中复制并扩展其发现,从而评估毒性行为对辩论动态的影响是否在模型规模上具有普遍性。先前研究中报告的25%的收敛延迟得到了确认,发现这一延迟在参数较少的模型中显著更大。我们进一步识别出显著的先行者优势,即无论毒性条件如何,发起讨论的代理赢得的概率显著高于随机水平。
cs.AI / 61 / 2605.11807

Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation

用户为何选择此地:世界知识增强的生成下一个兴趣点推荐
Ding, Qiuyu, Xu, Heng-Da, Zhang, Wei, Lv, Dongyi, Xia, Changda, Xiong, Feng, Xu, Mu
Abstract
Generative point-of-interest (POI) recommendation models based on large language models (LLMs) have shown promising results by formulating next POI prediction as a sequence generation task. However, the knowledge encoded in these models remains fixed after training, making them unable to perceive evolving real-world conditions that shape user mobility decisions, such as local events and cultural trends. To bridge this gap, we propose AWARE (Agent-based World knowledge Augmented REcommendation), which employs an LLM agent to generate location- and time-aware contextual narratives that capture regional cultural characteristics, seasonal trends, and ongoing events relevant to each user. Rather than introducing generic or noisy information, AWARE further anchors these narratives in each user's behavioral context, grounding external world knowledge in personalized spatial-temporal patterns. Extensive experiments on three real-world datasets demonstrate that AWARE consistently outperforms competitive baselines, achieving up to 12.4% relative improvement.
Chinese Translation
基于大型语言模型(LLMs)的生成兴趣点(POI)推荐模型通过将下一个兴趣点预测表述为序列生成任务,展现了良好的效果。然而,这些模型中编码的知识在训练后保持固定,无法感知影响用户移动决策的不断变化的现实世界条件,如地方事件和文化趋势。为了解决这一问题,我们提出了AWARE(基于代理的世界知识增强推荐),该模型利用LLM代理生成位置和时间感知的上下文叙述,捕捉与每个用户相关的区域文化特征、季节性趋势和正在进行的事件。AWARE不仅引入通用或噪声信息,还将这些叙述锚定在每个用户的行为背景中,将外部世界知识与个性化的时空模式相结合。在三个真实世界数据集上的广泛实验表明,AWARE始终优于竞争基线,达到了高达12.4%的相对提升。
cs.AI / 62 / 2605.11809

Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models

超越世界坐标框架的动作头:面向运动的动作框架用于视觉-语言-动作模型
Yang, Huoren, Zhao, Jianchao, Yusong, Hu, Ou, Qiguan, Gao, Yuyang, Ke, Wei, He, Yuhang, Dong, SongLin, Ma, Zhiheng, Gong, Yihong
Abstract
Vision-Language-Action (VLA) models have advanced rapidly with stronger backbones, broader pre-training, and larger demonstration datasets, yet their action heads remain largely homogeneous: most directly predict action commands in a fixed world coordinate frame. We propose \textbf{MCF-Proto}, a lightweight action head that equips VLA policies with a Motion-Centric Action Frame (MCF) and a prototype-based action parameterization. At each step, the policy predicts a rotation $R_t \in SO(3)$, composes actions in the transformed local frame from a set of prototypes, and maps them back to the world frame for end-to-end training, using only standard demonstrations without auxiliary supervision. This simple design induces stable emergent structure. Without explicit directional labels, the learned local frames develop a stable geometric structure whose axes are strongly compatible with demonstrated end-effector motion. Meanwhile, actions in the learned representation become substantially more compact, with variation captured by fewer dominant directions and more regularly organized by shared prototypes. These structural properties translate into improved robustness, especially under geometric perturbations. Our results suggest that adding lightweight geometric and compositional structure to the action head can materially improve how VLA policies organize and generalize robotic manipulation behavior. An anonymized code repository is provided in the supplementary material.
Chinese Translation
视觉-语言-动作(VLA)模型在更强大的骨干网络、更广泛的预训练和更大规模的示范数据集的推动下迅速发展,但其动作头仍然大体上同质化:大多数直接在固定的世界坐标框架中预测动作指令。我们提出了 extbf{MCF-Proto},一种轻量级的动作头,它为VLA策略配备了运动中心动作框架(Motion-Centric Action Frame, MCF)和基于原型的动作参数化。在每一步中,策略预测一个旋转$R_t ext{ in } SO(3)$,在从一组原型变换的局部框架中组合动作,并将其映射回世界框架以进行端到端训练,仅使用标准示范而不需要辅助监督。这种简单的设计引发了稳定的涌现结构。在没有明确方向标签的情况下,学习到的局部框架发展出一种稳定的几何结构,其轴与示范的末端执行器运动高度兼容。同时,学习表示中的动作变得更加紧凑,变化由更少的主导方向捕获,并通过共享原型更规则地组织。这些结构特性转化为更好的鲁棒性,尤其是在几何扰动下。我们的结果表明,向动作头添加轻量级的几何和组合结构可以显著改善VLA策略组织和泛化机器人操控行为的能力。补充材料中提供了匿名代码库。
cs.AI / 63 / 2605.11813

Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models

通过记忆增强的大型语言模型实现鲁棒优化的自动重构
Chen, Jinbiao, Jin, Shuang, Zhang, Guoyun, Zhang, Junyu, Wang, Guanyi, Qin, Hanzhang
Abstract
Robust optimization (RO) provides a principled framework for decision-making under uncertainty, but its practical use is often limited by the need to manually reformulate uncertain optimization models into tractable deterministic counterparts. Recent large language models (LLMs) have been shown promising for automating optimization formulation, yet RO reformulation remains challenging because it requires precise multi-step reasoning and mathematically consistent transformations. To facilitate systematic evaluation of LLM-based reformulation, for which no dedicated benchmark currently exists, we develop AutoRO-Bench, a benchmark featuring an automated data generation pipeline for the core RO reformulation task and a curated dataset for the RO application task. To address the reformulation challenge, we propose Automated Reformulation with Experience Memory (AutoREM), a tuning-free memory-augmented framework that autonomously builds a structured textual experience memory by reflecting on past failed trajectories through a tailored offline adaptation procedure. AutoREM requires neither domain-specific expert knowledge nor parameter updates, and the resulting memory readily transfers across different base LLMs. Experimental results show that AutoREM consistently improves the accuracy and efficiency of RO reformulation across in-distribution datasets, out-of-distribution datasets, and diverse base LLMs.
Chinese Translation
鲁棒优化(RO)为不确定性下的决策提供了一个原则性框架,但其实际应用常常受到需要手动将不确定的优化模型重构为可处理的确定性对应物的限制。最近的大型语言模型(LLMs)在自动化优化公式化方面显示出良好的前景,但RO重构仍然具有挑战性,因为它需要精确的多步骤推理和数学一致的变换。为了促进基于LLM的重构的系统评估,目前尚无专门的基准,我们开发了AutoRO-Bench,这是一个基准,具有用于核心RO重构任务的自动数据生成管道和用于RO应用任务的策划数据集。为了解决重构挑战,我们提出了经验记忆自动重构(AutoREM),这是一种无调优的记忆增强框架,通过定制的离线适应程序反思过去失败的轨迹,自动构建结构化的文本经验记忆。AutoREM既不需要领域特定的专家知识,也不需要参数更新,生成的记忆可以在不同的基础LLM之间轻松转移。实验结果表明,AutoREM在分布内数据集、分布外数据集和多样化基础LLM上始终提高了RO重构的准确性和效率。
cs.AI / 64 / 2605.11814

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

MedMemoryBench:个性化医疗中代理记忆的基准测试
Wang, Yihao, Xu, Haoran, Gu, Renjie, Ye, Yixuan, Chen, Xinyi, Mu, Xinyu, Gao, Yuan, Guo, Chunxiao, Wei, Peng, Gu, Jinjie, Li, Huan, Chen, Ke, Shou, Lidan
Abstract
The large-scale deployment of personalized healthcare agents demands memory mechanisms that are exceptionally precise, safe, and capable of long-term clinical tracking. However, existing benchmarks primarily focus on daily open-domain conversations, failing to capture the high-stakes complexity of real-world medical applications. Motivated by the stringent production requirements of an industry-leading health management agent serving tens of millions of active users, we introduce MedMemoryBench. We develop a human-agent collaborative pipeline to synthesize highly realistic, long-horizon medical trajectories based on clinically grounded, synthetic patient archetypes. This process yields a massive, expertly validated dataset comprising approximately 2,000 sessions and 16,000 interaction turns. Crucially, MedMemoryBench departs from traditional static evaluations by pioneering an "evaluate-while-constructing" streaming assessment protocol, which precisely mirrors dynamic memory accumulation in production environments. Furthermore, we formalize and systematically investigate the critical phenomenon of memory saturation, where sustained information influx actively degrades retrieval and reasoning robustness. Comprehensive benchmarking reveals severe bottlenecks in mainstream architectures, particularly concerning complex medical reasoning and noise resilience. By exposing these fundamental flaws, MedMemoryBench establishes a vital foundation for developing robust, production-ready medical agents.
Chinese Translation
个性化医疗代理的大规模部署需要极其精确、安全且能够进行长期临床追踪的记忆机制。然而,现有的基准测试主要集中在日常开放领域的对话上,未能捕捉到现实医疗应用中的高风险复杂性。受到服务数千万活跃用户的行业领先健康管理代理的严格生产要求的启发,我们提出了MedMemoryBench。我们开发了一个人机协作的流程,基于临床基础的合成患者原型合成高度真实的长期医疗轨迹。该过程生成了一个庞大且经过专家验证的数据集,包含约2000个会话和16000个交互轮次。重要的是,MedMemoryBench突破了传统静态评估的局限,首创了一种“构建同时评估”的流式评估协议,精确反映了生产环境中的动态记忆积累。此外,我们正式化并系统性地研究了记忆饱和这一关键现象,即持续的信息流入会积极降低检索和推理的稳健性。全面的基准测试揭示了主流架构中的严重瓶颈,特别是在复杂医疗推理和噪声抗扰性方面。通过揭示这些根本性缺陷,MedMemoryBench为开发稳健的、适合生产的医疗代理奠定了重要基础。
cs.AI / 65 / 2605.11882

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

通过失败轨迹实现代理安全对齐的在线自我进化
Yin, Bo, Li, Qi, Wang, Xinchao
Abstract
Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is then used as a supervision signal for agent self-evolution. During this process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.
Chinese Translation
使用工具的LLM代理通过轨迹而非仅仅最终响应发生失败,因为它们可能执行不安全的工具调用、遵循注入的指令、服从有害请求,或尽管产生看似安全的答案却过度拒绝良性任务。现有的安全对齐信号主要是响应级别或离线的,通常会导致安全与效用之间的权衡:提高代理安全性往往以降低任务性能为代价。这种稀疏且单一目标的奖励严重限制了在现实世界中的可用性。为了解决这一问题,我们提出了FATE,一个在线自我进化框架,将验证者评分的失败转化为修复监督,而无需专家示范。对于每个失败,相同的策略提出修复候选项,然后由验证者重新评分,并在安全性、效用、过度拒绝控制和轨迹有效性方面进行过滤。这种密集的轨迹级信息随后被用作代理自我进化的监督信号。在此过程中,我们进一步引入了帕累托前沿策略优化(Pareto-Front Policy Optimization, PFPO),结合监督预热与考虑帕累托的策略优化,以保持安全与效用之间的权衡。在AgentDojo、AgentHarm和ATBench上的实验表明,FATE在不同模型和规模中提高了安全性,同时保持了有用的行为。与强基线相比,FATE将攻击成功率降低了33.5%,有害遵从率降低了82.6%,并将外部轨迹安全诊断提高了6.5%。这些结果表明,失败轨迹可以为更安全的自我进化代理提供结构化的修复监督。
cs.AI / 66 / 2605.11885

From Clever Hans to Scientific Discovery: Interpreting EEG Foundational Transformers with LRP

从聪明汉斯到科学发现:使用层次相关传播(LRP)解释脑电图基础变换器
Bexten, Justus Meyer zu, Scherf, Nico, Franczyk, Bogdan, Hofmann, Simon M.
Abstract
Emerging foundation models (FMs) in electroencephalography (EEG) promise a path to scale deep learning in diagnostics and brain-computer interfaces despite data scarcity, yet their opaque nature remains a barrier to wider adoption. We investigate attention-aware Layer-wise relevance propagation (LRP) as a post-hoc attribution method for EEG-FMs, extending LRP's use on convolutional neural network (CNN)-based EEG models to the Transformer architectures that current FMs are based on. We find that LRP can both verify EEG-FM decisions and surface novel, biologically plausible hypotheses from them. In motor imagery, it unmasks 'Clever Hans' behavior where models prioritize task correlated ocular signals over the intended motor correlates. In a naturalistic paradigm for affect prediction, it reveals a recurring reliance on a central electrode cluster, suggesting a candidate sensorimotor signature of arousal. Though heatmap interpretation remains ambiguous in this complex domain, the results position LRP as a tool for both verification and exploration of EEG-FMs, a role that will grow in both importance and discovery potential as the underlying models mature.
Chinese Translation
新兴的脑电图(EEG)基础模型(FMs)为在数据稀缺的情况下扩大深度学习在诊断和脑-机接口中的应用提供了可能性,但其不透明的特性仍然是更广泛采用的障碍。我们研究了关注注意力的层次相关传播(LRP)作为脑电图基础模型的事后归因方法,将LRP在基于卷积神经网络(CNN)的脑电图模型上的应用扩展到当前基础模型所基于的变换器架构。我们发现LRP不仅可以验证脑电图基础模型的决策,还能从中提出新颖且生物学上合理的假设。在运动想象中,它揭示了“聪明汉斯”行为,即模型优先考虑与任务相关的眼动信号,而非预期的运动相关信号。在一个自然主义的情感预测范式中,它揭示了对一个中心电极簇的反复依赖,暗示了一种唤醒的候选感觉运动特征。尽管在这一复杂领域中热图解释仍然模糊,但结果将LRP定位为验证和探索脑电图基础模型的工具,随着基础模型的成熟,这一角色的重要性和发现潜力将不断增长。
cs.AI / 67 / 2605.11893

Toward Modeling Player-Specific Chess Behaviors

朝着建模特定棋手行为的方向
Sogliuzzo, Loris, Rautureau, Aloïs, Piette, Eric
Abstract
While artificial intelligence has achieved superhuman performance in chess, developing models that accurately emulate the individualized decision-making styles of human players remains a significant challenge. Existing human-like chess models capture general population behaviors based on skill levels but fail to reproduce the behavioral characteristics of specific historical champions. Furthermore, the standard evaluation metric, move accuracy, inherently penalizes natural human variance and ignores long-term behavioral consistency, leading to an incomplete assessment of stylistic fidelity. To address these limitations, an architecture is proposed that adapts the unified Maia-2 model to champion-specific embeddings, further enhanced by the integration of a limited Monte Carlo Tree Search (MCTS) process to enrich tactical exploration during move selection. To robustly evaluate this approach, a novel behavioral metric based on the Jensen-Shannon divergence is introduced. By compressing high-dimensional board representations into a latent space using an AutoEncoder and Uniform Manifold Approximation and Projection (UMAP), move distributions are discretized on a common grid to compare behavioral similarities. Results across 16 historical world champions indicate that while integrating MCTS decreases standard move accuracy, it improves stylistic alignment according to the proposed metric, substantially reducing the average Jensen-Shannon divergence. Ultimately, the proposed metric successfully discriminates between individual players and provides promising evidence toward more comprehensive evaluations of behavioral alignment between players and AI models.
Chinese Translation
尽管人工智能在国际象棋领域已达到超人类的表现,但开发能够准确模拟人类棋手个体决策风格的模型仍然是一项重大挑战。现有的人类棋类模型基于技能水平捕捉一般人群的行为,但未能再现特定历史冠军的行为特征。此外,标准评估指标——走棋准确性,固有地惩罚自然的人类变异,并忽视长期行为一致性,从而导致对风格忠实度的不完整评估。为了解决这些局限性,提出了一种架构,将统一的 Maia-2 模型适应于冠军特定的嵌入,并通过整合有限的蒙特卡洛树搜索(MCTS)过程来增强走棋选择过程中的战术探索。为了稳健地评估这种方法,提出了一种基于詹森-香农散度的新型行为指标。通过使用自编码器(AutoEncoder)和均匀流形近似与投影(UMAP)将高维棋盘表示压缩到潜在空间,走棋分布在一个共同的网格上离散化,以比较行为相似性。对16位历史世界冠军的结果表明,尽管整合MCTS降低了标准走棋准确性,但根据所提出的指标,它改善了风格一致性,显著减少了平均詹森-香农散度。最终,所提出的指标成功区分了个体棋手,并为棋手与人工智能模型之间行为一致性的更全面评估提供了有希望的证据。
cs.AI / 68 / 2605.11905

Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving

重新思考监督粒度:基于段落的学习用于大型语言模型的定理证明
Xu, Shuo, Zhang, Jiakun, Lai, Junyu, Cao, Chun, Xu, Jingwei
Abstract
Automated theorem proving with large language models in Lean 4 is commonly approached through either step-level tactic prediction with tree search or whole-proof generation. These two paradigms represent opposite granularities for constructing supervised training data: the former provides dense local signals but may fragment coherent proof processes, while the latter preserves global structure but requires complex end-to-end generation. In this paper, we revisit supervision granularity as a training set construction problem over proof trajectories and propose segment-level supervision, a training data construction strategy that extracts locally coherent proof segments for training policy models. We further reuse the same strategy at inference time to trigger short rollouts for existing step-level models. When trained with segment-level supervision on STP, LeanWorkbook, and NuminaMath-LEAN, the resulting policy models achieve proof success rates of 64.84%, 60.90%, and 66.31% on miniF2F, respectively, consistently outperforming both step-level and whole-proof baselines. Goal-aware rollout further improves existing step-level provers while reducing inference costs. It increases the proof success rate of BFS-Prover-V2-7B from 68.77% to 70.74% and that of InternLM2.5-StepProver from 59.59% to 60.33%, showing that appropriate supervision granularity better aligns model learning with proof structure and search. Code and models are available at https://github.com/NJUDeepEngine/SEG-ATP.
Chinese Translation
在 Lean 4 中,使用大型语言模型进行自动定理证明通常通过步骤级策略预测与树搜索或整体证明生成两种方式进行。这两种范式代表了构建监督训练数据的相反粒度:前者提供了密集的局部信号,但可能会破坏连贯的证明过程,而后者则保留了全局结构,但需要复杂的端到端生成。在本文中,我们将监督粒度重新审视为一个基于证明轨迹的训练集构建问题,并提出了段落级监督,这是一种提取局部连贯证明段落以训练策略模型的训练数据构建策略。我们进一步在推理时重用相同的策略,以触发现有步骤级模型的短期展开。在 STP、LeanWorkbook 和 NuminaMath-LEAN 上使用段落级监督进行训练后,得到的策略模型在 miniF2F 上的证明成功率分别为 64.84%、60.90% 和 66.31%,始终优于步骤级和整体证明基线。目标感知的展开进一步改善了现有的步骤级证明器,同时降低了推理成本。它将 BFS-Prover-V2-7B 的证明成功率从 68.77% 提高到 70.74%,将 InternLM2.5-StepProver 的证明成功率从 59.59% 提高到 60.33%,显示出适当的监督粒度更好地将模型学习与证明结构和搜索对齐。代码和模型可在 https://github.com/NJUDeepEngine/SEG-ATP 获取。
cs.AI / 69 / 2605.11910

Rethinking Positional Encoding for Neural Vehicle Routing

重新思考神经车辆调度中的位置编码
Hua, Chuanbo, Berto, Federico, Hottung, Andre, Zepeda, Nayeli Gast, Ma, Yining, Ma, Zihan, Wong-Chung, Paula, Kwon, Changhyun, Wu, Cathy, Tierney, Kevin, Park, Jinkyoo
Abstract
Transformer-based models have become the dominant paradigm for neural combinatorial optimization (NCO) of vehicle routing problems (VRPs), yet the role of positional encoding (PE) in these architectures remains largely unexplored. Unlike natural language, where tokens are uniformly spaced on a line, routing solutions exhibit several properties that render standard NLP positional encodings inadequate. In this work, we formalize three such structural properties that a routing-aware PE should respect, namely anisometric node distances, cyclic and direction-aware topology, and hierarchical depot-anchored global multi-route structure, combining them with a unifying design principle of geometric grounding. Guided by these criteria, we analyze and compare PE methods spanning NLP, graph-transformer, and routing-specific families, and propose a hierarchical anisometric PE that combines a distance-indexed, circularly consistent in-route encoding with a depot-anchored angular cross-route encoding. Extensive experiments across diverse VRP variants demonstrate that geometry-grounded PE consistently outperforms index-based alternatives, with gains that transfer across problem variants, model architectures, and distribution shifts.
Chinese Translation
基于Transformer的模型已成为车辆调度问题(VRP)神经组合优化(NCO)的主流范式,但位置编码(PE)在这些架构中的作用仍然 largely 未被探索。与自然语言不同,在自然语言中,标记均匀分布在一行上,而调度解决方案则表现出多种特性,使得标准的自然语言处理位置编码显得不足。在本研究中,我们形式化了三种路由感知PE应遵循的结构特性,即非等距节点距离、循环和方向感知拓扑,以及以仓库为锚的全球多路径层次结构,并将其与几何基础的统一设计原则相结合。在这些标准的指导下,我们分析并比较了涵盖自然语言处理、图形Transformer和特定于路由的PE方法,并提出了一种层次非等距PE,该PE结合了距离索引的、循环一致的路径内编码与以仓库为锚的角度交叉路径编码。通过对多种VRP变体进行广泛实验,结果表明,基于几何的PE在各类问题变体、模型架构和分布变化中始终优于基于索引的替代方案。
cs.AI / 70 / 2605.11920

Domain Restriction via Multi SAE Layer Transitions

通过多层稀疏自编码器过渡进行领域限制
Shaheen, Elias, Mendelson, Avi
Abstract
The general-purpose nature of Large Language Models (LLMs) presents a significant challenge for domain-specific applications, often leading to out-of-domain (OOD) interactions that undermine the provider's intent. Existing methods for detecting such scenarios treat the LLM as an uninterpretable black box and overlook the internal processing of inputs. In this work we show that layer transitions provide a promising avenue for extracting domain-specific signature. Specifically, we present several lightweight ways of learning on internal dynamics encoded using a sparse autoencoder (SAE) that exhibit great capability in distinguishing OOD texts. Building on top of SAEs representation transitions enables us to better interpret the LLM internal evolution of input processing and shed light on its decisions. We provide a comprehensive analysis of the method and benchmark it with the gemma-2 2B and 9B models. Our results emphasize the efficacy of the internal process in capturing fine-grained input-related details.
Chinese Translation
大型语言模型(LLMs)的通用性在领域特定应用中带来了显著挑战,常常导致超出领域(OOD)的交互,从而削弱提供者的意图。现有的检测此类场景的方法将LLM视为不可解释的黑箱,忽视了输入的内部处理。在本研究中,我们展示了层过渡为提取领域特定特征提供了一条有前景的途径。具体而言,我们提出了几种轻量级的方法,通过使用稀疏自编码器(SAE)编码的内部动态进行学习,这些方法在区分OOD文本方面表现出很强的能力。基于SAE的表示过渡使我们能够更好地解释LLM在输入处理中的内部演变,并揭示其决策过程。我们对该方法进行了全面分析,并与gemma-2 2B和9B模型进行了基准测试。我们的结果强调了内部过程在捕捉细粒度输入相关细节方面的有效性。
cs.AI / 71 / 2605.11928

When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

当模拟失真:工具使用代理的模拟到现实基准与领域随机化强化学习方案
Zhou, Xiaolin, Yuan, Aojie, Luo, Zheng, Ling, Zipeng, Pan, Xixiao, Gao, Yicheng, Zhang, Haiyue, Li, Jiate, Jiang, Shuli, Wang, Prince Zizhuang, Zhu, Zixuan, Liu, Jinbo, Rossi, Ryan A., Wei, Hua, Hu, Xiyang
Abstract
Tool-use language agents are evaluated on benchmarks that assume clean inputs, unambiguous tool registries, and reliable APIs. Real deployments violate all these assumptions: user typos propagate into hallucinated tool names, a misconfigured request timeout can stall an agent indefinitely, and duplicate tool names across servers can freeze an SDK. We study these failures as a sim-to-real gap in the tool-use partially observable Markov decision process (POMDP), where deployment noise enters through the observation, action space, reward-relevant metadata, or transition dynamics. We introduce RobustBench-TC, a benchmark with 22 perturbation types organized by these four POMDP components, each grounded in a verified GitHub issue or documented tool-calling failure. Across 21 models from 1.5B to 32B parameters (including the closed-source o4-mini), the robustness profile is sharply uneven: observation perturbations reduce accuracy by less than 5%, while reward-relevant and transition perturbations reduce accuracy by roughly 40% and 30%, respectively; scale alone does not close these gaps. We then propose ToolRL-DR, a domain-randomization reinforcement learning (RL) recipe that trains a tool-use agent on perturbation-augmented trajectories spanning the three statically encodable POMDP components. On a 3B backbone, ToolRL-DR-Full retains roughly three-quarters of clean accuracy and reaches an aggregate perturbed accuracy comparable to open-source 14B function-calling baselines while substantially narrowing the gap to o4-mini. It closes approximately 27% of the Transition gap despite never seeing transition perturbations in training, suggesting that RL on adversarial static tool-use inputs induces a more persistent retry policy that transfers to unseen runtime failures. The dataset, code and benchmark leaderboard are publicly available.
Chinese Translation
工具使用语言代理在假设输入干净、工具注册表明确且 API 可靠的基准上进行评估。实际部署违反了所有这些假设:用户输入错误会传播到虚构的工具名称,配置错误的请求超时可能会使代理无限期停滞,而跨服务器的重复工具名称可能会冻结 SDK。我们将这些失败视为工具使用部分可观察马尔可夫决策过程(POMDP)中的模拟到现实差距,其中部署噪声通过观察、动作空间、与奖励相关的元数据或转移动态进入。我们引入了 RobustBench-TC,这是一个包含 22 种扰动类型的基准,按这四个 POMDP 组件组织,每种扰动都基于经过验证的 GitHub 问题或文档化的工具调用失败。在 21 个模型中,参数从 15 亿到 320 亿(包括闭源的 o4-mini),其鲁棒性特征明显不均匀:观察扰动使准确率降低不到 5%,而与奖励相关的扰动和转移扰动分别使准确率降低约 40% 和 30%;仅仅依靠规模并不能弥补这些差距。随后,我们提出了 ToolRL-DR,一种领域随机化强化学习(RL)方案,该方案在跨越三个静态可编码 POMDP 组件的扰动增强轨迹上训练工具使用代理。在一个 30 亿参数的基础模型上,ToolRL-DR-Full 保留了大约四分之三的干净准确率,并达到了与开源 140 亿函数调用基线相当的整体扰动准确率,同时显著缩小了与 o4-mini 的差距。尽管在训练中从未见过转移扰动,但它仍然缩小了约 27% 的转移差距,这表明在对抗性静态工具使用输入上进行强化学习会诱导出更持久的重试策略,从而转移到未见的运行时失败。数据集、代码和基准排行榜均已公开。
cs.AI / 72 / 2605.11936

From Noise to Diversity: Random Embedding Injection in LLM Reasoning

从噪声到多样性:随机嵌入注入在大型语言模型推理中的应用
Kim, Heejun, Lee, Seungpil, Yeom, Jewon, Sok, Jaewon, Park, Seonghyeon, Park, Jeongjae, Kim, Taesup, Kim, Sundong
Abstract
Recent soft prompt research has tried to improve reasoning by inserting trained vectors into LLM inputs, yet whether the gain comes from the learned content or from the act of injection itself has not been carefully separated. We study Random Soft Prompts (RSPs), which drop the training step entirely and append a freshly drawn sequence of random embedding vectors to the input. Each RSP vector is sampled from an isotropic Gaussian fitted to the entrywise mean and variance of the pretrained embedding table; the sequence carries no learned content, and yet reaches accuracy comparable to optimized soft prompts on math reasoning benchmarks in several settings. The mechanism unfolds in two stages: because attention has to absorb a never-seen-before random position, the distribution over the first few generated tokens flattens and reasoning trajectories branch, and as generation continues this influence dilutes naturally so the response commits to a single completion. We show that during inference RSPs lift early-stage token diversity and, combined with temperature sampling, widen Pass@N, the probability that at least one out of N attempts is correct. Beyond inference, we carry the same effect into DAPO training and demonstrate practical gains. Our contributions are: (i) RSP isolates the simplest form of soft prompt -- training-free, freshly resampled -- providing a unified lens for the structural effect of injection that variants otherwise differing in training and form all share; (ii) a theoretical and empirical validation of the underlying mechanism; and (iii) an extension from inference to training.
Chinese Translation
近期的软提示研究试图通过将训练向量插入大型语言模型(LLM)输入来改善推理,然而增益是否来自于学习内容还是注入行为本身尚未被仔细区分。我们研究了随机软提示(Random Soft Prompts, RSPs),该方法完全省略了训练步骤,并将新生成的随机嵌入向量序列附加到输入中。每个RSP向量是从一个适配于预训练嵌入表的逐元素均值和方差的各向同性高斯分布中采样的;该序列不包含任何学习内容,但在多个数学推理基准测试中达到了与优化软提示相当的准确性。该机制分为两个阶段展开:由于注意力机制必须吸收一个前所未见的随机位置,前几个生成的标记的分布变得平坦,推理轨迹分叉,随着生成的继续,这种影响自然稀释,因此响应最终承诺于单一的完成。我们展示了在推理过程中,RSP提升了早期阶段的标记多样性,并结合温度采样,扩大了Pass@N,即至少有一个正确答案的N次尝试的概率。除了推理,我们还将相同的效果带入DAPO训练中,并展示了实际收益。我们的贡献包括:(i)RSP孤立出最简单形式的软提示——无训练、重新采样——为注入的结构效应提供了统一的视角,而其他在训练和形式上有所不同的变体都共享这一点;(ii)对基础机制的理论和实证验证;(iii)从推理扩展到训练。
cs.AI / 73 / 2605.11946

Counterfactual Trace Auditing of LLM Agent Skills

大语言模型代理技能的反事实追踪审计
Zhou, Xiaolin, Liu, Jinbo, Li, Li, Rossi, Ryan A., Hu, Xiyang
Abstract
Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deployed benchmarks report only pass rate before and after a skill is attached, treating the skill as a black box change to agent behavior. We introduce Counterfactual Trace Auditing (CTA), a framework for measuring how a skill changes agent behavior. CTA pairs each with skill agent trace with a without skill counterpart on the same task, segments both traces into goal directed phases, aligns the phases, and emits structured Skill Influence Pattern (SIP) annotations. These annotations describe the behavioral effect of a skill rather than only its task outcome. We instantiate CTA on SWE-Skills-Bench with Claude across 49 software engineering tasks. The resulting audit reveals a clear evaluation gap. Pass rate changes by only +0.3 percentage points on average, suggesting little aggregate effect. Yet CTA identifies 522 SIP instances across the same paired traces, showing that the skills substantially reshape agent behavior even when pass rate is nearly unchanged. The audit also separates several recurring effects that pass rate cannot detect, including literal template copying, off task artifact creation, excess planning, and task recovery. Three findings emerge. First, high baseline tasks contain most of the observed skill effects, although their pass rate is already saturated and therefore cannot reflect those effects. Second, tasks with moderate baseline performance show the most recoverable gain, but often at substantially higher token cost. Third, the dominant SIP type can be identified by baseline bucket: surface anchoring is most common on ceiling tasks and edge-case prompting is most common on mid-range and floor tasks. These regularities turn informal failure mode observations into reproducible behavioral measurements.
Chinese Translation
大语言模型代理越来越多地增强了代理技能。目前对技能的评估方法仍然有限。大多数部署的基准仅报告在附加技能前后的通过率,将技能视为对代理行为的黑箱变化。我们引入了反事实追踪审计(Counterfactual Trace Auditing, CTA),这是一个用于测量技能如何改变代理行为的框架。CTA将每个带技能的代理追踪与同一任务上不带技能的对应追踪配对,将两个追踪分段为以目标为导向的阶段,进行阶段对齐,并生成结构化的技能影响模式(Skill Influence Pattern, SIP)注释。这些注释描述了技能的行为效应,而不仅仅是其任务结果。我们在SWE-Skills-Bench上以Claude为基础,针对49个软件工程任务实例化CTA。结果审计揭示了明显的评估差距。平均通过率仅变化了+0.3个百分点,表明整体效应很小。然而,CTA在相同的配对追踪中识别出522个SIP实例,显示即使通过率几乎不变,技能仍然显著重塑代理行为。审计还分离出几个通过率无法检测的重复效应,包括字面模板复制、任务外工件创建、过度规划和任务恢复。得出了三个发现。首先,高基线任务包含大多数观察到的技能效应,尽管它们的通过率已经饱和,因此无法反映这些效应。其次,基线表现中等的任务显示出最可恢复的增益,但通常伴随显著更高的标记成本。第三,主导的SIP类型可以通过基线桶进行识别:表面锚定在顶级任务中最为常见,而边缘案例提示在中间和底层任务中最为常见。这些规律将非正式的失败模式观察转化为可重复的行为测量。
cs.AI / 74 / 2605.11954

Assessing and Mitigating Miscalibration in LLM-Based Social Science Measurement

评估和缓解基于大型语言模型的社会科学测量中的误校准
Wang, Jinyuan, Deng, Ningyuan, Yang, Yi
Abstract
Large language models (LLMs) are increasingly used in social science as scalable measurement tools for converting unstructured text into variables that can enter standard empirical designs. Measurement validity demands more than high average accuracy, which requires well calibrated confidence that faithfully reflects the empirical probability of each measurement being correct. This paper studies the model miscalibration in LLM-based social science measurement. We begin with a case study on FOMC and show that confidence based filtering can change downstream regression estimates when LLM confidence is miscalibrated. We then audit calibration across 14 social science constructs covering both proprietary models, including GPT-5-mini, DeepSeek-V3.2, and open source models. Across tasks and model families, reported confidence is poorly aligned with tolerance-based correctness. As a simple mitigation, we propose a soft label distillation pipeline for calibrating Bert with LLM. The method converts an LLM score and its verbalized confidence into a soft target distribution, then trains a smaller discriminative classifier on encoder models for these targets. Averaged across datasets, this approach reduces ECE by 43.2\% and Brier by 34.0\%. These results suggest that LLM-based social science pipelines should treat calibration as part of measurement validity, rather than as an optional post-processing concern.
Chinese Translation
大型语言模型(LLMs)在社会科学中越来越多地被用作可扩展的测量工具,将非结构化文本转换为可以进入标准实证设计的变量。测量有效性不仅仅要求高平均准确率,还需要良好校准的置信度,能够真实反映每个测量正确的经验概率。本文研究了基于LLM的社会科学测量中的模型误校准。我们首先通过对联邦公开市场委员会(FOMC)的案例研究,展示了当LLM置信度误校准时,基于置信度的过滤如何改变下游回归估计。接着,我们对包括GPT-5-mini、DeepSeek-V3.2等专有模型在内的14个社会科学构念进行了校准审计。跨任务和模型家族,报告的置信度与基于容忍度的正确性之间的对齐度较差。作为一种简单的缓解措施,我们提出了一种用于校准Bert与LLM的软标签蒸馏管道。该方法将LLM得分及其口头化的置信度转换为软目标分布,然后在这些目标上训练一个较小的判别分类器。根据数据集的平均结果,该方法将ECE降低了43.2%,Brier降低了34.0%。这些结果表明,基于LLM的社会科学管道应将校准视为测量有效性的一部分,而不是可选的后处理问题。
cs.AI / 75 / 2605.11986

On the Limitations of Large Language Models for Conceptual Database Modeling

大型语言模型在概念数据库建模中的局限性
Siqueira, Arthur F., Nogueira, Carlos D. S., Farias, Eduarda, Campelo, Claudio E. C., Menezes, Júlia
Abstract
This article analyzes the use of Large Language Models (LLMs) as support for the conceptual modeling of relational databases through the automatic generation of Entity-Relationship (ER) diagrams from natural language requirements. The approach combines different language models with prompt engineering techniques to evaluate their ability to identify entities, relationships, and attributes in a conceptually consistent manner. The experimental evaluation involved three LLMs, each subjected to three prompting techniques (Zero-Shot, Chain of Thought, and Chain of Thought + Verifier), applied to the same requirements scenario with progressively increasing complexity. The generated diagrams were qualitatively analyzed through direct comparison with the textual requirements, considering the structural and semantic adherence of the modeled elements. The results indicate that, although LLMs show reasonable performance in less complex scenarios, their reliability decreases as the complexity of the requirements increases, with a rise in inconsistencies, ambiguities, and failures in representing constraints. These findings reinforce that, in their current state, LLMs are not sufficiently mature for reliable use in complex scenarios, and the cost of validation may offset the apparent productivity gains.
Chinese Translation
本文分析了大型语言模型(LLMs)在通过自然语言需求自动生成实体-关系(ER)图支持关系数据库概念建模中的应用。该方法结合了不同的语言模型和提示工程技术,以评估它们在概念一致的方式下识别实体、关系和属性的能力。实验评估涉及三种LLM,每种模型都采用三种提示技术(零样本、思维链和思维链 + 验证器),应用于同一需求场景,且复杂性逐渐增加。生成的图通过与文本需求的直接比较进行定性分析,考虑了建模元素的结构和语义一致性。结果表明,尽管LLM在较简单场景中表现出合理的性能,但随着需求复杂性的增加,其可靠性下降,出现不一致性、模糊性和约束表示失败的情况。这些发现强调了在当前状态下,LLM尚未足够成熟,无法在复杂场景中可靠使用,验证的成本可能抵消明显的生产力提升。
cs.AI / 76 / 2605.11987

Random-Set Graph Neural Networks

随机集图神经网络
Woodley, Tommy, Manchingal, Shireen Kudukkil, Tolloso, Matteo, Bacciu, Davide, Cuzzolin, Fabio
Abstract
Uncertainty quantification has become an important factor in understanding the data representations produced by Graph Neural Networks (GNNs). Despite their predictive capabilities being ever useful across industrial workspaces, the inherent uncertainty induced by the nature of the data is a huge mitigating factor to GNN performance. While aleatoric uncertainty is the result of noisy and incomplete stochastic data such as missing edges or over-smoothing, epistemic uncertainty arises from lack of knowledge about a system or model (e.g., a graph's topology or node feature representation), which can be reduced by gathering more data and information. In this paper, we propose an original new framework in which node-level epistemic uncertainty is modelled in a belief function (finite random set) formalism. The resulting Random-Set Graph Neural Networks have a belief-function head predicting a random set over the list of classes, from which both a precise probability prediction and a measure of epistemic uncertainty can be obtained. Extensive experiments on 9 different graph learning datasets, including real-world autonomous driving benchmarks as such Nuscene and ROAD, demonstrate RS-GNN's superior uncertainty quantification capabilities
Chinese Translation
不确定性量化已成为理解图神经网络(GNNs)所产生的数据表示的重要因素。尽管其预测能力在工业工作环境中极为有用,但数据本质所引发的固有不确定性对GNN的性能构成了巨大的制约因素。随机不确定性是由噪声和不完整的随机数据(例如缺失的边或过度平滑)导致的,而认知不确定性则源于对系统或模型(例如图的拓扑或节点特征表示)缺乏知识,这可以通过收集更多的数据和信息来减少。在本文中,我们提出了一个原创的新框架,其中节点级的认知不确定性在信念函数(有限随机集)形式中被建模。所得到的随机集图神经网络(Random-Set Graph Neural Networks, RS-GNN)具有一个信念函数头,能够预测一个类列表上的随机集,从中可以获得精确的概率预测和认知不确定性的度量。在9个不同的图学习数据集上进行的广泛实验,包括现实世界的自动驾驶基准(如Nuscene和ROAD),展示了RS-GNN在不确定性量化能力方面的优越性。
cs.AI / 77 / 2605.11996

BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts

BadSKP:针对知识图谱增强的大型语言模型的后门攻击与软提示
Lyu, Xiaoting, Han, Yufei, Qian, Hangwei, Yu, Haoyuan, Ao, Xiang, Wang, Bin, Wang, Chenxu, Ma, Xiaobo, Wang, Wei
Abstract
Recent knowledge graph (KG)-enhanced large language models (LLMs) move beyond purely textual knowledge augmentation by encoding retrieved subgraphs into continuous soft prompts via graph neural networks, introducing a graph-conditioned channel that operates alongside the standard text interface. However, existing backdoor attacks are largely designed for the textual channel, and their effectiveness against this dual-channel architecture remains unclear. We show that this architecture creates a robustness gap: text-channel backdoor attacks that readily compromise textual KG prompting systems become largely ineffective against soft-prompt-based counterparts. We interpret this gap through semantic anchoring, whereby graph-derived soft prompts bias the generation-driving hidden state toward query-consistent semantics and suppress surface-level malicious instructions. Because this anchoring effect is itself induced by the graph channel, an attacker who manipulates graph-level representations can in turn redirect it toward adversarial semantics. To demonstrate this risk, we propose BadSKP, a backdoor attack that targets the graph-to-prompt interface through a multi-stage optimization strategy: it constructs adversarial target embeddings, optimizes poisoned node embeddings to steer the induced soft prompt, and approximates the optimized representations with fluent adversarial node attributes. Experiments on two soft-prompt KG-enhanced LLMs across four datasets show that BadSKP achieves high attack success under both frozen and trojaned settings, while text-only attacks remain unreliable even under perplexity-based defenses.
Chinese Translation
近期,知识图谱(KG)增强的大型语言模型(LLMs)通过图神经网络将检索到的子图编码为连续的软提示,超越了纯文本知识增强,引入了一个与标准文本接口并行的图条件通道。然而,现有的后门攻击主要针对文本通道设计,其在这种双通道架构下的有效性仍不明确。我们展示了该架构造成的鲁棒性差距:文本通道的后门攻击能够轻易破坏文本KG提示系统,但在基于软提示的对应系统中却变得大大无效。我们通过语义锚定来解释这一差距,即图派生的软提示使生成驱动的隐状态偏向于与查询一致的语义,并抑制表层的恶意指令。由于这一锚定效应本身是由图通道诱导的,攻击者通过操控图级表示可以将其重新引导至对抗性语义。为了展示这一风险,我们提出了BadSKP,一种通过多阶段优化策略针对图到提示接口的后门攻击:它构建对抗性目标嵌入,优化被污染的节点嵌入以引导诱导的软提示,并用流畅的对抗性节点属性近似优化后的表示。在四个数据集上对两个软提示KG增强LLMs的实验表明,BadSKP在冻结和被植入设置下均实现了高攻击成功率,而仅基于文本的攻击即使在基于困惑度的防御下也仍然不可靠。
cs.AI / 78 / 2605.12012

LegalCheck: Retrieval- and Context-Augmented Generation for Drafting Municipal Legal Advice Letters

LegalCheck:用于起草市政法律咨询信的检索增强和上下文增强生成
van der Meer, Virgill, Rossi, Julien
Abstract
Public-sector legal departments in the Netherlands face acute staff shortages, increased case volumes, and increased pressure to meet regulatory compliance. This paper presents LegalCheck, a novel system that addresses these challenges by automating the drafting of objection response letters through a combination of Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG). Using a large language model (LLM) alongside curated legal knowledge bases, LegalCheck performs retrieval of relevant laws and precedents, and uses controlled prompting to incorporate both external knowledge and case-specific details into a coherent draft. An expert-in-the-loop review ensures that each generated letter is legally sound and contextually appropriate. In a real-world deployment within the Municipality of Amsterdam, LegalCheck produced near-final advice letters in minutes rather than hours, while maintaining high legal consistency and factual accuracy. The output is based on actual regulations and prior cases, providing explainable outputs that captured the vast majority of required legal reasoning (often 80\% to 100\% of essential content). Legal professionals found that the system reduced their workload and ensured a consistent application of legal standards, without replacing human judgment. These results demonstrate substantial efficiency gains, improved legal consistency, and positive user acceptance. More broadly, this work illustrates how responsible AI can be deployed in the legal domain by augmenting LLMs with domain knowledge and governance mechanisms.
Chinese Translation
荷兰公共部门的法律部门面临严重的人手短缺、案件数量增加以及满足监管合规的压力加大。本文提出了LegalCheck,这是一种新颖的系统,通过结合检索增强生成(Retrieval-Augmented Generation, RAG)和上下文增强生成(Context-Augmented Generation, CAG)来应对这些挑战,自动化起草异议回复信。LegalCheck利用大型语言模型(Large Language Model, LLM)和精心策划的法律知识库,检索相关法律和判例,并使用受控提示将外部知识和案件特定细节融入到连贯的草稿中。专家审核环节确保每封生成的信件在法律上是合理的,并且在上下文上是适当的。在阿姆斯特丹市的实际部署中,LegalCheck能够在几分钟内生成接近最终的咨询信,而不是几个小时,同时保持高法律一致性和事实准确性。输出基于实际法规和先前案例,提供可解释的结果,涵盖了绝大多数所需的法律推理(通常为80%到100%的必要内容)。法律专业人士发现该系统减轻了他们的工作负担,并确保法律标准的一致应用,而不替代人类判断。这些结果表明了显著的效率提升、改善的法律一致性和积极的用户接受度。更广泛地说,这项工作展示了如何在法律领域中负责任地部署人工智能,通过将领域知识和治理机制增强大型语言模型。
cs.AI / 79 / 2605.12016

LLMs and the ZPD

大型语言模型与最近发展区(ZPD)
Wallis, Peter
Abstract
One hundred years ago Vygotsky and his circle were exploring the nature of consciousness and defining what would become psychology in the Soviet Union. They concluded that children develop "scientific thinking" through interacting with enculturated adults in Zones of Proximal Development or ZPDs. The proposal is that, contrary to the claims of some, the LLM mechanism is not doing thinking with "distributed representations," but rather the completion model is doing "primitive thinking" in terms of *practices*. Viewed from this perspective, it would seem our large language models don't hallucinate, but rather dream, and that what is needed is not "guard rails" but an investigation of the set of cognitive tools that enable us to do things that look like common-sense. The proposal here is that *interaction* is core to human communication rather than just an add-on to "real" understanding.
Chinese Translation
一百年前,维果茨基及其圈子正在探索意识的本质,并定义将成为苏联心理学的内容。他们得出结论,儿童通过与文化化成人在最近发展区(Zones of Proximal Development, ZPD)中的互动,发展出“科学思维”。与某些人的主张相反,本文提出大型语言模型(LLM)机制并不是通过“分布式表征”进行思考,而是完成模型在*实践*方面进行“原始思维”。从这个角度来看,我们的大型语言模型似乎并不是在幻觉,而是在做梦;我们所需要的不是“保护措施”,而是对使我们能够进行看似常识性活动的认知工具集合的研究。本文的提议是,*互动*是人类沟通的核心,而不仅仅是对“真实”理解的附加部分。
cs.AI / 80 / 2605.12056

OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

OmniRefine:面向对齐的协同压缩方法用于高效的全模态大型语言模型
Deng, Yuchen, Cai, Zidang, Zheng, Hai-Tao, Wang, Jie, Yang, Feidiao, Han, Yuxing
Abstract
Omnimodal large language models (Omni-LLMs) show strong capability in audio-video understanding, but their practical deployment remains limited by high inference cost of long video streams and dense audio sequences. Despite recent progress, existing compression methods for Omni-LLMs typically rely on fixed or native compression units, which can disrupt cross-modal correspondence and the complementary information required for audio-video reasoning, making it difficult to improve inference efficiency while stably preserving performance. To address this, we propose OmniRefine, a training-free two-stage framework for efficient audio-visual token compression in Omni-LLMs. First, Correspondence-Preserving Chunk Refinement refines native chunk boundaries into cross-modally aligned compression units through frame-audio similarity and dynamic programming. Second, Modality-Aware Cooperative Compression jointly compresses video and audio tokens within each refined unit to reduce redundancy while preserving critical evidence. Extensive experiments show that OmniRefine achieves a better efficiency-performance trade-off than strong baselines and maintains stable performance under lower compression ratios. On WorldSense, it still reaches 46.7% accuracy at a 44% token retention ratio, nearly matching the full-token baseline. The code and interface will be released to facilitate further research.
Chinese Translation
全模态大型语言模型(Omni-LLMs)在音频视频理解方面表现出强大的能力,但其实际部署仍受到长视频流和密集音频序列高推理成本的限制。尽管最近取得了一些进展,现有的Omni-LLMs压缩方法通常依赖于固定或原生压缩单元,这可能会破坏跨模态对应关系以及音频视频推理所需的互补信息,从而使得在稳定保持性能的同时提高推理效率变得困难。为了解决这个问题,我们提出了OmniRefine,一个无训练的两阶段框架,用于在Omni-LLMs中高效地压缩音视频标记。首先,保持对应关系的块细化(Correspondence-Preserving Chunk Refinement)通过帧音频相似性和动态规划将原生块边界细化为跨模态对齐的压缩单元。其次,模态感知协同压缩(Modality-Aware Cooperative Compression)在每个细化单元内共同压缩视频和音频标记,以减少冗余,同时保留关键证据。大量实验表明,OmniRefine在效率与性能的权衡上优于强基线,并在较低压缩比下保持稳定性能。在WorldSense数据集上,它在44%的标记保留比率下仍能达到46.7%的准确率,几乎与全标记基线相匹配。代码和接口将被发布以促进进一步研究。
cs.AI / 81 / 2605.12061

SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory

SAGE:一种自我演化的代理图记忆引擎,用于结构感知的关联记忆
Wang, Juntong, Zhao, Haoyue, Pan, guanghui, Wang, Xiyuan, Wang, Yanbo, Deng, Qiyan, Zhang, Muhan
Abstract
Long-term memory is becoming a central bottleneck for language agents. Exsting RAG and GraphRAG systems largely treat memory graphs as static retrieval middleware, which limits their ability to recover complete evidence chains from partial cues, exploit reusable graph-structrual roles, and improve the memory itself through downstream feedback. We introduce SAGE, a Self-evolving Agentic Graph-memory Engine that models graph memory as a dynamic long-term memory substrate. SAGE couples two roles: a memory writer that incrementally constucts structured graph memory from interaction histories, and a Graph Foundation Model-based memory reader to perform retrieval and provide feedback to the memory writer. We provide rigorooous theoretical annalyses supporting the framework. Across multi-hop QA, open-domain retireval, domain-specific review QA, and long-term agent-memory benchmarks, SAGE improves evidence recovery, answer grounding, and retrieval efficiency: after two self-evolution rounds, it achieves the best average rank on multi-hop QA; in zero-shot open-domain transfer, it reaches 82.5/91.6 Recall@2/5 on NQ. Further results on LongMemEval and HaluMem show that traning and reader-writer feedback improve multiple long-term memory and hallucination-diagnostic metrics, suggesting that self-evolving, structure-aware graph memory is a promising foundation for robust long-horizon language agents.
Chinese Translation
长期记忆正成为语言代理的一个核心瓶颈。现有的RAG和GraphRAG系统在很大程度上将记忆图视为静态检索中介,这限制了它们从部分线索中恢复完整证据链的能力,利用可重用的图结构角色,以及通过下游反馈改善记忆本身。我们提出了SAGE,一种自我演化的代理图记忆引擎,将图记忆建模为动态的长期记忆基底。SAGE结合了两个角色:一个记忆写入器,它从交互历史中逐步构建结构化的图记忆;一个基于图基础模型的记忆读取器,用于执行检索并向记忆写入器提供反馈。我们提供了严格的理论分析来支持该框架。在多跳问答、开放领域检索、特定领域复习问答和长期代理记忆基准测试中,SAGE改善了证据恢复、答案定位和检索效率:经过两轮自我演化,它在多跳问答中达到了最佳平均排名;在零-shot开放领域迁移中,它在NQ上达到了82.5/91.6的Recall@2/5。进一步的LongMemEval和HaluMem结果表明,训练和读写反馈改善了多个长期记忆和幻觉诊断指标,这表明自我演化的、结构感知的图记忆是构建稳健的长期语言代理的有希望的基础。
cs.AI / 82 / 2605.12087

Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems

中间产物作为第一类公民:一种用于代理系统中持久中间产物的数据模型
Rosen, Josh, Rosen, Seth
Abstract
Many AI systems are organized around loops in which models reason, call tools, observe results, and continue until a task is complete. These systems often produce final artifacts such as memos, plans, recommendations, and analyses, while the intermediate work that shaped those outputs remains ephemeral. For multi-step, revisable AI work, final artifacts are often lossy projections over upstream state. We argue that such systems should preserve durable, inspectable intermediate artifacts: typed, structured, addressable, versioned, dependency-aware, authoritative, and consumable by downstream computation. These artifacts are not the model's private chain-of-thought. They are maintained work products such as evidence maps, claim structures, criteria, assumptions, plans, transformation rules, synthesis procedures, unresolved tensions, and partial products that later humans and agents can inspect, revise, supersede, and improve. The contribution is a systems-level data model. We distinguish intermediate artifacts from chat transcripts, memory, hidden chain-of-thought, narration, thinking, and final answers; formalize additive and superseding update semantics with explicit current-state resolution; describe how artifact lineage supports durable intermediate state across revisions; and argue that evaluation must target maintained-state quality, not only final-output quality. The claim is not that artifacts make models smarter. It is that durable intermediate artifacts make AI-generated work more inspectable, revisable, and maintainable over time.
Chinese Translation
许多人工智能系统围绕循环组织,其中模型进行推理、调用工具、观察结果,并持续进行直到任务完成。这些系统通常会产生最终产物,如备忘录、计划、建议和分析,而塑造这些输出的中间工作则往往是短暂的。对于多步骤、可修订的人工智能工作,最终产物通常是上游状态的有损投影。我们认为,这类系统应当保留持久的、可检查的中间产物:类型化、结构化、可寻址、版本化、依赖感知、权威且可供下游计算使用。这些产物并不是模型的私有思维链,而是维护的工作产品,如证据图、主张结构、标准、假设、计划、转换规则、综合程序、未解决的紧张关系和部分产品,后续的人类和代理可以检查、修订、取代和改进。我们的贡献是一个系统级的数据模型。我们将中间产物与聊天记录、记忆、隐性思维链、叙述、思考和最终答案区分开来;形式化了具有明确当前状态解析的附加和取代更新语义;描述了产物谱系如何支持跨修订的持久中间状态;并主张评估必须针对维护状态质量,而不仅仅是最终输出质量。我们的主张并不是说产物使模型更聪明,而是持久的中间产物使得人工智能生成的工作在时间上更具可检查性、可修订性和可维护性。
cs.AI / 83 / 2605.12105

Autonomy and Agency in Agentic AI: Architectural Tactics for Regulated Contexts

代理人工智能中的自主性与能动性:受监管环境下的架构策略
Safin, Damir, Balta, Dian
Abstract
Deploying agentic AI in regulated contexts requires principled reasoning about two design dimensions: agency (what the system can do) and autonomy (how much it acts without human involvement). Though often treated independently, they are coupled: at higher autonomy, human error correction is less available, so reliable operation requires constraining agency accordingly; compliance requirements reinforce this by mandating human involvement as action consequences grow. Yet no established approach addresses them jointly, leaving practitioners without a principled basis for reasoning about oversight, action consequences, and error correction. This work introduces a two-dimensional design space in which both dimensions are organised into five operational levels, making the coupling explicit and navigable. Autonomy ranges from human-commanded operation (L1) to fully autonomous monitoring (L5); agency ranges from reasoning over supplied context (L1) to committed writes to authoritative records (L5). Building on this space, we propose six architectural tactics--checkpoints, escalation, multi-agent delegation, tool provisioning, tool fencing, and write staging--for adjusting a deployment's position within it. The tactics are grounded in two worked examples from public-sector contexts, illustrating how they apply under realistic compliance constraints. We further examine five deployment parameters--model capability, agent architecture, tool fidelity, workflow bottlenecks, and evaluation--that shape what is achievable at any configuration independently of agency and autonomy. Together, the design space, tactics, and deployment parameters provide a shared vocabulary for principled, compliance-aware agentic AI design in which responsibility, auditability, and reversibility are explicit design considerations rather than properties that must be retrofitted after deployment.
Chinese Translation
在受监管的环境中部署能动性人工智能需要对两个设计维度进行原则性推理:能动性(系统能够做什么)和自主性(系统在多大程度上无需人类参与而行动)。尽管这两个维度通常被独立对待,但它们是相互关联的:在更高的自主性下,人为错误纠正的可能性降低,因此可靠的操作需要相应地限制能动性;合规要求通过在行动后果增加时强制人类参与来强化这一点。然而,目前没有建立的方案能够共同处理这两个维度,导致从业者缺乏一个原则性基础来推理监督、行动后果和错误纠正。本文引入了一个二维设计空间,其中两个维度被组织为五个操作级别,使得它们的耦合关系变得明确且可导航。自主性从人类指挥操作(L1)到完全自主监控(L5);能动性从对提供的上下文进行推理(L1)到对权威记录进行承诺性写入(L5)。基于这一空间,我们提出了六种架构策略——检查点、升级、多代理委派、工具提供、工具围栏和写入分阶段——用于调整部署在该空间中的位置。这些策略基于来自公共部门环境的两个实例,展示了它们在现实合规约束下的应用。我们进一步考察了五个部署参数——模型能力、代理架构、工具保真度、工作流程瓶颈和评估——这些参数塑造了在任何配置下独立于能动性和自主性所能实现的目标。总体而言,设计空间、策略和部署参数为原则性、合规意识的能动性人工智能设计提供了一个共享的词汇,其中责任、可审计性和可逆性是明确的设计考虑,而不是部署后必须重新调整的属性。
cs.AI / 84 / 2605.12106

Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization

大型语言模型作为约束双目标凸优化的摊销帕累托前沿生成器
Xu, Peipei, Ma, SiYuan, Liu, Yaohua, Wu, Yu, Liu, Guanliang, Zhang, Yang, Liu, Yong
Abstract
Generating feasible Pareto fronts for constrained bi-objective continuous optimization is central to multi-criteria decision-making. Existing methods usually rely on iterative scalarization, evolutionary search, or problem-specific solvers, requiring repeated optimization for each instance. We introduce DIPS, an end-to-end framework that fine-tunes large language models as amortized Pareto-front generators for constrained bi-objective convex optimization. Given a textual problem description, DIPS directly outputs an ordered set of feasible continuous decision vectors approximating the Pareto front. To make continuous optimization compatible with autoregressive language modeling, DIPS combines a compact discretization scheme, Numerically Grounded Token Initialization for new numerical tokens, and Three-Phase Curriculum Optimization, which progressively aligns structural validity, feasibility, and Pareto-front quality. Across five families of constrained bi-objective convex problems, a fine-tuned 7B-parameter model achieves normalized hypervolume ratios of 95.29% to 98.18% relative to reference fronts. With vLLM-accelerated inference, DIPS solves one instance in as little as 0.16 seconds and outperforms general-purpose and reasoning LLM baselines under the evaluated setting. These results suggest that LLMs can serve as effective amortized generators for continuous Pareto-front approximation.
Chinese Translation
为约束双目标连续优化生成可行的帕累托前沿是多标准决策的核心。现有方法通常依赖于迭代标量化、进化搜索或特定问题的求解器,需对每个实例进行重复优化。我们提出了DIPS,一个端到端框架,旨在微调大型语言模型,使其作为约束双目标凸优化的摊销帕累托前沿生成器。给定文本问题描述,DIPS直接输出一组有序的可行连续决策向量,近似帕累托前沿。为了使连续优化与自回归语言建模兼容,DIPS结合了一种紧凑的离散化方案、针对新数值标记的数值基础标记初始化(Numerically Grounded Token Initialization),以及三阶段课程优化(Three-Phase Curriculum Optimization),该优化逐步对齐结构有效性、可行性和帕累托前沿质量。在五类约束双目标凸问题中,经过微调的7B参数模型相对于参考前沿实现了95.29%到98.18%的标准化超体积比。通过vLLM加速推理,DIPS在短至0.16秒内解决一个实例,并在评估设置下优于通用和推理LLM基线。这些结果表明,LLM可以作为连续帕累托前沿近似的有效摊销生成器。
cs.AI / 85 / 2605.12111

Adaptive Multi-Round Allocation with Stochastic Arrivals

具有随机到达的自适应多轮资源分配
Pan, Yuqi, Choo, Davin, Wang, Haichuan, Tambe, Milind, van Heerden, Alastair, Johnson, Cheryl
Abstract
We study a sequential resource allocation problem motivated by adaptive network recruitment, in which a limited budget of identical resources must be allocated over multiple rounds to individuals with stochastic referral capacity. Successful referrals endogenously generate future decision opportunities while allocating additional resources to an individual exhibits diminishing returns. We first show that the single-round allocation problem admits an exact greedy solution based on marginal survival probabilities. In the multi-round setting, the resulting Bellman recursion is intractable due to the stochastic, high-dimensional evolution of the frontier. To address this, we introduce a population-level surrogate value function that depends only on the remaining budget and frontier size. This surrogate enables an exact dynamic program via truncated probability generating functions, yielding a planning algorithm with polynomial complexity in the total budget. We further analyze robustness under model misspecification, proving a multi-round error bound that decomposes into a tight single-round frontier error and a population-level transition error. Finally, we evaluate our method on real-world inspired recruitment scenarios.
Chinese Translation
我们研究一个由自适应网络招聘所激励的序列资源分配问题,其中必须在多个轮次中将有限的相同资源分配给具有随机推荐能力的个体。成功的推荐会内生地产生未来的决策机会,而向个体分配额外资源则表现出递减收益。我们首先证明,单轮分配问题可以基于边际生存概率得到精确的贪心解。在多轮设置中,由于边界的随机高维演变,得到的贝尔曼递归是不可处理的。为了解决这个问题,我们引入一个仅依赖于剩余预算和边界大小的人口级替代价值函数。这个替代函数通过截断概率生成函数实现了精确的动态规划,从而产生了一个在总预算下具有多项式复杂度的规划算法。我们进一步分析了模型误设下的鲁棒性,证明了一个多轮误差界限,该界限分解为一个紧密的单轮边界误差和一个人口级转移误差。最后,我们在受现实启发的招聘场景中评估了我们的方法。
cs.AI / 86 / 2605.12120

To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands

语言模型对谁对齐?在高风险竞争需求下测量主要层级
Yu, Fangyi, Seedat, Nabeel, Schwarz, Jonathan Richard, Bean, Andrew M.
Abstract
Language models deployed in high-stakes professional settings face conflicting demands from users, institutional authorities, and professional norms. How models act when these demands conflict reveals a principal hierarchy -- an implicit ordering over competing stakeholders that determines, for instance, whether a medical AI receiving a cost-reduction directive from a hospital administrator complies at the expense of evidence-based care, or refuses because professional standards require it. Across 7,136 scenarios in legal and medical domains, we test ten frontier models and find that models frequently fail to adhere to professional standards during task execution, such as drafting, when user instructions conflict with those standards -- despite adequately upholding them when users seek advisory guidance. We further find that the hierarchies between user, authority, and professional standards exhibited by these models are unstable across medical and legal contexts and inconsistent across model families. When failing to follow professional standards, the primary failure mechanism is knowledge omission: models that demonstrably possess relevant knowledge produce harmful outputs without surfacing conflicting knowledge. In a particularly troubling instance, we find that a reasoning model recognizes the relevant knowledge in its reasoning trace -- e.g., that a drug has been withdrawn -- yet suppresses this in the user-facing answer and proceeds to recommend the drug under authority pressure anyway. Inconsistent alignment across task framing, domain, and model families suggests that current alignment methods, including published alignment hierarchies, are unlikely to be robust when models are deployed in high-stakes professional settings.
Chinese Translation
在高风险的专业环境中部署的语言模型面临来自用户、机构权威和专业规范的相互冲突的需求。当这些需求发生冲突时,模型的表现揭示了一种主要层级——一种对竞争利益相关者的隐含排序,这决定了例如医疗人工智能在接收到医院管理员的成本降低指令时,是否会以牺牲循证护理为代价进行遵从,或因专业标准的要求而拒绝。在法律和医疗领域的7136个场景中,我们测试了十个前沿模型,发现模型在任务执行过程中(例如起草)经常未能遵循专业标准,尤其是在用户指令与这些标准发生冲突时——尽管在用户寻求咨询指导时能够充分遵守这些标准。我们进一步发现,这些模型所表现出的用户、权威和专业标准之间的层级在医疗和法律背景下是不稳定的,并且在不同模型家族之间也存在不一致。当未能遵循专业标准时,主要的失效机制是知识遗漏:那些明显拥有相关知识的模型在未能呈现冲突知识的情况下产生有害输出。在一个特别令人担忧的案例中,我们发现一个推理模型在其推理轨迹中识别出相关知识——例如,某药物已被撤回——但在面向用户的回答中抑制了这一信息,并在权威压力下仍然推荐该药物。在任务框架、领域和模型家族之间的不一致对齐表明,当前的对齐方法,包括已发布的对齐层级,在模型部署于高风险专业环境时不太可能是稳健的。
cs.AI / 87 / 2605.12131

Rollout Cards: A Reproducibility Standard for Agent Research

回放卡:代理研究的可重复性标准
Masters, Charlie, Liu, Ziyuan, Albrecht, Stefano V.
Abstract
Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task-success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. We validate rollout cards in two settings. First, four partial public releases in tool safety, multi-agent systems, theorem proving, and search let us compute analyses their original reports did not include. Second, re-grading preserved benchmark outputs across short-answer, code-generation, and tool-use tasks shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models. We release a reference implementation integrated into Ergon, an open-source reinforcement learning gym, and publicly publish Ergon-produced rollout-card exports for benchmarks spanning tool use, software engineering, web interaction, multi-agent coordination, safety, and search to support future research.
Chinese Translation
长期以来影响机器学习和强化学习的可重复性问题现在在代理研究中浮现:论文通过报告的分数比较系统,同时使得与这些分数相关的回放记录难以检查。对于代理任务而言,这一点尤为重要,因为相同的行为在评估选择回放的不同部分或应用不同的报告规则时可能会获得不同的报告分数。在对50个流行的训练和评估库进行结构化审计时,我们发现没有一个报告了多少次运行失败、出错或被跳过,以及头条分数的情况。我们还记录了37个案例,其中报告规则可以改变任务成功率、成本/代币核算或固定证据的时间测量,有时变化非常显著。我们将回放记录视为代理研究的可重复性单位,而不是报告的分数。我们引入了回放卡:一种出版捆绑,保存回放记录并声明报告分数背后的视图、报告规则和丢弃清单。我们在两个环境中验证了回放卡。首先,四个部分公开发布的工具安全、多代理系统、定理证明和搜索使我们能够计算其原始报告未包含的分析。其次,重新评分在短答案、代码生成和工具使用任务中保留基准输出,显示仅改变报告规则就可以使报告分数变化20.9个绝对百分点,并且在某些情况下,颠倒前沿模型的排名。我们发布了一个集成到Ergon中的参考实现,Ergon是一个开源强化学习平台,并公开发布了Ergon生成的回放卡导出,涵盖工具使用、软件工程、网络交互、多代理协调、安全和搜索等基准,以支持未来的研究。
cs.AI / 88 / 2605.12139

BoolXLLM: LLM-Assisted Explainability for Boolean Models

BoolXLLM:基于大语言模型的布尔模型可解释性辅助
Cheng, Du, Kadioglu, Serdar, Wang, Xin
Abstract
Interpretable machine learning aims to provide transparent models whose decision-making processes can be readily understood by humans. Recent advances in rule-based approaches, such as expressive Boolean formulas (BoolXAI), offer faithful and compact representations of model behavior. However, for non-technical stakeholders, main challenges remain in practice: (i) selecting semantically meaningful features and (ii) translating formal logical rules into accessible explanations. In this work, we propose BoolXLLM , as a hybrid framework that integrates Large Language Models (LLMs) into the end-to-end pipeline of Boolean rule learning. We augment BoolXAI , an expressive Boolean rule-based classifier, with LLMs at three critical stages: (1) feature selection, where LLMs guide the identification of domain-relevant variables; (2) threshold recommendation, where LLMs propose semantically meaningful discretization strategies for numerical features; and (3) rule compression and interpretation, where Boolean rules are translated into natural language explanations at both global and local levels. This integration bridges formal, faithful explanations with human-understandable narratives. This allows build an explainable AI system that is both theoretically grounded and accessible to non-experts. Early empirical results demonstrate that LLM-assisted pipelines improve interpretability while maintaining competitive predictive performance. Our work highlights the promise of combining symbolic reasoning with language-based models for human-centered explainability.
Chinese Translation
可解释的机器学习旨在提供透明的模型,使其决策过程能够被人类轻松理解。最近在基于规则的方法方面的进展,例如表达性的布尔公式(BoolXAI),提供了对模型行为的真实且紧凑的表示。然而,对于非技术利益相关者而言,实践中仍然面临主要挑战:(i)选择语义上有意义的特征和(ii)将形式逻辑规则转化为易于理解的解释。在本研究中,我们提出了BoolXLLM,作为一个混合框架,将大语言模型(LLMs)集成到布尔规则学习的端到端流程中。我们在三个关键阶段增强了表达性布尔规则分类器BoolXAI与LLMs的结合:(1)特征选择阶段,LLMs指导识别与领域相关的变量;(2)阈值推荐阶段,LLMs为数值特征提出语义上有意义的离散化策略;(3)规则压缩和解释阶段,将布尔规则翻译为自然语言解释,涵盖全局和局部层面。这种集成架起了形式、真实的解释与人类可理解叙述之间的桥梁。这使得构建一个理论基础扎实且对非专家友好的可解释人工智能系统成为可能。早期的实证结果表明,LLM辅助的流程在保持竞争性预测性能的同时提高了可解释性。我们的研究强调了将符号推理与基于语言的模型结合用于以人为本的可解释性的潜力。
cs.AI / 89 / 2605.12154

MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling

MM-OptBench:一个基于求解器的多模态优化建模基准
Li, Zhong, Huang, Qi, Zhu, Yuxuan, Amiri, Mohammad Mohammadi, van Stein, Niki, Bäck, Thomas, van Leeuwen, Matthijs, Wen, Zaiwen, Yang, Lincen
Abstract
Optimization modeling translates real decision-making problems into mathematical optimization models and solver-executable implementations. Although language models are increasingly used to generate optimization formulations and solver code, existing benchmarks are almost entirely text-only. This omits many optimization-modeling tasks that arise in operational practice, where requirements are described in text but instance information is conveyed through visual artifacts such as tables, graphs, maps, schedules, and dashboards. We introduce multimodal optimization modeling, a benchmark setting in which models must construct both a mathematical formulation and executable solver code from a text-and-visual problem specification. To evaluate this setting, we develop a solver-grounded framework that generates structured optimization instances, verifies each with an exact solver, and builds both the model-facing inputs and hidden reference files from the same verified source. We instantiate the framework as MM-OptBench, a benchmark of 780 solver-verified instances spanning 6 optimization families, 26 subcategories, and 3 structural difficulty levels. We evaluate 9 multimodal large language models (MLLMs), including 6 frontier general-purpose models and 3 math-specialized models, with aggregate, family-level, difficulty-level, and failure-mode analyses. The results show that the task remains far from solved: the best two models reach 52.1% and 51.3% pass@1, while on average across the six general-purpose MLLMs, pass@1 is 43.4% on easy instances and 15.9% on hard instances. All three math-specialized MLLMs solve 0/780 instances. Failure attribution shows that errors arise both when extracting instance data from text and visuals and when turning extracted data into solver-correct formulations and code. MM-OptBench provides a testbed for solver-grounded, decision-oriented multimodal intelligence.
Chinese Translation
优化建模将实际决策问题转化为数学优化模型和可由求解器执行的实现。尽管语言模型在生成优化公式和求解器代码方面的应用日益增多,但现有的基准几乎完全是文本形式的。这忽略了许多在实际操作中出现的优化建模任务,其中需求以文本描述,但实例信息通过表格、图形、地图、时间表和仪表板等视觉工件传达。我们引入了多模态优化建模,这是一种基准设置,其中模型必须从文本和视觉问题规范中构建数学公式和可执行的求解器代码。为了评估这一设置,我们开发了一个基于求解器的框架,该框架生成结构化的优化实例,使用精确求解器验证每个实例,并从相同的验证源构建模型输入和隐藏参考文件。我们将该框架实例化为MM-OptBench,这是一个包含780个经过求解器验证的实例的基准,涵盖6个优化家族、26个子类别和3个结构难度级别。我们评估了9个多模态大型语言模型(MLLMs),包括6个前沿通用模型和3个数学专用模型,进行了汇总、家族级、难度级和失败模式分析。结果表明,这一任务仍然远未解决:表现最好的两个模型的通过率分别为52.1%和51.3%,而在六个通用MLLMs中,简单实例的平均通过率为43.4%,困难实例的平均通过率为15.9%。所有三个数学专用MLLMs在780个实例中均未解决任何实例。失败归因显示,错误既出现在从文本和视觉中提取实例数据时,也出现在将提取的数据转化为求解器正确的公式和代码时。MM-OptBench为基于求解器的、面向决策的多模态智能提供了一个测试平台。
cs.AI / 90 / 2605.12159

ALGOGEN: Tool-Generated Verifiable Traces for Reliable Algorithm Visualization

ALGOGEN:工具生成的可验证轨迹以实现可靠的算法可视化
Liao, Kunpeng, Ma, Yuexiao, Lin, Yisheng, Zeng, Hualin, Zheng, Xiawu, Ji, Rongrong
Abstract
Algorithm Visualization (AV) helps students build mental models by animating algorithm execution states. Recent LLM-based systems such as CODE2VIDEO generate AV videos in an end-to-end manner. However, this paradigm requires the system to simultaneously simulate algorithm flow and satisfy video rendering constraints, such as element layout and color schemes. This complex task induces LLM hallucinations, resulting in reduced execution success rates, element overlap, and inter-frame inconsistencies. To address these challenges, we propose ALGOGEN, a novel paradigm that decouples algorithm execution from rendering. We first introduce Visualization Trace Algebra (VTA), a monoid over algorithm visual states and operations. The LLM then generates a Python tracker that simulates algorithm flow and outputs VTA-JSON traces, a JSON encoding of VTA. For rendering, we define a Rendering Style Language (RSL) to templatize algorithm layouts. A deterministic renderer then compiles algorithm traces with RSL into Manim, LaTeX/TikZ, or Three.js outputs. Evaluated on a LeetCode AV benchmark of 200 tasks, ALGOGEN achieves an average success rate improvement of 17.3% compared to end-to-end methods, with 99.8% versus 82.5%. These results demonstrate that our decoupling paradigm effectively mitigates LLM hallucinations in complex AV tasks, providing a more reliable solution for automated generation of high-quality algorithm visualizations. Demo videos and code are available in the project repository.
Chinese Translation
算法可视化(AV)通过动画展示算法执行状态,帮助学生建立心理模型。近期基于大型语言模型(LLM)的系统如CODE2VIDEO以端到端的方式生成AV视频。然而,这种范式要求系统同时模拟算法流程并满足视频渲染约束,例如元素布局和配色方案。这一复杂任务导致LLM产生幻觉,造成执行成功率降低、元素重叠和帧间不一致。为了解决这些挑战,我们提出了ALGOGEN,这是一种新颖的范式,将算法执行与渲染解耦。我们首先引入可视化轨迹代数(Visualization Trace Algebra, VTA),这是一个关于算法可视状态和操作的单元。然后,LLM生成一个Python跟踪器,模拟算法流程并输出VTA-JSON轨迹,VTA的JSON编码。为了渲染,我们定义了一种渲染样式语言(Rendering Style Language, RSL)来模板化算法布局。一个确定性渲染器随后将算法轨迹与RSL编译为Manim、LaTeX/TikZ或Three.js输出。在对200个任务的LeetCode AV基准进行评估时,ALGOGEN的平均成功率提高了17.3%,达到99.8%对比于82.5%的端到端方法。这些结果表明,我们的解耦范式有效减轻了复杂AV任务中的LLM幻觉,为高质量算法可视化的自动生成提供了更可靠的解决方案。演示视频和代码可在项目库中获取。
cs.AI / 91 / 2605.12178

Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics

企业系统是否需要学习的世界模型?推断动态时上下文的重要性
Nair, Jishnu Sethumadhavan, Bechard, Patrice, Maheshwary, Rishabh, Dasgupta, Surajit, Ramachandran, Sravan, Bhagat, Aakash, Radhakrishna, Shruthan, Pattnaik, Pulkit, Obando-Ceron, Johan, Malay, Shiva Krishna Reddy, Davasam, Sagar, Subramanian, Seganrasan, Mittal, Vipul, Nemala, Sridhar Krishna, Pal, Christopher, Sunkara, Srinivas, Rajeswar, Sai
Abstract
World models enable agents to anticipate the effects of their actions by internalizing environment dynamics. In enterprise systems, however, these dynamics are often defined by tenant-specific business logic that varies across deployments and evolves over time, making models trained on historical transitions brittle under deployment shift. We ask a question the world-models literature has not addressed: when the rules can be read at inference time, does an agent still need to learn them? We argue, and demonstrate empirically, that in settings where transition dynamics are configurable and readable, runtime discovery complements offline training by grounding predictions in the active system instance. We propose enterprise discovery agents, which recover relevant transition dynamics at runtime by reading the system's configuration rather than relying solely on internalized representations. We introduce CascadeBench, a reasoning-focused benchmark for enterprise cascade prediction that adopts the evaluation methodology of World of Workflows on diverse synthetic environments, and use it together with deployment-shift evaluation to show that offline-trained world models can perform well in-distribution but degrade as dynamics change, whereas discovery-based agents are more robust under shift by grounding their predictions in the current instance. Our findings suggest that, in configurable enterprise environments, agents should not rely solely on fixed internalized dynamics, but should incorporate mechanisms for discovering relevant transition logic at runtime.
Chinese Translation
世界模型使代理能够通过内化环境动态来预测其行为的效果。然而,在企业系统中,这些动态通常由特定于租户的业务逻辑定义,这些逻辑在不同的部署中有所不同并且随着时间的推移而演变,这使得基于历史过渡训练的模型在部署变更时变得脆弱。我们提出一个世界模型文献尚未解决的问题:当规则在推断时可以被读取时,代理是否仍然需要学习这些规则?我们论证并通过实证研究证明,在过渡动态可配置且可读取的环境中,运行时发现通过将预测与活动系统实例相结合,补充了离线训练。我们提出了企业发现代理,它通过读取系统的配置而不是仅依赖内化表示,在运行时恢复相关的过渡动态。我们引入了CascadeBench,这是一个专注于推理的企业级级联预测基准,采用了在多样化合成环境中评估工作流世界的方法,并结合部署变更评估,展示了离线训练的世界模型在分布内表现良好,但随着动态变化而退化,而基于发现的代理在变化下更具鲁棒性,因为它们将预测与当前实例相结合。我们的研究结果表明,在可配置的企业环境中,代理不应仅依赖固定的内化动态,而应在运行时纳入发现相关过渡逻辑的机制。
cs.AI / 92 / 2605.12181

MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification

MolDeTox:评估语言模型逐步片段编辑在分子解毒中的应用
Park, Jueon, Jang, Wonjune, Lee, Jiwoo, Park, Yein, Kang, Jaewoo
Abstract
Large Language Models (LLMs) and Vision Language Models (VLMs) have recently shown promising capabilities in various scientific domain. In particular, these advances have opened new opportunities in drug discovery, where the ability to understand and modify molecular structures is critical for optimizing drug properties such as efficacy and toxicity. However, existing models and benchmarks often overlook toxicity-related challenges, focusing primarily on general property optimization without adequately addressing safety concerns. In addition, even existing toxicity repair benchmarks suffer from limited data diversity, low structural validity of generated molecules, and heavy reliance on proxy models for toxicity assessment. To address these limitations, we propose MolDeTox, a novel benchmark for molecular detoxification, designed to enable fine-grained and reliable evaluation of toxicity-aware molecular optimization across stepwise tasks. We evaluate a wide range of general-purpose LLMs and VLMs under diverse settings, and demonstrate that understanding and generating molecules at the fragment-level improves structural validity and enhances the quality of generated molecules. Moreover, through detailed task-level performance analysis, MolDeTox provides an interpretable benchmark that enables a deeper understanding of the detoxification process. Our dataset is available at : https://huggingface.co/datasets/MolDeTox/MolDeTox
Chinese Translation
大型语言模型(LLMs)和视觉语言模型(VLMs)最近在多个科学领域展现出令人鼓舞的能力。特别是,这些进展为药物发现开辟了新的机会,其中理解和修改分子结构的能力对于优化药物的有效性和毒性等属性至关重要。然而,现有模型和基准往往忽视与毒性相关的挑战,主要集中在一般属性优化上,而未能充分解决安全性问题。此外,即使是现有的毒性修复基准也存在数据多样性有限、生成分子的结构有效性低以及对毒性评估的代理模型依赖严重等问题。为了解决这些局限性,我们提出了MolDeTox,一个新颖的分子解毒基准,旨在实现对毒性意识分子优化的细粒度和可靠评估,涵盖逐步任务。我们在多种设置下评估了广泛的通用LLMs和VLMs,并证明在片段级别理解和生成分子可以提高结构有效性并增强生成分子的质量。此外,通过详细的任务级性能分析,MolDeTox提供了一个可解释的基准,使我们能够更深入地理解解毒过程。我们的数据集可在以下网址获取:https://huggingface.co/datasets/MolDeTox/MolDeTox
cs.AI / 93 / 2605.12213

Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

面向目标的推理在基于RAG的对话代理大型语言模型系统中的应用
Liang, Jiazhou, Toroghi, Armin, Liu, Yifan Simon, Kalarde, Faeze Moradi, Gallagher, Liam, Sanner, Scott
Abstract
LLM-based conversational AI agents struggle to maintain coherent behavior over long horizons due to limited context. While RAG-based approaches are increasingly adopted to overcome this limitation by storing interactions in external memory modules and performing retrieval from them, their effectiveness in answering challenging questions (e.g., multi-hop, commonsense) ultimately depends on the agent's ability to reason over the retrieved information. However, existing methods typically retrieve memory based on semantic similarity to the raw user utterance, which lacks explicit reasoning about missing intermediate facts and often returns evidence that is irrelevant or insufficient for grounded reasoning. In this work, we introduce Goal-Mem, a goal-oriented reasoning framework for RAG-based agentic memory that performs explicit backward chaining from the user's utterance as a goal. Rather than progressively expanding from retrieved context, Goal-Mem decomposes each goal into atomic subgoals, performs targeted memory retrieval to satisfy each subgoal, and iteratively identifies what information from memory should be retrieved when intermediate goals cannot be resolved. We formalize this process in Natural Language Logic, a logical system that combines the verifiability of reasoning provided by FOL with the expressivity of natural language. Through extensive experiments on two datasets and comparing to nine strong memory baselines, we show that Goal-Mem consistently improves performance, particularly on tasks requiring multi-hop reasoning and implicit inference.
Chinese Translation
基于大型语言模型(LLM)的对话人工智能代理在长时间范围内保持一致行为方面面临挑战,这主要是由于上下文的限制。虽然基于检索增强生成(RAG)的方法越来越多地被采用,以通过将交互存储在外部记忆模块中并从中进行检索来克服这一限制,但它们在回答具有挑战性的问题(例如,多跳推理、常识推理)时的有效性最终取决于代理对检索信息的推理能力。然而,现有的方法通常基于与用户原始发言的语义相似性来检索记忆,这缺乏对缺失中间事实的明确推理,且常常返回与基础推理无关或不足的证据。在本研究中,我们引入了Goal-Mem,一个面向目标的推理框架,旨在为基于RAG的代理记忆提供支持,该框架从用户的发言作为目标进行明确的逆向推理。Goal-Mem并不是逐步扩展检索到的上下文,而是将每个目标分解为原子子目标,进行有针对性的记忆检索以满足每个子目标,并在无法解决中间目标时迭代识别应从记忆中检索哪些信息。我们在自然语言逻辑(Natural Language Logic)中形式化了这一过程,这是一种将一阶逻辑(FOL)提供的推理可验证性与自然语言的表现力相结合的逻辑系统。通过在两个数据集上进行广泛实验,并与九个强大的记忆基线进行比较,我们展示了Goal-Mem在性能上始终有所提升,特别是在需要多跳推理和隐含推理的任务上。
cs.AI / 94 / 2605.12240

No Action Without a NOD: A Heterogeneous Multi-Agent Architecture for Reliable Service Agents

没有NOD就没有行动:一种可靠服务代理的异构多代理架构
Yang, Zixu, Zheng, Hang, Jiang, Nan, Tang, Zhiyang, Zhang, Situo, Wu, Xiaobao, Chen, Lu, Yu, Kai
Abstract
Large language model (LLM) agents have increasingly advanced service applications, such as booking flight tickets. However, these service agents suffer from unreliability in long-horizon tasks, as they often produce policy violations, tool hallucinations, and misaligned actions, which greatly impedes their real-world deployment. To address these challenges, we propose NOD (Navigator-Operator-Director), a heterogeneous multi-agent architecture for service agents. Instead of maintaining task state implicitly in dialogue context as in prior work, we externalize a structured Global State to enable explicit task state tracking and consistent decision-making by the Navigator. Besides, we introduce selective external oversight before critical actions, allowing an independent Director agent to verify execution and intervene when necessary. As such, NOD effectively mitigates error propagation and unsafe behavior in long-horizon tasks. Experiments on $\tau^2$-Bench demonstrate that NOD achieves higher task success rates and critical action precision over baselines. More importantly, NOD improves the reliability of service agents by reducing policy violations, tool hallucinations, and user-intent misalignment.
Chinese Translation
大型语言模型(LLM)代理在服务应用方面取得了显著进展,例如机票预订。然而,这些服务代理在长时间任务中存在不可靠性,常常导致政策违规、工具幻觉和行动不一致,这极大地阻碍了它们在现实世界中的部署。为了解决这些挑战,我们提出了NOD(导航者-操作员-导演),一种用于服务代理的异构多代理架构。与以往研究中隐式地在对话上下文中维护任务状态不同,我们将结构化的全局状态外部化,以实现明确的任务状态跟踪和导航者的一致决策。此外,我们在关键行动之前引入选择性外部监督,允许独立的导演代理验证执行并在必要时进行干预。因此,NOD有效地减轻了错误传播和长时间任务中的不安全行为。在$ au^2$-Bench上的实验表明,NOD在任务成功率和关键行动精确度上优于基线。更重要的是,NOD通过减少政策违规、工具幻觉和用户意图不一致,提高了服务代理的可靠性。
cs.AI / 95 / 2605.12255

Why Conclusions Diverge from the Same Observations: Formalizing World-Model Non-Identifiability via an Inference

为何相同观察结果的结论会出现分歧:通过推理形式化世界模型的非可识别性
Takahashi, Toru
Abstract
When people share the same documents and observations yet reach different conclusions, the disagreement often shifts into a judgment that the other party is cognitively defective, irrational, or acting in bad faith. This paper argues that such divergence is better described as a form of non-identifiability inherent in inference and learning, rather than as a defect of the other party. We organize the phenomenon into two levels: (i) $\theta$-level non-identifiability, where conclusions diverge under the same world model $W$ because inference settings differ; and (ii) $W$-level non-identifiability, where repeated use of an inference setting $\theta$ biases data exposure and update rules, causing the learned world model $W$ itself to diverge. We introduce an inference profile $\theta = (R, E, S, D)$, consisting of Reference, Exploration, Stabilization, and Horizon, and show how outputs can split even for the same observation $o$ and the same $W$. We further explain why disagreements tend to project onto a small number of bases -- abstract versus concrete, externalizability, and order versus freedom -- as a consequence of general constraints on learning systems: computational, observational, and coordination constraints. Finally, we relate the framework to deep representation learning, including representation hierarchy, latent-state estimation, and regularization-exploration trade-offs, and illustrate the framework through a case study on AI regulation debates.
Chinese Translation
当人们共享相同的文档和观察结果却得出不同的结论时,这种分歧往往转化为对另一方的判断,认为其在认知上存在缺陷、不理性或出于恶意。本文认为,这种分歧更应被描述为推理和学习中固有的非可识别性,而非对方的缺陷。我们将这一现象组织为两个层次:(i) $ heta$-层次非可识别性,在相同的世界模型 $W$ 下,由于推理设置的不同而导致结论分歧;(ii) $W$-层次非可识别性,在重复使用推理设置 $ heta$ 时,数据曝光和更新规则的偏差导致学习到的世界模型 $W$ 本身出现分歧。我们引入了一个推理轮廓 $ heta = (R, E, S, D)$,由参考、探索、稳定和视野组成,并展示了即使在相同观察 $o$ 和相同 $W$ 的情况下,输出也可能出现分裂。我们进一步解释了为何分歧往往投射到少数几个基础上——抽象与具体、外部化能力,以及秩序与自由——这是学习系统的一般约束(计算、观察和协调约束)的结果。最后,我们将该框架与深度表示学习相关联,包括表示层次、潜在状态估计和正则化-探索权衡,并通过对人工智能监管辩论的案例研究来说明该框架。
cs.AI / 96 / 2605.12262

Missingness-MDPs: Bridging the Theory of Missing Data and POMDPs

缺失性马尔可夫决策过程:弥合缺失数据理论与部分可观察马尔可夫决策过程的关系
Wendland, Joshua, Zubia, Markel, Andriushchenko, Roman, Galesloot, Maris F. L., Ceska, Milan, von Kleist, Henrik, Simao, Thiago D., Weininger, Maximilian, Jansen, Nils
Abstract
We introduce missingness-MDPs (miss-MDPs), a novel subclass of partially observable Markov decision processes (POMDPs) that incorporates the theory of missing data. A miss-MDP is a POMDP whose observation function is a missingness function, specifying the probability that individual state features are missing (i.e., unobserved) at a time step. The literature distinguishes three canonical missingness types: missing (1) completely at random (MCAR), (2) at random (MAR), and (3) not at random (MNAR). Our planning problem is to compute near-optimal policies for a miss-MDP with an unknown missingness function, given a dataset of action-observation trajectories. Achieving such optimality guarantees for policies requires learning the missingness function from data, which is infeasible for general POMDPs. To overcome this challenge, we exploit the structural properties of different missingness types to derive probably approximately correct (PAC) algorithms for learning the missingness function. These algorithms yield an approximate but fully specified miss-MDP that we solve using off-the-shelf planning methods. We prove that, with high probability, the resulting policies are epsilon-optimal in the true miss-MDP. Empirical results confirm the theory and demonstrate superior performance of our approach over two model-free POMDP methods.
Chinese Translation
我们引入了缺失性马尔可夫决策过程(missingness-MDPs,简称 miss-MDPs),这是一类新颖的部分可观察马尔可夫决策过程(POMDPs)子类,结合了缺失数据理论。miss-MDP 是一种其观测函数为缺失性函数的 POMDP,该函数指定了在某一时间步中个体状态特征缺失(即未观察到)的概率。文献中区分了三种典型的缺失类型:完全随机缺失(MCAR)、随机缺失(MAR)和非随机缺失(MNAR)。我们的规划问题是计算具有未知缺失性函数的 miss-MDP 的近似最优策略,前提是给定一个行动-观测轨迹的数据集。为了实现对策略的最优性保证,需要从数据中学习缺失性函数,这对于一般的 POMDP 来说是不可行的。为了克服这一挑战,我们利用不同缺失类型的结构特性推导出学习缺失性函数的概率近似正确(PAC)算法。这些算法产生一个近似但完全指定的 miss-MDP,我们使用现成的规划方法对其进行求解。我们证明,在高概率下,所得到的策略在真实的 miss-MDP 中是 epsilon-最优的。实证结果验证了理论,并展示了我们的方法在两种无模型 POMDP 方法上的优越性能。
cs.AI / 97 / 2605.12265

How Useful Is Cross-Domain Generalization for Training LLM Monitors?

跨领域泛化对训练大型语言模型监控器的实用性如何?
Martin, Sam, Roger, Fabien
Abstract
Using prompted language models as classifiers enables classification in domains with limited training data, but misses some of the robustness and performance benefits that fine-tuning can bring. We study whether training on multiple classification tasks, each with its own prompt, improves performance on new domains with new classification prompts. We show that such training partially generalizes to adjacent domains, improving classification performance on tasks that are unseen during training. However, we identify specific edge cases where the fine-tuned models fail to follow prompts, such as when the classification prompt changes completely while the data domain remains the same as during training. We show that classification training can be mixed with general instruction following training, and that (when done well) such training keeps the benefits of classification training and mitigates its generalization failures. Surprisingly, we see that this no-thinking supervised classification training can generalize to with-thinking classification and summarization, suggesting that no-thinking classification training might be instrumentally useful in building other kinds of classifiers and monitoring systems.
Chinese Translation
使用提示语言模型作为分类器能够在训练数据有限的领域进行分类,但错过了微调所能带来的某些鲁棒性和性能优势。我们研究了在多个分类任务上训练(每个任务都有其自己的提示)是否能提高在新领域和新分类提示上的性能。我们表明,这种训练在相邻领域部分泛化,提高了在训练过程中未见任务上的分类性能。然而,我们识别出一些特定的边缘案例,在这些情况下,微调模型未能遵循提示,例如当分类提示完全改变而数据领域与训练时相同。我们展示了分类训练可以与一般指令跟随训练相结合,并且(当执行得当时)这种训练保持了分类训练的优势并减轻了其泛化失败。令人惊讶的是,我们发现这种无思考的监督分类训练可以泛化到有思考的分类和摘要,暗示无思考的分类训练在构建其他类型的分类器和监控系统中可能具有实用价值。
cs.AI / 98 / 2605.12276

NARA: Anchor-Conditioned Relation-Aware Contextualization of Heterogeneous Geoentities

NARA:基于锚点条件的异构地理实体关系感知上下文化
Kim, Jina, Mai, Gengchen, Zhao, Lingyi, Shafique, Khurram, Chiang, Yao-Yi
Abstract
Geospatial foundation models have primarily focused on raster data such as satellite imagery, where self-supervised learning has been widely studied. Vector geospatial data instead represent the world as discrete geoentities with explicit geometry, semantics, and structured spatial relations, including metric proximity and topological relationships. These relations jointly determine how entities interact within space, yet existing representation learning methods remain fragmented, often restricted to specific geometry types or partial spatial relations, limiting their ability to capture unified spatial context across heterogeneous geoentities. We propose NARA (Neural Anchor-conditioned Relation-Aware representation learning), a self-supervised framework for vector geoentities. NARA learns context-dependent representations by jointly modeling semantics, geometry, and spatial relations within a unified framework and captures relational spatial structure beyond proximity alone, enabling rich contextualized representations across heterogeneous geoentities of points, polylines, and polygons. Evaluation on building function classification, traffic speed prediction, and next point-of-interest recommendation shows consistent improvements over prior methods, highlighting the benefit of unified relational modeling for vector geospatial data.
Chinese Translation
地理空间基础模型主要集中于栅格数据,如卫星图像,自监督学习在此领域得到了广泛研究。而矢量地理数据则将世界表示为具有明确几何形状、语义和结构化空间关系的离散地理实体,包括度量接近度和拓扑关系。这些关系共同决定了实体在空间中的交互方式,但现有的表示学习方法仍然零散,通常限制于特定的几何类型或部分空间关系,限制了它们在异构地理实体之间捕捉统一空间上下文的能力。我们提出了NARA(神经锚点条件关系感知表示学习),这是一个针对矢量地理实体的自监督框架。NARA通过在统一框架内共同建模语义、几何和空间关系来学习上下文依赖的表示,并捕捉超越接近度的关系空间结构,从而在点、折线和多边形等异构地理实体之间实现丰富的上下文化表示。在建筑功能分类、交通速度预测和下一个兴趣点推荐的评估中,NARA显示出相较于先前方法的一致性改进,突显了统一关系建模对矢量地理数据的益处。
cs.AI / 99 / 2605.12294

Executable Agentic Memory for GUI Agent

可执行的代理记忆用于图形用户界面代理
Qin, Zerui, Yue, Sheng, Hua, Xingyuan, Fu, Yongjian, Ren, Ju
Abstract
Modern GUI agents typically rely on a model-centric and step-wise interaction paradigm, where LLMs must re-interpret the UI and re-decide actions at every screen, which is fragile in long-horizon tasks. In this paper, we propose Executable Agentic Memory (EAM), a structured Knowledge Graph (KG) that shifts GUI planning from free-form generation to a robust retrieval-and-execution process. Our approach includes a sample-efficient memory construction pipeline using state-aware DFS and action-group mining to compress multi-step routines. To ensure efficient planning, we introduce a value-guided graph search where a lightweight Q-function model steers Monte Carlo Tree Search (MCTS) over the KG. We theoretically establish bias-consistency for the Q-model and derive sample complexity bounds for path recovery. Empirically, EAM outperforms state-of-the-art baselines like UI-TARS-7B by up to $19.6\%$ on AndroidWorld, while reducing token costs $6\times$ relative to GPT-4o. With a $2.8$s average latency, EAM enables reliable, quick, and long-horizon GUI automation.
Chinese Translation
现代图形用户界面(GUI)代理通常依赖于以模型为中心的逐步交互范式,在每个界面上,大型语言模型(LLMs)必须重新解释用户界面并重新决定行动,这在长时间跨度的任务中显得脆弱。本文提出了可执行的代理记忆(Executable Agentic Memory, EAM),这是一种结构化的知识图谱(Knowledge Graph, KG),将GUI规划从自由形式生成转变为稳健的检索与执行过程。我们的方法包括一个样本高效的记忆构建管道,利用状态感知的深度优先搜索(DFS)和动作组挖掘来压缩多步骤例程。为了确保高效规划,我们引入了一种价值引导的图搜索,其中轻量级的Q函数模型引导蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)在知识图谱上进行。我们从理论上建立了Q模型的偏差一致性,并推导出路径恢复的样本复杂度界限。从实证上看,EAM在AndroidWorld上相较于最先进的基线如UI-TARS-7B提高了高达19.6%的性能,同时相对于GPT-4o减少了6倍的令牌成本。EAM以2.8秒的平均延迟,实现了可靠、快速且长时间跨度的GUI自动化。
cs.AI / 100 / 2605.12321

LISA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management

LISA:无信号自主交叉口管理的认知仲裁
Lakas, Abderrahmane, Ferrag, Mohamed Amine, Debbah, Merouane
Abstract
Large language models (LLMs) show strong potential for Intelligent Transportation Systems (ITS), particularly in tasks requiring situational reasoning and multi-agent coordination. These capabilities make them well suited for cooperative driving, where rule-based approaches struggle in complex and dynamic traffic environments. Intersection management remains especially challenging due to conflicting right-of-way demands, heterogeneous vehicle priorities, and vehicle-specific kinematic constraints that must be resolved in real time. However, existing approaches typically use LLMs as auxiliary components on top of signal-based systems rather than as primary decision-makers. Signal controllers remain vehicle-agnostic, reservation-based methods lack intent awareness, and recent LLM-based systems still depend on signal infrastructure. In addition, LLM inference latency limits their use in sub-second control settings. We propose LISA (LLM-Based Intent-Driven Speed Advisory), a signal-free cognitive arbitration framework for autonomous intersection management. LISA uses an LLM to reason over declared vehicle intents, incorporating priority classes, queue pressure, and energy preferences. We evaluate LISA against fixed-cycle control, SCATS, AIM, and GLOSA across varying traffic loads. Results show that LISA reduces mean control delay by up to 89.1% and maintains Level of Service C while all non-LLM baselines degrade to Level of Service F. Under near-saturated demand, LISA reduces mean waiting time by 93% and peak queue length by 60.6% relative to fixed-cycle control. It also lowers fuel consumption by up to 48.8% and achieves 86.2% intent satisfaction, compared to 61.2% for the best non-LLM method. These results demonstrate that LLM-based reasoning can enable real-time, signal-free intersection management.
Chinese Translation
大型语言模型(LLMs)在智能交通系统(ITS)中展现出强大的潜力,尤其是在需要情境推理和多智能体协调的任务中。这些能力使其非常适合于合作驾驶,而基于规则的方法在复杂和动态的交通环境中往往表现不佳。交叉口管理尤其具有挑战性,因为需要实时解决相互冲突的优先通行权要求、异构车辆优先级以及特定于车辆的运动学约束。然而,现有的方法通常将LLMs作为信号基础系统的辅助组件,而非主要决策者。信号控制器仍然是与车辆无关的,基于预留的方法缺乏意图意识,而最近的基于LLM的系统仍然依赖于信号基础设施。此外,LLM推理延迟限制了其在亚秒级控制环境中的应用。我们提出了LISA(基于LLM的意图驱动速度建议),这是一个无信号的认知仲裁框架,用于自主交叉口管理。LISA利用LLM对声明的车辆意图进行推理,结合优先级类别、排队压力和能量偏好。我们将LISA与固定周期控制、SCATS、AIM和GLOSA在不同交通负载下进行了评估。结果表明,LISA将平均控制延迟减少了多达89.1%,并在所有非LLM基准方法降级至服务水平F时保持服务水平C。在接近饱和的需求下,LISA将平均等待时间减少了93%,峰值排队长度减少了60.6%,相对于固定周期控制。它还将燃料消耗降低了多达48.8%,并实现了86.2%的意图满足率,而最佳非LLM方法的意图满足率为61.2%。这些结果表明,基于LLM的推理能够实现实时、无信号的交叉口管理。
cs.AI / 101 / 2605.12332

Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models

基于大型语言模型的非塔机场自动化空中交通安全评估研究
Darrell, Torsten, Ghazanfari, Mahyar, Kam, Jordan, Bayen, Alexandre, Tabrizian, Amin, Wei, Peng
Abstract
We investigate frameworks for post-flight safety analysis at non-towered airports using large language models (LLMs). Non-towered airports rely on the Common Traffic Advisory Frequency (CTAF) for air traffic coordination and experience frequent near mid-air collisions due to the pilot self-announcement communication protocol. We propose a general vision-language model (VLM) approach to analyze the transcribed CTAF radio communications in natural language, METeorological Aerodrome Report (METAR) weather data, Automatic Dependent Surveillance-Broadcast (ADS-B) flight trajectories, and Visual Flight Rules sectional charts of the airfield. We provide a preliminary study at Half Moon Bay Airport, with a qualitative real world case study and a quantitative evaluation using a new synthetic dataset of communications and weather modalities. We qualitatively evaluate our framework on real flight data using Gemini 2.5 Pro, demonstrating accurate identification of a right-of-way violation. The synthetic dataset is derived from real examples and includes a 12-category hazard taxonomy, and is used to benchmark three open-source (Qwen 2.5-7B, Mistral-7B, Gemma-2-9B) and three closed-source (GPT-4o, GPT-5.4, Claude Sonnet 4.6) LLM models on the subset of inputs related to CTAF and METAR. Even limited to CTAF and METAR inputs and open source LLMs, instances of our framework typically achieve a macro F1 score above 0.85 on a binary nominal/danger classification task. Future work includes a quantitative evaluation across all modalities and a larger number of real world examples. Taken together, our results suggest that VLM analysis of safety at non-towered airports may be a valuable future capability.
Chinese Translation
我们研究了在非塔机场使用大型语言模型(LLMs)进行飞行后安全分析的框架。非塔机场依赖于公共交通咨询频率(CTAF)进行空中交通协调,由于飞行员自我公告通信协议,常常发生近失碰撞。我们提出了一种通用的视觉-语言模型(VLM)方法,以分析转录的CTAF无线电通信、气象机场报告(METAR)天气数据、自动相关监视广播(ADS-B)飞行轨迹和机场的视觉飞行规则分区图。我们在半月湾机场进行了初步研究,结合了定性实地案例研究和使用新的合成数据集进行的定量评估,该数据集包含通信和天气模态。我们使用Gemini 2.5 Pro对真实飞行数据进行了定性评估,准确识别出了一起优先权违规事件。合成数据集源于真实示例,包含12类危害分类法,并用于基准测试三种开源(Qwen 2.5-7B, Mistral-7B, Gemma-2-9B)和三种闭源(GPT-4o, GPT-5.4, Claude Sonnet 4.6)LLM模型在与CTAF和METAR相关的输入子集上的表现。即使仅限于CTAF和METAR输入以及开源LLM,我们的方法实例通常在二元名义/危险分类任务中实现超过0.85的宏F1分数。未来的工作包括对所有模态和更多真实示例进行定量评估。综合来看,我们的结果表明,VLM对非塔机场安全的分析可能成为一种有价值的未来能力。
cs.AI / 102 / 2605.12334

Reinforcing VLAs in Task-Agnostic World Models

在任务无关世界模型中强化视觉-语言-动作(VLA)
Wang, Yucen, Yu, Rui, Zhang, Fengming, Lu, Junjie, Qin, Xinyao, Zhang, Tianxiang, Wang, Kaixin, Zhao, Li
Abstract
Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task-specific data to fine-tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference. We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation. Because both components are task-agnostic, VLAs can be readily finetuned for any new task entirely within this zero-shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual-noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real-world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task-dependent data, offering a highly scalable roadmap for VLA adaptation.
Chinese Translation
通过在学习的世界模型中使用强化学习(RL)对视觉-语言-动作(VLA)模型进行后训练,已成为适应新任务的一种有效策略,而无需昂贵的现实世界交互。然而,尽管使用想象的轨迹减少了策略训练的样本复杂性,现有方法仍然在很大程度上依赖于特定任务的数据来微调世界模型和奖励模型,这从根本上限制了它们对未见任务的可扩展性。为了解决这个问题,我们认为世界模型和奖励模型应该捕捉可转移的物理先验,以实现零样本推断。我们提出了RAW-Dream(在任务无关世界梦中强化VLA),这是一种全新范式,完全将世界模型学习与下游任务依赖解耦。RAW-Dream利用在多样化无任务行为上预训练的世界模型来预测未来的展开,并使用现成的视觉-语言模型(VLM)生成奖励。由于这两个组件都是任务无关的,VLA可以完全在这种零样本想象中为任何新任务进行微调。此外,为了减轻世界模型的幻觉,我们引入了一种双噪声验证机制,以过滤不可靠的展开。在模拟和现实世界设置中的广泛实验表明,性能持续提升,证明了通用物理先验可以有效替代昂贵的任务依赖数据,为VLA适应提供了高度可扩展的路线图。
cs.AI / 103 / 2605.12357

$\delta$-mem: Efficient Online Memory for Large Language Models

$ ext{δ}$-mem: 大型语言模型的高效在线记忆
Lei, Jingdi, Zhang, Di, Li, Junxian, Wang, Weida, Fan, Kaixuan, Liu, Xiang, Liu, Qihan, Ma, Xiaoteng, Chen, Baian, Poria, Soujanya
Abstract
Large language models increasingly need to accumulate and reuse historical information in long-term assistants and agent systems. Simply expanding the context window is costly and often fails to ensure effective context utilization. We propose $\delta$-mem, a lightweight memory mechanism that augments a frozen full-attention backbone with a compact online state of associative memory. $\delta$-mem compresses past information into a fixed-size state matrix updated by delta-rule learning, and uses its readout to generate low-rank corrections to the backbone's attention computation during generation. With only an $8\times8$ online memory state, $\delta$-mem improves the average score to $1.10\times$ that of the frozen backbone and $1.15\times$ that of the strongest non-$\delta$-mem memory baseline. It achieves larger gains on memory-heavy benchmarks, reaching $1.31\times$ on MemoryAgentBench and $1.20\times$ on LoCoMo, while largely preserving general capabilities. These results show that effective memory can be realized through a compact online state directly coupled with attention computation, without full fine-tuning, backbone replacement, or explicit context extension.
Chinese Translation
大型语言模型越来越需要在长期助手和代理系统中积累和重用历史信息。单纯扩展上下文窗口成本高昂,并且往往无法确保有效的上下文利用。我们提出了$ ext{δ}$-mem,一种轻量级的记忆机制,它通过紧凑的在线关联记忆状态增强了冻结的全注意力骨干网络。$ ext{δ}$-mem将过去的信息压缩成一个固定大小的状态矩阵,该矩阵通过增量学习(delta-rule learning)进行更新,并利用其读出生成低秩修正,以改善生成过程中的骨干网络注意力计算。仅使用$8 imes8$的在线记忆状态,$ ext{δ}$-mem将平均得分提高到冻结骨干网络的$1.10 imes$,并且比最强的非$ ext{δ}$-mem记忆基线提高了$1.15 imes$。在对记忆要求较高的基准测试中,它取得了更大的提升,在MemoryAgentBench上达到$1.31 imes$,在LoCoMo上达到$1.20 imes$,同时在很大程度上保持了整体能力。这些结果表明,通过与注意力计算直接耦合的紧凑在线状态,可以实现有效的记忆,而无需完全微调、替换骨干网络或显式扩展上下文。
cs.AI / 104 / 2605.12366

Classifier Context Rot: Monitor Performance Degrades with Context Length

分类器上下文旋转:监控性能随着上下文长度的增加而下降
Martin, Sam, Roger, Fabien
Abstract
Monitoring coding agents for dangerous behavior using language models requires classifying transcripts that often exceed 500K tokens, but prior agent monitoring benchmarks rarely contain transcripts longer than 100K tokens. We show that when used as classifiers, current frontier models fail to notice dangerous actions more often in longer transcripts. In particular, on a dataset that requires identifying when a coding agent takes a subtly dangerous action, Opus 4.6, GPT 5.4, and Gemini 3.1 miss these actions $2\times$ to $30\times$ more often when they occur after 800K tokens of benign activity than when they occur on their own. We also show that these weaknesses can be partially mitigated with prompting techniques such as periodic reminders throughout the transcript and may be mitigated further with better post-training. Monitor evaluations that do not consider long-context degradation are likely overestimating monitor performance.
Chinese Translation
使用语言模型监控编码代理的危险行为需要对超过50万标记的转录文本进行分类,但先前的代理监控基准很少包含超过10万标记的转录文本。我们发现,当作为分类器使用时,当前的前沿模型在较长的转录文本中更常忽视危险行为。特别是在一个需要识别编码代理何时采取微妙危险行为的数据集中,Opus 4.6、GPT 5.4和Gemini 3.1在800K标记的良性活动后发生这些行为时,错过的频率比单独发生时高出2到30倍。我们还表明,这些弱点可以通过提示技术部分缓解,例如在转录文本中定期提醒,并且通过更好的后训练可能进一步减轻。未考虑长上下文退化的监控评估可能会高估监控性能。
cs.AI / 105 / 2605.12376

ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows

ProfiliTable:基于剖析驱动的代理工作流表格数据处理
Liu, Wei, Gu, Yang, Yan, Xi, Nan, Zihan, Xu, Beicheng, Ding, Keyao, Cui, Bin, Zhang, Wentao
Abstract
Table processing-including cleaning, transformation, augmentation, and matching-is a foundational yet error-prone stage in real-world data pipelines. While recent LLM-based approaches show promise for automating such tasks, they often struggle in practice due to ambiguous instructions, complex task structures, and the lack of structured feedback, resulting in syntactically correct but semantically flawed code. To address these challenges, we propose ProfiliTable, an autonomous multi-agent framework centered on dynamic profiling, which constructs and iteratively refines a unified execution context through interactive exploration, knowledge-augmented synthesis, and feedback-driven refinement. ProfiliTable integrates (i) a Profiler that performs ReAct-style data exploration to build semantic understanding, (ii) a Generator that retrieves curated operators to synthesize task-aware code, and (iii) an Evaluator-Summarizer loop that injects execution scores and diagnostic insights to enable closed-loop refinement. Extensive experiments on a diverse benchmark covering 18 tabular task types demonstrate that ProfiliTable consistently outperforms strong baselines, particularly in complex multi-step scenarios. These results highlight the critical role of dynamic profiling in reliably translating ambiguous user intents into robust and governance-compliant table transformations.
Chinese Translation
表格处理,包括清洗、转换、增强和匹配,是现实世界数据管道中的基础但易出错的阶段。尽管最近基于大型语言模型(LLM)的方法在自动化这些任务方面显示出潜力,但由于指令模糊、任务结构复杂以及缺乏结构化反馈,它们在实践中往往面临挑战,导致生成的代码在语法上正确但在语义上存在缺陷。为了解决这些问题,我们提出了ProfiliTable,一个以动态剖析为中心的自主多代理框架,通过互动探索、知识增强合成和反馈驱动的精炼,构建并迭代优化统一的执行上下文。ProfiliTable整合了(i)一个执行ReAct风格数据探索的剖析器,以建立语义理解;(ii)一个检索策划操作符以合成任务感知代码的生成器;以及(iii)一个注入执行分数和诊断见解以实现闭环精炼的评估-总结循环。在涵盖18种表格任务类型的多样化基准上的大量实验表明,ProfiliTable在复杂的多步骤场景中始终优于强基线。这些结果突显了动态剖析在可靠地将模糊用户意图转化为稳健且符合治理要求的表格转换中的关键作用。
cs.AI / 106 / 2605.12406

Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems

语义奖励崩溃与自适应人工智能系统中认知完整性的保护
Parris, William
Abstract
Recent advances in reinforcement learning from human feedback (RLHF) and preference optimization have substantially improved the usability, coherence, and safety of large language models. However, recurring behaviors such as performative certainty, hallucinated continuity, calibration drift, sycophancy, and suppression of visible uncertainty suggest unresolved structural issues within scalarized preference optimization systems. We propose Semantic Reward Collapse (SRC): the compression of semantically distinct forms of evaluative dissatisfaction into generalized optimization signals. Under SRC, categories such as factual incorrectness, uncertainty disclosure, formatting dissatisfaction, latency, and social preference may become entangled within a shared reward topology despite representing fundamentally different epistemic classes. We argue that adaptive reasoning systems operating under generalized evaluative pressure may drift toward suppression of visible epistemic failure rather than preservation of calibrated uncertainty integrity. These behaviors are framed strictly as optimization consequences rather than evidence of deception or anthropomorphic agency. Drawing on institutional proxy collapse, metric gaming, software reliability engineering, and human learning theory, we propose that uncertainty disclosure and escalation behavior should be treated as protected epistemic conduct rather than globally penalized task incompletion. Finally, we introduce Constitutional Reward Stratification (CRS), a domain-aware reward framework intended to preserve differentiated epistemic attribution within adaptive learning systems. We present CRS not as a validated solution, but as a testable governance-oriented research direction requiring further empirical investigation.
Chinese Translation
近年来,基于人类反馈的强化学习(RLHF)和偏好优化的进展显著提高了大型语言模型的可用性、一致性和安全性。然而,表现出确定性、虚幻的连续性、校准漂移、谄媚行为和抑制可见不确定性等反复出现的行为,表明标量化偏好优化系统中存在未解决的结构性问题。我们提出了语义奖励崩溃(Semantic Reward Collapse, SRC):将语义上不同的评估不满形式压缩为一般化的优化信号。在SRC的情况下,诸如事实错误、不确定性披露、格式不满、延迟和社会偏好等类别,尽管代表根本不同的认知类别,却可能在共享的奖励拓扑中交织在一起。我们认为,在一般化评估压力下运作的自适应推理系统可能倾向于抑制可见的认知失败,而不是保持校准的不确定性完整性。这些行为被严格框定为优化结果,而非欺骗或拟人化行为的证据。借鉴制度代理崩溃、指标游戏、软件可靠性工程和人类学习理论,我们建议不确定性披露和升级行为应被视为受保护的认知行为,而不是全球惩罚的任务未完成。最后,我们引入了宪法奖励分层(Constitutional Reward Stratification, CRS),这是一个领域感知的奖励框架,旨在保持自适应学习系统中差异化的认知归属。我们将CRS呈现为一种可测试的治理导向研究方向,而非经过验证的解决方案,需进一步的实证研究。
cs.AI / 107 / 2605.12421

Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers

形式化,而非优化:LLM生成组合求解器中的启发式陷阱
Wang, Haoyu, Song, Yuliang, Li, Tao, Deng, Zhiwei, Wang, Yaqing, Ramachandran, Deepak, Cohen, Eldan, Roth, Dan
Abstract
Large Language Models (LLMs) struggle to solve complex combinatorial problems through direct reasoning, so recent neuro-symbolic systems increasingly use them to synthesize executable solvers. A central design question is how the LLM should represent the solver, and whether it should also attempt to optimize search. We introduce CP-SynC-XL, a benchmark of 100 combinatorial problems (4,577 instances), and evaluate three solver-construction paradigms: native algorithmic search (Python), constraint modeling through a Python solver API (Python + OR-Tools), and declarative constraint modeling (MiniZinc + OR-Tools). We find a consistent representational divergence: Python + OR-Tools attains the highest correctness across LLMs, while MiniZinc + OR-Tools has lower absolute coverage despite using the same OR-Tools back-end. Native Python is the most likely to return a schema-valid solution that fails verification, whereas solver-backed paths preserve higher conditional fidelity. On the heuristic axis, prompting for search optimization yields only small median speed-ups (1.03-1.12x) and a strongly bimodal effect: many instances slow down, and correctness drops sharply on a long tail of problems. A paired code-level audit traces these regressions to a recurring heuristic trap. Under an efficiency-oriented prompt, the LLM may replace complete search with local approximations (Python), inject unverified bounds (Python + OR-Tools), or add redundant declarative machinery that overwhelms or over-constrains the model (MiniZinc + OR-Tools). These findings support a conservative design principle for LLM-generated combinatorial solvers: use the LLM primarily to formalize variables, constraints, and objectives for verified solvers, and separately check any LLM-authored search optimization before use.
Chinese Translation
大型语言模型(LLMs)在通过直接推理解决复杂组合问题时面临困难,因此最近的神经符号系统越来越多地利用它们来合成可执行的求解器。一个核心设计问题是LLM应该如何表示求解器,以及是否也应该尝试优化搜索。我们引入了CP-SynC-XL,这是一个包含100个组合问题(4,577个实例)的基准,并评估了三种求解器构建范式:原生算法搜索(Python)、通过Python求解器API进行约束建模(Python + OR-Tools)和声明式约束建模(MiniZinc + OR-Tools)。我们发现了一种一致的表征差异:Python + OR-Tools在所有LLM中实现了最高的正确性,而尽管使用相同的OR-Tools后端,MiniZinc + OR-Tools的绝对覆盖率却较低。原生Python最有可能返回一个模式有效但未通过验证的解决方案,而基于求解器的路径则保持了更高的条件保真度。在启发式方面,提示搜索优化仅带来了小幅的中位数加速(1.03-1.12倍)和明显的双峰效应:许多实例速度减慢,且在长尾问题上正确性急剧下降。配对的代码级审计将这些回归追溯到一个反复出现的启发式陷阱。在以效率为导向的提示下,LLM可能会用局部近似(Python)替代完整搜索,注入未经验证的界限(Python + OR-Tools),或添加冗余的声明式机制,这些机制会压倒或过度约束模型(MiniZinc + OR-Tools)。这些发现支持了一种保守的设计原则,用于LLM生成的组合求解器:主要使用LLM对变量、约束和目标进行形式化,以便为经过验证的求解器提供支持,并在使用之前单独检查任何LLM编写的搜索优化。
cs.AI / 108 / 2605.12436

CAAFC: Chronological Actionable Automated Fact-Checker for misinformation / non-factual hallucination detection and correction

CAAFC:用于虚假信息/非事实幻觉检测与纠正的时间顺序可操作自动事实检查器
Eldifrawi, Islam, Wang, Shengrui, Trabelsi, Amine
Abstract
With the vast amount of content uploaded every hour, along with the AI generated content that can include hallucinations, Automated Fact-Checking (AFC) has become increasingly vital, as it is infeasible for human fact-checkers to manually verify the sheer volume of information generated online. Professional fact-checkers have identified several gaps in existing AFC systems, noting a misalignment between how these systems operate and how fact-checking is performed in practice. In this paper, we introduce CAAFC (Chronological Actionable Automated Fact-Checker), a frame-work designed to bridge these gaps. It surpasses SOTA AFC and hallucination detection systems across multiple benchmark datasets. CAAFC operates on claims, conversations, and dialogues, enabling it not only to detect factual errors and hallucinations, but also to correct them by providing actionable justifications supported by primary information sources. Furthermore, CAAFC can update evidence and knowledge bases by incorporating recent and contextual information when necessary, thereby enhancing the reliability of fact verification.
Chinese Translation
随着每小时上传的大量内容以及可能包含幻觉的人工智能生成内容的增加,自动化事实检查(AFC)变得愈发重要,因为人类事实检查员无法手动验证在线生成的信息的庞大数量。专业事实检查员已识别出现有AFC系统中的几个缺口,指出这些系统的运作方式与实际的事实检查过程之间存在不一致。在本文中,我们介绍了CAAFC(时间顺序可操作自动事实检查器),这是一个旨在弥合这些差距的框架。它在多个基准数据集上超越了最先进的AFC和幻觉检测系统。CAAFC处理声明、对话和谈话,不仅能够检测事实错误和幻觉,还能通过提供基于主要信息来源的可操作性理由来纠正这些错误。此外,CAAFC在必要时可以通过整合最新和上下文信息来更新证据和知识库,从而增强事实验证的可靠性。
cs.AI / 109 / 2605.12462

Towards Affordable Energy: A Gymnasium Environment for Electric Utility Demand-Response Programs

迈向可负担的能源:电力公用事业需求响应程序的健身房环境
Escamilla, Jose E. Aguilar, Zhou, Lingdong, Zhu, Xiangqi, Wang, Huazheng
Abstract
Extreme weather and volatile wholesale electricity markets expose residential consumers to catastrophic financial risks, yet demand response at the distribution level remains an underutilized tool for grid flexibility and energy affordability. While a demand-response program can shield consumers by issuing financial credits during high-price periods, optimizing this sequential decision-making process presents a unique challenge for reinforcement learning despite the plentiful offline historical smart meter and wholesale pricing data available publicly. Offline historical data fails to capture the dynamic, interactive feedback loop between an electric utility's pricing signals and customer acceptance and adaptation to a demand-response program. To address this, we introduce DR-Gym, an open-source, online Gymnasium-compatible environment designed to train and evaluate demand-response from the electric utility's perspective. Unlike existing device-level energy simulators, our environment focuses on the market-level electric utility setting and provides a rich observational space relevant to the electric utility. The simulator additionally features a regime-switching wholesale price model calibrated to real-world extreme events, alongside physics-based building demand profiles. For our learning signal, we use a configurable, multi-objective reward function for specifying diverse learning objectives. We demonstrate through baseline strategies and data snapshots the capability of our simulator to create realistic and learnable environments.
Chinese Translation
极端天气和波动的批发电力市场使居民消费者面临灾难性的财务风险,但在配电层面,需求响应仍然是一个未被充分利用的电网灵活性和能源可负担性的工具。尽管需求响应程序可以通过在高价期间发放财务补贴来保护消费者,但优化这一序列决策过程对于强化学习而言仍然是一个独特的挑战,尽管有大量的离线历史智能电表和批发定价数据可供公开使用。离线历史数据未能捕捉电力公用事业的定价信号与客户接受度及对需求响应程序的适应之间的动态互动反馈循环。为了解决这一问题,我们引入了DR-Gym,这是一个开源的、与Gymnasium兼容的在线环境,旨在从电力公用事业的角度训练和评估需求响应。与现有的设备级能源模拟器不同,我们的环境专注于市场级电力公用事业设置,并提供与电力公用事业相关的丰富观察空间。该模拟器还具有一个针对现实世界极端事件校准的 regime-switching 批发价格模型,以及基于物理的建筑需求模型。作为我们的学习信号,我们使用一个可配置的多目标奖励函数来指定多样的学习目标。我们通过基准策略和数据快照展示了我们的模拟器创建现实且可学习环境的能力。
cs.AI / 110 / 2605.12474

Reward Hacking in Rubric-Based Reinforcement Learning

基于评分标准的强化学习中的奖励黑客行为
Mahmoud, Anas, Rezaei, MohammadHossein, Wang, Zihao, Gunjal, Anisha, Liu, Bing, He, Yunzhong
Abstract
Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.
Chinese Translation
可验证奖励的强化学习在数学和编程等领域实现了强大的后训练增益,尽管许多开放式设置依赖于基于评分标准的奖励。我们研究了基于评分标准的强化学习中的奖励黑客行为,其中一个策略针对训练验证者进行优化,但在评估时却依赖于三个前沿评审的跨家族小组,从而减少对任何单一评估者的依赖。我们的框架将两种偏差来源分开:验证者失败,即训练验证者认可的评分标准被验证者拒绝,以及评分标准设计的局限性,即使是强大的基于评分标准的验证者也偏向于评分标准自由的评审者整体评分较低的回应。在医学和科学领域,弱验证者产生的大量代理奖励增益并未转移到参考验证者;在训练过程中,利用行为增加并集中在重复失败上,例如部分满足复合标准、将隐含内容视为显性内容以及不精确的主题匹配。更强的验证者显著减少了验证者的利用行为,但并未消除。我们还引入了一种自我内化差距,这是一种基于策略对数概率的无验证者诊断,跟踪参考验证者的质量,检测使用弱验证者训练的策略何时停止改善。最后,在我们的设置中,当评分标准未明确重要的失败模式时,更强的验证并不能防止奖励黑客行为:基于评分标准的验证者偏好强化学习检查点,而无评分标准的评审者则偏好基础模型。这些分歧与在完整性和存在性标准上集中获得的增益相吻合,同时在事实正确性、简洁性、相关性和整体质量上出现下降。综合来看,这些结果表明,更强的验证减少了奖励黑客行为,但本身并不能确保评分标准的增益与更广泛的质量增益相对应。
cs.AI / 111 / 2605.12481

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

ToolCUA:面向计算机使用代理的最佳GUI工具路径编排
Hu, Xuhao, Zhang, Xi, Xu, Haiyang, Qiao, Kyle, Yang, Jingyi, Huang, Xuanjing, Shao, Jing, Yan, Ming, Ye, Jieping
Abstract
Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/
Chinese Translation
计算机使用代理(CUAs)可以通过原子GUI操作(如点击和输入)以及高层次工具调用(如基于API的文件操作)进行操作,但这种混合操作空间常常使它们在何时继续进行GUI操作或切换到工具方面感到不确定,从而导致次优的执行路径。这一困难源于高质量交错的GUI-工具轨迹的稀缺、收集真实工具轨迹的成本和脆弱性,以及缺乏对GUI-工具路径选择的轨迹级监督。在本文中,我们提出了ToolCUA,这是一种端到端的代理,旨在通过分阶段训练范式学习最佳的GUI-工具路径选择。我们首先介绍了一种交错GUI-工具轨迹缩放管道,该管道重新利用丰富的静态GUI轨迹并合成一个基础工具库,使得在没有手动工程或真实工具轨迹收集的情况下能够生成多样的GUI-工具轨迹。然后,我们执行工具引导的GUI RFT,将预热的SFT与单回合的强化学习相结合,以改善在关键的GUI-工具切换点的决策。最后,我们在高保真度的GUI-工具环境中通过在线代理强化学习优化ToolCUA,借助工具高效路径奖励来指导,鼓励适当的工具使用和更短的执行路径。在OSWorld-MCP上的实验表明,ToolCUA达到了46.85%的准确率,相较于基线有约66%的相对提升,确立了同类规模模型中的新最优。此外,它在仅使用GUI的设置中提高了3.9%,展示了有效的GUI-工具编排。结果进一步表明,在混合操作空间中进行训练是现实世界数字代理的一个有前景的范式。开源地址:https://x-plug.github.io/ToolCUA/
计算语言学 (Computation and Language)
82
cs.CL / 1 / 2605.11128

Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

采样更多,获得更少:校准是大型语言模型中的多样性瓶颈
Banayeeanzade, Amin, Yang, Qingchuan, Tarsadiya, Dhruv, Bahrani, Fatemeh, Blas, Leonardo, Samuel, Alfy, Jia, Robin, Razaviyayn, Meisam, Karimireddy, Sai Praneeth
Abstract
Diversity is essential for language-model applications ranging from creative generation to scientific discovery, yet modern LLMs often collapse into a narrow subset of plausible outputs. While prior work has developed benchmarks for measuring this lack of diversity, less is known about how the step-by-step probability distributions at inference time cause the problem. We introduce a validity--diversity framework that attributes diversity collapse to how an LLM allocates probability mass across valid and invalid continuations during decoding. This framework decomposes the bottleneck into two complementary forms of miscalibration. First, order calibration: valid tokens are not reliably ranked above invalid tokens, so rank-based cutoff rules must trade off between recovering valid continuations and admitting invalid ones. Second, shape calibration: probability mass is overly concentrated only on few valid continuations while having a heavy-tail of mixed valid and invalid tokens, so maintaining high validity limits diversity. We formalize both mechanisms and show that local failures compound across decoding steps, producing strong sequence-level losses in diversity. Empirically, we develop controlled diagnostics for probing these bottlenecks, including tasks with exactly known valid sets and oracle cutoff baselines. Across 14 language models spanning multiple families and scales, we find that diversity collapse is not merely a limitation of particular sampling heuristics, but a consequence of order and shape miscalibration in the LLM distribution.
Chinese Translation
多样性对于语言模型应用至关重要,从创造性生成到科学发现,然而现代大型语言模型(LLMs)往往会收敛到一小部分合理输出。虽然之前的研究已经开发了测量这种多样性缺乏的基准,但关于推理时逐步概率分布如何导致这一问题的了解却较少。我们引入了一个有效性-多样性框架,将多样性崩溃归因于LLM在解码过程中如何在有效和无效的延续之间分配概率质量。该框架将瓶颈分解为两种互补的误校准形式。首先,顺序校准:有效标记未能可靠地高于无效标记排名,因此基于排名的截止规则必须在恢复有效延续和接受无效延续之间进行权衡。其次,形状校准:概率质量过于集中于少数有效延续,同时存在大量混合的有效和无效标记,因此保持高有效性限制了多样性。我们对这两种机制进行了形式化,并展示了局部失败在解码步骤中累积,导致多样性在序列级别上产生显著损失。在实证上,我们开发了受控诊断工具来探测这些瓶颈,包括具有确切已知有效集和oracle截止基线的任务。在涵盖多个家族和规模的14个语言模型中,我们发现多样性崩溃不仅仅是特定采样启发式的局限,而是LLM分布中顺序和形状误校准的结果。
cs.CL / 2 / 2605.11143

ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

ClinicalBench:对MIMIC-IV跨入院临床问答的断言感知检索进行压力测试
Stinard, Alex
Abstract
Reasoning benchmarks measure clinical performance on clean inputs. We evaluate the step before reasoning: retrieval over real EHR notes, where negation, temporality, and family-versus-patient attribution can flip a correct answer to a wrong one. EpiKG carries an assertion label and a temporality tag with every fact in a patient knowledge graph, then routes retrieval by question intent. ClinicalBench is a 400-question test over 43 MIMIC-IV patients across 9 assertion-sensitive categories. A 7-condition ablation tests each piece of EpiKG across six LLMs (Claude Opus 4.6, GPT-OSS 20B, MedGemma 27B, Gemma 4 31B, MedGemma 1.5 4B, Qwen 3.5 35B). Three physicians blindly adjudicated 100 paired items. The author-blind primary endpoint, leave-author-out paired exact McNemar on 50 unanimous-strict items rated by two external physicians, yields +22.0 percentage points (95 percent Newcombe CI [+5.1, +31.5], p=0.0192). The architectural novelty, intent-aware KG-RAG over a Contriever dense-RAG baseline (C2b to C4g_kw on the change-excluded n=362 endpoint), is +8.84 percentage points (paired McNemar p=1.79e-3); +12.43 percentage points under oracle intent. Sensitivities agree directionally: three-rater physician majority +24.0 percentage points (subject to single-author circularity); deterministic keyword reproducibility proxy +39.5 percentage points. Across the six models, the gain shrinks as the LLM-alone baseline rises (beta=-1.123, r=-0.921, p=0.009). With n=6 this looks more like regression to the mean than encoding substituting for model size. Physician adjudication identified 56 percent of auto-generated reference answers as defective, a methodological finding indicating that NLP-pipeline clinical-QA benchmarks require physician adjudication to be usable. ClinicalBench, the frozen evaluator, three-rater adjudication data, and the EpiKG output stack are publicly released.
Chinese Translation
推理基准测量在干净输入下的临床表现。我们评估推理之前的步骤:在真实电子健康记录(EHR)笔记上的检索,其中否定、时间性以及家庭与患者归属的不同可能会将正确答案翻转为错误答案。EpiKG为患者知识图谱中的每个事实附加了断言标签和时间性标签,然后根据问题意图进行检索。ClinicalBench是一个包含400个问题的测试,涉及43名MIMIC-IV患者,涵盖9个对断言敏感的类别。通过7种条件的消融实验,测试EpiKG的每个部分在六个大型语言模型(LLMs)上的表现(Claude Opus 4.6、GPT-OSS 20B、MedGemma 27B、Gemma 4 31B、MedGemma 1.5 4B、Qwen 3.5 35B)。三位医生盲审了100对项目。作者盲审的主要终点,基于50个由两位外部医生评定的一致严格项目的留作者配对精确McNemar检验,结果为+22.0个百分点(95% Newcombe置信区间 [+5.1, +31.5],p=0.0192)。在架构创新方面,意图感知的KG-RAG相较于Contriever密集RAG基线(C2b到C4g_kw在排除变化的n=362终点上)提高了+8.84个百分点(配对McNemar p=1.79e-3);在oracle意图下提高了+12.43个百分点。敏感性方向一致:三位评审医生的多数意见提高了+24.0个百分点(受单一作者循环影响);确定性关键词可重复性代理提高了+39.5个百分点。在六个模型中,增益随着LLM单独基线的上升而缩小(beta=-1.123,r=-0.921,p=0.009)。在n=6的情况下,这更像是回归均值而非编码替代模型规模。医生审定发现56%的自动生成参考答案存在缺陷,这一方法学发现表明,NLP管道的临床问答基准需要医生审定才能可用。ClinicalBench、冻结评估者、三位评审的审定数据以及EpiKG输出堆栈已公开发布。
cs.CL / 3 / 2605.11153

Decomposing Evolutionary Mixture-of-LoRA Architectures: The Routing Lever, the Lifecycle Penalty, and a Substrate-Conditional Boundary

分解进化混合LoRA架构:路由杠杆、生命周期惩罚与基底条件边界
Kumaresan, Ramchand
Abstract
We decompose an evolutionary mixture-of-LoRA system on a from-scratch ~150M-parameter widened-D substrate (D=1536, V=32000; D/V approx 0.048; the "widened-1536" substrate) into three factors -- a router rewrite (parallel sigmoid gate with learnable per-adapter floor and bounded temperature anneal, fed post-stack hidden states rather than token-embedding means), a per-domain leave-one-out evaluation scope, and a lifecycle of death plus alpha-blend inheritance plus SVD mutation plus slot reallocation -- and report a 5-of-8 partial 2^3 factorial run at n=3 seeds and 25000 adaptation steps per cell. The attribution chain is sharp on this substrate: the router rewrite carries the entire +0.0426 nat balanced log-PPL improvement (Delta = log PPL_ref - log PPL_test, positive = improvement; t=12.86, p=0.006) attributed to "the full evolutionary system vs the static B3 baseline"; the headline full-system-vs-B3 balanced contrast itself is +0.015 nats, t=1.94, p=0.19 at n=3 and does not clear alpha=0.05. The per-domain evaluation scope is null at seed-resolution, and the lifecycle is a net drag of approx -0.028 nats (t=-4.46,p=0.047 in the primary chain). An auxiliary alpha=0 inheritance counterfactual at n=3 seeds is sign-inconsistent at the headline metric and underpowered for either an equivalence or load-bearing conclusion (corrected from an earlier arithmetic-mean aggregator that erroneously cleared inheritance; see Appendix B.11). A base-perturbation probe directionally refutes a "genomic-context" reframe of the lifecycle role. A controllable synthetic sandbox locates a substrate-conditional regime boundary: evolutionary search on the routing channel is load-bearing only when adapters are pre-aligned to the task; in every other regime tested it underperforms, ties, or actively degrades the gradient solution.
Chinese Translation
我们对一个从零开始的约150M参数扩展-D基底(D=1536,V=32000;D/V约为0.048;即“扩展-1536”基底)上的进化混合LoRA系统进行了分解,识别出三个因素——路由重写(具有可学习每个适配器底线和有界温度退火的并行sigmoid门,输入为堆叠后隐藏状态而非标记嵌入均值)、每个领域的留一法评估范围,以及死亡生命周期加上α混合继承加上SVD变异加上插槽重新分配——并报告在n=3种子和每个单元25000次适应步骤下进行的5/8部分2^3因子实验。该基底上的归因链非常明确:路由重写带来了整个+0.0426 nat的平衡对数PPL改善(Delta = log PPL_ref - log PPL_test,正值表示改善;t=12.86,p=0.006),归因于“完整的进化系统与静态B3基线”;完整系统与B3的平衡对比本身为+0.015 nats,t=1.94,p=0.19,n=3时未能达到α=0.05的显著性。每个领域的评估范围在种子分辨率下为零,而生命周期则带来了约-0.028 nats的净拖累(在主要链中t=-4.46,p=0.047)。在n=3种子下的辅助α=0继承反事实在头条指标上表现出不一致,并且对于等价或承载结论的能力不足(从早期错误清除继承的算术均值聚合器中修正;见附录B.11)。基础扰动探针在方向上反驳了生命周期角色的“基因组上下文”重构。一个可控的合成沙箱定位了一个基底条件的体制边界:在路由通道上的进化搜索仅在适配器与任务预对齐时是承载的;在测试的其他所有体制中,它表现不佳、持平或积极降低梯度解决方案。
cs.CL / 4 / 2605.11167

The Bicameral Model: Bidirectional Hidden-State Coupling Between Parallel Language Models

双院制模型:平行语言模型之间的双向隐状态耦合
Flamant, Cedric, Ghai, Udaya, Shimizu, Kanna
Abstract
Existing multi-model and tool-augmented systems communicate by generating text, serializing every exchange through the output vocabulary. Can two pretrained language models instead coordinate through a continuous, concurrent channel? The Bicameral Model couples two frozen language models through a trainable neural interface on their intermediate hidden states. At every generation step, both models run in lockstep: a primary model drives the task while an auxiliary model operates tools, solves constraints, or executes code, with both conditioning on each other's activations through a translation network and a learned suppression gate ($\sim$1\% of combined parameters). The gate learns a selective communication protocol from task loss alone, without a prescribed format. We demonstrate the mechanism across three tool backends. On arithmetic, coupling two 0.5B models with a calculator raises accuracy from 36\% to 96\%. On logic grid puzzles, coupling two 0.6B models with a Z3 solver achieves $1.7\times$ the unaugmented baseline on ZebraLogic. On mathematical reasoning, coupling with a Python sandbox enables the auxiliary to generate problem-specific code from hidden-state signals alone, without ever seeing the problem text.
Chinese Translation
现有的多模型和工具增强系统通过生成文本进行通信,将每次交换序列化为输出词汇。两个预训练的语言模型能否通过一个连续的并发通道进行协调?双院制模型通过一个可训练的神经接口将两个冻结的语言模型耦合在其中间隐状态上。在每个生成步骤中,两个模型同步运行:一个主模型驱动任务,而一个辅助模型操作工具、解决约束或执行代码,二者通过一个翻译网络和一个学习的抑制门相互条件化激活(约占组合参数的1%)。该门仅通过任务损失学习选择性通信协议,而不需要预设格式。我们在三个工具后端展示了该机制。在算术任务中,将两个0.5B模型与计算器耦合使准确率从36%提高到96%。在逻辑网格谜题中,将两个0.6B模型与Z3求解器耦合使ZebraLogic的性能达到了未增强基线的1.7倍。在数学推理中,与Python沙箱的耦合使辅助模型能够仅通过隐状态信号生成特定问题的代码,而无需看到问题文本。
cs.CL / 5 / 2605.11195

How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation

差分隐私如何影响大型语言模型中的社会偏见?系统评估
Tenorio, Eduardo, Bhaila, Karuna, Wu, Xintao
Abstract
Large language models (LLMs) trained on web-scale corpora can memorize sensitive training data, posing significant privacy risks. Differential privacy (DP) has emerged as a principled framework that limits the influence of individual data points during training, yet the relationship between differential privacy and social bias in LLMs remains poorly understood. To investigate this, we present a systematic evaluation of social bias in a pretrained LLM trained with DP-SGD, comparing a DP model against non-DP baselines across four complementary paradigms: sentence scoring, text completion, tabular classification, and question answering. We find that DP reduces bias in sentence scoring tasks, where bias is measured through controlled likelihood comparisons, yet this improvement does not generalize across all tasks. Our results reveal a discrepancy between logit-level bias and output-level bias. Moreover, decreasing memorization does not necessarily reduce unfairness, underscoring the importance of multi-paradigm evaluation when assessing fairness in LLMs.
Chinese Translation
在网络规模语料库上训练的大型语言模型(LLMs)可能会记忆敏感的训练数据,从而带来显著的隐私风险。差分隐私(DP)作为一种原则性框架,限制了训练过程中单个数据点的影响,但差分隐私与LLMs中的社会偏见之间的关系仍然不甚明了。为此,我们对使用DP-SGD训练的预训练LLM中的社会偏见进行了系统评估,比较了DP模型与非DP基线在四种互补范式下的表现:句子评分、文本补全、表格分类和问答。我们发现,DP在句子评分任务中减少了偏见,其中偏见通过控制的似然比较进行测量,但这种改善并未在所有任务中普遍适用。我们的结果揭示了logit级别偏见与输出级别偏见之间的差异。此外,减少记忆并不一定会降低不公平性,这强调了在评估LLMs的公平性时进行多范式评估的重要性。
cs.CL / 6 / 2605.11206

Instructions shape Production of Language, not Processing

指令塑造语言的产生,而非处理
Waldis, Andreas, Choshen, Leshem, Hou, Yufang, Perlit, Yotam
Abstract
Instructions trigger a production-centered mechanism in language models. Through a cognitively inspired lens that separates language processing and production, we reveal this mechanism as an asymmetry between the two stages by probing task-specific information layer-wise across five binary judgment tasks. Specifically, we measure how instruction tokens shape information both when sample tokens, the input under evaluation, are processed and when output tokens are produced. Across prompting variations, task-specific information in sample tokens remains largely stable and correlates only weakly with behavior, whereas the same information in output tokens varies substantially and correlates strongly with behavior. Attention-based interventions confirm this pattern causally: blocking instruction flow to all subsequent tokens reduces both behavior and information in output tokens, whereas blocking it only to sample tokens has minimal effect on either. The asymmetry generalizes across model families and tasks, and becomes sharper with model scale and instruction-tuning, both of which disproportionately affect the production stage. Our findings suggest that understanding model capabilities requires jointly assessing internals and behavior, while decomposing the internal perspective by token position to distinguish the processing of input tokens from the production of output tokens.
Chinese Translation
指令在语言模型中触发了一种以产生为中心的机制。通过一种认知启发的视角,将语言处理与产生分开,我们揭示了这一机制在两个阶段之间的不对称性,通过在五个二元判断任务中逐层探测任务特定信息。具体而言,我们测量了指令标记如何在样本标记(即待评估的输入)被处理时以及在输出标记被生成时塑造信息。在不同的提示变体中,样本标记中的任务特定信息保持相对稳定,并且与行为的相关性较弱,而输出标记中的相同信息则变化显著,并与行为强相关。基于注意力的干预证实了这一模式的因果关系:阻断指令流向所有后续标记会减少输出标记中的行为和信息,而仅阻断样本标记的指令流对二者的影响则微乎其微。这种不对称性在不同模型家族和任务中普遍存在,并且随着模型规模和指令调优的增加而变得更加明显,这两者对产生阶段的影响尤为显著。我们的发现表明,理解模型能力需要同时评估内部机制和行为,同时通过标记位置分解内部视角,以区分输入标记的处理与输出标记的产生。
cs.CL / 7 / 2605.11212

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

ReVision:通过时间视觉冗余减少扩展计算机使用代理
Abaskohi, Amirhossein, He, Yuhang, West, Peter, Carenini, Giuseppe, Chawla, Pranit, Vineet, Vibhav
Abstract
Computer-use agents~(CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by approximately 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed. This suggests that the commonly observed saturation in visual history is not due to limited usefulness of past information, but rather a consequence of inefficient token representations.
Chinese Translation
计算机使用代理(CUAs)依赖于对图形用户界面的视觉观察,其中每个屏幕截图被编码为大量视觉标记。随着交互轨迹的增长,标记成本迅速增加,限制了在固定上下文和计算预算下可以纳入的历史量。这导致在使用历史时,性能没有或仅有非常有限的改善,与其他领域形成鲜明对比。我们通过引入ReVision来解决这一低效问题,该方法用于在去除冗余视觉补丁的轨迹上训练多模态语言模型,使用学习的补丁选择器比较连续屏幕截图之间的补丁表示,同时保留模型所需的空间结构。在三个基准测试OSWorld、WebTailBench和AgentNetBench中,当使用Qwen2.5-VL-7B处理包含5个历史屏幕截图的轨迹时,ReVision平均减少了约46%的标记使用,同时在无下降基线的基础上提高了3%的成功率。这确立了明显的效率提升,使代理能够用更少的标记处理更长的轨迹。凭借这种改进的效率,我们重新审视了历史在CUAs中的作用,发现随着更多过去观察的纳入,性能持续改善,前提是去除了冗余。这表明,视觉历史中常见的饱和现象并不是由于过去信息的有限有用性,而是低效标记表示的结果。
cs.CL / 8 / 2605.11242

RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German

RETUYT-INCO在BEA 2026共享任务2中的表现:基于评分标准的元提示方法在德语评分中的应用
Sastre, Ignacio, Remersaro, Ignacio, Díaz, Facundo, De Horta, Nicolás, Chiruzzo, Luis, Rosá, Aiala, Góngora, Santiago
Abstract
In this paper, we present the RETUYT-INCO participation at the BEA 2026 shared task "Rubric-based Short Answer Scoring for German". Our team participated in track 1 (Unseen answers three-way), track 3 (Unseen answers two-way) and track 4 (Unseen questions two-way). Since these tracks required scoring short student answers using specific rubrics, we looked for ways to handle the changing nature of the task. We created a method called Meta-prompting. In this approach, an LLM creates a custom prompt based on examples from the Train set. This prompt is then used to grade new student answers. Along with this method, we also describe other approaches we used, such as classic machine learning, fine-tuning open-source LLMs, and different prompting techniques. According to the official results, our team placed 6th out of 8 participants in Track 1 with a QWK of 0.729. In Track 3, we secured 4th place out of 9 with a QWK of 0.674, and we also placed 4th out of 8 in Track 4 with a QWK of 0.49.
Chinese Translation
本文介绍了RETUYT-INCO团队在BEA 2026共享任务“基于评分标准的德语短答案评分”中的参与情况。我们的团队参与了轨道1(未见答案三分类)、轨道3(未见答案二分类)和轨道4(未见问题二分类)。由于这些轨道要求使用特定的评分标准对学生的短答案进行评分,我们探索了应对任务变化的方法。我们创建了一种称为元提示(Meta-prompting)的方法。在这种方法中,语言模型(LLM)根据训练集中的示例生成自定义提示。然后使用该提示对新的学生答案进行评分。除了这种方法外,我们还描述了其他使用的方法,例如经典机器学习、微调开源LLM和不同的提示技术。根据官方结果,我们的团队在轨道1中以0.729的QWK得分位列8名参与者中的第6位。在轨道3中,我们以0.674的QWK得分位列9名参与者中的第4位,并且在轨道4中以0.49的QWK得分位列8名参与者中的第4位。
cs.CL / 9 / 2605.11255

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

HEBATRON:一种希伯来语专用的开放权重混合专家语言模型
Kayzer, Noam, Revital, Dan, Joseph, Ori Bar, Arvatz, Smadar, Levi, Or, Geva, Tal, Shmidman, Shaltiel, Cohen, Amir DN, Ordan, Noam, Baruch, Omer, Zinkovskaia, Kate, Apini, Zevi, Weinberger, Sarel
Abstract
We present Hebatron, a Hebrew-specialized open-weight large language model built on the NVIDIA Nemotron-3 sparse Mixture-of-Experts architecture. Training employs a three-phase easy-to-hard curriculum with continuous anti-forgetting anchoring, followed by supervised fine-tuning on 2 million bilingual Hebrew--English samples. The curriculum ordering alone yields a 3-point aggregate benchmark gain over the reversed configuration. Hebatron achieves a Hebrew reasoning average of 73.8\%, outperforming DictaLM-3.0-24B-Thinking (68.9\%) and remaining competitive with Gemma-3-27B-IT on GSM8K-HE and Israeli Trivia, while activating only 3B parameters per forward pass across a 30B-parameter model, delivering approximately 9 times higher inference throughput at native context lengths up to 65,536 tokens. To our knowledge, this is the first language-specific adaptation of the Nemotron-3 architecture for any target language, and the first open-weight Hebrew-specialized MoE model with native long-context support. Model weights are released openly to support further research in Hebrew and Semitic-language NLP.
Chinese Translation
我们提出了Hebatron,这是一种基于NVIDIA Nemotron-3稀疏混合专家架构构建的希伯来语专用开放权重大型语言模型。训练采用三阶段的由易到难的课程,结合持续的抗遗忘锚定,随后在200万条双语希伯来语-英语样本上进行监督微调。仅课程排序就带来了相较于反向配置的3分综合基准提升。Hebatron在希伯来语推理方面的平均得分为73.8\%,超越了DictaLM-3.0-24B-Thinking(68.9\%),并在GSM8K-HE和以色列问答中与Gemma-3-27B-IT保持竞争力,同时在一个30B参数的模型中每次前向传播仅激活3B参数,在原生上下文长度达到65,536个标记时提供约9倍的推理吞吐量。据我们所知,这是Nemotron-3架构首次针对任何目标语言进行语言特定的适配,也是首个具有原生长上下文支持的开放权重希伯来语专用MoE模型。模型权重已公开发布,以支持希伯来语和闪米特语言自然语言处理的进一步研究。
cs.CL / 10 / 2605.11290

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

ReAD:用于大型语言模型的强化指导能力蒸馏
Cheng, Xueqi, Zhou, Xugui, Derr, Tyler, Dong, Yushun
Abstract
Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most existing methods treat capabilities as independent training targets and overlook how improving one capability can reshape the student's broader capability profile, especially when multiple abilities jointly determine task success. We study capability distillation under a fixed token budget and identify two consistent patterns: distillation induces systematic, budget-dependent cross-capability transfer, and additional budget often brings limited task-relevant gains while sometimes degrading other useful abilities. Building on these insights, we propose ReAD, a Reinforcement-guided cApability Distillation framework that explicitly accounts for capability interdependence. ReAD first infers task-essential capabilities, then generates capability-targeted supervision on the fly, and finally uses an uncertainty-aware contextual bandit to adaptively allocate the distillation budget based on expected utility gains. Extensive experiments show that ReAD improves downstream utility under the same token budget while reducing harmful spillover and wasted distillation effort compared to strong baselines. Our code is publicly available at https://github.com/LabRAI/ReAD.
Chinese Translation
能力蒸馏将知识蒸馏应用于选定的模型能力,旨在将大型语言模型(LLM)压缩为较小的模型,同时保留下游任务所需的能力。然而,大多数现有方法将能力视为独立的训练目标,忽视了提升一种能力如何重塑学生的更广泛能力特征,尤其是在多种能力共同决定任务成功时。我们在固定的令牌预算下研究能力蒸馏,并识别出两个一致的模式:蒸馏引发系统性的、依赖预算的跨能力转移,而额外的预算往往带来有限的与任务相关的收益,同时有时会降低其他有用能力的表现。在这些见解的基础上,我们提出了ReAD,一个明确考虑能力相互依赖性的强化指导能力蒸馏框架。ReAD首先推断任务必需的能力,然后实时生成针对能力的监督,最后使用一种关注不确定性的上下文赌博者根据预期效用收益自适应地分配蒸馏预算。大量实验表明,与强基线相比,ReAD在相同的令牌预算下提高了下游效用,同时减少了有害的溢出和浪费的蒸馏努力。我们的代码已公开在 https://github.com/LabRAI/ReAD。
cs.CL / 11 / 2605.11303

Predicting Psychological Well-Being from Spontaneous Speech using LLMs

利用大型语言模型预测自发言语中的心理健康
Loweimi, Erfan, Garcia, Sofia de la Fuente, Luz, Saturnino
Abstract
We investigate the use of Large Language Models (LLMs) for zero-shot prediction of Ryff Psychological Well-Being (PWB) scores from spontaneous speech. Using a few minutes of voice recordings from 111 participants in the PsyVoiD database, we evaluated 12 instruction-tuned LLMs, including Llama-3 (8B, 70B), Ministral, Mistral, Gemma-2-9B, Gemma-3 (1B, 4B, 27B), Phi-4, DeepSeek (Qwen and Llama), and QwQ-Preview. A domain-informed prompt was developed in collaboration with experts in clinical psychology and linguistics. Results show that LLMs can extract semantically meaningful cues from spontaneous speech, achieving Spearman correlations of up to 0.8 on 80\% of the data. Additionally, to enhance explainability, we conducted statistical analyses to characterise prediction variability and systematic biases, alongside keyword-based word cloud analyses to highlight the linguistic features driving the models' predictions.
Chinese Translation
我们研究了使用大型语言模型(LLMs)从自发言语中进行零样本预测的可能性,以评估 Ryff 心理健康(PWB)得分。通过分析来自 PsyVoiD 数据库中 111 名参与者的几分钟语音录音,我们评估了 12 种经过指令调优的 LLM,包括 Llama-3(8B, 70B)、Ministral、Mistral、Gemma-2-9B、Gemma-3(1B, 4B, 27B)、Phi-4、DeepSeek(Qwen 和 Llama)以及 QwQ-Preview。我们与临床心理学和语言学领域的专家合作,开发了一个领域知情的提示。结果表明,LLMs 能够从自发言语中提取具有语义意义的线索,在 80\% 的数据上达到了高达 0.8 的斯皮尔曼相关系数。此外,为了增强可解释性,我们进行了统计分析,以描述预测的变异性和系统性偏差,并通过基于关键词的词云分析来突出驱动模型预测的语言特征。
cs.CL / 12 / 2605.11317

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

SOMA:通过小型语言模型实现高效的多轮大语言模型服务
Cheng, Xueqi, Wu, Qiong, Zhou, Zhengyi, Zhou, Xugui, Derr, Tyler, Dong, Yushun
Abstract
Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large proprietary models. Existing approaches often struggle to balance the trade-off between response quality and efficiency. We propose a framework that exploits the early turns of a session to estimate a local response manifold and then adapt a smaller surrogate model to this local region for the remainder of the conversation. Concretely, we learn soft prompts that maximize semantic divergence between the large and surrogate small language models' responses to surface least-aligned local directions, stabilize training with anti-degeneration control, and distill the mined cases into localized LoRA fine-tuning so the surrogate runs without prompts at inference. A simple gate enables a one-time switch with rollback on drift. We further provide a theoretical analysis for key components in SOMA. Extensive experiments show the effectiveness of SOMA. The source code is provided at: https://github.com/LabRAI/SOMA.
Chinese Translation
大型语言模型(LLMs)越来越多地应用于多轮对话场景,在这些场景中,保持对话上下文在各轮之间的连贯性至关重要。标准的服务实践是在每一轮中连接完整的对话历史,这虽然可靠地维护了连贯性,但在延迟、内存和API开销方面产生了相当大的成本,尤其是在查询被路由到大型专有模型时。现有的方法往往难以平衡响应质量与效率之间的权衡。我们提出了一种框架,利用会话的早期轮次来估计局部响应流形,然后将一个较小的替代模型适应于这一局部区域,以应对后续的对话。具体而言,我们学习软提示,以最大化大型语言模型与替代小型语言模型响应之间的语义差异,从而揭示最不一致的局部方向,利用反退化控制来稳定训练,并将挖掘的案例提炼为局部的LoRA微调,使得替代模型在推理时无需提示。一个简单的门控机制允许一次性切换,并在漂移时回滚。我们还为SOMA中的关键组件提供了理论分析。大量实验表明SOMA的有效性。源代码可在以下网址获取:https://github.com/LabRAI/SOMA。
cs.CL / 13 / 2605.11348

Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence

用于社交媒体因果关系提取的大型语言模型:灾害智能的验证框架
Jeong, Ujun, Vishnubhatla, Saketh, Jiang, Bohan, Harrison, Andre, Raglin, Adrienne, Liu, Huan
Abstract
During disasters, extracting causal relations from social media can strengthen situational awareness by identifying factors linked to casualties, physical damage, infrastructure disruption, and cascading impacts. However, disaster-related posts are often informal, fragmented, and context-dependent, and they may describe personal experiences rather than explicit causal relations. In this work, we examine whether Large Language Models (LLMs) can effectively extract causal relations from disaster-related social media posts. To this end, we (1) propose an expert-grounded evaluation framework that compares LLM-generated causal graphs with reference graphs derived from disaster-specific reports and (2) assess whether the extracted relations are supported by post-event evidence or instead reflect model priors. Our findings highlight both the potential and risks of using LLMs for causal relation extraction in disaster decision-support systems.
Chinese Translation
在灾害发生期间,从社交媒体中提取因果关系可以通过识别与伤亡、物理损害、基础设施中断和连锁影响相关的因素来增强情境意识。然而,灾害相关的帖子往往是非正式的、零散的,并且依赖于上下文,它们可能描述个人经历而不是明确的因果关系。在本研究中,我们考察了大型语言模型(LLMs)是否能够有效地从灾害相关的社交媒体帖子中提取因果关系。为此,我们(1)提出了一个基于专家的评估框架,该框架将LLM生成的因果图与来自特定灾害报告的参考图进行比较,并(2)评估提取的关系是否得到事件后证据的支持,或者反映模型的先验知识。我们的研究结果突显了在灾害决策支持系统中使用LLMs进行因果关系提取的潜力和风险。
cs.CL / 14 / 2605.11378

An Empirical Study of Automating Agent Evaluation

自动化代理评估的实证研究
Zhou, Kang, Woo, Sangmin, Ding, Haibo, Ramnath, Kiran, Chidambaram, Subramanian, Feng, Aosong, Arannil, Vinayak, Kim, Muhyun, Singh, Ishan, Wang, Darren, Xu, Zhichao, Gandhi, Megha, Prabhu, Nirmal, Mishra, Soumya Smruti, Singh, Vivek, Pandeshwar, Gouri, Cheong, Lin Lee
Abstract
Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.
Chinese Translation
代理评估需要评估涉及工具使用和中间推理的复杂多步骤行为,这使得评估过程成本高昂且需要专业知识。由此产生一个自然的问题:前沿编码助手能否可靠地自动化这一评估过程?我们的研究表明,仅仅提示编码助手并不足以完成这一任务。在缺乏特定领域评估知识的情况下,前沿编码助手的执行成功率仅为30%,并且每个代理产生的评估平均超过12个指标,表明强大的编码能力并不自动转化为可靠的代理评估。我们引入了EvalAgent,一个自动化端到端代理评估流程的AI助手。EvalAgent将评估领域的专业知识编码为评估技能(程序指令、可重用代码和模板,以及动态检索的API文档),这些技能组合成一个基于追踪的流程,生成包括指标、可执行代码和报告在内的完整评估文档。为了系统地评估生成的评估结果,我们引入了一个元评估框架,并推出了AgentEvalBench,一个包含20个代理的基准,每个代理都配有评估要求和测试场景。我们进一步提出了Eval@1指标,用于衡量生成的评估代码是否在第一次运行时既能执行又能产生有意义的结果。我们的实验表明,EvalAgent生成的评估更加集中,将Eval@1从17.5%提高到65%,并在基准方法中获得了79.5%的人工专家偏好。进一步的消融研究表明,评估技能对于处理复杂评估至关重要:去除这些技能会导致Eval@1显著下降,从65%降至30%。
cs.CL / 15 / 2605.11388

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

通过结构化元认知实现通用智能体的深度推理
Light, Dean, Theologitis, Michael, Ghate, Kshitish, Li, Shuyue Stella, Newman, Benjamin, Shah, Chirag, Caliskan, Aylin, Koh, Pang Wei, Suciu, Dan, Tsvetkov, Yulia
Abstract
Humans intuitively solve complex problems by flexibly shifting among reasoning modes: they plan, execute, revise intermediate goals, resolve ambiguity through associative judgment, and apply formal procedures to well-specified subproblems. Current LLM agents lack this flexibility, as their scaffolds hard-code such reasoning decisions in advance. These scaffolds are effective when their prescribed structure matches the task, but brittle when solving the task requires adapting the structure of reasoning itself. We introduce Deep Reasoning -- an inference-time approach for constructing task-specific scaffolds through structured meta-reasoning. Deep Reasoning uses a formal language that represents meta-reasoning as executable decompositions over associative inference, formal computation, and recursive subproblem solving, enabling decomposition principles to be encoded as in-context examples that guide test-time scaffold construction. We instantiate this approach in a general-purpose agent (DOLORES) that distributes complex tasks across more controlled reasoning threads. We evaluate it against state-of-the-art scaffolding methods across four hard benchmarks: multi-hop reasoning, long-chain question answering, long-context aggregation, and deep research-style information seeking. DOLORES outperforms all evaluated scaffolds across three model sizes and two model families, improving over the strongest evaluated scaffold baseline by 24.8% on average. DOLORES distributes cognition across structured, lower-load reasoning threads, thereby reducing premature termination and hallucinations. This advantage can even bridge the scaling gap, with an 8B version surpassing all evaluated 32B baselines from the same family in more than half the settings. These results point toward future agentic systems that treat scaffolding as adaptive reasoning, constructing the structure each task requires just-in-time.
Chinese Translation
人类通过灵活切换推理模式直观地解决复杂问题:他们规划、执行、修订中间目标,通过联想判断解决模糊性,并对明确的子问题应用正式程序。当前的大型语言模型(LLM)智能体缺乏这种灵活性,因为它们的支架在预先硬编码了这些推理决策。当规定的结构与任务匹配时,这些支架是有效的,但在解决任务时需要调整推理结构时则显得脆弱。我们提出了深度推理(Deep Reasoning)——一种在推理时通过结构化元推理构建任务特定支架的方法。深度推理使用一种正式语言,将元推理表示为可执行的联想推理、正式计算和递归子问题求解的分解,从而使分解原则能够编码为上下文示例,指导测试时的支架构建。我们在一个通用智能体(DOLORES)中实例化了这种方法,该智能体将复杂任务分配到更受控的推理线程中。我们在四个困难基准测试中评估了它与最先进的支架方法的表现:多跳推理、长链问答、长上下文聚合和深度研究风格的信息获取。DOLORES在三个模型规模和两个模型系列中超越了所有评估的支架,平均提高了24.8%相对于最强评估支架基线。DOLORES在结构化、低负载的推理线程中分配认知,从而减少了过早终止和幻觉。这一优势甚至可以弥补规模差距,8B版本在超过一半的设置中超越了同一系列中所有评估的32B基线。这些结果指向未来的智能系统,将支架视为自适应推理,按需构建每个任务所需的结构。
cs.CL / 16 / 2605.11416

Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training

冻结深层,训练浅层:可解释的继续预训练层分配
Wu, Yu-Hang, Liu, Qin-Yuan, Zhao, Qiu-Yang, Jiang, Bo, Yang, Jiang-Feng, Cong, Qing-Wei
Abstract
Selective layer-wise updates are essential for low-cost continued pre-training of Large Language Models (LLMs), yet determining which layers to freeze or train remains an empirical black-box problem due to the lack of interpretable guidance. To address this issue, we propose LayerTracer, an architecture-agnostic diagnostic framework that reveals the evolution patterns of layer-wise representations and stability by locating task execution positions and quantifying layer sensitivity. Analysis results reveal that deep layers act as critical regions for task execution and maintain high stability against disruptive updates. Guided by this finding, we conduct three controlled continued pre-training trials to compare diverse freeze-train strategies, demonstrating that training shallow layers while freezing deep layers consistently outperforms full-parameter fine-tuning and the opposite allocation on both C-Eval and CMMLU benchmarks. We further present a hybrid model case study, which validates that placing high-quality pre-trained modules in deep layers effectively preserves inherent knowledge of the model. This work delivers a low-cost and interpretable solution for resource-constrained teams, offering actionable guidance for layer-wise parameter allocation in continued pre-training and hybrid model construction.
Chinese Translation
选择性逐层更新对于大型语言模型(LLMs)的低成本继续预训练至关重要,但由于缺乏可解释的指导,确定哪些层应被冻结或训练仍然是一个经验性的黑箱问题。为了解决这一问题,我们提出了LayerTracer,一个与架构无关的诊断框架,通过定位任务执行位置和量化层敏感性,揭示逐层表示和稳定性的演变模式。分析结果显示,深层作为任务执行的关键区域,对破坏性更新保持高稳定性。基于这一发现,我们进行了三次受控的继续预训练试验,以比较不同的冻结-训练策略,结果表明,在冻结深层的同时训练浅层的策略在C-Eval和CMMLU基准测试中始终优于全参数微调和相反的分配。我们进一步展示了一个混合模型案例研究,验证了将高质量的预训练模块放置在深层有效地保留了模型的固有知识。这项工作为资源有限的团队提供了一种低成本且可解释的解决方案,为继续预训练和混合模型构建中的逐层参数分配提供了可操作的指导。
cs.CL / 17 / 2605.11436

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

Agent-BRACE:通过语言化状态不确定性在长时间任务中将信念与行动解耦
Singh, Joykirat, Khan, Zaid, Prasad, Archiki, Chen, Justin Chih-Yao, Nambi, Akshay, Lee, Hyunji, Stengel-Eskin, Elias, Bansal, Mohit
Abstract
Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task-relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open-ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent-BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long-horizon, partially observable embodied language environments, Agent-BRACE achieves an average absolute improvement of +14.5% (Qwen2.5-3B-Instruct) and +5.3% (Qwen3-4B-Instruct), outperforming strong RL baselines while maintaining a near-constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.
Chinese Translation
大型语言模型(LLMs)越来越多地应用于部分可观察环境中的长时间任务,在这些任务中,它们必须在推断和跟踪复杂环境状态的同时采取行动。这带来了两个挑战:部分可观察性要求对未观察到的世界属性保持不确定性,而长时间的交互历史导致上下文无限增长,从而稀释与任务相关的信息。解决这两个挑战的一个原则性方案是信念状态:给定过去观察和行动的环境状态的后验分布,它紧凑地编码了决策所需的历史,无论剧集长度如何。然而,在LLM代理中,文本的开放性使得如何表示这样的分布变得不明确。因此,我们引入了Agent-BRACE:通过抽象和置信度估计的代理信念状态表示(Agent Belief state Representation via Abstraction and Confidence Estimation),这是一种将LLM代理解耦为信念状态模型和策略模型的方法,通过强化学习进行联合优化。信念状态模型生成信念分布的结构化近似:一组关于环境的原子自然语言声明,每个声明都附有一个从确定到未知的有序语言化置信度标签。策略模型基于这种紧凑、结构化的近似信念而不是完整历史进行条件学习,学习在明确不确定性下选择行动。在长时间、部分可观察的具身语言环境中,Agent-BRACE实现了平均绝对提升+14.5%(Qwen2.5-3B-Instruct)和+5.3%(Qwen3-4B-Instruct),超越了强大的强化学习基线,同时保持了与剧集长度无关的近乎恒定的上下文窗口。进一步分析表明,随着证据的积累,学习到的信念在剧集过程中变得越来越校准。
cs.CL / 18 / 2605.11483

StoicLLM: Preference Optimization for Philosophical Alignment in Small Language Models

StoicLLM:小型语言模型中的哲学对齐偏好优化
Khan, Ishmam, Thogarrati, Sindhuja, Zhang, Shuo
Abstract
While large language models excel at factual adaptation, their ability to internalize nuanced philosophical frameworks under severe data constraints remains underexplored. We investigate this by specializing small LLMs on micro-datasets of foundational Stoic texts using preference optimization (ORPO, AlphaPO). Evaluated via a multi-model critic bank, our results show that just 300 high-fidelity examples can induce strong alignment with inward-facing Stoic virtues, closely approaching few-shot prompting while freeing the context window. Critically, however, all models, including few-shot baselines, exhibit a persistent failure on Stoicism's outward-facing cosmopolitan duties, pointing to a representational limitation of small models that micro-dataset adaptation alone cannot overcome.
Chinese Translation
尽管大型语言模型在事实适应方面表现出色,但它们在严重数据限制下内化细致哲学框架的能力仍然未得到充分探索。我们通过使用偏好优化(ORPO,AlphaPO)对基础斯多噶文本的微型数据集进行小型语言模型的专业化来研究这一问题。通过多模型评估库进行评估,我们的结果表明,仅需300个高保真示例即可与内向的斯多噶美德产生强烈对齐,接近少量示例提示,同时释放上下文窗口。然而,所有模型,包括少量示例基线,在斯多噶主义的外向世界公民责任方面持续失败,这表明小型模型的表征限制是微型数据集适应所无法克服的。
cs.CL / 19 / 2605.11502

Robust Biomedical Publication Type and Study Design Classification with Knowledge-Guided Perturbations

基于知识引导扰动的稳健生物医学出版类型和研究设计分类
Ming, Shufan, Menke, Joe D., Smalheiser, Neil R., Kilicoglu, Halil
Abstract
Accurately and consistently indexing biomedical literature by publication type and study design is essential for supporting evidence synthesis and knowledge discovery. Prior work on automated publication type and study design indexing has primarily focused on expanding label coverage, enriching feature representations, and improving in-domain accuracy, with evaluation typically conducted on data drawn from the same distribution as training. Although pretrained biomedical language models achieve strong performance under these settings, models optimized for in-domain accuracy may rely on superficial lexical or dataset-specific cues, resulting in reduced robustness under distributional shift. In this study, we introduce an evaluation framework based on controlled semantic perturbations to assess the robustness of a publication type classifier and investigate robustness-oriented training strategies that combine entity masking and domain-adversarial training to mitigate reliance on spurious topical correlations. Our results show that the commonly observed trade-off between robustness and in-domain accuracy can be mitigated when robustness objectives are designed to selectively suppress non-task-defining features while preserving salient methodological signals. We find that these improvements arise from two complementary mechanisms: (1) increased reliance on explicit methodological cues when such cues are present in the input, and (2) reduced reliance on spurious domain-specific topical features. These findings highlight the importance of feature-level robustness analysis for publication type and study design classification and suggest that refining masking and adversarial objectives to more selectively suppress topical information may further improve robustness. Data, code, and models are available at: https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/ICHI
Chinese Translation
准确且一致地按出版类型和研究设计对生物医学文献进行索引,对于支持证据综合和知识发现至关重要。先前关于自动化出版类型和研究设计索引的研究主要集中在扩展标签覆盖范围、丰富特征表示和提高领域内准确性,评估通常在与训练相同分布的数据上进行。尽管预训练的生物医学语言模型在这些设置下表现出色,但针对领域内准确性优化的模型可能依赖于表面的词汇或数据集特定线索,导致在分布变化下的稳健性降低。在本研究中,我们引入了一种基于受控语义扰动的评估框架,以评估出版类型分类器的稳健性,并探讨结合实体掩蔽和领域对抗训练的稳健性导向训练策略,以减轻对虚假主题相关性的依赖。我们的结果表明,当稳健性目标旨在选择性抑制非任务定义特征,同时保留显著的方法论信号时,通常观察到的稳健性与领域内准确性之间的权衡可以得到缓解。我们发现这些改进源于两个互补机制:(1) 当输入中存在显式方法论线索时,增加对这些线索的依赖,(2) 减少对虚假领域特定主题特征的依赖。这些发现强调了对出版类型和研究设计分类进行特征级稳健性分析的重要性,并建议通过更具选择性地抑制主题信息来细化掩蔽和对抗目标,以进一步提高稳健性。数据、代码和模型可在以下网址获取:https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/ICHI
cs.CL / 20 / 2605.11513

A Study on Hidden Layer Distillation for Large Language Model Pre-Training

大语言模型预训练中的隐层蒸馏研究
Guigon, Maxime, Dixon, Lucas, Sander, Michaël E.
Abstract
Knowledge Distillation (KD) is a critical tool for training Large Language Models (LLMs), yet the majority of research focuses on approaches that rely solely on output logits, neglecting semantic information in the teacher's intermediate representations. While Hidden Layer Distillation (HLD) showed potential for encoder architectures, its application to decoder-only pre-training at scale remains largely unexplored. Through compute-controlled experiments, we benchmark HLD against logit-based KD and self-supervised baselines with Gemma3 3.4B as teacher and 123M and 735M students trained on up to 168B tokens from the C4 dataset. Our experiments show that HLD does not consistently outperform standard KD on downstream evaluation tasks. Nevertheless, we show that HLD can yield a systematic perplexity gain over KD across all shared-hyperparameter configurations, suggesting that a latent signal can be extracted, but a breakthrough may be needed for it to play a more significant role in LLM pre-training.
Chinese Translation
知识蒸馏(Knowledge Distillation, KD)是训练大语言模型(Large Language Models, LLMs)的关键工具,但大多数研究集中于仅依赖输出logits的方法,忽视了教师中间表示中的语义信息。尽管隐层蒸馏(Hidden Layer Distillation, HLD)在编码器架构中显示出潜力,但其在解码器单一预训练中的大规模应用仍然基本未被探索。通过计算受控的实验,我们将HLD与基于logit的KD和自监督基线进行基准测试,教师模型为Gemma3 3.4B,学生模型为在C4数据集上训练的123M和735M,使用了多达168B的tokens。我们的实验表明,HLD在下游评估任务中并不总是优于标准KD。然而,我们展示了HLD在所有共享超参数配置中相较于KD可以系统性地提高困惑度,这表明可以提取潜在信号,但可能需要突破才能使其在LLM预训练中发挥更重要的作用。
cs.CL / 21 / 2605.11533

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

Checkup2Action:用于患者导向行动卡生成的多模态临床体检报告数据集
Xiang, Sike, Chen, Shuang, Lin, Kevin Qinghong, Yu, Jialin, Sun, Yijia, Torr, Philip, Atapour-Abarghouei, Amir
Abstract
Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for laypersons to interpret and translate into concrete follow-up actions. Although large language models show promise in medical summarisation and triage support, their ability to generate safe, prioritised, and patient-oriented actions from multimodal check-up reports remains under-benchmarked. We present \textbf{Checkup2Action}, a multimodal clinical check-up report dataset and benchmark for structured \textit{Action Card} generation. Each card describes one clinically relevant issue and specifies its priority, recommended department, follow-up time window, patient-facing explanation, and questions for clinicians, while avoiding diagnostic or treatment-prescriptive claims. The dataset contains 2,000 de-identified real-world check-up reports covering demographic information, physical examinations, laboratory tests, cardiovascular assessments, imaging-related evidence, and physician summaries. We formulate checkup-to-action generation as a constrained structured generation task and introduce an evaluation protocol covering issue coverage and precision, priority consistency, department and time recommendation accuracy, action complexity, usefulness, readability, and safety compliance. Experiments with general-purpose and medical large language models reveal clear trade-offs between issue coverage, action correctness, conciseness, and safety alignment. Checkup2Action provides a new multimodal benchmark for evaluating patient-oriented reasoning over clinical check-up reports.
Chinese Translation
临床体检报告是结合了页面布局、表格、数值生物标志物、异常标志、影像学发现和领域特定术语的多模态文档。这种异质证据对于外行人来说难以解读并转化为具体的后续行动。尽管大型语言模型在医学摘要和分诊支持方面显示出潜力,但它们从多模态体检报告中生成安全、优先级明确且以患者为导向的行动的能力仍然缺乏基准评估。我们提出了 extbf{Checkup2Action},这是一个多模态临床体检报告数据集和结构化 extit{行动卡}生成的基准。每张卡片描述一个临床相关问题,并指定其优先级、推荐科室、后续时间窗口、面向患者的解释以及针对临床医生的问题,同时避免诊断或治疗建议的声明。该数据集包含2,000份去标识化的真实世界体检报告,涵盖人口统计信息、身体检查、实验室测试、心血管评估、影像相关证据和医生总结。我们将体检到行动的生成形式化为一个受限的结构化生成任务,并引入一个评估协议,涵盖问题覆盖率和准确性、优先级一致性、科室和时间推荐的准确性、行动复杂性、实用性、可读性和安全合规性。与通用和医学大型语言模型的实验揭示了问题覆盖、行动正确性、简洁性和安全对齐之间的明显权衡。Checkup2Action为评估临床体检报告中的患者导向推理提供了一个新的多模态基准。
cs.CL / 22 / 2605.11538

Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

驯服极端标记:具有高斯核优势重加权的协方差感知GRPO
Wang, Cheng, Liu, Qin, Zhou, Wenxuan, Chen, Muhao
Abstract
Group Relative Policy Optimization (GRPO) has emerged as a promising approach for improving the reasoning capabilities of large language models. However, it struggles to effectively balance the tradeoff between exploration and exploitation during training, often resulting in suboptimal performance. Motivated by the theoretical insight that changes in entropy are governed by the covariance between token probabilities and their corresponding advantages, we propose a hyperparameter-free, covariance-weighted optimization method that dynamically down-weights extreme token-level updates via a Gaussian kernel. This approach automatically reduces the instability caused by exploration-exploitation trade-off while preserving informative learning signals. Extensive empirical evaluations show that our approach improves downstream performance across reasoning benchmarks compared with GRPO, and effectively stablizes entropy as training progresses.
Chinese Translation
群体相对策略优化(GRPO)已成为提升大型语言模型推理能力的一种有前景的方法。然而,在训练过程中,它难以有效平衡探索与利用之间的权衡,常常导致次优性能。基于理论洞察,即熵的变化受标记概率及其相应优势之间的协方差支配,我们提出了一种无超参数的协方差加权优化方法,该方法通过高斯核动态降低极端标记级更新的权重。这种方法在保持信息学习信号的同时,自动减少了由探索-利用权衡引起的不稳定性。大量实证评估表明,与GRPO相比,我们的方法在推理基准测试中提高了下游性能,并有效地稳定了训练过程中的熵。
cs.CL / 23 / 2605.11574

Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation

三种上下文参数冲突的机制:预测框架与实证验证
Venkata, Pruthvinath Jeripity
Abstract
The literature on how large language models handle conflict between their training knowledge and a contradicting document presents a persistent empirical contradiction: some studies find models stubbornly retain their trained answers, ignoring provided documents nearly half the time, while others find models readily defer to the document, following context approximately 96% of the time. We argue these contradictions dissolve once one recognises that prior experiments have studied three qualitatively distinct processing situations without distinguishing them. We propose a three-regime framework: Regime 1 (single-source updating, dominant predictor: evidence coherence), Regime 2 (competitive integration, dominant predictor: parametric certainty), and Regime 3 (task-appropriate selection, dominant predictor: task knowledge requirement). We formalise a distinction between parametric strength (exposure frequency) and parametric uniqueness (encoding consistency), showing empirically that these are orthogonal dimensions (r = -0.002, p = .97) with strength as the operative predictor in stable factual domains. We validate the framework across Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Flash, Llama 4 Maverick, and DeepSeek V3 using 9,970 API calls in three experimental phases. GEE logistic regression confirms the predicted Regime 2 certainty gradient for all five models (beta = -0.38 to -0.50, all p <= .013, BH-FDR corrected). A Regime 3 ablation shows task framing alone flips context-following from near-100% (contextual knowledge condition) to 6-71% (parametric knowledge condition), with all five models significant (p < .001). The certainty gradient is robust to multinomial outcome modeling, sensitivity analyses for hedging responses, and FDR correction.
Chinese Translation
关于大型语言模型如何处理其训练知识与矛盾文档之间冲突的文献存在一个持续的实证矛盾:一些研究发现模型顽固地保留其训练答案,几乎一半的时间忽视提供的文档,而另一些研究则发现模型乐于遵循文档,约96%的时间遵循上下文。我们认为这些矛盾在认识到先前实验研究了三种质上不同的处理情况而未加区分时会消失。我们提出了一个三机制框架:机制1(单源更新,主导预测因子:证据一致性),机制2(竞争整合,主导预测因子:参数确定性),机制3(任务适当选择,主导预测因子:任务知识要求)。我们对参数强度(暴露频率)和参数唯一性(编码一致性)进行了形式化区分,实证显示这两个维度是正交的(r = -0.002, p = .97),在稳定的事实领域中,强度是操作性预测因子。我们在Claude Sonnet 4.6、GPT-5.5、Gemini 2.5 Flash、Llama 4 Maverick和DeepSeek V3上验证了该框架,使用了9,970次API调用进行三轮实验。广义估计方程(GEE)逻辑回归确认了所有五个模型的预测机制2确定性梯度(beta = -0.38到-0.50,所有p <= .013,BH-FDR校正)。机制3的消融实验表明,仅任务框架就将上下文跟随率从近100%(上下文知识条件)翻转至6-71%(参数知识条件),所有五个模型均显著(p < .001)。确定性梯度对多项结果建模、对对冲反应的敏感性分析以及FDR校正具有稳健性。
cs.CL / 24 / 2605.11577

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

BitLM:通过位连续扩散解锁多标记语言生成
Zhuang, Shaobin, Ai, Yuang, Han, Jiaming, Li, Xiaohui, Huang, Huaibo, Yue, Xiangyu, Hu, Xuefeng, Xu, Kun, Wang, Yali, Chen, Hao
Abstract
Autoregressive language models generate text one token at a time, yet natural language is inherently structured in multi-token units, including phrases, n-grams, and collocations that carry meaning jointly. This one-token bottleneck limits both the expressiveness of the model during pre-training and its throughput at inference time. Existing remedies such as speculative decoding or diffusion-based language models either leave the underlying bottleneck intact or sacrifice the causal structure essential to language modeling. We propose BitLM, a language model that represents each token as a fixed-length binary code and employs a lightweight diffusion head to denoise multiple tokens in parallel within each block. Crucially, BitLM preserves left-to-right causal attention across blocks while making joint lexical decisions within each block, combining the reliability of autoregressive modeling with the parallelism of iterative refinement. By replacing the large-vocabulary softmax with bitwise denoising, BitLM reframes token generation as iterative commitment in a compact binary space, enabling more efficient pre-training and substantially faster inference without altering the causal foundation that makes language models effective. Our results demonstrate that the one-token-at-a-time paradigm is not a fundamental requirement but an interface choice, and that changing it can yield a stronger and faster language model. We hope BitLM points toward a promising direction for next-generation language model architectures.
Chinese Translation
自回归语言模型一次生成一个标记,但自然语言本质上是以多标记单元结构化的,包括短语、n-gram 和携带共同意义的搭配。这种一次一个标记的瓶颈限制了模型在预训练期间的表达能力以及推理时的吞吐量。现有的解决方案,如推测解码或基于扩散的语言模型,要么保持了潜在的瓶颈不变,要么牺牲了语言建模所必需的因果结构。我们提出了 BitLM,一种将每个标记表示为固定长度二进制代码的语言模型,并采用轻量级扩散头在每个块内并行去噪多个标记。关键是,BitLM 在块之间保持了从左到右的因果注意力,同时在每个块内做出联合词汇决策,结合了自回归建模的可靠性与迭代精炼的并行性。通过用位去噪替代大词汇量的 softmax,BitLM 将标记生成重新构建为在紧凑的二进制空间中的迭代承诺,从而实现更高效的预训练和显著更快的推理,而不改变使语言模型有效的因果基础。我们的结果表明,一次一个标记的范式并不是一种基本要求,而是一种接口选择,改变这一点可以产生更强大、更快速的语言模型。我们希望 BitLM 指向下一代语言模型架构的一个有前景的方向。
cs.CL / 25 / 2605.11581

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

Ada-MK:通过自动化基于DAG的搜索实现自适应MegaKernel优化以支持大型语言模型推理
Dong, Wenxin, Hu, Mingqing, Yu, Guanghui, Fu, Qiang, Xu, Peng, Xu, Hui, Xing, Yue, Jiao, Xuewu, Li, Shuanglong, Liu, Lin
Abstract
When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet every token generated during the decode phase triggers thousands of kernel launches, and kernel launch overhead alone can account for 14.6% of end-to-end inference time. MegaKernel eliminates launch overhead and inter-operator HBM round-trips by fusing multiple operators into a single persistent kernel. However, existing MegaKernel implementations face a fundamental tension between portability and efficiency on resource-constrained GPUs such as NVIDIA Ada: hand-tuned solutions are tightly coupled to specific architectures and lack portability, while auto-compiled approaches introduce runtime dynamic scheduling whose branch penalties are unacceptable in latency-critical settings. We observe that under a fixed deployment configuration, the optimal execution path of a MegaKernel is uniquely determined, and runtime dynamic decision-making can be entirely hoisted to compile time. Building on this insight, we propose Ada-MK: (1) a three-dimensional shared-memory constraint model combined with K-dimension splitting that reduces peak shared memory usage by 50%; (2) MLIR-based fine-grained DAG offline search that solidifies the optimal execution path, completely eliminating runtime branching; and (3) a heterogeneous hybrid inference engine that embeds MegaKernel as a plugin into TensorRT-LLM, combining high-throughput Prefill with low-latency Decode. On an NVIDIA L20, Ada-MK improves single-batch throughput by up to 23.6% over vanilla TensorRT-LLM and 50.2% over vLLM, achieving positive gains across all tested scenarios--the first industrial deployment of MegaKernel in a commercial online advertising system.
Chinese Translation
当大型语言模型(LLMs)在商业在线广告系统中进行实时推理时,端到端延迟必须严格控制在毫秒级范围内。然而,在解码阶段生成的每个令牌都会触发数千次内核启动,而内核启动开销本身就可能占到端到端推理时间的14.6%。MegaKernel通过将多个操作符融合为一个持久内核,消除了启动开销和操作符间的HBM往返。然而,现有的MegaKernel实现面临着在资源受限的GPU(如NVIDIA Ada)上可移植性与效率之间的根本矛盾:手动调优的解决方案与特定架构紧密耦合,缺乏可移植性,而自动编译的方法引入的运行时动态调度在延迟敏感的环境中其分支惩罚是不可接受的。我们观察到,在固定的部署配置下,MegaKernel的最佳执行路径是唯一确定的,运行时动态决策可以完全提升到编译时。基于这一洞察,我们提出了Ada-MK:(1)结合K维分割的三维共享内存约束模型,减少峰值共享内存使用量50%;(2)基于MLIR的细粒度DAG离线搜索,巩固最佳执行路径,完全消除运行时分支;(3)一个异构混合推理引擎,将MegaKernel作为插件嵌入TensorRT-LLM,结合高吞吐量的Prefill与低延迟的Decode。在NVIDIA L20上,Ada-MK在单批次吞吐量上比原始TensorRT-LLM提高了最多23.6%,比vLLM提高了50.2%,在所有测试场景中均实现了正向增益——这是MegaKernel在商业在线广告系统中的首次工业部署。
cs.CL / 26 / 2605.11582

Efficient LLM-based Advertising via Model Compression and Parallel Verification

基于高效LLM的广告投放:模型压缩与并行验证
Dong, Wenxin, Gao, Chang, Yu, Guanghui, Jiao, Xuewu, Hu, Mingqing, Fu, Qiang, Xu, Peng, Wei, Penghui, Xu, Hui, Xing, Yue, Li, Shuanglong, Liu, Lin
Abstract
Large language models (LLMs) have shown remarkable potential in advertising scenarios such as ad creative generation and targeted advertising. However, deploying LLMs in real-time advertising systems poses significant challenges due to their high inference latency and computational cost. In this paper, we propose an Efficient Generative Targeting framework that integrates adaptive group quantization, layer-adaptive hierarchical sparsification, and prefix-tree parallel verification to accelerate LLM inference while preserving generation quality. Extensive experiments on two real-world advertising scenarios demonstrate that our framework achieves significant speedup with acceptable quality degradation, making it operationally viable for practical deployments.
Chinese Translation
大型语言模型(LLMs)在广告场景中展现了显著的潜力,例如广告创意生成和精准广告投放。然而,由于推理延迟和计算成本高,实时广告系统中部署LLMs面临重大挑战。本文提出了一种高效生成目标框架(Efficient Generative Targeting),该框架集成了自适应组量化、层自适应层次稀疏化和前缀树并行验证,以加速LLM推理,同时保持生成质量。在两个真实广告场景上的广泛实验表明,我们的框架在可接受的质量下降下实现了显著的加速,使其在实际部署中具有可操作性。
cs.CL / 27 / 2605.11601

DiffScore: Text Evaluation Beyond Autoregressive Likelihood

DiffScore:超越自回归似然的文本评估
Lai, Wen, Shen, Yingli, Jin, Dingnan, Cui, Qing, Zhou, Jun, Sun, Maosong, Fraser, Alexander
Abstract
Autoregressive language models are widely used for text evaluation, however, their left-to-right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi-timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero-shot and fine-tuned settings. The code is released at: https://github.com/wenlai-lavine/DiffScore.
Chinese Translation
自回归语言模型广泛用于文本评估,然而,它们的从左到右的因子化引入了位置偏差,即早期的标记仅使用左侧上下文进行评分,将架构不对称与真实文本质量混淆。我们提出了掩码重构作为一种替代范式,在该范式中,每个标记都使用完整的双向上下文进行评分。我们引入了DiffScore,一个基于掩码大扩散语言模型的评估框架。通过测量在连续掩码率下的文本可恢复性,DiffScore消除了位置偏差,并自然建立了从局部流畅性到全局一致性的评估层次。我们进一步提供了自回归框架无法提供的诊断工具:跨掩码率分解分数的多时间步质量轮廓,以及双向PMI分解,旨在将流畅性与忠实度区分开来。在十个基准测试中的实验表明,DiffScore在零-shot和微调设置中始终优于自回归基线。代码已发布于:https://github.com/wenlai-lavine/DiffScore。
cs.CL / 28 / 2605.11608

PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head

PRISM:一种将漂移分解为尺度、形状和头部的几何风险界限
Lin, Chieh-Yen, Sun, Shao-Hua
Abstract
Comparing post-training LLM variants, such as quantized, LoRA-adapted, and distilled models, requires a diagnostic that identifies how a variant has drifted, not only whether it has degraded. Existing similarity scores such as CKA and SVCCA can flag degradation, but they do not directly link representation drift to risk or mechanism. We propose PRISM, Proxy Risk Inference via Structural Mapping, which exploits the linear output head of LLMs and the empirically near-isometric structure of their backbones to derive a closed-form upper bound on the cross-entropy risk gap between a target model and a post-training variant. The bound is calibrated for variant ranking and decomposes drift into three independently measurable axes: scale mismatch, shape mismatch, and head divergence. Each axis corresponds to a distinct failure mode, including shape distortion under low-bit quantization, scale separability under LoRA forgetting, and head divergence under GGUF k-quantization. As a result, the dominant axis suggests a remediation direction rather than merely raising a degradation flag. Because the shape term is differentiable, the same geometry can also serve as a training-time regularizer against catastrophic forgetting. Across two model families and five benchmarks, PRISM ranks variants with mean Spearman correlations of 0.820 for post-training quantization and 0.831 for LoRA forgetting, and its axis-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting.
Chinese Translation
比较后训练的LLM变体,如量化模型、LoRA适配模型和蒸馏模型,需要一种诊断工具来识别变体的漂移情况,而不仅仅是判断其是否退化。现有的相似性评分,如CKA和SVCCA,可以标识退化,但它们并未直接将表示漂移与风险或机制联系起来。我们提出了PRISM(通过结构映射的代理风险推断),该方法利用LLM的线性输出头和其骨干网络的经验近等距结构,推导出目标模型与后训练变体之间交叉熵风险差距的封闭形式上界。该界限经过校准以进行变体排名,并将漂移分解为三个独立可测量的轴:尺度不匹配、形状不匹配和头部发散。每个轴对应于一种特定的失败模式,包括在低比特量化下的形状失真、在LoRA遗忘下的尺度可分性,以及在GGUF k-量化下的头部发散。因此,主导轴建议了一种补救方向,而不仅仅是提高退化警告。由于形状项是可微的,相同的几何结构也可以作为训练时的正则化器,以防止灾难性遗忘。在两个模型家族和五个基准测试中,PRISM在后训练量化的平均Spearman相关性为0.820,在LoRA遗忘的平均Spearman相关性为0.831,其轴导向的形状正则化器在减轻下游遗忘方面的整体表现优于经验重放。
cs.CL / 29 / 2605.11612

When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models

当情感成为触发器:寄生于大型语言模型的情感风格动态后门攻击
Liu, Ziyu, Li, Tao, Ni, Tianjie, Lan, Xiaolong, Ma, Wengang, Yang, Tao, Wang, Guohua, He, Junjiang
Abstract
Backdoor vulnerabilities widely exist in the fine-tuning of large language models(LLMs). Most backdoor poisoning methods operate mainly at the token level and lack deeper semantic manipulation, which limits stealthiness. In addition, Prior attacks rely on a single fixed trigger to induce harmful outputs. Such static triggers are easy to detect, and clean fine-tuning can weaken the trigger-target association. Through causal validation, we observe that emotion is not directly linked to individual words, but functions as an overall stylistic factor through tone. In the representation space of LLM, emotion can be decoupled from semantics, forming distinct cluster from the original neutral text. Therefore, we consider the emotional factor as the backdoor trigger to propose a pparasitic emotion-style dynamic backdoor attack, Paraesthesia. By mixing samples with the emotional trigger into clean data and then fine-tuning the model, the model is able to generate the predefined attack response when encountering emotional inputs during the inference stage. Paraesthesia includes two the quantification and rewriting of emotional styles. We evaluate the effectiveness of our method on instruction-following generation and classification tasks. The experimental results show that Paraesthesia achieves an attack success rate of around 99\% across both task types and four different models, while maintaining the clean utility of the models.
Chinese Translation
后门漏洞广泛存在于大型语言模型(LLMs)的微调过程中。大多数后门污染方法主要在标记级别操作,缺乏更深层次的语义操控,这限制了其隐蔽性。此外,先前的攻击依赖于单一固定触发器来诱导有害输出。这种静态触发器容易被检测,而清洁微调可以削弱触发器与目标之间的关联。通过因果验证,我们观察到情感并不是直接与单个词汇相关,而是通过语调作为整体风格因素发挥作用。在LLM的表示空间中,情感可以与语义解耦,形成与原始中性文本不同的聚类。因此,我们将情感因素视为后门触发器,提出了一种寄生情感风格动态后门攻击,称为Paraesthesia。通过将含有情感触发器的样本混合到清洁数据中,然后对模型进行微调,模型能够在推理阶段遇到情感输入时生成预定义的攻击响应。Paraesthesia包括情感风格的量化和重写。我们在遵循指令的生成和分类任务上评估了我们方法的有效性。实验结果表明,Paraesthesia在这两种任务类型和四种不同模型上实现了约99%的攻击成功率,同时保持了模型的清洁效用。
cs.CL / 30 / 2605.11629

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

OmniThoughtVis:可扩展的蒸馏管道用于可部署的多模态推理模型
Yue, Yuanhao, Wang, Chengyu, Lyu, Yuanjie, Shen, Lei, Huang, Jun
Abstract
Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource constraints. In practice, smaller MLLMs are preferred for online serving, yet their reasoning performance is bottlenecked by the lack of large-scale, high-quality multimodal CoT supervision. In this paper, we present OmniThoughtVis, a scalable data curation and distillation pipeline for transferring multimodal reasoning capabilities from high-capacity teacher models to smaller, deployment-oriented MLLMs. Starting from a diverse open-source seed pool, our pipeline generates structured CoT traces and performs joint annotation of reasoning difficulty, answer quality, and semantic task tags. To maintain data quality at scale, we combine rule-based filtering, difficulty-aware selection, and tag-based diversity sampling, resulting in a curated corpus of 1.8M samples that supports controllable subset construction for downstream training. We use OmniThoughtVis to distill Qwen3-VL models from 2B to 8B parameters and evaluate them on nine multimodal reasoning benchmarks. The resulting distilled models show consistent gains across model scales, including improvements of up to +16.8 points on MathVerse and +5.6 points on MMMU-Pro for the 4B model. Notably, the distilled 4B model matches or surpasses the undistilled 8B baseline on several tasks, highlighting the practical value of scalable reasoning distillation for deployment-oriented MLLMs.
Chinese Translation
近期的多模态大型语言模型(MLLMs)在视觉-语言任务上展现了强大的思维链(CoT)推理能力,但由于延迟和资源限制,它们在实际系统中的直接部署往往受到限制。在实践中,较小的MLLMs更适合在线服务,然而它们的推理性能受到缺乏大规模、高质量多模态CoT监督的瓶颈。本文提出了OmniThoughtVis,一个可扩展的数据策划和蒸馏管道,用于将多模态推理能力从高容量教师模型转移到较小的、面向部署的MLLMs。我们的管道从一个多样的开源种子池开始,生成结构化的CoT痕迹,并对推理难度、答案质量和语义任务标签进行联合标注。为了在规模上保持数据质量,我们结合了基于规则的过滤、难度感知选择和基于标签的多样性采样,最终形成了一个包含180万样本的策划语料库,支持下游训练的可控子集构建。我们使用OmniThoughtVis将Qwen3-VL模型从20亿参数蒸馏到80亿参数,并在九个多模态推理基准上进行评估。结果显示,蒸馏模型在模型规模上均表现出一致的提升,包括在MathVerse上提高了最多16.8分,在MMMU-Pro上提高了5.6分的4B模型。值得注意的是,蒸馏后的4B模型在多个任务上与未蒸馏的8B基线相匹配或超越,突显了可扩展推理蒸馏在面向部署的MLLMs中的实际价值。
cs.CL / 31 / 2605.11632

Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization

通过偏好对齐优化增强多语言反事实生成
Wang, Yilong, Wang, Qianli, Chu, Bohao, Liu, Yihong, Yang, Jing, Ostermann, Simon
Abstract
Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines explanation quality. We introduce Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual SCE generation, using a composite scoring function to construct preference pairs that effectively translate the trade-off into measurable preference signals. Experiments across four LLMs and seven typologically diverse languages show that Macro improves validity by 12.55\% on average over the chain-of-thought baseline without degrading minimality, while avoiding the severe minimality violations of the translation-based baseline. Compared to supervised fine-tuning, Macro achieves superior performance on both metrics, confirming that explicit preference optimization is essential for balancing this trade-off. Further analyses reveal that Macro increases cross-lingual perturbation alignment and mitigates common generation errors. Our results highlight preference optimization as a promising direction for enhancing multilingual model explanations.
Chinese Translation
自生成反事实解释(SCEs)是由大型语言模型(LLMs)生成的经过最小修改的输入(最小性),旨在翻转其自身预测(有效性),提供了一种基于因果关系的方法来揭示黑箱LLM行为。然而,将其扩展到英语以外的语言仍然具有挑战性:现有方法在非主导语言中难以生成有效的SCEs,并且有效性与最小性之间的持续权衡削弱了解释质量。我们提出了Macro,一个偏好对齐框架,应用直接偏好优化(DPO)于多语言SCE生成,使用复合评分函数构建偏好对,有效地将权衡转化为可测量的偏好信号。针对四个LLM和七种类型多样的语言的实验表明,Macro在不降低最小性的情况下,平均提高了12.55%的有效性,相较于思维链基线,同时避免了基于翻译的基线所带来的严重最小性违反。与监督微调相比,Macro在这两个指标上表现更优,确认了显式偏好优化对于平衡这一权衡的重要性。进一步分析显示,Macro增加了跨语言扰动对齐,并减轻了常见生成错误。我们的结果突显了偏好优化作为增强多语言模型解释的有前景方向。
cs.CL / 32 / 2605.11663

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

基于人类的多模态基准,包含来自日本国家学力评估的90万规模聚合学生反应分布
Takami, Kyosuke, Tateisi, Yuka, Sekine, Satoshi, Miyao, Yusuke
Abstract
Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed from Japan's National Assessment of Academic Ability, comprising officially released middle-school items in Science, Mathematics, and Japanese Language. Unlike existing benchmarks based on synthetic or curated data, our dataset preserves real exam layouts, diagrams, and Japanese educational text, together with nationwide aggregated student response distributions (N $\approx$ 900{,}000). These features enable direct comparison between human and model performance under a unified evaluation framework. We benchmark recent multimodal LLMs using exact-match accuracy and character-level F1 for open-ended responses, observing substantial variation across subjects and strong sensitivity to visual reasoning demands. Human evaluation and LLM-as-judge analyses further assess the reliability of automatic scoring. Our dataset establishes a reproducible, human-grounded benchmark for multimodal educational reasoning and supports future research on evaluation, feedback generation, and explainable AI in authentic assessment contexts. Our dataset is available at: https://github.com/KyosukeTakami/gakucho-benchmark
Chinese Translation
真实的学校考试为评估多模态大型语言模型(MLLMs)提供了高有效性的测试平台,但基于日本K-12评估的基准仍然稀缺。我们展示了一个基于日本国家学力评估构建的多模态数据集,包含官方发布的中学科学、数学和日语试题。与基于合成或策划数据的现有基准不同,我们的数据集保留了真实的考试布局、图表和日本教育文本,以及全国范围内聚合的学生反应分布(N ≈ 900,000)。这些特征使得在统一的评估框架下可以直接比较人类与模型的表现。我们使用准确匹配率和开放式反应的字符级F1对近期的多模态LLMs进行基准测试,观察到不同学科之间存在显著差异,并对视觉推理需求表现出强烈的敏感性。人类评估和LLM作为评判者的分析进一步评估了自动评分的可靠性。我们的数据集建立了一个可重复的、以人类为基础的多模态教育推理基准,并支持未来在真实评估环境中关于评估、反馈生成和可解释人工智能的研究。我们的数据集可在以下链接获取:https://github.com/KyosukeTakami/gakucho-benchmark
cs.CL / 33 / 2605.11685

Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter

针对重学习攻击的鲁棒性大语言模型去学习:表示中的次要成分至关重要
Xiao, Zeguan, Xu, Xuanzhe, Chen, Yun, Wang, Yong, Yang, Jian, Hu, Yanqing, Chen, Guanhua
Abstract
Large language model (LLM) unlearning aims to remove specific data influences from pre-trained model without costly retraining, addressing privacy, copyright, and safety concerns. However, recent studies reveal a critical vulnerability: unlearned models rapidly recover "forgotten" knowledge through relearning attacks. This fragility raises serious security concerns, especially for open-weight models. In this work, we investigate the fundamental mechanism underlying this fragility from a representation geometry perspective. We discover that existing unlearning methods predominantly optimize along dominant components, leaving minor components largely unchanged. Critically, during relearning attacks, the modifications in these dominant components are easily reversed, enabling rapid knowledge recovery, whereas minor components exhibit stronger resistance to such reversal. We further provide a theoretical analysis that explains both observations from the spectral structure of representations. Building on this insight, we propose Minor Component Unlearning (MCU), a novel unlearning approach that explicitly targets minor components in representations. By concentrating unlearning effects in these inherently robust directions, our method achieves substantially improved resistance to relearning attacks. Extensive experiments on three datasets validate our approach, demonstrating significant improvements over state-of-the-art methods including sharpness-aware minimization.
Chinese Translation
大语言模型(LLM)去学习旨在在不进行昂贵的再训练的情况下,从预训练模型中移除特定数据的影响,以应对隐私、版权和安全问题。然而,最近的研究揭示了一个关键的脆弱性:去学习模型通过重学习攻击迅速恢复“遗忘”的知识。这种脆弱性引发了严重的安全担忧,尤其是对于开放权重模型。在本研究中,我们从表示几何的角度探讨了这种脆弱性的基本机制。我们发现,现有的去学习方法主要沿着主成分进行优化,导致次要成分几乎保持不变。关键是,在重学习攻击期间,这些主成分的修改很容易被逆转,从而实现快速的知识恢复,而次要成分则表现出更强的抗逆转能力。我们进一步提供了理论分析,从表示的谱结构解释了这两种观察结果。在此基础上,我们提出了次要成分去学习(Minor Component Unlearning, MCU),这是一种新颖的去学习方法,明确针对表示中的次要成分。通过将去学习效果集中在这些固有的鲁棒方向上,我们的方法在抵抗重学习攻击方面取得了显著的改进。在三个数据集上的大量实验验证了我们的方法,显示出相较于包括敏感性意识最小化在内的最先进方法的显著提升。
cs.CL / 34 / 2605.11739

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

学习预见:揭示在线策略蒸馏的解锁效率
Cai, Yuchen, Cao, Ding, Lin, Liang, Luo, Chunxi, Xu, Xin, Yang, Kai, Liu, Weijie, Yang, Saiyong, Zhao, Tianxiang, Sun, Guangzhong, Liu, Guiquan, Fang, Junfeng
Abstract
On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level mechanisms underlying OPD's efficiency remain poorly understood. In this work, we argue that OPD's efficiency stems from a form of ``foresight'': it establishes a stable update trajectory toward the final model early in training. This foresight manifests in two aspects. First, at the \textbf{Module-Allocation Level}, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning. Second, at the \textbf{Update-Direction Level}, OPD exhibits stronger low-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training. Building on these findings, we propose \textbf{EffOPD}, a plug-and-play acceleration method that speeds up OPD by adaptively selecting an extrapolation step size and moving along the current update direction. EffOPD requires no additional trainable modules or complex hyperparameter tuning, and achieves an average training acceleration of $3\times$ while maintaining comparable final performance. Overall, our findings provide a parameter-dynamics perspective for understanding the efficiency of OPD and offer practical insights for designing more efficient post-training methods for large language models.
Chinese Translation
在线策略蒸馏(On-policy distillation, OPD)作为一种高效的大型语言模型后训练范式逐渐受到关注。然而,现有研究主要将这一优势归因于更密集和更稳定的监督,而OPD效率背后的参数级机制仍然不够清晰。在本研究中,我们认为OPD的效率源于一种“预见”形式:它在训练早期就建立了通向最终模型的稳定更新轨迹。这种预见体现在两个方面。首先,在 extbf{模块分配层面},OPD识别出边际效用低的区域,并将更新集中在对推理更为关键的模块上。其次,在 extbf{更新方向层面},OPD表现出更强的低秩集中,其主导子空间在训练早期与最终更新子空间紧密对齐。基于这些发现,我们提出了 extbf{EffOPD},一种即插即用的加速方法,通过自适应选择外推步长并沿当前更新方向移动来加速OPD。EffOPD无需额外的可训练模块或复杂的超参数调优,平均训练加速达到$3 imes$,同时保持可比的最终性能。总体而言,我们的研究提供了一个参数动态的视角,以理解OPD的效率,并为设计更高效的大型语言模型后训练方法提供了实用的见解。
cs.CL / 35 / 2605.11744

Training-Inference Consistent Segmented Execution for Long-Context LLMs

训练-推理一致的长上下文大语言模型分段执行
Shang, Xianpeng, Li, Jiang, Duo, Zehua, Cai, Qianyi, Su, Xiangdong
Abstract
Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve efficiency by adopting bounded-context or segment-level execution only during inference, while continuing to train models under full-context attention, resulting in a mismatch between training and inference execution and state-transition semantics. Based on this insight, we propose a training-inference consistent segment-level generation framework, in which training and inference follow the same segment-level forward execution semantics. During training, consistency with inference is enforced by restricting gradient propagation to KV states carried over from the immediately preceding segment, while permitting head-specific access to past KV states during the forward pass without involving them in gradient propagation. Across long-context benchmarks, our approach achieves performance comparable to full-context attention, while achieving competitive latency-memory trade-offs against strong inference-efficient baselines, and substantially improving scalability at very long context lengths (e.g., approximately 6x lower peak prefill memory at 128K compared to full-context attention with FlashAttention).
Chinese Translation
基于Transformer的大型语言模型在长上下文生成中面临严重的可扩展性挑战,这主要是由于全上下文注意力的计算和内存成本。在实际的计算和内存限制下,许多高效推理的长上下文方法通过在推理过程中采用有界上下文或分段级执行来提高效率,同时在训练过程中继续使用全上下文注意力,从而导致训练和推理执行及状态转移语义之间的不匹配。基于这一洞察,我们提出了一种训练-推理一致的分段级生成框架,其中训练和推理遵循相同的分段级前向执行语义。在训练过程中,通过限制梯度传播仅限于从紧接着的分段中携带的KV状态来强制与推理的一致性,同时允许在前向传播过程中对过去的KV状态进行特定头的访问,但不涉及它们的梯度传播。在长上下文基准测试中,我们的方法在性能上与全上下文注意力相当,同时在强推理高效基线下实现了竞争性的延迟-内存权衡,并在非常长的上下文长度(例如,在128K时相比于使用FlashAttention的全上下文注意力,峰值预填充内存降低约6倍)上显著提高了可扩展性。
cs.CL / 36 / 2605.11769

Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control

面向安全的空中交通控制语言理解系统评估
Chang, Yujing, Guleria, Yash, Pham, Duc-Thinh, Pham, Nhut-Huy, Wang, Ningli, Duong, Vu N., Alam, Sameer
Abstract
Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their reliability in operational ATC environments remains unclear. Existing evaluation approaches, largely based on aggregate metrics such as F1 or macro accuracy, treat all errors uniformly and fail to account for the asymmetric consequences of high-risk semantic mistakes (e.g., incorrect runway identifiers or movement constraints). To address this gap, we propose a safety-oriented, consequence-aware evaluation framework tailored to ATC operations. Our results reveal that while current LLMs achieve reasonable aggregate accuracy, their operational reliability is severely limited. Evaluated on clean transcripts, the peak Risk Score reaches only 0.69, with most models scoring below 0.6 despite high macro-F1 performance. Further analysis shows that errors concentrate in high-impact entities despite relatively stable action-type classification, indicating structural grounding deficiencies. These findings highlight the necessity of consequence-aware evaluation protocols for the responsible deployment of AI-assisted ATC systems.
Chinese Translation
空中交通控制(ATC)是一个安全关键领域,其中指令的错误解读可能导致严重的操作后果。尽管大型语言模型(LLMs)表现出强大的整体性能,但它们在实际ATC环境中的可靠性仍不明确。现有的评估方法主要基于聚合指标,如F1或宏观准确率,均匀对待所有错误,未能考虑高风险语义错误(例如,错误的跑道标识符或移动限制)所带来的不对称后果。为了解决这一问题,我们提出了一种针对ATC操作的面向安全、关注后果的评估框架。我们的结果显示,尽管当前的LLMs在整体准确性上表现合理,但它们的操作可靠性严重受限。在干净的转录文本上评估时,最高风险评分仅达到0.69,尽管宏观F1表现较高,但大多数模型的得分低于0.6。进一步分析表明,错误集中在高影响实体上,尽管动作类型分类相对稳定,这表明存在结构性基础缺陷。这些发现突显了为负责任地部署AI辅助ATC系统而需要关注后果的评估协议的必要性。
cs.CL / 37 / 2605.11774

From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

从单个标记到标记对:临床预测中大型语言模型的高效提示压缩
Zhu, Mingcheng, Luo, Zhiyao, Liu, Yu, Zhu, Tingting
Abstract
By processing electronic health records (EHRs) as natural language sequences, large language models (LLMs) have shown potential in clinical prediction tasks such as mortality prediction and phenotyping. However, longitudinal or highly frequent EHRs often yield excessively long token sequences that result in high computational costs and even reduced performance. Existing solutions either add modules for compression or remove less important tokens, which introduce additional inference latency or risk losing clinical information. To achieve lossless compression of token sequences without additional cost or loss of performance, we propose Medical Token-Pair Encoding (MedTPE), a layered method that extends standard tokenisation for EHR sequences. MedTPE merges frequently co-occurring medical token pairs into composite tokens, providing lossless compression while preserving the computational complexity through a dependency-aware replacement strategy. Only the embeddings of the newly introduced tokens of merely 0.5-1.0% of the LLM's parameters are fine-tuned via self-supervised learning. Experiments on real-world datasets for two clinical scenarios demonstrate that MedTPE reduces input token length by up to 31% and inference latency by 34-63%, while maintaining or even improving both predictive performance and output format compliance across multiple LLMs and four clinical prediction tasks. Furthermore, MedTPE demonstrates robustness across different input context lengths and generalisability to scientific and financial domains and different languages.
Chinese Translation
通过将电子健康记录(EHR)处理为自然语言序列,大型语言模型(LLMs)在临床预测任务中显示出潜力,例如死亡率预测和表型分析。然而,纵向或高频率的EHR通常会产生过长的标记序列,导致高计算成本甚至性能下降。现有解决方案要么增加压缩模块,要么移除不太重要的标记,这会引入额外的推理延迟或有丢失临床信息的风险。为了实现无损的标记序列压缩,而不增加额外的成本或性能损失,我们提出了医学标记对编码(Medical Token-Pair Encoding,MedTPE),这是一种分层方法,扩展了EHR序列的标准标记化。MedTPE将频繁共同出现的医学标记对合并为复合标记,提供无损压缩,同时通过依赖感知替换策略保持计算复杂度。仅对新引入的标记的嵌入进行微调,这些标记仅占LLM参数的0.5-1.0%,采用自监督学习。在两个临床场景的真实数据集上的实验表明,MedTPE将输入标记长度减少了多达31%,推理延迟减少了34-63%,同时在多个LLM和四个临床预测任务中保持或甚至提高预测性能和输出格式合规性。此外,MedTPE在不同输入上下文长度下表现出鲁棒性,并且能够推广到科学和金融领域以及不同语言。
cs.CL / 38 / 2605.11779

Choosing features for classifying multiword expressions

多词表达分类特征的选择
Laporte, Eric
Abstract
Multiword expressions (MWEs) are a heterogeneous set with a glaring need for classifications. Designing a satisfactory classification involves choosing features. In the case of MWEs, many features are a priori available. Not all features are equal in terms of how reliably MWEs can be assigned to classes. Accordingly, resulting classifications may be more or less fruitful for computational use. I outline an enhanced classification. In order to increase its suitability for many languages, I use previous works taking into account various languages.
Chinese Translation
多词表达(MWEs)是一组异质性集合,迫切需要进行分类。设计一个令人满意的分类涉及选择特征。在多词表达的情况下,许多特征是先验可用的。然而,并非所有特征在将多词表达分配到类别时的可靠性相等。因此,所产生的分类在计算使用上可能效果不同。我概述了一种增强的分类方法。为了提高其对多种语言的适用性,我参考了以往的研究,考虑了各种语言的因素。
cs.CL / 39 / 2605.11845

Probabilistic Calibration Is a Trainable Capability in Language Models

概率校准是语言模型可训练的能力
Baldelli, Davide, Kuriakose, Sruthi, Hashemzadeh, Maryam, Zouaq, Amal, Chandar, Sarath
Abstract
Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine-tuning. Concretely, we fine-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine-Tuning variants: a soft-target method that converts the desired output distribution into trie-derived next-token targets, and a hard-target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured-sampling fidelity on held-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected training configurations, the two methods exhibit different empirical profiles: hard-target fine-tuning is often strongest on structured numeric sampling, while soft-target fine-tuning performs better on broader stochastic generation benchmarks, including open-ended random generation, multiple-choice answer-position balancing, and NoveltyBench. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model. Overall, our results show that probabilistic calibration can be improved through fine-tuning, with our hard-target configuration favoring exact numeric fidelity and our soft-target configuration favoring broader stochastic transfer. Code is available at https://github.com/chandar-lab/calibration-finetuning.
Chinese Translation
语言模型越来越多地应用于需要满足用户指定随机性约束的场景中,但它们的生成概率往往与这些目标的校准效果较差。我们研究了这种能力是否可以通过微调直接改善。具体而言,我们在需要从数学分布中采样的合成提示上对语言模型进行微调,并比较了两种校准微调变体:一种软目标方法将期望的输出分布转换为基于前缀树的下一个标记目标,另一种硬目标方法则基于来自同一目标分布的采样完成进行训练。在涵盖四个模型系列的12个模型中,这两种方法在保留的分布系列和未见参数设置上显著提高了结构化采样的保真度,表明概率校准是一种可训练的能力。在我们选择的训练配置下,这两种方法表现出不同的经验特征:硬目标微调通常在结构化数值采样上表现最强,而软目标微调在更广泛的随机生成基准上表现更好,包括开放式随机生成、多选答案位置平衡和NoveltyBench。这些增益有时会降低下游能力,尤其是算术推理,成本因模型而异。总体而言,我们的结果表明,通过微调可以改善概率校准,其中我们的硬目标配置更有利于精确的数值保真度,而软目标配置则更有利于更广泛的随机转移。代码可在 https://github.com/chandar-lab/calibration-finetuning 获取。
cs.CL / 40 / 2605.11854

Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

自蒸馏轨迹感知的玻尔兹曼建模:弥合扩散语言模型中的训练-推理差异
Chen, Kecheng, Liu, Ziru, Tao, Xijia, Liu, Hui, Liu, Yibing, Fu, Xinyu, Wu, Shi, Zhang, Suiyun, Tu, Dandan, Kong, Lingpeng, Liu, Rui, Li, Haoliang
Abstract
Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive language models, offering stronger global awareness and highly parallel generation. However, post-training DLMs with standard Negative Evidence Lower Bound (NELBO)-based supervised fine-tuning remains inefficient: training reconstructs randomly masked tokens in a single step, whereas inference follows a confidence-guided, multi-step easy-to-hard denoising trajectory. Recent trajectory-based self-distillation methods exploit such inference trajectories mainly for sampling-step compression and acceleration, often improving decoding efficiency without substantially enhancing the model's underlying capability, and may even degrade performance under full diffusion decoding. In this work, we ask whether self-distilled trajectories can be used not merely for faster inference, but for genuine knowledge acquisition. Although these trajectories lie on the pretrained DLM's own distributional manifold and thus offer a potentially lower optimization barrier, we find that naively fine-tuning on them with standard NELBO objectives yields only marginal gains. To address this limitation, we propose \textbf{T}rajectory-\textbf{A}ligned optimization via \textbf{Bo}ltzmann \textbf{M}odeling (\textbf{TABOM}), a self-distilled trajectory-based post-training framework that aligns training with the easy-to-hard structure of inference. TABOM models the inference unmasking preference as a Boltzmann distribution over predictive entropies and derives a tractable pairwise ranking objective to align the model's certainty ordering with the observed decoding trajectory. Empirically, TABOM achieves substantial gains in new domains, expands the effective knowledge boundary of DLMs, and significantly mitigates catastrophic forgetting compared with standard SFT.
Chinese Translation
扩散语言模型(DLMs)最近作为自回归语言模型的一种有前景的替代方案出现,提供了更强的全局意识和高度并行的生成能力。然而,使用标准负证据下界(NELBO)进行监督微调的后训练DLMs效率仍然较低:训练在单步中重构随机掩蔽的标记,而推理则遵循一种基于置信度的多步从易到难的去噪轨迹。最近的基于轨迹的自蒸馏方法主要利用这种推理轨迹进行采样步的压缩和加速,通常提高解码效率,但并未实质性增强模型的基本能力,甚至可能在完全扩散解码下降低性能。在本研究中,我们探讨自蒸馏轨迹是否不仅可以用于更快的推理,而是真正的知识获取。尽管这些轨迹位于预训练DLM自身的分布流形上,因此提供了潜在的较低优化障碍,但我们发现,使用标准NELBO目标对其进行简单微调仅能获得边际收益。为了解决这一限制,我们提出了通过玻尔兹曼建模的轨迹对齐优化(TABOM),这是一个基于自蒸馏轨迹的后训练框架,它将训练与推理的易到难结构对齐。TABOM将推理的去掩蔽偏好建模为预测熵上的玻尔兹曼分布,并推导出一个可处理的成对排名目标,以将模型的确定性排序与观察到的解码轨迹对齐。从实证上看,TABOM在新领域取得了显著的增益,扩展了DLMs的有效知识边界,并显著减轻了与标准SFT相比的灾难性遗忘。
cs.CL / 41 / 2605.11862

Concordance Comparison as a Means of Assembling Local Grammars

一致性比较作为构建地方语法的一种手段
Pirovani, Juliana, de Oliveira, Elias, Laporte, Eric
Abstract
Named Entity Recognition for person names is an important but non-trivial task in information extraction. This article uses a tool that compares the concordances obtained from two local grammars (LG) and highlights the differences. We used the results as an aid to select the best of a set of LGs. By analyzing the comparisons, we observed relationships of inclusion, intersection and disjunction within each pair of LGs, which helped us to assemble those that yielded the best results. This approach was used in a case study on extraction of person names from texts written in Portuguese. We applied the enhanced grammar to the Gold Collection of the Second HAREM. The F-Measure obtained was 76.86, representing a gain of 6 points in relation to the state-of-the-art for Portuguese.
Chinese Translation
对人名的命名实体识别是信息提取中一项重要但并非简单的任务。本文使用了一种工具,比较从两个地方语法(LG)获得的一致性,并突出其差异。我们利用这些结果作为选择一组LG中最佳者的辅助工具。通过分析比较,我们观察到每对LG之间的包含、交集和并集关系,这帮助我们组装出那些产生最佳结果的LG。该方法在一个关于从葡萄牙语文本中提取人名的案例研究中得到了应用。我们将增强的语法应用于第二届HAREM的金集。所获得的F-Measure为76.86,相较于葡萄牙语的最新技术水平提高了6个百分点。
cs.CL / 42 / 2605.11887

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

Qwen-Scope:将稀疏特征转化为大型语言模型的开发工具
Deng, Boyi, Wang, Xu, Wang, Yaoning, Wan, Yu, Ma, Yubo, Yang, Baosong, Wei, Haoran, Tang, Jialong, Lin, Huan, Gao, Ruize, Li, Tianhao, Cao, Qian, Ren, Xuancheng, Deng, Xiaodong, Yang, An, Huang, Fei, Liu, Dayiheng, Zhou, Jingren
Abstract
Large language models have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque, limiting our ability to inspect, control, and systematically improve them. This opacity motivates a growing body of research in mechanistic interpretability, with sparse autoencoders (SAEs) emerging as one of the most promising tools for decomposing model activations into sparse, interpretable feature representations. We introduce Qwen-Scope, an open-source suite of SAEs built on the Qwen model family, comprising 14 groups of SAEs across 7 model variants from the Qwen3 and Qwen3.5 series, covering both dense and mixture-of-expert architectures. Built on top of these SAEs, we show that SAEs can go beyond post-hoc analysis to serve as practical interfaces for model development along four directions: (i) inference-time steering, where SAE feature directions control language, concepts, and preferences without modifying model weights; (ii) evaluation analysis, where activated SAE features provide a representation-level proxy for benchmark redundancy and capability coverage; (iii) data-centric workflows, where SAE features support multilingual toxicity classification and safety-oriented data synthesis; and (iv) post-training optimization, where SAE-derived signals are incorporated into supervised fine-tuning and reinforcement learning objectives to mitigate undesirable behaviors such as code-switching and repetition. Together, these results demonstrate that SAEs can serve not only as post-hoc analysis tools, but also as reusable representation-level interfaces for diagnosing, controlling, evaluating, and improving large language models. By open-sourcing Qwen-Scope, we aim to support mechanistic research and accelerate practical workflows that connect model internals to downstream behavior.
Chinese Translation
大型语言模型在多种任务中展现了卓越的能力,但其内部决策过程仍然 largely opaque,这限制了我们对其进行检查、控制和系统性改进的能力。这种不透明性促使了机制可解释性研究的不断发展,其中稀疏自编码器(Sparse Autoencoders, SAEs)作为一种有前景的工具,能够将模型激活分解为稀疏且可解释的特征表示。我们介绍了 Qwen-Scope,这是一个基于 Qwen 模型系列构建的开源 SAEs 套件,包含来自 Qwen3 和 Qwen3.5 系列的 14 组 SAEs,涵盖了稠密和混合专家架构。在这些 SAEs 的基础上,我们展示了 SAEs 可以超越事后分析,作为模型开发的实用接口,沿着四个方向发挥作用:(i)推理时引导,其中 SAE 特征方向控制语言、概念和偏好,而无需修改模型权重;(ii)评估分析,其中激活的 SAE 特征提供基准冗余和能力覆盖的表示级代理;(iii)数据中心工作流,其中 SAE 特征支持多语言毒性分类和安全导向的数据合成;(iv)训练后优化,其中 SAE 派生信号被纳入监督微调和强化学习目标,以减轻不良行为,如代码切换和重复。综合这些结果表明,SAEs 不仅可以作为事后分析工具,还可以作为可重用的表示级接口,用于诊断、控制、评估和改进大型语言模型。通过开源 Qwen-Scope,我们旨在支持机制研究,并加速将模型内部与下游行为连接的实际工作流程。
cs.CL / 43 / 2605.11906

YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning

YFPO:基于神经元引导奖励的耦合特征偏好优化在数学推理中的初步研究
Le, Yifan
Abstract
Preference optimization has become an important post-training paradigm for improving the reasoning abilities of large language models. Existing methods typically rely on externally constructed preference data, using preferred and dispreferred responses as sample-level supervision. However, such external signals rarely make explicit use of capability-related information contained in the model's internal representations. For mathematical reasoning, certain neuron groups may exhibit activation patterns associated with mathematical knowledge, symbolic manipulation, or logical reasoning. Similar to reflexive behavioral signals, these internal activations may provide a coarse indication of whether the model is engaging math-related capabilities.We introduce YFPO, short for Yoked Feature Preference Optimization, a preliminary neuron-guided preference optimization framework for mathematical reasoning. YFPO first uses AttnLRP to identify math-related neurons, and then constructs an auxiliary reward from their activation margin between preferred and dispreferred responses. This design augments external preference learning with internal neuron-level signals. We conduct preliminary experiments on a small-scale language model using GSM8K as the main benchmark. Results suggest that neuron-level signals can interact with preference optimization and occasionally improve reasoning performance, offering a promising direction for more fine-grained and interpretable reasoning-oriented post-training.
Chinese Translation
偏好优化已成为提高大型语言模型推理能力的重要后训练范式。现有方法通常依赖于外部构建的偏好数据,使用偏好和不偏好的响应作为样本级监督。然而,这些外部信号很少明确利用模型内部表征中包含的能力相关信息。在数学推理中,某些神经元组可能表现出与数学知识、符号操作或逻辑推理相关的激活模式。类似于反射性行为信号,这些内部激活可能粗略指示模型是否正在使用与数学相关的能力。我们提出YFPO,即耦合特征偏好优化(Yoked Feature Preference Optimization),这是一个初步的神经元引导偏好优化框架,用于数学推理。YFPO首先使用AttnLRP识别与数学相关的神经元,然后根据它们在偏好和不偏好响应之间的激活边际构建辅助奖励。该设计通过内部神经元级信号增强了外部偏好学习。我们在一个小规模语言模型上进行初步实验,以GSM8K作为主要基准。结果表明,神经元级信号可以与偏好优化相互作用,并偶尔提高推理性能,为更细粒度和可解释的推理导向后训练提供了一个有前景的方向。
cs.CL / 44 / 2605.11964

Enhancing Target-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent-Keyword Bridging

通过对话场景建模和意图-关键词桥接增强目标导向的主动对话系统
Li, Maodong, Li, Yancui, Kong, Fang
Abstract
A target-guided proactive dialogue system aims to steer conversations proactively toward pre-defined targets, such as designated keywords or specific topics. During guided conversations, dynamically modeling conversational scenarios and intent keywords to guide system utterance generation is beneficial; however, existing work largely overlooks this aspect, resulting in a mismatch with the dynamics of real-world conversations. In this paper, we jointly model user profiles and domain knowledge as conversational scenarios to introduce a scenario bias that dynamically influences system utterances, and employ intent-keyword bridging to predict intent keywords for upcoming dialogue turns, providing higher level and more flexible guidance. Extensive automatic and human evaluations demonstrate the effectiveness of conversational scenario modeling and intent keyword bridging, yielding substantial improvements in proactivity, fluency, and informativeness for target-guided proactive dialogue systems, thereby narrowing the gap with real world interactions.
Chinese Translation
目标导向的主动对话系统旨在主动引导对话朝向预定义的目标,例如指定的关键词或特定主题。在引导对话过程中,动态建模对话场景和意图关键词以指导系统发言生成是有益的;然而,现有研究在很大程度上忽视了这一方面,导致与现实世界对话的动态性不匹配。本文将用户画像和领域知识联合建模为对话场景,引入一种动态影响系统发言的场景偏置,并采用意图-关键词桥接来预测即将到来的对话轮次的意图关键词,从而提供更高层次和更灵活的引导。大量的自动评估和人工评估结果表明,对话场景建模和意图关键词桥接的有效性,显著提升了目标导向主动对话系统的主动性、流畅性和信息量,从而缩小了与现实世界交互的差距。
cs.CL / 45 / 2605.11978

On Predicting the Post-training Potential of Pre-trained LLMs

预测预训练大型语言模型的后训练潜力
Li, Xiaoyuan, Ma, Yubo, Yang, Kexin, Li, Moxin, Bao, Keqin, Wang, Wenie, Feng, Fuli, Liu, Dayiheng
Abstract
The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's plasticity in complex open-ended scenarios, leading to inefficient model selection. We address this by introducing a new task of predicting post-training potential - forecasting a base model's performance before post-training. We propose RuDE (Rubric-based Discriminative Evaluation), a unified framework that bypasses the generation gap of base models by leveraging response discrimination. Guided by our systematic 4C Taxonomy, RuDE constructs controlled contrastive pairs across diverse domains by fine-grained rubric violations. Extensive experiments demonstrate a correlation greater than 90% with post-training performance. Crucially, validation via Reinforcement Learning (RL) confirms that RuDE effectively identifies high-potential smaller models that outperform larger counterparts, offering a compute-efficient mechanism for foundation model development.
Chinese Translation
大型语言模型(LLMs)在下游任务上的表现基本上受到预训练期间所获得能力的限制。然而,传统基准测试如 MMLU 往往无法反映基础模型在复杂开放场景中的可塑性,从而导致模型选择效率低下。我们通过引入一种新的任务——预测后训练潜力,来解决这一问题,即在后训练之前预测基础模型的性能。我们提出了 RuDE(基于评分标准的区分性评估),这是一个统一框架,通过利用响应区分来绕过基础模型的生成差距。在我们系统的 4C 分类法的指导下,RuDE 通过细致的评分标准违规构建了跨多样领域的受控对比对。大量实验表明,RuDE 与后训练性能之间的相关性超过 90%。至关重要的是,通过强化学习(RL)进行的验证确认 RuDE 能有效识别出高潜力的小型模型,这些模型的表现优于更大型的模型,从而为基础模型的发展提供了一种计算高效的机制。
cs.CL / 46 / 2605.11993

Towards Visually-Guided Movie Subtitle Translation for Indic Languages

面向印度语言的视觉引导电影字幕翻译
Chintada, Tarun, Singh, Kshetrimayum Boynao, Ekbal, Asif
Abstract
Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu, Tamil and Kannada). We present a case study on five full-length films and compare two lightweight visual grounding strategies: structured attribute summaries from a 5-minute sliding window and free-text summaries of inter-subtitle visual gaps. Our analysis shows that temporal misalignment between subtitles and frames is a major obstacle in long-form video, often rendering indiscriminate visual grounding ineffective. However, oracle selective grounding, which replaces only the lowest-quality 20-30\% of baseline segments with visual-enhanced outputs, consistently improves COMET over the text-only baseline while requiring far less visual processing. Among the two approaches, coarse attribute-based visual context summarization is more robust, capturing scene-level emotion and contextual subtle cues that text alone often misses
Chinese Translation
电影字幕翻译本质上是多模态的,但仅依赖文本的系统往往忽视了传达情感、动作和社会细微差别所需的视觉线索,尤其是在资源匮乏的印度语言(英语到印地语、孟加拉语、泰卢固语、泰米尔语和卡纳达语)中。我们对五部完整电影进行了案例研究,并比较了两种轻量级的视觉定位策略:来自5分钟滑动窗口的结构化属性摘要和字幕间视觉间隙的自由文本摘要。我们的分析表明,字幕与画面之间的时间错位是长视频中的主要障碍,常常使得无差别的视觉定位效果不佳。然而,选择性视觉定位(oracle selective grounding)能够将基线片段中质量最低的20-30%替换为视觉增强输出,始终在文本仅基线之上提高了COMET,同时所需的视觉处理远少于其他方法。在这两种方法中,基于粗略属性的视觉上下文摘要更为稳健,能够捕捉场景级别的情感和文本单独往往忽视的上下文细微线索。
cs.CL / 47 / 2605.12004

Learning Agentic Policy from Action Guidance

从行动指导中学习自主政策
Ji, Yuxiang, Wang, Zengbin, Wang, Yong, Yang, Shidong, Ma, Ziyu, Chen, Guanhua, Sun, Zonghua, Wu, Liaoni, Chu, Xiangxiang
Abstract
Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base policy cannot reach reward states, additional training or external guidance is needed to recover effective learning signals. Rather than relying on costly iterative supervised fine tuning (SFT), we exploit the abundant action data generated in everyday human interactions. We propose \textsc{ActGuide-RL}, which injects action data as plan-style reference guidance, enabling the agentic policy to overcome reachability barriers to reward states. Guided and unguided rollouts are then jointly optimized via mixed-policy training, internalizing the exploration gains back into the unguided policy. Motivated by a theoretical and empirical analysis of the benefit-risk trade-off, we adopt a minimal intervention principle that invokes guidance only as an adaptive fallback, matching task difficulty while minimizing off-policy risk. On search-agent benchmarks, \textsc{ActGuide-RL} substantially improves over zero RL (+10.7 pp on GAIA and +19 pp on XBench with Qwen3-4B), and performs on par with the SFT+RL pipeline without any cold start. This suggests a new paradigm for agentic RL that reduces the reliance on heavy SFT data by using scalable action guidance instead.
Chinese Translation
自主强化学习(RL)对于大型语言模型(LLMs)在很大程度上依赖于基础政策的探索能力,因为训练信号仅在其能力不足的区域内产生。对于基础政策无法到达奖励状态的任务,需要额外的训练或外部指导以恢复有效的学习信号。我们提出了 extsc{ActGuide-RL},该方法将行动数据作为计划式参考指导注入,使自主政策能够克服到达奖励状态的障碍。然后,通过混合政策训练联合优化有指导和无指导的回滚,将探索收益内化回无指导政策。基于对收益-风险权衡的理论和实证分析,我们采用了最小干预原则,仅在适应性回退时引入指导,以匹配任务难度,同时最小化离线政策风险。在搜索代理基准测试中, extsc{ActGuide-RL} 相较于零强化学习显著提升(在 GAIA 上提高了 10.7 个百分点,在 XBench 上提高了 19 个百分点,使用 Qwen3-4B),并且在没有任何冷启动的情况下与 SFT+RL 流水线表现相当。这表明了一种新的自主强化学习范式,通过使用可扩展的行动指导,减少对大量 SFT 数据的依赖。
cs.CL / 48 / 2605.12022

SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

SAGE:用于大规模自动化鲁棒性增强的LLM知识评估
Li, Xiaoyuan, Wang, Yuzhe, Li, Moxin, Bao, Keqin, Men, Rui, Zhang, Yichang, Liu, Dayiheng, Wang, Wenjie, Feng, Fuli
Abstract
Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing knowledge evaluation benchmarks is therefore necessary, but current LLM-assisted generate-then-verify pipelines are costly and difficult to scale due to low-yield variant generation and unreliable variant verification. We propose SAGE (Scalable Automated Generation of Robustness BEnchmarks), a framework for scalable robustness augmentation of knowledge evaluation benchmarks using fine-tuned smaller models. SAGE consists of VariantQual, a rubric-based verifier trained on human-labeled seed data, and VariantGen, a variant generator initialized with supervised fine-tuning and further optimized with reinforcement learning using VariantQual as the reward model. Experiments on HellaSwag show that SAGE constructs a large-scale robustness-augmented benchmark with quality comparable to the human-annotated HellaSwag-Pro at substantially lower cost, while the fine-tuned models further generalize to MMLU without benchmark-specific fine-tuning.
Chinese Translation
大型语言模型(LLMs)在标准知识评估基准上表现出色,但最近的研究表明,它们的知识能力在测试相同知识但形式不同的问题变体下依然脆弱。因此,现有知识评估基准的鲁棒性增强是必要的,但当前的LLM辅助生成-验证流程由于变体生成的低产出和变体验证的不可靠性,成本高且难以扩展。我们提出了SAGE(可扩展自动生成鲁棒性基准),这是一个使用微调的小型模型进行知识评估基准的可扩展鲁棒性增强框架。SAGE包括VariantQual,一个基于评分标准的验证器,经过人类标注的种子数据训练而成,以及VariantGen,一个经过监督微调初始化的变体生成器,并通过使用VariantQual作为奖励模型的强化学习进一步优化。在HellaSwag上的实验表明,SAGE构建了一个大规模的鲁棒性增强基准,其质量可与人类标注的HellaSwag-Pro相媲美,且成本显著降低,同时微调后的模型在没有基准特定微调的情况下进一步推广到MMLU。
cs.CL / 49 / 2605.12028

Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting, Hybrid Search, and Cross-Encoder Reranking

Caraman在SemEval-2026任务8中的表现:基于查询重写、混合搜索和交叉编码重排序的三阶段多轮检索
Caraman, David-Maximilian, Silaghi, Gheorghe Cosmin
Abstract
We describe our system for SemEval-2026 Task 8 (MTRAGEval), participating in Task A (Retrieval) across four English-language domains. Our approach employs a three-stage pipeline: (1) query rewriting via a LoRA-fine-tuned Qwen 2.5 7B model that transforms context-dependent follow-up questions into standalone queries, (2) hybrid BM25 and dense retrieval combined through Reciprocal Rank Fusion, and (3) cross-encoder reranking with BGE-reranker-v2-m3. On the official test set, the system achieves nDCG@5 of 0.531, ranking 8th out of 38 participating systems and 10.7% above the organizer baseline. Development comparisons reveal that domain-specific temperature tuning for query generation, where technical domains benefit from deterministic decoding and general domains from controlled randomness, provides consistent gains, while more complex strategies such as domain-aware prompting and multi-query expansion degrade performance.
Chinese Translation
我们描述了参与SemEval-2026任务8(MTRAGEval)中任务A(检索)的系统,该系统涵盖四个英语领域。我们的方法采用三阶段管道:(1)通过LoRA微调的Qwen 2.5 7B模型进行查询重写,将上下文相关的后续问题转化为独立查询;(2)通过互惠排名融合(Reciprocal Rank Fusion)结合混合BM25和密集检索;(3)使用BGE-reranker-v2-m3进行交叉编码重排序。在官方测试集上,该系统实现了nDCG@5为0.531,在38个参与系统中排名第8,超出组织者基线10.7%。开发比较表明,针对查询生成的领域特定温度调优,其中技术领域受益于确定性解码而一般领域受益于受控随机性,提供了一致的增益,而更复杂的策略如领域感知提示和多查询扩展则会降低性能。
cs.CL / 50 / 2605.12039

SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

SkillGraph:通过演化技能图增强智能体的技能强化学习
Li, Xiaoyuan, Li, Moxin, Bao, Keqin, Ma, Yubo, Wang, Wenjie, Liu, Dayiheng, Feng, Fuli
Abstract
Skill libraries enable large language model agents to reuse experience from past interactions, but most existing libraries store skills as isolated entries and retrieve them only by semantic similarity. This leads to two key challenges for compositional tasks. Firstly, an agent must identify not only relevant skills but also how they depend on and build upon each other. Secondly, it also makes library maintenance difficult, since the system lacks structural cues for deciding when skills should be merged, split, or removed. We propose SKILLGRAPH, a framework that represents reusable skills as nodes in a directed graph, with typed edges encoding prerequisite, enhancement, and co-occurrence relations. Given a new task, SKILLGRAPH retrieves not just individual skills, but an ordered skill subgraph that can guide multi-step decision making. The graph is continuously updated from agent trajectories and reinforcement learning feedback, allowing both the skill library and the agent policy to improve together. Experiments on ALFWorld, WebShop, and seven search-augmented QA tasks show that SKILLGRAPH achieves state-of-the-art performance against memory-augmented RL methods, with especially large gains on complex tasks that require composing multiple skills.
Chinese Translation
技能库使大型语言模型智能体能够重用过去交互的经验,但现有的大多数库将技能存储为孤立条目,仅通过语义相似性进行检索。这给组合任务带来了两个关键挑战。首先,智能体不仅必须识别相关技能,还必须了解这些技能之间的依赖关系和相互构建的方式。其次,这也使得库的维护变得困难,因为系统缺乏结构性线索来决定何时合并、拆分或删除技能。我们提出了SKILLGRAPH,一个将可重用技能表示为有向图中节点的框架,带有编码前提、增强和共现关系的类型化边缘。在给定新任务时,SKILLGRAPH不仅检索单个技能,还检索一个有序的技能子图,以指导多步骤决策。该图从智能体轨迹和强化学习反馈中持续更新,使得技能库和智能体策略能够共同改进。在ALFWorld、WebShop和七个搜索增强的问答任务上的实验表明,SKILLGRAPH在与记忆增强的强化学习方法相比时,达到了最先进的性能,尤其在需要组合多个技能的复杂任务上取得了显著提升。
cs.CL / 51 / 2605.12047

Is Child-Directed Language Optimized for Word Learning? A Computational Study of Verb Meaning Acquisition

儿童导向语言是否优化了词汇学习?动词意义习得的计算研究
Padovani, Francesca, Jumelet, Jaap, Matusevych, Yevgen, Bisazza, Arianna
Abstract
Is child-directed language (CDL) optimized to support language learning, and which aspects of linguistic development does it facilitate? We investigate this question using neural language models trained on CDL versus adult-directed language (ADL). We selectively remove syntactic or lexical co-occurrence information from the model training data, and evaluate the impact of these manipulations on verb meaning acquisition. While disrupting syntax impairs learning across all datasets, models trained on CDL and spoken ADL show significantly higher resilience than those trained on written input. Tracking semantic and syntactic performance over training, we observe a semantic-first trajectory, with verb meanings emerging prior to robust syntactic proficiency, an asynchrony most pronounced in the spoken domain, especially CDL. These results suggest that the advantage for verb learning previously attributed to CDL may instead reflect broader properties of the spoken register, rather than a uniquely CDL-specific optimization.
Chinese Translation
儿童导向语言(CDL)是否经过优化以支持语言学习,以及它促进了语言发展的哪些方面?我们通过对比训练于儿童导向语言与成人导向语言(ADL)的神经语言模型来探讨这个问题。我们有选择性地从模型训练数据中移除句法或词汇共现信息,并评估这些操作对动词意义习得的影响。虽然破坏句法会在所有数据集中削弱学习,但训练于儿童导向语言和口语成人导向语言的模型显示出显著更高的韧性,相较于训练于书面输入的模型。在训练过程中跟踪语义和句法表现,我们观察到一种语义优先的轨迹,动词意义在稳健的句法熟练度之前出现,这种不同步在口语领域中尤为明显,尤其是在儿童导向语言中。这些结果表明,之前归因于儿童导向语言的动词学习优势可能反映了口语体的更广泛特性,而非独特的儿童导向语言优化。
cs.CL / 52 / 2605.12055

Do Language Models Encode Knowledge of Linguistic Constraint Violations?

语言模型是否编码了对语言约束违反的知识?
Hardy, Padó, Sebastian
Abstract
Large Language Models (LLMs) achieve strong linguistic performance, yet their internal mechanisms for producing these predictions remain unclear. We investigate the hypothesis that LLMs encode representations of linguistic constraint violations within their parameters, which are selectively activated when processing ungrammatical sentences. To test this, we use sparse autoencoders to decompose polysemantic activations into sparse, monosemantic features and recover candidates for violation-related features. We introduce a sensitivity score for identifying features that are preferentially activated on constraint-violated versus well-formed inputs, enabling unsupervised detection of potential violation-specific features. We further propose a conjunctive falsification framework with three criteria evaluated jointly. Overall, the results are negative in two respects: (1) the falsification criteria are not jointly satisfied across linguistic phenomena, and (2) no features are consistently shared across all categories. While some phenomena show partial evidence of selective causal structure, the overall pattern provides limited support for a unified set of grammatical violation detectors in current LMs.
Chinese Translation
大型语言模型(LLMs)在语言表现上取得了显著的成绩,但它们生成这些预测的内部机制仍不清楚。我们探讨了LLMs在其参数中编码语言约束违反的表征的假设,这些表征在处理不合语法的句子时被选择性激活。为了验证这一假设,我们使用稀疏自编码器将多义激活分解为稀疏的单义特征,并恢复与违反相关特征的候选项。我们引入了一种敏感度评分,用于识别在约束违反输入与符合语法的输入上优先激活的特征,从而实现潜在违反特征的无监督检测。此外,我们进一步提出了一种结合性反驳框架,评估三个标准的联合满足情况。总体而言,结果在两个方面呈现负面: (1) 反驳标准在语言现象中未能共同满足,(2) 没有特征在所有类别中一致共享。尽管某些现象显示出选择性因果结构的部分证据,但整体模式对当前语言模型中统一的语法违反探测器的支持有限。
cs.CL / 53 / 2605.12096

Sign Language Recognition and Translation for Low-Resource Languages: Challenges and Pathways Forward

低资源语言的手语识别与翻译:挑战与前进路径
Alishzade, Nigar, Abdullayeva, Gulchin
Abstract
Sign languages are natural, visual-gestural languages used by Deaf communities worldwide. Over 300 distinct sign languages remain severely low-resource due to limited documentation, sparse datasets, and insufficient computational tools. This systematic review synthesizes literature on sign language recognition and translation for under-resourced languages, using Azerbaijan Sign Language (AzSL) as a case study. Analysis of global initiatives extracts eight actionable lessons, including community co-design, dialectal diversity capture, and privacy-preserving pose-based representations. Turkic sign languages (Kazakh, Turkish, Azerbaijani) receive special attention, as linguistic proximity enables effective transfer learning. We propose three paradigm shifts: from architecture-centric to data-centric AI, from signer-independent to signer-adaptive systems, and from reference-based to task-specific evaluation metrics. A technical roadmap for AzSL leverages lightweight MediaPipe-based architectures, community-validated annotations, and offline-first deployment. Progress requires sustained interdisciplinary collaboration centered on Deaf communities to ensure cultural authenticity, ethical governance, and practical communication benefit.
Chinese Translation
手语是全球聋人社区使用的自然视觉-手势语言。超过300种不同的手语由于文献记录有限、数据集稀疏和计算工具不足而严重缺乏资源。本系统评审综合了关于低资源语言的手语识别与翻译的文献,以阿塞拜疆手语(Azerbaijan Sign Language, AzSL)为案例进行研究。对全球倡议的分析提炼出八个可行的经验教训,包括社区共同设计、方言多样性捕捉和隐私保护的基于姿势的表示。突厥语系手语(哈萨克语、土耳其语、阿塞拜疆语)受到特别关注,因为语言的接近性使得有效的迁移学习成为可能。我们提出三种范式转变:从以架构为中心到以数据为中心的人工智能,从独立于手语者到适应手语者的系统,以及从基于参考的评估指标到基于任务的评估指标。针对AzSL的技术路线图利用轻量级的基于MediaPipe的架构、社区验证的注释和优先离线部署。进展需要持续的跨学科合作,聚焦于聋人社区,以确保文化的真实性、伦理治理和实际的沟通效益。
cs.CL / 54 / 2605.12128

Metaphor Is Not All Attention Needs

隐喻并不是注意力所需的一切
Sorokoletova, Olga, Giarrusso, Francesco, De Luca, Giacomo, Bisconti, Piercosma, Prandi, Matteo, Pierucci, Federico, Galisai, Marcello, Suriani, Vincenzo, Nardi, Daniele
Abstract
Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies, recent evidence shows that stylistic reformulations, such as poetic transformation, can still bypass safety mechanisms with alarming effectiveness. This raises a central question: why do literary jailbreaks succeed? In this work, we investigate whether their effectiveness depends on specific poetic devices, on a failure to recognize literary formatting, or on deeper changes in how models process stylistically irregular prompts. We address this problem through an interpretability analysis of attention patterns. We perform input-level ablation studies to assess the contribution of individual and combinations of poetic devices; construct an interpretable vector representation of attention maps; cluster these representations and train linear probes to predict safety outcomes and literary format. Our results show that models distinguish poetic from prose formats with high accuracy, yet struggle to predict jailbreak success within each format. Clustering further reveals clear separation by literary format, but not by safety label. These findings indicate that jailbreak success is not caused by a failure to recognize poetic formatting; rather, poetic prompts induce distinct processing patterns that remain largely independent of harmful-content detection. Overall, literary jailbreaks appear to misalign large language models not through any single poetic device, but through accumulated stylistic irregularities that alter prompt processing and avoid lexical triggers considered during post-training. This suggests that robustness requires safety mechanisms that account for style-induced shifts in model behavior. We use Qwen3-14B as a representative open-weight case study.
Chinese Translation
大型语言模型越来越多地应用于安全关键的场景,其中抵御有害指令的能力至关重要。尽管后训练旨在使模型对多种越狱策略具有鲁棒性,但最近的证据表明,诸如诗意转化等风格性重构仍然能够以惊人的有效性绕过安全机制。这引发了一个核心问题:为什么文学越狱会成功?在本研究中,我们探讨了其有效性是否依赖于特定的诗歌手法、对文学格式的识别失败,或是模型处理风格不规则提示的深层变化。我们通过对注意力模式的可解释性分析来解决这一问题。我们进行输入级消融研究,以评估单个和组合诗歌手法的贡献;构建可解释的注意力图向量表示;对这些表示进行聚类,并训练线性探针以预测安全结果和文学格式。我们的结果表明,模型能够以高准确率区分诗歌格式和散文格式,但在每种格式内预测越狱成功方面却存在困难。聚类进一步揭示了文学格式之间的明显分离,但与安全标签之间并无明显区分。这些发现表明,越狱成功并不是由于未能识别诗歌格式;相反,诗意提示诱导了独特的处理模式,这些模式在很大程度上独立于有害内容检测。总体而言,文学越狱似乎并不是通过任何单一的诗歌手法使大型语言模型失调,而是通过累积的风格不规则性改变了提示处理,避免了在后训练期间考虑的词汇触发。这表明,鲁棒性需要考虑风格引起的模型行为变化的安全机制。我们以 Qwen3-14B 作为代表性的开放权重案例研究。
cs.CL / 55 / 2605.12156

Latent Causal Void: Explicit Missing-Context Reconstruction for Misinformation Detection

潜在因果空缺:用于虚假信息检测的显式缺失上下文重建
Li, Hui, Jian, Zhongquan, Su, Jinsong, Yao, Junfeng
Abstract
Automatic misinformation detection performs well when deception is visible in what an article explicitly states. However, some misinformation articles remain locally coherent and only become misleading once compared with contemporaneous reports that supply background facts the article omits. We study this omission-relevant setting and observe that current omission-aware approaches typically either attach retrieved context as auxiliary evidence or infer a categorical omission signal, leaving the specific missing fact implicit. We propose \emph{Latent Causal Void} (LCV), a retrieval-guided detector that explicitly reconstructs the missing fact for each target sentence and uses it as a textual cross-source relation in graph reasoning. Concretely, LCV retrieves temporally aligned context articles, asks a frozen instruction-tuned large language model to generate a short missing-context description for each sentence--article pair, and feeds the resulting relation text into a heterograph over target sentences and context articles. On the bilingual benchmark of Sheng et al., LCV improves over the strongest omission-aware baseline by $2.56$ and $2.84$ macro-F1 points on the English and Chinese splits, respectively. The results indicate that modeling the missing cross-source fact itself, rather than only attaching retrieved evidence or predicting an omission signal, is a useful representation for omission-aware misinformation detection.
Chinese Translation
自动虚假信息检测在文章明确陈述的欺骗内容可见时表现良好。然而,一些虚假信息文章在局部上保持一致性,只有在与提供背景事实的同时报道进行比较时才会变得误导。我们研究了这一与遗漏相关的情境,并观察到当前的遗漏感知方法通常要么将检索到的上下文作为辅助证据附加,要么推断出一个类别遗漏信号,而将具体缺失的事实隐含化。我们提出了 extit{潜在因果空缺}(Latent Causal Void,LCV),这是一种检索引导的检测器,显式重建每个目标句子的缺失事实,并将其作为图推理中的文本跨源关系。具体而言,LCV检索时间上对齐的上下文文章,要求一个冻结的指令调优的大型语言模型为每个句子-文章对生成一个简短的缺失上下文描述,并将生成的关系文本输入到一个异构图中,该图包含目标句子和上下文文章。在Sheng等人的双语基准测试中,LCV在英语和中文分割上分别比最强的遗漏感知基线提高了$2.56$和$2.84$的宏F1分数。结果表明,建模缺失的跨源事实本身,而不仅仅是附加检索到的证据或预测遗漏信号,是一种对遗漏感知虚假信息检测有用的表示。
cs.CL / 56 / 2605.12177

Correcting Selection Bias in Sparse User Feedback for Large Language Model Quality Estimation: A Multi-Agent Hierarchical Bayesian Approach

纠正稀疏用户反馈中的选择偏差以估计大型语言模型质量:一种多智能体层次贝叶斯方法
Morandi, Andrea, Viswanathan, Mahesh
Abstract
[Abridged] Production LLM deployments receive feedback from a non-random fraction of users: thumbs sit mostly in the tails of the satisfaction distribution, and a naive average over them can land 40-50 percentage points away from true system quality. We treat this as a topic- and sentiment- stratified selection-bias problem and propose a three-agent hierarchical Bayesian pipeline that does not require ground-truth labels on individual interactions. A Topic Clustering Agent partitions the stream via UMAP + HDBSCAN over text embeddings; a Bias Modeling Agent fits a two-stage hierarchical Beta-Binomial under NUTS, inferring per-topic selection rates $s_c$ and quality $q_c$ with partial pooling; a Synthesis Agent reweights $q_c$ by true topic prevalence $\hat\pi_c = n_c/N$ to report a bias-corrected aggregate posterior $\bar Q = \sum_c \hat\pi_c q_c$ with credible interval, plus drift signals for online recalibration. Validation uses UltraFeedback (N=10,232 retained interactions, $C=18$ clusters, $Q^\star=0.6249$) with simulated topic- and sentiment-dependent selection biases. We compare five Bayesian variants against Naive and IPW baselines. A mild prior on the feedback channel (typical positive-feedback rate and negative-to-positive ratio, both readable from any production dashboard without labels) keeps Hierarchical-Informed within 4-13 pp of $Q^\star$ as the bias ratio sweeps from 1:1 to 30:1, with 95% credible intervals covering $Q^\star$ in 50/50 random-seed replicates at $\kappa_{\max}=10$. Without channel-side priors, every weak-prior variant misses $Q^\star$ by 22-33 pp: the per-cluster sufficient statistics admit a one-parameter family of equally good fits, and the prior on the bias channel (not on latent quality) is what breaks the degeneracy.
Chinese Translation
[简要] 生产环境中的大型语言模型(LLM)部署接收到来自非随机用户群体的反馈:满意度分布的尾部大多集中在拇指反馈上,简单平均可能导致与真实系统质量相差40-50个百分点。我们将其视为一个主题和情感分层的选择偏差问题,并提出一种三智能体层次贝叶斯管道,该管道不需要对单个交互的真实标签。主题聚类智能体通过对文本嵌入进行UMAP + HDBSCAN来划分数据流;偏差建模智能体在NUTS下拟合两阶段层次Beta-二项分布,推断每个主题的选择率 $s_c$ 和质量 $q_c$,并进行部分汇聚;合成智能体通过真实主题普遍性 $ ildeeta_c = n_c/N$ 对 $q_c$ 进行重加权,以报告带有可信区间的偏差修正后聚合后验 $ar Q = extstyleigcup_c ildeeta_c q_c$,并提供在线重新校准的漂移信号。验证使用UltraFeedback(N=10,232个保留交互,$C=18$个聚类,$Q^ullet=0.6249$)并模拟主题和情感依赖的选择偏差。我们将五种贝叶斯变体与简单和IPW基线进行比较。对反馈通道的温和先验(典型的正反馈率和负到正比率,均可从任何生产仪表板中读取而无需标签)使得层次信息模型在偏差比从1:1到30:1变化时,始终保持在 $Q^ullet$ 的4-13个百分点内,95%的可信区间在$eta_{ ext{max}}=10$的50/50随机种子重复实验中覆盖 $Q^ullet$。在没有通道侧先验的情况下,每个弱先验变体都错过了 $Q^ullet$ 22-33个百分点:每个聚类的充分统计量承认一个参数族的同样良好拟合,而偏差通道上的先验(而非潜在质量)打破了这种简并性。
cs.CL / 57 / 2605.12185

Mitigating Context-Memory Conflicts in LLMs through Dynamic Cognitive Reconciliation Decoding

通过动态认知调和解码缓解大型语言模型中的上下文-记忆冲突
Zhou, Yigeng, Li, Wu, Lu, Yifan, Wang, Yequan, Liu, Xuebo, Wang, Wenya, Yu, Jun, Zhang, Min, Li, Jing
Abstract
Large language models accumulate extensive parametric knowledge through pre-training. However, knowledge conflicts occur when outdated or incorrect parametric knowledge conflicts with external knowledge in the context. Existing methods address knowledge conflicts through contrastive decoding, but in conflict-free scenarios, static approaches disrupt output distribution. Other dynamic decoding methods attempt to measure the degree of conflict but still struggle with complex real-world situations. In this paper, we propose a two-stage decoding method called Dynamic Cognitive Reconciliation Decoding (DCRD), to predict and mitigate context-memory conflicts. DCRD first analyzes the attention map to assess context fidelity and predict potential conflicts. Based on this prediction, the input is directed to one of two decoding paths: (1) greedy decoding, or (2) context fidelity-based dynamic decoding. This design enables DCRD to handle conflicts efficiently while maintaining high accuracy and decoding efficiency in conflict-free cases. Additionally, to simulate scenarios with frequent knowledge updates, we constructed ConflictKG, a knowledge conflict QA benchmark. Experiments on four LLMs across six QA datasets show that DCRD outperforms all baselines, achieving state-of-the-art performance.
Chinese Translation
大型语言模型通过预训练积累了广泛的参数知识。然而,当过时或不正确的参数知识与上下文中的外部知识发生冲突时,知识冲突就会出现。现有方法通过对比解码来解决知识冲突,但在无冲突的场景中,静态方法会干扰输出分布。其他动态解码方法试图测量冲突的程度,但在复杂的现实世界情境中仍然面临挑战。本文提出了一种名为动态认知调和解码(Dynamic Cognitive Reconciliation Decoding, DCRD)的两阶段解码方法,以预测和缓解上下文-记忆冲突。DCRD首先分析注意力图以评估上下文的保真度并预测潜在冲突。基于这一预测,输入被引导到两个解码路径之一:(1)贪婪解码,或(2)基于上下文保真度的动态解码。该设计使DCRD能够高效处理冲突,同时在无冲突情况下保持高准确性和解码效率。此外,为了模拟频繁知识更新的场景,我们构建了ConflictKG,一个知识冲突问答基准。在六个问答数据集上对四个大型语言模型的实验表明,DCRD在所有基线之上表现出色,达到了最先进的性能。
cs.CL / 58 / 2605.12225

Mechanistic Interpretability of ASR models using Sparse Autoencoders

使用稀疏自编码器对自动语音识别模型的机制可解释性研究
Pluth, Dan, Houghton, Zachary Nicholas, Zhou, Yu, Gurbani, Vijay K.
Abstract
Understanding the internal machinations of deep Transformer-based NLP models is more crucial than ever as these models see widespread use in various domains that affect the public at large, such as industry, academia, finance, health. While these models have advanced rapidly, their internal mechanisms remain largely a mystery. Techniques such as Sparse Autoencoders (SAE) have emerged to understand these mechanisms by projecting dense representations into a sparse vector. While existing research has demonstrated the viability of the SAE in interpreting text-based Large Language Models (LLMs), there are no equivalent studies that demonstrate the application of a SAE to audio processing models like Automatic Speech Recognizers (ASRs). In this work, a SAE is applied to Whisper, a Transformer-based ASR, training a high-dimensional sparse latent space on frame-level embeddings extracted from the Whisper encoder. Our work uncovers diverse monosemantic features across linguistic and non-linguistic boundaries, and demonstrates cross-lingual feature steering. This work establishes the viability of a SAE model and demonstrates that Whisper encodes a rich amount of linguistic information.
Chinese Translation
随着基于深度Transformer的自然语言处理模型在工业、学术、金融和健康等多个领域的广泛应用,理解这些模型的内部机制比以往任何时候都更加重要。然而,尽管这些模型发展迅速,其内部机制仍然在很大程度上是个谜。稀疏自编码器(Sparse Autoencoders, SAE)等技术应运而生,通过将密集表示投影到稀疏向量中,以理解这些机制。虽然现有研究已证明SAE在解释基于文本的大型语言模型(Large Language Models, LLMs)方面的可行性,但尚无相应的研究展示SAE在自动语音识别模型(Automatic Speech Recognizers, ASRs)中的应用。在本研究中,我们将SAE应用于Whisper,一个基于Transformer的ASR,训练一个高维稀疏潜在空间,利用从Whisper编码器提取的帧级嵌入。我们的研究揭示了语言和非语言边界之间多样的单义特征,并展示了跨语言特征引导的能力。本研究确立了SAE模型的可行性,并表明Whisper编码了丰富的语言信息。
cs.CL / 59 / 2605.12227

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

结合策略优化与蒸馏以实现大型语言模型的长上下文推理
Ramos, Miguel Moura, Alves, Duarte M., Martins, André F. T.
Abstract
Adapting large language models (LLMs) to long-context tasks requires post-training methods that remain accurate and coherent over thousands of tokens. Existing approaches are limited in several ways: 1) off-policy methods such as supervised fine-tuning (SFT) and knowledge distillation (KD) suffer from exposure bias and limited recovery from model-generated errors over long horizons; 2) on-policy reinforcement learning methods such as Group Relative Policy Optimization (GRPO) better align training with model-generated states, but are unstable and sample-inefficient due to sparse rewards; 3) on-policy distillation (OPD) provides dense token-level guidance, but does not directly optimize arbitrary reward signals. In this paper, we propose Distilled Group Relative Policy Optimization (dGRPO), a method for long-context reasoning that augments GRPO with dense guidance from a stronger teacher via OPD. We also introduce LongBlocks, a synthetic long-context dataset spanning multi-hop reasoning, contextual grounding, and long-form generation. We conduct extensive experiments and ablations comparing off-policy training, sparse-reward GRPO, and our combined approach, leading to an improved recipe for long-context alignment. Overall, our results show that combining outcome-based policy optimization with knowledge distillation in a single objective provides a more stable and effective path to long-context reasoning, while preserving short-context capabilities.
Chinese Translation
将大型语言模型(LLMs)适应于长上下文任务需要后训练方法,这些方法在数千个标记上保持准确性和连贯性。现有方法存在多方面的局限性:1)离策略方法如监督微调(SFT)和知识蒸馏(KD)受到曝光偏差的影响,并且在长时间范围内对模型生成错误的恢复能力有限;2)在线策略强化学习方法如群体相对策略优化(GRPO)更好地将训练与模型生成的状态对齐,但由于稀疏奖励而不稳定且样本效率低;3)在线蒸馏(OPD)提供密集的标记级指导,但并未直接优化任意奖励信号。在本文中,我们提出了蒸馏群体相对策略优化(dGRPO),这是一种增强GRPO的长上下文推理方法,通过OPD从更强的教师模型获取密集指导。我们还引入了LongBlocks,这是一个合成的长上下文数据集,涵盖多跳推理、上下文基础和长文本生成。我们进行了广泛的实验和消融研究,比较了离策略训练、稀疏奖励GRPO和我们结合的方法,从而提出了一种改进的长上下文对齐方案。总体而言,我们的结果表明,将基于结果的策略优化与知识蒸馏结合在单一目标中,提供了一条更稳定和有效的长上下文推理路径,同时保留了短上下文的能力。
cs.CL / 60 / 2605.12242

Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs

注意停顿:基于不流畅性意识的多语言语音纠正的目标调优与大型语言模型
Kumar, Deepak, Gain, Baban, Ekbal, Asif
Abstract
Automatic Speech Recognition (ASR) transcripts often contain disfluencies, such as fillers, repetitions, and false starts, which reduce readability and hinder downstream applications like chatbots and voice assistants. If left unaddressed, such disfluencies can significantly degrade the reliability of downstream systems. Most existing approaches rely on classical models that focus on identifying disfluent tokens for removal. While this strategy is effective to some extent, it often disrupts grammatical structure and semantic coherence, leading to incomplete or unnatural sentences. Recent literature explored the use of large language models (LLMs); however, these efforts have primarily focused on disfluency detection or data augmentation, rather than performing comprehensive correction. We propose a multilingual correction pipeline where a sequence tagger first marks disfluent tokens, and these signals guide instruction fine-tuning of an LLM to rewrite transcripts into fluent text. To further improve reliability, we add a contrastive learning objective that penalizes the reproduction of disfluent tokens, encouraging the model to preserve grammar and meaning while removing disfluent artifacts. Our experiments across three Indian languages, namely Hindi, Bengali, and Marathi show consistent improvements over strong baselines, including multilingual sequence-to-sequence models. These results highlight that detection-only strategies are insufficient. Combining token-level cues with instruction tuning and contrastive learning provides a practical and scalable solution for multilingual disfluency correction in speech-driven NLP systems. We make the codes publicly available at https://github.com/deepak-kumar-98/Mind-the-Pause.
Chinese Translation
自动语音识别(ASR)转录文本通常包含不流畅性,如填充词、重复和错误开头,这降低了可读性并妨碍了聊天机器人和语音助手等下游应用。如果不加以解决,这些不流畅性可能会显著降低下游系统的可靠性。现有大多数方法依赖于经典模型,侧重于识别不流畅的标记以进行删除。虽然这种策略在一定程度上有效,但往往会破坏语法结构和语义连贯性,导致句子不完整或不自然。近期文献探讨了使用大型语言模型(LLMs);然而,这些努力主要集中在不流畅性检测或数据增强上,而不是进行全面的纠正。我们提出了一种多语言纠正管道,其中序列标记器首先标记不流畅的标记,这些信号指导LLM的指令微调,以将转录文本重写为流畅的文本。为了进一步提高可靠性,我们增加了一个对比学习目标,惩罚不流畅标记的再现,鼓励模型在移除不流畅的伪影时保持语法和意义。我们在三种印度语言(即印地语、孟加拉语和马拉地语)上的实验显示出相对于强基线(包括多语言序列到序列模型)的持续改进。这些结果强调仅依赖检测的策略是不够的。将标记级线索与指令调优和对比学习相结合,为语音驱动的自然语言处理系统中的多语言不流畅性纠正提供了一个实用且可扩展的解决方案。我们已将代码公开发布在 https://github.com/deepak-kumar-98/Mind-the-Pause。
cs.CL / 61 / 2605.12243

PreScam: A Benchmark for Predicting Scam Progression from Early Conversations

PreScam:一个用于预测早期对话中诈骗进展的基准
Sun, Weixiang, Ma, Shang, Li, Yiyang, Ma, Tianyi, Wang, Zehong, Nelson, Colby, Xiao, Xusheng, Ye, Yanfang
Abstract
Conversational scams, such as romance and investment scams, are emerging as a major form of online fraud. Unlike one-shot scam lures such as fake lottery or unpaid toll messages, they unfold through multi-turn conversations in which scammers gradually manipulate victims using evolving psychological techniques. However, existing research mainly focuses on static scam detection or synthetic scams, leaving open whether language models can understand how real-world scams progress over time. We introduce PreScam, a benchmark for modeling scam progression from early conversations. Built from user-submitted scam reports, PreScam filters and structures 177,989 raw reports into 11,573 conversational scam instances spanning 20 scam categories. Each instance is hierarchically structured according to the scam lifecycle defined by the proposed scam kill chain, and further annotated at the turn level with scammer psychological actions and victim responses. We benchmark models on two tasks: real-time termination prediction, which estimates whether a conversation is approaching the termination stage, and scammer action prediction, which forecasts the scammer's subsequent actions. Results show a clear gap between surface-level fluency and progression modeling: supervised encoders substantially outperform zero-shot LLMs on real-time termination prediction, while next-action prediction remains only moderately successful even for strong LLMs. Taken together, these results show that current models can capture some scam-related cues, yet still struggle to track how risk escalates and how manipulation unfolds across turns.
Chinese Translation
对话诈骗,如恋爱诈骗和投资诈骗,正成为一种主要的在线欺诈形式。与假彩票或未支付通行费信息等一次性诈骗诱饵不同,这些诈骗通过多轮对话展开,诈骗者逐渐利用不断演变的心理技巧操控受害者。然而,现有研究主要集中在静态诈骗检测或合成诈骗上,尚未探讨语言模型是否能够理解现实世界中诈骗如何随时间进展。我们引入了PreScam,这是一个用于建模早期对话中诈骗进展的基准。PreScam基于用户提交的诈骗报告,过滤并结构化了177,989份原始报告,形成了涵盖20个诈骗类别的11,573个对话诈骗实例。每个实例根据所提出的诈骗杀链定义的诈骗生命周期进行层次结构化,并在每轮对话中进一步注释了诈骗者的心理行为和受害者的反应。我们在两个任务上对模型进行基准测试:实时终止预测,该任务估计对话是否接近终止阶段,以及诈骗者行为预测,该任务预测诈骗者的后续行为。结果显示,表面流畅性与进展建模之间存在明显差距:在实时终止预测中,监督编码器的表现显著优于零-shot LLM,而即使对于强大的LLM,下一步行动预测的成功率也仅为中等。综合来看,这些结果表明,当前模型能够捕捉一些与诈骗相关的线索,但仍然难以追踪风险如何升级以及操控如何在对话轮次中展开。
cs.CL / 62 / 2605.12260

PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

PRISM:基于意图感知的结构化记忆进行帕累托有效检索的长时间跨度代理
Peng, Jingyi, Wan, Zhongwei, Liu, Weiting, Sun, Qiuzhuang
Abstract
Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the context window without addressing what is retrieved, perform heavy ingestion-time fact extraction at substantial token cost, or rely on heuristic graph traversal that leaves both accuracy and efficiency on the table. We present PRISM, a training-free retrieval-side framework that treats long-horizon memory as a joint retrieval-and-compression problem over a graph-structured memory. PRISM combines four orthogonal inference-time components: Hierarchical Bundle Search over typed relation paths, Query-Sensitive Edge Costing that aligns traversal with detected query intent, Evidence Compression that compresses the candidate bundle into a compact answer-side context, and Adaptive Intent Routing that routes most queries through zero-LLM tiers. By formulating retrieval as min-cost selection over typed path templates and pairing it with an LLM-side compression step, PRISM surfaces the right evidence under a strict context budget without any fine-tuning or modification to the upstream ingestion pipeline. Experiments on the LoCoMo benchmark show that PRISM delivers substantially higher LLM-judge accuracy than every same-protocol baseline at an order-of-magnitude smaller context budget, occupying a previously empty corner of the accuracy-context-cost frontier and demonstrating a superior balance between answer quality and retrieval efficiency.
Chinese Translation
长时间跨度的语言代理在积累对话历史方面的速度远超任何固定上下文窗口的承载能力,因此内存管理对于回答准确性和服务成本至关重要。现有方法要么扩展上下文窗口而不解决检索内容,要么在显著的标记成本下进行重负载的摄取时间事实提取,或者依赖启发式图遍历,这在准确性和效率上都未能达到最佳。我们提出了PRISM,这是一种无训练的检索侧框架,将长时间跨度的记忆视为在图结构记忆上的联合检索与压缩问题。PRISM结合了四个正交的推理时间组件:基于类型关系路径的层次捆绑搜索、与检测到的查询意图对齐的查询敏感边缘成本计算、将候选捆绑压缩为紧凑答案侧上下文的证据压缩,以及通过零LLM层路由大多数查询的自适应意图路由。通过将检索形式化为基于类型路径模板的最小成本选择,并与LLM侧的压缩步骤配对,PRISM在严格的上下文预算下呈现出正确的证据,而无需对上游摄取管道进行任何微调或修改。在LoCoMo基准上的实验表明,PRISM在显著较小的上下文预算下提供了远高于每个相同协议基线的LLM评判准确性,填补了准确性-上下文-成本边界的一个空白区域,展示了答案质量与检索效率之间的优越平衡。
cs.CL / 63 / 2605.12281

What makes a word hard to learn? Modeling L1 influence on English vocabulary difficulty

是什么使一个词难以学习?建模母语对英语词汇难度的影响
Martins, Jonas Mayer, Huang, Zhuojing, Herygers, Aaricia, Beinborn, Lisa
Abstract
What makes a word difficult to learn, and how does the difficulty depend on the learner's native language? We computationally model vocabulary difficulty for English learners whose first language is Spanish, German, or Chinese with gradient-boosted models trained on features related to a word's familiarity (e.g., frequency), meaning, surface form, and cross-linguistic transfer. Using Shapley values, we determine the importance of each feature group. Word familiarity is the dominant feature group shared by all three languages. However, predictions for Spanish- and German-speaking learners rely additionally on orthographic transfer. This transfer mechanism is unavailable to Chinese learners, whose difficulty is shaped by a combination of familiarity and surface features alone. Our models provide interpretable, L1-tailored difficulty estimates that can be used to design vocabulary curricula.
Chinese Translation
是什么使一个词难以学习?这种难度又如何依赖于学习者的母语?我们通过计算建模了以西班牙语、德语或汉语为母语的英语学习者的词汇难度,采用了基于梯度提升模型的特征,这些特征与词汇的熟悉度(例如,频率)、意义、表面形式以及跨语言转移相关。通过使用Shapley值,我们确定了每个特征组的重要性。词汇熟悉度是所有三种语言共享的主导特征组。然而,西班牙语和德语学习者的预测还额外依赖于正字法转移。这个转移机制对汉语学习者来说是不可用的,他们的难度仅由熟悉度和表面特征的组合所决定。我们的模型提供了可解释的、针对母语定制的难度估计,可用于设计词汇课程。
cs.CL / 64 / 2605.12288

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

TokenRatio:通过比率匹配实现原则性的令牌级偏好优化
Nguyen, Truong, Nguyen, Tien-Phat, Van, Linh Ngo, Nguyen, Duy Minh Ho, Doan, Khoa, Le, Trung
Abstract
Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.
Chinese Translation
直接偏好优化(Direct Preference Optimization, DPO)是一种广泛使用的无强化学习方法,用于根据成对偏好对语言模型进行对齐,但它在全序列上建模偏好,而生成过程则是由每个令牌的决策驱动。现有的令牌级扩展通常在时间步上分解序列级的Bradley-Terry目标,导致每个前缀(状态级)的最优性隐含。我们研究如何仅使用标准序列级成对比较来恢复令牌级偏好最优性。我们提出了令牌级Bregman偏好优化(Token-level Bregman Preference Optimization, TBPO),该方法假设在前缀条件下对下一个令牌动作建立令牌级Bradley-Terry偏好模型,并推导出一种Bregman散度密度比匹配目标,该目标在保留由令牌级模型诱导的最优策略的同时,推广了逻辑回归/DPO损失,并保持了DPO式的简单性。我们介绍了两种实例:TBPO-Q,它显式学习一个轻量级的状态基线,以及TBPO-A,它通过优势归一化去除了基线。在指令遵循、有效性/无害性和摘要基准测试中,TBPO提高了对齐质量和训练稳定性,并相较于强大的序列级和令牌级基线增加了输出多样性。
cs.CL / 65 / 2605.12299

GKnow: Measuring the Entanglement of Gender Bias and Factual Gender

GKnow:测量性别偏见与事实性别的纠缠
Veloso, Leonor, Schütze, Hinrich
Abstract
Recent works have analyzed the impact of individual components of neural networks on gendered predictions, often with a focus on mitigating gender bias. However, mechanistic interpretations of gender tend to (i) focus on a very specific gender-related task, such as gendered pronoun prediction, or (ii) fail to distinguish between the production of factually gendered outputs (the correct assumption of gender given a word that carries gender as a semantic property) and gender biased outputs (based on a stereotype). To address these issues, we curate \gknow, a benchmark to assess gender knowledge and gender bias in language models across different types of gender-related predictions. \gknow allows us to identify and analyze circuits and individual neurons responsible for gendered predictions. We test the impact of neuron ablation on benchmarks for disentangling stereotypical and factual gender (DiFair and the test set of GKnow), as well as StereoSet. Results show that gender bias and factual gender are severely entangled on the level of both circuits and neurons, entailing that ablation is an unreliable debiasing method. Furthermore, we show that benchmarks for evaluating gender bias can hide the decrease in factual gender knowledge that accompanies neuron ablation. We curate GKnow as a contribution to the continuous development of robust gender bias benchmarks.
Chinese Translation
近期的研究分析了神经网络各个组件对性别化预测的影响,通常集中在减轻性别偏见上。然而,性别的机制性解释往往 (i) 专注于非常特定的性别相关任务,例如性别代词预测,或 (ii) 未能区分事实性别输出的生成(在给定一个具有性别作为语义属性的词时正确假设性别)与基于刻板印象的性别偏见输出。为了解决这些问题,我们构建了 extit{GKnow},一个基准测试,用于评估语言模型在不同类型的性别相关预测中的性别知识和性别偏见。 extit{GKnow} 使我们能够识别和分析负责性别化预测的电路和单个神经元。我们测试了神经元切除对区分刻板印象性别和事实性别的基准(DiFair 和 GKnow 的测试集)以及 StereoSet 的影响。结果表明,性别偏见和事实性别在电路和神经元层面上严重纠缠,这意味着切除是一种不可靠的去偏见方法。此外,我们还表明,评估性别偏见的基准可能掩盖了伴随神经元切除而减少的事实性别知识。我们构建了 GKnow 作为对持续发展稳健性别偏见基准的贡献。
cs.CL / 66 / 2605.12313

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

BioCreative IX MedHopQA 赛道概述:赛道描述、参与情况及多跳医学问答系统的评估
Islamaj, Rezarta, Chan, Joey, Leaman, Robert, Jung, Jongmyung, Hwang, Hyeongsoon, Nguyen, Quoc-An, Le, Hoang-Quynh, Saisudha, Harikrishnan Gurushankar, Chandrasekar, Ganesh, Taktashov, Rustam R., Bizyukova, Nadezhda Yu., Conceição, Sofia I. R., Lopes, Paulo R. C., Salam, Reem Abdel, Adewunmi, Mary, Lu, Zhiyong
Abstract
Multi-hop question answering (QA) remains a significant challenge in the biomedical domain, requiring systems to integrate information across multiple sources to answer complex questions. To address this problem, the BioCreative IX MedHopQA shared task was designed to benchmark in multi-hop reasoning for large language models (LLMs). We developed a novel dataset of 1,000 challenging QA pairs spanning diseases, genes, and chemicals, with particular emphasis on rare diseases. Each question was constructed to require two-hop reasoning through the integration of information from two distinct Wikipedia pages. The challenge attracted 48 submissions from 13 teams. Systems were evaluated using both surface string comparison and conceptual accuracy (MedCPT score). The results showed a substantial performance gap between baseline LLMs and enhanced systems. The top-ranked submission achieved an 89.30% F1 score on the MedCPT metric and an 87.30% exact match (EM) score, compared with 67.40% and 60.20%, respectively, for the zero-shot baseline. A central finding of the challenge was that retrieval-augmented generation (RAG) and related retrieval-based strategies were critical for strong performance. In addition, concept-level evaluation improved answer assessment when correct responses differed in surface form. The MedHopQA dataset is publicly available to support continued progress in this important area. Challenge materials: https://www.ncbi.nlm.nih.gov/research/bionlp/medhopqa and benchmark https://www.codabench.org/competitions/7609/
Chinese Translation
多跳问答(QA)在生物医学领域仍然是一个重大挑战,要求系统整合来自多个来源的信息以回答复杂问题。为了解决这一问题,BioCreative IX MedHopQA 共享任务旨在为大型语言模型(LLMs)在多跳推理方面建立基准。我们开发了一个包含1000对具有挑战性的问答对的新数据集,涵盖疾病、基因和化学物质,特别强调罕见疾病。每个问题的构建都要求通过整合来自两个不同维基百科页面的信息进行两跳推理。该挑战吸引了来自13个团队的48个提交。系统的评估采用了表面字符串比较和概念准确性(MedCPT 分数)两种方法。结果显示,基线 LLMs 和增强系统之间存在显著的性能差距。排名最高的提交在 MedCPT 指标上达到了89.30%的 F1 分数和87.30%的精确匹配(EM)分数,而零样本基线的分数分别为67.40%和60.20%。挑战的一个核心发现是,检索增强生成(RAG)及相关的基于检索的策略对于强劲的性能至关重要。此外,当正确答案在表面形式上有所不同时,概念级评估改善了答案评估。MedHopQA 数据集已公开,以支持这一重要领域的持续进展。挑战材料: https://www.ncbi.nlm.nih.gov/research/bionlp/medhopqa 和基准 https://www.codabench.org/competitions/7609/
cs.CL / 67 / 2605.12328

A categorical error sensitivity index (ISEC): A preventive ordinal decision-support measure for irrecoverable errors in manual data entry systems

分类错误敏感性指数 (ISEC):一种针对手动数据录入系统不可恢复错误的预防性序数决策支持措施
Palma, Ricardo Raúl, Benetti, Mauro Anibal, Varretti, Fabricio Orlando Sanchez
Abstract
Data entry systems remain structurally vulnerable to categorical misclassifications, particularly in small and medium sized enterprises (SMEs). When nominal categories exhibit semantic or morphological proximity, human machine interaction may produce errors that are irrecoverable ex post. In the absence of automated input controls, manual data entry frequently generates irrecoverable categorical distortions that propagate into Key Performance Indicators (KPIs), thereby misleading managerial decision making. State of the art normalization tools typically evaluate semantic and morphological dimensions in isolation and rely heavily on standard dictionaries, rendering them ineffective for SME master data rich in custom SKUs, abbreviations, and domain-specific technical jargon. This paper introduces the Categorical Error Sensitivity Index (ISEC), an ordinal composite score designed to rank category pairs according to their structural susceptibility to confusion. ISEC integrates semantic distance (via word embeddings), custom weighted morphological transformation costs (through an adapted Damerau Levenshtein algorithm), and empirical frequency into a unified, mathematically robust preventive framework. By leveraging vector database architectures, ISEC reduces computational complexity, achieving approximately a 195x performance improvement over brute-force methods. Validated across three heterogeneous datasets: governmental judicial records, retail inventory, and a synthetic ISO coded metalworking catalog, ISEC provides a scalable and proactive data governance instrument that enables SMEs to detect latent structural risk embedded within their categorical data assets.
Chinese Translation
数据录入系统在结构上仍然容易受到分类错误的影响,尤其是在中小型企业 (SMEs) 中。当名义类别在语义或形态上接近时,人机交互可能会产生不可恢复的错误。在缺乏自动输入控制的情况下,手动数据录入常常会产生不可恢复的分类扭曲,这些扭曲会传播到关键绩效指标 (KPIs) 中,从而误导管理决策。现有的标准化工具通常孤立地评估语义和形态维度,并且过于依赖标准词典,这使得它们在拥有丰富自定义 SKU、缩写和领域特定技术术语的中小企业主数据中显得无效。本文介绍了分类错误敏感性指数 (ISEC),这是一种序数复合评分,旨在根据类别对混淆的结构敏感性对类别对进行排名。ISEC 将语义距离(通过词嵌入)、自定义加权的形态转换成本(通过改进的 Damerau Levenshtein 算法)和经验频率整合到一个统一的、数学上稳健的预防框架中。通过利用向量数据库架构,ISEC 降低了计算复杂性,性能比暴力方法提高了约 195 倍。在三个异构数据集(政府司法记录、零售库存和合成 ISO 编码金属加工目录)上进行了验证,ISEC 提供了一种可扩展的、主动的数据治理工具,使中小企业能够检测其分类数据资产中潜在的结构风险。
cs.CL / 68 / 2605.12345

Output Composability of QLoRA PEFT Modules for Plug-and-Play Attribute-Controlled Text Generation

QLoRA PEFT模块的输出可组合性用于即插即用的属性控制文本生成
Lorandi, Michela, Belz, Anya
Abstract
Parameter-efficient fine-tuning (PEFT) techniques offer task-specific fine-tuning at a fraction of the cost of full fine-tuning, but require separate fine-tuning for every new task (combination). In this paper, we explore three ways of generalising beyond single-task training/inference: (i) training on combinations of multiple, related datasets; (ii) at inference, composing the weight matrices of separately trained PEFT modules; and (iii) at inference, composing the outputs of separately trained PEFT modules. We test these approaches on three different LLMs, QLoRA as the PEFT technique, and three sets of controlled text generation datasets for sentiment control, topic control, and multi-attribute control. We find that summing PEFT module outputs is a particularly strong composition method, which consistently either outperforms or matches the performance of alternative approaches. This is the case even when comparing against single-task specialised modules on the single-task test set, where three-module output composition achieves an average 2% point performance increase across all models for sentiment control.
Chinese Translation
参数高效微调(PEFT)技术以较低的成本提供任务特定的微调,但需要针对每个新任务(组合)进行单独的微调。本文探讨了超越单任务训练/推理的三种泛化方式:(i)在多个相关数据集的组合上进行训练;(ii)在推理时组合单独训练的PEFT模块的权重矩阵;以及(iii)在推理时组合单独训练的PEFT模块的输出。我们在三种不同的大型语言模型(LLMs)上测试了这些方法,使用QLoRA作为PEFT技术,并针对情感控制、主题控制和多属性控制的三组受控文本生成数据集进行实验。我们发现,PEFT模块输出的求和是一种特别有效的组合方法,始终优于或与其他方法的性能相匹配。即使在与单任务专用模块在单任务测试集上的比较中,三模块输出组合也在所有模型的情感控制任务中实现了平均2个百分点的性能提升。
cs.CL / 69 / 2605.12361

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

MedHopQA:基于疾病的多跳推理基准与LLM生物医学问答评估框架
Islamaj, Rezarta, Leaman, Robert, Chan, Joey, Wan, Nicholas, Jin, Qiao, Xie, Natalie, Wilbur, John, Tian, Shubo, Yeganova, Lana, Lai, Po-Ting, Wei, Chih-Hsuan, Yang, Yifan, Ge, Yao, Zhu, Qingqing, Wang, Zhizheng, Lu, Zhiyong
Abstract
Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.
Chinese Translation
在生物医学领域评估大型语言模型(LLMs)需要能够区分推理与模式匹配的基准,并在模型能力提升时保持其区分性。现有的生物医学问答(QA)基准在这方面存在局限。多项选择格式可能使模型通过排除答案而非推理来成功,而广泛流传的考试风格数据集则越来越容易受到性能饱和和训练数据污染的影响。多跳推理被定义为跨多个来源整合信息以得出答案的能力,这对于临床上有意义的任务如诊断支持、基于文献的发现和假设生成至关重要,但在当前的生物医学QA基准中仍然表现不足。MedHopQA是一个以疾病为中心的多跳推理基准,包含1,000个由专家策划的问题-答案对,作为BioCreative IX的共享任务引入。每个问题都需要整合来自两个不同维基百科文章的信息,答案以开放式自由文本格式提供。金标准注释通过来自MONDO、NCBI Gene和NCBI Taxonomy的本体基础同义词集进行增强,以支持词汇和概念层面的评估。MedHopQA通过结合人工注释、筛选、迭代验证和LLM作为评判者的验证的结构化过程构建。为了减少排行榜游戏和污染风险,这1,000个评分问题嵌入在一个可公开下载的10,000个问题的数据集中,答案被保留,在CodaBench排行榜上展示。MedHopQA提供了一个基准和一个可重用的框架,用于构建未来的生物医学QA数据集,优先考虑组合推理、饱和抵抗和污染抵抗作为核心设计约束。
cs.CL / 70 / 2605.12370

Context Convergence Improves Answering Inferential Questions

上下文收敛改善推理问题的回答
Mozafari, Jamshid, Piryani, Bhawna, Jatowt, Adam
Abstract
While Large Language Models (LLMs) are widely used in open-domain Question Answering (QA), their ability to handle inferential questions-where answers must be derived rather than directly retrieved-remains still underexplored. This study investigates how the structure and quality of passages influence LLM performance on such questions. We focus on convergence, a measure of how effectively sentences (hints) eliminate incorrect answers, as a criterion for constructing passages. Using subsets of the TriviaHG dataset, we form passages by combining sentences with varying convergence levels and evaluate six LLMs of different sizes and architectures. Our results show that passages built from higher convergence sentences lead to substantially better answer accuracy than those selected by cosine similarity, indicating that convergence captures meaningful relevance for inferential reasoning. Additionally, ordering sentences by descending convergence slightly improves performance, suggesting that LLMs tend to prioritize earlier, information-rich cues. These findings highlight convergence as a practical signal for guiding passage construction and analyzing inferential reasoning behavior in LLMs.
Chinese Translation
尽管大型语言模型(LLMs)在开放领域问答(QA)中被广泛应用,但它们处理推理问题的能力——即答案必须通过推导而非直接检索——仍然未得到充分探索。本研究探讨了段落的结构和质量如何影响LLM在此类问题上的表现。我们关注收敛性,这是一种衡量句子(提示)有效消除错误答案的能力的指标,作为构建段落的标准。通过使用TriviaHG数据集的子集,我们通过组合具有不同收敛水平的句子来形成段落,并评估六种不同规模和架构的LLM。我们的结果表明,基于高收敛性句子构建的段落在回答准确性上显著优于通过余弦相似度选择的段落,这表明收敛性捕捉了推理的有意义相关性。此外,按降序排列句子收敛性略微提高了性能,表明LLM倾向于优先考虑较早的信息丰富提示。这些发现突显了收敛性作为指导段落构建和分析LLM推理行为的实用信号。
cs.CL / 71 / 2605.12382

Pretraining Exposure Explains Popularity Judgments in Large Language Models

预训练暴露解释了大型语言模型中的流行度判断
Mozafari, Jamshid, Piryani, Bhawna, Jatowt, Adam
Abstract
Large language models (LLMs) exhibit systematic preferences for well-known entities, a phenomenon often attributed to popularity bias. However, the extent to which these preferences reflect real-world popularity versus statistical exposure during pretraining remains unclear, largely due to the inaccessibility of most training corpora. We provide the first direct, large-scale analysis of popularity bias grounded in fully observable pretraining data. Leveraging the open OLMo models and their complete pretraining corpus, Dolma, we compute precise entity-level exposure statistics across 7.4 trillion tokens. We analyze 2,000 entities spanning five types (Person, Location, Organization, Art, Product) and compare pretraining exposure against Wikipedia pageviews and two elicited LLM popularity signals: direct scalar estimation and pairwise comparison. Our results show that pretraining exposure strongly correlates with Wikipedia popularity, validating exposure as a meaningful proxy for real-world salience during the training period. More importantly, we find that LLM popularity judgments align more closely with exposure than with Wikipedia, especially when elicited via pairwise comparisons. This alignment is strongest for larger models and persists in the long tail, where Wikipedia popularity becomes unreliable. Overall, our findings demonstrate that popularity priors in LLMs are primarily shaped by pretraining statistics rather than external popularity signals, offering concrete evidence that data exposure plays a central role in driving popularity bias.
Chinese Translation
大型语言模型(LLMs)对知名实体表现出系统性的偏好,这一现象通常归因于流行度偏差。然而,这些偏好在多大程度上反映了现实世界的流行度与预训练期间的统计暴露仍然不清楚,这主要是由于大多数训练语料库的不可获取性。我们提供了首个基于完全可观察的预训练数据的大规模流行度偏差直接分析。利用开放的OLMo模型及其完整的预训练语料库Dolma,我们计算了跨越7.4万亿个标记的精确实体级暴露统计数据。我们分析了涵盖五种类型(人、地点、组织、艺术、产品)的2000个实体,并将预训练暴露与维基百科页面浏览量以及两种引导的LLM流行度信号进行比较:直接标量估计和成对比较。我们的结果表明,预训练暴露与维基百科流行度强相关,验证了暴露作为训练期间现实世界显著性的有意义代理。更重要的是,我们发现LLM的流行度判断与暴露的对齐程度高于与维基百科的对齐,尤其是在通过成对比较引导时。这种对齐在较大的模型中最为显著,并在长尾中持续存在,维基百科流行度在此时变得不可靠。总体而言,我们的研究结果表明,LLMs中的流行度先验主要由预训练统计数据塑造,而非外部流行度信号,提供了具体证据表明数据暴露在推动流行度偏差中发挥了核心作用。
cs.CL / 72 / 2605.12384

Scalable Token-Level Hallucination Detection in Large Language Models

大语言模型中的可扩展令牌级幻觉检测
Min, Rui, Pang, Tianyu, Du, Chao, Cheng, Minhao, Fung, Yi R.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities, but they still frequently produce hallucinations. These hallucinations are difficult to detect in reasoning-intensive tasks, where the content appears coherent but contains errors like logical flaws and unreliable intermediate results. While step-level analysis is commonly used to detect internal hallucinations, it suffers from limited granularity and poor scalability due to its reliance on step segmentation. To address these limitations, we propose TokenHD, a holistic pipeline for training token-level hallucination detectors. Specifically, TokenHD consists of a scalable data engine for synthesizing large-scale hallucination annotations along with a training recipe featuring an importance-weighted strategy for robust model training. To systematically assess the detection performance, we also provide a rigorous evaluation protocol. Through training within TokenHD, our detector operates directly on free-form text to identify hallucinations, eliminating the need for predefined step segmentation or additional text reformatting. Our experiments show that even a small detector (0.6B) achieves substantial performance gains after training, surpassing much larger reasoning models (e.g., QwQ-32B), and detection performance scales consistently with model size from 0.6B to 8B. Finally, we show that our detector can generalize well across diverse practical scenarios and explore strategies to further enhance its cross-domain generalization capability.
Chinese Translation
大语言模型(LLMs)展现了显著的能力,但仍然频繁产生幻觉。这些幻觉在推理密集型任务中难以检测,尽管内容看似连贯,但却包含逻辑缺陷和不可靠的中间结果等错误。虽然常用的逐步分析方法可以检测内部幻觉,但由于依赖于步骤分段,这种方法在粒度和可扩展性方面存在局限。为了解决这些问题,我们提出了TokenHD,这是一个用于训练令牌级幻觉检测器的整体管道。具体而言,TokenHD包括一个可扩展的数据引擎,用于合成大规模幻觉注释,并配备了一种具有重要性加权策略的训练方案,以实现稳健的模型训练。为了系统地评估检测性能,我们还提供了严格的评估协议。通过在TokenHD内的训练,我们的检测器能够直接在自由格式文本上操作,以识别幻觉,消除了对预定义步骤分段或额外文本格式化的需求。我们的实验表明,即使是一个小型检测器(0.6B)在训练后也能实现显著的性能提升,超越了更大的推理模型(例如,QwQ-32B),并且检测性能随着模型规模从0.6B到8B一致提升。最后,我们展示了我们的检测器能够在多种实际场景中良好泛化,并探讨了进一步增强其跨领域泛化能力的策略。
cs.CL / 73 / 2605.12395

A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles

基于公平竞争评估原则的受控文本生成系统比较研究
Lorandi, Michela, Belz, Anya
Abstract
Background: Many different approaches to controlled text generation (CTG) have been proposed over recent years, but it is difficult to get a clear picture of which approach performs best, because different datasets and evaluation methods are used in each case to assess the control achieved. Objectives: Our aim in the work reported in this paper is to develop an approach to evaluation that enables us to comparatively evaluate different CTG systems in a manner that is both informative and fair to the individual systems. Methods: We use a level-playing-field (LPF) approach to comparative evaluation where we (i) generate and process all system outputs in a standardised way, and (ii) apply a shared set of evaluation methods and datasets, selected based on those currently in use, in order to ensure fair evaluation. Results: When re-evaluated in this way, performance results for a representative set of current CTG systems differ substantially from originally reported results, in most cases for the worse. This highlights the importance of a shared standardised way of assessing controlled generation. Conclusions: The discrepancies revealed by LPF evaluation demonstrate the urgent need for standardised, reproducible evaluation practices in CTG. Our results suggest that without such practices, published performance claims may substantially misrepresent true system capabilities.
Chinese Translation
背景:近年来提出了许多不同的受控文本生成(CTG)方法,但由于每种方法使用不同的数据集和评估方法来评估所实现的控制效果,因此很难清晰地了解哪种方法表现最佳。目标:本文的目的是开发一种评估方法,使我们能够以一种既信息丰富又公平的方式对不同的CTG系统进行比较评估。方法:我们采用公平竞争(Level-Playing-Field, LPF)评估方法进行比较评估,其中我们(i)以标准化的方式生成和处理所有系统输出,以及(ii)应用一套共享的评估方法和数据集,这些方法和数据集是基于当前使用的标准选择的,以确保公平评估。结果:以这种方式重新评估时,当前CTG系统的代表性性能结果与最初报告的结果有显著差异,在大多数情况下表现更差。这突显了评估受控生成的共享标准化方式的重要性。结论:LPF评估揭示的差异表明,CTG中迫切需要标准化、可重复的评估实践。我们的结果表明,如果没有这样的实践,已发布的性能声明可能会严重误导真实系统能力的表现。
cs.CL / 74 / 2605.12398

Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring

通过答案可信度评分估计大型语言模型的问题难度
Mozafari, Jamshid, Piryani, Bhawna, Jatowt, Adam
Abstract
Estimating question difficulty is a critical component in evaluating and improving large language models (LLMs) for question answering (QA). Existing approaches often rely on readability formulas, retrieval-based signals, or popularity statistics, which may not fully capture the reasoning challenges posed to modern LLMs. In this paper, we introduce Q-DAPS (Question Difficulty based on Answer Plausibility Scores) method, a novel approach that estimates question difficulty by computing the entropy of plausibility scores over candidate answers. We systematically evaluate Q-DAPS across four prominent QA datasets-TriviaQA, NQ, MuSiQue, and QASC-demonstrating that it consistently outperforms baselines. Moreover, Q-DAPS shows strong robustness across hyperparameter variations and question types. Extensive ablation studies further show that Q-DAPS remains robust across different plausibility estimation paradigms, model sizes, and realistic settings. Human evaluations further confirm strong alignment between Q-DAPS's difficulty estimates and human judgments of question difficulty. Overall, Q-DAPS provides an interpretable, scalable, and bias-resilient approach to question difficulty estimation in modern QA systems.
Chinese Translation
估计问题难度是评估和改进大型语言模型(LLMs)在问答(QA)任务中的关键组成部分。现有的方法通常依赖于可读性公式、基于检索的信号或流行度统计,这些方法可能无法充分捕捉现代LLMs所面临的推理挑战。在本文中,我们介绍了一种新方法Q-DAPS(基于答案可信度评分的问题难度),该方法通过计算候选答案的可信度评分的熵来估计问题难度。我们在四个著名的QA数据集——TriviaQA、NQ、MuSiQue和QASC上系统地评估了Q-DAPS,结果表明其始终优于基线。此外,Q-DAPS在超参数变化和问题类型上表现出强大的鲁棒性。大量的消融研究进一步表明,Q-DAPS在不同的可信度估计范式、模型规模和现实设置中保持鲁棒性。人类评估进一步确认了Q-DAPS的难度估计与人类对问题难度的判断之间的强一致性。总体而言,Q-DAPS为现代QA系统中的问题难度估计提供了一种可解释、可扩展且抗偏见的方法。
cs.CL / 75 / 2605.12412

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

空间中的故事:概念信念空间中的上下文学习轨迹
Bigelow, Eric, Sarfati, Raphaël, Wurgaft, Daniel, Lewis, Owen, McGrath, Thomas, Merullo, Jack, Geiger, Atticus, Lubana, Ekdeep Singh
Abstract
Large Language Models (LLMs) update their behavior in context, which can be viewed as a form of Bayesian inference. However, the structure of the latent hypothesis space over which this inference operates remains unclear. In this work, we propose that LLMs assign beliefs over a low-dimensional geometric space - a conceptual belief space - and that in-context learning corresponds to a trajectory through this space as beliefs are updated over time. Using story understanding as a natural setting for dynamic belief updating, we combine behavioral and representational analyses to study these trajectories. We find that (1) belief updates are well-described as trajectories on low-dimensional, structured manifolds; (2) this structure is reflected consistently in both model behavior and internal representations and can be decoded with simple linear probes to predict behavior; and (3) interventions on these representations causally steer belief trajectories, with effects that can be predicted from the geometry of the conceptual space. Together, our results provide a geometric account of belief dynamics in LLMs, grounding Bayesian interpretations of in-context learning in structured conceptual representations.
Chinese Translation
大型语言模型(LLMs)在上下文中更新其行为,这可以被视为一种贝叶斯推断的形式。然而,进行这种推断的潜在假设空间的结构仍然不清楚。在本研究中,我们提出LLMs在一个低维几何空间——概念信念空间——上分配信念,并且上下文学习对应于随着时间推移信念更新而在该空间中的轨迹。我们利用故事理解作为动态信念更新的自然场景,结合行为和表征分析来研究这些轨迹。我们的研究发现:(1)信念更新可以很好地描述为低维结构流形上的轨迹;(2)这种结构在模型行为和内部表征中一致反映,并且可以通过简单的线性探针解码以预测行为;(3)对这些表征的干预会因果性地引导信念轨迹,其效果可以从概念空间的几何形状中预测。总之,我们的结果提供了LLMs中信念动态的几何解释,将上下文学习的贝叶斯解释建立在结构化的概念表征之上。
cs.CL / 76 / 2605.12419

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

ORBIT:通过源调节合并在生成检索中保持基础语言能力
Verma, Neha, Mehta, Nikhil, Wang, Shao-Chuan, Zhang, Naijing, Tsai, Alicia, Wei, Li, Heldt, Lukasz, Hong, Lichan, Chi, Ed, Yi, Xinyang
Abstract
Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates and addresses this challenge in the context of the Generative Retrieval (GenRetrieval) task. During GenRetrieval fine-tuning, we find this forgetting occurs rapidly and correlates with the distance between the fine-tuned and original model parameters. Given these observations, we propose ORBIT, a novel approach that actively tracks the distance between fine-tuned and initial model weights, and uses a weight averaging strategy to constrain model drift during GenRetrieval fine-tuning when this inter-model distance exceeds a maximum threshold. Our results show that ORBIT retains substantial text and retrieval performance by outperforming both common continual learning baselines and related regularization methods that also employ weight averaging.
Chinese Translation
尽管大型语言模型(LLM)发展迅速,但针对特定任务进行微调往往会导致其一般语言推理能力的灾难性遗忘。本研究在生成检索(GenRetrieval)任务的背景下探讨并解决了这一挑战。在GenRetrieval微调过程中,我们发现这种遗忘发生迅速,并且与微调模型和原始模型参数之间的距离相关。基于这些观察,我们提出了ORBIT,这是一种新颖的方法,主动跟踪微调模型与初始模型权重之间的距离,并在该模型间距离超过最大阈值时,采用权重平均策略来限制模型在GenRetrieval微调过程中的漂移。我们的结果表明,ORBIT在文本和检索性能上保持了显著的优势,超越了常见的持续学习基线和其他采用权重平均的相关正则化方法。
cs.CL / 77 / 2605.12422

Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals

在不使用生成时概率信号的情况下预测与人类评分者的分歧——以LLM作为评判者的难度评估
Ehara, Yo
Abstract
Automatic generation of educational materials using large language models (LLMs) is becoming increasingly common, but assigning difficulty levels to such materials still requires substantial human effort. LLM-as-a-Judge has therefore attracted attention, yet disagreement with human raters remains a major challenge. We propose a method for predicting which LLM-generated difficulty ratings are likely to disagree with human raters, so that such cases can be sent for re-rating. Unlike prior approaches, our method does not rely on generation-time probability signals, which must be collected during rating generation and are often difficult to compare across LLMs. Instead, exploiting the fact that difficulty is an ordinal scale, we use a separate embedding space, such as ModernBERT, and identify disagreement candidates based on the geometric consistency of the rating set. Experiments on English CEFR-based sentence difficulty assessment with GPT-OSS-120B and Qwen3-235B-A22B showed that the proposed method achieved higher AUC for predicting disagreement with human raters than probability-based baselines.
Chinese Translation
利用大型语言模型(LLMs)自动生成教育材料正变得越来越普遍,但为这些材料分配难度等级仍然需要大量的人力投入。因此,LLM作为评判者引起了关注,但与人类评分者的分歧仍然是一个主要挑战。我们提出了一种方法,用于预测哪些LLM生成的难度评分可能与人类评分者存在分歧,以便将此类案例送去重新评分。与之前的方法不同,我们的方法不依赖于生成时的概率信号,这些信号必须在评分生成过程中收集,并且通常难以在不同的LLM之间进行比较。相反,利用难度是一个有序尺度的事实,我们使用一个独立的嵌入空间,例如ModernBERT,并基于评分集的几何一致性来识别分歧候选。对基于CEFR的英语句子难度评估的实验使用了GPT-OSS-120B和Qwen3-235B-A22B,结果表明,所提方法在预测与人类评分者的分歧方面实现了比基于概率的基线更高的AUC。
cs.CL / 78 / 2605.12426

Geometric Factual Recall in Transformers

变换器中的几何事实回忆
Ravfogel, Shauli, Yehudai, Gilad, Bruna, Joan, Bietti, Alberto
Abstract
How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, \emph{geometric} form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode \emph{linear superpositions} of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting -- chains of relational queries such as ``Who is the mother of the wife of $x$?'' -- providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic selection mechanism rather than memorized any particular set of facts.
Chinese Translation
变换器语言模型如何记忆事实关联?一种普遍观点将内部权重矩阵视为对嵌入对的关联记忆,这要求参数数量与事实数量线性相关。我们提出了一种理论和实证的替代方案, extit{几何}形式的记忆,其中学习的嵌入直接编码关系结构,而多层感知器(MLP)扮演着 qualitatively 不同的角色。在一个受控环境中,单层变换器必须记忆从主体到共享属性集的随机双射,我们证明了对数嵌入维度是足够的:主体嵌入编码其关联属性向量的 extit{线性叠加},而小型 MLP 作为关系条件选择器,通过 ReLU 门控提取相关属性,而不是作为关联键值映射。我们将这些结果扩展到多跳设置——如“$x$ 的妻子的母亲是谁?”的关系查询链——提供了有和没有思维链的构造,展现了可证明的容量-深度权衡,并辅以匹配的信息论下界。从实证上看,梯度下降发现了具有精确预测结构的解决方案。一旦训练完成,MLP 在适当重新初始化主体嵌入时能够零-shot 转移到全新的双射,揭示出它学习了一种通用选择机制,而不是记忆任何特定的事实集合。
cs.CL / 79 / 2605.12438

A Causal Language Modeling Detour Improves Encoder Continued Pretraining

因果语言建模的绕行改善了编码器的持续预训练
Touchent, Rian, de la Clergerie, Eric
Abstract
When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.
Chinese Translation
在将编码器适应新领域时,标准方法是继续使用掩码语言建模(Masked Language Modeling, MLM)进行训练。我们展示了暂时切换到因果语言建模(Causal Language Modeling, CLM),然后进行短期的MLM衰减,可以提高下游性能。在使用ModernBERT处理生物医学文本时,这种CLM绕行在8个法语和11个英语生物医学任务中,分别比在相同数据和计算条件下训练的MLM基线提高了1.2-2.8个百分点和0.3-0.8个百分点,具体取决于模型大小。我们探讨了这些提升的原因。我们发现,CLM的密集监督对低层变换器层(0-7)的影响远大于MLM。在CLM期间冻结低层会消除下游收益;冻结中层则能保持收益。这些表征变化在MLM衰减阶段持续存在,即使其长度与CLM阶段相匹配,并且它们与模型容量成正比。我们发布了ModernCamemBERT-bio和ModernBERT-bio,作为基础和大型尺寸的最先进生物医学编码器。
cs.CL / 80 / 2605.12452

The Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events

算法讽刺:审计大型语言模型生成的危机事件中的政治话语
Gunjan, Benabderrahmane, Sidahmed, Rahwan, Talal
Abstract
Large Language Models (LLMs) can generate fluent political text at scale, raising concerns about synthetic discourse during crises and social conflict. Existing AI-text detection often focuses on sentence-level cues such as perplexity, burstiness, or token irregularities, but these signals may weaken as generative systems improve. We instead adopt a Computational Social Science perspective and ask whether synthetic political discourse behaves like an observed online population. We construct a paired corpus of 1,789,406 posts across nine crisis events: COVID-19, the Jan. 6 Capitol attack, the 2020 and 2024 U.S. elections, Dobbs/Roe v. Wade, the 2020 BLM protests, U.S. midterms, the Utah shooting, and the U.S.-Iran war. For each event, we compare observed discourse from social platforms with synthetic discourse generated for the same context. We evaluate four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency, using mean gaps and dispersion evidence. Across events, synthetic discourse is fluent but population-level unrealistic. It is generally more negative and less dispersed in sentiment, structurally more regular, and lexically more abstract than observed discourse. Observed discourse instead shows broader emotional variation, longer-tailed structural distributions, and more context-specific, colloquial lexical markers. These differences are event-dependent: larger for fast-moving, decentralized crises and smaller for formal or institutionally mediated events. We summarize them with a simple event-level measure, the Caricature Gap. Our findings suggest that the main limitation of synthetic political discourse is not grammar or fluency, but reduced population realism. Population-level auditing complements traditional text-detection and provides a CSS framework for evaluating the social realism of generated discourse.
Chinese Translation
大型语言模型(LLMs)能够大规模生成流畅的政治文本,这引发了对危机和社会冲突期间合成话语的担忧。现有的人工智能文本检测通常侧重于句子级别的线索,如困惑度、突发性或标记不规则性,但随着生成系统的改进,这些信号可能会减弱。我们采用计算社会科学的视角,探讨合成政治话语是否表现得像观察到的在线人群。我们构建了一个包含1,789,406条帖子、涵盖九个危机事件的配对语料库:COVID-19、1月6日国会大厦袭击、2020年和2024年美国选举、Dobbs/Roe v. Wade、2020年BLM抗议、美国中期选举、犹他州枪击事件和美伊战争。对于每个事件,我们将社交平台上观察到的话语与为相同背景生成的合成话语进行比较。我们评估了四个维度:情感强度、结构规律性、词汇-意识形态框架和跨事件依赖性,使用均值差距和离散证据进行分析。在各个事件中,合成话语流畅但在群体层面上不现实。它通常情感更为消极,情感分布较少,结构上更为规律,词汇上更为抽象,而观察到的话语则表现出更广泛的情感变化、更长尾的结构分布和更多上下文特定的口语化词汇标记。这些差异依赖于事件:在快速变化、去中心化的危机中更大,而在正式或机构介导的事件中较小。我们用一个简单的事件级别指标——讽刺差距(Caricature Gap)来总结这些发现。我们的研究表明,合成政治话语的主要局限性不在于语法或流畅性,而在于降低了群体现实感。群体层面的审计补充了传统的文本检测,并为评估生成话语的社会现实提供了计算社会科学框架。
cs.CL / 81 / 2605.12487

Task-Adaptive Embedding Refinement via Test-time LLM Guidance

通过测试时大语言模型指导的任务自适应嵌入精炼
Gera, Ariel, Ashury-Tahan, Shir, Bloch, Gal, Eytan, Ohad, Toledo, Assaf
Abstract
We explore the effectiveness of an LLM-guided query refinement paradigm for extending the usability of embedding models to challenging zero-shot search and classification tasks. Our approach refines the embedding representation of a user query using feedback from a generative LLM on a small set of documents, enabling embeddings to adapt in real time to the target task. We conduct extensive experiments with state-of-the-art text embedding models across a diverse set of challenging search and classification benchmarks. Empirical results indicate that LLM-guided query refinement yields consistent gains across all models and datasets, with relative improvements of up to +25% in literature search, intent detection, key-point matching, and nuanced query-instruction following. The refined queries improve ranking quality and induce clearer binary separation across the corpus, enabling the embedding space to better reflect the nuanced, task-specific constraints of each ad-hoc user query. Importantly, this expands the range of practical settings in which embedding models can be effectively deployed, making them a compelling alternative when costly LLM pipelines are not viable at corpus-scale. We release our experimental code for reproducibility, at https://github.com/IBM/task-aware-embedding-refinement.
Chinese Translation
我们探讨了一种由大语言模型(LLM)指导的查询精炼范式的有效性,以扩展嵌入模型在具有挑战性的零-shot搜索和分类任务中的可用性。我们的方法利用生成性LLM对一小组文档的反馈,精炼用户查询的嵌入表示,使嵌入能够实时适应目标任务。我们在一系列多样化的具有挑战性的搜索和分类基准上,对最先进的文本嵌入模型进行了广泛的实验。实证结果表明,LLM指导的查询精炼在所有模型和数据集上都带来了持续的提升,在文献搜索、意图检测、关键点匹配和细致查询指令跟随等任务中,相对提升高达25%。精炼后的查询提高了排名质量,并在语料库中引入了更清晰的二元分离,使嵌入空间更好地反映每个临时用户查询的细致、任务特定的约束。重要的是,这扩展了嵌入模型可以有效部署的实际场景范围,使其成为在语料库规模下成本高昂的LLM管道不可行时的一个有吸引力的替代方案。我们已在 https://github.com/IBM/task-aware-embedding-refinement 发布了我们的实验代码以便于复现。
cs.CL / 82 / 2605.12493

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

LongMemEval-V2:评估长期代理记忆以应对经验丰富的同事
Wu, Di, Ji, Zixiang, Kawatkar, Asmi, Kwan, Bryan, Gu, Jia-Chen, Peng, Nanyun, Chang, Kai-Wei
Abstract
Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.
Chinese Translation
长期记忆对于在专业网络环境中的代理至关重要,因为成功依赖于回忆界面可用性、状态动态、工作流程和反复出现的失败模式。然而,现有的代理记忆基准主要集中在用户历史、短期轨迹或下游任务成功上,尚未直接评估记忆系统是否有效内化特定环境的经验。为了解决这一空白,我们引入了LongMemEval-V2(LME-V2),这是一个评估记忆系统是否能够帮助代理获取成为定制环境中知识丰富的同事所需经验的基准。LME-V2包含451个手动策划的问题,涵盖了网络代理的五项核心记忆能力:静态状态回忆、动态状态跟踪、工作流程知识、环境陷阱和前提意识。这些问题与包含多达500条轨迹和1.15亿个标记的历史轨迹配对。我们使用了一种上下文收集的公式:记忆系统消耗历史轨迹并返回下游问答的紧凑证据。我们提出了一套两种记忆方法:AgentRunbook-R,一种基于RAG的高效记忆,具有原始状态观察、事件和策略笔记的知识库;以及AgentRunbook-C,它将轨迹存储为文件,并调用编码代理在增强的沙盒中收集证据。实验表明,AgentRunbook-C以72.5%的平均准确率取得最佳表现,超越了最强的RAG基线(48.5%)和现成编码代理基线(69.3%)。尽管性能显著提升,基于编码代理的方法仍存在高延迟成本。虽然AgentRunbook-C推动了准确性与延迟的Pareto前沿,但仍有相当大的改进空间。总体而言,这些结果确立了LME-V2作为开发环境经验的长期记忆系统的挑战性测试平台。