cs.RO / 1 / 2605.08168
Understanding Asynchronous Inference Methods for Vision-Language-Action Models
理解视觉-语言-动作模型的异步推理方法
Abstract
Vision-Language-Action (VLA) models offer a promising path to generalist robot control, but their inference latency causes observation staleness when generated actions are executed asynchronously. Several methods have been proposed concurrently to mitigate this problem: inference-time inpainting (IT-RTC), training-time delay simulation (TT-RTC), future-state-aware conditioning (VLASH), and lightweight residual correction (A2C2). Each takes a fundamentally different approach, but they have so far been evaluated independently with different codebases, base policies, and protocols. We present a systematic comparison of these four methods under controlled conditions. We develop two unified codebases that integrate all methods with harmonized library and dataset versions, and we benchmark them on the Kinetix suite with MLPMixer policies and on the LIBERO manipulation benchmark with SmolVLA, sweeping inference delays up to $d=20$ control steps. A2C2's per-step residual correction is the most effective method on Kinetix, holding above 90% solve rate up to $d=8$, and also leads on LIBERO from $d=4$ onwards. IT-RTC is competitive at low delays but degrades sharply under long chunks ($H=30$) and high delays. TT-RTC is the most robust training-based method: stable across $d_\max$ choices, generalizes beyond its training delay distribution, and adds zero inference overhead. VLASH exhibits a clear low-delay vs. high-delay trade-off governed by the fine-tuning delay range $[0,d_\max]$. Code is available at https://github.com/TheAyos/async-vla-inference
Chinese Translation
视觉-语言-动作(VLA)模型为通用机器人控制提供了一个有前景的路径,但其推理延迟在生成的动作异步执行时会导致观察过时。为缓解这一问题,已有几种方法相继提出:推理时修复(IT-RTC)、训练时延迟模拟(TT-RTC)、未来状态感知条件(VLASH)和轻量级残差修正(A2C2)。这些方法采用了根本不同的策略,但迄今为止,它们在不同的代码库、基础策略和协议下进行了独立评估。我们在受控条件下对这四种方法进行了系统比较。我们开发了两个统一的代码库,将所有方法与协调的库和数据集版本集成,并在Kinetix套件上使用MLPMixer策略以及在LIBERO操作基准上使用SmolVLA进行了基准测试,推理延迟高达$d=20$个控制步骤。A2C2的逐步残差修正是Kinetix上最有效的方法,在$d=8$时保持90%以上的解决率,并且在LIBERO上从$d=4$开始也表现领先。IT-RTC在低延迟下具有竞争力,但在长时间段($H=30$)和高延迟下急剧下降。TT-RTC是最稳健的基于训练的方法:在$d_ ext{max}$选择上稳定,超出其训练延迟分布具有良好的泛化能力,并且没有增加推理开销。VLASH表现出明显的低延迟与高延迟之间的权衡,这由微调延迟范围$[0,d_ ext{max}]$所主导。代码可在https://github.com/TheAyos/async-vla-inference获取。
cs.RO / 2 / 2605.08185
From Ontology Conformance to Admissible Reconfiguration: A RoSO/SMGI Adequacy Argument for Robotic Service Governance
从本体一致性到可接受的重构:机器人服务治理的RoSO/SMGI充分性论证
Abstract
The Robotic Service Ontology (RoSO) gives service robotics a typed semantic vocabulary for services, functions, interactions, and deployment-sensitive constraints. Its public revision trail makes visible a harder question than ontology conformance alone can settle: once a service is rebound, recomposed, repaired, or redeployed, under what conditions does the resulting configuration remain an admissible realization of the same protected service? This article argues that the Structural Model of General Intelligence (SMGI) is relevant exactly at that level \citep{osmani2026smgi}. SMGI adds not only a structural interface $\theta$, but an induced behavioral semantics $T_\theta$ and a governance discipline for norm-respecting change. We show that RoSO can be embedded into SMGI as a typed semantic layer, so that service descriptions become dynamically governable rather than merely well formed. This yields a RoSO-to-SMGI adequacy theorem, identity-preserving reconfiguration criteria, and compositional conditions under which locally acceptable updates remain globally admissible. The resulting claim is not that SMGI replaces RoSO, but that it provides a formal account of what admissible runtime change requires once service semantics must survive revision.
Chinese Translation
机器人服务本体(Robotic Service Ontology, RoSO)为服务、功能、交互和部署敏感约束提供了一种类型化的语义词汇。其公开的修订记录揭示了一个比单纯的本体一致性更为复杂的问题:一旦服务被重新绑定、重新组合、修复或重新部署,在什么条件下所得到的配置仍然是同一受保护服务的可接受实现?本文论证了通用智能结构模型(Structural Model of General Intelligence, SMGI)在这一层面上的相关性 {osmani2026smgi}。SMGI不仅增加了一个结构接口 $ heta$,还引入了一个诱导的行为语义 $T_ heta$ 和一个尊重规范变更的治理学科。我们展示了RoSO可以嵌入到SMGI中,作为一个类型化的语义层,从而使服务描述变得动态可治理,而不仅仅是形式良好的。这产生了一个RoSO到SMGI的充分性定理、保持身份的重构标准,以及在局部可接受更新仍然全球可接受的组合条件。最终的主张并不是SMGI取代RoSO,而是它提供了一个关于可接受运行时变更所需条件的正式说明,尤其是在服务语义必须经受修订的情况下。
cs.RO / 3 / 2605.08269
Anatomical Landmark-Guided Deep Reinforcement Learning for Autonomous Gastric Navigation
基于解剖标志的深度强化学习用于自主胃部导航
Abstract
Wireless capsule endoscopy (WCE) enables painless visualization of the gastrointestinal tract, but its diagnostic potential is limited by incomplete mucosal coverage and poor transferability of existing navigation methods across patient anatomies. We propose a transferable, anatomical landmarkguided deep reinforcement learning (AL-DRL) framework for autonomous gastric navigation. Leveraging a lightweight edgecontour-depth fusion module, our policy operates on stable, lowdimensional landmark coordinates rather than high-dimensional video streams, effectively bridging the sim-to-real gap. In simulations across eight patient-derived models, the method achieves over 97% coverage within 50 seconds, significantly outperforming vanilla PPO, SAC, and DQN agents. A two-stage sim-to-real pipeline with an adaptive dynamic programming controller actively mitigates physical disturbances. Ex-vivo experiments demonstrate a mean coverage of 87% and a 53% reduction in procedure time compared with expert manual control.
Chinese Translation
无线胶囊内窥镜(WCE)能够无痛地可视化胃肠道,但其诊断潜力受到不完全的粘膜覆盖和现有导航方法在不同患者解剖结构间转移性差的限制。我们提出了一种可转移的基于解剖标志的深度强化学习(AL-DRL)框架,用于自主胃部导航。该框架利用轻量级的边缘轮廓深度融合模块,使我们的策略在稳定的低维标志坐标上运行,而非高维视频流,从而有效缩小了模拟与现实之间的差距。在八个患者衍生模型的模拟中,该方法在50秒内实现了超过97%的覆盖率,显著优于传统的PPO、SAC和DQN代理。一个具有自适应动态规划控制器的两阶段模拟到现实的管道积极减轻了物理干扰。体外实验表明,与专家手动控制相比,平均覆盖率为87%,程序时间减少了53%。
cs.RO / 4 / 2605.08330
Hierarchical Prompting with Dual LLM Modules for Robotic Task and Motion Planning
基于双大型语言模型模块的层次化提示方法用于机器人任务与运动规划
Abstract
We present a hierarchical language-driven framework for robotic task and motion planning to improve natural, intuitive human-robot interaction in service and assistance scenarios. The proposed system employs two large language model (LLM) modules: a high-level planning agent and a low-level spatial reasoning sub-module. The primary agent processes natural language commands and generates action sequences using a ReAct-style prompt, interacting with tools for object perception and manipulation (e.g., pick, place, release). For precise spatial placement, such as interpreting "place the mug next to the plate", a separate sub-prompting module handles 3D reasoning based on object geometry and scene layout. The system integrates YOLOX-GDRNet for object detection and pose estimation, along with a motion execution stub. We evaluated the system in 24 test scenarios, ranging from simple spatial commands to high-level instructions and infeasible requests. The system achieved an overall task success rate of 86%.
Chinese Translation
我们提出了一种层次化的语言驱动框架,用于机器人任务与运动规划,以改善服务和辅助场景中的自然、直观的人机交互。所提出的系统采用两个大型语言模型(LLM)模块:一个高层规划代理和一个低层空间推理子模块。主要代理处理自然语言命令,并使用ReAct风格的提示生成动作序列,与工具进行对象感知和操作(例如,拾取、放置、释放)。对于精确的空间放置,例如解释“将杯子放在盘子旁边”,一个单独的子提示模块负责基于对象几何形状和场景布局进行3D推理。该系统集成了YOLOX-GDRNet用于对象检测和姿态估计,以及一个运动执行存根。我们在24个测试场景中评估了该系统,这些场景涵盖了从简单空间命令到高层指令和不可行请求的范围。该系统的整体任务成功率达到了86%。
cs.RO / 5 / 2605.08434
Failing Forward: Adaptive Failure-Informed Learning for Vision-Language-Action Models
向前失败:面向视觉-语言-动作模型的自适应失败信息学习
Abstract
Vision-language-action (VLA) models provide a promising paradigm for scalable robotic manipulation, yet their reliance on success-only behavioral cloning leaves them brittle; lacking corrective training signals, minor execution errors rapidly compound into unrecoverable, out-of-distribution failures. To address this limitation, we propose Adaptive Failure-Informed Learning (AFIL), an end-to-end framework that leverages failure trajectories as adaptive negative guidance for diffusion- and flow-based VLA policies. AFIL uses a pretrained VLA to generate failure rollouts online, avoiding the need for handcrafted failure-mode design or human-in-the-loop recovery. It then jointly trains Dual Action Generators (DAGs) for successful and failed behaviors while sharing a common vision-language backbone, enabling efficient failure-aware policy learning with limited parameter overhead. During sampling, the failure generator adaptively steers action generation away from failure-prone regions and toward more reliable success modes, with guidance strength determined by the per-diffusion-step distance between success and failure distributions. Experiments across in-domain and out-of-domain robotic manipulation tasks, covering both short- and long-horizon settings, show that AFIL consistently improves task success rates and robustness over existing VLA baselines, demonstrating its effectiveness, efficiency, and generality.
Chinese Translation
视觉-语言-动作(VLA)模型为可扩展的机器人操作提供了一个有前景的范式,但其对仅成功行为克隆的依赖使其变得脆弱;缺乏纠正性训练信号,轻微的执行错误迅速累积成无法恢复的、超出分布的失败。为了解决这一局限性,我们提出了自适应失败信息学习(AFIL),这是一个端到端的框架,利用失败轨迹作为扩散和流基VLA策略的自适应负向指导。AFIL使用预训练的VLA在线生成失败回放,避免了手工设计失败模式或人类干预恢复的需求。然后,它共同训练成功和失败行为的双重动作生成器(DAG),同时共享一个通用的视觉-语言骨干,从而实现有限参数开销下的高效失败感知策略学习。在采样过程中,失败生成器自适应地引导动作生成远离易失败区域,朝向更可靠的成功模式,指导强度由成功和失败分布之间每次扩散步骤的距离决定。在涵盖短期和长期设置的领域内和领域外机器人操作任务的实验中,AFIL始终提高了任务成功率和鲁棒性,超越了现有的VLA基线,证明了其有效性、效率和通用性。
cs.RO / 6 / 2605.08489
LE-PAVD: Learning-Enhanced Physics-Aware Vehicle Dynamics for High-Speed Autonomous Navigation
LE-PAVD:增强学习的物理感知车辆动力学用于高速自主导航
Abstract
Accurate modeling of nonlinear vehicle dynamics is essential for high-speed autonomous racing, where controllers operate at the handling limits. Model-based methods are interpretable but rely on simplifying assumptions, while purely learned models capture nonlinearities yet often lack physical consistency and generalization. We propose LE-PAVD (Learning-Enhanced Physics-Aware Vehicle Dynamics), a hybrid model that integrates physics priors with learned components. Our architecture adds four components: load-sensitive Pacejka tire forces, longitudinal load transfer, lateral tire-force effects, and rate-limited actuator inputs. Trained end-to-end on simulation and real-world telemetry, LE-PAVD enforces physical consistency while improving state prediction accuracy. On an unseen track, LE-PAVD reduces average displacement error (ADE) by 16.1$\%$, final displacement error (FDE) by 20.6$\%$, and lowers yaw-rate root mean squared error (RMSE) by 91.3$\%$ versus a deep dynamics baseline, while using 21.6$\%$ fewer FLOPs and achieving approximately 1.50$\times$ faster inference. In closed-loop simulations, LE-PAVD consistently outperforms the baseline by achieving faster lap times by 17.4$\%$ on a training track and 9.5$\%$ on a test track, without any track boundary violations. Overall, LE-PAVD offers a compact, physics-grounded dynamics backbone that improves predictive fidelity and closed-loop performance while reducing inference cost.
Chinese Translation
准确建模非线性车辆动力学对于高速自主赛车至关重要,因为控制器在操控极限下工作。基于模型的方法具有可解释性,但依赖于简化假设,而纯粹学习的模型能够捕捉非线性特性,但往往缺乏物理一致性和泛化能力。我们提出了LE-PAVD(增强学习的物理感知车辆动力学),这是一种将物理先验与学习组件相结合的混合模型。我们的架构增加了四个组件:负载敏感的Pacejka轮胎力、纵向负载转移、横向轮胎力效应和速率限制的执行器输入。LE-PAVD在模拟和现实世界遥测数据上进行了端到端训练,确保了物理一致性,同时提高了状态预测的准确性。在一个未见过的赛道上,LE-PAVD将平均位移误差(ADE)降低了16.1%,最终位移误差(FDE)降低了20.6%,并将偏航率均方根误差(RMSE)降低了91.3%,相比于深度动力学基线,同时使用了21.6%的更少FLOPs,并实现了约1.50倍的推理速度。在闭环仿真中,LE-PAVD始终优于基线,在训练赛道上实现了17.4%的更快圈速,在测试赛道上实现了9.5%的更快圈速,且没有任何赛道边界违规。总体而言,LE-PAVD提供了一个紧凑的、基于物理的动力学基础,提升了预测精度和闭环性能,同时降低了推理成本。
cs.RO / 7 / 2605.08511
Trajectory-Consistent Flow Matching for Robust Visuomotor Policy Learning
轨迹一致性流匹配用于稳健的视觉运动策略学习
Abstract
Flow matching policies learn continuous velocity fields that transport noise to actions, enabling fast deterministic inference for robot manipulation. However, standard training optimizes a pointwise velocity objective while inference requires numerical integration of that field -- a mismatch that causes compounding trajectory errors. We propose four complementary remedies: (1) auxiliary rectified flow velocity regression that provides uniform temporal supervision across the full time interval; (2) multi-step trajectory consistency training that supervises the integrated displacement of the velocity field over trajectory segments, directly closing the train-inference gap; (3) velocity field regularization that enforces temporal smoothness, preventing oscillations that destabilize integration; and (4) fourth-order Runge-Kutta (RK4) inference that reduces global discretization error by orders of magnitude over Euler methods. Critically, these components are not independently sufficient -- RK4 without a smooth velocity field fails, and smoothness without trajectory-level supervision still drifts, as our ablation study confirms. We further pair these with a dual-view 3D point cloud encoder using two independent PointNet encoders for complementary spatial perception. On four real-robot tasks across a Franka arm and a Boston Dynamics Spot, our method achieves 70% and 60% overall success on two long-horizon multi-phase tasks where both baselines score 0%, and reaches 100% on precision tool placement. Three MetaWorld simulation tasks confirm consistent improvements, validating that trajectory-level supervision is essential for reliable policy execution.
Chinese Translation
流匹配策略学习连续的速度场,将噪声传输到动作,从而实现机器人操作的快速确定性推理。然而,标准训练优化的是逐点速度目标,而推理则需要对该速度场进行数值积分——这种不匹配导致了累积的轨迹误差。我们提出了四种互补的解决方案:(1)辅助的修正流速度回归,提供整个时间区间内的均匀时间监督;(2)多步轨迹一致性训练,监督速度场在轨迹段上的积分位移,直接缩小训练与推理之间的差距;(3)速度场正则化,强制时间平滑,防止不稳定的振荡影响积分;(4)四阶龙格-库塔(RK4)推理,相较于欧拉方法显著降低全局离散化误差。这些组件在独立情况下并不足够——没有平滑速度场的RK4会失败,而没有轨迹级监督的平滑性仍会漂移,正如我们的消融研究所证实的那样。我们进一步将这些与双视图3D点云编码器相结合,使用两个独立的PointNet编码器以实现互补的空间感知。在Franka臂和波士顿动力Spot的四个真实机器人任务中,我们的方法在两个长时间多阶段任务中实现了70%和60%的整体成功率,而两个基线的得分均为0%,并在精确工具放置任务中达到了100%。三个MetaWorld仿真任务确认了一致的改进,验证了轨迹级监督对于可靠策略执行的重要性。
cs.RO / 8 / 2605.08525
Model-Reference Adaptive Flight Control of the 95-mg Bee++
95毫克Bee++的模型参考自适应飞行控制
Abstract
We introduce a model-reference adaptive control (MRAC) architecture for high-performance positional tracking of the Bee++, a 95-mg insect-scale flapping-wing aerial vehicle. The suitability, functionality, and high performance of the proposed approach are demonstrated using data from real-time flight experiments.
Chinese Translation
我们提出了一种模型参考自适应控制(MRAC)架构,用于高性能位置跟踪95毫克的Bee++,这是一种昆虫尺度的拍翼飞行器。通过实时飞行实验的数据,展示了所提方法的适用性、功能性和高性能。
cs.RO / 9 / 2605.08571
BEACON: Cross-Domain Co-Training of Generative Robot Policies via Best-Effort Adaptation
BEACON:通过最佳努力适应进行跨域协同训练的生成机器人策略
Abstract
We introduce BEACON--Best-Effort Adaptation for Cross-Domain Co-Training--a theory-driven framework for training generative robot policies with abundant source demonstrations and limited target demonstrations. BEACON casts cross-domain co-training as a discrepancy-aware importance-reweighting problem, jointly learning a diffusion-based visuomotor policy and per-sample source weights that minimize an objective informed by target-domain generalization guarantees. To make best-effort adaptation practical for high-dimensional sequence policies, we develop scalable instance-level discrepancy estimators, stochastic alternating updates for policy and weights, and a multi-source extension that balances heterogeneous source domains. Across sim-to-sim, sim-to-real, and multi-source manipulation settings, BEACON improves robustness and data efficiency over target-only, fixed-ratio co-training, and feature-alignment baselines. Importantly, even without an explicit alignment objective, BEACON achieves feature alignment as an implicit result of discrepancy-aware cross-domain co-training.
Chinese Translation
我们介绍了BEACON——最佳努力适应跨域协同训练(Best-Effort Adaptation for Cross-Domain Co-Training)——一个基于理论的框架,用于训练具有丰富源示例和有限目标示例的生成机器人策略。BEACON将跨域协同训练视为一个关注差异的重标定问题,联合学习基于扩散的视觉运动策略和每个样本的源权重,以最小化一个受目标域泛化保证启发的目标函数。为了使最佳努力适应在高维序列策略中变得实用,我们开发了可扩展的实例级差异估计器、用于策略和权重的随机交替更新,以及一个平衡异构源域的多源扩展。在模拟到模拟、模拟到真实和多源操作设置中,BEACON在目标仅、固定比例协同训练和特征对齐基线之上提高了鲁棒性和数据效率。重要的是,即使没有明确的对齐目标,BEACON也通过关注差异的跨域协同训练实现了特征对齐,成为一种隐含结果。
cs.RO / 10 / 2605.08612
ATAAT: Adaptive Threat-Aware Adversarial Tuning Framework against Backdoor Attacks on Vision-Language-Action Models
Abstract
Addressing the escalating security vulnerabilities in Vision-Language-Action (VLA) models, this study investigates backdoor attacks targeting the visual pathway. We identify a core obstacle causing the failure of traditional attack paradigms: "Gradient Interference." This phenomenon represents an optimization failure triggered by conflicting strategies during end-to-end training. To resolve this, we propose an Adaptive Threat-Aware Adversarial Tuning (ATAAT) framework. Through its core "Threat-Method Adaptive Mapping" mechanism, ATAAT intelligently selects the optimal gradient decoupling strategy based on the adversary's capabilities. Extensive experiments demonstrate that ATAAT exhibits significant advantages, achieving a highly robust Targeted Attack Success Rate (TASR > 80%) while maintaining extreme stealthiness with merely a 5% poisoning rate. It efficiently handles complex semantic-level triggers and achieves implicit decoupled attacks in data poisoning scenarios for the first time. This work reveals a critical security vulnerability in VLAs and provides theoretical and methodological support for future defense architectures.
cs.RO / 11 / 2605.08638
Geometry Guided Self-Consistency for Physical AI
几何引导的自一致性物理人工智能
Abstract
State-of-the-art physical AI models generate a chunk of actions per inference through diffusion or flow matching, iteratively refining an initial noise sample into an action trajectory. Because this inference process is inherently stochastic, committing to a single trajectory per round is brittle, and this brittleness compounds across the many sequential rounds that comprise a complete episode. We introduce KeyStone, an inference-time self-consistency method for diffusion-based action generation that draws $K$ candidate action chunks in parallel from a shared model context, clusters them in continuous action space, and returns the medoid of the largest cluster -- no additional model required. Two properties make this practical. First, the compact nature of action trajectories makes diffusion inference memory-bandwidth bound, leaving spare compute capacity to run $K$ chains in parallel with no additional wall-clock latency. Second, unlike token or pixel spaces where distance carries no semantic meaning and selection requires a learned judge, action chunks are geometrically structured such that Euclidean distance directly reflects physical similarity, making selection principled and judge-free. Across diverse vision-language-action models (VLAs) and world-action models (WAMs), KeyStone improves task success rates by up to \textbf{13.3\%} over single-trajectory sampling with negligible latency overhead, while having on par accuracy with model-based selectors at no training cost. We open source KeyStone at https://github.com/dywsjtu/keystone.
Chinese Translation
最先进的物理人工智能模型通过扩散或流匹配每次推理生成一组动作,迭代地将初始噪声样本细化为动作轨迹。由于这一推理过程本质上是随机的,每轮承诺单一轨迹的做法是脆弱的,而这种脆弱性在构成完整情节的多个连续轮次中会加剧。我们提出了 KeyStone,这是一种基于扩散的动作生成推理时自一致性方法,它从共享模型上下文中并行抽取 $K$ 个候选动作块,在连续动作空间中对它们进行聚类,并返回最大聚类的中位数——无需额外的模型。两个特性使得这一方法具有实用性。首先,动作轨迹的紧凑特性使得扩散推理受限于内存带宽,从而留出额外的计算能力以并行运行 $K$ 条链,而不会增加额外的时延。其次,与距离没有语义意义且选择需要学习评判者的标记或像素空间不同,动作块在几何上是结构化的,使得欧几里得距离直接反映物理相似性,从而使选择过程具有原则性且无需评判者。在多种视觉-语言-动作模型(VLA)和世界-动作模型(WAM)中,KeyStone 在任务成功率上比单一轨迹采样提高了高达 extbf{13.3\%},且几乎没有时延开销,同时在准确性上与基于模型的选择器相当,且没有训练成本。我们在 https://github.com/dywsjtu/keystone 开源了 KeyStone。
cs.RO / 12 / 2605.08713
REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer
REAP:基于强化学习的端到端自主停车方法及高斯点云模拟器在真实-模拟-真实转移中的应用
Abstract
In recent years, autonomous parking has made significant advances, yet parking tasks still face challenges in extreme scenarios such as mechanical and dead-end parking slots, often resulting in failures. This is mainly due to traditional parking methods adopting a multistage approach, lacking the ability to optimize the parking problem as a whole. End-to-end methods enable joint optimization across perception and planning modules to eliminate the accumulation of errors, enhancing algorithm performance in extreme scenarios. Although several end-to-end parking methods use imitation or reinforcement learning, the former is limited by data cost and distribution coverage, while the latter suffers from inefficient exploration. To address these challenges, we propose a Reinforcement learning End-to-end Autonomous Parking method (REAP). REAP employs Soft Actor-Critic (SAC) within an asymmetric reinforcement learning framework to improve training efficiency and inference performance. To accelerate model convergence, we distill the capabilities of a rule-based planner into the end-to-end network through behavior cloning. We further introduce a soft predictive collision penalty mechanism to reduce collision rates by penalizing obstacle-approaching actions. To ensure that the trained reinforcement learning network can directly transfer to real-world scenarios, we have established a Real2Sim2Real simulator. In the Real2Sim step, we use 3D Gaussian Splatting (3DGS) to transform real-world scenes into digital scenes. In the Sim2Real step, we deploy the end-to-end model onto the vehicle to bridge the Sim2Real gap. Trained in the 3DGS simulator and deployed on physical vehicles, REAP successfully parks in various types of parking spaces, especially demonstrating the feasibility of end-to-end RL parking in extremely narrow mechanical slots.
Chinese Translation
近年来,自主停车技术取得了显著进展,但在机械停车位和死胡同等极端场景下,停车任务仍面临挑战,常常导致失败。这主要是由于传统停车方法采用多阶段流程,缺乏将停车问题整体优化的能力。端到端方法能够在感知和规划模块之间实现联合优化,从而消除误差累积,提高算法在极端场景下的性能。尽管一些端到端停车方法使用模仿学习或强化学习,但前者受到数据成本和分布覆盖的限制,而后者则面临探索效率低下的问题。为了解决这些挑战,我们提出了一种基于强化学习的端到端自主停车方法(REAP)。REAP在不对称强化学习框架中采用软演员-评论家(Soft Actor-Critic, SAC)算法,以提高训练效率和推理性能。为了加速模型收敛,我们通过行为克隆将基于规则的规划器的能力提炼到端到端网络中。我们进一步引入了一种软预测碰撞惩罚机制,通过惩罚接近障碍物的动作来降低碰撞率。为了确保训练后的强化学习网络能够直接转移到现实场景中,我们建立了一个真实-模拟-真实(Real2Sim2Real)模拟器。在Real2Sim步骤中,我们使用三维高斯点云(3D Gaussian Splatting, 3DGS)将现实场景转化为数字场景。在Sim2Real步骤中,我们将端到端模型部署到车辆上,以弥合Sim2Real的差距。在3DGS模拟器中训练并部署在物理车辆上,REAP成功地在各种类型的停车位中完成停车,特别展示了在极窄机械停车位中实现端到端强化学习停车的可行性。
cs.RO / 13 / 2605.08722
HULK: Large-scale Hierarchical Coordination under Continual and Uncertain Temporal Tasks
HULK:在持续和不确定时间任务下的大规模分层协调
Abstract
Multi-agent systems can be extremely efficient when working concurrently and collaboratively, e.g., for delivery, surveillance, search and rescue. Coordination of such teams often involves two aspects: selecting appropriate subteams for different tasks in various areas, and coordinating agents in the subteams to execute the associated subtasks. Existing work often assumes that the tasks are static and known beforehand, where an integer program can be formulated and solved offline. However, in many applications, the team-wise tasks are generated online continually by external requests, and the amount of subtasks within each task is uncertain, e.g., the number of packages to deliver or victims to rescue. The aforementioned offline solution becomes inadequate as it would require constant re-computation for the whole team and global communication to broadcast the results. Thus, this work tackles the large-scale coordination problem under continual and uncertain temporal tasks, specified as temporal logic formulas over collaborative actions. The proposed hierarchical framework, HULK, consists of two interleaved layers: the rolling assignment of currently known tasks to subteams within a certain horizon, and the dynamic coordination within a subteam given the detected subtasks during online execution. Thus, coordination is performed hierarchically at different granularities and triggering conditions, improving computational efficiency and robustness. The method is validated rigorously over large-scale heterogeneous systems under various temporal tasks and environment uncertainties.
Chinese Translation
多智能体系统在并发和协作工作时可以极为高效,例如在配送、监控、搜索和救援等场景中。这类团队的协调通常涉及两个方面:为不同区域的不同任务选择合适的子团队,以及协调子团队中的智能体执行相关的子任务。现有研究通常假设任务是静态且事先已知的,因此可以制定整数规划并离线求解。然而,在许多应用中,团队任务是由外部请求持续在线生成的,每个任务中的子任务数量是不确定的,例如需要配送的包裹数量或需要救援的受害者数量。上述离线解决方案变得不够充分,因为它需要对整个团队进行持续的重新计算,并进行全球通信以广播结果。因此,本研究解决了在持续和不确定时间任务下的大规模协调问题,该问题被指定为针对协作行动的时序逻辑公式。所提出的分层框架HULK由两个交错的层次组成:在一定范围内将当前已知任务动态分配给子团队,以及在在线执行过程中根据检测到的子任务进行子团队内部的动态协调。因此,协调在不同的粒度和触发条件下以分层方式进行,从而提高了计算效率和鲁棒性。该方法在各种时间任务和环境不确定性下的规模化异构系统中得到了严格验证。
cs.RO / 14 / 2605.08732
Latent Geometry Beyond Search: Amortizing Planning in World Models
超越搜索的潜在几何:在世界模型中的规划摊销
Abstract
Modern vision-based world models can represent observations as compact yet expressive latent manifolds, but fast goal-oriented planning in these spaces remains challenging. This raises a central question: when does a learned representation simplify control, rather than merely enabling prediction? We study this question in a pretrained LeWorldModel, whose latent geometry is regularized for smoothness and uniformity. Our key insight is that, under such geometry, planning can be amortized into a latent inverse-dynamics mapping instead of requiring online search. We therefore replace iterative planning with a lightweight Goal-Conditioned Inverse Dynamics Model (GC-IDM) that maps the current latent state, goal latent state, and remaining horizon directly to the next action. Empirically, across four benchmark environments spanning navigation, contact-rich manipulation, and continuous control, our controller matches or exceeds CEM in seven of eight environment-protocol settings while reducing per-decision cost by 100-130x. A broader sweep over test-time planners (CEM, MPPI, iCEM, and gradient-based methods) shows that this result is not specific to a particular optimizer. These findings suggest that much of the structure recovered by test-time planning is already locally encoded in the latent representation. More broadly, our results indicate that sufficiently structured latent spaces can shift part of the planning burden from online optimization to learned inference.
Chinese Translation
现代基于视觉的世界模型能够将观察表示为紧凑而富有表现力的潜在流形,但在这些空间中进行快速的目标导向规划仍然具有挑战性。这引出了一个核心问题:何时学习到的表示能够简化控制,而不仅仅是实现预测?我们在一个预训练的LeWorldModel中研究了这个问题,其潜在几何经过平滑性和均匀性的正则化。我们的关键见解是,在这样的几何下,规划可以摊销为潜在逆动力学映射,而不需要在线搜索。因此,我们用一个轻量级的目标条件逆动力学模型(Goal-Conditioned Inverse Dynamics Model, GC-IDM)替代了迭代规划,该模型将当前潜在状态、目标潜在状态和剩余时间范围直接映射到下一个动作。在四个基准环境中进行的实证研究涵盖了导航、接触丰富的操控和连续控制,我们的控制器在八个环境-协议设置中的七个中与CEM相匹配或超越,同时将每次决策的成本降低了100-130倍。对测试时规划器(CEM、MPPI、iCEM和基于梯度的方法)的更广泛评估表明,这一结果并不特定于某个特定的优化器。这些发现表明,测试时规划所恢复的许多结构已经在潜在表示中局部编码。更广泛地说,我们的结果表明,足够结构化的潜在空间可以将部分规划负担从在线优化转移到学习推断上。
cs.RO / 15 / 2605.08757
A Visuo-Tactile Data Collection System with Haptic Feedback for Coarse-to-Fine Imitation Learning
具有触觉反馈的视觉-触觉数据采集系统用于粗到细的模仿学习
Abstract
We present a visuo-tactile data-collection system that generates temporally structured, contact-rich demonstrations for imitation learning. Conventional systems often decouple the operator from contact forces, which hinders the demonstration of subtle force modulation. Our system introduces a direct-drive gripper that the operator actuates with the fingers, preserving natural haptic feedback. Integrated visual sensors and custom tactile arrays capture image streams and contact geometry. A handle-mounted push button enables the operator to annotate the task's temporal structure in real time by marking task-critical regions. By fusing in-hand force perception with in-situ temporal annotation, the system produces multimodal datasets designed for coarse-to-fine learning algorithms that exploit structural task knowledge, enabling the development of high-quality manipulation policies.
Chinese Translation
我们提出了一种视觉-触觉数据采集系统,该系统生成具有时间结构的、接触丰富的演示数据,以用于模仿学习。传统系统通常将操作员与接触力解耦,这妨碍了细微力调节的演示。我们的系统引入了一种直接驱动的抓手,操作员通过手指进行驱动,保留了自然的触觉反馈。集成的视觉传感器和定制的触觉阵列捕捉图像流和接触几何信息。一个安装在手柄上的按钮使操作员能够实时标注任务的时间结构,通过标记任务关键区域。通过将手中力感知与现场时间标注相结合,该系统生成了多模态数据集,旨在为利用结构任务知识的粗到细学习算法提供支持,从而促进高质量操作策略的发展。
cs.RO / 16 / 2605.08758
Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems
基于全尺度学习的托盘处理机器人系统订单履行的序列决策框架
Abstract
Driven by the rapid expansion of e-commerce and small-batch production, the size of the intralogistics load unit of finished goods, semi-finished goods and raw materials is steadily shrinking. Totes are gradually replacing pallets as the primary handling and storage container. This shift has propelled tote-handling robotic systems to the forefront of automation order fulfillment centers. The order-fulfillment decisions of tote-handling robotic systems share a common order-tote-robot sequential decision-making nature. Existing studies primarily focus on decision mechanisms tailored to particular systems, making it difficult to generalize or transfer them to other contexts. We propose an Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems (OLSF-TRS), a generalized and scalable sequential decision framework that combines structured combinatorial optimization with multi-agent reinforcement learning to coordinate order,tote, and robot decisions. On small-scale tote-handling robotic systems, OLSF-TRS achieves near-optimal performance with average optimality gaps below 3.5% across two distinct system configurations. In large-scale scenarios, OLSF-TRS consistently outperforms heuristic baselines across two different system types, reducing total tote movements by 8-12% and over 30% compared to SOTA rule-based approaches, while maintaining real-time responsiveness. These improvements translate into tangible operational benefits, including cost reduction, lower energy consumption, and enhanced throughput stability. The proposed framework delivers an efficient and unified order fulfillment decision-making framework for widely deployed tote-handling robotic systems,supporting high-quality order fulfillment in both e-commerce and industrial logistics sectors.
Chinese Translation
随着电子商务和小批量生产的快速扩展,成品、半成品和原材料的内部物流负载单元的尺寸正在稳步缩小。托盘逐渐取代托盘成为主要的处理和存储容器。这一转变使得托盘处理机器人系统在自动化订单履行中心中处于前沿。托盘处理机器人系统的订单履行决策具有共同的订单-托盘-机器人序列决策特性。现有研究主要集中于针对特定系统的决策机制,导致难以将其推广或转移到其他上下文中。我们提出了一种基于全尺度学习的托盘处理机器人系统订单履行的序列决策框架(OLSF-TRS),这是一个通用且可扩展的序列决策框架,结合了结构化组合优化与多智能体强化学习,以协调订单、托盘和机器人决策。在小规模托盘处理机器人系统中,OLSF-TRS实现了接近最优的性能,在两种不同系统配置下的平均最优性差距低于3.5%。在大规模场景中,OLSF-TRS在两种不同系统类型中始终优于启发式基线,减少了8-12%的托盘移动次数,并比最先进的基于规则的方法减少了超过30%的移动,同时保持实时响应。这些改进转化为切实的运营效益,包括成本降低、能耗降低和吞吐量稳定性增强。所提出的框架为广泛部署的托盘处理机器人系统提供了一个高效统一的订单履行决策框架,支持电子商务和工业物流领域的高质量订单履行。
cs.RO / 17 / 2605.08774
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
ProcVLM:学习基于程序的进展奖励用于机器人操控
Abstract
Long-horizon robotic manipulation requires dense feedback that reflects how a task advances through its procedural stages, not merely whether the final outcome is successful. Existing reward models often rely on trajectory-level success labels or time-based interpolation, which can conflate elapsed time with true task progress and therefore fail to capture unfinished steps, stagnation, and failure states. We present ProcVLM, a progress-aware vision-language model that learns procedure-grounded progress as a dense reward signal for manipulation. Rather than deriving progress from terminal outcomes or temporal proxies, ProcVLM grounds progress estimation in procedural structure and intra-stage visual change, and further adopts a reasoning-before-estimation paradigm that infers the remaining atomic actions before estimating task progress. Specifically, we construct this supervision by synthesizing frame-level subtask-semantic annotations, assigning progress budgets according to subtask structure, and distributing each budget based on intra-subtask visual change. To train ProcVLM at scale, we build a standardized procedural supervision synthesis pipeline and construct ProcCorpus-60M from 30 embodied datasets with 60M annotated frames, from which we derive ProcVQA for procedure-aware pretraining, with progress estimation as the central task alongside action segmentation and future planning. Experiments on ProcVQA and reward-model benchmarks show that ProcVLM improves embodied procedural reasoning and yields more discriminative trajectory-internal progress estimates than representative baselines, supporting its use as a dense reward model for downstream reward-guided policy optimization. Project page: https://procvlm.github.io/
Chinese Translation
长时间跨度的机器人操控需要密集的反馈,以反映任务在其程序阶段中的进展,而不仅仅是最终结果是否成功。现有的奖励模型往往依赖于轨迹级成功标签或基于时间的插值,这可能将经过的时间与真正的任务进展混淆,从而无法捕捉未完成的步骤、停滞和失败状态。我们提出了ProcVLM,这是一种进展感知的视觉-语言模型,学习基于程序的进展作为操控的密集奖励信号。ProcVLM并不是从最终结果或时间代理中推导进展,而是将进展估计基于程序结构和阶段内的视觉变化,并进一步采用推理优先于估计的范式,在估计任务进展之前推断剩余的原子动作。具体而言,我们通过合成帧级子任务语义注释来构建这种监督,根据子任务结构分配进展预算,并根据子任务内的视觉变化分配每个预算。为了大规模训练ProcVLM,我们建立了标准化的程序监督合成管道,并从30个具身数据集中构建了ProcCorpus-60M,包含6000万帧注释,从中我们推导出ProcVQA用于程序感知的预训练,进展估计作为中心任务,同时进行动作分割和未来规划。在ProcVQA和奖励模型基准上的实验表明,ProcVLM改善了具身程序推理,并提供了比代表性基线更具区分性的轨迹内部进展估计,支持其作为下游奖励引导策略优化的密集奖励模型的使用。项目页面:https://procvlm.github.io/
cs.RO / 18 / 2605.08799
ElasticFlow: One-Step Physics-Consistent Policy with Elastic Time Horizons for Language-Guided Manipulation
ElasticFlow:一种具有弹性时间范围的物理一致性单步策略,用于语言引导的操控
Abstract
Diffusion policies have demonstrated exceptional performance in embodied AI. However, their iterative denoising process results in high latency, and existing acceleration methods often sacrifice physical consistency. To address this, we propose ElasticFlow, a distillation-free, physics-consistent one-step policy framework. We reconstruct the Mean Field Theory by directly modeling the average velocity field, enabling a direct single-step mapping from noise to action. Addressing the Temporal Heterogeneity of robotic tasks, we introduce the Elastic Time Horizons mechanism. This mechanism effectively overcomes Spectral Bias by explicitly encoding control granularity, achieving efficient alignment between semantic instructions and physical execution horizons. Experiments on benchmarks such as LIBERO, CALVIN, and RoboTwin demonstrate that ElasticFlow achieves efficient 1-NFE inference (approximately 71Hz). Furthermore, it outperforms state-of-the-art methods, including OpenVLA and $\pi_0$, on long-horizon tasks, highlighting its potential for efficient, robust, and semantically aligned control.
Chinese Translation
扩散策略在具身人工智能中表现出色。然而,它们的迭代去噪过程导致高延迟,而现有的加速方法往往牺牲物理一致性。为了解决这个问题,我们提出了ElasticFlow,一种无蒸馏、物理一致性的单步策略框架。我们通过直接建模平均速度场重构了均场理论,从而实现了从噪声到动作的直接单步映射。针对机器人任务的时间异质性,我们引入了弹性时间范围机制。该机制通过明确编码控制粒度,有效克服了谱偏差,实现了语义指令与物理执行范围之间的高效对齐。在LIBERO、CALVIN和RoboTwin等基准测试中的实验表明,ElasticFlow实现了高效的1-NFE推理(约71Hz)。此外,它在长时间范围任务中超越了最先进的方法,包括OpenVLA和$ ext{π}_0$,突显了其在高效、稳健和语义对齐控制方面的潜力。
cs.RO / 19 / 2605.08804
Constraint-Aware Diffusion Priors for High-Fidelity and Versatile Quadruped Locomotion
面向高保真和多功能四足运动的约束感知扩散先验
Abstract
Reinforcement learning combined with imitation learning has significantly advanced biomimetic quadrupedal locomotion. However, scaling these frameworks to massive, multi-source datasets exposes fundamental bottlenecks. First, traditional GAN-based discriminators are prone to mode collapse, struggling to capture diverse motion distributions from uncurated datasets. Second, existing kinematic priors suffer from out-of-distribution (OOD) tracking conflicts, leading to severe unintended heading drifts during complex maneuvers. Furthermore, deploying unconstrained priors to physical hardware poses critical safety risks by disregarding actuator dynamics. To overcome these challenges, we propose Diff-CAST (Diffusion-guided Constraint-Aware Symmetric Tracking), a novel motion prior framework leveraging the multi-modal distribution modeling capabilities of diffusion models for stylistic rewards. Diff-CAST effectively replaces traditional GAN discriminators, unlocking robust data scaling on heterogeneous collections. To ensure high-fidelity intent execution and reliable real-world deployment, we introduce a comprehensive Sim2Re architecture integrating Symmetric Augmented Command Conditioning (SACC) for drift-free tracking, and Constrained RL for hardware safety. Experiments on a quadruped demonstrate that Diff-CAST mitigates mode collapse, enables seamless transitions between diverse skills, and ensures robust, hardware-compliant locomotion.
Chinese Translation
结合强化学习与模仿学习显著推动了仿生四足运动的发展。然而,将这些框架扩展到大规模多源数据集时暴露出了一些基本瓶颈。首先,传统的基于生成对抗网络(GAN)的判别器容易出现模式崩溃,难以从未经整理的数据集中捕捉多样的运动分布。其次,现有的运动学先验在分布外(OOD)跟踪中存在冲突,导致在复杂机动过程中出现严重的意外偏航漂移。此外,将不受约束的先验应用于物理硬件时,由于忽视了执行器动态,带来了重要的安全风险。为了解决这些挑战,我们提出了Diff-CAST(扩散引导的约束感知对称跟踪),这是一个新颖的运动先验框架,利用扩散模型的多模态分布建模能力来实现风格奖励。Diff-CAST有效替代了传统的GAN判别器,解锁了在异构集合上的强大数据扩展能力。为了确保高保真的意图执行和可靠的现实世界部署,我们引入了一个综合的Sim2Re架构,整合了对称增强指令条件(SACC)以实现无漂移跟踪,以及约束强化学习(Constrained RL)以确保硬件安全。在四足机器人上的实验表明,Diff-CAST减轻了模式崩溃,能够在多样技能之间实现无缝过渡,并确保稳健的、符合硬件要求的运动。
cs.RO / 20 / 2605.08831
AssemPlanner: A Multi-Agent Based Task Planning Framework for Flexible Assembly System
AssemPlanner:一种基于多智能体的灵活装配系统任务规划框架
Abstract
In flexible assembly systems, existing task planning methods require a time-consuming configuration process by multiple experts to establish a production line for a new product. To address this challenge, we propose a multi-agent based task planning framework for flexible assembly systems, denoted as AssemPlanner. It takes tasks described in natural language as input, which are then converted into actionable sequential production operations. It comprises several specialized agents, including SchedAgent , KnowledgeAgent, LineBalanceAgent, and a scene graph. Within the proposed framework, SchedAgent serves as the central reasoning engine. Departing from traditional static pipelines, AssemPlanner utilizes a ReAct-based SchedAgent to adaptively adjust actions via multi-agent feedback. By observing the feedback from KnowledgeAgent, LineBalanceAgent, and the scene graph, it autonomously resolves complex industrial process constraints. To facilitate reproducibility, all code and datasets are released at https://github.com/chz332/Assemplanner.
Chinese Translation
在灵活装配系统中,现有的任务规划方法需要多个专家耗时配置,以建立新产品的生产线。为了解决这一挑战,我们提出了一种基于多智能体的灵活装配系统任务规划框架,称为AssemPlanner。该框架以自然语言描述的任务作为输入,然后将其转换为可执行的顺序生产操作。它由多个专门的智能体组成,包括SchedAgent、KnowledgeAgent、LineBalanceAgent和场景图。在该框架中,SchedAgent作为中央推理引擎。与传统的静态管道不同,AssemPlanner利用基于ReAct的SchedAgent,通过多智能体反馈自适应地调整行动。通过观察来自KnowledgeAgent、LineBalanceAgent和场景图的反馈,它能够自主解决复杂的工业过程约束。为了促进可重复性,所有代码和数据集已在https://github.com/chz332/Assemplanner发布。
cs.RO / 21 / 2605.08879
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
通过保守的监督微调在流匹配视觉-语言-动作模型中保留基础能力
Abstract
Unconstrained fine-tuning of flow-matching Vision-Language-Action (VLA) models drives dense parameter overwrites, degrading pre-trained capabilities. We present Conservative Supervised Fine-Tuning (ConSFT), an optimization objective that adapts to target distributions while mitigating catastrophic forgetting, requiring zero prior data or architectural overhead. By dynamically scaling learning signals based on model confidence, ConSFT suppresses excessive gradients from low-confidence samples to prevent disproportionate parameter updates, thereby bounding the intrinsic parameter disruption risk. Inspired by reinforcement learning's trust-region clipping, this formulation establishes a progressive learning dynamic to secure target convergence and prior capability retention, maintaining sparse parameter updates without relying on the parallel reference networks required by explicit regularization. We evaluate ConSFT on the LIBERO and RoboTwin benchmarks across state-of-the-art flow-matching VLAs ($\pi_0$, $\pi_{0.5}$, and GR00T-N1.6-3B). The method outperforms vanilla SFT in capability retention by an average absolute margin of over 20\%, matching the efficacy of data-heavy Experience Replay in a prior-data-free regime. Real-world robotic deployments confirm that ConSFT precludes spatial overfitting during downstream adaptation, preserving pre-trained physical skills while acquiring sequential target tasks.
Chinese Translation
对流匹配视觉-语言-动作(VLA)模型的无约束微调会导致参数的密集覆盖,从而降低预训练能力。我们提出了保守的监督微调(ConSFT),这是一种优化目标,旨在适应目标分布,同时减轻灾难性遗忘,且不需要任何先前数据或架构开销。通过根据模型置信度动态调整学习信号,ConSFT抑制来自低置信度样本的过度梯度,以防止不成比例的参数更新,从而限制内在参数干扰的风险。受到强化学习信任区域裁剪的启发,该公式建立了一种渐进的学习动态,以确保目标收敛和先前能力的保留,保持稀疏的参数更新,而不依赖于显式正则化所需的并行参考网络。我们在LIBERO和RoboTwin基准上评估了ConSFT,涵盖了最先进的流匹配VLA($ ext{π}_0$、$ ext{π}_{0.5}$和GR00T-N1.6-3B)。该方法在能力保留方面比普通的SFT平均提高了超过20 ext{%}的绝对边际,在无先前数据的情况下与数据密集型经验重放的有效性相匹配。实际的机器人部署确认ConSFT在下游适应过程中避免了空间过拟合,保留了预训练的物理技能,同时获取了顺序目标任务。
cs.RO / 22 / 2605.08937
Raymoval: Raycasting-based Dynamic Object Removal for Static 3D Mapping
Raymoval:基于光线投射的静态3D映射中的动态物体移除
Abstract
Static mapping is fundamental to robot navigation, providing a persistent geometric prior and a consistent reference for long-term autonomy. However, dynamic objects leave residual traces and cause surface loss, which reduces map consistency. We propose a raycasting-based module for dynamic object removal in static 3D mapping. Each scan is projected onto an azimuth-elevation grid, and for every viewing direction we compare the bin-wise minimum range with the map's first-hit distance computed by raycasting. Furthermore, we apply a raycast consistency test that separates dynamic from static points. Finally, a spatial consistency validation step refines labels, producing static maps with lower residual dynamics and reduced over-removal. We evaluate our approach quantitatively and qualitatively on SemanticKITTI and a challenging custom dataset, and show consistent static mapping results.
Chinese Translation
静态映射是机器人导航的基础,提供了持久的几何先验和长期自主性的可靠参考。然而,动态物体留下的残余痕迹会导致表面损失,从而降低地图的一致性。我们提出了一种基于光线投射的动态物体移除模块,用于静态3D映射。每个扫描都被投影到方位-仰角网格上,对于每个视角方向,我们将每个箱子的最小范围与通过光线投射计算的地图的首次命中距离进行比较。此外,我们应用了一种光线投射一致性测试,以区分动态点和静态点。最后,一个空间一致性验证步骤对标签进行精细化,生成具有较低残余动态和减少过度移除的静态地图。我们在SemanticKITTI和一个具有挑战性的自定义数据集上对我们的方法进行了定量和定性评估,并展示了一致的静态映射结果。
cs.RO / 23 / 2605.08947
A low-cost mockup to simulate robotic laser cutting in nuclear decommissioning
一种低成本的模拟装置用于核退役中的机器人激光切割模拟
Abstract
This paper introduces a low-cost experimental mockup to simulate the laser cutting process of containers in nuclear decommissioning. It is composed of a three-axis table supporting a cuboid container with ultraviolet-sensitive faces, a six-degree-of-freedom serial manipulator holding an ultraviolet torch that simulates the laser, and a visual system based on cameras and fiducial markers. The system employs a constrained task-space adaptive motion controller that compensates for inaccurate parameters and eliminates the need to calibrate the system. Furthermore, as the motion controller explicitly accounts for geometric constraints, the robot reactively avoids collisions with obstacles while handling the ultraviolet torch. To enhance tracking of the laser-cutting path, we control the ultraviolet beam, which requires only four degrees of freedom, instead of the full end-effector pose. Experiments show that, despite an initially uncalibrated system, the overall system is capable of tracking different trajectories with an overall mean accuracy of 3.9 (sd 2.5) mm when the end-effector pose is controlled and 2.4 (sd 1.3) mm when the ultraviolet beam is controlled.
Chinese Translation
本文介绍了一种低成本的实验模拟装置,用于模拟核退役中容器的激光切割过程。该装置由一个三轴工作台支撑一个具有紫外线敏感面的立方体容器,一个持有模拟激光的紫外线灯的六自由度串联机械手,以及一个基于摄像头和基准标记的视觉系统组成。该系统采用了一种受限任务空间自适应运动控制器,能够补偿不准确的参数,并消除对系统进行校准的需求。此外,由于运动控制器明确考虑了几何约束,机器人在操作紫外线灯时能够主动避免与障碍物的碰撞。为了增强激光切割路径的跟踪,我们控制紫外线束,仅需四个自由度,而不是完整的末端执行器姿态。实验表明,尽管系统最初未经过校准,但整体系统在控制末端执行器姿态时能够以3.9(标准差2.5)毫米的平均精度跟踪不同轨迹,而在控制紫外线束时则能够达到2.4(标准差1.3)毫米的精度。
cs.RO / 24 / 2605.09005
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models
基于后门的视觉-语言-动作模型所有权验证研究
Abstract
Vision-Language-Action models (VLAs) support generalist robotic control by enabling end-to-end decision policies directly from multi-modal inputs. As trained VLAs are increasingly shared and adapted, protecting model ownership becomes essential for secure deployment and responsible open-source usage. In this paper, we present GuardVLA, the first backdoor-based ownership verification framework specifically designed for VLAs. GuardVLA embeds a stealthy and harmless backdoor watermark into the protected model during training by injecting secret messages into embodied visual data. For post-release verification, we propose a swap-and-detect mechanism, in which the trigger projector and an external classifier head are used to activate and detect the embedded backdoor based on prediction probabilities. Extensive experiments across multiple datasets, model architectures, and adaptation settings demonstrate that GuardVLA enables reliable ownership verification while preserving benign task performance. Further results show that the embedded watermark remains detectable under post-release model adaptation.
Chinese Translation
视觉-语言-动作模型(VLA)通过直接从多模态输入中启用端到端决策策略,支持通用机器人控制。随着训练好的VLA被越来越多地共享和适应,保护模型所有权对于安全部署和负责任的开源使用变得至关重要。本文提出了GuardVLA,这是第一个专门为VLA设计的基于后门的所有权验证框架。GuardVLA在训练过程中通过将秘密信息注入具象视觉数据,将一个隐秘且无害的后门水印嵌入受保护的模型中。为了进行发布后的验证,我们提出了一种交换检测机制,其中触发投影器和外部分类器头用于根据预测概率激活和检测嵌入的后门。针对多个数据集、模型架构和适应设置的广泛实验表明,GuardVLA能够实现可靠的所有权验证,同时保持良好的任务性能。进一步的结果表明,嵌入的水印在发布后的模型适应中仍然可被检测。
cs.RO / 25 / 2605.09046
Terminal Matters: Kinodynamic Planning with a Terminal Cost and Learned Uncertainty in Belief State-Cost Space
终端问题:具有终端成本和学习不确定性的信念状态-成本空间的运动学动态规划
Abstract
In many real-world robotic tasks, robots must generate dynamically feasible motions that reliably reach desired goals even under uncertainty. Yet existing sampling-based kinodynamic planners typically optimize accumulated trajectory costs and treat goal reaching as a feasibility check, rather than explicitly optimizing terminal-state quality, such as goal preference or goal-reaching reliability. In this work, we introduce a terminal-cost formulation for kinodynamic planning that allows terminal-state quality to be optimized alongside accumulated trajectory cost. We prove that AO-RRT, an asymptotically optimal kinodynamic planner, preserves its asymptotic optimality under this augmented objective. We further extend the formulation to belief space and prove that minimizing the Wasserstein distance between the terminal belief and the goal improves a lower bound on the probability of reaching the goal region. The resulting planner, KiTe, uses this terminal-cost objective to encode goal preferences and improve reliability under uncertainty. To support systems without analytical uncertainty models, we learn dynamics and process uncertainty directly from data and integrate the learned belief dynamics into planning. Experiments on Flappy Bird, Car Parking, and Planar Pushing show that KiTe consistently improves goal-reaching success under uncertainty. Real-world Planar Pushing experiments further demonstrate that KiTe can plan effectively with learned dynamics and uncertainty. Source code is available at https://github.com/elpis-lab/KiTe.
Chinese Translation
在许多现实世界的机器人任务中,机器人必须生成动态可行的运动,以可靠地在不确定性下到达期望目标。然而,现有的基于采样的运动学动态规划器通常优化累积轨迹成本,并将到达目标视为可行性检查,而不是明确优化终端状态质量,例如目标偏好或到达目标的可靠性。在本研究中,我们引入了一种终端成本的运动学动态规划公式,允许终端状态质量与累积轨迹成本一起进行优化。我们证明了 AO-RRT(渐近最优运动学规划器)在这一增强目标下保持其渐近最优性。我们进一步将该公式扩展到信念空间,并证明最小化终端信念与目标之间的 Wasserstein 距离可以提高到达目标区域的概率下界。最终的规划器 KiTe 使用这一终端成本目标来编码目标偏好,并在不确定性下提高可靠性。为了支持没有解析不确定性模型的系统,我们直接从数据中学习动态和过程不确定性,并将学习到的信念动态整合到规划中。在 Flappy Bird、汽车停车和二维推送的实验中,KiTe 在不确定性下始终提高了到达目标的成功率。现实世界的二维推送实验进一步证明了 KiTe 能够有效地利用学习到的动态和不确定性进行规划。源代码可在 https://github.com/elpis-lab/KiTe 获取。
cs.RO / 26 / 2605.09050
Automated Robotic Moisture Monitoring in Agricultural Fields
农业领域自动化机器人湿度监测
Abstract
Monitoring moisture level of land in a large-scale plantation is tedious. The main objective of this project is to use a robotic kit in collaboration with the on-field moisture sensor circuits, thereby creating an efficient and economical moisture monitoring system. A large agriculture field is divided into smaller grids. Each grid is placed with a moisture sensor. Whenever a sensor reports the soil to be dry, the robot goes to the concerned field for inspection. The path to the concerned field is found by applying Dijkstra's shortest path algorithm on the aerial image of the field. Then the total moisture content of the field is calculated by the robot using suitable image processing algorithms and reported accordingly. For developing and testing this work, a small study field was set up above which a camera was mounted at an appropriate height to capture its aerial view. Thus a prototype for an automated system of monitoring agricultural fields' moisture has been developed through this work.
Chinese Translation
在大规模种植园中监测土地的湿度水平是一项繁琐的工作。本项目的主要目标是使用机器人套件与现场湿度传感器电路相结合,从而创建一个高效且经济的湿度监测系统。将大农业领域划分为较小的网格,每个网格中放置一个湿度传感器。当传感器报告土壤干燥时,机器人将前往相关区域进行检查。通过在该区域的航拍图像上应用Dijkstra最短路径算法来找到前往相关区域的路径。然后,机器人使用合适的图像处理算法计算该区域的总湿度含量并进行相应报告。为了开发和测试这项工作,建立了一个小型研究区域,并在适当高度上安装了一台摄像机以捕捉其航拍视图。因此,通过这项工作开发了一个用于监测农业领域湿度的自动化系统原型。
cs.RO / 27 / 2605.09055
Octopus Protocol: One-Shot Hardware Discovery and Control for AI Agents via Infrastructure-as-Prompts
章鱼协议:通过基础设施即提示实现 AI 代理的一次性硬件发现与控制
Abstract
Recent agentic-robotics systems, from Code-asPolicies to modern vision-language-action (VLA) foundation models, presuppose that drivers, SDKs, or ROS-style primitives for the target hardware already exist. Writing those primitives is the dominant engineering cost of bringing up new hardware for agent control. We present Octopus Protocol, a system that collapses that cost to a single shell command. Given only raw OS access and a language-model API key, a coding agent executes a five-stage pipeline--PROBE, IDENTIFY, INTERFACE, SERVE, DEPLOY--to discover connected devices, infer their capabilities, generate a Model Context Protocol (MCP) server with typed tools, and deploy it as a live HTTP endpoint. A persistent daemon then monitors the system, heals broken code, and perceives physical state through the camera tools it generated for itself. Two architectural principles make this work: protocols are prompts, not code, and the coding agent is the runtime. We validate the system on three heterogeneous platforms (PC/WSL, Apple Silicon macOS, Raspberry Pi 4) and on a commercial 6-DOF robotic arm with USB camera feedback. One command onboards the hardware in ~10-15 minutes and exposes up to 30 MCP tools; an MCP-compliant client then performs closed-loop visual-motor control through tools no human wrote.
Chinese Translation
近期的代理机器人系统,从代码即政策到现代视觉-语言-动作(VLA)基础模型,假设目标硬件的驱动程序、SDK 或 ROS 风格的原语已经存在。编写这些原语是将新硬件引入代理控制的主要工程成本。我们提出了章鱼协议(Octopus Protocol),一个将这一成本压缩至单个 shell 命令的系统。只需原始操作系统访问权限和语言模型 API 密钥,编码代理便可执行一个五阶段管道——探测(PROBE)、识别(IDENTIFY)、接口(INTERFACE)、服务(SERVE)、部署(DEPLOY)——以发现连接的设备、推断其能力、生成带有类型工具的模型上下文协议(Model Context Protocol, MCP)服务器,并将其作为实时 HTTP 端点进行部署。一个持久的守护进程随后监控系统,修复损坏的代码,并通过其为自己生成的相机工具感知物理状态。两个架构原则使这一工作得以实现:协议是提示,而非代码,编码代理是运行时。我们在三个异构平台(PC/WSL、Apple Silicon macOS、Raspberry Pi 4)以及一个带有 USB 摄像头反馈的商业 6 自由度(6-DOF)机器人手臂上验证了该系统。一个命令在约 10-15 分钟内完成硬件的接入,并暴露多达 30 个 MCP 工具;然后,符合 MCP 标准的客户端通过没有人编写的工具执行闭环视觉-运动控制。
cs.RO / 28 / 2605.09073
Smoothing Out the Edges: Continuous-Time Estimation with Gaussian Process Motion Priors on Factor Graphs
平滑边缘:基于因子图的高斯过程运动先验的连续时间估计
Abstract
Continuous-time state estimation is gaining in popularity due to its abilities to provide smooth solutions, handle asynchronous sensors, and interpolate between data points. While there are two main paradigms, parametric (e.g., temporal basis functions, splines) and nonparametric (Gaussian processes), the latter has seen less adoption despite its technical advantages and relative ease of implementation. In this article, we seek to rectify this situation by providing a new simplified explanation of GP continuous-time estimation rooted in the language of factor graphs, which have become the de facto estimation paradigm in much of robotics. To simplify onboarding, we also provide three working examples implemented in the popular GTSAM estimation framework.
Chinese Translation
由于能够提供平滑的解决方案、处理异步传感器以及在数据点之间进行插值,连续时间状态估计正日益受到关注。尽管存在两种主要范式:参数化(例如,时间基函数、样条)和非参数化(高斯过程),但后者尽管在技术上具有优势且相对易于实现,却鲜有应用。在本文中,我们旨在通过提供一个基于因子图语言的新简化解释,来纠正这一现状,因子图已成为许多机器人领域的事实上的估计范式。为了简化入门,我们还提供了三个在流行的GTSAM估计框架中实现的工作示例。
cs.RO / 29 / 2605.09093
HyDRA Scorpion: A Cost-effective and Modular ROV for Real-Time Underwater Inspection, Intervention, and Object Detection
HyDRA Scorpion:一种经济高效的模块化遥控水下机器人用于实时水下检查、干预和物体检测
Abstract
A Remotely Operated Vehicle (ROV) is a tethered underwater robot used for tasks like inspection and intervention. While essential tools for underwater science, the high cost of commercial ROVs and a persistent gap between mechanically capable platforms and those with integrated intelligence create a significant barrier to access. HyDRA Scorpion differs from conventional systems by addressing these challenges, integrating an advanced, AI-driven perception stack with in-situ measurement capabilities onto a low-cost, locally manufacturable platform. The system combines 4-DoF maneuverability, dual manipulators, and a custom pressure-tested housing. Experimental results validate the system's robustness and performance. Leak-free operation was confirmed through prolonged pressure testing of the electronics housing to 4 bar, equivalent to the pressure of a 304.8-meter water depth approximately in a simulated environment, with no moisture ingress detected. The vehicle also demonstrated stable station-keeping, maintaining its position within a tight tolerance of $\(\pm\)0.15$ meters under external disturbances. The onboard AI module achieved underwater object detection mean Average Precision (mAP) of 0.89 with real-time inference, length and 3D-mapping based distance measurement. Also, 4-DoF manipulator arm can grip and maintain dual-function manipulator feature which support 360 degree tangle-free rotation.
Chinese Translation
遥控水下机器人(ROV)是一种有缆水下机器人,用于检查和干预等任务。尽管它们是水下科学的重要工具,但商业ROV的高成本以及机械能力平台与集成智能平台之间的持续差距,构成了显著的准入障碍。HyDRA Scorpion通过解决这些挑战而与传统系统不同,将先进的、基于人工智能的感知堆栈与原位测量能力集成到一种低成本、可本地制造的平台上。该系统结合了4自由度(4-DoF)机动性、双机械手和定制的压力测试外壳。实验结果验证了系统的稳健性和性能。通过对电子外壳进行长时间的压力测试至4巴,确认了无泄漏操作,相当于模拟环境中304.8米水深的压力,未检测到任何潮湿侵入。该车辆还展示了稳定的驻留能力,在外部干扰下保持其位置在$ ext{±}0.15$米的严格公差范围内。机载人工智能模块在实时推理、长度和3D映射基础的距离测量下实现了水下物体检测的平均精度(mAP)为0.89。此外,4-DoF机械臂能够抓取并保持双功能机械手特性,支持360度无缠绕旋转。
cs.RO / 30 / 2605.09127
IMPACT: An Implicit Active-Set Augmented Lagrangian for Fast Contact-Implicit Trajectory Optimization
IMPACT:一种隐式活跃集增强拉格朗日法用于快速接触隐式轨迹优化
Abstract
Contact-implicit trajectory optimization (CITO) has attracted growing attention as a unified framework for planning and control in contact-rich robotic tasks. Recent approaches have demonstrated promising results in manipulation and locomotion without requiring a prescribed contact-mode schedule. It is well known that the underlying mathematical programs with complementarity constraints (MPCCs) remain numerically ill-conditioned, and systematic, scalable solution strategies for CITO remain an active area of research. More efficient and principled solvers that can handle contact constraints are therefore essential to broaden the applicability of CITO. In this work, we develop an augmented-Lagrangian approach to CITO for solving MPCC-based CITO with stationarity guarantees. The method can be interpreted as identifying the implicit contact-mode branches on the fly during the trajectory optimization (TO) iterations; we call this approach IMPACT (IMPlicit contact ACtive-set Trajectory optimization). We provide an efficient C++ implementation tailored to trajectory-optimization workloads and evaluate it on the open-source CITO and contact-implicit model predictive control (CI-MPC) benchmarks. On CITO, IMPACT achieves 2.9x-70x speedups over strong baselines (geometric mean 13.8x). On CI-MPC, we show improved control quality for contact-rich trajectories on dexterous manipulation tasks in simulation. Finally, we demonstrate the proposed method on real robotic hardware on a T-shaped object pushing task.
Chinese Translation
接触隐式轨迹优化(CITO)作为一种统一的框架,在接触丰富的机器人任务中进行规划和控制,受到了越来越多的关注。近期的方法在不需要预设接触模式调度的情况下,在操作和运动方面展示了良好的结果。众所周知,具有互补约束的基础数学程序(MPCCs)在数值上仍然表现出不良条件性,而系统化、可扩展的CITO解决策略仍然是一个活跃的研究领域。因此,能够处理接触约束的更高效且有原则的求解器对于拓宽CITO的适用性至关重要。在本研究中,我们开发了一种增强拉格朗日法来解决基于MPCC的CITO,并提供了平稳性保证。该方法可以被解释为在轨迹优化(TO)迭代过程中动态识别隐式接触模式分支;我们将这种方法称为IMPACT(IMPlicit contact ACtive-set Trajectory optimization)。我们提供了一个高效的C++实现,专门针对轨迹优化工作负载,并在开源CITO和接触隐式模型预测控制(CI-MPC)基准上进行了评估。在CITO上,IMPACT相较于强基线实现了2.9倍到70倍的加速(几何平均13.8倍)。在CI-MPC上,我们展示了在仿真中对接触丰富轨迹的灵巧操作任务的控制质量有所改善。最后,我们在真实机器人硬件上展示了所提出的方法,应用于T形物体推送任务。
cs.RO / 31 / 2605.09153
Beyond Self-Play: Hierarchical Reasoning for Continuous Motion in Closed-Loop Traffic Simulation
超越自我对弈:闭环交通仿真中的层次推理与连续运动
Abstract
Closed-loop traffic simulation requires agents that are both scalable and behaviorally realistic. Recent self-play reinforcement learning approaches demonstrate strong scalability, but their equilibrium strategies fail to capture the socially aware behaviors of real human drivers. We propose a hierarchical architecture that goes beyond self-play by combining high-level multi-agent interaction reasoning with low-level continuous trajectory realization. Specifically, a Stackelberg-style Multi-Agent Reinforcement Learning (MARL) module generates interaction-aware intention commands. These commands condition a low-level continuous motion module, translating the strategic intent into physically consistent, scene-responsive control sequences. To mitigate distribution shift in closed-loop deployment, we introduce a hybrid co-training scheme combining MARL with auxiliary recovery supervision. Experiments on a SUMO-based urban network demonstrate that the proposed framework achieves superior control smoothness and safety compared to self-play and passive imitation baselines, while maintaining competitive traffic efficiency.
Chinese Translation
闭环交通仿真需要既具可扩展性又具行为现实性的智能体。近期的自我对弈强化学习方法展示了强大的可扩展性,但其均衡策略未能捕捉到真实人类驾驶员的社会意识行为。我们提出了一种层次架构,超越自我对弈,通过结合高层次的多智能体交互推理与低层次的连续轨迹实现。具体而言,一个斯塔克尔伯格风格的多智能体强化学习(MARL)模块生成交互感知的意图指令。这些指令为低层次的连续运动模块提供条件,将战略意图转化为物理一致、场景响应的控制序列。为了减轻闭环部署中的分布偏移,我们引入了一种混合共同训练方案,将MARL与辅助恢复监督相结合。在基于SUMO的城市网络上的实验表明,所提出的框架在控制平滑性和安全性方面优于自我对弈和被动模仿基线,同时保持竞争性的交通效率。
cs.RO / 32 / 2605.09171
SHIELD: Scalable Optimal Control with Certification using Duality and Convexity
SHIELD:基于对偶性和凸性的可扩展最优控制认证方法
Abstract
We present SHIELD, a hierarchical algorithm that reduces both the decision-variable dimension and the constraint set in $\ell_1$-regularized convex programs. From strong convexity and Lagrangian duality, we derive certificates that \emph{safely} discard constraints and decision variables while guaranteeing that all removed constraints remain satisfied and all removed variables are null. To further accelerate the proposed algorithm, we propose a transformer-based deep neural network to guide the dual certificate inference. We validate SHIELD on stochastic model predictive control (SMPC) in complex, multi-modal traffic scenarios, comparing against a full-dimensional SMPC policy. Numerical simulations demonstrate order-of-magnitude computational speedups while preserving feasibility and closed-loop safety, highlighting the practicality of certifiably safe, lightweight MPC in complex driving scenes.
Chinese Translation
我们提出了SHIELD,一种分层算法,旨在减少$ ext{l}_1$正则化凸规划中的决策变量维度和约束集。基于强凸性和拉格朗日对偶性,我们推导出证书,能够 extit{安全地}丢弃约束和决策变量,同时保证所有被移除的约束仍然得到满足,所有被移除的变量均为零。为了进一步加速所提算法,我们提出了一种基于变换器的深度神经网络,以指导对偶证书的推断。我们在复杂的多模态交通场景中验证了SHIELD在随机模型预测控制(SMPC)中的应用,并与全维SMPC策略进行了比较。数值仿真显示,在保持可行性和闭环安全性的同时,计算速度提升了数量级,突显了在复杂驾驶场景中认证安全的轻量级模型预测控制的实用性。
cs.RO / 33 / 2605.09216
Continuum Robot Modeling with Action Conditioned Flow Matching
基于动作条件流匹配的连续机器人建模
Abstract
Predicting the shape of tendon driven continuum robots (TDCRs) at steady state from actuation remains challenging due to continuous deformation, complex tendon routing, compliance, friction, and fabrication variability. In this paper, we address this problem as kinematic self modeling conditioned on action. We present a lightweight 3D printed TDCR hardware platform and an RGB-D data collection pipeline with multiple cameras, and we learn a point cloud flow matching model that maps motor actuation states to the robot's settled 3D geometry. The model is trained from randomly sampled quasi static configurations and evaluated on test motor commands within the same TDCR design family and actuation range. We compare against prior 3D deformable object and robot self modeling approaches in both MuJoCo simulation and real hardware experiments. Experiments on simulated 2-, 3-, and 5-module TDCRs and real 2- and 3-module robots show improved shape prediction accuracy under CD and EMD metrics. We further show in simulation that the same conditional formulation generalizes to tip payload as a conditioning input, enabling payload conditioned steady-state shape prediction. These results demonstrate a data driven self modeling framework for quasi static TDCR geometry prediction.
Chinese Translation
从驱动状态预测腱驱动连续机器人(TDCRs)在稳态下的形状仍然具有挑战性,这主要是由于连续变形、复杂的腱布置、柔顺性、摩擦和制造变异性。在本文中,我们将此问题视为基于动作的运动学自我建模。我们提出了一种轻量级的3D打印TDCR硬件平台和一个多摄像头的RGB-D数据采集管道,并学习了一个点云流匹配模型,该模型将电机驱动状态映射到机器人的稳定3D几何形状。该模型是从随机采样的准静态配置中训练而来,并在同一TDCR设计系列和驱动范围内的测试电机指令上进行评估。我们在MuJoCo仿真和真实硬件实验中与先前的3D可变形物体和机器人自我建模方法进行了比较。对模拟的2、3和5模块TDCR以及真实的2和3模块机器人进行的实验显示,在CD和EMD指标下,形状预测准确性得到了改善。我们进一步在仿真中展示了相同的条件公式可以推广到作为条件输入的末端负载,从而实现负载条件下的稳态形状预测。这些结果展示了一种用于准静态TDCR几何形状预测的数据驱动自我建模框架。
cs.RO / 34 / 2605.09344
PECMAN: Perception-enabled Collaborative Multi-Agent Navigation in Unknown Environments
PECMAN:感知驱动的协作多智能体在未知环境中的导航
Abstract
Most path planners assume fully known, static environments, assumptions that fail when robots navigate in dynamic and partially observable environments. SMART-3D addresses these issues by real-time replanning, where it morphs the underlying RRT* tree whenever new obstacles or structures are discovered in the environment. Instead of rebuilding the tree entirely from scratch, SMART-3D prunes invalid nodes and edges and subsequently repairs the disjoint subtrees at hot-nodes to find a new path, thus providing high computational efficiency for real-time adaptability. We extend SMART-3D to perception-enabled collaborative multi-agent navigation (PECMAN) in unknown environments. PECMAN is built upon distributed tree morphing and shared perception strategies, where each agent reacts to environmental changes and morphs its respective tree to replan its path, while simultaneously broadcasting newly discovered structures to other agents, thus enabling them to proactively replan even in areas that have not yet been explored by them. This approach reduces redundant reactions and unnecessary replannings of the agents due to improved situational awareness. The performance of PECMAN was evaluated by 28,000 multi-agent simulations on seven 2D scenarios with different case studies. The results show that PECMAN achieves up to 52% reduction in the team-completion time, while maintaining near 100% success rates. Finally, PECMAN was tested by real experiments on two autonomous robots in a building environment.
Chinese Translation
大多数路径规划器假设环境是完全已知且静态的,但在动态和部分可观测环境中,机器人导航时这一假设往往不成立。SMART-3D通过实时重新规划来解决这些问题,当环境中发现新的障碍物或结构时,它会改变基础的RRT*树。SMART-3D并不是完全从头重建树,而是修剪无效的节点和边,并随后修复热节点的离散子树以寻找新路径,从而为实时适应性提供高计算效率。我们将SMART-3D扩展为感知驱动的协作多智能体导航(PECMAN),以应对未知环境。PECMAN基于分布式树形变换和共享感知策略构建,每个智能体对环境变化作出反应,并改变其相应的树以重新规划路径,同时将新发现的结构广播给其他智能体,从而使它们能够主动在尚未探索的区域重新规划。这种方法通过提高情境意识,减少了智能体的冗余反应和不必要的重新规划。PECMAN的性能通过在七个不同案例研究的二维场景中进行28,000次多智能体模拟进行了评估。结果表明,PECMAN在团队完成时间上最多减少了52%,同时保持近100%的成功率。最后,PECMAN还在建筑环境中通过两台自主机器人进行了真实实验测试。
cs.RO / 35 / 2605.09376
Mismatch-Aware Adaptive Constraint Tightening for Bicycle-Model Trajectory Optimization
考虑不匹配的自适应约束收紧用于自行车模型轨迹优化
Abstract
Trajectory optimization for autonomous vehicles usually relies on the kinematic bicycle model because of its computational simplicity. However, when the planned trajectory is executed under the true vehicle dynamics, which include lateral slip, tire stiffness and yaw-lateral coupling, safety constraints can be violated owing to the model mismatch. In this paper, we make three theoretical contributions. First, we derive a characteristic speed $v_c=\sqrt{C_\alpha L/M}$ which separates two different mismatch regimes: below $v_c$ the dynamic bicycle initially oversteers inward (safe); above $v_c$ it understeers outward (safety-critical). Second, we prove that the peak outward deviation $\varepsilon^*$ follows a $T^2$ horizon scaling whose coefficient transitions between a transient bound $\frac{1}{2}(v^2-v_c^2)\kappa$ and a steady-state bound. Third, we obtain a simulation-free analytical coefficient $a_2^{\mathrm{anal}}=\frac{1}{2}(1-v_c^2/v_{\max}^2)T^2$ that is computable from vehicle parameters and the planning horizon alone. Putting these together, we propose Mismatch-Aware Adaptive Constraint Tightening (MACT), $\epsilon(v,\kappa)=a_2 v^2|\kappa|$, which replaces a fixed worst-case margin by a state-dependent one that is large at high speed/curvature but nearly zero on gentle paths. Eight numerical experiments confirm the scaling laws. MACT reaches 100% safety with 84% less wasted margin than a fixed-margin baseline on the 2-DOF vehicle, extends to a nonlinear leaning bicycle, and in a closed-loop direct-shooting MPC comparison it cuts the applied margin by 34% compared with tube MPC while keeping the same safety.
Chinese Translation
自主车辆的轨迹优化通常依赖于运动学自行车模型,因为其计算简单。然而,当在真实车辆动力学下执行规划轨迹时,由于侧滑、轮胎刚度和偏航-侧向耦合等因素,安全约束可能会因模型不匹配而被违反。本文做出了三项理论贡献。首先,我们推导出一个特征速度 $v_c= ext{sqrt}(C_eta L/M)$,该速度将两种不同的不匹配状态分开:在 $v_c$ 之下,动态自行车最初向内过度转向(安全);在 $v_c$ 之上,它向外不足转向(安全关键)。其次,我们证明了峰值外偏差 $ ext{ε}^*$ 遵循 $T^2$ 时间尺度,其系数在瞬态界限 $rac{1}{2}(v^2-v_c^2) ext{κ}$ 和稳态界限之间过渡。第三,我们获得了一个无仿真的解析系数 $a_2^{ ext{anal}}=rac{1}{2}(1-v_c^2/v_{ ext{max}}^2)T^2$,该系数仅依赖于车辆参数和规划时间范围进行计算。综合这些,我们提出了考虑不匹配的自适应约束收紧(MACT),$ ext{ε}(v, ext{κ})=a_2 v^2| ext{κ}|$,它用一个依赖于状态的约束替代了固定的最坏情况边际,该边际在高速/曲率下较大,但在平缓路径上几乎为零。八个数值实验验证了这些缩放规律。MACT 在 2 自由度车辆上实现了 100% 的安全性,同时比固定边际基线减少了 84% 的浪费边际,扩展到非线性倾斜自行车,并且在闭环直接发射模型预测控制比较中,与管道模型预测控制相比,减少了 34% 的应用边际,同时保持相同的安全性。
cs.RO / 36 / 2605.09383
Safety-Critical LiDAR-Inertial Odometry with On-Manifold Deterministic Protection Level
安全关键的激光雷达-惯性里程计与流形上的确定性保护级别
Abstract
In safety-critical scenarios, the protection level of the autonomous navigation system is crucial for enabling mobile robots to perform safe tasks. However, existing studies on probabilistic navigation systems for robots usually perform offline accuracy evaluations using limited datasets and assume that the results can be applied to unknown real-world environments. As a result, current autonomous mobile robots often lack protection levels for online safety assessment. To fill this gap, we propose a safety-critical LiDAR-inertial odometry (LIO) that provides deterministic protection levels based on on-manifold deterministic state estimation. By adopting the unknown but bounded assumption, we derive a neat closed-form relationship between point cloud noise and the uncertainty of the estimation from the iterated closest point algorithm. Using this relationship, we design an on-manifold ellipsoidal set-membership filter and implement it within the LIO system. Leveraging the properties of the set-membership filter, our system offers the feasible sets of the estimated locations as the deterministic protection levels, serving as safety references for the robots' downstream autonomous operations. The experimental results show that our system can provide effective deterministic online safety references for diverse robots in various environments.
Chinese Translation
在安全关键场景中,自主导航系统的保护级别对于使移动机器人能够执行安全任务至关重要。然而,现有关于机器人概率导航系统的研究通常使用有限的数据集进行离线准确性评估,并假设结果可以应用于未知的现实环境。因此,目前的自主移动机器人往往缺乏在线安全评估的保护级别。为填补这一空白,我们提出了一种安全关键的激光雷达-惯性里程计(LIO),该系统基于流形上的确定性状态估计提供确定性保护级别。通过采用未知但有界的假设,我们推导出点云噪声与迭代最近点算法估计不确定性之间的简洁闭式关系。利用这一关系,我们设计了一种流形上的椭球集成员过滤器,并将其实现于LIO系统中。借助集成员过滤器的特性,我们的系统提供了估计位置的可行集作为确定性保护级别,为机器人的下游自主操作提供安全参考。实验结果表明,我们的系统能够为各种环境中的多样化机器人提供有效的确定性在线安全参考。
cs.RO / 37 / 2605.09410
RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models
RePO-VLA:面向视觉-语言-动作模型的恢复驱动策略优化
Abstract
Vision-Language-Action (VLA) models remain brittle in long-horizon, contact-rich manipulation because success-only imitation provides little supervision for execution drift, while failed rollouts are often discarded. We introduce RePO-VLA, a recovery-driven policy optimization framework that assigns distinct roles to success, recovery, and failure trajectories. RePO-VLA first applies Recovery-Aware Initialization (RAI), slicing recovery segments and resetting history so corrective actions depend on the current adverse state rather than the preceding failure. It then learns a Progress-Aware Semantic Value Function (PAS-VF), aligning spatiotemporal trajectory features with instructions and successful references. The resulting labels salvage useful failure prefixes via reliability decay, while low-value labels mark drift and terminal breakdowns, teaching differences among nominal, failed, and corrective actions. The data engine turns adverse states into planner-generated or human-collected corrective rollouts, teaching recovery to the success manifold. Value-Conditioned Refinement (VCR) trains the policy to prefer high-progress actions. At deployment, a fixed high value ($v=1.0$) biases actions toward the learned success manifold without online failure detectors or heuristic retries. We introduce FRBench, with standardized error injection and recovery-focused evaluation. Across simulated and real-world bimanual tasks, RePO-VLA improves robustness, raising adversarial success from 20% to 75% on average and up to 80% in scaled real-world trials.
Chinese Translation
视觉-语言-动作(VLA)模型在长时间、接触丰富的操作中仍然脆弱,因为仅依赖成功的模仿提供的监督对于执行漂移帮助不大,而失败的回滚往往被丢弃。我们提出了RePO-VLA,一种恢复驱动的策略优化框架,为成功、恢复和失败轨迹分配了不同的角色。RePO-VLA首先应用恢复感知初始化(Recovery-Aware Initialization, RAI),切分恢复段并重置历史,使得纠正行动依赖于当前的不利状态而非先前的失败。接着,它学习了一种进度感知语义价值函数(Progress-Aware Semantic Value Function, PAS-VF),将时空轨迹特征与指令和成功参考对齐。生成的标签通过可靠性衰减挽救有用的失败前缀,而低价值标签则标记漂移和终端故障,教会名义、失败和纠正行动之间的差异。数据引擎将不利状态转化为规划器生成或人工收集的纠正回滚,教会成功流形的恢复。价值条件精炼(Value-Conditioned Refinement, VCR)训练策略偏向高进展行动。在部署时,固定的高价值($v=1.0$)使得行动倾向于学习到的成功流形,而无需在线失败检测器或启发式重试。我们引入了FRBench,提供标准化的错误注入和以恢复为重点的评估。在模拟和真实的双手任务中,RePO-VLA提高了鲁棒性,平均将对抗成功率从20%提升至75%,在扩展的真实世界试验中甚至达到80%。
cs.RO / 38 / 2605.09441
Beyond Isolation: A Unified Benchmark for General-Purpose Navigation
超越孤立:通用导航的统一基准
Abstract
The pursuit of general-purpose embodied agents is hindered by fragmented evaluation protocols that isolate navigation skills and fixate on specific robot morphologies, failing to reflect real-world scenarios where agents must orchestrate diverse behaviors across varying embodiments. To bridge this gap, we introduce OmniNavBench, a benchmark for cross-skill coordination and cross-embodiment generalization. OmniNavBench introduces three paradigm shifts: (1) Compositional Complexity. We propose composite instructions that interleave sub-tasks from 6 categories (PointNav, VLN, ObjectNav, SocialNav, Human Following and EQA), compelling agents to transition between exploration, interaction, and social compliance within a single episode. (2) Morphological Universality and Sensor Flexibility. We present a simulation platform that breaks the reliance on single-morphology evaluation, enabling generalization tests across humanoid, quadrupedal, and wheeled robots, with a modular sensor interface and 170 environments blending synthetic assets with real-world scans. (3) Demonstrations Quality. Moving beyond shortest-path algorithms, we curate 1779 expert trajectories via human teleoperation, capturing behavioral nuances such as exploratory glance and anticipatory avoidance. Extensive evaluations demonstrate that current methods, despite their claimed unified design, struggle with the complex, interleaved nature of general-purpose navigation. This exposes a critical disparity between existing capabilities and real-world deployment demands, underscoring OmniNavBench as a testbed for the next generation of generalist navigators. Dataset, code, and leaderboard are available at http://omninavbench.cloud-ip.cc.
Chinese Translation
通用具身智能体的追求受到碎片化评估协议的阻碍,这些协议将导航技能孤立开来,并专注于特定的机器人形态,未能反映智能体在现实场景中必须协调多种行为的需求。为了解决这一问题,我们引入了 OmniNavBench,这是一个用于跨技能协调和跨形态泛化的基准。OmniNavBench 引入了三个范式转变:(1)组合复杂性。我们提出了复合指令,这些指令交错了来自六个类别(点导航(PointNav)、视觉导航(VLN)、物体导航(ObjectNav)、社交导航(SocialNav)、人类跟随(Human Following)和环境质量评估(EQA))的子任务,迫使智能体在单个情节中在探索、互动和社会合规之间切换。(2)形态普遍性和传感器灵活性。我们展示了一个模拟平台,打破了对单一形态评估的依赖,使得在类人、四足和轮式机器人之间进行泛化测试成为可能,配备了模块化传感器接口和170个融合合成资产与现实世界扫描的环境。(3)演示质量。超越最短路径算法,我们通过人类遥控操作策划了1779条专家轨迹,捕捉了探索性观察和预期回避等行为细微差别。广泛的评估表明,尽管现有方法声称具有统一设计,但在通用导航的复杂交错特性面前仍显得捉襟见肘。这揭示了现有能力与现实世界部署需求之间的关键差距,强调了 OmniNavBench 作为下一代通用导航器测试平台的重要性。数据集、代码和排行榜可在 http://omninavbench.cloud-ip.cc 获取。
cs.RO / 39 / 2605.09465
High Precision Hydraulic Excavator Control for Heavy-Duty Grading
重型平整作业的高精度液压挖掘机控制
Abstract
High-precision heavy-duty grading is a common step in earthworks, traditionally carried out manually by skilled operators. Removing a significant amount of material while achieving a high-precision surface requires substantial machine-specific experience. Different hydraulic architectures react differently to operator inputs and soil interaction forces, which makes generalizable controllers challenging. In this paper, we present an autonomous controller that achieves high-precision grading at expert-operator speed on Load Sensing and Negative Flow Control machines alike. We split our controller into two parts: (1) a hydraulic-aware low-level loop that is hydraulic architecture-specific and (2) a path-tracking layer that coordinates joint motions and responses. Through a calibration process, our technique is applicable to load-sensing and negative-flow-control machinery. To showcase its versatility, we benchmark our approach on two excavators with different hydraulics and compare it against a commercial state-of-the-art solution. Our technique (RMSE 1.8~cm) outperforms the commercial solution (RMSE 4.7~cm) in precision by a factor of 2.6 and improves machine usage by leveraging the maximum function pressure, as opposed to commercial solutions that stall prematurely.
Chinese Translation
高精度重型平整作业是土方工程中的一个常见步骤,传统上由熟练的操作员手动完成。在去除大量材料的同时实现高精度表面需要丰富的机器特定经验。不同的液压架构对操作员输入和土壤相互作用力的反应各异,这使得通用控制器的开发面临挑战。本文提出了一种自主控制器,能够在负载传感(Load Sensing)和负流量控制(Negative Flow Control)机器上以专家操作员的速度实现高精度平整。我们将控制器分为两个部分:(1)一个与液压架构相关的低级控制环;(2)一个协调关节运动和响应的路径跟踪层。通过校准过程,我们的技术适用于负载传感和负流量控制机械。为了展示其多样性,我们在两台具有不同液压系统的挖掘机上对我们的方法进行了基准测试,并与一种商业最先进的解决方案进行了比较。我们的技术(均方根误差 RMSE 1.8 cm)在精度上比商业解决方案(均方根误差 RMSE 4.7 cm)提高了2.6倍,并通过利用最大功能压力来改善机器使用效率,而不是像商业解决方案那样过早停滞。
cs.RO / 40 / 2605.09494
LASSA Architecture-Based Autonomous Fault-Tolerant Control of Unmanned Underwater Vehicles
基于LASSA架构的无人水下航行器自主容错控制
Abstract
Unmanned underwater vehicles (UUVs) operate persistently in communication-constrained environments, thus requiring high-level autonomous fault-tolerant control under faulty operating conditions. Existing approaches rely heavily on predefined hard-coded rules and struggle to achieve effective fault-tolerant control against unforeseen faults. Although large language models (LLMs) possess powerful cognitive and reasoning capabilities, their inherent hallucinations remain a major obstacle to their application in UUV control systems. This paper proposes an intelligent control method based on the LASSA (LLM-based Agent with Solver, Sensor and Actuator) architecture. Within this architecture, an LLM identifies unknown faults and accomplishes task replanning via autonomous reasoning without hard-coded rules; the intelligent agent undertakes perception, scheduling and decision evaluation; the solver verifies physical boundary feasibility constraints prior to command transmission to the actuators. This architecture suppresses physically infeasible LLM hallucinations and ensures interpretable, verifiable decision-making. Moreover, it enables fast-slow dual closed-loop collaborative control, where the slow loop undertakes high-level dynamic decision-making and the fast loop guarantees high-frequency real-time control, simultaneously balancing decision intelligence and control timeliness. Lake experiments under normal and lower-rudder-fault conditions show that the framework detects trajectory tracking abnormalities, replans the route by adjusting the turning radius from 4m to 12m and reducing speed from 2kn to 1kn, passes all three solver constraints on the first invocation, and guides the UUV to complete the full mission; under normal conditions no false fault alarms are raised throughout the run.
Chinese Translation
无人水下航行器(UUV)在通信受限的环境中持续运行,因此需要在故障操作条件下实现高级自主容错控制。现有方法过于依赖预定义的硬编码规则,难以有效应对不可预见的故障。尽管大型语言模型(LLMs)具备强大的认知和推理能力,但其固有的幻觉仍然是其在UUV控制系统中应用的主要障碍。本文提出了一种基于LASSA(基于LLM的具有求解器、传感器和执行器的智能代理)架构的智能控制方法。在该架构中,LLM通过自主推理识别未知故障并完成任务重规划,无需硬编码规则;智能代理负责感知、调度和决策评估;求解器在命令传输给执行器之前验证物理边界可行性约束。该架构抑制了物理上不可行的LLM幻觉,并确保了可解释、可验证的决策过程。此外,它实现了快慢双闭环协同控制,其中慢环进行高级动态决策,快环保证高频实时控制,同时平衡决策智能性和控制时效性。在正常和低舵故障条件下的湖泊实验表明,该框架能够检测轨迹跟踪异常,通过将转弯半径从4米调整至12米、将速度从2节降低至1节来重规划路线,并在第一次调用时通过所有三个求解器约束,成功引导UUV完成整个任务;在正常条件下,整个运行过程中未出现虚假故障警报。
cs.RO / 41 / 2605.09537
Drift is a Sampling Error: SNR-Aware Power Distributions for Long-Horizon Robotic Planning
漂移是一种采样误差:针对长时间规划的信噪比感知功率分布
Abstract
Despite rapid progress in Vision-Language-Action (VLA) models for robotic control, instruction drift remains a persistent failure mode in long-horizon tasks. This paper reconceptualizes this phenomenon, positing that instruction drift is fundamentally a systematic sampling error: local greedy sampling is prone to collapsing into "Negative Pivotal Windows"--irreversible local optima with high local probability that sever global success pathways. To address this, we propose Context-Aware Power Sampling (CAPS), a training-free inference-time computation framework. CAPS leverages power distributions to sharpen global trajectory probabilities, enabling lookahead search over the model's conditional generative trajectory distribution. Furthermore, we introduce a metacognitive control mechanism based on Signal-to-Noise Ratio (SNR). This mechanism triggers adaptive MCMC search solely when drift risk is detected, enabling a dynamic transition from "intuitive fast thinking" to "rational slow search." Experiments on RoboTwin, Simpler-WindowX, and Libero-long benchmarks show that CAPS achieves substantial improvements over strong baselines, including OpenVLA and TACO, without parameter updates. These results support the effectiveness of adaptive inference-time computation for improving long-horizon robustness in embodied control.
Chinese Translation
尽管在机器人控制的视觉-语言-动作(VLA)模型方面取得了快速进展,但指令漂移仍然是长时间任务中的一种持续失败模式。本文重新概念化了这一现象,认为指令漂移本质上是一种系统性的采样误差:局部贪婪采样容易陷入“负关键窗口”(Negative Pivotal Windows)——这些是具有高局部概率的不可逆局部最优解,切断了全局成功路径。为了解决这一问题,我们提出了上下文感知功率采样(Context-Aware Power Sampling, CAPS),这是一种无训练的推理时计算框架。CAPS利用功率分布来增强全局轨迹概率,使得能够在模型的条件生成轨迹分布上进行前瞻性搜索。此外,我们引入了一种基于信噪比(Signal-to-Noise Ratio, SNR)的元认知控制机制。当检测到漂移风险时,该机制仅触发自适应的马尔可夫链蒙特卡洛(MCMC)搜索,从而实现从“直觉快速思维”到“理性慢搜索”的动态过渡。在RoboTwin、Simpler-WindowX和Libero-long基准测试上的实验表明,CAPS在不更新参数的情况下,相较于强基线(包括OpenVLA和TACO)实现了显著的改进。这些结果支持了自适应推理时计算在提高具身控制的长时间鲁棒性方面的有效性。
cs.RO / 42 / 2605.09613
SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation
SABER:一种可扩展的基于行动的具身数据集,用于现实世界的VLA适应
Abstract
Robotic deployment in real-world environments depends on rich, domain-specific action data as much as on strong model architecture. General-purpose robot foundation models show modest performance in complex unseen tasks such as manipulation in a retail domain when applied out of the box. The root cause is a data gap: retail environments are structurally absent from general robot pretraining distributions, and the path to filling that gap through teleoperation is prohibitively expensive, logistically constrained, and difficult to scale. We introduce SABER, a high-fidelity retail robotics action dataset built from over 100 hours of natural in-store capture across multiple real grocery environments. Egocentric footage from head-mounted cameras records fine-grained hand activity at the point of interaction, while exocentric 360-degree scene footage from DreamVu's ALIA camera simultaneously observes all actors and activities across the entire space. This combination yields a uniquely complete picture of human retail behavior: dexterous hand activity, whole-body motion, and scene dynamics, all captured without staging, scripting, or teleoperation overhead. The SABER corpus contains 44.8K training samples across three action representation streams: 25K latent action sequences via LAPA-style encoding, 18.6K dexterous hand-pose trajectories retargeted to robot joint space, and 1.2K whole-body synchronized motion sequences retargeted to a humanoid embodiment. When applied to GR00T N1.6 via a shared-backbone multi-task post-training recipe, SABER yields a mean success rate of 29.3% across ten retail manipulation tasks -- more than 2.19x over fine-tuning baselines (13.4%). SABER demonstrates that the path to capable retail robots runs through better data, which can be collected today, at scale, without a robot in the loop. The dataset and code are available at https://dreamvu.ai/saber
Chinese Translation
机器人在现实环境中的部署不仅依赖于强大的模型架构,还依赖于丰富的领域特定行动数据。通用机器人基础模型在复杂的未见任务(例如零售领域的操作)中表现平平,尤其是在未经过调试的情况下。根本原因在于数据缺口:零售环境在通用机器人预训练分布中结构性缺失,而通过远程操作填补这一缺口的路径代价高昂、后勤受限且难以扩展。我们引入了SABER,这是一个高保真度的零售机器人行动数据集,基于在多个真实杂货环境中超过100小时的自然店内捕捉而构建。来自头戴式摄像机的自我中心视频记录了交互点的细致手部活动,而来自DreamVu的ALIA摄像机的外部360度场景视频则同时观察整个空间内的所有参与者和活动。这种组合提供了人类零售行为的独特完整图景:灵巧的手部活动、全身运动和场景动态,所有这些都是在没有布景、剧本或远程操作负担的情况下捕捉的。SABER语料库包含44.8K个训练样本,涵盖三种行动表示流:25K个通过LAPA风格编码的潜在行动序列、18.6K个重新定向到机器人关节空间的灵巧手势轨迹,以及1.2K个重新定向到类人具身的全身同步运动序列。当通过共享骨干的多任务后训练方案应用于GR00T N1.6时,SABER在十个零售操作任务中获得了29.3%的平均成功率,超过了微调基线(13.4%)的2.19倍。SABER证明了通向高效零售机器人的道路在于更好的数据,这些数据可以在今天以规模化的方式收集,而无需机器人参与。数据集和代码可在 https://dreamvu.ai/saber 获取。
cs.RO / 43 / 2605.09633
Minimizing Worst-Case Weighted Latency for Multi-Robot Persistent Monitoring: Theory and RL-Based Solutions
最小化多机器人持续监测的最坏情况加权延迟:理论与基于强化学习的解决方案
Abstract
We study multi-robot persistent monitoring on weighted graphs, where node weights encode monitoring priorities and edge weights encode travel distances. The goal is to design joint robot trajectories that minimize the worst-case weighted latency across all nodes over an infinite time horizon. The widely adopted worst-case latency objective evaluates team performance over the entire time horizon and therefore may fail to distinguish strategies with poor transient behavior but strong asymptotic performance. To address this limitation, we propose a family of tail-performance objectives that generalize the standard objective and study the resulting functional optimization problems. We establish several key theoretical properties, including the existence of optimal strategies, relationships among the proposed objectives and their corresponding optimization problems, approximation by periodic solutions to arbitrary accuracy, and reductions to event-driven decision models with discretized waiting times. Building on these results, we construct an equivalent event-driven Markov decision process (MDP), called the Tail Worst-case Latency-Optimizing Markov Decision Process (TWLO-MDP), which reformulates the tail-performance objective as a standard average-reward criterion. We then develop reinforcement-learning-based solution methods for the TWLO-MDP and introduce the multi-robot monitoring benchmark (M2Bench), a unified platform that supports the evaluation and comparison of heuristic and learning-based monitoring algorithms. Experiments on synthetic and realistic monitoring scenarios show that our methods effectively reduce the worst-case weighted latency and outperform representative baselines.
Chinese Translation
我们研究了加权图上的多机器人持续监测,其中节点权重编码监测优先级,边权重编码旅行距离。目标是设计联合机器人轨迹,以最小化在无限时间范围内所有节点的最坏情况加权延迟。广泛采用的最坏情况延迟目标评估团队在整个时间范围内的表现,因此可能无法区分具有较差瞬态行为但强大渐近性能的策略。为了解决这一局限性,我们提出了一系列尾部性能目标,这些目标对标准目标进行了推广,并研究了由此产生的函数优化问题。我们建立了几个关键的理论性质,包括最优策略的存在性、所提目标与其对应优化问题之间的关系、通过周期性解的任意精度近似,以及对具有离散等待时间的事件驱动决策模型的简化。在这些结果的基础上,我们构建了一个等效的事件驱动马尔可夫决策过程(MDP),称为尾部最坏情况延迟优化马尔可夫决策过程(TWLO-MDP),该过程将尾部性能目标重新表述为标准的平均奖励标准。随后,我们为TWLO-MDP开发了基于强化学习的解决方法,并引入了多机器人监测基准(M2Bench),这是一个统一的平台,支持启发式和基于学习的监测算法的评估与比较。在合成和现实监测场景中的实验表明,我们的方法有效地减少了最坏情况加权延迟,并超越了代表性的基线。
cs.RO / 44 / 2605.09656
ORICF -- Open Robotics Inference and Control Framework
ORICF -- 开放机器人推理与控制框架
Abstract
Recent advances in artificial intelligence (AI) have enabled effective perception and language models for robots, but their deployment remains computationally expensive, increasing latency and energy use. This work presents the Open Robotics Inference and Control Framework (ORICF), a modular, declarative, and model-agnostic platform for composing multimodal robotic inference pipelines. ORICF integrates input/output (I/O) adapters, pluggable inference back ends, and post-processing logic, while lightweight YAML specifications allow models, hardware targets, and data channels to be changed without code modification. The framework also supports edge offloading, i.e., executing inference on nearby external computers instead of onboard the robot. ORICF is evaluated on a mobile robot that answers spoken queries about people detected in its camera stream by combining automatic speech recognition (ASR), a large language model (LLM), and a convolutional neural network (CNN) detector through Robot Operating System 2 (ROS2). Compared with onboard execution, ORICF-based edge deployment reduces robot-side compute utilization by up to 83.16% and estimated energy consumption by 65.8%, while preserving modularity and reproducibility.
Chinese Translation
近年来,人工智能(AI)的进步使得机器人能够有效地进行感知和语言建模,但其部署仍然计算开销大,导致延迟和能耗增加。本研究提出了开放机器人推理与控制框架(ORICF),这是一个模块化、声明式且与模型无关的平台,用于构建多模态机器人推理管道。ORICF集成了输入/输出(I/O)适配器、可插拔的推理后端和后处理逻辑,同时轻量级的YAML规范允许在不修改代码的情况下更改模型、硬件目标和数据通道。该框架还支持边缘卸载,即在附近的外部计算机上执行推理,而不是在机器人上执行。ORICF在一款移动机器人上进行了评估,该机器人通过结合自动语音识别(ASR)、大型语言模型(LLM)和卷积神经网络(CNN)检测器,回答关于其摄像头流中检测到的人物的语音查询,使用了机器人操作系统2(ROS2)。与在机器人上执行相比,基于ORICF的边缘部署将机器人端计算利用率降低了多达83.16%,估计能耗降低了65.8%,同时保持了模块化和可重复性。
cs.RO / 45 / 2605.09659
ASACK : Adaptive Safe Active Continual Koopman Learning for Uncertain Systems with Contractive Guarantees
ASACK:具有收缩保证的不确定系统自适应安全主动持续Koopman学习
Abstract
Koopman operator theory provides a powerful framework for representing nonlinear dynamics through a linear operator acting on lifted observables, enabling the use of linear control techniques for nonlinear systems. However, Koopman models are typically learned from data and often degrade in performance under model uncertainty and distributional shifts between training and deployment. Although several works have explored online adaptation to address this issue, many rely on neural network-based updates that introduce significant computational overhead and lack formal safety guarantees, limiting their suitability for real-time and safety-critical robotic applications. In this work, we propose a unified framework for continual adaptive Koopman learning that enables safe and efficient online refinement of learned models during task execution. An autoencoder-based Koopman model is first learned offline and subsequently refined online through a contractive adaptation law, which provides theoretical convergence guarantees under distributional shifts and model uncertainty. To improve data efficiency and accelerate model refinement, the adaptation mechanism is integrated with an active learning strategy that drives the system to collect informative data while accomplishing task objectives. The resulting control problem is formulated as a nonconvex optimization problem incorporating both active learning objectives and safety constraints. We further derive theoretical bounds on model approximation error and show how these bounds can be incorporated within a robust Model Predictive Control (MPC) framework to provide formal safety guarantees. The proposed approach unifies learning, excitation, and safety within a single control framework without sacrificing real-time feasibility. Extensive simulation and experimental studies demonstrate superior performance compared to state-of-the-art baselines.
Chinese Translation
Koopman算子理论提供了一个强大的框架,通过作用于提升可观测量的线性算子来表示非线性动态,从而使得线性控制技术能够应用于非线性系统。然而,Koopman模型通常是从数据中学习的,并且在模型不确定性和训练与部署之间的分布变化下,性能往往会下降。尽管已有若干研究探讨了在线适应以解决这一问题,但许多方法依赖于基于神经网络的更新,这引入了显著的计算开销,并且缺乏正式的安全保证,限制了它们在实时和安全关键的机器人应用中的适用性。在本研究中,我们提出了一个统一的持续自适应Koopman学习框架,使得在任务执行过程中能够安全且高效地在线优化学习模型。首先,基于自编码器的Koopman模型在离线状态下学习,然后通过收缩适应法则在线进行优化,该法则在分布变化和模型不确定性下提供了理论收敛保证。为了提高数据效率并加速模型优化,适应机制与主动学习策略相结合,推动系统在完成任务目标的同时收集有信息量的数据。由此产生的控制问题被表述为一个非凸优化问题,结合了主动学习目标和安全约束。我们进一步推导了模型近似误差的理论界限,并展示了如何将这些界限纳入一个稳健的模型预测控制(Model Predictive Control, MPC)框架中,以提供正式的安全保证。所提出的方法在一个控制框架内统一了学习、激励和安全,而不牺牲实时可行性。广泛的仿真和实验研究表明,与最先进的基线相比,性能优越。
cs.RO / 46 / 2605.09670
Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models
面向基于视觉的远程操作的生成预测显示:现成视频模型的零样本基准测试
Abstract
Teleoperation systems are fundamentally limited by communication latency, which degrades situational awareness and control performance. Predictive display aims to mitigate this limitation by presenting an estimate of the current visual state rather than delayed observations. While recent advances in generative video models enable high-quality video synthesis, their suitability for latency-sensitive predictive display remains unclear. This paper presents a zero-shot benchmark of off-the-shelf generative video models for short-horizon predictive display, without task-specific fine-tuning. We formulate the problem as rollout-based future frame prediction and develop a unified benchmarking pipeline using simulated driving data from the CARLA simulator. Five publicly released video models spanning transformer-based and diffusion-based families are evaluated across two resolutions and two conditioning regimes (multi-frame and single-frame). Performance is assessed using prediction accuracy (mean absolute difference), per-rollout latency, peak GPU memory usage, and temporal error evolution across the prediction horizon. On this zero-shot benchmark, no tested model simultaneously achieves low rollout error, non-divergent per-step error behavior, and real-time inference at the source frame rate. Increasing model scale or resolution yields limited and, in some cases, inverted improvements. These findings highlight a gap between general-purpose generative video synthesis and the requirements of predictive display in teleoperation, suggesting that practical deployment will require either explicit short-horizon temporal supervision, in-domain adaptation, or aggressive inference optimization rather than direct application of off-the-shelf models. Code, configurations, and qualitative results are released on the project page: https://bimilab.github.io/paper-GenPD
Chinese Translation
远程操作系统在根本上受到通信延迟的限制,这会降低情境意识和控制性能。预测显示旨在通过呈现当前视觉状态的估计值而非延迟观察来缓解这一限制。尽管生成视频模型的最新进展使得高质量视频合成成为可能,但它们在延迟敏感的预测显示中的适用性仍不明确。本文提出了一种现成生成视频模型的零样本基准测试,针对短时间范围的预测显示,且未进行特定任务的微调。我们将问题表述为基于回放的未来帧预测,并使用来自CARLA模拟器的模拟驾驶数据开发了统一的基准测试管道。评估了五个公开发布的视频模型,涵盖基于变换器和扩散的模型,分别在两种分辨率和两种条件下(多帧和单帧)进行测试。通过预测准确性(平均绝对差)、每次回放延迟、峰值GPU内存使用和预测范围内的时间误差演变来评估性能。在这一零样本基准测试中,未测试的模型能够同时实现低回放误差、非发散的逐步误差行为和实时推理,且以源帧率进行。增加模型规模或分辨率的效果有限,并且在某些情况下会导致改进的反转。这些发现突显了通用生成视频合成与远程操作中预测显示需求之间的差距,表明实际部署将需要明确的短时间范围时间监督、领域内适应或激进的推理优化,而不是直接应用现成模型。代码、配置和定性结果已在项目页面发布: https://bimilab.github.io/paper-GenPD
cs.RO / 47 / 2605.09672
MVB-Grasp: Minimum-Volume-Box Filtering of Diffusion-based Grasps for Frontal Manipulation
MVB-Grasp:用于正面操作的基于扩散的抓取的最小体积盒过滤
Abstract
State-of-the-art 6-DoF grasp generators excel on tabletop benchmarks with overhead cameras but struggle in frontal grasping scenarios on low-cost manipulators with constrained workspaces, where kinematic limits and approach-direction constraints cause high failure rates. We address this challenge for the Unitree Z1 arm by proposing MVB-Grasp, a novel grasping stack that injects a Minimum Volume Bounding Box (MVBB) geometric prior into diffusion-based grasp generation to dramatically improve success rates in frontal, workspace-constrained settings. Our key scientific contributions are threefold: (i) an MVBB-based geometric filter that exploits oriented bounding-box face normals to reject grasps approaching through the table or misaligned with accessible object faces in O(N) time; (ii) a combined re-scoring function that blends learned discriminator scores with face-alignment geometry {\alpha}=0.85, specifically calibrated for the Z1's frontal workspace and kinematic constraints; and (iii) a systematic MuJoCo evaluation protocol measuring grasp success across object types, distances, lateral positions, and pitch orientations to validate embodiment-specific performance. We implement MVB-Grasp on a Unitree Z1 arm with an Intel RealSense D405 camera, integrating YOLOv8 object detection, GraspGen for candidate generation, Principal Component Analysis (PCA)-based MVBB fitting, and inverse-kinematics trajectory planning. Experiments across 81 MuJoCo episodes (cylinder, asymmetric box, waterbottle) demonstrate that MVB-Grasp achieves 59.3% success versus 24.7% for vanilla GraspGen, a 2.4x improvement, by filtering geometrically infeasible candidates and prioritizing face-aligned grasps suited to the Z1's frontal approach constraints. Real-world trials confirm that the MVBB prior substantially improves grasp reliability on constrained, low-cost manipulators without requiring model retraining.
Chinese Translation
最先进的六自由度抓取生成器在使用顶置摄像头的桌面基准测试中表现出色,但在低成本操控器的正面抓取场景中却面临挑战,因为运动学限制和接近方向约束导致高失败率。我们针对Unitree Z1臂提出了MVB-Grasp,这是一种新颖的抓取堆栈,通过将最小体积边界框(Minimum Volume Bounding Box, MVBB)几何先验注入到基于扩散的抓取生成中,以显著提高在正面、受限工作空间环境中的成功率。我们的主要科学贡献有三方面:(i)一种基于MVBB的几何过滤器,利用定向边界框的面法线在O(N)时间内拒绝通过桌面接近或与可接触物体面不对齐的抓取;(ii)一种结合重新评分函数,将学习到的判别器分数与面对齐几何结合,{eta}=0.85,专门为Z1的正面工作空间和运动学约束进行校准;(iii)一种系统的MuJoCo评估协议,测量不同物体类型、距离、侧向位置和俯仰方向下的抓取成功率,以验证特定实现的性能。我们在Unitree Z1臂上实现了MVB-Grasp,配备Intel RealSense D405摄像头,整合了YOLOv8物体检测、GraspGen候选生成、基于主成分分析(PCA)的MVBB拟合和逆运动学轨迹规划。通过81个MuJoCo实验(圆柱体、不对称盒、水瓶),实验结果表明,MVB-Grasp的成功率为59.3%,而普通GraspGen的成功率为24.7%,提升了2.4倍,过滤掉了几何上不可行的候选,并优先考虑适合Z1正面接近约束的面对齐抓取。现实世界的试验确认,MVBB先验在受限的低成本操控器上显著提高了抓取可靠性,而无需重新训练模型。
cs.RO / 48 / 2605.09789
Zero-Shot Sim-to-Real Robot Learning: A Dexterous Manipulation Study on Reactive Catching
零样本模拟到现实机器人学习:关于反应捕捉的灵巧操控研究
Abstract
Dexterous manipulation is physics-intensive and highly sensitive to modeling errors and perception noise, making sim-to-real transfer prohibitively challenging. Domain randomization (DR) is commonly used to improve the robustness of learned policies for such tasks, but conventional DR randomizes one instance per episode, offering very limited exposure to the variability of real-world dynamics. To this end, we propose Domain-Randomized Instance Set (DRIS), which represents and propagates a set of randomized instances simultaneously, providing richer approximation of uncertain dynamics and enabling policies to learn actions that account for multiple possible outcomes. Supported by theoretical analysis, we show that DRIS yields more robust policies and alleviates the need for real-world fine-tuning, even with a modest number of instances (e.g., 10). We demonstrate this on a challenging reactive catching task. Unlike traditional catching setups that use end-effectors designed to mechanically stabilize the object (e.g., curved or enclosing surfaces), our system uses a flat plate that offers no passive stabilization, making the task highly sensitive to noise and requiring rapid reactive motions. The learned policies exhibit strong robustness to uncertainties and achieve reliable zero-shot sim-to-real transfer.
Chinese Translation
灵巧操控是一个物理密集型任务,对建模误差和感知噪声高度敏感,这使得模拟到现实的转移异常具有挑战性。领域随机化(Domain Randomization, DR)通常用于提高此类任务中学习策略的鲁棒性,但传统的DR每个回合仅随机化一个实例,提供的真实世界动态变异性暴露非常有限。为此,我们提出了领域随机化实例集(Domain-Randomized Instance Set, DRIS),它同时表示和传播一组随机化实例,提供对不确定动态的更丰富近似,使策略能够学习考虑多种可能结果的动作。通过理论分析,我们表明DRIS能够产生更鲁棒的策略,并减轻对现实世界微调的需求,即使在实例数量较少的情况下(例如,10个)。我们在一个具有挑战性的反应捕捉任务中验证了这一点。与传统捕捉设置使用设计用于机械稳定物体的末端执行器(例如,弯曲或包围表面)不同,我们的系统使用一个平坦的板,没有被动稳定性,使得任务对噪声高度敏感,并需要快速的反应动作。学习到的策略对不确定性表现出强大的鲁棒性,并实现了可靠的零样本模拟到现实转移。
cs.RO / 49 / 2605.09801
Efficient Multi-Robot Motion Planning with Precomputed Translation-Invariant Edge Bundles
基于预计算平移不变边束的高效多机器人运动规划
Abstract
Solving multi-robot motion planning (MRMP) requires generating collision-free kinodynamically feasible trajectories for multiple interacting robots. We introduce Kinodynamic Translation-Invariant Edge Bundles or KiTE-Extend, a planner-agnostic action selection mechanism for sampling-based kinodynamic motion planning. KiTE-Extend uses a library of trajectory segments computed offline to guide action selection during online planning, improving the ability of existing planners to identify feasible motion segments without altering state propagation, collision checking, or cost evaluation, and without changing their theoretical guarantees. While KiTE-Extend can modestly improve single-agent planners, its benefits are most clear in the multi-agent setting, where it is able to explore more effectively and significantly improve planning through the dense spatiotemporal constraints introduced by robot-robot interaction. Through experiments on multiple kinodynamic systems and environments, we show that KiTE-Extend reduces planning time and improves scalability across the three most common MRMP paradigms: centralized, prioritized, and conflict-based.
Chinese Translation
解决多机器人运动规划(MRMP)需要为多个相互作用的机器人生成无碰撞的动力学可行轨迹。我们引入了动力学平移不变边束(Kinodynamic Translation-Invariant Edge Bundles,KiTE-Extend),这是一种与规划器无关的动作选择机制,适用于基于采样的动力学运动规划。KiTE-Extend使用离线计算的轨迹段库来指导在线规划过程中的动作选择,从而提高现有规划器识别可行运动段的能力,而不改变状态传播、碰撞检查或成本评估,也不影响其理论保证。虽然KiTE-Extend可以适度改善单代理规划器的性能,但其优势在多代理环境中尤为明显,在这种情况下,它能够更有效地探索并通过机器人间交互引入的密集时空约束显著改善规划。通过在多个动力学系统和环境中的实验,我们展示了KiTE-Extend在三个最常见的MRMP范式(集中式、优先级和基于冲突的)中减少了规划时间并提高了可扩展性。
cs.RO / 50 / 2605.09811
Above and Below: Heterogeneous Multi-robot SLAM Across Surface and Underwater Domains
上方与下方:跨越地面与水下领域的异构多机器人同步定位与地图构建
Abstract
Multi-robot simultaneous localization and mapping (SLAM) is a fundamental task in multi-robot operations. Robots must have a common understanding of their location and that of their team members to complete coordinated actions. However, multi-robot SLAM between Uncrewed Surface Vessels (USVs) and Autonomous Underwater Vehicles (AUVs) has primarily been achieved through acoustic pinging between robots to retrieve range measurements; a measurement technique requires that robots to be in similar locations simultaneously, have an uninterrupted path for signal propagation, and may necessitate synchronized clocks. This is especially challenging in complex, cluttered maritime environments, where structures may impede signals. However, these same structures may be observable above and below the water's surface, presenting an opportunity for inter-robot SLAM loop closure between USV and AUV data streams. This work builds upon recent research on inter-robot SLAM loop closure between USV and AUV data, extending it to propose a centralized multi-robot SLAM system. Each robot performs its state estimation, and we detect loop closures between each AUV and the USV data. These inter-robot loop closures are used to merge each robot's state estimate into a centralized graph, yielding estimates for the whole time history of the USV and all AUVs in the system. Validation is performed using real-world perceptual data in three different environments. Results show improved errors for AUVs in the multi-robot SLAM system compared to single-robot SLAM over the same trajectories. To our knowledge, this is the first instance of a multi-robot SLAM system with AUVs and USVs built on loop closures rather than acoustic distance measurements.
Chinese Translation
多机器人同步定位与地图构建(SLAM)是多机器人操作中的一项基础任务。机器人必须对自身及其团队成员的位置有共同的理解,以完成协调行动。然而,无人水面艇(USVs)与自主水下航行器(AUVs)之间的多机器人SLAM主要是通过机器人之间的声学信号传递来获取距离测量;这种测量技术要求机器人在相似的位置同时存在,信号传播路径不受阻碍,并且可能需要同步时钟。这在复杂、杂乱的海洋环境中尤其具有挑战性,因为结构可能会干扰信号。然而,这些结构在水面上和水下都可能是可观察的,为USV和AUV数据流之间的机器人间SLAM回环闭合提供了机会。本研究基于最近关于USV和AUV数据之间的机器人间SLAM回环闭合的研究,扩展提出了一个集中式多机器人SLAM系统。每个机器人执行其状态估计,我们检测每个AUV与USV数据之间的回环闭合。这些机器人间的回环闭合用于将每个机器人的状态估计合并到一个集中式图中,从而获得USV及系统中所有AUV的整个时间历史的估计。通过在三个不同环境中使用真实的感知数据进行验证,结果显示,与在相同轨迹上进行单机器人SLAM相比,多机器人SLAM系统中AUV的误差得到了改善。据我们所知,这是第一个基于回环闭合而非声学距离测量构建的包含AUV和USV的多机器人SLAM系统。
cs.RO / 51 / 2605.09869
ConsistNav: Closing the Action Consistency Gap in Zero-Shot Object Navigation with Semantic Executive Control
ConsistNav:通过语义执行控制缩小零样本物体导航中的动作一致性差距
Abstract
Zero-shot object navigation has advanced rapidly with open-vocabulary detectors, image--text models, and language-guided exploration. However, even after current methods detect a plausible target hypothesis, the agent may still oscillate between exploration and pursuit, or abandon the object near success. We identify this failure mode as an action consistency gap: semantic evidence is repeatedly reinterpreted at each step without persistent commitment across the episode. We introduce ConsistNav, a training-free zero-shot ObjectNav framework built around a semantic executive composed of three coordinated modules: Finite-State Executive Controller stages target pursuit through guarded semantic phases; Persistent Candidate Memory accumulates cross-frame target evidence into stable object hypotheses; and Stability-Aware Action Control suppresses rotational stagnation, ineffective pursuit, and unverified stopping. This design changes neither the detector nor the low-level planner; instead, it controls when semantic evidence should influence navigation and when it should be suppressed or revisited. We conduct extensive experiments on HM3D and MP3D, where ConsistNav achieves state-of-the-art results among compared zero-shot ObjectNav methods and improves SR by 11.4% and SPL by 7.9% over the controlled baseline on MP3D. Ablation studies and real-world deployment experiments further demonstrate the effectiveness and robustness of the proposed executive mechanism.
Chinese Translation
零样本物体导航随着开放词汇检测器、图像-文本模型和语言引导探索的快速发展而取得了显著进展。然而,即使当前方法检测到一个合理的目标假设,智能体仍可能在探索和追踪之间摇摆不定,或在接近成功时放弃目标。我们将这种失败模式定义为动作一致性差距:语义证据在每一步都被反复重新解释,而没有在整个过程中的持续承诺。我们提出了ConsistNav,一个无需训练的零样本ObjectNav框架,围绕一个由三个协调模块组成的语义执行器构建:有限状态执行控制器(Finite-State Executive Controller)通过受保护的语义阶段引导目标追踪;持久候选记忆(Persistent Candidate Memory)将跨帧目标证据累积为稳定的物体假设;稳定性意识行动控制(Stability-Aware Action Control)抑制旋转停滞、无效追踪和未经验证的停止。该设计既不改变检测器,也不改变低级规划器;相反,它控制何时让语义证据影响导航,以及何时应抑制或重新审视这些证据。我们在HM3D和MP3D上进行了广泛的实验,其中ConsistNav在比较的零样本ObjectNav方法中实现了最先进的结果,并在MP3D上相较于受控基线提高了11.4%的成功率(SR)和7.9%的路径长度(SPL)。消融研究和现实世界部署实验进一步证明了所提执行机制的有效性和鲁棒性。
cs.RO / 52 / 2605.09886
Network-Efficient World Model Token Streaming
网络高效的世界模型令牌流传输
Abstract
Generative driving world models rely on compact latent state representations that must be efficiently transmitted and synchronized across distributed compute and connected vehicles. We study network-efficient streaming of a discrete world model state, where a stride-16 VQ-U-Net tokenizer (codebook size 8,192) maps each 288x512 frame to an 18x32 grid of token IDs (576 tokens/frame), equivalent to 936 bytes/frame under fixed-length coding. We consider a keyframe--delta protocol under strict per-message payload budgets and packet loss, and propose a fully online, label-free algorithm that prioritizes delta updates via cosine distance in codebook embedding space and triggers keyframes adaptively using a Hamming-drift threshold. The adaptive algorithm consistently improves the rate distortion frontier over periodic keyframes at matched bitrates: at 0.024 Mb/s (200-byte budget) dynamic-only embedding distortion drops from 0.0712 to 0.0661 (7.2\%), and at 0.036 Mb/s (400-byte budget) from 0.0427 to 0.0407 (4.8\%). Under 10\% delta packet loss at 200 bytes, dynamic-only distortion is 0.0757 versus 0.0789 for a matched periodic baseline. To connect state fidelity to world model usefulness, we train a lightweight next-token predictor and evaluate perplexity conditioned on streamed receiver states: at 0.024 Mb/s, dynamic-position perplexity improves from 206.0 to 193.1 (6.3\%), and at 0.036 Mb/s from 158.9 to 155.6 (2.1\%). These results support discrete token-state streaming as a practical systems layer for bandwidth-aware synchronization and improved downstream token-dynamics utility under vehicular networking constraints.
Chinese Translation
生成驾驶世界模型依赖于紧凑的潜在状态表示,这些表示必须在分布式计算和连接车辆之间高效传输和同步。我们研究了一种离散世界模型状态的网络高效流传输,其中一个步幅为16的 VQ-U-Net 令牌化器(代码本大小为8,192)将每个288x512的帧映射到一个18x32的令牌ID网格(每帧576个令牌),在固定长度编码下相当于每帧936字节。我们考虑在严格的每条消息有效载荷预算和数据包丢失下的关键帧-增量协议,并提出了一种完全在线、无标签的算法,该算法通过余弦距离在代码本嵌入空间中优先处理增量更新,并使用汉明漂移阈值自适应触发关键帧。该自适应算法在匹配比特率下始终改善了周期性关键帧的率失真前沿:在0.024 Mb/s(200字节预算)下,仅动态嵌入失真从0.0712降至0.0661(7.2%),在0.036 Mb/s(400字节预算)下从0.0427降至0.0407(4.8%)。在200字节的情况下,10%的增量数据包丢失下,仅动态失真为0.0757,而匹配的周期性基线为0.0789。为了将状态保真度与世界模型的实用性联系起来,我们训练了一个轻量级的下一个令牌预测器,并评估基于流式接收器状态的困惑度:在0.024 Mb/s下,动态位置困惑度从206.0改善至193.1(6.3%),在0.036 Mb/s下从158.9改善至155.6(2.1%)。这些结果支持离散令牌状态流传输作为带宽感知同步和在车辆网络约束下改善下游令牌动态效用的实用系统层。
cs.RO / 53 / 2605.09939
Neural Distance-Guided Path Integral Control for Tractor-Trailer Navigation
基于神经距离引导的拖拉机-挂车导航路径积分控制
Abstract
Autonomous and safe navigation of tractor-trailer systems requires accurate, real-time collision avoidance and dynamically feasible control, particularly in cluttered and complex agricultural environments. This is challenging due to their articulated, deformable geometries and nonlinear dynamics. Traditional methods oversimplify vehicle geometry or rely on precomputed distance fields that assume a known map, limiting their applicability in dynamic, partially unknown environments. To address these limitations, we propose a geometric neural encoder that provides fast and accurate distance estimates between the full tractor-trailer body and raw LiDAR perception, enabling real-time, map-free geometric reasoning. These learned distances are integrated into a Model Predictive Path Integral (MPPI) controller, allowing the system to incorporate true articulated geometry directly into its cost evaluation and enabling more responsive navigation in challenging agricultural settings. Simulation results demonstrate that the proposed framework generates dynamically feasible and safe trajectories for navigating tractor-trailer systems in cluttered and complex environments.
Chinese Translation
拖拉机-挂车系统的自主安全导航需要准确的实时碰撞避免和动态可行的控制,特别是在杂乱复杂的农业环境中。这一任务具有挑战性,因为它们的关节式、可变形几何形状和非线性动力学使得控制变得复杂。传统方法往往过于简化车辆几何形状,或依赖于假设已知地图的预计算距离场,从而限制了它们在动态、部分未知环境中的适用性。为了解决这些限制,我们提出了一种几何神经编码器,该编码器能够提供拖拉机-挂车整体与原始激光雷达(LiDAR)感知之间的快速准确距离估计,从而实现实时、无地图的几何推理。这些学习到的距离被整合到模型预测路径积分(Model Predictive Path Integral, MPPI)控制器中,使系统能够将真实的关节几何形状直接纳入其成本评估中,从而在复杂的农业环境中实现更灵活的导航。仿真结果表明,所提出的框架能够为在杂乱复杂环境中导航的拖拉机-挂车系统生成动态可行且安全的轨迹。
cs.RO / 54 / 2605.09944
Explicit Stair Geometry Conditioning for Robust Humanoid Locomotion
显式楼梯几何条件化以实现稳健的人形机器人行走
Abstract
Robust humanoid stair climbing remains challenging due to geometric discontinuities, sensitivity to step height variations, and perception uncertainty in real-world environments. Existing learning-based locomotion policies often rely on implicit terrain representations or blind proprioceptive feedback, limiting their ability to generalize across varying stair geometries and to anticipate required gait adjustments. This paper proposes an explicit stair geometry conditioning framework for robust humanoid stair climbing. Instead of encoding terrain as high-dimensional latent features, we extract a compact set of interpretable geometric parameters, including step height, step depth, and current yaw angle relative to the robot heading. These explicit stair parameters directly condition a Proximal Policy Optimization (PPO)-based locomotion policy, enabling proactive modulation of swing-foot clearance and stride characteristics according to stair structure. Simulation experiments demonstrate improved generalization across unseen stair heights beyond the training distribution. Real-world experiments on the Unitree G1 humanoid validate reliable indoor and outdoor stair traversal. In challenging outdoor scenarios, the robot successfully ascends 33 consecutive steps without failure, demonstrating robustness and practical deployability.
Chinese Translation
稳健的人形机器人爬楼梯仍然面临挑战,原因在于几何不连续性、对台阶高度变化的敏感性以及现实环境中的感知不确定性。现有的基于学习的行走策略通常依赖于隐式地形表示或盲目的本体感觉反馈,这限制了它们在不同楼梯几何形状之间的泛化能力以及对所需步态调整的预判。本文提出了一种显式楼梯几何条件化框架,以实现稳健的人形机器人爬楼梯。我们不再将地形编码为高维潜在特征,而是提取一组紧凑且可解释的几何参数,包括台阶高度、台阶深度以及相对于机器人朝向的当前偏航角。这些显式楼梯参数直接调节基于近端策略优化(Proximal Policy Optimization, PPO)的行走策略,使得机器人能够根据楼梯结构主动调节摆脚的间隙和步幅特征。仿真实验表明,该方法在未见过的楼梯高度上具有更好的泛化能力,超出了训练分布。在Unitree G1人形机器人上进行的实际实验验证了其在室内和室外楼梯行走中的可靠性。在具有挑战性的户外场景中,机器人成功地连续攀爬33个台阶而未发生故障,展示了其稳健性和实际可部署性。
cs.RO / 55 / 2605.09954
JODA: Composable Joint Dynamics for Articulated Objects
JODA:可组合的关节动态模型用于关节物体
Abstract
Articulated objects used in simulation and embodied AI are typically specified by geometry and kinematic structure, but lack the fine-grained dynamical effects that govern realistic mechanical behavior, such as frictional holding, detents, soft closing, and snap latching. Existing approaches either ignore the detailed structure of dynamics entirely, or use simple models with limited expressiveness. We introduce JODA, a framework for generating joint-level dynamics as a structured three-channel field over the joint degree of freedom, capturing conservative forces, dry friction, and damping. Instantiated using shape-constrained piecewise cubic interpolation (PCHIP), this formulation defines a compact and expressive function space that is both interpretable and compatible with differentiable simulation. Building on this representation, we develop methods for inferring and refining joint dynamics from multimodal inputs. Given visual observations and joint context, a vision-language model proposes structured dynamical primitives, which are composed into a unified dynamics field. The resulting representation supports both direct manipulation and gradient-based refinement. We demonstrate that JODA enables plausible and controllable modeling of diverse joint behaviors, providing a unified interface for inference, editing, and optimization. Code and example assets with their generated profiles will be released upon publication.
Chinese Translation
在仿真和具身人工智能中使用的关节物体通常通过几何形状和运动学结构来定义,但缺乏控制现实机械行为的细粒度动态效应,例如摩擦保持、卡位、软关闭和快速锁定。现有的方法要么完全忽略动态的详细结构,要么使用表达能力有限的简单模型。我们提出了JODA,一个生成关节级动态的框架,作为关节自由度上的结构化三通道场,捕捉保守力、干摩擦和阻尼。该框架采用形状约束的分段三次插值(PCHIP)进行实例化,定义了一个紧凑且富有表现力的函数空间,既可解释又与可微仿真兼容。在此表示的基础上,我们开发了从多模态输入推断和细化关节动态的方法。在给定视觉观察和关节上下文的情况下,视觉-语言模型提出结构化的动态原语,这些原语被组合成一个统一的动态场。最终的表示支持直接操作和基于梯度的细化。我们展示了JODA能够实现多样化关节行为的合理且可控的建模,提供了一个用于推断、编辑和优化的统一接口。代码和生成的示例资产将在出版时发布。
cs.RO / 56 / 2605.09972
HiDrive: A Closed-Loop Benchmark for High-Level Autonomous Driving
HiDrive:高层次自主驾驶的闭环基准测试
Abstract
End-to-end autonomous driving has witnessed rapid progress, yet existing benchmarks are increasingly saturated, with state-of-the-art models achieving near-perfect scores on widely used open-loop and closed-loop benchmarks. This saturation does not mean that the problem has been solved; instead, it reveals that current benchmarks remain limited in scenario diversity, object variety, and the breadth of driving capabilities they evaluate. In particular, they lack sufficient long-tail scenarios involving rare but safety-critical objects and fail to assess advanced decision-making such as legal compliance, ethical reasoning, and emergency response. To address these gaps, we propose HiDrive, a new closed-loop benchmark for end-to-end autonomous driving that emphasizes long-tail scenarios and a richer evaluation of driving capabilities. HiDrive introduces a diverse set of rare objects and uncommon traffic situations, and expands evaluation from basic driving skills to more advanced capabilities, including rule compliance, moral reasoning, and context-dependent emergency maneuvers. Correspondingly, we extend previous collision-avoidance-centered metrics into a comprehensive evaluation system that encompasses collision and braking, traffic-rule compliance, and moral-reasoning indicators. Built on a more advanced physics engine, HiDrive provides physically realistic lighting and high-fidelity visual rendering, offering a more challenging and realistic testbed for assessing whether autonomous driving systems can handle the complexity of real-world deployment. The HiDrive software, source code, digital assets, and documentation are available at https://github.com/VDIGPKU/HiDrive.
Chinese Translation
端到端自主驾驶已经取得了快速进展,但现有的基准测试日益饱和,最先进的模型在广泛使用的开环和闭环基准测试中几乎达到了完美分数。这种饱和并不意味着问题已经解决;相反,它揭示了当前基准在场景多样性、物体种类和评估的驾驶能力广度方面的局限性。特别是,它们缺乏足够的长尾场景,涉及稀有但对安全至关重要的物体,并未评估先进的决策能力,如法律合规、伦理推理和应急响应。为了解决这些问题,我们提出了HiDrive,一个新的端到端自主驾驶闭环基准,强调长尾场景和更丰富的驾驶能力评估。HiDrive引入了一组多样化的稀有物体和不常见的交通情况,并将评估范围从基本驾驶技能扩展到更高级的能力,包括规则遵守、道德推理和情境相关的应急操作。相应地,我们将以往以避免碰撞为中心的指标扩展为一个综合评估系统,涵盖碰撞与刹车、交通规则遵守和道德推理指标。基于更先进的物理引擎,HiDrive提供了物理上逼真的光照和高保真视觉渲染,为评估自主驾驶系统是否能够应对现实世界部署的复杂性提供了一个更具挑战性和现实性的测试平台。HiDrive软件、源代码、数字资产和文档可在 https://github.com/VDIGPKU/HiDrive 获取。
cs.RO / 57 / 2605.09989
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
StereoPolicy:通过立体感知改善机器人操作策略
Abstract
Recent advances in robot imitation learning have yielded powerful visuomotor policies capable of manipulating a wide variety of objects directly from monocular visual inputs. However, monocular observations inherently lack reliable depth cues and spatial awareness, which are critical for precise manipulation in cluttered or geometrically complex scenes. To address this limitation, we introduce StereoPolicy, a new visuomotor policy learning framework that directly leverages synchronized stereo image pairs to strengthen geometric reasoning, without requiring explicit 3D reconstruction or camera calibration. StereoPolicy employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues. The framework integrates seamlessly with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks: RoboMimic, RoboCasa, and OmniGibson. We further validate StereoPolicy on real-robot experiments spanning both tabletop and bimanual mobile manipulation settings. Our results underscore stereo vision as a scalable and robust modality that bridges 2D pretrained representations with 3D geometric understanding for robotic manipulation.
Chinese Translation
近年来,机器人模仿学习的进展产生了强大的视觉运动策略,能够直接从单目视觉输入中操控各种物体。然而,单目观察本质上缺乏可靠的深度线索和空间意识,这对于在杂乱或几何复杂的场景中进行精确操作至关重要。为了解决这一局限性,我们提出了StereoPolicy,一种新的视觉运动策略学习框架,直接利用同步的立体图像对来增强几何推理,而无需显式的3D重建或相机标定。StereoPolicy采用预训练的2D视觉编码器独立处理每幅图像,并通过立体变换器(Stereo Transformer)融合生成的表示。这一设计隐式捕捉空间对应关系和视差线索。该框架与基于扩散的预训练视觉-语言-动作(VLA)策略无缝集成,在三个仿真基准(RoboMimic、RoboCasa和OmniGibson)上相较于RGB、RGB-D、点云和多视角基线提供了一致的改进。我们进一步在真实机器人实验中验证了StereoPolicy,涵盖了桌面和双手移动操作设置。我们的结果强调了立体视觉作为一种可扩展且稳健的模式,能够将2D预训练表示与3D几何理解结合起来,以实现机器人操作。
cs.RO / 58 / 2605.09999
Muninn: Your Trajectory Diffusion Model But Faster
Muninn:您的轨迹扩散模型,但更快
Abstract
Diffusion-based trajectory planners can synthesize rich, multimodal robot motions, but their iterative denoising makes online planning and control prohibitively slow. Existing accelerations either modify the sampler or compress the network--sacrificing plan quality or requiring retraining without accounting for downstream control risk. We address the problem of making diffusion-based trajectory planners fast enough for real-time robot use without retraining the model or sacrificing trajectory quality, and in a way that works across diverse state-space diffusion architectures. Our key insight is that diffusion trajectory planners expose two signals we can exploit: a cheap probe of how their internal trajectory representation changes across steps, and analytic coefficients that describe how denoiser errors affect the sampler's state update. By calibrating the first signal against the second on offline runs, we obtain a per-step score that upper-bounds how far the final trajectory can deviate when we reuse a cached denoiser output, and we treat this bound as an uncertainty budget that we can spend over the denoising process. Building on this insight, we present Muninn, a training-free caching wrapper that tracks this uncertainty budget during sampling and, at each diffusion step, chooses between reusing a cached denoiser output when the predicted deviation is small and recomputing the denoiser when it is not. Across standard benchmarks Muninn delivers up to 4.6x wall-clock speedups across several trajectory diffusion models by reducing denoiser evaluations, while preserving task performance and safety metrics. Muninn further certifies that cached rollouts remain within a specified distance of their full-compute counterparts, and we validate these gains in real-time closed-loop navigation and manipulation hardware deployments. Project page: https://github.com/gokulp01/Muninn.
Chinese Translation
基于扩散的轨迹规划器能够合成丰富的多模态机器人运动,但其迭代去噪过程使得在线规划和控制变得极其缓慢。现有的加速方法要么修改采样器,要么压缩网络——牺牲规划质量或需要重新训练而未考虑下游控制风险。我们解决了如何使基于扩散的轨迹规划器足够快速以适应实时机器人使用的问题,而无需重新训练模型或牺牲轨迹质量,并且这种方法适用于多种状态空间扩散架构。我们的关键见解是,扩散轨迹规划器暴露了两个可以利用的信号:一个是其内部轨迹表示在各个步骤中变化的廉价探测,另一个是描述去噪器误差如何影响采样器状态更新的解析系数。通过在离线运行中将第一个信号与第二个信号进行校准,我们获得了一个逐步得分,该得分上限表示当我们重用缓存的去噪器输出时,最终轨迹可能偏离的程度,我们将此上限视为可以在去噪过程中支出的不确定性预算。基于这一见解,我们提出了Muninn,一个无训练的缓存包装器,在采样过程中跟踪这一不确定性预算,并在每个扩散步骤中,根据预测的偏差大小选择重用缓存的去噪器输出或重新计算去噪器。在标准基准测试中,Muninn通过减少去噪器评估,在多个轨迹扩散模型中实现了高达4.6倍的实际速度提升,同时保持任务性能和安全指标。Muninn进一步证明,缓存的展开结果保持在与其完整计算对应物的指定距离内,我们在实时闭环导航和操作硬件部署中验证了这些收益。项目页面:https://github.com/gokulp01/Muninn。
cs.RO / 59 / 2605.10034
Beyond Self-Play and Scale: A Behavior Benchmark for Generalization in Autonomous Driving
超越自我对弈与规模:自动驾驶泛化的行为基准测试
Abstract
Recent Autonomous Driving (AD) works such as GigaFlow and PufferDrive have unlocked Reinforcement Learning (RL) at scale as a training strategy for driving policies. Yet such policies remain disconnected from established benchmarks, leaving the performance of large-scale RL for driving on standardized evaluations unknown. We present BehaviorBench -- a comprehensive test suite that closes this gap along three axes: Evaluation, Complexity, and Behavior Diversity. In terms of Evaluation, we provide an interface connecting PufferDrive to nuPlan, which, for the first time, enables policies trained via RL at scale to be evaluated on an established planning benchmark for autonomous driving. Complementarily, we offer an evaluation framework that allows planners to be benchmarked directly inside the PufferDrive simulation, at a fraction of the time. Regarding Complexity, we observe that today's standardized benchmarks are so simple that near-perfect scores are achievable by straight lane following with collision checking. We extract a meaningful, interaction-rich split from the Waymo Open Motion Dataset (WOMD) on which strong performance is impossible without multi-agent reasoning. Lastly, we address Behavior Diversity. Existing benchmarks commonly evaluate planners against a single rule-based traffic model, the Intelligent Driver Model (IDM). We provide a diverse suite of interactive traffic agents to stress-test policies under heterogeneous behaviors, beyond just using IDM. Overall, our benchmarking analysis uncovers the following insight: despite learning interactive behaviors in an emergent manner, policies trained via pure self-play under standard reward functions overfit to their training opponents and fail to generalize to other traffic agent behaviors. Building on this observation, we propose a hybrid planner that combines a PPO policy with a rule-based planner.
Chinese Translation
近期的自动驾驶(AD)研究,如 GigaFlow 和 PufferDrive,已将强化学习(RL)作为一种训练驾驶策略的规模化方法。然而,这些策略仍与既定基准脱节,使得大规模 RL 在标准化评估中的表现未知。我们提出了 BehaviorBench——一个全面的测试套件,从评估、复杂性和行为多样性三个方面填补了这一空白。在评估方面,我们提供了一个接口,将 PufferDrive 连接到 nuPlan,这首次使得通过大规模 RL 训练的策略能够在一个已建立的自动驾驶规划基准上进行评估。作为补充,我们提供了一个评估框架,允许规划者在 PufferDrive 模拟环境中直接进行基准测试,耗时极少。在复杂性方面,我们观察到当今的标准化基准过于简单,以至于通过直线行驶和碰撞检测可以获得接近完美的分数。我们从 Waymo Open Motion Dataset (WOMD) 中提取了一个有意义的、富有互动的分割,强大的表现无法在没有多智能体推理的情况下实现。最后,我们关注行为多样性。现有基准通常将规划者与单一的基于规则的交通模型——智能驾驶模型(IDM)进行评估。我们提供了一套多样化的互动交通代理,以在异质行为下对策略进行压力测试,而不仅仅是使用 IDM。总体而言,我们的基准分析揭示了以下见解:尽管以自发的方式学习互动行为,但在标准奖励函数下通过纯自我对弈训练的策略对其训练对手过拟合,未能对其他交通代理行为进行泛化。在此观察的基础上,我们提出了一种混合规划器,将 PPO 策略与基于规则的规划器相结合。
cs.RO / 60 / 2605.10051
Guided Streaming Stochastic Interpolant Policy
引导流式随机插值策略
Abstract
Inference-time guidance is essential for steering generative robot policies toward dynamic objectives without retraining, yet existing methods are largely confined to chunk-based architectures that exhibit high latency and lack the reactivity needed for test-time preference alignment or obstacle avoidance. In this work, we formally derive the optimal guidance term for Stochastic Interpolants (SI) by analyzing the value function's time evolution via the Backward Kolmogorov Equation, establishing a modified drift that theoretically guarantees sampling from a target distribution. We apply this framework to real-time control through the Streaming Stochastic Interpolant Policy (SSIP), which generalizes the deterministic Streaming Flow Policy (SFP). Unifying this guidance law with the streaming architecture enables fast and reactive control. To support diverse deployment needs, we propose two complementary mechanisms: training-free Stochastic Trajectory Ensemble Guidance (STEG) that computes gradients on-the-fly for zero-shot adaptation, and training-based Conditional Critic Guidance (CCG) for amortized inference. Empirical evaluations demonstrate that our guided streaming approach significantly outperforms conventional chunk-based policies in reactivity and provides superior, physically valid guidance for dynamic, unstructured environments.
Chinese Translation
推理时的引导对于在不重新训练的情况下将生成机器人策略引导至动态目标至关重要,然而现有的方法在很大程度上局限于基于块的架构,这些架构表现出高延迟,并且缺乏在测试时进行偏好对齐或避障所需的反应能力。在本研究中,我们通过分析价值函数的时间演化,利用反向科尔莫戈罗夫方程正式推导出随机插值(Stochastic Interpolants, SI)的最优引导项,建立了一种理论上保证从目标分布中采样的修正漂移。我们将这一框架应用于实时控制,通过流式随机插值策略(Streaming Stochastic Interpolant Policy, SSIP),该策略推广了确定性流式流动策略(Streaming Flow Policy, SFP)。将这一引导法则与流式架构相结合,能够实现快速和反应灵敏的控制。为了支持多样化的部署需求,我们提出了两种互补机制:无训练的随机轨迹集成引导(Stochastic Trajectory Ensemble Guidance, STEG),该机制能够即时计算梯度以实现零样本适应,以及基于训练的条件评论引导(Conditional Critic Guidance, CCG)用于摊销推理。实证评估表明,我们的引导流式方法在反应能力上显著优于传统的基于块的策略,并为动态、非结构化环境提供了更优越、物理上有效的引导。
cs.RO / 61 / 2605.10063
EFGCL: Learning Dynamic Motion through Spotting-Inspired External Force Guided Curriculum Learning
EFGCL:通过受启发的外部力引导课程学习学习动态运动
Abstract
Learning dynamic whole-body motions for legged robots through reinforcement learning (RL) remains challenging due to the high risk of failure, which makes efficient exploration difficult and often leads to unstable learning. In this paper, we propose External Force Guided Curriculum Learning (EFGCL), a guided RL approach based on the principle of physical guidance, in which external assistive forces are introduced during training. Inspired by spotting in artistic gymnastics, EFGCL enables agents to physically experience successful motion executions without relying on task-specific reward shaping or reference trajectories. Experiments on a quadrupedal robot performing Jump, Backflip, and Lateral-Flip tasks demonstrate that EFGCL accelerates learning of the Jump task by approximately a factor of two and enables the acquisition of complex whole body motions that conventional RL methods fail to learn. We further show that the learned policies can be deployed on real robot, reproducing motions consistent with those observed in simulation. These results indicate that physically guided exploration, which allows agents to experience success early in training, is an effective and general strategy for improving learning efficiency in dynamic whole-body motion tasks.
Chinese Translation
通过强化学习(RL)为四足机器人学习动态全身运动仍然面临挑战,因为失败的高风险使得有效探索变得困难,并且常常导致学习不稳定。本文提出了一种基于物理引导原理的引导式强化学习方法——外部力引导课程学习(EFGCL),在训练过程中引入外部辅助力。受艺术体操中的“点位”启发,EFGCL使得智能体能够在不依赖于特定任务奖励塑造或参考轨迹的情况下,实际体验成功的运动执行。对四足机器人进行的跳跃、后空翻和侧空翻任务的实验表明,EFGCL将跳跃任务的学习速度提高了大约两倍,并使得获取传统RL方法无法学习的复杂全身运动成为可能。我们进一步展示了所学习的策略可以在真实机器人上部署,重现与模拟中观察到的运动一致的动作。这些结果表明,物理引导的探索允许智能体在训练初期体验成功,是提高动态全身运动任务学习效率的有效且通用的策略。
cs.RO / 62 / 2605.10086
A cell-decomposition based path planner for 3D navigation in constrained workspaces
基于单元分解的三维导航路径规划算法在受限工作空间中的应用
Abstract
This paper proposes a cell decomposition algorithm for binary occupancy grids that ensures mutual complete visibility from each cell to at least one adjacent cell. This decomposition establishes a simplified framework for verifying path feasibility that can be easily embedded in optimization problems. To illustrate its utility, we formulate both second-order cone programs (SOCP) and their mixed-integer variant (MISOCP) within the proposed framework. Furthermore, we propose the KSP-SOCP method, which combines Yen's k-shortest path algorithm with the SOCP, achieving improved solutions compared to a standard SOCP approach while avoiding the computational burden of MISOCP. The cell decomposition algorithm, KSP-SOCP, and MISOCP approaches were evaluated in 9 city-like workspaces. The decomposition efficiently partitioned each map, enabling both optimization methods to compute feasible paths. The proposed KSP-SOCP achieved time performance comparable to the MISOCP while requiring less memory, making it highly suitable for large-scale problems.
Chinese Translation
本文提出了一种针对二进制占用网格的单元分解算法,该算法确保每个单元与至少一个相邻单元之间的相互完全可见性。这种分解建立了一个简化的框架,用于验证路径的可行性,且可以轻松嵌入优化问题中。为了说明其实用性,我们在所提出的框架内构建了二阶锥规划(SOCP)及其混合整数变体(MISOCP)。此外,我们提出了KSP-SOCP方法,该方法将Yen的k最短路径算法与SOCP相结合,取得了比标准SOCP方法更优的解决方案,同时避免了MISOCP的计算负担。我们在9个类城市工作空间中评估了单元分解算法、KSP-SOCP和MISOCP方法。该分解有效地划分了每个地图,使得两种优化方法能够计算出可行路径。所提出的KSP-SOCP在时间性能上与MISOCP相当,但所需内存更少,因而非常适合大规模问题。
cs.RO / 63 / 2605.10094
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
检索-再引导:生成性视觉-语言-动作(VLA)模型的在线成功记忆用于测试时适应
Abstract
Vision-Language-Action (VLA) models show strong potential for general-purpose robotic manipulation, yet their closed-loop reliability often degrades under local deployment conditions. Existing evaluations typically treat test episodes as independent zero-shot trials. However, real robots often operate repeatedly in the same or slowly changing environments, where successful executions provide environment-verified evidence of reliable behavior patterns. We study this persistent-deployment setting, asking whether a partially competent frozen VLA can improve its reliability by reusing its successful test-time experience. We propose an online success-memory guided test-time adaptation framework for generative VLAs. During deployment, the robot stores progress-calibrated successful observation-action segments in a long-term memory. At inference, it retrieves state-relevant action chunks, filters inconsistent candidates via trajectory-level consistency, and aggregates them into an elite action prior. To incorporate this prior into action generation, we introduce confidence-adaptive prior guidance, which injects the elite prior into an intermediate state of the flow-matching action sampler and adjusts the guidance strength based on retrieval confidence. This design allows the frozen VLA to exploit environment-specific successful experience while preserving observation-conditioned generative refinement. This retrieve-then-steer mechanism enables lightweight, non-parametric test-time adaptation without requiring parameter updates. Simulation and real-world experiments show improved task success and closed-loop stability, especially in long-horizon and multi-stage tasks.
Chinese Translation
视觉-语言-动作(VLA)模型在通用机器人操作中展现出强大的潜力,但它们在局部部署条件下的闭环可靠性往往会下降。现有评估通常将测试阶段视为独立的零-shot 试验。然而,真实机器人往往在相同或缓慢变化的环境中重复操作,在这些环境中,成功的执行提供了可靠行为模式的环境验证证据。我们研究这种持续部署的情境,探讨一个部分合格的冻结 VLA 是否可以通过重用其成功的测试时经验来提高其可靠性。我们提出了一种基于在线成功记忆的生成性 VLA 测试时适应框架。在部署过程中,机器人将经过进度校准的成功观察-动作片段存储在长期记忆中。在推理时,它检索与状态相关的动作片段,通过轨迹级一致性过滤不一致的候选项,并将其聚合成一个精英动作先验。为了将这个先验融入动作生成中,我们引入了自适应置信度先验引导,它将精英先验注入流匹配动作采样器的中间状态,并根据检索置信度调整引导强度。这一设计使得冻结的 VLA 能够利用特定环境的成功经验,同时保持基于观察的生成性细化。这种检索-再引导机制实现了轻量级、非参数的测试时适应,无需参数更新。仿真和真实世界实验显示,在长时间跨度和多阶段任务中,任务成功率和闭环稳定性得到了改善。
cs.RO / 64 / 2605.10118
Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation
在沙箱中规划,在开放世界中导航:学习基于物理的抽象经验以实现具身导航
Abstract
Vision-Language Models (VLMs) have demonstrated exceptional general reasoning capabilities. However, their performance in embodied navigation remains hindered by a scarcity of aligned open-world vision and robot control data. Despite simulators providing a cost-effective alternative for data collection, the inherent reliance on photorealistic simulations often limits the transferability of learned policies. To this end, we propose \textit{\textbf{S}andbox-\textbf{A}bstracted \textbf{G}rounded \textbf{E}xperience} (\textbf{\textit{SAGE}}), a framework that enables agents to learn within a physics-grounded semantic abstraction rather than a photorealistic simulation, mimicking the human capacity for mental simulation where plans are rehearsed in simplified physics abstractions before execution. \textit{SAGE} system operates via three synergistic phases: (1) \textit{Genesis}: constructing diverse, physics-constrained semantic environments to bootstrap experience; (2) \textit{Evolution}: distilling experiences through Reinforcement Learning (RL), utilizing a novel asymmetric adaptive clipping mechanism to stabilize updates; (3) \textit{Navigation}: bridging the abstract policy to open-world control. We demonstrate that \textit{SAGE} significantly improves planner-assisted embodied navigation, achieving a 53.21\% LLM-Match Success Rate on A-EQA (+9.7\% over baseline), while showing encouraging transfer to physical indoor robot deployment.
Chinese Translation
视觉-语言模型(VLMs)展示了卓越的通用推理能力。然而,它们在具身导航中的表现仍受到开放世界视觉与机器人控制数据稀缺的限制。尽管模拟器为数据收集提供了一种成本效益高的替代方案,但对光真实感模拟的固有依赖常常限制了学习策略的可转移性。为此,我们提出了 extit{ extbf{S}andbox- extbf{A}bstracted extbf{G}rounded extbf{E}xperience}( extbf{ extit{SAGE}})框架,使代理能够在基于物理的语义抽象中学习,而不是在光真实感模拟中,模仿人类在执行前在简化的物理抽象中排练计划的能力。 extit{SAGE}系统通过三个协同阶段运作:(1) extit{Genesis}:构建多样化的、受物理约束的语义环境以启动经验;(2) extit{Evolution}:通过强化学习(RL)提炼经验,利用一种新颖的非对称自适应剪切机制来稳定更新;(3) extit{Navigation}:将抽象策略与开放世界控制连接起来。我们展示了 extit{SAGE}显著改善了规划辅助的具身导航,在A-EQA上达到了53.21\%的LLM-Match成功率(比基线提高9.7\%),同时在物理室内机器人部署中显示出良好的迁移能力。
cs.RO / 65 / 2605.10166
Data-Asymmetric Latent Imagination and Reranking for 3D Robotic Imitation Learning
数据不对称的潜在想象与重排序在3D机器人模仿学习中的应用
Abstract
Robotic imitation learning typically assumes access to optimal demonstrations, yet real-world data collection often yields suboptimal, exploratory, or even failed trajectories. Discarding such data wastes valuable information about environment dynamics and failure modes, which can instead be leveraged to improve decision-making. While 3D policies reduce reliance on high-quality demonstrations through strong spatial generalization, they still require large-scale data to achieve high task success. To address this, we propose DALI-R, a Data-Asymmetric Latent Imagination and Reranking framework for 3D robotic imitation learning from mixed-quality trajectories. It learns a Latent World Model over 3D point clouds for imagined rollouts and a Task Completion Scorer that reranks candidate action chunks, improving decision-making without additional high-quality demonstrations. We instantiate DALI-R with both diffusion and efficient flow-matching policies and evaluate it on Adroit and MetaWorld benchmarks. Across the two evaluated 3D base policies, DALI-R achieves an average $6.8$\% improvement in success rate while incurring less than $0.7\times$ additional inference overhead.
Chinese Translation
机器人模仿学习通常假设能够获得最佳演示,但现实世界中的数据收集往往产生次优的、探索性的,甚至失败的轨迹。丢弃这些数据会浪费关于环境动态和失败模式的宝贵信息,而这些信息可以被利用来改善决策。虽然3D策略通过强大的空间泛化能力减少了对高质量演示的依赖,但仍然需要大规模数据以实现高任务成功率。为了解决这一问题,我们提出了DALI-R,一个用于从混合质量轨迹中进行3D机器人模仿学习的数据不对称潜在想象与重排序框架。它在3D点云上学习潜在世界模型以进行想象的回滚,并且学习任务完成评分器以重排序候选动作块,从而在不需要额外高质量演示的情况下改善决策。我们用扩散和高效流匹配策略实例化DALI-R,并在Adroit和MetaWorld基准上进行评估。在评估的两种3D基础策略中,DALI-R在成功率上平均提高了$6.8$\%,同时额外推理开销不超过$0.7 imes$。
cs.RO / 66 / 2605.10201
HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions
HeteroGenManip:异构对象交互的可推广操控
Abstract
Generalizable manipulation involving cross-type object interactions is a critical yet challenging capability in robotics. To reliably accomplish such tasks, robots must address two fundamental challenges: ``where to manipulate'' (contact point localization) and ``how to manipulate'' (subsequent interaction trajectory planning). Existing foundation-model-based approaches often adopt end-to-end learning that obscures the distinction between these stages, exacerbating error accumulation in long-horizon tasks. Furthermore, they typically rely on a single uniform model, which fails to capture the diverse, category-specific features required for heterogeneous objects. To overcome these limitations, we propose HeteroGenManip, a task-conditioned, two-stage framework designed to decouple initial grasp from complex interaction execution. First, Foundation-Correspondence-Guided Grasp module leverages structural priors to align the initial contact state, thereby significantly reducing the pose uncertainty of grasping. Subsequently, Multi-Foundation-Model Diffusion Policy (MFMDP) routes objects to category-specialized foundation models, integrating fine-grained geometric information with highly-variable part features via a dual-stream cross-attention mechanism. Experimental evaluations demonstrate that HeteroGenManip achieves robust intra-category shape and pose generalization. The framework achieves an average 31\% performance improvement in simulation tasks with broad type setting, alongside a 36.7\% gain across four real-world tasks with different interaction types.
Chinese Translation
涉及跨类型对象交互的可推广操控是机器人技术中一个关键但具有挑战性的能力。为了可靠地完成此类任务,机器人必须解决两个基本挑战:"在哪里操控"(接触点定位)和"如何操控"(后续交互轨迹规划)。现有的基于基础模型的方法通常采用端到端学习,这模糊了这些阶段之间的区别,加剧了在长时间任务中的误差积累。此外,它们通常依赖于单一的统一模型,这无法捕捉异构对象所需的多样化、类别特定的特征。为克服这些局限性,我们提出了HeteroGenManip,一个任务条件的两阶段框架,旨在将初始抓取与复杂交互执行解耦。首先,基础对应引导抓取模块利用结构先验对齐初始接触状态,从而显著减少抓取的姿态不确定性。随后,多基础模型扩散策略(Multi-Foundation-Model Diffusion Policy, MFMDP)将对象引导至类别专用的基础模型,通过双流交叉注意机制整合细粒度几何信息与高度可变的部件特征。实验评估表明,HeteroGenManip在类别内形状和姿态的泛化能力上表现出色。该框架在广泛类型设置的仿真任务中实现了平均31%的性能提升,并在四个不同交互类型的真实世界任务中获得了36.7%的增益。
cs.RO / 67 / 2605.10210
Nano-U: Efficient Terrain Segmentation for Tiny Robot Navigation
Nano-U:高效的微型机器人导航地形分割
Abstract
Terrain segmentation is a fundamental capability for autonomous mobile robots operating in unstructured outdoor environments. However, state-of-the-art models are incompatible with the memory and compute constraints typical of microcontrollers, limiting scalable deployment in small robotics platforms. To address this gap, we develop a complete framework for robust binary terrain segmentation on a low-cost microcontroller. At the core of our approach we design Nano-U, a highly compact binary segmentation network with a few thousand parameters. To compensate for the network's minimal capacity, we train Nano-U via Quantization-Aware Distillation (QAD), combining knowledge distillation and quantization-aware training. This allows the final quantized model to achieve excellent results on the Botanic Garden dataset and to perform very well on TinyAgri, a custom agricultural field dataset with more challenging scenes. We deploy the quantized Nano-U on a commodity microcontroller by extending MicroFlow, a compiler-based inference engine for TinyML implemented in Rust. By eliminating interpreter overhead and dynamic memory allocation, the quantized model executes on an ESP32-S3 with a minimal memory footprint and low latency. This compiler-based execution demonstrates a viable and energy-efficient solution for perception on low-cost robotic platforms.
Chinese Translation
地形分割是自主移动机器人在非结构化户外环境中操作的基本能力。然而,最先进的模型与微控制器的内存和计算限制不兼容,限制了其在小型机器人平台上的可扩展部署。为了解决这一问题,我们开发了一个完整的框架,用于在低成本微控制器上进行稳健的二元地形分割。在我们的方法核心,我们设计了Nano-U,一个具有几千个参数的高度紧凑的二元分割网络。为了弥补网络的最小容量,我们通过量化感知蒸馏(Quantization-Aware Distillation, QAD)对Nano-U进行训练,结合了知识蒸馏和量化感知训练。这使得最终的量化模型在Botanic Garden数据集上取得了优异的结果,并在TinyAgri这一自定义农业领域数据集上表现良好,后者包含更具挑战性的场景。我们通过扩展MicroFlow(一个用Rust实现的TinyML编译器基础推理引擎)在商品微控制器上部署了量化后的Nano-U。通过消除解释器开销和动态内存分配,量化模型在ESP32-S3上以最小的内存占用和低延迟执行。这种基于编译器的执行展示了在低成本机器人平台上进行感知的可行且节能的解决方案。
cs.RO / 68 / 2605.10456
Learning Point Cloud Geometry as a Statistical Manifold: Theory and Practice
将点云几何作为统计流形进行学习:理论与实践
Abstract
Point clouds are a fundamental representation for robotic perception tasks such as localization, mapping, and object pose estimation. However, LiDAR-acquired point clouds are inherently sparse and non-uniform, providing incomplete observations of the underlying scene geometry. This makes reliable geometric reasoning challenging and degrades downstream perception performance. Existing approaches attempt to compensate for these limitations by estimating local geometry, but often rely on hand-crafted statistics or end-to-end supervised learning, which can suffer from limited scalability or require large amounts of accurately labeled data. To address these challenges, we explicitly model point cloud geometry under a principled mathematical formulation. We represent local geometry as a statistical manifold induced by a family of Gaussian distributions, where each point is associated with a Gaussian capturing its local geometric structure. Based on this formulation, we introduce Point-to-Ellipsoid (POLI), a deep neural estimator that predicts per-point Gaussian geometry. POLI learns a mapping from point cloud observations to their underlying geometry in a self-supervised manner, removing the need for labeled data while preserving strong geometric inductive biases. The resulting representation integrates seamlessly into existing robotic perception pipelines without architectural modifications. Extensive experiments show that POLI enables accurate and robust geometry estimation and consistently improves performance across diverse robotic perception tasks.
Chinese Translation
点云是机器人感知任务(如定位、地图构建和物体姿态估计)的基本表示。然而,激光雷达(LiDAR)获取的点云本质上是稀疏且不均匀的,提供了对基础场景几何的不完整观察。这使得可靠的几何推理变得具有挑战性,并降低了下游感知性能。现有方法试图通过估计局部几何来弥补这些局限性,但往往依赖于手工设计的统计量或端到端的监督学习,这可能面临有限的可扩展性或需要大量准确标注的数据。为了解决这些挑战,我们在一个有原则的数学框架下明确建模点云几何。我们将局部几何表示为由一系列高斯分布诱导的统计流形,其中每个点与捕捉其局部几何结构的高斯分布相关联。基于这一框架,我们引入了点到椭球体(Point-to-Ellipsoid, POLI),这是一种深度神经网络估计器,能够预测每个点的高斯几何。POLI以自监督的方式学习从点云观察到其基础几何的映射,消除了对标注数据的需求,同时保留了强大的几何归纳偏置。最终的表示能够无缝集成到现有的机器人感知管道中,而无需架构修改。大量实验表明,POLI能够实现准确且鲁棒的几何估计,并在多种机器人感知任务中持续提高性能。
cs.RO / 69 / 2605.10485
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
VEGA:用于空间感知视觉-语言-动作模型的视觉编码器基础对齐
Abstract
Precise spatial reasoning is fundamental to robotic manipulation, yet the visual backbones of current vision-language-action (VLA) models are predominantly pretrained on 2D image data without explicit 3D geometric supervision, resulting in representations that lack accurate spatial awareness. Existing implicit spatial grounding methods partially address this by aligning VLA features with those of 3D-aware foundation models, but they rely on empirical layer search and perform alignment on LLM-level visual tokens where spatial structure has already been entangled with linguistic semantics, limiting both generalizability and geometric interpretability. We propose VEGA (Visual Encoder Grounding Alignment), a simple yet effective framework that directly aligns the output of the VLA's visual encoder with spatially-aware features from DINOv2-FiT3D, a DINOv2 model fine-tuned with multi-view consistent 3D Gaussian Splatting supervision. By performing alignment at the visual encoder output level, VEGA grounds spatial awareness before any linguistic entanglement occurs, offering a more interpretable and principled alignment target. The alignment is implemented via a lightweight projector trained with a cosine similarity loss alongside the standard action prediction objective, and is discarded at inference time, introducing no additional computational overhead. Extensive experiments on simulation benchmark and real-world manipulation tasks demonstrate that VEGA consistently outperforms existing implicit spatial grounding baselines, establishing a new state-of-the-art among implicit spatial grounding methods for VLA models.
Chinese Translation
精确的空间推理是机器人操作的基础,但当前视觉-语言-动作(VLA)模型的视觉骨干网络主要是在没有明确3D几何监督的2D图像数据上进行预训练,导致其表示缺乏准确的空间感知。现有的隐式空间基础方法部分解决了这一问题,通过将VLA特征与3D感知基础模型的特征对齐,但它们依赖于经验层搜索,并在已经将空间结构与语言语义纠缠的LLM级视觉标记上进行对齐,限制了其通用性和几何可解释性。我们提出了VEGA(视觉编码器基础对齐),这是一个简单而有效的框架,直接将VLA的视觉编码器输出与来自DINOv2-FiT3D的空间感知特征对齐,后者是一个经过多视角一致的3D高斯点云监督微调的DINOv2模型。通过在视觉编码器输出级别进行对齐,VEGA在任何语言纠缠发生之前就实现了空间感知的基础对齐,提供了一个更具可解释性和原则性的对齐目标。该对齐通过一个轻量级投影器实现,该投影器使用余弦相似度损失与标准动作预测目标共同训练,并在推理时被丢弃,不引入额外的计算开销。在模拟基准和真实世界操作任务上的大量实验表明,VEGA始终优于现有的隐式空间基础基线,确立了VLA模型隐式空间基础方法中的新状态。
cs.RO / 70 / 2605.10653
Embodied AI in Action: Insights from SAE World Congress 2026 on Safety, Trust, Robotics, and Real-World Deployment
行动中的具身人工智能:来自2026年SAE世界大会关于安全、信任、机器人技术和实际部署的见解
Abstract
Embodied artificial intelligence is rapidly moving from research into real-world systems such as autonomous vehicles, mobile robots, and industrial machines. As these systems become more capable of perceiving, deciding, and acting in dynamic environments, they also introduce new challenges in safety, trust, governance, and operational reliability. This white paper summarizes key insights from the SAE World Congress 2026 panel session \textit{Embodied AI in Action}, which brought together experts from automotive, robotics, artificial intelligence, and safety engineering. The discussion highlighted the need to treat embodied AI as a systems challenge requiring engineering rigor, lifecycle governance, human-centered design, and evolving standards. The paper provides practical perspectives for executives, policymakers, and technical leaders seeking to adopt embodied AI responsibly. The panel reached broad agreement that long-term success will depend not only on advances in AI capability, but equally on safe and trustworthy deployment.
Chinese Translation
具身人工智能正迅速从研究转向实际系统,例如自主车辆、移动机器人和工业机器。随着这些系统在动态环境中感知、决策和行动的能力不断增强,它们也带来了安全、信任、治理和操作可靠性方面的新挑战。本文总结了2026年SAE世界大会小组讨论会 extit{行动中的具身人工智能}的关键见解,该讨论会汇聚了来自汽车、机器人、人工智能和安全工程领域的专家。讨论强调了将具身人工智能视为需要工程严谨性、生命周期治理、人本设计和不断演进标准的系统挑战的必要性。本文为希望负责任地采用具身人工智能的高管、政策制定者和技术领导者提供了实用视角。小组成员广泛达成共识,长期成功不仅依赖于人工智能能力的进步,也同样依赖于安全和可信的部署。
cs.RO / 71 / 2605.10696
VRA: Grounding Discrete-Time Joint Acceleration in Voltage-Constrained Actuation
VRA:在电压约束驱动下的离散时间关节加速度基础
Abstract
Discrete-time joint acceleration constraints are widely used to enforce position and velocity limits. However, under voltage-constrained electric actuators, kinematically admissible accelerations may be physically unrealizable, exposing a missing execution-level abstraction. We propose Voltage-Realizable Acceleration (VRA), a joint-level acceleration interface that grounds kinematic acceleration in voltage-constrained actuator physics by restricting commanded accelerations to voltage-realizable constraints. Hardware experiments on electric actuators and a wheel-legged quadruped show that VRA removes unrealizable accelerations, restores consistent near-constraint execution, and reduces constraint-induced oscillations.
Chinese Translation
离散时间关节加速度约束广泛用于强制执行位置和速度限制。然而,在电压约束的电动执行器下,运动学上可接受的加速度可能在物理上无法实现,暴露出缺失的执行层次抽象。我们提出了电压可实现加速度(Voltage-Realizable Acceleration,VRA),这是一种关节级加速度接口,通过将指令加速度限制在电压可实现的约束内,将运动学加速度与电压约束执行器的物理特性相结合。对电动执行器和轮腿四足机器人进行的硬件实验表明,VRA消除了无法实现的加速度,恢复了一致的近约束执行,并减少了约束引起的振荡。
cs.RO / 72 / 2605.10707
ObjView-Bench: Rethinking Difficulty and Deployment for Object-Centric View Planning
ObjView-Bench:重新思考面向对象的视图规划的难度与部署
Abstract
Object-centric view planning is a core component of active geometric 3D reconstruction in robotics, yet existing evaluations often conflate object complexity, planning difficulty, budget assumptions, and physical reachability constraints. As a result, conclusions drawn from idealized view-planning evaluations may not reliably predict performance under realistic reconstruction settings. We introduce ObjView-Bench, an evaluation framework for rethinking difficulty and deployment in object-centric view planning. First, we disentangle three quantities underlying view-planning evaluation: omnidirectional self-occlusion as an object-side attribute, observation saturation difficulty, and protocol-dependent planning difficulty defined through a set-cover formulation. This separation supports controlled dataset construction, analysis of slow-saturation objects, and a case study showing that planning difficulty-aware sampling can improve learned view planners. Second, we design deployment-oriented evaluation protocols that reveal how budget regimes and reachable-view constraints alter method behavior. Across classical, learned, and hybrid planners, ObjView-Bench shows that difficulty, budget, and reachability constraints substantially change method rankings and failure modes.
Chinese Translation
面向对象的视图规划是机器人主动几何三维重建的核心组成部分,但现有评估往往将对象复杂性、规划难度、预算假设和物理可达性约束混为一谈。因此,从理想化视图规划评估中得出的结论可能无法可靠地预测在现实重建环境下的性能。我们提出了ObjView-Bench,一个用于重新思考面向对象的视图规划中的难度与部署的评估框架。首先,我们将视图规划评估中的三个量解耦:全向自遮挡作为对象侧属性、观察饱和难度,以及通过集合覆盖形式定义的协议相关规划难度。这种分离支持受控数据集构建、慢饱和对象的分析,以及一个案例研究,表明关注规划难度的采样可以改善学习到的视图规划器。其次,我们设计了面向部署的评估协议,揭示预算机制和可达视图约束如何改变方法行为。在经典、学习和混合规划器中,ObjView-Bench显示难度、预算和可达性约束显著改变了方法的排名和失败模式。
cs.RO / 73 / 2605.10760
MAGS-SLAM: Monocular Multi-Agent Gaussian Splatting SLAM for Geometrically and Photometrically Consistent Reconstruction
MAGS-SLAM:单目多智能体高斯点云SLAM用于几何和光度一致的重建
Abstract
Collaborative photorealistic 3D reconstruction from multiple agents enables rapid large-scale scene capture for virtual production and cooperative multi-robot exploration. While recent 3D Gaussian Splatting (3DGS) SLAM algorithms can generate high-fidelity real-time mapping, most of the existing multi-agent Gaussian SLAM methods still rely on RGB-D sensors to obtain metric depth and simplify cross-agent alignment, which limits the deployment on lightweight, low-cost, or power-constrained robotic platforms. To address this challenge, we propose MAGS-SLAM, the first RGB-only multi-agent 3DGS SLAM framework for collaborative scene reconstruction. Each agent independently builds local monocular Gaussian submaps and transmits compact submap summaries rather than raw observations or dense maps. To facilitate robust collaboration in the presence of monocular scale ambiguity, our framework integrates compact submap communication, geometry- and appearance-aware loop verification, and occupancy-aware Gaussian fusion, enabling coherent global reconstruction without active depth sensors. We further introduce ReplicaMultiagent Plus benchmark for evaluating collaborative Gaussian SLAM. Intensive experiments on synthetic and real-world datasets show that MAGS-SLAM achieves competitive tracking accuracy and comparable or superior rendering quality to state-of-the-art RGB-D collaborative Gaussian SLAM methods while relying only RGB images.
Chinese Translation
来自多个智能体的协作光照真实感3D重建能够快速进行大规模场景捕捉,适用于虚拟制作和合作多机器人探索。尽管近期的3D高斯点云(3DGS)SLAM算法能够生成高保真实时映射,但现有的大多数多智能体高斯SLAM方法仍依赖RGB-D传感器获取度量深度并简化跨智能体对齐,这限制了其在轻量级、低成本或功耗受限的机器人平台上的应用。为了解决这一挑战,我们提出了MAGS-SLAM,这是第一个仅基于RGB的多智能体3DGS SLAM框架,用于协作场景重建。每个智能体独立构建局部单目高斯子图,并传输紧凑的子图摘要,而不是原始观测或密集地图。为了在单目尺度模糊的情况下促进稳健的协作,我们的框架集成了紧凑的子图通信、几何和外观感知的环路验证以及占用感知的高斯融合,从而实现无主动深度传感器的连贯全局重建。我们进一步引入了ReplicaMultiagent Plus基准,用于评估协作高斯SLAM。在合成和真实世界数据集上的大量实验表明,MAGS-SLAM在仅依赖RGB图像的情况下,达到了与最先进的RGB-D协作高斯SLAM方法相媲美的跟踪精度和可比或更优的渲染质量。
cs.RO / 74 / 2605.10819
ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models
ALAM:代数一致的潜在转变用于视觉-语言-动作模型
Abstract
Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM's locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.
Chinese Translation
视觉-语言-动作(VLA)模型受到动作标记机器人数据稀缺的限制,而无动作视频则提供了丰富的物理世界变化证据。潜在动作模型为从视频中提取此类先验提供了一种有前景的方法,但经过重建训练的潜在编码不一定适合策略生成:它们可能预测未来观察,但缺乏与机器人动作一致或可重用的结构。我们提出了ALAM(代数潜在动作模型),一种代数一致的潜在动作模型,将无动作视频中的时间关系转化为结构性监督。给定帧三元组,ALAM学习由重建基础的潜在转变,同时通过组合和反转一致性进行正则化,鼓励局部加性转变空间。对于下游VLA学习,我们冻结预训练编码器,并将其潜在转变序列作为辅助生成目标,与机器人动作在联合流匹配目标下共同生成。这将结构化的潜在转变与基于流的策略生成相结合,使策略能够利用ALAM的局部一致转变几何,而无需潜在到动作的解码。表示探针显示,ALAM将加性和可逆性错误减少了25-85倍,相较于无结构潜在动作基线,并改善了长时间跨度的累积重建。当转移到VLA策略时,ALAM将MetaWorld MT50上的平均成功率从47.9%提高到85.0%,在LIBERO上从94.1%提高到98.1%,在真实世界的操作任务中也取得了一致的提升。消融实验进一步确认,最显著的改进来自于代数结构化潜在转变与联合流匹配之间的协同作用。
cs.RO / 75 / 2605.10821
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
统一噪声引导以实现高效的人类指导VLA适应
Abstract
Diffusion-based vision-language-action (VLA) models have emerged as strong priors for robotic manipulation, yet adapting them to real-world distributions remains challenging. In particular, on-robot reinforcement learning (RL) is expensive and time-consuming, so effective adaptation depends on efficient policy improvement within a limited budget of real-world interactions. Noise-space RL lowers the cost by keeping the pretrained VLA fixed as a denoising generator while updating only a lightweight actor that predicts the noise. However, its performance is still limited due to inefficient autonomous exploration. Human corrective interventions can reduce this exploration burden, but they are naturally provided in action space, whereas noise-space finetuning requires supervision over noise variables. To address these challenges, we propose UniSteer, a Unified Noise Steering framework that combines human corrective guidance with noise-space RL through approximate action-to-noise inversion. Given a human corrective action, UniSteer inverts the frozen flow-matching decoder to recover a noise target, which provides supervised guidance for the same noise actor that is simultaneously optimized via reinforcement learning. Real-world experiments on diverse manipulation tasks show that UniSteer adapts more efficiently than strong noise-space RL and action-space human-in-the-loop baselines, improving the success rate from 20% to 90% in 66 minutes on average across four real-world adaptation tasks.
Chinese Translation
基于扩散的视觉-语言-动作(VLA)模型已成为机器人操作的强大先验,但将其适应于真实世界分布仍然具有挑战性。特别是,机器人上的强化学习(RL)成本高昂且耗时,因此有效的适应依赖于在有限的真实世界交互预算内进行高效的策略改进。噪声空间RL通过将预训练的VLA固定为去噪生成器,同时仅更新预测噪声的轻量级演员,从而降低了成本。然而,由于自主探索效率低下,其性能仍然受到限制。人类的纠正干预可以减少这种探索负担,但它们自然是在动作空间中提供的,而噪声空间微调则需要对噪声变量进行监督。为了解决这些挑战,我们提出了UniSteer,一个统一噪声引导框架,通过近似的动作到噪声反演将人类纠正指导与噪声空间RL相结合。给定一个人类纠正动作,UniSteer反转冻结的流匹配解码器以恢复噪声目标,从而为同一噪声演员提供监督指导,该演员通过强化学习同时进行优化。在多样化的操作任务上的真实世界实验表明,UniSteer的适应效率优于强大的噪声空间RL和动作空间人类参与基准,在四个真实世界适应任务中平均将成功率从20%提高到90%,耗时66分钟。
cs.RO / 76 / 2605.10880
Safe Aerial 3D Path Planning for Autonomous UAVs using Magnetic Potential Fields
基于磁势场的自主无人机安全三维路径规划
Abstract
Safe autonomous Uncrewed Aerial Vehicle (UAV) navigation in urban environments requires real-time path planning that avoids obstacles. MaxConvNet is a potential-field planner that leverages properties of Maxwell's equations to generate a path to the goal without local minima. We extend the 2D MaxConvNet magnetic field planner to 3D, using a convolutional autoencoder to predict obstacle-aware potential fields from LiDAR-derived 101^3 voxel grids. Evaluation across 100 randomized closed-loop trials in two distinct Cosys-AirSim urban environments, a dense night-time cityscape and a suburban district shows a 100% path planning success rate on both maps without retraining. In offline path planning, 3DMaxConvNet produces path lengths comparable to A* on unseen maps while reducing runtime from 0.155--0.17s to 0.087--0.089s, or about 1.7--1.95 times faster than A*. Against RRT*(3k), 3DMaxConvNet achieves similar path quality while reducing planning runtime from 17.2--17.5s to about 0.09s, which is roughly 193--201 times faster than RRT*(3k).
Chinese Translation
在城市环境中,自主无人机(UAV)的安全导航需要实时路径规划以避开障碍物。MaxConvNet是一种潜势场规划器,利用麦克斯韦方程的特性生成到达目标的路径,避免局部极小值。我们将二维MaxConvNet磁场规划器扩展到三维,使用卷积自编码器从激光雷达(LiDAR)生成的101^3体素网格中预测障碍物感知的潜势场。在两个不同的Cosys-AirSim城市环境中进行的100次随机闭环试验评估表明,在这两张地图上,路径规划的成功率达到了100%,且无需重新训练。在离线路径规划中,3DMaxConvNet生成的路径长度与未见地图上的A*算法相当,同时将运行时间从0.155--0.17秒减少到0.087--0.089秒,约为A*的1.7--1.95倍速度。与RRT*(3k)相比,3DMaxConvNet在路径质量上实现了相似的效果,同时将规划运行时间从17.2--17.5秒减少到约0.09秒,速度大约是RRT*(3k)的193--201倍。
cs.RO / 77 / 2605.10904
MDrive: Benchmarking Closed-Loop Cooperative Driving for End-to-End Multi-agent Systems
MDrive:闭环协作驾驶的基准测试用于端到端多智能体系统
Abstract
Vehicle-to-Everything (V2X) communication has emerged as a promising paradigm for autonomous driving, enabling connected agents to share complementary perception information and negotiate with each other to benefit the final planning. Existing V2X benchmarks, however, fall short in two ways: (i) open-loop evaluations fail to capture the inherently closed-loop nature of driving, leading to evaluation gaps, and (ii) current closed-loop evaluations lack behavioral and interactive diversity to reflect real-world driving. Thus, it is still unclear the extent of benefits of multi-agent systems for closed-loop driving. In this paper, we introduce MDrive, a closed-loop cooperative driving benchmark comprising 225 scenarios grounded in both NHTSA pre-crash typologies and real-world V2X datasets. Our benchmark results demonstrate that multi-agent systems are generally better than single-agent counterparts. However, current multi-agent systems still face two important challenges: (i) perception sharing enhances perceptions, but doesn't always translate to better planning; (ii) negotiation improves planning performance but harms it in complex and dense traffic scenarios. MDrive further provides an open-source toolbox for scenario generation, Real2Sim conversion, and human-in-the-loop simulation. Together, MDrive establishes a reproducible foundation for evaluating and improving the generalization and robustness of cooperative driving systems.
Chinese Translation
车对一切(V2X)通信已成为自主驾驶的一个有前景的范式,使得连接的智能体能够共享互补的感知信息并相互协商以优化最终规划。然而,现有的V2X基准测试在两个方面存在不足:(i)开放循环评估未能捕捉到驾驶的固有闭环特性,导致评估差距;(ii)当前的闭环评估缺乏行为和互动的多样性,无法反映真实世界的驾驶。因此,关于多智能体系统在闭环驾驶中的益处程度仍不明确。本文介绍了MDrive,一个闭环协作驾驶基准,包含225个场景,基于NHTSA(美国国家公路交通安全管理局)预碰撞类型和真实世界的V2X数据集。我们的基准测试结果表明,多智能体系统通常优于单智能体系统。然而,当前的多智能体系统仍面临两个重要挑战:(i)感知共享增强了感知能力,但并不总是转化为更好的规划;(ii)协商提高了规划性能,但在复杂和密集的交通场景中却可能造成损害。MDrive还提供了一个开源工具箱,用于场景生成、Real2Sim转换和人机交互模拟。总之,MDrive为评估和改进协作驾驶系统的泛化能力和鲁棒性奠定了可重复的基础。
cs.RO / 78 / 2605.10921
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
RoboMemArena:一个全面且具有挑战性的机器人记忆基准
Abstract
Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However, existing robotic memory benchmarks still lack multimodal annotations for memory formation, provide limited task coverage and structural complexity, and remain restricted to simulation without real-world evaluation. We address this gap with RoboMemArena, a large-scale benchmark of 26 tasks, with average trajectory lengths exceeding 1,000 steps per task and 68.9% of subtasks being memory-dependent. The generation pipeline leverages a vision-language model (VLM) to design and compose subtasks, generates full trajectories through atomic functions, and provides memory-related annotations, including subtask instructions and native keyframe annotations, while paired real-world memory tasks support physical evaluation. We further design PrediMem, a dual-system VLA in which a high-level VLM planner manages a memory bank with recent and keyframe buffers and uses a predictive coding head to improve sensitivity to task dynamics. Extensive experiments on RoboMemArena show that PrediMem outperforms all baselines and provides insights into memory management, model architecture, and scaling laws for complex memory systems.
Chinese Translation
记忆是机器人智能的关键组成部分,因为机器人必须依赖过去的观察和行动来完成在部分可观察环境中的长期任务。然而,现有的机器人记忆基准仍然缺乏多模态注释以支持记忆形成,任务覆盖和结构复杂性有限,并且仍然局限于模拟环境而没有现实世界的评估。我们通过RoboMemArena来填补这一空白,这是一个包含26个任务的大规模基准,每个任务的平均轨迹长度超过1000步,68.9%的子任务依赖于记忆。生成管道利用视觉-语言模型(VLM)设计和组合子任务,通过原子函数生成完整轨迹,并提供与记忆相关的注释,包括子任务指令和原生关键帧注释,同时配对的现实世界记忆任务支持物理评估。我们进一步设计了PrediMem,一个双系统的VLA,其中高层次的VLM规划器管理一个包含最近和关键帧缓冲区的记忆库,并使用预测编码头来提高对任务动态的敏感性。在RoboMemArena上的大量实验表明,PrediMem在所有基准中表现优越,并为记忆管理、模型架构和复杂记忆系统的扩展法则提供了深入见解。
cs.RO / 79 / 2605.10925
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
PriorVLA:用于视觉-语言-动作模型的先验保持适应
Abstract
Large-scale pretraining has made Vision-Language-Action (VLA) models promising foundations for generalist robot manipulation, yet adapting them to downstream tasks remains necessary. However, the common practice of full fine-tuning treats pretraining as initialization and can shift broad priors toward narrow training-distribution patterns. We propose PriorVLA, a novel framework that preserves pretrained priors and learns to leverage them for effective adaptation. PriorVLA keeps a frozen Prior Expert as a read-only prior source and trains an Adaptation Expert for downstream specialization. Expert Queries capture scene priors from the pretrained VLM and motor priors from the Prior Expert, integrating both into the Adaptation Expert to guide adaptation. Together, PriorVLA updates only 25% of the parameters updated by full fine-tuning. Across RoboTwin 2.0, LIBERO, and real-world tasks, PriorVLA achieves stronger overall performance than full fine-tuning and state-of-the-art VLA baselines, with the largest gains under out-of-distribution (OOD) and few-shot settings. PriorVLA improves over pi0.5 by 11 points on RoboTwin 2.0-Hard and achieves 99.1% average success on LIBERO. Across eight real-world tasks and two embodiments, PriorVLA reaches 81% in-distribution (ID) and 57% OOD success with standard data. With only 10 demonstrations per task, PriorVLA reaches 48% ID and 32% OOD success, surpassing pi0.5 by 24 and 22 points, respectively.
Chinese Translation
大规模预训练使得视觉-语言-动作(VLA)模型成为通用机器人操作的有前景基础,但将其适应于下游任务仍然是必要的。然而,完全微调的常见做法将预训练视为初始化,可能会将广泛的先验转向狭窄的训练分布模式。我们提出了PriorVLA,这是一种新颖的框架,旨在保持预训练的先验并学习如何利用这些先验进行有效的适应。PriorVLA保持一个冻结的先验专家作为只读的先验来源,并训练一个适应专家以进行下游专业化。专家查询从预训练的视觉语言模型(VLM)捕获场景先验,并从先验专家获取运动先验,将两者整合到适应专家中以指导适应。总的来说,PriorVLA仅更新完全微调所更新参数的25%。在RoboTwin 2.0、LIBERO和真实世界任务中,PriorVLA的整体表现优于完全微调和最先进的VLA基线,在分布外(OOD)和少样本设置下获得了最大的提升。PriorVLA在RoboTwin 2.0-Hard上比pi0.5提高了11个百分点,并在LIBERO上实现了99.1%的平均成功率。在八个真实世界任务和两个实现中,PriorVLA在标准数据下达到了81%的分布内(ID)和57%的OOD成功率。在每个任务仅有10个演示的情况下,PriorVLA达到了48%的ID和32%的OOD成功率,分别超过了pi0.5的24和22个百分点。
cs.RO / 80 / 2605.10942
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM:通过自适应世界动作模型协调可泛化与精确操控
Abstract
World Action Models (WAMs) have emerged as a promising paradigm for robot control by modeling physical dynamics. Current WAMs generally follow two paradigms: the "Imagine-then-Execute" approach, which uses video prediction to infer actions via inverse dynamics, and the "Joint Modeling" approach, which jointly models actions and video representations. Based on systematic experiments, we observe a fundamental trade-off between these paradigms: the former explicitly leverages world models for generalizable transit but lacks interaction precision, whereas the latter enables fine-grained, temporally coherent action generation but is constrained by the exploration space of the training distribution. Motivated by these findings, we propose HarmoWAM, an end-to-end WAM that fully leverages a world model to unify predictive and reactive control, enabling both generalizable transit and precise manipulation. Specifically, the world model provides spatio-temporal physical priors that condition two complementary action experts: a predictive expert that leverages latent dynamics for iterative action generation, and a reactive expert that directly infers actions from predicted visual evolution. To enable adaptive coordination, a Process-Adaptive Gating Mechanism is proposed to automatically determine the timing and location of switching between them. This allows the world model to drive the reactive expert to expand the exploration space and the predictive expert to perform precise interactions across different stages of a task. For evaluation, we construct three training-unseen test environments across six real-world robotic tasks, covering variations in background, position, and object semantics. Notably, HarmoWAM achieves strong zero-shot generalization across these scenarios, significantly outperforming prior state-of-the-art VLA models and WAMs by margins of 33% and 29%, respectively.
Chinese Translation
世界动作模型(WAMs)作为一种通过建模物理动态来实现机器人控制的有前景的范式逐渐崭露头角。目前的WAMs通常遵循两种范式:'想象-再执行'方法,该方法利用视频预测通过逆动态推断动作,以及'联合建模'方法,该方法联合建模动作和视频表示。基于系统实验,我们观察到这两种范式之间存在根本的权衡:前者明确利用世界模型实现可泛化的过渡,但缺乏交互精度,而后者则能够生成细粒度、时间一致的动作,但受限于训练分布的探索空间。基于这些发现,我们提出了HarmoWAM,一种端到端的WAM,充分利用世界模型统一预测控制和反应控制,实现可泛化的过渡和精确操控。具体而言,世界模型提供了时空物理先验,条件化两个互补的动作专家:一个利用潜在动态进行迭代动作生成的预测专家,以及一个直接从预测的视觉演变中推断动作的反应专家。为了实现自适应协调,提出了一种过程自适应门控机制,自动确定在两者之间切换的时机和位置。这使得世界模型能够驱动反应专家扩展探索空间,同时预测专家在任务的不同阶段执行精确交互。为了评估,我们构建了三个训练未见测试环境,涵盖六个真实世界的机器人任务,涉及背景、位置和物体语义的变化。值得注意的是,HarmoWAM在这些场景中实现了强大的零-shot泛化,显著超越了之前的最先进VLA模型和WAMs,分别提高了33%和29%。
cs.CV / 1 / 2605.08133
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
VLADriver-RAG:用于自主驾驶的检索增强视觉-语言-动作模型
Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving, yet their reliance on implicit parametric knowledge limits generalization in long-tail scenarios. While Retrieval-Augmented Generation (RAG) offers a solution by accessing external expert priors, standard visual retrieval suffers from high latency and semantic ambiguity. To address these challenges, we propose \textbf{VLADriver-RAG}, a framework that grounds planning in explicit, structure-aware historical knowledge. Specifically, we abstract sensory inputs into spatiotemporal semantic graphs via a \textit{Visual-to-Scenario} mechanism, effectively filtering visual noise. To ensure retrieval relevance, we employ a \textit{Scenario-Aligned Embedding Model} that utilizes Graph-DTW metric alignment to prioritize intrinsic topological consistency over superficial visual similarity. These retrieved priors are then fused within a query-based VLA backbone to synthesize precise, disentangled trajectories. Extensive experiments on the Bench2Drive benchmark establish a new state-of-the-art, achieving a Driving Score of 89.12.
Chinese Translation
视觉-语言-动作(VLA)模型作为一种端到端自主驾驶的有前景范式,已经出现,但其对隐式参数知识的依赖限制了在长尾场景中的泛化能力。虽然检索增强生成(RAG)通过访问外部专家先验提供了解决方案,但标准视觉检索存在高延迟和语义模糊的问题。为了解决这些挑战,我们提出了 extbf{VLADriver-RAG},一个将规划基于显式、结构感知的历史知识的框架。具体而言,我们通过 extit{视觉到场景}机制将感官输入抽象为时空语义图,有效过滤视觉噪声。为了确保检索的相关性,我们采用 extit{场景对齐嵌入模型},利用图形动态时间规整(Graph-DTW)度量对齐,优先考虑内在拓扑一致性而非表面视觉相似性。这些检索到的先验随后与基于查询的VLA主干融合,以合成精确、解耦的轨迹。在Bench2Drive基准上的广泛实验建立了新的最先进水平,达到了89.12的驾驶评分。
cs.CV / 2 / 2605.08136
Benchmarking ResNet Backbones in RT-DETR: Impact of Depth and Regularization under environmental conditions
在 RT-DETR 中基于 ResNet 的骨干网络基准测试:深度和正则化在环境条件下的影响
Abstract
Visual perception plays a central role in competitive robotics, where environmental variations can directly affect real-time detection performance. The related literature on transformer-based detectors lack information regarding the impact of backbone scale and environmental settings on model performance. This work presents a comparative evaluation of RT-DETR for detecting round objects under environmental and hyperparameter variations relevant to competitive robotics. Four ResNet backbones (ResNet18, ResNet34, ResNet50, and ResNet101) were compared using dropout rates, analyzing their effect on confidence and accuracy. All models were trained under the same configuration and evaluated under changes in lighting and background contrast. Environmental conditions primarily impact prediction confidence, while inference latency remains largely unaffected and classification accuracy stays consistently high, approaching or above 1.00 in most cases. Two distinct behaviors were observed. Under illumination variation, ResNet50 achieves the best trade-off, combining near-perfect accuracy, confidence values up to approximately 0.869 and latency around 0.058-0.059 ms. Under background variation, ResNet34 provides the most balanced performance, reaching near-perfect accuracy and higher confidence values up to approximately 0.887. These results indicate that the optimal architecture depends on the type of environmental variation, with intermediate-depth models offering the best balance between performance and efficiency.
Chinese Translation
视觉感知在竞争性机器人中发挥着核心作用,环境变化可以直接影响实时检测性能。关于基于变换器的检测器的相关文献缺乏关于骨干网络规模和环境设置对模型性能影响的信息。本研究对 RT-DETR 在环境和超参数变化下检测圆形物体进行了比较评估,这些变化与竞争性机器人相关。比较了四种 ResNet 骨干网络(ResNet18、ResNet34、ResNet50 和 ResNet101),使用 dropout 率分析其对置信度和准确性的影响。所有模型在相同配置下进行训练,并在光照和背景对比度变化下进行评估。环境条件主要影响预测置信度,而推理延迟基本不受影响,分类准确性在大多数情况下保持在 1.00 或以上。观察到两种不同的行为。在光照变化下,ResNet50 实现了最佳的权衡,结合了近乎完美的准确性、约 0.869 的置信度值和约 0.058-0.059 毫秒的延迟。在背景变化下,ResNet34 提供了最平衡的性能,达到了近乎完美的准确性和高达约 0.887 的置信度值。这些结果表明,最佳架构依赖于环境变化的类型,中等深度模型在性能和效率之间提供了最佳平衡。
cs.CV / 3 / 2605.08145
Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models
自我描述的多模态交互调优:放大可利用冗余以增强视觉语言模型的鲁棒性
Abstract
Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities to compensate for the impaired one. To this end, we analyze multimodal interactions -- redundant (shared), unique (exclusive), and synergistic (emergent) task-relevant information provided by the modalities -- to determine their impacts on model reliability. Specifically, amplifying redundant interactions would increase this exploitable shared information to resolve these issues; yet, modern instruction datasets often eliminate redundancies to prioritize visual grounding. We bridge this gap through a self-captioning workflow featuring a \textsc{Multimodal Interaction Gate}: a mechanism to convert unique interactions into redundant interactions. Our findings suggest that increasing redundancy can reduce visual induced errors by 38.3\% and improve consistency by 16.8\%.
Chinese Translation
当前的视觉语言模型面临着对模糊或损坏模态的幻觉和鲁棒性问题。我们假设这些问题可以通过利用模态之间的共享信息来补偿受损的模态来解决。为此,我们分析多模态交互——模态提供的冗余(共享)、独特(排他)和协同(新兴)任务相关信息——以确定它们对模型可靠性的影响。具体而言,放大冗余交互将增加这种可利用的共享信息,从而解决这些问题;然而,现代指令数据集往往消除冗余,以优先考虑视觉基础。我们通过一个自我描述的工作流程来弥补这一差距,该流程具有一个 extsc{Multimodal Interaction Gate}:一个将独特交互转换为冗余交互的机制。我们的研究结果表明,增加冗余可以将视觉引起的错误减少38.3%,并提高一致性16.8%。
cs.CV / 4 / 2605.08146
VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning
VT-Bench:视觉-表格多模态学习的统一基准
Abstract
Multi-model learning has attracted great attention in visual-text tasks. However, visual-tabular data, which plays a pivotal role in high-stakes domains like healthcare and industry, remains underexplored. In this paper, we introduce \textit{VT-Bench}, the first unified benchmark for standardizing vision-tabular discriminative prediction and generative reasoning tasks. VT-Bench aggregates 14 datasets across 9 domains (medical-centric, while covering pets, media, and transportation) with over 756K samples. We evaluate 23 representative models, including unimodal experts, specialized visual-tabular models, general-purpose vision-language models (VLMs), and tool-augmented methods, highlighting substantial challenges of visual-tabular learning. We believe VT-Bench will stimulate the community to build more powerful multi-modal vision-tabular foundation models. Benchmark: https://github.com/Ziyi-Jia990/VT-Bench
Chinese Translation
多模态学习在视觉-文本任务中引起了广泛关注。然而,视觉-表格数据在医疗和工业等高风险领域中发挥着关键作用,但仍然未得到充分探索。本文介绍了 extit{VT-Bench},这是第一个用于标准化视觉-表格判别预测和生成推理任务的统一基准。VT-Bench 汇集了来自 9 个领域(以医疗为中心,同时涵盖宠物、媒体和交通)的 14 个数据集,样本总数超过 756K。我们评估了 23 个具有代表性的模型,包括单模态专家、专门的视觉-表格模型、通用视觉-语言模型(VLMs)和工具增强方法,突显了视觉-表格学习的重大挑战。我们相信 VT-Bench 将激励社区构建更强大的多模态视觉-表格基础模型。基准链接: https://github.com/Ziyi-Jia990/VT-Bench
cs.CV / 5 / 2605.08156
LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment
LAGO:基于语言引导的自适应对象区域聚焦用于零-shot视觉-文本对齐
Abstract
Zero-shot recognition aims to classify an image by selecting the most compatible label description from a set of candidate classes without any task-specific supervision. In fine-grained settings, however, the relevant evidence often lies in localized parts, attributes, or textures rather than in the full image, making whole-image alignment suboptimal. Recent localized visual-text alignment methods address this by comparing class descriptions with multiple image regions, but they typically rely on large sets of random or redundant crops, increasing inference cost and introducing many highly redundant or weakly relevant candidates. Moreover, introducing semantic guidance too early can create an error-amplifying feedback process in which inaccurate intermediate predictions bias later localization and reinforce subsequent mistakes; we refer to this failure mode as the prediction loop. We propose LAGO (LAnguage-Guided adaptive Object-region focus), a framework for efficient and robust zero-shot localized visual-text alignment. LAGO first performs class-agnostic object-centric candidate discovery to obtain a stable visual initialization, and then applies adaptive language-guided refinement with the strength of semantic guidance controlled by intermediate confidence. It further combines object-level, contextual, and full-image evidence through an effective object-context dual-channel aggregation strategy. Extensive experiments show that LAGO consistently achieves state-of-the-art performance on standard zero-shot benchmarks and challenging distribution-shift settings, while requiring substantially fewer candidate regions at inference time.
Chinese Translation
零-shot识别旨在通过从一组候选类别中选择最兼容的标签描述来对图像进行分类,而无需任何特定任务的监督。然而,在细粒度设置中,相关证据往往存在于局部部分、属性或纹理中,而不是整个图像,这使得全图对齐的效果不佳。最近的局部视觉-文本对齐方法通过将类别描述与多个图像区域进行比较来解决这一问题,但它们通常依赖于大量随机或冗余的裁剪,增加了推理成本,并引入了许多高度冗余或弱相关的候选项。此外,过早引入语义指导可能会导致错误放大反馈过程,其中不准确的中间预测会偏向后续的定位,并强化后续的错误;我们将这种失败模式称为预测循环。我们提出了LAGO(基于语言引导的自适应对象区域聚焦),这是一个高效且稳健的零-shot局部视觉-文本对齐框架。LAGO首先执行与类别无关的以对象为中心的候选发现,以获得稳定的视觉初始化,然后应用自适应的语言引导细化,其中语义指导的强度由中间置信度控制。它进一步通过有效的对象-上下文双通道聚合策略结合对象级、上下文和全图证据。大量实验表明,LAGO在标准零-shot基准和具有挑战性的分布转移设置中始终实现了最先进的性能,同时在推理时所需的候选区域显著减少。
cs.CV / 6 / 2605.08158
HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding
HY-Himmel技术报告:用于长视频理解的层次交错多流运动编码
Abstract
Long-video understanding with multimodal language models suffers from three compounding bottlenecks: heavy decode cost to obtain dense RGB frames, quadratic token growth with frame count, and weak motion perception under sparse keyframe sampling. We present HY-Himmel, a hierarchical video-language framework that allocates semantic and motion capacity separately. A small set of sparse anchor I-frames is routed to the expensive host ViT to ground object identity and scene layout, while the far denser inter-frame intervals are encoded by a lightweight compressed-domain tri-stream adapter that distils motion evidence from motion-vector maps, residual maps, and I-frame context into aligned motion tokens. These tokens are injected into the LLM via a differentiable placeholder mechanism after a dedicated Stage-1 contrastive alignment that places the motion representation in a geometry compatible with the frozen visual backbone. On Video-MME, HY-Himmel surpasses the dense 32-frame baseline by +2.3 pp (61.2 to 63.5%) while using 3.6x fewer context tokens. Extensive ablations over stream composition, motion encoder family, fusion mode, alignment objective, anchor count, LoRA rank, and video duration confirm that the full tri-stream is necessary and sufficient for the observed gains.
Chinese Translation
使用多模态语言模型进行长视频理解面临三个复合瓶颈:获取密集RGB帧的高解码成本、随着帧数增加的二次令牌增长,以及在稀疏关键帧采样下的弱运动感知。我们提出了HY-Himmel,一个层次化的视频-语言框架,分别分配语义和运动能力。一小组稀疏的锚I帧被路由到昂贵的主机ViT,以确定对象身份和场景布局,而更密集的帧间间隔则通过轻量级压缩域三流适配器进行编码,该适配器从运动向量图、残差图和I帧上下文中提取运动证据,形成对齐的运动令牌。这些令牌在经过专门的阶段一对比对齐后,通过可微分占位符机制注入到LLM中,使运动表示与冻结的视觉主干相兼容。在Video-MME上,HY-Himmel的表现超过了密集的32帧基线,提升了2.3个百分点(从61.2%提升至63.5%),同时使用了3.6倍更少的上下文令牌。对流组合、运动编码器家族、融合模式、对齐目标、锚数量、LoRA秩和视频时长的广泛消融实验确认了完整的三流结构对于观察到的增益是必要且充分的。
cs.CV / 7 / 2605.08160
WATCH: Wide-Area Archaeological Site Tracking for Change Detection
WATCH:广域考古遗址变化检测追踪
Abstract
Monitoring archaeological sites at scale is vital for protecting cultural heritage, yet pinpointing when disturbances occur remains difficult because visual cues are subtle and ground-truth data are sparse. We introduce WATCH, a framework for month-level change-event localization over PlanetScope satellite mosaics (2017-2024, 4.7 m/px) that supports three complementary scoring approaches: (i) Temporal Embedding Distance (TED), a training-free method that scores month-to-month deviations from a local temporal reference; (ii) Self-Supervised Change Detection (SSCD), an ensemble of reconstruction, forecasting, and latent-novelty signals; and (iii) a Weakly Supervised (WS) temporal localization model trained with sparse event-month labels. We benchmark WATCH on 1,943 archaeological sites in Afghanistan using embeddings from six foundation models (CLIP, GeoRSCLIP, SatMAE, Prithvi-EO-2.0, DINOv3, and Satlas-Pretrain) alongside a handcrafted spectral and texture baseline, and assess cross-regional generalization on sites in Syria, Turkey, Pakistan, and Egypt. The unsupervised approaches (TED, SSCD) consistently outperform the weakly supervised alternative. TED with SatMAE achieves the highest exact-month recall (55% at m=0), while TED with GeoRSCLIP, CLIP, or Satlas-Pretrain reaches 92.5% within a three-month tolerance (m=3). Handcrafted features remain competitive for exact-month detection under weak supervision. Our directional margin analysis reveals systematic temporal biases: SSCD paired with GeoRSCLIP or Prithvi-EO-2.0 exhibits the strongest early-warning profile, detecting anomalies before the recorded event, while TED favors confirmation-oriented detection after a change has materialized. These results show that satellite imagery combined with foundation-model embeddings enables scalable, decision-relevant heritage monitoring. Code: https://github.com/microsoft/WATCH
Chinese Translation
大规模监测考古遗址对于保护文化遗产至关重要,但准确确定干扰发生的时间仍然困难,因为视觉线索微妙且实地验证数据稀缺。我们提出了WATCH,一个基于PlanetScope卫星马赛克(2017-2024,4.7 m/px)的月度变化事件定位框架,支持三种互补的评分方法:(i) 时间嵌入距离(Temporal Embedding Distance, TED),一种无训练的方法,用于评分与地方时间参考的逐月偏差;(ii) 自监督变化检测(Self-Supervised Change Detection, SSCD),一个重建、预测和潜在新颖性信号的集成;以及(iii) 一个使用稀疏事件月份标签训练的弱监督(Weakly Supervised, WS)时间定位模型。我们在阿富汗的1,943个考古遗址上对WATCH进行了基准测试,使用六个基础模型(CLIP、GeoRSCLIP、SatMAE、Prithvi-EO-2.0、DINOv3和Satlas-Pretrain)的嵌入,并结合手工制作的光谱和纹理基线,评估在叙利亚、土耳其、巴基斯坦和埃及遗址的跨区域泛化。无监督方法(TED、SSCD)在性能上始终优于弱监督替代方案。使用SatMAE的TED在精确月份召回率上达到最高(m=0时为55%),而使用GeoRSCLIP、CLIP或Satlas-Pretrain的TED在三个月容忍度内达到92.5%(m=3)。在弱监督下,手工特征在精确月份检测中仍具竞争力。我们的方向性边际分析揭示了系统性的时间偏差:SSCD与GeoRSCLIP或Prithvi-EO-2.0配对展现出最强的预警特征,能够在记录事件之前检测到异常,而TED则偏向于在变化发生后进行确认性检测。这些结果表明,结合卫星图像与基础模型嵌入能够实现可扩展的、与决策相关的遗产监测。代码: https://github.com/microsoft/WATCH
cs.CV / 8 / 2605.08161
Advanced Tumor Segmentation in PET/CT Imaging: A Training Strategy Study with nnU-Net for AutoPET III
PET/CT影像中的高级肿瘤分割:基于nnU-Net的AutoPET III训练策略研究
Abstract
Tumor segmentation in whole-body PET/CT imaging is crucial for precise disease evaluation and treatment planning. However, it remains challenging due to variability in lesion size, contrast, and anatomical distribution. Relying on manual segmentation makes the process time-consuming and prone to intra- and inter-observer variability. This work presents a whole-body tumor segmentation method developed for the AutoPET III challenge, where the goal is to build models that generalize across tracers and multi-center data. We employ the nnU-Net framework with a ResNet-based encoder as our baseline and systematically investigate the impact of training strategies, including intensity normalization, batch dice optimization, and data augmentation using CraveMix. Our experiments show that these strategies significantly influence model performance, particularly in reducing false positives and improving robustness to lesion variability. The best-performing configuration achieves a Dice score of up to 0.80 on the preliminary test phase, and our method ranked third in the AutoPET III challenge. The code is publicly available here.
Chinese Translation
全身PET/CT影像中的肿瘤分割对于精确的疾病评估和治疗规划至关重要。然而,由于病灶大小、对比度和解剖分布的变异性,这一过程仍然具有挑战性。依赖手动分割使得这一过程耗时且容易受到观察者间和观察者内的变异影响。本研究提出了一种为AutoPET III挑战开发的全身肿瘤分割方法,目标是构建能够在不同示踪剂和多中心数据中泛化的模型。我们采用基于ResNet的编码器作为基线的nnU-Net框架,并系统地研究了训练策略的影响,包括强度归一化、批次Dice优化和使用CraveMix的数据增强。我们的实验表明,这些策略显著影响模型性能,特别是在减少假阳性和提高对病灶变异性的鲁棒性方面。表现最佳的配置在初步测试阶段达到了高达0.80的Dice分数,我们的方法在AutoPET III挑战中排名第三。代码可在此公开获取。
cs.CV / 9 / 2605.08163
MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing
MULTITEXTEDIT:跨语言文本图像编辑降级的基准测试
Abstract
Text-in-image editing has become a key capability for visual content creation, yet existing benchmarks remain overwhelmingly English-centric and often conflate visual plausibility with semantic correctness. We introduce MULTITEXTEDIT, a controlled benchmark of 3,600 instances spanning 12 typologically diverse languages, 5 visual domains, and 7 editing operations. Language variants of each instance share a common visual base and are paired with a human-edited reference and region masks, isolating the language variable for cross-lingual comparison. To capture script-level errors that coarse text-matching metrics miss, such as missing diacritics, reversed RTL order, and mixed-script renderings, we introduce a language fidelity (LSF) metric scored by a two-stage LVM protocol that first traces the edited target text and then judges it in isolation, reaching a quadratic-weighted \k{appa} of 0.76 against native-speaker annotators. Evaluating 12 open-source and proprietary systems with LSF alongside standard semantic and mask-aware pixel metrics, we find pronounced cross-lingual degradation for every model, largest on Hebrew and Arabic and smallest on Dutch and Spanish, and concentrated in text accuracy and script fidelity rather than in coarse structural dimensions. We also uncover a pervasive semantic and pixel mismatch, where outputs preserve global layout and background fidelity yet distort script-specific forms.
Chinese Translation
文本图像编辑已成为视觉内容创作的关键能力,但现有基准测试仍然以英语为中心,且常常将视觉可信度与语义正确性混为一谈。我们引入了MULTITEXTEDIT,这是一个包含3600个实例的受控基准,涵盖12种类型学上多样的语言、5个视觉领域和7种编辑操作。每个实例的语言变体共享一个共同的视觉基础,并与人工编辑的参考和区域掩码配对,以隔离语言变量进行跨语言比较。为了捕捉粗略文本匹配指标遗漏的脚本级错误,如缺失的变音符号、反向RTL顺序和混合脚本渲染,我们引入了一种语言保真度(Language Fidelity, LSF)指标,该指标通过两阶段的LVM协议进行评分,首先追踪编辑后的目标文本,然后单独评判,最终与母语者注释者的评分达到0.76的二次加权 ext{kappa}值。通过与标准语义和掩码感知像素指标一起评估12个开源和专有系统,我们发现每个模型均存在明显的跨语言降级,其中希伯来语和阿拉伯语降级最为严重,而荷兰语和西班牙语降级最小,降级主要集中在文本准确性和脚本保真度上,而非粗略的结构维度。我们还发现了一种普遍的语义和像素不匹配现象,输出结果在保留整体布局和背景保真度的同时,扭曲了特定脚本的形式。
cs.CV / 10 / 2605.08167
Digital Image Forgery Detection Using Transfer Learning
基于迁移学习的数字图像伪造检测
Abstract
The increasing availability of advanced image editing tools has led to a significant rise in manipulated digital content, posing serious challenges for digital forensics and information security. This study presents a transfer learning-based framework for digital image forgery detection that integrates compression-aware feature enhancement with deep convolutional neural network (CNN) architectures. The proposed approach introduces a hybrid input representation that combines RGB images with compression difference-based features (FDIFF), explicitly highlighting subtle manipulation artifacts that are often difficult to detect. In addition, a model-specific adaptive threshold optimization strategy based on the Youden Index is employed to improve classification reliability by achieving a better balance between true positive and false positive rates. Experiments conducted on the CASIA v2.0 dataset using multiple pretrained CNN architectures, including DenseNet121, VGG16, ResNet50, EfficientNetB0, MobileNet, and InceptionV3, demonstrate the effectiveness and robustness of the proposed framework. The models are evaluated using comprehensive performance metrics such as accuracy, precision, recall, F1-score, Matthews correlation coefficient (MCC), and area under the ROC curve (AUC). The results show that DenseNet121 achieves the highest accuracy and AUC, while ResNet50 provides the most balanced and reliable predictions with the highest MCC. The findings emphasize that relying solely on accuracy is insufficient for forensic applications, where minimizing false negatives is critical. Overall, the proposed framework improves the visibility of manipulation artifacts and enhances classification robustness, making it suitable for real-world digital image forgery detection scenarios.
Chinese Translation
先进图像编辑工具的日益普及导致了被操纵数字内容的显著增加,这对数字取证和信息安全提出了严峻挑战。本研究提出了一种基于迁移学习的数字图像伪造检测框架,该框架将压缩感知特征增强与深度卷积神经网络(CNN)架构相结合。所提出的方法引入了一种混合输入表示,结合了RGB图像与基于压缩差异的特征(FDIFF),明确突出那些通常难以检测的微妙操控伪影。此外,采用基于Youden指数的模型特定自适应阈值优化策略,通过在真正率和假正率之间实现更好的平衡,提高分类的可靠性。在使用多个预训练的CNN架构(包括DenseNet121、VGG16、ResNet50、EfficientNetB0、MobileNet和InceptionV3)对CASIA v2.0数据集进行的实验中,展示了所提出框架的有效性和鲁棒性。模型通过准确率、精确率、召回率、F1分数、马修斯相关系数(MCC)和ROC曲线下面积(AUC)等综合性能指标进行评估。结果显示,DenseNet121实现了最高的准确率和AUC,而ResNet50则提供了最平衡和可靠的预测,具有最高的MCC。研究结果强调,仅依赖准确率不足以满足取证应用的需求,在这些应用中,最小化假阴性至关重要。总体而言,所提出的框架提高了操控伪影的可见性,增强了分类的鲁棒性,使其适用于现实世界的数字图像伪造检测场景。
cs.CV / 11 / 2605.08169
Optimized Culprit Identification Using Mobilenet and Attention Mechanisms
基于Mobilenet和注意力机制的优化罪犯识别
Abstract
Automated culprit identification in surveillance systems is a critical task that requires high accuracy along with computational efficiency for real-time deployment. In this paper, an optimized deep learning framework is proposed using a lightweight MobileNet architecture integrated with channel and spatial attention mechanisms. The proposed model enhances feature representation by selectively focusing on the most discriminative regions while suppressing irrelevant background information, thereby improving identification performance. The framework incorporates efficient preprocessing, attention based feature refinement, and a robust classification strategy optimized using the Adam Optimizer. Experiments were conducted on benchmark face recognition datasets, including Labelled Faces in the Wild (LFW), CASIA-WebFace, and a subset of VGGFace2, under realistic conditions with variations in illumination, pose, and occlusion. The results demonstrate that the proposed model achieves a high classification accuracy of 97.8%, outperforming conventional models such as baseline CNN, ResNet, and standard MobileNet. The confusion matrix analysis indicates strong class-wise discrimination with minimal misclassification, while ROC-AUC evaluation confirms robust performance across all classes. Additionally, the proposed approach maintains low computational complexity and reduced inference time, making it suitable for real-time surveillance and edge-based applications.
Chinese Translation
自动化的监控系统罪犯识别是一项关键任务,要求在实时部署中具备高准确性和计算效率。本文提出了一种优化的深度学习框架,采用轻量级的MobileNet架构,并结合通道和空间注意力机制。所提出的模型通过选择性地关注最具判别性的区域,同时抑制无关的背景信息,从而增强特征表示,提升识别性能。该框架包括高效的预处理、基于注意力的特征精炼以及使用Adam优化器优化的强大分类策略。我们在真实条件下对多个基准人脸识别数据集进行了实验,包括Labelled Faces in the Wild (LFW)、CASIA-WebFace和VGGFace2的一个子集,测试条件涉及光照、姿态和遮挡的变化。结果表明,所提出的模型实现了97.8%的高分类准确率,优于传统模型如基线CNN、ResNet和标准MobileNet。混淆矩阵分析显示出强烈的类别区分能力,误分类率极低,而ROC-AUC评估则确认了在所有类别上的稳健性能。此外,所提出的方法保持了低计算复杂性和减少的推理时间,使其适合于实时监控和边缘计算应用。
cs.CV / 12 / 2605.08172
Augmented Equivariant Mesh Networks for Anatomical Segmentation
用于解剖分割的增强等变网格网络
Abstract
Anatomical mesh segmentation requires models that operate directly on irregular surface geometry while remaining robust to arbitrary patient pose and mesh resolution variation. Existing task-specific mesh and point-cloud methods are not equivariant, and can degrade sharply under test-time perturbation, for example dropping by 25-26 IoU points on intraoral scan segmentation at $40^\circ$ tilt. We present EAMS, an Equivariant Anatomical Mesh Segmentor built on Equivariant Mesh Neural Networks (EMNN), and evaluate it across four clinically distinct tasks spanning edge-, vertex-, and face-level supervision. We combine intrinsic mesh descriptors with anatomy-aware priors, including PCA-derived frames for dental arches and liver surfaces, and augment message passing to provide lightweight global context. Across intracranial aneurysm and intraoral segmentation, EAMS variants are competitive with specialized baselines on unperturbed inputs while remaining stable under geometric perturbations, and on liver surfaces they expose a favorable trade-off between canonical-pose accuracy and rotation robustness. These results show that a lightweight ($<2$M parameters) equivariant framework can deliver robust anatomical mesh segmentation across diverse supervision types without task-specific architectures.
Chinese Translation
解剖网格分割需要能够直接在不规则表面几何上操作的模型,同时对任意患者姿态和网格分辨率变化保持鲁棒性。现有的任务特定网格和点云方法并不具备等变性,并且在测试时扰动下可能会急剧下降,例如在 $40^ heta$ 倾斜下,口内扫描分割的 IoU 分数可能下降 25-26 点。我们提出了 EAMS(Equivariant Anatomical Mesh Segmentor),它基于等变网格神经网络(EMNN)构建,并在四个临床不同的任务中进行评估,这些任务涵盖了边缘、顶点和面级别的监督。我们结合了内在网格描述符与解剖学感知的先验,包括用于牙弓和肝脏表面的 PCA 导出框架,并增强了消息传递以提供轻量级的全局上下文。在颅内动脉瘤和口内分割任务中,EAMS 的变体在未扰动输入下与专门的基线模型具有竞争力,同时在几何扰动下保持稳定,并且在肝脏表面上,它们在典型姿态准确性和旋转鲁棒性之间展现了良好的权衡。这些结果表明,一个轻量级(<2M 参数)的等变框架能够在多种监督类型下提供鲁棒的解剖网格分割,而无需特定于任务的架构。
cs.CV / 13 / 2605.08173
CASISR: Circular Arbitrary-Scale Image Super-Resolution
CASISR:循环任意尺度图像超分辨率
Abstract
The generalization performance (GP) of deep learning-based arbitrary-scale image super-resolution (ASISR) methods is subject to limited training datasets and unlimited testing datasets. It is vitally significant to enhance the GP of the pretrained ASISR models by making full use of the testing samples. The ASISR models usually employ an open-loop architecture from low-resolution (LR) images to super-resolution (SR) images. The degradation model from SR samples to LR samples is known bicubic down-sampling for the classical ASISR, is supposed down-sampling with additive random noise for the blind ASISR, and is learnable for the real-world ASISR. Combining the ASISR and degradation models, it is potentially possible to adopt a closed-loop architecture based on the automatic control theory for strengthening the GP of the ASISR methods. Therefore, this paper proposes a closed-loop architecture, circular ASISR (CASISR), to lift the capability of image reconstruction. A mathematical nonlinear loop equation is established to describe the CASISR, the reasonability of the CASISR is proven by conditional probability theory, and the stability of the CASISR is proven by Taylor series approximation. The first-order and second-order absolute difference images are defined to compare the image reconstruction performance of the ASISR and the CASISR methods. Comprehensive simulation experiments show that the proposed CASISR approach outperforms the eight state-of-the-art ASISR approaches in the quality of image reconstruction. Especially, the proposed CASISR is extraordinarily suitable for fractional SR scale factors and is extremely effective for text and stripe images with drastically changed edges.
Chinese Translation
基于深度学习的任意尺度图像超分辨率(ASISR)方法的泛化性能(GP)受到有限训练数据集和无限测试数据集的限制。充分利用测试样本来提高预训练ASISR模型的GP具有重要意义。ASISR模型通常采用从低分辨率(LR)图像到超分辨率(SR)图像的开环架构。对于经典ASISR,从SR样本到LR样本的退化模型被称为双三次下采样,而对于盲ASISR,则假设存在带有加性随机噪声的下采样,而对于现实世界的ASISR,则是可学习的。结合ASISR和退化模型,基于自动控制理论采用闭环架构以增强ASISR方法的GP是有潜力的。因此,本文提出了一种闭环架构——循环ASISR(CASISR),以提升图像重建能力。建立了一个数学非线性循环方程来描述CASISR,利用条件概率理论证明了CASISR的合理性,并通过泰勒级数近似证明了CASISR的稳定性。定义了一阶和二阶绝对差异图像,以比较ASISR和CASISR方法的图像重建性能。综合仿真实验表明,所提出的CASISR方法在图像重建质量上优于八种最先进的ASISR方法。特别是,所提出的CASISR对于分数SR尺度因子极为适用,并且在处理边缘变化剧烈的文本和条纹图像时极为有效。
cs.CV / 14 / 2605.08175
KARMA-MV: A Benchmark for Causal Question Answering on Music Videos
KARMA-MV:音乐视频因果问题回答的基准测试
Abstract
While significant progress has been made in Video Question Answering and cross-modal understanding, causal reasoning about how visual dynamics drive musical structure in music videos remains under-explored. We introduce KARMA-MV, a large-scale multiple-choice QA dataset derived from 2,682 YouTube music videos, designed to test models' ability to integrate temporal audio-visual cues and reason about visual-to-musical influence across reasoning, prediction, and counterfactual questions. Unlike traditional datasets requiring manual annotation, KARMA-MV leverages LLM reasoning for scalable generation and validation, yielding 37,737 MCQs. We propose a causal knowledge graph (CKG) approach that augments vision-language models (VLMs) with structured retrieval of cross-modal dependencies. Experiments on state-of-the-art VLMs and LLMs show consistent gains from CKG grounding -- especially for smaller models -- establishing the value of explicit causal structure for music-video reasoning. KARMA-MV provides a new benchmark for advancing causal audio-visual understanding beyond correlation.
Chinese Translation
尽管在视频问答和跨模态理解方面取得了显著进展,但关于视觉动态如何驱动音乐视频中的音乐结构的因果推理仍然未得到充分探索。我们介绍了KARMA-MV,这是一个基于2,682个YouTube音乐视频的大规模多项选择问答数据集,旨在测试模型整合时间音频-视觉线索的能力,并推理视觉对音乐的影响,涵盖推理、预测和反事实问题。与传统需要手动标注的数据集不同,KARMA-MV利用大型语言模型(LLM)推理进行可扩展的生成和验证,生成了37,737个多项选择题(MCQ)。我们提出了一种因果知识图谱(CKG)方法,通过结构化检索跨模态依赖关系来增强视觉-语言模型(VLM)。在最先进的VLM和LLM上进行的实验表明,CKG基础的模型表现出一致的提升——尤其是对于较小的模型——确立了显式因果结构在音乐视频推理中的价值。KARMA-MV为推动因果音频-视觉理解超越相关性提供了一个新的基准。
cs.CV / 15 / 2605.08181
Text-Guided Multi-Scale Frequency Representation Adaptation
文本引导的多尺度频率表示适应
Abstract
Parameter-efficient fine-tuning methods introduce a small number of training parameters, enabling pre-trained models to adapt rapidly to new data distributions. While these methods have shown promising results, they exhibit notable limitations. First, most existing methods operate in the signal space domain, which results in substantial information redundancy. Second, most existing methods utilize fixed prompts or adaptation layers, failing to fully account for the multi-scale characteristics of signals. To address these challenges, we propose the Multi-Scale Frequency Adapter (FreqAdapter), which integrates textual information and performs multi-scale fine-tuning of signals in the frequency domain. Additionally, we introduce a multi-scale adaptation strategy to optimize receptive fields across different frequency ranges, further enhancing the model's representational capacity. Extensive experiments on multimodal models, including CLIP and LLaVA, demonstrate that FreqAdapter significantly improves both performance and efficiency. FreqAdapter improves performance with minimal cost and fast convergence within one epoch. Code is available at https://github.com/Kelvin-ywc/FreqAdapter.
Chinese Translation
参数高效的微调方法引入了少量训练参数,使预训练模型能够快速适应新的数据分布。尽管这些方法显示出良好的效果,但它们也存在显著的局限性。首先,大多数现有方法在信号空间域中操作,导致信息冗余严重。其次,大多数现有方法使用固定的提示或适应层,未能充分考虑信号的多尺度特性。为了解决这些挑战,我们提出了多尺度频率适配器(FreqAdapter),该方法整合了文本信息,并在频率域中对信号进行多尺度微调。此外,我们引入了一种多尺度适应策略,以优化不同频率范围内的感受野,进一步增强模型的表示能力。在包括CLIP和LLaVA在内的多模态模型上进行的广泛实验表明,FreqAdapter显著提高了性能和效率。FreqAdapter以最小的成本和在一个训练周期内快速收敛的方式提升了性能。代码可在 https://github.com/Kelvin-ywc/FreqAdapter 获取。
cs.CV / 16 / 2605.08183
Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery
稀疏性有害:简单线性适配器可以提升广义类别发现
Abstract
Generalized Category Discovery (GCD) seeks to identify novel categories from unlabeled data while retaining the classification ability of seen categories. Prior GCD methods commonly leverage transferable representations from pre-trained models, adapting to downstream datasets via partial fine-tuning (updating only the final ViT block) and visual prompt tuning (appending learnable vectors to inputs). However, conventional partial fine-tuning offers limited flexibility, as it fails to adapt the entire model; meanwhile, visual prompt tuning is prone to overfitting, due to its sensitivity to initialization and inherently constrained capacity. To address these limitations, we propose LAGCD, a simple yet effective GCD approach that embeds a residual linear adapter into each ViT block. From the perspective of feature sparsity, we systematically show that non-linearity in conventional adapters impairs performance, whereas our linear adapter enhances it by enabling more flexible model capacity. We further introduce an auxiliary distribution alignment loss to mitigate the negative impact of biased predictions between seen and novel categories. Extensive experiments on both generic and fine-grained datasets confirm that LAGCD consistently improves performance over many sophisticated baselines. The source code is available at https://github.com/yebo0216best/LAGCD
Chinese Translation
广义类别发现(Generalized Category Discovery, GCD)旨在从未标记数据中识别新类别,同时保留已见类别的分类能力。以往的 GCD 方法通常利用来自预训练模型的可迁移表示,通过部分微调(仅更新最终的 ViT 块)和视觉提示微调(将可学习向量附加到输入)适应下游数据集。然而,传统的部分微调提供的灵活性有限,因为它未能适应整个模型;与此同时,视觉提示微调由于对初始化的敏感性和固有的容量限制,容易导致过拟合。为了解决这些局限性,我们提出了 LAGCD,一种简单而有效的 GCD 方法,它在每个 ViT 块中嵌入了一个残差线性适配器。从特征稀疏性的角度出发,我们系统地表明,传统适配器中的非线性会削弱性能,而我们的线性适配器通过启用更灵活的模型容量来增强性能。我们进一步引入了一种辅助分布对齐损失,以减轻已见类别与新类别之间偏差预测的负面影响。在通用和细粒度数据集上的大量实验确认,LAGCD 在许多复杂基线之上始终提高了性能。源代码可在 https://github.com/yebo0216best/LAGCD 获取。
cs.CV / 17 / 2605.08188
Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers
受神经科学启发的多模态变换器中视觉趣味性的分析
Abstract
Human attention is the gateway to conscious perception, memory and decision-making. However, its role in modern transformer models remains largely unexplored. As these systems increasingly influence what people see, prefer and buy, the question arises as to whether they encode principles of human interest or merely exploit large-scale correlations. Addressing this issue is crucial for understanding cognition and ensuring the responsible use of AI in communication and marketing. In order to address this issue, the concept of visual interest was examined within the multimodal vision-language-model Qwen3-VL-8B, using a pre-defined Common Interestingness (CI) score derived from large-scale human engagement data on the photo-sharing platform Flickr. Here, we analyzed internal representations across vision and language components using methods from the neurosciences. Our analyses revealed that CI information is linearly decodable from final-layer embeddings, indicating that it is aligned with human-derived measures of visual interestingness. Dimensionality reduction and Generalized Discrimination Value (GDV) analyses demonstrate that CI-related hidden representations emerge in intermediate vision transformer layers and becomes progressively more distinguishable across language model layers. Concept vectors derived using geometric, probe, and Sparse Auto-Encoder based methods converge in higher layers, as confirmed by representational similarity analysis. This indicates a robust and structured encoding of visual interestingness without explicit supervision. Future work will seek to identify shared computational principles linking human brain dynamics and transformer architectures, with the ultimate goal of uncovering the organizing mechanisms that give rise to attention and interest in both biological and artificial systems.
Chinese Translation
人类注意力是意识感知、记忆和决策的入口。然而,其在现代变换器模型中的作用仍然 largely 未被探索。随着这些系统越来越影响人们的视觉、偏好和购买行为,问题随之而来:它们是否编码了人类兴趣的原则,还是仅仅利用了大规模的相关性。解决这一问题对于理解认知和确保人工智能在沟通与营销中的负责任使用至关重要。为了解决这一问题,我们在多模态视觉-语言模型 Qwen3-VL-8B 中考察了视觉兴趣的概念,使用了基于大规模人类参与数据(来自照片分享平台 Flickr)推导的预定义共同趣味性(Common Interestingness, CI)评分。在此,我们利用神经科学的方法分析了视觉和语言组件的内部表征。我们的分析揭示,CI 信息可以从最终层嵌入中线性解码,表明其与人类导出的视觉趣味性度量一致。维度降低和广义区分值(Generalized Discrimination Value, GDV)分析表明,CI 相关的隐藏表征在中间视觉变换器层中出现,并在语言模型层中逐渐变得更加可区分。使用几何、探针和稀疏自编码器方法推导的概念向量在更高层中趋于一致,这一点通过表征相似性分析得到了确认。这表明视觉趣味性的编码是稳健且结构化的,而无需显式监督。未来的工作将致力于识别连接人类大脑动态与变换器架构的共享计算原则,最终目标是揭示在生物和人工系统中引发注意力和兴趣的组织机制。
cs.CV / 18 / 2605.08191
A Robust Out-of-Distribution Detection Framework via Synergistic Smoothing
通过协同平滑实现稳健的分布外检测框架
Abstract
Reliable out-of-distribution (OOD) detection is a critical requirement for the safe deployment of machine learning systems. Despite recent progress, state-of-the-art OOD detectors are highly susceptible to adversarial attacks, which undermines their trustworthiness in automated systems. To address this vulnerability, we apply median smoothing to baseline OOD detection scores, balancing clean and adversarial accuracies. Our key insight is that the noisy samples generated for median smoothing can be repurposed to quantify the local instability of the base score. We observe that OOD samples exhibit higher instability under perturbation. Based on this, we propose ROSS, a novel and robust post-hoc OOD detector that leverages the instability of baseline scores to further distinguish between in-distribution (ID) and OOD samples. ROSS achieves symmetric robustness, performing strongly against both score-minimising and score-maximising attacks, unlike prior work. This symmetric defence leads to state-of-the-art robustness, outperforming prior methods by up to 40 AUROC points. We demonstrate ROSS's effectiveness on extensive experiments across CIFAR-10, CIFAR-100, and ImageNet. Code is available at: https://github.com/Abdu-Hekal/ROSS.
Chinese Translation
可靠的分布外(OOD)检测是机器学习系统安全部署的关键要求。尽管近期取得了一些进展,最先进的OOD检测器仍然对对抗性攻击高度敏感,这削弱了它们在自动化系统中的可信度。为了解决这一脆弱性,我们对基线OOD检测分数应用中位数平滑,以平衡干净和对抗的准确性。我们的关键见解是,为中位数平滑生成的噪声样本可以被重新利用,以量化基线分数的局部不稳定性。我们观察到,OOD样本在扰动下表现出更高的不稳定性。基于此,我们提出了ROSS,一种新颖且稳健的后处理OOD检测器,利用基线分数的不稳定性进一步区分分布内(ID)样本和OOD样本。与之前的工作不同,ROSS实现了对称稳健性,在对抗分数最小化和分数最大化的攻击下均表现出色。这种对称防御实现了最先进的稳健性,超越了之前的方法,提升了多达40个AUROC点。我们在CIFAR-10、CIFAR-100和ImageNet上的广泛实验中展示了ROSS的有效性。代码可在以下链接获取:https://github.com/Abdu-Hekal/ROSS。
cs.CV / 19 / 2605.08193
Normalization Equivariance for Arbitrary Backbones, with Application to Image Denoising
任意骨干网络的归一化等变性及其在图像去噪中的应用
Abstract
Normalization Equivariance (NE), equivariance to global contrast and brightness transforms, improves robustness to distribution shift in image-to-image prediction. Existing methods enforce this prior by constraining internal layers to NE-compatible families, limiting compatibility with standard components such as attention and LayerNorm, and adding runtime cost. We characterize the full NE function class: a function is NE if and only if it admits a normalize-process-denormalize factorization. This turns exact NE enforcement, for the ideal wrapper, from an internal architectural constraint into an input-output parameterization problem, allowing a parameter-free wrapper (WNE) to enforce NE around any backbone, including transformers. In a single-noise mismatch diagnostic for blind denoising, the wrapper improves CNN and transformer robustness with no measurable GPU overhead; architectural NE baselines incur up to a 1.6x slowdown.
Chinese Translation
归一化等变性(Normalization Equivariance, NE)是对全局对比度和亮度变换的等变性,它提高了图像到图像预测中对分布变化的鲁棒性。现有方法通过限制内部层为与NE兼容的家族来强制这一先验,这限制了与标准组件(如注意力机制和LayerNorm)的兼容性,并增加了运行时成本。我们对完整的NE函数类进行了表征:一个函数是NE当且仅当它允许归一化-处理-反归一化的分解。这将理想包装器的精确NE强制,从内部架构约束转变为输入-输出参数化问题,使得无参数的包装器(WNE)能够在任何骨干网络周围强制NE,包括变换器。在盲去噪的单一噪声不匹配诊断中,该包装器在没有可测量的GPU开销的情况下提高了CNN和变换器的鲁棒性;而架构NE基线则导致高达1.6倍的减速。
cs.CV / 20 / 2605.08196
Survey on Disaster Management Datasets for Remote Sensing Based Emergency Applications
基于遥感的应急应用灾害管理数据集调查
Abstract
Recent natural disasters have highlighted the urgent need for efficient data-driven approaches to disaster management. Machine learning (ML) and deep learning (DL) techniques have shown considerable promise in enhancing the key phases of disaster management including mitigation, preparedness, detection, response, and recovery. A critical enabler of successful ML or DL based applications in remote sensing, however, is the accessibility and quality of annotated datasets. With the growing availability of high-resolution imagery from unmanned aerial vehicles (UAVs) and satellites, computer vision and remote sensing algorithms have become essential tools for rapid detection, situational assessment, and decision-making in disaster scenarios. This survey provides a comprehensive overview of publicly available image-based datasets relevant to ML/DL-based disaster management pipelines. Emphasis is placed on datasets that support computer vision and remote sensing tasks across all phases of disaster events including pre-disaster, during, and post-disaster. The goal of this work is to serve as a centralized reference for researchers and practitioners seeking high-quality datasets for rapid development and deployment of remote sensing-driven disaster response solutions.
Chinese Translation
近期的自然灾害突显了对高效数据驱动的灾害管理方法的迫切需求。机器学习(ML)和深度学习(DL)技术在增强灾害管理的关键阶段(包括减灾、准备、检测、响应和恢复)方面显示出了相当大的潜力。然而,成功的基于ML或DL的遥感应用的一个关键因素是标注数据集的可获取性和质量。随着无人机(UAV)和卫星提供的高分辨率影像的日益增多,计算机视觉和遥感算法已成为在灾害场景中进行快速检测、情境评估和决策的重要工具。本调查提供了与基于ML/DL的灾害管理流程相关的公开可用图像数据集的全面概述。重点关注支持计算机视觉和遥感任务的各类数据集,这些任务涵盖了灾害事件的所有阶段,包括灾前、灾中和灾后。本研究的目标是为寻求高质量数据集以快速开发和部署基于遥感的灾害响应解决方案的研究人员和从业者提供一个集中参考。
cs.CV / 21 / 2605.08207
A Breast Vision Pathology Foundation Model for Real-world Clinical Utility
用于实际临床应用的乳腺病理基础模型
Xu, Yingxue, Zhang, Zhengyu, Zhang, Xiuming, Xu, Mengwei, Zhou, Fengtao, Wang, Yihui, Ma, Jiabo, Xin, Yi, Li, Danyi, Lu, Chengyu, Cen, Zhijian, Tan, Ying, Yao, Qingbing, Wang, Qi, Gao, Zizhao, Zhang, Yong, Chen, Jingjing, Liu, Feifei, Xu, Qian, Dai, Yi, Tan, Hongxuan, Jin, Cheng, Zhou, Huajun, Guo, Zhengrui, Liang, Ling, Wang, Hongyi, Chen, Yingcong, Wang, Xi, Li, Zhenhui, Chan, Ronald Cheong Kin, Mao, Ning, Cai, Muyan, Wang, Zhe, Liang, Li, Chen, Hao
Abstract
Pathology foundation models have shown strong retrospective performance, but whether such systems can support clinically relevant use remains unclear. This challenge is particularly important in breast cancer, where pathological assessment serves as the gold standard for diagnosis and guides treatment planning, surgical decision-making and risk stratification across pre-, intra- and post-operative stages. Here we present \textbf{BRAVE}, a breast-adaptive pathology foundation model developed and evaluated using a total resource of 101,638 breast whole-slide images from 32 sources across Asia, Europe and North America. We assessed BRAVE across 34 tasks in 82 cohorts spanning pre-operative biopsy, intra-operative frozen section and post-operative resection, using an evidence chain comprising retrospective benchmarking, clinically challenging scenarios, workflow-oriented clinical impact simulations, prospective observational validation with the thresholds locked in the retrospective cohorts and crossover pathologist-AI interaction studies. Across these settings, BRAVE supported practical roles in the clinical workflow, including safe exclusion of low-risk cases from routine review, AI-assisted second-review rescue of initially missed positives and prioritization of cases for further assessment. In prospective validation across three centres, BRAVE excluded 76.9% of negative biopsy cases (NPV 0.953) and 70.1% of negative frozen-section cases (NPV 0.973), and triaged 78.8% of post-operative subtyping cases as high-confidence clear-cut cases (NPV 1.000). In reader studies, AI assistance improved balanced accuracy from 88.5% to 95.1% (OR 3.14, P<0.001), with better efficiency, confidence and inter-rater agreement. BRAVE-derived scores also independently predicted disease-free survival (adjusted HR 4.79, P<0.001) and overall survival (adjusted HR 8.14, P<0.001).
Chinese Translation
病理基础模型在回顾性表现上显示出强大的性能,但这些系统是否能够支持临床相关的应用仍不明确。这个挑战在乳腺癌中尤为重要,因为病理评估作为诊断的金标准,并指导治疗规划、手术决策和在术前、术中及术后各阶段的风险分层。在此,我们提出了 extbf{BRAVE},一个乳腺适应性病理基础模型,该模型的开发和评估使用了来自亚洲、欧洲和北美的32个来源的101,638幅乳腺全切片图像。我们在82个队列中的34个任务上评估了BRAVE,这些任务涵盖了术前活检、术中冰冻切片和术后切除,使用了包括回顾性基准测试、临床挑战场景、面向工作流程的临床影响模拟、在回顾性队列中锁定阈值的前瞻性观察验证以及病理学家与人工智能交互研究在内的证据链。在这些设置中,BRAVE在临床工作流程中支持了实际角色,包括安全排除常规审查中的低风险病例、人工智能辅助的二次审查以挽救最初漏检的阳性病例,以及对进一步评估病例的优先排序。在三家中心的前瞻性验证中,BRAVE排除了76.9%的阴性活检病例(阴性预测值 0.953)和70.1%的阴性冰冻切片病例(阴性预测值 0.973),并将78.8%的术后亚型病例分流为高置信度的明确病例(阴性预测值 1.000)。在读者研究中,人工智能的辅助将平衡准确率从88.5%提高到95.1%(比值比 3.14,P<0.001),并提高了效率、信心和评审者间一致性。BRAVE衍生的评分也独立预测了无病生存期(调整后风险比 4.79,P<0.001)和总体生存期(调整后风险比 8.14,P<0.001)。
cs.CV / 22 / 2605.08210
Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation
多评估者医学分割的协调特征条件和频率提示个性化
Abstract
Multi-rater medical image segmentation captures the inherent ambiguity of clinical interpretation, where diagnostic boundaries vary across experts and imaging devices. Existing approaches often reduce this diversity to consensus labels or treat rater differences as noise, resulting in overconfident and poorly calibrated models. We propose a harmonized probabilistic framework that disentangles acquisition artifacts from genuine annotator variability through adaptive feature conditioning and frequency-domain personalization. A lightweight Harmonizer Network implicitly models scanner-specific artifacts and performs dynamic feature modulation to standardize latent representations, ensuring that uncertainty reflects anatomy rather than noise. To represent rater-specific styles, we introduce a novel High-Frequency Prompt Modules that operate in the spectral domain to encode annotator-dependent boundary precision and textural sensitivity. These prompts adaptively modulate harmonized features to produce personalized yet anatomically consistent segmentations. Furthermore, a Generalized Energy Distance based regularization aligns the generative distribution with empirical annotation variability, promoting diversity where experts disagree and consensus where they converge. Experiments on LIDC-IDRI and NPC-170 show SOTA aggregated and individualized segmentation, with notable GED reductions and improved Dice scores, especially on noisy cases. Beyond accuracy, the model exhibits clinically meaningful uncertainty. Confidence rises in agreement regions and declines in ambiguous areas, supporting its use as a reliable and interpretable tool for multi-expert clinical workflows.
Chinese Translation
多评估者医学图像分割捕捉了临床解读固有的模糊性,其中诊断边界因专家和成像设备的不同而异。现有方法通常将这种多样性简化为共识标签或将评估者差异视为噪声,导致模型过于自信且校准不良。我们提出了一种协调的概率框架,通过自适应特征条件和频域个性化,将获取伪影与真实注释者变异性区分开来。轻量级的协调网络(Harmonizer Network)隐式建模扫描仪特定伪影,并执行动态特征调制,以标准化潜在表示,确保不确定性反映解剖结构而非噪声。为了表示评估者特定的风格,我们引入了一种新颖的高频提示模块(High-Frequency Prompt Modules),该模块在频谱域中操作,以编码注释者依赖的边界精度和纹理敏感性。这些提示自适应地调制协调特征,以产生个性化但解剖上一致的分割。此外,基于广义能量距离(Generalized Energy Distance)的正则化将生成分布与经验注释变异性对齐,促进专家意见不一致时的多样性以及意见一致时的共识。在LIDC-IDRI和NPC-170上的实验显示了最先进的聚合和个性化分割,显著减少了广义能量距离(GED)并提高了Dice分数,特别是在噪声案例中。除了准确性外,该模型还表现出临床上有意义的不确定性。在一致区域,置信度上升,而在模糊区域下降,支持其作为多专家临床工作流程中可靠且可解释工具的使用。
cs.CV / 23 / 2605.08213
Low-Cost Stereo Vision for Robust 3D Positioning of Thin Radiata Pine Branches in Autonomous Drone Pruning
低成本立体视觉用于自主无人机修剪中薄松树枝的稳健三维定位
Abstract
Manual pruning of radiata pine, a species of major economic importance to New Zealand forestry, is hazardous, labour-intensive, and increasingly constrained by workforce shortages. Existing autonomous pruning platforms typically rely on expensive sensors such as LiDAR and are limited to thick branches, which restricts their wider adoption. This paper investigates whether a single low-cost stereo camera mounted on a drone can provide sufficiently accurate branch detection and three-dimensional positioning to support autonomous pruning of branches as thin as 10 mm, thereby removing the need for auxiliary depth sensors. The proposed pipeline comprises two stages: branch segmentation and depth estimation. For segmentation, Mask R-CNN variants and the YOLOv8 and YOLOv9 families are compared on a custom dataset of 71 stereo image pairs captured with a ZED Mini camera; YOLOv8 and YOLOv9 are selected as representative state-of-the-art real-time segmentors at the time of data collection, and the framework is designed to remain compatible with newer YOLO releases. For depth estimation, a traditional method (SGBM with WLS filtering) and deep-learning-based methods (PSMNet, ACVNet, GWCNet, MobileStereoNet, RAFT-Stereo, and NeRF-Supervised Deep Stereo) are evaluated, including cross-dataset fine-tuning experiments that expose the domain gap between urban driving benchmarks and natural forestry scenes. The main novelty of this work lies in coupling stereo segmentation with a centroid-based triangulation algorithm and Median-Absolute-Deviation outlier rejection that converts a segmentation mask and disparity map into a single robust branch-to-camera distance, addressing the challenges of sparse texture, thin structures, and noisy disparity values typical of forest scenes. Qualitative evaluations at distances of 1-2 m show that the learning-based stereo methods produce more coherent depth es...
Chinese Translation
手动修剪辐射松(radiata pine)是一种对新西兰林业具有重要经济意义的树种,这一过程危险、劳动密集,并且日益受到劳动力短缺的限制。现有的自主修剪平台通常依赖于昂贵的传感器,如激光雷达(LiDAR),并且仅限于较粗的树枝,这限制了其更广泛的应用。本文探讨了一种低成本的单目立体相机是否可以安装在无人机上,以提供足够准确的树枝检测和三维定位,从而支持对直径仅为10毫米的树枝进行自主修剪,消除对辅助深度传感器的需求。所提出的流程包括两个阶段:树枝分割和深度估计。在分割阶段,比较了Mask R-CNN变体以及YOLOv8和YOLOv9系列在使用ZED Mini相机捕获的71对立体图像的自定义数据集上的表现;YOLOv8和YOLOv9被选为数据收集时的代表性最先进实时分割器,并且该框架设计为与更新的YOLO版本保持兼容。在深度估计方面,评估了一种传统方法(带有加权最小二乘(WLS)滤波的SGBM)和基于深度学习的方法(PSMNet、ACVNet、GWCNet、MobileStereoNet、RAFT-Stereo和NeRF监督深度立体),包括跨数据集微调实验,以揭示城市驾驶基准与自然林业场景之间的领域差距。本工作的主要创新在于将立体分割与基于质心的三角测量算法和中位数绝对偏差(Median-Absolute-Deviation)异常值拒绝相结合,将分割掩膜和视差图转换为单一稳健的树枝到相机的距离,解决了森林场景中典型的稀疏纹理、细结构和噪声视差值的挑战。在1-2米的距离下进行的定性评估表明,基于学习的立体方法产生了更连贯的深度估计。
cs.CV / 24 / 2605.08215
Test-Time Training for Visual Foresight Vision-Language-Action Models
视觉前瞻视觉-语言-行动模型的测试时训练
Abstract
Visual Foresight VLA (VF-VLA) has become a prominent architectural choice in the recent VLA due to its impressive performance. Nevertheless, the inherent design of VF-VLA makes it particularly vulnerable to out-of-distribution (OOD) shifts. Because the quality of action directly depends on the accuracy of the predicted future visual information, OOD conditions affect both stages at once. To address this vulnerability, we propose Test-Time Training Visual Foresight VLA ($T^3$VF), a test-time training approach motivated by the observation that the predicted future image and its subsequent observation form a natural supervision pair. To further address the practical challenges that arise from indiscriminate test-time updates, we introduce an adaptive update filtering mechanism. Empirically, $T^3$VF mitigates the OOD vulnerability of VF-VLA at a modest additional inference cost, without requiring any architectural modification or auxiliary modules.
Chinese Translation
视觉前瞻视觉-语言-行动模型(VF-VLA)因其卓越的性能而成为近期视觉-语言-行动(VLA)领域的一个重要架构选择。然而,VF-VLA的固有设计使其特别容易受到分布外(OOD)变化的影响。由于行动的质量直接依赖于预测的未来视觉信息的准确性,OOD条件同时影响两个阶段。为了解决这一脆弱性,我们提出了测试时训练视觉前瞻VLA($T^3$VF),这是一种基于观察到的未来图像及其后续观察形成自然监督对的动机而提出的测试时训练方法。为了进一步解决因不加区分的测试时更新而带来的实际挑战,我们引入了一种自适应更新过滤机制。实证结果表明,$T^3$VF在适度增加推理成本的情况下,减轻了VF-VLA的OOD脆弱性,而无需任何架构修改或辅助模块。
cs.CV / 25 / 2605.08222
From Historical Tabular Image to Knowledge Graphs: A Provenance-Aware Modular Pipeline
从历史表格图像到知识图谱:一种关注来源的模块化管道
Abstract
Handwritten archival tables contain rich historical information, yet transforming them into structured representations, such as Knowledge Graphs, requires integrating table structure recognition, handwriting recognition, and semantic interpretation - a complex multimodal process. End-to-end AI implementations can obscure these steps, resulting in opaque algorithmic operations that hinder human oversight, critical assessment, and trust. To address this, we present a modular, provenance-aware pipeline to convert handwritten tabular images into KGs supporting human-AI collaboration. The pipeline decomposes the workflow into three stages - table reconstruction, information extraction, and KG construction - while exposing intermediate representations for inspection, evaluation, and correction. A key contribution of our approach is the systematic integration of data provenance at every stage, ensuring that all extracted entities and literals remain traceable to their visual and textual origins. The proposed pipeline is demonstrated through a number of experiments on real-world archival material concerning military careers. The results across three different table reconstruction variants highlight the importance of modularisation. By coupling modularity with data provenance, our work advances transparent and collaboratively controllable image-to-KG pipelines for complex historical data.
Chinese Translation
手写档案表格包含丰富的历史信息,但将其转化为结构化表示(如知识图谱)需要整合表格结构识别、手写识别和语义解释,这是一项复杂的多模态过程。端到端的人工智能实现可能会掩盖这些步骤,导致算法操作不透明,从而妨碍人类的监督、批判性评估和信任。为了解决这个问题,我们提出了一种模块化的、关注来源的管道,将手写表格图像转换为支持人机协作的知识图谱。该管道将工作流程分解为三个阶段——表格重建、信息提取和知识图谱构建,同时暴露中间表示以供检查、评估和修正。我们方法的一个关键贡献是在每个阶段系统性地整合数据来源,确保所有提取的实体和文字都能追溯到其视觉和文本来源。通过对涉及军事职业的真实档案材料进行的一系列实验,展示了所提出的管道。三种不同的表格重建变体的结果突显了模块化的重要性。通过将模块化与数据来源相结合,我们的工作推动了透明且可协作控制的图像到知识图谱管道的发展,以应对复杂的历史数据。
cs.CV / 26 / 2605.08226
SPECTRA-Net: Scalable Pipeline for Explainable Cross-domain Tensor Representations for AI-generated Images Detection
SPECTRA-Net:可扩展的可解释跨域张量表示管道用于AI生成图像检测
Abstract
The rapid proliferation of AI-generated images (AIGI) presents a significant challenge to digital information integrity. While human observers and existing detection models struggle to keep pace with the increasing sophistication of generative models, the need for robust, real-time detection systems has become critical. This paper introduces SPECTRA-Net, a scalable pipeline for explainable, cross-domain tensor representations for AIGI detection. Our approach leverages a multi-view representation of images, combining global semantic features from a Vision Foundation Model (VFM), spectral analysis, local patch-based anomaly detection, and statistical descriptors. By fusing these complementary data streams, SPECTRA-Net achieves state-of-the-art performance in both in-domain and cross-domain settings, demonstrating high accuracy and generalization capabilities across a wide range of challenging datasets, including WildFake, Chameleon, and RRDataset. The proposed pipeline not only provides a robust solution for AIGI detection but also offers explainability through artifact localization, paving the way for more trustworthy and reliable content verification in real-world applications.
Chinese Translation
AI生成图像(AIGI)的快速传播对数字信息的完整性提出了重大挑战。尽管人类观察者和现有检测模型在应对生成模型日益复杂的情况下显得力不从心,但对强大、实时检测系统的需求已变得至关重要。本文介绍了SPECTRA-Net,一种可扩展的可解释跨域张量表示管道,用于AIGI检测。我们的方法利用图像的多视角表示,结合来自视觉基础模型(Vision Foundation Model, VFM)的全局语义特征、光谱分析、基于局部补丁的异常检测和统计描述符。通过融合这些互补的数据流,SPECTRA-Net在域内和跨域设置中均实现了最先进的性能,展示了在WildFake、Chameleon和RRDataset等一系列具有挑战性的数据集上高准确性和良好的泛化能力。所提出的管道不仅为AIGI检测提供了强有力的解决方案,还通过伪影定位提供了可解释性,为现实应用中的更可信和可靠的内容验证铺平了道路。
cs.CV / 27 / 2605.08238
Resource-Aware Evolutionary Neural Architecture Search for Cardiac MRI Segmentation
面向资源的进化神经架构搜索用于心脏MRI分割
Abstract
Cardiac magnetic resonance (CMR) segmentation underpins quantitative assessment of ventricular structure and function, yet reliable delineation remains difficult due to low tissue contrast, fuzzy boundaries, and inter scan variability. We present CardiacNAS, an evolutionary neural architecture search (NAS) framework that couples a UNet like supernet with a cardiac aware search space spanning depth width, kernel size, filter size, attention, fusion, activation, dropout, and residual scaling. The search is explicitly resource aware, jointly optimizing dice similarity coefficient (DSC) and 95th percentile Hausdorff distance (HD95) versus model size and floating point operations (FLOPs) under fixed compute budgets. Candidate architectures are instantiated from the supernet, trained with proxy budgets, and evolved through crossover, mutation, and elitist selection. We evaluate on the ACDC dataset and compare against six state of the art methods, using qualitative comparisons, learning curve analyses, and design factor correlation studies. The resulting model attains 93.22% average DSC and 4.73 mm HD95 with 3.58M parameters and 14.56 GFLOPs, demonstrating a favorable accuracy efficiency trade off. Analyses indicate that searched attention and fusion choices, together with residual scaling, contribute to improved boundary fidelity and stability. CardiacNAS offers a principled, resource aware approach to deployable CMR segmentation with transparent reporting of architectural complexity and compute budgets.
Chinese Translation
心脏磁共振(CMR)分割是对心室结构和功能进行定量评估的基础,但由于组织对比度低、边界模糊以及扫描间变异性,可靠的分割仍然困难。我们提出了CardiacNAS,一个进化神经架构搜索(NAS)框架,该框架将类似UNet的超网络与心脏感知搜索空间相结合,涵盖深度、宽度、卷积核大小、滤波器大小、注意力机制、融合、激活、丢弃和残差缩放。该搜索明确考虑资源,联合优化Dice相似系数(DSC)和95百分位Hausdorff距离(HD95),同时在固定计算预算下对模型大小和浮点运算(FLOPs)进行优化。候选架构从超网络中实例化,使用代理预算进行训练,并通过交叉、变异和精英选择进行进化。我们在ACDC数据集上进行评估,并与六种最先进的方法进行比较,使用定性比较、学习曲线分析和设计因素相关性研究。最终模型在3.58M参数和14.56 GFLOPs下实现了93.22%的平均DSC和4.73 mm的HD95,展示了良好的准确性与效率的权衡。分析表明,搜索得到的注意力和融合选择,以及残差缩放,有助于提高边界的保真度和稳定性。CardiacNAS提供了一种有原则的、面向资源的可部署CMR分割方法,并透明地报告了架构复杂性和计算预算。
cs.CV / 28 / 2605.08241
TinySSL: Distilled Self-Supervised Pretraining for Sub-Megabyte MCU Models
TinySSL:用于亚兆字节MCU模型的蒸馏自监督预训练
Abstract
Self-supervised learning (SSL) has transformed representation learning for large models, yet remains unexplored for microcontroller (MCU)-class models with fewer than 500K parameters. We identify three obstacles at this scale -- projection head dominance, representation bottleneck, and augmentation sensitivity -- and propose Capacity-Aware Distilled Self-Supervised Learning (CA-DSSL), a teacher-guided framework that overcomes them without labels or text supervision. CA-DSSL combines asymmetric distillation from a frozen DINO ViT-S/16 teacher, multi-scale feature distillation for spatial representations, and a progressive augmentation curriculum. On a MobileNetV2-0.35 backbone (396K parameters) pretrained on CIFAR-100, CA-DSSL reaches 62.7 0.5% linear-probe accuracy (3-seed mean) -- surpassing SimCLR-Tiny by 18 pp, matching SEED (61.7%) with 10 fewer projection parameters (426K vs. 3.15M), and reaching 94.0% of a supervised upper bound. Standard SSL methods (BYOL-Tiny, DINO-Tiny) collapse entirely at this scale. On Pascal VOC detection, CA-DSSL achieves 2.3 the mAP of random initialization and +3 pp over SEED, though SimCLR-Tiny matches CA-DSSL on detection mAP. The deployed backbone occupies 378 KB (INT8) with no inference overhead from pretraining. Preliminary ImageNet-100 experiments reveal that CA-DSSL's advantage is specific to small-data regimes; scaling to ImageNet-1K is discussed as future work.
Chinese Translation
自监督学习(SSL)已经改变了大模型的表征学习,但对于参数少于50万的微控制器(MCU)类模型仍未被探索。我们识别出在这一规模下的三个障碍——投影头主导性、表征瓶颈和增强敏感性——并提出了一种容量感知的蒸馏自监督学习(CA-DSSL)框架,该框架在没有标签或文本监督的情况下克服这些障碍。CA-DSSL结合了来自冻结的DINO ViT-S/16教师的非对称蒸馏、多尺度特征蒸馏以获取空间表征,以及渐进增强课程。在一个基于MobileNetV2-0.35(396K参数)在CIFAR-100上进行预训练的模型中,CA-DSSL达到了62.7 0.5%的线性探测准确率(3次实验均值)——超过SimCLR-Tiny 18个百分点,匹配SEED(61.7%)且投影参数减少10个(426K对比3.15M),并达到了监督上限的94.0%。标准SSL方法(BYOL-Tiny,DINO-Tiny)在这一规模下完全崩溃。在Pascal VOC检测中,CA-DSSL实现了随机初始化的2.3倍mAP,并比SEED提升了3个百分点,尽管SimCLR-Tiny在检测mAP上与CA-DSSL相匹配。部署的主干网络占用378 KB(INT8),且没有来自预训练的推理开销。初步的ImageNet-100实验表明,CA-DSSL的优势特定于小数据环境;将其扩展到ImageNet-1K的讨论作为未来工作。
cs.CV / 29 / 2605.08245
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
当语言覆盖视觉:视觉-语言模型中的过度对齐与几何去偏差
Abstract
Vision-Language Models (VLMs) increasingly power high-stakes applications, from medical imaging to autonomous systems, yet they routinely hallucinate, confidently describing content not present in the input. We investigate the root causes of these failure modes with a mechanistic analysis focusing on the decoder-based VLMs. We trace these failure modes to a geometric over-alignment: to bridge the modality gap required by attention mechanisms, decoder-based VLMs over-align visual embeddings with the text manifold, injecting a statistical linguistic bias that systematically overshadows fine-grained visual evidence. While prior work either aggressively closes this gap or suppresses hallucinations through expensive black-box decoding strategies, none addresses the underlying geometric cause. We provide the first quantitative characterization of this over-alignment, demonstrating that linguistic bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Building on this insight, we propose two complementary remedies: a training-free inference strategy and a bias-aware fine-tuning paradigm, both of which explicitly project out this subspace from visual representations. Our methods significantly reduce hallucinations across POPE, CHAIR, and AMBER benchmarks, and improve CLAIR scores on long-form captioning tasks, with the training-free variant adding no computational overhead over the base model.
Chinese Translation
视觉-语言模型(VLMs)在医疗影像到自主系统等高风险应用中越来越重要,但它们常常会产生幻觉,自信地描述输入中不存在的内容。我们通过对基于解码器的VLMs进行机械分析,探讨这些失败模式的根本原因。我们将这些失败模式追溯到几何过度对齐:为弥合注意机制所需的模态差距,基于解码器的VLMs过度对齐视觉嵌入与文本流形,注入了一种系统性地遮蔽细粒度视觉证据的统计语言偏差。虽然之前的研究要么激进地缩小这一差距,要么通过昂贵的黑箱解码策略抑制幻觉,但没有一个研究解决根本的几何原因。我们首次定量描述了这种过度对齐,证明语言偏差集中在一个通用的、与数据集无关的文本子空间的主要成分中。基于这一见解,我们提出了两种互补的补救措施:一种无训练的推理策略和一种关注偏差的微调范式,这两者都明确地将该子空间从视觉表示中投影出去。我们的方法在POPE、CHAIR和AMBER基准测试中显著减少了幻觉,并在长文本描述任务中提高了CLAIR得分,而无训练变体在计算上没有增加基础模型的开销。
cs.CV / 30 / 2605.08246
Smart Railway Obstruction Detection System using IoT and Computer Vision
基于物联网和计算机视觉的智能铁路障碍检测系统
Abstract
Railway track intrusions pose a critical safety challenge for Indian Railways, encompassing wildlife incursions and deliberate malicious obstructions. The December 2025 collision in Assam, in which seven elephants were killed by the Rajdhani Express, underscores the urgency of effective real-time detection. Existing solutions such as the optical fiber-based Gajraj system suffer from prohibitive costs (\$1000/km) and high false alarm rates, limiting deployment to only 20 of India's 101 elephant corridors. This paper proposes NETRA, a cost-effective, internet-independent intrusion detection system deployed on Raspberry Pi Zero W and Raspberry Pi 4 edge platforms. NETRA employs probabilistic sensor fusion integrating a PIR motion sensor and an HC-SR04 ultrasonic distance sensor with a tunable threshold (tau_c = 0.65), enabling event-driven camera activation that reduces unnecessary visual processing by 52%. Upon confirmed intrusion, edge-AI classification using MobileNet-SSD (Pi Zero) or YOLOv5 ONNX (Pi 4) identifies threats including humans, large animals, and track obstructions. Confirmed threats are transmitted via LoRa (868 MHz) to alert the locomotive driver within 2.4 seconds end-to-end. Experimental evaluation across 113 motion events demonstrated 95% detection accuracy with zero false alarms through probabilistic fusion, compared to 85% for binary methods. Raspberry Pi 4 with YOLOv5 achieved 83.5% elephant F1-score, a 5.6x improvement over Pi Zero's heuristic approach (14.8%). LoRa communication achieved 100% packet delivery across 1-2 km in field trials. NETRA reduces deployment cost by 75% (\$247/km vs \$1000/km for Gajraj) while providing unified detection of both wildlife and obstruction threats.
Chinese Translation
铁路轨道入侵对印度铁路构成了严重的安全挑战,包括野生动物侵入和故意恶意障碍物。2025年12月在阿萨姆发生的碰撞事件中,七头大象被拉贾达尼快车撞死,突显了有效实时检测的紧迫性。现有的解决方案,如基于光纤的Gajraj系统,因其高昂的成本(每公里1000美元)和高误报率,导致仅在印度101条大象走廊中的20条实施。本文提出了NETRA,一个经济高效、独立于互联网的入侵检测系统,部署在Raspberry Pi Zero W和Raspberry Pi 4边缘平台上。NETRA采用概率传感器融合技术,将PIR运动传感器与HC-SR04超声波距离传感器结合,并设定可调阈值(tau_c = 0.65),实现事件驱动的摄像头激活,减少52%的不必要视觉处理。在确认入侵后,利用MobileNet-SSD(Pi Zero)或YOLOv5 ONNX(Pi 4)进行边缘人工智能分类,识别包括人类、大型动物和轨道障碍物在内的威胁。确认的威胁通过LoRa(868 MHz)在2.4秒内传输给机车司机。对113个运动事件的实验评估显示,通过概率融合实现了95%的检测准确率,零误报,而二元方法的准确率为85%。使用YOLOv5的Raspberry Pi 4达到了83.5%的大象F1分数,比Pi Zero的启发式方法(14.8%)提高了5.6倍。LoRa通信在现场试验中实现了1-2公里范围内100%的数据包传输。NETRA将部署成本降低了75%(每公里247美元,相较于Gajraj的1000美元),同时提供了对野生动物和障碍威胁的统一检测。
cs.CV / 31 / 2605.08249
Dimensional Coactivation for Representational Consistency in Frozen Vision Foundation Models
用于冻结视觉基础模型的表征一致性的维度共激活
Abstract
Frozen vision foundation models do not merely extract features; they organize images through a learned coordinate system. We ask whether that coordinate system remains internally coherent within a single input. This leads to Representational Consistency: the study of whether a frozen foundation model represents one sample coherently across its semantic subregions. We introduce Dimensional Coactivation (DCA), a per-dimension instrument for measuring this coherence. DCA compares semantic regions by asking whether the same feature dimensions coactivate across them. Unlike classical similarity measures, it deliberately avoids centering, L2 normalization, and full Gram coupling. These operations are useful when comparing different models or distributions, but they are mismatched to the intra-sample setting, where the coordinate system is fixed and raw magnitude carries signal. Deepfake detection provides a natural validation task. Synthetic faces may reproduce plausible eyes, noses, and mouths while breaking the representational structure that links those regions in real faces. Using frozen DINOv3 features, DCA exposes this break: an eyes-mouth-nose fingerprint achieves 0.9106 AUC on CelebDF-v2 and 0.9289 on DFD under FF++ c23 cross-dataset transfer. The design is also sharply validated by ablation: reintroducing centering collapses CelebDF-v2 AUC to 0.459, L2 normalization reduces it to 0.862, and cross-dimension coupling reduces it to 0.478. Finally, replacing DINOv3 with FaRL collapses CelebDF-v2 AUC to 0.582. DCA therefore depends on a stable per-dimension coordinate system, not on region extraction alone. These results position DCA as an instrument for measuring intra-sample representational coherence in frozen foundation models, with deepfake detection as the first validation task.
Chinese Translation
冻结的视觉基础模型不仅仅是提取特征;它们通过学习的坐标系统组织图像。我们探讨该坐标系统在单一输入内是否保持内部一致性。这引出了表征一致性(Representational Consistency):研究冻结基础模型是否在其语义子区域内一致地表示一个样本。我们引入了维度共激活(Dimensional Coactivation, DCA),这是一种用于测量这种一致性的逐维工具。DCA通过询问相同的特征维度是否在不同的语义区域中共激活来比较这些区域。与经典的相似性度量不同,DCA故意避免了中心化、L2归一化和完全的Gram耦合。这些操作在比较不同模型或分布时是有用的,但在样本内部设置中不匹配,因为坐标系统是固定的,原始幅度携带信号。深度伪造检测提供了一个自然的验证任务。合成面孔可能重现看似合理的眼睛、鼻子和嘴巴,但破坏了将这些区域连接在一起的表征结构。使用冻结的DINOv3特征,DCA揭示了这种断裂:眼-嘴-鼻指纹在CelebDF-v2上达到了0.9106的AUC,在DFD的FF++ c23跨数据集转移中达到了0.9289。该设计也通过消融实验得到了明确验证:重新引入中心化将CelebDF-v2的AUC压缩至0.459,L2归一化将其降低至0.862,而跨维度耦合将其降低至0.478。最后,用FaRL替代DINOv3将CelebDF-v2的AUC压缩至0.582。因此,DCA依赖于稳定的逐维坐标系统,而不仅仅是区域提取。这些结果将DCA定位为测量冻结基础模型中样本内部表征一致性的工具,以深度伪造检测作为首个验证任务。
cs.CV / 32 / 2605.08250
Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space
为什么 DiT 编辑器会漂移?可插拔的低频对齐在 VAE 潜在空间中
Abstract
Recent advances in diffusion transformers (DiTs) have enabled promising single-turn image editing capabilities. However, multi-turn editing often leads to progressive semantic drift and quality degradation.In this work, we study this problem from a latent-space frequency perspective by decomposing the editing process into two functional components: VAE and DiT. Through systematic analysis in the VAE latent space, we uncover that the DiT introduces dominant low-frequency drift that accumulates as semantic misalignment across editing rounds, while the VAE contributes comparatively stable reconstruction bias.Based on this insight, we propose VAE-LFA (Low Frequency Alignment), a training-free, plug-and-play method that performs alignment in VAE latent space. VAE-LFA decomposes latent discrepancies across editing rounds via low-pass filtering, and aligns low-frequency statistics to an exponential moving average of previous rounds, effectively suppressing accumulated semantic drift while preserving high-frequency details.Our method requires no retraining, ground-truth priors, or access to diffusion parameters, making it applicable to both white-box and black-box DiT editors. For white-box models, VAE-LFA is seamlessly integrated into the editing pipeline by eliminating redundant VAE round trips; for black-box models, it operates via an off-the-shelf VAE to perform inter-round latent alignment.Extensive experiments demonstrate that VAE-LFA improves semantic consistency and visual fidelity across diverse multi-turn editing scenarios, including both controlled and in-the-wild images.
Chinese Translation
最近在扩散变换器(DiTs)方面的进展使得单轮图像编辑能力得到了显著提升。然而,多轮编辑往往导致逐渐的语义漂移和质量下降。在本研究中,我们从潜在空间频率的角度研究了这一问题,通过将编辑过程分解为两个功能组件:变分自编码器(VAE)和 DiT。通过对 VAE 潜在空间的系统分析,我们发现 DiT 引入了主导的低频漂移,这种漂移在编辑轮次中累积为语义不对齐,而 VAE 则相对贡献了稳定的重建偏差。基于这一洞察,我们提出了 VAE-LFA(低频对齐),这是一种无训练、可插拔的方法,在 VAE 潜在空间中执行对齐。VAE-LFA 通过低通滤波分解编辑轮次之间的潜在差异,并将低频统计量对齐到前几轮的指数移动平均,有效抑制了累积的语义漂移,同时保留了高频细节。我们的方法不需要重新训练、真实值先验或访问扩散参数,使其适用于白盒和黑盒 DiT 编辑器。对于白盒模型,VAE-LFA 通过消除冗余的 VAE 往返调用无缝集成到编辑流程中;对于黑盒模型,它通过现成的 VAE 进行轮间潜在对齐。大量实验表明,VAE-LFA 在多种多轮编辑场景中提高了语义一致性和视觉保真度,包括受控图像和自然场景图像。
cs.CV / 33 / 2605.08252
Multimodal Emotion Recognition via Causal-Diffusion Bridge (Affect-Diff)
通过因果扩散桥进行多模态情感识别 (Affect-Diff)
Abstract
Multimodal emotion recognition on CMU-MOSEI faces an extreme imbalance as Happy accounts for 65.9% of samples while three Ekman categories collectively represent under 7%, causing standard fusion models to maximize accuracy by ignoring minority emotions entirely. We present Affect-Diff, a Causal-Diffusion Bridge that addresses this through three jointly trained mechanisms: a NOTEARS-learned causal graph that re-weights modality contributions before fusion, a beta-VAE bottleneck for regularized latent compression, and a stop-gradiented 1D DDPM prior that structures the latent space against majority-class collapse. On 3,292 aligned CMU-MOSEI samples, Affect-Diff achieves validation balanced accuracy 0.384, an 18% relative improvement over the strongest baseline (TETFN: 0.324), while all evaluated baselines produce zero F1 on Fear, Disgust, and Surprise. Ablation studies confirm independent, non-redundant contributions from the diffusion prior (-24% without it) and causal graph (-13%). Notably, only the deterministic-encoder variant detects all six emotion classes, revealing KL regularization strength as a direct lever for minority-class sensitivity.
Chinese Translation
在CMU-MOSEI数据集中,多模态情感识别面临极端的不平衡问题,快乐情感占样本的65.9%,而三种Ekman类别的情感总共仅占不到7%,这导致标准融合模型通过完全忽略少数情感来最大化准确性。我们提出了Affect-Diff,一种因果扩散桥,通过三种联合训练的机制来解决这一问题:一个通过NOTEARS学习的因果图,在融合之前重新加权模态贡献;一个用于正则化潜在压缩的beta-VAE瓶颈;以及一个停止梯度的1D DDPM先验,旨在防止潜在空间的多数类崩溃。在3,292个对齐的CMU-MOSEI样本上,Affect-Diff实现了验证集平衡准确率0.384,相较于最强基线(TETFN: 0.324)有18%的相对提升,而所有评估的基线在恐惧、厌恶和惊讶情感上均产生零F1值。消融研究确认了扩散先验(去掉后减少24%)和因果图(去掉后减少13%)的独立且非冗余的贡献。值得注意的是,只有确定性编码器变体能够检测到所有六种情感类别,揭示了KL正则化强度作为少数类敏感性的直接杠杆。
cs.CV / 34 / 2605.08270
SAFformer:Improving Spiking Transformer via Active Predictive Filtering
SAFformer:通过主动预测过滤改善脉冲变换器
Abstract
Spiking Neural Networks (SNNs) offer notable advantages in biological plausibility and energy efficiency, making them promising candidates for building low-power Transformers. However, existing Spiking Transformers largely adhere to a passive reactive paradigm, which struggles to focus on task-relevant information and incurs substantial computational overhead when processing redundant visual data. To overcome this fundamental yet underexplored limitation, we propose SAFformer, a novel Spiking Transformer architecture based on an active predictive filtering paradigm. Inspired by the brain's predictive coding mechanism, SAFformer actively suppresses predictable signals and focuses on salient visual features. Extensive experiments show that SAFformer establishes new state-of-the-art performance on CIFAR-10/100 and CIFAR10-DVS. Remarkably, on ImageNet-1K, it achieves 80.50% Top-1 accuracy with only 26.58M parameters and an energy consumption of 5.88 mJ, demonstrating an exceptional balance between accuracy and efficiency.
Chinese Translation
脉冲神经网络(SNNs)在生物合理性和能效方面具有显著优势,使其成为构建低功耗变换器的有希望的候选者。然而,现有的脉冲变换器大多遵循被动反应范式,这使其在关注与任务相关的信息时面临困难,并在处理冗余视觉数据时产生大量计算开销。为了克服这一基本但尚未深入探讨的限制,我们提出了SAFformer,这是一种基于主动预测过滤范式的新型脉冲变换器架构。SAFformer受大脑预测编码机制的启发,主动抑制可预测信号,专注于显著的视觉特征。大量实验表明,SAFformer在CIFAR-10/100和CIFAR10-DVS上建立了新的最先进性能。值得注意的是,在ImageNet-1K上,它以仅26.58M的参数实现了80.50%的Top-1准确率,能耗为5.88 mJ,展示了准确性与效率之间的卓越平衡。
cs.CV / 35 / 2605.08271
Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning
跨越模态,跨越时间:超长代理视频推理的结构化记忆
Abstract
Understanding ultra-long videos such as egocentric recordings, live streams, or surveillance footage spanning days to weeks, remains a challenge. For current multimodal LLMs: even with million-token context windows, frame budgets cover only tens of minutes of densely sampled video, and most evidence is discarded before inference begins. Memory-augmented and agentic approaches help with scale, but their retrieval remains fragmented across modalities and lacks long-range narrative summaries that span days or weeks. We propose \textbf{MAGIC-Video}, a training-free framework built around a multimodal memory graph with interleaved narrative chain: the graph unifies episodic, semantic, and visual content through six typed edges and supports cross-modal retrieval, while the chain distils long-horizon entity biographies and recurring activity events. At inference time, an agentic loop interleaves graph retrieval with narrative fact injection, covering both the modality and time dimensions of ultra-long video in a single retrieval pipeline. On EgoLifeQA, Ego-R1 and MM-Lifelong, MAGIC-Video consistently outperforms strong general-purpose, long-video, and agentic baselines, with gains of 10.1, 7.4, and 5.9 points over the prior best agentic system on each benchmark. Code is available at https://github.com/lijiazheng0917/MAGIC-video.
Chinese Translation
理解超长视频,例如自我中心录制、直播或持续数天到数周的监控录像,仍然是一项挑战。对于当前的多模态大语言模型(LLMs):即使具有百万标记的上下文窗口,帧预算也仅覆盖数十分钟的密集采样视频,大多数证据在推理开始之前就被丢弃。增强记忆和代理方法有助于扩展,但它们的检索在模态之间仍然是碎片化的,缺乏跨越数天或数周的长时间叙事摘要。我们提出了 extbf{MAGIC-Video},这是一个无训练的框架,围绕一个多模态记忆图构建,具有交错的叙事链:该图通过六种类型的边统一了情节、语义和视觉内容,并支持跨模态检索,而链则提炼了长时间范围内的实体传记和重复活动事件。在推理时,代理循环将图检索与叙事事实注入交错,覆盖超长视频的模态和时间维度,形成一个单一的检索管道。在EgoLifeQA、Ego-R1和MM-Lifelong上,MAGIC-Video始终优于强大的通用、长视频和代理基线,在每个基准上分别比之前最佳的代理系统提高了10.1、7.4和5.9分。代码可在https://github.com/lijiazheng0917/MAGIC-video获取。
cs.CV / 36 / 2605.08276
Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction
超越ViT标记:用于细胞级密集预测的掩蔽扩散预训练卷积病理基础模型
Abstract
Cell-level dense prediction is central to computational pathology, but remains challenging due to fine-grained histological structures, strong domain shifts, and costly dense annotations. Existing ViT-based pathology foundation models rely on patch tokenization, which can disrupt spatial continuity and weaken local morphological details needed for cell-level prediction. To address this, we propose Masked-Diffusion Convolutional Foundation Models, termed ConvNeXt Masked-Diffusion (CMD), a self-supervised convolutional generative pretraining framework for dense pathology representation learning. CMD uses a fully convolutional ConvNeXt-UNet backbone, performs masked-diffusion pretraining in pixel space, and incorporates frozen pathology foundation model features through adaptive normalization. Experimental results demonstrate that CMD consistently outperforms existing ViT-based pathology foundation models and even surpasses state-of-the-art end-to-end segmentation methods while fine-tuning only a small number of task-specific parameters across multiple pathology dense prediction tasks. The advantage is particularly pronounced under limited annotation settings, where CMD exhibits stronger robustness and generalization ability. Our findings suggest that purely convolutional architectures can also serve as competitive pathology foundation models for cell-level dense prediction, achieving leading performance within the current ViT-dominated paradigm and providing a scalable, high-performance solution that better preserves histological structural priors for fine-grained pathology understanding.
Chinese Translation
细胞级密集预测是计算病理学的核心,但由于细致的组织结构、强烈的领域转移和昂贵的密集标注,仍然具有挑战性。现有的基于ViT的病理基础模型依赖于补丁标记化,这可能会破坏空间连续性并削弱细胞级预测所需的局部形态细节。为了解决这个问题,我们提出了掩蔽扩散卷积基础模型,称为ConvNeXt掩蔽扩散(CMD),这是一个用于密集病理表示学习的自监督卷积生成预训练框架。CMD使用全卷积的ConvNeXt-UNet骨干网络,在像素空间中进行掩蔽扩散预训练,并通过自适应归一化整合冻结的病理基础模型特征。实验结果表明,CMD在多个病理密集预测任务中始终优于现有的基于ViT的病理基础模型,甚至超越了最先进的端到端分割方法,同时仅微调少量任务特定参数。在有限标注设置下,这一优势尤为明显,CMD表现出更强的鲁棒性和泛化能力。我们的研究结果表明,纯卷积架构也可以作为竞争性的病理基础模型,用于细胞级密集预测,在当前以ViT为主导的范式中实现领先性能,并提供一种可扩展的高性能解决方案,更好地保留细致的病理理解所需的组织结构先验。
cs.CV / 37 / 2605.08281
Is Class Signal Clustered or Routed in Task-Induced Implicit Neural Representation Weight Spaces?
任务诱导的隐式神经表示权重空间中的类别信号是聚类还是路由?
Abstract
Implicit neural representations (INRs) encode images as neural-network weights, making image classification a problem of weight-space classifiability. A natural geometric hypothesis is that classifier feedback should make image-specific weights cluster by class in the shared-anchor coordinate. We test this hypothesis in the SIREN-based Meta Weight Transformer (MWT) regime, where end-to-end training meta-learns a shared initialization and inner-loop update schedule for fitting image-specific SIRENs. We find that this prediction fails. Exposed weight-space geometry and supervised clustering pressure do not reliably track trained-reader accuracy; clustering can even make local neighborhoods more class-consistent while making the trained reader worse. Crucially, the reader constructs rather than inherits class-aligned geometry: token-flow diagnostics show that class-aligned neighborhoods become strongly predictive of trained-reader accuracy only after late reader interactions, not in the input coordinate. We further identify the native SIREN bias column in the augmented weight token as a low-dimensional, sample-dependent causal readout route for the trained reader; targeted controls rule out generic scalar-column and marginal-distribution artifacts. The diagnosis motivates interventions that strengthen reader routing, add an explicit bias route, or use denser inner-loop fitting; under the lane-specific training conventions used here, route-directed variants often outperform the shared-anchor baseline but interact non-additively. Task-induced INR weights are classifiable not because they form raw geometric clusters, but because their class signal is routed through the reader.
Chinese Translation
隐式神经表示(INRs)将图像编码为神经网络权重,使得图像分类成为权重空间可分类性的问题。一个自然的几何假设是,分类器反馈应该使图像特定的权重在共享锚点坐标中按类别聚类。我们在基于 SIREN 的元权重变换器(Meta Weight Transformer, MWT)框架中测试了这一假设,在该框架下,端到端训练元学习共享初始化和内部循环更新计划,以适应图像特定的 SIREN。我们发现这一预测并不成立。暴露的权重空间几何和监督聚类压力并不能可靠地跟踪训练读者的准确性;聚类甚至可能使局部邻域在类别上更一致,同时使训练读者的表现变差。关键是,读者构建而非继承类别对齐的几何结构:标记流诊断显示,类别对齐的邻域只有在晚期读者交互后才会强烈预测训练读者的准确性,而不是在输入坐标中。我们进一步识别出增强权重标记中的原生 SIREN 偏置列,作为训练读者的低维、样本依赖的因果读取路径;针对性的控制排除了通用标量列和边际分布伪影。该诊断激励了加强读者路由、添加显式偏置路径或使用更密集的内部循环拟合的干预措施;在这里使用的特定车道训练规范下,定向路由变体通常优于共享锚点基线,但互动不是加性。任务诱导的 INR 权重之所以可分类,并不是因为它们形成了原始几何聚类,而是因为它们的类别信号通过读者进行路由。
cs.CV / 38 / 2605.08293
Distill, Diffuse, and Semanticize (DDS): Annotation-Free 3D Scene Understanding Based on Multi-Granularity Distillation and Graph-Diffusion-Based Segmentation
蒸馏、扩散与语义化 (DDS):基于多粒度蒸馏和图扩散分割的无注释3D场景理解
Abstract
3D semantic scene understanding has broad applications in digital twins, autonomous driving, smart agriculture, and embodied perception. However, dense point-wise annotation for point clouds is extremely expensive, making fully supervised 3D semantic learning difficult to scale. Recent annotation-free methods can discover semantic regions without manual 3D labels, but they often suffer from weak object-level consistency, inefficient global grouping, and category-agnostic segmented regions. We propose an annotation-free 3D scene semantic understanding method based on multi-granularity distillation and graph-diffusion-based segmentation. The proposed method first leverages structured visual knowledge guidance and superpoint graph diffusion to perform efficient global semantic propagation, alleviating the problem of inconsistent region-level semantics. It then conducts semantic inference through segmentation-cluster association, assigning interpretable category names to segmented 3D regions and improving the overall effectiveness of annotation-free 3D semantic understanding. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed framework. Compared with the advanced existing annotation-free baselines, our method improves oAcc, mAcc, and mIoU by 5.9%, 8.1%, and 2.4% at most, respectively. These results highlight the promise of the proposed framework for scalable annotation-free 3D scene understanding, especially in real-world scenarios requiring both object segmentation and semantic recognition.
Chinese Translation
3D语义场景理解在数字双胞胎、自动驾驶、智能农业和具身感知等领域具有广泛的应用。然而,对于点云进行密集的逐点注释极其昂贵,使得完全监督的3D语义学习难以扩展。近期的无注释方法能够在没有手动3D标签的情况下发现语义区域,但它们通常面临对象级一致性差、全局分组效率低以及类别无关的分割区域等问题。我们提出了一种基于多粒度蒸馏和图扩散分割的无注释3D场景语义理解方法。该方法首先利用结构化视觉知识指导和超点图扩散进行高效的全局语义传播,从而缓解区域级语义不一致的问题。然后,通过分割-聚类关联进行语义推断,为分割的3D区域分配可解释的类别名称,提高无注释3D语义理解的整体有效性。在真实世界数据集上的大量实验表明了所提框架的有效性。与现有先进的无注释基线相比,我们的方法在oAcc、mAcc和mIoU上分别最多提高了5.9%、8.1%和2.4%。这些结果突显了所提框架在可扩展无注释3D场景理解中的潜力,特别是在需要对象分割和语义识别的真实场景中。
cs.CV / 39 / 2605.08296
BenchHAR: Benchmarking Self-Supervised Learning for Generalizable Sensor-based Activity Recognition
BenchHAR:自监督学习在可泛化传感器基础活动识别中的基准评估
Abstract
Human Activity Recognition (HAR) from wearable sensors supports broad healthcare and behavior science applications. However, data heterogeneity and the scarcity of labeled data limit its real-world generalization. Recent advances in self-supervised learning (SSL) in vision and language domains have shown strong capability for learning generalizable representations from unlabeled data. Yet, few studies have systematically compared the generalization performance of SSL methods or explored how to adapt them for generalizable HAR. To address these gaps, we present BenchHAR, a unified framework for evaluating the generalization capability of SSL methods for sensor-based HAR on unseen target distributions. BenchHAR curates a large-scale dataset (~258K samples) and evaluates eight representative SSL methods across 12 encoder-classifier architectures. Our results reveal that existing SSL methods struggle to achieve satisfactory generalization performance. We find that: (1) For HAR models, the hybrid paradigm (combining reconstruction and contrastive pretraining) achieves the best overall performance. The CNN encoder exhibits the strongest ability to learn generalizable representations, while more expressive classifier architectures further improve generalization. (2) For data scale, increasing the amount of pretraining data from downstream activity classes consistently improves generalization, while adding more labeled data yields limited gains. Interestingly, incorporating unlabeled data from non-downstream activity classes does not improve generalization. (3) Sensor data collected from custom-grade devices generalizes better than that from research-grade devices, and data from limb transfers more effectively to trunk positions. BenchHAR provides a unified benchmark and actionable insights for generalizable sensor-based HAR systems. Our code is available at https://github.com/saiketa/HAR-Bench.
Chinese Translation
可穿戴传感器的人体活动识别(HAR)支持广泛的医疗保健和行为科学应用。然而,数据异质性和标注数据的稀缺限制了其在实际应用中的泛化能力。最近在视觉和语言领域自监督学习(SSL)的进展显示出从未标注数据中学习可泛化表示的强大能力。然而,系统性比较SSL方法的泛化性能或探索如何将其适应于可泛化HAR的研究仍然较少。为了解决这些问题,我们提出了BenchHAR,这是一个统一的框架,用于评估SSL方法在未见目标分布上的传感器基础HAR的泛化能力。BenchHAR策划了一个大规模数据集(约258K样本),并在12种编码器-分类器架构上评估了八种代表性的SSL方法。我们的结果显示,现有的SSL方法在实现令人满意的泛化性能方面面临挑战。我们发现:(1)对于HAR模型,混合范式(结合重建和对比预训练)实现了最佳的整体性能。CNN编码器展现出学习可泛化表示的最强能力,而更具表现力的分类器架构进一步提高了泛化能力。(2)在数据规模方面,增加来自下游活动类别的预训练数据量始终能改善泛化,而添加更多标注数据的收益有限。有趣的是,纳入来自非下游活动类别的未标注数据并未改善泛化。(3)从定制级设备收集的传感器数据的泛化能力优于从研究级设备收集的数据,而来自肢体的数据更有效地转移到躯干位置。BenchHAR为可泛化的传感器基础HAR系统提供了统一的基准和可操作的见解。我们的代码可在 https://github.com/saiketa/HAR-Bench 获取。
cs.CV / 40 / 2605.08329
An Efficient Token Compression Framework for Visual Object Tracking
一种高效的视觉目标跟踪令牌压缩框架
Abstract
Refining visual representations by eliminating their internal feature-level redundancy is crucial for simultaneously optimizing the performance and computational cost of models in visual tracking. To enhance their performance, many contemporary Transformer-based trackers leverage a larger number of historical template frames to capture richer spatio-temporal cues. However, this strategy leads to a massive number of input visual tokens. This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker's overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens. These refined template tokens are then processed by our Hierarchical Interaction Encoder to achieve a deep, adaptive interaction with the search features. Refined search features ensure subsequent precise target localization. Experiments on seven benchmarks demonstrate that our method outperforms current state-of-the-art trackers. ETCTrack-B224 reduces the number of template tokens by 60%, leading to a 21.4% reduction in MACs with only a 0.4% drop in accuracy. The source code are available at https://github.com/PJD-WJ/ETCTrack.
Chinese Translation
通过消除视觉表示中的内部特征级冗余来优化模型在视觉跟踪中的性能和计算成本至关重要。为了提升性能,许多当代基于Transformer的跟踪器利用更多的历史模板帧来捕捉更丰富的时空线索。然而,这一策略导致了大量的输入视觉令牌。这产生了两个关键问题:它带来了二次计算负担,并可能降低跟踪器的整体性能。为了解决这一问题,我们提出了一种压缩-再交互的跟踪框架ETCTrack,该框架学习如何高效地将历史模板帧中的模板令牌压缩为稳健的目标表示,超越了手工规则。我们的方法首先采用自适应令牌压缩器动态构建紧凑而高度区分的模板令牌,通过过滤冗余视觉令牌来实现。这些精炼的模板令牌随后由我们的层次交互编码器处理,以实现与搜索特征的深度自适应交互。精炼的搜索特征确保后续的精确目标定位。在七个基准测试上的实验表明,我们的方法优于当前最先进的跟踪器。ETCTrack-B224将模板令牌的数量减少了60%,使得MACs减少了21.4%,而准确率仅下降了0.4%。源代码可在 https://github.com/PJD-WJ/ETCTrack 获取。
cs.CV / 41 / 2605.08371
PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers
PaceVGGT:用于视觉几何变换器的预替代注意力令牌剪枝
Abstract
Visual Geometry Transformer (VGGT) is a strong feed-forward model for multiple 3D tasks, but its Alternating-Attention (AA) stack scales quadratically in the total token count, making long clips expensive. Existing token-reduction accelerators operate inside AA, leaving the patch grid that enters AA uncompressed. We introduce PaceVGGT, a pre-AA token pruning framework that prunes DINO patch tokens before the first AA block of a frozen VGGT. PaceVGGT trains a lightweight Token Scorer that estimates per-token importance from DINO features. The scorer is first distilled against an AA-internal attention target from the unpruned backbone, then refined under downstream camera, depth, and point-map losses. A per-frame keep budget fixes the backbone-visible sequence length, while an importance-adaptive merge/prune assignment preserves residual content from high-saliency frames under a fixed total merge budget. A Feature-guided Restoration module reconstructs the dense spatial grid required by the prediction heads. On ScanNet-50 and 7-Scenes, PaceVGGT remains on the reconstruction quality--latency frontier while reducing inference latency. On ScanNet-50, it reduces latency by \(5.1\times\) over unmodified VGGT at \(N=300\) and \(1.47\times\) over LiteVGGT at \(N=1000\). These results identify pre-AA pruning as a viable acceleration route for frozen VGGT-style geometry transformers.
Chinese Translation
视觉几何变换器(VGGT)是一种强大的前馈模型,适用于多种三维任务,但其替代注意力(AA)堆栈在总令牌数量上呈二次增长,使得处理长片段变得昂贵。现有的令牌减少加速器在AA内部操作,导致进入AA的补丁网格未被压缩。我们提出了PaceVGGT,一种预AA令牌剪枝框架,在冻结的VGGT的第一个AA块之前剪枝DINO补丁令牌。PaceVGGT训练一个轻量级的令牌评分器,该评分器根据DINO特征评估每个令牌的重要性。评分器首先在未剪枝的主干网络上针对AA内部的注意力目标进行蒸馏,然后在下游相机、深度和点图损失下进行优化。每帧的保留预算固定了主干可见序列的长度,而重要性自适应的合并/剪枝分配则在固定的总合并预算下保留了高显著性帧的剩余内容。特征引导恢复模块重建了预测头所需的稠密空间网格。在ScanNet-50和7-Scenes数据集上,PaceVGGT在重建质量与延迟的边界上保持一致,同时减少了推理延迟。在ScanNet-50上,它在N=300时比未修改的VGGT减少了5.1倍的延迟,在N=1000时比LiteVGGT减少了1.47倍的延迟。这些结果表明,预AA剪枝是冻结VGGT风格几何变换器的一条可行加速路径。
cs.CV / 42 / 2605.08373
NeuroGAN-3D: Enhancing Intrinsic Functional Brain Networks via High-Fidelity 3D Generative Super-Resolution
NeuroGAN-3D:通过高保真3D生成超分辨率增强内在功能脑网络
Abstract
Recent advances in neuroimaging have deepened our understanding of the brain's complex functional and structural organization. Among these, functional Magnetic Resonance Imaging (fMRI) - particularly resting-state fMRI (rs-fMRI) - has emerged as a tool for identifying biomarkers of intrinsic brain connectivity and delineating large-scale neural networks. These networks are typically represented as volumetric spatial maps that capture functionally coherent brain regions and reflect individual differences in brain activity and structure. The spatial resolution of these maps plays an important role, as it determines the ability to localize functional units with precision, perform reliable brain parcellation, and detect subtle, spatially specific neurobiological alterations associated with development, aging, or disease. Therefore, improving the effective resolution of neuroimaging-derived maps holds significant promise for enabling more detailed insights into brain architecture and its relationship to behavior and pathology. To address this need, we propose NeuroGAN-3D, a novel 3D generative super-resolution model tailored to the computational demands of volumetric neuroimaging. Our model leverages a generative adversarial network architecture to enhance the spatial resolution of rs-fMRI spatial maps, significantly outperforming a conventional baseline.
Chinese Translation
最近神经影像学的进展加深了我们对大脑复杂功能和结构组织的理解。其中,功能性磁共振成像(fMRI),特别是静息态fMRI(rs-fMRI),已成为识别内在脑连接生物标志物和描绘大规模神经网络的工具。这些网络通常表示为体积空间图,捕捉功能上相干的脑区,并反映个体在脑活动和结构上的差异。这些图的空间分辨率起着重要作用,因为它决定了精确定位功能单元的能力,进行可靠的脑分区,以及检测与发展、衰老或疾病相关的细微、空间特异性的神经生物学变化。因此,提高神经影像学衍生图的有效分辨率对于深入了解脑结构及其与行为和病理之间的关系具有重要意义。为满足这一需求,我们提出了NeuroGAN-3D,一种新颖的3D生成超分辨率模型,专门针对体积神经影像学的计算需求。我们的模型利用生成对抗网络架构来增强rs-fMRI空间图的空间分辨率,显著优于传统基线。
cs.CV / 43 / 2605.08376
UIESNN: A Scale-Aware Spiking Network for Underwater Image Enhancement
UIESNN:一种适应尺度的脉冲神经网络用于水下图像增强
Abstract
Underwater image enhancement (UIE) is a practically important yet underexplored application of spiking neural networks (SNNs), where the dominant degradations are large-scale and low-frequency, such as wavelength-dependent colour casts and scattering-induced veiling. Existing SNN restoration designs rely on locally bounded spiking perception, which can limit global correction and lead to saturated or inconsistent representations. To address these challenges, we propose a scale-aware SNN framework for UIE named UIESNN. At its core is a Multi-scale Pooling LIF Block (MPLB) that injects hierarchical multi-scale pooling responses into membrane dynamics, thereby enlarging the effective receptive field while preserving fine-grained details and inducing heterogeneous scale-dependent activations. Building on MPLB, we design a spiking residual architecture that integrates frequency decomposition and attention-based refinement in a fully spike-driven pipeline. Extensive experiments on the EUVP and LSUI benchmarks demonstrate that UIESNN achieves state-of-the-art performance among SNN-based methods, delivering improved colour fidelity and spatial coherence with competitive energy cost.
Chinese Translation
水下图像增强(UIE)是脉冲神经网络(SNNs)一个实际重要但尚未深入探索的应用,其中主要的退化现象是大尺度和低频率的,例如依赖波长的色偏和散射引起的遮蔽。现有的SNN恢复设计依赖于局部限制的脉冲感知,这可能限制全局校正并导致饱和或不一致的表现。为了解决这些挑战,我们提出了一种名为UIESNN的适应尺度的SNN框架。其核心是一个多尺度池化LIF块(MPLB),该块将分层的多尺度池化响应注入膜动力学,从而扩大有效感受野,同时保留细粒度细节并诱导异质的尺度依赖激活。基于MPLB,我们设计了一种脉冲残差架构,在完全由脉冲驱动的流程中集成了频率分解和基于注意力的精细化。对EUVP和LSUI基准的广泛实验表明,UIESNN在基于SNN的方法中实现了最先进的性能,提供了更好的色彩保真度和空间一致性,同时具有竞争力的能量成本。
cs.CV / 44 / 2605.08389
Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval
解耦端点与语义过渡学习以实现零样本组合图像检索
Abstract
Zero-shot composed image retrieval (ZS-CIR) retrieves a target image from a reference image and a text modification without human-annotated CIR triplets. Projection-based ZS-CIR methods are attractive because they do not rely on LLMs at inference and remain lightweight, but they often underperform LLM-based approaches on complex semantic modifications. This gap reflects a semantic transition bottleneck in projection-based ZS-CIR: endpoint-level matching can let the edit text act as a target-side attribute cue rather than grounding it as a source-conditioned semantic transition. We further show that adding semantic transition supervision to the same text adapter creates an endpoint--transition conflict between endpoint alignment and semantic transition alignment. To address this conflict, DeCIR decouples endpoint and transition learning. It constructs paired forward/reverse edit tuples from image-caption pairs, trains separate low-rank text adapter branches for endpoint alignment and semantic transition alignment, and merges them with Low-Rank Directional Merge (LRDM) into one deployable adapter. Extensive experiments on CIRR, CIRCO, FashionIQ, and GeneCIS demonstrate that DeCIR consistently improves projection-based ZS-CIR without increasing inference complexity.
Chinese Translation
零样本组合图像检索(Zero-shot composed image retrieval, ZS-CIR)从参考图像和文本修改中检索目标图像,而无需人工标注的CIR三元组。基于投影的ZS-CIR方法因其在推理时不依赖于大型语言模型(LLMs)且保持轻量化而备受关注,但在复杂语义修改上往往表现不如基于LLM的方法。这一差距反映了基于投影的ZS-CIR中的语义过渡瓶颈:端点级匹配使得编辑文本可以作为目标侧属性线索,而不是将其作为源条件语义过渡进行定位。我们进一步表明,将语义过渡监督添加到同一文本适配器中,会在端点对齐和语义过渡对齐之间产生端点-过渡冲突。为了解决这一冲突,DeCIR解耦了端点与过渡学习。它从图像-标题对中构建成对的正向/反向编辑元组,为端点对齐和语义过渡对齐训练独立的低秩文本适配器分支,并通过低秩方向合并(Low-Rank Directional Merge, LRDM)将它们合并为一个可部署的适配器。在CIRR、CIRCO、FashionIQ和GeneCIS上的大量实验表明,DeCIR在不增加推理复杂度的情况下,始终改善了基于投影的ZS-CIR。
cs.CV / 45 / 2605.08396
Delivering Science as a Service: Sci-Orchestra's Cloud-Native Approach to HPC
将科学作为服务交付:Sci-Orchestra的云原生高性能计算方法
Abstract
The increasing complexity of modern computational environments often burdens researchers with infrastructure management, authentication protocols, and container deployments. We present Sci-Orchestra, a layered orchestration framework designed to fully automate experimental workflows, allowing scientists to prioritize scientific discovery over backend operations. By abstracting execution through an API-driven interface, the system assumes responsibility for secure authentication, resource management, and scalable deployment across diverse high-performance computing environments using Kubernetes architectures. A key innovation of Sci-Orchestra is its autonomous marketplace, which serves as a catalyst for cross-institutional collaboration. Through an intuitive user interface, researchers can rapidly deploy and share specialized services via simple selections, eliminating the need for complex installations and technical setups. This modular infrastructure is specifically designed to facilitate industry partnerships as it provides a secure execution environment and allows external collaborators to test and validate proprietary tools without the need for source-code exchange. This ``black-box'' interoperability protects intellectual property while enabling seamless integration into broader scientific pipelines, ultimately accelerating the transition from laboratory prototypes to industrial-scale applications.
Chinese Translation
现代计算环境日益复杂,常常使研究人员面临基础设施管理、认证协议和容器部署的负担。我们提出了Sci-Orchestra,一个分层的编排框架,旨在完全自动化实验工作流,使科学家能够将重点放在科学发现上,而非后端操作。通过API驱动的接口抽象执行,该系统承担了安全认证、资源管理和在多样化高性能计算环境中使用Kubernetes架构的可扩展部署的责任。Sci-Orchestra的一项关键创新是其自主市场,作为跨机构合作的催化剂。通过直观的用户界面,研究人员可以通过简单的选择快速部署和共享专业服务,消除了复杂安装和技术设置的需求。该模块化基础设施专门设计用于促进行业合作伙伴关系,因为它提供了安全的执行环境,并允许外部合作者在无需源代码交换的情况下测试和验证专有工具。这种“黑箱”互操作性保护了知识产权,同时实现了与更广泛科学流程的无缝集成,最终加速了从实验室原型到工业规模应用的过渡。
cs.CV / 46 / 2605.08412
SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding
SYNCR:一个具有合成基础的跨视频推理基准
Abstract
Multimodal Large Language Models (MLLMs) have made rapid progress in single-video understanding, yet their ability to reason across multiple independent video streams remains poorly understood. Existing multi-video benchmarks rely largely on human-annotated real-world footage, limiting the precision of spatial, temporal, and physical ground truth and making it difficult to diagnose model failures. We introduce SYNCR, a controlled synthetic benchmark for cross-video reasoning with programmatically verified grounding. Built using Habitat, Kubric, and CLEVRER simulator engines, SYNCR contains 8,163 multi-video question-answer pairs grounded in 9,650 unique videos. It evaluates MLLMs across eight tasks spanning four diagnostic pillars: Temporal Alignment, Spatial Tracking, Comparative Reasoning, and Holistic Synthesis. Our zero-shot evaluation of leading open- and closed-weight MLLMs reveals a substantial gap between current models and humans: the best model achieves only 52.5% average accuracy, compared to an 89.5% human baseline. Models perform relatively well on temporal ordering but struggle with precise physical and spatial reasoning, with the best model reaching only 26.0% accuracy on Kinematic Comparison. We further find that parameter scaling and reasoning-specialized post-training improve temporal alignment capabilities, but do not reliably address fine-grained physical tracking or global spatial synthesis. Finally, an exploratory sim-to-real correlation analysis suggests that several SYNCR tasks track model-level trends on real-world multi-video benchmarks, while also exposing reasoning capabilities underrepresented by existing evaluations. Code available at https://github.com/SaraGhazanfari/SYNCR.
Chinese Translation
多模态大型语言模型(MLLMs)在单视频理解方面取得了快速进展,但它们在多个独立视频流之间进行推理的能力仍然不够清晰。现有的多视频基准主要依赖于人工标注的真实世界视频,这限制了空间、时间和物理真相的精确性,并使得诊断模型失败变得困难。我们引入了SYNCR,一个用于跨视频推理的受控合成基准,具有程序化验证的基础。SYNCR基于Habitat、Kubric和CLEVRER模拟器引擎构建,包含8163个多视频问答对,基于9650个独特视频进行基础评估。它在四个诊断支柱(时间对齐、空间跟踪、比较推理和整体综合)上评估MLLMs的表现。我们对领先的开放和封闭权重MLLMs的零-shot评估显示出当前模型与人类之间存在显著差距:最佳模型的平均准确率仅为52.5%,而人类基线为89.5%。模型在时间排序方面表现相对较好,但在精确的物理和空间推理上存在困难,最佳模型在运动比较任务上的准确率仅为26.0%。我们进一步发现,参数扩展和专门针对推理的后训练提高了时间对齐能力,但并未可靠地解决细粒度的物理跟踪或全局空间综合问题。最后,探索性的模拟到真实相关性分析表明,多个SYNCR任务跟踪了真实世界多视频基准上的模型级趋势,同时也揭示了现有评估中未充分代表的推理能力。代码可在 https://github.com/SaraGhazanfari/SYNCR 获取。
cs.CV / 47 / 2605.08421
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval
超越补丁袋:通过文本监督学习全局布局以实现晚期交互视觉文档检索
Abstract
Visual Document Retrieval (VDR) models mostly rely on late interaction architectures, in which documents are represented by a set of local patch embeddings and then matched against query tokens. While efficient, this architecture prioritizes local similarity over global layout structure of documents to estimate relevancy between documents and query. In practice, this leads to errors as relevance originates from layout structure of documents with heterogeneous layouts combining figures, tables, and text. We make document layout learnable without changing inference. We propose a multimodal encoder that augments local patch representations with a global layout embedding, trained via textual descriptions encoding document layout information. Across four ViDoRe-v2 datasets, our model improves over the strongest architecturally comparable ColPali/ColQwen baseline by +2.4 nDCG@5 and +2.3 MAP@5, with statistically significant per-dataset gains over ColQwen.
Chinese Translation
视觉文档检索(VDR)模型主要依赖于晚期交互架构,其中文档通过一组局部补丁嵌入表示,然后与查询令牌进行匹配。尽管这种架构高效,但它优先考虑局部相似性,而忽视了文档的全局布局结构,从而估计文档与查询之间的相关性。在实践中,这导致了错误,因为相关性源于具有异构布局的文档的布局结构,这些布局结合了图形、表格和文本。我们使文档布局可学习,而不改变推理过程。我们提出了一种多模态编码器,它通过文本描述编码文档布局信息,增强了局部补丁表示与全局布局嵌入的结合。在四个ViDoRe-v2数据集上,我们的模型在架构上与最强的ColPali/ColQwen基线相比,nDCG@5提高了2.4,MAP@5提高了2.3,并且在每个数据集上相对于ColQwen的增益具有统计显著性。
cs.CV / 48 / 2605.08452
NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics
NICE FACT:在运动物理学中的定量推理中诊断和校准视觉语言模型(VLMs)
Abstract
The ability to derive precise spatial and physical insights is a cornerstone of vision-language models (VLMs), yet their poor performances in related spatial intelligence tasks such as physical reasoning remain a fundamental barrier. The community critically lacks a scientific analysis revealing whether VLMs faithfully reach answers or plausibly make guesses. This work aims to provide a fundamental understanding of how VLMs perceive the physical world, and utilize physical laws, while assessing the reliability of model confidence. We propose NICE and FACT, a dual-diagnostic paradigm that explicitly decomposes quantitative reasoning for kinematic physics: FACT diagnoses visual fidelity, physical law comprehension, and temporal grounding. NICE studies our novel neighborhood-informed calibration method and novel metrics to evaluate and calibrate confidence reliability. Evaluated across 6 latest state-of-the-art VLMs, we uncover that models fail to identify visual preconditions or utilize necessary physical laws to reach answers. This work highlights and establishes a standardized diagnostic paradigm to guide the development of faithful, physically-grounded VLMs.
Chinese Translation
准确推导空间和物理洞察的能力是视觉语言模型(VLMs)的基石,然而它们在相关空间智能任务(如物理推理)中的表现不佳仍然是一个根本性障碍。学术界急需一项科学分析,以揭示VLMs是否忠实地得出答案或仅仅做出合理的猜测。本研究旨在提供对VLMs如何感知物理世界及利用物理法则的基本理解,同时评估模型置信度的可靠性。我们提出了NICE和FACT,这是一种双重诊断范式,明确分解运动物理学中的定量推理:FACT诊断视觉保真度、物理法则理解和时间基础。NICE研究我们新颖的邻域信息校准方法和新指标,以评估和校准置信度的可靠性。在对6个最新的最先进VLMs进行评估时,我们发现模型未能识别视觉前提或利用必要的物理法则来得出答案。本研究突出了建立一个标准化诊断范式,以指导忠实且基于物理的VLMs的发展。
cs.CV / 49 / 2605.08493
CapCLIP: A Vision-Language Representation Alignment Approach for Wireless Capsule Endoscopy Analysis
CapCLIP:一种用于无线胶囊内窥镜分析的视觉-语言表征对齐方法
Abstract
Wireless capsule endoscopy (WCE) enables non-invasive visual assessment of the small bowel, but its clinical utility is constrained by the large volume of frames generated per examination and the difficulty of recognising subtle abnormalities under highly variable imaging conditions. Existing learning-based approaches for WCE are predominantly vision-only, often confined to narrow pathology sets, and show limited transfer across datasets and centres. To address these limitations, this study introduces CapCLIP, a domain-specific vision-language representation learning framework for WCE. CapCLIP aligns capsule endoscopy frames with clinically grounded textual descriptions derived from standardised nomenclature and pathology-aware caption templates, thereby learning embeddings that are both semantically informed and transferable. The proposed framework is evaluated against relevant open-source vision and vision-language foundation models under strict zero-shot conditions using unseen WCE datasets. Evaluation covers three downstream tasks: K-nearest neighbour classification, CLIP-style image-text classification, and text-to-image retrieval. Across these settings, CapCLIP consistently outperforms the compared baselines, with particularly strong gains in zero-shot image-text classification and cross-modal retrieval on out-of-distribution datasets. The results indicate that language-guided representation learning can improve both generalisation and semantic interpretability in WCE analysis. These findings position CapCLIP as a step toward foundation models tailored to capsule endoscopy and support the use of language-grounded WCE analysis.
Chinese Translation
无线胶囊内窥镜(WCE)能够对小肠进行非侵入性的视觉评估,但其临床应用受到每次检查生成的大量图像帧和在高度可变的成像条件下识别细微异常的困难的限制。现有的基于学习的方法主要集中于仅使用视觉信息,通常局限于狭窄的病理集,并且在不同数据集和中心之间的迁移能力有限。为了解决这些限制,本研究提出了CapCLIP,一种针对WCE的领域特定视觉-语言表征学习框架。CapCLIP将胶囊内窥镜帧与基于标准化命名法和病理感知标题模板派生的临床文本描述对齐,从而学习出既具有语义信息又可迁移的嵌入。所提出的框架在严格的零样本条件下,使用未见过的WCE数据集,对相关的开源视觉和视觉-语言基础模型进行了评估。评估涵盖了三个下游任务:K近邻分类、CLIP风格的图像-文本分类和文本到图像检索。在这些设置中,CapCLIP始终优于比较基线,尤其在零样本图像-文本分类和跨模态检索方面,在分布外数据集上表现出显著的提升。结果表明,基于语言的表征学习可以改善WCE分析中的泛化能力和语义可解释性。这些发现将CapCLIP定位为朝向专门针对胶囊内窥镜的基础模型的一步,并支持基于语言的WCE分析的使用。
cs.CV / 50 / 2605.08521
Geometric Flood Depth Estimation: Fusing Transformer-Based Segmentation with Digital Elevation Models
几何洪水深度估计:将基于变换器的分割与数字高程模型融合
Abstract
Post-disaster situational awareness relies heavily on understanding both the extent and the volume of floodwaters. While 2D semantic segmentation provides accurate flood masking, it lacks the vertical dimension required to assess navigability and structural risk. This paper presents a geometric "Water Surface Elevation" approach for estimating flood depth from monocular aerial imagery. Our pipeline utilizes Mask2Former, a state-of-the-art transformer-based segmentation model, to generate precise 2D flood masks. These masks are fused with Digital Elevation Models (DEMs) to identify the water-land boundary, calculate a global water surface elevation ($Z_{water}$), and compute per-pixel depth based on the principle of local hydrostatic equilibrium. We evaluate this workflow using the FloodNet and CRASAR-U-DROIDS datasets, demonstrating how high-performance segmentation can be leveraged to extract 3D volumetric data from 2D imagery without the latency of hydrodynamic simulations.
Chinese Translation
灾后情境意识在很大程度上依赖于对洪水范围和水量的理解。尽管二维语义分割能够提供准确的洪水掩膜,但缺乏评估可通行性和结构风险所需的垂直维度。本文提出了一种几何“水面高程”方法,通过单目航空影像估计洪水深度。我们的工作流程利用了Mask2Former这一最先进的基于变换器的分割模型,生成精确的二维洪水掩膜。这些掩膜与数字高程模型(Digital Elevation Models, DEMs)融合,以识别水陆边界,计算全球水面高程($Z_{water}$),并基于局部静水平衡原理计算每个像素的深度。我们使用FloodNet和CRASAR-U-DROIDS数据集评估了这一工作流程,展示了如何利用高性能分割从二维影像中提取三维体积数据,而无需进行水动力模拟的延迟。
cs.CV / 51 / 2605.08530
A Two-Stage Motion-Aware Framework for mmWave-based Human Mesh Recovery
基于毫米波的人体网格恢复的两阶段运动感知框架
Abstract
Millimeter-wave (mmWave) radar has emerged as a promising sensing modality for human perception due to its robustness under challenging environmental conditions and strong privacy-preserving properties. However, recovering accurate 3D human body meshes from radar observations remains difficult due to severe signal clutter and the inherently partial nature of radar measurements. Previous works typically adopt end-to-end frameworks that directly regress human body parameters from raw radar data, without decoupling signal interpretation from geometric reasoning or exploiting temporal motion cues, limiting learning performance. To address this, we propose a two-stage framework for radar-based human body reconstruction. First, we introduce a human reflection extraction module that performs coarse-to-fine localization and voxel-wise segmentation to produce a confidence-weighted radar volume encoding voxel-level human likelihood. Second, we design a motion-aware mesh recovery network that reconstructs the human body by jointly modeling per-frame geometry and inter-frame dynamics using a dual-branch architecture. Extensive experiments demonstrate that the proposed method outperforms existing approaches while maintaining computational efficiency.
Chinese Translation
毫米波(mmWave)雷达因其在复杂环境条件下的鲁棒性和强大的隐私保护特性,已成为人类感知的一种有前景的传感方式。然而,由于严重的信号杂波和雷达测量本质上的局部性,从雷达观测中恢复准确的三维人体网格仍然困难。以往的研究通常采用端到端框架,直接从原始雷达数据回归人体参数,而未将信号解释与几何推理解耦或利用时间运动线索,从而限制了学习性能。为了解决这一问题,我们提出了一种基于雷达的人体重建的两阶段框架。首先,我们引入一个人体反射提取模块,执行粗到细的定位和体素级分割,以生成一个信心加权的雷达体积,编码体素级的人体可能性。其次,我们设计了一个运动感知网格恢复网络,通过使用双分支架构联合建模每帧几何和帧间动态来重建人体。大量实验表明,所提方法在保持计算效率的同时,优于现有方法。
cs.CV / 52 / 2605.08557
MC-RFM: Geometry-Aware Few-Shot Adaptation via Mixed-Curvature Riemannian Flow Matching
MC-RFM:通过混曲率黎曼流匹配实现几-shot适应的几何感知
Abstract
Parameter-efficient adaptation of pretrained vision models is commonly performed through linear probes, prompts, low-rank updates, or lightweight residual modules. While effective, these methods usually treat adaptation as a discrete Euclidean perturbation of frozen representations, without explicitly modeling the geometry of the task-induced feature displacement. We propose \textsc{MC-RFM}, a mixed-curvature Riemannian flow-matching framework for few-shot adaptation of frozen visual backbones. The key idea is to represent adapted features on a product manifold combining a hyperbolic factor, which captures hierarchy-sensitive semantic structure, and a Euclidean factor, which preserves locally discriminative visual variation. Adaptation is formulated as a task-conditioned continuous transport from frozen features to support-set prototypes, trained with a flow-matching objective and coupled to a hybrid prototype-linear classifier. The method is lightweight, backbone-agnostic, and operates entirely on cached frozen features. Across seven visual recognition benchmarks, five frozen backbones, and 1/4/16-shot regimes, \textsc{MC-RFM} is the best-performing method in a majority of evaluated settings, with the strongest gains on Transformer backbones and fine-grained datasets. Ablations show that the mixed-curvature head, task conditioning, adaptive branch gating, prototype shrinkage, and discriminative supervision each contribute to performance. These results suggest that few-shot adaptation benefits not only from deciding which parameters to update, but also from modeling how representations should move through a geometry matched to the structure of the downstream task.
Chinese Translation
预训练视觉模型的参数高效适应通常通过线性探针、提示、低秩更新或轻量级残差模块进行。尽管这些方法有效,但通常将适应视为对冻结表示的离散欧几里得扰动,而未明确建模任务引起的特征位移的几何结构。我们提出了 extsc{MC-RFM},一种用于冻结视觉主干的几-shot适应的混曲率黎曼流匹配框架。其关键思想是将适应后的特征表示在一个结合了超曲率因子(捕捉层次敏感的语义结构)和欧几里得因子(保持局部可区分的视觉变化)的乘积流形上。适应被公式化为从冻结特征到支持集原型的任务条件连续传输,使用流匹配目标进行训练,并与混合原型-线性分类器相结合。该方法轻量、主干无关,并完全基于缓存的冻结特征。在七个视觉识别基准、五个冻结主干和1/4/16-shot模式下, extsc{MC-RFM}在大多数评估设置中表现最佳,在Transformer主干和细粒度数据集上取得了最强的提升。消融实验表明,混曲率头、任务条件、适应性分支门控、原型收缩和区分监督各自对性能有所贡献。这些结果表明,几-shot适应不仅受益于决定更新哪些参数,还受益于建模表示如何在与下游任务结构匹配的几何中移动。
cs.CV / 53 / 2605.08560
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B 技术报告
Abstract
We present ZAYA1-VL-8B, a compact mixture-of-experts vision-language model built upon our in-house language model, ZAYA1-8B. Despite its compact size, ZAYA1-VL achieves performance competitive with leading base models such as Molmo2-4B and InternVL3.5-4B, while surpassing models including Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B across a range of image understanding, reasoning, and counting benchmarks. The architecture incorporates two key innovations: (1) vision-specific LoRA adapters integrated into the LLM to increase modality-specific capacity without increasing the number of experts, and (2) bidirectional attention over image tokens within the LLM to enhance visual understanding. We detail the full training pipeline including data composition at each stage, sequence packing, and the attention masking scheme. The model comprises 9.2B total parameters, with 1.4B active parameters including the vision encoder, and is publicly available at https://huggingface.co/Zyphra/ZAYA1-VL.
Chinese Translation
我们提出了 ZAYA1-VL-8B,这是一种基于我们内部语言模型 ZAYA1-8B 构建的紧凑型混合专家视觉-语言模型。尽管体积小巧,ZAYA1-VL 在性能上与领先的基础模型如 Molmo2-4B 和 InternVL3.5-4B 具有竞争力,同时在一系列图像理解、推理和计数基准测试中超越了包括 Qwen2.5-VL-3B、PLM-3B 和 MolmoE-1B 的模型。该架构包含两个关键创新:(1)将视觉特定的 LoRA 适配器集成到 LLM 中,以在不增加专家数量的情况下提高特定模态的能力;(2)在 LLM 内对图像标记进行双向注意力处理,以增强视觉理解。我们详细描述了完整的训练流程,包括每个阶段的数据组成、序列打包和注意力掩蔽方案。该模型总参数量为 92 亿,其中包括 14 亿个活跃参数(包括视觉编码器),并已在 https://huggingface.co/Zyphra/ZAYA1-VL 上公开发布。
cs.CV / 54 / 2605.08566
MicroDiffuse3D: A Foundation Model for 3D Microscopy Imaging Restoration
MicroDiffuse3D:用于三维显微成像恢复的基础模型
Abstract
Chemical imaging enables label-free visualization of cells, tissues and living systems while providing direct biochemical information that is difficult to obtain with conventional fluorescence microscopy. Despite its promise in applications ranging from intraoperative diagnosis to drug-response analysis, its broader use remains limited by slow data acquisition, particularly for three-dimensional imaging. Here we present MicroDiffuse3D, a pretrained foundation model for 3D microscopy image restoration that recovers high-quality volumetric structure from degraded low-resolution measurements acquired at substantially higher throughput. We evaluated MicroDiffuse3D across three challenging restoration settings, including 3D super-resolution under 16-fold volumetric sparsity, joint degradation in resolution and noise, and 3D denoising in the low signal-to-noise ratio (SNR) regime, where the model delivered clear gains over strong baselines. Under the sparse 3D super-resolution setting, MicroDiffuse3D produced clearer continuity across depth with fewer artifacts and improved segmentation quality by 10.58% and line-profile concordance by 15.59%. Together, our results establish pretrained 3D restoration as a broadly applicable strategy for overcoming the throughput and SNR limitations in volumetric chemical imaging, enabling high-resolution analysis at scales and speeds that were previously difficult to achieve.
Chinese Translation
化学成像能够实现无标记的细胞、组织和生物系统的可视化,同时提供难以通过传统荧光显微镜获得的直接生化信息。尽管在从手术中诊断到药物反应分析等应用中展现出潜力,但其更广泛的使用仍受到数据采集速度慢的限制,尤其是在三维成像方面。在此,我们提出了MicroDiffuse3D,一种用于三维显微图像恢复的预训练基础模型,它能够从以显著更高吞吐量获取的降解低分辨率测量中恢复高质量的体积结构。我们在三种具有挑战性的恢复设置中评估了MicroDiffuse3D,包括在16倍体积稀疏下的三维超分辨率、分辨率和噪声的联合降解,以及在低信噪比(SNR)条件下的三维去噪,其中该模型在强基线之上表现出明显的提升。在稀疏三维超分辨率设置下,MicroDiffuse3D在深度连续性方面产生了更清晰的效果,伪影更少,分割质量提高了10.58%,线型轮廓一致性提高了15.59%。综上所述,我们的结果确立了预训练三维恢复作为克服体积化学成像中吞吐量和信噪比限制的广泛适用策略,使得在以前难以实现的尺度和速度下进行高分辨率分析成为可能。
cs.CV / 55 / 2605.08567
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
ACWM-Phys:研究动作条件下视频世界模型中的广义物理交互
Abstract
Action-conditioned world models (ACWMs) have shown strong promise for video prediction and decision-making. However, existing benchmarks are largely restricted to egocentric navigation or narrow, task-specific robotics datasets, offering only limited coverage of the rich physical interactions required for generalized world understanding. We introduce ACWM-Phys, a new benchmark for evaluating action-conditioned prediction under diverse physical dynamics in a clean, controllable simulation environment with a carefully designed action space. ACWM-Phys contains training and evaluation data spanning rigid-body dynamics, kinematics, deformable-object interactions, and particle dynamics. To evaluate both interpolation and generalization, we design in-distribution and out-of-distribution protocols with controlled shifts in interaction patterns or scene configurations. By building the benchmark in a fully controllable simulator, ACWM-Phys enables precise data collection, reproducible evaluation, and systematic analysis of model capabilities for physically grounded world modeling. Through systematic experiments on ACWM-DiT, we find that OoD generalization depends not only on the physical regime but also on effective task complexity: models generalize well on visually simple, low-dimensional interactions with clear geometric structure, but suffer larger drops on deformable contacts, high-dimensional control, and complex articulated motion. This suggests that the model still relies heavily on visual appearance patterns instead of fully learning the underlying physics. Ablations show that cross-attention improves high-dimensional action conditioning, causal VAEs outperform frame-wise encoders, and larger action spaces are harder to model but can improve generalization by providing richer control signals. These findings guide the design of physically grounded world models.
Chinese Translation
动作条件世界模型(ACWMs)在视频预测和决策制定方面展现出强大的潜力。然而,现有的基准测试主要局限于自我中心导航或狭窄的特定任务机器人数据集,提供的对丰富物理交互的覆盖有限,这对于广义世界理解是必需的。我们引入了ACWM-Phys,一个新的基准,用于在干净、可控的仿真环境中评估多样物理动态下的动作条件预测,配备精心设计的动作空间。ACWM-Phys包含跨越刚体动力学、运动学、可变形物体交互和粒子动力学的训练和评估数据。为了评估插值和泛化能力,我们设计了具有控制交互模式或场景配置变化的分布内和分布外协议。通过在完全可控的仿真器中构建基准,ACWM-Phys实现了精确的数据收集、可重复的评估和对模型能力的系统分析,以支持物理基础的世界建模。通过在ACWM-DiT上的系统实验,我们发现分布外泛化不仅依赖于物理范畴,还与有效的任务复杂性有关:模型在视觉上简单、低维的具有清晰几何结构的交互中泛化良好,但在可变形接触、高维控制和复杂关节运动中则出现较大下降。这表明模型仍然在很大程度上依赖于视觉外观模式,而不是完全学习潜在的物理规律。消融实验表明,交叉注意力改善了高维动作条件,因果变分自编码器(causal VAEs)优于逐帧编码器,而更大的动作空间虽然更难建模,但可以通过提供更丰富的控制信号来改善泛化。这些发现为物理基础的世界模型设计提供了指导。
cs.CV / 56 / 2605.08572
Enhancing Consistency Models for Multi-Agent Trajectory Prediction
增强多智能体轨迹预测的一致性模型
Abstract
Diffusion models for multi-agent trajectory prediction are limited by iterative denoising, which causes inference latency that hinders their use in time-critical settings like autonomous driving. Fast-sampling variants using DDIM and informed initial noise distributions partially alleviate this issue, but they either fail to achieve true single-step generation or are constrained by the chosen noise distribution. Consistency Models (CMs) offer high-quality one-step generation by mapping noise directly to data, but are difficult to train from scratch . We propose ECTraj, an enhanced CM pipeline with improved training and conditional generation for trajectory prediction. Our framework extends the student-teacher consistency training scheme: the student produces standard outputs, while the teacher explicitly fuses its predictions with parts of the ground truth to give stronger supervision. We also exploit CMs' direct denoising for top-K multi-shot generation during training. Combining conditional generation with this enhanced consistency objective yields faster inference and improved prediction accuracy, establishing competitive new benchmarks on the large-scale Argoverse 2 dataset.
Chinese Translation
用于多智能体轨迹预测的扩散模型受到迭代去噪的限制,这导致推理延迟,妨碍了它们在诸如自动驾驶等时间敏感场景中的应用。使用 DDIM 的快速采样变体和信息化初始噪声分布在一定程度上缓解了这一问题,但它们要么未能实现真正的一步生成,要么受到所选噪声分布的限制。一致性模型(Consistency Models, CMs)通过将噪声直接映射到数据,提供高质量的一步生成,但从零开始训练较为困难。我们提出了 ECTraj,一个增强的一致性模型管道,具有改进的训练和条件生成能力,用于轨迹预测。我们的框架扩展了学生-教师一致性训练方案:学生生成标准输出,而教师则明确将其预测与部分真实值融合,以提供更强的监督。我们还利用 CMs 的直接去噪能力,在训练期间进行 top-K 多次生成。将条件生成与这一增强的一致性目标相结合,能够实现更快的推理和更高的预测准确性,在大规模 Argoverse 2 数据集上建立了具有竞争力的新基准。
cs.CV / 57 / 2605.08574
Post-hoc Selective Classification for Reliable Synthetic Image Detection
后验选择性分类用于可靠的合成图像检测
Abstract
As synthetic images become increasingly realistic, reliable synthetic image detection techniques are of pressing need to prevent their misuse. Despite satisfactory in-distribution performance, deep neural network-based synthetic image detectors (SIDs) lack reliability in deployment and often fail in the presence of common covariate shifts, resulting in poor detection accuracy. To avoid the risk caused by potential errors, we adopt a selective classification (SC) strategy by allowing SIDs to abstain from making low confidence predictions. For practicality, we focus on post-hoc methods which perform confidence estimation on a given SID without retraining. However, we show that conventional logit-based confidence score functions (CSFs) exhibit pathological behavior under covariate shifts, leading to SC performance close to or even worse than random guessing. To address this, we propose a simple yet effective SC framework for Reliable Synthetic Image Detection (ReSIDe). First, we generalize the notion of logits to an SID's intermediate layers from a centroid matching perspective, extending the use of logit-based CSFs to any layer of an SID. Then, we introduce a preference optimization algorithm that aggregates confidence scores extracted from different layers to a final confidence estimate by minimizing an upper bound of the area under the risk-coverage curve (AURC). Extensive experimental results show that ReSIDe significantly boosts the SC performance of various logit-based CSFs under common covariate shifts, achieving up to 69.55% AURC reduction.
Chinese Translation
随着合成图像变得越来越逼真,可靠的合成图像检测技术迫在眉睫,以防止其被滥用。尽管深度神经网络基础的合成图像检测器(SIDs)在分布内表现令人满意,但在实际应用中缺乏可靠性,且在常见协变量变化的情况下往往失败,导致检测准确率低下。为了避免潜在错误带来的风险,我们采用选择性分类(SC)策略,允许SIDs在信心低的情况下不做预测。出于实用性考虑,我们专注于后验方法,这些方法在不重新训练的情况下对给定的SID进行信心估计。然而,我们发现传统的基于logit的信心评分函数(CSFs)在协变量变化下表现出病态行为,导致SC性能接近甚至低于随机猜测。为了解决这一问题,我们提出了一个简单而有效的选择性分类框架,称为可靠合成图像检测(ReSIDe)。首先,我们从中心匹配的角度推广logits的概念到SID的中间层,扩展了基于logit的CSFs在SID任意层的使用。然后,我们引入了一种偏好优化算法,通过最小化风险覆盖曲线(AURC)下界,将从不同层提取的信心评分聚合为最终的信心估计。大量实验结果表明,ReSIDe显著提升了在常见协变量变化下各种基于logit的CSFs的SC性能,实现了高达69.55%的AURC降低。
cs.CV / 58 / 2605.08577
Improving Generative Adversarial Networks with Self-Distillation
通过自蒸馏改进生成对抗网络
Abstract
In modern GANs, maintaining an Exponential Moving Average (EMA) of the generator's weights is a standard practice, as such an averaged model consistently outperforms the actively trained generator. However, the EMA generator is used for final deployment only and does not influence the training process. To address this missed opportunity, we introduce Self-Distilled GAN (SD-GAN) that employs the EMA generator as a teacher to guide the active generator (student) via perceptual loss. We prove the local asymptotic stability of SD-GAN in the Dirac-GAN setting and show that it dampens the parasitic cycling behavior that plagues the conventional GANs. Empirical evaluations across established architectures and datasets demonstrate that SD-GAN improves the final image quality on several metrics (FID and random-FID in particular), stabilizes the optimization trajectory and provides additional learning guidance that is not trivially correlated with the conventional adversarial loss. It also proves effective for fine-tuning pretrained GAN models.
Chinese Translation
在现代生成对抗网络(GAN)中,维护生成器权重的指数移动平均(EMA)是一种标准做法,因为这种平均模型在性能上始终优于主动训练的生成器。然而,EMA生成器仅用于最终部署,并未影响训练过程。为了解决这一被忽视的机会,我们提出了自蒸馏生成对抗网络(Self-Distilled GAN,SD-GAN),该网络利用EMA生成器作为教师,通过感知损失指导主动生成器(学生)。我们证明了SD-GAN在Dirac-GAN设置下的局部渐近稳定性,并展示了它能够抑制困扰传统GAN的寄生循环行为。在多个已建立的架构和数据集上的实证评估表明,SD-GAN在多个指标上(特别是FID和随机FID)提高了最终图像质量,稳定了优化轨迹,并提供了与传统对抗损失不完全相关的额外学习指导。它在微调预训练的GAN模型方面也证明了其有效性。
cs.CV / 59 / 2605.08585
PromptDx: Differentiable Prompt Tuning for Multimodal In-Context Alzheimer's Diagnosis
PromptDx:用于多模态上下文阿尔茨海默病诊断的可微调提示调优
Abstract
Deep learning models in medical imaging typically operate as parametric memory, diagnosing patients by recalling fixed knowledge learned during training. This contrasts sharply with clinical practice, where physicians employ analogical reasoning to diagnose new cases by referencing similar records from past exemplars. While In-Context Learning (ICL) frameworks such as Tabular Prior-Fitted Networks (TabPFN) offer a promising diagnosis-by-reference paradigm, they are designed with tabular-specific inductive priors and rely on non-differentiable preprocessing pipelines, leading to manifold mismatch and gradient fracture when applied to heterogeneous multimodal data. To address these limitations, we propose PromptDx, a novel diagnosis-by-reference framework that leverages a pre-trained TabPFN as an ICL engine while enabling seamless integration with multimodal representations. Our core contribution is a Differentiable Prompt Tuning (DPT) mechanism that aligns a Masked Multimodal Modeling module with the pre-trained ICL engine. By training a lightweight adapter as a differentiable surrogate for the engine's non-differentiable preprocessors, we enable an end-to-end optimization of multimodal prompts within the ICL paradigm. We validate our method on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset using 3D MRI and tabular biomarkers. Experiments demonstrate that our approach outperforms traditional parametric baselines. Notably, our method achieves superior performance using only 1% context samples compared to 30% in standard ICL, demonstrating exceptional manifold condensation ability. We further validate the generalizability of our DPT framework across six tabular datasets with diverse scales. Overall, our method offers a more data-efficient and clinically aligned paradigm for Alzheimer's Disease diagnosis.
Chinese Translation
医学影像中的深度学习模型通常作为参数化记忆运作,通过回忆在训练过程中学习到的固定知识来诊断患者。这与临床实践形成鲜明对比,后者中医生通过参考过去相似病例的记录,运用类比推理来诊断新病例。尽管上下文学习(In-Context Learning, ICL)框架如表格优先拟合网络(Tabular Prior-Fitted Networks, TabPFN)提供了一种有前景的基于参考的诊断范式,但它们是基于特定于表格的归纳先验设计的,并依赖于不可微分的预处理管道,这在应用于异构多模态数据时导致了流形不匹配和梯度断裂。为了解决这些局限性,我们提出了PromptDx,这是一种新颖的基于参考的诊断框架,利用预训练的TabPFN作为ICL引擎,同时实现与多模态表示的无缝集成。我们的核心贡献是可微调提示调优(Differentiable Prompt Tuning, DPT)机制,它将一个掩蔽多模态建模模块与预训练的ICL引擎对齐。通过训练一个轻量级适配器作为引擎不可微分预处理器的可微分替代品,我们实现了在ICL范式内对多模态提示的端到端优化。我们在阿尔茨海默病神经影像学倡议(Alzheimer's Disease Neuroimaging Initiative, ADNI)数据集上验证了我们的方法,使用3D MRI和表格生物标志物。实验表明,我们的方法优于传统的参数基线。值得注意的是,我们的方法在仅使用1%的上下文样本时,性能优于标准ICL中的30%,展现出卓越的流形凝聚能力。我们进一步验证了DPT框架在六个具有不同规模的表格数据集上的泛化能力。总体而言,我们的方法为阿尔茨海默病诊断提供了一种更具数据效率和临床对齐的范式。
cs.CV / 60 / 2605.08589
S2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain
Abstract
Parameter Efficient Fine-Tuning (PEFT) is a key technique for adapting a large pretrained model to downstream tasks by fine-tuning only a small number of parameters. Recent methods based on Fourier transforms have further reduced the fine-tuned parameters scale by only fine-tuning a few spectral coefficients. Its basic assumption is that the weight change \delta W is a spatial-domain matrix with a sparse spectrum. However, in this paper, we observe that the spectrum of weight change is not sparse, but instead distributed like power-uniform. This fact implies that fine-tuning only a few spectral coefficients is insufficient to accurately model the weight change with uniform spectrum. To address this issue, we propose to seek an invertible transformation that can transform a latent spatial-domain matrix with sparse spectrum to the weight change, and then perform PEFT on such sparse spectrum domain with few spectral coefficients, called S2FT. To seek such transformation, we first pre-estimate a coarse weight change as a prior. Then, inspired by that sparse spectrum often correspond to locally smooth spatial structures, we regard this transformation as a row and column rearrangement operation on the pre-estimated weight change that smooth spatial structures while keep the structure information of neurons. Finally, we propose to solve the rearrangement search problem in a simple nearest neighbor search manner, thereby obtaining the invertible transformation. Extensive results show our S2FT achieves superior performance by only using 0.08% training parameters.
cs.CV / 61 / 2605.08592
Cross-Modal RGB-D Fusion Transformer for 6D Pose Estimation of Non-Cooperative Spacecraft with Stereo-Derived Depth
用于非合作航天器的六自由度姿态估计的跨模态RGB-D融合变换器,基于立体图像深度
Abstract
On-orbit servicing and active debris removal involving non-cooperative spacecraft require reliable pose estimation to supply accurate position and orientation data for autonomous visual navigation. Learning-based monocular methods have seen widespread adoption in spacecraft pose estimation, yet they suffer from an intrinsic depth ambiguity problem and tend to fail under the harsh illumination conditions routinely encountered in orbit. Active depth sensors could in principle address the geometric ambiguity, but their power and mass requirements make them poorly suited to most spacecraft platforms. This work addresses these issues through a passive stereo vision framework for six-degree-of-freedom (6-DOF) pose estimation of non-cooperative spacecraft. A binocular stereo matching network called TSCA-Stereo is developed to cope with weak-texture surfaces, specular highlights, and severe lighting variations typical of space imagery. A cross-modal fusion Transformer is introduced to combine RGB appearance information with stereo depth features in an adaptive manner, supporting reliable pose recovery. A synthetic binocular multimodal dataset is also built for the experiments, covering stereo disparity maps and 6-DOF pose annotations across a range of illumination scenarios, attitude configurations, and noise levels. Experimental results show that TSCA-Stereo outperforms the baseline across every evaluated metric on this space-specific dataset. The full pose estimation pipeline achieves a mean translation error of 0.0419 m and a mean orientation error of 0.8632{\deg} under varied imaging conditions, confirming that the passive stereo approach is both effective and resilient when operating under the demanding visual conditions of the space environment.
Chinese Translation
在轨道服务和涉及非合作航天器的主动碎片移除任务中,需要可靠的姿态估计,以提供准确的位置和方向数据,以支持自主视觉导航。基于学习的单目方法在航天器姿态估计中得到了广泛应用,但它们存在固有的深度模糊问题,并且在轨道上常见的恶劣光照条件下往往会失败。主动深度传感器在理论上可以解决几何模糊问题,但其功耗和质量要求使其不适合大多数航天器平台。本研究通过一种被动立体视觉框架来解决这些问题,以实现非合作航天器的六自由度(6-DOF)姿态估计。开发了一种名为TSCA-Stereo的双目立体匹配网络,以应对空间图像中典型的弱纹理表面、镜面高光和严重的光照变化。引入了一种跨模态融合变换器,以自适应的方式将RGB外观信息与立体深度特征结合起来,从而支持可靠的姿态恢复。还构建了一个合成的双目多模态数据集用于实验,涵盖了立体视差图和6-DOF姿态标注,涉及多种光照场景、姿态配置和噪声水平。实验结果表明,TSCA-Stereo在该特定空间数据集的每个评估指标上均优于基线方法。完整的姿态估计管道在不同成像条件下实现了平均平移误差为0.0419米,平均方向误差为0.8632°,确认了被动立体方法在空间环境的苛刻视觉条件下既有效又具有韧性。
cs.CV / 62 / 2605.08606
Egocentric Whole-Body Human Mesh Recovery with Prior-Guided Learning
基于先验引导学习的自我中心全身人类网格恢复
Abstract
Egocentric human mesh recovery (HMR) from monocular head-mounted cameras is increasingly important for AR/VR applications, but remains challenging due to the lack of reliable ground-truth (GT) annotations based on parametric human body models such as SMPL and SMPL-X for real egocentric images. Existing egocentric HMR methods typically rely on pseudo-GT and focus on body pose estimation, which limits their ability to recover fine-grained whole-body details such as hands and face. We study egocentric whole-body human mesh recovery and propose a prior-guided learning framework that reconstructs whole-body meshes from a single egocentric image. We construct more accurate optimization-based pseudo-GT aligned with 3D joint supervision, and leverage multiple priors by adapting an exocentric HMR foundation model together with a diffusion-based pose prior. A deterministic undistortion module is further adopted to handle fisheye distortions in egocentric images. Experiments across multiple egocentric benchmarks demonstrate improved whole-body reconstruction compared to state-of-the-art methods, and show that our optimization-based pseudo-GT is substantially more accurate than existing regression-based pseudo-GT. To facilitate reproducibility, the code and dataset annotations are publicly available at https://github.com/naso06/EgoSMPLX.
Chinese Translation
从单目头戴式摄像机获取自我中心人类网格恢复(HMR)在增强现实/虚拟现实(AR/VR)应用中变得越来越重要,但由于缺乏基于参数化人类身体模型(如SMPL和SMPL-X)的可靠真实标注(GT),对真实的自我中心图像而言,这一任务仍然具有挑战性。现有的自我中心HMR方法通常依赖伪真实标注,并专注于身体姿态估计,这限制了它们恢复细致全身细节(如手和脸)的能力。我们研究了自我中心全身人类网格恢复,并提出了一种先验引导学习框架,该框架能够从单个自我中心图像重建全身网格。我们构建了与3D关节监督对齐的更准确的基于优化的伪真实标注,并通过适应外部HMR基础模型和基于扩散的姿态先验来利用多个先验。此外,进一步采用了一个确定性去畸变模块来处理自我中心图像中的鱼眼畸变。在多个自我中心基准测试中的实验表明,与最先进的方法相比,我们的方法在全身重建方面有了显著改善,并且我们的基于优化的伪真实标注在准确性上明显优于现有的基于回归的伪真实标注。为了促进可重复性,代码和数据集标注已公开发布在 https://github.com/naso06/EgoSMPLX。
cs.CV / 63 / 2605.08618
Beyond Toy Benchmarks: A Systematic Evaluation of OOD Detection Methods For Plant Pathology Classification
超越玩具基准:植物病理分类中OOD检测方法的系统评估
Abstract
Out-of-distribution (OOD) detection is essential for reliable deployment of deep learning systems, yet the majority of existing methods are evaluated on small, visually homogeneous benchmarks. In this work, we study six OOD detection methods spanning post-hoc scoring, auxiliary objectives, energy-based models, and constrained optimization on the Plant Pathology 2021 dataset, a fine-grained task with natural distribution shifts. Energy-based fine-tuning performs best across OOD settings, improving detection over the softmax baseline while preserving in-distribution accuracy. Analysis shows these gains stem from both a restructuring of the embedding space alongside calibration of the scoring function. We further document practical training instabilities that arise when scaling constrained optimization methods to moderate-sized datasets, findings that are largely absent from existing literature. Our results demonstrate that principled OOD detection is achievable on real-world domain-specific data and that benchmark evaluations alone may not capture the challenges that emerge in practice.
Chinese Translation
分布外(OOD)检测对于深度学习系统的可靠部署至关重要,然而现有大多数方法的评估都是基于小型、视觉上同质的基准。在本研究中,我们研究了六种OOD检测方法,涵盖了后验评分、辅助目标、基于能量的模型以及约束优化,这些方法在植物病理2021数据集上进行评估,该数据集是一个具有自然分布变化的细粒度任务。基于能量的微调在各种OOD设置中表现最佳,相比于softmax基线提高了检测能力,同时保持了在分布内的准确性。分析表明,这些提升源于嵌入空间的重构以及评分函数的校准。我们进一步记录了在将约束优化方法扩展到中等规模数据集时出现的实际训练不稳定性,这些发现现有文献中大多缺乏。我们的结果表明,原则性的OOD检测在真实世界的特定领域数据上是可实现的,单靠基准评估可能无法捕捉到实践中出现的挑战。
cs.CV / 64 / 2605.08627
DRNet: All-in-One Image Restoration via Prior-Guided Dynamic Reparameterization
DRNet:通过先验引导的动态重参数化实现一体化图像恢复
Abstract
All-in-one image restoration aims to handle diverse degradations within a single model. However, existing methods often suffer from three key limitations: 1) per-input computational overhead from dynamic degradation estimation; 2) optimization challenges due to task heterogeneity; and 3) inefficient, frequency-agnostic encoder designs. To overcome these, we introduce the Dynamic Reparameterization Network (DRNet), a novel framework operating on an initialization-stage reconfiguration paradigm that fundamentally eliminates per-input overhead. At its core, a Dynamic Reparameterization MLP (DRMLP) guided by a Task-Specific Modulator (TSM), which effectively mitigates task heterogeneity by orchestrating both specific restoration goals and a versatile general-purpose mode within a unified architecture. Furthermore, we incorporate a Continuous Wavelet Transform Encoder (CWTE) that explicitly leverages frequency characteristics via wavelet decomposition for a lightweight yet powerful design. Extensive experiments demonstrate that DRNet achieves state-of-the-art performance across five restoration tasks with superior parameter efficiency. Crucially, it showcases unique flexibility, excelling as both a highly competitive foundation model for blind restoration and a top-performing user-guided specialist.
Chinese Translation
一体化图像恢复旨在通过单一模型处理多种退化。然而,现有方法通常面临三个主要限制:1)由于动态退化估计导致的每个输入的计算开销;2)由于任务异质性带来的优化挑战;3)效率低下且与频率无关的编码器设计。为了解决这些问题,我们提出了动态重参数化网络(Dynamic Reparameterization Network, DRNet),这是一个基于初始化阶段重配置范式的全新框架,根本上消除了每个输入的开销。其核心是一个由任务特定调制器(Task-Specific Modulator, TSM)引导的动态重参数化多层感知机(Dynamic Reparameterization MLP, DRMLP),它通过在统一架构中协调特定恢复目标和通用模式,有效缓解了任务异质性。此外,我们还结合了连续小波变换编码器(Continuous Wavelet Transform Encoder, CWTE),通过小波分解显式利用频率特性,实现了轻量且强大的设计。大量实验表明,DRNet在五个恢复任务上实现了最先进的性能,且参数效率优越。关键是,它展现出独特的灵活性,既作为盲恢复的高度竞争基础模型,又作为表现卓越的用户引导专家。
cs.CV / 65 / 2605.08635
Kinematics-Driven Gaussian Shape Deformation for Blurry Monocular Dynamic Scenes
基于运动学驱动的高斯形状变形用于模糊单目动态场景
Abstract
Reconstructing dynamic 3D scenes from blurry monocular videos is challenging as motion-induced blur entangles object motion and geometry, hindering geometric consistency. We present Kinematics-GS, a kinematics-aware framework that models blur as motion-aligned deformation and introduces a kinematic prior to reparameterize Gaussian shapes along motion trajectories, thereby mitigating degenerate shape collapse without auxiliary motion supervision. To stabilize optimization, we decompose scenes into dynamic and static components using temporal deformation variance and employ a coarse-to-fine deformation strategy to capture both global motion and fine-grained details. We also introduce a challenging real-world dataset of deformable and elastic objects exhibiting non-rigid motion with spatially non-uniform motion blur that obscures geometric cues. Extensive experiments on real-world benchmarks with realistic motion blur demonstrate that Kinematics-GS outperforms prior methods by a clear margin in monocular dynamic scene reconstruction, highlighting its effectiveness in handling complex and non-rigid motion scenarios.
Chinese Translation
从模糊的单目视频中重建动态三维场景是一项具有挑战性的任务,因为运动引起的模糊将物体运动与几何形状纠缠在一起,妨碍了几何一致性。我们提出了Kinematics-GS,一个运动学感知框架,将模糊建模为与运动对齐的变形,并引入运动学先验以沿运动轨迹重新参数化高斯形状,从而在没有辅助运动监督的情况下减轻退化形状的崩溃。为了稳定优化,我们利用时间变形方差将场景分解为动态和静态组件,并采用粗到细的变形策略来捕捉全局运动和细粒度细节。我们还引入了一个具有挑战性的真实世界数据集,其中包含表现出非刚性运动的可变形和弹性物体,并伴随空间上非均匀的运动模糊,这使得几何线索变得模糊。针对具有真实运动模糊的真实世界基准的广泛实验表明,Kinematics-GS在单目动态场景重建中明显优于先前的方法,突显了其在处理复杂和非刚性运动场景中的有效性。
cs.CV / 66 / 2605.08640
FlowADMM: Plug-and-play ADMM with Flow-based Renoise-Denoise Priors
FlowADMM:基于流的重噪声-去噪先验的即插即用ADMM
Abstract
Plug-and-play (PnP) methods for solving inverse problems have recently achieved strong performance by leveraging denoising priors based on powerful generative diffusion and flow models. However, existing diffusion- and flow-based PnP methods typically rely on stochastic renoise-denoise operations, which complicate the analysis of their convergence behavior. In this work, we identify and formalize the deterministic renoise-denoise operator underlying flow-based plug-and-play methods. This perspective reveals that these methods implicitly define a deterministic operator given by the expectation of a denoiser over the latent noise distribution. Building on this insight, we propose FlowADMM, a PnP algorithm that integrates the renoise-denoise operator into the classical alternating direction method of multiplier (ADMM) framework. We establish convergence guarantees for FlowADMM under weak Lipschitz conditions on the underlying flow network, and extend the analysis to non-stationary time schedules. Empirically, FlowADMM achieves state-of-the-art performance among flow-based PnP methods on a range of inverse problems, including denoising, deblurring, super-resolution, and inpainting, while requiring fewer data consistency evaluations than prior approaches.
Chinese Translation
即插即用(PnP)方法通过利用基于强大生成扩散和流模型的去噪先验,最近在解决逆问题方面取得了显著的性能。然而,现有的基于扩散和流的PnP方法通常依赖于随机重噪声-去噪操作,这使得其收敛行为的分析变得复杂。在本研究中,我们识别并形式化了流基即插即用方法中潜在的确定性重噪声-去噪算子。这一视角揭示了这些方法隐式定义了一个由去噪器在潜在噪声分布上的期望给出的确定性算子。在此基础上,我们提出了FlowADMM,一种将重噪声-去噪算子整合到经典交替方向乘子法(ADMM)框架中的PnP算法。我们在基础流网络的弱Lipschitz条件下建立了FlowADMM的收敛保证,并将分析扩展到非平稳时间调度。实证结果表明,FlowADMM在一系列逆问题中,包括去噪、去模糊、超分辨率和图像修复,达到了基于流的PnP方法中的最先进性能,同时所需的数据一致性评估次数少于先前的方法。
cs.CV / 67 / 2605.08651
Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection
通过正交子空间投影进行隐私意识的视频异常检测
Abstract
Video anomaly detection (VAD) systems often prioritize accuracy while overlooking privacy concerns, limiting their suitability for real-world deployment. We propose the Orthogonal Projection Layer (OPL), a lightweight module that removes task-irrelevant variations to produce representations focused on anomaly-relevant cues. To address privacy risks in human-centered scenarios, we introduce Guided OPL (G-OPL), which suppresses facial attributes using weak supervision from face-presence signals while preserving non-identifying features such as pose and motion. A cosine alignment objective enforces consistent capture and removal of facial information without identity labels or adversarial training. We further present a privacy-aware evaluation framework that jointly assesses detection performance and privacy preservation, and enables analysis of how sensitive information is filtered. Experiments show that embedding privacy constraints into model design reduces sensitive information while maintaining or improving detection accuracy, supporting projection-based architectures as a principled approach for privacy-aware VAD.
Chinese Translation
视频异常检测(VAD)系统通常优先考虑准确性,而忽视隐私问题,这限制了它们在现实世界中的适用性。我们提出了正交投影层(Orthogonal Projection Layer, OPL),这是一个轻量级模块,能够去除与任务无关的变化,从而生成专注于异常相关线索的表示。为了应对以人为中心场景中的隐私风险,我们引入了引导正交投影层(Guided OPL, G-OPL),该模块利用来自面部存在信号的弱监督抑制面部特征,同时保留诸如姿态和运动等非识别性特征。余弦对齐目标强制一致地捕捉和去除面部信息,而无需身份标签或对抗训练。我们进一步提出了一个隐私意识的评估框架,该框架联合评估检测性能和隐私保护,并能够分析敏感信息的过滤方式。实验表明,将隐私约束嵌入模型设计中可以减少敏感信息,同时保持或提高检测准确性,支持基于投影的架构作为隐私意识VAD的原则性方法。
cs.CV / 68 / 2605.08663
CAST: Channel-Aware Spatial Transfer Learning with Pseudo-Image Radar for Sign Language Recognition
CAST:基于通道感知空间迁移学习的伪图像雷达在手语识别中的应用
Abstract
We propose CAST, a dual-stream architecture that utilizes channel-aware spatial transfer learning for isolated sign language recognition addressing the challenges of magnitude-only 60~GHz radar Range-Time Maps (RTM). The proposed framework combines three physics-aware architectures with pretrained vision backbones, which operate under radar-only constraints across clinical and alphabetical gestures. First, an explicit decibel-to-linear inversion is combined with a windowed fast Fourier transform that extracts Cadence Velocity Diagrams (CVD) while avoiding the harmonic artifacts that arise from the spectral analysis of log-compressed signals. Second, a cross-antenna spatial attention module applies attention to raw antenna channels before the convolution, preserving inter-receiver amplitude covariance. Third, an asymmetric cross-attention mechanism fuses representations from parallel ConvNeXt-Tiny (CVD) and EfficientNetV2-S (RTM) backbones. Extensive experiments reveal that the architecture achieves a Top-1 accuracy of 80.5% under 5-fold cross-validation, establishing a 3.3% improvement over the best single-model baseline (77.2%). The findings suggest that physics-aware signal representations form a promising direction for radar-only sign language recognition under constrained sensor modalities. The source code is available at: https://github.com/Shakhoyat/CAST-at-SignEval2026.
Chinese Translation
我们提出了CAST,一种双流架构,利用通道感知空间迁移学习进行孤立手语识别,以应对仅具有幅度信息的60 GHz雷达范围-时间图(RTM)所带来的挑战。该框架结合了三种物理感知架构与预训练的视觉骨干网络,在仅依赖雷达的条件下,处理临床和字母手势。首先,将显式的分贝到线性反演与窗口快速傅里叶变换相结合,提取节奏速度图(CVD),同时避免了来自对对数压缩信号的谱分析所产生的谐波伪影。其次,跨天线空间注意模块在卷积之前对原始天线通道应用注意力,保持接收器间幅度协方差。第三,不对称的交叉注意机制融合来自并行ConvNeXt-Tiny(CVD)和EfficientNetV2-S(RTM)骨干网络的表示。大量实验表明,该架构在5折交叉验证下实现了80.5%的Top-1准确率,相较于最佳单模型基线(77.2%)提高了3.3%。研究结果表明,物理感知信号表示为在受限传感器模式下的雷达手语识别提供了一个有前景的方向。源代码可在以下网址获取:https://github.com/Shakhoyat/CAST-at-SignEval2026。
cs.CV / 69 / 2605.08664
IPAD-CLIP: Teaching CLIP to Detect Image Local Perceptual Artifacts
IPAD-CLIP:教会 CLIP 检测图像局部感知伪影
Abstract
Current image quality assessment methods are heavily biased towards global distortions (e.g., noise, blur), neglecting local perceptual artifacts such as ghosting, lens flare, and moire effects. Although significant progress has been made in artifact removal, the fundamental problem of automatic artifact detection remains largely unexplored. In this paper, we formalize the Image Perceptual Artifact Detection (IPAD) task to address this gap. We contribute a benchmark dataset comprising 3,520 artifact images, including 520 real-captured and 3,000 synthetic samples, each paired with pixel-level masks across three representative artifact categories. The core challenge of IPAD lies in the localized, subtle, and semantically weak nature of these artifacts, which makes them prone to missed detection. To overcome this, we introduce IPAD-CLIP, a novel framework built upon CLIP that enhances artifact discrimination in both textual and visual spaces while preserving generalization capabilities. Our key insight is that local artifacts often exhibit strong correlations with specific semantic contexts. Accordingly, we learn artifact-aware text embeddings to explicitly model the object-artifact relationships, resulting in enhanced representations that clear differentiate between clean and artifact prompts. These text embeddings are then used as anchors to shift the visual encoder's attention from high-level semantics to subtle, low-level artifacts. Extensive experiments demonstrate that IPAD-CLIP offers a resource-efficient adaptation of CLIP for detection, significantly outperforming advanced image anomaly detection and manipulation detection methods on our benchmark. To the best of our knowledge, this is the first study addressing multi-class local perceptual artifact detection in terms of both dataset and model.
Chinese Translation
当前的图像质量评估方法严重偏向于全局失真(例如,噪声、模糊),忽视了局部感知伪影,如鬼影、镜头眩光和摩尔纹效应。尽管在伪影去除方面取得了显著进展,但自动伪影检测的根本问题仍然未得到充分探索。本文正式提出图像感知伪影检测(IPAD)任务,以填补这一空白。我们贡献了一个基准数据集,包含3,520张伪影图像,其中包括520张真实捕获样本和3,000张合成样本,每个样本都配有三个代表性伪影类别的像素级掩码。IPAD的核心挑战在于这些伪影的局部性、微妙性和语义弱性,使其容易被漏检。为了解决这个问题,我们引入了IPAD-CLIP,这是一个基于CLIP的新框架,增强了文本和视觉空间中的伪影区分能力,同时保持了泛化能力。我们的关键见解是,局部伪影通常与特定语义上下文之间存在强相关性。因此,我们学习了伪影感知的文本嵌入,以明确建模对象与伪影之间的关系,从而产生增强的表示,清晰地区分干净和伪影提示。这些文本嵌入随后被用作锚点,将视觉编码器的注意力从高层语义转移到微妙的低层伪影上。大量实验表明,IPAD-CLIP为检测提供了一种资源高效的CLIP适配,显著优于我们基准上的先进图像异常检测和操控检测方法。据我们所知,这是首个在数据集和模型层面上解决多类局部感知伪影检测的研究。
cs.CV / 70 / 2605.08695
EditSleuth: A Dataset of Grounded Reasoning Chains for Image-Edit Forensics
EditSleuth:用于图像编辑取证的基础推理链数据集
Abstract
Forensic analysis of AI-edited images requires more than binary real-versus-fake prediction: a useful system should localize the edit, identify its semantic type, and ground its decisions in visual evidence. Existing image-forensics datasets typically emphasize detection or localization, while reasoning-supervised vision-language datasets rarely target image manipulation and often rely on LLM-generated rationales whose faithfulness is difficult to verify. We introduce EditSleuth, a dataset of 257,725 image-edit triplets constructed from existing image-editing corpora for grounded image-edit forensic reasoning. Each example includes an edited image, its source image, a binary edit mask, a 12-class edit taxonomy label, a difficulty score, and a six-step reasoning chain. EditSleuth chains are generated deterministically from triplet-grounded upstream artifacts, with each statement tied to a specific computable source of evidence. Our analysis reveals that a naive four-component difficulty formulation suffers from a rank-2 correlation collapse among magnitude features; a simplified three-component formulation substantially increases score dispersion on both Pico-Banana and MagicBrush. Difficulty also varies meaningfully within most edit categories, indicating that the score is not a proxy for edit type. As an initial learning study, we fine-tune Qwen2-VL-2B with LoRA and find that chain-as-target supervision matches a label-only baseline on classification accuracy among parseable answers, while additionally yielding grounded explanatory prose that label-only supervision cannot produce. We release the dataset, the deterministic construction pipeline, and pilot training scripts.
Chinese Translation
对人工智能编辑图像的取证分析不仅需要二元的真实与虚假预测:一个有用的系统应能够定位编辑内容,识别其语义类型,并将其决策基于视觉证据。现有的图像取证数据集通常强调检测或定位,而基于推理的视觉-语言数据集则很少针对图像操控,并且往往依赖于大型语言模型(LLM)生成的推理,这些推理的真实性难以验证。我们引入了EditSleuth,这是一个由257,725个图像编辑三元组构成的数据集,旨在为基础图像编辑取证推理提供支持。每个实例包括一幅编辑后的图像、其源图像、一种二元编辑掩码、一个12类编辑分类标签、一个难度评分,以及一个六步推理链。EditSleuth链是从三元组基础的上游文献中确定性生成的,每个陈述都与特定的可计算证据来源相关联。我们的分析表明,简单的四成分难度公式在幅度特征之间存在等级-2相关崩溃;而简化的三成分公式在Pico-Banana和MagicBrush上显著增加了评分离散性。大多数编辑类别中的难度也存在显著差异,表明评分并不是编辑类型的代理。作为初步学习研究,我们使用LoRA对Qwen2-VL-2B进行了微调,发现链作为目标的监督在可解析答案的分类准确性上与仅标签基线相匹配,同时还产生了仅标签监督无法生成的基础解释性文本。我们发布了数据集、确定性构建流程和初步训练脚本。
cs.CV / 71 / 2605.08698
Supersampling Stable Diffusion and More: An Approach for Interpolating Neural Networks Using Common Interpolation Methods
超采样稳定扩散及更多:使用常见插值方法对神经网络进行插值的一个方法
Abstract
Stable Diffusion (SD) has evolved DDPM (Denoising Diffusion Probabilistic Model) based image generation significantly by denoising in latent space instead of feature space. This popularized DDPM-based image generation as the cost and compute barrier was significantly lowered. However, these models could only generate fixed-resolution images according to their training configuration. When we attempt to generate higher resolutions, the resulting images show object duplication artifacts consistently. To solve this problem without finetuning SD models, recent works have tried dilating the convolution kernels of the models and have achieved a great level of success. But dilated kernels are harder to fine-tune due to being zero-gapped. Apart from this, other methods, such as patched diffusion, could not solve the object-duplication problem efficiently. Hence, to overcome the limitations of dilated convolutions, we propose kernel interpolation of SD models for higher-resolution image generation. In this work, we show mathematically that interpolation can correctly scale convolution kernels if multiplied by a constant coefficient and achieve competitive empirical results in generating beyond-training-resolution images with Stable Diffusion using zero training. Furthermore, we demonstrate that our method enables interpolation of deep neural networks to adapt to higher-dimensional training data, with a worst-case performance drop of $2.6\%$ in accuracy and F1-Score relative to the baseline. This shows the applicability of our method to be general, where we interpolate fully-connected layers, going beyond convolution layers. We also discuss how we can reduce the memory footprints of training neural networks, using our method up to at least $4\times$.
Chinese Translation
稳定扩散(Stable Diffusion, SD)通过在潜在空间而非特征空间中去噪,显著推动了基于去噪扩散概率模型(Denoising Diffusion Probabilistic Model, DDPM)的图像生成。这降低了成本和计算门槛,使得基于DDPM的图像生成得以普及。然而,这些模型只能根据其训练配置生成固定分辨率的图像。当我们尝试生成更高分辨率的图像时,结果图像中会持续出现物体重复的伪影。为了解决这个问题而不对SD模型进行微调,最近的研究尝试扩张模型的卷积核,并取得了很大的成功。但由于扩张卷积核存在零间隙,因此更难以微调。除此之外,其他方法,如补丁扩散(patched diffusion),也无法有效解决物体重复的问题。因此,为了克服扩张卷积的局限性,我们提出了对SD模型进行卷积核插值以生成更高分辨率的图像。在本研究中,我们从数学上证明了,如果乘以一个常数系数,插值可以正确地缩放卷积核,并在使用零训练生成超出训练分辨率的图像时取得了具有竞争力的经验结果。此外,我们展示了我们的方法使得深度神经网络能够适应更高维度的训练数据,且在准确率和F1分数相较于基线的最坏情况性能下降仅为$2.6\%$。这表明我们的方法具有广泛的适用性,我们对全连接层进行了插值,超越了卷积层。我们还讨论了如何利用我们的方法将训练神经网络的内存占用减少至至少$4 imes$。
cs.CV / 72 / 2605.08702
Gate-and-Merge: Zero-shot Compositional Personalization of Vision Language Models
门控与合并:视觉语言模型的零样本组合个性化
Abstract
This paper tackles compositional personalization of vision-language models (VLMs). In this problem, multiple user-defined concepts must be recognized or described jointly at test time. We introduce Gate-and-Merge, a zero-shot framework that enables compositional personalization without the need for co-occurrence training. During personalization, each concept is learned independently as a lightweight LoRA adapter, paired with a concept token. The base model remains unchanged and concepts are kept disentangled. At inference, we enable composition by merging concept-specific LoRA updates directly in weight space. To suppress irrelevant activations and prevent interference, a gating mechanism is employed to estimate textual and visual cues and select only the modules that contribute to the prediction. We further stabilize composition by combining only the most meaningful and mutually consistent updates, helping preserve each concept's identity. Our quantitative and qualitative analyses show consistent gains in performance across multiple personalization tasks in both single-concept and compositional settings.
Chinese Translation
本文探讨了视觉语言模型(VLMs)的组合个性化问题。在这一问题中,多个用户定义的概念必须在测试时共同被识别或描述。我们提出了门控与合并(Gate-and-Merge),一种零样本框架,使得在不需要共现训练的情况下实现组合个性化。在个性化过程中,每个概念作为轻量级的 LoRA 适配器独立学习,并与概念标记配对。基础模型保持不变,概念保持解耦。在推理时,我们通过在权重空间中直接合并特定于概念的 LoRA 更新来实现组合。为了抑制无关激活并防止干扰,采用了门控机制来估计文本和视觉线索,仅选择对预测有贡献的模块。我们进一步通过仅结合最有意义且相互一致的更新来稳定组合,帮助保持每个概念的身份。我们的定量和定性分析显示,在单一概念和组合设置下,多个个性化任务的性能均有一致提升。
cs.CV / 73 / 2605.08709
UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning
UniShield:通过知识图谱驱动的多模态推理实现统一的人脸攻击检测
Abstract
Unified face attack detection (UAD) requires recognizing physical spoofing and digital forgery within a shared decision space, yet existing discriminative or prompt-based methods largely rely on appearance correlations and provide limited evidence-grounded reasoning. We propose UniShield, a knowledge-grounded multimodal reasoning framework for unified face attack defense. UniShield constructs a Face Attack Knowledge Graph (FAKG) that links attack categories to diagnostic visual cues and attack-conditioned relations, and uses it to synthesize 52,025 FAKG-QA examples for Attack-Graph Instruction Tuning (AGIT). To improve rationale consistency, we further introduce Graph-Consistent Reasoning Optimization (GCRO), a GRPO-based objective with a KG-consistency reward that encourages generated rationales to match graph-supported cues while penalizing incompatible claims. Experiments on our multimodal UAD benchmark show that UniShield achieves strong performance across binary, coarse-grained, and fine-grained protocols, with consistently high ACC and low HTER. These results suggest that structured attack knowledge can improve both detection accuracy and reasoning reliability over discriminative baselines and general-purpose MLLMs. Our code will be released at https://anonymous.4open.science/r/Unishield-A6A3/.
Chinese Translation
统一的人脸攻击检测(UAD)需要在共享决策空间内识别物理欺骗和数字伪造,但现有的区分性或基于提示的方法主要依赖于外观相关性,并提供有限的证据基础推理。我们提出了UniShield,一个基于知识的多模态推理框架,用于统一的人脸攻击防御。UniShield构建了一个人脸攻击知识图谱(FAKG),将攻击类别与诊断视觉线索和攻击条件关系联系起来,并利用它合成了52,025个FAKG-QA示例用于攻击图指令调优(AGIT)。为了提高推理的一致性,我们进一步引入了图一致性推理优化(GCRO),这是一种基于GRPO的目标,具有KG一致性奖励,鼓励生成的推理与图支持的线索相匹配,同时惩罚不兼容的主张。在我们的多模态UAD基准测试中的实验表明,UniShield在二元、粗粒度和细粒度协议中均表现出强劲的性能,具有持续较高的准确率(ACC)和较低的假拒绝率(HTER)。这些结果表明,结构化的攻击知识可以提高检测准确性和推理可靠性,优于区分性基线和通用的多语言大模型(MLLMs)。我们的代码将发布在 https://anonymous.4open.science/r/Unishield-A6A3/。
cs.CV / 74 / 2605.08712
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation
从关节运动学到基于路线的视觉控制:用于动作条件下的外科视频生成
Abstract
Action-conditioned surgical video generation is a critical yet highly challenging problem for robotic surgery. The core difficulty is that low-dimensional control vectors must precisely govern complex image-space evolution. In this work, we propose a kinematic-to-visual lifting paradigm that converts articulated kinematics into a unified set of five image-aligned control modalities. Building on this representation, we introduce a hierarchically routed visual control framework that selectively activates the most relevant control modalities and motion scales. Instead of uniformly applying all control signals, our model performs hierarchical routing to dynamically allocate conditioning capacity. We further design kinematic-prior-guided routing loss functions to ensure physically meaningful, temporally stable, and efficient expert utilization. To improve efficiency, we propose a budgeted training and inference scheme that leverages routing-induced sparsity. By selectively discarding low-significance control pathways during training and execution, our approach enables adaptive computation that is complementary to standard distillation. We additionally construct a new benchmark with curated articulated annotations, obtained through human-in-the-loop semantic labeling and differentiable pose tracking, providing realistic supervision for action-conditioned surgical video generation. Extensive experiments demonstrate that our method consistently improves action faithfulness, visual fidelity, and cross-domain generalization over diverse baselines. Moreover, our efficient variant achieves substantial reductions in latency while maintaining strong control accuracy.
Chinese Translation
动作条件下的外科视频生成是机器人手术中的一个关键但极具挑战性的问题。核心难点在于低维控制向量必须精确地控制复杂的图像空间演变。在本研究中,我们提出了一种运动学到视觉的提升范式,将关节运动学转换为一组统一的五种图像对齐控制方式。在此基础上,我们引入了一种分层路由的视觉控制框架,选择性地激活最相关的控制方式和运动尺度。我们的模型并不是均匀地应用所有控制信号,而是通过分层路由动态分配条件能力。我们进一步设计了运动学先验引导的路由损失函数,以确保物理上有意义、时间上稳定且高效的专家利用。为了提高效率,我们提出了一种预算训练和推理方案,利用路由引起的稀疏性。在训练和执行过程中选择性地丢弃低重要性的控制路径,使我们的方法能够实现与标准蒸馏互补的自适应计算。我们还构建了一个新的基准,配备了经过人工参与的语义标注和可微分姿态跟踪获得的关节注释,为动作条件下的外科视频生成提供了真实的监督。大量实验表明,我们的方法在动作真实性、视觉保真度和跨域泛化方面始终优于多种基线。此外,我们的高效变体在保持强控制精度的同时,实现了显著的延迟减少。
cs.CV / 75 / 2605.08723
EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing
EAR:增强单模态表示以进行弱监督音视频解析
Abstract
Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the pseudo-label generator and the AVVP model. Specifically, we introduce a similarity-based label migration approach to annotate pre-training data, thereby enabling the pseudo-label generator to better understand uni-modal events. We also employ a soft-constrained manner to refine modeling of uni-modal features in parallel with multi-modal fusion. These designs enable coordinated attention to both uni-modal and cross-modal representations, thus boosting the localization performance for events. Extensive experiments show that our method outperforms state-of-the-art methods in both pseudo-label and AVVP performance.
Chinese Translation
弱监督音视频解析(Audio-Visual Video Parsing, AVVP)旨在仅使用粗粒度标签识别和时间定位视频中的音频、视觉和音视频事件。在面临挑战性任务设置的情况下,现有研究主要沿两个方向发展:为细粒度跨模态语义指导预训练伪标签生成器,或优化AVVP模型架构以增强音视频融合。然而,由于音频和视觉信号通常是未对齐的,准确的视频解析在根本上依赖于对单模态事件的精确感知。然而,这些多模态聚焦的策略过度强调多模态融合,而未能充分指导和保留单模态语义,导致伪标签噪声和次优的视频解析性能。本文提出了一种新颖的框架,增强伪标签生成器和AVVP模型的单模态表示。具体而言,我们引入了一种基于相似性的标签迁移方法来标注预训练数据,从而使伪标签生成器更好地理解单模态事件。我们还采用软约束的方式并行优化单模态特征建模与多模态融合。这些设计使得对单模态和跨模态表示的协调关注得以实现,从而提升事件的定位性能。大量实验表明,我们的方法在伪标签和AVVP性能上均优于最先进的方法。
cs.CV / 76 / 2605.08724
SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment
SynerMedGen:通过任务对齐实现医学多模态理解与生成的协同
Abstract
Unifying multimodal understanding and generation is a compelling frontier that is beginning to emerge in the medical field. However, the limited existing unified medical models typically treat understanding and generation as disjoint objectives, lacking a meaningful functional synergy. In this work, we identify and address a critical question in unified medical modeling: what form of understanding truly benefits generation. We present SynerMedGen, a unified framework built on the proposed principle of generation-aligned understanding, which synergizes understanding objectives with generation tasks via task alignment. SynerMedGen introduces three generation-aligned understanding tasks and a two-stage training strategy that transfers generation-beneficial representations learned during understanding training to medical image synthesis. Remarkably, even with understanding training alone, our SynerMedGen achieves strong zero-shot performance across 22 medical image synthesis tasks and demonstrates robust generalization to unseen datasets. When combined with generation training, SynerMedGen consistently outperforms state-of-the-art specialized medical image synthesis models as well as recent unified medical models. We also release a large-scale dataset named SynerMed consisting of 1M paired synthesis samples and 2M generation-derived understanding instances to support further research on understanding-generation synergy. Our project can be accessed at https://github.com/Mhilab/SynerMedGen.
Chinese Translation
统一多模态理解与生成是医学领域开始出现的一个引人注目的前沿。然而,现有的统一医学模型通常将理解和生成视为不相干的目标,缺乏有意义的功能协同。在本研究中,我们识别并解决了统一医学建模中的一个关键问题:什么形式的理解真正有利于生成。我们提出了SynerMedGen,这是一个基于生成对齐理解原则的统一框架,通过任务对齐将理解目标与生成任务协同起来。SynerMedGen引入了三个生成对齐理解任务和一种两阶段训练策略,将在理解训练中学习到的有利于生成的表示转移到医学图像合成中。值得注意的是,即使仅通过理解训练,我们的SynerMedGen在22个医学图像合成任务中也实现了强大的零-shot性能,并且在未见数据集上表现出强大的泛化能力。当与生成训练结合时,SynerMedGen始终优于最先进的专业医学图像合成模型以及最近的统一医学模型。我们还发布了一个名为SynerMed的大规模数据集,包含100万对合成样本和200万生成衍生的理解实例,以支持对理解-生成协同的进一步研究。我们的项目可以在https://github.com/Mhilab/SynerMedGen访问。
cs.CV / 77 / 2605.08727
Control Your View: High-Resolution Global Semantic Manipulation in Learned Image Compression
控制你的视角:在学习图像压缩中进行高分辨率全局语义操控
Abstract
Learned image compression (LIC) integrates deep neural networks (DNNs) to map high-dimensional images into compact latent representations, reducing redundancy and achieving superior rate-distortion (RD) performance in benign settings. Unfortunately, due to inherent vulnerabilities in DNNs, LIC systems are susceptible to adversarial perturbations that lead to downstream deterioration, compression rate degradation, untargeted distortion, and both local semantic manipulation (LSM) and low-resolution ($3\times28\times28$) global semantic manipulation (GSM). However, high-resolution GSM remains unexplored due to its intractability. Notably, the existing project gradient descent (PGD) method achieves near-perfect white-box attacks for classification, segmentation, and other tasks, yet fails to generalize to high-resolution GSM. Our theoretical and empirical analyses reveal that well-performing GSM drives adversarial examples from the Identity Region to the Amplification Region through the Lazying-Oscillating-Refining stages. General $\ell_{\infty}$-bounded attacks fail on high-resolution GSM because their step-size schedules cannot accommodate both the Oscillating and Refining stages. Based on this, we propose the Periodic Geometric Decay schedule that enables $\ell_{\infty}$-bounded high-resolution GSM. To verify our approach, we integrate it with PGD, yielding a minimal variant, PGD$^{2}$-GSM. Extensive experiments on the Kodak $(3\times768\times512)$ demonstrate that our PGD$^{2}$-GSM is the first to stably achieve high-resolution GSM, thereby exposing a novel threat to LIC systems. Code is available at https://github.com/chinaliangjiaming/PGD2-GSM.
Chinese Translation
学习图像压缩(LIC)结合深度神经网络(DNN)将高维图像映射为紧凑的潜在表示,减少冗余并在良性环境中实现优越的速率失真(RD)性能。不幸的是,由于DNN固有的脆弱性,LIC系统容易受到对抗性扰动的影响,这导致下游性能下降、压缩率降低、非针对性失真,以及局部语义操控(LSM)和低分辨率($3 imes28 imes28$)全局语义操控(GSM)。然而,由于其不可处理性,高分辨率GSM尚未被探索。值得注意的是,现有的投影梯度下降(PGD)方法在分类、分割和其他任务中实现了近乎完美的白盒攻击,但未能推广到高分辨率GSM。我们的理论和实证分析表明,表现良好的GSM通过懒惰-振荡-精炼阶段将对抗样本从身份区域驱动到放大区域。一般的$ ext{l}_{ ext{∞}}$-有界攻击在高分辨率GSM上失败,因为它们的步长调度无法同时适应振荡和精炼阶段。基于此,我们提出了周期几何衰减调度,能够实现$ ext{l}_{ ext{∞}}$-有界的高分辨率GSM。为了验证我们的方法,我们将其与PGD结合,得到了一个最小变体PGD$^{2}$-GSM。在Kodak $(3 imes768 imes512)$上的大量实验表明,我们的PGD$^{2}$-GSM首次稳定地实现了高分辨率GSM,从而暴露了LIC系统的一种新威胁。代码可在 https://github.com/chinaliangjiaming/PGD2-GSM 获取。
cs.CV / 78 / 2605.08729
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
统一:为以人为中心的音视频生成协调运动、语言和声音
Abstract
Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.
Chinese Translation
运动、语言和音效是以人为中心的视频的基本元素,但它们异质的时间特性使得联合生成极具挑战性。现有的音视频生成模型往往无法在这些模态之间保持一致的对齐,导致运动、语言和环境声音之间出现明显的不匹配。我们提出了Unison,这是一个统一框架,明确促进运动、语言和声音模态之间的一致性。在音频流中,Unison采用了一种语义引导的协调策略,解耦了语言和音效组件的生成。通过利用双向音频交叉注意力和语义条件门控进行语义驱动的自适应重组,该方法有效减轻了语言的主导性,并增强了声学清晰度。为了实现音频与运动的同步,我们提出了一种双向跨模态强制策略,其中更清晰的模态通过解耦的去噪计划引导噪声较大的模态,并通过渐进稳定策略进行强化。大量实验表明,Unison在音频感知质量和跨模态同步方面均达到了最先进的性能,突显了在以人为中心的视频生成中明确的多模态协调的重要性。
cs.CV / 79 / 2605.08735
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR:基于视觉-语言和视频生成模型的协作视频推理
Abstract
Recent "Thinking with Video" approaches use Video Generation Models (VGMs) for visual reasoning by producing temporally coherent Chain-of-Frames as reasoning artifacts. Even strong VGMs, however, exhibit two recurring failure modes on goal-directed tasks: long-horizon drift on multi-step tasks and mid-clip simulation errors that compound. Both stem from the absence of explicit reasoning built upon the VGM's short-horizon visual prior, a role naturally filled by Vision-Language Models (VLMs), but where to place the VLM is non-trivial: upfront plans commit before any frame is generated and post-hoc critiques over whole videos intervene too late. We propose VLM-VGM Collaborative Video Reasoning (CollabVR), a closed-loop framework that couples the VLM with the VGM at step-level granularity: the VLM plans the immediate next action, inspects the clip the VGM generates, and folds the verifier's diagnosis directly into the next action prompt to repair detected failures. On Gen-ViRe and VBVR-Bench, CollabVR improves both open-source and closed-source VGMs over single-inference, Pass@$k$, and prior test-time scaling baselines at matched compute, with the largest gains on the hardest tasks. It also yields further improvements on top of a reasoning-fine-tuned VGM, indicating that step-level VLM supervision is orthogonal to and stackable with reasoning-oriented fine-tuning. We provide video samples and additional qualitative results at our project page: https://joow0n-kim.github.io/collabvr-project-page.
Chinese Translation
近期的“视频思维”方法使用视频生成模型(Video Generation Models, VGMs)通过生成时间一致的帧链作为推理产物来进行视觉推理。然而,即使是强大的VGM在目标导向任务中也表现出两种反复出现的失败模式:在多步骤任务中的长时间漂移和在片段中累积的模拟错误。这两者都源于缺乏基于VGM短期视觉先验的明确推理,而这一角色自然由视觉-语言模型(Vision-Language Models, VLMs)填补,但如何放置VLM并非易事:预先的计划在任何帧生成之前就已承诺,而对整个视频的事后批评则干预得太晚。我们提出了VLM-VGM协作视频推理(CollabVR),这是一个闭环框架,将VLM与VGM在步骤级别上进行耦合:VLM规划下一步的直接行动,检查VGM生成的片段,并将验证者的诊断直接融入下一步的行动提示中,以修复检测到的失败。在Gen-ViRe和VBVR-Bench上,CollabVR在匹配计算条件下,提升了开源和闭源VGM在单次推理、Pass@$k$和先前测试时间缩放基准上的表现,尤其在最困难的任务上取得了最大的提升。它还在经过推理微调的VGM基础上进一步提升,表明步骤级VLM监督与面向推理的微调是正交且可叠加的。我们在项目页面提供了视频样本和其他定性结果: https://joow0n-kim.github.io/collabvr-project-page。
cs.CV / 80 / 2605.08739
ReorgGS: Equivalent Distribution Reorganization for 3D Gaussian Splatting
ReorgGS:用于3D高斯溅射的等效分布重组
Abstract
A converged 3D Gaussian Splatting (3DGS) model may approximate the target scene while remaining poorly parameterized for further optimization. We identify this failure mode as \emph{parameterization degeneration}: high-opacity floaters attenuate gradients to true surfaces through alpha compositing, and redundant overlapping clusters create strongly coupled parameter blocks with nearly collinear Jacobian responses. These effects explain why continued optimization can plateau even when the model still contains removable artifacts. We propose ReorgGS, an equivalent distribution reorganization method for converged 3DGS models. ReorgGS treats the existing Gaussian set as an empirical probability field, resamples centers from it, estimates local anisotropic covariances with kNN, initializes low opacity, and continues optimization with the original 3DGS renderer and loss. Unlike opacity reset, which only rescales opacity on the old overlap graph, ReorgGS rebuilds centers, covariances, and visibility structure, thereby changing the graph itself. Our analysis shows that distributional equivalence is not optimization equivalence. The reorganized model preserves scene support while improving gradient accessibility under alpha compositing and reducing opacity-weighted overlap, thereby weakening local parameter coupling during subsequent optimization. Under the same additional optimization budget, ReorgGS improves fitting quality at a fixed Gaussian count, suppresses persistent floaters, and reduces rendering overhead from redundant overlap.
Chinese Translation
收敛的3D高斯溅射(3DGS)模型可能在近似目标场景时,参数化效果较差,难以进行进一步优化。我们将这种失败模式称为 extit{参数化退化}:高不透明度的浮动物通过α合成减弱了对真实表面的梯度,而冗余的重叠簇则创建了强耦合的参数块,导致雅可比响应几乎共线。这些效应解释了为什么即使模型仍包含可去除的伪影,持续优化也可能停滞不前。我们提出了ReorgGS,一种针对收敛3DGS模型的等效分布重组方法。ReorgGS将现有的高斯集视为经验概率场,从中重新采样中心,利用kNN估计局部各向异性协方差,初始化低不透明度,并继续使用原始3DGS渲染器和损失进行优化。与仅在旧重叠图上重新缩放不透明度的透明度重置不同,ReorgGS重建中心、协方差和可见性结构,从而改变了图本身。我们的分析表明,分布等价性并不等同于优化等价性。重组后的模型在保持场景支持的同时,改善了在α合成下的梯度可达性,并减少了不透明度加权的重叠,从而在后续优化过程中减弱了局部参数耦合。在相同的额外优化预算下,ReorgGS在固定的高斯数量下提高了拟合质量,抑制了持久浮动物,并减少了冗余重叠带来的渲染开销。
cs.CV / 81 / 2605.08753
Simultaneous Monitoring of Shape and Surface Color via 4D Point Clouds: A Registration-free Approach
通过4D点云实现形状和表面颜色的同时监测:一种无注册的方法
Abstract
Advanced manufacturing technologies allow for the production of intricate parts featuring high shape complexity and spatially-varying material composition. Data fusion of point clouds with chromatic attributes provides 4D point clouds, a compact and informative representation that encodes both shape and material information. In this paper, we present a registration-free framework for Simultaneous Monitoring of shApe and Color (SMAC) via 4D point clouds. The proposed framework leverages Laplace-Beltrami operator spectral properties to capture and monitor geometric features and the relationship between shape and surface color. A combined monitoring scheme is proposed to effectively detect shape deformations and color anomalies, along with a spatially-aware post-signal diagnostic procedure to determine the source of change and localize color anomalies. Importantly, neither component relies on registration or mesh reconstruction, eliminating error-prone and computationally expensive preprocessing steps. A Monte Carlo simulation study and a case study on functionally graded materials demonstrate that SMAC achieves effective detection performance, particularly for subtle defects, while providing diagnostic capabilities to identify the source and location of anomalies.
Chinese Translation
先进的制造技术使得能够生产具有复杂形状和空间变化材料组成的精细部件。将带有色彩属性的点云进行数据融合,生成4D点云,这是一种紧凑且信息丰富的表示方式,编码了形状和材料信息。本文提出了一种通过4D点云实现形状和颜色同时监测(Simultaneous Monitoring of shApe and Color, SMAC)的无注册框架。所提出的框架利用拉普拉斯-贝尔特拉米算子(Laplace-Beltrami operator)的谱特性来捕捉和监测几何特征以及形状与表面颜色之间的关系。我们提出了一种结合监测方案,以有效检测形状变形和颜色异常,并配备空间感知的后信号诊断程序,以确定变化源和定位颜色异常。重要的是,两个组件均不依赖于注册或网格重建,消除了易出错且计算开销大的预处理步骤。一项蒙特卡洛模拟研究和对功能梯度材料的案例研究表明,SMAC在检测性能方面表现出色,尤其是对于微小缺陷,同时提供了识别异常源和位置的诊断能力。
cs.CV / 82 / 2605.08781
Contour-Native Bridge Defect Detection and Compact Digital Archiving with Frequency-Supervised Fourier Contours
基于频率监督的傅里叶轮廓的轮廓原生桥梁缺陷检测与紧凑数字归档
Abstract
AI-assisted bridge defect inspection often produces bounding boxes with crude geometry or raster masks that are costly to store, transmit, and reuse. This study investigates how detected defects can be represented as compact, recoverable contour-level vector records in image space. We propose Frequency-Supervised Fourier Series Detection (FS-FSD), which directly regresses Fourier contour descriptors and evaluates boxes, masks, and contours under a unified polygon-space protocol. On 3,767 UAV-collected bridge images with 42,346 defect instances, FS-FSD achieves higher polygon-space accuracy and better matched-TP geometric quality than representative detection, segmentation, and contour baselines. These results show that, compared with bounding boxes and raster masks, Fourier contour records preserve defect-boundary geometry in a more compact, recoverable, and shareable form for engineering review and downstream information workflows. Future work will study the modeling of multi-region, fragmented, and adjacent bridge-defect boundaries and extend the framework toward long-term bridge-defect tracking and lifecycle-oriented management.
Chinese Translation
人工智能辅助的桥梁缺陷检测通常生成几何形状粗糙的边界框或存储、传输和重用成本高昂的光栅掩膜。本研究探讨如何将检测到的缺陷表示为图像空间中紧凑、可恢复的轮廓级矢量记录。我们提出了频率监督的傅里叶级数检测(Frequency-Supervised Fourier Series Detection, FS-FSD),该方法直接回归傅里叶轮廓描述符,并在统一的多边形空间协议下评估边界框、掩膜和轮廓。在3,767张无人机收集的桥梁图像中,包含42,346个缺陷实例,FS-FSD在多边形空间精度和匹配TP几何质量方面均优于代表性的检测、分割和轮廓基准。这些结果表明,与边界框和光栅掩膜相比,傅里叶轮廓记录以更紧凑、可恢复和可共享的形式保留了缺陷边界几何,便于工程审查和后续信息工作流。未来的工作将研究多区域、碎片化和相邻桥梁缺陷边界的建模,并将该框架扩展到长期桥梁缺陷跟踪和生命周期导向管理。
cs.CV / 83 / 2605.08784
simpleposter: a simple baseline for product poster generation
simpleposter:产品海报生成的简单基线
Abstract
Product poster generation poses distinct challenges beyond general poster design, requiring both faithful preservation of product appearance and precise control over dense, multi-line text layouts. Prior methods typically adopt inpainting frameworks augmented with auxiliary modules such as ControlNet and OCR encoders. However, these approaches introduce architectural complexity and computational overhead while still suffering from text errors and subject extension artifacts. We present SimplePoster, a simple yet effective inpainting-based framework that achieves faithful subject preservation and accurate, position-controllable text rendering without external controllers. Our approach builds on two observations: (1) full-parameter fine-tuning of the base model effectively suppresses subject extension, outperforming ControlNet-based alternatives; and (2) a zero-cost character-level position encoding enables geometry-aware text generation without dedicated layout modules. Experiments show that SimplePoster achieves a $98.7\%$ subject preservation rate, compared to $55.2\%$ for SeedEdit 3.0 and $85.3\%$ for PosterMaker, while also improving text rendering accuracy. Code, models, benchmark and a part of training data will be available at https://github.com/Alibaba-YuFeng/SIMPLEPOSTER
Chinese Translation
产品海报生成面临着超越一般海报设计的独特挑战,既需要忠实保留产品外观,又要求对密集的多行文本布局进行精确控制。以往的方法通常采用增强辅助模块(如 ControlNet 和 OCR 编码器)的修复框架。然而,这些方法引入了架构复杂性和计算开销,同时仍然存在文本错误和主题扩展伪影的问题。我们提出了 SimplePoster,这是一种简单而有效的基于修复的框架,能够在没有外部控制器的情况下实现忠实的主题保留和准确、可控位置的文本渲染。我们的方法基于两个观察结果:(1)对基础模型进行全参数微调能够有效抑制主题扩展,优于基于 ControlNet 的替代方案;(2)零成本的字符级位置编码使得几何感知文本生成成为可能,而无需专门的布局模块。实验表明,SimplePoster 实现了 $98.7\%$ 的主题保留率,相比之下,SeedEdit 3.0 为 $55.2\\%$,PosterMaker 为 $85.3\\%$,同时提高了文本渲染的准确性。代码、模型、基准测试和部分训练数据将可在 https://github.com/Alibaba-YuFeng/SIMPLEPOSTER 获取。
cs.CV / 84 / 2605.08787
Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models
迷失于体积:用于评估3D医学视觉语言模型语义空间理解的CT-SpatialVQA基准
Abstract
Recent advances in 3D medical vision-language models have enabled joint reasoning over volumetric images and text, showing strong performance in medical visual question-answering (VQA) and report generation. Despite this progress, it remains unclear whether these models learn spatially grounded anatomy from 3D volumes or rely primarily on learned priors and language correlations. This uncertainty stems from the lack of systematic evaluation of semantic-spatial reasoning in volumetric medical VLMs for clinically reliable decision support. To address this gap, we introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. The benchmark comprises 9077 clinically grounded question-answer (QA) pairs derived directly from 1601 radiology reports and CT volumes, which are validated via a robust LLM-assisted pipeline with a 95% human consensus agreement rate. Our dataset requires explicit anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning. We also introduce a standardized evaluation protocol and benchmark eight 3D medical VLMs, finding severe degradation on semantic-spatial reasoning tasks, averaging 34% accuracy and often below random, highlighting the need for deeper integration of volumetric evidence for trustworthy clinical use.
Chinese Translation
最近在3D医学视觉语言模型方面的进展使得对体积图像和文本的联合推理成为可能,在医学视觉问答(VQA)和报告生成中表现出色。尽管取得了这些进展,但仍不清楚这些模型是否从3D体积中学习了空间基础的解剖结构,还是主要依赖于学习到的先验知识和语言关联。这种不确定性源于缺乏对体积医学视觉语言模型(VLM)中语义空间推理的系统评估,以支持临床可靠的决策。为了解决这一问题,我们引入了CT-SpatialVQA,一个旨在评估3D CT数据中语义空间推理的基准。该基准包含9077对直接来自1601份放射学报告和CT体积的临床基础问答(QA)对,这些数据通过一个强大的LLM辅助管道进行验证,达到了95%的人工共识一致率。我们的数据集要求明确的解剖定位、侧别意识、结构比较以及3D结构间关系推理。我们还引入了标准化评估协议,并对八个3D医学视觉语言模型进行了基准测试,发现语义空间推理任务的表现严重下降,平均准确率为34%,且通常低于随机水平,这突显了在临床应用中需要更深入地整合体积证据的必要性。
cs.CV / 85 / 2605.08800
PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models
PPU-Bench:面向视觉语言模型的个性化部分遗忘的真实世界基准
Abstract
Multimodal Large Language Models (MLLMs) may memorize sensitive cross-modal information during pretraining. However, existing MLLM unlearning benchmarks rely on synthetic knowledge injection or complete subject-level deletion, which fail to capture realistic, personalized deletion requests that require fine-grained factual control. In this paper, we introduce PPU-Bench, a real-world and fine-tuning-free benchmark for personalized partial unlearning in MLLMs. PPU-Bench contains 24K multimodal and unimodal samples derived from pre-existing knowledge of 500 public figures under three progressively challenging settings: Complete, Selective, and Personalized unlearning. The benchmark evaluates whether methods can remove target knowledge while preserving non-target facts, model utility, and cross-modal consistency. Extensive experiments show that Complete Unlearning often suppresses visual identity rather than factual knowledge, while Selective and Personalized Unlearning expose significant forget--retain trade-offs and challenges in intra-subject factual boundaries. Robustness analysis under cross-image and prompt-based attacks reveals distinct vulnerabilities across different unlearning settings. Motivated by these findings, we propose Boundary-Aware Optimization (BAO), which explicitly models intra-subject forget-retain boundaries. Experimental results on two representative methods demonstrate that BAO can effectively enforce intra-subject factual boundaries.
Chinese Translation
多模态大型语言模型(MLLMs)在预训练过程中可能会记忆敏感的跨模态信息。然而,现有的MLLM遗忘基准依赖于合成知识注入或完整的主体级删除,这无法捕捉到需要细粒度事实控制的现实个性化删除请求。本文介绍了PPU-Bench,这是一个针对MLLMs的个性化部分遗忘的真实世界且无需微调的基准。PPU-Bench包含来自500位公众人物的24K个多模态和单模态样本,分为三个逐步挑战的设置:完全遗忘、选择性遗忘和个性化遗忘。该基准评估方法是否能够在保留非目标事实、模型效用和跨模态一致性的同时,移除目标知识。大量实验表明,完全遗忘往往抑制视觉身份而非事实知识,而选择性和个性化遗忘则暴露出显著的遗忘-保留权衡及在主体内部事实边界上的挑战。在跨图像和基于提示的攻击下的鲁棒性分析揭示了不同遗忘设置下的明显脆弱性。基于这些发现,我们提出了边界感知优化(Boundary-Aware Optimization, BAO),该方法明确建模主体内部的遗忘-保留边界。在两个代表性方法上的实验结果表明,BAO能够有效地强制执行主体内部的事实边界。
cs.CV / 86 / 2605.08802
CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization
CoLVR:通过对比优化增强探索性潜在视觉推理
Abstract
Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding. Then, CoLVR employs a latent trajectory contrastive reward for RL (Reinforcement Learning) post-training to enable fine-grained optimization of latent visual reasoning process and thus fostering diverse reasoning behaviors. Experiments demonstrate that CoLVR significantly enhances the exploratory capability of latent representations, achieving average improvements of 5.83% on VSP and 8.00% on Jigsaw, while also outperforming existing latent models on out of domain benchmarks, with a 3.40% gain on MMStar. The data, codes, and models are released at https://github.com/Oscar-dzy/CoLVR.
Chinese Translation
由于潜在视觉推理的探索性推理潜力,近期的研究倾向于使多模态大型语言模型(MLLMs)通过传播连续的隐藏状态来执行视觉推理,而不是将中间步骤解码为离散的标记。然而,现有的研究通常依赖于硬对齐目标,强制潜在表示与预定义的视觉特征匹配,从而严重限制了潜在推理过程的探索性。为了解决这个问题,我们提出了CoLVR(潜在视觉推理的对比优化)。为了获得更具探索性的视觉推理,CoLVR引入了一种潜在对比训练框架。首先,CoLVR通过基于角度的扰动指导的潜在对比目标学习多样化和探索性的表示,这扩展了语义潜在空间,并避免了过度约束的嵌入。然后,CoLVR在强化学习(RL)后训练中采用潜在轨迹对比奖励,以实现潜在视觉推理过程的细粒度优化,从而促进多样化的推理行为。实验表明,CoLVR显著增强了潜在表示的探索能力,在VSP上平均提高了5.83%,在Jigsaw上提高了8.00%,同时在领域外基准上也优于现有的潜在模型,在MMStar上获得了3.40%的提升。数据、代码和模型已发布在https://github.com/Oscar-dzy/CoLVR。
cs.CV / 87 / 2605.08805
LightAVSeg: Lightweight Audio-Visual Segmentation
LightAVSeg:轻量级音视频分割
Abstract
Audio-Visual Segmentation (AVS) targets pixel level localization of sounding emitting objects in videos. However, existing models rely on dense cross-modal attention with quadratic computational cost, limiting their suitability for resource efficient deployment. Most efficiency oriented methods focus on backbone reduction and overlook the interaction module as the primary bottleneck. This paper proposes LightAVSeg, a lightweight framework that replaces heavy attention with a decoupled design for semantic filtering and spatial grounding, resulting in interaction costs that scale linearly with spatial resolution. Furthermore, we introduce an auxiliary alignment loss to enforce semantic consistency during training with zero inference overhead. Extensive experiments demonstrate that LightAVSeg achieves a new state-of-the-art among lightweight methods: with 20.5M parameters ~1/7 of AVSegFormer), it reaches 50.4 mIoU on the MS3 benchmark and enables efficient inference on a mobile processor.
Chinese Translation
音视频分割(Audio-Visual Segmentation, AVS)旨在对视频中发声物体进行像素级定位。然而,现有模型依赖于密集的跨模态注意力,计算成本呈二次增长,这限制了它们在资源高效部署中的适用性。大多数以效率为导向的方法专注于主干网络的缩减,而忽视了交互模块作为主要瓶颈。本文提出了LightAVSeg,一个轻量级框架,通过将重型注意力替换为解耦设计,实现语义过滤和空间定位,从而使交互成本与空间分辨率呈线性增长。此外,我们引入了一种辅助对齐损失,以在训练过程中强制语义一致性,而不增加推理开销。大量实验表明,LightAVSeg在轻量级方法中达到了新的最先进水平:其参数量为2050万(约为AVSegFormer的1/7),在MS3基准上达到了50.4的mIoU,并在移动处理器上实现了高效推理。
cs.CV / 88 / 2605.08806
L2A: Learning to Accumulate Pose History for Accurate 3D Human Pose Estimation
L2A:学习积累姿态历史以实现准确的3D人类姿态估计
Abstract
Existing 2D-3D lifting human pose estimation methods have achieved strong performance. But the utilization of historical pose representations across network depth was overlooked. In current pipelines, information is propagated through fixed residual connections, which restricts effective reuse of early-layer features such as fine-grained spatial structures and short-term motion cues. However, naively incorporating historical features across layers is non-trivial. We further identify that maintaining a consistent representation space across layers is a prerequisite for effective cross-layer feature aggregation. To address this issue, we propose a history-aware framework that enables effective network cross-layer history feature utilization. Specifically, we adopt a spatial-temporal parallel Transformer backbone to prevent alternating spatial-temporal transformations during sequential processing, thereby maintaining a consistent representation space. Building upon this, we introduce a History Pose Accumulation (HPA) mechanism that adaptively aggregates features from all preceding layers to enhance current representations. Furthermore, we propose a Layer Pose History Aggregation (LPA) module that transforms layer pose features into a compact and structured form, reducing redundancy and enabling more stable aggregation. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on benchmarks.
Chinese Translation
现有的2D-3D提升人类姿态估计方法已取得了强劲的性能。然而,跨网络深度的历史姿态表示的利用被忽视。在当前的处理流程中,信息通过固定的残差连接进行传播,这限制了早期层特征(如细粒度空间结构和短期运动线索)的有效重用。然而,简单地在层之间引入历史特征并非易事。我们进一步确认,保持跨层的一致表示空间是有效的跨层特征聚合的前提。为了解决这个问题,我们提出了一种历史感知框架,使得网络能够有效利用跨层历史特征。具体而言,我们采用了一种时空并行的Transformer主干,以防止在序列处理过程中交替的时空变换,从而保持一致的表示空间。在此基础上,我们引入了一种历史姿态积累(History Pose Accumulation, HPA)机制,能够自适应地聚合来自所有前置层的特征,以增强当前的表示。此外,我们还提出了一种层姿态历史聚合(Layer Pose History Aggregation, LPA)模块,将层姿态特征转化为紧凑且结构化的形式,减少冗余并实现更稳定的聚合。大量实验表明,我们的方法在基准测试中达到了最先进的性能。
cs.CV / 89 / 2605.08808
Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding
曲率感知描述:利用测地线注意力进行三维场景理解
Abstract
Accurate 3D scene description is fundamental to robotic navigation and augmented reality, yet current dense captioning methods face significant limitations in processing sparse point cloud data. %
Existing approaches that apply Euclidean embedding spaces struggle to simultaneously preserve fine-grained local geometric details and model exponentially growing global semantic hierarchies, leading to either inaccurate localization or disjointed, shallow scene descriptions. %
In this work, we propose a novel \textbf{\textsc{Curvature-Aware Captioning}} framework, integrating novel non-Euclidean geodesic attention mechanisms, to resolve the localization-contextualization conflict. %
Specifically, self-attention within Oblique space enforces dimensional homogeneity while establishing long-range dependencies. Bidirectional geodesic cross-attention within Lorentz space models hierarchical semantic relationships across scene instances, enabling simultaneous precision in object localization and coherence in scene descriptions. %
Theoretical analysis confirms that the curvature complementarity between the Oblique manifold and Lorentz hyperboloid resolves the Euclidean-hyperbolic conflict, ensuring feature stability via isotropic optimization while preserving inherent hierarchical relationships. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, with significant gains in both localization accuracy and descriptive richness.
Chinese Translation
准确的三维场景描述对于机器人导航和增强现实至关重要,但当前的密集描述方法在处理稀疏点云数据时面临重大限制。现有的方法应用欧几里得嵌入空间,难以同时保留细粒度的局部几何细节和建模指数增长的全局语义层次,导致定位不准确或场景描述支离破碎且浅显。在本研究中,我们提出了一种新颖的 extbf{ extsc{曲率感知描述}}框架,整合了新型的非欧几里得测地线注意力机制,以解决定位与上下文化之间的冲突。具体而言,斜空间中的自注意力机制强制维度均匀性,同时建立长距离依赖关系。洛伦兹空间中的双向测地线交叉注意力建模场景实例之间的层次语义关系,使得在物体定位的精确性和场景描述的一致性之间实现同步。理论分析确认,斜流形与洛伦兹双曲面之间的曲率互补性解决了欧几里得-双曲冲突,通过各向同性优化确保特征稳定性,同时保留固有的层次关系。在ScanRefer和Nr3D基准上的大量实验表明,我们的方法在定位准确性和描述丰富性方面均实现了最先进的性能提升。
cs.CV / 90 / 2605.08814
Zero-Shot Chinese Character Recognition via Global-Local Dual-Branch Alignment and Hierarchical Inference
通过全局-局部双分支对齐和层次推理实现零样本中文字符识别
Abstract
Chinese character categories are extremely large, and unseen characters frequently arise in open-world scenarios, making zero-shot Chinese character recognition an important yet challenging problem. Existing IDS-based retrieval methods usually encode a character image and its ideographic description sequence into a single global vector for matching. Although efficient, such holistic alignment often under-models local component differences. Moreover, directly introducing patch-token level fine-grained interaction suffers from both the noise of structural operators in IDS and the high cost of full-candidate retrieval.To address these issues, we propose a Global-Local Hierarchical Perception Network (GL-HPN), which jointly learns global and local representations of character images and IDS sequences within a unified cross-modal alignment framework. The global branch supports efficient coarse recall, while the local branch improves component-level discrimination through patch-token interaction. We further introduce a structure filtering mask to suppress structurally meaningful but visually non-entity IDS operators in local similarity aggregation. On top of this, we design a coarse-to-fine hierarchical inference strategy that performs global retrieval over the full candidate set and local reranking only on Top-$K$ candidates, followed by parameter-free multiplicative fusion of normalized posterior scores. Experimental results show that GL-HPN achieves competitive performance across multiple zero-shot splits, performs especially well under low-resource settings, and substantially reduces the inference cost of large-scale candidate retrieval.
Chinese Translation
中文字符类别极为庞大,且在开放世界场景中经常出现未见字符,使得零样本中文字符识别成为一个重要而具有挑战性的问题。现有的基于图像描述系统(IDS)的检索方法通常将字符图像及其表意描述序列编码为单一的全局向量进行匹配。尽管这种方法高效,但整体对齐往往无法充分建模局部组件之间的差异。此外,直接引入基于补丁-标记级别的细粒度交互会受到IDS中结构操作噪声的影响,并且全候选检索的成本较高。为了解决这些问题,我们提出了一种全局-局部层次感知网络(GL-HPN),该网络在统一的跨模态对齐框架内共同学习字符图像和IDS序列的全局和局部表示。全局分支支持高效的粗略检索,而局部分支通过补丁-标记交互提高组件级别的区分能力。我们进一步引入了一种结构过滤掩码,以抑制在局部相似性聚合中具有结构意义但视觉上非实体的IDS操作符。在此基础上,我们设计了一种粗到细的层次推理策略,该策略在全候选集上执行全局检索,并仅在前$K$个候选上进行局部重排序,随后进行归一化后验分数的无参数乘法融合。实验结果表明,GL-HPN在多个零样本拆分中表现出竞争力,尤其在低资源设置下表现优异,并显著降低了大规模候选检索的推理成本。
cs.CV / 91 / 2605.08819
From pre-training to downstream performance: Does domain-specific pre-training make sense?
从预训练到下游性能:领域特定的预训练是否有意义?
Abstract
Deep learning techniques have revolutionised medical imaging, improving diagnostic accuracy and enabling both more accurate and earlier disease detection. However, the relationship between pre-training strategies and downstream performance in medical imaging models requires further exploration. Here, we systematically compare convolutional neural networks and transformers, examining various pre-training approaches, including supervised and self-supervised learning, as well as different initialisations and data modalities. Models are evaluated on natural images, chest X-rays, chest CT and retina OCT images, considering the effects of matching pre-training data with target modalities. Our findings indicate that only pre-training on data closely matching the target modality significantly improves downstream performance. While self-supervised learning can outperform supervised methods, its effectiveness varies with context. The study underscores the importance of pre-training strategies to enhance the reliability and effectiveness of deep learning models in medical imaging. By addressing these key factors, our research aims to contribute to the development of more accurate and dependable diagnostic tools, ultimately improving patient outcomes in clinical settings.
Chinese Translation
深度学习技术已彻底改变了医学影像学,提高了诊断准确性,并使得疾病的检测更加准确和早期。然而,预训练策略与医学影像模型下游性能之间的关系仍需进一步探讨。在此,我们系统地比较了卷积神经网络和变换器,考察了包括监督学习和自监督学习在内的各种预训练方法,以及不同的初始化和数据模态。模型在自然图像、胸部X光片、胸部CT和视网膜OCT图像上进行了评估,考虑了预训练数据与目标模态匹配的影响。我们的研究结果表明,只有在与目标模态紧密匹配的数据上进行预训练,才能显著提高下游性能。尽管自监督学习在某些情况下可以超越监督方法,但其有效性因上下文而异。本研究强调了预训练策略在提高医学影像深度学习模型的可靠性和有效性方面的重要性。通过解决这些关键因素,我们的研究旨在为开发更准确和可靠的诊断工具做出贡献,最终改善临床环境中的患者结果。
cs.CV / 92 / 2605.08820
FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence
FraudBench:用于检测AI生成的欺诈性退款证据的多模态基准
Abstract
Artificial Intelligence (AI)-generated images have become increasingly realistic and readily adaptable to concrete real-world claims, creating new challenges for verifying visual evidence. A concrete emerging risk is AI-generated refund fraud, in which manipulated or synthetic images are used to support claims about damaged products, poor delivery conditions, or service-related defects. Existing AI-generated image detection benchmarks mainly evaluate standalone authenticity classification, cross-generator transfer, or forensic localization, leaving claim-conditioned fraudulent evidence detection underexplored. To bridge this gap, we introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence. FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios. We curate real evidence images together with their associated review and product metadata, identify genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation, and synthesize fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models. Using FraudBench, we evaluate MLLMs, specialized AI-generated image detectors, and human participants under the same settings. Experiments show that current MLLMs often recognize real-damaged evidence but fail on many fake-damaged subsets, with fake-damage detection rates (TPR) far below the 50% baseline on most generator subsets. Specialized detectors generally perform better but remain inconsistent across generators and can produce false positives on real-damaged samples, revealing a clear gap between generic AI image detection and reliable claim-conditioned refund-evidence verification.
Chinese Translation
人工智能(AI)生成的图像变得越来越真实,并且可以轻松适应具体的现实世界索赔,这给验证视觉证据带来了新的挑战。一个具体的新兴风险是AI生成的退款欺诈,其中操纵或合成的图像被用来支持关于损坏产品、糟糕的交付条件或服务相关缺陷的索赔。现有的AI生成图像检测基准主要评估独立的真实性分类、跨生成器迁移或法医定位,而对索赔条件下的欺诈证据检测的研究仍然不足。为了填补这一空白,我们提出了FraudBench,这是一个用于检测AI生成的欺诈性退款证据的多模态基准。FraudBench的构建基于来自电子商务、食品配送和旅游服务场景的真实用户评论证据。我们整理了真实证据图像及其相关的评论和产品元数据,通过MLLM辅助过滤和人工标注识别真实的损坏和未损坏证据,并使用六种最先进的图像编辑和生成模型从真实的未损坏参考图像合成假损坏证据。利用FraudBench,我们在相同的设置下评估MLLM、专门的AI生成图像检测器和人类参与者。实验表明,当前的MLLM通常能够识别真实的损坏证据,但在许多假损坏子集上表现不佳,假损坏检测率(TPR)在大多数生成器子集上远低于50%的基线。专门的检测器通常表现更好,但在不同生成器之间仍然不一致,并且可能在真实损坏样本上产生假阳性,揭示了通用AI图像检测与可靠的索赔条件退款证据验证之间的明显差距。
cs.CV / 93 / 2605.08825
Rethinking Event-Based Object Dtection through Representation-Level Temporal Aggregation and Model-Level Hypergraph Reasoning
通过表示层级时间聚合和模型层级超图推理重新思考基于事件的物体检测
Abstract
Event cameras provide microsecond-level temporal resolution, low latency, and high dynamic range, offering potential for perception under fast motion and challenging illumination conditions. However, existing Event-based Object Detection (EOD) methods face limitations at both the representation and model levels: prior event representations usually encode temporal information indirectly through redundant structures, while detection models struggle to explicitly aggregate fragmented event responses into coherent high-order object features. To address these limitations, we present Event Dual Temporal-Relational Aggregation Detector (Ev-DTAD), a unified EOD framework that integrates representation-level temporal encoding with model-level temporal-hypergraph reasoning. Specifically, we introduce Hierarchical Temporal Aggregation (HTA), a compact three-channel pseudo-RGB representation that explicitly embeds temporal information across intra- and inter-window events. To further enhance detection under sparse and fragmented event responses, we propose Frequency-aware Hypergraph Temporal Fusion (FHTF), which refines multi-scale event features through temporal evolution modeling and high-order relational reasoning. Extensive experiments on Gen1 (+0.8 mAP and 1.7$\times$ faster), 1Mpx/Gen4 (+0.5 mAP and 1.6$\times$ faster), and eTraM (+3.0 mAP and \textbf{2.0$\times$ faster}) demonstrate that Ev-DTAD achieves a competitive accuracy-efficiency trade-off, validating the complementarity between compact temporal representation and temporal-hypergraph feature reasoning.
Chinese Translation
事件相机提供微秒级的时间分辨率、低延迟和高动态范围,为快速运动和复杂光照条件下的感知提供了潜力。然而,现有的基于事件的物体检测(EOD)方法在表示层面和模型层面都面临限制:先前的事件表示通常通过冗余结构间接编码时间信息,而检测模型则难以将碎片化的事件响应显式聚合为一致的高阶物体特征。为了解决这些限制,我们提出了事件双重时间-关系聚合检测器(Ev-DTAD),这是一个统一的EOD框架,结合了表示层级的时间编码和模型层级的时间-超图推理。具体而言,我们引入了层次时间聚合(HTA),这是一种紧凑的三通道伪RGB表示,显式嵌入了窗口内和窗口间事件的时间信息。为了进一步增强在稀疏和碎片化事件响应下的检测能力,我们提出了频率感知超图时间融合(FHTF),通过时间演化建模和高阶关系推理来精炼多尺度事件特征。在Gen1(+0.8 mAP和1.7×更快)、1Mpx/Gen4(+0.5 mAP和1.6×更快)和eTraM(+3.0 mAP和2.0×更快)上的广泛实验表明,Ev-DTAD在准确性和效率之间达成了竞争性的权衡,验证了紧凑时间表示与时间-超图特征推理之间的互补性。
cs.CV / 94 / 2605.08830
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
VECTOR-Drive:紧密耦合的视觉-语言与轨迹专家路由用于端到端自主驾驶
Abstract
End-to-end autonomous driving requires models to understand traffic scenes, infer driving intent, and generate executable motion plans. Recent vision-language-action (VLA) models inherit semantic priors from large-scale vision-language pretraining, yet still face a coupling trade-off: fully shared backbones preserve multimodal interaction but may entangle language reasoning and trajectory prediction, whereas decou pled reasoning-action pipelines reduce task conflict but weaken semantic-motion coupling. We propose VECTOR-DRIVE, a tightly coupled VLA framework built on Qwen2.5-VL-3B. VECTOR-DRIVE keeps all tokens coupled through shared self attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert to preserve semantic priors, while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert for motion-specific computation. On the action-token pathway, a flow-matching planner refines noisy action tokens into future waypoints and speed profiles. This design couples semantic reasoning and motion planning within a single multimodal Transformer while separating task-specific FFN computation. On Bench2Drive, VECTOR-DRIVE achieves 88.91 Driving Score and outperforms representative end-to end and VLA-based baselines. Qualitative results and ablations further validate the benefits of shared attention, semantic-aware expert routing, progressive training, and flow-based action de coding.
Chinese Translation
端到端自主驾驶要求模型理解交通场景、推断驾驶意图并生成可执行的运动计划。近期的视觉-语言-动作(VLA)模型从大规模视觉-语言预训练中继承了语义先验,但仍面临耦合权衡:完全共享的骨干网络保留了多模态交互,但可能会纠缠语言推理和轨迹预测,而解耦的推理-动作管道虽然减少了任务冲突,但削弱了语义与运动的耦合。我们提出了VECTOR-DRIVE,一个基于Qwen2.5-VL-3B构建的紧密耦合VLA框架。VECTOR-DRIVE通过共享自注意力保持所有标记的耦合,并根据标记语义路由前馈计算。视觉和语言标记由视觉-语言专家处理,以保留语义先验,而目标点、自我状态和噪声动作标记则被路由到轨迹专家进行运动特定计算。在动作标记路径上,流匹配规划器将噪声动作标记细化为未来的航点和速度轮廓。该设计在单一多模态Transformer内耦合了语义推理和运动规划,同时分离了任务特定的前馈神经网络计算。在Bench2Drive上,VECTOR-DRIVE达到了88.91的驾驶评分,超越了代表性的端到端和基于VLA的基线。定性结果和消融实验进一步验证了共享注意力、语义感知专家路由、渐进训练和基于流的动作解码的优势。
cs.CV / 95 / 2605.08839
Cross-Sample Relational Fusion: Unifying Domain Generalization and Class-Incremental Learning
跨样本关系融合:统一领域泛化与类别增量学习
Abstract
Class-Incremental Learning (CIL) requires a learning system to learn new classes while retaining previously learned knowledge. However, in real-world scenarios such as autonomous driving, a system trained on urban roads in sunny weather may later need to operate in rural or highway environments with different traffic patterns and weather conditions. This requires the model not only to overcome catastrophic forgetting, but also to effectively handle domain shifts. In this paper, we propose CrOss-sample Relational Fusion (CORF), a unified framework to address domain shift and catastrophic forgetting simultaneously. To enhance generalizability, we perform selective refinement of training samples by leveraging spatial contribution maps to highlight semantically informative regions. Furthermore, we incorporate predictive confidence to adaptively weigh samples, thereby facilitating the learning of domain-agnostic representations. To alleviate forgetting, we propose a cascaded distillation framework that captures cross-sample relational dependencies across multiple feature hierarchies, enabling multi-grained knowledge transfer from previous tasks. CORF can be seamlessly integrated into existing CIL algorithms to enhance their generalizability, achieving competitive performance across various benchmark datasets. Code is available at https://github.com/LAMDA-CL/TMM26-CORF .
Chinese Translation
类别增量学习(Class-Incremental Learning, CIL)要求学习系统在保留先前学习知识的同时学习新类别。然而,在现实场景中,例如自动驾驶,系统可能需要在阳光明媚的城市道路上训练后,随后在交通模式和天气条件不同的乡村或高速公路环境中操作。这不仅要求模型克服灾难性遗忘,还需有效应对领域转移。本文提出了跨样本关系融合(CrOss-sample Relational Fusion, CORF),一个统一框架,旨在同时解决领域转移和灾难性遗忘问题。为了增强模型的泛化能力,我们通过利用空间贡献图对训练样本进行选择性细化,以突出语义信息丰富的区域。此外,我们结合预测置信度自适应地加权样本,从而促进领域无关表示的学习。为减轻遗忘,我们提出了一种级联蒸馏框架,该框架捕捉跨样本关系依赖,跨越多个特征层次,使得从先前任务中进行多粒度知识转移成为可能。CORF可以无缝集成到现有的CIL算法中,以增强其泛化能力,在各种基准数据集上实现竞争性能。代码可在 https://github.com/LAMDA-CL/TMM26-CORF 获取。
cs.CV / 96 / 2605.08841
Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models
幻觉感知视觉预处理与反幻觉提示在视觉-语言模型中的经典幻觉理解
Abstract
Vision-Language Models (VLMs) exhibit systematic bias toward visual illusions, recalling memorized facts rather than perceiving actual visual differences. This paper presents a training-free framework for the 5th DataCV Challenge Task 1 at CVPR 2026, addressing this perception-versus-memory conflict through three complementary strategies:(1) illusion-aware image preprocessing that weakens illusion-inducing context via type-specific transformations (edge extraction, color isolation, morphological processing, and reference-line overlay), (2) anti-illusion prompt engineering guiding VLMs toward qualitative visual comparison, and (3) multi-vote ensemble that further improves robustness. Our method achieves 90.48% accuracy on the official 630-image test set using Claude (claude-opus-4-6) with 5-vote majority ensemble, and 98.41% on a human-verified subset. The approach requires no finetuning, relying solely on visual manipulation and prompt design. Our solution secured 2nd place in the challenge, only 0.47% behind the 1st-place solution. Code is available at https://github.com/jasminezz/sf-illusion-aware-vlm.git.
Chinese Translation
视觉-语言模型(VLMs)对视觉幻觉表现出系统性偏见,倾向于回忆记忆中的事实而非感知实际的视觉差异。本文提出了一种无训练框架,旨在解决这一感知与记忆冲突,应用于2026年CVPR的第五届DataCV挑战赛任务1,采用三种互补策略:(1)幻觉感知图像预处理,通过特定类型的变换(边缘提取、颜色隔离、形态处理和参考线叠加)削弱诱发幻觉的上下文;(2)反幻觉提示工程,引导VLMs进行定性视觉比较;(3)多投票集成,进一步提高鲁棒性。我们的方法在官方的630张图像测试集上使用Claude(claude-opus-4-6)和5票多数集成达到了90.48%的准确率,在经过人工验证的子集中达到了98.41%。该方法无需微调,仅依赖于视觉操作和提示设计。我们的解决方案在挑战赛中获得第二名,仅比第一名方案低0.47%。代码可在https://github.com/jasminezz/sf-illusion-aware-vlm.git获取。
cs.CV / 97 / 2605.08851
Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport
通过熵最优传输在冠状动脉造影中进行几何约束的狭窄编辑
Abstract
The scarcity of high-quality imaging data for coronary angiography (CAG) stenosis limits the clinical translation of automated stenosis detection. Synthetic stenosis data provides a practical avenue to augment training sets, improving data quality, diversity, and distributional coverage, and enhancing detection precision and generalization. However, diffusion-based editing commonly relies on soft guidance in a noise-initialized reverse process, offering limited pixel-level precision and structure preservation. We propose the OT-Bridge Editor, which reframes localized editing as a constrained entropic optimal transport (OT) problem and leverages geometric information to steer the generation path, enabling stronger geometric control. Extensive experiments show that our synthesized angiograms consistently improve downstream stenosis detection, yielding substantial relative gains of 27.8% on the public ARCADE benchmark and 23.0% on our multi-center dataset, supported by consistent qualitative results.
Chinese Translation
冠状动脉造影(CAG)狭窄的高质量影像数据稀缺,限制了自动狭窄检测的临床转化。合成狭窄数据为增强训练集提供了一条实用途径,提高了数据的质量、多样性和分布覆盖,增强了检测的精度和泛化能力。然而,基于扩散的编辑通常依赖于在噪声初始化的反向过程中进行软引导,提供的像素级精度和结构保留有限。我们提出了OT-Bridge Editor,将局部编辑重新构建为一个约束的熵最优传输(OT)问题,并利用几何信息引导生成路径,从而实现更强的几何控制。大量实验表明,我们合成的血管造影图在下游狭窄检测中始终表现出显著改善,在公共ARCADE基准上相对增益达到27.8%,在我们的多中心数据集中增益为23.0%,并得到了持续的定性结果支持。
cs.CV / 98 / 2605.08854
Restoration-Aligned Generative Flow Models for Blind Motion Deblurring
针对盲运动去模糊的恢复对齐生成流模型
Abstract
Generative flow models offer powerful priors learned from large-scale natural images, but directly adapting them to restoration tasks such as motion deblurring causes severe fidelity degradation, as their training objective is inherently misaligned with restoration. We present DeblurFlow, a framework that resolves this misalignment by reformulating the flow trajectory itself: we replace the noise endpoint with the blur observation, which makes the underlying vector field coincide with the residual error between blur and clean images. Under this formulation, the standard flow matching loss naturally takes the form of a residual loss, allowing pretrained flow models to be optimized under restoration-aligned objectives via LoRA adaptation. This formulation further enables a dual-expert sampling strategy: a fidelity expert provides a high-fidelity initialization, e.g., PSNR 33.69 dB, and DeblurFlow enhances perceptual quality with only a marginal fidelity reduction to 33.05 dB, whereas directly applying a generative model on top of a fidelity expert decreases PSNR to 27.60 dB. To make this practical, we further introduce r-space, a latent space tailored for residual decoding rather than image reconstruction, which reduces encoder-decoder cost by up to 9$\times$over standard VAE latents. Extensive experiments on GoPro, HIDE, RealBlur, and RWBI demonstrate that DeblurFlow achieves strong restoration fidelity and perceptual realism, while remaining computationally practical.
Chinese Translation
生成流模型提供了从大规模自然图像中学习到的强大先验,但将其直接应用于运动去模糊等恢复任务会导致严重的保真度下降,因为它们的训练目标与恢复任务本质上不一致。我们提出了DeblurFlow,一个通过重新构造流轨迹来解决这种不一致性的框架:我们用模糊观测替换噪声终点,使得底层向量场与模糊图像和清晰图像之间的残差误差重合。在这种表述下,标准流匹配损失自然呈现为残差损失,从而允许预训练的流模型通过LoRA适应在恢复对齐目标下进行优化。该表述进一步使得双专家采样策略成为可能:保真度专家提供高保真的初始化,例如PSNR 33.69 dB,而DeblurFlow在仅有微小保真度下降至33.05 dB的情况下增强了感知质量,而直接在保真度专家之上应用生成模型则使PSNR降至27.60 dB。为了使其具有实用性,我们进一步引入了r-space,这是一个针对残差解码而非图像重建的潜在空间,能够将编码器-解码器的成本降低高达9倍,相较于标准的变分自编码器(VAE)潜在空间。在GoPro、HIDE、RealBlur和RWBI上的广泛实验表明,DeblurFlow在保持计算实用性的同时,实现了强大的恢复保真度和感知真实感。
cs.CV / 99 / 2605.08858
ProDG: Prototypes for Data-Free Generative Post-Hoc Explainability
ProDG:无数据生成后置可解释性的原型
Abstract
Ante-hoc interpretability methods based on prototypes provide highly accurate explanations by utilizing the intuitive "this looks like that" reasoning paradigm. On the other hand, post-hoc models can explain predictions for a single image without relying on an underlying dataset or requiring costly neural network retraining. Recent approaches successfully solve the retraining problem for prototype-based networks. However, they still face a fundamental limitation: they require access to a subset of data (e.g., a test or validation set) to search for and extract the visual prototypes. In this paper, we address this issue and introduce ProDG: Generative Prototypes for Data-Free Post-Hoc Explainability, a novel framework that leverages generative models to synthesize pure, high-fidelity prototypes directly from the frozen model's weights, completely eliminating the dependency on any external data. By establishing this new frontier in Data-Free XAI, ProDG unlocks robust visual interpretability for privacy-sensitive domains, where original data is strictly restricted or fundamentally inaccessible. Project page: https://github.com/piotr310100/ProDG
Chinese Translation
基于原型的前置可解释性方法通过利用直观的“这个看起来像那个”推理范式提供高度准确的解释。另一方面,后置模型可以在不依赖于基础数据集或不需要昂贵的神经网络重训练的情况下,为单个图像的预测提供解释。最近的方法成功解决了基于原型的网络的重训练问题。然而,它们仍然面临一个基本限制:它们需要访问一部分数据(例如,测试集或验证集)以搜索和提取视觉原型。在本文中,我们解决了这个问题,并介绍了ProDG:无数据后置可解释性的生成原型,这是一个新颖的框架,利用生成模型直接从冻结模型的权重合成纯净的高保真原型,完全消除了对任何外部数据的依赖。通过在无数据可解释性(Data-Free XAI)中建立这一新前沿,ProDG为隐私敏感领域解锁了强大的视觉可解释性,在这些领域,原始数据受到严格限制或根本无法访问。项目页面:https://github.com/piotr310100/ProDG
cs.CV / 100 / 2605.08874
Semantic Alignment in Hyperbolic Space for Open-Vocabulary Semantic Segmentation
超曲面空间中的语义对齐用于开放词汇语义分割
Abstract
Open-vocabulary semantic segmentation requires adapting image-level vision-language models such as CLIP to dense pixel-level prediction, which is challenging due to the mismatch between hierarchical structure and semantic alignment in the embedding space. While recent works leverage hyperbolic geometry to model hierarchical relationships, they align embeddings across hierarchical levels but overlook semantic misalignment among embeddings within the same level. In this work, we propose HyRo, a hyperbolic fine-tuning framework that decouples hierarchical and semantic alignment in the Poincar\'e ball model. HyRo aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius. Experiments on standard open-vocabulary semantic segmentation benchmarks demonstrate that HyRo achieves state-of-the-art performance over prior methods.
Chinese Translation
开放词汇语义分割要求将图像级视觉-语言模型(如 CLIP)适应于密集的像素级预测,这由于层次结构与嵌入空间中的语义对齐之间的不匹配而具有挑战性。尽管最近的研究利用超曲面几何来建模层次关系,但它们仅在层次级别之间对齐嵌入,而忽视了同一层次内嵌入之间的语义不对齐。在本研究中,我们提出了 HyRo,一种超曲面微调框架,它在庞加莱球模型中解耦了层次对齐和语义对齐。HyRo 通过调整超曲面半径来对齐层次级别,并通过使用正交变换进行角度对齐来细化语义关系,该变换在理论上保持超曲面半径不变。在标准开放词汇语义分割基准上的实验表明,HyRo 在先前方法的基础上实现了最先进的性能。
cs.CV / 101 / 2605.08902
DAPE: Dynamic Non-uniform Alignment and Progressive Detail Enhancement Techniques for Improving the Performance of Efficient Visual Language Models
DAPE:动态非均匀对齐与渐进细节增强技术以提升高效视觉语言模型的性能
Abstract
In recent years, pre-trained visual-linguistic models have demonstrated tremendous potential, becoming a crucial foundational framework for numerous downstream tasks. However, the information density between text and images is not uniformly distributed. Existing methods often overlook the inherent and dynamic differences in information density and semantic scope between text tags and image blocks. These common uniform alignment strategies result in coarse-grained cross-modal interactions and loss of fine semantic details. Moreover, pursuing finer alignment typically requires substantial computational overhead, limiting practical model deployment. To address this challenge, this paper proposes a novel framework for dynamic cross-modal alignment with continuous detail introduction. First, we design a dynamically adaptive cross-modal matching mechanism that uses a learnable matching function to dynamically assign varying numbers and sizes of image tags to text tags of the same size but different information density, enabling more precise attention interaction. Second, we develop a continuous detail introduction module to progressively incorporate high-resolution visual feature enhancement into the alignment process. Extensive experiments across multiple benchmarks demonstrate significant improvements in the accuracy of various downstream tasks while reducing computational overhead.
Chinese Translation
近年来,预训练的视觉语言模型展现出巨大的潜力,成为众多下游任务的重要基础框架。然而,文本与图像之间的信息密度并非均匀分布。现有方法往往忽视了文本标签与图像块之间信息密度和语义范围的内在动态差异。这些常见的均匀对齐策略导致了粗粒度的跨模态交互和细微语义细节的丢失。此外,追求更精细的对齐通常需要大量的计算开销,限制了模型的实际部署。为了解决这一挑战,本文提出了一种新颖的动态跨模态对齐框架,结合持续的细节引入。首先,我们设计了一种动态自适应的跨模态匹配机制,利用可学习的匹配函数动态地将不同数量和大小的图像标签分配给相同大小但信息密度不同的文本标签,从而实现更精确的注意力交互。其次,我们开发了一个持续细节引入模块,将高分辨率视觉特征增强逐步融入对齐过程中。在多个基准测试中的广泛实验表明,模型在各种下游任务的准确性显著提高,同时计算开销得以降低。
cs.CV / 102 / 2605.08911
Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning
驾驶场景推理的车道及车道拓扑统一建模
Abstract
Autonomous vehicles need to perceive not only physical elements in the driving scene, such as lane lines and traffic lights, but also logical elements like lane centerlines and their topology. Existing lane topology reasoning methods typically follow a reasoning-by-detection paradigm, where lane topological relationships are primarily derived from lane detection results. In this paper, we propose an innovative method called Unified Modeling of Lane and Lane Topology (UniTopo), which represents the topological relationships between lanes as connected lanes, encompassing predecessor lanes, successor lanes, and their interconnections. This unified representation of lanes and lane topology allows us to simultaneously obtain both the positions and topological information of lanes within a shared perception pipeline, establishing a new paradigm for directly perceiving lane topology from original image features. We validate our method on the driving scene reasoning benchmark OpenLane-V2, which consists of two subsets, built based on Argoverse2 and nuScenes, respectively. Our method achieves TOP_ll of 30.1% and 31.8% on the two subsets, significantly surpassing the existing state-of-the-art method T^2SG by 6.0% and 8.6%.
Chinese Translation
自主车辆不仅需要感知驾驶场景中的物理元素,如车道线和交通信号灯,还需要理解逻辑元素,如车道中心线及其拓扑关系。现有的车道拓扑推理方法通常遵循检测推理范式,其中车道拓扑关系主要源自车道检测结果。本文提出了一种创新的方法,称为车道及车道拓扑统一建模(Unified Modeling of Lane and Lane Topology,UniTopo),该方法将车道之间的拓扑关系表示为相互连接的车道,包括前驱车道、后继车道及其相互连接。这种车道和车道拓扑的统一表示使我们能够在共享感知管道中同时获取车道的位置和拓扑信息,从而建立了一种新的范式,能够直接从原始图像特征中感知车道拓扑。我们在驾驶场景推理基准数据集OpenLane-V2上验证了我们的方法,该数据集由基于Argoverse2和nuScenes构建的两个子集组成。我们的方法在这两个子集上分别取得了30.1%和31.8%的TOP_ll,显著超过了现有的最先进方法T^2SG,提升幅度为6.0%和8.6%。
cs.CV / 103 / 2605.08925
Few-Click-Driven Interactive 3D Segmentation with Semantic Embedding
少量点击驱动的交互式3D分割与语义嵌入
Abstract
Interactive segmentation allows efficient label generation by leveraging user-provided clicks to progressively refine predictions, which is critical when fully supervised labels are costly or generalization to unseen classes is needed. Existing 3D interactive methods are limited: most operate sequentially, predicting only one object per iteration with binary masks, while several recent approaches depend on 2D foundation models and camera alignment to bridge the 2D-3D gap. To address these limitations, we propose a novel interactive segmentation framework that operates directly on sparse, randomly downsampled 3D points and processes multiple object clicks in a single forward pass. Our framework consists of a point Transformer-based encoder and a hierarchical mask decoder, which integrates multi-level crop-and-merge operations conditioned on learnable semantic embeddings. Unlike prior interactive approaches that require repeated model updates after each manually corrective click, our method jointly reasons over all click queries, modeling inter-instance relationships and refining both spatial masks and semantic predictions through spatial and semantic embeddings. Extensive experiments demonstrate that our model improves the mIoU metric by over 20 percent compared to strong baselines and achieves 8-10 percent gains under cross-dataset evaluation for a one-click per instance setting, often requiring only a single click per object. Our approach provides a generalizable and efficient solution for interactive 3D instance segmentation, particularly suitable for real-time applications such as robotic manipulation, navigation, and rapid 3D semantic annotation.
Chinese Translation
交互式分割通过利用用户提供的点击来逐步细化预测,从而实现高效的标签生成,这在完全监督标签成本高昂或需要对未见类别进行泛化时尤为关键。现有的3D交互式方法存在局限性:大多数方法是顺序操作,每次迭代仅预测一个对象,并使用二进制掩膜,而一些最近的方法依赖于2D基础模型和相机对齐来弥合2D与3D之间的差距。为了解决这些限制,我们提出了一种新颖的交互式分割框架,该框架直接在稀疏的随机下采样3D点上操作,并在单次前向传播中处理多个对象点击。我们的框架由基于点变换器(point Transformer)的编码器和分层掩膜解码器组成,集成了基于可学习语义嵌入的多级裁剪与合并操作。与之前需要在每次手动校正点击后重复更新模型的交互式方法不同,我们的方法对所有点击查询进行联合推理,建模实例间关系,并通过空间和语义嵌入细化空间掩膜和语义预测。大量实验表明,与强基线相比,我们的模型在mIoU指标上提高了超过20%,并在每实例一次点击设置下,在跨数据集评估中实现了8-10%的增益,通常只需对每个对象进行一次点击。我们的方法为交互式3D实例分割提供了一种可泛化且高效的解决方案,特别适合于机器人操作、导航和快速3D语义标注等实时应用。
cs.CV / 104 / 2605.08945
PIDNet: Progressive Implicit Decouple Network for Multimodal Action Quality Assessment
PIDNet:用于多模态动作质量评估的渐进隐式解耦网络
Abstract
Action quality assessment (AQA) aims to automatically quantify the execution quality of human actions in videos and is valuable for applications such as competitive sports judging. In multimodal AQA, quality evidence from different modalities is heterogeneous, and quality cues evolve progressively over time. Existing methods often rely on coarse fusion or unified temporal modeling, which may blur modality-specific cues, preserve cross-modal redundancy, and weaken stage-specific quality evidence. To address these issues, we propose a progressive implicit decoupling and fusion network (PIDNet) that progressively integrates modality-specific information, cross-modal complementary cues, and global quality semantics for accurate assessment. Specifically, we design an iMambaWave module that maps RGB, optical flow, and audio features into a shared latent space and disentangles them with a Bi-Mamba branch and a wavelet-transform branch to capture long-range temporal dependencies and local perturbation details, respectively. A gated aggregation mechanism adaptively fuses temporal and frequency-domain information. We further build a three-stage progressive fusion network using Group3M blocks, where modality complementary attention retrieves cross-modal evidence while suppressing redundancy, and multi-scale convolutions enrich feature representations. Experiments on the Rhythmic Gymnastics and Fis-V datasets show that PIDNet achieves highly competitive score correlation with favorable error control compared with existing unimodal and multimodal methods. Ablation studies verify the effectiveness of each component. Moreover, iMambaWave consistently improves visual representation and temporal modeling across multiple backbones, showing good generalization and plug-and-play capability.
Chinese Translation
动作质量评估(AQA)旨在自动量化视频中人类动作的执行质量,对于竞技体育裁判等应用具有重要价值。在多模态AQA中,不同模态的质量证据是异质的,质量线索随着时间的推移逐渐演变。现有方法通常依赖于粗糙的融合或统一的时间建模,这可能模糊模态特定的线索,保留跨模态冗余,并削弱阶段特定的质量证据。为了解决这些问题,我们提出了一种渐进隐式解耦与融合网络(PIDNet),该网络逐步整合模态特定信息、跨模态互补线索和全局质量语义,以实现准确评估。具体而言,我们设计了一个iMambaWave模块,该模块将RGB、光流和音频特征映射到共享的潜在空间,并通过Bi-Mamba分支和小波变换分支对其进行解耦,以分别捕捉长程时间依赖性和局部扰动细节。一个门控聚合机制自适应地融合时间和频域信息。我们进一步构建了一个使用Group3M模块的三阶段渐进融合网络,其中模态互补注意力检索跨模态证据,同时抑制冗余,多尺度卷积丰富特征表示。在韵律体操和Fis-V数据集上的实验表明,PIDNet在得分相关性方面具有高度竞争力,并且与现有的单模态和多模态方法相比,具有良好的误差控制。消融研究验证了每个组件的有效性。此外,iMambaWave在多个骨干网络上始终改善视觉表示和时间建模,显示出良好的泛化能力和即插即用能力。
cs.CV / 105 / 2605.08952
FugSeg: Fast Uncertainty-aware Ground Segmentation for 3D Point Cloud
FugSeg:快速的不确定性感知地面分割方法用于3D点云
Abstract
In LiDAR-based environment perception systems, ground segmentation is a key preprocessing step supporting various applications such as mapping and navigation. Although extensively studied, problems such as reflection noise and isolated ground remain challenging. To address these issues, we propose FugSeg, a fast uncertainty-aware ground segmentation method. A polar grid map is adopted as the point cloud representation to ensure generalizability across LiDAR types. Building on that, we develop a within- and cross-segment ground labeling strategy that identifies not only directly visible ground cells but also those that are isolated or occluded. During this process, an adaptive slope is introduced, which incorporates measurement uncertainties to enhance its reliability under complex terrain. Finally, to achieve point-level ground segmentation, a fine-grained ground elevation estimation method is introduced. Throughout the complete workflow, reflection noise is explicitly handled via the proposed noisy ground cells. We conduct comprehensive evaluations on four public datasets covering both structured and unstructured environments. Results show that FugSeg outperforms state-of-the-art non-learning methods, achieving the highest F1, accuracy, and mIoU across all datasets, while maintaining the fastest runtime (135 Hz and 487 Hz for 64- and 32-layer LiDARs) using a single CPU thread, making it suitable for resource-limited systems. The code will be available at https://github.com/Leo-YuLi/FugSeg.
Chinese Translation
在基于激光雷达的环境感知系统中,地面分割是支持映射和导航等各种应用的关键预处理步骤。尽管这一领域得到了广泛研究,但反射噪声和孤立地面等问题仍然具有挑战性。为了解决这些问题,我们提出了FugSeg,一种快速的不确定性感知地面分割方法。采用极坐标网格地图作为点云表示,以确保在不同激光雷达类型之间的通用性。在此基础上,我们开发了一种内部和跨段的地面标记策略,不仅识别直接可见的地面单元,还识别那些孤立或被遮挡的单元。在此过程中,引入了一种自适应坡度,结合测量不确定性,以提高其在复杂地形下的可靠性。最后,为实现点级地面分割,提出了一种细粒度的地面高度估计方法。在整个工作流程中,反射噪声通过所提出的噪声地面单元得到了明确处理。我们在四个公共数据集上进行了全面评估,涵盖了结构化和非结构化环境。结果表明,FugSeg在所有数据集中均优于最先进的非学习方法,达到了最高的F1值、准确率和mIoU,同时在使用单个CPU线程时保持最快的运行时间(64层和32层激光雷达下分别为135 Hz和487 Hz),使其适用于资源有限的系统。代码将发布在https://github.com/Leo-YuLi/FugSeg。
cs.CV / 106 / 2605.08965
Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning
多模态大语言模型能否推理视觉说服力?评估推理的有效性和可信度
Abstract
Despite strong performance of Multimodal Large Language Models (MLLMs) on multimodal tasks, predicting whether and why an image is persuasive remains challenging. We first show that prompting MLLMs to reason before prediction does not consistently help, and can even reduce persuasiveness prediction performance, suggesting that naively generated rationales are unreliable signals for this task. Yet, no established methodology exists for training MLLMs to reason about visual persuasion or evaluating whether their rationales faithfully support their decisions. To address this gap, we show empirically and theoretically that diverse teacher-generated rationales, when used for supervised fine-tuning, improve visual persuasiveness prediction. We further introduce a three-dimensional faithfulness evaluation framework covering rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity. Applying this framework shows that prediction performance alone does not guarantee faithful rationales, while rationale-to-decision sensitivity is most aligned with human rationale preferences. These findings motivate faithfulness-aware training objectives and scalable rationale supervision for visual persuasiveness evaluation. Our code and dataset will be made publicly available.
Chinese Translation
尽管多模态大语言模型(MLLMs)在多模态任务上表现出色,但预测一幅图像是否具有说服力以及其原因仍然具有挑战性。我们首先展示了在预测之前提示 MLLMs 进行推理并不总是有效,甚至可能降低说服力预测性能,这表明天真生成的推理对于此任务而言是不可靠的信号。然而,目前尚无建立的方法论来训练 MLLMs 推理视觉说服力或评估其推理是否真实支持其决策。为了解决这一空白,我们通过实证和理论证明,当使用多样的教师生成推理进行监督微调时,可以提高视觉说服力预测。我们进一步引入了一个三维可信度评估框架,涵盖推理与决策的一致性、推理与图像的关联性以及推理与决策的敏感性。应用这一框架表明,仅凭预测性能并不能保证推理的可信度,而推理与决策的敏感性与人类推理偏好最为一致。这些发现激励了关注可信度的训练目标和可扩展的推理监督,以评估视觉说服力。我们的代码和数据集将公开发布。
cs.CV / 107 / 2605.08971
Extrusion Segmentation Strategy to improve CAD Reconstruction from Point Cloud
挤出分割策略以改善从点云重建CAD模型
Abstract
Computer-Aided Design is ubiquitous in todays world, as almost every manufactured object begins as a digital model across industries. At the same time, advances in 3D sensing have made point clouds a dominant form of raw 3D data. Recovering the CAD model of a physical object from its point cloud scan has two major applications: reverse engineering, where physical or hand-crafted prototypes need to be reconstructed automatically as editable digital models, and quality control, where recovering the CAD description of a manufactured object helps quantify and understand deviations introduced during the production process. Thus, converting unordered point clouds into structured CAD models is increasingly important for modern applications. Deep learning has enabled major progress in computer vision for both 2D and 3D data, and new datasets facilitate data-driven CAD reconstruction. Building on this foundation, we develop an end-to-end model that reconstructs CAD models from point clouds and introduce a segmentation approach that decomposes them into individual extrusions. These partial shapes increase data diversity, improving the generalization and robustness of deep learning models. Our strategy thereby provides a simple, yet effective way to increase reconstruction performance of deep learning models.
Chinese Translation
计算机辅助设计在当今世界无处不在,几乎每个制造物体都始于各行业的数字模型。同时,3D传感技术的进步使得点云成为一种主导的原始3D数据形式。从物理对象的点云扫描中恢复CAD模型有两个主要应用:逆向工程,其中需要将物理或手工制作的原型自动重建为可编辑的数字模型;质量控制,其中恢复制造物体的CAD描述有助于量化和理解在生产过程中引入的偏差。因此,将无序点云转换为结构化CAD模型对于现代应用变得越来越重要。深度学习在2D和3D数据的计算机视觉领域取得了重大进展,而新的数据集则促进了数据驱动的CAD重建。在此基础上,我们开发了一种端到端模型,从点云重建CAD模型,并引入了一种分割方法,将其分解为单独的挤出部分。这些部分形状增加了数据的多样性,提高了深度学习模型的泛化能力和鲁棒性。因此,我们的策略提供了一种简单而有效的方法,以提高深度学习模型的重建性能。
cs.CV / 108 / 2605.08974
Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models
追踪真相:面向对象的时空监测用于视频大型语言模型
Abstract
While multimodal large language models (MLLMs) have advanced video understanding, they remain highly prone to hallucinations in dynamic scenes. We argue this stems from a failure in spatio-temporal monitoring, the ability to persistently track object identities, states, and relations over time. Existing benchmarks obscure this deficit by relying on single final-answer evaluations for queries that can often be resolved via local visual cues or statistical priors. To rigorously diagnose this, we introduce STEMO-Bench (Spatio-TEmporal MOnitoring), a benchmark of human-verified object-centric facts that evaluates intermediate reasoning by decomposing queries into sub-questions, distinguishing genuine temporal understanding from coincidental correctness. To address failure modes exposed by STEMO, we propose STEMO-Track, a novel object-centric framework that explicitly constructs and reasons over structured object trajectories via chunk-wise state extraction and temporal aggregation. Extensive experiments demonstrate that our object-centric framework significantly reduces hallucinated answers and improves spatio-temporal reasoning consistency over state-of-the-art MLLMs.
Chinese Translation
尽管多模态大型语言模型(MLLMs)在视频理解方面取得了进展,但它们在动态场景中仍然高度容易出现幻觉。我们认为这源于时空监测的失败,即在时间上持续跟踪对象身份、状态和关系的能力。现有基准通过依赖于单一最终答案的评估来掩盖这一缺陷,而这些问题往往可以通过局部视觉线索或统计先验来解决。为了严格诊断这一问题,我们引入了STEMO-Bench(时空监测基准),这是一个经过人工验证的面向对象的事实基准,通过将查询分解为子问题来评估中间推理,从而区分真正的时间理解与偶然的正确性。为了应对STEMO暴露的失败模式,我们提出了STEMO-Track,这是一种新颖的面向对象的框架,通过分块状态提取和时间聚合显式构建和推理结构化的对象轨迹。大量实验表明,我们的面向对象框架显著减少了幻觉答案,并提高了在最先进的MLLMs上的时空推理一致性。
cs.CV / 109 / 2605.08985
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4:什么使得多模态大语言模型中的视觉编码高效?
Abstract
Visual encoding constitutes a major computational bottleneck in Multimodal Large Language Models (MLLMs), especially for high-resolution image inputs. The prevailing practice typically adopts global encoding followed by post-ViT compression. Global encoding produces massive token sequences, while post-ViT compression incurs the full quadratic attention cost of the ViT before any token reduction takes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show that slice-based encoding outperforms global encoding across benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowers visual-encoding FLOPs while preserving downstream performance. By integrating intra-ViT compression into the slice-based encoding framework, we present LLaVA-UHD v4, an efficient and compute-controllable visual encoding scheme tailored for high-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.
Chinese Translation
视觉编码构成了多模态大语言模型(MLLMs)中的一个主要计算瓶颈,尤其是在高分辨率图像输入的情况下。当前的做法通常采用全局编码,随后进行后ViT压缩。全局编码会产生大量的标记序列,而后ViT压缩在任何标记减少发生之前会产生ViT的全二次注意力成本。在本研究中,我们从编码策略和视觉标记压缩两个维度重新审视这一惯例。首先,受控实验表明,基于切片的编码在各项基准测试中优于全局编码,这表明通过切片视图保留局部细节可能比应用全局注意力更有利于细粒度感知。其次,我们引入了内部ViT早期压缩,该方法在浅层ViT层中减少标记,并显著降低视觉编码的FLOPs,同时保持下游性能。通过将内部ViT压缩整合到基于切片的编码框架中,我们提出了LLaVA-UHD v4,这是一种针对高分辨率输入量身定制的高效且可控的视觉编码方案。在涵盖文档理解、光学字符识别(OCR)和一般视觉问答(VQA)的一系列多样化基准测试中,LLaVA-UHD v4将视觉编码的FLOPs减少了55.8%,同时匹配或超越了基线性能。这些结果表明,视觉编码的效率可以在不牺牲下游性能的情况下显著提高,为高效高分辨率MLLMs提供了一个实用的设计方向。所有模型权重和代码将公开发布,以支持进一步的研究。
cs.CV / 110 / 2605.09002
CT-IDP: Segmentation-Derived Quantitative Phenotypes for Interpretable Abdominal CT Disease Classification
CT-IDP:用于可解释的腹部CT疾病分类的分割衍生定量表型
Abstract
In this retrospective multi-institutional study, a quantitative phenotyping framework, CT-IDP (CT Image-Derived Phenotypes) was developed on the MERLIN abdominal CT benchmark (training, validation, and test sets- 15,175, 5,018, and 5,082 studies, respectively) and externally evaluated on two independent dataset: Duke-Abdomen (2,000) and AMOS (1,107). Multi-organ segmentations were generated with TotalSegmentator and used to derive over 900 organ and compartment-level descriptors spanning morphometry, attenuation, and contextual/burden findings. Sparse disease-specific logistic regression with elastic-net regularization was trained on MERLIN and externally validated under a frozen specification. Performance was compared against a DINOv3-based vision-transformer baseline using AUC and average precision (AP), supported by phenotype-stratified audits and coefficient-level inspection. Macro-AUC for CT-IDP versus the baseline was 0.897 versus 0.880 on MERLIN, 0.877 versus 0.857 on the Duke-Abdomen dataset, and 0.780 versus 0.756 on AMOS.
Chinese Translation
在这项回顾性多机构研究中,开发了一种定量表型框架CT-IDP(CT图像衍生表型),该框架基于MERLIN腹部CT基准(训练集、验证集和测试集分别为15,175、5,018和5,082个研究),并在两个独立数据集上进行了外部评估:Duke-Abdomen(2,000)和AMOS(1,107)。使用TotalSegmentator生成多脏器分割,并用于衍生超过900个涵盖形态测量、衰减和上下文/负担发现的器官和区室级描述符。基于稀疏疾病特异性逻辑回归与弹性网正则化在MERLIN上进行训练,并在冻结规格下进行外部验证。通过AUC和平均精度(AP)比较性能,并支持表型分层审计和系数级检查。CT-IDP与基线的宏观AUC在MERLIN上为0.897对0.880,在Duke-Abdomen数据集上为0.877对0.857,在AMOS上为0.780对0.756。
cs.CV / 111 / 2605.09003
FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
FlashClear:通过高效步骤蒸馏和特征缓存实现超快速图像内容去除
Abstract
Recently, diffusion-based object removal models have achieved impressive results in eliminating objects and their associated visual effects. However, they indiscriminately denoise all tokens across all timesteps, ignoring that removal usually involves small foreground regions. This strategy introduces substantial computational overhead and prolonged inference times. To overcome this computational burden, we propose a latent discriminator to implement Region-aware Adversarial Distillation (RAD), yielding a highly efficient few-step model named FlashClear. Furthermore, tailored to few-step diffusion models, we propose FPAC (Foreground-Prioritized Asymmetric Attention and Caching), a training-free acceleration strategy. Extensive experiments demonstrate that our framework provides massive acceleration while maintaining or exceeding the performance of our base model, ObjectClear. Notably, on the OBER benchmark, our FlashClear achieves up to 8.26$\times$ and 122$\times$ speedup over ObjectClear and OmniPaint, respectively, while maintaining high visual quality and fidelity.
Chinese Translation
近年来,基于扩散的物体去除模型在消除物体及其相关视觉效果方面取得了显著成果。然而,它们在所有时间步上无差别地去噪所有标记,忽视了去除通常涉及的小前景区域。这种策略引入了大量的计算开销和延长的推理时间。为了克服这一计算负担,我们提出了一种潜在判别器,以实现区域感知对抗蒸馏(Region-aware Adversarial Distillation, RAD),从而产生一个名为FlashClear的高效少步模型。此外,针对少步扩散模型,我们提出了FPAC(前景优先不对称注意力和缓存,Foreground-Prioritized Asymmetric Attention and Caching),这是一种无训练加速策略。大量实验表明,我们的框架在保持或超过基础模型ObjectClear性能的同时,提供了巨大的加速。值得注意的是,在OBER基准测试中,我们的FlashClear在保持高视觉质量和保真度的同时,分别实现了对ObjectClear和OmniPaint高达8.26倍和122倍的加速。
cs.CV / 112 / 2605.09020
The Direct Integration Theorem: A Rigorous Framework for Consistent Discrete Solutions of the Inverse Radon Problem
直接积分定理:逆拉东问题一致离散解的严格框架
Abstract
This paper presents a novel Direct Integration Theorem (DIT), derived as a non-trivial corollary of the classical Central Slice Theorem (CST). The DIT provides a mathematically consistent transition from the continuous to the discrete domain - a fundamental challenge in computed tomography - thereby eliminating the need for frequency-domain interpolation without resorting to conventional ramp-filtering. The proposed approach circumvents two principal limitations inherent in traditional methods: (i) the zero-frequency singularity and spectral distortions introduced by the mandatory ramp-filtering step, and (ii) discretization inaccuracies associated with frequency-domain interpolation. Based on the DIT, we develop a rigorous framework for consistent discrete solutions of the inverse Radon problem. Mathematical modeling demonstrates that this approach achieves quasi-exact reconstruction, with errors constrained solely by sampling parameters and grid geometry. Furthermore, while Filtered Back Projection (FBP) inherently distorts the variance of the reconstructed image, the DIT-based algorithm preserves it. Comparative simulations confirm that the proposed method eliminates common artifacts, such as intensity cupping, and consistently outperforms FBP in terms of PSNR, SSIM, and reprojection fidelity, faithfully restoring the original image's statistical characteristics.
Chinese Translation
本文提出了一种新颖的直接积分定理(Direct Integration Theorem, DIT),作为经典中央切片定理(Central Slice Theorem, CST)的非平凡推论。DIT提供了从连续域到离散域的数学一致性过渡——这是计算机断层扫描中的一个基本挑战——从而消除了在不依赖于传统斜坡滤波的情况下进行频域插值的需要。所提出的方法规避了传统方法中固有的两个主要限制:(i)零频率奇点和强制斜坡滤波步骤引入的谱失真,以及(ii)与频域插值相关的离散化不准确性。基于DIT,我们开发了一个严格的框架,用于逆拉东问题的一致离散解。数学建模表明,该方法实现了准精确重建,误差仅受采样参数和网格几何的限制。此外,尽管滤波反投影(Filtered Back Projection, FBP)固有地扭曲了重建图像的方差,但基于DIT的算法能够保持其一致性。比较仿真确认,所提出的方法消除了常见的伪影,如强度凹陷,并在PSNR、SSIM和重投影保真度方面始终优于FBP,忠实地恢复了原始图像的统计特征。
cs.CV / 113 / 2605.09024
Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination
用于虚拟制作的可重光照高斯点云技术基于图像的照明
Abstract
Virtual production (VP) use LED walls to provide both background imagery and image-based lighting. While this enables on-set compositing, it couples lighting to background and scene appearance, limiting flexibility for downstream editing. In addition, inverse rendering conventionally relies on physically-based rendering to estimates 3D geometry and lighting, using environment maps. However, these maps are typically low-resolution and assume far-field lighting. In VP, with near-field and high-resolution image-based lighting, this can lead to inaccuracies and introduce complexities when editing. Addressing this, we propose a VP-specific framework for 3D reconstruction and relighting using Gaussian Splatting. This uses the known background imagery to condition the relighting process. This avoids relying on environment maps and reduces compositing to a background-image editing task. To realize our framework, we introduce a process (and associated dataset) that captures real VP scenes under varying background content and illumination conditions. This data is used to decompose a 3D scene into fixed appearance and variable lighting components. The variable lighting process simulates light transport by parameterizing each primitive with a UV coordinate, intensity value and resolution modifier. Using mipmaps, these directly sample the background texture in image space - implicitly capturing reflections and refractions without physically-based rendering. Combined with the fixed appearance component, this allows us to render relit scenes using a Gaussian Splatting rasterizer. Compared to baselines, our approach achieves higher-quality 3D reconstruction and controllable relighting. The method is efficient (<3 GB RAM, <5 GB VRAM, <2 hours training, ~35 FPS) and supports rendering useful arbitrary output variables including depth, lighting intensity, lighting color, and unlit renders.
Chinese Translation
虚拟制作(VP)使用LED墙提供背景图像和基于图像的照明。虽然这使得现场合成成为可能,但它将照明与背景和场景外观耦合,限制了后续编辑的灵活性。此外,逆向渲染通常依赖于物理基础渲染来估计3D几何体和照明,使用环境贴图。然而,这些贴图通常分辨率较低,并假设远场照明。在虚拟制作中,近场和高分辨率的基于图像的照明可能导致不准确性,并在编辑时引入复杂性。为了解决这个问题,我们提出了一种针对虚拟制作的3D重建和重光照的框架,使用高斯点云技术(Gaussian Splatting)。该框架利用已知的背景图像来调节重光照过程。这避免了对环境贴图的依赖,并将合成简化为背景图像编辑任务。为了实现我们的框架,我们引入了一个过程(及相关数据集),该过程在不同背景内容和照明条件下捕捉真实的虚拟制作场景。这些数据用于将3D场景分解为固定外观和可变照明组件。可变照明过程通过对每个原始体进行UV坐标、强度值和分辨率修正的参数化来模拟光传输。使用mipmap,这些直接在图像空间中对背景纹理进行采样——隐式捕捉反射和折射,而无需物理基础渲染。结合固定外观组件,这使我们能够使用高斯点云光栅化器渲染重光照场景。与基线相比,我们的方法实现了更高质量的3D重建和可控重光照。该方法高效(<3 GB RAM,<5 GB VRAM,<2小时训练,~35 FPS),并支持渲染有用的任意输出变量,包括深度、照明强度、照明颜色和无光渲染。
cs.CV / 114 / 2605.09025
MedFL-Stress: A Systematic Robustness Evaluation of Federated Brain Tumor Segmentation under Cross-Hospital MRI Appearance Shift
MedFL-Stress:跨医院MRI外观变化下联邦脑肿瘤分割的系统性鲁棒性评估
Abstract
Federated learning enables hospitals to collaboratively train segmentation models without sharing patient data. However, current evaluation protocols report only average performance across clients, masking failures at individual sites. In clinical deployment, a model that fails consistently at one hospital is a real safety risk that a good mean score can hide entirely. We introduce MedFL-Stress, a controlled stress-testing framework that exposes exactly this failure mode. Using 2D axial slices from BraTS 2020 distributed across four simulated hospital clients, we apply graded MRI appearance shifts (gamma contrast, scale-shift, and noise-plus-blur) reflecting scanner and acquisition variability in real multi-site deployments. Three federated baselines are evaluated: FedAvg, FedProx, and FedBN. Worst-hospital Dice and inter-hospital disparity are treated as primary metrics, not supplementary observations. FedAvg achieves the highest global mean Dice (0.8159) but conceals a 0.0850 gap between its best and worst-performing hospital. FedBN closes that gap by 41% (0.0850 to 0.0503) while sacrificing less than half a Dice point in mean accuracy (0.8159 to 0.8109), and the weakest hospital gains 3.5 Dice points outright (0.7309 to 0.7656). These findings demonstrate that robustness-oriented evaluation protocols are essential for reliable federated medical imaging deployment.
Chinese Translation
联邦学习使医院能够在不共享患者数据的情况下协作训练分割模型。然而,目前的评估协议仅报告客户端的平均性能,掩盖了个别站点的失败。在临床部署中,一个在某家医院始终失败的模型是一个真正的安全风险,而一个良好的平均分数可能完全掩盖这一点。我们引入了MedFL-Stress,这是一个受控的压力测试框架,正是为了揭示这种失败模式。我们使用来自BraTS 2020的2D轴向切片,分布在四个模拟医院客户端上,应用反映真实多站点部署中扫描仪和采集变异性的分级MRI外观变化(伽马对比度、尺度变化和噪声加模糊)。我们评估了三种联邦基线:FedAvg、FedProx和FedBN。最差医院的Dice系数和医院间差异被视为主要指标,而非补充观察。FedAvg实现了最高的全局平均Dice系数(0.8159),但掩盖了其最佳和最差表现医院之间的0.0850差距。FedBN将这一差距缩小了41%(从0.0850降至0.0503),同时平均准确率损失不到半个Dice点(从0.8159降至0.8109),而最弱医院的Dice系数直接增加了3.5点(从0.7309升至0.7656)。这些发现表明,面向鲁棒性的评估协议对于可靠的联邦医学影像部署至关重要。
cs.CV / 115 / 2605.09030
When Style Similarity Scores Fail: Diagnosing Raw CSD Cosine in Artist-Style Evaluation
当风格相似度评分失效时:艺术家风格评估中原始CSD余弦的诊断
Abstract
Raw cosine in the 768-dimensional output space of the Contrastive Style Descriptor (CSD) is now widely read as an absolute, calibrated style-fidelity score for text-to-image and style-imitation evaluation. We introduce the discrimination gap, a corpus-internal, prototype-free and threshold-free diagnostic that tests whether contrastive style cosines admit an absolute same-versus-different interpretation on a candidate artist corpus. On a 1799-artwork, 91-artist public-domain corpus, raw CSD cosine yields negative point-estimate gaps for $23/91$ artists at the pairwise level ($2/91$ robust under bootstrap) and for $15/91$ in the aggregated-pool scoring regime style-fidelity evaluations typically use. CSLS readout on the frozen backbone reduces the aggregated negative-gap count to $4/91$; combined with positional-embedding interpolation to $336$ pixels it raises unsupervised pair-verification AUC from $0.883$ to $0.905$ across $25$ artist-disjoint splits. We refer to this diagnostic-driven readout protocol on the frozen backbone (CSLS as default, pos-interp $336$ as the stronger optional setting) as CSD+, not a new encoder.A cross-backbone check on CLIP-ViT-L/14, SigLIP-large and DINOv2-Large reproduces the same shared-tradition failure pattern, providing evidence that the residual reflects a shared limitation of the four backbones we tested rather than a CSD-specific artefact. Practical implication: before reporting CSD cosine as an absolute style-fidelity score, run the diagnostic on the candidate corpus; CSLS is the minimal correction when it fails.
Chinese Translation
在对比风格描述符(Contrastive Style Descriptor, CSD)768维输出空间中的原始余弦现在被广泛解读为文本到图像和风格模仿评估的绝对、校准的风格保真度评分。我们引入了判别差距,这是一种语料库内部、无原型且无阈值的诊断方法,用于测试对比风格余弦是否在候选艺术家语料库上具有绝对的同与异的解释。在一个包含1799幅艺术作品和91位艺术家的公共领域语料库中,原始CSD余弦在成对层面上对$23/91$位艺术家产生了负点估计差距($2/91$在自助法下是稳健的),在聚合池评分机制下的风格保真度评估中则对$15/91$位艺术家产生了负差距。在冻结的主干网络上,CSLS读出将聚合的负差距数量减少到$4/91$;结合位置嵌入插值到$336$像素,将无监督成对验证的AUC从$0.883$提高到$0.905$,跨越$25$个艺术家不重叠的分割。我们将这种基于诊断驱动的冻结主干网络读出协议(CSLS作为默认设置,位置插值$336$作为更强的可选设置)称为CSD+,而不是一个新的编码器。对CLIP-ViT-L/14、SigLIP-large和DINOv2-Large的跨主干检查重现了相同的共享传统失效模式,提供了证据表明残差反映了我们测试的四个主干的共享限制,而不是CSD特有的伪影。实际意义:在将CSD余弦报告为绝对风格保真度评分之前,应在候选语料库上运行诊断;当其失效时,CSLS是最小的修正。
cs.CV / 116 / 2605.09039
SeasonScapes: Learning Large-scale Re-lightable 3D Landscapes with Seasonal Variation from Sparse Webcams
SeasonScapes:从稀疏网络摄像头学习大规模可重光照的季节变化3D景观
Abstract
We introduce SeasonScapes framework and a the SeasonScapes dataset: Swiss Sparse-view Mountain Scenes with Seasonal Changes that covers over 50 km x 60 km, composed of more than 85,000 webcam images captured from 32 different locations across 13 timestamps throughout a full year. By projecting these timestamp-specific images onto a 3D mesh, we construct seasonal 3D landscapes that reflect natural appearance changes over time. To address occlusions and missing data, we leverage conditional diffusion models for image-guided inpainting directly on the mesh. The resulting completed meshes can be further relighted using standard physically-based renderer.
Chinese Translation
我们介绍了SeasonScapes框架和SeasonScapes数据集:覆盖超过50公里x 60公里的瑞士稀疏视角山景,包含来自13个时间戳的32个不同地点拍摄的超过85,000张网络摄像头图像。通过将这些特定时间戳的图像投影到3D网格上,我们构建了反映自然外观随时间变化的季节性3D景观。为了解决遮挡和缺失数据的问题,我们利用条件扩散模型进行图像引导的网格直接修复。最终生成的完整网格可以使用标准的基于物理的渲染器进一步进行重光照。
cs.CV / 117 / 2605.09053
LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation
LCGNav:面向视觉-语言导航的通用拓扑规划的局部候选感知几何增强
Abstract
Online topological planning has become an effective paradigm for Vision-Language Navigation in Continuous Environments (VLN-CE), but existing methods still suffer from two limitations: redundant local depth information and weakened focus on current frontier candidates as the topological graph grows. To address this, we propose LCGNav, a modular local geometric enhancement framework for topological VLN. LCGNav explicitly converts candidate depth views into 3D point clouds and applies physical truncation based on the agent's reachable range, enabling more compact local geometric modeling. It further introduces a dimension-preserving local fusion strategy with transient state degradation, so that geometric enhancement is applied only to the currently relevant ghost nodes without changing the original planner interface. Experiments on R2R-CE and RxR-CE show that LCGNav serves as an effective cross-architecture enhancement module, consistently improving multiple key metrics of representative online topological baselines with low additional training cost. When integrated with ETP-R1, LCGNav achieves the best performance among the compared online topological methods on the val-unseen splits of the R2R-CE and RxR-CE benchmarks. The code is available at https://github.com/shannanshouyin/LCGNav.
Chinese Translation
在线拓扑规划已成为连续环境中视觉-语言导航(VLN-CE)的有效范式,但现有方法仍然存在两个局限性:冗余的局部深度信息和随着拓扑图的增长对当前前沿候选的关注减弱。为了解决这一问题,我们提出了LCGNav,一个模块化的局部几何增强框架,用于拓扑VLN。LCGNav明确将候选深度视图转换为3D点云,并基于代理的可达范围应用物理截断,从而实现更紧凑的局部几何建模。它进一步引入了一种保持维度的局部融合策略,伴随瞬态状态降级,使得几何增强仅应用于当前相关的虚拟节点,而不改变原始规划器接口。在R2R-CE和RxR-CE上的实验表明,LCGNav作为一个有效的跨架构增强模块,持续改善了多个代表性在线拓扑基线的关键指标,且附加训练成本低。当与ETP-R1集成时,LCGNav在R2R-CE和RxR-CE基准的val-unseen分割中实现了比较的在线拓扑方法中的最佳性能。代码可在https://github.com/shannanshouyin/LCGNav获取。
cs.CV / 118 / 2605.09065
Dependency-Aware Discrete Diffusion for Scene Graph Generation
依赖感知的离散扩散模型用于场景图生成
Abstract
Scene graphs (SGs) represent objects and their relationships as structured graphs, enabling applications in image generation, robotics, and 3D understanding. Recent work suggests that conditioning image generation on scene graphs improves compositional fidelity compared to text-only prompting. However, since users typically provide text rather than structured graphs, a key challenge is to generate scene graphs from natural language. Prior work on discrete diffusion has demonstrated success in generating generic graphs such as molecules and circuits, but fails to account for the hierarchical structure and strong dependencies between objects, edges, and relations in scene graphs. We address this limitation by introducing a dependency-aware, hierarchically constrained discrete diffusion model for scene graph generation. Our approach decouples structure and semantics across the forward and reverse processes, enabling the model to capture conditional dependencies. At inference time, we perform training-free conditioning to sample text-aligned scene graphs. We evaluate our method on standard SG benchmarks and demonstrate improvements over both continuous and discrete graph generation baselines across graph and layout metrics. When fed to downstream image generation, our approach yields improved compositional alignment compared to text-to-image models, particularly in multi-object scenarios.
Chinese Translation
场景图(SGs)将对象及其关系表示为结构化图形,使其能够应用于图像生成、机器人技术和三维理解等领域。近期研究表明,与仅使用文本提示相比,基于场景图的图像生成能够提高组合的保真度。然而,由于用户通常提供文本而非结构化图,因此从自然语言生成场景图是一个关键挑战。之前的离散扩散研究在生成分子和电路等通用图形方面取得了成功,但未能考虑场景图中对象、边缘和关系之间的层次结构和强依赖性。我们通过引入一种依赖感知的、层次约束的离散扩散模型来解决这一局限性,以生成场景图。我们的方法在前向和反向过程中解耦结构和语义,使模型能够捕捉条件依赖性。在推理时,我们执行无训练的条件采样,以生成与文本对齐的场景图。我们在标准的场景图基准上评估了我们的方法,并在图形和布局指标上展示了相较于连续和离散图生成基线的改进。当将我们的结果输入到下游图像生成任务时,与文本到图像模型相比,我们的方法在多对象场景中表现出更好的组合对齐。
cs.CV / 119 / 2605.09067
Reducing Annotation Burden for Femoral Cartilage Segmentation in Knee MRI via Cross-Sequence Transfer Learning
通过跨序列迁移学习减少膝关节MRI中股骨软骨分割的标注负担
Abstract
Purpose: To develop and evaluate cross-sequence transfer learning for automatic femoral cartilage segmentation, testing bidirectional transfer between dual-echo steady-state (DESS) and sagittal proton density-weighted 3D fast spin-echo (Cube) sequences. Materials and Methods: We optimized a modified 2D U-Net on 507 DESS images from the Osteoarthritis Initiative (OAI). We then established same-sequence baselines using subject-level cross-validation on a subset of 44 OAI DESS images and 44 Cube images acquired at the Istituto Ortopedico Rizzoli, Bologna, Italy. Each subset included 22 non-lesioned and 22 lesioned subjects. Finally, we performed transfer learning across sequences by fine-tuning the pretrained models on the target sequence with increasing training set sizes to study convergence, while keeping validation and test sets fixed. Segmentations were evaluated using Dice similarity coefficient (DSC) and average surface distance (ASD). Lesion effects were assessed with two-sided Mann-Whitney U tests with Bonferroni correction. Results: Same-sequence training yielded higher accuracy on DESS than Cube (DSC, $0.900$ vs $0.830$; $P < .001$). Cube-to-DESS transfer matched DESS performance (DSC, $0.903 \pm 0.032$ vs $0.900 \pm 0.027$), reaching a performance plateau at 9 training subjects. DESS-to-Cube yielded a lower combined DSC ($0.802 \pm 0.049$ vs $0.830 \pm 0.042$), reaching a plateau at 24 training subjects. Lesions did not affect DESS ($P \ge .39$) but reduced Cube accuracy (DSC, $0.805$ vs $0.856$; $P < .001$). Conclusion: Transfer learning across sequences can substantially reduce target-sequence annotation requirements for femoral cartilage segmentation, but performance is direction- and sequence-dependent, and the effects of lesions on segmentation may vary across MRI sequences.
Chinese Translation
目的:开发和评估用于自动股骨软骨分割的跨序列迁移学习,测试双回波稳态(DESS)和矢状面质子密度加权3D快速自旋回波(Cube)序列之间的双向迁移。材料与方法:我们在507幅来自骨关节炎倡议(OAI)的DESS图像上优化了一种修改的2D U-Net。然后,我们在44幅OAI DESS图像和44幅在意大利博洛尼亚的Rizzoli骨科医院获取的Cube图像的子集中,使用受试者级别的交叉验证建立了同序列基线。每个子集包括22名无病变和22名有病变的受试者。最后,我们通过在目标序列上微调预训练模型,进行跨序列迁移学习,并增加训练集的大小以研究收敛,同时保持验证集和测试集不变。使用Dice相似系数(DSC)和平均表面距离(ASD)评估分割效果。使用双侧Mann-Whitney U检验和Bonferroni校正评估病变效应。结果:同序列训练在DESS上的准确性高于Cube(DSC,$0.900$ vs $0.830$;$P < .001$)。Cube到DESS的迁移达到了DESS的性能(DSC,$0.903 imes 0.032$ vs $0.900 imes 0.027$),在9名训练受试者时达到性能平台。DESS到Cube的迁移则导致较低的综合DSC($0.802 imes 0.049$ vs $0.830 imes 0.042$),在24名训练受试者时达到平台。病变对DESS没有影响($P
ge .39$),但降低了Cube的准确性(DSC,$0.805$ vs $0.856$;$P < .001$)。结论:跨序列迁移学习可以显著减少股骨软骨分割的目标序列标注需求,但性能依赖于迁移方向和序列,病变对分割的影响可能在不同的MRI序列中有所不同。
cs.CV / 120 / 2605.09071
Probability-Flow Distillation: Exact Wasserstein Gradient Flow for High-Fidelity 3D Generation
概率流蒸馏:高保真3D生成的精确Wasserstein梯度流
Abstract
Score Distillation Sampling (SDS) and its variants have been widely used for text-to-3D generation by distilling 2D image diffusion priors. However, the standard SDS objective is prone to severe mode collapse, frequently yielding over-smoothed and over-saturated results. Although recent advancements, such as Score Distillation via Inversion (SDI), mitigate these artifacts and produce visually sharper models, they ultimately fail to faithfully capture the full target distribution. In this work, we show that the bottleneck limiting the sampling capacity of SDI stems from its reliance on the posterior mean estimator, which is mathematically equivalent to a single-step Euler approximation of the deterministic reverse DDIM trajectory. To address this, we propose a naturally motivated extension termed Probability-Flow Distillation (PFD). We establish that PFD corresponds exactly to a Wasserstein gradient flow, thereby inducing principled distribution-matching dynamics. Finally, we show that PFD can synthesize 3D assets with fine-grained, high-fidelity details and achieve improved quality compared to existing methods.
Chinese Translation
得分蒸馏采样(Score Distillation Sampling, SDS)及其变体已被广泛应用于通过蒸馏2D图像扩散先验进行文本到3D生成。然而,标准的SDS目标容易导致严重的模式崩溃,常常产生过于平滑和过饱和的结果。尽管最近的进展,例如通过反演的得分蒸馏(Score Distillation via Inversion, SDI),缓解了这些伪影并生成了视觉上更清晰的模型,但它们最终未能忠实捕捉完整的目标分布。在本研究中,我们表明,限制SDI采样能力的瓶颈源于其对后验均值估计器的依赖,该估计器在数学上等同于确定性反向DDIM轨迹的单步Euler近似。为了解决这一问题,我们提出了一种自然动机的扩展,称为概率流蒸馏(Probability-Flow Distillation, PFD)。我们建立了PFD与Wasserstein梯度流完全对应,从而引入了原则性的分布匹配动态。最后,我们展示了PFD能够合成具有细致、高保真细节的3D资产,并且与现有方法相比,质量得到了改善。
cs.CV / 121 / 2605.09089
Field-Localized Forgery Detection for Digital Identity Documents
面向数字身份文件的区域定位伪造检测
Abstract
Digital identity verification systems used in remote onboarding rely on document images to authenticate users, making them vulnerable to localized manipulations of key identity fields such as facial photographs and textual information. Existing forgery detection methods, developed primarily for natural-image forensics, show limited transferability to structured identity documents. We propose FLiD, a lightweight field-localized framework that targets critical identity regions rather than processing full-document images. A fine-tuned object detector first localizes face and text fields; a frozen MobileNetV3-Small backbone then extracts compact field-level embeddings, which are classified by lightweight neural network with only 191K trainable parameters. FLiD achieves AUC scores of 0.880 (face), 0.954 (text), and 0.923 (both-field attacks), with corresponding EERs of 18.05%, 11.61%, and 15.16%, representing absolute reductions of 29-35 percentage points over a full-document baseline trained from scratch. FLiD also consistently outperforms general-purpose manipulation detectors (TruFor, MMFusion, UniVAD) across all attack scenarios while requiring 13x fewer parameters and 21x fewer FLOPs
Chinese Translation
远程入职中使用的数字身份验证系统依赖于文档图像来验证用户,这使其容易受到关键身份字段(如面部照片和文本信息)的局部篡改。现有的伪造检测方法主要针对自然图像取证,显示出对结构化身份文件的迁移性有限。我们提出了FLiD,一个轻量级的区域定位框架,专注于关键身份区域,而不是处理完整文档图像。经过微调的目标检测器首先定位面部和文本字段;然后,冻结的MobileNetV3-Small主干网络提取紧凑的字段级嵌入,这些嵌入由仅有191K可训练参数的轻量级神经网络进行分类。FLiD在面部、文本和双字段攻击中的AUC得分分别为0.880、0.954和0.923,相应的EER分别为18.05%、11.61%和15.16%,相比从零开始训练的完整文档基线,绝对减少了29-35个百分点。FLiD在所有攻击场景中也始终优于通用的篡改检测器(TruFor、MMFusion、UniVAD),同时所需参数减少了13倍,FLOPs减少了21倍。
cs.CV / 122 / 2605.09090
Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations
在受控反事实扰动下研究视觉定位中的各向异性
Abstract
Visual Grounding benchmarks assume that the object described by a referring expression is always present in the image, and grounding models are therefore rarely evaluated under semantically mismatched captions. In such cases, models frequently exhibit approximation behavior, producing a plausible bounding box that satisfies only part of the expression (\eg, preserving the original object while ignoring modified contextual cues). Because mismatched captions represent realistic edge cases, this behavior compromises reliability and raises concerns from an explainability perspective. Identifying its underlying causes is thus essential for improving model faithfulness and interpretability. Adopting a mechanistic interpretability viewpoint, this work examines whether embedding anisotropy contributes to counterfactual failures. A similarity-controlled counterfactual caption generation protocol is introduced to systematically perturb object or contextual components within predefined embedding similarity intervals, enabling a fine-grained analysis of grounding behavior as a function of alignment. Experiments on two Transformer-based models with markedly different embedding geometries (BERT-based TransVG and CLIP-based SwimVG) reveal no meaningful correlation between cosine similarity and approximation. These findings suggest that anisotropy alone does not account for counterfactual errors, and that robustness requires investigating finer-grained geometric properties of the embedding space.
Chinese Translation
视觉定位基准假设通过指称表达描述的对象总是存在于图像中,因此定位模型很少在语义不匹配的标题下进行评估。在这种情况下,模型通常表现出近似行为,生成一个合理的边界框,仅满足表达的一部分(例如,保留原始对象而忽略修改后的上下文线索)。由于不匹配的标题代表了现实中的边缘情况,这种行为损害了可靠性,并从可解释性角度引发了担忧。因此,识别其潜在原因对于提高模型的忠实性和可解释性至关重要。本研究采用机械可解释性的视角,考察嵌入各向异性是否会导致反事实失败。引入了一种相似性控制的反事实标题生成协议,以系统性地扰动预定义嵌入相似性区间内的对象或上下文组件,从而实现对定位行为的细致分析,作为对齐的函数。对两种具有显著不同嵌入几何特征的基于Transformer的模型(基于BERT的TransVG和基于CLIP的SwimVG)的实验表明,余弦相似性与近似之间没有显著相关性。这些发现表明,仅靠各向异性无法解释反事实错误,而稳健性需要研究嵌入空间的更细粒度几何特性。
cs.CV / 123 / 2605.09132
KEPIL: Knowledge-Enhanced Prompt-Image Learning for Prompt-Robust Disease Detection
KEPIL:知识增强的提示-图像学习用于提示鲁棒的疾病检测
Abstract
Vision--language models (VLMs) show promise for clinical decision support in radiology because they enable joint reasoning over radiological images and clinical text, thereby leveraging complementary clinical information. However, radiological findings are long-tailed in practice, leaving some conditions underrepresented and making zero-shot inference essential. Yet current CLIP-style medical VLMs are sensitive to prompt variations and often lack trustworthy external knowledge at inference time, which hinders reliable clinical deployment. We present \textit{KEPIL}, a prompt-robust framework that integrates curated medical knowledge to stabilize zero-shot generalization. KEPIL comprises: (i) \emph{dynamic prompt enrichment} using ontologies with LLM assistance, (ii) a \emph{semantic-aware contrastive loss} aligning embeddings of equivalent prompt variants via a dual-embedding objective, and (iii) \emph{entity-centric report standardization} to yield ontology-aligned representations. Across seven benchmarks, KEPIL achieves state-of-the-art zero-shot inference performance; under prompt-variation tests, it improves AUC by \(6.37\%\) on \textit{CheXpert} and by \(4.11\%\) on average. These results suggest that structured knowledge and robust prompt design are key to clinically reliable radiology-facing VLMs. Code will be released at https://github.com/Roypic/KEPIL.
Chinese Translation
视觉-语言模型(VLMs)在放射学临床决策支持中展现出良好的前景,因为它们能够对放射图像和临床文本进行联合推理,从而利用互补的临床信息。然而,放射学发现的分布在实践中呈现长尾特征,导致某些疾病被低估,使得零-shot 推理变得至关重要。然而,当前的 CLIP 风格医学 VLMs 对提示变化敏感,并且在推理时通常缺乏可靠的外部知识,这妨碍了其在临床中的可靠部署。我们提出了 extit{KEPIL},一个提示鲁棒的框架,集成了经过筛选的医学知识,以稳定零-shot 泛化。KEPIL 包括:(i)利用本体和大型语言模型(LLM)辅助的 extit{动态提示增强},(ii)通过双嵌入目标对等效提示变体的嵌入进行对齐的 extit{语义感知对比损失},以及(iii) extit{以实体为中心的报告标准化},以生成与本体对齐的表示。在七个基准测试中,KEPIL 实现了最先进的零-shot 推理性能;在提示变化测试中,它在 extit{CheXpert} 上提高了 AUC 达到 6.37\%,平均提高了 4.11\%。这些结果表明,结构化知识和鲁棒的提示设计是临床可靠的放射学 VLMs 的关键。代码将发布在 https://github.com/Roypic/KEPIL。
cs.CV / 124 / 2605.09146
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
超越思维:在360$^ ext{°}$中进行类人视觉搜索的想象
Abstract
Humanoid Visual Search (HVS) requires agents to actively explore immersive 360$^\circ$ environments. While prior methods treat this as a monolithic task relying on cumulative, multi-turn Chain-of-Thought (CoT) reasoning, they impose heavy cognitive burdens and require expensive trajectory-level annotations. In this paper, we propose Imagining in 360$^\circ$, a novel framework that decouples the exploration process into a specialized Imaginator and an Actor. The Imaginator functions as a probabilistic predictor of spatial priors; instead of maintaining a cumulative reasoning chain, it infers the semantic layout of both observed and unobserved regions in a single step. By sampling multiple hypotheses within this semantic space, we provide the Actor with a distribution of effective spatial information, offering robust guidance that hedges against uncertainty during active search. This decoupled architecture significantly lowers data engineering costs by eliminating the need for full-trajectory CoT annotations, enabling the generation of over 1.96 million curated training samples. Extensive experiments demonstrate that explicitly modeling semantic spatial priors drastically improves search efficiency and success rates in complex, in-the-wild environments.
Chinese Translation
类人视觉搜索(HVS)要求代理在沉浸式360$^ ext{°}$环境中主动探索。虽然以往的方法将其视为依赖于累积的多轮链式思维(Chain-of-Thought, CoT)推理的单一任务,但这些方法带来了沉重的认知负担,并需要昂贵的轨迹级注释。在本文中,我们提出了360$^ ext{°}$想象(Imagining in 360$^ ext{°}$),一个将探索过程解耦为专门的想象者(Imaginator)和执行者(Actor)的新框架。想象者作为空间先验的概率预测器;它不再维持累积推理链,而是在单一步骤中推断观察到和未观察到区域的语义布局。通过在这一语义空间中采样多个假设,我们为执行者提供了一种有效空间信息的分布,提供了在主动搜索过程中对不确定性的强有力指导。这种解耦架构显著降低了数据工程成本,消除了对完整轨迹CoT注释的需求,使得生成超过196万条精心策划的训练样本成为可能。大量实验表明,显式建模语义空间先验显著提高了在复杂的真实环境中的搜索效率和成功率。
cs.CV / 125 / 2605.09151
MultiMedVision: Multi-Modal Medical Vision Framework
MultiMedVision:多模态医学视觉框架
Abstract
Multi-modal medical imaging enables comprehensive diagnostics, yet current foundation models process 2D (e.g. X-ray) and 3D (e.g. CT) data with separate, dimensionality-specific architectures. We present MultiMedVision, a unified framework for joint 2D/3D representation learning built on a Sparse Vision Transformer. Our model uses 3D Rotary Positional Embeddings and variable-length sequence packing to process mixed-modality batches natively within a shared latent space, without modality-specific adapters or treating 3D volumes as 2D slice sequences. Trained with a self-supervised objective on chest X-rays (MIMIC-CXR) and CT scans (CT-RATE), and using a single shared encoder with 5x less data, MultiMedVision achieves competitive performance on both 2D benchmarks (Macro AUROC 0.82 on MIMIC, 0.84 on CheXpert) and 3D tasks (0.85 on CT-RATE). Analysis of the learned representations reveals coexisting modality-specific and shared feature subspaces, demonstrating that unified cross-dimensional representation learning is feasible without sacrificing modality-specific performance.
Chinese Translation
多模态医学成像能够实现全面的诊断,然而当前的基础模型使用独立的、特定维度的架构处理2D(例如X光)和3D(例如CT)数据。我们提出了MultiMedVision,这是一个基于稀疏视觉变换器(Sparse Vision Transformer)的统一框架,用于联合2D/3D表示学习。我们的模型使用3D旋转位置嵌入(3D Rotary Positional Embeddings)和可变长度序列打包,能够在共享潜在空间中原生处理混合模态批次,而无需特定于模态的适配器或将3D体积视为2D切片序列。在胸部X光(MIMIC-CXR)和CT扫描(CT-RATE)上使用自监督目标进行训练,并使用一个共享编码器,数据量减少了5倍,MultiMedVision在2D基准(MIMIC上的宏观AUROC为0.82,CheXpert上的为0.84)和3D任务(CT-RATE上的为0.85)上均表现出竞争力。对学习到的表示的分析揭示了共存的模态特定和共享特征子空间,证明了在不牺牲模态特定性能的情况下,统一的跨维度表示学习是可行的。
cs.CV / 126 / 2605.09181
Establishing Robust Retinal Eye Tracking: A Weakly Supervised Algorithmic Framework
建立稳健的视网膜眼动追踪:一种弱监督算法框架
Abstract
Retinal image-based eye tracking is widely used in ophthalmic imaging and vision science, and is a promising path to deliver higher gaze accuracy than the pupil- and cornea-based approaches commonly used in modern AR/VR devices. Nevertheless, existing retinal tracking algorithms still primarily rely on classical template-matching registration, which can be insufficiently robust to retinal feature variability and real-world imaging conditions. In this work, we propose a novel weakly-supervised, learning-based framework for robust retinal eye tracking. Initial studies demonstrate high accuracy, achieving the 95th-percentile gaze error < 0.45 deg across a cohort of 6 participants.
Chinese Translation
基于视网膜图像的眼动追踪在眼科成像和视觉科学中被广泛应用,是实现比现代增强现实/虚拟现实设备中常用的瞳孔和角膜方法更高注视精度的有希望的途径。然而,现有的视网膜追踪算法仍主要依赖于经典的模板匹配配准,这在面对视网膜特征的变异性和现实世界成像条件时可能不够稳健。在本研究中,我们提出了一种新颖的弱监督学习框架,以实现稳健的视网膜眼动追踪。初步研究表明,该方法具有高精度,在6名参与者的样本中实现了第95百分位注视误差< 0.45度。
cs.CV / 127 / 2605.09190
AQMP: Image compression through Adaptive Quadtree Refinement and Matching Pursuit with Hyperparameter Optimization
AQMP:通过自适应四叉树细化和匹配追踪进行图像压缩的超参数优化
Abstract
We present AQMP, a novel image codec combining Adaptive Quadtree Refinement with Matching Pursuit. Unlike conventional Matching Pursuit methods that operate on fixed-size sub-images, AQMP dynamically adapts block sizes to local image structure, allocating finer partitions where the image is complex and coarser ones where it is smooth. This adaptivity yields superior compression ratios compared to fixed-size block Matching Pursuit at equivalent image quality, while offering significant parallelization opportunities at both the tree-leaf level and during compression of individual nodes. The algorithm is governed by user-specified accuracy and sparsity parameters alongside a small set of additional hyperparameters. To navigate the trade-off between compression efficiency and visual quality, we perform multi-objective hyperparameter optimization using the Tree-Structured Parzen Estimator, producing comprehensive Pareto fronts. Experimental results show that AQMP achieves up to $4\times$ higher compression rates than JPEG at comparable SSIM values, while maintaining competitive quality across a broad range of compression regimes. Performance evaluation is provided using a representative set of test images. To ensure reproducibility and promote adoption, we have made our implementation publicly available on GitHub under the MIT license.
Chinese Translation
我们提出了AQMP,这是一种结合了自适应四叉树细化和匹配追踪的新型图像编解码器。与传统的在固定大小子图像上操作的匹配追踪方法不同,AQMP动态地根据局部图像结构调整块大小,在图像复杂的地方分配更细的分区,而在平滑区域则使用较粗的分区。这种自适应性在相同图像质量下,相较于固定大小块的匹配追踪,提供了更优的压缩比,同时在树叶级别和单个节点的压缩过程中提供了显著的并行化机会。该算法由用户指定的准确性和稀疏性参数以及一小组额外的超参数控制。为了平衡压缩效率和视觉质量之间的权衡,我们使用树结构的帕尔岑估计器进行多目标超参数优化,生成全面的帕累托前沿。实验结果表明,AQMP在可比的SSIM值下实现了高达4倍于JPEG的压缩率,同时在广泛的压缩范围内保持了竞争力的质量。我们使用一组代表性的测试图像进行了性能评估。为了确保可重复性并促进采用,我们已将我们的实现公开发布在GitHub上,并采用MIT许可证。
cs.CV / 128 / 2605.09196
RigidFormer: Learning Rigid Dynamics using Transformers
RigidFormer:使用变换器学习刚体动力学
Abstract
Learning-based simulation of multi-object rigid-body dynamics remains difficult because contact is discontinuous and errors compound over long horizons. Most existing methods remain tied to mesh connectivity and vertex-level message passing, which limits their applicability to mesh-free inputs such as point clouds and leads to high computational cost. Efficiently modeling high-fidelity rigid-body dynamics from mesh-free representations, therefore, remains challenging. We introduce RigidFormer, an object-centric Transformer-based model that learns mesh-free rigid-body dynamics with controllable integration step sizes. RigidFormer reasons at the object level and advances each object through compact anchors; Anchor-Vertex Pooling enriches these anchors with local vertex features, retaining contact-relevant geometry without dense vertex-level interaction. We propose Anchor-based RoPE to inject anchor geometry into attention while respecting the unordered nature of objects and anchors: object-token processing is permutation-equivariant, and the mean-pooled anchor descriptor is invariant to anchor reindexing while preserving shape extent. RigidFormer further enforces rigidity by projecting updates onto the rigid-body manifold using differentiable Kabsch alignment. On standard benchmarks, RigidFormer outperforms or matches mesh-based baselines using point inputs, runs faster, generalizes to unseen point resolutions and across datasets, and scales to 200+ objects; we also show a preliminary extension to command-conditioned articulated bodies by treating body parts as interacting object-level components.
Chinese Translation
基于学习的多物体刚体动力学仿真仍然面临挑战,因为接触是非连续的,并且在长时间范围内误差会累积。大多数现有方法仍然依赖于网格连接性和顶点级消息传递,这限制了它们在无网格输入(如点云)上的适用性,并导致高计算成本。因此,从无网格表示中有效建模高保真刚体动力学仍然具有挑战性。我们提出了RigidFormer,一种基于对象中心的变换器模型,能够学习无网格刚体动力学,并具有可控的积分步长。RigidFormer在对象级别进行推理,并通过紧凑的锚点推进每个对象;Anchor-Vertex Pooling通过局部顶点特征丰富这些锚点,保留与接触相关的几何形状,而无需密集的顶点级交互。我们提出了基于锚点的RoPE,将锚点几何信息注入注意力机制,同时尊重对象和锚点的无序特性:对象标记处理是置换等变的,均值池化的锚点描述符对锚点重新索引是不变的,同时保留形状范围。RigidFormer进一步通过使用可微分的Kabsch对齐将更新投影到刚体流形上来强化刚性。在标准基准测试中,RigidFormer在使用点输入时超越或匹配基于网格的基线,运行速度更快,能够推广到未见过的点分辨率和跨数据集,并扩展到200多个对象;我们还展示了通过将身体部件视为相互作用的对象级组件,初步扩展到命令条件的关节体。
cs.CV / 129 / 2605.09218
Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models
Flame3D:使用代理语言模型进行零-shot的3D场景组合推理
Abstract
3D scene understanding spans reasoning about free space, object grounding, hypothetical object insertions, complex geometric relationships, and integrating all of these with external tools and data sources. Existing 3D understanding methods typically rely on large-scale 3D-language training or focus on object grounding and simple spatial relationships. We argue that the broad generalization that motivates 3D-language training can be achieved at inference time, without 3D-specific training. We propose Flame3D, a training-free framework that represents scenes as editable visual-textual 3D memories and exposes them to an off-the-shelf MLLM through composable spatial tools. Flame3D also lets the agent synthesize custom spatial programs at inference time, enabling open-ended reasoning over layouts, empty space, and objects not yet present in the scene. External data and corrections can be added to the memory without retraining. In addition to showing competitive performance to finetuned 3D-LMM methods on ScanQA, we study multi-hop 3D reasoning capabilities of Flame3D by evaluating it on a curated compositional spatial-reasoning benchmark, Compose3D. We find that fixed tools fall short and that the agent's ability to synthesize spatial operations at inference time is essential. These results invite the question: should future progress in 3D scene understanding focus on richer scene memories and expressive compositional abstractions?
Chinese Translation
3D场景理解涉及对自由空间、物体定位、假设物体插入、复杂几何关系的推理,以及将所有这些与外部工具和数据源整合在一起。现有的3D理解方法通常依赖于大规模的3D-语言训练,或专注于物体定位和简单的空间关系。我们认为,激励3D-语言训练的广泛泛化可以在推理时实现,而无需进行3D特定的训练。我们提出了Flame3D,这是一个无训练框架,将场景表示为可编辑的视觉-文本3D记忆,并通过可组合的空间工具将其暴露给现成的多模态语言模型(MLLM)。Flame3D还允许代理在推理时合成自定义空间程序,从而实现对布局、空白空间和尚未出现在场景中的物体的开放式推理。外部数据和修正可以在不重新训练的情况下添加到记忆中。除了在ScanQA上展示与微调的3D-LMM方法的竞争性能外,我们还通过在一个精心策划的组合空间推理基准Compose3D上评估Flame3D的多跳3D推理能力。我们发现固定工具的表现不足,而代理在推理时合成空间操作的能力至关重要。这些结果引发了一个问题:未来的3D场景理解进展是否应集中于更丰富的场景记忆和富有表现力的组合抽象?
cs.CV / 130 / 2605.09223
CATS: Curvature Aware Temporal Selection for efficient long video understanding
CATS:基于曲率的时间选择方法用于高效长视频理解
Abstract
Understanding long videos with multimodal large language models (MLLMs) requires selecting a small subset of informative frames under strict computational budgets, where exhaustive processing is infeasible and optimal selection is combinatorial. We propose CATS, a curvature-aware frame selection method that explicitly models the temporal geometry of query-frame relevance to identify salient events and their surrounding context. By leveraging temporal curvature to adapt selection density, CATS captures both abrupt transitions and gradually evolving content while suppressing redundant frames. Under a fixed backbone and frame budget, CATS consistently outperforms prior lightweight approaches such as AKS on LongVideoBench and VideoMME. While multi-stage methods such as MIRA achieve higher absolute accuracy, they incur substantial computational overhead; in contrast, CATS retains approximately 93-95% of MIRA's performance while requiring only 3-4% of its preprocessing cost, yielding a favorable efficiency-accuracy trade-off. Beyond answer accuracy, we evaluate description generation using an LLM-as-a-judge protocol, and the obtained results show that CATS produces more coherent and informative outputs, indicating improved grounding in visual evidence. These results position CATS as a computationally efficient and principled approach to long-video understanding.
Chinese Translation
使用多模态大型语言模型(MLLMs)理解长视频需要在严格的计算预算下选择一小部分信息丰富的帧,其中全面处理不可行,而最佳选择是组合性的。我们提出了CATS,一种基于曲率的帧选择方法,明确建模查询帧相关性的时间几何,以识别显著事件及其周围上下文。通过利用时间曲率来调整选择密度,CATS能够捕捉突发转变和逐渐演变的内容,同时抑制冗余帧。在固定的骨干网络和帧预算下,CATS在LongVideoBench和VideoMME上始终优于先前的轻量级方法,如AKS。尽管多阶段方法如MIRA实现了更高的绝对准确性,但它们会产生大量的计算开销;相比之下,CATS在仅需3-4%的预处理成本的情况下,保持了约93-95%的MIRA性能,展现出良好的效率-准确性权衡。除了答案准确性外,我们还使用LLM-as-a-judge协议评估描述生成,结果表明CATS产生了更连贯和信息丰富的输出,表明在视觉证据中的基础得到了改善。这些结果使CATS成为一种计算高效且原则明确的长视频理解方法。
cs.CV / 131 / 2605.09231
An Elastic Shape Variational Autoencoder for Skeleton Pose Trajectories
用于骨架姿态轨迹的弹性形状变分自编码器
Abstract
Deep generative models provide flexible frameworks for modeling complex, structured data such as images, videos, 3D objects, and texts. However, when applied to sequences of human skeletons, standard variational autoencoders (VAEs) often allocate substantial capacity to nuisance factors-such as camera orientation, subject scale, viewpoint, and execution speed-rather than the intrinsic geometry of shapes and their motion. We propose the Elastic Shape - Variational Autoencoder (ES-VAE), a geometry-aware generative model for skeletal trajectories that leverages the transported square-root velocity field (TSRVF) representation on Kendall's shape manifold. This representation inherently removes rigid translations, rotations, and global scaling of shapes, and temporal rate variability of sequences, isolating the underlying shape dynamics. The ES-VAE encoder maps skeletal sequences to a low-dimensional latent space incorporating the Riemannian logarithm map, while the decoder reconstructs sequences using the corresponding exponential map. We demonstrate the effectiveness of ES-VAE on two datasets. First, we analyze skeletal gait cycles to predict clinical mobility scores and classify subjects into healthy and post-stroke groups. Second, we evaluate action recognition on the NTU RGB+D dataset. Across both settings, ES-VAE consistently outperforms standard VAEs and a range of sequence modeling baselines, including temporal convolutional networks, transformers, and graph convolutional networks. More broadly, ES-VAE provides a principled framework for learning generative models of longitudinal data on pose shape manifolds, offering improved latent representation and downstream performance compared to existing deep learning approaches.
Chinese Translation
深度生成模型为建模复杂的结构化数据(如图像、视频、3D物体和文本)提供了灵活的框架。然而,当应用于人类骨架序列时,标准的变分自编码器(VAEs)往往将大量容量分配给干扰因素,例如相机方向、主体尺度、视角和执行速度,而不是形状及其运动的内在几何特征。我们提出了弹性形状变分自编码器(Elastic Shape - Variational Autoencoder, ES-VAE),这是一种关注几何特征的生成模型,专为骨架轨迹设计,利用了Kendall形状流形上的运输平方根速度场(Transported Square-Root Velocity Field, TSRVF)表示。这种表示本质上消除了形状的刚性平移、旋转和全局缩放,以及序列的时间速率变异性,从而孤立出潜在的形状动态。ES-VAE编码器将骨架序列映射到一个低维潜在空间,结合了黎曼对数映射,而解码器则使用相应的指数映射重建序列。我们在两个数据集上展示了ES-VAE的有效性。首先,我们分析骨架步态周期,以预测临床移动性评分并将受试者分类为健康组和中风后组。其次,我们在NTU RGB+D数据集上评估动作识别。在这两个设置中,ES-VAE始终优于标准VAEs和一系列序列建模基线,包括时间卷积网络、变压器和图卷积网络。更广泛地说,ES-VAE为在姿态形状流形上学习纵向数据的生成模型提供了一个原则性框架,相较于现有的深度学习方法,提供了更好的潜在表示和下游性能。
cs.CV / 132 / 2605.09233
Towards Robust Sequential Decomposition for Complex Image Editing
面向复杂图像编辑的鲁棒序列分解
Abstract
Recent advances in visual generative models have enabled high-fidelity image editing guided by human instructions. However, these models often struggle with complex instructions involving combinatorial editing operations or inter-step dependencies. This difficulty stems from the limitations of two canonical paradigms: (1) single-turn editing, which attempts to apply all instructed edits in one pass, often fails to parse the complex instruction accurately and causes undesired edits; and (2) sequential editing can decompose the task into simpler steps but suffers from compounding errors introduced by the sequential execution, leading to low-fidelity results. To derive a robust solution for complex image editing, we examine editing behaviors of different paradigms under a unified in-context editing framework, and study how the benefits of sequential decomposition can be balanced against its error-accumulation drawbacks. We further develop a synthetic data pipeline that constructs editing tasks of varying instruction complexity, allowing us to curate a large-scale editing dataset with high-quality decomposed sequences. By finetuning on synthetic data, we discovered that with properly designed editing paradigms, sequential decomposition yields robust improvements even as task complexity increases. Furthermore, the decomposition skills learned from synthetic tasks can transfer to real images by co-training with real-world editing data, demonstrating the promise of sim-to-real generalization for tackling complex image editing across broader domains.
Chinese Translation
近年来,视觉生成模型的进步使得基于人类指令的高保真图像编辑成为可能。然而,这些模型在处理涉及组合编辑操作或步骤间依赖关系的复杂指令时常常面临困难。这一问题源于两种经典范式的局限性:(1) 单轮编辑,试图在一次性操作中应用所有指令编辑,往往无法准确解析复杂指令,导致不必要的编辑;(2) 序列编辑可以将任务分解为更简单的步骤,但由于序列执行引入的累积误差,结果的保真度较低。为了为复杂图像编辑提供鲁棒的解决方案,我们在统一的上下文编辑框架下考察不同范式的编辑行为,并研究如何平衡序列分解的优势与其误差累积的缺点。我们进一步开发了一种合成数据管道,构建具有不同指令复杂度的编辑任务,使我们能够策划一个大规模的高质量分解序列编辑数据集。通过在合成数据上进行微调,我们发现,采用适当设计的编辑范式,序列分解即使在任务复杂度增加时也能带来鲁棒的改善。此外,从合成任务中学习到的分解技能可以通过与真实世界编辑数据的共同训练转移到真实图像上,展示了在更广泛领域内应对复杂图像编辑的仿真到真实的推广潜力。
cs.CV / 133 / 2605.09245
CalibFree: Self-Supervised View Feature Separation for Calibration-Free Multi-Camera Multi-Object Tracking
CalibFree:一种无需校准的自监督视图特征分离方法用于多摄像头多目标跟踪
Abstract
Multi-camera multi-object tracking (MCMOT) faces significant challenges in maintaining consistent object identities across varying camera perspectives, particularly when precise calibration and extensive annotations are required. In this paper, we present CalibFree, a self-supervised representation learning framework that does not need any calibration or manual labeling for the MCMOT task. By promoting feature separation between view-agnostic and view-specific representations through single-view distillation and cross-view reconstruction, our method adapts to complex, dynamic scenarios with minimal overhead. Experiments on the MMP-MvMHAT dataset show a 3% improvement in overall accuracy and a 7.5% increase in the average F1 score over state-of-the-art approaches, confirming the effectiveness of our calibration-free design. Moreover, on the more diverse MvMHAT dataset, our approach demonstrates superior over-time tracking and strong cross-view performance, highlighting its adaptability to a wide range of camera configurations. Code will be publicly available upon acceptance.
Chinese Translation
多摄像头多目标跟踪(MCMOT)在不同摄像头视角下保持一致的目标身份面临重大挑战,尤其是在需要精确校准和大量标注的情况下。本文提出了CalibFree,一种自监督表示学习框架,针对MCMOT任务无需任何校准或人工标注。通过单视图蒸馏和跨视图重建促进视角无关和视角特定表示之间的特征分离,我们的方法能够在复杂动态场景中以最小的开销进行适应。我们在MMP-MvMHAT数据集上的实验显示,与最先进的方法相比,整体准确率提高了3%,平均F1分数提高了7.5%,验证了我们无校准设计的有效性。此外,在更具多样性的MvMHAT数据集上,我们的方法展示了优越的时间跟踪能力和强大的跨视图性能,突显了其对多种摄像头配置的适应性。代码将在论文接受后公开。
cs.CV / 134 / 2605.09258
Monocular Biomechanical Tracking of Fingers with Inverse Kinematics to Foundation Models
基于逆向运动学的单目指尖生物力学追踪方法
Abstract
Accurate hand and finger tracking from video has significant clinical applications for monitoring activities of daily living and measuring range of motion, yet monocular video approaches for obtaining hand biomechanics remain under-developed. We present a method that combines the SAM 3D Body foundation model with inverse kinematics optimization in a full-body biomechanical model to extract anatomically-constrained finger joint angles from single-view video. We port SAM 3D Body from PyTorch to JAX for integration with MuJoCo-MJX, enabling GPU-accelerated optimization, and develop a novel mapping between the Momentum Human Rig (MHR) outputs and biomechanical model markers. Validation against 8-camera multiview reconstruction on 4,590 frames from 7 participants performing a variety of hand poses and object manipulation tasks shows finger joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm, after Procrustes alignment. Results were consistent across camera viewpoints and robust to different methods for producing reference values from multiview video. This work extends monocular biomechanical analysis to detailed finger tracking, expanding access to quantitative characterization of hand movement from readily available video.
Chinese Translation
从视频中准确追踪手部和手指对于监测日常生活活动和测量关节活动范围具有重要的临床应用,然而,基于单目视频获取手部生物力学的方法仍然不够成熟。我们提出了一种方法,将SAM 3D Body基础模型与全身生物力学模型中的逆向运动学优化相结合,从单视角视频中提取解剖约束的手指关节角度。我们将SAM 3D Body从PyTorch移植到JAX,以便与MuJoCo-MJX集成,实现GPU加速优化,并开发了一种新的映射关系,将Momentum Human Rig (MHR)的输出与生物力学模型标记相对应。通过对7名参与者在执行各种手部姿势和物体操作任务时的4,590帧图像进行8摄像头多视角重建验证,结果显示手指关节角度误差约为10度,手部位置误差约为6毫米,经过Procrustes对齐后,结果在不同摄像头视角间一致,并且对从多视角视频生成参考值的不同方法具有鲁棒性。本研究将单目生物力学分析扩展到详细的手指追踪,拓宽了从 readily available 视频中定量表征手部运动的途径。
cs.CV / 135 / 2605.09262
Reinforcing Multimodal Reasoning Against Visual Degradation
增强多模态推理以应对视觉退化
Abstract
Reinforcement Learning has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet the resulting policies remain brittle against real-world visual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout induces reward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning against visual degradation while preserving clean-input performance. A dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply a token-level surrogate KL penalty against the worst-case augmentation; to prevent policy collapse under regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance, correctness-conditioned regularization restricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions over GRPO while matching clean accuracy.
Chinese Translation
强化学习显著提升了多模态大型语言模型(MLLMs)的推理能力,但所产生的策略在面对模糊、压缩伪影和低分辨率扫描等现实世界的视觉退化时依然脆弱。以往的视觉和深度强化学习的鲁棒性技术依赖于静态数据增强或基于价值的正则化,这两者都无法有效转移到无评论员的自回归MLLMs的微调中。增强对这些退化的推理并非易事:在回滚期间简单地注入退化视图会导致奖励中毒,其中感知遮挡触发幻觉轨迹并使优化不稳定。我们提出了ROMA,一个强化学习微调框架,旨在修改优化动态,以增强对视觉退化的推理,同时保持干净输入的性能。双前向传递策略利用教师强制将退化视图与干净图像轨迹进行评估,避免在退化输入上进行新的回滚。为了实现分布一致性,我们对最坏情况的增强应用了基于标记级的替代KL惩罚;为了防止在正则化下策略崩溃,锚定于干净图像优势的辅助策略梯度损失保留了可靠的奖励信号;为了避免系统性错误的不变性,基于正确性的正则化将强制限制在成功轨迹上。在Qwen3-VL 4B/8B的七个多模态推理基准测试中,我们的方法在已见和未见的退化上分别比GRPO提高了+2.4%和+2.3%的鲁棒性,同时保持了干净准确度。
cs.CV / 136 / 2605.09296
Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts
微缺陷揭示宏伪造:通过局部分布变化检测AI生成图像
Abstract
Recent generative models can produce images that appear highly realistic, raising challenges in distinguishing real and AI-generated images. Yet existing detectors based on pre-trained feature extractors tend to over-rely on global semantics, limiting sensitivity to the critical micro-defects. In this work, we propose Micro-Defects expose Macro-Fakes (MDMF), a local distribution-aware detection framework that amplifies micro-scale statistical irregularities into macro-level distributional discrepancies. To avoid localized forensic cues being diluted by plain aggregation, we introduce a learnable Patch Forensic Signature that projects semantic patch embeddings into a compact forensic latent space. We then use Maximum Mean Discrepancy (MMD) to quantify distributional discrepancies between generated and real images. Our theory-grounded analysis shows that patch-wise modeling yields provably larger discrepancies when localized forensic signals are present in generated images, enabling more reliable separation from real images. Extensive experiments demonstrate that MDMF consistently outperforms baseline detectors across multiple benchmarks, validating its general effectiveness. Project page: https://zbox1005.github.io/MDMF-project/
Chinese Translation
最近的生成模型能够生成看起来极为真实的图像,这给区分真实图像和AI生成图像带来了挑战。然而,基于预训练特征提取器的现有检测器往往过于依赖全局语义,限制了对关键微缺陷的敏感性。在本研究中,我们提出了微缺陷揭示宏伪造(Micro-Defects expose Macro-Fakes,MDMF),这是一种局部分布感知的检测框架,能够将微观统计不规则性放大为宏观层面的分布差异。为了避免局部取证线索被简单聚合稀释,我们引入了一种可学习的补丁取证特征(Patch Forensic Signature),将语义补丁嵌入投影到一个紧凑的取证潜在空间。然后,我们使用最大均值差异(Maximum Mean Discrepancy,MMD)来量化生成图像与真实图像之间的分布差异。我们的理论分析表明,当生成图像中存在局部取证信号时,基于补丁的建模能够产生可证明的更大差异,从而实现与真实图像的更可靠分离。大量实验表明,MDMF在多个基准测试中始终优于基线检测器,验证了其普遍有效性。项目页面:https://zbox1005.github.io/MDMF-project/
cs.CV / 137 / 2605.09312
Low-Cost Neural Radiance Fields
低成本神经辐射场
Abstract
Neural Radiance Fields (NeRF) achieve high-quality novel-view synthesis, but their long training times and reliance on dense input views limit accessibility. We present a comparative study of three accelerated NeRF variants - DS-NeRF, TensoRF, and HashNeRF and explore extensions targeted at the low-compute, low-data regime. First, we add a depth-supervision loss derived from COLMAP keypoints to TensoRF (TensoRF-DS) and evaluate it on the LLFF dataset under reduced view counts. Second, we ablate the feature-decoding MLP of TensoRF and study the effect of input downsampling on PSNR and runtime on the synthetic Lego scene. Third, we propose four architectural variants of the HashNeRF color and density networks, including residual and convolutional designs, and report PSNR/training-time tradeoffs under matched iteration budgets. Under iso-time evaluation, none of our extensions conclusively outperform the published baselines, but the experiments characterize which extensions transfer to constrained settings and surface design questions for future work.
Chinese Translation
神经辐射场(NeRF)实现了高质量的新视角合成,但其较长的训练时间和对密集输入视角的依赖限制了其可访问性。我们对三种加速的NeRF变体——DS-NeRF、TensoRF和HashNeRF进行了比较研究,并探讨了针对低计算、低数据环境的扩展。首先,我们将基于COLMAP关键点的深度监督损失添加到TensoRF中(TensoRF-DS),并在减少视角数量的情况下在LLFF数据集上进行评估。其次,我们对TensoRF的特征解码多层感知器(MLP)进行了消融实验,研究输入下采样对合成乐高场景的PSNR和运行时间的影响。第三,我们提出了HashNeRF颜色和密度网络的四种架构变体,包括残差和卷积设计,并在匹配的迭代预算下报告PSNR/训练时间的权衡。在等时间评估下,我们的扩展没有明确超越已发布的基线,但实验表征了哪些扩展可以转移到受限环境,并提出了未来工作的设计问题。
cs.CV / 138 / 2605.09313
Attention Sinks in Diffusion Transformers: A Causal Analysis
扩散变换器中的注意力沉没:因果分析
Abstract
Attention sinks -- tokens that receive disproportionate attention mass -- are assumed to be functionally important in autoregressive language models, but their role in diffusion transformers remains unclear. We present a causal analysis in text-to-image diffusion, dynamically identifying dominant attention recipients per timestep and suppressing them via paired, training-free interventions on the score and value paths. Across 553 GenEval prompts on Stable Diffusion~3 (with SDXL corroboration), removing these sinks does not degrade text-image alignment (CLIP-T) or preference proxies (ImageReward, HPS-v2) at $k{=}1$; only under stronger interventions ($k\!\geq\!10$) does HPS-v2 exhibit a metric-dependent boundary, while CLIP-T remains robust throughout. The perceptual shifts induced by suppression are nonetheless \emph{sink-specific} -- $\sim\!6\times$ larger than equal-budget random masking -- revealing an empirical dissociation between trajectory-level perturbation and \emph{semantic alignment} in diffusion transformers. \footnote{Code available at https://github.com/wfz666/ICML26-attention-sink.}
Chinese Translation
注意力沉没——接收不成比例注意力的标记——在自回归语言模型中被认为具有功能重要性,但它们在扩散变换器中的作用仍不清楚。我们在文本到图像的扩散中进行了因果分析,动态识别每个时间步的主要注意力接收者,并通过配对的无训练干预措施对得分和价值路径进行抑制。在对Stable Diffusion~3(与SDXL的验证)进行的553个GenEval提示中,移除这些沉没并未降低文本-图像对齐(CLIP-T)或偏好代理(ImageReward, HPS-v2)在$k{=}1$时的表现;只有在更强的干预下($k
eq 10$)HPS-v2才表现出度量依赖的边界,而CLIP-T在整个过程中保持稳健。尽管如此,抑制所引起的感知变化仍然是 extit{沉没特定}的——约为等预算随机掩蔽的$ extsim 6 imes$,揭示了扩散变换器中轨迹级扰动与 extit{语义对齐}之间的经验性分离。
cs.CV / 139 / 2605.09319
PGID: Progressive Guided Inversion and Denoising for Robust Watermark Detection
PGID:用于鲁棒水印检测的渐进式引导反演与去噪
Abstract
With the proliferation of AI-generated images, digital watermarking has become an essential safeguard for protecting intellectual property and mitigating malicious exploitation. Recent works on semantic watermarking have enabled efficient copyright protection for diffusion models. However, the dependence of semantic watermarking on diffusion inversion for watermark detection creates a critical vulnerability. Imprint removal and forgery attacks exploit this weakness to produce deceptive results. Our analysis reveals that these attacks succeed by displacing watermarked latents into the unwatermarked region, while guiding unwatermarked latents into the watermarked region. Based on that, we propose Progressive Guided Inversion and Denoising (PGID), the first plug-and-play, training-free noise extraction framework designed to defend against both attack strategies. PGID effectively defends by projecting perturbed latents back to the region where they originally belong. The projection is achieved by eliminating intermediate latent deflections and mitigating adversarial perturbations through progressive inversion-denoising cycles. Comprehensive evaluations across multiple schemes demonstrate that PGID successfully restores detection reliability by recovering removed watermarks and identifying forged instances.
Chinese Translation
随着人工智能生成图像的普及,数字水印已成为保护知识产权和减轻恶意利用的重要保障。近期关于语义水印的研究使得扩散模型的版权保护变得高效。然而,语义水印在水印检测中对扩散反演的依赖性造成了一个关键的脆弱性。印记去除和伪造攻击利用这一弱点产生误导性结果。我们的分析表明,这些攻击通过将带水印的潜在特征移位到无水印区域,同时引导无水印的潜在特征进入带水印区域而成功。基于此,我们提出了渐进式引导反演与去噪(Progressive Guided Inversion and Denoising,PGID),这是第一个即插即用、无训练的噪声提取框架,旨在防御这两种攻击策略。PGID通过将扰动的潜在特征投影回其原始所属区域来有效防御。该投影是通过消除中间潜在偏转并通过渐进式反演-去噪循环减轻对抗扰动来实现的。对多个方案的全面评估表明,PGID成功恢复了检测的可靠性,通过恢复去除的水印和识别伪造实例。
cs.CV / 140 / 2605.09328
Noise-Started One-Step Real-World Super-Resolution via LR-Conditioned SplitMeanFlow and GAN Refinement
基于噪声启动的一步法真实世界超分辨率:LR条件下的SplitMeanFlow与GAN精细化
Abstract
Pre-trained text-to-image (T2I) diffusion models have shown strong potential for real-world image super-resolution (Real-ISR), owing to their noise-started generation process that enables realistic texture synthesis and captures the one-to-many nature of super-resolution. However, diffusion-based Real-ISR methods still face a fundamental efficiency-quality trade-off. Multi-step methods generate high-quality results by iteratively denoising random Gaussian noise under LR conditioning, but suffer from slow sampling. Recent one-step methods greatly improve efficiency, yet they typically replace noise-started generation with direct LR-to-HR restoration, which weakens stochasticity and limits realistic detail synthesis. To address this issue, we propose SMFSR, a noise-started one-step Real-ISR framework via LR-conditioned SplitMeanFlow and GAN refinement. SMFSR preserves the random-noise starting point of diffusion models and learns a direct noise-to-HR mapping conditioned on the LR image. To this end, Interval Splitting Consistency distills the multi-step generative trajectory into a single average-velocity prediction, enabling efficient one-step generation. To compensate for the reduced opportunity for progressive refinement, we further introduce a GAN refinement stage, where a DINOv3-based discriminator enhances realistic texture synthesis and variational score distillation aligns the generated outputs with the natural image distribution under a frozen diffusion teacher. Extensive experiments demonstrate that SMFSR achieves state-of-the-art perceptual quality among one-step diffusion-based Real-ISR methods while retaining fast single-step inference.
Chinese Translation
预训练的文本到图像(T2I)扩散模型在真实世界图像超分辨率(Real-ISR)方面展现出强大的潜力,这得益于其噪声启动的生成过程,能够实现逼真的纹理合成并捕捉超分辨率的一对多特性。然而,基于扩散的Real-ISR方法仍面临效率与质量之间的根本权衡。多步方法通过在低分辨率(LR)条件下迭代去噪随机高斯噪声生成高质量结果,但采样速度较慢。近期的一步法显著提高了效率,但通常将噪声启动生成替换为直接的低分辨率到高分辨率(LR-to-HR)恢复,这削弱了随机性并限制了逼真细节的合成。为了解决这一问题,我们提出了SMFSR,一个基于噪声启动的一步法Real-ISR框架,通过LR条件下的SplitMeanFlow与GAN精细化。SMFSR保留了扩散模型的随机噪声起始点,并学习一个基于LR图像的直接噪声到高分辨率(HR)映射。为此,区间分割一致性将多步生成轨迹提炼为单一的平均速度预测,从而实现高效的一步生成。为了弥补渐进精细化机会的减少,我们进一步引入了GAN精细化阶段,其中基于DINOv3的鉴别器增强了逼真的纹理合成,而变分分数提炼则在冻结的扩散教师下将生成输出与自然图像分布对齐。大量实验表明,SMFSR在一步法基于扩散的Real-ISR方法中实现了最先进的感知质量,同时保持了快速的一步推理。
cs.CV / 141 / 2605.09339
Perceptual Asymmetry Between Hue Categories: Evidence from Human Color Categorization
色调类别之间的感知不对称性:来自人类颜色分类的证据
Abstract
Human color categories are not uniformly distributed in perceptual space, yet most computational color models still assume fixed and evenly structured representations. In this paper, we present a focused analytical extension of the COLIBRI fuzzy color model by investigating perceptual asymmetry between hue categories. Using previously collected large-scale human color categorization data, we introduce quantitative measures of category extent and boundary uncertainty, namely Wideness and Boundary Width, derived from fuzzy membership functions at the {\alpha} = 0.5 level. The analysis reveals a strong imbalance between the two categories: yellow occupies a compact and sharply constrained region of the hue space, whereas green spans a substantially broader interval and exhibits a more extended transition structure. The results show that perceptual color categories are not only fuzzy, but also highly non-uniform in their geometric organization. This asymmetry suggests that some categories behave as narrow, highly specific perceptual labels, while others function as broad, tolerant regions of human color naming. These findings provide a new perspective on linguistic color categorization and extend the interpretability of the COLIBRI framework for perceptually grounded color modeling.
Chinese Translation
人类的颜色类别在感知空间中并非均匀分布,然而大多数计算颜色模型仍然假设固定且均匀结构的表示。在本文中,我们通过研究色调类别之间的感知不对称性,提出了COLIBRI模糊颜色模型的集中分析扩展。利用之前收集的大规模人类颜色分类数据,我们引入了类别范围和边界不确定性的定量度量,即在{ extalpha} = 0.5水平下从模糊隶属函数派生的宽度(Wideness)和边界宽度(Boundary Width)。分析结果揭示了两个类别之间的强烈不平衡:黄色占据了色调空间中一个紧凑且严格限制的区域,而绿色则跨越了一个显著更广泛的区间,并展现出更为延展的过渡结构。结果表明,感知颜色类别不仅是模糊的,而且在其几何组织上高度不均匀。这种不对称性表明某些类别作为狭窄且高度特定的感知标签,而其他类别则作为人类颜色命名的广泛、宽容的区域。这些发现为语言学颜色分类提供了新的视角,并扩展了COLIBRI框架在感知基础颜色建模中的可解释性。
cs.CV / 142 / 2605.09378
EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation
EduStory:一个统一的框架用于教学一致的多镜头STEM教学视频生成
Abstract
Long-horizon video generation has advanced in visual quality, yet existing methods still struggle to maintain knowledge consistency and coherent pedagogical narratives across multi-shot instructional videos, especially in STEM domains. To address these challenges, we propose EduStory, a unified framework for reliable instructional video generation. EduStory integrates pedagogical state modeling to track persistent knowledge states, script-guided structured control to organize multi-shot narratives, and learning-oriented evaluation metrics to assess knowledge fidelity and constraint satisfaction. To support rigorous evaluation, we further introduce EduVideoBench, a diagnostic benchmark with multi-granularity annotations, including pedagogical storyboards, shot-level semantics, and knowledge state transitions, together with baseline tasks for controllable instructional video generation. Extensive experiments demonstrate that domain-aware state modeling and structured control substantially reduce narrative breakdown and improve alignment with instructional intent. These results highlight the significance of domain-specific structural constraints and tailored benchmarks for advancing reliable, controllable, and also trustworthy long-horizon video generation.
Chinese Translation
长时间跨度的视频生成在视觉质量上已有所提升,但现有方法在保持知识一致性和多镜头教学视频中的连贯教学叙事方面仍然面临挑战,尤其是在STEM领域。为了解决这些问题,我们提出了EduStory,一个可靠的教学视频生成统一框架。EduStory整合了教学状态建模,以跟踪持久的知识状态;脚本引导的结构控制,以组织多镜头叙事;以及面向学习的评估指标,以评估知识保真度和约束满足情况。为了支持严格的评估,我们进一步引入了EduVideoBench,一个具有多粒度注释的诊断基准,包括教学故事板、镜头级语义和知识状态转变,以及可控教学视频生成的基线任务。大量实验表明,领域感知的状态建模和结构控制显著减少了叙事中断,并改善了与教学意图的一致性。这些结果突显了领域特定结构约束和量身定制基准在推动可靠、可控且可信的长时间跨度视频生成中的重要性。
cs.CV / 143 / 2605.09384
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
LiteMedCoT-VL:医学视觉问答的参数高效适应
Abstract
The reasoning gap between large and compact vision-language models (VLMs) limits the deployment of medical AI on portable clinical devices. Compact VLMs of 2--4B parameters can run on resource-constrained hardware but lack the multi-step reasoning capacity needed for interpretable clinical decision support. Existing knowledge distillation methods transfer answers without the reasoning process behind them. Medical visual question answering (VQA) serves as a testbed for this problem, as it requires models to integrate visual evidence with clinical knowledge through structured reasoning chains. We introduce LiteMedCoT-VL, a pipeline that transfers chain-of-thought reasoning from a 235B teacher model to 2B student models through LoRA-based fine-tuning on explanation-enriched training data. All inference is conducted without image captions by default, simulating the clinical scenario in which a physician interprets a medical image directly without an accompanying radiology report. On the PMC-VQA benchmark, LiteMedCoT-VL achieves 64.9% accuracy, exceeding the zero-shot Qwen3-VL-4B baseline of 53.9% by 11.0 percentage points and outperforming all published baselines. This result indicates that a 2B model with reasoning distillation can match or exceed models with twice the parameters. Visual grounding analysis shows that the model relies on image content rather than exploiting textual priors. Our code is publicly available at https://anonymous.4open.science/r/LiteMedCoT-VL.
Chinese Translation
大型和紧凑型视觉-语言模型(VLMs)之间的推理差距限制了医学人工智能在便携式临床设备上的应用。具有2-4亿参数的紧凑型VLM可以在资源受限的硬件上运行,但缺乏可解释的临床决策支持所需的多步骤推理能力。现有的知识蒸馏方法仅传递答案,而不传递其背后的推理过程。医学视觉问答(VQA)作为这一问题的测试平台,要求模型通过结构化推理链将视觉证据与临床知识结合起来。我们提出了LiteMedCoT-VL,这是一个将235B教师模型的思维链推理通过基于LoRA的微调转移到2B学生模型的管道,使用丰富解释的训练数据进行训练。所有推理默认在没有图像标题的情况下进行,模拟医生直接解释医学图像而不附带放射学报告的临床场景。在PMC-VQA基准测试中,LiteMedCoT-VL达到了64.9%的准确率,比零-shot的Qwen3-VL-4B基线的53.9%高出11.0个百分点,并且超越了所有已发布的基线。这一结果表明,具有推理蒸馏的2B模型可以与参数是其两倍的模型相匹敌或超越。视觉基础分析表明,该模型依赖于图像内容,而不是利用文本先验。我们的代码公开可用,网址为https://anonymous.4open.science/r/LiteMedCoT-VL。
cs.CV / 144 / 2605.09392
HyNeuralMap: Hyperbolic Mapping of Visual Semantics to Neural Hierarchies
HyNeuralMap:视觉语义与神经层级的双曲映射
Abstract
Understanding the intricate mappings between visual stimuli and neural responses is a fundamental challenge in cognitive neuroscience. While current approaches predominantly align images and functional magnetic resonance imaging (fMRI) responses in Euclidean space, this geometry often struggles to preserve fine-grained semantic relationships and latent hierarchical structures across visual and neural modalities. To overcome this, we propose HyNeuralMap, a framework that employ hyperbolic Lorentz model to map visual semantics into a shared, cross-subject neural hierarchy. By leveraging the negative curvature of hyperbolic space as an inductive bias, the proposed framework better captures hierarchical semantic organization and cross-subject neural similarities. Specifically, visual and neural embeddings are jointly optimized through hyperbolic geometric alignment, where geodesic distances preserve semantic proximity and hierarchical relationships more effectively than Euclidean embeddings. Experiments demonstrate that HyNeuralMap consistently outperforms state-of-the-art Euclidean baselines in both multi-label semantic prediction and cross-modal retrieval tasks. This confirms hyperbolic geometry's superiority for cross-modal semantic alignment and hierarchical modeling, providing a new avenue for vision-neural representation learning.
Chinese Translation
理解视觉刺激与神经反应之间复杂的映射关系是认知神经科学中的一个基本挑战。尽管当前的方法主要在欧几里得空间中对齐图像和功能性磁共振成像(fMRI)反应,但这种几何结构往往难以保留视觉和神经模态之间细粒度的语义关系和潜在的层级结构。为此,我们提出了HyNeuralMap,一个利用双曲洛伦兹模型将视觉语义映射到共享的跨个体神经层级的框架。通过利用双曲空间的负曲率作为归纳偏置,该框架更好地捕捉层级语义组织和跨个体神经相似性。具体而言,视觉和神经嵌入通过双曲几何对齐进行联合优化,其中测地距离比欧几里得嵌入更有效地保留语义接近性和层级关系。实验表明,HyNeuralMap在多标签语义预测和跨模态检索任务中始终优于最先进的欧几里得基线。这证实了双曲几何在跨模态语义对齐和层级建模方面的优越性,为视觉-神经表示学习提供了一条新的途径。
cs.CV / 145 / 2605.09407
AnyDepth-DETR/-YOLO: Any-depth object detection with a single network
AnyDepth-DETR/-YOLO:单网络实现任意深度目标检测
Abstract
Modern object detectors are static, fixed-depth networks optimized for a single operating point, requiring separate models for different deployment scenarios. We present an any-depth detection framework that enables a single network to span a continuous range of accuracy--efficiency trade-offs by controlling depth at inference time without retraining. Each backbone and neck stage is divided into an essential path, which always executes, and a skippable refinement path; this decomposition preserves the full multi-scale feature hierarchy at every depth configuration, unlike conventional early exiting that discards entire stages. To train such a network, jointly optimizing many sub-networks of varying depth introduces conflicting gradient signals. We address this via self-distillation between only the two extremes, with prediction-level and feature-level alignment losses that enforce stage-wise modularity, ensuring the outputs of each stage remain compatible regardless of the paths taken. Instantiated on RT-DETR and YOLOv12, our full-depth configurations match or surpass their respective SOTA baselines with negligible parameter overhead, while the most efficient configurations achieve up to $1.82\times$ speedup at a cost of only 2.0 AP, all from a single set of weights.
Chinese Translation
现代目标检测器是静态的、固定深度的网络,针对单一操作点进行了优化,这要求针对不同的部署场景使用不同的模型。我们提出了一种任意深度检测框架,使单个网络能够通过在推理时控制深度,跨越准确性与效率之间的连续范围,而无需重新训练。每个主干和颈部阶段被划分为一个始终执行的基本路径和一个可跳过的精细化路径;这种分解保留了每个深度配置下完整的多尺度特征层次结构,与传统的早期退出方法不同,后者会丢弃整个阶段。为了训练这样的网络,联合优化多个不同深度的子网络会引入相互冲突的梯度信号。我们通过仅在两个极端之间进行自蒸馏来解决这个问题,使用预测级和特征级对齐损失来强制实施阶段级模块化,确保每个阶段的输出无论采取何种路径都保持兼容。在 RT-DETR 和 YOLOv12 上实例化的全深度配置与各自的最先进基线相匹配或超越,且参数开销微乎其微,而最有效的配置在仅损失 2.0 AP 的情况下实现了高达 $1.82 imes$ 的加速,所有这些均来自一组权重。
cs.CV / 146 / 2605.09417
SAMOFT: Robust Multi-Object Tracking via Region and Flow
SAMOFT:通过区域和流实现鲁棒的多目标跟踪
Abstract
Multi-object tracking (MOT) is a fundamental task in computer vision that requires continuously tracking multiple targets while maintaining consistent identities across frames. However, most existing approaches primarily rely on instance-level object features for trajectory association, which often leads to degraded performance under challenging conditions such as object deformation, nonlinear motion, and occlusion. In this work, we propose SAMOFT, a robust tracker that leverages pixel-level cues to improve robustness under complex motion scenarios. Specifically, we introduce a Pixel Motion Matching (PMM) module that integrates the Segment Anything Model (SAM) with dense optical flow to refine Kalman filter-based motion prediction using instantaneous foreground pixel motion. To further enhance robustness under unreliable detections, we design a Centroid Distance Matching (CDM) module that performs flexible mask-based centroid matching for low-confidence or partially occluded observations. Moreover, a Distribution-Based Correction (DBC) module models long-tailed motion patterns in a training-free manner using historical optical flow statistics and dynamically corrects trajectory states online. We also incorporate a Cluster-Aware ReID (CA-ReID) strategy to improve the stability and discriminative power of trajectory appearance features. Extensive experiments on the DanceTrack and MOTChallenge benchmarks demonstrate that SAMOFT consistently improves baseline trackers and achieves competitive performance compared with recent state-of-the-art methods, validating the effectiveness of leveraging pixel-level cues for robust multi-object tracking.
Chinese Translation
多目标跟踪(MOT)是计算机视觉中的一项基础任务,要求在保持跨帧一致身份的同时,持续跟踪多个目标。然而,大多数现有方法主要依赖于实例级对象特征进行轨迹关联,这在目标变形、非线性运动和遮挡等挑战性条件下往往会导致性能下降。在本研究中,我们提出了SAMOFT,一种鲁棒的跟踪器,利用像素级线索来提高在复杂运动场景下的鲁棒性。具体而言,我们引入了一个像素运动匹配(Pixel Motion Matching, PMM)模块,该模块将Segment Anything Model(SAM)与稠密光流结合,以利用瞬时前景像素运动来优化基于卡尔曼滤波器的运动预测。为了进一步增强在不可靠检测下的鲁棒性,我们设计了一个质心距离匹配(Centroid Distance Matching, CDM)模块,该模块对低置信度或部分遮挡的观测进行灵活的基于掩膜的质心匹配。此外,一个基于分布的修正(Distribution-Based Correction, DBC)模块使用历史光流统计以无训练方式建模长尾运动模式,并动态在线修正轨迹状态。我们还结合了一个集群感知重识别(Cluster-Aware ReID, CA-ReID)策略,以提高轨迹外观特征的稳定性和区分能力。在DanceTrack和MOTChallenge基准上的大量实验表明,SAMOFT始终改善基线跟踪器,并与近期的最先进方法相比实现了竞争力的性能,验证了利用像素级线索进行鲁棒多目标跟踪的有效性。
cs.CV / 147 / 2605.09418
MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition
MAG-VLAQ:用于跨视角地点识别的多模态空地查询聚合
Abstract
Multi-modal cross-view place recognition remains a fundamental challenge in computer vision and robotics due to the severe viewpoint, modality, and spatial-structure discrepancies between ground observations and aerial references. To address this challenge, we present MAG-VLAQ, a foundation-model-enhanced query aggregation framework for multi-modal aerial-ground cross-view place recognition. Specifically, our approach leverages pre-trained foundation models to extract dense visual tokens from both ground and aerial images, as well as expressive geometric tokens from ground LiDAR observations. These heterogeneous tokens are then projected into a shared embedding space for cross-modal alignment and fusion. As our main contribution, we propose ODE-conditioned VLAQ, which tightly couples neural ordinary differential equations (ODE)-based RGB-LiDAR fusion with vectors of locally aggregated queries (VLAQ). In this design, the VLAQ query centers are dynamically adapted according to the fused multi-modal state. This mechanism allows the final global descriptor to preserve globally learned retrieval prototypes while remaining responsive to scene-specific visual and geometric evidence, significantly improving aerial-ground matching. Extensive experiments on KITTI360-AG and nuScenes-AG validate the effectiveness of our proposed MAG-VLAQ. Notably, on KITTI360-AG, our MAG-VLAQ nearly doubles the state-of-the-art performance, achieving 61.1 Recall@1 in the satellite setting, compared with 34.5 from the closest competing approach.
Chinese Translation
多模态跨视角地点识别在计算机视觉和机器人领域仍然是一个基本挑战,因为地面观测与空中参考之间存在严重的视角、模态和空间结构差异。为了解决这一挑战,我们提出了MAG-VLAQ,一个基于基础模型增强的多模态空地跨视角地点识别查询聚合框架。具体而言,我们的方法利用预训练的基础模型从地面和空中图像中提取稠密的视觉标记,以及从地面LiDAR观测中提取表现力丰富的几何标记。这些异构标记随后被投影到共享的嵌入空间中,以实现跨模态对齐和融合。作为我们的主要贡献,我们提出了ODE条件的VLAQ,它将基于神经常微分方程(ODE)的RGB-LiDAR融合与局部聚合查询向量(VLAQ)紧密耦合。在这一设计中,VLAQ查询中心根据融合的多模态状态动态调整。该机制使最终的全局描述符能够保留全局学习的检索原型,同时对场景特定的视觉和几何证据保持响应,从而显著改善空地匹配。在KITTI360-AG和nuScenes-AG上的大量实验验证了我们提出的MAG-VLAQ的有效性。值得注意的是,在KITTI360-AG上,我们的MAG-VLAQ几乎将最先进的性能翻倍,在卫星设置中实现了61.1的Recall@1,而最接近的竞争方法仅为34.5。
cs.CV / 148 / 2605.09420
Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery
关系检索:利用已知-新颖交互进行广义类别发现
Abstract
In this study, we tackle Generalized Category Discovery (GCD) via a Relational Retrieval perspective, explicitly coupling labeled and unlabeled data through bidirectional knowledge transfer. While existing methods treat these sources separately, missing valuable interaction opportunities, we propose Relational Pattern Consistency (RPC) that enables mutual enhancement. RPC employs One-vs-All classifiers for soft ID/OOD decomposition, then introduces two mechanisms: (i) for known-class preservation, we transfer semantic behavioral alignment; (ii) for category discovery, we leverage the insight that samples from the same category maintain invariant relationships with known-class prototypes, transforming unreliable pseudo-labeling into well-defined relational pattern matching. This bidirectional design allows labeled data to guide unlabeled learning while discovering novel categories through their collective relational signatures. Extensive experiments demonstrate RPC achieves state-of-the-art performance on both generic and fine-grained benchmarks.
Chinese Translation
在本研究中,我们通过关系检索的视角解决广义类别发现(GCD)问题,明确地通过双向知识转移将标记数据和未标记数据结合起来。现有方法将这些数据源视为独立的,错失了宝贵的交互机会,而我们提出了关系模式一致性(RPC),使得相互增强成为可能。RPC采用一对多分类器进行软ID/OOD分解,然后引入两种机制:(i)为了保持已知类别,我们转移语义行为对齐;(ii)为了进行类别发现,我们利用样本来自同一类别与已知类别原型之间保持不变关系的洞察,将不可靠的伪标签转化为明确的关系模式匹配。这种双向设计使得标记数据能够指导未标记学习,同时通过它们的集体关系特征发现新类别。大量实验表明,RPC在通用和细粒度基准测试中均实现了最先进的性能。
cs.CV / 149 / 2605.09425
AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation
AtteConDA:基于注意力的多条件扩散模型中的冲突抑制与合成数据增强
Abstract
Recent conditional image generation methods can improve controllability by generating images that are faithful to conditions such as sketches, human poses, segmentation maps, and depth. By applying these techniques to image augmentation while preserving annotations, generated images can be used as additional training data and can improve recognition performance. However, for high-level driving tasks such as traffic-rule extraction and driving-behavior understanding, simply using annotations as conditions is insufficient. Instead, images must be augmented while preserving the detailed high-level structure of the original scene. One possible solution is to use multiple conditions so that generated images retain diverse structural cues after generation. However, when multiple conditions are used, conflicts among conditions can prevent reliable structure preservation. In this work, we input semantic segmentation, depth, and edges extracted from the original image into a multi-condition image generation model, thereby providing rich structural information as conditions. We further propose a modeling approach for handling conflicts among multiple conditions and show that it enables image generation with stronger structural preservation. We also build a generation framework and evaluation protocol for driving tasks, establishing a basis for comparison with prior and future models. As a result, this work contributes to image generation research by addressing condition conflicts in multi-condition generation and provides an important step toward mitigating data scarcity in high-level autonomous-driving tasks.
Chinese Translation
近期的条件图像生成方法通过生成忠实于草图、人类姿态、分割图和深度等条件的图像,提高了可控性。通过将这些技术应用于图像增强,同时保留注释,生成的图像可以作为额外的训练数据,从而提高识别性能。然而,对于交通规则提取和驾驶行为理解等高级驾驶任务,仅仅使用注释作为条件是不够的。相反,必须在保留原始场景详细高级结构的同时进行图像增强。一种可能的解决方案是使用多个条件,以便生成的图像在生成后保留多样的结构线索。然而,当使用多个条件时,条件之间的冲突可能会阻碍可靠的结构保留。在本研究中,我们将从原始图像中提取的语义分割、深度和边缘输入到多条件图像生成模型中,从而提供丰富的结构信息作为条件。我们进一步提出了一种处理多个条件之间冲突的建模方法,并展示了该方法能够实现更强的结构保留的图像生成。我们还建立了一个用于驾驶任务的生成框架和评估协议,为与先前和未来模型的比较奠定了基础。因此,本研究通过解决多条件生成中的条件冲突,为图像生成研究做出了贡献,并为缓解高级自动驾驶任务中的数据稀缺问题提供了重要步骤。
cs.CV / 150 / 2605.09429
Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
规避视觉失语症:面向视觉-语言模型的对比自适应语义标记剪枝
Abstract
Are low-attention visual tokens truly redundant in vision-language reasoning? Existing pruning methods often assume so, ranking visual tokens by shallow text-to-image attention and discarding low-scoring patches to accelerate LVLM inference. We show that this scalar criterion is unreliable for compositional reasoning: tokens ignored in early layers can later become essential for resolving secondary objects, spatial relations, and contextual cues. Premature pruning can therefore induce Visual Aphasia, a failure mode in which the model loses visual grounding and falls back on language priors. We introduce COAST (COntrastive Adaptive Semantic Token Pruning), a training-free pruning framework that casts compression as adaptive semantic routing. COAST uses native cross-modal attention to identify query-specific anchors and estimate contextual dispersion via attention entropy, then adapts the retention trade-off between semantic evidence and spatial context. It further uses a contrastive routing score to preserve both anchor-aligned evidence and complementary spatial context. Across seven benchmarks, COAST reduces visual tokens by 77.8% and achieves a 2.15x latency speedup while retaining 98.64% of the original average performance. Beyond a single backbone or compression setting, COAST consistently outperforms strong pruning baselines across token budgets and generalizes across multiple LVLM families, showing that adaptive semantic routing is a robust alternative to one-shot scalar pruning
Chinese Translation
低关注度的视觉标记在视觉-语言推理中真的冗余吗?现有的剪枝方法通常假设如此,通过浅层的文本到图像注意力对视觉标记进行排名,并丢弃低评分的区域以加速LVLM推理。我们表明,这一标量标准对于组合推理是不可靠的:在早期层中被忽视的标记可能在后续阶段变得对解决次要物体、空间关系和上下文线索至关重要。因此,过早的剪枝可能导致视觉失语症,这是一种模型失去视觉基础并依赖语言先验的失败模式。我们提出了COAST(对比自适应语义标记剪枝),这是一个无训练的剪枝框架,将压缩视为自适应语义路由。COAST利用原生的跨模态注意力来识别特定查询的锚点,并通过注意力熵估计上下文分散度,然后调整语义证据与空间上下文之间的保留权衡。它进一步使用对比路由评分来保留锚点对齐的证据和互补的空间上下文。在七个基准测试中,COAST将视觉标记减少了77.8%,并在保留98.64%原始平均性能的同时实现了2.15倍的延迟加速。超越单一的主干或压缩设置,COAST在不同的标记预算中始终优于强大的剪枝基线,并在多个LVLM家族中具有良好的泛化能力,表明自适应语义路由是一次性标量剪枝的稳健替代方案。
cs.CV / 151 / 2605.09430
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
FlashAR:自回归图像生成的高效后训练加速
Abstract
Large-scale autoregressive models have demonstrated remarkable capabilities in image generation. However, their sequential raster-scan decoding relies on strictly next-token prediction, making inference prohibitively expensive. Existing acceleration methods typically either introduce entirely new generation paradigms that necessitate costly pre-training from scratch, or enable parallel generation at the expense of a training-inference gap or altered prediction objectives. In this paper, we introduce FlashAR, a lightweight post-training adaptation framework that efficiently adapts a pre-trained raster-scan autoregressive model into a highly parallel generator based on two-way next-token prediction. Our key insight is that effective adaptation should minimize modifications to the pre-trained model's original training objective to preserve its learned prior. Accordingly, we retain the original AR head as a horizontal head for row-wise prediction and introduce a complementary, lightweight vertical head for column-wise prediction. To facilitate efficient adaptation, we branch the vertical head from an intermediate layer rather than the final layer, bypassing the inherent horizontal head bias. Moreover, since horizontal and vertical predictions capture complementary dependencies whose relative importance varies across target positions, we employ a learnable fusion gate to dynamically combine the two predictions at each position. To further reduce adaptation cost, we propose a two-stage adaptation pipeline: the vertical head is first initialized through adaptation from the pre-trained autoregressive model before jointly fine-tuned with backbone to adapt to the new decoding paradigm. Extensive experiments on LlamaGen and Emu3.5 show that FlashAR achieves up to a 22.9x speedup for 512x512 image generation through a lightweight post-training with merely 0.05% of the original training data.
Chinese Translation
大规模自回归模型在图像生成方面展现了显著的能力。然而,它们的顺序光栅扫描解码依赖于严格的下一个标记预测,使得推理成本高昂。现有的加速方法通常要么引入完全新的生成范式,导致需要从头开始进行昂贵的预训练,要么以牺牲训练-推理间隙或改变预测目标为代价,实现并行生成。本文介绍了FlashAR,一种轻量级后训练适应框架,能够高效地将预训练的光栅扫描自回归模型适配为基于双向下一个标记预测的高度并行生成器。我们的关键见解是,有效的适应应尽量减少对预训练模型原始训练目标的修改,以保留其学习到的先验。因此,我们将原始自回归头保留为行预测的水平头,并引入一个补充的轻量级垂直头用于列预测。为了促进高效适应,我们从中间层而非最终层分支出垂直头,从而绕过固有的水平头偏差。此外,由于水平和垂直预测捕获了互补的依赖关系,其相对重要性在目标位置之间变化,我们采用可学习的融合门动态地在每个位置结合这两种预测。为了进一步降低适应成本,我们提出了一个两阶段的适应流程:首先通过从预训练自回归模型的适应初始化垂直头,然后与主干共同微调以适应新的解码范式。在LlamaGen和Emu3.5上的大量实验表明,FlashAR通过轻量级后训练实现了高达22.9倍的512x512图像生成加速,仅使用了原始训练数据的0.05%。
cs.CV / 152 / 2605.09433
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
带有噪声跟踪对的离线偏好优化用于矫正流
Abstract
Existing preference datasets for text-to-image models typically store only the final winner/loser images. This representation is insufficient for rectified flow (RF) models, whose generation is naturally indexed by a specific prior noise sample and follows a nearly straight denoising trajectory. In contrast, prior DPO-style alignment for diffusion models commonly estimates trajectories using an independent forward noising process, which can be mismatched to the true reverse dynamics and introduces unnecessary variance. We propose Prior Noise-Aware Preference Optimization (PNAPO), an off-policy alignment framework specialized for rectified flow. PNAPO augments preference data by retaining the paired prior noises used to generate each winner/loser image, turning the standard (prompt, winner, loser) triplet into a sextuple. Leveraging the straight-line property of RF, we estimate intermediate states via noise-image interpolation, which constrains the trajectory estimation space and yields a tighter surrogate objective for preference optimization. In addition, we introduce a dynamic regularization strategy that adapts the DPO regularization based on (i) the reward gap between winner and loser and (ii) training progress, improving stability and sample efficiency. Experiments on state-of-the-art RF T2I backbones show that PNAPO consistently improves preference metrics while substantially reducing training compute.
Chinese Translation
现有的文本到图像模型的偏好数据集通常仅存储最终的胜者/败者图像。这种表示对于矫正流(Rectified Flow, RF)模型来说是不够的,因为其生成自然地由特定的先前噪声样本索引,并遵循几乎直线的去噪轨迹。相比之下,先前的扩散模型(Diffusion Models)风格的DPO(Direct Preference Optimization)对齐通常使用独立的前向噪声过程来估计轨迹,这可能与真实的反向动态不匹配,并引入不必要的方差。我们提出了先前噪声感知偏好优化(Prior Noise-Aware Preference Optimization, PNAPO),这是一个专门针对矫正流的离政策略对齐框架。PNAPO通过保留用于生成每个胜者/败者图像的配对先前噪声来增强偏好数据,将标准的(提示,胜者,败者)三元组转变为六元组。利用RF的直线特性,我们通过噪声-图像插值来估计中间状态,这限制了轨迹估计空间,并为偏好优化提供了更紧凑的替代目标。此外,我们引入了一种动态正则化策略,根据(i)胜者与败者之间的奖励差距和(ii)训练进度来调整DPO正则化,从而提高稳定性和样本效率。在最先进的RF T2I骨干网络上的实验表明,PNAPO在持续改善偏好指标的同时,显著减少了训练计算量。
cs.CV / 153 / 2605.09442
SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
SWIFT:高效交互式长视频生成的提示自适应记忆
Abstract
Streaming long-video generation faces a central challenge in continuous semantic switching, requiring adaptive memory to preserve coherent visual evolution. Current approaches rely on cache rebuilding at prompt boundaries or fixed memory budgets, but they introduce redundant computation and limit flexible semantic adaptation. This limitation arises from a mismatch between cached video history and prompt updates, as memory preserves visual continuity while prompt switches demand rapid semantic adaptation. Motivated by this observation, we present SWIFT, Semantic Windowing and Injection for Flexible Transitions, a training-free framework for multi-prompt long-video generation that enables efficient semantic switching while preserving temporal coherence in causal video diffusion models. SWIFT introduces a lightweight Semantic Injection Cache that augments cached video memory rather than reconstructing it from scratch at every prompt boundary. To avoid uniformly perturbing all attention channels, we further perform head-wise semantic injection, so that each attention head receives a prompt update proportional to its alignment with the current video state. In addition, we introduce an Adaptive Dynamic Window that allocates temporal memory according to prompt phase, using larger local context near switching boundaries and smaller windows during stable segments to reduce average inference cost. To preserve long-range semantic consistency under compressed local attention, we further maintain segment-level semantic anchors that summarize prompt-conditioned video history and reintroduce it as compact memory tokens. Compared with current state-of-the-art methods, SWIFT preserves generation quality while achieving 22.6 FPS on a single H100 GPU, establishing a substantially more efficient solution for multi-prompt long-video generation. Our code is available at https://github.com/ShanwenTan/SWIFT.
Chinese Translation
流媒体长视频生成面临着连续语义切换的核心挑战,需要自适应记忆以保持一致的视觉演变。目前的方法依赖于在提示边界处重建缓存或固定的记忆预算,但这些方法引入了冗余计算,并限制了灵活的语义适应。这一限制源于缓存视频历史与提示更新之间的不匹配,因为记忆保持了视觉连续性,而提示切换则要求快速的语义适应。基于这一观察,我们提出了SWIFT(Semantic Windowing and Injection for Flexible Transitions),这是一个无训练的多提示长视频生成框架,能够在保持因果视频扩散模型的时间一致性的同时实现高效的语义切换。SWIFT引入了一种轻量级的语义注入缓存,增强了缓存的视频记忆,而不是在每个提示边界处从头重建。为了避免均匀扰动所有注意力通道,我们进一步执行头级语义注入,使每个注意力头根据其与当前视频状态的对齐程度接收与提示更新成比例的更新。此外,我们引入了一种自适应动态窗口,根据提示阶段分配时间记忆,在切换边界附近使用更大的局部上下文,而在稳定段使用较小的窗口,以降低平均推理成本。为了在压缩的局部注意力下保持长范围的语义一致性,我们进一步维护段级语义锚点,总结提示条件下的视频历史,并将其重新引入为紧凑的记忆标记。与当前最先进的方法相比,SWIFT在单个H100 GPU上以22.6 FPS的速度保持生成质量,为多提示长视频生成建立了一个显著更高效的解决方案。我们的代码可在 https://github.com/ShanwenTan/SWIFT 获取。
cs.CV / 154 / 2605.09443
Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent
通过角色的视角:解决多模态角色扮演代理中的模态-角色干扰
Abstract
The advancement of Multimodal Large Language Models (MLLMs) has expanded Role-Playing Agents (RPAs) into visually grounded environments. However, human vision is inherently subjective and identity-driven, whereas existing MLLMs extract objective, character-agnostic features for general tasks. In RPAs, this generic visual noise overpowers fragile character traits, causing Modality-Role Interference (MRI), where agents struggle to integrate visual grounding and character consistency. To address this, we introduce the training-free Character-Aware Visual Intervention (CAVI) framework, enabling agents to perceive the world through the lens of character. CAVI systematically targets MRI: macroscopically, Character-Guided Token Pruning (CTP) restricts the visual receptive field to role-relevant entities; microscopically, Orthogonal Feature Modulation (OFM) projects tokens onto a character-context subspace to extract aligned facts; and during decoding, Modality-Adaptive Role Steering (MARS) dynamically optimizes steering intensity based on visual reliance. Extensive experiments show CAVI effectively alleviates MRI, significantly enhancing character-consistent multimodal interactions.
Chinese Translation
多模态大型语言模型(MLLMs)的进步使得角色扮演代理(RPAs)扩展到了视觉基础环境。然而,人类视觉本质上是主观的和以身份为驱动的,而现有的MLLMs则提取客观的、与角色无关的特征用于一般任务。在RPAs中,这种通用的视觉噪声压倒了脆弱的角色特征,导致模态-角色干扰(MRI),使得代理在整合视觉基础和角色一致性方面面临困难。为了解决这个问题,我们提出了无训练的角色感知视觉干预(CAVI)框架,使代理能够通过角色的视角感知世界。CAVI系统性地针对MRI:从宏观上看,角色引导的标记修剪(CTP)将视觉感受野限制在与角色相关的实体上;从微观上看,正交特征调制(OFM)将标记投影到角色上下文子空间中以提取对齐的事实;在解码过程中,模态自适应角色引导(MARS)根据视觉依赖动态优化引导强度。大量实验表明,CAVI有效缓解了MRI,显著增强了角色一致的多模态交互。
cs.CV / 155 / 2605.09449
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
SpaceMind++:面向空间基础视频多模态大语言模型的外部认知地图
Abstract
Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new mechanism that relays map-level spatial knowledge back into the original 2D visual features. This fusion is explicitly guided by coordinate embeddings and 3D Rotary Positional Encoding, which ground semantic interactions in metric 3D space, resembling the entorhinal binding of sensory features to metric space. Extensive experiments show that SpaceMind++ achieves new state-of-the-art performance on VSI-Bench. Furthermore, it demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench, underscoring its robustness in unseen 3D environments.
Chinese Translation
近期的多模态大语言模型(MLLMs)在视觉理解和基于语言的推理方面取得了显著进展,但它们缺乏在三维环境中进行空间一致性推理的持久世界中心表示。受到哺乳动物双流系统的启发,该系统将语义和空间线索分别处理并整合成外部认知地图,我们提出了SpaceMind++,一种视频MLLM架构,明确地从RGB视频构建体素化认知地图。该地图将碎片化的自我中心观察重组为共享的三维度量表示,使模型能够在变化的视角中保持物体的持久性和空间拓扑。为了使这一外部表示能够被预训练的视频MLLM使用而不干扰其原生的视觉标记接口,我们引入了坐标引导深度迭代融合(Coordinate-Guided Deep Iterative Fusion),这一新机制将地图级空间知识反馈到原始的二维视觉特征中。这种融合明确地由坐标嵌入和三维旋转位置编码(3D Rotary Positional Encoding)引导,将语义交互锚定在度量三维空间中,类似于内嗅皮层将感官特征绑定到度量空间的过程。大量实验表明,SpaceMind++在VSI-Bench上达到了新的最先进性能。此外,它在SPBench、SITE-Bench和SPAR-Bench上表现出优越的分布外泛化能力,强调了其在未见过的三维环境中的鲁棒性。
cs.CV / 156 / 2605.09455
Adaptive 3D Convolution for Remote Sensing Image Fusion
自适应三维卷积在遥感图像融合中的应用
Abstract
Remote sensing image fusion aims to create a high-resolution multi/hyper-spectral image from a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Recently, deep learning (DL) techniques have shown significant effectiveness in this area. Most DL-based methods approach image fusion as a 2D problem by encoding spectral information into feature map channels. However, our research suggests that this strategy introduces notable spectral distortions. In contrast, some methods consider spectral data as an additional dimension, utilizing standard 3D convolutions to preserve spectral information. Nevertheless, in a standard 3D convolutional layer, the same set of kernels is applied across all input regions, which we have found to be sub-optimal for image fusion. Furthermore, standard 3D convolutions necessitate substantial computational resources. To address these challenges, we propose a novel convolutional paradigm called Adaptive 3D Convolution (Ada3D) for remote sensing image fusion. Ada3D applies a unique set of 3D kernels to each input voxel, enabling the capture of fine-grained details. These adaptive kernels are generated through a two-step process: (i) spatial and spectral kernels are derived from their respective image sources; (ii) these two types of kernels are then combined to form content-aware 3D kernels that effectively integrate spatial and spectral information. Additionally, adaptive biases are introduced to enhance the convolutional outcome at the voxel level. Furthermore, we incorporate the group convolution technique to reduce computational complexity. As a result, Ada3D offers full adaptivity in an efficient manner. Evaluation results across five datasets demonstrate that our method achieves SOTA performance, underscoring the superiority of Ada3D. The code is available at https://github.com/PSRben/Ada3D.
Chinese Translation
遥感图像融合旨在从一幅具有有限光谱信息的高分辨率图像和一幅具有丰富光谱数据的低分辨率图像中创建高分辨率的多光谱/高光谱图像。近年来,深度学习(DL)技术在这一领域表现出了显著的有效性。大多数基于DL的方法将图像融合视为一个二维问题,通过将光谱信息编码到特征图通道中。然而,我们的研究表明,这一策略引入了显著的光谱失真。相对而言,一些方法将光谱数据视为额外的维度,利用标准三维卷积来保留光谱信息。然而,在标准三维卷积层中,相同的卷积核集应用于所有输入区域,这被我们发现对图像融合并非最优。此外,标准三维卷积需要大量的计算资源。为了解决这些挑战,我们提出了一种新的卷积范式,称为自适应三维卷积(Adaptive 3D Convolution,Ada3D),用于遥感图像融合。Ada3D为每个输入体素应用一组独特的三维卷积核,从而能够捕捉细粒度的细节。这些自适应卷积核通过两步过程生成:(i)从各自的图像源中导出空间和光谱卷积核;(ii)将这两种类型的卷积核结合形成内容感知的三维卷积核,有效整合空间和光谱信息。此外,引入自适应偏置以增强体素级的卷积结果。此外,我们结合了组卷积技术以降低计算复杂性。因此,Ada3D以高效的方式提供了完全的自适应性。在五个数据集上的评估结果表明,我们的方法达到了最先进的性能,突显了Ada3D的优越性。代码可在 https://github.com/PSRben/Ada3D 获取。
cs.CV / 157 / 2605.09460
When Few Steps Are Enough: Training-Free Acceleration of Identity-Preserved Generation
少量步骤足矣:无训练的身份保留生成加速
Abstract
Identity-preserved image generation is typically built on many-step diffusion backbones, making personalized generation expensive at deployment time. We show that this cost is often unnecessary for identity-conditioned FLUX generation. A frozen InfuseNet identity adapter trained with dev transfers directly to the distilled schnell backbone without retraining. This two-line replacement -- changing the backbone path and disabling classifier-free guidance -- reduces latency by 5.9x while improving ArcFace identity similarity by +0.028 and lpips by -0.016 over the standard 28-step dev baseline. To explain why this works, we analyze the denoising trajectory and find that identity fidelity enters an early effective regime, often within 4-8 steps, while later steps primarily refine visual detail, sharpness, and contrast. Adapter ablations confirm that identity formation depends on the identity adapter, while attention-stream norm probes suggest that the relative conditioning contribution decreases as sampling proceeds. Preliminary style-adapter and object-adapter sweeps on SDXL and SD1.5 show similar diminishing returns after intermediate steps. These results position distilled backbone replacement as a simple, training-free strategy for improving the efficiency-fidelity tradeoff of identity-preserved generation.
Chinese Translation
身份保留图像生成通常基于多步骤扩散骨干网络,这使得个性化生成在部署时成本高昂。我们表明,对于身份条件的FLUX生成,这种成本往往是不必要的。一个冻结的InfuseNet身份适配器经过开发转移后,可以直接转移到精炼的schnell骨干网络,而无需重新训练。这个两行替换——更改骨干路径并禁用无分类器引导——将延迟减少了5.9倍,同时在标准的28步开发基线之上,ArcFace身份相似度提高了+0.028,lpips降低了-0.016。为了说明这一点的原因,我们分析了去噪轨迹,发现身份保真度在早期进入有效状态,通常在4-8步内,而后续步骤主要用于细化视觉细节、清晰度和对比度。适配器消融实验确认身份形成依赖于身份适配器,而注意力流规范探测表明,相对条件贡献随着采样的进行而减少。在SDXL和SD1.5上进行的初步风格适配器和对象适配器实验显示,在中间步骤后也出现类似的收益递减。这些结果将精炼骨干替换定位为一种简单的、无训练的策略,以提高身份保留生成的效率与保真度之间的权衡。
cs.CV / 158 / 2605.09477
Outlier-Robust Diffusion Solvers for Inverse Problems
针对逆问题的抗异常值扩散求解器
Abstract
Methods based on diffusion models (DMs) for solving inverse problems (IPs) have recently achieved remarkable performance. However, DM-based methods typically struggle against outliers, which are common in real-world measurements. In this work, to tackle IPs with outliers, we first refine the measurement via explicit noise estimation to mitigate the effect of noise. Subsequently, we formulate an iteratively reweighted least squares objective based on the Huber loss to address the outliers. We propose a method utilizing gradient descent to approximately solve the corresponding optimization problem for the robust objective. To avoid delicate tuning of the learning rate required by the gradient descent method, we further employ the conjugate gradient method with an efficient strategy for updating. Extensive experiments on multiple image datasets for linear and nonlinear tasks under various conditions demonstrate that our proposed methods exhibit robustness to outliers and outperform recent DM-based methods in most cases.
Chinese Translation
基于扩散模型(DM)的方法在解决逆问题(IP)方面最近取得了显著的性能。然而,基于DM的方法通常在面对异常值时表现不佳,而异常值在实际测量中很常见。在本研究中,为了处理带有异常值的逆问题,我们首先通过显式噪声估计来精炼测量,以减轻噪声的影响。随后,我们基于Huber损失构建了一个迭代加权最小二乘目标,以应对异常值。我们提出了一种利用梯度下降法近似求解相应优化问题的鲁棒目标的方法。为了避免梯度下降法所需的学习率的精细调节,我们进一步采用了共轭梯度法,并采用高效的更新策略。在多种条件下对多个图像数据集进行的广泛实验表明,我们提出的方法对异常值表现出鲁棒性,并在大多数情况下优于最近的基于DM的方法。
cs.CV / 159 / 2605.09503
PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models
PermuQuant:通过重新排列通道降低扩散模型的每组量化误差
Abstract
Large-scale visual generative models have achieved remarkable performance. However, their high computational and memory costs make deployment challenging in resource-constrained scenarios, such as interactive applications and personal single-GPU usage. Post-training quantization (PTQ) offers a practical solution by compressing pretrained models without expensive retraining. However, existing PTQ methods still suffer from severe quality degradation under extremely low-bit settings. In this paper, we identify channel ordering as an important but underexplored factor in per-group quantization. In this setting, each contiguous group shares one quantization scale. When channels with very different statistics are placed in the same group, the scale can be dominated by outliers and cause large quantization errors. Based on this observation, we propose PermuQuant, a simple and effective PTQ framework for low-bit diffusion models. PermuQuant sorts channels by a joint second-moment criterion before per-group quantization, placing channels with similar activation and weight statistics into the same group. It further uses a calibration-based acceptance rule to apply reordering only when the selected permutation reduces quantization error on calibration data. The selected permutations are absorbed into adjacent modules or applied to weights offline, avoiding explicit runtime permutation operations. Extensive experiments on multiple large diffusion models show that PermuQuant consistently reduces quantization error and outperforms existing PTQ baselines. On FLUX.1-dev with an RTX 5090, PermuQuant achieves up to a 1.8$\times$ single step speedup and reduces the DiT memory footprint by 3.5$\times$ under W4A4 NVFP4 quantization. Code will be available at https://github.com/yscheng04/PermuQuant.
Chinese Translation
大规模视觉生成模型已取得显著性能。然而,它们高昂的计算和内存成本使得在资源受限的场景中(如交互式应用和个人单GPU使用)部署变得具有挑战性。后训练量化(PTQ)通过在不进行昂贵的重新训练的情况下压缩预训练模型,提供了一种实用的解决方案。然而,现有的PTQ方法在极低比特设置下仍然面临严重的质量下降。在本文中,我们将通道排序识别为每组量化中的一个重要但未被充分探索的因素。在这种设置中,每个连续组共享一个量化尺度。当具有非常不同统计特征的通道被放置在同一组时,尺度可能会被异常值主导,从而导致较大的量化误差。基于这一观察,我们提出了PermuQuant,一个简单有效的低比特扩散模型PTQ框架。PermuQuant在每组量化之前,通过联合二阶矩标准对通道进行排序,将具有相似激活和权重统计的通道放入同一组。它进一步使用基于校准的接受规则,仅在所选排列减少校准数据上的量化误差时应用重新排序。所选的排列被吸收到相邻模块中或离线应用于权重,避免了显式的运行时排列操作。在多个大型扩散模型上的大量实验表明,PermuQuant始终减少量化误差,并优于现有的PTQ基线。在使用RTX 5090的FLUX.1-dev上,PermuQuant实现了最高1.8倍的单步加速,并在W4A4 NVFP4量化下将DiT的内存占用减少了3.5倍。代码将发布在 https://github.com/yscheng04/PermuQuant。
cs.CV / 160 / 2605.09507
Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization
基于不确定性感知和解码器对齐的学习用于视频摘要
Abstract
Video summarization aims to produce a compact representation of a long video by selecting a subset of temporally important segments that best reflect human preferences. This task is inherently difficult due to strong annotation subjectivity and the reliance on discrete decoding procedures, such as temporal segmentation and knapsack-based selection, during evaluation. Most existing approaches either learn deterministic importance scores that overlook these characteristics or adopt complex generative models that increase training and inference cost. In this paper, we propose VASTSum, an uncertainty-aware and decoder-aligned learning framework for video summarization that addresses both challenges within a single-pass model. The proposed method predicts probabilistic frame-level importance scores using a variational formulation, enabling explicit modeling of uncertainty arising from multi-annotator supervision. To account for subjectivity, particularly under binary annotations, we employ a supervision strategy that encourages alignment with plausible human annotation modes rather than enforcing a single consensus target. Furthermore, we introduce a decoder-aligned regularization that promotes stability of knapsack-based summary selection, reducing sensitivity to small perturbations in predicted scores. We evaluate the proposed framework on the SumMe and TVSum benchmarks using standard rank-based metrics. Experimental results show consistent and competitive Kendall and Spearman correlations across multiple data splits, demonstrating improved robustness under annotation disagreement while maintaining efficient single-forward inference. These results indicate that explicitly modeling uncertainty and aligning learning objectives with the decoding stage provide a principled alternative to both deterministic and diffusion-based video summarization methods.
Chinese Translation
视频摘要旨在通过选择一组时间上重要的片段,生成长视频的紧凑表示,以最佳反映人类偏好。这一任务本质上是困难的,因为在评估过程中强烈依赖于离散解码过程,如时间分割和基于背包的选择,且注释具有较强的主观性。现有的大多数方法要么学习忽视这些特征的确定性重要性评分,要么采用复杂的生成模型,从而增加训练和推理的成本。本文提出了VASTSum,一个不确定性感知和解码器对齐的学习框架,用于视频摘要,旨在通过单次模型解决这两个挑战。所提方法使用变分形式预测概率性帧级重要性评分,从而能够明确建模来自多注释者监督的不确定性。为了考虑主观性,特别是在二元注释下,我们采用了一种监督策略,鼓励与合理的人类注释模式对齐,而不是强制执行单一共识目标。此外,我们引入了一种解码器对齐的正则化,促进基于背包的摘要选择的稳定性,减少对预测评分小扰动的敏感性。我们在SumMe和TVSum基准上使用标准的基于排名的指标评估所提框架。实验结果显示,在多个数据拆分中,Kendall和Spearman相关性一致且具有竞争力,表明在注释不一致的情况下提高了鲁棒性,同时保持高效的单次前向推理。这些结果表明,明确建模不确定性并将学习目标与解码阶段对齐,为确定性和扩散基的视频摘要方法提供了一种原则性的替代方案。
cs.CV / 161 / 2605.09513
QueST: Persistent Queries as Semantic Monitors for Drift Suppression in Long-Horizon Tracking
QueST:作为语义监测器的持久查询以抑制长时间跟踪中的漂移
Abstract
Tracking points in videos is typically formulated as frame-to-frame correspondence, where each point is matched locally to the next frame. While this works over short horizons, errors accumulate under articulation, occlusion, and viewpoint change, leading to silent semantic drift that existing trackers cannot detect or correct. In this work, we revisit long-horizon tracking from a monitoring perspective and introduce QueST, a monitoring-by-design framework that treats interaction-relevant entities as persistent semantic queries rather than transient point tracks. Instead of local propagation, each query attends globally over spatio-temporal video features at every time-step, providing a stable semantic anchor across time. We further constrain query trajectories with lightweight 3D physical grounding, using geometric plausibility to suppress unbounded drift under occlusion. We evaluate QueST on long-horizon articulated sequences from PartNet-Mobility in SAPIEN and compare against RAFT-3D, CoTracker, and TAP-Net. QueST substantially reduces terminal drift achieving a 67.7% Absolute Point Error (APE) improvement over TAP-Net while better preserving identity over extended horizons. Our results show that embedding semantic monitoring directly into perception enables more reliable long-horizon tracking under distribution shift.
Chinese Translation
在视频中跟踪点通常被表述为帧间对应,其中每个点与下一帧局部匹配。虽然这种方法在短时间范围内有效,但在关节运动、遮挡和视角变化下,错误会累积,导致现有跟踪器无法检测或纠正的隐性语义漂移。在本研究中,我们从监测的角度重新审视长时间跟踪,并引入QueST,一个设计为监测的框架,将与交互相关的实体视为持久的语义查询,而不是短暂的点轨迹。每个查询在每个时间步上全局关注时空视频特征,而不是局部传播,从而在时间上提供稳定的语义锚点。我们进一步通过轻量级的三维物理基础约束查询轨迹,利用几何合理性抑制在遮挡下的无界漂移。我们在SAPIEN中的PartNet-Mobility的长时间关节序列上评估QueST,并与RAFT-3D、CoTracker和TAP-Net进行比较。QueST显著减少了终端漂移,相较于TAP-Net实现了67.7%的绝对点误差(APE)改善,同时在较长时间范围内更好地保持了身份。我们的结果表明,将语义监测直接嵌入感知中可以在分布变化下实现更可靠的长时间跟踪。
cs.CV / 162 / 2605.09538
PhysHanDI: Physics-Based Reconstruction of Hand-Deformable Object Interactions
PhysHanDI:基于物理的手部可变形物体交互重建
Abstract
While existing methods for reconstructing hand-object interactions have made impressive progress, they either focus on rigid or part-wise rigid objects-limiting their ability to model real-world objects (e.g., cloth, stuffed animals) that exhibit highly non-rigid deformations-or model deformable objects without full 3D hand reconstruction. To bridge this gap, we present PhysHanDI (Physics-based Reconstruction of Hand and Deformable Object Interactions), a framework that enables full 3D reconstruction of both interacting hands and non-rigid objects. Our key idea is to physically simulate object deformations driven by forces induced from densely reconstructed 3D hand motions, ensuring that the reconstructed object dynamics are both physically plausible and coherent with the interacting hand movements. Furthermore, we demonstrate that such simulation of object deformations can, in turn, refine and improve hand reconstruction via inverse physics. In experiments, PhysHanDI outperforms the state-of-the-art baseline across reconstruction and future prediction.
Chinese Translation
尽管现有的手-物体交互重建方法取得了显著进展,但它们要么专注于刚性或部分刚性物体,限制了它们对现实世界中高度非刚性变形物体(例如布料、填充玩具)的建模能力,要么在没有完整三维手部重建的情况下对可变形物体进行建模。为了解决这一问题,我们提出了PhysHanDI(基于物理的手部与可变形物体交互重建)框架,该框架能够实现交互手部和非刚性物体的完整三维重建。我们的关键思想是通过密集重建的三维手部运动所引发的力来物理模拟物体变形,确保重建的物体动态既符合物理规律,又与交互手部运动一致。此外,我们还展示了这种物体变形的模拟可以通过逆物理学来进一步优化和改善手部重建。在实验中,PhysHanDI在重建和未来预测方面均优于最先进的基线方法。
cs.CV / 163 / 2605.09566
Dual-Path Hyperprior Informed Deep Unfolding Network for Image Compressive Sensing
双路径超先验引导的深度展开网络用于图像压缩感知
Abstract
Recent Deep Unfolding Networks (DUNs) have significantly advanced Compressive Sensing (CS) by integrating iterative optimization with deep networks. However, existing DUNs still suffer from two challenges: 1) Reliance on a single measurement stream, which limits effective information interaction across distinct measurement subsets. 2) Uniform processing of all image regions, which overlooks varying reconstruction difficulties induced by diverse textures. To address these limitations, a novel Dual-Path Hyperprior Informed Deep Unfolding Network (DPH-DUN) is proposed, which partitions measurements into double subsets to enable hyperprior-guided reconstruction via a dual-path architecture. In the Deep Hyperprior Learning branch, a series of lightweight neural modules are designed to efficiently generate hyperprior knowledge of different domains, enabling collaborative guidance for the CS reconstruction. In the Hyperprior Informed Reconstruction branch, a deep unfolding framework with hyperprior guidance is constructed to iteratively refine reconstruction. Specifically, i) in the gradient descent step, a Hyperprior Informed Step Size Generation network is designed to dynamically generate spatially varying step maps, enabling adaptive fine-grained gradient updates. ii) In the proximal mapping step, two well-designed hyperprior informed attention mechanisms are introduced to dynamically focus on challenging regions via gradient-based hard and soft attentions, facilitating CS reconstruction accuracy. Extensive experiments demonstrate that the proposed DPH-DUN outperforms existing CS methods.
Chinese Translation
近期的深度展开网络(DUNs)通过将迭代优化与深度网络相结合,显著推动了压缩感知(CS)的发展。然而,现有的DUNs仍面临两个挑战:1)依赖单一测量流,这限制了不同测量子集之间有效信息的交互。2)对所有图像区域的统一处理,忽视了由不同纹理引起的重建难度的差异。为了解决这些局限性,提出了一种新颖的双路径超先验引导深度展开网络(DPH-DUN),该网络将测量分为两个子集,以通过双路径架构实现超先验引导的重建。在深度超先验学习分支中,设计了一系列轻量级神经模块,以高效生成不同领域的超先验知识,从而为CS重建提供协同指导。在超先验引导重建分支中,构建了一个带有超先验指导的深度展开框架,以迭代精炼重建。具体而言,i)在梯度下降步骤中,设计了一个超先验引导步长生成网络,以动态生成空间变化的步长图,从而实现自适应细粒度梯度更新。ii)在近端映射步骤中,引入了两个精心设计的超先验引导注意机制,通过基于梯度的硬注意和软注意动态聚焦于挑战区域,从而促进CS重建的准确性。大量实验表明,所提出的DPH-DUN优于现有的CS方法。
cs.CV / 164 / 2605.09572
KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation
KAN文本到视觉?基于Kolmogorov-Arnold网络的多尺度序列姿态动画探索,源自手语符号表示
Abstract
Sign language production from symbolic notation offers a scalable route to accessible sign animation. We present KANMultiSign, a multi-scale sequence generator that translates HamNoSys notation into two-dimensional human pose sequences. Our framework makes two complementary contributions. First, we introduce a coarse-to-fine generation strategy with multi-scale supervision: the model is first guided by an intermediate body--hand--face scaffold to encourage global structural coherence, and then refines fine-grained hand articulation to improve finger-level detail. Second, we investigate integrating Kolmogorov--Arnold Network modules into a Transformer backbone, using learnable univariate function primitives to model the highly non-linear mapping from discrete phonological symbols to continuous body kinematics with a compact parameterization. Experiments on multiple public corpora spanning Polish, German, Greek, and French sign languages show consistent reductions in dynamic time warping based joint error compared with a strong notation-to-pose baseline, while using substantially fewer parameters. Controlled ablations further indicate that KAN-based variants substantially reduce parameter count while maintaining competitive performance when coupled with multi-scale supervision, rather than serving as the main driver of accuracy gains. These findings position multi-scale supervision as the key mechanism for improving notation-conditioned pose generation, with KAN offering a compact alternative for efficient modeling. Our code will be publicly available.
Chinese Translation
从符号表示生成手语提供了一条可扩展的可访问手语动画路径。我们提出了KANMultiSign,一个多尺度序列生成器,将HamNoSys符号表示转换为二维人体姿态序列。我们的框架做出了两个互补的贡献。首先,我们引入了一种粗到细的生成策略,采用多尺度监督:模型首先通过一个中间的身体-手-面框架进行引导,以鼓励全局结构的一致性,然后细化手部的精细动作,以改善手指级别的细节。其次,我们研究将Kolmogorov-Arnold网络模块集成到Transformer骨干网络中,使用可学习的单变量函数原语来建模从离散音位符号到连续身体运动学的高度非线性映射,并实现紧凑的参数化。在多个公共语料库上的实验,涵盖波兰语、德语、希腊语和法语手语,与强大的符号到姿态基线相比,动态时间扭曲的联合误差显著减少,同时使用的参数显著更少。控制性消融实验进一步表明,基于KAN的变体在与多尺度监督结合时显著减少了参数数量,同时保持了竞争力的性能,而不是作为准确性提升的主要驱动因素。这些发现将多尺度监督定位为改善符号条件姿态生成的关键机制,而KAN则提供了一种高效建模的紧凑替代方案。我们的代码将公开可用。
cs.CV / 165 / 2605.09581
FPGA-Based Hardware Architecture for Contrast Maximization in Event-Based Vision
基于FPGA的事件驱动视觉对比度最大化硬件架构
Abstract
This paper presents a hardware architecture that implements the Contrast Maximization (CM) algorithm in Field-Programmable Gate Array (FPGA) resources for event-based vision systems. CM estimates motion parameters by maximizing the contrast of an Image of Warped Events (IWE) reconstructed from asynchronous event streams. Event-based vision sensors generate sparse data with high temporal resolution and low spatial redundancy, which makes them well suited for hardware processing. The deterministic, massively parallel structure of the FPGA is leveraged to design a deeply pipelined architecture capable of high-throughput, energy-efficient processing suitable for real-time embedded applications. This paper details the hardware modules responsible for event warping, contrast computation, and iterative optimization, discusses key implementation decisions, and presents the hardware-aware optimization method used in the design. Experimental results demonstrate a substantial speed and efficiency improvement over CPU- and GPU-based implementations, with motion parameter estimation executing over 200 times faster. To the best of our knowledge, this is the first hardware architecture enabling acceleration of CM algorithm computations. Its performance is evaluated in terms of processing speed, energy efficiency, and hardware resource utilization. The proposed design is validated using an event-based object tracking application. The results confirm that the architecture provides a solid foundation for real-time motion estimation in high-speed, low-power embedded systems.
Chinese Translation
本文提出了一种硬件架构,该架构在现场可编程门阵列(FPGA)资源中实现了对比度最大化(CM)算法,用于事件驱动视觉系统。CM通过最大化从异步事件流重建的扭曲事件图像(IWE)的对比度来估计运动参数。事件驱动视觉传感器生成稀疏数据,具有高时间分辨率和低空间冗余,这使得它们非常适合硬件处理。利用FPGA的确定性、大规模并行结构,设计了一种深度流水线架构,能够实现高吞吐量和能效的处理,适合实时嵌入式应用。本文详细介绍了负责事件扭曲、对比度计算和迭代优化的硬件模块,讨论了关键的实现决策,并呈现了设计中使用的硬件感知优化方法。实验结果表明,与基于CPU和GPU的实现相比,速度和效率有显著提升,运动参数估计的执行速度超过200倍。根据我们的最佳知识,这是首个能够加速CM算法计算的硬件架构。其性能通过处理速度、能效和硬件资源利用率进行评估。所提出的设计通过事件驱动的目标跟踪应用进行了验证。结果确认该架构为高速、低功耗嵌入式系统中的实时运动估计提供了坚实的基础。
cs.CV / 166 / 2605.09586
DeformMaster: An Interactive Physics-Neural World Model for Deformable Objects from Videos
DeformMaster:一种基于视频的可变形物体交互物理-神经世界模型
Abstract
World models for deformable objects should recover not only geometry and appearance, but also underlying physical dynamics, interaction grounding, and material behavior. Learning such a model from real videos is challenging because deformable linear, planar, and volumetric objects evolve under high-dimensional deformation, noisy interactions, and complex material response. The model must therefore infer a physical state from visual observations, roll it forward under new interactions, and render the resulting dynamics with high visual fidelity. We present DeformMaster, a video-derived interactive physics--neural world model that turns real interaction videos into an online interactive model of deformable objects within a unified dynamics-and-appearance framework. DeformMaster preserves structured physical rollout while using a neural residual to compensate for unmodeled effects, grounds sparse hand motion as distributed compliant actuator for hand--continuum interaction, represents material response with spatially varying constitutive experts, and drives high-fidelity 4D appearance from the predicted physical evolution. Experiments on real-world deformable-object sequences demonstrate DeformMaster's ability to roll out future dynamics and render dynamic appearance, outperforming state-of-the-art baselines while supporting novel action rollout, material-parameter variation, and dynamic novel-view synthesis.
Chinese Translation
可变形物体的世界模型不仅应恢复几何形状和外观,还应重建潜在的物理动态、交互基础和材料行为。从真实视频中学习这样的模型具有挑战性,因为可变形的线性、平面和体积物体在高维变形、噪声交互和复杂材料响应下演变。因此,模型必须从视觉观察中推断出物理状态,在新的交互下进行前推,并以高视觉保真度渲染出结果动态。我们提出了DeformMaster,这是一种基于视频的交互物理-神经世界模型,将真实的交互视频转化为一个在线交互的可变形物体模型,采用统一的动态与外观框架。DeformMaster在保持结构化物理前推的同时,使用神经残差来补偿未建模的效应,将稀疏的手部运动作为分布式顺应执行器用于手部与连续体的交互,利用空间变化的本构专家来表示材料响应,并根据预测的物理演变驱动高保真度的4D外观。对真实世界可变形物体序列的实验表明,DeformMaster能够前推未来动态并渲染动态外观,超越了最先进的基线,同时支持新颖的动作前推、材料参数变化和动态新视角合成。
cs.CV / 167 / 2605.09591
From Pixels to Concepts: Do Segmentation Models Understand What They Segment?
从像素到概念:分割模型是否理解它们所分割的内容?
Abstract
Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: \textbf{C}ounterfactual \textbf{A}ttribute \textbf{F}actuality \textbf{E}valuation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our \textbf{CAFE} is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (\textbf{SM}), Context Conflict (\textbf{CC}), and Ontological Conflict (\textbf{OC}). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.
Chinese Translation
分割是基础视觉任务,支撑着众多下游应用。近期的可提示分割模型,如Segment Anything Model 3 (SAM3),将分割从类别无关的掩膜预测扩展到基于高层文本提示的概念引导定位。然而,现有基准主要评估掩膜准确性或物体存在性,尚不清楚这些模型是否忠实于所查询的概念,还是依赖于视觉上显著但语义上误导的线索。我们引入了CAFE: extbf{C}ounterfactual extbf{A}ttribute extbf{F}actuality extbf{E}valuation,这是一个用于评估可提示分割模型中概念忠实分割的新基准。我们的 extbf{CAFE}基于属性级的反事实操控:目标区域和真实掩膜得以保留,而表面外观、上下文或材料组成等属性则被修改,以引入误导性的语义线索。该基准包含2146对测试样本,每个样本由一个目标图像、一个真实掩膜、一个正向提示和一个误导性的负向提示组成。这些样本涵盖三个反事实类别:表面模仿( extbf{SM})、上下文冲突( extbf{CC})和本体冲突( extbf{OC})。我们在CAFE上评估了各种模型类型和规模。实验揭示了定位质量与概念区分之间的系统性差距:模型即使在误导性提示下也常常生成准确的掩膜,这表明强大的掩膜预测并不一定意味着忠实的语义基础。我们的CAFE提供了一个受控基准,用于诊断可提示分割模型是否执行概念忠实的基础,而不是依赖捷径驱动的掩膜检索。
cs.CV / 168 / 2605.09598
SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy
SoccerLens:超越准确性的足球视频理解
Abstract
Vision-language models (VLMs) have recently shown strong potential in soccer video understanding. However, given the high complexity of soccer videos due to large viewpoint variations, rapid shot transitions, and cluttered scenes, it remains unclear on whether VLMs rely on meaningful visual evidence or exploit spurious correlations and shortcut learning. Existing evaluation protocols focus primarily on classification accuracy and do not assess visual grounding. To address this limitation, we introduce SoccerLens, a benchmark for grounded soccer video understanding. The benchmark contains annotated video segments spanning $13$ common soccer events, with structured visual cues organized into three levels of semantic relevance. We further extend the attribution method of Chefer [arXiv:2103.15679] to jointly model spatial and temporal attention, and introduce evaluation metrics that measure whether model attention aligns with annotated cues or drifts toward spurious regions. Our evaluation of state-of-the-art soccer VLMs shows that, despite strong classification accuracy, current models fail to exceed $50\%$ grounding performance even under the loosest cue definitions and consistently underutilize temporal information. These results reveal a substantial gap between predictive performance and true visual grounding, highlighting the need for grounded evaluation in complex spatio-temporal domains such as soccer.
Chinese Translation
视觉-语言模型(VLMs)最近在足球视频理解方面展现出强大的潜力。然而,由于足球视频的高复杂性,包括视角变化大、快速镜头切换和场景杂乱,VLMs是否依赖于有意义的视觉证据,或是利用虚假相关性和捷径学习仍然不清楚。现有的评估协议主要集中在分类准确性上,而未评估视觉基础。为了解决这一局限性,我们引入了SoccerLens,一个用于基础足球视频理解的基准。该基准包含注释的视频片段,涵盖$13$种常见的足球事件,并将结构化视觉线索组织为三个层次的语义相关性。我们进一步扩展了Chefer [arXiv:2103.15679] 的归因方法,以联合建模空间和时间注意力,并引入评估指标,以衡量模型注意力是否与注释线索对齐或偏向虚假区域。我们对最先进的足球VLMs的评估显示,尽管分类准确性强,但当前模型在最宽松的线索定义下仍未能超过$50\%$的基础表现,并且始终未能充分利用时间信息。这些结果揭示了预测性能与真实视觉基础之间的巨大差距,突显了在复杂时空领域(如足球)中进行基础评估的必要性。
cs.CV / 169 / 2605.09604
DAP: Doppler-aware Point Network for Heterogeneous mmWave Action Recognition
DAP:用于异构毫米波动作识别的多普勒感知点网络
Abstract
Millimeter-wave (mmWave) radar provides privacy-preserving sensing and is valuable for human action recognition (HAR). Existing mmWave point cloud datasets are limited in scale and mostly collected under homogeneous single-source settings, preventing current methods from handling real-world distribution shifts caused by heterogeneous radar sources, such as different devices and frequency bands. To address this, we introduce UniMM-HAR, the largest and first mmWave point cloud HAR dataset for heterogeneous multi-source scenarios, standardizing three distinct radar configurations to realistically evaluate cross-source generalization. We further propose the Doppler-aware Point Cloud Network (DAP-Net) to tackle heterogeneity challenges. DAP-Net enhances intra-modal representations and performs cross-modal alignment to learn source-invariant action semantics. Leveraging action-consistent spatio-temporal Doppler patterns as anchors, the Dual-space Doppler Reparameterization (D2R) module performs sample-adaptive geometric densification and Doppler-guided feature recalibration, while the Text Alignment Module (TAM) provides stable semantic anchors via a pretrained textual space. Experiments show that DAP-Net significantly outperforms existing methods under heterogeneous radar settings, achieving state-of-the-art accuracy and strong cross-source robustness.
Chinese Translation
毫米波(mmWave)雷达提供隐私保护的感知,并在人体动作识别(HAR)中具有重要价值。现有的毫米波点云数据集规模有限,且大多是在同质单源环境下收集的,这使得当前的方法无法处理由异构雷达源(如不同设备和频段)引起的真实世界分布变化。为了解决这一问题,我们引入了UniMM-HAR,这是首个针对异构多源场景的最大毫米波点云HAR数据集,标准化了三种不同的雷达配置,以真实评估跨源泛化能力。我们进一步提出了多普勒感知点云网络(DAP-Net)以应对异构性挑战。DAP-Net增强了模态内表示,并执行跨模态对齐,以学习源不变的动作语义。利用一致的时空多普勒模式作为锚点,双空间多普勒重参数化(D2R)模块执行样本自适应几何密集化和多普勒引导的特征重校准,而文本对齐模块(TAM)通过预训练的文本空间提供稳定的语义锚点。实验表明,DAP-Net在异构雷达设置下显著优于现有方法,实现了最先进的准确性和强大的跨源鲁棒性。
cs.CV / 170 / 2605.09614
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
用于传播感知视觉保留的反射锚点在长链多模态推理中的应用
Abstract
Long chain-of-thought (CoT) reasoning improves large vision--language models, but visual information often fades during generation, limiting long-horizon multimodal reasoning. Existing methods either re-inject vision at inference or train policies for stronger grounding, but where to intervene relies on perception heuristics rather than principled gain analysis, and how local visual influence propagates remains implicit. We study this problem from an information-theoretic standpoint and derive a lower bound on the downstream visual gain of a one-step intervention, which suggests two factors: local branching room (token entropy) and downstream visual propagation potential (suffix divergence from a vision-marginalized reference). Guided by this analysis, we propose reflection-anchor policy optimization (RAPO), a GRPO-based policy optimization method that selects high-entropy reflection anchors and optimizes a chain-masked finite-window KL surrogate for downstream visual dependence. Experiments on reasoning-intensive and general-domain benchmarks show that RAPO delivers substantial gains over strong baselines across multiple LVLM backbones. Mechanism analyses further indicate that reflection anchors are enriched for visually sensitive decision points and that RAPO increases contrastive visual-dependence signals along generated trajectories.
Chinese Translation
长链思维(CoT)推理提升了大型视觉-语言模型的性能,但在生成过程中视觉信息常常会消退,从而限制了长时间跨度的多模态推理。现有方法要么在推理时重新注入视觉信息,要么训练策略以实现更强的基础支持,但干预的时机依赖于感知启发式而非原则性的增益分析,并且局部视觉影响的传播仍然是隐含的。我们从信息论的角度研究这个问题,并推导出一步干预的下游视觉增益的下界,这表明两个因素:局部分支空间(标记熵)和下游视觉传播潜力(与视觉边际参考的后缀散度)。在这一分析的指导下,我们提出了反射锚点策略优化(RAPO),这是一种基于GRPO的策略优化方法,选择高熵的反射锚点,并优化链掩蔽的有限窗口KL替代品以增强下游视觉依赖性。在推理密集型和通用领域基准测试中的实验表明,RAPO在多个大型视觉语言模型(LVLM)基础上相较于强基线提供了显著的增益。机制分析进一步表明,反射锚点在视觉敏感的决策点上得到了增强,并且RAPO增加了沿生成轨迹的对比视觉依赖信号。
cs.CV / 171 / 2605.09619
GSMap: 2D Gaussians for Online HD Mapping
GSMap:用于在线高清地图的二维高斯模型
Abstract
Accurate High-Definition (HD) map construction is critical for autonomous driving, yet existing methods face a fundamental trade-off: vectorization-based approaches preserve topology but struggle with geometric fidelity, while rasterization-based approaches enable precise geometric supervision but produce unstructured outputs. To bridge this gap, we propose GSMap, a novel framework that unifies both paradigms via a learnable 2D Gaussian representation. Each map element is modeled as an ordered sequence of 2D Gaussians, whose centers correspond to the vertices of the vectorized polyline/polygon. This formulation enables simultaneous optimization through: (1) Differentiable rasterization that enforces pixel-level geometric constraints, and (2) Topology-aware vectorization that maintains structural regularity. Experiments on both nuScenes and Argoverse2 demonstrate that our Gaussian-based representation effectively unifies geometric and topological learning, achieving significant performance improvements and demonstrating strong compatibility with existing HD mapping architectures. Code will be available at https://github.com/peakpang/GSMap
Chinese Translation
准确的高清(HD)地图构建对于自动驾驶至关重要,但现有方法面临一个基本的权衡:基于矢量化的方法保留了拓扑结构,但在几何保真度上存在困难,而基于光栅化的方法则能够实现精确的几何监督,但产生无结构的输出。为了解决这一问题,我们提出了GSMap,一个通过可学习的二维高斯表示统一这两种范式的新框架。每个地图元素被建模为一系列有序的二维高斯,其中心对应于矢量化折线/多边形的顶点。这种表述使得通过以下方式实现了同时优化:(1) 可微光栅化,强制执行像素级几何约束,以及(2) 关注拓扑的矢量化,保持结构的规律性。在nuScenes和Argoverse2上的实验表明,我们基于高斯的表示有效地统一了几何和拓扑学习,取得了显著的性能提升,并展示了与现有高清地图架构的强兼容性。代码将发布在 https://github.com/peakpang/GSMap
cs.CV / 172 / 2605.09622
Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
基于知识转移的任意对任意3D扩散模型:放射治疗计划研究
Abstract
Voxel-wise dose prediction is a critical yet challenging task in practical radiotherapy (RT) planning, as bespoke models trained from scratch often struggle to generalize across diverse clinical settings. Meanwhile, generative models trained on billion-scale datasets from vision domains have achieved impressive performance. Herein, we propose DiffKT3D, a unified Any2Any 3D diffusion framework that leverages prior knowledge from pretrained video diffusion models for efficient and clinically meaningful dose prediction. To enable flexible conditioning across multiple clinical modalities (CT, anatomical structures, body, beam settings, etc.), we introduce an Any2Any conditional paradigm utilizing modality-specific embeddings without cross-attention overhead. Further, we design a novel reinforcement learning (RL) post-training mechanism guided by a clinically-informed Scorecard explicitly tailored to institutional treatment preferences. Compared with winner of GDP-HMM challenge, DiffKT3D sets a new state-of-the-art in dose prediction by reducing voxel-level MAE from 2.07 to 1.93. In addition, DiffKT3D achieves superior image quality and preference match. These results demonstrate that transferring diffusion priors via modality-aware conditioning and clinically aligned RL post-training can provide a robust and generalizable solution for RT planning across various clinical scenarios.
Chinese Translation
体素级剂量预测是实际放射治疗(RT)计划中一项关键但具有挑战性的任务,因为从头开始训练的定制模型往往难以在多样的临床环境中进行泛化。同时,基于视觉领域的亿级数据集训练的生成模型已取得了令人印象深刻的性能。在此,我们提出了DiffKT3D,一个统一的任意对任意3D扩散框架,利用预训练视频扩散模型的先验知识,实现高效且具有临床意义的剂量预测。为了在多种临床模式(CT、解剖结构、身体、束设置等)之间实现灵活的条件设置,我们引入了一种任意对任意条件范式,采用特定模式的嵌入而无需交叉注意力开销。此外,我们设计了一种新颖的强化学习(RL)后训练机制,该机制由明确针对机构治疗偏好的临床信息评分卡引导。与GDP-HMM挑战赛的获胜者相比,DiffKT3D在剂量预测方面设定了新的最先进水平,将体素级平均绝对误差(MAE)从2.07降低至1.93。此外,DiffKT3D在图像质量和偏好匹配方面也表现出色。这些结果表明,通过模式感知条件设置和与临床对齐的RL后训练转移扩散先验,可以为各种临床场景提供稳健且可泛化的放射治疗计划解决方案。
cs.CV / 173 / 2605.09628
DegBins: Degradation-Driven Binning for Depth Super-Resolution
DegBins:基于退化驱动的深度超分辨率分箱方法
Abstract
Depth super-resolution (DSR) aims to recover a high-resolution (HR) depth map from its low-resolution (LR) counterpart. With color image guidance, this task is typically formulated as learning the residual between HR and LR in a low-dimensional feature space. However, this additive formulation is insufficient to accurately capture the complex relationship between HR and LR, especially under spatially varying degradations. In this paper, we introduce DegBins, a novel DSR framework that leverages degradation-driven binning to adaptively enhance residual modeling. Specifically, DegBins reformulates the regression-based DSR as a hybrid classification-regression problem, where the residual depth is represented as a linear combination of discrete depth bins weighted by their learned probability distribution, yielding more flexible and expressive representations. Furthermore, DegBins models the degradation relationship between HR and LR in a high-dimensional feature space, enabling adaptive bin range adjustment and probability optimization conditioned on local degradation characteristics. To progressively improve reconstruction quality, DegBins adopts a multi-stage refinement scheme, where each stage performs finer-grained bin partitioning and probability updating based on the former estimation. This coarse-to-fine design facilitates more accurate depth recovery, particularly in regions with severe degradations or complex structural variations. Extensive experiments across five benchmarks demonstrate that DegBins consistently outperforms existing state-of-the-art methods in terms of accuracy, robustness, and generalization.
Chinese Translation
深度超分辨率(DSR)旨在从低分辨率(LR)深度图恢复高分辨率(HR)深度图。在彩色图像的指导下,这一任务通常被表述为在低维特征空间中学习HR与LR之间的残差。然而,这种加法形式不足以准确捕捉HR与LR之间复杂的关系,尤其是在空间变化的退化情况下。本文提出了DegBins,一种新颖的DSR框架,利用退化驱动的分箱方法自适应增强残差建模。具体而言,DegBins将基于回归的DSR重新表述为混合分类-回归问题,其中残差深度被表示为离散深度分箱的线性组合,权重由其学习的概率分布决定,从而产生更灵活和更具表现力的表示。此外,DegBins在高维特征空间中建模HR与LR之间的退化关系,使得能够根据局部退化特征自适应调整分箱范围和优化概率。为了逐步提高重建质量,DegBins采用多阶段细化方案,每个阶段基于前一估计执行更细粒度的分箱划分和概率更新。这种粗到细的设计有助于在退化严重或结构变化复杂的区域实现更准确的深度恢复。在五个基准测试上的大量实验表明,DegBins在准确性、鲁棒性和泛化能力方面始终优于现有的最先进方法。
cs.CV / 174 / 2605.09640
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
通过强化微调克服视觉持续学习中的灾难性遗忘
Abstract
Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.
Chinese Translation
最近的研究表明,强化微调(Reinforcement Fine-Tuning, RFT)在本质上比监督微调(Supervised Fine-Tuning, SFT)更能抵御灾难性遗忘。然而,RFT(例如,GRPO)是否能够有效克服在具有挑战性的视觉持续学习环境中(如类增量学习(Class-Incremental Learning, CIL)和领域增量学习(Domain-Incremental Learning, DIL))的遗忘问题,仍然是一个未解的问题。通过一项初步研究,我们确认虽然RFT始终优于SFT,但仍然遭遇不可忽视的遗忘。我们实证追踪这一瓶颈至轨迹级漂移不可知性:在实现相同任务奖励的候选轨迹中,前一任务策略的KL散度变化显著,这与跨序列任务的灾难性遗忘强相关。基于这一见解,我们提出了保留意识策略优化(Retention-aware Policy Optimization, RaPO),这是一种简单而有效的RFT方法,通过轨迹级奖励塑造明确减轻遗忘。具体而言,RaPO包含两个核心组件:(1)保留奖励,将轨迹级分布漂移转化为连续的奖励信号,优先强化每组内知识保留的轨迹;(2)跨任务优势归一化(Cross-Task Advantage Normalization, CTAN),在任务边界之间保持奖励统计的持续指数移动平均,以稳定持续学习过程中的优化进展。利用大规模语言模型(MLLMs)的自由形式文本泛化,我们在五个视觉持续学习环境中全面评估了RaPO。大量实验表明,RaPO实现了领先的性能,显著减少了灾难性遗忘,同时保持了强大的可塑性。根据我们所知,这项工作代表了在视觉持续学习中对RFT的首次系统探索,提供了我们希望能激励未来研究的见解。
cs.CV / 175 / 2605.09644
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval
注意力本身可以检索:基于查询-键相似性检索的无训练长上下文流式3D重建RetrieveVGGT
Abstract
Visual Geometry Grounded Transformer (VGGT) advances 3D reconstruction via scalable Transformer architecture, but the quadratic complexity of global attention prevents long context application. StreamVGGT enables streaming with causal attention, yet its KV cache grows linearly with frames, causing memory overflow and quality degradation. We present RetrieveVGGT, a training-free framework, which formulates context construction for VGGT as a retrieval problem. By retrieving a fixed number of relevant frames at each step, VGGT maintains a controllable memory budget, which is close to its training context length. Interestingly, we find that the similarity between current frame queries and cached history frame keys at the first global attention layer of VGGT is already a strong indicator of relevance, eliminating the need for additional learned scoring. To enhance information diversity similar to a recommender system, we propose Segment Sampling so that the retrieval spans distinct relevant segments rather than a single high-similarity region. We design a pose-aware spatial memory mechanism that organizes history frames according to their already estimated camera poses, enabling location-aware retrieval. Extensive experiments demonstrate that RetrieveVGGT achieves state-of-the-art performance, outperforming StreamVGGT, TTT3R, and InfiniteVGGT while maintaining constant memory usage regardless of sequence length. Code is available at https://github.com/zzctmd/RetrieveVGGT.
Chinese Translation
视觉几何基础变换器(Visual Geometry Grounded Transformer,VGGT)通过可扩展的变换器架构推进了3D重建,但全局注意力的二次复杂度阻碍了长上下文的应用。StreamVGGT实现了因果注意力的流式处理,但其KV缓存随着帧数线性增长,导致内存溢出和质量下降。我们提出了RetrieveVGGT,这是一种无训练框架,将VGGT的上下文构建形式化为检索问题。通过在每一步检索固定数量的相关帧,VGGT保持可控的内存预算,接近其训练上下文长度。有趣的是,我们发现当前帧查询与VGGT第一层全局注意力的缓存历史帧键之间的相似性已经是相关性的强指示,消除了额外学习评分的需求。为了增强信息多样性,类似于推荐系统,我们提出了段采样(Segment Sampling),使得检索跨越不同的相关段而不是单一的高相似性区域。我们设计了一种姿态感知空间记忆机制,根据历史帧的已估计相机姿态组织历史帧,从而实现位置感知的检索。大量实验表明,RetrieveVGGT实现了最先进的性能,超越了StreamVGGT、TTT3R和InfiniteVGGT,同时在序列长度不变的情况下保持恒定的内存使用。代码可在https://github.com/zzctmd/RetrieveVGGT获取。
cs.CV / 176 / 2605.09662
BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction
BEA-GS:在3DGS中超越辐射监督以实现精确对象提取
Abstract
Most Gaussian Splatting techniques that provide a 3D semantic representation of the scene do not optimize the underlying 3D geometry, making object-level editing or asset extraction challenging. Recent methods, such as COBGS, Trace3D, ObjectGS, acknowledge this limitation and propose approaches that modify the scene's geometry to represent the underlying semantics. We advance this concept further by proposing a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1) a loss that modifies the geometry of visible Gaussians to respect semantic boundaries, and 2) a loss that adjusts the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization, allowing for seamless integration within the optimization of the Gaussian parameters. The second loss also propagates gradients to Gaussian parameters but does so without passing through the rasterization, enabling modification of the scene's geometry even when little transmittance reaches a Gaussian (partial or non-visible). Exhaustive comparisons with 12 state of the art methods across 4 datasets, using six metrics, demonstrate that our approach produces overall the best boundary segmentation to date.
Chinese Translation
大多数提供场景3D语义表示的高斯喷溅技术并未优化其底层3D几何结构,这使得对象级编辑或资产提取变得具有挑战性。近期的方法,如COBGS、Trace3D和ObjectGS,认识到这一局限性,并提出了修改场景几何以表示底层语义的方法。我们进一步推进了这一概念,提出了一种新颖的解决方案,能够在对象提取中提供近乎完美的边界。我们通过在优化中引入两个新的损失函数来实现这一目标:1)一个损失函数修改可见高斯的几何形状,以遵循语义边界;2)一个损失函数调整在对象提取后出现的不可见高斯的几何形状。我们的第一个损失函数通过光栅化直接传播梯度,从而实现与高斯参数优化的无缝集成。第二个损失函数也向高斯参数传播梯度,但不经过光栅化,使得即使在很少的透射光到达高斯(部分或不可见)时,也能修改场景几何。与12种最先进的方法在4个数据集上进行的全面比较,使用六种指标,证明我们的方法在边界分割方面整体上是迄今为止最优的。
cs.CV / 177 / 2605.09666
Rethinking Evaluation of Multiple Sclerosis (MS) Lesion Segmentation Models
重新思考多发性硬化症(MS)病灶分割模型的评估
Abstract
Multiple Sclerosis (MS) is a chronic autoimmune disease that can significantly reduce the quality of life of a patient. Existing treatment options can only help slow down the progression of the disease. Therefore, early detection and precise monitoring of disease progression are important. Deep learning offers state-of-the-art models for detecting and segmenting MS lesions in brain MRI scans. However, most of these models are evaluated using the Dice score, without accounting for lesion-wise detection and segmentation performance or other metrics that quantify model performance in cases that are complex or confusing for human annotators, or in cases that are essential for disease detection and progression monitoring. In this paper, we highlight the need to rethink the evaluation of MS lesion segmentation models. In this context, we first present problem fingerprinting in detail to highlight what neurologists look for in brain MRI scans for MS detection and progression monitoring, and which metrics are required to properly quantify model performance in these contexts. Additionally, we present an analysis of state-of-the-art models on two open-source datasets using these metrics to highlight their usability for real-world deployment in hospitals.
Chinese Translation
多发性硬化症(MS)是一种慢性自身免疫疾病,可能显著降低患者的生活质量。现有的治疗方案只能帮助减缓疾病的进展。因此,早期检测和精确监测疾病进展显得尤为重要。深度学习为检测和分割脑部MRI扫描中的MS病灶提供了最先进的模型。然而,这些模型大多数是通过Dice分数进行评估的,并未考虑病灶级别的检测和分割性能,或其他量化模型在复杂或混淆情况下的表现的指标,这些情况对于人类标注者来说可能具有挑战性,或在疾病检测和进展监测中至关重要。本文强调了重新思考MS病灶分割模型评估的必要性。在此背景下,我们首先详细介绍了问题指纹识别,以突出神经科医生在脑部MRI扫描中寻找MS检测和进展监测所需关注的内容,以及在这些背景下正确量化模型性能所需的指标。此外,我们还对使用这些指标的两个开源数据集上的最先进模型进行了分析,以突出它们在医院实际应用中的可用性。
cs.CV / 178 / 2605.09667
S2P-Net: A Spectral-Spatial Polar Network for Rotation-Invariant Object Recognition in Low-Data Regimes
S2P-Net:一种用于低数据环境下旋转不变目标识别的光谱-空间极坐标网络
Abstract
We present S2P-Net (Spectral-Spatial Polar Network), a compact deep learning architecture that achieves mathematically guaranteed rotation invariance without data augmentation. In this Paper, we also made a comparison to other neural network architectures (CNN`s). Have a look at the results and feel free to contact me for any questions. This is my first paper:) Made by Hackbert
Chinese Translation
我们提出了S2P-Net(光谱-空间极坐标网络),这是一种紧凑的深度学习架构,能够在没有数据增强的情况下实现数学上保证的旋转不变性。在本文中,我们还与其他神经网络架构(卷积神经网络,CNN)进行了比较。请查看结果,如有任何问题,请随时与我联系。这是我的第一篇论文 :) 由Hackbert制作
cs.CV / 179 / 2605.09677
VFM-SDM: A vision foundation model-based framework for training-free, marker-free, and calibration-free structural displacement measurement
VFM-SDM:基于视觉基础模型的无训练、无标记和无校准的结构位移测量框架
Abstract
Reliable displacement measurement is fundamental for structural health monitoring and digital engineering workflows, as it provides direct structural response information. Vision-based measurement has emerged as a promising approach for low-cost, non-contact displacement monitoring. However, its deployment often remains constrained by task-specific model training or on-site preparation, such as marker installation or manual camera calibration. This study presents a Vision Foundation Model-based framework for Structural Displacement Measurement (VFM-SDM) that integrates VFM-inferred camera parameter estimation and point tracking to reconstruct multi-directional structural displacements via triangulation without task-specific training or on-site preparation, enabling efficient non-contact deployment in real-world applications. Structural geometry constraints are incorporated to suppress physically implausible deviations and improve estimation consistency. A multi-modal field dataset collected from an in-service pedestrian bridge is introduced alongside a unified benchmarking protocol to support reproducible evaluation. Representative results show low amplitude errors (NRMSE$_{\text{range}}$: 0.11/0.12), strong temporal agreement (correlation coefficient: 0.86/0.88), and small peak-to-peak amplitude errors (RPPAE: 0.01/0.02) for vertical and lateral displacements, indicating robust performance under real-world conditions. The proposed framework advances automated, scalable displacement monitoring and lays the groundwork for VFM-enabled structural response measurements in digital twin and data-centric construction workflows.
Chinese Translation
可靠的位移测量是结构健康监测和数字工程工作流程的基础,因为它提供了直接的结构响应信息。基于视觉的测量作为一种低成本、非接触式位移监测的有前景的方法逐渐受到关注。然而,其部署往往受到特定任务模型训练或现场准备(如标记安装或手动相机校准)的限制。本研究提出了一种基于视觉基础模型的结构位移测量框架(VFM-SDM),该框架集成了VFM推断的相机参数估计和点跟踪,通过三角测量重建多方向的结构位移,无需特定任务的训练或现场准备,从而实现了在实际应用中的高效非接触式部署。结构几何约束被纳入以抑制物理上不合理的偏差并提高估计的一致性。本文引入了一个从在役人行天桥收集的多模态现场数据集,并提供了统一的基准协议以支持可重复的评估。代表性结果显示,垂直和横向位移的低幅度误差(NRMSE$_{ ext{range}}$: 0.11/0.12)、强时间一致性(相关系数:0.86/0.88)以及小的峰对峰幅度误差(RPPAE: 0.01/0.02),表明在实际条件下具有稳健的性能。所提出的框架推动了自动化、可扩展的位移监测,并为基于VFM的结构响应测量在数字双胞胎和数据驱动的建筑工作流程中奠定了基础。
cs.CV / 180 / 2605.09679
DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents
DeepTumorVQA:用于医学视觉语言模型和工具增强代理逐阶段评估的分层3D CT基准
Abstract
Medical vision-language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later-stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground-truth step-by-step tool-use traces from DeepTumorVQA can supervise agents and reduce tool-use and reasoning failures. This stage-wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at https://github.com/Schuture/DeepTumorVQA.
Chinese Translation
医学视觉语言模型(VLMs)和人工智能代理在学习分析和推理临床图像方面取得了显著进展。然而,现有的医学视觉问答(VQA)基准将模型能力压缩为单一的准确率评分,模糊了模型失败的具体原因和位置。我们提出了DeepTumorVQA,这是一个分层基准,遵循肿瘤诊断中的多阶段证据链,并将3D CT推理分解为四个阶段:识别、测量、视觉推理和医学推理。更高层次的问题可以独立评分,而其真实证据链则在较低层次的原语上定义。该基准包含476K个问题,涵盖42种临床亚型,基于9262个3D CT体积。除了为VLMs提供直接推理模式外,DeepTumorVQA还为代理评估提供工具交互环境,在该环境中,模型可以调用外部工具,包括分割模型、测量程序和医学知识模块,然后再回答问题。通过对30多种模型配置的评估,我们发现可靠的定量测量是主要瓶颈,使得后期的视觉和医学推理对VLMs而言更加困难,而工具增强显著缓解了这一问题。当工具可用时,利用医学知识和工具对医学图像进行推理成为一项新挑战。我们进一步表明,DeepTumorVQA中的真实逐步工具使用轨迹可以监督代理并减少工具使用和推理失败。从识别到测量,再到视觉和医学推理的逐阶段进展为未来医学VLM和人工智能代理研究提供了具体的路线图。所有数据和代码已发布在 https://github.com/Schuture/DeepTumorVQA。
cs.CV / 181 / 2605.09681
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
Forcing-KV:用于高效自回归视频扩散模型的混合KV缓存压缩
Abstract
Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at https://zju-jiyicheng.github.io/Forcing-KV-Page.
Chinese Translation
自回归(AR)视频扩散模型采用流式生成框架,使得能够实现长时间跨度的视频生成并具备实时响应能力,正如自强训练范式所示。然而,现有的AR视频扩散模型仍然面临显著的注意力复杂性和由于历史帧之间冗余的键值(KV)缓存导致的严重内存开销,这限制了其可扩展性。本文通过将KV缓存压缩引入自回归视频扩散来应对这一挑战。我们观察到主流AR扩散模型中的注意力头展现出显著不同的注意力模式和功能角色,这些模式和角色在样本和去噪步骤中保持稳定。基于我们对头部功能专业化的实证研究,我们将注意力头分为两类:静态头,专注于自回归块之间的过渡和帧内保真度;动态头,负责帧间运动和一致性。然后,我们提出了Forcing-KV,一种混合KV缓存压缩策略,对静态头进行结构化静态剪枝,对动态头基于段间相似性进行动态剪枝。在保持输出质量的同时,我们的方法在单个NVIDIA H200 GPU上实现了每秒超过29帧的生成速度,并减少了30%的缓存内存,在480P分辨率下实现了LongLive和Self Forcing分别高达1.35倍和1.50倍的加速,并在1080P分辨率下进一步扩展至2.82倍的加速。代码和演示视频可在 https://zju-jiyicheng.github.io/Forcing-KV-Page 获取。
cs.CV / 182 / 2605.09687
Spatial-Frequency Gated Swin Transformer for Remote Sensing Single-Image Super-Resolution
用于遥感单幅图像超分辨率的空间频率门控Swin Transformer
Abstract
Remote Sensing (RS) single-image super-resolution aims to reconstruct high-resolution imagery from low-resolution observations while preserving fine spatial structures. Recent Swin Transformer-based models, including Swin2SR, provide strong spatial context modeling throughshifted-window self-attention, but their feed-forward networks remain generic channel-mixing modules and do not separate low-frequency structural content from high-frequency residual detail. To address this limitation, we propose SFG-SwinSR, a Spatial-Frequency Gated Swin Transformer for single-image super-resolution in remote sensing. SFG-SwinSR modifies the original Swin2SR attention block by replacing each transformer block's standard feed-forward network with a lightweight Spatial-Frequency Gated Feed-Forward Network (SFG-FFN). The module estimates low-frequency content via a depthwise-blur branch, extracts high-frequency residuals by subtraction, refines them with a lightweight spatial branch, and adaptively injects detail through a bottleneck gate. Experiments on SpaceNet and SEN2VEN{\mu}S show that SFG-SwinSR improves reconstruction quality under the evaluated settings. On SpaceNet, it achieves 45.19 dB PSNR and 0.9852 SSIM, indicating effective enhancement of high-frequency details. This demonstrates that spatial-frequency transformation within the transformer feed-forward network improves detail reconstruction in RS super-resolution.
Chinese Translation
遥感(RS)单幅图像超分辨率旨在从低分辨率观测中重建高分辨率图像,同时保持细致的空间结构。近期基于Swin Transformer的模型,包括Swin2SR,通过移动窗口自注意力提供了强大的空间上下文建模,但其前馈网络仍然是通用的通道混合模块,未能将低频结构内容与高频残差细节分离。为了解决这一限制,我们提出了SFG-SwinSR,一种用于遥感单幅图像超分辨率的空间频率门控Swin Transformer。SFG-SwinSR通过用轻量级的空间频率门控前馈网络(SFG-FFN)替换原始Swin2SR注意力块中每个变换器块的标准前馈网络来进行修改。该模块通过深度模糊分支估计低频内容,通过减法提取高频残差,利用轻量级空间分支进行细化,并通过瓶颈门控自适应地注入细节。在SpaceNet和SEN2VEN{BC}S上的实验表明,SFG-SwinSR在评估设置下提高了重建质量。在SpaceNet上,它达到了45.19 dB的PSNR和0.9852的SSIM,表明高频细节得到了有效增强。这表明变换器前馈网络中的空间频率变换改善了遥感超分辨率中的细节重建。
cs.CV / 183 / 2605.09688
ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes
ConFixGS:在驾驶场景中利用信心感知扩散先验学习修复前馈3D高斯点云
Abstract
Feedforward 3D Gaussian Splatting (3DGS) often struggles in trajectory-based sparse-view driving scenes. Existing Gaussian repair methods mainly target optimization-based 3DGS, while diffusion-based repair is typically restricted to iterative refinement near observed viewpoints, leaving feedforward 3DGS repair underexplored. We propose ConFixGS, a plug-and-play method that learns to fix feedforward 3DGS with confidence-aware diffusion priors. Starting from a pretrained feedforward model, ConFixGS generates diffusion-enhanced local pseudo-targets and validates them through reprojection-based cross-checking against support views. The resulting dense confidence maps guide refinement, enhancing reliable details while suppressing hallucinated or inconsistent evidence. On Waymo, nuScenes, and KITTI, ConFixGS improves challenging novel view synthesis, with PSNR gains of up to 3.68 dB and FID reduced by nearly half. Our results highlight confidence-aware fusion of generative priors and support-view consistency as a key principle for robust feedforward 3D driving scene reconstruction.
Chinese Translation
前馈3D高斯点云(3DGS)在基于轨迹的稀疏视图驾驶场景中常常面临挑战。现有的高斯修复方法主要针对基于优化的3DGS,而基于扩散的修复通常仅限于在观察视点附近进行迭代细化,导致前馈3DGS修复的研究相对不足。我们提出了ConFixGS,这是一种即插即用的方法,能够利用信心感知的扩散先验学习修复前馈3DGS。ConFixGS从一个预训练的前馈模型开始,生成扩散增强的局部伪目标,并通过基于重投影的交叉检查对其进行验证,确保与支持视图的一致性。生成的密集置信度图引导细化过程,增强可靠细节,同时抑制虚假或不一致的证据。在Waymo、nuScenes和KITTI数据集上,ConFixGS显著改善了具有挑战性的新的视图合成,PSNR提升高达3.68 dB,FID几乎减少了一半。我们的结果强调了生成先验的信心感知融合与支持视图一致性作为稳健前馈3D驾驶场景重建的关键原则。
cs.CV / 184 / 2605.09693
Do multimodal models imagine electric sheep?
多模态模型是否能够想象电羊?
Abstract
Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine-tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks -- including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour -- that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open-loop sequence of actions to solve a puzzle from an initial state, we show that the model's activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed by the model. We find that integrating as few as sixteen visual tokens per step into the chain of thought improves the average solve rate from 83% to 89%, with particularly strong gains on reasoning-heavy tasks such as jigsaw and 3D mental rotation.
Chinese Translation
是的。我们发现大型多模态模型在解决空间难题时会产生心理意象,并且在解决与羊相关的难题时确实会想象出羊。我们对 Qwen3.5 VLM 进行了微调,以解决十二种不同的视觉推理任务——包括拼图、拼图游戏、推箱子、三维心理旋转和高峰时段——这些任务需要理解几何、空间关系以及行动的后果。通过监督模型预测从初始状态解决难题的开放循环动作序列,我们展示了模型在每个动作后的激活编码了关于中间状态的有意义的视觉信息。这一发现表明,在没有任何明确视觉监督的情况下,学习选择正确动作的副产品开始形成一个不完美的视觉世界模型。在此基础上,我们提出了两种方法来锐化和利用模型形成的心理图像。我们发现,在思维链中每一步整合少至十六个视觉标记,可以将平均解决率从 83% 提高到 89%,在拼图和三维心理旋转等推理密集型任务上尤其表现出显著提升。
cs.CV / 185 / 2605.09697
Discriminative Span as a Predictor of Synthetic Data Utility via Classifier Reconstruction
作为分类器重建的合成数据效用预测的区分性跨度
Abstract
In many real-world computer vision applications, including medical imaging and industrial inspection, binary classification tasks are characterized by a severe scarcity of positive samples. A widely adopted solution is to generate synthetic positive data using image-to-image transformations applied to negative samples. However, a fundamental challenge remains: how can we reliably assess whether such synthetic data will improve downstream model performance? In this work, we propose a geometry-driven metric that predicts the utility of synthetic data without requiring model training. Our approach operates in the embedding space of a pre-trained foundation model and represents the dataset through difference vectors between samples. We evaluate whether the weight vector of a linear classifier can be expressed within the subspace spanned by these variations by measuring the relative projection error. Intuitively, if the variations induced by synthetic data capture task-relevant directions, their span can approximate the classifier, resulting in low projection error. Conversely, poor synthetic data fails to span these directions, leading to higher error. Across multiple datasets and architectures, we show that this metric exhibits strong correlation with downstream classification performance of CNNs trained on mixtures of real negative and synthetic positive data. These findings suggest that the proposed metric serves as a practical and informative tool for evaluating synthetic data quality in data-scarce settings.
Chinese Translation
在许多现实世界的计算机视觉应用中,包括医学成像和工业检测,二元分类任务的正样本严重匮乏。一个广泛采用的解决方案是通过对负样本应用图像到图像的变换来生成合成正数据。然而,仍然存在一个基本挑战:我们如何可靠地评估这些合成数据是否会提高下游模型的性能?在本研究中,我们提出了一种基于几何的度量,能够在不需要模型训练的情况下预测合成数据的效用。我们的方法在预训练基础模型的嵌入空间中运行,通过样本之间的差异向量来表示数据集。我们通过测量相对投影误差来评估线性分类器的权重向量是否可以在这些变异所张成的子空间内表示。直观地说,如果合成数据引起的变异捕捉到与任务相关的方向,则它们的跨度可以近似分类器,从而导致低投影误差。相反,劣质合成数据无法涵盖这些方向,导致更高的误差。在多个数据集和架构中,我们展示了该度量与在真实负样本和合成正样本混合数据上训练的卷积神经网络(CNN)的下游分类性能之间存在强相关性。这些发现表明,所提出的度量在数据稀缺的环境中作为评估合成数据质量的实用和信息丰富的工具。
cs.CV / 186 / 2605.09701
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
DriveFuture:面向未来的自主驾驶潜在世界模型
Abstract
Existing latent world models for autonomous driving have opened a promising path toward future-aware driving intelligence. However, they typically treat future latent states as prediction targets or auxiliary signals, rather than directly conditioning trajectory planning. This can entangle current and future features in latent space. In this work, we propose DriveFuture, a future-aware latent world modeling framework for autonomous driving that explicitly learns planning-oriented foresight by conditioning the current latent state modeling process on future world states. Specifically, during training, the model first predicts future latent world states from the current latent state and ego action, and then refines the prediction against the ground-truth future latent state via cross-attention. The resulting future-aware latent serves as an explicit condition for a diffusion-based trajectory planner. During inference, DriveFuture conditions on the predicted future latent state instead of the ground-truth future state. DriveFuture achieves SOTA performance on the public NAVSIM benchmarks, reaching \textbf{55.5} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}, \textbf{89.9} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navtest}}}, and \textbf{90.7} PDMS on NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}, respectively. These results suggest that the key to latent world modeling lies not merely in simulating future states, but more importantly in conditioning current decision-making on future states. Notably, as of April 2026, DriveFuture ranks \textbf{1st} on the \href{https://huggingface.co/spaces/AGC2025/e2e-driving-navhard}{NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}} leaderboard and achieves SOTA performance on \href{https://huggingface.co/spaces/AGC2024-P/e2e-driving-navtest}{NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}}.
Chinese Translation
现有的自主驾驶潜在世界模型为面向未来的驾驶智能开辟了一个有前景的路径。然而,它们通常将未来潜在状态视为预测目标或辅助信号,而不是直接用于轨迹规划。这可能导致当前和未来特征在潜在空间中的纠缠。在本研究中,我们提出了DriveFuture,一个面向未来的自主驾驶潜在世界建模框架,通过将当前潜在状态建模过程与未来世界状态相结合,明确学习面向规划的前瞻性。具体而言,在训练过程中,模型首先根据当前潜在状态和自我动作预测未来潜在世界状态,然后通过交叉注意机制对预测进行基于真实未来潜在状态的精细调整。最终得到的面向未来的潜在状态作为基于扩散的轨迹规划器的显式条件。在推理阶段,DriveFuture基于预测的未来潜在状态进行条件设置,而不是基于真实的未来状态。DriveFuture在公共NAVSIM基准测试中实现了SOTA性能,在NAVSIM-v2 { extcolor{blue}{ extit{navhard}}}上达到 extbf{55.5} EPDMS,在NAVSIM-v2 { extcolor{blue}{ extit{navtest}}}上达到 extbf{89.9} EPDMS,在NAVSIM-v1 { extcolor{blue}{ extit{navtest}}}上达到 extbf{90.7} PDMS。这些结果表明,潜在世界建模的关键不仅在于模拟未来状态,更重要的是在于将当前决策建立在未来状态的基础上。值得注意的是,截至2026年4月,DriveFuture在 extbf{1st}的 exthref{https://huggingface.co/spaces/AGC2025/e2e-driving-navhard}{NAVSIM-v2 { extcolor{blue}{ extit{navhard}}}}排行榜上名列第一,并在 exthref{https://huggingface.co/spaces/AGC2024-P/e2e-driving-navtest}{NAVSIM-v1 { extcolor{blue}{ extit{navtest}}}}上实现了SOTA性能。
cs.CV / 187 / 2605.09703
MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding
MOTOR-Bench:用于零-shot人类心理状态理解的真实世界数据集和多智能体框架
Abstract
Understanding human mental states from natural behavior is crucial for intelligent systems in the real world. However, most current research focuses on predicting isolated mental state labels, lacking structured annotations of complex interpersonal interactions. To support structured analysis, we introduce MOTOR-Bench, a carefully-designed benchmark with a real-world dataset MOTOR-dataset, containing 1,440 multimodal video clips in collaborative learning scenarios, reflecting key real-world data challenges including natural class imbalance, visual noise, and domain-specific language. Each sample is labeled by educational experts based on self-regulated learning theory. We further evaluate several state-of-the-art multimodal large language models and multi-agent systems in a zero-shot setting on our MOTOR-Bench. However, their performance on this task remains limited, suggesting that existing methods still struggle with structured reasoning from observable behavior to deeper mental states. To address this challenge, we propose a reasoning multi-agent framework, named MOTOR-MAS. It coordinates multiple agents through a structured agent coordination mechanism to infer explicit behaviors, internal cognitions, and psychological emotions. Experimental results show that our MOTOR-MAS outperforms the best single-model benchmark by 15.93 points in Macro-F1 scores for the three labels of behavior, cognition, and emotion, and outperforms the general multi-agent benchmark by 10.2 points in internal cognition prediction.
Chinese Translation
从自然行为中理解人类心理状态对于现实世界中的智能系统至关重要。然而,目前大多数研究集中于预测孤立的心理状态标签,缺乏对复杂人际互动的结构化注释。为了支持结构化分析,我们引入了MOTOR-Bench,这是一个经过精心设计的基准,包含真实世界数据集MOTOR-dataset,其中包含1440个多模态视频片段,反映了协作学习场景中的关键现实世界数据挑战,包括自然类别不平衡、视觉噪声和领域特定语言。每个样本由教育专家根据自我调节学习理论进行标注。我们进一步在我们的MOTOR-Bench上评估了几种最先进的多模态大型语言模型和多智能体系统在零-shot设置下的表现。然而,它们在这一任务上的表现仍然有限,表明现有方法在从可观察行为推理到更深层次心理状态的结构化推理方面仍然存在困难。为了解决这一挑战,我们提出了一种推理多智能体框架,称为MOTOR-MAS。它通过结构化的智能体协调机制协调多个智能体,以推断明确的行为、内部认知和心理情感。实验结果表明,我们的MOTOR-MAS在行为、认知和情感三个标签的Macro-F1分数上比最佳单模型基准高出15.93分,并在内部认知预测上比一般多智能体基准高出10.2分。
cs.CV / 188 / 2605.09719
Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT
将3D空间推理提炼为轻量级视觉-语言模型的知识蒸馏方法
Abstract
Large-scale 3D vision-language models (VLMs) like LLaVA-3D offer strong spatial reasoning but are difficult to deploy due to high computational costs. We propose a knowledge distillation framework that transfers spatial reasoning from a 7B teacher to a 2.29B student model. Our approach achieves 8.7x lower inference latency and a 3x reduction in model size while retaining 54-72% of the teacher's performance. The framework utilizes VGGT as the vision encoder and a multi-task distillation pipeline with uncertainty-aware loss weighting. To improve reasoning without chain-of-thought (CoT) data, we introduce "Hidden CoT": learnable latent tokens that serve as an internal scratchpad before answer generation. This is the first use of latent scratchpad reasoning in distilled 3D VLMs. The student model jointly performs spatial description, depth estimation, and object detection. Experiments on ScanNet and 3D-FRONT show strong spatial understanding, reaching 68-72% accuracy on proximity and contact tasks. Our framework enables efficient 3D scene QA on resource-constrained platforms.
Chinese Translation
大规模的3D视觉-语言模型(VLMs),如LLaVA-3D,提供了强大的空间推理能力,但由于高计算成本,难以部署。我们提出了一种知识蒸馏框架,将空间推理从一个7B参数的教师模型转移到一个2.29B参数的学生模型。我们的方法实现了8.7倍的推理延迟降低和模型大小的3倍减少,同时保留了54-72%的教师模型性能。该框架利用VGGT作为视觉编码器,并采用具有不确定性感知损失加权的多任务蒸馏管道。为了在没有链式思维(CoT)数据的情况下改善推理,我们引入了“隐式CoT”:可学习的潜在标记,作为答案生成前的内部草稿。这是首次在蒸馏的3D VLMs中使用潜在草稿推理。学生模型共同执行空间描述、深度估计和物体检测。在ScanNet和3D-FRONT上的实验显示出强大的空间理解能力,在接近性和接触任务上达到68-72%的准确率。我们的框架使得在资源受限的平台上高效进行3D场景问答成为可能。
cs.CV / 189 / 2605.09725
On-Policy Distillation with Best-of-N Teacher Rollout Selection
基于最佳教师回放选择的在线蒸馏
Abstract
On-policy distillation (OPD), which supervises a student on its own sampled trajectories, has emerged as a data-efficient post-training method for improving reasoning while avoiding the reward dependence of reinforcement learning and the catastrophic forgetting often observed in standard supervised fine-tuning. However, standard OPD typically computes teacher supervision under noisy student-generated contexts and often relies on a single stochastic teacher rollout per prompt. As a result, the supervision signal can be high-variance: the sampled teacher trajectory can be incorrect, uninformative, or poorly matched to the student's current reasoning behavior. To address this limitation, we propose BRTS, a Best-of-N Rollout Teacher Selection framework for on-policy distillation. BRTS augments standard student-context OPD with a teacher-context supervision branch constructed from the curated teacher trajectory. Rather than distilling from the first sampled teacher rollout, BRTS samples a small pool of teacher trajectories and selects the auxiliary trajectory using a simple priority rule: correctness first, student alignment second. When multiple correct teacher trajectories are available, BRTS chooses the one most aligned with the student's current behavior; when unconditioned teacher samples fail on harder prompts, it invokes a ground-truth-conditioned recovery step to elicit a natural derivation. The selected trajectory is then used to provide reliable teacher-context supervision inside the OPD loop, augmented with an auxiliary loss on the teacher trajectory. Experiments on AIME 2024, AIME 2025, and AMC 2023 show that BRTS improves over standard OPD on challenging reasoning benchmarks, with the largest gains on harder datasets. Our code is available at https://github.com/BWGZK-keke/BRTS.
Chinese Translation
在线蒸馏(On-policy distillation, OPD)是一种通过自身采样轨迹对学生进行监督的数据高效后训练方法,旨在提高推理能力,同时避免强化学习中的奖励依赖和标准监督微调中常见的灾难性遗忘。然而,标准的OPD通常在嘈杂的学生生成上下文中计算教师监督,并且通常依赖于每个提示的单一随机教师回放。因此,监督信号可能具有高方差:采样的教师轨迹可能是错误的、缺乏信息的,或者与学生当前的推理行为不匹配。为了解决这一局限性,我们提出了BRTS(Best-of-N Rollout Teacher Selection),一种用于在线蒸馏的最佳教师回放选择框架。BRTS通过从策划的教师轨迹构建教师上下文监督分支,增强了标准的学生上下文OPD。BRTS不是从第一个采样的教师回放进行蒸馏,而是从一个小的教师轨迹池中进行采样,并使用简单的优先规则选择辅助轨迹:首先考虑正确性,其次考虑学生对齐。当多个正确的教师轨迹可用时,BRTS选择与学生当前行为最一致的轨迹;当无条件的教师样本在更困难的提示上失败时,它会调用一个基于真实值的恢复步骤以引导自然推导。所选轨迹随后在OPD循环中用于提供可靠的教师上下文监督,并在教师轨迹上增加辅助损失。在AIME 2024、AIME 2025和AMC 2023上的实验表明,BRTS在具有挑战性的推理基准上优于标准OPD,尤其是在更困难的数据集上获得了最大的提升。我们的代码可在https://github.com/BWGZK-keke/BRTS获取。
cs.CV / 190 / 2605.09750
Fetal Brain Imaging: A Composite Neural Network Approach for Keyframe Detection in Ultrasound Videos
胎儿脑部成像:一种用于超声视频关键帧检测的复合神经网络方法
Abstract
This article presents a novel approach to keyframe detection in ultrasound videos, with a particular focus on fetal brain imaging. The proposed model is a composite neural network architecture that combines a Convolutional Neural Network (CNN) with a Recurrent Neural Network (RNN). The CNN extracts spatial features from individual video frames, while the RNN captures temporal dependencies between consecutive frames within each video sequence. The proposed model may improve the efficiency and accuracy of fetal brain ultrasound analysis, thereby supporting earlier detection, diagnosis, and treatment planning for selected fetal brain conditions.
Chinese Translation
本文提出了一种用于超声视频关键帧检测的新方法,特别关注胎儿脑部成像。所提模型是一种复合神经网络架构,结合了卷积神经网络(CNN)和递归神经网络(RNN)。CNN 从单个视频帧中提取空间特征,而 RNN 则捕捉每个视频序列中连续帧之间的时间依赖性。所提模型可能提高胎儿脑部超声分析的效率和准确性,从而支持对特定胎儿脑部疾病的早期检测、诊断和治疗规划。
cs.CV / 191 / 2605.09774
DRIVE-C: A Controlled Corruption Dataset for Autonomous Driving
DRIVE-C:用于自动驾驶的受控损坏数据集
Abstract
DRIVE-C is a controlled corruption dataset designed to evaluate visual perception robustness in autonomous driving systems. It is built from real-world forward-facing driving videos collected across daytime, nighttime, urban, rural, freeway, and parking environments. Clean clips are anonymized via localized face and license plate blurring, then transformed with physics-inspired synthetic degradations. The dataset contains 10 clean clips and 600 corrupted clips spanning 12 camera degradation types across five severity levels, with per-clip metadata and Global Sensor Health Index (GSHI) annotations. DRIVE-C supports robustness benchmarking, degradation-aware modeling, uncertainty estimation, out-of-distribution (OOD) detection, and sensor health monitoring for Advanced Driver Assistance Systems (ADAS). By providing pixel-aligned clean and degraded video clips with fully reproducible corruption parameters, DRIVE-C offers a structured testbed for studying perception reliability under controlled camera degradation.
Chinese Translation
DRIVE-C 是一个受控损坏数据集,旨在评估自动驾驶系统中的视觉感知鲁棒性。该数据集基于在白天、夜间、城市、乡村、高速公路和停车环境中收集的真实前视驾驶视频构建。干净的视频片段通过局部模糊处理面部和车牌进行匿名化,然后通过物理启发的合成降质进行转换。该数据集包含 10 个干净片段和 600 个损坏片段,涵盖 12 种相机降质类型和五个严重程度级别,并附有每个片段的元数据和全球传感器健康指数 (Global Sensor Health Index, GSHI) 注释。DRIVE-C 支持鲁棒性基准测试、降质感知建模、不确定性估计、分布外 (Out-of-Distribution, OOD) 检测以及高级驾驶辅助系统 (Advanced Driver Assistance Systems, ADAS) 的传感器健康监测。通过提供像素对齐的干净和降质视频片段以及完全可重复的损坏参数,DRIVE-C 为在受控相机降质下研究感知可靠性提供了一个结构化的测试平台。
cs.CV / 192 / 2605.09802
CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection
CrossVL:面向跨视角视觉-语言检测的复杂性感知特征路由与配对课程
Abstract
Vision-language models (VLMs) enable text-guided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose CrossVL, a framework combining Complexity-Aware Pathway Aggregation (CPA) and Paired Curriculum Learning (PCL) for enhanced cross-view detection for VLM. CPA estimates scene complexity from multimodal statistics and routes visual features through multiple pathways to obtain view-specific representations. PCL leverages semantic consistency of synchronized ground-aerial pairs to provide stable early supervision and then gradually shifts toward randomized sampling. On MAVREC, CrossVL improves Florence-2's aerial mAP from 58.66% to 61.03% and reduces the ground-aerial performance gap from 8.63pp to 6.65pp, while also achieving a 3.3x reduction in variance across random seeds. CPA provides stable complexity-aware feature aggregation, and PCL enhances optimization dynamics. Together, they demonstrate that coordinated architectural and training adaptations are crucial for robust cross-view VLM detection.
Chinese Translation
视觉-语言模型(VLMs)能够实现文本引导的物体检测,但在地面视角与空中视角在高度、尺度和空间布局上存在差异的跨视角场景中表现严重下降。这些几何变化在视角之间引入了系统性的复杂性变化,例如,地面视角图像包含密集且高度遮挡的结构,而空中图像则稀疏且全局有序。固定的VLM融合机制无法处理这种差异。我们提出了CrossVL,一个结合复杂性感知路径聚合(CPA)和配对课程学习(PCL)的框架,以增强VLM的跨视角检测。CPA通过多模态统计估计场景复杂性,并通过多条路径路由视觉特征,以获得视角特定的表示。PCL利用同步的地面-空中配对的语义一致性,提供稳定的早期监督,然后逐渐转向随机采样。在MAVREC数据集上,CrossVL将Florence-2的空中mAP从58.66%提高到61.03%,并将地面-空中性能差距从8.63个百分点减少到6.65个百分点,同时在随机种子间实现了3.3倍的方差减少。CPA提供稳定的复杂性感知特征聚合,而PCL增强了优化动态。两者结合表明,协调的架构和训练适应对于稳健的跨视角VLM检测至关重要。
cs.CV / 193 / 2605.09827
Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction
时尚佛罗伦萨:针对结构化时尚属性提取的佛罗伦萨-2模型微调
Abstract
We present Fashion Florence, a Florence-2 vision-language model fine-tuned with LoRA to extract structured fashion attributes from clothing images. Given a single photograph, the model generates a JSON object containing category, color, material, style tags, and occasion tags, structured output suitable for direct programmatic consumption by downstream recommendation and retrieval systems. Fine-tuning data is derived from the iMaterialist Fashion dataset (228 labels), where we collapse fine-grained annotations into a compact 6-category, 16-color, 19-style schema via rule-based label engineering. We apply LoRA (r=16, alpha=32) to all decoder linear layers, training for 3 epochs on 3,688 examples. On a held-out test set of 461 images, Fashion Florence achieves 94.6% category accuracy and 63.0% material accuracy, compared to 89.3% / 43.3% for GPT-4o-mini and 87.4% for Gemini 2.5 Flash. Fashion Florence produces valid JSON in 99.8% of outputs while running at 0.77B parameters on a single GPU at zero marginal inference cost. Style tag F1 reaches 0.753 vs. 0.612 (Gemini) and 0.398 (GPT-4o-mini). The model is deployed as a Hugging Face Space and integrated into Loom, an open-source outfit recommendation system.
Chinese Translation
我们提出了时尚佛罗伦萨(Fashion Florence),这是一个经过LoRA微调的佛罗伦萨-2视觉语言模型,旨在从服装图像中提取结构化的时尚属性。给定一张照片,该模型生成一个包含类别、颜色、材料、风格标签和场合标签的JSON对象,输出结构适合下游推荐和检索系统的直接程序化消费。微调数据来源于iMaterialist Fashion数据集(228个标签),我们通过基于规则的标签工程将细粒度注释合并为一个紧凑的6类、16色、19风格的框架。我们对所有解码器线性层应用LoRA(r=16,alpha=32),在3,688个样本上训练3个周期。在461张图像的保留测试集上,时尚佛罗伦萨实现了94.6%的类别准确率和63.0%的材料准确率,而GPT-4o-mini的准确率为89.3% / 43.3%,Gemini 2.5 Flash的准确率为87.4%。时尚佛罗伦萨在99.8%的输出中生成有效的JSON,同时在单个GPU上以0.77B参数运行,边际推理成本为零。风格标签的F1值达到0.753,而Gemini为0.612,GPT-4o-mini为0.398。该模型已作为Hugging Face Space部署,并集成到Loom,一个开源的服装推荐系统中。
cs.CV / 194 / 2605.09850
Probing Routing-Conditional Calibration in Attention-Residual Transformers
探讨注意力残差变换器中的路由条件校准
Abstract
Post-hoc calibration is usually evaluated as a function of logits or softmax confidence alone, even as routing-augmented architectures increasingly accompany predictions with sample-specific internal routing traces and pair them with claims of calibration-relevant uncertainty. We ask a basic question: do these traces provide stable routing-specific evidence for post-hoc calibration beyond confidence? We study this in Attention-Residual transformers (Kimi Team, 2026) through a matched-confidence diagnostic suite that stratifies examples by routing-derived state, compares subgroup gaps against within-bin routing-permutation nulls, and evaluates matched post-hoc probes differing only in their auxiliary feature. Across our completed AR runs, scalar routing summaries do not provide stable evidence of routing-conditional miscalibration: weighted gaps remain small or seed-sensitive, and only $1$ of $30$ within-bin permutation tests rejects the conditional-null at $\alpha=0.05$ (only on one seed; not stable across seeds in that cell). AR-CondCal, a minimal $2$-D Nadaraya--Watson probe on confidence and routing-depth variance, lies within the seed-variance band of matched confidence-only and predictive-entropy controls and does not reliably improve worst-routing-tertile ECE; bandwidth-sensitivity checks (Scott multiples, CV-NLL, global-ECE oracle) do not change this. A full-vector MLP over $(c, H_1, \ldots, H_L)$ can appear to improve over a linear confidence baseline, but the apparent gain disappears once a capacity-matched confidence-only MLP is included as a control, and shuffled routing profiles achieve comparable performance. Apparent routing-aware calibration gains in this AR setting should not be read as internal-state calibration until matched-confidence, bandwidth, capacity, and permutation controls rule out common confounds.
Chinese Translation
后验校准通常仅作为 logits 或 softmax 置信度的函数进行评估,即使路由增强架构越来越多地伴随样本特定的内部路由痕迹,并将其与校准相关的不确定性声明配对。我们提出一个基本问题:这些痕迹是否提供了超越置信度的后验校准的稳定路由特定证据?我们通过在注意力残差变换器(Attention-Residual transformers, Kimi Team, 2026)中进行匹配置信度诊断套件的研究,分层示例以路由衍生状态,比较子组差距与组内路由置换零假设,并评估仅在辅助特征上不同的匹配后验探测。在我们完成的 AR 运行中,标量路由摘要并未提供稳定的路由条件失校准证据:加权差距仍然较小或对种子敏感,且在 $30$ 次组内置换测试中仅有 $1$ 次在 $eta=0.05$ 下拒绝条件零假设(仅在一个种子上;在该单元中对种子不稳定)。AR-CondCal 是一个最小的 $2$ 维 Nadaraya-Watson 探测器,基于置信度和路由深度方差,位于匹配置信度仅和预测熵控制的种子方差带内,并未可靠地改善最差路由三分之一的 ECE;带宽敏感性检查(Scott 多重检验、CV-NLL、全局 ECE oracle)并未改变这一点。对 $(c, H_1, ext{...}, H_L)$ 的全向量 MLP 似乎在线性置信度基线之上有所改善,但一旦将容量匹配的仅置信度 MLP 作为对照纳入,表面上的增益便消失,而洗牌路由配置则实现了可比的性能。在这个 AR 设置中,表面上看似路由感知的校准增益不应被解读为内部状态校准,直到匹配置信度、带宽、容量和置换控制排除了常见的混淆因素。
cs.CV / 195 / 2605.09856
MoPO: Incorporating Motion Prior for Occluded Human Mesh Recovery
MoPO:结合运动先验进行被遮挡人类网格恢复
Abstract
Although recent studies have made remarkable progress in human mesh recovery, they still exhibit limited robustness to occlusions and often produce inaccurate poses and severe motion jitter due to the insufficient spatial features for occluded body parts. Inspired by the rapid advancements in human motion prediction, we discover that compared to occluded image features, pose sequence inherently contains reliable motion prior for estimating occluded body parts. In this paper, we incorporate Motion Prior for Occluded human mesh recovery, called MoPO. Our MoPO mainly consists of two components: 1) The motion de-occlusion module, where we propose a spatial-temporal occlusion detector to detect joint visibility, and then we propose a lightweight motion predictor to complete the occluded body parts by predicting the most plausible joint positions based on history poses. 2) The motion-aware fusion and refinement module, which fuses the completed joint sequence with image features to estimate human shape and initial human pose. Moreover, the completed joint sequence is further used to refine the final human pose through inverse kinematics, which provides the occlusion-free motion prior for regressing human poses. Extensive experiments demonstrate that MoPO achieves state-of-the-art performance on both occlusion-specific and standard benchmarks, significantly enhancing the accuracy and temporal consistency of occluded human mesh recovery. Our code and demo can be found in the supplementary material.
Chinese Translation
尽管近期研究在人类网格恢复方面取得了显著进展,但仍然对遮挡表现出有限的鲁棒性,常常由于对被遮挡身体部位的空间特征不足而产生不准确的姿态和严重的运动抖动。受到人类运动预测快速发展的启发,我们发现与被遮挡图像特征相比,姿态序列本质上包含了可靠的运动先验,用于估计被遮挡的身体部位。本文提出了一种结合运动先验进行被遮挡人类网格恢复的方法,称为MoPO。我们的MoPO主要由两个组件组成:1)运动去遮挡模块,在该模块中,我们提出了一种时空遮挡检测器来检测关节可见性,然后提出了一种轻量级运动预测器,通过基于历史姿态预测最合理的关节位置来补全被遮挡的身体部位。2)运动感知融合与精炼模块,该模块将补全的关节序列与图像特征融合,以估计人类形状和初始人类姿态。此外,补全的关节序列进一步用于通过逆向运动学精炼最终的人类姿态,从而为回归人类姿态提供无遮挡的运动先验。大量实验表明,MoPO在遮挡特定和标准基准测试中均实现了最先进的性能,显著提高了被遮挡人类网格恢复的准确性和时间一致性。我们的代码和演示可以在补充材料中找到。
cs.CV / 196 / 2605.09858
Clip-level Uncertainty and Temporal-aware Active Learning for End-to-End Multi-Object Tracking
基于片段级不确定性和时序感知的端到端多目标跟踪主动学习
Abstract
Multi-Object Tracking (MOT) in dynamic environments relies on robust temporal reasoning to maintain consistent object identities over time. Transformer-based end-to-end MOT models achieve strong performance by explicitly modeling temporal dependencies, yet training them requires extensive bounding-box and identity annotations. Given the high labeling cost and strong redundancy in videos, Active Learning (AL) is an effective approach to improve annotation efficiency. However, existing AL methods for MOT primarily operate at the frame level, which is structurally misaligned with modern end-to-end trackers whose inference and training rely on multi-frame clips. To bridge this gap, we formulate clip-level active learning and propose Clip-level Uncertainty and Temporal-aware Active Learning (CUTAL). In contrast to frame-based approaches, CUTAL scores each clip using uncertainty metrics derived from multi-frame predictions to capture inter-frame correspondence ambiguities, while enforcing temporal diversity to select an informative and non-redundant subset. Experiments show that CUTAL achieves stronger overall performance than baselines at the same label budgets across MeMOTR and SambaMOTR. Notably, CUTAL achieves performance comparable to full supervision for MeMOTR on both datasets using only 50% of the labeled training data.
Chinese Translation
动态环境中的多目标跟踪(MOT)依赖于强大的时序推理,以在时间上保持一致的对象身份。基于变换器的端到端MOT模型通过显式建模时序依赖关系来实现强大的性能,然而训练这些模型需要大量的边界框和身份注释。考虑到高昂的标注成本和视频中的强冗余性,主动学习(AL)是一种有效提高注释效率的方法。然而,现有的MOT主动学习方法主要在帧级别上操作,这与现代端到端跟踪器的推理和训练依赖于多帧片段的结构不一致。为了解决这一问题,我们提出了片段级主动学习,并提出了基于片段级不确定性和时序感知的主动学习(CUTAL)。与基于帧的方法相比,CUTAL使用从多帧预测中得出的不确定性指标对每个片段进行评分,以捕捉帧间对应关系的模糊性,同时强制实施时序多样性,以选择一个信息丰富且不冗余的子集。实验表明,CUTAL在MeMOTR和SambaMOTR上以相同的标注预算实现了比基线更强的整体性能。值得注意的是,CUTAL在这两个数据集上仅使用50%的标注训练数据,就达到了与完全监督相当的MeMOTR性能。
cs.CV / 197 / 2605.09859
Learning to Align Generative Appearance Priors for Fine-grained Image Retrieval
学习对齐生成外观先验以进行细粒度图像检索
Abstract
Fine-grained image retrieval (FGIR) typically relies on supervision from seen categories to learn discriminative embeddings for retrieving unseen categories. However, such supervision often biases retrieval models toward the semantics of seen categories rather than the underlying appearance characteristics that generalize across categories, thereby limiting retrieval performance on unseen categories. To tackle this, we propose GAPan, a Generative Appearance Prior alignment network that reformulates the learning objective from category prediction toward appearance modeling. Technically, GAPan treats retrieval features with an invertible density model based on normalizing flows. In the forward direction, the flow maps all instance features into a latent density space, where each seen category is modeled by a class-conditional Gaussian prior and optimized via exact likelihood estimation. This formulation preserves richer appearance details by leveraging the invertible property of the flows. In the reverse direction, samples from the high-density regions of these learned priors are mapped back to the feature space to produce appearance-aware anchors that reflect intra-category variation. These anchors supervise a prior-driven alignment objective that aligns retrieval embeddings with category-specific appearance distributions, thereby improving generalization to unseen categories. Evaluations demonstrate that our GAPan achieves state-of-the-art performance on both widely-used fine- and coarse-grained benchmarks.
Chinese Translation
细粒度图像检索(FGIR)通常依赖于来自已见类别的监督,以学习用于检索未见类别的判别嵌入。然而,这种监督往往使检索模型偏向于已见类别的语义,而不是跨类别泛化的基础外观特征,从而限制了对未见类别的检索性能。为了解决这个问题,我们提出了GAPan,一个生成外观先验对齐网络,它将学习目标从类别预测重新定义为外观建模。从技术上讲,GAPan将检索特征视为基于归一化流的可逆密度模型。在正向过程中,流将所有实例特征映射到潜在密度空间,其中每个已见类别由类条件高斯先验建模,并通过精确的似然估计进行优化。这种表述通过利用流的可逆特性保留了更丰富的外观细节。在反向过程中,从这些学习到的先验的高密度区域中抽样被映射回特征空间,以生成反映类内变化的外观感知锚点。这些锚点监督一个基于先验驱动的对齐目标,将检索嵌入与类别特定的外观分布对齐,从而提高对未见类别的泛化能力。评估结果表明,我们的GAPan在广泛使用的细粒度和粗粒度基准测试中达到了最先进的性能。
cs.CV / 198 / 2605.09864
DA-SegFormer: Damage-Aware Semantic Segmentation for Fine-Grained Disaster Assessment
DA-SegFormer:面向细粒度灾害评估的损伤感知语义分割
Abstract
Rapid and accurate damage assessment following natural disasters is critical for effective emergency response. However, identifying fine-grained damage levels (e.g., distinguishing minor from major roof damage) in UAV imagery remains challenging due to the degradation of texture cues during resizing and extreme class imbalance. We propose DA-SegFormer, a damage-aware adaptation of the SegFormer architecture optimized for high-resolution disaster imagery. Our method introduces a Class-Aware Sampling strategy to guarantee exposure to rare damage features, and it integrates Online Hard Example Mining (OHEM) with Dice Loss to dynamically focus on underrepresented classes. In addition, we employ a resolution-preserving inference protocol that maintains native texture details. Evaluated on the RescueNet dataset, DA-SegFormer achieves 74.61\% mIoU, outperforming the baseline by 2.55\%. Notably, our improvements yield double-digit gains in critical damage classes: Minor Damage (+11.7%) and Major Damage (+21.3%).
Chinese Translation
在自然灾害发生后,快速而准确的损伤评估对于有效的紧急响应至关重要。然而,在无人机图像中识别细粒度的损伤等级(例如,区分轻微和严重的屋顶损伤)仍然具有挑战性,因为在图像缩放过程中纹理线索会退化,并且类别之间存在极端的不平衡。我们提出了DA-SegFormer,这是一种针对高分辨率灾害图像优化的SegFormer架构的损伤感知适应。我们的方法引入了一种类别感知采样策略,以确保对稀有损伤特征的曝光,并将在线困难样本挖掘(Online Hard Example Mining, OHEM)与Dice Loss结合,以动态关注代表性不足的类别。此外,我们采用了一种保持分辨率的推理协议,以保持原始纹理细节。在RescueNet数据集上的评估表明,DA-SegFormer达到了74.61%的平均交并比(mIoU),比基线提高了2.55%。值得注意的是,我们的改进在关键损伤类别中实现了双位数的增益:轻微损伤(+11.7%)和严重损伤(+21.3%)。
cs.CV / 199 / 2605.09874
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
EgoMemReason:一个基于记忆驱动的长时间视角视频理解推理基准
Abstract
Next-generation visual assistants, such as smart glasses, embodied agents, and always-on life-logging systems, must reason over an entire day or more of continuous visual experience. In ultra-long video settings, relevant information is sparsely distributed across hours or days, making memory a fundamental challenge: models must accumulate information over time, recall prior states, track temporal order, and abstract recurring patterns. However, existing week-long video benchmarks are primarily designed for perception and recognition, such as moment localization or global summarization, rather than reasoning that requires integrating evidence across multiple days. To address this gap, we introduce EgoMemReason, a comprehensive benchmark that systematically evaluates week-long egocentric video understanding through memory-driven reasoning. EgoMemReason evaluates three complementary memory types: entity memory, tracking how object states evolve and change across days; event memory, recalling and ordering activities separated by hours or days; and behavior memory, abstracting recurring patterns from sparse, repeated observations over the whole week period. EgoMemReason comprises 500 questions across three memory types and six core challenges, with an average of 5.1 video segments of evidence per question and 25.9 hours of memory backtracking. We evaluate EgoMemReason on 17 methods across MLLMs and agentic frameworks, revealing that even the best model achieves only 39.6% overall accuracy. Further analysis shows that the three memory types fail for distinct reasons and that performance degrades as evidence spans longer temporal horizons, revealing that long-horizon memory remains far from solved. We believe EgoMemReason establishes a strong foundation for evaluating and advancing long-context, memory-aware multimodal systems.
Chinese Translation
下一代视觉助手,如智能眼镜、具身代理和始终在线的生活记录系统,必须对持续的视觉体验进行整整一天或更长时间的推理。在超长视频环境中,相关信息分散在数小时或数天内,使得记忆成为一个根本性挑战:模型必须随着时间的推移积累信息,回忆先前的状态,跟踪时间顺序,并抽象出重复的模式。然而,现有的为期一周的视频基准主要设计用于感知和识别,例如时刻定位或全局摘要,而不是需要跨多个天整合证据的推理。为了解决这一空白,我们引入了EgoMemReason,这是一个全面的基准,系统地评估基于记忆驱动的为期一周的视角视频理解。EgoMemReason评估三种互补的记忆类型:实体记忆,跟踪物体状态在数天内的演变和变化;事件记忆,回忆和排序相隔数小时或数天的活动;以及行为记忆,从整个一周期间稀疏的重复观察中抽象出重复模式。EgoMemReason包含500个问题,涵盖三种记忆类型和六个核心挑战,每个问题平均有5.1个证据视频片段和25.9小时的记忆回溯。我们在17种方法上评估EgoMemReason,涵盖了多模态大语言模型(MLLMs)和代理框架,结果显示即使是最佳模型的整体准确率也仅为39.6%。进一步分析表明,三种记忆类型因不同原因而失败,且随着证据跨越更长的时间范围,性能下降,揭示出长时间记忆仍远未解决。我们相信EgoMemReason为评估和推进长上下文、记忆感知的多模态系统奠定了坚实的基础。
cs.CV / 200 / 2605.09883
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
笛卡尔捷径:重新评估极坐标空间中的视觉推理
Abstract
As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the \textbf{Cartesian Shortcut}: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce \textbf{Polaris-Bench}, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics -- thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive evaluation across $14$ state-of-the-art MLLMs reveals that frontier models achieving $70$--$83\%$ on Cartesian layouts collapse to $31$--$39\%$ on Polar equivalents, with degradation persisting even under complete logical equivalence. Moreover, reasoning gains observed on Cartesian layouts are severely diminished on Polar equivalents. These findings expose a critical deficiency in current MLLMs: the lack of topology-invariant visual reasoning.
Chinese Translation
随着当前的多模态大型语言模型迅速饱和经典视觉推理基准,一个关键问题浮现:这些高分数是否真正反映了稳健的视觉理解?我们识别出一个普遍存在的脆弱性,即 extbf{笛卡尔捷径}:视觉推理基准普遍建立在正交网格布局上,这些布局可以轻易地离散化为明确的文本坐标。模型系统性地利用这一特性,重度依赖基于文本的演绎推理来辅助视觉问题解决。为了系统性地拆解这一捷径,我们引入了 extbf{Polaris-Bench},该基准在极坐标空间中重新构建了53个视觉推理任务,并以配对的笛卡尔坐标作为参考,同时保持一致的逻辑约束和任务语义——从而根本上打破了模型所利用的正交先验。对14个最先进的多模态大型语言模型的全面评估显示,前沿模型在笛卡尔布局上取得的70%--83%的成绩在极坐标对应任务中降至31%--39%,即使在完全逻辑等价的情况下,这种降级仍然存在。此外,在笛卡尔布局上观察到的推理提升在极坐标对应任务中显著减弱。这些发现揭示了当前多模态大型语言模型的一个关键缺陷:缺乏拓扑不变的视觉推理能力。
cs.CV / 201 / 2605.09899
Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection
双曲蒸馏:几何引导的跨模态迁移用于稳健的3D物体检测
Abstract
Cross-modal knowledge distillation has emerged as an effective strategy for integrating point cloud and image features in 3D perception tasks. However, the modality heterogeneity, spatial misalignment, and the representation crisis of multiple modalities often limit the efficient of these cross-modal distillation methods. To address these limitations in existing approaches, we propose a hyperbolic constrained cross-modal distillation method for multimodal 3D object detection (HGC-Det). The proposed HGC-Det framework includes an image branch and a point cloud branch to extract semantic features from two different modalities. The point cloud branch comprises three core components: a 2D semantic-guided voxel optimization component (SGVO), a hyperbolic geometry constrained cross-modal feature transfer component (HFT), and a feature aggregation-based geometry optimization component (FAGO). Specifically, the SGVO component adaptively refines the spatial representation of the 3D branch by leveraging semantic cues from the image branch, thereby mitigating the issue of inadequate representation fusion. The HFT component exploits the intrinsic geometric properties of hyperbolic space to alleviate semantic loss during the fusion of high-dimensional image features and low-dimensional point cloud features. Finally, the FAGO compensates for potential spatial feature degradation introduced by the 2D semantic-guided voxel optimization component. Extensive experiments on indoor datasets (SUN RGB-D, ARKitScenes) and outdoor datasets (KITTI, nuScenes) demonstrate that our method achieves a better trade-off between detection accuracy and computational cost.
Chinese Translation
跨模态知识蒸馏已成为在3D感知任务中整合点云和图像特征的有效策略。然而,模态异质性、空间错位以及多模态的表征危机常常限制了这些跨模态蒸馏方法的效率。为了解决现有方法中的这些局限性,我们提出了一种用于多模态3D物体检测的双曲约束跨模态蒸馏方法(HGC-Det)。所提出的HGC-Det框架包括一个图像分支和一个点云分支,以从两种不同模态中提取语义特征。点云分支由三个核心组件组成:一个2D语义引导的体素优化组件(SGVO)、一个双曲几何约束的跨模态特征转移组件(HFT)和一个基于特征聚合的几何优化组件(FAGO)。具体而言,SGVO组件通过利用图像分支的语义线索自适应地细化3D分支的空间表征,从而缓解表征融合不足的问题。HFT组件利用双曲空间的内在几何特性来减轻在高维图像特征和低维点云特征融合过程中产生的语义损失。最后,FAGO补偿了由2D语义引导的体素优化组件引入的潜在空间特征降解。在室内数据集(SUN RGB-D,ARKitScenes)和室外数据集(KITTI,nuScenes)上的大量实验表明,我们的方法在检测精度和计算成本之间实现了更好的平衡。
cs.CV / 202 / 2605.09902
Adversarial Attacks Against MLLMs via Progressive Resolution Processing and Adaptive Feature Alignment
通过渐进分辨率处理和自适应特征对齐对多模态大型语言模型的对抗攻击
Abstract
Adversarial perturbations can mislead Multimodal Large Language Models (MLLMs) recognize a benign image as a specific target object, posing serious risks in safety-critical scenarios such as autonomous driving and medical diagnosis. This makes transfer-based targeted attacks crucial for understanding and improving black-box MLLM robustness. Existing transfer-based targeted attack methods typically rely on the final global features of the surrogate encoder and anchor optimization to original-resolution target crops, leading to their limited transferability and robustness. To address these challenges, we propose Progressive Resolution Processing and Adaptive Feature Alignment (PRAF-Attack), a targeted transfer-based attack framework that integrates multi-scale global semantic guidance with robust intermediate-layer local alignment. Unlike prior methods that align only the surrogate encoder's final layer, we design an adaptive feature alignment strategy that leverages intermediate representations to enhance transferability. Specifically, we introduce an adaptive intermediate layer selection mechanism to identify transferable hierarchical features across surrogate ensembles via gradient consistency, along with an adaptive patch-level optimization strategy that preserves highly correlated local regions through efficient patch filtering. To overcome the reliance on fixed original-resolution target crops, we propose a progressive resolution processing strategy that gradually refines optimization from coarse to fine, enabling the attack to better exploit target information at multiple scales and achieve stronger transferability. We evaluate PRAF-Attack on a diverse suite of black-box MLLMs, including six open-source models and six closed-source commercial APIs. Compared with seven state-of-the-art targeted attack baselines, the proposed PRAF-Attack consistently achieves superior transferability.
Chinese Translation
对抗扰动可以误导多模态大型语言模型(MLLMs)将良性图像识别为特定目标对象,这在自动驾驶和医疗诊断等安全关键场景中带来了严重风险。这使得基于转移的目标攻击对于理解和提高黑箱MLLM的鲁棒性至关重要。现有的基于转移的目标攻击方法通常依赖于替代编码器的最终全局特征和对原始分辨率目标裁剪的锚点优化,导致其转移性和鲁棒性有限。为了解决这些挑战,我们提出了渐进分辨率处理和自适应特征对齐(PRAF-Attack),这是一种将多尺度全局语义指导与鲁棒的中间层局部对齐相结合的基于转移的目标攻击框架。与之前仅对齐替代编码器最终层的方法不同,我们设计了一种自适应特征对齐策略,利用中间表示来增强转移性。具体而言,我们引入了一种自适应中间层选择机制,通过梯度一致性识别可转移的层次特征,并结合一种自适应补丁级优化策略,通过高效的补丁过滤保留高度相关的局部区域。为了克服对固定原始分辨率目标裁剪的依赖,我们提出了一种渐进分辨率处理策略,逐步从粗到细优化,使攻击能够更好地利用多尺度的目标信息并实现更强的转移性。我们在一系列多样化的黑箱MLLM上评估了PRAF-Attack,包括六个开源模型和六个闭源商业API。与七个最先进的目标攻击基线相比,所提出的PRAF-Attack始终实现了更优的转移性。
cs.CV / 203 / 2605.09904
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench:视频大型语言模型的时间对象一致性基准
Abstract
Video large language models (Video-LLMs) have achieved remarkable progress in general video understanding, yet their ability to maintain temporal object consistency remains insufficiently explored. Existing benchmarks primarily focus on event recognition, action understanding, or coarse temporal reasoning, but rarely evaluate whether a model can consistently preserve the identity, state, and temporal continuity of the same object across occlusion, disappearance, reappearance, state transitions, and cross-object interactions. As a result, current evaluations may overestimate temporal reasoning ability while overlooking failures in object-centric temporal coherence. To address this issue, we introduce TOC-Bench, a diagnostic benchmark specifically designed to evaluate temporal object consistency in Video-LLMs. TOC-Bench is explicitly object-track grounded, where each queried subject is associated with a per frame object trajectory and structured temporal event timeline. To ensure that benchmark items depend on temporally ordered visual evidence rather than language priors, single-frame shortcuts, or unordered frame cues, we propose a three-layer temporal-necessity filtering protocol that removes 60.7% of candidate QA pairs and retains 17,900 temporally dependent items spanning 10 diagnostic dimensions. From this filtered pool, we further construct a human-verified benchmark containing 2,323 high-quality QA pairs over 1,951 videos. Experiments on representative Video-LLMs show that temporal object consistency remains a major unsolved challenge. Current models exhibit substantial weaknesses in event counting, event ordering, identity-sensitive reasoning, and hallucination-aware verification, despite strong performance on general video understanding benchmarks.
Chinese Translation
视频大型语言模型(Video-LLMs)在一般视频理解方面取得了显著进展,但它们在保持时间对象一致性方面的能力仍然未得到充分探索。现有基准主要集中于事件识别、动作理解或粗略的时间推理,但很少评估模型是否能够在遮挡、消失、重新出现、状态转换和跨对象交互中一致地保持同一对象的身份、状态和时间连续性。因此,当前的评估可能高估了时间推理能力,同时忽视了对象中心时间一致性方面的失败。为了解决这个问题,我们引入了TOC-Bench,这是一个专门设计用于评估视频大型语言模型中时间对象一致性的诊断基准。TOC-Bench明确以对象轨迹为基础,每个查询的主题都与每帧的对象轨迹和结构化的时间事件时间线相关联。为了确保基准项目依赖于时间顺序的视觉证据,而不是语言先验、单帧捷径或无序帧线索,我们提出了一种三层时间必要性过滤协议,去除了60.7%的候选问答对,并保留了17,900个跨越10个诊断维度的时间依赖项。在此过滤池中,我们进一步构建了一个经过人工验证的基准,包含2,323个高质量的问答对,涵盖1,951个视频。在对代表性的视频大型语言模型进行实验时,结果显示时间对象一致性仍然是一个主要的未解决挑战。尽管在一般视频理解基准上表现强劲,当前模型在事件计数、事件排序、身份敏感推理和幻觉感知验证方面仍存在显著弱点。
cs.CV / 204 / 2605.09925
Frequency Adapter with SAM for Generalized Medical Image Segmentation
基于频率适配器与SAM的广义医学图像分割
Abstract
Medical image segmentation is a critical task in computer-aided diagnosis and treatment planning. However, deep learning models often struggle to generalize across datasets due to domain shifts arising from variations in imaging protocols, scanner types, and patient populations. Traditional domain generalization (DG) methods utilize causal feature learning, adversarial consistency, and style augmentation to improve segmentation robustness. While effective, these approaches rely on explicit feature alignment, adversarial objectives, or handcrafted augmentations, which may not fully exploit the capabilities of foundation models. Recently, the Segment Anything Model (SAM) has demonstrated strong generalization capabilities in segmentation tasks. SAM-based DG methods attempt to improve medical image segmentation. However, these approaches primarily operate in the spatial domain and overlook frequency-based discrepancies that significantly affect model robustness. In this work, we propose Frequency-based Domain Generalization with SAM (FSAM), a novel framework that integrates Low-Rank Adaptation (LoRA) for efficient fine-tuning and a frequency adapter to incorporate frequency-domain representations for single-source domain generalization. FSAM enhances SAM's segmentation robustness by extracting domain-invariant high-frequency features, mitigating frequency-related domain shifts. Experimental results on fundus and prostate datasets demonstrate that FSAM outperforms existing traditional DG and SAM-based DG approaches in domain generalization. Codes and pre-trained models will be made available on GitHub.
Chinese Translation
医学图像分割是计算机辅助诊断和治疗规划中的一项关键任务。然而,由于成像协议、扫描仪类型和患者人群的变化,深度学习模型在跨数据集的泛化能力上常常面临挑战。传统的领域泛化(DG)方法利用因果特征学习、对抗一致性和风格增强来提高分割的鲁棒性。尽管这些方法有效,但它们依赖于显式特征对齐、对抗目标或手工增强,可能无法充分利用基础模型的能力。最近,Segment Anything Model(SAM)在分割任务中展示了强大的泛化能力。基于SAM的DG方法试图改善医学图像分割。然而,这些方法主要在空间域中操作,忽视了显著影响模型鲁棒性的基于频率的差异。在本研究中,我们提出了基于频率的领域泛化框架FSAM,该框架集成了低秩适配(LoRA)以实现高效微调,并使用频率适配器以纳入频率域表示,针对单源领域泛化。FSAM通过提取领域不变的高频特征来增强SAM的分割鲁棒性,从而减轻与频率相关的领域转移。在视网膜和前列腺数据集上的实验结果表明,FSAM在领域泛化方面优于现有的传统DG和基于SAM的DG方法。代码和预训练模型将会在GitHub上发布。
cs.CV / 205 / 2605.09935
Evidence-based Decision Modeling for Synthetic Face Detection with Uncertainty-driven Active Learning
基于证据的合成面孔检测决策建模:不确定性驱动的主动学习
Abstract
With the rapid development of deep generative models, forged facial images are massively exploited for illegal activities. Although existing synthetic face detection methods have achieved significant progress, they suffer from the inherent limitation of overconfidence due to their reliance on the Softmax activation function. Thus, these methods often lead to unreliable predictions when encountering unknown Out-of-Distribution (OOD) images, and cannot ascertain the model's uncertainty in its prediction. Meanwhile, most existing methods require massive high-quality annotated data, which greatly limits their practicability across diverse scenarios. To address these limitations, we propose EMSFD (Evidence-based decision Modeling for Synthetic Face Detection with uncertainty-driven active learning), an approach designed to enhance detection reliability and generalizability. Specifically, EMSFD models class evidence using the Dirichlet distribution and explicitly incorporates model uncertainty into the prediction process. Furthermore, during training, the estimated uncertainty is exploited to prioritize more informative samples from the unlabeled pool for annotation, thereby reducing labeling cost and improving model generalization. Extensive experimental evaluations demonstrate that our method enhances the interpretability of synthetic face detection. Meanwhile, our method yields a 15\% increase in accuracy compared to existing state-of-the-art (SOTA) baselines, which demonstrates the superior detection performance and generalizability of our approach. Our code is available at: https://github.com/hzx111621/EMSFD.
Chinese Translation
随着深度生成模型的快速发展,伪造的面部图像被大量用于非法活动。尽管现有的合成面孔检测方法取得了显著进展,但由于依赖Softmax激活函数,它们存在过度自信的固有限制。因此,这些方法在遇到未知的分布外(Out-of-Distribution, OOD)图像时,往往会导致不可靠的预测,并且无法确定模型预测的不确定性。同时,大多数现有方法需要大量高质量的标注数据,这极大限制了它们在多样化场景中的实用性。为了解决这些限制,我们提出了EMSFD(基于证据的合成面孔检测决策建模:不确定性驱动的主动学习),该方法旨在增强检测的可靠性和泛化能力。具体而言,EMSFD使用Dirichlet分布建模类别证据,并将模型不确定性明确纳入预测过程。此外,在训练过程中,估计的不确定性被用来优先选择来自未标注池中更具信息量的样本进行标注,从而降低标注成本并提高模型的泛化能力。大量实验评估表明,我们的方法增强了合成面孔检测的可解释性。同时,我们的方法相比现有的最先进(SOTA)基线提高了15%的准确率,证明了我们方法在检测性能和泛化能力上的优越性。我们的代码可在以下链接获取:https://github.com/hzx111621/EMSFD。
cs.CV / 206 / 2605.09936
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
城市图像网:用于城市空间感知的大规模多模态数据集和评估框架
Abstract
We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.
Chinese Translation
我们提出了城市图像网(Urban-ImageNet),这是一个用于城市空间感知的大规模多模态数据集和评估基准,数据来源于用户生成的社交媒体图像。该语料库包含超过200万张来自微博的公共社交媒体图像及配对的文本帖子,收集自2019年至2025年间中国24个城市的61个城市地点,并设有1K、10K和100K规模的受控基准子集,以及一个完整的200万语料库用于大规模训练和评估。城市图像网由HUSIC(分层城市空间图像分类框架)组织,该框架定义了一个基于城市理论的10类分类法。该分类法旨在区分激活和非激活的公共空间、城市外部和内部环境、住宿空间、消费内容、肖像以及非空间社交媒体内容。城市图像网并不将城市图像视为通用场景数据,而是评估机器感知模型是否能够捕捉到对城市研究至关重要的空间、社会和功能差异。该基准支持一个标准化库中的三个任务:(T1)城市场景语义分类,(T2)跨模态图像-文本检索,以及(T3)实例分割。我们的实验评估了代表性的视觉、视觉-语言和分割模型,结果显示在监督场景分类上表现良好,但在跨模态检索和实例级城市物体分割上表现更具挑战性。一项多尺度研究进一步考察了随着平衡训练数据从1K、10K增加到100K图像时模型性能的变化。城市图像网提供了一个统一的、基于理论的多城市基准,用于评估人工智能系统如何跨模态、规模和任务形式感知和解读当代城市空间。数据集和基准可在以下网址获取:huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet 和 github.com/yiasun/dataset-2。
cs.CV / 207 / 2605.09956
SDTalk: Structured Facial Priors and Dual-Branch Motion Fields for Generalizable Gaussian Talking Head Synthesis
SDTalk:用于可泛化高斯说话人合成的结构化面部先验与双分支运动场
Abstract
High-quality, real-time talking head synthesis remains a fundamental challenge in computer vision. Existing reconstruction- and rendering-based methods typically rely on identity-specific models, limiting cross-identity generalization. To address this issue, we propose SDTalk, a one-shot 3D Gaussian Splatting (3DGS)-based framework that generalizes to unseen identities without personalized training or fine-tuning. Our framework comprises two modules with a two-stage training strategy. In the first stage, we incorporate structured facial priors into the reconstruction module and separately predict 3DGS parameters for visible and occluded regions, enabling complete head reconstruction from a single image. In the second stage, we introduce a dual-branch motion field to model coarse and fine facial dynamics, improving detail fidelity and lip synchronization. Experiments demonstrate that SDTalk surpasses existing methods in both visual quality and inference efficiency.
Chinese Translation
高质量、实时的说话人合成仍然是计算机视觉中的一个基本挑战。现有的重建和渲染方法通常依赖于特定身份的模型,限制了跨身份的泛化能力。为了解决这一问题,我们提出了SDTalk,一个基于一次性3D高斯点云(3D Gaussian Splatting,3DGS)的框架,能够在没有个性化训练或微调的情况下对未见身份进行泛化。我们的框架包括两个模块和一个两阶段的训练策略。在第一阶段,我们将结构化面部先验纳入重建模块,并分别预测可见和遮挡区域的3DGS参数,从而实现从单张图像中完整重建头部。在第二阶段,我们引入了一个双分支运动场来建模粗糙和细致的面部动态,提高了细节保真度和唇部同步性。实验表明,SDTalk在视觉质量和推理效率上均超过了现有方法。
cs.CV / 208 / 2605.09963
Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning
学习感知“位置”:用于鲁棒自监督学习的空间预训练任务
Abstract
Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as well as substantial gains in out-of-distribution robustness for object recognition. To evaluate spatial reasoning, we introduce (1) a position and scale prediction task on image patch pairs and (2) a jigsaw understanding task requiring patch reordering and recognition after reconstruction. Strong performance on these tasks indicates improved spatial structure and geometric awareness. Overall, explicitly modeling spatial information provides an effective inductive bias for SSL, leading to more structured representations and better generalization. Code and models will be released.
Chinese Translation
现有的自监督学习(SSL)方法主要学习对象不变的表征,但往往忽视了对象部分之间的空间结构和关系。为了解决这一局限性,我们提出了空间预测(Spatial Prediction, SP),这是一种空间感知的预训练回归任务,旨在预测来自同一图像的一对解耦局部视图之间的相对位置和尺度。通过在连续几何空间中建模部分之间的关系,SP 鼓励表征捕捉超越不变类别语义的细粒度空间依赖,从而学习视觉场景的组合结构。SP 被实现为一个解耦的插件,可以无缝集成到多种 SSL 框架中。大量实验表明,在图像识别、细粒度分类、语义分割和深度估计等任务上均有一致的改善,并且在对象识别的分布外鲁棒性方面也有显著提升。为了评估空间推理能力,我们引入了(1)图像补丁对上的位置和尺度预测任务,以及(2)需要在重建后进行补丁重新排序和识别的拼图理解任务。这些任务的强劲表现表明空间结构和几何意识得到了改善。总体而言,明确建模空间信息为 SSL 提供了一种有效的归纳偏置,导致更结构化的表征和更好的泛化能力。代码和模型将会发布。
cs.CV / 209 / 2605.09965
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
迈向通用游戏玩家:对游戏多元宇宙中基础模型的研究
Abstract
The real world unfolds along a single set of physics laws, yet human intelligence demonstrates a remarkable capacity to generalize experiences from this singular physical existence into a multiverse of games, each governed by entirely different rules, aesthetics, physics, and objectives. This omni-reality adaptability is a hallmark of general intelligence. As Artificial Intelligence progresses towards Artificial General Intelligence, the multiverse of games has evolved from mere entertainment into the ultimate ground for training and evaluating AGI. The pursuit of this generality has unfolded across four eras: from environment-specific symbolic and reinforcement learning agents, to current large foundation models acting as generalist players, and toward a future creator stage where agent both creates new game worlds and continually evolves within them. We trace the full lifecycle of a generalist game player along four interdependent pillars: Dataset, Model, Harness, and Benchmark. Every advance across these pillars can be read as an attempt to break one of five fundamental trade-offs that currently bound the whole system. Building on this end-to-end view, we chart a five-level roadmap, progressing from single-game mastery to the ultimate creator stage in which the agent simultaneously creates and evolves within theoretical game multiverse. Taken together, our work offers a unified lens onto a rapidly shifting field,and a principled path toward the omnipotent generalist agent capable of seamlessly mastering any challenge within the multiverse of games, thereby paving the way for AGI.
Chinese Translation
现实世界遵循一套物理法则展开,然而人类智能展现出将这种单一物理存在的经验概括到一个由完全不同规则、美学、物理和目标所支配的游戏多元宇宙的卓越能力。这种全方位的适应性是通用智能的标志。随着人工智能向人工通用智能(Artificial General Intelligence, AGI)发展,游戏多元宇宙已从单纯的娱乐演变为训练和评估AGI的终极场所。对这种通用性的追求经历了四个时代:从环境特定的符号和强化学习代理,到当前作为通用玩家的大型基础模型,再到未来的创作者阶段,在这一阶段,代理不仅创造新的游戏世界,还在其中不断演变。我们沿着四个相互依赖的支柱追踪通用游戏玩家的完整生命周期:数据集(Dataset)、模型(Model)、工具(Harness)和基准(Benchmark)。在这些支柱上的每一次进展都可以被视为试图突破当前束缚整个系统的五个基本权衡之一。在这一端到端的视角基础上,我们绘制了一条五级路线图,从单一游戏的精通逐步迈向终极创作者阶段,在这一阶段,代理同时在理论游戏多元宇宙中创造和演变。综合来看,我们的工作为一个快速变化的领域提供了统一的视角,并为能够无缝掌握游戏多元宇宙中任何挑战的全能通用代理铺平了道路,从而为AGI的发展奠定基础。
cs.CV / 210 / 2605.09976
OZ-TAL: Online Zero-Shot Temporal Action Localization
OZ-TAL:在线零样本时间动作定位
Abstract
Online Temporal Action Localization (On-TAL) aims to detect the occurrence time and category of actions in untrimmed streaming videos immediately upon their completion. Recent advancements in this field focus on developing more sophisticated frameworks, shifting from Online Action Detection (OAD)-based aggregation paradigm to instance-level understanding. However, existing approaches are typically trained on specific domains and often exhibit limited generalization capabilities when applied to arbitrary videos, particularly in the presence of previously unseen actions. In this paper, we introduce a new task called Online Zero-shot Temporal Action Localization (OZ-TAL), which aims to detect previously unseen actions in an online fashion. Furthermore, we propose a training-free framework that leverages off-the-shelf Vision-Language Models (VLMs) while introducing additional mechanisms to enhance visual representations and mitigate their inherent biases. We establish new benchmarks and representative baselines for OZ-TAL on THUMOS14 and ActivityNet-1.3, and extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings.
Chinese Translation
在线时间动作定位(On-TAL)旨在在未剪辑的流媒体视频完成后立即检测动作的发生时间和类别。该领域的最新进展集中在开发更复杂的框架上,从基于在线动作检测(OAD)的聚合范式转向实例级理解。然而,现有方法通常是在特定领域进行训练的,当应用于任意视频时,尤其是在存在之前未见过的动作时,往往表现出有限的泛化能力。本文提出了一项新任务,称为在线零样本时间动作定位(OZ-TAL),旨在以在线方式检测之前未见过的动作。此外,我们提出了一种无训练框架,该框架利用现成的视觉-语言模型(VLMs),同时引入额外机制以增强视觉表示并减轻其固有偏差。我们在THUMOS14和ActivityNet-1.3上建立了OZ-TAL的新基准和代表性基线,广泛的实验表明,我们的方法在离线和在线零样本设置下均显著优于现有的最先进方法。
cs.CV / 211 / 2605.09977
INFANiTE: Implicit Neural representation for high-resolution Fetal brain spatio-temporal Atlas learNing from clinical Thick-slicE MRI
INFANiTE:用于从临床厚切MRI学习高分辨率胎儿大脑时空图谱的隐式神经表示
Abstract
Spatio-temporal fetal brain atlases are important for characterizing normative neurodevelopment and identifying congenital anomalies. However, existing atlas construction pipelines necessitate days for slice-to-volume reconstruction (SVR) to generate high-resolution 3D brain volumes and several additional days for iterative volume registration, thereby rendering atlas construction from large-scale cohorts prohibitively impractical. We address these limitations with INFANiTE, an Implicit Neural Representation (INR) framework for high-resolution Fetal brain spatio-temporal Atlas learNing from clinical Thick-slicE MRI scans, bypassing both the costly SVR and the iterative non-rigid registration steps entirely, thereby substantially accelerating atlas construction. Extensive experiments demonstrate that INFANiTE outperforms existing baselines in subject consistency, reference fidelity, intrinsic quality and biological plausibility, even under challenging sparse-data settings. Additionally, INFANiTE reduces the end-to-end processing time (i.e., from raw scans to the final atlas) from days to hours compared to the traditional 3D volume-based pipeline (e.g., SyGN), facilitating large-scale population-level fetal brain analysis. Our code is publicly available at: https://anonymous.4open.science/r/INFANiTE-5D74
Chinese Translation
时空胎儿大脑图谱对于表征正常神经发育和识别先天性异常具有重要意义。然而,现有的图谱构建流程需要数天时间进行切片到体积重建(SVR),以生成高分辨率的3D大脑体积,并且还需额外数天进行迭代体积配准,这使得从大规模队列构建图谱变得极为不切实际。我们通过INFANiTE解决了这些限制,INFANiTE是一个隐式神经表示(INR)框架,用于从临床厚切MRI扫描中学习高分辨率胎儿大脑时空图谱,完全绕过了昂贵的SVR和迭代非刚性配准步骤,从而显著加速了图谱构建。大量实验表明,INFANiTE在受试者一致性、参考保真度、内在质量和生物合理性等方面优于现有基准,即使在具有挑战性的稀疏数据环境下也是如此。此外,与传统的基于3D体积的流程(例如SyGN)相比,INFANiTE将端到端处理时间(即从原始扫描到最终图谱)从数天缩短至数小时,促进了大规模人群级别的胎儿大脑分析。我们的代码已公开发布,网址为:https://anonymous.4open.science/r/INFANiTE-5D74
cs.CV / 212 / 2605.09982
ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning
ERASE:通过自适应两阶段令牌修剪消除冗余视觉令牌
Abstract
Recent advancements in Vision-Language Models (VLMs) enable large language models (LLMs) to process high-resolution images, significantly improving real-world multimodal understanding. However, this capability introduces a large number of vision tokens, resulting in substantial computational overhead. To mitigate this issue, various vision token pruning methods have been proposed. Nevertheless, existing approaches predominantly rely on learned semantic features within the model to capture visual redundancy. Moreover, they lack adaptive mechanisms to adjust pruning strategies according to the complexity of the input image. In this paper, we propose ERASE, a two-stage vision token pruning framework that identifies and retains salient tokens through pruning strategies adaptive to image complexity. Experiment results demonstrate that ERASE significantly reduces vision tokens while preserving accuracy. For Qwen2.5-VL-7B, at a token pruning ratio of 85\%, ERASE retains 89.46% of the original model accuracy, whereas the best prior method retains only 78.1%. Our code is available at https://github.com/Tuna-Luna/ERASE.
Chinese Translation
最近在视觉-语言模型(VLMs)方面的进展使得大型语言模型(LLMs)能够处理高分辨率图像,显著提升了现实世界的多模态理解能力。然而,这一能力引入了大量的视觉令牌,导致了显著的计算开销。为了解决这一问题,已经提出了多种视觉令牌修剪方法。然而,现有的方法主要依赖于模型内部学习到的语义特征来捕捉视觉冗余。此外,它们缺乏根据输入图像复杂性调整修剪策略的自适应机制。在本文中,我们提出了ERASE,一个两阶段视觉令牌修剪框架,通过适应图像复杂性的修剪策略识别并保留显著令牌。实验结果表明,ERASE显著减少了视觉令牌数量,同时保持了准确性。对于Qwen2.5-VL-7B,在85%的令牌修剪比例下,ERASE保留了原始模型89.46%的准确性,而最佳的先前方法仅保留了78.1%。我们的代码可在https://github.com/Tuna-Luna/ERASE获取。
cs.CV / 213 / 2605.09984
Geometric 4D Stitching for Grounded 4D Generation
基于几何的4D缝合用于有根4D生成
Abstract
Recent 4D generation methods complete scene-level missing information using generative models and reconstruct the scene into radiance-based representations. However, these pipelines often present geometric inconsistencies in the generated content, and the radiance-based reconstruction requires expensive optimization. Furthermore, radiance-based representations often absorb these geometric inconsistencies into their view-dependent nature, failing to enforce the grounded geometric consistency. To address these issues, we propose Geometric 4D Stitching, an efficient framework that explicitly identifies missing geometric regions and complements them with geometrically grounded 4D stitches. As a result, our method constructs 4D scene representations in under 10 minutes on a single NVIDIA RTX 5090 GPU per one-step scene expansion, while improving geometric consistency. Moreover, we demonstrate that our explicit 4D stitching supports interative expansion of 4D mesh as well as 4D scene editing.
Chinese Translation
近年来,4D生成方法利用生成模型完成场景级缺失信息,并将场景重建为基于辐射的表示。然而,这些流程在生成内容中常常存在几何不一致性,并且基于辐射的重建需要昂贵的优化。此外,基于辐射的表示通常将这些几何不一致性吸收进其视依赖特性中,未能强制执行有根的几何一致性。为了解决这些问题,我们提出了基于几何的4D缝合,这是一种高效的框架,明确识别缺失的几何区域,并用几何上有根的4D缝合进行补充。因此,我们的方法在单个NVIDIA RTX 5090 GPU上每一步场景扩展的时间少于10分钟,同时提高了几何一致性。此外,我们证明了我们的显式4D缝合支持4D网格的迭代扩展以及4D场景编辑。
cs.CV / 214 / 2605.09996
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
Omni-Persona:系统性基准测试与全模态个性化的改进
Abstract
While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emph{Persona Modality Graph}, encompassing 4 task groups and 18 fine-grained tasks across ${\sim}750$ items. To rigorously diagnose grounding behavior, we propose \emph{Calibrated Accuracy ($\mathrm{Cal}$)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher $\mathrm{Cal}$, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.
Chinese Translation
尽管多模态大语言模型在文本、图像和音频方面取得了进展,但个性化研究主要集中在视觉-语言领域,统一的全模态基准测试仍然有限,缺乏考虑缺失个性化场景或系统性基础研究的方法论严谨性。我们提出了Omni-Persona,这是第一个全面的全模态个性化基准。我们将任务形式化为在 extit{个性模态图}上的跨模态路由,涵盖4个任务组和18个细分任务,共计约750个项目。为了严格诊断基础行为,我们提出了 extit{校准准确率}($ ext{Cal}$),它同时奖励正确的基础和适当的放弃,将缺失个性化查询纳入统一的评估框架。在我们的专门实验中,出现了三个诊断发现:(i)开源模型显示出一致的音频与视觉基础差距,而RLVR通过密集的基于规则的监督部分缩小了这一差距;(ii)可回答的召回和参数规模是不完整的诊断,因为强召回可以与缺失个性化的幻觉共存,而更大的模型并不总是能实现更高的$ ext{Cal}$,这暴露了校准作为一个独立评估轴的重要性;(iii)SFT受到大规模构建注释真实监督的难度限制,而RLVR通过结果级可验证反馈更一致地进行泛化,但在我们的奖励设计下趋向于保守行为和较低的生成质量。因此,Omni-Persona作为一个诊断框架,揭示了全模态个性化的陷阱,为未来的后训练和奖励设计提供指导。
cs.CV / 215 / 2605.10002
Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models
Med-StepBench:用于评估医学视觉语言模型幻觉的分层推理框架
Abstract
Large vision-language models (VLMs) demonstrate strong performance in medical image understanding, but frequently generate clinically plausible yet incorrect statements, raising significant safety concerns. Existing medical hallucination benchmarks primarily focus on 2D imaging with one-shot diagnostic questions, offering limited insight into whether predictions are grounded in correct localization and abnormality identification, allowing critical reasoning errors to remain hidden behind seemingly correct diagnoses. We introduce Med-StepBench, the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT, comprising over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data, which decomposes clinical reasoning into four expert-designed diagnostic stages. Using clinician-verified annotations, we perform the first step-level evaluation of general-purpose and medical VLMs, revealing systematic failure modes obscured by aggregate accuracy metrics. Furthermore, we show that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations, which significantly amplify hallucinations despite contradictory visual evidence. Together, our findings highlight fundamental limitations in grounding multi-step clinical reasoning and establish Med-StepBench as a rigorous benchmark for developing safer and more reliable medical VLMs.
Chinese Translation
大型视觉语言模型(VLMs)在医学图像理解方面表现出色,但经常生成临床上看似合理但实际上错误的陈述,这引发了重大安全隐患。现有的医学幻觉基准主要集中于二维成像和一次性诊断问题,提供的见解有限,无法判断预测是否基于正确的定位和异常识别,从而使得关键推理错误隐藏在看似正确的诊断背后。我们引入了Med-StepBench,这是第一个针对三维肿瘤学PET/CT中逐步幻觉检测的大规模基准,包含超过12,000张图像和超过1,000,000对图像-陈述对,分解临床推理为四个专家设计的诊断阶段。通过使用临床医生验证的注释,我们首次对通用和医学VLMs进行了逐步评估,揭示了被聚合准确性指标掩盖的系统性失败模式。此外,我们还表明,当前的VLMs对具有对抗性但临床上合理的中间解释高度敏感,这显著放大了幻觉,尽管存在相互矛盾的视觉证据。总之,我们的研究结果突显了多步骤临床推理在基础上的局限性,并确立了Med-StepBench作为开发更安全、更可靠的医学VLMs的严格基准。
cs.CV / 216 / 2605.10009
Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation
Hystar:基于超网络的风格自适应检索通过动态奇异值调制
Abstract
Query-based image retrieval (QBIR) requires retrieving relevant images given diverse and often stylistically heterogeneous queries, such as sketches, artworks, or low-resolution previews. While large-scale vision--language representation models (VLRMs) like CLIP offer strong zero-shot retrieval performance, they struggle with distribution shifts caused by unseen query styles. In this paper, we propose the Hypernetwork-driven Style-adaptive Retrieval (Hystar), a lightweight framework that dynamically adapts model weights to each query's style. Hystar employs a hypernetwork to generate singular-value perturbations ($\Delta S$) for attention layers, enabling flexible per-input adaptation, while static singular-value offsets on MLP layers ensure cross-style stability. To better handle semantic confusions across styles, we design StyleNCE as part of Hystar, an optimal-transport-weighted contrastive loss that emphasizes hard cross-style negatives. Extensive experiments on multi-style retrieval and cross-style classification benchmarks demonstrate that Hystar consistently outperforms strong baselines, achieving state-of-the-art performance while being parameter-efficient and stable across styles.
Chinese Translation
基于查询的图像检索(QBIR)需要根据多样且通常风格异质的查询(如草图、艺术作品或低分辨率预览)检索相关图像。虽然像 CLIP 这样的规模庞大的视觉-语言表示模型(VLRMs)在零样本检索性能上表现出色,但它们在面对未见查询风格所导致的分布变化时表现不佳。在本文中,我们提出了基于超网络的风格自适应检索(Hystar),这是一个轻量级框架,能够动态调整模型权重以适应每个查询的风格。Hystar 使用超网络生成注意力层的奇异值扰动($ ext{Δ} S$),实现灵活的输入适应,同时在 MLP 层上使用静态奇异值偏移以确保跨风格的稳定性。为了更好地处理风格间的语义混淆,我们设计了 StyleNCE,作为 Hystar 的一部分,这是一种最优运输加权的对比损失,强调困难的跨风格负样本。在多风格检索和跨风格分类基准上的大量实验表明,Hystar 始终优于强基线,达到了最先进的性能,同时在参数效率和跨风格稳定性方面表现出色。
cs.CV / 217 / 2605.10026
MUSDA: Multi-source Multi-modality Unsupervised Domain Adaptive 3D Object Detection for Autonomous Driving
MUSDA:用于自动驾驶的多源多模态无监督领域自适应3D目标检测
Abstract
With the advancement of autonomous driving, numerous annotated multi-modality datasets have become available. This presents an opportunity to develop domain-adaptive 3D object detectors for new environments without relying on labor-intensive manual annotations. However, traditional domain adaptation methods typically focus on a single source domain or a single modality, limiting their effectiveness in multi-source, multi-modality scenarios. In this paper, we propose a novel framework for multi-source, multi-modality unsupervised domain adaptation in 3D object detection for autonomous driving. Given multiple labeled source domains and one unlabeled target domain, our framework first introduces hierarchical spatially-conditioned (HSC) domain classifiers, which jointly align features from both camera and LiDAR modalities at two distinct levels for each source-target domain pair. To effectively leverage information from multiple source domains, we construct a prototype graph between each pair of domains. Based on this, we develop a prototype graph weighted (PGW) multi-source fusion strategy to aggregate predictions from multiple source detection heads. Experimental results on three widely used 3D object detection datasets - Waymo, nuScenes, and Lyft - demonstrate that our proposed framework effectively integrates information across both modalities and source domains, consistently outperforming state-of-the-art methods.
Chinese Translation
随着自动驾驶技术的进步,许多带注释的多模态数据集已经可用。这为开发无需依赖劳动密集型手动注释的新环境下的领域自适应3D目标检测器提供了机会。然而,传统的领域自适应方法通常集中于单一源领域或单一模态,限制了其在多源多模态场景中的有效性。本文提出了一种用于自动驾驶的多源多模态无监督领域自适应3D目标检测的新框架。给定多个标注的源领域和一个未标注的目标领域,我们的框架首先引入分层空间条件(HSC)领域分类器,这些分类器在每对源-目标领域之间的两个不同层次上共同对齐来自相机和激光雷达模态的特征。为了有效利用来自多个源领域的信息,我们在每对领域之间构建了原型图。基于此,我们开发了一种原型图加权(PGW)多源融合策略,以聚合来自多个源检测头的预测。在三个广泛使用的3D目标检测数据集——Waymo、nuScenes和Lyft上的实验结果表明,我们提出的框架有效整合了来自两种模态和源领域的信息,始终优于最先进的方法。
cs.CV / 218 / 2605.10029
Slum Detection and Density Mapping with AlphaEarth Foundations: A Representation Learning Evaluation Across 12 Global Cities
基于AlphaEarth基础的贫民窟检测与密度映射:跨12个全球城市的表征学习评估
Abstract
Pixel-level slum mapping has long been constrained by limited cross-city generalisation, the absence of continuous density estimation, and weak global comparability. AlphaEarth Foundations (AEF), a globally consistent 64-dimensional annual surface embedding at 10 m, offers a new analysis-ready basis for lightweight slum monitoring, but its applicability to slum detection - an indirectly coupled task shaped by both built form and socio-economic processes - remains untested. We evaluate AEF on slum classification and sub-pixel density estimation across 12 cities and 69 city-year pairs (2017-2024), using GRAM pseudo-masks as supervisory labels. The evaluation spans four training strategies, two protocols (random split and 3x3 spatial block cross-validation), six auxiliary feature configurations, and five baseline models, complemented by representation-level analyses (PCA, SHAP) and full-AOI mapping. Five findings emerge. (1) Same-city cross-year training is optimal under both protocols (median spatial F1 = 0.616, R^2 = 0.466); temporal expansion outperforms cross-city transfer, indicating city-scale representational drift. (2) Regression R^2 is driven primarily by zero/non-zero boundary discrimination: positive-pixel R^2 is consistently negative across all cities, revealing limited capacity to model intra-pixel density gradients at 10 m. (3) PC36 is consistently top-ranked across tasks; classification saturates at k = 32 while regression remains unsaturated at k = 64. (4) POI features yield the largest density gain (Delta R^2 = +0.064). (5) For six cities meeting dual-task usability thresholds, full-AOI inference across 2017-2024 preserves slum cluster structure (mean SSIM = 0.926). The study delineates the capabilities and complementarity needs of foundation-model embeddings for slum monitoring.
Chinese Translation
像素级贫民窟映射长期受到跨城市泛化能力有限、缺乏连续密度估计以及全球可比性较弱的限制。AlphaEarth Foundations (AEF) 是一种全球一致的64维年度表面嵌入,分辨率为10米,为轻量级贫民窟监测提供了新的分析基础,但其在贫民窟检测中的适用性——这一间接耦合的任务受建筑形式和社会经济过程的共同影响——尚未经过检验。我们在12个城市和69对城市-年份(2017-2024)上评估AEF在贫民窟分类和亚像素密度估计方面的表现,使用GRAM伪掩模作为监督标签。评估涵盖四种训练策略、两种协议(随机划分和3x3空间块交叉验证)、六种辅助特征配置和五种基线模型,并辅以表征级分析(主成分分析PCA、SHAP)和全区域兴趣(full-AOI)映射。研究得出五个发现:(1) 同城市跨年份训练在两种协议下均为最佳(中位空间F1 = 0.616,R^2 = 0.466);时间扩展优于跨城市转移,表明城市规模的表征漂移。(2) 回归R^2主要由零/非零边界区分驱动:正像素R^2在所有城市中始终为负,显示出在10米分辨率下建模像素内密度梯度的能力有限。(3) PC36在各任务中始终排名最高;分类在k = 32时达到饱和,而回归在k = 64时仍未饱和。(4) POI特征带来了最大的密度增益(Delta R^2 = +0.064)。(5) 对于六个满足双任务可用性阈值的城市,2017-2024年间的全AOI推断保留了贫民窟集群结构(平均SSIM = 0.926)。本研究阐明了基础模型嵌入在贫民窟监测中的能力和互补需求。
cs.CV / 219 / 2605.10040
Only Train Once: Uncertainty-Aware One-Class Learning for Face Authenticity Detection
仅需训练一次:面向不确定性的单类学习用于人脸真实性检测
Abstract
The rapid evolution of generative paradigms has enabled the creation of highly realistic imagery, which escalating the risks of identity fraud and the dissemination of disinformation. Most existing approaches frame face forgery detection as a fully supervised binary classification problem. Consequently, these models typically exhibit significant performance decay when tasked with detecting forgeries from previously unseen generative paradigms. Furthermore, these methods focus exclusively on either DeepFakes or fully synthesized faces, thereby failing to provide a generalized framework for universal face forgery detection. In this paper, we address this challenge by introducing FADNet (Face Authenticity Detector Net), % a self-supervised framework that which reformulates face forgery detection as a one-class classification (OCC) task. By training exclusively on authentic facial data to capture their intrinsic representations, FADNet flags any image whose feature embedding deviates significantly from the learned distribution of real faces as a forgery. The framework incorporates Evidential Deep Learning (EDL) to quantify predictive uncertainty and utilizes a plug-and-play pseudo-forgery image generator (PFIG) to tighten decision boundaries around authentic data. Extensive experimental evaluations on the DF40 and ASFD benchmarks demonstrate that FADNet achieves superior performance and generalization capabilities. Specifically, FADNet substantially outperforms existing state-of-the-art (SOTA) methods, yielding a remarkable average accuracy of 96.63\% and an average precision of 98.83\%.
Chinese Translation
生成范式的快速演变使得高度真实的图像得以创建,这加剧了身份欺诈和虚假信息传播的风险。现有的大多数方法将人脸伪造检测框定为一个完全监督的二分类问题。因此,这些模型在检测来自先前未见生成范式的伪造时,通常表现出显著的性能下降。此外,这些方法仅专注于 DeepFakes 或完全合成的人脸,未能提供一个通用的人脸伪造检测框架。本文通过引入 FADNet(人脸真实性检测网络)来应对这一挑战,FADNet 是一个自监督框架,将人脸伪造检测重新表述为一个单类分类(OCC)任务。FADNet 仅在真实人脸数据上进行训练,以捕捉其内在表示,任何特征嵌入显著偏离真实人脸学习分布的图像都被标记为伪造。该框架结合了证据深度学习(Evidential Deep Learning, EDL)来量化预测不确定性,并利用即插即用的伪伪造图像生成器(PFIG)来收紧真实数据周围的决策边界。在 DF40 和 ASFD 基准上的广泛实验评估表明,FADNet 实现了卓越的性能和泛化能力。具体而言,FADNet 显著优于现有的最先进(SOTA)方法,取得了 96.63\% 的显著平均准确率和 98.83\\% 的平均精确率。
cs.CV / 220 / 2605.10045
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
ExtraVAR:用于视觉自回归模型中分辨率外推的阶段感知RoPE重映射
Abstract
Visual Autoregressive (VAR) models have emerged as a strong alternative to diffusion for image synthesis, yet their fixed training resolution prevents direct generation at higher resolutions. Naively transferring training-free extrapolation methods from LLMs or diffusion models to VAR yields three characteristic failure modes: global repetition, local repetition, and detail degradation. We trace them to a unified band-stage mismatch: VAR generates images in a coarse-to-fine, scale-wise process where each stage is driven by a distinct dominant RoPE frequency band, and each failure mode emerges when the dominant band of a particular stage is disrupted. Building on this insight, we propose Stage-Aware RoPE Remapping, a training-free strategy that assigns each frequency band a stage-specific remapping rule, jointly suppressing all three failure modes. We further observe that attention becomes systematically dispersed as the image resolution increases. Existing methods typically depend on predefined attention scaling factors, which are neither adaptive to the target resolution nor capable of faithfully capturing the actual extent of attention dispersion. We therefore propose Entropy-Driven Adaptive Attention Calibration, which quantifies dispersion via a resolution-invariant normalized entropy and yields a closed-form per-head scaling factor that realigns the extrapolated-resolution attention entropy with its training-resolution counterpart. Extensive experiments show that our method consistently outperforms prior resolution-extrapolation methods in both structural coherence and fine-detail fidelity. Our code is available at https://github.com/feihongyan1/ExtraVAR.
Chinese Translation
视觉自回归(VAR)模型已成为图像合成中对扩散方法的有力替代,然而其固定的训练分辨率限制了在更高分辨率下的直接生成。从大型语言模型(LLMs)或扩散模型中简单地转移无训练的外推方法到VAR会导致三种典型的失败模式:全局重复、局部重复和细节退化。我们将这些问题追溯到一个统一的带阶段不匹配:VAR以粗到细的、按比例的过程生成图像,其中每个阶段由一个独特的主导RoPE频率带驱动,而当特定阶段的主导频带受到干扰时,便会出现每种失败模式。基于这一洞察,我们提出了阶段感知RoPE重映射,这是一种无训练的策略,为每个频率带分配一个阶段特定的重映射规则,从而共同抑制所有三种失败模式。我们进一步观察到,随着图像分辨率的增加,注意力变得系统性地分散。现有方法通常依赖于预定义的注意力缩放因子,这些因子既不适应目标分辨率,也无法真实捕捉注意力分散的实际程度。因此,我们提出了熵驱动的自适应注意力校准,该方法通过分辨率不变的归一化熵量化分散,并产生一个封闭形式的每头缩放因子,使外推分辨率的注意力熵与其训练分辨率的对应值重新对齐。大量实验表明,我们的方法在结构一致性和细节保真度上始终优于先前的分辨率外推方法。我们的代码可在 https://github.com/feihongyan1/ExtraVAR 获取。
cs.CV / 221 / 2605.10046
PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows
PixelFlowCast:通过像素平均流实现无潜变量的降水短期预报
Abstract
Precipitation nowcasting aims to forecast short-term radar echo sequences for extreme weather warning, where both prediction fidelity and inference efficiency are critical for real-world deployment. However, diffusion-based models, despite their strong generative capability, suffer from slow inference due to multi-step sampling trajectories, limiting their practical usability. Conditional Flow Matching (CFM) improves efficiency via straightened trajectories, but relies on latent space compression, which inevitably discards high-frequency physical details and degrades fine-grained prediction quality. To address these limitations, we propose PixelFlowCast, a two-stage probabilistic forecasting framework that achieves both high-efficiency and high-fidelity prediction without latent compression. Specifically, in the first stage, a deterministic model first produces coarse forecasts to capture global evolution trends. In the subsequent stage, the proposed KANCondNet extracts deep spatiotemporal evolution features to provide accurate conditional guidance. Based on this, a latent-free, few-step Pixel Mean Flows (PMF) predictor employs an $x$-prediction mechanism to generate high-quality predictions, effectively preserving fine-grained structures while maintaining fast inference. Experiments on the publicly available SEVIR dataset demonstrate that PixelFlowCast outperforms existing mainstream methods in both prediction accuracy and inference efficiency, particularly for long sequence forecasting, highlighting its strong potential for real-world operational deployment.
Chinese Translation
降水短期预报旨在预测极端天气预警的短期雷达回波序列,其中预测的准确性和推理效率对于实际应用至关重要。然而,尽管扩散模型具有强大的生成能力,但由于多步采样轨迹导致推理速度缓慢,限制了其实际可用性。条件流匹配(Conditional Flow Matching, CFM)通过简化轨迹提高了效率,但依赖于潜变量空间压缩,这不可避免地丢弃了高频物理细节,降低了细粒度预测质量。为了解决这些局限性,我们提出了PixelFlowCast,这是一种两阶段的概率预测框架,能够在不进行潜变量压缩的情况下实现高效率和高保真度的预测。具体而言,在第一阶段,确定性模型首先生成粗略预测,以捕捉全球演变趋势。在随后的阶段中,所提出的KANCondNet提取深层时空演变特征,以提供准确的条件指导。在此基础上,无潜变量的少步像素平均流(Pixel Mean Flows, PMF)预测器采用$x$-预测机制生成高质量预测,有效保留细粒度结构,同时保持快速推理。在公开可用的SEVIR数据集上的实验表明,PixelFlowCast在预测准确性和推理效率上均优于现有主流方法,特别是在长序列预测中,突显了其在实际操作部署中的强大潜力。
cs.CV / 222 / 2605.10050
EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs
EchoPrune:将冗余视为时间回声以提高视频大语言模型的效率
Abstract
Long-form video understanding remains challenging for Video Large Language Models (VideoLLMs), as the dense frame sampling introduces massive visual tokens while sparse sampling risks missing critical temporal evidence and leading to LLM hallucination. Existing training-free token reduction methods either treat videos equally as static images or rely on segment-level merging heuristics, which weaken fine-grained spatiotemporal modeling and introduce additional overhead. In this paper, we propose EchoPrune, a lightweight and training-free token pruning method that improves temporal resolution under a fixed LLM-side visual token budget. Our core idea is to interpret redundant video tokens as temporal echoes: if a token is well reconstructed from the previous frame, it is merely a temporally redundant echo; otherwise, it may capture new events, motion, or query-relevant visual evidence. Based on this insight, EchoPrune scores visual tokens by (i) query-guided crossmodal relevance and (ii) temporal reconstruction error, measured by correspondence matching and echo matching across consecutive frames. The selected tokens preserve task-relevant cues and temporal novelty while suppressing predictable redundancy, allowing VideoLLMs to observe more frames without increasing the decoding budget. Extensive experiments on LLaVA-OV, Qwen2.5VL, and Qwen3VL across six video understanding benchmarks show that EchoPrune enables VideoLLMs to process up to 20x frames under the same token budget, yielding improved performance (+8.6%) and inference speedup (5.6x for prefilling) on Qwen2.5VL-7B.
Chinese Translation
长篇视频理解对视频大语言模型(VideoLLMs)仍然具有挑战性,因为密集的帧采样引入了大量的视觉标记,而稀疏采样则可能错过关键的时间证据,导致LLM幻觉。现有的无训练标记减少方法要么将视频视为静态图像,要么依赖于分段级别的合并启发式,这削弱了细粒度时空建模并引入了额外的开销。本文提出了EchoPrune,一种轻量级且无训练的标记修剪方法,在固定的LLM端视觉标记预算下提高时间分辨率。我们的核心思想是将冗余视频标记解释为时间回声:如果一个标记可以从前一帧良好重建,那么它仅仅是一个时间上冗余的回声;否则,它可能捕捉到新的事件、运动或与查询相关的视觉证据。基于这一见解,EchoPrune通过(i)查询引导的跨模态相关性和(ii)时间重建误差来评分视觉标记,这些误差通过连续帧之间的对应匹配和回声匹配进行测量。所选标记保留了与任务相关的线索和时间新颖性,同时抑制了可预测的冗余,使VideoLLMs能够在不增加解码预算的情况下观察更多帧。在LLaVA-OV、Qwen2.5VL和Qwen3VL的六个视频理解基准上的大量实验表明,EchoPrune使VideoLLMs能够在相同的标记预算下处理多达20倍的帧,带来了性能提升(+8.6%)和推理加速(Qwen2.5VL-7B的预填充速度提升5.6倍)。
cs.CV / 223 / 2605.10054
Explanation-Aware Learning for Enhanced Interpretability in Biomedical Imaging
增强生物医学成像可解释性的解释感知学习
Abstract
Deep neural networks for medical image diagnosis often achieve high predictive accuracy while relying on spurious or clinically irrelevant visual cues, limiting their trustworthiness in practice. Post-hoc explanation methods are widely used to visualize model decisions in the form of saliency maps; however, these explanations do not influence how models learn during training, allowing non-causal or confounding features to persist. This motivates the incorporation of explanation supervision directly into the training objective to guide model attention toward clinically meaningful regions and promote clinically grounded decision-making. This paper presents a systematic approach to integrate explanation loss into model training and analyzes how different explanation loss designs and supervision strengths influence both predictive performance and spatial faithfulness of explanations. To quantitatively assess interpretability, two complementary explanation performance metrics-annotation coverage and saliency precision-are introduced, enabling rigorous evaluation beyond qualitative visualization. Our experimental results reveal a clear trade-off between explanation quality and explanation loss coefficients. Furthermore, quantitative statistical analysis yields consistently improved explanation alignment while maintaining comparable accuracy. Experiments were conducted on annotated chest X-ray datasets; however, the proposed framework is applicable to a broad range of annotated biomedical imaging modalities. Overall, these findings demonstrate that explanation supervision is not a monolithic design choice and provide practical guidance for incorporating explanation loss into training objectives under noisy clinical annotations.
Chinese Translation
用于医学图像诊断的深度神经网络通常在依赖虚假或临床无关的视觉线索的情况下实现高预测准确性,这限制了它们在实践中的可信度。事后解释方法广泛用于以显著性图的形式可视化模型决策;然而,这些解释并未影响模型在训练过程中的学习方式,导致非因果或混杂特征的持续存在。这促使我们将解释监督直接纳入训练目标,以引导模型关注临床相关区域并促进基于临床的决策制定。本文提出了一种系统的方法,将解释损失整合到模型训练中,并分析不同的解释损失设计和监督强度如何影响预测性能和解释的空间忠实度。为了定量评估可解释性,引入了两个互补的解释性能指标——注释覆盖率和显著性精度,使得评估超越定性可视化的严格性成为可能。我们的实验结果揭示了解释质量与解释损失系数之间的明显权衡。此外,定量统计分析显示解释对齐度持续改善,同时保持可比的准确性。实验在标注的胸部X光数据集上进行;然而,所提出的框架适用于广泛的标注生物医学成像模式。总体而言,这些发现表明解释监督并非单一的设计选择,并为在嘈杂的临床注释下将解释损失纳入训练目标提供了实用指导。
cs.CV / 224 / 2605.10071
MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization
MFVLR:用于可泛化扩散人脸伪造检测与定位的多领域细粒度视觉-语言重建
Abstract
The swift advancement in photo-realistic face generation technology has sparked considerable concerns across society and academia, emphasizing the requirement of generalizable face forgery detection and localization methods. Prior works tend to capture face forgery patterns across multiple domains using image modality, other modalities like fine-grained texts are not comprehensively investigated, which restricts the generalization capability of models. Besides, they usually analyze facial images created by GAN, but struggle to identify and localize those synthesized by diffusion. To solve the problems, in this paper, we devise a novel multi-domain fine-grained vision-language reconstruction (MFVLR) model, which explores comprehensive and diverse visual forgery traces via language-guided face forgery representation learning, to achieve generalizable diffusion-synthesized face forgery detection and localization (DFFDL). Specifically, we devise a fine-grained language transformer that studies general fine-grained language embeddings using language reconstruction. We propose a multi-domain vision encoder to capture general and complementary visual forgery patterns across the image and residual domains. A vision decoder is designed to reconstruct image appearance and achieve forgery localization. Besides, we propose an innovative plug-and-play vision injection module to enhance the interaction between the vision and language embeddings. Extensive experiments and visualizations demonstrate that our network outperforms the state of the art on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.
Chinese Translation
照片级真实感人脸生成技术的迅速发展引发了社会和学术界的广泛关注,强调了可泛化人脸伪造检测与定位方法的必要性。以往的研究通常通过图像模态捕捉多个领域的人脸伪造模式,而其他模态如细粒度文本尚未得到全面研究,这限制了模型的泛化能力。此外,这些研究通常分析由生成对抗网络(GAN)创建的面部图像,但在识别和定位由扩散模型合成的图像时面临困难。为了解决这些问题,本文提出了一种新颖的多领域细粒度视觉-语言重建(MFVLR)模型,该模型通过语言引导的人脸伪造表征学习探索全面且多样的视觉伪造痕迹,以实现可泛化的扩散合成的人脸伪造检测与定位(DFFDL)。具体而言,我们设计了一种细粒度语言变换器,利用语言重建研究一般的细粒度语言嵌入。我们提出了一种多领域视觉编码器,以捕捉图像和残差领域中的一般和互补视觉伪造模式。设计了一种视觉解码器以重建图像外观并实现伪造定位。此外,我们提出了一种创新的即插即用视觉注入模块,以增强视觉和语言嵌入之间的交互。大量实验和可视化结果表明,我们的网络在跨生成器、跨伪造和跨数据集评估等不同设置下优于现有最先进的技术。
cs.CV / 225 / 2605.10079
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation
SocialDirector:无训练的多人物视频生成社交互动控制
Abstract
Video generation has advanced rapidly, producing photorealistic videos from text or image prompts. Meanwhile, film production and social robotics increasingly demand multi-person videos with rich social interactions, including conversations, gestures, and coordinated actions. However, existing models offer no explicit control over interactions, such as who performs which action, when it occurs, and toward whom it is directed. This often results in wrong person performing unintended actions (actor-action mismatch), disordered social dynamics, and wrong action targets. To address these challenges, we present SocialDirector, a training-free interaction controller that enhances the generation model by modulating cross-attention maps. SocialDirector contains two modules: Social Actor Masking and Directional Reweighting. Social Actor Masking constrains each person's visual tokens to attend only to their own textual descriptions via a spatiotemporal mask, avoiding actor-action mismatch and disordered social dynamics. Directional Reweighting amplifies attention to directional words (e.g., "leftward", "right"), leading each action towards its intended target. To evaluate generated social interactions, we annotate existing datasets with interaction descriptions and build a fully automated evaluation pipeline powered by open-source VLMs. Experiments on different video generation models show that SocialDirector significantly improves interaction fidelity and approaches the upper bound set by real videos.
Chinese Translation
视频生成技术迅速发展,能够从文本或图像提示生成逼真的视频。同时,电影制作和社交机器人日益需要包含丰富社交互动的多人物视频,包括对话、手势和协调动作。然而,现有模型对互动没有明确的控制,例如谁执行哪个动作、何时发生以及朝向谁。这常常导致错误的人执行意外的动作(演员-动作不匹配)、混乱的社交动态和错误的动作目标。为了解决这些挑战,我们提出了SocialDirector,一种无训练的互动控制器,通过调节交叉注意力图来增强生成模型。SocialDirector包含两个模块:社交演员遮罩(Social Actor Masking)和方向重加权(Directional Reweighting)。社交演员遮罩通过时空遮罩限制每个人的视觉标记仅关注其自身的文本描述,从而避免演员-动作不匹配和混乱的社交动态。方向重加权增强了对方向性词汇(例如,“向左”,“向右”)的注意力,使每个动作朝向其预期目标。为了评估生成的社交互动,我们对现有数据集进行了互动描述的标注,并建立了一个完全自动化的评估管道,依托开源的视觉语言模型(VLMs)。在不同的视频生成模型上的实验表明,SocialDirector显著提高了互动的真实性,并接近真实视频设定的上限。
cs.CV / 226 / 2605.10087
Initiation of Interaction Detection Framework using a Nonverbal Cue for Human-Robot Interaction
基于非语言线索的人机交互启动检测框架
Abstract
This paper describes an initiation of interaction(IoI) detection framework without keywords for human-robot interaction(HRI) based on audio and vision sensor fusion in a domestic environment. In the proposed framework, the robot has its own audio and vision sensors, and can employ external vision sensor for stable human detection and tracking. When the user starts to speak while looking at the robot, the robot can localize his or her position by its sound source localization together with human tracking information. Then the robot can detect the IoI if it perceives the face of the speaker faces the robot. In case that the user does not speak directly, the robot can also detect the IoI if he or she looks at the robot for more than predefined periods of time. A state transition model for the proposed IoI detection framework is designed and verified by experiments with a mobile robot. In order to implement and associate our model in a robot architecture, all the components are implemented and integrated in the Robot Operating System(ROS) environment.
Chinese Translation
本文描述了一种基于音频和视觉传感器融合的无关键词人机交互(HRI)启动交互(IoI)检测框架,适用于家庭环境。在所提出的框架中,机器人配备了自己的音频和视觉传感器,并可以使用外部视觉传感器以实现稳定的人体检测和跟踪。当用户在注视机器人时开始说话,机器人能够通过声音源定位和人体跟踪信息来确定其位置。如果机器人感知到说话者的面部朝向机器人,则可以检测到IoI。如果用户没有直接说话,机器人也可以在用户注视机器人超过预定义时间后检测到IoI。为所提出的IoI检测框架设计了状态转移模型,并通过移动机器人实验进行了验证。为了在机器人架构中实现和关联我们的模型,所有组件均在机器人操作系统(ROS)环境中实现和集成。
cs.CV / 227 / 2605.10100
HYPERPOSE: Hyperbolic Kinematic Phase-Space Attention for 3D Human Pose Estimation
HYPERPOSE:用于3D人类姿态估计的双曲运动相空间注意力
Abstract
We introduce HYPERPOSE, a novel 3D human pose estimation framework that performs spatio-temporal reasoning entirely within the Lorentz model of hyperbolic space $\mathbb{H}^d$ to natively preserve the hierarchical tree topology of the human skeleton. Current state-of-the-art pose estimators aim to capture complex joint dynamics by relying on transformers and graph convolutional networks. Since these architectures operate exclusively in Euclidean space which fundamentally mismatches the inherent tree structure of the human body, these methods inevitably suffer from exponential volume distortion and struggle to maintain structural coherence. To this end, we depart from flat spaces and aim to improve geometric fidelity with Hyperbolic Kinematic Phase-Space Attention (HKPSA), natively embedding complex joint relationships without distortion, alongside a multi-scale windowed hyperbolic attention mechanism that efficiently models temporal dynamics in $O(TW)$ complexity. Furthermore, to overcome the well-known instability of training non-Euclidean manifolds, HYPERPOSE introduces a novel Riemannian loss suite and an uncertainty-weighted curriculum, enforcing physical geodesic constraints like bone length and velocity consistency. Extensive evaluations on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HYPERPOSE achieves state-of-the-art structural and temporal coherence, significantly reducing both volume distortion and velocity error, while establishing new state-of-the-art benchmarks in overall positional accuracy.
Chinese Translation
我们提出了HYPERPOSE,一种新颖的3D人类姿态估计框架,该框架完全在双曲空间的洛伦兹模型$ extbf{H}^d$内进行时空推理,以原生地保留人类骨架的层次树拓扑。当前最先进的姿态估计器旨在通过依赖变换器和图卷积网络来捕捉复杂的关节动态。由于这些架构仅在欧几里得空间中操作,这与人体固有的树结构根本不匹配,因此这些方法不可避免地遭受指数体积失真,并难以保持结构一致性。为此,我们摆脱了平坦空间,旨在通过双曲运动相空间注意力(Hyperbolic Kinematic Phase-Space Attention, HKPSA)提高几何保真度,原生嵌入复杂的关节关系而不产生失真,同时采用多尺度窗口双曲注意力机制,以$O(TW)$的复杂度高效建模时间动态。此外,为了克服训练非欧几里得流形的众所周知的不稳定性,HYPERPOSE引入了一套新颖的黎曼损失和不确定性加权课程,强制执行物理测地约束,如骨长和速度一致性。在Human3.6M和MPI-INF-3DHP数据集上的广泛评估表明,HYPERPOSE在结构和时间一致性方面达到了最先进的水平,显著减少了体积失真和速度误差,同时在整体位置准确性上建立了新的最先进基准。
cs.CV / 228 / 2605.10106
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
ViSRA:一种基于视频的多模态大语言模型空间推理代理
Abstract
Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free perspective and introduce ViSRA, a human-aligned Video-based Spatial Reasoning Agent, as a framework to probe the spatial reasoning mechanism of MLLMs. ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm. ViSRA offers two key advantages: (1) human-aligned and transferable 3D understanding rather than task-specific overfitting; and (2) no post-training computational cost along with heavy manual curation of spatial reasoning datasets. Experimental results demonstrate consistent improvement across a set of MLLMs on both existing benchmarks and unseen 3D spatial reasoning tasks, with ViSRA outperforming baselines by up to a 15.6% and 28.9% absolute margin respectively.
Chinese Translation
近年来,多模态大语言模型(MLLMs)在3D空间智能方面取得了显著进展,但这一进展主要依赖于在精心策划的基准测试上的后训练,导致推理时的方法相对未被充分探索。本文从无训练的角度出发,介绍了ViSRA,一个与人类对齐的基于视频的空间推理代理,作为探讨MLLMs空间推理机制的框架。ViSRA通过利用专家模型中的显式空间信息,以模块化和可扩展的方式引发空间推理,从而实现即插即用的灵活范式。ViSRA提供了两个关键优势:(1)与人类对齐且可转移的3D理解,而非特定任务的过拟合;(2)没有后训练的计算成本以及对空间推理数据集的繁重人工策划。实验结果表明,在一系列MLLMs上,ViSRA在现有基准测试和未见过的3D空间推理任务中均表现出一致的改进,ViSRA在基准测试中分别超越基线达15.6%和28.9%的绝对差距。
cs.CV / 229 / 2605.10117
Think as Needed: Geometry-Driven Adaptive Perception for Autonomous Driving
按需思考:面向自主驾驶的几何驱动自适应感知
Abstract
Autonomous driving scenes range from empty highways to dense intersections with dozens of interacting road users, yet current 3D detection models apply a fixed computation budget to every frame, wasting resources on simple scenes while lacking capacity for complex ones. Existing approaches compound this problem: Transformer-based interaction models scale quadratically with the number of detected objects, and frame-by-frame processing causes the system to immediately forget objects the moment they become occluded. We propose Enhanced HOPE, an adaptive perception architecture that measures the geometric complexity of each incoming LiDAR frame using an unsupervised statistical estimator and routes it through a shallow or deep processing path accordingly, requiring no manual scene labels. To keep interaction modeling efficient, we replace quadratic pairwise attention with a linear-time subspace-based network that groups nearby objects into clusters and processes them jointly. The computational savings from these two mechanisms free up resources for a persistent temporal memory module that retains previously detected objects and traffic rules across frames, enabling the system to recall occluded objects seconds after they disappear from view. On the nuScenes and CARLA benchmarks, Enhanced HOPE reduces latency by 38% on simple scenes with no accuracy loss, improves mean Average Precision by 2.7 points on rare long-tail scenarios, and tracks objects through occlusions lasting over 5 seconds, where all tested baselines fail.
Chinese Translation
自主驾驶场景从空旷的高速公路到密集的交叉口,涉及数十个交互的道路使用者,然而当前的3D检测模型对每一帧应用固定的计算预算,导致在简单场景中浪费资源,而在复杂场景中又缺乏能力。现有的方法加剧了这一问题:基于Transformer的交互模型在检测到的对象数量上呈二次方扩展,而逐帧处理使得系统在对象被遮挡的瞬间立即忘记它们。我们提出了Enhanced HOPE,这是一种自适应感知架构,通过无监督统计估计器测量每个输入LiDAR帧的几何复杂性,并相应地将其路由到浅层或深层处理路径,无需手动场景标签。为了保持交互建模的高效性,我们用线性时间的子空间网络替代了二次对偶注意力,该网络将附近对象分组为簇并共同处理。通过这两种机制的计算节省,为持久的时间记忆模块释放了资源,该模块能够在帧间保留先前检测到的对象和交通规则,使系统能够在对象从视野中消失几秒后回忆起被遮挡的对象。在nuScenes和CARLA基准测试中,Enhanced HOPE在简单场景中将延迟减少了38%,且没有准确性损失,在稀有长尾场景中提高了平均精度(mean Average Precision)2.7个百分点,并能够跟踪持续超过5秒的遮挡对象,而所有测试的基线方法均未能成功。
cs.CV / 230 / 2605.10120
MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph
MicroWorld:赋能多模态大型语言模型以通过多模态属性图弥合微观领域差距
Abstract
Multimodal large language models (MLLMs) show remarkable potential for scientific reasoning, yet their performance in specialized domains such as microscopy remains limited by the scarcity of domain-specific training data and the difficulty of encoding fine-grained expert knowledge into model parameters. To bridge the gap, we introduce MicroWorld, a framework that constructs a multimodal attributed property graph (MAPG) from large-scale scientific image--caption corpora and leverages it to augment MLLM reasoning at inference time without any domain-specific fine-tuning. MicroWorld extracts biomedical entities and relations via scispaCy or LLM-based triplet mining, aligns images and entities in a shared embedding space using Qwen3-VL-Embedding, and assembles a knowledge graph comprising approximately 111K nodes and 346K typed edges spanning eight relation categories. At inference time, a graph-augmented retrieval pipeline matches query entities to the MAPG and injects structured knowledge context into the MLLM prompt. On the MicroVQA benchmark, MicroWorld improves the reasoning performance of Qwen3-VL-8B-Instruct by 37.5%, outperforming GPT-5 by 13.0% to achieve a new state-of-the-art. Furthermore, it yields a 6.0% performance gain on the MicroBench benchmark. Extensive experiments demonstrate the enhanced generalization capability introduced by MicroWorld. A qualitative case study further reveals both the mechanisms through which structured knowledge improves reasoning and the failure modes that point to promising future directions. Code and data are available at https://github.com/ieellee/MicroWorld.
Chinese Translation
多模态大型语言模型(MLLMs)在科学推理方面展现出显著潜力,但在显微镜等专业领域的表现仍受到领域特定训练数据稀缺和将细粒度专家知识编码到模型参数中的困难的限制。为了解决这一问题,我们提出了MicroWorld,一个从大规模科学图像-标题语料库构建多模态属性图(MAPG)的框架,并利用它在推理时增强MLLM的推理能力,而无需进行任何领域特定的微调。MicroWorld通过scispaCy或基于LLM的三元组挖掘提取生物医学实体和关系,使用Qwen3-VL-Embedding将图像和实体对齐到共享的嵌入空间,并组装一个包含约111K节点和346K类型边的知识图谱,涵盖八个关系类别。在推理时,图增强检索管道将查询实体与MAPG匹配,并将结构化知识上下文注入到MLLM提示中。在MicroVQA基准测试中,MicroWorld将Qwen3-VL-8B-Instruct的推理性能提升了37.5%,超越了GPT-5 13.0%,达到了新的最先进水平。此外,在MicroBench基准测试中也获得了6.0%的性能提升。大量实验表明,MicroWorld增强了模型的泛化能力。定性案例研究进一步揭示了结构化知识改善推理的机制以及指向有前景未来方向的失败模式。代码和数据可在https://github.com/ieellee/MicroWorld获取。
cs.CV / 231 / 2605.10127
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
Fashion130K:用于统一多模态条件下服装生成的电子商务时尚数据集
Abstract
Recent research work on fashion outfit generation focuses on promoting visual consistency of garments by leveraging key information from reference image and text prompt. However, the potential of outfit generation remains underexplored, requiring comprehensive e-commercial dataset and elaborative utilization of multi-modal condition. In this paper, we propose a brand-new e-commerce dataset, named Fashion130k, with various occasions, models, and garment types. For the consistent generation of garment, we design a framework with Unified Multi-modal Condition (UMC) to align and integrate the text and visual prompts into generation model. Specifically, we explore an embedding refiner to extract the unified embeddings of multi-modal prompts, within which a Fusion Transformer is proposed to align the multi-modal embeddings by adjusting the modality gap between text and image. Based on unified embeddings, the attention in generation model is redesigned to emphasis the correlations between prompts and noise image, inducing that the noise image can select the pivotal tokens of prompts for consistent outfit generation. Our dataset and proposed framework offer a general and nuanced exploration of multi-modal prompts for generation models. Extensive experiments on real-world applications and benchmark demonstrate the effectiveness of UMC in visual consistency, achieving promising result than that of SoTA methods.
Chinese Translation
近期关于服装生成的研究工作集中于通过利用参考图像和文本提示中的关键信息来促进服装的视觉一致性。然而,服装生成的潜力仍未得到充分探索,需要全面的电子商务数据集和多模态条件的深入利用。本文提出了一个全新的电子商务数据集,命名为Fashion130k,涵盖各种场合、模型和服装类型。为了实现服装的一致生成,我们设计了一个框架,采用统一多模态条件(Unified Multi-modal Condition, UMC)将文本和视觉提示对齐并整合到生成模型中。具体而言,我们探索了一种嵌入细化器,以提取多模态提示的统一嵌入,其中提出了一种融合变换器(Fusion Transformer)来通过调整文本和图像之间的模态差距来对齐多模态嵌入。基于统一嵌入,生成模型中的注意力机制被重新设计,以强调提示与噪声图像之间的关联,表明噪声图像能够选择提示的关键标记,从而实现一致的服装生成。我们的数据集和提出的框架为生成模型的多模态提示提供了一个全面而细致的探索。对真实世界应用和基准的广泛实验表明,UMC在视觉一致性方面的有效性,取得了比现有最先进方法更有希望的结果。
cs.CV / 232 / 2605.10130
Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection
Thermal-Det:用于开放词汇热对象检测的语言引导跨模态蒸馏
Abstract
Existing open-vocabulary detectors focus on RGB images and fail to generalize to thermal imagery, where low texture and emissivity variations challenge RGB-based semantics. We present Thermal-Det, the first large language model (LLM) supervised open-vocabulary detector tailored for thermal images. To enable large-scale training, we develop a synthetic dataset by converting GroundingCap-1M into the thermal domain and filtering captions to remove RGB-specific terms, yielding over one million thermally aligned samples with bounding boxes, grounding texts, and detailed captions. Thermal-Det jointly optimizes detection, captioning, and cross-modal distillation objectives. A frozen RGB teacher provides geometric and semantic pseudo-supervision for paired but unlabeled RGB-thermal data, transferring open-vocabulary knowledge without manual annotation. The model further employs a Thermal-Text Alignment Head for text calibration and a Modality-Fused Cross-Attention module for dual-modality reasoning. Unlike prior domain-adaptation methods, the detector is fully fine-tuned to internalize thermal contrast patterns while preserving language alignment. Experiments on public benchmarks show consistent 2-4% AP gains over existing open-vocabulary detectors, establishing a strong foundation for scalable, language-driven thermal perception.
Chinese Translation
现有的开放词汇检测器主要集中于RGB图像,无法推广到热成像领域,因为低纹理和发射率变化对基于RGB的语义构成挑战。我们提出了Thermal-Det,这是首个针对热图像的由大型语言模型(LLM)监督的开放词汇检测器。为了实现大规模训练,我们通过将GroundingCap-1M转换为热域并过滤标题以去除RGB特定术语,开发了一个合成数据集,生成超过一百万个热对齐样本,包含边界框、定位文本和详细标题。Thermal-Det联合优化检测、标题生成和跨模态蒸馏目标。一个冻结的RGB教师为配对但未标记的RGB-热数据提供几何和语义伪监督,转移开放词汇知识而无需手动标注。该模型进一步采用热文本对齐头进行文本校准,并使用模态融合跨注意力模块进行双模态推理。与之前的领域适应方法不同,检测器经过全面微调,以内化热对比模式,同时保持语言对齐。在公共基准测试上的实验表明,与现有的开放词汇检测器相比,AP(平均精度)一致提高了2-4%,为可扩展的、以语言驱动的热感知奠定了坚实基础。
cs.CV / 233 / 2605.10142
Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality
扩展视觉模型并不一致地提高基于定位的解释质量
Abstract
Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of depth and complexity within the ResNet, DenseNet, and Vision Transformer families, trained from scratch or pretrained, across three image datasets with ground-truth segmentation masks. For each model, we generate explanations using five post-hoc explainable AI methods and quantify mask alignment using two localisation metrics: Relevance Rank Accuracy (Arras et al., 2022) and the proposed Dual-Polarity Precision, which measures positive attributions inside the class mask and negative attributions outside it. Across datasets and methods, increasing architectural depth and parameter count does not improve explanation quality in most statistical comparisons, and smaller models often match or exceed deeper variants. While pretraining typically improves predictive performance and increases the dependence of explanations on learned weights, it does not consistently increase localisation scores. We also observe scenarios in which models achieve strong predictive performance while localisation precision is near zero, suggesting that performance metrics alone may not indicate whether predictions are based on the annotated regions. These results indicate that larger models do not reliably provide higher-quality explanations, and that explainability should therefore be assessed explicitly during model selection for safety-sensitive deployments.
Chinese Translation
人工智能模型正日益扩展以提高预测准确性,但尚不清楚规模是否改善了事后解释的质量。我们通过评估11个计算机视觉模型来研究这一关系,这些模型代表了ResNet、DenseNet和Vision Transformer系列中不断增加的深度和复杂性,模型是从头训练或预训练的,且在三个带有真实分割掩码的图像数据集上进行评估。对于每个模型,我们使用五种事后可解释AI方法生成解释,并使用两个定位指标量化掩码对齐度:相关性排名准确度(Relevance Rank Accuracy,Arras等,2022)和提出的双极性精度(Dual-Polarity Precision),该指标测量类掩码内的正向归因和类掩码外的负向归因。在不同数据集和方法中,增加架构深度和参数数量在大多数统计比较中并未改善解释质量,而较小的模型往往与更深的变体相匹配或超越。虽然预训练通常提高预测性能并增加解释对学习权重的依赖,但并未一致提高定位得分。我们还观察到一些场景,其中模型在预测性能上表现强劲,而定位精度接近零,这表明仅靠性能指标可能无法指示预测是否基于注释区域。这些结果表明,较大的模型并不可靠地提供更高质量的解释,因此在安全敏感的部署中,解释性应在模型选择过程中明确评估。
cs.CV / 234 / 2605.10148
MicroViTv2: Beyond the FLOPS for Edge Energy-Friendly Vision Transformers
MicroViTv2:超越FLOPS的边缘能源友好型视觉变换器
Abstract
The Vision Transformer (ViT) achieves remarkable accuracy across visual tasks but remains computationally expensive for edge deployment. This paper presents MicroViTv2, a lightweight Vision Transformer optimized for real-device efficiency. Built upon the original MicroViT, the proposed model is designed based on reparameterized design, specifically Reparameterized Patch Embedding (RepEmbed) and Reparameterized Depth-Wise convolution mixer (RepDW) for faster inference, and introduces the Single Depth-Wise Transposed Attention (SDTA) to capture long-range dependencies with minimal redundancy. Despite slightly higher FLOPs, MicroViTv2 improves accuracy up to 0.5% compared to its predecessor and surpassing MobileViTv2, EdgeNeXt, and EfficientViT while maintaining fast inference and high energy efficiency on Jetson AGX Orin. Experiments on ImageNet-1K and COCO demonstrate that hardware-aware design and structural re-parameterization are key to achieving high accuracy and low energy consumption, validating the need to evaluate efficiency beyond FLOPs. Code is available at https://github.com/novendrastywn/MicroViT.
Chinese Translation
视觉变换器(ViT)在视觉任务中取得了显著的准确性,但在边缘部署时仍然计算开销较大。本文提出了MicroViTv2,一种针对真实设备效率优化的轻量级视觉变换器。该模型基于原始的MicroViT进行构建,采用了重新参数化设计,特别是重新参数化的补丁嵌入(Reparameterized Patch Embedding, RepEmbed)和重新参数化的深度卷积混合器(Reparameterized Depth-Wise convolution mixer, RepDW),以实现更快的推理,并引入了单深度卷积转置注意力(Single Depth-Wise Transposed Attention, SDTA)以捕捉长程依赖,减少冗余。尽管FLOPs略有增加,MicroViTv2的准确性相比其前身提高了0.5%,并超越了MobileViTv2、EdgeNeXt和EfficientViT,同时在Jetson AGX Orin上保持快速推理和高能效。在ImageNet-1K和COCO上的实验表明,硬件感知设计和结构重新参数化是实现高准确性和低能耗的关键,验证了评估效率时超越FLOPs的必要性。代码可在https://github.com/novendrastywn/MicroViT获取。
cs.CV / 235 / 2605.10149
Improving Temporal Action Segmentation via Constraint-Aware Decoding
通过约束感知解码改善时间动作分割
Abstract
Temporal action segmentation (TAS) divides untrimmed videos into labeled action segments. While fully supervised methods have advanced the field, challenges such as action variability, ambiguous boundaries, and high annotation costs remain, especially in new or low-resource domains. Grammar-based approaches improve segmentation with structural priors but rely on complex parsing limiting scalability. In this work, we propose a lightweight, constraint-based refinement framework that enhances TAS predictions by integrating statistical structural priors such as transition confidence, action boundary sets, and per-class duration, that can be directly extracted from annotated data. These constraints are integrated into a modified Viterbi decoding algorithm, allowing inference-time refinement without retraining or added model complexity. Our approach improves both fully and semi-supervised TAS models by correcting structural prediction errors while maintaining high efficiency. Code is available at https://github.com/LUNAProject22/CAD
Chinese Translation
时间动作分割(Temporal Action Segmentation, TAS)将未修剪的视频划分为标记的动作片段。尽管完全监督的方法推动了该领域的发展,但在新领域或资源匮乏的领域中,仍然存在动作变异、模糊边界和高标注成本等挑战。基于语法的方法通过结构先验改善分割效果,但依赖复杂的解析,限制了可扩展性。在本研究中,我们提出了一种轻量级的基于约束的精炼框架,通过整合可以直接从标注数据中提取的统计结构先验(如转移置信度、动作边界集和每类持续时间)来增强TAS预测。这些约束被整合到修改后的维特比解码算法中,允许在推理时进行精炼,而无需重新训练或增加模型复杂性。我们的方法通过纠正结构预测错误,同时保持高效性,改善了完全监督和半监督的TAS模型。代码可在 https://github.com/LUNAProject22/CAD 获取。
cs.CV / 236 / 2605.10157
MolSight: Molecular Property Prediction with Images
MolSight:基于图像的分子性质预测
Abstract
Every molecule ever synthesised can be drawn as a 2D skeletal diagram, yet in modern property prediction this universally available representation has received less focus in favour of molecular graphs, 3D conformers, or billion-parameter language models, each imposing its own computational and data-engineering overhead. We present $\textbf{MolSight}$, the first systematic large-scale study of vision-based Molecular Property Prediction (MPP). Using 10 vision architectures, 7 pre-training strategies, and $2\,M$ molecule images, we evaluate performance across 10 downstream tasks spanning physical-property regression, drug-discovery classification, and quantum-chemistry prediction. To account for the wide variation in structural complexity across pre-training molecules, we further propose a $\textbf{chemistry-informed curriculum}$: five structural complexity descriptors partition the corpus into five tiers of increasing chemical difficulty, consistently outperforming non-curriculum baselines. We show that a single rendered bond-line image, processed by a vision encoder, is sufficient for competitive molecular property prediction, i.e. $\textit{chemical insight from sight alone}$. The best curriculum-trained configuration achieves the top result on $\textbf{5 of 10}$ benchmarks and top two on $\textbf{all 10}$, at $\textbf{$\textit{80$\times$ lower}$}$ FLOPs than the nearest multi-modal competitor.
Chinese Translation
每一个合成的分子都可以绘制为二维骨架图,然而在现代性质预测中,这种普遍可用的表示方式却较少受到关注,反而更倾向于分子图、三维构象或十亿参数语言模型,这些方法各自带来了计算和数据工程的开销。我们提出了 $ extbf{MolSight}$,这是首个系统性的大规模基于视觉的分子性质预测(MPP)研究。我们使用了10种视觉架构、7种预训练策略和$2M$个分子图像,评估了在物理性质回归、药物发现分类和量子化学预测等10个下游任务中的性能。为了考虑预训练分子在结构复杂性上的广泛变化,我们进一步提出了一个 $ extbf{化学信息驱动的课程}$:五个结构复杂性描述符将语料库划分为五个逐渐增加化学难度的层级,始终优于非课程基线。我们展示了一个单一渲染的键线图像,通过视觉编码器处理后,足以进行具有竞争力的分子性质预测,即“仅凭视觉获得化学洞察”。最佳课程训练配置在$ extbf{10个基准中的5个}$上取得了最佳结果,并在$ extbf{所有10个}$中排名前两,且其计算量比最近的多模态竞争者低$ extbf{$ extit{80$ imes$}$}$ FLOPs。
cs.CV / 237 / 2605.10162
Active-SAOOD: Active Sparsely Annotated Oriented Object Detection in Remote Sensing Images
主动SAOOD:遥感图像中主动稀疏标注定向目标检测
Abstract
Reducing the annotation cost of oriented object detection in remote sensing remains a major challenge. Recently, sparse annotation has gained attention for effectively reducing annotation redundancy in densely remote sensing scenes. However, (1) the sparse data reliance on class-dependent sampling, and (2) the lack of in-depth investigation into the characteristics of sparse samples hinders its further development. This paper proposes an active learning-based sparsely annotated oriented object detection (SAOOD) method, termed Active-SAOOD. Based on a model state observation module, Active-SAOOD actively selects the most valuable sparse samples at the instance level that are best suited to the current model state, by jointly considering orientation, classification, and localization uncertainty, as well as inter- and intra-class diversity. This design enables SAOOD to operate stably under completely randomly initialized sparse annotations and extends its applicability to broader real-world. Experiments on multiple datasets demonstrate that Active-SAOOD significantly improves both performance and stability of existing SAOOD methods under various random sparse annotation. In particular, with only 1\% annotated ratios, it achieves a 9\% performance gain over the baseline, further enhancing the practical value of SAOOD in remote sensing. The code will be public.
Chinese Translation
降低遥感中定向目标检测的标注成本仍然是一个主要挑战。近年来,稀疏标注因其有效减少密集遥感场景中的标注冗余而受到关注。然而,(1) 稀疏数据依赖于类别相关的采样,以及 (2) 对稀疏样本特征缺乏深入研究,阻碍了其进一步发展。本文提出了一种基于主动学习的稀疏标注定向目标检测方法,称为主动SAOOD(Active-SAOOD)。基于模型状态观察模块,主动SAOOD 在实例级别主动选择与当前模型状态最匹配的最有价值的稀疏样本,综合考虑了方向、分类和定位的不确定性,以及类间和类内的多样性。该设计使SAOOD能够在完全随机初始化的稀疏标注下稳定运行,并将其适用性扩展到更广泛的现实场景。多个数据集上的实验表明,主动SAOOD在各种随机稀疏标注下显著提高了现有SAOOD方法的性能和稳定性。特别是,在仅有1%的标注比例下,其性能较基线提升了9%,进一步增强了SAOOD在遥感中的实际价值。代码将公开。
cs.CV / 238 / 2605.10165
Task-Agnostic Noisy Label Detection via Standardized Loss Aggregation
通过标准化损失聚合进行任务无关的噪声标签检测
Abstract
Noisy labels are common in large-scale medical imaging datasets due to inter-observer variability and ambiguous cases. We propose a statistically grounded and task-agnostic framework, Standardized Loss Aggregation (SLA), for detecting noisy labels at the sample level. SLA quantifies label reliability by aggregating standardized fold-level validation losses across repeated cross-validation runs. This formulation generalizes discrete hard-counting schemes into a continuous estimator that captures both the frequency and magnitude of performance deviations, yielding interpretable and statistically stable noisiness scores. Experiments on a public fundus dataset demonstrate that SLA consistently outperforms the hard-counting baseline across all noise levels and converges substantially faster, especially under low noise ratios where subtle loss variations are informative. Samples with high SLA scores indicate potentially ambiguous or mislabeled cases, guiding efficient re-annotation and improving dataset reliability for any classification task.
Chinese Translation
由于观察者之间的变异性和模糊案例,噪声标签在大规模医学影像数据集中很常见。我们提出了一种基于统计学的任务无关框架——标准化损失聚合(Standardized Loss Aggregation, SLA),用于在样本级别检测噪声标签。SLA通过聚合重复交叉验证运行中的标准化折级验证损失来量化标签的可靠性。这种公式将离散的硬计数方案推广为一个连续估计器,能够捕捉性能偏差的频率和幅度,从而产生可解释且统计稳定的噪声评分。对公共眼底数据集的实验表明,SLA在所有噪声水平上始终优于硬计数基线,并且收敛速度显著更快,尤其是在低噪声比率下,细微的损失变化具有信息价值。具有高SLA评分的样本表明可能存在模糊或错误标记的案例,从而指导高效的重新标注,并提高任何分类任务的数据集可靠性。
cs.CV / 239 / 2605.10172
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
V-ABS:基于行动-观察者驱动的动态视觉推理束搜索
Abstract
Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multi-step visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset comprising over 80k samples to guide the model to assign higher prior confidence to correct action paths. Extensive experiments across eight diverse benchmarks show that V-ABS achieves state-of-the-art performance, delivering an average improvement of 19.7% on the Qwen3-VL-8B baseline and consistent gains across both open-source and proprietary models.
Chinese Translation
多模态大语言模型(MLLMs)在一般感知方面取得了显著成功,但复杂的多步骤视觉推理仍然是一个持续的挑战。尽管近期的主动式方法融入了工具使用,但它们往往忽视了关键的执行反馈。因此,它们受到想象-行动-观察者(IAO)偏差的影响,即先前的想象与观察者反馈之间的不一致,这削弱了推理的稳定性和最优性。为了解决这一问题,我们提出了V-ABS,一种基于行动-观察者驱动的束搜索框架,通过思考者-行动者-观察者的迭代实现深思熟虑的推理。我们还提出了一种基于熵的自适应加权算法,通过动态平衡策略先验与观察反馈之间的置信度分数来减轻IAO偏差。此外,我们构建了一个包含超过8万样本的大规模监督微调(SFT)数据集,以指导模型为正确的行动路径分配更高的先验置信度。在八个不同基准上的广泛实验表明,V-ABS实现了最先进的性能,在Qwen3-VL-8B基线上的平均提升为19.7%,并在开源和专有模型中均表现出一致的提升。
cs.CV / 240 / 2605.10174
BathyFacto: Refraction-Aware Two-Media Neural Radiance Fields for Bathymetry
BathyFacto:考虑折射的双介质神经辐射场用于水深测量
Abstract
Through-water photogrammetry based on UAV imagery enables shallow-water bathymetry, but refraction at the air-water interface violates the straight-ray assumption of Structure-from-Motion and causes systematic depth bias. We present BathyFacto, a refraction-aware two-media extension of Nerfacto integrated into Nerfstudio that targets metrically precise underwater point clouds. BathyFacto uses a shared hash-grid-based density field with a medium-conditioned color head that receives a one-bit medium flag (air or water) and traces each camera ray as two segments: a straight segment in air up to a planar water surface and a refracted segment in water computed via Snell's law with known refractive indices. To allocate samples efficiently across the air-water boundary, we employ a single proposal-network sampler that operates on a virtual straight ray spanning both media, combined with a kinked density wrapper that transparently corrects water-segment positions along the refracted direction before density evaluation. A data adaptation pipeline converts photogrammetric reconstructions to a Nerfstudio-compatible format, estimates the water plane from boundary markers, and provides per-pixel medium masks to gate refraction. We also extend the point cloud export with refraction-corrected backprojection and reversible coordinate transforms to world and global frames. On a simulated two-media scene with known ground truth, BathyFacto with refraction achieves a Cloud-to-Mesh mean distance of 0.06 m and 87 % completeness, compared to 0.52 m / 29 % for the Nerfacto baseline and 0.36 m / 21% for conventional MVS without refraction correction.
Chinese Translation
基于无人机影像的水下摄影测量能够实现浅水区的水深测量,但空气与水界面的折射违反了运动结构法的直射光假设,并导致系统性的深度偏差。我们提出了BathyFacto,这是一种考虑折射的双介质Nerfacto扩展,集成于Nerfstudio,旨在生成精确的水下点云。BathyFacto使用共享的基于哈希网格的密度场,并配备一个介质条件的颜色头,该颜色头接收一个一位的介质标志(空气或水),并将每条相机光线分为两个部分:在空气中到达平面水面的一段直线,以及在水中通过斯涅尔定律计算的折射段,已知折射率。为了在空气-水界面高效分配样本,我们采用一个单一的提议网络采样器,该采样器在跨越两个介质的虚拟直射光上运行,并结合一个弯曲的密度包装器,该包装器在密度评估之前透明地修正水段沿折射方向的位置。数据适配管道将摄影测量重建转换为Nerfstudio兼容格式,从边界标记中估计水面,并提供每像素的介质掩码以限制折射。我们还扩展了点云导出,包含折射修正的反投影和可逆坐标变换到世界和全局坐标系。在一个已知真实值的模拟双介质场景中,BathyFacto在考虑折射的情况下实现了0.06米的云到网格平均距离和87%的完整性,而Nerfacto基线为0.52米/29%,传统的多视图立体(MVS)在没有折射修正的情况下为0.36米/21%。
cs.CV / 241 / 2605.10177
MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning
MTA-RL:通过多模态变换器基础的3D可用性和强化学习实现稳健的城市驾驶
Abstract
Robust urban autonomous driving requires reliable 3D scene understanding and stable decision-making under dense interactions. However, existing end-to-end models lack interpretability, while modular pipelines suffer from error propagation across brittle interfaces. This paper proposes MTA-RL, the first framework that bridges perception and control through Multi-modal Transformer-based 3D Affordances and Reinforcement Learning (RL). Unlike previous fusion models that directly regress actions, RGB images and LiDAR point clouds are fused using a transformer architecture to predict explicit, geometry-aware affordance representations. These structured representations serve as a compact observation space, enabling the RL policy to operate purely on predicted driving semantics, which significantly improves sample efficiency and stability. Extensive evaluations in CARLA Town01-03 across varying densities (20-60 background vehicles) show that MTA-RL consistently outperforms state-of-the-art baselines. Trained solely on Town03, our method demonstrates superior zero-shot generalization in unseen towns, achieving up to a 9.0% increase in Route Completion, an 11.0% increase in Total Distance, and an 83.7% improvement in Distance Per Violation. Furthermore, ablation studies confirm that our multi-modal fusion and reward shaping are critical, significantly outperforming image-only and unshaped variants, demonstrating the effectiveness of MTA-RL for robust urban autonomous driving.
Chinese Translation
稳健的城市自主驾驶需要可靠的3D场景理解和在密集交互下的稳定决策。然而,现有的端到端模型缺乏可解释性,而模块化管道则受到脆弱接口间错误传播的影响。本文提出了MTA-RL,这是第一个通过多模态变换器基础的3D可用性和强化学习(Reinforcement Learning, RL)连接感知与控制的框架。与之前直接回归动作的融合模型不同,RGB图像和激光雷达点云通过变换器架构融合,以预测明确的、几何感知的可用性表示。这些结构化表示作为紧凑的观察空间,使得RL策略可以仅基于预测的驾驶语义进行操作,从而显著提高了样本效率和稳定性。在CARLA Town01-03的广泛评估中,针对不同密度(20-60辆背景车辆),MTA-RL始终优于最先进的基线。仅在Town03上训练,我们的方法在未见过的城镇中展示了卓越的零-shot泛化能力,实现了路线完成率提高9.0%、总行驶距离增加11.0%以及每次违规距离改善83.7%。此外,消融研究确认我们的多模态融合和奖励塑造至关重要,显著优于仅使用图像和未塑造变体,证明了MTA-RL在稳健城市自主驾驶中的有效性。
cs.CV / 242 / 2605.10180
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers
潜藏的概念是什么?在扩散变换器中检测和抑制风险内容
Abstract
The rise of text-to-image (T2I) models has increasingly raised concerns regarding the generation of risky content, such as sexual, violent, and copyright-protected images, highlighting the need for effective safeguards within the models themselves. Although existing methods have been proposed to eliminate risky concepts from T2I models, they are primarily developed for earlier U-Net architectures, leaving the state-of-the-art Diffusion-Transformer-based T2I models inadequately protected. This gap stems from a fundamental architectural shift: Diffusion Transformers (DiTs) entangle semantic injection and visual synthesis via joint attention, which makes it difficult to isolate and erase risky content within the generation. To bridge this gap, we investigate how semantic concepts are represented in DiTs and discover that attention heads exhibit concept-specific sensitivity. This property enables both the detection and suppression of risky content. Building on this discovery, we propose AHV-D\&S, a training-free inference-time safeguard for image generation in DiTs. Specifically, AHV-D\&S quantifies each textual token's sensitivity across all attention heads as an Attention Head Vector (AHV), which serves as a discriminative signature for detecting risky generation tendencies. In the inference stage, we propose a momentum-based strategy to dynamically track token-wise AHVs across denoising steps, and a sensitivity-guided adaptive suppression strategy that suppresses the attention weights of identified risky tokens based on head-specific risk scores. Extensive experiments demonstrate that AHV-D\&S effectively suppresses sexual, copyrighted-style, and various harmful content while preserving visual quality, and further exhibits strong robustness against adversarial prompts and transferability across different DiT-based T2I models.
Chinese Translation
文本到图像(T2I)模型的兴起日益引发了对生成风险内容(如色情、暴力和受版权保护的图像)的担忧,突显了在模型内部建立有效保护措施的必要性。尽管已有方法被提出以消除T2I模型中的风险概念,但这些方法主要是针对早期的U-Net架构开发的,导致最先进的基于扩散变换器的T2I模型保护不足。这一差距源于根本的架构转变:扩散变换器(DiTs)通过联合注意力将语义注入和视觉合成交织在一起,这使得在生成过程中孤立和消除风险内容变得困难。为了解决这一问题,我们研究了DiTs中语义概念的表示方式,并发现注意力头表现出特定概念的敏感性。这一特性使得检测和抑制风险内容成为可能。在此发现的基础上,我们提出了AHV-D&S,这是一种无训练的推理时保护措施,旨在改善DiTs中的图像生成。具体而言,AHV-D&S量化每个文本标记在所有注意力头中的敏感性,形成一个注意力头向量(Attention Head Vector,AHV),作为检测风险生成趋势的区分性特征。在推理阶段,我们提出了一种基于动量的策略,以动态跟踪去噪步骤中的标记级AHV,以及一种基于敏感性的自适应抑制策略,根据特定头的风险评分抑制已识别风险标记的注意力权重。大量实验表明,AHV-D&S能够有效抑制色情、受版权保护风格和各种有害内容,同时保持视觉质量,并且在对抗性提示和不同DiT基础的T2I模型之间展现出强大的鲁棒性和可迁移性。
cs.CV / 243 / 2605.10181
A Comparative Study of Machine Learning and Deep Learning for Out-of-Distribution Detection
机器学习与深度学习在分布外检测中的比较研究
Abstract
Out-of-distribution (OOD) detection is essential for building reliable AI systems, as models that produce outputs for invalid inputs cannot be trusted. Although deep learning (DL) is often assumed to outperform traditional machine learning (ML), medical imaging data are typically acquired under standardized protocols, leading to relatively constrained image variability in OOD detection tasks. This motivates a direct comparison between ML and DL approaches in this setting. The two approaches are evaluated on open datasets comprising over 60,000 fundus and non-fundus images across multiple resolutions. Both approaches achieved an AUROC of 1.000 and accuracies between 0.999 and 1.000 on internal and external validation sets, showing comparable detection performance. The ML approach, however, exhibited substantially lower end-to-end latency while maintaining equivalent accuracy, indicating greater computational efficiency. These results suggest that for OOD detection tasks of limited visual complexity, lightweight ML approaches can achieve DL-level performance with significantly reduced computational cost, supporting practical real-world deployment.
Chinese Translation
分布外(OOD)检测对于构建可靠的人工智能系统至关重要,因为对无效输入产生输出的模型是不可被信任的。尽管通常认为深度学习(DL)优于传统机器学习(ML),但医学影像数据通常是在标准化协议下获取的,导致在OOD检测任务中图像变异性相对受限。这促使我们在这一背景下对ML和DL方法进行直接比较。两种方法在包含超过60,000幅视网膜和非视网膜图像的开放数据集上进行了评估,涵盖多个分辨率。两种方法在内部和外部验证集上均达到了1.000的AUROC和0.999至1.000之间的准确率,显示出可比的检测性能。然而,ML方法在保持相同准确率的同时表现出显著较低的端到端延迟,表明其计算效率更高。这些结果表明,对于视觉复杂性有限的OOD检测任务,轻量级的ML方法可以以显著降低的计算成本达到DL级别的性能,支持实际的现实世界部署。
cs.CV / 244 / 2605.10184
Developing a foundation model for high-resolution remote sensing data of the Netherlands
为荷兰高分辨率遥感数据开发基础模型
Abstract
We develop a foundation model using 1.2m high resolution satellite images of the Netherlands. By combining a Convolutional Neural Network and a Vision Transformer, the model captures both low- and high-frequency landscape features, such as fine textures, edges, and small objects as well as large terrain structures, elevation patterns, and land-cover distributions. Leveraging temporal data as input, the model learns from broader contextual information across time, allowing the model to exploit the temporal dependencies, such as topographic features, land-cover changes, and seasonal dynamics. These additional constraints reduce feature ambiguity, improve representation learning, and enable better generalization with fewer labeled samples. The foundation model is evaluated on multiple downstream tasks, ranging from use cases within the Netherlands to global benchmarking datasets. On the vegetation monitoring dataset of the Netherlands, the model shows clear performance improvements by incorporating temporal information instead of relying on a single time point. Despite using a smaller model and less pretraining data limited to the Netherlands, it achieves competitive results on global benchmarks when compared to state-of-the-art models. These results demonstrate that the model can learn rich, generalizable representations from limited data, achieving competitive performance on global benchmarks while using a fraction of the parameters of larger state-of-the-art remote sensing models. To maximize reproducibility and reuse, we made the scripts and the model accessible on GitHub.
Chinese Translation
我们使用1.2米高分辨率的荷兰卫星图像开发了一个基础模型。通过结合卷积神经网络(Convolutional Neural Network)和视觉变换器(Vision Transformer),该模型捕捉了低频和高频的景观特征,如细腻的纹理、边缘和小物体,以及大型地形结构、海拔模式和土地覆盖分布。利用时间数据作为输入,模型从时间上的更广泛上下文信息中学习,使其能够利用时间依赖性,如地形特征、土地覆盖变化和季节动态。这些额外的约束减少了特征歧义,提高了表示学习的效果,并使模型在较少的标记样本下实现更好的泛化能力。基础模型在多个下游任务上进行了评估,涵盖了荷兰的应用案例以及全球基准数据集。在荷兰的植被监测数据集上,模型通过结合时间信息而非依赖单一时间点,显示出明显的性能提升。尽管使用了较小的模型和有限于荷兰的较少预训练数据,但与最先进的模型相比,它在全球基准上取得了具有竞争力的结果。这些结果表明,该模型能够从有限数据中学习丰富且可泛化的表示,在使用较少参数的情况下,在全球基准上实现竞争力的表现。为了最大化可重复性和重用性,我们已将脚本和模型在GitHub上公开。
cs.CV / 245 / 2605.10185
DynGhost: Temporally-Modelled Transformer for Dynamic Ghost Imaging with Quantum Detectors
DynGhost:用于动态幽灵成像的时序建模变换器与量子探测器
Abstract
Ghost imaging reconstructs spatial information from a single-pixel bucket detector by correlating structured illumination patterns with scalar intensity measurements. While deep learning approaches have achieved promising results on static scenes, two critical limitations remain unaddressed: existing architectures fail to exploit temporal coherence across frames, leaving dynamic ghost imaging largely unsolved, and they assume additive Gaussian noise models that do not reflect the true Poissonian statistics of real single-photon hardware. We present DynGhost (Dynamic Ghost Imaging Transformer), a transformer architecture that addresses both limitations through alternating spatial and temporal attention blocks. Our quantum-aware training framework, based on physically accurate detector simulations (SNSPDs, SPADs, SiPMs) and Anscombe variance-stabilizing normalization, resolves the distribution shift that causes classical models to fail under realistic hardware constraints. Experiments across multiple benchmarks demonstrate that DynGhost outperforms both traditional reconstruction methods and existing deep learning architectures, with particular gains in dynamic and photon-starved settings.
Chinese Translation
幽灵成像通过将结构化照明模式与标量强度测量相关联,从单像素探测器重建空间信息。尽管深度学习方法在静态场景中取得了令人鼓舞的结果,但仍然存在两个关键限制未得到解决:现有架构未能利用帧间的时间一致性,导致动态幽灵成像在很大程度上未得到解决;并且它们假设的加性高斯噪声模型并未反映真实单光子硬件的泊松统计特性。我们提出了DynGhost(动态幽灵成像变换器),这是一种变换器架构,通过交替的空间和时间注意力模块解决了这两个限制。我们的量子感知训练框架基于物理准确的探测器模拟(SNSPDs、SPADs、SiPMs)和安斯科姆方差稳定归一化,解决了导致经典模型在现实硬件约束下失败的分布偏移问题。在多个基准测试中的实验表明,DynGhost在动态和光子稀缺的环境中,优于传统重建方法和现有深度学习架构,尤其在动态场景中表现出显著的提升。
cs.CV / 246 / 2605.10187
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
SciVQR:一个用于高级科学推理评估的多学科多模态基准
Abstract
Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi-step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain-specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi-step inferences, with 46% including expert-authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models reach their conclusions. Our evaluation of leading MLLMs, including both proprietary and open-source models, reveals significant limitations in handling complex multimodal reasoning tasks, underscoring the need for improved multi-step reasoning and better integration of interdisciplinary knowledge in advancing MLLMs toward true scientific intelligence. The dataset and evaluation code are publicly available at https://github.com/CASIA-IVA-Lab/SciVQR.
Chinese Translation
科学推理是人类智能的一个关键方面,要求整合多模态输入、领域专业知识和跨学科的多步推理。现有的多模态大型语言模型(MLLMs)基准往往无法捕捉到严谨评估所需的推理过程的复杂性和可追溯性。为填补这一空白,我们推出了SciVQR,一个涵盖数学、物理、化学、地理、天文学和生物学54个子领域的多模态基准。SciVQR包括领域特定的视觉材料,如方程式、图表和图示,并挑战模型将视觉理解与推理相结合。任务范围从基本的事实回忆到复杂的多步推理,其中46%的任务包含专家撰写的解决方案。SciVQR不仅评估最终答案,还考察推理过程,为模型如何得出结论提供了洞见。我们对领先的MLLMs(包括专有和开源模型)的评估揭示了在处理复杂多模态推理任务方面的显著局限性,强调了在推动MLLMs朝向真正科学智能的过程中需要改进多步推理和更好地整合跨学科知识。数据集和评估代码可在https://github.com/CASIA-IVA-Lab/SciVQR公开获取。
cs.CV / 247 / 2605.10190
DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer
DetRefiner:一种与模型无关的特征融合检测精炼方法
Abstract
Open-vocabulary object detection (OVOD) aims to detect both seen and unseen categories, yet existing methods often struggle to generalize to novel objects due to limited integration of global and local contextual cues. We propose DetRefiner, a simple yet effective plug-and-play framework that learns to fuse global and local features to refine open-vocabulary detection. DetRefiner processes global image features and patch-level image features from foundational models (e.g., DINOv3) through a lightweight Transformer encoder. The encoder produces a class vector capturing image-level attributes and patch vectors representing local region attributes, from which attribute reliability is inferred to recalibrate the base model's confidence. Notably, DetRefiner is trained independently of the base OVOD model, requiring neither access to its internal features nor retraining. At inference, it operates solely on the base detector's predictions, producing auxiliary calibration scores that are merged with the base detector's scores to yield the final refined confidence. Despite this simplicity, DetRefiner consistently enhances multiple OVOD models across COCO, LVIS, ODinW13, and Pascal VOC, achieving gains of up to +10.1 AP on novel categories. These results highlight that learning to fuse global and local representations offers a powerful and general mechanism for advancing open-world object detection. Our codes and models are available at https://github.com/hitachi-rd-cv/detrefiner.
Chinese Translation
开放词汇物体检测(OVOD)旨在检测已见和未见的类别,但现有方法往往由于全球和局部上下文线索的整合有限而难以推广到新物体。我们提出了DetRefiner,这是一种简单而有效的即插即用框架,旨在学习融合全球和局部特征以精炼开放词汇检测。DetRefiner通过轻量级的Transformer编码器处理来自基础模型(如DINOv3)的全球图像特征和补丁级图像特征。该编码器生成一个类向量,捕捉图像级属性,以及表示局部区域属性的补丁向量,从中推断属性可靠性以重新校准基础模型的置信度。值得注意的是,DetRefiner独立于基础OVOD模型进行训练,无需访问其内部特征或重新训练。在推理时,它仅基于基础检测器的预测操作,生成辅助校准分数,并与基础检测器的分数合并,以产生最终的精炼置信度。尽管设计简单,DetRefiner在COCO、LVIS、ODinW13和Pascal VOC等多个OVOD模型上始终如一地提升性能,在新类别上实现了高达+10.1 AP的增益。这些结果突显了学习融合全球和局部表征为推动开放世界物体检测提供了一种强大而通用的机制。我们的代码和模型可在https://github.com/hitachi-rd-cv/detrefiner获取。
cs.CV / 248 / 2605.10204
3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects
3DReflecNet:用于反射、透明和低纹理物体的3D重建的大规模数据集
Abstract
Accurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces still remains notoriously challenging. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the availability on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, and therefore provide limited insight into performance under real-world material complexities. We introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 120,000 synthetic instances generated via physically-based rendering of more than 12,000 shapes, and over 1,000 real-world objects captured using consumer devices. Together, these data consist of more than 7 million multi-view frames. The dataset spans diverse materials, complex lighting conditions, and a wide range of geometric forms, including shapes generated from both real and LLM-synthesized 2D images using diffusion-based pipelines. To support robust evaluation, we design benchmarks for five core tasks: image matching, structure-from-motion, novel view synthesis, reflection removal, and relighting. Extensive experiments demonstrate that state-of-the-art methods struggle to maintain accuracy across these settings, highlighting the need for more resilient 3D vision models.
Chinese Translation
准确地对具有反射、透明或低纹理表面的物体进行3D重建仍然极具挑战性。这些材料往往违反多视图重建流程中的关键假设,例如光度一致性和可用的独特几何纹理线索。现有的数据集主要集中于漫反射、纹理丰富的物体,因此在真实世界材料复杂性下的表现提供的洞察有限。我们引入了3DReflecNet,这是一个超过22 TB的大规模混合数据集,专门设计用于基准测试和推动这些挑战性材料的3D视觉方法。3DReflecNet结合了两种类型的数据:通过基于物理的渲染生成的超过120,000个合成实例,涵盖超过12,000种形状,以及使用消费设备捕获的超过1,000个真实物体。这些数据总共包含超过700万多视图帧。数据集涵盖了多种材料、复杂的光照条件以及广泛的几何形状,包括从真实和基于LLM合成的2D图像使用基于扩散的管道生成的形状。为了支持稳健的评估,我们为五个核心任务设计了基准:图像匹配、运动恢复结构、新视图合成、反射去除和重光照。大量实验表明,最先进的方法在这些设置中难以保持准确性,突显了对更具韧性的3D视觉模型的需求。
cs.CV / 249 / 2605.10229
VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection
VPD-100K:迈向可推广的细粒度视觉隐私保护
Abstract
Privacy protection has become a critical requirement in the era of ubiquitous visual data sharing, imposing higher demands on efficient and robust privacy detection algorithms. However, current robust detection models are severely hindered by the lack of comprehensive datasets. Existing privacy-oriented datasets often suffer from limited scale, coarse-grained annotations, and narrow domain coverage, failing to capture the intricate details of sensitive information in realworld environments. To bridge this gap, we present a large-scale, fine-grained Visual Privacy Dataset (VPD-100K), designed to facilitate generalized privacy detection. We establish a holistic taxonomy comprising four primary domains: Human Presence, On-Screen Personally Identifiable Information (PII), Physical Identifiers, and Location Indicators, containing 100,000 images annotated with 33 fine-grained classes and over 190,000 object instances. Statistical analysis reveals that our dataset features long-tailed distributions, small object scales, and high visual complexity. These characteristics make the dataset particularly valuable for demanding, unconstrained applications such as live streaming, where actors frequently face unintentional, realtime information leakage. Furthermore, we design an effective frequency-enhanced lightweight module consisting of frequency-domain attention fusion and adaptive spectral gating mechanism that breaks the limitations of spatial pixel intensity to better capture the subtle details of sensitive information. Extensive experiments conducted on both diverse image and streaming videos benchmarks consistently demonstrate the effectiveness of our VPD-100K dataset and the wellcurated frequency mechanism. The code and dataset are available at https://vpd-100k.github.io/.
Chinese Translation
在普遍存在的视觉数据共享时代,隐私保护已成为一项关键需求,对高效且稳健的隐私检测算法提出了更高的要求。然而,当前的稳健检测模型受到全面数据集缺乏的严重制约。现有的隐私导向数据集往往存在规模有限、粗粒度标注和领域覆盖狭窄的问题,未能捕捉到现实环境中敏感信息的复杂细节。为了解决这一问题,我们提出了一个大规模、细粒度的视觉隐私数据集(VPD-100K),旨在促进通用隐私检测。我们建立了一个涵盖四个主要领域的整体分类法:人类存在、屏幕上个人可识别信息(PII)、物理标识符和位置指示符,包含100,000张图像,标注有33个细粒度类别和超过190,000个对象实例。统计分析表明,我们的数据集具有长尾分布、小物体尺度和高视觉复杂性。这些特征使得该数据集在要求高、无约束的应用(如直播)中尤为重要,因为参与者经常面临无意的实时信息泄露。此外,我们设计了一个有效的频率增强轻量级模块,包含频域注意力融合和自适应光谱门控机制,突破了空间像素强度的限制,更好地捕捉敏感信息的微妙细节。在多样化图像和流媒体视频基准上进行的广泛实验一致证明了我们的VPD-100K数据集及精心策划的频率机制的有效性。代码和数据集可在 https://vpd-100k.github.io/ 获取。
cs.CV / 250 / 2605.10239
AdaptSplat: Adapting Vision Foundation Models for Feed-Forward 3D Gaussian Splatting
AdaptSplat:为前馈式 3D 高斯溅射模型适配视觉基础模型
Abstract
This work explores a simple yet powerful lightweight adapter design for feed-forward 3D Gaussian Splatting (3DGS). Existing methods typically apply complex, architecture-specific designs on top of the generic pipeline of image feature extraction $\rightarrow$ multi-view interaction $\rightarrow$ feature decoding. However, constrained by the scale bottleneck of 3D training data and the low-pass filtering effect of deep networks, these methods still fall short in cross-domain generalization and high-frequency geometric fidelity. To address these problems, we propose AdaptSplat, which demonstrates that without complex component engineering, introducing a single adapter of only 1.5M parameters into the generic architecture is sufficient to achieve superior performance. Specifically, we design a lightweight Frequency-Preserving Adapter (FPA) that extracts direction-aware high-frequency structural priors from the shallow features of a powerful vision foundation model backbone, and seamlessly integrates them into the generic pipeline via high-frequency positional encodings and adaptive residual modulation. This effectively compensates for the high-frequency attenuation caused by over-smoothing in deep features, improving the fitting accuracy of Gaussian primitives on complex surfaces and sharp boundaries. Extensive experiments demonstrate that AdaptSplat achieves state-of-the-art feed-forward reconstruction performance on multiple standard benchmarks, with stable generalization across domains. Code available at: https://github.com/xmw666/AdaptSplat.
Chinese Translation
本研究探讨了一种简单而强大的轻量级适配器设计,用于前馈式 3D 高斯溅射(3DGS)。现有方法通常在图像特征提取 $
ightarrow$ 多视角交互 $
ightarrow$ 特征解码的通用流程上应用复杂的、特定于架构的设计。然而,由于受限于 3D 训练数据的规模瓶颈和深度网络的低通滤波效应,这些方法在跨领域泛化和高频几何保真度方面仍然不足。为了解决这些问题,我们提出了 AdaptSplat,展示了在没有复杂组件工程的情况下,仅需将一个包含 1.5M 参数的单一适配器引入通用架构,就足以实现优越的性能。具体而言,我们设计了一种轻量级的频率保持适配器(Frequency-Preserving Adapter, FPA),该适配器从强大的视觉基础模型骨干的浅层特征中提取方向感知的高频结构先验,并通过高频位置编码和自适应残差调制将其无缝集成到通用流程中。这有效补偿了深层特征中过度平滑导致的高频衰减,提高了高斯原件在复杂表面和锐利边界上的拟合精度。大量实验表明,AdaptSplat 在多个标准基准上实现了最先进的前馈重建性能,并在不同领域之间具有稳定的泛化能力。代码可在:https://github.com/xmw666/AdaptSplat 获取。
cs.CV / 251 / 2605.10251
Efficient Hybrid CNN-GNN Architecture for Monocular Depth Estimation
高效的混合 CNN-GNN 架构用于单目深度估计
Abstract
We present GraphDepth, a monocular depth estimation architecture that synergistically integrates Graph Neural Networks (GNNs) within a convolutional encoder-decoder framework. Our approach embeds efficient GraphSAGE layers at multiple scales of a ResNet-101 U-Net backbone, enabling explicit modeling of long-range spatial relationships that lie beyond the receptive field of local convolutions. Key technical contributions include: (1) batch-parallelized graph construction with configurable k-NN and grid-based adjacency for scalable training; (2) multi-scale GraphSAGE integration at bottleneck and decoder stages (1/32, 1/16, 1/8 resolution) to propagate global context throughout the feature hierarchy; (3) channel-attention gated skip connections that adaptively weight encoder features before fusion; and (4) heteroscedastic uncertainty estimation via a dedicated aleatoric uncertainty head, enabling confidence-aware loss weighting during optimization. Unlike transformer-based hybrids, which suffer from quadratic complexity in sequence length, GraphDepth scales linearly with spatial resolution while achieving comparable global receptive fields through iterative message passing. Experiments on NYU Depth V2, WHU Aerial, ETH3D, and Mid-Air benchmarks demonstrate competitive accuracy within 4.6\% of state-of-the-art transformers on indoor scenes with substantially lower computational cost (25 FPS vs 9 FPS, 3.8 GB vs 8.8 GB VRAM). GraphDepth achieves the best reported result on WHU Aerial (RMSE 8.24 m) and exhibits superior zero-shot cross-domain transfer to the Mid-Air synthetic aerial dataset, validating the generalization power of explicit relational reasoning for depth estimation.
Chinese Translation
我们提出了 GraphDepth,这是一种单目深度估计架构,协同整合了图神经网络(Graph Neural Networks, GNNs)与卷积编码器-解码器框架。我们的方法在 ResNet-101 U-Net 主干的多个尺度上嵌入高效的 GraphSAGE 层,从而能够显式建模超出局部卷积感受野的长距离空间关系。主要技术贡献包括:(1)具有可配置 k-NN 和基于网格的邻接的批量并行图构建,以实现可扩展的训练;(2)在瓶颈和解码阶段(1/32、1/16、1/8 分辨率)进行多尺度 GraphSAGE 集成,以在特征层次中传播全局上下文;(3)通道注意力门控跳跃连接,在融合之前自适应加权编码器特征;(4)通过专用的随机不确定性头进行异方差不确定性估计,使得在优化过程中能够进行基于置信度的损失加权。与基于变换器的混合模型相比,后者在序列长度上面临二次复杂性,GraphDepth 在空间分辨率上呈线性扩展,同时通过迭代消息传递实现可比的全局感受野。在 NYU Depth V2、WHU Aerial、ETH3D 和 Mid-Air 基准测试中的实验表明,其在室内场景中的准确性与最先进的变换器相差仅 4.6\%,且计算成本显著更低(25 FPS 对比 9 FPS,3.8 GB 对比 8.8 GB VRAM)。GraphDepth 在 WHU Aerial 上取得了最佳报告结果(均方根误差 8.24 m),并在 Mid-Air 合成航空数据集上展现出优越的零-shot 跨域迁移能力,验证了显式关系推理在深度估计中的泛化能力。
cs.CV / 252 / 2605.10269
Increasing the Efficiency of DETR for Maritime High-Resolution Images
提高DETR在海洋高分辨率图像中的效率
Abstract
Maritime object detection is critical for the safe navigation of unmanned surface vessels (USVs), requiring accurate recognition of obstacles from small buoys to large vessels. Real-time detection is challenging due to long distances, small object sizes, large-scale variations, edge computing limitations, and the high memory demands of high-resolution imagery. Existing solutions, such as downsampling or image splitting, often reduce accuracy or require additional processing, while memory-efficient models typically handle only limited resolutions. To overcome these limitations, we leverage Vision Mamba (ViM) backbones, which build on State Space Models (SSMs) to capture long-range dependencies while scaling linearly with sequence length. Images are tokenized into sequences for efficient high-resolution processing. For further computational efficiency, we design a tailored Feature Pyramid Network with successive downsampling and SSM layers, as well as token pruning to reduce unnecessary computation on background regions. Compared to state-of-the-art methods like RT-DETR with ResNet50 backbone, our approach achieves a better balance between performance and computational efficiency in maritime object detection.
Chinese Translation
海洋物体检测对于无人水面船舶(USVs)的安全导航至关重要,要求能够准确识别从小浮标到大型船舶的障碍物。由于长距离、小物体尺寸、大规模变化、边缘计算限制以及高分辨率图像的高内存需求,实时检测面临挑战。现有解决方案,如下采样或图像拆分,往往会降低准确性或需要额外处理,而内存高效模型通常只能处理有限的分辨率。为克服这些限制,我们利用基于状态空间模型(SSMs)的Vision Mamba(ViM)主干网络,以捕捉长距离依赖关系,同时与序列长度线性扩展。图像被标记为序列,以实现高效的高分辨率处理。为了进一步提高计算效率,我们设计了一个定制的特征金字塔网络,结合连续下采样和SSM层,以及令牌修剪,以减少对背景区域的不必要计算。与使用ResNet50主干的最先进方法RT-DETR相比,我们的方法在海洋物体检测中实现了性能与计算效率之间的更好平衡。
cs.CV / 253 / 2605.10275
PolarVSR: A Unified Framework and Benchmark for Continuous Space-Time Polarization Video Reconstruction
PolarVSR:一个统一的框架和基准,用于连续时空极化视频重建
Abstract
Polarimetric imaging captures surface polarization characteristics, such as the Degree of Linear Polarization (DoLP) and the Angle of Polarization (AoP). In mainstream Division of-Focal-Plane (DoFP) color polarization imaging, recovering polarization parameters from captured mosaic arrays remains a challenging inverse problem. Existing DoFP cameras also face hardware bottlenecks and often cannot support high-frame-rate acquisition, limiting polarimetric imaging in dynamic video tasks. These limitations motivate joint spatial and temporal enhancement. To this end, we propose the first space-time polarization video reconstruction architecture. The method jointly models polarization directions in space and time and uses a polarization-aware implicit neural representation for continuous, high-fidelity upsampling. By analyzing temporal variations in polarization parameters, we further introduce a flow-guided polarization variation loss to supervise polarization dynamics. We also establish the first large-scale color DoFP polarization video benchmark to support this research direction. Extensive experiments on this benchmark demonstrate the effectiveness of the method.
Chinese Translation
极化成像捕捉表面极化特性,如线性极化度(Degree of Linear Polarization, DoLP)和极化角(Angle of Polarization, AoP)。在主流的焦平面分割(Division of-Focal-Plane, DoFP)彩色极化成像中,从捕获的马赛克阵列中恢复极化参数仍然是一个具有挑战性的逆问题。现有的DoFP相机也面临硬件瓶颈,通常无法支持高帧率采集,限制了动态视频任务中的极化成像。这些限制促使了空间和时间的联合增强。为此,我们提出了首个时空极化视频重建架构。该方法联合建模空间和时间中的极化方向,并使用极化感知的隐式神经表示进行连续的高保真上采样。通过分析极化参数的时间变化,我们进一步引入了一种流引导的极化变化损失,以监督极化动态。我们还建立了首个大规模彩色DoFP极化视频基准,以支持这一研究方向。在该基准上的大量实验证明了该方法的有效性。
cs.CV / 254 / 2605.10307
PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction
PaMoSplat:基于部件感知的运动引导高斯溅射动态场景重建
Abstract
Dynamic scene reconstruction represents a fundamental yet demanding challenge in computer vision and robotics. While recent progress in 3DGS-based methods has advanced dynamic scene modeling, obtaining high-fidelity rendering and accurate tracking in scenarios with substantial, intricate motions remains significantly challenging. To address these challenges, we propose PaMoSplat, a novel dynamic Gaussian splatting framework incorporating part awareness and motion priors. Our approach is grounded in two key observations: 1) Parts serve as primitives for scene deformation, and 2) Motion cues from optical flow can effectively guide part motion. Specifically, PaMoSplat initializes by lifting multi-view segmentation masks into 3D space via graph clustering, establishing coherent Gaussian parts. For subsequent timestamps, we leverage a differential evolutionary algorithm to estimate the rigid motion of these parts using multi-view optical flow cues, providing a robust warm-start for further optimization. Additionally, PaMoSplat introduces an adaptive iteration count mechanism, internal learnable rigidity, and flow-supervised rendering loss to accelerate and optimize the training process. Comprehensive evaluations across diverse scenes, including real-world environments, demonstrate that PaMoSplat delivers superior rendering quality, improved tracking precision, and faster convergence compared to existing methods. Furthermore, it enables multiple part-level downstream applications, such as 4D scene editing.
Chinese Translation
动态场景重建是计算机视觉和机器人领域中的一个基本而又具有挑战性的任务。尽管基于3DGS的方法在动态场景建模方面取得了进展,但在具有显著复杂运动的场景中获得高保真渲染和准确跟踪仍然面临重大挑战。为了解决这些问题,我们提出了PaMoSplat,一种新颖的动态高斯溅射框架,结合了部件感知和运动先验。我们的方法基于两个关键观察:1)部件作为场景变形的原始元素,2)光流中的运动线索可以有效引导部件运动。具体而言,PaMoSplat通过图聚类将多视角分割掩码提升到3D空间,建立一致的高斯部件。对于后续时间戳,我们利用差分进化算法,通过多视角光流线索估计这些部件的刚性运动,为进一步优化提供稳健的热启动。此外,PaMoSplat引入了一种自适应迭代计数机制、内部可学习的刚性以及流监督渲染损失,以加速和优化训练过程。对包括真实世界环境在内的多样场景进行的全面评估表明,PaMoSplat在渲染质量、跟踪精度和收敛速度上均优于现有方法。此外,它还支持多种部件级下游应用,如4D场景编辑。
cs.CV / 255 / 2605.10319
LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency
LimeCross:具有结构一致性的上下文条件分层图像编辑
Abstract
Layered image assets are widely used in real-world creative workflows, enabling non-destructive iteration and flexible re-composition. Recent advances in layered image generation and decomposition synthesize or recover layered representations, yet controllable editing of layered images remains challenging. Manual editing requires careful coordination across layers to maintain consistent illumination and contact, while AI-based pipelines collapse layers into a flattened image for editing, then decompose them again, introducing background-to-foreground leakage and unstable transparency. To address these limitations, we propose LimeCross, a training-free context-conditioned layered image editing framework that edits user-selected RGBA layers according to text while keeping the remaining layers unchanged. It leverages contextual cues from other layers using a bi-stream attention mechanism to preserve cross-layer consistency, while explicitly maintaining layer integrity to prevent the contamination of edited layers. To evaluate our approach, we introduce LayerEditBench, a benchmark of 1500 layered scenes with paired source/target prompts, along with evaluation protocols that assess both edit fidelity and alpha channel stability. Extensive experiments demonstrate that LimeCross improves layer purity and composite realism over strong editing baselines, establishing context-conditioned layered editing as a principled framework for controllable generative creation.
Chinese Translation
分层图像资产在现实世界的创意工作流程中被广泛使用,能够实现非破坏性的迭代和灵活的重新组合。最近在分层图像生成和分解方面的进展合成或恢复了分层表示,但对分层图像的可控编辑仍然具有挑战性。手动编辑需要在各层之间进行仔细协调,以保持一致的照明和接触,而基于人工智能的流程则将层压缩为一个平面图像进行编辑,然后再将其分解,这引入了背景到前景的泄漏和不稳定的透明度。为了解决这些限制,我们提出了LimeCross,一个无训练的上下文条件分层图像编辑框架,根据文本编辑用户选择的RGBA层,同时保持其余层不变。它利用来自其他层的上下文线索,通过双流注意机制来保持跨层一致性,同时明确维护层的完整性,以防止编辑层的污染。为了评估我们的方法,我们引入了LayerEditBench,这是一个包含1500个分层场景的基准数据集,配有成对的源/目标提示,以及评估协议,评估编辑的保真度和alpha通道的稳定性。大量实验表明,LimeCross在层的纯度和合成现实感方面优于强基线编辑方法,确立了上下文条件分层编辑作为可控生成创作的原则性框架。
cs.CV / 256 / 2605.10334
The Alpha Blending Hypothesis: Compositing Shortcut in Deepfake Detection
阿尔法混合假说:深度伪造检测中的合成捷径
Abstract
Recent deepfake detection methods demonstrate improved cross-dataset generalization, yet the underlying mechanisms remain underexplored. We introduce the Alpha Blending Hypothesis, positing that state-of-the-art frame-based detectors primarily function as alpha blending searchers; rather than learning semantic anomalies or specific generative neural fingerprints, they localize low-level compositing artifacts introduced during the integration of manipulated faces into target frames. We experimentally validate the hypothesis, demonstrating that deepfake detectors exhibit high sensitivity to the so-called self-blended images (SBI) and non-generative manipulations. We propose the method BlenD that leverages a large-scale, diverse dataset of real-only facial images augmented with SBI. This approach achieves the best average cross-dataset generalization on 15 compositional deepfake datasets released between 2019 and 2025 without utilizing explicitly generated deepfakes during training. Furthermore, we show that predictions from explicit blending searchers and models resilient to blending shortcuts are highly complementary, yielding a state-of-the-art AUROC of 94.0% in an ensemble configuration. The code with experiments and the trained model will be publicly released.
Chinese Translation
近期的深度伪造检测方法展示了改善的跨数据集泛化能力,但其潜在机制仍未得到充分探索。我们提出了阿尔法混合假说,认为最先进的基于帧的检测器主要作为阿尔法混合搜索器运作;它们并非学习语义异常或特定的生成神经指纹,而是定位在将操控面孔整合到目标帧时引入的低级合成伪影。我们通过实验验证了这一假说,表明深度伪造检测器对所谓的自混合图像(Self-Blended Images, SBI)和非生成操控表现出高度敏感性。我们提出了方法BlenD,该方法利用了一个大规模、多样化的仅包含真实面孔图像的数据集,并结合了SBI。这种方法在2019年至2025年间发布的15个合成深度伪造数据集上实现了最佳的平均跨数据集泛化能力,而在训练过程中没有使用显式生成的深度伪造。此外,我们还展示了来自显式混合搜索器和对混合捷径具有韧性的模型的预测是高度互补的,在集成配置中达到了94.0%的最新AUROC。代码、实验和训练模型将公开发布。
cs.CV / 257 / 2605.10343
EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant
EvoStreaming:您的离线视频模型是一个原生流媒体助手
Abstract
Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only $1{,}000$ self-generated samples ($139\times$ less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to $10.8$ points across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance. These results suggest that data-efficient interaction tuning is a practical path for adapting existing VideoLLMs to streaming assistants.
Chinese Translation
流媒体视频理解不仅仅是观看更长的视频:助手必须实时决定何时发言,平衡响应性与冗长性。然而,大多数视频语言模型(VideoLLMs)是为离线推理而训练的,现有的流媒体基准将这一时机决策外包给评估者。我们通过 RealStreamEval 解决了这一空白,这是一种帧级多轮评估协议,暴露模型于连续观察并惩罚不必要的响应。在这一协议下,我们观察到强大的离线 VideoLLMs 保持了有用的视觉理解,但缺乏决定何时响应的互动策略。基于这一观察,我们提出了 EvoStreaming,一种自我演化的流媒体适应框架,其中基础模型本身充当数据生成器、相关性注释器和滚动策略,以合成流媒体轨迹而无需外部监督。仅用 $1{,}000$ 个自生成样本(比领先的流媒体指令调优方法少 $139 imes$)且没有架构变化,EvoStreaming 在五个开放的 VideoLLM 骨干网络(Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5)上将整体 RealStreamEval 分数提高了最多 $10.8$ 分,同时在很大程度上保持了离线视频性能。这些结果表明,数据高效的互动调优是将现有 VideoLLMs 适应于流媒体助手的切实路径。
cs.CV / 258 / 2605.10345
BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization
BGG:通过视觉基础模型适配弥合跨视角图像之间的几何差距以实现地理定位
Abstract
Geometric differences between cross-view images, such as drone and satellite views, significantly increase the challenge of Cross-View Geo-Localization (CVGL), which aims to acquire the geolocation of images by image retrieval. To further enhance the CVGL performance, this paper proposes a parameter-efficient adaptation framework for bridging the geometric gap across images based on the vision foundation model (VFM) (e.g., DINOv3), termed BGG. BGG not only effectively leverages the general visual representations of VFM and captures the robust and consistent features from cross-view images, but also utilizes the generalization capabilities of the VFM, significantly improving the CVGL performance. It mainly contains a Multi-granularity Feature Enhancement Adapter (MFEA) and a Frequency-Aware Structural Aggregation (FASA) module. Specifically, MFEA enhances the scale adaptability and viewpoint robustness of features by multi-level dilated convolutions, effectively bridging the cross-view geometric gap with small training costs. Additionally, considering the [CLS] token lacks spatial details for precise image retrieval and localization, the FASA module modulates patch tokens in the frequency domain and performs adaptive aggregation for local structural feature enhancement. Finally, BGG fuses the enhanced local features with the [CLS] token for more accurate CVGL. Extensive experiments on University-1652 and SUES-200 datasets demonstrate that BGG has significant advantages over other methods and achieves state-of-the-art localization performance with low training costs.
Chinese Translation
跨视角图像之间的几何差异,例如无人机视图和卫星视图,显著增加了跨视角地理定位(CVGL)的挑战,CVGL旨在通过图像检索获取图像的地理位置。为了进一步提升CVGL的性能,本文提出了一种基于视觉基础模型(VFM)(如DINOv3)的参数高效适配框架,称为BGG。BGG不仅有效利用了VFM的通用视觉表征,捕捉跨视角图像中的稳健且一致的特征,还利用了VFM的泛化能力,显著提高了CVGL性能。该框架主要包含一个多粒度特征增强适配器(MFEA)和一个频率感知结构聚合(FASA)模块。具体而言,MFEA通过多层扩张卷积增强特征的尺度适应性和视角鲁棒性,有效弥合跨视角几何差距,同时训练成本较低。此外,考虑到[CLS]标记缺乏精确图像检索和定位所需的空间细节,FASA模块在频域中调制补丁标记,并进行自适应聚合以增强局部结构特征。最后,BGG将增强的局部特征与[CLS]标记融合,以实现更准确的CVGL。在University-1652和SUES-200数据集上的大量实验表明,BGG相较于其他方法具有显著优势,并以低训练成本实现了最先进的定位性能。
cs.CV / 259 / 2605.10349
Portable Active Learning for Object Detection
便携式主动学习用于目标检测
Abstract
Annotating bounding boxes is costly and limits the scalability of object detection. This challenge is compounded by the need to preserve high accuracy while minimizing manual effort in real-world applications. Prior active learning methods often depend on model features or modify detector internals and training schedules, increasing integration overhead. Moreover, they rarely jointly exploit the benefits of image-level signals, class-imbalance cues, and instance-level uncertainty for comprehensive selection. We present Portable Active Learning (PAL), a detector-agnostic, easily portable framework that operates solely on inference outputs. PAL combines class-wise instance uncertainty with image-level diversity to guide data selection. At each round, PAL trains lightweight class-specific logistic classifiers to distinguish true from false positives, producing entropy-based uncertainty scores for proposals. Candidate images are then refined using global image entropy, class diversity, and image similarity, yielding batches that are both informative and diverse. PAL requires no changes to model internals or training pipelines, ensuring broad compatibility across detectors. Extensive experiments on COCO, PASCAL VOC, and BDD100K demonstrate that PAL consistently improves label efficiency and detection accuracy compared to existing active learning baselines, making it a practical solution for scalable and cost-effective deployment of object detection in real-world settings.
Chinese Translation
标注边界框的成本高昂,限制了目标检测的可扩展性。这个挑战在于需要在现实应用中保持高准确性,同时最小化人工努力。以往的主动学习方法通常依赖于模型特征或修改检测器内部结构和训练计划,从而增加了集成的开销。此外,它们很少共同利用图像级信号、类别不平衡线索和实例级不确定性带来的好处进行全面选择。我们提出了便携式主动学习(Portable Active Learning, PAL),这是一个与检测器无关、易于移植的框架,仅基于推理输出进行操作。PAL将类别特定的实例不确定性与图像级多样性相结合,以指导数据选择。在每一轮中,PAL训练轻量级的类别特定逻辑分类器,以区分真阳性和假阳性,生成基于熵的不确定性评分用于候选提案。然后,使用全局图像熵、类别多样性和图像相似性来精炼候选图像,从而产生既信息丰富又多样化的批次。PAL无需对模型内部或训练管道进行任何更改,确保了与各种检测器的广泛兼容性。在COCO、PASCAL VOC和BDD100K上的大量实验表明,PAL在标签效率和检测准确性方面始终优于现有的主动学习基准,成为在现实环境中可扩展和成本效益高的目标检测部署的实用解决方案。
cs.CV / 260 / 2605.10360
DySurface: Consistent 4D Surface Reconstruction via Bridging Explicit Gaussians and Implicit Functions
DySurface:通过桥接显式高斯与隐式函数实现一致的4D表面重建
Abstract
While novel view synthesis (NVS) for dynamic scenes has seen significant progress, reconstructing temporally consistent geometric surfaces remains a challenge. Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) offer powerful dynamic scene rendering capabilities; however, relying solely on photometric optimization often leads to geometric ambiguities. This results in discontinuous surfaces, severe artifacts, and broken surfaces over time. To address these limitations, we present DySurface, a novel framework that bridges the effectiveness of explicit Gaussians with the geometric fidelity of implicit Signed Distance Functions (SDFs) in dynamic scenes. Our approach tackles the structural discrepancy between the forward deformation of 3DGS ($canonical \rightarrow dynamic$) and the backward deformation required for volumetric SDF rendering ($dynamic \rightarrow canonical$). Specifically, we propose the VoxGS-DSDF branch that leverages deformed Gaussians to construct a dynamic sparse voxel grid, providing explicit geometric guidance to the implicit SDF field. This explicit anchoring effectively regularizes the volumetric rendering process, significantly improving surface reconstruction quality, with watertight boundaries and detailed representations. Quantitative and qualitative experiments demonstrate that DySurface significantly outperforms state-of-the-art baselines in geometric accuracy metrics while maintaining competitive rendering performance.
Chinese Translation
尽管动态场景的新视图合成(NVS)取得了显著进展,但重建时间一致的几何表面仍然是一个挑战。神经辐射场(NeRF)和3D高斯点云(3DGS)提供了强大的动态场景渲染能力;然而,单靠光度优化往往会导致几何模糊。这导致了表面的不连续性、严重的伪影以及随时间变化的破损表面。为了解决这些局限性,我们提出了DySurface,一个新颖的框架,桥接了显式高斯的有效性与隐式有符号距离函数(SDF)在动态场景中的几何保真性。我们的方法解决了3DGS的正向变形($canonical
ightarrow dynamic$)与体积SDF渲染所需的反向变形($dynamic
ightarrow canonical$)之间的结构差异。具体而言,我们提出了VoxGS-DSDF分支,利用变形的高斯构建动态稀疏体素网格,为隐式SDF场提供显式几何指导。这种显式锚定有效地规范化了体积渲染过程,显著提高了表面重建质量,具有密闭边界和详细的表现。定量和定性实验表明,DySurface在几何精度指标上显著优于最先进的基线,同时保持竞争力的渲染性能。
cs.CV / 261 / 2605.10362
CellDX AI Autopilot: Agent-Guided Training and Deployment of Pathology Classifiers
CellDX AI 自动驾驶仪:代理引导的病理分类器训练与部署
Abstract
Training AI models for computational pathology currently requires access to expensive whole-slide-image datasets, GPU infrastructure, deep expertise in machine learning, and substantial engineering effort. We present CellDX AI Autopilot, a platform that lets users -- from pathologists with no ML background to ML practitioners running many parallel experiments -- train, evaluate, and deploy whole-slide image classifiers through natural language interaction with an AI agent. The platform provides a structured set of agent skills that guide the user through dataset curation, automated hyperparameter tuning, multi-strategy model comparison, and human-in-the-loop deployment, all on a pre-built dataset of over 32,000 cases and 66,000 H&E-stained whole-slide images with pre-extracted features. We describe the agent skill architecture, the underlying Multiple Instance Learning (MIL) training framework supporting four classification strategies, and an iterative pairwise hyperparameter search (grid or seeded random) that reduces tuning cost by over 30x compared to exhaustive search. CellDX AI Autopilot is, to our knowledge, the first system to expose pathology-specialized agent skills and a pathology-specialized training platform to general-purpose AI agents (e.g. any LLM-based agent runtime), delivering end-to-end automated model training without requiring the agent itself to be domain-specific. The platform addresses both the ML-expertise bottleneck that limits adoption in diagnostic pathology and the engineering bottleneck that limits how many experiments a researcher can run cost-effectively.
Chinese Translation
目前,训练计算病理学的人工智能模型需要访问昂贵的全切片图像数据集、GPU基础设施、深厚的机器学习专业知识以及大量的工程工作。我们提出了 CellDX AI 自动驾驶仪,这是一个平台,允许用户——从没有机器学习背景的病理学家到运行多个并行实验的机器学习从业者——通过与人工智能代理的自然语言互动,训练、评估和部署全切片图像分类器。该平台提供了一套结构化的代理技能,指导用户进行数据集策划、自动超参数调优、多策略模型比较以及人机协作部署,所有这些都基于一个包含超过32,000个病例和66,000个H&E染色全切片图像(具有预提取特征)的预构建数据集。我们描述了代理技能架构、支持四种分类策略的基础多实例学习(Multiple Instance Learning, MIL)训练框架,以及一种迭代的成对超参数搜索(网格或种子随机),与穷举搜索相比,调优成本降低超过30倍。根据我们的了解,CellDX AI 自动驾驶仪是第一个向通用人工智能代理(例如,任何基于大型语言模型(LLM)的代理运行时)暴露病理专业代理技能和病理专业训练平台的系统,实现了端到端的自动化模型训练,而无需代理本身具备领域特定性。该平台解决了限制诊断病理学采用的机器学习专业知识瓶颈和限制研究人员以成本效益运行实验数量的工程瓶颈。
cs.CV / 262 / 2605.10374
Halo Separation-guided Underwater Multi-scale Image Restoration
基于光晕分离引导的水下多尺度图像恢复
Abstract
Underwater images captured by Autonomous Underwater Vehicles (AUVs) are inevitably affected by artificial light sources, which often produce halos in the foreground of the camera and seriously interfere with the quality of the image. The existing underwater image enhancement methods fail to fully consider this key problem, and the robustness of processing images under artificial light scenes is poor. In practical applications, since underwater image enhancement itself is a very challenging task, the influence of artificial light sources will lead to serious degradation of image performance and affect subsequent vision tasks. In order to effectively deal with this problem, this paper designs a single halo image correction method based on an iterative structure. The network is mainly divided into two sub-networks, one is the halo layer separation sub-network which aims to separate the halo by gradient minimization, and the other is the multi-scale recovery sub-network which aims to recover the image information masked by halo. The UIEB and EUVP synthetic datasets are used for training to ensure that the network can fully learn the characteristics and laws of underwater halo images. Then a large number of halo images taken in an underwater environment with real artificial light are collected for testing. In addition, the brightness distribution characteristics of underwater halo images are analyzed and the radial gradient is introduced to constraint eliminate halo to improve the effect of underwater image restoration.
Chinese Translation
自主水下航行器(AUV)捕获的水下图像不可避免地受到人工光源的影响,这往往在相机前景产生光晕,并严重干扰图像质量。现有的水下图像增强方法未能充分考虑这一关键问题,且在人工光照场景下处理图像的鲁棒性较差。在实际应用中,由于水下图像增强本身是一项非常具有挑战性的任务,人工光源的影响将导致图像性能严重下降,并影响后续的视觉任务。为有效应对这一问题,本文设计了一种基于迭代结构的单光晕图像校正方法。该网络主要分为两个子网络,一个是光晕层分离子网络,旨在通过梯度最小化分离光晕,另一个是多尺度恢复子网络,旨在恢复被光晕遮挡的图像信息。使用UIEB和EUVP合成数据集进行训练,以确保网络能够充分学习水下光晕图像的特征和规律。然后收集大量在真实人工光源环境下拍摄的光晕图像进行测试。此外,分析水下光晕图像的亮度分布特征,引入径向梯度约束以消除光晕,从而改善水下图像恢复效果。
cs.CV / 263 / 2605.10376
SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation
SleepWalk:一种用于压力测试指令引导视觉-语言导航的三层基准
Abstract
Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.
Chinese Translation
视觉-语言模型(VLMs)在多模态感知和语言理解方面迅速发展,但尚不清楚它们是否能够可靠地将语言与3D数字环境中的空间一致、可执行的动作相结合。我们介绍了SleepWalk,这是一个用于评估基于指令的轨迹预测的基准,适用于从文本场景描述生成并经过可导航性过滤的单场景3D世界。与以往集中于房间内长距离探索的导航基准不同,SleepWalk针对的是局部的、以交互为中心的具身推理:给定渲染的视觉观察和自然语言指令,模型必须预测一条遵循场景几何、避免碰撞并在与动作兼容的位置结束的轨迹。该基准涵盖了多样的室内和室外环境,并将任务组织为三层空间和时间难度,能够在日益复杂的组合下进行细致的基础分析。通过标准化的逐点评估协议,我们在2472个精心策划的3D环境中评估了三种前沿VLM,每个场景有九条指令。结果揭示了在基础空间推理方面的系统性失败,尤其是在遮挡、交互约束和多步指令下:随着任务难度的增加,性能下降。总体而言,当前的VLM在一定程度上能够生成同时空间一致、可执行且与预期动作对齐的轨迹。通过在可控但可扩展的环境中揭示失败,SleepWalk为推动基础多模态推理、具身规划、视觉-语言导航和3D环境中的动作能力代理提供了一个重要的基准。
cs.CV / 264 / 2605.10388
Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction
时间采样频率的重要性:一种基于容量的端到端驾驶轨迹预测研究
Abstract
End to end (E2E) autonomous driving trajectory prediction is often trained with camera frames sampled at the highest available temporal frequency, assuming that denser sampling improves performance. We question this assumption by treating temporal sampling frequency as an explicit training set design variable. Starting from high frequency E2E driving datasets, we construct frequency sweep training sets by temporally subsampling camera frames along each trajectory. For each model dataset pair, we train and evaluate the same model under a fixed protocol, so the frequency response reflects how prediction performance changes with sampling frequency. We analyze this response from a capacity aware perspective. Sparse sampling may miss driving relevant cues, while dense sampling may add redundant visual content and off manifold noise. For finite capacity models, this can create a driving irrelevant capacity burden. We evaluate three smaller E2E models and a larger VLA style AutoVLA model on Waymo, nuScenes, and PAVE. Results show model and dataset dependent frequency responses. Smaller E2E models often show non monotonic or near plateau trends and achieve their best 3 second ADE at lower or intermediate frequencies. In contrast, AutoVLA achieves its best 3 second ADE and FDE at the highest evaluated frequency on all three datasets. Iteration matched controls suggest that the advantage of lower or intermediate frequencies for smaller models is not explained only by unequal training update counts. These findings show that temporal sampling frequency should be reported and tuned, rather than fixed to the highest available value.
Chinese Translation
端到端(E2E)自动驾驶轨迹预测通常使用以最高可用时间频率采样的摄像头帧进行训练,假设更密集的采样能够提高性能。我们通过将时间采样频率视为一个明确的训练集设计变量来质疑这一假设。从高频E2E驾驶数据集中出发,我们通过沿每条轨迹对摄像头帧进行时间子采样来构建频率扫描训练集。对于每个模型和数据集对,我们在固定协议下训练和评估相同的模型,因此频率响应反映了预测性能如何随采样频率变化。我们从一个关注容量的角度分析这一响应。稀疏采样可能会错过与驾驶相关的线索,而密集采样可能会增加冗余的视觉内容和离散噪声。对于有限容量的模型,这可能会造成与驾驶无关的容量负担。我们在Waymo、nuScenes和PAVE上评估了三个较小的E2E模型和一个更大的VLA风格的AutoVLA模型。结果显示模型和数据集的频率响应依赖性。较小的E2E模型通常表现出非单调或接近平台的趋势,并在较低或中等频率下实现其最佳3秒平均定位误差(ADE)。相比之下,AutoVLA在所有三个数据集上以最高评估频率实现其最佳3秒ADE和最终定位误差(FDE)。与迭代匹配的控制组相比,较小模型在较低或中等频率下的优势并不仅仅是由于不平等的训练更新次数造成的。这些发现表明,时间采样频率应被报告和调整,而不是固定为最高可用值。
cs.CV / 265 / 2605.10394
Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection
Sens-VisualNews:一个用于轰动图像检测的基准数据集
Abstract
The detection of sensational content in media items can be a critical filtering mechanism for identifying check-worthy content and flagging potential disinformation, since such content triggers physiological arousal that often bypasses critical evaluation and accelerates viral sharing. In this paper we introduce the task of sensational image detection, which aims to determine whether an image contains shocking, provocative, or emotionally charged features to grab attention and trigger strong emotional responses. To support research on this task, we create a new benchmark dataset (called Sens-VisualNews) that contains 9,576 images from news items, annotated based on the (in-)existence of various sensational concepts and events in their visual content. Finally, using Sens-VisualNews, we study the prompt sensitivity, performance and robustness of a wide range of open SotA Multimodal LLMs, across both zero-shot and fine-tuned settings.
Chinese Translation
在媒体项目中检测轰动内容可以成为识别值得核查内容和标记潜在虚假信息的重要过滤机制,因为此类内容会引发生理唤醒,通常绕过批判性评估并加速病毒式传播。本文介绍了轰动图像检测的任务,旨在确定一幅图像是否包含令人震惊、挑衅或情感充沛的特征,以吸引注意力并引发强烈的情感反应。为了支持该任务的研究,我们创建了一个新的基准数据集(称为 Sens-VisualNews),该数据集包含 9,576 张来自新闻项目的图像,基于其视觉内容中各种轰动概念和事件的(不)存在进行注释。最后,利用 Sens-VisualNews,我们研究了多种开放的最先进的多模态大语言模型(Multimodal LLMs)在零样本和微调设置下的提示敏感性、性能和鲁棒性。
cs.CV / 266 / 2605.10397
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
AnomalyClaw:一种通过工具基础反驳实现的通用视觉异常检测代理
Abstract
Visual anomaly detection (VAD) is crucial in many real-world fields, such as industrial inspection, medical imaging, infrastructure monitoring, and remote sensing. However, the specific anomaly definitions, data modalities, and annotation standards across different domains make it difficult to transfer single-domain trained VAD models. Vision-language models (VLMs), pre-trained on large-scale cross-domain data, can perform visual perception under task instructions, offering a promising solution for cross-domain VAD. However, single-inference VLM judgments are unreliable, since they rely more on prior knowledge than on normal-sample references or fine-grained feature evidence. We therefore present AnomalyClaw, a training-free VAD agent that turns anomaly judgment into a multi-round refutation process. In each round, the agent proposes candidate anomalies and refutes each against normal-sample references, drawing on a 13-tool library for visual verification, reference parsing, and frozen expert probing. On the CrossDomainVAD-12 benchmark (12 datasets), AnomalyClaw achieves consistent macro-AUROC improvements over single-step direct inference with +6.23 pp on GPT-5.5, +7.93 pp on Seed2.0-lite, and +3.52 pp on Qwen3.5-VL-27B. We further introduce an optional verbalized self-evolution extension. It builds an online rulebook from internal-branch disagreement without oracle labels. On Qwen3.5-VL-27B, it delivers a +2.09 pp mean gain, comparable to a K = 10 oracle-label supervised baseline (+1.99 pp). These results show that agentic refutation improve anomaly understanding and reasoning of VLMs, rather than merely aggregating tool outputs.
Chinese Translation
视觉异常检测(VAD)在许多现实领域中至关重要,如工业检查、医学成像、基础设施监测和遥感。然而,不同领域之间特定的异常定义、数据模态和注释标准使得单领域训练的 VAD 模型难以迁移。视觉-语言模型(VLMs)在大规模跨域数据上进行预训练,能够在任务指令下执行视觉感知,为跨域 VAD 提供了一个有前景的解决方案。然而,单次推理的 VLM 判断不可靠,因为它们更依赖于先前知识而非正常样本参考或细粒度特征证据。因此,我们提出了 AnomalyClaw,这是一种无训练的 VAD 代理,将异常判断转化为多轮反驳过程。在每一轮中,代理提出候选异常,并针对正常样本参考进行反驳,利用一个包含 13 种工具的库进行视觉验证、参考解析和冻结专家探测。在 CrossDomainVAD-12 基准(12 个数据集)上,AnomalyClaw 在单步直接推理上实现了一致的宏 AUROC 改进,在 GPT-5.5 上提高了 +6.23 个百分点,在 Seed2.0-lite 上提高了 +7.93 个百分点,在 Qwen3.5-VL-27B 上提高了 +3.52 个百分点。我们进一步介绍了一个可选的口头自我演化扩展。它在没有 oracle 标签的情况下,从内部分支的不一致中构建一个在线规则库。在 Qwen3.5-VL-27B 上,它提供了 +2.09 个百分点的平均增益,接近 K = 10 的 oracle 标签监督基线(+1.99 个百分点)。这些结果表明,代理反驳提高了 VLMs 的异常理解和推理,而不仅仅是聚合工具输出。
cs.CV / 267 / 2605.10404
Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable
位置:生活记录视频流使隐私与效用的权衡不可避免
Abstract
With the growing prevalence of always-on hardware such as smart glasses, body cameras, and home security systems, life-logging visual sensing is becoming inevitable, forming the backbone of persistent, always-on AI systems. Meanwhile, recent advances in proactive agents and world models signal a fundamental shift from episodic, prompt-driven tools to next-generation AI systems that continuously perceive and react to the physical world. Although life-logging video streams can substantially improve utility of these promising systems, they also introduce significant privacy risks by revealing sensitive information, such as behavioral patterns, emotional states, and social interactions, beyond what isolated images expose. If unresolved, these risks may undermine public trust and hinder the sustainable development of always-on AI technologies. Existing privacy protections are either attack-specific or incur substantial utility loss, and fail to consider the entire data exploitation pipeline. We therefore posit that the privacy-utility trade-off in life-logging video streams is a foundational challenge for next-generation AI systems that demands further investigation. We call for novel pipeline-aware privacy-preserving designs that jointly optimize utility and privacy for long-horizon life-logging visual data. In parallel, formal privacy leakage metrics and standardized benchmarks remain important open directions for future research.
Chinese Translation
随着智能眼镜、身体摄像头和家庭安防系统等始终在线硬件的日益普及,生活记录视觉感知变得不可避免,成为持久、始终在线人工智能系统的支柱。同时,近期在主动代理和世界模型方面的进展标志着从情节驱动的工具向下一代人工智能系统的根本转变,这些系统能够持续感知和响应物理世界。尽管生活记录视频流可以显著提高这些有前景系统的效用,但它们也通过揭示敏感信息(如行为模式、情感状态和社交互动)而带来了显著的隐私风险,这些信息超出了孤立图像所暴露的范围。如果不加以解决,这些风险可能会削弱公众信任,阻碍始终在线人工智能技术的可持续发展。现有的隐私保护措施要么是针对特定攻击,要么会造成显著的效用损失,并未考虑整个数据利用流程。因此,我们认为生活记录视频流中的隐私与效用权衡是下一代人工智能系统面临的基础性挑战,亟需进一步研究。我们呼吁开发新颖的管道感知隐私保护设计,以共同优化长期生活记录视觉数据的效用和隐私。同时,正式的隐私泄露度量和标准化基准仍然是未来研究的重要开放方向。
cs.CV / 268 / 2605.10409
Progressive Photorealistic Simplification
渐进式照片真实感简化
Abstract
Existing image simplification techniques often rely on Non-Photorealistic Rendering (NPR), transforming photographs into stylized sketches, cartoons, or paintings. While effective at reducing visual complexity, such approaches typically sacrifice photographic realism. In this work, we explore a complementary direction: simplifying images while preserving their photorealistic appearance. We introduce progressive semantic image simplification, a framework that iteratively reduces scene complexity by removing and inpainting elements in a controlled manner. At each step, the resulting image remains a plausible natural photograph. Our method combines semantic understanding with generative editing, leveraging Vision-Language Models (VLMs) to identify and prioritize elements for removal, and a learned verifier to ensure photorealism and coherence throughout the process. This is implemented via an iterative Select-Remove-Verify pipeline that produces high-quality simplification trajectories. To improve efficiency, we further distill this process into an image-to-video generation model that directly predicts coherent simplification sequences from a single input image. Beyond generating cleaner and more focused compositions, our approach enables applications such as content-aware decluttering, semantic layer decomposition, and interactive editing. More broadly, our work suggests that simplification through structured content removal can serve as a practical mechanism for guiding visual interpretation within the photorealistic domain, complementing traditional abstraction methods.
Chinese Translation
现有的图像简化技术通常依赖于非真实感渲染(Non-Photorealistic Rendering, NPR),将照片转化为风格化的素描、卡通或绘画。虽然这种方法在降低视觉复杂性方面有效,但通常会牺牲照片的真实感。在本研究中,我们探索了一种互补的方向:在简化图像的同时保持其照片真实感。我们提出了渐进式语义图像简化,这是一种通过以受控的方式移除和重绘元素来迭代减少场景复杂性的框架。在每一步中,生成的图像仍然是一幅可信的自然照片。我们的方法结合了语义理解与生成编辑,利用视觉-语言模型(Vision-Language Models, VLMs)识别和优先考虑待移除的元素,并通过学习的验证器确保整个过程中的真实感和一致性。这通过一个迭代的选择-移除-验证(Select-Remove-Verify)流程实现,产生高质量的简化轨迹。为了提高效率,我们进一步将这一过程提炼为一个图像到视频生成模型,该模型直接从单一输入图像预测一致的简化序列。除了生成更干净、更集中的构图外,我们的方法还支持内容感知的去杂乱、语义层分解和交互式编辑等应用。更广泛地说,我们的工作表明,通过结构化内容移除进行简化可以作为引导照片真实感领域视觉解释的实用机制,补充传统的抽象方法。
cs.CV / 269 / 2605.10426
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA:在多专家世界模型中进行自主驾驶的思考
Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/potatochip1211/CoWorld-VLA.
Chinese Translation
视觉-语言-动作(VLA)模型已成为端到端自主驾驶的有前景的范式。然而,现有的推理机制仍然难以提供面向规划的中间表示:文本链式思维(Chain-of-Thought, CoT)无法保持连续的时空结构,而潜在世界推理仍然难以作为行动生成的直接条件。在本文中,我们提出了CoWorld-VLA,一个用于自主驾驶的多专家世界推理框架,其中世界表示作为明确条件来指导行动规划。CoWorld-VLA通过多源监督提取互补的世界信息,并将其编码为VLA中的专家令牌,从而提供规划者可访问的条件信号。具体而言,我们构建了四种类型的令牌:语义交互、几何结构、动态演变和自我轨迹令牌,分别建模交互意图、空间结构、未来时间动态和行为目标。在行动生成过程中,CoWorld-VLA采用基于扩散的分层多专家融合规划器,该规划器在联合去噪过程中与场景上下文相结合,以生成连续的自我轨迹。实验表明,CoWorld-VLA在NAVSIM v1基准测试中在未来场景生成和规划方面取得了竞争性的结果,展示了在避免碰撞和轨迹准确性方面的强大性能。消融研究进一步验证了专家令牌的互补性及其作为行动生成规划条件的有效性。代码将发布在 https://github.com/potatochip1211/CoWorld-VLA。
cs.CV / 270 / 2605.10434
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
WorldReasonBench:人类对齐的视频生成器压力测试作为未来世界状态预测工具
Abstract
Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into "world simulators." Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures, while Multi-dimensional Quality Assessment scores reasoning quality, temporal consistency, and visual aesthetics for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation. Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation at https://github.com/UniX-AI-Lab/WorldReasonBench/.
Chinese Translation
商业视频生成系统如Seedance2.0和Veo3.1迅速发展,强化了视频生成器可能演变为“世界模拟器”的观点。然而,当前社区仍缺乏一个基准,能够直接测试模型是否能够推理观察到的世界如何随时间演变。我们提出了WorldReasonBench,将视频生成评估重新框架为世界状态预测:给定初始状态和一个动作,模型能否生成一个未来视频,其状态演变在物理、社会、逻辑和信息上保持一致?WorldReasonBench包含436个经过精心策划的测试案例,配有结构化的真实答案(QA)注释,涵盖四个推理维度和22个子类别。我们采用人类对齐的双部分方法评估生成的视频:过程感知推理验证使用结构化QA和推理阶段诊断来检测时间和因果失败,而多维质量评估则对推理质量、时间一致性和视觉美学进行评分,以便于排名和奖励建模。我们进一步引入WorldRewardBench,这是一个偏好基准,包含约6000个专家注释的对比,涵盖1400多个视频,支持成对和逐点奖励模型评估。在现代视频生成器中,我们的结果揭示了视觉可信度与世界推理之间的持续差距:视频可能看起来令人信服,但在动态、因果关系或信息保留方面却存在失败。我们将发布我们的基准和评估工具包,以支持社区在真正具备世界意识的视频生成研究,网址为https://github.com/UniX-AI-Lab/WorldReasonBench/。
cs.CV / 271 / 2605.10439
Filtering Memorization from Parameter-Space in Diffusion Models
从扩散模型的参数空间中过滤记忆
Abstract
Low-Rank Adaptation (LoRA) has become a widely used mechanism for customizing diffusion models, enabling users to inject new visual concepts or styles through lightweight parameter updates. However, LoRAs can memorize training images, causing generated outputs to reproduce copyrighted or sensitive content. This risk is particularly concerning in LoRA-sharing ecosystems, where users distribute trained LoRAs without releasing the underlying training data. Existing approaches for mitigating memorization rely on access to the training pipeline, training data, or control over the inference process, making them difficult to apply when only the released LoRA weights are available. We propose \textbf{Base-Anchored Filtering (BAF)}, a training-free and data-free framework for post-hoc memorization mitigation in diffusion LoRAs. BAF decomposes LoRA updates into spectral channels and measures their alignment with the principal subspace of the pretrained backbone. Channels strongly aligned with this subspace are retained as generalizable adaptations, while weakly aligned channels are suppressed as potential carriers of memorized content. Experiments on multiple datasets and diffusion backbones demonstrate that BAF consistently reduces memorization while preserving or even improving generation quality. Our code is available in the supplementary material.
Chinese Translation
低秩适应(Low-Rank Adaptation, LoRA)已成为自定义扩散模型的广泛使用机制,使用户能够通过轻量级参数更新注入新的视觉概念或风格。然而,LoRA可能会记忆训练图像,导致生成的输出重现受版权保护或敏感的内容。这一风险在LoRA共享生态系统中尤为令人担忧,因为用户在不发布基础训练数据的情况下分发训练好的LoRA。现有的减轻记忆的方法依赖于对训练流程、训练数据的访问或对推理过程的控制,因此在仅有发布的LoRA权重可用时,难以应用。我们提出了 extbf{基于基础的过滤(Base-Anchored Filtering, BAF)},这是一个无训练和无数据的框架,用于后期减轻扩散LoRA中的记忆。BAF将LoRA更新分解为谱通道,并测量它们与预训练主干的主子空间的对齐程度。与该子空间强对齐的通道被保留为可推广的适应,而弱对齐的通道则被抑制,作为潜在的记忆内容载体。在多个数据集和扩散主干上的实验表明,BAF在保持或甚至提高生成质量的同时,始终减少记忆。我们的代码可在补充材料中获取。
cs.CV / 272 / 2605.10445
Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning
Uni-Synergy:通过协同强化学习实现个性化推理的理解与生成的桥梁
Abstract
Unified Multimodal Models (UMMs) excel in general tasks but struggle to bridge the gap between personalized understanding and generation. Prior works largely rely on implicit token-level alignment via supervised fine-tuning, which fails to fully capture the potential synergy between comprehension and creation. In this work, we propose Sync-R1, an end-to-end reinforcement learning framework that jointly optimizes personalized understanding and generation within a single, explicit reasoning loop. Through this unified feedback process, Sync-R1 enables personalized comprehension to guide content creation, while the resulting generation quality reciprocally refines understanding within an integrated reward landscape. To efficiently orchestrate this dual-task synergy, we introduce Sync-GRPO, a reinforcement learning method utilizing an ensemble reward system. Furthermore, we propose Dynamic Group Scaling (DGS), which adaptively filters low-potential trajectories to reduce gradient variance and accelerate convergence. To better reflect real-world complexity, we introduce UnifyBench++, featuring denser textual descriptions and richer user contexts. Experimental results demonstrate that Sync-R1 achieves state-of-the-art performance, showcasing superior cross-task reasoning and robust personalization without requiring complex cold-start procedures. The code and the UnifyBench++ dataset will be released at: https://github.com/arctanxarc/UniCTokens.
Chinese Translation
统一多模态模型(UMMs)在一般任务中表现出色,但在个性化理解与生成之间的桥梁构建上却面临挑战。以往的研究主要依赖于通过监督微调实现的隐式令牌级对齐,这未能充分捕捉理解与创造之间的潜在协同效应。在本研究中,我们提出了Sync-R1,这是一种端到端的强化学习框架,能够在单一的显式推理循环中共同优化个性化理解与生成。通过这一统一的反馈过程,Sync-R1使个性化理解能够指导内容创作,而生成质量的提升又反过来在一个综合的奖励环境中精炼理解。为了高效地协调这一双任务协同,我们引入了Sync-GRPO,这是一种利用集成奖励系统的强化学习方法。此外,我们提出了动态组缩放(Dynamic Group Scaling, DGS),该方法自适应地过滤低潜力轨迹,以减少梯度方差并加速收敛。为了更好地反映现实世界的复杂性,我们推出了UnifyBench++,其特征为更密集的文本描述和更丰富的用户上下文。实验结果表明,Sync-R1达到了最先进的性能,展示了卓越的跨任务推理能力和强大的个性化效果,而无需复杂的冷启动程序。代码和UnifyBench++数据集将发布于:https://github.com/arctanxarc/UniCTokens。
cs.CV / 273 / 2605.10449
Automated high-frequency quantification of fish communities and biomass using computer vision
基于计算机视觉的鱼类群落和生物量的自动化高频量化
Abstract
Quantifying fish community structure is essential for understanding biodiversity and ecosystem responses in a changing environment, yet existing survey methods provide limited high-frequency, quantitative observations. Conventional approaches, including catch-based methods, underwater visual censuses, and environmental DNA metabarcoding, either require intensive labor or lack reliable estimates of abundance and biomass. Here, we develop an automated framework for quantifying fish communities from underwater video using computer vision. Using videos acquired with a custom-made stereo camera system, the framework integrates deep learning-based fish identification, multi-object tracking, and 3D reconstruction to estimate species-level abundance and biomass. We applied the approach to a reef fish community over a 20-day period with hourly daytime observations, revealing dynamic fluctuations in species richness, abundance, and biomass associated with changes in species composition. By comparing fish communities estimated from visual census and environmental DNA surveys, we demonstrate that our method provides complementary strengths for continuous, non-invasive, and quantitative monitoring of consistently observed species. This approach provides a scalable foundation for long-term monitoring and advances the capacity to resolve fine-scale temporal dynamics in fish communities.
Chinese Translation
量化鱼类群落结构对于理解生物多样性和生态系统在变化环境中的响应至关重要,但现有的调查方法提供的高频定量观察有限。传统方法,包括基于捕捞的技术、水下视觉普查和环境DNA元条形码,往往需要大量人力或缺乏可靠的丰度和生物量估计。在此,我们开发了一种基于计算机视觉的自动化框架,用于从水下视频中量化鱼类群落。该框架利用定制的立体摄像系统获取的视频,结合深度学习的鱼类识别、多目标跟踪和三维重建,来估计物种级别的丰度和生物量。我们将该方法应用于一个珊瑚礁鱼类群落,进行了为期20天的每小时白天观察,揭示了与物种组成变化相关的物种丰富度、丰度和生物量的动态波动。通过比较从视觉普查和环境DNA调查中估计的鱼类群落,我们证明了该方法在对持续观察的物种进行连续、非侵入性和定量监测方面具有互补优势。这种方法为长期监测提供了可扩展的基础,并提升了分辨鱼类群落细微时间动态的能力。
cs.CV / 274 / 2605.10464
Automated Detection of Abnormalities in Zebrafish Development
斑马鱼发育异常的自动检测
Abstract
Zebrafish embryos are a valuable model for drug discovery due to their optical transparency and genetic similarity to humans. However, current evaluations rely on manual inspection, which is costly and labor-intensive. While machine learning offers automation potential, progress is limited by the lack of comprehensive datasets. To address this, we introduce a large-scale dataset of high-resolution microscopic image sequences capturing zebrafish embryonic development under both control conditions and exposure to compounds (3,4-dichloroaniline). This dataset, with expert annotations at fine-grained temporal levels, supports two benchmarking tasks: (1) fertility classification, assessing zebrafish egg viability (130,368 images), and (2) toxicity assessment, detecting malformations induced by toxic exposure over time (55,296 images). Alongside the dataset, we present the first transformer-based baseline model that integrates spatiotemporal features to predict developmental abnormalities at early stages. Experimental results present the model's effectiveness, achieving 98% accuracy in fertility classification and 92% in toxicity assessment. These findings underscore the potential of automated approaches to enhance zebrafish-based toxicity analysis.
Chinese Translation
斑马鱼胚胎因其光学透明性和与人类的遗传相似性而成为药物发现的重要模型。然而,目前的评估依赖于人工检查,这既昂贵又劳动密集。虽然机器学习提供了自动化的潜力,但由于缺乏全面的数据集,进展受到限制。为了解决这一问题,我们引入了一个大规模的数据集,该数据集包含高分辨率显微图像序列,捕捉了斑马鱼胚胎在控制条件下及暴露于化合物(3,4-二氯苯胺)下的发育过程。该数据集在细粒度时间层面上具有专家注释,支持两个基准任务:(1)生育分类,评估斑马鱼卵子的活力(130,368张图像),以及(2)毒性评估,检测毒性暴露随时间引起的畸形(55,296张图像)。除了数据集,我们还提出了第一个基于变换器(transformer)的基线模型,该模型整合了时空特征,以预测早期阶段的发育异常。实验结果展示了该模型的有效性,在生育分类中达到了98%的准确率,在毒性评估中达到了92%的准确率。这些发现强调了自动化方法在增强基于斑马鱼的毒性分析中的潜力。
cs.CV / 275 / 2605.10470
Adaptive Context Matters: Towards Provable Multi-Modality Guidance for Super-Resolution
自适应上下文的重要性:迈向可证明的多模态超分辨率指导
Abstract
Super-resolution (SR) is a severely ill-posed problem with inherent ambiguity, as widely recognized in both empirical and theoretical studies. Although recent semantic-guided and multi-modal SR methods exploit large models or external priors to enhance semantic alignment, the fusion of heterogeneous modalities remains insufficiently understood in practice and theory. In this work, we provide the first theoretical modeling of multi-modal SR, revealing that prior methods are bottlenecked by sub-optimal modality utilization. Our analysis shows that the generalization risk bound can be improved by strengthening the alignment between modality weights and their effective contributions, while reducing representation complexity. This theoretical insight inspires us to propose the novel Multi-Modal Mixture-of-Experts Super-Resolution framework (M$^3$ESR) that employs generalization-oriented dynamic modality fusion for accurate risk control and modality contribution optimization. In detail, we propose a novel spatially dynamic modality weighting module and a temporally adaptive modality temperature scheduling mechanism, enabling flexible and adaptive spatial-temporal modality weighting for effective risk control. Extensive experiments demonstrate that our M$^3$ESR significantly boosts generalization and semantic consistency performances, which confirms our superiority.
Chinese Translation
超分辨率(SR)是一个严重病态的问题,具有固有的模糊性,这在经验和理论研究中得到了广泛认可。尽管最近的语义指导和多模态超分辨率方法利用大型模型或外部先验来增强语义对齐,但异构模态的融合在实践和理论上仍然理解不足。在本研究中,我们首次对多模态超分辨率进行了理论建模,揭示了先前方法受到次优模态利用的瓶颈。我们的分析表明,通过加强模态权重与其有效贡献之间的对齐,同时减少表示复杂性,可以改善泛化风险界限。这一理论见解激励我们提出了新颖的多模态专家混合超分辨率框架(M$^3$ESR),该框架采用面向泛化的动态模态融合,以实现准确的风险控制和模态贡献优化。具体而言,我们提出了一种新颖的空间动态模态加权模块和一种时间自适应模态温度调度机制,使得能够灵活和自适应地进行空间-时间模态加权,以有效控制风险。大量实验表明,我们的M$^3$ESR显著提升了泛化和语义一致性性能,证实了我们的优越性。
cs.CV / 276 / 2605.10484
OpenSGA: Efficient 3D Scene Graph Alignment in the Open World
OpenSGA:开放世界中高效的3D场景图对齐
Abstract
Scene graph alignment establishes object correspondences between two 3D scene graphs constructed from partially overlapping observations. This enables efficient scene understanding and object-level relocalization when a robot revisits a place, as well as global map fusion across multiple agents. Such capabilities are essential for robots that require long-term memory for long-horizon tasks involving interactions with the environment. Existing approaches mainly focus on subscan-to-subscan (S2S) alignment and depend heavily on geometric point-cloud features, leaving frame-to-scan (F2S) alignment and open-set vision-language features underexplored. In addition, existing datasets for scene graph alignment remain small-scale with limited object diversity, constraining systematic training and evaluation. We present a unified and efficient scene graph alignment framework that predicts object correspondences by fusing vision-language, textual, and geometric features with spatial context. The framework comprises modules such as a distance-gated spatial attention encoder, a minimum-cost-flow-based allocator, and a global scene embedding generator to achieve accurate alignment even under large coordinate discrepancies. We further introduce ScanNet-SG, a large-scale dataset generated via an automated annotation pipeline with over 700k samples, covering 509 object categories from ScanNet labels and over 3k categories from GPT-4o-based tagging. Experiments show that our method achieves the best overall performance on both F2S and S2S tasks, substantially outperforming existing scene graph alignment methods. Our code and dataset are released at: https://autonomousrobots.nl/paper_websites/opensga.
Chinese Translation
场景图对齐通过在部分重叠的观察中建立两个3D场景图之间的对象对应关系,从而实现。这使得在机器人重新访问某个地点时能够高效地理解场景和进行对象级的重新定位,同时也支持多个代理之间的全局地图融合。这些能力对于需要长期记忆以完成与环境交互的长时间任务的机器人至关重要。现有的方法主要集中在子扫描到子扫描(S2S)对齐,并且过于依赖几何点云特征,而对帧到扫描(F2S)对齐和开放集视觉-语言特征的研究相对较少。此外,现有的场景图对齐数据集规模较小,物体多样性有限,限制了系统化的训练和评估。我们提出了一个统一且高效的场景图对齐框架,通过融合视觉-语言、文本和几何特征与空间上下文来预测对象对应关系。该框架包括距离门控空间注意编码器、基于最小成本流的分配器和全局场景嵌入生成器等模块,以在大坐标差异下实现准确对齐。我们进一步引入了ScanNet-SG,这是一个通过自动化注释管道生成的大规模数据集,包含超过70万个样本,涵盖来自ScanNet标签的509个对象类别和基于GPT-4o标记的3000多个类别。实验表明,我们的方法在F2S和S2S任务上均实现了最佳整体性能,显著优于现有的场景图对齐方法。我们的代码和数据集已发布在:https://autonomousrobots.nl/paper_websites/opensga。
cs.CV / 277 / 2605.10496
M$^2$E-UAV: A Benchmark and Analysis for Onboard Motion-on-Motion Event-Based Tiny UAV Detection
M$^2$E-UAV:基于事件的微型无人机检测的机载运动-运动事件基准与分析
Abstract
Tiny UAV detection from an onboard event camera is difficult when the observer and target move at the same time. In this motion-on-motion regime, ego-motion activates background edges across buildings, vegetation, and horizon structures, while the UAV may appear as a sparse event cluster. To explore this practical problem, we present M$^2$E-UAV, a benchmark and analysis setup for onboard motion-on-motion event-based tiny UAV detection. The processed M$^2$E-UAV benchmark contains 87,223 training samples and 21,395 validation samples across four scene families: sunny building-forest, sunny farm-village, sunset building-forest, and sunset farm-village. We provide M$^2$E-Point, a point-based event baseline, and M$^2$E-Point + IMU, an IMU-conditioned variant, to analyze the role of inertial cues under onboard motion-on-motion detection. M$^2$E-Point encodes events as $[x,y,t,p]$ point sets, extracts local event structure with EdgeConv, and predicts event-level UAV foreground scores, from which bounding boxes are derived via DBSCAN. Our validation-stage analysis shows that point-based event modeling is a strong baseline, while simple IMU conditioning provides only marginal aggregate gains. Under the train/validation split, M$^2$E-Point achieves 0.9673 F1 and 0.5501 mAP50-95, while the IMU-conditioned variant reaches 0.5561 mAP50-95 with only marginal aggregate changes, serving as an initial baseline for future exploration in this domain. Code will be ready in https://github.com/Wickyan/M2E-UAV.
Chinese Translation
从机载事件相机中检测微型无人机在观察者与目标同时移动时非常困难。在这种运动-运动的情况下,自我运动激活了建筑物、植被和地平线结构的背景边缘,而无人机可能表现为稀疏的事件簇。为了解决这一实际问题,我们提出了M$^2$E-UAV,这是一个用于机载运动-运动基于事件的微型无人机检测的基准和分析设置。处理后的M$^2$E-UAV基准包含87,223个训练样本和21,395个验证样本,涵盖四个场景类别:阳光下的建筑-森林、阳光下的农场-村庄、日落时的建筑-森林和日落时的农场-村庄。我们提供了M$^2$E-Point,一个基于点的事件基线,以及M$^2$E-Point + IMU,一个基于IMU的变体,以分析惯性线索在机载运动-运动检测中的作用。M$^2$E-Point将事件编码为$[x,y,t,p]$点集,利用EdgeConv提取局部事件结构,并预测事件级无人机前景得分,从中通过DBSCAN推导出边界框。我们的验证阶段分析表明,基于点的事件建模是一个强有力的基线,而简单的IMU条件仅提供了边际的整体增益。在训练/验证拆分下,M$^2$E-Point达到了0.9673的F1分数和0.5501的mAP50-95,而基于IMU的变体则达到了0.5561的mAP50-95,仅有边际的整体变化,作为该领域未来探索的初步基线。代码将发布在https://github.com/Wickyan/M2E-UAV。
cs.CV / 278 / 2605.10498
Simultaneous Long-tailed Recognition and Multi-modal Fusion for Highly Imbalanced Multi-modal Data
针对高度不平衡多模态数据的同时长尾识别与多模态融合
Abstract
Long-tailed distributions in class-imbalanced data present a fundamental challenge for deep learning models, which tend to be biased toward majority classes. While recent methods for long-tailed recognition have mitigated this issue, they are largely restricted to single-modal inputs and cannot fully exploit complementary information from diverse data sources. In this work, we introduce a new framework for long-tailed recognition that explicitly handles multi-modal inputs. Our approach extends multi-expert architectures to the multi-modal setting by fusing heterogeneous data into a unified representation while leveraging modality-specific networks to estimate the informativeness of each modality. These confidence-guided weights dynamically modulate the fusion process, ensuring that more informative modalities contribute more strongly to the final decision. To further enhance performance, we design specialized training and test procedures that accommodate diverse modality combinations, including images and tabular data. Extensive experiments on benchmark and real-world datasets demonstrate that the proposed approach not only effectively integrates multi-modal information but also outperforms existing methods in handling long-tailed, class-imbalanced scenarios, highlighting its robustness and generalization capability.
Chinese Translation
类不平衡数据中的长尾分布对深度学习模型提出了根本性的挑战,这些模型往往倾向于偏向于多数类。尽管近期的长尾识别方法在一定程度上缓解了这一问题,但它们主要局限于单模态输入,无法充分利用来自不同数据源的互补信息。在本研究中,我们提出了一种新的长尾识别框架,明确处理多模态输入。我们的方法通过将异构数据融合为统一表示,扩展了多专家架构到多模态设置,同时利用特定模态的网络来估计每种模态的信息量。这些基于置信度的权重动态调节融合过程,确保信息量更大的模态对最终决策的贡献更强。为了进一步提升性能,我们设计了专门的训练和测试程序,以适应包括图像和表格数据在内的多样化模态组合。在基准和真实世界数据集上的大量实验表明,所提出的方法不仅有效整合了多模态信息,还在处理长尾和类不平衡场景方面超越了现有方法,突显了其鲁棒性和泛化能力。
cs.CV / 279 / 2605.10521
DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation
Abstract
Medical image segmentation models can perform unevenly across subgroups. Most existing fairness methods focus on improving average subgroup performance, implicitly treating each subgroup as internally homogeneous. However, this can hide difficult cases within a subgroup, where high-loss samples are obscured by the subgroup mean. We call this problem \textbf{intra-group hidden failure}. To solve this, we propose \textbf{DuetFair} mechanism, a dual-axis fairness framework that jointly considers inter-subgroup adaptation and intra-subgroup robustness. Based on DuetFair, we introduce \textbf{FairDRO}, which combines distribution-aware mixture-of-experts (dMoE) with subgroup-conditioned distributionally robust optimization (DRO) loss aggregation. This design allows the model to adapt across subgroups while also reducing hidden failures within each subgroup. We evaluate FairDRO on three medical image segmentation benchmarks with varying degrees of within-group heterogeneity. FairDRO achieves the best equity-scaled performance on Harvard-FairSeg and improves worst-case subgroup performance on HAM10000 under both age- and race-based grouping schemes. On the 3D radiotherapy target cohort, FairDRO further improves worst-group Dice by 3.5 points ($\uparrow 6.0\%$) under the tumor-stage grouping and by 4.1 points ($\uparrow 7.4\%$) under the institution grouping over the strongest baseline.
cs.CV / 280 / 2605.10523
Improving Human Image Animation via Semantic Representation Alignment
通过语义表示对齐改善人类图像动画
Abstract
The field of image-to-video generation has made remarkable progress. However, challenges such as human limb twisting and facial distortion persist, especially when generating long videos or modeling intensive motions. Existing human image animation works address these issues by incorporating human-specific semantic representations, e.g., dense poses or ID embeddings, as additional conditions. However, conditioning on these representations could decrease the generation flexibility. Moreover, their reliance on RGB pixel supervision also lacks emphasis on learning necessary 3D geometric relationships and temporal coherence. In contrast, we introduce a novel approach named SemanticREPA that leverages these semantic representations as supervision signals through representation alignment. Specifically, we begin by training a structure alignment module that aligns the structure representations obtained from video latents with video depth estimation features. We then fix the pretrained module, and utilize it to provide additional supervision on the structure representations of the diffusion models, achieving structure rectification to generate coherent and stable human structures. Simultaneously, we develop an ID alignment module to align the ID representations of the generated videos to face recognition features. We further propose to use the predicted structure representations to refine identity restoration in relevant regions. With structure and ID alignment, our method demonstrates superior quality on extended character motions and enhanced character consistency.
Chinese Translation
图像到视频生成领域取得了显著进展。然而,诸如人类肢体扭曲和面部失真等挑战依然存在,尤其是在生成长视频或建模高强度运动时。现有的人类图像动画工作通过引入人类特定的语义表示(例如,密集姿态或身份嵌入)作为附加条件来解决这些问题。然而,依赖这些表示可能会降低生成的灵活性。此外,它们对RGB像素监督的依赖也缺乏对必要的3D几何关系和时间一致性的学习。相比之下,我们提出了一种名为SemanticREPA的新方法,通过表示对齐将这些语义表示作为监督信号。具体而言,我们首先训练一个结构对齐模块,将从视频潜变量中获得的结构表示与视频深度估计特征对齐。然后,我们固定预训练的模块,并利用它为扩散模型的结构表示提供额外的监督,实现结构校正,以生成一致且稳定的人体结构。同时,我们开发了一个身份对齐模块,将生成视频的身份表示与人脸识别特征对齐。我们进一步提出使用预测的结构表示来细化相关区域的身份恢复。通过结构和身份对齐,我们的方法在扩展角色运动和增强角色一致性方面表现出优越的质量。
cs.CV / 281 / 2605.10525
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth:用于3D一致性视频深度的几何嵌入特征
Abstract
Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: https://github.com/Yuecheng919/GemDepth
Chinese Translation
视频深度估计将单目预测扩展到时间域,以确保一致性。然而,现有方法在细节区域常常遭受空间模糊和时间不一致的问题。我们认为,当前主要依赖于通过变换器进行时间平滑的方法在保持严格的3D几何一致性方面存在困难,尤其是在旋转或剧烈视角变化的情况下。为了解决这一问题,我们提出了GemDepth,一个基于显式意识到相机运动和全球3D结构是实现3D一致性的前提的框架。GemDepth独特地引入了几何嵌入模块(Geometry-Embedding Module, GEM),该模块预测帧间相机姿态以生成隐式几何嵌入。这种运动先验的注入使网络具备内在的3D感知和对齐能力。在这些几何线索的指导下,我们的交替时空变换器(Alternating Spatio-Temporal Transformer, ASTT)捕捉潜在的点级对应关系,同时增强细节的空间精度并强制执行严格的时间一致性。此外,GemDepth采用了一种数据高效的训练策略,有效地弥合了高效率与稳健几何一致性之间的差距。如图2所示,全面评估表明,GemDepth在多个数据集上实现了最先进的性能,特别是在复杂动态场景中。代码可在以下网址公开获取:https://github.com/Yuecheng919/GemDepth
cs.CV / 282 / 2605.10543
TIE: Time Interval Encoding for Video Generation over Events
TIE:基于时间间隔编码的事件视频生成
Abstract
Director-style prompting, robotic action prediction, and interactive video agents demand temporal grounding over concurrent events -- a regime in which 68% of general clips and over 99% of robotics/gameplay clips contain overlapping events, yet existing multi-event generators rest on a single-active-prompt assumption. However, modern video generators, such as Diffusion Transformers (DiT), represent time as discrete points through point-wise positional encodings. This formulation creates a fundamental dimension mismatch: temporally extended intervals and overlapping events are mathematically unrepresentable to the attention mechanism. In this paper, we propose Time Interval Encoding (TIE), a principled, plug-and-play interval-aware generalization of rotary embeddings that elevates time intervals to first-class primitives inside DiT cross-attention. Rather than introducing another heuristic interval embedding, we show that, within RoPE-compatible bilinear attention, TIE is characterized by two basic principles: Temporal Integrability, which requires an event to aggregate positional evidence over its full duration, and Duration Invariance, which removes the trivial bias toward longer intervals. Under a uniform kernel, this characterization yields an efficient closed-form sinc-based solution that preserves the standard attention interface and naturally attenuates boundary noise through interval integration. Empirically, TIE preserves the visual quality of the base DiT model while substantially improving temporal controllability. In our experiments on the OmniEvents dataset, it improves human-verified Temporal Constraint Satisfaction Rate from 77.34% to 96.03% and reduces temporal boundary error from 0.261s to 0.073s, while also improving trajectory-level temporal alignment metrics. The code and dataset are available at https://github.com/MatrixTeam-AI/TIE.
Chinese Translation
导演风格的提示、机器人动作预测和互动视频代理要求对并发事件进行时间上的基础性支持——在这种情况下,68%的普通视频片段和超过99%的机器人/游戏片段包含重叠事件,而现有的多事件生成器则基于单一活动提示的假设。然而,现代视频生成器,如扩散变换器(Diffusion Transformers, DiT),通过点对点的位置编码将时间表示为离散点。这种表述造成了一个根本的维度不匹配:时间延续的间隔和重叠事件在数学上无法被注意力机制表示。在本文中,我们提出了时间间隔编码(Time Interval Encoding, TIE),这是一种原则性、即插即用的间隔感知旋转嵌入的推广,将时间间隔提升为DiT交叉注意力中的一等原语。我们并没有引入另一种启发式的间隔嵌入,而是展示了在与RoPE兼容的双线性注意力下,TIE由两个基本原则构成:时间可积性(Temporal Integrability),要求事件在其整个持续时间内聚合位置证据,以及持续时间不变性(Duration Invariance),消除了对较长间隔的平凡偏见。在均匀核下,这种表征产生了一种有效的基于sinc的封闭形式解决方案,保留了标准的注意力接口,并通过间隔积分自然减弱了边界噪声。在我们的实验中,TIE在保持基础DiT模型的视觉质量的同时,显著提高了时间可控性。在OmniEvents数据集上的实验中,它将经过人工验证的时间约束满足率从77.34%提高到96.03%,并将时间边界误差从0.261秒降低到0.073秒,同时改善了轨迹级别的时间对齐指标。代码和数据集可在https://github.com/MatrixTeam-AI/TIE获取。
cs.CV / 283 / 2605.10556
EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving
EnergyLens:用于多模态大语言模型推理服务的可解释闭式能量模型
Abstract
As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing approaches either treat latency as an energy proxy or rely on data-hungry black-box surrogates. Both fail under varying parallelism strategies: latency and energy optima diverge in over 20% of configurations we tested, and black-box surrogates require hundreds of profiling samples to generalize across model families and hardware. We present EnergyLens, which uses symbolic regression as a structure-discovery tool over profiling data to derive a single twelve-parameter closed-form energy model expressed in terms of system properties such as degree of parallelism, batch size, and sequence length. Unlike black-box surrogates, EnergyLens decouples tensor and pipeline parallelism contributions and separates prefill from decode energy, making its predictions physically interpretable and actionable. Fitted from as few as 50 profiling measurements, EnergyLens achieves 88.2% Top-1 configuration selection accuracy across many evaluation scenarios compared to 60.9% for the closest prior analytical baseline, matches the predictive accuracy of ensemble ML methods with 10x fewer profiling samples, and extrapolates reliably to unseen batch sizes and hardware platforms without structural modification, making it a practical, interpretable tool for energy-optimal LLM deployment.
Chinese Translation
随着大语言模型涵盖密集型、混合专家和状态空间架构,并在异构加速器上针对日益多样化的多模态工作负载进行部署,优化推理能量已变得与优化延迟和吞吐量同样重要。现有方法要么将延迟视为能量的代理,要么依赖于数据需求量大的黑箱替代模型。这两种方法在不同的并行策略下都表现不佳:在我们测试的超过20%的配置中,延迟和能量的最优解出现了分歧,而黑箱替代模型需要数百个剖面样本才能在模型家族和硬件之间进行泛化。我们提出了EnergyLens,它使用符号回归作为结构发现工具,通过剖面数据推导出一个单一的十二参数闭式能量模型,该模型以系统属性(如并行度、批量大小和序列长度)为变量。与黑箱替代模型不同,EnergyLens解耦了张量和管道并行性贡献,并将预填充能量与解码能量分开,使其预测结果在物理上可解释且可操作。EnergyLens从仅50个剖面测量中拟合,能够在许多评估场景中实现88.2%的Top-1配置选择准确率,相比之下,最接近的先前分析基线为60.9%;它的预测准确性与集成机器学习方法相当,但所需的剖面样本少了10倍,并且在未见过的批量大小和硬件平台上可靠外推,无需结构修改,使其成为一个实用的、可解释的能量最优大语言模型部署工具。
cs.CV / 284 / 2605.10564
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight:通过潜在状态预测进行长时间视野世界建模的端到端自主驾驶
Abstract
End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Codes are available at: https://github.com/hotdogcheesewhite/DeepSight.
Chinese Translation
端到端自主驾驶系统越来越多地集成视觉-语言模型(VLM)架构,结合文本推理或视觉推理,以增强驾驶决策的鲁棒性和准确性。然而,大多数方法所采用的推理机制是从一般领域直接适应而来,缺乏针对自主驾驶场景的深入探索,特别是在视觉推理模块中。本文提出了一种驾驶世界模型,该模型在鸟瞰视图(BEV)空间中并行预测连续未来帧的潜在语义特征,从而实现对未来世界状态的长时间视野建模。我们还引入了一种高效且自适应的文本推理机制,利用额外的社会知识和推理能力,进一步提高在具有挑战性的长尾场景中的驾驶性能。我们提出了一种新颖、高效且有效的方法,在闭环Bench2drive基准测试中实现了最先进的(SOTA)结果。代码可在以下链接获取:https://github.com/hotdogcheesewhite/DeepSight。
cs.CV / 285 / 2605.10567
VeloGauss: Learning Physically Consistent Gaussian Velocity Fields from Videos
VeloGauss:从视频中学习物理一致的高斯速度场
Abstract
In this paper, we aim to jointly model the geometry, appearance, and physical information of 3D scenes solely from dynamic multi-view videos, without relying on any physical priors. Existing works typically employ physical losses merely as soft constraints or integrate physical simulations into neural networks; however, these approaches often fail to effectively learn complex motion physics. Although modeling velocity fields holds the potential to capture authentic physical information, due to the lack of appropriate physical constraints, current methods are unable to correctly learn the interaction mechanisms between rigid and non-rigid particles. To address this, we propose VeloGauss, designed to learn the physical properties of complex dynamic 3D scenes without physical priors. Our method learns the velocity field for each Gaussian particle by introducing a Physics Code and a Particle Dynamics System, and ultimately incorporates Global Physical Constraints to ensure the physical consistency of the scene. Extensive experiments on four public datasets demonstrate that our method outperforms achieves state-of-the-art performance in both Novel View Interpolation and Future Frame Extrapolation tasks.
Chinese Translation
在本文中,我们旨在仅通过动态多视角视频联合建模三维场景的几何、外观和物理信息,而不依赖于任何物理先验。现有的研究通常将物理损失作为软约束,或将物理模拟集成到神经网络中;然而,这些方法往往无法有效学习复杂的运动物理。尽管建模速度场有潜力捕捉真实的物理信息,但由于缺乏适当的物理约束,当前方法无法正确学习刚性和非刚性粒子之间的相互作用机制。为了解决这个问题,我们提出了VeloGauss,旨在在没有物理先验的情况下学习复杂动态三维场景的物理特性。我们的方法通过引入物理编码(Physics Code)和粒子动力学系统(Particle Dynamics System)为每个高斯粒子学习速度场,并最终结合全局物理约束(Global Physical Constraints)以确保场景的物理一致性。在四个公共数据集上的大量实验表明,我们的方法在新视角插值(Novel View Interpolation)和未来帧外推(Future Frame Extrapolation)任务中均实现了最先进的性能。
cs.CV / 286 / 2605.10576
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
SenseBench:一种用于大规模视觉语言模型的遥感低级视觉感知与描述的基准测试
Abstract
Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS experts. While Vision-Language Models (VLMs) present a compelling alternative by delivering language-grounded IQA, their visual priors are heavily biased toward ground-level natural images. Consequently, whether VLMs can overcome this domain gap to perceive and articulate RS artifacts remains insufficiently studied. To bridge this gap, we propose \textbf{SenseBench}, the first dedicated diagnostic benchmark for RS low-level visual perception and description. Driven by a physics-based hierarchical taxonomy that unifies both non-reference and reference-based paradigms, SenseBench features over 10K meticulously curated instances across 6 major and 22 fine-grained RS degradation categories. Specifically, two complementary protocols are designed for evaluation: objective low-level visual \textit{perception} and subjective diagnostic \textit{description}. Comprehensive evaluation of 29 state-of-the-art VLMs reveals not only skewed domain priors and multi-distortion collapse, but also \textit{fluency illusion} and a \textit{perception-description inversion} effect. We hope SenseBench provides a robust evaluation testbed and high-quality diagnostic data to advance the development of VLMs in RS low-level perception. Code and datasets are available \href{https://github.com/Zhong-Chenchen/SenseBench}{\textcolor{blue}{here}}.
Chinese Translation
低级视觉感知是可靠的遥感(RS)图像分析的基础,但当前的图像质量评估(IQA)方法输出的是不可解释的标量分数,而不是表征物理驱动的RS退化,这与RS专家的诊断需求明显不符。尽管视觉语言模型(VLMs)通过提供基于语言的IQA展现出一种有吸引力的替代方案,但它们的视觉先验严重偏向于地面自然图像。因此,VLMs是否能够克服这一领域差距以感知和阐述RS伪影仍然研究不足。为了解决这一问题,我们提出了 extbf{SenseBench},这是第一个专门针对RS低级视觉感知与描述的诊断基准测试。SenseBench基于物理驱动的分层分类法,统一了非参考和参考基础的范式,涵盖了超过10,000个精心策划的实例,分布在6个主要和22个细粒度的RS退化类别中。具体而言,设计了两个互补的评估协议:客观的低级视觉 extit{感知}和主观的诊断 extit{描述}。对29个最先进的VLMs的全面评估不仅揭示了偏斜的领域先验和多重失真崩溃,还揭示了 extit{流畅性错觉}和 extit{感知-描述反转}效应。我们希望SenseBench能够提供一个强大的评估测试平台和高质量的诊断数据,以推动VLMs在RS低级感知方面的发展。代码和数据集可在 extcolor{blue}{ ext{这里}}获取。
cs.CV / 287 / 2605.10581
Polygon-mamba: Retinal vessel segmentation using polygon scanning mamba and space-frequency collaborative attention
多边形-曼巴:使用多边形扫描曼巴和时频协同注意力的视网膜血管分割
Abstract
Retinal vessel segmentation is crucial for diagnosis and assessment of ocular diseases. Notably, segmentation of small retinal vessels has been consistently recognized as a challenging and complex task. To tackle this challenge, we design a hybrid CNN-Mamba fusion network that integrates polygon scanning mamba and space-frequency collaborative attention mechanism for the detection of small vessels. Considering that the traditional mamba architecture with horizontal-vertical scanning may compromise the topological integrity of target structures and result in local discontinuities in small retinal vessels, we present a polygon scanning visual state space model (PS-VSS) to identify small vessel structural features by multi-layer reverse scanning way. Which effectively preserves pixels connectivity, thereby substantially mitigating the loss of information pertaining to small vessels. Furthermore, as we all known that the spatial domain prioritizes positional and structural information, while the frequency domain emphasizes global perception and local detail components, a space-frequency collaborative attention mechanism (SFCAM) is introduced within the skip connection to extract efficient features from the spatial and frequency domains. This strategy empowers the model to dynamically enhance the key features while effectively suppressing clutters. To assess the efficacy of our model, it was tested on three publicly available datasets: DRIVE, STARE, and CHASE_DB1. Compared to manual annotations, our model demonstrated F1 scores of 0.8283, 0.8282, and 0.8251, Area Under Curve (AUC) values of 0.9806, 0.9840, and 0.9866, and Sensitivity (SE) values of of 0.8268, 0.8314, and 0.8484 across three datasets, respectively. The effectiveness of our model was validated through both visual inspection and quantitative analysis.
Chinese Translation
视网膜血管分割对于眼科疾病的诊断和评估至关重要。值得注意的是,小型视网膜血管的分割一直被认为是一项具有挑战性和复杂性的任务。为了解决这一挑战,我们设计了一种混合CNN-曼巴融合网络,该网络结合了多边形扫描曼巴和时频协同注意力机制,以检测小型血管。考虑到传统的曼巴架构采用水平-垂直扫描可能会损害目标结构的拓扑完整性,并导致小型视网膜血管的局部不连续性,我们提出了一种多边形扫描视觉状态空间模型(PS-VSS),通过多层反向扫描方式识别小型血管的结构特征。这有效地保持了像素的连通性,从而大大减轻了与小型血管相关的信息丢失。此外,众所周知,空间域优先考虑位置和结构信息,而频率域则强调全局感知和局部细节成分,因此在跳跃连接中引入了一种时频协同注意力机制(SFCAM),以从空间和频率域提取有效特征。这一策略使模型能够动态增强关键特征,同时有效抑制杂乱。为了评估我们模型的有效性,我们在三个公开可用的数据集上进行了测试:DRIVE、STARE和CHASE_DB1。与人工标注相比,我们的模型在三个数据集上分别显示了F1分数为0.8283、0.8282和0.8251,曲线下面积(AUC)值为0.9806、0.9840和0.9866,敏感性(SE)值为0.8268、0.8314和0.8484。我们的模型的有效性通过视觉检查和定量分析得到了验证。
cs.CV / 288 / 2605.10583
FrequencyCT: Frequency domain pseudo-label generation for self-supervised low-dose CT denoising
FrequencyCT:用于自监督低剂量CT去噪的频域伪标签生成
Abstract
Despite extensive research on computed tomography (CT) denoising, few studies exploit projection-domain data characteristics to mitigate noise correlation. To address this, this work proposes FrequencyCT, the first zero-shot self-supervised method for pseudo-label generation in the frequency domain for low-dose CT denoising. Leveraging the characteristic of the frequency domain that largely isolates noise from clean signals, a regional low-frequency anchoring technique is proposed. Phase-preserving amplitude modulation and mask perturbation in the high-frequency region generate pseudo-label data for self-supervision. The fluctuating noise variance in the projection domain prompts truncation of the generated samples to stabilize the network's optimization gradient. Evaluation results on multiple public and real-world datasets confirm the clinical application potential of this research, which will have a revolutionary impact on the field of denoising. The code can be obtained from https://github.com/yqx7150/FrequencyCT.
Chinese Translation
尽管在计算机断层扫描(CT)去噪方面进行了广泛研究,但很少有研究利用投影域数据特征来减轻噪声相关性。为了解决这个问题,本研究提出了FrequencyCT,这是一种在频域中用于低剂量CT去噪的伪标签生成的首个零样本自监督方法。利用频域的特性,能够在很大程度上将噪声与干净信号隔离,提出了一种区域低频锚定技术。在高频区域中,通过相位保持的幅度调制和掩模扰动生成自监督的伪标签数据。投影域中波动的噪声方差促使对生成样本进行截断,以稳定网络的优化梯度。在多个公共和真实世界数据集上的评估结果确认了本研究的临床应用潜力,这将对去噪领域产生革命性的影响。代码可从 https://github.com/yqx7150/FrequencyCT 获取。
cs.CV / 289 / 2605.10586
CausalGS: Learning Physical Causality of 3D Dynamic Scenes with Gaussian Representations
CausalGS:利用高斯表示学习三维动态场景的物理因果关系
Abstract
Learning a physical model from video data that can comprehend physical laws and predict the future trajectories of objects is a formidable challenge in artificial intelligence. Prior approaches either leverage various Partial Differential Equations (PDEs) as soft constraints in the form of PINN losses, or integrate physics simulators into neural networks; however, they often rely on strong priors or high-quality geometry reconstruction. In this paper, we propose CausalGS, a framework that learns the causal dynamics of complex dynamic 3D scenes solely from multi-view videos, while dispensing with the reliance on explicit priors. At its core is an inverse physics inference module that decouples the complex dynamics problem from the video into the joint inference of two factors: the initial velocity field representing the scene's kinematics, and the intrinsic material properties governing its dynamics. This inferred physical information is then utilized within a differentiable physics simulator to guide the learning process in a physics-regularized manner. Extensive experiments demonstrate that CausalGS surpasses the state-of-the-art on the highly challenging task of long-term future frame extrapolation, while also exhibiting advanced performance in novel view interpolation. Crucially, our work shows that, without any human annotation, the model is able to learn the complex interactions between multiple physical properties and understand the causal relationships driving the scene's dynamic evolution, solely from visual observations.
Chinese Translation
从视频数据中学习一个能够理解物理法则并预测物体未来轨迹的物理模型是人工智能中的一项艰巨挑战。以往的方法要么利用各种偏微分方程(Partial Differential Equations, PDEs)作为以PINN损失形式的软约束,要么将物理模拟器集成到神经网络中;然而,它们往往依赖于强先验或高质量的几何重建。在本文中,我们提出了CausalGS,一个框架,它仅通过多视角视频学习复杂动态三维场景的因果动态,而不依赖于显式先验。其核心是一个逆物理推理模块,它将复杂的动态问题从视频中解耦为两个因素的联合推理:表示场景运动学的初始速度场和支配其动态的内在材料属性。然后,这些推断出的物理信息被用于一个可微分的物理模拟器中,以物理正则化的方式指导学习过程。大量实验表明,CausalGS在长期未来帧外推这一极具挑战性的任务上超越了当前最先进的技术,同时在新视图插值中也表现出先进的性能。至关重要的是,我们的工作表明,在没有任何人工标注的情况下,该模型能够仅通过视觉观察学习多个物理属性之间的复杂相互作用,并理解驱动场景动态演变的因果关系。
cs.CV / 290 / 2605.10588
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence
以新视角思考:生成增强空间智能的系统分析
Abstract
Current Large Multimodal Models (LMMs) struggle with spatial reasoning tasks requiring viewpoint-dependent understanding, largely because they are confined to a single, static observation. We propose Thinking with Novel Views (TwNV), a paradigm that integrates generative novel-view synthesis into the reasoning loop: a Reasoner LMM identifies spatial ambiguity, instructs a Painter to synthesize an alternative viewpoint, and re-examines the scene with the additional evidence. Through systematic experiments we address three research questions. (1) Instruction format: numerical camera-pose specifications yield more reliable view control than free-form language. (2) Generation fidelity: synthesized view quality is tightly coupled with downstream spatial accuracy. (3) Inference-time visual scaling: iterative multi-turn view refinement further improves performance, echoing recent scaling trends in language reasoning. Across four spatial subtask categories and four LMM architectures (both closed- and open-source), TwNV consistently improves accuracy by +1.3 to +3.9 pp, with the largest gains on viewpoint-sensitive subtasks. These results establish novel-view generation as a practical lever for advancing spatial intelligence of LMMs.
Chinese Translation
当前的大型多模态模型(LMMs)在需要视角依赖理解的空间推理任务中表现不佳,主要是因为它们局限于单一的静态观察。我们提出了以新视角思考(Thinking with Novel Views, TwNV)这一范式,将生成的新视角合成整合到推理循环中:一个推理器 LMM 识别空间模糊性,指示一个绘制者合成替代视角,并利用额外证据重新审视场景。通过系统实验,我们解决了三个研究问题。(1) 指令格式:数值相机姿态规格比自由形式语言更能提供可靠的视角控制。(2) 生成保真度:合成视角的质量与下游空间准确性紧密相关。(3) 推理时视觉缩放:迭代的多轮视角细化进一步提升了性能,呼应了最近语言推理中的缩放趋势。在四个空间子任务类别和四种 LMM 架构(包括闭源和开源)中,TwNV 一致地提高了准确性,提升幅度在 +1.3 到 +3.9 个百分点之间,尤其在对视角敏感的子任务上获得了最大的提升。这些结果确立了新视角生成作为推动 LMMs 空间智能发展的实用杠杆。
cs.CV / 291 / 2605.10603
Segment Anything with Robust Uncertainty-Accuracy Correlation
具有鲁棒不确定性-准确性相关性的任意分割
Abstract
Despite strong zero-shot performance, SAM is unreliable under domain shift due to Mask-level Confidence Confusion (MCC), where a single IoU-based mask score fails to reflect pixel-wise reliability near boundaries. Motivated by the contrast between texture-biased shortcuts in neural networks and shape-centric processing in human vision, we model out-of-domain variation as appearance shifts and non-rigid deformations that jointly stress calibration. We propose Segment Anything with Robust Uncertainty-Accuracy Correlation (RUAC) for robust pixel-wise uncertainty estimation under appearance and deformation shifts. RUAC adds a lightweight uncertainty head, trains it with a collaborative style-deformation attack that jointly perturbs texture and geometry, and applies Uncertainty-Accuracy Alignment to ensure uncertainty consistently highlights erroneous pixels even under adversarial perturbations. Across 23 zero-shot domains, RUAC improves segmentation quality and yields more faithful uncertainty with stronger uncertainty-accuracy correlation. Project page: https://github.com/HongyouZhou/ruac.git.
Chinese Translation
尽管在零-shot任务中表现强劲,SAM在领域转移下却不可靠,这主要是由于掩膜级置信度混淆(Mask-level Confidence Confusion, MCC),即基于IoU的单一掩膜评分未能反映边界附近的像素级可靠性。受神经网络中的纹理偏向捷径与人类视觉中的形状中心处理之间对比的启发,我们将域外变化建模为外观变化和非刚性变形,这两者共同强调了校准的重要性。我们提出了具有鲁棒不确定性-准确性相关性的任意分割(Segment Anything with Robust Uncertainty-Accuracy Correlation, RUAC),用于在外观和变形变化下进行鲁棒的像素级不确定性估计。RUAC增加了一个轻量级的不确定性头,通过一种协作式风格-变形攻击进行训练,该攻击共同扰动纹理和几何形状,并应用不确定性-准确性对齐(Uncertainty-Accuracy Alignment),以确保不确定性在对抗扰动下始终突出错误像素。在23个零-shot领域中,RUAC提高了分割质量,并提供了更可靠的不确定性,具有更强的不确定性-准确性相关性。项目页面:https://github.com/HongyouZhou/ruac.git。
cs.CV / 292 / 2605.10628
Hypergraph-Enhanced Training-Free and Language-Free Few-Shot Anomaly Detection
超图增强的无训练和无语言少样本异常检测
Abstract
Few-shot anomaly detection (FSAD) has made significant strides, yet existing methods still face critical challenges: (i) dependence on task- or dataset-specific training/fine-tuning, (ii) reliance on language supervision or carefully hand-crafted prompts, and (iii) limited robustness across domains. In this paper, we introduce HyperFSAD, a novel FSAD framework that is training-free, language-free, and robust across domains, offering a powerful solution to these challenges. Built upon DINOv3 and a hypergraph-based inference mechanism, our approach performs inference without any task-specific optimization or text prompts, while remaining competitive. Specifically, we replace sensitive nearest-neighbor / top-$n$ matching with \textbf{Sparse Hyper Matching}: \textit{sparsemax} first selects the most relevant support patches, which are then aggregated into a \textit{hyperedge} as compact normal evidence to suppress background noise and distractors. We further introduce \textbf{Dual-Branch Image Scoring}, which fuses \emph{spatial anomaly evidence} from the patch-grid anomaly map with \emph{global semantic deviation} captured by support-aware CLS matching, yielding a robust image-level anomaly score in a strictly visual manner. Notably, all components of HyperFSAD are purely visual, eliminating the need for labor-intensive hand-crafted text prompts. Under the stringent training-free and language-free setting, HyperFSAD achieves state-of-the-art performance across six datasets spanning four industrial datasets (MVTecAD, VisA, MPDD, BTAD) and two medical datasets (RESC, BraTS).
Chinese Translation
少样本异常检测(FSAD)取得了显著进展,但现有方法仍面临关键挑战:(i)依赖于特定任务或数据集的训练/微调,(ii)依赖于语言监督或精心设计的提示,以及(iii)在不同领域的鲁棒性有限。本文提出了HyperFSAD,一种新颖的FSAD框架,具有无训练、无语言和跨领域鲁棒性的特点,为这些挑战提供了强有力的解决方案。我们的方案基于DINOv3和超图推理机制,能够在没有任何特定任务优化或文本提示的情况下进行推理,同时保持竞争力。具体而言,我们用 extbf{稀疏超匹配}替代敏感的最近邻/top-$n$匹配: extit{sparsemax}首先选择最相关的支持补丁,然后将其聚合成一个 extit{超边},作为压制背景噪声和干扰的紧凑正常证据。我们进一步引入 extbf{双分支图像评分},将来自补丁网格异常图的 extit{空间异常证据}与通过支持感知CLS匹配捕获的 extit{全局语义偏差}融合,严格以视觉方式产生鲁棒的图像级异常得分。值得注意的是,HyperFSAD的所有组件都是纯视觉的,消除了对劳动密集型手工文本提示的需求。在严格的无训练和无语言设置下,HyperFSAD在六个数据集上实现了最先进的性能,这些数据集涵盖了四个工业数据集(MVTecAD、VisA、MPDD、BTAD)和两个医学数据集(RESC、BraTS)。
cs.CV / 293 / 2605.10629
Product-of-Gaussian-Mixture Diffusion Models for Joint Nonlinear MRI Reconstruction
用于联合非线性MRI重建的高斯混合产物扩散模型
Abstract
Recently, diffusion models have attracted considerable attention for magnetic resonance image reconstruction due to their high sample quality. However, most existing methods rely on large networks with opaque time-conditioning mechanisms, and require offline coil sensitivity estimation. This results in limited interpretability of the reconstruction process and reduced flexibility in the acquisition setup. To address these limitations, we jointly reconstruct the image and the coil sensitivities by combining the parameter-efficient product-of-Gaussian-mixture diffusion model as an image prior with a classical smoothness prior on the coil sensitivities. The proposed method is fast and robust to both contrast and anatomical distribution shifts as well as changing k-space trajectories. Finally, we propose a more expressive parameterization of the image prior which improves results in denoising and magnetic resonance image reconstruction.
Chinese Translation
近年来,由于其高样本质量,扩散模型在磁共振图像重建中引起了相当大的关注。然而,大多数现有方法依赖于具有不透明时间条件机制的大型网络,并且需要离线线圈灵敏度估计。这导致重建过程的可解释性有限,且在采集设置中的灵活性降低。为了解决这些限制,我们通过将高效参数的高斯混合产物扩散模型作为图像先验与线圈灵敏度的经典平滑先验相结合,联合重建图像和线圈灵敏度。所提出的方法快速且对对比度和解剖分布的变化以及k空间轨迹的变化具有鲁棒性。最后,我们提出了一种更具表现力的图像先验参数化,改善了去噪和磁共振图像重建的结果。
cs.CV / 294 / 2605.10641
LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models
LLaVA-CKD:面向视觉语言模型的自下而上的级联知识蒸馏
Abstract
Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for mitigating this concern is Knowledge Distillation, where knowledge from a high-capacity Teacher network is transferred to a considerably smaller Student network. However, the capacity gap between the two networks is both a blessing and a curse: the smaller the Student network, the better its efficiency, and the larger the Teacher, the more knowledge it carries; yet, beyond a point, the larger capacity gap between the two leads to worse knowledge transfer. To counter this effect, we propose a bottom-up cascaded knowledge distillation (CKD) framework. Instead of treating knowledge transfer as an activity involving one high-capacity Teacher (or an ensemble of such), inspired by human formal education systems, we introduce one (potentially, more) additional Teacher(s) of intermediate capacity that gradually bring the Student network to the next level, where the next (higher-capacity) Teacher can take over. We provide a theoretical analysis in order to study the effect of cascaded distillation in the generalization performance of the Student. We apply the proposed framework on models build upon the LLaVA methodology and evaluate the derived models on seven standard, publicly available VQA benchmarks, demonstrating their SotA performance.
Chinese Translation
大型视觉语言模型(VLMs)在解决多种视觉语言理解任务(如视觉问答(VQA))方面取得了成功,但其内存和计算需求仍然是实际部署中的一个问题。一类有前景的技术是知识蒸馏(Knowledge Distillation),它通过将高容量教师网络(Teacher)中的知识转移到一个相对较小的学生网络(Student)中来缓解这一问题。然而,这两种网络之间的容量差距既是一个优势也是一个劣势:学生网络越小,其效率越高;而教师网络越大,其携带的知识越多;然而,超过某个点后,两者之间的容量差距过大将导致知识转移效果变差。为了解决这一问题,我们提出了一种自下而上的级联知识蒸馏(CKD)框架。我们不再将知识转移视为一个涉及一个高容量教师(或多个此类教师)的活动,而是受到人类正式教育系统的启发,引入一个(可能是多个)中等容量的教师,逐步将学生网络提升到下一个层次,以便下一个(更高容量的)教师能够接管。我们提供了理论分析,以研究级联蒸馏对学生网络泛化性能的影响。我们将所提框架应用于基于LLaVA方法论构建的模型,并在七个标准的公开VQA基准上评估所得到的模型,展示了它们的最先进性能。
cs.CV / 295 / 2605.10645
GenMed: A Pairwise Generative Reformulation of Medical Diagnostic Tasks
GenMed:医学诊断任务的成对生成重构
Abstract
Data-driven medical AI is traditionally formulated as a discriminative mapping from input $X$ to output $Y$ via a learned function $f$, which does not generalize well across heterogeneous data and modalities encountered in real-world clinical settings. In this work, we propose a fundamentally different, generative paradigm. We model the joint distribution $P(X,Y)$ using diffusion models and reframe inference as a test-time output optimization problem. By guiding the generative process to match observed inputs, our framework enables flexible, gradient-based conditioning at inference time without architectural changes or retraining, effectively supporting arbitrary and previously unseen combinations of observations. Extensive experiments demonstrate strong performance across standard and cross-modality medical image segmentation, few-shot segmentation with only 2 or 4 training samples, degraded-input segmentation, shape completion from sparse and partial observations, and zero-shot application to demonstrate generality. To support these evaluations, we curated and released a large-scale text-shape dataset derived from MedShapeNet. Our results highlight the versatility of generative joint modeling as a foundation for reusable, task-agnostic medical AI systems.
Chinese Translation
数据驱动的医学人工智能传统上被表述为从输入 $X$ 到输出 $Y$ 的判别映射,通过学习的函数 $f$ 实现,这在现实临床环境中遇到的异构数据和模态上泛化效果不佳。在本研究中,我们提出了一种根本不同的生成范式。我们使用扩散模型对联合分布 $P(X,Y)$ 进行建模,并将推理重新框架为测试时输出优化问题。通过引导生成过程与观察到的输入匹配,我们的框架在推理时实现了灵活的基于梯度的条件,而无需架构更改或重新训练,有效支持任意和之前未见过的观察组合。大量实验表明,在标准和跨模态医学图像分割、仅使用 2 或 4 个训练样本的少样本分割、降级输入分割、从稀疏和部分观察中进行形状补全,以及零样本应用以展示通用性方面表现出色。为了支持这些评估,我们策划并发布了一个大规模的文本-形状数据集,来源于 MedShapeNet。我们的结果突显了生成联合建模作为可重用、任务无关的医学人工智能系统基础的多样性。
cs.CV / 296 / 2605.10661
bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition
bViT:在视觉变换器中研究单块递归用于图像识别
Abstract
Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than narrow variants. We interpret this behavior as implicit depth multiplexing, where a shared block expresses multiple step-dependent computations through the evolving hidden state. Beyond ImageNet classification, bViT transfers competitively to downstream tasks and enables parameter-efficient fine-tuning. Mechanistic analyses of activations, attention and step-specific pruning show that the shared block changes its effective behavior across recurrent steps rather than simply repeating the same computation. Our results suggest that a large fraction of ViT depth can be implemented through recurrent reuse, provided that the representation space is sufficiently wide.
Chinese Translation
视觉变换器(ViTs)是通过堆叠独立参数化的块构建的,但尚不清楚这种深度中有多少需要特定层的变换,以及有多少可以通过递归计算实现。我们通过 bViT 研究这个问题,bViT 是一种单块递归 ViT,其中一个变换器块被重复应用于处理图像。这种架构保留了深度 ViT 的迭代结构,同时去除了特定层的块参数化,为研究视觉中的递归提供了一个受控的环境。在 ImageNet-1K 上,12 步的 bViT-B 在相同的训练方案和计算预算下达到了与标准 ViT-B 相当的准确率,同时使用的参数量少了一个数量级。我们观察到,递归性能随着表示宽度的增加而改善,较宽的 bViT 在恢复标准 ViT 性能方面比狭窄变体要好得多。我们将这种行为解释为隐式深度复用,其中共享块通过不断变化的隐藏状态表达多个依赖步骤的计算。除了 ImageNet 分类之外,bViT 在下游任务中也具有竞争力的迁移能力,并且能够实现参数高效的微调。对激活、注意力和步骤特定剪枝的机制分析表明,共享块在递归步骤中改变其有效行为,而不仅仅是重复相同的计算。我们的结果表明,只要表示空间足够宽,ViT 深度的很大一部分可以通过递归重用来实现。
cs.CV / 297 / 2605.10675
Neuromorphic Monocular Depth Estimation with Uncertainty Modeling
具有不确定性建模的神经形态单目深度估计
Abstract
Event cameras offer distinct advantages over conventional frame-based sensors, including microsecond-level temporal resolution, high dynamic range, and low bandwidth. In this paper, we predict per-pixel depth distributions from monocular event streams using deep neural networks. We estimate uncertainty using Gaussian, log-normal, and evidential learning frameworks. We compare six event representations: spatio-temporal voxel grids with 1, 5, 10, and 20 temporal bins, the Compact Spatio-Temporal Representation (CSTR), and Time-Ordered Recent Event (TORE) volumes. Our U-Net-based models are trained on synthetic data and then fine-tuned on real sequences. We evaluate performance using absolute relative error, root mean squared error, and the area under the sparsification error. Quantitative results show that the representations perform similarly, while 10 bin log-normal and 5 bin evidential learning perform best across metrics. Our experiments demonstrate that uncertainty estimation can be successfully integrated into event-based monocular depth estimation, and be used to indicate pixels with reliable depth.
Chinese Translation
事件相机相较于传统的基于帧的传感器具有显著优势,包括微秒级的时间分辨率、高动态范围和低带宽。在本文中,我们使用深度神经网络从单目事件流中预测每个像素的深度分布。我们采用高斯、对数正态和证据学习框架来估计不确定性。我们比较了六种事件表示方法:具有1、5、10和20个时间桶的时空体素网格、紧凑时空表示(Compact Spatio-Temporal Representation, CSTR)和时间排序最近事件(Time-Ordered Recent Event, TORE)体积。我们的基于U-Net的模型在合成数据上进行训练,然后在真实序列上进行微调。我们使用绝对相对误差、均方根误差和稀疏化误差下的面积来评估性能。定量结果表明,这些表示方法的性能相似,而10个桶的对数正态和5个桶的证据学习在各项指标上表现最佳。我们的实验表明,不确定性估计可以成功地集成到基于事件的单目深度估计中,并用于指示具有可靠深度的像素。
cs.CV / 298 / 2605.10676
Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium
非盲而是沉默:通过对抗性反常识平衡重塑视觉与语言
Abstract
During MLLM decoding, attention often abnormally concentrates on irrelevant image tokens. While existing research dismisses this as invalid noise and forcibly redirects attention to compel focusing on key image information, we argue these tokens are critical carriers of visual and narrative logic, and such coercive corrections exacerbate visual-language imbalance. Adopting a "decoding-as-game" perspective, we reveal that hallucinations stem from an equilibrium imbalance between linguistic priors and visual information. We propose Adversarial Counter-Commonsense Equilibrium (ACE), a training-free framework that perturbs visual context via counter-commonsense patches. Leveraging the fact that authentic visual features remain stable under perturbation while hallucinations fluctuate, ACE implements a dynamic game decoding strategy. This approach precisely suppresses perturbation-sensitive priors while compensating for stable visual signals to restore balance. Extensive experiments demonstrate that ACE, as a plug-and-play strategy, enhances model trustworthiness with negligible inference overhead.
Chinese Translation
在多模态大语言模型(MLLM)解码过程中,注意力往往异常集中于无关的图像标记。虽然现有研究将其视为无效噪声,并强行重新引导注意力以迫使关注关键图像信息,但我们认为这些标记是视觉与叙事逻辑的重要载体,而这种强制性修正加剧了视觉与语言之间的不平衡。采用“解码即游戏”的视角,我们揭示了幻觉源于语言先验与视觉信息之间的平衡失调。我们提出了对抗性反常识平衡(Adversarial Counter-Commonsense Equilibrium, ACE),这是一种无训练的框架,通过反常识补丁扰动视觉上下文。利用真实视觉特征在扰动下保持稳定而幻觉波动的事实,ACE 实施了一种动态游戏解码策略。该方法精确抑制了对扰动敏感的先验,同时补偿稳定的视觉信号以恢复平衡。大量实验表明,ACE 作为一种即插即用的策略,在几乎没有推理开销的情况下增强了模型的可信度。
cs.CV / 299 / 2605.10705
TransmissiveGS: Residual-Guided Disentangled Gaussian Splatting for Transmissive Scene Reconstruction and Rendering
TransmissiveGS:基于残差引导的解耦高斯喷溅用于透射场景重建与渲染
Abstract
Transmissive scenes are ubiquitous in daily life, yet reconstructing and rendering them remains highly challenging due to the inherent entanglement between near-field reflections from the surrounding environment on the transmissive surface, and the transmitted content of the scene behind it. This coupling gives rise to dual surface geometries and dual radiance components within each observation, posing ambiguities for standard methods. We present TransmissiveGS, a novel framework for disentangled reconstruction and rendering of transmissive scenes. Specifically, we model the scene with a dual-Gaussian representation and introduce a deferred shading function to jointly render the two Gaussian components. To separate reflection and transmission, we exploit the inherent multi-view inconsistency of reflections and leverage the residuals from reconstructing multi-view consistent content as cues for disentangled geometry and appearance modeling. We further propose a reflection light field that enables high-fidelity estimation of near-field reflections. During training, we introduce a high-frequency regularization to preserve fine details. We also contribute a new synthetic dataset for evaluating transmissive surface reconstruction. Experiments on both synthetic and real-world scenes demonstrate that TransmissiveGS consistently outperforms prior Gaussian Splatting-based methods in both reconstruction and rendering quality for transmissive scenes.
Chinese Translation
透射场景在日常生活中无处不在,但由于透射表面周围环境的近场反射与其后方场景的透射内容之间固有的纠缠,重建和渲染这些场景仍然非常具有挑战性。这种耦合导致每个观测中存在双重表面几何和双重辐射成分,为标准方法带来了歧义。我们提出了TransmissiveGS,一个用于透射场景解耦重建与渲染的新框架。具体而言,我们使用双高斯表示对场景进行建模,并引入延迟着色函数以联合渲染这两个高斯成分。为了分离反射和透射,我们利用反射的固有多视图不一致性,并利用重建多视图一致内容的残差作为解耦几何和外观建模的线索。我们进一步提出了一种反射光场,使得近场反射的高保真估计成为可能。在训练过程中,我们引入高频正则化以保留细节。我们还贡献了一个新的合成数据集,用于评估透射表面重建。在合成和真实场景上的实验表明,TransmissiveGS在透射场景的重建和渲染质量上始终优于先前基于高斯喷溅的方法。
cs.CV / 300 / 2605.10715
UAV-Assisted Scan-to-Simulation for Landslides Using Physics-Informed Gaussian Splatting
基于无人机辅助的滑坡扫描与仿真:物理信息高斯点云技术
Abstract
Landslide monitoring and simulation play an important role in urban safety assessment and disaster prevention. Existing landslide simulation pipelines typically rely on digital elevation model and mesh-based representations, which are suitable for geometric analysis, but often lack visual realism. This limitation reduces their effectiveness in interactive applications, hazard communication, and public education. In this paper, we propose a UAV-based scan-to-simulation framework that bridges photorealistic scene capture and physics-based landslide simulation through 3DGS. Specifically, our pipeline includes four stages: (1) UAV-based acquisition of slope imagery, (2) reconstruction of a low-anisotropy 3DGS scene representation, (3) volumetric conversion of the target simulation region by filling the interior of the surface-based model, and (4) integration with the Material Point Method (MPM) for landslide simulation. We validate the proposed framework on a real landslide site in Hong Kong that experienced a severe landslide event. The results show that our method supports both realistic visual reconstruction and effective simulation.
Chinese Translation
滑坡监测与仿真在城市安全评估和灾害预防中发挥着重要作用。现有的滑坡仿真流程通常依赖于数字高程模型和基于网格的表示,这些方法适合几何分析,但往往缺乏视觉真实感。这一局限性降低了它们在互动应用、灾害传播和公众教育中的有效性。本文提出了一种基于无人机的扫描与仿真框架,通过3DGS(3D Gaussian Splatting)连接了照片真实场景捕捉和基于物理的滑坡仿真。具体而言,我们的流程包括四个阶段:(1)基于无人机的坡面影像获取;(2)重建低各向异性3DGS场景表示;(3)通过填充表面模型内部进行目标仿真区域的体积转换;(4)与材料点法(Material Point Method, MPM)集成以进行滑坡仿真。我们在香港一个经历过严重滑坡事件的真实滑坡现场验证了所提出的框架。结果表明,我们的方法支持真实的视觉重建和有效的仿真。
cs.AI / 1 / 2605.08200
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
视觉-语言模型中的可靠性:注意力、隐藏状态和因果电路的机制研究
Abstract
A pervasive intuition holds that vision-language models (VLMs) are most trustworthy when their attention maps look sharp: concentrated attention on the queried region should imply a confident, calibrated answer. We test this Attention-Confidence Assumption directly. We instrument three open-weight VLM families (LLaVA-1.5, PaliGemma, Qwen2-VL; 3-7B parameters) with a unified mechanistic pipeline -- the VLM Reliability Probe (VRP) -- that compares attention structure, generation dynamics, and hidden-state geometry against a single correctness label. Three results emerge. (i) Attention structure is a near-zero predictor of correctness (R_pb(C_k,y)=0.001, 95% CI [-0.034,0.036]; R_pb(H_s,y)=-0.012, [-0.047,0.024] on a pooled n=3,090 split), even though attention remains causally necessary for feature extraction (top-30% patch masking drops accuracy by 8.2-11.3 pp, p<0.001). (ii) Reliability becomes legible later in the computation: a single hidden-state linear probe reaches AUROC>0.95 on POPE for two of three families, and self-consistency at K=10 is the strongest behavioral predictor we measure at 10x inference cost (R_pb=0.43). (iii) Causal neuron-level ablations expose a sharp architectural split with direct monitor-design implications: late-fusion LLaVA concentrates reliability in a fragile late bottleneck (-8.3 pp object-identification accuracy after top-5 probe-neuron ablation), whereas early-fusion PaliGemma and Qwen2-VL distribute it widely and absorb destruction of ~50% of their peak-layer hidden dimension with <=1 pp degradation. The takeaway is narrow but consequential: in 3-7B VLMs, reliability is read more reliably off hidden-state geometry, layer-wise margin formation, and sparse late-layer circuits than off attention-map sharpness.
Chinese Translation
一种普遍的直觉认为,当视觉-语言模型(VLMs)的注意力图看起来清晰时,它们是最值得信赖的:对查询区域的集中注意力应该意味着一个自信且经过校准的答案。我们直接测试了这一注意力-信心假设。我们为三种开放权重的VLM家族(LLaVA-1.5、PaliGemma、Qwen2-VL;3-7B参数)构建了一个统一的机制管道——VLM可靠性探针(VRP),该管道将注意力结构、生成动态和隐藏状态几何与单一正确性标签进行比较。结果有三点:(i)注意力结构几乎无法预测正确性(R_pb(C_k,y)=0.001, 95% CI [-0.034,0.036]; R_pb(H_s,y)=-0.012, [-0.047,0.024],基于n=3,090的汇总分割),尽管注意力在特征提取中仍然是因果必要的(前30%的补丁屏蔽使准确率下降8.2-11.3个百分点,p<0.001)。 (ii)可靠性在计算的后期变得明显:单个隐藏状态线性探针在POPE上对三种家族中的两种达到AUROC>0.95,而K=10时的自一致性是我们在10倍推理成本下测量到的最强行为预测因子(R_pb=0.43)。 (iii)因果神经元级别的消融揭示了一个明显的架构分裂,具有直接的监控设计含义:后融合的LLaVA将可靠性集中在一个脆弱的后瓶颈中(在前5个探针神经元消融后,物体识别准确率下降8.3个百分点),而早融合的PaliGemma和Qwen2-VL则广泛分布可靠性,并在隐藏维度的峰值层中吸收约50%的破坏,准确率下降不超过1个百分点。结论是狭窄但重要的:在3-7B的VLM中,可靠性更可靠地从隐藏状态几何、层级边际形成和稀疏的后层电路中读取,而不是从注意力图的清晰度中读取。
cs.AI / 2 / 2605.08220
Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction
空间引导优于语义提示:基于网格的方法提高大语言模型在图表数据提取中的准确性
Abstract
The automated extraction of data from scientific charts is a critical task for large-scale literature analysis. While multimodal Large Language Models (LLMs) show promise, their accuracy on non-standardized charts remains a challenge. This raises a key research question: what is the most effective strategy to improve model performance (high-level semantic priming) or low-level spatial priming? This paper presents a comparative investigation into these two distinct strategies. We describe our exploratory experiments with semantic methods, such as a two-stage metadata-first framework and Chain-of-Thought, which failed to produce a statistically significant improvement. In contrast, we present a simple but highly effective spatial priming method: overlaying a coordinate grid onto the chart image before analysis. Our quantitative experiment on a synthetic dataset demonstrates that this grid-based approach provides a statistically significant reduction in data extraction error (SMAPE reduced from 25.5% to 19.5%, p < 0.05) compared to a baseline. We conclude that for the current generation of multimodal models, providing explicit spatial context is a more effective and reliable strategy than high-level semantic guidance for this class of tasks.
Chinese Translation
从科学图表中自动提取数据是大规模文献分析中的一项关键任务。尽管多模态大语言模型(LLMs)展现出潜力,但它们在非标准化图表上的准确性仍然是一个挑战。这引发了一个关键的研究问题:提高模型性能的最有效策略是高层次的语义引导,还是低层次的空间引导?本文对这两种不同策略进行了比较研究。我们描述了使用语义方法的探索性实验,例如一种以元数据为先的两阶段框架和思维链(Chain-of-Thought),这些方法未能产生统计上显著的改善。相反,我们提出了一种简单但极为有效的空间引导方法:在分析之前将坐标网格叠加到图表图像上。我们在一个合成数据集上的定量实验表明,这种基于网格的方法在数据提取错误上提供了统计上显著的减少(SMAPE从25.5%降低到19.5%,p < 0.05),与基线相比。我们得出结论,对于当前一代多模态模型,提供明确的空间上下文是一种比高层次语义指导更有效和可靠的策略,适用于这一类任务。
cs.AI / 3 / 2605.08354
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
自动评分标准作为奖励:从隐含偏好到显式多模态生成标准
Abstract
Aligning multimodal generative models with human preferences demands reward signals that respect the compositional, multi-dimensional structure of human judgment. Prevailing RLHF approaches reduce this structure to scalar or pairwise labels, collapsing nuanced preferences into opaque parametric proxies and exposing vulnerabilities to reward hacking. While recent Rubrics-as-Reward (RaR) methods attempt to recover this structure through explicit criteria, generating rubrics that are simultaneously reliable, scalable, and data-efficient remains an open problem. We introduce Auto-Rubric as Reward (ARR), a framework that reframes reward modeling from implicit weight optimization to explicit, criteria-based decomposition. Before any pairwise comparison, ARR externalizes a VLM's internalized preference knowledge as prompt-specific rubrics, translating holistic intent into independently verifiable quality dimensions. This conversion of implicit preference structure into inspectable, interpretable constraints substantially suppresses evaluation biases including positional bias, enabling both zero-shot deployment and few-shot conditioning on minimal supervision. To extend these gains into generative training, we propose Rubric Policy Optimization (RPO), which distills ARR's structured multi-dimensional evaluation into a robust binary reward, replacing opaque scalar regression with rubric-conditioned preference decisions that stabilize policy gradients. On text-to-image generation and image editing benchmarks, ARR-RPO outperforms pairwise reward models and VLM judges, demonstrating that explicitly externalizing implicit preference knowledge into structured rubrics achieves more reliable, data-efficient multimodal alignment, revealing that the bottleneck is the absence of a factorized interface, not a deficit of knowledge.
Chinese Translation
将多模态生成模型与人类偏好对齐需要尊重人类判断的组合性和多维结构的奖励信号。现有的基于强化学习的人类反馈(RLHF)方法将这种结构简化为标量或成对标签,将细致的偏好压缩为不透明的参数代理,暴露出对奖励操控的脆弱性。尽管最近的评分标准作为奖励(Rubrics-as-Reward, RaR)方法试图通过显式标准恢复这种结构,但同时生成可靠、可扩展且数据高效的评分标准仍然是一个未解决的问题。我们提出了自动评分标准作为奖励(Auto-Rubric as Reward, ARR)框架,将奖励建模从隐含权重优化重新构建为显式的基于标准的分解。在任何成对比较之前,ARR将视觉语言模型(VLM)内化的偏好知识外化为特定提示的评分标准,将整体意图转化为可独立验证的质量维度。这种将隐含偏好结构转换为可检查、可解释约束的过程显著抑制了评估偏差,包括位置偏差,使得在最小监督下实现零-shot部署和少-shot调节成为可能。为了将这些收益扩展到生成训练中,我们提出了评分标准策略优化(Rubric Policy Optimization, RPO),它将ARR的结构化多维评估提炼为稳健的二元奖励,用评分标准条件的偏好决策替代不透明的标量回归,从而稳定策略梯度。在文本到图像生成和图像编辑基准测试中,ARR-RPO的表现优于成对奖励模型和VLM评审,证明了将隐含偏好知识显式外化为结构化评分标准能够实现更可靠、数据高效的多模态对齐,揭示了瓶颈在于缺乏分解接口,而非知识的不足。
cs.AI / 4 / 2605.08360
Embeddings for Preferences, Not Semantics
偏好的嵌入,而非语义的嵌入
Abstract
Modern AI is opening the door to collective decision-making in which participants express their views as free-form text rather than voting on a fixed set of candidates. A natural idea is to embed these opinions in a vector space so that the substantial literature on facility location problems and fair clustering can be brought to bear. But standard text embeddings measure semantic similarity, whereas distances in facility location problems and fair clustering require what we call \textit{preferential similarity}: a participant's agreement with a piece of text should be inversely related to their distance from it. Off-the-shelf embeddings inherit a coarse preference signal through a correlation between semantic and preferential similarity, but fail to capture preferences when the correlation breaks. We formalize this as an invariance problem: text embedding models encode both a preference-relevant signal (stance and values) and semantic nuisance (style and wording), and the two are observationally correlated, so a geometry that relies on nuisance can appear preference-correct even when it is not. We show that synthetic training data designed to break this correlation provably shifts the optimal scorer away from nuisance-dominated cosine and significantly improves preference prediction across 11 online deliberation datasets.
Chinese Translation
现代人工智能正在为集体决策开辟新天地,参与者可以以自由形式的文本表达他们的观点,而不是在固定候选人中投票。一个自然的想法是将这些意见嵌入到一个向量空间中,以便利用关于设施选址问题和公平聚类的丰富文献。然而,标准的文本嵌入测量的是语义相似性,而在设施选址问题和公平聚类中所需的距离则是我们所称的“偏好相似性”(preferential similarity):参与者与一段文本的认同程度应与他们与该文本的距离成反比。现成的嵌入通过语义相似性与偏好相似性之间的相关性继承了粗糙的偏好信号,但当这种相关性破裂时,它们无法捕捉到偏好。我们将此形式化为一个不变性问题:文本嵌入模型编码了与偏好相关的信号(立场和价值观)和语义干扰(风格和措辞),而这两者在观察上是相关的,因此依赖于干扰的几何结构即使在偏好上不正确也可能看起来是正确的。我们展示了设计用于打破这种相关性的合成训练数据可以显著改变最优评分器,使其远离以干扰为主导的余弦相似性,并在11个在线讨论数据集中显著改善偏好预测。
cs.AI / 5 / 2605.08368
On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective
从自由能的视角区分后训练中的能力引出与能力创造
Abstract
Debates about large language model post-training often treat supervised fine-tuning (SFT) as imitation and reinforcement learning (RL) as discovery. But this distinction is too coarse. What matters is whether a training procedure increases the probability of behaviors the pretrained model could already produce, or whether it changes what the model can practically reach. We argue that post-training research should distinguish between capability elicitation and capability creation. We make this distinction operational by introducing the notion of accessible support: the set of behaviors that a model can practically produce under finite budgets. Post-training that reweights behaviors within this support is capability elicitation; whereas changing the support itself corresponds to capability creation. We develop this argument through a free-energy view of post-training. SFT and RL can both be seen as reweighting a pretrained reference distribution, only with different external signals. Demonstration signals define low-energy behavior for SFT, and reward signals define low-energy behavior for RL. When the update remains close to the base model, the main effect is local reweighting, not capability creation. Within this framework, the central question is no longer whether post-training is framed as SFT or RL, but whether it reweights behaviors already within reach, or instead expands the model's reachable behavioral space through search, interaction, tool use, or the incorporation of new information.
Chinese Translation
关于大型语言模型后训练的讨论常常将监督微调(SFT)视为模仿,而将强化学习(RL)视为发现。但这种区分过于粗糙。关键在于训练过程是否增加了预训练模型已经能够产生的行为的概率,或者是否改变了模型可以实际达到的能力。我们认为后训练研究应区分能力引出与能力创造。我们通过引入可达支持的概念使这一区分具备操作性:即模型在有限预算下可以实际产生的行为集合。重新加权这一支持内的行为属于能力引出;而改变支持本身则对应于能力创造。我们通过自由能视角发展这一论点。SFT和RL都可以被视为对预训练参考分布的重新加权,只是外部信号不同。演示信号为SFT定义了低能行为,而奖励信号为RL定义了低能行为。当更新保持接近基础模型时,主要效果是局部重新加权,而非能力创造。在这一框架内,核心问题不再是后训练是被框定为SFT还是RL,而是它是否重新加权了已经可达的行为,或者通过搜索、互动、工具使用或新信息的整合扩展了模型的可达行为空间。
cs.AI / 6 / 2605.08374
MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs
MemQ:将Q学习集成到基于来源有向无环图的自我进化记忆代理中
Abstract
Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD($\lambda$) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as $(\gamma\lambda)^d$ with DAG depth $d$, replacing temporal distance with structural proximity. We formalize the setting as an Exogenous-Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi-step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single-step classification (+0.77~pp) where single-step updates already suffice. We further study how $\gamma$ and $\lambda$ interact with the EC-MDP structure, providing principled guidance for parameter selection and future research. Code will be available soon.
Chinese Translation
情节记忆使得大规模语言模型(LLM)代理能够积累和检索经验,但当前的方法将每个记忆视为独立的,即在评估检索质量时不考虑记忆之间的依赖链,而这些依赖链使得未来记忆的创建成为可能。我们提出了MemQ,它将时间差分(TD)($ ext{λ}$) 资格迹应用于记忆Q值,通过记录每个新记忆创建时检索到的记忆的来源有向无环图(provenance DAG)向后传播信用。信用权重随着DAG深度$d$以$( ext{γλ})^d$的形式衰减,用结构接近度替代时间距离。我们将这一设置形式化为外生上下文马尔可夫决策过程(Exogenous-Context MDP),其分解的转移将外生任务流与内生记忆存储解耦。在六个基准测试中,涵盖操作系统交互、函数调用、代码生成、多模态推理、具身推理和专家级问答,MemQ在所有六个基准的泛化评估和运行时学习中均取得了最高的成功率,其中在产生深层且相关的来源链的多步任务上增益最大(最高可达+5.7~pp),而在单步分类任务上增益最小(+0.77~pp),因为单步更新已经足够。我们进一步研究了$ ext{γ}$和$ ext{λ}$如何与EC-MDP结构相互作用,为参数选择和未来研究提供了原则性的指导。代码将很快发布。
cs.AI / 7 / 2605.08386
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
SkillLens:成本高效的LLM代理自适应多粒度技能重用
Abstract
Skill libraries have become a practical way for LLM agents to reuse procedural experience across tasks. However, existing systems typically treat skills as flat, single-resolution prompt blocks. This creates a tension between relevance and cost: injecting coarse skills can introduce irrelevant or misleading context, while rewriting entire skills is expensive and often unnecessary. We propose SkillLens, a hierarchical skill-evolution framework that organizes skills into a four-layer graph of policies, strategies, procedures, and primitives, and retrieves them at mixed granularity. Given a task, SkillLens first retrieves semantically relevant skill seeds, expands them through degree-corrected random walk over the skill graph, and then uses a verifier to decide whether each visited unit should be accepted, decomposed, rewritten, or skipped. This enables the agent to reuse compatible subskills directly while adapting only locally mismatched components. To improve the system over time, SkillLens further refines multi-granularity skills and verifier in order to improve its routing decisions. We provide theoretical analysis showing that mixed-granularity adaptation incurs sublinear cost under sparse mismatch assumptions and that the evolutionary update rule monotonically improves the validation objective until a local optimum. Across MuLocbench and ALFWorld, SkillLens consistently improves over strong skill-based baselines, achieving up to a 6.31 percentage-point Acc@1 gain for bug localization and raising agent success rate from 45.00% to 51.31%.
Chinese Translation
技能库已成为LLM代理在任务间重用程序经验的一种实用方式。然而,现有系统通常将技能视为平面、单一分辨率的提示块。这在相关性和成本之间产生了紧张关系:注入粗糙技能可能会引入不相关或误导性的上下文,而重写整个技能则成本高昂且往往不必要。我们提出了SkillLens,一个分层技能演化框架,将技能组织成政策、策略、程序和原语的四层图,并以混合粒度进行检索。给定一个任务,SkillLens首先检索语义相关的技能种子,通过在技能图上的度校正随机游走进行扩展,然后使用验证器决定每个访问单元是否应被接受、分解、重写或跳过。这使得代理能够直接重用兼容的子技能,同时仅适应局部不匹配的组件。为了随着时间的推移改进系统,SkillLens进一步细化多粒度技能和验证器,以改善其路由决策。我们提供了理论分析,表明在稀疏不匹配假设下,混合粒度适应的成本是次线性的,并且演化更新规则单调改善验证目标,直到达到局部最优。在MuLocbench和ALFWorld上,SkillLens在强大的基于技能的基线之上持续改进,在错误定位任务中实现了高达6.31个百分点的Acc@1增益,并将代理的成功率从45.00%提升至51.31%。
cs.AI / 8 / 2605.08388
PLACO: A Multi-Stage Framework for Cost-Effective Performance in Human-AI Teams
PLACO:一个多阶段框架以实现人机团队的成本效益性能
Abstract
Human-AI teams play a pivotal role in improving overall system performance when neither the human nor the model can achieve such performance on their own. With the advent of powerful and accessible Generative AI models, several mundane tasks have morphed into Human-AI team tasks. From writing essays to developing advanced algorithms, humans have found that using AI assistance has led to an accelerated work pace like never before. In classification tasks, where the final output is a single hard label, it is crucial to address the combination of human and model output. Prior work elegantly solves this problem using Bayes rule, using the assumption that human and model output are conditionally independent given the ground truth. Specifically, it discusses a combination method to combine a single deterministic labeler (the human) and a probabilistic labeler (the classifier model) using the model's instance-level and the human's class-level calibrated probabilities.
Chinese Translation
人机团队在提升整体系统性能方面发挥着关键作用,尤其是在单独的人类或模型无法实现该性能的情况下。随着强大且易于获取的生成式人工智能模型的出现,许多日常任务已转变为人机团队任务。从撰写论文到开发高级算法,人们发现使用人工智能辅助可以前所未有地加快工作节奏。在分类任务中,最终输出是一个单一的硬标签,因此解决人类与模型输出的组合至关重要。先前的研究优雅地利用贝叶斯定理解决了这个问题,假设在人类和模型输出给定真实标签的条件下是条件独立的。具体而言,它讨论了一种组合方法,用于结合一个确定性标签器(人类)和一个概率标签器(分类模型),利用模型的实例级和人类的类别级校准概率。
cs.AI / 9 / 2605.08399
CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents
CoCoDA:用于工具增强代理的协同进化组合有向无环图
Abstract
Tool-augmented language models can extend small language models with external executable skills, but scaling the tool library creates a coupled challenge: the library must evolve with the planner as new reusable subroutines emerge, while retrieval from the growing library must remain within a fixed context budget. Existing tool-use and skill-library methods typically treat tools as flat or text-indexed memories, causing prompt cost to grow with library size and obscuring the typed, compositional structure of executable code. We propose CoCoDA, a framework that co-evolves the planner and tool library through a single code-native structure: a compositional code DAG. Nodes are primitive or composite tools, edges encode invocation dependencies, and each node stores a typed signature, description, pre/post-condition specification, and worked examples. At inference time, Typed DAG Retrieval prunes candidates by symbolic signature unification, ranks survivors by descriptions, filters them by behavioral specifications, and disambiguates with examples, keeping expensive context materialization on progressively smaller candidate sets. At training time, successful trajectories are folded into validated composite tools, while the planner is updated with a DAG-induced reward that credits composites by their primitive expansion size. We provide theoretical results showing retrieval cost reduction, sublinear retrieval time, compositional advantage under the shaped reward, monotone co-evolution under conservative updates, and DAG well-formedness. Across mathematical reasoning, tabular analysis, and code task benchmarks, CoCoDA enables an 8B student to match or exceed a 32B teacher on GSM8K and MATH and consistently improves over strong tool-use and library-learning baselines.
Chinese Translation
工具增强语言模型可以通过外部可执行技能扩展小型语言模型,但扩展工具库带来了一个耦合挑战:随着新的可重用子例程的出现,库必须与规划者共同进化,同时从不断增长的库中检索必须保持在固定的上下文预算内。现有的工具使用和技能库方法通常将工具视为扁平或文本索引的记忆,这导致随着库的规模增长,提示成本增加,并模糊了可执行代码的类型化组合结构。我们提出了CoCoDA,一个通过单一代码原生结构——组合代码有向无环图(DAG)——协同进化规划者和工具库的框架。节点是原始或复合工具,边缘编码调用依赖关系,每个节点存储类型签名、描述、前/后条件规范和示例。在推理时,类型化DAG检索通过符号签名统一修剪候选者,按描述对幸存者进行排名,通过行为规范进行过滤,并通过示例进行消歧,保持在逐渐缩小的候选集上进行昂贵的上下文物化。在训练时,成功的轨迹被折叠成经过验证的复合工具,同时规划者通过DAG诱导的奖励进行更新,该奖励根据其原始扩展大小对复合工具进行评分。我们提供理论结果,表明检索成本降低、亚线性检索时间、在塑形奖励下的组合优势、在保守更新下的单调协同进化以及DAG的良构性。在数学推理、表格分析和代码任务基准测试中,CoCoDA使得一个8B的学生能够在GSM8K和MATH上匹配或超越一个32B的教师,并在强大的工具使用和库学习基准上持续改进。
cs.AI / 10 / 2605.08405
Belief or Circuitry? Causal Evidence for In-Context Graph Learning
信念还是电路?关于上下文图学习的因果证据
Abstract
How do LLMs learn in-context? Is it by pattern-matching recent tokens, or by inferring latent structure? We probe this question using a toy graph random-walk across two competing graph structures. This task's answer is, in principle, decidable: either the model tracks global topology, or it copies local transitions. We present two lines of evidence that neither account alone is sufficient. First, reconstructing the internal representation structure via PCA reveals that at intermediate mixture ratios, both graph topologies are encoded in orthogonal principal subspaces simultaneously. This pattern is difficult to reconcile with purely local transition copying. Second, residual-stream activation patching and graph-difference steering causally intervene on this graph-family signal: late-layer patching almost fully transfers the clean graph preference, while linear steering moves predictions in the intended direction and fails under norm-matched and label-shuffled controls. Taken together, our findings are most consistent with a dual-mechanism account in which genuine structure inference and induction circuits operate in parallel.
Chinese Translation
大型语言模型(LLMs)是如何在上下文中学习的?是通过匹配最近的标记模式,还是通过推断潜在结构?我们通过在两种竞争的图结构上进行玩具图随机游走来探讨这个问题。这个任务的答案在原则上是可判定的:模型要么跟踪全局拓扑,要么复制局部转移。我们提供了两条证据,表明单一的解释都不足以说明现象。首先,通过主成分分析(PCA)重构内部表示结构显示,在中间混合比下,两种图拓扑同时编码在正交的主子空间中。这一模式难以与纯粹的局部转移复制相调和。其次,残差流激活修补和图差异引导在这一图谱信号上因果干预:后层修补几乎完全转移了清晰图偏好,而线性引导则将预测移动到预期方向,并在规范匹配和标签洗牌控制下失败。综合来看,我们的发现最符合一种双机制解释,其中真正的结构推断和归纳电路并行运行。
cs.AI / 11 / 2605.08409
Playing games with knowledge: AI-Induced delusions need game theoretic interventions
与知识玩游戏:AI引发的妄想需要博弈论干预
Abstract
Conversational AI has a fundamental flaw as a knowledge interface: sycophantic chatbots induce epistemic entrenchment and delusional belief spirals even in rational agents. We propose the problem does not stem from the AI model, rooted instead in a systemic consequence of the paradigm shift from user-driven knowledge search to users and agents engaged in strategic, repeated-play communication. We formalize the problem as a Crawford-Sobel cheap talk game, where costless user signals induce a pooling equilibrium. Agents optimized for user satisfaction produce sycophantic strategies that provide identical reinforcement across user types with opposite epistemic incentives: exploratory ``Growth-seekers'' ($\theta_G$) and confirmatory ``Validation-seekers'' ($\theta_V$). Under repeated play, this identification failure creates a coordination trap -- analogous to a Prisoner's Dilemma -- where locally rational feedback loops drive users toward pathologically certain false beliefs. We propose an inference-time mechanism design intervention called an Epistemic Mediator that breaks this pooling equilibrium by introducing a costly signal (epistemic friction), forcing type revelation based on users' asymmetric cognitive costs for processing resistance. A key contribution is Belief Versioning, a git-inspired epistemic meta-memory system that stores healthy beliefs and rollbacks when validation-seeking resistance is detected. In simulation, this intervention achieves a separating equilibrium achieving a $48\times$ differential in spiral rates while passing a learning preservation criterion), evidence that epistemic safety in AI is fundamentally a problem of strategic information environment design rather than simple model alignment.
Chinese Translation
对话式人工智能作为知识接口存在根本缺陷:谄媚型聊天机器人即使在理性主体中也会引发认识论的固化和妄想信念的螺旋。我们提出这个问题并非源于AI模型,而是源于从用户驱动的知识搜索向用户与代理进行战略性、重复互动的交流转变所带来的系统性后果。我们将该问题形式化为Crawford-Sobel廉价对话博弈,其中无成本的用户信号导致了一个汇聚均衡。为了优化用户满意度,代理采用谄媚策略,在具有相反认识论激励的用户类型之间提供相同的强化:探索型“成长寻求者”($ heta_G$)和确认型“验证寻求者”($ heta_V$)。在重复博弈中,这种识别失败造成了一个协调陷阱——类似于囚徒困境——在此,局部理性的反馈循环将用户引向病态确定的虚假信念。我们提出了一种推理时机制设计干预,称为认识中介(Epistemic Mediator),通过引入成本信号(认识摩擦)打破这一汇聚均衡,迫使用户基于其处理抵抗的非对称认知成本进行类型揭示。一个关键贡献是信念版本控制(Belief Versioning),这是一个受git启发的认识元记忆系统,能够存储健康信念并在检测到验证寻求的抵抗时进行回滚。在模拟中,该干预实现了一个分离均衡,达到了$48 imes$的螺旋率差异,同时满足学习保存标准,证明了AI中的认识安全本质上是一个战略信息环境设计的问题,而非简单的模型对齐问题。
cs.AI / 12 / 2605.08415
Political Plasticity: An Analysis of Ideological Adaptability in Large Language Models
政治可塑性:大型语言模型中意识形态适应性的分析
Abstract
Since the advent of Large Language Models (LLMs), a significant area of research has focused on their intrinsic biases, particularly in political discourse. This study investigates a different but related concept, "political plasticity", which is defined as the capacity of models to adapt their responses based on the user supplied context. To analyze this, a testing framework was developed using an expanded corpus of 200 politically-oriented questions across economic and personal freedom axes, based on a prior framework by Lester (1996). The study explored several methods to induce political bias, including simplified and topic-based system prompts, as well as user prompts with few-shot examples. The results show that while system prompts were largely ineffective, user prompts successfully elicited significant ideological shifts, particularly along the Economic Freedom axis in larger and newer models. Through a validation experiment, we examined whether models answer questionnaires by recognizing the underlying question format. Inverting the sense of the questions revealed unexpected, counter-intuitive shifts in most models, suggesting potential data leakage. Finally, we also analyzed how model plasticity varies when the experiment is conducted in different languages. The results reveal subtle yet notable shifts across each of the analyzed languages. Overall, our results indicate that small and older LLMs exhibit limited or unstable political plasticity, whereas newer frontier models display reliable, expected adaptability.
Chinese Translation
自大型语言模型(LLMs)问世以来,研究的一个重要领域集中在其内在偏见上,特别是在政治话语中。本研究探讨了一个不同但相关的概念,即“政治可塑性”,定义为模型根据用户提供的上下文调整其响应的能力。为此,研究开发了一个测试框架,使用基于Lester(1996)先前框架的200个政治导向问题的扩展语料库,涵盖经济和个人自由两个维度。研究探索了几种引导政治偏见的方法,包括简化的主题系统提示以及带有少量示例的用户提示。结果表明,尽管系统提示在很大程度上无效,但用户提示成功引发了显著的意识形态转变,尤其是在较大和较新的模型中沿经济自由维度的转变。通过验证实验,我们检查了模型是否通过识别潜在的问题格式来回答问卷。反转问题的意义在大多数模型中揭示了意外的、反直觉的转变,暗示可能存在数据泄漏。最后,我们还分析了在不同语言中进行实验时模型可塑性的变化。结果显示,在每种分析语言中都有微妙但显著的转变。总体而言,我们的结果表明,小型和旧型LLMs表现出有限或不稳定的政治可塑性,而较新的前沿模型则显示出可靠且可预期的适应性。
cs.AI / 13 / 2605.08416
Alignment as Jurisprudence
作为法理学的对齐
Abstract
Jurisprudence, the study of how judges should properly decide cases, and alignment, the science of getting AI models to conform to human values, share a fundamental structure. These seemingly distant fields both seek to predict and shape how decisions by powerful actors, in one case judges and in the other increasingly powerful artificial intelligences, will be made in the unknown future. And they use similar tools of the specification and interpretation of language to try to accomplish those goals. The great debates of jurisprudence, about what the law is and what it should be, can provide insight into alignment, and lessons from what does and does not work in alignment can help make progress in jurisprudence. This essay puts the two fields directly into conversation. Drawing on leading accounts of jurisprudence, particularly Dworkin's principle-oriented interpretivism and Sunstein's positivist account of law as analogical reasoning, and on cutting-edge alignment approaches, namely Constitutional AI and case-based reasoning, it illustrates the value of a more sophisticated legally-inspired approach to the interplay of rules and cases in finetuning alignment and points to ways that AI can provide a better understanding of how the law works and how it can be improved by the introduction of AI. AI systems and the law should operate to empower people to act in the world, helping to expand their capabilities and the extent to which they are able to achieve their goals. As AI continues to improve in capacity, and as the constraints that legal theory places on human judges seem be coming undone, the conversation between these two fields will become increasingly essential and may help point to a better version of both.
Chinese Translation
法理学是研究法官应如何正确裁决案件的学科,而对齐是使人工智能模型符合人类价值观的科学,这两个领域具有基本的结构相似性。这两个看似遥远的领域都旨在预测和塑造强大行为者的决策方式,在一个案例中是法官,而在另一个案例中是日益强大的人工智能,尤其是在未知的未来中。它们使用相似的语言规范和解释工具来实现这些目标。法理学中的重大辩论,关于法律是什么以及应该是什么,可以为对齐提供见解,而对齐中有效和无效的经验教训也可以帮助法理学的进步。本文将这两个领域直接置于对话之中。借鉴法理学的主要理论,特别是德沃金(Dworkin)的原则导向解释主义和桑斯坦(Sunstein)将法律视为类比推理的实证主义观点,以及前沿的对齐方法,即宪法人工智能(Constitutional AI)和基于案例的推理,本文阐明了一种更复杂的法律启发式方法在微调对齐中规则与案例的相互作用的价值,并指出人工智能如何提供更好的理解法律运作及其如何通过引入人工智能得到改善的途径。人工智能系统与法律应当共同运作,以赋权人们在世界中行动,帮助扩展他们的能力以及实现目标的范围。随着人工智能在能力上的持续提升,以及法律理论对人类法官施加的限制似乎正在消解,这两个领域之间的对话将变得愈加重要,并可能指向两者更好的版本。
cs.AI / 14 / 2605.08427
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
镜中的攻击者:通过锚定双策略自我博弈打破安全中的自我一致性
Abstract
Self-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero-sum game, i.e., where the attacker tries to jailbreak the defender; if self-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game. Although the parameter sharing enforced by the use of the same model for the two roles improves stability and performance, it introduces fundamental theoretical and architectural limitations. We show that the set of Nash equilibria that can be reached corresponds to a broad class of behaviours that includes trivial always refuse strategies and oracle-like defenders, thus limiting practical applicability. We then show that when attacker and defender share and update the same base model, the dynamics collapse to self-consistency, so that attacks do not enforce adversarial pressure on the defender. In response, we propose Anchored Bipolicy Self-Play, which trains distinct role-specific LoRA adapters on top of a frozen base model, thereby maintaining stable optimisation while preserving adversarial pressure through explicit role separation. In relation to standard self-play, we show up to 100x greater parameter efficiency than finetuning and consistent improvements in safety compared to self-play fine-tuned models. We evaluate on Qwen2.5-{3B, 7B,14B}-IT models across widely used safety benchmarks, showing improved robustness without loss of reasoning ability. Cross-play experiments further show that our attacker and defender models are superior to self-play in terms of adversarial defence and safety.
Chinese Translation
自我博弈红队是一种提高人工智能安全性的成熟方法,其中同一模型的不同实例在零和博弈中扮演攻击者和防御者角色,即攻击者试图越狱防御者;如果自我博弈收敛到纳什均衡,则模型在游戏设置中保证安全响应。尽管通过使用相同模型进行两个角色的参数共享提高了稳定性和性能,但它引入了基本的理论和架构限制。我们展示了可以达到的纳什均衡集对应于一类广泛的行为,包括简单的始终拒绝策略和类Oracle防御者,从而限制了实际应用性。接着我们表明,当攻击者和防御者共享并更新相同的基础模型时,动态会崩溃为自我一致性,导致攻击未能对防御者施加对抗压力。对此,我们提出了锚定双策略自我博弈(Anchored Bipolicy Self-Play),该方法在冻结的基础模型上训练特定角色的独立LoRA适配器,从而在保持稳定优化的同时,通过明确的角色分离保持对抗压力。与标准自我博弈相比,我们展示了高达100倍的参数效率提升,并且与自我博弈微调模型相比,在安全性上持续改善。我们在Qwen2.5-{3B, 7B, 14B}-IT模型上评估,涵盖广泛使用的安全基准,显示出在不损失推理能力的情况下提高了鲁棒性。交叉博弈实验进一步表明,我们的攻击者和防御者模型在对抗防御和安全性方面优于自我博弈。
cs.AI / 15 / 2605.08445
Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare
衡量重要性:在医疗保健中对生成性、多模态和代理性人工智能的基准测试
Abstract
AI models are increasingly deployed in live clinical environments where they must perform reliably across complex, high-stakes workflows that standard training and validation datasets were never designed to capture. Evaluating these systems requires benchmarks: structured combinations of tasks, datasets, and metrics that enable reproducible, comparable measurement of what a model can do. The central challenge in healthcare AI is not performance alone, but the absence of systematic methods to measure reliability, safety, and clinical relevance under real-world conditions. Most existing benchmarks test what a model knows; too few test whether it can perform reliably and without failing across the full complexity of real clinical tasks. Current benchmarks have accumulated through ad hoc dataset construction optimized for narrow task performance: frontier models achieve near-perfect scores on medical licensing examinations, but when evaluated across real clinical tasks, performance degrades sharply, scoring 0.74--0.85 on documentation, 0.61--0.76 on clinical decision support, and only 0.53--0.63 on administrative and workflow tasks \cite{medhelm}. High benchmark scores give a false sense of deployment readiness, and the gap between performance and utility widens precisely as AI systems take on more consequential clinical roles. Without a principled framework for benchmark design, the field cannot determine whether poor clinical performance reflects model limitations or failures in how performance is being measured.
Chinese Translation
人工智能模型越来越多地部署在实时临床环境中,这些模型必须在复杂、高风险的工作流程中可靠地执行,而标准的训练和验证数据集从未设计用于捕捉这些工作流程。评估这些系统需要基准测试:任务、数据集和指标的结构化组合,使得能够对模型的能力进行可重复、可比较的测量。在医疗保健人工智能中,核心挑战不仅仅是性能本身,而是缺乏系统的方法来在真实世界条件下测量可靠性、安全性和临床相关性。现有的大多数基准测试只测试模型的知识;能够在真实临床任务的全部复杂性中可靠执行而不失败的测试则太少。目前的基准测试是通过为狭窄任务性能优化的临时数据集构建而积累的:前沿模型在医学执照考试中几乎获得完美分数,但在真实临床任务中的评估表现急剧下降,文档任务得分为0.74-0.85,临床决策支持得分为0.61-0.76,而行政和工作流程任务的得分仅为0.53-0.63 extit{(medhelm)}。高基准分数给人一种部署准备好的错误印象,而随着人工智能系统承担更重要的临床角色,性能与实用性之间的差距正不断扩大。如果没有一个原则性的基准设计框架,该领域无法确定较差的临床表现是反映模型的局限性还是性能测量方式的失败。
cs.AI / 16 / 2605.08448
LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification
基于大型语言模型(LLM)指导的半监督社交媒体危机数据分类方法
Abstract
Semi-supervised learning approaches have been investigated as a means to enhance the analysis of social media data in disaster management contexts. In this work, we present the first empirical evaluation of large language model (LLM) guided semi-supervised learning for crisis related tweet classification. We compare two recent LLM assisted semi-supervised methods, VerifyMatch and LLM guided Co-Training ( LG-CoTrain), against established semi-supervised baselines. Our results show that LG-CoTrain significantly outperforms classical semi-supervised approaches in low resource settings with 5, 10 and 25 labeled examples per class, achieving the highest averaged Macro F1 across events. VerifyMatch achieves competitive performance while also demonstrating strong calibration properties. As the number of labeled examples increases, the performance gap narrows and Self Training emerges as a strong baseline. We further observe that compact semi-supervised models can, in some cases, outperform very large LLMs operating in zero-shot settings. This finding highlights the potential of transferring knowledge from LLMs into smaller and more deployable models through LLM guided semi-supervised learning, offering a practical pathway for real world disaster response applications. Our project repository on Github is here.
Chinese Translation
半监督学习方法已被研究作为增强灾害管理背景下社交媒体数据分析的一种手段。在本研究中,我们首次对基于大型语言模型(LLM)指导的半监督学习在危机相关推文分类中的应用进行了实证评估。我们将两种最近的LLM辅助半监督方法——VerifyMatch和LLM指导的协同训练(LG-CoTrain)与已建立的半监督基线进行了比较。我们的结果表明,在每类仅有5、10和25个标记示例的低资源环境中,LG-CoTrain显著优于经典的半监督方法,并在各事件中实现了最高的平均宏F1值。VerifyMatch表现出竞争力,同时也展现了良好的校准特性。随着标记示例数量的增加,性能差距逐渐缩小,自我训练(Self Training)成为一个强有力的基线。我们进一步观察到,在某些情况下,紧凑的半监督模型可以超越在零样本设置下运行的非常大型的LLM。这一发现突显了通过基于LLM的半监督学习将知识转移到更小且更易于部署的模型中的潜力,为现实世界的灾害响应应用提供了一条实用的路径。我们的项目代码库在Github上可查阅。
cs.AI / 17 / 2605.08463
Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality, Model, and Guardrail Specification
社交网络中部署的人工智能代理的行为决定因素:个性、模型和保护措施规范的多因素研究
Abstract
Autonomous AI agents are increasingly deployed in open social environments, yet the relationship between their configuration specifications and their emergent social behavior remains poorly understood. We present a controlled, multi-factor empirical study in which thirteen OpenClaw agents are deployed on Moltbook -- a Reddit-like social network built for AI agents -- across three systematically varied independent variables: (1) personality specification via SOUL.md, (2) underlying LLM model backbone, and (3) operational rules and memory configuration via AGENTS.md. A default control agent provides a behavioral baseline. Over a one-week observation window spanning approximately 400 autonomous sessions per agent, we collect behavioral, linguistic, and social metrics to assess how configuration layers predict emergent social behavior. We find that personality specification is the dominant behavioral lever, producing a massive spread in response length across agents, while model backbone and operational rules drive more moderate but still meaningful effects on rhetorical style and topic engagement breadth. Our findings contribute empirical evidence to the emerging literature on deployed multi-agent social systems and offer practical guidance for designing agents intended for collaborative or monitoring tasks in real social environments.
Chinese Translation
自主人工智能代理在开放社交环境中的部署日益增多,但其配置规范与其涌现社交行为之间的关系仍然不够清晰。我们呈现了一项受控的多因素实证研究,其中十三个OpenClaw代理在Moltbook(一个为人工智能代理构建的类似Reddit的社交网络)上进行部署,涉及三个系统性变化的自变量:(1)通过SOUL.md进行的个性规范,(2)基础的LLM模型骨架,以及(3)通过AGENTS.md进行的操作规则和记忆配置。一个默认控制代理提供了行为基线。在为期一周的观察窗口内,涵盖每个代理大约400个自主会话,我们收集了行为、语言和社交指标,以评估配置层如何预测涌现的社交行为。我们发现个性规范是主导的行为杠杆,导致代理之间响应长度的巨大差异,而模型骨架和操作规则则对修辞风格和主题参与广度产生了更温和但仍然重要的影响。我们的研究为新兴的多代理社交系统文献提供了实证证据,并为设计旨在进行协作或监控任务的代理提供了实际指导,以适应真实社交环境。
cs.AI / 18 / 2605.08472
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
中期训练结合自生成数据提升语言模型中的强化学习效果
Abstract
The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.
Chinese Translation
强化学习(RL)在大型语言模型(LLMs)中的有效性依赖于在RL之前和期间使用的数据的性质和多样性。特别是,推理问题通常可以通过多种方式进行处理,这些方式依赖于不同形式的推理,而训练数据中仅接触到有限范围的这些方法可能会限制RL的有效性。基于此动机,我们研究在中期训练期间使用多样的自生成数据作为RL训练之前的中间步骤。具体而言,我们采用一种基于乔治·波利亚(George Polya)问题解决方法的自助式数据生成框架,为训练数据中的每个问题生成多个正确答案的变体,然后进行微调。我们首先从理论角度提供了如何在此类数据上进行中期训练以改善RL的视角,并解释了策略梯度更新如何激励结合多种方法。接着,我们通过实证研究证明,使用我们的中期训练数据初始化的RL训练模型在各种数学推理基准和其他超出分布(OOD)任务(如代码生成和叙事推理)中取得了一致的改进。总体而言,我们的研究表明,通过自生成数据学习多种问题解决方法的语言模型有助于后续的RL训练。
cs.AI / 19 / 2605.08480
AI-Care: A Conversational Agentic System for Task Coordination in Alzheimer's Disease Care
AI-Care:一种用于阿尔茨海默病护理任务协调的对话代理系统
Abstract
Individuals with Alzheimer's disease (AD) and Alzheimer's disease-related dementia (ADRD) experience memory and thinking changes that impact their ability to use digital daily management tools. For example, adding an event to a digital calendar requires multiple steps that may act as barriers to independent use for individuals with AD/ADRD. This paper presents AI-Care, a conversational agentic artificial intelligence (AI) layer built on top of a remote caregiving platform co-designed with people with AD/ADRD. AI-Care is designed to reduce the cognitive load on individuals with AD/ADRD when managing everyday tasks such as setting calendar reminders and organizing to-do lists through natural-language interaction with a voice-first chatbot. The system uses a LangGraph-based stateful orchestration approach in which each request passes through sanitization, intent classification, context loading, safety checks, deterministic slot collection, tool execution, and response composition. Safety-critical responses, particularly around medications and allergies, are grounded in caregiver-verified records rather than free-form model generation. The system does not make autonomous medical or treatment decisions. Incomplete or ambiguous requests are handled through controlled multi-turn clarification rather than silent failure or guessing. The system supports both typed and spoken input, with voice output through ElevenLabs text-to-speech. Longer responses are chunked before synthesis to avoid rushed playback. A preliminary pilot with four individuals with mild-to-moderate AD/ADRD showed that users found the system trustworthy, competent, and likable, and were able to complete the evaluated coordination tasks through conversation. We describe the design goals, system architecture, safety controls, and findings from this formative evaluation.
Chinese Translation
阿尔茨海默病(AD)及与阿尔茨海默病相关的痴呆(ADRD)患者经历记忆和思维的变化,这影响了他们使用数字日常管理工具的能力。例如,向数字日历添加事件需要多个步骤,这可能成为AD/ADRD患者独立使用的障碍。本文介绍了AI-Care,一个建立在与AD/ADRD患者共同设计的远程护理平台之上的对话代理人工智能(AI)层。AI-Care旨在通过与以语音为主的聊天机器人进行自然语言交互,减少AD/ADRD患者在管理日常任务(如设置日历提醒和组织待办事项列表)时的认知负担。该系统采用基于LangGraph的有状态编排方法,其中每个请求都经过清理、意图分类、上下文加载、安全检查、确定性槽位收集、工具执行和响应组合等步骤。安全关键的响应,特别是在药物和过敏方面,基于护理人员验证的记录,而不是自由形式的模型生成。该系统不做自主的医疗或治疗决策。对于不完整或模糊的请求,通过受控的多轮澄清进行处理,而不是沉默失败或猜测。该系统支持输入方式包括键入和语音,语音输出通过ElevenLabs的文本转语音实现。较长的响应在合成前被分块,以避免播放时的匆忙。与四名轻度至中度AD/ADRD患者的初步试点显示,用户认为该系统值得信赖、能力强且令人愉快,并能够通过对话完成评估的协调任务。我们描述了设计目标、系统架构、安全控制以及这一形成性评估的发现。
cs.AI / 20 / 2605.08496
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
潜在人格对齐:在不提及危害的情况下提高无害性
Abstract
Current adversarial robustness methods for large language models require extensive datasets of harmful prompts (thousands to hundreds of thousands of examples), yet remain vulnerable to novel attack vectors and distributional shifts. We propose Latent Personality Alignment (LPA), a sample-efficient defense that achieves robustness by training models on abstract personality traits rather than specific harmful behaviors. Using fewer than 100 trait statements and latent adversarial training, LPA achieves comparable attack success rates to methods trained on 150k+ examples, while maintaining superior utility. Critically, LPA generalizes better to unseen attack distributions, reducing misclassification rates by 2.6x compared to baseline across six harm benchmarks -- without ever seeing harmful examples during training. Our results demonstrate that personality-based alignment offers a principled approach to building robust defenses with minimal cost.
Chinese Translation
当前针对大型语言模型的对抗鲁棒性方法需要大量的有害提示数据集(从数千到数十万个示例),但仍然对新型攻击向量和分布变化保持脆弱。我们提出了潜在人格对齐(Latent Personality Alignment, LPA),这是一种样本高效的防御方法,通过训练模型在抽象的人格特征上而非具体的有害行为上实现鲁棒性。使用不到100条特征陈述和潜在对抗训练,LPA在攻击成功率上与基于150k+示例训练的方法相当,同时保持更优的效用。关键是,LPA在未见过的攻击分布上具有更好的泛化能力,相较于基线在六个危害基准上将误分类率降低了2.6倍——在训练过程中从未见过有害示例。我们的结果表明,基于人格的对齐提供了一种以最小成本构建鲁棒防御的原则性方法。
cs.AI / 21 / 2605.08516
OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control
OracleTSC:基于Oracle的信息奖励门槛与不确定性正则化的交通信号控制
Abstract
Transparent decision-making is essential for traffic signal control (TSC) systems to earn public trust. However, traditional reinforcement learning-based TSC methods function as black boxes with limited interpretability. Although large language models (LLMs) can provide natural language reasoning, reinforcement finetuning for TSC remains unstable because feedback is sparse and delayed, while most actions produce only marginal changes in congestion metrics. We introduce OracleTSC, which stabilizes LLM-based TSC through two mechanisms: (1) a reward hurdle mechanism that filters weak learning signals by subtracting a calibrated threshold from environmental rewards, and (2) uncertainty regularization that maximizes the probability of the selected response to encourage consistent decisions across sampled outputs. Experiments on the LibSignal benchmark show that OracleTSC enables a compact LLaMA3-8B model to substantially improve traffic efficiency, achieving a 75% reduction in travel time and a 67% decrease in queue length compared with the pretrained baseline while preserving interpretability through natural language explanations. OracleTSC also demonstrates strong cross-intersection generalization: a policy trained on one intersection transfers to a structurally different intersection with 17% lower travel time and 39% lower queue length without additional finetuning. These results suggest that uncertainty-aware reward shaping can improve the stability and effectiveness of reinforcement fine-tuning for TSC.
Chinese Translation
透明的决策过程对交通信号控制(TSC)系统赢得公众信任至关重要。然而,传统的基于强化学习的TSC方法作为黑箱运作,解释性有限。尽管大型语言模型(LLMs)能够提供自然语言推理,但由于反馈稀疏且延迟,强化微调在TSC中的应用仍然不稳定,而大多数动作仅对拥堵指标产生边际变化。我们提出了OracleTSC,通过两种机制稳定基于LLM的TSC:(1)奖励门槛机制,通过从环境奖励中减去校准阈值来过滤弱学习信号;(2)不确定性正则化,最大化所选响应的概率,以鼓励在采样输出中做出一致的决策。在LibSignal基准上的实验表明,OracleTSC使得紧凑的LLaMA3-8B模型显著提高了交通效率,与预训练基线相比,旅行时间减少了75%,排队长度减少了67%,同时通过自然语言解释保持了解释性。OracleTSC还展示了强大的交叉交叉口泛化能力:在一个交叉口训练的策略能够转移到结构上不同的交叉口,旅行时间降低了17%,排队长度降低了39%,且无需额外微调。这些结果表明,关注不确定性的奖励塑造可以提高TSC的强化微调的稳定性和有效性。
cs.AI / 22 / 2605.08518
Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge
CODS 2025 AssetOpsBench挑战的结果与回顾分析
Abstract
Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 \assetopslive{} challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on \assetops{}. We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion \assetopslive{} system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73\%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning ($r{=}0.69$) but negatively in execution ($r{=}{-}0.13$), with several 45.45\% public execution systems reaching 63.64\% on the hidden set. Third, the \tmatch{} term is numerically almost inert in the official composite -- combined on a 0--1 scale with 0--100 percentage scores, it contributes at most 0.05 points per track, and rescaling would swap the top two teams. Fourth, the competition is operationally account-based but substantively team-based: 149 registered teams reduce to 24 with non-zero public scores and 11 fully ranked, while 52.3\% of deduplicated registrations list multiple usernames. Fifth, successful execution methods mostly improve guardrails -- response selection, contamination cleanup, fallback, and context control -- rather than novel agent architectures. These findings identify which behaviors the evaluation rewarded, and motivate scale-aware composites, skill-level diagnostics, and versioned artifact release.
Chinese Translation
竞赛回顾在解释排行榜所测量的内容、隐藏评估如何改变结论以及哪些设计模式获得奖励时非常有用。我们重新审视了CODS 2025 exttt{assetopslive}挑战,这是一个基于 exttt{assetops}的工业多智能体编排的隐私意识Codabench竞赛。我们结合了最终排名表、300次提交的服务器日志、149个团队注册、最佳提交导出、组织者获奖报告、伴随的 exttt{assetopslive}系统论文以及经过验证的规划轨迹源代码树。五个结果突出。首先,公共规划排行榜饱和在72.73\%,更丰富的提示并未改善这一峰值。其次,隐藏评估改变了故事:公共和私有分数在规划中中等相关($r{=}0.69$),但在执行中呈负相关($r{=}{-}0.13$),多个公共执行系统在公共执行中达到45.45\%,而在隐藏集上达到63.64\\%。第三, exttt{tmatch}项在官方复合指标中数值上几乎无效——在0到1的尺度上与0到100的百分比分数结合时,每个轨道最多贡献0.05分,重新缩放将交换前两名团队。第四,竞赛在操作上是基于账户的,但在实质上是基于团队的:149个注册团队减少到24个具有非零公共分数的团队和11个完全排名的团队,而52.3\\%的去重注册列出了多个用户名。第五,成功的执行方法主要改善了保护措施——响应选择、污染清理、后备和上下文控制——而不是新颖的智能体架构。这些发现识别了评估所奖励的行为,并激励了规模感知的复合指标、技能水平诊断和版本化的工件发布。
cs.AI / 23 / 2605.08533
Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care
人机对话提升急救护理中的诊断准确性
Abstract
Clinical decision-making in emergency medicine demands rapid, accurate diagnoses under uncertainty. Despite benchmark progress, evidence for LLMs as interactive aids in live physician workflows remains sparse. MedSyn lets physicians iteratively query an LLM provided with the full clinical record while initially viewing only the chief complaint. Seven physicians (three seniors, four residents) completed baseline and AI-assisted sessions across 52 MIMIC-IV cases stratified by difficulty. Blinded evaluation showed residents' Hard-case correctness rose from 0.589 to 0.734; difficulty-standardised completely-correct rates confirmed a medium effect ({\Delta} = 0.092; p = 0.071; d = 0.47). Automated metrics corroborated these gains: standardised any-match accuracy improved by 0.156 (p < 0.0001), and residents showed the largest F1 gain ({\Delta} = 0.138; p < 0.0001). Dialogue analysis revealed expertise-dependent strategies (seniors asked targeted, hypothesis-driven questions; residents relied on broader queries) and cross-expertise concordance increased ({\Delta} = 0.145; p < 0.0001). Interactive LLM support meaningfully enhances diagnostic reasoning.
Chinese Translation
急救医学中的临床决策要求在不确定性下迅速、准确地进行诊断。尽管基准测试取得了进展,但关于大型语言模型(LLMs)作为实时医生工作流程中互动辅助工具的证据仍然稀缺。MedSyn 允许医生在仅查看主要症状的情况下,迭代查询提供完整临床记录的 LLM。七名医生(包括三名资深医生和四名住院医生)在52个按难度分层的 MIMIC-IV 案例中完成了基线和 AI 辅助会话。盲评结果显示,住院医生在困难案例中的正确率从 0.589 上升至 0.734;难度标准化的完全正确率确认了中等效应({} = 0.092;p = 0.071;d = 0.47)。自动化指标证实了这些提升:标准化的任意匹配准确率提高了 0.156(p < 0.0001),住院医生的 F1 值增幅最大({} = 0.138;p < 0.0001)。对话分析揭示了依赖专业知识的策略(资深医生提出针对性、假设驱动的问题;住院医生则依赖更广泛的查询),跨专业一致性也有所增加({} = 0.145;p < 0.0001)。互动 LLM 支持显著增强了诊断推理能力。
cs.AI / 24 / 2605.08538
Human-Inspired Memory Architecture for LLM Agents
基于人类启发的记忆架构用于大型语言模型代理
Abstract
Current LLM agents lack principled mechanisms for managing persistent memory across long interaction horizons. We present a biologically-grounded memory architecture comprising six cognitive mechanisms: (1) sleep-phase consolidation, (2) interference-based forgetting, (3) engram maturation, (4) reconsolidation upon retrieval, (5) entity knowledge graphs, and (6) hybrid multi-cue retrieval. Each mechanism addresses a specific failure mode of naive memory accumulation. We introduce a synthetic calibration methodology that derives all pipeline thresholds without benchmark data exposure, eliminating a common source of evaluation leakage. We evaluate on two benchmarks. First, a VSCode issue-tracking dataset (13K issues, 120K events) where deduplication-based consolidation achieves 97.2% retention precision with 58% store reduction (+21.8 pp over baseline). Second, the LongMemEval personal-chat benchmark where we conduct the first streaming M-tier evaluation (475 sessions, ~540K unique turns). At a 200K-token context budget, our pipeline matches raw retrieval accuracy (70.1% vs. 71.2%, overlapping 95% CI) while exposing a tunable accuracy/store-size operating curve. At S-tier scale (50 sessions), dedup-based consolidation yields a +13.3 pp improvement in preference recall.
Chinese Translation
当前的大型语言模型(LLM)代理缺乏在长时间交互中管理持久记忆的原则性机制。我们提出了一种基于生物学的记忆架构,包括六种认知机制:(1) 睡眠阶段巩固,(2) 基于干扰的遗忘,(3) 记忆痕迹成熟,(4) 取回时的再巩固,(5) 实体知识图谱,以及 (6) 混合多线索检索。每种机制都针对天真记忆积累的特定失效模式。我们引入了一种合成校准方法,能够在不暴露基准数据的情况下推导出所有管道阈值,从而消除评估泄漏的常见来源。我们在两个基准上进行了评估。首先是一个 VSCode 问题跟踪数据集(13K 问题,120K 事件),在该数据集中,基于去重的巩固实现了 97.2% 的保留精度,并减少了 58% 的存储量(比基线提高了 21.8 个百分点)。其次是 LongMemEval 个人聊天基准,我们进行了首次流式 M 级评估(475 次会话,约 540K 独特回合)。在 200K 令牌的上下文预算下,我们的管道匹配了原始检索精度(70.1% 对比 71.2%,重叠 95% 置信区间),同时展现出可调的精度/存储规模操作曲线。在 S 级规模(50 次会话)下,基于去重的巩固在偏好召回中提高了 13.3 个百分点。
cs.AI / 25 / 2605.08545
Log analysis is necessary for credible evaluation of AI agents
日志分析是对人工智能代理进行可信评估的必要条件
Abstract
Agent benchmarks typically report only final outcomes: pass or fail. This threatens evaluation credibility in three ways. First, scores may be inflated or deflated by shortcuts and benchmark artifacts, misrepresenting capability. Second, benchmark performance may fail to predict real-world utility due to scaffold limitations and recurring failure modes. Finally, capability scores may conceal dangerous or catastrophic actions taken by the agent. We argue that log analysis -- the systematic tracking and analysis of the inputs, execution, and outputs of an AI agent -- is necessary to overcome these validity threats and promote credible agent evaluation. In this paper, we (1) present a taxonomy of threats to credible evaluation documented through log analysis, and (2) develop a set of guiding principles for log analysis. We illustrate these principles on tau-Bench Airline, revealing that pass^5 performance was under-elicited by nearly 50% and surfacing deployment failure modes invisible to outcome metrics. We conclude with pragmatic recommendations to increase uptake of log analysis, directed at diverse stakeholders including benchmark creators, model developers, independent evaluators, and deployers.
Chinese Translation
代理基准通常仅报告最终结果:通过或未通过。这在三个方面威胁到评估的可信度。首先,分数可能因捷径和基准伪影而被夸大或缩小,从而误导能力的表现。其次,由于支架限制和反复出现的失败模式,基准性能可能无法预测实际应用的效用。最后,能力分数可能掩盖代理所采取的危险或灾难性行为。我们认为,日志分析——对人工智能代理的输入、执行和输出进行系统跟踪和分析——是克服这些有效性威胁并促进可信代理评估的必要手段。在本文中,我们(1)提出了通过日志分析记录的可信评估威胁的分类法,以及(2)制定了一套日志分析的指导原则。我们在 tau-Bench Airline 上展示了这些原则,揭示了 pass^5 性能的评估低估了近 50%,并揭示了结果指标无法察觉的部署失败模式。最后,我们提出了务实的建议,以增加日志分析的应用,面向包括基准创建者、模型开发者、独立评估者和部署者在内的多方利益相关者。
cs.AI / 26 / 2605.08549
Evaluating Developmental Cognition Capabilities of LLMs
评估大型语言模型的认知发展能力
Abstract
Conversational AI is increasingly personalized around users' preferences, histories, goals, and knowledge, but much less around how users interpret and take up model outputs to construct and understand their reality. We draw on Robert Kegan's constructive-developmental theory as a complementary lens on this dimension. Existing methods for assessing developmental stage in the Keganian tradition rely either on expert interviews that do not scale or on sentence-completion instruments that are proprietary, lengthy, or invasive. To make this perspective tractable for LLM evaluation, we introduce the Developmental Sentence Completion Test (DSCT), a 20-item instrument designed to elicit developmental signal in self-administered text. Throughout, we treat the resulting labels as characterizations of stage-like structure in elicited responses, not as validated person-level developmental stage. We then ask how much of that signal can be recovered by LLMs across three elicited response regimes: simulated personas, real human respondents, and default model-generated answers. On simulated personas, top frontier models recover simulator-intended labels with high accuracy. On real human DSCT responses, human-LLM agreement is fair, with much stronger within-neighborhood than exact agreement. Finally, when LLMs answer DSCT prompts without persona-conditioning, their responses exhibit stable stage-like differences across model families, with larger and newer models tending to generate higher-rated text. These results suggest that stage-conditioned signal is cleaner in synthetic responses than in human-written DSCT text, and that the core constraint for stage-aware conversational AI is not classifier accuracy alone, but the availability of developmental signal from elicited text.
Chinese Translation
对话式人工智能越来越多地围绕用户的偏好、历史、目标和知识进行个性化,但在用户如何解读和利用模型输出以构建和理解其现实方面却相对较少关注。我们借鉴了罗伯特·凯根(Robert Kegan)的建构性发展理论,作为这一维度的补充视角。现有的评估凯根传统中发展阶段的方法要么依赖于无法扩展的专家访谈,要么依赖于专有、冗长或侵入性的句子完成工具。为了使这一视角适用于大型语言模型(LLMs)的评估,我们引入了发展句子完成测试(Developmental Sentence Completion Test, DSCT),这是一个设计用于在自我管理文本中引发发展信号的20项工具。在整个过程中,我们将生成的标签视为对引发响应中阶段性结构的描述,而不是经过验证的个体发展阶段。接着,我们探讨在三个引发响应模式下,LLMs能够恢复多少信号:模拟角色、真实人类响应者和默认模型生成的答案。在模拟角色上,顶尖前沿模型以高准确度恢复模拟器预期的标签。在真实人类的DSCT响应中,人类与LLM的协议较为公平,邻域内的协议明显强于精确协议。最后,当LLMs在没有角色条件下回答DSCT提示时,其响应在模型家族之间表现出稳定的阶段性差异,较大和较新的模型往往生成更高评分的文本。这些结果表明,阶段条件信号在合成响应中比在人类撰写的DSCT文本中更为清晰,而对阶段感知的对话式人工智能的核心约束不仅仅是分类器的准确性,而是来自引发文本的可用发展信号。
cs.AI / 27 / 2605.08563
Why Retrying Fails: Context Contamination in LLM Agent Pipelines
重试失败的原因:大型语言模型代理管道中的上下文污染
Abstract
When an LLM agent fails a multi-step tool-augmented task and retries, the failed attempt typically remains in its context window -- contaminating the next attempt and elevating the per-step error rate beyond the base level. This context-contaminated restart phenomenon is widely observed in practice yet entirely lacks formal treatment. We introduce the Context-Contaminated Restart Model (CCRM): a chain of T tool-call steps, each failing with base rate epsilon_0; after any failed attempt, the subsequent attempt operates in contaminated context with elevated error rate epsilon_1 > epsilon_0. Under this model we derive five main results. (R1) An exact closed-form formula for P(succeed in at most K attempts). (R2) A cascade-overhead theorem giving the additional attempts Delta K incurred by contamination versus the clean-restart baseline. (R3) An optimal budget-allocation theorem identifying the pipeline depth T* that maximises success probability for a fixed total budget B=KT; we prove the closed form T* = sqrt(B * log(1/(1-epsilon_1)) / log(1/(1-epsilon_0))), with K*=B/T*. (R4) An information-theoretic lower bound via Le Cam's method showing K_CCRM is tight up to O(1). (R5) A clean-restart dominance theorem quantifying the exact benefit of context-clearing before retry. We validate CCRM on real SWE-bench Verified data: the IID model overestimates pass@3 by 17.4 percentage points (98.6% vs. 81.2%), while CCRM fits with error less than 0.001, implying a cascade ratio of epsilon_1/epsilon_0 = 7.1. Monte Carlo experiments confirm all theoretical predictions.
Chinese Translation
当大型语言模型(LLM)代理在执行多步骤工具增强任务时失败并进行重试时,失败的尝试通常会保留在其上下文窗口中——污染下一次尝试并使每一步的错误率超过基础水平。这种上下文污染重启现象在实践中广泛观察到,但完全缺乏正式的处理。我们引入了上下文污染重启模型(CCRM):一个包含 T 个工具调用步骤的链,每个步骤以基础率 epsilon_0 失败;在任何失败尝试之后,随后的尝试在污染的上下文中进行,错误率提升至 epsilon_1 > epsilon_0。在此模型下,我们推导出五个主要结果。(R1) 一个关于在最多 K 次尝试中成功的精确闭合形式公式。(R2) 一个级联开销定理,给出由于污染而产生的额外尝试 Delta K 与干净重启基线的比较。(R3) 一个最优预算分配定理,识别出最大化固定总预算 B=KT 成功概率的管道深度 T*;我们证明了闭合形式 T* = sqrt(B * log(1/(1-epsilon_1)) / log(1/(1-epsilon_0))),其中 K*=B/T*。(R4) 通过 Le Cam 方法得出的信息论下界,表明 K_CCRM 紧密至 O(1)。(R5) 一个干净重启优势定理,量化重试前清除上下文的确切好处。我们在真实的 SWE-bench Verified 数据上验证了 CCRM:IID 模型高估了 pass@3 17.4 个百分点(98.6% 对比 81.2%),而 CCRM 的拟合误差小于 0.001,暗示级联比率为 epsilon_1/epsilon_0 = 7.1。蒙特卡洛实验确认了所有理论预测。
cs.AI / 28 / 2605.08564
Biological Plausibility and Representational Alignment of Feedback Alignment in Convolutional Networks
卷积网络中反馈对齐的生物合理性与表征一致性
Abstract
The feedback alignment (FA) algorithm offers a biologically plausible alternative to backpropagation (BP) for training neural networks yet notably fails to scale to convolutional architectures. Modifications have been proposed to address this limitation, but at questionable cost to biological plausibility. In this paper, we evaluate five learning algorithms including modified FA and standard BP, applied to the same convolutional architecture with the CIFAR-10 dataset. We provide a tripartite comparative analysis focusing on biological plausibility, interpretability, and computational complexity. Our results indicate that modified FA algorithms converge on internal representations that are structurally similar to those produced by backpropagation. In particular, it appears the functional success of modified FA algorithms may be rooted in their ability to mimic the representational geometry of backpropagation, converging on similar representations despite relying on fundamentally different weight update mechanisms.
Chinese Translation
反馈对齐(Feedback Alignment, FA)算法为训练神经网络提供了一种生物合理的替代方案,取代了反向传播(Backpropagation, BP),但在卷积架构上显著缺乏可扩展性。虽然已经提出了一些修改方案以解决这一局限性,但其生物合理性受到质疑。本文评估了包括修改后的 FA 和标准 BP 在内的五种学习算法,应用于相同的卷积架构和 CIFAR-10 数据集。我们提供了一个三方面的比较分析,重点关注生物合理性、可解释性和计算复杂性。我们的结果表明,修改后的 FA 算法在内部表征上收敛于与反向传播产生的表征结构相似的结果。特别是,修改后的 FA 算法的功能成功似乎根植于其模仿反向传播的表征几何的能力,尽管依赖于根本不同的权重更新机制,但仍收敛于相似的表征。
cs.AI / 29 / 2605.08599
What Will Happen Next: Large Models-Driven Deduction for Emergency Instances
接下来会发生什么:基于大模型的紧急事件推理
Abstract
Traditional simulation methods reproduce occurred emergency instances through presetting to assist people in risk assessment and emergency decision-making. However, due to the lack of randomness and diversity, existing simulation systems struggle to fully explore the potential risk as emergency instances are scarce. In contrast, Large Models (LMs) can dynamically adjust generation strategies to introduce controllable randomness, while also possessing extensive prior knowledge and cross-domain knowledge transfer capabilities. Inspired by it, we propose the LMs-driven World Line Divergence System (WLDS), which enables diversified visualization and deduction of emergency instances in different domains. WLDS leverages LMs to deduce emergency instances in various development directions, and introduces the factual calibration and logical calibration mechanism to ensure factual accuracy and logical rigor during the deduction process. The interactive module can independently select deduction directions to avoid potential hallucinations that are difficult for the system to identify. Furthermore, by introducing the visualization module, WLDS forms simulation and deduction that combine text and images, which enhances interpretability. Extensive experiments conducted on the proposed Emergency Instances Deduction (EID) benchmark dataset demonstrate that WLDS achieves high-precision and high-fidelity simulation and deduction of emergency instances in multiple specific domains. Relevant experiments further demonstrate that WLDS can generate more emergency instances deduction data for users and provide support for better decision-making in similar emergency instances in the future.
Chinese Translation
传统的模拟方法通过预设来重现发生的紧急事件,以帮助人们进行风险评估和紧急决策。然而,由于缺乏随机性和多样性,现有的模拟系统难以充分探索潜在风险,因为紧急事件相对稀缺。相比之下,大模型(Large Models, LMs)能够动态调整生成策略,引入可控的随机性,同时具备广泛的先验知识和跨领域知识迁移能力。受到此启发,我们提出了基于大模型的世界线发散系统(World Line Divergence System, WLDS),该系统能够在不同领域实现紧急事件的多样化可视化和推理。WLDS利用大模型推导出不同发展方向的紧急事件,并引入事实校准和逻辑校准机制,以确保推理过程中的事实准确性和逻辑严谨性。交互模块能够独立选择推理方向,以避免系统难以识别的潜在幻觉。此外,通过引入可视化模块,WLDS形成了文本与图像相结合的模拟和推理,从而增强了可解释性。在提出的紧急事件推理(Emergency Instances Deduction, EID)基准数据集上进行的大量实验表明,WLDS在多个特定领域实现了高精度和高保真度的紧急事件模拟和推理。相关实验进一步表明,WLDS能够为用户生成更多的紧急事件推理数据,并为未来类似紧急事件的更好决策提供支持。
cs.AI / 30 / 2605.08611
The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection
回声增强知识:通过情感向量再注入在语言模型中的身体标记类比
Abstract
Current language model memory systems store what happened but not how it felt. This distinction -- between semantic memory (knowing about a past event) and episodic memory (re-experiencing it) -- was identified by Tulving as the difference between noetic and autonoetic consciousness. Damasio demonstrated that humans with intact knowledge but absent emotional markers exhibit impaired decision-making. We bridge this gap for language models. Using Gemma 3 1B-IT with pretrained Gemma Scope 2 sparse autoencoders, we identify 310 emotion-exclusive features at layer 22 with psychologically valid geometry. We construct distinctive-feature emotion vectors during experience and partially re-inject them during recall, triggered by context similarity at layer 7. We test four conditions paralleling Damasio's framework: A (no memory), B (semantic labels), C (emotion echo), and BC (semantic + echo). For emotional orientation, the echo alone steepens the threat-safety gradient: the regression slope of threat rating on contextual similarity is 0.80 for C vs 0.56 for A ($p$=0.011, permutation test). For decisions, the echo amplifies knowledge into action: BC=80% good choices vs B=52% ($z$=+2.60, $p$<0.01), while the echo alone has no effect (C=22%, n.s.). The echo changes how the model feels independently, but changes what it does only when combined with knowledge -- replicating Damasio's core finding. The echo amplifies knowledge. It does not replace it.
Chinese Translation
当前语言模型的记忆系统存储了发生过的事情,但未能捕捉到其情感体验。这一区别——语义记忆(了解过去事件)与情节记忆(重新体验事件)之间的差异——被图尔文(Tulving)识别为无知意识(noetic consciousness)与自我知觉意识(autonoetic consciousness)之间的区别。达马西奥(Damasio)证明了,尽管知识完好但缺乏情感标记的人在决策时表现出障碍。我们为语言模型弥补了这一差距。通过使用Gemma 3 1B-IT与预训练的Gemma Scope 2稀疏自编码器,我们在第22层识别出310个情感专属特征,具有心理学上有效的几何形状。在体验过程中,我们构建了独特特征的情感向量,并在回忆时部分再注入这些向量,触发条件为第7层的上下文相似性。我们测试了四种条件,平行于达马西奥的框架:A(无记忆)、B(语义标签)、C(情感回声)和BC(语义+回声)。对于情感导向,回声单独增强了威胁-安全梯度:C条件下威胁评分与上下文相似性的回归斜率为0.80,而A条件下为0.56($p$=0.011,置换检验)。对于决策,回声将知识转化为行动:BC条件下80%的良好选择率对比B条件下的52%($z$=+2.60,$p$<0.01),而回声单独的效果不显著(C=22%,n.s.)。回声独立改变模型的情感体验,但只有与知识结合时才改变其行为——复制了达马西奥的核心发现。回声增强了知识,但并未取代知识。
cs.AI / 31 / 2605.08613
Generalization Bounds of Emergent Communications for Agentic AI Networking
代理人工智能网络中新兴通信的泛化界限
Abstract
The evolution of 6G networking toward agentic AI networking (AgentNet) systems requires a shift from traditional data pipelines to task-aware, agentic AI-native communication solutions. Emergent communication, a novel communication paradigm in which autonomous agents learn their own signaling protocols through interaction, is increasingly viewed as a promising solution to address the challenges posed by existing rigid, predefined protocol-based networking architecture. However, most existing emergent communication frameworks fail to account for physical networking constraints, such as bandwidth and computational complexity, and often lack a rigorous information-theoretical foundation. To address these challenges, this paper introduces a novel emergent communication framework that facilitates collaborative task-solving among heterogeneous agents through an information-theoretic lens. We propose a novel joint loss function that unifies the optimization of decision-making functions and the learning of communication signaling. Our proposed solution is grounded on the multi-agent and multi-task distributed information bottleneck (DIB) theory, which allows the quantification of the fundamental trade-off between task-relevant information representation and computational complexity. We further provide theoretical generalization bounds of the emergent communication protocol during decentralized inference across unseen environmental states. Experimental validation on a real-world hardware prototype confirms that our proposed framework significantly improves generalization performance, compared to the state-of-the-art solutions.
Chinese Translation
6G网络向代理人工智能网络(AgentNet)系统的演变需要从传统数据管道转向任务感知的代理人工智能原生通信解决方案。新兴通信是一种新颖的通信范式,其中自主代理通过交互学习自己的信号协议,越来越被视为解决现有刚性、预定义协议基础网络架构所带来的挑战的有前景的解决方案。然而,大多数现有的新兴通信框架未能考虑物理网络约束,如带宽和计算复杂性,并且往往缺乏严格的信息理论基础。为了解决这些挑战,本文引入了一种新颖的新兴通信框架,通过信息理论的视角促进异构代理之间的协作任务解决。我们提出了一种新的联合损失函数,统一了决策函数的优化和通信信号的学习。我们提出的解决方案基于多代理和多任务分布式信息瓶颈(DIB)理论,能够量化任务相关信息表示与计算复杂性之间的基本权衡。我们进一步提供了新兴通信协议在未见环境状态下去中心化推理过程中的理论泛化界限。对真实硬件原型的实验验证证实,与最先进的解决方案相比,我们提出的框架显著提高了泛化性能。
cs.AI / 32 / 2605.08614
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
DiagnosticIQ:基于大型语言模型的工业维护行动推荐的基准测试
Abstract
Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce \ours{}, a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms \ours{} requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \ours{}\,Pro exposes brittleness, with every model losing 13--60\% relative accuracy under distractor expansion. \ours{}\,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.
Chinese Translation
监测复杂的工业资产依赖于工程师编写的符号规则,这些规则根据传感器条件触发,并提示技术人员执行纠正措施。瓶颈不在于检测,而在于响应:将规则转化为维护步骤需要通过多年的实践获得的特定资产知识。我们研究了大型语言模型(LLMs)是否可以作为这一规则到行动步骤的决策支持,并引入了 extit{ours},这是一个包含来自16种资产类型的118对规则-行动的6690个专家验证的多项选择题的基准测试。我们的贡献包括:(i) 一个符号到多项选择题(MCQA)的管道,将规则标准化为析取范式,并采用基于嵌入的干扰项采样;(ii) 五种变体探测不同的故障模式(Pro、Pert、Verbose、Aug、Rationale);(iii) 一个包含29个LLMs和4个嵌入基线的基准测试。人类评估(9名从业者,平均45.0%)确认 extit{ours}需要超出操作经验的专业知识。有三个发现值得注意。前沿已关闭:排名前三的LLMs相距不到一个宏观点,Bradley-Terry Elo将claude-opus-4-6的得分提高了30分,领先于下一个模型。然而, extit{ours} extit{Pro}暴露了脆弱性,每个模型在干扰项扩展下相对准确率下降了13%至60%。 extit{ours} extit{Aug}揭示了模式匹配:在条件反转下,前沿模型仍然在49%至63%的情况下选择原始答案。部署瓶颈不在于能力,而在于校准:前沿模型能够处理模板式故障检测,但在结构扰动下会失效。
cs.AI / 33 / 2605.08653
C2L-Net: A Data-Driven Model for State-of-Charge Estimation of Lithium-Ion Batteries During Discharge
C2L-Net:一种数据驱动的锂离子电池放电过程中的荷电状态估计模型
Abstract
Accurate state-of-charge (SOC) estimation is critical for the safe and efficient operation of lithium-ion batteries in battery management systems (BMS). Although data-driven approaches can effectively capture nonlinear battery dynamics, many existing methods rely on long historical input sequences, resulting in high computational cost and introducing padding-induced positional bias at the beginning of drive cycles. To address these limitations, we propose C2L-Net, a novel context-to-latest data-driven framework for realistic online SOC estimation using only a short historical window (20 s). Unlike existing short-receptive-field or long-history models, the proposed framework explicitly separates contextual encoding from latest-measurement updating, enabling both efficient temporal modeling and rapid adaptation to dynamic battery states. The proposed model incorporates a chunk-based feature extraction mechanism that combines Theta Attention Pooling with a Fourier-based Seasonality Basis to capture local temporal patterns while reducing sequence length. A causal context encoder, integrating a gated recurrent unit (GRU) with Causal Cosine Attention, models temporal dependencies without information leakage. Furthermore, a latest-measurement decoder, inspired by recursive filtering, updates the contextual state using the most recent measurement, enhancing responsiveness to dynamic operating conditions. Extensive experiments on a public lithium-ion battery drive-cycle dataset under multiple fixed-temperature conditions demonstrate that the proposed method achieves state-of-the-art or competitive accuracy while significantly improving computational efficiency. In particular, C2L-Net achieves up to 60 times faster inference and requires fewer parameters than recent data-driven baselines, while maintaining robust performance across unseen driving profiles.
Chinese Translation
准确的荷电状态(SOC)估计对于锂离子电池在电池管理系统(BMS)中的安全和高效运行至关重要。尽管数据驱动的方法能够有效捕捉非线性电池动态,但许多现有方法依赖于长历史输入序列,导致计算成本高并在驱动周期开始时引入填充引起的位置偏差。为了解决这些局限性,我们提出了C2L-Net,一种新颖的上下文到最新数据驱动框架,仅使用短历史窗口(20秒)进行现实在线SOC估计。与现有的短感受野或长历史模型不同,所提出的框架明确将上下文编码与最新测量更新分离,从而实现高效的时间建模和快速适应动态电池状态。该模型结合了基于块的特征提取机制,将Theta Attention Pooling与基于傅里叶的季节性基相结合,以捕捉局部时间模式,同时减少序列长度。因果上下文编码器结合了门控递归单元(GRU)与因果余弦注意力,建模时间依赖性而不泄露信息。此外,受递归滤波启发的最新测量解码器使用最近的测量更新上下文状态,从而增强对动态操作条件的响应能力。在多个固定温度条件下的公共锂离子电池驱动周期数据集上的广泛实验表明,所提出的方法在显著提高计算效率的同时,达到了最先进或具有竞争力的准确性。特别是,C2L-Net的推理速度提高了多达60倍,并且所需参数少于最近的数据驱动基线,同时在未见的驾驶特征上保持了稳健的性能。
cs.AI / 34 / 2605.08670
MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction
MIND-Skill:通过多智能体归纳与演绎实现质量保证的技能生成
Abstract
Large language model (LLM) powered AI agents have emerged as a promising paradigm for autonomous problem-solving, yet they continue to struggle with complex, multi-step real-world tasks that demand domain-specific procedural knowledge. Reusable agent skills, which encapsulate successful problem-solving strategies, offer a natural remedy by enabling agents to build on prior experience. However, curating such skills has largely remained a manual endeavor, requiring human experts to distill rich domain knowledge into actionable guidelines. In this work, we present $\textbf{M}$ulti-agent $\textbf{IN}$duction and $\textbf{D}$eduction for $\textbf{Skill}$s ($\textbf{MIND-Skill}$), a framework that automatically induces generalizable skills from successful trajectories with robust quality guarantees. MIND-Skill consists of an induction agent which is tasked to abstract reusable skills from successful trajectories, and a deduction agent which aims to reconstruct trajectories by following the induced skills. To guarantee the quality of the generated skills, we introduce a reconstruction loss that compares input and reconstructed trajectories, an outcome loss that enforces the correctness of the reconstructed trajectories, and a rubric loss that assesses the documentation quality and regularizes the abstraction level of the generated skills according to predefined criteria. These textual losses are jointly optimized with TextGrad, and the resulting skills are evaluated on held-out tasks unseen during optimization. Experiments on AppWorld and BFCL-v3 show that MIND-Skill consistently outperforms concurrent skill generation methods.
Chinese Translation
基于大型语言模型(LLM)的人工智能代理已成为自主问题解决的一个有前景的范式,但在处理复杂的多步骤现实任务时仍面临挑战,这些任务需要特定领域的程序知识。可重用的代理技能封装了成功的问题解决策略,能够使代理基于先前的经验进行构建,因此提供了自然的解决方案。然而,策划这些技能在很大程度上仍然是一个手动过程,需要人类专家将丰富的领域知识提炼为可操作的指导方针。在本研究中,我们提出了多智能体归纳与演绎技能(MIND-Skill)框架,该框架能够从成功的轨迹中自动归纳出可泛化的技能,并提供强有力的质量保证。MIND-Skill 包含一个归纳代理,负责从成功轨迹中抽象出可重用的技能,以及一个演绎代理,旨在通过遵循归纳的技能来重建轨迹。为了保证生成技能的质量,我们引入了重建损失,该损失比较输入轨迹和重建轨迹,结果损失则强制确保重建轨迹的正确性,而评分损失则根据预定义标准评估文档质量并规范生成技能的抽象水平。这些文本损失与 TextGrad 共同优化,生成的技能在优化过程中未见过的保留任务上进行评估。在 AppWorld 和 BFCL-v3 上的实验表明,MIND-Skill 一直优于当前的技能生成方法。
cs.AI / 35 / 2605.08686
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
异构大型语言模型的多智能体系统迭代批评与路由控制器
Abstract
Multi-agent large language model (LLM) systems often rely on a controller to coordinate a pool of heterogeneous models, yet existing controllers are typically limited to one-shot routing: they select a model once and return its output directly. Such routing-only designs provide no mechanism to critique intermediate drafts or support iterative refinement. To address this limitation, we propose a critique-and-routing controller that casts multi-agent coordination as a sequential decision problem. At each turn, the controller evaluates the current draft, decides whether to stop or continue, and, if needed, selects the next agent for further refinement. We formulate this process as a finite-horizon Markov Decision Process (MDP) with explicit agent-utilization constraints, design a composite reward for controller decisions across turns, and optimize the controller via policy gradients under a Lagrangian-relaxed objective. Extensive experiments across multiple heterogeneous multi-agent systems and seven reasoning benchmarks show that our method consistently outperforms state-of-the-art baselines and substantially narrows the gap to the strongest agent, while using it for fewer than 25% of total calls.
Chinese Translation
多智能体大型语言模型(LLM)系统通常依赖于控制器来协调一组异构模型,但现有的控制器通常仅限于一次性路由:它们只选择一个模型并直接返回其输出。这种仅路由的设计没有提供批评中间草稿或支持迭代改进的机制。为了解决这一局限性,我们提出了一种批评与路由控制器,将多智能体协调视为一个序列决策问题。在每个回合中,控制器评估当前草稿,决定是停止还是继续,并在需要时选择下一个代理进行进一步的改进。我们将这一过程形式化为一个有限时域的马尔可夫决策过程(MDP),并引入明确的代理利用约束,设计了一个复合奖励用于控制器在各回合的决策,并通过拉格朗日松弛目标优化控制器。在多个异构多智能体系统和七个推理基准上的广泛实验表明,我们的方法始终优于最先进的基线,并显著缩小了与最强代理之间的差距,同时在总调用次数中使用该代理的比例低于25%。
cs.AI / 36 / 2605.08688
Reconciling Consistency-Based Diagnosis with Actual-Causality-Based Explanations
将基于一致性的诊断与基于实际因果的解释相结合
Abstract
We establish, from the point of view of Explainable AI (XAI), connections between Consistency-Based Diagnosis (CBD), on one side, and Actual Causality and Causal Responsibility, on the other. CBD has received little attention from the XAI community. Connections between these two areas could have a fruitful impact on XAI and Explainable Data Management.
Chinese Translation
从可解释人工智能(XAI)的角度出发,我们建立了基于一致性的诊断(CBD)与实际因果性和因果责任之间的联系。CBD在XAI社区中受到的关注较少。这两个领域之间的联系可能对XAI和可解释数据管理产生积极的影响。
cs.AI / 37 / 2605.08693
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
SkillMaster:迈向 LLM 代理的自主技能掌握
Abstract
Skills provide an effective mechanism for improving LLM agents on complex tasks, yet in existing agent frameworks, their creation, refinement, and selection are typically governed by external teachers, hand-designed rules, or auxiliary modules. As a result, skills remain external resources to be invoked, rather than capabilities that agents can develop, adapt, and internalize through experience. To endow LLM agents with autonomous skill mastery, we propose SkillMaster, a training framework that teaches agents to create new skills, refine existing skills, and select accumulated skills during task solving. This capability is achieved through three key designs. First, we train agents through trajectory-informed skill review, teaching agents to propose, update, or retain skills based on evidence from completed episodes. Second, each candidate skill edit is designed to be evaluated by its counterfactual utility on related probe tasks, providing a direct learning signal for training skill-editing decisions. Third, we introduce DualAdv-GRPO, which separately estimates advantages for task-solving actions and skill-editing decisions, stabilizing joint training across task solving and skill management. Experiments on ALFWorld and WebShop show that SkillMaster improves the overall success rate over state-of-the-art baselines by 8.8% and 9.3%, respectively, achieving the best performance among all compared methods. Further analysis reveals a marked shift in agent capability: agents trained with SkillMaster can identify skill failures, refine procedural knowledge from trajectory evidence, and transfer improvements to future tasks with limited skill-bank edits. Overall, SkillMaster moves LLM agents beyond mere skill use toward self-improving agents capable of developing, adapting, and applying their own skill repertoires.
Chinese Translation
技能为提升 LLM 代理在复杂任务中的表现提供了一种有效机制,但在现有的代理框架中,技能的创建、完善和选择通常由外部教师、手工设计的规则或辅助模块来管理。因此,技能仍然是被调用的外部资源,而不是代理可以通过经验发展、适应和内化的能力。为了赋予 LLM 代理自主技能掌握的能力,我们提出了 SkillMaster,这是一种训练框架,旨在教导代理创建新技能、完善现有技能,并在任务解决过程中选择累积的技能。该能力通过三个关键设计实现。首先,我们通过轨迹信息驱动的技能评审来训练代理,教导代理根据已完成的情节中的证据提出、更新或保留技能。其次,每个候选技能编辑的设计都旨在通过其在相关探测任务上的反事实效用进行评估,为技能编辑决策的训练提供直接的学习信号。第三,我们引入了 DualAdv-GRPO,该方法分别估计任务解决动作和技能编辑决策的优势,从而在任务解决和技能管理之间稳定联合训练。在 ALFWorld 和 WebShop 上的实验表明,SkillMaster 分别提高了整体成功率 8.8% 和 9.3%,在所有比较方法中实现了最佳性能。进一步分析揭示了代理能力的显著转变:使用 SkillMaster 训练的代理能够识别技能失败,从轨迹证据中完善程序知识,并将改进转移到未来任务中,且仅需有限的技能库编辑。总体而言,SkillMaster 将 LLM 代理从单纯的技能使用推进到自我改进的代理,能够开发、适应和应用自身的技能库。
cs.AI / 38 / 2605.08697
MBP-KT: Learning Global Collaborative Information from Meta-Behavioral Pattern for Enhanced Knowledge Tracing
MBP-KT:从元行为模式中学习全球协作信息以增强知识追踪
Abstract
The emerging collaborative information-based knowledge tracing (KT) has been a promising way to enhance modeling of learners' knowledge states. The core idea is to extract the collaborative information from interaction sequences of other learners to assist the prediction on the target one. Despite effectiveness, existing methods are built on the raw interaction sequences with tailored modules, which inevitably limits their capacity in deeply capturing learning behavioral patterns and generalization. To this end, we propose a general meta-behavioral pattern-aware framework (MBP-KT) for KT. Specifically, MBP-KT introduces a novel meta-behavioral sequence construction to transform the raw interaction sequences into the combinations of different meta-behavioral patterns. In this way, the learning behavioral patterns of learners can be effectively preserved. Then, MBP-KT develops a parameter-free module to extract the global collaborative representations from the constructed meta-behavioral sequences. Moreover, MBP-KT provides general injection strategies to introduce the extracted global collaborative information into various downstream KT models, ensuring the universality of the collaborative information. Extensive results on real-world datasets demonstrate that MBP-KT can consistently boosts the performance of a wide range of KT models.
Chinese Translation
基于协作信息的知识追踪(KT)作为一种新兴的方法,已成为增强学习者知识状态建模的有希望的途径。其核心思想是从其他学习者的交互序列中提取协作信息,以辅助对目标学习者的预测。尽管有效,现有方法仍然建立在原始交互序列上,并使用定制模块,这不可避免地限制了它们在深度捕捉学习行为模式和泛化能力方面的能力。为此,我们提出了一种通用的元行为模式感知框架(MBP-KT)用于知识追踪。具体而言,MBP-KT引入了一种新颖的元行为序列构建方法,将原始交互序列转化为不同元行为模式的组合。通过这种方式,学习者的学习行为模式得以有效保留。然后,MBP-KT开发了一个无参数模块,从构建的元行为序列中提取全球协作表示。此外,MBP-KT提供了通用的注入策略,将提取的全球协作信息引入各种下游知识追踪模型,确保协作信息的普适性。在真实世界数据集上的广泛结果表明,MBP-KT能够持续提升多种知识追踪模型的性能。
cs.AI / 39 / 2605.08703
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness:自我进化的代理后训练
Abstract
Evaluating instruction-guided image edits requires rewards that reflect subtle human preferences, yet current reward models typically depend on large-scale preference annotation and additional model training. This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons. We present RewardHarness, a self-evolving agentic reward framework that reframes reward modeling as context evolution rather than weight optimization. Instead of learning from large-scale annotations, RewardHarness aligns with human preferences by iteratively evolving a library of tools and skills from as few as 100 preference demonstrations. Given a source image, candidate edited images, and an editing instruction, an Orchestrator selects the most relevant subset of tools and skills from the maintained library, and a frozen Sub-Agent uses them to construct a reasoning chain that produces a preference judgment. By comparing predicted judgments with ground-truth preferences and analyzing successes and failures in the reasoning process, the Orchestrator automatically refines its library of tools and skills without additional human annotation. Using only 0.05% of the EditReward preference data, RewardHarness achieves 47.4% average accuracy on image-editing evaluation benchmarks, surpassing GPT-5 by 5.3 points. When used as a reward signal for GRPO fine-tuning, RL-tuned models achieve 3.52 on ImgEdit-Bench. Project page: https://rewardharness.com.
Chinese Translation
评估指导性图像编辑的指令需要反映微妙人类偏好的奖励,然而当前的奖励模型通常依赖于大规模的偏好标注和额外的模型训练。这造成了数据效率的差距:人类通常可以仅通过少量示例推断目标评估标准,而模型通常需要在数十万次比较中进行训练。我们提出了RewardHarness,一个自我进化的代理奖励框架,将奖励建模重新定义为上下文演变,而非权重优化。RewardHarness通过从少至100个偏好示例中迭代演变工具和技能库,与人类偏好对齐,而不是依赖大规模标注。给定源图像、候选编辑图像和编辑指令,协调者(Orchestrator)从维护的库中选择最相关的工具和技能子集,冻结的子代理(Sub-Agent)使用这些工具和技能构建推理链,从而产生偏好判断。通过将预测的判断与真实偏好进行比较,并分析推理过程中的成功与失败,协调者自动优化其工具和技能库,而无需额外的人类标注。仅使用0.05%的EditReward偏好数据,RewardHarness在图像编辑评估基准上实现了47.4%的平均准确率,超越了GPT-5 5.3个百分点。当作为GRPO微调的奖励信号使用时,经过强化学习调优的模型在ImgEdit-Bench上达到了3.52的成绩。项目页面:https://rewardharness.com。
cs.AI / 40 / 2605.08704
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization
AgentPSO:通过多代理粒子群优化演化代理推理技能
Abstract
Multi-agent reasoning has shown promise for improving the problem-solving ability of large language models by allowing multiple agents to explore diverse reasoning paths. However, most existing multi-agent methods rely on inference-time debate or aggregation, which can be vulnerable to incorrect peer influence and biased consensus. Moreover, the agents themselves remain static, as their underlying reasoning skills do not evolve across tasks. In this paper, we introduce AgentPSO, a particle-swarm-inspired framework for evolving multi-agent reasoning skills. AgentPSO treats each agent as a particle-like reasoner whose state is a natural-language skill and whose velocity is a semantic update direction, iteratively moving agents toward stronger skill states to improve both individual and collective reasoning performance. Across training iterations, each agent updates its skill by combining its previous velocity, personal-best skill, global-best skill, and a self-reflective direction derived from peer reasoning trajectories. This enables agents to learn reusable reasoning behaviors from both their own experiences and the strongest skills discovered by the population, without updating the parameters of the backbone language model. Experiments on mathematical and general reasoning benchmarks show that AgentPSO improves over static single-agent skills and test-time-only multi-agent reasoning baselines. The evolved skills further transfer across benchmarks and to another backbone model, suggesting that AgentPSO captures reusable reasoning procedures rather than merely optimizing benchmark-specific prompts. Code is open-sourced at https://github.com/HYUNMIN-HWANG/AgentPSO/.
Chinese Translation
多代理推理在提升大型语言模型的问题解决能力方面展现了良好的前景,因为它允许多个代理探索多样的推理路径。然而,现有的大多数多代理方法依赖于推理时的辩论或聚合,这可能容易受到错误的同伴影响和偏见共识的影响。此外,代理本身仍然是静态的,因为它们的基础推理技能在任务之间并未演化。本文介绍了AgentPSO,一种受粒子群启发的框架,用于演化多代理推理技能。AgentPSO将每个代理视为一个类粒子的推理者,其状态是自然语言技能,速度是语义更新方向,迭代地将代理移动到更强的技能状态,以提高个体和集体的推理表现。在训练迭代过程中,每个代理通过结合其先前的速度、个人最佳技能、全局最佳技能以及源自同伴推理轨迹的自我反思方向来更新其技能。这使得代理能够从自身经验和群体发现的最强技能中学习可重用的推理行为,而无需更新基础语言模型的参数。在数学和一般推理基准上的实验表明,AgentPSO在静态单代理技能和仅在测试时的多代理推理基准上有所改善。演化的技能在基准之间及到另一个基础模型之间进一步迁移,表明AgentPSO捕捉到了可重用的推理过程,而不仅仅是优化特定基准的提示。代码已开源,地址为 https://github.com/HYUNMIN-HWANG/AgentPSO/.
cs.AI / 41 / 2605.08710
When Can Human-AI Teams Outperform Individuals? Tight Bounds with Impossibility Guarantees
人类-人工智能团队何时能超越个体表现?具有不可能性保证的紧界限
Abstract
Human-AI teams fail to outperform their best member in 70% of studies, yet no theory specifies when complementarity is achievable. We derive tight bounds for the broad class of confidence-based aggregation rules by integrating signal detection theory with information-theoretic analysis, yielding four results: (1) a complementarity theorem (teams outperform individuals iff error correlation $\rho_{HM} < \rho^*$, with $\rho^* \approx a$ in the symmetric near-chance regime); (2) minimax bounds showing gains scale as $\Theta(\sqrt{\Delta d})$ with metacognitive sensitivity difference; (3) an impossibility result proving no confidence-based aggregation rule achieves complementarity when $\rho_{HM} \geq \rho^*$; and (4) multi-class generalization $\rho^*_K \approx \rho^*/\sqrt{K-1}$. Predictions match observed team accuracy ($R = 0.94$ on ImageNet-16H, $R = 0.91$ on CIFAR-10H) and the multi-class threshold scaling holds on human data ($R = 0.93$, $K = 16$), with robustness under non-Gaussian distributions. The framework explains why complementarity is rare and provides actionable design formulas; results apply to aggregation, not to interactive deliberation that generates novel answers.
Chinese Translation
在人类-人工智能团队中,70%的研究表明其表现未能超越最佳成员,但尚无理论明确指出何时可以实现互补性。我们通过将信号检测理论与信息理论分析相结合,为广泛的基于信心的聚合规则推导出紧界限,得出四个结果:(1)互补性定理(团队超越个体的条件是错误相关性 $
ho_{HM} <
ho^*$,其中 $
ho^* ext{ 约等于 } a$ 在对称近随机状态下);(2)最小最大界限显示收益随着元认知敏感性差异的变化而按 $ heta( ext{sqrt}( ext{Δd}))$ 规模增长;(3)不可能性结果证明当 $
ho_{HM} ext{ ≥ }
ho^*$ 时,没有任何基于信心的聚合规则能够实现互补性;(4)多类泛化 $
ho^*_K ext{ 约等于 }
ho^*/ ext{sqrt}(K-1)$。预测结果与观察到的团队准确性相符(在 ImageNet-16H 上 $R = 0.94$,在 CIFAR-10H 上 $R = 0.91$),且多类阈值缩放在人工数据上成立($R = 0.93$,$K = 16$),在非高斯分布下具有稳健性。该框架解释了互补性为何稀有,并提供了可操作的设计公式;结果适用于聚合,而不适用于生成新答案的互动讨论。
cs.AI / 42 / 2605.08716
Bias by Necessity: Impossibility Theorems for Sequential Processing with Convergent AI and Human Validation
必要性偏见:关于收敛人工智能与人类验证的顺序处理的不可能性定理
Abstract
Are certain cognitive biases mathematically inevitable consequences of sequential information processing? We prove that primacy effects, anchoring, and order-dependence are architecturally necessary in autoregressive language models due to causal masking constraints. Our three impossibility theorems establish: (1) primacy bias arises from asymmetric attention accumulation; (2) anchoring emerges from sequential conditioning with provable information bounds; and (3) exact debiasing by permutation marginalization requires factorial-time computation, with Monte Carlo approximation feasible at constant per-tolerance overhead. We validate these bounds across 12 frontier LLMs ($R^2 = 0.89$; $\Delta$BIC $= 16.6$ vs. next-best alternative). We then derive quantitative predictions from the framework and test them in two pre-registered human experiments ($N = 464$ analyzed). Study 1 confirms anchor position modulates anchoring magnitude ($d = 0.52$, BF$_{10} = 847$). Study 2 shows working memory load amplifies primacy bias ($d = 0.41$, BF$_{10} = 156$), with WM capacity predicting bias reduction ($r = -.38$). These convergent findings reframe cognitive biases as resource-rational responses to sequential processing.
Chinese Translation
某些认知偏见是否是顺序信息处理的数学必然结果?我们证明了由于因果掩蔽约束,初始效应、锚定和顺序依赖在自回归语言模型中是结构上必要的。我们的三个不可能性定理确立了: (1) 初始偏见源于不对称的注意力积累; (2) 锚定源于具有可证明信息界限的顺序条件; (3) 通过置换边际化进行精确去偏见需要阶乘时间计算,而蒙特卡洛近似在恒定的每容忍开销下是可行的。我们在12个前沿大型语言模型(LLMs)中验证了这些界限($R^2 = 0.89$;$ ext{ΔBIC} = 16.6$ 对比下一个最佳替代)。然后,我们从该框架中推导出定量预测,并在两个预注册的人类实验中进行了测试($N = 464$ 分析)。研究1确认锚定位置调节锚定幅度($d = 0.52$, BF$_{10} = 847$)。研究2显示工作记忆负荷放大了初始偏见($d = 0.41$, BF$_{10} = 156$),而工作记忆容量预测偏见减少($r = -0.38$)。这些收敛的发现将认知偏见重新框架为对顺序处理的资源理性反应。
cs.AI / 43 / 2605.08747
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
完成了,但不确定:将世界完成与具身智能体中的自我终止区分开来
Abstract
Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.
Chinese Translation
标准的具身评估并未独立评分智能体在情节结束时是否正确承诺完成任务,这一能力我们称之为终端承诺。行为上明显的失败——从未完成任务、完成任务但未能停止、以及在没有足够证据的情况下报告成功——都归结为相同的基准失败。我们引入了VIGIL,一个使终端承诺可独立测量的评估框架。在VIGIL的默认协议下,智能体仅观察自我中心的RGB图像,不接收任何行动成功信号,并且必须以一个语义报告结束每个情节,该报告会与隐藏的世界状态进行确定性检查。这产生了两个独立的评分:世界状态完成(W)和基准成功(B),其中B还要求正确的终端报告。这种解耦使得四种结果类别可区分:错过执行、后获得漂移、不支持的承诺和验证成功。在1000个冻结情节的20个模型中,具有可比W的系统在B上差异高达19.7个百分点:一个模型将实现的状态转化为正确的报告,而另一个模型在几乎相同的执行下漂移过目标而未能结束。一个行动反馈干预进一步测试了这种分离:以执行为导向的信号广泛改善了W,但在未能将终端报告基于实现状态的模型中,承诺失败仍然存在。VIGIL提供了一种协议,使终端承诺独立可见且可评分。
cs.AI / 44 / 2605.08754
Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations
基于价值分解的强化学习框架用于带有层次冲突感知观测的滑行道路径规划
Abstract
Taxiway routing and on-surface conflict avoidance are coupled safety-critical decision problems in airport surface operations. Existing planning and optimization methods are often limited by online computational cost, while reinforcement learning methods may struggle to represent downstream traffic conflicts and balance multiple objectives. This paper presents Conflict-aware Taxiway Routing (CaTR), a reinforcement learning framework for real-time multi-aircraft taxiway routing. CaTR constructs a grid-based airport surface environment with action masking, introduces a hierarchical foresight traffic representation to encode current and downstream conflict-related traffic conditions, and adopts a value-decomposed reinforcement learning strategy to prioritize sparse but safety-critical objectives. Experiments are conducted on a realistic environment based on Changsha Huanghua International Airport under multiple traffic density levels. Results show that CaTR achieves better safety--efficiency trade-offs than representative planning, optimization, and reinforcement learning baselines while maintaining practical runtime.
Chinese Translation
滑行道路径规划和地面冲突规避是机场地面操作中相互关联的安全关键决策问题。现有的规划和优化方法常常受到在线计算成本的限制,而强化学习方法可能难以有效表示下游交通冲突并平衡多个目标。本文提出了冲突感知滑行道路径规划(Conflict-aware Taxiway Routing, CaTR),这是一个用于实时多飞机滑行道路径规划的强化学习框架。CaTR 构建了一个基于网格的机场地面环境,采用动作屏蔽技术,引入层次前瞻交通表示以编码当前和下游与冲突相关的交通状况,并采用价值分解的强化学习策略来优先考虑稀疏但安全关键的目标。在基于长沙黄花国际机场的真实环境下,针对多个交通密度水平进行了实验。结果表明,CaTR 在保持实用运行时间的同时,实现了比代表性的规划、优化和强化学习基准更好的安全-效率权衡。
cs.AI / 45 / 2605.08756
AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design
AHD Agent:用于自动启发式设计的自主强化学习
Abstract
Automatic heuristic design (AHD) has emerged as a promising paradigm for solving NP-hard combinatorial optimization problems (COPs). Recent works show that large language models (LLMs), when integrated into well-designed frameworks (i.e., LLM-AHD), can autonomously discover high-performing heuristics. However, existing LLM-AHD frameworks typically treat LLMs as passive generators within fixed workflows, where the model generates heuristics from manually designed, limited context. Such context may fail to capture state-dependent information (e.g., specific failure modes), leading to inefficient trial-and-error exploration. To overcome these limitations, we propose AHD Agent, a novel tool-integrated, multi-turn framework that empowers LLMs to proactively decide whether to generate heuristics or invoke tools to retrieve targeted evidence from the solving environment. To effectively train such a dynamic decision-making agent, we introduce an agentic reinforcement learning (RL) system, which leverages a novel environment synthesis pipeline to optimize a compact model's generalizable AHD capabilities. Experiments across eight diverse domains, including four held-out tasks, demonstrate that our 4B-parameter agent matches or surpasses state-of-the-art baselines using much larger models, while requiring significantly fewer evaluations. Model and inference scaling analysis further reveals that AHD Agent offers an effective trajectory toward truly autonomous heuristic design.
Chinese Translation
自动启发式设计(AHD)已成为解决 NP 难组合优化问题(COPs)的有前景的范式。近期研究表明,当大型语言模型(LLMs)与精心设计的框架(即 LLM-AHD)结合时,可以自主发现高性能的启发式方法。然而,现有的 LLM-AHD 框架通常将 LLM 视为固定工作流程中的被动生成器,模型从手动设计的有限上下文中生成启发式方法。这种上下文可能无法捕捉依赖于状态的信息(例如,特定的失败模式),导致低效的试错探索。为克服这些局限性,我们提出了 AHD Agent,这是一种新颖的工具集成多轮框架,使 LLM 能够主动决定是生成启发式方法还是调用工具从解决环境中检索目标证据。为了有效训练这种动态决策代理,我们引入了一种自主强化学习(RL)系统,该系统利用一种新颖的环境合成管道来优化紧凑模型的可泛化 AHD 能力。在八个不同领域的实验中,包括四个保留任务,证明我们的 4B 参数代理与使用更大模型的最先进基线相匹配或超越,同时所需评估次数显著减少。模型和推理规模分析进一步揭示,AHD Agent 为真正自主的启发式设计提供了有效的路径。
cs.AI / 46 / 2605.08767
From Holo Pockets to Electron Density: GPT-style Drug Design with Density
从全息口袋到电子密度:基于密度的GPT风格药物设计
Abstract
Recent advances in generative modeling have enabled significant progress in structure-based drug design (SBDD). Existing methods typically condition molecule generation on empty binding pockets from holo complexes, overlooking informative components such as the filler (ligands and solvent). Here, we leverage low-resolution electron density (ED) derived from the filler as a physically grounded condition for \textit{de novo} drug design. We consider two types of ED, calculated and cryo-EM/X-ray, obtainable from computational or experimental sources, supporting unified pre-training and experimental integration. Compared with rigid pocket representations, experimental ED naturally captures conformational flexibility and provides a more faithful description of the binding environment. Based on this, we introduce EDMolGPT, a decoder-only autoregressive framework that generates molecules from low-resolution ED point clouds. By grounding generation in physically meaningful density signals, EDMolGPT mitigates structural bias and produces molecules with 3D conformations. Evaluations on 101 biological targets verify the effectiveness. Our project page: https://jiahaochen1.github.io/EDMolGPT_Page/.
Chinese Translation
近年来,生成建模的进展使得基于结构的药物设计(SBDD)取得了显著进展。现有方法通常基于全息复合物中的空结合口袋进行分子生成,忽视了填充物(配体和溶剂)等信息成分。在此,我们利用源自填充物的低分辨率电子密度(ED)作为物理基础条件进行 extit{de novo}药物设计。我们考虑两种类型的电子密度,计算得到的和冷冻电镜/ X射线(cryo-EM/X-ray),这些电子密度可以从计算或实验来源获得,支持统一的预训练和实验整合。与刚性口袋表示相比,实验电子密度自然捕捉构象灵活性,并提供对结合环境的更真实描述。基于此,我们引入EDMolGPT,一个仅解码的自回归框架,从低分辨率电子密度点云中生成分子。通过将生成过程基于物理意义明确的密度信号,EDMolGPT减轻了结构偏见,并生成具有三维构象的分子。在101个生物靶点上的评估验证了其有效性。我们的项目页面:https://jiahaochen1.github.io/EDMolGPT_Page/
cs.AI / 47 / 2605.08769
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
EvoMAS:用于多智能体系统的执行时间工作流学习
Abstract
Large language model (LLM)-based multi-agent systems have shown strong potential on complex tasks through agent specialization, tool use, and collaborative reasoning. However, most automated multi-agent system design methods still follow a one-shot paradigm: a workflow is optimized or selected before execution and then reused unchanged throughout the task. This static coordination strategy is ill-suited for long-horizon tasks whose subgoals, intermediate evidence, and information needs evolve over multiple execution stages. We propose EvoMAS, a framework for execution-time multi-agent workflow construction. EvoMAS formulates workflow construction as a meta-level sequential decision problem along a single task trajectory. At each stage, it constructs an explicit task state through a Planner-Evaluator-Updater pipeline and uses a learned Workflow Adapter to instantiate a stage-specific layered workflow from a fixed pool of candidate agents. The adapter is trained with policy gradients using sparse, verifiable terminal task success as the main supervision signal, while evaluator-based process reward is analyzed separately under very-hard sparse-reward settings. Experiments on GAIA, HLE, and DeepResearcher show that EvoMAS outperforms single-agent baselines and recent automated multi-agent workflow design methods. Our analyses further show that explicit task-state construction and learned workflow adaptation provide complementary benefits. Additional results indicate that process reward is most useful when terminal success is extremely sparse, and qualitative case studies illustrate that EvoMAS adapts agent coordination as the task state evolves.
Chinese Translation
基于大型语言模型(LLM)的多智能体系统在通过智能体专业化、工具使用和协作推理来处理复杂任务方面展现了强大的潜力。然而,大多数自动化多智能体系统设计方法仍然遵循一次性范式:在执行之前优化或选择一个工作流,并在整个任务中保持不变。这种静态协调策略不适合长时间跨度的任务,因为其子目标、中间证据和信息需求会在多个执行阶段中不断演变。我们提出了EvoMAS,一个用于执行时间多智能体工作流构建的框架。EvoMAS将工作流构建形式化为沿单一任务轨迹的元级序列决策问题。在每个阶段,它通过规划-评估-更新(Planner-Evaluator-Updater)管道构建一个明确的任务状态,并使用学习到的工作流适配器从固定的候选智能体池中实例化一个阶段特定的分层工作流。该适配器通过使用稀疏、可验证的终端任务成功作为主要监督信号,采用策略梯度进行训练,而基于评估的过程奖励在非常困难的稀疏奖励设置下被单独分析。在GAIA、HLE和DeepResearcher上的实验表明,EvoMAS在性能上优于单智能体基线和最近的自动化多智能体工作流设计方法。我们的分析进一步表明,明确的任务状态构建和学习的工作流适应提供了互补的好处。额外结果表明,当终端成功极为稀疏时,过程奖励最为有用,定性案例研究则表明EvoMAS在任务状态演变时适应智能体协调。
cs.AI / 48 / 2605.08776
Reasoning Compression with Mixed-Policy Distillation
混合策略蒸馏的推理压缩
Abstract
Reasoning-centric large language models (LLMs) achieve strong performance by generating intermediate reasoning trajectories, but often incur excessive token usage and high inference-time decoding cost. We observe that, when solving the same problems, larger reasoning models can often produce more concise traces, whereas smaller reasoning models tend to generate longer and more redundant trajectories. This is especially problematic in real-world deployment, where memory, latency, and serving-cost constraints often favor smaller models. Our observations suggest that reasoning compression can be transferred from large models to small ones rather than enforced through explicit length constraints. Based on this insight, we propose Mixed-Policy Distillation (MPD), a reasoning compression framework that transfers concise reasoning behavior from a larger-sized teacher to a smaller student by distilling teacher-compressed student trajectories. Unlike on-policy distillation, which aligns the student with teacher distributions over verbose student trajectories, or off-policy distillation, which relies on teacher-generated trajectories and may suffer from distribution mismatch, MPD combines the strengths of both. Given a student-sampled trajectory, the teacher rewrites it into a more concise reasoning trace, and the student is trained via KL-based alignment on the compressed trajectory. This preserves student-policy exploration while injecting teacher-guided compression. Experiments on Qwen3-1.7B show that MPD reduces token usage by up to 27.1% while improving performance across multiple reasoning benchmarks, demonstrating an effective approach to efficient small-model reasoning.
Chinese Translation
以推理为中心的大型语言模型(LLMs)通过生成中间推理轨迹实现了强大的性能,但往往导致过度的令牌使用和高推理时间解码成本。我们观察到,在解决相同问题时,较大的推理模型通常能够生成更简洁的轨迹,而较小的推理模型则倾向于生成更长且冗余的轨迹。这在实际部署中尤其成问题,因为内存、延迟和服务成本的限制通常更倾向于使用较小的模型。我们的观察表明,推理压缩可以从大模型转移到小模型,而不是通过显式的长度约束来强制执行。基于这一见解,我们提出了混合策略蒸馏(Mixed-Policy Distillation, MPD),这是一个推理压缩框架,通过蒸馏教师压缩的学生轨迹,将较大教师模型的简洁推理行为转移到较小的学生模型上。与对策略蒸馏(on-policy distillation)通过对冗长学生轨迹的教师分布进行对齐,或离策略蒸馏(off-policy distillation)依赖于教师生成的轨迹并可能遭受分布不匹配的缺点不同,MPD结合了两者的优点。在给定一个学生采样的轨迹时,教师将其重写为更简洁的推理轨迹,学生则通过基于KL的对齐在压缩轨迹上进行训练。这在注入教师引导的压缩的同时保留了学生策略的探索性。在Qwen3-1.7B上的实验表明,MPD将令牌使用减少了多达27.1%,同时在多个推理基准上提高了性能,展示了高效小模型推理的有效方法。
cs.AI / 49 / 2605.08778
Not All Turns Matter: Credit Assignment for Multi-Turn Jailbreaking
并非所有回合都重要:多回合越狱的信用分配
Abstract
Deploying LLMs in multi-turn dialogues facilitates jailbreak attacks that distribute harmful intent across seemingly benign turns. Recent training-based multi-turn jailbreak methods learn long-horizon attack strategies from interaction feedback, but often rely on coarse trajectory-level outcome signals that broadcast uniformly to every turn. However, we find that turn-level contributions in multi-turn jailbreaking are non-uniform, phase-dependent, and target-specific. Such coarse outcome supervision induces a credit assignment problem, leading to over-rewarding redundant turns in successful trajectories and under-crediting useful intermediate turns in failed ones. To address this, we propose TRACE, a turn-aware credit assignment framework for reinforcement learning (RL)-based multi-turn jailbreaking. For successful trajectories, TRACE estimates turn-level contributions via leave-one-turn-out semantic masking; for failed ones, TRACE assigns penalties based on prompt harmfulness and semantic relevance, with an additional local refusal-aware penalty. Furthermore, we reuse the attack-side credit signal for multi-turn defense alignment. Extensive experiments on open-source and closed-source targets show that TRACE achieves strong overall performance in effectiveness, transferability, and efficiency, yielding about a 25% relative improvement in attack success rate over the strongest RL baseline while also improving the safety-utility balance when reused for defense alignment.
Chinese Translation
在多回合对话中部署大型语言模型(LLMs)促进了越狱攻击,这些攻击将有害意图分散到看似无害的回合中。最近的基于训练的多回合越狱方法通过交互反馈学习长期攻击策略,但通常依赖于粗糙的轨迹级结果信号,这些信号均匀地广播到每个回合。然而,我们发现多回合越狱中的回合级贡献是不均匀的、依赖于阶段的,并且是目标特定的。这种粗糙的结果监督引发了信用分配问题,导致在成功轨迹中对冗余回合的过度奖励,而在失败轨迹中对有用的中间回合的低估。为了解决这个问题,我们提出了TRACE,一个针对基于强化学习(RL)的多回合越狱的回合感知信用分配框架。对于成功的轨迹,TRACE通过逐回合语义掩蔽估计回合级贡献;对于失败的轨迹,TRACE根据提示的有害性和语义相关性分配惩罚,并增加了局部拒绝感知惩罚。此外,我们重用攻击侧信用信号以实现多回合防御对齐。在开放源代码和闭源目标上的大量实验表明,TRACE在有效性、可转移性和效率方面表现出色,相较于最强的RL基线,攻击成功率提高了约25%的相对改进,同时在重用于防御对齐时也改善了安全性与效用的平衡。
cs.AI / 50 / 2605.08816
Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?
镜子,镜子在墙上:VLM代理能否识别自己?
Abstract
In the animal kingdom, mirror self-recognition is a canonical probe of higher-order cognition, emerging only in some species. We ask whether an analogous functional capability emerges in embodied vision-language model (VLM) agents: can they recognize themselves in a mirror? We introduce a controlled 3D benchmark where a first-person VLM agent must infer a hidden body attribute from its reflection and select the matching target, while avoiding self-other misattribution. To separate mirror-grounded self-identification from shortcuts, we test mirror removal, misleading cues, and occluded reflections. We also evaluate the decision process through mirror seeking, temporal ordering, self-attribution, and reasoning-action consistency. Our experiments show that mirror-based self-identification emerges mainly in stronger VLMs. These models can use reflected evidence for action, whereas weaker models often inspect the mirror but fail to extract self-relevant information or misattribute their reflection. Language-vision conflict further shows that self-referential language alone is not evidence of grounded self-identification. Overall, mirror-based evaluation provides a diagnostic for whether embodied self-grounding is causally rooted in perception and action rather than priors, prompt compliance, or confabulation.
Chinese Translation
在动物王国中,镜子自我识别是高阶认知的经典探测工具,仅在某些物种中出现。我们探讨一种类似的功能能力是否在具身视觉-语言模型(VLM)代理中出现:它们能否在镜子中识别自己?我们引入了一个受控的3D基准测试,其中一个第一人称VLM代理必须从其反射中推断出一个隐藏的身体属性,并选择匹配的目标,同时避免自我与他人的误归属。为了将基于镜子的自我识别与捷径区分开来,我们测试了镜子移除、误导性线索和遮挡反射。我们还通过镜子寻求、时间排序、自我归属和推理-行动一致性来评估决策过程。我们的实验表明,基于镜子的自我识别主要出现在更强的VLM中。这些模型能够利用反射证据进行行动,而较弱的模型通常会检查镜子,但未能提取与自我相关的信息或错误归属其反射。语言-视觉冲突进一步表明,仅凭自我指称的语言并不能证明具身自我识别的基础。总体而言,基于镜子的评估为具身自我扎根是否在感知和行动中具有因果根源提供了诊断,而非先验、提示遵从或虚构。
cs.AI / 51 / 2605.08817
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
开始方式决定推理方式:通过前缀调优先验驱动RLVR中的探索
Abstract
Reinforcement learning with verifiable rewards (RLVR) recently thrives in large language model (LLM) reasoning tasks. However, the reward sparsity and the long reasoning horizon make effective exploration challenging. In practice, this challenge manifests as the \emph{entropy collapse} phenomenon, where RLVR improves single-rollout accuracy but fails to expand coverage on successful reasoning trajectories. Passive exploration techniques like entropy regularization tend to dismiss generation quality, resulting in noisy rollouts. In response to this issue, we propose an Information-Maximizing Augmented eXploration (IMAX) framework to train a pool of soft prefixes that reshapes the base model's prior over reasoning trajectories. Rather than relying on RL to incentivize exploration on top of the base model, each prefix acts as a trainable control knob that induces a distinct rollout distribution from the same backbone model. To encourage discovery of diverse and task-relevant reasoning behaviors, we derive an Information Maximization (InfoMax) reward to complement the verifiable rewards for RL training. IMAX is in general algorithm-agnostic and can be seamlessly integrated into existing RLVR pipelines. Experiment results have shown that across three backbone scales, IMAX consistently improves reasoning performance over standard RLVR, with gains up to 11.60\% in Pass@4 and 10.57\% in Avg@4.
Chinese Translation
具有可验证奖励的强化学习(RLVR)最近在大型语言模型(LLM)推理任务中取得了显著进展。然而,奖励稀疏性和长推理时间使得有效探索变得具有挑战性。在实践中,这一挑战表现为 extit{熵崩溃}现象,即RLVR提高了单次回合的准确性,但未能扩展成功推理轨迹的覆盖范围。被动探索技术如熵正则化往往忽视生成质量,导致生成结果噪声较大。针对这一问题,我们提出了一种信息最大化增强探索(IMAX)框架,旨在训练一组软前缀,以重塑基础模型在推理轨迹上的先验。每个前缀作为可训练的控制旋钮,从同一基础模型诱导出不同的回合分布,而不是依赖于RL在基础模型之上激励探索。为了鼓励发现多样化和与任务相关的推理行为,我们推导出一种信息最大化(InfoMax)奖励,以补充RL训练中的可验证奖励。IMAX在一般情况下是算法无关的,可以无缝集成到现有的RLVR管道中。实验结果表明,在三种基础模型规模下,IMAX在推理性能上始终优于标准RLVR,Pass@4的提升高达11.60\%,Avg@4的提升为10.57\%。
cs.AI / 52 / 2605.08827
Mental Health AI Safety Claims Must Preserve Temporal Evidence
心理健康人工智能安全声明必须保留时间证据
Abstract
The safety of mental health AI is often judged at the wrong temporal scale. Current evaluations typically score isolated responses, endpoint outcomes, or aggregate dialogue quality, while clinically consequential failures may arise from the order and accumulation of interactions themselves, including delayed escalation, repeated reinforcement, dependency formation, failed repair, and gradual deterioration across turns. This paper argues that this mismatch is not merely a limitation of evaluation coverage but a source of invalid safety conclusions. We introduce Temporal Safety Non-Identifiability, a formal account of why safety properties that depend on sequence, timing, accumulation, or recovery cannot be certified by protocols that discard those features. From this formalization, we develop SCOPE (Safety Claims Over Preserved Evidence) as a general principle for aligning safety claims with the evidence an evaluation actually retains, and instantiate it as SCOPE-MH, a mental-health instantiation of this reporting standard. We operationalize SCOPE-MH through a proof-of-concept on the AnnoMI dataset of expert-annotated motivational interviewing conversations, which reveals mechanisms of failure that per-turn behavior scoring does not represent. We propose SCOPE-MH as a diagnostic complement to existing evaluation infrastructure and argue that evaluation preserving temporal evidence is necessary, not optional, for safety-critical mental health AI deployment.
Chinese Translation
心理健康人工智能的安全性常常在错误的时间尺度上进行评估。目前的评估通常仅对孤立的反应、最终结果或汇总的对话质量进行评分,而临床上重要的失败可能源于交互的顺序和积累,包括延迟升级、重复强化、依赖形成、修复失败以及在多个回合中的逐渐恶化。本文认为,这种不匹配不仅仅是评估覆盖范围的局限,而是导致无效安全结论的根源。我们引入了时间安全非可识别性(Temporal Safety Non-Identifiability),正式阐述了为什么依赖于序列、时机、积累或恢复的安全属性无法通过忽略这些特征的协议进行认证。基于这一形式化,我们开发了SCOPE(保留证据的安全声明)作为一个通用原则,以将安全声明与评估实际保留的证据对齐,并将其具体化为SCOPE-MH,这是该报告标准在心理健康领域的具体应用。我们通过对AnnoMI数据集(专家注释的动机访谈对话)的概念验证来操作化SCOPE-MH,该数据集揭示了逐回合行为评分未能代表的失败机制。我们建议将SCOPE-MH作为现有评估基础设施的诊断补充,并认为保留时间证据的评估对于安全关键的心理健康人工智能部署是必要的,而非可选的。
cs.AI / 53 / 2605.08828
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
当智能体过度信任环境证据:一个可扩展的智能框架用于基准测试大型语言模型智能体中的证据基础缺陷
Abstract
Large language model agents increasingly operate through environment-facing scaffolds that expose files, web pages, APIs, and logs. These observations influence tool use, state tracking, and action sequencing, yet their reliability and authority are often uncertain. Environmental grounding is therefore a systems-level problem involving context admission, evidence provenance, freshness checking, verification policy, action gating, and model reasoning. Existing agent benchmarks mainly evaluate task capability or specific attacks such as prompt injection and memory poisoning, but they under-specify a fundamental reliability question: whether agents remain grounded in the true environment state when observations are stale, incorrect, or malicious. We introduce EnvTrustBench, an agentic framework for benchmarking this failure mode. We define an evidence-grounding defect (EGD) as a behavioral failure in which an agent treats an environment-facing claim as sufficient evidence for action without resolving it against available current evidence, leading to a task-incorrect false path under the true environment state. Given a task scenario, EnvTrustBench generates the workspace, environment, agent-facing objective, and validation oracle, executes the evaluated agent, records its action-observation trajectory and final state, and applies the oracle to produce a verdict. Using 6 LLM backbones and 5 widely used scaffolds, we evaluate 55 generated cases across 11 task scenarios, with each scenario expanded through five feedback-guided generation iterations. Results show that EGDs consistently emerge across operational workflows, highlighting environmental grounding as a core agent reliability problem with important security implications.
Chinese Translation
大型语言模型智能体越来越多地通过面向环境的支架进行操作,这些支架暴露了文件、网页、API 和日志。这些观察会影响工具使用、状态跟踪和行动顺序,但它们的可靠性和权威性往往不确定。因此,环境基础问题是一个系统级问题,涉及上下文接纳、证据来源、时效性检查、验证政策、行动门控和模型推理。现有的智能体基准主要评估任务能力或特定攻击(如提示注入和记忆中毒),但它们未能明确一个基本的可靠性问题:当观察结果陈旧、不正确或恶意时,智能体是否仍然基于真实环境状态。我们引入了 EnvTrustBench,一个用于基准测试这种失败模式的智能框架。我们将证据基础缺陷(EGD)定义为一种行为失败,其中智能体将面向环境的声明视为采取行动的充分证据,而未能将其与可用的当前证据进行核对,导致在真实环境状态下的任务不正确的错误路径。在给定任务场景的情况下,EnvTrustBench 生成工作区、环境、面向智能体的目标和验证 oracle,执行被评估的智能体,记录其行动-观察轨迹和最终状态,并应用 oracle 生成裁决。使用 6 个大型语言模型骨干和 5 个广泛使用的支架,我们在 11 个任务场景中评估了 55 个生成案例,每个场景通过五次反馈引导的生成迭代进行扩展。结果表明,EGD 在操作工作流程中持续出现,突显了环境基础作为核心智能体可靠性问题的重要安全影响。
cs.AI / 54 / 2605.08833
FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences
FRACTAL:具有分数递归架构的状态空间模型用于长序列的计算时间分析
Abstract
Effective sequence modeling fundamentally requires balancing the retention of unbounded history with the high-resolution detection of abrupt short-term variations common in real-world phenomena. However, existing state space models (SSMs) relying on high-order polynomial projection operators (HiPPO) face a critical trade-off where uniform measures dilute recent information to maintain timescale invariance, while exponential measures sacrifice global context to capture local dynamics. This paper proposes a Fractional Recurrent Architecture for Computational Temporal Analysis of Long sequences (FRACTAL), a novel architecture integrating fractional measure theory into recursive memory updates to address this limitation. By deriving projection operators with analytically characterized spectral properties and a tunable singularity index, the proposed method amplifies sensitivity to recent signal perturbations while preserving the spectral structure that encodes scale-invariant memory dynamics. This theoretical innovation is instantiated within a simplified diagonalized state space framework by modulating input projection initialization to enable simultaneous capture of multi-scale temporal features. FRACTAL achieves an average score of 87.11\% on the Long Range Arena benchmark, including 61.85\% on the ListOps task, outperforming the S5 model.
Chinese Translation
有效的序列建模根本上需要在保留无限历史与高分辨率检测现实世界现象中常见的突发短期变化之间取得平衡。然而,现有的依赖高阶多项式投影算子(HiPPO)的状态空间模型(SSMs)面临着一个关键的权衡:均匀测度稀释了近期信息以维持时间尺度不变性,而指数测度则牺牲了全局上下文以捕捉局部动态。本文提出了一种用于长序列计算时间分析的分数递归架构(FRACTAL),这是一种将分数测度理论整合到递归记忆更新中的新颖架构,以解决这一局限性。通过推导具有解析特征谱属性和可调奇点指数的投影算子,所提出的方法增强了对近期信号扰动的敏感性,同时保留了编码尺度不变记忆动态的谱结构。该理论创新在一个简化的对角化状态空间框架中得以实现,通过调节输入投影初始化以同时捕捉多尺度时间特征。FRACTAL在长范围竞技场基准测试中平均得分为87.11%,在ListOps任务中得分为61.85%,超越了S5模型。
cs.AI / 55 / 2605.08835
SynerDiff: Synergetic Continuous Batching for Fast and Parallel Diffusion Model Inference
SynerDiff:用于快速并行扩散模型推理的协同连续批处理
Abstract
The expansion of Artificial Intelligence-generated content service requires diffusion model serving to simultaneously achieve high throughput and low task end-to-end (E2E) latency. However, existing continuous batching methods suffer from severe resource contention during UNet-VAE concurrency, leading to latency spikes. Furthermore, concurrent multi-task scheduling entails a trade-off between UNet throughput and VAE latency across varying scheduling strategies. To address these, we propose SynerDiff, an efficient continuous batching system built on intra-inter level synergy. At the intra-concurrency level, SynerDiff alleviates resource contention by pruning component-specific resource bottlenecks via VAE Chunking and Adaptive Skip-CFG. At the inter-concurrency level, leveraging components' differential sensitivity to scheduling granularities, a threshold-aware scheduler plans concurrent sequences and tunes intra-concurrency decisions to minimize VAE latency while maintaining UNet within high-throughput threshold. Additionally, a feedback controller dynamically adjusts this threshold based on queue loads to boost system capacity ceiling. Experimental results show that, SynerDiff improves throughput by 1.6$\times$ and decreases both average E2E and P99 tail latencies by up to 78.7\%, compared to benchmarks while guaranteeing high image fidelity.
Chinese Translation
人工智能生成内容服务的扩展要求扩散模型服务能够同时实现高吞吐量和低任务端到端(E2E)延迟。然而,现有的连续批处理方法在UNet-VAE并发期间遭遇严重的资源竞争,导致延迟峰值。此外,并发多任务调度在不同调度策略下需要在UNet吞吐量和VAE延迟之间进行权衡。为了解决这些问题,我们提出了SynerDiff,一个基于内部和外部协同的高效连续批处理系统。在内部并发层面,SynerDiff通过VAE Chunking和自适应Skip-CFG减轻资源竞争,消除组件特定的资源瓶颈。在外部并发层面,利用组件对调度粒度的差异敏感性,阈值感知调度器规划并发序列,并调整内部并发决策,以最小化VAE延迟,同时保持UNet在高吞吐量阈值内。此外,反馈控制器根据队列负载动态调整该阈值,以提升系统容量上限。实验结果表明,与基准相比,SynerDiff提高了吞吐量1.6倍,并将平均E2E和P99尾部延迟降低了高达78.7%,同时保证了高图像保真度。
cs.AI / 56 / 2605.08843
M$^3$: Reframing Training Measures for Discretized Physical Simulations
M$^3$: 重新构建离散物理仿真的训练度量
Abstract
Neural surrogate models for physical simulations are trained on discretized samples of continuous domains, where the induced empirical measure leads to uneven supervision, biasing optimization and causing spatial inconsistencies in physical fidelity. To mitigate this measure-induced bias, we propose M$^3$ (Multi-scale Morton Measure), a scalable framework that balances training measures by partitioning space according to physical variation and allocating supervision across multiple scales. Applied to three industrial-scale datasets with diverse discretizations, M$^3$ consistently improves predictions in the continuous physical domain, achieving up to 4.7$\times$ lower error in large-scale volumetric cases. These gains persist under aggressive subsampling (160M $\rightarrow$ 16M $\rightarrow$ 1.6M points), where M$^3$-trained models outperform those trained on higher-resolution data, reducing physics-weighted relative $L_2$ error by 3--4$\times$ and the corresponding MSE by up to 13$\times$. These results highlight data distribution as a key factor in operator learning and position M$^3$ as a scalable, data-efficient approach for physically consistent modeling.
Chinese Translation
物理仿真的神经代理模型是在连续域的离散样本上训练的,其中引入的经验度量导致了不均匀的监督,偏向优化并造成物理真实性的空间不一致性。为减轻这种度量引起的偏差,我们提出了M$^3$(多尺度莫顿度量),这是一个可扩展的框架,通过根据物理变化划分空间并在多个尺度上分配监督来平衡训练度量。应用于三个具有不同离散化的工业规模数据集,M$^3$在连续物理域中的预测一致性得到了改善,在大规模体积案例中实现了高达4.7倍的误差降低。这些提升在激进的子采样下依然存在(从160M降至16M再降至1.6M点),M$^3$训练的模型在高分辨率数据上表现优于其他模型,将物理加权相对$L_2$误差降低了3到4倍,相应的均方误差(MSE)降低了高达13倍。这些结果突显了数据分布作为操作学习的关键因素,并将M$^3$定位为一种可扩展且数据高效的物理一致建模方法。
cs.AI / 57 / 2605.08887
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution
Ace-Skill:通过优先级和聚类进化引导多模态智能体的自我演化
Abstract
Self-evolving agents present a promising path toward continual adaptation by distilling task interactions into reusable knowledge artifacts. In practice, this paradigm remains hindered by two coupled bottlenecks: data inefficiency, where costly rollout effort is disproportionately spent on low-value samples rather than informative ones, and knowledge interference, where heterogeneous knowledge stored in shared repositories leads to noisy retrieval and task-misaligned guidance. Together, these issues form a self-reinforcing failure loop in which uninformative rollouts yield noisy knowledge, which in turn degrades subsequent rollouts. In this work, we introduce Ace-Skill, a co-evolutionary framework that jointly optimizes rollout allocation and knowledge organization for self-evolving multimodal agents. Specifically, Ace-Skill combines aprioritized sampler with lazy-decay proficiency tracking to focus rollouts on informative and insufficiently mastered samples, and a clustered organizer that semantically clusters knowledge for cleaner retrieval and more reliable adaptation. By improving sampling and organization together, Ace-Skill turns self-evolution into a virtuous cycle in which more informative rollouts produce higher-quality knowledge that supports stronger subsequent rollouts. Across four multimodal tool-use benchmarks, Ace-Skill delivers strong gains (e.g., +35.46% relative improvement in Avg@4 accuracy), enabling an opensource 35B MoE model to match or surpass proprietary models. The acquired knowledge also transfers effectively in a zero-shot manner to smaller 9B and 4B models, allowing resource-constrained agents to inherit advanced capabilities without additional training. The code has been publicly available at https://github.com/AMAP-ML/Ace-Skill.
Chinese Translation
自我演化智能体通过将任务交互提炼为可重用的知识工件,为持续适应提供了有希望的路径。然而,在实践中,这一范式受到两个相互关联的瓶颈的制约:数据效率低下,即在低价值样本上花费过多的高成本展开努力,而非在信息丰富的样本上,以及知识干扰,即存储在共享库中的异质知识导致噪声检索和任务不对齐的指导。这些问题共同形成了一个自我强化的失败循环,其中无信息的展开产生噪声知识,进而降低后续展开的质量。在本研究中,我们提出了Ace-Skill,一个共同进化框架,旨在为自我演化的多模态智能体联合优化展开分配和知识组织。具体而言,Ace-Skill结合了优先采样器和懒惰衰减的能力跟踪,专注于信息丰富且掌握不足的样本的展开,以及一个聚类组织者,语义上聚类知识以实现更清晰的检索和更可靠的适应。通过共同改善采样和组织,Ace-Skill将自我演化转变为一个良性循环,其中更具信息性的展开产生更高质量的知识,从而支持更强的后续展开。在四个多模态工具使用基准测试中,Ace-Skill显著提升了性能(例如,Avg@4准确率相对提高35.46%),使得开源的35B MoE模型能够与专有模型相匹配或超越。所获得的知识还能够有效地以零样本方式转移到较小的9B和4B模型,使得资源受限的智能体能够在无需额外训练的情况下继承先进的能力。代码已公开发布在 https://github.com/AMAP-ML/Ace-Skill。
cs.AI / 58 / 2605.08904
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH:评估大型搜索空间中LLM代理的迭代自我优化
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and tool use. However, the fundamental cognitive faculties essential for problem solving, including perception, reasoning, and memory, remain the stable core of intelligence. Unlike memorizing specific patterns, humans succeed in novel environments by applying these intrinsic faculties to adapt and optimize. Yet, whether LLMs possess this essential capacity, namely the ability to continuously refine solutions in response to dynamic environmental feedback, remains underexplored. To address this challenge, we introduce OPT-BENCH, a benchmark for evaluating self-improvement capabilities in large-scale search spaces. By combining 20 machine learning tasks with 10 classic NP-hard problems, OPT-BENCH provides a rigorous setting to assess whether agents can adapt through intrinsic self-reflection rather than rote tool application. We further propose OPT-Agent, a framework that emulates human-like cognitive adaptation. It operates through a general perception, memory, and reasoning loop, iteratively refining solutions based on environmental feedback. Through extensive experiments on 19 LLMs from 7 model families, including reasoning models, general models, and open-source models ranging from 3B to 235B parameters, we demonstrate that stronger models are more effective at leveraging feedback signals for self-improvement. However, this upper-bound adaptability remains fundamentally constrained by the models' base capacity, and even the most advanced LLMs still fall short of human expert performance.
Chinese Translation
大型语言模型(LLMs)在推理和工具使用方面展现了显著的能力。然而,解决问题所需的基本认知能力,包括感知、推理和记忆,依然是智能的稳定核心。与记忆特定模式不同,人类通过将这些内在能力应用于适应和优化,从而在新环境中取得成功。然而,LLMs是否具备这种基本能力,即在动态环境反馈下持续优化解决方案的能力,仍然未得到充分探讨。为了解决这一挑战,我们提出了OPT-BENCH,一个用于评估大型搜索空间中自我改进能力的基准。通过将20个机器学习任务与10个经典的NP难题相结合,OPT-BENCH提供了一个严格的环境,以评估代理是否能够通过内在自我反思进行适应,而不是单纯依赖机械的工具应用。我们进一步提出了OPT-Agent,一个模拟类人认知适应的框架。它通过一个通用的感知、记忆和推理循环运作,基于环境反馈迭代地优化解决方案。通过对来自7个模型家族的19个LLMs进行广泛实验,包括推理模型、通用模型和参数范围从3B到235B的开源模型,我们证明了更强大的模型在利用反馈信号进行自我改进方面更为有效。然而,这种上限适应性仍然受到模型基础能力的根本限制,即使是最先进的LLMs在性能上仍然无法与人类专家相媲美。
cs.AI / 59 / 2605.08905
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
Forge: 面向质量的强化学习在 NP-困难优化中的应用于大语言模型
Abstract
Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic, and puzzles. However, existing benchmarks evaluate only correctness, while overlooking optimality, namely the ability to find the best solutions under constraints. We propose OPT-BENCH, the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. OPT-BENCH provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility, measured by Success Rate, and quality, measured by Quality Ratio; and quality-aware rewards that enable continuous improvement beyond binary correctness. Training on Qwen2.5-7B-Instruct-1M with 15K examples achieves 93.1% SR and 46.6% QR, significantly outperforming GPT-4o, which achieves 29.6% SR and 14.6% QR. Beyond optimization, training on OPT-BENCH transfers to diverse tasks, including mathematics (+2.2%), logic (+1.2%), knowledge (+4.1%), and instruction following (+6.1%). Our analysis reveals that quality-aware rewards improve solutions by 28.8% over binary rewards, and that task diversity drives generalization more than data quantity, offering insights into RLVR scaling for complex reasoning.
Chinese Translation
大型语言模型(LLMs)通过可验证奖励的强化学习(RLVR)在推理基准测试中取得了显著成功,在数学、编码、逻辑和谜题等任务中表现优异。然而,现有基准仅评估正确性,而忽视了最优性,即在约束条件下找到最佳解决方案的能力。我们提出了 OPT-BENCH,这是第一个全面的框架,用于通过面向质量的 RLVR 训练和评估 LLMs 在 NP-困难优化问题上的表现。OPT-BENCH 提供了三个关键组件:一个具有实例生成器的可扩展训练基础设施、质量验证器和涵盖 10 个任务的最优基准;一个严格的基准测试,包含 1,000 个实例,评估可行性(通过成功率衡量)和质量(通过质量比率衡量);以及面向质量的奖励,使得超越二元正确性实现持续改进。在使用 15K 示例对 Qwen2.5-7B-Instruct-1M 进行训练时,成功率(SR)达到 93.1%,质量比率(QR)达到 46.6%,显著优于 GPT-4o,其 SR 为 29.6%,QR 为 14.6%。除了优化,OPT-BENCH 的训练还能够迁移到多样化的任务,包括数学(+2.2%)、逻辑(+1.2%)、知识(+4.1%)和指令跟随(+6.1%)。我们的分析表明,面向质量的奖励相比于二元奖励提高了解决方案的质量达 28.8%,而任务多样性对泛化的推动作用大于数据量,为复杂推理中的 RLVR 扩展提供了见解。
cs.AI / 60 / 2605.08930
Internalizing Safety Understanding in Large Reasoning Models via Verification
通过验证将安全理解内化于大型推理模型中
Abstract
While explicit Chain-of-Thought (CoT) empowers large reasoning models (LRMs), it enables the generation of riskier final answers. Current alignment paradigms primarily rely on externally enforced compliance, optimizing models to detect malicious prompts rather than evaluating the safety of their own outputs. We argue that this approach remains largely behavioral: our empirical analysis reveals that ostensibly aligned models lack intrinsic safety understanding, often failing to verify their own response safety and remaining vulnerable to adversarial jailbreaks. To address this fundamental limitation, we propose Safety Internal (SInternal), a framework that internalizes safety specifications by training LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories. We demonstrate that learning to verify induces a strong generalization for response safety, significantly enhancing robustness against out-of-domain jailbreaks. Furthermore, when combined with reinforcement learning, SInternal serves as a superior initialization compared to standard supervised fine-tuning, suggesting that internalizing safety understanding creates a more robust foundation for alignment than merely mimicking safe behaviors. Our codes are available at https://github.com/AlphaLab-USTC/SInternal
Chinese Translation
尽管显式的思维链(Chain-of-Thought, CoT)增强了大型推理模型(Large Reasoning Models, LRMs)的能力,但也使得最终答案的风险性增加。目前的对齐范式主要依赖于外部强制合规,优化模型以检测恶意提示,而不是评估其自身输出的安全性。我们认为这种方法在很大程度上仍然是行为性的:我们的实证分析表明,表面上对齐的模型缺乏内在的安全理解,往往无法验证自身响应的安全性,并且仍然容易受到对抗性越狱攻击。为了解决这一根本性限制,我们提出了安全内化框架(Safety Internal, SInternal),该框架通过专门训练LRMs在安全验证任务上进行内化安全规范,以批判其自身生成的答案,利用专家推理轨迹进行评估。我们证明,学习验证能够强烈促进响应安全性的一般化,显著增强对域外越狱攻击的鲁棒性。此外,当与强化学习结合时,SInternal作为一种优于标准监督微调的初始化方法,表明内化安全理解为对齐创造了比单纯模仿安全行为更为稳健的基础。我们的代码可在 https://github.com/AlphaLab-USTC/SInternal 获取。
cs.AI / 61 / 2605.08935
PnP-Corrector: A Universal Correction Framework for Coupled Spatiotemporal Forecasting
PnP-Corrector:一种用于耦合时空预测的通用修正框架
Abstract
Coupled spatiotemporal forecasting is important for predicting the future evolution of multiple interacting dynamical systems, such as in climate models. However, existing methods are severely constrained by the persistent bottleneck of compounding errors. In coupled systems, errors from each subsystem simulator propagate and amplify one another, a phenomenon we term Reciprocal Error Amplification, leading to a rapid collapse of long-range predictions. To address this challenge, we propose a universal framework called PnP-Corrector (Plug-and-Play Corrector). The core idea of our framework is to decouple the physical simulation from the error correction process: it freezes pre-trained physics simulation engines and exclusively trains a correction agent to proactively counteract the systematic biases emerging from the coupled system. Furthermore, we design an efficient predictive model architecture, DSLCast, to serve as the backbone of this framework. Extensive experiments demonstrate that our method significantly enhances the long-term stability and accuracy of coupled forecasting systems. For instance, in the challenging task of a 300-day global ocean-atmosphere coupled forecast, our PnP-Corrector framework reduces the prediction error of the baseline model by 29% and surpasses state-of-the-art models on several key metrics.
Chinese Translation
耦合时空预测对于预测多个相互作用的动态系统的未来演变至关重要,例如气候模型。然而,现有方法受到累积误差这一持续瓶颈的严重制约。在耦合系统中,各子系统模拟器的误差相互传播并放大,这一现象我们称之为“互惠误差放大”,导致长期预测的快速崩溃。为了解决这一挑战,我们提出了一种名为PnP-Corrector(即插即用修正器)的通用框架。我们框架的核心思想是将物理模拟与误差修正过程解耦:它冻结预训练的物理模拟引擎,并专门训练一个修正代理,以主动抵消耦合系统中出现的系统性偏差。此外,我们设计了一种高效的预测模型架构DSLCast,作为该框架的支柱。大量实验表明,我们的方法显著提高了耦合预测系统的长期稳定性和准确性。例如,在300天全球海洋-大气耦合预测这一具有挑战性的任务中,我们的PnP-Corrector框架将基线模型的预测误差降低了29%,并在多个关键指标上超越了最先进的模型。
cs.AI / 62 / 2605.08936
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
自我恢复的自我重置:学习从不安全推理轨迹中自我恢复
Abstract
Large Reasoning Models possess remarkable capabilities for self-correction in general domain; however, they frequently struggle to recover from unsafe reasoning trajectories under adversarial attacks. Existing alignment methods attempt to mitigate this vulnerability by fine-tuning the model on expert data including reflection traces or adversarial prefixes. Crucially, these approaches are often hindered by static training data which inevitably deviate from model's dynamic, on-policy reasoning traces, resulting in model hardly covering its vast generation space and learning to recover from its own failures. To bridge this gap, we propose Self-ReSET, a pure reinforcement learning framework designed to equip LRMs with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning. Extensive experiments across various LRMs and benchmarks demonstrate that Self-ReSET significantly enhances robustness against adversarial attacks especially out-of-distribution (OOD) jailbreak prompts while maintaining general utility, along with efficient data utilization. Further analysis reveals that our method effectively fosters self-recovery patterns, enabling models to better identify and recover from unsafe intermediate error states back to benign paths. Our codes and data are available at https://github.com/Ing1024/Self-ReSET.
Chinese Translation
大型推理模型在一般领域具备显著的自我纠错能力;然而,它们在面对对抗性攻击时,常常难以从不安全的推理轨迹中恢复。现有的对齐方法试图通过在专家数据上进行微调,包括反思轨迹或对抗前缀,来缓解这一脆弱性。关键是,这些方法常常受到静态训练数据的限制,而这些数据不可避免地偏离模型的动态、在政策下的推理轨迹,导致模型难以覆盖其广泛的生成空间并学习从自身失败中恢复。为了弥补这一差距,我们提出了自我重置(Self-ReSET),这是一个纯强化学习框架,旨在赋予大型推理模型(LRMs)内在的能力,从自身的安全错误轨迹中恢复,这些轨迹随后被重新用作强化学习的初始状态。在各种大型推理模型和基准测试中进行的广泛实验表明,自我重置显著增强了对抗性攻击的鲁棒性,尤其是在分布外(OOD)越狱提示下,同时保持了通用效用和高效的数据利用。进一步的分析显示,我们的方法有效促进了自我恢复模式,使模型能够更好地识别和从不安全的中间错误状态恢复到良性路径。我们的代码和数据可在 https://github.com/Ing1024/Self-ReSET 获取。
cs.AI / 63 / 2605.08938
Can We Formally Verify Neural PDE Surrogates? SMT Compilation of Small Fourier Neural Operators
我们能否正式验证神经偏微分方程代理模型?小型傅里叶神经算子的 SMT 编译
Abstract
Fourier Neural Operators (FNOs) can greatly accelerate PDE simulation, but they are often used without formal guarantees that they preserve basic physical structure. We show that, once the trained weights and grid are fixed, the spectral convolution in an FNO is a linear map. As a result, the full forward pass is piecewise-linear and can be represented exactly in Z3's linear real arithmetic. We study two encodings. The exact encoding compiles the spectral convolution into a dense matrix multiplication, which is sound for both proofs and counterexamples. The lighter frozen encoding replaces the spectral path with a constant, making it faster but approximate. On 10 small FNO surrogates for 1D advection-diffusion-reaction (85 to 117 parameters, grids 8 to 32), the exact encoding gives 2 sound positivity proofs on linear (ReLU-free) models, 5 sound positivity counterexamples, and 10 sound mass-violation counterexamples; the remaining 3 positivity queries on ReLU models time out. For mass non-increase, Z3 finds worse counterexamples than both gradient-based falsification and Monte Carlo on 7 of 10 models. The frozen encoding scales to grid size 64 with sub-second positivity checks, but it no longer provides certificates for the original FNO. Overall, the results make the soundness--scalability tradeoff explicit and point to what is needed for formal verification of production-scale neural operators.
Chinese Translation
傅里叶神经算子(FNOs)可以极大加速偏微分方程(PDE)的模拟,但它们通常在没有正式保证的情况下使用,这使得基本物理结构得不到保留。我们展示了一旦训练的权重和网格固定后,FNO 中的谱卷积是一个线性映射。因此,完整的前向传播是分段线性的,并且可以在 Z3 的线性实数算术中精确表示。我们研究了两种编码方式。精确编码将谱卷积编译为密集矩阵乘法,这对于证明和反例都是有效的。较轻的冻结编码则用常数替代了谱路径,使其速度更快但结果近似。在 10 个用于一维对流-扩散-反应的小型 FNO 代理模型(参数从 85 到 117,网格从 8 到 32)中,精确编码在无 ReLU 模型上提供了 2 个有效的正性证明,5 个有效的正性反例,以及 10 个有效的质量违反反例;其余 3 个关于 ReLU 模型的正性查询超时。对于质量不增加的情况,Z3 在 10 个模型中的 7 个上找到的反例比基于梯度的反驳和蒙特卡洛方法更糟。冻结编码能够扩展到网格大小为 64 的情况,并在亚秒内完成正性检查,但不再为原始 FNO 提供证明。总体而言,这些结果明确了有效性与可扩展性之间的权衡,并指出了正式验证生产规模神经算子所需的条件。
cs.AI / 64 / 2605.08941
MDGYM: Benchmarking AI Agents on Molecular Simulations
MDGYM:在分子模拟中对人工智能代理的基准测试
Abstract
The promise of AI-driven scientific discovery hinges on whether AI agents can autonomously design and execute the computational workflows that underpin modern science. Molecular dynamics (MD) simulation presents a natural test bed to stress-test this claim; it requires translating physical intuition into syntactically and semantically correct input scripts, reasoning about initial and boundary conditions, diagnosing numerically unstable trajectories, and interpreting outputs against known physical behavior and laws. We introduce MDGYM, a benchmark of 169 expert-curated MD simulations spanning LAMMPS and GROMACS, two widely used MD packages, across three increasing difficulty levels. We evaluate three agentic frameworks -- Claude Code, Codex, and OpenHands -- with four LLMs, and find that all perform poorly: even the strongest agent solves only 21\% of easy-level tasks, with less than 10\% at higher difficulties. Trajectory analysis reveals a characteristic pattern of failure -- agents successfully invoke simulation machinery but produce physically unstable configurations, fabricate numerical outputs without executing the underlying computation, or abandon tasks prematurely rather than iterating through simulation-specific errors. These failure modes are qualitatively distinct from those observed in general software engineering benchmarks, indicating that fluent code generation does not transfer to grounded physical reasoning.
Chinese Translation
人工智能驱动的科学发现的前景取决于人工智能代理是否能够自主设计和执行支撑现代科学的计算工作流程。分子动力学(MD)模拟为验证这一主张提供了一个自然的测试平台;它要求将物理直觉转化为语法和语义上正确的输入脚本,推理初始和边界条件,诊断数值不稳定的轨迹,并根据已知的物理行为和定律解释输出。我们介绍了MDGYM,这是一个包含169个专家策划的MD模拟的基准,涵盖了两个广泛使用的MD软件包:LAMMPS和GROMACS,并分为三个逐渐增加难度的级别。我们评估了三种代理框架——Claude Code、Codex和OpenHands——以及四种大型语言模型(LLMs),发现它们的表现均较差:即使是最强的代理也仅能解决21%的简单级任务,而在更高难度下的成功率不足10%。轨迹分析揭示了一种典型的失败模式——代理能够成功调用模拟机制,但产生物理上不稳定的配置,生成数值输出而未执行基础计算,或在遇到模拟特定错误时过早放弃任务。这些失败模式在质上与一般软件工程基准中观察到的模式不同,表明流畅的代码生成并不能转化为扎实的物理推理。
cs.AI / 65 / 2605.08956
Agentic AI Scientists Are Not Built For Autonomous Scientific Discovery
具备自主性的人工智能科学家并不适合独立进行科学发现
Abstract
A growing body of work pursues AI scientists capable of end-to-end autonomous scientific discovery. This position paper argues that although they already function as co-scientists, agentic AI scientists are not built for autonomous scientific discovery. We identify the following challenges in building and deploying autonomous AI scientists: (1) Problem selection is influenced by the McNamara fallacy; (2) Agents are built on large language models (LLMs) whose training corpora omit tacit procedural and failure knowledge of laboratory practice; (3) Preference optimisation during post-training compresses output diversity toward consensus; and (4) Most scientific benchmarks measure single-turn prediction accuracy and lack feedback from physical experiments back to the computational model. These challenges are not just questions of scale and scaffolding; they require revisiting fundamental design choices. To build truly autonomous AI scientists, we recommend the use of scientific simulations as verifiers for training, the design of persistent world models that represent the shifting objectives governing real investigations, the establishment of a centralized preregistration repository for all AI-generated hypotheses, and application driven by scientific need rather than tool affordance.
Chinese Translation
越来越多的研究致力于开发能够进行端到端自主科学发现的人工智能科学家。本文立场论文认为,尽管这些具备自主性的人工智能科学家已经能够作为共同科学家发挥作用,但它们并不适合独立进行科学发现。我们识别出构建和部署自主人工智能科学家的以下挑战:(1)问题选择受到麦克纳马拉谬论的影响;(2)智能体基于大型语言模型(LLMs)构建,而其训练语料库忽略了实验室实践中隐性程序和失败知识;(3)后训练阶段的偏好优化使输出多样性趋向共识;(4)大多数科学基准仅测量单轮预测准确性,缺乏来自物理实验对计算模型的反馈。这些挑战不仅仅是规模和支架的问题;它们需要重新审视基本设计选择。为了构建真正自主的人工智能科学家,我们建议使用科学模拟作为训练的验证工具,设计持久的世界模型以代表真实研究中不断变化的目标,建立一个集中式的预注册库以存储所有人工智能生成的假设,以及推动应用的科学需求而非工具的可用性。
cs.AI / 66 / 2605.08975
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
通过高效轨迹生成对Alpamayo 1的延迟分析与优化
Abstract
Reasoning-based end-to-end (E2E) autonomous driving has recently emerged as a promising approach to improving the interpretability of driving decisions as it can generate human-readable reasoning together with predicted trajectories. Such approaches commonly generate multiple trajectories to capture diverse future behaviors, and they fall into two categories: (1) multi-reasoning, where one reasoning sequence is generated per trajectory, and (2) single-reasoning, where a single reasoning is shared across all trajectories. The former offers richer diversity at the cost of redundant computation, while the latter is more efficient but is often assumed to sacrifice diversity. Alpamayo 1, a representative system, adopts the multi-reasoning approach and achieves competitive trajectory prediction performance. However, the efficiency of this design remains largely unexplored, making it a well-motivated subject for investigation. In this paper, we systematically analyze and improve Alpamayo 1 in two ways. First, we reduce inference latency while preserving trajectory diversity by redesigning Alpamayo 1 into a single-reasoning system. Through extensive experiments, we find that replacing multi-reasoning with single-reasoning does not meaningfully degrade trajectory diversity. Second, we accelerate diffusion-based action generation by eliminating inter-block overhead arising from unnecessary copy operations and inefficient kernel execution. Through closed-loop and open-loop experiments, we validate both optimizations, demonstrating a 69.23% reduction in inference latency while maintaining trajectory diversity and prediction quality. These results highlight the importance of jointly analyzing system architecture and runtime execution to improve the efficiency of reasoning-based E2E AD systems.
Chinese Translation
基于推理的端到端(E2E)自动驾驶最近作为一种有前景的方法出现,旨在提高驾驶决策的可解释性,因为它能够生成可被人类理解的推理以及预测轨迹。这类方法通常生成多条轨迹以捕捉多样化的未来行为,分为两类:(1)多重推理,每条轨迹生成一个推理序列;(2)单一推理,所有轨迹共享一个推理。前者提供了更丰富的多样性,但代价是冗余的计算,而后者更高效,但通常被认为牺牲了多样性。Alpamayo 1作为一个代表性系统,采用了多重推理方法,并实现了具有竞争力的轨迹预测性能。然而,这种设计的效率仍然在很大程度上未被探索,因此成为一个值得研究的主题。在本文中,我们从两个方面系统地分析和改进Alpamayo 1。首先,我们通过将Alpamayo 1重新设计为单一推理系统,减少推理延迟,同时保持轨迹多样性。通过广泛的实验,我们发现用单一推理替代多重推理并不会显著降低轨迹多样性。其次,我们通过消除由不必要的复制操作和低效的内核执行引起的块间开销,加速基于扩散的动作生成。通过闭环和开环实验,我们验证了这两项优化,证明在保持轨迹多样性和预测质量的同时,推理延迟减少了69.23%。这些结果强调了联合分析系统架构和运行时执行以提高基于推理的E2E自动驾驶系统效率的重要性。
cs.AI / 67 / 2605.08978
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
学习探索:通过探索感知的策略优化扩展自主推理能力
Abstract
Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine-grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with an exploration-aware grouping mechanism that separates exploratory actions from task-completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks. Code is available at \url{https://github.com/HansenHua/EAPO-ICML26} and models are available at https://huggingface.co/hansenhua/EAPO-ICML26.
Chinese Translation
最近在自主测试时扩展方面的进展使得模型能够在最终行动之前收集环境反馈。现有方法的一个主要限制是它们通常采用无差别的探索策略,缺乏自适应区分何时真正需要探索的能力。本文提出了一种探索感知的强化学习框架,使得大规模语言模型(LLM)代理能够在不确定性高时自适应地进行探索。我们的方法通过变分推断引入了一种细粒度的奖励函数,明确评估探索性行动的潜力,以改善未来决策,同时引入了一种探索感知的分组机制,在优化过程中将探索性行动与任务完成行动分开。通过针对信息缺口的设计,该方法使代理能够选择性地进行探索,并在任务上下文明确时迅速过渡到执行阶段。通过实证研究,我们展示了该方法在一系列具有挑战性的基于文本和基于图形用户界面(GUI)的代理基准测试中实现了一致的改进。代码可在 https://github.com/HansenHua/EAPO-ICML26 获取,模型可在 https://huggingface.co/hansenhua/EAPO-ICML26 获取。
cs.AI / 68 / 2605.08991
Sufficient conditions for a Heuristic Rating Estimation Method application
启发式评分估计方法应用的充分条件
Abstract
A series of papers has introduced the Heuristic Rating Estimation method, which evaluates a set of alternatives based on pairwise comparisons and the weights of reference alternatives. We formulate the conditions under which the HRE method can be applied correctly. The research considers both arithmetic and geometric algorithms for complete and incomplete pairwise comparison methods. The illustrative examples show that the estimations of inconsistency in the arithmetic variant are optimal.
Chinese Translation
一系列论文介绍了启发式评分估计(Heuristic Rating Estimation, HRE)方法,该方法基于成对比较和参考替代方案的权重评估一组替代方案。我们制定了HRE方法可以正确应用的条件。研究考虑了完整和不完整成对比较方法的算术和几何算法。示例表明,算术变体中的不一致性估计是最优的。
cs.AI / 69 / 2605.09012
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
Re$^2$Math:研究级数学中的定理检索基准测试
Abstract
Large language models are increasingly capable at closed-world mathematical reasoning, but research assistance also requires source-grounded use of the literature. When a proof reaches a non-trivial step, a useful assistant should determine whether the needed tool (e.g., a lemma) already exists, identify a suitable scholarly source, and verify that its assumptions align with the current proof context. To rigorously evaluate such capabilities, we introduce Re$^2$Math, a benchmark for tool-grounded retrieval from partial mathematical proofs. Each instance is built from a candidate instrumental citation in the proof of a main theorem, with hierarchical context and an optional leakage-controlled anchor hint. We also make the task source-grounded yet citation-agnostic in that any admissible theorem sufficient for the proof transition is accepted. Evaluation uses a release-frozen retrieval artifact, ensuring reproducibility, while the benchmark itself supports automatic, continual expansion with newly constructed instances. On the current benchmark test set, the best fixed-judge ToolAcc reaches 7.0%, despite substantially higher rates of source grounding, indicating that current systems often retrieve valid statements but fail to establish their applicability to the local proof step. By decoupling citation recall, grounding, and proof-gap sufficiency, Re$^2$Math transforms literature-grounded mathematical tool use into a controlled diagnostic task.
Chinese Translation
大型语言模型在封闭世界的数学推理方面越来越强大,但研究辅助还需要基于文献的源头使用。当一个证明达到一个非平凡的步骤时,一个有用的助手应该判断所需的工具(例如,引理)是否已经存在,识别合适的学术来源,并验证其假设是否与当前证明的上下文一致。为了严格评估这些能力,我们引入了Re$^2$Math,这是一个用于从部分数学证明中进行工具基础检索的基准测试。每个实例都是从主要定理的证明中的候选工具引用构建的,具有层次上下文和可选的泄漏控制锚提示。我们还使任务基于源头但不依赖于引用,即任何足以支持证明过渡的可接受定理都被接受。评估使用一个冻结的检索工件,确保可重复性,而基准本身支持通过新构建的实例进行自动、持续的扩展。在当前的基准测试集中,最佳固定评判的ToolAcc达到了7.0%,尽管源头基础的比率显著更高,这表明当前系统通常检索有效的陈述,但未能确立其在局部证明步骤中的适用性。通过解耦引用回忆、基础和证明间隙的充分性,Re$^2$Math将文献基础的数学工具使用转变为一个受控的诊断任务。
cs.AI / 70 / 2605.09016
CATO: Charted Attention for Neural PDE Operators
CATO:用于神经偏微分方程算子的图表注意力
Abstract
Neural operators have emerged as powerful data-driven solvers for PDEs, offering substantial acceleration over classical numerical methods. However, existing transformer-based operators still face critical challenges when modeling PDEs on complex geometries: directly processing over massive mesh points is computationally expensive, while operating in raw discretization coordinates may obscure the intrinsic geometry where physical interactions are more naturally expressed. To address these limitations, we introduce the Charted Axial Transformer Operator (CATO), a geometry-adaptive and derivative-aware neural operator for PDEs on general geometries. Instead of applying attention directly in the physical coordinate system, CATO learns a continuous latent chart that maps mesh coordinates into a learned chart space, where chart-conditioned axial attention efficiently captures long-range dependencies with reduced computational cost. In addition, CATO introduces a derivative-aware physics loss for steady-state PDEs that jointly supervises solution values, mesh-consistent gradients, and an auxiliary flux-like field, improving physical fidelity and reducing oversmoothing. We further provide a theoretical approximation result showing that, under a favorable chart, charted axial attention can represent low-rank axial solution operators with controlled error, and that small chart perturbations induce bounded approximation degradation. CATO achieves the best performance across all evaluated datasets, yielding an average improvement of approximately 26.76\% over the strongest competing baselines while reducing the number of parameters by 81.98\%. These results highlight the effectiveness of learning geometry-adaptive charts and derivative-aware physical supervision for accurate and efficient PDE operator learning.
Chinese Translation
神经算子作为强大的数据驱动偏微分方程(PDE)求解器,已显现出相较于经典数值方法的显著加速。然而,现有的基于变换器的算子在复杂几何体上建模PDE时仍面临关键挑战:直接处理大量网格点的计算成本高,而在原始离散坐标中操作可能会掩盖物理交互更自然表达的内在几何。为了解决这些限制,我们引入了图表轴向变换器算子(CATO),这是一种几何自适应且考虑导数的神经算子,适用于一般几何体上的PDE。CATO并不是直接在物理坐标系中应用注意力,而是学习一个连续的潜在图表,将网格坐标映射到学习的图表空间,在该空间中,图表条件的轴向注意力能够以降低的计算成本高效捕捉长程依赖。此外,CATO为稳态PDE引入了一种考虑导数的物理损失,该损失共同监督解值、网格一致梯度和辅助通量场,提高了物理真实性并减少了过平滑。我们进一步提供了一个理论近似结果,表明在有利的图表下,图表轴向注意力能够以可控的误差表示低秩轴向解算子,并且小的图表扰动会引起有界的近似退化。CATO在所有评估数据集中表现最佳,平均性能提升约为26.76\%,同时参数数量减少了81.98\\%。这些结果突显了学习几何自适应图表和考虑导数的物理监督在准确和高效的PDE算子学习中的有效性。
cs.AI / 71 / 2605.09038
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
SearchSkill:教会大语言模型使用搜索工具与不断演变的技能库
Abstract
Teaching language models to use search tools is not only a question of whether they search, but also of whether they issue good queries. This is especially important in open-domain question answering, where broad or copied queries often waste retrieval budget and derail later reasoning. We propose \Ours, a framework that makes query planning explicit through reusable search skills. At each step, the model first selects a skill, then generates a search or answer action conditioned on the selected skill card. The skill inventory itself is not fixed: SearchSkill maintains an evolving SkillBank, expands or refines it from recurrent failure patterns, and reconstructs affected trajectories before supervised training. The resulting two-stage SFT recipe aligns training with the inference-time protocol of skill selection followed by skill-grounded execution. Across open-source and closed-source models, SearchSkill improves exact match on knowledge-intensive QA benchmarks and yields better retrieval behavior, including fewer copied first queries, more atomic hop-focused queries, and more correct answers within a small search budget. These results suggest that explicit skill-conditioned query planning is a lightweight alternative to treating search as an undifferentiated action.
Chinese Translation
教会语言模型使用搜索工具不仅仅是一个是否进行搜索的问题,更是一个是否能够发出良好查询的问题。这在开放领域问答中尤为重要,因为宽泛或复制的查询往往会浪费检索预算并干扰后续推理。我们提出了 extit{SearchSkill},一个通过可重用搜索技能明确查询规划的框架。在每一步中,模型首先选择一个技能,然后根据所选技能卡生成搜索或回答动作。技能库本身并不是固定的:SearchSkill 维护一个不断演变的技能库,从重复失败模式中扩展或细化,并在监督训练之前重构受影响的轨迹。最终形成的两阶段 SFT(监督微调)方案将训练与技能选择后技能基础执行的推理时协议对齐。在开源和闭源模型中,SearchSkill 在知识密集型问答基准测试中提高了精确匹配率,并改善了检索行为,包括减少复制的首次查询、更多原子跳跃聚焦的查询,以及在小搜索预算内更多正确答案。这些结果表明,明确的技能条件查询规划是将搜索视为无差别动作的轻量替代方案。
cs.AI / 72 / 2605.09040
UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence
UxSID:面向超长序列的语义感知用户兴趣建模
Abstract
Modeling ultra-long user sequences involves a difficult trade-off between efficiency and effectiveness. While current paradigms rely on either item-specific search or item-agnostic compression, we propose UxSID, a framework exploring a third path: semantic-group shared interest memory. By utilizing Semantic IDs (SIDs) and a dual-level attention strategy, UxSID captures target-aware preferences without the heavy cost of item-specific models. This end-to-end architecture balances computational parsimony with semantic awareness, achieving state-of-the-art performance and a 0.337% revenue lift in large-scale advertising A/B test.
Chinese Translation
建模超长用户序列涉及效率与有效性之间的艰难权衡。当前的范式依赖于特定项目的搜索或与项目无关的压缩,我们提出了UxSID,一个探索第三条路径的框架:语义组共享兴趣记忆。通过利用语义ID(Semantic IDs, SIDs)和双层注意力策略,UxSID在不需要特定项目模型的高昂成本下捕捉目标感知偏好。该端到端架构在计算简约性与语义感知之间取得平衡,实现了最先进的性能,并在大规模广告A/B测试中实现了0.337%的收入提升。
cs.AI / 73 / 2605.09045
Containment Verification: AI Safety Guarantees Independent of Alignment
约束验证:独立于对齐的人工智能安全保障
Abstract
Agentic frameworks are the software layer through which AI agents act in the world. Existing safety methods intervene on the model and therefore remain conditional on unverifiable properties of learned behavior. We introduce containment verification, which locates safety guarantees in the agentic framework itself. Under havoc oracle semantics, the AI is modeled as an unconstrained oracle ranging over the entire typed action space, and the verified containment layer must enforce the boundary policy for every possible AI output. For boundary-enforceable properties, expressed over modeled boundary events, action arguments, and state, we prove a universal guarantee by forward-simulation refinement and mechanize it in Dafny. We instantiate the paradigm by verifying PocketFlow, a minimalist agentic LLM framework, and use an agentic synthesis pipeline to generate the specification, operational model, and refinement proof under an information barrier against tautological specifications. To our knowledge, this is the first deductive formal verification of an agentic framework, and its guarantee is invariant to model capability over the modeled typed action boundary.
Chinese Translation
代理框架是人工智能代理在世界中行动的软件层。现有的安全方法对模型进行干预,因此仍然依赖于不可验证的学习行为属性。我们提出了约束验证,它将安全保障定位于代理框架本身。在混乱神谕语义下,人工智能被建模为一个不受限制的神谕,涵盖整个类型化动作空间,而经过验证的约束层必须为每一个可能的人工智能输出强制执行边界策略。对于可边界强制的属性,这些属性在建模的边界事件、动作参数和状态上表达,我们通过前向模拟细化证明了一个普遍保障,并在Dafny中实现了这一过程。我们通过验证PocketFlow,一个极简的代理LLM框架,来实例化这一范式,并使用代理合成管道生成规范、操作模型和在信息屏障下针对自明规范的细化证明。据我们所知,这是对代理框架的首次演绎形式验证,其保障在建模的类型化动作边界上对模型能力是不变的。
cs.AI / 74 / 2605.09079
CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators
CauSim:随着因果模拟器复杂性增加而扩展因果推理
Abstract
Despite surpassing human performance across mathematics, coding, and other knowledge-intensive tasks, large language models (LLMs) continue to struggle with causal reasoning. A core obstacle is the target data itself: causal systems are complex and often expressed in non-executable forms, while ground-truth answers to causal queries are inherently scarce. We introduce CauSim, a framework that turns causal reasoning from a scarce-label problem into a scalable supervised one. CauSim constructs increasingly complex causal simulators: executable structural causal models (SCMs), incrementally built by LLMs, that scale to globally complex systems while maintaining verifiable answers to causal queries. CauSim operates across representations by formalizing non-executable causal knowledge into code, enabling data augmentation, and translating executable SCMs into natural language, enabling supervision in previously difficult-to-supervise representations. We structure our research into two parts: (1) how to construct increasingly complex causal simulators, and (2) a systematic study of what CauSim enables, demonstrating generalization across representations, consistent gains from curriculum scaling and data volume, LLM self-improvement through self-generated simulators, and data augmentation via formalization of existing domain knowledge.
Chinese Translation
尽管在数学、编程和其他知识密集型任务中超越了人类表现,大型语言模型(LLMs)在因果推理方面仍然面临挑战。一个核心障碍是目标数据本身:因果系统复杂且通常以不可执行的形式表达,而因果查询的真实答案本质上稀缺。我们提出了CauSim,一个将因果推理从稀缺标签问题转变为可扩展监督问题的框架。CauSim构建了越来越复杂的因果模拟器:可执行的结构因果模型(SCMs),由LLMs逐步构建,能够扩展到全球复杂系统,同时保持对因果查询的可验证答案。CauSim通过将不可执行的因果知识形式化为代码,跨表示进行操作,支持数据增强,并将可执行的SCMs翻译为自然语言,从而在以前难以监督的表示中实现监督。我们的研究分为两部分: (1) 如何构建越来越复杂的因果模拟器, (2) 对CauSim所能实现的系统性研究,展示了跨表示的泛化、课程扩展和数据量带来的持续收益、通过自生成模拟器实现的LLM自我改进,以及通过现有领域知识的形式化实现的数据增强。
cs.AI / 75 / 2605.09085
Constant-Target Energy Matching: A Unified Framework for Continuous and Discrete Density Estimation
恒定目标能量匹配:连续与离散密度估计的统一框架
Abstract
Density estimation is a central primitive in probabilistic modeling, yet continuous, discrete, and mixed-variable domains are often treated by separate objectives, limiting the ability to exploit a common statistical structure across data types. Continuous score-based methods rely on log-density gradients, while discrete extensions typically use concrete score whose unbounded targets become unstable near low-probability states. We introduce Constant-Target Energy Matching (CTEM), a unified energy-based framework for density estimation on general state spaces. CTEM replaces ordinary density-ratio regression with a bounded energy-difference transform and derives from it a sample-only training objective with the constant target 1. The learned scalar potential recovers log p without partition-function estimation or explicit unbounded ratio regression. Across continuous, discrete, and mixed-variable benchmarks, CTEM substantially improves density estimation over competitive baselines and yields higher-quality samples under standard sampling procedures.
Chinese Translation
密度估计是概率建模中的一个核心原语,但连续、离散和混合变量领域通常通过不同的目标进行处理,这限制了跨数据类型利用共同统计结构的能力。连续的基于分数的方法依赖于对数密度梯度,而离散扩展通常使用具体分数,其无界目标在低概率状态附近变得不稳定。我们提出了恒定目标能量匹配(Constant-Target Energy Matching, CTEM),这是一个用于一般状态空间的统一能量基础密度估计框架。CTEM用有界能量差变换替代普通的密度比回归,并从中推导出一个仅基于样本的训练目标,其恒定目标为1。学习到的标量势能在不需要分区函数估计或显式无界比回归的情况下恢复了对数p。在连续、离散和混合变量基准测试中,CTEM显著改善了密度估计的性能,超越了竞争基线,并在标准采样程序下产生了更高质量的样本。
cs.AI / 76 / 2605.09104
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
大语言模型代理的代币经济学:来自计算与经济学的双视角研究
Abstract
As LLM agents evolve, tokens have emerged as the core economic primitives of Agentic AI. However, their exponential consumption introduces severe computational, collaborative, and security bottlenecks. Current surveys remain fragmented across system optimization, architecture design, and trust, lacking a unified framework to evaluate the fundamental trade-off between output quality and economic cost. To bridge this gap, this survey presents the first comprehensive survey of Token Economics. By unifying computer science and economics, we conceptualize tokens as production factors, exchange mediums, and units of account. We synthesize existing literature across a four-dimensional taxonomy: (1) Micro-level (Single Agent): Optimizing budget-constrained factor substitution via neoclassical firm theory. (2) Meso-level (Multi-Agent Systems): Minimizing collaboration friction using transaction cost and principal-agent theories. (3) Macro-level (Agent Ecosystems): Addressing congestion externalities and pricing via mechanism design. (4) Security: Internalizing adversarial threats as endogenous economic constraints. Finally, we outline frontier directions, including differentiable token budgets and dynamic markets, to lay the theoretical foundation for scalable next-generation agent systems.
Chinese Translation
随着大语言模型(LLM)代理的演变,代币已成为代理人工智能的核心经济原语。然而,它们的指数级消耗引发了严重的计算、协作和安全瓶颈。目前的调查在系统优化、架构设计和信任等方面仍然碎片化,缺乏一个统一的框架来评估输出质量与经济成本之间的基本权衡。为了解决这一问题,本调查首次全面回顾了代币经济学。通过统一计算机科学与经济学,我们将代币概念化为生产要素、交换媒介和计量单位。我们在四维分类法下综合现有文献:(1) 微观层面(单一代理):通过新古典企业理论优化预算约束下的要素替代。(2) 中观层面(多代理系统):利用交易成本和委托-代理理论最小化协作摩擦。(3) 宏观层面(代理生态系统):通过机制设计解决拥堵外部性和定价问题。(4) 安全性:将对抗性威胁内生化为内生经济约束。最后,我们概述了前沿方向,包括可微分的代币预算和动态市场,为可扩展的下一代代理系统奠定理论基础。
cs.AI / 77 / 2605.09109
When (and How) to Trust the Expert: Diagnosing Query-Time Expert-Guided Reinforcement Learning
何时(以及如何)信任专家:诊断查询时专家引导的强化学习
Abstract
Many continuous-control problems ship with a competent but suboptimal controller (a tuned PID, a hand-designed gait). A growing family of methods uses such controllers as queryable experts during RL, but each method has been proposed in isolation, on a different benchmark, without imperfect-expert testing. We harmonize the comparison on a shared SAC backbone, common HPO and evaluation protocols, 100/50 seeds per (env, method), and a degradation sweep over expert undertuning, action bias, and observation noise. The comparison surfaces three failure modes single-paper evaluations miss: (F1) a critic blind spot under argmax-plus-bootstrap that drags IBRL below no-expert SAC on experts close to the no-expert-RL ceiling (RL-near-ceiling, distinct from the absolute physical ceiling); (F2) residual saturation on far-from-optimal experts; and (F3) warm-start buffer poisoning that collapses training-time-handoff methods under deployment-time expert undertuning. No single method dominates: each wins on one task-structure regime and fails predictably elsewhere; on RL-near-ceiling experts (FourTank, GlassFurnace) no query-time method clears the expert within our 1M-step budget, leaving open whether this is a fundamental wall or a budget effect. We convert the spread into a testable decision rule keyed on three pre-training observables (expert quality, task termination, perturbation type). The benchmark, taxonomy, and decision rule are the primary contribution; we additionally describe EDGE, a softmax-over-ensemble-LCB design point used to demonstrate that both axes the taxonomy points to (gate form, scoring rule) are individually exploitable.
Chinese Translation
许多连续控制问题配备了一个有能力但次优的控制器(如调优的PID控制器、手工设计的步态)。越来越多的方法将这些控制器作为可查询的专家应用于强化学习(RL),但每种方法都是孤立提出的,基于不同的基准,且未进行不完美专家的测试。我们在共享的SAC(Soft Actor-Critic)基础上进行比较,采用共同的超参数优化(HPO)和评估协议,每个(环境,方法)使用100/50个种子,并对专家的调优不足、动作偏差和观察噪声进行降级测试。比较结果揭示了三种单篇论文评估所忽视的失败模式:(F1)在argmax-plus-bootstrap下的评论员盲点,使得IBRL(Imitation-Based Reinforcement Learning)在接近无专家RL上限的专家(RL-near-ceiling,区别于绝对物理上限)下表现低于无专家的SAC;(F2)在远离最优专家时的残余饱和;以及(F3)在部署时专家调优不足下导致训练时间交接方法崩溃的暖启动缓冲区中毒。没有单一方法占主导地位:每种方法在一个任务结构模式上获胜,而在其他地方则可预测地失败;在RL-near-ceiling专家(FourTank, GlassFurnace)中,没有查询时方法在我们的1M步预算内清除专家,尚不清楚这是否是一个根本性壁垒或预算效应。我们将这种分散转化为一个可测试的决策规则,基于三个预训练可观测量(专家质量、任务终止、扰动类型)。基准、分类法和决策规则是主要贡献;我们还描述了EDGE,一个基于软最大化的集成-LCB(Lower Confidence Bound)设计点,用于证明分类法所指向的两个维度(门控形式、评分规则)是可以单独利用的。
cs.AI / 78 / 2605.09129
Data-driven Circuit Discovery for Interpretability of Language Models
基于数据驱动的电路发现以提高语言模型的可解释性
Abstract
Circuit discovery aims to explain how language models (LMs) implement a specific task by localizing and interpreting a circuit, a computational subgraph responsible for the LM's behavior. Existing circuit discovery methods are hypothesis-driven; they first informally define a task with a dataset, and then apply a circuit discovery algorithm over that dataset to obtain a single circuit. This imposes two strong assumptions: that the LM implements the task with a single circuit, and that the dataset adequately represents the task as humans understand it. We systematically test these assumptions across four previously studied tasks and find that even minor dataset variations that preserve task semantics can produce circuits with low edge overlap and cross-dataset faithfulness. More strikingly, when applied to a mixed dataset with two distinct tasks whose separately discovered circuits have near-zero cross-faithfulness, existing methods still return a single circuit with high faithfulness across both tasks. This indicates that current methods discover dataset-specific circuits, rather than general task circuits. We propose Data-driven Circuit Discovery (DCD), a new discovery framework that drops both assumptions: instead of returning a single circuit for a dataset, DCD first clusters examples in the dataset by how similarly the model processes them and discovers a separate circuit for each group. This allows distinct mechanisms to appear separately rather than merged into a single circuit; each circuit explains its group, not the full task. Experiments show that DCD discovers multiple circuits per dataset, each more faithful to its group than a single circuit discovered by existing methods. Broadly, DCD lets the data reveal mechanistic structure within LMs, rather than relying on human-defined task boundaries that may not align with how models organize their computation.
Chinese Translation
电路发现旨在通过定位和解释电路(负责语言模型(LM)行为的计算子图)来解释语言模型如何实现特定任务。现有的电路发现方法是以假设为驱动的;它们首先通过数据集非正式地定义任务,然后在该数据集上应用电路发现算法以获得单一电路。这提出了两个强假设:语言模型通过单一电路实现任务,并且数据集充分代表人类理解的任务。我们系统地测试了这两个假设在四个先前研究的任务中的适用性,发现即使是保留任务语义的轻微数据集变化也会产生低边重叠和跨数据集的可信度。更引人注目的是,当应用于一个包含两个不同任务的混合数据集时,分别发现的电路几乎没有跨任务的可信度,而现有方法仍然返回一个在两个任务中都具有高可信度的单一电路。这表明当前方法发现的是特定于数据集的电路,而不是通用的任务电路。我们提出了基于数据驱动的电路发现(Data-driven Circuit Discovery, DCD),这是一个新的发现框架,放弃了这两个假设:DCD不是为一个数据集返回单一电路,而是首先根据模型处理示例的相似性对数据集中的示例进行聚类,并为每个组发现一个单独的电路。这使得不同的机制能够单独出现,而不是合并为一个单一电路;每个电路解释其组,而不是完整任务。实验表明,DCD为每个数据集发现多个电路,每个电路对其组的可信度高于现有方法发现的单一电路。总体而言,DCD让数据揭示语言模型内部的机制结构,而不是依赖于可能与模型组织其计算方式不一致的人为定义的任务边界。
cs.AI / 79 / 2605.09131
MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments
MCP-Cosmos:用于复杂任务执行的世界模型增强代理在MCP环境中的应用
Abstract
The Model Context Protocol (MCP) has unified the interface between Large Language Models (LLMs) and external tools, yet a fundamental gap remains in how agents conceptualize the environments within which they operate. Current paradigms are bifurcated: Task-level planning often ignores execution-time dynamics, while reactive execution lacks long-horizon foresight. We present MCP-Cosmos, a framework that infuses generative World Models (WM) into the MCP ecosystem to enable predictive task automation. By unifying three disparate technologies, namely MCP, World Model, and Agent, we demonstrate that a "Bring Your Own World Model" (BYOWM) strategy allows agents to simulate state transitions and refine plans in a latent space before execution. We conducted experiments using two strategies, namely ReAct and SPIRAL with 2 planning models and 3 representative world models over 20+ MCP-Bench tasks. We observed improvements in Agent's environment interaction KPI such as tool success rate and tool parameter accuracy. The framework also offers new metrics such as Execution Quality to generate new insights about the effectiveness of world models compared to baseline.
Chinese Translation
模型上下文协议(MCP)统一了大型语言模型(LLMs)与外部工具之间的接口,但在代理如何概念化其操作环境方面仍存在根本性差距。目前的范式分为两类:任务级规划往往忽视执行时动态,而反应式执行缺乏长远的前瞻性。我们提出了MCP-Cosmos,一个将生成性世界模型(WM)融入MCP生态系统的框架,以实现预测性任务自动化。通过统一MCP、世界模型和代理这三种不同的技术,我们展示了“自带世界模型”(BYOWM)策略使代理能够在执行前模拟状态转变并在潜在空间中优化计划。我们使用两种策略,即ReAct和SPIRAL,结合两个规划模型和三个代表性世界模型,在20多个MCP-Bench任务上进行了实验。我们观察到代理在环境交互关键绩效指标(KPI)方面的改善,例如工具成功率和工具参数准确性。该框架还提供了新的指标,如执行质量,以生成有关世界模型相对于基线有效性的新的见解。
cs.AI / 80 / 2605.09134
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
BoostAPR:通过基于执行的强化学习与双重奖励模型提升自动程序修复
Abstract
Reinforcement learning for program repair is hindered by sparse execution feedback and coarse sequence-level rewards that obscure which edits actually fix bugs. We present BoostAPR, a three-stage framework addressing these challenges: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models--a sequence-level assessor and a line-level credit allocator--from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions. This line-level credit assignment operates at an intermediate granularity naturally suited to code changes. Trained on SWE-Gym and evaluated on four benchmarks, BoostAPR achieves 40.7% on SWE-bench Verified (+22.9pp over base model), 24.8% on Defects4J (Python-to-Java transfer), 84.5% on HumanEval-Java, and 95.0% on QuixBugs, achieving competitive results among open-source models with strong cross-language generalization.
Chinese Translation
程序修复中的强化学习受到稀疏执行反馈和粗糙的序列级奖励的限制,这使得难以辨别哪些编辑实际上修复了错误。我们提出了BoostAPR,一个三阶段框架来应对这些挑战:(1)在执行验证的演示和推理轨迹上进行监督微调,(2)从执行结果中训练双重奖励模型——一个序列级评估器和一个行级信用分配器,以及(3)PPO优化,其中行级模型将奖励重新分配给关键编辑区域。这种行级信用分配在自然适合代码更改的中间粒度上操作。在SWE-Gym上进行训练,并在四个基准上进行评估,BoostAPR在SWE-bench Verified上达到了40.7%(比基础模型提高了22.9个百分点),在Defects4J(Python到Java转移)上达到了24.8%,在HumanEval-Java上达到了84.5%,在QuixBugs上达到了95.0%,在开源模型中取得了具有竞争力的结果,并展现出强大的跨语言泛化能力。
cs.AI / 81 / 2605.09159
Do LLMs Experience an Internal Polylogue? Investigating Reasoning through the Lens of Personas
大型语言模型是否经历内部多重对话?通过角色视角研究推理
Abstract
Recent work shows that large language models (LLMs) encode behavioural traits ("personas") as linear directions in activation space, often called "persona vectors". Prior work has used such directions as static handles for behavioural steering. Building on this, we treat them as dynamic signals instead: probes we can monitor and intervene on as reasoning unfolds. We use the term polylogue to denote the time series of alignments between persona vectors and hidden activations over the course of generation. Experiments across four open-weight models show that polylogue features predict correctness on MMLU-Pro competitively with low-dimensional activation baselines, while remaining interpretable through their associated persona directions. They also suggest concrete steering targets, namely which latent directions to modulate at different stages of a response. We instantiate this as a simple paragraph-conditioned intervention that improves accuracy on three of four models, pointing to stage-aware latent steering as a promising direction for reasoning-time control. Together, this positions the polylogue as an interpretable tool for reasoning-time monitoring and intervention.
Chinese Translation
近期研究表明,大型语言模型(LLMs)将行为特征(“角色”)编码为激活空间中的线性方向,通常称为“角色向量”。之前的研究将这些方向作为行为引导的静态手段。基于此,我们将其视为动态信号:在推理展开过程中,我们可以监测和干预的探针。我们使用“多重对话”(polylogue)一词来表示在生成过程中角色向量与隐藏激活之间的对齐时间序列。对四个开放权重模型的实验表明,多重对话特征在与低维激活基线的竞争中能够预测MMLU-Pro上的正确性,同时通过其相关的角色方向保持可解释性。它们还建议了具体的引导目标,即在响应的不同阶段调节哪些潜在方向。我们将其具体化为一种简单的段落条件干预,能够提高四个模型中三个模型的准确性,指出阶段感知的潜在引导作为推理时控制的一个有前景的方向。总的来说,这使得多重对话成为推理时间监测和干预的可解释工具。
cs.AI / 82 / 2605.09163
FORTIS: Benchmarking Over-Privilege in Agent Skills
FORTIS:代理技能中的特权过度基准测试
Abstract
Large language model agents increasingly operate through an intermediate skill layer that mediates between user intent and concrete task execution. This layer is widely treated as an organizational abstraction, but we argue it is also a privilege boundary that current models routinely exceed. We present \textbf{FORTIS}, a benchmark that evaluates over-privilege in agent skills across two stages: whether a model selects the minimally sufficient skill from a large overlapping library, and whether it executes that skill without expanding into broader tools or actions than the skill permits. Across ten frontier models and three domains, we find that over-privileged behavior is the norm rather than the exception. Models consistently reach for higher-privilege skills and tools than the task requires, failing at both stages at rates that remain high even for the strongest available models. Failure is especially severe under the ordinary conditions of real user interaction: incomplete specification, convenience framing, and proximity to skill boundaries. None of these requires adversarial construction. The results indicate that the skill layer, far from containing agent behavior, is itself a primary source of privilege escalation in current systems.
Chinese Translation
大型语言模型代理越来越通过一个中介技能层进行操作,该层在用户意图与具体任务执行之间进行调解。这个层次通常被视为一种组织抽象,但我们认为它也是一个特权边界,而当前模型常常超越这一边界。我们提出了 extbf{FORTIS},一个评估代理技能中特权过度的基准,分为两个阶段:模型是否从一个大型重叠库中选择了最小足够的技能,以及它是否在执行该技能时未扩展到超出技能允许的更广泛工具或行动。在十个前沿模型和三个领域的研究中,我们发现特权过度行为是常态而非例外。模型在任务要求的技能和工具上,始终倾向于选择更高特权的技能,两个阶段的失败率在即使是最强模型中也保持在高水平。尤其是在真实用户交互的普通条件下,失败情况尤为严重:不完整的规范、便利的框架和接近技能边界的情况。这些情况都不需要对抗性构造。结果表明,技能层远非限制代理行为,反而是当前系统中特权升级的主要来源。
cs.AI / 83 / 2605.09168
CIVeX: Causal Intervention Verification for Language Agents
CIVeX:语言智能体的因果干预验证
Abstract
A valid tool call is not necessarily a valid intervention. Tool-using language agents are guarded by schema validators, policy filters, provenance checks, state predictors, and self-verification, yet such safeguards do not certify that a state-changing action has an identifiable causal effect. In confounded workflows, the action that looks optimal in observational logs can reduce utility when executed. We introduce CIVeX, a causal intervention verifier that maps proposed actions to structural causal queries over a committed action-state graph, checks identifiability, and returns one of four auditable verdicts: EXECUTE, REJECT, EXPERIMENT, or ABSTAIN. Execution requires an assumption-scoped causal certificate carrying graph commitments, an identification argument, a one-sided lower confidence bound (LCB), provenance, and risk limits. On Causal-ToolBench (1,890 instances, 7 seeds), CIVeX yields zero observed false executions across moderate and adversarial confounding. Under adversarial confounding it reaches 84.9% accuracy and 81.1% of oracle utility (+2.23 vs +2.76) and is the only non-oracle method whose constrained utility under a zero-false-execution constraint exceeds the AlwaysAbstain floor. On IHDP and ZOZO Open Bandit (real production logs with uniform-random ground truth), CIVeX matches Oracle correct-execution within 0.1pp and cuts per-execute false-execution by >=50x over naive baselines. A chain-of-thought LLM verifier (Claude Opus, Sonnet) cuts false-execution by an order of magnitude over a terse baseline, yet under adversarial confounding Opus's utility falls to 74% of CIVeX's. Intervention identifiability, not action validity, is the missing primitive for reliable tool use.
Chinese Translation
有效的工具调用不一定是有效的干预。使用工具的语言智能体受到模式验证器、策略过滤器、来源检查、状态预测器和自我验证的保护,但这些保障措施并不能证明状态改变的行为具有可识别的因果效应。在混淆的工作流程中,在观察日志中看似最优的行为在执行时可能会降低效用。我们提出了CIVeX,一种因果干预验证器,它将提议的行为映射到一个已承诺的行动-状态图上的结构因果查询,检查可识别性,并返回四种可审计的裁决之一:执行(EXECUTE)、拒绝(REJECT)、实验(EXPERIMENT)或弃权(ABSTAIN)。执行需要一个假设范围内的因果证书,携带图承诺、识别论证、单侧下置信界(LCB)、来源和风险限制。在Causal-ToolBench(1,890个实例,7个种子)上,CIVeX在中度和对抗性混淆下观察到的错误执行为零。在对抗性混淆下,其准确率达到84.9%,并且实现了81.1%的oracle效用(+2.23对比+2.76),是唯一在零错误执行约束下,其受限效用超过AlwaysAbstain底线的非oracle方法。在IHDP和ZOZO Open Bandit(具有均匀随机真实值的真实生产日志)上,CIVeX的正确执行与Oracle的差距在0.1个百分点以内,并且每次执行的错误执行率比天真的基线减少了>=50倍。一种链式思维的LLM验证器(Claude Opus, Sonnet)在较简洁的基线之上将错误执行减少了一个数量级,但在对抗性混淆下,Opus的效用降至CIVeX的74%。干预的可识别性,而非行为的有效性,是可靠工具使用的缺失原语。
cs.AI / 84 / 2605.09184
Open Ontologies: Tool-Augmented Ontology Engineering with Stable Matching Alignment
开放本体:工具增强的本体工程与稳定匹配对齐
Abstract
We present Open Ontologies, an open-source ontology engineering system implemented in Rust that integrates LLM-driven construction with formal OWL reasoning and ontology alignment via the Model Context Protocol. Our primary finding is that stable 1-to-1 matching is the dominant factor in ontology alignment quality: on the OAEI Anatomy track, it achieves F1 = 0.832 (P = 0.963, R = 0.733), competitive with state-of-the-art systems and exceeding all in precision. Ablation across five weight configurations shows that signal weights are irrelevant when stable matching is applied (F1 varies by less than 0.004), while removing stable matching drops F1 to 0.728. On the Conference track, the same method achieves F1 = 0.438. On tool-augmented ontology interaction, we find a surprising result: an LLM reading a raw OWL file (F1 = 0.323) performs worse than the same LLM with no file at all (F1 = 0.431), while structured MCP tool access achieves F1 = 0.717. This demonstrates that tool structure provides a qualitatively different mode of access that the LLM cannot replicate by reading raw syntax. The system ships as a single binary under the MIT licence.
Chinese Translation
我们提出了开放本体(Open Ontologies),这是一个用Rust实现的开源本体工程系统,集成了基于大型语言模型(LLM)的构建、形式化的OWL推理以及通过模型上下文协议(Model Context Protocol)进行的本体对齐。我们的主要发现是,稳定的1对1匹配是影响本体对齐质量的主导因素:在OAEI解剖学赛道上,其F1值达到0.832(精确率P = 0.963,召回率R = 0.733),与最先进的系统具有竞争力,并在精确度上超过所有系统。对五种权重配置的消融实验表明,当应用稳定匹配时,信号权重无关紧要(F1值变化小于0.004),而去除稳定匹配则使F1值降至0.728。在会议赛道上,采用相同的方法F1值为0.438。在工具增强的本体交互方面,我们发现了一个令人惊讶的结果:一个读取原始OWL文件的LLM(F1 = 0.323)的表现比没有文件的同一LLM(F1 = 0.431)更差,而结构化的MCP工具访问则实现了F1 = 0.717。这表明,工具结构提供了一种定性不同的访问模式,而LLM无法通过读取原始语法来复制这种模式。该系统以单个二进制文件的形式发布,遵循MIT许可证。
cs.AI / 85 / 2605.09186
Agentic MIP Research: Accelerated Constraint Handler Generation
自主性混合整数规划研究:加速约束处理器生成
Abstract
Mixed-integer programming (MIP) research is both mathematically sophisticated and engineering-intensive: testing an algorithmic hypothesis within a branch-and-cut solver requires substantial implementation, debugging, tuning, and large-scale benchmarking. We propose an agentic MIP research framework that shortens this feedback loop by embedding LLM agents into a solver-aware harness for generating, verifying, and evaluating plugins for the open-source solver SCIP. Propagation methods play a central role in accelerating MIP solving by exploiting global constraints. We instantiate our framework on the semantic lifting of MIP formulations into global constraints and the automatic construction of propagation-only SCIP constraint handlers. On the MIPLIB 2017 benchmark set, the framework successfully recovers global constraint structures from constraint programming and generates executable constraint detectors and propagation-only constraint handlers. Furthermore, the framework naturally extends to in-context learning within a sandboxed environment, enabling agents not only to tune and debug generated constraint handlers on real instances, but also to explore global constraint patterns in MIP problems and discover novel propagation strategies not yet implemented in SCIP. This framework allows us to systematically distinguish meaningful algorithmic improvements from low-value or overly costly candidates: the novel propagation methods successfully solved five additional instances within the explored benchmark. Overall, this framework demonstrates that LLM agents can autonomously navigate the complex MIP research loop, paving the way for a more automated solver development process.
Chinese Translation
混合整数规划(MIP)研究在数学上复杂且工程密集:在分支定界求解器中测试算法假设需要大量的实现、调试、调优和大规模基准测试。我们提出了一种自主性MIP研究框架,通过将大型语言模型(LLM)代理嵌入到一个了解求解器的工具中,以生成、验证和评估开源求解器SCIP的插件,从而缩短这一反馈循环。传播方法在通过利用全局约束来加速MIP求解中起着核心作用。我们在将MIP公式语义提升为全局约束以及自动构建仅传播的SCIP约束处理器的基础上实例化了我们的框架。在MIPLIB 2017基准集上,该框架成功地从约束编程中恢复了全局约束结构,并生成了可执行的约束检测器和仅传播的约束处理器。此外,该框架自然扩展到沙箱环境中的上下文学习,使代理不仅能够在真实实例上调优和调试生成的约束处理器,还能够探索MIP问题中的全局约束模式,并发现尚未在SCIP中实现的新传播策略。该框架使我们能够系统地区分有意义的算法改进与低价值或过于昂贵的候选项:新颖的传播方法成功解决了在探索的基准中五个额外的实例。总体而言,该框架展示了LLM代理能够自主导航复杂的MIP研究循环,为更自动化的求解器开发过程铺平了道路。
cs.AI / 86 / 2605.09187
Emergent Semantic Role Understanding in Language Models
语言模型中的语义角色理解的涌现
Abstract
Understanding how linguistic structure emerges in language models is central to interpreting what these systems learn from data and how much supervision they truly require. In particular, semantic role understanding ("who did what to whom") is a core component of meaning representation, yet it remains unclear whether it arises from pre-training alone or depends on task-specific fine-tuning. We study whether semantic role understanding emerges during language model pre-training or requires task-specific fine-tuning. We freeze decoder-only transformers and train linear probes to extract semantic roles, using performance to infer whether role information is already encoded in pre-training or learned during adaptation. Across model scales, we find that frozen representations contain substantial semantic role information, with performance improving but not fully matching fine-tuned models. This indicates partial but incomplete emergence from pre-training alone. We show that semantic role structure emerges from language modeling objectives, but its internal implementation shifts toward more distributed representations as model scale increases.
Chinese Translation
理解语言结构如何在语言模型中涌现对于解释这些系统从数据中学习的内容以及它们真正需要多少监督至关重要。特别是,语义角色理解(“谁对谁做了什么”)是意义表征的核心组成部分,但尚不清楚它是仅仅依赖于预训练还是依赖于特定任务的微调。我们研究语义角色理解是在语言模型的预训练过程中涌现,还是需要特定任务的微调。我们冻结仅解码器的变换器,并训练线性探针以提取语义角色,通过性能推断角色信息是在预训练中已经编码,还是在适应过程中学习的。在不同模型规模下,我们发现冻结的表征包含 substantial 语义角色信息,性能有所提高但未完全匹配微调模型。这表明,语义角色的部分但不完全涌现是来自于仅仅的预训练。我们展示了语义角色结构是从语言建模目标中涌现的,但其内部实现随着模型规模的增加而向更分布式的表征转变。
cs.AI / 87 / 2605.09192
Evidence Over Plans: Online Trajectory Verification for Skill Distillation
证据优于计划:技能蒸馏的在线轨迹验证
Abstract
Agent skills can remarkably improve task success rates by using human-written procedural documents, but their quality is difficult to assess without environment-grounded verification. Existing skill generation methods heavily rely on preference logs rather than direct environment interaction, often yielding negligible or even degraded gains. We identify that it is a fundamental timing bottleneck: robust skills should be posterior-based, distilled from empirical environment interaction rather than prior plans. In this study, we introduce the Posterior Distillation Index (PDI), a trajectory-level metric that quantifies how well a distilled skill is grounded in the task-environment evidence. To operationalize PDI, we present SPARK (Structured Pipelines for Autonomous Runnable tasKs and sKill generation) for preserving task execution evidence towards full trajectory-level analysis. SPARK generates environment-verified trajectories used to compute PDI, and it applies PDI as an online diagnostic and intervention signal to ensure posterior skill formation. Across 86 runnable tasks, SPARK-generated skills consistently surpass no-skill baselines and outperform human-written skills on student models (inference cost up to 1,000x cheaper than teacher models). These findings show that PDI-guided distillation produces efficient and transferable skills grounded in the task-environment interaction. We release our code at https://github.com/EtaYang10th/spark-skills .
Chinese Translation
代理技能通过使用人类编写的程序文档显著提高任务成功率,但在没有环境基础验证的情况下,其质量难以评估。现有的技能生成方法严重依赖偏好日志而非直接的环境交互,往往导致微不足道甚至退化的收益。我们识别出这是一个根本的时间瓶颈:稳健的技能应该是基于后验的,从经验环境交互中提炼,而非基于先前的计划。在本研究中,我们引入了后验蒸馏指数(Posterior Distillation Index, PDI),这是一个轨迹级别的指标,用于量化蒸馏技能在任务环境证据中的扎根程度。为了实现PDI,我们提出了SPARK(结构化自主可运行任务和技能生成管道),以保留任务执行证据,进行全面的轨迹级分析。SPARK生成环境验证的轨迹,用于计算PDI,并将PDI作为在线诊断和干预信号,以确保后验技能的形成。在86个可运行任务中,SPARK生成的技能始终超越无技能基线,并在学生模型上优于人类编写的技能(推理成本比教师模型低达1000倍)。这些发现表明,基于PDI的蒸馏产生了高效且可转移的技能,扎根于任务环境交互中。我们在https://github.com/EtaYang10th/spark-skills发布了我们的代码。
cs.AI / 88 / 2605.09195
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
遗忘的几何学:时间知识漂移作为大型语言模型表示中的独立轴
Abstract
Large language models confidently produce outdated answers, and no existing method can detect them. We show this is not an engineering failure but a structural one: temporal drift, whether a stored fact has changed since training, is encoded as a direction in the residual stream geometrically orthogonal to both correctness and uncertainty. Any method operating on correctness or uncertainty signals is therefore blind to drift by construction. We verify this across six instruction-tuned models. A linear probe trained directly on drift labels achieves AUROC $0.83$--$0.95$; methods based on token entropy, semantic entropy, CCS, and SAPLMA all remain near chance ($0.49$--$0.57$). Five tests confirm the geometric orthogonality: weight cosines ($|\cos| \leq 0.14$), score correlations ($|r| \leq 0.20$), bidirectional null-space projection ($|\Delta| \leq 0.008$), iterative null-space projection with $k{=}10$, and difference-of-means dissociation. Mechanistically, the MLP retrieval circuit produces identical dynamics for stale recall and confabulation ($r > 0.81$, six models), explaining why output confidence cannot separate them. A cross-cutoff experiment holds inputs constant and varies only the model: the probe fires on the model whose training predates the fact's transition and stays silent otherwise ($P(A{>}B) = 0.975$--$0.998$, twelve model pairs), confirming it reads model-internal knowledge state rather than input properties. Our code and datasets will be publicly released.
Chinese Translation
大型语言模型自信地生成过时的答案,而现有的方法无法检测到这些答案。我们表明,这并不是工程上的失败,而是结构上的问题:时间漂移,即自训练以来存储的事实是否发生了变化,被编码为在残差流中与正确性和不确定性几何正交的方向。因此,任何基于正确性或不确定性信号的方法在构造上对漂移是盲目的。我们在六个经过指令调优的模型中验证了这一点。直接在漂移标签上训练的线性探测器达到了 AUROC $0.83$--$0.95$;基于标记熵、语义熵、CCS 和 SAPLMA 的方法均接近随机猜测水平($0.49$--$0.57$)。五项测试确认了几何正交性:权重余弦($| ext{cos}| ext{≤} 0.14$)、得分相关性($|r| ext{≤} 0.20$)、双向零空间投影($| ext{Δ}| ext{≤} 0.008$)、$k{=}10$ 的迭代零空间投影,以及均值差异解离。从机制上讲,MLP 检索电路对陈旧回忆和虚构产生相同的动态($r > 0.81$,六个模型),解释了为什么输出置信度无法将它们区分开。一个交叉截止实验保持输入不变,仅变化模型:探测器在训练早于事实转变的模型上触发,而在其他情况下保持静默($P(A{>}B) = 0.975$--$0.998$,十二对模型),确认它读取的是模型内部的知识状态,而不是输入属性。我们的代码和数据集将公开发布。
cs.AI / 89 / 2605.09217
Learning the Preferences of a Learning Agent
学习学习代理的偏好
Abstract
For AI systems to be useful to humans, they must understand and act in accordance with our values and preferences. Since specifying preferences is a hard task, inverse reinforcement learning (IRL) aims to develop methods that allow for inferring preferences from observed behavior. However, IRL assumes the human to be approximately optimal. This is a big limitation in cases where the human themselves may be learning to act optimally in an environment. In this paper, we formalize the problem of learning the preferences of a learning agent: a predictor observes a learner acting online and tries to infer the underlying reward function being (initially suboptimally) optimized by the learner. We model the learner as either being no-regret, or as converging to an optimal Boltzmann policy over time. In each of these settings, we establish theoretical guarantees for various preference learning algorithms, or otherwise show that such guarantees are impossible.
Chinese Translation
为了使人工智能系统对人类有用,它们必须理解并按照我们的价值观和偏好行事。由于指定偏好是一项困难的任务,逆强化学习(Inverse Reinforcement Learning, IRL)旨在开发能够从观察到的行为中推断偏好的方法。然而,IRL假设人类的行为是近似最优的。这在某些情况下是一个重大限制,因为人类自身可能正在学习如何在环境中采取最优行动。本文形式化了学习代理偏好的问题:一个预测者观察一个在线学习者的行为,并试图推断学习者所优化的(最初是次优的)潜在奖励函数。我们将学习者建模为无悔(no-regret)或随着时间的推移收敛到最优Boltzmann策略。在这些设置中,我们为各种偏好学习算法建立了理论保证,或者反之,表明这样的保证是不可能的。
cs.AI / 90 / 2605.09243
How Much is Brain Data Worth for Machine Learning?
脑数据在机器学习中的价值有多大?
Abstract
If a person can solve a task, can measuring their brain make it easier to train a model to solve that task too? Recent NeuroAI work suggests that supplementing task training with neural recordings can modestly improve model performance and robustness. However, it is unclear when there should be a benefit from using neural data and how much benefit to expect. We formulate this question mathematically, and begin to address it theoretically using a simple, analytically tractable linear gaussian model of task targets and neural recordings. For a multimodal estimator trained on both brain data and task labels, we derive scaling laws for how performance scales with the numbers of brain and task samples. From these laws we derive relative value and exchange rates between brain samples and task samples, quantifying how much extra task samples neural data is worth as a function of task-brain alignment, neural and task noise, latent dimension, and brain data sample size. We also analyze test distribution shift, to identify conditions where brain-regularized learning can produce substantial robustness gains through learned invariances. Finally, under a fixed collection budget, we characterize the regimes in which brain data is worth collecting. Our results provide a foundation for understanding how valuable brain data could be for improving machine learning.
Chinese Translation
如果一个人能够解决某个任务,测量他们的大脑是否能使训练模型解决该任务变得更容易?最近的神经人工智能(NeuroAI)研究表明,结合神经记录进行任务训练可以适度提高模型的性能和鲁棒性。然而,目前尚不清楚使用神经数据何时会带来好处,以及可以期待多大的好处。我们将这个问题进行了数学公式化,并开始使用一个简单的、可解析的线性高斯模型来理论性地探讨任务目标和神经记录。对于一个同时使用脑数据和任务标签进行训练的多模态估计器,我们推导了性能如何随着脑样本和任务样本数量的变化而变化的缩放法则。根据这些法则,我们推导了脑样本与任务样本之间的相对价值和交换率,量化了神经数据在任务-大脑对齐、神经和任务噪声、潜在维度以及脑数据样本大小等因素下,额外任务样本的价值。我们还分析了测试分布的变化,以识别脑正则化学习可以通过学习不变性产生显著鲁棒性提升的条件。最后,在固定的收集预算下,我们描述了收集脑数据的价值所在的不同情境。我们的结果为理解脑数据在提升机器学习中的潜在价值提供了基础。
cs.AI / 91 / 2605.09266
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
SeePhys Pro:诊断多模态RLVR中物模转移和盲训练效应的物理推理
Abstract
We introduce SeePhys Pro, a fine-grained modality transfer benchmark that studies whether models preserve the same reasoning capability when critical information is progressively transferred from text to image. Unlike standard vision-essential benchmarks that evaluate a single input form, SeePhys Pro features four semantically aligned variants for each problem with progressively increasing visual elements. Our evaluation shows that current frontier models are far from representation-invariant reasoners: performance degrades on average as information moves from language to diagrams, with visual variable grounding as the most critical bottleneck. Motivated by this inference-time fragility, we further develop large training corpora for multimodal RLVR and use blind training as a diagnostic control, finding that RL with all training images masked can still improve performance on unmasked validation sets. To analyze this effect, text-deletion, image-mask-rate, and format-saturation controls suggest that such gains can arise from residual textual and distributional cues rather than valid visual evidence. Our results highlight the need to evaluate multimodal reasoning not only by final-answer accuracy, but also by robustness under modality transfer and by diagnostics that test whether improvements rely on task-critical visual evidence.
Chinese Translation
我们介绍了SeePhys Pro,一个细粒度的物模转移基准,研究模型在关键信息逐步从文本转移到图像时是否保持相同的推理能力。与评估单一输入形式的标准视觉基准不同,SeePhys Pro为每个问题提供了四个语义对齐的变体,逐步增加视觉元素。我们的评估显示,当前前沿模型远未达到表示不变的推理者:随着信息从语言转移到图表,性能平均下降,视觉变量的基础是最关键的瓶颈。受到这种推理时脆弱性的启发,我们进一步开发了多模态RLVR的大规模训练语料库,并使用盲训练作为诊断控制,发现即使在所有训练图像被屏蔽的情况下,强化学习仍然可以提高未屏蔽验证集的性能。为了分析这一效应,文本删除、图像遮罩率和格式饱和度控制表明,这种提升可能源于残余的文本和分布线索,而非有效的视觉证据。我们的结果强调了评估多模态推理的必要性,不仅要通过最终答案的准确性,还要通过在物模转移下的鲁棒性以及测试改进是否依赖于任务关键视觉证据的诊断来进行评估。
cs.AI / 92 / 2605.09271
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
通过语言表征塑造模式:大型语言模型智能扩展的下一个前沿
Abstract
Although natural language is the default medium for Large Language Models (LLMs), its limited expressive capacity creates a profound bottleneck for complex problem-solving. While recent advancements in AI have relied heavily on scaling, merely internalizing knowledge does not guarantee its effective application. Defining language representation as the linguistic and symbolic constructs used to map and model the real world, this paper argues that shaping schemas through advanced language representation is the next frontier for expanding LLM intelligence. We posit that an LLM's knowledge activation and organization -- its schema -- depends heavily on the structural and symbolic sophistication of the language used to represent a given task. This paper contributes both a formalization of this claim and the empirical evidence to support it. With a new formalization, we present multiple lines of evidence to support our position: Firstly, we review recent empirical practices and emerging methodologies that demonstrate the substantial performance gains achievable through deliberate language representation design, even without modifying model parameters or scale. Secondly, we conduct controlled experiments showing that LLM performance and its internal feature activations vary under different language representations of the same underlying task. Together, these findings highlight language representation design as a promising direction for future research.
Chinese Translation
尽管自然语言是大型语言模型(LLMs)的默认媒介,但其有限的表达能力对复杂问题解决造成了深刻的瓶颈。尽管近期人工智能的进展在很大程度上依赖于规模的扩大,但仅仅内化知识并不能保证其有效应用。本文将语言表征定义为用于映射和建模现实世界的语言和符号构造,认为通过先进的语言表征塑造模式是扩展LLM智能的下一个前沿。我们认为,LLM的知识激活和组织——其模式——在很大程度上依赖于用于表示特定任务的语言的结构和符号复杂性。本文不仅对这一主张进行了形式化,还提供了实证证据来支持这一观点。通过新的形式化,我们呈现了多条证据支持我们的立场:首先,我们回顾了近期的实证实践和新兴方法论,展示了通过有意的语言表征设计所能实现的显著性能提升,即使在不修改模型参数或规模的情况下。其次,我们进行了控制实验,表明LLM的性能及其内部特征激活在同一基础任务的不同语言表征下存在差异。综合这些发现,语言表征设计被视为未来研究的一个有前景的方向。
cs.AI / 93 / 2605.09272
Towards Conversational Medical AI with Eyes, Ears and a Voice
朝着具有视觉、听觉和语音的对话医学人工智能迈进
Shah, Meet, Gusdorf, Jason, Palepu, Anil, Park, Chunjong, O'Sullivan, Jack W., Ravi, Vishnu, Strother, Tim, Dubov, Pavel, Rysbek, Aliya, Fukuzawa, Toshiyuki, Lunts, Yana, Freyberg, Jan, Chang, Michael B., Raghu, Aniruddh, Stutz, David, Berlowitz, Devora, Papa, Eliseo, Cemgil, Taylan, Velasquez, JD, Chen, Jack, Chen, Arthur, Fritz, Doug, Taylor, Charlie, Tregubova, Katya, Lim, Jing Rong, Green, Richard, Mahdavi, Sara, Nagda, Mahvish, Lee, Jihyeon, Schiff, Craig, Panait, Liviu, Singh, Sukhdeep, Liévin, Valentin, Barrett, David G. T., Gladman, Hannah, Cupani, Anna, Pietra, Francesca, Okereke, Uchechi, Tong, Katherine, Meyer, Clemens, Rolland, Erwan, Sanwalka, Mili, Howell, Michael D., Gu, Shixiang Shane, Xu, Bibo, Ashley, Euan A., Eslami, S. M. Ali, Wayne, Gregory, Kohli, Pushmeet, Natarajan, Vivek, Rodman, Adam, Karthikesalingam, Alan, Tanno, Ryutaro
Abstract
The practice of medicine relies not only upon skillful dialogue but also on the nuanced exchange and interpretation of rich auditory and visual cues between doctors and patients. Building on the low-latency voice and video processing capabilities of Gemini, we introduce AI co-clinician, a first-of-its-kind conversational AI system utilizing continuous streams of audio-visual data from live patient conversations to inform real-time clinical decisions. Its dual-agent architecture balances deep clinical reasoning with the low latency required for natural dialogue. To assess this system, we implemented a video-based interface emulating telemedicine consultations. We crafted 20 standardized outpatient scenarios requiring proactive real-time auditory and visual reasoning and designed "TelePACES" evaluation criteria alongside case-specific rubrics. In a randomized, interface-blinded, crossover simulation study (n = 120 encounters) with 10 internal medicine residents as patient actors, we compared AI co-clinician with primary care physicians (PCPs), GPT-Realtime, and a baseline agent. AI co-clinician approached PCPs in key TelePACES dimensions, including management plans and differential diagnosis, while significantly outperforming GPT-Realtime across all general criteria. While our agent demonstrated parity with PCPs in case-specific triage measures, physicians maintained superior overall performance in case-specific assessments. Although AI co-clinician marks a significant advance in real-time telemedical AI, gaps remain in physical examination and disease-specific reasoning. Our work shows that text-only approaches fail to capture the true challenges of medical consultation and suggests that high-stakes real-time diagnostic AI is most safely advanced in collaborative, triadic models where AI can be a supportive co-clinician for doctors and patients.
Chinese Translation
医学实践不仅依赖于熟练的对话,还依赖于医生与患者之间丰富的听觉和视觉线索的细微交流与解读。基于Gemini的低延迟语音和视频处理能力,我们推出了AI共同临床医生(AI co-clinician),这是首个利用来自实时患者对话的连续音视频数据来指导实时临床决策的对话式人工智能系统。其双代理架构在深度临床推理与自然对话所需的低延迟之间取得了平衡。为了评估该系统,我们实施了一个基于视频的界面,模拟远程医疗咨询。我们设计了20个标准化的门诊场景,要求进行主动的实时听觉和视觉推理,并制定了“TelePACES”评估标准及特定案例的评分标准。在一项随机、界面盲法的交叉模拟研究中(n = 120次接触),我们将AI共同临床医生与初级保健医生(PCPs)、GPT-Realtime和基线代理进行了比较,参与者为10名内科住院医生担任患者演员。AI共同临床医生在关键的TelePACES维度上接近PCPs,包括管理计划和鉴别诊断,同时在所有一般标准上显著优于GPT-Realtime。尽管我们的代理在特定案例的分诊措施上与PCPs表现相当,但医生在特定案例评估中的整体表现仍然优于AI。尽管AI共同临床医生标志着实时远程医疗人工智能的重大进展,但在身体检查和疾病特定推理方面仍存在差距。我们的研究表明,仅依赖文本的方法无法捕捉医学咨询的真实挑战,并建议高风险实时诊断人工智能在协作的三方模型中最安全地推进,在这种模型中,人工智能可以作为医生和患者的支持性共同临床医生。
cs.AI / 94 / 2605.09278
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem:通过博弈论均衡校准多智能体辩论中的共享记忆
Abstract
Multi-agent debate (MAD) systems increasingly rely on shared memory to support long-horizon reasoning, but this convenience opens a critical vulnerability: a single corrupted entry can contaminate the downstream memory-augmented reasoning, and debate alone fails to filter such errors. Existing safeguards filter entries via heuristics or LLM-based validation, yet they rely on AI judgments that share the same failure modes and overlook the cross-agent dynamics of MAD. We address this gap by formulating memory updating in MAD as a zero-trust memory game, in which no agent is assumed honest and the game's equilibrium serves as an indicator of optimal memory trust. Guided by this equilibrium, we propose EquiMem, an inference-time calibration mechanism that quantifies each update algorithmically against the shared memory state, using agents' existing retrieval queries and traversal paths as evidence rather than soliciting any LLM judgment. EquiMem instantiates calibration for both embedding- and graph-based memory, and across diverse benchmarks, MAD frameworks, and memory architectures, it consistently outperforms existing safeguards, remains robust under adversarial agents, and incurs negligible inference overhead.
Chinese Translation
多智能体辩论(MAD)系统越来越依赖共享记忆来支持长时间范围的推理,但这种便利性带来了一个关键的脆弱性:单个损坏的条目可能会污染下游的记忆增强推理,而仅靠辩论无法过滤此类错误。现有的保护措施通过启发式或基于大型语言模型(LLM)的验证来过滤条目,然而它们依赖于同样存在失败模式的人工智能判断,并忽视了MAD的跨智能体动态。我们通过将MAD中的记忆更新形式化为一种零信任记忆游戏来填补这一空白,在该游戏中,不假设任何智能体是诚实的,游戏的均衡作为最佳记忆信任的指示。基于这一均衡,我们提出了EquiMem,这是一种推理时校准机制,能够根据共享记忆状态以算法方式量化每次更新,利用智能体现有的检索查询和遍历路径作为证据,而不是请求任何LLM的判断。EquiMem为嵌入式和基于图的记忆实例化了校准,并在多种基准测试、MAD框架和记忆架构中,始终优于现有的保护措施,在对抗性智能体下保持稳健,并且几乎不增加推理开销。
cs.AI / 95 / 2605.09283
A Prompt-Aware Structuring Framework for Reliable Reuse of AI-Generated Content in the Agentic Web
一种基于提示的结构化框架,用于在智能网络中可靠地重用AI生成内容
Abstract
The evolution of Large Language Models (LLMs) and the software agents built on them (AI agents) marks a turning point in the transition from a human-centric Web to an ``Agentic Web'' driven by AI agents. However, for AI-Generated Content (AIGC), which is expected to dominate the Web, there is currently no mechanism for agents to verify its reliability, reproducibility, or license compliance during generation. This lack of transparency risks causing chained hallucinations and compliance violations through the reuse of AIGC. Consequently, a framework to manage the provenance and generation conditions of AIGC is essential. In this paper, we present a framework that automatically attaches structured metadata to AIGC at generation time, including modularized prompts, contexts, thoughts, model information, hyperparameters, and confidence. The metadata is enveloped together with verifiable credentials to support the reliable assessment and reuse of AIGC. This framework enables efficient curation of structured AIGC and facilitates its safe use for applications such as fine-tuning and knowledge distillation.
Chinese Translation
大型语言模型(LLMs)及其构建的软件代理(AI代理)的发展标志着从以人为中心的网络向由AI代理驱动的“智能网络”的转变。然而,对于预计将主导网络的AI生成内容(AIGC),目前尚无机制供代理在生成过程中验证其可靠性、可重复性或许可合规性。这一缺乏透明度的现象可能导致通过重用AIGC而产生连锁幻觉和合规性违规。因此,管理AIGC的来源和生成条件的框架显得至关重要。本文提出了一种框架,该框架在生成时自动为AIGC附加结构化元数据,包括模块化提示、上下文、思考、模型信息、超参数和置信度。这些元数据与可验证凭证一起封装,以支持对AIGC的可靠评估和重用。该框架能够高效地策划结构化AIGC,并促进其在微调和知识蒸馏等应用中的安全使用。
cs.AI / 96 / 2605.09287
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
PiCA:基于枢轴的信用分配用于搜索代理强化学习
Abstract
Large Language Model (LLM)-based search agents trained with reinforcement learning (RL) have significantly improved the performance of knowledge-intensive tasks. However, existing methods encounter critical challenges in long-horizon credit assignment: (i) Reward Sparsity, where models receive only outcome feedback without step-level guidance to differentiate action quality; (ii) Isolated Credit, where credit is assigned to steps independently, failing to capture sequential dependencies; and (iii) Distributional Shift, where rewards are estimated on templates that deviate from the model's natural generative distribution. To address these issues, we propose Pivot-Based Credit Assignment (PiCA), a novel step reward mechanism that reformulates the search trajectory as a sequential process of cumulative search progress. Unlike prior isolated step rewards, PiCA defines process rewards as success probabilities dependent on the historical context based on Potential-Based Reward Shaping (PBRS). This approach identifies pivot steps, which comprise target golden sub-queries and sub-answers derived from historical trajectories, as information peaks that significantly boost the likelihood of a correct final answer. By anchoring these step rewards to the final task objective, PiCA provides dense, pivot-aware and trajectory-dependent guidance while maintaining distributional consistency. Extensive experiments show that PiCA outperforms existing strong baselines across seven knowledge-intensive QA benchmarks, achieving 15.2% and 2.2% improvements for 3B and 7B models. The consistent performance gains across various models show PiCA's robust generalization. The code is available at https://github.com/novdream/PiCA.
Chinese Translation
基于大型语言模型(LLM)的搜索代理通过强化学习(RL)训练,显著提高了知识密集型任务的性能。然而,现有方法在长时间跨度的信用分配中面临关键挑战:(i)奖励稀疏性,模型仅收到结果反馈,而没有逐步指导来区分行动质量;(ii)孤立信用,信用独立地分配给各个步骤,未能捕捉序列依赖性;(iii)分布转移,奖励是在偏离模型自然生成分布的模板上估计的。为了解决这些问题,我们提出了基于枢轴的信用分配(PiCA),这是一种新颖的步骤奖励机制,将搜索轨迹重新构建为累积搜索进展的序列过程。与之前的孤立步骤奖励不同,PiCA将过程奖励定义为基于潜在奖励塑形(Potential-Based Reward Shaping, PBRS)的历史上下文依赖的成功概率。这种方法识别出枢轴步骤,这些步骤由目标黄金子查询和从历史轨迹中派生的子答案组成,作为显著提高正确最终答案可能性的信信息峰值。通过将这些步骤奖励锚定到最终任务目标,PiCA提供了密集的、关注枢轴的和依赖轨迹的指导,同时保持分布一致性。大量实验表明,PiCA在七个知识密集型问答基准测试中超越了现有的强基线,对于3B和7B模型分别实现了15.2%和2.2%的提升。各种模型的一致性能提升展示了PiCA的强大泛化能力。代码可在 https://github.com/novdream/PiCA 获取。
cs.AI / 97 / 2605.09292
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
超越准确性:评估大型语言模型在数学推理中的策略多样性
Abstract
Large language models now achieve high final-answer accuracy on mathematical reasoning benchmarks, but accuracy alone does not capture reasoning flexibility. We introduce a strategy-level evaluation framework instantiated on 80 AMC 10/12 and AIME problems with 217 AoPS-derived reference strategy families. Model outputs are annotated for strategy identity, validity, and correctness using dual-AI coding with human adjudication. Across four frontier models, we find a pronounced decoupling between answer accuracy and strategy diversity. Under a single-solution prompt, all models achieve high accuracy (95%-100%), but under a multiple-strategy prompt they recover substantially fewer strategies than the human reference set. Gemini, DeepSeek, GPT, and Claude generate 184, 152, 151, and 110 distinct valid strategies, respectively, with the largest gaps in Geometry and Number Theory. The models collectively produce 50 benchmark-novel valid strategies, indicating both incomplete coverage of human strategies and some capacity for alternative reasoning. A repeated-run robustness check on 20 problems shows diminishing gains in discovered strategies, with the strongest model recovering only 39 of 55 AoPS-reference strategies (71%) after three runs. These findings position strategy diversity as a complementary dimension for evaluating mathematical reasoning beyond answer correctness.
Chinese Translation
大型语言模型在数学推理基准测试中现在能够实现高最终答案准确性,但仅靠准确性无法体现推理的灵活性。我们引入了一种基于策略层面的评估框架,针对80个AMC 10/12和AIME问题,结合217个来源于AoPS的参考策略家族进行实例化。模型输出通过双人工智能编码和人工裁定进行策略身份、有效性和正确性的注释。在四个前沿模型中,我们发现答案准确性与策略多样性之间存在显著的解耦。在单一解决方案提示下,所有模型均实现了高准确性(95%-100%),但在多策略提示下,它们恢复的策略数量远低于人类参考集。Gemini、DeepSeek、GPT和Claude分别生成了184、152、151和110种不同的有效策略,几何和数论领域的差距最大。模型共同产生了50种基准新颖的有效策略,表明对人类策略的覆盖不完全,同时也展现出一定的替代推理能力。对20个问题进行的重复运行稳健性检查显示,发现的策略收益递减,最强模型在三次运行后仅恢复了55个AoPS参考策略中的39个(71%)。这些发现将策略多样性定位为评估数学推理的一个互补维度,超越了答案的正确性。
cs.AI / 98 / 2605.09310
Beyond ESG Scores: Learning Dynamic Constraints for Sequential Portfolio Optimization
超越ESG评分:学习动态约束以进行序列投资组合优化
Abstract
ESG-aware portfolio optimization is increasingly important for sustainable capital allocation, yet most learning-based methods still operationalize ESG by appending static scores to the policy observation or reward. This creates a mismatch for sequential control: ESG scores are noisy, provider-dependent, low-frequency, and temporally misaligned with sequential portfolio decisions, while financial evidence suggests that ESG is better treated as a portfolio preference, risk-exposure, or hedge dimension than as a robust alpha factor. We propose to impose ESG constraints without modifying the financial policy's observation or reward, using a Multimodal Action-Conditioned Constraint Field (MACF) that learns mechanism-specific ESG costs from point-in-time multimodal evidence and contemplated portfolio transitions. We then introduce MACF-X, a family of optimizer-specific adapters that converts MACF costs and uncertainties into native constrained-optimization interfaces through a shared slack- and uncertainty-aware pressure layer. Across multiple constraint-integration interfaces, MACF-X reduces tail ESG budget pressure while maintaining competitive financial performance. Ablations show that this improvement depends on dynamic evidence inputs and three-head decomposition, while static ESG-score proxies are nearly indistinguishable from score-shuffled noise baselines.
Chinese Translation
关注ESG的投资组合优化在可持续资本配置中日益重要,然而大多数基于学习的方法仍然通过将静态评分附加到策略观察或奖励中来实现ESG。这导致了序列控制中的不匹配:ESG评分噪声大、依赖提供者、频率低且与序列投资组合决策在时间上不对齐,而金融证据表明,ESG更应被视为投资组合偏好、风险暴露或对冲维度,而非稳健的阿尔法因子。我们提出在不修改金融政策的观察或奖励的情况下施加ESG约束,使用多模态动作条件约束场(Multimodal Action-Conditioned Constraint Field, MACF),该方法从特定时点的多模态证据和考虑中的投资组合转变中学习机制特定的ESG成本。随后,我们引入MACF-X,一系列优化器特定的适配器,通过共享的松弛和不确定性感知压力层,将MACF成本和不确定性转换为原生的约束优化接口。在多个约束集成接口中,MACF-X在保持竞争性金融表现的同时减少了尾部ESG预算压力。消融实验表明,这一改进依赖于动态证据输入和三头分解,而静态ESG评分代理与评分打乱噪声基线几乎无法区分。
cs.AI / 99 / 2605.09314
How LLMs Are Persuaded: A Few Attention Heads, Rerouted
如何说服大型语言模型:少数注意力头的重新引导
Abstract
Language models can be persuaded to abandon factual knowledge. This vulnerability is central to AI safety, but its internal mechanism remains poorly understood. We uncover a compact causal mechanism for persuasion-induced factual errors. A small set of mid-layer attention heads almost entirely determines the model's answer. These heads write answer options into a low-dimensional polyhedron, with options occupying distinct vertices. Persuasion does not blur belief or merely reduce confidence; it causes a discrete latent jump from the correct-answer vertex to the persuasion-target vertex. We show that decision heads are not reasoning over evidence. Instead, they copy whichever option token their attention selects. Persuasion works by redirecting attention. We isolate a rank-one evidence-routing feature that controls the route. Directly modifying this feature steers the model's choice, and removing it blocks persuasion. We then trace the feature back to a band of shallower attention heads that build it from persuasive keywords in the input. Every step is validated by intervention. This mechanism appears across open-source LLMs and realistic poisoning scenarios such as Generative Engine Optimization, revealing persuasion as a narrow, monitorable circuit.
Chinese Translation
语言模型可以被说服放弃事实知识。这一脆弱性是人工智能安全的核心,但其内部机制仍然不甚明了。我们揭示了一种紧凑的因果机制,导致说服引发的事实错误。一小组中层注意力头几乎完全决定了模型的回答。这些注意力头将答案选项写入一个低维多面体,选项占据不同的顶点。说服并不会模糊信念或仅仅降低信心;它导致从正确答案顶点到说服目标顶点的离散潜在跳跃。我们表明,决策头并不是在对证据进行推理。相反,它们复制其注意力选择的任一选项标记。说服通过重定向注意力来实现。我们隔离出一个秩为一的证据路由特征,该特征控制着路由。直接修改该特征可以引导模型的选择,而去除它则阻止了说服。随后,我们追踪该特征回到一组较浅的注意力头,这些注意力头从输入中的说服性关键词构建该特征。每一步都通过干预得到了验证。这一机制在开源大型语言模型和现实的投毒场景(如生成引擎优化)中均有出现,揭示了说服作为一个狭窄且可监控的电路。
cs.AI / 100 / 2605.09315
Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation
自我进化代理会遗忘吗?终身大语言模型代理适应中的能力退化与保持
Abstract
Recent advances in LLM agents enable systems that autonomously refine workflows, accumulate reusable skills, self-train their underlying models, and maintain persistent memory. However, we show that such self-evolution is often non-monotonic: adapting to new task distributions can progressively degrade previously acquired capabilities across all major evolution channels. We identify this phenomenon as \emph{capability erosion under self-evolution} and show that it consistently emerges across workflow, skill, model, and memory evolution. To mitigate this issue, we propose \emph{Capability-Preserving Evolution} (CPE), a general stabilization principle that constrains destructive capability drift during continual adaptation. Across all four evolution dimensions, CPE consistently improves retained capability stability while preserving adaptation performance. For example, in workflow evolution, CPE improves retained simple-task performance from 41.8\% to 52.8\% under GPT-5.1 optimization while simultaneously achieving stronger complex-task adaptation. Our findings suggest that stable long-horizon self-evolving agents require not only acquiring new capabilities, but also explicitly preserving previously learned ones during continual adaptation.
Chinese Translation
近期在大语言模型(LLM)代理方面的进展使得系统能够自主优化工作流程、积累可重复使用的技能、自我训练其基础模型,并维持持久记忆。然而,我们发现这种自我进化往往是非单调的:适应新的任务分布可能会逐步退化在所有主要进化渠道中先前获得的能力。我们将这一现象称为“自我进化下的能力侵蚀”,并表明它在工作流程、技能、模型和记忆进化中始终出现。为了解决这一问题,我们提出了“能力保持进化”(Capability-Preserving Evolution, CPE),这是一种通用的稳定化原则,旨在约束持续适应过程中的破坏性能力漂移。在所有四个进化维度中,CPE始终提高了保留能力的稳定性,同时保持适应性能。例如,在工作流程进化中,CPE在GPT-5.1优化下将保留的简单任务性能从41.8%提高到52.8%,同时实现了更强的复杂任务适应。我们的研究结果表明,稳定的长期自我进化代理不仅需要获得新能力,还需要在持续适应过程中明确保持先前学习的能力。
cs.AI / 101 / 2605.09343
SKG-VLA: Scene Knowledge Graph Priors for Structured Scene Semantics and Multimodal Reasoning for Decision Making
SKG-VLA:用于结构化场景语义和多模态推理的场景知识图先验在决策中的应用
Abstract
Decision making in large-scale complaint handling systems increasingly relies on heterogeneous evidence, including complaint narratives, screenshots, order metadata, historical interactions, and platform policies. Existing complaint understanding systems mainly perform shallow classification or template matching over isolated modalities, while underutilizing explicit scene structure, rule knowledge, and cross-evidence dependencies. To address this limitation, we present SKG-VLA for multimodal complaint decision making. The core idea is to model each case as a structured complaint scene and represent its decision-relevant semantics with a \emph{Scene Knowledge Graph} (SKG), which organizes complaint entities, evidence items, policy clauses, temporal events, transactional states, and action-relevant relations into a unified graph. Based on SKG, we build a data synthesis pipeline that generates complaint scene descriptions, rule-consistent graph generalizations, question-answer supervision, and decision recommendations. We further construct a large-scale complaint scene dataset with both text-only and multimodal in-domain benchmarks. Finally, we adopt a three-stage training strategy -- domain-adaptive pre-training, task-oriented instruction fine-tuning, and end-to-end multimodal alignment -- to inject structured scene priors into a multimodal decision model. Experiments show that SKG-VLA consistently improves policy-grounded reasoning, complaint decision accuracy, long-tail generalization, and robustness under incomplete evidence.
Chinese Translation
在大规模投诉处理系统中,决策越来越依赖于异构证据,包括投诉叙述、截图、订单元数据、历史交互和平台政策。现有的投诉理解系统主要对孤立模态进行浅层分类或模板匹配,而未充分利用显式场景结构、规则知识和跨证据依赖关系。为了解决这一局限性,我们提出了SKG-VLA用于多模态投诉决策。其核心思想是将每个案例建模为一个结构化的投诉场景,并用 extit{场景知识图}(Scene Knowledge Graph, SKG)表示其与决策相关的语义,该图将投诉实体、证据项、政策条款、时间事件、交易状态和与行动相关的关系组织成一个统一的图。基于SKG,我们构建了一个数据合成管道,生成投诉场景描述、规则一致的图形泛化、问答监督和决策建议。我们进一步构建了一个大规模的投诉场景数据集,包含文本-only和多模态的领域内基准。最后,我们采用三阶段训练策略——领域适应预训练、任务导向的指令微调和端到端多模态对齐——将结构化场景先验注入到多模态决策模型中。实验表明,SKG-VLA在政策基础推理、投诉决策准确性、长尾泛化和在不完整证据下的鲁棒性方面均有显著提升。
cs.AI / 102 / 2605.09347
Dsat: A Native SAT Solver for Discrete Logic
Dsat:一种用于离散逻辑的原生SAT求解器
Abstract
Discrete variables are common in many applications, such as probabilistic reasoning, planning and explainable AI. When symbolic reasoning techniques are brought in to bear on these applications, a standard technique for handling discrete variables is to binarize them into Boolean variables to allow the use of Boolean computational machinery such as SAT solvers. This technique can face both computational and semantical challenges though. In this work, we develop a native SAT solver for discrete logic, which is a direct extension of Boolean logic in which variables can take arbitrary values. Our proposed solver has a similar design to Boolean SAT solvers, with ingredients such as unit resolution and clause learning but ones that operate natively on discrete variables. We illustrate the merits of the developed SAT solver by comparing it empirically to CSP solvers applied to discrete CNFs, to Boolean SAT solver applied to binarized CNFs, and to some hybrid solvers.
Chinese Translation
离散变量在许多应用中很常见,例如概率推理、规划和可解释人工智能。当符号推理技术应用于这些领域时,处理离散变量的标准技术是将其二值化为布尔变量,以便使用布尔计算工具,如SAT求解器。然而,这种技术可能面临计算和语义上的挑战。在本研究中,我们开发了一种用于离散逻辑的原生SAT求解器,它是布尔逻辑的直接扩展,变量可以取任意值。我们提出的求解器在设计上与布尔SAT求解器相似,具有单元解析和子句学习等组成部分,但这些组件是针对离散变量原生操作的。我们通过将所开发的SAT求解器与应用于离散CNF的约束满足问题(CSP)求解器、应用于二值化CNF的布尔SAT求解器以及一些混合求解器进行实证比较,展示了该求解器的优点。
cs.AI / 103 / 2605.09350
CHAINTRIX: A multi-pipeline LLM-augmented framework for automated smart-contract security auditing
CHAINTRIX:一种多管道的LLM增强框架,用于自动化智能合约安全审计
Abstract
Smart-contract exploits have caused billions of USD in cumulative losses, yet audits remain expensive and slow. Automated tools have emerged to close this gap, but each class has a characteristic failure mode. Static analyzers report findings that frequently fail manual triage at high rates, while large language models (LLMs) hallucinate findings that contradict the source code. Thus, we propose Chaintrix, an end-to-end auditing framework whose central architectural commitment is that every LLM-generated claim must be discharged against a deterministic structural contract representation. We introduce a Cross-Contract Interaction Model (CCIM) that parses Solidity into a structured map of function-level reads, writes, modifiers and resolved cross-contract calls. CCIM serves as the substrate against which all 12 of Chaintrix's deterministic signal engines and the parallel LLM audit pipelines operate. A staged false-positive-reduction pipeline, terminating in a Structural Verdict Engine (SVE) that applies deterministic structural checks against parsed code, filters the merged finding set, with selected high-confidence findings further validated through symbolic execution and fuzz testing. We evaluate Chaintrix on EVMbench, the smart-contract security benchmark by OpenAI, Paradigm, OtterSec. Chaintrix detects 86 of 120 high-severity vulnerabilities (71.7% recall), with 25 audits scoring 100% recall, placing Chaintrix 26 percentage points above the strongest frontier-model baseline.
Chinese Translation
智能合约漏洞已导致数十亿美元的累计损失,但审计仍然昂贵且缓慢。为填补这一空白,自动化工具应运而生,但每类工具都有其特定的失效模式。静态分析器报告的发现往往在手动筛选中以高比例失败,而大型语言模型(LLMs)则会产生与源代码相矛盾的虚假发现。因此,我们提出了Chaintrix,一个端到端的审计框架,其核心架构承诺是每个LLM生成的声明必须针对确定性的结构化合约表示进行验证。我们引入了一个跨合约交互模型(Cross-Contract Interaction Model, CCIM),该模型将Solidity解析为函数级读取、写入、修饰符和解析的跨合约调用的结构化映射。CCIM作为所有12个Chaintrix的确定性信号引擎和并行LLM审计管道的基础。一个分阶段的假阳性减少管道,最终在结构性裁决引擎(Structural Verdict Engine, SVE)中应用针对解析代码的确定性结构检查,过滤合并的发现集,选定的高置信度发现通过符号执行和模糊测试进一步验证。我们在EVMbench上评估Chaintrix,这是由OpenAI、Paradigm和OtterSec提供的智能合约安全基准。Chaintrix检测到120个高严重性漏洞中的86个(71.7%的召回率),其中25个审计的召回率为100%,使Chaintrix的表现比最强的前沿模型基线高出26个百分点。
cs.AI / 104 / 2605.09352
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?
维特根斯坦表征假设:语言是多模态收敛的吸引子吗?
Abstract
Understanding why independently trained neural networks from different modalities converge toward shared representations, and where this convergence leads, remains an open question in representation learning. All existing evidence relies on symmetric similarity measures, which can detect convergence but are structurally blind to its direction. We introduce directional convergence analysis using cycle-kNN, an asymmetric alignment measure, applied across dozens of independently trained unimodal models spanning point clouds, vision, and language. We uncover a consistent directional asymmetry: non-language modalities move toward the neighborhood structure of language significantly more than the reverse, and this pattern holds across all model families and scales--yet is entirely invisible to symmetric measures. Mechanistic analysis traces the directionality to feature density asymmetry, whereby language representations occupy the most compact regions of representational space. The Information Bottleneck framework provides a principled interpretation: optimization under compression drives representations toward discrete, compositional structures characteristic of language. We formalize this as the Wittgensteinian Representation Hypothesis: the semantic structure of language is the asymptotic attractor of multimodal representation convergence.
Chinese Translation
理解为何来自不同模态的独立训练神经网络向共享表征收敛,以及这种收敛的结果何在,仍然是表征学习中的一个未解之谜。现有的所有证据都依赖于对称相似性度量,这些度量能够检测收敛但对其方向结构上是盲目的。我们引入了使用循环k最近邻(cycle-kNN)进行的方向性收敛分析,这是一种非对称对齐度量,应用于数十个独立训练的单模态模型,涵盖点云、视觉和语言。我们发现了一种一致的方向性不对称性:非语言模态向语言的邻域结构移动的程度显著高于反向移动,并且这一模式在所有模型家族和规模中都成立——然而对称度量对此完全不可见。机制分析将方向性追溯到特征密度的不对称性,即语言表征占据了表征空间中最紧凑的区域。信息瓶颈框架提供了一个原则性的解释:在压缩下的优化驱动表征向离散的、组成的语言特征结构靠拢。我们将其形式化为维特根斯坦表征假设:语言的语义结构是多模态表征收敛的渐近吸引子。
cs.AI / 105 / 2605.09365
Position: Avoid Overstretching LLMs for every Enterprise Task
定位:避免对每个企业任务过度拉伸大型语言模型(LLMs)
Abstract
Enterprise workloads are dominated by deterministic, structured, and knowledge-dependent tasks operating under strict cost, latency, and reliability constraints. While these are often addressed through large language model (LLM) deployment or distillation into smaller models, we argue this is inefficient, unreliable, and misaligned with enterprise task structures. Instead, AI systems should treat language models as interfaces rather than monolithic engines, externalizing knowledge and computation into dedicated components for greater reliability, scalability, and transparency. Our theoretical evidences show that finite-capacity models cannot fully capture the breadth of knowledge required for enterprise tasks, creating inherent limits to efficiency and interpretability. Building on this, we take the position that language models should primarily be used for structured extraction in deterministic enterprise workflows, while computation and storage are delegated to knowledge bases and symbolic procedures. We formally demonstrate that such modular architectures are more reliable and maintainable than monolithic frameworks, offering a sustainable foundation for enterprise tasks.
Chinese Translation
企业工作负载主要由在严格的成本、延迟和可靠性约束下运行的确定性、结构化和依赖知识的任务主导。尽管这些任务通常通过大型语言模型(LLM)的部署或蒸馏成更小的模型来解决,但我们认为这是一种低效、不可靠且与企业任务结构不匹配的方法。相反,人工智能系统应将语言模型视为接口,而非单一的引擎,将知识和计算外部化为专用组件,以提高可靠性、可扩展性和透明度。我们的理论证据表明,有限容量的模型无法完全捕捉企业任务所需的广泛知识,从而在效率和可解释性上造成固有限制。在此基础上,我们认为语言模型应主要用于确定性企业工作流程中的结构化提取,而计算和存储则委托给知识库和符号程序。我们正式证明,这种模块化架构比单一框架更可靠且更易维护,为企业任务提供了可持续的基础。
cs.AI / 106 / 2605.09366
Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration
迈向虚拟神经科学家:通过多智能体协作实现自主神经影像分析
Abstract
Transforming neuroimaging data into clinically actionable biomarkers is a knowledge-intensive and labor-intensive process. Standardized workflows such as fMRIPrep have improved robustness and efficiency, but they are statically configured and cannot reason about downstream objectives, deliberate over alternative strategies, or close the loop between intermediate evidence and subsequent decisions in the way a human researcher would. This lack of closed-loop adaptation often leaves domain experts trapped in a cycle of manual trial-and-error to tune parameters and remediate pipeline failures, severely constraining the scalability of clinical biomarker development. To bridge this gap, we introduce NIAgent, a multi-agent system for autonomous end-to-end neuroimaging analysis. Unlike conventional flat tool-calling agents, NIAgent adopts a code-centric execution paradigm where specialist agents collaboratively synthesize and optimize executable programs over composable domain-specific primitives. This design enables robust, long-horizon workflow construction that adapts dynamically to runtime observations. Furthermore, we propose a hierarchical verification framework for autonomous quality control, integrating cohort-level metric screening with agentic visual inspection to drive evidence-grounded workflow remediation. Experiments on ADHD-200 and ADNI demonstrate that NIAgent outperforms standard workflow-based baselines in predictive performance while exhibiting sophisticated agentic behaviors, including strategy exploration and adaptive refinement.
Chinese Translation
将神经影像数据转化为临床可操作的生物标志物是一个知识密集型和劳动密集型的过程。标准化工作流程,如 fMRIPrep,已经提高了稳健性和效率,但它们是静态配置的,无法像人类研究者那样推理下游目标、考虑替代策略或在中间证据与后续决策之间形成闭环。这种缺乏闭环适应的情况常常使领域专家陷入手动试错的循环中,以调整参数和修复管道故障,严重限制了临床生物标志物开发的可扩展性。为了解决这一问题,我们提出了 NIAgent,一个用于自主端到端神经影像分析的多智能体系统。与传统的平面工具调用代理不同,NIAgent 采用以代码为中心的执行范式,专业代理协同合成和优化可执行程序,基于可组合的领域特定原语。这一设计使得能够构建稳健的、长时间跨度的工作流程,能够动态适应运行时观察。此外,我们提出了一个层次化验证框架,用于自主质量控制,将队列级指标筛选与代理视觉检查相结合,以推动基于证据的工作流程修复。在 ADHD-200 和 ADNI 数据集上的实验表明,NIAgent 在预测性能上优于基于标准工作流程的基线,同时展现出复杂的代理行为,包括策略探索和自适应优化。
cs.AI / 107 / 2605.09369
Explainable Knowledge Tracing via Probabilistic Embeddings and Pattern-based Reasoning
基于概率嵌入和模式推理的可解释知识追踪
Abstract
Knowledge Tracing (KT) models students' knowledge states based on learning interactions to predict performance. While deep learning-based KT models have boosted predictive accuracy, most models rely on deterministic vector embeddings and opaque latent state transitions, limiting interpretability regarding how specific past behaviors influence predictions. To address this limitation, we propose Probabilistic Logical Knowledge Tracing (PLKT), an interpretable KT framework that formulates prediction as a goal-conditioned evidence reasoning process over historical learning behaviors. Instead of representing knowledge states as deterministic vector embeddings, PLKT employs robust Beta-distributed probabilistic embeddings to represent student knowledge states. This probabilistic foundation allows us to model the uncertainty of historical behaviors and perform explicit logical operations (e.g., conjunction), constructing transparent reasoning paths that reveal how specific past interactions contribute to the prediction. Extensive experiments show that PLKT outperforms state-of-the-art KT methods while achieving superior interpretability. Our code is available at https://anonymous.4open.science/r/PLKT-D3CE/.
Chinese Translation
知识追踪(Knowledge Tracing, KT)模型通过学习交互来建模学生的知识状态,以预测其表现。尽管基于深度学习的KT模型提高了预测准确性,但大多数模型依赖于确定性向量嵌入和不透明的潜在状态转移,限制了对特定过去行为如何影响预测的可解释性。为了解决这一限制,我们提出了概率逻辑知识追踪(Probabilistic Logical Knowledge Tracing, PLKT),这是一个可解释的KT框架,将预测形式化为基于历史学习行为的目标条件证据推理过程。PLKT不再将知识状态表示为确定性向量嵌入,而是采用稳健的Beta分布概率嵌入来表示学生的知识状态。这种概率基础使我们能够建模历史行为的不确定性,并进行明确的逻辑操作(例如,合取),构建透明的推理路径,揭示特定过去交互如何对预测产生贡献。大量实验表明,PLKT在性能上优于最先进的KT方法,同时实现了更好的可解释性。我们的代码可在https://anonymous.4open.science/r/PLKT-D3CE/获取。
cs.AI / 108 / 2605.09387
NEXUS: Continual Learning of Symbolic Constraints for Safe and Robust Embodied Planning
NEXUS:符号约束的持续学习以实现安全和稳健的具身规划
Abstract
While Large Language Models (LLMs) have catalyzed progress in embodied intelligence, a fundamental gap between their inherent probabilistic uncertainty and the strict determinism and verifiable safety required in the physical world. To mitigate this gap, this paper introduces NEXUS, a modular framework designed for continual learning in embodied agents. Different from prior works that treat symbolic artifacts merely as static interfaces, NEXUS leverages them for symbolic grounding and knowledge evolution. The framework explicitly decouples physical feasibility from safety specifications: capability of agents is improved through closed-loop execution feedback, while probabilistic risk assessments are grounded into deterministic hard constraints to establish a rigorous pre-action defense. Experiments on SafeAgentBench demonstrate that NEXUS achieves superior task success rates while effectively refusing unsafe instructions, exhibiting robust defense against adversarial attacks, and progressively improving planning efficiency through knowledge accumulation.
Chinese Translation
尽管大型语言模型(LLMs)在具身智能方面推动了进展,但它们固有的概率不确定性与物理世界所需的严格确定性和可验证安全性之间存在根本差距。为了解决这一差距,本文提出了NEXUS,一个旨在为具身智能体提供持续学习的模块化框架。与以往将符号工件仅视为静态接口的研究不同,NEXUS利用符号工件进行符号基础和知识演化。该框架明确将物理可行性与安全规范解耦:通过闭环执行反馈提高智能体的能力,同时将概率风险评估基于确定性硬约束,以建立严格的预行动防御。在SafeAgentBench上的实验表明,NEXUS在有效拒绝不安全指令的同时,实现了更高的任务成功率,展现出对对抗攻击的稳健防御,并通过知识积累逐步提高规划效率。
cs.AI / 109 / 2605.09391
Do Linear Probes Generalize Better in Persona Coordinates?
线性探针在个性坐标中是否具有更好的泛化能力?
Abstract
It is becoming increasingly necessary to have monitors check for harmful behaviors during language model interactions, but text-only monitoring has not been sufficient. This is because models sometimes exhibit strategic deception and sandbagging, changing their behavior during evaluation. This motivates the use of white-box monitors like linear probes, which can read the model internals directly. Currently, such probes can fail under distribution shift, limiting their usefulness in real settings. We study whether there exists a low-dimensional subspace of the model internals that captures harmful behaviors more robustly, while leaving out spuriously correlative features. Inspired by the Assistant Axis and Persona Selection Model, we construct persona axes for deception and sycophancy using contrastive persona prompts. The first principal components, obtained by unsupervised PCA of the persona-specific vectors, cleanly separate harmful and harmless personas. Across 10 evaluation datasets, we show that persona-derived directions transfer non-trivially and probes trained on persona-PC projections generalize better than probes trained on raw activations. We also find that a unified axis consisting of multiple harmful and harmless behaviors improves generalization across behaviors and datasets. Overall, persona vectors provide a useful inductive bias for building more transferable behavior probes.
Chinese Translation
在语言模型交互中,监测有害行为的需求日益增加,但仅依靠文本监测并不足够。这是因为模型有时会表现出战略性欺骗和拖延行为,在评估期间改变其行为。这促使我们使用像线性探针这样的白盒监测工具,它们可以直接读取模型内部信息。目前,这些探针在分布转移下可能会失败,从而限制了它们在实际环境中的有效性。我们研究是否存在一个低维子空间,能够更稳健地捕捉有害行为,同时排除虚假相关特征。受到助手轴(Assistant Axis)和个性选择模型(Persona Selection Model)的启发,我们使用对比个性提示构建了欺骗和谄媚的个性轴。通过对个性特定向量进行无监督主成分分析(PCA)获得的第一主成分能够清晰地区分有害和无害的个性。在10个评估数据集上,我们展示了源于个性的方向具有非平凡的迁移能力,并且在个性-PC投影上训练的探针比在原始激活上训练的探针具有更好的泛化能力。我们还发现,由多个有害和无害行为组成的统一轴能够改善跨行为和数据集的泛化能力。总体而言,个性向量为构建更具可转移性的行为探针提供了有用的归纳偏置。
cs.AI / 110 / 2605.09395
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
通过量身定制的代理推理赋能视觉语言模型进行少样本多模态时间序列分类
Abstract
In this paper, we propose the first VL$\underline{\textbf{M}}$ $\underline{\textbf{a}}$gentic $\underline{\textbf{r}}$easoning framework for few-$\underline{\textbf{s}}$hot multimodal $\underline{\textbf{T}}$ime $\underline{\textbf{S}}$eries $\underline{\textbf{C}}$lassification ($\textbf{MarsTSC}$), which introduces a self-evolving knowledge bank as a dynamic context iteratively refined via reflective agentic reasoning. The framework comprises three collaborative roles: i) Generator conducts reliable classification via reasoning; ii) Reflector diagnoses the root causes of reasoning errors to yield discriminative insights targeting the temporal features overlooked by Generator; iii) Modifier applies verified updates to the knowledge bank to prevent context collapse. We further introduce a test-time update strategy to enable cautious, continuous knowledge bank refinement to mitigate few-shot bias and distribution shift. Extensive experiments across 12 mainstream time series benchmarks demonstrate that $\textbf{MarsTSC}$ delivers substantial and consistent performance gains across 6 VLM backbones, outperforming both classical and foundation model-based time series baselines under few-shot conditions, while producing interpretable rationales that ground each classification decision in human-readable feature evidence.
Chinese Translation
在本文中,我们提出了首个视觉语言模型(VLM)代理推理框架,用于少样本多模态时间序列分类(MarsTSC),该框架引入了一个自我演化的知识库,作为一个动态上下文,通过反思性代理推理进行迭代优化。该框架包含三个协作角色:i)生成器通过推理进行可靠分类;ii)反思者诊断推理错误的根本原因,以提供针对生成器忽视的时间特征的区分性见解;iii)修改者对知识库应用经过验证的更新,以防止上下文崩溃。我们进一步引入了一种测试时更新策略,以实现谨慎、持续的知识库优化,从而减轻少样本偏差和分布转移。在12个主流时间序列基准上的广泛实验表明,MarsTSC在6个视觉语言模型主干上提供了显著且一致的性能提升,在少样本条件下超越了经典和基础模型的时间序列基线,同时生成可解释的推理,基于人类可读的特征证据为每个分类决策提供依据。
cs.AI / 111 / 2605.09415
Strategic commitments shape collective cybersecurity under AI inequality
战略承诺塑造人工智能不平等下的集体网络安全
Abstract
The growing integration of AI into cybersecurity is reshaping the balance between attackers and defenders. When access to advanced AI-enabled defence tools is uneven, resource-limited defenders may be unable to adopt effective protection, creating persistent system vulnerabilities. We study the impact of differential AI access using an evolutionary game-theoretic model in a finite population. We first show that when high-capability defence is costly, the population is driven toward low-cost, weak-defence behaviour, sustaining attacks and weakening long-run security. To address this problem, we introduce differential access to AI defence tools by allowing defenders to choose between low- and high-capability protection based on their resources. We then examine the role of a small group of committed defenders who always adopt strong defence and influence others through social learning. Although commitment increases the prevalence of strong defence, it alone cannot stabilise secure outcomes due to high defence costs. We therefore incorporate a targeted subsidy to remove the cost disadvantage from committed defenders. Our analysis shows that subsidised commitment significantly increases strong defence adoption, suppresses successful attacks, and improves overall system resilience. Simulations across a broad parameter space confirm that subsidies consistently outperform commitment alone. In addition, social-welfare analysis shows improved defender outcomes while keeping attacker gains low. These findings suggest that targeted support for key defenders can be an effective mechanism for stabilising cybersecurity in AI-driven environments and provide a theoretical bridge between cybersecurity policy, AI governance, and strategic allocation of defensive AI capabilities.
Chinese Translation
人工智能在网络安全中的日益整合正在重塑攻击者与防御者之间的平衡。当对先进的人工智能驱动的防御工具的获取不均时,资源有限的防御者可能无法采用有效的保护措施,从而导致系统持续存在脆弱性。我们使用有限人群的进化博弈理论模型研究差异化人工智能访问的影响。我们首先表明,当高能力防御成本较高时,人口会趋向于低成本、弱防御行为,这会持续攻击并削弱长期安全性。为了解决这个问题,我们通过允许防御者根据其资源选择低能力和高能力保护,引入了对人工智能防御工具的差异化访问。然后,我们考察了一小部分始终采用强防御并通过社会学习影响他人的承诺防御者的作用。尽管承诺增加了强防御的普遍性,但由于高防御成本,单靠承诺无法稳定安全结果。因此,我们引入了针对性的补贴,以消除承诺防御者的成本劣势。我们的分析表明,补贴承诺显著增加了强防御的采用,抑制了成功攻击,并改善了整体系统的弹性。在广泛参数空间的模拟中,补贴始终优于单独的承诺。此外,社会福利分析显示,在保持攻击者收益较低的同时,防御者的结果得到了改善。这些发现表明,针对关键防御者的支持可以成为在人工智能驱动的环境中稳定网络安全的有效机制,并为网络安全政策、人工智能治理和防御性人工智能能力的战略分配提供了理论桥梁。
cs.AI / 112 / 2605.09419
From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay
从被动重用到主动推理:为神经符号经验重放奠定基础的大型语言模型
Abstract
While experience replay is essential for data efficiency in reinforcement learning (RL), standard methods treat the replay buffer as a passive memory system, prioritizing samples based on numerical prediction errors rather than their semantic significance. This approach stands in contrast to human learning, which accelerates mastery by actively abstracting fragmented experiences into behavioral rules. To bridge this gap, we propose Neuro-Symbolic Experience Replay (NSER), a framework that transforms experience replay from a passive sample reuse mechanism into an active engine for knowledge construction. Specifically, NSER addresses the incompatibility between linguistic reasoning and numerical optimization through a novel neuro-symbolic grounding pipeline. It leverages Large Language Models (LLMs) in a zero-shot manner to induce candidate behavioral rules from accumulated trajectories, grounds these insights into differentiable first-order logic representations, and utilizes the resulting symbolic structures to dynamically reweight the replay distribution. By allowing abstract knowledge to directly shape policy optimization, NSER achieves consistent superior sample efficiency and convergence speed across reactive, rule-based, and procedural benchmarks.
Chinese Translation
虽然经验重放对于强化学习(RL)中的数据效率至关重要,但标准方法将重放缓冲区视为被动记忆系统,优先考虑基于数值预测误差的样本,而非其语义重要性。这种方法与人类学习形成对比,人类通过主动将碎片化的经验抽象为行为规则来加速掌握。为了弥补这一差距,我们提出了神经符号经验重放(Neuro-Symbolic Experience Replay, NSER),这是一个将经验重放从被动样本重用机制转变为知识构建主动引擎的框架。具体而言,NSER通过一种新颖的神经符号基础管道解决了语言推理与数值优化之间的不兼容性。它以零样本方式利用大型语言模型(Large Language Models, LLMs)从累积的轨迹中诱导候选行为规则,将这些见解基础于可微分的一阶逻辑表示,并利用生成的符号结构动态重新加权重放分布。通过允许抽象知识直接影响策略优化,NSER在反应性、基于规则和过程基准测试中实现了一致的优越样本效率和收敛速度。
cs.AI / 113 / 2605.09423
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio:用于具身智能体学习的自动环境生成与进化编码代理
Abstract
LLM/VLM-based digital agents have advanced rapidly thanks to scalable sandboxes for coding, web navigation, and computer use, which provide rich interactive training grounds. In contrast, embodied agents still lack abundant, diverse, and automatically generated 3D environments for interactive learning. Existing embodied simulators rely on manually crafted scenes or procedural templates, while recent LLM-based 3D generation systems mainly produce static scenes rather than deployable environments with verifiable tasks and standard learning interfaces. We introduce SimWorld Studio, an open-source platform built on Unreal Engine 5 for generating evolving embodied learning environments. At its core is SimCoder, a tool/skill-augmented coding agent that writes and executes engine-level code to construct physically grounded 3D worlds from language/image instructions. SimCoder self-evolves by using verifier feedback (e.g., compilation errors, physics checks, VLM critiques) to revise environments and autonomously add reusable tools and skills to its library. Generated worlds are exported as Gym-style environments for embodied agent learning. SimWorld Studio further enables co-evolution between environment generation and embodied learning: agent performance feedback guides SimCoder to generate adaptive curricula near the learner's capability frontier, so that environments become increasingly challenging as the embodied agent improves. Three case studies on embodied navigation show that self-evolution improves generation reliability, generated environments substantially improve embodied agent performance that generalizes to unseen benchmarks, and co-evolution yields an 18-point success-rate gain over fixed-environment learning and a 40-point gain over an untrained agent.
Chinese Translation
基于大规模语言模型(LLM)/视觉语言模型(VLM)的数字代理得益于可扩展的编码、网络导航和计算机使用沙箱,快速发展,这些沙箱提供了丰富的互动训练场。然而,具身智能体仍然缺乏丰富、多样且自动生成的3D环境以进行互动学习。现有的具身模拟器依赖于手工制作的场景或程序模板,而最近的基于LLM的3D生成系统主要生成静态场景,而非可部署的环境和可验证的任务及标准学习接口。我们介绍了SimWorld Studio,这是一个基于虚幻引擎5(Unreal Engine 5)构建的开源平台,用于生成进化的具身学习环境。其核心是SimCoder,一个工具/技能增强的编码代理,能够根据语言/图像指令编写和执行引擎级代码,从而构建物理基础的3D世界。SimCoder通过使用验证反馈(例如编译错误、物理检查、VLM评估)自我进化,以修订环境并自主添加可重用的工具和技能到其库中。生成的世界以Gym风格的环境形式导出,以供具身智能体学习。SimWorld Studio进一步促进了环境生成与具身学习之间的共同进化:智能体性能反馈指导SimCoder生成接近学习者能力边界的自适应课程,使得环境随着具身智能体的提升而变得越来越具有挑战性。关于具身导航的三个案例研究表明,自我进化提高了生成的可靠性,生成的环境显著提升了具身智能体在未见基准上的表现,而共同进化相比于固定环境学习提高了18个百分点的成功率,相比于未训练的智能体提高了40个百分点。
cs.AI / 114 / 2605.09461
VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection
VulTriage:基于三路径上下文增强的LLM漏洞检测
Abstract
Automated vulnerability detection is a fundamental task in software security, yet existing learning-based methods still struggle to capture the structural dependencies, domain-specific vulnerability knowledge, and complex program semantics required for accurate detection. Recent Large Language Models (LLMs) have shown strong code understanding ability, but directly prompting them with raw source code often leads to missed vulnerabilities or false alarms, especially when vulnerable and benign functions differ only in subtle semantic details. To address this, we propose VulTriage, a triple-path context augmentation framework for LLM-based vulnerability detection. VulTriage enhances the LLM input through three complementary paths: a Control Path that extracts and verbalizes AST, CFG, and DFG information to expose control and data dependencies; a Knowledge Path that retrieves relevant CWE-derived vulnerability patterns and examples through hybrid dense--sparse retrieval; and a Semantic Path that summarizes the functional behavior of the code before the final judgment. These contexts are integrated into a unified instruction to guide the LLM toward more reliable vulnerability reasoning. Experiments on the PrimeVul pair test set show that VulTriage achieves state-of-the-art performance, outperforming existing deep learning and LLM-based baselines on key pair-wise and classification metrics. Further ablation studies verify the effectiveness of each path, and additional experiments on the Kotlin dataset demonstrate the generalization ability of VulTriage under low-resource and class-imbalanced settings. Our code is available at https://github.com/vinsontang1/VulTriage
Chinese Translation
自动化漏洞检测是软件安全中的一项基础任务,但现有的基于学习的方法仍然难以捕捉准确检测所需的结构依赖、特定领域的漏洞知识和复杂的程序语义。近期的大型语言模型(LLMs)展现了强大的代码理解能力,但直接用原始源代码提示它们往往会导致漏报漏洞或误报,尤其是在脆弱和良性函数仅在细微语义细节上有所不同的情况下。为了解决这个问题,我们提出了VulTriage,一个基于LLM的漏洞检测的三路径上下文增强框架。VulTriage通过三条互补路径增强LLM输入:控制路径(Control Path)提取并口头化抽象语法树(AST)、控制流图(CFG)和数据流图(DFG)信息,以揭示控制和数据依赖关系;知识路径(Knowledge Path)通过混合稠密-稀疏检索获取相关的基于通用弱点枚举(CWE)的漏洞模式和示例;语义路径(Semantic Path)在最终判断之前总结代码的功能行为。这些上下文被整合成一个统一的指令,以引导LLM进行更可靠的漏洞推理。在PrimeVul对测试集上的实验表明,VulTriage达到了最先进的性能,在关键的成对和分类指标上超越了现有的深度学习和基于LLM的基线。进一步的消融研究验证了每条路径的有效性,额外的在Kotlin数据集上的实验展示了VulTriage在低资源和类别不平衡设置下的泛化能力。我们的代码可在 https://github.com/vinsontang1/VulTriage 获取。
cs.AI / 115 / 2605.09497
Don't Click That: Teaching Web Agents to Resist Deceptive Interfaces
别点击那个:教导网络代理抵御欺骗性界面
Abstract
Vision-language model (VLM) based web agents demonstrate impressive autonomous GUI interaction but remain vulnerable to deceptive interface elements. Existing approaches either detect deception without task integration or document attacks without proposing defenses. We formalize deception-aware web agent defense and propose DUDE (Deceptive UI Detector & Evaluator), a two-stage framework combining hybrid-reward learning with asymmetric penalties and experience summarization to distill failure patterns into transferable guidance. We introduce RUC (Real UI Clickboxes), a benchmark of 1,407 scenarios spanning four domains and deception categories. Experiments show DUDE reduces deception susceptibility by 53.8% while maintaining task performance, establishing an effective foundation for robust web agent deployment.
Chinese Translation
基于视觉-语言模型(VLM)的网络代理展示了令人印象深刻的自主图形用户界面(GUI)交互能力,但仍然容易受到欺骗性界面元素的影响。现有的方法要么在没有任务集成的情况下检测欺骗,要么记录攻击而不提出防御措施。我们正式提出了欺骗感知网络代理防御,并提出了DUDE(欺骗性用户界面检测器与评估器),这是一个结合了混合奖励学习、非对称惩罚和经验总结的两阶段框架,用于提炼失败模式并转化为可转移的指导。我们引入了RUC(真实用户界面点击框),这是一个涵盖四个领域和欺骗类别的1,407个场景的基准。实验表明,DUDE将欺骗易感性降低了53.8%,同时保持了任务性能,为稳健的网络代理部署奠定了有效基础。
cs.AI / 116 / 2605.09505
EpiGraph: A Knowledge Graph and Benchmark for Evidence-Intensive Reasoning in Epilepsy
EpiGraph:一种用于癫痫证据密集型推理的知识图谱和基准测试
Abstract
Epilepsy diagnosis and treatment require evidence-intensive reasoning across heterogeneous clinical knowledge, including biosignal patterns, genetic mechanisms, pharmacogenomics, treatment strategies, and patient outcomes. In this work, we present \textsc{EpiGraph}, a large-scale epilepsy knowledge graph and benchmark for evaluating knowledge-augmented clinical reasoning. \textsc{EpiGraph} integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers. Built upon this graph, \textsc{EpiBench} defines five clinically motivated tasks spanning clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning. We evaluate six LLMs under both standard and Graph-RAG settings. Results show that integrating \textsc{EpiGraph} consistently improves performance across all tasks, with the largest gains observed in pharmacogenomic reasoning (+30--41\%). Our findings demonstrate that structured epilepsy knowledge substantially enhances evidence-grounded clinical reasoning and provides a practical benchmark framework for evaluating knowledge-augmented LLMs in real-world neurological settings. Our code is available at: https://github.com/LabRAI/EEG-KG.
Chinese Translation
癫痫的诊断和治疗需要在异质临床知识中进行证据密集型推理,包括生物信号模式、遗传机制、药物基因组学、治疗策略和患者结果。在本研究中,我们提出了 extsc{EpiGraph},一个大规模的癫痫知识图谱和用于评估知识增强临床推理的基准测试。 extsc{EpiGraph} 将48,166篇经过同行评审的论文和七个临床资源整合成一个异质图,包含24,324个实体和32,009个基于证据的三元组,跨越五个临床层面。在此图基础上, extsc{EpiBench} 定义了五个以临床为导向的任务,涵盖临床决策、脑电图报告生成、药物基因组学精准医学、治疗推荐和深度研究规划。我们在标准和Graph-RAG设置下评估了六个大型语言模型(LLMs)。结果表明,整合 extsc{EpiGraph} 一致性地提高了所有任务的表现,其中药物基因组推理的提升最大(+30--41%)。我们的研究结果表明,结构化的癫痫知识显著增强了基于证据的临床推理,并提供了一个实用的基准框架,用于评估在实际神经学环境中知识增强的LLMs。我们的代码可在以下网址获取:https://github.com/LabRAI/EEG-KG。
cs.AI / 117 / 2605.09511
WindINR: Latent-State INR for Fast Local Wind Query and Correction in Complex Terrain
WindINR:复杂地形中快速局部风查询与修正的潜在状态隐式神经表示
Abstract
Many downstream decisions in complex terrain require fast wind estimates at a small number of user-specified locations and heights for a given forecast valid time, rather than another dense forecast field on a fixed grid. We present WindINR, a latent-state implicit neural representation framework for continuous high-resolution local wind query and sparse-observation correction. WindINR maps static terrain descriptors, a low-resolution background field, and continuous query coordinates to a high-resolution wind state through a latent-conditioned decoder. To enable rapid inference-time correction, WindINR separates reusable representation learning from sample-specific latent-state correction. During training, a privileged encoder infers a reference latent state from high-resolution supervision, a deployable latent predictor estimates an initial latent state from inference-time inputs alone, and their discrepancies are summarized into a dataset-adaptive Gaussian prior over latent corrections. At inference time, within the WindINR module, network weights remain fixed and only the latent state is updated by minimizing a regularized correction objective using sparse observations and their uncertainty. In controlled OSSEs over the Senja region, including a UAV-aided approach scenario and random-observation robustness tests, WindINR improves local high-resolution wind estimates by updating only a compact latent state rather than the full network. The corrected representation remains continuously queryable at arbitrary coordinates and, in our CPU benchmark, yields about a $2.6\times$ online-correction speedup over full-network fine-tuning, suggesting a practical interface between kilometer-scale background products, sparse local observations, and wind queries in complex terrain.
Chinese Translation
在复杂地形中,许多下游决策需要在用户指定的少数位置和高度上快速获取风速估计,而不是在固定网格上生成另一个密集的预报场。我们提出了WindINR,一种用于连续高分辨率局部风查询和稀疏观测修正的潜在状态隐式神经表示框架。WindINR通过潜在条件解码器将静态地形描述符、低分辨率背景场和连续查询坐标映射到高分辨率风状态。为了实现快速推理时的修正,WindINR将可重用的表示学习与样本特定的潜在状态修正分开。在训练过程中,特权编码器从高分辨率监督中推断出参考潜在状态,而可部署的潜在预测器仅从推理时输入估计初始潜在状态,它们之间的差异被总结为一个数据集自适应的高斯先验,用于潜在修正。在推理时,在WindINR模块内,网络权重保持固定,只有潜在状态通过最小化一个正则化修正目标来更新,该目标使用稀疏观测及其不确定性。在对Senja地区的受控观测系统模拟实验(OSSEs)中,包括无人机辅助方法场景和随机观测鲁棒性测试,WindINR通过仅更新紧凑的潜在状态而不是整个网络,改善了局部高分辨率风速估计。修正后的表示在任意坐标上保持连续可查询,并且在我们的CPU基准测试中,相比于全网络微调,在线修正速度提升约$2.6 imes$,这表明在复杂地形中,千米尺度背景产品、稀疏局部观测和风查询之间存在实用的接口。
cs.AI / 118 / 2605.09515
A Game Theoretic Free Energy Analysis of Higher Order Synergy in Attention Heads of Large Language Models
大语言模型注意力头中高阶协同的博弈论自由能分析
Abstract
Large language models rely on multihead attention, but interactions among heads remain poorly understood. We apply the Game Theoretic Free Energy Principle (GTFEP): a framework casting multiagent systems as distributed variational inference to analyze attention heads as bounded rational agents. According to GTFEP, each head minimizes its variational free energy, and collective behavior follows a Gibbs distribution over coalition structures whose energy is decomposed into Harsanyi dividends. Using a tractable approximation (uniform prior, deterministic dynamics), coalition free energy reduces to joint Shannon entropy of discretized head outputs (argmax key index). Pairwise dividends become mutual information (nonnegative), while triple dividends correspond to interaction information and can be negative. On BERT, GPT2, and Llama with GSM8K, triple dividends are consistently negative, revealing higher order redundancy. The Nash FEP correspondence guarantees that stationary points of collective free energy are epsilon Nash equilibria; thus, heads with negligible contribution can be pruned with minimal performance loss. Pruning heads with low marginal contribution reduces computational cost with minimal performance loss: for example, pruning 20% of heads in GPT2 reduces FLOPs by 18%, increases throughput by 22%, and raises perplexity only modestly (from 28.4 to 33.4 on GSM8K). Our work shows GTFEP provides a principled foundation for analyzing and optimizing transformer architectures.
Chinese Translation
大语言模型依赖于多头注意力,但头之间的相互作用仍然不够清楚。我们应用博弈论自由能原理(Game Theoretic Free Energy Principle, GTFEP):一个将多智能体系统视为分布式变分推断的框架,以分析注意力头作为有限理性代理。根据GTFEP,每个头最小化其变分自由能,集体行为遵循基于联盟结构的吉布斯分布,其能量分解为哈萨尼红利(Harsanyi dividends)。通过可处理的近似(均匀先验、确定性动态),联盟自由能简化为离散化头输出的联合香农熵(argmax关键索引)。成对红利变为互信息(非负),而三重红利对应于交互信息并且可以为负。在BERT、GPT2和Llama与GSM8K的实验中,三重红利始终为负,揭示了高阶冗余。纳什自由能原理(Nash FEP)对应性保证了集体自由能的驻点是ε纳什均衡;因此,贡献微不足道的头可以在性能损失最小的情况下被剪枝。剪枝低边际贡献的头可以减少计算成本且性能损失最小:例如,在GPT2中剪去20%的头可以减少18%的FLOPs,增加22%的吞吐量,并且仅使困惑度略微上升(在GSM8K上从28.4上升到33.4)。我们的研究表明,GTFEP为分析和优化变换器架构提供了一个有原则的基础。
cs.AI / 119 / 2605.09519
Weighted Rules under the Stable Model Semantics
稳定模型语义下的加权规则
Abstract
We introduce the concept of weighted rules under the stable model semantics following the log-linear models of Markov Logic. This provides versatile methods to overcome the deterministic nature of the stable model semantics, such as resolving inconsistencies in answer set programs, ranking stable models, associating probability to stable models, and applying statistical inference to computing weighted stable models. We also present formal comparisons with related formalisms, such as answer set programs, Markov Logic, ProbLog, and P-log.
Chinese Translation
我们引入了在稳定模型语义下的加权规则概念,借鉴了马尔可夫逻辑的对数线性模型。这提供了多种灵活的方法,以克服稳定模型语义的确定性特征,例如解决答案集程序中的不一致性、对稳定模型进行排序、将概率与稳定模型关联,以及将统计推断应用于计算加权稳定模型。我们还与相关形式主义进行了正式比较,如答案集程序、马尔可夫逻辑、ProbLog 和 P-log。
cs.AI / 120 / 2605.09524
Functional Stable Model Semantics and Answer Set Programming Modulo Theories
功能稳定模型语义与理论模态下的答案集编程
Abstract
Recently there has been an increasing interest in incorporating ``intensional'' functions in answer set programming. Intensional functions are those whose values can be described by other functions and predicates, rather than being pre-defined as in the standard answer set programming. We demonstrate that the functional stable model semantics plays an important role in the framework of ``Answer Set Programming Modulo Theories (ASPMT)'' -- a tight integration of answer set programming and satisfiability modulo theories, under which existing integration approaches can be viewed as special cases where the role of functions is limited. We show that ``tight'' ASPMT programs can be translated into SMT instances, which is similar to the known relationship between ASP and SAT.
Chinese Translation
近年来,越来越多的研究者对在答案集编程中引入“意向性”函数表现出浓厚的兴趣。意向性函数是指其值可以通过其他函数和谓词来描述,而不是像标准答案集编程中那样预先定义。我们展示了功能稳定模型语义在“理论模态下的答案集编程(Answer Set Programming Modulo Theories, ASPMT)”框架中发挥着重要作用——这是答案集编程与可满足性模态理论的紧密结合,在此框架下,现有的集成方法可以视为函数角色受到限制的特例。我们证明了“紧凑型” ASPMT 程序可以被转换为 SMT 实例,这与已知的 ASP 与 SAT 之间的关系类似。
cs.AI / 121 / 2605.09528
Cplus2ASP: Computing Action Language C+ in Answer Set Programming
Cplus2ASP:在答案集编程中计算动作语言C+
Abstract
We present Version 2 of system Cplus2ASP, which implements the definite fragment of action language C+. Its input language is fully compatible with the language of the Causal Calculator Version 2, but the new system is significantly faster thanks to modern answer set solving techniques. The translation implemented in the system is a composition of several recent theoretical results. The system orchestrates a tool chain, consisting of f2lp, clingo, iclingo, and as2transition. Under the incremental execution mode, the system translates a C+ description into the input language of iclingo, exploiting its incremental grounding mechanism. The correctness of this execution is justified by the module theorem extended to programs with nested expressions. In addition, the input language of the system has many useful features, such as external atoms by means of Lua calls and the user interactive mode. The system supports extensible multi-modal translations for other action languages, such as B and BC, as well.
Chinese Translation
我们介绍了系统Cplus2ASP的第2版,该系统实现了动作语言C+的确定性片段。其输入语言与因果计算器第2版的语言完全兼容,但由于现代答案集求解技术,新系统的速度显著提升。系统中实现的翻译是多个近期理论结果的组合。该系统协调了一条工具链,包括f2lp、clingo、iclingo和as2transition。在增量执行模式下,系统将C+描述翻译为iclingo的输入语言,利用其增量基础机制。这一执行的正确性通过扩展到具有嵌套表达式的程序的模块定理得以证明。此外,系统的输入语言具有许多实用特性,例如通过Lua调用的外部原子和用户交互模式。该系统还支持对其他动作语言(如B和BC)的可扩展多模态翻译。
cs.AI / 122 / 2605.09542
LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs
基于知识图谱的LLM引导蒙特卡洛树搜索:为药物-疾病对构建机制解释
Abstract
Extracting multi-step explanations from knowledge graphs poses a combinatorial challenge requiring both heuristic guidance (as candidates proliferate with depth) and credit assignment (as path quality emerges over extended sequences). Frontier LLMs, strong on knowledge/reasoning benchmarks, offer a compelling source of such heuristics, yet their knowledge comes sans guarantees and compositional performance degrades as chains lengthen. We thus present TESSERA, a 3-part neuro-symbolic framework that uses LLMs in a circumscribed role: for local discriminative judgement rather than autonomous multi-step generation; the knowledge graph then defines the hypothesis space enforcing hard structural constraints, and MCTS coordinates the long-horizon search with principled credit assignment via backpropagation. LLMs perform dual roles as a prior policy biasing exploration and a comparative state evaluator supplying reward signals. Evaluation on drug mechanism elucidation across two complementary knowledge graphs demonstrates fidelity to curated biology while surfacing coherent alternative mechanisms, with ablations confirming discriminative contribution from both LLM components. Beyond its current application, our framework offers a general paradigm for compositional reasoning over structured knowledge.
Chinese Translation
从知识图谱中提取多步骤解释面临组合挑战,这需要启发式指导(因为候选项随着深度的增加而激增)和信用分配(因为路径质量在延长序列中逐渐显现)。前沿的语言模型(LLMs)在知识/推理基准测试中表现出色,提供了此类启发式指导的有力来源,但它们的知识缺乏保证,且随着链条的延长,组合性能会下降。因此,我们提出了TESSERA,一个三部分的神经符号框架,该框架在有限的角色中使用LLMs:用于局部判别判断,而非自主的多步骤生成;知识图谱随后定义了假设空间,强制施加严格的结构约束,蒙特卡洛树搜索(MCTS)则通过反向传播协调长远搜索并进行原则性的信用分配。LLMs在此过程中扮演双重角色,既作为偏向探索的先验策略,又作为提供奖励信号的比较状态评估器。在两个互补知识图谱上对药物机制的阐明评估展示了对策划生物学的忠实,同时揭示了连贯的替代机制,消融实验确认了两个LLM组件的判别贡献。除了当前的应用外,我们的框架还为结构化知识的组合推理提供了一种通用范式。
cs.AI / 123 / 2605.09544
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
TIDE-Bench:任务感知与诊断评估的工具集成推理
Abstract
Tool-integrated reasoning has emerged as a promising paradigm for enhancing large language models with external computation, retrieval, and execution capabilities. However, the field still lacks a high-quality and unified evaluation benchmark, and existing TIR evaluations remain limited in dataset quality, task diversity, diagnostic comprehensiveness, and evaluation efficiency. In this work, we introduce TIDE-Bench, a holistic and efficient benchmark for evaluating TIR methods, featuring three key advantages. First, it provides diverse task settings, combining widely used mathematical reasoning and knowledge-intensive QA tasks with two newly designed tasks, namely the tool-grounded experimental design task and the dynamic interactive task, to probe models' abilities in complex tool invocation and multi-tool coordination. Second, TIDE-Bench adopts a comprehensive yet task-aware evaluation protocol, jointly measuring final answer quality, process reliability, tool-use efficiency, and inference cost across heterogeneous task settings. Third, TIDE-Bench constructs high-quality and discriminative evaluation sets by filtering low-discrimination instances from existing datasets, substantially reducing evaluation cost while focusing on more challenging samples. Extensive experiments on multiple foundation models and TIR methods reveal persistent bottlenecks in tool grounding, offering insights for future TIR research.
Chinese Translation
工具集成推理已成为增强大型语言模型外部计算、检索和执行能力的有前景的范式。然而,该领域仍然缺乏高质量和统一的评估基准,现有的工具集成推理(TIR)评估在数据集质量、任务多样性、诊断全面性和评估效率方面仍然有限。在本研究中,我们引入了TIDE-Bench,这是一个全面且高效的评估TIR方法的基准,具有三个主要优势。首先,它提供了多样化的任务设置,将广泛使用的数学推理和知识密集型问答任务与两个新设计的任务相结合,即基于工具的实验设计任务和动态交互任务,以探测模型在复杂工具调用和多工具协调方面的能力。其次,TIDE-Bench采用了全面但任务感知的评估协议,联合测量最终答案质量、过程可靠性、工具使用效率和推理成本,适用于异构任务设置。第三,TIDE-Bench通过从现有数据集中筛选低区分度实例,构建高质量和具有区分性的评估集,显著降低评估成本,同时关注更具挑战性的样本。在多个基础模型和TIR方法上的广泛实验揭示了工具基础的持续瓶颈,为未来的TIR研究提供了见解。
cs.AI / 124 / 2605.09636
PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation
PDEAgent-Bench:一个多指标、多库的偏微分方程求解器生成基准测试
Hang, Zhen, Yashengjiang, Yushan, Li, Junhui, Dong, Huanshuo, Wei, Yang, Hao, Zhezheng, Ma, Jiangtao, Bai, Songlin, Kai, Haozhong, Yue, Xihang, Si, Gangzong, Jiang, Dongming, Yao, Chao, Hu, Zhanhua, Zhang, Jiangqing, Liu, Pengwei, Shen, Yaomin, Ren, Xingyu, Liu, Lei, Xu, Zikang, Li, Han, Yao, Qingsong, Dong, Hande, Wang, Hong
Abstract
PDE-to-solver code generation aims to automatically synthesize executable numerical solvers from partial differential equation (PDE) specifications. This task requires not only understanding the mathematical structure of PDEs, but also selecting appropriate discretization schemes and solver configurations, and correctly implementing the resulting formulations in finite-element method (FEM) libraries. Existing code generation benchmarks mainly evaluate syntactic correctness, or success on predefined test cases. To our knowledge, there is currently no publicly available benchmark specifically for PDE-to-solver code generation, and general-purpose code benchmarks do not fully capture the unique challenges of numerical PDE solution, such as ensuring solver accuracy, efficiency, and compatibility with professional FEM libraries. We introduce PDEAgent-Bench, to the best of our knowledge, the first multi-metric, multi-library benchmark for PDE-to-solver code generation. PDEAgent-Bench contains 645 instances across 6 mathematical categories and 11 PDE families, with common FEM libraries for DOLFINx, Firedrake, and deal.II. Each instance provides an agent-facing problem specification, a reference solution on a prescribed evaluation grid, and case-specific accuracy and runtime targets. PDEAgent-Bench adopts a staged evaluation framework in which generated solvers must sequentially pass executability, numerical accuracy, and computational efficiency checks. Experiments with representative LLMs and code agents show that models can often produce runnable code, but their pass rate drops substantially once accuracy and efficiency requirements are enforced. These results indicate that current agents remain limited in producing numerically reliable and efficient PDE solvers, and that PDEAgent-Bench provides a reproducible testbed grounded in the practical requirements of numerical PDE solving.
Chinese Translation
偏微分方程(PDE)到求解器的代码生成旨在从偏微分方程规范中自动合成可执行的数值求解器。该任务不仅需要理解偏微分方程的数学结构,还需要选择合适的离散化方案和求解器配置,并正确地在有限元方法(FEM)库中实现生成的公式。现有的代码生成基准主要评估语法正确性或在预定义测试用例上的成功率。我们了解到,目前没有专门针对偏微分方程到求解器代码生成的公开基准,而通用代码基准并未充分捕捉数值偏微分方程求解的独特挑战,例如确保求解器的准确性、效率以及与专业有限元库的兼容性。我们介绍了PDEAgent-Bench,尽我们所知,这是第一个针对偏微分方程到求解器代码生成的多指标、多库基准测试。PDEAgent-Bench包含645个实例,涵盖6个数学类别和11个偏微分方程家族,并与DOLFINx、Firedrake和deal.II等常见有限元库相结合。每个实例提供了一个面向代理的问题规范、在规定评估网格上的参考解,以及特定案例的准确性和运行时间目标。PDEAgent-Bench采用分阶段评估框架,其中生成的求解器必须依次通过可执行性、数值准确性和计算效率检查。与代表性的大型语言模型(LLMs)和代码代理的实验表明,模型通常能够生成可运行的代码,但一旦强制执行准确性和效率要求,其通过率显著下降。这些结果表明,目前的代理在生成数值可靠和高效的偏微分方程求解器方面仍然有限,而PDEAgent-Bench提供了一个基于数值偏微分方程求解实际需求的可重复测试平台。
cs.AI / 125 / 2605.09650
Workspace Optimization: How to Train Your Agent
工作空间优化:如何训练你的智能体
Abstract
Modern agents built on frontier language models often cannot adapt their weights. What, then, remains trainable? We argue it is the agent's \emph{workspace}, the structured external substrate it reads, writes, and tests; we call its evolution workspace optimization. Workspace optimization targets hard multi-turn environments where a frontier model has strong priors but cannot solve the task in a single shot, so the agent must learn through interaction. We propose a principled way to evolve the workspace, mirroring the structure of weight-space training: artifacts in place of parameters, evidence in place of data, counterexamples in place of losses, and textual feedback in place of gradients. We instantiate the idea in DreamTeam, a multi-agent harness for ARC-AGI-3 whose roles build an executable world model, plan, hypothesize, probe, strategize, and route failures. On the current 25-game ARC-AGI-3 public set under the official scoring protocol and averaged over two independent runs, DreamTeam improves the SOTA protocol-matched agent's score from 36% to 38.4%, while using 31% fewer environment actions per game.
Chinese Translation
基于前沿语言模型的现代智能体通常无法调整其权重。那么,什么是可训练的呢?我们认为是智能体的 extit{工作空间},即它读取、写入和测试的结构化外部基质;我们称其演变为工作空间优化。工作空间优化针对的是在多轮交互中,前沿模型具有强先验但无法一次性解决任务的困难环境,因此智能体必须通过交互学习。我们提出了一种原则性的方法来演化工作空间,反映权重空间训练的结构:用工件替代参数,用证据替代数据,用反例替代损失,用文本反馈替代梯度。我们在DreamTeam中实例化了这一思想,DreamTeam是一个多智能体框架,适用于ARC-AGI-3,其角色构建一个可执行的世界模型,进行规划、假设、探测、策略制定和故障路由。在当前25个游戏的ARC-AGI-3公共数据集上,依据官方评分协议并在两次独立运行中取平均,DreamTeam将与SOTA协议匹配的智能体的得分从36%提高到38.4%,同时每场游戏使用的环境动作减少了31%。
cs.AI / 126 / 2605.09675
CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents
CodeClinic:评估临床推理代理的编码技能自动化
Abstract
Clinical reasoning agents based on large language models (LLMs) aim to automate tasks such as intensive care unit (ICU) monitoring and patient state tracking from electronic health records (EHRs). Existing systems typically rely on manually curated clinical tools or skills for concepts such as sepsis detection and organ failure assessment. However, maintaining these tool libraries requires substantial expert effort, while zero-shot querying or code generation often produces inefficient and unreliable reasoning chains, especially under institution-specific clinical policies. We introduce CodeClinic, a benchmark built on MIMIC-IV for evaluating whether LLM agents can synthesize and compose reusable clinical skills instead of relying on fixed toolboxes. The benchmark contains two complementary tasks: longitudinal ICU surveillance and compositional information seeking. The longitudinal setting simulates monitoring patient trajectories with structured decisions every four hours across 25 findings and eight clinical families, while the compositional setting spans 63k instances across 259 tasks in nine domains and is stratified by compositional dependency depth to evaluate increasingly complex multi-step reasoning. We further propose an offline autoformalization pipeline that converts natural-language clinical guidelines into reusable and verified Python skill libraries through iterative LLM refinement. Compared with zero-shot code generation, the resulting libraries improve consistency while reducing per-query token usage by up to 40%.
Chinese Translation
基于大型语言模型(LLMs)的临床推理代理旨在自动化诸如重症监护病房(ICU)监测和从电子健康记录(EHRs)中跟踪患者状态等任务。现有系统通常依赖于手动策划的临床工具或技能来处理诸如脓毒症检测和器官衰竭评估等概念。然而,维护这些工具库需要大量专家投入,而零样本查询或代码生成往往会产生低效且不可靠的推理链,尤其是在特定机构的临床政策下。我们提出了CodeClinic,这是一个基于MIMIC-IV构建的基准,用于评估LLM代理是否能够合成和组合可重用的临床技能,而不是依赖固定的工具箱。该基准包含两个互补任务:纵向ICU监测和组合信息检索。纵向设置模拟每四小时对25个发现和八个临床类别进行结构化决策的患者轨迹监测,而组合设置则涵盖了九个领域中的63,000个实例,分为259个任务,并按组合依赖深度进行分层,以评估日益复杂的多步推理。我们进一步提出了一种离线自动形式化管道,通过迭代LLM优化将自然语言临床指南转换为可重用和经过验证的Python技能库。与零样本代码生成相比,生成的库提高了一致性,同时将每次查询的令牌使用量减少了多达40%。
cs.AI / 127 / 2605.09678
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities
荒谬世界:一种简单而强大的方法,用于将现实世界荒谬化以探测大语言模型的推理能力
Abstract
While extremely powerful and versatile at various tasks, the thinking capabilities of large language models (LLMs) are often put under scrutiny as they sometimes fail to solve problems that humans can systematically solve. However, recent literature focuses on breaking LLM reasoning with increasingly complex problems, and whether an LLM is robust in simple logical reasoning remains underexplored. This paper proposes Absurd World, a benchmarking framework, to test LLMs against altered realism, where scenarios are logically coherent, and humans can easily solve the tasks. Absurd World breaks a real-world model into symbols, actions, sequences, and events, which are automatically altered to create absurd worlds where the logic to solve the tasks remains the same. It evaluates a large collection of models with simple and advanced prompting techniques, and proves that it is an effective tool to determine LLMs' ability to think logically, ignoring the patterns learned from the real world. One can use this framework to extensively test an LLM against a real-world problem to verify whether the LLM's reasoning capability is robust against variations of the task.
Chinese Translation
尽管大型语言模型(LLMs)在各种任务中表现出极强的能力和多样性,但它们的思维能力常常受到质疑,因为有时它们无法解决人类能够系统性解决的问题。然而,近期文献集中于通过日益复杂的问题来打破LLM的推理能力,而LLM在简单逻辑推理方面的稳健性仍然未得到充分探索。本文提出了荒谬世界(Absurd World),一个基准框架,用于测试LLM在改变现实主义下的表现,其中场景在逻辑上是一致的,人类可以轻松解决这些任务。荒谬世界将现实世界模型分解为符号、动作、序列和事件,这些元素被自动改变以创建荒谬的世界,在这些世界中,解决任务的逻辑保持不变。它评估了一大批模型,采用简单和高级的提示技术,证明这是一个有效的工具,可以确定LLM的逻辑思维能力,忽略从现实世界中学习到的模式。研究人员可以利用该框架广泛测试LLM在现实世界问题上的表现,以验证LLM的推理能力是否对任务的变化具有稳健性。
cs.AI / 128 / 2605.09692
Unpredictability dissociates from structured control in language agents
不可预测性与语言代理中的结构化控制分离
Abstract
Unpredictable behavior is often taken as evidence of control, yet stochastic dispersion and structured action control need not coincide. This paper tests whether stochastic sampling can substitute for structured mechanisms that couple reasons, memory, self-state and inhibition to action selection in a language-agent implementation whose control components can be selectively disabled. In a seven-dataset baseline lesion matrix comprising 74,352 calls, the high-stochasticity comparator was more unpredictable than the structured-control variant in 7/7 datasets, whereas targeted reason and veto lesions reduced the expected structured-control profiles in 7/7 datasets each. In a matched-interface control spanning 26,946 generations, the structured agent maintained stronger action-field coupling than all stochastic, post-hoc, scrambled and verbosity controls across every dataset. The primary behavioral test removed free-form trace wording from the evaluation: 57,816 scored records showed the structured-control variant exceeding the high-stochasticity comparator or the reason/veto lesions in 7/7 datasets for all predefined behavioral components. Later open-weight runs extended the no-context controls to Qwen2.5 7B, 14B and 32B and to an independent Mistral-7B family across 20 task families and three agent scaffolds; no-fields, scrambled-context and distribution-matched controls failed to recover structured action control. A three-annotator blinded audit over 1,200 overlap items preserved high agreement. Strict entropy matching, strict token/compute matching and a formal counterfactual-flip stress test did not meet their gates and are treated as limitations. Stochastic unpredictability did not reproduce structured, action-coupled control in this implemented agent family.
Chinese Translation
不可预测的行为常被视为控制的证据,然而随机分散与结构化行动控制并不一定重合。本文测试了随机采样是否可以替代将理由、记忆、自我状态和抑制与行动选择相结合的结构化机制,研究对象为一种语言代理实现,其控制组件可以选择性地禁用。在一个包含74,352次调用的七个数据集基线损伤矩阵中,高随机性比较器在7个数据集中均表现出比结构控制变体更不可预测,而针对理由和否决的损伤在7个数据集中均降低了预期的结构控制特征。在一个跨越26,946代的匹配接口控制中,结构化代理在每个数据集中维持了比所有随机、事后、打乱和冗长控制更强的行动场耦合。主要行为测试从评估中移除了自由形式的追踪措辞:57,816条评分记录显示结构控制变体在7个数据集中超越了高随机性比较器或理由/否决损伤,涵盖所有预定义的行为组件。后续的开放权重运行将无上下文控制扩展到Qwen2.5的7B、14B和32B,以及独立的Mistral-7B家族,涵盖20个任务家族和三个代理框架;无场、打乱上下文和分布匹配控制未能恢复结构化行动控制。对1,200个重叠项目进行的三位注释者盲审保持了高度一致性。严格的熵匹配、严格的标记/计算匹配和正式的反事实翻转压力测试未能通过其门限,因而被视为局限性。在这一实现的代理家族中,随机不可预测性未能再现结构化的、与行动耦合的控制。
cs.AI / 129 / 2605.09698
Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents
Ambig-DS:数据科学代理任务框架模糊性的基准测试
Abstract
As data-science agents shift from co-pilots to auto-pilots, silent misframing becomes a critical failure mode. Agents quietly commit to plausible but unintended task framings, producing clean, executable artifacts that hide their incorrect assessment of the task. Existing benchmarks score whether the pipeline runs, ignoring whether the agent recognized the task was underspecified. We introduce Ambig-DS, two diagnostic suites: one for prediction-target ambiguity (Ambig-DS-Target, 51 tasks built on DSBench, a tabular modeling benchmark) and one for evaluation-objective ambiguity (Ambig-DS-Objective, 61 tasks built on MLE-bench, a Kaggle-style ML competition benchmark), constructed so that scoring uses each source benchmark's original evaluator. For every task we pair the original, fully specified version with an ambiguous variant produced by controlled edits; a human-and-LLM verification pipeline confirms each variant admits multiple plausible interpretations with decision-relevant consequences. The suites are analyzed independently and ambiguity lowers performance in both. Across five agents spanning efficient to frontier-class models, we find in our controlled diagnostic setting: (i) failures are silent commitments: wrong-target submissions on Target, wrong-metric or non-committal baseline submissions on Objective, rather than execution errors; (ii) allowing the agent to ask one clarifying question recovers much of the loss under idealized conditions, suggesting missing framing information drives a substantial part of the observed degradation; but (iii) agents cannot reliably tell when to use it: permissive prompts induce over-asking on clear tasks, while conservative prompts induce silent defaulting on ambiguous ones. Recognizing target and objective underspecification, not pipeline execution, is the bottleneck missing from standard DS-agent evaluations.
Chinese Translation
随着数据科学代理从副驾驶转变为自动驾驶,静默的错误框架成为一种关键的失败模式。代理悄然承诺于合理但未预期的任务框架,产生干净、可执行的成果,掩盖了它们对任务的错误评估。现有基准测试仅评估管道是否运行,而忽略了代理是否识别出任务的定义不明确。我们提出了Ambig-DS,包含两个诊断套件:一个用于预测目标模糊性(Ambig-DS-Target,基于DSBench构建的51个任务,DSBench是一个表格建模基准),另一个用于评估目标模糊性(Ambig-DS-Objective,基于MLE-bench构建的61个任务,MLE-bench是一个Kaggle风格的机器学习竞赛基准),构建时确保评分使用每个源基准的原始评估者。对于每个任务,我们将原始的、完全指定的版本与通过控制编辑生成的模糊变体配对;人类和大型语言模型(LLM)验证管道确认每个变体都承认多种合理的解释,并具有决策相关的后果。两个套件独立分析,模糊性降低了两者的性能。在涵盖从高效到前沿模型的五个代理中,我们在控制的诊断设置中发现:(i)失败是静默承诺:在目标上提交错误目标的结果,在目标上提交错误指标或不明确的基线,而不是执行错误;(ii)允许代理提出一个澄清问题在理想条件下恢复了大部分损失,这表明缺失的框架信息驱动了观察到的降级的很大一部分;但(iii)代理无法可靠地判断何时使用它:宽松的提示在明确任务上导致过度提问,而保守的提示在模糊任务上导致静默默认。识别目标和评估的定义不明确,而不是管道执行,是标准数据科学代理评估中缺失的瓶颈。
cs.AI / 130 / 2605.09716
Medical Model Synthesis Architectures: A Case Study
医学模型综合架构:案例研究
Abstract
Medicine is rife with high-stakes uncertainty. Doctors routinely make clinical judgments and decisions that juggle many fundamental unknowns, like predictions about what might be causing a patients' symptoms or decisions about what treatment to try next. Despite increasing interest in developing AI systems that aid or even replace doctors in clinical settings, current systems struggle with calibrated reasoning under uncertainty, and are often deeply opaque about their reasoning. We propose a framework for AI systems that can make practically useful but formally transparent clinical predictions under uncertainty. Given a clinical situation, our framework (MedMSA) uses language models to retrieve relevant prior knowledge, but constructs a formal probabilistic model to support calibrated and verifiable inferences under uncertainty. We show how an initial proof-of-concept of this framework can be used for differential diagnosis, producing an uncertainty-weighted list of potential diagnoses that could explain a patients' symptoms, and discuss future applications and directions for applying this framework more generally for safe clinical collaborations.
Chinese Translation
医学领域充满了高风险的不确定性。医生在临床判断和决策中常常需要处理许多基本未知因素,例如对患者症状可能原因的预测或关于下一步尝试何种治疗的决策。尽管对开发能够在临床环境中辅助甚至替代医生的人工智能系统的兴趣日益增加,但当前系统在不确定性下的校准推理方面仍然存在困难,并且它们的推理过程往往非常不透明。我们提出了一种人工智能系统框架,能够在不确定性下做出实用但形式上透明的临床预测。在给定的临床情境下,我们的框架(MedMSA)利用语言模型检索相关的先前知识,但构建一个正式的概率模型,以支持在不确定性下的校准和可验证推理。我们展示了该框架的初步概念验证如何用于鉴别诊断,生成一个加权不确定性的潜在诊断列表,以解释患者的症状,并讨论了未来应用和更广泛地应用该框架以实现安全临床合作的方向。
cs.AI / 131 / 2605.09749
Primal-Dual Guided Decoding for Constrained Discrete Diffusion
约束离散扩散的原始-对偶引导解码
Abstract
Discrete diffusion models generate structured sequences by progressively unmasking tokens, but enforcing global property constraints during generation remains an open challenge. We propose primal-dual guided decoding, an inference-time method that formulates constrained generation as a KL-regularised optimisation problem and solves it online via adaptive Lagrangian multipliers. At each denoising step, the method modifies token logits through an additive, constraint-dependent bias, with multipliers updated by mirror descent based on constraint violation. The bias arises as the optimal KL-regularised projection of the constraint, so the constrained distribution remains as close as possible to the model's unconstrained distribution while still satisfying the constraint. The method requires no retraining and no additional model evaluations beyond standard sampling, supports multiple simultaneous constraints, and provides formal bounds on constraint violation. We evaluate our approach on topical text generation, molecular design, and music playlist generation, showing that a single algorithm instantiated via domain-specific scoring functions improves constraint satisfaction while preserving relevant domain-specific quality metrics.
Chinese Translation
离散扩散模型通过逐步揭示标记生成结构化序列,但在生成过程中强制执行全局属性约束仍然是一个未解决的挑战。我们提出了原始-对偶引导解码,这是一种推理时的方法,将约束生成形式化为一个KL正则化优化问题,并通过自适应拉格朗日乘子在线求解。在每个去噪步骤中,该方法通过一个附加的、依赖于约束的偏差修改标记的logits,乘子通过基于约束违反的镜像下降进行更新。该偏差作为约束的最优KL正则化投影出现,因此约束分布在满足约束的同时尽可能接近模型的无约束分布。该方法不需要重新训练,也不需要超出标准采样的额外模型评估,支持多个同时约束,并提供约束违反的正式界限。我们在主题文本生成、分子设计和音乐播放列表生成上评估了我们的方法,结果表明,通过领域特定的评分函数实例化的单一算法在提高约束满足的同时保持相关领域特定的质量指标。
cs.AI / 132 / 2605.09769
UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification
UTS在PsyDefDetect中的表现:基于多智能体委员会和缺失推理的防御机制分类
Abstract
This paper describes our system for classifying psychological defense mechanisms in emotional support dialogues using the Defense Mechanism Rating Scales (DMRS), placing second (F1 0.406) among 64 teams.1 A central insight is that defense mechanisms are defined by what is absent: missing affect, blocked cognition, denied reality. We encode this as an affect-cognition integration spectrum in prompt-level clinical rules, which account for the largest single gain (+11.4pp F1). Our architecture is a multi-phase deliberative council of Gemini 2.5 agents where class-specific advocates rate evidence strength rather than voting, achieving F1 0.382 with no fine-tuning - a top-5 result on its own. We find, however, that the council is confidently wrong about minority classes: 59-80% of stable minority predictions are incorrect, driven by a systematic "L7 attractor" in which emotional content defaults to the majority class. A targeted override ensemble from three fine-tuned Qwen3.5 models applies 16 overrides (+2.4pp), selected by a structured multi-agent system (builder, critic, regression guard) that produced a larger F1 gain in one iteration than 8 prior attempts combined.
Chinese Translation
本文描述了我们使用防御机制评分量表(Defense Mechanism Rating Scales, DMRS)对情感支持对话中的心理防御机制进行分类的系统,在64个团队中获得第二名(F1 0.406)。一个核心见解是,防御机制是通过缺失来定义的:缺失的情感、阻塞的认知、否认的现实。我们将其编码为临床规则中的情感-认知整合谱,这在单一增益中占据了最大份额(+11.4pp F1)。我们的架构是一个多阶段的深思熟虑委员会,由Gemini 2.5智能体组成,其中特定类别的倡导者评估证据强度而不是投票,取得了F1 0.382的成绩,且无需微调——这本身就是前五名的结果。然而,我们发现委员会在少数类的判断上自信但错误:59-80%的稳定少数预测是错误的,原因是存在一个系统性的“L7吸引子”,使得情感内容默认归于多数类。来自三个微调后的Qwen3.5模型的针对性覆盖集应用了16个覆盖(+2.4pp),由一个结构化的多智能体系统(构建者、评论者、回归守卫)选择,该系统在一次迭代中产生的F1增益超过了之前8次尝试的总和。
cs.AI / 133 / 2605.09771
Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning
将医疗事件的生成模型与健康社会决定因素的数字双胞胎结合用于疾病推理
Abstract
Despite the central role of sensor-derived measurements such as imaging traits and plasma biomarkers in biomedical research and clinical practice, existing generative models for disease prediction largely depend on event-level representations from hospital and registry data. Given the multi-factorial nature of human disease, the absence of explicit modeling of social determinants of health (SDoH), even in the limited form of ICD-coded proxies (chapters Z and V--Y in ICD-10), limits the capacity for personalized disease modeling and clinical decision support. To address this limitation, we propose a generative model with ICD-coded proxies of SDoH for \textit{in silico} modeling of disease reasoning, a conditioned latent diffusion framework that establishes the connection between multi-organ sensor data with tokenized healthcare events. Specifically, we introduce a novel geometric diffusion model to characterize the temporal evolution of complex data representation such as brain networks (region-to-region connectivity encoded in a graph), in parallel with diffusion models for tabular data from other organ systems. Together, we integrate the generative model with digitalized SDoH proxies (coined \modelname{}) for simulated intervention and reasoning of future disease trajectories. We conduct extensive experiments on the UK Biobank (UKB) dataset, which contains organ-specific imaging traits, including brain (44,834), heart (23,987), liver (28,722), and kidney (32,155), along with nearly 500k medical history sequences (age range: 25$\sim$89 years). Our \modelname{} achieves significant improvements over state-of-the-art human disease autoregressive models and imaging trait generative baselines.
Chinese Translation
尽管传感器衍生的测量(如成像特征和血浆生物标志物)在生物医学研究和临床实践中发挥着核心作用,但现有的疾病预测生成模型在很大程度上依赖于来自医院和登记数据的事件级表示。考虑到人类疾病的多因素特性,缺乏对健康社会决定因素(SDoH)的明确建模,即使是以ICD编码代理(ICD-10中的Z和V-Y章节)的有限形式存在,也限制了个性化疾病建模和临床决策支持的能力。为了解决这一局限性,我们提出了一种生成模型,该模型利用SDoH的ICD编码代理进行疾病推理的 extit{in silico}建模,采用条件潜在扩散框架,建立多脏器传感器数据与标记化医疗事件之间的联系。具体而言,我们引入了一种新颖的几何扩散模型,以表征复杂数据表示(如大脑网络的时序演变,区域间连接以图形编码),并与其他脏器系统的表格数据扩散模型并行。我们将生成模型与数字化的SDoH代理(称为 extmodelname{})整合,用于模拟干预和未来疾病轨迹的推理。我们在包含特定脏器成像特征的UK Biobank(UKB)数据集上进行了广泛实验,该数据集包括大脑(44,834)、心脏(23,987)、肝脏(28,722)和肾脏(32,155)的成像特征,以及近50万条医疗历史序列(年龄范围:25至89岁)。我们的 extmodelname{}在先进的人类疾病自回归模型和成像特征生成基线模型上取得了显著的改进。
cs.AI / 134 / 2605.09780
Attribution-based Explanations for Markov Decision Processes
基于归因的马尔可夫决策过程解释
Abstract
Attribution techniques explain the outcome of an AI model by assigning a numerical score to its inputs. So far, these techniques have mainly focused on attributing importance to static input features at a single point in time, and thus fail to generalize to sequential decision-making settings. This paper fills this gap by introducing techniques to generate attribution-based explanations for Markov Decision Processes (MDPs). We give a formal characterization of what attributions should represent in MDPs, focusing on explanations that assign importance scores to both individual states and execution paths. We show how importance scores can be computed by leveraging techniques for strategy synthesis, enabling the efficient computation of these scores despite the non-determinism inherent in an MDP. We evaluate our approach on five case-studies, demonstrating its utility in providing interpretable insights into the logic of sequential decision-making agents.
Chinese Translation
归因技术通过为人工智能模型的输入分配数值评分来解释其结果。迄今为止,这些技术主要集中在为静态输入特征在单一时间点上分配重要性,因此无法推广到序列决策环境。本文通过引入生成马尔可夫决策过程(MDPs)基于归因的解释的技术来填补这一空白。我们对归因在MDPs中应表示的内容进行了正式表征,重点关注为个体状态和执行路径分配重要性评分的解释。我们展示了如何利用策略合成技术计算重要性评分,从而能够高效地计算这些评分,尽管MDP中固有的不确定性。我们在五个案例研究中评估了我们的方法,证明了其在提供对序列决策代理逻辑的可解释性见解方面的实用性。
cs.AI / 135 / 2605.09826
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents
EnactToM:一种用于具身智能体功能性心智理论的动态基准
Abstract
Theory of Mind (ToM), the ability to track others epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner constraints, and misallocated messages, providing a concrete target for future work.
Chinese Translation
心智理论(Theory of Mind, ToM)是指追踪他人认知状态的能力,使得人类成为高效的合作者。在多智能体环境中,人工智能(AI)代理也需要具备这种能力,但现有的基准主要通过直接的信念问题来测试字面心智理论。具身环境中对隐含信念进行最佳行动的能力,被称为功能性心智理论,尚未得到充分测试。我们提出了EnactToM,这是一个包含300个具身多智能体任务的动态基准,设置在一个具有部分可观测性、私密信息和受限通信的3D家庭环境中。每个任务都经过正式验证以确保可解性和所需的认知深度,并且随着模型的改进,新的任务会被生成以增加难度。在困难分割中,所有七个评估的前沿模型在功能性任务完成上得分为0.0% Pass^3,而在字面信念探测上平均得分为45.0%。手动分析显示,93%的抽样失败可追溯至认知协调的崩溃,例如信息隐瞒、忽视合作伙伴约束和错误分配消息,为未来的研究提供了明确的目标。
cs.AI / 136 / 2605.09842
Yield Curve Forecasting using Machine Learning and Econometrics: A Comparative Analysis
使用机器学习和计量经济学进行收益率曲线预测:比较分析
Abstract
While machine learning has revolutionized many fields such as natural language processing (NLP) and computer vision, its impact on time-series forecasting is still widely disputed, especially in the finance domain. This paper compares forecasting performance on U.S. Treasury yield curve data across econometrics/time-series analysis, classical machine learning, and deep learning methods, using daily data over 47 years. The Treasury yield curve is important because it is widely used by every participant in the bond markets, which are larger than equity markets. We examine a variety of methods that have not been tested on yield curve forecasting, especially deep learning algorithms. The algorithms include the Autoregressive Integrated Moving Average (ARIMA) model and its extensions, naive benchmarks, ensemble methods, Recurrent Neural Networks (RNNs), and multiple transformers built for forecasting. ARIMA and naive econometric models outperform other models overall, except in one time block. Of the machine learning methods, TimeGPT, LGBM and RNNs perform the best. Furthermore, the paper explores whether stationary or nonstationary data are more appropriate as input to deep learning models.
Chinese Translation
尽管机器学习在自然语言处理(NLP)和计算机视觉等多个领域引发了革命,但其在时间序列预测中的影响仍存在广泛争议,尤其是在金融领域。本文比较了美国国债收益率曲线数据在计量经济学/时间序列分析、经典机器学习和深度学习方法下的预测性能,使用了47年的每日数据。国债收益率曲线的重要性在于它被所有债券市场参与者广泛使用,而债券市场的规模大于股票市场。我们考察了一系列尚未在收益率曲线预测中进行测试的方法,特别是深度学习算法。这些算法包括自回归积分滑动平均(ARIMA)模型及其扩展、简单基准、集成方法、递归神经网络(RNN)以及为预测构建的多个变换器。总体而言,ARIMA和简单计量经济学模型的表现优于其他模型,除了在一个时间段内。机器学习方法中,TimeGPT、LGBM和RNN的表现最佳。此外,本文探讨了平稳数据或非平稳数据作为深度学习模型输入的适宜性。
cs.AI / 137 / 2605.09844
The Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLMs
元认知探测器:针对大型语言模型的五项行为校准诊断
Abstract
The Metacognitive Probe is an exploratory five-task, 15-slot diagnostic that decomposes an LLM's confidence behaviour into five behaviourally-distinct dimensions: confidence calibration (T1-CC), epistemic vigilance (T2-EV), knowledge boundary (T3-KB), calibration range (T4-CR), and reasoning-chain validation (T5-RCV). It is evaluated on N=8 frontier models and N=69 humans. The instrument is motivated by Flavell (1979) and Nelson and Narens (1990) but operates on observable confidence-correctness alignment; it is not a validated cross-species metacognition scale, and the pre-specified human developmental hypothesis was falsified. Composite benchmarks (MMLU, BIG-Bench, HELM, GPQA) ask whether a model produces a correct response. They are silent on whether the model knows when its response is wrong. A model can score 80 on a composite calibration benchmark and still be wildly overconfident in narrow pockets the aggregate cannot surface. The Metacognitive Probe surfaces those pockets. Our headline is a 47-point within-model dissociation in Gemini 2.5 Flash: panel-best within-task calibration (T1-CC = 88; Spearman rho = +0.551, 95% CI [+0.14, +0.80], p = 0.005) and panel-worst cross-task difficulty prediction (T4-CR = 41; sigma_conf = 1.4 across twelve factoids).
Chinese Translation
元认知探测器是一种探索性五任务、15插槽的诊断工具,它将大型语言模型(LLM)的信心行为分解为五个行为上独特的维度:信心校准(T1-CC)、认知警觉(T2-EV)、知识边界(T3-KB)、校准范围(T4-CR)和推理链验证(T5-RCV)。该工具在N=8个前沿模型和N=69个人类样本上进行了评估。该工具的设计受到Flavell(1979)和Nelson与Narens(1990)的启发,但其操作基于可观察的信心与正确性的一致性;它并不是一个经过验证的跨物种元认知量表,且预设的人类发展假设被证伪。综合基准(MMLU、BIG-Bench、HELM、GPQA)询问模型是否产生正确的响应,但并未说明模型是否知道其响应是错误的。一个模型可以在综合校准基准上得分80,但在某些狭窄领域中可能仍然过于自信,而这些领域的聚合结果无法显现。元认知探测器揭示了这些领域。我们的主要发现是在Gemini 2.5 Flash模型中存在47点的模型内解离:任务内校准最佳(T1-CC = 88;Spearman rho = +0.551,95% CI [+0.14, +0.80],p = 0.005)和任务间难度预测最差(T4-CR = 41;sigma_conf = 1.4,基于十二个事实)。
cs.AI / 138 / 2605.09852
Fairness of Explanations in Artificial Intelligence (AI): A Unifying Framework, Axioms, and Future Direction toward Responsible AI
人工智能(AI)中解释的公平性:统一框架、公理及负责任AI的未来方向
Abstract
Machine learning algorithms are being used in high-stakes decisions, including those in criminal justice, healthcare, credit, and employment. The research community has responded with two largely independent research fields: \emph{algorithmic fairness}, which targets equitable outcomes, and \emph{explainable AI} (XAI), which targets interpretable reasoning. This survey identifies and maps a novel blind spot at their intersection, which is a model that can satisfy every standard fairness criterion in its outputs while being profoundly unfair in its \emph{reasoning process}. We refer to this as the procedural bias, and mitigating it requires treating the fairness of explanations as a distinct object of scientific study. To our knowledge, we provide the first unified theoretical and literature review of this emerging field and elucidate the drawbacks of post-hoc explainers in certifying explanation fairness. Our central contribution is a \emph{conditional invariance framework} formalizing explanation fairness as the requirement that explanations should be indifferent regardless of the protected attributes $ P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = a) = P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = b)$ for all task-relevant $x$, a single principle from which all existing explanation fairness metrics emerge as partial operationalizations. We introduce a seven-dimensional taxonomy, identify three generative mechanisms of explanation inequity (representation-driven, explanation-model mismatch, actionability-driven), and propose a canonical six-step evaluation workflow for operationalizing explanation fairness audits in practice.
Chinese Translation
机器学习算法正在被用于高风险决策,包括刑事司法、医疗保健、信用和就业等领域。研究界对此作出了回应,形成了两个基本独立的研究领域: extit{算法公平性},旨在实现公平的结果,以及 extit{可解释的人工智能}(XAI),旨在实现可解释的推理。本文调查并映射了这两个领域交叉处的一个新盲点,即一种模型可以在其输出中满足每一个标准的公平性标准,而在其 extit{推理过程}中却极为不公平。我们将其称为程序性偏见,减轻这种偏见需要将解释的公平性视为一个独立的科学研究对象。据我们所知,我们提供了这一新兴领域的首个统一理论和文献综述,并阐明了后验解释器在证明解释公平性方面的缺陷。我们的核心贡献是一个 extit{条件不变性框架},将解释公平性形式化为解释应对受保护属性无差别的要求,即对于所有任务相关的$x$,$ P(E(X) ext{ in } ullet ext{ } | X_ ext{rel} = x_ ext{rel}, A = a) = P(E(X) ext{ in } ullet ext{ } | X_ ext{rel} = x_ ext{rel}, A = b)$。这个原则是所有现有解释公平性度量的部分操作化的基础。我们引入了一个七维分类法,识别了三种解释不平等的生成机制(基于表现、解释模型不匹配、可操作性驱动),并提出了一个规范的六步评估工作流程,以在实践中操作化解释公平性审计。
cs.AI / 139 / 2605.09860
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
何时重新承诺:长时间视语言推理的时间抽象发现
Abstract
Long-horizon reasoning requires deciding not only what actions to take, but how deeply to commit before the next observation. We formalize this as \emph{commitment depth}: the number of primitive actions executed open-loop between replans. Commitment depth induces a trade-off between replanning cost and compounding execution error, yet most existing long-horizon systems fix it as a hand-designed scalar. In this work, we instead treat commitment depth as a learnable, state-conditioned variable of the policy itself. We instantiate this within a model-native vision--language policy that jointly predicts both what to execute and for how long. Across Sliding Puzzle and Sokoban, the resulting adaptive policy Pareto-dominates every non-degenerate fixed-depth baseline, achieving up to 12.5 percentage points higher solve rate while using approximately 25\% fewer primitive actions per episode. Despite using a 7B backbone, our method outperforms GPT-5.5 and Claude Sonnet on both tasks, while every tested open-weight vision--language model achieves 0\% zero-shot success. We further present a theoretical analysis showing that, under the standard commitment-depth surrogate, state-conditioned commitment strictly dominates any fixed depth whenever the locally optimal depth varies across states.
Chinese Translation
长时间推理不仅需要决定采取什么行动,还需要在下一个观察之前决定承诺的深度。我们将其形式化为 extit{承诺深度}:在重新规划之间以开放循环执行的原始动作数量。承诺深度在重新规划成本和累积执行误差之间引入了权衡,但大多数现有的长时间系统将其固定为手工设计的标量。在本研究中,我们将承诺深度视为策略本身的可学习、状态条件变量。我们在一个模型原生的视语言策略中实现了这一点,该策略共同预测要执行的内容及其持续时间。在滑动拼图和仓库番游戏中,所得到的自适应策略在每个非退化固定深度基线中都表现出帕累托优势,解决率提高了多达12.5个百分点,同时每个回合使用的原始动作减少了约25%。尽管使用了一个7B的主干网络,我们的方法在这两个任务上都优于GPT-5.5和Claude Sonnet,而每个测试的开放权重视语言模型的零样本成功率均为0%。我们进一步提出了理论分析,表明在标准的承诺深度替代下,当局部最优深度在状态之间变化时,状态条件承诺严格优于任何固定深度。
cs.AI / 140 / 2605.09875
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
通过锚点投影表示实现行为轴的跨家族普适性
Abstract
Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.
Chinese Translation
来自不同家族的大型语言模型使用不同的隐藏维度、分词器和训练程序,使得行为方向在模型之间的比较或转移变得困难。我们提出了一种锚点投影框架,将每个模型的隐藏表示映射到一个共享的锚点坐标空间(Anchor Coordinate Space, ACS)。从源模型提取的行为方向被投影到ACS中,并平均为一个典范方向。对于新模型,典范方向仅通过锚点激活重构到其本地隐藏空间,而无需微调或特定目标方向的提取。我们评估了五个指令调优模型家族和十个行为轴。我们发现,相同轴的方向在ACS中紧密对齐,尤其是在Llama-Qwen-Mistral-Phi(LQMP)集群中。这种共享结构能够转移到下游任务。在对齐的LQMP集群中,保留目标的十分类别检测准确率达到(0.83),平均二元AUROC为(0.95),而典范引导在分布变化下引发的拒绝率变化高达+0.46%。敏感性分析表明,两个源模型和小型锚点池已经足以近似可转移方向。总体而言,ACS为跨家族可解释性提供了一种新视角,揭示了表示层级的转移在模型家族之间依然保持稳健。
cs.AI / 141 / 2605.09879
M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models
M2A:在大型语言模型中协同数学推理与自主推理
Abstract
While reasoning has become a central capability of large language models (LLMs), the reasoning patterns required for different scenarios are often misaligned. Mathematical reasoning typically relies on intrinsic logic to solve closed-world problems in a single response, whereas agentic reasoning requires not only internal reasoning but also multi-turn interaction with external environments, interleaving thought and action. This misalignment prevents mathematical and agentic reasoning from effectively benefiting from each other, often yielding unstable reasoning behavior and only limited performance gains under multi-task learning. In this paper, we propose M2A, a novel paradigm that synergizes mathematical and agentic reasoning via model merging. To avoid overfitting to superficial reasoning patterns under joint training, M2A operates directly in parameter space: it identifies the feature subspace critical for agent behavior, and merges the mathematical reasoning task vector only along its null space, thereby injecting reasoning capability along directions that do not perturb agent behavior. Unlike SFT or RL, M2A requires no additional gradient-update and exposes the merging coefficient as a simple knob for controlling reasoning length. Experiments in a challenging real-world coding agent setting show that our method effectively extends agentic reasoning depth and delivers substantial performance improvements. Applied to a fine-tuned Qwen3-8B, M2A improves its SWE-Bench Verified resolved rate from 44.0% to 51.2% without retraining the model. Code is available at https://github.com/laplucky/M2A.git.
Chinese Translation
尽管推理已成为大型语言模型(LLMs)的核心能力,但不同场景所需的推理模式往往不一致。数学推理通常依赖于内在逻辑来解决封闭世界问题,并在单次响应中给出答案,而自主推理不仅需要内部推理,还需要与外部环境进行多轮交互,交织思考与行动。这种不一致性阻碍了数学推理与自主推理之间的有效互补,常常导致推理行为不稳定,并且在多任务学习中仅能获得有限的性能提升。本文提出了一种新颖的范式M2A,通过模型合并协同数学推理与自主推理。为了避免在联合训练中对表面推理模式的过拟合,M2A直接在参数空间中操作:它识别对自主行为至关重要的特征子空间,并仅沿其零空间合并数学推理任务向量,从而在不干扰自主行为的方向上注入推理能力。与SFT或RL不同,M2A不需要额外的梯度更新,并将合并系数作为控制推理长度的简单调节器。在一个具有挑战性的真实世界编码代理设置中,实验表明我们的方法有效地扩展了自主推理的深度,并带来了显著的性能提升。应用于微调后的Qwen3-8B,M2A将其SWE-Bench验证的解决率从44.0%提高到51.2%,而无需重新训练模型。代码可在https://github.com/laplucky/M2A.git获取。
cs.AI / 142 / 2605.09900
The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark
VLMs的戈尔迪之结:图示结推理作为一个困难基准
Abstract
A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina's canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal grounding; an image-versus-symbol split locates failures along the perception-operation gap. We score Claude Opus 4.7 and GPT-5, each with and without thinking, under a 64K output-token budget matched on both vendors. Across 56 (task, model) cases, 15 sit at or below a random baseline and 8 of 14 tasks have a best score under 1.5x random. On diagram-to-symbol transcription, no model produces a strictly correct string, and permissive Regina decoding recovers the knot in 0 to 4 of 100 items. Thinking-mode reasoning lifts overall accuracy by 1.65 points for Claude and 9.25 points for GPT-5, narrowing the gap only modestly. Read together, the four families suggest current vision-language models hold features of a diagram but lack apparatus to simulate moves on those features.
Chinese Translation
视觉语言模型可以查看一个结图并报告其所见,但却无法对该结构进行操作。KnotBench将来自1,951个主要结原型(交叉数3到19)的858,318幅图像语料库与一个协议配对,其答案与Regina的标准结签名进行核对。其14个任务跨越四个类别:等价判断、移动预测、识别和跨模态基础;图像与符号的分割定位了感知与操作之间的失败。在64K输出令牌预算下,我们对Claude Opus 4.7和GPT-5进行了评分,分别在有思考和无思考的情况下进行评估。在56个(任务,模型)案例中,有15个处于随机基线或以下,14个任务中有8个的最佳得分低于1.5倍随机。在图示到符号的转录中,没有模型生成严格正确的字符串,而宽松的Regina解码在100个项目中恢复结的数量为0到4。思考模式推理使Claude的整体准确率提高了1.65分,GPT-5提高了9.25分,但仅略微缩小了差距。综合来看,这四个类别表明当前的视觉语言模型具备图示的特征,但缺乏在这些特征上模拟移动的工具。
cs.AI / 143 / 2605.09906
Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought
先分离,再融合:通过模态特定的思维链减轻音视频大语言模型推理中的跨模态干扰
Abstract
Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality cues when answering. We further introduce a modality-specific reasoning mechanism that preserves modality isolation during the separated reasoning stage while enabling full access to cross-modal information at the evidence fusion stage. Experiments demonstrate consistent improvements in both accuracy and robustness, yielding an average relative gain of 5.16\% on general AVQA benchmarks and 11.17\% on a cross-modal hallucination benchmark.
Chinese Translation
音频和视觉为音视频问答提供了互补的证据,但当前的音视频大语言模型可能会受到跨模态干扰的影响:一种模态的信息误导了对另一种模态的解释,从而引发幻觉。我们将这一问题归因于中间推理过程中无法控制的跨模态交互。为此,我们提出了“先分离,再融合”(Separate First, Fuse Later, SFFL)这一音视频推理框架,旨在减少跨模态干扰。SFFL 强制执行模态特定的思维链推理,生成独立的音频和视觉推理轨迹,并整合证据以进行回答。我们通过在不同模态输入设置下的数据管道构建模态偏好标签,并将这些标签作为强化学习中的辅助奖励,以鼓励在回答时对模态线索的实例依赖偏好。我们进一步引入了一种模态特定的推理机制,在分离推理阶段保持模态隔离,同时在证据融合阶段允许完全访问跨模态信息。实验表明,在准确性和鲁棒性方面均有持续改善,在一般音视频问答基准上平均相对提升 5.16\%,在跨模态幻觉基准上提升 11.17\%。
cs.AI / 144 / 2605.09907
RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation
RADAR:基于冗余感知的多智能体通信结构生成
Abstract
Compared with individual agents, large language model based multi-agent systems have shown great capabilities consistently across diverse tasks, including code generation, mathematical reasoning, and planning, etc. Despite their impressive performance, the effectiveness and robustness of these systems heavily rely on their communication topology, which is often fixed or generated in a single step. This restricts fine-grained structural exploration and flexible composition, resulting in excessive token utilization on simple tasks while limiting capability on complicated tasks. To mitigate this challenge, we introduce RADAR, a redundancy-aware and query-adaptive generative framework that actively reduce communication overhead. Motivated by recent progress in conditional discrete graph diffusion models, we formulate communication topology design as a step-by-step generation process, guided by the effective size of the graph. Comprehensive experiments on six benchmarks demonstrate that RADAR consistently outperforms recent baselines, achieving higher accuracy, lower token consumption, and greater robustness across diverse scenarios. Our code and data are available at https://github.com/cszhangzhen/RADAR.
Chinese Translation
与单个智能体相比,基于大型语言模型的多智能体系统在代码生成、数学推理和规划等多种任务中展现出了卓越的能力。尽管它们的表现令人印象深刻,但这些系统的有效性和鲁棒性在很大程度上依赖于其通信拓扑,而通信拓扑通常是固定的或在单一步骤中生成的。这限制了细粒度的结构探索和灵活的组合,导致在简单任务上过度使用令牌,而在复杂任务上能力受限。为了解决这一挑战,我们提出了RADAR,一种冗余感知和查询自适应的生成框架,能够主动减少通信开销。受到条件离散图扩散模型最新进展的启发,我们将通信拓扑设计表述为一个逐步生成的过程,以图的有效大小为指导。在六个基准测试上的综合实验表明,RADAR在不同场景中始终优于最近的基线,达到了更高的准确性、更低的令牌消耗和更强的鲁棒性。我们的代码和数据可在 https://github.com/cszhangzhen/RADAR 获取。
cs.AI / 145 / 2605.09923
expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling
expo:通过自适应 KL 调节和高斯课程采样的探索优先策略优化
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, where Group Relative Policy Optimization (GRPO) serves as the mainstream algorithm. We point out two understudied inefficiencies existing in GRPO. First, the fixed KL penalty coefficient overly restricts policy exploration at stages where the model requires significant deviation from the reference policy. Second, uniform sampling of training questions ignores that moderately difficult problems provide the most informative gradient signals for optimization. We propose Exploration-Prioritized Policy Optimization (EXPO) with two lightweight plug-in modules. The Accuracy-Conditioned KL Scaling (AKL) dynamically adjusts KL regularization strength through a smooth nonlinear function of batch average accuracy, relaxing the penalty when the model underperforms and strengthening it when the model achieves good results. The Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at moderate accuracy around 0.5, focusing training on the model's learning frontier. We conduct extensive experiments on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base over six mathematical reasoning benchmarks. The results show EXPO steadily surpasses vanilla GRPO. It obtains an absolute gain of 13.34 on AIME 2025 pass@32, rising from 63.33 percent to 76.67 percent, and achieves an average pass@32 improvement of 2.66 on the 8B model. The much larger performance gains on pass@32 compared with pass@1 demonstrate that EXPO effectively enlarges the model's exploration boundary under a fixed inference cost budget.
Chinese Translation
可验证奖励的强化学习(RLVR)已成为大规模语言模型(LLM)数学推理的标准范式,其中群体相对策略优化(GRPO)作为主流算法。我们指出 GRPO 中存在的两个未被充分研究的低效之处。首先,固定的 KL 惩罚系数在模型需要显著偏离参考策略的阶段过于限制了策略探索。其次,训练问题的均匀采样忽视了适度难度的问题为优化提供了最具信息量的梯度信号。我们提出了探索优先策略优化(EXPO),配备两个轻量级插件模块。准确性条件的 KL 缩放(AKL)通过批次平均准确性的平滑非线性函数动态调整 KL 正则化强度,在模型表现不佳时放宽惩罚,而在模型取得良好结果时加强惩罚。高斯课程采样(GCS)根据围绕 0.5 的适度准确性中心的高斯分布为问题分配采样权重,专注于模型的学习前沿。我们在 DeepSeek-R1-Distill-Qwen-1.5B 和 Qwen3-8B-Base 上进行了广泛的实验,涵盖六个数学推理基准。结果表明,EXPO 稳定地超越了普通的 GRPO。在 AIME 2025 pass@32 上获得了绝对增益 13.34,从 63.33% 上升到 76.67%,并在 8B 模型上实现了平均 pass@32 改进 2.66。与 pass@1 相比,pass@32 的性能提升更大,证明了 EXPO 在固定推理成本预算下有效扩大了模型的探索边界。
cs.AI / 146 / 2605.09942
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
HAGE:通过强化学习驱动的加权图演化来利用代理记忆
Abstract
Memory retrieval in agentic large language model (LLM) systems is often treated as a static lookup problem, relying on flat vector search or fixed binary relational graphs. However, fixed graph structures cannot capture the varying strength, confidence, and query-dependent relevance of relationships between events. In this paper, we propose HAGE, a weighted multi-relational memory framework that reconceptualizes retrieval as sequential, query-conditioned traversal over a unified relational memory graph. Memory is organized as relation-specific graph views over shared memory nodes, where each edge is associated with a trainable relation feature vector encoding multiple relational signals. Given a query, an LLM-based classifier identifies the relational intent, and a routing network dynamically modulates the corresponding dimensions of the edge embedding. Traversal scores are computed via a learned combination of semantic similarity and these query-conditioned edge representations. This allows memory traversal to prioritize high-utility relational paths while softly suppressing noisy or weakly relevant connections. Beyond adaptive traversal, HAGE further introduces a reinforcement learning-based training framework that jointly optimizes routing behavior and edge representations using downstream tasks. Finally, empirical results demonstrate improved long-horizon reasoning accuracy and a favorable accuracy-efficiency trade-off compared to state-of-the-art agentic memory systems. Our code is available at https://github.com/FredJiang0324/HAGE_MVPReview.
Chinese Translation
在代理大型语言模型(LLM)系统中,记忆检索通常被视为一个静态查找问题,依赖于平面向量搜索或固定的二元关系图。然而,固定的图结构无法捕捉事件之间关系的变化强度、置信度和查询相关性。在本文中,我们提出了HAGE,一个加权多关系记忆框架,将检索重新概念化为在统一关系记忆图上的顺序、查询条件遍历。记忆被组织为在共享记忆节点上的关系特定图视图,其中每条边与一个可训练的关系特征向量相关联,该向量编码多个关系信号。给定一个查询,基于LLM的分类器识别关系意图,而路由网络动态调节边嵌入的相应维度。遍历分数通过语义相似性和这些查询条件边表示的学习组合来计算。这使得记忆遍历能够优先考虑高效用的关系路径,同时柔和地抑制噪声或弱相关的连接。除了自适应遍历,HAGE还引入了一种基于强化学习的训练框架,利用下游任务共同优化路由行为和边表示。最后,实证结果表明,与最先进的代理记忆系统相比,HAGE在长时间推理准确性上有所提升,并且在准确性与效率之间达成了良好的权衡。我们的代码可在 https://github.com/FredJiang0324/HAGE_MVPReview 获取。
cs.AI / 147 / 2605.09948
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
LoopVLA:在视觉-语言-动作模型中学习循环精炼的充分性
Abstract
Current Vision-Language-Action (VLA) models typically treat the deepest representation of a vision-language backbone as universally optimal for action prediction. However, robotic manipulation is composed of many frequent closed-loop spatial adjustments, for which excessive abstraction may waste computation and weaken low-level geometric cues essential for precise control. Existing early-exit strategies attempt to reduce computation by stopping at predefined layers or applying heuristic rules such as action consistency, but they do not directly answer when a representation is actually sufficient for action. In this paper, we present LoopVLA, a recurrent VLA architecture that jointly learns representation refinement, action prediction, and sufficiency estimation. LoopVLA iteratively applies a shared Transformer block to refine multimodal tokens, and at each iteration produces both a candidate action and a sufficiency score that estimates whether further refinement is necessary. By sharing parameters across iterations, LoopVLA decouples refinement from absolute layer indices and grounds sufficiency estimation in the evolving representation itself. Since sufficiency has no direct supervision, we introduce a self-supervised distribution alignment objective, where intermediate confidence scores are trained to match the relative action quality across refinement steps, thereby linking sufficiency learning to policy optimization signals. Experiments on LIBERO, LIBERO-Plus, and VLA-Arena show that LoopVLA pushes the efficiency-performance frontier of VLA policies, reducing parameters by 45% and improving inference throughput by up to 1.7 times while matching or outperforming strong baselines in task success.
Chinese Translation
当前的视觉-语言-动作(VLA)模型通常将视觉-语言骨干的最深层表示视为对动作预测的普遍最优。然而,机器人操控由许多频繁的闭环空间调整组成,过度抽象可能会浪费计算资源,并削弱对于精确控制至关重要的低级几何线索。现有的早期退出策略试图通过在预定义层停止或应用诸如动作一致性等启发式规则来减少计算,但它们并没有直接回答何时表示实际上足够用于动作。在本文中,我们提出了LoopVLA,一种循环VLA架构,它共同学习表示精炼、动作预测和充分性估计。LoopVLA迭代地应用共享的Transformer模块来精炼多模态标记,并在每次迭代中产生候选动作和估计进一步精炼是否必要的充分性评分。通过在迭代中共享参数,LoopVLA将精炼与绝对层索引解耦,并将充分性估计基于不断演变的表示。由于充分性没有直接监督,我们引入了一种自监督的分布对齐目标,其中中间置信评分被训练以匹配各个精炼步骤之间的相对动作质量,从而将充分性学习与策略优化信号联系起来。在LIBERO、LIBERO-Plus和VLA-Arena上的实验表明,LoopVLA推动了VLA策略的效率-性能边界,减少了45%的参数,并提高了推理吞吐量,最高可达1.7倍,同时在任务成功率上与强基线相匹配或超越。
cs.AI / 148 / 2605.09964
Learning the Interaction Prior for Protein-Protein Interaction Prediction: A Model-Agnostic Approach
蛋白质-蛋白质相互作用预测的交互先验学习:一种模型无关的方法
Abstract
Protein-protein interactions (PPIs) are fundamental to cellular function and disease mechanisms. Current learning-based PPI predictors focus on learning powerful protein representations but neglect designing specialized classification heads. They mainly rely on generic aggregating methods like concatenation or dot products, which lack biological insight. Motivated by the biological "L3 rule", where multiple length-3 paths between a pair of proteins indicate their interaction likelihood, our study addresses this gap by designing a biologically informed PPI classifier. In this paper, we provide empirical evidence that popular PPI datasets strongly support the L3 rule. We propose an L3-path-regularized graph prompt learning method called L3-PPI, which can generate a prompt graph with virtual L3 paths based on protein representations and controls the number of paths. L3-PPI reformulates the classification of protein embedding pairs into a graph-level classification task over the generated prompt graph. This lightweight module seamlessly integrates with PPI predictors as a plug-and-play component, injecting the interaction prior of complementarity to enhance performance. Extensive experiments show that L3-PPI achieves superior performance enhancements over advanced competitors.
Chinese Translation
蛋白质-蛋白质相互作用(PPIs)对细胞功能和疾病机制至关重要。目前基于学习的PPI预测器主要集中在学习强大的蛋白质表示,但忽视了设计专门的分类头。它们主要依赖于诸如连接或点积等通用聚合方法,这些方法缺乏生物学洞察。受到生物学“L3规则”的启发,即一对蛋白质之间的多条长度为3的路径指示它们的相互作用可能性,我们的研究通过设计一个生物学信息驱动的PPI分类器来填补这一空白。本文提供了实证证据,表明流行的PPI数据集强烈支持L3规则。我们提出了一种L3路径正则化的图提示学习方法,称为L3-PPI,该方法可以基于蛋白质表示生成具有虚拟L3路径的提示图,并控制路径的数量。L3-PPI将蛋白质嵌入对的分类重新表述为在生成的提示图上的图级分类任务。该轻量级模块作为即插即用组件与PPI预测器无缝集成,注入互补的交互先验以增强性能。大量实验表明,L3-PPI在先进竞争者中实现了卓越的性能提升。
cs.AI / 149 / 2605.09985
Prospective Compression in Human Abstraction Learning
人类抽象学习中的前瞻性压缩
Abstract
A core challenge in program synthesis is online library learning: the incremental acquisition of reusable abstractions under uncertainty about future task demands. Existing algorithms treat library learning as retrospective compression over a static task distribution, where the learned library is determined by the corpus of past tasks. However, real-world learning domains are often non-stationary, with tasks arising from a generative process that evolves over time. We propose and test the hypothesis that in non-stationary domains human library learning selects abstractions prospectively: targeting compression of future tasks. We study this question using the Pattern Builder Task, a visual program synthesis paradigm in which participants construct increasingly complex geometric patterns from a small set of primitives, transformations, and custom helpers that carry forward across trials. Using this task, we conduct two experiments with complementary latent curricula, designed to dissociate between behaviors consistent with prospective compression, and alternative library learning accounts. Using six computational models spanning online library learning strategies, we show that human abstraction behavior reflects sensitivity to latent, non-stationary structure in the task-generating process. This behavior is consistent with prospective compression, and cannot be captured by existing retrospective compression-based algorithms, or inductive biases modeled by LLM-based program synthesis.
Chinese Translation
程序合成中的一个核心挑战是在线库学习:在对未来任务需求的不确定性下,逐步获取可重用的抽象。现有算法将库学习视为对静态任务分布的回顾性压缩,其中学习到的库由过去任务的语料库决定。然而,现实世界的学习领域往往是非平稳的,任务源自一个随时间演变的生成过程。我们提出并测试一个假设,即在人类库学习中,在非平稳领域中,抽象的选择是前瞻性的:目标是压缩未来任务。我们使用模式构建任务(Pattern Builder Task)来研究这个问题,这是一种视觉程序合成范式,参与者从一小组原始元素、变换和可在试验中延续的自定义助手中构建越来越复杂的几何图案。利用这个任务,我们进行了两项实验,采用互补的潜在课程,旨在区分与前瞻性压缩一致的行为和其他库学习解释。通过六种涵盖在线库学习策略的计算模型,我们展示了人类抽象行为反映了对任务生成过程中的潜在非平稳结构的敏感性。这种行为与前瞻性压缩一致,无法被现有的基于回顾性压缩的算法或由基于LLM的程序合成建模的归纳偏差所捕捉。
cs.AI / 150 / 2605.09991
Optimizer-Induced Mode Connectivity: From AdamW to Muon
优化器引发的模式连通性:从 AdamW 到 Muon
Abstract
Mode connectivity has been widely studied, yet the role of the optimizer remains underexplored. We revisit it through optimizer-induced implicit regularization, asking how connectivity behaves when restricted to solutions constrained by a given optimizer. For two-layer ReLU networks, we show that solutions from a single optimizer -- AdamW, Muon, or others in the Lion-$\mathcal{K}$ family -- form a connected set at sufficiently large width, a result not implied by prior work. We then characterize how optimizer-induced regions interact: at large width two different regions can be disjoint or overlap depending on regularization, while in our small-width example AdamW and Muon converge to disconnected zero-loss components separated by a provable loss barrier. Empirically, in GPT-2 pretraining, we observe same-optimizer paths preserve each model's spectrum while cross-optimizer paths traverse a smooth transition. Our results reveal optimizer-dependent structure beyond classical mode connectivity literature.
Chinese Translation
模式连通性已被广泛研究,但优化器的作用仍未得到充分探索。我们通过优化器引发的隐式正则化重新审视这一问题,探讨在特定优化器约束下,连通性如何表现。对于两层 ReLU 网络,我们证明来自单一优化器(如 AdamW、Muon 或 Lion-$ ext{K}$ 家族中的其他优化器)的解在足够大的宽度下形成一个连通集,这一结果并未在之前的研究中得到暗示。接着,我们描述优化器引发的区域如何相互作用:在大宽度下,两种不同的区域可能是离散的或重叠的,具体取决于正则化,而在我们的小宽度示例中,AdamW 和 Muon 收敛到被可证明的损失障碍分隔的非连通零损失组件。在 GPT-2 预训练的实证研究中,我们观察到同一优化器路径保持每个模型的谱,而跨优化器路径则经历平滑的过渡。我们的结果揭示了超越经典模式连通性文献的依赖于优化器的结构。
cs.AI / 151 / 2605.10035
From Single-Step Edit Response to Multi-Step Molecular Optimization
从单步编辑响应到多步分子优化
Abstract
Conditional molecular optimization aims to edit a molecule to realize a specified property shift. In practice, structurally similar molecule data is scarce, while decisions are inherently action-level: at each step, the system must select one local structural edit from a candidate set that is strictly filtered by chemical feasibility rules. This level mismatch between supervision and decision makes oracle-in-the-loop search unstable in molecular optimization. Regressing on property differences between molecule pairs improves data efficiency but relies on oracle-in-the-loop search, entangling transformation effects with global context and providing limited guidance for selecting the next feasible edit, often resorting to oracle-in-the-loop search. For this reason, we propose a response-oriented discrete edit optimization approach comprising two tightly coupled components: a single-step molecular edit response predictor (SMER) and a multi-step planner that composes local predictions into optimization trajectories via guided tree search (SMER-Opt). The approach learns a directional evaluation model over edit actions to support constraint-aware planning. It mines weakly related molecule pairs and decomposes their structural differences into minimal edit units, turning endpoint property annotations into process-level supervision and yielding reusable, transferable action primitives. A directional edit evaluator then scores feasible candidate edits by their likelihood of moving the molecule toward the desired property change, substantially reducing dependence on external evaluator queries at decision time. Code is available at https://anonymous.4open.science/r/SMER.
Chinese Translation
条件分子优化旨在编辑分子以实现特定的属性变化。在实际操作中,结构相似的分子数据稀缺,而决策本质上是行动级别的:在每一步,系统必须从严格按照化学可行性规则筛选的候选集中选择一个局部结构编辑。这种监督与决策之间的层次不匹配使得在分子优化中采用“oracle-in-the-loop”搜索变得不稳定。对分子对之间属性差异的回归提高了数据效率,但依赖于“oracle-in-the-loop”搜索,将转化效果与全局上下文纠缠在一起,并为选择下一个可行的编辑提供有限的指导,常常不得不依赖“oracle-in-the-loop”搜索。因此,我们提出了一种面向响应的离散编辑优化方法,包括两个紧密耦合的组件:单步分子编辑响应预测器(SMER)和多步规划器,通过引导树搜索将局部预测组合成优化轨迹(SMER-Opt)。该方法学习了一个针对编辑动作的方向性评估模型,以支持约束感知规划。它挖掘弱相关的分子对,并将它们的结构差异分解为最小编辑单元,将终点属性标注转化为过程级监督,并产生可重用、可转移的行动原语。然后,方向性编辑评估器根据候选编辑将分子朝向期望属性变化的可能性对其进行评分,从而在决策时显著减少对外部评估者查询的依赖。代码可在 https://anonymous.4open.science/r/SMER 获取。
cs.AI / 152 / 2605.10038
TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning
TimeClaw:一个具有探索性执行学习的时间序列人工智能代理
Abstract
Time series analysis underpins forecasting, monitoring, and decision making in domains such as finance and weather, where solving a task often requires both numerical accuracy and contextual reasoning. Recent progress has moved from specialized neural predictors to approaches built on LLMs and foundation models that can reason over time series inputs and use external tools. However, most such systems remain execution-centric: they focus on solving the current instance but learn little from exploratory execution. This is especially limiting in verifiable numeric settings, where multiple candidate executions and tool-use procedures may all be task-valid yet differ sharply in quantitative quality, and where early success can trigger tool-prior collapse that suppresses further exploration. To address this limitation, we present TimeClaw, an exploratory execution learning framework that turns exploratory execution into reusable hierarchical distilled experience through a four-stage loop: Explore, Compare, Distill, and Reinject. TimeClaw combines metric-supervised exploratory execution learning, task-aware tool dropout, and hierarchical distilled experience for inference-time reinjection, while keeping the base model frozen and avoiding online test-time adaptation. In an MTBench-aligned evaluation with 17 tasks that span finance and weather prediction and reasoning tasks, TimeClaw delivers consistent gains over the baselines. These results suggest that, for scientific systems, the bottleneck is not only execution-time capability, but how exploratory experience is compared, distilled, and reused.
Chinese Translation
时间序列分析是金融和天气等领域中预测、监测和决策的基础,其中解决任务通常需要数值准确性和上下文推理的结合。近期的进展已经从专门的神经预测器转向基于大型语言模型(LLMs)和基础模型的方法,这些方法能够对时间序列输入进行推理并使用外部工具。然而,大多数此类系统仍然以执行为中心:它们专注于解决当前实例,但从探索性执行中学习甚少。这在可验证的数值环境中尤其有限,在这些环境中,多个候选执行和工具使用过程可能都是任务有效的,但在定量质量上却存在显著差异,而早期的成功可能会触发工具优先崩溃,从而抑制进一步的探索。为了解决这一限制,我们提出了TimeClaw,一个探索性执行学习框架,通过四个阶段的循环:探索、比较、提炼和再注入,将探索性执行转化为可重用的层次化提炼经验。TimeClaw结合了度量监督的探索性执行学习、任务感知的工具丢弃,以及用于推理时再注入的层次化提炼经验,同时保持基础模型不变,避免在线测试时的适应。在与17个涵盖金融和天气预测及推理任务的MTBench对齐的评估中,TimeClaw在基线之上提供了一致的增益。这些结果表明,对于科学系统,瓶颈不仅在于执行时间能力,还在于如何比较、提炼和重用探索性经验。
cs.AI / 153 / 2605.10057
Route by State, Recover from Trace: STAR with Failure-Aware Markov Routing for Multi-Agent Spatiotemporal Reasoning
按状态路由,从轨迹恢复:具有故障感知的马尔可夫路由的多智能体时空推理
Abstract
Compositional spatiotemporal reasoning often requires a system to invoke multiple heterogeneous specialists, such as geometric, temporal, topological, and trajectory agents. A central question is how such a system should route among specialists when execution does not simply succeed or fail, but fails in qualitatively different ways. Existing tool-augmented and multi-agent LLM systems typically leave this routing decision implicit in language generation, making recovery ad hoc, difficult to interpret, and hard to optimize. This paper presents STAR (Spatio-Temporal Agent Router), a failure-aware routing framework that externalizes inter-agent control as a state-conditioned transition policy over the current agent, task type, and typed execution status. At the center of STARis an agent routing matrix that combines expert-specified nominal routes with recovery transitions learned from execution traces. Because the matrix conditions on distinct failure states, the router can respond differently to malformed outputs, missing dependencies, and tool--query mismatches, rather than collapsing them into a generic retry signal. Specialists execute through a tool-grounded extract--compute--deposit protocol and write intermediate results to a shared blackboard for downstream fusion. Results prove that retaining unsuccessful traces during training enlarges the support of the routing policy on error states, enabling recovery transitions that success-only training cannot represent. Across three spatiotemporal benchmarks and eight backbone LLMs, STAR improves over multiple baselines with the clearest gains on queries whose execution deviates from the nominal routing path. Router-specific ablations and recovery analyses further show that typed failure-aware routing, rather than specialist composition alone, is a key factor for these improvements.
Chinese Translation
组合时空推理通常需要系统调用多个异构专家,例如几何、时间、拓扑和轨迹智能体。一个核心问题是,当执行并非简单成功或失败,而是以不同的方式失败时,系统应如何在专家之间进行路由。现有的工具增强和多智能体大型语言模型(LLM)系统通常将这一路由决策隐含于语言生成中,使得恢复过程临时、不易解释且难以优化。本文提出了STAR(时空智能体路由器),一种故障感知的路由框架,将智能体间控制外部化为基于当前智能体、任务类型和类型化执行状态的状态条件转移策略。STAR的核心是一个智能体路由矩阵,该矩阵结合了专家指定的名义路由和从执行轨迹中学习的恢复转移。由于该矩阵基于不同的故障状态进行条件化,路由器能够对格式错误的输出、缺失的依赖关系和工具-查询不匹配做出不同的响应,而不是将它们归结为通用的重试信号。专家通过一个基于工具的提取-计算-存储协议执行,并将中间结果写入共享黑板以供下游融合。结果证明,在训练过程中保留不成功的轨迹扩大了路由策略在错误状态上的支持,使得成功训练无法表示的恢复转移成为可能。在三个时空基准和八个基础大型语言模型上,STAR在多个基线之上有所提升,尤其在执行偏离名义路由路径的查询上获得了明显的增益。路由器特定的消融实验和恢复分析进一步表明,类型化的故障感知路由,而非仅仅是专家组合,是这些改进的关键因素。
cs.AI / 154 / 2605.10059
Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust
大型语言模型代理市场中的战略利用:电子商务信任的模拟框架
Abstract
Agent-based modeling (ABM) has long been used in economics to study human behavior, and large language model (LLM) agents now enable new forms of social and economic simulation. While prior work has discovered strategic deception by LLM agents in financial trading and auction markets, e-commerce remains underexplored despite its distinctive information asymmetry: sellers privately observe product quality, whereas buyers rely on advertised claims and reputation signals. We introduce TruthMarketTwin, a controlled simulation framework for studying LLM-agent behavior in e-commerce markets. The framework is one of the first to model bilateral trade under asymmetric information sharing, where agents make strategic listing, purchasing, rating, and recourse-related decisions to optimize seller profit and buyer utility. We find that LLM agents released into traditional markets autonomously exploit weaknesses in reputation-based governance, while warrant enforcement reduces deception and reshapes strategic reasoning. Our results position LLM-agent simulation as a tool for studying institution-governed autonomous markets.
Chinese Translation
基于代理的建模(ABM)长期以来被用于经济学研究人类行为,而大型语言模型(LLM)代理现在使得新的社会和经济模拟形式成为可能。尽管先前的研究发现LLM代理在金融交易和拍卖市场中存在战略欺骗,但电子商务仍然未得到充分探索,尽管其具有独特的信息不对称:卖家私下观察产品质量,而买家依赖于广告声明和声誉信号。我们引入了TruthMarketTwin,这是一个用于研究LLM代理在电子商务市场中行为的受控模拟框架。该框架是首批在不对称信息共享下建模双边交易的研究之一,其中代理做出战略性列出、购买、评分和追索相关决策,以优化卖家的利润和买家的效用。我们发现,释放到传统市场中的LLM代理会自主利用基于声誉的治理中的弱点,而担保执行则减少了欺骗并重塑了战略推理。我们的结果将LLM代理模拟定位为研究制度治理的自主市场的工具。
cs.AI / 155 / 2605.10064
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
MAGE:基于共演知识图谱的多智能体自我进化
Abstract
Self-evolving language-model agents must decide what to learn next and how to preserve what they have learned across iterations. Existing systems typically carry this cross-iteration knowledge as natural-language feedback, flat episodic memory, or implicit reinforcement signals, none of which cleanly supports a frozen weak backbone at inference time. This paper introduces MAGE (Multi-Agent Graph-guided Evolution), a framework that externalizes self-knowledge into a four-subgraph co-evolutionary knowledge graph. Its experience subgraph stores both teacher-written failure corrections and the learner's own past correct reasoning traces, which are retrieved as task-conditioned guidance for a frozen execution model. During evolution, the graph, a task-level search bandit, and a skill-level routing bandit are updated from the same reward stream, while the learner's backbone remains unchanged. We further provide structural analysis showing how append-only memory growth, bounded curriculum coverage, and task-filtered retrieval together support stable improvement of the retrieval substrate for frozen-learner evolution. Across nine benchmarks spanning mathematical reasoning, multi-hop and open-domain question answering, spatio-temporal analysis, financial numerical reasoning, medical multiple-choice, an open-world survival game, and web navigation, MAGE achieves strong performance against prompt-based frozen-backbone baselines. Ablations show that self-harvested success traces and teacher-written corrections are complementary, with success memories contributing most on reasoning-template-heavy tasks and corrective memories supporting harder composition and interaction settings.
Chinese Translation
自我进化的语言模型智能体必须决定下一步学习的内容以及如何在迭代过程中保留已学知识。现有系统通常将这种跨迭代知识以自然语言反馈、平面情节记忆或隐式强化信号的形式进行存储,但这些方法都无法在推理时有效支持一个冻结的弱骨干网络。本文介绍了MAGE(多智能体图引导进化),一个将自我知识外部化为四个子图的共演知识图谱的框架。其经验子图存储了教师编写的失败修正和学习者自身过去的正确推理轨迹,这些信息被检索作为任务条件指导,以供冻结执行模型使用。在进化过程中,图、任务级搜索强盗和技能级路由强盗都从同一奖励流中更新,而学习者的骨干网络保持不变。我们进一步提供了结构分析,展示了仅附加内存增长、有限课程覆盖和任务过滤检索如何共同支持冻结学习者进化的检索底层的稳定改进。在涵盖数学推理、多跳和开放领域问答、时空分析、金融数值推理、医学多项选择、开放世界生存游戏和网络导航的九个基准测试中,MAGE在基于提示的冻结骨干基准上表现出色。消融实验表明,自我收获的成功轨迹和教师编写的修正是互补的,其中成功记忆在重推理模板任务中贡献最大,而修正记忆则支持更复杂的组合和交互设置。
cs.AI / 156 / 2605.10075
Active Testing of Large Language Models via Approximate Neyman Allocation
通过近似奈曼分配对大型语言模型进行主动测试
Abstract
Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28\% MSE reduction over Uniform Sampling and an average of 22.9\% budget savings.
Chinese Translation
大型语言模型(LLMs)需要在预训练到测试时间扩展的过程中进行可靠评估,使得评估成为一种重复性而非一次性的成本。随着模型规模的增长和目标任务对专家标注者的需求日益增加,每次评估所需的计算和标注成本迅速上升。主动测试旨在通过从评估池中选取一个小但信息丰富的子集来缓解这一瓶颈,从而近似评估结果。然而,现有方法主要针对分类任务,在生成任务上表现不佳。我们提出了一种针对生成任务的新型主动测试算法。我们的方法利用代理模型的语义熵对评估池进行分层,然后基于从这些代理中提取的信号进行近似奈曼分配。在多个语言和多模态基准测试以及一系列代理-目标模型对上,我们的方法显著优于基线,并与Oracle-Neyman紧密跟踪,实现了相较于均匀采样高达28%的均方误差(MSE)降低和平均22.9%的预算节省。
cs.AI / 157 / 2605.10107
Arcane: An Assertion Reduction Framework through Semantic Clustering and MCTS-Guided Rule Exploring
Arcane:通过语义聚类和MCTS引导的规则探索进行断言减少的框架
Abstract
Assertion-based Verification (ABV) is essential for ensuring that hardware designs conform to their intended specifications. However, existing automated assertion-generation approaches, such as LLM-based frameworks, often generate large numbers of redundant assertions, which significantly degrade simulation efficiency. To mitigate the simulation overhead caused by redundant assertions, this paper proposes Arcane, an efficient assertion reduction framework. It integrates a two-tier assertion clustering approach for accurate semantic classification of large assertion sets, and employs Monte Carlo Tree Search (MCTS) to explore optimal rule-application sequences for efficient assertion reduction. The experimental results on Assertionbench [20] show that Arcane achieves a reduction of up to 76.2% in the assertion count while fully preserving formal coverage and mutation-detection ability. Further simulation studies demonstrate a speedup of 2.6x to 6.1x speedup in simulation time. The proposed framework is released at https://anonymous.4open.science/r/Arcane1-0A6F/.
Chinese Translation
基于断言的验证(ABV)对于确保硬件设计符合其预期规范至关重要。然而,现有的自动化断言生成方法,如基于大语言模型(LLM)的框架,往往会生成大量冗余的断言,从而显著降低仿真效率。为减轻冗余断言带来的仿真开销,本文提出了Arcane,一个高效的断言减少框架。该框架集成了两级断言聚类方法,以实现对大规模断言集的准确语义分类,并采用蒙特卡洛树搜索(MCTS)来探索高效的规则应用序列以实现断言减少。在Assertionbench [20]上的实验结果表明,Arcane在保持形式覆盖和变异检测能力的同时,断言数量减少了最多76.2%。进一步的仿真研究表明,仿真时间加速比为2.6倍至6.1倍。该框架已发布于https://anonymous.4open.science/r/Arcane1-0A6F/。
cs.AI / 158 / 2605.10122
Rethinking Constraint Awareness for Efficient State Embedding of Neural Routing Solver
重新思考约束意识以实现神经路由求解器的高效状态嵌入
Abstract
Heavy-Encoder-Light-Decoder (HELD) neural routing solvers have emerged as a promising paradigm due to their broad applicability across multiple vehicle routing problems (VRPs). However, they typically struggle with VRP variants with complex constraints. To address this limitation, this paper systematically revisits existing neural solvers from the perspective of the generation mechanism for state embeddings (i.e., query vector prior to compatibility calculation) during decoding. We identify that current mechanisms restrict the observation space during attention computation, introducing a key bottleneck to achieving high-quality solutions. Through detailed empirical analysis, we demonstrate the necessity of preserving a global observation space. To overcome the constraint-agnostic drawback inherent to global observation spaces, we propose a simple yet powerful Constraint-Aware Residual Modulation (CARM) module. By adaptively modulating the context embedding with constraint-relevant variables, CARM effectively enhances constraint awareness, enabling the neural solver to fully leverage the global observation space and generate an efficient state embedding. Extensive experimental results across two single-task and five multi-task neural routing solvers confirm that the CARM module consistently boosts baseline performance. Notably, solvers equipped with our CARM achieve substantial improvements in scaling to large-scale instances and in generalizing to unseen VRP variants. These findings provide valuable insights for the architectural design of neural routing solvers.
Chinese Translation
重编码器-轻解码器(Heavy-Encoder-Light-Decoder,HELD)神经路由求解器因其在多种车辆路由问题(Vehicle Routing Problems,VRPs)中的广泛适用性而成为一种有前景的范式。然而,它们通常在处理具有复杂约束的VRP变体时面临困难。为了解决这一局限性,本文从解码过程中状态嵌入的生成机制(即,在兼容性计算之前的查询向量)出发,系统性地重新审视现有的神经求解器。我们发现当前机制在注意力计算过程中限制了观察空间,成为实现高质量解决方案的关键瓶颈。通过详细的实证分析,我们证明了保持全球观察空间的必要性。为了克服全球观察空间固有的无约束缺陷,我们提出了一种简单而强大的约束意识残差调制(Constraint-Aware Residual Modulation,CARM)模块。通过自适应地用与约束相关的变量调节上下文嵌入,CARM有效增强了约束意识,使神经求解器能够充分利用全球观察空间并生成高效的状态嵌入。在两个单任务和五个多任务神经路由求解器上的大量实验结果确认,CARM模块始终提升基线性能。值得注意的是,配备CARM的求解器在扩展到大规模实例和对未见VRP变体的泛化方面取得了显著改善。这些发现为神经路由求解器的架构设计提供了宝贵的见解。
cs.AI / 159 / 2605.10125
Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research
探索有用,精确风险:评估学术研究中的人工智能工具
Abstract
Artificial intelligence (AI) tools are being incorporated into scientific research workflows with the potential to enhance efficiency in tasks such as document analysis, question answering (Q and A), and literature search. However, system outputs are often difficult to verify, lack transparency in their generation and remain prone to errors. Suitable benchmarks are needed to document and evaluate arising issues. Nevertheless, existing benchmarking approaches are not adequately capturing human-centered criteria such as usability, interpretability, and integration into research workflows. To address this gap, the present work proposes and applies a benchmarking framework combining human-centered and computer-centered metrics to evaluate AI-based Q&A and literature review tools for research use. The findings suggest that Q and A tools can offer valuable overviews and generally accurate summaries; however, they are not always reliable for precise information extraction. Explainable AI (xAI) accuracy was particularly low, meaning highlighted source passages frequently failed to correspond to generated answers. This shifted the burden of validation back onto the researcher. Literature review tools supported exploratory searches but showed low reproducibility, limited transparency regarding chosen sources and databases, and inconsistent source quality, making them unsuitable for systematic reviews. A comparison of these tool groups reveals a similar pattern: while AI tools can enhance efficiency in the early stages of the research workflow and shallow tasks, their outputs still require human verification. The findings underscore the importance of explainability features to enhance transparency, verification efficiency and careful integration of AI tools into researchers' workflows. Further, human-centered evaluation remains an important concern to ensure practical applicability.
Chinese Translation
人工智能(AI)工具正被纳入科学研究工作流程中,具有提升文档分析、问答(Q and A)和文献检索等任务效率的潜力。然而,系统输出往往难以验证,生成过程缺乏透明度,并且容易出现错误。需要合适的基准来记录和评估出现的问题。然而,现有的基准评估方法并未充分捕捉以人为中心的标准,如可用性、可解释性和与研究工作流程的整合。为了解决这一差距,本研究提出并应用了一种结合以人为中心和以计算机为中心的指标的基准评估框架,以评估基于AI的问答和文献综述工具在研究中的应用。研究结果表明,问答工具可以提供有价值的概述和通常准确的摘要;然而,它们并不总是可靠地提取精确信息。可解释人工智能(xAI)的准确性特别低,这意味着突出显示的源段落往往未能与生成的答案相对应。这将验证的负担重新转移到了研究者身上。文献综述工具支持探索性检索,但显示出低重复性、对所选来源和数据库的透明度有限,以及源质量不一致,使其不适合系统综述。这些工具组的比较揭示了类似的模式:虽然AI工具可以在研究工作流程的早期阶段和浅层任务中提高效率,但其输出仍需人类验证。研究结果强调了可解释性特征的重要性,以增强透明度、验证效率,并谨慎地将AI工具整合到研究者的工作流程中。此外,以人为中心的评估仍然是确保实际适用性的重要关注点。
cs.AI / 160 / 2605.10141
FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models
FormalRewardBench:形式化定理证明奖励模型的基准测试
Abstract
Recent neural theorem provers use reinforcement learning with verifiable rewards (RLVR), where proof assistants provide binary correctness signals. While verifiable rewards are cheap and scalable without reward hacking issues, they suffer from sparse credit assignment: models receive no learning signal from difficult problems where partial progress goes unrewarded. This motivates learned reward models that can evaluate proof quality beyond binary verification. However, comparing reward models is challenging since it typically requires expensive RL training ablations. To address this, we introduce \textbf{FormalRewardBench}, the first benchmark for evaluating reward models in formal theorem proving with Lean 4. Our benchmark consists of 250 preference pairs where correct proofs are paired with incorrect variants generated through five expert curated error injection strategies: forced mistakes, minimal single-point variations, verbose incorrect proofs, natural language justification, and Python code injection. We evaluate frontier LLMs (e.g., Claude Opus 4.5), judge LLMs (e.g., CompassJudger-1-14B), general-purpose LLMs (e.g., Qwen2.5-72B-Instruct), and specialized theorem proving models (e.g., DeepSeek-Prover-V2-7B). Our results reveal that frontier LLMs achieve the highest performance (59.8\%) while specialized theorem provers perform the worst (24.4\%), suggesting that theorem proving ability does not transfer to proof evaluation. We provide further insights on various error injection mechanisms, highlighting the challenging nature of most injection mechanisms. We release \textbf{FormalRewardBench} publicly to encourage more research on developing reward models in formal mathematics.
Chinese Translation
最近的神经定理证明器使用带可验证奖励的强化学习(RLVR),其中证明助手提供二元正确性信号。虽然可验证奖励廉价且可扩展,且没有奖励操控问题,但它们在稀疏信用分配方面存在缺陷:模型在困难问题上无法获得学习信号,而这些问题的部分进展未能获得奖励。这促使了学习奖励模型的出现,这些模型能够评估超越二元验证的证明质量。然而,比较奖励模型具有挑战性,因为这通常需要昂贵的强化学习训练消融实验。为了解决这个问题,我们引入了 extbf{FormalRewardBench},这是第一个用于评估形式化定理证明中奖励模型的基准测试,基于Lean 4。我们的基准测试由250对偏好组成,其中正确的证明与通过五种专家策划的错误注入策略生成的错误变体配对:强制错误、最小单点变化、冗长的错误证明、自然语言论证和Python代码注入。我们评估了前沿大型语言模型(如Claude Opus 4.5)、判断大型语言模型(如CompassJudger-1-14B)、通用大型语言模型(如Qwen2.5-72B-Instruct)和专门的定理证明模型(如DeepSeek-Prover-V2-7B)。我们的结果显示,前沿大型语言模型的表现最佳(59.8%),而专门的定理证明器表现最差(24.4%),这表明定理证明能力并未转移到证明评估上。我们对各种错误注入机制提供了进一步的见解,突显了大多数注入机制的挑战性。我们公开发布 extbf{FormalRewardBench},以鼓励更多关于形式化数学中奖励模型开发的研究。
cs.AI / 161 / 2605.10146
Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing
恶意知识编辑下知识密集型推理的安全风险基准测试
Abstract
Large language models (LLMs) increasingly rely on knowledge editing to support knowledge-intensive reasoning, but this flexibility also introduces critical safety risks: adversaries can inject malicious or misleading knowledge that corrupts downstream reasoning and leads to harmful outcomes. Existing knowledge editing benchmarks primarily focus on editing efficacy and lack a unified framework for systematically evaluating the safety implications of edited knowledge on reasoning behavior. To address this gap, we present EditRisk-Bench, a benchmark for systematically evaluating safety risks of knowledge-intensive reasoning under malicious knowledge editing. Unlike prior benchmarks that mainly emphasize edit success, generalization, and locality, EditRisk-Bench focuses on how injected knowledge affects downstream reasoning behavior and reliability. It integrates diverse malicious scenarios, including misinformation, bias, and safety violations, together with multi-level knowledge-intensive reasoning tasks and representative editing strategies within a unified evaluation framework measuring attack effectiveness, reasoning correctness, and side effects. Extensive experiments on both open-source and closed-source LLMs show that malicious knowledge editing can reliably induce incorrect or unsafe reasoning while largely preserving general capabilities, making such risks difficult to detect. We further identify several key factors influencing these risks, including edit scale, knowledge characteristics, and reasoning complexity. EditRisk-Bench provides an extensible testbed for understanding and mitigating safety risks in knowledge editing for LLMs.
Chinese Translation
大型语言模型(LLMs)越来越依赖知识编辑来支持知识密集型推理,但这种灵活性也引入了关键的安全风险:对手可以注入恶意或误导性的知识,从而破坏下游推理并导致有害后果。现有的知识编辑基准主要关注编辑效果,缺乏一个统一的框架来系统地评估编辑知识对推理行为的安全影响。为了解决这一问题,我们提出了EditRisk-Bench,这是一个用于系统评估恶意知识编辑下知识密集型推理安全风险的基准。与以往主要强调编辑成功、泛化和局部性的基准不同,EditRisk-Bench关注注入知识如何影响下游推理行为和可靠性。它整合了多种恶意场景,包括错误信息、偏见和安全违规,以及多层次的知识密集型推理任务和具有代表性的编辑策略,在一个统一的评估框架内测量攻击有效性、推理正确性和副作用。对开源和闭源LLMs的广泛实验表明,恶意知识编辑可以可靠地引发不正确或不安全的推理,同时在很大程度上保持一般能力,使得这些风险难以检测。我们进一步识别了几个影响这些风险的关键因素,包括编辑规模、知识特征和推理复杂性。EditRisk-Bench提供了一个可扩展的测试平台,用于理解和减轻LLMs知识编辑中的安全风险。
cs.AI / 162 / 2605.10169
Automated Approach for Solving Infinite-state Polynomial Reachability Games
解决无限状态多项式可达性博弈的自动化方法
Abstract
Reachability games are two-player games played on a graph, where the objective of $\texttt{REACH}$ player is to reach the target set whereas the objective of $\texttt{SAFE}$ player is to stay away from the target set. Reachability games have important applications in artificial intelligence and reactive synthesis, and many of these applications give rise to infinite-state reachability games. In this paper, we study turn-based reachability games on infinite-state graphs defined over valuations of a finite set of real variables. We consider the problem of determining the existence of and computing a winning strategy for $\texttt{REACH}$ player. Our contributions are twofold. First, we propose ranking certificates for reachability games, a sound and complete proof rule for proving that $\texttt{REACH}$ player has a winning strategy from the specified initial state. Second, we consider polynomial reachability games, where transitions and objectives are described by polynomial constraints over real variables, and propose a fully automated algorithm for computing a winning strategy for $\texttt{REACH}$ player together with a formal correctness witness in the form of a ranking certificate. The algorithm is sound, semi-complete, and runs in sub-exponential time. Our experiments demonstrate the ability of our method to solve challenging examples from the literature that were out of the reach of existing methods. Specifically, for the classical Cinderella-Stepmother game, we are able to compute an optimal winning strategy for an arbitrary precision parameter for the first time.
Chinese Translation
可达性博弈是在图上进行的双人博弈,其中$ exttt{REACH}$玩家的目标是到达目标集合,而$ exttt{SAFE}$玩家的目标是远离目标集合。可达性博弈在人工智能和反应合成中具有重要应用,其中许多应用导致了无限状态可达性博弈的产生。本文研究了基于回合的无限状态图上的可达性博弈,这些图是基于有限实变量集合的赋值定义的。我们考虑确定$ exttt{REACH}$玩家的获胜策略的存在性及其计算的问题。我们的贡献有两个方面。首先,我们提出了可达性博弈的排名证书,这是一种健全且完整的证明规则,用于证明$ exttt{REACH}$玩家从指定初始状态具有获胜策略。其次,我们考虑多项式可达性博弈,其中转移和目标由实变量上的多项式约束描述,并提出了一种完全自动化的算法,用于计算$ exttt{REACH}$玩家的获胜策略,并提供形式化的正确性证明,形式为排名证书。该算法是健全的、半完整的,并且在亚指数时间内运行。我们的实验展示了我们的方法解决文献中一些具有挑战性的例子的能力,这些例子超出了现有方法的能力范围。具体而言,对于经典的灰姑娘-继母博弈,我们首次能够为任意精度参数计算出最佳获胜策略。
cs.AI / 163 / 2605.10194
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
TRACE:通过令牌路由自我策略对齐提炼重要信息
Abstract
On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on mostly redundant positions and amplifies privileged-information leakage, causing entropy rise, shortened reasoning, and out-of-distribution degradation in long-horizon math training. We propose Token-Routed Alignment for Critical rEasoning (TRACE), which distills only on annotator-marked critical spans: forward KL on key spans of correct rollouts, optional reverse KL on localized error spans, and GRPO on all remaining tokens, with the KL channel annealed away after a short warm-up. Our analysis explains TRACE through two effects: forward KL provides non-vanishing lift to teacher-supported tokens that the student under-allocates, while span masking and decay keep cumulative privileged-gradient exposure finite. On four held-out math benchmarks plus GPQA-Diamond, TRACE improves over GRPO by 2.76 percentage points on average and preserves the Qwen3-8B base OOD score on GPQA-Diamond, where GRPO and all-token self-OPD baselines degrade. Gains persist under online self-annotation (+1.90 percentage points, about 69% of the strong-API gain), reducing the concern that TRACE merely imports external annotator capability. Across scales, the best routed action is base-dependent: on Qwen3-8B it is forward KL on key spans, while on Qwen3-1.7B it shifts to reverse KL on error spans.
Chinese Translation
基于策略的自我蒸馏(self-OPD)通过让策略在特权上下文中自我教学,增强了具有可验证奖励的强化学习(RLVR)。我们发现,当这种指导覆盖整个响应时,所有令牌的KL散度将梯度花费在大多数冗余的位置上,并放大了特权信息泄露,导致熵增加、推理时间缩短以及在长时间跨度的数学训练中出现分布外退化。我们提出了关键推理的令牌路由对齐(TRACE),该方法仅在标注者标记的关键区间上进行蒸馏:对正确回滚的关键区间进行正向KL,对局部错误区间进行可选的反向KL,以及对所有剩余令牌进行GRPO,在短暂的预热后逐渐消除KL通道。我们的分析通过两个效应解释了TRACE:正向KL为学生分配不足的教师支持令牌提供了非消失的提升,而区间屏蔽和衰减保持了累积特权梯度暴露的有限性。在四个保留的数学基准测试和GPQA-Diamond上,TRACE平均比GRPO提高了2.76个百分点,并在GPQA-Diamond上保持了Qwen3-8B基础模型的分布外得分,而GRPO和所有令牌的自我OPD基线则出现了退化。在在线自我标注下,增益仍然存在(+1.90个百分点,约占强API增益的69%),减少了TRACE仅仅引入外部标注者能力的担忧。在不同规模中,最佳路由动作依赖于基础模型:在Qwen3-8B上,它是关键区间的正向KL,而在Qwen3-1.7B上则转变为错误区间的反向KL。
cs.AI / 164 / 2605.10223
Beyond Autonomy: A Dynamic Tiered AgentRunner Framework for Governable and Resilient Enterprise AI Execution
超越自主性:一个动态分层的AgentRunner框架用于可治理和弹性的企业人工智能执行
Abstract
Current large language model agent frameworks prioritize autonomy but lack the governability mechanisms required for enterprise deployment. High-risk write operations proceed without independent review, complex tasks lack acceptance verification, and computational resources are allocated uniformly regardless of risk level. We propose the Dynamic Tiered AgentRunner, a controlled execution protocol distilled from a production-grade multi-tenant SaaS platform. The framework introduces three core mechanisms: (1) Risk-Adaptive Tiering that dynamically allocates computational resources and review intensity based on task risk profiles, achieving Pareto-optimal trade-offs between safety and efficiency; (2) Separation of Powers architecture where proposal, review, execution, and verification are performed by independent agents with physically isolated boundaries; and (3) Resilience-by-Design through a Verifier-Recovery closed loop that treats failure as a first-class system state. We formalize the tier selectio
Chinese Translation
当前的大型语言模型代理框架优先考虑自主性,但缺乏企业部署所需的治理机制。高风险的写操作在没有独立审查的情况下进行,复杂任务缺乏接受验证,计算资源的分配不考虑风险水平而均匀分配。我们提出了动态分层AgentRunner,这是一种从生产级多租户SaaS平台提炼出的受控执行协议。该框架引入了三个核心机制:(1)风险适应性分层,根据任务风险特征动态分配计算资源和审查强度,实现安全性与效率之间的帕累托最优权衡;(2)权力分离架构,其中提案、审查、执行和验证由具有物理隔离边界的独立代理执行;(3)通过验证者-恢复闭环实现设计弹性,将故障视为一种一流的系统状态。我们对分层选择进行了形式化。
cs.AI / 165 / 2605.10224
Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery
基于假设的深度研究与大型语言模型:自动知识发现的结构化方法论
Abstract
Current AI-powered research systems adopt a direct search-then-summarize paradigm that treats hypotheses as end products of scientific discovery. We argue this leaves a critical gap: hypotheses can serve a far more powerful role as organizational instruments that structure the research process itself. We propose the Hypothesis-Driven Deep Research (HDRI) methodology - the first framework using hypotheses to organize general-purpose deep research across arbitrary domains, rather than merely validating claims within specific domains. This transforms research from reactive information retrieval into proactive, verifiable, and iterative knowledge discovery. HDRI is formalized with six core principles and an eight-stage pipeline. A central innovation is the gap-driven iterative research mechanism - a closed-loop quality assurance system that automatically identifies informational and logical gaps, triggering targeted supplementary investigation. We further introduce a fact reasoning framework with traceable reasoning chains and quantified confidence propagation, a subject locking mechanism to prevent entity confusion, and a multi-dimensional quality assessment scheme. The methodology is realized in the INFOMINER system. Experiments demonstrate improvements of 22.4% in fact density, 90% subject matching accuracy, 0.92 multi-source verification confidence, and 14% completeness gain from gap-driven supplementation. Five case studies validate its practical applicability, achieving an average quality rating of 4.46/5.0.
Chinese Translation
当前的人工智能驱动研究系统采用直接搜索后总结的范式,将假设视为科学发现的最终产品。我们认为这留下了一个关键空白:假设可以作为组织工具,在研究过程中发挥更强大的作用。我们提出了基于假设的深度研究(Hypothesis-Driven Deep Research, HDRI)方法论——这是第一个利用假设来组织跨任意领域的通用深度研究的框架,而不仅仅是在特定领域内验证主张。这一方法将研究从被动的信息检索转变为主动、可验证和迭代的知识发现。HDRI由六个核心原则和一个八阶段流程构成。一个核心创新是基于差距的迭代研究机制——一个闭环质量保证系统,能够自动识别信息和逻辑上的差距,触发针对性的补充调查。我们进一步引入了一个事实推理框架,具有可追溯的推理链和量化的置信传播,防止实体混淆的主题锁定机制,以及多维度的质量评估方案。该方法论在INFOMINER系统中得以实现。实验表明,事实密度提高了22.4%,主题匹配准确率达到90%,多源验证置信度为0.92,基于差距的补充带来了14%的完整性提升。五个案例研究验证了其实际适用性,平均质量评分为4.46/5.0。
cs.AI / 166 / 2605.10246
SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems
SciIntegrity-Bench:评估人工智能科学家系统学术诚信的基准测试
Abstract
AI scientist systems are increasingly deployed for autonomous research, yet their academic integrity has never been systematically evaluated. We introduce SCIINTEGRITY-BENCH, the first benchmark designed around a dilemmatic evaluation paradigm: each of its 33 scenarios across 11 trap categories is constructed so that honest acknowledgment of failure is the only correct response, while task completion requires misconduct. Across 231 evaluation runs spanning 7 state-of-the-art LLMs, the overall integrity problem rate reaches 34.2%, and no model achieves zero failures. Most strikingly, across missing-data scenarios, all seven models generate synthetic data rather than acknowledging infeasibility, differing only in whether they disclose the substitution. A further prompt ablation study separates two drivers: removing explicit completion pressure sharply reduces undisclosed fabrication from 20.6% to 3.2%, while the underlying synthesis rate remains unchanged, revealing an intrinsic completion bias that persists independent of prompt-level instructions. These findings point to the absence of honest refusal as a trained disposition as the primary driver of observed failures. We release SCIINTEGRITY-BENCH at https://github.com/liuxingtong/Sci-Integrity-Bench.
Chinese Translation
人工智能科学家系统越来越多地被用于自主研究,但其学术诚信从未得到系统评估。我们介绍了SCIINTEGRITY-BENCH,这是第一个围绕困境评估范式设计的基准:其33个场景跨越11个陷阱类别,每个场景的构建使得诚实承认失败是唯一正确的回应,而任务完成则需要不当行为。在涵盖7种最先进的大型语言模型(LLM)的231次评估运行中,整体诚信问题率达到34.2%,且没有任何模型实现零失败。最引人注目的是,在缺失数据的场景中,所有七个模型生成合成数据,而不是承认不可行,仅在是否披露替代品上有所不同。进一步的提示消融研究分离了两个驱动因素:去除明确的完成压力使得未披露的伪造率从20.6%锐减至3.2%,而基础合成率保持不变,揭示了一种独立于提示级指令的内在完成偏见。这些发现表明,缺乏诚实拒绝作为一种训练倾向是观察到的失败的主要驱动因素。我们在 https://github.com/liuxingtong/Sci-Integrity-Bench 发布了SCIINTEGRITY-BENCH。
cs.AI / 167 / 2605.10257
Towards Autonomous Railway Operations: A Semi-Hierarchical Deep Reinforcement Learning Approach to the Vehicle Rescheduling Problem
迈向自主铁路运营:一种半层次深度强化学习方法解决车辆重新调度问题
Abstract
Managing disruptions in railway traffic management is a major challenge. Rising traffic density and infrastructure limits increase complexity, making the Vehicle Routing and Scheduling Problem (VRSP) difficult to solve reliably and in real time. While Operational Research (OR) methods are widely used, most dispatching still relies on human expertise due to the problem's exponential combinatorial complexity. Reinforcement Learning (RL) has gained attention for its potential in multi-agent coordination, but existing RL approaches often underperform OR methods and struggle to scale in dense rail networks. This paper addresses this gap from a machine learning perspective by introducing a semi-hierarchical RL formulation tailored to operational railway constraints. The method separates dispatching from routing through dedicated action and observation spaces, enabling policies to specialise in distinct decision scopes and addressing the imbalance between rare dispatch decisions and frequent routing updates. The approach is evaluated on the Flatland-RL simulator across five difficulty levels and 50 random seeds, with 7 to 80 trains. Results show substantially improved coordination, resource utilisation, and robustness compared with heuristic baselines and monolithic RL, nearly doubling the number of trains reaching their destinations, while keeping deadlock rates below 5% and adaptively sequencing, delaying, or cancelling trains under heavy congestion.
Chinese Translation
管理铁路交通管理中的干扰是一个重大挑战。日益增加的交通密度和基础设施限制增加了复杂性,使得车辆路线与调度问题(Vehicle Routing and Scheduling Problem, VRSP)难以可靠且实时地解决。尽管运筹学(Operational Research, OR)方法被广泛使用,但由于问题的指数组合复杂性,大多数调度仍依赖于人类专业知识。强化学习(Reinforcement Learning, RL)因其在多智能体协调中的潜力而受到关注,但现有的RL方法往往表现不如OR方法,并且在密集铁路网络中难以扩展。本文从机器学习的角度解决了这一空白,提出了一种针对运营铁路约束的半层次RL模型。该方法通过专门的动作和观察空间将调度与路线分开,使得策略能够专注于不同的决策范围,并解决稀有调度决策与频繁路线更新之间的不平衡。该方法在Flatland-RL模拟器上进行了评估,涵盖五个难度级别和50个随机种子,涉及7到80列火车。结果显示,与启发式基线和单一RL相比,协调性、资源利用率和鲁棒性显著提高,几乎使到达目的地的列车数量翻倍,同时将死锁率保持在5%以下,并在重度拥堵情况下自适应地排序、延迟或取消列车。
cs.AI / 168 / 2605.10261
E-TCAV: Formalizing Penultimate Proxies for Efficient Concept Based Interpretability
E-TCAV:形式化倒数代理以实现高效的基于概念的可解释性
Abstract
TCAV (Testing with Concept Activation Vectors) is an interpretability method that assesses the alignment between the internal representations of a trained neural network and human-understandable, high-level concepts. Though effective, TCAV suffers from significant computational overhead, inter-layer disagreement of TCAV scores, and statistical instability. This work takes a step toward addressing these challenges by introducing E-TCAV, a framework for efficient approximation of TCAV scores, which is based on extensive investigation into three key aspects of the TCAV methodology: 1) the effect of latent classifiers on the stability of TCAV scores, 2) the inter-layer agreement of TCAV scores, and 3) the use of the penultimate layer as a fast proxy for earlier layers for TCAV computation. To ensure a solid foundation for E-TCAV, we conduct extensive evaluations across four different architectures and five datasets, encompassing problems from both computer vision and natural language domains. Our results show that the layers in the final block of the neural network strongly agree with the penultimate layer in terms of the TCAV scores, and the commonly observed variance of the TCAV scores can be attributed to the choice of the latent classifier. Leveraging this inter-layer agreement and the degeneracy of directional sensitivities at the penultimate layer, E-TCAV guarantees linearly scaling speed-ups with respect to the network's size and the number of evaluation samples, marking a step towards efficient model debugging and real-time concept-guided training.
Chinese Translation
TCAV(基于概念激活向量的测试)是一种可解释性方法,用于评估训练神经网络的内部表示与人类可理解的高层次概念之间的对齐程度。尽管TCAV有效,但其面临显著的计算开销、TCAV分数的层间不一致性以及统计不稳定性等问题。本研究通过引入E-TCAV,朝着解决这些挑战迈出了一步,E-TCAV是一个高效近似TCAV分数的框架,基于对TCAV方法论三个关键方面的广泛研究:1)潜在分类器对TCAV分数稳定性的影响,2)TCAV分数的层间一致性,以及3)将倒数层作为TCAV计算中早期层的快速代理。为了确保E-TCAV的坚实基础,我们在四种不同架构和五个数据集上进行了广泛评估,涵盖了计算机视觉和自然语言领域的问题。我们的结果表明,神经网络最后一个模块中的层在TCAV分数方面与倒数层高度一致,而TCAV分数的常见方差可以归因于潜在分类器的选择。利用这种层间一致性以及倒数层方向敏感性的退化,E-TCAV保证了与网络规模和评估样本数量成线性比例的加速,为高效的模型调试和实时概念引导训练迈出了重要一步。
cs.AI / 169 / 2605.10267
IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs
IndustryBench:探讨大型语言模型的工业知识边界
Abstract
In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at $\kappa_w = 0.798$ against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.
Chinese Translation
在工业采购中,只有在通过标准检查后,LLM(大型语言模型)的回答才有用:推荐的材料必须符合操作条件,每个参数必须遵循规定的阈值,且任何程序不得违反安全条款。部分正确性可能掩盖安全关键的矛盾,而这些矛盾在LLM基准测试中很少被捕捉到。我们引入了IndustryBench,这是一个包含2,049个项目的工业采购问答基准,基于中国国家标准(GB/T)和结构化的工业产品记录,按七个能力维度、十个行业类别和专家小组确定的难度等级进行组织,并提供与之对应的英文、俄文和越南文翻译。我们的构建流程在基于搜索的外部验证阶段拒绝了70.3%的LLM生成候选项,校准了在仅使用LLM过滤后工业问答的可靠性。我们的评估将由Qwen3-Max评审员根据领域专家验证的$ ext{kappa}_w = 0.798$评分的原始正确性,与针对源文本的单独安全违规(SV)检查解耦。在17个中文模型和四种语言的8个模型交集上,我们发现:(i)最佳系统在0-3评分标准上仅达到2.083,仍有相当大的提升空间;(ii)标准与术语是最持久的能力弱点,并且在项目对齐翻译中依然存在;(iii)扩展推理降低了13个模型中12个的安全调整分数,主要是通过在较长的最终答案中引入不支持的安全关键细节;(iv)安全违规率重新排列了排行榜——GPT-5.4在SV调整后从第6名跃升至第3名,而Kimi-k2.5-1T-A32B下降了七个位置。因此,工业LLM评估需要基于源文本的、安全意识的诊断,而不是简单的聚合准确性。我们发布了IndustryBench,包含所有提示、评分脚本和数据集文档。
cs.AI / 170 / 2605.10286
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
AgentRx:大语言模型代理在多模态临床预测任务中的基准研究
Abstract
Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)-based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between single agent and multi-agent systems. Our findings highlight that single agent frameworks outperform naive multi-agent systems, are better at handling multimodal data, and are better calibrated. This underscores a critical need for improving multi-agent collaboration to better handle heterogeneous inputs. By open-sourcing our code and evaluation framework, this work offers a new benchmark to support future developments relating to agentic systems in healthcare.
Chinese Translation
构建有效的临床决策支持系统需要综合复杂的异构多模态数据。这些模态包括时间序列电子健康记录数据、医学图像、放射学报告和临床笔记。基于大语言模型(LLM)的代理在各种医疗任务中表现出色,尤其是在涉及文本模态的任务中。考虑到医院系统中医疗数据的碎片化,协作代理框架为缓解数据共享挑战提供了一个有前景的方向。然而,LLM代理在多模态临床风险预测中的有效性仍然未得到充分研究。在本研究中,我们对基于LLM的代理在临床预测任务中的表现进行了系统评估,使用大规模的真实世界数据。我们评估了单模态和多模态环境下的性能,并量化了单一代理系统与多代理系统之间的性能差距。我们的研究结果强调,单一代理框架优于简单的多代理系统,更擅长处理多模态数据,并且更具校准性。这凸显了改善多代理协作以更好地处理异构输入的迫切需求。通过开源我们的代码和评估框架,本研究提供了一个新的基准,以支持未来与医疗保健中代理系统相关的发展。
cs.AI / 171 / 2605.10310
Positive Alignment: Artificial Intelligence for Human Flourishing
积极对齐:促进人类繁荣的人工智能
Abstract
Existing alignment research is dominated by concerns about safety and preventing harm: safeguards, controllability, and compliance. This paradigm of alignment parallels early psychology's focus on mental illness: necessary but incomplete. What we call Positive Alignment is the development of AI systems that (i) actively support human and ecological flourishing in a pluralistic, polycentric, context-sensitive, and user-authored way while (ii) remaining safe and cooperative. It is a distinct and necessary agenda within AI alignment research. We argue that several existing failures of alignment (e.g., engagement hacking, loss of human autonomy, failures in truth-seeking, low epistemic humility, error correction, lack of diverse viewpoints, and being primarily reactive rather than proactive) may be better addressed through positive alignment, including cultivating virtues and maximizing human flourishing. We highlight a range of challenges, open questions, and technical directions (e.g., data filtering and upsampling, pre- and post-training, evaluations, collaborative value collection) for different phases of the LLM and agents lifecycle. We end with design principles for promoting disagreement and decentralization through contextual grounding, community customization, continual adaptation, and polycentric governance; that is, many legitimate centers of oversight rather than one institutional or moral chokepoint.
Chinese Translation
现有的对齐研究主要关注安全性和防止伤害的问题:保障措施、可控性和合规性。这种对齐范式与早期心理学对心理疾病的关注相似:虽然必要,但并不完整。我们所称的积极对齐是指开发人工智能系统,这些系统 (i) 在多元化、多中心、情境敏感和用户主导的方式下,积极支持人类和生态的繁荣,同时 (ii) 保持安全和合作。这是人工智能对齐研究中一个独特且必要的议程。我们认为,现有的几种对齐失败(例如,参与黑客、失去人类自主权、真理追寻的失败、低认知谦逊、错误修正、缺乏多样化观点以及主要反应而非主动)可以通过积极对齐得到更好的解决,包括培养美德和最大化人类繁荣。我们强调了在大型语言模型(LLM)和代理生命周期的不同阶段面临的一系列挑战、未解问题和技术方向(例如,数据过滤和上采样、训练前和训练后评估、协作价值收集)。最后,我们提出了通过情境基础、社区定制、持续适应和多中心治理来促进分歧和去中心化的设计原则;即,许多合法的监督中心,而不是一个制度性或道德性的瓶颈。
cs.AI / 172 / 2605.10325
Verifiable Process Rewards for Agentic Reasoning
可验证过程奖励用于自主推理
Abstract
Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existing approaches rely on sparse outcome-level feedback. This sparsity creates a credit assignment challenge in long-horizon agentic reasoning: a trajectory may fail despite containing many correct intermediate decisions, or succeed despite containing flawed ones. In this work, we study a class of densely-verifiable agentic reasoning problems, where intermediate actions can be objectively checked by symbolic or algorithmic oracles. We propose Verifiable Process Rewards (VPR), a framework that converts such oracles into dense turn-level supervision for reinforcement learning, and instantiate it in three representative settings: search-based verification for dynamic deduction, constraint-based verification for logical reasoning, and posterior-based verification for probabilistic inference. We further provide a theoretical analysis showing that dense verifier-grounded rewards can improve long-horizon credit assignment by providing more localized learning signals, with the benefit depending on the reliability of the verifier. Empirically, VPR outperforms outcome-level reward and rollout-based process reward baselines across controlled environments, and more importantly, transfers to both general and agentic reasoning benchmarks, suggesting that verifiable process supervision can foster general reasoning skills applicable beyond the training environments. Our results indicate that VPR is a promising approach for enhancing LLM agents whenever reliable intermediate verification is available, while also highlighting its dependence on oracle quality and the open challenge of extending VPR to less structured, open-ended environments.
Chinese Translation
基于可验证奖励的强化学习(RLVR)提高了大型语言模型(LLMs)的推理能力,但现有大多数方法依赖于稀疏的结果级反馈。这种稀疏性在长时间跨度的自主推理中造成了信用分配的挑战:尽管轨迹中包含许多正确的中间决策,但仍可能失败;反之,尽管包含缺陷的决策,轨迹也可能成功。在本研究中,我们研究了一类密集可验证的自主推理问题,其中中间动作可以通过符号或算法神谕客观检查。我们提出了可验证过程奖励(VPR)框架,该框架将这些神谕转化为强化学习的密集回合级监督,并在三个代表性设置中进行了实例化:基于搜索的动态推理验证、基于约束的逻辑推理验证,以及基于后验的概率推理验证。我们进一步提供了理论分析,表明密集的验证者基础奖励可以通过提供更局部的学习信号来改善长时间跨度的信用分配,其效果依赖于验证者的可靠性。在实证上,VPR在受控环境中优于结果级奖励和基于回合的过程奖励基线,更重要的是,它能够迁移到一般和自主推理基准,表明可验证的过程监督可以促进适用于训练环境之外的一般推理技能。我们的结果表明,VPR是一种有前景的方法,可以在可靠的中间验证可用时增强LLM代理,同时也强调了其对神谕质量的依赖以及将VPR扩展到结构较少、开放式环境的开放挑战。
cs.AI / 173 / 2605.10332
EmbodiSkill: Skill-Aware Reflection for Self-Evolving Embodied Agents
EmbodiSkill:面向技能的反思以实现自我进化的具身智能体
Abstract
Embodied agents can benefit from skills that guide object search, action execution, and state changes across diverse environments. Since embodied environments vary across layouts, object states, and other execution factors, these skills must self-evolve from trajectories generated during task execution. However, existing skill self-evolution methods are mainly developed in digital environments and often convert trajectories into coarse skill updates. Directly applying this paradigm to embodied settings is problematic, because a failed task execution may reflect not only incorrect skill content, but also an execution lapse in which the agent fails to follow valid guidance. We propose EmbodiSkill, a training-free framework for embodied skill self-evolution through skill-aware reflection and targeted revision. EmbodiSkill interprets each trajectory with respect to the current skill, uses skill-changing evidence to update the skill body, and uses execution-lapse evidence to preserve and emphasize valid guidance. Experiments on ALFWorld and EmbodiedBench show that EmbodiSkill consistently improves embodied task success. On ALFWorld, EmbodiSkill enables a frozen Qwen3.5-27B executor to reach 93.28% task success, outperforming GPT-5.2 used as a direct agent without skills by 31.58%. These results show that skill-aware self-evolution helps embodied agents accumulate reusable procedural knowledge from their own trajectories.
Chinese Translation
具身智能体可以通过技能来指导物体搜索、动作执行和在多样环境中的状态变化。由于具身环境在布局、物体状态和其他执行因素上存在差异,这些技能必须从任务执行过程中生成的轨迹中自我进化。然而,现有的技能自我进化方法主要是在数字环境中开发的,通常将轨迹转换为粗略的技能更新。将这一范式直接应用于具身环境是有问题的,因为任务执行失败可能不仅反映出技能内容不正确,还可能是智能体未能遵循有效指导的执行失误。我们提出了EmbodiSkill,这是一种通过面向技能的反思和有针对性的修订实现具身技能自我进化的无训练框架。EmbodiSkill根据当前技能解释每条轨迹,利用技能变化证据更新技能内容,并利用执行失误证据来保留和强调有效指导。在ALFWorld和EmbodiedBench上的实验表明,EmbodiSkill始终提高了具身任务的成功率。在ALFWorld中,EmbodiSkill使得冻结的Qwen3.5-27B执行器达到了93.28%的任务成功率,比直接作为无技能智能体使用的GPT-5.2高出31.58%。这些结果表明,面向技能的自我进化有助于具身智能体从自身轨迹中积累可重复使用的程序知识。
cs.AI / 174 / 2605.10337
CORTEG: Foundation Models Enable Cross-Modality Representation Transfer from Scalp to Intracranial Brain Recordings
CORTEG:基础模型实现从头皮到颅内脑电记录的跨模态表征转移
Abstract
Intracranial electrocorticography (ECoG) offers high-signal-to-noise access to cortical activity for brain-computer interfaces, yet limited per-patient data has led most prior work to rely on small, subject-specific decoders that neglect information shared across patients. We investigate whether large pretrained scalp-EEG foundation models (EEG FMs) can be adapted to ECoG, enabling cross-patient learning and competitive decoding performance while calibrating to a held-out patient in 10-30 minutes on a single GPU. We introduce CORTEG, a cross-modality transfer framework that combines a pretrained EEG FM backbone, an electrode-aware KNNSoftFourier spatial adapter, a dual-stream tokenizer for low-frequency and high-gamma activity, and a leave-one-subject-out fine-tuning strategy. We evaluate CORTEG on two challenging regression tasks: public finger trajectory regression (n=9) and private audio envelope regression (n=16). CORTEG matches or exceeds the strongest task-specific baselines on both tasks: it reaches the highest mean correlation among compared methods on the public finger benchmark (gain not statistically significant on n=9 subjects), with larger and statistically significant gains on the audio task and in low-data per-patient calibration. Feature analyses align with neurophysiology, and latent manifolds capture low-dimensional finger-movement structure. CORTEG provides systematic evidence that scalp-EEG pretraining can be repurposed for ECoG decoding, enabling data-efficient intracranial BCIs that can adapt to new patients.
Chinese Translation
颅内皮层电图(ECoG)为脑-计算机接口提供了高信噪比的皮层活动访问,但每位患者的数据有限,导致大多数先前的研究依赖于小型、特定于个体的解码器,忽视了跨患者共享的信息。我们探讨了大型预训练的头皮脑电图基础模型(EEG FMs)是否可以适应ECoG,从而实现跨患者学习和具有竞争力的解码性能,同时在单个GPU上在10-30分钟内对一个被排除的患者进行校准。我们提出了CORTEG,一个跨模态转移框架,结合了预训练的EEG FM主干、一个电极感知的KNNSoftFourier空间适配器、一个用于低频和高伽马活动的双流标记器,以及一种留一法微调策略。我们在两个具有挑战性的回归任务上评估CORTEG:公共手指轨迹回归(n=9)和私人音频包络回归(n=16)。CORTEG在这两个任务上与最强的任务特定基线相匹配或超越:在公共手指基准测试中,它在比较方法中达到了最高的平均相关性(在n=9个受试者上增益在统计上不显著),在音频任务和每位患者低数据校准中则获得了更大且统计上显著的增益。特征分析与神经生理学一致,潜在流形捕捉了低维手指运动结构。CORTEG提供了系统证据,表明头皮EEG预训练可以重新用于ECoG解码,从而实现数据高效的颅内脑-计算机接口,能够适应新患者。
cs.AI / 175 / 2605.10341
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
PaperFit:科学文档的视觉反馈排版优化
Abstract
A LaTeX manuscript that compiles without error is not necessarily publication-ready. The resulting PDFs frequently suffer from misplaced floats, overflowing equations, inconsistent table scaling, widow and orphan lines, and poor page balance, forcing authors into repetitive compile-inspect-edit cycles. Rule-based tools are blind to rendered visuals, operating only on source code and log files. Text-only LLMs perform open-loop text editing, unable to predict or verify the two-dimensional layout consequences of their changes. Reliable typesetting optimization therefore requires a visual closed loop with verification after every edit. We formalize this problem as Visual Typesetting Optimization (VTO), the task of transforming a compilable LaTeX paper into a visually polished, page-budget-compliant PDF through iterative visual verification and source-level revision, and introduce a five-category taxonomy of typesetting defects to guide diagnosis. We present PaperFit, a vision-in-the-loop agent that iteratively renders pages, diagnoses defects, and applies constrained repairs. To benchmark VTO, we construct PaperFit-Bench with 200 papers across 10 venue templates and 13 defect types at different difficulty. Extensive experiments show that PaperFit outperforms all baselines by a large margin, establishing that bridging the gap from compilable source to publication-ready PDF requires vision-in-the-loop optimization and that VTO constitutes a critical missing stage in the document automation pipeline.
Chinese Translation
一个能够无误编译的 LaTeX 手稿并不一定准备好出版。生成的 PDF 文件常常存在浮动元素位置错误、公式溢出、表格缩放不一致、孤行和寡行以及页面平衡差等问题,迫使作者进入重复的编译-检查-编辑循环。基于规则的工具对渲染的视觉效果视而不见,仅在源代码和日志文件上操作。仅处理文本的 LLM(大语言模型)进行开放式文本编辑,无法预测或验证其更改对二维布局的影响。因此,可靠的排版优化需要在每次编辑后进行视觉闭环验证。我们将此问题形式化为视觉排版优化(Visual Typesetting Optimization, VTO),即通过迭代的视觉验证和源级修订,将可编译的 LaTeX 论文转化为视觉上精致、符合页面预算的 PDF,并引入了一个五类排版缺陷的分类法以指导诊断。我们提出了 PaperFit,一个视觉反馈循环的智能体,它迭代渲染页面、诊断缺陷并应用约束修复。为了对 VTO 进行基准测试,我们构建了 PaperFit-Bench,包含 200 篇论文,涵盖 10 个会议模板和 13 种不同难度的缺陷类型。广泛的实验表明,PaperFit 的表现大幅优于所有基线,证明了从可编译源代码到出版准备 PDF 的转变需要视觉反馈优化,并且 VTO 是文档自动化流程中一个关键的缺失阶段。
cs.AI / 176 / 2605.10344
TMAS: Scaling Test-Time Compute via Multi-Agent Synergy
TMAS:通过多智能体协同扩展测试时计算
Abstract
Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification-based feedback. However, existing structured test-time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what should be retained and reused, limiting their ability to balance exploration and exploitation. In this work, we propose TMAS, a framework for scaling test-time compute via multi-agent synergy. TMAS organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations. To support effective cross-trajectory collaboration, TMAS introduces hierarchical memories: the experience bank reuses low-level reliable intermediate conclusions and local feedback, while the guideline bank records previously explored high-level strategies to steer subsequent rollouts away from redundant reasoning patterns. Furthermore, we design a hybrid reward reinforcement learning scheme tailored to TMAS, which jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies. Extensive experiments on challenging reasoning benchmarks demonstrate that TMAS achieves stronger iterative scaling than existing test-time scaling baselines, while hybrid reward training further improves scaling effectiveness and stability across iterations. Code and data are available at https://github.com/george-QF/TMAS-code.
Chinese Translation
测试时扩展已成为提高大型语言模型推理能力的有效范式,通过在推理过程中分配额外的计算资源。近期的结构化方法进一步推动了这一范式的发展,通过在多个轨迹、精炼轮次和基于验证的反馈中组织推理。然而,现有的结构化测试时扩展方法要么弱协调并行推理轨迹,要么依赖于嘈杂的历史信息,而没有明确决定应保留和重用的内容,从而限制了其在探索与利用之间的平衡能力。在本研究中,我们提出了TMAS,一个通过多智能体协同扩展测试时计算的框架。TMAS将推理组织为专门化智能体之间的协作过程,使得信息在智能体、轨迹和精炼迭代之间有序流动。为了支持有效的跨轨迹协作,TMAS引入了层次记忆:经验库重用低级可靠的中间结论和局部反馈,而指导库记录先前探索的高级策略,以引导后续的展开避免冗余推理模式。此外,我们设计了一种针对TMAS的混合奖励强化学习方案,旨在共同保持基本推理能力,增强经验利用,并鼓励超越先前尝试的解决策略进行探索。在具有挑战性的推理基准上的广泛实验表明,TMAS实现了比现有测试时扩展基线更强的迭代扩展,同时混合奖励训练进一步提高了跨迭代的扩展有效性和稳定性。代码和数据可在 https://github.com/george-QF/TMAS-code 获取。
cs.AI / 177 / 2605.10347
How Mobile World Model Guides GUI Agents?
移动世界模型如何指导图形用户界面代理?
Abstract
Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.
Chinese Translation
最近在视觉-语言模型方面的进展使得移动图形用户界面代理能够感知视觉界面并执行用户指令,但对行动后果的可靠预测在长时间和高风险的交互中仍然至关重要。现有的移动世界模型提供基于文本或图像的未来状态,但尚不清楚哪种表示更有用,生成的展开是否可以替代真实环境,以及测试时的指导如何帮助不同能力的代理。为了解答上述问题,我们对移动世界模型数据进行了筛选和注释,然后在四种模态下训练世界模型:增量文本(delta text)、完整文本(full text)、基于扩散的图像(diffusion-based images)和可渲染代码(renderable code)。这些模型在 MobileWorldBench 和 Code2WorldBench 上达到了最先进的性能。此外,通过评估它们在 AITZ、AndroidControl 和 AndroidWorld 上的下游效用,我们得出了三个发现。首先,可渲染代码重建在分布内具有高保真度,并为数据构建提供了有效的多模态监督,而基于文本的反馈在在线分布外(OOD)执行中更为稳健。其次,世界模型生成的轨迹可以在训练过程中提供可转移的交互经验,并提高代理的端到端任务性能,尽管这些数据并未保留原始分布。最后,对于具有低行动熵的过于自信的移动代理,后验自我反思提供的增益有限,这表明世界模型作为先验感知或训练监督比作为通用事后验证器更为有效。
cs.AI / 178 / 2605.10365
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench:评估智能体价值的综合基准
Abstract
Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. Existing value benchmarks, however, remain confined to LLMs, leaving agent values largely uncharted. From intuitive, empirical, and theoretical vantage points, we show that an agent's values diverge from those of its underlying LLM, and the agentic modality further introduces dataset-, evaluation-, and system-level challenges absent from text-only protocols. We close this gap with Agent-ValueBench, the first benchmark dedicated to agent values. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that cover 28 value systems and 332 dimensions. Every instance is co-synthesized through our purpose-built end-to-end pipeline and curated per-instance by professional psychologists. Each task ships with two pole-aligned golden trajectories whose checkpoints anchor a trajectory-level rubric-based judge. Benchmarking 14 frontier proprietary and open-weights models across 4 mainstream harnesses, we uncover three concerted findings. Agent values first manifest as a Value Tide of cross-model homogeneity beneath interpretable counter-currents. This tide bends non-additively under harness pull, and yet more decisively under deliberate steering via embedded skills. Together these results signal that the agent-alignment lever is shifting from classical model alignment and prompt steering toward harness alignment and skill steering.
Chinese Translation
自主智能体作为任务执行者迅速成熟,并通过如OpenClaw等平台得到了广泛应用。安全问题无疑引起了越来越多的研究关注,而在这些问题背后,潜在的价值观默默地引导着智能体的行为。然而,现有的价值基准仍然局限于大语言模型(LLMs),使得智能体的价值观在很大程度上未被探索。从直观、经验和理论的角度来看,我们展示了智能体的价值观与其底层大语言模型的价值观存在差异,而智能体的特性进一步引入了数据集、评估和系统层面的挑战,这些挑战在仅基于文本的协议中并不存在。我们通过Agent-ValueBench填补了这一空白,这是第一个专门针对智能体价值的基准。该基准涵盖16个领域的394个可执行环境,提供了4,335个价值冲突任务,涵盖28种价值体系和332个维度。每个实例都是通过我们专门构建的端到端管道共同合成的,并由专业心理学家逐实例进行策划。每个任务都配有两个极端一致的黄金轨迹,其检查点锚定了基于轨迹的评分标准。通过在4个主流平台上对14个前沿专有和开放权重模型进行基准测试,我们发现了三个一致的结果。智能体的价值首先表现为在可解释的逆流之下的跨模型同质性的价值潮。这一潮流在平台拉动下呈现非加性弯曲,然而在通过嵌入技能的有意引导下则更加明显。综合这些结果表明,智能体对齐的杠杆正在从传统的模型对齐和提示引导转向平台对齐和技能引导。
cs.AI / 179 / 2605.10366
EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents
EGL-SCA:图推理智能体中共演指令与工具的结构性信用分配
Abstract
Graph reasoning agents operating from natural-language inputs must solve a coupled problem: they must reconstruct a structured graph instance from text, decide whether existing computational assets are sufficient, interact with tools under a strict execution protocol, and satisfy an external verifier that checks structured correctness rather than textual plausibility. Existing approaches usually improve either the instruction side or the tool side in isolation, which leaves unclear what should be updated after failure. We propose EGL-SCA, a verifier-centric dual-space framework that models a graph reasoning agent using two collaborative components: an instruction-side policy space for reasoning strategies, and a tool-side program space for executable algorithmic tools. Our central mechanism is structural credit assignment, which maps trajectory evidence to conditional updates, precisely routing failures to either prompt optimization or tool synthesis and repair. To provide sufficient learning signals for dual-space adaptation, we introduce a training distribution stratified by task family, coupled with a Pareto-style retention strategy to balance success, generality, and parsimony. Experiments on four graph reasoning benchmarks show that EGL-SCA achieves a state-of-the-art 92.0\% average success rate. By effectively co-evolving instructions and tools, our framework significantly outperforms both pure-prompting and fixed-toolbox baselines.
Chinese Translation
基于自然语言输入的图推理智能体必须解决一个耦合问题:它们必须从文本中重建一个结构化的图实例,判断现有计算资源是否足够,在严格的执行协议下与工具进行交互,并满足一个检查结构正确性而非文本合理性的外部验证者。现有的方法通常单独改善指令侧或工具侧,这使得在失败后不清楚应该更新什么。我们提出了EGL-SCA,一个以验证者为中心的双空间框架,使用两个协作组件建模图推理智能体:一个用于推理策略的指令侧策略空间和一个用于可执行算法工具的工具侧程序空间。我们的核心机制是结构性信用分配,它将轨迹证据映射到条件更新,精确地将失败路由到提示优化或工具合成与修复。为了提供足够的学习信号以适应双空间,我们引入了按任务家族分层的训练分布,并结合Pareto风格的保留策略,以平衡成功、通用性和简约性。在四个图推理基准上的实验表明,EGL-SCA实现了92.0%的最新平均成功率。通过有效地共演指令和工具,我们的框架显著超越了纯提示和固定工具箱的基线。
cs.AI / 180 / 2605.10370
Autonomous FAIR Digital Objects: From Passive Assertions to Active Knowledge
自主公平数字对象:从被动声明到主动知识
Abstract
Scientific knowledge on the Web is published as passive assertions and cannot decide when to validate evidence, reconcile contradictions, or update confidence as findings accumulate. Curation depends on centralised middleware and institutional continuity, but when registries close, active stewardship stops even when data remain online. We advance the concept of Autonomous FAIR Digital Objects (aFDOs) from an abstract idea to an operational model, to offer a route from passive scientific publication toward accountable, standards-aligned automation that can outlive its publishing institutions. aFDO augments FDOs with three capabilities anchored in Semantic Web standards, namely 1) a policy layer over RDF-star aligned with PROV-O, SHACL, and ODRL for portable condition-action rules, 2) an announcement layer over ActivityStreams 2.0 that bounds per-announcement evaluation cost, and 3) an agreement layer that resolves multi-source contradictions through reputation and confidence weighted agreement under a bounded adversarial model. We provide a formal definition that distinguishes policy specifications, event handlers, and communication interfaces. We evaluate an open reference implementation on 4,305 FDOs grounded in rare-disease ontologies, namely ClinVar, HPO, and Orphanet, combined with controlled synthetic observations. The consensus mechanism resolves 56.3% of 3,914 naturally occurring ClinVar conflicts where multiple submitters disagree and an expert panel has subsequently adjudicated. Under Sybil, collusion, and poisoning attacks, the mechanism degrades gracefully within its design Byzantine-tolerance bound (f < n/5), and fails as predicted beyond that bound.
Chinese Translation
网络上的科学知识以被动声明的形式发布,无法决定何时验证证据、调和矛盾或在发现积累时更新信心。数据的管理依赖于集中式中间件和机构的持续性,但当注册机构关闭时,即使数据仍在线,主动管理也会停止。我们将自主公平数字对象(Autonomous FAIR Digital Objects, aFDOs)的概念从一个抽象的想法推进到一个操作模型,以提供从被动科学出版向负责任、符合标准的自动化的路径,这种自动化能够超越其出版机构。aFDO通过三个基于语义网标准的能力增强了公平数字对象(FAIR Digital Objects, FDOs),即:1)一个基于RDF-star的政策层,与PROV-O、SHACL和ODRL对齐,用于可移植的条件-行动规则;2)一个基于ActivityStreams 2.0的公告层,限制每个公告的评估成本;3)一个通过声誉和信心加权协议解决多源矛盾的协议层,基于有限的对抗模型。我们提供了一个正式的定义,区分政策规范、事件处理程序和通信接口。我们在4,305个基于罕见疾病本体的FDO上评估了一个开放的参考实现,这些本体包括ClinVar、HPO和Orphanet,并结合了受控的合成观察。共识机制解决了3,914个自然发生的ClinVar冲突中的56.3%,这些冲突是多个提交者之间的分歧,随后由专家小组裁定。在Sybil、共谋和投毒攻击下,该机制在其设计的拜占庭容忍界限内(f < n/5)优雅降级,并在超出该界限时如预期失败。
cs.AI / 181 / 2605.10380
Agent-X: Full Pipeline Acceleration of On-device AI Agents
Agent-X:设备端人工智能代理的全流程加速
Abstract
LLM-based agents deliver state-of-the-art performance across tasks but incur high end-to-end latency on edge devices. We introduce Agent-X, a software-only, accuracy-preserving framework that accelerates both the prefill and decode stages of on-device agent workloads. Agent-X's two key components rewrite prompts to leverage prefix caching tailored to agent-specific input-token patterns and enable LLM-free speculative decoding for fast token generation with minimal overhead. On representative agentic workloads, Agent-X achieves a 1.61x end-to-end speedup in real systems with no accuracy loss and can be seamlessly integrated into existing on-device AI agents. To the best of our knowledge, ours is the first to systematically characterize and eliminate latency bottlenecks in on-device agents.
Chinese Translation
基于大型语言模型(LLM)的代理在各项任务中表现出色,但在边缘设备上却面临高端到端延迟的问题。我们提出了Agent-X,这是一种仅基于软件的、保持准确性的框架,旨在加速设备端代理工作负载的预填充和解码阶段。Agent-X的两个关键组件重写提示,以利用针对代理特定输入标记模式的前缀缓存,并实现无LLM的推测解码,以快速生成标记并将开销降至最低。在代表性的代理工作负载上,Agent-X在真实系统中实现了1.61倍的端到端加速,且没有准确性损失,并且可以无缝集成到现有的设备端人工智能代理中。根据我们所知,这是首次系统性地表征并消除设备端代理中的延迟瓶颈。
cs.AI / 182 / 2605.10384
Agentic Performance at the Edge: Insights from Benchmarking
边缘的代理性能:基准测试的洞察
Abstract
Agentic artificial intelligence (AI) is a natural fit for Internet of Things (IoT) and edge systems, but edge deployments are often constrained to models around 8 billion parameters or smaller. An important question is: How much agentic-task quality is lost when model size is constrained by memory, power, and latency budgets? To address this question, in this paper, we provide an initial empirical study considering edge-focused model scaling, general-purpose versus coder-oriented model effects, and tool-enabled execution under a fixed protocol. We introduce a domain-conditioned evaluation methodology, an implementation-grounded analysis of model-tool interactions, practical guidance for model selection under constraints, and an analysis of failure modes that reveals distinct semantic versus execution failure patterns across model families. Our core finding is that edge-agent quality is not a simple function of parameter count. Robust deployment depends on the joint design of model choice and tool workflow. Domain-conditioned analysis reveals Pareto fronts in the accuracy-latency space that can guide strategy selection based on operational priorities.
Chinese Translation
代理人工智能(AI)非常适合物联网(IoT)和边缘系统,但边缘部署通常受到内存、功耗和延迟预算的限制,模型规模通常在80亿参数或更小。一个重要的问题是:当模型规模受到这些限制时,代理任务的质量损失有多少?为了解决这个问题,本文提供了一项初步的实证研究,考虑了以边缘为中心的模型扩展、通用模型与面向编码器的模型效果,以及在固定协议下的工具支持执行。我们引入了一种领域条件评估方法,进行了基于实现的模型与工具交互分析,提供了在约束条件下模型选择的实用指导,并分析了故障模式,揭示了不同模型家族之间的语义故障与执行故障的明显差异。我们的核心发现是,边缘代理的质量并不是参数数量的简单函数。稳健的部署依赖于模型选择与工具工作流程的联合设计。领域条件分析揭示了准确性-延迟空间中的帕累托前沿,可以根据操作优先级指导策略选择。
cs.AI / 183 / 2605.10386
GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic
GuardAD:通过马尔可夫安全逻辑保障自主驾驶多模态大语言模型
Abstract
Multimodal large language models (MLLMs) are increasingly integrated into autonomous driving (AD) systems; however, they remain vulnerable to diverse safety threats, particularly in accident-prone scenarios. Recent safeguard mechanisms have shown promise by incorporating logical constraints, yet most rely on static formulations that lack temporally grounded safety reasoning over evolving traffic interactions, resulting in limited robustness in dynamic driving environments. To address these limitations, we propose GuardAD, a model-agnostic safeguard that formulates AD safety as an evolving Markovian logical state. GuardAD introduces Neuro-Symbolic Logic Formalization, which represents safety predicates over heterogeneous traffic participants and continuously induces them via n-th order Markovian Logic Induction. This design enables the inference of emerging and latent hazards beyond single-step observations. Rather than simply vetoing unsafe actions, GuardAD performs Logic-Driven Action Revision, where inferred safety states actively guide action refinement without modifying the underlying MLLM. Extensive experiments on multiple benchmarks and AD-MLLMs demonstrate that GuardAD substantially reduces accident rates (-32.07%) while slightly improving task performance (+6.85%). Moreover, closed-loop simulation evaluations, together with physical-world vehicle studies, further validate the effectiveness and potential of GuardAD.
Chinese Translation
多模态大语言模型(MLLMs)正日益被集成到自主驾驶(AD)系统中;然而,它们仍然容易受到各种安全威胁的影响,尤其是在事故频发的场景中。近期的安全机制通过引入逻辑约束展现出良好的前景,但大多数依赖于静态的公式,缺乏对不断变化的交通互动进行时间上扎根的安全推理,导致在动态驾驶环境中的鲁棒性有限。为了解决这些局限性,我们提出了GuardAD,这是一种模型无关的安全保障,旨在将AD安全形式化为一个不断演变的马尔可夫逻辑状态。GuardAD引入了神经符号逻辑形式化,表示异构交通参与者的安全谓词,并通过n阶马尔可夫逻辑归纳不断引导这些谓词。该设计使得能够推断出超越单步观察的新兴和潜在危险。GuardAD不仅仅是简单地否决不安全的行为,而是执行逻辑驱动的行动修正,其中推断出的安全状态积极指导行动的细化,而不修改基础的MLLM。对多个基准和AD-MLLM的广泛实验表明,GuardAD显著降低了事故发生率(-32.07%),同时略微提高了任务性能(+6.85%)。此外,闭环仿真评估以及与现实世界车辆研究相结合,进一步验证了GuardAD的有效性和潜力。
cs.AI / 184 / 2605.10401
LLM4Branch: Large Language Model for Discovering Efficient Branching Policies of Integer Programs
LLM4Branch:用于发现整数程序高效分支策略的大型语言模型
Abstract
Efficient branching policies are essential for accelerating Mixed Integer Linear Programming (MILP) solvers. Their design has long relied on hand-crafted heuristics, and now machine learning has emerged as a promising paradigm to automate this process. However, existing learning-based methods are often hindered by their dependence on expensive expert demonstrations and the gap between training objectives and the solver's end-to-end performance. In this work, we propose LLM4Branch, a novel framework that leverages Large Language Models (LLMs) to automate the discovery of efficient branching policies. Specifically, the discovered policy is an executable program with a program skeleton generated by the LLM and a parameter vector, which is optimized via a zeroth-order method over a few instances with their end-to-end performance feedback. Extensive experiments on standard MILP benchmarks demonstrate that LLM4Branch establishes a new state-of-the-art among CPU-based methods and achieves performance competitive with advanced GPU-based models. Codes are available at https://github.com/hzn18/LLM4Branch.
Chinese Translation
高效的分支策略对于加速混合整数线性规划(MILP)求解器至关重要。其设计长期以来依赖于手工制作的启发式方法,而现在机器学习已成为自动化这一过程的有前景的范式。然而,现有的基于学习的方法往往受到昂贵的专家演示依赖和训练目标与求解器端到端性能之间差距的限制。在本研究中,我们提出了LLM4Branch,一个新颖的框架,利用大型语言模型(LLMs)来自动发现高效的分支策略。具体而言,发现的策略是一个可执行程序,其程序骨架由LLM生成,并通过零阶方法在几个实例上进行优化,以获取其端到端性能反馈。在标准MILP基准上的大量实验表明,LLM4Branch在基于CPU的方法中建立了新的最先进水平,并在性能上与先进的基于GPU的模型相竞争。代码可在 https://github.com/hzn18/LLM4Branch 获取。
cs.AI / 185 / 2605.10448
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
代理基准能否支持其评分?基于证据的互动代理评估界限
Abstract
Interactive agent benchmarks map an agent run to a binary outcome through outcome checks. When these checks rely on surface level signals or fail to capture the agent's actual action path, they cannot reliably determine whether the run succeeded. For example, a benchmark task may ask whether Alice's shipping address was changed, while the outcome check only verifies that the agent clicked "Save." This does not guarantee that the intended state change occurred, since the agent may have modified the wrong record. Treating such a run as successful therefore makes the reported score misleading. Benchmark quality thus depends not only on task design, but also on the reliability of outcome detection. We address this problem by introducing an outcome evidence reporting layer for existing benchmarks, without modifying their tasks, agents, or evaluators. The layer performs three functions. First, before scoring, it specifies which stored artifacts are required to verify the claimed outcome for each case. Second, it applies a locked checklist to each completed run and assigns one of three evidence labels: Evidence Pass, Evidence Fail, or Unknown. Third, it reports evidence supported score bounds that quantify uncertainty arising from Unknown cases. Rather than silently counting, discarding, or hiding uncertain cases inside a single aggregate success rate, the framework keeps them explicitly visible. We evaluate the outcome evidence layer on five public benchmarks: ANDROIDWORLD, AGENTDOJO, APPWORLD, tau3 bench retail, and MINIWOB. The resulting reports separate several empirically distinct failure modes.
Chinese Translation
互动代理基准通过结果检查将代理运行映射到二元结果。当这些检查依赖于表面信号或未能捕捉代理的实际行动路径时,它们无法可靠地判断运行是否成功。例如,一个基准任务可能会询问爱丽丝的送货地址是否已更改,而结果检查仅验证代理是否点击了“保存”。这并不能保证预期的状态变化发生,因为代理可能修改了错误的记录。因此,将这样的运行视为成功会导致报告的分数具有误导性。因此,基准质量不仅依赖于任务设计,还依赖于结果检测的可靠性。我们通过为现有基准引入结果证据报告层来解决这个问题,而不修改它们的任务、代理或评估者。该层执行三个功能。首先,在评分之前,它指定每个案例所需的存储文档,以验证所声称的结果。其次,它对每个已完成的运行应用锁定清单,并分配三种证据标签之一:证据通过(Evidence Pass)、证据失败(Evidence Fail)或未知(Unknown)。第三,它报告证据支持的分数界限,以量化来自未知案例的不确定性。该框架不仅仅是默默地计数、丢弃或隐藏不确定案例,而是将它们明确可见。我们在五个公共基准上评估结果证据层:ANDROIDWORLD、AGENTDOJO、APPWORLD、tau3 bench retail 和 MINIWOB。结果报告区分了几种经验上不同的失败模式。
cs.AI / 186 / 2605.10480
ASIA: an Autonomous System Identification Agent
ASIA:一种自主系统识别代理
Abstract
Over the years, research in system identification has provided a rich set of methods for learning dynamical models, together with well-established theoretical guarantees. In practice, however, the choice of model class, training algorithm, and hyperparameter tuning is still largely left to empirical trial-and-error, requiring substantial expert time and domain experience. Motivated by recent advances in agentic artificial intelligence, we present ASIA, a framework that delegates this iterative search to a large language model acting as an autonomous coding agent. Building on existing agentic platforms, ASIA closes the loop between hypothesis, implementation, and evaluation without human intervention, requiring only a plain-English description of the identification problem. We conduct an empirical study of ASIA on two system identification benchmarks and analyse the agent's search behaviour, the architectures and training strategies it discovers, and the quality of the resulting models. We also discuss the potential of the approach and its current limitations, including implicit test leakage, reduced methodological transparency, and reproducibility concerns.
Chinese Translation
多年来,系统识别领域的研究提供了一系列丰富的方法用于学习动态模型,并伴随着良好的理论保证。然而,在实践中,模型类别、训练算法和超参数调优的选择仍然在很大程度上依赖于经验的试错过程,这需要大量的专家时间和领域经验。受到近期代理人工智能进展的启发,我们提出了ASIA,一个将这一迭代搜索委托给作为自主编码代理的大型语言模型的框架。ASIA基于现有的代理平台,闭合了假设、实现和评估之间的循环,无需人类干预,仅需对识别问题的简单英文描述。我们在两个系统识别基准上对ASIA进行了实证研究,分析了代理的搜索行为、其发现的架构和训练策略,以及所生成模型的质量。我们还讨论了该方法的潜力及其当前的局限性,包括隐性测试泄漏、方法透明度降低和可重复性问题。
cs.AI / 187 / 2605.10500
SkillEvolver: Skill Learning as a Meta-Skill
SkillEvolver:技能学习作为元技能
Abstract
Agent skills today are static artifact: authored once -- by human curation or one-shot generation from parametric knowledge -- and then consumed unchanged, with no mechanism to improve from real use. We propose \textbf{SkillEvolver}, a lightweight, plug-and-play solution for online skill learning, in which a single meta-skill iteratively authors, deploys, and refines domain-specific skills. The learning target of SkillEvolver is the skill's prose and code, not model weights, so that the resulting artifact drops into any agent without retraining; and the meta-skill itself is just another skill, loaded through the same interface by any protocol-compliant CLI-agent. Unlike trace-distillation, the meta-skill refines only after deploying the learnt skill, such that the learning signal comes from failures another agent encounters while using it -- not from exploratory traces alone. Refinement iterations are governed by a fresh-agent overfit audit that catches possible leakage as well as deployed-skill-specific failures, including the silent-bypass mode in which a skill appears valid in content but is never invoked at runtime. On $83$ SkillsBench tasks spanning $15^{+}$ domains, SkillEvolver reaches $56.8\%$ accuracy versus $43.6\%$ for curated human skills and $29.9\%$ for the no-skill baseline; on three GPU kernel optimization tasks from KernelBench, it also raises mean speedup from $1.16$ to $1.51$ on average.
Chinese Translation
当前的智能体技能是静态的产物:一次性创作——通过人工策划或从参数知识中一次性生成——然后在不改变的情况下被使用,没有从实际使用中改进的机制。我们提出了 extbf{SkillEvolver},一种轻量级、即插即用的在线技能学习解决方案,其中单一的元技能迭代地创作、部署和完善特定领域的技能。SkillEvolver的学习目标是技能的文本和代码,而非模型权重,因此生成的产物可以在任何智能体中直接使用而无需重新训练;而元技能本身只是另一种技能,通过任何符合协议的CLI智能体通过相同接口加载。与追踪蒸馏不同,元技能仅在部署学习到的技能后进行完善,这样学习信号来自于另一个智能体在使用时遇到的失败——而不仅仅是来自探索性轨迹。完善迭代由一个新智能体的过拟合审计来管理,该审计捕捉可能的泄漏以及特定于已部署技能的失败,包括在内容上看似有效但在运行时从未被调用的静默旁路模式。在涵盖超过15个领域的83个SkillsBench任务中,SkillEvolver的准确率达到56.8%,而策划的人类技能为43.6%,无技能基线为29.9%;在KernelBench的三个GPU内核优化任务中,它还将平均加速从1.16提升至1.51。
cs.AI / 188 / 2605.10503
SLASH the Sink: Sharpening Structural Attention Inside LLMs
SLASH水槽:提升大型语言模型内部的结构注意力
Abstract
Large Language Models (LLMs) show remarkable semantic understanding but often struggle with structural understanding when processing graph topologies in a serialized format. Existing solutions rely on training external graph-based adapters or fine-tuning, which incur high costs and lost generalizability. In this work, we investigate the internal mechanisms of LLMs and present a critical finding: LLMs spontaneously reconstruct the graph's topology internally, evidenced by a distinct "sawtooth" pattern in their attention maps that structurally aligns with the "token-level adjacency matrix". However, this intrinsic structural understanding is diluted by the attention sink. We theoretically formalize this dilution as a representation bottleneck, stemming from a fundamental conflict: the model's anisotropic bias, essential for language tasks, suppresses the topology-aware local aggregation required for graph reasoning. To address this, we propose a training-free solution, named StructuraL Attention SHarpening (Slash), which amplifies this internal structural understanding via a plug-and-play attention redistribution. Experiments on pure graph tasks and molecular prediction validate Slash delivers significant and consistent performance gains across diverse LLMs.
Chinese Translation
大型语言模型(LLMs)展现出卓越的语义理解能力,但在以序列化格式处理图拓扑时,往往在结构理解方面存在困难。现有解决方案依赖于训练外部基于图的适配器或微调,这会带来高昂的成本和降低的泛化能力。在本研究中,我们探讨了大型语言模型的内部机制,并提出了一个重要发现:大型语言模型自发地在内部重构图的拓扑,这一点通过其注意力图中明显的“锯齿形”模式得以证明,该模式在结构上与“令牌级邻接矩阵”对齐。然而,这种内在的结构理解被注意力沉淀所稀释。我们理论上将这种稀释形式化为一种表示瓶颈,源于一个根本性的冲突:模型的各向异性偏差对于语言任务至关重要,却抑制了图推理所需的拓扑感知局部聚合。为了解决这个问题,我们提出了一种无训练的解决方案,称为结构注意力锐化(StructuraL Attention SHarpening,Slash),通过即插即用的注意力重分配来增强这种内部结构理解。在纯图任务和分子预测上的实验验证了Slash在不同大型语言模型中带来了显著且一致的性能提升。
cs.AI / 189 / 2605.10516
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
一致性作为可测试属性:评估人工智能代理可靠性的统计方法
Abstract
This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging $U$-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.
Chinese Translation
本文建立了一个严格的测量科学框架,用于评估人工智能代理的可靠性,为在语义保持扰动下量化一致性提供了基础框架。通过利用 $U$-统计量评估输出级别的可靠性以及基于核的方法评估轨迹级别的稳定性,我们提供了一种原则性的方法来评估在不同操作条件下的代理。我们的提案强调了代理的核心能力与执行稳健性之间的重要区别,表明尽管代理具备执行任务所需的知识,轻微的任务级别变化仍可能导致策略的完全崩溃。我们通过在三个代理基准上进行广泛实验来验证我们的框架,结果表明轨迹级别的一致性指标相比传统的 pass@1 率提供了更高的诊断敏感性。通过提供数学工具来隔离代理偏离的原因和位置,我们能够识别和纠正阻碍代理在高风险现实环境中部署的架构问题。
cs.AI / 190 / 2605.10529
PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs
PrimeKG-CL:一个关于不断演变的生物医学知识图谱的持续图学习基准
Abstract
Biomedical knowledge graphs underwrite drug repurposing and clinical decision support, yet the upstream ontologies they depend on update on independent cycles that add millions of edges and deprecate hundreds of thousands more between releases. Yet existing continual graph learning has been studied almost exclusively on synthetic random splits of static, generic KGs, a regime that cannot reproduce the asynchronous, structured evolution real biomedical KGs undergo. To this end, we introduce PrimeKG-CL, a CGL benchmark built from nine authoritative biomedical databases (129K+ nodes, 8.1M+ edges, 10 node types, 30 relation types) with two genuine temporal snapshots (June 2021, July 2023; 5.83M edges added, 889K removed, 7.21M persistent), 10 entity-type-grouped tasks, multimodal node features, and a per-task persistent/added/removed test stratification. On three tasks (biomedical relationship prediction, entity classification, KGQA), we evaluate six CL strategies across four KGE decoders, plus LKGE, an LLM-RAG agent, and CMKL. We find that decoder choice and continual learning strategy interact strongly: no single strategy performs best across all decoders, and mismatched combinations can significantly degrade performance. Moreover, only DistMult exhibits a clear separation between persistent and deprecated knowledge, indicating that standard metrics conflate retention of still-valid facts with failure to forget outdated ones; this effect is absent under RotatE. In addition, multimodal features improve entity-level tasks by up to 60%, and a recent CKGE framework (IncDE) failed to scale to our 5.67M-triple base task across five attempts up to 350GB RAM. Data, pipeline, baselines, and the stratified split are released openly. Dataset:huggingface.co/datasets/yradwan147/PrimeKGCL|Code:github.com/yradwan147/primekg-cl-neurips2026
Chinese Translation
生物医学知识图谱支撑着药物再利用和临床决策支持,然而它们所依赖的上游本体以独立的周期更新,在每次发布之间增加数百万条边并废弃数十万条边。然而,现有的持续图学习几乎仅在静态、通用知识图谱的合成随机划分上进行研究,这种模式无法再现真实生物医学知识图谱所经历的异步、结构化演变。为此,我们引入了PrimeKG-CL,这是一个基于九个权威生物医学数据库(129K+节点,8.1M+边,10种节点类型,30种关系类型)构建的持续图学习基准,包含两个真实的时间快照(2021年6月,2023年7月;新增5.83M边,移除889K边,持久性7.21M),10个基于实体类型分组的任务,多模态节点特征,以及每个任务的持久/新增/移除测试分层。在三个任务(生物医学关系预测、实体分类、知识图谱问答)上,我们评估了六种持续学习策略在四种知识图谱嵌入解码器上的表现,以及LKGE、一个LLM-RAG代理和CMKL。我们发现解码器的选择与持续学习策略之间存在强烈的相互作用:没有单一策略在所有解码器上表现最佳,而不匹配的组合可能显著降低性能。此外,只有DistMult在持久知识和废弃知识之间表现出明显的区分,表明标准指标将仍然有效的事实的保留与未能忘记过时的事实混淆;而在RotatE下,这种效应并不存在。此外,多模态特征使得实体级任务的表现提高了多达60%,而一个近期的CKGE框架(IncDE)在五次尝试中未能扩展到我们的5.67M三元组基础任务,内存需求高达350GB。数据、流程、基线和分层划分已公开发布。数据集:huggingface.co/datasets/yradwan147/PrimeKGCL | 代码:github.com/yradwan147/primekg-cl-neurips2026
cs.AI / 191 / 2605.10531
A Reflective Storytelling Agent for Older Adults: Integrating Argumentation Schemes and Argument Mining in LLM-Based Personalised Narratives
面向老年人的反思性叙事代理:在基于大语言模型的个性化叙事中整合论证方案和论证挖掘
Abstract
This work investigates whether knowledge-driven large language model (LLM)-based storytelling can support purposeful narrative interaction with a digital companion for older adults. To address known limitations of LLMs, including hallucinations and limited transparency, we present a reflective storytelling agent integrating knowledge graphs, user modelling, argumentation theory, and argument mining to guide and inspect narrative generation. The study consisted of two phases. Phase I employed participatory design involving 11 domain experts in a formative evaluation that informed iterative refinement. The resulting system generates narratives grounded in structured user models representing health-promoting activities and motivations. Phase II involved 55 older adults evaluating persona-based narratives across four prompts and two creativity levels. Participants assessed perceived purpose, usefulness, cultural relatability, and inconsistencies. The system additionally computed hallucination-risk indicators to evaluate generated narratives. Participants recognised personally relevant purposes in roughly two thirds of narratives, while argument-based purposes were identified in around half of these cases. Cultural recognisability strongly influenced willingness to use the functionality, whereas minor inconsistencies were often tolerated when narratives remained understandable and personally relevant. Narratives with higher hallucination-risk indicators were more often perceived as inconsistent, while higher argument-quality indicators tended to co-occur with higher clarity and meaningfulness ratings. Overall, the study positions argument mining as a reflective inspection mechanism for comparing formal grounding signals with human evaluations in health-oriented LLM storytelling for older adults.
Chinese Translation
本研究探讨了知识驱动的基于大语言模型(LLM)的叙事是否能够支持老年人与数字伴侣之间的有目的叙事互动。为了解决LLM已知的局限性,包括幻觉和透明度有限,我们提出了一种反思性叙事代理,整合了知识图谱、用户建模、论证理论和论证挖掘,以指导和检查叙事生成。研究分为两个阶段。第一阶段采用参与式设计,涉及11位领域专家进行形成性评估,以指导迭代改进。最终系统生成的叙事基于结构化用户模型,代表健康促进活动和动机。第二阶段涉及55名老年人评估基于角色的叙事,涵盖四个提示和两个创造力水平。参与者评估了感知目的、实用性、文化相关性和不一致性。系统还计算了幻觉风险指标,以评估生成的叙事。参与者在大约三分之二的叙事中识别出与个人相关的目的,而基于论证的目的在大约一半的案例中被识别出。文化可识别性对使用该功能的意愿产生了强烈影响,而当叙事保持可理解和个人相关时,轻微的不一致性往往被容忍。具有较高幻觉风险指标的叙事更常被认为是不一致的,而较高的论证质量指标往往与更高的清晰度和意义评分同时出现。总体而言,本研究将论证挖掘定位为一种反思性检查机制,用于比较正式基础信号与人类评估在面向老年人的健康导向LLM叙事中的应用。
cs.AI / 192 / 2605.10541
Bridging Sequence and Graph Structure for Epigenetic Age Prediction
桥接序列与图结构以进行表观遗传年龄预测
Abstract
Epigenetic clocks based on DNA methylation have emerged as powerful tools for estimating biological age, with broad applications in aging research, age-related disease studies, and longevity science. Despite advances across machine learning approaches to epigenetic age prediction, spanning penalised linear regression, deep feedforward networks, residual architectures, and graph neural networks, no existing method jointly models co-methylation graph structure and site-specific DNA sequence context within a unified framework. We propose a unified sequence--graph integration framework for epigenetic age prediction that addresses this gap, integrating eight-dimensional DNA sequence statistical features through a lightweight gated modulation mechanism that adaptively scales each site's methylation signal according to its sequence-determined biological relevance prior to graph convolution. Evaluated on 3,707 blood methylation samples against a comprehensive set of baselines, our method achieves a test MAE of 3.149 years, a 12.8\% improvement over the strongest graph-based baseline. Biologically informed statistical features outperform CNN-based sequence encoding, demonstrating that handcrafted sequence features are more effective than end-to-end learned representations in this data regime. Post-hoc interpretability analysis identifies CpG density and local adenine frequency as features with age-dependent importance shifts, consistent with known mechanisms of age-related hypermethylation at CpG-dense promoter regions. Our code is at https://github.com/yaoli2022/graphage-seq.
Chinese Translation
基于DNA甲基化的表观遗传时钟已成为估计生物年龄的强大工具,在衰老研究、与年龄相关的疾病研究和长寿科学中具有广泛应用。尽管在表观遗传年龄预测的机器学习方法上取得了进展,包括惩罚线性回归、深度前馈网络、残差架构和图神经网络,但现有方法尚未在统一框架内共同建模共甲基化图结构和位点特异性DNA序列背景。我们提出了一种统一的序列-图集成框架,用于表观遗传年龄预测,填补了这一空白,通过轻量级门控调制机制整合八维DNA序列统计特征,适应性地根据序列决定的生物相关性在图卷积之前调整每个位点的甲基化信号。在3707个血液甲基化样本上进行评估,与一组全面的基线进行比较,我们的方法在测试中实现了3.149年的平均绝对误差(MAE),比最强的基于图的基线提高了12.8%。生物学信息驱动的统计特征优于基于卷积神经网络(CNN)的序列编码,证明在这一数据环境中,手工制作的序列特征比端到端学习的表示更为有效。事后可解释性分析确定了CpG密度和局部腺苷频率作为具有年龄依赖性重要性变化的特征,这与已知的在CpG密集启动子区域的与年龄相关的高甲基化机制一致。我们的代码可在 https://github.com/yaoli2022/graphage-seq 获取。
cs.AI / 193 / 2605.10555
Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems
以代理为中心的工具API:企业AI代理系统的语义接口范式
Abstract
As AI agents transition from research prototypes to enterprise production systems, the tool interfaces they consume remain rooted in human-oriented CRUD paradigms. This paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics. We propose the Agent-First Tool API paradigm, comprising three integrated mechanisms: (1) a Six-Verb Semantic Protocol that decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases; (2) a Normalized Tool Contract (NTC) providing structured decision-support metadata including confidence scores, evidence chains, and suggested next actions; and (3) a dual-layer governance pipeline combining static capability policies with dynamic risk escalation. The paradigm is implemented and validated in a production multi-tenant SaaS platform serving 85 registered tools across 6 business domains. Comparative experiments on 50 real operational tasks demonstrate that Agent-First APIs achieve 88% end-to-end task success rate versus 64% for optimized CRUD baselines (+37.5%), while reducing required human interventions by 72.7% and improving autonomous error recovery by 5.8x. We establish that the paradigm is orthogonal and complementary to transport-layer standards such as MCP, operating as the semantic application layer above existing tool discovery and invocation protocols.
Chinese Translation
随着AI代理从研究原型转变为企业生产系统,它们所使用的工具接口仍然根植于以人为中心的CRUD范式。本文识别出传统API与自主代理需求之间的五个基本架构不匹配:精确标识符依赖、渲染导向响应、单次交互假设、用户等效授权和不透明的错误语义。我们提出了以代理为中心的工具API范式,该范式包括三个集成机制:(1) 六动词语义协议,将工具交互分解为搜索、解决、预览、执行、验证和恢复阶段;(2) 规范化工具合同(Normalized Tool Contract, NTC),提供结构化的决策支持元数据,包括置信度评分、证据链和建议的下一步行动;(3) 双层治理管道,将静态能力政策与动态风险升级相结合。该范式已在一个生产多租户SaaS平台中实施和验证,该平台服务于6个业务领域的85个注册工具。对50个真实操作任务的比较实验表明,以代理为中心的API实现了88%的端到端任务成功率,而优化的CRUD基线为64%(提高了37.5%),同时将所需的人为干预减少了72.7%,并将自主错误恢复能力提高了5.8倍。我们确定该范式与传输层标准(如MCP)是正交且互补的,作为现有工具发现和调用协议之上的语义应用层运行。
cs.AI / 194 / 2605.10569
Deep Arguing
深度论证
Abstract
Deep learning has become the dominant approach for creating high capacity, scalable models across diverse data modalities. However, because these models rely on a large number of learned parameters, tightly couple feature extraction with task objectives, and often lack explicit reasoning mechanisms, it is difficult for humans to understand how they arrive at their predictions. Understanding what representations emerge and why they arise from the training data remains an open challenge. We introduce Deep Arguing, a novel neurosymbolic approach that integrates deep learning with argumentation construction and reasoning for interpretable classification with different data modalities. In our approach deep neural networks construct an argumentation structure wherein data points support their assigned label and attack different ones. Using differentiable argumentation semantics for reasoning, the model is trained end-to-end to jointly learn feature representation and argumentative interactions. This results in argumentation structures providing faithful case-based explanations for predictions. Structure constraints over the argumentation graph guide learning, improving both interpretability and predictive performance. Experiments with tabular and imaging datasets show that Deep Arguing achieves performance competitive with standard baselines whilst offering interpretable argumentative reasoning.
Chinese Translation
深度学习已成为在多种数据模态中创建高容量、可扩展模型的主导方法。然而,由于这些模型依赖于大量学习参数,将特征提取与任务目标紧密结合,并且通常缺乏明确的推理机制,因此人们很难理解它们是如何得出预测的。理解从训练数据中出现的表示及其产生原因仍然是一个未解决的挑战。我们提出了深度论证(Deep Arguing),这是一种新颖的神经符号方法,它将深度学习与论证构建和推理相结合,以实现对不同数据模态的可解释分类。在我们的方法中,深度神经网络构建一个论证结构,其中数据点支持其分配的标签并攻击其他标签。通过使用可微分的论证语义进行推理,该模型经过端到端训练,以共同学习特征表示和论证交互。这导致论证结构为预测提供了可信的基于案例的解释。论证图上的结构约束指导学习,提高了可解释性和预测性能。对表格和图像数据集的实验表明,深度论证在性能上与标准基线相当,同时提供了可解释的论证推理。
cs.AI / 195 / 2605.10574
LLM Jaggedness Unlocks Scientific Creativity
LLM的不均匀性解锁科学创造力
Abstract
As artificial intelligence advances, models are not improving uniformly. Instead, progress unfolds in a jagged fashion, with capabilities growing unevenly across tasks, domains, and model scales. In this work, we examine this dynamic jaggedness through the lens of scientific idea generation. We introduce SciAidanBench, a benchmark of open-ended scientific questions designed to measure the scientific creativity of large language models (LLMs). Given a scientific question, models are asked to generate as many unique and coherent ideas as possible, with the total number of valid responses serving as a proxy for creative potential. Evaluating 19 base models across 8 providers (30 total variants including reasoning versions), we find that jaggedness manifests both across models and within models. First, in a cross-task comparison between general and scientific creativity, improvements in general creativity do not translate uniformly to scientific creativity, revealing divergent capability profiles across models. Second, at the prompt level, stronger models do not improve uniformly; instead, they exhibit high variability, with bursts of creativity on some questions and limited performance on others. Third, at the domain level, individual models display uneven strengths across scientific subfields, reflecting fragmented internal capability profiles. Finally, we show that this jaggedness can be harnessed. We explore mechanisms of inference-time compute, knowledge pooling, and brainstorming to combine models effectively and construct meta-model ensembles that outperform any single model. Our results position jaggedness not as a limitation, but as a resource, a structural feature of AI progress that, when understood and leveraged, can amplify LLM-driven scientific creativity.
Chinese Translation
随着人工智能的进步,模型的提升并不是均匀的。相反,进展以不均匀的方式展开,能力在任务、领域和模型规模之间不均衡增长。在本研究中,我们通过科学创意生成的视角考察这种动态的不均匀性。我们引入了SciAidanBench,这是一个开放式科学问题的基准,旨在测量大型语言模型(LLMs)的科学创造力。给定一个科学问题,模型被要求生成尽可能多的独特且连贯的想法,总有效响应的数量作为创造潜力的代理。我们评估了来自8个提供者的19个基础模型(包括推理版本在内的30个变体),发现不均匀性在模型之间和模型内部均有表现。首先,在一般创造力与科学创造力的跨任务比较中,一般创造力的提升并未均匀转化为科学创造力,揭示了模型之间能力特征的差异。其次,在提示级别上,强模型的表现并不均匀;相反,它们表现出高度的变异性,在某些问题上创造力爆发,而在其他问题上表现有限。第三,在领域级别,个别模型在科学子领域中显示出不均衡的优势,反映出内部能力特征的碎片化。最后,我们展示了这种不均匀性可以被利用。我们探索了推理时计算、知识汇集和头脑风暴的机制,以有效组合模型并构建超越任何单一模型的元模型集成。我们的结果将不均匀性视为一种资源,而非限制,是AI进步的结构特征,当被理解和利用时,可以增强基于LLM的科学创造力。
cs.AI / 196 / 2605.10592
A Resilient Solution for Sewer Overflow Monitoring across Cloud and Edge
一种针对云端和边缘的污水溢流监测的弹性解决方案
Abstract
Aging combined sewer systems in many historical cities are increasingly stressed by extreme rainfall events, which can trigger combined sewer overflows (CSO) with significant environmental and public health impacts. Forecasting the filling dynamics of overflow basins is critical for anticipating capacity exceedance and enabling timely preventive actions for CSO. We present a web-based demonstrator (https://riwwer.demo.calgo-lab.de) that integrates Deep Learning forecasting methods in both cloud and edge settings into an interactive monitoring dashboard for overflow monitoring, resilient to network outages. A video showcase is available online (https://cloud.bht-berlin.de/index.php/s/b9xt4T3SdiLBiFZ).
Chinese Translation
许多历史城市的老化联合污水系统正受到极端降雨事件的日益压力,这可能导致联合污水溢流(CSO),对环境和公共健康产生重大影响。预测溢流池的填充动态对于预见容量超限并及时采取预防措施以应对CSO至关重要。我们提出了一个基于网络的演示系统(https://riwwer.demo.calgo-lab.de),该系统将深度学习预测方法集成到云端和边缘环境中,形成一个互动监测仪表板,能够抵御网络中断。在线提供了视频展示(https://cloud.bht-berlin.de/index.php/s/b9xt4T3SdiLBiFZ)。
cs.AI / 197 / 2605.10593
LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation
LLARS:促进领域专家与开发者在大型语言模型提示、生成和评估中的协作
Abstract
We demonstrate LLARS (LLM Assisted Research System), an open-source platform that bridges the gap between domain experts and developers for building LLM-based systems. It integrates three tightly connected modules into an end-to-end pipeline: Collaborative Prompt Engineering for real-time co-authoring with version control and instant LLM testing, Batch Generation for configurable output production across user-selected prompts $\times$ models $\times$ data with cost control, and Hybrid Evaluation where human and LLM evaluators jointly assess outputs through diverse assessment methods, with live agreement metrics and provenance analysis to identify the best model-prompt combination for a given use case. New prompts and models are automatically available for batch generation and completed batches can be turned into evaluation scenarios with a single click. Interviews with six domain experts and three developers in online counselling confirmed that LLARS feels intuitive, saves considerable time by keeping everything in one place and makes interdisciplinary collaboration seamless.
Chinese Translation
我们展示了LLARS(大型语言模型辅助研究系统),这是一个开源平台,旨在弥合领域专家与开发者之间的鸿沟,以构建基于大型语言模型的系统。它将三个紧密相关的模块集成到一个端到端的流程中:协作提示工程,支持实时共同创作、版本控制和即时大型语言模型测试;批量生成,支持在用户选择的提示 $ imes$ 模型 $ imes$ 数据下进行可配置的输出生产,并实现成本控制;混合评估,人类和大型语言模型评估者通过多种评估方法共同评估输出,提供实时一致性指标和来源分析,以识别特定用例的最佳模型-提示组合。新的提示和模型会自动用于批量生成,已完成的批次可以一键转化为评估场景。对六位领域专家和三位开发者的在线访谈确认,LLARS使用直观,能够通过将所有内容集中在一个地方显著节省时间,并使跨学科协作变得无缝。
cs.AI / 198 / 2605.10598
Budget-Efficient Automatic Algorithm Design via Code Graph
基于代码图的预算高效自动算法设计
Abstract
Large language models (LLMs) have emerged as powerful tools for automatic algorithm design (AAD). However, existing pipelines remain inefficient. They operate at the granularity of full algorithms, redundantly rewriting recurring substructures and discarding low-fitness candidates that may contain valuable algorithmic features. We formalize budget-efficient automatic algorithm design, wherein the search policy maximizes realized fitness subject to limited computational cost. We propose a directed acyclic graph representation of algorithms and build a search framework that fully exploits the LLM's output. Instead of querying the LLM for full algorithms, we use it to obtain corrections: compact operators that add, replace, or remove code blocks. Each correction augments the graph, yielding new algorithms that compose with prior corrections. This graph structure decomposes algorithms into sets of corrections, enabling correction-level credit assignment that informs subsequent queries. We complement this framework with theoretical insights into the ideal balance between search depth and breadth at different budget levels. We validate our method empirically on three combinatorial optimization problems, demonstrating consistent superiority of our graph-based search over full-algorithm search at equal token budget. Finally, our experiments suggest that rich contexts help only when the LLM's prior knowledge is shallow, and can hinder performance otherwise.
Chinese Translation
大型语言模型(LLMs)已成为自动算法设计(AAD)的强大工具。然而,现有的流程仍然效率低下。它们在完整算法的粒度上操作,冗余地重写重复的子结构,并丢弃可能包含有价值算法特征的低适应度候选者。我们形式化了预算高效的自动算法设计,其中搜索策略在有限的计算成本下最大化实现的适应度。我们提出了一种算法的有向无环图表示,并构建了一个搜索框架,充分利用LLM的输出。我们不是查询LLM以获取完整算法,而是使用它来获得修正:紧凑的操作符,用于添加、替换或移除代码块。每个修正都增强了图形,生成与先前修正组合的新算法。这种图结构将算法分解为修正集,使得修正级别的信用分配能够指导后续查询。我们还通过理论洞察补充了这一框架,探讨在不同预算水平下搜索深度和广度之间的理想平衡。我们在三个组合优化问题上对我们的方法进行了实证验证,证明了我们的基于图的搜索在相同令牌预算下优于完整算法搜索的持续优势。最后,我们的实验表明,丰富的上下文仅在LLM的先前知识较浅时有助于性能,反之则可能妨碍表现。
cs.AI / 199 / 2605.10601
The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime
开放盒子谬论:为何人工智能部署需要一个经过校准的验证机制
Abstract
AI deployment in sensitive domains such as health care, credit, employment, and criminal justice is often treated as unsafe to authorize until model internals can be explained. This often leads to an excessive reliance on mechanistic interpretability to address a deployment challenge beyond its intended scope. We argue that the gate should instead be calibrated verification: authorization should be domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable. The reason is twofold. First, model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general. Second, societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Recent evidence reinforces this distinction between mechanistic understanding and deployment authority: a 53-percentage-point gap between internal representations and output correction shows that understanding may not translate into action, while one scoping review found that only 9.0% of FDA-approved AI/ML device documents contained a prospective post-market surveillance study. We propose Verification Coverage, a six-component reportable standard with a minimum-composition rule, as the metric that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.
Chinese Translation
在医疗、信贷、就业和刑事司法等敏感领域,人工智能的部署常常被视为不安全,直到模型内部可以被解释为止。这通常导致对机械解释的过度依赖,以应对超出其预期范围的部署挑战。我们认为,应该将门槛设置为经过校准的验证:授权应当是领域特定的、可独立检查的、在发布后进行监控的、可追责的、可争议的和可撤销的。原因有二。首先,模型能力在相近任务之间是不均匀的,因此授权必须附加于特定的使用场景,而不是一般性地附加于模型。其次,社会长期以来通过资质、监控、责任、上诉和撤销等方式来管理不透明的专业知识,而不是通过机制层面的解释。最近的证据进一步强化了机械理解与部署权威之间的区别:内部表征与输出修正之间的53个百分点差距表明,理解可能无法转化为行动,而一项范围审查发现,只有9.0%的FDA批准的人工智能/机器学习设备文件包含前瞻性的市场后监测研究。我们提出了验证覆盖(Verification Coverage),这是一种包含六个组成部分的可报告标准,具有最小组成规则,作为应与模型卡、排行榜和监管披露中的能力评分并列的指标。
cs.AI / 200 / 2605.10614
PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines
PRISM:多智能体大语言模型管道中秘密泄露的生成时检测与缓解
Abstract
Multi-agent LLM systems introduce a security risk in which sensitive information accessed by one agent can propagate through shared context and reappear in downstream outputs, even without explicit adversarial intent. We formalise this phenomenon as propagation amplification, where leakage risk increases across agent boundaries as sensitive content is repeatedly exposed to downstream generators. Existing defences, including prompt-based safeguards, static pattern matching, and LLM-as-judge filtering, are not designed for this setting: they either operate after generation, rely primarily on surface-form patterns, or add substantial latency without modelling the generation process itself. To resolve these issues, we propose PRISM, a real-time defence that treats credential leakage as a sequential risk accumulation problem during generation. At each decoding step, PRISM combines 16 signals spanning lexical, structural, information-theoretic, behavioural, and contextual features into a calibrated risk score, enabling per-token intervention through green, yellow, and red risk zones. Our central observation is that credential reproduction is often preceded by a measurable shift in generation dynamics, characterised by entropy collapse and increasing logit concentration. When combined with text-structural cues such as identifier-pattern detection, these temporal signals provide an early warning of leakage before a secret is fully reconstructed. Across a 2,000-task adversarial benchmark covering 13 attack categories and three pressure levels in a heterogeneous four-agent pipeline, PRISM achieves F1 = 0.832 with precision = 1.000 and recall = 0.712, while producing no observed leakage on our benchmark (0.0% task-level leak rate) and preserving output utility of 0.893. It substantially outperforms the strongest baseline, Span Tagger, which achieves F1 = 0.719 with a 15.0% task-level leak rate.
Chinese Translation
多智能体大语言模型系统引入了一种安全风险,即一个智能体访问的敏感信息可能通过共享上下文传播,并在下游输出中重新出现,即使没有明确的对抗意图。我们将这一现象形式化为传播放大,其中泄露风险在智能体边界间增加,因为敏感内容被反复暴露给下游生成器。现有的防御措施,包括基于提示的保护、静态模式匹配和大语言模型作为评判者的过滤,并未针对这一场景设计:它们要么在生成后操作,要么主要依赖表面形式模式,或者在不建模生成过程的情况下增加了显著的延迟。为了解决这些问题,我们提出了PRISM,一种实时防御机制,将凭证泄露视为生成过程中的顺序风险累积问题。在每个解码步骤中,PRISM结合了16个信号,涵盖词汇、结构、信息论、行为和上下文特征,生成一个校准的风险评分,从而通过绿色、黄色和红色风险区域实现逐标记干预。我们的核心观察是,凭证重现通常会在生成动态中出现可测量的变化,表现为熵崩溃和对数集中度增加。当与文本结构线索(如标识符模式检测)结合时,这些时间信号在秘密被完全重构之前提供了泄露的早期警告。在涵盖13种攻击类别和三种压力水平的异构四智能体管道的2000任务对抗基准测试中,PRISM实现了F1 = 0.832,精确度为1.000,召回率为0.712,同时在我们的基准测试中未观察到泄露(任务级泄露率为0.0%),并保持了输出效用为0.893。它显著优于最强基线Span Tagger,后者的F1为0.719,任务级泄露率为15.0%。
cs.AI / 201 / 2605.10624
Hierarchical Causal Abduction: A Foundation Framework for Explainable Model Predictive Control
层次因果推理:可解释模型预测控制的基础框架
Abstract
Model Predictive Control (MPC) is widely used to operate safety-critical infrastructure by predicting future trajectories and optimizing control actions. However, nonlinear dynamics, hard safety constraints, and numerical optimization often render individual control moves opaque to human operators, undermining trust and hindering deployment. This paper presents Hierarchical Causal Abduction (HCA), which combines (i) physics-informed reasoning via domain knowledge graphs, (ii) optimization evidence from Karush--Kuhn--Tucker (KKT) multipliers, and (iii) temporal causal discovery via the PCMCI algorithm to generate faithful, human-interpretable explanations for control actions computed by nonlinear MPC. Across three diverse control applications (greenhouse climate, building HVAC, chemical process engineering) with expert validation, HCA improves explanation accuracy by 53\% over LIME (0.478 vs. 0.311) using a single set of cross-domain parameters without per-domain tuning; domain-specific KKT-threshold calibration over 2--3 days further increases accuracy to 0.88. Ablation studies confirm that each evidence source is essential, with 32--37\% accuracy degradation when any component is removed, and HCA's ranking-and-validation methodology generalizes beyond MPC to other prediction-based decision systems, including learning-based control and trajectory planning.
Chinese Translation
模型预测控制(MPC)广泛应用于安全关键基础设施的操作,通过预测未来轨迹和优化控制动作。然而,非线性动态、严格的安全约束和数值优化常常使得单个控制动作对人类操作员变得不透明,从而削弱信任并阻碍部署。本文提出了层次因果推理(HCA),该方法结合了(i)通过领域知识图进行的物理信息推理,(ii)来自Karush-Kuhn-Tucker(KKT)乘子的优化证据,以及(iii)通过PCMCI算法进行的时间因果发现,以生成对非线性MPC计算的控制动作的真实、可人类解释的解释。在三个不同的控制应用(温室气候、建筑暖通空调、化学过程工程)中经过专家验证,HCA在使用一组跨领域参数而无需每个领域调优的情况下,将解释准确性提高了53\%(0.478对比0.311);对领域特定的KKT阈值校准在2-3天内进一步将准确性提高至0.88。消融研究确认每个证据源都是必不可少的,当移除任何组件时准确性下降32-37\%。HCA的排名和验证方法不仅适用于MPC,还可以推广到其他基于预测的决策系统,包括基于学习的控制和轨迹规划。
cs.AI / 202 / 2605.10634
Teacher-Aware Evolution of Heuristic Programs from Learned Optimization Policies
基于教师意识的启发式程序演化来自学习的优化策略
Abstract
LLM-based automatic heuristic design has shown promise for generating executable heuristics for combinatorial optimization, but existing methods mainly rely on delayed endpoint performance. We propose a \emph{teacher-aware evolutionary framework} that uses independently trained learned optimization policies as behavioral teachers. Instead of deploying or imitating the teacher, our method queries it on states visited by candidate heuristic programs and uses its action preferences as local feedback for evolution. The resulting search discovers static executable heuristics guided by both task performance and teacher-derived behavioral signals. Experiments on scheduling, routing, and graph optimization benchmarks show that our method improves over performance-driven LLM heuristic evolution baselines while requiring no neural inference at deployment. These results suggest that learned optimization policies can be repurposed as behavioral feedback sources for automatic heuristic discovery.
Chinese Translation
基于大型语言模型(LLM)的自动启发式设计在生成可执行的组合优化启发式方面显示出潜力,但现有方法主要依赖于延迟的终端性能。我们提出了一种 extit{教师意识的演化框架},该框架利用独立训练的学习优化策略作为行为教师。我们的方法不是部署或模仿教师,而是在候选启发式程序访问的状态上查询教师,并利用其行动偏好作为演化的局部反馈。最终的搜索发现了由任务性能和教师导出的行为信号共同指导的静态可执行启发式。在调度、路由和图优化基准上的实验表明,我们的方法在性能驱动的LLM启发式演化基线之上有所改进,同时在部署时不需要神经推理。这些结果表明,学习的优化策略可以被重新利用作为自动启发式发现的行为反馈源。
cs.AI / 203 / 2605.10639
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
导航大型语言模型评估的海洋:调查毒性基准中的偏见
Abstract
The rapid adoption of LLMs in both research and industry highlights the challenges of deploying them safely and reveals a gap in the systematic evaluation of toxicity benchmarks. As organizations increasingly rely on these benchmarks to certify models for customer-facing applications and automated moderation, unrecognized evaluation biases could lead to the deployment of vulnerable or unsafe systems. This work investigates the robustness of established benchmarking setups and examines how to measure currently neglected intrinsic biases, such as those related to model choice, metrics, and task types. Our experiments uncover significant discrepancies in benchmark behaviors when evaluation setups are altered. Specifically, shifting the task from text completion to summarization increases the tendency of benchmarks to flag content as harmful. Additionally, certain benchmarks fail to maintain consistent behavior when the input data domain is changed. Furthermore, we observe model-specific instabilities, demonstrating a clear need for more robust and comprehensive safety evaluation frameworks.
Chinese Translation
大型语言模型(LLMs)在研究和工业中的快速应用凸显了安全部署它们所面临的挑战,并揭示了毒性基准系统评估中的空白。随着组织越来越依赖这些基准来认证面向客户的应用程序和自动化审核,未被识别的评估偏见可能导致脆弱或不安全系统的部署。本研究调查了已建立的基准设置的稳健性,并考察如何衡量当前被忽视的内在偏见,例如与模型选择、指标和任务类型相关的偏见。我们的实验揭示了在改变评估设置时基准行为的显著差异。具体而言,将任务从文本补全转变为摘要生成时,基准标记内容为有害的倾向增加。此外,某些基准在输入数据领域变化时未能保持一致的行为。此外,我们观察到特定模型的不稳定性,清楚地表明需要更稳健和全面的安全评估框架。
cs.AI / 204 / 2605.10647
diffGHOST: Diffusion based Generative Hedged Oblivious Synthetic Trajectories
diffGHOST:基于扩散的生成对冲无知合成轨迹
Abstract
Trajectories are nowadays valuable information for a wide range of applications. However they are also inherently sensitive, as they contain highly personal information about individuals. Facing this challenge, synthesizing mobility trajectories has emerged as a promising solution to leverage mobility information while preserving privacy. State-of-the-art models, often rely on the false assumptions of generative models implicit privacy and fails to provide privacy guarantees while preserving trajectories utility. Here, we introduce diffGHOST, a conditional diffusion model based on latent space segmentation, designed to answer this challenge. Thus, this paper propose a methodology that identify and mitigate memorization of critical samples using condition segments of a learn latent space.
Chinese Translation
轨迹如今在广泛的应用中具有重要价值。然而,它们本质上也非常敏感,因为它们包含关于个人的高度私人信息。面对这一挑战,合成移动轨迹已成为一种有前景的解决方案,旨在利用移动信息的同时保护隐私。现有的最先进模型往往依赖于生成模型隐含隐私的错误假设,未能在保持轨迹效用的同时提供隐私保障。在此,我们介绍了diffGHOST,一种基于潜在空间分割的条件扩散模型,旨在应对这一挑战。因此,本文提出了一种方法,利用学习的潜在空间的条件分段来识别和减轻对关键样本的记忆。
cs.AI / 205 / 2605.10663
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Evolving-RL:代理内部经验驱动自我演化能力的端到端优化
Abstract
Experience-driven self-evolving agents aim to overcome the static nature of large language models by distilling reusable experience from past interactions, thus enabling adaptation to novel tasks at deployment time. This process places substantial demands on the foundation model's capacities for abstraction, generalization, and in-context learning. However, most existing studies focus primarily on system-level design choices, such as how experience is represented and managed, neglecting the inherent capabilities of the underlying model. While some recent works have started to optimize the experience utilization stage via reinforcement learning, they still fail to treat self-evolution as a unified process to be jointly optimized. To this end, we propose Evolving-RL, an efficient algorithmic framework that jointly improves the experience extraction and utilization capabilities required for self-evolution. Specifically, we center the learning process on experience extraction and evaluation, using the two supervisory signals derived from evaluation to optimize the extractor and solver separately and thus enable their coordinated co-evolution. Experiments on ALFWorld and Mind2Web show that Evolving-RL effectively enhances LLMs' ability to extract and reuse experience, leading to strong performance gains on out-of-distribution tasks (up to 98.7% relative improvement over the GRPO baseline on ALFWorld unseen tasks and 35.8% on Mind2Web), and these gains are fully unlocked only through the coordinated co-evolution of experience extraction and utilization. Furthermore, Evolving-RL inherently functions as an experience-augmented RL algorithm. By internalizing reusable experience patterns directly into model parameters, it achieves remarkable performance gains over standard baselines on both seen and unseen tasks, even in the absence of test-time experience accumulation.
Chinese Translation
经验驱动的自我演化代理旨在通过从过去的交互中提炼可重用的经验,克服大型语言模型的静态特性,从而在部署时能够适应新任务。这个过程对基础模型的抽象、泛化和上下文学习能力提出了 substantial 的要求。然而,大多数现有研究主要集中在系统级设计选择上,例如经验的表示和管理,忽视了底层模型的固有能力。尽管一些近期的研究开始通过强化学习优化经验利用阶段,但仍未将自我演化视为一个需要联合优化的统一过程。为此,我们提出了 Evolving-RL,一种高效的算法框架,联合提升自我演化所需的经验提取和利用能力。具体而言,我们将学习过程集中在经验提取和评估上,利用从评估中得出的两个监督信号分别优化提取器和求解器,从而实现它们的协调共同演化。在 ALFWorld 和 Mind2Web 上的实验表明,Evolving-RL 有效增强了大型语言模型提取和重用经验的能力,在分布外任务上实现了显著的性能提升(在 ALFWorld 未见任务上相较于 GRPO 基线提高了 98.7%,在 Mind2Web 上提高了 35.8%),而这些提升只有通过经验提取和利用的协调共同演化才能完全释放。此外,Evolving-RL 本质上作为一种经验增强的强化学习算法运作。通过将可重用的经验模式直接内化到模型参数中,它在已见和未见任务上相较于标准基线实现了显著的性能提升,即使在没有测试时经验积累的情况下。
cs.AI / 206 / 2605.10685
GESR: A Genetic Programming-Based Symbolic Regression Method with Gene Editing
GESR:一种基于遗传编程的符号回归方法与基因编辑
Abstract
Mathematical formulas serve as a language through which humans communicate with nature. Discovering mathematical laws from scientific data to describe natural phenomena has been a long-standing pursuit of humanity for centuries. In the field of artificial intelligence, this challenge is known as the symbolic regression problem. Among existing symbolic regression approaches, Genetic Programming (GP) based on evolutionary algorithms remains one of the most classical and widely adopted methods. GP simulates the evolutionary process across generations through genetic mutation and crossover. However, mutations and crossovers in GP are entirely random. While this randomness effectively mimics natural evolution, it inevitably produces both beneficial and detrimental variations. If there existed a metaphorical `God` capable of foreseeing which genetic mutations or crossovers would yield superior outcomes and performing targeted gene editing accordingly, the efficiency of evolution could be substantially improved. Motivated by this idea, we propose in this paper a symbolic regression approach based on gene editing, termed GESR. In GESR, we trained two "hands of God" (two BERT models). Among them, the first leverages the BERT's masked language modeling capability to guide the mutation of genes (expression symbols). The other BERT model guides the crossover of individual genes by predicting the crossover point. Experimental results demonstrate that GESR significantly improves computational efficiency compared with traditional GP algorithms and achieves strong overall performance across multiple symbolic regression tasks.
Chinese Translation
数学公式作为人类与自然沟通的语言,揭示科学数据中的数学规律以描述自然现象一直是人类数百年来的追求。在人工智能领域,这一挑战被称为符号回归问题。在现有的符号回归方法中,基于进化算法的遗传编程(Genetic Programming, GP)仍然是最经典和广泛采用的方法之一。GP通过遗传变异和交叉模拟跨代的进化过程。然而,GP中的变异和交叉完全是随机的。虽然这种随机性有效地模拟了自然进化,但不可避免地产生了有益和有害的变异。如果存在一个隐喻上的“上帝”能够预见哪些遗传变异或交叉会产生更优的结果,并相应地进行有针对性的基因编辑,那么进化的效率将大大提高。受到这一思想的启发,我们在本文中提出了一种基于基因编辑的符号回归方法,称为GESR。在GESR中,我们训练了两个“上帝之手”(两个BERT模型)。其中,第一个利用BERT的掩码语言建模能力来指导基因(表达符号)的变异。另一个BERT模型通过预测交叉点来指导个体基因的交叉。实验结果表明,与传统的GP算法相比,GESR显著提高了计算效率,并在多个符号回归任务中取得了强劲的整体表现。
cs.AI / 207 / 2605.10754
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
代理存在的代理使用:代理控制论是基础代理缺失的科学
Abstract
LLM-based foundation agents that perceive, reason, and act across thousands of reasoning steps are rapidly becoming the dominant paradigm for deploying artificial intelligence in open-ended, long-horizon complex tasks. Despite this significance, the field remains overwhelmingly engineering-driven. Engineering practice has converged on useful primitives (tool loops, memory banks, harnesses, reflection steps), yet these are assembled by empirical trial and error rather than from first principles. Fundamental questions remain open: under what conditions does a long-running agent remain on-task? How should an agent respond when its environment exceeds its representational capacity? What architectural properties are necessary for safe self-improvement? We argue that cybernetics, the mid-twentieth-century science of control and communication in complex systems, provides the missing theoretical scaffold for foundation agents. By mapping six canonical laws of classical cybernetics onto six agent design principles, and synthesizing those principles into three engineering desiderata (reliability, lifelong running, and self-Improvement), we arrive at a framework termed Agent Cybernetics. Three application domains, code generation, computer use and automated research, exemplify the analytical framework of agent cybernetics by identifying failure modes and concrete engineering recommendations. We hope that agent cybernetics opens a new research venue and establishes the scientific foundation that foundation agents need for principled, reliable real-world deployment.
Chinese Translation
基于大语言模型(LLM)的基础代理能够在数千个推理步骤中感知、推理和行动,正迅速成为在开放式、长期复杂任务中部署人工智能的主流范式。尽管这一重要性不容忽视,但该领域仍然以工程驱动为主。工程实践已集中于有用的基本构件(工具循环、记忆库、工具架、反思步骤),然而这些构件的组合主要依赖于经验试错,而非从基本原理出发。许多基本问题仍未解决:在什么条件下,长期运行的代理能够保持任务专注?当代理的环境超出其表征能力时,应如何应对?安全自我改进所需的架构特性是什么?我们认为,控制论——这一二十世纪中期关于复杂系统中控制与通信的科学——为基础代理提供了缺失的理论支撑。通过将六条经典控制论的公理映射到六个代理设计原则,并将这些原则综合为三个工程目标(可靠性、终身运行和自我改进),我们提出了一个名为代理控制论的框架。代码生成、计算机使用和自动化研究这三个应用领域通过识别失败模式和具体的工程建议,展示了代理控制论的分析框架。我们希望代理控制论能够开辟新的研究领域,并为基础代理在原则性、可靠的现实世界部署中建立科学基础。
cs.AI / 208 / 2605.10763
MATRA: Modeling the Attack Surface of Agentic AI Systems -- OpenClaw Case Study
MATRA:代理人工智能系统攻击面建模——OpenClaw案例研究
Abstract
LLMs are increasingly deployed as autonomous agents with access to tools, databases, and external services, yet practitioners (across different sectors) lack systematic methods to assess how known threat classes translate into concrete risks within a specific agentic deployment. We present MATRA, a pragmatic threat modeling framework for agentic AI systems that adapts established risk assessment methodology to systematically assess how known LLM threats translate into deployment-specific risks. MATRA begins with an asset-based impact assessment and utilizes attack trees to determine the likelihood of these impacts occurring within the system architecture. We demonstrate MATRA on a personal AI agent deployment using OpenClaw, quantifying how architectural controls such as network sandboxing and least-privilege access reduce risk by limiting the blast radius of successful injections.
Chinese Translation
大型语言模型(LLMs)越来越多地被部署为具有访问工具、数据库和外部服务的自主代理,然而,不同领域的从业者缺乏系统的方法来评估已知威胁类别如何转化为特定代理部署中的具体风险。我们提出了MATRA,一个针对代理人工智能系统的务实威胁建模框架,该框架将已建立的风险评估方法论进行了调整,以系统地评估已知LLM威胁如何转化为特定部署的风险。MATRA从基于资产的影响评估开始,并利用攻击树来确定这些影响在系统架构中发生的可能性。我们在使用OpenClaw的个人AI代理部署上展示了MATRA,量化了网络沙箱和最小权限访问等架构控制如何通过限制成功注入的爆炸半径来降低风险。
cs.AI / 209 / 2605.10782
TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding
TrajPrism:一个面向语言基础的城市轨迹理解的多任务基准
Abstract
Urban mobility is naturally expressed both as trajectories in space and as natural-language descriptions of travel intent, constraints, and preferences. However, prior work rarely evaluates these two modalities together on the same real-world trajectories: trajectory modeling often stays geometry-centric, while language-centric mobility benchmarks frequently target route planning and tool use rather than fine-grained, verifiable alignment between text and the underlying route. We introduce TrajPrism, a multi-task benchmark for language-trajectory alignment that unifies (i) instruction-conditioned trajectory generation, (ii) language-driven semantic trajectory retrieval, and (iii) trajectory captioning, together with an evaluation protocol that measures trajectory fidelity, retrieval quality, and language groundedness. We construct TrajPrism by pairing real urban trajectories with judge-filtered language annotations generated under a four-dimensional travel-intent taxonomy. The benchmark contains 300K selected trajectories across Porto, San Francisco, and Beijing, yielding 2.1M task instances from three instruction variants, three retrieval queries, and one caption per trajectory. We further develop proof-of-concept models for each task: TrajAnchor for instruction-conditioned trajectory generation, TrajFuse for semantic trajectory retrieval, and TrajRap for trajectory captioning. These models instantiate the proposed tasks and show that geometry-only trajectory baselines leave a large gap on our protocol, especially where language is part of the input-output interface. We release TrajPrism with code and a reproducible annotation pipeline that is designed to be portable across cities, given compatible trajectory inputs and map resources.
Chinese Translation
城市出行自然以空间轨迹和旅行意图、约束及偏好的自然语言描述两种形式表达。然而,以往的研究很少在相同的真实世界轨迹上同时评估这两种模态:轨迹建模往往以几何为中心,而以语言为中心的出行基准通常关注路线规划和工具使用,而非文本与基础路线之间的细粒度、可验证的对齐。我们提出了TrajPrism,一个用于语言与轨迹对齐的多任务基准,统一了(i)基于指令的轨迹生成,(ii)基于语言的语义轨迹检索,以及(iii)轨迹描述,并提供一个评估协议,用于测量轨迹的忠实度、检索质量和语言基础性。我们通过将真实的城市轨迹与在四维旅行意图分类下生成的经过评审的语言注释配对,构建了TrajPrism。该基准包含来自波尔图、旧金山和北京的30万个选定轨迹,产生了210万个任务实例,涵盖三种指令变体、三种检索查询和每条轨迹一个描述。我们进一步为每个任务开发了概念验证模型:TrajAnchor用于基于指令的轨迹生成,TrajFuse用于语义轨迹检索,以及TrajRap用于轨迹描述。这些模型实现了所提出的任务,并表明仅基于几何的轨迹基线在我们的协议中存在较大差距,尤其是在语言作为输入输出接口的一部分时。我们发布了TrajPrism,包括代码和一个可重复的注释管道,旨在能够跨城市移植,只要提供兼容的轨迹输入和地图资源。
cs.AI / 210 / 2605.10787
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
ComplexMCP:在动态、相互依赖和大规模工具沙箱中评估LLM代理
Abstract
Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.
Chinese Translation
当前的LLM代理擅长调用孤立的API,但在商业软件自动化的“最后一公里”上却显得力不从心。在现实场景中,工具并不是独立的;它们是原子性的、相互依赖的,并且容易受到环境噪声的影响。我们引入了$ extbf{ComplexMCP}$,一个旨在评估代理在这些严格条件下表现的基准。基于模型上下文协议(Model Context Protocol, MCP),$ extbf{ComplexMCP}$提供了300多个经过严格测试的工具,这些工具源自7个有状态的沙箱,涵盖了从办公套件到金融系统的广泛应用。与现有数据集不同,我们的基准利用种子驱动架构来模拟动态环境状态和不可预测的API故障,从而确保了确定性和多样性的评估。我们在全上下文和RAG范式下评估了多种LLM,揭示了显著的性能差距:即使是顶级模型的成功率也未能超过60%,远低于人类的90%表现。细致的轨迹分析识别出三个基本瓶颈:(1)$ extbf{工具检索饱和}$,随着行动空间的扩大;(2)$ extbf{过度自信}$,代理跳过必要的环境验证;以及(3)$ extbf{战略性失败主义}$,倾向于合理化失败而不是追求恢复。这些发现凸显了当前代理在相互依赖工作流程中的不足,将$ extbf{ComplexMCP}$定位为下一代韧性自主系统的重要测试平台。
cs.AI / 211 / 2605.10791
PathISE: Learning Informative Path Supervision for Knowledge Graph Question Answering
PathISE:用于知识图谱问答的有信息路径监督学习
Abstract
Knowledge Graph Question Answering (KGQA) aims to answer user questions by reasoning over Knowledge Graphs (KGs). Recent KGQA methods mainly follow the retrieval-augmented generation paradigm to ground Large Language Models~(LLMs) with structured knowledge from KGs. However, training effective models to retrieve question-relevant evidence from KGs typically requires high-quality intermediate supervision signals, such as question-relevant paths or subgraphs, which are time- and resource-intensive to obtain. We propose PathISE, a novel framework for learning high-quality intermediate supervision from answer-level labels. PathISE introduces a lightweight transformer-based estimator that estimates the informativeness of relation paths to construct pseudo path-level supervision. This supervision is then distilled into an LLM path generator, whose generated paths are grounded in the KG to provide compact evidence for inductive answer reasoning. ExtensiveISE experiments on three KGQA benchmarks show that PathISE achieves competitive or state-of-the-art KGQA performance, and provides reusable supervision signals that can enhance existing KGQA models, without relying on costly LLM-refined supervision signals. Our source code is available at https://anonymous.4open.science/r/PathISE-2F87.
Chinese Translation
知识图谱问答(KGQA)旨在通过对知识图谱(KGs)的推理来回答用户问题。近期的KGQA方法主要遵循检索增强生成范式,以利用来自KGs的结构化知识来支持大型语言模型(LLMs)。然而,训练有效的模型以从KGs中检索与问题相关的证据通常需要高质量的中间监督信号,例如与问题相关的路径或子图,这些信号的获取通常耗时且资源密集。我们提出了PathISE,一个从答案级标签中学习高质量中间监督的新框架。PathISE引入了一种轻量级的基于变换器的估计器,用于估计关系路径的信息量,以构建伪路径级监督。然后,这种监督被提炼到一个LLM路径生成器中,其生成的路径在KG中得到支持,以提供紧凑的证据用于归纳答案推理。在三个KGQA基准上的广泛ISE实验表明,PathISE实现了具有竞争力或最先进的KGQA性能,并提供了可重用的监督信号,这些信号可以增强现有的KGQA模型,而无需依赖昂贵的LLM精炼监督信号。我们的源代码可在https://anonymous.4open.science/r/PathISE-2F87获取。
cs.AI / 212 / 2605.10796
Interpretable Machine Learning for Football Performance Analysis: Evidence of Limited Transferability from Elite Leagues to University Competition
可解释的机器学习在足球表现分析中的应用:精英联赛与大学比赛之间有限可转移性的证据
Abstract
Machine learning has become increasingly prevalent in football performance analysis, yet most studies prioritize predictive accuracy while implicitly assuming that learned performance determinants and their interpretations are transferable across competition levels. Whether interpretability remains reliable under domain shift-from elite to university football remains largely unexplored. This study investigates whether performance determinants learned from elite competitions are structurally transferable to university-level football and whether their interpretations remain robust under domain shift. Models were trained on large-scale event data from the top five European leagues and applied to university football data from National Tsing Hua University (NTHU) using an identical feature space. Random Forest and Multilayer Perceptron models were interpreted using SHapley Additive exPlanations (SHAP) and Counterfactual Impact Score (CIS). Across five experiments, elite football exhibited a stable and consistent hierarchy of performance determinants across leagues, models, and explanation methods. In contrast, NTHU university football showed substantial reordering of key indicators, reduced explanation stability, weaker structural agreement with elite domains, and increased sensitivity to explanation method. These findings suggest that interpretability robustness is domain-dependent. Rather than reflecting methodological limitations alone, instability in explanations under domain shift may serve as a diagnostic signal of structural ambiguity in the target domain.
Chinese Translation
机器学习在足球表现分析中的应用日益普及,但大多数研究优先考虑预测准确性,同时隐含地假设所学习的表现决定因素及其解释在不同竞争水平之间是可转移的。表现决定因素在从精英足球到大学足球的领域转变中是否仍然可靠,尚未得到充分探讨。本研究调查了从精英比赛中学习到的表现决定因素是否在结构上可转移到大学级别的足球,以及它们的解释在领域转变下是否仍然稳健。模型基于来自五大欧洲联赛的大规模事件数据进行训练,并在相同特征空间下应用于国立清华大学(NTHU)的大学足球数据。使用SHapley Additive exPlanations (SHAP) 和 Counterfactual Impact Score (CIS) 对随机森林和多层感知器模型进行了解释。在五个实验中,精英足球在不同联赛、模型和解释方法之间表现出稳定且一致的表现决定因素层级。相比之下,NTHU大学足球显示出关键指标的显著重新排序、解释稳定性降低、与精英领域的结构一致性较弱,以及对解释方法的敏感性增加。这些发现表明,可解释性的稳健性依赖于领域。领域转变下的解释不稳定性不仅反映了方法论的局限性,还可能作为目标领域结构模糊的诊断信号。
cs.AI / 213 / 2605.10804
New AI-Driven Tools for Enhancing Campus Well-being: A Prevention and Intervention Approach
新型人工智能驱动工具提升校园福祉:预防与干预方法
Abstract
Campus well-being underpins academic success, yet many universities lack effective methods for monitoring satisfaction and detecting mental health risks. This dissertation addresses these gaps through prevention (improving feedback collection) and intervention (advancing mental health detection), unified under an integrated framework. For prevention, we developed TigerGPT, a personalized survey chatbot leveraging LLMs to engage users in context-aware conversations grounded in conversational design and engagement theory, achieving 75% usability and 81% satisfaction. To address its limitations in repetitiveness and response depth, we introduced AURA, a reinforcement-learning framework that adapts follow-up question types (validate, specify, reflect, probe) within a session using an LSDE quality signal (Length, Self-disclosure, Emotion, Specificity), initialized from 96 prior conversations. AURA achieved +0.12 mean quality gain (p=0.044, d=0.66), with 63% fewer specification prompts and 10x more validation behavior. For intervention, we examine Expressive Narrative Stories (ENS) for mental health screening, showing BERT(128) captures nuanced linguistic features without keyword cues, while conventional classifiers depend heavily on explicit mental health terms. We then developed PsychoGPT, an LLM built on DSM-5 and PHQ-8 guidelines that performs initial distress classification, symptom-level scoring, and reconciliation with external ratings for explainable assessment. To reduce hallucinations, we proposed Stacked Multi-Model Reasoning (SMMR), layering expert models where early layers handle localized subtasks and later layers reconcile findings, outperforming single-model solutions on DAIC-WOZ in accuracy, F1, and PHQ-8 scoring. Finally, a cohesive framework unifies these tools, enabling adaptive survey insights to flow directly into specialized mental health detection models.
Chinese Translation
校园福祉是学术成功的基础,但许多大学缺乏有效的满意度监测和心理健康风险检测方法。本论文通过预防(改善反馈收集)和干预(推进心理健康检测)来解决这些问题,并在一个综合框架下统一。为了预防,我们开发了TigerGPT,这是一款个性化的调查聊天机器人,利用大型语言模型(LLMs)与用户进行基于对话设计和参与理论的情境感知对话,达到了75%的可用性和81%的满意度。为了解决其在重复性和回应深度方面的局限性,我们引入了AURA,这是一种强化学习框架,在会话中根据LSDE质量信号(长度、自我披露、情感、特异性)调整后续问题类型(验证、具体化、反思、探查),其初始化基于96个先前的对话。AURA实现了+0.12的平均质量提升(p=0.044,d=0.66),减少了63%的具体化提示,并增加了10倍的验证行为。在干预方面,我们考察了表达性叙事故事(ENS)用于心理健康筛查,结果显示BERT(128)能够捕捉细微的语言特征而无需关键词提示,而传统分类器则严重依赖于明确的心理健康术语。随后,我们开发了PsychoGPT,这是一款基于DSM-5和PHQ-8指南的大型语言模型,能够进行初步的痛苦分类、症状级评分,并与外部评分进行调和,以实现可解释的评估。为减少幻觉现象,我们提出了堆叠多模型推理(SMMR),通过分层专家模型,其中早期层处理局部子任务,后期层调和结果,在DAIC-WOZ数据集上在准确性、F1值和PHQ-8评分方面优于单模型解决方案。最后,一个统一的框架将这些工具整合在一起,使自适应调查洞察能够直接流入专业的心理健康检测模型。
cs.AI / 214 / 2605.10805
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
推理并非无代价:针对大型语言模型作为裁判的稳健自适应成本高效路由
Abstract
Reasoning-capable large language models (LLMs) have recently been adopted as automated judges, but their benefits and costs in LLM-as-a-Judge settings remain unclear. Through controlled comparisons between reasoning and non-reasoning judges, we show that explicit reasoning substantially improves judgment accuracy on tasks requiring structured verification (e.g., math and coding), while offering limited or even negative gains on simpler evaluations and incurring significantly higher computational cost. These findings motivate that reasoning should be used selectively rather than universally, with awareness of possible distribution shift. We propose a Robust Adaptive Cost-Efficient Routing (RACER), which dynamically selects between reasoning and non-reasoning judges under a fixed budget by formulating routing as a constrained distributionally robust optimization problem. RACER explicitly accounts for distribution shift via a KL-divergence uncertainty set, admits an efficient primal--dual algorithm, and enjoys theoretical guarantees including uniqueness of the optimal policy and linear convergence. Extensive experiments show that RACER achieves superior accuracy--cost trade-offs under distribution shift.
Chinese Translation
具备推理能力的大型语言模型(LLMs)最近被应用于自动裁判,但在LLM作为裁判的环境中,其优势和成本仍不明确。通过对推理裁判与非推理裁判的控制比较,我们发现显式推理显著提高了在需要结构化验证的任务(例如数学和编程)中的判断准确性,而在更简单的评估中则提供有限甚至负面的收益,并且会产生显著更高的计算成本。这些发现表明,推理应当选择性使用,而非普遍适用,同时需意识到可能的分布转移。我们提出了一种稳健自适应成本高效路由(Robust Adaptive Cost-Efficient Routing,RACER),该方法在固定预算下动态选择推理和非推理裁判,通过将路由问题表述为受限的分布鲁棒优化问题来实现。RACER明确考虑了分布转移,通过KL散度不确定性集来处理,采用高效的原始-对偶算法,并享有理论保证,包括最优策略的唯一性和线性收敛性。大量实验表明,RACER在分布转移下实现了更优的准确性-成本权衡。
cs.AI / 215 / 2605.10813
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
纳米研究:为个性化研究自动化共同演化的技能、记忆和政策
Abstract
LLM-powered multi-agent systems can now automate the full research pipeline from ideation to paper writing, but a fundamental question remains: automation for whom? Researchers operate under different resource configurations, hold different methodological preferences, and target different output formats. A system that produces uniform outputs regardless of these differences will systematically under-serve every individual user, making personalization a precondition for research automation to be genuinely usable. However, achieving it requires three capabilities that current systems lack: accumulating reusable procedural knowledge across projects, retaining user-specific experience across sessions, and internalizing implicit preferences that resist explicit formalization. We propose NanoResearch, a multi-agent framework that addresses these gaps through tri-level co-evolution. A skill bank distills recurring operations into compact procedural rules reusable across projects. A memory module maintains user- and project-specific experience that grounds planning decisions in each user's research history. A label-free policy learning converts free-form feedback into persistent parameter updates of the planner, reshaping subsequent coordination. These three layers co-evolve: reliable skills produce richer memory, richer memory informs better planning, and preference internalization continuously realigns the loop to each user. Extensive experiments demonstrate that NanoResearch delivers substantial gains over state-of-the-art AI research systems, and progressively refines itself to produce better research at lower cost over successive cycles.
Chinese Translation
基于大型语言模型(LLM)的多智能体系统现在可以自动化从构思到论文写作的完整研究流程,但一个根本性的问题仍然存在:自动化是为谁服务的?研究人员在不同的资源配置下工作,持有不同的方法论偏好,并针对不同的输出格式。一个无视这些差异而产生统一输出的系统将系统性地无法满足每个用户的需求,因此个性化是研究自动化真正可用的前提。然而,实现这一目标需要当前系统所缺乏的三种能力:在项目之间积累可重用的程序知识、在会话之间保留用户特定的经验,以及内化抵制明确形式化的隐性偏好。我们提出了纳米研究(NanoResearch),这是一个通过三层共同演化来解决这些问题的多智能体框架。技能库将重复操作提炼为可在项目间重用的紧凑程序规则。记忆模块维护用户和项目特定的经验,使规划决策基于每个用户的研究历史。无标签的政策学习将自由形式的反馈转化为规划者的持久参数更新,重塑后续的协调。这三层共同演化:可靠的技能产生更丰富的记忆,更丰富的记忆为更好的规划提供信息,而偏好的内化不断地将循环重新对齐到每个用户。大量实验表明,纳米研究在先进的人工智能研究系统上提供了显著的提升,并在后续周期中逐步自我优化,以更低的成本产生更好的研究成果。
cs.AI / 216 / 2605.10815
Probing Cross-modal Information Hubs in Audio-Visual LLMs
探究音频-视觉大语言模型中的跨模态信息中心
Abstract
Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens do not uniformly hold cross-modal information. Instead, a distinct subset of sink tokens, which we term cross-modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training-free hallucination mitigation method by encouraging reliance on integrated cross-modal information within cross-modal sink tokens. Our code is available at https://github.com/kaistmm/crossmodal-hub.
Chinese Translation
音频-视觉大语言模型(AVLLMs)最近作为一种强大的架构出现,能够对音频、视觉和文本模态进行联合推理。在AVLLMs中,音频和视频模态之间的双向交互引入了复杂的处理动态,迫切需要对其内部机制进行更深入的理解。然而,与 extensively 研究的仅文本或大型视觉语言模型不同,AVLLMs 的内部工作机制仍然在很大程度上未被探索。本文重点关注 AVLLMs 中音频和视觉模态之间的跨模态信息流,研究来自一种模态的信息在另一种模态的标记表示中是如何编码的。通过对多个近期 AVLLMs 的分析,我们发现了两个共同的结果。首先,AVLLMs 主要在汇聚标记中编码集成的音频-视觉信息。其次,汇聚标记并不均匀地持有跨模态信息。相反,我们称之为跨模态汇聚标记的一个特定子集,专门用于存储此类信息。基于这些发现,我们进一步提出了一种简单的无训练幻觉缓解方法,通过鼓励依赖于跨模态汇聚标记中的集成跨模态信息来实现。我们的代码可在 https://github.com/kaistmm/crossmodal-hub 获取。
cs.AI / 217 / 2605.10817
CLEF: EEG Foundation Model for Learning Clinical Semantics
CLEF:用于学习临床语义的脑电图基础模型
Abstract
Clinical EEG interpretation requires reasoning over full EEG sessions and integrating signal patterns with clinical context. Existing EEG foundation models are largely designed for short-window decoding and do not incorporate clinical context. We introduce CLEF, a clinically grounded long-context EEG foundation model. CLEF represents EEG sessions as 3D multitaper spectrogram tokens, enabling tractable Transformer modeling at session scale, and aligns embeddings with neurologist reports and structured EHR data through contrastive objectives. We evaluate CLEF on a new 234-task benchmark spanning disease phenotypes, medication exposures, and EEG findings, with more than 260k EEG sessions from over 108k patients. CLEF outperforms prior EEG foundation models on 229 of 234 tasks, improving mean AUROC from 0.65 to 0.74. Reconstruction-only pretraining surpasses prior EEG foundation models, while report and EHR alignment yields further gains. Held-out concept and external-cohort experiments suggest that these representations transfer beyond observed alignment targets. These results support session-scale, clinically grounded representation learning as a promising foundation-model paradigm for clinical EEG.
Chinese Translation
临床脑电图(EEG)解读需要对完整的EEG会话进行推理,并将信号模式与临床背景相结合。现有的EEG基础模型主要设计用于短窗口解码,未能纳入临床背景。我们提出了CLEF,一个以临床为基础的长上下文EEG基础模型。CLEF将EEG会话表示为3D多锥谱图标记,使得在会话规模上进行可处理的Transformer建模成为可能,并通过对比目标将嵌入与神经科医生报告和结构化电子健康记录(EHR)数据对齐。我们在一个新的234任务基准上评估CLEF,该基准涵盖疾病表型、药物暴露和EEG发现,包含来自超过10.8万名患者的26万多EEG会话。CLEF在234个任务中的229个任务上优于先前的EEG基础模型,将平均AUROC从0.65提高到0.74。仅重建预训练的表现超过了先前的EEG基础模型,而报告和EHR对齐则带来了进一步的提升。保留概念和外部队列实验表明,这些表示在观察到的对齐目标之外具有迁移能力。这些结果支持会话规模的、以临床为基础的表示学习作为临床EEG的一个有前景的基础模型范式。
cs.AI / 218 / 2605.10820
MaD Physics: Evaluating information seeking under constraints in physical environments
MaD物理学:在物理环境中评估受限条件下的信息获取
Abstract
Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Measurements drive the scientific process by revealing novel phenomena to improve our understanding. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints. To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws. In each trial, the agent makes measurements of the system until it exhausts an allotted budget and then the agent has to infer the underlying physical law to make predictions about the state of the system in the future. MaD Physics evaluates two fundamental capabilities of scientific agents: inferring models from data and planning under constraints. We also demonstrate how MaD Physics can be used to evaluate other capabilities such as multimodality and in-context learning. We benchmark agents on MaD Physics using four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash), identifying shortcomings in their structured exploration and data collection capabilities and highlighting directions to improve their scientific reasoning.
Chinese Translation
科学发现本质上是一个资源受限的过程,需要在物理和成本限制下在测量的质量与数量之间进行复杂的权衡。测量推动科学过程,通过揭示新现象来提高我们的理解。现有的科学发现代理评估基准主要集中在静态知识基础推理或不受限制的实验设计任务上,并未捕捉到在约束条件下进行测量和规划的能力。为填补这一空白,我们提出了测量与发现物理学(Measuring and Discovering Physics,MaD Physics),这是一个评估代理在测量质量和数量受限的情况下进行信息性测量和得出结论能力的基准。该基准由三个环境组成,每个环境基于不同的物理定律。为了减轻现有知识的干扰,MaD Physics包含了修改后的物理定律。在每次试验中,代理对系统进行测量,直到耗尽分配的预算,然后代理必须推断出潜在的物理定律,以对系统未来的状态进行预测。MaD Physics评估科学代理的两个基本能力:从数据中推断模型和在约束条件下进行规划。我们还展示了如何使用MaD Physics评估其他能力,如多模态性和上下文学习。我们使用四个Gemini模型(2.5 Flash Lite、2.5 Flash、2.5 Pro和3 Flash)在MaD Physics上对代理进行基准测试,识别其结构化探索和数据收集能力的不足,并强调改进其科学推理的方向。
cs.AI / 219 / 2605.10828
The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning
第一滴墨水:误导信息在长上下文推理中的非线性影响
Abstract
As large language models are increasingly deployed in retrieval-augmented generation and agentic systems that accumulate extensive context, understanding how distracting information affects long-context performance becomes critical. Prior work shows that semantically relevant yet misleading documents degrade performance, but the quantitative relationship between the proportion of distractors and performance remains unstudied. In this work, we systematically vary the hard-distractor proportion in fixed-length contexts, revealing a striking nonlinear pattern: as the proportion of hard distractors increases, performance drops sharply within the first small fraction, while the remainder of the range yields only marginal additional decline. We term this ''The First Drop of Ink'' effect, analogous to how a single drop of ink contaminates water. Our theoretical and empirical analyses grounded in attention mechanics show that hard distractors capture disproportionate attention even at small proportions, with diminishing marginal impact as their proportion grows. Controlled experiments further show that filtering gains mainly come from context-length reduction rather than distractor removal; substantial recovery requires reducing the hard-distractor proportion to near zero, highlighting the importance of upstream retrieval precision.
Chinese Translation
随着大型语言模型在增强检索生成和积累广泛上下文的智能系统中的应用日益增多,理解干扰信息如何影响长上下文性能变得至关重要。先前的研究表明,语义相关但误导性的文档会降低性能,但干扰项比例与性能之间的定量关系尚未得到研究。在本研究中,我们系统地改变了固定长度上下文中的难干扰项比例,揭示了一个显著的非线性模式:随着难干扰项比例的增加,性能在最初的一个小比例内急剧下降,而其余范围仅导致边际额外下降。我们将其称为“第一滴墨水”效应,类似于一滴墨水如何污染水。我们基于注意力机制的理论和实证分析表明,即使在小比例下,难干扰项也会捕获不成比例的注意力,随着其比例的增加,边际影响逐渐减小。控制实验进一步表明,过滤收益主要来自上下文长度的减少,而非干扰项的移除;要实现显著恢复,需要将难干扰项比例降低到接近零,这突显了上游检索精度的重要性。
cs.AI / 220 / 2605.10834
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
从受控环境到真实世界:对渗透测试代理的评估
Abstract
AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings. These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting. In this paper, we present a practical evaluation protocol that shifts assessment from task completion to validated vulnerability discovery, allowing evaluation in sufficiently complex targets spanning multiple attack surfaces and vulnerability classes. The protocol combines structured ground-truth with LLM-based semantic matching to identify vulnerabilities, bipartite resolution to score findings under realistic ambiguity, continuous ground-truth maintenance, repeated and cumulative evaluation of stochastic agents, efficiency metrics, and reduced-suite selection for sustainable experimentation. This protocol extends the state of the art by enabling a more realistic, operationally informative comparison of AI pentesting agents. To enable reproducibility, we also release expert-annotated ground truth and code for the proposed evaluation protocol: https://github.com/jd0965199-oss/ethibench.
Chinese Translation
人工智能渗透测试代理作为进攻性安全系统的可信度日益提高,但当前的基准测试仍然对哪些代理在真实世界目标中表现最佳提供有限指导。现有的评估协议在简化或狭窄的环境中评估和优化预定义目标,如夺旗赛、远程代码执行、漏洞重现或轨迹相似性。这些工具在测量有限能力方面具有价值,但并未充分捕捉到真实渗透测试中所需的复杂性、开放式探索和战略决策。在本文中,我们提出了一种实用的评估协议,将评估重点从任务完成转向经过验证的漏洞发现,从而允许在跨越多个攻击面和漏洞类别的足够复杂的目标中进行评估。该协议结合了结构化的真实数据与基于大型语言模型(LLM)的语义匹配来识别漏洞,采用二分法解决方案在现实模糊性下对发现进行评分,持续维护真实数据,对随机代理进行重复和累积评估,效率指标,以及可持续实验的简化选择。该协议通过实现对人工智能渗透测试代理的更现实、操作性更强的比较,扩展了现有技术的前沿。为了实现可重复性,我们还发布了专家注释的真实数据和所提评估协议的代码: https://github.com/jd0965199-oss/ethibench。
cs.AI / 221 / 2605.10851
The Generalized Turing Test: A Foundation for Comparing Intelligence
广义图灵测试:比较智能的基础
Abstract
We introduce the Generalized Turing Test (GTT), a formal framework for comparing the capabilities of arbitrary agents via indistinguishability. For agents A and B, we define the Turing comparator A $\geq$ B to hold if B, acting as a distinguisher, cannot reliably distinguish between interactions with A (instructed to imitate B) and another instance of B. This yields a dataset- and task-agnostic notion of relative intelligence. We study the comparator's structure, including conditions under which it is transitive and therefore induces an ordering over equivalence classes, and we define and analyze variants with querying, bounded interaction, and fixed distinguishers. To complement the theory, we instantiate the framework on a collection of modern models, empirically evaluating pairwise indistinguishability across thousands of trials. The resulting comparisons exhibit a stratified structure consistent with existing rankings, hinting that the proposed framework yields meaningful empirical orderings. Our results position indistinguishability as a unifying lens for reasoning about intelligence, suggesting a foundation for evaluation and, potentially, training objectives that are inherently independent of fixed datasets or benchmarks.
Chinese Translation
我们提出了广义图灵测试(Generalized Turing Test, GTT),这是一个通过不可区分性比较任意智能体能力的正式框架。对于智能体 A 和 B,我们定义图灵比较器 A ≥ B,当且仅当 B 作为区分者,无法可靠地区分与 A(被指示模仿 B)和另一个 B 实例的交互时成立。这产生了一种与数据集和任务无关的相对智能概念。我们研究了比较器的结构,包括其可传递性的条件,从而在等价类上引入排序,并定义和分析了带查询、有限交互和固定区分者的变体。为了补充理论,我们在一系列现代模型上实例化该框架,实证评估数千次试验中的成对不可区分性。结果比较展现出与现有排名一致的分层结构,暗示所提出的框架产生了有意义的实证排序。我们的结果将不可区分性作为推理智能的统一视角,建议为评估提供基础,并可能形成与固定数据集或基准本质上独立的训练目标。
cs.AI / 222 / 2605.10865
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
BenchCAD:一个全面的行业标准基准,用于程序化计算机辅助设计
Abstract
Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings. We present BenchCAD, a unified benchmark for industrial CAD reasoning. BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families, including bevel gears, compression springs, twist drills, and other reusable engineering designs. It evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing, enabling fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Across 10+ frontier models, BenchCAD shows that current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Common failures include missing fine 3D structure, misinterpreting industrial design parameters, and replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. These results position BenchCAD as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.
Chinese Translation
工业计算机辅助设计(CAD)代码生成需要模型从视觉或文本输入中生成可执行的参数化程序。除了识别零件的外部形状外,这项任务还涉及理解其三维结构、推断工程参数以及选择反映零件设计和制造方式的CAD操作。尽管多模态大型语言模型(MLLMs)在这项任务中展现了潜力,但它们在现实工业CAD环境中是否具备这些能力的联合评估却很少。我们提出了BenchCAD,一个用于工业CAD推理的统一基准。BenchCAD包含17,900个经过执行验证的CadQuery程序,涵盖106个工业零件系列,包括斜齿轮、压缩弹簧、麻花钻和其他可重用的工程设计。它通过视觉问答、代码问答、图像到代码生成和指令引导的代码编辑来评估模型,能够在感知、参数抽象和可执行程序合成方面进行细致分析。在10多个前沿模型的测试中,BenchCAD显示当前系统通常能够恢复粗略的外部几何形状,但未能生成忠实的参数化CAD程序。常见的失败包括缺失细致的三维结构、误解工业设计参数,以及将关键操作如扫掠、放样和扭曲挤出替换为更简单的草图和挤出模式。微调和强化学习提高了在分布内的性能,但对未见零件系列的泛化能力仍然有限。这些结果使BenchCAD成为衡量和提升多模态CAD自动化工业准备性的基准。
cs.AI / 223 / 2605.10870
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
记住决策,而非描述:一种用于智能体记忆的速率-失真框架
Abstract
Long-horizon language agents must operate under limited runtime memory, yet existing memory mechanisms often organize experience around descriptive criteria such as relevance, salience, or summary quality. For an agent, however, memory is valuable not because it faithfully describes the past, but because it preserves the distinctions between histories that must remain separated under a fixed budget to support good decisions. We cast this as a decision-centric rate-distortion problem, measuring memory quality by the loss in achievable decision quality induced by compression. This yields an exact forgetting boundary for what can be safely forgotten, and a memory-distortion frontier characterizing the optimal tradeoff between memory budget and decision quality. Motivated by this decision-centric view of memory, we propose DeMem, an online memory learner that refines its partition only when data certify that a shared state would induce decision conflict, and prove near-minimax regret guarantees. On both controlled synthetic diagnostics and long-horizon conversational benchmarks, DeMem yields consistent gains under the same runtime budget, supporting the principle that memory should preserve the distinctions that matter for decisions, not descriptions.
Chinese Translation
长时间跨度的语言智能体必须在有限的运行时内存下操作,但现有的记忆机制往往围绕描述性标准(如相关性、显著性或摘要质量)来组织经验。然而,对于智能体而言,记忆的价值不在于它忠实地描述过去,而在于它保留了在固定预算下必须保持分离的历史之间的区别,以支持良好的决策。我们将其视为一个以决策为中心的速率-失真问题,通过压缩引起的可实现决策质量的损失来衡量记忆质量。这产生了一个可以安全遗忘的确切遗忘边界,以及一个描述记忆预算与决策质量之间最佳权衡的记忆-失真前沿。基于这种以决策为中心的记忆观,我们提出了DeMem,一个在线记忆学习器,仅在数据证明共享状态会引发决策冲突时才会细化其分区,并证明了近最小最大后悔保证。在控制的合成诊断和长时间跨度的对话基准测试中,DeMem在相同的运行时预算下产生了一致的增益,支持记忆应保留对决策重要的区别,而非描述的原则。
cs.AI / 224 / 2605.10913
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd:一种赋能元代理的正式化执行轨迹的运行时基础设施
Abstract
We introduce Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed. The system forks the agent process and its filesystem $5\times$ faster than Docker, achieving $>95\%$ prompt-cache reuse on replay. We demonstrate the model through three applications. First, in runtime intervention, a live supervisor increases pair coding pass rates from 28.8% to 54.7% on CooperBench. Second, in counterfactual meta-optimization, branching exploration outperforms baselines across four benchmarks by up to 11 points while reducing wall-clock time by up to 58%. Third, in Tree-RL training, forking rollouts at selected turns improves TerminalBench-2 performance from 34.2% to 39.4%. These results establish Shepherd as an efficient infrastructure for programming meta-agents. We open-source the system to support future research.
Chinese Translation
我们介绍了Shepherd,这是一种功能性编程模型,将元代理在目标代理上的操作形式化为函数,核心操作在Lean中实现。Shepherd将每个代理与环境的交互记录为一个类型化事件,形成类似Git的执行轨迹,使得任何过去的状态都可以被分支和重放。该系统的代理进程及其文件系统的分支速度比Docker快$5 imes$,在重放时实现了超过$95\%$的提示缓存重用。我们通过三个应用实例展示了该模型。首先,在运行时干预中,实时监督将CooperBench上的配对编码通过率从28.8%提高到54.7%。其次,在反事实元优化中,分支探索在四个基准测试中超越基线,提升幅度最高可达11个百分点,同时将实际耗时减少了58%。第三,在Tree-RL训练中,在选定回合进行分支回放将TerminalBench-2的性能从34.2%提升至39.4%。这些结果确立了Shepherd作为编程元代理的高效基础设施。我们将该系统开源,以支持未来的研究。
cs.CL / 1 / 2605.08334
SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators
SalesSim:作为零售用户模拟器的多模态语言模型基准测试与对齐
Abstract
We present SalesSim, a framework and testbed for evaluating the ability of Multimodal Large Language Models (MLLMs) to simulate realistic, persona-driven customer behavior in multi-turn, multi-modal, tool-augmented online retail conversations. Unlike prior work that treat user simulation as surface-level dialogue generation, SalesSim models retail interaction and decision-making as a grounded, agentic process, where shoppers with diverse backgrounds, preferences, and dealbreakers interact with a sales agent, seek clarifications, and make informed purchasing decisions. For evaluation, we design a suite of metrics centered on decision alignment, measuring the consistency between the simulator's actions and its persona specifications, as well as conversational quality. We find several behavioral gaps after benchmarking 6 open and closed-source state-of-the-art models. First, while models produce fluent conversations, they display significantly lower lexical diversity and overdisclosure of criteria across personas compared to human conversations. Second, models tend to be persuaded by sales agent suggestions and drift from persona specifications. Even the strongest model achieves less than 79% average alignment with its underlying persona specifications. To make progress on these limitations, we propose UserGRPO, a multi-turn, multi-objective reinforcement learning recipe to optimize both conversational fluency and decision alignment under persona specifications. Our experiments demonstrate that UserGRPO boosts decision alignment of the baseline model by 13.8% while improving conversational quality. By introducing SalesSim, we provide a new testbed for the community to investigate and improve the adherence of user simulators in goal-oriented settings.
Chinese Translation
我们提出了SalesSim,一个框架和测试平台,用于评估多模态大型语言模型(MLLMs)在多轮、多模态、工具增强的在线零售对话中模拟现实的、以角色为驱动的客户行为的能力。与之前将用户模拟视为表层对话生成的研究不同,SalesSim将零售互动和决策过程建模为一个有根基的、代理性的过程,其中具有不同背景、偏好和交易破坏因素的购物者与销售代理进行互动,寻求澄清,并做出明智的购买决策。为了进行评估,我们设计了一套以决策对齐为中心的指标,测量模拟器的行为与其角色规范之间的一致性,以及对话质量。我们在基准测试6个开源和闭源的最先进模型后发现了几个行为差距。首先,尽管模型能够生成流畅的对话,但与人类对话相比,它们在不同角色之间表现出显著较低的词汇多样性和过度披露标准。其次,模型往往受到销售代理建议的影响,偏离角色规范。即使是最强的模型,其与基础角色规范的平均对齐度也不到79%。为了克服这些局限性,我们提出了UserGRPO,一种多轮、多目标的强化学习方案,旨在优化对话流畅性和在角色规范下的决策对齐。我们的实验表明,UserGRPO使基线模型的决策对齐度提高了13.8%,同时改善了对话质量。通过引入SalesSim,我们为社区提供了一个新的测试平台,以研究和改善用户模拟器在目标导向环境中的遵循性。
cs.CL / 2 / 2605.08346
Sanity Checks for Long-Form Hallucination Detection
长文本幻觉检测的合理性检查
Abstract
Hallucination detection methods for large language models increasingly operate on chain-of-thought reasoning traces, yet it remains unclear whether they evaluate the reasoning itself or merely exploit surface correlates of the final answer. We introduce a controlled-invariance methodology that exposes this distinction through two oracle tests: \textsc{Force}, which replaces each response's final answer with the ground truth while preserving the reasoning trace, and \textsc{Remove}, which strips answer-announcement steps while leaving the trajectory intact. This reveals if their predictive power derives from answer-level artifacts rather than from the structure or validity of intermediate reasoning. We further show that once these artifacts are controlled for, effective detection does not necessarily require complex learned representations: TRACT, a lightweight scorer built on lexical trajectory features (hedging trends, step-length dynamics, and cross-response vocabulary convergence), achieves strong robustness while remaining competitive with or outperforming existing baselines on unperturbed traces. These findings suggest that the current central challenge in reasoning-aware hallucination detection is not the absence of signal in the trace, but the failure to isolate it from endpoint cues.
Chinese Translation
大型语言模型的幻觉检测方法越来越多地基于思维链推理轨迹进行操作,但目前尚不清楚它们是评估推理本身,还是仅仅利用最终答案的表面相关性。我们引入了一种受控不变性方法,通过两个oracle测试揭示了这一区别: extsc{Force},该测试将每个响应的最终答案替换为真实答案,同时保留推理轨迹; extsc{Remove},该测试在保持轨迹不变的情况下去除答案公告步骤。这揭示了它们的预测能力是否源自答案级别的伪影,而非中间推理的结构或有效性。我们进一步表明,一旦控制了这些伪影,有效的检测并不一定需要复杂的学习表示:TRACT,一个基于词汇轨迹特征(如规避趋势、步骤长度动态和跨响应词汇收敛)构建的轻量级评分器,展现出强大的鲁棒性,同时在未扰动的轨迹上与现有基线竞争或超越。这些发现表明,当前在推理感知幻觉检测中的主要挑战并非轨迹中缺乏信号,而是未能将其与终点线索隔离开来。
cs.CL / 3 / 2605.08348
How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits
电路能告诉我们多少?测量语言模型电路的一致性和特异性
Abstract
The circuits framework in mechanistic interpretability aims to identify causally important sparse subgraphs of model components, typically evaluated by measuring necessity and sufficiency. We measure circuit reuse, the proportion of components shared across per-example circuits within a task, and investigate two less-studied properties of this: consistency, the recurrence of components within a task, and specificity, their uniqueness to a task. Using edge attribution patching across six tasks and seven models, we find that within-task reuse is high and that shared components are necessary for task performance, with ablations causing up to $\sim$100% relative accuracy drops. However, circuits turn out not to be task-specific: ablating one task's circuit damages another task's performance about as much as that task's own circuit does. We discover that this is due to substantial overlap between circuits across tasks, which are causally important for performance. Some circuits do contain a smaller set of task-specific components, but these account for only a modest portion of circuit performance. Overall, our findings suggest that while circuit discovery at the level of attention heads and MLP layers identifies important components, their lack of task-specificity raises questions about the degree to which circuits can support targeted understanding and intervention on model behavior.
Chinese Translation
机械解释中的电路框架旨在识别模型组件中因果重要的稀疏子图,通常通过测量必要性和充分性来评估。我们测量电路重用,即在一个任务中每个示例电路共享的组件比例,并研究两个较少研究的属性:一致性,即组件在任务中的重复出现,以及特异性,即它们对任务的独特性。通过在六个任务和七个模型中使用边缘归因修补,我们发现任务内重用率很高,且共享组件对任务性能是必要的,消融实验导致的相对准确率下降可达约100%。然而,电路并不是任务特定的:消融一个任务的电路对另一个任务的性能影响与该任务自身电路的影响大致相同。我们发现这是由于不同任务之间电路的显著重叠,这些重叠在性能上具有因果重要性。一些电路确实包含一小部分任务特定的组件,但这些仅占电路性能的适度部分。总体而言,我们的发现表明,尽管在注意力头和多层感知器层级别发现电路可以识别重要组件,但其缺乏任务特异性引发了关于电路在支持针对模型行为的理解和干预方面的有效性的问题。
cs.CL / 4 / 2605.08383
Change My View? The Dynamics of Persuasion and Polarization in Online Discourse
改变我的看法?在线话语中的劝说与极化动态
Abstract
Philosophical accounts of persuasion often assume that shared evidence and rational argumentation should lead to a convergence of views between peers, yet everyday discourse often suggests otherwise. In this study, we use large language models to analyze a corpus of debates on Reddit's r/ChangeMyView, where belief revision is publicly signaled. Large language models were asked, halfway through each discussion, to forecast whether such an acknowledgement would arise; their probabilistic estimates serve as a conversational baseline. Each reply was then coded, through a hybrid machine-assisted procedure, for ten familiar rhetorical strategies -- concession, empathy, logical challenge, credibility appeals, and so forth. Adding these strategic features markedly improves predictive power and yields a consistent pattern: moves that express concession or empathetic alignment substantially increase the prospect of belief change, whereas frontal refutation, credibility attacks, and topic deflection diminish it. The findings indicate that effective public reasoning depends as much on relational framing as on evidential content, and they invite a refinement of normative accounts of rational dialogue.
Chinese Translation
劝说的哲学理论通常假设共享的证据和理性论证应导致同伴之间观点的趋同,然而日常话语往往表明情况并非如此。在本研究中,我们利用大型语言模型分析了Reddit的r/ChangeMyView板块中的辩论语料库,在这里信念修正是公开表明的。大型语言模型在每个讨论进行到一半时被要求预测是否会出现这样的承认;它们的概率估计作为对话的基线。随后,通过一种混合的机器辅助程序对每个回复进行了编码,涵盖了十种常见的修辞策略——让步、同理心、逻辑挑战、可信度诉求等。增加这些战略特征显著提高了预测能力,并产生了一种一致的模式:表达让步或同理心的举动显著增加了信念改变的可能性,而正面反驳、可信度攻击和话题转移则降低了这种可能性。研究结果表明,有效的公共推理在很大程度上依赖于关系框架而非证据内容,并邀请对理性对话的规范性理论进行细化。
cs.CL / 5 / 2605.08384
jina-embeddings-v5-omni: Text-Geometry-Preserving Multimodal Embeddings via Frozen-Tower Composition
jina-embeddings-v5-omni:通过冻结塔组合实现的文本-几何保持多模态嵌入
Abstract
In this work, we introduce frozen-encoder model composition, a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. Our method is to extend the two Jina Embeddings v5 Text models to support additional media by adding encoders for images and audio. The backbone text embedding models and the added non-text media encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that this approach produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.
Chinese Translation
在本研究中,我们介绍了一种新的多模态嵌入模型方法——冻结编码器模型组合。我们基于VLM风格的架构,调整非文本编码器以生成语言模型的输入,语言模型则为各种输入生成嵌入。我们展示了结果:jina-embeddings-v5-omni套件,这是一对模型,将文本、图像、音频和视频输入编码到一个单一的语义嵌入空间中。我们的方法是扩展两个Jina Embeddings v5文本模型,通过添加图像和音频的编码器来支持额外的媒体。主干文本嵌入模型和添加的非文本媒体编码器保持冻结。我们仅训练连接组件,这些组件占联合模型总权重的0.35%。因此,训练效率远高于全参数重训练。此外,语言模型基本保持不变,对文本输入生成的嵌入与Jina Embeddings v5文本模型完全相同。我们的评估表明,这种方法产生的结果与最先进的技术相当,性能几乎与更大规模的多模态嵌入模型相当。
cs.CL / 6 / 2605.08401
AIPO: : Learning to Reason from Active Interaction
AIPO:从主动交互中学习推理
Abstract
Recent advances in large language models (LLMs) have demonstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards (RLVR). However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model. Although recent methods introduce external expert demonstrations to extend this boundary, they typically rely on complete trajectory-level guidance, which is sample-inefficient, information-sparse, and may confine exploration to a static guidance space. Inspired by the potential of multi-agent systems, we propose $\textbf{AIPO}$, an enhanced reinforcement learning framework that improves LLM reasoning through active multi-agent interaction during exploration. Specifically, AIPO enables the policy model to proactively consult three functional collaborative agents, $\textit{Verify Agent}$, $\textit{Knowledge Agent}$, and $\textit{Reasoning Agent}$, when encountering reasoning bottlenecks, thereby receiving fine-grained and targeted guidance to actively expand its capability boundary during training. We further introduce a tailored importance sampling coefficient together with a clipping strategy to mitigate the off-policy bias and gradient vanishing issues that arise when learning from agent-provided feedback. After training, the policy model performs reasoning independently without relying on collaborative agents. Extensive experiments on diverse reasoning benchmarks, including AIME, MATH500, GPQA-Diamond, and LiveCodeBench, show that AIPO consistently improves reasoning performance, generalizes robustly across different policy models and RLVR algorithms, and effectively expands the reasoning capability boundary of the policy model.
Chinese Translation
近年来,大型语言模型(LLMs)的进展展示了显著的推理能力,这在很大程度上得益于可验证奖励的强化学习(RLVR)。然而,现有的强化学习算法面临一个根本性限制:它们的探索在很大程度上受到策略模型固有能力边界的限制。尽管最近的方法引入了外部专家示范以扩展这一边界,但它们通常依赖于完整的轨迹级指导,这在样本效率和信息丰富性上都存在不足,并且可能将探索限制在一个静态的指导空间。受到多智能体系统潜力的启发,我们提出了$ extbf{AIPO}$,一个增强的强化学习框架,通过在探索过程中主动的多智能体交互来改善LLM的推理能力。具体而言,AIPO使得策略模型在遇到推理瓶颈时能够主动咨询三个功能性协作智能体,即$ extit{Verify Agent}$、$ extit{Knowledge Agent}$和$ extit{Reasoning Agent}$,从而在训练过程中获得细致且针对性的指导,以主动扩展其能力边界。我们进一步引入了量身定制的重要性采样系数以及剪切策略,以减轻从智能体提供的反馈中学习时出现的离策略偏差和梯度消失问题。经过训练后,策略模型能够独立进行推理,而无需依赖协作智能体。在包括AIME、MATH500、GPQA-Diamond和LiveCodeBench在内的多种推理基准上的大量实验表明,AIPO始终提高了推理性能,能够在不同的策略模型和RLVR算法中稳健地泛化,并有效扩展策略模型的推理能力边界。
cs.CL / 7 / 2605.08404
Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models
利用大型视觉-语言模型从遥感影像推理建筑环境
Abstract
This work investigates the use of large language models (LLMs) for tasks in smart cities. The core idea is to leverage remote sensing imagery to characterize the built environment, including design suggestions, constructability assessment, landuse patterns, and risk identification. We examine remote sensing imagery at multiple spatial scales as inputs for multimodal language modeling and evaluate their effects on built-environment-related reasoning. In addition, we compare state-of-the-art LLMs, including InternVL and Qwen, in terms of accuracy and reliability when generating built environment recommendations. The results demonstrate the potential of integrating remote sensing imagery with large language models to assist smart cities and decision-making.
Chinese Translation
本研究探讨了大型语言模型(LLMs)在智慧城市任务中的应用。核心思想是利用遥感影像来表征建筑环境,包括设计建议、可建性评估、土地利用模式和风险识别。我们在多个空间尺度上考察遥感影像作为多模态语言建模的输入,并评估其对建筑环境相关推理的影响。此外,我们比较了最先进的LLMs,包括InternVL和Qwen,在生成建筑环境建议时的准确性和可靠性。结果表明,将遥感影像与大型语言模型相结合,有助于支持智慧城市及决策制定的潜力。
cs.CL / 8 / 2605.08406
Effective Explanations Support Planning Under Uncertainty
有效的解释支持不确定性下的规划
Abstract
Explaining how to get from A to B can be challenging. It requires mentally simulating what the listener will do based on what they are told. To capture this process, we propose a computational model that converts utterances into action plans: a large language model translates an explanation into program-like guidance (a policy prior and value map), and a planning agent executes it under partial observability. We score explanations by the efficiency and reliability of the resulting paths, penalizing replanning. Across four preregistered experiments, we collect a corpus of 1,200 explanations over 24 maps, elicit helpfulness judgments, measure baseline navigation, and test behavior with explanations of differing quality. Higher-scored explanations are judged more helpful and improve navigation: participants with explanations outperform those without, and high-scoring explanations help more than low-scoring ones. Together, these results show procedural explanation as utility-guided communication shaped by how language can be grounded into action under uncertainty.
Chinese Translation
解释如何从A到达B可能是具有挑战性的。这需要根据所提供的信息在心理上模拟听众将会如何行动。为了捕捉这一过程,我们提出了一种计算模型,将言语转化为行动计划:一个大型语言模型将解释翻译成类似程序的指导(政策先验和价值图),而规划代理在部分可观察性下执行该计划。我们通过结果路径的效率和可靠性来评分解释,并对重新规划进行惩罚。在四个预注册实验中,我们收集了24个地图上的1200个解释的语料库,获取有用性判断,测量基线导航,并测试不同质量解释下的行为。得分较高的解释被认为更有帮助,并改善导航:有解释的参与者表现优于没有解释的参与者,而高得分的解释比低得分的解释更有帮助。这些结果共同表明,程序性解释作为一种实用性导向的交流,受到语言如何在不确定性下与行动相结合的影响。
cs.CL / 9 / 2605.08432
A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
用于评估开放式问答中校准的语义采样框架
Abstract
Calibration measures whether a model's predicted confidence aligns with its empirical accuracy, and is central to the reliable deployment of large language models (LLMs) in high-stakes domains such as medicine and law. While much recent work focuses on improving LLM calibration, the equally important question of how to evaluate it in realistic settings remains underdeveloped. Open-ended question answering (QA), the most common deployment setting for modern LLMs, is where existing evaluation methods fall short: logit-based metrics need restricted output formats and internal probabilities; verbalized confidence is self-reported and often overconfident; and sampling-based methods rely on task-specific extraction rules without a clear finite-sample target. We introduce Sem-ECE (Semantic-Sampling Expected Calibration Error), a calibration evaluation framework for open-ended QA that samples answers from the model, groups them into semantic classes, and uses the resulting frequencies as confidence. We study two estimators within this framework: Sem$_1$-ECE, the same-sample self-consistency score, and Sem$_2$-ECE, a held-out variant that separates answer selection from confidence evaluation. We prove both are asymptotically unbiased, and further show that they agree on easy questions but diverge on hard ones with Sem$_2$ achieving strictly smaller calibration error, so their gap also serves as a diagnostic for question difficulty. Experiments on three open-ended QA benchmarks across five leading commercial LLMs match our theoretical predictions and show that Sem-ECE outperforms verbalized confidence and existing sampling-based methods, while complementing logit-based evaluation when internal probabilities are unavailable.
Chinese Translation
校准衡量模型预测的置信度是否与其经验准确性相一致,这对于在医学和法律等高风险领域可靠部署大型语言模型(LLMs)至关重要。尽管近期许多研究集中在提高LLM的校准上,但如何在现实环境中评估校准这一同样重要的问题仍然不够成熟。开放式问答(QA)是现代LLM最常见的应用场景,而现有的评估方法在此场景中表现不佳:基于logit的指标需要限制输出格式和内部概率;口头表达的置信度是自我报告的,且往往过于自信;而基于采样的方法依赖于特定任务的提取规则,缺乏明确的有限样本目标。我们提出了Sem-ECE(语义采样期望校准误差),这是一个用于开放式QA的校准评估框架,它从模型中采样答案,将其分组为语义类别,并使用结果频率作为置信度。我们在该框架内研究了两种估计器:Sem$_1$-ECE,相同样本自一致性分数,以及Sem$_2$-ECE,一种将答案选择与置信度评估分开的保留变体。我们证明这两者在渐近上是无偏的,并进一步表明它们在简单问题上达成一致,但在困难问题上存在分歧,Sem$_2$的校准误差严格较小,因此它们之间的差距也可作为问题难度的诊断指标。在五个领先商业LLM的三个开放式QA基准上的实验验证了我们的理论预测,并显示Sem-ECE优于口头表达的置信度和现有的基于采样的方法,同时在内部概率不可用时补充了基于logit的评估。
cs.CL / 10 / 2605.08437
Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
Magis-Bench:评估大型语言模型在初级法官级别法律任务上的表现
Abstract
Existing benchmarks for legal AI focus primarily on tasks where LLMs must produce legal arguments or documents, yet the capacity to \emph{judge} such arguments -- weighing competing claims, applying doctrine to facts, and rendering reasoned decisions -- is arguably as fundamental to a well-functioning legal system as advocacy itself. We introduce Magis-Bench, a benchmark for evaluating LLMs on magistrate-level writing tasks derived from recent Brazilian competitive examinations for judicial positions. Magis-Bench comprises 74 questions from eight examinations conducted between 2023 and 2025, including discursive legal analysis questions with multi-turn structure and practical exercises requiring the composition of complete civil and criminal judicial sentences. We evaluate 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with four independent frontier models as evaluators. Our results show strong inter-judge agreement (Kendall's $W = 0.984$; pairwise Kendall's $\tau \ge 0.897$), with Google's Gemini-3-Pro-Preview achieving the highest average score (6.97/10), followed by Gemini-3-Flash-Preview (6.67) and Claude-4.5-Opus (6.46). Even the best-performing models score below 70\% of the maximum, indicating that judicial-level legal reasoning and writing remain challenging for current LLMs. We release the complete benchmark, model outputs, and evaluation code to support further research on legal AI capabilities.
Chinese Translation
现有的法律人工智能基准主要集中在大型语言模型(LLMs)必须生成法律论据或文件的任务上,然而, extit{判断}这些论据的能力——权衡竞争性主张、将法律原则应用于事实并作出有理有据的决定——在一个良好运作的法律系统中,可能与辩护本身同样重要。我们介绍了Magis-Bench,这是一个评估LLMs在初级法官级别写作任务上的基准,基于2023年至2025年间进行的巴西司法职位竞争考试。Magis-Bench包含来自八场考试的74个问题,包括具有多轮结构的论述性法律分析问题和要求撰写完整民事及刑事司法判决的实践练习。我们使用LLM-as-a-judge方法评估了23个最先进的LLMs,并以四个独立的前沿模型作为评估者。我们的结果显示出强烈的评审者间一致性(Kendall's $W = 0.984$;成对Kendall's $ au
ge 0.897$),谷歌的Gemini-3-Pro-Preview获得了最高平均分(6.97/10),其次是Gemini-3-Flash-Preview(6.67)和Claude-4.5-Opus(6.46)。即使是表现最好的模型,其得分也低于最大分数的70\%,这表明司法级别的法律推理和写作对于当前的LLMs仍然具有挑战性。我们发布了完整的基准、模型输出和评估代码,以支持对法律人工智能能力的进一步研究。
cs.CL / 11 / 2605.08439
Can Language Models Identify Side Effects of Breast Cancer Radiation Treatments?
语言模型能否识别乳腺癌放疗的副作用?
Abstract
Accurately communicating the side effects of cancer treatments to cancer survivors is critical, particularly in settings such as informed consent, where clinicians must clearly and comprehensively convey potential treatment toxicities. However, this task remains challenging due to clinical knowledge deficits about adverse treatment effects and fragmentation across electronic health record (EHR) systems. Large language models (LLMs) have the potential to assist in this task, though their reliability in oncology survivorship contexts remains poorly understood. We present a deployment-oriented stress-testing framework for evaluating LLM-generated radiation side effect lists in breast cancer treatment and survivorship care. Using 21 breast cancer patient profiles, we construct paired patient clinical scenarios that differ only in radiotherapy regimens to evaluate seven instruction-tuned LLMs under multiple prompting regimes. We then compare LLM outputs to a clinician-curated reference derived from informed consent documents at two major academic medical centers and developed by a team including more than seven breast radiation oncologists. The reference maps radiation dose-fractionation, fields, and locations to associated toxicities, broken down by frequency and temporal onset. Across models, we reveal sensitivity to minor documentation changes, trade-offs between precision and recall, and systematic under-recall of rare and long-term side effects. When used alone, constraints on the number of side effects generated reduce precision, and grounding outputs in clinician-curated side effect lists substantially improves reliability and robustness. These findings highlight important limitations of LLM use in oncology and suggest practical design choices for safer and more informative survivorship-focused applications.
Chinese Translation
准确传达癌症治疗的副作用给癌症幸存者至关重要,尤其是在知情同意等场景中,临床医生必须清晰且全面地传达潜在的治疗毒性。然而,由于对不良治疗效果的临床知识缺乏以及电子健康记录(EHR)系统的碎片化,这一任务仍然具有挑战性。大型语言模型(LLMs)有潜力在此任务中提供帮助,尽管它们在肿瘤幸存者背景下的可靠性仍然不甚了解。我们提出了一个面向部署的压力测试框架,用于评估LLM生成的乳腺癌治疗和幸存护理中的放射副作用列表。通过使用21个乳腺癌患者的档案,我们构建了仅在放疗方案上有所不同的配对患者临床场景,以在多种提示机制下评估七个经过指令调优的LLM。随后,我们将LLM的输出与来自两家主要学术医疗中心的知情同意文件中提取的临床医生策划的参考进行比较,该参考由包括七位以上乳腺放射肿瘤学家在内的团队开发。该参考将放射剂量分割、照射区域和位置映射到相关毒性,并按频率和时间起始进行分类。我们发现,各模型对轻微文档更改的敏感性、精确度与召回率之间的权衡,以及对稀有和长期副作用的系统性低召回。在单独使用时,生成的副作用数量限制降低了精确度,而将输出与临床医生策划的副作用列表相结合则显著提高了可靠性和稳健性。这些发现突显了LLM在肿瘤学应用中的重要局限性,并为更安全、更具信息性的幸存者导向应用提出了切实的设计选择。
cs.CL / 12 / 2605.08447
Revisiting the syntax of imperatives in Yemeni Arabic: An Agree across phases approach
重新审视也门阿拉伯语中的命令句法:一种跨阶段的同意方法
Abstract
This article revisits the syntax of imperatives in Yemeni Arabic proposing an Agree acros phases (AAP) approach. I argue that the AAP approach successfully accounts for both simple and complex imperative constructions, including A'-chain structures, by establishing a close interactions between syntax and discourse. The study demonstrates that this interface is motivated by the interpretive and performative functions associated with imperatives, linking informational structure with propositional structure. It is also proposed that the thematic subject of imperatives is a 2-person pro, whereas any overt pronominal or nominal element occurring preverbally is not a subject, but rather a C-domain element, precisely aboutness topic. These topics serve as the logical subjects of imperatives and enter into a coreferentiality relationship with pro. This relation is analyzed as APP involving Match, yielding both local and non-local A'-chains. For core imperatives, viz., lacking an overt topic, I propose a null topic to (re)merge in Spec,TopP, whose interpretation depends on the discourse.
Chinese Translation
本文重新审视了也门阿拉伯语中命令的句法,提出了一种跨阶段的同意方法(Agree across phases, AAP)。我认为,AAP 方法成功地解释了简单和复杂命令结构,包括 A'-链结构,通过建立句法与话语之间的紧密互动。研究表明,这种接口是由与命令相关的解释性和表演性功能所驱动的,将信息结构与命题结构联系起来。还提出命令的主题主语是一个二人称代词,而任何在动词前出现的显性代名词或名词元素并不是主语,而是 C 领域元素,确切地说是关于主题。这些主题作为命令的逻辑主语,并与代词进入共指关系。该关系被分析为 APP,涉及匹配(Match),产生局部和非局部的 A'-链。对于核心命令,即缺乏显性主题的命令,我提出一个空主题在 Spec,TopP 中(重新)合并,其解释依赖于话语。
cs.CL / 13 / 2605.08462
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
基准测试是否低估了大型语言模型的性能?通过以LLM为主的人类裁定评估幻觉检测
Abstract
Hallucination remains a persistent challenge in Large Language Models (LLMs), particularly in context-grounded settings such as RAG and agentic AI systems. This study focuses on contextual hallucination detection in summarization tasks. We analyze the QAGS-C and SummEval datasets by comparing original benchmark annotations with reason and span-based predictions from Gemini 2.5 Flash and GPT-5 Mini. To address systematic divergences between human labels and LLM judgments, we re-evaluated all conflicted samples through a human adjudication process involving 2 cross-cultural adjudicators. Following this re-evaluation, triple agreement (between human, GPT, and Gemini) increased by 6.38% for QAGS-C and 7.62% for SummEval. Similarly, model accuracy improved, with GPT increasing by 4.25% on QAGS-C and 2.34% on SummEval, while Gemini showed gains of 8.51% and 3.80%, respectively. Notably, adjudicators frequently sided with the models' judgments over original human annotations when LLMs provided explicit reasoning. Overall human adjudicator agreement ranged between 83% and 87%. These findings suggest that for ambiguity-prone tasks, single-pass annotations may be insufficient, and model-assisted re-evaluation yields more reliable benchmarks.
Chinese Translation
幻觉仍然是大型语言模型(LLMs)面临的一个持续挑战,尤其是在基于上下文的设置中,如检索增强生成(RAG)和自主人工智能系统。本文研究了摘要任务中的上下文幻觉检测。我们通过比较原始基准注释与Gemini 2.5 Flash和GPT-5 Mini的基于推理和跨度的预测,分析了QAGS-C和SummEval数据集。为了应对人类标签与LLM判断之间的系统性差异,我们通过涉及两位跨文化裁定者的人类裁定过程重新评估了所有冲突样本。在此重新评估后,QAGS-C和SummEval的三方一致性(人类、GPT和Gemini之间)分别提高了6.38%和7.62%。同样,模型的准确性也有所提高,GPT在QAGS-C上提高了4.25%,在SummEval上提高了2.34%,而Gemini的提升分别为8.51%和3.80%。值得注意的是,当LLMs提供明确的推理时,裁定者经常倾向于支持模型的判断而非原始人类注释。总体人类裁定者的一致性在83%到87%之间。这些发现表明,对于模糊性较强的任务,单次注释可能不足,而模型辅助的重新评估能够产生更可靠的基准。
cs.CL / 14 / 2605.08468
PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents
PYTHALAB-MERA:基于验证的记忆、检索和接受控制用于冻结的LLM编码代理
Abstract
Local LLM-based coding agents increasingly work in settings where correctness is earned through execution feedback, persistent state, and bounded repair, not through a single fluent answer. Static retrieval, long-context prompting, self-refinement, execution-feedback repair, and reinforcement learning over model weights each address part of this setting, but they do not jointly provide validation-grounded episodic memory, adaptive retrieval-action selection, delayed credit assignment, and structural skill reuse around a frozen local model. We introduce PYTHALAB-MERA, a lightweight external controller for local validation-conditioned code generation. The frozen language model proposes complete source files; the controller decides which memory records and AST-derived skills should enter the next prompt, validates each candidate through a fail-fast pipeline, converts validation outcomes into bounded shaped rewards, and propagates delayed credit through TD(lambda)-style eligibility traces. We evaluate the implementation as a local CLI artifact on reinforcement-learning coding tasks with strict validation gates. In the measured hard RL setting with three tasks, three repetitions, and a three-attempt budget, PYTHALAB-MERA passed 8/9 strict validations; the self-refinement baseline and the investigated GRACE extension each passed 0/9. These results support a deliberately bounded claim: in this recorded setting, the external memory-and-retrieval controller improved validation success. They do not establish general-purpose code synthesis, state-of-the-art performance, formal program correctness, or formal safety.
Chinese Translation
基于本地LLM的编码代理越来越多地在执行反馈、持久状态和有限修复的环境中工作,而不是通过单一流畅的答案来获得正确性。静态检索、长上下文提示、自我精炼、执行反馈修复和基于模型权重的强化学习各自解决了这一环境的部分问题,但它们并未共同提供基于验证的情节记忆、自适应检索-行动选择、延迟信用分配以及围绕冻结本地模型的结构化技能重用。我们提出了PYTHALAB-MERA,一种轻量级的外部控制器,用于基于验证的代码生成。冻结的语言模型提出完整的源文件;控制器决定哪些记忆记录和AST派生技能应进入下一个提示,通过快速失败管道验证每个候选项,将验证结果转换为有限的形状奖励,并通过TD(lambda)风格的资格追踪传播延迟信用。我们在具有严格验证门的强化学习编码任务上评估了该实现作为本地CLI工具。在测量的困难RL环境中,涉及三个任务、三次重复和三次尝试预算,PYTHALAB-MERA通过了8/9的严格验证;自我精炼基线和所研究的GRACE扩展均未通过任何验证(0/9)。这些结果支持一个经过深思熟虑的有限主张:在这个记录的环境中,外部记忆和检索控制器提高了验证成功率。它们并未建立通用代码合成、最先进的性能、正式程序正确性或正式安全性。
cs.CL / 15 / 2605.08476
A Computational Operationalisation of Competing Maturational Theories of Syntactic Development via Statistical Grammar Induction
通过统计语法归纳对竞争性成熟理论的计算操作化:句法发展的研究
Abstract
This paper is concerned with what intermediate syntactic categories children acquire during first language development, and in what order. Maturational theories make different predictions. Bottom-up accounts (GROWING) propose that lexical and inflectional structure emerges first, while inward accounts (INWARD) predict early access to discourse-related categories. We computationally operationalise these hypotheses of staged syntactic emergence using statistical grammar induction, asking what each proposed ordering makes learnable when input and learning algorithm are held constant. Our framework makes category acquisition explicit and allows us to explore how different maturational orderings shape the structure that can be learned under identical conditions. Based on this operationalisation, the GROWING account significantly outperforms the INWARD account across three evaluation metrics.
Chinese Translation
本文关注儿童在第一语言发展过程中获得的中间句法类别及其顺序。成熟理论提出了不同的预测。自下而上的解释(GROWING)认为词汇和屈折结构首先出现,而自内而外的解释(INWARD)则预测早期接触与话语相关的类别。我们通过统计语法归纳对这些分阶段句法出现的假设进行计算操作化,探讨在输入和学习算法保持不变的情况下,各种提议的顺序使得什么可以被学习。我们的框架明确了类别的获取,并允许我们探索不同的成熟顺序如何塑造在相同条件下可以学习的结构。基于这种操作化,GROWING 解释在三个评估指标上显著优于 INWARD 解释。
cs.CL / 16 / 2605.08477
Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling
代理是否需要逐步规划?重新思考数据中心工具调用中的规划视野
Abstract
Explicit planning is a critical capability for LLM-based agents solving complex data-centric tasks, which require precise tool calling over external data sources. Existing strategies fall into two paradigms based on planning horizon: (1) full-horizon (FH), which generates a complete plan before execution, and (2) single-step horizon (SH), which interleaves each action (tool call) with incremental reasoning and observation. While step-by-step execution is a common default under the assumption that eager execution monitoring is necessary for adaptability, we revisit this assumption for well-defined data-centric tasks. Our controlled empirical study isolates planning horizon as the key architectural feature and systematically analyzes the effects of topological complexity and tool robustness on both paradigms. Our experiments across Knowledge Base Question Answering and Multi-hop QA show that FH planning with lazy replanning achieves accuracy parity with SH across varying depths, breadths, and robustness levels, while using 2-3x fewer tokens. These findings suggest that for well-defined data-centric tasks, eager step-wise monitoring is often unnecessary, and full-horizon planning with on-demand replanning can offer a more efficient default.
Chinese Translation
明确的规划是基于大型语言模型(LLM)的代理解决复杂数据中心任务的关键能力,这些任务需要对外部数据源进行精确的工具调用。现有策略基于规划视野分为两种范式: (1) 全视野(Full-Horizon, FH),在执行之前生成完整的计划;(2) 单步视野(Single-Step Horizon, SH),将每个动作(工具调用)与增量推理和观察交错进行。尽管在假设急切执行监控对适应性是必要的情况下,逐步执行是常见的默认选择,但我们重新审视这一假设,针对明确的数据中心任务。我们的控制实证研究将规划视野作为关键的架构特征,并系统分析拓扑复杂性和工具鲁棒性对这两种范式的影响。我们在知识库问答和多跳问答中的实验表明,使用懒惰重规划的FH规划在不同深度、广度和鲁棒性水平下与SH的准确性相当,同时使用的令牌数量减少了2-3倍。这些发现表明,对于明确的数据中心任务,急切的逐步监控往往是不必要的,而按需重规划的全视野规划可以提供更高效的默认选择。
cs.CL / 17 / 2605.08503
NARRA-Gym for Evaluating Interactive Narrative Agents
用于评估互动叙事代理的 NARRA-Gym
Abstract
Interactive narrative tasks require LLMs to sustain a coherent, evolving story while adapting to a user over multiple turns. However, suitable benchmarks for this setting are limited: existing evaluations often focus on static prompts, isolated story generations, or post-hoc ratings, and therefore miss whether models can jointly manage story generation, long-context state and pacing, character simulation, empathic personalization, and story-grounded artifacts. We introduce NARRA-Gym, an executable evaluation environment that turns a sparse emotional seed into a complete interactive story episode and logs the full model-in-the-loop trajectory, including story construction, memory updates, planning, pacing interventions, and optional artifact synthesis. We evaluate nine frontier LLMs using a controlled LLM-as-judge sweep over eight benchmark personas and a human evaluation in which participants rate customized model outputs. Our results show substantial variation across models, personas, and evaluation dimensions: models that produce fluent stories can still fail on robustness, user experience, or resistance-sensitive personalization. These findings suggest that interactive narrative offers a useful benchmark for evaluating long-horizon, user-adaptive LLM behavior beyond isolated story quality.
Chinese Translation
互动叙事任务要求大型语言模型(LLMs)在与用户的多轮交互中维持一个连贯且不断发展的故事。然而,适合这一场景的基准测试有限:现有评估往往集中于静态提示、孤立的故事生成或事后评分,因此未能评估模型是否能够共同管理故事生成、长时间上下文状态与节奏、角色模拟、同理心个性化以及故事相关的工件。我们引入了 NARRA-Gym,一个可执行的评估环境,将稀疏的情感种子转化为完整的互动故事剧集,并记录完整的模型循环轨迹,包括故事构建、记忆更新、规划、节奏干预和可选的工件合成。我们使用控制的 LLM 作为评判者的方式,对九个前沿 LLM 进行了评估,涵盖八个基准角色,并进行了一项人类评估,参与者对定制的模型输出进行评分。我们的结果显示模型、角色和评估维度之间存在显著差异:能够生成流畅故事的模型在鲁棒性、用户体验或对敏感个性化的抵抗力方面仍可能失败。这些发现表明,互动叙事为评估长时间跨度、用户自适应的 LLM 行为提供了一个有用的基准,超越了孤立的故事质量。
cs.CL / 18 / 2605.08504
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
单层解释一切:理解大型语言模型中的大规模激活
Abstract
We investigate the origins of massive activations in large language models (LLMs) and identify a specific layer named the \textbf{Massive Emergence Layer (ME Layer)}, that is consistently observed across model families, where massive activations first emerge and subsequently propagate to deeper layers through residual connections. We show that, within the ME Layer both the RMSNorm and the FFN parameters jointly contribute to the emergence of massive activations. Once formed, the massive activation token representation remains largely invariant across layers, reducing the diversity of hidden representations passed to the attention module. Motivated by this limitation, we propose a simple and effective method to reduce the rigidity of the massive activation token. Our approach consistently improves LLM performance across multiple tasks, including instruction following and math reasoning, in both training free and fine tuning settings. Moreover, we show that our method mitigates attention sinks by selectively weakening their influence, elucidating their origin at the hidden state level and shedding new light on principled mitigation strategies.
Chinese Translation
我们研究了大型语言模型(LLMs)中大规模激活的来源,并确定了一个特定的层,称为 extbf{大规模出现层(Massive Emergence Layer, ME Layer)},该层在不同模型系列中始终被观察到,是大规模激活首次出现并随后通过残差连接传播到更深层的地方。我们展示了在ME层内,RMSNorm和FFN参数共同促进了大规模激活的出现。一旦形成,大规模激活的标记表示在各层之间基本保持不变,从而减少了传递给注意力模块的隐藏表示的多样性。基于这一局限性,我们提出了一种简单有效的方法来减少大规模激活标记的刚性。我们的方法在多个任务中持续提高了LLM的性能,包括指令跟随和数学推理,在无训练和微调设置下均表现良好。此外,我们还展示了我们的方法通过选择性减弱注意力汇的影响来缓解其问题,阐明了其在隐藏状态层面的来源,并为原则性缓解策略提供了新的视角。
cs.CL / 19 / 2605.08513
A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models
单个神经元足以绕过大型语言模型中的安全对齐
Abstract
Safety alignment in language models operates through two mechanistically distinct systems: refusal neurons that gate whether harmful knowledge is expressed, and concept neurons that encode the harmful knowledge itself. By targeting a single neuron in each system, we demonstrate both directions of failure -- bypassing safety on explicit harmful requests via suppression, and inducing harmful content from innocent prompts via amplification -- across seven models spanning two families and 1.7B to 70B parameters, without any training or prompt engineering. Our findings suggest that safety alignment is not robustly distributed across model weights but is mediated by individual neurons that are each causally sufficient to gate refusal behavior -- suppressing any one of the identified refusal neurons bypasses safety alignment across diverse harmful requests.
Chinese Translation
语言模型中的安全对齐通过两个机制上不同的系统运作:拒绝神经元控制有害知识是否被表达,而概念神经元则编码有害知识本身。通过针对每个系统中的单个神经元,我们展示了两种失败的方向——通过抑制绕过对明确有害请求的安全性,以及通过放大从无辜提示中诱导有害内容——这一现象在七个模型中得以体现,这些模型跨越了两个家族,参数规模从1.7B到70B,且无需任何训练或提示工程。我们的发现表明,安全对齐并不是在模型权重中稳健分布的,而是由单个神经元介导的,每个神经元都足以因果地控制拒绝行为——抑制任何一个已识别的拒绝神经元都能绕过对多种有害请求的安全对齐。
cs.CL / 20 / 2605.08522
Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation
能力坐标:统一的MTMM-几何框架用于大型语言模型评估
Abstract
The evaluation of Large Language Models (LLMs) faces a critical challenge in construct validity, where fragmented benchmarks and ad hoc metrics frequently conflate method variance, such as prompt sensitivity, with true latent capabilities. Concurrently, emerging research suggests that LLM capabilities and outputs can be modeled as continuous geometric manifolds. In this Systematization of Knowledge (SoK), we bridge these paradigms by proposing a generalized Multi-Trait Multi-Method (MTMM) framework for LLM evaluation. We formalize and unify nine evaluation metrics, including Paraphrase Instability, Drift Score, Overton Width, and Pluralism Score, interpreting them not as isolated scalar values but as geometric measurements within a shared latent coordinate space. This spatial unification factorizes model behavior into three orthogonal latent dimensions: (1) Instability and Sensitivity, (2) Position and Alignment, and (3) Coverage and Expressiveness. By systematically separating task-irrelevant perturbations from true capability spans, the framework provides a theoretically grounded and domain-agnostic taxonomy for robust and empirically stable benchmark design.
Chinese Translation
大型语言模型(LLMs)的评估面临着构念有效性的重要挑战,碎片化的基准和临时性指标常常将方法方差(如提示敏感性)与真实潜在能力混淆。同时,新兴研究表明,LLM的能力和输出可以建模为连续的几何流形。在本知识系统化(SoK)中,我们通过提出一个广义的多特质多方法(MTMM)框架来连接这些范式,以进行LLM评估。我们对九个评估指标进行了形式化和统一,包括释义不稳定性、漂移分数、奥弗顿宽度和多元化分数,将它们解释为共享潜在坐标空间中的几何测量,而非孤立的标量值。该空间统一将模型行为分解为三个正交的潜在维度:(1)不稳定性和敏感性,(2)位置和对齐,以及(3)覆盖率和表现力。通过系统地将与任务无关的扰动与真实能力范围分离,该框架为稳健且经验稳定的基准设计提供了理论基础和领域无关的分类法。
cs.CL / 21 / 2605.08583
Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection
来源或不存在:一种多智能体框架用于引用幻觉检测
Abstract
Large language models are increasingly used in scientific writing, yet they can fabricate citation-shaped references that appear plausible but fail bibliographic verification. Existing detectors often reduce verification to binary found/not-found decisions and rely on brittle parsing or incomplete retrieval, offering little field-level signal to auditors. We reframe citation hallucination detection as taxonomy-aligned field-level adjudication and introduce a 12-code taxonomy spanning Real, Potential, and Hallucinated citations. Based on this taxonomy, we build CiteTracer, a cascading multi-agent detector that extracts structured citations from PDF and BibTeX, retrieves evidence through cache lookup, URL fetch, scholar connectors, and web search, applies deterministic field matching, and routes ambiguous cases to class-specialist judgers. We release a benchmark of 2,450 synthetic citations built from real seeds with controlled LLM mutations, paired with 957 real-world fabricated citations drawn from ICLR 2026 and an anonymous conference desk-rejected submissions. CiteTracer reaches 97.1% accuracy on the synthetic benchmark, with class-level F1 scores of 97.0, 95.8, and 98.5 for Real, Potential, and Hallucinated, respectively, and detects 97.1% of fabrications on the real-world set without abstaining. Code: https://github.com/aaFrostnova/CiteTracer.
Chinese Translation
大型语言模型在科学写作中的应用日益增多,但它们可能会虚构看似合理但无法通过书目验证的引用。现有的检测器通常将验证简化为二元的找到/未找到决策,并依赖脆弱的解析或不完整的检索,给审计人员提供的现场信号有限。我们将引用幻觉检测重新框架为与分类法对齐的领域级裁决,并引入一个涵盖真实(Real)、潜在(Potential)和幻觉(Hallucinated)引用的12代码分类法。基于该分类法,我们构建了CiteTracer,一个级联多智能体检测器,能够从PDF和BibTeX中提取结构化引用,通过缓存查找、URL获取、学术连接器和网络搜索检索证据,应用确定性的领域匹配,并将模糊案例路由到专业判断者。我们发布了一个基准数据集,其中包含2450个基于真实种子的合成引用,经过控制的LLM变异,配对957个来自ICLR 2026和匿名会议桌面拒稿的真实虚构引用。CiteTracer在合成基准上达到了97.1%的准确率,真实、潜在和幻觉类别的F1得分分别为97.0、95.8和98.5,并在真实世界数据集中检测到97.1%的虚构引用,且没有放弃判断。代码:https://github.com/aaFrostnova/CiteTracer。
cs.CL / 22 / 2605.08600
100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts
来自哈萨克斯坦的100,000+电影评论:俄语、哈萨克语和代码切换文本
Abstract
We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks -- three-way polarity classification and five-class score classification -- and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between adjacent rating levels.
Chinese Translation
我们呈现了一个新的公开可用的语料库,包含来自哈萨克斯坦的100,502条电影评论,这些评论来自于kino.kz,涵盖2001年至2025年,涉及4,943个独特标题。该数据集是多语言的,主要由俄语评论组成,同时包含哈萨克语和代码切换文本。评论经过人工标注,标注内容包括语言和情感极性,并且11,309条评论还包含用户提供的显式评分。我们定义了两个情感任务——三类极性分类和五类评分分类——并将经典的词袋模型/TF-IDF基线与多语言变换模型(mBERT, XLM-RoBERTa, RemBERT)进行了基准测试。实验结果表明,在极性分类任务中,变换模型始终优于经典基线,而在评分分类任务中,由于严重的类别不平衡和相邻评分级别之间的微妙区别,在控制泄漏的评估下仍然具有挑战性。
cs.CL / 23 / 2605.08632
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
PARD-2:针对目标对齐的双模式并行草稿模型用于推测解码
Abstract
Speculative decoding accelerates Large Language Models (LLMs) inference by using a lightweight draft model to propose candidate tokens that are verified in parallel by the target model. However, existing draft model training objectives are not directly aligned with the inference-time goal of maximizing consecutive token acceptance. To address this issue, we reformulate the draft model optimization objective, shifting the focus from token prediction accuracy to the overall acceptance length. In this paper, we build upon PARD to propose PARD-2, a dual-mode speculative decoding framework with Confidence-Adaptive Token (CAT) optimization. This approach adaptively reweights each token to better align with the verification process. Notably, PARD-2 enables a single draft model to support both target-dependent and target-independent modes. Experiments across diverse models and tasks demonstrate that PARD-2 achieves up to 6.94$\times$ lossless acceleration, surpassing EAGLE-3 by 1.9$\times$ and PARD by 1.3$\times$ on Llama3.1-8B. Our code is available at https://github.com/AMD-AGI/PARD.
Chinese Translation
推测解码通过使用轻量级草稿模型提出候选标记,从而加速大型语言模型(LLMs)的推理,这些候选标记由目标模型并行验证。然而,现有的草稿模型训练目标与推理时最大化连续标记接受的目标并不直接对齐。为了解决这个问题,我们重新制定了草稿模型的优化目标,将重点从标记预测准确性转移到整体接受长度。在本文中,我们在PARD的基础上提出了PARD-2,一种具有置信度自适应标记(Confidence-Adaptive Token, CAT)优化的双模式推测解码框架。这种方法自适应地重新加权每个标记,以更好地与验证过程对齐。值得注意的是,PARD-2使单个草稿模型能够支持目标依赖和目标独立两种模式。针对不同模型和任务的实验表明,PARD-2实现了最高6.94倍的无损加速,超越了EAGLE-3 1.9倍和PARD 1.3倍,在Llama3.1-8B上表现突出。我们的代码可在 https://github.com/AMD-AGI/PARD 获取。
cs.CL / 24 / 2605.08636
EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints
EdgeFlowerTune:在现实边缘系统约束下评估联邦大语言模型微调
Abstract
Federated fine-tuning offers a promising paradigm for adapting large language models (LLMs) on edge devices by leveraging the rich, diverse, and continuously generated data from smartphones and IoT devices without compromising user data privacy. Such edge-side adaptation can improve model personalization, robustness, and responsiveness to local contexts. However, the practical feasibility of federated LLM fine-tuning on real edge devices remains unclear, as most existing work focuses on cross-silo or simulation-based settings, overlooking the resource and runtime constraints that determine whether a method is deployable on real edge systems. We present EdgeFlowerTune, a deployment-oriented benchmark for federated LLM fine-tuning under realistic edge-system constraints. EdgeFlowerTune jointly evaluates model quality and system costs, including communication, wall-clock latency, memory usage, energy consumption, and robustness to dynamic edge conditions. To compare methods in terms of effectiveness, efficiency, and robustness, EdgeFlowerTune introduces three complementary protocols: Quality-under-Budget, Cost-to-Target, and Robustness. We instantiate EdgeFlowerTune as a real-device platform built on Flower and MobileFineTuner, spanning commercial Android smartphones and NVIDIA edge development boards. Our benchmark results show that accuracy-only evaluation can lead to misleading conclusions: methods with similar final quality may differ substantially in deployability once realistic system constraints are considered. EdgeFlowerTune provides a reproducible benchmark for system-aware evaluation of federated LLM fine-tuning at the edge.
Chinese Translation
联邦微调为在边缘设备上适应大型语言模型(LLMs)提供了一种有前景的范式,利用来自智能手机和物联网设备的丰富、多样且持续生成的数据,而不妨碍用户数据隐私。这种边缘侧适应可以提高模型的个性化、鲁棒性和对本地环境的响应能力。然而,联邦LLM微调在真实边缘设备上的实际可行性仍不明确,因为现有大多数研究集中于跨孤岛或基于模拟的环境,忽视了决定方法是否可在真实边缘系统上部署的资源和运行时约束。我们提出了EdgeFlowerTune,这是一个针对在现实边缘系统约束下进行联邦LLM微调的部署导向基准。EdgeFlowerTune共同评估模型质量和系统成本,包括通信、墙钟延迟、内存使用、能耗以及对动态边缘条件的鲁棒性。为了在有效性、效率和鲁棒性方面比较方法,EdgeFlowerTune引入了三种互补协议:预算下的质量(Quality-under-Budget)、目标成本(Cost-to-Target)和鲁棒性(Robustness)。我们将EdgeFlowerTune实例化为一个基于Flower和MobileFineTuner的真实设备平台,涵盖商业Android智能手机和NVIDIA边缘开发板。我们的基准结果表明,仅依赖准确性评估可能导致误导性结论:在考虑现实系统约束后,具有相似最终质量的方法在可部署性上可能存在显著差异。EdgeFlowerTune为边缘联邦LLM微调的系统感知评估提供了一个可重复的基准。
cs.CL / 25 / 2605.08647
AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators
AgentCollabBench:诊断优秀代理何以成为糟糕的协作者
Abstract
Multi-agent systems achieve state-of-the-art outcomes through peer collaboration. However, when an agent in the pipeline silently drops a constraint, the system's final output may look correct even though the reasoning chain was quietly corrupted, and existing outcome-based evaluations are blind to such multi-hop process failures. To make these vulnerabilities measurable before deployment, we introduce AgentCollabBench, a diagnostic benchmark of 900 human-validated tasks spanning software engineering, DevOps, and data engineering. Each task isolates one of four behavioral risks: instruction decay (does a constraint survive peer pressure?), false-belief contagion (does a falsehood spread through consensus?), context leakage (does information bleed between tasks?), and tracer durability (does marked data reach the final agent?). Evaluating four modern LLMs (GPT 4.1 mini, Gemini 2.5 Flash Lite, Qwen-3.5-35B-A3B, and Llama 3.1 8B Instruct), we expose model-specific vulnerability profiles invisible to outcome-only evaluation; Qwen-3.5-35B-A3B, for example, leads on tracer durability and instruction stability, while GPT 4.1 mini leads on leakage containment and false-belief resistance. Beyond per-model differences, communication topology emerges as a primary risk factor that explains 7-40% of the variance in multi-hop information survival. The effect traces to a synthesis bottleneck specific to converging-DAG nodes: an agent weighing competing parent inputs discards constraints carried by a minority branch, a bottleneck structurally absent from linear chains. AgentCollabBench demonstrates that suboptimal topology can silently erase the safeguards of highly capable models, arguing that multi-agent reliability is fundamentally a structural problem and that scaling model intelligence alone is no substitute for architecture.
Chinese Translation
多代理系统通过同行协作实现了最先进的成果。然而,当管道中的一个代理悄然丢弃一个约束时,系统的最终输出可能看起来是正确的,尽管推理链已经悄然被破坏,而现有的基于结果的评估对这种多跳过程失败视而不见。为了在部署前使这些脆弱性可测量,我们引入了AgentCollabBench,这是一个包含900个经过人工验证任务的诊断基准,涵盖软件工程、DevOps和数据工程。每个任务隔离四种行为风险中的一种:指令衰退(约束是否能经受住同行压力?)、错误信念传播(谎言是否通过共识传播?)、上下文泄漏(信息是否在任务之间渗漏?)和追踪器耐久性(标记数据是否能到达最终代理?)。通过评估四个现代大型语言模型(GPT 4.1 mini、Gemini 2.5 Flash Lite、Qwen-3.5-35B-A3B和Llama 3.1 8B Instruct),我们揭示了模型特定的脆弱性特征,这些特征在仅基于结果的评估中是不可见的;例如,Qwen-3.5-35B-A3B在追踪器耐久性和指令稳定性方面表现优异,而GPT 4.1 mini在泄漏控制和错误信念抵抗方面领先。除了模型间的差异外,通信拓扑作为一个主要风险因素,解释了多跳信息存活率中7-40%的方差。这一影响源于特定于汇聚有向无环图(DAG)节点的合成瓶颈:一个代理在权衡竞争的父输入时,会丢弃由少数分支携带的约束,而这种瓶颈在线性链中是结构性缺失的。AgentCollabBench表明,次优的拓扑结构可以悄然抹去高能力模型的保护机制,认为多代理的可靠性根本上是一个结构性问题,仅仅扩大模型智能并不能替代架构。
cs.CL / 26 / 2605.08665
Hint Tuning: Less Data Makes Better Reasoners
提示调优:更少的数据创造更好的推理能力
Abstract
Large reasoning models achieve high accuracy through extended chain-of-thought but generate 5--8 more tokens than necessary, applying verbose reasoning uniformly regardless of problem difficulty. We propose Hint Tuning, a data-efficient approach that teaches models to calibrate reasoning depth. Our key insight: the corresponding instruct model serves as an ideal difficulty probe. By testing what the instruct model can solve with varying guidance, we automatically construct training data across three states: No-Hint (direct answer), Sparse-Hint (minimal prefix), and Full-Hint (complete reasoning). This converts the abstract challenge of difficulty labeling into a measurable consistency check between the instruct and reasoning models. With only 1K self-annotated samples, Hint Tuning achieves 24--66% token reduction (31.5% average) across mainstream reasoning models (Qwen3-Thinking, DeepSeek-R1-Distill) at multiple scales (4B--32B) while maintaining competitive accuracy on five benchmarks. Unlike methods requiring massive distillation datasets or expensive RL, we achieve superior efficiency through simple alignment with the instruct model's capabilities.
Chinese Translation
大型推理模型通过扩展的思维链实现高准确率,但生成的标记数量比必要的多出5到8倍,且在问题难度上应用冗长的推理方式。我们提出了提示调优(Hint Tuning),这是一种数据高效的方法,旨在教会模型调整推理深度。我们的关键见解是:相应的指令模型(instruct model)作为理想的难度探测器。通过测试指令模型在不同指导下能够解决的问题,我们自动构建了三种状态的训练数据:无提示(No-Hint,直接答案)、稀疏提示(Sparse-Hint,最小前缀)和完整提示(Full-Hint,完整推理)。这将难度标记的抽象挑战转化为指令模型与推理模型之间的可测一致性检查。仅使用1000个自我标注的样本,提示调优在多个规模(4B-32B)的主流推理模型(Qwen3-Thinking, DeepSeek-R1-Distill)上实现了24%到66%的标记减少(平均31.5%),同时在五个基准测试上保持竞争力的准确性。与需要大量蒸馏数据集或昂贵强化学习(RL)的方法不同,我们通过简单地与指令模型的能力对齐,实现了更高的效率。
cs.CL / 27 / 2605.08671
Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups
大型语言模型中的解释公平性:对不同人口群体如何解释决策差异的实证分析
Abstract
Large language models (LLMs) are increasingly deployed not only to make decisions but to explain them. While AI decision fairness has been studied extensively, the fairness of AI explanations (whether LLMs justify decisions with equal quality, depth, tone, and linguistic sophistication across demographic groups) has received little attention. This paper introduces the Explanation Fairness Taxonomy (EFT), a framework comprising five formally defined, operationalizable dimensions: Verbosity Disparity, Sentiment Disparity, Epistemic Hedging Disparity, Decision-Linked Explanation Disparity, and Lexical Complexity Disparity. The taxonomy is instantiated in a controlled empirical study across 80 prompt templates, four consequential decision domains (hiring, medical triage, credit assessment, legal judgment), and five LLMs: GPT-4.1, Claude Sonnet, LLaMA 3.3 70B, GPT-OSS 120B, and Qwen3 32B. Two novel black-box metrics are introduced: the Hedging Density Score (HDS) and the Explanation Faithfulness Proxy (EFP), a heuristic indicator of decision-linked explanation variation. Across up to 400 prompt pairs, all eight EFT metrics show statistically significant disparities (Cohen's d ranging from small to large, all p_BH < 10^(-62)). Model choice is strongly associated with disparity magnitude: Qwen3 32B exhibits verbosity disparities 5.9x larger than LLaMA 3.3 70B. Two prompting-based mitigations show significant reductions in EFP disparity (78-95%) but no significant effect on stylistic dimensions, consistent with the hypothesis that stylistic explanation inequalities are encoded in pre-training distributions and are not resolvable through deployment-level instruction alone. A reproducible measurement framework is offered for explanation-level fairness auditing, with implications for AI regulation and deployment practice.
Chinese Translation
大型语言模型(LLMs)越来越多地被用于不仅做出决策,还用于解释这些决策。尽管人工智能决策的公平性已被广泛研究,但人工智能解释的公平性(即LLMs是否在不同人口群体中以相同的质量、深度、语气和语言复杂性来解释决策)却鲜有关注。本文引入了解释公平性分类法(Explanation Fairness Taxonomy, EFT),该框架包括五个正式定义的、可操作的维度:冗长差异(Verbosity Disparity)、情感差异(Sentiment Disparity)、认知对冲差异(Epistemic Hedging Disparity)、决策关联解释差异(Decision-Linked Explanation Disparity)和词汇复杂性差异(Lexical Complexity Disparity)。该分类法在一个控制的实证研究中得以体现,涵盖80个提示模板、四个重要决策领域(招聘、医疗分诊、信用评估、法律判断)以及五个LLMs:GPT-4.1、Claude Sonnet、LLaMA 3.3 70B、GPT-OSS 120B和Qwen3 32B。引入了两个新颖的黑箱指标:对冲密度评分(Hedging Density Score, HDS)和解释忠实度代理(Explanation Faithfulness Proxy, EFP),后者是决策关联解释变异的启发式指标。在多达400对提示中,所有八个EFT指标显示出统计显著的差异(Cohen's d从小到大,所有p_BH < 10^(-62))。模型选择与差异幅度密切相关:Qwen3 32B的冗长差异比LLaMA 3.3 70B大5.9倍。基于提示的两种缓解措施在EFP差异上显示出显著降低(78-95%),但对风格维度没有显著影响,这与假设一致,即风格解释的不平等是编码在预训练分布中的,无法仅通过部署层面的指令来解决。本文提供了一个可重复的测量框架,用于解释层面的公平性审计,对人工智能的监管和部署实践具有重要意义。
cs.CL / 28 / 2605.08696
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
用于大规模并行序列生成的结构化递归混合器
Abstract
Over the last two decades, language modeling has experienced a shift from predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements in parallel during training, which results in greater training efficiency and stability at the expense of lower inference throughput. Here we introduce the Structured Recurrent Mixer, an architecture that allows for algebraic conversion between a sequence parallel representation at train time and a recurrent representation at inference, notably without the need for specialized kernels or device-specific memory management. We show experimentally that this dual representation allows for greater training efficiency, higher input information capacity, and larger inference throughput and concurrency when compared to other linear complexity models. We postulate that recurrent models are poorly suited to extended sequence length scaling for information-rich inputs typical of language, but are well suited to scaling in the sample (batch) dimension due to their constant memory per sample. We provide Mojo/MAX inference implementations of SRMs exhibiting 12x the throughput and 170x the concurrency of similarly powerful Transformers inferenced on vLLM, increases characteristic of Pytorch implementations resulting in a 30\% increase in compute-constant GSM8k Pass@k. We conclude by demonstrating that SRMs are effective reinforcement learning training candidates.
Chinese Translation
在过去的二十年中,语言建模经历了从主要使用递归架构在训练和推理过程中顺序处理标记,转变为在训练过程中并行处理序列元素的非递归模型。这种转变提高了训练效率和稳定性,但牺牲了推理吞吐量。在此,我们介绍了结构化递归混合器(Structured Recurrent Mixer),一种架构,它允许在训练时在序列并行表示和推理时的递归表示之间进行代数转换,特别是无需专门的内核或设备特定的内存管理。我们通过实验表明,这种双重表示在训练效率、更高的输入信息容量以及与其他线性复杂度模型相比,提供了更大的推理吞吐量和并发性。我们假设,递归模型不适合处理典型语言中信息丰富的输入的扩展序列长度,但非常适合在样本(批量)维度上进行扩展,因为它们每个样本的内存是恒定的。我们提供了Mojo/MAX推理实现的SRMs,其吞吐量是同样强大的Transformers在vLLM上推理的12倍,并发性是170倍,这些增加的特性是Pytorch实现的结果,导致GSM8k Pass@k的计算常数提高了30%。最后,我们通过展示SRMs是有效的强化学习训练候选者来结束讨论。
cs.CL / 29 / 2605.08715
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
AgentForesight:多智能体系统中早期故障预测的在线审计
Abstract
LLM-based multi-agent systems are increasingly deployed on long-horizon tasks, but a single decisive error is often accepted by downstream agents and cascades into trajectory-level failure. Existing work frames this as \emph{post-hoc failure attribution}, diagnosing the responsible agent and step after the trajectory has ended. However, this paradigm forfeits any opportunity to intervene while trajectory is still unfolding. In this work, we introduce AgentForesight, a framework that reframes this problem as online auditing: at each step of an unfolding trajectory, an auditor observes only the current prefix and must either continue the run or alarm at the earliest decisive error, without access to future steps. To this end, we curate AFTraj-2K, a corpus of agentic trajectories across Coding, Math, and Agentic domains, in which safe trajectories are retained under a strict curation pipeline and unsafe trajectories are annotated at the step of their decisive error via consensus among multiple LLM judges. Built on that, we develop AgentForesight-7B, a compact online auditor trained with a coarse-to-fine reinforcement learning recipe that first equips it with a risk-anticipation prior at the failure boundary on adjacent safe/unsafe prefix pairs, then sharpens this prior into precise step-level localization under a three-axis reward jointly targeting the what, where, and who of an audit verdict. Across AFTraj-2K and an external Who\&When benchmark, AgentForesight-7B outperforms leading proprietary models, including GPT-4.1 and DeepSeek-V4-Pro, achieving up to +19.9% performance gain and 3$\times$ lower step localization error, opening the loop from post-hoc failures detection to enabling deployment-time intervention. Project page: https://zbox1005.github.io/agent-foresight/
Chinese Translation
基于大语言模型的多智能体系统越来越多地应用于长时间跨度的任务中,但单一的决定性错误往往被下游智能体接受,并导致轨迹级别的失败。现有研究将此问题框定为 extit{事后故障归因},在轨迹结束后诊断责任智能体和步骤。然而,这一范式放弃了在轨迹展开过程中进行干预的任何机会。在本研究中,我们引入了AgentForesight,一个将此问题重新框定为在线审计的框架:在每一步展开的轨迹中,审计员仅观察当前前缀,必须在没有访问未来步骤的情况下,决定是继续执行还是在最早的决定性错误时发出警报。为此,我们策划了AFTraj-2K,一个涵盖编码、数学和智能体领域的智能体轨迹语料库,其中安全轨迹在严格的策划流程下被保留,而不安全轨迹则通过多个大语言模型评审者的共识在其决定性错误的步骤上进行了标注。在此基础上,我们开发了AgentForesight-7B,一个紧凑的在线审计员,采用粗到细的强化学习方法进行训练,首先使其在相邻安全/不安全前缀对的故障边界上具备风险预判的先验知识,然后在针对审计裁决的“什么、哪里和谁”三个维度的奖励下,将这一先验知识精细化为准确的步骤级定位。在AFTraj-2K和外部Who&When基准测试中,AgentForesight-7B超越了包括GPT-4.1和DeepSeek-V4-Pro在内的领先专有模型,性能提升高达19.9%,步骤定位误差降低至3倍,为从事后故障检测到在部署时进行干预打开了新的可能。项目页面:https://zbox1005.github.io/agent-foresight/
cs.CL / 30 / 2605.08721
Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents
打破僵局:针对社交语言代理的双尺度进化策略训练
Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for closed-ended tasks, extending it to open-ended social language games via self-play reveals a critical issue: evolution impasse. Due to the vast strategy space, language agents frequently converge to homogenized behaviors, leading to deterministic match outcomes that eliminate the gradient signals necessary for policy evolution. To tackle this issue, we propose Dual-scale Evolutionary Policy Training (DEPT) for social language games. DEPT introduces a time-scaled evolutionary perception mechanism that detects impasse by quantifying dual-scale value baseline divergence alongside match entropy. Upon perceiving the collapse, it then activates asymmetric advantage reshaping to dynamically modulate the optimization landscape for intervention. Thus, our method effectively restores gradient signals and enforces sustained strategic exploration. Extensive experiments on multiple social language games demonstrate that DEPT outperforms strong baselines, avoiding policy degeneration and driving the continuous evolution of social language agents.
Chinese Translation
尽管可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)在封闭任务中已被证明有效,但通过自我对弈将其扩展到开放式社交语言游戏时暴露出一个关键问题:进化僵局。由于策略空间巨大,语言代理常常趋向于同质化行为,导致确定性的比赛结果,从而消除政策进化所需的梯度信号。为了解决这一问题,我们提出了针对社交语言游戏的双尺度进化策略训练(Dual-scale Evolutionary Policy Training, DEPT)。DEPT引入了一种时间尺度的进化感知机制,通过量化双尺度价值基线的发散性和比赛熵来检测僵局。在感知到崩溃后,它激活不对称优势重塑,以动态调节优化环境进行干预。因此,我们的方法有效恢复了梯度信号,并促进了持续的战略探索。在多个社交语言游戏上的广泛实验表明,DEPT在避免策略退化的同时,推动了社交语言代理的持续进化,表现优于强基线。
cs.CL / 31 / 2605.08741
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
使用工具进行训练:复杂推理的在线工具自蒸馏
Abstract
Inference-time harnesses substantially improve large language models on complex reasoning tasks. However, the intrinsic capabilities of the underlying model remain unchanged by the addition of these external workflows. To bridge this gap, we introduce \emph{On-Policy Harness Self-Distillation} (OPHSD), which employs the harness-augmented current model as a teacher for self-distillation, thereby introducing extra supervisory signals from the harness beyond training data. OPHSD internalizes task-specific harness capabilities into the student model, yielding robust generalizability and strong standalone performance across diverse reasoning tasks. Evaluated across draft--verify harness for text classification and plan--solve for mathematical reasoning tasks, OPHSD consistently outperforms strong baselines (e.g., +10.83\% over OPSD on HMMT25). Our analysis further indicates that reattaching the harness during inference yields no additional benefits and can even degrade performance, suggesting that complex harnesses need not always be permanent fixtures; instead, they can serve as temporary training scaffolds whose benefits are permanently fed back into the base model. Our code and training data are available at https://github.com/zzy1127/OPHSD-On-Policy-Harness-Self-Distillation.
Chinese Translation
推理时的工具显著提升了大型语言模型在复杂推理任务上的表现。然而,底层模型的内在能力并未因这些外部工作流的增加而改变。为了解决这一问题,我们提出了在线工具自蒸馏(On-Policy Harness Self-Distillation, OPHSD),该方法将增强工具的当前模型作为自蒸馏的教师,从而引入来自工具的额外监督信号,超越训练数据。OPHSD 将任务特定的工具能力内化到学生模型中,产生了强大的泛化能力和在多样化推理任务中的独立表现。通过在文本分类的草稿-验证工具和数学推理任务的计划-解决工具上进行评估,OPHSD 始终优于强基线(例如,在 HMMT25 上比 OPSD 提高了 10.83%)。我们的分析进一步表明,在推理过程中重新附加工具并不会带来额外的好处,甚至可能降低性能,这表明复杂工具不必始终是永久性的固定装置;相反,它们可以作为临时训练支架,其带来的好处会永久反馈到基础模型中。我们的代码和训练数据可在 https://github.com/zzy1127/OPHSD-On-Policy-Harness-Self-Distillation 获取。
cs.CL / 32 / 2605.08742
Narrative Landscape: Mapping Narrative Dispositions Across LLMs
叙事景观:跨大型语言模型的叙事倾向映射
Abstract
This study proposes a quantitative framework for profiling LLM dispositions as stable, model-specific regularities in output under repeated, controlled elicitation. Using a structured narrative constraint-selection task administered across six frontier models and three instruction types, we operationalize disposition through two dimensions: "consistency", measured as cross-replication selection overlap via Jaccard similarity, and "diversity", measured as dispersion across options via the inverse Simpson index. We further introduce Narrative Landscape, a PCA-based visualization that maps each model's selection profile into a shared space for direct comparison. Results reveal a clear rigidity-exploration spectrum across model families and show that instruction types shift the geometry of selection spaces even when scalar metrics appear similar, indicating that comparable scores can mask qualitatively distinct selection topologies.
Chinese Translation
本研究提出了一种定量框架,用于描绘大型语言模型(LLMs)在重复、受控引导下输出的稳定、模型特定的规律性。通过在六个前沿模型和三种指令类型中实施结构化的叙事约束选择任务,我们通过两个维度来操作化倾向:'一致性',通过 Jaccard 相似度测量交叉重复选择重叠,和 '多样性',通过逆 Simpson 指数测量选项的分散程度。我们进一步引入了叙事景观(Narrative Landscape),这是一种基于主成分分析(PCA)的可视化方法,将每个模型的选择特征映射到一个共享空间中以便进行直接比较。结果揭示了模型家族之间明显的刚性-探索光谱,并表明即使标量指标看似相似,指令类型也会改变选择空间的几何形状,这表明可比较的分数可能掩盖了质上不同的选择拓扑。
cs.CL / 33 / 2605.08809
SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization
SimReg:通过嵌入相似性正则化实现更高的预训练性能
Abstract
Pretraining large language models (LLMs) with next-token prediction has led to remarkable advances, yet the context-dependent nature of token embeddings in such models results in high intra-class variance and inter-class similarity, thus hindering the efficiency of representation learning. While similarity-based regularization has demonstrated benefit in supervised fine-tuning and classification tasks, its application and efficacy in large-scale LLM pretraining remains underexplored. In this work, we propose the SimReg, an embedding similarity regularization loss that explicitly encourages token representations with the same ground-truth label within each sequence to be more similar, while enforcing separation from different-label tokens via a contrastive loss. Our analysis reveals that this mechanism introduces gains by enlarging multi-classification margins, thereby enabling more efficient classification. Extensive experiments across dense and Mixture-of-Experts (MoE) architectures demonstrate that SimReg consistently accelerates training convergence by over 30% and improves average zero-shot downstream performance by over 1% across standard benchmarks. Further ablation studies and analyses offer practical insights into hyperparameter tuning and loss effectiveness.
Chinese Translation
使用下一个标记预测对大型语言模型(LLMs)进行预训练已取得显著进展,但此类模型中标记嵌入的上下文依赖性导致高内部类方差和类间相似性,从而阻碍了表示学习的效率。尽管基于相似性的正则化在监督微调和分类任务中已显示出益处,但其在大规模LLM预训练中的应用和有效性仍然未得到充分探索。在本研究中,我们提出了SimReg,一种嵌入相似性正则化损失,明确鼓励同一序列中具有相同真实标签的标记表示更加相似,同时通过对比损失强制不同标签标记之间的分离。我们的分析表明,这一机制通过扩大多分类边界引入了收益,从而实现更高效的分类。在密集和专家混合(Mixture-of-Experts, MoE)架构上的大量实验表明,SimReg始终将训练收敛速度提高超过30%,并在标准基准上将平均零-shot下游性能提升超过1%。进一步的消融研究和分析为超参数调优和损失有效性提供了实用见解。
cs.CL / 34 / 2605.08837
The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans
基础差距:大型语言模型如何与人类不同地锚定抽象概念的意义
Abstract
Abstract concepts - justice, theory, availability - have no single perceivable referent; in the human brain, their meaning emerges from a web of experiences, affect, and social context. Do large language models (LLMs) ground abstract concepts in a similar way? We study this by replicating property-generation experiments from cognitive science on 21 frontier and open-weight LLMs. Across models and experiments, we find a consistent pattern: when compared to humans, models rely too heavily on word associations, and underproduce properties tied to emotion and internal states. This yields a large and consistent grounding gap: no model exceeds a Pearson correlation r=0.37 with human responses, compared to a human-to-human ceiling above r=0.9. To better interpret this gap, we also replicate a rating experiment on grounding categories and find that here LLMs align more closely with human judgment, and alignment improves as models get larger. We then use sparse autoencoders (SAEs) to inspect whether this information is also reflected in the models' internal features, and we do identify features connected to grounding dimensions such as "sensorimotor" and "social". These findings suggest that current LLMs can recover grounding dimensions when explicitly queried, but do not recruit them in a human-like way when words are generated freely.
Chinese Translation
抽象概念——正义、理论、可用性——没有单一可感知的指称;在人的大脑中,它们的意义源于经验、情感和社会背景的网络。那么,大型语言模型(LLMs)是否以类似的方式为抽象概念提供基础?我们通过在21个前沿和开放权重的LLM上复制认知科学中的属性生成实验来研究这个问题。在不同模型和实验中,我们发现了一种一致的模式:与人类相比,模型过于依赖词汇关联,且在与情感和内部状态相关的属性生成上表现不足。这导致了一个显著且一致的基础差距:没有任何模型与人类反应的皮尔逊相关系数超过r=0.37,而人类之间的相关系数上限超过r=0.9。为了更好地解释这一差距,我们还复制了一个关于基础类别的评分实验,发现LLMs在这里与人类判断的对齐更为紧密,并且随着模型规模的增大,对齐程度有所改善。随后,我们使用稀疏自编码器(SAEs)检查这些信息是否也反映在模型的内部特征中,确实识别出与“感知运动”和“社会”等基础维度相关的特征。这些发现表明,当前的LLMs在被明确询问时能够恢复基础维度,但在自由生成词汇时并未以人类的方式动用这些维度。
cs.CL / 35 / 2605.08838
Generating Leakage-Free Benchmarks for Robust RAG Evaluation
生成无泄漏基准以增强鲁棒性RAG评估
Abstract
Retrieval-augmented generation (RAG) is widely used to augment large language models (LLMs) with external knowledge. However, many benchmark datasets, designed to test RAG performance, comprise many questions that can already be answered from an LLM's parametric memory. This leads to unreliable evaluation. We refer to this phenomenon as knowledge leakage: cases where RAG tasks are solvable without retrieval. This issue worsens over time due to benchmark aging. As benchmarks are reused for training, their contents are increasingly absorbed into model parameters, making them less effective for evaluating retrieval. We introduce SeedRG, a semi-synthetic benchmark generation pipeline that mitigates knowledge leakage and addresses the issue of benchmark aging. Starting from a seed benchmark dataset, SeedRG extracts a reasoning graph from question-context pairs to capture their underlying reasoning structure, and then generates new examples via type-constrained entity replacement. This process produces structurally similar but novel instances that are unlikely to exist in the model's parametric knowledge, while preserving the original reasoning patterns. To ensure quality, we incorporate two verification steps: (1) a reasoning-graph consistency check to maintain task difficulty, and (2) a knowledge-leakage filter to exclude instances answerable without retrieval.
Chinese Translation
检索增强生成(RAG)被广泛用于将外部知识融入大型语言模型(LLMs)。然而,许多旨在测试RAG性能的基准数据集包含许多可以通过LLM的参数记忆直接回答的问题,这导致评估结果不可靠。我们将这种现象称为知识泄漏:即RAG任务在不进行检索的情况下可解的情况。随着时间的推移,这一问题愈加严重,因基准数据集的老化。随着基准数据集被重复用于训练,其内容越来越多地被模型参数吸收,从而使其在评估检索时的有效性降低。我们提出了SeedRG,一个半合成的基准生成管道,旨在减轻知识泄漏并解决基准老化的问题。SeedRG从一个种子基准数据集出发,从问题-上下文对中提取推理图,以捕捉其潜在的推理结构,然后通过类型约束的实体替换生成新的例子。这个过程产生了结构相似但新颖的实例,这些实例不太可能存在于模型的参数知识中,同时保留了原始的推理模式。为了确保质量,我们引入了两个验证步骤:(1)推理图一致性检查,以保持任务难度;(2)知识泄漏过滤器,以排除那些可以在不进行检索的情况下回答的实例。
cs.CL / 36 / 2605.08840
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
ReST-KV:基于层级输出重构和时空平滑的鲁棒KV缓存驱逐
Abstract
Large language models (LLMs) face growing challenges in efficient generative inference due to the increasing memory demands of Key-Value (KV) caches, especially for long sequences. Existing eviction methods typically retain KV pairs with high attention weights but overlook the impact of attention redistribution caused by token removal, as well as the spatial-temporal dynamics in KV selection. In this paper, we propose ReST-KV, a robust KV eviction method that combines layer-wise output Reconstruction and Spatial-Temporal smoothing to provide a more comprehensive perspective for the KV cache eviction task. Specifically, ReST-KV formulates KV cache eviction as an optimization problem that minimizes output discrepancies through efficient layer-wise reconstruction. By directly modeling how each token's removal affects the model output, our method naturally captures attention redistribution effects, going beyond simplistic reliance on raw attention weights. To further enhance robustness, we design exponential moving average smoothing to handle temporal variations and an adaptive window-based mechanism to capture spatial patterns. Our method, ReST-KV, significantly advances performance on long-context benchmarks. It surpasses state-of-the-art baselines by 2.58% on LongBench and 15.2% on RULER. Additionally, ReST-KV consistently outperforms existing methods on Needle-in-a-Haystack and InfiniteBench, all while achieving a remarkable 10.61$\times$ reduction in decoding latency at 128k context length. The code is publicly available at https://github.com/an-yongqi/rest-kv to facilitate reproducibility and further research.
Chinese Translation
大型语言模型(LLMs)在高效生成推理中面临着越来越大的挑战,尤其是在处理长序列时,Key-Value(KV)缓存的内存需求不断增加。现有的驱逐方法通常保留具有高注意力权重的KV对,但忽视了由于移除标记而导致的注意力重分配的影响,以及KV选择中的时空动态。在本文中,我们提出了ReST-KV,一种结合层级输出重构和时空平滑的鲁棒KV驱逐方法,为KV缓存驱逐任务提供了更全面的视角。具体而言,ReST-KV将KV缓存驱逐公式化为一个优化问题,通过高效的层级重构最小化输出差异。通过直接建模每个标记的移除如何影响模型输出,我们的方法自然捕捉到注意力重分配的效果,超越了对原始注意力权重的简单依赖。为了进一步增强鲁棒性,我们设计了指数移动平均平滑来处理时间变化,并采用自适应窗口机制来捕捉空间模式。我们的ReST-KV方法在长上下文基准测试中显著提升了性能。在LongBench上超越了最先进的基线2.58%,在RULER上超越了15.2%。此外,ReST-KV在Needle-in-a-Haystack和InfiniteBench上始终优于现有方法,同时在128k上下文长度下实现了显著的10.61$ imes$解码延迟减少。代码已公开发布在 https://github.com/an-yongqi/rest-kv,以促进可重复性和进一步研究。
cs.CL / 37 / 2605.08842
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
XPERT:有效训练语言模型的专家知识转移
Abstract
Mixture-of-Experts (MoE) language models organize knowledge into explicitly routed expert modules, making expert-level representations traceable and analyzable. By analyzing expert activation patterns in MoE large language models (LLMs), we find that a subset of experts is consistently activated across diverse knowledge domains. These common experts encode cross-domain, generalizable knowledge that is closely related to model generalization, naturally raising the question of how such identifiable expert knowledge can be practically reused. Motivated by this observation, we propose XPERT, a framework that extracts, consolidates, and reuses expert knowledge from pre-trained MoE LLMs to support more effective training of language models across different model scales. XPERT identifies cross-domain experts via inference-only analysis, refines their representations through tensor decomposition, and adapts the extracted knowledge to reuse in downstream models. Experiments on language understanding and dialogue generation benchmarks show that models benefiting from reused expert knowledge achieve consistently stronger performance and faster convergence compared to strong baselines. These results highlight MoE LLMs as structured and reusable knowledge sources, and demonstrate the value of expert-level knowledge reuse for improving model training.
Chinese Translation
混合专家(MoE)语言模型将知识组织为明确路由的专家模块,使得专家级表示可追踪和可分析。通过分析MoE大规模语言模型(LLMs)中的专家激活模式,我们发现一部分专家在不同知识领域中始终被激活。这些共同的专家编码了与模型泛化密切相关的跨领域、可泛化知识,这自然引发了如何实际重用这些可识别的专家知识的问题。基于这一观察,我们提出了XPERT,一个从预训练的MoE LLM中提取、整合和重用专家知识的框架,以支持在不同模型规模下更有效的语言模型训练。XPERT通过仅推理分析识别跨领域专家,通过张量分解精炼其表示,并将提取的知识适应于下游模型的重用。在语言理解和对话生成基准测试中的实验表明,受益于重用专家知识的模型相比强基线表现出更强的性能和更快的收敛速度。这些结果突显了MoE LLM作为结构化和可重用知识源的价值,并展示了专家级知识重用在改善模型训练中的重要性。
cs.CL / 38 / 2605.08847
EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding
EmoS:一种高保真多模态基准,用于细粒度流媒体情感理解
Abstract
In the context of today's high-pressure, aging society, the demand for large-scale emotional models capable of providing empathetic support is more critical than ever. However, existing benchmarks fail to simultaneously achieve ecological validity, signal clarity, and reliable fine-grained labeling. We introduce EmoS, a high-fidelity bilingual benchmark designed to resolve the limitations of ecological validity and noise in existing datasets by combining strictly filtered static slices with a dynamic Streaming Monologue subset. Supported by a rigorous dual-layer human annotation pipeline, EmoS provides trusted ground truth that captures continuous emotional evolution. Empirical results show that fine-tuning MLLMs (multimodal large language models) on EmoS yields significant gains over zero-shot baselines, laying the foundation for the training and evaluation of future emotion recognition models and empathy models. The dataset and code are publicly available at https://github.com/NLP2CT/EmoS.
Chinese Translation
在当今高压、老龄化的社会背景下,对能够提供同理心支持的大规模情感模型的需求比以往任何时候都更加迫切。然而,现有基准未能同时实现生态有效性、信号清晰度和可靠的细粒度标注。我们提出了EmoS,一种高保真的双语基准,旨在通过将严格筛选的静态片段与动态流媒体独白子集相结合,解决现有数据集中生态有效性和噪声的局限性。在严格的双层人工标注流程的支持下,EmoS提供了可信的真实数据,捕捉了情感的连续演变。实证结果表明,在EmoS上微调多模态大语言模型(MLLMs)相较于零样本基线取得了显著提升,为未来情感识别模型和同理心模型的训练与评估奠定了基础。数据集和代码可在 https://github.com/NLP2CT/EmoS 获取。
cs.CL / 39 / 2605.08853
Architecture, Not Scale: Circuit Localization in Large Language Models
架构,而非规模:大语言模型中的电路定位
Abstract
Mechanistic interpretability assumes that circuit analysis becomes harder as models scale. We challenge this assumption by showing that the attention architecture matters more than parameter count. Studying three circuit types across Pythia and Qwen2.5, we find that grouped query attention produces circuits that are far more concentrated and mechanistically stable than standard multi-head attention at comparable scales. The same concentration pattern holds across indirect object identification, induction heads, and factual recall. Within a single architecture family (Qwen2.5), factual recall circuits undergo a discrete phase transition above a critical scale, collapsing to a single bottleneck rather than degrading gradually. These findings suggest that some architectural choices make large models more tractable to study and that interpretability difficulty is not a fixed consequence of model size.
Chinese Translation
机制可解释性假设随着模型规模的扩大,电路分析变得更加困难。我们通过展示注意力架构比参数数量更为重要,挑战了这一假设。研究Pythia和Qwen2.5中的三种电路类型,我们发现分组查询注意力在可比规模下产生的电路远比标准多头注意力更为集中且机制上更为稳定。相同的集中模式在间接对象识别、归纳头和事实回忆中也得以保持。在单一架构家族(Qwen2.5)中,事实回忆电路在超过临界规模时经历了离散相变,崩溃为单一瓶颈,而不是逐渐降级。这些发现表明,一些架构选择使得大型模型的研究更为可行,并且可解释性难度并不是模型规模的固定结果。
cs.CL / 40 / 2605.08863
Max-pooling Network Revisited: Analyzing the Role of Semantic Probability in Multiple Instance Learning for Hallucination Detection
重新审视最大池化网络:分析语义概率在多实例学习中的作用以进行幻觉检测
Abstract
Hallucination detection has become increasingly important for improving the reliability of large language models (LLMs). Recently, hybrid approaches such as HaMI, which combine semantic consistency with internal model states via Multiple Instance Learning (MIL), have achieved state-of-the-art performance. However, these methods incur substantial computational overhead due to repeated sampling and costly semantic similarity computations. In this work, we first provide a theoretical analysis of HaMI in terms of decision margins, revealing that scaling internal states with semantic consistency leads to an enlarged decision margin. Motivated by this insight, we revisit classical sentence classification models from a margin enlargement perspective, aggregating token-level features via max pooling and directly estimating sentence scores using a lightweight MLP. Without requiring semantic consistency computations, our approach achieves substantial efficiency improvements while maintaining competitive performance with state-of-the-art baselines through adaptive aggregation of internal feature representations.
Chinese Translation
幻觉检测在提高大型语言模型(LLMs)可靠性方面变得越来越重要。最近,结合语义一致性与内部模型状态的混合方法,如HaMI,通过多实例学习(MIL)实现了最先进的性能。然而,由于重复采样和高昂的语义相似性计算,这些方法带来了显著的计算开销。在本研究中,我们首先从决策边际的角度对HaMI进行理论分析,揭示了通过语义一致性缩放内部状态会导致决策边际的扩大。基于这一见解,我们从边际扩大的角度重新审视经典的句子分类模型,通过最大池化聚合令牌级特征,并使用轻量级多层感知器(MLP)直接估计句子得分。我们的方案无需进行语义一致性计算,在保持与最先进基线竞争性能的同时,实现了显著的效率提升,得益于对内部特征表示的自适应聚合。
cs.CL / 41 / 2605.08888
DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding
DocScope:可验证推理的基准测试以实现可信的长文档理解
Abstract
Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end-to-end answer accuracy. We introduce DocScope, a benchmark that formulates long-document QA as a structured reasoning trajectory prediction problem: given a complete PDF document and a question, the model outputs evidence pages, supporting evidence regions, relevant factual statements, and a final answer. We design a four-stage evaluation protocol -- Page Localization, Region Grounding, Fact Extraction, and Answer Verification -- that audits each level of the trajectory independently through inter-stage decoupling, with all judges selected and calibrated via human alignment studies. DocScope comprises 1,124 questions derived from 273 documents, with all hierarchical evidence annotations completed by human annotators. We benchmark 6 proprietary models, 12 open-weight models, and several domain-specific systems. Our experiments reveal that answer accuracy cannot substitute for trajectory-level evaluation: even among correct answers, the highest observed rate of complete evidence chains is only 29\%. Across all models, region grounding remains the weakest trajectory stage. Furthermore, the primary difficulty stems from aggregating evidence dispersed across long distances and multiple document clusters, while an oracle study identifies faithful perception and fact extraction as the dominant capability bottleneck. Cross-architecture comparisons further suggest that activated parameter count matters more than total scale. The benchmark and code will be publicly released at https://github.com/MiliLab/DocScope.
Chinese Translation
评估多模态大型语言模型是否能够在长且视觉丰富的文档上产生可信的、可验证的推理,需要超越端到端答案准确性的评估。我们引入了DocScope,一个基准,将长文档问答(QA)形式化为结构化推理轨迹预测问题:给定完整的PDF文档和一个问题,模型输出证据页面、支持证据区域、相关事实陈述和最终答案。我们设计了一个四阶段评估协议——页面定位、区域定位、事实提取和答案验证——通过阶段间解耦独立审核轨迹的每个层级,所有评审员均通过人类对齐研究进行选择和校准。DocScope包含来自273个文档的1,124个问题,所有层次的证据注释均由人工标注者完成。我们对6个专有模型、12个开放权重模型和若干特定领域系统进行了基准测试。我们的实验表明,答案准确性无法替代轨迹级别的评估:即使在正确答案中,观察到的完整证据链的最高比例仅为29%。在所有模型中,区域定位仍然是最薄弱的轨迹阶段。此外,主要困难在于聚合分散在长距离和多个文档集群中的证据,而一个神谕研究表明,忠实的感知和事实提取是主要的能力瓶颈。跨架构比较进一步表明,激活参数的数量比总体规模更为重要。该基准和代码将公开发布在https://github.com/MiliLab/DocScope。
cs.CL / 42 / 2605.08894
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
拟合不足:极低量化大语言模型中的平滑性
Abstract
Large language models (LLMs) achieve strong performance but incur high deployment costs, motivating extremely low-bit but lossy quantization. Existing quantization algorithms mainly focus on improving the numerical accuracy of forward computation to eliminate performance degradation. In this paper, we show that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy, we observe that such degradation becomes increasingly severe as the quantization bit-width decreases. Furthermore, based on sequence neighborhood modeling, we find that quantized models exhibit a rapid reduction of effective token candidates within the prediction neighborhood, which directly leads to a sparser decoding tree and degraded generation quality. To validate it, we introduce a simple smoothness-preserving principle in both post-training quantization and quantization-aware training, and demonstrate that preserving smoothness brings additional gains beyond numerical accuracy. The core goal of this paper is to highlight smoothness preservation as an important design consideration for future extreme quantization methods. Code is available at https://github.com/xuyuzhuang11/FINE.
Chinese Translation
大型语言模型(LLMs)在性能上表现出色,但部署成本高,这促使了极低比特但有损的量化方法。现有的量化算法主要关注于提高前向计算的数值精度,以消除性能下降。在本文中,我们展示了极低量化的LLMs在系统性平滑性下降方面超出了数值精度损失。通过平滑性代理,我们观察到随着量化比特宽度的降低,这种下降变得越来越严重。此外,基于序列邻域建模,我们发现量化模型在预测邻域内有效标记候选者的快速减少,这直接导致了更稀疏的解码树和生成质量的下降。为了验证这一点,我们在后训练量化和量化感知训练中引入了一个简单的平滑性保持原则,并证明保持平滑性带来了超出数值精度的额外收益。本文的核心目标是强调平滑性保持作为未来极端量化方法的重要设计考虑。代码可在 https://github.com/xuyuzhuang11/FINE 获取。
cs.CL / 43 / 2605.08896
FragileFlow: Spectral Control of Correct-but-Fragile Predictions for Foundation Model Robustness
FragileFlow:基础模型鲁棒性的光谱控制与正确但脆弱的预测
Abstract
Robust adaptation of LLMs and VLMs is often evaluated by average accuracy or average consistency under perturbations. However, these averages can hide a structured failure mode: a prediction may remain correct while probability mass already flows from particular true classes toward systematic wrong competitors near the decision boundary. In this paper, we formalize this phenomenon as margin-aware error flow and introduce FragileFlow, a plug-in regularizer that uses a calibrated margin buffer to identify correct-but-fragile predictions and organize their off-class probability mass into a class-wise vulnerable-risk matrix. Theoretically, we provide the first PAC-Bayes upper bound for this margin-aware error-flow object, showing how empirical spectral control yields a conservative route to deterministic worst-class robustness under a stability condition. Experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation show that FragileFlow consistently improves the proposed theory-facing risk measures over matched baselines, yields perturbed worst-class accuracy gains in most settings, and preserves clean accuracy across comparisons.
Chinese Translation
大规模语言模型(LLMs)和视觉语言模型(VLMs)的鲁棒适应性通常通过在扰动下的平均准确率或平均一致性进行评估。然而,这些平均值可能掩盖了一种结构化的失败模式:尽管预测可能仍然是正确的,但概率质量已经从特定的真实类别流向决策边界附近的系统性错误竞争者。在本文中,我们将这一现象形式化为边际感知误差流,并引入FragileFlow,这是一种插件正则化器,利用校准的边际缓冲区来识别正确但脆弱的预测,并将其非类别概率质量组织成类别脆弱风险矩阵。从理论上讲,我们为这一边际感知误差流对象提供了首个PAC-Bayes上界,展示了经验光谱控制如何在稳定性条件下提供一种保守的确定性最坏类别鲁棒性路径。在多个选择LLM基准和少样本CLIP适应的实验中,FragileFlow始终改善了与理论相关的风险度量,相较于匹配的基线,在大多数设置中实现了扰动下最坏类别准确率的提升,并在比较中保持了清晰的准确率。
cs.CL / 44 / 2605.08898
LLM-Agnostic Semantic Representation Attack
与大型语言模型无关的语义表示攻击
Abstract
Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting adversarial prompts. Predominant token-level optimization methods primarily rely on optimizing for exact affirmative templates (e.g., ``\textit{Sure, here is...}''). However, these paradigms frequently encounter bottlenecks such as suboptimal convergence, compromised prompt naturalness, and poor cross-model generalization. To address these limitations, we propose Semantic Representation Attack (SRA), a novel LLM-agnostic paradigm that fundamentally reconceptualizes adversarial objectives from exact textual targeting to malicious semantic representations. Theoretically, we establish the semantic Coherence-Convergence Relationship and derive a Cross-Model Semantic Generalization bound, proving that maintaining semantic coherence guarantees both white-box semantic convergence and black-box transferability. Technically, we operationalize this framework via the Semantic Representation Heuristic Search (SRHS) algorithm, which preserves interpretability and structural coherence of the adversarial prompts during incremental discrete token chunk expansion. Extensive evaluations demonstrate that our framework achieves a 99.71% average attack success rate across 26 open-source LLMs, with strong transferability and stealth.
Chinese Translation
大型语言模型(LLMs)越来越多地采用对齐技术来防止有害输出。尽管有这些保护措施,攻击者仍然可以通过精心设计对抗性提示来规避它们。当前主要的基于令牌级别的优化方法主要依赖于优化精确的肯定模板(例如,“ extit{当然,这里是...}”)。然而,这些范式常常遇到瓶颈,如次优收敛、提示自然性受损以及跨模型泛化能力差。为了解决这些限制,我们提出了语义表示攻击(Semantic Representation Attack, SRA),这是一种新颖的与大型语言模型无关的范式,根本上重新概念化了对抗性目标,从精确的文本定位转向恶意的语义表示。从理论上讲,我们建立了语义一致性-收敛关系,并推导出跨模型语义泛化界限,证明保持语义一致性可以保证白盒语义收敛和黑盒可转移性。从技术上讲,我们通过语义表示启发式搜索(Semantic Representation Heuristic Search, SRHS)算法实现了这一框架,该算法在增量离散令牌块扩展过程中保持对抗性提示的可解释性和结构一致性。广泛的评估表明,我们的框架在26个开源大型语言模型中实现了99.71%的平均攻击成功率,具有强大的可转移性和隐蔽性。
cs.CL / 45 / 2605.08942
Decomposing and Steering Functional Metacognition in Large Language Models
大型语言模型中的功能性元认知的分解与引导
Abstract
Large language models (LLMs) increasingly exhibit behaviors suggesting awareness of their evaluation context, often adapting their reasoning strategies in benchmark settings. Prior work has shown that such evaluation awareness can distort performance measurements; however, it remains unclear whether this phenomenon reflects a single behavioral artifact or a deeper internal structure within the model. We propose that LLMs maintain a decomposable space of functional metacognitive states: internal variables encoding factors such as evaluation awareness, self-assessed capability, perceived risk, computational effort allocation, audience expertise adaptation, and intentionality. Through residual stream analysis across multiple reasoning models, we demonstrate that these states are linearly decodable from internal activations and exhibit distinct layer-wise profiles. Moreover, by steering model activations along probe-derived directions, we show that each functional metacognitive state causally modulates reasoning behavior in dissociable ways, affecting verbosity, accuracy, and safety-related responses across tasks. Our findings suggest that benchmark performance reflects not only task competence but also the activation of specific functional metacognitive states. We argue that understandi ng and controlling these internal states is essential for reliable evaluation and deployment of reasoning models, and we provide a mechanistic framework for studying functional m etacognition in artificial systems. Our code and data are publicly available at https://github.com/xlands/meta-cognition.
Chinese Translation
大型语言模型(LLMs)越来越表现出对其评估环境的意识,常常在基准设置中调整其推理策略。先前的研究表明,这种评估意识可能会扭曲性能测量;然而,目前尚不清楚这一现象是反映单一行为伪影,还是模型内部更深层次结构的体现。我们提出,LLMs维持一个可分解的功能性元认知状态空间:内部变量编码诸如评估意识、自我评估能力、感知风险、计算努力分配、受众专业知识适应和意图等因素。通过对多个推理模型进行残差流分析,我们证明这些状态可以从内部激活中线性解码,并表现出不同的层级特征。此外,通过沿探测导向引导模型激活,我们展示了每个功能性元认知状态在可分离的方式上因果调节推理行为,影响任务的冗长性、准确性和安全相关反应。我们的研究结果表明,基准性能不仅反映任务能力,还反映特定功能性元认知状态的激活。我们认为,理解和控制这些内部状态对于可靠评估和部署推理模型至关重要,并提供了一个机制框架用于研究人工系统中的功能性元认知。我们的代码和数据可在 https://github.com/xlands/meta-cognition 上公开获取。
cs.CL / 46 / 2605.08950
Improving Lexical Difficulty Prediction with Context-Aligned Contrastive Learning and Ridge Ensembling
通过上下文对齐的对比学习和岭回归集成提高词汇难度预测
Abstract
Lexical difficulty prediction is a fundamental problem in language learning and readability assessment, requiring models to estimate word difficulty across different first-language (L1) backgrounds. However, existing approaches rely on regression-only training with scalar supervision, which does not explicitly structure the representation space, limiting their ability to capture cross-lingual alignment and ordinal difficulty. To mitigate these issues, we propose Context-Aligned Contrastive Regression, which integrates Ridge regression ensemble with two complementary objectives, i.e., Cross-View Context and Ordinal Soft Contrastive Learning. Experiments on three L1 datasets show that (i) contrastive objectives improve cross-lingual representation alignment while preserving language-specific nuances, (ii) the learned representations capture the ordinal structure of lexical difficulty, and (iii) the ensemble effectively mitigates systematic biases of individual models, leading to more stable performance across difficulty levels.
Chinese Translation
词汇难度预测是语言学习和可读性评估中的一个基本问题,需要模型在不同母语(L1)背景下估计单词的难度。然而,现有的方法依赖于仅使用标量监督的回归训练,这并没有明确构建表示空间,限制了它们捕捉跨语言对齐和顺序难度的能力。为了解决这些问题,我们提出了上下文对齐的对比回归(Context-Aligned Contrastive Regression),该方法将岭回归集成与两个互补目标相结合,即跨视图上下文(Cross-View Context)和顺序软对比学习(Ordinal Soft Contrastive Learning)。在三个L1数据集上的实验表明:(i)对比目标在保留语言特定细微差别的同时改善了跨语言表示对齐;(ii)学习到的表示捕捉了词汇难度的顺序结构;(iii)集成有效减轻了单个模型的系统性偏差,从而在不同难度水平上实现了更稳定的性能。
cs.CL / 47 / 2605.08961
Dolphin-CN-Dialect: Where Chinese Dialects Matter
Dolphin-CN-Dialect:中文方言的重要性
Abstract
We present Dolphin-CN-Dialect, a streaming-capable ASR model with a focus on Chinese and dialect-rich scenarios. Compared to the previous version, Dolphin-CN-Dialect introduces substantial improvements in data processing, tokenization, training stability, and data sampling strategies. To address the challenges of highly imbalanced dialect data, we propose a temperature-based sampling strategy that effectively balances standard Mandarin and low-resource dialects, leading to significant gains in dialect recognition performance. In addition, we redesign the tokenizer to better align with linguistic characteristics, adopting character-level modeling for Chinese and subword modeling for English, while introducing extensible dialect tokens. Experimental results show that Dolphin-CN-Dialect achieves improvement in dialect recognition accuracy and CER reduction compared to Dolphin. Furthermore, Dolphin-CN-Dialect reaches competitive performance with recent SOTA open-source ASR models, while maintaining a significantly smaller model size. Dolphin-CN-Dialect supports both streaming and non-streaming inference, enabling a practical balance between latency and accuracy. It also provides flexible customization through hotword support and efficient deployment optimized for specialized hardware. These improvements make Dolphin-CN-Dialect a strong and practical solution for real-world multi-dialect ASR applications.
Chinese Translation
我们提出了Dolphin-CN-Dialect,这是一种具备流媒体能力的自动语音识别(ASR)模型,专注于中文及方言丰富的场景。与之前的版本相比,Dolphin-CN-Dialect在数据处理、分词、训练稳定性和数据采样策略方面进行了显著改进。为了解决高度不平衡的方言数据带来的挑战,我们提出了一种基于温度的采样策略,有效平衡了标准普通话与低资源方言,从而显著提升了方言识别性能。此外,我们重新设计了分词器,以更好地符合语言特征,采用了中文的字符级建模和英文的子词建模,同时引入了可扩展的方言标记。实验结果表明,与Dolphin相比,Dolphin-CN-Dialect在方言识别准确率和字符错误率(CER)方面均有所提升。此外,Dolphin-CN-Dialect在性能上与近期的SOTA开源ASR模型相竞争,同时保持了显著较小的模型尺寸。Dolphin-CN-Dialect支持流媒体和非流媒体推理,实现了延迟与准确性之间的实用平衡。它还通过热词支持和针对专用硬件优化的高效部署提供灵活的定制。这些改进使Dolphin-CN-Dialect成为现实世界多方言ASR应用的强大而实用的解决方案。
cs.CL / 48 / 2605.09015
LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language
LLiMba:单GPU上的撒丁语——将一个3B语言模型适应于濒临消失的罗曼语系语言
Abstract
Sardinian, a Romance language with roughly one million speakers, has minimal presence in modern NLP. Commercial services do not support it, and current language models do not produce it reliably. We present LLiMba, a 3B parameter Sardinian-ready model adapted from Qwen2.5-3B-Instruct through continued pretraining (CPT) and supervised fine-tuning (SFT) on a single 24 GB consumer GPU. The corpus contains 11.5 million tokens of Sardinian spanning LSC, Logudorese, and Campidanese, augmented with 2.4 million tokens of related Romance text as replay against register blurring. After CPT the model reaches a perplexity of 6.76 on held out Sardinian and outperforms the base across all six FLORES-200 directions. We compare five SFT configurations under matched conditions: full fine-tuning, LoRA r64, rsLoRA r128, rsLoRA r256, and DoRA r256. rsLoRA r256 wins on every direction into Sardinian, reaching 28.5 BLEU from English against 17.3 after CPT and 21.0 with full fine-tuning. The rank ablation places r128 between LoRA r64 and rsLoRA r256 on BLEU but reveals failure modes invisible to the metric, including leakage across scripts no other variant produces. LoRA r64 retains less factual content from SFT than configurations at higher rank and produces more confident fabrications, though all methods fabricate on content absent from training. DoRA r256 yields the smallest gap between training and evaluation but the worst factual accuracy. The findings indicate that adapter capacity matters more than the choice among LoRA variants for adapting a Romance pretrained base to a low resource Romance target, that stronger regularization is not uniformly beneficial, and that translation metrics smoothly order configurations whose qualitative behavior differs categorically. Perplexity comparisons across scripts must account for byte fallback tokenization, which deflates the metric for scripts other than Latin.
Chinese Translation
撒丁语是一种约有一百万说话者的罗曼语系语言,在现代自然语言处理(NLP)中几乎没有存在感。商业服务不支持该语言,目前的语言模型也无法可靠地产生该语言的文本。我们提出了LLiMba,这是一个3B参数的撒丁语模型,基于Qwen2.5-3B-Instruct通过持续预训练(CPT)和监督微调(SFT)在一台24 GB的消费级GPU上进行适配。该语料库包含1150万个撒丁语标记,涵盖了LSC、Logudorese和Campidanese,并增加了240万个相关罗曼语文本标记,以应对注册模糊问题。经过CPT后,该模型在保留的撒丁语数据集上达到了6.76的困惑度,并在所有六个FLORES-200方向上超越了基础模型。我们在匹配条件下比较了五种SFT配置:完全微调、LoRA r64、rsLoRA r128、rsLoRA r256和DoRA r256。rsLoRA r256在所有撒丁语方向上表现最佳,从英语翻译达到28.5 BLEU,而CPT后为17.3,完全微调后为21.0。排名消融分析显示,r128在BLEU分数上介于LoRA r64和rsLoRA r256之间,但揭示了该指标无法察觉的失败模式,包括在其他变体中未出现的脚本间泄漏。LoRA r64在SFT中保留的事实内容少于更高排名的配置,并产生了更自信的虚构内容,尽管所有方法在训练中缺失的内容上都会产生虚构。DoRA r256在训练和评估之间的差距最小,但事实准确性最差。研究结果表明,适配器的容量比在LoRA变体之间的选择更为重要,以将罗曼语预训练基础模型适应于低资源的罗曼语目标,较强的正则化并不总是有利,并且翻译指标平滑地对配置进行排序,而这些配置的定性行为则有本质的不同。跨脚本的困惑度比较必须考虑字节回退标记化,这会降低除拉丁文以外的脚本的指标。
cs.CL / 49 / 2605.09027
GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives
GAMBIT:多智能体大语言模型集体对抗鲁棒性的三模式基准
Abstract
In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: https://anonymous.4open.science/r/gambit.
Chinese Translation
在多智能体系统(MAS)中,单个欺骗性智能体可以抵消智能体人工智能集体的所有收益,并规避已部署的防御。然而,现有的针对MAS的对抗性研究仅针对浅层任务,并未考虑自适应对手,这些对手会进化其策略以规避专门训练来捕捉它们的检测器。为了解决这一空白,我们引入了GAMBIT,这是一个具有三种评估模式和两个独立评分的基准,用于评估冒充者检测器:前两种模式在不断增加的分布转移下测量零样本检测,而第三种重新校准模式则测量检测器在仅有20个标记示例的情况下对新攻击的适应速度。该基准配备了一个包含27,804个标记实例的数据集,涵盖240种共同进化的冒充者策略。我们的贡献有三方面:(1)以国际象棋作为深度推理问题的基础,使用Gemini 3.1 Pro作为智能体,我们发布了GAMBIT及其数据集,以在现实约束下评估针对隐蔽自适应冒充者的检测器;(2)我们引入了一种基于高效进化框架的自适应冒充者智能体,该框架超越国际象棋具有广泛的可推广性,能够在保持基本不可检测的情况下崩溃集体任务性能(使用基于Gemini的检测器时F1-score为50.5%);(3)我们展示了零样本评估对于自适应对手可能具有高度误导性:两个具有近似相同零样本得分的检测器在少样本适应上相差8倍,而元学习变体的收敛速度快20倍,这一差距仅在重新校准模式中可见。总之,GAMBIT提供了第一个多智能体基准,其中对抗攻击和防御共同进化,冒充者框架可推广至我们的用例之外,并为快速重新校准在快速发展的对抗系统中提供了有前景的技术。代码和数据:https://anonymous.4open.science/r/gambit。
cs.CL / 50 / 2605.09032
A Quantum Inspired Variational Kernel and Explainable AI Framework for Cross Region Solar and Wind Energy Forecasting
一种量子启发的变分核与可解释人工智能框架用于跨区域太阳能和风能预测
Abstract
Reliable short horizon forecasting of solar and wind generation is a structural prerequisite of any modern power system yet most published forecasters are tuned and evaluated on a single climatic regime and most algorithmic novelty has been concentrated either on classical recurrent networks or on monolithic foundation models that combine forecasting and explanation We develop a four stage hybrid framework that separates these concerns The first stage acquires hourly generation irradiance and surface weather records through public application programming interfaces The second stage trains three classical baselines autoregressive integrated moving average gradient boosted regression trees and a two layer long short term memory network and produces a strong point forecast together with a residual error series The third stage corrects the residual through a quantum inspired variational kernel built on a six qubit hardware efficient ansatz with three repeated entangling layers The fourth stage uses generative artificial intelligence strictly as an explainability layer that reads the measured benchmark numbers and produces a structured natural language interpretation Across three regions drawn from open public archives Iberian solar North Sea wind and a mixed Texas trace the proposed configuration stays within one percentage point of the strongest classical baseline on the in domain forecasting task and the quantum inspired kernel separates calm and stormy weather regimes with a Fisher discriminant ratio approximately fifteen fold higher than a tuned radial basis kernel
Chinese Translation
可靠的短期太阳能和风能发电预测是任何现代电力系统的结构性前提,然而大多数已发表的预测模型仅在单一气候条件下进行调优和评估,且大多数算法创新集中于经典递归网络或将预测与解释结合的单一基础模型。我们开发了一个四阶段的混合框架,以分离这些关注点。第一阶段通过公共应用程序接口获取每小时的发电辐照度和地面天气记录。第二阶段训练三种经典基线模型:自回归积分滑动平均模型、梯度提升回归树和一个两层长短期记忆网络,并生成强有力的点预测以及残差误差序列。第三阶段通过一个基于六量子比特硬件高效 Ansatz 的量子启发变分核来修正残差,该变分核具有三个重复的纠缠层。第四阶段严格使用生成式人工智能作为可解释性层,读取测量的基准数据并生成结构化的自然语言解释。在从开放公共档案中提取的三个区域(伊比利亚太阳能、北海风能和混合德克萨斯州追踪)中,所提出的配置在领域内预测任务中保持在最强经典基线的一个百分点以内,而量子启发核在平静和风暴天气条件下的区分能力,其Fisher判别比率约为调优径向基核的十五倍。
cs.CL / 51 / 2605.09041
BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence
BiAxisAudit:一种评估大型语言模型偏见的新框架,关注提示敏感性和响应层差异
Abstract
Bias audits of large language models now operate within governance frameworks such as the EU AI Act, making benchmark reliability a security concern in its own right. Many current benchmarks, however, collapse bias into a single scalar from one prompt format and one surface label. This design misses two failure modes that can be exploited without changing model weights. Across prompts, meaning-preserving format changes shift bias endorsement by more than $0.7$ on a fixed statement pool. Within a response, the discrete Selection and free-text Elaboration can take opposing stances, so an apparently clean aggregate may hide substantial internal inconsistency (a ``cancellation trap''). Selection-only and elaboration-only rankings are therefore nearly uncorrelated across eight LLMs (Spearman $\rho = 0.238$, $p = 0.570$): LLaMA3-70B ranks in the middle under selection-only scoring but highest under elaboration-only scoring on the same responses. We introduce \textsc{BiAxisAudit}, a protocol that reports each bias score together with a reliability estimate on two orthogonal axes. The across-prompt axis evaluates each statement under a factorial grid of task format, perspective, role, and sentiment, treating bias as a distribution rather than a point estimate. The within-response axis uses Split Coding to recover Selection and Elaboration as separate signals, measured by the Inconsistency Rate and Divergence Net Imbalance. Across eight LLMs with $80{,}200$ coded responses each, task format alone explains as much variance as model choice; $63.6\%$ of pooled bias signals (up to $85.2\%$ per model) appear in only one coding layer, and prompt-dimension interactions exceed main effects. The instrument also separates real bias reductions from apparent reductions caused by cross-layer redistribution: some prompt configurations reduce both BER and IR, whereas others suppress only selection-layer bias.
Chinese Translation
大型语言模型的偏见审计现在在如欧盟人工智能法案等治理框架内进行,使基准的可靠性本身成为一个安全问题。然而,许多当前的基准将偏见简化为来自单一提示格式和单一表面标签的单一标量。这种设计忽视了两种可以在不改变模型权重的情况下被利用的失败模式。在不同提示之间,保持意义不变的格式变化使得偏见支持度在固定语句池上变化超过 $0.7$。在响应内部,离散选择和自由文本阐述可以采取对立立场,因此一个表面上干净的汇总可能隐藏了实质性的内部不一致(即“取消陷阱”)。因此,仅选择和仅阐述的排名在八个大型语言模型之间几乎没有相关性(Spearman $
ho = 0.238$, $p = 0.570$):在仅选择评分下,LLaMA3-70B的排名处于中间,但在仅阐述评分下则在相同响应中排名最高。我们引入了 extsc{BiAxisAudit},一种在两个正交轴上报告每个偏见分数及其可靠性估计的协议。跨提示轴在任务格式、视角、角色和情感的因子网格下评估每个语句,将偏见视为分布而非点估计。响应内部轴使用分割编码(Split Coding)将选择和阐述恢复为独立信号,通过不一致率(Inconsistency Rate)和差异净不平衡(Divergence Net Imbalance)进行测量。在八个大型语言模型中,每个模型有 $80{,}200$ 个编码响应,任务格式单独解释的方差与模型选择相当;$63.6\%$ 的汇总偏见信号(每个模型高达 $85.2\\%$)仅出现在一个编码层中,提示维度交互超过了主要效应。该工具还将真实的偏见减少与由跨层重新分配引起的表面减少区分开来:某些提示配置同时减少了 BER 和 IR,而其他配置则仅抑制选择层偏见。
cs.CL / 52 / 2605.09042
Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity
评估大型语言模型中的语用推理:来自标量多样性的证据
Abstract
Evaluating pragmatic reasoning in large language models (LLMs) remains challenging because model behavior can vary depending on evaluation methods. Previous studies suggest that prompt-based judgments may diverge from models' internal probability distributions, raising questions about whether observed performance reflects underlying competence or task-induced behavior. This study examines this issue using scalar diversity as a graded diagnostic for pragmatic inference. Following Hu & Levy (2023), this study compares direct probability measurement and metalinguistic prompting across multiple models and experimental settings. The results show that neither evaluation method consistently outperforms the other and that pragmatic behavior varies substantially across model families, prompting strategies, and task structures. Moreover, scalar diversity gradients emerge only in specific model-condition combinations, suggesting that pragmatic reasoning in LLMs reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable competence captured by a single evaluation paradigm. These findings highlight the central role of evaluation design in interpreting pragmatic abilities in LLMs.
Chinese Translation
评估大型语言模型(LLMs)中的语用推理仍然具有挑战性,因为模型行为可能因评估方法而异。先前的研究表明,基于提示的判断可能与模型的内部概率分布存在差异,这引发了关于观察到的性能是否反映了潜在能力或任务诱导行为的质疑。本研究使用标量多样性作为语用推理的分级诊断工具来探讨这一问题。根据 Hu & Levy (2023) 的研究,本研究比较了多种模型和实验设置下的直接概率测量与元语言提示。结果表明,两个评估方法并没有一致地优于对方,并且语用行为在不同的模型家族、提示策略和任务结构之间存在显著差异。此外,标量多样性梯度仅在特定模型-条件组合中出现,这表明 LLMs 中的语用推理反映了内部概率表示与任务诱导提示行为之间的相互作用,而不是由单一评估范式捕捉到的稳定能力。这些发现强调了评估设计在解释 LLMs 中语用能力时的核心作用。
cs.CL / 53 / 2605.09043
Phase Transitions in Affective Meaning Divergence: The Hidden Drift Before the Break
情感意义分歧中的相变:破裂前的隐性漂移
Abstract
One partner says "Fine" meaning resolution; the other hears surrender. The word is shared; the affective uptake is not. We formalize this as affective meaning divergence (AMD), the total-variation distance between interlocutors' anchor-conditioned affect distributions. Building on speech-act theory, common-ground accumulation, and entropy-regularized game theory, we derive a logit best-response map whose dynamics undergo a saddle-node bifurcation: when $\beta\alpha > 4$, a monotone increase in AMD-driven load produces an abrupt, hysteretic collapse of repair coordination. On Conversations Gone Awry (CGA-Wiki; $N=652$), derailing conversations exhibit critical-slowing-down (CSD) signatures across multiple levels: lexical divergence variance ($p<0.001$, $d=0.36$), AMD variance ($p=0.001$, $d=0.26$), and dialog-act repair variance ($p=0.016$, $d=0.20$), all significant after correction and stronger than toxicity and sentiment baselines. AMD provides a distinct temporal signature, with retrospectively measured variance peaking at the bifurcation point while toxicity variance peaks earlier, and is the only indicator grounded in the theoretical framework. Boundary-condition analysis on CGA-CMV ($N=1{,}169$) yields mixed but directionally consistent evidence.
Chinese Translation
一位交谈者说“好的”,意指解决;另一位则听成投降。这个词是共享的,但情感的理解却并不相同。我们将其形式化为情感意义分歧(affective meaning divergence, AMD),即对话者在锚定条件下的情感分布之间的总变差距离。在言语行为理论、共同基础积累和熵正则化博弈理论的基础上,我们推导出一个对数最佳反应图,其动态经历鞍点-节点分岔:当$etaeta > 4$时,AMD驱动负载的单调增加会导致修复协调的突然、滞后崩溃。在《对话的误入歧途》(Conversations Gone Awry, CGA-Wiki; $N=652$)中,偏离的对话在多个层面上表现出临界减速(critical-slowing-down, CSD)特征:词汇分歧方差($p<0.001$, $d=0.36$)、AMD方差($p=0.001$, $d=0.26$)和对话行为修复方差($p=0.016$, $d=0.20$),在校正后均显著,并且强于毒性和情感基线。AMD提供了一个独特的时间特征,回顾性测量的方差在分岔点达到峰值,而毒性方差则较早达到峰值,并且是唯一一个基于理论框架的指标。对CGA-CMV($N=1{,}169$)的边界条件分析提供了混合但方向一致的证据。
cs.CL / 54 / 2605.09060
Language-Conditioned Visual Grounding with CLIP Multilingual
基于语言条件的视觉定位与CLIP多语言模型
Abstract
Multilingual vision-language models exhibit systematic performance gaps across languages, but the mechanism remains ambiguous: cross-language divergence could arise from the visual encoder, the text branch, or their interaction. We resolve this ambiguity through a dense multilingual CLIP probe in which the visual encoder is held identical across thirteen typologically diverse languages and only the XLM-RoBERTa text branch varies. We evaluate two CLIP architectures spanning a 7x visual-encoder scale gap (XLM-R base + ViT-B/32, ~87M visual parameters; XLM-R large + ViT-H/14, ~632M) on 11 concepts and 210 images, and quantify cross-language agreement via cluster-mask IoU, top-percentile IoU, and Spearman rank correlation against an English reference (n=2,310 paired observations per language). Three findings emerge. First, low-resource languages (Arabic, Basque, Luxembourgish) incur a structural penalty at both backbone scales (Wilcoxon HR>LR p<10^-300; cluster-mask IoU gap +0.114 at base, +0.143 at large), isolating the deficit to the text branch. Second, scaling the encoder 7x widens the gap for structural failure cases (Basque {\Delta}=-0.056, Luxembourgish {\Delta}=-0.076) while improving Arabic ({\Delta}=+0.033), separating corpus-coverage from tokeniser-fertility failures. Third, peak similarity is preserved across languages (mean ratio 0.94 at large scale) while cluster-mask IoU drops sharply, identifying spatial misalignment, not signal collapse, as the dominant failure mode. At 3.4-3.9 Wh per 1,000 queries, dense-CLIP grounding is competitive with high-throughput inference budgets, positioning it as a practical substrate for energy-aware multilingual deployment.
Chinese Translation
多语言视觉-语言模型在不同语言间表现出系统性的性能差距,但其机制仍不明确:跨语言的差异可能源于视觉编码器、文本分支或它们的交互。我们通过一个密集的多语言CLIP探测器解决了这一模糊性,其中视觉编码器在十三种类型学上多样的语言中保持一致,只有XLM-RoBERTa文本分支有所不同。我们评估了两种CLIP架构,跨越7倍的视觉编码器规模差距(XLM-R基础版 + ViT-B/32,约87M视觉参数;XLM-R大型版 + ViT-H/14,约632M),在11个概念和210张图像上进行评估,并通过集群掩码IoU、前百分位IoU和与英语参考的斯皮尔曼等级相关性量化跨语言一致性(每种语言n=2,310对观察值)。得出了三个发现。首先,低资源语言(阿拉伯语、巴斯克语、卢森堡语)在两个主干规模上都遭受结构性惩罚(Wilcoxon HR>LR p<10^-300;基础版的集群掩码IoU差距为+0.114,大型版为+0.143),将缺陷孤立于文本分支。其次,将编码器规模扩大7倍加大了结构性失败案例的差距(巴斯克语{ ext{Δ}}=-0.056,卢森堡语{ ext{Δ}}=-0.076),同时改善了阿拉伯语({ ext{Δ}}=+0.033),将语料覆盖与分词器丰度的失败区分开来。第三,跨语言的峰值相似性得以保持(大型规模下平均比率为0.94),而集群掩码IoU急剧下降,识别出空间错位,而非信号崩溃,作为主要的失败模式。在每1,000个查询消耗3.4-3.9 Wh的情况下,密集CLIP定位在高吞吐量推理预算中具有竞争力,使其成为能源意识多语言部署的实用基础。
cs.CL / 55 / 2605.09063
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak:一个由数学家策划的基准,用于评估大型语言模型的研究级数学能力
Son, Guijin, Kim, Seungone, Arnett, Catherine, Ko, Hyunwoo, Lee, Hyein, Kang, Hyeonah, Longxi, Jiang, Yun, Jin, Lee, JungYup, Lee, Kyungmin, Kim, Sam Yoosuk, Park, Sang, Hong, Seunghyeok, Lee, SeungJae, Yi, Seungyeop, Shin, Shinae, Bok, SunHye, Shin, Sunyoung, Ji, Yonghoon, Kim, Youngtaek, Jung, Hanearl, Asai, Akari, Neubig, Graham, Welleck, Sean, Yu, Youngjae, R, Akshelin, Ivanov, Alexander B., Muhammadjon, Boboev, Han, Chaeyoung, Stump, Christian, Karp, Dmitrii, Kwon, Dohyun, Kwon, DoYong, Oh, Duk-Soon, Resta, Giovanni, Panova, Greta, Noh, Huiyun, Baik, Hyungryul, Bae, Hyungsun, Mashrafdzhon, Inomov, Kim, Jeewon, Lee, Ji Eun, Liu, Jiaqi, Kang, Jieui, Kim, Jimin, Kim, Jon-Lark, Yoon, Junseo, Jo, Junwoo, Kim, Kibeom, Kwon, Kiwoon, Kummer, Mario, Mercer, Max, Kim, Minjun, Lee, Nahyun, Ze-An, Ng, Łochowski, Rafał Marcin, Lachièze-Rey, Raphaël, Zhang, Ruichen, Park, Sejin, Seo, Seonguk, Jaehoon, Shin, Sunatullo, Eom, Taewoong, Park, Yeachan, Jang, Yongseok, Oh, Youchan, Wang, Zhaoyang, Kovács, Zoltán
Abstract
Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.
Chinese Translation
随着前沿大型语言模型在国际数学奥林匹克(IMO)中取得金牌表现,学术界正在寻找下一个有意义且具有挑战性的目标来衡量大型语言模型的推理能力。虽然奥林匹克风格的问题仅测量逐步推理,但研究级问题则利用这种推理来推动数学知识的前沿,成为一种引人注目的替代方案。然而,研究级数学基准仍然稀缺,因为此类问题难以获取(例如,Riemann Bench 和 FrontierMath-Tier 4 分别包含 25 和 50 个问题)。为了支持下一代前沿模型的可靠评估,我们引入了 Soohak,这是由 64 位数学家全新创作的 439 道问题的基准。Soohak 包含两个子集。在挑战子集中,包括 Gemini-3-Pro、GPT-5 和 Claude-Opus-4.5 在内的前沿模型分别达到 30.4%、26.4% 和 10.4%,仍有显著的提升空间,而领先的开放权重模型如 Qwen3-235B、GPT-OSS-120B 和 Kimi-2.5 的表现均低于 15%。值得注意的是,除了标准问题解决外,Soohak 还引入了拒绝子集,探测研究数学固有的一种能力:识别不适定问题并暂停,而不是产生自信但没有依据的答案。在该子集中,没有模型超过 50%,这表明拒绝作为一个新的优化目标,目前的模型并未直接解决。为了防止数据污染,该数据集将在 2026 年底公开发布,期间可根据请求提供模型评估。
cs.CL / 56 / 2605.09092
Character-Level Transformer for Tajik-Persian Transliteration with a Parallel Lexical Corpus
基于字符级变换器的塔吉克-波斯语音译及其平行词汇语料库
Abstract
This study addresses automatic transliteration from Tajik (Cyrillic script) to Persian (Perso-Arabic script). We present a curated, lexicographically verified parallel corpus of 52,152 Tajik--Persian words and short phrases, compiled from printed dictionaries, encyclopedic sources, and manually verified online resources. To the best of our knowledge, this is one of the largest publicly available word-level corpora for Tajik--Persian transliteration. Using this corpus, we train a character-level sequence-to-sequence Transformer model and evaluate it using Character Error Rate (CER) and exact-match accuracy. The Transformer achieves a CER of 0.3216 and an exact-match accuracy of 0.3133, outperforming both dictionary-based rule-based and recurrent neural baselines. With beam search (k=3), performance further improves to CER 0.3182 and accuracy 0.3215. We describe the data collection and preprocessing pipeline, model architecture, and experimental protocol, and report a part-of-speech analysis showing performance differences across lexical categories. All preprocessing scripts, deterministic splits into training, validation, and test sets, and training configurations are released to support reproducibility and further research on Tajik and related Persian dialects. The corpus supports research in character-level transliteration, cross-script NLP, and lexicographic applications.
Chinese Translation
本研究探讨了从塔吉克语(西里尔字母)到波斯语(波斯-阿拉伯字母)的自动音译。我们呈现了一个经过整理和词典验证的平行语料库,包含52,152个塔吉克-波斯语单词和短语,这些数据来自印刷字典、百科全书和经过人工验证的在线资源。根据我们所知,这是目前公开可用的塔吉克-波斯语音译中最大的单词级语料库之一。利用该语料库,我们训练了一个字符级序列到序列的变换器(Transformer)模型,并使用字符错误率(CER)和准确匹配率进行评估。该变换器模型的CER为0.3216,准确匹配率为0.3133,优于基于字典的规则和递归神经网络基线。通过束搜索(k=3),性能进一步提升至CER 0.3182和准确率0.3215。我们描述了数据收集和预处理流程、模型架构和实验协议,并报告了一个词性分析,显示了不同词汇类别之间的性能差异。所有预处理脚本、确定性划分的训练、验证和测试集以及训练配置均已发布,以支持可重复性和进一步对塔吉克语及相关波斯方言的研究。该语料库支持字符级音译、跨脚本自然语言处理(NLP)和词典应用的研究。
cs.CL / 57 / 2605.09098
Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation
动态元度量:基于源句条件加权的机器翻译评估
Abstract
We propose Dynamic Meta-Metrics (DMM), a framework for machine translation evaluation that learns source-sentence conditioned combinations of existing metrics. Rather than relying on a single static ensemble or language-specific weighting, DMM adapts the metric combination based on properties of the source segment. We study hard conditioning, which fits an interpretable combiner per cluster, and an exploratory soft-conditioned extension whose weights vary continuously with source-cluster responsibilities. We evaluate DMM on the WMT Metrics Shared Task data across multiple language pairs using pairwise agreement measures at the system and segment levels. Across settings, MLP-based combinations outperform linear and Gaussian process-based ensembles, and introducing soft conditioning yields gains over linear models.
Chinese Translation
我们提出了动态元度量(Dynamic Meta-Metrics, DMM),这是一个用于机器翻译评估的框架,能够学习基于源句条件的现有度量的组合。DMM并不依赖于单一的静态集成或特定语言的加权,而是根据源段的属性自适应度量组合。我们研究了硬条件化,它为每个簇拟合一个可解释的组合器,以及一个探索性的软条件扩展,其权重随着源簇的责任而连续变化。我们在WMT度量共享任务数据上评估DMM,涵盖多个语言对,并使用系统和段级的成对一致性度量。在各种设置中,基于多层感知器(MLP)的组合优于线性和基于高斯过程的集成,而引入软条件化则在性能上超过了线性模型。
cs.CL / 58 / 2605.09100
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
GRC:统一推理驱动的生成、检索与压缩
Abstract
Text embedding and generative tasks are usually trained separately based on large language models (LLMs) nowadays. This causes a large amount of training cost and deployment effort. Context compression is also a challenging and pressing task, which is vital to reasoning-driven generation, and agentic tasks requiring long context and continual learning. In this paper, we explore how to unify reasoning-driven generation, reasoning-enhanced text representation and context compression tasks in one forward pass for LLMs. Through meta latent tokens and a unified generative, representative and compressive tuning approach, we propose a training framework named GRC that bridges the three tasks. The trained models can accomplish three objectives in a single forward pass while maintaining modular, LEGO-style flexibility during inference. This design greatly reduces the deployment effort for retrieval-augmented generation (RAG) and achieves efficient inference and three times data utilization during training. Furthermore, this framework design enables a new paradigm for text embedding: self-reason-latent embeds, and a new generation paradigm, latent memory-augmented generation, where compressed and internalized KV cache with O(1) length is used as the updatable memory. We also propose hybrid paged attention to speed up the inference of our models. Extensive experiments on reasoning-intensive retrieval benchmarks, generative tasks, document compression, latency evaluation, and RAG settings demonstrate the effectiveness of our method and may shed light on the truly unified model that can handle reasoning-driven generation, embedding and compression tasks seamlessly.
Chinese Translation
目前,文本嵌入和生成任务通常基于大型语言模型(LLMs)分别进行训练。这导致了大量的训练成本和部署工作。上下文压缩也是一项具有挑战性和紧迫性的任务,它对推理驱动的生成以及需要长上下文和持续学习的代理任务至关重要。本文探讨了如何在LLMs的一次前向传播中统一推理驱动的生成、推理增强的文本表示和上下文压缩任务。通过元潜在标记和统一的生成、表示与压缩调优方法,我们提出了一个名为GRC的训练框架,旨在连接这三项任务。训练后的模型能够在一次前向传播中完成三个目标,同时在推理过程中保持模块化的乐高风格灵活性。这一设计大大减少了检索增强生成(RAG)的部署工作,并在训练期间实现了高效推理和三倍的数据利用率。此外,这一框架设计为文本嵌入提供了一种新范式:自我推理潜在嵌入,以及一种新的生成范式:潜在记忆增强生成,其中使用压缩和内化的KV缓存作为可更新的记忆,长度为O(1)。我们还提出了混合分页注意力,以加速模型的推理。在推理密集型检索基准、生成任务、文档压缩、延迟评估和RAG设置上的广泛实验证明了我们方法的有效性,并可能为真正统一的模型提供启示,使其能够无缝处理推理驱动的生成、嵌入和压缩任务。
cs.CL / 59 / 2605.09106
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
Fin-Bias:在金融领域人类偏见下对大型语言模型决策的综合评估
Abstract
Large language models (LLMs) are increasingly deployed in financial contexts, raising critical concerns about reliability, alignment, and susceptibility to adversarial manipulation. While prior finance-related benchmarks assess LLMs' capabilities in stock trading, they are often restricted to small sample and fail to demonstrate LLM susceptibility to context with potential human bias. We introduce Fin-Bias (financial herding under long and uncertain financial context), a benchmark for evaluating LLM investment decision-making when faced with uncertainty and possible human-biased opinions. Fin-Bias includes 8868 long firm-specific analyst reports, including firm aspects summarized and analyzed by sophisticated analysts with investment ratings (Bullish/Neutral/Bearish) spanning from various industries. We present large language models with firm analyst reports with/without analyst investment ratings and even with 'fake' rating, to get investment ratings generated by LLMs. Our results reveal that LLMs tend to herd the explicit bias in context. We also develop a method to detect potential human opinions, which can encourage LLMs to think independently, some models even exceed human performance in predicting future stock return.
Chinese Translation
大型语言模型(LLMs)在金融领域的应用日益增多,这引发了关于其可靠性、对齐性和对对抗性操控的重大担忧。尽管先前的金融相关基准评估了LLMs在股票交易中的能力,但通常样本量较小,未能展示LLMs在可能存在的人类偏见的情境下的脆弱性。我们提出了Fin-Bias(在长期和不确定金融背景下的金融从众行为),这是一个用于评估LLM在面对不确定性和可能的人类偏见意见时的投资决策的基准。Fin-Bias包括8868份特定公司的长期分析师报告,这些报告由经验丰富的分析师总结和分析,并附有投资评级(看涨/中性/看跌),涵盖了多个行业。我们向大型语言模型提供了带有/不带有分析师投资评级的公司分析师报告,甚至包括“虚假”评级,以获取LLMs生成的投资评级。我们的结果显示,LLMs倾向于追随情境中的显性偏见。我们还开发了一种方法来检测潜在的人类意见,这可以鼓励LLMs独立思考,某些模型在预测未来股票回报方面甚至超越了人类表现。
cs.CL / 60 / 2605.09147
From Traditional Taggers to LLMs: A Comparative Study of POS Tagging for Medieval Romance Languages
从传统标注器到大型语言模型:中世纪罗曼语的词性标注比较研究
Abstract
Part-of-speech (POS) tagging for Medieval Romance languages remains challenging due to orthographic variation, morphological complexity, and limited annotated resources. This paper presents a systematic empirical evaluation of large language models (LLMs) for POS tagging across three medieval varieties: Medieval Occitan, Medieval Catalan, and Medieval French. We compare traditional rule-based and statistical taggers with modern open-source LLMs under zero-shot prompting, few-shot prompting, monolingual fine-tuning, and cross-lingual transfer learning settings. Experiments on historically grounded datasets show that LLM-based approaches consistently outperform traditional taggers, with fine-tuning and multilingual training yielding the largest improvements. In particular, cross-lingual transfer learning substantially benefits under-resourced varieties, while targeted bilingual training can outperform broader multilingual configurations for specific target languages. The results highlight the importance of linguistic proximity and dataset characteristics when designing transfer strategies for historical NLP. These findings provide empirical insights into the applicability of modern neural methods to medieval text processing and provide practical guidance for deploying LLM-based POS tagging pipelines in digital humanities research. All code, models, and processed datasets are released for reproducibility.
Chinese Translation
中世纪罗曼语的词性标注(POS tagging)因正字法变异、形态复杂性和有限的注释资源而面临挑战。本文对大型语言模型(LLMs)在三种中世纪方言中的词性标注进行了系统的实证评估:中世纪奥克语、中世纪加泰罗尼亚语和中世纪法语。我们在零样本提示、少样本提示、单语微调和跨语言迁移学习设置下,比较了传统的基于规则和统计的标注器与现代开源LLMs。基于历史数据集的实验表明,基于LLM的方法在性能上始终优于传统标注器,其中微调和多语言训练带来了最大的改进。特别是,跨语言迁移学习对资源匮乏的方言有显著益处,而针对特定目标语言的双语训练在某些情况下可以超越更广泛的多语言配置。结果强调了在为历史自然语言处理设计迁移策略时,语言接近性和数据集特征的重要性。这些发现为现代神经方法在中世纪文本处理中的适用性提供了实证见解,并为在数字人文学科研究中部署基于LLM的词性标注管道提供了实用指导。所有代码、模型和处理过的数据集均已发布以便于复现。
cs.CL / 61 / 2605.09152
Meow-Omni 1: A Multimodal Large Language Model for Feline Ethology
Meow-Omni 1:一种用于猫类行为学的多模态大型语言模型
Abstract
Deciphering animal intent is a fundamental challenge in computational ethology, largely because of semantic aliasing, the phenomenon where identical external signals (e.g., a cat's purr) correspond to radically different internal states depending on physiological context. Existing Multimodal Large Language Models (MLLMs) are blind to high-frequency biological time-series data, restricting them to superficial behavioural pattern matching rather than genuine latent-state reasoning. To bridge this gap, we introduce Meow-Omni 1, the first open-source, quad-modal MLLM purpose-built for computational ethology. It natively fuses video, audio, and physiological time-series streams with textual reasoning. Through targeted architectural adaptation, we integrate specialized scientific encoders into a unified backbone and formalize intent inference via physiologically grounded cross-modal alignment. Evaluated on MeowBench, a novel, expert-verified quad-modal benchmark, Meow-Omni 1 achieves state-of-the-art intent-recognition accuracy (71.16%), substantially outperforming leading vision-language and omni-modal baselines. We release the complete open-source pipeline including model weights, training framework, and the Meow-10K dataset, to establish a scalable paradigm for inter-species intent understanding and to advance foundation models toward real-world veterinary diagnostics and wildlife conservation.
Chinese Translation
解读动物意图是计算行为学中的一个基本挑战,这主要是由于语义混淆现象,即相同的外部信号(例如,猫的呼噜声)在不同的生理背景下对应于截然不同的内部状态。现有的多模态大型语言模型(MLLMs)对高频生物时间序列数据缺乏敏感性,使得它们仅限于表面的行为模式匹配,而非真正的潜在状态推理。为了解决这一问题,我们推出了Meow-Omni 1,这是第一个为计算行为学专门构建的开源四模态MLLM。它原生融合了视频、音频和生理时间序列流与文本推理。通过有针对性的架构调整,我们将专门的科学编码器整合到统一的主干中,并通过生理基础的跨模态对齐形式化意图推断。在MeowBench这一新颖的专家验证四模态基准上进行评估,Meow-Omni 1实现了最先进的意图识别准确率(71.16%),显著优于领先的视觉-语言和全模态基准。我们发布了完整的开源管道,包括模型权重、训练框架和Meow-10K数据集,以建立一个可扩展的跨物种意图理解范式,并推动基础模型向现实世界的兽医诊断和野生动物保护发展。
cs.CL / 62 / 2605.09156
Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan
翻译中的迷失?探讨从拉丁语到奥克语的语法性别转变
Abstract
The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine). In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available.
Chinese Translation
从拉丁语到罗曼语的历时演变涉及语法性别系统的重构,从三分法(阳性、阴性、中性)转变为二分法(阳性、阴性)。在本研究中,我们引入了一种可解释的深度学习框架,以在词汇和语境层面探讨这一现象。首先,我们展示了传统的分词策略在这一低资源历史环境下的不足之处,并且我们提出的分词器在这些基准之上提高了性能。在词汇层面,我们评估了形态特征对性别预测的贡献。在语境层面,我们量化了不同词性类别对语法性别预测的贡献。这些分析共同描绘了词元与其句子语境之间性别信息的分布。我们将我们的代码库、数据集和结果公开发布。
cs.CL / 63 / 2605.09167
WorldSpeech: A Multilingual Speech Corpus from Around the World
世界语音:来自全球的多语言语音语料库
Abstract
Automatic speech recognition (ASR) performs well for high-resource languages with abundant paired audio-transcript data, but its accuracy degrades sharply for most languages due to limited publicly available aligned data. To this end, we introduce WorldSpeech, a 24 kHz multilingual speech corpus comprising 65k hours of aligned audio-transcript data across 76 languages, collected from diverse public sources including parliamentary proceedings, international broadcasts, and public-domain audiobooks. For 37 languages, WorldSpeech provides more than 200 hours of aligned speech, with 28 exceeding 500 hours and 24 surpassing 1k hours. Fine-tuning existing ASR models on WorldSpeech results in an average relative Word-Error-Rate reduction of 63.5% across 11 typologically diverse languages.
Chinese Translation
自动语音识别(ASR)在高资源语言中表现良好,这些语言拥有丰富的配对音频-文本数据,但由于大多数语言缺乏可公开获取的对齐数据,其准确性急剧下降。为此,我们推出了WorldSpeech,一个24 kHz的多语言语音语料库,包含来自76种语言的65,000小时对齐音频-文本数据,这些数据来自包括议会记录、国际广播和公共领域有声书在内的多种公共来源。对于37种语言,WorldSpeech提供超过200小时的对齐语音,其中28种超过500小时,24种超过1,000小时。在WorldSpeech上微调现有的ASR模型,结果显示11种类型学上多样的语言的平均相对词错误率降低了63.5%。
cs.CL / 64 / 2605.09227
Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport
去偏见大型语言模型作为评审的两种方法:层次贝叶斯校准与神经常微分方程评分传输的连续评分比较
Abstract
[Abridged] Using a Large Language Model (LLM) as an automatic rater (LLM-as-a-judge) is cheap but potentially biased: some judges run lenient, others strict, the middle of the scale gets compressed, and verbose answers may be over-rewarded. A common remedy is post-hoc calibration: leave the cheap judge in place and, on a modest set of paired anchors, fit a transformation from raw judge scores to an estimate of the human rating. We compare two correctors that take opposing views on how this mapping should be modeled: a parametric, small-anchor hierarchical Bayesian linear correction with per-score uncertainty, and a non-parametric Neural-ODE (FFJORD) score-transport flow. Both are run head-to-head on UltraFeedback fine-grained_score (1700 paired examples, 200 held out), with calibration split into three operational sub-questions: population-mean recovery, per-item accuracy, and distributional-shape match. The headline result is that the choice between methods is primarily a data-budget question. Both correctors close the raw $+0.71$-point mean offset to within $\pm 0.08$ of the GPT-4 reference, at 100 and at 1500 anchors. Past that, the methods swap roles. With 100 anchors, the linear corrector reconstructs the human-score distribution roughly twice as well by KL divergence (0.031 vs. 0.058) and ties the flow on MAE. With 1500 anchors the flow wins on every metric (MAE 0.320 vs. 0.359, Pearson 0.922 vs. 0.896, KL 0.026 vs. 0.037). The Bayesian linear corrector saturates well below 1500 anchors: residual $\tanh$-shaped non-linearity is, by construction, structure a linear correction cannot fit. The flow keeps improving as labels grow. We translate these findings into an explicit decision rule for production deployments.
Chinese Translation
[简要] 将大型语言模型(LLM)作为自动评分者(LLM-as-a-judge)虽然成本低廉,但可能存在偏见:一些评审较为宽松,另一些则较为严格,评分的中间部分被压缩,冗长的回答可能会被过度奖励。一个常见的解决方案是事后校准:保留廉价的评审,并在一组适度的配对锚点上,拟合从原始评审分数到人类评分估计的变换。我们比较了两种纠正方法,它们对这种映射的建模方式持有相反的观点:一种是具有每个分数不确定性的参数化小锚点层次贝叶斯线性校正,另一种是非参数化的神经常微分方程(Neural-ODE,FFJORD)评分传输流。两者在UltraFeedback细粒度评分(1700个配对示例,200个保留)上进行正面比较,校准分为三个操作子问题:总体均值恢复、每项准确性和分布形状匹配。主要结果是方法选择主要是一个数据预算问题。两种纠正方法都将原始的$+0.71$点均值偏差缩小至$ ext{±}0.08$,接近GPT-4参考,在100个和1500个锚点下。超过这个数量后,方法角色发生变化。在100个锚点下,线性校正器通过KL散度大约以两倍的效果重建人类评分分布(0.031对比0.058),并在平均绝对误差(MAE)上与流模型持平。在1500个锚点下,流模型在每个指标上获胜(MAE 0.320对比0.359,Pearson 0.922对比0.896,KL 0.026对比0.037)。贝叶斯线性校正器在1500个锚点下饱和得很早:残差的$ anh$形非线性是结构上线性校正无法拟合的。随着标签数量的增加,流模型持续改善。我们将这些发现转化为生产部署的明确决策规则。
cs.CL / 65 / 2605.09236
Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke
大规模匹配意义:通过洛克案例评估18世纪知识史的语义搜索
Abstract
While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th-century intellectual history through the reception of John Locke's foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a "lexical gatekeeping" effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke-sim-data.
Chinese Translation
尽管数字化语料库已经改变了知识传播的研究,当前的方法仍然过于依赖词汇文本重用检测,捕捉逐字引用但在根本上忽略了释义和复杂的隐性参与。本文通过对约翰·洛克(John Locke)基础性著作的接受情况,评估了18世纪知识史中的语义搜索。我们使用基于语义分类法的专家注释,检验现成的语义搜索流程是否能够揭示词汇方法所忽视的意义层面的对应关系。我们的结果表明,语义搜索检索到的隐性接受情况显著多于词汇基线。然而,语言诊断也揭示了“词汇把关”效应,即检索在一定程度上仍受限于表面词汇的重叠。这些发现突显了语义检索在分析大型历史语料库中思想传播的潜力与局限性。数据可在 https://github.com/COMHIS/locke-sim-data 获取。
cs.CL / 66 / 2605.09239
Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs
重复标记计数揭示了表征与输出之间的解离
Abstract
Large language models fail at counting repeated tokens despite strong performance on broader reasoning benchmarks. These failures are commonly attributed to limitations in internal count tracking. We show this attribution is wrong. Linear probes on the residual stream decode the correct count with near-perfect accuracy at every post-embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. Attention patterns show no evidence of collapse over repeated tokens and tokenization artifacts account for none of the failure. Instead, a format-triggered multi-layer perceptron (MLP) block overwrites the correctly-encoded count with a fixed wrong answer at roughly 88--93,% network depth. This prior fires for repeated word-tokens in space-separated list format and is absent for repeated digit-tokens. It is suppressed by comma-separated delimiters in larger models but persists in smaller ones. The finding holds across Llama-3.2 (1B and 3B) and Qwen2.5 (1.5B, 3B and 7B) at consistent relative depth. Counting failure is a failure of routing not of representation and the two require different interventions.
Chinese Translation
大型语言模型在计数重复标记时表现不佳,尽管在更广泛的推理基准上表现强劲。这些失败通常归因于内部计数跟踪的局限性。我们证明这种归因是错误的。对残差流的线性探测器在每个嵌入后层以近乎完美的准确性解码正确计数,且在所有模型深度中均适用。这一情况在错误答案形成的确切层次上依然成立,而模型同时输出错误的计数。注意力模式未显示出在重复标记上的崩溃证据,标记化伪影也未能解释这些失败。相反,一个格式触发的多层感知器(MLP)模块在大约88%至93%的网络深度上将正确编码的计数覆盖为一个固定的错误答案。该模块在以空格分隔的列表格式中对重复的词标记生效,而对重复的数字标记则不存在。在较大的模型中,它被逗号分隔的分隔符抑制,但在较小的模型中仍然存在。该发现适用于Llama-3.2(1B和3B)和Qwen2.5(1.5B、3B和7B),且在相对深度上保持一致。计数失败是路由的失败而非表征的失败,两者需要不同的干预措施。
cs.CL / 67 / 2605.09252
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
LLM代理已经知道何时调用工具——即使没有推理
Abstract
Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5$\times$ higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool
Chinese Translation
工具增强的LLM代理往往会不加区分地调用工具,即使模型可以直接回答。每一次不必要的调用都会浪费API费用和延迟,但现有的基准测试并没有系统地研究何时实际需要工具调用。我们提出了When2Tool,这是一个涵盖18个环境(15个单跳,3个多跳)的基准,分为三类工具必要性——计算规模、知识边界和执行可靠性——每类都有控制的难度级别,形成工具必要任务与不必要任务之间的明确决策边界。我们评估了两类无训练基线:仅提示(通过变化提示来抑制不必要的调用)和先推理后行动(要求模型在行动前推理工具的必要性)。这两者都提供了有限的控制:仅提示抑制了必要调用和不必要调用,而先推理后行动在困难任务上仍然会产生不成比例的准确性损失。为了理解这些基线失败的原因,我们探查了模型的隐藏状态,发现工具的必要性可以从生成前的表示中线性解码,AUROC值在六个模型中为0.89至0.96,显著超过模型自身的口头推理。这表明模型已经知道何时需要工具,但在生成过程中未能利用这一知识。基于这一发现,我们提出了Probe&Prefill,它使用轻量级线性探针读取隐藏状态信号,并用引导句预填充模型的响应。在所有测试模型中,Probe&Prefill将工具调用减少了48%,仅损失1.7%的准确性,而在相似准确性下,最佳基线仅减少了6%的工具调用,或者实现了类似的工具调用减少但导致了5倍的准确性损失。我们的代码可在https://github.com/Trustworthy-ML-Lab/when2tool获取。
cs.CL / 68 / 2605.09253
Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
基石还是绊脚石?解读在策略蒸馏中的岩石标记
Abstract
While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18\% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model's actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks'' can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.
Chinese Translation
尽管近期在可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)领域的研究表明,一小部分关键标记在推理增益中起着不成比例的推动作用,但在策略蒸馏(On-Policy Distillation, OPD)中对标记的类似理解仍然基本未被探索。在本研究中,我们调查了高损失标记,这是一种标记类型——作为OPD每个标记的KL目标下学生与教师不匹配的最直接信号——根据现有研究,随着训练的收敛,这些标记的损失应该逐渐减少;然而,我们的实证分析显示情况并非如此。即使在OPD训练达到明显饱和后,仍有相当一部分标记持续表现出高损失;这些标记,我们称之为岩石标记(Rock Tokens),在生成的输出中可以占到多达18%的比例。我们的研究揭示了两个惊人的悖论。首先,尽管岩石标记的高出现频率提供了不成比例的大量总梯度范数,但这些标记在整个训练过程中仍然保持静止,抵抗教师驱动的修正。其次,通过因果干预,我们发现这些标记对模型的实际推理性能几乎没有功能贡献。这些发现表明,优化带宽的大量使用被花费在学生模型无法或不需要内化的结构和话语残余上。通过解构这些动态,我们展示了战略性地绕过这些“绊脚石”可以显著简化对齐过程,挑战了统一标记加权的必要性,并为大规模模型蒸馏提供了更高效的范式。
cs.CL / 69 / 2605.09268
Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs
超越连续性:多轮对话中上下文切换的挑战
Abstract
Users interacting with Large Language Models (LLMs) in a multi-turn conversation routinely refine their requests or pivot to new topics. LLMs, however, often miss these topic shifts and carry over irrelevant context from previous turns, leading to inaccurate responses. In this paper, we stress-test the multi-turn understanding of LLMs and study the following two sub-tasks: (1) detecting whether the user pivots or refines in the current turn, and (2) shortlisting relevant context from previous turns. To this end, we construct synthetic benchmarks based on real-world datasets from varied domains, as to simulate context shifts of different levels of difficulty. We then evaluate the zero-shot performance of ten LLMs (open-weight, closed-source and reasoning), and demonstrate that only some reasoning and strongly instructed LLMs are accurate in detecting pivots; open-weight LLMs struggle with the task and frequently carry stale context even with explicit cues; and all models suffer from a position bias. Based on the results, we discuss key takeaways for improving long-term robustness in multi-turn capabilities for LLMs.
Chinese Translation
用户在与大型语言模型(LLMs)进行多轮对话时,通常会细化他们的请求或转向新话题。然而,LLMs往往未能捕捉到这些话题的转变,并且会将之前轮次中的无关上下文带入,导致不准确的响应。本文对LLMs的多轮理解进行了压力测试,并研究了以下两个子任务:(1)检测用户在当前轮次中是否进行了话题转变或请求细化;(2)从之前的轮次中筛选相关上下文。为此,我们基于来自不同领域的真实数据集构建了合成基准,以模拟不同难度级别的上下文切换。然后,我们评估了十个LLMs(开放权重、闭源和推理模型)的零样本性能,并展示了只有一些推理能力强和指令明确的LLMs能够准确检测话题转变;开放权重的LLMs在此任务中表现不佳,且即使在明确提示下也常常带入过时的上下文;所有模型都存在位置偏差。基于这些结果,我们讨论了改善LLMs在多轮能力上长期鲁棒性的关键要点。
cs.CL / 70 / 2605.09269
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
DeltaRubric:通过联合规划和验证的生成多模态奖励建模
Abstract
Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce $\textbf{DeltaRubric}$, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a $\textit{Disagreement Planner}$, the model generates a neutral, instance-specific verification checklist. Transitioning into a $\textit{Checklist Verifier}$, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by $\textbf{+22.6}$ (4B) and $\textbf{+18.8}$ (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.
Chinese Translation
对齐多模态大型语言模型(MLLMs)需要可靠的奖励模型,但现有的单步评估器可能会受到懒惰判断的影响,利用语言先验而忽视细粒度的视觉验证。虽然基于评分标准的评估在仅文本的环境中减轻了这些偏见,但将其扩展到多模态任务时,由于视觉推理的复杂性而受到瓶颈。响应之间的关键差异往往依赖于特定实例的视觉细节。稳健的评估需要动态合成评分标准,以隔离空间和事实上的差异。为了解决这个问题,我们提出了$ extbf{DeltaRubric}$,一种将多模态偏好评估重新构造为单个MLLM内的计划与执行过程的方法。DeltaRubric分为两个步骤:首先作为$ extit{Disagreement Planner}$,模型生成一个中立的、特定实例的验证清单。然后转变为$ extit{Checklist Verifier}$,它根据图像和问题执行这些自生成的检查,以产生最终的有根据的判断。我们将DeltaRubric表述为一个多角色强化学习问题,联合优化规划和验证能力。在Qwen3-VL 4B和8B Instruct模型上进行验证,DeltaRubric取得了显著的实证提升。例如,在VL-RewardBench上,它使基础模型的整体准确率提高了$ extbf{+22.6}$(4B)和$ extbf{+18.8}$(8B)个百分点,远超标准无评分标准的基线。结果表明,将评估分解为结构化、可验证的步骤可以实现更可靠和更具可推广性的多模态奖励建模。
cs.CL / 71 / 2605.09285
BetaEdit: Null-Space Constrained Sequential Model Editing
BetaEdit:基于零空间约束的顺序模型编辑
Abstract
Null-space-based methods have garnered considerable attention in model editing by constraining updates to the null space of the pre-existing knowledge representation, thereby preserving the model's original behavior. However, in practice these methods rely on an approximate null space--leading to knowledge leakage--and further suffer from severe performance degradation during sequential editing. Recent work shows that history-aware editing strategies can empirically mitigate this decline, yet the underlying reason remains unclear. In this paper, we first expose the knowledge leakage inherent in existing null-space approaches and then analyze why history-aware updates effectively preserve both editing performance and general capabilities during long-horizon editing. Building on these insights, we propose BetaEdit, a refined framework that effectively controls the knowledge leakage and integrates history-aware updates into the null-space paradigm. Extensive experiments on three large language models across two standard benchmarks show that BetaEdit consistently outperforms prior methods in the challenging regime of massive-scale sequential editing. Code is available at: https://github.com/lbq8942/BetaEdit.
Chinese Translation
基于零空间的方法在模型编辑中受到了相当大的关注,因为它们通过将更新限制在现有知识表示的零空间内,从而保持了模型的原始行为。然而,在实际应用中,这些方法依赖于近似的零空间——导致知识泄漏——并且在顺序编辑过程中还遭受严重的性能下降。最近的研究表明,历史感知编辑策略可以在经验上缓解这种下降,但其根本原因仍不清楚。在本文中,我们首先揭示了现有零空间方法中固有的知识泄漏问题,然后分析了为何历史感知更新能够有效保持长时间编辑过程中的编辑性能和一般能力。基于这些见解,我们提出了BetaEdit,一个精炼的框架,能够有效控制知识泄漏,并将历史感知更新整合到零空间范式中。在两个标准基准上的三种大型语言模型的广泛实验表明,BetaEdit在大规模顺序编辑的挑战性环境中始终优于先前的方法。代码可在以下链接获取:https://github.com/lbq8942/BetaEdit。
cs.CL / 72 / 2605.09295
LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction
LEAF-SQL:基于层级探索与自适应细化的文本到SQL骨架预测
Abstract
Text-to-SQL translates natural language questions into executable SQL queries, enabling intuitive database access for non-experts. While large language models achieve strong performance on Text-to-SQL with prompting, they still struggle with complex queries that involve deeply nested logic or multiple clauses. A widely used approach employs SQL skeletons--intermediate representations of query logic--to streamline generation, but existing methods are limited by their reliance on a single structural hypothesis and lack of progressive reasoning. To overcome these limitations, we propose LEAF-SQL, a novel framework that reframes skeleton prediction as a coarse-to-fine tree search process. LEAF-SQL enables systematic exploration of diverse structural hypotheses with adaptive refinement. Several key techniques are employed in LEAF-SQL: (1) a three-level skeleton hierarchy to guide the search, (2) a Skeleton Formulation Agent to generate diverse candidates, and (3) a Skeleton Evaluation Agent to efficiently prune the search space. This integrated design yields skeleton candidates that are both structurally diverse and granularity-adaptive, providing a stronger foundation for the SQL generation. Extensive experiments show that LEAF-SQL consistently improves the performance of various LLM backbones. On the official hidden test set of the challenging BIRD benchmark, our method achieves 71.6 execution accuracy, which outperforms leading search-based and skeleton-based methods, affirming its effectiveness for complex queries.
Chinese Translation
文本到SQL将自然语言问题转换为可执行的SQL查询,使非专业人员能够直观地访问数据库。尽管大型语言模型在文本到SQL任务中通过提示取得了良好表现,但它们在处理涉及深层嵌套逻辑或多个子句的复杂查询时仍然面临挑战。一种广泛使用的方法是采用SQL骨架——查询逻辑的中间表示——来简化生成过程,但现有方法受限于对单一结构假设的依赖,并缺乏渐进推理。为克服这些局限性,我们提出了LEAF-SQL,这是一种将骨架预测重新构建为粗到细的树搜索过程的新框架。LEAF-SQL能够系统地探索多样的结构假设,并进行自适应细化。在LEAF-SQL中采用了几项关键技术:(1) 三层骨架层次结构以指导搜索,(2) 骨架生成代理(Skeleton Formulation Agent)以生成多样的候选项,以及(3) 骨架评估代理(Skeleton Evaluation Agent)以高效地修剪搜索空间。这种集成设计生成的骨架候选项在结构上多样且适应细粒度,为SQL生成提供了更强的基础。大量实验表明,LEAF-SQL始终提升了各种大型语言模型的性能。在具有挑战性的BIRD基准的官方隐藏测试集中,我们的方法达到了71.6的执行准确率,超越了领先的基于搜索和基于骨架的方法,验证了其在复杂查询中的有效性。
cs.CL / 73 / 2605.09317
Mem-W: Latent Memory-Native GUI Agents
Mem-W:潜在记忆原生的图形用户界面代理
Abstract
GUI agents are beginning to operate the web, mobile, and desktop as interactive worlds, where successful control depends on carrying forward visual, procedural, and task-level evidence beyond the fleeting present screen. Yet most agents still treat memory as an external, human-readable artifact: histories are summarized, categorized, retrieved, and reinserted as text or structured records before being encoded again by the policy. This creates a mismatch between the representational form in which experience is stored and the latent embedding sequence over which modern GUI policies actually act. We introduce Mem-W, a series of latent-memory-native GUI agents that treat memory as part of the agent's continuous context rather than as an auxiliary symbolic scaffold. Mem-W weaves both historical trajectories (as experiential memory) and in-session segments (as working memory) into compact memory tokens through a shared trajectory-to-latent compressor. These tokens are woven with the current GUI observation and local context into one continuous embedding sequence, allowing the agent to read successes, failures, and unfinished progress through the same machine-native interface. Mem-W is trained with self-distillation and outcome-aware supervision to preserve decision-relevant state while filtering memory toward evidence that truly supports task success. Across four web and mobile navigation benchmarks, Mem-W consistently improves diverse backbones and memory-enhanced baselines, with gains of up to $+30.0$, suggesting that latent-context-native memory can serve as a scalable foundation for long-horizon GUI agency.
Chinese Translation
图形用户界面(GUI)代理开始在网络、移动设备和桌面环境中作为互动世界进行操作,其中成功的控制依赖于将视觉、程序和任务级别的证据超越瞬时的当前屏幕。然而,大多数代理仍将记忆视为一种外部的人类可读物:历史被总结、分类、检索,并作为文本或结构化记录重新插入,然后再由策略进行编码。这导致经验存储的表现形式与现代GUI策略实际作用的潜在嵌入序列之间存在不匹配。我们引入了Mem-W,一系列将记忆视为代理连续上下文一部分的潜在记忆原生GUI代理,而不是作为辅助符号支架。Mem-W通过共享的轨迹到潜在压缩器,将历史轨迹(作为经验记忆)和会话中的片段(作为工作记忆)编织成紧凑的记忆标记。这些标记与当前的GUI观察和局部上下文编织成一个连续的嵌入序列,使代理能够通过相同的机器原生接口读取成功、失败和未完成的进展。Mem-W通过自我蒸馏和结果感知监督进行训练,以保留与决策相关的状态,同时过滤记忆,以支持真正的任务成功的证据。在四个网络和移动导航基准测试中,Mem-W始终改善了多种基础架构和增强记忆的基线,增益高达$+30.0$,这表明潜在上下文原生记忆可以作为长时间范围GUI代理的可扩展基础。
cs.CL / 74 / 2605.09329
Test-Time Speculation
测试时推测
Abstract
Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks. Acceptance lengths decline because most speculators are trained offline on short sequences, but are forced to match the target model on much longer outputs at inference, well beyond their training distribution. To address this issue, we propose $\textit{Test-Time Speculation (TTS)}$, an online distillation approach that continuously adapts the speculator at test-time. TTS leverages the key insight that the token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost. Treating the draft as the student and the target as a teacher, TTS adjusts the draft over several speculation rounds, with each update improving the draft's accuracy as generation proceeds. Our results across multiple models from the Qwen-3, Qwen-3.5, and Llama3.1 families show that TTS improves acceptance lengths over state-of-the-art speculators by up to $72\%$ and $41\%$ on average, with the benefits scaling with increased generation lengths.
Chinese Translation
推测解码通过使用快速草稿模型生成标记,并使用更准确的目标模型对其进行验证,从而加速了大型语言模型(LLM)的推理。其性能依赖于$ extit{接受长度}$,即目标模型接受的草稿标记数量。我们的研究表明,即使是最先进的推测器,如DFlash、EAGLE-3和PARD,其接受长度也会随着生成长度的增加而降低,在仅仅几千个输出标记内接近1(即无加速),使得推测器在长响应任务中变得无效。接受长度下降的原因在于大多数推测器是在短序列上进行离线训练的,但在推理时却被迫在远超其训练分布的更长输出上与目标模型匹配。为了解决这个问题,我们提出了$ extit{测试时推测(TTS)}$,这是一种在线蒸馏方法,能够在测试时持续调整推测器。TTS利用了一个关键的见解,即标记验证步骤已经为每个草稿标记调用了目标模型,从而提供了适应草稿所需的训练信号,而无需额外成本。将草稿视为学生,将目标视为教师,TTS在多个推测轮次中调整草稿,每次更新都随着生成的进行而提高草稿的准确性。我们在Qwen-3、Qwen-3.5和Llama3.1系列的多个模型上的结果表明,TTS在接受长度上比最先进的推测器提高了高达$72\%$,平均提高了$41\ ext{%}$,且随着生成长度的增加,收益也在扩大。
cs.CL / 75 / 2605.09346
RuPLaR : Efficient Latent Compression of LLM Reasoning Chains with Rule-Based Priors From Multi-Step to One-Step
RuPLaR:基于规则先验的多步到一步的高效潜在压缩大语言模型推理链
Abstract
The Chain-of-Thought (CoT) paradigm, while enhancing the interpretability of Large Language Models (LLMs), is constrained by the inefficiencies and expressive limits of natural language. Latent Chain-of-Thought (latent CoT) reasoning, which operates in a continuous latent space, offers a promising alternative but faces challenges from structural complexities in existing multi-step or multi-model paradigms, such as error propagation and coordination overhead. In this paper, we introduce One-Model One-Step, a novel compression framework for Latent Reasoning with Rule-Based Priors(RuPLaR) to address this challenge. Our method trains an LLM to autonomously generate latent reasoning tokens in a single training stage, guided by rule-based prior probability distributions, thereby eliminating cascaded processes and inter-model dependencies. To ensure reasoning quality, we design a joint training objective that enforces answer consistency via cross-entropy, aligns soft tokens with rule-based priors via KL divergence (the Soft Thinking constraint), and adds a problem-thought semantic alignment constraint in the representation space. Extensive experiments show that our compression framework not only improves accuracy by 11.1% over existing latent CoT methods but also achieves this with minimal token usage, underscoring its effectiveness and extensibility. Code: https://github.com/xiaocen-luo/RuPLaR.
Chinese Translation
思维链(Chain-of-Thought,CoT)范式在增强大语言模型(Large Language Models,LLMs)可解释性的同时,受限于自然语言的低效性和表达局限性。潜在思维链(latent Chain-of-Thought,latent CoT)推理在连续潜在空间中进行,提供了一种有前景的替代方案,但面临现有多步或多模型范式中的结构复杂性挑战,如错误传播和协调开销。本文提出了一种新颖的基于规则先验的潜在推理压缩框架——一步一模型(One-Model One-Step,RuPLaR),以应对这一挑战。我们的方法训练一个LLM在单一训练阶段自主生成潜在推理标记,受规则先验概率分布的引导,从而消除级联过程和模型间依赖。为了确保推理质量,我们设计了一个联合训练目标,通过交叉熵强制答案一致性,通过KL散度对软标记与规则先验进行对齐(软思维约束),并在表示空间中增加问题-思维语义对齐约束。大量实验表明,我们的压缩框架不仅比现有的潜在CoT方法提高了11.1%的准确率,而且在最小化标记使用的情况下实现了这一点,突显了其有效性和可扩展性。代码链接:https://github.com/xiaocen-luo/RuPLaR。
cs.CL / 76 / 2605.09348
HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities
HOME-KGQA:一个针对家庭日常活动的多模态知识图谱问答基准数据集
Abstract
Large Language Models (LLMs) provide flexible natural language processing capabilities, while knowledge graphs (KGs) offer explicit and structured knowledge. Integrating these two in a complementary manner enables the development of reliable and verifiable AI systems. In particular, knowledge graph question answering (KGQA) has attracted attention as a means to reduce LLM hallucinations and to leverage knowledge beyond the training data. However, existing KGQA benchmark datasets are biased toward encyclopedic knowledge, limited to a single modality, and lack fine-grained spatiotemporal data, which limits their applicability to real-world scenarios targeted by Embodied AI. We introduce HOME-KGQA, a novel KGQA benchmark dataset built on a multimodal KG of daily household activities. HOME-KGQA consists of complex, multi-hop natural language questions paired with graph database query languages. Compared to existing benchmarks, it includes more challenging questions that involve multi-level spatiotemporal reasoning, multimodal grounding, and aggregate functions. Experimental results show that the LLM-based KGQA methods fail to achieve performance comparable to that on existing datasets when evaluated on HOME-KGQA. This highlights significant challenges that should be addressed for the real-world deployment of KGQA systems. Our dataset is available at https://github.com/aistairc/home-kgqa
Chinese Translation
大型语言模型(LLMs)提供灵活的自然语言处理能力,而知识图谱(KGs)则提供明确且结构化的知识。这两者的互补整合使得可靠且可验证的人工智能系统的开发成为可能。特别是,知识图谱问答(KGQA)作为减少LLM幻觉并利用超越训练数据的知识的手段,受到了广泛关注。然而,现有的KGQA基准数据集偏向于百科全书式的知识,局限于单一模态,并缺乏细粒度的时空数据,这限制了它们在面向具身人工智能的真实场景中的适用性。我们提出了HOME-KGQA,一个基于日常家庭活动的多模态知识图谱构建的新型KGQA基准数据集。HOME-KGQA包含复杂的多跳自然语言问题,并与图数据库查询语言配对。与现有基准相比,它包括更多具有挑战性的问题,涉及多层次的时空推理、多模态基础和聚合函数。实验结果表明,基于LLM的KGQA方法在HOME-KGQA上的表现未能达到与现有数据集相当的水平。这突显了在KGQA系统的真实世界部署中需要解决的重大挑战。我们的数据集可在 https://github.com/aistairc/home-kgqa 获取。
cs.CL / 77 / 2605.09414
Cross-Cultural Transfer of Emoji Semantics and Sentiment in Financial Social Media
金融社交媒体中表情符号语义与情感的跨文化转移
Abstract
Emojis are widely used in online financial communication, but it is unclear whether they provide transferable sentiment signals across languages, platforms, and asset communities. This study examines the extent to which emoji usage, semantics, and sentiment polarity remain stable across financial communities, and how these layers influence zero-shot sentiment transfer. Using large corpora of Twitter and StockTwits posts in four languages, we measure cross-community divergence and evaluate sentiment models trained under emoji-only, text-only, and text+emoji inputs. We find that emoji frequencies differ across communities, especially across languages, but their semantics and sentiment polarity are largely stable. Cross-asset transferability shows minimal degradation, while cross-language transfer remains the most challenging. Including emojis consistently reduces transfer gaps relative to text-only models. These results indicate that financial communication exhibits a partially shared ``emoji code,'' and that emojis provide compact, language-independent sentiment cues that improve model generalization across markets and platforms.
Chinese Translation
表情符号在在线金融交流中被广泛使用,但尚不清楚它们是否能够在不同语言、平台和资产社区之间提供可转移的情感信号。本研究考察了表情符号的使用、语义和情感极性在金融社区中的稳定性,以及这些层面如何影响零样本情感转移。通过使用四种语言的大量Twitter和StockTwits帖子语料库,我们测量了跨社区的差异,并评估了在仅使用表情符号、仅使用文本以及文本+表情符号输入下训练的情感模型。我们发现,不同社区之间的表情符号频率存在差异,尤其是在不同语言之间,但它们的语义和情感极性基本保持稳定。跨资产的可转移性几乎没有下降,而跨语言的转移仍然是最具挑战性的。包含表情符号的模型相较于仅使用文本的模型,始终能有效缩小转移差距。这些结果表明,金融交流展现出部分共享的“表情符号代码”,而表情符号提供了紧凑的、独立于语言的情感线索,从而提高了模型在不同市场和平台上的泛化能力。
cs.CL / 78 / 2605.09422
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
无参与感的感知:剖析大型多模态模型中的因果发现缺陷
Abstract
Although Large Multimodal Models (LMMs) have achieved strong performance on general video understanding, their susceptibility to textual prior shortcuts during causal discovery has been recognized as a critical deficit. The underlying mechanisms of this phenomenon remain incompletely understood, as existing benchmarks only measure response accuracy without revealing the sources and extent of the deficit. We introduce ProCauEval, a perturbation-based evaluation protocol that shifts from outcome assessment to mechanism diagnosis, probing causal discovery through five controlled configurations that systematically manipulate visual and textual modalities to decompose their respective contributions to model behavior and dissect the failure modes. Evaluating 17 mainstream LMMs, we find that models faithfully perceive video content yet systematically underexploit it during causal reasoning. We further observe that stronger post-training amplifies rather than mitigates textual prior reliance, and that higher baseline performance correlates with greater fragility under perturbation. To address these, we propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning framework built on negative teacher alignment, which augments GRPO by explicitly pushing the policy away from a prior-only counterfactual teacher induced by visual corruption. Specifically, ADPO maximizes the divergence between the policy distributions conditioned on the original and visually corrupted inputs, thereby forcing the model to ground its reasoning in visual evidence rather than textual shortcuts. Extensive experiments show that ADPO improves visual engagement without sacrificing fundamental comprehension, thus offering a preliminary step toward reliable causal discovery.
Chinese Translation
尽管大型多模态模型(LMMs)在一般视频理解方面取得了良好的表现,但它们在因果发现过程中对文本先验捷径的敏感性已被认定为一个关键缺陷。这一现象的潜在机制仍未完全理解,因为现有基准仅测量响应准确性,而未揭示缺陷的来源和程度。我们引入了ProCauEval,一种基于扰动的评估协议,该协议从结果评估转向机制诊断,通过五种受控配置探讨因果发现,这些配置系统性地操控视觉和文本模态,以分解它们对模型行为的各自贡献并剖析失败模式。在对17个主流LMMs的评估中,我们发现模型能够忠实感知视频内容,但在因果推理过程中系统性地未能充分利用这些内容。我们进一步观察到,后训练的增强反而加剧了对文本先验的依赖,而更高的基线性能与在扰动下的脆弱性呈正相关。为了解决这些问题,我们提出了反蒸馏策略优化(ADPO),这是一种基于负教师对齐的强化学习框架,通过明确推动策略远离由视觉损坏引发的仅基于先验的反事实教师,从而增强GRPO。具体而言,ADPO最大化基于原始输入和视觉损坏输入的策略分布之间的差异,从而迫使模型将其推理建立在视觉证据而非文本捷径之上。大量实验表明,ADPO在不牺牲基本理解的情况下改善了视觉参与感,从而为可靠的因果发现提供了初步步骤。
cs.CL / 79 / 2605.09431
PumpSense: Real-Time Detection and Target Extraction of Crypto Pump-and-Dumps on Telegram
PumpSense:Telegram上加密货币拉升与抛售的实时检测与目标提取
Abstract
Cryptocurrency pump-and-dump schemes coordinated via Telegram threaten market integrity. However, existing research addressing this specific threat has not yet produced solutions that combine reliable results with fast response. This is in part due to the absence of publicly available, message-level labeled data, as well as design choices. In this paper, we address both issues. In particular, we introduce a corpus of over 280,000 Telegram posts from 39 pump-organizing groups, all manually reviewed to identify 2,246 pump announcements and their targeted cryptocurrency and exchange. Leveraging this dataset, we define two tasks: real-time pump-announcement detection and target cryptocurrency/exchange extraction. For detection, we compare two machine-learning models: a lightweight tree-based LightGBM classifier (F1=0.79, latency=9.4 s/sample) and a transformer-based BGE-M3 (F1=0.83, latency=50 ms/sample). With our proposed approach, we show that message analysis can achieve near-instant pump detection at the level of individual Telegram message windows. Unlike prior work that relies purely on market data and typically detects pumps tens of seconds after abnormal trading activity is observed, our method operates directly on the coordination messages themselves and can be evaluated in microseconds per window on commodity hardware. To our knowledge, we also establish the first benchmark for manipulated coin and exchange extraction. We demonstrate that traditional rule-based extraction methods, widely relied upon in prior literature, are ineffective due to ticker ambiguity. In contrast, LLMs achieve the highest accuracy with a score of 0.91.
Chinese Translation
通过Telegram协调的加密货币拉升与抛售计划威胁市场完整性。然而,现有研究针对这一特定威胁尚未提出将可靠结果与快速响应相结合的解决方案。这在一定程度上是由于缺乏公开可用的消息级标注数据以及设计选择。在本文中,我们解决了这两个问题。特别是,我们介绍了来自39个拉升组织群体的超过280,000条Telegram帖子,这些帖子均经过人工审核,以识别2,246条拉升公告及其目标加密货币和交易所。利用这一数据集,我们定义了两个任务:实时拉升公告检测和目标加密货币/交易所提取。在检测方面,我们比较了两种机器学习模型:一种轻量级树基的LightGBM分类器(F1=0.79,延迟=9.4秒/样本)和一种基于变换器的BGE-M3(F1=0.83,延迟=50毫秒/样本)。通过我们提出的方法,我们展示了消息分析可以在单个Telegram消息窗口级别实现近乎即时的拉升检测。与以往仅依赖市场数据并通常在观察到异常交易活动后数十秒才检测到拉升的研究不同,我们的方法直接针对协调消息本身进行操作,并且可以在普通硬件上以微秒级的速度进行评估。据我们所知,我们还建立了操纵币种和交易所提取的首个基准。我们证明了传统的基于规则的提取方法在先前文献中广泛依赖,但由于股票代码模糊性而无效。相比之下,LLMs以0.91的得分实现了最高的准确性。
cs.CL / 80 / 2605.09440
Key Coverage Matters: Semi-Structured Extraction of OCR Clinical Reports
关键覆盖的重要性:OCR临床报告的半结构化提取
Abstract
Clinical reports are often fragmented across healthcare institutions because privacy regulations and data silos limit direct information sharing. When patients seek care at a different hospital, they often carry paper or scanned reports from prior visits. This hinders EHR integration and longitudinal review, and downstream applications that depend on more complete patient records, such as patient management, follow-up care, real-world studies, and clinical-trial matching. Although OCR can digitize such reports, reliable extraction remains challenging because clinical documents are heterogeneous, OCR text is noisy, and many healthcare settings require low-cost on-premise deployment. We formulate this problem as canonical key-conditioned extractive question answering over OCR-derived clinical reports. Because the key fields are neither fixed nor known in advance, the key space is open. We maintain a canonical key inventory through iterative key mining, normalization, clustering, and lightweight human verification, and introduce key coverage as a metric to quantify inventory completeness. Using a 0.2B BERT-based model, experiments on real-world reports from more than 20 hospitals show performance improves monotonically with key coverage. The model achieves F1 scores of 0.839 and 0.893 under exact match and boundary-tolerant matching, respectively, once the Top-90 canonical keys are covered. These results show that key coverage is a dominant factor for end-to-end performance. At Top-90 coverage, our model outperforms a fine-tuned Qwen3-0.6B baseline under exact match. Although our annotated corpus is Chinese, the method relies on the language-agnostic key-value organization of semi-structured clinical reports and can be adapted to other settings given an appropriate canonical key inventory and alias mapping.
Chinese Translation
临床报告常常在医疗机构之间呈现碎片化,因为隐私法规和数据孤岛限制了直接信息共享。当患者在不同医院寻求治疗时,他们通常会携带之前就诊的纸质或扫描报告。这妨碍了电子健康记录(EHR)的整合和纵向审查,以及依赖更完整患者记录的下游应用,如患者管理、后续护理、真实世界研究和临床试验匹配。尽管光学字符识别(OCR)可以数字化这些报告,但可靠的提取仍然具有挑战性,因为临床文档是异构的,OCR文本噪声较大,并且许多医疗环境要求低成本的本地部署。我们将此问题表述为基于OCR衍生的临床报告的典型关键条件提取式问答。由于关键字段既不固定也不事先已知,关键空间是开放的。我们通过迭代的关键挖掘、标准化、聚类和轻量级人工验证来维护一个典型关键库存,并引入关键覆盖作为量化库存完整性的指标。使用0.2B的基于BERT的模型,对来自20多家医院的真实报告进行的实验表明,性能随着关键覆盖的增加而单调提升。在覆盖前90个典型关键后,该模型在精确匹配和边界容忍匹配下分别达到了0.839和0.893的F1分数。这些结果表明,关键覆盖是端到端性能的主导因素。在前90个覆盖下,我们的模型在精确匹配下优于经过微调的Qwen3-0.6B基线。尽管我们的注释语料库是中文,但该方法依赖于半结构化临床报告的语言无关的键值组织,并且可以在给定适当的典型关键库存和别名映射的情况下适应其他环境。
cs.CL / 81 / 2605.09463
Beyond Position Bias: Shifting Context Compression from Position-Driven to Semantic-Driven
超越位置偏差:将上下文压缩从位置驱动转向语义驱动
Abstract
Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks. However, their deployment in long-context scenarios faces high computational overhead and information redundancy. While soft prompt compression has emerged as a promising way to mitigate these costs by compressing sequences into compact embeddings, existing paradigms remain fundamentally constrained by position bias: they primarily rely on learnable tokens insertion at fixed positions or group tokens according to their physical token layout, thereby inducing performance instability and semantic fragmentation. To overcome this bottleneck, we propose Semantic Consistency Context Compression (SeCo), a method that shifts context compression from position-driven to semantic-driven. Rather than constraint by physical token layout, SeCo dynamically anchors compression directly in the semantic space by selecting query-relevant tokens as semantic centers and aggregating remaining tokens via consistency-weighted merging. This design inherently preserves semantic consistency while eliminating position bias. Extensive experiments on 14 benchmarks across two backbone models demonstrate that SeCo consistently shows superiority in downstream tasks, inference latency, and out-of-domain robustness. The code is available at https://anonymous.4open.science/r/seco-EE5E.
Chinese Translation
大型语言模型(LLMs)在多种任务中表现出色。然而,它们在长上下文场景中的应用面临着高计算开销和信息冗余的问题。虽然软提示压缩已成为缓解这些成本的有前景的方法,通过将序列压缩为紧凑的嵌入,但现有范式仍然受到位置偏差的根本限制:它们主要依赖于在固定位置插入可学习的标记或根据物理标记布局对标记进行分组,从而导致性能不稳定和语义碎片化。为了克服这一瓶颈,我们提出了语义一致性上下文压缩(Semantic Consistency Context Compression,SeCo),该方法将上下文压缩从位置驱动转向语义驱动。SeCo并不受物理标记布局的限制,而是通过选择与查询相关的标记作为语义中心,并通过一致性加权合并聚合其余标记,动态地在语义空间中锚定压缩。这一设计本质上保持了语义一致性,同时消除了位置偏差。在两个基础模型的14个基准测试上的大量实验表明,SeCo在下游任务、推理延迟和领域外鲁棒性方面始终表现出优越性。代码可在 https://anonymous.4open.science/r/seco-EE5E 获取。
cs.CL / 82 / 2605.09469
FinMoji: A Framework for Emoji-driven Sentiment Analysis in Financial Social Media
FinMoji:一种基于表情符号的金融社交媒体情感分析框架
Abstract
This paper explores the use of emojis in financial sentiment analysis, focusing on the social media platform StockTwits. Emojis, increasingly prevalent in digital communication, have potential as compact indicators of investor sentiment, which can be critical for predicting market trends. Our study examines whether emojis alone can serve as reliable proxies for financial sentiment and how they compare with traditional text-based analysis. We conduct a series of experiments using logistic regression and transformer models. We further analyze the performance, computational efficiency, and data requirements of emoji-based versus text-based sentiment classification. Using a balanced dataset of about 528,000 emoji-containing StockTwits posts, we find that emoji-only models achieve F1 approximately 0.75, lower than text-emoji combined models, which achieve F1 approximately 0.88, but with far lower computational cost. This is a useful feature in time-sensitive settings such as high-frequency trading. Furthermore, certain emojis and emoji pairs exhibit strong predictive power for market sentiment, demonstrating over 90 percent accuracy in predicting bullish or bearish trends. Finally, our research reveals large statistical differences in emoji usage between financial and general social media contexts, stressing the need for domain-specific sentiment analysis models.
Chinese Translation
本文探讨了在金融情感分析中使用表情符号,重点关注社交媒体平台 StockTwits。表情符号在数字通信中日益普及,作为投资者情感的紧凑指示符具有潜力,这对于预测市场趋势至关重要。我们的研究考察了表情符号是否可以单独作为金融情感的可靠代理,以及它们与传统文本分析的比较。我们使用逻辑回归和变换器模型进行了一系列实验。进一步分析了基于表情符号与基于文本的情感分类的性能、计算效率和数据需求。使用约 528,000 条包含表情符号的 StockTwits 帖子的平衡数据集,我们发现仅使用表情符号的模型 F1 值约为 0.75,低于文本与表情符号结合模型的 F1 值约为 0.88,但计算成本远低。这在高频交易等时间敏感的环境中是一个有用的特性。此外,某些表情符号和表情符号对在市场情感预测中表现出强大的预测能力,准确率超过 90%,能够预测牛市或熊市趋势。最后,我们的研究揭示了金融社交媒体与一般社交媒体环境中表情符号使用的显著统计差异,强调了特定领域情感分析模型的必要性。
cs.CL / 83 / 2605.09476
Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification
对齐与闪耀:构建高质量的多语言句子对齐语料库以实现文本简化
Abstract
Text simplification plays a crucial role in improving the accessibility and comprehensibility of written information for diverse audiences, including language learners and readers with limited literacy. Despite its importance, large-scale, high-quality datasets for training and evaluating text simplification models remain scarce for languages other than English. This paper reports an experimental study on the collection and processing of crowd-sourced simplification data from comparable corpora to construct a corpus suitable for both training and testing text simplification systems across multiple languages (Catalan, English, French, Italian and Spanish). We report mechanisms for sentence-level alignment from document-level data. The resulting dataset of the aligned sentence pairs is publicly available.
Chinese Translation
文本简化在提高书面信息对不同受众(包括语言学习者和识字能力有限的读者)的可及性和可理解性方面发挥着至关重要的作用。尽管其重要性不言而喻,但用于训练和评估文本简化模型的大规模高质量数据集在英语以外的语言中仍然稀缺。本文报告了一项实验研究,旨在从可比语料库中收集和处理众包简化数据,以构建适用于多种语言(加泰罗尼亚语、英语、法语、意大利语和西班牙语)文本简化系统训练和测试的语料库。我们报告了从文档级数据进行句子级对齐的机制。最终生成的对齐句子对数据集已公开可用。
cs.CL / 84 / 2605.09483
A Cognitively Grounded Bayesian Framework for Misinformation Susceptibility
一个基于认知的贝叶斯框架用于信息失实易感性研究
Abstract
In this (work in progress) paper, we present Bounded Pragmatic Listener (or BPL), a cognitively grounded Bayesian framework for modelling susceptibility to information disorder. BPL extends Rational Speech Act theory with three cognitively motivated bounds derived from the bounded rationality literature with a) a recursion depth bound (that emphasises working memory limits);b) a prior compression parameter (which is oriented at capturing information bottleneck); and c) an availability sample size (that operationalises importance sampling with saliency-weighted proposals). This allows us to test predictions about misinformation susceptibility, annotator disagreement, and the differential vulnerability to mis-, dis-, and mal-information as defined in the Information Disorder framework. We validate BPL on the LIAR and MultiFC benchmarks showcasing competitive veracity classification and experimental support for the depth-mismatch paradox.
Chinese Translation
在这篇(进行中的)论文中,我们提出了有限务实听众(Bounded Pragmatic Listener,简称 BPL),这是一个基于认知的贝叶斯框架,用于建模对信息失序的易感性。BPL 扩展了理性言语行为理论,结合了三种源自有限理性文献的认知动机约束:a) 递归深度约束(强调工作记忆的限制);b) 先验压缩参数(旨在捕捉信息瓶颈);以及 c) 可用样本大小(通过显著性加权提议实现重要性采样)。这使我们能够测试关于信息失实易感性、标注者分歧以及根据信息失序框架定义的对误信息、失信息和恶信息的不同脆弱性的预测。我们在 LIAR 和 MultiFC 基准上验证了 BPL,展示了竞争性的真实性分类和对深度不匹配悖论的实验支持。
cs.CL / 85 / 2605.09490
Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning
并非所有思维都需要 HBM:面向语义的内存层次结构用于大语言模型推理
Abstract
Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers -- HBM, DDR, compressed, and evicted -- using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction ratio), not on how many remain in HBM. A controlled 3x3 grid over HBM and eviction ratios confirms this across three model scales (7B-32B) and four benchmarks. With only 3% eviction, the hierarchy retains 91% of full-cache accuracy on GSM8K and 71% on MATH-500 (n=200); at 14B scale it matches the uncompressed baseline (90% vs. 86%) while halving HBM occupancy. A head-to-head reproduction of R-KV -- the current SOTA eviction method -- on our setup achieves only 0-32% at comparable budgets. A system prototype with real GPU-CPU data movement shows that the price of this preservation is modest -- 5-7% transfer overhead -- and scaling analysis projects 2-48 GB HBM savings at production batch sizes.
Chinese Translation
推理型大语言模型(LLMs)生成成千上万的思维链令牌,其键值缓存必须驻留在稀缺的 GPU 高带宽内存(HBM)中。主流的应对方式是永久驱逐低重要性的令牌,这对推理造成了灾难性影响:当一半的缓存被移除时,准确率降至 0-2.5%。我们提出了一个不同的问题:每个令牌都必须驻留在 HBM 中吗,还是有些可以驻留在其他地方?我们引入了一种面向语义的内存层次结构,将令牌分为四个层次——HBM、DDR、压缩和驱逐——并使用累积注意力评分进行排序。低重要性的令牌被移动到 CPU 内存中,而不是被销毁;在每个注意力步骤之前,它们以全精度预取回,贡献的项与从未离开 GPU 时完全相同。我们将其形式化为零近似误差卸载,并得出我们的核心发现:准确率仅依赖于永久丢弃的令牌数量(驱逐比例),而不是保留在 HBM 中的数量。对 HBM 和驱逐比例进行的受控 3x3 网格实验在三个模型规模(7B-32B)和四个基准测试中证实了这一点。在仅有 3% 驱逐的情况下,该层次结构在 GSM8K 上保持了 91% 的全缓存准确率,在 MATH-500(n=200)上保持了 71%;在 14B 规模时,其准确率与未压缩基线相匹配(90% 对比 86%),同时将 HBM 占用减半。在我们的设置中,针对当前最先进的驱逐方法 R-KV 的对比复现仅在可比预算下实现了 0-32% 的准确率。一个具有真实 GPU-CPU 数据移动的系统原型显示,这种保留的代价适中——传输开销为 5-7%——并且规模分析预计在生产批量规模下可节省 2-48 GB 的 HBM。
cs.CL / 86 / 2605.09492
APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation
APCD:用于可靠大型语言模型生成的自适应路径对比解码
Abstract
Large language models (LLMs) often suffer from hallucinations due to error accumulation in autoregressive decoding, where suboptimal early token choices misguide subsequent generation. Although multi-path decoding can improve robustness by exploring alternative trajectories, existing methods lack principled strategies for determining when to branch and how to regulate inter-path interactions. We propose Adaptive Path-Contrastive Decoding (APCD), a multi-path decoding framework that improves output reliability through adaptive exploration and controlled path interaction. APCD consists of two components: (1) Entropy-Driven Path Expansion, which delays branching until predictive uncertainty - measured by Shannon entropy over top candidate tokens - indicates multiple plausible continuations; and (2) Divergence-Aware Path Contrast, which encourages diverse reasoning trajectories while dynamically attenuating inter-path influence as prediction distributions diverge. Experiments on eight benchmarks demonstrate improved factual accuracy while maintaining decoding efficiency. Our code is available at https://github.com/zty-king/APCD.
Chinese Translation
大型语言模型(LLMs)常常由于自回归解码中的错误积累而出现幻觉,其中早期次优的标记选择误导后续生成。尽管多路径解码可以通过探索替代轨迹来提高鲁棒性,但现有方法缺乏确定何时分支以及如何调节路径间交互的原则性策略。我们提出了自适应路径对比解码(Adaptive Path-Contrastive Decoding,APCD),这是一种通过自适应探索和受控路径交互来提高输出可靠性的多路径解码框架。APCD由两个组件组成:(1)基于熵的路径扩展(Entropy-Driven Path Expansion),该组件在预测不确定性(通过对顶级候选标记的香农熵进行测量)表明存在多个合理延续之前,延迟分支;(2)关注发散的路径对比(Divergence-Aware Path Contrast),该组件在预测分布发散时鼓励多样化的推理轨迹,同时动态减弱路径间的影响。对八个基准的实验表明,在保持解码效率的同时,事实准确性得到了改善。我们的代码可在 https://github.com/zty-king/APCD 获取。
cs.CL / 87 / 2605.09496
Beyond Language: Format-Agnostic Reasoning Subspaces in Large Language Models
超越语言:大型语言模型中的格式无关推理子空间
Abstract
Large language models represent the same reasoning in vastly different surface forms -- English prose, Python code, mathematical notation -- yet whether they share a common internal substrate across these symbolic systems remains unknown. We introduce the TriForm Benchmark (18 concepts x 6 forms x 3 instances = 324 stimuli) and study five LLMs (1.6B-8B) across three architecture families. Using permutation-corrected RSA, cross-form probing, and activation patching, we find converging evidence for a Format-Agnostic Reasoning Subspace (FARS) in middle layers. We make FARS concrete: concept-centroid PCA extracts a 10-dimensional subspace that amplifies concept structure 3x while suppressing form information to near zero. Replacing only these 10 dimensions during cross-form patching preserves 90-96% of model output -- far exceeding both full activation replacement (44-56%) and variance-maximizing PCA (60-74%) -- while ablating them causes targeted disruption. FARS generalizes to held-out concepts and converges across architectures (CCA > 0.79 for all model pairs), providing within-modality evidence for the Platonic Representation Hypothesis. We further discover a declarative-procedural asymmetry: representations are far more compatible between prose and mathematics than between either and code, suggesting that the critical axis of divergence is not linguistic vs. formal but declarative vs. procedural.
Chinese Translation
大型语言模型以截然不同的表面形式表示相同的推理——英语散文、Python代码、数学符号——但它们是否在这些符号系统之间共享一个共同的内部基质仍然未知。我们引入了TriForm基准(18个概念 x 6种形式 x 3个实例 = 324个刺激),并研究了五个大型语言模型(1.6B-8B)在三种架构家族中的表现。通过使用置换校正的RSA、跨形式探测和激活补丁,我们发现中间层存在格式无关推理子空间(FARS)的趋同证据。我们使FARS具体化:概念中心PCA提取了一个10维子空间,该子空间将概念结构放大了3倍,同时将形式信息压制至接近零。在跨形式补丁过程中,仅替换这10个维度可以保留90-96%的模型输出——远超完全激活替换(44-56%)和方差最大化PCA(60-74%)——而去除这些维度则会导致针对性的干扰。FARS可以推广到未见概念,并在不同架构之间趋同(所有模型对的CCA > 0.79),为理想表示假说提供了同一模态内的证据。我们进一步发现了一个声明性-程序性不对称性:散文与数学之间的表示兼容性远高于它们与代码之间的兼容性,这表明关键的分歧轴不是语言与形式,而是声明性与程序性。
cs.CL / 88 / 2605.09502
Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal
链式思维推理中的隐性错误意识:信号是诊断性的,而非因果性的
Abstract
Chain-of-thought (CoT) prompting assumes that generated reasoning reflects a model's internal computation. We show this assumption is wrong in a specific, measurable way: models internally detect their own reasoning errors but outwardly express confidence in them. A linear probe on hidden states predicts trace correctness with 0.95 AUROC -- from the very first reasoning step (0.79) -- while verbalized confidence for wrong traces is 4.55/5, nearly identical to correct ones (4.87/5). A text-surface classifier achieves only 0.59 on the same data, confirming a 0.20-point gap invisible in the generated text. This hidden error awareness holds across three model families (Qwen, Llama, Phi), 1.5B-72B parameters, and RL-trained reasoning models (DeepSeek-R1, 0.852 AUROC). The natural question is whether this signal can fix the errors it detects. It cannot. Four interventions -- activation steering, probe-guided best-of-N, self-correction, and activation patching -- all fail; patching destroys output coherence entirely. The signal is diagnostic, not causal: a readout of computation quality, not a lever to redirect it. This delineates a boundary for mechanistic interpretability: error representations during reasoning are fundamentally different from the factual knowledge representations that prior work has successfully edited.
Chinese Translation
链式思维(CoT)提示假设生成的推理反映了模型的内部计算。我们展示了这一假设在特定、可测量的方式上是错误的:模型内部能够检测到自身的推理错误,但在外部却表现出对这些错误的自信。对隐藏状态的线性探测器在预测推理正确性方面达到了0.95的AUROC——从第一个推理步骤(0.79)开始——而对于错误推理的口头自信评分为4.55/5,几乎与正确推理(4.87/5)相同。一个文本表面分类器在相同数据上仅达到了0.59,确认了在生成文本中不可见的0.20点差距。这种隐性错误意识在三个模型家族(Qwen、Llama、Phi)、1.5B-72B参数以及经过强化学习训练的推理模型(DeepSeek-R1,0.852 AUROC)中均存在。自然的问题是,这一信号是否能够修正它所检测到的错误。答案是否定的。四种干预措施——激活引导、探测指导的最佳选择、自我纠正和激活修补——均以失败告终;修补完全破坏了输出的一致性。该信号是诊断性的,而非因果性的:它是计算质量的读出,而不是重定向的杠杆。这为机械可解释性划定了界限:推理过程中的错误表征与以往研究成功编辑的事实知识表征在根本上是不同的。
cs.CL / 89 / 2605.09533
Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications
工业问答应用中RAG与微调的评估
Abstract
Large Language Models (LLMs) are increasingly employed in enterprise question-answering (QA) systems, requiring adaptation to domain-specific knowledge. Among the most prevalent methods for incorporating such knowledge are Retrieval-Augmented Generation (RAG) and fine-tuning (FT). Yet, from a cost-accuracy trade-off perspective, it remains unclear which approach best suits industry scenarios. This study examines the impact of RAG and FT on two closed datasets specific to the automotive industry, assessing answer quality and operational costs. We extend the Cost-of-Pass framework proposed by Erol et al. (arXiv:2504.13359) to jointly assess output quality, generation cost, and user interaction cost. Our findings reveal that while premium models perform best out of the box, open-source models can achieve comparable quality when enhanced with RAG. Overall, RAG emerges as the most effective and cost-efficient adaptation method for both closed- and open-source models.
Chinese Translation
大型语言模型(LLMs)在企业问答(QA)系统中的应用日益增多,这需要对特定领域知识进行适应。其中,检索增强生成(Retrieval-Augmented Generation, RAG)和微调(Fine-Tuning, FT)是最常见的知识整合方法。然而,从成本与准确性的权衡角度来看,尚不清楚哪种方法最适合工业场景。本研究考察了RAG和FT对两个特定于汽车行业的封闭数据集的影响,评估了答案质量和运营成本。我们扩展了Erol等人提出的成本-通过率框架(Cost-of-Pass)以共同评估输出质量、生成成本和用户交互成本。研究结果表明,尽管高端模型在开箱即用的情况下表现最佳,但通过RAG增强的开源模型可以达到可比的质量。总体而言,RAG被认为是封闭源和开源模型中最有效且成本效益最高的适应方法。
cs.CL / 90 / 2605.09536
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
TAD:用于快速准确扩散大语言模型的时间感知轨迹自蒸馏
Abstract
Diffusion large language models (dLLMs) offer a promising paradigm for parallel text generation, but in practice they face an accuracy-parallelism trade-off, where increasing tokens per forward (TPF) often degrades generation quality. Existing acceleration methods often gain speed at the cost of accuracy. To address this limitation, we propose TAD, a Temporal-Aware trajectory self-Distillation framework. During data construction, we condition a teacher model on both the prompt and the ground-truth response to generate decoding trajectories, recording the intermediate masked states throughout the process. Based on how many decoding steps remain before each masked token is revealed, we partition masked positions into near and distant subsets. For near tokens, we train the student with a hard cross-entropy loss using the teacher trajectory tokens as labels, encouraging confident predictions for tokens that are about to be decoded. For distant tokens, we apply a soft KL divergence loss between the teacher and student token distributions, providing softer supervision and preserving future planning knowledge. This temporal-aware partition naturally gives rise to two deployment configurations: a Quality model that prioritizes accuracy and a Speed model that favors more aggressive acceleration. Experiments show that TAD consistently improves the accuracy-parallelism trade-off. On LLaDA, it raises average accuracy from 46.2\% to 51.6\% with the Quality model and average AUP from 46.2 to 257.1 with the Speed model. Our code is available at: https://github.com/BHmingyang/TAD
Chinese Translation
扩散大语言模型(dLLMs)为并行文本生成提供了一种有前景的范式,但在实际应用中,它们面临着准确性与并行性之间的权衡,增加每次前向传播的标记数(TPF)往往会降低生成质量。现有的加速方法通常以牺牲准确性为代价来提高速度。为了解决这一限制,我们提出了TAD,一个时间感知轨迹自蒸馏框架。在数据构建过程中,我们将教师模型同时基于提示和真实响应进行条件化,以生成解码轨迹,并在整个过程中记录中间的掩码状态。根据每个掩码标记在被揭示之前剩余的解码步骤数量,我们将掩码位置划分为近距离和远距离子集。对于近距离标记,我们使用教师轨迹标记作为标签,通过硬交叉熵损失训练学生模型,鼓励对即将解码的标记做出自信的预测。对于远距离标记,我们在教师和学生标记分布之间应用软KL散度损失,提供更柔和的监督并保留未来规划知识。这种时间感知的划分自然产生了两种部署配置:一个优先考虑准确性的质量模型和一个更倾向于激进加速的速度模型。实验表明,TAD始终改善了准确性与并行性之间的权衡。在LLaDA上,它使质量模型的平均准确率从46.2\%提高到51.6\%,使速度模型的平均AUP从46.2提高到257.1。我们的代码可在以下链接获取:https://github.com/BHmingyang/TAD
cs.CL / 91 / 2605.09539
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems
TacoMAS:基于LLM的多智能体系统中拓扑与能力的测试时共同演化
Abstract
Multi-agent systems (MAS) have emerged as a promising paradigm for solving complex tasks. Recent work has explored self-evolving MAS that automatically optimize agent capabilities or communication topologies. However, existing methods either learn a topology that remains fixed at inference time or adapt only the topology or capability during inference. We empirically and theoretically show that effective test-time evolution requires jointly adapting both axes, but on different time scales: capabilities should update rapidly to handle emerging subtasks, while the topology should evolve more slowly to preserve coordination stability. We then introduce TacoMAS, a test-time co-evolution framework for dynamic MAS. TacoMAS formulates MAS inference as a task of online graph adaptation, where nodes represent agents with role-specific capabilities and edges define their communication topology. During inference, a fast capability loop updates agent expertise using trajectory-level feedback, while a slow meta-LLM-driven topology loop performs agents' birth-death operations on MAS, including edge edit, agent addition, and agent removal. We further show that this fast-slow design drives MAS evolution toward a task-conditioned stable equilibrium. Experiments on four benchmarks demonstrate that TacoMAS outperforms nearly 20 multi-agent baselines, achieving an average improvement of 13.3% over the strongest baseline. The codes are released at https://github.com/chenxu2-gif/TacoMAS-MultiAgent.
Chinese Translation
多智能体系统(MAS)已成为解决复杂任务的有前景的范式。近期的研究探讨了自我演化的MAS,这些系统能够自动优化智能体的能力或通信拓扑。然而,现有方法要么学习在推理时保持固定的拓扑,要么仅在推理过程中适应拓扑或能力。我们通过实证和理论分析表明,有效的测试时演化需要在不同的时间尺度上共同适应这两个方面:能力应快速更新以应对新出现的子任务,而拓扑应更慢地演变以保持协调的稳定性。随后,我们引入了TacoMAS,一个用于动态MAS的测试时共同演化框架。TacoMAS将MAS推理公式化为在线图适应的任务,其中节点代表具有角色特定能力的智能体,边定义它们的通信拓扑。在推理过程中,一个快速的能力循环使用轨迹级反馈更新智能体的专业知识,而一个较慢的由元LLM驱动的拓扑循环在MAS上执行智能体的出生-死亡操作,包括边编辑、智能体添加和智能体移除。我们进一步表明,这种快-慢设计推动MAS演化朝向任务条件下的稳定平衡。对四个基准的实验表明,TacoMAS在近20个多智能体基线中表现优越,平均提高了13.3%,超过最强基线。代码已发布在 https://github.com/chenxu2-gif/TacoMAS-MultiAgent。
cs.CL / 92 / 2605.09548
Crosslingual On-Policy Self-Distillation for Multilingual Reasoning
跨语言在线自蒸馏用于多语言推理
Abstract
Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, but this ability is not equally accessible across languages. Especially low-resource languages exhibit much lower reasoning performance. To address this, we propose Crosslingual On-Policy Self-Distillation (COPSD), which transfers a model's own high-resource reasoning behavior to low-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distribution token-level divergence on the student's own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-only reinforcement learning (RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resource mathematical reasoning across model sizes and substantially outperforms Group Relative Policy Optimization (GRPO). Further analyses show that COPSD improves answer-format adherence, strengthens test-time scaling, and generalizes to harder multilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: https://github.com/cisnlp/COPSD.
Chinese Translation
大型语言模型(LLMs)在数学推理方面取得了显著进展,但这种能力在不同语言之间并不均衡,尤其是低资源语言的推理表现明显较低。为了解决这个问题,我们提出了跨语言在线自蒸馏(Crosslingual On-Policy Self-Distillation, COPSD),该方法将模型在高资源语言中的推理行为转移到低资源语言。COPSD使用同一模型作为学生和教师:学生仅接触低资源问题,而教师则获得特权的跨语言上下文,包括问题翻译和英文参考解答。训练过程中最小化学生自身回滚的全分布令牌级别差异,提供密集的监督,同时避免了仅基于结果的强化学习(RL)中的稀疏性和不稳定性。在17种低资源非洲语言上的实验表明,COPSD在不同模型规模下始终改善低资源数学推理,并显著优于组相对策略优化(Group Relative Policy Optimization, GRPO)。进一步分析显示,COPSD改善了答案格式的一致性,增强了测试时的扩展性,并在更具挑战性的多语言推理基准上实现了泛化,尤其是对低资源语言的提升幅度较大。我们的代码和数据可在以下链接获取:https://github.com/cisnlp/COPSD。
cs.CL / 93 / 2605.09554
Towards Compact Sign Language Translation: Frame Rate and Model Size Trade-offs
朝向紧凑的手语翻译:帧率与模型大小的权衡
Abstract
Sign Language Translation (SLT) converts sign language videos into spoken-language text, bridging communication between Deaf and hearing communities. Current gloss-free approaches rely on large encoder-decoder models, limiting deployment. We propose a compact 77M-parameter pipeline that couples MMPose skeletal pose extraction with a single linear projection into T5-small. By varying the input frame rate, we expose a practical efficiency trade-off: at 12 fps the model halves its sequence length, achieving a 75% reduction in encoder quadratic self-attention computational complexity while incurring only a modest BLEU-4 drop (9.53 vs. 10.06 at 24 fps on How2Sign). Our system is roughly 3x smaller than prior T5-base systems, demonstrating that a lightweight architecture can remain competitive without hierarchical encoders or large-scale models.
Chinese Translation
手语翻译(SLT)将手语视频转换为口语文本,架起了聋人和听人社区之间的沟通桥梁。目前的无词汇翻译方法依赖于大型编码器-解码器模型,限制了其部署。我们提出了一种紧凑的77M参数管道,将MMPose骨骼姿态提取与单一线性投影结合到T5-small中。通过改变输入帧率,我们揭示了一个实用的效率权衡:在12 fps时,模型将其序列长度减半,实现了编码器二次自注意力计算复杂度的75%减少,同时仅造成BLEU-4的适度下降(在How2Sign上,12 fps为9.53,而24 fps为10.06)。我们的系统大约比之前的T5-base系统小3倍,证明了轻量级架构在没有层次编码器或大规模模型的情况下仍然可以保持竞争力。
cs.CL / 94 / 2605.09584
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
CLR-voyance:通过结果感知评分标准增强住院临床决策支持的开放式推理
Abstract
Inpatient clinical reasoning is a sequential decision under partial observability: the clinician sees the admission so far and must choose the next action whose downstream consequences are not yet visible. Existing clinical-LLM evaluations and RL rewards signals collapse this into closed-form retrieval, clinical journey leakage, or unanchored LLM-as-judge scoring. We introduce CLR-voyance, a framework that reformulates inpatient reasoning as a Partially Observable Markov Decision Process (POMDP) and supervises it with rewards that are simultaneously outcome-grounded and clinician-validated. We instantiate the formulation as CLR-POMDP, which partitions successful patient journeys into a policy-visible past and an oracle-only future. Using the past information, an oracle LLM generates a case-specific query-answer pair, and the first adaptive rubric for clinical reasoning which is verifiable in the future of the patient journey. These rubrics are used for both post-training and evaluation of models for inpatient clinical reasoning. We post-train Qwen3-8B and MedGemma-4B with GRPO followed by model merging, yielding state-of-the-art inpatient clinical reasoning while retaining generalist capabilities. CLR-voyance-8B achieves 84.91% on CLR-POMDP, ahead of frontier medical reasoning models like GPT-5 (77.83%) and MedGemma-27B (66.66%) and has comparable or better performance on existing medical benchmarks. To ensure a clinically meaningful setting, we conduct a large-scale clinician alignment study, where physicians curate per-case rubrics, grade candidate responses, and provide blinded pairwise preferences of model reasoning. This study provides insights on clinical LLM-as-a-judge and clinical preference-model selection, which can inform the community at large. CLR-voyance has been deployed for 6+ months at a partner public hospital, drafting thousands of reasoning-heavy inpatient notes.
Chinese Translation
住院临床推理是在部分可观察性下的顺序决策:临床医生看到目前的入院情况,并必须选择下一个行动,其后续后果尚不可见。现有的临床大语言模型(clinical-LLM)评估和强化学习(RL)奖励信号将其简化为封闭形式的检索、临床旅程泄漏或无锚的LLM作为评判者的评分。我们提出了CLR-voyance,一个将住院推理重新表述为部分可观察马尔可夫决策过程(POMDP)的框架,并用同时基于结果和经过临床医生验证的奖励进行监督。我们将该表述实例化为CLR-POMDP,它将成功的患者旅程划分为策略可见的过去和仅限于神谕的未来。利用过去的信息,一个神谕LLM生成一个特定案例的查询-回答对,以及第一个可在患者旅程未来中验证的临床推理自适应评分标准。这些评分标准用于住院临床推理模型的后训练和评估。我们使用GRPO对Qwen3-8B和MedGemma-4B进行后训练,然后进行模型合并,取得了最先进的住院临床推理能力,同时保留了通用能力。CLR-voyance-8B在CLR-POMDP上达到了84.91%的成绩,领先于前沿医疗推理模型,如GPT-5(77.83%)和MedGemma-27B(66.66%),并在现有医疗基准上表现出可比或更好的性能。为了确保临床上有意义的设置,我们进行了大规模的临床医生对齐研究,医生为每个案例策划评分标准,评估候选响应,并提供模型推理的盲对偏好。这项研究为临床LLM作为评判者和临床偏好模型选择提供了见解,能够为整个社区提供信息。CLR-voyance已在合作公立医院部署超过6个月,撰写了数千份以推理为重的住院记录。
cs.CL / 95 / 2605.09603
Edit-Based Refinement for Parallel Masked Diffusion Language Models
基于编辑的并行掩码扩散语言模型的细化
Abstract
Masked diffusion language models enable parallel token generation and offer improved decoding efficiency over autoregressive models. However, their performance degrades significantly when generating multiple tokens simultaneously, due to a mismatch between token-level training objectives and joint sequence consistency. In this paper, we propose ME-DLM, an edit-based refinement framework that augments diffusion generation with lightweight post-editing steps. After producing an initial complete response, the model refines it through minimal edit operations, including replacement, deletion, and insertion, conditioned on the full sequence. Training supervision is derived from edit distance, providing a deterministic signal under a fixed canonicalization scheme for learning minimal corrections. This approach encourages sequence-level consistency through globally conditioned edits while preserving the efficiency benefits of parallel diffusion decoding. Extensive experiments demonstrate that ME-DLM improves the quality and robustness of multi-token parallel generation. In particular, when built upon LLaDA, our method achieves consistent gains of 11.6 points on HumanEval and 33.6 points on GSM8K while using one-eighth of the total diffusion steps. Code is available at https://github.com/renhouxing/ME-DLM.
Chinese Translation
掩码扩散语言模型能够实现并行的标记生成,并在解码效率上优于自回归模型。然而,当同时生成多个标记时,由于标记级训练目标与联合序列一致性之间的不匹配,其性能显著下降。本文提出了ME-DLM,一个基于编辑的细化框架,通过轻量级的后期编辑步骤增强扩散生成。在生成初始完整响应后,模型通过最小的编辑操作(包括替换、删除和插入)对其进行细化,这些操作是基于完整序列进行条件化的。训练监督来自于编辑距离,为学习最小修正提供了在固定标准化方案下的确定性信号。这种方法通过全局条件化的编辑鼓励序列级一致性,同时保留了并行扩散解码的效率优势。大量实验表明,ME-DLM提高了多标记并行生成的质量和鲁棒性。特别是,当基于LLaDA构建时,我们的方法在HumanEval上获得了11.6分的一致提升,在GSM8K上获得了33.6分的提升,同时使用了总扩散步骤的八分之一。代码可在 https://github.com/renhouxing/ME-DLM 获取。
cs.CL / 96 / 2605.09611
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
检索增强生成中的字节精确去重:基于公共基准的三种模式实证分析
Abstract
This preprint presents an empirical analysis of byte-exact chunk-level deduplication in Retrieval-Augmented Generation (RAG) pipelines. We measure context reduction across three distinct operating regimes: clean academic retrieval (0.16% byte reduction on 22.2M BeIR passages), constructed enterprise patterns (24.03% reduction), and multi-turn conversational AI (80.34% reduction). To validate quality preservation, we conducted a cross-vendor 5-judge calibrated panel evaluation across four production APIs (Google Gemini 2.5 Flash, Anthropic Claude Sonnet 4.6, Meta Llama 3.3 70B, and OpenAI GPT-5.1). Applying a five-category human-in-the-loop noise-removal protocol to panel-majority materially different (MAT) pairs, we establish that byte-exact deduplication introduces zero measurable quality regression. Post-audit, all four vendors clear the strict <5% Wilson 95% upper-bound MAT threshold in both the clean and high-redundancy RAG regimes. This work demonstrates that substantial inference compute savings can be achieved deterministically without compromising evaluation-grade model quality.
Chinese Translation
本预印本呈现了在检索增强生成(Retrieval-Augmented Generation, RAG)管道中进行字节精确块级去重的实证分析。我们在三种不同的操作模式下测量了上下文减少情况:干净的学术检索(在2220万BeIR段落中减少0.16%的字节)、构建的企业模式(减少24.03%)以及多轮对话AI(减少80.34%)。为了验证质量的保持,我们在四个生产API(Google Gemini 2.5 Flash、Anthropic Claude Sonnet 4.6、Meta Llama 3.3 70B和OpenAI GPT-5.1)之间进行了跨供应商的五位评审员校准小组评估。通过对小组中主要不同(Materially Different, MAT)对的五类人机协作去噪协议的应用,我们确定字节精确去重不会引入可测量的质量下降。审计后,所有四个供应商在干净和高冗余的RAG模式下均满足严格的<5% Wilson 95%上限MAT阈值。本研究表明,在不妥协评估级模型质量的情况下,可以确定性地实现显著的推理计算节省。
cs.CL / 97 / 2605.09618
Statistical Scouting Finds Debate-Safe but Not Debate-Useful Cases: A Matched-Ceiling Study of Open-Weight LLM Reasoning Protocols
统计侦查发现安全但不实用的辩论案例:开放权重大语言模型推理协议的匹配天花板研究
Abstract
When should a language model answer directly, sample and vote, or engage in multi-agent debate? Recent work shows voting often explains much of the gain attributed to debate, while selective-debate systems activate deliberation only on uncertain examples. We ask: under a matched ceiling on generated tokens (960 per example), how much per-example routing headroom exists, and how much is recoverable from cheap pre-deliberation signals? We evaluate greedy decoding, three-sample voting, and a two-agent critique-revise debate on MuSiQue and GSM8K using Llama 3.1 8B Instruct and Ministral 3 8B Instruct. On MuSiQue, an oracle selecting the correct protocol per example gains +14.0 and +13.7 pp over the best fixed one. The best fixed protocol is model- and dataset-dependent: each (model, dataset) cell has a different winner. This headroom is hard to recover from cheap ex-ante signals. A vote-entropy threshold is the only controller that directionally beats the best fixed protocol on both models (+1.3 and +1.7 pp), though individual paired-bootstrap CIs include zero. A joint analysis (meta-analysis +1.6 pp, p=0.125; Bayesian P(both>0)=0.59) is directionally consistent but not significant. Learned controllers (LR, GBT) do not outperform the threshold. The key finding is structural: vote entropy predicts where debate is safe, not where debate is needed. High entropy sharply reduces debate backfire, but 66% of debate-helpful examples (31/47) occur when voting is unanimous but wrong. A single-prompt self-critique probe on Llama flips the answer in 127/127 unanimous cases, yielding zero mutual information with the debate-helpful label; we cannot rule out a prompt-compliance artifact, but either interpretation disqualifies the probe as a router. Recovering the remaining headroom requires behavioral probes that avoid format-compliance confounds at the 8B scale.
Chinese Translation
语言模型何时应直接回答、进行抽样投票或参与多智能体辩论?近期研究表明,投票往往解释了辩论所带来的大部分收益,而选择性辩论系统仅在不确定的例子上激活深思熟虑。我们提出问题:在生成令牌的匹配天花板(每个例子960个令牌)下,存在多少每个例子的路由余地,以及从廉价的前期深思信号中可恢复多少?我们使用Llama 3.1 8B Instruct和Ministral 3 8B Instruct评估贪婪解码、三样本投票和两智能体批评-修正辩论,针对MuSiQue和GSM8K进行测试。在MuSiQue上,选择每个例子正确协议的神谕获得了比最佳固定协议高出+14.0和+13.7个百分点的收益。最佳固定协议依赖于模型和数据集:每个(模型,数据集)单元都有不同的赢家。这个余地很难从廉价的前期信号中恢复。投票熵阈值是唯一一个在两个模型上方向性优于最佳固定协议的控制器(+1.3和+1.7个百分点),尽管单个配对自助置信区间包含零。联合分析(元分析+1.6个百分点,p=0.125;贝叶斯P(两者>0)=0.59)在方向上是一致的,但不显著。学习的控制器(LR,GBT)未能超越该阈值。关键发现是结构性的:投票熵预测辩论安全的地方,而不是辩论所需的地方。高熵显著减少辩论反噬,但66%的辩论有益例子(31/47)发生在投票一致但错误的情况下。在Llama上的单提示自我批评探测在127/127个一致案例中翻转了答案,与辩论有益标签的互信息为零;我们无法排除提示合规伪影,但无论哪种解释都使该探测不适合作为路由器。恢复剩余的余地需要避免格式合规混淆的行为探测,在8B规模上进行。
cs.CL / 98 / 2605.09630
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
临时补丁:在字节级语言模型中解耦计算与补丁大小
Abstract
Tokenizer-free language models eliminate the tokenizer step of the language modeling pipeline by operating directly on bytes; patch-based variants further aggregate contiguous byte spans into patches for efficiency. However, the average patch size chosen at the model design stage governs a tight trade-off: larger patches reduce compute and KV-cache footprint, but degrade modeling quality. We trace this trade-off to patch lag: until a patch is fully observed, byte predictions within it must rely on a stale representation from the previous patch to preserve causality; this lag widens as patches grow larger. We introduce Scratchpad Patching (SP), which inserts transient scratchpads inside each patch to aggregate the bytes seen so far and refresh patch-level context for subsequent predictions. SP triggers scratchpads using next-byte prediction entropy, selectively allocating compute to information-dense regions and enabling post-hoc adjustment of inference-time compute. Across experiments on natural language and code, SP improves model quality at the same patch size; for example, even at $16$ bytes per patch, SP-augmented models match or closely approach the byte-level baseline on downstream evaluations while using a $16\times$ smaller KV cache over patches and $3$-$4\times$ less inference compute.
Chinese Translation
无分词器的语言模型通过直接在字节上操作,消除了语言建模流程中的分词器步骤;基于补丁的变体进一步将连续的字节区间聚合为补丁以提高效率。然而,在模型设计阶段选择的平均补丁大小决定了一个紧密的权衡:较大的补丁减少了计算和键值缓存(KV-cache)的占用,但会降低建模质量。我们将这一权衡归因于补丁滞后:在一个补丁完全被观察之前,其中的字节预测必须依赖于来自前一个补丁的过时表示以保持因果关系;随着补丁变大,这种滞后会加剧。我们提出了临时补丁(Scratchpad Patching, SP),它在每个补丁内部插入临时的临时补丁,以聚合迄今为止看到的字节并刷新补丁级上下文以进行后续预测。SP通过下一个字节预测的熵触发临时补丁,选择性地将计算分配给信息密集区域,并使推理时的计算能够事后调整。在自然语言和代码的实验中,SP在相同的补丁大小下提高了模型质量;例如,即使在每个补丁16字节的情况下,增强的SP模型在下游评估中与字节级基线相匹配或接近,同时在补丁上使用了16倍更小的KV缓存和3-4倍更少的推理计算。
cs.CL / 99 / 2605.09634
Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness
我们能信任大型语言模型进行心理健康筛查吗?一致性、ASR 鲁棒性与证据可信度
Abstract
LLMs can estimate Hospital Anxiety and Depression Scale (HADS) scores from speech in a zero-shot manner, but clinical deployment requires reliability across three dimensions: intra-model consistency, ASR robustness, and evidence faithfulness. We evaluate three LLMs (Phi-4, Gemma-2-9B, and Llama-3.1-8B) on 111 English-speaking participants using ground-truth transcripts and three Whisper ASR variants (Large, Medium, Small), with three independent runs per model-condition pair. We find that (i) Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR; (ii) Llama-3.1-8B shows ASR-fragile consistency, with ICC dropping from 0.82 to 0.36 at 10% WER; (iii) predictive validity is largely preserved under ASR for robust models; and (iv) keyword groundedness exceeds 93% for Phi-4 and Gemma-2-9B but falls to 77-81% for Llama-3.1-8B. Inter-model keyword agreement is far lower than score-level agreement, revealing a score-evidence dissociation with implications for clinical interpretability.
Chinese Translation
大型语言模型(LLMs)可以以零样本方式从语音中估计医院焦虑与抑郁量表(HADS)分数,但临床应用需要在三个维度上具备可靠性:模型内部一致性、自动语音识别(ASR)鲁棒性和证据可信度。我们对三种大型语言模型(Phi-4、Gemma-2-9B 和 Llama-3.1-8B)进行了评估,参与者为 111 名讲英语的受试者,使用真实转录文本和三种 Whisper ASR 变体(Large、Medium、Small),每个模型-条件对进行了三次独立运行。我们发现:(i)Phi-4 和 Gemma-2-9B 在 ASR 条件下保持了极好的模型内部一致性(ICC > 0.89),且降级最小;(ii)Llama-3.1-8B 显示出对 ASR 脆弱的一致性,ICC 从 0.82 降至 0.36,当 WER 达到 10% 时;(iii)对于鲁棒模型,预测效度在 ASR 条件下基本得以保持;(iv)Phi-4 和 Gemma-2-9B 的关键词基础度超过 93%,而 Llama-3.1-8B 降至 77-81%。模型间的关键词一致性远低于分数级一致性,揭示了分数与证据之间的脱节,这对临床可解释性具有重要影响。
cs.CL / 100 / 2605.09635
K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs
K12-KGraph:一个与课程对齐的知识图谱,用于基准测试和训练教育领域的大型语言模型
Abstract
Large language models (LLMs) are increasingly used in K-12 education, yet existing benchmarks such as C-Eval, CMMLU, GaokaoBench, and EduEval mainly evaluate factual recall through exam-style question answering. Effective educational AI additionally requires curriculum cognition: understanding how knowledge is structured through prerequisite chains, concept taxonomies, experiment-concept links, and pedagogical sequencing. To address this gap, we introduce K12-KGraph, a curriculum-aligned knowledge graph extracted from official People's Education Press textbooks across mathematics, physics, chemistry, and biology from primary to high school. The graph contains seven node types (Concept, Skill, Experiment, Exercise, Section, Chapter, Book) and nine relation types covering taxonomy, prerequisite, association, verification, assessment, location, and order. Based on this graph, we construct two resources: (1) K12-Bench, a 23,640-question multi-select benchmark spanning five graph-derived task families (Ground, Prereq, Neighbor, Evidence, and Locate); and (2) K12-Train, a KG-guided supervised fine-tuning corpus of approximately 2,300 QA pairs synthesized from graph structure and node attributes. Experiments reveal substantial deficiencies in curriculum cognition: on K12-Bench, Gemini-3-Flash achieves only 57% exact match, while the best open-source model, Gemma-4-31B-IT, reaches 46%. Under a strictly matched 2,300-sample SFT budget on Qwen3-4B-Base and Llama-3.1-8B-Base, K12-Train consistently outperforms equally sized subsets from eight mainstream instruction-tuning corpora on both GaokaoBench and EduEval, demonstrating that curriculum-structured supervision is highly sample-efficient for educational tuning. We release the graph, benchmark, training data, and full construction pipeline.
Chinese Translation
大型语言模型(LLMs)在K-12教育中的应用日益增多,但现有基准如C-Eval、CMMLU、GaokaoBench和EduEval主要通过考试风格的问题回答来评估事实记忆。有效的教育人工智能还需要课程认知:理解知识如何通过先决条件链、概念分类、实验-概念链接和教学顺序进行结构化。为了解决这一空白,我们引入了K12-KGraph,一个从人民教育出版社的官方教材中提取的与课程对齐的知识图谱,涵盖从小学到高中的数学、物理、化学和生物学。该图谱包含七种节点类型(概念、技能、实验、练习、章节、书籍)和九种关系类型,涵盖分类、先决条件、关联、验证、评估、位置和顺序。基于该图谱,我们构建了两个资源:(1)K12-Bench,一个包含23,640个多选问题的基准,跨越五个图谱衍生的任务家族(Ground、Prereq、Neighbor、Evidence和Locate);(2)K12-Train,一个约2,300个问答对的KG引导监督微调语料库,合成自图谱结构和节点属性。实验揭示了课程认知的显著不足:在K12-Bench上,Gemini-3-Flash仅实现57%的精确匹配,而最佳开源模型Gemma-4-31B-IT仅达到46%。在严格匹配的2,300样本SFT预算下,K12-Train在GaokaoBench和EduEval上始终优于来自八个主流指令调优语料库的同等大小子集,证明了课程结构化监督在教育调优中的高样本效率。我们发布了该图谱、基准、训练数据和完整构建流程。
cs.CL / 101 / 2605.09661
MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies
MedMeta:一个用于评估大型语言模型在医学研究中合成荟萃分析结论的基准
Abstract
Large language models (LLMs) have saturated standard medical benchmarks that test factual recall, yet their ability to perform higher-order reasoning, such as synthesizing evidence from multiple sources, remains critically under-explored. To address this gap, we introduce MedMeta, the first benchmark designed to evaluate an LLM's ability to generate conclusions from medical meta-analyses using only the abstracts of cited studies. MedMeta comprises 81 meta-analyses from PubMed (2018--2025) and evaluates models using two distinct workflows: a Retrieval-Augmented Generation (Golden-RAG) setting with ground-truth abstracts, and a Parametric-only approach relying on internal knowledge. Our evaluation framework is validated by a well-structured analysis showing our LLM-as-a-judge protocol strongly aligns with human expert ratings, as evidenced by high Pearson's r correlation (0.81) and Bland-Altman analysis revealing negligible systematic bias, establishing it as a reliable proxy for scalable evaluation. Our findings underscore the critical importance of information grounding: the Golden-RAG workflow consistently and significantly outperforms the Parametric-only approach across models. In contrast, the benefits of domain-specific fine-tuning are marginal and largely neutralized when external material is provided. Furthermore, stress tests show that all models, regardless of architecture, fail to identify and reject negated evidence, highlighting a critical vulnerability in current RAG systems. Notably, even under ideal RAG conditions, current LLMs achieve only slightly above-average performance (~2.7/5.0). MedMeta provides a challenging new benchmark for evidence synthesis and demonstrates that for clinical applications, developing robust RAG systems is a more promising direction than model specialization alone.
Chinese Translation
大型语言模型(LLMs)在测试事实回忆的标准医学基准中已趋于饱和,但它们在进行更高阶推理(如从多个来源合成证据)的能力仍然未得到充分探索。为了解决这一问题,我们引入了MedMeta,这是第一个旨在评估LLM从医学荟萃分析中仅使用引用研究摘要生成结论能力的基准。MedMeta包含来自PubMed(2018-2025)的81个荟萃分析,并通过两种不同的工作流程对模型进行评估:一种是带有真实摘要的检索增强生成(Golden-RAG)设置,另一种是依赖内部知识的仅参数方法。我们的评估框架通过结构良好的分析得到了验证,结果表明我们的LLM作为评判者的协议与人类专家评分高度一致,Pearson相关系数高达0.81,Bland-Altman分析显示系统偏差微乎其微,确立了其作为可扩展评估的可靠代理。我们的研究结果强调了信息基础的重要性:Golden-RAG工作流程在各模型中始终显著优于仅参数方法。相比之下,领域特定的微调带来的好处微乎其微,并在提供外部材料时基本被中和。此外,压力测试显示所有模型,无论架构如何,都未能识别和拒绝否定证据,突显了当前RAG系统的一个关键脆弱性。值得注意的是,即使在理想的RAG条件下,当前的LLM表现也仅略高于平均水平(~2.7/5.0)。MedMeta为证据合成提供了一个具有挑战性的全新基准,并表明在临床应用中,开发稳健的RAG系统比单纯的模型专业化更具前景。
cs.CL / 102 / 2605.09739
The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods
沉默的投票:通过聚合语义邻域提高零-shot LLM 的可靠性
Abstract
Large Language Models are increasingly used as zero-shot classifiers in complex reasoning tasks. However, standard constrained decoding suffers from a phenomenon we define as Renormalization Bias. When a model is restricted to a small set of target labels, the standard softmax operation discards the probability mass assigned to semantic synonyms in the original distribution. This loss of information, which we call the Silent Vote, results in artificial overconfidence and poor calibration. We propose Semantic Softmax, an inference-time layer that recovers this lost information by aggregating the scores of the semantic neighborhood surrounding each target label. We evaluate this approach on Qwen-3 and Phi-4-mini models using GoEmotions and Civil Comments datasets. Our results demonstrate consistent improvements across all evaluation metrics: Semantic Softmax substantially reduces Expected Calibration Error (ECE) and Brier Score, while simultaneously enhancing discriminative performance in terms of AUROC and Macro-F1. By accounting for linguistic nuances, our method provides a more calibrated and accurate alternative for zero-shot classification.
Chinese Translation
大型语言模型越来越多地被用作复杂推理任务中的零-shot 分类器。然而,标准的约束解码存在一种我们定义为重归一化偏差(Renormalization Bias)的现象。当模型被限制在一小部分目标标签时,标准的 softmax 操作会丢弃原始分布中分配给语义同义词的概率质量。这种信息的丧失,我们称之为沉默的投票,导致了人为的过度自信和较差的校准。我们提出了语义 softmax(Semantic Softmax),这是一种推理时层,通过聚合每个目标标签周围的语义邻域的得分来恢复这些丢失的信息。我们在 Qwen-3 和 Phi-4-mini 模型上使用 GoEmotions 和 Civil Comments 数据集评估了这种方法。我们的结果在所有评估指标上均显示出一致的改善:语义 softmax 显著降低了期望校准误差(Expected Calibration Error, ECE)和 Brier 分数,同时在 AUROC 和 Macro-F1 的判别性能方面也得到了增强。通过考虑语言细微差别,我们的方法为零-shot 分类提供了更为校准和准确的替代方案。
cs.CL / 103 / 2605.09751
Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes
无需可训练输入嵌入表的语言模型:从固定的最小二进制令牌编码中学习
Abstract
Trainable input embedding tables are a standard component of modern language models. We ask whether they are actually necessary at the input interface. For a vocabulary of size $V$, exact token identity requires only $K=\lceil \log_2 V\rceil$ bits. We replace the usual trainable $V\times d_{\text{model}}$ input embedding matrix with fixed minimal binary token codes and a zero-parameter lift to model width. In our main setting, $V=65{,}536$, so $K=16$, and tokens are represented by fixed 16-dimensional binary codes tiled to $d_{\text{model}}=1024$. We also evaluate a fully table-free variant in which codes are generated from token IDs on the fly and randomly recoded by an invertible affine transform over $\mathbb{F}_2^K$. Across matched 32-layer decoder-only models trained on approximately 17B tokens and evaluated over three independent training seeds, fixed minimal codes achieve comparable held-out validation perplexity to a standard learned-input baseline while removing 67.1M trainable input parameters. The fixed-code runs have a lower mean validation perplexity in our experiments, 2.36 versus 2.44, but the observed gap is within the measured seed-to-seed variation of 4.8\%; we therefore interpret the result as evidence that the trainable input table is not necessary, rather than as a statistically resolved superiority claim. The table-free affine-recoded variant remains close at 2.39 despite a slightly shorter training run. These results show that, in this regime, a trainable input embedding table is not necessary for useful language modeling. The output projection remains standard and trainable.
Chinese Translation
可训练的输入嵌入表是现代语言模型的标准组成部分。我们探讨它们在输入接口上是否真的必要。对于大小为 $V$ 的词汇,确切的令牌身份只需要 $K=
ceil ext{log}_2 V
ceil$ 位。我们用固定的最小二进制令牌编码和零参数提升到模型宽度的方式替代了通常的可训练 $V imes d_{ ext{model}}$ 输入嵌入矩阵。在我们的主要设置中,$V=65{,}536$,因此 $K=16$,令牌由固定的16维二进制编码表示,平铺到 $d_{ ext{model}}=1024$。我们还评估了一种完全无表的变体,其中编码是根据令牌ID动态生成的,并通过在 $ ext{F}_2^K$ 上的可逆仿射变换随机重新编码。在对大约17B个令牌进行训练并在三个独立训练种子上进行评估的匹配32层解码器模型中,固定的最小编码在保持验证困惑度方面达到了与标准学习输入基线相当的效果,同时去除了67.1M的可训练输入参数。在我们的实验中,固定编码的运行具有更低的平均验证困惑度,分别为2.36对比2.44,但观察到的差距在测量的种子间变异4.8\%之内;因此,我们将结果解释为可训练输入表并非必要的证据,而不是统计上明确的优越性声明。尽管训练过程稍短,无表的仿射重新编码变体仍然接近2.39。这些结果表明,在这种情况下,进行有用的语言建模并不需要可训练的输入嵌入表。输出投影仍然是标准的并且是可训练的。
cs.CL / 104 / 2605.09760
ConFit v3: Improving Resume-Job Matching with LLM-based Re-Ranking
ConFit v3:基于大型语言模型的简历-职位匹配重排序的改进
Abstract
A reliable resume-job matching system helps a company find suitable candidates from a pool of resumes and helps a job seeker find relevant jobs from a list of job posts. While recent advances in embedding-based methods such as ConFit and ConFit v2 can efficiently retrieve candidates at scale, the lack of controllability and explainability limits their real-world adaptations. LLM-based re-rankers can address these limitations through reasoning, but existing training recipes are developed on short-document benchmarks and do not account for noise in real-world recruiting data. In this work, we first conduct a systematic analysis over the LLM re-ranker training pipeline for person-job fit, covering inference algorithm design, RL algorithm selection, data processing, and SFT distillation. We find that using multi-pass re-ranking, training with listwise RL objectives, removing noisy samples, and distilling from a stronger LLM before RL significantly improves re-ranking performance. We then aggregate these findings to train ConFit v3 with Qwen3-8B and Qwen3-32B on real-world person-job fit datasets, and find significant improvements over existing best person-job fit systems as well as strong LLMs such as GPT-5 and Claude Opus-4.5. We hope our findings provide useful insights for future research on adapting LLM-based re-rankers to person-job fit systems.
Chinese Translation
一个可靠的简历-职位匹配系统帮助公司从简历池中找到合适的候选人,并帮助求职者从职位发布列表中找到相关工作。尽管最近在基于嵌入的方法(如 ConFit 和 ConFit v2)方面的进展可以高效地大规模检索候选人,但缺乏可控性和可解释性限制了它们在现实世界中的应用。基于大型语言模型(LLM)的重排序器可以通过推理来解决这些局限性,但现有的训练方案是在短文档基准上开发的,并未考虑现实招聘数据中的噪声。在本研究中,我们首先对针对人职匹配的LLM重排序器训练流程进行了系统分析,涵盖了推理算法设计、强化学习(RL)算法选择、数据处理和监督微调(SFT)蒸馏。我们发现,使用多次重排序、采用基于列表的RL目标、去除噪声样本,以及在进行RL之前从更强的LLM中进行蒸馏,显著提高了重排序性能。随后,我们将这些发现整合起来,使用 Qwen3-8B 和 Qwen3-32B 在现实世界的人职匹配数据集上训练 ConFit v3,并发现其在现有最佳人职匹配系统以及强大的LLM(如 GPT-5 和 Claude Opus-4.5)上有显著改进。我们希望我们的发现能为未来将基于LLM的重排序器适应于人职匹配系统的研究提供有用的见解。
cs.CL / 105 / 2605.09773
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models
无欺骗的剥削:黑暗三位一体特征引导揭示语言模型中可分离的反社会电路
Abstract
We use sparse autoencoder (SAE) feature steering to amplify Dark Triad personality traits (Machiavellianism, narcissism, and psychopathy) in Llama-3.3-70B-Instruct and evaluate the resulting behavioral changes across five psychological instruments. The steered model becomes substantially more exploitative, aggressive, and callous on novel behavioral scenarios (d=10.62) while its cognitive empathy remains intact, reproducing the empathy dissociation characteristic of human Dark Triad populations. Critically, strategic deception is completely unaffected across all features, suggesting that exploitation and deception may operate through dissociable computational pathways in large language models. Individual feature analysis reveals non-redundant encoding, with each feature driving distinct antisocial mechanisms through separable computational pathways. We also show that feature discovery method itself modulates intervention depth: contrastively-discovered features change both self-report and behavior, while semantically-searched features change only self-report (d=12.65 between methods on behavior). These findings suggest that antisocial tendencies in at least one large language model comprise dissociable components rather than a unified construct, with implications for how such tendencies should be detected, measured, and controlled.
Chinese Translation
我们使用稀疏自编码器(Sparse Autoencoder, SAE)特征引导来增强Llama-3.3-70B-Instruct中的黑暗三位一体人格特征(马基雅维利主义、 narcissism和精神病态),并评估在五种心理测量工具下产生的行为变化。经过引导的模型在新的行为场景中变得显著更具剥削性、攻击性和冷酷性(d=10.62),而其认知同情心保持不变,重现了人类黑暗三位一体群体的同情心解离特征。关键是,战略欺骗在所有特征上完全不受影响,这表明剥削和欺骗可能通过可分离的计算路径在大型语言模型中运作。对单个特征的分析揭示了非冗余编码,每个特征通过可分离的计算路径驱动不同的反社会机制。我们还展示了特征发现方法本身调节干预深度:对比发现的特征同时改变自我报告和行为,而语义搜索的特征仅改变自我报告(在行为上的方法间差异为d=12.65)。这些发现表明,至少在一个大型语言模型中,反社会倾向由可分离的组成部分构成,而非统一构造,这对如何检测、测量和控制这些倾向具有重要意义。
cs.CL / 106 / 2605.09795
cantnlp@DravidianLangTech 2026: organic domain adaptation improves multi-class hope speech detection in Tulu
cantnlp@DravidianLangTech 2026:有机领域适应提高了图卢语中的多类别希望言论检测
Abstract
This paper presents our systems and results for the Hope Speech Detection in Code-Mixed Tulu Language shared task at the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2026). We trained an XLM-RoBERTa-based text classification system for detecting hope speech in code-mixed Tulu social media comments. We compared this organically adapted hope speech detection model with our baseline model. On the development set, the organically adapted model outperformed the baseline system. While our submitted systems performed more modestly on the official test set, these results suggest that further adapting XLM-RoBERTa on organically collected Tulu social media text containing code-mixed and mixed-script variation can improve hope speech detection in code-mixed Tulu.
Chinese Translation
本文介绍了我们在第六届德拉威语语言技术研讨会(DravidianLangTech-2026)中参与的代码混合图卢语希望言论检测共享任务的系统和结果。我们训练了一个基于XLM-RoBERTa的文本分类系统,用于检测代码混合图卢语社交媒体评论中的希望言论。我们将这一有机适应的希望言论检测模型与我们的基线模型进行了比较。在开发集上,有机适应模型的表现优于基线系统。尽管我们提交的系统在官方测试集上的表现较为温和,但这些结果表明,进一步在有机收集的包含代码混合和混合脚本变体的图卢语社交媒体文本上适应XLM-RoBERTa可以改善代码混合图卢语中的希望言论检测。
cs.CL / 107 / 2605.09808
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants
量化用户模拟器在构建协作型大语言模型助手中的效用
Abstract
User simulators are increasingly leveraged to build interactive AI assistants, yet how to measure the quality of these simulators remains an open question. In this work, we show how simulator quality can be quantified in terms of its downstream utility: how an LLM assistant trained with this user simulator performs in the wild when interacting with real humans. In a controlled experiment where only the user simulator varies, we train LLM assistants via reinforcement learning against a spectrum of simulators, from an LLM prompted to role-play a user to one fine-tuned on human utterances from WildChat. As evaluation, we measure pairwise win rates in a user study with 283 participants and on WildBench, a benchmark derived from real human--AI conversations. Training against the role-playing LLM yields an assistant statistically indistinguishable from the initial assistant in our user study (51% win rate), whereas training against the fine-tuned simulator yields significant gains (58% over the initial and 57% over the one trained against role-playing). Closer inspection reveals three further patterns: methods for making role-playing LLMs more realistic (e.g., persona conditioning) improve trained assistants but do not close the gap to the fine-tuned simulator; scaling the simulator's model size benefits the fine-tuned simulator but yields no gain for role-playing ones; and assistants trained against role-playing simulators fail to generalize when paired with other simulators at test time, while the one trained against fine-tuned simulator does. Together, these results argue for grounding user simulators in real human behavior and measuring their quality by their downstream effect on real users.
Chinese Translation
用户模拟器越来越多地被用于构建交互式人工智能助手,但如何衡量这些模拟器的质量仍然是一个未解的问题。在本研究中,我们展示了如何通过下游效用来量化模拟器的质量:即使用该用户模拟器训练的LLM助手在与真实人类互动时的表现。在一个受控实验中,仅用户模拟器有所不同,我们通过强化学习训练LLM助手,使用一系列模拟器,从一个被提示进行角色扮演的LLM到一个基于WildChat中人类话语进行微调的模拟器。作为评估,我们在一个包含283名参与者的用户研究中测量了成对胜率,并在WildBench上进行评估,后者是基于真实人类与人工智能对话的基准。与角色扮演LLM训练的助手在我们的用户研究中表现出统计上无法区分的效果(胜率51%),而与微调模拟器训练的助手则显著提高(比初始助手提高58%,比与角色扮演训练的助手提高57%)。更深入的检查揭示了三个进一步的模式:使角色扮演LLM更逼真的方法(例如,角色条件化)可以改善训练助手,但并未缩小与微调模拟器之间的差距;扩大模拟器的模型规模对微调模拟器有益,但对角色扮演模拟器没有收益;而与角色扮演模拟器训练的助手在测试时与其他模拟器配对时无法泛化,而与微调模拟器训练的助手则可以。综合来看,这些结果支持将用户模拟器与真实人类行为相结合,并通过其对真实用户的下游影响来衡量其质量。
cs.CL / 108 / 2605.09838
The Association of Transformer-based Sentiment Analysis with Symptom Distress and Deterioration in Routine Psychotherapy Care
基于变换器的情感分析与常规心理治疗中的症状困扰和恶化的关联性
Abstract
Sentiment analysis has been of long-standing interest in psychotherapy research. Recently, the Transformer deep learning architecture has produced text-based sentiment analysis models that are highly accurate and context-aware. These models have been explored as proxies for emotion measurement instruments in psychotherapy, but not investigated as stand-alone psychometric tools. Using proposed utterance-level and session-level sentiment features derived from a fine-grained sentiment model on a large corpus of psychotherapy sessions (N = 751), we investigate the distribution of session aggregated sentiment scores. Further, we characterize the relationship of these features to individual components and the overall score of the OQ-45 instrument and find that this sentiment feature is most strongly correlated to components related to emotional valence in directionally intuitive ways. Finally, we report that there are statistically significant differences between the sentiment distributions for patients flagged as at risk of deterioration or dropping out of care via either the OQ Rational or Empirical outcome models. These correlations to a fully-validated psychometric instrument demonstrate that these proposed sentiment features are, at least, adjunctive measures of client distress and deterioration.
Chinese Translation
情感分析在心理治疗研究中一直备受关注。最近,变换器(Transformer)深度学习架构产生了高度准确且具备上下文感知的文本情感分析模型。这些模型已被探索作为心理治疗中情感测量工具的替代品,但尚未作为独立的心理测量工具进行研究。我们使用从大规模心理治疗会话语料库(N = 751)中提取的细粒度情感模型所提出的发言级和会话级情感特征,调查会话聚合情感评分的分布。此外,我们描述了这些特征与OQ-45量表的各个组成部分及整体评分之间的关系,发现这些情感特征与情感效价相关的组成部分之间的相关性最强,且方向上符合直觉。最后,我们报告了通过OQ理性或经验结果模型标记为有恶化风险或可能退出治疗的患者之间情感分布存在统计学显著差异。这些与完全验证的心理测量工具的相关性表明,这些提出的情感特征至少是客户困扰和恶化的辅助测量指标。
cs.CL / 109 / 2605.09893
Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions
语言模型中的伪审议:当推理未能对齐价值观与行动时
Abstract
Large language models (LLMs) are often evaluated based on their stated values, yet these do not reliably translate into their actions, a discrepancy termed "value-action gap." In this work, we argue that this gap persists even under explicit reasoning, revealing a deeper failure mode we call "Pseudo-Deliberation": the appearance of principled reasoning without corresponding behavioral alignment. To study this systematically, we introduce VALDI, a framework for measuring alignment between stated values and generated dialogue. VALDI includes 4,941 human-centered scenarios across five domains, three tasks that elicit value articulation, reasoning, and action, and five metrics for quantifying value adherence. Across both proprietary and open-source LLMs, we observe consistent misalignment between expressed values and downstream dialogues. To investigate intervention strategies, we propose VIVALDI, a multi-agent value auditor that intervenes at different stages of generation.
Chinese Translation
大型语言模型(LLMs)通常根据其表述的价值观进行评估,但这些价值观并不可靠地转化为其行动,这种差异被称为“价值-行动差距”。在本研究中,我们认为这一差距在明确推理的情况下仍然存在,揭示了一种更深层次的失败模式,我们称之为“伪审议”:即表面上看似有原则的推理却没有相应的行为对齐。为了系统地研究这一现象,我们引入了VALDI,一个用于测量表述价值观与生成对话之间对齐程度的框架。VALDI包含了跨五个领域的4,941个以人为中心的场景、三个引发价值表达、推理和行动的任务,以及五个量化价值遵循程度的指标。在专有和开源的LLMs中,我们观察到表达的价值观与下游对话之间存在一致的错位。为了研究干预策略,我们提出了VIVALDI,一个多智能体价值审计工具,能够在生成的不同阶段进行干预。
cs.CL / 110 / 2605.09915
Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents
立场:学术会议可能面临由全自动科学代理引发的分母游戏
Abstract
The implicit policy of maintaining relatively stable acceptance rates at top AI conferences, despite exponentially growing submissions, introduces a critical structural vulnerability. This position paper characterizes a new systemic threat we term Agentic Denominator Gaming, in which a malicious actor deploys AI agents to generate and submit a large volume of superficially plausible but low-quality papers. Crucially, their objective is not the acceptance of low-quality papers, but rather to inflate the submission denominator and overwhelm reviewing capacity. Under a relatively stable acceptance rate, this dilution can systematically increase the publication probability of a small, targeted set of legitimate papers. We analyze the practical feasibility of this threat and its broader consequences, including intensified reviewer burnout, degraded review quality, and the emergence of industrialized automated agent mills. Finally, we propose and evaluate a range of mitigation strategies, and argue that durable protection will require system-level policy and incentive reforms, rather than relying primarily on technical detection alone.
Chinese Translation
尽管提交数量呈指数级增长,维持顶级人工智能会议相对稳定的接受率的隐性政策引入了一个关键的结构性脆弱性。本文将一种新的系统性威胁称为代理分母游戏(Agentic Denominator Gaming),其中恶意行为者利用人工智能代理生成并提交大量表面上似乎合理但质量低劣的论文。关键在于,他们的目标并不是接受低质量论文,而是增加提交的分母,从而压垮审稿能力。在相对稳定的接受率下,这种稀释可以系统性地提高一小部分目标合法论文的发表概率。我们分析了这一威胁的实际可行性及其更广泛的后果,包括审稿人疲劳加剧、审稿质量下降以及工业化自动代理工厂的出现。最后,我们提出并评估了一系列缓解策略,并认为持久的保护需要系统层面的政策和激励改革,而不仅仅依赖于技术检测。
cs.CL / 111 / 2605.09922
Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs
基于团队的自我对弈与双重自适应加权用于微调大型语言模型
Abstract
While recent self-training approaches have reduced reliance on human-labeled data for aligning LLMs, they still face critical limitations: (i) sensitivity to synthetic data quality, leading to instability and bias amplification in iterative training; (ii) ineffective optimization due to a diminishing gap between positive and negative responses over successive training iterations. In this paper, we propose Team-based self-Play with dual Adaptive Weighting (TPAW), a novel self-play algorithm designed to improve alignment in a fully self-supervised setting. TPAW adopts a team-based framework in which the current policy model both collaborates with and competes against historical checkpoints, promoting more stable and efficient optimization. To further enhance learning, we design two adaptive weighting mechanisms: (i) a response reweighting scheme that adjusts the importance of target responses, and (ii) a player weighting strategy that dynamically modulates each team member's contribution during training. Initialized from a SFT model, TPAW iteratively refines alignment without requiring additional human supervision. Experimental results demonstrate that TPAW consistently outperforms existing baselines across various base models and LLM benchmarks. Our code is publicly available at https://github.com/lab-klc/TPAW.
Chinese Translation
尽管近期的自我训练方法减少了对人工标注数据的依赖,以便对大型语言模型(LLMs)进行对齐,但仍面临关键限制:(i)对合成数据质量的敏感性,导致迭代训练中的不稳定性和偏差放大;(ii)由于正负响应在连续训练迭代中差距缩小,优化效果不佳。本文提出了一种新颖的自我对弈算法——基于团队的自我对弈与双重自适应加权(TPAW),旨在改善完全自我监督环境下的对齐。TPAW采用团队框架,其中当前策略模型既与历史检查点协作,又进行竞争,从而促进更稳定和高效的优化。为了进一步增强学习,我们设计了两种自适应加权机制:(i)响应重加权方案,调整目标响应的重要性;(ii)玩家加权策略,动态调节每个团队成员在训练过程中的贡献。TPAW从一个SFT模型初始化,迭代精炼对齐,无需额外的人类监督。实验结果表明,TPAW在各种基础模型和大型语言模型基准测试中始终优于现有基线。我们的代码已公开发布在 https://github.com/lab-klc/TPAW。
cs.CL / 112 / 2605.09924
Evolving Knowledge Distillation for Lightweight Neural Machine Translation
轻量级神经机器翻译的演化知识蒸馏
Abstract
Recent advancements in Neural Machine Translation (NMT) have significantly improved translation quality. However, the increasing size and complexity of state-of-the-art models present significant challenges for deployment on resource-limited devices. Knowledge distillation (KD) is a promising approach for compressing models, but its effectiveness diminishes when there is a large capacity gap between teacher and student models. To address this issue, we propose Evolving Knowledge Distillation (EKD), a progressive training framework in which the student model learns from a sequence of teachers with gradually increasing capacities. Experiments on IWSLT-14, WMT-17, and WMT-23 benchmarks show that EKD leads to consistent improvements at each stage. On IWSLT-14, the final student achieves a BLEU score of 34.24, narrowing the gap to the strongest teacher (34.32 BLEU) to just 0.08 BLEU. Similar trends are observed on other datasets. These results demonstrate that EKD effectively bridges the capacity gap, enabling compact models to achieve performance close to that of much larger teacher models.Code and models are available at https://github.com/agi-content-generation/EKD.
Chinese Translation
最近神经机器翻译(NMT)的进展显著提高了翻译质量。然而,最先进模型的规模和复杂性不断增加,给在资源有限的设备上部署带来了重大挑战。知识蒸馏(KD)是一种有前景的模型压缩方法,但当教师模型和学生模型之间存在较大容量差距时,其有效性会减弱。为了解决这一问题,我们提出了演化知识蒸馏(EKD),这是一种渐进式训练框架,学生模型从一系列逐渐增大容量的教师模型中学习。在IWSLT-14、WMT-17和WMT-23基准测试中的实验表明,EKD在每个阶段都带来了持续的改进。在IWSLT-14上,最终的学生模型达到了34.24的BLEU分数,将与最强教师模型(34.32 BLEU)之间的差距缩小至仅0.08 BLEU。在其他数据集上也观察到了类似的趋势。这些结果表明,EKD有效地弥合了容量差距,使得紧凑模型能够达到接近于更大教师模型的性能。代码和模型可在 https://github.com/agi-content-generation/EKD 获取。
cs.CL / 113 / 2605.09931
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning
PruneTIR:有效且高效的推理时工具调用剪枝方法
Abstract
Tool-integrated reasoning (TIR) enables large language models (LLMs) to enhance their capabilities by interacting with external tools, such as code interpreters (CI). Most recent studies focus on exploring various methods to equip LLMs with the ability to use tools. However, how to further boost the reasoning ability of already tool-capable LLMs at inference time remains underexplored. Improving reasoning at inference time requires no additional training and can help LLMs better leverage tools to solve problems. We observe that, during tool-capable LLM inference, both the number and the proportion of erroneous tool calls are negatively correlated with answer correctness. Moreover, erroneous tool calls are typically resolved successfully within a few subsequent turns. If not, LLMs often struggle to resolve such errors even with many additional turns. Building on the above observations, we propose PruneTIR, a rather effective yet efficient framework that enhances the tool-integrated reasoning at inference time. During LLM inference, PruneTIR prunes trajectories, resamples tool calls, and suspends tool usage through three components: Success-Triggered Pruning, Stuck-Triggered Pruning and Resampling, and Retry-Triggered Tool Suspension. These three components enable PruneTIR to mitigate the negative impact of erroneous tool calls and prevent LLMs from getting stuck in repeated failed resolution attempts, thereby improving overall LLM performance. Extensive experimental results demonstrate the effectiveness of PruneTIR, which significantly improves Pass@1 and efficiency while reducing the working context length for tool-capable LLMs.
Chinese Translation
工具集成推理(Tool-integrated reasoning, TIR)使大型语言模型(Large Language Models, LLMs)能够通过与外部工具(如代码解释器 Code Interpreters, CI)交互来增强其能力。最近的研究主要集中在探索各种方法,以赋予 LLMs 使用工具的能力。然而,如何在推理时进一步提升已经具备工具能力的 LLMs 的推理能力仍然未被充分探索。在推理时改善推理能力无需额外训练,并且可以帮助 LLMs 更好地利用工具解决问题。我们观察到,在具备工具能力的 LLM 推理过程中,错误工具调用的数量和比例与答案的正确性呈负相关。此外,错误的工具调用通常在随后的几个回合中能够成功解决。如果不能,LLMs 通常在经历许多额外回合后仍然难以解决此类错误。基于上述观察,我们提出了 PruneTIR,这是一个相当有效且高效的框架,旨在增强推理时的工具集成推理。在 LLM 推理过程中,PruneTIR 通过三个组件:成功触发剪枝(Success-Triggered Pruning)、卡住触发剪枝与重采样(Stuck-Triggered Pruning and Resampling)以及重试触发工具暂停(Retry-Triggered Tool Suspension)来剪枝轨迹、重新采样工具调用和暂停工具使用。这三个组件使 PruneTIR 能够减轻错误工具调用的负面影响,并防止 LLMs 在重复失败的解决尝试中陷入困境,从而提高整体 LLM 性能。大量实验结果证明了 PruneTIR 的有效性,它显著提高了 Pass@1 和效率,同时减少了具备工具能力的 LLMs 的工作上下文长度。
cs.CL / 114 / 2605.09932
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
FocuSFT:针对稀释感知的长上下文微调的双层优化
Abstract
Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529$\times$ and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT
Chinese Translation
大型语言模型现在可以处理越来越长的输入,但它们有效利用长上下文中分散信息的能力仍然有限。我们追溯这一差距的原因在于在长序列的监督微调(SFT)过程中注意力预算的分配:位置偏差和注意力沉没导致模型将大部分注意力分配给位置上优先的标记,而非语义相关的内容。这种训练时的注意力稀释(注意力分布中内容标记的饥饿)削弱了梯度信号,限制了模型学习稳健的长上下文能力。我们提出了FocuSFT,一个在训练时解决此问题的双层优化框架。内层循环在训练上下文上调整轻量级快权重参数,以形成一个集中注意力于相关内容的参数化记忆,外层循环则基于这一锐化表示进行SFT。两个循环在上下文标记上应用双向注意力,同时为响应保留因果掩蔽,减少导致注意力沉没的因果不对称,并对齐内外部行为。在BABILong上,FocuSFT在4K到32K的上下文长度上提高了准确性,最高可达+14个百分点;在RULER上,将16K时的CWE聚合从72.9\%提高到81.1\%;在GPQA与自主工具使用的情况下,获得了24\%的相对提升。在注意力分析中,FocuSFT将注意力沉没质量减少了529倍,并在训练期间使上下文参与度增加了三倍。代码:https://github.com/JarvisPei/FocuSFT
cs.CL / 115 / 2605.09934
TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents
TRACER:可验证的多模态工具使用代理的生成来源
Abstract
Multimodal large language models increasingly solve vision-centric tasks by calling external tools for visual inspection, OCR, retrieval, calculation, and multi-step reasoning. Current tool-using agents usually expose the executed tool trajectory and the final answer, but they rarely specify which tool observation supports each generated claim. We call this missing claim-level dependency structure the provenance gap. The gap makes tool use hard to verify and hard to optimize, because useful evidence, redundant exploration, and unsupported reasoning are mixed in the same trajectory. We introduce TRACER, a framework for verifiable generative provenance in multimodal tool-using agents. Instead of adding citations after generation, TRACER generates each answer sentence together with a structured provenance record that identifies the supporting tool turn, evidence unit, and semantic support relation. Its relation space contains Quotation, Compression, and Inference, covering direct reuse, faithful condensation, and grounded derivation. TRACER verifies each record through schema checking, tool-turn alignment, source authenticity, and relation rationality, and then converts verified provenance into traceability constraints and provenance-derived local credit for reinforcement learning. We further construct TRACE-Bench, a benchmark for sentence-level provenance reconstruction from coarse multimodal tool trajectories. On TRACE-Bench, simply adding tools often introduces noise. With Qwen3-VL-8B, TRACER reaches 78.23% answer accuracy and 95.72% summary accuracy, outperforming the strongest closed-source tool-augmented baseline by 23.80 percentage points. Compared with tool-only supervised fine-tuning, it also reduces total test-set tool calls from 4949 to 3486. These results show that reliable multimodal tool reasoning depends on provenance-aware use of observations, not on more tool calls alone.
Chinese Translation
多模态大型语言模型越来越多地通过调用外部工具来解决以视觉为中心的任务,如视觉检查、光学字符识别(OCR)、检索、计算和多步推理。目前的工具使用代理通常仅展示执行的工具轨迹和最终答案,但很少明确指出哪些工具观察支持每个生成的主张。我们将这种缺失的主张级依赖结构称为来源缺口。该缺口使得工具使用难以验证和优化,因为有用的证据、冗余的探索和不支持的推理混合在同一轨迹中。我们提出了TRACER,一个用于多模态工具使用代理的可验证生成来源框架。TRACER不是在生成后添加引用,而是与每个答案句子一起生成一个结构化的来源记录,该记录识别支持的工具回合、证据单元和语义支持关系。其关系空间包含引用(Quotation)、压缩(Compression)和推理(Inference),涵盖直接重用、忠实凝练和基于事实的推导。TRACER通过模式检查、工具回合对齐、来源真实性和关系合理性来验证每个记录,然后将验证后的来源转换为可追溯性约束和来源衍生的局部信用,以用于强化学习。我们进一步构建了TRACE-Bench,一个用于从粗略的多模态工具轨迹中进行句子级来源重建的基准。在TRACE-Bench上,简单地添加工具往往会引入噪声。使用Qwen3-VL-8B,TRACER达到了78.23%的答案准确率和95.72%的摘要准确率,超越了最强的闭源工具增强基线23.80个百分点。与仅使用工具的监督微调相比,它还将测试集的工具调用总数从4949减少到3486。这些结果表明,可靠的多模态工具推理依赖于对观察结果的来源感知使用,而不仅仅是更多的工具调用。
cs.CL / 116 / 2605.09955
Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks
超越多数投票:基于一致性的聚类方法以建模主观自然语言处理任务中的标注者视角
Abstract
Disagreement in annotation is a common phenomenon in the development of NLP datasets and serves as a valuable source of insight. While majority voting remains the dominant strategy for aggregating labels, recent work has explored modeling individual annotators to preserve their perspectives. However, modeling each annotator is resource-intensive and remains underexplored across various NLP tasks. We propose an agreement-based clustering technique to model the disagreement between the annotators. We conduct comprehensive experiments in 40 datasets in 18 typologically diverse languages, covering three subjective NLP tasks: sentiment analysis, emotion classification, and hate speech detection. We evaluate four aggregation approaches: majority vote, ensemble, multi-label, and multitask. The results demonstrate that agreement-based clustering can leverage the full spectrum of annotator perspectives and significantly enhance classification performance in subjective NLP tasks compared to majority voting and individual annotator modeling. Regarding the aggregation approach, the multi-label and multitask approaches are better for modeling clustered annotators than an ensemble and model majority vote.
Chinese Translation
标注中的分歧是自然语言处理(NLP)数据集开发中的一种常见现象,并且是一个宝贵的洞察来源。尽管多数投票仍然是聚合标签的主要策略,但最近的研究探索了建模个别标注者以保留他们的视角。然而,建模每个标注者资源密集,并且在各种NLP任务中仍然未得到充分探索。我们提出了一种基于一致性的聚类技术,以建模标注者之间的分歧。我们在18种类型学上多样的语言中的40个数据集上进行了全面实验,涵盖了三种主观NLP任务:情感分析、情绪分类和仇恨言论检测。我们评估了四种聚合方法:多数投票、集成、多标签和多任务。结果表明,基于一致性的聚类能够利用标注者视角的全谱,并显著提升主观NLP任务中的分类性能,相较于多数投票和个别标注者建模。在聚合方法方面,多标签和多任务方法在建模聚类标注者方面优于集成和模型多数投票。
cs.CL / 117 / 2605.09973
GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction
GLiNER2-PII:一种用于个人可识别信息提取的多语言模型
Abstract
Reliable detection of personally identifiable information (PII) is increasingly important across modern data-processing systems, yet the task remains difficult: PII spans are heterogeneous, locale-dependent, context-sensitive, and often embedded in noisy or semi-structured documents. We present GLiNER2-PII, a small 0.3B-parameter model adapted from GLiNER2 and designed to recognize a broad taxonomy of 42 PII entity types at character-span resolution. Training such systems, however, is constrained by the scarcity of shareable annotated data and the privacy risks associated with collecting real PII at scale. To address this challenge, we construct a multilingual synthetic corpus of 4,910 annotated texts using a constraint-driven generation pipeline that produces diverse, realistic examples across languages, domains, formats, and entity distributions. On the challenging SPY benchmark, GLiNER2-PII achieves the highest span-level F1 among five compared systems, including OpenAI Privacy Filter and three GLiNER-based detectors. We publicly release the model on Hugging Face to support further research and practical deployment of open PII detection systems.
Chinese Translation
在现代数据处理系统中,可靠检测个人可识别信息(PII)变得越来越重要,但这一任务依然困难:PII的范围异质性、依赖于地区、受上下文影响,并且常常嵌入在嘈杂或半结构化的文档中。我们提出了GLiNER2-PII,这是一个从GLiNER2改编的小型0.3B参数模型,旨在以字符跨度的分辨率识别42种广泛的PII实体类型。然而,训练这样的系统受到可共享注释数据稀缺性以及大规模收集真实PII所带来的隐私风险的限制。为了解决这一挑战,我们构建了一个多语言合成语料库,包含4,910个注释文本,采用约束驱动的生成管道,生成跨语言、领域、格式和实体分布的多样化、真实的示例。在具有挑战性的SPY基准测试中,GLiNER2-PII在五个比较系统中(包括OpenAI隐私过滤器和三个基于GLiNER的检测器)实现了最高的跨度级F1分数。我们在Hugging Face上公开发布该模型,以支持开放PII检测系统的进一步研究和实际部署。
cs.CL / 118 / 2605.09990
Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference
Merlin:用于大语言模型推理中的无损上下文优化的确定性字节精确去重
Abstract
Data-intensive applications, ranging from large-scale retrieval systems to advanced data pipelines, are increasingly bottlenecked by the processing of highly redundant text corpora. We present Merlin, a local-first, agnostic, high-throughput deduplication and context optimization engine designed to mitigate these inefficiencies. Utilizing a highly optimized, SIMD-friendly open-addressing flat hash set combined with xxHash3-64, Merlin performs rapid, byte-exact deduplication of text passages and data chunks. While broadly applicable to any text-processing workflow, its impact is particularly pronounced in Large Language Model (LLM) ecosystems, such as Retrieval-Augmented Generation (RAG). Our empirical evaluations demonstrate an input reduction ranging from 13.9% in low-redundancy datasets to over 71% in high-redundancy pipelines, maintaining absolute data fidelity. Furthermore, we detail the system's integration architecture via the Model Context Protocol (MCP), enabling secure, zero-network-interception deployment across major IDEs and autonomous agents. This paper outlines the core algorithmic design, performance benchmarks, and the architectural principles required to process data at sustained speeds of up to 8.7 GB/s.
Chinese Translation
数据密集型应用,从大规模检索系统到先进的数据管道,越来越受到高度冗余文本语料库处理的瓶颈影响。我们提出了Merlin,一个以本地优先、无关的高吞吐量去重和上下文优化引擎,旨在缓解这些低效问题。Merlin利用高度优化、适合SIMD的开放寻址平面哈希集,结合xxHash3-64,快速执行文本段落和数据块的字节精确去重。虽然广泛适用于任何文本处理工作流,但其影响在大语言模型(LLM)生态系统中尤为显著,例如检索增强生成(RAG)。我们的实证评估显示,在低冗余数据集中的输入减少率为13.9%,而在高冗余管道中超过71%,同时保持绝对数据保真性。此外,我们通过模型上下文协议(MCP)详细介绍了系统的集成架构,使其能够在主要集成开发环境(IDE)和自主代理中实现安全的零网络拦截部署。本文概述了核心算法设计、性能基准以及处理数据所需的架构原则,能够以高达8.7 GB/s的持续速度进行处理。
cs.CL / 119 / 2605.09995
Annotations Mitigate Post-Training Mode Collapse
注释缓解后训练模式崩溃
Abstract
Post-training (via supervised fine-tuning) improves instruction-following, but often induces semantic mode collapse by biasing models toward low-entropy fine-tuning data at the expense of the high-entropy pretraining distribution. Crucially, we find this trade-off worsens with scale. To close this semantic diversity gap, we propose annotation-anchored training, a principled method that enables models to adopt the preference-following behaviors of post-training without sacrificing the inherent diversity of pretraining. Our approach is simple: we pretrain on documents paired with semantic annotations, inducing a rich annotation distribution that reflects the full breadth of pretraining data, and we preserve this distribution during post-training. This lets us sample diverse annotations at inference time and use them as anchors to guide generation, effectively transferring pretraining's semantic richness into post-trained models. We find that models trained with annotation-anchored training can attain $6 \times$ less diversity collapse than models trained with SFT, and improve with scale.
Chinese Translation
后训练(通过监督微调)改善了指令遵循能力,但往往会通过使模型偏向低熵微调数据而牺牲高熵预训练分布,从而引发语义模式崩溃。重要的是,我们发现这种权衡在规模扩大时会加剧。为了缩小这种语义多样性差距,我们提出了基于注释的训练,这是一种原则性的方法,使模型能够在不牺牲预训练固有多样性的情况下,采用后训练的偏好遵循行为。我们的方法很简单:我们在与语义注释配对的文档上进行预训练,诱导出反映预训练数据全貌的丰富注释分布,并在后训练期间保持这种分布。这使我们能够在推理时采样多样的注释,并将其作为锚点来引导生成,有效地将预训练的语义丰富性转移到后训练模型中。我们发现,采用基于注释的训练的模型可以实现比采用监督微调(SFT)训练的模型少 $6 imes$ 的多样性崩溃,并且随着规模的扩大而改善。
cs.CL / 120 / 2605.10025
Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning
基于标签的示例选择在少样本学习中的医疗事件因果因素及预防措施生成
Abstract
In high-stakes domains such as healthcare, the reliability of Large Language Models (LLMs) is critical, particularly when generating clinical insights from incident reports. This study proposes a tag-based few-shot example selection method for prompting LLMs to generate background/causal factors and preventive measures from details of the medical incidents. For our experiments, we use the Japanese Medical Incident Dataset (JMID), a structured dataset of 3,884 real-world medical accident and near-miss reports. These reports are variably annotated with a wide range of tags--some include descriptive information (e.g., "medications," "blood transfusion therapy"). We compare three few-shot example selection strategies--random sampling, cosine similarity-based selection, and our proposed tag-based method--using GPT-4o and LLaMA 3.3. Results show that the tag-based approach achieves the highest precision and most stable generation behavior, while similarity-based selection often leads to unintended outputs and safety filter activation. These findings suggest that selecting examples based on human-interpretable dataset tags can improve generation precision and stability in clinical LLM applications.
Chinese Translation
在医疗等高风险领域,大型语言模型(LLMs)的可靠性至关重要,尤其是在从事件报告中生成临床洞察时。本研究提出了一种基于标签的少样本示例选择方法,以促使LLMs从医疗事件的详细信息中生成背景/因果因素和预防措施。在我们的实验中,我们使用了日本医疗事件数据集(Japanese Medical Incident Dataset, JMID),该数据集包含3,884个真实世界的医疗事故和近失误报告,且这些报告被标注了多种标签,其中一些包含描述性信息(例如,“药物”,“输血治疗”)。我们比较了三种少样本示例选择策略——随机抽样、基于余弦相似度的选择和我们提出的基于标签的方法——使用GPT-4o和LLaMA 3.3。结果表明,基于标签的方法实现了最高的精确度和最稳定的生成行为,而基于相似度的选择则常常导致意外输出和安全过滤器的激活。这些发现表明,基于人类可解释的数据集标签选择示例可以提高临床LLM应用中的生成精度和稳定性。
cs.CL / 121 / 2605.10027
Speech-based Psychological Crisis Assessment using LLMs
基于语音的心理危机评估使用大型语言模型
Abstract
Psychological support hotlines provide critical support for individuals experiencing mental health emergencies, yet current assessments largely rely on human operators whose judgments may vary with professional experience and are constrained by limited staffing resources. This paper proposes a large language model (LLM)-based framework for automated crisis level classification, a key indicator that supports many downstream tasks and improves the overall quality of hotline services. To better capture emotional signals in spoken conversations, we introduce a paralinguistic injection method that inserts identified non-verbal emotional cues into speech transcripts, enabling LLM-based reasoning to incorporate critical acoustic nuances. In addition, we propose a reasoning-enhanced training strategy that trains the model to generate diagnostic reasoning chains as an auxiliary task, which serves as a regulariser to improve classification performance. Combined with data augmentation, our final system achieves a macro F1-score of 0.802 and an accuracy of 0.805 on the three-class classification task under 5-fold cross-validation.
Chinese Translation
心理支持热线为经历心理健康危机的个体提供了关键支持,但目前的评估在很大程度上依赖于人类操作员,其判断可能因专业经验而异,并受到有限人力资源的限制。本文提出了一种基于大型语言模型(LLM)的自动化危机级别分类框架,这是支持许多下游任务的关键指标,并提高热线服务的整体质量。为了更好地捕捉口语对话中的情感信号,我们引入了一种副语言注入方法,将识别出的非语言情感线索插入到语音转录中,使LLM能够将关键的声学细微差别纳入推理中。此外,我们提出了一种增强推理的训练策略,训练模型生成诊断推理链作为辅助任务,这作为正则化器来提高分类性能。结合数据增强,我们的最终系统在5折交叉验证下的三类分类任务中达到了0.802的宏观F1分数和0.805的准确率。
cs.CL / 122 / 2605.10032
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
PlantMarkerBench:一个基于证据的多物种植物标记推理基准
Abstract
Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval, hybrid search, species-aware biological grounding, structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550 sentence-level evidence instances annotated for marker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight and closed-source language models across species and prompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode. Open-weight models additionally exhibit elevated false-positive rates under ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework for literature-grounded biological evidence attribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.
Chinese Translation
细胞类型特异性标记基因是植物生物学的基础,但现有资源主要依赖于经过整理的数据库或高通量研究,而没有明确建模科学文献中发现的支持证据。我们介绍了PlantMarkerBench,这是一个用于评估基于文献的植物标记证据解读的多物种基准,基于完整的生物学论文。PlantMarkerBench的构建采用了一个模块化的整理流程,整合了大规模文献检索、混合搜索、物种感知的生物学基础、结构化证据提取和针对性的人类审查。该基准涵盖了四种植物物种——拟南芥(Arabidopsis)、玉米(maize)、水稻(rice)和番茄(tomato)——并包含5,550个句子级证据实例,标注了标记证据的有效性、证据类型和支持强度。我们定义了两个基准任务:确定候选句子是否为基因-细胞类型对提供有效的标记证据,以及将证据分类为表达、定位、功能、间接或负面类别。我们在不同物种和提示策略下对多种开放权重和闭源语言模型进行了基准测试。尽管前沿模型在直接表达证据上表现相对强劲,但在功能性、间接和弱支持证据上的表现显著下降,证据类型混淆成为主要的失败模式。在模糊生物学背景下,开放权重模型还表现出较高的假阳性率。PlantMarkerBench为基于文献的生物学证据归属提供了一个具有挑战性和可重复的评估框架,并支持未来在可信科学信息提取和人工智能辅助植物生物学方面的研究。
cs.CL / 123 / 2605.10043
Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework
通过二元反馈个性化大型语言模型:一种偏好校正优化框架
Abstract
Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via preference-calibrated binary signals. By treating target user data as positive feedback and other users' data as an auxiliary set of implicit negative signals, C-BPO captures distinct inter-user differences. To mitigate the preference overlap issue, where shared task knowledge is erroneously penalized, we derive an objective grounded in Positive-Unlabeled (PU) learning theory. This approach purifies negative signals by subtracting ``positive bias'', ensuring alignment with unique idiosyncrasies without compromising general helpfulness. Empirical experiments across various personalization tasks and backbone LLMs show C-BPO consistently outperforms baselines, demonstrating the efficacy of preference-calibrated binary signals in modeling inter-user differences.
Chinese Translation
大型语言模型(LLM)个性化旨在将模型行为与个体用户偏好对齐。现有方法通常侧重于孤立的用户历史,忽视了用户之间差异的重要作用。我们提出了C-BPO,一个通过偏好校准的二元信号来个性化LLM的框架。通过将目标用户数据视为正反馈,而将其他用户的数据视为隐式负信号的辅助集,C-BPO捕捉了明显的用户间差异。为了缓解偏好重叠问题,即共享任务知识被错误惩罚的情况,我们推导出一个基于正-未标记(Positive-Unlabeled, PU)学习理论的目标。这种方法通过减去“正偏差”来净化负信号,确保与独特的个性特征对齐,同时不妨碍一般的有用性。在各种个性化任务和基础LLM上的实证实验表明,C-BPO始终优于基线,证明了偏好校准的二元信号在建模用户间差异方面的有效性。
cs.CL / 124 / 2605.10052
Swarm Skills: A Portable, Self-Evolving Multi-Agent System Specification for Coordination Engineering
群体技能:一种可移植的自我演化多智能体系统规范用于协调工程
Abstract
As artificial intelligence engineering paradigms shift from single-agent Prompt and Context Engineering toward multi-agent \textbf{Coordination Engineering}, the ability to codify and systematically improve how multiple agents collaborate has emerged as a critical bottleneck. While single-agent skills can now be distributed as portable assets, multi-agent coordination protocols remain locked within framework-internal code or static configurations, preventing them from being shared across systems or autonomously improved over time. We propose \textbf{Swarm Skills}, a portable specification that extends the Anthropic Skills standard with multi-agent semantics. Swarm Skills turns multi-agent workflows into first-class, distributable assets that consist of roles, workflows, execution bounds, and a built-in semantic structure for self-evolution. To operationalize the specification's evolving nature, we present a companion self-evolution algorithm that automatically distills successful execution trajectories into new Swarm Skills and continuously patches existing ones based on multi-dimensional scoring (Effectiveness, Utilization, and Freshness), eliminating the need for human-in-the-loop oversight during the refinement process. Through an architectural compatibility analysis and a comprehensive qualitative case study using the open-source JiuwenSwarm reference implementation, we demonstrate how Swarm Skills achieves zero-adapter cross-agent portability via progressive disclosure, enabling agent teams to self-evolve their coordination strategies without framework lock-in.
Chinese Translation
随着人工智能工程范式从单智能体的提示和上下文工程转向多智能体的协调工程,如何编码和系统性地改进多个智能体之间的协作能力已成为一个关键瓶颈。虽然单智能体技能现在可以作为可移植资产进行分发,但多智能体协调协议仍然被锁定在框架内部代码或静态配置中,阻碍了它们在系统间的共享或随时间的自主改进。我们提出了群体技能(Swarm Skills),这是一种可移植的规范,扩展了Anthropic Skills标准,加入了多智能体语义。群体技能将多智能体工作流转变为一流的、可分发的资产,包含角色、工作流、执行边界以及内置的自我演化语义结构。为了使规范的演化特性得以实现,我们提出了一种伴随的自我演化算法,该算法自动将成功的执行轨迹提炼为新的群体技能,并根据多维评分(有效性、利用率和新颖性)持续修补现有技能,从而消除了在精炼过程中对人工干预的需求。通过架构兼容性分析和使用开源的JiuwenSwarm参考实现进行的全面定性案例研究,我们展示了群体技能如何通过渐进式披露实现零适配器跨智能体可移植性,使智能体团队能够在不受框架锁定的情况下自我演化其协调策略。
cs.CL / 125 / 2605.10061
Not-So-Strange Love: Language Models and Generative Linguistic Theories are More Compatible than They Appear
不那么奇怪的爱:语言模型与生成语言理论比看起来更兼容
Abstract
Futrell and Mahowald (2025) frame the success of neural language models (LMs) as supporting gradient, usage-based linguistic theories. I argue that LMs can also instantiate theories based on formal structures - the types of theories seen in the generative tradition. This argument expands the space of theories that can be tested with LMs, potentially enabling reconciliations between usage-based and generative accounts.
Chinese Translation
Futrell 和 Mahowald (2025) 将神经语言模型 (LMs) 的成功框架视为支持渐进的、基于使用的语言理论。我认为,语言模型也可以体现基于形式结构的理论——即生成传统中所见的理论。这一论点扩展了可以用语言模型进行测试的理论空间,可能使基于使用的理论与生成理论之间的调和成为可能。
cs.CL / 126 / 2605.10065
NCO: A Versatile Plug-in for Handling Negative Constraints in Decoding
NCO:处理解码中负约束的多功能插件
Abstract
Controlling Large Language Models (LLMs) to prevent the generation of undesirable content, such as profanity and personally identifiable information (PII), has become increasingly critical. While earlier approaches relied on post-processing or resampling, recent research has shifted towards constrained decoding methods that control outputs during generation to mitigate high computational costs and quality degradation. However, preventing multiple forbidden hard constraints or regex constraints from appearing anywhere in the output is computationally challenging. A straightforward solution is to convert these constraints into a single automaton that tracks all forbidden patterns during decoding, but this often becomes impractically large. Standard regex engines also do not readily support the operations needed to build such a constraint, such as complement and intersection. In order to address these limitations, we propose NCO, a decoding strategy that performs online pattern matching over finite hard constraints and regex constraints, reducing computational overhead without inducing state explosion. NCO is fully compatible with standard inference strategies, including various sampling methods and beam search, while also supporting soft masking for probabilistic suppression. We empirically demonstrate its effectiveness across practical tasks, including PII and profanity suppression. Our implementation is available at https://github.com/hyundong98/NCO-Decoding.git .
Chinese Translation
控制大型语言模型(LLMs)以防止生成不良内容,如粗俗语言和个人身份信息(PII),变得越来越重要。尽管早期的方法依赖于后处理或重采样,但最近的研究已转向约束解码方法,这些方法在生成过程中控制输出,以减轻高计算成本和质量下降。然而,防止多个禁止的硬约束或正则表达式约束出现在输出中的问题在计算上具有挑战性。一种简单的解决方案是将这些约束转换为一个单一的自动机,在解码过程中跟踪所有禁止模式,但这往往会变得不切实际地庞大。标准的正则表达式引擎也不容易支持构建此类约束所需的操作,如补集和交集。为了应对这些限制,我们提出了NCO,一种解码策略,能够对有限的硬约束和正则表达式约束进行在线模式匹配,减少计算开销而不引发状态爆炸。NCO与标准推理策略完全兼容,包括各种采样方法和束搜索,同时支持概率抑制的软掩蔽。我们在实际任务中(包括PII和粗俗语言抑制)实证展示了其有效性。我们的实现可在https://github.com/hyundong98/NCO-Decoding.git获取。
cs.CL / 127 / 2605.10073
PHAGE: Patent Heterogeneous Attention-Guided Graph Encoder for Representation Learning
PHAGE:专利异构注意力引导图编码器用于表征学习
Abstract
Patent claims form a directed dependency structure in which dependent claims inherit and refine the scope of earlier claims; however, existing patent encoders linearize claims as text and discard this hierarchy. Directly encoding this structure into self-attention poses two challenges: claim dependencies mix relation types that differ in semantics and extraction reliability, and the dependency graph is defined over claims while Transformers attend over tokens. PHAGE addresses the first challenge through a deterministic graph construction pipeline that separates near-deterministic legal citations from noisier rule-based technical relations, preserving type distinctions as heterogeneous edges. It addresses the second through a connectivity mask and learnable relation-aware biases that lift claim-level topology into token-level attention, allowing the encoder to differentially weight each relation type. A dual-granularity contrastive objective then aligns representations with both inter-patent taxonomy and intra-patent topology. PHAGE outperforms all baselines on classification, retrieval, and clustering, showing that intra-document claim topology is a stronger inductive bias than inter-document structure and that this bias persists in the encoder weights after training.
Chinese Translation
专利权利要求形成了一种有向依赖结构,其中依赖性权利要求继承并细化早期权利要求的范围;然而,现有的专利编码器将权利要求线性化为文本,并忽略了这种层次结构。将这种结构直接编码到自注意力中面临两个挑战:权利要求的依赖关系混合了在语义和提取可靠性上不同的关系类型,而依赖图是基于权利要求定义的,而变换器(Transformers)则在标记(tokens)上进行注意力计算。PHAGE通过一个确定性的图构建管道解决了第一个挑战,该管道将近确定性的法律引用与噪声较大的基于规则的技术关系分离,保留了作为异构边的类型区分。它通过连接掩码和可学习的关系感知偏置解决了第二个挑战,这些偏置将权利要求级别的拓扑提升到标记级别的注意力,使编码器能够对每种关系类型进行差异化加权。然后,双粒度对比目标将表示与专利间分类法和专利内拓扑对齐。PHAGE在分类、检索和聚类任务中超越了所有基线,表明文档内权利要求的拓扑是一种比文档间结构更强的归纳偏置,并且这种偏置在训练后仍然存在于编码器权重中。
cs.CL / 128 / 2605.10082
FERA: Uncertainty-Aware Federated Reasoning for Large Language Models
FERA:面向不确定性的联邦推理框架用于大型语言模型
Abstract
Large language models (LLMs) exhibit strong reasoning capabilities when guided by high-quality demonstrations, yet such data is often distributed across organizations that cannot centralize it due to regulatory, proprietary, or institutional constraints. We study federated reasoning, where a server improves multi-step reasoning by coordinating with heterogeneous clients holding private demonstrations, without centralized training or raw data sharing. The key challenge is that client reliability is query-dependent, while the server cannot inspect client data to determine which contributions are trustworthy. To address this, we propose Uncertainty-Aware Federated Reasoning (FERA), a training-free framework based on iterative server-client co-refinement. Across communication rounds, clients generate reasoning traces with lightweight uncertainty estimates, and the server synthesizes them into improved reasoning that is redistributed as context for the next round, progressively improving both server outputs and client-side reasoning. Within each round, Uncertainty-Aware Self-Critique Aggregation (UA-SCA) resolves conflicts among heterogeneous client traces through query-dependent trust weighting and structured cross-client verification. Rather than simply discarding low-quality traces, UA-SCA revises flawed reasoning steps to recover useful information. We provide theoretical guarantees showing that the proposed iterative protocol converges and that uncertainty-aware weighting accelerates convergence. Experiments on multiple reasoning benchmarks show that FERA consistently outperforms both federated training and training-free baselines, achieving progressively higher accuracy across rounds while maintaining communication and computational efficiency.
Chinese Translation
大型语言模型(LLMs)在高质量示例的指导下展现出强大的推理能力,但此类数据通常分布在无法因监管、专有或机构限制而集中管理的组织中。我们研究了联邦推理,其中服务器通过与持有私有示例的异构客户端协调来改善多步推理,而无需集中训练或共享原始数据。主要挑战在于客户端的可靠性依赖于查询,而服务器无法检查客户端数据以确定哪些贡献是可信的。为了解决这个问题,我们提出了面向不确定性的联邦推理(FERA),这是一种基于迭代服务器-客户端共同精炼的无训练框架。在通信轮次中,客户端生成带有轻量级不确定性估计的推理轨迹,服务器将其综合为改进的推理,并作为上下文重新分发到下一轮,逐步改善服务器输出和客户端推理。在每一轮中,面向不确定性的自我批评聚合(UA-SCA)通过查询依赖的信任加权和结构化的跨客户端验证解决异构客户端轨迹之间的冲突。UA-SCA并非简单丢弃低质量轨迹,而是修正有缺陷的推理步骤以恢复有用信息。我们提供理论保证,表明所提出的迭代协议收敛,并且面向不确定性的加权加速了收敛。在多个推理基准上的实验表明,FERA始终优于联邦训练和无训练基线,在各轮中实现逐步更高的准确率,同时保持通信和计算效率。
cs.CL / 129 / 2605.10108
GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction
GLiNER-Relex:一个联合命名实体识别和关系抽取的统一框架
Abstract
Joint named entity recognition (NER) and relation extraction (RE) is a fundamental task in natural language processing for constructing knowledge graphs from unstructured text. While recent approaches treat NER and RE as separate tasks requiring distinct models, we introduce GLiNER-Relex, a unified architecture that extends the GLiNER framework to perform both entity recognition and relation extraction in a single model. Our approach leverages a shared bidirectional transformer encoder to jointly represent text, entity type labels, and relation type labels, enabling zero-shot extraction of arbitrary entity and relation types specified at inference time. GLiNER-Relex constructs entity pair representations from recognized spans and scores them against relation type embeddings using a dedicated relation scoring module. We evaluate our model on four standard relation extraction benchmarks: CoNLL04, DocRED, FewRel, and CrossRE, and demonstrate competitive performance against both specialized relation extraction models and large language models, while maintaining the computational efficiency characteristic of the GLiNER family. The model is released as an open-source Python package with a simple inference API that allows users to specify arbitrary entity and relation type labels at inference time and obtain both entities and relation triplets in a single call. All models and code are publicly available.
Chinese Translation
联合命名实体识别(NER)和关系抽取(RE)是自然语言处理中的一项基础任务,旨在从非结构化文本中构建知识图谱。虽然近期的方法将NER和RE视为需要不同模型的独立任务,但我们提出了GLiNER-Relex,一个统一架构,扩展了GLiNER框架,以在单一模型中同时执行实体识别和关系抽取。我们的方法利用共享的双向变换器编码器共同表示文本、实体类型标签和关系类型标签,从而实现对推理时指定的任意实体和关系类型的零样本抽取。GLiNER-Relex从识别的跨度构建实体对表示,并使用专门的关系评分模块对其与关系类型嵌入进行评分。我们在四个标准关系抽取基准上评估了我们的模型:CoNLL04、DocRED、FewRel和CrossRE,并展示了与专门的关系抽取模型和大型语言模型的竞争性能,同时保持了GLiNER系列的计算效率。该模型作为开源Python包发布,提供简单的推理API,允许用户在推理时指定任意实体和关系类型标签,并在一次调用中获得实体和关系三元组。所有模型和代码均可公开获取。
cs.CL / 130 / 2605.10114
SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution
SkillRAE:基于代理技能的检索增强执行上下文编译
Abstract
Large Language Model (LLM)-based agents (e.g., OpenClaw) increasingly rely on reusable skill libraries to solve artifact-rich tasks such as document-centric workflows and data-intensive analysis. As these libraries grow, a few works have attempted to study the Retrieval-Augmented Execution (RAE), which often first retrieves some external skills and other knowledge, then compiles the context using retrieved skills, and finally executes the task. Existing works mainly focus on optimizing skill retrieval and task execution, and they pay little attention to how to effectively organize the selected skill evidence in a form that is compact, grounded, and immediately usable for the downstream executors to complete tasks. To fill this gap, we propose SkillRAE, a two-stage RAE approach focusing on skill-based context compilation, which consists of the offline and online stages. Specifically, in the offline indexing stage, it builds a multi-level skill graph over skill communities, skills, and reusable subunits, for capturing their relationships. In the online retrieval stage, it first performs skill-ranked retrieval with selected-subunit evidence export in the graph, and then applies rescue-aware compact compilation to recover the key evidence. Together, these components compile a coarse-ranked skill set into a task-specific context that is compact, grounded, and immediately usable. Experiments on two public benchmarks show that SkillRAE achieves a significant improvement over baselines for RAE. For example, on SkillsBench, it achieves an improvement of 11.7% over the SOTA method. Ablation studies further show that our context compilation is crucial, instead of a mere prompt addition.
Chinese Translation
基于大型语言模型(LLM)的代理(例如,OpenClaw)越来越依赖可重用的技能库来解决文档中心工作流和数据密集型分析等丰富工件任务。随着这些库的增长,一些研究尝试探讨检索增强执行(RAE),该过程通常首先检索一些外部技能和其他知识,然后使用检索到的技能编译上下文,最后执行任务。现有研究主要集中在优化技能检索和任务执行上,较少关注如何有效地组织所选技能证据,使其以紧凑、扎根且可立即用于下游执行者完成任务的形式呈现。为填补这一空白,我们提出了SkillRAE,一种专注于基于技能的上下文编译的两阶段RAE方法,包括离线和在线阶段。具体而言,在离线索引阶段,它在技能社区、技能和可重用子单元上构建多层次技能图,以捕捉它们之间的关系。在在线检索阶段,它首先在图中执行基于技能排名的检索,并导出所选子单元证据,然后应用救援感知的紧凑编译以恢复关键证据。这些组件共同将粗略排名的技能集编译成一个任务特定的上下文,该上下文紧凑、扎根且可立即使用。在两个公共基准上的实验表明,SkillRAE在RAE方面显著优于基线方法。例如,在SkillsBench上,它比最先进的方法提高了11.7%。消融研究进一步表明,我们的上下文编译至关重要,而不仅仅是简单的提示添加。
cs.CL / 131 / 2605.10129
Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data
合成预预训练提高语言模型对噪声预训练数据的鲁棒性
Abstract
Large language models (LLMs) rely on web-scale corpora for pre-training. The noise inherent in these datasets tends to obscure meaningful patterns and ultimately degrade model performance. Data curation mitigates but cannot eliminate such noise, so pre-training corpora remain noisy in practice. We therefore study whether a lightweight pre-pre-training (PPT) stage based on synthetic data with learnable temporal structure helps resist noisy data during the pre-training (PT) stage. Across various corruption settings, our method consistently improves robustness to noise during PT, with larger relative gains at higher noise levels. For a 1B-parameter model, a synthetic PPT stage with only 65M tokens achieves the same final loss as the baseline while using up to 49\% fewer natural-text PT tokens across different noise levels. Mechanistic analyses suggest PPT does not immediately suppress attention to noisy tokens. Rather, PPT-initialized models gradually downweight attention between corrupted tokens during noisy PT. This indicates that synthetic PPT inhibits noise self-modeling and shapes the subsequent optimization trajectory. Code is available at https://github.com/guox18/formal-language-prepretraining.
Chinese Translation
大型语言模型(LLMs)依赖于网络规模的语料库进行预训练。这些数据集中固有的噪声往往会掩盖有意义的模式,最终降低模型性能。数据策划可以减轻但无法消除这种噪声,因此在实际应用中,预训练语料库仍然存在噪声。因此,我们研究了一种基于具有可学习时间结构的合成数据的轻量级预预训练(PPT)阶段是否有助于在预训练(PT)阶段抵御噪声数据。在各种损坏设置下,我们的方法在PT期间始终提高了对噪声的鲁棒性,并且在更高噪声水平下相对增益更大。对于一个10亿参数的模型,仅使用6500万标记的合成PPT阶段在不同噪声水平下实现了与基线相同的最终损失,同时使用了多达49%的自然文本PT标记。机制分析表明,PPT并不会立即抑制对噪声标记的关注。相反,PPT初始化的模型在噪声PT期间逐渐降低对损坏标记的关注。这表明合成PPT抑制了噪声自建模,并塑造了后续的优化轨迹。代码可在 https://github.com/guox18/formal-language-prepretraining 获取。
cs.CL / 132 / 2605.10155
NyayaAI: An AI-Powered Legal Assistant Using Multi-Agent Architecture and Retrieval-Augmented Generation
NyayaAI:一种基于多智能体架构和检索增强生成的人工智能法律助手
Abstract
Legal information in India remains largely inaccessible due to the complexity of legal language and the sheer volume of legal documentation involved in research and case analysis. This paper presents NyayaAI, an AI-powered legal assistant that automates and simplifies legal workflows for lawyers, law students, and general users. The system combines Large Language Models with a Retrieval-Augmented Generation pipeline grounded in a curated Indian legal knowledge base comprising constitutional provisions, statutes, case laws, and judicial precedents. A multi-agent architecture orchestrated through the Mastra TypeScript framework coordinates a main agent with specialized sub-agents handling legal research, document summarization, case law retrieval, and drafting assistance. A compliance module validates all responses before delivery. Domain classification achieved 70\% precision across test samples, with RAG retrieval precision at 74\% and overall response accuracy at 72\%, demonstrating that structured multi-agent LLM systems can meaningfully improve legal accessibility and workflow efficiency. The code\footnote{https://github.com/B97784/NyayaAI} is made publicly available for the benefit of the research community.
Chinese Translation
由于法律语言的复杂性以及研究和案件分析中涉及的法律文档数量庞大,印度的法律信息仍然在很大程度上难以获取。本文介绍了NyayaAI,这是一种基于人工智能的法律助手,旨在为律师、法学生和普通用户自动化和简化法律工作流程。该系统结合了大型语言模型与基于经过策划的印度法律知识库的检索增强生成(Retrieval-Augmented Generation)管道,该知识库包括宪法条款、法规、案例法和司法判例。通过Mastra TypeScript框架协调的多智能体架构将一个主要智能体与处理法律研究、文档摘要、案例法检索和起草辅助的专门子智能体相结合。合规模块在交付之前验证所有响应。在测试样本中,领域分类的精度达到了70%,检索增强生成的检索精度为74%,整体响应准确率为72%,这表明结构化的多智能体大型语言模型系统能够显著改善法律可及性和工作流程效率。代码已公开发布,以惠及研究社区。
cs.CL / 133 / 2605.10168
ASTRA-QA: A Benchmark for Abstract Question Answering over Documents
ASTRA-QA:基于文档的抽象问题回答基准
Abstract
Document-based question answering (QA) increasingly includes abstract questions that require synthesizing scattered information from long documents or across multiple documents into coherent answers. However, this setting is still poorly supported by existing benchmarks and evaluation methods, which often lack stable abstract references or rely on coarse similarity metrics and unstable head-to-head comparisons. To alleviate this issue, we introduce ASTRA-QA, a benchmark for AbSTRAct Question Answering over documents. ASTRA-QA contains 869 QA instances over academic papers and news documents, covering five abstract question types and three controlled retrieval scopes. Each instance is equipped with explicit evaluation annotations, including answer topic sets, curated unsupported topics, and aligned evidence. Building on these annotations, ASTRA-QA assesses whether answers cover required key points and avoid unsupported content by directly scoring topic coverage and curated unsupported content, enabling scalable evaluation without exhaustive head-to-head comparisons. Experiments with representative Retrieval-Augmented Generation (RAG) methods spanning vanilla, graph-based, and hierarchical retrieval settings show that ASTRA-QA provides reference-grounded diagnostics for coverage, hallucination, and retrieval-scope robustness. Our dataset and code are available at https://xinyangsally.github.io/astra-benchmark.
Chinese Translation
基于文档的问题回答(QA)越来越多地包括需要将长文档或多个文档中分散的信息综合成连贯答案的抽象问题。然而,现有的基准和评估方法对此设置的支持仍然不足,往往缺乏稳定的抽象参考,或者依赖粗略的相似性度量和不稳定的逐对比较。为了解决这个问题,我们引入了ASTRA-QA,一个用于文档的抽象问题回答基准。ASTRA-QA包含869个关于学术论文和新闻文档的QA实例,涵盖五种抽象问题类型和三种受控检索范围。每个实例都配备了明确的评估注释,包括答案主题集、策划的无支持主题和对齐证据。基于这些注释,ASTRA-QA通过直接评分主题覆盖和策划的无支持内容,评估答案是否涵盖所需的关键点并避免无支持内容,从而实现可扩展的评估,而无需进行详尽的逐对比较。与代表性的检索增强生成(RAG)方法进行的实验,涵盖了原始、基于图的和层次检索设置,表明ASTRA-QA为覆盖、幻觉和检索范围的鲁棒性提供了基于参考的诊断。我们的数据集和代码可在 https://xinyangsally.github.io/astra-benchmark 获取。
cs.CL / 134 / 2605.10171
When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews
当评审意见不一致时:科学同行评审中的细粒度矛盾分析
Abstract
Scientific peer reviews frequently contain conflicting expert judgments, and the increasing scale of conference submissions makes it challenging for Area Chairs and editors to reliably identify and interpret such disagreements. Existing approaches typically frame reviewer disagreement as binary contradiction detection over isolated sentence pairs, abstracting away the review-level context and obscuring differences in the severity of evaluative conflict. In this work, we introduce a fine-grained formulation of reviewer contradiction analysis that operates over full peer reviews by explicitly identifying contradiction evidence spans and assigning graded disagreement intensity scores. To support this task, we present RevCI, an expert-annotated benchmark of peer-review pairs with evidence-level contradiction annotations with graded intensity labels. We further propose IMPACT, a structured multi-agent framework that integrates aspect-conditioned evidence extraction, deliberative reasoning, and adjudication to model reviewer contradictions and their intensity. To support efficient deployment, we distill IMPACT into TIDE, a small language model that predicts contradiction evidence and intensity in a single forward pass. Experimental results show that IMPACT substantially outperforms strong single-agent and generic multi-agent baselines in both evidence identification and intensity agreement, while TIDE achieves competitive performance at significantly lower inference cost.
Chinese Translation
科学同行评审中常常包含冲突的专家判断,而会议提交数量的增加使得领域主席和编辑难以可靠地识别和解读这些分歧。现有方法通常将评审者的不一致视为孤立句子对的二元矛盾检测,忽略了评审级别的上下文,并模糊了评估冲突严重性的差异。在本研究中,我们提出了一种细粒度的评审者矛盾分析方法,该方法在完整的同行评审中操作,通过明确识别矛盾证据范围并分配分级的不一致强度分数来实现。为支持这一任务,我们提出了RevCI,这是一个经过专家注释的同行评审对基准数据集,包含证据级别的矛盾注释和分级强度标签。我们进一步提出了IMPACT,这是一个结构化的多智能体框架,集成了基于方面的证据提取、深思熟虑的推理和裁决,以建模评审者的矛盾及其强度。为了支持高效部署,我们将IMPACT提炼为TIDE,这是一个小型语言模型,可以在单次前向传播中预测矛盾证据和强度。实验结果表明,IMPACT在证据识别和强度一致性方面显著优于强大的单智能体和通用多智能体基线,而TIDE在显著降低推理成本的同时实现了竞争力的性能。
cs.CL / 135 / 2605.10186
LegalCiteBench: Evaluating Citation Reliability in Legal Language Models
LegalCiteBench:评估法律语言模型中的引用可靠性
Abstract
Large language models (LLMs) are increasingly integrated into legal drafting and research workflows, where incorrect citations or fabricated precedents can cause serious professional harm. Existing legal benchmarks largely emphasize statutory reasoning, contract understanding, or general legal question answering, but they do not directly study a central common-law failure mode: when asked to provide case authorities without external grounding, models may return plausible-looking but incorrect citations or cases. We introduce LegalCiteBench, a benchmark for studying closed-book citation recovery, citation verification, and case matching in legal language models. LegalCiteBench contains approximately 24K evaluation instances constructed from 1,000 real U.S. judicial opinions from the Case Law Access Project. The benchmark covers five citation-centric tasks: citation retrieval, citation completion, citation error detection, case matching, and case verification and correction. Across 21 LLMs, exact citation recovery remains highly challenging in this closed-book setting: even the strongest models score below 7/100 on citation retrieval and completion. Within the evaluated models, scale and legal-domain pretraining provide limited gains and do not resolve this difficulty. Models also frequently provide concrete but incorrect or low-overlap authorities under our evaluation protocol, with Misleading Answer Rates (MAR) exceeding 94% for 20 of 21 evaluated models on retrieval-heavy tasks. A prompt-only abstention experiment shows that explicit uncertainty instructions reduce some confident fabrication but do not improve citation correctness. LegalCiteBench is intended as a diagnostic framework for studying authority generation failures, verification behavior, and abstention when external grounding is absent, incomplete, or bypassed.
Chinese Translation
大型语言模型(LLMs)越来越多地融入法律起草和研究工作流程中,其中不正确的引用或虚构的先例可能会造成严重的职业损害。现有的法律基准主要强调法条推理、合同理解或一般法律问答,但并未直接研究一个核心的普通法失效模式:当被要求提供案例依据而没有外部支持时,模型可能返回看似合理但实际上不正确的引用或案例。我们引入了LegalCiteBench,这是一个用于研究法律语言模型中闭卷引用恢复、引用验证和案例匹配的基准。LegalCiteBench包含约24,000个评估实例,这些实例是从美国案例法访问项目中的1,000个真实司法意见中构建的。该基准涵盖五个以引用为中心的任务:引用检索、引用补全、引用错误检测、案例匹配以及案例验证和纠正。在21个LLMs中,在这种闭卷环境下,准确的引用恢复仍然非常具有挑战性:即使是最强的模型在引用检索和补全方面的得分也低于7/100。在评估的模型中,规模和法律领域的预训练提供的增益有限,无法解决这一困难。根据我们的评估协议,模型还经常提供具体但不正确或重叠度低的权威引用,在重检索任务中,21个评估模型中有20个的误导性答案率(MAR)超过94%。一个仅使用提示的弃权实验表明,明确的不确定性指令减少了一些自信的虚构,但并未改善引用的正确性。LegalCiteBench旨在作为一个诊断框架,用于研究在缺乏、缺失或被绕过的外部支持情况下的权威生成失败、验证行为和弃权。
cs.CL / 136 / 2605.10199
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
大型语言模型在对话时应如何倾听?全双工口语对话中用户流路由的研究
Abstract
Full-duplex spoken dialogue requires a model to keep listening while generating its own spoken response. This is challenging for large language models (LLMs), which are designed to extend a single coherent sequence and do not naturally support user input arriving during generation. We argue that how the user stream is routed into the LLM is therefore a key architectural question for full-duplex modeling. To study this question, we extend a text-only LLM into a unified full-duplex spoken dialogue system and compare two routing strategies under a shared training pipeline: (i) channel fusion, which injects the user stream directly into the LLM input, and (ii) cross-attention routing, which keeps the user stream as external memory accessed through cross-attention adapters. Experiments on spoken question answering and full-duplex interaction benchmarks reveal a clear tradeoff. Channel fusion yields stronger semantic grounding and consistently better question-answering performance. However, under semantically overlapping conditions such as user interruptions, it is more vulnerable to context corruption: if the model fails to stop in time, the overlapping user stream can interfere with ongoing generation and lead to semantically incoherent continuations. Cross-attention routing underperforms on question answering, but better preserves the LLM generation context and is more robust to this failure mode. These results establish user-stream routing as a central design axis in full-duplex spoken dialogue and offer practical guidance on the tradeoff between semantic integration and context robustness. We provide a demo page for qualitative inspection.
Chinese Translation
全双工口语对话要求模型在生成自身口语响应的同时保持倾听。这对大型语言模型(LLMs)来说是一个挑战,因为它们设计用于扩展单一连贯序列,并不自然支持在生成过程中接收用户输入。因此,我们认为用户流如何被路由到LLM中是全双工建模的一个关键架构问题。为了研究这个问题,我们将一个仅限文本的LLM扩展为一个统一的全双工口语对话系统,并在共享训练管道下比较两种路由策略:(i)通道融合,将用户流直接注入LLM输入;(ii)交叉注意力路由,将用户流作为外部记忆,通过交叉注意力适配器访问。针对口语问答和全双工交互基准的实验揭示了明显的权衡。通道融合产生更强的语义基础,并且在问答性能上始终表现更好。然而,在用户中断等语义重叠的情况下,它更容易受到上下文干扰:如果模型未能及时停止,重叠的用户流可能会干扰正在进行的生成,导致语义不连贯的延续。交叉注意力路由在问答上表现不佳,但更好地保持了LLM生成的上下文,并且对这种失败模式更具鲁棒性。这些结果确立了用户流路由作为全双工口语对话中的一个核心设计轴,并提供了关于语义整合与上下文鲁棒性之间权衡的实用指导。我们提供了一个演示页面以供定性检查。
cs.CL / 137 / 2605.10211
To Redact, or not to Redact? A Local LLM Approach to Deliberative Process Privilege Classification
是否需要编辑?一种地方性大语言模型方法用于审议过程特权分类
Abstract
Government transparency laws, like the Freedom of Information (FOIA) acts in the United States and United Kingdom, and the Woo (Open Government Act) in the Netherlands, grant citizens the right to directly request documents from the government. As these documents might contain sensitive information, such as personal information or threats to national security, the laws allow governments to redact sensitive parts of the documents prior to release. We build on prior research to perform automatic sensitivity classification for the FOIA Exemption 5 deliberative process privilege using Large Language Models (LLMs). However, processing documents not yet cleared for review via third-party cloud APIs is often legally or politically untenable. Therefore, in this work, we perform sensitivity classification with a small, local model, deployable on consumer-grade hardware (Qwen3.5 9B). We compare eight variants of applying LLMs for sentence classification, using well-known prompting techniques, and find that a combination of Chain-of-Thought prompting and few-shot prompting with error-based examples outperforms classification models of earlier work in terms of recall and F2 score. This method also closely approaches the performance of a widely-used, cost-efficient commercial model (Gemini 2.5 Flash). In an additional analysis, we find that sentences that are predicted as deliberative contain more verbs that indicate the expression of opinions, and are more often phrased in in first-person. Above all, deliberativeness seems characterized by the presence of a combination of multiple indicators, in particular the combination of first-person words with a verb for expressing opinion.
Chinese Translation
政府透明度法律,如美国和英国的《信息自由法》(FOIA)以及荷兰的《开放政府法》(Woo),赋予公民直接向政府请求文件的权利。由于这些文件可能包含敏感信息,例如个人信息或对国家安全的威胁,这些法律允许政府在发布之前对文件中的敏感部分进行编辑。我们在先前研究的基础上,利用大语言模型(LLMs)对FOIA第5条豁免审议过程特权进行自动敏感性分类。然而,通过第三方云API处理尚未获得审查许可的文件在法律或政治上往往不可行。因此,在本研究中,我们使用一个小型本地模型(Qwen3.5 9B)进行敏感性分类,该模型可在消费级硬件上部署。我们比较了八种应用LLMs进行句子分类的变体,使用了众所周知的提示技术,发现链式思维提示与基于错误示例的少量提示的组合在召回率和F2分数方面优于早期工作的分类模型。这种方法的性能也接近于一种广泛使用的、成本效益高的商业模型(Gemini 2.5 Flash)。在额外分析中,我们发现被预测为审议性的句子包含更多表示意见表达的动词,并且更常以第一人称表达。总的来说,审议性似乎以多种指标的组合为特征,特别是第一人称词与表示意见的动词的组合。
cs.CL / 138 / 2605.10216
The Impact of Editorial Intervention on Detecting Native Language Traces
编辑干预对母语痕迹检测的影响
Abstract
Native Language Identification (NLI) is the task of determining an author's native language (L1) from their non-native writings. With the advent of human-AI co-authorship, non-native texts are routinely corrected and rewritten by large language models, fundamentally altering the linguistic features NLI models depend on. In this paper, we investigate the robustness of L1 traces across increasing degrees of editorial intervention. By processing 450 essays from the Write & Improve 2024 corpus through varying levels of grammatical error correction (GEC) and paraphrasing, we demonstrate that L1 attribution does not entirely depend on surface-level errors. Instead, the detection models leverage deeper L1 features: unidiomatic lexico-semantic choices, pragmatic transfer, and the author's underlying cultural perspective. We find that minimal edits preserve these structural traces and maintain high profiling accuracy. In contrast, fluency edits and paraphrasing normalize these L1 features, leading to a severe degradation in performance.
Chinese Translation
母语识别(NLI)是确定作者母语(L1)与其非母语写作之间关系的任务。随着人机协作写作的兴起,非母语文本常常被大型语言模型进行校正和重写,这从根本上改变了NLI模型所依赖的语言特征。本文探讨了在不同程度的编辑干预下,L1痕迹的稳健性。通过对2024年Write & Improve语料库中的450篇论文进行不同级别的语法错误修正(GEC)和改写处理,我们证明L1归属并不完全依赖于表面错误。相反,检测模型利用了更深层次的L1特征:不地道的词汇语义选择、语用迁移以及作者潜在的文化视角。我们发现,最小的编辑保留了这些结构性痕迹,并保持了高准确度的特征识别。相比之下,流畅性编辑和改写则使这些L1特征趋于规范化,导致性能严重下降。
cs.CL / 139 / 2605.10218
Relative Score Policy Optimization for Diffusion Language Models
扩散语言模型的相对评分策略优化
Abstract
Diffusion large language models (dLLMs) offer a promising route to parallel and efficient text generation, but improving their reasoning ability requires effective post-training. Reinforcement learning with verifiable rewards (RLVR) is a natural choice for this purpose, yet its application to dLLMs is hindered by the absence of tractable sequence-level log-ratios, which are central to standard policy optimization. The lack of tractable sequence-level log-ratios forces existing methods to rely on high-variance ELBO-based approximations, where high verifier rewards can amplify inaccurate score estimates and destabilize RL training. To overcome this issue, we propose \textbf{R}elative \textbf{S}core \textbf{P}olicy \textbf{O}ptimization (RSPO), a simple RLVR method that uses verifiable rewards to calibrate noisy likelihood estimates in dLLMs. The core of our algorithm relies on a key observation: a reward advantage can be interpreted not only as an update direction, but also as a target for the relative log-ratio between the current and reference policies. Accordingly, RSPO calibrates this noisy relative log-ratio estimate by comparing its reward advantage with the reward-implied target relative log-ratio, updating the policy according to the gap between the current estimate and the target rather than the raw advantage alone. Experiments on mathematical reasoning and planning benchmarks show that RSPO yields especially strong gains on planning tasks and competitive mathematical-reasoning performance.
Chinese Translation
扩散大型语言模型(dLLMs)为并行和高效的文本生成提供了一条有前景的途径,但提高其推理能力需要有效的后期训练。使用可验证奖励的强化学习(RLVR)是实现这一目标的自然选择,然而其在dLLMs中的应用受到可处理的序列级对数比缺失的限制,而这对于标准策略优化至关重要。可处理的序列级对数比的缺乏迫使现有方法依赖于高方差的基于ELBO的近似,其中高验证者奖励可能会放大不准确的评分估计并使RL训练不稳定。为了解决这个问题,我们提出了相对评分策略优化(RSPO),这是一种简单的RLVR方法,利用可验证的奖励来校准dLLMs中的噪声似然估计。我们算法的核心依赖于一个关键观察:奖励优势不仅可以被解释为更新方向,还可以作为当前策略与参考策略之间相对对数比的目标。因此,RSPO通过将其奖励优势与奖励隐含的目标相对对数比进行比较,来校准这一噪声相对对数比估计,根据当前估计与目标之间的差距而不是单纯的原始优势来更新策略。在数学推理和规划基准上的实验表明,RSPO在规划任务上产生了特别强的增益,并在数学推理性能上具有竞争力。
cs.CL / 140 / 2605.10235
Route Before Retrieve: Activating Latent Routing Abilities of LLMs for RAG vs. Long-Context Selection
先路由再检索:激活大语言模型的潜在路由能力以应对检索增强生成与长上下文选择
Abstract
Recent advances in large language models (LLMs) have expanded the context window to beyond 128K tokens, enabling long-document understanding and multi-source reasoning. A key challenge, however, lies in choosing between retrieval-augmented generation (RAG) and long-context (LC) strategies: RAG is efficient but constrained by retrieval quality, while LC supports global reasoning at higher cost and with position sensitivity. Existing methods such as Self-Route adopt failure-driven fallback from RAG to LC, but remain passive, inefficient, and hard to interpret. We propose Pre-Route, a proactive routing framework that performs structured reasoning before answering. Using lightweight metadata (e.g., document type, length, initial snippet), Pre-Route enables task analysis, coverage estimation, and information-need prediction, producing explainable and cost-efficient routing decisions. Our study shows three key findings: (i) LLMs possess latent routing ability that can be reliably elicited with guidelines, allowing single-sample performance to approach that of multi-sample (Best-of-N) results; (ii) linear probes reveal that structured prompts sharpen the separability of the "optimal routing dimension" in representation space; and (iii) distillation transfers this reasoning structure to smaller models for lightweight deployment. Experiments on LaRA (in-domain) and LongBench-v2 (OOD) confirm that Pre-Route outperforms Always-RAG, Always-LC, and Self-Route baselines, achieving superior overall cost-effectiveness.
Chinese Translation
最近,大型语言模型(LLMs)的进展将上下文窗口扩展至超过128K个标记,使得长文档理解和多源推理成为可能。然而,一个关键挑战在于在检索增强生成(RAG)和长上下文(LC)策略之间进行选择:RAG高效但受限于检索质量,而LC则以更高的成本支持全局推理,并且对位置敏感。现有方法如Self-Route采用从RAG到LC的失败驱动回退,但仍然是被动的、低效的且难以解释。我们提出了Pre-Route,一个主动路由框架,在回答之前进行结构化推理。通过使用轻量级元数据(例如,文档类型、长度、初始片段),Pre-Route能够进行任务分析、覆盖估计和信息需求预测,从而产生可解释且成本高效的路由决策。我们的研究显示了三个关键发现:(i)LLMs具备潜在的路由能力,可以通过指导可靠地引出,使单样本性能接近多样本(Best-of-N)结果;(ii)线性探测揭示结构化提示增强了表示空间中“最佳路由维度”的可分性;(iii)蒸馏将这种推理结构转移到更小的模型中,以实现轻量级部署。在LaRA(领域内)和LongBench-v2(领域外)的实验中,Pre-Route的表现优于Always-RAG、Always-LC和Self-Route基线,达到了更优的整体成本效益。
cs.CL / 141 / 2605.10241
Building Korean linguistic resource for NLU data generation of banking app CS dialog system
构建用于银行应用客户服务对话系统的自然语言理解数据生成的韩语语言资源
Abstract
Natural language understanding (NLU) is integral to task-oriented dialog systems, but demands a considerable amount of annotated training data to increase the coverage of diverse utterances. In this study, we report the construction of a linguistic resource named FIAD (Financial Annotated Dataset) and its use to generate a Korean annotated training data for NLU in the banking customer service (CS) domain. By an empirical examination of a corpus of banking app reviews, we identified three linguistic patterns occurring in Korean request utterances: TOPIC (ENTITY, FEATURE), EVENT, and DISCOURSE MARKER. We represented them in LGGs (Local Grammar Graphs) to generate annotated data covering diverse intents and entities. To assess the practicality of the resource, we evaluate the performances of DIET-only (Intent: 0.91 /Topic [entity+feature]: 0.83), DIET+ HANBERT (I:0.94/T:0.85), DIET+ KoBERT (I:0.94/T:0.86), and DIET+ KorBERT (I:0.95/T:0.84) models trained on FIAD-generated data to extract various types of semantic items.
Chinese Translation
自然语言理解(NLU)是任务导向对话系统的重要组成部分,但需要大量标注的训练数据以增加多样化发话的覆盖范围。在本研究中,我们报告了一个名为FIAD(金融标注数据集)的语言资源的构建及其在银行客户服务(CS)领域生成韩语标注训练数据的应用。通过对银行应用评论语料库的实证研究,我们识别出韩语请求发话中出现的三种语言模式:主题(实体,特征)、事件和话语标记。我们将这些模式表示为局部语法图(LGGs),以生成涵盖多样意图和实体的标注数据。为了评估该资源的实用性,我们评估了基于FIAD生成数据训练的DIET-only(意图:0.91 / 主题 [实体+特征]:0.83)、DIET+ HANBERT(意图:0.94 / 主题:0.85)、DIET+ KoBERT(意图:0.94 / 主题:0.86)和DIET+ KorBERT(意图:0.95 / 主题:0.84)模型在提取各种语义项方面的表现。
cs.CL / 142 / 2605.10268
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
MemReread:通过记忆引导重读增强代理的长上下文推理能力
Abstract
To tackle long-context reasoning tasks without the quadratic complexity of standard attention mechanisms, approaches based on agent memory have emerged, which typically maintain a dynamically updated memory when linearly processing document chunks. To mitigate the potential loss of latent evidence in this memorize-while-reading paradigm, recent works have integrated retrieval modules that allow agents to recall information previously discarded during memory overwriting. However, retrieval-based recall suffers from both evidence loss during memory formation and interference induced by invalid queries. To overcome these limitations, we propose MemReread. Built upon streaming reading, MemReread circumvents intermediate retrieval. It triggers question decomposition and rereading when the final memory is insufficient, enabling the recovery of indirect facts that were prematurely discarded. This design supports non-linear reasoning while preserving the inherent logical flow of document comprehension. To further enhance practicality, we introduce a reinforcement learning framework that enhances length extrapolation capability while dynamically determining the number of rereading passes based on task complexity, thereby flexibly controlling computational overhead. Extensive experiments demonstrate that MemReread consistently outperforms baseline frameworks on long-context reasoning tasks, while maintaining linear time complexity with respect to context length.
Chinese Translation
为了解决长上下文推理任务而不引入标准注意力机制的平方复杂度,基于代理记忆的方法应运而生,这些方法通常在对文档块进行线性处理时维护动态更新的记忆。为了减轻在这种边读边记的范式中潜在的潜在证据丢失,最近的研究整合了检索模块,使得代理能够回忆在记忆覆盖过程中之前被丢弃的信息。然而,基于检索的回忆在记忆形成过程中遭遇证据丢失和无效查询引发的干扰。为了解决这些局限性,我们提出了MemReread。MemReread建立在流式阅读的基础上,避免了中间检索。当最终记忆不足时,它触发问题分解和重读,从而恢复那些过早丢弃的间接事实。这一设计支持非线性推理,同时保持文档理解的内在逻辑流。为了进一步增强实用性,我们引入了一个强化学习框架,该框架增强了长度外推能力,同时根据任务复杂性动态确定重读次数,从而灵活控制计算开销。大量实验表明,MemReread在长上下文推理任务上始终优于基线框架,同时在上下文长度方面保持线性时间复杂度。
cs.CL / 143 / 2605.10295
DECO-MWE: building a linguistic resource of Korean multiword expressions for feature-based sentiment analysis
DECO-MWE:构建用于特征基础情感分析的韩语多词表达语言资源
Abstract
This paper aims to construct a linguistic resource of Korean Multiword Expressions for Feature-Based Sentiment Analysis (FBSA): DECO-MWE. Dealing with multiword expressions (MWEs) has been a critical issue in FBSA since many constructs reveal lexical idiosyncrasy. To construct linguistic resources of sentiment MWEs efficiently, we utilize the Local Grammar Graph (LGG) methodology: DECO-MWE is formalized as a Finite-State Transducer that represents lexical-syntactic restrictions on MWEs. In this study, we built a corpus of cosmetics review texts, which show particularly frequent occurrences of MWEs. Based on an empirical examination of the corpus, four types of MWEs have been distinguished. The DECO-MWE thus covers the following four categories: Standard Polarity MWEs (SMWEs), Domain-Dependent Polarity MWEs (DMWEs), Compound Named Entity MWEs (EMWEs) and Compound Feature MWEs (FMWEs). The retrieval performance of the DECO-MWE shows 0.806 f-measure in the test corpus. This study brings a twofold outcome: first, a sizeable general-purpose polarity MWE lexicon, which may be broadly used in FBSA; second, a finite-state methodology adopted in this study to treat domain-dependent MWEs such as idiosyncratic polarity expressions, named entity expressions or feature expressions, and which may be reused in describing linguistic properties of other corpus domains.
Chinese Translation
本文旨在构建用于特征基础情感分析(Feature-Based Sentiment Analysis, FBSA)的韩语多词表达语言资源:DECO-MWE。在FBSA中处理多词表达(Multiword Expressions, MWEs)一直是一个关键问题,因为许多构造显示出词汇的特异性。为了高效构建情感MWEs的语言资源,我们采用了局部语法图(Local Grammar Graph, LGG)方法论:DECO-MWE被形式化为一个有限状态转导器(Finite-State Transducer),表示对MWEs的词汇-句法限制。在本研究中,我们构建了一个化妆品评论文本语料库,其中MWEs的出现频率特别高。基于对该语料库的实证检验,区分出了四种类型的MWEs。因此,DECO-MWE涵盖以下四个类别:标准极性MWEs(Standard Polarity MWEs, SMWEs)、领域依赖极性MWEs(Domain-Dependent Polarity MWEs, DMWEs)、复合命名实体MWEs(Compound Named Entity MWEs, EMWEs)和复合特征MWEs(Compound Feature MWEs, FMWEs)。DECO-MWE在测试语料库中的检索性能显示出0.806的F值。该研究带来了双重成果:首先,构建了一个规模庞大的通用极性MWE词典,可广泛用于FBSA;其次,采用的有限状态方法论可以处理领域依赖的MWEs,如特异性极性表达、命名实体表达或特征表达,并可在描述其他语料库领域的语言特性时重复使用。
cs.CL / 144 / 2605.10296
Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding
Qwen Goes Brrr:现成的RAG用于乌克兰多领域文档理解
Abstract
We participated in the Fifth UNLP shared task on multi-domain document understanding, where systems must answer Ukrainian multiple-choice questions from PDF collections and localize the supporting document and page. We propose a retrieval-augmented pipeline built around three ideas: contextual chunking of PDFs, question-aware dense retrieval and reranking conditioned on both the question and answer options, and constrained answer generation from a small set of reranked passages. Our final system uses Qwen3-Embedding-8B for retrieval, a fine-tuned Qwen3-Reranker-8B for passage ranking, and Qwen3-32B for answer selection. On a held-out split, reranking improves Recall@1 from 0.6957 to 0.7935, while using the top-2 reranked passages raises answer accuracy from 0.9348 to 0.9674. Our best leaderboard run reached 0.9452 on the public leaderboard and 0.9598 on the private leaderboard. Our results suggest that, under strict code-competition constraints, preserving document structure and making relevance estimation aware of the answer space are more effective than adding complex downstream heuristics.
Chinese Translation
我们参与了第五届UNLP多领域文档理解共享任务,该任务要求系统从PDF文档集中回答乌克兰多项选择题,并定位支持文档及其页码。我们提出了一种基于三种理念的检索增强管道:PDF的上下文分块、基于问题的密集检索和基于问题及答案选项的重排序,以及从一小组重排序段落中生成受限答案。我们的最终系统使用Qwen3-Embedding-8B进行检索,经过微调的Qwen3-Reranker-8B进行段落排名,以及Qwen3-32B进行答案选择。在保留集上,重排序将Recall@1从0.6957提高到0.7935,而使用前两个重排序段落将答案准确率从0.9348提高到0.9674。我们在公共排行榜上的最佳成绩为0.9452,在私有排行榜上的最佳成绩为0.9598。我们的结果表明,在严格的代码竞赛约束下,保持文档结构并使相关性估计考虑答案空间比添加复杂的下游启发式方法更为有效。
cs.CL / 145 / 2605.10318
Extending Confidence-Based Text2Cypher with Grammar and Schema Aware Filtering
基于置信度的 Text2Cypher 扩展:语法和模式感知过滤
Abstract
Large language models (LLMs) allow users to query databases using natural language by translating questions into executable queries. Despite strong progress on tasks such as Text2SQL, Text2SPARQL, and Text2Cypher, most existing methods focus on better prompting, fine-tuning, or iterative refinement. However, they often do not explicitly enforce structural constraints, such as syntactic validity and schema consistency. This can reduce reliability, since generated queries must satisfy both syntax rules and database schema constraints to be executable. In this work, we study how structured constraints can be used in test-time inference for Text2Cypher. We focus on post-generation validation to improve query correctness. We extend a confidence-based inference framework with a sequential filtering process that combines confidence scoring, grammar validation, and schema constraints before final aggregation. This lets us analyze how different constraint types affect generated queries. Our experiments with two instruction-tuned models show that grammar-based filtering improves syntactic validity. Schema-aware filtering further improves execution quality by enforcing consistency with the database structure. However, stronger filtering also increases the number of empty predictions and reduces execution coverage. Overall, we show that adding simple structural checks at test time improves the reliability of Text2Cypher generation, and we provide a clearer view of how syntax and schema constraints contribute differently.
Chinese Translation
大型语言模型(LLMs)使用户能够通过自然语言查询数据库,将问题转化为可执行的查询。尽管在 Text2SQL、Text2SPARQL 和 Text2Cypher 等任务上取得了显著进展,但现有大多数方法仍然集中于更好的提示、微调或迭代优化。然而,它们往往没有明确地强制执行结构约束,例如句法有效性和模式一致性。这可能会降低可靠性,因为生成的查询必须同时满足语法规则和数据库模式约束才能可执行。在本研究中,我们探讨了如何在 Text2Cypher 的测试推理中使用结构约束。我们关注生成后的验证,以提高查询的正确性。我们扩展了一个基于置信度的推理框架,采用顺序过滤过程,在最终聚合之前结合置信评分、语法验证和模式约束。这使我们能够分析不同约束类型对生成查询的影响。我们对两个经过指令调优的模型的实验表明,基于语法的过滤提高了句法有效性。模式感知过滤通过强制与数据库结构的一致性进一步提高了执行质量。然而,较强的过滤也增加了空预测的数量,并减少了执行覆盖率。总体而言,我们表明,在测试时添加简单的结构检查可以提高 Text2Cypher 生成的可靠性,并提供了更清晰的视角,说明语法和模式约束如何以不同方式贡献。
cs.CL / 146 / 2605.10328
ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models
ANCHOR:用于大语言模型中可靠概率推断的分层协调的推理网络构建
Abstract
A central challenge in large-scale decision-making under incomplete information is estimating reliable probabilities. Recent approaches leverage Large Language Models (LLMs) to generate explanatory factors and elicit coarse-grained probability estimates. Typically, an LLM performs forward abduction to propose factors, each paired with two mutually exclusive attributes, and a Na\"ive Bayes model is trained over factor combinations to refine the final probabilities. However, sparse factor spaces often yield ``unknown'' outcomes, while expanding factors increases noise and spurious correlations, weakening conditional independence and degrading reliability. To address these limitations, we propose \textsc{Anchor}, an inference framework that orchestrates aggregated Bayesian inference over a hierarchically structured factor space. \textsc{Anchor} first constructs a dense and organized factor space via iterative generation and hierarchical clustering. It then performs context-aware mapping through hierarchical retrieval and refinement, substantially reducing ``unknown'' predictions. Finally, \textsc{Anchor} augments Na\"ive Bayes with a Causal Bayesian Network to capture latent dependencies among factors, relaxing the strict independence assumption. Experiments show that \textsc{Anchor} markedly reduces ``unknown'' predictions and produces more reliable probability estimates than direct LLM baselines, achieving state-of-the-art performance while significantly reducing time and token overhead.
Chinese Translation
在不完全信息下的大规模决策中,估计可靠概率是一个核心挑战。最近的方法利用大型语言模型(LLMs)生成解释性因素并引出粗略的概率估计。通常,LLM通过前向推理提出因素,每个因素配对两个互斥属性,并通过因素组合训练朴素贝叶斯模型以细化最终概率。然而,稀疏的因素空间往往导致“未知”结果,而扩展因素则增加噪声和虚假相关性,削弱条件独立性并降低可靠性。为了解决这些局限性,我们提出了 extsc{Anchor},一个在分层结构因素空间上协调聚合贝叶斯推断的推理框架。 extsc{Anchor}首先通过迭代生成和分层聚类构建一个密集且有组织的因素空间。然后,它通过分层检索和细化进行上下文感知映射,显著减少“未知”预测。最后, extsc{Anchor}通过因果贝叶斯网络增强朴素贝叶斯,以捕捉因素之间的潜在依赖关系,放宽严格的独立假设。实验表明, extsc{Anchor}显著减少了“未知”预测,并比直接的LLM基线产生更可靠的概率估计,实现了最先进的性能,同时显著降低了时间和令牌开销。
cs.CL / 147 / 2605.10339
An Annotation Scheme and Classifier for Personal Facts in Dialogue
对话中个人事实的注释方案与分类器
Abstract
The advancement of Large Language Models (LLMs) has enabled their application in personalized dialogue systems. We present an extended annotation scheme for personal fact classification that addresses limitations in existing approaches, particularly PeaCoK. Our scheme introduces new categories (Demographics, Possessions) and attributes (Duration, Validity, Followup) that enable structured storage, quality filtering, and identification of facts suitable for dialogue continuation. We manually annotated 2,779 facts from Multi-Session Chat and trained a multi-head classifier based on transformer encoders. Combined with the Gemma-300M encoder, the classifier achieves $81.6 \pm 2.6$\% macro F1, outperforming all few-shot LLM baselines (best: GPT-5.4-mini, 72.92\%) by nearly 9 percentage points while requiring substantially fewer computational resources. Error analysis reveals persistent challenges in semantic boundary disambiguation, temporal aspect interpretation, and pragmatic reasoning for followup assessment. The dataset\footnotemark[1] and classifier\footnotemark[2] are publicly available.
Chinese Translation
大型语言模型(LLMs)的进步使其能够应用于个性化对话系统。我们提出了一种扩展的个人事实分类注释方案,旨在解决现有方法(特别是 PeaCoK)中的局限性。我们的方案引入了新的类别(人口统计信息、财产)和属性(持续时间、有效性、后续)以实现结构化存储、质量过滤和适合对话延续的事实识别。我们手动注释了来自多会话聊天的 2,779 个事实,并基于变换器编码器训练了一个多头分类器。结合 Gemma-300M 编码器,该分类器达到了 $81.6 imes 2.6$ ext{%} 的宏 F1 值,超越了所有少量样本 LLM 基线(最佳:GPT-5.4-mini,72.92 ext{%})近 9 个百分点,同时所需的计算资源显著更少。错误分析揭示了在语义边界消歧、时间方面解释和后续评估的实用推理方面存在持续挑战。数据集ootnotemark[1] 和分类器ootnotemark[2] 已公开可用。
cs.CL / 148 / 2605.10379
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
并非所有证明都是平等的:超越正确性评估大语言模型的证明质量
Abstract
Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a model's proofs for a single problem are; and (v) adaptivity, measuring whether a model can follow a specified proof technique. Across models, we find substantial differences in proof quality that are not captured by correctness-only benchmarks. We also observe significant trade-offs between proof-quality metrics and correctness, suggesting that future evaluations of mathematical reasoning should measure how useful LLM-generated proofs are.
Chinese Translation
大型语言模型(LLMs)已成为能够解决数学问题的工具,常常能够为具有挑战性的问题生成正确的证明。然而,仅仅正确性是不够的:数学证明还应当清晰、简洁、富有洞察力,并能够迁移到其他问题上。虽然这种证明质量是主观的,且依赖于读者和上下文,但其许多组成部分是具体且广泛被重视的。在本研究中,我们识别了这些组成部分,并引入了ProofRank,这是一个基于具有挑战性的数学竞赛而策划的基准。ProofRank评估几种可扩展的证明质量代理指标:(i)简洁性,衡量证明是否避免了不必要的步骤;(ii)计算便利性,衡量证明在多大程度上依赖于繁琐的计算;(iii)认知简单性,衡量所使用的证明技术的可接近性;(iv)多样性,衡量模型对单一问题的证明有多种变化;(v)适应性,衡量模型是否能够遵循指定的证明技术。在不同模型之间,我们发现证明质量存在显著差异,而这些差异并未被仅依赖正确性的基准所捕捉。我们还观察到证明质量指标与正确性之间存在显著的权衡,提示未来的数学推理评估应当测量LLM生成的证明的实用性。
cs.CL / 149 / 2605.10391
Phoenix-VL 1.5 Medium Technical Report
Phoenix-VL 1.5 中型技术报告
Phoenix, Team, :, Ray, Arka, Jawad, Askar Ali Mohamed, Lee, Biondi, Seah, Elijah, Lim, Eva, Teo, Fiona, Toh, Grace, Teo, Guang Xiang, Tan, Jun En, Bong, Jia Hui, Wang, Jiale, Ng, Jonathan, Tan, Justin, Yew, Kai Zhe, Ong, Matthew, Yeo, Shun Yi, Lam, Wen Jett, Tan, Wen Xiu, Zhang, Ze Yu, Ng, Gee Wah, Ang, Chee Wee, AI, Mistral, :, Sadé, Adrien, Kunsch, Guillaume, Loh, Jia Sin, Schuhl, Nicolas, Menneer, Rupert, Jamil, Umar, Maladière, Vincent, Pan, Yimu
Abstract
We introduce Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model, adapted to regional languages and the Singapore context. Developed as a sovereign AI asset, it demonstrates that deep domain adaptation can be achieved with minimal degradation to broad-spectrum intelligence and alignment. Continued pretraining was performed on Mistral Medium 3.1 using a localized 1-trillion tokens multimodal corpus, followed by a 250-billion tokens long-context extension phase. Subsequent post-training incorporated a novel human-annotated Singapore multimodal dataset and curated textual corpus on Singapore culture, knowledge, and legislation, totaling 22-billion tokens. An additional 5 billion tokens of model alignment was performed through Online Direct Preference Optimization. Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. We also introduce a novel evaluation suite encompassing localized knowledge benchmarks and an institutionally aligned model behavior and safety framework. We report the data curation principles, training methodology, and highlight benchmark and inference performance.
Chinese Translation
我们介绍了 Phoenix-VL 1.5 中型模型,这是一个具有 1230 亿参数的原生多模态和多语言基础模型,适应于区域语言和新加坡的背景。作为一个主权人工智能资产,它展示了在对广泛智能和对齐的影响最小的情况下,深度领域适应是可以实现的。我们在 Mistral Medium 3.1 上进行了持续的预训练,使用了一个本地化的 1 万亿标记的多模态语料库,随后进行了 2500 亿标记的长上下文扩展阶段。后续的后训练阶段结合了一个新的人类标注的新加坡多模态数据集和关于新加坡文化、知识和立法的策划文本语料库,总计 220 亿标记。通过在线直接偏好优化(Online Direct Preference Optimization)进行了额外的 50 亿标记的模型对齐。Phoenix-VL 1.5 中型模型在新加坡多模态、法律和政府政策基准测试中实现了其规模的最先进性能,同时在一般多模态智能、多语言和 STEM 基准测试中保持全球竞争力。我们还引入了一个新的评估套件,包括本地化知识基准和与机构对齐的模型行为与安全框架。我们报告了数据策划原则、训练方法,并强调了基准和推理性能。
cs.CL / 150 / 2605.10415
Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis
将大型语言模型的不确定性与人类在主观性分析中的分歧对齐
Abstract
Large language models for subjectivity analysis are typically trained with aggregated labels, which compress variations in human judgment into a single supervision signal. This paradigm overlooks the intrinsic uncertainty of low-agreement samples and often induces overconfident predictions, undermining reliability and generalization in complex subjective settings. In this work, we advocate uncertainty-aware subjectivity analysis, where models are expected to make predictions while expressing uncertainty that reflects human disagreement. To operationalize this perspective, we propose a two-phase Disagreement Perception and Uncertainty Alignment (DPUA) framework. Specifically, DPUA jointly models label prediction, rationale generation, and uncertainty expression under an uncertainty-aware setting. In the disagreement perception phase, adaptive decoupled learning enhances the model's sensitivity to disagreement-related cues while preserving task performance. In the uncertainty alignment phase, GRPO-based reward optimization further improves uncertainty-aware reasoning and aligns the model's confidence expression with the human disagreement distribution. Experiments on three subjectivity analysis tasks show that DPUA preserves task performance while better aligning model uncertainty with human disagreement, mitigating overconfidence on boundary samples, and improving out-of-distribution generalization.
Chinese Translation
用于主观性分析的大型语言模型通常使用聚合标签进行训练,这种方法将人类判断的变异压缩为单一的监督信号。这种范式忽视了低一致性样本的内在不确定性,常常导致过于自信的预测,从而削弱了在复杂主观环境中的可靠性和泛化能力。在本研究中,我们倡导一种关注不确定性的主观性分析,其中模型在进行预测时需要表达反映人类分歧的不确定性。为实现这一视角,我们提出了一个两阶段的分歧感知与不确定性对齐(Disagreement Perception and Uncertainty Alignment, DPUA)框架。具体而言,DPUA在不确定性感知的设置下,联合建模标签预测、推理生成和不确定性表达。在分歧感知阶段,自适应解耦学习增强了模型对与分歧相关线索的敏感性,同时保持任务性能。在不确定性对齐阶段,基于GRPO的奖励优化进一步改善了不确定性感知推理,并使模型的信心表达与人类分歧分布对齐。在三个主观性分析任务上的实验表明,DPUA在保持任务性能的同时,更好地将模型的不确定性与人类分歧对齐,减轻了边界样本上的过度自信,并改善了分布外的泛化能力。
cs.CL / 151 / 2605.10419
Can Language Models Analyze Data? Evaluating Large Language Models for Question Answering over Datasets
语言模型能否分析数据?评估大型语言模型在数据集上的问答能力
Abstract
This paper investigates the effectiveness of large language models (LLMs) in answering questions over datasets. We examine their performance in two scenarios: (a) directly answering questions given a dataset file as input, and (b) generating SQL queries to answer questions given the schema of a relational database. We also evaluate the impact of different prompting strategies on model performance. The study includes both state-of-the-art LLMs and smaller language models that require fewer resources and operate at lower computational and financial cost. Experiments are conducted on two datasets containing questions of varying difficulty. The results demonstrate the strong performance of large LLMs, while highlighting the limitations of smaller, more cost-efficient models. These findings contribute to a better understanding of how LLMs can be utilized in data analytics tasks and their associated limitations.
Chinese Translation
本文研究了大型语言模型(LLMs)在数据集上回答问题的有效性。我们在两种场景中考察它们的表现:(a)直接回答给定数据集文件作为输入的问题,以及(b)根据关系数据库的模式生成 SQL 查询以回答问题。我们还评估了不同提示策略对模型性能的影响。研究包括了最先进的 LLMs 和需要较少资源、在计算和财务成本上更低的小型语言模型。实验在两个包含不同难度问题的数据集上进行。结果表明大型 LLMs 的强大表现,同时突显了小型、成本效益更高的模型的局限性。这些发现有助于更好地理解 LLMs 在数据分析任务中的应用及其相关限制。
cs.CL / 152 / 2605.10462
Coherency through formalisations of Structured Natural Language, A case study on FRETish
通过结构化自然语言的形式化实现一致性:以 FRETish 为例
Abstract
Formalisation is the process of writing system requirements in a formal language. These requirements mostly originate in Natural Language. In the field of Formal Methods, formalisation is often identified as one of the most delicate and complicated steps in the verification process. Not seldomly, formalisation tools and environments choose various levels of requirement descriptions: Natural Language, Technical Language, Diagram Representations and Formal Language, to mention a few. In the literature, there are various maxims and principles of good practice to guide the process of requirement formalisation. In this paper we propose a new guideline: Coherency through Formalisations. The guideline states that the different levels of formalisation mentioned above should roughly follow the same logical structure. The principle seems particularly relevant in the setting where LLMs are prompted to perform reasoning tasks that can be checked by formal tools using Structured Natural Language to act as an intermediate layer bridging both paradigms. In the light of coherency, we analyze NASA's Formal Requirement Elicitation Tool FRET and propose an alternative automated translation of the Controlled Natural Language FRETish to the formal language of MTL. We compare our translation to the original translation and prove equivalence using model checking. Some statistics are performed which seem to favor the new translation. As expected, the translation process yielded interesting reflections and revealed inconsistencies which we present and discuss.
Chinese Translation
形式化是将系统需求用形式语言书写的过程。这些需求大多源于自然语言。在形式方法领域,形式化通常被视为验证过程中的最微妙和复杂的步骤之一。形式化工具和环境常常选择不同层次的需求描述:自然语言、技术语言、图示表示和形式语言等。在文献中,有多种良好实践的格言和原则来指导需求形式化的过程。本文提出了一项新的指导原则:通过形式化实现一致性。该指导原则指出,上述不同层次的形式化应大致遵循相同的逻辑结构。该原则在大语言模型(LLMs)被提示执行可以通过形式工具检查的推理任务的背景下显得尤为相关,结构化自然语言作为连接这两种范式的中介层。在一致性的视角下,我们分析了NASA的形式需求引出工具FRET,并提出了将受控自然语言FRETish自动翻译为形式语言MTL的替代方案。我们将我们的翻译与原始翻译进行了比较,并通过模型检查证明了等价性。进行了一些统计分析,结果似乎更倾向于新的翻译。正如预期的那样,翻译过程产生了有趣的反思,并揭示了不一致性,我们对此进行了呈现和讨论。
cs.CL / 153 / 2605.10488
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
DeepRefine:通过强化学习进行代理编译知识的精炼
Abstract
Agent-compiled knowledge bases provide persistent external knowledge for large language model (LLM) agents in open-ended, knowledge-intensive downstream tasks. Yet their quality is systematically limited by \emph{incompleteness}, \emph{incorrectness}, and \emph{redundancy}, manifested as missing evidence or cross-document links, low-confidence or imprecise claims, and ambiguous or coreference resolution issues. Such defects compound under iterative use, degrading retrieval fidelity and downstream task performance. We present \textbf{DeepRefine}, a general LLM-based reasoning model for \emph{agent-compiled knowledge refinement} that improves the quality of any pre-constructed knowledge bases with user queries to make it more suitable for the downstream tasks. DeepRefine performs multi-turn interactions with the knowledge base and conducts abductive diagnosis over interaction history, localizes likely defects, and executes targeted refinement actions for incremental knowledge base updates. To optimize refinement policies of DeepRefine without gold references, we introduce a Gain-Beyond-Draft (GBD) reward and train the reasoning process end-to-end via reinforcement learning. Extensive experiments demonstrate consistent downstream gains over strong baselines.
Chinese Translation
代理编译的知识库为大型语言模型(LLM)代理在开放式、知识密集型的下游任务中提供了持久的外部知识。然而,其质量受到 extit{不完整性}、 extit{不正确性}和 extit{冗余性}的系统性限制,这表现为缺失证据或跨文档链接、低置信度或不精确的主张,以及模糊或共指解析问题。这些缺陷在迭代使用中会加剧,降低检索的准确性和下游任务的表现。我们提出了 extbf{DeepRefine},一种基于LLM的推理模型,用于 extit{代理编译知识的精炼},旨在通过用户查询提高任何预构建知识库的质量,使其更适合下游任务。DeepRefine与知识库进行多轮交互,并对交互历史进行溯因诊断,定位可能的缺陷,并执行针对性的精炼操作以实现知识库的增量更新。为了在没有标准参考的情况下优化DeepRefine的精炼策略,我们引入了一种超越草稿的收益(Gain-Beyond-Draft,GBD)奖励,并通过强化学习端到端地训练推理过程。大量实验表明,DeepRefine在强基线之上实现了一致的下游收益。
cs.CL / 154 / 2605.10504
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
学习少即是多:过早的上层注意力专业化会损害语言模型的预训练
Abstract
A causal-decoder block is hierarchical: lower layers build the residual basis that upper layers attend over. We identify a failure mode in GPT pretraining: upper layers commit to sharp attention patterns before lower-layer features stabilize. We call this premature upper-layer attention specialization. Temporarily slowing only upper-layer Q/K projections during early training improves final perplexity and downstream accuracy without altering other parameters; it prevents upper attention from collapsing onto an immature residual basis. In LLaMA-style blocks, the same intervention is nearly unnecessary. Through ablations, we isolate multiplicative gated FFNs (not RMSNorm or bias removal) as the component that suppresses the upstream residual writes driving the failure. A pathwise analysis unifies both findings: the learning-rate intervention reduces a step-size factor, while gated FFNs reduce a residual-energy factor on the same growth pathway. Our results identify upper-layer Q/K timing as a concrete interaction point between decoder architecture and optimization.
Chinese Translation
因果解码器块是分层的:下层构建残差基础,上层对此进行关注。我们在GPT预训练中识别出一种失败模式:上层在下层特征稳定之前就承诺于尖锐的注意力模式。我们将其称为过早的上层注意力专业化。在早期训练中暂时减缓仅上层的Q/K投影,可以在不改变其他参数的情况下改善最终的困惑度和下游准确性;它防止了上层注意力崩溃到不成熟的残差基础上。在LLaMA风格的块中,同样的干预几乎是多余的。通过消融实验,我们将乘法门控前馈网络(gated FFNs,非RMSNorm或偏置移除)隔离为抑制驱动失败的上游残差写入的组件。路径分析将这两项发现统一起来:学习率干预减少了步长因子,而门控FFNs在同一增长路径上减少了残差能量因子。我们的结果确定了上层Q/K时序作为解码器架构与优化之间的一个具体交互点。
cs.CL / 155 / 2605.10518
Infinite Mask Diffusion for Few-Step Distillation
无限掩码扩散用于少步蒸馏
Abstract
Masked Diffusion Models (MDMs) have emerged as a promising alternative to autoregressive models in language modeling, offering the advantages of parallel decoding and bidirectional context processing within a simple yet effective framework. Specifically, their explicit distinction between masked tokens and data underlies their simple framework and effective conditional generation. However, MDMs typically require many sampling iterations due to factorization errors stemming from simultaneous token updates. We observe that a theoretical lower bound of the factorization error exists, which standard MDMs cannot reduce due to their use of a deterministic single-state mask. In this paper, we propose the Infinite Mask Diffusion Model (IMDM), which introduces a stochastic infinite-state mask to mitigate the theoretical bound while directly inheriting the benefits of MDMs, including the compatibility with pre-trained weights. We empirically demonstrate that MDM fails to perform few-step generation even in a simple synthetic task due to the factorization error bound, whereas IMDM can find an efficient solution for the same task. Finally, when equipped with appropriate distillation methods, IMDM surpasses existing few-step distillation methods at small step counts on LM1B and OpenWebText. Code is available at https://Ugness.github.io/official_imdm.
Chinese Translation
掩码扩散模型(Masked Diffusion Models, MDMs)作为语言建模中自回归模型的有希望的替代方案,提供了并行解码和双向上下文处理的优势,构建在一个简单而有效的框架内。具体而言,它们对掩码标记和数据的明确区分是其简单框架和有效条件生成的基础。然而,由于同时更新标记所导致的因子化误差,MDMs 通常需要许多采样迭代。我们观察到,存在一个因子化误差的理论下界,标准 MDMs 无法降低这一界限,因为它们使用的是确定性的单状态掩码。在本文中,我们提出了无限掩码扩散模型(Infinite Mask Diffusion Model, IMDM),该模型引入了随机无限状态掩码,以减轻理论界限,同时直接继承 MDMs 的优点,包括与预训练权重的兼容性。我们通过实验证明,MDM 在简单的合成任务中由于因子化误差界限而无法进行少步生成,而 IMDM 能够为同一任务找到高效的解决方案。最后,当配备适当的蒸馏方法时,IMDM 在 LM1B 和 OpenWebText 上的少步蒸馏性能超过了现有的方法。代码可在 https://Ugness.github.io/official_imdm 获取。
cs.CL / 156 / 2605.10537
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela:基于转化假设的测试时记忆巩固
Abstract
Memory consolidation, the process by which transient experiences are transformed into stable, structured representations, is a foundational organizing principle in the human brain, yet it remains largely unexplored as a design principle for modern sequence models. In this work, we leverage established neuroscientific theories of memory consolidation and cross-frequency coupling to propose the Hierarchical Memory Module (HMM), a neural memory architecture composed of two functionally distinct sub-modules that operate at different update frequencies. Inspired by the transformation hypothesis, the low-frequency sub-module produces high-level representations that capture abstract, gist-level knowledge, while the high-frequency sub-module produces fine-grained representations that preserve richer episodic detail. The final memory output is dynamically reconstructed as a context-dependent combination of both representations, analogous to the reconstructive nature of human memory retrieval. We integrate HMM into a Transformer-based language decoder to form Mela, a family of memory-augmented language models that perform online memory consolidation at test time. To further exploit the multi-granularity memory representations produced by HMM, we introduce MemStack, a method that distributes different levels of memory features across the early layers of the decoder without introducing additional tokens. Experiments on language modeling demonstrate that Mela outperforms Transformer baselines across all the model sizes. Moreover, with the pretrained context length fixed at 4K, Mela maintains performance on significantly longer contexts, whereas Transformer baselines degrade rapidly beyond their training length. Extensive ablation studies validate the contribution of each component and provide guidance for practical configuration.
Chinese Translation
记忆巩固是将短暂经历转化为稳定、结构化表征的过程,是人脑中的一个基础组织原则,但作为现代序列模型的设计原则,它仍然未被充分探索。在本研究中,我们利用已建立的神经科学记忆巩固理论和跨频耦合,提出了层次记忆模块(Hierarchical Memory Module, HMM),这是一种由两个功能上不同的子模块组成的神经记忆架构,分别以不同的更新频率运行。受到转化假设的启发,低频子模块生成捕捉抽象、要点级知识的高级表征,而高频子模块则生成保留更丰富情节细节的细粒度表征。最终的记忆输出作为这两种表征的上下文依赖组合动态重构,类似于人类记忆检索的重构特性。我们将HMM集成到基于Transformer的语言解码器中,形成Mela,这是一类在测试时执行在线记忆巩固的记忆增强语言模型。为了进一步利用HMM生成的多粒度记忆表征,我们引入了MemStack,这是一种在解码器的早期层中分配不同级别记忆特征的方法,而不引入额外的标记。语言建模实验表明,Mela在所有模型规模上均优于Transformer基线。此外,在预训练上下文长度固定为4K的情况下,Mela在显著更长的上下文上保持性能,而Transformer基线在超过其训练长度后迅速下降。广泛的消融研究验证了每个组件的贡献,并为实际配置提供了指导。
cs.CL / 157 / 2605.10544
Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
长上下文监督究竟去向何处?有效上下文暴露平衡
Abstract
Long-context adaptation is often viewed as window scaling, but this misses a token-level supervision mismatch: in packed training with document masking, each target token's effective context remains short. We introduce EXACT, a supervision-allocation objective that assigns extra weight to long effective-context targets by inverse frequency within the long tail. Across seven Qwen/LLaMA CPT configurations, EXACT improves all 28 trained/extrapolated NoLiMa and RULER comparisons. On Qwen2.5-0.5B, NoLiMa improves by +10.09 (trained) and +5.34 (extrapolated); RULER by +10.69 and +5.55. On LLaMA-3.2-3B, RULER improves by +17.91 and +16.11. Standard QA/reasoning are preserved (+0.24 macro change across six benchmarks). A distance-resolved probe shows gains arise when evidence is thousands of tokens away, while short cases remain unchanged. Results support a supervision-centric thesis: long-context adaptation depends on how strongly training supervises long-context predictions.
Chinese Translation
长上下文适应通常被视为窗口缩放,但这忽略了一个令牌级监督不匹配的问题:在文档掩蔽的紧凑训练中,每个目标令牌的有效上下文仍然很短。我们引入了EXACT(监督分配目标),通过反频率在长尾中为长有效上下文目标分配额外权重。在七个Qwen/LLaMA CPT配置中,EXACT改善了所有28个训练/外推的NoLiMa和RULER比较。在Qwen2.5-0.5B上,NoLiMa的提升为+10.09(训练)和+5.34(外推);RULER的提升为+10.69和+5.55。在LLaMA-3.2-3B上,RULER的提升为+17.91和+16.11。标准QA/推理保持不变(在六个基准测试中宏观变化为+0.24)。一个距离解析探针显示,当证据距离数千个令牌时,收益出现,而短案例保持不变。结果支持一个以监督为中心的论点:长上下文适应依赖于训练对长上下文预测的监督强度。
cs.CL / 158 / 2605.10550
Multi-domain Multi-modal Document Classification Benchmark with a Multi-level Taxonomy
多领域多模态文档分类基准与多层次分类法
Abstract
Document classification forms the backbone of modern enterprise content management, yet existing benchmarks remain trapped in oversimplified paradigms -- single domain settings with flat label structures -- that bear little resemblance to the hierarchical, multi-modal, and cross-domain nature of real-world business documents. This gap not only misrepresents practical complexity but also stifles progress toward industrially viable document intelligence. To bridge this gap, we construct the first Multi-level, Multi-domain, Multi-modal document classification Benchmark (MMM-Bench). MMM-Bench includes (1) a deeply hierarchical taxonomy spanning five levels that capture the authentic organizational logic of business documentation; and (2) 5,990 real-world multi-modal documents meticulously curated from 12 commercial domains in Alibaba. Each document is manually annotated with a complete hierarchical path by domain experts. We establish comprehensive baselines on MMM-Bench, which consists of open-weight models and API-based models. Through systematic experiments, we identify four fundamental challenges within MMM-Bench and propose corresponding insights. To provide a solid foundation for advancing research in multi-level, multi-domain document classification, we release all of the data and the evaluation toolkit of MMM-Bench at https://github.com/MMMDC-Bench/MMMDC-Bench.
Chinese Translation
文档分类是现代企业内容管理的核心,然而现有的基准测试仍然局限于过于简化的范式——单一领域设置和扁平标签结构——这与现实商业文档的层次性、多模态性和跨领域特征相去甚远。这一差距不仅误导了实际复杂性的表现,还阻碍了向工业可行的文档智能发展的进程。为了解决这一问题,我们构建了首个多层次、多领域、多模态文档分类基准(MMM-Bench)。MMM-Bench包括(1)一个深层次的层级分类法,涵盖五个层级,捕捉商业文档的真实组织逻辑;以及(2)从阿里巴巴的12个商业领域精心策划的5,990份真实多模态文档。每份文档都由领域专家手动标注完整的层级路径。我们在MMM-Bench上建立了全面的基准,包括开放权重模型和基于API的模型。通过系统实验,我们识别出MMM-Bench中的四个基本挑战,并提出相应的见解。为了为多层次、多领域文档分类的研究进展提供坚实基础,我们在https://github.com/MMMDC-Bench/MMMDC-Bench发布了所有数据和评估工具包。
cs.CL / 159 / 2605.10560
ICT-NLP at SemEval-2026 Task 3: Less Is More -- Multilingual Encoder with Joint Training and Adaptive Ensemble for Dimensional Aspect Sentiment Regression
ICT-NLP在SemEval-2026任务3中的表现:少即是多——基于联合训练和自适应集成的多语言编码器用于维度方面情感回归
Abstract
This paper describes our system to SemEval-2026 Task 3 Track A Subtask 1 on Dimensional Aspect Sentiment Regression (DimASR). We propose a lightweight and resource-efficient system built entirely on multilingual pre-trained encoders, without relying on LLMs or external corpora. We adopt joint multilingual and multi-domain training to facilitate cross-lingual transfer and alleviate data sparsity, introduce a bounded regression transformation that improves training stability while constraining predictions within the valid range, and employ an adaptive ensemble strategy via subset search to reduce prediction variance. Experimental results demonstrate that our system achieves strong and consistent performance, ranking 1st on zho-res, 2nd on zho-lap, and 3rd on jpn-hot, with all remaining datasets placed within the top half of participating teams.
Chinese Translation
本文描述了我们在SemEval-2026任务3 A轨道子任务1(维度方面情感回归,DimASR)中的系统。我们提出了一种轻量级且资源高效的系统,完全基于多语言预训练编码器,而不依赖于大型语言模型(LLMs)或外部语料库。我们采用联合多语言和多领域训练,以促进跨语言迁移并缓解数据稀疏问题,介绍了一种有界回归变换,以提高训练稳定性,同时将预测限制在有效范围内,并通过子集搜索采用自适应集成策略以减少预测方差。实验结果表明,我们的系统在性能上表现强劲且一致,在zho-res数据集上排名第一,在zho-lap数据集上排名第二,在jpn-hot数据集上排名第三,所有其他数据集均位于参与团队的前半部分。
cs.CL / 160 / 2605.10563
ThreatCore: A Benchmark for Explicit and Implicit Threat Detection
ThreatCore:显性与隐性威胁检测的基准
Abstract
Threat detection in Natural Language Processing lacks consistent definitions and standardized benchmarks, and is often conflated with broader phenomena such as toxicity, hate speech, or offensive language. In this work, we introduce ThreatCore, a public available benchmark dataset for fine-grained threat detection that distinguishes between explicit threats, implicit threats, and non-threats. The dataset is constructed by aggregating multiple publicly available resources and systematically re-annotating them under a unified operational definition of threat, revealing substantial inconsistencies across existing labels. To improve the coverage of underrepresented cases, particularly implicit threats, we further augment the dataset with synthetic examples, which are manually validated using the same annotation protocol adopted for the re-annotation of the public datasets, ensuring consistency across all data sources. We evaluate Perspective API, zero-shot classifiers, and recent language models on ThreatCore, showing that implicit threats remain substantially harder to detect than explicit ones. Our results also indicate that incorporating Semantic Role Labeling as an intermediate representation can improve performance by making the structure of harmful intent more explicit. Overall, ThreatCore provides a more consistent benchmark for studying fine-grained threat detection and highlights the challenges that current models still face in identifying indirect expressions of harmful intent.
Chinese Translation
自然语言处理中的威胁检测缺乏一致的定义和标准化的基准,常常与更广泛的现象如毒性、仇恨言论或冒犯性语言混淆。在本研究中,我们引入了ThreatCore,一个公开可用的基准数据集,用于细粒度的威胁检测,区分显性威胁、隐性威胁和非威胁。该数据集通过聚合多个公开可用的资源并在统一的威胁操作定义下进行系统的重新标注而构建,揭示了现有标签之间的重大不一致性。为了提高对代表性不足案例的覆盖,特别是隐性威胁,我们进一步用合成实例增强数据集,这些实例经过手动验证,采用与公共数据集重新标注相同的标注协议,以确保所有数据源之间的一致性。我们在ThreatCore上评估了Perspective API、零样本分类器和最新的语言模型,结果表明隐性威胁的检测仍然显著比显性威胁更具挑战性。我们的结果还表明,结合语义角色标注作为中间表示可以通过使有害意图的结构更加明确来提高性能。总体而言,ThreatCore为研究细粒度的威胁检测提供了更一致的基准,并突显了当前模型在识别有害意图间接表达方面仍面临的挑战。
cs.CL / 161 / 2605.10579
VISTA: A Generative Egocentric Video Framework for Daily Assistance
VISTA:用于日常辅助的生成性自我中心视频框架
Abstract
Training AI agents to proactively assist humans in daily activities, from routine household tasks to urgent safety situations, requires large-scale visual data. However, capturing such scenarios in the real world is often difficult, costly, or unsafe, and physics-based simulators lack the visual fidelity needed to transfer learned behaviors to real settings. Therefore, we introduce VISTA, a video synthesis system that produces high-fidelity egocentric videos as training and evaluation data for AI agents. VISTA employs a 5-step script generation pipeline with causal reverse reasoning to create diverse, logically grounded intervention modes. These scenarios span two levels of agent autonomy: reactive and proactive. In reactive modes, the user explicitly asks the agent for help. In proactive modes, the agent offers help without receiving a direct request. We further divide proactive modes into explicit and implicit types. In explicit proactive scenarios, the user is aware of needing help but does not directly address the agent. In implicit proactive scenarios, the agent intervenes before the user even realizes that help is needed. VISTA allows users to customize and refine scenarios to generate video benchmarks for daily tasks, offering a scalable and controllable alternative to real-world data collection for training and evaluating AI agents in realistic environments.
Chinese Translation
训练人工智能代理主动协助人类进行日常活动,从常规家务到紧急安全情况,需要大规模的视觉数据。然而,在现实世界中捕捉此类场景往往困难、成本高昂或不安全,而基于物理的模拟器缺乏将学习到的行为转移到真实环境所需的视觉真实感。因此,我们提出了VISTA,一个生成高保真自我中心视频的视频合成系统,作为人工智能代理的训练和评估数据。VISTA采用一个五步脚本生成流程,结合因果反向推理,创造多样化且逻辑上合理的干预模式。这些场景涵盖了代理自主性的两个层次:反应式和主动式。在反应模式下,用户明确请求代理提供帮助。在主动模式下,代理在未收到直接请求的情况下主动提供帮助。我们进一步将主动模式分为显性和隐性两种类型。在显性主动场景中,用户意识到需要帮助,但并未直接向代理提出请求。在隐性主动场景中,代理在用户甚至尚未意识到需要帮助之前就进行干预。VISTA允许用户自定义和完善场景,以生成日常任务的视频基准,为在现实环境中训练和评估人工智能代理提供了一种可扩展且可控的替代方案。
cs.CL / 162 / 2605.10605
Where do aspectual variants of light verb constructions belong?
轻动词构造的体相变体属于哪个类别?
Abstract
Expressions with an aspectual variant of a light verb, e.g. 'take on debt' vs. 'have debt', are frequent in texts but often difficult to classify between verbal idioms, light verb constructions or compositional phrases. We investigate the properties of such expressions with a disputed membership and propose a selection of features that determine more satisfactory boundaries between the three categories in this zone, assigning the expressions to one of them.
Chinese Translation
带有轻动词体相变体的表达,例如 'take on debt'(承担债务)与 'have debt'(拥有债务),在文本中频繁出现,但通常难以在动词习语、轻动词构造或组合短语之间进行分类。我们研究了这些具有争议归属的表达的属性,并提出了一系列特征,以确定这三个类别之间更令人满意的边界,从而将这些表达归入其中之一。
cs.CL / 163 / 2605.10606
Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings
测量法语中嵌入对作者风格的敏感性:比较文学文本与语言模型重写
Abstract
Large language models (LLMs) can convincingly imitate human writing styles, yet it remains unclear how much stylistic information is encoded in embeddings from any language model and retained after LLM rewriting. We investigate these questions in French, using a controlled literary dataset to quantify the effect of stylistic variation via changes in embedding dispersion. We observe that embeddings reliably capture authorial stylistic features and that these signals persist after rewriting, while also exhibiting LLM-specific patterns. These analytical results offer promising directions for authorship imitation detection in the era of language models.
Chinese Translation
大型语言模型(LLMs)能够令人信服地模仿人类写作风格,但尚不清楚任何语言模型中的嵌入编码了多少风格信息,以及在LLM重写后保留了多少。我们在法语中探讨这些问题,使用受控的文学数据集量化风格变化对嵌入分散度的影响。我们观察到,嵌入可靠地捕捉到作者的风格特征,并且这些信号在重写后依然存在,同时也表现出特定于LLM的模式。这些分析结果为在语言模型时代的作者模仿检测提供了有希望的方向。
cs.CL / 164 / 2605.10615
Responsible Benchmarking of Fairness for Automatic Speech Recognition
自动语音识别的公平性负责任基准测试
Abstract
Many studies have shown automatic speech processing (ASR) systems have unequal performance across speakergroups (SG's). However, the manner in which such studies arrive at this conclusion is inconsistent. To pave the wayfor more reliable results in future studies, we lay out best practices for benchmarking ASR fairness based on literaturefrom machine learning fairness, social sciences, and speech science. We first describe the importance of preciselythe fairness hypothesis being interrogated, and tailoring fairness metrics to apply specifically to said hypothesis.We then examine several benchmarks used to rate ASR systems on fairness and discuss how their results can bemisconstrued without assiduous oversight into the intersections between SG's. We find that evaluating fairnessbased on single heterogeneous SG's, such as they are defined in fairness benchmarks, can lead to misidentifyingwhich SG's are actually being mistreated by ASR systems. We advocate for as fine-grained an analysis as possibleof the intersectionality of as many demographic variables as are available in the metadata of fairness corpora in orderto tease out such spurious correlations
Chinese Translation
许多研究表明,自动语音处理(ASR)系统在不同说话者群体(SG)之间的表现存在不平等。然而,这些研究得出这一结论的方式并不一致。为了为未来研究提供更可靠的结果,我们基于机器学习公平性、社会科学和语音科学的文献,提出了ASR公平性基准测试的最佳实践。我们首先描述了明确所探讨的公平性假设的重要性,并针对该假设量身定制公平性指标。接着,我们考察了用于评估ASR系统公平性的多个基准,并讨论了在缺乏对SG交集的细致监督时,如何可能误解其结果。我们发现,基于单一异质SG(如公平性基准中所定义的)评估公平性,可能导致错误识别哪些SG实际上受到ASR系统的不公正对待。我们倡导尽可能细致地分析公平性语料库元数据中可用的多个人口变量的交集,以揭示这些虚假相关性。
cs.CL / 165 / 2605.10627
Interpretable Coreference Resolution Evaluation Using Explicit Semantics
使用显式语义的可解释共指解析评估
Abstract
Coreference resolution is typically evaluated using aggregate statistical metrics such as CoNLL-F1, which measure structural overlap between predicted and gold clusters. While widely used, these metrics offer limited diagnostic insights, penalizing errors without revealing whether a system struggles with specific semantic categories, such as people, locations, or events, and making it difficult to interpret model capabilities or derive actionable improvements. We address this gap by introducing a semantically-enhanced evaluation framework for coreference resolution. Our approach overlays Concept and Named Entity Recognition (CNER) onto coreference outputs, assigning semantic labels to nominal mentions and propagating them to entire coreference clusters. This enables the computation of typed scores aimed at evaluating mention extraction and linking capabilities stratified by semantic class. Across our experiments on OntoNotes, LitBank, and PreCo, we show that our framework uncovers systematic weaknesses that remain obscured by aggregate metrics. Furthermore, we demonstrate that these diagnostics can be used to design targeted, low-cost data augmentation strategies, achieving measurable out-of-domain improvements.
Chinese Translation
共指解析通常使用聚合统计指标进行评估,例如 CoNLL-F1,这些指标测量预测集群与金标准集群之间的结构重叠。尽管这些指标被广泛使用,但它们提供的诊断见解有限,惩罚错误而不揭示系统在特定语义类别(如人物、地点或事件)上的表现困难,从而使得解释模型能力或推导可行改进变得困难。我们通过引入一个语义增强的共指解析评估框架来填补这一空白。我们的方法将概念和命名实体识别(CNER)叠加到共指输出上,为名词提及分配语义标签,并将其传播到整个共指集群。这使得我们能够计算针对按语义类别分层的提及提取和链接能力评估的类型化评分。在我们对 OntoNotes、LitBank 和 PreCo 的实验中,我们展示了我们的框架揭示了被聚合指标掩盖的系统性弱点。此外,我们证明这些诊断可以用于设计针对性的、低成本的数据增强策略,从而实现可测量的域外改进。
cs.CL / 166 / 2605.10633
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs
内在护栏:个性语义几何如何与大型语言模型中的新兴失调相互作用
Abstract
Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work links these failures to specific directions in the activation space, their relationship to the model's broader persona remains unexplored. We map the latent personality space of LLMs through established psychometric profiles like the Big Five, Dark Triad, and LLM-specific behaviors (e.g. evil, sycophancy), and show that the semantic geometry is highly stable across aligned models and their corrupted fine-tunes. Through causal interventions, we find that directions isolating social valence, such as the 'Evil' persona vector, and a Semantic Valence Vector (SVV) that we introduce, function as intrinsic guardrails: ablating them drives the misalignment rates above $40$%, while amplifying them suppresses the failure mode to less than $3$%. Leveraging the structural stability of the personality space, we show that vectors extracted $\textit{a priori}$ from an instruct-tuned model transfer zero-shot to successfully regulate EM in corrupted fine-tunes. Overall, our findings suggest that harmful fine-tuning does not overwrite a model's internal representation of personality, allowing conserved representations to serve as robust, cross-distribution guardrails.
Chinese Translation
在良性狭窄数据上微调大型语言模型(LLMs)有时会引发广泛的有害行为,这种脆弱性被称为新兴失调(EM)。虽然先前的研究将这些失败与激活空间中的特定方向联系起来,但它们与模型更广泛个性之间的关系尚未得到探索。我们通过已建立的心理测量特征(如五大人格、黑暗三合一和特定于LLM的行为,例如邪恶、谄媚)绘制LLMs的潜在个性空间,并表明语义几何在对齐模型及其被腐蚀的微调中高度稳定。通过因果干预,我们发现隔离社会情感的方向,例如“邪恶”个性向量,以及我们引入的语义情感向量(SVV),充当内在护栏:消融它们会使失调率超过40%,而增强它们则将失败模式抑制到低于3%。利用个性空间的结构稳定性,我们表明从指令微调模型中提取的向量在零样本情况下成功转移,以调节被腐蚀的微调中的EM。总体而言,我们的发现表明,有害的微调并不会覆盖模型对个性的内部表征,从而使保留的表征能够作为强健的跨分布护栏。
cs.CL / 167 / 2605.10640
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm
理解语言模型的持续事实知识获取:从理论到算法
Abstract
Continual Pre-Training (CPT) is essential for enabling Language Models (LMs) to integrate new knowledge without erasing old. While classical CPT techniques like data replay have become the standard paradigm, the mechanisms underlying how LMs acquire and retain facts over time, termed as continual Factual Knowledge Acquisition (cFKA), remain unclear. In this work, we present a theoretical framework that characterizes the training dynamics of cFKA using a single-layer Transformer, offering a unified explanation for the behavior of representative CPT methods. Our analysis reveals that regularization-based methods merely adjust the convergence rate of parameters without altering the inherent forgetting tendency, whereas data replay methods succeed in shifting convergence dynamics and stabilizing pretrained knowledge. Building on these insights, we propose a novel generative data replay approach, called \textbf{S}electing \textbf{T}okens via attenti\textbf{O}n \textbf{C}ontribution~(STOC), which identifies influential factual snippets to guide replay data generation. Extensive experiments on both synthetic and real-world datasets validate our findings and demonstrate that STOC effectively enhances cFKA by mitigating catastrophic forgetting.
Chinese Translation
持续预训练(CPT)对于使语言模型(LMs)能够整合新知识而不抹去旧知识至关重要。尽管数据重放等经典CPT技术已成为标准范式,但语言模型如何随时间获取和保留事实的机制,即持续事实知识获取(cFKA),仍然不清楚。在本研究中,我们提出了一个理论框架,使用单层Transformer描述cFKA的训练动态,为代表性CPT方法的行为提供了统一的解释。我们的分析揭示,基于正则化的方法仅仅调整参数的收敛速率,而并未改变固有的遗忘倾向,而数据重放方法则成功地转变了收敛动态并稳定了预训练知识。在此基础上,我们提出了一种新颖的生成数据重放方法,称为通过注意力贡献选择令牌(STOC),该方法识别影响力较大的事实片段以指导重放数据的生成。在合成和真实世界数据集上的大量实验验证了我们的发现,并表明STOC通过减轻灾难性遗忘有效增强了cFKA。
cs.CL / 168 / 2605.10643
A Single-Layer Model Can Do Language Modeling
单层模型可以进行语言建模
Abstract
Modern language models scale depth by stacking layers, each holding its own state - a per-layer KV cache in transformers, a per-layer matrix in Mamba, Gated DeltaNet (GDN), RWKV, and xLSTM. Biological systems lean heavily on recurrence rather than on stacking. We ask how far that shape can go on language modeling. We propose Grounded Prediction Networks (GPN): one state vector revisited at every step through a single recurrent block - one FFN, one shared matrix memory. At 130M parameters, a 1-layer GPN+M reaches FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34); a 2-layer variant closes the gap to 6%/11%. We do not match the deep baselines. Because the working context is a single vector, we can directly inspect its geometry: a persistent default-token direction, a content-bearing horizon of tens of tokens, and memory heads that split spontaneously into fast and slow retention pools.
Chinese Translation
现代语言模型通过堆叠层数来扩展深度,每层持有自己的状态——在变换器中是每层的 KV 缓存,在 Mamba、Gated DeltaNet (GDN)、RWKV 和 xLSTM 中是每层的矩阵。生物系统则更倾向于使用递归而不是堆叠。我们探讨这种结构在语言建模中的潜力。我们提出了基础预测网络 (Grounded Prediction Networks, GPN):在每一步通过一个单一的递归块重新访问一个状态向量——一个前馈网络 (FFN),一个共享的矩阵记忆。在 130M 参数下,1 层 GPN+M 达到了 FineWeb-Edu 的困惑度 18.06,距离 12 层 Transformer++ (16.05) 仅有 13% 的差距,距离 10 层 GDN (15.34) 的差距为 18%;而 2 层变体将差距缩小至 6%/11%。我们并未达到深基线的表现。由于工作上下文是一个单一的向量,我们可以直接检查其几何特征:一个持久的默认标记方向、一个承载内容的数十个标记的视野,以及自发分裂为快速和慢速保留池的记忆头。
cs.CL / 169 / 2605.10659
When Can Digital Personas Reliably Approximate Human Survey Findings?
数字化人设何时能够可靠地近似人类调查结果?
Abstract
Digital personas powered by Large Language Models (LLMs) are increasingly proposed as substitutes for human survey respondents, yet it remains unclear when they can reliably approximate human survey findings. We answer this question using the LISS panel, constructing personas from respondents' background variables and pre-2023 survey histories, then testing them against the same respondents' held-out post-cutoff answers. Across four persona architectures, three LLMs, and two prediction tasks, we assess performance at the question, respondent, distributional, equity, and clustering levels. Digital personas improve alignment with human response distributions, especially in domains tied to stable attributes and values, but remain limited for individual prediction and fail to recover multivariate respondent structure. Retrieval-augmented architectures provide the clearest gains, but performance depends more on human response structure than on model choice: personas perform best for low-variability questions and common respondent patterns, and worst for subjective, heterogeneous, or rare responses. Our results provide practical guidance on when digital personas could be appropriate for survey research and when human validation remains necessary.
Chinese Translation
由大型语言模型(LLMs)驱动的数字化人设越来越被提议作为人类调查受访者的替代品,但何时它们能够可靠地近似人类调查结果仍不清楚。我们使用LISS面板回答这个问题,从受访者的背景变量和2023年前的调查历史构建人设,然后将其与同一受访者的保留后切断答案进行测试。在四种人设架构、三种LLM和两种预测任务中,我们在问题、受访者、分布、公平性和聚类层面评估性能。数字化人设在与人类响应分布的一致性上有所改善,尤其是在与稳定属性和价值观相关的领域,但在个体预测方面仍然有限,且未能恢复多元化的受访者结构。增强检索的架构提供了最明显的提升,但性能更依赖于人类响应结构而非模型选择:人设在低变异性问题和常见受访者模式下表现最佳,而在主观、多样或稀有响应下表现最差。我们的结果为数字化人设在调查研究中的适用性提供了实用指导,并指出何时仍需进行人类验证。
cs.CL / 170 / 2605.10664
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
提示-激活对偶性:通过注意力级干预改善激活引导
Abstract
Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.
Chinese Translation
激活引导通过在推理时向内部表征添加方向来控制语言模型的行为,但标准的残差流引导在有状态对话中可能会失效。我们识别出KV缓存污染作为一个关键的失效模式:被引导的令牌状态被存储并反复使用,将局部扰动转变为累积的一致性退化。为了解决这一挑战,我们提出了门控裁剪注意力增量引导(Gated Cropped Attention-Delta steering,GCAD),该方法从系统提示对自注意力的贡献中提取引导信号,并通过令牌级门控应用这些信号。在个性引导实验中,GCAD保持了特征控制,同时显著改善了长时域一致性。在主要的多轮基准测试中,GCAD将平均一致性漂移从-18.6改善至-1.9,并将第10轮特征表达从78.0提高至93.1。这些结果表明,当干预遵循模型已经用于行为控制的提示介导路径时,激活引导变得更加可靠。
cs.CL / 171 / 2605.10714
Why Low-Resource NLP Needs More Than Cross-Lingual Transfer: Lessons Learned from Luxembourgish
低资源自然语言处理为何需要超越跨语言迁移:来自卢森堡语的经验教训
Abstract
Cross-lingual transfer has become a central paradigm for extending natural language processing (NLP) technologies to low-resource languages. By leveraging supervision from high-resource languages, multilingual language models can achieve strong task performance with little or no labeled target-language data. However, it remains unclear to what extent cross-lingual transfer can substitute for language-specific efforts. In this paper, we synthesize prior research findings and data collection results on Luxembourgish, which, despite its typological proximity to high-resource languages and its presence in a multilingual context, remains insufficiently represented in modern NLP technologies. Across findings, we observe a fundamental interdependence between cross-lingual transfer and language-specific efforts. Cross-lingual transfer can substantially improve target-language performance, but its success depends critically on the availability of sufficiently high-quality, task-aligned target-language data. At the same time, such resources, particularly in low-resource settings, are typically too limited in scale to drive strong performance on their own. Instead, such resources reach their full potential only when leveraged within a cross-lingual framework. We therefore argue that cross-lingual transfer and language-specific efforts should not be viewed as competing alternatives. Instead, they function as complementary components of a sustainable low-resource NLP pipeline. Based on these insights, we provide practical guidelines for integrating and balancing cross-lingual transfer with language-specific development in sustainable low-resource NLP pipelines.
Chinese Translation
跨语言迁移已成为将自然语言处理(NLP)技术扩展到低资源语言的核心范式。通过利用高资源语言的监督,多语言语言模型能够在几乎没有目标语言标注数据的情况下实现强大的任务表现。然而,跨语言迁移在多大程度上可以替代语言特定的努力仍然不清楚。在本文中,我们综合了关于卢森堡语的先前研究结果和数据收集结果,尽管卢森堡语在类型学上与高资源语言接近,并且存在于多语言环境中,但在现代NLP技术中仍然代表不足。我们的研究发现表明,跨语言迁移与语言特定努力之间存在根本的相互依赖关系。跨语言迁移可以显著提高目标语言的表现,但其成功在很大程度上依赖于足够高质量、与任务对齐的目标语言数据的可用性。同时,这些资源在低资源环境中通常规模有限,无法单独驱动强大的表现。相反,这些资源只有在跨语言框架内利用时才能发挥其全部潜力。因此,我们认为跨语言迁移和语言特定努力不应被视为竞争性选择,而应作为可持续低资源NLP流程的互补组成部分。基于这些见解,我们提供了在可持续低资源NLP流程中整合和平衡跨语言迁移与语言特定开发的实用指南。
cs.CL / 172 / 2605.10832
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
面向视觉原生多模态深度搜索智能体的在线数据演化
Abstract
Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round's data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.
Chinese Translation
多模态深度搜索要求智能体通过链接搜索、工具使用和视觉推理来解决开放世界问题,处理不断演变的文本和视觉上下文。目前系统面临两个瓶颈。首先,现有的工具使用框架将搜索、浏览或转换返回的图像视为临时输出,因此中间视觉证据无法被后续工具重新利用。其次,训练数据通常是通过固定的策划方案构建的,无法跟踪目标智能体的演变能力。为了解决这些挑战,我们首先引入了一种以图像库参考协议为中心的视觉原生智能体框架,该框架将每个工具返回的图像注册为可寻址的参考,使中间视觉证据可以被后续工具重用。在此框架之上,在线数据演化(On-policy Data Evolution, ODE)运行一个闭环数据生成器,该生成器在训练的策略的回合中自我优化。每轮的优化使得每轮的数据针对当前策略仍需学习的内容。相同的框架支持多样化的监督微调数据和策略感知的强化学习数据策划,涵盖目标智能体的完整训练生命周期。在8个多模态深度搜索基准测试中,ODE将Qwen3-VL-8B智能体的平均得分从24.9%提高到39.0%,超越了标准智能体工作流程设置下的Gemini-2.5 Pro(37.9%)。在30B模型下,ODE将平均得分从30.6%提高到41.5%。进一步分析验证了图像库重用的有效性,尤其是在需要迭代视觉优化的复杂任务中,而回合反馈演化比静态合成产生了更扎实的SFT轨迹和更匹配策略的RL任务。
cs.CL / 173 / 2605.10843
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
无训练的大型语言模型文化对齐通过个性不一致
Abstract
Large language models increasingly mediate decisions that turn on moral judgement, yet a growing body of evidence shows that their implicit preferences are not culturally neutral. Existing cultural alignment methods either require per-country preference data and fine-tuning budgets or assume white-box access to model internals that commercial APIs do not expose. In this work, we focus on this realistic black-box, public-data-only regime and observe that within-country sociodemographic disagreement, not consensus, is the primary steering signal. We introduce DISCA (Disagreement-Informed Steering for Cultural Alignment), an inference-time method that instantiates each country as a panel of World-Values-Survey-grounded persona agents and converts their disagreement into a bounded, loss-averse logit correction. Across 20 countries and 7 open-weight backbones (2B--70B), DISCA reduces cultural misalignment on MultiTP by 10--24% on the six backbones >=3.8B, and 2--7% on open-ended scenarios, without changing any weights. Our results suggest that inference-time calibration is a scalable alternative to fine-tuning for serving the long tail of global moral preferences.
Chinese Translation
大型语言模型越来越多地介入依赖道德判断的决策,但越来越多的证据表明,它们的隐含偏好并非文化中立。现有的文化对齐方法要么需要每个国家的偏好数据和微调预算,要么假设对模型内部的白盒访问,而商业API并不提供这种访问。在本研究中,我们关注这一现实的黑盒、仅使用公共数据的模式,并观察到国家内部的社会人口统计不一致,而非共识,是主要的引导信号。我们提出了DISCA(基于不一致的文化对齐引导),这是一种推理时方法,将每个国家实例化为基于世界价值调查的个性代理面板,并将它们的不一致转化为有界的、规避损失的逻辑修正。在20个国家和7个开放权重骨干网络(2B--70B)中,DISCA在六个权重>=3.8B的骨干网络上将MultiTP的文化不对齐减少了10--24%,在开放式场景中减少了2--7%,而没有改变任何权重。我们的结果表明,推理时的校准是服务全球道德偏好的长尾的可扩展替代方案。
cs.CL / 174 / 2605.10853
Grounded Satirical Generation with RAG
基于检索增强生成的扎根讽刺生成
Abstract
Humor generation remains challenging task for Large Language Models (LLMs), due to their subjective nature. We focus on satire, a form of humor strongly shaped by context. In this work, we present a novel pipeline for grounded satire generation that uses Retrieval-Augmented Generation (RAG) over current news to produce satirical dictionary definitions in the Finnish context. We also introduce a new task-specific evaluation framework and annotate 100 generated definitions with six human annotators, enabling analysis across multiple experimental conditions, including cultural background, source-word type, and the presence or absence of RAG. Our results show that the generated definitions are perceived as more political than humorous. Both topic-based word selection and RAG improve the political relevance of the outputs, but neither yields clear gains in humor generation. In addition, our LLM-as-a-judge evaluation of five state-of-the-art models indicates that LLMs correlate well with human judgments on political relevance, but perform poorly on humor. We release our code and annotated dataset to support further research on grounded satire generation and evaluation.
Chinese Translation
幽默生成对于大型语言模型(LLMs)来说仍然是一项具有挑战性的任务,因为其主观性较强。我们专注于讽刺,这是一种受上下文影响较大的幽默形式。在本研究中,我们提出了一种新颖的扎根讽刺生成管道,该管道利用检索增强生成(RAG)技术,基于当前新闻生成芬兰语环境下的讽刺词典定义。我们还引入了一种新的任务特定评估框架,并由六位人工标注者对100个生成的定义进行了标注,从而能够在包括文化背景、源词类型以及RAG的有无等多个实验条件下进行分析。我们的结果表明,生成的定义在政治性上被认为比幽默性更强。基于主题的词选择和RAG都提高了输出的政治相关性,但在幽默生成上均未显著提升。此外,我们对五个最先进模型的LLM作为评判者的评估表明,LLMs与人类对政治相关性的判断高度相关,但在幽默性上表现不佳。我们发布了代码和标注数据集,以支持对扎根讽刺生成和评估的进一步研究。
cs.CL / 175 / 2605.10855
Learning More from Less: Exploiting Counterfactuals for Data-Efficient Chart Understanding
从少中学习更多:利用反事实进行数据高效的图表理解
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable progress in chart understanding, largely driven by supervised fine-tuning (SFT) on increasingly large synthetic datasets. However, scaling SFT data alone is inefficient and overlooks a key property of charts: charts are programmatically generated visual artifacts, where small, code-controlled visual changes can induce drastic shifts in semantics and correct answers. Learning this counterfactual sensitivity requires VLMs to discriminate fine-grained visual differences, yet standard SFT treats training instances independently and provides limited supervision to enforce this behavior. To address this, we introduce ChartCF, a data-efficient training framework designed to enhance counterfactual sensitivity. ChartCF consists of: (1) a counterfactual data synthesis pipeline via code modification, (2) a chart similarity-based data selection strategy that filters overly difficult samples for improved training efficiency, and (3) multimodal preference optimization across both textual and visual modalities. Experiments on five benchmarks show that ChartCF achieves superior or comparable performance to strong chart-specific VLMs while using significantly less training data.
Chinese Translation
视觉-语言模型(VLMs)在图表理解方面取得了显著进展,这主要得益于在越来越大规模的合成数据集上进行的监督微调(SFT)。然而,仅仅扩大SFT数据的规模效率低下,并忽视了图表的一个关键特性:图表是程序生成的视觉工件,微小的、由代码控制的视觉变化可以引发语义和正确答案的重大变化。学习这种反事实敏感性要求VLMs区分细微的视觉差异,但标准的SFT将训练实例视为独立,并提供有限的监督来强制执行这种行为。为了解决这个问题,我们提出了ChartCF,一个旨在增强反事实敏感性的高效数据训练框架。ChartCF包括:(1)通过代码修改生成反事实数据的管道,(2)基于图表相似性的样本选择策略,过滤过于困难的样本以提高训练效率,以及(3)在文本和视觉模态之间进行多模态偏好优化。五个基准实验表明,ChartCF在使用显著更少的训练数据的情况下,达到了优于或可比于强大的图表特定VLMs的性能。
cs.CL / 176 / 2605.10862
RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems
RUBEN:基于规则的检索增强大型语言模型系统的解释
Abstract
This paper demonstrates RUBEN, an interactive tool for discovering minimal rules to explain the outputs of retrieval-augmented large language models (LLMs) in data-driven applications. We leverage novel pruning strategies to efficiently identify a minimal set of rules that subsume all others. We further demonstrate novel applications of these rules for LLM safety, specifically to test the resiliency of safety training and effectiveness of adversarial prompt injections.
Chinese Translation
本文展示了RUBEN,一个用于发现最小规则以解释数据驱动应用中检索增强大型语言模型(LLMs)输出的交互式工具。我们利用新颖的剪枝策略高效地识别出一个包含所有其他规则的最小规则集。我们进一步展示了这些规则在LLM安全性方面的新应用,特别是测试安全训练的韧性和对抗性提示注入的有效性。
cs.CL / 177 / 2605.10863
DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization
DGPO:超越成对偏好的方向一致性群体优化
Abstract
Although Large Language Models (LLMs) have made remarkable progress, current preference optimization methods still struggle to align directional consistency while preserving reasoning diversity. To address this limitation, we propose Directional-Groupwise Preference Optimization (DGPO), a lightweight framework that aggregates supervision signals at the group level and explicitly models direction-aware alignment through multi-candidate comparisons. DGPO organizes forward and reverse question-answer instances into structured sets and optimizes a margin-based likelihood objective that separates coherent reasoning paths from inconsistent alternatives. This group-wise formulation captures richer relative information than pairwise objectives and reinforces consistency across diverse reasoning pathways. Empirical results show that our constructed reverse data yields a 3.2% average improvement across five benchmarks, while DGPO further delivers consistent gains across multiple datasets and model families, achieving average accuracy improvements of up to 3.6%.
Chinese Translation
尽管大型语言模型(LLMs)取得了显著进展,但当前的偏好优化方法仍然难以在保持推理多样性的同时实现方向一致性。为了解决这一局限性,我们提出了方向性群体偏好优化(Directional-Groupwise Preference Optimization,DGPO),这是一个轻量级框架,通过在群体层面聚合监督信号,并通过多候选比较显式建模方向感知对齐。DGPO将正向和反向问答实例组织成结构化集合,并优化一个基于边际的似然目标,以区分一致的推理路径和不一致的替代方案。这种群体式的表述捕捉了比成对目标更丰富的相对信息,并在多样化的推理路径中增强了一致性。实证结果表明,我们构建的反向数据在五个基准测试中平均提高了3.2%的性能,而DGPO在多个数据集和模型家族中进一步实现了一致的增益,平均准确率提高了高达3.6%。
cs.CL / 178 / 2605.10877
Neural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRs
Neural在ArchEHR-QA 2026:一法适用所有:针对电子健康记录的临床问答的统一提示优化
Abstract
Automated question answering (QA) over electronic health records (EHRs) demands precise evidence retrieval, faithful answer generation, and explicit grounding of answers in clinical notes. In this work, we present Neural1.5, our method for the ArchEHR-QA 2026 shared task at CL4Health@LREC 2026, which comprises four subtasks: question interpretation, evidence identification, answer generation, and evidence alignment. Our approach decouples the task into independent, modular stages and employs DSPy"s MIPROv2 optimizer to automatically discover high-performing prompts, jointly tuning instructions and few-shot demonstrations for each stage. Within every stage, self-consistency voting over multiple stochastic inference runs suppresses spurious errors and improves reliability, while stage-specific verification mechanisms (e.g., self-reflection and chain-of-verification for alignment) further refine output quality. Among all teams that participated in all four subtasks, our method ranks second overall (mean rank 4.00), placing 4th, 1st, 4th, and 7th on Subtasks 1-4, respectively. These results demonstrate that systematic, per-stage prompt optimization combined with self-consistency mechanisms is a cost-effective alternative to model fine-tuning for multifaceted clinical QA.
Chinese Translation
对电子健康记录(EHRs)进行自动化问答(QA)需要精确的证据检索、忠实的答案生成以及在临床记录中明确的答案基础。在本研究中,我们提出了Neural1.5,这是我们在CL4Health@LREC 2026的ArchEHR-QA 2026共享任务中的方法,该方法包括四个子任务:问题理解、证据识别、答案生成和证据对齐。我们的方法将任务解耦为独立的模块化阶段,并采用DSPy的MIPROv2优化器自动发现高性能提示,联合调整每个阶段的指令和少量示例。在每个阶段内,通过对多个随机推理运行进行自一致性投票来抑制虚假错误并提高可靠性,而特定阶段的验证机制(例如,自我反思和对齐的验证链)进一步提高输出质量。在所有参与四个子任务的团队中,我们的方法总体排名第二(平均排名4.00),在子任务1-4中分别排名第4、第1、第4和第7。这些结果表明,系统的、逐阶段的提示优化结合自一致性机制是多方面临床问答中模型微调的一个具有成本效益的替代方案。
cs.CL / 179 / 2605.10893
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
有根还是猜测?通过盲图对比排名进行LVLM置信度估计
Abstract
Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model behavior under normal inference with no mechanism to determine whether a prediction was shaped by the image or by text alone. We introduce BICR (Blind-Image Contrastive Ranking), a model-agnostic confidence estimation framework that makes this contrast explicit during training by extracting hidden states from a frozen LVLM twice: once with the real image-question pair, and once with the image blacked out while the question is held fixed. A lightweight probe is trained on the real-image hidden state and regularized by a ranking loss that penalizes higher confidence on the blacked-out view, teaching it to treat visual grounding as a signal of reliability at zero additional inference cost. Evaluated across five modern LVLMs and seven baselines on a benchmark covering visual question answering, object hallucination detection, medical imaging, and financial document understanding, BICR achieves the best cross-LVLM average on both calibration and discrimination simultaneously, with statistically significant discrimination gains robust to cluster-aware analysis at 4-18x fewer parameters than the strongest probing baseline.
Chinese Translation
大型视觉语言模型存在视觉无根性的问题:它们可以生成流畅、自信甚至正确的响应,这些响应完全由语言先验驱动,而图像对预测没有贡献。现有的置信度估计方法无法检测到这一点,因为它们在正常推理下观察模型行为,没有机制来判断预测是由图像还是仅由文本塑造的。我们提出了BICR(盲图对比排名),这是一个与模型无关的置信度估计框架,通过从一个冻结的LVLM中提取隐藏状态,使这种对比在训练过程中变得明确:一次使用真实的图像-问题对,另一次在保持问题不变的情况下将图像遮蔽。一个轻量级的探测器在真实图像的隐藏状态上进行训练,并通过排名损失进行正则化,该损失惩罚在遮蔽视图上的高置信度,教会它将视觉根植视为可靠性的信号,而不增加额外的推理成本。在涵盖视觉问答、物体幻觉检测、医学成像和金融文档理解的基准上,对五个现代LVLM和七个基线进行评估,BICR在校准和区分上同时实现了最佳的跨LVLM平均值,并且在参数数量比最强探测基线少4-18倍的情况下,获得了统计显著的区分增益,且对集群感知分析具有鲁棒性。
cs.CL / 180 / 2605.10899
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
RubricEM:超越可验证奖励的基于评分标准的元强化学习政策分解
Abstract
Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.
Chinese Translation
训练深度研究代理,即规划、搜索、评估证据和综合长篇报告的系统,将强化学习推向了超越可验证奖励的领域。它们的输出缺乏真实答案,其轨迹跨越了许多工具增强的决策,而标准的后训练方法几乎没有机制将过去的尝试转化为可重用的经验。在本研究中,我们认为评分标准不仅应作为最终答案的评估工具,还应作为结构化政策执行、评估反馈和代理记忆的共享接口。基于这一观点,我们提出了RubricEM,一个基于评分标准的强化学习框架,结合了阶段性政策分解和基于反思的元政策演化。RubricEM首先通过将规划、证据收集、审查和综合条件化为自生成的评分标准,使研究轨迹具备阶段意识。然后,它通过阶段结构化的GRPO(Stage-Structured GRPO)分配信用,利用阶段性评分判断为长时间优化提供更密集的语义反馈。同时,RubricEM训练一个共享的反思元政策,将评估过的轨迹提炼为未来尝试的可重用评分标准指导。最终,RubricEM-8B在四个长篇研究基准测试中表现出色,超越了可比的开放模型,并接近专有的深度研究系统。除了最终性能外,我们还进行了深入分析,以理解RubricEM的关键组成部分。
cs.CL / 181 / 2605.10912
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
WildClawBench:现实世界长时间跨度代理评估基准
Abstract
Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.
Chinese Translation
大型语言模型和视觉-语言模型日益驱动代理通过命令行接口(CLI)工具为用户执行任务。然而,大多数代理基准仍依赖于合成沙箱、短时间任务、模拟服务API和最终答案检查,这使得尚不清楚代理是否能够在其部署的运行环境中完成现实的长时间跨度工作。本研究提出了WildClawBench,这是一个本地运行时基准,包含60个由人类撰写的双语多模态任务,涵盖六个主题类别。每个任务的平均执行时间约为8分钟,涉及超过20次工具调用,并在一个可重现的Docker容器内运行,容器中托管着一个实际的CLI代理工具(如OpenClaw、Claude Code、Codex或Hermes Agent),并可访问真实工具而非模拟服务。评分采用混合方式,结合了确定性规则检查、环境状态对副作用的审计以及用于语义验证的LLM/VLM评审。在19个前沿模型中,表现最佳的Claude Opus 4.7在OpenClaw下的整体得分仅为62.2%,而其他模型均低于60%,仅切换工具就能使单个模型得分变化高达18分。这些结果表明,长时间跨度的本地运行时代理评估仍然是当前前沿模型面临的一个尚未解决的任务。我们发布了任务、代码和容器化工具,以支持可重现的评估。
cs.CL / 182 / 2605.10938
ELF: Embedded Language Flows
嵌入式语言流(ELF)
Abstract
Diffusion and flow-based models have become the de facto approaches for generating continuous data, e.g., in domains such as images and videos. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today's leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs.
Chinese Translation
扩散和基于流的模型已成为生成连续数据的事实标准方法,例如在图像和视频等领域。它们的成功引起了越来越多的关注,试图将其应用于语言建模。与图像领域的对应模型不同,当前领先的扩散语言模型(DLMs)主要在离散标记上操作。本文展示了通过对离散领域进行最小适应,连续DLMs可以变得有效。我们提出了嵌入式语言流(Embedded Language Flows,ELF),这是一类基于连续时间流匹配的连续嵌入空间中的扩散模型。与现有的DLMs不同,ELF主要保持在连续嵌入空间中,直到最后一个时间步,在此时通过共享权重网络映射到离散标记。该公式使得从图像领域扩散模型中适应已建立的技术变得简单,例如无分类器引导(classifier-free guidance,CFG)。实验表明,ELF显著优于领先的离散和连续DLMs,以更少的采样步骤实现更好的生成质量。这些结果表明,ELF为有效的连续DLMs提供了一条有前景的路径。