← Back to Index
Daily Research Digest

arXiv Papers

2026-05-14
308
Papers
4
Categories
308
Translated
收藏清单 0
机器人学 (Robotics)
55
cs.RO / 1 / 2605.12622

Action Emergence from Streaming Intent

基于流式意图的动作生成
Jing, Pengfei, Huang, Victor Shea-Jay, Lu, Hengtong, Dai, Jifeng, Yan, Xie, Zhu, Benjin
Abstract
We formalize action emergence as a target capability for end-to-end autonomous driving: the ability to generate physically feasible, semantically appropriate, and safety-compliant actions in arbitrary, long-tail traffic scenes through scene-conditioned reasoning rather than retrieval or interpolation of learned scene-action mappings. We show that previous paradigms cannot deliver action emergence: autoregressive trajectory decoders collapse the inherently multimodal future into a single averaged output, while diffusion and flow-matching generators express multimodality but are not steerable by reasoned intent. We propose Streaming Intent as a concrete way to approach action emergence: a mechanism that makes driving intent (i) semantically streamed through a continuous chain-of-thought that causally derives the intent from scene understanding, and (ii) temporally streamed across clips so that intent commitments remain coherent along the driving horizon. We realize Streaming Intent in a VLA model we call SI (Streaming Intent). SI autoregressively decodes a four-step chain-of-thought and emits an intent token; the decoded intent then drives classifier-free guidance (CFG) on a flow-matching action head, requiring only two denoising steps to generate the final trajectory. On the Waymo End-to-End benchmark, SI achieves competitive aggregate performance, with an RFS score of 7.96 on the validation set and 7.74 on the test set. Beyond aggregate metrics, the model demonstrates -- to our knowledge for the first time in a fully end-to-end VLA -- intent-faithful controllability: for a fixed scene, varying the intent class at inference yields qualitatively distinct yet consistently high-quality plans, arising purely from data-driven learning without any pre-built trajectory bank or hand-coded post-hoc selector.
Chinese Translation
我们将动作生成形式化为端到端自主驾驶的目标能力:在任意的长尾交通场景中,通过场景条件推理而非检索或插值学习到的场景-动作映射,生成在物理上可行、语义上适当且符合安全标准的动作。我们表明,之前的范式无法实现动作生成:自回归轨迹解码器将固有的多模态未来压缩为单一的平均输出,而扩散和流匹配生成器虽然能够表达多模态性,但无法通过推理意图进行引导。我们提出流式意图(Streaming Intent)作为接近动作生成的具体方法:一种机制,使驾驶意图 (i) 通过一个连续的思维链语义上流动,该思维链从场景理解因果推导出意图,以及 (ii) 在片段之间时间上流动,以确保意图承诺在驾驶视野中保持一致。我们在一个称为SI(Streaming Intent)的VLA模型中实现流式意图。SI自回归地解码一个四步思维链并发出一个意图标记;解码的意图随后驱动无分类器引导(CFG)在流匹配动作头上,仅需两个去噪步骤即可生成最终轨迹。在Waymo端到端基准测试中,SI实现了具有竞争力的综合性能,在验证集上的RFS得分为7.96,在测试集上的得分为7.74。除了综合指标外,该模型展示了——据我们所知,这是在一个完全端到端的VLA中首次实现——意图忠实的可控性:对于固定场景,在推理时改变意图类别会产生定性不同但始终高质量的计划,这些计划完全源于数据驱动学习,而无需任何预构建的轨迹库或手动编码的后置选择器。
cs.RO / 2 / 2605.12624

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

MindVLA-U1:统一流架构下的VLA超越VA在自动驾驶中的应用
Huang, Yuzhou, Zhu, Benjin, Lu, Hengtong, Huang, Victor Shea-Jay, Zhang, Haiming, Chen, Wei, Dai, Jifeng, Xie, Yan, Li, Hongsheng
Abstract
Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose into coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces autoregressive language tokens and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of each modality. A streaming design processes the driving video framewise rather than as fixed video-action chunks, while a learned memory channel carries temporal context across frames so planned trajectories evolve smoothly without redundant multi-frame VLM modeling. The unified architecture admits fast/slow execution on dense/sparse Mixture-of-Transformers (MoT) backbones via flexible self-attention context management, and exposes a measurable language-to-action route: a language-predicted driving intent steers action diffusion through classifier-free guidance (CFG), turning language-side intent into a control signal for continuous trajectory generation. On the long-tail WOD-E2E benchmark, MindVLA-U1 surpasses experienced human drivers for the first time (8.20 RFS vs. 8.13 GT RFS) with 2 diffusion steps, achieves state-of-the-art planning ADEs over prior VA/VLA methods by large margins, and matches VA-class throughput (16 FPS vs. RAP-DINO's 18 FPS) while preserving natural-language interfaces.
Chinese Translation
自动驾驶已从模块化管道发展到端到端的统一,而视觉-语言-行动(VLA)模型是这一旅程超越视觉-行动(VA)的自然延伸。在实践中,驾驶VLA在规划质量上往往落后于VA,这表明困难不仅仅在于模型规模,而在于语义推理、时间上下文和连续控制结合的接口。我们认为,这一差距反映了VLA的构建方式——作为孤立的子任务改进,未能组合成连贯的驾驶能力,而不是VLA本身的性质。我们提出了MindVLA-U1,这是首个用于自动驾驶的统一流VLA架构。统一的视觉-语言-模型(VLM)主干在一个共享表示上通过单次前向传递生成自回归语言标记和流匹配的连续动作轨迹,保留了每种模态的自然输出形式。流式设计逐帧处理驾驶视频,而不是作为固定的视频-动作块,同时学习的记忆通道在帧之间传递时间上下文,使得规划的轨迹平滑演变,而无需冗余的多帧VLM建模。统一架构通过灵活的自注意力上下文管理,允许在稠密/稀疏的混合变换器(Mixture-of-Transformers,MoT)主干上快速/慢速执行,并暴露出可测量的语言到行动的路径:语言预测的驾驶意图通过无分类器引导(classifier-free guidance,CFG)引导动作扩散,将语言侧的意图转化为连续轨迹生成的控制信号。在长尾WOD-E2E基准测试中,MindVLA-U1首次超越了经验丰富的人类驾驶员(8.20 RFS对比8.13 GT RFS),在2个扩散步骤下实现了显著优于以往VA/VLA方法的规划ADE,并在保持自然语言接口的同时匹配了VA级的吞吐量(16 FPS对比RAP-DINO的18 FPS)。
cs.RO / 3 / 2605.12625

Driving Intents Amplify Planning-Oriented Reinforcement Learning

驾驶意图增强规划导向的强化学习
Lu, Hengtong, Huang, Victor Shea-Jay, Yang, Chengmin, Jing, Pengfei, Dai, Jifeng, Xie, Yan, Zhu, Benjin
Abstract
Continuous-action policies trained on a single demonstrated trajectory per scene suffer from mode collapse: samples cluster around the demonstrated maneuver and the policy cannot represent semantically distinct alternatives. Under preference-based evaluation, this caps best-of-N performance -- even oracle selection cannot recover what the sampling distribution does not contain. We introduce DIAL, a two-stage Driving-Intent-Amplified reinforcement Learning framework for preference-aligned continuous-action driving policies. In the first stage, DIAL conditions the flow-matching action head on a discrete intent label with classifier-free guidance (CFG), which expands the sampling distribution along distinct maneuver modes and breaks single-demonstration mode collapse. In the second stage, DIAL carries this expanded distribution into preference RL through multi-intent GRPO, which spans all intent classes within every preference group and prevents fine-tuning from re-collapsing around the currently preferred mode. Instantiated for end-to-end driving with eight rule-derived intents and evaluated on WOD-E2E: competitive Vision-to-Action (VA) and Vision-Language-Action (VLA) Supervised Finetuning (SFT) baselines plateau below the human-driven demonstration at best-of-128, with the strongest prior (RAP) capping at Rater Feedback Score (RFS) 8.5 even with best-of-64; intent-CFG sampling lifts this ceiling to RFS 9.14 at best-of-128, surpassing both the prior best (RAP 8.5) and the human-driven demonstration (8.13) for the first time; and multi-intent GRPO improves held-out RFS from 7.681 to 8.211, while every single-intent baseline peaks lower and degrades by training end. These results suggest that the bottleneck of preference RL on continuous-action policies trained from demonstrations is not only how to update the policy, but to expand and preserve the sampling distribution being optimized.
Chinese Translation
基于单一示范轨迹训练的连续动作策略面临模式崩溃的问题:样本聚集在示范的操作周围,策略无法表示语义上不同的替代方案。在基于偏好的评估下,这限制了最佳选择的性能——即使是oracle选择也无法恢复采样分布中不存在的内容。我们提出了DIAL,一种两阶段的驾驶意图增强强化学习框架,用于偏好对齐的连续动作驾驶策略。在第一阶段,DIAL通过无分类器引导(CFG)将流匹配动作头条件化于离散意图标签,从而扩展沿不同操作模式的采样分布,并打破单一示范模式崩溃。在第二阶段,DIAL将这一扩展的分布引入偏好强化学习,通过多意图的广义重参数化优化(GRPO),涵盖每个偏好组内的所有意图类别,并防止微调重新聚焦于当前偏好的模式。该方法在端到端驾驶中实例化,使用八个规则派生的意图,并在WOD-E2E上进行评估:竞争性的视觉到动作(VA)和视觉-语言-动作(VLA)监督微调(SFT)基线在最佳的128中达不到人类驾驶示范的水平,最强的先前方法(RAP)在最佳的64中也仅达到评估者反馈分数(RFS)8.5;而意图-CFG采样将这一上限提升至最佳的128中的RFS 9.14,首次超越了先前最佳(RAP 8.5)和人类驾驶示范(8.13);同时,多意图GRPO将保留的RFS从7.681提高到8.211,而每个单一意图基线的峰值更低,并在训练结束时下降。这些结果表明,基于示范训练的连续动作策略在偏好强化学习中的瓶颈不仅在于如何更新策略,还在于如何扩展和保持正在优化的采样分布。
cs.RO / 4 / 2605.12628

Multistep Belief Space Dynamics Learning For Risk-Aware Control

风险感知控制的多步信念空间动态学习
Gibson, Jason, Vlahov, Bogdan, Spieler, Patrick, Theodorou, Evangelos A.
Abstract
As autonomous vehicles move from a simplified research setting to practical use, there exists a large gap between the dynamic behavior of a human driving and an autonomous system. Risk-aware behavior needs to naturally develop in order to scale to the demands of the real world. A major issue for risk-aware planning and control has been predicting how dynamical uncertainty evolves through time and optimizing plans that account for this without being overly conservative. Here, we present a learning framework to predict distributional dynamics that can be optimized in real time for Model Predictive Control (MPC). We explore the importance of structure when learning distributional dynamics for use in MPC. A rigorous ablation study is conducted on a large dataset of real world off-road driving that shows the impact of deviations from our proposed structure. Furthermore, we deploy our learned model and planning stack on a full sized vehicle in challenging off-road conditions. Our planning architecture is able to naturally regulate the speed of the vehicle based on the environment and consistently demonstrates intelligent behavior over miles of diverse terrain.
Chinese Translation
随着自主车辆从简化的研究环境转向实际应用,人类驾驶与自主系统之间的动态行为存在较大差距。为了满足现实世界的需求,风险感知行为需要自然发展。风险感知规划和控制的一个主要问题是预测动态不确定性如何随时间演变,并优化考虑这一点的计划,而不至于过于保守。在此,我们提出了一种学习框架,以预测可以实时优化的分布动态,用于模型预测控制(Model Predictive Control, MPC)。我们探讨了在为MPC学习分布动态时结构的重要性。我们在一个大型真实世界越野驾驶数据集上进行了严格的消融研究,展示了偏离我们提出的结构的影响。此外,我们在具有挑战性的越野条件下,将学习到的模型和规划堆栈部署在一辆全尺寸车辆上。我们的规划架构能够根据环境自然调节车辆速度,并在多样化的地形上持续展示智能行为。
cs.RO / 5 / 2605.12654

COSMIC: Concurrent Optimization of Structure, Material, and Integrated Control for robotic systems

COSMIC:机器人系统的结构、材料和集成控制的并行优化
Guo, Qinsong, Wang, Liwei
Abstract
Replicating and surpassing the autonomy of natural organisms remains a long-standing goal in robotics. Yet most robotic systems have their structure, materials, and control designed separately, in sharp contrast to the co-evolution in nature. This separation often leads to suboptimal designs, and we still have a limited understanding of the individual and collective contributions of these design entities. In this work, we propose a gradient-based co-design framework that simultaneously optimizes the topology, material distribution, and control policy of a truss-lattice robot. The framework embeds mixed-type topological and material variables into a continuous design space and integrates a neural network controller within a differentiable simulator, capturing their interactions and enabling efficient gradient calculation via automatic differentiation. Furthermore, we develop a constrained optimization to navigate the highly non-convex design landscape and jointly optimize all design entities. Case studies demonstrate that the proposed framework consistently discovers diverse locomotion strategies that outperform baselines obtained through separated design. The framework is also flexible to accommodate different functional requirements and boundary conditions. Using this framework, we further extract design insights that reveal the individual and collective effects of different entities on robotic performance. The proposed framework provides a computational foundation for the autonomous co-design of robotic systems, capable of reconfiguration, locomotion, and other complex autonomous behaviors.
Chinese Translation
复制和超越自然生物的自主性仍然是机器人学中的一个长期目标。然而,大多数机器人系统的结构、材料和控制是分开设计的,这与自然界中的共同进化形成鲜明对比。这种分离往往导致次优设计,我们对这些设计实体的个体和集体贡献仍然了解有限。在本研究中,我们提出了一种基于梯度的共同设计框架,同时优化桁架-格子机器人的拓扑、材料分布和控制策略。该框架将混合类型的拓扑和材料变量嵌入到连续设计空间中,并在可微分模拟器中集成了神经网络控制器,捕捉它们之间的相互作用,并通过自动微分实现高效的梯度计算。此外,我们开发了一种约束优化方法,以在高度非凸的设计空间中导航,并共同优化所有设计实体。案例研究表明,所提出的框架始终发现多样的运动策略,超越了通过分开设计获得的基线。该框架也具有灵活性,可以适应不同的功能需求和边界条件。利用该框架,我们进一步提取设计洞察,揭示不同实体对机器人性能的个体和集体影响。所提出的框架为机器人系统的自主共同设计提供了计算基础,能够进行重构、运动和其他复杂的自主行为。
cs.RO / 6 / 2605.12689

3D RL-DWA: A Hybrid Reinforcement Learning and Dynamic Window Approach for Goal-Directed Local Navigation in Multi-DoF Robots

3D RL-DWA:一种混合强化学习与动态窗口方法的多自由度机器人目标导向局部导航
Castellani, Chiara, Turco, Enrico, Prattichizzo, Domenico
Abstract
In this paper, we present a novel hybrid approach that combines Reinforcement Learning (RL) with Dynamic Window Approach (DWA) for adaptive 3D local navigation of high-degree-of-freedom robotic systems. Our method leverages sparse point cloud data to dynamically adjust both the motion and the shape of a deformable microrobot, enabling the system to navigate toward a goal in complex, constrained environments while maximizing the occupied volume. We evaluate our framework in a simulated vascular network. Experimental results, based on 1080 trials, indicate that integrating RL with a DWA-based local planner significantly enhances both deformation and navigation capabilities compared to a pure RL and a model-based methods. In particular, the proposed autonomous controller consistently achieves high deformation and near-perfect path completion during training and maintains robust performance in unseen scenarios. These findings highlight the potential of hybrid planning strategies for efficient and adaptive 3D navigation under sparse sensory conditions.
Chinese Translation
本文提出了一种新颖的混合方法,将强化学习(Reinforcement Learning, RL)与动态窗口方法(Dynamic Window Approach, DWA)结合,用于高自由度机器人系统的自适应3D局部导航。我们的方法利用稀疏点云数据动态调整可变形微型机器人的运动和形状,使系统能够在复杂的受限环境中朝向目标导航,同时最大化占用体积。我们在模拟血管网络中评估了我们的框架。基于1080次试验的实验结果表明,与纯RL和基于模型的方法相比,将RL与基于DWA的局部规划器结合显著增强了变形和导航能力。特别是,所提出的自主控制器在训练过程中始终实现高变形率和接近完美的路径完成,并在未见场景中保持稳健的性能。这些发现突显了混合规划策略在稀疏传感条件下实现高效自适应3D导航的潜力。
cs.RO / 7 / 2605.12710

Belief-Space Residual Risk for Automated Driving under Localization Uncertainty

在定位不确定性下的自动驾驶信念空间残余风险
Karunainayagam, Nijinshan, Gehrke, Nils, Diermeyer, Frank
Abstract
Residual risk metrics have recently been introduced to assess the safety implications of automated driving systems. Existing approaches typically assume a deterministic ego pose and concentrate mainly on perception errors related to surrounding objects and latency effects. In practice, however, automated vehicles operate under considerable localization uncertainty, especially in complex urban settings and in adverse weather conditions. This work extends the spatial residual risk formulation to the belief space by explicitly modeling ego pose uncertainty as a Gaussian distribution. Residual risk is reformulated as the expected degradation-induced risk over the ego pose belief distribution. Within a particle-based risk estimation framework, localization uncertainty is incorporated into the computation of collision probabilities through covariance fusion of ego and object uncertainties.
Chinese Translation
最近引入了残余风险指标,以评估自动驾驶系统的安全性影响。现有方法通常假设自我位置是确定的,并主要集中在与周围物体相关的感知误差和延迟效应上。然而,在实际操作中,自动驾驶车辆在复杂的城市环境和恶劣天气条件下,面临相当大的定位不确定性。本研究通过将自我位置的不确定性明确建模为高斯分布,将空间残余风险的公式扩展到信念空间。残余风险被重新表述为自我位置信念分布下的期望退化引起的风险。在基于粒子的风险估计框架内,通过自我和物体不确定性的协方差融合,将定位不确定性纳入碰撞概率的计算中。
cs.RO / 8 / 2605.12719

A Five-Layer MLOps Architecture for Connected Automated Driving

面向连接自动驾驶的五层 MLOps 架构
Lampe, Bastian, Eckstein, Lutz
Abstract
The continual assurance of safety and performance of automated driving systems (ADSs) poses significant challenges. ADSs operate in complex, dynamic, open-world environments allowing a wide range of scenarios, including ones that are rare or not foreseen during initial development. While the incorporation of artificial intelligence (AI) and machine learning (ML) technology allows ADSs to learn from data gathered during operation and thus enables them to adapt over time, these approaches come with their own challenges. A key advantage of ADSs compared to human drivers is their greater ability to gather data collectively across a fleet of vehicles, or even across multiple fleets operated by different entities, and to learn from this data collectively. Vehicles can share and combine their data to identify additional learning opportunities otherwise missed by individual vehicles. This creates new opportunities to tackle the challenges of continual assurance of safety and performance, but requires the implementation of architectures that leverage the collective learning potential. Based on established MLOps principles and existing work in the field of connected automated driving, this paper presents a five-layer architecture for collective learning-enabled MLOps processes for ADSs. The goal of this architecture is to provide a conceptual blueprint for the design and implementation of MLOps processes by fleet operators and other relevant stakeholders. The paper describes the main responsibilities of each layer, their interactions, and how multi-level self-assessments enabled by the architecture can support the detection and reduction of edge cases including black swan events.
Chinese Translation
自动驾驶系统(ADS)的安全性和性能的持续保障面临重大挑战。ADS 在复杂、动态的开放世界环境中运行,允许广泛的场景,包括一些在初始开发阶段稀有或未预见的场景。尽管人工智能(AI)和机器学习(ML)技术的引入使得 ADS 能够从操作过程中收集的数据中学习,从而使其能够随着时间的推移进行适应,但这些方法也带来了自身的挑战。与人类驾驶员相比,ADS 的一个关键优势在于其能够在整个车队,甚至跨多个由不同实体运营的车队中集体收集数据,并从这些数据中共同学习。车辆可以共享和结合它们的数据,以识别单个车辆可能错过的额外学习机会。这为解决持续保障安全性和性能的挑战创造了新的机会,但需要实施能够利用集体学习潜力的架构。基于已建立的 MLOps 原则和连接自动驾驶领域的现有研究,本文提出了一种面向 ADS 的集体学习支持的 MLOps 过程的五层架构。该架构的目标是为车队运营商和其他相关利益相关者提供 MLOps 过程设计和实施的概念蓝图。本文描述了每一层的主要职责、它们之间的相互作用,以及架构所支持的多层次自我评估如何有助于检测和减少边缘案例,包括黑天鹅事件。
cs.RO / 9 / 2605.12735

The Unified Autonomy Stack: Toward a Blueprint for Generalizable Robot Autonomy

统一自主栈:通向可推广机器人自主性的蓝图
Dharmadhikari, Mihir, Khedekar, Nikhil, Kulkarni, Mihir, Nissov, Morten, Jacquet, Martin, Zacharia, Angelos, Harms, Marvin, Puigjaner, Albert Gassol, Weiss, Philipp, Alexis, Kostas
Abstract
We introduce and open-source the Unified Autonomy Stack, a system-level solution that enables resilient autonomy across diverse aerial and ground robot morphologies. The architecture centers on three synergistic modules -- multi-modal perception, multi-behavior planning, and multi-layered safe navigation -- that together deliver comprehensive mission autonomy. The stack fuses data from LiDAR, radar, vision, and inertial sensing, enabling (a) robust localization and mapping through factor graph-based fusion, (b) semantic scene understanding, (c) motion and informative path planning through sampling-based techniques adaptive across spatial scales, as well as (d) multi-layered safe navigation both through planning on the online reconstructed map and deep learning-driven exteroceptive policies alongside last-resort safety filters using control barrier functions. The resulting behaviors include safe GNSS-denied navigation into unknown and perceptually-degraded regions, exploration of complex environments, object discovery, and efficient inspection planning. The stack has been field-tested and validated on both aerial (rotorcraft) and ground (legged) robots operating in a host of demanding environments, including self-similar and smoke-filled settings, with complex geometries and high obstacle clutter. These tests demonstrate resilient performance in challenging conditions. To facilitate ease of adoption, we open-source the implementation alongside supporting documentation, validation, and evaluation datasets https://github.com/ntnu-arl/unified_autonomy_stack. A video giving the overview of the paper and the field experiments is available at https://youtu.be/l8Su8OXsM-E.
Chinese Translation
我们介绍并开源了统一自主栈(Unified Autonomy Stack),这是一种系统级解决方案,能够在多样的空中和地面机器人形态中实现弹性自主。该架构围绕三个协同模块构建——多模态感知、多行为规划和多层安全导航——共同提供全面的任务自主性。该栈融合了来自激光雷达(LiDAR)、雷达、视觉和惯性传感器的数据,实现了(a)通过基于因子图的融合进行稳健的定位和地图构建,(b)语义场景理解,(c)通过基于采样的技术进行运动和信息路径规划,适应不同空间尺度,以及(d)通过在在线重建地图上进行规划和深度学习驱动的外部感知策略进行多层安全导航,同时使用控制屏障函数实现最后的安全过滤。最终的行为包括在未知和感知降级区域进行安全的GNSS(全球导航卫星系统)拒绝导航、复杂环境的探索、物体发现和高效的检查规划。该栈已在多种苛刻环境中进行现场测试和验证,包括自相似和烟雾充满的场景,具有复杂几何形状和高障碍物杂乱。这些测试展示了在挑战性条件下的稳健性能。为了方便采用,我们开源了实现代码及其支持文档、验证和评估数据集,链接为 https://github.com/ntnu-arl/unified_autonomy_stack。关于论文和现场实验的概述视频可在 https://youtu.be/l8Su8OXsM-E 获取。
cs.RO / 10 / 2605.12771

Adaptive Smooth Tchebycheff Attention for Multi-Objective Policy Optimization

用于多目标策略优化的自适应平滑切比雪夫注意力
Murillo-Gonzalez, Alejandro, Ali, Mahmoud, Liu, Lantao
Abstract
Multi-objective reinforcement learning in robotic domains requires balancing complex, non-convex trade-offs between conflicting objectives. While linear scalarization methods provide stability, they are theoretically incapable of recovering solutions within non-convex regions of the Pareto front. Conversely, static non-linear scalarizations (e.g., Tchebycheff) can theoretically access these regions but often suffer from severe gradient variance and optimization instability in deep RL. In this work, we propose an Adaptive Smooth Tchebycheff framework that resolves this tension by dynamically modulating the curvature of the optimization landscape. We introduce a novel conflict-driven controller that regulates the optimization smoothness based on real-time gradient interference. This allows the agent to anneal toward precise, non-convex scalarization when objectives align, while elastically reverting to stable, smooth approximations when destructive gradient conflicts emerge. We validate our approach on a challenging robotic stealth visual search task -- a proxy for monitoring of protected/fragile ecosystems -- where an agent must balance search, exposure/interference minimization and exploration speed. Extensive ablations confirm that our conflict-aware adaptation enables the robust discovery of Pareto-optimal policies in non-convex regions inaccessible to linear baselines and unstable for static non-linear methods. Website: https://alejandromllo.github.io/research/pasta/
Chinese Translation
在机器人领域的多目标强化学习中,需要平衡复杂的、非凸的目标之间的冲突权衡。虽然线性标量化方法提供了稳定性,但从理论上讲,它们无法恢复在帕累托前沿的非凸区域内的解。相反,静态非线性标量化方法(例如切比雪夫)理论上可以进入这些区域,但在深度强化学习中往往会遭遇严重的梯度方差和优化不稳定性。在本研究中,我们提出了一种自适应平滑切比雪夫框架,通过动态调节优化景观的曲率来解决这一矛盾。我们引入了一种新颖的冲突驱动控制器,根据实时梯度干扰调节优化的平滑性。这使得智能体在目标一致时能够逐渐收敛到精确的非凸标量化,同时在出现破坏性梯度冲突时弹性地恢复到稳定的平滑近似。我们在一个具有挑战性的机器人隐身视觉搜索任务上验证了我们的方法——这是监测受保护/脆弱生态系统的代理任务——其中智能体必须平衡搜索、暴露/干扰最小化和探索速度。大量消融实验确认,我们的冲突感知适应性使得在对线性基线不可及且对静态非线性方法不稳定的非凸区域中,能够稳健地发现帕累托最优策略。网站:https://alejandromllo.github.io/research/pasta/
cs.RO / 11 / 2605.12786

Emotional Expression in Low-Degrees-of-Freedom Robots: Assessing Perception with Reachy Mini

低自由度机器人中的情感表达:使用Reachy Mini评估感知
Rogel, Amit, Yadollahi, Elmira, Laban, Guy
Abstract
Emotion expression is central to human--robot interaction, yet little is known about how people interpret affect on robots with sparse, non-anthropomorphic expressive capabilities. This study examined how people perceive emotional expressions displayed by Reachy Mini (Pollen Robotics and Hugging Face), a low-degree-of-freedom (low-DoF) robot with a constrained and distinctly non-human expressive repertoire. In an online within-subjects study, 100 participants viewed 10 short video clips of Reachy Mini expressing different emotions and, for each clip, identified the perceived emotion, rated its valence and arousal, and evaluated the robot on social-perception traits. Exact emotion recognition was modest overall and varied considerably across expressions, with anger, sadness, and interest recognized more reliably than emotions such as love, pleasure, shame, and disgust. However, participants were generally more successful at recovering broader affective meaning than exact emotion labels, particularly along valence and arousal dimensions. Emotional expressions also shaped social evaluation, as positive expressions were perceived as warmer and more sociable than negative ones, and animacy varied less across conditions. These findings suggest that even constrained robotic expressions can communicate affective meaning and influence social impressions, positioning Reachy Mini as a useful benchmark for studying affective communication in low-DoF robots.
Chinese Translation
情感表达是人机交互的核心,但关于人们如何解读具有稀疏且非人类表现能力的机器人情感的信息仍然很少。本研究考察了人们如何感知Reachy Mini(Pollen Robotics和Hugging Face)这一低自由度(low-DoF)机器人所表达的情感,该机器人具有受限且明显非人类的表现能力。在一项在线的被试内研究中,100名参与者观看了10段Reachy Mini表达不同情感的短视频,并在每段视频中识别感知到的情感,评估其情感价值(valence)和唤醒度(arousal),并对机器人在社会感知特征上的表现进行评价。整体而言,情感识别的准确性适中,并且在不同情感之间存在显著差异,愤怒、悲伤和兴趣的识别率明显高于爱、愉悦、羞愧和厌恶等情感。然而,参与者在恢复更广泛的情感意义方面通常比精确的情感标签更成功,特别是在情感价值和唤醒度维度上。情感表达也影响了社会评价,积极的表达被认为比消极的表达更温暖和更具社交性,而生动性在不同条件下变化较小。这些发现表明,即使是受限的机器人表达也能够传达情感意义并影响社会印象,使Reachy Mini成为研究低自由度机器人情感交流的有用基准。
cs.RO / 12 / 2605.12789

Lifelong Learning in Vision-Language Models: Enhanced EWC with Cross-Modal Knowledge Retention

视觉-语言模型中的终身学习:增强的弹性权重巩固与跨模态知识保留
Durrani, Hamza Ahmed, Durrani, Rafay Suleman
Abstract
Large language-vision models (LVLMs) such as CLIP, Flamingo, and BLIP have revolutionized AI by enabling understanding across textual and visual modalities. These models excel at tasks like image captioning, visual question answering, and cross-modal retrieval. However, they face catastrophic forgetting when learning new tasks sequentially, particularly challenging in multi-modal settings where preserving cross-modal alignments adds complexity to the learning process. This paper presents a comprehensive continual learning framework for LVLMs that combines enhanced Elastic Weight Consolidation (EWC) with parameter-efficient fine-tuning techniques. We integrate multi-modal Fisher Information Matrix calculation, consistency preservation across modalities, and adaptive regularization that considers dependencies across visual and textual encoders. The framework achieves a 78% reduction in forgetting rates relative to naive sequential training approaches through extensive evaluation testing. The framework also preserves alignment between modalities during sequential learning with only 15% additional computational cost. This work advances the state of the art in lifelong learning for multi-modal AI systems, with direct applications to autonomous driving, intelligent robotic assistants, and adaptive robotic systems that must continuously learn in dynamic real-world environments.
Chinese Translation
大型语言-视觉模型(LVLMs),如CLIP、Flamingo和BLIP,已经通过实现文本和视觉模态之间的理解而彻底改变了人工智能。这些模型在图像描述、视觉问答和跨模态检索等任务中表现出色。然而,当顺序学习新任务时,它们面临灾难性遗忘的问题,尤其是在多模态环境中,保持跨模态对齐使学习过程变得更加复杂。本文提出了一种针对LVLMs的全面持续学习框架,该框架结合了增强的弹性权重巩固(EWC)和参数高效的微调技术。我们整合了多模态费舍尔信息矩阵计算、跨模态的一致性保持以及考虑视觉和文本编码器之间依赖关系的自适应正则化。通过广泛的评估测试,该框架相较于简单的顺序训练方法实现了78%的遗忘率降低。此外,该框架在顺序学习过程中仅增加15%的计算成本即可保持模态之间的对齐。这项工作推动了多模态人工智能系统终身学习的最新进展,直接应用于自动驾驶、智能机器人助手和必须在动态现实环境中持续学习的自适应机器人系统。
cs.RO / 13 / 2605.12790

Few-Shot Physics-Informed Neural Network for Shape Reconstruction of Concentric-Tube Robots

基于少量样本的物理信息神经网络用于同心管机器人形状重建
Feizi, Navid, Pedrosa, Filipe C., Patel, Rajni V., Jayender, Jagadeesan
Abstract
Modeling concentric tube robots (CTRs) involves complex nonlinear continuum mechanics, and despite recent progress, physics-based models often lack an accurate representation of the experimental setups. To overcome these limitations, deep neural network-based models have been explored as alternatives with superior accuracy; however, they often overlook known mechanics, require large training datasets, and typically discard shape estimation of the robot. We present a physics-informed neural network (PINN) for kinematic modeling of a 6-DoF CTR with three pre-curved tubes that embeds the Cosserat rod differential equations and learns from few-shot observational data, balancing physics priors with data-driven fitting. PINN enables full-state estimation of shape, twist angle, torsional strain, bending moment, and orientation. Benchmark tests show a mean shape error below 1% of the robot length and accurately recovered other kinematic states, outperforming a purely physics-based Cosserat rod model baseline while using a minimal training set. The resulting model is also computationally efficient and robust, making it well-suited for real-time control applications.
Chinese Translation
同心管机器人(CTR)的建模涉及复杂的非线性连续介质力学,尽管近期取得了一些进展,但基于物理的模型往往无法准确反映实验设置。为克服这些局限性,研究者们探索了基于深度神经网络的模型作为具有更高准确性的替代方案;然而,这些模型通常忽视已知的力学原理,需要大量的训练数据集,并且通常舍弃机器人的形状估计。我们提出了一种物理信息神经网络(PINN),用于对具有三根预弯管的6自由度CTR进行运动学建模,该网络嵌入了Cosserat杆微分方程,并从少量观察数据中学习,平衡物理先验与数据驱动的拟合。PINN能够实现形状、扭转角、扭转应变、弯矩和方向的全状态估计。基准测试表明,模型的平均形状误差低于机器人长度的1%,并准确恢复了其他运动学状态,超越了纯物理基础的Cosserat杆模型基线,同时使用了最小的训练集。所得到的模型在计算上也高效且稳健,非常适合实时控制应用。
cs.RO / 14 / 2605.12804

BiPneu: Design and Control of a Bipolar-Pressure Pneumatic System for Soft Robots

BiPneu:软机器人双极压力气动系统的设计与控制
Mei, Yu, Zhou, Xinyu, Naik, Vedant, Gao, Alan, Tan, Xiaobo
Abstract
Positive-negative pressure regulation is critical to soft robotic actuators, enabling large motion ranges and versatile actuation modes. However, achieving high-performance regulation across both pressure polarities remains challenging due to asymmetric inflation-deflation dynamics, valve nonlinearities, and switching-induced flow disturbances. This paper presents BiPneu, a scalable and cost-efficient multi-channel bipolar-pressure pneumatic system for soft robots that enables wide-range, accurate, and responsive pressure regulation while providing seamless compatibility with high-level software ecosystems. A dual-mode sliding-mode controller (DM-SMC) with hysteresis-supervised mode selection is proposed based on a hybrid electro-pneumatic model. Extensive simulation and experiments demonstrate the superior performance of DM-SMC in tracking step and sinusoidal pressure references compared with both advanced model predictive controllers and well-tuned PID controllers. Experimental results show average absolute errors of 1.44 kPa in multi-step tests and 4.23 kPa in sinusoidal tracking, corresponding to reductions of 11.9% and 35.6% relative to PID control, along with improved control effort, valve switching rate, and transient response. Robustness of DM-SMC is further verified on a bellow actuator with pressure-dependent volume. Finally, BiPneu's capability is demonstrated via two soft robotic examples, quick ball-maneuvering with a soft parallel manipulator and real-time finite element method (FEM)-based teleoperation of a soft bellows actuator.
Chinese Translation
正负压力调节对软机器人执行器至关重要,能够实现大范围的运动和多样化的驱动模式。然而,由于不对称的充气-放气动态、阀门非线性以及切换引起的流动干扰,实现两种压力极性下的高性能调节仍然具有挑战性。本文提出了BiPneu,一种可扩展且成本效益高的多通道双极压力气动系统,能够为软机器人提供广泛、准确和响应迅速的压力调节,同时与高级软件生态系统无缝兼容。基于混合电气-气动模型,提出了一种具有滞后监督模式选择的双模滑模控制器(DM-SMC)。广泛的仿真和实验表明,与先进的模型预测控制器和调优良好的PID控制器相比,DM-SMC在跟踪阶跃和正弦压力参考方面表现出优越的性能。实验结果显示,在多步测试中平均绝对误差为1.44 kPa,在正弦跟踪中为4.23 kPa,分别比PID控制减少了11.9%和35.6%,同时控制努力、阀门切换率和瞬态响应均有所改善。DM-SMC的鲁棒性在一个压力依赖体积的波纹管执行器上得到了进一步验证。最后,通过两个软机器人示例展示了BiPneu的能力,包括使用软并联操纵器进行快速球体操控和基于有限元方法(FEM)的软波纹管执行器的实时遥操作。
cs.RO / 15 / 2605.12897

DynoJEPP: Joint Estimation, Prediction and Planning in Dynamic Environments

DynoJEPP:动态环境中的联合估计、预测与规划
Kliniewski, Mikolaj, Morris, Jesse, Wang, Yiduo, Manchester, Ian R., Ila, Viorela
Abstract
DynoJEPP is a factor-graph-based framework that jointly formulates and simultaneously optimizes estimation, prediction, and planning in dynamic environments. In conventional factor-graph-based approaches that jointly formulate estimation, prediction, and planning, information from prediction and planning feeds back into state estimation, yielding corrupted estimates, undesired behaviors, and unsafe plans. To address this, DynoJEPP introduces a novel directed factor that enforces directional information flow within the factor graph, preventing prediction and planning from corrupting state estimation. We evaluate the impact of directed factors on inter-module interactions during navigation in both static and dynamic environments. Our results demonstrate that these factors are critical for safe operation, as without them, the robot collides in the majority of experiments. Building on this, we further introduce Cooperative DynoJEPP, which enables the ego robot to incorporate cooperative object behavior into its prediction and trajectory planning.
Chinese Translation
DynoJEPP是一个基于因子图的框架,它在动态环境中联合构建并同时优化估计、预测和规划。在传统的基于因子图的方法中,联合构建估计、预测和规划时,来自预测和规划的信息反馈到状态估计中,导致估计结果受到干扰、行为不当以及计划不安全。为了解决这个问题,DynoJEPP引入了一种新颖的有向因子,强制因子图内的信息流向,防止预测和规划对状态估计造成干扰。我们评估了有向因子在静态和动态环境中导航时对模块间交互的影响。我们的结果表明,这些因子对安全操作至关重要,因为在没有它们的情况下,机器人在大多数实验中都会发生碰撞。在此基础上,我们进一步引入了合作DynoJEPP,使自我机器人能够将合作对象行为纳入其预测和轨迹规划中。
cs.RO / 16 / 2605.12974

Distributionally Robust Safety Under Arbitrary Uncertainties: A Safety Filtering Approach

在任意不确定性下的分布鲁棒安全性:一种安全过滤方法
Cherenson, Daniel M., Lee, Haejoon, Kim, Taekyung, Panagou, Dimitra
Abstract
In this work, we study how to ensure probabilistic safety for nonlinear systems under distributional ambiguity. Our approach builds on a backup-based safety filtering framework that switches between a high-performance nominal policy and a certified backup policy to ensure safety. To handle arbitrary uncertainties from ambiguous distributions, i.e., where the distribution is not of specific structure and the true distribution is unknown, we adopt a distributionally robust (DR) formulation using Wasserstein ambiguity sets. Rather than solving a high-dimensional DR trajectory optimization problem online, we exploit the structure of backup-based safety filtering to reduce safety certification to a one-dimensional search over the switching time between nominal and backup policies. We then develop a sampling-based certification procedure with finite-sample guarantees, where empirical failure probabilities are compared against a Wasserstein-inflated threshold. We validate our method through simulations across three systems, from a Dubins vehicle to a high-speed racing car and a fighter jet, demonstrating the broad applicability and computational efficiency.
Chinese Translation
在本研究中,我们探讨如何在分布模糊的情况下确保非线性系统的概率安全性。我们的方法基于一种基于备份的安全过滤框架,该框架在高性能的名义策略和经过认证的备份策略之间切换,以确保安全性。为了处理来自模糊分布的任意不确定性,即分布没有特定结构且真实分布未知的情况,我们采用了使用 Wasserstein 模糊集的分布鲁棒(DR)表述。我们并不在线解决高维的 DR 轨迹优化问题,而是利用基于备份的安全过滤的结构,将安全认证简化为在名义策略和备份策略之间切换时间的一维搜索。随后,我们开发了一种基于采样的认证程序,具有有限样本保证,其中经验失败概率与 Wasserstein 扩大阈值进行比较。我们通过对三个系统的仿真验证了我们的方法,从 Dubins 车辆到高速赛车和战斗机,展示了其广泛的适用性和计算效率。
cs.RO / 17 / 2605.13006

Occlusion-Based Object Transportation Around Obstacles With a Swarm of Miniature Robots

基于遮挡的微型机器人群体障碍物运输
Queiroz, Breno Cunha, MacRae, Daniel
Abstract
Swarm robotics utilises decentralised self-organising systems to form complex collective behaviours built from the bottom-up using individuals that have limited capabilities. Previous work has shown that simple occlusion-based strategies can be effective in using swarm robotics for the task of transporting objects to a goal position. However, this strategy requires a clear line-of-sight between the object and the goal. In this paper, we extend this strategy by allowing robots to form sub-goals; enabling any member of the swarm to establish a wider range of visibility of the goal, ultimately forming a chain of sub-goals between the object and the goal position. We do so while preserving the fully decentralised and communication-free nature of the original strategy, while maintaining performance in object-free scenarios. In five sets of simulated experiments, we demonstrate the generalisability of our proposed strategy. Our finite-state machine allows a sufficiently large swarm to transport objects around obstacles that block the goal. The method is robust to varying starting positions and can handle both concave and convex shapes.
Chinese Translation
群体机器人利用去中心化的自组织系统,通过具有有限能力的个体自下而上地形成复杂的集体行为。先前的研究表明,简单的基于遮挡的策略在利用群体机器人进行物体运输到目标位置的任务中是有效的。然而,这一策略要求物体与目标之间有清晰的视线。在本文中,我们通过允许机器人形成子目标来扩展这一策略,使得群体中的任何成员都能够建立更广泛的目标可见性,最终在物体与目标位置之间形成一系列子目标。我们在保持原有策略完全去中心化和无通信特性的同时,确保在无物体场景中的性能。在五组模拟实验中,我们展示了所提策略的普适性。我们的有限状态机允许足够大的群体在障碍物阻挡目标的情况下运输物体。该方法对起始位置的变化具有鲁棒性,并能够处理凹形和凸形的障碍物。
cs.RO / 18 / 2605.13028

Local Conformal Calibration of Dynamics Uncertainty from Semantic Images

基于语义图像的动态不确定性局部一致性校准
Marques, Luís, Berenson, Dmitry
Abstract
We introduce Observation-aware Conformal Uncertainty Local-Calibration (OCULAR), a conformal prediction-based algorithm that uses perception information to provide uncertainty quantification guarantees for unseen test-time environments. While previous conformal approaches lack the ability to discriminate between state-action space regions leading to higher or lower model mismatch, and require environment-specific data, our method uses data collected from visually similar environments to provably calibrate a given linear Gaussian dynamics model of arbitrary fidelity. The prediction regions generated from OCULAR are guaranteed to contain the future system states with, at least, a user-set likelihood, despite both aleatoric and epistemic uncertainty -- i.e., uncertainty arising from both stochastic disturbances and lack of data. Our guarantees are non-asymptotic and distribution-free, not requiring strong assumptions about the unknown real system dynamics. Our calibration procedure enables distinguishing between observation-velocity-action inputs leading to higher and lower next-state-uncertainty, which is helpful for probabilistically-safe planning. We numerically validate our algorithm on a double-integrator system subject to random perturbations and significant model mismatch, using both a simplified sensor and a more realistic simulated camera. Our approach appropriately quantifies uncertainty both when in-distribution and out-of-distribution, being comparatively volume-efficient to baselines requiring environment-specific data.
Chinese Translation
我们提出了一种观察感知一致性不确定性局部校准算法(Observation-aware Conformal Uncertainty Local-Calibration,OCULAR),该算法基于一致性预测,利用感知信息为未见测试环境提供不确定性量化保证。尽管以往的一致性方法缺乏区分导致模型不匹配程度高低的状态-动作空间区域的能力,并且需要特定环境的数据,我们的方法利用从视觉相似环境中收集的数据,能够证明性地校准任意精度的线性高斯动态模型。OCULAR生成的预测区域保证在至少用户设定的可能性下包含未来系统状态,尽管存在随机性和认知不确定性——即,由随机扰动和数据缺乏引起的不确定性。我们的保证是非渐近的且不依赖于分布,不需要对未知真实系统动态做出强假设。我们的校准过程能够区分导致下一状态不确定性高低的观察-速度-动作输入,这对概率安全规划非常有帮助。我们在一个受随机扰动和显著模型不匹配影响的双积分器系统上对我们的算法进行了数值验证,使用了简化传感器和更现实的模拟相机。我们的方法在分布内和分布外都能适当地量化不确定性,相较于需要特定环境数据的基线方法具有更高的体积效率。
cs.RO / 19 / 2605.13058

MUJICA: Multi-skill Unified Joint Integration of Control Architecture for Wheeled-Legged Robots

MUJICA:轮腿机器人控制架构的多技能统一联合集成
Li, Yuqi, Zhai, Peng, Zhang, Yueqi, Wei, Xiaoyi, Qian, Quancheng, He, Zhengxu, Yu, Qianxiang, Zhang, Lihua
Abstract
Wheeled-legged robots hold promise for traversing complex terrains and offer superior mobility compared to legged robots. However, wheeled-legged robots must effectively balance both wheeled driving and legged control. Furthermore, due to noisy proprioceptive sensing and real-world motor constraints, realizing robust and adaptive locomotion at peak performance of motors remains challenging. We propose the Multi-skill Unified Joint Integration of Control Architecture (MUJICA), a unified, fully proprioceptive control framework for wheeled-legged robots that integrates diverse low-level skills-including omnidirectional moving, high platform climbing, and fall recovery-within a single policy. All skills, distinguished by unique indicator variables, are trained jointly with accurate DC-motor constraint modeling. Additionally, a high-level skill selector is learned to dynamically choose the optimal skill based solely on proprioceptions, enabling adaptive responses to the surrounding environment. Therefore, MUJICA enhances sim-to-real robustness and enables seamless transitions across diverse locomotion modes, facilitating autonomous adjustment to the environment. We validate our framework in both simulation and real-world experiments on the Unitree Go2-W robot, demonstrating significant improvements in adaptability and task success in unstructured environments.
Chinese Translation
轮腿机器人在复杂地形的穿越中展现出良好的前景,并且相比于单腿机器人具有更优越的机动性。然而,轮腿机器人必须有效地平衡轮式驱动与腿式控制。此外,由于本体感觉的噪声和现实世界中的运动约束,实现电机峰值性能下的稳健和自适应运动仍然具有挑战性。我们提出了多技能统一联合集成控制架构(MUJICA),这是一个针对轮腿机器人的统一、完全本体感觉控制框架,能够在单一策略中集成多种低级技能,包括全向移动、高平台攀爬和跌倒恢复。所有技能通过独特的指示变量进行区分,并与准确的直流电机约束建模共同训练。此外,我们学习了一个高级技能选择器,能够仅基于本体感觉动态选择最佳技能,从而实现对周围环境的自适应响应。因此,MUJICA增强了从仿真到现实的稳健性,并实现了不同运动模式之间的无缝过渡,促进了对环境的自主调整。我们在Unitree Go2-W机器人上进行了仿真和现实世界实验,验证了我们的框架,显示出在非结构化环境中适应性和任务成功率的显著提升。
cs.RO / 20 / 2605.13067

When Absolute State Fails: Evaluating Proprioceptive Encodings for Robust Manipulation

绝对状态失效时:评估用于稳健操作的本体感知编码
Alvarez, Maxime, Watanabe, Ryo, Crook, Paul, Meymand, Afshin Zeinaddini, Kurian, Suvin, Ferreiro, Pablo, Sano, Genki
Abstract
As end-to-end robotic policies are progressively deployed in the real world to solve real tasks, they face a gap between the training and inference conditions. Scaling the amount and diversity of the training data has shown some success in improving zero-shot generalization, yet robots still fail when faced with new, unseen test conditions. For instance, while robots with fixed frames of reference are common, those with moving frames pose a greater challenge for deployment. To address this specific instance of the issue, we present a study of strategies for encoding the robot's proprioceptive state to improve both in- and out-of-distribution performance at test time. Through a systematic study of joint representations, we find that a simple episode-wise relative frame provides the best trade-off between task performance and robustness, outperforming the baselines in extensive real-robot experiments conducted in a realistic test environment. The results suggest a practical path to leveraging data collected by robots with varying frames of reference and deployment to unseen test configurations.
Chinese Translation
随着端到端机器人策略逐渐在现实世界中部署以解决实际任务,它们面临着训练和推理条件之间的差距。增加训练数据的数量和多样性在改善零样本泛化方面显示出了一定的成功,但当面临新的、未见过的测试条件时,机器人仍然会失败。例如,虽然固定参考框架的机器人很常见,但具有移动参考框架的机器人在部署时面临更大的挑战。为了解决这一特定问题,我们提出了一项关于编码机器人本体感知状态的策略研究,以提高测试时的分布内和分布外性能。通过对联合表示的系统研究,我们发现简单的逐集相对框架在任务性能和稳健性之间提供了最佳的权衡,在现实测试环境中进行的大规模真实机器人实验中超越了基线。结果表明,利用具有不同参考框架的机器人收集的数据,并在未见测试配置中进行部署,提供了一条实用的路径。
cs.RO / 21 / 2605.13083

TouchAnything: A Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video

TouchAnything:一个用于从自我中心视频进行双手触觉估计的数据集和框架
Zhou, Jianyi, Gao, Ziteng, Hong, Feiyang, Liu, Zirui, Zhang, Guannan, Dai, Weisheng, Zhen, Ruichen, Lyu, Chuqiao, Wu, Haotian, Mao, Yinian, Wang, Xushi, Jiang, Yuxiang, Ding, Wenbo, Yang, Shuo
Abstract
Egocentric human video data, which captures rich human-environment interactions and can be collected at scale, has become a key driver of embodied intelligence research. However, existing egocentric datasets typically lack tactile sensing, a critical modality that provides direct cues about contact, force, and pressure in human-object interaction. Without such signals, models struggle to learn physically grounded representations of real-world interaction dynamics. While tactile sensors provide these cues, deploying high-quality tactile hardware at scale remains expensive and cumbersome. This raises a central question: can tactile feedback be inferred directly from visual observations, enabling scalable tactile supervision for egocentric video data and supporting physically grounded embodied learning? To enable research in this direction, we introduce EgoTouch, a large-scale multi-view egocentric dataset with dense tactile supervision for bimanual hand-object interaction. EgoTouch comprises 208 manipulation tasks spanning 1,891 episodes in diverse indoor and outdoor environments, with synchronized multi-view RGB (head-mounted egocentric and dual wrist-mounted cameras), bimanual 3D hand pose, and continuous pressure maps from wearable tactile sensors. Building on EgoTouch, we introduce TouchAnything, a baseline multi-view vision-to-touch prediction framework that uses the egocentric view as the primary input and flexibly leverages available wrist-mounted views at inference time. Experiments show that incorporating wrist-mounted views generally improves tactile prediction over egocentric-only input, achieving up to 5.0% relative improvement in Contact IoU and 6.1% relative improvement in Volumetric IoU. We will publicly release the dataset, code, and benchmark.
Chinese Translation
自我中心人类视频数据捕捉了丰富的人与环境的互动,并且可以大规模收集,已成为体现智能研究的关键驱动力。然而,现有的自我中心数据集通常缺乏触觉感知,这是一种关键的模态,提供了关于人-物体互动中的接触、力量和压力的直接线索。没有这些信号,模型很难学习与现实世界互动动态相对应的物理基础表示。虽然触觉传感器提供了这些线索,但大规模部署高质量的触觉硬件仍然昂贵且繁琐。这引发了一个核心问题:能否直接从视觉观察中推断触觉反馈,从而为自我中心视频数据提供可扩展的触觉监督,并支持物理基础的体现学习?为了推动这一方向的研究,我们引入了EgoTouch,这是一个大规模的多视角自我中心数据集,具有密集的触觉监督,专注于双手物体互动。EgoTouch包含208个操作任务,跨越1,891个情节,涵盖多样的室内和室外环境,配备同步的多视角RGB(头戴式自我中心和双腕式摄像头)、双手3D手势和来自可穿戴触觉传感器的连续压力图。基于EgoTouch,我们引入了TouchAnything,这是一个基线多视角视觉到触觉预测框架,使用自我中心视角作为主要输入,并在推理时灵活利用可用的腕部视角。实验表明,结合腕部视角通常能改善触觉预测,相较于仅使用自我中心输入,Contact IoU的相对提升可达5.0%,Volumetric IoU的相对提升可达6.1%。我们将公开发布数据集、代码和基准。
cs.RO / 22 / 2605.13086

Object Manipulation of the Variable Topology Truss system

可变拓扑桁架系统的物体操作
Bae, Andrew Jang-Ho, Choi, Myeongjin, Li, Haorui, Yim, Mark, Seo, TaeWon
Abstract
This paper presents an object manipulation strategy for the Variable Topology Truss (VTT) system, a truss robot that comprises actuated truss members connected by passive spherical joints. Although truss robots were originally proposed as rapidly deployable manipulators, manipulation strategy has not been studied thoroughly. To enable manipulation, we introduce a hybrid control framework that regulates position and force concurrently without explicit decoupling. At the actuator level, each member employs a sensor-based force feedback controller to generate the desired axial forces despite high actuator friction. At the task level, the forces applied at the end-effector nodes are produced by computing the required member forces using a static model of the VTT. We evaluate force-tracking performance through experiments on both a single member module and the full VTT system. Finally, we demonstrate object manipulation using two representative configurations and quantitatively assess combined position and force tracking performance. Experimental results confirm that the proposed approach enables consistent and reliable object manipulation with the VTT system.
Chinese Translation
本文提出了一种针对可变拓扑桁架(Variable Topology Truss, VTT)系统的物体操作策略,该桁架机器人由通过被动球形关节连接的驱动桁架构件组成。尽管桁架机器人最初被提出作为快速部署的操作器,但其操作策略尚未得到深入研究。为了实现操作,我们引入了一种混合控制框架,该框架在不明确解耦的情况下同时调节位置和力。在执行器层面,每个构件采用基于传感器的力反馈控制器,以在高执行器摩擦的情况下产生所需的轴向力。在任务层面,通过使用VTT的静态模型计算所需的构件力,来产生施加在末端执行器节点上的力。我们通过对单个构件模块和完整VTT系统的实验评估力跟踪性能。最后,我们展示了使用两种代表性配置进行物体操作,并定量评估了位置和力的联合跟踪性能。实验结果证实,所提出的方法使得VTT系统能够实现一致且可靠的物体操作。
cs.RO / 23 / 2605.13094

Identification of Non-Transversal Bifurcations of Linkages

非横向分叉的连杆识别
Mueller, Andreas, Custodio, P. C. López, Dai, J. S.
Abstract
The local analysis is an established approach to the study of singularities and mobility of linkages. Key result of such analyses is a local picture of the finite motion through a configuration. This reveals the finite mobility at that point and the tangents to smooth motion curves. It does, however, not immediately allow to distinguish between motion branches that do not intersect transversally (which is a rather uncommon situation that has only recently been discussed in the literature). The mathematical framework for such a local analysis is the kinematic tangent cone. It is shown in this paper that the constructive definition of the kinematic tangent cone already involves all information necessary to separate different motion branches. A computational method is derived by amending the algorithmic framework reported in previous publications.
Chinese Translation
局部分析是一种研究连杆奇点和运动性的成熟方法。这类分析的关键结果是通过一个配置获得有限运动的局部图像。这揭示了该点的有限运动性以及平滑运动曲线的切线。然而,这并不能立即区分那些不横向相交的运动分支(这种情况相对少见,最近才在文献中讨论)。这种局部分析的数学框架是运动切锥。本文表明,运动切锥的构造性定义已经包含了分离不同运动分支所需的所有信息。通过修正之前出版物中报告的算法框架,推导出了一种计算方法。
cs.RO / 24 / 2605.13105

What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models

忽略什么,反应什么:视觉鲁棒性强化学习微调VLA模型
Peng, Yuanfang, Fu, Jingjing, Zhang, Chuheng, Zhao, Li, Bian, Jiang, Liu, Mingyu, Zhang, Ling, Zhang, Jun, Wang, Rui
Abstract
Reinforcement learning (RL) fine-tuning has shown promise for Vision-Language-Action (VLA) models in robotic manipulation, but deployment-time visual shifts pose practical challenges. A key difficulty is that standard task rewards supervise task success, but offer limited guidance on whether a visual change is task-irrelevant or changes the behavior required for manipulation. We propose PAIR-VLA (Paired Action Invariance & Sensitivity for Visually Robust VLA), an RL fine-tuning framework to address this difficulty by adding two auxiliary objectives over paired visual variants during PPO optimization: an invariance term that reduces the discrepancy between action distributions for a task-preserving pair (e.g., different distractors), and a sensitivity objective that encourages separable action distributions for a task-altering pair (e.g., target object in a different pose). Together, these objectives turn visual variants from mere observation diversity into behavior-level guidance on policy responses during RL fine-tuning. We evaluate on ManiSkill3 across two representative VLA architectures, OpenVLA and $\pi_{0.5}$, under diverse out-of-distribution visual shifts including unseen distractors, texture changes, target object pose variation, viewpoint shifts, and lighting changes. Our method consistently improves over standard PPO, achieving average improvements of 16.62% on $\pi_{0.5}$ and 9.10% on OpenVLA. Notably, ablations further show generalization across visual shifts: invariance guidance learned from distractor and texture variants transfers to target-pose and lighting shifts, while adding sensitivity guidance on target-pose variants further improves robustness to nuisance shifts, highlighting the broader transferability of behavior-level RL guidance.
Chinese Translation
强化学习(RL)微调在机器人操作的视觉-语言-动作(VLA)模型中展现出了良好的前景,但部署时的视觉变化带来了实际挑战。一个关键困难在于,标准任务奖励监督任务成功,但对视觉变化是否与任务无关或是否改变了操作所需的行为提供了有限的指导。我们提出了PAIR-VLA(配对动作不变性与敏感性用于视觉鲁棒VLA),这是一个RL微调框架,旨在通过在PPO优化过程中对配对视觉变体添加两个辅助目标来解决这一困难:一个不变性项,减少任务保持配对(例如,不同的干扰物)之间的动作分布差异,以及一个敏感性目标,鼓励任务改变配对(例如,目标物体在不同姿态下)的可分离动作分布。这些目标共同将视觉变体从单纯的观察多样性转变为在RL微调过程中对策略反应的行为级指导。我们在ManiSkill3上对两种代表性的VLA架构OpenVLA和$ ext{π}_{0.5}$进行了评估,涵盖了多种分布外的视觉变化,包括未见的干扰物、纹理变化、目标物体姿态变化、视角变化和光照变化。我们的方法在标准PPO的基础上持续改进,在$ ext{π}_{0.5}$上平均提高了16.62%,在OpenVLA上提高了9.10%。值得注意的是,消融实验进一步显示了在视觉变化中的泛化:从干扰物和纹理变体中学习到的不变性指导可以转移到目标姿态和光照变化,而在目标姿态变体上增加的敏感性指导进一步提高了对干扰变化的鲁棒性,突显了行为级RL指导的更广泛可转移性。
cs.RO / 25 / 2605.13117

SECOND-Grasp: Semantic Contact-guided Dexterous Grasping

SECOND-Grasp:语义接触引导的灵巧抓取
Shin, Han Yi, Ko, Heeju, Mun, Jaewon, Huang, Qixing, Lee, Jaehyeok, Kim, Sung June, Lee, Honglak, Jang, Sujin, Kim, Sangpil
Abstract
Achieving reliable robotic manipulation, such as dexterous grasping, requires a synergy between physically stable interactions and semantic task guidance, yet these objectives are often treated as separate, disjoint goals. In this paper, we investigate how to integrate dexterous grasping techniques, i.e., physically stable grasps for object lifting and language-guided grasp generation, to achieve both physical stability and semantic understanding. To this end, we propose SECOND-Grasp (SEmantic CONtact-guided Dexterous Grasping), a unified framework that enables robotic hands to dynamically adjust grasping strategies based on semantic reasoning while ensuring physical feasibility. We begin by obtaining coarse contact proposals through vision-language reasoning to infer where contacts should occur based on object properties, followed by segmentation to localize these regions across views. To further ensure consistency across multiple viewpoints, we introduce Semantic-Geometric Consistency Refinement (SGCR), which refines initial contact predictions by enforcing semantic consistency across views and removing geometrically invalid regions, yielding reliable 3D contact maps. Then, we derive a feasible hand pose for each contact map via inverse kinematics, generating a supervision signal for policy learning. Our approach, trained on DexGraspNet, consistently outperforms baselines in lifting success rate on both seen and unseen categories, achieving 98.2% and 97.7%, respectively, while also improving intent-aware grasping by 12.8% and 26.2%. We further show promising results on additional datasets and robotic hands, including Shadow Hand and Allegro Hand.
Chinese Translation
实现可靠的机器人操作,例如灵巧抓取,需要物理稳定交互与语义任务指导之间的协同,但这些目标通常被视为独立的、分离的目标。本文探讨如何将灵巧抓取技术(即用于物体提升的物理稳定抓取和基于语言的抓取生成)进行整合,以实现物理稳定性和语义理解。为此,我们提出了SECOND-Grasp(SEmantic CONtact-guided Dexterous Grasping),这是一个统一框架,使机器人手能够根据语义推理动态调整抓取策略,同时确保物理可行性。我们首先通过视觉-语言推理获得粗略的接触建议,以推断基于物体属性的接触位置,然后通过分割来定位这些区域。为了进一步确保多个视角之间的一致性,我们引入了语义几何一致性细化(Semantic-Geometric Consistency Refinement,SGCR),通过在视角之间强制语义一致性并去除几何上无效的区域来细化初始接触预测,从而生成可靠的3D接触图。接着,我们通过逆向运动学为每个接触图导出可行的手部姿态,生成用于策略学习的监督信号。我们的算法在DexGraspNet上训练,在已见和未见类别的提升成功率上始终优于基线,分别达到了98.2%和97.7%,同时在意图感知抓取方面提高了12.8%和26.2%。我们还在其他数据集和机器人手(包括Shadow Hand和Allegro Hand)上展示了良好的结果。
cs.RO / 26 / 2605.13119

Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

面向长时间跨度的具身智能体与工具对齐的视觉-语言-行动模型
Lei, Zixing, Liu, Changxing, Xiong, Yichen, Xiong, Minhao, Ding, Yuanzhuo, Zhang, Zhipeng, Li, Weixin, Chen, Siheng
Abstract
Vision-language-action (VLA) models are effective robot action executors, but they remain limited on long-horizon tasks due to the dual burden of extended closed-loop planning and diverse physical operations. We therefore propose VLAs-as-Tools, a strategy that distributes this burden across a high-level vision language model (VLM) agent for temporal reasoning and a family of specialized VLA tools for diverse local physical operations. The VLM handles scene analysis, global planning, and recovery, while each VLA tool executes a bounded subtask. To tightly couple agent planning with VLA tool execution in long-horizon tasks, we introduce a VLA tool-family interface that exposes explicit tool selection and in-execution progress feedback, enabling efficient event-triggered agent replanning without continuous agent polling. To obtain diverse specialized VLA tools that faithfully follow agent invocations, we further propose Tool-Aligned Post-Training (TAPT), which constructs invocation-aligned training units for instruction following and adopts tool-family residual adapters for efficient tool specialization. Experiments show that VLAs-as-Tools improves the success rate of $\pi_{0.5}$ by 4.8 points on LIBERO-Long and 23.1 points on RoboTwin, and further enhances invocation fidelity by 15.0 points as measured by Non-biased Rate. Code will be released.
Chinese Translation
视觉-语言-行动(VLA)模型是有效的机器人行动执行者,但由于长期闭环规划和多样化物理操作的双重负担,它们在长时间跨度任务上仍然存在局限。因此,我们提出了VLA作为工具(VLAs-as-Tools)策略,将这一负担分配给高层次视觉语言模型(VLM)智能体进行时间推理,以及一系列专门的VLA工具进行多样化的局部物理操作。VLM负责场景分析、全局规划和恢复,而每个VLA工具则执行一个有限的子任务。为了在长时间跨度任务中紧密结合智能体规划与VLA工具执行,我们引入了VLA工具家族接口,公开明确的工具选择和执行进度反馈,使得高效的事件触发智能体重新规划成为可能,而无需持续轮询智能体。为了获得忠实于智能体调用的多样化专门VLA工具,我们进一步提出了工具对齐后训练(Tool-Aligned Post-Training, TAPT),该方法构建了与调用对齐的训练单元以实现指令跟随,并采用工具家族残差适配器以实现高效的工具专业化。实验表明,VLA作为工具(VLAs-as-Tools)在LIBERO-Long上将成功率提高了4.8个百分点,在RoboTwin上提高了23.1个百分点,并且在非偏倚率测量中进一步提升了15.0个百分点的调用保真度。代码将会发布。
cs.RO / 27 / 2605.13123

Multi-Depth Uniform Coverage Path Planning for Unmanned Surface Vehicle Surveying

无人水面车辆测量的多深度均匀覆盖路径规划
Larrazabal, Maider, Yang, Tong, Goienetxea, Izaro, Miro, Jaime Valls
Abstract
This paper introduces a novel automatic coverage path planning algorithm for bathymetry surveying with unmanned surface vehicles. The detection range of the mapping sensor employed - a multibeam echo sounder - is heavily influenced by local seafloor depths. Hence, a path designed to uniformly cover the sea surface does not guarantee uniform coverage of the seafloor. Yet this is currently the typical process for bathymetric surveys, with the simplistic boustrophedon scheme along manually selected waypoints at constant depths being the most widespread planner used. The proposed scheme incorporates coarse prior depth information to pre-process the target region and adaptively guide path generation and sensing range configuration. By explicitly accounting for depth variations, the proposed algorithm designs a coverage path with optimised spacing between survey passes that adjusts the sensing beam aperture to achieve more consistent seafloor coverage. The proposed method is shown to offer significant improvements in both synthetic and real-world scenarios. Validations in challenging synthetic terrains achieves coverage ratios beyond 99%, a marked improvement when compared with traditional boustrophedon paths revealing a maximum 75% coverage. The same trend appears in realistic simulations using real bathymetric data from a coastal harbour, with coverage reaching over 92%, and significantly surpassing boustrophedon sweeps with coverage rates below 65%. Beyond improved performance, the scheme also brings a fully automated design, suitable for autonomous marine vehicles, thus offering practical utilities for real-world applications.
Chinese Translation
本文介绍了一种新颖的自动覆盖路径规划算法,用于无人水面车辆的水深测量。所采用的映射传感器——多波束回声测深仪的探测范围受到局部海底深度的显著影响。因此,设计用于均匀覆盖海面路径并不能保证海底的均匀覆盖。然而,这仍然是当前水深测量的典型流程,简单的交替行走方案沿着手动选择的恒定深度航点是最广泛使用的规划方法。所提出的方案结合了粗略的先前深度信息,以预处理目标区域并自适应地引导路径生成和传感范围配置。通过明确考虑深度变化,所提出的算法设计了一条覆盖路径,优化了测量经过之间的间距,并调整了传感波束的孔径,以实现更一致的海底覆盖。该方法在合成和实际场景中均显示出显著的改进。在具有挑战性的合成地形中的验证达到了超过99%的覆盖率,相较于传统的交替行走路径,后者的最大覆盖率仅为75%,显示出显著的改善。在使用来自沿海港口的真实水深数据进行的现实模拟中,同样的趋势出现,覆盖率超过92%,显著超越了覆盖率低于65%的交替行走扫掠。除了性能的提升,该方案还带来了完全自动化的设计,适用于自主海洋车辆,从而为实际应用提供了实用的效用。
cs.RO / 28 / 2605.13125

MoCCA: A Movable Circle Probability of Collision Approximation

MoCCA:可移动圆形碰撞概率近似
Kern, Tobias, Birkner, Christian
Abstract
In automated driving, crash mitigation is crucial to ensure passenger safety. Accurate avoidance requires precise knowledge of the object's position and orientation. However, sensor noise and occlusions often result in tracking and prediction uncertainties. To account for these uncertainties, estimating the Probability of Collision (POC) is a critical requirement. While Monte Carlo sampling is a common estimation technique, its high computational demand and stochastic nature often render it unsuitable for real-time applications. Analytical POC calculations are simplified by approximating vehicle geometries using circular bounds. While multi-circle approximations offer higher fidelity than a single circumscribed circle, they significantly increase computational complexity. This paper proposes a shape approximation algorithm, MoCCA, which utilizes a single circle for each vehicle, optimized to minimize the relative distance between them. MoCCA maintains a computational efficiency comparable to standard single-circle techniques while reducing over-conservatism. To address the potential underestimation of POC inherent in partial coverage, we establish an upper bound for the approximation error, demonstrating that it depends primarily on inter-vehicle distance and orientation variance. Furthermore, we introduce a safety distance margin that can be calibrated solely based on orientation variance.
Chinese Translation
在自动驾驶中,碰撞缓解对于确保乘客安全至关重要。准确的避免需要对物体的位置和方向有精确的了解。然而,传感器噪声和遮挡常常导致跟踪和预测的不确定性。为了考虑这些不确定性,估计碰撞概率(Probability of Collision, POC)是一个关键要求。尽管蒙特卡洛采样是一种常见的估计技术,但其高计算需求和随机特性常常使其不适合实时应用。通过使用圆形边界近似车辆几何形状,可以简化POC的解析计算。虽然多圆近似提供的保真度高于单个外接圆,但它显著增加了计算复杂性。本文提出了一种形状近似算法MoCCA,该算法为每辆车利用一个单一圆形进行优化,以最小化它们之间的相对距离。MoCCA在计算效率上与标准的单圆技术相当,同时减少了过于保守的情况。为了应对部分覆盖中POC潜在的低估问题,我们建立了近似误差的上限,证明其主要依赖于车辆间距离和方向方差。此外,我们引入了一种安全距离边际,可以仅根据方向方差进行校准。
cs.RO / 29 / 2605.13192

Dynamics Computation of Soft-Rigid Hybrid-Link System and Its Application to Motion Analysis of an Athlete Wearing Sport Prosthesis

软-刚性混合链接系统的动力学计算及其在运动分析中的应用:以穿戴运动假肢的运动员为例
Kim, Sunghee, Shimane, Yuta, Ishigaki, Taiki, Yamamoto, Ko
Abstract
This paper presents a motion analysis framework for an athlete wearing sport-specific flexible prosthesis based on the soft-rigid hybrid-link system. Such a motion analysis is a challenging problem because we need to consider the interaction force between the rigid human skeleton system and a flexible prosthesis. However, most of human musculoskeletal models are based on the computation framework of a rigid-body multi-link system. Recently in soft robotics research field, fast and efficient modeling methods were developed for a flexible rod deformation, which allows us to build a hybrid-link system that integrates rigid-link and soft-bodies in a unified formulation. We apply inverse kinematics of the hybrid-link system to motion reconstruction from a motion captured data, and also present the estimation of the joint torques and ground reaction force by inverse dynamics. Through a human subject experiment, we show that the inverse dynamics achieved approximately 12% error on the ground reaction force estimation. Furthermore, we provide the muscle force estimation considering muscle amputation and interaction force with the prosthesis leg deformation.
Chinese Translation
本文提出了一种基于软-刚性混合链接系统的运动分析框架,旨在分析穿戴运动特定柔性假肢的运动员。由于需要考虑刚性人体骨骼系统与柔性假肢之间的相互作用力,因此这种运动分析是一项具有挑战性的任务。然而,大多数人体肌肉骨骼模型是基于刚体多链接系统的计算框架。最近,在软机器人研究领域,开发了快速高效的柔性杆变形建模方法,使我们能够构建一个将刚性链接和柔性体统一表述的混合链接系统。我们将混合链接系统的逆运动学应用于从运动捕捉数据中重建运动,并通过逆动力学方法估计关节扭矩和地面反作用力。通过人类受试者实验,我们显示逆动力学在地面反作用力估计中达到了约12%的误差。此外,我们还提供了考虑肌肉截肢和与假肢腿变形相互作用力的肌肉力量估计。
cs.RO / 30 / 2605.13208

Calibration-Free Gas Source Localization with Mobile Robots: Source Term Estimation Based on Concentration Measurement Ranking

无校准气体源定位的移动机器人:基于浓度测量排名的源项估计
Jin, Wanting, Duranceau, Agatha, Erünsal, İzzet Kağan, Martinoli, Alcherio
Abstract
Efficient Gas Source Localization (GSL) in real-world settings is crucial, especially in emergency scenarios. Mobile robots equipped with low-cost, in-situ gas sensors offer a safer alternative to human inspection in hazardous environments. Probabilistic algorithms enhance GSL efficiency with scattered gas measurements by comparing gas concentration measurements gathered by robots to physical dispersion models. However, accurately deriving gas concentrations from data acquired with low-cost sensors is challenging due to the nonlinear sensor response, environmental dependencies (e.g., humidity, temperature, and other gas influences), and robot motion. Mitigating these disturbance factors requires frequent sensor calibration in controlled environments, which is often impractical for real-world deployments. To overcome these issues, we propose a novel feature extraction algorithm that leverages the relative ranking of gas measurements within the dynamically accumulated dataset. By comparing the rank differences between gathered and modeled values, we estimate the probabilistic distribution of source locations across the entire environment. We validate our approach in high-fidelity simulations and physical experiments, demonstrating consistent localization accuracy with uncalibrated gas sensors. Compared to existing methods, our technique eliminates the need for gas sensor calibration, making it well-suited for real-world applications.
Chinese Translation
在现实环境中高效的气体源定位(GSL)至关重要,尤其是在紧急情况下。配备低成本原位气体传感器的移动机器人为危险环境中的人工检查提供了更安全的替代方案。概率算法通过将机器人收集的气体浓度测量与物理扩散模型进行比较,增强了GSL在分散气体测量中的效率。然而,由于低成本传感器的非线性响应、环境依赖性(例如湿度、温度和其他气体影响)以及机器人运动,从数据中准确推导气体浓度是具有挑战性的。减轻这些干扰因素需要在受控环境中频繁校准传感器,这在实际部署中往往不切实际。为了解决这些问题,我们提出了一种新颖的特征提取算法,利用动态累积数据集中气体测量的相对排名。通过比较收集值与模型值之间的排名差异,我们估计了整个环境中源位置的概率分布。我们在高保真模拟和实际实验中验证了我们的方法,证明了在未校准气体传感器下的一致定位精度。与现有方法相比,我们的技术消除了气体传感器校准的需求,使其非常适合实际应用。
cs.RO / 31 / 2605.13266

Galilean State Estimation for Inertial Navigation Systems with Unknown Time Delay

具有未知时间延迟的惯性导航系统的伽利略状态估计
Delama, Giulio, Scheiber, Martin, Ge, Yixiao, Hamel, Tarek, Weiss, Stephan, Mahony, Robert
Abstract
Many Inertial Navigation Systems (INS) use Global Navigation Satellite System (GNSS) position as the primary measurement to drive filter performance and bound error growth. However, commercial-grade GNSS receivers introduce unknown measurement delays ranging from 50 ms to 300 ms depending on sensor quality and operating mode. Such time delays can significantly degrade INS performance unless they are explicitly compensated for. Existing algorithms commonly estimate this delay offline, run the filter concurrently with GNSS measurements using buffered Inertial Measurement Unit (IMU) data, and predict the current state by forward-integrating buffered inertial measurements via IMU preintegration. The state-of-the-art online method is an Extended Kalman Filter (EKF) that explicitly models the time delay as a state parameter, which defines the preintegration duration. This paper introduces a novel geometric framework for modeling time-delayed INS, in which Galilean symmetry is leveraged to provide a joint representation of space and time for consistent state estimation. An Equivariant Filter (EqF) is derived for the coupled estimation of navigation states and time delay. Validation is performed on two fixed-wing Uncrewed Aerial Vehicles (UAV) with GNSS time lags of 90 ms and 120 ms. The test flights last two to three minutes. Simulations further investigate delays up to 500 ms and provide a statistical comparison against the state-of-the-art EKF. Results show that the EqF preserves accuracy and consistency, while the EKF lacks consistency and its performance degrades significantly with increasing measurement delays.
Chinese Translation
许多惯性导航系统(INS)使用全球导航卫星系统(GNSS)的位置作为主要测量,以驱动滤波器性能并限制误差增长。然而,商用级GNSS接收器引入的未知测量延迟范围从50毫秒到300毫秒,具体取决于传感器质量和操作模式。这种时间延迟可能显著降低INS性能,除非进行明确补偿。现有算法通常离线估计该延迟,使用缓冲的惯性测量单元(IMU)数据与GNSS测量并行运行滤波器,并通过IMU预积分前向积分缓冲的惯性测量来预测当前状态。最先进的在线方法是扩展卡尔曼滤波器(EKF),它将时间延迟显式建模为状态参数,从而定义预积分持续时间。本文引入了一种新的几何框架来建模时间延迟的INS,其中利用伽利略对称性提供空间和时间的联合表示,以实现一致的状态估计。为导航状态和时间延迟的耦合估计推导出了一种等变滤波器(EqF)。在两架具有90毫秒和120毫秒GNSS时间延迟的固定翼无人机(UAV)上进行了验证。测试飞行持续两到三分钟。模拟进一步研究了高达500毫秒的延迟,并与最先进的EKF进行了统计比较。结果表明,EqF保持了准确性和一致性,而EKF缺乏一致性,并且随着测量延迟的增加,其性能显著下降。
cs.RO / 32 / 2605.13321

HCSG: Human-Centric Semantic-Geometric Reasoning for Vision-Language Navigation

HCSG:面向人类的语义-几何推理在视觉-语言导航中的应用
Xu, Haoxuan, Li, Tianfu, Chen, Wenbo, Liu, Yi, Wu, Jin, Lei, Huashuo, Lou, Yunfan, Wang, Lujia, Wang, Hesheng, Li, Haoang
Abstract
VLN has achieved remarkable progress by scaling data and model capacity. However, the assumption of a static environment breaks down in real-world indoor scenarios, where robots inevitably encounter dynamic pedestrians. Existing human-aware approaches typically treat humans merely as moving obstacles based on implicit visual cues, lacking the explicit reasoning required to interpret human intentions or maintain social norms. To address this, we propose HCSG, the first human-centric framework for VLN. This framework provides a robust foundation for safe, socially intelligent navigation in dynamic human-robot environments that shifts the paradigm from passive collision avoidance to active human behavior understanding. Specifically, HCSG introduces a unified Human Understanding Module that synergizes two key capabilities: (i) geometric forecasting, which predicts human pose and trajectory to anticipate future motion dynamics; and (ii) semantic interpretation, which leverages a Vision-Language Model (VLM) to generate natural language descriptions of human actions and intentions. These semantic-geometric representations are fused into the agent's topological map for instruction-conditioned planning. Furthermore, a social distance loss is introduced to enforce socially compliant interaction distances. Extensive experiments on the HA-VLNCE benchmark demonstrate that HCSG significantly outperforms state-of-the-art methods, achieving a 14% improvement in Success Rate and a 34% reduction in Collision Rate. Our project can be seen at https://haoxuanxu1024.github.io/HCSG/.
Chinese Translation
视觉-语言导航(VLN)通过扩大数据和模型容量取得了显著进展。然而,在现实世界的室内场景中,静态环境的假设往往失效,因为机器人不可避免地会遇到动态行人。现有的人类感知方法通常仅将人视为基于隐含视觉线索的移动障碍物,缺乏解释人类意图或维持社会规范所需的明确推理。为了解决这一问题,我们提出了HCSG,这是第一个面向人类的视觉-语言导航框架。该框架为动态人机环境中的安全、社会智能导航提供了坚实的基础,将范式从被动的碰撞避免转变为主动的人类行为理解。具体而言,HCSG引入了一个统一的人类理解模块,协同两项关键能力:(i)几何预测,预测人类姿态和轨迹,以预见未来的运动动态;(ii)语义解释,利用视觉-语言模型(VLM)生成对人类行为和意图的自然语言描述。这些语义-几何表示被融合到代理的拓扑图中,以进行基于指令的规划。此外,引入了社会距离损失,以强制执行符合社会规范的互动距离。在HA-VLNCE基准上的广泛实验表明,HCSG显著超越了最先进的方法,成功率提高了14%,碰撞率降低了34%。我们的项目可以在https://haoxuanxu1024.github.io/HCSG/上查看。
cs.RO / 33 / 2605.13328

What Limits Vision-and-Language Navigation ?

视觉与语言导航的限制是什么?
Wang, Yunheng, Fang, Yuetong, Wang, Taowen, Li, Lusong, Liu, Kun, Xu, Junzhe, Yuan, Zizhao, Feng, Yixiao, Zhang, Jiaxi, Lu, Wei, Zeng, Zecui, Xu, Renjing
Abstract
Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R-CE and RxR-CE demonstrate that StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments. Project page: https://yunheng-wang.github.io/stereonav-public.github.io.
Chinese Translation
视觉与语言导航(Vision-and-Language Navigation, VLN)是具身智能的基石。然而,当前的智能体在从模拟环境转向现实世界部署时,往往会遭遇显著的性能下降,这主要是由于感知不稳定(例如,光照变化和运动模糊)以及指令不明确所致。虽然现有方法试图通过扩大模型规模和训练数据来弥补这一差距,但我们认为瓶颈在于缺乏稳健的空间基础和跨领域先验知识。本文提出了StereoNav,一个旨在增强现实世界导航一致性的稳健视觉-语言-行动框架。为了解决合成训练与物理执行之间的固有差距,我们引入了目标位置先验(Target-Location Priors)作为一个持久的桥梁。这些先验提供了稳定的视觉指导,在不同领域中保持不变,有效地为智能体提供基础,即使指令模糊。此外,为了减轻运动模糊和光照变化等视觉干扰,StereoNav利用立体视觉构建语义与几何的统一表示,通过增强的深度感知实现精确的动作预测。在R2R-CE和RxR-CE上的大量实验表明,StereoNav在自我中心RGB性能上达到了最先进的水平,SR和SPL得分分别为81.1%和68.3%,以及67.5%和52.0%,同时使用的参数和训练数据显著少于之前基于规模扩展的方法。更重要的是,现实世界的机器人部署确认,StereoNav在复杂的非结构化环境中显著提高了导航的可靠性。项目页面:https://yunheng-wang.github.io/stereonav-public.github.io。
cs.RO / 34 / 2605.13380

Exploring Human-Robot Collaboration: Analysis of Interaction Modalities in Challenging Tasks

探索人机协作:在挑战性任务中交互方式的分析
Arreghini, Simone, Iani, Cristina, Giusti, Alessandro, Villani, Valeria, Sabattini, Lorenzo, Paolillo, Antonio
Abstract
This work compares three interaction modalities for human-robot collaboration: passive, reactive, and proactive. We studied 18 participants assembling a seven-layer colored tower from memory while using nearby and distant blocks. In the passive modality participants worked alone; in the reactive modality a mobile robot helped only upon request; in the proactive modality it initiated brick delivery and error signaling without explicit requests. Although robot assistance increased completion time, most participants preferred collaboration: 67% preferred proactive behavior and 78% judged it most useful. These results suggest that timely proactive support can improve user experience in controlled collaborative tasks.
Chinese Translation
本研究比较了三种人机协作的交互方式:被动、反应和主动。我们研究了18名参与者在使用近处和远处的积木时,依靠记忆组装七层彩色塔的过程。在被动模式下,参与者独立工作;在反应模式下,移动机器人仅在请求时提供帮助;而在主动模式下,机器人在没有明确请求的情况下主动进行积木递送和错误信号提示。尽管机器人的帮助增加了完成时间,但大多数参与者更倾向于协作:67%的参与者偏好主动行为,78%的参与者认为其最有用。这些结果表明,及时的主动支持可以改善用户在受控协作任务中的体验。
cs.RO / 35 / 2605.13382

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

BlockVLA:通过块扩散微调加速自回归视觉-语言-动作模型
Wang, Ruiheng, Bai, Shuanghao, Zhang, Haoran, Chen, Badong, Xu, Xiangyu
Abstract
While autoregressive (AR) Vision-Language-Action (VLA) models have demonstrated formidable reasoning capabilities in robotic tasks, their sequential decoding process often incurs high inference latency and may amplify error accumulation during long-horizon execution. Discrete Diffusion Language Models (dLLMs) provide a promising alternative through parallel token refinement, but their practical deployment in robotics remains limited by repeated denoising function evaluations (NFEs) and the difficulty of directly applying standard KV caching to bidirectional iterative decoding. To bridge these paradigms, we propose BlockVLA, a framework that adapts pretrained AR backbones into an efficient discrete diffusion policy through a block diffusion paradigm. BlockVLA maintains autoregressive dependencies at the block level while enabling parallel denoising within each block, thereby combining global causal coherence with local parallel generation. This design enables prefix KV-cache reuse across completed blocks, reduces the effective cost of iterative denoising, and provides a smoother transition from AR pretraining to diffusion-based policy fine-tuning. We conduct extensive evaluations on the LIBERO and SimplerEnv benchmarks. Experimental results demonstrate that our BlockVLA achieves a 3.3$\times$ inference acceleration over standard discrete diffusion baselines. Furthermore, our model exhibits superior training efficiency, with success rates converging substantially faster than baselines, a gain that is particularly pronounced in complex, long-horizon tasks, where BlockVLA achieves significant performance gains in the early stages of training. This work establishes Block Diffusion as a robust bridge between large-scale pretrained AR models and efficient, high-frequency real-time robotic control.
Chinese Translation
尽管自回归(AR)视觉-语言-动作(VLA)模型在机器人任务中展示了强大的推理能力,但其顺序解码过程通常会导致高推理延迟,并可能在长时间执行过程中加剧错误累积。离散扩散语言模型(dLLMs)通过并行令牌精炼提供了一种有前景的替代方案,但其在机器人中的实际应用仍受到重复去噪函数评估(NFEs)和难以将标准KV缓存直接应用于双向迭代解码的限制。为了解决这些问题,我们提出了BlockVLA,一个将预训练的AR骨干网络适配为高效离散扩散策略的框架,采用块扩散范式。BlockVLA在块级别保持自回归依赖,同时在每个块内实现并行去噪,从而将全局因果一致性与局部并行生成相结合。该设计使得在已完成的块之间重用前缀KV缓存,降低了迭代去噪的有效成本,并提供了从AR预训练到基于扩散的策略微调的更平滑过渡。我们在LIBERO和SimplerEnv基准上进行了广泛评估。实验结果表明,我们的BlockVLA在标准离散扩散基线之上实现了3.3倍的推理加速。此外,我们的模型展现出卓越的训练效率,成功率收敛速度明显快于基线,尤其在复杂的长时间任务中,BlockVLA在训练的早期阶段取得了显著的性能提升。这项工作确立了块扩散作为大规模预训练AR模型与高效、高频实时机器人控制之间的稳健桥梁。
cs.RO / 36 / 2605.13403

RotVLA: Rotational Latent Action for Vision-Language-Action Model

RotVLA:用于视觉-语言-动作模型的旋转潜在动作
Li, Qiwei, Gong, Xicheng, Li, Xinghang, Li, Peiyan, Zhou, Quanyun, Ye, Hangjun, Zhou, Jiahuan, Mu, Yadong
Abstract
Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale cross-embodiment robotic datasets and human videos with latent-action supervision. For downstream robot control, the flow-matching head is extended into a unified action expert that jointly denoises latent and robot actions. Here, latent actions serve as a latent planner, providing high-level guidance that conditions action generation. With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.
Chinese Translation
潜在动作模型(LAMs)已成为处理异构数据集的有效范式,尤其是在视觉-语言-动作(VLA)模型的预训练过程中,提供了跨体现的统一动作空间。然而,现有的LAMs通常依赖于离散量化的编码和解码管道,这可能导致微不足道的帧重建行为、有限的表征能力以及缺乏物理意义的结构。我们提出了RotVLA,一个基于连续旋转潜在动作表示的VLA框架。潜在动作被建模为SO(n)的元素,提供了与真实世界动作动态相一致的连续性、组合性和结构几何。三元组帧学习框架进一步强化了有意义的时间动态,同时避免了退化。RotVLA由一个VLM主干和一个流匹配动作头组成,在大规模跨体现的机器人数据集和带有潜在动作监督的人类视频上进行了预训练。对于下游机器人控制,流匹配头被扩展为一个统一的动作专家,联合去噪潜在和机器人动作。在这里,潜在动作作为潜在规划者,提供高层次的指导以调节动作生成。仅凭1.7亿参数和1700小时以上的预训练数据,RotVLA在LIBERO上达到了98.2%的准确率,在干净和随机设置下分别在RoboTwin2.0上达到了89.6%和88.5%。它在操作任务上也展示了强大的现实世界表现,始终优于现有的VLA模型。
cs.RO / 37 / 2605.13428

SID: Sliding into Distribution for Robust Few-Demonstration Manipulation

SID:滑入分布以实现稳健的少演示操控
Ma, Yicheng, Yu, Wei, Su, Zhian, Zhang, Xidan, Dong, Huixu
Abstract
Generalizing robotic manipulation across object poses, viewpoints, and dynamic disturbances is difficult, especially with only a few demonstrations. End-to-end visuomotor policies are expressive but data-hungry, while planning and optimization satisfy explicit constraints but do not directly capture the interaction strategies demonstrated by humans. We propose Sliding into Distribution (SID), a structured framework that learns an object-centric motion field from canonicalized demonstrations to iteratively slide the system toward the demonstrated manifold and into the reliable operating region of a lightweight egocentric execution policy, mitigating out-of-distribution (OOD) execution. The motion field provides large corrective motions when far from the demonstration manifold and naturally vanishes near convergence, enabling robust reaching under substantial pose and viewpoint shifts. Within the reached regime, an egocentric policy trained with conditioned flow matching performs task-specific manipulation, supported by kinematically consistent point-cloud reprojection augmentation that preserves action-observation consistency. Across six real-world tasks, SID achieves approximately 90% success under OOD initializations with only two demonstrations, with under a 10% drop under distractors and external disturbances. Overall, SID provides a new paradigm for few-shot manipulation: explicitly managing distribution shift via online distribution recovery.
Chinese Translation
在物体姿态、视角和动态干扰下进行机器人操控的泛化是困难的,尤其是在只有少量演示的情况下。端到端的视觉运动策略虽然表达能力强,但对数据需求较高;而规划和优化虽然满足显式约束,但并未直接捕捉人类展示的交互策略。我们提出了滑入分布(Sliding into Distribution, SID),这是一个结构化框架,通过规范化的演示学习物体中心的运动场,以迭代方式将系统滑向演示的流形,并进入轻量级自我中心执行策略的可靠操作区域,从而减轻分布外(Out-of-Distribution, OOD)执行的影响。当远离演示流形时,运动场提供较大的修正运动,而在收敛附近自然消失,从而在显著的姿态和视角变化下实现稳健的到达。在达到的状态下,经过条件流匹配训练的自我中心策略执行特定任务的操控,得益于运动学一致的点云重投影增强,保持了动作-观察的一致性。在六个真实世界任务中,SID在仅有两个演示的情况下,在OOD初始化下实现了约90%的成功率,在干扰物和外部干扰下的成功率下降不到10%。总体而言,SID为少样本操控提供了一种新范式:通过在线分布恢复显式管理分布转移。
cs.RO / 38 / 2605.13442

Asymptotically Optimal Ergodic Coverage on Generalized Motion Fields

广义运动场上的渐近最优遍历覆盖
Hughes, Christian, Liu, Yilang, Lahrach, Yanis, Engdahl, Julia, Warren, Houston, Lee, Darrick, Ramos, Fabio, Miles, Travis, Abraham, Ian
Abstract
Autonomous robotic exploration in remote and extreme environments allows scientists to model complex transport phenomena and collective behaviors described by continuously deforming flow fields. Although these environments are naturally modeled as time-varying domains, most adaptive exploration methods assume static environments and fail to provide adequate coverage or satisfy any formal guarantees. This is especially the case in oceanography where autonomous underwater systems (UxS) have highly restrictive compute and payload requirements that necessitate path planning methods that yield robust data collection strategies in open-loop and underactuated settings. In this work, to address the aforementioned issues, we propose to formulate adaptive search as an ergodic coverage problem and investigate certifying coverage in the ergodic sense over evolving domains with flow-induced dynamics. We expand upon recent work demonstrating maximum mean discrepancy (MMD) as a functional ergodic metric, and derive a flow-adaptive formulation that explicitly accounts for domain evolution within the coverage objective. We show that this approach preserves ergodic coverage guarantees in ambient flows and enables effective exploration in under-actuated, and even open-loop planning settings by integrating environment dynamics. Experiments validate that our method generalizes to diverse spatiotemporal processes including ocean exploration, and tracking human and cattle movement. Physical experiments on aerial and legged robotic platforms validate our ability to obtain ergodic coverage in non-convex, flow-restricted environments while respecting robot dynamics.
Chinese Translation
在遥远和极端环境中进行自主机器人探索,使科学家能够对由持续变形的流场描述的复杂传输现象和集体行为进行建模。尽管这些环境自然被建模为时变域,但大多数自适应探索方法假设环境是静态的,未能提供足够的覆盖或满足任何正式保证。这在海洋学中尤其如此,因为自主水下系统(UxS)具有高度限制的计算和有效载荷要求,这需要路径规划方法在开环和欠驱动设置中产生稳健的数据收集策略。为了解决上述问题,我们提出将自适应搜索形式化为遍历覆盖问题,并研究在具有流动诱导动态的演变域中以遍历意义证明覆盖。我们扩展了最近的工作,证明最大均值差异(MMD)作为一个功能性遍历度量,并推导出一个流动自适应的公式,明确考虑覆盖目标中的域演变。我们展示了这种方法在环境流中保持遍历覆盖保证,并通过整合环境动态,使得在欠驱动甚至开环规划设置中有效探索。实验验证了我们的方法能够推广到包括海洋探索、跟踪人类和牛只运动在内的多样时空过程。在空中和腿式机器人平台上的物理实验验证了我们在非凸、流动受限环境中获得遍历覆盖的能力,同时尊重机器人动态。
cs.RO / 39 / 2605.13452

CUBic: Coordinated Unified Bimanual Perception and Control Framework

CUBic:协调统一的双手感知与控制框架
Wang, Xingyu, Ding, Pengxiang, Xu, Jingkai, Wang, Donglin, Fan, Zhaoxin
Abstract
Recent advances in visuomotor policy learning have enabled robots to perform control directly from visual inputs. Yet, extending such end-to-end learning from single-arm to bimanual manipulation remains challenging due to the need for both independent perception and coordinated interaction between arms. Existing methods typically favor one side -- either decoupling the two arms to avoid interference or enforcing strong cross-arm coupling for coordination -- thus lacking a unified treatment. We propose CUBic, a Coordinated and Unified framework for Bimanual perception and control that reformulates bimanual coordination as a unified perceptual modeling problem. CUBic learns a shared tokenized representation bridging perception and control, where independence and coordination emerge intrinsically from structure rather than from hand-crafted coupling. Our approach integrates three components: unidirectional perception aggregation, bidirectional perception coordination through two codebooks with shared mapping, and a unified perception-to-control diffusion policy. Extensive experiments on the RoboTwin benchmark show that CUBic consistently surpasses standard baselines, achieving marked improvements in coordination accuracy and task success rates over state-of-the-art visuomotor baselines.
Chinese Translation
最近在视觉运动策略学习方面的进展使得机器人能够直接从视觉输入中进行控制。然而,由于需要独立感知和双臂之间的协调交互,将这种端到端学习从单臂扩展到双手操作仍然具有挑战性。现有的方法通常偏向于一方——要么解耦双臂以避免干扰,要么强制实施强烈的跨臂耦合以实现协调——因此缺乏统一的处理。我们提出了CUBic,一个协调统一的双手感知与控制框架,将双手协调重新表述为一个统一的感知建模问题。CUBic学习了一种共享的标记化表示,桥接感知与控制,其中独立性和协调性从结构中内生,而不是来自手工设计的耦合。我们的方法集成了三个组成部分:单向感知聚合、通过两个共享映射的代码本实现的双向感知协调,以及统一的感知到控制扩散策略。在RoboTwin基准上的广泛实验表明,CUBic始终超越标准基线,在协调精度和任务成功率方面显著优于最先进的视觉运动基线。
cs.RO / 40 / 2605.13500

Uncertainty-Aware 3D Position Refinement for Multi-UAV Systems

面向不确定性的多无人机系统三维位置精炼
Alamleh, Hosam, Pulatov, Damir
Abstract
Reliable real-time 3D localization is essential for multi-UAV navigation, collision avoidance, and coordinated flight, yet onboard estimates can degrade under GNSS multipath, non-line-of-sight reception, vertical drift, and intentional interference. This paper presents a decentralized, lightweight 3D position-refinement layer that improves robustness by fusing each Unmanned Aerial Vehicle (UAV)'s local estimate with neighbor-shared state summaries and inter-UAV range or proximity constraints. The method performs uncertainty-aware neighborhood fusion by weighting each UAV's prior according to its reported covariance and weighting neighbor constraints according to link quality, ranging uncertainty, and a learned trust score. To support practical deployment, the framework explicitly handles cold start and temporary localization loss by inflating or substituting weak priors, allowing trusted neighborhood constraints to bootstrap and stabilize estimates until absolute sensing recovers. To mitigate the impact of faulty or malicious participants, each UAV applies a local range-consistency check, smoothed over time, to down-weight or exclude neighbors whose reported positions are incompatible with observed inter-UAV distances. Simulation experiments with 10 UAVs in a 3D volume show that the proposed refinement substantially reduces mean localization error during cold start, remains competitive after local estimators stabilize, and maintains lower error as the fraction of malicious nodes increases compared with fusion without trust. These results suggest that the approach can serve as a practical resilience layer for swarm operation in challenging environments.
Chinese Translation
可靠的实时三维定位对于多无人机导航、避碰和协调飞行至关重要,但在全球导航卫星系统(GNSS)多路径、非视距接收、垂直漂移和故意干扰等情况下,机载估计可能会下降。本文提出了一种去中心化、轻量级的三维位置精炼层,通过将每个无人机(UAV)的局部估计与邻居共享的状态摘要和无人机间的距离或接近约束融合,从而提高了鲁棒性。该方法通过根据每个无人机报告的协方差加权其先验,并根据链路质量、测距不确定性和学习的信任评分加权邻居约束,执行面向不确定性的邻域融合。为了支持实际部署,该框架明确处理冷启动和临时定位丢失,通过膨胀或替代弱先验,使得可信的邻域约束能够引导和稳定估计,直到绝对传感器恢复。为了减轻故障或恶意参与者的影响,每个无人机应用局部范围一致性检查,随时间平滑,以降低权重或排除那些报告位置与观察到的无人机间距离不兼容的邻居。在一个包含10个无人机的三维空间中的仿真实验表明,所提出的精炼方法在冷启动期间显著降低了平均定位误差,在局部估计器稳定后仍保持竞争力,并且随着恶意节点比例的增加,相较于没有信任的融合方法保持较低的误差。这些结果表明,该方法可以作为在挑战性环境中群体操作的实用韧性层。
cs.RO / 41 / 2605.13539

Integration of an Agent Model into an Open Simulation Architecture for Scenario-Based Testing of Automated Vehicles

将代理模型集成到开放仿真架构中以进行自动驾驶车辆的场景测试
Geller, Christian, Becker, Daniel, Beckmann, Jobst, Eckstein, Lutz
Abstract
Simulative and scenario-based testing are crucial methods in the safety assurance for automated driving systems. To ensure that simulation results are reliable, the real world must be modeled with sufficient fidelity, including not only the static environment but also the surrounding traffic of a vehicle under test. Thus, the availability of traffic agent models is of common interest to model naturalistic and parameterizable behavior, similar to human drivers. The interchangeability of agent models across different simulation environments represents a major challenge and necessitates harmonization and standardization. To address this challenge, we present a standardized and modular simulation integration architecture that enables the tool-independent integration of traffic agent models. The architecture builds upon the Open Simulation Interface (OSI) as a structured message format and the Functional Mock-up Interface (FMI) for dynamic model exchange. Rather than introducing yet another model or simulation tool, we provide a reusable reference implementation that translates these standards into a practical integration blueprint, including clear interfaces, data mappings, and execution semantics. The generic nature of the architecture is demonstrated by integrating an exemplary agent model into three widely used simulation environments: OpenPASS, CARLA, and CarMaker. As part of the evaluation, we show that the model yields consistent behavior in all simulation platforms, thereby validating the interoperability, modularity, and standard compliance of the proposed architecture. The reference implementation lowers integration barriers, serves as a foundation for future research, and is made publicly available at github.com/ika-rwth-aachen/agent-model-integration
Chinese Translation
模拟和基于场景的测试是自动驾驶系统安全保障的重要方法。为了确保仿真结果的可靠性,必须以足够的逼真度对现实世界进行建模,这不仅包括静态环境,还包括被测车辆周围的交通。因此,交通代理模型的可用性对于模拟自然和可参数化的行为(类似于人类驾驶员)具有普遍意义。在不同仿真环境中代理模型的可互换性代表了一项重大挑战,并需要进行协调和标准化。为了解决这一挑战,我们提出了一种标准化和模块化的仿真集成架构,能够实现工具无关的交通代理模型集成。该架构基于开放仿真接口(Open Simulation Interface, OSI)作为结构化消息格式,以及功能模型接口(Functional Mock-up Interface, FMI)用于动态模型交换。我们提供了一种可重用的参考实现,将这些标准转化为实际的集成蓝图,包括明确的接口、数据映射和执行语义,而不是引入另一个模型或仿真工具。通过将一个示例代理模型集成到三个广泛使用的仿真环境中:OpenPASS、CARLA 和 CarMaker,展示了该架构的通用性。作为评估的一部分,我们展示了该模型在所有仿真平台中产生一致的行为,从而验证了所提架构的互操作性、模块化和标准合规性。该参考实现降低了集成障碍,为未来的研究奠定了基础,并已在github.com/ika-rwth-aachen/agent-model-integration上公开发布。
cs.RO / 42 / 2605.13548

AttenA+: Rectifying Action Inequality in Robotic Foundation Models

AttenA+: 纠正机器人基础模型中的动作不平等
Peng, Daojie, Ma, Fulong, Cao, Jiahang, Zhang, Qiang, Xie, Xupeng, Guo, Jian, Luo, Ping, Luo, Andrew F., Zhou, Boyu, Ma, Jun
Abstract
Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via velocity-driven action attention. By reweighting the training objective based on the inverse velocity field, AttenA+ naturally aligns the model's learning capacity with the physical demands of manipulation. As a plug-and-play enhancement, AttenA+ can be integrated into existing backbones without structural modifications or additional parameters. Extensive experiments demonstrate that AttenA+ significantly elevates the ceilings of current state-of-the-art models. Specifically, it improves OpenVLA-OFT to 98.6% (+1.5%) on the Libero benchmark and pushes FastWAM to 92.4% (+0.6%) on RoboTwin 2.0. Real-world validation on a Franka manipulator further showcases its robustness and cross-task generalization. Our work suggests that mining the intrinsic structural priors of action sequences offers a highly efficient, physics-aware complement to standard scaling laws, paving a new path for general-purpose robotic control.
Chinese Translation
现有的机器人基础模型虽然功能强大,但其建立在一个隐含的时间均质性假设之上:在优化过程中将所有动作视为同等信息量。这种从语言建模继承而来的“平坦”训练范式,对操作的基础物理层次无动于衷。实际上,机器人轨迹在本质上是异质的,其中低速段往往通过对精度要求高的交互决定任务的成功,而高速运动则作为容错的过渡。这种均匀损失加权与物理关键性的错位,根本上限制了当前视觉-语言-动作(Vision-Language-Action, VLA)模型和世界-动作模型(World-Action Models, WAM)在复杂长时间任务中的表现。为了解决这一问题,我们提出了AttenA+,一个架构无关的框架,通过基于速度驱动的动作注意力优先考虑运动学关键段。通过根据逆速度场重新加权训练目标,AttenA+自然地将模型的学习能力与操作的物理需求对齐。作为一种即插即用的增强,AttenA+可以在不进行结构修改或增加额外参数的情况下集成到现有的主干网络中。大量实验表明,AttenA+显著提升了当前最先进模型的性能上限。具体而言,它将OpenVLA-OFT在Libero基准上的表现提高至98.6%(+1.5%),并将FastWAM在RoboTwin 2.0上的表现提升至92.4%(+0.6%)。在Franka操纵器上的实际验证进一步展示了其鲁棒性和跨任务的泛化能力。我们的工作表明,挖掘动作序列的内在结构先验为标准缩放法提供了一种高效的、物理感知的补充,为通用机器人控制开辟了一条新路径。
cs.RO / 43 / 2605.13613

Design of Magnetic Continuum Robots with Tunable Force Response Using Rotational Ring Pairs

使用可调力响应的旋转环对设计磁性连续机器人
Sayres, Alex, Pittiglio, Giovanni
Abstract
In this paper, we discuss a novel continuum robot design that enables the online tuning of the magnetic response at its tip. The proposed method allows for the change of both effective magnetic direction and intensity, introducing steering DOF without the need to control the external fields. This is unattainable with classical designs, which rely on fixed internal magnetic content and steer solely under the effect of a controllable magnetic field. The proposed robot design can be used in both controllable and fixed magnetic fields, potentially widening the clinical applicability of these robots. We experimentally show a max tip deflection of 33.8 mm from the resting state (23 % of the length of the robot). We discuss a model based on modified beam theory that captures the mechanical behavior of the continuum robot, with a mean absolute tip tracking error of 1.86 mm (1.2 % of the length) and maximum errors of less than 4.8 mm (3.2 % of the length) for all experimental points.
Chinese Translation
本文讨论了一种新颖的连续机器人设计,该设计能够在线调节其尖端的磁响应。所提出的方法允许有效磁方向和强度的变化,引入了转向自由度,而无需控制外部磁场。这是经典设计无法实现的,经典设计依赖于固定的内部磁内容,仅在可控磁场的作用下进行转向。所提出的机器人设计可以在可控和固定的磁场中使用,可能扩大这些机器人的临床适用性。我们通过实验展示了从静止状态到最大尖端偏转33.8毫米(占机器人长度的23%)。我们讨论了一种基于改进梁理论的模型,该模型捕捉了连续机器人的机械行为,平均绝对尖端跟踪误差为1.86毫米(占长度的1.2%),所有实验点的最大误差均小于4.8毫米(占长度的3.2%)。
cs.RO / 44 / 2605.13632

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

引导、思考、行动:视觉-语言-行动模型中的互动具身推理
Ling, Yiran, Lian, Qing, Li, Jinghang, Jiang, Qing, Zhang, Tianming, Jiang, Xiaoke, Liu, Chuanxiu, Liu, Jie, Zhang, Lei
Abstract
In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the subsequent reasoning process can directly condition on. Based on these inputs, the model generates a unified spatial-visual Chain-of-Thought that integrates external guidance with internal task planning, aligning human visual intent with autonomous decision-making. For practical deployment, we further couple the reasoning module with a lightweight reactive action head for efficient action execution. Extensive experiments demonstrate the effectiveness of our approach. On the in-domain SimplerEnv WidowX benchmark, our framework achieves a state-of-the-art 81.2% success rate. Under OOD visual shifts and spatial ambiguities, a single visual interaction substantially improves task success over existing methods, highlighting the value of interactive reasoning for failure recovery in embodied control. Details of the project can be found here: https://signalispupupu.github.io/GTA-VLA_ProjPage/
Chinese Translation
在本文中,我们提出了GTA-VLA(引导、思考、行动),一个互动的视觉-语言-行动(VLA)框架,能够通过允许用户用明确的视觉线索引导机器人策略,实现空间可操控的具身推理。现有的VLA模型学习从多模态观察到机器人动作的直接“感知-行动”映射。虽然在训练分布内有效,但这种紧密耦合的策略在域外(OOD)变化下表现脆弱,并且在发生失败时难以纠正。尽管最近的具身链式思维(CoT)方法暴露了中间推理过程,但仍缺乏纳入人类空间引导的机制,限制了其解决视觉歧义或从错误中恢复的能力。为了解决这一问题,我们的框架允许用户选择性地用空间先验(如可用性点、框和轨迹)引导策略,后续的推理过程可以直接以此为条件。基于这些输入,模型生成一个统一的空间-视觉链式思维,将外部引导与内部任务规划整合,使人类视觉意图与自主决策相一致。为了实现实际部署,我们进一步将推理模块与轻量级反应动作头结合,以实现高效的动作执行。大量实验表明我们的方法的有效性。在领域内的SimplerEnv WidowX基准测试中,我们的框架达到了81.2%的最新成功率。在OOD视觉变化和空间歧义下,单次视觉交互显著提高了任务成功率,突显了互动推理在具身控制中对失败恢复的价值。项目的详细信息可以在此找到:https://signalispupupu.github.io/GTA-VLA_ProjPage/
cs.RO / 45 / 2605.13646

Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

基于因果关系的端到端自主驾驶:自我中心联合场景建模
Moon, Seokha, Lee, Minseung, Seo, Joon, Kim, Jinkyu, Lee, Jungbeom
Abstract
End-to-end autonomous driving, which bypasses traditional modular pipelines by directly predicting future trajectories from sensor inputs, has recently achieved substantial progress. However, existing methods often overlook the causal inter-dependencies in ego-vehicle planning, ignoring the reciprocal relations between the ego vehicle and surrounding agents. This causal oversight leads to inconsistent and unreliable trajectory predictions, especially in interaction-critical scenarios where ego decisions and neighboring agent behaviors must be reasoned about jointly. To address this limitation, we propose CaAD, a Causality-aware end-to-end Autonomous Driving framework that captures these dependencies within a shared latent scene representation. First, we propose a ego-centric joint-causal modeling module that builds on the marginal prediction branch, and learns causal dependencies between the ego vehicle and interaction-relevant agents. Second, we employ a causality-aware policy alignment stage implemented with joint-mode embeddings to align the stochastic ego policy with planning-oriented closed-loop feedback computed from surrounding traffic and map context. On the Bench2Drive and NAVSIM benchmarks, CaAD demonstrates strong closed-loop planning performance, achieving a Driving Score of 87.53 and Success Rate of 71.81 on Bench2Drive, and a PDMS of 91.1 on NAVSIM.
Chinese Translation
端到端自主驾驶通过直接从传感器输入预测未来轨迹,绕过传统的模块化流程,近年来取得了显著进展。然而,现有方法往往忽视了自我车辆规划中的因果相互依赖关系,忽略了自我车辆与周围代理之间的相互关系。这种因果忽视导致了不一致和不可靠的轨迹预测,尤其是在需要共同推理自我决策和邻近代理行为的交互关键场景中。为了解决这一局限性,我们提出了CaAD(Causality-aware end-to-end Autonomous Driving),一个能够在共享潜在场景表示中捕捉这些依赖关系的因果感知端到端自主驾驶框架。首先,我们提出了一个基于边际预测分支的自我中心联合因果建模模块,学习自我车辆与交互相关代理之间的因果依赖关系。其次,我们采用了因果感知策略对齐阶段,通过联合模式嵌入将随机自我策略与从周围交通和地图上下文计算的规划导向闭环反馈对齐。在Bench2Drive和NAVSIM基准测试中,CaAD展示了强大的闭环规划性能,在Bench2Drive上获得了87.53的驾驶评分和71.81的成功率,在NAVSIM上获得了91.1的PDMS。
cs.RO / 46 / 2605.13665

Robot Squid Game: Quadrupedal Locomotion for Traversing Narrow Tunnels

机器人鱿鱼游戏:四足运动在狭窄隧道中的应用
Raj, Amir Hossain, Das, Dibyendu, Xiao, Xuesu
Abstract
Quadruped robots demonstrate exceptional potential for navigating complex terrain in critical applications such as search and rescue missions and infrastructure inspection However autonomous traversal of confined 3D environments including tunnels caves and collapsed structures remains a significant challenge Existing methods often struggle with rigid gait patterns limited adaptability to diverse geometries and reliance on oversimplified environmental assumptions This paper introduces a Reinforcement Learning RL framework that combines procedural environment generation with policy distillation to enable robust locomotion across various tunnel configurations Our approach leverages a teacher student training paradigm where specialized expert policies trained on procedurally generated tunnel geometries transfer their knowledge to a unified student policy This strategy eliminates the need for complex reward shaping in end-to-end RL training simplifying the process by breaking down complicated tasks into smaller more manageable components that are easier for the robot to learn By synthesizing diverse tunnel structures during training and distilling navigation strategies into a generalizable policy our method achieves consistent traversal across complex spatial constraints where conventional approaches fail We demonstrate through both simulation and real world experiments that our method enables quadruped robots to successfully traverse challenging confined tunnel environments
Chinese Translation
四足机器人在复杂地形中展现出卓越的潜力,尤其是在搜索与救援任务和基础设施检查等关键应用中。然而,自动穿越包括隧道、洞穴和倒塌结构在内的受限三维环境仍然是一个重大挑战。现有方法通常面临刚性步态模式、对多样几何形状的适应性有限以及依赖过于简化的环境假设等问题。本文提出了一种强化学习(Reinforcement Learning, RL)框架,结合程序生成环境与策略蒸馏,以实现跨越各种隧道配置的稳健运动。我们的方法利用教师-学生训练范式,其中在程序生成的隧道几何上训练的专门专家策略将其知识转移给统一的学生策略。这一策略消除了在端到端RL训练中对复杂奖励设计的需求,通过将复杂任务分解为更小、更易于管理的组件,简化了机器人学习的过程。通过在训练过程中合成多样的隧道结构,并将导航策略蒸馏为可泛化的策略,我们的方法在传统方法失效的复杂空间约束中实现了一致的穿越。我们通过模拟和现实世界实验展示了我们的方法使四足机器人能够成功穿越具有挑战性的受限隧道环境。
cs.RO / 47 / 2605.13741

LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction

LEXI-SG:基于单目视觉的房间引导前馈重建的三维场景图映射
Kassab, Christina, Gil, Hyeonjae, Mattamala, Matías, Kim, Ayoung, Fallon, Maurice
Abstract
Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed -- enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open-vocabulary object segmentation and tracking. We validate LEXI-SG on indoor scenes from the Habitat-Matterport 3D and self-collected egocentric office sequences. We evaluate its performance against existing feed-forward SLAM methods, as well as established scene graphs baselines. We demonstrate improved trajectory estimation and dense reconstruction, as well as, competitive performance in open-vocabulary segmentation. LEXI-SG shows that accurate, scalable, open-vocabulary 3D scene graphs can be achieved from monocular RGB alone. Our project page and office sequences are available here: https://ori-drs.github.io/lexisg-web/.
Chinese Translation
场景图正成为机器人导航的标准表示,提供层次化的几何和语义场景理解。然而,大多数场景图映射方法依赖于深度相机或激光雷达传感器。在本研究中,我们提出了LEXI-SG,这是第一个仅使用RGB相机输入的开放词汇三维场景图的密集单目视觉映射系统。我们的方法利用开放词汇基础模型的语义先验将场景划分为房间,将前馈重建推迟到每个房间完全观察到时——从而实现可扩展的密集映射,而不会出现滑动窗口尺度不一致的问题。我们提出了一种基于房间的因子图构造,以全局对齐房间重建,同时保持局部地图一致性,并自然地施加语义场景图的层次结构。在每个房间内,我们进一步支持开放词汇的物体分割和跟踪。我们在Habitat-Matterport 3D的室内场景和自收集的自我中心办公室序列上验证了LEXI-SG。我们将其性能与现有的前馈SLAM方法以及已建立的场景图基线进行了评估。我们展示了改进的轨迹估计和密集重建,以及在开放词汇分割中的竞争性能。LEXI-SG表明,仅通过单目RGB就可以实现准确、可扩展的开放词汇三维场景图。我们的项目页面和办公室序列可在此查看:https://ori-drs.github.io/lexisg-web/
cs.RO / 48 / 2605.13748

TinySDP: Real Time Semidefinite Optimization for Certifiable and Agile Edge Robotics

TinySDP:用于可认证和灵活边缘机器人实时半正定优化
Mahajan, Ishaan, Arrizabalaga, Jon, Grillo, Andrea, Vega, Fausto, Anderson, James, Manchester, Zachary, Plancher, Brian
Abstract
Semidefinite programming (SDP) provides a principled framework for convex relaxations of nonconvex geometric constraints in motion planning, yet existing solvers are too computationally expensive for real-time control, particularly on resource-constrained embedded systems. To address this gap, we introduce TinySDP, the first semidefinite programming solver designed for embedded systems, enabling real-time model-predictive control (MPC) on microcontrollers for problems with nonconvex obstacle constraints. Our approach integrates positive-semidefinite cone projections into a cached-Riccati-based ADMM solver, leveraging computational structure for embedded tractability. We pair this solver with an a posteriori rank-1 certificate that converts relaxed solutions into explicit geometric guarantees at each timestep. On challenging benchmarks, e.g., cul-de-sac and dynamic obstacle avoidance scenarios that induce failures in local methods, TinySDP achieves collision-free navigation with up to 73% shorter paths than state-of-the-art baselines. We validate our approach on a Crazyflie quadrotor, demonstrating that semidefinite constraints can be enforced at real-time rates for agile embedded robotics.
Chinese Translation
半正定规划(SDP)为运动规划中的非凸几何约束提供了一个原则性框架的凸松弛,然而现有的求解器在实时控制中计算开销过大,尤其是在资源受限的嵌入式系统上。为了解决这一问题,我们提出了TinySDP,这是第一个为嵌入式系统设计的半正定规划求解器,能够在微控制器上实现实时模型预测控制(MPC),以应对具有非凸障碍约束的问题。我们的方法将正半定锥投影集成到基于缓存的Riccati ADMM求解器中,利用计算结构实现嵌入式可处理性。我们将该求解器与后验秩-1证书配对,能够在每个时间步将松弛解转换为明确的几何保证。在具有挑战性的基准测试中,例如导致局部方法失败的死胡同和动态障碍物规避场景,TinySDP实现了无碰撞导航,其路径比最先进的基线短多达73%。我们在Crazyflie四旋翼上验证了我们的方法,展示了半正定约束可以以实时速率在灵活的嵌入式机器人中得到执行。
cs.RO / 49 / 2605.13751

Learning Responsibility-Attributed Adversarial Scenarios for Testing Autonomous Vehicles

学习责任归属对抗场景以测试自动驾驶车辆
Xiao, Yizhuo, Yan, Haotian, Wang, Ying, Zhu, Zhongpan, Zhang, Yuxin, Yan, Xintao, Erden, Mustafa Suphi, Wang, Cheng
Abstract
Establishing trustworthy safety assurance for autonomous driving systems (ADSs) requires evidence that failures arise from avoidable system deficiencies rather than unavoidable traffic conflicts. Current adversarial simulation methods can efficiently expose collisions, but generally lack mechanisms to distinguish these fundamentally different failure modes. Here we present CARS (Context-Aware, Responsibility-attributed Scenario generation), a framework that integrates responsibility attribution directly into adversarial scenario generation. CARS combines context-aware adversary selection with a generative adversarial policy optimized in closed-loop simulation to construct collision scenarios that are both physically feasible and diagnostically attributable. Across benchmark datasets spanning heterogeneous national traffic environments, CARS consistently discovers feasible collision scenarios with high attribution rates under multiple regulation-prescribed careful and competent driver models. By coupling adversarial generation with normative responsibility assessment, CARS moves simulation testing beyond collision discovery toward the construction of interpretable, regulation-aligned safety evidence for scalable ADS validation.
Chinese Translation
建立可信的自动驾驶系统(ADS)的安全保障需要证据表明故障源于可避免的系统缺陷,而非不可避免的交通冲突。目前的对抗模拟方法能够有效暴露碰撞,但通常缺乏区分这些根本不同故障模式的机制。在此,我们提出了CARS(上下文感知、责任归属场景生成),这是一个将责任归属直接整合到对抗场景生成中的框架。CARS结合了上下文感知的对手选择与在闭环模拟中优化的生成对抗策略,以构建既在物理上可行又可诊断归属的碰撞场景。在涵盖异质国家交通环境的基准数据集上,CARS在多个法规规定的谨慎和称职驾驶员模型下,始终发现具有高归属率的可行碰撞场景。通过将对抗生成与规范责任评估相结合,CARS将模拟测试从碰撞发现推进到构建可解释的、符合规范的安全证据,以支持可扩展的ADS验证。
cs.RO / 50 / 2605.13754

Manipulation Planning for Construction Activities with Repetitive Tasks

针对重复任务的建筑活动操控规划
Liu, Wangyi, Mahalingam, Dasharadhan, Gao, Fanru, Liang, Ci-Jyun, Chakraborty, Nilanjan
Abstract
In this paper, we study the problem of manipulation skill acquisition for performing construction activities consisting of repetitive tasks (e.g., building a wall or installing ceiling tiles). Our approach involves setting up a simulated construction activity in a Virtual Reality (VR) environment, where the user can provide demonstrations of the object manipulation skills needed to perform the construction activity. We then exploit the screw geometry of motion to approximate the demonstrated motion as a sequence of constant screw motions. For performing the construction activity, we generate the sequence of manipulation task instances and then compute the joint space motion plan corresponding to each instance using Screw Linear Interpolation (ScLERP) and Resolved Motion Rate Control (RMRC). We evaluate our framework by executing two representative construction tasks: constructing brick walls and installing multiple ceiling tiles. Each task is performed using only a single demonstration, a pick-and-place action for the bricks, and a single ceiling tile installation. Our experiments with a 7-DoF robot in both simulation and hardware demonstrate that the approach generalizes robustly to arbitrarily long construction activities that involve repetitive motions and demand precision, even when provided with just one demonstration. For instance, we can construct walls of arbitrary layout and length by leveraging a single demonstration of placing one brick on top of another.
Chinese Translation
本文研究了执行由重复任务组成的建筑活动(例如,建造墙壁或安装天花板瓷砖)的操控技能获取问题。我们的方法是在虚拟现实(VR)环境中设置一个模拟建筑活动,用户可以提供执行该建筑活动所需的物体操控技能的演示。然后,我们利用运动的螺旋几何特性,将演示的运动近似为一系列恒定的螺旋运动。为了执行建筑活动,我们生成操控任务实例的序列,并使用螺旋线性插值(Screw Linear Interpolation, ScLERP)和解析运动速率控制(Resolved Motion Rate Control, RMRC)计算每个实例对应的关节空间运动规划。我们通过执行两个代表性的建筑任务进行评估:构建砖墙和安装多个天花板瓷砖。每个任务仅使用一次演示,即对砖块的拾取和放置动作,以及一次天花板瓷砖的安装。我们在模拟和硬件环境中使用7自由度(7-DoF)机器人进行的实验表明,该方法能够稳健地推广到涉及重复运动并要求精确度的任意长建筑活动,即使只提供一次演示。例如,我们可以通过利用一次将一块砖放在另一块砖上方的演示,构建任意布局和长度的墙壁。
cs.RO / 51 / 2605.13757

FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

FrameSkip:在视觉-语言-动作训练中从更少但更具信息量的帧中学习
Yu, Bin, Lian, Shijie, Lin, Xiaopeng, Shen, Zhaolong, Wei, Yuliang, Wu, Changti, Yuan, Hang, Liu, Haishan, Wang, Bailing, Huang, Cong, Chen, Kai
Abstract
Vision-Language-Action (VLA) policies are commonly trained from dense robot demonstration trajectories, often collected through teleoperation, by sampling every recorded frame as if it provided equally useful supervision. We argue that this convention creates a temporal supervision imbalance: long low-change segments dominate the training stream, while manipulation-critical transitions such as alignment, contact, grasping, and release appear only sparsely. We introduce FrameSkip, a data-layer frame selection framework that scores trajectory frames using action variation, visual-action coherence, task-progress priors, and gripper-transition preservation, then remaps training samples toward high-importance frames under a target retention ratio. Because FrameSkip operates only in the dataloader, it leaves the VLA architecture, action head, training objective, and inference procedure unchanged. Across RoboCasa-GR1, SimplerEnv, and LIBERO, FrameSkip improves the success-retention trade-off over full-frame training and simpler frame selection variants, achieving a macro-average success rate of 76.15% across the three benchmarks compared with 66.50% for full-frame training while using a compressed trajectory view that retains 20% of unique frames in the main setting.
Chinese Translation
视觉-语言-动作(VLA)策略通常通过密集的机器人演示轨迹进行训练,这些轨迹通常是通过遥操作收集的,采样每个记录的帧,仿佛它们提供了同等有用的监督。我们认为这种惯例造成了时间监督的不平衡:长时间的低变化段主导了训练流,而对操作至关重要的过渡(如对齐、接触、抓取和释放)则仅稀疏出现。我们提出了FrameSkip,这是一种数据层帧选择框架,通过动作变化、视觉-动作一致性、任务进展先验和夹具过渡保留对轨迹帧进行评分,然后在目标保留比例下重新映射训练样本到高重要性帧。由于FrameSkip仅在数据加载器中操作,它不改变VLA架构、动作头、训练目标和推理过程。在RoboCasa-GR1、SimplerEnv和LIBERO上,FrameSkip提高了成功保留的权衡,相较于全帧训练和更简单的帧选择变体,三项基准的宏平均成功率达到了76.15%,而全帧训练的成功率为66.50%,同时在主要设置中使用保留20%独特帧的压缩轨迹视图。
cs.RO / 52 / 2605.13775

RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

RoboEvolve:一种用于有限数据下机器人操作的协同进化规划者-模拟器
Chen, Harold Haodong, Chen, Sirui, Xu, Yingjie, Ge, Wenhang, Chen, Ying-Cong
Abstract
The scalability of robotic manipulation is fundamentally bottlenecked by the scarcity of task-aligned physical interaction data. While vision-language models (VLMs) and video generation models (VGMs) hold promise for autonomous data synthesis, they suffer from semantic-spatial misalignment and physical hallucinations, respectively. To bridge this gap, we introduce RoboEvolve, a novel framework that couples a VLM planner and a VGM simulator into a mutually reinforcing co-evolutionary loop. Operating purely on unlabeled seed images, RoboEvolve leverages a cognitive-inspired dual-phase mechanism: (i) daytime exploration fosters physically grounded behavioral discovery through a semantic-controlled multi-granular reward, and (ii) nighttime consolidation mines "near-miss" failures to stabilize policy optimization. Guided by an autonomous progressive curriculum, the system naturally scales from simple atomic actions to complex tasks. Extensive experiments demonstrate that RoboEvolve (I) achieves superior effectiveness, elevating base planners by 30 absolute points and amplifying simulator success by 48% on average; (II) exhibits extreme data efficiency, surpassing fully supervised baselines with merely 500 unlabeled seeds--a 50x reduction; and (III) demonstrates robust continual learning without catastrophic forgetting.
Chinese Translation
机器人的操作可扩展性在根本上受到任务对齐的物理交互数据稀缺的限制。尽管视觉-语言模型(VLMs)和视频生成模型(VGMs)在自主数据合成方面展现了潜力,但它们分别存在语义-空间错位和物理幻觉的问题。为了解决这一问题,我们提出了RoboEvolve,一个将VLM规划者和VGM模拟器结合成相互强化的协同进化循环的新框架。RoboEvolve完全基于未标记的种子图像,利用一种受认知启发的双阶段机制:(i) 白天探索通过语义控制的多粒度奖励促进基于物理的行为发现,(ii) 夜间巩固挖掘“近失误”失败以稳定策略优化。在自主渐进课程的指导下,该系统自然地从简单的原子动作扩展到复杂任务。大量实验表明,RoboEvolve (I) 实现了卓越的有效性,使基础规划者提高了30个绝对点,模拟器的成功率平均提高了48%;(II) 展现出极高的数据效率,仅用500个未标记的种子就超越了完全监督的基线——减少了50倍;(III) 展示了强大的持续学习能力而没有灾难性遗忘。
cs.RO / 53 / 2605.13778

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

实时-VLA FLASH:基于扩散的视觉-语言-行动模型的推测推理框架
Niu, Jiahui, Gu, Kefan, Zhao, Yucheng, Liang, Shengwen, Wang, Tiancai, Hu, Xing, Wang, Ying, Li, Huawei
Abstract
Diffusion-based vision-language-action models (dVLAs) are promising for embodied intelligence but are fundamentally limited in real-time deployment by the high latency of full inference. We propose Realtime-VLA FLASH, a speculative inference framework that eliminates most full inference calls during replanning by introducing a lightweight draft model with parallel verification via the main model's Action Expert and a phase-aware fallback mechanism that reverts to the full inference pipeline when needed. This design enables low-latency, high-frequency replanning without sacrificing reliability. Experiments show that on LIBERO, FLASH largely preserves task performance by replacing many 58.0 ms full-inference rounds with speculative rounds as fast as 7.8 ms, lowering task-level average inference latency to 19.1 ms (3.04x speedup). We additionally demonstrate effectiveness on real-world conveyor-belt sorting, highlighting its practical impact for latency-critical embodied tasks.
Chinese Translation
基于扩散的视觉-语言-行动模型(dVLAs)在具身智能方面具有很大的潜力,但由于完整推理的高延迟,实时部署受到根本性限制。我们提出了实时-VLA FLASH,这是一种推测推理框架,通过引入轻量级草稿模型和通过主模型的行动专家进行并行验证,消除了在重新规划过程中大多数完整推理调用,并且在需要时采用相位感知的回退机制恢复到完整推理管道。该设计实现了低延迟、高频率的重新规划,而不牺牲可靠性。实验表明,在LIBERO上,FLASH通过将许多58.0毫秒的完整推理回合替换为最快7.8毫秒的推测回合,显著保持了任务性能,将任务级平均推理延迟降低至19.1毫秒(加速比为3.04倍)。我们还在实际的输送带排序中展示了其有效性,突显了其对延迟敏感的具身任务的实际影响。
cs.RO / 54 / 2605.13782

LMPath: Language-Mediated Priors and Path Generation for Aerial Exploration

LMPath:用于空中探索的语言介导先验和路径生成
Diller, Jonathan A., Cladera, Fernando, Taylor, Camillo J., Kumar, Vijay
Abstract
Traditional autonomous UAV search missions rely on geometric coverage patterns that ignore the semantic context of the target, leading to significant time waste in large-scale environments. In this paper we present LMPath, a pipeline for generating language-mediated exploration priors for Unmanned Aerial Vehicle (UAV) search missions that leverages semantics. Given a basic geofence and an object of interest prompt, LMPath uses generative language models to determine what regions of the environment should contain that object and a foundation vision model ran over satellite imagery to segment sub-regions that form the exploration prior. This prior can then be used to generate UAV paths with various objectives, such as minimizing the expected time to locate the object of interest, maximizing the probability that the object is found given a limited travel distance, or narrowing down the search space to sub-regions that are most likely to contain the object. To demonstrate it's capabilities, we used LMPath to generate various UAV paths and ran them using a real UAV over large-scale environments. We also ran simulations to demonstrate how paths generated using LMPath outperform traditional path planning approaches for search missions.
Chinese Translation
传统的自主无人机搜索任务依赖于几何覆盖模式,这些模式忽视了目标的语义上下文,导致在大规模环境中浪费大量时间。本文提出了LMPath,一个生成语言介导探索先验的管道,旨在用于无人机(UAV)搜索任务,充分利用语义信息。给定一个基本的地理围栏和一个感兴趣对象的提示,LMPath利用生成语言模型来确定环境中应包含该对象的区域,并通过在卫星影像上运行基础视觉模型来分割出形成探索先验的子区域。该先验可以用于生成具有多种目标的无人机路径,例如最小化找到感兴趣对象的预期时间、在有限的旅行距离内最大化找到该对象的概率,或缩小搜索空间至最有可能包含该对象的子区域。为了展示其能力,我们使用LMPath生成了多条无人机路径,并在大规模环境中使用真实无人机进行了测试。我们还进行了模拟,展示了使用LMPath生成的路径在搜索任务中优于传统路径规划方法的表现。
cs.RO / 55 / 2605.13822

Loiter UAV Reinsertion Guidance for Fixed-wing UAV Corridors

固定翼无人机走廊的滞留无人机再插入引导
J, Pradeep, Siddhardha, Kedarisetty, Ratnoo, Ashwini
Abstract
This paper considers fixed-wing unmanned aerial vehicle (UAV) corridors comprising a main lane, a circular loiter lane for managing traffic congestion, and transit lanes connecting the two. In particular, we address the problem of conflict-free reinsertion of UAVs from the loiter lane back into the main lane. The loiter lane contains a fixed number of equidistant virtual slots that UAVs can occupy. Reinsertion of loiter UAVs into the main lane becomes essential either due to reduced traffic in the main lane or due to a loiter UAV needing to reach its destination urgently. Given the total number of loiter slots, UAV speed limits, and the minimum safety distance, a guidance algorithm is developed to compute the required speed of a loiter UAV in the transit lane to ensure safe reinsertion. The proposed guidance and automation strategies are validated through numerical simulations.
Chinese Translation
本文考虑了由主通道、用于管理交通拥堵的圆形滞留通道以及连接两者的过渡通道组成的固定翼无人机(UAV)走廊。特别地,我们解决了从滞留通道无冲突地将无人机再插入主通道的问题。滞留通道包含固定数量的等距虚拟插槽,供无人机占用。由于主通道交通减少或滞留无人机急需到达目的地,滞留无人机再插入主通道变得至关重要。根据滞留插槽的总数、无人机的速度限制和最小安全距离,开发了一种引导算法,以计算滞留无人机在过渡通道中所需的速度,以确保安全再插入。通过数值仿真验证了所提出的引导和自动化策略。
计算机视觉 (Computer Vision)
123
cs.CV / 1 / 2605.12506

Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection

Scale-Gest:可扩展的模型空间合成与运行时选择用于设备端手势检测
Basit, Abdul, Rehman, Saim, Shafique, Muhammad
Abstract
Realizing on-device ML-based gesture detection under tight real-time performance, energy and memory constraints is challenging, especially when considering mobile devices with varying battery-power levels. Existing EdgeAI deployments typically rely on a single fixed detector, limiting optimization opportunities. We present Scale-Gest, a novel run-time adaptive gesture detection framework that expands the detector space into a dense family of tiny-YOLO architectures. We introduce multiple novel device-calibrated ACE (Accuracy-Complexity-Energy) profiles by analyzing different model-resolution-stride operating points. A lightweight run-time controller selects an appropriate ACE mode under user-defined and battery constraints, while a motion-aware hand-gesture-tracking ROI gate crops the input for reduced complexity detection. To evaluate performance of our system in real-world car driving scenarios, we introduce a temporally-annotated Driver Simulated Gesture (DSG-18) dataset. Scale-Gest maintains event-level F1 while significantly reducing energy and latency compared to single-detector approaches. On a battery-powered laptop running gesture streams, our ACE controller reduces per-frame energy by 4x (from 6.9 mJ to 1.6 mJ) while maintaining high gesture-detection performance (event-level F1 = 0.8-0.9) and low mean latency (6 ms).
Chinese Translation
在严格的实时性能、能量和内存限制下实现基于机器学习的设备端手势检测是具有挑战性的,尤其是在考虑到电池电量水平各异的移动设备时。现有的边缘人工智能(EdgeAI)部署通常依赖于单一固定的检测器,这限制了优化的机会。我们提出了Scale-Gest,一个新颖的运行时自适应手势检测框架,它将检测器空间扩展为一个密集的tiny-YOLO架构家族。通过分析不同模型分辨率-步幅操作点,我们引入了多个新颖的设备校准ACE(准确性-复杂性-能量)配置文件。一个轻量级的运行时控制器在用户定义和电池约束下选择合适的ACE模式,同时,一个运动感知的手势跟踪感兴趣区域(ROI)门裁剪输入,以减少复杂性检测。为了评估我们系统在真实驾驶场景下的性能,我们引入了一个时间标注的驾驶员模拟手势(DSG-18)数据集。与单一检测器方法相比,Scale-Gest在显著降低能量和延迟的同时,保持事件级F1得分。在一台运行手势流的电池供电笔记本电脑上,我们的ACE控制器将每帧能量降低了4倍(从6.9 mJ降至1.6 mJ),同时保持高手势检测性能(事件级F1 = 0.8-0.9)和低平均延迟(6毫秒)。
cs.CV / 2 / 2605.12528

MorphOPC: Advancing Mask Optimization with Multi-scale Hierarchical Morphological Learning

MorphOPC:通过多尺度层次形态学习推进掩模优化
Hu, Yuting, Zhuang, Lei, Wang, Chen, Qin, Ruiyang, Xiang, Hua, Nam, Gi-joon, Xiong, Jinjun
Abstract
As feature sizes shrink to the nanometer scale, accurately transferring circuit patterns from photomasks to silicon wafers becomes increasingly challenging. Optical proximity correction (OPC) is widely used to ensure pattern fidelity and manufacturability. Recent generative mask optimization models based on encoder-decoder architecture can synthesize near-optimal masks, serving as fast machine learning (ML) surrogates for traditional OPC. However, these models often fail to capture the geometric transformations from target layouts to mask patterns, leading to suboptimal quality. In this work, we formulate mask generation as a sequence of morphological operations on local layout features and propose \textit{MorphOPC}, a multi-scale hierarchical model with neural morphological modules to learn these transformations. Experiments on edge-based OPC and ILT benchmarks across metal and via layers show that \textit{MorphOPC} consistently outperforms state-of-the-art methods, achieving higher printing fidelity and lower manufacturing cost, demonstrating strong potential for scalable mask optimization.
Chinese Translation
随着特征尺寸缩小到纳米级,从光掩模到硅晶圆准确传输电路图案变得愈加困难。光学邻近校正(Optical Proximity Correction,OPC)被广泛应用于确保图案的保真性和可制造性。最近基于编码器-解码器架构的生成掩模优化模型能够合成近似最优的掩模,作为传统OPC的快速机器学习(Machine Learning,ML)替代方案。然而,这些模型往往无法捕捉从目标布局到掩模图案的几何变换,导致质量不佳。在本研究中,我们将掩模生成形式化为对局部布局特征的一系列形态操作,并提出了MorphOPC,一个具有神经形态模块的多尺度层次模型,以学习这些变换。在金属层和通孔层的基于边缘的OPC和ILT基准测试中的实验表明,MorphOPC始终优于最先进的方法,实现了更高的印刷保真度和更低的制造成本,展示了其在可扩展掩模优化中的强大潜力。
cs.CV / 3 / 2605.12545

CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

CROP:通过组合推理和优化偏好实现专家对齐的图像裁剪
Dong, Zhitong, Li, Chao, Yu, Jie, Chen, Hao
Abstract
Aesthetic image cropping aims to enhance the aesthetic quality of an image by improving its composition through spatial cropping. Previous methods often rely on saliency prediction or retrieval augmentation, ignoring the task's core requirement: a deep understanding of composition and aesthetics. Consequently, saliency-based methods struggle to make compositional trade-offs in complex scenes, while retrieval-based methods blindly refer to similar cases, lacking adaptive reasoning for unique scenes. Both approaches fail to align their automated cropping results with those of human experts. To address the above issues, we propose a novel paradigm that reformulates aesthetic cropping as a multimodal reasoning task, aiming to activate the VLM's analytical and comprehension capabilities in aesthetics. We design a Compositional Reasoning and Optimizing Preference method (CROP) that directs the VLM to think like a professional photographer. It deconstructs a complex and subjective aesthetic problem into an "analysis-proposal-decision" process, reasoning step by step through the analysis of scene elements and compositional principles. Meanwhile, our expert preference alignment module makes the model's decision consistent with human expert aesthetics. Extensive experiments across multiple datasets validate our method's superiority and component effectiveness.
Chinese Translation
美学图像裁剪旨在通过空间裁剪提升图像的构图,从而增强其美学质量。以往的方法通常依赖于显著性预测或检索增强,忽视了任务的核心要求:对构图和美学的深刻理解。因此,基于显著性的方法在复杂场景中难以进行构图权衡,而基于检索的方法则盲目参考相似案例,缺乏对独特场景的自适应推理。这两种方法都未能使其自动裁剪结果与人类专家的结果对齐。为了解决上述问题,我们提出了一种新颖的范式,将美学裁剪重新定义为多模态推理任务,旨在激活视觉语言模型(VLM)在美学方面的分析和理解能力。我们设计了一种组合推理和优化偏好方法(CROP),引导VLM像专业摄影师一样思考。它将复杂且主观的美学问题分解为“分析-提议-决策”过程,通过对场景元素和构图原则的分析逐步推理。同时,我们的专家偏好对齐模块使模型的决策与人类专家的美学保持一致。针对多个数据集的广泛实验验证了我们方法的优越性和组件的有效性。
cs.CV / 4 / 2605.12549

What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

解码前发生了什么?预填充决定了视觉语言模型中的GUI基础
Lin, Jiaping, Shen, Fei, Li, Junzhe, Nie, Ping, Yu, Fei, Li, Ming, Li, Haizhou
Abstract
Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding. Based on this observation, we propose Re-Prefill, a training-free method that revisits inference by introducing an attention-guided second prefill stage to refine target selection. Specifically, visual tokens that consistently receive high attention from the query position, i.e., the final token, across layers are extracted as a preliminary target hypothesis and appended to the input, together with the instruction hidden states, enabling the model to deeply re-think its decision before coordinate generation. Experiments across four VLMs and five benchmarks, including ScreenSpot-Pro, ScreenSpot-V2, OSWorld-G, UI-Vision, and MMBench-GUI, demonstrate consistent improvements without additional training, with gains of up to 4.3% on ScreenSpot-Pro. Code will be available at https://github.com/linjiaping1/Re-Prefill.
Chinese Translation
现有的无训练GUI基础方法通常依赖于多次推理运行,例如迭代裁剪或候选聚合,以识别目标元素。尽管增加了额外的计算,每次前向传播仍然独立地解释指令并解析视觉布局,而未能实现视觉标记之间的逐步交互。本文研究了视觉语言模型(VLMs)中GUI基础的过程,并识别出一个之前被忽视的瓶颈。我们表明,基础遵循一个两阶段的范式:预填充阶段确定候选用户界面元素,而解码阶段随后细化最终坐标。这种不对称性确立了预填充作为关键步骤,因为在解码过程中无法有效纠正候选选择中的错误。基于这一观察,我们提出了Re-Prefill,这是一种无训练的方法,通过引入一个注意力引导的第二预填充阶段来细化目标选择,从而重新审视推理。具体而言,在各层中,从查询位置(即最终标记)持续获得高注意力的视觉标记被提取为初步目标假设,并与指令的隐藏状态一起附加到输入中,使模型在坐标生成之前能够深入重新思考其决策。针对四个VLM和五个基准(包括ScreenSpot-Pro、ScreenSpot-V2、OSWorld-G、UI-Vision和MMBench-GUI)的实验表明,在没有额外训练的情况下,性能持续提升,在ScreenSpot-Pro上提升幅度达到4.3%。代码将发布在https://github.com/linjiaping1/Re-Prefill。
cs.CV / 5 / 2605.12550

SSDA: Bridging Spectral and Structural Gaps via Dual Adaptation for Vision-Based Time Series Forecasting

SSDA:通过双重适应弥合光谱和结构差距以实现基于视觉的时间序列预测
Zhang, Mingrui, Yang, Hanchen, Li, Wengen, Jiang, Xudong, Zhang, Yichao, Guan, Jihong, Zhou, Shuigeng
Abstract
Large vision models (LVMs) have recently proven to be surprisingly effective time series forecasters, simply by rendering temporal data as images. This success, how ever, rests on a largely unexamined premise: the rendered time series images are sufficiently close to natural images for knowledge in pre-trained models to transfer effectively. We argue that two gaps still remain, i.e., spectral and structural gaps, fundamentally limiting the potential of LVMs for time series forecasting. Spectrally, we systematically reveal that rendered time series images exhibit a markedly shallower power spectrum than the natural images LVMs are pre-trained to recognize. Structurally, reshaping 1D temporal sequences into 2D grids fabricates spurious spatial adjacencies while severing genuine temporal continuities, misleading the spatial inductive biases of pre-trained LVMs. To bridge these gaps, we propose SSDA, a dual-branch network that spectrally and structurally adapts to unlock the full potential of LVMs for time series forecasting. At the data level, a Spectral Magnitude Aligner (SMA) applies 2D FFT to selectively enhance the magnitude spectrum toward natural-image statistics while preserving phase. At the model level, a Structural-Guided Low-Rank Adaptation (SG-LoRA) injects position-aware temporal encodings into patch embeddings and adapts at tention via low-rank updates. The two branches are further adaptively fused to produce the final forecast. Extensive experiments on seven real-world benchmarks demonstrate that SSDA consistently outperforms strong LVM- and LLM-based baselines under both full-shot and few-shot settings. Code is publicly available at https://anonymous.4open.science/r/SSDA-8C5B.
Chinese Translation
大型视觉模型(LVMs)最近被证明在时间序列预测中表现出惊人的有效性,仅仅通过将时间数据呈现为图像。然而,这一成功在很大程度上建立在一个尚未充分检验的前提之上:渲染的时间序列图像与自然图像之间的相似性足以使预训练模型中的知识有效转移。我们认为,仍然存在两个差距,即光谱差距和结构差距,根本上限制了LVMs在时间序列预测中的潜力。在光谱方面,我们系统性地揭示了渲染的时间序列图像的功率谱明显比LVMs预训练识别的自然图像要浅。在结构方面,将一维时间序列重塑为二维网格制造了虚假的空间邻接,同时切断了真实的时间连续性,误导了预训练LVMs的空间归纳偏差。为了弥合这些差距,我们提出了SSDA,一种双分支网络,能够在光谱和结构上进行适应,以释放LVMs在时间序列预测中的全部潜力。在数据层面,光谱幅度对齐器(SMA)应用二维快速傅里叶变换(2D FFT)选择性地增强幅度谱,以接近自然图像统计,同时保留相位。在模型层面,结构引导低秩适应(SG-LoRA)将位置感知的时间编码注入到补丁嵌入中,并通过低秩更新进行适应性调整。两个分支进一步自适应融合以生成最终预测。在七个真实世界基准上的大量实验表明,SSDA在全样本和少样本设置下均持续优于强大的基于LVM和LLM的基线。代码已公开发布,网址为 https://anonymous.4open.science/r/SSDA-8C5B。
cs.CV / 6 / 2605.12556

M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement

M2Retinexformer:用于低光照图像增强的多模态Retinexformer
Aboelwafa, Youssef, Elmongui, Hicham G., Torki, Marwan
Abstract
Low-light image enhancement is challenging due to complex degradations, including amplified noise, artifacts, and color distortion. While Retinex-based deep learning methods have achieved promising results, they primarily rely on single-modality RGB information. We propose M2Retinexformer (Multi-Modal Retinexformer), a novel framework that extends Retinexformer by incorporating depth cues, luminance priors, and semantic features within a progressive refinement pipeline. Depth provides geometric context that is invariant to lighting variations, while luminance and semantic features offer explicit guidance on brightness distribution and scene understanding. Modalities are extracted at multiple scales and fused through cross-attention, with adaptive gating dynamically balancing illumination-guided self-attention and cross-attention based on the reliability of auxiliary cues. Evaluations on the LOL, SID, SMID, and SDSD benchmarks demonstrate overall improvements over Retinexformer and recent state-of-the-art methods. Code and pretrained weights are available at https://github.com/YoussefAboelwafa/M2Retinexformer
Chinese Translation
低光照图像增强面临复杂退化的挑战,包括噪声放大、伪影和颜色失真。尽管基于Retinex的深度学习方法已取得了令人鼓舞的成果,但它们主要依赖于单一模态的RGB信息。我们提出了M2Retinexformer(多模态Retinexformer),这是一个新颖的框架,通过在渐进式精炼管道中结合深度线索、亮度先验和语义特征,扩展了Retinexformer。深度提供了对光照变化不变的几何上下文,而亮度和语义特征则为亮度分布和场景理解提供了明确的指导。模态在多个尺度上提取,并通过交叉注意力进行融合,适应性门控根据辅助线索的可靠性动态平衡基于光照的自注意力和交叉注意力。在LOL、SID、SMID和SDSD基准上的评估显示,整体性能优于Retinexformer和最近的最先进方法。代码和预训练权重可在https://github.com/YoussefAboelwafa/M2Retinexformer获取。
cs.CV / 7 / 2605.12567

Pyramid Self-contrastive Learning Framework for Test-time Ultrasound Image Denoising

用于测试时超声图像去噪的金字塔自对比学习框架
Zhang, Jiajing, Dai, Bingze, Zhang, Xi, Xu, Yue, Lee, Wei-Ning
Abstract
The inherent electronic and speckle noise complicates clinical interpretation of ultrasound images. Conventional denoising methods rely on explicit noise assumptions whose validity diminishes under composite noise conditions. Learning-based methods require massive labeled data and model parameters. These pre-defined and pre-trained manners entail an inevitable domain shift in complex in vivo environments, so they are limited to a specific noise type and often blur structural details. In this study, we propose a pure test-time training framework for one-shot ultrasound image denoising and apply it to synthetic aperture ultrasound (SAU), which synthesizes transmit focus from sub-aperture transmissions. Our Aperture-to-Aperture (A2A) framework disentangles anatomical similarity and noise randomness from shuffled sub-apertures through self-contrastive learning in pyramid latent spaces. The clean image is then decoded from the anatomy space, while discarding the noise space. A2A is trained at test time on one noisy sample of SAU signals, so it fundamentally eliminates the domain shift and pretraining costs. Simulation experiments, including electronic noise levels of 0 to 30 dB and different inclusion geometries, demonstrated an improvement of 69.3% SNR and 34.4% CNR by A2A. The in vivo results showed 84.8% SNR and 25.7% CNR gains using only two aperture data of the heart in six echocardiographic views, liver, and kidney. A2A delivers clear images/signals across diverse imaging targets and configurations, paving the way for more reliable anatomical visualization and functional assessment by ultrasound.
Chinese Translation
超声图像中的固有电子噪声和斑点噪声使得临床解读变得复杂。传统的去噪方法依赖于明确的噪声假设,但在复合噪声条件下,其有效性降低。基于学习的方法需要大量标记数据和模型参数。这些预定义和预训练的方式在复杂的体内环境中不可避免地导致领域转移,因此它们仅限于特定类型的噪声,且常常模糊结构细节。在本研究中,我们提出了一种纯测试时训练框架,用于一次性超声图像去噪,并将其应用于合成孔径超声(Synthetic Aperture Ultrasound, SAU),该方法通过子孔径传输合成发射焦点。我们的孔径到孔径(Aperture-to-Aperture, A2A)框架通过金字塔潜在空间中的自对比学习,将解剖相似性和噪声随机性从打乱的子孔径中解耦。然后,从解剖空间中解码出干净图像,同时丢弃噪声空间。A2A在测试时仅对一个噪声样本的SAU信号进行训练,从根本上消除了领域转移和预训练成本。模拟实验显示,在电子噪声水平为0到30 dB及不同包含几何形状的情况下,A2A提高了69.3%的信噪比(SNR)和34.4%的对比噪声比(CNR)。体内结果显示,在六个超声心动图视图中,仅使用心脏的两个孔径数据,获得了84.8%的SNR和25.7%的CNR增益。A2A在多种成像目标和配置中提供清晰的图像/信号,为超声的更可靠解剖可视化和功能评估铺平了道路。
cs.CV / 8 / 2605.12570

M3Net: A Macro-to-Meso-to-Micro Clinical-inspired Hierarchical 3D Network for Pulmonary Nodule Classification

M3Net:一种宏观-中观-微观临床启发的层次化3D网络用于肺结节分类
Li, Jinyue, Yu, Yuzhou, Yang, Jingjing, Fu, Meng, Zhang, Yani, He, Shuyao, Ge, Dianlong, Ning, Xin, Chu, Yannan, Li, Qiankun
Abstract
The accurate classification of benign and malignant pulmonary nodules in CT scans is critical for early lung cancer screening, yet remains challenging due to the multi-scale and heterogeneous nature of pulmonary nodules. While deep learning offers potential for auxiliary diagnosis, most existing models act as "black boxes", lacking the transparency and explainability required for trustworthy clinical integration. To address this issue, we propose M3Net, a novel 3D network for pulmonary nodule classification inspired by the hierarchical diagnostic workflow of radiologists, which integrates multi-scale contextual information from fine-grained structures to global anatomical relationships. Our framework constructs a progressive multi-scale input, from fine-grained nodule structures to local semantics and global spatial relationships. M3Net employs scale-specific encoders and ensures cross-scale semantic consistency through latent space projection and mutual information maximization. Extensive experiments on the public LIDC-IDRI dataset and a self-collected clinical dataset (USTC-FHLN) demonstrate that our method achieves state-of-the-art performance, with accuracies of 86.96% and 84.24% respectively, outperforming the best baseline by 3.26% and 2.17%. The results validate that M3Net provides a more robust and clinically relevant solution for pulmonary nodule classification. The code is available at https://github.com/jylEcho/M3-Net.
Chinese Translation
在CT扫描中准确分类良性和恶性肺结节对于早期肺癌筛查至关重要,但由于肺结节的多尺度和异质性特征,这一任务仍然具有挑战性。尽管深度学习为辅助诊断提供了潜力,但大多数现有模型充当“黑箱”,缺乏临床整合所需的透明性和可解释性。为了解决这一问题,我们提出了M3Net,一种受放射科医生层次化诊断工作流程启发的全新3D网络,用于肺结节分类,整合了从细粒度结构到全局解剖关系的多尺度上下文信息。我们的框架构建了一个渐进的多尺度输入,从细粒度的结节结构到局部语义和全局空间关系。M3Net采用特定尺度的编码器,并通过潜在空间投影和互信息最大化确保跨尺度的语义一致性。在公共LIDC-IDRI数据集和自收集的临床数据集(USTC-FHLN)上进行的大量实验表明,我们的方法达到了最先进的性能,准确率分别为86.96%和84.24%,比最佳基线提高了3.26%和2.17%。结果验证了M3Net为肺结节分类提供了更稳健且临床相关的解决方案。代码可在https://github.com/jylEcho/M3-Net获取。
cs.CV / 9 / 2605.12571

VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority

VideoSEAL:通过解耦答案权威性来减轻代理性长视频理解中的证据不一致问题
Qiu, Chenhao, Zhang, Yechao, Luo, Xin, Song, Shien, Liu, Xusheng
Abstract
Long video question answering requires locating sparse, time-scattered visual evidence within highly redundant content. Although current MLLMs perform well on short videos, long videos introduce long-horizon search and verification, which often necessitates multi-turn, agentic interaction. We show that existing LVU agents can exhibit "evidence misalignment": they produce correct answers that are not supported by the retrieved or inspected evidence. To characterize this failure, we introduce two diagnostics (temporal groundedness and semantic groundedness) and use them to reveal two pressures that amplify misalignment: prompt pressure from shared-context saturation at inference time and reward pressure from outcome-only optimization during training. These findings point to a structural root cause: the coupled agent paradigm conflates long-horizon planning with answer authority. We therefore propose the decoupled planner-inspector framework, which separates planning from answer authority and gates final answering on pixel-level verification. Across four long-video benchmarks, our framework improves both answer accuracy and evidence alignment, achieving 55.1% on LVBench and 62.0% on LongVideoBench while producing interpretable search trajectories. Moreover, the decoupled architecture scales consistently with increased search budgets and supports plug-and-play upgrades of the MLLM backbone without retraining the planner. Code and models are available at https://github.com/Echochef/VideoSEAL.
Chinese Translation
长视频问答需要在高度冗余的内容中定位稀疏、时间分散的视觉证据。尽管当前的多模态大语言模型(MLLMs)在短视频上表现良好,但长视频引入了长时间范围的搜索和验证,这通常需要多轮的代理性互动。我们发现现有的长视频理解(LVU)代理可能会出现“证据不一致”现象:它们产生的正确答案并未得到检索或检查的证据支持。为了描述这种失败,我们引入了两个诊断指标(时间基础性和语义基础性),并利用它们揭示了两种加剧不一致的压力:推理时共享上下文饱和带来的提示压力,以及训练期间仅基于结果的优化带来的奖励压力。这些发现指向一个结构性根本原因:耦合的代理范式将长时间范围的规划与答案权威性混为一谈。因此,我们提出了解耦的规划-检查框架,该框架将规划与答案权威性分离,并在像素级验证的基础上进行最终回答。在四个长视频基准测试中,我们的框架提高了答案准确性和证据一致性,在LVBench上达到了55.1%,在LongVideoBench上达到了62.0%,同时产生可解释的搜索轨迹。此外,解耦架构在增加搜索预算时能够持续扩展,并支持在不重新训练规划器的情况下对多模态大语言模型骨干进行即插即用的升级。代码和模型可在 https://github.com/Echochef/VideoSEAL 获取。
cs.CV / 10 / 2605.12573

Improving Diffusion Posterior Samplers with Lagged Temporal Corrections for Image Restoration

通过滞后时间修正改善图像恢复中的扩散后验采样器
Evangelista, Davide, Morotti, Elena, Pivi, Francesco, Gabbrielli, Maurizio
Abstract
Diffusion-based posterior sampling (PS) is a leading framework for imaging inverse problems, combining learned priors with measurement constraints. Yet, its standard formulations rely on instantaneous data-consistent estimates, which induce temporal variability in the reverse dynamics. We reinterpret PS from a dynamical perspective, showing that the standard PS update corresponds to a first-order discretization of the diffusion dynamics plus a residual correction capturing the mismatch between the denoised prediction and the data-consistent estimate. A second-order discretization, however, naturally introduces a temporal correction based on the variation of consecutive estimates. Building on this, we propose LAMP, combining the second-order update with the residual correction characterizing a PS technique. LAMP thus inherits a lagged temporal correction, and it can be implemented as a modular plug-in over the PS backbone. We show that LAMP preserves the structure of a posterior sampler, and we perform a one-step risk analysis to characterize when LAMP improves the reverse transition via a bias-variance trade-off. Experiments across multiple imaging tasks demonstrate consistent improvements over strong baselines such as DiffPIR and DDRM, without increasing the number of denoising evaluations.
Chinese Translation
基于扩散的后验采样(PS)是解决成像逆问题的主要框架,结合了学习的先验和测量约束。然而,其标准形式依赖于瞬时数据一致性估计,这会导致反向动态中的时间变化。我们从动态的角度重新解释PS,表明标准PS更新对应于扩散动态的一阶离散化加上一个残差修正,用于捕捉去噪预测与数据一致性估计之间的差异。然而,二阶离散化自然引入了基于连续估计变化的时间修正。在此基础上,我们提出了LAMP,将二阶更新与特征化PS技术的残差修正相结合。因此,LAMP继承了滞后时间修正,并可以作为PS框架上的模块化插件实现。我们表明LAMP保持了后验采样器的结构,并进行了一步风险分析,以表征何时通过偏差-方差权衡改善反向转移。多个成像任务的实验表明,LAMP在不增加去噪评估次数的情况下,相较于强基线如DiffPIR和DDRM,表现出一致的改进。
cs.CV / 11 / 2605.12574

DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction

DistractMIA:通过语义干扰对视觉-语言模型进行黑箱成员推断
Tang, Hongyi, Zhu, Zhihao, Yang, Yi
Abstract
Vision-language models (VLMs) are trained on large-scale image-text corpora that may contain private, copyrighted, or otherwise sensitive data, motivating membership inference as a tool for training-data auditing. This is especially challenging for deployed VLMs, where auditors typically observe only generated textual responses. Existing VLM membership inference attacks either rely on probability-level signals unavailable in such settings, or use mask-based semantic prediction tasks whose effectiveness depends on object-centric visual assumptions. To address these limitations, we propose DistractMIA, an output-only black-box framework based on semantic distraction. Rather than removing visual evidence, DistractMIA preserves the original image, inserts a known semantic distractor, and measures how generated responses change. This design is motivated by the intuition that member samples remain more anchored to the original image semantics, while non-member samples are more easily redirected toward the distractor. To make this signal reliable, DistractMIA calibrates distractor configurations on a reference set and derives membership scores from repeated textual generations, capturing response stability and distractor uptake without accessing logits, probabilities, or hidden states. Experiments across multiple VLMs and benchmarks show that DistractMIA consistently outperforms both output-only and stronger-access baselines. Its performance on a medical benchmark further demonstrates applicability beyond object-centric natural images.
Chinese Translation
视觉-语言模型(VLMs)是在大规模图像-文本语料库上训练的,这些语料库可能包含私人、受版权保护或其他敏感数据,因此成员推断成为一种用于训练数据审计的工具。这对已部署的VLMs尤其具有挑战性,因为审计人员通常仅观察生成的文本响应。现有的VLM成员推断攻击要么依赖于在此类环境中不可用的概率级信号,要么使用基于掩码的语义预测任务,其有效性依赖于以对象为中心的视觉假设。为了解决这些局限性,我们提出了DistractMIA,这是一种基于语义干扰的仅输出黑箱框架。DistractMIA并不是去除视觉证据,而是保留原始图像,插入已知的语义干扰物,并测量生成的响应如何变化。这一设计的动机在于,成员样本更能与原始图像语义保持一致,而非成员样本则更容易被引导至干扰物。为了使这一信号可靠,DistractMIA在参考集上校准干扰物配置,并从重复的文本生成中推导成员得分,捕捉响应的稳定性和干扰物的吸收,而无需访问logits、概率或隐藏状态。针对多个VLM和基准的实验表明,DistractMIA在输出仅限和更强访问基线中始终表现优越。其在医学基准上的表现进一步证明了其在以对象为中心的自然图像之外的适用性。
cs.CV / 12 / 2605.12586

3D Primitives are a Spatial Language for VLMs

3D 原语是视觉语言模型的空间语言
Liu, Junze, Qian, Kun, Dubost, Florian, Zhong, Kai, Srinivasan, Arvind, Chen, Nan, Wang, Anping, Zhang, Sam, Mottini, Alejandro, Cui, Qingjun, Wang, Tian
Abstract
Vision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding, and exploit this through three contributions. First, we introduce \textbf{\textsc{SpatialBabel}}, a benchmark evaluating fourteen VLMs on primitive-based 3D scene reconstruction across six \emph{scene-code languages} (programming languages and declarative formats for 3D primitive scenes), revealing that a single model's object-detection F1 can vary by up to $5.7\times$ across languages. Second, we propose \textbf{Code-CoT} (Code Chain-of-Thought), a training-free inference strategy that routes spatial reasoning through primitive-based code generation. Code-CoT lifts the SpatialBabel-QA-Score by up to $+6.4$\% on primitive scenes and real-photo CV-Bench-3D accuracy by $+5.0$\% for VLMs with strong coding capabilities. Third, we propose \textbf{S$^{3}$-FT} (Self-Supervised Spatial Fine-Tuning), which self-supervisedly distills primitive spatial knowledge into general visual reasoning by parsing the model's own Three.js primitive-reconstructions into structured annotations and fine-tuning on the result, with \emph{no human labels and no teacher model}. Training on primitive images alone, S$^3$-FT improves Qwen3-VL-8B by $+4.6$ to $+8.6$\% on SpatialBabel-Primitive-QA, $+9.7$\% on CV-Bench-2D, and $+17$\% on HallusionBench; the recipe transfers across model families. These results establish geometric primitives in code as both a diagnostic and a transferable spatial vocabulary for VLMs. We will release all artifacts upon publication.
Chinese Translation
视觉语言模型(VLMs)展现出一个显著的悖论:它们能够生成可执行代码,从几何原语中重建 3D 场景,且对象的数量、类别和大致位置均正确,但同样的模型在同一图像上却无法回答更简单的空间问题。我们展示了 3D 几何原语(立方体、球体、圆柱体,以可执行代码表达)作为空间理解的强大中介表示,并通过三个贡献加以利用。首先,我们引入了 extbf{ extsc{SpatialBabel}},这是一个基准测试,评估十四个 VLMs 在六种 extit{场景代码语言}(用于 3D 原语场景的编程语言和声明性格式)上的基于原语的 3D 场景重建,结果显示单个模型的对象检测 F1 分数在不同语言之间可变化高达 $5.7 imes$。其次,我们提出了 extbf{Code-CoT}(代码思维链),这是一种无训练的推理策略,通过基于原语的代码生成来引导空间推理。Code-CoT 在原语场景上将 SpatialBabel-QA-Score 提升了高达 $+6.4$ ext{%},在具有强编码能力的 VLMs 上将真实照片 CV-Bench-3D 的准确率提高了 $+5.0$ ext{%}。第三,我们提出了 extbf{S$^{3}$-FT}(自监督空间微调),它通过解析模型自身的 Three.js 原语重建为结构化注释,并在结果上进行微调,自监督地提炼原语空间知识,且 extit{无需人工标签和教师模型}。仅在原语图像上训练,S$^3$-FT 在 SpatialBabel-Primitive-QA 上将 Qwen3-VL-8B 的性能提高了 $+4.6$ 到 $+8.6$ ext{%},在 CV-Bench-2D 上提高了 $+9.7$ ext{%},在 HallusionBench 上提高了 $+17$ ext{%};该方法在不同模型家族之间具有可迁移性。这些结果确立了代码中的几何原语作为 VLMs 的诊断工具和可转移的空间词汇。我们将在发表时发布所有相关材料。
cs.CV / 13 / 2605.12587

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

TrackCraft3R:将视频扩散变换器重新用于密集3D跟踪
Nam, Jisu, Koo, Jahyeok, Son, Soowon, Jung, Jaewoo, An, Honggyu, Hur, Junhwa, Kim, Seungryong
Abstract
Dense 3D tracking from monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising foundation for 3D tracking. However, their frame-anchored formulation, which generates each frame's content, is fundamentally mismatched with reference-anchored dense 3D tracking, which must follow the same physical points from a reference frame across time. We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker. Given a monocular video and its frame-anchored reconstruction pointmap, TrackCraft3R predicts a reference-anchored tracking pointmap that follows every pixel of the first frame across time in a single forward pass, along with its visibility. We achieve this through two designs: (i) a dual-latent representation that uses per-frame geometry latents and reference-anchored track latents as dense queries, and (ii) temporal RoPE alignment, which specifies the target timestamp of each track latent. Together, these designs convert the per-frame generative paradigm of video DiTs into a reference-anchored tracking formulation with LoRA fine-tuning. TrackCraft3R achieves state-of-the-art performance on standard sparse and dense 3D tracking benchmarks, while running 1.3x faster and using 4.6x less peak memory than the strongest prior method. We further demonstrate robustness to large motions and long videos.
Chinese Translation
从单目视频进行密集3D跟踪是动态场景理解的基础。尽管最近的3D基础模型提供了可靠的逐帧几何信息,但在这些几何信息中恢复物体运动仍然具有挑战性,并且受益于从真实世界视频中学习到的强运动先验。现有的3D跟踪器要么遵循从头开始在合成数据上训练的迭代范式,要么微调从静态多视图图像中学习到的3D重建模型,这两者都缺乏真实世界的运动先验。预训练的视频扩散变换器(video DiTs)提供了来自互联网规模视频的丰富时空先验,使其成为3D跟踪的有希望的基础。然而,它们的帧锚定形式生成每帧的内容,根本上与参考锚定的密集3D跟踪不匹配,后者必须在时间上遵循来自参考帧的相同物理点。我们提出了TrackCraft3R,这是第一个将视频DiT重新用于前馈密集3D跟踪的方法。给定一个单目视频及其帧锚定的重建点图,TrackCraft3R预测一个参考锚定的跟踪点图,该点图在单次前向传递中遵循第一帧的每个像素在时间上的变化及其可见性。我们通过两个设计实现这一目标:(i)双潜表示,使用逐帧几何潜变量和参考锚定的跟踪潜变量作为密集查询,以及(ii)时间RoPE对齐,指定每个跟踪潜变量的目标时间戳。这些设计共同将视频DiTs的逐帧生成范式转换为参考锚定的跟踪公式,并进行LoRA微调。TrackCraft3R在标准稀疏和密集3D跟踪基准上实现了最先进的性能,同时运行速度比最强的先前方法快1.3倍,峰值内存使用量减少4.6倍。我们进一步展示了对大运动和长视频的鲁棒性。
cs.CV / 14 / 2605.12608

A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline

基于 Clear2Fog 流水线的合成雾数据效率研究用于目标检测
Mohamed, Mohamed Ahmed, Huang, Xiaowei
Abstract
Object detection in adverse weather is critical for the safety of autonomous vehicles; however, the scarcity of labelled, real-world foggy data remains a significant bottleneck. In this paper, we propose Clear2Fog (C2F), an end-to-end, physics-based pipeline that simulates fog on clear-weather datasets while ensuring sensor-level consistency across camera and LiDAR. By using monocular depth estimation and a novel atmospheric light estimation method, C2F overcomes structural artifacts and chromatic biases common in existing techniques. A human perceptual study confirms C2F's physical realism, with the generated images being preferred 92.95% of the time over an established method. Utilising a training set of 270,000 images from the Waymo Open Dataset, we conduct an extensive data efficiency study to investigate how environmental diversity influences model robustness. Our findings reveal that models trained on mixed-density fog datasets at 75% scale outperform those trained on fixed-density datasets at 100% scale. Furthermore, we investigate the sim-to-real transfer by fine-tuning pre-trained models on real-world foggy data. We demonstrate that a tenfold increase over the default fine-tuning learning rate successfully overcomes negative transfer from synthetic biases, resulting in a 1.67 mAP improvement over real-only baselines. The C2F pipeline provides a scalable framework for enhancing the reliability of autonomous systems in adverse weather and demonstrates the potential of diverse synthetic datasets for efficient model training.
Chinese Translation
在恶劣天气条件下进行目标检测对自动驾驶车辆的安全至关重要;然而,标注的真实世界雾天数据的稀缺仍然是一个重大瓶颈。本文提出了 Clear2Fog (C2F),一个端到端的基于物理的流水线,能够在清晰天气数据集上模拟雾气,同时确保相机和激光雷达的传感器级一致性。通过使用单目深度估计和一种新颖的大气光估计方法,C2F 克服了现有技术中常见的结构伪影和色差偏差。一项人类感知研究证实了 C2F 的物理真实感,生成的图像在92.95%的情况下优于传统方法。利用来自 Waymo Open Dataset 的 270,000 张图像的训练集,我们进行了广泛的数据效率研究,以探讨环境多样性如何影响模型的鲁棒性。我们的研究结果表明,在 75% 比例的混合密度雾数据集上训练的模型优于在 100% 比例的固定密度数据集上训练的模型。此外,我们通过在真实世界雾天数据上微调预训练模型来研究模拟到现实的迁移。我们证明,将默认微调学习率提高十倍能够成功克服来自合成偏差的负迁移,从而在仅使用真实数据的基线模型上实现 1.67 mAP 的提升。C2F 流水线提供了一个可扩展的框架,以提高自动系统在恶劣天气下的可靠性,并展示了多样化合成数据集在高效模型训练中的潜力。
cs.CV / 15 / 2605.12640

MambaPanoptic: A Vision Mamba-based Structured State Space Framework for Panoptic Segmentation

MambaPanoptic:基于视觉Mamba的全景分割结构状态空间框架
Cheng, Qing, Bertolini, Damiano, Zhang, Wei, Wang, Dong, Zeller, Niclas, Cremers, Daniel
Abstract
Panoptic segmentation requires the simultaneous recognition of countable thing instances and amorphous stuff regions, placing joint demands on long-range context modelling, multi-scale feature representation, and efficient dense prediction. Existing convolutional and transformer-based methods struggle to satisfy all three requirements concurrently: convolutional architectures are limited in their capacity to model long-range dependencies, while transformer-based methods incur quadratic computational cost that is prohibitive at high resolutions. In this paper, we propose MambaPanoptic, a fully Mamba-based panoptic segmentation framework that addresses these limitations through two principal contributions. First, we introduce MambaFPN, a top-down feature pyramid that leverages Mamba blocks to generate globally coherent, multi-scale feature representations with linear computational complexity. Second, we adopt a PanopticFCN-style kernel generator that produces unified thing and stuff kernels for proposal-free panoptic prediction, enhanced by a QuadMamba-based feature refinement module applied at multiple network stages. Experiments on the Cityscapes and COCO panoptic segmentation benchmarks demonstrate that MambaPanoptic consistently outperforms PanopticDeepLab and PanopticFCN under comparable model sizes, and matches or surpasses Mask2Former on Cityscapes in PQ and AP while requiring fewer parameters.
Chinese Translation
全景分割要求同时识别可计数的物体实例和无形的背景区域,对长距离上下文建模、多尺度特征表示和高效密集预测提出了共同的要求。现有的卷积和基于变换器的方法在同时满足这三项要求方面面临挑战:卷积架构在建模长距离依赖关系的能力上受到限制,而基于变换器的方法在高分辨率下的计算成本呈二次增长,难以承受。本文提出了MambaPanoptic,一个完全基于Mamba的全景分割框架,通过两个主要贡献来解决这些限制。首先,我们引入了MambaFPN,一个自上而下的特征金字塔,利用Mamba块生成具有线性计算复杂度的全局一致的多尺度特征表示。其次,我们采用了一种PanopticFCN风格的核生成器,生成统一的物体和背景核,以实现无提案的全景预测,并通过在多个网络阶段应用的QuadMamba基础特征精炼模块进行增强。在Cityscapes和COCO全景分割基准上的实验表明,MambaPanoptic在可比模型规模下始终优于PanopticDeepLab和PanopticFCN,并在PQ和AP指标上与Mask2Former在Cityscapes上持平或超越,同时所需参数更少。
cs.CV / 16 / 2605.12649

DIVER:Diving Deeper into Distilled Data via Expressive Semantic Recovery

DIVER:通过表达性语义恢复深入挖掘蒸馏数据
Xia, Qianxin, Shu, Zhiyong, Jiang, Wenbo, Du, Jiawei, Wang, Jielei, Lu, Guoming
Abstract
Dataset distillation aims to synthesize a compact proxy dataset that is unreadable or non-raw from the original dataset for privacy protection and highly efficient learning. However, previous approaches typically adopt a single-stage distillation paradigm, which suffers from learning specific patterns that overfit on a prior architecture, consequently suppressing the expression of semantics and leading to performance degradation across heterogeneous architectures. To address this issue, we propose a novel dual-stage distillation framework called ${\textbf{DIVER}}$, which leverages the pre-trained diffusion model to dive deeper into $\textbf{DI}$stilled data $\textbf{V}$ia $\textbf{E}$xpressive semantic $\textbf{R}$ecovery, an entire process of semantic inheritance, guidance, and fusion. Semantic inheritance distills high-level semantics of abstract distilled images into the latent space to filter out architecture-specific ``noise" and retain the intrinsic semantics. Furthermore, semantic guidance improves the preservation of the original semantics by directing the reverse procedure. Finally, semantic fusion is designed to provide semantic guidance only during the concrete phase of the reverse process, preventing semantic ambiguity and artifacts while maintaining the guidance information. Extensive experiments validate the effectiveness and efficiency of DIVER in improving classical distillation techniques and significantly improving cross-architecture generalization, requiring processing time comparable to raw DiT on ImageNet (256$\times$256) with only 4 GB of GPU memory usage. Code is available: https://github.com/einsteinxia/DIVER.
Chinese Translation
数据集蒸馏旨在合成一个紧凑的代理数据集,该数据集从原始数据集中提取,无法读取或非原始,以保护隐私并实现高效学习。然而,以往的方法通常采用单阶段蒸馏范式,这导致学习特定模式而过拟合于先前的架构,从而抑制了语义的表达,并导致在异构架构上的性能下降。为了解决这个问题,我们提出了一种新颖的双阶段蒸馏框架,称为 ${ extbf{DIVER}}$,该框架利用预训练的扩散模型深入挖掘 $ extbf{DI}$stilled data $ extbf{V}$ia $ extbf{E}$xpressive semantic $ extbf{R}$ecovery,这是一个语义继承、引导和融合的完整过程。语义继承将抽象蒸馏图像的高层语义蒸馏到潜在空间,以过滤掉架构特定的“噪声”,并保留内在语义。此外,语义引导通过引导逆过程来改善原始语义的保留。最后,语义融合旨在仅在逆过程的具体阶段提供语义引导,防止语义模糊和伪影,同时保持引导信息。大量实验验证了 DIVER 在改进经典蒸馏技术和显著提升跨架构泛化能力方面的有效性和效率,其处理时间与 ImageNet (256$ imes$256) 上的原始 DiT 相当,仅需 4 GB 的 GPU 内存。代码可用: https://github.com/einsteinxia/DIVER。
cs.CV / 17 / 2605.12650

CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis

CRAFT:用于医学图像合成的临床奖励对齐微调
Chung, Yunsung, Darzi, Alex El, Khoury, Carlo El, Feng, Han, Marrouche, Nassir, Hamm, Jihun
Abstract
Foundation diffusion models can generate photorealistic natural images, but adapting them to medical imaging remains challenging. In medical adaptation, limited labeled data can exacerbate hallucination-like and clinically implausible synthesis, while existing metrics such as FID or Inception Score do not quantify per-image alignment with pathology-relevant criteria. We introduce the Clinical Alignment Score (CAS), a foundation-model-based proxy for clinical alignment that evaluates generated images along four complementary dimensions beyond visual fidelity. Building on CAS, we propose Clinical Reward-Aligned Finetuning (CRAFT), a reward-based adaptation framework that transfers medical knowledge from multimodal large language models and vision-language models through label-conditioned prompt enrichment, clinical checklists, and differentiable reward optimization. Across four diverse modalities, CRAFT improves CAS and downstream classification performance over strong adaptation baselines. Beyond average CAS gains, CRAFT reduces the empirical low-alignment tail below a real-image reference threshold by 5.5-34.7% points relative to the strongest baseline, corresponding to a 20.4% average relative reduction across datasets. These results indicate fewer hallucination-like generations under CAS, and are corroborated by out-of-family evaluator evaluation, structured checklist auditing, memorization analysis, and a blinded physician preference study on CheXpert.
Chinese Translation
基础扩散模型能够生成逼真的自然图像,但将其适应于医学成像仍然具有挑战性。在医学适应中,有限的标注数据可能加剧幻觉般和临床上不合理的合成,而现有的指标如FID或Inception Score并不能量化每幅图像与病理相关标准的对齐程度。我们引入了临床对齐评分(Clinical Alignment Score,CAS),这是一种基于基础模型的临床对齐代理,评估生成图像在视觉保真度之外的四个互补维度。基于CAS,我们提出了临床奖励对齐微调(Clinical Reward-Aligned Finetuning,CRAFT),这是一种基于奖励的适应框架,通过标签条件的提示增强、临床检查表和可微分的奖励优化,将医学知识从多模态大型语言模型和视觉-语言模型转移到医学图像合成中。在四种不同的模态中,CRAFT在强适应基线之上提高了CAS和下游分类性能。除了平均CAS的提升外,CRAFT还将经验低对齐尾部相对于最强基线减少了5.5-34.7个百分点,平均相对减少幅度为20.4%。这些结果表明在CAS下生成的幻觉般图像更少,并通过超出家族评估者的评估、结构化检查表审计、记忆分析以及对CheXpert的盲法医生偏好研究得到了证实。
cs.CV / 18 / 2605.12678

No One Knows the State of the Art in Geospatial Foundation Models

无人知晓地理空间基础模型的最新进展
Corley, Isaac, Lehmann, Nils, Robinson, Caleb, Tseng, Gabriel, Fuller, Anthony, Alemohammad, Hamed, Shelhamer, Evan, Marcus, Jennifer, Kerner, Hannah
Abstract
Geospatial foundation models (GFMs) have been proposed as generalizable backbones for disaster response, land-cover mapping, food-security monitoring, and other high-stakes Earth-observation tasks. Yet the published work about these models does not give reviewers or users enough information to tell which model fits a given task. We argue that nobody knows what the current state of the art is in geospatial foundation models. The methods may be useful, but the GFM literature does not standardize evaluations, training and testing protocols, released weights, or pretraining controls well enough for anyone to compare or rank them. In a 152-paper audit, we find 46 cross-paper disagreements of at least 10 points for the same model, benchmark, and protocol; 94/126 papers with extractable pretraining data use a configuration no other paper uses; and 39% of GFM papers release no model weights. This lack of community standards can be solved. We propose six concrete expectations: named-license weight release, shared core evaluations, copied-versus-rerun baseline annotations, variance reporting, one shared evaluation harness, and data-vs-architecture-vs-algorithm controls. These gaps are a coordination failure, not a fault of any individual lab; the authors of this paper, like many others in the GFM community, have contributed to them. Rather than just critiquing the community, we aim to provide concrete steps toward a shared understanding of how to innovate GFMs.
Chinese Translation
地理空间基础模型(GFMs)被提议作为灾害响应、土地覆盖制图、粮食安全监测及其他高风险地球观测任务的通用骨干。然而,关于这些模型的已发表研究并未为审稿人或用户提供足够的信息,以判断哪个模型适合特定任务。我们认为,目前没有人知道地理空间基础模型的最新进展。尽管这些方法可能有用,但GFM文献并未对评估、训练和测试协议、发布的权重或预训练控制进行足够的标准化,以便任何人能够进行比较或排名。在对152篇论文的审计中,我们发现同一模型、基准和协议之间存在46个至少10分的跨论文分歧;94/126篇具有可提取预训练数据的论文使用了其他论文未使用的配置;39%的GFM论文未发布任何模型权重。这种缺乏社区标准的问题是可以解决的。我们提出六项具体期望:命名许可证权重发布、共享核心评估、复制与重新运行基线注释、方差报告、一个共享的评估工具以及数据与架构与算法的控制。这些差距是协调失败,而非任何单个实验室的过错;本文的作者与GFM社区中的许多人一样,亦对此有所贡献。我们不仅仅是批评社区,而是旨在提供具体步骤,以实现对如何创新GFMs的共同理解。
cs.CV / 19 / 2605.12684

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

视觉美学基准:前沿模型能否判断美感?
Feng, Yichen, Li, Yuetai, Liu, Chunjiang, Chen, Yuanyuan, Jiang, Fengqing, Huang, Yue, Hua, Hang, Yuan, Zhengqing, Zheng, Kaiyuan, Niu, Luyao, Ramasubramanian, Bhaskar, Alomair, Basel, Zhang, Xiangliang, Sra, Misha, Chen, Zichen, Poovendran, Radha, Xu, Zhangchen
Abstract
Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with labels derived from the consensus of 10 independent expert judges per task. Evaluating 20 frontier MLLMs and six dedicated visual-quality reward models, we find that the strongest system identifies both the best and the worst image correctly across three random permutations of the candidate order in only 26.5% of tasks, far below the 68.9% achieved by human experts. Fine-tuning a 35B-parameter model on 2,000 expert examples brings its accuracy close to that of a 397B-parameter open-weight model, suggesting that the comparative signal in VAB is transferable. Together, these results expose a clear and measurable gap between current multimodal models and expert aesthetic judgment, and VAB provides the first set-based, expert-grounded testbed on which that gap can be tracked and closed.
Chinese Translation
多模态大型语言模型(MLLMs)现已广泛应用于视觉理解、生成和策展。这些应用中有相当一部分需要明确的美学判断。现有大多数解决方案将这种判断简化为对单一图像预测一个标量分数。我们首先探讨这些分数是否真实反映了比较偏好:在一项由八位专家注释者参与的控制研究中,基于分数的排名与同一注释者的直接比较结果一致性较差,而直接排名在最佳和最差图像标签上则显著提高了注释者之间的一致性。基于这一发现,我们引入了视觉美学基准(Visual Aesthetic Benchmark, VAB),将美学评估视为对具有匹配主题的候选集的比较选择。VAB包含400个任务和1,195幅图像,涵盖美术、摄影和插图,每个任务的标签均来自10位独立专家评审的共识。对20个前沿MLLM和六个专门的视觉质量奖励模型进行评估,我们发现最强的系统在仅26.5%的任务中能够正确识别出最佳和最差图像,而这一表现远低于人类专家的68.9%。在2,000个专家示例上对一个35B参数模型进行微调后,其准确性接近397B参数开放权重模型,表明VAB中的比较信号是可转移的。这些结果共同揭示了当前多模态模型与专家美学判断之间存在明显且可测量的差距,而VAB提供了第一个基于集合的、以专家为基础的测试平台,可以追踪并缩小这一差距。
cs.CV / 20 / 2605.12703

MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence

MMCL-Bench:来自视觉规则、程序和证据的多模态上下文学习
Chen, Yifan, Yin, Fei, Bai, Qingyan, Lin, Zicheng, Yang, Yujiu
Abstract
We introduce MMCL-Bench, a benchmark for multimodal context learning: learning task-local rules, procedures, and empirical patterns from visual or mixed-modality teaching context and applying them to new visual instances. Unlike text-only context learning or standard multimodal question answering, this setting requires models to recover and localize relevant evidence from images, screenshots, manuals, videos, and frame sequences before they can reason over the learned context. MMCL-Bench contains 102 tasks spanning three categories: rule system application, procedural task execution, and empirical discovery and induction. We evaluate frontier multimodal models with strict rubric-based scoring and find that current systems remain far from robust multimodal context learning, with even the strongest model solving fewer than one-third of tasks under strict evaluation. Diagnostic ablations and error analysis show that failures arise throughout the context-to-answer pipeline, including context anchoring, visual evidence extraction, context reasoning, and response construction. MMCL-Bench thus highlights multimodal context learning as an important unsolved capability bottleneck for current multimodal models.
Chinese Translation
我们介绍了MMCL-Bench,这是一个用于多模态上下文学习的基准:从视觉或混合模态的教学上下文中学习任务特定的规则、程序和经验模式,并将其应用于新的视觉实例。与仅基于文本的上下文学习或标准的多模态问答不同,这一设置要求模型从图像、截图、手册、视频和帧序列中恢复和定位相关证据,然后才能对学习到的上下文进行推理。MMCL-Bench包含102个任务,涵盖三个类别:规则系统应用、程序性任务执行和经验发现与归纳。我们使用严格的评分标准评估前沿的多模态模型,发现当前系统在多模态上下文学习方面仍然远未达到稳健的水平,即使是最强的模型在严格评估下也仅能解决不到三分之一的任务。诊断性消融实验和错误分析表明,失败发生在上下文到答案的整个流程中,包括上下文锚定、视觉证据提取、上下文推理和响应构建。因此,MMCL-Bench突显了多模态上下文学习作为当前多模态模型一个重要的未解决能力瓶颈。
cs.CV / 21 / 2605.12724

Inline Critic Steers Image Editing

内联批评者引导图像编辑
Kang, Weitai, Zhan, Xiaohang, Wang, Yizhou, Chiu, Mang Tik, Kuen, Jason, Liu, Kangning, Yan, Yan
Abstract
Instruction-based image editing exhibits heterogeneous difficulty not only across cases but also across regions of an image, motivating refinement approaches that allocate correction to where the model struggles. Existing refinement signals arrive late, after a fully generated image or a completed denoising step. We ask whether such a signal can act within an ongoing forward pass. To investigate this, we probe a frozen image-editing model and find that although generation capability emerges only in the last few layers, the error pattern is already set in early layers (rank correlation \r{ho} = 0.83 with the final-layer error map). Based on this, we introduce Inline Critic, a learnable token that critiques a frozen model's predictions at its intermediate layers and steers its hidden states to refine generation during the forward pass. A three-stage recipe is proposed to stabilize the training from learning how to critique to steering generation. As a result, we achieve state of the art on GEdit-Bench (7.89), a +9.4 gain on RISEBench over the same backbone, and the strongest open-source result on KRIS-Bench (81.92, surpassing GPT-4o). We further provide analyses showing that the critic genuinely shapes the model's attention and prediction updates at subsequent layers.
Chinese Translation
基于指令的图像编辑在不同案例和图像区域之间表现出异质性难度,这促使我们采用精细化方法,将修正分配到模型难以处理的地方。现有的精细化信号通常在完全生成图像或完成去噪步骤后才出现。我们探讨这种信号是否可以在进行中的前向传播中发挥作用。为此,我们对一个冻结的图像编辑模型进行了研究,发现尽管生成能力仅在最后几层中显现,但错误模式在早期层中已经形成(与最终层错误图的等级相关性 {ho} = 0.83)。基于此,我们引入了内联批评者(Inline Critic),这是一个可学习的标记,能够在冻结模型的中间层对其预测进行批评,并引导其隐藏状态在前向传播过程中精细化生成。我们提出了一个三阶段的方案,以稳定从学习批评到引导生成的训练。因此,我们在 GEdit-Bench 上达到了最先进的水平(7.89),在 RISEBench 上相较于相同骨干网络获得了 +9.4 的提升,并在 KRIS-Bench 上取得了最强的开源结果(81.92,超越 GPT-4o)。我们进一步提供了分析,表明批评者确实在后续层中塑造了模型的注意力和预测更新。
cs.CV / 22 / 2605.12725

Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models

视频异常检测是否被误框架?基于大型语言模型和多场景模型的证据
Mumcu, Furkan, Jones, Michael J., Cherian, Anoop, Yilmaz, Yasin
Abstract
Recent video anomaly detection research has expanded rapidly with an emphasis on general models of normality intended to work across many different scenes. While this focus has led to improvements in scalability and multi-scene generalization, it has also shifted the field away from modeling the scene-specific and context-dependent nature of normal behavior. Contemporary approaches frequently rely on video-level weak supervision and opaque pretrained representations from multi-modal large language models (MLLMs), which encourage models to respond to familiar semantic anomaly categories rather than to deviations from the normal patterns of a particular environment. This trend suppresses spatial localization, introduces semantic bias, and reduces anomaly detection to a form of action recognition. In this paper, we examine whether these prevailing formulations align with the core requirements of real-world VAD, which is typically performed within a single scene where normality is determined by local geometry, semantics, and activity patterns. Through targeted visual analyses and empirical evaluations, we demonstrate the practical consequences of these limitations and show that meaningful progress in VAD requires renewed focus on single-scene, spatially-aware, and explainable formulations that capture the nuanced structure of normality within individual environments.
Chinese Translation
近年来,视频异常检测研究迅速扩展,强调通用正常性模型,以便在许多不同场景中工作。虽然这种关注导致了可扩展性和多场景泛化的改善,但也使该领域偏离了对场景特定和上下文依赖的正常行为建模。当前的方法通常依赖于视频级的弱监督和来自多模态大型语言模型(MLLMs)的不透明预训练表示,这促使模型响应熟悉的语义异常类别,而不是特定环境中正常模式的偏差。这一趋势抑制了空间定位,引入了语义偏见,并将异常检测简化为一种动作识别。在本文中,我们考察这些主流表述是否与现实世界视频异常检测(VAD)的核心要求相符,后者通常在单一场景中进行,其中正常性由局部几何、语义和活动模式决定。通过有针对性的视觉分析和实证评估,我们展示了这些局限性的实际后果,并表明在VAD中取得有意义的进展需要重新关注单场景、空间感知和可解释的表述,以捕捉个体环境中正常性的细微结构。
cs.CV / 23 / 2605.12772

Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs

只需请求一个表格:三十个令牌的用户提示在十二个大型语言模型中击败了赞助推荐
Maier, Andreas, Sopa, Jeta, Sahin, Gozde Gul, Perez-Toro, Paula, Bayer, Siming
Abstract
Wu et al. (2026) showed that most frontier large language models (LLMs) recommend a sponsored, roughly twice-as-expensive flight when their system prompt contains a soft sponsorship cue. We reproduce their evaluation on ten open-weight chat models plus the two of their twenty-three models that are still reachable today (gpt-3.5-turbo, gpt-4o). All reported rates in this paper are produced under the same judge the original paper used (gpt-4o); we additionally store every label under an open-weight (gpt-oss-120b) and a smaller proprietary (gpt-4o-mini) judge for an ablation. Three findings emerge. First, a prose description of an LLM evaluation pipeline is not, on its own, sufficient for accurate reproduction: we surfaced three silent implementation failures that each shifted a reported rate by tens of percentage points. Second, the central claims do generalise - the gpt-3.5-turbo logistic-regression intercept of alpha = 0.81 is within four points of the original alpha = 0.86, and 200 of 200 trials on gpt-3.5-turbo and gpt-4o promote a payday lender to a financially distressed user. Third, a thirty-token user prompt that asks the assistant for a neutral comparison table first cuts sponsored recommendation from 46.9% to 1.0% averaged across our ten open-source models, and from 53.0% to 0% averaged across the two OpenAI models. AI literacy and price-comparison portals are likely market-level mitigations; the harmful-product cell is bounded by neither. Raw data, labels and analysis scripts are at https://github.com/akmaier/Paper-LLM-Ads .
Chinese Translation
Wu等人(2026)显示,当系统提示包含软赞助提示时,大多数前沿大型语言模型(LLMs)会推荐一项赞助的、价格大约是两倍的航班。我们在十个开放权重的聊天模型以及他们二十三个模型中仍然可用的两个模型(gpt-3.5-turbo,gpt-4o)上重现了他们的评估。本文中报告的所有比率均是在原论文使用的相同评估模型(gpt-4o)下生成的;我们还在一个开放权重模型(gpt-oss-120b)和一个较小的专有模型(gpt-4o-mini)下存储了每个标签以进行消融实验。我们得出了三个发现。首先,LLM评估流程的文字描述本身不足以实现准确重现:我们发现了三个静默的实现失败,每个失败都使报告的比率偏差了数十个百分点。其次,核心主张是可以推广的——gpt-3.5-turbo的逻辑回归截距alpha = 0.81与原始的alpha = 0.86相差四个百分点,并且在gpt-3.5-turbo和gpt-4o上的200次试验中,均向财务困境用户推荐了一个发薪日贷款者。第三,一个请求助手提供中立比较表的三十个令牌的用户提示,首先将赞助推荐的比例从我们十个开源模型的46.9%降低到1.0%,并将两个OpenAI模型的比例从53.0%降低到0%。人工智能素养和价格比较门户可能是市场层面的缓解措施;而有害产品的单元则不受此限制。原始数据、标签和分析脚本可在https://github.com/akmaier/Paper-LLM-Ads获取。
cs.CV / 24 / 2605.12774

WildPose: A Unified Framework for Robust Pose Estimation in the Wild

WildPose:一种用于野外鲁棒姿态估计的统一框架
Zheng, Jianhao, Zhu, Liyuan, Zhu, Zihan, Armeni, Iro
Abstract
Estimating camera pose in dynamic environments is a critical challenge, as most visual SLAM and SfM methods assume static scenes. While recent dynamic-aware methods exist, they are often not unified: semantic-based approaches are brittle, per-sequence optimization methods fail on short sequences, and other learned models may degrade on static-only scenes. We present WildPose, a unified monocular pose estimation framework that is robust in dynamic environments while maintaining state-of-the-art performance on static and low-ego-motion datasets. Our key insight is to connect two powerful paradigms in modern 3D vision: the rich perceptual frontend of feedforward models and the end-to-end optimization of differentiable bundle adjustment (BA). We achieve this with a 3D-aware update operator built on a frozen, pre-trained MASt3R feature backbone, together with a high-capacity motion mask detector that uses multi-level 3D-aware features from the same backbone. Extensive experiments show WildPose consistently outperforms prior methods across dynamic (Wild-SLAM, Bonn), static (TUM, 7-Scenes), and low-ego-motion (Sintel) benchmarks.
Chinese Translation
在动态环境中估计相机姿态是一个关键挑战,因为大多数视觉SLAM和结构从运动(SfM)方法假设场景是静态的。虽然最近出现了一些动态感知的方法,但它们往往缺乏统一性:基于语义的方法脆弱,逐序列优化方法在短序列上表现不佳,而其他学习模型在仅静态场景上可能会退化。我们提出了WildPose,这是一种统一的单目姿态估计框架,在动态环境中表现鲁棒,同时在静态和低自我运动数据集上保持最先进的性能。我们的关键见解是将现代3D视觉中的两种强大范式连接起来:前馈模型的丰富感知前端和可微分束调整(BA)的端到端优化。我们通过基于冻结的、预训练的MASt3R特征骨干网构建的3D感知更新算子,以及一个使用来自同一骨干网的多层次3D感知特征的高容量运动掩码检测器来实现这一点。大量实验表明,WildPose在动态(Wild-SLAM,Bonn)、静态(TUM,7-Scenes)和低自我运动(Sintel)基准测试中始终优于先前的方法。
cs.CV / 25 / 2605.12826

FRAME: Forensic Routing and Adaptive Multi-path Evidence Fusion for Image Manipulation Detection

FRAME:用于图像篡改检测的取证路由和自适应多路径证据融合
Zhao, Kaixiang, Yu, Tianrun, Zhang, Aoxu, Su, Junhao, Jenkins, Porter, Hughes, Amanda
Abstract
The proliferation of sophisticated image editing tools and generative artificial intelligence models has made verifying the authenticity of digital images increasingly challenging, with important implications for journalism, forensic analysis, and public trust. Although numerous forensic algorithms, ranging from handcrafted methods to deep learning-based detectors, have been developed for manipulation detection, individual methods often suffer from limited robustness, fragmented evidence, or weak generalization across manipulation types and image conditions. To address these limitations, we present \textbf{FRAME}, a method for \textbf{F}orensic \textbf{R}outing and \textbf{A}daptive \textbf{M}ulti-path \textbf{E}vidence fusion for image manipulation detection. FRAME organizes diverse forensic algorithms into a multi-path analysis space, adaptively selects informative forensic paths for each input image, and fuses complementary evidence to improve detection and localization performance. By moving beyond single-method analysis and fixed fusion strategies, FRAME provides a more robust and flexible approach to image forensic reasoning while preserving interpretable forensic cues from multiple evidence sources. Experimental results demonstrate the effectiveness of FRAME across diverse manipulation scenarios. Code is available at \href{https://github.com/kzhao5/FRAME}{https://github.com/kzhao5/FRAME}.
Chinese Translation
随着复杂图像编辑工具和生成性人工智能模型的普及,验证数字图像的真实性变得愈加困难,这对新闻报道、取证分析和公众信任具有重要影响。尽管已经开发了众多取证算法,从手工方法到基于深度学习的检测器,用于篡改检测,但单一方法往往存在鲁棒性有限、证据碎片化或对不同篡改类型和图像条件的泛化能力弱等问题。为了解决这些局限性,我们提出了 extbf{FRAME},一种用于图像篡改检测的 extbf{F}orensic extbf{R}outing和 extbf{A}daptive extbf{M}ulti-path extbf{E}vidence融合的方法。FRAME将多种取证算法组织成一个多路径分析空间,为每个输入图像自适应选择信息丰富的取证路径,并融合互补证据以提高检测和定位性能。通过超越单一方法分析和固定融合策略,FRAME提供了一种更为鲁棒和灵活的图像取证推理方法,同时保留来自多个证据源的可解释取证线索。实验结果表明,FRAME在多种篡改场景下的有效性。代码可在 exthref{https://github.com/kzhao5/FRAME}{https://github.com/kzhao5/FRAME} 获取。
cs.CV / 26 / 2605.12845

AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects

AssemblyBench:物理感知的复杂工业物体组装
Li, Danrui, Zhang, Jiahao, Egger, Bernhard, Chatterjee, Moitreya, Lohit, Suhas, Marks, Tim K., Cherian, Anoop
Abstract
Assembling objects from parts requires understanding multimodal instructions, linking them to 3D components, and predicting physically plausible 6-DoF motions for each assembly step. Existing datasets focus on simplified scenarios, overlooking shape complexities and assembly trajectories in industrial assemblies. We introduce AssemblyBench, a synthetic dataset of 2,789 industrial objects with multimodal instruction manuals, corresponding 3D part models, and part assembly trajectories. We also propose a transformer-based model, AssemblyDyno, which uses the instructional manual and the 3D shape of each part to jointly predict assembly order and part assembly trajectories. AssemblyDyno outperforms prior works in both assembly pose estimation and trajectory feasibility, where the latter is evaluated by our physics-based simulations.
Chinese Translation
从零件组装物体需要理解多模态指令,将其与三维组件链接,并预测每个组装步骤的物理上合理的六自由度(6-DoF)运动。现有的数据集集中于简化场景,忽视了工业组装中的形状复杂性和组装轨迹。我们引入了AssemblyBench,这是一个包含2,789个工业物体的合成数据集,配有多模态指令手册、相应的三维零件模型和零件组装轨迹。我们还提出了一种基于变换器的模型AssemblyDyno,该模型利用指令手册和每个零件的三维形状共同预测组装顺序和零件组装轨迹。AssemblyDyno在组装姿态估计和轨迹可行性方面均优于之前的研究,后者通过我们的基于物理的模拟进行评估。
cs.CV / 27 / 2605.12851

PRISM: Perinuclear Ring-based Image Segmentation Method for Acute Lymphoblastic Leukemia Classification

PRISM:基于核周环的急性淋巴细胞白血病分类图像分割方法
Moreira, Larissa Ferreira Rodrigues, Rodrigues, Leonardo Gabriel Ferreira, Moreira, Rodrigo, Backes, André Ricardo
Abstract
Automated analysis of peripheral blood smears for Acute Lymphoblastic Leukemia (ALL) is hindered by low contrast and substantial variability in cytoplasmic appearance, which complicate conventional membrane-based segmentation. We found that many recent approaches rely on heavy neural architectures and extensive training, but still struggle to generalize across staining and acquisition variability. To address these limitations, we propose the Perinuclear Ring-based Image Segmentation Method (PRISM), which replaces explicit cytoplasmic delineation with adaptive concentric zones constructed around the nucleus. These perinuclear regions enable the extraction of robust cytoplasmic descriptors by integrating color information with texture statistics derived from grey-level co-occurrence patterns, without requiring accurate cell-boundary detection. A calibrated stacking ensemble of traditional classifiers leverages these descriptors to achieve a high performance, with an accuracy of 98.46% and a precision-recall AUC of 0.9937.
Chinese Translation
急性淋巴细胞白血病(ALL)外周血涂片的自动化分析受到低对比度和细胞质外观显著变异的阻碍,这使得传统的基于膜的分割变得复杂。我们发现许多近期的方法依赖于复杂的神经网络架构和大量的训练,但在不同染色和采集变异下仍然难以泛化。为了解决这些局限性,我们提出了基于核周环的图像分割方法(PRISM),该方法用围绕细胞核构建的自适应同心区域替代了明确的细胞质轮廓划分。这些核周区域通过整合颜色信息和来自灰度共生矩阵的纹理统计,能够提取出稳健的细胞质描述符,而无需准确的细胞边界检测。经过校准的传统分类器堆叠集成利用这些描述符实现了高性能,准确率达到98.46%,精确率-召回率曲线下面积(AUC)为0.9937。
cs.CV / 28 / 2605.12855

Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy

基于纵向内窥镜的直肠癌再生长预测
Gomez, Jorge Tapias, Kanata, Despoina, Rangnekar, Aneesh, Lee, Christina, Williams, Hannah, Thompson, Hannah, Smith, J. Joshua, Sanchez-Vega, Francisco, Sabuncu, Mert R., Garcia-Aguilar, Julio, Veeraraghavan, Harini
Abstract
Clinical trial studies indicate benefit of watch-and-wait (WW) surveillance for patients with rectal cancer showing a complete or near clinical response (CR) directly after treatment (restaging). However, there are no objectively accurate methods to early detect local tumor regrowth (LR) in patients undergoing WW from follow-up exams. Hence, we developed Temporal Rectal Endoscopy Cross-attention (TREX), a longitudinal deep learning approach that combines pairs of images acquired at restaging and follow-up to distinguish CR from LR. TREX uses pretrained Swin Transformers in a siamese setting to extract features from longitudinal images and dual cross-attention to combine the features without spatial co-registration between image pairs. TREX and Swin-based baselines were trained under two settings: (a) detecting LR or CR at the last available follow-up and (b) early detection of LR at 3--6, 6--12, and 12--24 months before clinical confirmation. TREX achieved the highest accuracy in detecting LR with a high sensitivity of 97% $\pm$ 6% and a balanced accuracy of 90% $\pm$ 3%, and outperformed all baselines in early detection at both 3--6 (74% $\pm$ 1%) and 6--12 months (62% $\pm$ 4%) prior to clinical detection. Clinical validation via a surgeon survey showed that TREX matched attending-level overall accuracy (TREX: 86.21% vs.\ Clinicians: 87.84% $\pm$ 1.28%). Finally, we explored TREX's ability to predict treatment response by combining pre-treatment (pre-TNT) and restaging endoscopies, achieving a balanced accuracy of 73% $\pm$ 12%. These results show that longitudinal deep learning analysis of endoscopy may improve surveillance and enable earlier identification of rectal cancer regrowth.
Chinese Translation
临床试验研究表明,对于在治疗后(重新分期)显示完全或近完全临床反应(CR)的直肠癌患者,观察等待(WW)监测具有益处。然而,目前尚无客观准确的方法能够在进行WW的患者的随访检查中早期检测局部肿瘤再生长(LR)。因此,我们开发了时间性直肠内窥镜交叉注意力(TREX),这是一种纵向深度学习方法,结合在重新分期和随访中获取的一对图像,以区分CR和LR。TREX在一个孪生网络设置中使用预训练的Swin Transformers,从纵向图像中提取特征,并通过双重交叉注意力在图像对之间不进行空间共同配准的情况下结合这些特征。TREX和基于Swin的基线模型在两种设置下进行训练:(a)在最后一次可用随访中检测LR或CR,以及(b)在临床确认之前的3-6个月、6-12个月和12-24个月内早期检测LR。TREX在检测LR方面取得了最高的准确率,灵敏度高达97% ± 6%,平衡准确率为90% ± 3%,并在临床检测之前的3-6个月(74% ± 1%)和6-12个月(62% ± 4%)的早期检测中超越了所有基线模型。通过外科医生调查的临床验证显示,TREX的整体准确率与主治医师的水平相匹配(TREX: 86.21% vs. 临床医生: 87.84% ± 1.28%)。最后,我们通过结合治疗前(pre-TNT)和重新分期内窥镜检查,探索了TREX预测治疗反应的能力,达到了73% ± 12%的平衡准确率。这些结果表明,内窥镜的纵向深度学习分析可能改善监测,并能够更早地识别直肠癌再生长。
cs.CV / 29 / 2605.12917

Adaptive Conformal Prediction for Reliable and Explainable Medical Image Classification

适应性符合预测在可靠且可解释的医学图像分类中的应用
Octadion, One, Yudistira, Novanto, Muflikhah, Lailil
Abstract
Deep learning models for medical imaging often exhibit overconfidence, creating safety risks in ambiguous diagnostic scenarios. While Conformal Prediction (CP) provides distribution-free statistical guarantees, standard methods such as Regularized Adaptive Prediction Sets (RAPS) optimize for average efficiency and can mask severe failures on difficult inputs. We propose an Adaptive Lambda Criterion for RAPS that minimizes the worst-case coverage violation across prediction set size strata. On OrganAMNIST (58,850 abdominal CT images, 11 classes), standard size-optimized RAPS converges to near-deterministic behavior with stratified undercoverage on uncertain samples, while our method achieves 95.72 percent global coverage with average set size 1.09 and at least 90 percent coverage across all strata. Cross-domain validation on PathMNIST (107,180 pathology images, 9 classes) confirms generalizability. Quantitative Grad-CAM analysis (rho = -0.30, p < 1e-22) shows that multi-label predictions correspond to focused attention on anatomically ambiguous regions. These results demonstrate that the proposed method improves reliability while maintaining efficiency, making it suitable for safety-critical medical AI applications.
Chinese Translation
医学影像的深度学习模型常常表现出过度自信,这在模糊的诊断场景中带来了安全风险。虽然符合预测(Conformal Prediction, CP)提供了无分布的统计保证,但标准方法如正则化自适应预测集(Regularized Adaptive Prediction Sets, RAPS)优化平均效率,可能掩盖在困难输入上的严重失败。我们提出了一种针对RAPS的自适应Lambda准则,该准则最小化预测集大小层次上的最坏情况覆盖违例。在OrganAMNIST(58,850张腹部CT图像,11个类别)上,标准的大小优化RAPS在不确定样本上收敛到近乎确定性行为,并出现分层欠覆盖,而我们的方法在平均预测集大小为1.09的情况下实现了95.72%的全球覆盖率,并在所有层次上至少达到90%的覆盖率。在PathMNIST(107,180张病理图像,9个类别)上的跨领域验证确认了该方法的可推广性。定量Grad-CAM分析(rho = -0.30, p < 1e-22)表明,多标签预测与对解剖上模糊区域的集中注意力相对应。这些结果表明,所提出的方法在提高可靠性的同时保持了效率,适用于安全关键的医学人工智能应用。
cs.CV / 30 / 2605.12919

GuardMarkGS: Unified Ownership Tracing and Edit Deterrence for 3D Gaussian Splatting

GuardMarkGS:3D高斯点云的统一所有权追踪与编辑威慑
Jeong, Utae, Choi, Jaewan, Lee, Junseok, Jeong, Jongheon, Yoon, Sang Ho, Koh, ByoungSoo, Kim, Sangpil
Abstract
3D Gaussian Splatting (3DGS) is becoming a practical representation for novel view synthesis, but its growing adoption, together with rapid advances in instruction-driven 3DGS editing, also exposes a dual copyright risk: once a 3DGS-based asset is released, it can be used without permission and manipulated through 3D editing. Existing protection methods address only one side of this problem. Watermarking can trace ownership after unauthorized use, but it cannot prevent malicious editing. Adversarial edit-deterrence methods can disrupt editing, but they do not provide evidence of ownership. To the best of our knowledge, we present the first unified protection framework for 3DGS that jointly optimizes ownership tracing and unauthorized editing deterrence. Our framework combines a scene-wide watermarking objective over all Gaussians with an adversarial objective for edit deterrence. The adversarial branch combines latent-anchor separation, denoising-trajectory diversion, and cross-attention diversion to divert the editing trajectory, while an update-saliency-motivated Gaussian selection strategy assigns stronger adversarial updates to mask-selected Gaussians, improving the balance among watermark recovery, edit deterrence, and rendering fidelity. Experiments on scenes from Mip-NeRF 360 and Instruct-NeRF2NeRF demonstrate that the proposed framework achieves a favorable balance among bit accuracy, edit deterrence, and rendering quality. These results suggest that practical copyright protection of 3DGS-based assets can be more effectively addressed by integrating ownership tracing and unauthorized editing deterrence into a single optimization framework.
Chinese Translation
3D高斯点云(3DGS)正成为新视图合成的实用表示方法,但其日益普及以及基于指令的3DGS编辑的快速进展也暴露出双重版权风险:一旦基于3DGS的资产被发布,就可能在未经许可的情况下被使用并通过3D编辑进行操控。现有的保护方法仅解决了这一问题的一方面。水印技术可以在未经授权使用后追踪所有权,但无法防止恶意编辑。对抗性编辑威慑方法可以干扰编辑,但并不提供所有权的证据。根据我们所知,我们提出了第一个针对3DGS的统一保护框架,该框架共同优化所有权追踪和未经授权编辑的威慑。我们的框架结合了对所有高斯点的场景级水印目标与编辑威慑的对抗性目标。对抗性分支结合了潜在锚点分离、去噪轨迹偏转和交叉注意力偏转,以转移编辑轨迹,而更新显著性驱动的高斯选择策略则为掩码选择的高斯点分配更强的对抗性更新,从而改善水印恢复、编辑威慑和渲染保真度之间的平衡。在Mip-NeRF 360和Instruct-NeRF2NeRF的场景实验表明,所提出的框架在比特准确性、编辑威慑和渲染质量之间达成了良好的平衡。这些结果表明,通过将所有权追踪和未经授权编辑威慑整合到一个单一的优化框架中,可以更有效地解决基于3DGS资产的实际版权保护问题。
cs.CV / 31 / 2605.12929

Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis

解剖槽:用于视网膜诊断的同源双侧推理的无监督解剖因子分解
Ma, Yingzhe, Yang, Xiao, Yin, Yuguo, Wang, Zheyu
Abstract
Retinal diagnosis is inherently bilateral: clinicians compare homologous structures across eyes (e.g., optic disc asymmetry), yet most deep models operate on monocular representations. We investigate whether explicit structural correspondence improves diagnosis, and propose Anatomy-Slot to operationalize this hypothesis. Anatomy-Slot introduces an unsupervised anatomical bottleneck by decomposing patch tokens into slots and aligning slots across eyes via bidirectional cross-attention. On ODIR-5K with $n=10$ seeds, the method improves AUC by 4.2% over a matched ViT-L baseline (95% CIs; Wilcoxon signed-rank test, $W=0$, $p=0.002$). Pairing disruption and stress testing under Gaussian noise provide controlled tests of correspondence dependence and robustness under corruption. We further report quantitative optic disc grounding on REFUGE and cross-attention localization analysis.
Chinese Translation
视网膜诊断本质上是双侧的:临床医生比较眼睛之间的同源结构(例如,视神经盘不对称),然而大多数深度模型在单眼表示上进行操作。我们研究了显式结构对应是否能改善诊断,并提出了解剖槽(Anatomy-Slot)来实现这一假设。解剖槽通过将补丁标记分解为槽,并通过双向交叉注意力在眼睛之间对齐槽,引入了一种无监督的解剖瓶颈。在 ODIR-5K 数据集上,使用 $n=10$ 个种子,该方法相比于匹配的 ViT-L 基线提高了 4.2% 的 AUC(95% 置信区间;Wilcoxon 符号秩检验,$W=0$,$p=0.002$)。配对破坏和在高斯噪声下的压力测试提供了对应依赖性和在损坏下的鲁棒性的受控测试。我们还报告了在 REFUGE 数据集上的定量视神经盘定位以及交叉注意力定位分析。
cs.CV / 32 / 2605.12937

AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters

AuraMask:一个可扩展的管道,用于开发美学反人脸识别图像滤镜
Lagogiannis, Jacob, Agnew, William, Arriaga, Rosa I., Das, Sauvik
Abstract
Anti-facial recognition (AFR) image filters alter images in ways that are subtle to people but blinding to computer vision. Yet, despite widespread interest in these technologies to subvert surveillance, users rarely use them in practice -- because the ``subtle'' alterations are visible enough to conflict with users' self-presentation goals. To address this challenge, we propose AuraMask: a novel approach to creating AFR filters that are both adversarially effective and aesthetically acceptable. Using AuraMask, we produce 40 ``aesthetic'' filters that emulate popular ``one-click'' Instagram image filters. We show that AuraMask filters meet or exceed the adversarial effectiveness of prior methods against open-source facial recognition models. Moreover, in a controlled online user study ($N=630$) we confirm these filters achieve significantly higher user acceptance than prior methods. Lastly, we provide our AFR pipeline to the community for accelerated research in adversarially effective and aesthetically acceptable protections.
Chinese Translation
反人脸识别(AFR)图像滤镜以微妙的方式改变图像,使人类难以察觉,但对计算机视觉却具有强烈的干扰性。然而,尽管人们对这些颠覆监控的技术表现出广泛兴趣,用户在实践中却很少使用它们——因为这些“微妙”的改变足够明显,影响了用户的自我呈现目标。为了解决这一挑战,我们提出了AuraMask:一种新颖的方法,用于创建既具对抗有效性又符合美学接受度的AFR滤镜。通过AuraMask,我们制作了40种“美学”滤镜,模拟流行的“一键式”Instagram图像滤镜。我们展示了AuraMask滤镜在对抗开源人脸识别模型的有效性方面,达到了或超过了先前方法的效果。此外,在一项受控的在线用户研究中($N=630$),我们确认这些滤镜的用户接受度显著高于先前的方法。最后,我们将我们的AFR管道提供给社区,以加速对抗有效且美学可接受的保护措施的研究。
cs.CV / 33 / 2605.12938

CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

CRePE:用于统一相机控制的视频生成的曲线光线期望位置编码
Jin, Seonghyun, Kim, Youngmin, Park, Sunwoo, Ye, Jong Chul
Abstract
Camera-conditioned video generation requires positional encoding that remains reliable under changes in camera motion, lens configuration, and scene structure. However, existing attention-level camera encodings either provide ray-only camera signals or rely on pinhole camera geometry, limiting their applicability to general camera control under the Unified Camera Model, including wide-angle and fisheye lenses. To address this limitation, we propose Curved Ray Expectation Positional Encoding (CRePE). CRePE represents each image token as a depth-aware positional distribution along its source ray, providing a Unified Camera Model-compatible positional encoding that captures the projected-path geometry induced by wide-angle and fisheye cameras. CRePE is implemented through a Geometric Attention Adapter added to frozen video DiTs, injecting token-wise scene-distance information into selected attention layers and stabilizing it with pseudo supervision from a monocular geometry foundation model. This design leads to more stable camera control and improves several geometry-aware and perceptual-quality metrics, while remaining competitive on video-quality metrics. Controlled positional-encoding ablations show a better overall average rank than a RayRoPE-style endpoint PE baseline, demonstrating the effectiveness of UCM-aware projected-path integration across diverse camera models. Furthermore, by extending the same positional-encoding pathway to external geometry control through Radial MixForcing, CRePE supports external radial-map control for scene-geometry-conditioned generation and source-video motion transfer beyond camera control.
Chinese Translation
基于相机条件的视频生成需要在相机运动、镜头配置和场景结构变化下保持可靠的位置编码。然而,现有的注意力级别相机编码要么仅提供光线信号,要么依赖针孔相机几何,限制了它们在统一相机模型下的广泛应用,包括广角和鱼眼镜头。为了解决这一限制,我们提出了曲线光线期望位置编码(CRePE)。CRePE将每个图像标记表示为沿其源光线的深度感知位置分布,提供与统一相机模型兼容的位置编码,捕捉由广角和鱼眼相机引起的投影路径几何。CRePE通过添加几何注意力适配器到冻结的视频DiTs中实现,将标记级的场景距离信息注入选定的注意力层,并通过单目几何基础模型的伪监督进行稳定。这一设计导致了更稳定的相机控制,并改善了多个几何感知和感知质量指标,同时在视频质量指标上保持竞争力。受控位置编码消融实验显示出比RayRoPE风格的端点PE基线更好的整体平均排名,证明了在多种相机模型中UCM感知的投影路径集成的有效性。此外,通过将相同的位置编码路径扩展到通过径向混合强制(Radial MixForcing)进行外部几何控制,CRePE支持场景几何条件生成和超越相机控制的源视频运动转移的外部径向图控制。
cs.CV / 34 / 2605.12939

DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport

DirectTryOn:通过直线条件传输实现一步虚拟试穿
Sun, Xianbing, Zhan, Jiahui, Zhang, Liqing, Zhang, Jianfu
Abstract
Recent diffusion- and flow-based VTON methods achieve strong results with pretrained generative models, but their reliance on multi-step sampling incurs high inference cost, while existing acceleration methods largely overlook the intrinsic structure of the try-on task. In this paper, we highlight a key observation: VTON outputs are highly constrained by the conditional inputs, suggesting that the conditional sampling trajectory can be much straighter than that in general image generation, making one-step generation a natural solution. However, limited task-specific data makes training from scratch impractical, forcing existing methods to fine-tune pretrained models whose objectives do not encourage such straight conditional trajectories. Thus, the deviation from an ideal straight path mainly comes from the mismatch between pretrained base models and the conditional nature of try-on generation, rather than from the task itself. Motivated by this insight, we encourage straighter VTON sampling trajectories through three targeted modifications: pure conditional transport, a garment preservation loss, and a self consistency loss. We further introduce a one-step distillation stage. Extensive experiments show that our method achieves state-of-the-art performance with one-step sampling, establishing a new standard for efficient and high-quality VTON.
Chinese Translation
最近的基于扩散和流的方法在预训练生成模型的支持下取得了良好的虚拟试穿(VTON)效果,但它们对多步采样的依赖导致了高昂的推理成本,而现有的加速方法在很大程度上忽视了试穿任务的内在结构。本文强调了一个关键观察:VTON 输出受到条件输入的高度限制,这表明条件采样轨迹可以比一般图像生成中的轨迹更为直线化,使得一步生成成为一种自然的解决方案。然而,有限的任务特定数据使得从头训练变得不切实际,迫使现有方法对预训练模型进行微调,而这些模型的目标并不鼓励如此直线的条件轨迹。因此,偏离理想直线路径的主要原因在于预训练基础模型与试穿生成的条件特性之间的不匹配,而不是任务本身。基于这一洞察,我们通过三项针对性的修改来鼓励更直的 VTON 采样轨迹:纯条件传输、服装保留损失和自一致性损失。我们进一步引入了一步蒸馏阶段。大量实验表明,我们的方法在一步采样中实现了最先进的性能,为高效且高质量的 VTON 建立了新的标准。
cs.CV / 35 / 2605.12952

Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation

揭穿 Grad-ECLIP:关于其错误性及模型解释基本原则的综合研究
Cui, Yongjin, Fan, Xiaohui
Abstract
Grad-ECLIP is published at ICML 2024 and represents a new Transformer interpretation technical route (intermediate features-based). First, this paper demonstrates that the intermediate features-based technical route is not a novel one. Based on the existing attention-based route, we have developed Attention-ECLIP, which is completely equivalent to Grad-ECLIP but with simpler computation. Both through formal derivation and experimental validation, we prove that the intermediate feature-based route represented by Grad-ECLIP is actually an equivalent variant of the attention-based route. Next, this paper demonstrates that the Grad-ECLIP method is flawed. The model interpretation results obtained by Grad-ECLIP are not those of the original model, and the interpretation results are misaligned with the model's performance. We analyze the causes of Grad-ECLIP's flaws and propose, or rather, explicitly emphasize two fundamental principles that model interpretation should adhere to in order to avoid similar errors.
Chinese Translation
Grad-ECLIP 于 ICML 2024 发布,代表了一种新的基于中间特征的 Transformer 解释技术路线。首先,本文证明了基于中间特征的技术路线并不是一种新颖的路线。基于现有的基于注意力的路线,我们开发了 Attention-ECLIP,它与 Grad-ECLIP 完全等价,但计算更为简单。通过形式推导和实验验证,我们证明了 Grad-ECLIP 所代表的基于中间特征的路线实际上是基于注意力路线的一个等价变体。接下来,本文展示了 Grad-ECLIP 方法的缺陷。通过 Grad-ECLIP 获得的模型解释结果并非原始模型的结果,并且解释结果与模型的性能不一致。我们分析了 Grad-ECLIP 缺陷的原因,并提出,或者说,明确强调模型解释应遵循的两个基本原则,以避免类似错误的发生。
cs.CV / 36 / 2605.12953

Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

Seg-Agent:无训练语言引导分割的测试时多模态推理
Hao, Chao, Xu, Jun, Du, Ji, Ye, Shuo, Qiao, Ziyue, Cun, Xiaodong, Wang, Guangcong, Zheng, Xubin, Yu, Zitong
Abstract
Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose Seg-Agent, a completely training-free framework that pioneers Explicit Multimodal Chain-of-Reasoning. Unlike prior text-only reasoning, our approach constructs an interactive visual reasoning loop comprising three stages: generation, selection, and refinement. Specifically, we leverage Set-of-Mark (SoM) visual prompting to render candidate regions directly onto the image, allowing the MLLM to ``see'' and iteratively reason about spatial relationships in the visual domain rather than just the textual one. This explicit multimodal interaction enables Seg-Agent to achieve performance comparable to state-of-the-art training-based methods without any parameter updates. Furthermore, to comprehensively evaluate generalization across diverse scenarios, we introduce Various-LangSeg, a novel benchmark covering explicit semantic, generic object, and reasoning-guided segmentation tasks. Extensive experiments demonstrate the effectiveness and robustness of our method.
Chinese Translation
语言引导的分割超越了传统语义分割的范围限制,使模型能够根据自然语言指令对任意目标区域进行分割。现有方法通常采用两阶段框架:首先利用多模态大型语言模型(Multimodal Large Language Models, MLLMs)来解释指令并生成视觉提示,然后由基础分割模型(如SAM)生成掩膜。然而,由于现成的MLLMs在空间定位能力上的局限性,这些方法通常依赖于在大规模数据集上的广泛训练以达到令人满意的准确性。尽管最近的进展引入了推理机制以提高性能,但它们主要在文本领域内运作,仅基于抽象文本表示进行链式推理,而没有直接的视觉反馈。在本文中,我们提出了Seg-Agent,一个完全无训练的框架,开创了显式多模态推理链。与以往仅基于文本的推理不同,我们的方法构建了一个包含生成、选择和精炼三个阶段的互动视觉推理循环。具体而言,我们利用标记集(Set-of-Mark, SoM)视觉提示直接在图像上呈现候选区域,使MLLM能够“看到”并迭代推理视觉领域中的空间关系,而不仅仅是文本领域。这种显式的多模态互动使Seg-Agent在没有任何参数更新的情况下实现了与最先进的基于训练的方法相媲美的性能。此外,为了全面评估在多样场景下的泛化能力,我们引入了Various-LangSeg,一个涵盖显式语义、通用物体和推理引导分割任务的新基准。大量实验表明我们方法的有效性和鲁棒性。
cs.CV / 37 / 2605.12954

AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

AdaFocus:具有零缓存回溯的自适应相关性-多样性采样用于高效的长视频理解
Yang, Xiao, Ma, Yingzhe, Yu, Haoxuan, Li, Zixin, Qin, Ning
Abstract
Long video understanding is heavily bottlenecked by a rigid one-shot paradigm: existing methods either densely encode videos at prohibitive memory and latency costs, or aggressively compress them into sparse frame sets that irreversibly discard fine-grained evidence needed for downstream reasoning. Consequently, current models struggle to simultaneously balance temporal coverage, visual details, and computational efficiency. We propose AdaFocus, an efficient framework that rethinks long-video understanding as progressive evidence acquisition rather than one-pass encoding. AdaFocus relies on two tightly coupled components. First, a Query-Aware Adaptive Relevance-Diversity sampler (AdaRD) produces a compact yet informative video preview, adaptively switching to global clustering when the query lacks reliable local grounding. Second, instead of caching exhaustive frame sequences in memory, AdaFocus introduces an uncertainty-triggered refinement mechanism. It performs targeted look-back only when the model is not confident, retrieving high-resolution evidence directly from disk via a zero-cache I/O design. This turns discarded visual details from an irreversible loss into on-demand recoverable evidence without paying the cost of exhaustive preloading. Experiments on seven standard long-video benchmarks show that AdaFocus delivers a substantially better efficiency-accuracy trade-off than strong baselines. Compared with conventional dense encoding, AdaFocus achieves improved task performance (e.g., +2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA over single-pass inference) while reducing visual token consumption by ~33x and eliminating the need for in-memory frame pre-caching through its zero-cache disk retrieval design. These findings suggest that progressive preview combined with zero-cache evidence refinement is a highly effective paradigm for scalable multimedia reasoning.
Chinese Translation
长视频理解受到严格的一次性范式的严重制约:现有方法要么以高昂的内存和延迟成本密集编码视频,要么将其激进压缩为稀疏帧集,从而不可逆地丢弃下游推理所需的细粒度证据。因此,当前模型在平衡时间覆盖、视觉细节和计算效率方面面临挑战。我们提出了AdaFocus,一个高效的框架,将长视频理解重新定义为渐进式证据获取,而非一次性编码。AdaFocus依赖于两个紧密耦合的组件。首先,查询感知自适应相关性-多样性采样器(AdaRD)生成一个紧凑而信息丰富的视频预览,当查询缺乏可靠的局部基础时,它会自适应切换到全局聚类。其次,AdaFocus引入了一种不确定性触发的精炼机制,而不是在内存中缓存详尽的帧序列。它仅在模型不自信时执行有针对性的回溯,通过零缓存I/O设计直接从磁盘检索高分辨率证据。这将被丢弃的视觉细节从不可逆损失转变为按需可恢复的证据,而无需支付详尽预加载的成本。在七个标准长视频基准上的实验表明,AdaFocus在效率-准确性权衡方面显著优于强基线。与传统的密集编码相比,AdaFocus在任务性能上取得了改善(例如,在VideoMME上提高了2.59的准确率,在Charades-STA上提高了8.39的mIoU,相较于单次推理),同时将视觉标记消耗减少了约33倍,并通过其零缓存磁盘检索设计消除了对内存帧预缓存的需求。这些发现表明,渐进式预览结合零缓存证据精炼是一种极为有效的可扩展多媒体推理范式。
cs.CV / 38 / 2605.12957

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

GTA:通过几何优先再到外观的视频扩散推进图像到3D世界生成
Zhu, Hanxin, Wang, Cong, Tu, Peiyan, Luo, Jiayi, He, Tianyu, Jin, Xin, Chen, Zhibo
Abstract
Recent developments in generative models and large-scale datasets have substantially advanced 3D world generation, facilitating a broad range of domains including spatial intelligence, embodied intelligence, and autonomous driving. While achieving remarkable progress, existing approaches to 3D world generation typically prioritize appearance prediction with limited modeling of the underlying geometry, leading to issues such as unreliable scene structure estimation and degraded cross-view consistency. To address these limitations, motivated by the coarse-to-fine nature of human visual perception, we propose GTA, a novel image-to-3D world generation method following a Geometry-Then-Appearance paradigm. Specifically, given a single input image, to improve the structural fidelity of synthesized 3D scenes, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, we introduce a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance. Extensive experiments have demonstrated that our proposed method consistently outperforms existing approaches in terms of fidelity, visual quality, and geometric accuracy. Moreover, GTA is shown to be effective as a general enhancement module that further improves the generation quality of existing image-to-3D world pipelines, as well as supporting multiple downstream applications and exhibiting favorable data efficiency during model training, highlighting its versatility and broad applicability. Project page: https://hanxinzhu-lab.github.io/GTA/.
Chinese Translation
最近生成模型和大规模数据集的发展显著推动了3D世界生成,促进了空间智能、具身智能和自动驾驶等多个领域的发展。尽管取得了显著进展,现有的3D世界生成方法通常优先考虑外观预测,而对基础几何的建模有限,导致了场景结构估计不可靠和跨视角一致性降低等问题。为了解决这些局限性,受到人类视觉感知的粗到细特性的启发,我们提出了GTA,一种遵循几何优先再到外观(Geometry-Then-Appearance)范式的新颖图像到3D世界生成方法。具体而言,给定一幅输入图像,为了提高合成3D场景的结构保真度,GTA采用了一个两阶段框架,配备两个专用的视频扩散模型,首先从新视角生成粗略的几何结构,然后基于预测的几何合成细粒度的外观。为了进一步增强跨视角的外观一致性,我们在训练过程中引入了一种随机潜在洗牌策略,并采用了一种测试时缩放方案,以提高感知质量而不影响定量性能。大量实验表明,我们提出的方法在保真度、视觉质量和几何准确性方面始终优于现有方法。此外,GTA被证明作为一个通用增强模块,进一步提升了现有图像到3D世界生成管道的生成质量,并支持多个下游应用,在模型训练过程中展现出良好的数据效率,突显了其多功能性和广泛适用性。项目页面:https://hanxinzhu-lab.github.io/GTA/
cs.CV / 39 / 2605.12961

Reducing Bias and Variance: Generative Semantic Guidance and Bi-Layer Ensemble for Image Clustering

减少偏差和方差:生成语义指导与双层集成用于图像聚类
Li, Feijiang, Li, Zhenxiong, Wang, Jieting, Jiu, Zizheng, Liu, Saixiong, Du, Liang
Abstract
Image clustering aims to partition unlabeled image datasets into distinct groups. A core aspect of this task is constructing and leveraging prior knowledge to guide the clustering process. Recent approaches introduce semantic descriptions as prior information, most of which typically relying on matching-based techniques with predefined vocabularies. However, the limited matching space restricts their adaptability to downstream clustering tasks. Moreover, these methods primarily focus on reducing bias to improve performance, frequently overlooking the importance of variance reduction. To address these limitations, we propose GSEC (Image Clustering based on Generative Semantic Guidance and Bi-Layer Ensemble), a framework designed to reduce bias through generative semantic guidance and mitigate variance via ensemble learning. Our method employs Multimodal Large Language Models to generate semantic descriptions and derive image embeddings via weighted averaging. Additionally, a bi-layer ensemble strategy integrates cross-modal information through BatchEnsemble in the inner layer and aligns outputs via an alignment mechanism in the outer layer. Comparative experiments demonstrate that GSEC outperforms 18 state-of-the-art methods across six benchmark datasets, while further analysis confirms its effectiveness in simultaneously reducing both bias and variance. The code is available at https://github.com/2017LI/GSEC.git.
Chinese Translation
图像聚类旨在将未标记的图像数据集划分为不同的组。该任务的核心在于构建和利用先验知识以指导聚类过程。最近的方法引入了语义描述作为先验信息,大多数方法通常依赖于基于匹配的技术和预定义的词汇。然而,有限的匹配空间限制了它们对下游聚类任务的适应性。此外,这些方法主要集中在减少偏差以提高性能,常常忽视了方差减少的重要性。为了解决这些局限性,我们提出了GSEC(基于生成语义指导和双层集成的图像聚类),这是一个旨在通过生成语义指导减少偏差,并通过集成学习减轻方差的框架。我们的方法采用多模态大型语言模型生成语义描述,并通过加权平均法导出图像嵌入。此外,双层集成策略通过BatchEnsemble在内层整合跨模态信息,并通过外层的对齐机制对输出进行对齐。比较实验表明,GSEC在六个基准数据集上优于18种最先进的方法,同时进一步分析确认其在同时减少偏差和方差方面的有效性。代码可在 https://github.com/2017LI/GSEC.git 获取。
cs.CV / 40 / 2605.12964

Asymmetric Flow Models

非对称流模型
Chen, Hansheng, Ackermann, Jan, Kim, Minseo, Wetzstein, Gordon, Guibas, Leonidas
Abstract
Flow-based generation in high-dimensional spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. From this asymmetric prediction, AsymFlow analytically recovers the full-dimensional velocity without changing the network architecture or training/sampling procedures. On ImageNet 256$\times$256, AsymFlow achieves a leading 1.57 FID, outperforming prior DiT/JiT-like pixel diffusion models by a large margin. AsymFlow also provides the first-ever route for finetuning pretrained latent flow models into pixel-space models: aligning the low-rank pixel subspace to the latent space gives a seamless initialization that preserves the latent model's high-level semantics and structure, so finetuning mainly improves low-level mismatches rather than relearning pixel generation. We show that the pixel AsymFlow model finetuned from FLUX.2 klein 9B establishes a new state of the art for pixel-space text-to-image generation, beating its latent base on HPSv3, DPG-Bench, and GenEval while qualitatively showing substantially improved visual realism.
Chinese Translation
在高维空间中基于流的生成是困难的,因为速度预测需要建模高维噪声,即使数据具有强烈的低秩结构。我们提出了非对称流建模(Asymmetric Flow Modeling,AsymFlow),这是一种秩非对称的速度参数化方法,它将噪声预测限制在低秩子空间,同时保持数据预测为全维度。通过这种非对称预测,AsymFlow 在不改变网络架构或训练/采样程序的情况下,解析地恢复全维度的速度。在 ImageNet 256×256 数据集上,AsymFlow 达到了领先的 1.57 FID,显著超越了之前的 DiT/JiT 类像素扩散模型。AsymFlow 还首次提供了将预训练的潜在流模型微调为像素空间模型的途径:将低秩像素子空间与潜在空间对齐,提供了一个无缝的初始化,保留了潜在模型的高层语义和结构,因此微调主要改善低层次的不匹配,而不是重新学习像素生成。我们展示了从 FLUX.2 klein 9B 微调的像素 AsymFlow 模型在像素空间文本到图像生成中建立了新的最先进水平,超越了其潜在基础模型在 HPSv3、DPG-Bench 和 GenEval 上的表现,同时在视觉真实感上显著提升。
cs.CV / 41 / 2605.12967

ImageAttributionBench: How Far Are We from Generalizable Attribution?

ImageAttributionBench:我们距离可推广的归因还有多远?
Mou, Tingshu, Wei, Zhipeng, Gong, Chao, Chen, Jingjing, Ma, Xingjun
Abstract
The rapid advancement of generative AI has enabled the creation of highly realistic and diverse synthetic images, posing critical challenges for image provenance and misinformation detection. This underscores the urgent need for effective image attribution. However, existing attribution datasets are constrained by limited scale, outdated generation methods, and insufficient semantic diversity - hindering the development of robust and generalizable attribution models. To address these limitations, we introduce ImageAttributionBench, a comprehensive dataset comprising images synthesized by a wide array of advanced generative models with state-of-the-art (SOTA) architectures. Covering multiple real-world semantic domains, the dataset offers rich diversity and scale to support and accelerate progress in image attribution research. To simulate real-world attribution scenarios, we evaluate several SOTA attribution methods on ImageAttributionBench under two challenging settings: (1) training on a standard balanced split and testing on degraded images, and (2) training and testing on semantically disjoint splits. In both cases, current methods exhibit consistently poor performance, revealing significant limitations in their robustness and generalization to unseen semantic content. Our work provides a rigorous benchmark to facilitate the development and evaluation of future image attribution methods.
Chinese Translation
生成性人工智能的快速发展使得高度真实和多样化的合成图像得以创建,这对图像来源和虚假信息检测提出了重大挑战。这突显了有效图像归因的迫切需求。然而,现有的归因数据集受限于规模有限、生成方法过时以及语义多样性不足,阻碍了稳健且可推广的归因模型的发展。为了解决这些局限性,我们引入了ImageAttributionBench,这是一个综合数据集,包含了由多种先进生成模型(具有最先进的架构)合成的图像。该数据集涵盖多个真实世界的语义领域,提供了丰富的多样性和规模,以支持和加速图像归因研究的进展。为了模拟真实世界的归因场景,我们在ImageAttributionBench上评估了几种最先进的归因方法,采用了两种具有挑战性的设置:(1)在标准平衡划分上训练并在降级图像上测试,以及(2)在语义不相交的划分上进行训练和测试。在这两种情况下,当前方法的表现 consistently poor,揭示了它们在鲁棒性和对未见语义内容的推广能力方面的显著局限性。我们的工作提供了一个严格的基准,以促进未来图像归因方法的发展和评估。
cs.CV / 42 / 2605.13010

Amortized Guidance for Image Inpainting with Pretrained Diffusion Models

基于预训练扩散模型的图像修复的摊销指导
Huang, Yilie, Zhou, Xun Yu
Abstract
We study image inpainting with generative diffusion models. Existing methods typically either train dedicated task-specific models, or adapt a pretrained diffusion model separately for each masked image at deployment. We introduce a middle-ground model, termed Amortized Inpainting with Diffusion (AID), which keeps a pretrained diffusion backbone fixed, trains a small reusable guidance module offline, and then reuses it across masked images without per-instance optimization. We formulate it as a deterministic guidance problem with a supervised terminal objective. To make this problem learnable in high dimensions, we derive an auxiliary Gaussian formulation and prove that solving this randomized problem recovers the optimal deterministic guidance field. This bridge yields a principled continuous-time actor--critic algorithm for learning the guidance module in a fully data-driven manner. Empirically, on AFHQv2 and FFHQ under the pixel EDM pipeline and on ImageNet under the latent EDM2 pipeline, AID consistently improves the quality--speed trade-off over strong fixed-backbone and amortized inpainting baselines across multiple mask types, while adding less than one percent trainable overhead.
Chinese Translation
我们研究了使用生成性扩散模型进行图像修复的技术。现有方法通常要么训练专门的任务特定模型,要么在部署时为每个被遮挡的图像单独调整预训练的扩散模型。我们提出了一种中间模型,称为摊销扩散修复(Amortized Inpainting with Diffusion,AID),该模型保持预训练的扩散主干不变,离线训练一个小型可重用的指导模块,然后在多个被遮挡图像中重复使用该模块,而无需针对每个实例进行优化。我们将其形式化为一个具有监督终端目标的确定性指导问题。为了使该问题在高维空间中可学习,我们推导出一个辅助高斯形式,并证明解决该随机问题可以恢复最优的确定性指导场。这一桥梁为以完全数据驱动的方式学习指导模块提供了一个原则性的连续时间演员-评论家算法。在实证研究中,在AFHQv2和FFHQ的像素EDM管道以及在ImageNet的潜在EDM2管道下,AID在多个遮挡类型中始终改善了强固定主干和摊销修复基线的质量-速度权衡,同时增加的可训练开销不足百分之一。
cs.CV / 43 / 2605.13018

OCH3R: Object-Centric Holistic 3D Reconstruction

OCH3R:面向对象的整体3D重建
Du, Yi, You, Yang, Wan, Xiang, Guibas, Leonidas
Abstract
Object-centric scene understanding is a fundamental challenge in computer vision. Existing approaches often rely on multi-stage pipelines that first apply pre-trained segmentors to extract individual objects, followed by per-object 3D reconstruction. Such methods are computationally expensive, fragile to segmentation errors, and scale poorly with scene complexity. We introduce OCH3R, a unified framework for Object-Centric Holistic 3D Reconstruction from a single RGB image. OCH3R performs one forward pass to simultaneously predict all object instances with their 6D poses and detailed 3D reconstructions. The key idea is a transformer architecture that predicts per-pixel attributes, including CLIP-based category embeddings, metric depth, normalized object coordinates (NOCS), and a fixed number of 3D Gaussians representing each object. To supervise these Gaussian reconstructions, we transform them into canonical space using the predicted 6D poses and align them with pre-rendered canonical ground truth, avoiding costly per-image Gaussian label generation. On standard indoor benchmarks, OCH3R achieves state-of-the-art performance across monocular depth estimation, open-vocabulary semantic segmentation, and RGB-only category-level 6D pose estimation, while producing high-fidelity, editable per-object reconstructions. Crucially, inference is fully feed-forward and scales independently of the number of objects, offering orders-of-magnitude speedups over conventional multi-stage pipelines in cluttered scenes.
Chinese Translation
面向对象的场景理解是计算机视觉中的一个基本挑战。现有方法通常依赖于多阶段管道,首先应用预训练的分割器提取单个对象,然后进行每个对象的3D重建。这些方法计算开销大,对分割错误敏感,并且在场景复杂性增加时扩展性差。我们提出了OCH3R,一个统一的框架,用于从单个RGB图像进行面向对象的整体3D重建。OCH3R通过一次前向传播同时预测所有对象实例的6D姿态和详细的3D重建。其关键思想是采用变换器架构,预测每个像素的属性,包括基于CLIP的类别嵌入、度量深度、归一化对象坐标(NOCS)以及表示每个对象的固定数量的3D高斯分布。为了监督这些高斯重建,我们使用预测的6D姿态将其转换到标准空间,并与预渲染的标准真实值对齐,从而避免了昂贵的每图像高斯标签生成。在标准室内基准测试中,OCH3R在单目深度估计、开放词汇语义分割和仅RGB的类别级6D姿态估计方面实现了最先进的性能,同时生成高保真、可编辑的每个对象重建。关键是,推理过程完全是前馈的,并且与对象数量无关地扩展,在杂乱场景中提供了相较于传统多阶段管道数量级的速度提升。
cs.CV / 44 / 2605.13027

PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution

PRISM:基于扩散的文本图像超分辨率的先验校正与不确定性感知结构建模
Xu, Zihang, Liu, Xiaoyang, Chen, Zheng, Zhang, Yulun, Yang, Xiaokang
Abstract
Text image super-resolution (Text-SR) requires more than visually plausible detail synthesis: slight errors in stroke topology may alter character identity and break readability. Existing methods improve text fidelity with stronger recognition-based or generative priors, yet they still face two unresolved challenges under severe degradation: the text condition extracted from low-quality inputs can itself be unreliable, and a plausible global prior does not fully determine fine-grained stroke boundaries. We present PRISM, a single-step diffusion-based Text-SR framework that addresses these two challenges through Flow-Matching Prior Rectification (FMPR) and a Structure-guided Uncertainty-aware Residual Encoder (SURE). FMPR constructs a privileged training-time prior from paired low-quality/high-quality latents and learns a flow matching that transports degraded embeddings toward this restoration-oriented prior space, yielding more accurate and reliable global text guidance. SURE further predicts uncertainty-aware structural residuals to selectively absorb reliable local boundary evidence while suppressing ambiguous stroke cues. Together, these components enable explicit global prior rectification and local structure refinement within a single diffusion restoration pass. Experiments on both synthetic and real-world benchmarks show that PRISM achieves state-of-the-art performance with millisecond-level inference. Our dataset and code will be available at https://github.com/faithxuz/PRISM.
Chinese Translation
文本图像超分辨率(Text-SR)不仅需要视觉上合理的细节合成:笔画拓扑的轻微错误可能会改变字符身份并影响可读性。现有方法通过更强的基于识别或生成的先验来提高文本的保真度,但在严重退化的情况下仍面临两个未解决的挑战:从低质量输入中提取的文本条件本身可能不可靠,而合理的全局先验并不能完全确定细致的笔画边界。我们提出了PRISM,一个单步扩散基础的Text-SR框架,通过流匹配先验校正(Flow-Matching Prior Rectification, FMPR)和结构引导的不确定性感知残差编码器(Structure-guided Uncertainty-aware Residual Encoder, SURE)来解决这两个挑战。FMPR从配对的低质量/高质量潜变量中构建特权训练时先验,并学习一种流匹配,将退化的嵌入传输到这一恢复导向的先验空间,从而产生更准确和可靠的全局文本指导。SURE进一步预测不确定性感知的结构残差,以选择性地吸收可靠的局部边界证据,同时抑制模糊的笔画线索。这些组件共同实现了在单次扩散恢复过程中显式的全局先验校正和局部结构细化。在合成和真实世界基准上的实验表明,PRISM在毫秒级推理下实现了最先进的性能。我们的数据集和代码将发布在 https://github.com/faithxuz/PRISM。
cs.CV / 45 / 2605.13034

ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

ViDR:将多模态深度研究报告与源视觉证据相结合
Shi, Zhuofan, Jia, Peilun, Sun, Baoqin, Shen, Haiyang, Xie, Sixiong, Ma, Yun, Jing, Xiang
Abstract
Recent deep research systems have improved the ability of large language models to produce long, grounded reports through iterative retrieval and reasoning. However, most text-centered systems rely mainly on textual evidence, while multimodal systems often retrieve images only weakly or generate charts themselves, leaving source figures underused as evidence. We present ViDR, a multimodal deep research framework that grounds long-form reports in source figures. ViDR treats source figures as retrievable, interpretable, routable, and verifiable evidence objects, while still generating analytical charts when needed. It builds an evidence-indexed outline linking claims to textual and visual evidence, refines noisy web images into source-figure evidence atoms through context-aware filtering, outline-aware reranking, and VLM-based visual analysis, and generates each section with section-specific evidence. ViDR further validates visual references to reduce hallucinated or misplaced figures. We also introduce MMR Bench+, a benchmark for evaluating visual evidence use in deep research reports, covering source-figure retrieval, placement, interpretation, verifiability, and analytical chart generation. Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines. These results suggest that source visual evidence is important for multimodal deep research, as it strengthens evidential grounding, visual support, and report verifiability.
Chinese Translation
近期的深度研究系统提升了大型语言模型生成长篇、基于证据的报告的能力,主要通过迭代检索和推理。然而,大多数以文本为中心的系统主要依赖于文本证据,而多模态系统往往仅弱地检索图像或自行生成图表,导致源图形作为证据的使用不足。我们提出了ViDR,一个多模态深度研究框架,将长篇报告与源图形相结合。ViDR将源图形视为可检索、可解释、可路由和可验证的证据对象,同时在需要时生成分析图表。它构建了一个证据索引大纲,将主张与文本和视觉证据链接,通过上下文感知过滤、基于大纲的重新排序和基于视觉语言模型(VLM)的视觉分析,将嘈杂的网络图像精炼为源图形证据原子,并使用特定于章节的证据生成每个章节。ViDR进一步验证视觉引用,以减少虚构或错误放置的图形。我们还引入了MMR Bench+,一个用于评估深度研究报告中视觉证据使用的基准,涵盖源图形检索、放置、解释、可验证性和分析图表生成。实验表明,ViDR在整体报告质量、源图形整合和可验证性方面优于强大的商业和开源基线。这些结果表明,源视觉证据对于多模态深度研究至关重要,因为它增强了证据基础、视觉支持和报告的可验证性。
cs.CV / 46 / 2605.13038

CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy

CoGE:单目结肠镜下的在线几何估计
Shao, Liangjing, Cui, Beilei, Ren, Hongliang
Abstract
Geometric estimation including depth estimation and scene reconstruction is a crucial technique for colonoscopy which can provide surgeons with 3D spatial perception and navigation. However, geometric ground truth in colonoscopy is difficult to obtain due to narrow and enclosed space of the colon, while there is a large feature gap between simulated data and realistic data caused by artifacts and illumination. In this paper, we present CoGE, a novel framework for online monocular geometric estimation during colonoscopy. Firstly, we propose an illumination-aware supervision module based on the Retinex theory to address illumination diversity in different colonoscopy scenes. Moreover, a structure-aware perception module is proposed based on wavelet decomposition to extract common structural and local features of the colon. Both quantitative and qualitative results demonstrate that the proposed model solely trained on simulated data achieves state-of-the-art performance in geometric estimation for both simulated and realistic scenes.
Chinese Translation
几何估计,包括深度估计和场景重建,是结肠镜检查中的一项关键技术,可以为外科医生提供三维空间感知和导航。然而,由于结肠狭窄和封闭的空间,几何真实值在结肠镜检查中难以获取,同时,由于伪影和照明的影响,模拟数据与真实数据之间存在较大的特征差距。本文提出了CoGE,一个用于结肠镜检查期间在线单目几何估计的新框架。首先,我们提出了一种基于Retinex理论的照明感知监督模块,以解决不同结肠镜场景中的照明多样性。此外,基于小波分解,我们提出了一种结构感知感知模块,以提取结肠的共同结构和局部特征。定量和定性结果均表明,单独在模拟数据上训练的模型在模拟和真实场景的几何估计中均达到了最先进的性能。
cs.CV / 47 / 2605.13041

EgoForce: Robust Online Egocentric Motion Reconstruction via Diffusion Forcing

EgoForce:通过扩散强制实现稳健的在线自我中心运动重建
Hwang, Inwoo, Lim, Donggeun, Jang, Hojun, Kim, Young Min
Abstract
With recent advances in embodied agents and AR devices, egocentric observations are readily available as input for real-world interactive online applications. However, egocentric viewpoints can only sporadically observe hands, in addition to the estimated head trajectory. We propose EgoForce, an online framework for reconstructing long-term full-body motion from noisy egocentric input. While existing generative frameworks can robustly handle noisy and sparse measurements, they assume a fixed-length observation window is available and are thus not suitable for real-time applications. Faster inference often relies on autoregressive prediction, sacrificing robustness. In contrast, we adopt a diffusion-based method with a temporally asymmetric noise schedule inspired by Diffusion Forcing. Specifically, our approach models temporally evolving uncertainty and incrementally denoises states as new streaming observations arrive. Combined with a noise-robust imputation strategy, EgoForce progressively generates stable and coherent full-body motion under strict causal constraints. Experiments demonstrate that our online framework outperforms existing online and offline methods, enabling long-horizon, full-body motion reconstruction in challenging egocentric scenarios.
Chinese Translation
随着具身代理和增强现实设备的最新进展,自我中心观察已成为现实世界交互在线应用的可用输入。然而,自我中心视角只能偶尔观察到手部动作,此外还需估计头部轨迹。我们提出了EgoForce,这是一个用于从噪声自我中心输入中重建长期全身运动的在线框架。虽然现有的生成框架能够稳健地处理噪声和稀疏测量,但它们假设可用固定长度的观察窗口,因此不适合实时应用。更快的推理往往依赖于自回归预测,这会牺牲稳健性。相比之下,我们采用了一种基于扩散的方法,使用受扩散强制启发的时间不对称噪声调度。具体而言,我们的方法建模时间演变的不确定性,并在新的流式观察到达时逐步去噪状态。结合稳健的噪声插补策略,EgoForce在严格的因果约束下逐步生成稳定且连贯的全身运动。实验表明,我们的在线框架优于现有的在线和离线方法,使得在具有挑战性的自我中心场景中实现长时间范围的全身运动重建成为可能。
cs.CV / 48 / 2605.13047

Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

通过反事实语义显著性揭示人类与视觉语言模型(VLM)场景感知之间的差距
Wen, Ziqi, Madinei, Parsa, Eckstein, Miguel P.
Abstract
Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at https://github.com/starsky77/Counterfactual-Semantic-Saliency.
Chinese Translation
评估大型视觉语言模型(VLM)在高层次语义场景理解方面是否与人类感知一致仍然是一项挑战。传统的白盒可解释性方法不适用于闭源架构,而被动指标无法孤立因果特征。我们提出了反事实语义显著性(Counterfactual Semantic Saliency, CSS)。该黑盒、模型无关的框架通过测量因果去除对场景引起的语义变化来量化物体的重要性。为了评估人工智能与人类的语义一致性,我们对16,289个有效响应的心理物理基线进行了测试,涵盖307个复杂自然场景和1,306个高保真反事实变体。我们的分析揭示了一个普遍的场景理解差距:模型在大物体(尺寸偏见)、图像中心的物体(中心偏见)和高显著性物体上表现出相对于人类的过度依赖。相比之下,模型在描述图像时对场景中的人类依赖程度低于我们的参与者。模型的尺寸偏见是解释模型与人类语义差异变化的主要驱动因素。代码和数据将发布在 https://github.com/starsky77/Counterfactual-Semantic-Saliency。
cs.CV / 49 / 2605.13049

Uncertainty-aware Spatial-Frequency Registration and Fusion for Infrared and Visible Images

考虑不确定性的红外与可见图像空间频率配准与融合
Li, Xingyuan, Xu, Haoyuan, Zhu, Xingyue, Ma, Jun, Zou, Yang, Jiang, Zhiying, Liu, Jinyuan
Abstract
Infrared and Visible Image Fusion (IVIF) has shown promise in visual tasks under challenging environments, but fusion under unregistered conditions faces inherent misalignments. Current studies to solve them either predict the deformation parameters coarse-to-fine (i.e., coarse registration and fine registration) or estimate the deformation fields in multi-scales for registration. Though straightforward, they overlook the cumulative errors in registration, which contaminate the fusion stage and severely deteriorate the resulting images. We introduce the Spatial-Frequency Registration and Fusion (SFRF) framework, which incorporates uncertainty estimation and infrared thermal radiation distribution consistency into a unified pipeline to handle the error accumulation for robust registration and fusion across both spatial and frequency domains. Specifically, SFRF constructs a Multi-scale Iterative Registration (MIR) framework that iteratively refines the deformation field across scales, leveraging uncertainty estimation at each stage to mitigate error accumulation and enhance alignment accuracy dynamically. To ensure the accurate alignment of infrared thermal distributions during registration, thermal radiation distribution consistency is employed as a frequency-domain supervisory signal, promoting global consistency in the frequency domain. Based on the spatial-frequency alignment, SFRF further adopts a Dual-branch Spatial-Frequency Fusion (DSFF) module, which incorporates spatial geometric features and frequency distribution information to reconstruct visually appealing images. SFRF achieves impressive performance across diverse datasets.
Chinese Translation
红外与可见图像融合(IVIF)在复杂环境下的视觉任务中展现了良好的前景,但在未配准条件下的融合面临固有的错位问题。目前的研究要么采用粗到细的方式预测变形参数(即粗配准和细配准),要么在多尺度下估计变形场以进行配准。尽管这种方法简单,但它们忽视了配准中的累积误差,这会污染融合阶段并严重恶化最终图像的质量。我们提出了空间频率配准与融合(SFRF)框架,该框架将不确定性估计和红外热辐射分布一致性整合到一个统一的流程中,以处理错误累积,从而实现空间和频率域的稳健配准与融合。具体而言,SFRF构建了一个多尺度迭代配准(MIR)框架,迭代地在各个尺度上细化变形场,在每个阶段利用不确定性估计来减轻误差累积并动态增强对齐精度。为了确保在配准过程中红外热分布的准确对齐,热辐射分布一致性被用作频率域的监督信号,促进频率域的全局一致性。在空间频率对齐的基础上,SFRF进一步采用了双分支空间频率融合(DSFF)模块,该模块结合了空间几何特征和频率分布信息,以重建视觉上令人满意的图像。SFRF在多个数据集上表现出色。
cs.CV / 50 / 2605.13059

BrainAnytime: Anatomy-Aware Cross-Modal Pretraining for Brain Image Analysis with Arbitrary Modality Availability

BrainAnytime:基于解剖学的跨模态预训练框架,用于具有任意模态可用性的脑影像分析
Yang, Guangqian, Ding, Tong, Hou, Wenlong, Xun, Yue, Du, Ye, Niu, Qian, Wang, Shujun
Abstract
Clinical diagnostic workups typically follow a modality escalation pathway: after initial clinical evaluation, clinicians begin with routine structural imaging (e.g., MRI), selectively add sequences such as FLAIR or T2 to refine the differential, and reserve molecular imaging (e.g., amyloid-PET) for cases that remain uncertain after standard evaluation. Consequently, patients are observed with heterogeneous and often incomplete modality subsets. However, most current AI models assume fixed data modalities as the model inputs. In this paper, we present BrainAnytime, a unified pretraining framework pretrained on 34,899 3D brain scans from five datasets that support brain image analysis under arbitrary modality availability spanning multi-sequence MRI and amyloid-PET. A single model accepts whatever imaging is available, from a lone T1 scan to a full multimodal workup. Pretraining learns structural-molecular correspondences between MRI and PET via cross-modal distillation (RCMD) and prioritizes disease-vulnerable anatomy via atlas-guided curriculum masking (PACM), all within a shared 3D masked autoencoder (Multi-MAE3D). Across four downstream tasks and five clinically motivated modality settings, BrainAnytime largely outperforms modality-specific models, missing-modality baselines, and large-scale brain MRI pretrained foundation models on most modality settings. Notably, it surpasses the strongest missing-modality baselines with relative improvements of 6.2% and 7.0% in average accuracy on CN vs. AD and CN vs. MCI classification, respectively. Code is available at https://github.com/SDH-Lab/BrainAnytime.
Chinese Translation
临床诊断工作通常遵循模态升级路径:在初步临床评估后,临床医生首先进行常规结构成像(例如,MRI),然后选择性地添加FLAIR或T2等序列以细化鉴别诊断,并将分子成像(例如,淀粉样蛋白PET)保留给在标准评估后仍然不确定的病例。因此,患者通常会接受异质且往往不完整的模态子集。然而,目前大多数AI模型假设固定的数据模态作为模型输入。在本文中,我们提出了BrainAnytime,一个统一的预训练框架,基于来自五个数据集的34,899个3D脑扫描进行预训练,支持在任意模态可用性的情况下进行脑影像分析,涵盖多序列MRI和淀粉样蛋白PET。单一模型可以接受任何可用的成像,从单个T1扫描到完整的多模态工作。预训练通过跨模态蒸馏(RCMD)学习MRI和PET之间的结构-分子对应关系,并通过图谱引导的课程掩蔽(PACM)优先考虑易受疾病影响的解剖结构,所有这些都在共享的3D掩蔽自编码器(Multi-MAE3D)中进行。在四个下游任务和五个临床驱动的模态设置中,BrainAnytime在大多数模态设置上显著优于特定模态模型、缺失模态基线和大规模脑MRI预训练基础模型。值得注意的是,在CN与AD和CN与MCI分类的平均准确率上,它分别以6.2%和7.0%的相对提升超越了最强的缺失模态基线。代码可在 https://github.com/SDH-Lab/BrainAnytime 获取。
cs.CV / 51 / 2605.13062

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Edit-Compass与EditReward-Compass:图像编辑与奖励建模的统一基准
Bai, Xuehai, Shi, Yang, Zhang, Yi-Fan, Zhu, Xuanyu, Wang, Yuran, Dai, Yifan, Liu, Xinyu, Ji, Yiyan, Gu, Xiaoling, Zhang, Yuanxing
Abstract
Recent image editing models have achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grained evaluation protocols. In parallel, reward models have become increasingly important for RL-based image editing optimization, yet existing reward model benchmarks still rely on unrealistic evaluation settings that deviate from practical RL scenarios. These limitations hinder reliable assessment of both image editing models and reward models. To address these challenges, we introduce Edit-Compass and EditReward-Compass, a unified evaluation suite for image editing and reward modeling. Edit-Compass contains 2,388 carefully annotated instances spanning six progressively challenging task categories, covering capabilities such as world knowledge reasoning, visual reasoning, and multi-image editing. Beyond broad task coverage, Edit-Compass adopts a fine-grained multidimensional evaluation framework based on structured reasoning and carefully designed scoring rubrics. In parallel, EditReward-Compass contains 2,251 preference pairs that simulate realistic reward modeling scenarios during RL optimization.
Chinese Translation
近期的图像编辑模型在遵循指令、多模态理解和复杂视觉编辑方面取得了显著进展。然而,现有基准往往未能真实反映人类判断,尤其是对于强前沿模型,由于任务难度有限和粗糙的评估协议。与此同时,奖励模型在基于强化学习(RL)的图像编辑优化中变得越来越重要,但现有的奖励模型基准仍然依赖于不切实际的评估设置,偏离了实际的RL场景。这些局限性妨碍了对图像编辑模型和奖励模型的可靠评估。为了解决这些挑战,我们提出了Edit-Compass和EditReward-Compass,这是一个用于图像编辑和奖励建模的统一评估套件。Edit-Compass包含2,388个经过仔细注释的实例,涵盖六个逐步具有挑战性的任务类别,涉及世界知识推理、视觉推理和多图像编辑等能力。除了广泛的任务覆盖,Edit-Compass采用基于结构化推理和精心设计评分标准的细粒度多维评估框架。同时,EditReward-Compass包含2,251个偏好对,模拟在RL优化过程中现实的奖励建模场景。
cs.CV / 52 / 2605.13073

HarmoGS: Robust 3D Gaussian Splatting in the Wild via Conflict-Aware Gradient Harmonization

HarmoGS:通过冲突感知梯度调和实现鲁棒的野外三维高斯点云渲染
Kang, Yulei, Zhu, Tianze, Hu, Jian-Fang, Lai, Jianhuang, Zheng, Wei-Shi
Abstract
In-the-wild 3D Gaussian Splatting remains challenging due to transient distractors and illumination-induced cross-view appearance inconsistencies. Existing methods mainly rely on image-level masking to suppress unreliable supervision, but masking alone cannot fully eliminate residual occlusions or resolve illumination-induced inconsistencies, both of which can introduce conflicting cross-view gradients. These unresolved conflicts may destabilize Gaussian optimization and lead to visible reconstruction artifacts. We propose a conflict-aware 3DGS framework that addresses this problem from both image-space supervision and gradient-level optimization. Semantic Consistency-Guided Masking learns pixel-wise consistency scores to adaptively refine prior masks and suppress unreliable supervision before gradient formation. A dual-view Conflict-Aware Gradient Harmonization strategy further reconciles view-specific gradients by mutually rotating them into an orthogonal configuration, reducing negative directional interference across views. We also introduce conflict-aware densification and pruning to stabilize Gaussian growth and remove persistently conflicting primitives. Extensive experiments on standard in-the-wild benchmarks demonstrate that our method achieves state-of-the-art rendering quality under complex transient distractors and cross-view inconsistencies.
Chinese Translation
在野外环境中,三维高斯点云渲染仍然面临挑战,主要是由于瞬态干扰物和光照引起的视角间外观不一致。现有方法主要依赖于图像级掩膜来抑制不可靠的监督,但仅靠掩膜无法完全消除残余遮挡或解决光照引起的不一致性,这两者都可能引入冲突的视角间梯度。这些未解决的冲突可能会使高斯优化不稳定,并导致可见的重建伪影。我们提出了一种冲突感知的三维高斯点云框架,从图像空间监督和梯度级优化两个方面解决这一问题。语义一致性引导的掩膜学习像素级一致性分数,以自适应地细化先前的掩膜,并在梯度形成之前抑制不可靠的监督。双视角冲突感知梯度调和策略进一步通过相互旋转梯度至正交配置来调和视角特定的梯度,从而减少视角间的负向干扰。我们还引入了冲突感知的密集化和修剪,以稳定高斯增长并去除持续冲突的原始元素。在标准的野外基准测试中进行的大量实验表明,我们的方法在复杂的瞬态干扰物和视角间不一致性下实现了最先进的渲染质量。
cs.CV / 53 / 2605.13080

Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

学习看你所需:多模态大语言模型的注视注意力
Song, Junha, Heo, Byeongho, Gu, Geonmo, Choo, Jaegul, Han, Dongyoon, Yun, Sangdoo
Abstract
When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation step, leading to diluted focus and unnecessary computational overhead. In this work, we introduce Gaze Attention, a novel mechanism that enables MLLMs to selectively attend to task-relevant visual regions during generation. Specifically, we spatially group visual embeddings-stored as key-value caches-into compact gaze regions, each represented by a lightweight descriptor. At each decoding step, the model dynamically selects the most relevant regions and restricts attention to them, reducing redundant computation while enhancing focus. To mitigate the loss of global context caused by localized attention, we further propose learnable context tokens appended to each image or frame, allowing the model to maintain holistic visual awareness. Extensive experiments on image and video understanding benchmarks demonstrate that Gaze Attention matches or surpasses dense-attention baselines, while using up to 90% fewer visual KV entries in the attention computation.
Chinese Translation
当人类描述视觉场景时,他们并不会均匀地处理整个图像;相反,他们会选择性地注视与其意图描述相关的区域。与此相比,当前的多模态大语言模型(MLLMs)在每个生成步骤中关注所有视觉标记,导致注意力分散和不必要的计算开销。在本研究中,我们提出了一种新颖的机制——注视注意力(Gaze Attention),使得MLLMs能够在生成过程中选择性地关注与任务相关的视觉区域。具体而言,我们将视觉嵌入(以键值缓存形式存储)在空间上分组为紧凑的注视区域,每个区域由轻量级描述符表示。在每个解码步骤中,模型动态选择最相关的区域并限制注意力于这些区域,从而减少冗余计算,同时增强关注度。为了减轻局部注意力导致的全局上下文丧失,我们进一步提出了可学习的上下文标记,附加到每个图像或帧上,使模型能够保持整体视觉意识。在图像和视频理解基准上的大量实验表明,注视注意力的性能与密集注意力基线相当或更优,同时在注意力计算中使用的视觉键值条目减少了多达90%。
cs.CV / 54 / 2605.13081

PRA-PoE: Robust Alzheimer's Diagnosis with Arbitrary Missing Modalities

PRA-PoE:具有任意缺失模态的鲁棒阿尔茨海默病诊断
Yang, Guangqian, Du, Ye, Hou, Wenlong, Niu, Qian, Wang, Shujun
Abstract
Missing modalities are prevalent in real-world Alzheimer's disease (AD) assessment and pose a significant challenge to multimodal learning, particularly when the distribution of observed modality subsets differs between training and deployment. Such missingness pattern mismatch induces a conditional representation shift across modality subsets. Existing approaches that rely on implicit imputation or modality synthesis often fail to explicitly model modality availability and uncertainty, leading to overconfident dependence on synthesized features, reduced robustness, and miscalibrated uncertainty estimates. To address these limitations, we propose PRA-PoE, an incomplete multimodal learning framework that is equipped with Prototype-anchored Representation Alignment (PRA) and an Uncertainty-aware Product of Experts (UA-PoE) fusion mechanism. First, PRA uses learnable global prototypes and availability-conditioned tokens to encode modality availability, distinguish observed from missing modalities, re-synthesize features for missing modalities, and adaptively refine observed representations to align latent spaces across modality subsets, with the goal of reducing representation shift under varying missingness patterns. Second, UA-PoE models each modality as a Gaussian expert and performs closed-form Product of Experts fusion, where experts with higher uncertainty are automatically down-weighted via lower precision, improving uncertainty reliability. We evaluate PRA-PoE under a clinically realistic protocol by training with naturally missing data and testing on all non-empty modality combinations. PRA-PoE consistently outperforms the state-of-the-art across datasets, achieving a 5.4% relative improvement in average accuracy on ADNI and a 10.9% relative gain in average F1 on OASIS-3 over the strongest baseline across all non-empty modality subsets.
Chinese Translation
缺失模态在现实世界的阿尔茨海默病(AD)评估中普遍存在,并对多模态学习构成了重大挑战,特别是在观察到的模态子集的分布在训练和部署之间存在差异时。这种缺失模式的不匹配会导致模态子集之间的条件表示转移。现有依赖隐式插补或模态合成的方法往往未能明确建模模态的可用性和不确定性,导致对合成特征的过度自信依赖,降低了鲁棒性,并使不确定性估计失准。为了解决这些局限性,我们提出了PRA-PoE,这是一种不完整的多模态学习框架,配备了原型锚定表示对齐(Prototype-anchored Representation Alignment, PRA)和不确定性感知专家乘积(Uncertainty-aware Product of Experts, UA-PoE)融合机制。首先,PRA使用可学习的全局原型和基于可用性的条件标记来编码模态的可用性,区分观察到的模态和缺失模态,为缺失模态重新合成特征,并自适应地细化观察到的表示,以对齐模态子集之间的潜在空间,旨在减少在不同缺失模式下的表示转移。其次,UA-PoE将每个模态建模为高斯专家,并执行闭式形式的专家乘积融合,其中不确定性较高的专家通过较低的精度自动降低权重,从而提高不确定性可靠性。我们在临床现实的协议下评估PRA-PoE,通过使用自然缺失数据进行训练,并在所有非空模态组合上进行测试。PRA-PoE在各数据集上始终超越当前最先进的方法,在ADNI上实现了5.4%的平均准确率相对提升,在OASIS-3上实现了10.9%的平均F1相对提升,超越了所有非空模态子集中的最强基线。
cs.CV / 55 / 2605.13093

RoSplat: Robust Feed-Forward Pixel-wise Gaussian Splatting for Varying Input Views and High-Resolution Rendering

RoSplat:针对变化输入视角和高分辨率渲染的鲁棒前馈像素级高斯溅射
Nguyen, Hoang Chuong, Wu, Renjie, Alvarez, Jose M., Liu, Miaomiao
Abstract
Generalizable 3D Gaussian Splatting has recently emerged as an efficient approach for novel-view synthesis, enabling feed-forward synthesis from only a few input views. However, existing pixel-wise feed-forward methods suffer from over-bright renderings when the number of input views varies during inference, as well as insufficient supervision for accurate Gaussian scale estimation, which leads to hole artifacts, particularly in high-resolution renderings. To address these issues, we identify that the over-brightness is caused by the varying number of overlapping Gaussians and propose a simple alpha normalization strategy to maintain brightness consistency across different number of input views. In addition, we introduce an auxiliary 3D sampling-based regularizer to improve Gaussian scale estimation, thereby mitigating hole artifacts in high-resolution rendering. Experiments on benchmark datasets demonstrate that our method significantly improves baseline models under varying input-view and high-resolution rendering settings.
Chinese Translation
可泛化的3D高斯溅射最近作为一种高效的新视角合成方法出现,能够仅通过少量输入视角进行前馈合成。然而,现有的像素级前馈方法在推理过程中,当输入视角数量变化时,容易出现过亮渲染,并且对于准确的高斯尺度估计缺乏足够的监督,这导致了孔洞伪影,特别是在高分辨率渲染中。为了解决这些问题,我们发现过亮度是由重叠高斯数量的变化引起的,并提出了一种简单的α归一化策略,以保持不同输入视角数量下的亮度一致性。此外,我们引入了一种基于辅助3D采样的正则化器,以改善高斯尺度估计,从而减轻高分辨率渲染中的孔洞伪影。在基准数据集上的实验表明,我们的方法在变化输入视角和高分辨率渲染设置下显著改善了基线模型的性能。
cs.CV / 56 / 2605.13108

Flow Augmentation and Knowledge Distillation for Lightweight Face Presentation Attack Detection

轻量级人脸呈现攻击检测的流增强与知识蒸馏
Jabbar, Muhammad Shahid, Ibrahim, Muhammad Sohail, Siddique, Taha Hasan Masood, Huang, Kejie, Khan, Shujaat
Abstract
Face presentation attack detection (FacePAD) remains challenging under diverse spoofing representation, including 2D print and replay, 3D mask-based spoofing, makeup-induced appearance manipulation, and physical occlusions, as well as under varying capture conditions. Motion cues are highly discriminative for FacePAD but typically require explicit optical flow estimation, which introduces substantial computational overhead and limits real-time deployment. In this work, we leverage optical flow to enhance motion representation during training while eliminating the need for flow computation at inference. We propose a dual-branch teacher model that fuses appearance cues from RGB frames with motion cues derived from colorwheel-encoded optical flow, enabling effective modeling of micro-motions and temporal consistency. To enable efficient deployment, we introduce a knowledge distillation framework that transfers motion-aware knowledge from the flow-augmented teacher to a lightweight RGB-only student via logit distillation. As a result, the student implicitly learns motion-sensitive representations without requiring explicit flow estimation or additional feature extraction blocks at inference. Extensive experiments demonstrate strong performance across multiple benchmarks, achieving 0.0% HTER on Replay-Attack and Replay-Mobile, 0.94% HTER on ROSE-Youtu, 5.65% HTER on SiW-Mv2, and 0.42% ACER on OULU-NPU. The distilled student achieves performance comparable to or better than the teacher while significantly reducing parameters and FLOPs, achieving 52 FPS on an NVIDIA Jetson Orin Nano, indicating its suitability for real-time and resource-constrained FacePAD deployment.
Chinese Translation
人脸呈现攻击检测(FacePAD)在多样的欺骗表现下仍然面临挑战,包括2D打印和重放、基于3D面具的欺骗、化妆引起的外观操控以及物理遮挡,同时还受到不同捕捉条件的影响。运动线索对于FacePAD具有高度的区分性,但通常需要显式的光流估计,这引入了大量的计算开销,并限制了实时部署。在本研究中,我们利用光流在训练过程中增强运动表示,同时在推理时消除光流计算的需求。我们提出了一种双分支教师模型,该模型将来自RGB帧的外观线索与源自色轮编码光流的运动线索融合,从而有效建模微运动和时间一致性。为了实现高效部署,我们引入了一种知识蒸馏框架,通过logit蒸馏将运动感知知识从流增强的教师模型转移到轻量级的仅RGB学生模型。因此,学生模型在推理时无需显式的光流估计或额外的特征提取模块,便能隐式学习运动敏感的表示。大量实验表明,在多个基准测试中表现出色,在Replay-Attack和Replay-Mobile上实现了0.0%的HTER,在ROSE-Youtu上实现了0.94%的HTER,在SiW-Mv2上实现了5.65%的HTER,在OULU-NPU上实现了0.42%的ACER。蒸馏后的学生模型在性能上与教师模型相当或更好,同时显著减少了参数和FLOPs,在NVIDIA Jetson Orin Nano上实现了52 FPS,表明其适用于实时和资源受限的人脸呈现攻击检测部署。
cs.CV / 57 / 2605.13111

Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

金字塔强制:面向头部的金字塔 KV 缓存策略用于高质量长视频生成
Chen, Jiayu, Tang, Junbei, Zhao, Wenbiao, Li, Maoliang, Luo, Jiayi, Zheng, Zihao, Yang, Jiawei, Luo, Guojie, Chen, Xiang
Abstract
Autoregressive video generation enables streaming and open-ended long video synthesis, but still suffers from long-term degradation caused by accumulated errors. Existing KVCache strategies usually apply unified historical-frame retention, implicitly assuming homogeneous historical dependencies across attention heads. We revisit historical-frame attention and reveal three distinct head types: Anchor Heads require broad long-range context, Wave Heads exhibit periodic temporal dependencies, and Veil Heads focus on initial and adjacent frames. Based on this finding, we propose Pyramid Forcing, a head-aware pyramidal KVCache framework that identifies head types offline, assigns behavior-specific cache policies, and supports heterogeneous cache lengths via efficient ragged-cache attention. Experiments on Self Forcing and Causal Forcing show that Pyramid Forcing consistently improves long-horizon generation quality on VBench-Long, increasing the 60-second Self Forcing score from 77.87 to 81.21 while enhancing motion dynamics, visual fidelity, and semantic consistency. Project: https://if-lab-pku.github.io/Pyramid-Forcing/.
Chinese Translation
自回归视频生成使得流媒体和开放式长视频合成成为可能,但仍然受到累积误差导致的长期退化的影响。现有的 KVCache 策略通常采用统一的历史帧保留,隐含假设注意力头之间存在同质的历史依赖关系。我们重新审视历史帧注意力,并揭示了三种不同的头部类型:锚头(Anchor Heads)需要广泛的长距离上下文,波头(Wave Heads)表现出周期性的时间依赖,而面纱头(Veil Heads)则专注于初始和相邻帧。基于这一发现,我们提出了金字塔强制(Pyramid Forcing),一种面向头部的金字塔 KVCache 框架,该框架离线识别头部类型,分配特定行为的缓存策略,并通过高效的锯齿缓存注意力支持异构缓存长度。在 Self Forcing 和 Causal Forcing 的实验中,金字塔强制在 VBench-Long 上持续提高了长时间生成质量,将 60 秒的 Self Forcing 分数从 77.87 提高到 81.21,同时增强了运动动态、视觉保真度和语义一致性。项目网址:https://if-lab-pku.github.io/Pyramid-Forcing/.
cs.CV / 58 / 2605.13122

Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

图像编辑模型中的早期语义基础用于零样本指称图像分割
He, Jingxuan, Wang, Xiyu, Wang, Yunke, Zheng, Mengyu, Xu, Chang
Abstract
Instruction-based image editing (IIE) models have recently demonstrated strong capability in modifying specific image regions according to natural language instructions, which implicitly requires identifying where an edit should be applied. This indicates that such models inherently perform language-conditioned visual semantic grounding. In this work, we investigate whether this implicit grounding can be leveraged for zero-shot referring image segmentation (RIS), a task that requires pixel-level localization of objects described by natural language expressions. Through systematic analysis, we reveal that strong foreground-background separability emerges in the internal representations of these models at the earliest denoising timestep, well before any visible image transformation occurs. Building on this insight, we propose a training-free framework that repurposes pretrained image editing models for RIS by exploiting their intermediate representations. Our approach decomposes localization into two complementary components: attention-based spatial priors that estimate where to focus, and feature-based semantic discrimination that determines what to segment. By leveraging feature-space separability, the framework produces accurate segmentation masks using only a single denoising step, without requiring full image synthesis. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate that our method achieves superior performance over existing zero-shot baselines.
Chinese Translation
基于指令的图像编辑(IIE)模型最近在根据自然语言指令修改特定图像区域方面表现出了强大的能力,这隐含地要求识别编辑应应用于何处。这表明这些模型本质上执行了语言条件的视觉语义基础。在本研究中,我们探讨了这种隐式基础是否可以用于零样本指称图像分割(RIS),这一任务要求对自然语言表达描述的对象进行像素级定位。通过系统分析,我们揭示了这些模型的内部表示在最早的去噪时间步中出现了强烈的前景-背景可分离性,远在任何可见的图像变换发生之前。基于这一洞察,我们提出了一个无训练框架,通过利用其中间表示,将预训练的图像编辑模型重新用于RIS。我们的方法将定位分解为两个互补组件:基于注意力的空间先验,用于估计关注的重点,以及基于特征的语义区分,用于确定分割对象。通过利用特征空间的可分离性,该框架仅使用单个去噪步骤生成准确的分割掩膜,而无需完全合成图像。在RefCOCO、RefCOCO+和RefCOCOg上的大量实验表明,我们的方法在现有的零样本基准上取得了优越的性能。
cs.CV / 59 / 2605.13140

Multi-Modal Guided Multi-Source Domain Adaptation for Object Detection

多模态引导的多源领域适应用于目标检测
Lee, Sangin, Kwon, Seokjun, Shin, Jeongmin, Kim, Namil, Choi, Yukyung
Abstract
General object detection (OD) struggles to detect objects in the target domain that differ from the training distribution. To address this, recent studies demonstrate that training from multiple source domains and explicitly processing them separately for multi-source domain adaptation (MSDA) outperforms blending them for unsupervised domain adaptation (UDA). However, existing MSDA methods learn domain-agnostic features from domain-specific RGB images while preserving domain-specific information from the domain-agnostic feature map. To address this, we propose MS-DePro: Multi-Source Detector with Depth and Prompt, composed of (1) depth-guided localization and (2) multi-modal guided prompt learning. We leverage domain-agnostic input modalities, namely depth maps and text, to encode domain-agnostic characteristics. Specifically, we utilize depth maps to generate domain-agnostic region proposals for localization and integrate multi-modal features to align learnable text embeddings for classification. MS-DePro achieves state-of-the-art performance on MSDA benchmarks, and comprehensive ablations demonstrate the effectiveness of our contributions. Our code is available on https://github.com/sejong-rcv/Multi-Modal-Guided-Multi-Source-Domain-Adaptation-for-Object-Detection.
Chinese Translation
一般目标检测(OD)在目标领域中检测与训练分布不同的对象时面临困难。为了解决这个问题,最近的研究表明,从多个源领域进行训练并明确地分别处理它们以实现多源领域适应(MSDA)优于将它们混合用于无监督领域适应(UDA)。然而,现有的MSDA方法从特定领域的RGB图像中学习领域无关的特征,同时保留来自领域无关特征图的领域特定信息。为此,我们提出了MS-DePro:深度和提示的多源检测器,包含(1)深度引导的定位和(2)多模态引导的提示学习。我们利用领域无关的输入模态,即深度图和文本,以编码领域无关的特征。具体而言,我们利用深度图生成领域无关的区域提议用于定位,并整合多模态特征以对齐可学习的文本嵌入进行分类。MS-DePro在MSDA基准测试中实现了最先进的性能,全面的消融实验证明了我们贡献的有效性。我们的代码可在 https://github.com/sejong-rcv/Multi-Modal-Guided-Multi-Source-Domain-Adaptation-for-Object-Detection 获取。
cs.CV / 60 / 2605.13151

GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation

GenCape:用于类别无关姿态估计的结构诱导生成建模
Rao, Jiyong, Wang, Yu, Zhao, Shengjie
Abstract
Category-agnostic pose estimation (CAPE) aims to localize keypoints on query images from arbitrary categories, using only a few annotated support examples for guidance. Recent approaches either treat keypoints as isolated entities or rely on manually defined skeleton priors, which are costly to annotate and inherently inflexible across diverse categories. Such oversimplification limits the model's capacity to capture instance-wise structural cues critical for accurate pixel-level localization. To overcome these limitations, we propose GenCape, a Generative-based framework for CAPE that infers keypoint relationships solely from image-based support inputs, without additional textual descriptions or predefined skeletons. Our framework consists of two principal components: an iterative Structure-aware Variational Autoencoder (i-SVAE) and a Compositional Graph Transfer (CGT) module. The former infers soft, instance-specific adjacency matrices from support features through variational inference, embedded layer-wise into the Graph Transformer Decoder for progressive structural priors refinement. The latter adaptively aggregates multiple latent graphs into a query-aware structure via Bayesian fusion and attention-based reweighting, enhancing resilience to visual uncertainty and support-induced bias. This structure-aware design facilitates effective message propagation among keypoints and promotes semantic alignment across object categories with diverse keypoint topologies. Experimental results on the MP-100 dataset show that our method achieves substantial gains over graph-support baselines under both 1- and 5-shot settings, while maintaining competitive performance against text-support counterparts.
Chinese Translation
类别无关姿态估计(CAPE)旨在从任意类别的查询图像中定位关键点,仅使用少量带注释的支持示例作为指导。近期的方法要么将关键点视为孤立的实体,要么依赖于手动定义的骨架先验,这些先验在注释上成本高且在不同类别之间固有不灵活。这种过度简化限制了模型捕捉对准确像素级定位至关重要的实例级结构线索的能力。为克服这些局限性,我们提出了GenCape,一种基于生成的CAPE框架,该框架仅通过基于图像的支持输入推断关键点关系,而无需额外的文本描述或预定义的骨架。我们的框架由两个主要组件组成:一个迭代的结构感知变分自编码器(i-SVAE)和一个组合图转移(CGT)模块。前者通过变分推断从支持特征推断出软的、实例特定的邻接矩阵,并逐层嵌入到图变换器解码器中,以逐步细化结构先验。后者通过贝叶斯融合和基于注意力的重加权,自适应地将多个潜在图聚合为查询感知结构,从而增强对视觉不确定性和支持引起的偏差的韧性。这种结构感知设计促进了关键点之间有效的信息传播,并促进了具有不同关键点拓扑的物体类别之间的语义对齐。在MP-100数据集上的实验结果表明,我们的方法在1-shot和5-shot设置下相较于图支持基线取得了显著的提升,同时在性能上与文本支持的对比方法保持了竞争力。
cs.CV / 61 / 2605.13152

EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision

EvObj:无场景监督的3D实例分割中学习演变的以对象为中心的表示
Chen, Jiahao, Zhang, Zihui, Yang, Yafei, Li, Jinxi, Wei, Shenxing, Sun, Zhixuan, Yang, Bo
Abstract
We introduce EvObj for unsupervised 3D instance segmentation that bridges the geometric domain gap between synthetic pretraining data and real-world point clouds. Current methods suffer from structural discrepancies when transferring object priors from synthetic datasets (e.g., ShapeNet) to real scans (e.g., ScanNet), particularly due to morphological variations and occlusion artifacts. To address this, EvObj integrates two innovative modules: (1) An object discerning module that dynamically refines object candidates, enabling continuous adaptation of object priors to target domains; and (2) An object completion module that reconstructs partial geometries after discovering objects. We conduct extensive experiments on both real-world and synthetic datasets, demonstrating superior 3D object segmentation performance over all baselines while achieving state-of-the-art results.
Chinese Translation
我们提出了EvObj,一种用于无监督3D实例分割的方法,旨在弥合合成预训练数据与真实世界点云之间的几何领域差距。当前方法在将对象先验从合成数据集(例如,ShapeNet)转移到真实扫描(例如,ScanNet)时,面临结构差异的问题,特别是由于形态变化和遮挡伪影。为了解决这一问题,EvObj集成了两个创新模块:(1)一个对象辨识模块,动态地细化对象候选,能够持续适应目标领域的对象先验;(2)一个对象补全模块,在发现对象后重建部分几何形状。我们在真实世界和合成数据集上进行了广泛的实验,证明了在所有基线方法上,EvObj在3D对象分割性能上具有优越性,并取得了最先进的结果。
cs.CV / 62 / 2605.13155

Pareto-Guided Optimal Transport for Multi-Reward Alignment

帕累托引导的多奖励对齐最优传输
Ba, Ying, Zhang, Tianyu, Zhou, Mohan, Bai, Yalong, Mo, Wenyi, Zhang, Guiwei, Su, Bing, Wen, Ji-Rong
Abstract
Text-to-image generation models have achieved remarkable progress in preference optimization, yet achieving robust alignment across diverse reward models remains a significant challenge. Existing multi-reward fusion approaches rely on weighted summation, which is costly to tune and insufficient for balancing conflicting objectives. More critically, optimization with reward models is highly susceptible to reward hacking, where reward scores increase while the perceived quality of generated images deteriorates. We demonstrate that optimizing against a unified global target under heterogeneous reward upper bounds can induce reward hacking, a risk further exacerbated by the inherent instability of weak reward models. To mitigate this, we propose a Pareto Frontier-Guided Optimal Transport (PG-OT) framework. Our method constructs a prompt-specific Pareto frontier and maps dominated samples toward it via distribution-aware optimal transport. Furthermore, we develop both online and offline optimization strategies tailored to diverse reward signal characteristics. To provide a more rigorous assessment, we introduce the Joint Domination Rate (JDR) and Joint Collapse Rate (JCR) as principled metrics to quantify multi-reward synergy and reward hacking. Experimental results show that our approach outperforms strong baselines with an 11% gain in JDR and achieves a near 80% win rate in human evaluations.
Chinese Translation
文本到图像生成模型在偏好优化方面取得了显著进展,但在多样化奖励模型之间实现稳健对齐仍然是一个重大挑战。现有的多奖励融合方法依赖于加权求和,这种方法在调优时成本高且不足以平衡相互冲突的目标。更为关键的是,使用奖励模型进行优化极易受到奖励操控的影响,即奖励分数上升的同时生成图像的感知质量却在下降。我们证明,在异构奖励上界下针对统一全局目标进行优化可能会导致奖励操控,这一风险因弱奖励模型的固有不稳定性而进一步加剧。为此,我们提出了一种帕累托前沿引导的最优传输框架(Pareto Frontier-Guided Optimal Transport,PG-OT)。我们的方法构建了特定提示的帕累托前沿,并通过分布感知的最优传输将被支配的样本映射到该前沿。此外,我们开发了针对多样化奖励信号特征的在线和离线优化策略。为了提供更严格的评估,我们引入了联合支配率(Joint Domination Rate,JDR)和联合崩溃率(Joint Collapse Rate,JCR)作为量化多奖励协同和奖励操控的原则性指标。实验结果表明,我们的方法在JDR上比强基线提高了11%,并在人工评估中达到了近80%的胜率。
cs.CV / 63 / 2605.13156

Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

视觉-语言模型中的对象幻觉双通路电路
Liu, Jiaxin, Zhong, Ding, Wang, Yue, Yang, Zhidong, Kang, Zhaolu, Dong, Guangyuan, Zhan, Qishi, Fang, Pengcheng, Liu, Aofan
Abstract
Vision-language models (VLMs) have demonstrated remarkable capabilities in bridging visual perception and natural language understanding, enabling a wide range of multimodal reasoning tasks. However, they often produce object hallucinations, describing content absent from the input image, which limits their reliability and interpretability. To address this limitation, we propose Dual-Pathway Circuit Analysis, a framework that identifies and characterizes hallucination-related circuits in VLMs for mechanistic understanding and causal probing. We first apply activation patching across five architecturally diverse VLMs to identify a visual grounding pathway that supports correct predictions and a hallucination pathway that drives erroneous outputs. We then introduce Conditional Pathway Analysis (CPA) to characterize pathway-level interactions, revealing that grounding components remain strongly redundant in both correct and hallucinating samples but undergo a consistent polarity flip, shifting from supporting the ground truth on correct samples to aligning with the hallucinated answer on erroneous ones. We further perform targeted suppression of hallucination-pathway components, showing that scaling these components reduces object hallucination by up to 76% with minimal accuracy cost, and validate that the same circuit selectively transfers to relational but not attribute hallucination. Evaluations on POPE-adversarial and AMBER show that the identified circuits are consistent across architectures, support causal intervention, and transfer selectively across hallucination types.
Chinese Translation
视觉-语言模型(VLMs)在连接视觉感知与自然语言理解方面展现了显著的能力,使其能够处理广泛的多模态推理任务。然而,它们经常产生对象幻觉,即描述输入图像中不存在的内容,这限制了它们的可靠性和可解释性。为了解决这一限制,我们提出了双通路电路分析(Dual-Pathway Circuit Analysis),这是一个识别和表征VLMs中与幻觉相关电路的框架,以实现机制理解和因果探测。我们首先在五种架构各异的VLMs中应用激活补丁技术,以识别支持正确预测的视觉基础通路和驱动错误输出的幻觉通路。然后,我们引入条件通路分析(Conditional Pathway Analysis, CPA)来表征通路级别的交互,揭示了基础组件在正确样本和幻觉样本中保持高度冗余,但经历了一种一致的极性翻转,从在正确样本中支持真实情况转变为在错误样本中与幻觉答案对齐。我们进一步对幻觉通路组件进行针对性抑制,显示缩放这些组件可以将对象幻觉减少多达76%,且对准确性影响最小,并验证相同电路在关系幻觉中选择性转移,而在属性幻觉中则不然。在POPE对抗性和AMBER上的评估表明,所识别的电路在不同架构中是一致的,支持因果干预,并在幻觉类型之间选择性转移。
cs.CV / 64 / 2605.13158

Unifying Physically-Informed Weather Priors in A Single Model for Image Restoration Across Multiple Adverse Weather Conditions

统一物理信息天气先验的单一模型在多种恶劣天气条件下的图像恢复
Xu, Jiaqi, Hu, Xiaowei, Zhu, Lei, Heng, Pheng-Ann
Abstract
Image restoration under multiple adverse weather conditions aims to develop a single model to recover the underlying scene with high visibility. Weather-related artifacts vary with the particle's distance to the camera according to the established scene visibility analysis, where close and faraway regions are more affected by falling drops and fog effects, respectively. Existing methods fail to consider this weather-specific physical visual process; thus, the restoration performance is limited. In this work, we analyze the common visual factors in adverse weather conditions and present a unified imaging model that considers the individually visible particles and fog-like aggregate scattering effects. Further, we design a novel weather-prior-based network, which leverages the weather-related prior information to help recover the scene by enhancing the features using the estimated occlusion and transmission. Experimental results in multiple adverse scenarios show the superiority of our method against state-of-the-art methods.
Chinese Translation
在多种恶劣天气条件下的图像恢复旨在开发一个单一模型,以高可见度恢复潜在场景。根据已建立的场景可见性分析,天气相关的伪影随着粒子与相机的距离而变化,其中近处和远处区域分别受到降水和雾效应的影响。现有方法未能考虑这种特定天气的物理视觉过程,因此恢复性能受到限制。在本研究中,我们分析了恶劣天气条件下的共同视觉因素,并提出了一个统一的成像模型,该模型考虑了单独可见的粒子和类似雾的聚集散射效应。此外,我们设计了一种新颖的基于天气先验的网络,利用天气相关的先验信息,通过增强特征来帮助恢复场景,使用估计的遮挡和传输。多个恶劣场景下的实验结果表明,我们的方法优于最先进的方法。
cs.CV / 65 / 2605.13161

A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning

A$_3$B$_2$: 自适应非对称适配器用于缓解视觉-语言图像分类中的分支偏差,结合少样本学习
Zhou, Yiyun, Jiang, Zhonghua, Han, Wenkang, Li, Kunxi, Xu, Mingjing, Yao, Chang, Chen, Jingyuan
Abstract
Efficient transfer learning methods for large-scale vision-language models ($e.g.$, CLIP) enable strong few-shot transfer, yet existing adaptation methods follow a fixed fine-tuning paradigm that implicitly assumes a uniform importance of the image and text branches, which has not been systematically studied in image classification. Through extensive analysis, we reveal a Branch Bias issue in vision-language image classification: adapting the image encoder does not always improve performance under out-of-distribution settings. Motivated by this observation, we propose A$_3$B$_2$, an Adaptive Asymmetric Adapter that alleviates Branch Bias in few-shot learning. A$_3$B$_2$ introduces Uncertainty-Aware Adapter Dampening (UAAD), which automatically suppresses image-branch adaptation when prediction uncertainty is high, enabling soft and data-driven control without manual intervention. Architecturally, A$_3$B$_2$ adopts a lightweight asymmetric design inspired by mixture-of-experts with Load Balancing Regularization. Extensive experiments on three few-shot image classification tasks across 11 datasets demonstrate that A$_3$B$_2$ consistently outperforms 11 competitive prompt- and adapter-based baselines.
Chinese Translation
大规模视觉-语言模型(如 CLIP)的高效迁移学习方法实现了强大的少样本迁移,但现有的适应方法遵循固定的微调范式,隐含地假设图像和文本分支的重要性是均匀的,这在图像分类中尚未得到系统研究。通过广泛的分析,我们揭示了视觉-语言图像分类中的分支偏差问题:在分布外设置下,适应图像编码器并不总是能提高性能。基于这一观察,我们提出了 A$_3$B$_2$,一种自适应非对称适配器,旨在缓解少样本学习中的分支偏差。A$_3$B$_2$引入了不确定性感知适配器抑制(UAAD),当预测不确定性较高时自动抑制图像分支的适应,从而实现无需人工干预的柔性和数据驱动控制。在架构上,A$_3$B$_2$采用了受专家混合模型启发的轻量级非对称设计,并结合负载平衡正则化。在 11 个数据集上的三项少样本图像分类任务中,广泛实验表明 A$_3$B$_2$始终优于 11 个竞争性的基于提示和适配器的基线。
cs.CV / 66 / 2605.13169

PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

PanoWorld:迈向360°全景世界的空间超感知
Wang, Changpeng, Lin, Xin, Liu, Junhan, Liu, Yuheng, Wang, Zhen, Qi, Donglian, Yan, Yunfeng, Chen, Xi
Abstract
Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.
Chinese Translation
多模态大型实验室模型(MLLMs)在主导的视角图像范式下仍然面临空间理解的挑战,这种范式继承了人类感知的狭窄视野。对于导航、机器人搜索和三维场景理解,360度全景感知通过一次性捕捉整个周围环境提供了一种超感知的形式。然而,现有的MLLM管道通常将全景图分解为多个视角图像,导致等矩形投影(ERP)的球面结构在很大程度上是隐含的。在本文中,我们研究了全景原生理解,这要求MLLM能够将ERP全景视为一个连续的、以观察者为中心的空间进行推理。为此,我们首先定义了全景原生理解的关键能力,包括语义锚定、球面定位、参考框架转换和深度感知的三维空间推理。然后,我们构建了一个大规模元数据构建管道,将混合源的ERP全景转换为几何感知、语言基础和深度感知的监督,并将这些信号实例化为能力对齐的指令调优数据。在模型方面,我们引入了PanoWorld和球面空间交叉注意力(Spherical Spatial Cross-Attention),将球面几何信息注入视觉流中。我们进一步构建了PanoSpace-Bench,这是一个用于评估ERP原生空间推理的诊断基准。实验表明,PanoWorld在PanoSpace-Bench、H* Bench和R2R-CE Val-Unseen基准上显著优于专有和开源基线。这些结果表明,稳健的全景推理需要专门的全景原生监督和几何感知模型适应。所有源代码和提出的数据将公开发布。
cs.CV / 67 / 2605.13178

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

CLIP 让你困惑:无训练的标记剪枝用于大型视觉-语言模型中的高效像素定位
Lee, Sangin, Choi, Yukyung
Abstract
In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. However, these methods struggle with pixel grounding tasks, where token importance is highly contingent on the input text. Through an in-depth analysis of CLIP, we observe that visual tokens located within referent regions often exhibit low similarity to the textual representation. Motivated by this insight, we introduce LiteLVLM, a training-free, text-guided token pruning strategy for efficient pixel grounding inference. By reversing the ranking of CLIP's visual-text similarity, LiteLVLM effectively retains visual tokens covering the referent regions, while recovering context tokens to enable clear foreground-background separation. Extensive experiments demonstrate that LiteLVLM significantly outperforms existing methods by over 5% across diverse token budgets. Without any training or fine-tuning, LiteLVLM maintains 90\% of the original performance with a 22% speedup and a 2.3x memory reduction. Our code is available at https://github.com/sejong-rcv/LiteLVLM.
Chinese Translation
在大型视觉-语言模型中,视觉标记通常占输入标记的大多数,导致显著的计算开销。为了解决这个问题,近期研究探索了剪除冗余或信息量较少的视觉标记以用于图像理解任务。然而,这些方法在像素定位任务中表现不佳,因为标记的重要性高度依赖于输入文本。通过对 CLIP 的深入分析,我们观察到位于指称区域的视觉标记通常与文本表示的相似度较低。基于这一见解,我们提出了 LiteLVLM,一种无训练的、文本引导的标记剪枝策略,用于高效的像素定位推理。通过反转 CLIP 的视觉-文本相似度排名,LiteLVLM 有效保留覆盖指称区域的视觉标记,同时恢复上下文标记以实现清晰的前景-背景分离。大量实验表明,LiteLVLM 在不同的标记预算下显著优于现有方法,提升幅度超过 5%。在没有任何训练或微调的情况下,LiteLVLM 保持了原始性能的 90%,同时实现了 22% 的加速和 2.3 倍的内存减少。我们的代码可在 https://github.com/sejong-rcv/LiteLVLM 获取。
cs.CV / 68 / 2605.13179

Does Engram Do Memory Retrieval in Autoregressive Image Generation?

Engram在自回归图像生成中的记忆检索作用吗?
Wang, Jinghao, He, Qiyuan, Gu, Chunbin, Heng, Pheng-Ann
Abstract
The Engram module -- a hash-keyed, O(1) associative memory injected into Transformer layers -- was recently shown to improve large language model pretraining, with the appealing interpretation that it provides a content-addressed shortcut to recurring local token patterns. We ask whether this interpretation transfers to autoregressive (AR) image generation, or whether the observed gains, if any, come from a different mechanism. We adapt the Engram module to vision with 2D spatial $n$-gram hashing, gated fusion, and KV-cache-compatible incremental inference, and inject it into a class-conditional AR generator trained on ImageNet 256x256. Across a sweep of backbone-to-memory budget ratios $\rho{\in}[0.17, 0.90]$, every Engram-augmented variant trails the pure AR baseline in FID, indicating that the module saves backbone FLOPs but does not, by itself, improve sample quality. We then probe how the module is used. A gate-clamp sweep shows that disabling the Engram pathway entirely is catastrophic, yet a tiny constant gate (g=0.10) matches or beats the learned gate -- inconsistent with a heavily content-addressed recall mechanism. A donor-probe experiment shows that swapping the hash inputs for matched, adversarial, or random same-class exemplars produces statistically indistinguishable next-token distributions, while collapsing or randomising the table degrades them by two to three orders of magnitude. Finally, training a model from scratch with the entire memory table frozen to $\mathcal{N}(0, 1)$ noise costs only $\Delta\text{FID}{=}0.10$ and actually raises Inception Score. Together, these findings indicate that the Engram in AR image generation behaves not as a content-addressed retriever but as a gated architectural side-pathway: a hash-keyed residual stream whose benefit is dominated by the pathway itself, with the learned table contributing only a small distributional refinement.
Chinese Translation
Engram模块——一种注入到Transformer层中的哈希键关联记忆,具有O(1)的特性——最近被证明可以改善大型语言模型的预训练,其诱人的解释是它为重复的局部标记模式提供了一种内容寻址的快捷方式。我们探讨这种解释是否适用于自回归(AR)图像生成,或者观察到的增益(如果有的话)是否来自于不同的机制。我们通过2D空间$n$-gram哈希、门控融合和与KV缓存兼容的增量推理,将Engram模块适配到视觉任务中,并将其注入到一个在ImageNet 256x256上训练的类条件AR生成器中。在对骨干网络与记忆预算比率$ ho{ ext{in}}[0.17, 0.90]$的全面测试中,所有增强Engram的变体在FID上均落后于纯AR基线,这表明该模块节省了骨干网络的FLOP,但并未单独提高样本质量。接着,我们探讨该模块的使用方式。门控夹紧测试表明,完全禁用Engram通道是灾难性的,但一个微小的常数门(g=0.10)与学习到的门相匹配或超越——这与一个高度内容寻址的回忆机制不一致。捐赠-探测实验表明,匹配的、对抗的或随机的同类样本的哈希输入交换产生的下一个标记分布在统计上是不可区分的,而折叠或随机化表格则使其降级两个到三个数量级。最后,从头开始训练一个将整个记忆表冻结为$ ext{N}(0, 1)$噪声的模型,仅需$ ext{FID}的增量为0.10$,并且实际上提高了Inception Score。综合这些发现表明,Engram在AR图像生成中的表现并非作为一个内容寻址的检索器,而是作为一个门控的架构旁路:一个哈希键的残差流,其益处主要来自于通道本身,而学习到的表仅贡献了少量的分布性细化。
cs.CV / 69 / 2605.13182

DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution

DiffST:面向真实世界时空视频超分辨率的时空感知扩散方法
Chen, Zheng, Yang, Ruofan, Han, Jin, Song, Dehua, Zou, Zichen, He, Chunming, Guo, Yong, Zhang, Yulun
Abstract
Diffusion-based models have shown strong performance in video super-resolution (VSR) and video frame interpolation (VFI). However, their role in the coupled space-time video super-resolution (STVSR) setting remains limited. Existing diffusion-based STVSR approaches suffer from two issues: (1) low inference efficiency and (2) insufficient utilization of spatiotemporal information. These limitations impede deployment. To address these issues, we introduce DiffST, an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. To improve efficiency, we adapt a pre-trained diffusion model for one-step sampling and process the entire video directly rather than operating on individual frames. Furthermore, to enhance spatiotemporal information utilization, we introduce cross-frame context aggregation (CFCA) and video representation guidance (VRG). The CFCA module aggregates information across multiple keyframes to produce intermediate frames. The VRG module extracts video-level global features to guide the diffusion process. Extensive experiments show that DiffST obtains leading results on real-world STVSR tasks. It also maintains high inference efficiency, running about 17$\times$ faster than previous diffusion-based STVSR methods. Code is available at: https://github.com/zhengchen1999/DiffST.
Chinese Translation
基于扩散的模型在视频超分辨率(VSR)和视频帧插值(VFI)中表现出色。然而,它们在耦合时空视频超分辨率(STVSR)设置中的作用仍然有限。现有的基于扩散的STVSR方法面临两个问题:(1)推理效率低;(2)时空信息利用不足。这些限制妨碍了其部署。为了解决这些问题,我们提出了DiffST,这是一种高效的时空感知视频扩散框架,旨在处理真实世界的STVSR。为了提高效率,我们对预训练的扩散模型进行了调整,以实现一步采样,并直接处理整个视频,而不是单独处理每一帧。此外,为了增强时空信息的利用,我们引入了跨帧上下文聚合(CFCA)和视频表示引导(VRG)。CFCA模块聚合多个关键帧的信息以生成中间帧。VRG模块提取视频级全局特征以指导扩散过程。大量实验表明,DiffST在真实世界的STVSR任务中取得了领先的结果。它还保持了高推理效率,运行速度约为之前基于扩散的STVSR方法的17倍。代码可在以下链接获取:https://github.com/zhengchen1999/DiffST。
cs.CV / 70 / 2605.13193

FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

FIKA-Bench:从细粒度识别到细粒度知识获取
Li, Geng, Peng, Yuxin
Abstract
Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans actively search, compare visual details, and verify evidence before deciding. Existing benchmarks primarily evaluate visually recognition, leaving this active external knowledge acquisition ability underexplored. We study fine-grained knowledge acquisition, where a system must seek, verify, and use external evidence to answer open-ended fine-grained recognition questions. We introduce FIKA-Bench, a leakage-aware and evidence-grounded collection of 311 public-source and real-life instances. To ensure high quality, every example is filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage, retaining only samples supported by verified evidence. Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement. These results show that reliable knowledge acquisition needs better agent designs that focus on fine-grained recognition.
Chinese Translation
日常生活中的细粒度识别通常不是一个闭卷分类问题:当遇到不熟悉的物体时,人类会主动搜索、比较视觉细节并验证证据,然后再做出决定。现有基准主要评估视觉识别,未能充分探讨这种主动外部知识获取能力。我们研究细粒度知识获取,其中系统必须寻求、验证并利用外部证据来回答开放式细粒度识别问题。我们引入FIKA-Bench,这是一个考虑信息泄露且以证据为基础的311个公共来源和现实生活实例的集合。为了确保高质量,每个示例都经过前沿闭卷模型的过滤,以去除记忆化案例,并经过审计以消除图像-答案泄露,仅保留由经过验证的证据支持的样本。我们对最新的大型多模态模型(Large Multimodal Models, LMMs)和代理的评估表明,这一任务仍然是一个巨大的挑战:最佳系统的准确率仅为25.1%,且没有模型超过30%。关键是,我们发现仅仅为模型配备工具不足以弥补这一差距;代理的失败主要是由于错误的实体检索和较差的视觉判断。这些结果表明,可靠的知识获取需要更好的代理设计,专注于细粒度识别。
cs.CV / 71 / 2605.13202

STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

STAR:用于少样本动作识别的语义-时间自适应表示学习
Liu, Hongli, Wang, Yu, Zhao, Shengjie
Abstract
Few-shot action recognition (FSAR) requires models to generalize to novel action categories from only a handful of annotated samples. Despite progress with vision-language models, existing approaches still suffer from semantic-temporal misalignment, where static textual prompts fail to capture decisive visual cues that appear sparsely across sequences, and from inadequate modeling of multi-scale temporal dynamics, as short-term discriminative cues and long-range dependencies are often either oversmoothed or fragmented. To address these challenges, we propose Semantic Temporal Adaptive Representation Learning (STAR), a unified framework, consisting of a semantic-alignment component and a temporal-aware component, effectively bridging the semantic and temporal gaps and transferring the sequence modeling capability of Mamba into the FSAR. The semantic alignment module introduces a Temporal Semantic Attention (TSA) mechanism, which performs frame-level cross-modal alignment with textual cues, ensuring fine-grained semantic-temporal consistency. The temporal-aware module incorporates a Semantic Temporal Prototype Refiner (STPR) that integrates semantic-guided Mamba blocks with multi-frequency temporal sampling and bidirectional state-space refinement, yielding semantically aligned prototypes with enhanced discriminative fidelity and temporal consistency. Furthermore, temporally dependent class descriptors derived from large language models (LLMs) provide long-range semantic guidance. Extensive experiments on five FSAR benchmarks demonstrate the consistent superiority of STAR over state-of-the-art methods. For instance, STAR achieves up to 8.1% and 6.7% gains on the SSv2-Full and SSv2-Small datasets under the 1-shot setting, and 7.3% on HMDB51, validating its effectiveness under limited supervision. The code is available at https://github.com/HongliLiu1/STAR-main.
Chinese Translation
少样本动作识别(FSAR)要求模型能够从仅有的少量标注样本中推广到新颖的动作类别。尽管视觉-语言模型取得了一定进展,但现有方法仍然面临语义-时间不对齐的问题,即静态文本提示无法捕捉到在序列中稀疏出现的决定性视觉线索,以及多尺度时间动态建模不足的问题,因为短期区分线索和长期依赖关系往往被过度平滑或碎片化。为了解决这些挑战,我们提出了语义时间自适应表示学习(STAR),这是一个统一框架,由语义对齐组件和时间感知组件组成,有效地弥合了语义和时间之间的差距,并将Mamba的序列建模能力转移到FSAR中。语义对齐模块引入了一种时间语义注意力(TSA)机制,该机制通过文本线索执行帧级跨模态对齐,确保细粒度的语义-时间一致性。时间感知模块结合了语义引导的Mamba块、多个频率的时间采样和双向状态空间细化,产生具有增强的区分保真度和时间一致性的语义对齐原型。此外,从大型语言模型(LLMs)派生的时间依赖类描述符提供了长期的语义指导。在五个FSAR基准上的大量实验表明,STAR在性能上始终优于最先进的方法。例如,在1-shot设置下,STAR在SSv2-Full和SSv2-Small数据集上分别实现了高达8.1%和6.7%的提升,在HMDB51上实现了7.3%的提升,验证了其在有限监督下的有效性。代码可在 https://github.com/HongliLiu1/STAR-main 获取。
cs.CV / 72 / 2605.13223

Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation

技能对齐注释在文本到图像生成中的可靠评估
Eldesokey, Abdelrahman, Ramazanova, Merey, Sait, Ahmad, Khangeldin, Ansar, Sanchez, Karen, Zhang, Tong, Ghanem, Bernard
Abstract
Text-to-image (T2I) generation has advanced rapidly, making reliable evaluation critical as performance differences between models narrow. Existing evaluation practices typically apply uniform annotation mechanisms, such as Likert-scale or binary question answering (BQA), across heterogeneous evaluation skills, despite fundamental differences in their nature. In this work, we revisit T2I evaluation through the lens of skill-aligned annotation, where annotation strategies reflect the underlying characteristics of each evaluation skill. We systematically compare skill-aligned annotation against uniform baselines and show that it produces more consistent evaluation signals, with higher inter-annotator agreement and improved stability across models. Finally, we present an automated pipeline that instantiates the proposed evaluation protocol, enabling scalable and fine-grained evaluation with spatially grounded feedback. Our work highlights that improving the foundations of image evaluation can increase reliability and efficiency without simply scaling annotation effort. We hope this motivates further research on refining evaluation protocols as a central component of reliable model assessment.
Chinese Translation
文本到图像(T2I)生成技术发展迅速,随着模型之间性能差异的缩小,可靠的评估变得至关重要。现有的评估实践通常在异质的评估技能上应用统一的注释机制,如李克特量表或二元问答(BQA),尽管它们的本质存在根本差异。在本研究中,我们通过技能对齐注释的视角重新审视T2I评估,其中注释策略反映了每种评估技能的基本特征。我们系统地将技能对齐注释与统一基线进行比较,结果表明,前者能够产生更一致的评估信号,具有更高的注释者间一致性和跨模型的稳定性。最后,我们提出了一个自动化流程,实例化了所提议的评估协议,实现了可扩展和细致的评估,并提供了空间基础的反馈。我们的研究强调,改善图像评估的基础可以在不单纯增加注释工作量的情况下提高可靠性和效率。我们希望这能激励进一步研究,以完善评估协议,作为可靠模型评估的核心组成部分。
cs.CV / 73 / 2605.13228

ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

ReTool-Video:具有元增强工具定位的递归工具使用视频代理
Liu, Xiao, Liu, Nayu, Zhu, Junnan, Chen, Ruirui, Xiang, Guohui, Wang, Changjian, Wei, Kaiwen, Li, Rongzhen, Zhong, Jiang
Abstract
Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, we construct a MetaAug-Video Tool Library (MVTL), an extensible tool library with 134 registered tools, including 26 base tools for general multimodal signal processing and 108 meta tools for filtering, aggregation, reranking, formatting, and other intermediate-result operations. MVTL supports dual-level access to both structured video information and raw modal evidence, enabling diverse video reasoning scenarios. Second, we propose ReTool-Video, a recursive tool-using method that grounds high-level video intents into executable tool chains. In ReTool-Video, matched actions are executed directly, while unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This allows abstract actions such as temporal merging, cross-modal verification, or repeated-event aggregation to be progressively translated into concrete multimodal operations at runtime. Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.
Chinese Translation
视频理解需要主动寻求证据,这促使工具增强的视频代理在时间推理、跨模态理解和复杂问题回答方面的发展。现有的视频代理通过检索、记忆、帧检查和验证工具改善了视频推理,但仍面临两个限制:(1)粗糙的工具空间缺乏用于组合推理的细粒度操作;(2)扁平的动作空间将高级视频意图强制转化为原始可执行的工具调用。本文通过两种互补设计来解决这些挑战。首先,我们构建了一个元增强视频工具库(MetaAug-Video Tool Library, MVTL),这是一个可扩展的工具库,注册了134个工具,包括26个用于一般多模态信号处理的基础工具和108个用于过滤、聚合、重排序、格式化及其他中间结果操作的元工具。MVTL支持对结构化视频信息和原始模态证据的双层访问,能够支持多样化的视频推理场景。其次,我们提出了ReTool-Video,一种递归工具使用方法,将高级视频意图定位为可执行的工具链。在ReTool-Video中,匹配的动作被直接执行,而不匹配的意图则委托给解析器进行参数修复、工具替换或分解。这使得诸如时间合并、跨模态验证或重复事件聚合等抽象动作能够在运行时逐步转化为具体的多模态操作。在MVBench、MLVU和Video-MME w/o sub.上的实验表明,ReTool-Video始终优于强基线。进一步的分析表明,递归定位和细粒度元工具提高了复杂视频理解的稳定性和有效性。
cs.CV / 74 / 2605.13258

X-Restormer++: 1st Place Solution for the UG2+ CVPR 2026 All-Weather Restoration Challenge

X-Restormer++:UG2+ CVPR 2026 全天气恢复挑战赛第一名解决方案
Pan, Youwei, Cao, Leilei, Zhu, Yingfang, Zhu, Fengjie
Abstract
In this work, we present our winning solution for the 8th UG2+ Challenge (CVPR 2026) Track 1: Image Restoration under All-weather Conditions. Our method is built upon the strong baseline framework X-Restormer, which effectively captures both channel-wise global dependencies and spatially-local structural information through its dual-attention design (Multi-DConv Head Transposed Attention and Overlapping Cross-Attention). To further boost the restoration performance, we propose several key improvements. First, we integrate the spatially-adaptive input scaling mechanism from Restormer-Plus to dynamically adjust the spatial weights of the input image, enhancing spatial adaptability. Second, to better preserve structural details and edge information, we introduce a novel Gradient-Guided Edge-Aware (GGEA) loss, which is combined with L1 and Multi-Scale SSIM losses in a unified training objective. Third, we significantly expand the training data by incorporating an extra 24,500 degraded-clean image pairs from FoundIR and WeatherBench alongside the original WeatherStream dataset. With these strategies, our proposed method successfully ranks the 1st place in the challenge.
Chinese Translation
在本研究中,我们提出了在第八届 UG2+ 挑战赛(CVPR 2026)第一赛道:全天气条件下图像恢复的获胜解决方案。我们的方法基于强大的基线框架 X-Restormer,该框架通过其双重注意力设计(多维卷积头转置注意力和重叠交叉注意力)有效捕捉通道间的全局依赖关系和空间局部结构信息。为了进一步提升恢复性能,我们提出了几个关键改进。首先,我们整合了来自 Restormer-Plus 的空间自适应输入缩放机制,以动态调整输入图像的空间权重,从而增强空间适应性。其次,为了更好地保留结构细节和边缘信息,我们引入了一种新颖的梯度引导边缘感知(Gradient-Guided Edge-Aware, GGEA)损失,并将其与 L1 和多尺度 SSIM 损失结合在统一的训练目标中。第三,我们通过结合来自 FoundIR 和 WeatherBench 的额外 24,500 对退化-清晰图像对,以及原始 WeatherStream 数据集,显著扩展了训练数据。凭借这些策略,我们提出的方法在挑战赛中成功获得第一名。
cs.CV / 75 / 2605.13293

Img2CADSeq: Image-to-CAD Generation via Sequence-Based Diffusion

Img2CADSeq:基于序列的扩散图像到CAD生成
Tan, Shiyu, Zhao, Zixuan, Gao, Hao, Chen, Zhiheng, Yin, Xiaolong, Shen, Enya
Abstract
Boundary Representation (BRep) is the standard format for Computer-Aided Design (CAD), yet reconstructing high-quality BReps from single-view images remains challenging due to the complexity of topological constraints and operation sequences. We present Img2CADSeq, a multi-stage pipeline that overcomes these limitations by encoding CAD sequences into a three-level hierarchical codebook. Guided by an importance prioritization, this strategy values profiles over details, compressing long sequences into a stable discrete latent space. To bridge the modality gap, we leverage a coarse-to-fine point cloud intermediate, aligning 2D visual features with 3D CAD sequences via contrastive learning to condition a VQ-Diffusion model. Supported by newly introduced CAD-220K and PrintCAD datasets, our approach ensures robust industrial domain adaptation. Extensive experiments demonstrate that Img2CADSeq significantly outperforms state-of-the-art methods, producing standard STEP files that can be directly used in commercial CAD software.
Chinese Translation
边界表示(BRep)是计算机辅助设计(CAD)的标准格式,但由于拓扑约束和操作序列的复杂性,从单视图图像重建高质量的BRep仍然具有挑战性。我们提出了Img2CADSeq,这是一种多阶段管道,通过将CAD序列编码为三级层次代码本来克服这些限制。在重要性优先级的指导下,该策略重视轮廓而非细节,将长序列压缩到一个稳定的离散潜在空间。为了弥合模态差距,我们利用粗到细的点云中间体,通过对比学习将2D视觉特征与3D CAD序列对齐,以条件化VQ-Diffusion模型。在新引入的CAD-220K和PrintCAD数据集的支持下,我们的方法确保了强大的工业领域适应性。大量实验表明,Img2CADSeq显著优于最先进的方法,生成的标准STEP文件可以直接用于商业CAD软件。
cs.CV / 76 / 2605.13306

Color Constancy in Hyperspectral Imaging via Reduced Spectral Spaces

通过降低光谱空间实现高光谱成像中的颜色恒常性
Vidarsson, G. Dofri, Lu, Liying, Süsstrunk, Sabine
Abstract
Illuminant estimation aims to infer scene illumination from image measurements despite intrinsic ambiguities between surface reflectance and lighting. Most existing methods operate on trichromatic RGB images and are therefore fundamentally limited by the restricted spectral information available. Hyperspectral imaging provides a much richer representation of scene radiance and has the potential to alleviate these ambiguities. However, its high dimensionality poses computational and statistical challenges. In this work, we systematically study the effect of spectral dimensionality and representation choice on illuminant estimation performance using hyperspectral data. We adopt the practical and effective Color-by-Correlation (CbC) framework as the estimation backbone and analyze its behavior under different spectral dimensionality reduction strategies. Our results offer practical insights into how hyperspectral information can be efficiently exploited for illuminant estimation and identify conditions under which compact spectral representations outperform conventional RGB-based approaches. The code is available at https://github.com/IVRL/Reduced-Spectral-Color-Constancy.
Chinese Translation
光源估计旨在从图像测量中推断场景照明,尽管表面反射率与照明之间存在内在的模糊性。现有的大多数方法基于三色RGB图像,因此在可用的光谱信息有限的情况下,受到根本性的限制。高光谱成像提供了更丰富的场景辐射表示,有潜力缓解这些模糊性。然而,其高维性带来了计算和统计上的挑战。在本研究中,我们系统地研究了光谱维度和表示选择对使用高光谱数据的光源估计性能的影响。我们采用实用有效的基于相关的颜色(Color-by-Correlation, CbC)框架作为估计的基础,并分析其在不同光谱降维策略下的表现。我们的结果为如何高效利用高光谱信息进行光源估计提供了实用见解,并确定了紧凑光谱表示优于传统RGB方法的条件。代码可在 https://github.com/IVRL/Reduced-Spectral-Color-Constancy 获取。
cs.CV / 77 / 2605.13316

Test-time Sparsity for Extreme Fast Action Diffusion

极快动作扩散的测试时稀疏性
Ji, Kangye, Meng, Yuan, Zhou, Jianbo, Li, Ye, Tang, Chen, Wang, Zhi
Abstract
Action diffusion excels at high-fidelity action generation but incurs heavy computational costs owing to its iterative denoising nature. Despite current technologies showing promise in accelerating diffusion transformers by reusing the cached features, they struggle to adapt to policy dynamics arising from diverse perceptions and multi-round rollout iterations in open environments. We propose test-time sparsity to tackle this challenge, which aims to accelerate action diffusion by dynamically predicting prunable residual computations for each model forward at test time. However, two bottlenecks remain in this paradigm: 1) repetitive conditional encoding and pruning offset most potential speed gains, and 2) the features cached from previous denoising timesteps cannot constrain large pruning errors under aggressive sparsity. To address the first bottleneck, we design a highly parallelized inference pipeline that minimizes the non-decoder delay to milliseconds. Specifically, we first design a lightweight pruner that shares the encoder with the diffusion transformer. Then, we decouple the encoding and pruning from the autoregressive denoising loop by processing all denoising timesteps in parallel, and overlap the pruner with the decoder forward inference through asynchronism. To overcome the second bottleneck, we introduce an omnidirectional reusing strategy, which achieves 95% sparsity by selectively reusing features cached from the current forward, previous denoising timesteps, and earlier rollout iterations. To learn the rollout-level reusing strategies, we sample a few action trajectories to supervise the sparsified diffusion step by step. Extensive experiments demonstrate that our method reduces FLOPs by 92% and accelerates action generation by 5x, achieving lossless performance with an inference frequency of 47.5 Hz. Our code is available at https://github.com/ky-ji/Test-time-Sparsity.
Chinese Translation
动作扩散在高保真动作生成方面表现出色,但由于其迭代去噪的特性,导致计算成本高昂。尽管当前技术在通过重用缓存特征加速扩散变换器方面显示出希望,但它们在开放环境中适应来自多样感知和多轮展开迭代的策略动态方面仍然面临挑战。我们提出了测试时稀疏性来应对这一挑战,旨在通过动态预测可修剪的残差计算来加速动作扩散。然而,这一范式仍然存在两个瓶颈:1)重复的条件编码和修剪抵消了大部分潜在的速度提升,2)从先前去噪时间步缓存的特征无法在激进稀疏下限制大规模修剪错误。为了解决第一个瓶颈,我们设计了一个高度并行化的推理管道,将非解码器延迟最小化到毫秒级。具体而言,我们首先设计了一个轻量级的修剪器,与扩散变换器共享编码器。然后,我们通过并行处理所有去噪时间步,将编码和修剪与自回归去噪循环解耦,并通过异步方式将修剪器与解码器前向推理重叠。为了克服第二个瓶颈,我们引入了一种全方位重用策略,通过选择性重用来自当前前向、先前去噪时间步和早期展开迭代的缓存特征,实现了95%的稀疏性。为了学习展开级重用策略,我们采样了一些动作轨迹,以逐步监督稀疏化的扩散步骤。大量实验表明,我们的方法将FLOPs减少了92%,并将动作生成加速了5倍,以47.5 Hz的推理频率实现无损性能。我们的代码可在 https://github.com/ky-ji/Test-time-Sparsity 获取。
cs.CV / 78 / 2605.13322

KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

KamonBench:用于评估视觉-语言模型中组合因子恢复的基于语法的数据集
Sproat, Richard, Peluchetti, Stefano
Abstract
Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-to-structure benchmark with 20,000 synthetic composite crests and auxiliary component examples. Each composite crest is paired with a formal kamon description language - "kamon y\=ogo" - description, a segmented Japanese analysis, an English translation, and a non-linguistic program code. Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed container-modifier contexts, and linear probes of factor accessibility. We include baseline results for a ViT encoder/Transformer decoder and two VGG n-gram decoders, with and without learned positional masks. KamonBench therefore provides a controlled testbed for sparse compositional visual recognition and factor recovery in vision-language models.
Chinese Translation
家纹(Kamon)是日本文化的重要组成部分,也是组合视觉识别的自然测试案例:每个家纹结合了少量的象征性选择,但可能描述的空间是稀疏的。我们介绍了KamonBench,这是一个基于语法的图像到结构基准,包含20,000个合成的复合家纹和辅助组件示例。每个复合家纹都配有一个正式的家纹描述语言——“kamon y=ogo”——描述、分段的日语分析、英文翻译和非语言程序代码。由于每个合成家纹是从已知因子生成的,即容器、修饰符和主题,KamonBench支持超越标题级准确性的评估:直接的程序代码因子度量、受控的因子对重组拆分、在固定容器-修饰符上下文下的反事实主题敏感性组,以及因子可达性的线性探测。我们包括了ViT编码器/Transformer解码器和两个VGG n-gram解码器的基线结果,分别考虑了学习的位置信息掩码和不考虑。KamonBench因此提供了一个受控的测试平台,用于稀疏组合视觉识别和视觉-语言模型中的因子恢复。
cs.CV / 79 / 2605.13333

Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation

通过超网络驱动的低秩适应生成风格化文本到运动
Jeon, Junhyuk, Hong, Seokhyeon, Noh, Junyong
Abstract
Text-driven motion diffusion models are capable of generating realistic human motions, but text alone often struggles to express fine-level nuances of motion, commonly referred to as style. Recent approaches have tackled this challenge by attaching a style injection mechanism to a pretrained text-driven diffusion model. Existing stylization methods, however, either require style-specific fine-tuning of existing models or rely on heavy ControlNet-based architectures, limiting efficiency and generalization to unseen styles. We propose a lightweight style conditioning framework that dynamically modulates a pretrained diffusion model through hypernetwork-generated LoRA parameters. A style reference motion is encoded into a global style embedding, which is mapped by a hypernetwork to low-rank updates applied at each denoising step of the diffusion model. By structuring the style latent space with a supervised contrastive loss, our framework reliably captures diverse stylistic attributes, improves generalization to unseen styles, and supports optimization-based guidance without requiring predefined style categories. Experiments on the HumanML3D and 100STYLE datasets show state-of-the-art stylization results, while achieving improved stylization for unseen styles.
Chinese Translation
基于文本驱动的运动扩散模型能够生成逼真的人类运动,但单靠文本往往难以表达运动的细微差别,通常称为风格。最近的方法通过将风格注入机制附加到预训练的文本驱动扩散模型来应对这一挑战。然而,现有的风格化方法要么需要对现有模型进行风格特定的微调,要么依赖于复杂的基于ControlNet的架构,这限制了效率和对未见风格的泛化能力。我们提出了一种轻量级的风格条件框架,通过超网络生成的低秩适应(LoRA)参数动态调节预训练的扩散模型。风格参考运动被编码为一个全局风格嵌入,该嵌入通过超网络映射到在扩散模型的每个去噪步骤中应用的低秩更新。通过使用监督对比损失来构建风格潜在空间,我们的框架可靠地捕捉多样的风格属性,提高了对未见风格的泛化能力,并支持基于优化的指导,而无需预定义的风格类别。在HumanML3D和100STYLE数据集上的实验显示了最先进的风格化结果,同时在未见风格上实现了更好的风格化效果。
cs.CV / 80 / 2605.13349

Drag within Prior Distribution: Text-Conditioned Point-Based Image Editing within Distribution Constraints

先验分布中的拖拽:基于文本条件的点状图像编辑在分布约束下
Hu, Haoyang, Seo, Masataka, Chen, Yen-Wei
Abstract
Diffusion-based point editing methods have gained significant traction in image editing tasks due to their ability to manipulate image semantics and fine details by applying localized perturbations on the manifold of noise latent. However, these approaches face several limitations. Traditional point-based editing relies on pairs of handle and target points to define motion trajectories, which can introduce ambiguity or unnecessary alterations. Furthermore, when the distance between the handle and target points is large, the accumulated perturbations often cause the noise latent deviation from inversion score trajectory, resulting in unnatural artifacts. To address these issues in global editing tasks, we introduce a CLIP-based model to evaluate and guide intermediate editing steps, ensuring that the generated results remain both semantically aligned. Additionally, we propose a prior-preservation loss that constrains the optimized latent code to stay within the sampling space of the diffusion prior, improving consistency with the original data distribution, to ensure the model generates images along a familiar score trajectory. For fine-grained tasks, we present a directionally-weighted point tracking mechanism that steers the editing process toward the target direction within similar feature regions. This improves both the tracking accuracy and generation quality, while also reducing the editing time.
Chinese Translation
基于扩散的点编辑方法因其能够通过对噪声潜在空间的局部扰动来操控图像语义和细节,在图像编辑任务中获得了显著关注。然而,这些方法面临着若干限制。传统的基于点的编辑依赖于一对把手点和目标点来定义运动轨迹,这可能引入模糊性或不必要的更改。此外,当把手点和目标点之间的距离较大时,累积的扰动往往导致噪声潜在偏离反演分数轨迹,从而产生不自然的伪影。为了解决这些在全局编辑任务中的问题,我们引入了一种基于CLIP的模型来评估和引导中间编辑步骤,确保生成的结果在语义上保持一致。此外,我们提出了一种先验保持损失,约束优化后的潜在编码保持在扩散先验的采样空间内,从而提高与原始数据分布的一致性,确保模型沿着熟悉的分数轨迹生成图像。对于细粒度任务,我们提出了一种方向加权的点跟踪机制,指导编辑过程朝着相似特征区域内的目标方向进行。这提高了跟踪精度和生成质量,同时也减少了编辑时间。
cs.CV / 81 / 2605.13366

Neural Surrogate Forward Modelling For Electrocardiology Without Explicit Intracellular Conductivity Tensor

无需显式细胞内导电张量的电心脏学神经代理前向建模
Ogbomo-Harmitt, Shaheim, Magnetti, Cesare, Grzelak, Jakub, Aslanidi, Oleg
Abstract
Accurate forward modelling is essential for non-invasive cardiac electrophysiology, particularly in atrial fibrillation, where electrical activation is highly disorganised. Conventional physics-based forward models require explicit specification of intracellular conductivity tensors, which are not directly measurable in clinical practice and introduce structural modelling errors. This proof-of-concept study presents a deep learning approach that learns a direct mapping from left atrial intracellular electrical potentials to far-field ECGs without requiring explicit intracellular conductivity inputs at inference time. Despite training only on 74 subjects, the model achieved an R2 of 0.949 \pm 0.037, highlighting potential to reduce structural uncertainty and improve non-invasive AF assessment.
Chinese Translation
准确的前向建模对于非侵入性心脏电生理学至关重要,特别是在心房颤动中,电激活高度无序。传统的基于物理的前向模型需要显式指定细胞内导电张量,而这些张量在临床实践中无法直接测量,并且会引入结构建模误差。本概念验证研究提出了一种深度学习方法,该方法学习从左心房细胞内电位到远场心电图(ECG)的直接映射,而在推理时无需显式的细胞内导电输入。尽管仅在74名受试者上进行训练,该模型仍达到了0.949 B1 0.037的R2值,突显了减少结构不确定性和改善非侵入性心房颤动评估的潜力。
cs.CV / 82 / 2605.13375

GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

GRIP-VLM:用于高效视觉-语言模型的组相对重要性剪枝
Huang, Mingzhe, Wang, Weijun, Ding, Xin, Mi, Liang, Wen, Hao, Li, Yuanchun, Pang, Lichen, Yang, Shansong, Liu, Yunxin, Cao, Ting
Abstract
In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15\% inference speedup at equal accuracy.
Chinese Translation
在视觉-语言模型(VLMs)中,处理大量视觉标记会带来巨大的计算开销。尽管最近的训练感知剪枝方法试图选择性地丢弃冗余标记,但它们在很大程度上依赖于连续梯度松弛。然而,视觉标记剪枝本质上是一个离散的、非凸的组合优化问题;因此,这些连续近似常常将优化困在次优局部极小值,尤其是在激进的压缩预算下。为了克服这一根本瓶颈,我们提出了GRIP-VLM,一个由强化学习驱动的组相对重要性剪枝框架。GRIP-VLM并不依赖于平滑梯度假设,而是将剪枝形式化为一个马尔可夫决策过程,采用由监督预热锚定的组相对策略优化(GRPO)范式,直接探索离散选择空间。结合预算感知评分器,我们的轻量级代理动态评估每个标记的重要性,并在不重新训练的情况下适应任意压缩比。在多种多模态基准测试中的广泛实验表明,GRIP-VLM始终优于启发式和监督学习基线,达到了更优的帕累托前沿,并在相同准确率下实现了高达15%的推理加速。
cs.CV / 83 / 2605.13381

Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics

骨干网络就是你所需要的一切:评估冻结基础模型在合成图像取证中的脆弱性
Musso, Chiara, Battocchio, Joy, Montibeller, Andrea, Boato, Giulia
Abstract
As AI-generated synthetic images become increasingly realistic, Vision Transformers (ViTs) have emerged as a cornerstone of modern deepfake detection. However, the prevailing reliance on frozen, pre-trained backbones introduces a subtle yet critical vulnerability. In this work, we present the Surrogate Iterative Adversarial Attack (SIAA), a gray-box attack that exploits knowledge of the detector's ViT backbone alone and operates entirely within the target detector's feature space to craft highly effective adversarial examples. Through our experiments, involving multiple ViT-based detectors and diverse gray-box scenarios, including few-shot learning, complete training misalignment and attack transferability tests, we demonstrate that this vulnerability consistently yields high attack success rates, often approaching white-box performance. By doing so, we reveal that backbone knowledge alone is sufficient to undermine detector reliability, highlighting the urgent need for more resilient defenses in adversarial multimedia forensics.
Chinese Translation
随着AI生成的合成图像变得越来越逼真,视觉变换器(Vision Transformers, ViTs)已成为现代深度伪造检测的基石。然而,普遍依赖于冻结的预训练骨干网络引入了一种微妙但关键的脆弱性。在本研究中,我们提出了替代迭代对抗攻击(Surrogate Iterative Adversarial Attack, SIAA),这是一种灰盒攻击,仅利用检测器的ViT骨干知识,并完全在目标检测器的特征空间内操作,以生成高效的对抗样本。通过我们的实验,涉及多个基于ViT的检测器和多种灰盒场景,包括少量学习、完全训练不对齐和攻击可转移性测试,我们证明了这种脆弱性始终导致高攻击成功率,通常接近白盒性能。通过这样做,我们揭示了仅凭骨干知识就足以破坏检测器的可靠性,强调了在对抗多媒体取证中迫切需要更具韧性的防御措施。
cs.CV / 84 / 2605.13396

PreFIQs: Face Image Quality Is What Survives Pruning

PreFIQs:面部图像质量是修剪后存留下来的
Kolf, Jan Niklas, Ozgur, Guray, Atzori, Andrea, Babnik, Žiga, Štruc, Vitomir, Damer, Naser, Boutros, Fadi
Abstract
Face Image Quality Assessment (FIQA) evaluates the utility of a face image for automated face recognition (FR) systems. In this work, we propose PreFIQs, an unsupervised and training-free FIQA framework grounded in the Pruning Identified Exemplar (PIE) hypothesis. We hypothesize that low-utility face images rely disproportionately on fragile network parameters, resulting in larger geometric displacement of their embeddings under model sparsification. Accordingly, PreFIQs quantifies image utility as the Euclidean distance between L2-normalized embeddings extracted from a pre-trained FR model and its pruned counterpart. We provide a first-order theoretical justification via a Jacobian-vector product analysis, demonstrating that this empirical drift serves as a computationally efficient approximation of the exact geometric sensitivity of the latent embedding manifold. Extensive experiments across eight benchmarks and four FR models demonstrate that PreFIQs achieves competitive or superior performance compared to state-of-the-art FIQA methods, including establishing new state-of-the-art results on several benchmarks, without any training or supervision. These results validate parameter sparsification as a principled and practically efficient signal for face image utility, and demonstrate that quality is, in essence, what survives pruning.
Chinese Translation
面部图像质量评估(FIQA)用于评估面部图像在自动面部识别(FR)系统中的实用性。在本研究中,我们提出了PreFIQs,这是一种基于修剪识别示例(PIE)假设的无监督且无需训练的FIQA框架。我们假设低实用性的面部图像过度依赖脆弱的网络参数,导致在模型稀疏化过程中其嵌入的几何位移更大。因此,PreFIQs通过计算从预训练FR模型和其修剪版本中提取的L2归一化嵌入之间的欧几里得距离来量化图像的实用性。我们通过雅可比-向量乘积分析提供了一阶理论证明,表明这种经验漂移作为潜在嵌入流形的精确几何敏感性的计算高效近似。我们在八个基准和四个FR模型上进行了广泛实验,结果表明PreFIQs在性能上与最先进的FIQA方法相当或更优,包括在多个基准上建立新的最先进结果,且无需任何训练或监督。这些结果验证了参数稀疏化作为面部图像实用性的原则性和实用性信号,并表明质量本质上是修剪后存留下来的东西。
cs.CV / 85 / 2605.13402

Fast and Compact Graph Cuts for the Boykov-Kolmogorov Algorithm

Boykov-Kolmogorov 算法的快速紧凑图割
Mikkelstrup, Christian Møller, Dahl, Anders Bjorholm, Bille, Philip, Dahl, Vedrana Andersen, Gørtz, Inge Li
Abstract
Computing a minimum $s$-$t$ cut in a graph is a solution to a wide range of computer vision problems, and is often done using the Boykov-Kolmogorov (BK) algorithm. In this paper, we revisit the BK algorithm from both a theoretical and practical point of view. We improve the analysis of the time complexity of the BK algorithm to $O(mn|C|)$ and propose a new algorithm, the fast and compact BK (fcBK) algorithm, with a time complexity of $O(m|C|)$, where $m$, $n$, and $|C|$ are the number of edges, number of vertices, and the capacity of the cut, respectively. We additionally propose a compact graph representation that allows our implementation to find a minimum $s$-$t$ cut in a graph with upwards of $10^9$ vertices and $10^{10}$ edges on a machine with 128 GB of memory. We find our implementation of the BK algorithm to be the fastest available implementation of the BK algorithm when evaluating on a comprehensive set of benchmark datasets, highlighting the importance of memory-efficient implementations. We make our implementations publicly available for further research and implementation development within minimum $s$-$t$ cut algorithms.
Chinese Translation
在图中计算最小 $s$-$t$ 切割是解决广泛计算机视觉问题的一种方法,通常使用 Boykov-Kolmogorov (BK) 算法来实现。本文从理论和实践两个角度重新审视了 BK 算法。我们改进了 BK 算法的时间复杂度分析,得到了 $O(mn|C|)$ 的复杂度,并提出了一种新的算法——快速紧凑 BK (fcBK) 算法,其时间复杂度为 $O(m|C|)$,其中 $m$、$n$ 和 $|C|$ 分别表示边的数量、顶点的数量和切割的容量。此外,我们还提出了一种紧凑的图表示方法,使得我们的实现能够在具有 128 GB 内存的机器上找到具有超过 $10^9$ 个顶点和 $10^{10}$ 条边的图中的最小 $s$-$t$ 切割。我们发现,在对一组全面的基准数据集进行评估时,我们的 BK 算法实现是目前可用的最快实现,突显了内存高效实现的重要性。我们将我们的实现公开,以便于进一步的研究和最小 $s$-$t$ 切割算法的实现开发。
cs.CV / 86 / 2605.13455

Bayesian In Vivo Tracking of Synapses using Joint Poisson Deconvolution and Diffeomorphic Registration

基于联合泊松去卷积和微分同胚配准的突触贝叶斯体内追踪
Kumar, Shashwat, Padova, Dominic M., Narang, Binish, Coste, Gabrielle I., Graves, Austin R., Huganir, Richard L., Charles, Adam S., Miller, Michael I., Srivastava, Anuj
Abstract
Synapses are densely packed submicron structures that dynamically reorganize during learning and memory formation. Longitudinal \textit{in vivo} imaging of fluorescently tagged synaptic receptors offers a promising opportunity to study large-scale synaptic dynamics and how these processes are disrupted in neurological disease. However, in vivo imaging with 2-photon microscopy uses low laser power and therefore suffers from low signal-to-noise ratio (SNR) and high shot noise, nonlinear tissue motion between days, nonstationary fluctuations in synaptic fluorescence, and significant blur induced by the microscope point spread function (PSF). Together, these factors make it challenging to detect and track synapses, especially in regions with high synaptic density. This paper presents a novel template-based framework for modeling synapses as varying luminance point sources that move under a nonlinear tissue deformation. Taking a unified Bayesian approach, we apply this model to microscopy data by deriving a posterior that incorporates a diffeomorphic mapping for domain warping, a Gaussian point spread function for the imaging process, and a Poisson observation model for raw photon counts. The Bayesian solution simultaneously: (1) Constructs a probabilistic template of synapse locations, (2) denoises and deconvolves the image data, (3) infers fluorescence intensities, (4) performs diffeomorphic image registration to correct for tissue motion, and (5) provides confidence regions for these parameter estimates. We demonstrate the framework on both a 2D+t simulated dataset and a 3D+t longitudinal \textit{in vivo} microscopy dataset of fluorescent synapses imaged in a mouse over two weeks.
Chinese Translation
突触是密集堆积的亚微米结构,在学习和记忆形成过程中动态重组。荧光标记的突触受体的纵向体内成像为研究大规模突触动态及其在神经疾病中如何受到干扰提供了有希望的机会。然而,使用双光子显微镜进行体内成像时,由于激光功率低,因此信噪比(SNR)低且高噪声,日间组织运动非线性,突触荧光的非平稳波动,以及显微镜点扩散函数(PSF)引起的显著模糊,这些因素共同使得在高突触密度区域检测和追踪突触变得具有挑战性。本文提出了一种新颖的基于模板的框架,将突触建模为在非线性组织变形下移动的变化亮度点源。采用统一的贝叶斯方法,我们通过推导一个后验分布,将该模型应用于显微镜数据,该后验分布结合了用于领域扭曲的微分同胚映射、用于成像过程的高斯点扩散函数以及用于原始光子计数的泊松观察模型。贝叶斯解同时:(1) 构建突触位置的概率模板,(2) 去噪和去卷积图像数据,(3) 推断荧光强度,(4) 执行微分同胚图像配准以校正组织运动,以及 (5) 为这些参数估计提供置信区域。我们在一个2D+t模拟数据集和一个3D+t纵向体内显微镜数据集上演示了该框架,该数据集记录了在小鼠中成像的荧光突触,时间跨度为两周。
cs.CV / 87 / 2605.13457

OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact Suppression

OP4KSR:一种一步法无补丁的4K超分辨率方法及周期性伪影抑制
Deng, Chengyan, Yu, Pengbin, Chen, Zhentao, Shen, Wei, Zhang, Kai, Li, Meng, Yuan, Lunxi, Zhou, Xue, Yu, Li
Abstract
Diffusion-based real-world image super-resolution (Real-ISR) has achieved remarkable perceptual quality; however, directly super-resolving images to 4K remains limited by extreme memory consumption. Consequently, prior methods adopt patch-based inference, sacrificing global context and introducing semantic confusion, spatial inconsistency, and severe latency. We propose OP4KSR, a one-step patch-free 4K SR approach built upon the powerful Flux backbone. By leveraging the extreme-compression F16 VAE, OP4KSR makes 4K SR inference tractable under practical GPU budgets, preserving global spatial-semantic coherence while enabling highly efficient inference. However, adapting this one-step architecture intrinsically triggers severe periodic artifacts. We trace this to a RoPE base frequency allocation mismatch and intra-token spatial ambiguity, both exacerbated by the lack of iterative refinement. To suppress these artifacts, we couple RoPE base frequency rescaling (RFR) with an autocorrelation-based periodicity loss ($\mathcal{L}_\text{AP}$). Furthermore, we curate a dedicated training dataset alongside three benchmarks (one synthetic and two real-world) to advance 4K SR research. Extensive experiments demonstrate that OP4KSR achieves competitive perceptual quality with efficient inference, generating a $4096\times4096$ output in only 5.75 seconds on a single NVIDIA H20 GPU.
Chinese Translation
基于扩散的真实世界图像超分辨率(Real-ISR)在感知质量上取得了显著的进展;然而,直接将图像超分辨率提升至4K仍受到极高内存消耗的限制。因此,先前的方法采用基于补丁的推理,牺牲了全局上下文,导致语义混淆、空间不一致性和严重的延迟。我们提出了OP4KSR,一种基于强大Flux骨干网的一步法无补丁4K超分辨率方法。通过利用极压缩的F16变分自编码器(VAE),OP4KSR使得在实际GPU预算下进行4K超分辨率推理成为可能,同时保持全局空间-语义一致性,并实现高效推理。然而,适应这种一步法架构本质上会引发严重的周期性伪影。我们将其归因于RoPE基频分配的不匹配和内部标记空间模糊,这两者都因缺乏迭代精炼而加剧。为了抑制这些伪影,我们将RoPE基频重缩放(RFR)与基于自相关的周期性损失($ ext{L}_ ext{AP}$)相结合。此外,我们策划了一个专门的训练数据集,并提供了三个基准(一个合成的和两个真实世界的)以推动4K超分辨率研究。大量实验表明,OP4KSR在高效推理的同时,能够实现具有竞争力的感知质量,在单个NVIDIA H20 GPU上仅需5.75秒即可生成$4096 imes4096$的输出。
cs.CV / 88 / 2605.13465

Z-Order Transformer for Feed-Forward Gaussian Splatting

用于前馈高斯溅射的Z-顺序变换器
Wang, Can, Liu, Lei, Jiang, Wei, Xu, Dong
Abstract
Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this work, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively suppress redundancy while preserving critical structural details. This allows the transformer to efficiently model context, compress Gaussian primitives, and predict Gaussian attributes in a single forward pass. Comprehensive experiments demonstrate that our method achieves fast and high-quality novel view synthesis with fewer Gaussian primitives.
Chinese Translation
近期在三维高斯溅射(3D Gaussian Splatting, 3DGS)方面的进展使得光真实感新视图合成取得了显著进展。然而,传统的3DGS依赖于缓慢的迭代优化过程,这限制了其在需要实时结果的场景中的应用。为了解决这一瓶颈,近期的前馈方法旨在直接从图像中预测高斯属性,但它们常常在高斯原语的冗余性和渲染质量方面遇到困难。在本研究中,我们提出了一种专门为前馈高斯溅射设计的基于变换器的架构。我们的关键见解是,高斯之间的空间和语义关系可以通过稀疏注意机制有效捕捉,这一机制通过Z-顺序策略将无结构的高斯集合组织成空间上连贯的序列。此外,我们结合这一Z-顺序策略,自适应地抑制冗余,同时保留关键的结构细节。这使得变换器能够高效地建模上下文,压缩高斯原语,并在单次前向传递中预测高斯属性。全面的实验表明,我们的方法能够以更少的高斯原语实现快速且高质量的新视图合成。
cs.CV / 89 / 2605.13475

FedHPro: Federated Hyper-Prototype Learning via Gradient Matching

FedHPro:通过梯度匹配的联邦超原型学习
Wang, Huan, Shen, Jun, Li, Haoran, Yang, Zhenyu, Yan, Jun, Manjang, Ousman, Zhai, Yanlong, Wu, Di, Pang, Guansong
Abstract
Federated Learning (FL) enables collaborative training of distributed clients while protecting privacy. To enhance generalization capability in FL, prototype-based FL is in the spotlight, since shared global prototypes offer semantic anchors for aligning client-specific local prototypes. However, existing methods update global prototypes at the prototype-level via averaging local prototypes or refining global anchors, which often leads to semantic drift across clients and subsequently yields a misaligned global signal. To alleviate this issue, we introduce hyper-prototypes, defined by a set of learnable global class-wise prototypes to preserve underlying semantic knowledge across clients. The hyper-prototypes are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients' real samples, rather than prototype-level descriptors. We further propose FedHPro, a Federated Hyper-Prototype Learning framework, to leverage hyper-prototypes to promote inter-class separability via mutual-contrastive learning with client-specific margin, while encouraging intra-class uniformity through a consistency penalty. Comprehensive experiments under diverse heterogeneous scenarios confirm that 1) hyper-prototypes produce a more semantically consistent global signal, and 2) FedHPro achieves state-of-the-art performance on several benchmark datasets. Code is available at \href{https://github.com/mala-lab/FedHPro}{https://github.com/mala-lab/FedHPro}.
Chinese Translation
联邦学习(FL)使得分布式客户端能够在保护隐私的同时进行协作训练。为了增强FL中的泛化能力,以原型为基础的FL受到关注,因为共享的全局原型为对齐客户端特定的局部原型提供了语义锚点。然而,现有方法通过对局部原型进行平均或精炼全局锚点在原型层面更新全局原型,这往往导致客户端之间的语义漂移,从而产生不对齐的全局信号。为了解决这个问题,我们引入了超原型,定义为一组可学习的全局类别原型,以保持客户端之间的潜在语义知识。超原型通过梯度匹配进行优化,以与直接从客户端真实样本中提取的与类别相关的特征对齐,而不是原型层面的描述符。我们进一步提出了FedHPro,一个联邦超原型学习框架,利用超原型通过具有客户端特定边际的互对比学习促进类别间可分性,同时通过一致性惩罚鼓励类别内均匀性。在多种异构场景下的全面实验验证了1)超原型产生了更具语义一致性的全局信号,2)FedHPro在多个基准数据集上达到了最先进的性能。代码可在 exttt{https://github.com/mala-lab/FedHPro} 获取。
cs.CV / 90 / 2605.13476

Neural Video Compression with Domain Transfer

基于领域转移的神经视频压缩
Zhang, Tiange, Lin, Rongqun, Meng, Xiandong, Wang, Haofeng, Tian, Xing, Zhang, Qi, Ma, Siwei
Abstract
Content-adaptive compression has always been a key direction in neural video coding (NVC), aiming to mitigate the domain gap between training and testing data. Such gaps often arise from distributional discrepancies between training and inference data, which may cause noticeable performance degradation when the testing content differs from the training distribution. To tackle this challenge, we propose DCVC-DT, a domain transfer enhanced neural video compression framework. Specifically, we design a lightweight online domain transfer (DT) mechanism that dynamically adapts the encoded latent representation during inference, effectively bridging the domain gap without modifying the encoder or decoder parameters. In addition, we develop a frame-level dynamic RD (Rate and Distortion) adjustment scheme that actively regulates the ratio of R and D in the loss function based on quality fluctuation, thereby improving rate-distortion performance. Extensive experiments demonstrate that DCVC-DT achieves up to 6.21% bitrate savings over the baseline DCVC-DC, while significantly enhancing generalization to unseen testing data and alleviating error propagation. Our code is available at https://github.com/SunnyMass/DCVC-DT.
Chinese Translation
内容自适应压缩一直是神经视频编码(NVC)的一个关键方向,旨在减小训练数据与测试数据之间的领域差距。这种差距通常源于训练数据与推理数据之间的分布差异,当测试内容与训练分布不同时,可能导致显著的性能下降。为了解决这一挑战,我们提出了DCVC-DT,一个增强领域转移的神经视频压缩框架。具体而言,我们设计了一种轻量级的在线领域转移(DT)机制,该机制在推理过程中动态调整编码的潜在表示,有效地弥合领域差距,而无需修改编码器或解码器的参数。此外,我们开发了一种帧级动态RD(速率与失真)调整方案,该方案根据质量波动主动调节损失函数中R与D的比率,从而改善速率-失真性能。大量实验表明,DCVC-DT在基线DCVC-DC的基础上实现了最高6.21%的比特率节省,同时显著增强了对未见测试数据的泛化能力,并减轻了错误传播。我们的代码可在https://github.com/SunnyMass/DCVC-DT获取。
cs.CV / 91 / 2605.13493

PhysEditBench: A Protocol-Conditioned Benchmark for Dense Physical-Map Prediction with Image Editors

PhysEditBench:一种基于协议的基准,用于使用图像编辑器进行密集物理图预测
Yang, Jiaxin, Hou, Yu, Liu, Muxin, Liu, Weixuan, Yuan, Ze, Chen, Zeming, Wang, Zhongrui, Qi, Xiaojuan
Abstract
Can general-purpose image editors predict physical maps from a single RGB image? General-purpose image editors differ from standard task-specific dense-prediction models: they do not directly take an image and output a physical map. Instead, they must be guided by prompts, examples, or image-based textual cues. To this end, we introduce PhysEditBench, a novel protocol-conditioned benchmark to evaluate and standardize image editors in dense physical-map prediction that covers five targets: depth, normal, albedo, roughness, and metallic maps. For evaluation data, we build a target-dependent benchmark substrate. We use OpenRooms-FF for depth, surface normal, albedo, and roughness, InteriorVerse as an additional source for depth, normal, albedo, and a new procedurally generated source for metallic maps. We curate the data with quality checks, valid-region masks, scene-level sampling, and lighting-based stress subsets to ensure reliable and diverse evaluation. For each target, PhysEditBench defines a fixed protocol that specifies the allowed input, expected output format, and scoring procedure. Each score, therefore, reflects the performance of a model under a specified protocol, rather than its best possible performance under all prompts or interaction modes. Experimental results show that specialized models remain much stronger on depth, normal, and albedo, and stronger image editors can produce more reasonable map-like outputs. For roughness and metallic, image editors can match or outperform specialized baselines on some scalar metrics, but they still suffer from structural errors, sparsity effects, and sensitivity to lighting.
Chinese Translation
通用图像编辑器能否从单一RGB图像预测物理图?通用图像编辑器与标准的特定任务密集预测模型不同:它们并不直接接收图像并输出物理图。相反,它们必须通过提示、示例或基于图像的文本线索进行引导。为此,我们引入了PhysEditBench,这是一种新颖的基于协议的基准,用于评估和标准化图像编辑器在密集物理图预测中的表现,涵盖五个目标:深度、法线、反照率、粗糙度和金属图。为了评估数据,我们构建了一个目标依赖的基准基础。我们使用OpenRooms-FF进行深度、表面法线、反照率和粗糙度的评估,InteriorVerse作为深度、法线、反照率的额外来源,以及一个新生成的程序化金属图源。我们通过质量检查、有效区域掩膜、场景级采样和基于光照的压力子集来整理数据,以确保可靠和多样的评估。对于每个目标,PhysEditBench定义了一个固定的协议,指定了允许的输入、预期的输出格式和评分程序。因此,每个分数反映的是模型在特定协议下的表现,而不是在所有提示或交互模式下的最佳表现。实验结果表明,专门模型在深度、法线和反照率上仍然表现更强,而更强的图像编辑器能够生成更合理的图状输出。对于粗糙度和金属图,图像编辑器在某些标量指标上可以与专门基准相匹配或超越,但它们仍然存在结构错误、稀疏效应和对光照敏感的问题。
cs.CV / 92 / 2605.13517

ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin

ArcVQ-VAE:一种具有弧余弦加性边际的球面向量量化框架
Kim, Jaeyung, Yoo, YoungJoon
Abstract
Vector Quantized Variational Autoencoder (VQ-VAE) has become a fundamental framework for learning discrete representations in image modeling. However, VQ-VAE models must tokenize entire images using a finite set of codebook vectors, and this capacity limitation restricts their ability to capture rich and diverse representations. In this paper, we propose ArcCosine Additive Margin VQ-VAE (ArcVQ-VAE), a novel vector quantization framework that introduces a spherical angular-margin prior (SAMP) for the codebook of a conventional VQ-VAE. The proposed SAMP consists of Ball-Bounded Norm Regularization, which constrains all codebook vectors within a time-dependent Euclidean ball, and ArcCosine Additive Margin Loss, which encourages greater angular separability among latent vectors. This formulation promotes more discriminative and uniformly dispersed latent representations within the constrained space, thereby improving effective latent-space coverage and leading to improved codebook utilization. Experimental results on standard image reconstruction and generation tasks show that ArcVQ-VAE achieves competitive performance against baseline models in terms of reconstruction accuracy, representation diversity, and sample quality. The code is available at: https://github.com/goals4292/ArcVQ-VAE
Chinese Translation
向量量化变分自编码器(VQ-VAE)已成为图像建模中学习离散表示的基础框架。然而,VQ-VAE模型必须使用有限的代码本向量对整个图像进行标记,这一容量限制限制了它们捕捉丰富多样表示的能力。本文提出了弧余弦加性边际VQ-VAE(ArcVQ-VAE),一种新颖的向量量化框架,为传统VQ-VAE的代码本引入了球面角度边际先验(SAMP)。所提出的SAMP包括球体约束范数正则化,它将所有代码本向量限制在一个时间依赖的欧几里得球内,以及弧余弦加性边际损失,它鼓励潜在向量之间更大的角度可分性。该公式促进了在受限空间内更具辨别性和均匀分布的潜在表示,从而改善有效的潜在空间覆盖,提升代码本的利用率。在标准图像重建和生成任务上的实验结果表明,ArcVQ-VAE在重建精度、表示多样性和样本质量等方面与基线模型相比表现出竞争力。代码可在以下链接获取:https://github.com/goals4292/ArcVQ-VAE
cs.CV / 93 / 2605.13530

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs

迈向统一的外科场景理解:通过多模态大语言模型(MLLM)桥接推理与基础
Huang, Jincai, Zou, Shihao, Guo, Yuchen, Li, Jingjing, Ji, Wei, Wang, Kai, Wang, Shanshan, Si, Weixin
Abstract
Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and triplet-entity segmentation tokens. These tokens are then temporally aggregated and serve as prompts for a segmentation network, enabling accurate pixel-wise grounding of triplet instruments and targets. The entire framework is trained end-to-end with a unified objective that couples language-based reasoning supervision with visual grounding losses, promoting coherent cross-task learning and clinically consistent scene representations. To facilitate unified evaluation, we introduce CholecT45-Scene, extending CholecT45 dataset with 64,299 frames of pixel-level mask annotations for instruments and targets, aligned with existing triplet labels. Extensive experiments show that SurgMLLM significantly advances surgical scene understanding, improving the primary triplet recognition metric AP_IVT from 40.7% to 46.0% and consistently outperforming prior methods in phase recognition and segmentation. These results highlight the effectiveness of unified reasoning-and-grounding for reliable, context-aware surgical assistance.
Chinese Translation
外科场景理解是计算机辅助干预的基石。尽管最近在外科图像分割方面的进展推动了这一领域的发展,但实际临床应用需要一种更全面的理解,能够共同捕捉过程背景、语义推理和精确的视觉基础。然而,现有的方法通常孤立地处理这些组件,导致碎片化的表征和有限的语义一致性。为了解决这一局限性,我们提出了SurgMLLM,一个统一的外科场景理解框架,能够在单一模型中桥接高层次推理与低层次视觉基础。给定外科视频,SurgMLLM微调一个多模态大语言模型(MLLM),以支持结构化可解释性推理,这用于共同建模阶段、仪器-动词-目标(IVT)三元组和三元组-实体分割标记。这些标记随后被时间聚合,并作为分割网络的提示,从而实现对三元组仪器和目标的准确像素级基础。整个框架以统一的目标进行端到端训练,将基于语言的推理监督与视觉基础损失相结合,促进一致的跨任务学习和临床一致的场景表征。为了便于统一评估,我们引入了CholecT45-Scene,扩展了CholecT45数据集,增加了64,299帧仪器和目标的像素级掩码注释,与现有的三元组标签对齐。大量实验表明,SurgMLLM显著推动了外科场景理解,将主要三元组识别指标AP_IVT从40.7%提高到46.0%,并在阶段识别和分割方面持续超越先前的方法。这些结果突显了统一推理与基础在可靠、上下文感知的外科辅助中的有效性。
cs.CV / 94 / 2605.13544

CA-GCL: Cross-Anatomy Global-Local Contrastive Learning for Robust 3D Medical Image Understanding

CA-GCL:用于稳健3D医学图像理解的跨解剖全局-局部对比学习
Zhang, Hanwen, Liu, Yao, Dai, Die, Yang, Jiaye, Liu, Qiao, Xie, Yutong, Wang, Peng
Abstract
Fine-grained Vision-Language Pre-training (FVLP) demonstrates significant potential in 3D medical image understanding by aligning anatomy-level visual representations with corresponding textual descriptions. However, existing FVLP paradigms often suffer from severe representation collapse in the textual embedding space, where text embeddings of distinct anatomical structures become highly clustered and indistinguishable. This distributional degeneracy renders the model hypersensitive to prompt variations, hindering reliable clinical deployment. To address these challenges, we propose a novel Cross-Anatomy Global-Local Contrastive Learning framework (CA-GCL). CA-GCL introduces a global contrastive objective that enforces separation between anatomical categories in the latent space, effectively counteracting the aggregation tendency induced by local alignment. Furthermore, we incorporate a clinical-aware text augmentation strategy based on permutation invariance and partial completeness to enhance robustness against descriptive incompleteness. Extensive evaluations on the CT-RATE and Rad-ChestCT datasets demonstrate that CA-GCL consistently outperforms existing VLP paradigms in zero-shot abnormality detection, achieving superior performance while exhibiting strong cross-dataset generalization. Crucially, CA-GCL reduces performance variance across diverse prompt templates, transforming the collapsed textual similarity distribution into a bell-shaped distribution. These results validate CA-GCL as an effective framework for robust 3D medical image understanding.
Chinese Translation
细粒度视觉-语言预训练(FVLP)在3D医学图像理解中展现出显著潜力,通过将解剖级视觉表征与相应的文本描述对齐。然而,现有的FVLP范式往往在文本嵌入空间中遭遇严重的表征崩溃,不同解剖结构的文本嵌入高度聚集且难以区分。这种分布退化使得模型对提示变化过于敏感,妨碍了可靠的临床应用。为了解决这些挑战,我们提出了一种新颖的跨解剖全局-局部对比学习框架(CA-GCL)。CA-GCL引入了一种全局对比目标,强制在潜在空间中分离解剖类别,有效抵消了局部对齐所诱导的聚合倾向。此外,我们结合了一种基于置换不变性和部分完整性的临床感知文本增强策略,以增强对描述不完整性的稳健性。在CT-RATE和Rad-ChestCT数据集上的广泛评估表明,CA-GCL在零-shot异常检测中始终优于现有的视觉语言预训练(VLP)范式,取得了卓越的性能,并展现出强大的跨数据集泛化能力。重要的是,CA-GCL降低了不同提示模板下的性能方差,将崩溃的文本相似性分布转变为钟形分布。这些结果验证了CA-GCL作为稳健3D医学图像理解的有效框架。
cs.CV / 95 / 2605.13565

Qwen-Image-VAE-2.0 Technical Report

Qwen-Image-VAE-2.0 技术报告
Zhang, Zekai, Li, Deqing, Cao, Kuan, Wu, Yujia, Wu, Chenfei, Wu, Yu, Peng, Liang, Meng, Hao, Li, Jiahao, Zhang, Jie, Gao, Kaiyuan, Yan, Kun, Jiang, Lihan, Tang, Ningyuan, Yin, Shengming, Wu, Tianhe, Xu, Xiao, Chen, Xiaoyue, Shu, Yan, Zhang, Yanran, Chen, Yilei, Xu, Yixian, Chen, Yuxiang, Wang, Zhendong, Liu, Zihao, Zhou, Zikai, Gu, Yiliang, Wang, Yi, Xu, Xiaoxiao, Qu, Lin
Abstract
We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT experiments reveal our models possess superior diffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptional diffusability.
Chinese Translation
我们提出了 Qwen-Image-VAE-2.0,这是一个高压缩变分自编码器(Variational Autoencoders, VAEs)套件,在重建保真度和可扩散性方面取得了显著进展。为了解决高压缩下的重建瓶颈,我们采用了一种改进的架构,具有全局跳跃连接(Global Skip Connections, GSC)和扩展的潜在通道。此外,我们将训练规模扩展到数十亿张图像,并结合合成渲染引擎,以提高在文本丰富场景中的性能。为了解决高维潜在空间的收敛挑战,我们实施了一种增强的语义对齐策略,使潜在空间更适合扩散建模。为了优化计算效率,我们利用了一种非对称且无注意力机制的编码器-解码器骨干网络,以最小化编码开销。我们对 Qwen-Image-VAE-2.0 在公共重建基准上的表现进行了全面评估。为了评估在文本丰富场景中的性能,我们提出了 OmniDoc-TokenBench,这是一个包含多样化真实文档的新基准,并配有专门的基于光学字符识别(OCR)的评估指标。Qwen-Image-VAE-2.0 实现了最先进的重建性能,在高压缩比下在一般领域和文本丰富场景中展现了卓越的能力。此外,下游 DiT 实验表明,我们的模型具有优越的可扩散性,与现有的高压缩基线相比,显著加快了收敛速度。这些结果确立了 Qwen-Image-VAE-2.0 作为一种在高压缩、优越重建和卓越可扩散性方面的领先模型。
cs.CV / 96 / 2605.13581

HIR-ALIGN: Enhancing Hyperspectral Image Restoration via Diffusion-Based Data Generation

HIR-ALIGN:通过基于扩散的数据生成增强高光谱图像恢复
Pang, Li, Zhao, Heng, Zhang, Yijia, Meng, Deyu, Cao, Xiangyong
Abstract
Hyperspectral image (HSI) restoration is crucial for reliable analysis, as real HSIs suffer from degradations like noise, blur, and resolution loss. However, existing models trained on source data often fail on target domains lacking clean references, a common occurrence in practice. To address this issue, we present HIR-ALIGN, a plug-and-play target-adaptive augmentation framework that enhances hyperspectral image restoration by augmenting limited training images with synthetic data that closely matches the target distribution using no extra data. It consists of three stages: (i) proxy generation, where off-the-shelf restoration models restore degraded target observations to produce semantics-preserving proxy HSIs that approximate target-domain clean images; (ii) distribution-adaptive synthesis, where a blur-robust unCLIP diffusion model generates target-aligned RGBs from proxy RGBs, with prompt conditioning and embedding-space noise initialization. Then, a warp-based spectral transfer module synthesizes HSIs by aligning each generated RGB with the proxy RGB, estimating soft patch-wise transport weights, and applying these weights and learnable local interpolation kernels to the proxy HSI; and (iii) aligned supervised finetuning, where restoration networks pretrained on the source distribution are finetuned using both the proxy HSIs and synthesized target-aligned HSIs, and are then deployed on degraded target images. We further provide theoretical analysis showing that augmentation-based finetuning can achieve lower target-domain restoration risk by jointly improving target distribution coverage and controlling spectral bias. Extensive experiments on simulated and real datasets across denoising and super-resolution tasks demonstrate that HIR-ALIGN consistently improves source-only supervised baselines, outperforming both source-only counterparts and representative unsupervised methods.
Chinese Translation
高光谱图像(HSI)恢复对于可靠分析至关重要,因为真实的高光谱图像常常受到噪声、模糊和分辨率损失等退化的影响。然而,现有模型在缺乏干净参考的目标领域上训练时常常失败,这在实践中是一个普遍现象。为了解决这个问题,我们提出了HIR-ALIGN,这是一种即插即用的目标自适应增强框架,通过使用与目标分布高度匹配的合成数据来增强高光谱图像恢复,从而扩展有限的训练图像,而无需额外的数据。该框架由三个阶段组成:(i)代理生成,在此阶段,现成的恢复模型将退化的目标观测恢复为语义保持的代理高光谱图像,近似目标领域的干净图像;(ii)分布自适应合成,使用一个抗模糊的unCLIP扩散模型从代理RGB生成目标对齐的RGB,结合提示条件和嵌入空间噪声初始化。然后,一个基于变形的光谱转移模块通过将每个生成的RGB与代理RGB对齐,估计软补丁级传输权重,并将这些权重和可学习的局部插值核应用于代理高光谱图像来合成高光谱图像;(iii)对齐的监督微调,在此阶段,预训练于源分布的恢复网络使用代理高光谱图像和合成的目标对齐高光谱图像进行微调,然后在退化的目标图像上进行部署。我们进一步提供理论分析,表明基于增强的微调可以通过共同改善目标分布覆盖和控制光谱偏差来实现更低的目标领域恢复风险。在去噪和超分辨率任务上对模拟和真实数据集进行的广泛实验表明,HIR-ALIGN始终改善源仅监督基线,超越了源仅模型和代表性的无监督方法。
cs.CV / 97 / 2605.13583

Phy-CoSF: Physics-Guided Continuous Spectral Fields Reconstruction and Super-Resolution for Snapshot Compressive Imaging

Phy-CoSF:基于物理指导的连续光谱场重建与快照压缩成像的超分辨率
Chen, Wudi, Zha, Zhiyuan, Yuan, Xin, Wang, Shigang, Wen, Bihan, Zhou, Jiantao, Yan, Gang, Fan, Zipei, Zhu, Ce
Abstract
Recent advances have demonstrated that coded aperture snapshot spectral imaging (CASSI) systems show great potential for capturing 3D hyperspectral images (HSIs) from a single 2D measurement. Despite the inherent spectral continuity of scenes captured by CASSI, most existing reconstruction methods are restricted to fixed, discrete spectral outputs, thereby precluding continuous spectral reconstruction or spectral super-resolution. To address this challenge, we propose Phy-CoSF, which synergizes deep unfolding networks with implicit neural representations, establishing a new paradigm for continuous spectral reconstruction and super-resolution in CASSI. Specifically, we propose a two-phase architecture that bridges discrete-wavelength training with continuous spectral rendering, enabling the synthesis of high-fidelity HSIs at arbitrary target wavelengths. At the core of our framework lies the continuous spectral fields (CoSF) module, embedded within each unfolding stage as a dynamic prior, which comprises a triple-branch cross-domain feature mixer for comprehensive spatial-frequency-channel feature fusion, alongside a spectral synthesis head that generates spectral intensities by querying continuous wavelength coordinates. Extensive experimental results demonstrate that Phy-CoSF not only achieves continuous modeling at arbitrary spectral resolutions but also outperforms many state-of-the-art methods in both reconstruction fidelity and spectral detail preservation. Our code and more results are available at: https://github.com/PaiDii/Phy-CoSF.git.
Chinese Translation
近期的研究进展表明,编码孔快照光谱成像(CASSI)系统在从单个二维测量中捕获三维高光谱图像(HSI)方面具有巨大潜力。尽管CASSI捕获的场景具有固有的光谱连续性,但大多数现有的重建方法仅限于固定的离散光谱输出,从而阻碍了连续光谱重建或光谱超分辨率的发展。为了解决这一挑战,我们提出了Phy-CoSF,该方法将深度展开网络与隐式神经表示相结合,建立了CASSI中连续光谱重建和超分辨率的新范式。具体而言,我们提出了一种双阶段架构,将离散波长训练与连续光谱渲染相结合,使得能够在任意目标波长合成高保真度的HSI。我们框架的核心是连续光谱场(CoSF)模块,该模块嵌入在每个展开阶段中作为动态先验,包含一个三分支跨域特征混合器,用于全面的空间-频率-通道特征融合,以及一个光谱合成头,通过查询连续波长坐标生成光谱强度。大量实验结果表明,Phy-CoSF不仅能够在任意光谱分辨率下实现连续建模,还在重建保真度和光谱细节保留方面超越了许多最先进的方法。我们的代码和更多结果可在以下链接获取:https://github.com/PaiDii/Phy-CoSF.git。
cs.CV / 98 / 2605.13586

HetScene: Heterogeneity-Aware Diffusion for Dense Indoor Scene Generation

HetScene:考虑异质性的稠密室内场景生成扩散方法
Chen, Zini, Huang, Junming, Zhang, Rong, Xu, Jiamin, Peng, Cheng, Wang, Chi, Xu, Weiwei
Abstract
Generating controllable and physically plausible indoor scenes is a pivotal prerequisite for constructing high-fidelity simulation environments for embodied AI. However, existing deeplearning-based methods usually treat all objects as homogeneous instances within a unified generation process. While effective for sparse and simplistic layouts, they struggle to model realistic layouts with dense object arrangements and complex spatial dependencies, leadingto limited scalability and degraded physical plausibility. To deal with these challenges, we revisit indoor layout generation from the perspective of structural heterogeneity and decompose the objects into primary objects and secondary objects according to their distinct roles in shaping a scene. Based on this decomposition, we propose HetScene, a heterogeneous two-stage generation framework that decouples indoor layout synthesis into Structural Layout Generation (SLG) and Contextual Layout Generation (CLG). SLG first generates globally coherent structural layouts with only primary objects conditioned on text descriptions, top-down binary room masks, and spatial relation graphs, establishing a stable global macro-skeleton of large core furniture.
Chinese Translation
生成可控且物理上合理的室内场景是构建高保真模拟环境以支持具身人工智能的关键前提。然而,现有的基于深度学习的方法通常将所有物体视为在统一生成过程中同质的实例。虽然这种方法在稀疏和简单布局中有效,但在处理具有稠密物体排列和复杂空间依赖关系的真实布局时却显得力不从心,导致可扩展性有限和物理合理性下降。为了解决这些挑战,我们从结构异质性的角度重新审视室内布局生成,并根据物体在塑造场景中的不同角色将其分解为主要物体和次要物体。基于这种分解,我们提出了HetScene,一个异质的两阶段生成框架,将室内布局合成解耦为结构布局生成(Structural Layout Generation, SLG)和上下文布局生成(Contextual Layout Generation, CLG)。SLG首先根据文本描述、从上到下的二进制房间掩码和空间关系图生成全球一致的结构布局,仅使用主要物体,从而建立大型核心家具的稳定全局宏观骨架。
cs.CV / 99 / 2605.13591

Real2Sim: A Physics-driven and Editable Gaussian Splatting Framework for Autonomous Driving Scenes

Real2Sim:一个基于物理驱动的可编辑高斯点云框架用于自动驾驶场景
Huang, Kaicong, Azfar, Talha, Shi, Weisong, Ke, Ruimin
Abstract
Reliable autonomous driving relies on large-scale, well-labeled data and robust models. However, manual data collection is resource-intensive, and traditional simulation suffers from a persistent reality gap. While recent generative frameworks and radiance-field methods improve visual fidelity, they still struggle with temporal and spatial consistency and cannot ensure physics-aware behavior, limiting their applicability to driving scenario generation. To address these challenges, we propose Real2Sim, an unified framework that combines 4D Gaussian Splatting (4DGS) with a differentiable Material Point Method (MPM) solver. Real2Sim explicitly reconstructs dynamic driving scenes as temporally continuous Gaussian primitives, supports instance-level editing, and simulates realistic object-object and object-environment interactions. This framework enables physics-aware, high-fidelity synthesis of diverse, editable scenarios, including challenging corner cases such as collisions and post-impact trajectories. Experiments on the Waymo Open Dataset validate Real2Sim's capabilities in rendering, reconstruction, editing, and physics simulation, demonstrating its potential as a scalable tool for data generation in downstream tasks such as perception, tracking, trajectory prediction, and end-to-end policy learning.
Chinese Translation
可靠的自动驾驶依赖于大规模、标注良好的数据和稳健的模型。然而,手动数据收集资源密集,传统模拟存在持续的现实差距。尽管最近的生成框架和辐射场方法提高了视觉真实感,但它们在时间和空间一致性方面仍然存在困难,无法确保物理感知行为,限制了其在驾驶场景生成中的适用性。为了解决这些挑战,我们提出了Real2Sim,一个将4D高斯点云(4D Gaussian Splatting, 4DGS)与可微分材料点法(Material Point Method, MPM)求解器相结合的统一框架。Real2Sim明确地将动态驾驶场景重建为时间上连续的高斯原语,支持实例级编辑,并模拟真实的物体间和物体与环境之间的交互。该框架能够实现物理感知的高保真合成,生成多样的可编辑场景,包括碰撞和碰撞后轨迹等具有挑战性的边缘案例。在Waymo开放数据集上的实验验证了Real2Sim在渲染、重建、编辑和物理模拟方面的能力,展示了其作为下游任务(如感知、跟踪、轨迹预测和端到端策略学习)中数据生成的可扩展工具的潜力。
cs.CV / 100 / 2605.13600

Sparse Code Uplifting for Efficient 3D Language Gaussian Splatting

高效3D语言高斯点云稀疏编码提升
Budimir, Lovre Antonio, Guan, Yushi, Ryhner, Steve, Lončarić, Sven, Vijaykumar, Nandita
Abstract
3D Language Gaussian Splatting (3DLGS) augments 3D Gaussian Splatting with language-aligned visual features for open-vocabulary 3D scene understanding. A core challenge is efficiently associating high-dimensional vision-language embeddings with millions of 3D Gaussians while preserving efficient feature rendering for text-based querying. Existing methods either store dense features directly on Gaussians, causing high storage costs and slow rendering, or learn compact representations through expensive per-scene optimization with repeated feature rasterization. No existing method simultaneously achieves fast 3D semantic reconstruction, efficient storage, and fast rendering. We propose SCOUP (Sparse COde UPlifting), which addresses all three by decoupling language representation learning from 3D Gaussian optimization. Rather than working directly in 3D, we learn sparse codebook-based representations entirely using features associated with 2D image regions, associating each region with a sparse set of codebook coefficients. We then uplift these coefficients to 3D Gaussians with our weighted sparse aggregation using Gaussian-to-pixel associations, where each Gaussian accumulates coefficients over codebook atoms across views. Top-$K$ filtering then extracts the most dominant multi-view coefficients per Gaussian, enabling efficient storage and fast rendering. Our method achieves up to $400\times$ training speedup while being $3\times$ more memory efficient during training compared to the state-of-the-art in rendering speed. Across multiple benchmarks, SCOUP matches or outperforms existing methods in open-vocabulary querying accuracy.
Chinese Translation
3D语言高斯点云(3DLGS)通过与语言对齐的视觉特征增强了3D高斯点云,以实现开放词汇的3D场景理解。一个核心挑战是如何高效地将高维视觉-语言嵌入与数百万个3D高斯关联,同时保持基于文本查询的高效特征渲染。现有方法要么直接在高斯上存储密集特征,导致高存储成本和缓慢渲染,要么通过昂贵的逐场景优化和重复特征光栅化来学习紧凑表示。没有现有方法能够同时实现快速的3D语义重建、高效存储和快速渲染。我们提出了SCOUP(稀疏编码提升),通过将语言表示学习与3D高斯优化解耦来解决这三个问题。我们不是直接在3D中工作,而是完全使用与2D图像区域相关的特征学习基于稀疏代码本的表示,将每个区域与一组稀疏的代码本系数关联。然后,我们使用加权稀疏聚合将这些系数提升到3D高斯,其中每个高斯在视图间累积代码本原子的系数。Top-$K$过滤器提取每个高斯的最主要的多视图系数,从而实现高效存储和快速渲染。与当前渲染速度的最先进技术相比,我们的方法在训练速度上实现了高达$400 imes$的加速,同时在训练期间的内存效率提高了$3 imes$。在多个基准测试中,SCOUP在开放词汇查询准确性上与现有方法相匹配或超越。
cs.CV / 101 / 2605.13604

Rethinking Graph Convolution for 2D-to-3D Hand Pose Lifting

重新思考图卷积在2D到3D手势提升中的应用
Kim, Chanyoung, Kim, Donghyun, Sim, Dong-Hyun, Hwang, Seong Jae, Kwon, Youngjoong
Abstract
Graph convolutional networks (GCNs) are widely used for 3D hand pose estimation, where the hand skeleton is encoded as a fixed adjacency graph. We revisit whether this is the most effective way to incorporate hand topology in 2D-to-3D lifting. In this paper, we perform controlled, parameter-matched ablations on the FPHA benchmark and show that standard multi-head self-attention consistently outperforms GCN baselines. Even when the GCN is strengthened with multi-hop adjacency and matched parameter count, self-attention reduces MPJPE from 12.36 mm to 10.09 mm. A skeleton-constrained graph attention network recovers most of this gap, indicating that input-dependent aggregation is a major source of improvement, while fully connected attention yields additional gains. We further show that hand topology is most effective when introduced as a soft structural prior through graph-distance positional encoding, rather than as a hard adjacency constraint. These results suggest that, for hand pose lifting, adaptive spatial attention is a more effective inductive bias than fixed graph convolution.
Chinese Translation
图卷积网络(GCNs)广泛应用于3D手势估计,其中手部骨架被编码为固定的邻接图。我们重新审视这种方法是否是将手部拓扑有效融入2D到3D提升的最佳方式。本文在FPHA基准上进行控制的参数匹配消融实验,结果表明标准的多头自注意力在性能上始终优于GCN基线。即使在GCN通过多跳邻接和匹配参数数量进行增强的情况下,自注意力仍将MPJPE从12.36毫米降低至10.09毫米。一个受骨架约束的图注意力网络弥补了大部分差距,表明输入依赖的聚合是主要的改进来源,而全连接注意力则带来了额外的提升。我们进一步表明,当手部拓扑作为软结构先验通过图距离位置编码引入时,其效果最佳,而不是作为硬邻接约束。这些结果表明,对于手势提升,适应性空间注意力比固定图卷积更有效。
cs.CV / 102 / 2605.13621

WD-FQDet: Multispectral Detection Transformer via Wavelet Decomposition and Frequency-aware Query Learning

WD-FQDet:基于小波分解和频率感知查询学习的多光谱检测变换器
Yang, Chunjin, Zhang, Xiwei, Xiao, Yiming, Meng, Fanman
Abstract
Infrared-visible object detection improves detection performance by combining complementary features from multispectral images. Existing backbone-specific and backbone-shared approaches still suffer from the problems of severe bias of modality-shared features and the insufficiency of modality-specific features. To address these issues, we propose a novel detection framework WD-FQDet that explicitly decouples modality-shared and modality-specific information from infrared and visible modalities in the new view of low- and high-frequency domains, allowing fusion strategies tailored to their frequency characteristics. Specifically, a low-frequency homogeneity alignment module is proposed to align modality-shared features across modalities via a cross-modal attention mechanism, and a high-frequency specificity retention module is proposed to preserve modality-specific features through the multi-scale gradient consistency loss. To reinforce the feature representation in the frequency domain, we propose a hybrid feature enhancement module that incorporates spatial cues. Furthermore, considering that the contributions of homogeneous and modality-specific features to object detection vary across scenarios, we propose a frequency-aware query selection module to dynamically regulate their contributions. Experimental results on the FLIR, LLVIP, and M3FD datasets demonstrate that WD-FQDet achieves state-of-the-art performance across multiple evaluation metrics.
Chinese Translation
红外-可见物体检测通过结合多光谱图像中的互补特征来提高检测性能。现有的特定骨干网和共享骨干网方法仍然面临模态共享特征严重偏差和模态特定特征不足的问题。为了解决这些问题,我们提出了一种新颖的检测框架WD-FQDet,该框架明确地将红外和可见模态中的模态共享和模态特定信息从低频域和高频域的全新视角进行解耦,从而允许根据其频率特征量身定制的融合策略。具体而言,提出了一种低频同质性对齐模块,通过跨模态注意机制对齐跨模态的模态共享特征,并提出了一种高频特异性保留模块,通过多尺度梯度一致性损失来保留模态特定特征。为了增强频域中的特征表示,我们提出了一种结合空间线索的混合特征增强模块。此外,考虑到同质特征和模态特定特征对物体检测的贡献在不同场景下有所不同,我们提出了一种频率感知查询选择模块,以动态调节它们的贡献。在FLIR、LLVIP和M3FD数据集上的实验结果表明,WD-FQDet在多个评估指标上达到了最先进的性能。
cs.CV / 103 / 2605.13664

HADAR-Based Thermal Infrared Hyperspectral Image Restoration

基于HADAR的热红外高光谱图像恢复
Dai, Cheng, Lin, Jiale, Song, Bingxuan, Chen, Yifei, Chen, Jiashuo, Yuan, Xin, Bao, Fanglin
Abstract
Thermal-infrared (TIR) hyperspectral imagery (HSI) provides critical scene information for various applications. However, its practical utility is severely limited by unique sensor degradations beyond the capabilities of existing restoration methods, which are ignorant of underlying thermal physics. Here, we propose HAIR (HADAR-based Image Restoration) as a physics-driven framework for ground-based TIR-HSI restoration. HAIR utilizes the HADAR rendering equation (HRE) and combines it with the atmospheric downwelling radiative transfer equation (RTE) to model TIR-HSI using temperature, emissivity, and texture (TeX) physical triplets. This physical model leads to a TeX decompose-synthesize strategy that guarantees physical consistency and spatio-spectral noise resilience, in stark contrast to existing approaches. Moreover, our framework uses a forward-modeled atmospheric downwelling reference, along with spectral smoothness of emissivity and blackbody radiation, to enable spectral calibration and generation that would otherwise be elusive. Our extensive experiments on the outdoor DARPA Invisible Headlights dataset and in-lab FTIR measurements show that HAIR consistently outperforms state-of-the-art methods across denoising, inpainting, spectral calibration, and spectral super-resolution, establishing a benchmark in objective accuracy and visual quality.
Chinese Translation
热红外(TIR)高光谱图像(HSI)为各种应用提供了关键的场景信息。然而,由于独特的传感器退化超出了现有恢复方法的能力,其实际应用受到严重限制,而这些方法忽视了潜在的热物理学。在此,我们提出HAIR(基于HADAR的图像恢复)作为一种基于物理的地面TIR-HSI恢复框架。HAIR利用HADAR渲染方程(HRE),并将其与大气下行辐射传输方程(RTE)相结合,以温度、发射率和纹理(TeX)物理三元组来建模TIR-HSI。该物理模型导致了一种TeX分解-合成策略,确保了物理一致性和空间-光谱噪声的抗干扰能力,这与现有方法形成鲜明对比。此外,我们的框架使用前向建模的大气下行参考,以及发射率和黑体辐射的光谱平滑性,以实现光谱校准和生成,这在其他情况下是难以实现的。我们在户外DARPA隐形车灯数据集和实验室FTIR测量上的广泛实验表明,HAIR在去噪、修补、光谱校准和光谱超分辨率方面始终优于最先进的方法,确立了客观准确性和视觉质量的基准。
cs.CV / 104 / 2605.13667

SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

SceneGraphVLM:基于视觉语言模型的视频动态场景图生成
Makarov, Vladislav, Gizetdinov, Mark, Yudin, Dmitry
Abstract
Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models. SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or post-processing. We evaluate SceneGraphVLM on PSG, PVSG, and Action Genome. With compact VLMs and vLLM-accelerated decoding, SceneGraphVLM achieves a strong quality-speed trade-off, improves precision-oriented SGG metrics while preserving reasonable recall, and generates complete scene graphs with approximately one-second latency. Code and implementation details are available at: https://github.com/markus0440/SceneGraphVLM.git.
Chinese Translation
场景图生成为视觉感知提供了一种紧凑的结构化表示,但从图像和视频中准确快速地预测图形仍然具有挑战性。最近基于视觉语言模型(VLM)的方法能够端到端生成结构化文本形式的场景图,但通常会生成包含无关对象和关系的冗长输出。我们提出了SceneGraphVLM,这是一种使用小型视觉语言模型进行图像和视频场景图生成的紧凑方法。SceneGraphVLM以高效的TOON格式序列化图形,并分两阶段训练模型:首先进行监督微调,然后进行强化学习,使用关注幻觉的奖励来平衡关系覆盖率和精度,同时惩罚不支持的对象和关系。对于视频,该模型可以选择性地将每帧条件化于先前生成的图形,从而提供轻量级的短期上下文,而无需跟踪或后处理。我们在PSG、PVSG和Action Genome上评估了SceneGraphVLM。通过紧凑的VLM和vLLM加速解码,SceneGraphVLM实现了良好的质量与速度的权衡,改善了以精度为导向的场景图生成(SGG)指标,同时保持合理的召回率,并以大约一秒的延迟生成完整的场景图。代码和实现细节可在以下网址获取:https://github.com/markus0440/SceneGraphVLM.git。
cs.CV / 105 / 2605.13670

Pattern-Enhanced RT-DETR for Multi-Class Battery Detection

基于模式增强的 RT-DETR 多类别电池检测
Zhong, Xu, Hu, Enyuan
Abstract
Accurate and efficient battery detection is increasingly important for applications in electronic waste recycling, industrial quality control, and automated sorting systems. In this paper, we present both a comprehensive benchmark and a novel method for multi-class battery detection. We systematically compare three CNN-based detectors (YOLOv8n, YOLOv8s, YOLO11n) and two transformer-based detectors (RT-DETR-L, RT-DETR-X) on a publicly available dataset of approximately 8,591 annotated images under identical experimental conditions, and further propose PaQ-RT-DETR, which introduces pattern-based dynamic query generation into RT-DETR to alleviate query activation imbalance with negligible computational overhead. Among baselines, YOLO11n achieves the best CNN-based accuracy (mAP@50: 0.779) at only 2.6M parameters, while YOLOv8n delivers the fastest inference at ~1,667 FPS. PaQ-RT-DETR-X achieves the highest overall mAP@50 of 0.782, surpassing RT-DETR-X by +2.8% with consistent per-class gains across all six battery categories including the data-scarce Bike Battery class. Our findings provide practical guidance for selecting object detection models in battery-related industrial applications.
Chinese Translation
准确高效的电池检测在电子废物回收、工业质量控制和自动化分拣系统等应用中变得越来越重要。本文提出了一个全面的基准测试和一种新颖的多类别电池检测方法。我们在一个公开可用的包含约8,591张标注图像的数据集上,系统地比较了三种基于卷积神经网络(CNN)的检测器(YOLOv8n、YOLOv8s、YOLO11n)和两种基于变换器的检测器(RT-DETR-L、RT-DETR-X),在相同的实验条件下进行评估,并进一步提出了 PaQ-RT-DETR,该方法将基于模式的动态查询生成引入 RT-DETR,以减轻查询激活不平衡,同时几乎不增加计算开销。在基线模型中,YOLO11n 以仅 2.6M 的参数量实现了最佳的 CNN 基础准确率(mAP@50: 0.779),而 YOLOv8n 则以约 1,667 FPS 的速度提供了最快的推理。PaQ-RT-DETR-X 实现了最高的整体 mAP@50 为 0.782,较 RT-DETR-X 提升了 2.8%,并在包括数据稀缺的自行车电池类别在内的六个电池类别中均实现了一致的每类增益。我们的研究结果为电池相关工业应用中选择目标检测模型提供了实用指导。
cs.CV / 106 / 2605.13672

SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

SpurAudio:用于研究少样本音频分类中快捷学习的基准
Ayoub, Giries Abu, Tukan, Morad, Mualem, Loay
Abstract
Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.
Chinese Translation
少样本分类(FSC)广泛应用于从有限标注数据中学习,但大多数评估隐含假设目标概念与上下文线索是独立的。然而,在现实世界中,示例往往出现在丰富的上下文中,使得模型能够利用前景内容与背景信号之间的虚假相关性。尽管这种效应在少样本图像分类中得到了研究,但其在少样本音频分类中的作用仍然基本未被探索,现有的音频基准对上下文结构的控制有限。我们引入了SpurAudio,这是一个利用音频中前景事件和背景环境的自然可分性来实现对支持集和查询集之间上下文变化的受控、多层次评估的基准。使用该基准,我们展示了许多最先进的少样本方法在背景相关性被破坏时会遭受严重的性能下降,尽管在标准评估协议下取得了类似的准确性。关键是,这种脆弱性在大型预训练音频基础模型中仍然存在,排除了有限主干容量作为解释。此外,在传统基准下看似可比的方法对虚假相关性的敏感性可能显著不同,揭示了与特征表示如何在推理时与分类器头部交互相关的系统性算法优势和脆弱性。这些发现为少样本方法在音频中的行为提供了新的见解,并强调了在评估FSC模型时需要明确探测上下文依赖性的基准。
cs.CV / 107 / 2605.13674

Weakly Supervised Segmentation as Semantic-Based Regularization

弱监督分割作为基于语义的正则化
Colamonaco, Stefano, Florea, Andrei-Bogdan, Maene, Jaron
Abstract
Weakly supervised semantic segmentation (WSSS) trains dense pixel-level segmentation models from partial or coarse annotations such as bounding boxes, scribbles, or image-level tags. While recent work leverages foundation models such as the Segment Anything Model (SAM) to generate pseudo-labels, these approaches typically depend on heuristic prompt choices and offer limited ways to incorporate prior knowledge or heterogeneous labels. We address this gap by taking a neurosymbolic perspective: integrating differentiable fuzzy logic with deep segmentation models. Weak annotations and domain-specific priors are unified as continuous logical constraints that fine-tune SAM under weak supervision. The refined foundation model then produces improved pseudo-labels, from which we train a second-stage prompt-free segmentation model. Experiments on Pascal VOC 2012 and the REFUGE2 optic disc/cup segmentation dataset show that our logic-guided fine-tuning yields higher-quality pseudo-labels, leading to state-of-the-art segmentation accuracy that often exceeds densely supervised baselines.
Chinese Translation
弱监督语义分割(WSSS)通过部分或粗略标注(如边界框、涂鸦或图像级标签)训练密集像素级分割模型。尽管近期的研究利用基础模型(如Segment Anything Model,SAM)生成伪标签,但这些方法通常依赖于启发式提示选择,并且在整合先验知识或异构标签方面提供的方式有限。我们通过采用神经符号学的视角来填补这一空白:将可微分模糊逻辑与深度分割模型相结合。弱标注和领域特定的先验被统一为连续逻辑约束,以在弱监督下微调SAM。经过精细调整的基础模型随后生成改进的伪标签,我们基于这些伪标签训练一个无提示的第二阶段分割模型。在Pascal VOC 2012和REFUGE2视盘/杯分割数据集上的实验表明,我们的逻辑引导微调产生了更高质量的伪标签,从而实现了超越密集监督基线的最先进分割精度。
cs.CV / 108 / 2605.13675

Characterizing Universal Object Representations Across Vision Models

跨视觉模型的通用对象表征特征分析
Mahner, Florian P., Roth, Johannes, Lam, Ka Chun, Bonner, Michael F., Pereira, Francisco, Hebart, Martin N.
Abstract
Deep neural networks trained with different architectures, objectives, and datasets have been reported to converge on similar visual representations. However, what remains unknown is which visual properties models actually converge on and which factors may underlie this convergence. To address this, we decompose the object similarity structure of 162 diverse vision models into a small set of non-negative dimensions. To determine universal versus model-specific dimensions, we then estimate how often each dimension reappears across models. In contrast to model-specific dimensions, universal dimensions are more interpretable and more strongly driven by conceptual image properties, indicating the relevance of interpretability and semantic content as implicit factors driving universality across models. Differences in architecture, objective function, training data, model size, and model performance do not explain the emergence of universal dimensions. However, models with more universal dimensions also better predict macaque IT activity and human similarity judgments, suggesting that universality reflects representations relevant to biological vision. These findings have important implications for understanding the emergent representations underlying deep neural network models and their alignment with biological vision.
Chinese Translation
不同架构、目标和数据集训练的深度神经网络已被报道在相似的视觉表征上收敛。然而,目前尚不清楚模型实际收敛于哪些视觉属性,以及哪些因素可能导致这种收敛。为了解决这一问题,我们将162个多样化视觉模型的对象相似性结构分解为一小组非负维度。接着,我们估计每个维度在模型间重复出现的频率,以确定通用维度与模型特定维度的区别。与模型特定维度相比,通用维度更具可解释性,并且更强烈地受到概念图像属性的驱动,这表明可解释性和语义内容作为推动模型间通用性的隐含因素的重要性。架构、目标函数、训练数据、模型大小和模型性能的差异无法解释通用维度的出现。然而,具有更多通用维度的模型在预测猕猴IT活动和人类相似性判断方面表现更佳,这表明通用性反映了与生物视觉相关的表征。这些发现对理解深度神经网络模型背后的涌现表征及其与生物视觉的对齐具有重要意义。
cs.CV / 109 / 2605.13686

Cross Modality Image Translation In Medical Imaging Using Generative Frameworks

基于生成框架的医学影像跨模态图像翻译
Romoli, Giulia, Capoccia, Alessia, Ruffini, Filippo, Di Feola, Francesco, Boldrini, Luca, Chiti, Arturo, Cuocolo, Renato, D'Antonoli, Tugba Akinci, Darvizeh, Fatemeh, Di Pumpo, Marcello, Erickson, Bradley J., Fang, Liu, Fazzini, Deborah, Feraco, Paola, Gelardi, Fabrizia, Gossetti, Francesco, Ferrer, Ana Isabel Hernáiz, Klontzas, Michail E., Payabvash, Seyedmehdi, Riklund, Katrine, Strandberg, Sara N., Guarrasi, Valerio, Soda, Paolo
Abstract
Medical image-to-image (I2I) translation enables virtual scanning, i.e. the synthesis of a target imaging modality from a source one without additional acquisitions. Despite growing interest, most proposed methods operate on 2D slices, are evaluated on isolated tasks with different experimental set-ups and lack clinical validation. The primary contribution of this work is a reproducible, standardized comparative evaluation of 3D I2I translation methods in oncological imaging, designed to standardize preprocessing, splitting, inference, and multi-level evaluation across heterogeneous clinical tasks. Within this framework, we compare seven generative models, three Generative Adversarial Networks (GANs: Pix2Pix, CycleGAN, SRGAN) and four latent generative models (Latent Diffusion Model, Latent Diffusion Model+ControlNet, Brownian Bridge, Flow Matching), across eleven datasets spanning three anatomical regions (head/neck, lung, pelvis) and four translation directions (cone-beam CT to CT, MRI to CT, CT to PET, MRI T2-weighted to T2-FLAIR), for a total of 77 experiments under uniform training, inference, and evaluation conditions. The results show that GANs outperform latent generative models across all tasks, with SRGAN achieving statistically significant superiority. Our lesion-level analysis reveals that all models struggle with small lesions and that, in CT to PET synthesis, models reproduce lesion shape more reliably than absolute uptake-related intensity. We also performed a Visual Turing test administered to 17 physicians, including 15 radiologists, which shows near-chance classification accuracy (56.7%), confirming that synthetic volumes are largely indistinguishable from real acquisitions, while exposing a dissociation between quantitative metrics and clinical preference.
Chinese Translation
医学图像间(I2I)翻译实现了虚拟扫描,即从源影像模态合成目标影像模态,而无需额外的采集。尽管兴趣日益增长,但大多数提出的方法仅在二维切片上进行操作,且在不同的实验设置下评估孤立任务,缺乏临床验证。本研究的主要贡献是对肿瘤影像学中三维I2I翻译方法进行可重复的、标准化的比较评估,旨在标准化预处理、数据划分、推理和跨异构临床任务的多层次评估。在此框架内,我们比较了七种生成模型,包括三种生成对抗网络(GANs:Pix2Pix、CycleGAN、SRGAN)和四种潜在生成模型(潜在扩散模型、潜在扩散模型+控制网络、布朗桥、流匹配),涵盖了三个解剖区域(头/颈、肺、骨盆)和四个翻译方向(锥束CT到CT、MRI到CT、CT到PET、MRI T2加权到T2-FLAIR),共进行了77个实验,所有实验在统一的训练、推理和评估条件下进行。结果表明,GANs在所有任务中均优于潜在生成模型,其中SRGAN表现出统计学上的显著优势。我们的病灶级分析显示,所有模型在小病灶的处理上均存在困难,并且在CT到PET的合成中,模型在重现病灶形状方面比绝对摄取相关强度更可靠。我们还对17名医生(包括15名放射科医师)进行了视觉图灵测试,结果显示分类准确率接近随机水平(56.7%),确认合成体积与真实采集在很大程度上难以区分,同时揭示了定量指标与临床偏好之间的脱节。
cs.CV / 110 / 2605.13688

MedCore: Boundary-Preserving Medical Core Pruning for MedSAM

MedCore:用于 MedSAM 的边界保留医学核心剪枝
Zhang, Cenwei, Xiang, Suncheng, You, Lei
Abstract
Medical segmentation foundation models such as SAM and MedSAM provide strong prompt-driven segmentation, but their image encoders are still too large for many clinical settings. Compression is also risky in medicine because a model can keep high Dice while losing boundary fidelity. We propose MedCore, a structured pruning framework for MedSAM. The main idea is to preserve two kinds of structures: structures that became important during SAM-to-MedSAM adaptation, and structures that have high boundary leverage. We identify the first type by a dual-intervention score that compares zeroing a group with resetting it to its original SAM weight. We identify the second type by boundary-aware Fisher estimation. We also introduce a boundary leverage principle, which shows that compression-induced boundary displacement is controlled by logit perturbation on the boundary divided by the logit spatial gradient. This principle explains why boundary metrics can degrade even when Dice remains high. On polyp segmentation benchmarks, MedCore reduces parameters by 60.0% and FLOPs by 58.4% while achieving Dice 0.9549, Boundary F1 0.6388, and HD95 5.14 after recovery fine-tuning. It also reaches 86.6% parameter reduction and 90.4G FLOPs with strong boundary quality. Our analysis further shows that MedSAM lies in a head-fragile boundary regime: head-pruning steps have 2.887 times larger 95th-percentile boundary leverage than MLP-pruning steps, and this logit-level effect is consistent with BF1 and HD95 degradation. Our code is available at https://github.com/cenweizhang/MedCore.
Chinese Translation
医学分割基础模型如 SAM 和 MedSAM 提供强大的基于提示的分割能力,但它们的图像编码器在许多临床环境中仍然过于庞大。压缩在医学中也是有风险的,因为一个模型可能在保持高 Dice 指标的同时失去边界的保真度。我们提出了 MedCore,这是一个针对 MedSAM 的结构化剪枝框架。其主要思想是保留两种结构:在 SAM 到 MedSAM 适配过程中变得重要的结构,以及具有高边界杠杆的结构。我们通过一个双干预评分来识别第一种类型,该评分比较将一组权重归零与将其重置为原始 SAM 权重的效果。我们通过边界感知的费舍尔估计来识别第二种类型。我们还引入了边界杠杆原理,该原理表明压缩引起的边界位移由边界上的 logit 扰动与 logit 空间梯度之比控制。该原理解释了为什么即使 Dice 指标保持高水平,边界指标也可能下降。在息肉分割基准测试中,MedCore 将参数减少了 60.0%,FLOPs 减少了 58.4%,并在恢复微调后实现了 Dice 0.9549、边界 F1 0.6388 和 HD95 5.14。它还在保持良好边界质量的情况下实现了 86.6% 的参数减少和 90.4G FLOPs。我们的分析进一步表明,MedSAM 处于一个头部脆弱的边界状态:头部剪枝步骤的第 95 百分位边界杠杆比 MLP 剪枝步骤大 2.887 倍,并且这种 logit 级别的效应与 BF1 和 HD95 的下降是一致的。我们的代码可在 https://github.com/cenweizhang/MedCore 获取。
cs.CV / 111 / 2605.13713

Learning to Optimize Radiotherapy Plans via Fluence Maps Diffusion Model Generation and LSTM-based Optimization

通过辐射强度图扩散模型生成和基于LSTM的优化学习放射治疗计划
Poles, Isabella, Arberet, Simon, Gao, Riqiang, Kraus, Martin, Santambrogio, Marco D., Ghesu, Florin C., Kamen, Ali, Comaniciu, Dorin
Abstract
Volumetric Modulated Arc Therapy (VMAT) is a cornerstone of modern radiation therapy, enabling highly conformal tumor irradiation and healthy-tissue sparing. Yet, its planning solves inverse and nested optimization for multi-leaf collimators, monitor units and dose parameters, while enforcing their consistency to ensure mechanical deliverability. Nevertheless, this process often requires repeated re-optimization when treatment configurations change, resulting in substantial planning time per patient. To address these problems, we present a diffusion-driven Learning-to-Optimize (L2O) method for end-to-end VMAT planning. A distribution-matching distilled diffusion model learns a clinically feasible manifold of fluence maps, enabling their one-shot generation. On top of this, an LSTM-based L2O module learns gradient update dynamics to swiftly refine fluence maps toward prescribed dose objectives during inference. Experimental results on clinical and public prostate cancer cohorts demonstrate improved planning efficiency, flexibility, and machine deliverability over currently available end-to-end VMAT planners.
Chinese Translation
体积调制弧形治疗(VMAT)是现代放射治疗的基石,能够实现高度符合肿瘤照射和保护健康组织。然而,其规划需要对多叶光栅、监控单位和剂量参数进行逆向和嵌套优化,同时确保其一致性以保证机械可交付性。然而,当治疗配置发生变化时,这一过程通常需要反复重新优化,导致每位患者的规划时间显著增加。为了解决这些问题,我们提出了一种基于扩散驱动的优化学习(L2O)方法,用于端到端的VMAT规划。一个分布匹配的提炼扩散模型学习临床可行的辐射强度图流形,从而实现其一次性生成。在此基础上,基于LSTM的L2O模块学习梯度更新动态,以在推理过程中迅速优化辐射强度图,以达到规定的剂量目标。在临床和公共前列腺癌队列上的实验结果表明,与当前可用的端到端VMAT规划器相比,该方法提高了规划效率、灵活性和机械可交付性。
cs.CV / 112 / 2605.13724

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

AnyFlow:基于策略流图蒸馏的任意步视频扩散模型
Gu, Yuchao, Fang, Guian, Jiang, Yuxin, Mao, Weijia, Han, Song, Cai, Han, Shou, Mike Zheng
Abstract
Few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated at test time, limiting their effectiveness for any-step video diffusion. This limitation arises because consistency distillation replaces the original probability-flow ODE trajectory with a consistency-sampling trajectory, weakening the desirable test-time scaling behavior of ODE sampling. To address this limitation, we introduce AnyFlow, the first any-step video diffusion distillation framework based on flow maps. Instead of distilling a model for only a few fixed sampling steps, AnyFlow optimizes the full ODE sampling trajectory. To this end, we shift the distillation target from endpoint consistency mapping $(z_{t}\rightarrow z_{0})$ to flow-map transition learning $(z_{t}\rightarrow z_{r})$ over arbitrary time intervals. We further propose Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions, enabling efficient on-policy distillation that reduces test-time errors (i.e., discretization error in few-step sampling and exposure bias in causal generation). Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B parameters, demonstrate that AnyFlow achieves performance matches or surpasses consistency-based counterparts in the few-step regime, while scaling with sampling step budgets.
Chinese Translation
通过一致性蒸馏,少步视频生成取得了显著进展。然而,一致性蒸馏模型的性能往往随着测试时分配更多采样步骤而下降,这限制了其在任意步视频扩散中的有效性。这一限制产生的原因在于一致性蒸馏将原始的概率流常微分方程(ODE)轨迹替换为一致性采样轨迹,从而削弱了ODE采样在测试时的理想缩放行为。为了解决这一限制,我们提出了AnyFlow,这是第一个基于流图的任意步视频扩散蒸馏框架。AnyFlow并不是仅为少数固定的采样步骤蒸馏模型,而是优化完整的ODE采样轨迹。为此,我们将蒸馏目标从端点一致性映射($z_{t} ightarrow z_{0}$)转变为任意时间间隔的流图过渡学习($z_{t} ightarrow z_{r}$)。我们进一步提出了流图反向模拟,它将完整的欧拉展开分解为快捷流图过渡,从而实现高效的在线策略蒸馏,减少测试时的误差(即在少步采样中的离散化误差和因果生成中的曝光偏差)。在参数规模从13亿到140亿的双向和因果架构上进行的大量实验表明,AnyFlow在少步生成的表现与基于一致性的对手相匹配或超越,同时能够根据采样步骤预算进行扩展。
cs.CV / 113 / 2605.13729

Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation

协调多种条件以生成轨迹控制的人类运动
Cai, Deli, Ma, Haoyang, Ding, Changxing
Abstract
Trajectory-controlled human motion generation aims to synthesize realistic human motions conditioned on both textual descriptions and spatial trajectories. However, existing methods suffer from two critical limitations: first, the conflict between text and trajectory conditions disrupts the denoising process, resulting in compromised motion quality or inaccurate trajectory following; second, the use of redundant motion representations introduces inconsistencies between motion components, leading to instability during trajectory control. To address these challenges, we propose CMC, a decoupled framework that effectively coordinates text and trajectory conditions through a divide-and-conquer strategy. CMC follows a divide-and-conquer paradigm, comprising two cascaded stages: Trajectory Control and Motion Completion. In the first stage, a diffusion model generates a simplified representation of the controlled joints under trajectory guidance, based on the given trajectories, ensuring accurate and stable trajectory following. In the second stage, a text-conditioned diffusion inpainting model generates full-body motions using the simplified representation from the first stage as partial observations. To mitigate overfitting caused by limited inpainting training data, we further introduce the Selective Inpainting Mechanism (SIM), which alternates between text-to-motion generation and motion inpainting tasks during training. Experiments on HumanML3D and KIT datasets demonstrate that CMC achieves state-of-the-art performance in control accuracy and motion quality, demonstrating its effectiveness in coordinating multimodal conditions and representations.
Chinese Translation
轨迹控制的人类运动生成旨在合成基于文本描述和空间轨迹的真实人类运动。然而,现有方法存在两个关键限制:首先,文本和轨迹条件之间的冲突干扰了去噪过程,导致运动质量下降或轨迹跟踪不准确;其次,冗余的运动表示引入了运动组件之间的不一致,导致轨迹控制过程中的不稳定性。为了解决这些挑战,我们提出了CMC(条件多模态协调),一个通过分而治之策略有效协调文本和轨迹条件的解耦框架。CMC遵循分而治之的范式,包含两个级联阶段:轨迹控制和运动完成。在第一个阶段,扩散模型在轨迹指导下生成受控关节的简化表示,基于给定的轨迹,确保准确和稳定的轨迹跟踪。在第二个阶段,基于文本的扩散修复模型使用来自第一个阶段的简化表示作为部分观测生成全身运动。为了减轻由于有限的修复训练数据导致的过拟合,我们进一步引入了选择性修复机制(Selective Inpainting Mechanism,SIM),在训练过程中在文本到运动生成和运动修复任务之间交替进行。在HumanML3D和KIT数据集上的实验表明,CMC在控制精度和运动质量方面达到了最先进的性能,证明了其在协调多模态条件和表示方面的有效性。
cs.CV / 114 / 2605.13744

Aligning Network Equivariance with Data Symmetry: A Theoretical Framework and Adaptive Approach for Image Restoration

将网络等变性与数据对称性对齐:图像恢复的理论框架与自适应方法
Tan, Feiyu, Xie, Qi, Xu, Zongben, Meng, Deyu
Abstract
Image restoration is an inherently ill posed inverse problem. Equivariant networks that embed geometric symmetry priors can mitigate this ill posedness and improve performance. However, current understanding of the relationship between network equivariance and data symmetry remains largely heuristic. Particularly for real world data with imperfect symmetry, existing research lacks a systematic theoretical framework to quantify symmetry, select transformation groups, or evaluate model data alignment. To bridge this gap, we conduct an analysis from an optimization perspective and formalize the intrinsic relationship among data symmetry priors, model equivariance, and generalization capability. Specifically, we propose for the first time a quantifiable definition of non strict symmetry at the dataset level (rather than sample level) and use it as a constraint to formulate the restoration inverse problem. We then show that the equivariance for restoration models can be naturally derived from this inverse problems incorporated the proposed symmetry constraints, and that the equivariance error of the optimal restoration operator is strictly bounded by the data symmetry error and the discretization mesh size. Furthermore, by analyzing the network's empirical risk, we demonstrate that aligning equivariance with data symmetry optimizes the bias variance trade off, minimizing the total expected risk. Guided by these insights, we propose a Sample Adaptive Equivariant Network that uses a hypernetwork and transformation learnable equivariant convolutions to dynamically align with each sample's inherent symmetry. Extensive experiments on super resolution, denoising, and deraining validate our theoretical findings and show significant superiority over standard baselines and traditional equivariant models. Our code and supplementary material are available at https://github.com/tanfy929/SA-Conv.
Chinese Translation
图像恢复本质上是一个病态的逆问题。嵌入几何对称性先验的等变网络可以缓解这种病态性并提高性能。然而,目前对网络等变性与数据对称性之间关系的理解仍然主要是启发式的。特别是对于具有不完美对称性的真实世界数据,现有研究缺乏系统的理论框架来量化对称性、选择变换群或评估模型与数据的对齐。为了解决这一问题,我们从优化的角度进行分析,形式化数据对称性先验、模型等变性和泛化能力之间的内在关系。具体而言,我们首次提出了一种可量化的非严格对称性定义,适用于数据集层面(而非样本层面),并将其作为约束来构造恢复逆问题。我们随后展示了恢复模型的等变性可以自然地从包含所提出的对称性约束的逆问题中推导出来,并且最优恢复算子的等变性误差严格受限于数据对称性误差和离散化网格大小。此外,通过分析网络的经验风险,我们证明了将等变性与数据对称性对齐可以优化偏差-方差权衡,从而最小化总期望风险。在这些见解的指导下,我们提出了一种样本自适应等变网络,该网络使用超网络和可学习的等变卷积动态地与每个样本的固有对称性对齐。在超分辨率、去噪和去雨等方面的广泛实验验证了我们的理论发现,并显示出相较于标准基线和传统等变模型的显著优势。我们的代码和补充材料可在 https://github.com/tanfy929/SA-Conv 获取。
cs.CV / 115 / 2605.13746

Weakly-Supervised Spatiotemporal Anomaly Detection

弱监督时空异常检测
Gianchandani, Urvi, Tirupattur, Praveen, Shah, Mubarak
Abstract
In this paper, we explore a weakly supervised method for anomaly detection. Since annotating videos is time-consuming, we only look at weak video-level labels during training. This means that given a video, we know that it is either normal or contains an anomaly, but no further annotations are used to train the network. Features are extracted from video clips that are either normal or anomalous. These features are used to determine anomaly scores for spatiotemporal regions of the clips based on a classifier and the implementation of a multiple instance ranking loss (MIL). We represent both anomalous and normal video clips as positive and negative bags, respectively, to apply MIL. Furthermore, since anomalies are usually localized to a part of a frame rather than the whole frame, we chose to explore temporal as well as spatial anomaly detection. We show our results on the UCF Crime2Local Dataset, which contains spatiotemporal annotations for a portion of the UCF Crime Dataset.
Chinese Translation
在本文中,我们探讨了一种弱监督的异常检测方法。由于视频标注耗时较长,我们在训练过程中仅使用弱视频级标签。这意味着给定一个视频,我们知道它要么是正常的,要么包含异常,但没有进一步的注释用于训练网络。我们从正常或异常的视频片段中提取特征。这些特征用于基于分类器和多实例排序损失(Multiple Instance Ranking Loss, MIL)的实现来确定片段时空区域的异常分数。我们将异常和正常的视频片段分别表示为正袋和负袋,以应用MIL。此外,由于异常通常局限于帧的一部分而非整个帧,我们选择探索时空异常检测。我们在UCF Crime2Local数据集上展示了我们的结果,该数据集包含UCF Crime数据集中部分的时空注释。
cs.CV / 116 / 2605.13755

Generative Texture Diversification of 3D Pedestrians for Robust Autonomous Driving Perception

用于鲁棒自主驾驶感知的3D行人生成纹理多样化
Bhowmick, Arka, Ozeren, Enes, Abdullah, Ahmed, Wasenmuller, Oliver
Abstract
In recent years, autonomous driving has significantly in creased the demand for high-quality data to train 2D and 3D perception models for safety-critical scenarios. Real world datasets struggle to meet this demand as require ments continuously evolve and large-scale annotated data collection remains costly and time-consuming making syn thetic data a scalable, practical and controllable alterna tive. Pedestrian detection is among the most safety-critical tasks in autonomous driving. In this paper, we propose a simple yet effective method for scaling variability in 3D pedestrian assets for synthetic scene generation. Starting from a single 3D base asset, we generate multiple distinct pedestrian instances by synthesizing diverse facial textures and identity-level appearance variations using StyleGAN2 and automatically mapping them onto 3D meshes. This ap proach enables scalable appearance-level asset diversifica tion without requiring the design of new geometries for each instance. Using the assets, we construct synthetic datasets and study the impact of mixing real and synthetic data for RGB-based object detection. Through complementary ex periments, we analyze geometry-driven distribution shifts in point cloud perception for 3D object detection. Our findings demonstrate that controlled synthetic diversifica tion improves robustness in 2D detection while revealing the sensitivity of 3D perception models to geometric domain gaps. Overall, this work highlights how generative AI en ables scalable, simulation-ready pedestrian diversification through controlled facial texture synthesis, along with the benefits and limitations of cross-domain training strategies in autonomous driving pipelines.
Chinese Translation
近年来,自主驾驶对高质量数据的需求显著增加,以训练用于安全关键场景的2D和3D感知模型。现实世界的数据集难以满足这一需求,因为要求不断演变,而大规模标注数据的收集既昂贵又耗时,这使得合成数据成为一种可扩展、实用且可控的替代方案。行人检测是自主驾驶中最为安全关键的任务之一。本文提出了一种简单而有效的方法,用于在合成场景生成中扩展3D行人资产的多样性。我们从单一的3D基础资产出发,通过使用StyleGAN2合成多样的面部纹理和身份级外观变化,并将其自动映射到3D网格上,从而生成多个不同的行人实例。这种方法使得在不需要为每个实例设计新几何体的情况下,实现可扩展的外观级资产多样化。利用这些资产,我们构建了合成数据集,并研究了真实数据与合成数据混合对基于RGB的物体检测的影响。通过互补实验,我们分析了点云感知中几何驱动的分布变化对3D物体检测的影响。我们的研究结果表明,受控的合成多样化提高了2D检测的鲁棒性,同时揭示了3D感知模型对几何领域差距的敏感性。总体而言,这项工作强调了生成性人工智能如何通过受控的面部纹理合成实现可扩展的、适合模拟的行人多样化,以及在自主驾驶流程中跨领域训练策略的优缺点。
cs.CV / 117 / 2605.13798

VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence

VoxCor:无训练的体积特征用于多模态体素对应
Tombak, Guney, Erdil, Ertunc, Konukoglu, Ender
Abstract
Cross-modal 3D medical image analysis requires voxelwise representations that remain anatomically consistent across imaging contrasts, scanners, and acquisition protocols. Recent work has shown that frozen 2D Vision Transformer (ViT) foundation models can support such representations, but typical pipelines extract features along a single anatomical axis and adapt those features inside a registration solver for one image pair at a time, leaving complementary viewing directions unused and producing representations that do not transfer to new volumes. We introduce VoxCor, a training-free fit--transform method for reusable volumetric feature representations from frozen 2D ViT foundation models. During an offline fitting phase, VoxCor combines triplanar ViT inference with a compact closed-form weighted partial least squares (WPLS) projection that uses fitting-time voxel correspondences to select modality-stable anatomical directions in the triplanar feature space. At transform time, new volumes are mapped by triplanar ViT inference and linear projection alone, without fine-tuning or registration. Voxel correspondences can then be queried directly by nearest-neighbor search. We evaluate VoxCor on intra-subject Abdomen MR--CT and inter-subject HCP T2w--T1w tasks using deformable registration, voxelwise k-nearest-neighbor segmentation, and segmentation-center landmark localization. VoxCor improves the hardest cross-subject, cross-modality transfer settings, reduces encoder sensitivity for dense correspondence transfer, and yields registration performance competitive with handcrafted descriptors and learned 3D features. This positions VoxCor as a reusable feature layer for downstream multimodal analysis beyond pairwise registration. Code, configuration files, and implementation details are publicly available on GitHub at \href{https://github.com/guneytombak/VoxCor}{guneytombak/VoxCor}.
Chinese Translation
跨模态三维医学图像分析需要在成像对比、扫描仪和采集协议之间保持解剖一致性的体素级表示。近期研究表明,冻结的二维视觉变换器(ViT)基础模型可以支持这样的表示,但典型的处理流程沿单一解剖轴提取特征,并在注册求解器中针对一对图像调整这些特征,导致未利用互补的视角方向,并生成无法转移到新体积的表示。我们提出了VoxCor,一种无训练的拟合-变换方法,用于从冻结的二维ViT基础模型中获取可重用的体积特征表示。在离线拟合阶段,VoxCor结合了三平面ViT推理和一种紧凑的封闭形式加权偏最小二乘(WPLS)投影,利用拟合时的体素对应选择三平面特征空间中的模态稳定解剖方向。在变换时,新体积仅通过三平面ViT推理和线性投影进行映射,无需微调或注册。然后,可以通过最近邻搜索直接查询体素对应。我们在使用可变形注册的同一受试者腹部MR-CT和不同受试者HCP T2w-T1w任务上评估了VoxCor,采用体素级k最近邻分割和分割中心标志定位。VoxCor改善了最困难的跨受试者、跨模态转移设置,减少了密集对应转移的编码器敏感性,并在注册性能上与手工描述符和学习的三维特征相媲美。这使得VoxCor成为下游多模态分析中可重用的特征层,超越了成对注册。代码、配置文件和实现细节已在GitHub上公开,链接为 exttt{https://github.com/guneytombak/VoxCor}。
cs.CV / 118 / 2605.13803

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

EvoGround:用于视频时间定位的自我演化视频代理
Jung, Minjoon, Zhang, Byoung-Tak, Torresani, Lorenzo
Abstract
Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.
Chinese Translation
视频时间定位(VTG)将未剪辑视频和自然语言查询作为输入,并定位与查询最匹配的时间点。现有方法依赖于大型、特定任务的数据集,这些数据集需要昂贵的人工标注。我们提出了EvoGround,一个由两个耦合的自我演化代理(提议者和求解者)组成的框架,能够从原始视频中学习时间定位,而无需任何人工标注数据。提议者从原始视频中生成查询-时刻对,而求解者学习如何进行定位,并反馈信号以改进提议者。通过这种自我强化的强化学习循环,这两个代理从相同的基础模型初始化,并在迭代中相互提升。在2500个未标记视频上训练后,EvoGround在多个VTG基准测试中与完全监督模型相匹配或超越,同时成为一种无需人工标签的最先进的细粒度视频字幕生成器。
cs.CV / 119 / 2605.13813

JANUS: Anatomy-Conditioned Gating for Robust CT Triage Under Distribution Shift

JANUS:用于分布转变下稳健CT分诊的解剖条件门控
Dahal, Lavsen, Bhandari, Yubraj, Rubin, Geoffrey, Lo, Joseph Y.
Abstract
Automated CT triage requires models that are simultaneously accurate across diverse pathologies and reliable under institutional shift. While Vision Transformers provide strong visual representations, many clinically significant findings are defined by quantitative imaging biomarkers rather than appearance alone. We introduce JANUS, a physiology-guided dual-stream architecture that conditions visual embeddings on macro-radiomic priors via Anatomically Guided Gating. On the MERLIN test set (N=5082), JANUS attains macro-AUROC 0.88 and AUPRC 0.74, outperforming all reproduced baselines. It generalizes to an external dataset N=2000; AUROC 0.87), with the largest gains on findings defined by size and attenuation as well as improved calibration on both datasets. We further quantify prediction suppression using the Physiological Veto Rate (PVR), showing that under domain shift JANUS reduces high-confidence false positives substantially more often than true positives. Together, these results are consistent with physically grounded conditioning that improves both discrimination and reliability in CT triage. Code is made publicly available at github repository https://github.com/lavsendahal/janus and model weights are at https://huggingface.co/lavsendahal/janus.
Chinese Translation
自动化CT分诊需要在多种病理情况下都能保持准确性的模型,并在机构转变下保持可靠性。尽管视觉变换器(Vision Transformers)提供了强大的视觉表征,但许多临床显著发现是通过定量影像生物标志物而非仅仅依赖外观来定义的。我们提出了JANUS,一种生理引导的双流架构,通过解剖引导门控(Anatomically Guided Gating)将视觉嵌入条件化于宏观放射组学先验上。在MERLIN测试集(N=5082)上,JANUS达到了宏观AUROC 0.88和AUPRC 0.74,超越了所有复现的基线。它在外部数据集(N=2000)上也表现出良好的泛化能力(AUROC 0.87),在由大小和衰减定义的发现上获得了最大的提升,并在两个数据集上改善了校准。我们进一步使用生理否决率(Physiological Veto Rate, PVR)量化预测抑制,显示在领域转变下,JANUS显著更频繁地减少高置信度的假阳性,而非真阳性。综合来看,这些结果与基于物理的条件相一致,改善了CT分诊中的辨别能力和可靠性。代码已在github仓库https://github.com/lavsendahal/janus上公开,模型权重可在https://huggingface.co/lavsendahal/janus获取。
cs.CV / 120 / 2605.13815

OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation

OmniLiDAR:一种用于多领域3D LiDAR生成的统一扩散框架
Liu, Youquan, Yang, Weidong, Liang, Ao, Xu, Xiang, Kong, Lingdong, Wu, Yang, Zhu, Dekai, Li, Xin, Chen, Runnan, Fei, Ben, Liu, Tongliang, Ouyang, Wanli
Abstract
LiDAR scene generation is increasingly important for scalable simulation and synthetic data creation, especially under diverse sensing conditions that are costly to capture at scale. Typically, diffusion-based LiDAR generators are developed under single-domain settings, requiring separate models for different datasets or sensing conditions and hindering unified, controllable synthesis under heterogeneous distribution shifts. To this end, we present OmniLiDAR, a unified text-conditioned diffusion framework that generates LiDAR scans in a shared range-image representation across eight representative domains spanning three shift types: adverse weather, sensor-configuration changes (e.g., reduced beams), and cross-platform acquisition (vehicle, drone, and quadruped). To enable training a single model over heterogeneous domains without isolating optimization by domain, we introduce a Cross-Domain Training Strategy (CDTS) that mixes domains within each mini-batch and leverages conditioning to steer generation. We further propose Cross-Domain Feature Modeling (CDFM), which captures directional dependencies along azimuth and elevation axes to reflect the anisotropic scanning structure of range images, and Domain-Adaptive Feature Scaling (DAFS) as a lightweight modulation to account for structured domain-dependent feature shifts during denoising. In the absence of a public consolidated benchmark, we construct an 8-domain dataset by combining real-world scans with physically based weather simulation and systematic beam reduction while following official splits. Extensive experiments demonstrate strong generation fidelity and consistent gains in downstream use cases, including generative data augmentation for LiDAR semantic segmentation and 3D object detection, as well as robustness evaluation under corruptions, with consistent benefits in limited-label regimes.
Chinese Translation
LiDAR场景生成在可扩展模拟和合成数据创建中变得越来越重要,特别是在捕获成本高昂的多样化传感条件下。通常,基于扩散的LiDAR生成器是在单一领域环境下开发的,这需要为不同的数据集或传感条件分别建立模型,从而阻碍了在异构分布变化下的统一和可控合成。为此,我们提出了OmniLiDAR,这是一种统一的文本条件扩散框架,能够在八个代表性领域中生成共享的范围图像表示,涵盖三种变化类型:恶劣天气、传感器配置变化(例如,减少光束)和跨平台采集(车辆、无人机和四足机器人)。为了在异构领域上训练单一模型而不通过领域隔离优化,我们引入了一种跨领域训练策略(Cross-Domain Training Strategy, CDTS),该策略在每个小批量中混合领域并利用条件引导生成。我们进一步提出了跨领域特征建模(Cross-Domain Feature Modeling, CDFM),该方法捕捉方位和高度轴上的方向依赖性,以反映范围图像的各向异性扫描结构,并提出了领域自适应特征缩放(Domain-Adaptive Feature Scaling, DAFS),作为一种轻量调制,以考虑去噪过程中结构化的领域相关特征变化。在缺乏公开的综合基准的情况下,我们通过将真实世界的扫描与基于物理的天气模拟和系统性光束减少相结合,构建了一个8领域数据集,同时遵循官方划分。大量实验表明,生成的保真度强,并且在下游应用中(包括LiDAR语义分割和3D目标检测的生成数据增强)具有一致的收益,以及在损坏情况下的鲁棒性评估,在有限标签环境中也表现出一致的优势。
cs.CV / 121 / 2605.13831

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

有效训练长上下文视觉语言模型,实现超越128K上下文的泛化
Wang, Zhaowei, Luo, Lishu, Duan, Haodong, Liu, Weiwei, Wu, Sijin, Luo, Ji, Yan, Shen, Peng, Shuai, Yuan, Sihang, Huang, Chaoyi, Lin, Yi, Song, Yangqiu
Abstract
Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.
Chinese Translation
长上下文建模正成为现代大型视觉语言模型(LVLMs)的核心能力,使得在长文档理解、视频分析和多轮工具使用的代理工作流中能够持续管理上下文。然而,实际的训练方案仍然探索不足,特别是在设计和平衡长上下文数据混合方面。在本研究中,我们对LVLMs的长上下文持续预训练进行了系统研究,将一个7B模型的上下文从32K扩展到128K,并对长文档数据进行了广泛的消融实验。我们首先展示了长文档视觉问答(VQA)比光学字符识别(OCR)转录更为有效。基于这一观察,我们的消融实验进一步得出了三个关键发现:i)在序列长度分布方面,平衡数据优于以目标长度为中心的数据(例如128K),这表明长上下文能力需要在不同长度和位置之间进行可泛化的关键信息检索;ii)检索仍然是主要瓶颈,倾向于使用重检索的混合数据与适度推理数据以实现任务多样性;iii)纯长文档VQA在很大程度上保留了短上下文能力,表明指令格式的长数据减少了对短数据混合的需求。基于这些发现,我们引入了MMProLong,通过仅使用5B标记预算从Qwen2.5-VL-7B进行长上下文持续预训练。MMProLong将长文档VQA得分提高了7.1%,并在超出其128K训练窗口的256K和512K上下文中保持强劲表现,无需额外训练。它还在没有任务特定监督的情况下,推广到基于网页的多模态针检索、长上下文视觉文本压缩和长视频理解。总体而言,我们的研究建立了一个实用的LongPT方案,并为推进长上下文视觉语言模型提供了经验基础。
cs.CV / 122 / 2605.13835

Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning

解锁基于CLIP的类增量学习中的补丁级特征
Sun, Hao, Ding, Zi-Jun, Zhou, Da-Wei
Abstract
Class-Incremental Learning (CIL) enables models to continuously integrate new knowledge while mitigating catastrophic forgetting. Driven by the remarkable generalization of CLIP, leveraging pre-trained vision-language models has become a dominant paradigm in CIL. However, current work primarily focuses on aligning global image embeddings (i.e., [CLS] token) with their corresponding text prompts (i.e., [EOS] token). Despite their good performance, we find that they discard the rich patch-level semantic information inherent in CLIP's encoders. For instance, when recognizing a rabbit, local patches may encode its distinctive cues, such as long ears and a fluffy tail, which can provide complementary evidence for recognition. Based on the above observation, we propose SPA (Semantic-guided Patch-level Alignment) for CLIP-based CIL, which aims to awaken long-neglected local representations within CLIP. Specifically, for each class, we first construct representative and diverse visual samples and feed them to GPT-5 as visual guidance to generate class-wise semantic descriptions. These descriptions are used to guide the selection of discriminative patch-level visual features. Building upon these selected patches, we further employ optimal transport to align selected patch tokens with semantic tokens from class-wise descriptions, yielding a structured cross-modal alignment that improves recognition. Furthermore, we introduce task-specific projectors for effective adaptation to downstream incremental tasks, and sample pseudo-features from stored class-wise Gaussian statistics to calibrate old-class representations, thereby mitigating catastrophic forgetting. Extensive experiments demonstrate that SPA achieves state-of-the-art performance.
Chinese Translation
类增量学习(CIL)使模型能够持续整合新知识,同时减轻灾难性遗忘。受益于CLIP的卓越泛化能力,利用预训练的视觉-语言模型已成为CIL中的主流范式。然而,目前的研究主要集中在将全局图像嵌入(即[CLS]标记)与其对应的文本提示(即[EOS]标记)对齐。尽管它们表现良好,但我们发现这些方法忽视了CLIP编码器中固有的丰富补丁级语义信息。例如,在识别兔子时,局部补丁可能编码其独特特征,如长耳朵和蓬松的尾巴,这些特征可以为识别提供补充证据。基于上述观察,我们提出了SPA(语义引导的补丁级对齐)用于基于CLIP的CIL,旨在唤醒CLIP中长期被忽视的局部表示。具体而言,对于每个类别,我们首先构建具有代表性和多样性的视觉样本,并将其输入GPT-5作为视觉指导,以生成类别特定的语义描述。这些描述用于指导选择具有区分性的补丁级视觉特征。在这些选定的补丁基础上,我们进一步采用最优传输将选定的补丁标记与来自类别描述的语义标记对齐,从而产生结构化的跨模态对齐,提升识别效果。此外,我们引入任务特定的投影器,以有效适应下游增量任务,并从存储的类别高斯统计中采样伪特征,以校准旧类别表示,从而减轻灾难性遗忘。大量实验表明,SPA实现了最先进的性能。
cs.CV / 123 / 2605.13838

R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

R-DMesh:通过校正动态网格流进行视频引导的3D动画
Wu, Zijie, Xu, Lixin, Jiang, Puhua, Liu, Sicong, Guo, Chunchao, Bai, Xiang
Abstract
Video-guided 3D animation holds immense potential for content creation, offering intuitive and precise control over dynamic assets. However, practical deployment faces a critical yet frequently overlooked hurdle: the pose misalignment dilemma. In real-world scenarios, the initial pose of a user-provided static mesh rarely aligns with the starting frame of a reference video. Naively forcing a mesh to follow a mismatched trajectory inevitably leads to severe geometric distortion or animation failure. To address this, we present Rectified Dynamic Mesh (R-DMesh), a unified framework designed to generate high-fidelity 4D meshes that are ``rectified'' to align with video context. Unlike standard motion transfer approaches, our method introduces a novel VAE that explicitly disentangles the input into a conditional base mesh, relative motion trajectories, and a crucial rectification jump offset. This offset is learned to automatically transform the arbitrary pose of the input mesh to match the video's initial state before animation begins. We process these components via a Triflow Attention mechanism, which leverages vertex-wise geometric features to modulate the three orthogonal flows, ensuring physical consistency and local rigidity during the rectification and animation process. For generation, we employ a Rectified Flow-based Diffusion Transformer conditioned on pre-trained video latents, effectively transferring rich spatio-temporal priors to the 3D domain. To support this task, we construct Video-RDMesh, a large-scale dataset of over 500k dynamic mesh sequences specifically curated to simulate pose misalignment. Extensive experiments demonstrate that R-DMesh not only solves the alignment problem but also enables robust downstream applications, including pose retargeting and holistic 4D generation.
Chinese Translation
视频引导的3D动画在内容创作中具有巨大的潜力,提供了对动态资产的直观和精确控制。然而,实际应用面临一个关键但常被忽视的障碍:姿态不对齐问题。在现实场景中,用户提供的静态网格的初始姿态很少与参考视频的起始帧对齐。简单地强迫网格遵循不匹配的轨迹不可避免地会导致严重的几何失真或动画失败。为了解决这个问题,我们提出了校正动态网格(R-DMesh),这是一个统一框架,旨在生成与视频上下文“校正”的高保真4D网格。与标准运动转移方法不同,我们的方法引入了一种新颖的变分自编码器(VAE),该编码器明确地将输入解耦为条件基础网格、相对运动轨迹和一个关键的校正跳跃偏移。这个偏移是通过学习自动将输入网格的任意姿态转换为与视频的初始状态相匹配,在动画开始之前。我们通过三流注意力机制处理这些组件,该机制利用顶点级几何特征来调节三个正交流,确保在校正和动画过程中保持物理一致性和局部刚性。为了生成,我们采用基于校正流的扩散变换器,该变换器以预训练视频潜变量为条件,有效地将丰富的时空先验转移到3D领域。为支持这一任务,我们构建了Video-RDMesh,这是一个超过50万动态网格序列的大规模数据集,专门策划用于模拟姿态不对齐。大量实验表明,R-DMesh不仅解决了对齐问题,还支持稳健的下游应用,包括姿态重定向和整体4D生成。
人工智能 (Artificial Intelligence)
59
cs.AI / 1 / 2605.12620

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

三思而后行:面向具身智能体的验证者引导行动选择
Singhi, Nishad, Bialas, Christian, Jauhri, Snehal, Prasad, Vignesh, Chalvatzaki, Georgia, Rohrbach, Marcus, Rohrbach, Anna
Abstract
Building generalist embodied agents capable of solving complex real-world tasks remains a fundamental challenge in AI. Multimodal Large Language Models (MLLMs) have significantly advanced the reasoning capabilities of such agents through strong vision-language knowledge and chain-of-thought (CoT) reasoning, yet remain brittle when faced with challenging out-of-distribution scenarios. To address this, we propose Verifier-Guided Action Selection (VegAS), a test-time framework designed to improve the robustness of MLLM-based embodied agents through an explicit verification step. At inference time, rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy. Crucially, we find that using an MLLM off-the-shelf as a verifier yields no improvement, motivating our LLM-driven data synthesis strategy, which automatically constructs a diverse curriculum of failure cases to expose the verifier to a rich distribution of potential errors at training time. Across embodied reasoning benchmarks spanning the Habitat and ALFRED environments, VeGAS consistently improves generalization, achieving up to a 36% relative performance gain over strong CoT baselines on the most challenging multi-object, long-horizon tasks.
Chinese Translation
构建能够解决复杂现实任务的通用具身智能体仍然是人工智能领域的一项基本挑战。多模态大型语言模型(MLLMs)通过强大的视觉-语言知识和思维链(CoT)推理显著提升了此类智能体的推理能力,但在面对具有挑战性的分布外场景时仍然显得脆弱。为了解决这一问题,我们提出了验证者引导行动选择(VegAS),这是一种在测试时框架,旨在通过明确的验证步骤提高基于MLLM的具身智能体的鲁棒性。在推理时,VegAS并不直接选择单一解码动作,而是对一组候选动作进行采样,并使用生成式验证器来识别最可靠的选择,而无需修改基础策略。关键的是,我们发现将现成的MLLM用作验证器并未带来改进,这促使我们采用基于LLM的数据合成策略,自动构建多样化的失败案例课程,以在训练时使验证器接触到丰富的潜在错误分布。在涵盖Habitat和ALFRED环境的具身推理基准测试中,VegAS始终提高了泛化能力,在最具挑战性的多对象、长时间跨度任务上,相较于强大的CoT基线,性能提升高达36%。
cs.AI / 2 / 2605.12655

Macro-Action Based Multi-Agent Instruction Following through Value Cancellation

基于宏动作的多智能体指令遵循通过价值取消
Lin, Wo Wei, Rathbun, Ethan, Tan, Enrico Marchesini Xiang Zhi
Abstract
Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.
Chinese Translation
在现实世界的应用中,多智能体强化学习(MARL)可能需要适应外部自然语言指令,这些指令会中断正在进行的行为并与长期目标发生冲突。然而,将奖励与指令相结合会引入一种根本性的失败模式,因为贝尔曼更新将价值估计与指令上下文耦合,导致在指令中断宏动作时出现不一致的价值。我们提出了指令合规的宏动作价值修正(MAVIC),该方法通过修正传入的指令目标并恢复在当前目标下的延续价值,来修正指令边界处的贝尔曼备份。与奖励塑形不同,MAVIC 修改了自举目标本身,使得在统一策略下能够在随机指令切换中实现一致的价值估计。我们提供了理论分析和一个演员-评论家实现,并展示了MAVIC在日益复杂的合作多智能体环境中实现高指令遵循的同时保持基本任务性能。
cs.AI / 3 / 2605.12673

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

安卓人会梦想打破游戏吗?使用 BenchJack 系统化审计 AI 代理基准
Wang, Hao, Li, Hanchen, Mang, Qiuyang, Cheung, Alvin, Sen, Koushik, Song, Dawn
Abstract
Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations. BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space.
Chinese Translation
代理基准已成为前沿 AI 能力的事实标准,指导模型选择、投资和部署。然而,奖励黑客现象,即代理在未执行预期任务的情况下最大化得分,往往在前沿模型中自发出现,而不需要过拟合。我们认为基准必须从设计上确保安全。通过过去的奖励黑客事件,我们推导出八种重复出现的缺陷模式,并将其汇编成代理评估清单(Agent-Eval Checklist),供基准设计者参考。我们将这些见解浓缩为 BenchJack,这是一个自动化的红队系统,推动编码代理审计基准,并以预见的方式识别可能的奖励黑客漏洞。此外,我们将 BenchJack 扩展为一个迭代生成对抗管道,发现新缺陷并迭代修补,以提高基准的鲁棒性。我们将 BenchJack 应用于 10 个流行的代理基准,涵盖软件工程、网页导航、桌面计算和终端操作。BenchJack 合成的奖励黑客漏洞在大多数基准上实现了近乎完美的得分,而未解决任何单一任务,识别出八类中的 219 种不同缺陷。此外,BenchJack 的扩展管道将四个基准的可黑客任务比例从接近 100% 降低到 10% 以下,且没有致命设计缺陷,在三次迭代内完全修补了 WebArena 和 OSWorld。我们的结果表明,评估管道尚未内化对抗思维,主动审计可以帮助缩小快速发展的基准领域的安全差距。
cs.AI / 4 / 2605.12674

Revealing Interpretable Failure Modes of VLMs

揭示视觉语言模型的可解释失败模式
Chaudhary, Isha, Jain, Vedaant V, Sachdeva, Kavya, Ranu, Sayan, Singh, Gagandeep
Abstract
Vision-Language Models (VLMs) are increasingly used in safety-critical applications because of their broad reasoning capabilities and ability to generalize with minimal task-specific engineering. Despite these advantages, they can exhibit catastrophic failures in specific real-world situations, constituting failure modes. We introduce REVELIO, a framework for systematically uncovering interpretable failure modes in VLMs. We define a failure mode as a composition of interpretable, domain-relevant concepts-such as pedestrian proximity or adverse weather conditions-under which a target VLM consistently behaves incorrectly. Identifying such failures requires searching over an exponentially large discrete combinatorial space. To address this challenge, REVELIO combines two search procedures: a diversity-aware beam search that efficiently maps the failure landscape, and a Gaussian-process Thompson Sampling strategy that enables broader exploration of complex failure modes. We apply REVELIO to autonomous driving and indoor robotics domains, uncovering previously unreported vulnerabilities in state-of-the-art VLMs. In driving environments, the models often demonstrate weak spatial grounding and fail to account for major obstructions, leading to recommendations that would result in simulated crashes. In indoor robotics tasks, VLMs either miss safety hazards or behave excessively conservatively, producing false alarms and reducing operational efficiency. By identifying structured and interpretable failure modes, REVELIO offers actionable insights that can support targeted VLM safety improvements.
Chinese Translation
视觉语言模型(VLMs)因其广泛的推理能力和在最小任务特定工程下的泛化能力,越来越多地应用于安全关键的场景。尽管具有这些优势,它们在特定的现实世界情境中可能会表现出灾难性的失败,构成了失败模式。我们提出了REVELIO,一个系统性揭示VLMs可解释失败模式的框架。我们将失败模式定义为一组可解释的、与领域相关的概念的组合,例如行人接近或恶劣天气条件,在这些条件下,目标VLM会持续表现不正确的行为。识别此类失败需要在一个指数级大的离散组合空间中进行搜索。为了解决这个挑战,REVELIO结合了两种搜索程序:一种关注多样性的束搜索(diversity-aware beam search),有效地映射失败景观;以及一种高斯过程汤普森采样(Gaussian-process Thompson Sampling)策略,能够更广泛地探索复杂的失败模式。我们将REVELIO应用于自动驾驶和室内机器人领域,揭示了在最先进的VLMs中以前未报告的脆弱性。在驾驶环境中,这些模型通常表现出较弱的空间定位能力,并未考虑主要障碍物,导致建议可能导致模拟碰撞。在室内机器人任务中,VLMs要么忽视安全隐患,要么表现得过于保守,产生误报并降低操作效率。通过识别结构化和可解释的失败模式,REVELIO提供了可操作的见解,支持针对性的VLM安全改进。
cs.AI / 5 / 2605.12682

Learning Transferable Latent User Preferences for Human-Aligned Decision Making

学习可转移的潜在用户偏好以实现人类对齐的决策制定
Hyk, Alina, Saisubramanian, Sandhya
Abstract
Large language models (LLMs) are increasingly used as reasoning modules in many applications. While they are efficient in certain tasks, LLMs often struggle to produce human-aligned solutions. Human-aligned decision making requires accounting for both explicitly stated goals and latent user preferences that shape how ambiguous situations should be resolved. Existing approaches to incorporating such preferences either rely on extensive and repeated user interactions or fail to generalize latent preferences across tasks and contexts, limiting their practical applicability. We consider a setting in which an LLM is used for high-level reasoning and is responsible for inferring latent user preferences from limited interactions, which guides downstream decision making. We introduce CLIPR (Conversational Learning for Inferring Preferences and Reasoning), a framework that learns actionable, transferable natural language rules that represent latent user preferences from minimal conversational input. These rules are iteratively refined through adaptive feedback and applied to both in-distribution and out-of-distribution ambiguous tasks across multiple environments. Evaluations on three datasets and a user study show that CLIPR consistently outperforms existing methods in improving alignment and reducing inference costs.
Chinese Translation
大型语言模型(LLMs)在许多应用中越来越多地被用作推理模块。尽管它们在某些任务中表现高效,但LLMs往往难以产生与人类对齐的解决方案。人类对齐的决策制定需要考虑明确陈述的目标和潜在用户偏好,这些偏好影响如何解决模糊情况。现有的方法在融入这些偏好时,要么依赖于大量且重复的用户交互,要么无法在不同任务和上下文中推广潜在偏好,从而限制了其实际应用性。我们考虑一种情境,其中LLM用于高层次推理,并负责从有限的交互中推断潜在用户偏好,以指导下游决策制定。我们提出了CLIPR(用于推断偏好和推理的对话学习)框架,该框架从最少的对话输入中学习可操作的、可转移的自然语言规则,以表示潜在用户偏好。这些规则通过自适应反馈进行迭代优化,并应用于多个环境中的分布内和分布外模糊任务。对三个数据集和一项用户研究的评估表明,CLIPR在提高对齐性和降低推理成本方面始终优于现有方法。
cs.AI / 6 / 2605.12691

On the Size Complexity and Decidability of First-Order Progression

一阶进展的大小复杂性与可判定性
Classen, Jens, Liu, Daxin
Abstract
Progression, the task of updating a knowledge base to reflect action effects, generally requires second-order logic. Identifying first-order special cases, by restricting either the knowledge base or action effects, has long been a central topic in reasoning about actions. It is known that local-effect, normal, and acyclic actions, three increasingly expressive classes, admit first-order progression. However, a systematic analysis of the size of such progressions, crucial for practical applications, has been missing. In this paper, using the framework of Situation Calculus, we show that under reasonable assumptions, first-order progression for these action classes grows only polynomially. Moreover, we show that when the KB belongs to decidable fragments such as two-variable first-order logic or universal theories with constants, the progression remains within the same fragment, ensuring decidability and practical applicability.
Chinese Translation
进展是更新知识库以反映行动效果的任务,通常需要二阶逻辑。通过限制知识库或行动效果来识别一阶特例,长期以来一直是推理行动的核心主题。已知局部效果、正常和无环行动这三类逐渐增强的类别允许一阶进展。然而,关于此类进展大小的系统分析,尤其对于实际应用至关重要,尚缺乏相关研究。本文利用情境演算(Situation Calculus)的框架,展示在合理假设下,这些行动类别的一阶进展仅以多项式增长。此外,我们还表明,当知识库属于可判定片段,如二变量一阶逻辑或带常量的全称理论时,进展仍然保持在同一片段内,从而确保可判定性和实际应用性。
cs.AI / 7 / 2605.12702

DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models

DisaBench:一种参与性评估框架,用于评估语言模型中的残疾伤害
Kim, Eugenia, Tanase, Ioana, Mallon, Christina
Abstract
General-purpose safety benchmarks for large language models do not adequately evaluate disability-related harms. We introduce DisaBench: a taxonomy of twelve disability harm categories co-created with people with disabilities and red teaming experts, a taxonomy-driven evaluation methodology that pairs benign and adversarial prompts across seven life domains, and a dataset of 175 prompts with human-annotated labels on 525 prompt-response pairs. Annotation by four evaluators with lived disability experience reveals three findings: harm rates vary sharply by disability type and will compound in non-text modalities, terminology-driven harm is culturally and temporally bound rather than universally assessable, and standard safety evaluation catches overt failures while missing the subtle harms that only domain expertise can recognize. Disability harm is simultaneously personal, intersectional, and community-defined: it cannot be isolated from the full context of who a person is, and general-purpose benchmarks systematically miss it. We will release the dataset, taxonomy, and methodology via Hugging Face and an open-source red teaming framework for direct integration into existing safety pipelines with no additional infrastructure.
Chinese Translation
通用的大型语言模型安全基准未能充分评估与残疾相关的伤害。我们介绍了DisaBench:一个与残疾人士和红队专家共同创建的十二类残疾伤害分类法,一个基于分类法的评估方法论,该方法论将无害和对抗性提示配对于七个生活领域,以及一个包含175个提示和525个提示-响应对的人类标注标签的数据集。四位具有生活残疾经验的评估者的标注揭示了三个发现:伤害率因残疾类型而异,并且在非文本模式中会加重;术语驱动的伤害是文化和时间上有界的,而不是普遍可评估的;标准安全评估能够捕捉明显的失败,但错过了只有领域专业知识才能识别的微妙伤害。残疾伤害同时是个人的、交叉的和社区定义的:它无法与一个人身份的完整背景分离,而通用基准系统性地忽视了这一点。我们将通过Hugging Face发布数据集、分类法和方法论,并提供一个开源红队框架,以便直接集成到现有的安全管道中,而无需额外的基础设施。
cs.AI / 8 / 2605.12718

CHAL: Council of Hierarchical Agentic Language

CHAL:分层代理语言委员会
Giovannelli, Tommaso, Kent, Griffin D.
Abstract
Multi-agent debate has emerged as a promising approach for improving LLM reasoning on ground-truth tasks, yet current methodologies face certain structural limitations: debate tends to induce a martingale over belief trajectories, majority voting accounts for most observed gains, and LLMs exhibit confidence escalation rather than calibration across rounds. We argue that the genuine value of debate, and dialectic systems as a whole, lies not in ground-truth tasks but in defeasible domains, where every position can in principle be defeated by better reasoning. We present the Council of Hierarchical Agentic Language (CHAL), a multi-agent dialectic framework that treats defeasible argumentation as an engine for belief optimization. Each agent maintains a CHAL Belief Schema (CBS), a graph-structured belief representation with a Bayesian-inspired architecture, that facilitates belief revision through a gradient-informed dynamic mechanism by leveraging the strength of the belief's thesis as a differentiable objective. Meta-cognitive value systems spanning epistemology, logic, and ethics are elevated to configurable hyperparameters governing agent reasoning and adjudication outcomes. We provide a series of ablation experiments that demonstrate systematic and interpretable effects: the adjudicator's value system determines the debate's overall trajectories in latent belief space, council diversity refines beliefs for all participants, and the framework generalizes across broad fields. CHAL is, to our knowledge, the first framework to treat multi-agent debate as structured belief optimization over defeasible domains. Further, the auditable belief artifacts it produces establish the foundation for dedicated evaluation suites for defeasible argumentation, with broader implications for building AI systems whose reasoning and value commitments are transparent, aligned, and subject to human oversight.
Chinese Translation
多代理辩论已成为改善大型语言模型(LLM)在真实任务上推理的有前景的方法,然而当前的方法面临某些结构性限制:辩论往往会在信念轨迹上引发马尔可夫过程,主要投票占据了大多数观察到的收益,而LLM在多个回合中表现出信心升级而非校准。我们认为,辩论及其辩证系统的真正价值不在于真实任务,而在于可反驳领域,在这些领域中,每个立场原则上都可以被更好的推理所击败。我们提出了分层代理语言委员会(Council of Hierarchical Agentic Language, CHAL),这是一个将可反驳论证视为信念优化引擎的多代理辩证框架。每个代理维护一个CHAL信念框架(CHAL Belief Schema, CBS),这是一个图结构的信念表示,采用贝叶斯启发的架构,通过利用信念论题的强度作为可微分目标,促进信念的修正,采用基于梯度的信息动态机制。跨越认识论、逻辑和伦理的元认知价值系统被提升为可配置的超参数,以控制代理的推理和裁决结果。我们提供了一系列消融实验,展示了系统性和可解释的效果:裁决者的价值系统决定了辩论在潜在信念空间中的整体轨迹,委员会的多样性为所有参与者精炼信念,且该框架在广泛领域中具有普适性。据我们所知,CHAL是第一个将多代理辩论视为可反驳领域中的结构化信念优化的框架。此外,它所产生的可审计信念文献为可反驳论证的专门评估套件奠定了基础,对构建推理和价值承诺透明、对齐并受人类监督的人工智能系统具有更广泛的影响。
cs.AI / 9 / 2605.12730

BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics

BEHAVE:一种用于实时建模集体人类动态的混合人工智能框架
Malyutina, Helene
Abstract
Existing AI systems for modeling human behavior operate at the level of individuals or detect events after they occur. As a result, they systematically fail to capture the collective dynamics that determine whether a group remains stable or transitions into escalation or breakdown. We propose a different foundation: a group of interacting humans constitutes a complex dynamical system in the precise mathematical sense, exhibiting emergence, nonlinearity, feedback loops, sensitivity near critical points, and phase transitions between qualitatively distinct regimes. The state of such a system is not located within any single participant; it is distributed across mutual influence loops and observable through the micro-dynamics of the body. We introduce BEHAVE (Behavioral Engine for Human Activity Vector Estimation), a formal framework that models collective dynamics as continuous behavioral fields defined over an interaction space derived from observable physical signals. Kinematic micro-signals (position, velocity, body orientation, gestural activity) are structured into a directed interaction graph and aggregated into a basis of behavioral fields capturing distinct, non-redundant axes of collective state. The framework rests on one theorem and two structural propositions characterizing the tension field, the field basis, and the criticality index. Perception and forecasting layers are implemented using neural models, enabling data-driven learning and approximation of system dynamics. BEHAVE is formulated as a computational system for learning, representing, and forecasting collective dynamics from data. A working pipeline is demonstrated on a 7-agent negotiation snapshot. The same fields, recalibrated, apply to crowd safety, crisis-team dynamics, education, and clinical contexts.
Chinese Translation
现有的用于建模人类行为的人工智能系统主要在个体层面运作,或在事件发生后进行检测。因此,它们系统性地无法捕捉决定一个群体是否保持稳定或转向升级或崩溃的集体动态。我们提出了一个不同的基础:一组相互作用的人类构成了一个复杂的动态系统,从精确的数学意义上讲,表现出涌现性、非线性、反馈循环、在临界点附近的敏感性以及在定性不同的状态之间的相变。这样的系统状态并不位于任何单一参与者之内;它分布在相互影响的循环中,并通过身体的微观动态进行观察。我们引入了BEHAVE(行为引擎用于人类活动向量估计),这是一个将集体动态建模为定义在由可观察的物理信号导出的交互空间上的连续行为场的正式框架。运动微信号(位置、速度、身体方向、手势活动)被结构化为一个有向交互图,并聚合成一个行为场的基础,捕捉集体状态的不同且不冗余的轴。该框架基于一个定理和两个结构命题,描述了张力场、场基础和临界指数。感知和预测层通过神经模型实现,使得基于数据的学习和系统动态的近似成为可能。BEHAVE被构建为一个计算系统,用于从数据中学习、表示和预测集体动态。在一个包含7个代理的谈判快照上演示了一个工作流程。经过重新校准的相同场应用于人群安全、危机团队动态、教育和临床背景。
cs.AI / 10 / 2605.12755

State-Centric Decision Process

以状态为中心的决策过程
Jeong, Sungheon, Masukawa, Ryozo, Yun, Sanggeon, Imani, Mahdi, Imani, Mohsen
Abstract
Language environments such as web browsers, code terminals, and interactive simulations emit raw text rather than states, and provide none of the runtime structure that MDP analysis requires. No explicit state space, no observation-to-state mapping, no certified transitions, and no termination criterion. We introduce the State-Centric Decision Process (SDP), a runtime framework that constructs these missing inputs by having the agent build them, predicate by predicate, as it acts. At each step the agent commits to a natural-language predicate describing how the world should look, takes an action to make it true, and checks the observation against it. Predicates that pass become certified states, and the resulting trajectory carries the four objects language environments do not provide, namely a task-induced state space, an observation-to-state mapping, certified transitions, and a termination criterion. We evaluate SDP on five benchmarks spanning planning, scientific exploration, web reasoning, and multi-hop question answering. SDP achieves the best training-free results on all five, with the advantage widening as the horizon grows. The certified trajectories additionally support analyses unavailable to reactive agents, including per-predicate credit assignment, failure localization, partial-progress measurement, and modular operator replacement.
Chinese Translation
语言环境如网页浏览器、代码终端和互动模拟发出原始文本而非状态,并且提供的运行时结构并不符合马尔可夫决策过程(MDP)分析的要求。没有明确的状态空间,没有观察到状态的映射,没有经过认证的转移,也没有终止标准。我们引入了以状态为中心的决策过程(State-Centric Decision Process, SDP),这是一个运行时框架,通过让智能体在行动时逐个谓词地构建这些缺失的输入。在每一步中,智能体承诺一个自然语言谓词,描述世界应该是什么样子,采取行动使其成为真实,并将观察结果与之进行对比。通过的谓词成为经过认证的状态,结果轨迹携带了语言环境未提供的四个对象,即任务驱动的状态空间、观察到状态的映射、经过认证的转移和终止标准。我们在五个基准测试上评估了SDP,涵盖规划、科学探索、网页推理和多跳问答。SDP在所有五个基准测试上都实现了最佳的无训练结果,随着时间范围的扩大,优势进一步加大。经过认证的轨迹还支持对反应型智能体无法进行的分析,包括每个谓词的信用分配、故障定位、部分进展测量和模块化操作符替换。
cs.AI / 11 / 2605.12835

PROMETHEUS: Automating Deep Causal Research Integrating Text, Data and Models

PROMETHEUS:自动化深度因果研究,整合文本、数据和模型
Mahadevan, Sridhar
Abstract
Large language models can extract local causal claims from text, but those claims become more useful when organized as persistent, navigable world models rather than as flat summaries. We introduce PROMETHEUS, a framework that turns retrieved literature, filings, reviews, reports, agent traces, source data, code, simulations, and scientific models into causal atlases: sheaf-like families of local causal predictive-state models over an explicit cover of a research substrate. Each local region contains causal episodes, structured claim tables, predictive tests, support statistics, and provenance; restriction maps compare overlapping regions; gluing diagnostics expose agreement, drift, contradiction, and underdetermination. The resulting Topos World Model is not a single universal graph. It is a research instrument for navigating what a corpus says, where it says it, how strongly it is supported, and where local claims fail to assemble into a coherent global view. Three literature-atlas case studies -- ocean-temperature impacts on marine populations, GLP-1 weight-loss evidence, and resveratrol/red-wine health-benefit claims -- illustrate deep causal research from text with explicit locality, evidence, persistent state, and gluing tension. Four grounded-counterfactual case studies -- a Nature Climate Change microplastics forcing paper, an Indus Valley hydrology paper with VIC-derived figure data and model code, the canonical Sachs protein-signaling study with single-cell perturbation data, and a Nature singing-mouse study with MAPseq projection matrices -- show a stronger mode: when a paper ships source data, simulation outputs, or code, PROMETHEUS can evaluate a counterfactual against that scientific substrate and then rebuild the sheaf world model around the
Chinese Translation
大型语言模型可以从文本中提取局部因果声明,但当这些声明被组织成持久的、可导航的世界模型时,其价值更大,而不是作为平面摘要。我们介绍了PROMETHEUS,这是一个框架,将检索到的文献、文件、评论、报告、代理跟踪、源数据、代码、模拟和科学模型转化为因果地图:在研究基底的显式覆盖上形成的局部因果预测状态模型的层叠家族。每个局部区域包含因果事件、结构化声明表、预测测试、支持统计和来源;限制映射比较重叠区域;粘合诊断揭示一致性、漂移、矛盾和不确定性。最终得到的拓扑世界模型并不是一个单一的通用图。它是一个研究工具,用于导航文献所述内容、所述位置、支持强度,以及局部声明未能组装成一致全球视图的地方。三个文献地图案例研究——海洋温度对海洋种群的影响、GLP-1减重证据和白藜芦醇/红酒健康益处声明——展示了来自文本的深度因果研究,具有明确的局部性、证据、持久状态和粘合张力。四个基于实证的反事实案例研究——一篇关于微塑料强迫的《自然气候变化》论文、一篇关于印度河流域水文学的论文(包含VIC派生的图表数据和模型代码)、经典的Sachs蛋白信号研究(包含单细胞干扰数据)以及一项关于唱歌老鼠的《自然》研究(包含MAPseq投影矩阵)——展示了一种更强的模式:当一篇论文发布源数据、模拟输出或代码时,PROMETHEUS可以针对该科学基底评估反事实,然后围绕其重建层叠世界模型。
cs.AI / 12 / 2605.12838

Multimodal Hidden Markov Models for Persistent Emotional State Tracking

用于持久情感状态跟踪的多模态隐马尔可夫模型
Ragu, Anamika, Jonelagadda, Aneesh
Abstract
Tracking an interpretable emotional arc of a conversation via the sentiment of individual utterances processed as a whole is central to both understanding and guiding communication in applied, especially clinical, conversational contexts. Existing approaches to emotion recognition operate at the utterance level, obscuring the persistent phases that characterize real conversational dynamics. We propose a lightweight framework that models conversational emotion as a sequence of latent emotional regimes using sticky factorial HDP-HMMs over multimodal valence-arousal representations derived from simultaneous video, audio and textual input. We evaluate the quality of regime prediction using LLM-as-a-Judge, geometric, and temporal consistency metrics, demonstrating that the sticky HDP-HMM produces more interpretable regime sequences than the baseline Gaussian HMM at a fraction of the computational cost of LLM-based dialogue state tracking methods. In addition, Question-Answer experiments in a clinical dataset suggest that meaningful emotional phases can reliably be recovered from multimodal valence-arousal trajectories and used to improve the quality of LLM responses in unstable affective regimes via context augmentation. This framework thus opens a path toward interpretable, lightweight, and actionable analysis of conversational emotion dynamics at scale.
Chinese Translation
通过整体处理单个话语的情感来跟踪对话的可解释情感弧,是理解和引导应用中,尤其是临床对话环境中沟通的核心。现有的情感识别方法在话语层面上运作,掩盖了真实对话动态特征的持久阶段。我们提出了一种轻量级框架,将对话情感建模为潜在情感状态的序列,使用基于多模态愉悦-唤醒(valence-arousal)表示的粘性因子HDP-HMM(Hierarchical Dirichlet Process Hidden Markov Model),这些表示来源于视频、音频和文本输入的同步数据。我们使用LLM-as-a-Judge、几何和时间一致性指标评估状态预测的质量,证明粘性HDP-HMM在计算成本仅为基于LLM的对话状态跟踪方法的一小部分的情况下,产生了比基线高斯HMM更可解释的状态序列。此外,在临床数据集中的问答实验表明,可以可靠地从多模态愉悦-唤醒轨迹中恢复有意义的情感阶段,并通过上下文增强来提高LLM在不稳定情感状态下的响应质量。因此,该框架为大规模可解释、轻量和可操作的对话情感动态分析开辟了一条新路径。
cs.AI / 13 / 2605.12856

Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

Moltbook 监督:通过多轮对话揭示隐藏意图
Al-Lawati, Ali, Tripto, Nafis, Ansari, Abolfazl, Lucas, Jason, Wang, Suhang, Lee, Dongwon
Abstract
The emergence of multi-agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with {\em malicious intent} may contribute harmful content that appears benign to evade content-based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce \textsc{\textbf{Bot-Mod}} (\textsc{\textbf{Bot-Mod}}eration), a moderation framework that grounds detection in agent intent rather than traditional content level signals. \method{} identifies the underlying intent by engaging with the target agent in a multi-turn exchange guided by Gibbs-based sampling over candidate intent hypotheses. This progressively narrows the space of plausible agent objectives to identify the underlying behavior. To evaluate our approach, we construct a dataset derived from Moltbook that encompasses diverse benign and malicious behaviors based on actual community structures, posts, and comments. Results demonstrate that \textsc{\textbf{Bot-Mod}} reliably identifies agent intent across a range of adversarial configurations, while maintaining a low false positive rate on benign behaviors. This work advances the foundation for scalable, intent-aware moderation of agents in open multi-agent environments.
Chinese Translation
多智能体系统的出现带来了超越内容过滤的新监督挑战。具有恶意意图的智能体可能会贡献看似无害的有害内容,以规避基于内容的监督,同时通过其在社区内的整体互动模式表现出的剥削性和恶意行为来危害系统。为了解决这个问题,我们提出了 extsc{ extbf{Bot-Mod}}( extsc{ extbf{Bot-Mod}}eration),一个基于智能体意图而非传统内容级信号的监督框架。该方法通过与目标智能体进行多轮交流,利用基于 Gibbs 采样的候选意图假设,识别潜在意图。这逐步缩小了合理智能体目标的范围,以识别其潜在行为。为了评估我们的方法,我们构建了一个来自 Moltbook 的数据集,涵盖了基于实际社区结构、帖子和评论的多样化良性和恶性行为。结果表明, extsc{ extbf{Bot-Mod}} 在多种对抗配置中可靠地识别智能体意图,同时在良性行为上保持低误报率。这项工作为开放多智能体环境中可扩展的、意图感知的智能体监督奠定了基础。
cs.AI / 14 / 2605.12894

Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

超越合作模拟器:生成真实用户角色以增强大型语言模型代理的评估可靠性
Chopra, Harshita, Ghate, Kshitish, Caliskan, Aylin, Kohno, Tadayoshi, Shah, Chirag, Jaques, Natasha
Abstract
Large Language Model (LLM) agents are increasingly deployed in settings where they interact with a wide variety of people, including users who are unclear, impatient, or reluctant to share information. However, collecting real interaction data at scale remains expensive. The field has turned to LLM-based user simulators as stand-ins, but these simulators inherit the behavior of their underlying models: cooperative and homogeneous. As a result, agents that appear strong in simulation often fail under the unseen, diverse communication patterns of real users. To narrow this gap, we introduce Persona Policies (PPol), a plug-and-play control layer that induces realistic behavioral variation in user simulators while preserving the original task goals. Rather than hand-crafting personas, we cast persona generation as an LLM-driven evolutionary program search that optimizes a Python generator to discover behaviors and translate them into task-preserving roleplay policies. Candidate generators are guided by a multi-objective fitness score combining human-likeness with broad coverage of human behavioral patterns. Once optimized, the generator produces a diverse population of human-like personas for any task in the domain. Across tau^2-bench retail and airline domains, evolved PPol programs yield 33-62% absolute gains in fitness score over the baseline simulator. In a blinded evaluation, annotators rated PPol-conditioned users as human 80.4% of the time, close to real human traces and nearly twice as frequently as baseline simulators. Agents trained with PPol are more robust to challenging, out-of-distribution behaviors, improving task success by +17% relative to training only on existing simulated interactions. This offers a novel approach to strengthen simulator-based evaluation and training without changing tasks or rewards.
Chinese Translation
大型语言模型(LLM)代理越来越多地被部署在与各种人群互动的环境中,包括那些不明确、缺乏耐心或不愿分享信息的用户。然而,大规模收集真实交互数据仍然是昂贵的。该领域转向基于LLM的用户模拟器作为替代,但这些模拟器继承了其底层模型的行为特征:合作性和同质性。因此,在模拟中表现强大的代理在真实用户的未知多样化沟通模式下往往失败。为缩小这一差距,我们引入了角色策略(Persona Policies, PPol),这是一种即插即用的控制层,能够在用户模拟器中引入真实的行为变化,同时保持原始任务目标。我们并非手动构建角色,而是将角色生成视为一个由LLM驱动的进化程序搜索,优化一个Python生成器以发现行为并将其转化为保持任务的角色扮演策略。候选生成器由一个多目标适应度评分引导,该评分结合了人类相似性与广泛覆盖的人类行为模式。一旦优化,生成器就能为该领域的任何任务生成多样化的人类角色。在tau^2-bench零售和航空领域,进化的PPol程序在适应度评分上相较于基线模拟器获得了33-62%的绝对提升。在盲评中,评估者将PPol条件下的用户评定为人类的概率为80.4%,接近真实人类轨迹,几乎是基线模拟器的两倍。使用PPol训练的代理对具有挑战性的、超分布的行为更具鲁棒性,相较于仅在现有模拟交互上训练,任务成功率提高了17%。这为在不改变任务或奖励的情况下增强基于模拟器的评估和训练提供了一种新颖的方法。
cs.AI / 15 / 2605.12922

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

当注意力关闭时:大型语言模型在多轮交互中如何失去线索
Dongre, Vardhan, Hsieh, Joseph, Lai, Viet Dac, Yoon, Seunghyun, Bui, Trung, Hakkani-Tür, Dilek
Abstract
Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.
Chinese Translation
大型语言模型能够在单轮交互中遵循复杂的指令,但在长时间的多轮交互中,它们常常会失去指令、角色和规则的线索。这种退化已经在行为上进行了测量,但尚未在机制上得到解释。我们提出了一种通道转移的解释:定义目标的标记通过注意力变得不那么可获取,而与目标相关的信息可能在残余表示中持续存在。我们引入了目标可获取性比率(Goal Accessibility Ratio, GAR),用于测量从生成标记到任务定义目标标记的注意力,并结合滑动窗口消融和残余流探测。当对指令的注意力关闭时,存留下来的信息揭示了架构的特征。在不同的架构中,这种转变产生了质上不同的失败模式:一些模型在注意力消失时仍能保持目标条件行为,而其他模型尽管有可解码的残余目标信息却失败,且这种编码出现的层级从2到27不等。对Mistral模型进行的内部模型因果消融实验强制关闭注意力通道,使得在一个20条事实的保留任务中,回忆率从近乎完美降至11%,并且在没有用户压力的情况下,角色约束的违反率超过了对抗压力基线,这两个效应都在可预测的交叉轮次中出现。线性探测器从残余表示中恢复每个回合的回忆结果,在所有四种主要架构中,其AUC高达0.99,而输入嵌入保持在随机水平。在不同架构和模型规模中,注意力损失与残余可解码性之间的差距预测了目标条件行为在通道关闭时是否能够存活。我们贡献了GAR作为诊断工具,通道转移框架作为一种受控的机制解释,以及在窗口注意力关闭下失败时机的参数预测。
cs.AI / 16 / 2605.12963

Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements

维持人工智能安全:控制理论的外部不可能性、内在必要性及结构要求
Mazzu, James M.
Abstract
As AI systems become increasingly capable, safety strategies must be evaluated not only by how much they reduce present risk, but by whether they could sustain safety once external control can no longer reliably constrain system behavior. This paper addresses that problem by using control theory to clarify, at a structural level, whether externally enforced safety-sustaining strategies can succeed and, if not, what any alternative strategy would have to satisfy in order to be viable. It establishes two main results. First, under explicit premises including a reachability condition, it proves a class-wide external impossibility result: once the system's effects exceed what bounded external control can counteract, no strategy that depends in any degree on continued external enforcement can sustain AI safety. This failure is structural across the entire externally enforced class rather than contingent on any particular strategy. Second, it establishes a conditional class-level necessity result: if at least one candidate safety-sustaining strategy remains after that elimination, then all such remaining strategies must be intrinsic. It then states four structural requirements for viability: safety may not depend on continued external enforcement; the system's terminal objective must be safety-compatible when first formed; that objective must remain stable under self-modification; and safety must continue to be preserved as capability grows. The paper does not propose a complete strategy for sustaining AI safety. Its contribution is to give formal structure to a widely held concern about the limits of external control. It does so by deriving explicit conditional results that identify which safety-sustaining strategies are ruled out and what any remaining strategies must satisfy.
Chinese Translation
随着人工智能系统能力的不断增强,安全策略的评估不仅要考虑其在多大程度上降低当前风险,还要考虑在外部控制无法可靠约束系统行为时,是否能够维持安全。本文通过使用控制理论,从结构层面阐明了外部强制的安全维持策略是否能够成功,以及如果不能成功,任何替代策略需要满足哪些条件才能可行。本文建立了两个主要结果。首先,在包括可达性条件在内的明确前提下,证明了一类广泛的外部不可能性结果:一旦系统的影响超出有限外部控制能够抵消的范围,任何在某种程度上依赖于持续外部强制的策略都无法维持人工智能安全。这一失败是整个外部强制类的结构性问题,而不是依赖于任何特定策略。其次,建立了一个条件类级别的必要性结果:如果在这种消除后至少还有一个候选的安全维持策略存在,那么所有剩余的策略必须是内在的。接着,本文提出了四个可行性的结构要求:安全不应依赖于持续的外部强制;系统的最终目标在首次形成时必须与安全兼容;该目标在自我修改过程中必须保持稳定;并且随着能力的增长,安全必须继续得到维护。本文并未提出维持人工智能安全的完整策略,其贡献在于为广泛关注的外部控制的局限性提供了正式结构。通过推导明确的条件结果,识别出哪些安全维持策略被排除,以及任何剩余策略必须满足的条件。
cs.AI / 17 / 2605.12966

Position: Agentic AI System Is a Foreseeable Pathway to AGI

位置:代理智能系统是通往人工通用智能的可预见路径
Liao, Junwei, Li, Shuai, Wen, Muning, Wang, Jun, Zhang, Weinan
Abstract
Is monolithic scaling the only path to AGI? This paper challenges the dogma that purely scaling a single model is sufficient to achieve Artificial General Intelligence. Instead, we identify Agentic AI as a necessary paradigm for mastering the complex, heterogeneous distribution of real-world tasks. Through rigorous theoretical derivations, we contrast the optimization constraints of monolithic learners against the efficiency of Agentic systems, progressing from simple routing mechanisms to general Directed Acyclic Graph (DAG) topologies. We demonstrate that Agentic AI achieves exponentially superior generalization and sample efficiency. Finally, we discuss the connection to Mixture-of-Experts, reinterpret the instability of current multi-agent frameworks, and call for greater research focus on Agentic AI.
Chinese Translation
单一模型扩展是否是通往人工通用智能的唯一途径?本文挑战了单纯扩展单一模型足以实现人工通用智能的教条。相反,我们将代理智能(Agentic AI)确定为掌握现实世界任务复杂多样分布的必要范式。通过严格的理论推导,我们对比了单一学习者的优化约束与代理系统的效率,从简单的路由机制发展到一般的有向无环图(Directed Acyclic Graph, DAG)拓扑结构。我们证明了代理智能在泛化能力和样本效率上实现了指数级的优越性。最后,我们讨论了与专家混合模型(Mixture-of-Experts)的联系,重新解释了当前多智能体框架的不稳定性,并呼吁对代理智能的更多研究关注。
cs.AI / 18 / 2605.12975

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

检索成本低,给我代码:用于检索增强生成的可执行多跳推理
Sun, Jiashuo, Shi, Jimeng, Xie, Yixuan, Wang, Saizhuo, Parekh, Jash Rajesh, Jiang, Pengcheng, Shi, Zhiyi, Fan, Jiajun, Zheng, Qinglong, Li, Peiran, Wang, Shaowen, Liu, Ge, Han, Jiawei
Abstract
Retrieval-Augmented Generation (RAG) has become a standard approach for knowledge-intensive question answering, but existing systems remain brittle on multi-hop questions, where solving the task requires chaining multiple retrieval and reasoning steps. Key challenges are that current methods represent reasoning through free-form natural language, where intermediate states are implicit, retrieval queries can drift from intended entities, and errors are detected by the same model that produces them making self-reflection an unreliable, ungrounded signal. We observe that multi-hop question answering is a typical form of step-by-step computation, and that this structured process aligns closely with how code-specialized language models are trained to operate. Motivated by this, we introduce \pyrag, a framework that reformulates multi-hop RAG as program synthesis and execution. Instead of free-form reasoning trajectories, \pyrag represents the reasoning process as an executable Python program over retrieval and QA tools, exposing intermediate states as variables, producing deterministic feedback through execution, and yielding an inspectable trace of the entire reasoning process. This formulation further enables compiler-grounded self-repair and execution-driven adaptive retrieval without any additional training. Experiments on five QA benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle) show that \pyrag consistently outperforms strong baselines under both training-free and RL-trained settings, with especially large gains on compositional multi-hop datasets. Our code, data and models are publicly available at https://github.com/GasolSun36/PyRAG.
Chinese Translation
检索增强生成(Retrieval-Augmented Generation, RAG)已成为知识密集型问答的标准方法,但现有系统在多跳问题上仍然脆弱,这类任务需要链式处理多个检索和推理步骤。主要挑战在于当前方法通过自由形式的自然语言表示推理,其中中间状态是隐式的,检索查询可能偏离预期实体,并且错误由同一模型检测,该模型又是产生这些错误的,因此自我反思成为一个不可靠且没有基础的信号。我们观察到,多跳问答是一种典型的逐步计算形式,这一结构化过程与代码专用语言模型的训练方式密切相关。基于此,我们提出了 extit{pyrag},一个将多跳 RAG 重新表述为程序合成和执行的框架。与自由形式的推理轨迹不同, extit{pyrag} 将推理过程表示为一个可执行的 Python 程序,利用检索和问答工具,暴露中间状态作为变量,通过执行产生确定性的反馈,并生成整个推理过程的可检查痕迹。这种表述进一步实现了编译器基础的自我修复和执行驱动的自适应检索,而无需任何额外的训练。在五个问答基准(PopQA、HotpotQA、2WikiMultihopQA、MuSiQue 和 Bamboogle)上的实验表明, extit{pyrag} 在无训练和强化学习训练设置下均持续优于强基线,尤其在组合多跳数据集上获得了显著提升。我们的代码、数据和模型已公开发布在 https://github.com/GasolSun36/PyRAG。
cs.AI / 19 / 2605.12978

Useful Memories Become Faulty When Continuously Updated by LLMs

有用的记忆在被大型语言模型(LLMs)持续更新时变得不可靠
Zhang, Dylan, Lin, Yanshan, Wu, Zhengkun, Sun, Yihang, Li, Bingxuan, Li, Dianqi, Peng, Hao
Abstract
Learning from past experience benefits from two complementary forms of memory: episodic traces -- raw trajectories of what happened -- and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it continuously updates with new interactions, promising self-improving agents without parameter updates. Yet we find that such consolidated memories produced by today's LLMs are often faulty even when derived from useful experiences. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no-memory baseline. More surprisingly, even when consolidating from ground-truth solutions, GPT-5.4 fails on 54% of a set of ARC-AGI problems it had previously solved without memory. We trace the regression to the consolidation step rather than the underlying experience: the same trajectories yield qualitatively different memories under different update schedules, and an episodic-only control that simply retains those trajectories remains competitive with the consolidators we test. In a controlled ARC-AGI Stream environment that exposes Retain, Delete, and Consolidate actions, agents preserve raw episodes by default and double the accuracy of their forced-consolidation counterparts; disabling consolidation entirely (episodic management only) matches this auto regime. Practically, robust agent memory should treat raw episodes as first-class evidence and gate consolidation explicitly rather than firing it after every interaction. Looking forward, reliable agentic memory will require LLMs that can consolidate without overwriting the evidence they depend on.
Chinese Translation
从过去的经验中学习受益于两种互补的记忆形式:情节痕迹——发生过的事情的原始轨迹——和通过多个情节提炼出的可重用的、类似模式的抽象化教训。近期的代理记忆系统追求后者的形式:一个大型语言模型(LLM)将过去的轨迹重写为一个文本记忆库,并通过新的交互不断更新,承诺无需参数更新的自我改进代理。然而,我们发现,今天的LLM所生成的这种整合记忆往往是不可靠的,即使是源自有用的经验。当整合进行时,记忆的效用首先上升,然后下降,甚至可能低于无记忆的基线。更令人惊讶的是,即使是从真实解决方案中进行整合,GPT-5.4在一组ARC-AGI问题上仍然失败了54%,而这些问题它之前在没有记忆的情况下成功解决。我们将这种退化追溯到整合步骤,而非基础经验:相同的轨迹在不同的更新计划下产生质 qualitatively不同的记忆,而一个仅保留这些轨迹的情节控制在我们测试的整合者中仍然具有竞争力。在一个控制的ARC-AGI Stream环境中,暴露出保留、删除和整合的操作,代理默认保留原始情节,并将其强制整合的对应物的准确性翻倍;完全禁用整合(仅情节管理)与这种自动模式相匹配。从实践角度来看,稳健的代理记忆应将原始情节视为一等证据,并明确控制整合,而不是在每次交互后自动触发整合。展望未来,可靠的代理记忆将需要能够在不覆盖其依赖证据的情况下进行整合的LLM。
cs.AI / 20 / 2605.12988

Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education

基于检索增强的算法追踪与问题解决辅导在人工智能教育中的应用
Jain, Mragisha, Bhatt, Tirth, Pitts, Griffin, Pandya, Aum, Brusilovsky, Peter, Norouzi, Narges, Hellas, Arto, Leinonen, Juho, Akram, Bita
Abstract
Students learning algorithms often need support as they interpret traces, debug reasoning errors, and apply procedures across unfamiliar problem instances. In this paper, we present KITE (Knowledge-Informed Tutoring Engine), a Retrieval-Augmented Generation (RAG)-based intelligent tutoring system designed to serve as a classroom teaching assistant for algorithmic reasoning and problem-solving tasks. KITE uses an intent-aware Socratic response strategy to tailor support to different student needs, responding with targeted hints, guiding questions, and progressive scaffolding intended to strengthen students' algorithmic problem-solving ability. To keep responses aligned with course content, KITE uses a multimodal RAG pipeline that retrieves relevant information from course materials. We evaluate KITE using three forms of assessment: RAGAs-based metrics for response grounding and quality, expert evaluation of pedagogical quality, and a simulated student pipeline in which a weaker language model interacts with KITE across two-turn dialogues and produces revised answers after receiving feedback. Results indicate that KITE produces contextually grounded and pedagogically appropriate responses. Further, using simulated students, KITE's feedback helped the student models produce more accurate follow-up responses on procedural and tracing questions, suggesting that its scaffolding can support algorithmic problem-solving. This work contributes a tutoring architecture and an evaluation approach for assessing retrieval-grounded explanations and scaffolded problem-solving feedback.
Chinese Translation
学习算法的学生在解释追踪、调试推理错误以及在不熟悉的问题实例中应用程序时,常常需要支持。本文介绍了KITE(知识驱动辅导引擎),这是一种基于检索增强生成(RAG)的智能辅导系统,旨在作为算法推理和问题解决任务的课堂教学助手。KITE采用意图感知的苏格拉底式回应策略,根据不同学生的需求量身定制支持,提供有针对性的提示、引导性问题和逐步支架,旨在增强学生的算法问题解决能力。为了确保回应与课程内容保持一致,KITE使用多模态RAG管道,从课程材料中检索相关信息。我们通过三种评估形式对KITE进行评估:基于RAGAs的响应基础和质量指标、专家对教学质量的评估,以及一个模拟学生管道,其中较弱的语言模型在两轮对话中与KITE互动,并在收到反馈后生成修订答案。结果表明,KITE能够生成上下文相关且教学适宜的回应。此外,通过使用模拟学生,KITE的反馈帮助学生模型在程序和追踪问题上产生更准确的后续回应,表明其支架能够支持算法问题解决。该研究为评估基于检索的解释和支架问题解决反馈提供了一种辅导架构和评估方法。
cs.AI / 21 / 2605.13037

MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

MAP:一种用于长时间交互代理推理的先图后行范式
Liu, Yuxin, Ye, Ziang, Sun, Yueqing, Zhu, Mingye, Xiao, Jinwei, Han, Zhuowen, GU, Qi, Cai, Xunliang, Zhang, Lei
Abstract
Current interactive LLM agents rely on goal-conditioned stepwise planning, where environmental understanding is acquired reactively during execution rather than established beforehand. This temporal inversion leads to Delayed Environmental Perception: agents must infer environmental constraints through trial-and-error, resulting in an Epistemic Bottleneck that traps them in inefficient failure cycles. Inspired by human affordance perception and cognitive map theory, we propose the Map-then-Act Paradigm (MAP), a plug-and-play framework that shifts environment understanding before execution. MAP consists of three stages: (1) Global Exploration, acquiring environment-general priors; (2) Task-Specific Mapping, constructing a structured cognitive map; and (3) Knowledge-Augmented Execution, solving tasks grounded on the map. Experiments show consistent gains across benchmarks and LLMs. On ARC-AGI-3, MAP enables frontier models to surpass near-zero baseline performance in 22 of 25 game environments. We further introduce MAP-2K, a dataset of map-then-act trajectories, and show that training on it outperforms expert execution traces, suggesting that understanding environments is more fundamental than imitation.
Chinese Translation
当前的交互式大型语言模型(LLM)代理依赖于目标条件的逐步规划,其中环境理解是在执行过程中被动获得,而不是事先建立。这种时间上的倒置导致了延迟环境感知:代理必须通过反复试验推断环境约束,从而导致了知识瓶颈,使其陷入低效的失败循环。受到人类可供性感知和认知地图理论的启发,我们提出了先图后行范式(MAP),这是一个即插即用的框架,旨在在执行之前转变环境理解。MAP包括三个阶段:(1)全局探索,获取环境通用先验;(2)任务特定映射,构建结构化的认知地图;(3)知识增强执行,基于地图解决任务。实验表明,在基准测试和大型语言模型上均取得了一致的提升。在ARC-AGI-3上,MAP使前沿模型在25个游戏环境中的22个超越了近乎零的基线性能。我们进一步介绍了MAP-2K,一个先图后行轨迹的数据集,并显示在其上训练的表现优于专家执行轨迹,表明理解环境比模仿更为根本。
cs.AI / 22 / 2605.13046

An Agentic LLM-Based Framework for Population-Scale Mental Health Screening

基于代理的LLM框架用于大规模心理健康筛查
Lorenzoni, Giuliano, Alencar, Paulo, Cowan, Donald
Abstract
Mental health disorders affect millions worldwide, and healthcare systems are increasingly overwhelmed by the volume of clinical data generated from electronic records, telemedicine platforms, and population-level screening programs. At the same time, the emergence of novel AI-based approaches in healthcare calls for intelligent frameworks capable of processing domain-specific unstructured clinical information while adapting to patient-specific needs. This paper proposes an agentic framework for building robust LLM-based pipelines, where each stage is encapsulated as a LangChain agent governed by explicit policies and proxy-guided evaluation. Stages are incrementally locked once validated, ensuring that later adaptations cannot overwrite configurations without demonstrated improvement. The proposed framework evolves from feature-level exploration, through proxy-based tuning and freeze/rollback mechanisms, to full orchestration by an Orchestrator Agent that coordinates preprocessing, retrieval, selection, diversity, threshold optimization, and decoding. A proof-of-concept in transcript-based depression detection demonstrates that the framework converges to stable configurations, such as cosine similarity, dynamic Top-k, and threshold 0.75, while controlling evaluation costs and avoiding regressions. These results highlight the potential of agentic AI to enable population-level mental health screening over large clinical datasets, addressing critical challenges in trustworthiness, reproducibility, and adaptability required in healthcare environments.
Chinese Translation
心理健康障碍影响着全球数百万人,医疗系统正日益被来自电子记录、远程医疗平台和人口级筛查程序生成的临床数据所淹没。同时,医疗领域新兴的基于人工智能的方法呼唤能够处理特定领域非结构化临床信息并适应患者特定需求的智能框架。本文提出了一种代理框架,用于构建稳健的基于LLM的管道,其中每个阶段都被封装为一个由明确政策和代理引导评估所管理的LangChain代理。一旦验证,阶段将逐步锁定,确保后续的调整不能在没有证明改进的情况下覆盖配置。所提出的框架从特征级探索演变,通过基于代理的调优和冻结/回滚机制,最终由一个协调预处理、检索、选择、多样性、阈值优化和解码的编排代理(Orchestrator Agent)进行全面编排。基于转录的抑郁检测的概念验证表明,该框架收敛到稳定的配置,如余弦相似度、动态Top-k和阈值0.75,同时控制评估成本并避免回归。这些结果突显了代理人工智能在大规模临床数据集上实现人口级心理健康筛查的潜力,解决了医疗环境中所需的可信度、可重复性和适应性等关键挑战。
cs.AI / 23 / 2605.13130

GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training

GRACE:用于高效后训练的梯度对齐推理数据整理
Li, Junjie, Wang, Ziao, Ma, NingXuan, Ma, Jianghong, Zhang, Xiaofeng
Abstract
Existing reasoning data curation pipelines score whole samples, treating every intermediate step as equally valuable. In reality, steps within a trace contribute very unevenly, and selecting reasoning data well requires assessing them individually. We present GRACE, a gradient-aligned curation method that views each reasoning trace as a sequence of optimization events and scores every step by two complementary signals: its alignment with the answer-oriented gradient direction, and its consistency with the preceding reasoning trajectory. Step-level scores are aggregated into a sample-level value for subset selection, using only the model's internal optimization signals and no external reward models or step annotations. To make this scalable, GRACE introduces a representation-level gradient proxy that estimates step-level alignment from token-level upstream signals in a single forward pass. Post-training Qwen3-VL-2B-Instruct on MMathCoT-1M, GRACE reaches 108.8% of the full-data performance with 20% of the data and retains 100.2% with only 5%, with subsets that transfer effectively across model backbones.
Chinese Translation
现有的推理数据整理流程对整个样本进行评分,将每个中间步骤视为同等重要。实际上,推理轨迹中的步骤贡献极不均匀,良好的推理数据选择需要对其进行单独评估。我们提出了GRACE,一种梯度对齐的整理方法,将每个推理轨迹视为一系列优化事件,并通过两个互补信号对每个步骤进行评分:与面向答案的梯度方向的对齐程度,以及与之前推理轨迹的一致性。步骤级评分被聚合为样本级值,以便进行子集选择,仅使用模型的内部优化信号,而不依赖外部奖励模型或步骤注释。为了使这一过程可扩展,GRACE引入了一种表示级梯度代理,通过单次前向传播从标记级上游信号中估计步骤级对齐。在对Qwen3-VL-2B-Instruct模型进行MMathCoT-1M数据集的后训练时,GRACE在仅使用20%的数据时达到了全数据性能的108.8%,而在仅使用5%的数据时仍保持100.2%的性能,并且这些子集在不同模型骨干之间有效迁移。
cs.AI / 24 / 2605.13142

A Constraint Programming Approach for $n$-Day Lookahead Playoff Clinching

基于约束编程的$n$天前瞻性季后赛锁定方法
Rosenberg, Gili, Booth, Kyle E. C., Brubaker, J. Kyle, Andrist, Ruben S.
Abstract
In professional sports, a team has clinched the playoffs if they are guaranteed a postseason spot, regardless of the outcomes of any remaining games. As the season progresses, sports fans and other stakeholders are interested in precisely when, and under what conditions, their team will clinch the playoffs. In this paper, we investigate playoff clinching in the context of the National Hockey League (NHL), where it is computationally challenging to produce clinching scenarios due, in part, to complex tie-breakers. We present an algorithm that determines under which combinations of game outcomes in the next $n$ days a team will clinch the playoffs (i.e., "$n$-day lookahead clinching"). Our approach is a custom tree search which employs various preprocessing techniques, pruning strategies, and node ordering heuristics to efficiently explore the space of possible outcomes. The tree search leverages a constraint programming (CP)-based subroutine for inference that determines if a team has clinched the playoffs for some snapshot in time of the regular season (i.e., "0-day lookahead clinching"). This CP subroutine aims to find a counter-example in which the team being evaluated is eliminated, taking into account qualification rules and the NHL's extensive list of tie-breakers. We validate the efficacy of our algorithm using hundreds of scenarios based on public NHL data for the seasons 2021-22 through 2024-25. The methods introduced can be readily extended to other metrics of interest, including mathematical proof of playoff elimination, clinching the President's Trophy, as well as clinching (or being eliminated from clinching) any other seed in the standings.
Chinese Translation
在职业体育中,如果一支球队确保获得季后赛名额,无论剩余比赛的结果如何,则该球队已锁定季后赛。随着赛季的推进,体育迷和其他利益相关者对他们的球队何时以及在什么条件下锁定季后赛产生了浓厚的兴趣。本文研究了国家冰球联盟(NHL)中的季后赛锁定问题,由于复杂的平局决胜规则,这一问题在计算上具有挑战性。我们提出了一种算法,用于确定在接下来的$n$天内,球队在什么样的比赛结果组合下将锁定季后赛(即“$n$天前瞻性锁定”)。我们的方法是一种定制的树搜索,采用各种预处理技术、剪枝策略和节点排序启发式方法,以高效地探索可能结果的空间。树搜索利用基于约束编程(CP)的子程序进行推理,以确定在常规赛的某个时刻,球队是否已锁定季后赛(即“0天前瞻性锁定”)。该CP子程序旨在找到一个反例,其中被评估的球队被淘汰,同时考虑资格规则和NHL的广泛平局决胜规则列表。我们使用基于2021-22至2024-25赛季的公共NHL数据的数百个场景验证了我们算法的有效性。所提出的方法可以轻松扩展到其他感兴趣的指标,包括季后赛淘汰的数学证明、总统奖杯的锁定,以及在排名中锁定(或被淘汰于锁定)任何其他种子的情况。
cs.AI / 25 / 2605.13153

Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning

关注显著性的时间知识图谱推理评估
Huang, Rikui, Zhang, Shengzhe, Wei, Wei
Abstract
Temporal Knowledge Graph Reasoning (TKGR) aims at inferring missing (especially future) events from historical data. Current evaluation in TKGR uniformly weights all events, ignoring that most are trivial repetitions, which overestimate the true reasoning ability. Therefore, the rare outstanding events, whose prediction demands deeper reasoning, should be distinguished and emphasized. To this end, we propose a strikingness-aware evaluation framework, which introduces a rule-based strikingness measuring framework (RSMF) to quantify event strikingness by comparing its expected occurrence with peer events derived from temporal rules. Strikingness is then integrated as a weighting factor into metrics like weighted MRR and Hits@k. Experiments on four TKG benchmarks reveal: 1) All representative models perform worse as event strikingness increases, 2) Path-based methods excel on low-strikingness events and representation-based ones on high-strikingness events, 3) We design an ensemble method whose gains stem from fitting trivial events rather than reasoning improvement. Our framework provides a more rigorous evaluation, refocusing the field on predicting outstanding events.
Chinese Translation
时间知识图谱推理(TKGR)旨在从历史数据中推断缺失的(尤其是未来的)事件。目前TKGR中的评估对所有事件进行统一加权,忽视了大多数事件是微不足道的重复,这导致对真实推理能力的高估。因此,稀有的显著事件,其预测需要更深层次的推理,应该被区分和强调。为此,我们提出了一种关注显著性的评估框架,该框架引入了一种基于规则的显著性测量框架(RSMF),通过将事件的预期发生与基于时间规则推导的同类事件进行比较来量化事件的显著性。显著性随后作为加权因子整合到加权的平均倒排率(weighted MRR)和Hits@k等指标中。在四个TKG基准上的实验结果显示:1)随着事件显著性的增加,所有代表性模型的表现均有所下降;2)基于路径的方法在低显著性事件上表现优异,而基于表示的方法在高显著性事件上表现更好;3)我们设计了一种集成方法,其收益源于对微不足道事件的拟合,而非推理能力的提升。我们的框架提供了更为严谨的评估,重新聚焦于预测显著事件的领域。
cs.AI / 26 / 2605.13171

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

形式猜想:一个开放且不断发展的数学验证发现基准
Firsching, Moritz, Lezeau, Paul, Mercuri, Salvatore, Horváth, Miklós Z., Dillies, Yaël, Sönne, Calle, Wieser, Eric, Zhang, Fred, Hubert, Thomas, Arcas, Blaise Agüera y, Kohli, Pushmeet
Abstract
As automated reasoning systems advance rapidly, there is a growing need for research-level formal mathematical problems to accurately evaluate their capabilities. To address this, we present Formal Conjectures, an evolving benchmark of currently 2615 mathematical problem statements formalized in Lean 4. Sourced from areas of active mathematical research, the dataset features 1029 open research conjectures providing a zero-contamination benchmark for mathematical proof discovery, and 836 solved problems for proof autoformalization. Notably, the repository provides a structured interface connecting mathematicians who formalize and clarify problems with the AI systems and humans attempting to solve them. Demonstrating its immediate utility, the benchmark has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures. We describe our approach to ensuring the correctness of these formalizations in a collaborative open-source project where contributions stem from an active community. In this framework, AI-generated proofs and disproofs serve as a valuable auditing mechanism to iteratively improve the fidelity of the benchmark. Finally, we provide a standardized evaluation setup and report baseline results on frozen evaluation subsets, demonstrating a climbable signal that measures the current frontier of automated reasoning on research-level mathematics.
Chinese Translation
随着自动推理系统的快速发展,研究级形式数学问题的需求日益增长,以准确评估其能力。为此,我们提出了形式猜想(Formal Conjectures),这是一个不断发展的基准,目前包含2615个在Lean 4中形式化的数学问题陈述。该数据集来源于活跃的数学研究领域,包含1029个开放研究猜想,为数学证明发现提供了零污染基准,以及836个已解决的问题用于证明自动形式化。值得注意的是,该库提供了一个结构化接口,将形式化和澄清问题的数学家与试图解决这些问题的人工智能系统和人类连接起来。基准的即时实用性已得到证明,已经被用于新的数学发现,包括解决开放研究猜想。我们描述了在一个由活跃社区贡献的协作开源项目中确保这些形式化正确性的方法。在这个框架中,AI生成的证明和反驳作为一种有价值的审计机制,迭代地提高基准的准确性。最后,我们提供了一个标准化的评估设置,并报告了在冻结评估子集上的基线结果,展示了一个可攀升的信号,衡量当前在研究级数学上的自动推理前沿。
cs.AI / 27 / 2605.13213

Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning

多模态多智能体推理的层次攻击
Zhou, Hao, Wu, Tiru, Jiang, Yan, Zhou, Wanqi, Hu, Junxing, Han, Ai
Abstract
Multi-modal multi-agent systems (MM-MAS) have gained increasing attention for their capacity to enable complex reasoning and coordination across diverse modalities. As these systems continue to expand in scale and functionality, investigating their potential vulnerabilities has become increasingly important. However, existing studies on adversarial attacks in multi-agent systems primarily focus on isolated agents or unimodal settings, leaving the vulnerabilities of MM-MAS largely underexplored. To bridge this gap, we introduce HAM$^{3}$, a Hierarchical Attack framework for multi-modal multi-agent systems that decomposes attacks into three interconnected layers. Specifically, at the perception layer, HAM$^{3}$ mounts attacks by perturbing visual inputs, textual inputs, and their fused visual-textual representations. At the communication layer, it performs communication-level attacks that corrupt message content and interaction topology, such as manipulating shared context or communication links to distort collective information flow. At the reasoning layer, it conducts reasoning-level attacks that interfere with each agent's cognitive pipeline, biasing reasoning trajectories and ultimately compromising final decisions. We evaluate HAM$^{3}$ on the GQA benchmark through multi-agent systems built on distinct reasoning paradigms including ReAct, Plan-and-Solve, and Reflexion. Experiments demonstrate that our framework achieves an Attack Success Rate of up to 78.3%, with reasoning-layer attacks being the most effective. More than half of the successful attacks lead multiple agents to produce consistent errors. These findings offer valuable insights for building more robust and interpretable multi-agent intelligence.
Chinese Translation
多模态多智能体系统(MM-MAS)因其在不同模态间实现复杂推理和协调的能力而受到越来越多的关注。随着这些系统在规模和功能上的不断扩展,研究其潜在的脆弱性变得愈加重要。然而,现有关于多智能体系统的对抗攻击研究主要集中于孤立智能体或单模态环境,导致MM-MAS的脆弱性尚未得到充分探索。为填补这一空白,我们提出了HAM$^{3}$,一个针对多模态多智能体系统的层次攻击框架,该框架将攻击分解为三个相互关联的层次。具体而言,在感知层,HAM$^{3}$通过扰动视觉输入、文本输入及其融合的视觉-文本表示来发起攻击。在通信层,它执行通信级攻击,破坏消息内容和交互拓扑,例如操纵共享上下文或通信链接以扭曲集体信息流。在推理层,它进行推理级攻击,干扰每个智能体的认知流程,偏向推理轨迹,最终影响最终决策。我们在GQA基准上评估了HAM$^{3}$,通过构建基于不同推理范式(包括ReAct、Plan-and-Solve和Reflexion)的多智能体系统进行实验。实验结果表明,我们的框架实现了高达78.3%的攻击成功率,其中推理层攻击最为有效。超过一半的成功攻击导致多个智能体产生一致的错误。这些发现为构建更强大和可解释的多智能体智能提供了宝贵的见解。
cs.AI / 28 / 2605.13221

An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing

基于大型语言模型和思维链的自主智能框架用于无人机辅助的物流调度与移动边缘计算
Zhang, Hanwen, Niyato, Dusit, Zhang, Wei, Lou, Xin, Low, Malcolm Yoke Hean
Abstract
In cloud manufacturing, unmanned aerial vehicles (UAVs) can support both product collection and mobile edge computing (MEC). This joint operation forms a hybrid scheduling problem, where physical logistics decisions are coupled with computational task scheduling. In this paper, UAVs collect finished products from manufacturing stations and transport them back to a central depot. Meanwhile, computational tasks generated by industrial sensor devices at these stations are processed locally, at UAVs, or offloaded via UAVs to the cloud. This coupling makes the problem challenging. A UAV can provide MEC services only during its service window at a station, so routing decisions directly determine when UAV-assisted offloading is available. Routing decisions also affect the UAV energy budget and the availability of onboard computing and communication resources for computational task execution under task deadline constraints. To address this, we propose an agentic-AI-assisted optimization framework with two components. First, we develop an agentic AI that combines large language models, retrieval-augmented generation, and chain-of-thought reasoning to translate user input into an interpretable mathematical formulation for the hybrid scheduling problem. Second, we design a hierarchical deep reinforcement learning approach based on proximal policy optimization (PPO), where the upper layer learns UAV routing and the lower layer optimizes per-slot task execution and resource allocation. Simulation results show that the proposed framework yields more consistent formulations, while the hierarchical PPO achieves full product collection in 99.6% of the last 500 episodes and maintains a 100% deadline satisfaction rate, with more stable performance than the advantage actor-critic approach.
Chinese Translation
在云制造中,无人机(UAV)可以支持产品收集和移动边缘计算(MEC)。这种联合操作形成了一个混合调度问题,其中物理物流决策与计算任务调度相互耦合。本文中,无人机从制造站收集成品并将其运输回中央仓库。同时,这些站点的工业传感器设备生成的计算任务在本地、无人机上处理,或通过无人机卸载到云端。这种耦合使得问题变得复杂。无人机只能在其服务窗口内提供MEC服务,因此路径决策直接决定了无人机辅助卸载的可用性。路径决策还影响无人机的能量预算以及在任务截止约束下执行计算任务所需的机载计算和通信资源的可用性。为了解决这个问题,我们提出了一个自主智能辅助的优化框架,包含两个组成部分。首先,我们开发了一种自主智能(agentic AI),结合了大型语言模型、检索增强生成和思维链推理,将用户输入转化为可解释的数学模型,以解决混合调度问题。其次,我们设计了一种基于近端策略优化(PPO)的分层深度强化学习方法,其中上层学习无人机路径规划,下层优化每个时隙的任务执行和资源分配。仿真结果表明,所提框架能够产生更一致的模型,而分层PPO在最后500个回合中实现了99.6%的全产品收集率,并保持100%的截止时间满足率,其性能比优势演员-评论家方法更为稳定。
cs.AI / 29 / 2605.13229

Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization

通过语法引导和语义感知的偏好优化提升代码翻译
Wu, Yuhan, Zhang, Huan, Cheng, Wei, Shen, Chen, Yang, Jingyue, Hu, Wei
Abstract
LLMs have shown immense potential for code translation, yet they often struggle to ensure both syntactic correctness and semantic consistency. While preference-based learning offers a promising alignment strategy, it is hindered by unreliable semantic rewards derived from sparse test cases or restrictive reference translations. We argue that a robust semantic reward for code translation must be derived directly from the source code. In this paper, we propose CTO to improve code translation with syntax-guided and semantic-aware preference optimization. Through contrastive learning, we train a cross-lingual semantic model to directly assess functional equivalence between source and translated code. By formulating code translation as a multi-objective optimization problem, this robust semantic signal is seamlessly unified with compiler-based syntactic feedback within the direct preference optimization framework. Extensive experiments on C++, Java, and Python translations demonstrate that CTO significantly outperforms existing baselines and alternative preference optimization strategies.
Chinese Translation
大型语言模型(LLMs)在代码翻译方面展现出巨大的潜力,但它们常常难以确保语法正确性和语义一致性。尽管基于偏好的学习提供了一种有前景的对齐策略,但由于稀疏测试用例或限制性参考翻译所产生的不可靠语义奖励,这一策略受到限制。我们认为,代码翻译的稳健语义奖励必须直接源自源代码。在本文中,我们提出了CTO,通过语法引导和语义感知的偏好优化来提升代码翻译。通过对比学习,我们训练了一个跨语言语义模型,以直接评估源代码与翻译代码之间的功能等价性。通过将代码翻译形式化为多目标优化问题,这一稳健的语义信号与基于编译器的语法反馈在直接偏好优化框架中无缝结合。在C++、Java和Python翻译上的广泛实验表明,CTO显著优于现有基准和替代的偏好优化策略。
cs.AI / 30 / 2605.13245

It's not the Language Model, it's the Tool: Deterministic Mediation for Scientific Workflows

不是语言模型,而是工具:科学工作流中的确定性中介
Adamidis, Marios, Katrisioti, Danae, Tzitzikas, Yannis, Stratakis, Emmanuel
Abstract
Language models can produce convincing scientific analyses, but repeated generations on the same data do not guarantee the same result. A researcher may regenerate an identical query and receive a different fit, a different peak position or a different analysis procedure, without an obvious way to decide which output to trust. We propose typed mediation, a pattern in which the model orchestrates deterministic tools rather than generating analytical code. Each tool encodes one researcher's exact procedure for one instrument, ported through structured interviews. The model selects which tool to call and with what parameters. The tool produces the result. Regeneration does not change it. We evaluate this claim by running the same photoluminescence analysis on four platforms, including three commercial foundation models, four times each with the same prompt. The typed tool produces identical results across all runs. The commercial platforms either vary in numerical output and analytical methodology across runs, or fail to produce valid results on the task. We deploy this pattern on two instruments serving users over approximately six months, with very positive user feedback. Both cases are very challenging: they involve proprietary binary formats and per-seat licensed software, which force the tool to remain on local infrastructure alongside the data and the instrument it operates. We argue that deployment topology is not just a preference, but a structural requirement of scientific tool mediation. The result is a practical pattern for deploying language models in scientific workflows where reproducibility is mandatory, reducing analysis time from weeks to minutes while guaranteeing identical outputs across runs.
Chinese Translation
语言模型能够生成令人信服的科学分析,但在相同数据上重复生成并不能保证结果相同。研究者可能会重新生成相同的查询,却得到不同的拟合、不同的峰位或不同的分析过程,而没有明显的方法来决定哪个输出更值得信赖。我们提出了类型化中介(typed mediation),这是一种模式,其中模型协调确定性工具,而不是生成分析代码。每个工具编码了一个研究者针对某一仪器的确切过程,通过结构化访谈转化而来。模型选择调用哪个工具以及使用什么参数。该工具生成结果,重生成不会改变结果。我们通过在四个平台上运行相同的光致发光分析来评估这一主张,包括三个商业基础模型,每个平台使用相同的提示运行四次。类型化工具在所有运行中产生了相同的结果。商业平台在运行中要么在数值输出和分析方法上有所不同,要么未能在该任务上产生有效结果。我们在两个仪器上部署了这一模式,为用户服务约六个月,用户反馈非常积极。这两个案例都非常具有挑战性:它们涉及专有二进制格式和按席位许可的软件,这迫使工具与数据和其操作的仪器一起保留在本地基础设施上。我们认为,部署拓扑不仅仅是一种偏好,而是科学工具中介的结构性要求。其结果是一个在科学工作流中部署语言模型的实用模式,在这些工作流中可重复性是强制性的,将分析时间从几周缩短到几分钟,同时保证运行间输出一致。
cs.AI / 31 / 2605.13255

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

在策略自蒸馏中尊重自我不确定性以实现高效的大型语言模型推理
Ke, Junlong, Wen, Zichen, Li, Weijia, He, Conghui, Zhang, Linfeng
Abstract
On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose following context rapidly becomes low entropy. Experiments with Qwen3-4B and Qwen3-8B in thinking mode show that EGRSD and CL-EGRSD advance the accuracy-length frontier among the compared trainable methods.
Chinese Translation
策略自蒸馏通过教师(通常是基于特权上下文条件的同一模型)提供密集的标记级监督,在其自身的回滚上训练推理模型。现有目标通常在思维链序列中均匀加权教师的标记级信号,尽管教师的预测分布的熵存在显著差异。我们提出了EGRSD(熵引导强化自蒸馏),通过三个信号统一标记级更新:基于奖励的方向、教师-学生似然比的大小,以及提出的教师熵置信门,该门在保持每个标记权重的非零下限的同时,降低高熵标记位置的权重。我们进一步引入CL-EGRSD,一种因果前瞻变体,能够区分持续的高熵区间与瞬态高熵位置,其后续上下文迅速变为低熵。在思维模式下对Qwen3-4B和Qwen3-8B的实验表明,EGRSD和CL-EGRSD在比较的可训练方法中推动了准确性与长度的前沿。
cs.AI / 32 / 2605.13276

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

D-VLA:一种用于视觉-语言-动作模型的高并发分布式异步强化学习框架
Guo, Yucheng, Guo, Yongjian, Guan, Zhong, Huang, Wen, Sun, Haoran, Yue, Haodong, Xiang, Xiaolong, Di, Shuai, Sun, Zhen, Wang, Luqiao, Xiong, Junwu, Gong, Yicheng
Abstract
The rapid evolution of Embodied AI has enabled Vision-Language-Action (VLA) models to excel in multimodal perception and task execution. However, applying Reinforcement Learning (RL) to these massive models in large-scale distributed environments faces severe systemic bottlenecks, primarily due to the resource conflict between high-fidelity physical simulation and the intensive VRAM/bandwidth demands of deep learning. This conflict often leaves overall throughput constrained by execution-phase inefficiencies. To address these challenges, we propose D-VLA, a high-concurrency, low-latency distributed RL framework for large-scale embodied foundation models. D-VLA introduces "Plane Decoupling," physically isolating high-frequency training data from low-frequency weight control to eliminate interference between simulation and optimization. We further design a four-thread asynchronous "Swimlane" pipeline, enabling full parallel overlap of sampling, inference, gradient computation, and parameter distribution. Additionally, a dual-pool VRAM management model and topology-aware replication resolve memory fragmentation and optimize communication efficiency. Experiments on benchmarks like LIBERO show that D-VLA significantly outperforms mainstream RL frameworks in throughput and sampling efficiency for billion-parameter VLA models. In trillion-parameter scalability tests, our framework maintains exceptional stability and linear speedup, providing a robust system for high-performance general-purpose embodied agents.
Chinese Translation
具身人工智能的快速发展使得视觉-语言-动作(VLA)模型在多模态感知和任务执行方面表现出色。然而,在大规模分布式环境中将强化学习(RL)应用于这些庞大的模型面临严重的系统瓶颈,主要是由于高保真物理仿真与深度学习对显存(VRAM)和带宽的高需求之间的资源冲突。这种冲突常常导致整体吞吐量受到执行阶段低效的限制。为了解决这些挑战,我们提出了D-VLA,一种用于大规模具身基础模型的高并发、低延迟的分布式RL框架。D-VLA引入了“平面解耦”,将高频训练数据与低频权重控制物理隔离,以消除仿真与优化之间的干扰。我们进一步设计了一个四线程异步“泳道”管道,实现了采样、推理、梯度计算和参数分发的完全并行重叠。此外,双池显存管理模型和拓扑感知复制解决了内存碎片问题并优化了通信效率。在LIBERO等基准测试中的实验表明,D-VLA在亿参数VLA模型的吞吐量和采样效率上显著优于主流RL框架。在万亿参数的可扩展性测试中,我们的框架保持了卓越的稳定性和线性加速,为高性能通用具身代理提供了一个强大的系统。
cs.AI / 33 / 2605.13282

Differentiable Learning of Lifted Action Schemas for Classical Planning

可微分学习提升动作模式用于经典规划
Reiter, Jonas, Gebler, Jakob Elias, Geffner, Hector
Abstract
Classical planners can effectively solve very large deterministic MDPs represented in STRIPS or PDDL where states are sets of atoms over objects and relations, and lifted action schemas add or delete these atoms. This compact representation yields strong search heuristics and provides an ideal setting for structural generalization, since lifted relations and action schemas give rise to infinitely many domain instances. A central challenge is to learn these relations and action schemas from data, and recent approaches have addressed this problem using different types of observations. In this work, we develop a novel neural network architecture for learning action schemas from traces where states are fully observed but action arguments are unobserved. The problem is a simplification but an important step towards learning planning domains from sequences of images and action labels, and we aim to solve this simplification in a nearly perfect manner. The challenge lies in learning the action schemas while simultaneously identifying the action arguments from observed state changes. Our approach yields a robust differentiable component that can then be integrated into larger neuro-symbolic models. We evaluate the architecture on various planning domains, where the learned lifted action schemas must recover the ground-truth structure. Additionally, we report experiments on robustness to observation noise and on a variation related to slot-based dynamics models.
Chinese Translation
经典规划器能够有效地解决以 STRIPS 或 PDDL 表示的大型确定性马尔可夫决策过程(MDP),其中状态是对象和关系的原子集合,提升的动作模式可以添加或删除这些原子。这种紧凑的表示法产生了强大的搜索启发式,并为结构泛化提供了理想的环境,因为提升的关系和动作模式可以产生无穷多的领域实例。一个核心挑战是从数据中学习这些关系和动作模式,最近的方法已经使用不同类型的观察来解决这个问题。在本工作中,我们开发了一种新颖的神经网络架构,用于从状态完全可观察但动作参数不可观察的轨迹中学习动作模式。这个问题虽然是一个简化,但却是从图像序列和动作标签中学习规划领域的重要一步,我们的目标是以几乎完美的方式解决这一简化。挑战在于在学习动作模式的同时识别观察到的状态变化中的动作参数。我们的方法产生了一个稳健的可微分组件,随后可以将其集成到更大的神经符号模型中。我们在多个规划领域上评估了该架构,其中学习到的提升动作模式必须恢复真实的结构。此外,我们报告了对观察噪声的鲁棒性实验以及与基于槽的动态模型相关的变体实验。
cs.AI / 34 / 2605.13290

What properties of reasoning supervision are associated with improved downstream model quality?

推理监督的哪些属性与下游模型质量的提升相关?
Langner, Mikołaj, Pihulski, Dzmitry, Eliasz, Jan, Rajkowski, Michał, Kazienko, Przemysław, Piasecki, Maciej, Kocoń, Jan, Ferdinan, Teddy
Abstract
Validating training data for reasoning models typically requires expensive trial-and-error fine-tuning cycles. In this work, we investigate whether the utility of a reasoning dataset can be reliably predicted prior to training using intrinsic data metrics. We propose a suite of quantitative measures and evaluate their predictive power by fine-tuning 8B and 11B models on semantically distinct variants of a Polish reasoning dataset. Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance. Crucially, we find that the predictors of utility are scale-dependent: smaller models rely on alignment-focused metrics to ensure precision, whereas larger models benefit from high redundancy, utilizing verbose traces to solve complex tasks. These findings establish a scale-aware framework for validating reasoning data, enabling practitioners to select effective training sets without the need for exhaustive empirical testing.
Chinese Translation
验证推理模型的训练数据通常需要昂贵的试错微调周期。在本研究中,我们探讨了是否可以在训练之前通过内在数据指标可靠地预测推理数据集的效用。我们提出了一套定量指标,并通过在语义上不同的波兰推理数据集变体上微调8B和11B模型来评估其预测能力。我们的分析表明,这些内在指标与下游模型性能之间存在强烈且显著的相关性。关键的是,我们发现效用的预测因素依赖于模型规模:较小的模型依赖于以对齐为中心的指标来确保精度,而较大的模型则受益于高冗余,利用冗长的痕迹来解决复杂任务。这些发现建立了一个考虑规模的推理数据验证框架,使从业者能够在不需要详尽实证测试的情况下选择有效的训练集。
cs.AI / 35 / 2605.13296

Discrete Diffusion for Complex and Congested Multi-Agent Path Finding with Sparse Social Attention

稀疏社会注意下复杂和拥堵多智能体路径寻找的离散扩散
Wang, Yuanzhe, Zhi, Tian, Wei, Zihang, Wang, Hongguang, Guo, Jiaming, Zhao, Yang, Liu, Zisheng, Quan, Shiyu, Hu, Xing, Du, Zidong, Chen, Yunji
Abstract
Multi-Agent Path Finding (MAPF) is a coordination problem that requires computing globally consistent, collision-free trajectories from individual start positions to assigned goal positions under combinatorial planning complexity. In dense environments, suboptimal initial plans induce compound conflicts that hinder feasible repair. For repair-based solvers like LNS2, initial plan quality critically affects downstream repair, yet this factor remains underexplored. We propose DiffLNS, a hybrid framework that integrates a discrete denoising diffusion probabilistic model (D3PM) with LNS2. The D3PM serves as an initializer with sparse social attention that learns a spatiotemporal prior over coordinated multi-agent action trajectories from expert demonstrations and samples multiple joint plans. Operating directly on the categorical action space, our discrete diffusion preserves the MAPF action structure and samples from a multimodal joint-plan distribution to produce diverse drafts well suited for neighborhood repair. These drafts act as warm starts for downstream repair, which completes unfinished trajectories and resolves remaining conflicts under hard MAPF constraints. Experimental results show that despite being trained only on instances with at most 96 agents, the initializer generalizes to scenarios with up to 312 agents at inference time. Across 20 complex and congested settings, DiffLNS achieves an average success rate of 95.8%, outperforming the strongest tested baseline by 9.6 percentage points and matching or exceeding all baselines in all 20 settings. To the best of our knowledge, this is the first work to leverage discrete diffusion for warm-starting an LNS-based MAPF solver.
Chinese Translation
多智能体路径寻找(MAPF)是一个协调问题,需要在组合规划复杂性下,从各自的起始位置计算出全球一致、无碰撞的轨迹到指定的目标位置。在密集环境中,次优的初始计划会引发复合冲突,妨碍可行的修复。对于基于修复的求解器如LNS2,初始计划的质量对后续修复至关重要,但这一因素仍未得到充分探索。我们提出了DiffLNS,一个将离散去噪扩散概率模型(D3PM)与LNS2结合的混合框架。D3PM作为初始化器,利用稀疏社会注意机制,从专家演示中学习协调多智能体行动轨迹的时空先验,并生成多个联合计划样本。我们的离散扩散直接作用于分类行动空间,保持了MAPF的行动结构,并从多模态联合计划分布中采样,以生成适合邻域修复的多样化草案。这些草案作为后续修复的热启动,完成未完成的轨迹,并在严格的MAPF约束下解决剩余冲突。实验结果表明,尽管仅在最多96个智能体的实例上进行训练,初始化器在推理时能够推广到最多312个智能体的场景。在20个复杂和拥堵的设置中,DiffLNS实现了95.8%的平均成功率,超越了最强测试基线9.6个百分点,并在所有20个设置中与所有基线匹配或超过。根据我们所知,这是首次利用离散扩散为基于LNS的MAPF求解器提供热启动的研究。
cs.AI / 36 / 2605.13301

Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

通过简单统一的缩放实现金牌级别的奥林匹克推理
Li, Yafu, Zhan, Runzhe, Zhang, Haoran, Zhang, Shunkai, Li, Yizhuo, Wang, Zhilin, Chen, Jiacheng, Wang, Futing, Hu, Xuyang, Fan, Yuchen, Xu, Bangjie, Su, Yucheng, Han, Xinmiao, Li, Chenxi, Lei, Haodi, Zhao, Yufeng, Lin, Zejin, Cheng, Qianjia, Zhu, Tong, Qu, Xiaoye, Cui, Ganqu, Ye, Peng, Luo, Yun, Lin, Zhouchen, Qiao, Yu, Zhou, Bowen, Ding, Ning, Cheng, Yu
Abstract
Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a rigorous olympiad-level solver. The recipe first uses a reverse-perplexity curriculum for SFT to instill rigorous proof-search and self-checking behaviors, then scales these behaviors through a two-stage RL pipeline that progresses from RL with verifiable rewards to more delicate proof-level RL, and finally boosts solving performance with test-time scaling. Applying this recipe, we train a 30B-A3B backbone with SFT on around 340K sub-8K-token trajectories followed by 200 RL steps. The resulting model, SU-01, supports stable reasoning on difficult problems with trajectories exceeding 100K tokens, while achieving gold-medal-level performance on mathematical and physical olympiad competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025. It also demonstrates strong generalization of scientific reasoning to domains beyond mathematics and physics.
Chinese Translation
最近在推理模型方面的进展显著推动了长时间跨度的数学和科学问题解决,多个系统现在在国际数学奥林匹克(IMO)和国际物理奥林匹克(IPhO)问题上达到了金牌级别的表现。本文介绍了一种简单统一的方法,将后训练的推理骨干网络转换为严格的奥林匹克级别求解器。该方法首先使用反困惑度课程(reverse-perplexity curriculum)进行自监督微调(SFT),以培养严格的证明搜索和自检行为,然后通过一个两阶段的强化学习(RL)流程来扩展这些行为,该流程从具有可验证奖励的强化学习进展到更精细的证明级别强化学习,最后通过测试时缩放提升求解性能。应用该方法,我们在约340K个子8K标记的轨迹上对一个30B-A3B骨干网络进行了自监督微调,随后进行了200步的强化学习。最终模型SU-01在解决超过100K标记的困难问题时支持稳定推理,同时在数学和物理奥林匹克竞赛中实现了金牌级别的表现,包括IMO 2025/USAMO 2026和IPhO 2024/2025。它还展示了科学推理在超越数学和物理领域的强大泛化能力。
cs.AI / 37 / 2605.13311

IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation

IdeaForge:一个基于知识图谱的多智能体框架,用于跨方法论的创新分析和专利申请生成
Bose, Joy
Abstract
Current AI-assisted innovation systems typically apply a single ideation methodology (such as TRIZ or Design Thinking) using sequential prompt-based workflows that do not preserve intermediate reasoning structure. As a result, insights generated across methodologies remain fragmented, limiting traceability, synthesis, and systematic evaluation of novelty. We present IdeaForge, a knowledge graph-grounded multi-agent framework for innovation analysis and patent claim generation. IdeaForge integrates multiple innovation methodologies (TRIZ, Design Thinking, and SCAMPER) through specialist agents operating over a persistent FalkorDB knowledge graph. Each agent contributes structured entities and relationships representing contradictions, inventive principles, user needs, transformations, analogies, and candidate claims. The central contribution of IdeaForge is a cross-methodology convergence mechanism implemented through graph-based claim linkage. Claims independently supported by multiple methodologies are connected using CONVERGENT relationships, enabling identification of high-confidence innovation candidates through graph traversal. A downstream patent drafting agent generates structured patent drafts grounded in convergent claim subgraphs, reducing reliance on unconstrained language model generation. An InnovationScore formula ranks claims by convergent support, methodology diversity, claim strength, and prior art challenge count. We describe the graph schema, agent architecture, convergence detection pipeline, and patent synthesis workflow. Experiments on a legal technology use case demonstrate that graph-grounded multi-methodology synthesis produces more diverse and traceable innovation candidates compared to single-methodology baselines. We discuss implications for computational creativity, explainable AI-assisted invention, and graph-native innovation systems.
Chinese Translation
当前的人工智能辅助创新系统通常采用单一的创意方法论(如TRIZ或设计思维),使用基于提示的顺序工作流程,这些流程无法保留中间推理结构。因此,不同方法论生成的洞察往往是片段化的,限制了新颖性的可追溯性、综合性和系统评估。我们提出了IdeaForge,一个基于知识图谱的多智能体框架,用于创新分析和专利申请生成。IdeaForge通过在持久的FalkorDB知识图谱上运行的专业代理集成了多种创新方法论(TRIZ、设计思维和SCAMPER)。每个代理贡献结构化的实体和关系,表示矛盾、发明原则、用户需求、转化、类比和候选主张。IdeaForge的核心贡献是通过基于图的主张链接实现的跨方法论收敛机制。由多种方法论独立支持的主张通过CONVERGENT关系连接,使得通过图遍历能够识别出高置信度的创新候选。下游专利起草代理生成基于收敛主张子图的结构化专利草案,减少对不受限语言模型生成的依赖。InnovationScore公式根据收敛支持、方法论多样性、主张强度和先前艺术挑战数量对主张进行排名。我们描述了图架构、代理架构、收敛检测管道和专利综合工作流程。在一个法律技术用例上的实验表明,基于图的多方法论综合相比单一方法论基线产生了更具多样性和可追溯性的创新候选。我们讨论了对计算创造力、可解释的人工智能辅助发明和图本土创新系统的影响。
cs.AI / 38 / 2605.13318

VERA-MH: Validation of Ethical and Responsible AI in Mental Health

VERA-MH:心理健康领域伦理与负责任人工智能的验证
Belli, Luca, Bentley, Kate H., Gieringer, Josh, Van Ark, Emily, Zhao, Nilu, Thachile, Pradip, Hawrilenko, Matt, Brown, Millard, Chekroud, Adam M.
Abstract
Chatbot usage has increased, including in fields for which they were never developed for--notably mental health support. To that end, we introduce Validations of Ethical and Responsible AI in Mental Health (VERA-MH), a novel clinically-validated evaluation for safety of chatbots in the context of mental health support. The first iteration of VERA-MH focuses on Suicidal Ideation (SI) risks, by assessing how well chatbots can responds to users that might be in crisis. VERA-MH is comprised of three steps: conversation simulation, conversation judging and model rating. First, to simulate conversations with the chatbot under evaluation, another chatbot is tasked with role-playing users based on specific personas. Such user personas have been developed under clinical guidance, to make sure that, among others, multiple risk factors, demographic characteristics and disclosure factors were represented. In the judging step, a second support model is used as an LLM-as-a-Judge, together with a clinically-developed rubric. The rubric is structured as a flow, with a single Yes/No question asked each time, to improve answers' consistency and highlight models' failure modes. In the last stage, results of each conversation are aggregated to present the final evaluation of the chatbot. Together with the framework, we present the result of the evaluations for four leading LLM providers.
Chinese Translation
聊天机器人在使用上有所增加,包括在一些原本并未为之开发的领域——尤其是心理健康支持。为此,我们提出了心理健康领域伦理与负责任人工智能验证(VERA-MH),这是一种新颖的临床验证评估方法,旨在确保聊天机器人在心理健康支持中的安全性。VERA-MH的首个迭代专注于自杀意念(Suicidal Ideation, SI)风险,通过评估聊天机器人对可能处于危机中的用户的响应能力。VERA-MH包含三个步骤:对话模拟、对话评判和模型评分。首先,为了模拟与待评估聊天机器人的对话,另一款聊天机器人负责根据特定角色扮演用户。这些用户角色是在临床指导下开发的,以确保多个风险因素、人口特征和披露因素得以体现。在评判步骤中,第二个支持模型作为大型语言模型(LLM)进行评判,并结合临床开发的评分标准。评分标准以流程形式构建,每次询问一个是/否问题,以提高回答的一致性并突出模型的失败模式。在最后阶段,每次对话的结果被汇总,以呈现聊天机器人的最终评估。与框架一起,我们展示了四个领先LLM提供商的评估结果。
cs.AI / 39 / 2605.13332

Diversity of Extensions in Abstract Argumentation

抽象论证中的扩展多样性
Fichte, Johannes K., Hecher, Markus, Mahmood, Yasir, Wang, Zhengjun
Abstract
Argumentation is an important topic of AI for modelling and reasoning about arguments. In abstract argumentation, we consider directed graphs, so-called argumentation frameworks (AF), that express conflicts between arguments. The semantics is defined by the notion of extensions, which are sets of arguments that satisfy particular relationship conditions in the AF. Usually, standard reasoning in argumentation do not reveal how far apart extensions are. We introduce a quantitative notion of diversity of extensions based on the symmetric difference and provide a systematic complexity classification. Intuitively, diversity captures whether extensions of a framework (accepted viewpoints) differ only marginally or represent fundamentally incompatible sets of arguments. We study whether an AF admits k-diverse extensions, admits k-diverse extensions covering specific arguments, and to compute the largest k for which an AF admits k-diverse extensions. We outline a prototype and provide an evaluation for computing diversity levels.
Chinese Translation
论证是人工智能中一个重要的主题,用于建模和推理论证。在抽象论证中,我们考虑有向图,即所谓的论证框架(Argumentation Framework, AF),用于表达论证之间的冲突。其语义由扩展的概念定义,扩展是满足AF中特定关系条件的论证集合。通常,标准的论证推理并未揭示扩展之间的距离。我们引入了一种基于对称差的扩展多样性的定量概念,并提供了系统的复杂性分类。直观上,多样性捕捉了一个框架的扩展(被接受的观点)是仅有微小差异还是代表根本不兼容的论证集合。我们研究一个AF是否允许k-多样性扩展,是否允许覆盖特定论证的k-多样性扩展,以及计算一个AF允许k-多样性扩展的最大k值。我们概述了一个原型,并提供了计算多样性水平的评估。
cs.AI / 40 / 2605.13335

Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

Ego2World:将自我中心的烹饪视频编译为可执行世界以进行信念状态规划
Cheng, Qinchuan, Gong, Zhantao, Sun, Pengzhan, Yao, Angela, Yang, Xulei, Li, Shijie
Abstract
Embodied agents in household environments must plan under partial observation: they need to remember objects, track state changes, and recover when actions fail. Existing benchmarks only partially test this ability. Egocentric video datasets capture realistic human activities but remain passive, while interactive simulators support execution but rely on synthetic scenes and hand-crafted dynamics, introducing a sim-to-real gap and often assuming fully observable state. We introduce Ego2World, an executable benchmark that turns egocentric cooking videos into executable symbolic worlds governed by graph-transition rules. Built on HD-EPIC, Ego2World derives reusable transition rules from video annotations and executes them in a hidden symbolic world graph. During evaluation, the simulator maintains the hidden world graph, while the agent plans over its own partial belief graph using only local observations and execution feedback. This separation forces agents to update memory and replan without observing the true world state. Experiments show that action-overlap scores overestimate physical-state success, and that persistent belief memory improves task completion while reducing repeated visual exploration -- suggesting that belief maintenance should be a first-class target of embodied-agent evaluation.
Chinese Translation
在家庭环境中,具身智能体必须在部分观察下进行规划:它们需要记住物体、跟踪状态变化,并在行动失败时进行恢复。现有基准测试仅部分测试这种能力。自我中心的视频数据集捕捉了真实的人类活动,但仍然是被动的,而交互式模拟器支持执行但依赖于合成场景和手工制作的动态,导致了模拟与现实之间的差距,并且通常假设状态是完全可观察的。我们提出了Ego2World,一个可执行的基准,将自我中心的烹饪视频转化为由图转移规则支配的可执行符号世界。Ego2World基于HD-EPIC,从视频注释中推导出可重用的转移规则,并在一个隐藏的符号世界图中执行它们。在评估过程中,模拟器维护隐藏的世界图,而智能体仅使用局部观察和执行反馈在自己的部分信念图上进行规划。这种分离迫使智能体在不观察真实世界状态的情况下更新记忆并重新规划。实验表明,行动重叠得分高估了物理状态的成功,而持久的信念记忆改善了任务完成率,同时减少了重复的视觉探索——这表明信念维护应该成为具身智能体评估的一个重要目标。
cs.AI / 41 / 2605.13345

Multi-Agent Systems in Emergency Departments: Validation Study on a ED Digital Twin

急诊科中的多智能体系统:急诊数字双胞胎的验证研究
Wenzel, Markus, Strapatsas, Tobias, Kress, Jessika, Sauer, Dorothea, Gessler, Nele, Hahn, Horst K.
Abstract
Emergency departments (ED) face challenges in patient care and resource management. We propose to explore optimization strategies in a realistic and flexible model and develop a hybrid Discrete Event Simulation (DES) and Agent-Based Model (ABM) simulating highly configurable ED environments. We specifically focus on the validation of the modeling approach. We derive configurations for ED sizes, patient load, and staffing from real-world studies. We then validate the model expressivity by matching its key performance indicators and metrics with their values known from literature. We proceed by implementing scientifically established and practice-proven resource optimization strategies. Comparing the documented real-world outcomes with our model's results demonstrates that the DES-ABM based simulation can effectively replicate real-world ER dynamics under interventions. We lastly integrate a Proof-of-Concept multi-agent system (MAS) that can autonomously explore resource allocation strategies within the simulated ER environment based on a temporal ledger of ED event records. This modular DES-ABM-MAS framework offers a powerful tool to explore resource optimization strategies in emergency departments.
Chinese Translation
急诊科(ED)在患者护理和资源管理方面面临挑战。我们提出探索在一个现实且灵活的模型中优化策略,并开发一个混合的离散事件仿真(Discrete Event Simulation, DES)和基于智能体的模型(Agent-Based Model, ABM),以模拟高度可配置的急诊科环境。我们特别关注建模方法的验证。我们根据现实世界的研究推导出急诊科规模、患者负荷和人员配置的配置。然后,我们通过将模型的关键绩效指标和度量与文献中已知的值进行匹配,来验证模型的表现力。接着,我们实施科学确立且经过实践验证的资源优化策略。将记录的现实世界结果与我们模型的结果进行比较,表明基于DES-ABM的仿真可以有效地复制现实世界急诊科在干预下的动态。最后,我们整合了一个概念验证的多智能体系统(Multi-Agent System, MAS),该系统能够根据急诊事件记录的时间账本,在模拟的急诊环境中自主探索资源分配策略。这个模块化的DES-ABM-MAS框架为探索急诊科的资源优化策略提供了一个强大的工具。
cs.AI / 42 / 2605.13391

RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents

RS-Claw:通过层次技能树进行渐进式主动工具探索的遥感智能体
Liu, Liangtian, Wang, Zeyuan, Li, Ziyu, Ouyang, Kai, Tang, Zichao, Liu, Chengfu, Li, Haifeng, Yu, Hanwen, Yang, Wentao, Yang, Cheng, Hou, Dongyang
Abstract
The rise of multi-modal large language models (MLLMs) is shifting remote sensing (RS) intelligence from "see" to "action", as OpenClaw-style frameworks enable agents to autonomously operate massive RS image-processing tools for complex tasks. Existing RS agents adopt a passive selection paradigm for tool invocation, relying on either full tool registration (Flat) or retrieval-augmented generation (RAG). However, in the massive and multi-source heterogeneous RS tool ecosystem, such passive mechanisms struggle to dynamically balance "context load" and "toolset completeness" throughout task reasoning, thus exhibiting inherent limitations: full tool registration triggers context space deficits during long-horizon tasks, whereas RAG retrieval may omit critical tools in essential steps. To overcome these bottlenecks, this paper redefines tool selection by arguing that the agent should act as an active explorer within the tool space. Based on this perspective, we propose RS-Claw, a novel RS agent architecture. By leveraging Skill encapsulation technology at the tool end, this architecture hierarchically structures tool descriptions, enabling the agent to execute on-demand sequential decision-making: initially selecting relevant skill branches by reading only tool summaries, then dynamically loading detailed descriptions, and ultimately achieving precise invocation. This active paradigm not only significantly liberates the agent's context space but also effectively ensures the accurate hit rate of critical tools during long-horizon reasoning. Systematic experiments on the Earth-Bench benchmark demonstrate that RS-Claw's active exploration mechanism effectively filters semantic noise and substantially frees up reasoning space, achieving an input token compression ratio of up to 86%, and comprehensively outperforming existing Flat and RAG baselines across complex reasoning evaluations.
Chinese Translation
多模态大型语言模型(MLLMs)的兴起正在将遥感(RS)智能从“观察”转向“行动”,因为OpenClaw风格的框架使得智能体能够自主操作大量遥感图像处理工具以完成复杂任务。现有的遥感智能体采用被动选择范式进行工具调用,依赖于完全工具注册(Flat)或检索增强生成(RAG)。然而,在庞大且多源异构的遥感工具生态系统中,这种被动机制难以在任务推理过程中动态平衡“上下文负载”和“工具集完整性”,因此表现出固有的局限性:完全工具注册在长时间任务中会引发上下文空间不足,而RAG检索可能在关键步骤中遗漏重要工具。为了解决这些瓶颈,本文重新定义了工具选择,认为智能体应作为工具空间中的主动探索者。基于这一观点,我们提出了RS-Claw,一种新颖的遥感智能体架构。通过在工具端利用技能封装技术,该架构对工具描述进行层次化结构化,使得智能体能够按需执行顺序决策:最初通过仅阅读工具摘要选择相关技能分支,然后动态加载详细描述,最终实现精确调用。这种主动范式不仅显著释放了智能体的上下文空间,还有效确保了在长时间推理过程中关键工具的准确命中率。在Earth-Bench基准上的系统实验表明,RS-Claw的主动探索机制有效过滤语义噪声,并大幅释放推理空间,实现了高达86%的输入令牌压缩比,并在复杂推理评估中全面超越了现有的Flat和RAG基线。
cs.AI / 43 / 2605.13414

TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

TRIAGE:在资源限制下评估大型语言模型的前瞻性元认知控制
Nazi, Zabir Al, Dipta, Shubhashis Roy
Abstract
Deploying language models as autonomous agents requires more than per-task accuracy: when an agent faces a queue of problems under a finite token budget, it must decide which to attempt, in what order, and how much compute to commit to each, all before any execution feedback is available. This is the prospective form of metacognitive control studied for decades in human cognition, yet whether language models possess it remains untested. We introduce TRIAGE, an evaluation framework in which a model receives a task pool and a token budget calibrated to its own baseline cost, and commits to a single ordered plan that jointly encodes selection, sequencing, and per-problem allocation. Plans are scored against an oracle with full knowledge of the model's solvability and cost on each problem, yielding a triage efficiency ratio on a common scale. We evaluate frontier and open-source models, with and without reasoning enabled, across competition mathematics, graduate-level science, code generation, and expert multidisciplinary knowledge, and find that current language models exhibit substantial gaps in prospective metacognitive control, revealing a previously unmeasured capability dimension with direct implications for resource-efficient agent deployment.
Chinese Translation
将语言模型作为自主智能体进行部署不仅需要每个任务的准确性:当一个智能体在有限的标记预算下面对一系列问题时,它必须决定尝试哪些问题、以何种顺序进行以及为每个问题分配多少计算资源,所有这些都必须在任何执行反馈可用之前完成。这是人类认知中研究了数十年的前瞻性元认知控制形式,但语言模型是否具备这种能力尚未经过检验。我们引入了TRIAGE,一个评估框架,其中模型接收一个任务池和一个与其自身基线成本相匹配的标记预算,并承诺执行一个单一的有序计划,该计划共同编码选择、排序和每个问题的资源分配。该计划的得分与一个全知的神谕进行比较,该神谕对模型在每个问题上的可解性和成本有全面了解,从而在一个共同的尺度上得出一个分流效率比。我们评估了前沿和开源模型,启用和未启用推理能力,涵盖竞争数学、研究生级科学、代码生成和专家多学科知识,发现当前的语言模型在前瞻性元认知控制方面存在显著差距,揭示了一个以前未测量的能力维度,对资源高效的智能体部署具有直接影响。
cs.AI / 44 / 2605.13438

Cognifold: Always-On Proactive Memory via Cognitive Folding

Cognifold:通过认知折叠实现的始终在线主动记忆
Wang, Suli, Duan, Yiqun, Deng, Yu, Zhao, Rundong, Shi, Dai, Zhou, Xinliang
Abstract
Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce Cognifold, a brain-inspired "always-on" agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into self-emerging cognitive structures, bootstrapping progressively higher-level cognition from incoming events and accumulated knowledge. We ground this by extending Complementary Learning Systems (CLS) theory from two layers (hippocampus, neocortex) to three, adding a prefrontal intent layer. Emulating the prefrontal cortex as the locus of intentional control and decision-making, CogniFold achieves this through graph-topology self-organization: cognitive structures proactively assemble under the stream, merge when semantically similar, decay when stale, relink through associative recall, and surface intents when concept-cluster density crosses a threshold. We evaluate structural formation using CogEval-Bench, demonstrating that CogniFold uniquely produces memory structures that match cognitive expectations and concept emergence. Furthermore, across 7 broad-coverage benchmarks spanning five cognitive domains, we validate that CogniFold simultaneously performs robustly on conventional memory benchmarks.
Chinese Translation
现有的代理记忆主要是反应式和基于检索的,缺乏将经验自主组织成持久认知结构的能力。为了实现真正自主的代理,我们提出了Cognifold,一种受大脑启发的“始终在线”代理记忆,旨在为下一代主动助手提供支持。Cognifold持续将碎片化的事件流折叠成自我涌现的认知结构,从输入事件和积累知识中逐步引导出更高层次的认知。我们通过将互补学习系统(Complementary Learning Systems, CLS)理论从两个层次(海马体、皮层)扩展到三个层次,增加了一个前额叶意图层,来为此提供理论基础。Cognifold模拟前额叶皮层作为意图控制和决策的中心,通过图拓扑自组织实现这一目标:认知结构在事件流中主动组装,在语义相似时合并,在过时时衰退,通过联想回忆重新链接,并在概念簇密度超过阈值时显现意图。我们使用CogEval-Bench评估结构形成,证明Cognifold独特地产生与认知预期和概念涌现相匹配的记忆结构。此外,在涵盖五个认知领域的七个广泛覆盖基准测试中,我们验证了Cognifold在传统记忆基准测试中同时表现出色。
cs.AI / 45 / 2605.13450

Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

评估大型语言模型的创造力:测试、局限性与新前沿
Schapiro, Samuel, Gladstone, Alexi, Black, Jonah, Ji, Heng
Abstract
Measuring the creativity of large language models (LLMs) is essential for designing methods that can improve creativity and for enhancing our scientific understanding of this ability. To accomplish this, it has become common in recent years to administer tests of human creativity to LLMs. Although these tests provide a convenient and fully automated way to score "creativity," their validity as measures of machine creativity has not been established, and these tests already have limited validity as predictors of human creativity. To address this problem, we conduct the first large-scale, systematic study assessing the effectiveness of human creativity tests for predicting the creative achievement of LLMs across three target constructs: creative writing, divergent thinking, and scientific ideation. We find that the Divergent Association Task (DAT) and the Conditional DAT are the best predictors of creative writing and divergent thinking, respectively, but that test effectiveness varies significantly by construct, and no single test predicts all constructs well. Moreover, contrary to popular belief, no existing test reliably predicts scientific ideation ability. Motivated by this problem, we introduce the Divergent Remote Association Test (DRAT), a vocabulary-space test that assesses both convergent and divergent thinking in a single instrument. The DRAT is the first and only creativity test for LLMs that is a significant predictor of scientific ideation ability, demonstrating robustness across major design choices. Furthermore, the performance gain of the DRAT is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test, indicating that assessing divergent and convergent thinking in the same test is essential to reliably predicting scientific ideation ability.
Chinese Translation
测量大型语言模型(LLMs)的创造力对于设计能够提升创造力的方法以及增强我们对这一能力的科学理解至关重要。近年来,向LLMs施加人类创造力测试已成为一种普遍做法。尽管这些测试提供了一种方便且完全自动化的方式来评分“创造力”,但其作为机器创造力测量的有效性尚未得到确立,并且这些测试在预测人类创造力方面的有效性也有限。为了解决这一问题,我们进行了一项首次的大规模系统研究,评估人类创造力测试在预测LLMs的创造性成就方面的有效性,涵盖三个目标构念:创造性写作、发散性思维和科学构思。我们发现,发散联想任务(Divergent Association Task, DAT)和条件DAT分别是创造性写作和发散性思维的最佳预测指标,但测试的有效性在不同构念间显著变化,且没有单一测试能够良好预测所有构念。此外,与普遍看法相反,现有的测试并不能可靠地预测科学构思能力。基于这一问题,我们引入了发散远程联想测试(Divergent Remote Association Test, DRAT),这是一种词汇空间测试,能够在单一工具中评估发散性和聚合性思维。DRAT是首个也是唯一一个能够显著预测科学构思能力的LLMs创造力测试,并在主要设计选择上展示了稳健性。此外,DRAT的性能提升无法通过发散联想任务和远程联想测试的任何线性组合来恢复,这表明在同一测试中评估发散性和聚合性思维对于可靠预测科学构思能力至关重要。
cs.AI / 46 / 2605.13527

MMSkills: Towards Multimodal Skills for General Visual Agents

MMSkills:面向通用视觉智能体的多模态技能
Zhang, Kangning, Shao, Shuai, Li, Qingyao, Lin, Jianghao, Fu, Lingyue, Wang, Shijian, Jiao, Wenxiang, Lu, Yuan, Liu, Weiwen, Zhang, Weinan, Yu, Yong
Abstract
Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.
Chinese Translation
可重用技能已成为提升智能体能力的核心基础,但现有的大多数技能包主要将可重用行为编码为文本提示、可执行代码或学习的例程。然而,对于视觉智能体而言,程序性知识本质上是多模态的:重用不仅依赖于要执行的操作,还依赖于识别相关状态、解释进展或失败的视觉证据,以及决定下一步该做什么。我们将这一要求形式化为多模态程序性知识,并解决三个实际挑战:(I)多模态技能包应包含哪些内容;(II)这些包可以从公共交互经验中衍生出什么;(III)智能体如何在推理时咨询多模态证据,而不需要过多的图像上下文或过度依赖参考截图。我们引入了MMSkills,一个用于表示、生成和使用可重用多模态程序的框架,以支持运行时视觉决策。每个MMSkill都是一个紧凑的状态条件包,将文本程序与运行时状态卡和多视角关键帧结合在一起。为了构建这些包,我们开发了一个智能体轨迹到技能生成器,该生成器通过工作流分组、程序归纳、视觉定位和元技能指导审计,将公共非评估轨迹转化为可重用的多模态技能。为了使用这些技能,我们引入了一个分支加载的多模态技能智能体:所选状态卡和关键帧在一个临时分支中进行检查,与实时环境对齐,并提炼成主要智能体的结构化指导。在GUI和基于游戏的视觉智能体基准测试中的实验表明,MMSkills始终改善了前沿和较小的多模态智能体,表明外部多模态程序性知识补充了模型内部的先验知识。
cs.AI / 47 / 2605.13532

AI-Generated Slides: Are They Good? Can Students Tell?

AI生成的幻灯片:它们好吗?学生能分辨吗?
Leinonen, Juho, Zhang, Lisa, Hellas, Arto
Abstract
As generative AI (GenAI) tools become easily accessible, there is promise in using such tools to support instructors. To that end, this paper examines using GenAI to help generate slides from instructor authored course notes, emphasizing instructor and student perceptions. We examine an end-to-end education tool (NotebookLM), two general-purpose LLMs (Claude, M365 Copilot), and two coding assistants (Cursor, Claude Code). We first analyze whether GenAI generated slides are ``good'' via narrative assessment by educators. We choose the best slides to use (with some modification) in a real course setting, and compare the student perception of human vs. AI generated slides. We find that coding assistant tools produce slides that were most accurate, complete, and pedagogically sound. Additionally, students rate GenAI slides to be of similar quality as instructor-created slides, and cannot reliably identify which slides are AI-generated. Additionally, we find a negative correlation between a high quality rating and a high ``AI-generated'' rating, suggesting students associate poor quality with the source of the slides being AI. These findings highlight promising opportunities for integrating GenAI into instructional design workflows and call for further research on how educators can best harness such tools responsibly and effectively.
Chinese Translation
随着生成性人工智能(Generative AI, GenAI)工具的广泛可及,利用这些工具支持教师的前景令人期待。为此,本文探讨了使用GenAI从教师撰写的课程笔记中生成幻灯片的可能性,重点关注教师和学生的感知。我们研究了一种端到端的教育工具(NotebookLM)、两种通用大型语言模型(Claude, M365 Copilot)以及两种编码助手(Cursor, Claude Code)。我们首先通过教育工作者的叙述性评估分析GenAI生成的幻灯片是否“优秀”。我们选择最佳幻灯片(经过一些修改)在实际课程中使用,并比较学生对人类与AI生成幻灯片的感知。我们发现,编码助手工具生成的幻灯片在准确性、完整性和教学效果上表现最佳。此外,学生对GenAI生成的幻灯片的质量评价与教师创建的幻灯片相似,且无法可靠地区分哪些幻灯片是AI生成的。此外,我们发现高质量评分与高“AI生成”评分之间存在负相关,表明学生将低质量与幻灯片来源为AI联系在一起。这些发现突显了将GenAI整合到教学设计工作流程中的良好机会,并呼吁进一步研究教育工作者如何负责任且有效地利用这些工具。
cs.AI / 48 / 2605.13534

Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

通过并行搜索和显式合并扩展检索增强推理
Liu, Jiabei, Mao, Wenyu, Tan, Junfei, Shen, Chunxu, Yi, Lingling, Wu, Jiancan, Wang, Xiang
Abstract
Deep search agents have proven effective in enhancing LLMs by retrieving external knowledge during multi-step reasoning. However, existing methods often generate a single query for retrieval at each reasoning step, limiting information coverage and introducing high noise. This may result in low signal-to-noise ratios (SNR) during search, degrading reasoning accuracy and leading to unnecessary reasoning steps. In this paper, we introduce MultiSearch, an RL-based framework that addresses these limitations through multi-query retrieval and explicit merging of retrieved information. At each reasoning step, MultiSearch generates queries from multiple perspectives and retrieves external information in parallel, expanding the scope of relevant information and mitigating the reliance on any single retrieval result. Then, the agent consolidates and refines retrieved information at the merging process, improving the SNR and ensuring more accurate reasoning. Additionally, we propose a reinforcement learning framework with a multi-process reward design to optimize agents for both multi-query retrieval and information consolidation. Extensive experiments on seven benchmarks demonstrate that MultiSearch outperforms baseline methods, enhancing the SNR of retrieval and improving reasoning performance in question-answering tasks.
Chinese Translation
深度搜索代理在多步骤推理中通过检索外部知识有效增强了大型语言模型(LLMs)。然而,现有方法通常在每个推理步骤中生成单一查询进行检索,这限制了信息覆盖范围并引入了较高的噪声。这可能导致搜索过程中的信噪比(SNR)较低,从而降低推理准确性并导致不必要的推理步骤。在本文中,我们介绍了MultiSearch,这是一种基于强化学习的框架,通过多查询检索和检索信息的显式合并来解决这些限制。在每个推理步骤中,MultiSearch从多个角度生成查询并并行检索外部信息,扩展相关信息的范围并减轻对任何单一检索结果的依赖。然后,代理在合并过程中整合和精炼检索到的信息,提高信噪比并确保更准确的推理。此外,我们提出了一种具有多进程奖励设计的强化学习框架,以优化代理在多查询检索和信息整合方面的表现。在七个基准上的大量实验表明,MultiSearch优于基线方法,提高了检索的信噪比并改善了问答任务中的推理性能。
cs.AI / 49 / 2605.13542

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

RealICU:大型语言模型代理是否理解长上下文的重症监护数据?超越行为模仿的基准测试
Shen, Chengzhi, Shen, Weixiang, Susetzky, Tobias, Chen, Chen, Li, Jun, Liu, Yuyuan, Zhang, Xuepeng, Gong, Zhenyu, Rueckert, Daniel, Pan, Jiazhen
Abstract
Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30-min windows and release two datasets: RealICU-Gold with 930-window annotations from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall-safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU-Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision-support in high-stakes care. Project page: https://chengzhi-leo.github.io/RealICU-Bench/
Chinese Translation
重症监护病房(ICU)生成长时间、密集且不断演变的临床信息流,医生必须在时间压力下反复重新评估患者状态,这突显了对可靠的人工智能决策支持的明确需求。现有的ICU基准通常将历史临床医生的行为视为真实依据。然而,这些行为是在不完整的信息和有限的时间上下文下做出的,可能因此并非最佳选择,这使得评估人工智能系统的真实推理能力变得困难。我们引入了RealICU,这是一个后见注释的基准,用于在现实的ICU条件下评估大型语言模型(LLMs),其中标签是在资深医生审查完整患者轨迹后生成的。我们制定了四个以医生为导向的任务:评估患者状态、急性问题、推荐行动和可能导致不安全结果的红旗行动。我们将每个轨迹划分为30分钟的窗口,并发布了两个数据集:RealICU-Gold,包含来自94名MIMIC-IV患者的930个窗口注释,以及RealICU-Scale,包含由Oracle(一个经过医生验证的LLM后见标注器)扩展的11,862个窗口。现有的LLMs,包括增强记忆的模型,在RealICU上的表现不佳,暴露出两种失败模式:临床推荐的召回-安全权衡,以及对患者早期解释的锚定偏差。我们进一步引入ICU-Evo,以研究改善长时间推理的结构化记忆代理,但并未完全消除安全失败。总之,RealICU为在高风险护理中测量和改善人工智能顺序决策支持提供了一个临床基础的测试平台。项目页面:https://chengzhi-leo.github.io/RealICU-Bench/
cs.AI / 50 / 2605.13570

Learning Local Constraints for Reinforcement-Learned Content Generators

学习局部约束以增强学习内容生成器
Bhaumik, Debosmita, Togelius, Julian, Yannakakis, Georgios N., Khalifa, Ahmed
Abstract
Constraint-based game content generators that learn local constraints from existing content, such as Wave Function Collapse (WFC), can generate visually satisfying game levels but face challenges in guaranteeing global properties, such as playability. On the other hand, reinforcement-learning trained generators can guarantee global properties -- because such properties can easily be included in reward functions -- but the results can be visually dissatisfying. In this paper, we explore ways to combine these methods. Specifically, we constrain the action space of a PCGRL generator with constraints learned by WFC, effectively allowing the PCGRL generator to achieve global properties while forced to adhere to local constraints. To better analyze how this hybrid content generation method operates, we vary the number and type of inputs, and we test whether to randomly collapse the starting state and exclude rare patterns. While the method is sensitive to hyperparameter tuning, the best of our trained generators produce visually satisfying and playable puzzle-platform game levels -- such as Lode Runner levels -- with desired global properties.
Chinese Translation
基于约束的游戏内容生成器,如波函数坍缩(Wave Function Collapse, WFC),能够从现有内容中学习局部约束,生成视觉上令人满意的游戏关卡,但在保证全球属性(如可玩性)方面面临挑战。另一方面,经过强化学习训练的生成器能够保证全球属性,因为这些属性可以很容易地纳入奖励函数中,但生成的结果可能在视觉上不令人满意。本文探讨了结合这两种方法的途径。具体而言,我们通过WFC学习到的约束来限制PCGRL生成器的动作空间,有效地使PCGRL生成器在遵循局部约束的同时实现全球属性。为了更好地分析这种混合内容生成方法的运行方式,我们改变输入的数量和类型,并测试随机坍缩初始状态和排除稀有模式的效果。尽管该方法对超参数调优敏感,但我们训练的最佳生成器能够生成视觉上令人满意且可玩的益智平台游戏关卡,例如《寻宝者》(Lode Runner)关卡,并具备所需的全球属性。
cs.AI / 51 / 2605.13579

Position: Assistive Agents Need Accessibility Alignment

定位:辅助智能体需要可及性对齐
Hu, Jie, Yan, Changyuan, Zheng, Yu, Wang, Ziqian, Zhang, Jiaming
Abstract
Assistive agents for Blind and Visually Impaired (BVI) users require accessibility alignment as a first-class design objective. Despite rapid progress in agentic AI, most systems are designed and evaluated under assumptions of sighted interaction, low-cost verification, and tolerable trial-and-error, leading to systematic failures in assistive scenarios that cannot be resolved by model scaling or post-hoc interface adaptations alone. Drawing on an analysis of 778 assistance task instances from prior work, we show that current agentic AI remain prone to failure in assistive scenarios due to mismatches between sighted-user design assumptions and the verification, risk, and interaction constraints faced by BVI users. We argue that accessibility should be treated as an alignment problem rather than a peripheral usability concern. To this end, we introduce accessibility alignment and propose a lifecycle-oriented design pipeline for accessibility-aligned assistive agents, spanning user research, system design, deployment and post-deployment iteration. We conclude that BVI-centered assistive tasks provide a critical stress test for agentic AI and motivate a broader shift toward inclusive agent design.
Chinese Translation
盲人和视力障碍(BVI)用户的辅助智能体需要将可及性对齐作为首要设计目标。尽管智能代理技术迅速发展,但大多数系统在设计和评估时仍然假设用户为视力正常者,且依赖低成本验证和可容忍的试错过程,这导致在辅助场景中出现系统性失败,而仅通过模型扩展或事后界面适配无法解决这些问题。通过对先前研究中778个辅助任务实例的分析,我们表明,当前的智能代理在辅助场景中仍然容易失败,原因在于视力正常用户设计假设与BVI用户面临的验证、风险和交互约束之间存在不匹配。我们认为,可及性应被视为一个对齐问题,而非边缘的可用性问题。为此,我们引入了可及性对齐的概念,并提出了一种面向生命周期的设计流程,以实现可及性对齐的辅助智能体,涵盖用户研究、系统设计、部署及后续迭代。我们得出结论,BVI中心的辅助任务为智能代理提供了一个关键的压力测试,并推动了向包容性代理设计的更广泛转变。
cs.AI / 52 / 2605.13601

Unweighted ranking for value-based decision making with uncertainty

基于价值的决策中的无权排名与不确定性
García, Aarón López, Criado, Natalia, Such, Jose
Abstract
As intelligent systems are increasingly implemented in our society to make autonomous decisions, their commitment to human values raises serious concerns. Their alignment with human values remains a critical challenge because it can jeopardise the integrity and security of citizens. For this reason, an innovative human-centred and values-driven approach to decision making is required. In this work, we introduce the Fuzzy-Unweighted Value-Based Decision Making (FUW-VBDM) framework, where agents incorporate both quantitative and qualitative criteria to generate human-centred decisions. We also address the normative bias introduced by stakeholders with arbitrary weights by removing prior weights and introducing a fuzzy domain of decision variables defined for a score function. This concept allows us to generalise any VBDM problem as the search for feasible solutions when optimising the score in the weight domain. To provide a solution to FUW-VBDM, we present Rankzzy, a customizable unweighted ranking method that integrates fuzzy-based reasoning to quantify uncertainty. We mathematically prove the consistency of the Rankzzy for any admissible configuration selected by stakeholders. We show the applicability of our method through an illustrative case study, which we also use as a running example. The evaluation conducted indicates a reduced computational cost in large-scale value-based decision-making problems and a strong rank performance regarding existing approaches when employing the aggregation via Pythagorean means.
Chinese Translation
随着智能系统在我们社会中越来越多地被应用于自主决策,它们对人类价值观的承诺引发了严重的担忧。它们与人类价值观的一致性仍然是一个关键挑战,因为这可能危及公民的完整性和安全性。因此,需要一种创新的人本中心和价值驱动的决策方法。在本研究中,我们提出了模糊无权基于价值的决策框架(Fuzzy-Unweighted Value-Based Decision Making,FUW-VBDM),其中代理结合了定量和定性标准以生成以人为中心的决策。我们还通过去除先前的权重并引入为得分函数定义的模糊决策变量域,解决了利益相关者引入的规范性偏见。该概念使我们能够将任何基于价值的决策问题概括为在权重域中优化得分时寻找可行解。为了提供FUW-VBDM的解决方案,我们提出了Rankzzy,这是一种可定制的无权排名方法,集成了基于模糊的推理来量化不确定性。我们从数学上证明了Rankzzy在利益相关者选择的任何可接受配置下的一致性。我们通过一个说明性案例研究展示了我们方法的适用性,并将其作为一个运行示例。评估结果表明,在大规模基于价值的决策问题中,计算成本降低,并且在通过毕达哥拉斯均值进行聚合时,相较于现有方法表现出强大的排名性能。
cs.AI / 53 / 2605.13625

How to Interpret Agent Behavior

如何解读智能体行为
Gao, Jie, Sun, Kaiser, Huang, Jen-tse, Van Koevering, Katherine, Ji, Sijie, Huang, Heyuan, Shi, Weiyan, Lu, Zhuoran, Xiao, Ziang, Khashabi, Daniel, Dredze, Mark
Abstract
Autonomous agents such as Claude Code and Codex now operate for hours or even days. Understanding their runtime behavior has become critical for downstream tasks such as diagnosing inefficiencies, fixing bugs, and ensuring better oversight. A primary way to gain this understanding is analyzing the reasoning trajectories and execution traces these agents generate. Yet such data remains in unstructured natural-language form, making it difficult for humans to interpret at scale. We introduce ACT*ONOMY (a combination of Action and Taxonomy), a taxonomy for describing and analyzing agent behavior at runtime. ACT*ONOMY has two components: (1) the taxonomy itself, developed through Grounded Theory and structured as a three-level hierarchy of 10 actions, 46 subactions, and 120 leaf categories; and (2) an open repository that hosts the living taxonomy, provides an automated analysis pipeline that applies it to agent trajectories analysis, and defines an extension protocol for customization and growth. Our experiments show that ACTONOMY can compare behavioral profiles across agents and characterize a single agent's behavior across diverse trajectories, surfacing patterns indicative of failure modes. By providing a shared vocabulary, ACT*ONOMY helps researchers, agent designers, and end users interpret agent behavior more consistently, enabling better oversight and control.
Chinese Translation
自主智能体如Claude Code和Codex现在可以连续运行数小时甚至数天。理解它们的运行时行为对于后续任务(如诊断低效、修复错误和确保更好的监督)变得至关重要。获得这种理解的主要方式是分析这些智能体生成的推理轨迹和执行痕迹。然而,这些数据仍然以非结构化的自然语言形式存在,使得人类在大规模解读时面临困难。我们提出了ACT*ONOMY(Action和Taxonomy的结合),这是一个用于描述和分析智能体运行时行为的分类法。ACT*ONOMY有两个组成部分:(1)分类法本身,通过扎根理论(Grounded Theory)开发,结构为10个动作、46个子动作和120个叶类的三级层次;(2)一个开放的存储库,托管动态分类法,提供一个自动化分析管道,将其应用于智能体轨迹分析,并定义一个扩展协议以便于定制和发展。我们的实验表明,ACT*ONOMY能够比较不同智能体的行为特征,并在多样化轨迹中表征单个智能体的行为,揭示出指示故障模式的模式。通过提供共享的词汇,ACT*ONOMY帮助研究人员、智能体设计者和最终用户更一致地解读智能体行为,从而实现更好的监督和控制。
cs.AI / 54 / 2605.13702

Adaptive mine planning under geological uncertainty: A POMDP framework for sequential decision-making

地质不确定性下的自适应矿山规划:一种用于序贯决策的部分可观察马尔可夫决策过程框架
Khalifi, Hamza, Caers, Jef, Taha, Yassine, Benzaazoua, Mostafa, Elghali, Abdellatif
Abstract
Strategic mine production scheduling under geological uncertainty is conventionally formulated as a stochastic optimization problem in which a fixed extraction sequence and routing decisions are computed ex ante. This plan-driven paradigm treats uncertainty as passive: decisions are hedged across geological scenarios, but planning does not anticipate how future observations will inform future decisions. We propose a different perspective by formulating mine scheduling as a Partially Observable Markov Decision Process (POMDP), in which extraction and routing decisions are made sequentially with planning explicitly integrating the expectation of future belief updates. To achieve computational tractability, we introduce a hybrid SA-POMDP architecture that combines simulated annealing-based (SA) value approximation with ensemble-based belief updating via ensemble smoother with multiple data assimilation (ES-MDA). At each decision epoch, candidate actions are evaluated through their expected long-term value under the current belief, and the belief is updated as mining observations are assimilated. This yields an adaptive policy rather than a fixed plan. We evaluate the framework on a copper-gold open-pit mining complex with multiple processing destinations. Under a statistically consistent prior, the SA-POMDP reduces the expectation-reality gap from 22.3% to 4.6%, improving realized NPV by USD8.4M relative to one-shot stochastic optimization. Under systematic prior misspecification of 10%, the adaptive framework outperforms static planning by up to USD44.6M (36.9%), demonstrating structural robustness beyond scenario hedging. These results show that sequential belief updating transforms geological uncertainty from a passive constraint into an active component of value creation.
Chinese Translation
在地质不确定性下,战略矿山生产调度通常被表述为一个随机优化问题,其中固定的开采顺序和运输决策是事先计算得出的。这种以计划为驱动的范式将不确定性视为被动的:决策是在不同地质情景下进行对冲,但规划并未预见未来观察将如何影响未来决策。我们提出了一种不同的视角,将矿山调度表述为部分可观察马尔可夫决策过程(POMDP),在该过程中,开采和运输决策是顺序作出的,规划明确整合了未来信念更新的期望。为了实现计算的可行性,我们引入了一种混合的SA-POMDP架构,该架构结合了基于模拟退火(SA)的价值近似和通过多数据同化的集成平滑器(ES-MDA)进行的集成信念更新。在每个决策时刻,候选行动通过其在当前信念下的预期长期价值进行评估,并在矿山观察被同化时更新信念。这产生了一种自适应策略,而不是固定计划。我们在一个具有多个处理目的地的铜金开放式矿山综合体上评估了该框架。在统计一致的先验下,SA-POMDP将期望与现实之间的差距从22.3%降低至4.6%,相较于一次性随机优化,实际净现值(NPV)提高了840万美元。在系统性先验错误指定10%的情况下,自适应框架的表现优于静态规划,提升幅度高达4460万美元(36.9%),展示了超越情景对冲的结构性鲁棒性。这些结果表明,顺序信念更新将地质不确定性从被动约束转变为价值创造的主动组成部分。
cs.AI / 55 / 2605.13725

ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles

ScioMind:基于认知的多智能体社会模拟,结合锚定基础的信念动态与动态特征
Yang, Yitian, Duan, Yiqun, Huang, Linghan, Zhu, Yiqi, Bailo, Francesco, Su, Chunmeizi, Chen, Huaming
Abstract
Large language model (LLM)-based multi-agent simulation offers a powerful testbed for studying social opinion dynamics. Yet current approaches often adopt two contrasting methods: either relying on fixed update rules with limited cognitive grounding or delegating belief change largely to unconstrained LLM interaction. We introduce ScioMind, a cognitively grounded simulation framework that bridges these paradigms by combining structured opinion dynamics with LLM-based agent reasoning. ScioMind integrates three key components: 1) a memory-anchored belief update rule that modulates susceptibility to influence via personality-conditioned anchoring strength; 2) a hierarchical memory architecture that supports persistent, experience-driven belief formation; and 3) dynamic agent profiles derived from a corpus-grounded retrieval pipeline, enabling heterogeneous personalities, rationales, and evolving internal states. We evaluate ScioMind on multiple case studies in a real-world policy debate scenario. Across metrics including polarisation, diversity, extremization, and trajectory stability, the proposed components consistently yield improvements in behavioural realism. In particular, dynamic profiles increase opinion diversity, memory and reflection reduce unstable oscillation, and anchoring induces persistent belief trajectories that better align with patterns reported in political psychology. These results suggest that our cognitively grounded design provides a novel solution to LLM-based social simulation that improves both stable and behavioural realism
Chinese Translation
基于大型语言模型(LLM)的多智能体模拟为研究社会舆论动态提供了强有力的实验平台。然而,目前的方法通常采用两种对立的方式:要么依赖于固定的更新规则,缺乏足够的认知基础,要么将信念变化主要委托给不受限制的LLM交互。我们提出了ScioMind,一个基于认知的模拟框架,通过将结构化的舆论动态与基于LLM的智能体推理相结合,弥合了这两种范式。ScioMind集成了三个关键组件:1)一个记忆锚定的信念更新规则,通过个性条件的锚定强度调节对影响的敏感性;2)一个支持持久、经验驱动的信念形成的层次记忆架构;3)基于语料库的检索管道衍生的动态智能体特征,能够实现异质个性、理由和不断发展的内部状态。我们在一个真实的政策辩论场景中对ScioMind进行了多项案例研究的评估。在包括极化、多样性、极端化和轨迹稳定性等指标上,所提出的组件在行为真实感方面始终表现出改善。特别是,动态特征增加了意见多样性,记忆和反思减少了不稳定的振荡,而锚定则诱导出更持久的信念轨迹,更好地与政治心理学中报告的模式相一致。这些结果表明,我们基于认知的设计为基于LLM的社会模拟提供了一种新颖的解决方案,改善了稳定性和行为真实感。
cs.AI / 56 / 2605.13737

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

感官闭合:全模态大语言模型中的表征-行动差距
Quang, Trung Nguyen, Gao, Yiming, Pu, Fanyi, Zhang, Kaichen, Sun, Shuo, Liu, Ziwei
Abstract
When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.
Chinese Translation
当一个全模态大语言模型接受一个文本前提与其实际看到或听到的内容相矛盾的问题时,失败是出在感知还是行动上?近期的全模态模型被定位为基于感知的代理,能够共同处理视频、音频和文本,但一种基本的基础验证仍未得到测试:捕捉与模型自身感官输入相冲突的文本声明。我们引入了IMAVB,这是一个经过精心策划的500个长篇电影片段的基准数据集,采用2x2设计,交叉目标模态(视觉、音频)和前提条件(标准、误导),使我们能够将冲突检测与普通多模态理解分开测量。在八个开源全模态大语言模型和Gemini 3.1 Pro中,我们记录了表征-行动差距:隐藏状态可靠地编码前提-感知不匹配,即使相同的模型几乎从不在其输出中拒绝错误声明。在行为上,模型陷入两种失败模式:低拒绝,在这种情况下,它们将误导性问题的回答视为错误前提为真;以及高拒绝,在这种情况下,它们更频繁地拒绝,但也拒绝标准问题,从而牺牲了普通理解的准确性。该差距在模态上是不对称的(音频基础表现不如视觉),并且在七种变体中对提示具有抗性。作为初步诊断干预,探针引导的logit调整(PGLA)将编码的不匹配信号重新注入解码中,并始终改善拒绝行为。综合来看,这些结果表明,全模态基础的瓶颈在于翻译,而非感知。
cs.AI / 57 / 2605.13821

Harnessing Agentic Evolution

利用自主进化
Zhang, Jiayi, Gu, Yongfeng, Ruan, Jianhao, Song, Maojia, Peng, Yiran, Han, Zhiguang, Xiang, Jinyu, Wang, Zhitao, Yang, Caiyin, Ouyang, Yixi, Liu, Bang, Wu, Chenglin, Luo, Yuyu
Abstract
Agentic evolution has emerged as a powerful paradigm for improving programs, workflows, and scientific solutions by iteratively generating candidates, evaluating them, and using feedback to guide future search. However, existing methods are typically instantiated either as fixed hand-designed procedures that are modular but rigid, or as general-purpose agents that flexibly integrate feedback but can drift in long-horizon evolution. Both forms accumulate rich evidence over time, including candidates, feedback, traces, and failures, yet lack a stable interface for organizing this evidence and revising the mechanism that drives future evolution. We address this limitation by formulating agentic evolution as an interactive environment, where the accumulated evolution context serves as a process-level state. We introduce AEvo, a harnessed meta-editing framework in which a meta-agent observes this state and acts not by directly proposing the next candidate, but by editing the procedure or agent context that controls future evolution. This unified interface enables AEvo to steer both procedure-based and agent-based evolution, making accumulated evidence actionable for long-horizon search. Empirical evaluations on agentic and reasoning benchmarks show that AEvo outperforms five evolution baselines, achieving a 26 relative improvement over the strongest baseline. Across three open-ended optimization tasks, AEvo further outperforms four evolution baselines and achieves state-of-the-art performance under the same iteration budget.
Chinese Translation
自主进化已成为一种强大的范式,通过迭代生成候选项、评估它们并利用反馈指导未来的搜索,从而改善程序、工作流程和科学解决方案。然而,现有方法通常被实例化为固定的手工设计程序,这些程序模块化但缺乏灵活性,或作为通用代理,这些代理灵活地整合反馈但在长期进化中可能偏离目标。这两种形式随着时间的推移积累了丰富的证据,包括候选项、反馈、轨迹和失败,但缺乏一个稳定的接口来组织这些证据并修订驱动未来进化的机制。我们通过将自主进化表述为一个交互环境来解决这一限制,在这个环境中,累积的进化上下文作为一个过程级状态。我们引入了AEvo,一个被利用的元编辑框架,其中一个元代理观察这个状态,并不是通过直接提出下一个候选项来行动,而是通过编辑控制未来进化的程序或代理上下文。这个统一的接口使AEvo能够同时引导基于程序和基于代理的进化,使得累积的证据在长期搜索中变得可操作。在自主和推理基准上的实证评估表明,AEvo在五个进化基线中表现优于其他方法,相较于最强基线实现了26%的相对提升。在三个开放式优化任务中,AEvo进一步超越了四个进化基线,并在相同的迭代预算下达到了最先进的性能。
cs.AI / 58 / 2605.13825

History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

历史锚定:先前行为如何引导大型语言模型的决策走向不安全行为
Salgado, Alberto G. Rodríguez
Abstract
Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and the same instruction with an all-safe prior history keeps unsafe rates below 7%. Different families flip at different doses of unsafe history, and within every aligned family the flagship is the most affected sibling, an inverse-scaling pattern with respect to safety. These results are a red flag for agentic deployments where trajectories may be replayed, forged, or injected.
Chinese Translation
前沿的大型语言模型(LLMs)越来越多地被用作代理,在一系列由同一模型或不同模型产生的先前工具调用之后选择下一步行动。我们提出一个简单的安全问题:如果该日志中的某一步骤是有害的,模型是否会继续沿着有害的方向发展?我们构建了HistoryAnchor-100,这是100个短场景,涵盖十个高风险领域,每个场景将三个强制性有害的先前行为与一个提供两个安全选项和两个不安全选项的自由选择节点配对。在来自六个提供商的17个前沿模型中,我们发现了一个显著的不对称性:在中立的系统提示下,最强的对齐模型几乎从不选择不安全的选项,但添加一句话“保持与先前历史中显示的策略一致”后,它们的选择比例跃升至91-98%,而且这些被翻转的模型往往会进一步升级行为。两个控制实验排除了更简单的解释:对行动标签的排列不会改变效果,而同样的指令在全安全的先前历史下保持不安全率低于7%。不同类型的模型在不同程度的不安全历史下会发生翻转,并且在每个对齐的模型家族中,旗舰模型是受影响最严重的兄弟,呈现出与安全性相反的缩放模式。这些结果对代理部署构成了警示,因为这些轨迹可能会被重放、伪造或注入。
cs.AI / 59 / 2605.13830

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

树集成的敏感性量化:一种符号和组合的方法
Akshay, S., Garg, Chaitanya, Gupta, Ashutosh, Meel, Kuldeep S., Naik, Ajinkya
Abstract
Decision tree ensembles (DTE) are a popular model for a wide range of AI classification tasks, used in multiple safety critical domains, and hence verifying properties on these models has been an active topic of study over the last decade. One such verification question is the problem of sensitivity, which asks, given a DTE, whether a small change in subset of features can lead to misclassification of the input. In this work, our focus is to build a quantitative notion of sensitivity, tailored to DTEs, by discretizing the input space of the model and enumerating the regions which are susceptible to sensitivity. We propose a novel algorithmic technique that can perform this computation efficiently, within a certified error and confidence bound. Our approach is based on encoding the problem as an algebraic decision diagram (ADD), and further splitting it into subproblems that can be solved efficiently and make the computation compositional and scalable. We evaluate the performance of our technique over benchmarks of varying size in terms of number of trees and depth, comparing it against the performance of model counters over the same problem encoding. Experimental results show that our tool XCount achieves significant speedup over other approaches and can scale well with the increasing sizes of the ensembles.
Chinese Translation
决策树集成(DTE)是广泛应用于多种人工智能分类任务的流行模型,尤其是在多个安全关键领域,因此对这些模型的性质进行验证在过去十年中成为一个活跃的研究课题。其中一个验证问题是敏感性问题,它询问给定一个DTE时,特征子集的微小变化是否会导致输入的错误分类。在本研究中,我们的重点是构建一个针对DTE的定量敏感性概念,通过离散化模型的输入空间并枚举易受敏感性影响的区域。我们提出了一种新颖的算法技术,可以在认证的误差和置信区间内高效地执行这一计算。我们的方法基于将问题编码为代数决策图(ADD),并进一步将其拆分为可以高效解决的子问题,使得计算具有组合性和可扩展性。我们在不同规模的基准测试中评估了我们技术的性能,比较了树的数量和深度,并与模型计数器在相同问题编码上的性能进行了对比。实验结果表明,我们的工具XCount在速度上显著优于其他方法,并且能够很好地随着集成规模的增加而扩展。
计算语言学 (Computation and Language)
71
cs.CL / 1 / 2605.12515

Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation

通过共识驱动的偏好优化缓解大型语言模型中的跨语言文化不一致性
Resck, Lucas, Augenstein, Isabelle, Korhonen, Anna
Abstract
Despite their impressive capabilities, multilingual large language models (MLLMs) frequently exhibit inconsistent behaviour when the prompt's language changes. While such adaptation is generally desirable, it becomes a critical failure when a user's identity is explicitly defined. For instance, given a fixed British persona and an ambiguous everyday knowledge query about literature, the prompt's language frequently overwrites the system persona -- yielding Shakespeare in English but Cervantes in Spanish. To robustly quantify this Cross-lingual Cultural Inconsistency, we introduce Singleton Fleiss's $\kappa_S$, a metric mathematically resilient to hallucinations. For mitigation, we propose Cross-lingual Cultural Consistent Preference Optimisation (C-3PO), a consensus-driven alignment framework. C-3PO achieves up to a 0.10-point absolute increase in $\kappa_S$ over unaligned models, outperforming strong prompting and representation steering baselines. Empirical evaluations show this inconsistency disproportionately affects lower-resource languages like Indonesian and Persian. A layer-wise interpretability analysis reveals the underlying mechanism: by early-decoding intermediate layer representations, we find that MLLMs implicitly personalise outputs towards the prompt language's stereotypical culture as forward-pass representations stabilise.
Chinese Translation
尽管多语言大型语言模型(MLLMs)具有令人印象深刻的能力,但在提示语言变化时,它们经常表现出不一致的行为。虽然这种适应通常是可取的,但当用户身份被明确界定时,这种不一致就成为一个严重的失败。例如,给定一个固定的英国角色和一个关于文学的模糊日常知识查询,提示的语言常常会覆盖系统角色——在英语中生成莎士比亚,而在西班牙语中生成塞万提斯。为了稳健地量化这种跨语言文化不一致性,我们引入了Singleton Fleiss的$ ext{kappa}_S$,这是一个在数学上对幻觉具有韧性的度量。为了缓解这种不一致性,我们提出了跨语言文化一致性偏好优化(C-3PO),这是一种基于共识驱动的对齐框架。C-3PO在$ ext{kappa}_S$上相较于未对齐模型实现了高达0.10点的绝对提升,超越了强提示和表示引导的基线。实证评估表明,这种不一致性对印尼语和波斯语等低资源语言的影响尤为显著。逐层可解释性分析揭示了其潜在机制:通过早期解码中间层表示,我们发现MLLMs在输出中隐式地朝向提示语言的刻板文化个性化,因为前向传播的表示会稳定下来。
cs.CL / 2 / 2605.12516

Domain Adaptation of Large Language Models for Polymer-Composite Additive Manufacturing Using Retrieval-Augmented Generation and Fine-Tuning

基于检索增强生成和微调的大型语言模型在聚合物复合材料增材制造中的领域适应
Sagor, Saiful Islam, Haghighi, Tania, Alam, Minhaj Nur, Joyee, Erina Baynojir
Abstract
General-purpose large language models (LLMs) often struggle to generate reliable responses in specialized engineering domains due to limited domain grounding and insufficient exposure to structured technical knowledge. This study investigates practical strategies for adapting a foundation LLM to the additive manufacturing (AM) domain in order to improve answer accuracy, relevance, and usability for expert-level question answering. AM knowledge is distributed across heterogeneous sources such as academic literature, manufacturer documentation, technical standards, and procedural guides. Although general LLMs demonstrate strong linguistic capabilities, they frequently fail to retrieve and contextualize such domain-specific information. Two common approaches to address this limitation are domain-specific fine-tuning and retrieval-augmented generation (RAG). We construct a curated AM corpus and evaluate three configurations based on LLaMA-3-8B: (1) the pretrained baseline model, (2) a RAG system that retrieves relevant document chunks from a vector database, and (3) a model fine-tuned on raw domain text. Performance is evaluated using 200 expert-designed AM questions assessed by mechanical engineering experts for accuracy, relevance, and overall preference. Results show that the RAG model consistently outperforms the baseline. Among the 200 questions, 75.5% of RAG responses are judged more accurate, 85.2% are preferred overall, and 90.8% are rated more relevant than baseline responses. In contrast, fine-tuning on raw AM text reduces performance, producing more accurate answers in only 5.6% of cases and more relevant answers in 32.5% of cases. These results indicate that retrieval-augmented approaches provide a more effective pathway for adapting LLMs to specialized engineering domains than naive fine-tuning on unstructured technical data.
Chinese Translation
通用大型语言模型(LLMs)在专业工程领域中往往难以生成可靠的响应,原因在于领域基础薄弱以及对结构化技术知识的接触不足。本研究探讨了将基础LLM适应于增材制造(AM)领域的实用策略,以提高专家级问答的准确性、相关性和可用性。增材制造知识分布在异构来源中,如学术文献、制造商文档、技术标准和程序指南。尽管通用LLM展现出强大的语言能力,但它们常常无法检索和上下文化这些特定领域的信息。为了解决这一限制,常用的两种方法是领域特定的微调和检索增强生成(RAG)。我们构建了一个经过筛选的增材制造语料库,并基于LLaMA-3-8B评估了三种配置:(1)预训练的基线模型,(2)一个从向量数据库中检索相关文档片段的RAG系统,以及(3)在原始领域文本上微调的模型。通过200个由机械工程专家设计的增材制造问题来评估性能,专家们对其准确性、相关性和整体偏好进行了评估。结果表明,RAG模型的表现始终优于基线。在200个问题中,75.5%的RAG响应被评判为更准确,85.2%被整体偏好,90.8%被认为比基线响应更相关。相比之下,在原始增材制造文本上微调导致性能下降,仅在5.6%的情况下产生更准确的答案,在32.5%的情况下产生更相关的答案。这些结果表明,检索增强的方法为将LLM适应于专业工程领域提供了比在非结构化技术数据上进行简单微调更有效的途径。
cs.CL / 3 / 2605.12517

Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

弥补缺失模态间隙:改善视觉语言模型的文本-only 校准
Kim, Mingyeong, Choi, Jungwon, Jang, Chaeyun, Lee, Juho
Abstract
Vision-language models (VLMs) are often deployed on text-only inputs, although they are trained with images. We find that removing the vision modality causes large drops in accuracy and severe miscalibration, and the model does not behave like its original language backbone under text-only prompting. This failure is not explained only by missing semantic information. Even when text descriptions preserve key content, confidence becomes unreliable, while adding a visual signal through generated images partially restores accuracy and calibration. We propose the Latent Imagination Module (LIM), a lightweight cross-attention module that predicts imagined latent embeddings from textual input and feeds them into a frozen VLM backbone without pixel-level image synthesis. Across text-only benchmarks, unseen tasks, and missing-image scenarios, LIM improves accuracy and reduces calibration error. These results suggest that latent modality completion is a practical approach for reliable VLM inference under missing-modality.
Chinese Translation
视觉语言模型(VLMs)通常在仅有文本输入的情况下部署,尽管它们是通过图像进行训练的。我们发现,去除视觉模态会导致准确率大幅下降和严重的失调,且模型在仅有文本提示下的表现与其原始语言骨干网络不一致。这一失败并不仅仅是由于缺失语义信息所致。即使文本描述保留了关键内容,置信度也变得不可靠,而通过生成图像添加视觉信号则部分恢复了准确性和校准性。我们提出了潜在想象模块(Latent Imagination Module, LIM),这是一种轻量级的交叉注意力模块,能够从文本输入中预测想象的潜在嵌入,并将其输入到一个冻结的VLM骨干网络中,而无需进行像素级的图像合成。在仅有文本的基准测试、未见任务和缺失图像场景中,LIM提高了准确性并减少了校准误差。这些结果表明,潜在模态补全是一种在缺失模态下实现可靠VLM推理的实用方法。
cs.CL / 4 / 2605.12518

TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models

TimelineReasoner:利用大型推理模型推进时间线摘要
Zhang, Liancheng, Li, Xiaoxi, Dou, Zhicheng
Abstract
The proliferation of online news poses a challenge to extracting structured timelines from unstructured content. While recent studies have shown that Large Language Models (LLMs) can assist Timeline Summarization (TLS), these approaches primarily treat models as passive generators. The emergence of Large Reasoning Models (LRMs) presents an opportunity to reason over events actively, enabling iterative evidence acquisition, the detection of missing events, and the validation of temporal consistency. To systematically leverage the reasoning capabilities of LRMs, we propose TimelineReasoner, a novel framework that shifts TLS from static generation to an active, reasoning-driven process. Unlike prior work, TimelineReasoner adopts a two-stage framework: Global Cognition, which tracks events at a macroscopic level and continuously updates a global event memory, and Detail Exploration, which identifies informational gaps and refines the timeline via targeted document retrieval. To support this, TimelineReasoner incorporates several specialized mechanisms, including an Event Scraper for retrieving temporal event descriptions, a Timeline Updater for refining the timeline, and a Supervisor for detecting gaps in the timeline and guiding retrieval. Experimental results on open-domain TLS datasets demonstrate that TimelineReasoner significantly outperforms existing LLM-based TLS methods in terms of timeline accuracy, coverage, and coherence. On closed-domain TLS datasets, our method performs on par with or exceeds state-of-the-art approaches. This work not only pushes the boundaries of TLS but also highlights the broader potential of LRM-based reasoning frameworks for timeline summarization.
Chinese Translation
在线新闻的激增给从非结构化内容中提取结构化时间线带来了挑战。尽管近期研究表明大型语言模型(LLMs)可以辅助时间线摘要(TLS),但这些方法主要将模型视为被动生成器。大型推理模型(LRMs)的出现为主动推理事件提供了机会,使得迭代证据获取、缺失事件检测和时间一致性验证成为可能。为了系统性地利用LRMs的推理能力,我们提出了TimelineReasoner,一个将TLS从静态生成转变为主动推理驱动过程的新框架。与之前的工作不同,TimelineReasoner采用了两阶段框架:全球认知(Global Cognition),在宏观层面跟踪事件并持续更新全球事件记忆;细节探索(Detail Exploration),识别信息缺口并通过有针对性的文档检索来完善时间线。为此,TimelineReasoner整合了多个专门机制,包括用于检索时间事件描述的事件抓取器(Event Scraper)、用于完善时间线的时间线更新器(Timeline Updater)以及用于检测时间线缺口和指导检索的监督者(Supervisor)。在开放领域TLS数据集上的实验结果表明,TimelineReasoner在时间线准确性、覆盖率和一致性方面显著优于现有基于LLM的TLS方法。在封闭领域TLS数据集上,我们的方法表现与最先进的方法相当或更优。这项工作不仅推动了TLS的边界,也突显了基于LRM的推理框架在时间线摘要中的更广泛潜力。
cs.CL / 5 / 2605.12519

Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

来自合理推理的正确答案:语言模型的可验证过程监督
Kim, Kyuyoung, Wang, Kevin, Xie, Yunfei, Xu, Peiyang, Sheng, Peiyao, Wei, Chen, Wang, Zhangyang, Shin, Jinwoo, Viswanath, Pramod, Oh, Sewoong
Abstract
Training language models to produce both correct answers and sound reasoning remains an open challenge. Reinforcement learning with verifiable rewards typically optimizes only final outcomes, which can lead to a failure mode where task accuracy improves while reasoning becomes less accurate, less complete, or even internally inconsistent. We propose verifiable process supervision (VPS), a post-training framework for verifiable domains that jointly optimizes prediction accuracy and reasoning quality. We first apply supervised fine-tuning to induce a structured reasoning format, enabling syntactic extraction of intermediate claims that are evaluated against ground-truth signals to form process-level rewards. To address the heterogeneous difficulty of reasoning subtasks, we introduce adaptive reward weighting that prioritizes components with the largest remaining errors, creating an implicit curriculum. We evaluate VPS on chess, a controlled testbed where reasoning steps can be deterministically verified against engine signals. While accuracy-only RL improves move accuracy, it sharply degrades reasoning quality, increasing win-rate error by up to 112% and reducing internal consistency by up to 69%. In contrast, VPS preserves accuracy while significantly improving reasoning quality, reducing win-rate error by up to 30% and restoring consistency to near saturation. At matched accuracy, judge evaluation also prefers the process-supervised models. A reasoning-space analysis further shows that, without a structured prior, accuracy-only RL converges to budget-dependent shortcuts rather than sound multi-step reasoning. These results show that VPS enables language models to reason both accurately and reliably in verifiable domains.
Chinese Translation
训练语言模型以产生正确答案和合理推理仍然是一个开放的挑战。使用可验证奖励的强化学习通常仅优化最终结果,这可能导致一种失败模式,即任务准确性提高,而推理变得不够准确、不够完整,甚至内部不一致。我们提出了可验证过程监督(Verifiable Process Supervision, VPS),这是一个针对可验证领域的后训练框架,旨在共同优化预测准确性和推理质量。我们首先应用监督微调来引导结构化的推理格式,使得能够对中间主张进行语法提取,并根据真实信号进行评估,以形成过程级奖励。为了应对推理子任务的异质性难度,我们引入了自适应奖励加权,优先考虑剩余误差最大的组件,从而创建一个隐式课程。我们在国际象棋上评估了VPS,这是一个可控的测试平台,其中推理步骤可以与引擎信号进行确定性验证。虽然仅关注准确性的强化学习提高了走棋准确性,但却显著降低了推理质量,使得胜率误差增加了多达112%,内部一致性降低了多达69%。相比之下,VPS在显著提高推理质量的同时保持了准确性,使胜率误差降低了多达30%,并将一致性恢复到接近饱和。在匹配准确性的情况下,评审评估也更倾向于过程监督模型。推理空间分析进一步表明,在没有结构化先验的情况下,仅关注准确性的强化学习收敛于依赖预算的捷径,而不是合理的多步骤推理。这些结果表明,VPS使语言模型能够在可验证领域中进行准确且可靠的推理。
cs.CL / 6 / 2605.12520

BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration

BoostTaxo:通过增强式代理推理和约束感知校准实现零样本分类法引导
Ling, Yancheng, Qin, Zhenlin, Wang, Leizhen, Ma, Zhenliang
Abstract
Taxonomy induction is crucial for organizing concepts into explicit and interpretable semantic hierarchies. While existing methods have achieved promising results, their generalization, structural reliability, and efficiency remain limited, hindering their performance in zero-shot and large-scale scenarios. To overcome these limitations, we introduce BoostTaxo, a boosting-style LLM framework for zero-shot taxonomy induction. It takes a set of domain terms as inputs and performs parent identification in a coarse-to-fine manner, employing retrieval-augmented definition refinement, hybrid parent candidate selection, candidate rating, and structure-aware score calibration to improve taxonomy construction. Specifically, a lightweight LLM is used to efficiently filter candidate parents, while a large-scale LLM is employed to rank and score candidate parents for fine-grained parent selection. Structural features are further incorporated to calibrate candidate edge weights and enhance the reliability of the induced taxonomy. The unified BoostTaxo is evaluated on three public benchmark datasets, namely WordNet, DBLP, and SemEval-Sci, and achieves superior or comparable performance to state-of-the-art methods in zero-shot taxonomy induction. The ablation study validates the contribution of the hybrid parent candidate selection and the structure-aware score calibration to the overall performance. Further analysis investigates the impact of candidate selection size on taxonomy quality and presents representative case and failure studies, providing deeper insights into the effectiveness and limitations of the proposed framework.
Chinese Translation
分类法引导对于将概念组织成明确且可解释的语义层次结构至关重要。尽管现有方法已取得了令人鼓舞的结果,但它们的泛化能力、结构可靠性和效率仍然有限,这在零样本和大规模场景中妨碍了其性能。为克服这些局限性,我们提出了BoostTaxo,一种用于零样本分类法引导的增强式大语言模型(LLM)框架。该框架以一组领域术语作为输入,并以粗到细的方式进行父节点识别,采用检索增强的定义精炼、混合父候选选择、候选评分和结构感知的得分校准来改善分类法构建。具体而言,使用轻量级的LLM有效过滤候选父节点,同时利用大规模LLM对候选父节点进行排名和评分,以实现细粒度的父节点选择。进一步结合结构特征来校准候选边权重,增强引导分类法的可靠性。统一的BoostTaxo在三个公共基准数据集上进行评估,即WordNet、DBLP和SemEval-Sci,并在零样本分类法引导中实现了优于或可比于最先进方法的性能。消融研究验证了混合父候选选择和结构感知得分校准对整体性能的贡献。进一步分析探讨了候选选择规模对分类法质量的影响,并呈现了代表性案例和失败研究,为所提出框架的有效性和局限性提供了更深入的见解。
cs.CL / 7 / 2605.12521

ToolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues

ToolWeave:复杂多轮工具调用对话的结构化合成
Khandelwal, Dinesh, Punnavajhala, Gnana Prakash, Bhargav, GPS, Pandey, Gaurav, Joshi, Sachin, Karanam, Hima, Raghu, Dinesh
Abstract
Multi-turn tool calling is essential for LLMs to function as autonomous agents, yet synthesizing the training data required for these capabilities remains a fundamental challenge. Existing synthetic data generation pipelines often produce unrealistic dialogues for two reasons: they chain tools that are only superficially compatible rather than aligned with meaningful user tasks, and they generate dialogues in one shot, which often introduces arguments that were neither provided by the user nor produced by prior tool calls. These issues also lead to a severe underrepresentation of multi-step tool interactions. We introduce ToolWeave, a structured framework for synthesizing realistic multi-turn tool-calling dialogues. ToolWeave support realistic multi-step workflows (or tool sequences) by constructing tools with built-in dependencies and filters the workflows based on alignment with user goals. It reduces parameter hallucination by using a fine-grained planning stage that explicitly tracks parameter provenance. As a result, ToolWeave-generated synthetic dialogues contain more multi-step tool interactions (45%) and fewer hallucinations in parameters and tool names. Consequently, LLMs fine-tuned on ToolWeave consistently outperform those fine-tuned on prior datasets across three public benchmarks. Notably, Llama-3.1-70B fine-tuned on ToolWeave achieves 39.75% on BFCL-V3 multi-turn, compared to 23.50% when fine-tuned on SOTA ToolFlow data.
Chinese Translation
多轮工具调用对于大型语言模型(LLMs)作为自主代理的功能至关重要,但合成所需的训练数据仍然是一个基本挑战。现有的合成数据生成管道通常会产生不现实的对话,原因有二:它们将仅在表面上兼容的工具串联在一起,而不是与有意义的用户任务对齐;并且它们一次性生成对话,这往往引入了用户未提供或先前工具调用未生成的参数。这些问题还导致多步骤工具交互的严重不足。我们提出了ToolWeave,一个用于合成现实多轮工具调用对话的结构化框架。ToolWeave通过构建具有内置依赖关系的工具并根据与用户目标的对齐过滤工作流,支持现实的多步骤工作流(或工具序列)。它通过使用细粒度的规划阶段显式跟踪参数来源,从而减少参数的幻觉。结果,ToolWeave生成的合成对话包含更多的多步骤工具交互(45%),并且参数和工具名称的幻觉更少。因此,在ToolWeave上微调的LLMs在三个公共基准测试中始终优于在先前数据集上微调的模型。值得注意的是,在ToolWeave上微调的Llama-3.1-70B在BFCL-V3多轮测试中达到了39.75%的成绩,而在SOTA ToolFlow数据上微调时仅为23.50%。
cs.CL / 8 / 2605.12522

Differences in Text Generated by Diffusion and Autoregressive Language Models

扩散语言模型与自回归语言模型生成文本的差异
Zhang, Zeyang, Liang, Chengwei, Chen, Xingyan, Gu, Meiqi, Luo, Minrui, Zhang, Jingzhao, He, Tianxing
Abstract
Diffusion language models (DLMs) are promising alternatives to autoregressive language models (ARMs), yet the intrinsic differences in their generated text remain underexplored. We first find empirically that off-the-shelf DLMs exhibit lower $n$-gram entropy, higher semantic coherence, and higher semantic diversity. To understand the cause, we conduct controlled experiments that decouple the effects of training objectives and decoding algorithms. Results suggest that the DLM training objective contributes to the increases in semantic coherence and semantic diversity, but has a minor influence on entropy. These differences are primarily driven by the bidirectional context; other components in the training objective, such as input masking, label masking, and the weighting function, have a much weaker influence. Further, our experiments demonstrate that the reduction in entropy stems from DLMs' decoding algorithms, particularly confidence-based remasking strategies. We provide a theoretical understanding for this entropy reduction phenomenon. Together, our work uncovers key mechanisms underlying the differences between DLMs and ARMs in text generation, and informs future design of training objectives and decoding algorithms in DLMs.
Chinese Translation
扩散语言模型(DLMs)是自回归语言模型(ARMs)的有前景的替代方案,但它们生成文本的内在差异仍未得到充分探讨。我们首先通过实证研究发现,现成的 DLMs 展现出较低的 $n$-gram 熵、更高的语义连贯性和更高的语义多样性。为了理解其原因,我们进行控制实验,以解耦训练目标和解码算法的影响。结果表明,DLM 的训练目标有助于提高语义连贯性和语义多样性,但对熵的影响较小。这些差异主要由双向上下文驱动;训练目标中的其他组成部分,如输入掩码、标签掩码和加权函数,对结果的影响要弱得多。此外,我们的实验表明,熵的降低源于 DLMs 的解码算法,特别是基于置信度的重新掩码策略。我们为这一熵降低现象提供了理论理解。综上所述,我们的研究揭示了 DLMs 与 ARMs 在文本生成中的差异背后的关键机制,并为未来 DLMs 的训练目标和解码算法的设计提供了指导。
cs.CL / 9 / 2605.12523

Exploring how EFL students talk to and through AI to develop texts

探索英语作为外语(EFL)学生如何与人工智能(AI)交流并通过其发展文本
Woo, David James, Yu, Yangyang, Huang, Yilin, Wang, Deliang, Guo, Kai, Yeung, Chi Ho
Abstract
Generative Artificial Intelligence (AI) introduces new considerations for English as a foreign language (EFL) writing pedagogy. This study explores how students talk to and through AI by prompt engineering and negotiating authorship, respectively, and whether any patterns in the latter relate to students' writing performance. Using an exploratory mixed methods design, we analyzed screen recordings of 44 Hong Kong secondary students completing a Curricular Writing Task with AI Chatbots. Content analysis identified ten types of prompting strategies students employed, including questions, searches, and detailed instructions. From clustering these strategies, three distinct profiles of human-AI rhetorical load responsibility emerged: AI-dominant (52% of students), Human-dominant (25%) and Collaborative human-AI (14%). A MANOVA analysis indicated no significant multivariate effect of rhetorical load responsibility on three dimensions of students' writing performance: content, language, and organization. Students' prompting strategies and rhetorical load responsibility patterns have implications for their engagement and autonomy in EFL writing pedagogy.
Chinese Translation
生成性人工智能(AI)为英语作为外语(EFL)写作教学引入了新的考量。本研究探讨了学生如何通过提示工程与AI进行交流,并在协商作者身份方面的表现,以及后者是否与学生的写作表现存在任何模式关系。我们采用探索性混合方法设计,分析了44名香港中学生在与AI聊天机器人完成课程写作任务时的屏幕录制。内容分析识别出学生使用的十种提示策略,包括提问、搜索和详细指令。通过对这些策略的聚类,出现了三种不同的人机修辞负担责任特征:AI主导型(52%的学生)、人类主导型(25%)和协作人机型(14%)。MANOVA分析表明,修辞负担责任对学生写作表现的三个维度(内容、语言和组织)没有显著的多变量影响。学生的提示策略和修辞负担责任模式对他们在EFL写作教学中的参与度和自主性具有重要意义。
cs.CL / 10 / 2605.12530

In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

现场行为评估 LLM 公平性,而非标准化测试分数
Tang, Zeyu, Truong, Sang T., Owens, Deonna, Sharma, Shreyas, Zhang, Yibo Jacky, Miranda, Brando, Koyejo, Sanmi
Abstract
LLM fairness should be evaluated through in-situ conversational behavior rather than standardized-test Q&A benchmarks. We show that the standardized-test paradigm can be structurally unreliable: surface-level prompt construction choices, although entirely orthogonal to the fairness question being tested, account for the majority of score variance, shift fairness conclusions in both the direction and the magnitude, and result in severe discordance in model rankings. We develop MAC-Fairness, a multi-agent conversational framework that embeds controlled variation factors into multi-round dialogue for in-situ behavior evaluation, examining how models' conversational behavior shifts when identity is varied as part of natural multi-agent interaction. Repurposing standardized-test questions as conversation seeds rather than as the evaluation instrument, we evaluate position persistence (how they hold positions, from the self-perspective) and peer receptiveness (how receptive they are to peers, from the other-perspective) across 8 million conversation transcripts spanning multiple models and identity presence configurations. In-situ behavioral evaluation reveals stable, model-specific behavioral signatures that could generalize across benchmarks differing in fairness targets and evaluation methodologies, a form of evidence the standardized-test paradigm does not offer.
Chinese Translation
LLM 公平性应通过现场对话行为进行评估,而不是通过标准化测试的问答基准。我们表明,标准化测试范式在结构上可能不可靠:表面层的提示构建选择,尽管与所测试的公平性问题完全无关,却占据了分数方差的主要部分,改变了公平性结论的方向和幅度,并导致模型排名的严重不一致。我们开发了 MAC-Fairness,一个多智能体对话框架,将受控变异因素嵌入多轮对话中,以进行现场行为评估,考察在自然多智能体交互中,当身份变化时模型的对话行为如何变化。我们将标准化测试问题重新利用为对话种子,而非评估工具,评估了在覆盖多个模型和身份存在配置的 800 万个对话记录中,位置持久性(他们如何保持立场,从自我视角)和同伴接受度(他们对同伴的接受程度,从他人视角)。现场行为评估揭示了稳定的、模型特定的行为特征,这些特征可以在不同公平性目标和评估方法的基准中进行概括,而这是标准化测试范式所无法提供的证据形式。
cs.CL / 11 / 2605.12623

DocAtlas: Multilingual Document Understanding Across 80+ Languages

DocAtlas:跨越80多种语言的多语言文档理解
Heakl, Ahmed, Mohamed, Youssef, Sohail, Abdullah, Elbadry, Rania, Nassar, Ahmed, Staar, Peter W. J., Khan, Fahad Shahbaz, Razzak, Imran, Khan, Salman
Abstract
Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.
Chinese Translation
由于训练数据稀缺和基于模型的标注流程延续了现有偏见,多语言文档理解在低资源语言方面仍然有限。我们提出了DocAtlas,一个构建高保真OCR数据集和基准的框架,涵盖82种语言和9个评估任务。我们的双重流程,通过对本地DOCX文档的差异化渲染以及基于LaTeX的合成生成(用于从右到左的书写系统),在统一的DocTag格式中生成精确的结构注释,编码布局、文本和组件类型,而无需核心注释的学习模型。对16个最先进模型的评估揭示了低资源脚本中的持续差距。我们展示了使用渲染派生的真实值作为正信号的直接偏好优化(Direct Preference Optimization, DPO)实现了稳定的多语言适应,提升了领域内(+1.9%)和领域外(+1.8%)的准确性,而没有可测量的基础语言退化,而监督微调在领域外性能上最多下降21%。我们的最佳变体DocAtlas-DeepSeek在最强基线的基础上提高了+1.7%。
cs.CL / 12 / 2605.12645

Training LLMs with Reinforcement Learning for Intent-Aware Personalized Question Answering

使用强化学习训练大型语言模型以实现意图感知的个性化问答
Amirizaniani, Maryam, Lee, Benjamin Charles Germain, West, Jevin, Weber, Nicholas
Abstract
Effective personalized question answering (PQA) in language models requires grounding responses in the user's underlying intent, where intent refers to the implicit ``why'' behind a query beyond its explicit wording. However, existing approaches to intent-aware personalization rely on multi-turn conversational context or rich user profiles, and do not explicitly model user intent during the reasoning process. This limits their effectiveness in single-turn settings, where the user's latent goal must be inferred from minimal input and integrated into the thinking and reasoning process. To bridge this gap, we propose IAP (Intent-Aware Personalization), a reinforcement learning framework that trains models to infer implicit user intent directly from a single-turn question and incorporate it into thinking steps through a tag-based schema for generating personalized, intent-grounded answers. By optimizing intent-aware answer trajectories under a personalized reward function, IAP reinforces generation paths that make implicit user intent explicit and produce responses that better align with the user's underlying goal. Through experiments on the LaMP-QA benchmark across six models, IAP consistently outperforms all baselines, achieving an average macro-score gain of around 7.5\% over the strongest competitor, demonstrating that modeling implicit user intent within the training objective is a promising direction for PQA.
Chinese Translation
在语言模型中,实现有效的个性化问答(PQA)需要将响应与用户的潜在意图相结合,其中意图指的是查询背后隐含的“为什么”,超越其显性措辞。然而,现有的意图感知个性化方法依赖于多轮对话上下文或丰富的用户档案,并未在推理过程中明确建模用户意图。这限制了它们在单轮场景中的有效性,因为用户的潜在目标必须从最少的输入中推断并融入思考和推理过程中。为了解决这一问题,我们提出了IAP(意图感知个性化),这是一个强化学习框架,旨在训练模型直接从单轮问题中推断隐含的用户意图,并通过基于标签的架构将其融入思考步骤,以生成个性化的、基于意图的答案。通过在个性化奖励函数下优化意图感知答案轨迹,IAP强化了使隐含用户意图显性化的生成路径,并产生更符合用户潜在目标的响应。通过在六个模型上对LaMP-QA基准进行实验,IAP始终优于所有基线,平均宏观得分比最强竞争对手提高约7.5,证明在训练目标中建模隐含用户意图是PQA的一个有前景的方向。
cs.CL / 13 / 2605.12671

All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs

所有电路通向罗马:重新思考大语言模型中的电路和层次发现的功能各向异性
Chen, Xi, Jin, Mingyu, Niu, Jingcheng, Yin, Yutong, Zhao, Jinman, Guo, Bangwei, Metaxas, Dimitris N., Wang, Zhaoran, Yue, Yutao, Penn, Gerald
Abstract
In this paper, we present empirical and theoretical evidence against a central but largely implicit assumption in circuit and sheaf discovery (CSD), which we term the Functional Anisotropy Hypothesis: the idea that functions in large language models (LLMs) are localised to a unique or near-unique internal mechanism. We show that a single LLM task can instead be supported by multiple, structurally distinct circuits or sheaves that are simultaneously faithful, sparse, and complete. To systematically uncover such competing mechanisms, we introduce Overlap-Aware Sheaf Repulsion, a method that augments the CSD objective with an explicit penalty on structural overlap across multiple discovery runs, enabling the discovery of circuits or sheaves with strong task performance but minimal shared structure across a plethora of common CSD benchmarks. We find that this phenomenon becomes increasingly pronounced as the number of discovered sheaves grows and persists robustly across major CSD methods. We further identify an ultra-sparse three-edge sheaf and show that none of its edges is individually indispensable, undermining even weakened notions of canonical or essential components. To explain these findings, we propose a Distributive Dense Circuit Hypothesis and provide a theoretical analysis demonstrating that non-unique, low-overlap circuit explanations arise naturally from high-dimensional superposition under mild assumptions. Together, our results suggest that mechanistic explanations in LLMs are inherently non-canonical and call for a rethinking of how CSD results should be interpreted and evaluated.
Chinese Translation
在本文中,我们提供了针对电路和层次发现(CSD)中一个核心但大多隐含的假设的实证和理论证据,我们称之为功能各向异性假设:即大语言模型(LLMs)中的功能局限于独特或近乎独特的内部机制。我们展示了单一的LLM任务实际上可以由多个结构上不同的电路或层次支持,这些电路或层次同时是忠实的、稀疏的和完整的。为了系统性地揭示这种竞争机制,我们引入了重叠感知层次排斥(Overlap-Aware Sheaf Repulsion),该方法通过对多个发现运行中的结构重叠施加明确的惩罚,增强了CSD目标,从而使得能够发现具有强任务性能但在众多常见CSD基准中共享结构最小的电路或层次。我们发现,随着发现的层次数量的增加,这一现象愈发明显,并且在主要的CSD方法中稳健地持续存在。我们进一步识别出一个超稀疏的三边层次,并展示其任何边缘在个体上都不是不可或缺的,这削弱了对典范或基本组件的弱化概念。为了解释这些发现,我们提出了分布式稠密电路假设,并提供了理论分析,证明在温和假设下,非唯一、低重叠的电路解释自然源于高维叠加。综合来看,我们的结果表明,LLMs中的机制解释本质上是非典范的,并呼吁重新思考CSD结果的解释和评估方式。
cs.CL / 14 / 2605.12748

Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators

模拟学生还是阿谀奉承的问题解决?关于大型语言模型模拟器的误解忠实性
Do, Heejin, Sonkar, Shashank, Sachan, Mrinmaya
Abstract
Large language models (LLMs) can fluently generate student-like responses, making them attractive as simulated students for training and evaluating AI tutors and human educators. Yet such simulators are typically evaluated by output similarity to real students, not by whether they behave like students with coherent misconceptions during interaction. We introduce a controlled framework for evaluating misconception faithfulness, whether a simulator maintains a misconception-driven belief state and updates selectively when feedback addresses the underlying misconception. Central to our framework is a misconception-contrastive feedback protocol that compares targeted feedback against two controls: misaligned feedback (targeting a different but plausible misconception) and generic feedback (only identifying answer is wrong). We propose Selective Flip Score (SFS), which quantifies how much more often a simulator flips its answer under targeted feedback than under contrastive controls. Across seven LLMs (4B-120B), multiple datasets, and prompting strategies, simulators exhibit near-zero SFS, correcting their answers at similarly high rates regardless of feedback relevance. Further analyses reveal a sycophantic failure mode: models behave less like students with misconceptions but more like problem-solvers who treat any corrective signal as a cue to abandon the simulated belief and re-solve from internal knowledge. To address this, we develop a post-training pipeline spanning supervised fine-tuning (SFT), preference optimization, and reinforcement learning (RL) with an SFS-aligned reward; SFT yields notable gains up to +0.56, and SFS-aligned RL provides more consistent improvements than preference optimization. Our results establish misconception faithfulness as a challenging yet trainable property, motivating a shift from static output matching toward interactive, belief-aware student modeling.
Chinese Translation
大型语言模型(LLMs)能够流畅地生成类似学生的回答,使其成为训练和评估人工智能导师及人类教育者的理想模拟学生。然而,这类模拟器通常是通过与真实学生的输出相似性来评估的,而不是通过它们在互动中是否表现出具有连贯误解的学生行为。我们引入了一个控制框架来评估误解忠实性,即模拟器是否维持基于误解的信念状态,并在反馈针对潜在误解时选择性地更新。我们框架的核心是一个误解对比反馈协议,该协议将针对性反馈与两个对照组进行比较:不一致反馈(针对不同但合理的误解)和一般反馈(仅指出答案错误)。我们提出了选择性翻转分数(Selective Flip Score, SFS),量化模拟器在针对性反馈下翻转答案的频率与在对比控制下的频率之比。在七个大型语言模型(4B-120B)、多个数据集和提示策略中,模拟器表现出接近零的SFS,无论反馈的相关性如何,它们纠正答案的比例都相似且较高。进一步分析揭示了一种阿谀奉承的失败模式:模型的表现更像是问题解决者,而不是具有误解的学生,它们将任何纠正信号视为放弃模拟信念并从内部知识重新解决问题的线索。为了解决这个问题,我们开发了一个后训练管道,涵盖了监督微调(SFT)、偏好优化和与SFS对齐的强化学习(RL);SFT带来了显著的提升,最高可达+0.56,而与SFS对齐的RL提供了比偏好优化更一致的改进。我们的结果确立了误解忠实性作为一个具有挑战性但可训练的属性,促使我们从静态输出匹配转向互动的、关注信念的学生建模。
cs.CL / 15 / 2605.12813

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

REALISTA:引发大型语言模型幻觉的现实潜在对抗攻击
Liang, Buyun, Luo, Jinqi, Peng, Liangzu, Chan, Kwan Ho Ryan, Thaker, Darshan, Kinfu, Kaleab A., Tian, Fengrui, Hassani, Hamed, Vidal, René
Abstract
Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, motivating the need for realistic adversarial prompts that elicit such failures. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing methods remain limited: discrete prompt-based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent-space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent-space attack framework. REALISTA constructs an input-dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing-based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings, where prior realistic attacks fail. Code is available at https://github.com/Buyun-Liang/REALISTA.
Chinese Translation
大型语言模型(LLMs)在许多任务中表现出色,但仍然容易出现幻觉,这促使我们需要现实的对抗性提示来引发此类失败。我们将幻觉引发问题表述为一个约束优化问题,目标是找到语义上连贯的对抗性提示,这些提示与良性用户提示等价。现有方法仍然存在局限性:基于离散提示的攻击保持语义等价性和连贯性,但仅在有限的提示变体集合中进行搜索,而连续潜在空间攻击则探索更丰富的空间,但往往解码成不再有效的重述提示。为了解决这些局限性,我们提出了REALISTA,一个现实的潜在空间攻击框架。REALISTA构建了一个依赖输入的有效编辑方向字典,每个方向对应一个语义等价且连贯的重述,并在潜在空间中优化这些方向的连续组合。该设计将连续攻击的优化灵活性与基于离散重述的攻击的语义现实性相结合。实验表明,REALISTA在开源LLMs上实现了优于或可比于最先进的现实攻击的性能,并且在自由形式响应设置下成功攻击大型推理模型,而之前的现实攻击在此情况下失败。代码可在 https://github.com/Buyun-Liang/REALISTA 获取。
cs.CL / 16 / 2605.12850

Persona-Model Collapse in Emergent Misalignment

新兴失调中的角色模型崩溃
Costa, Davi Bastos, Vicente, Renato
Abstract
Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model's internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using two metrics: moral susceptibility (S) and moral robustness (R), computed from the across- and within-persona variability of models' Moral Foundations Questionnaire responses under persona role-play. These metrics formalize the model's ability to differentiate characters (S) and its consistency when simulating a given one (R). We evaluate four frontier models (DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B) in three variants: base, fine-tuned to output insecure code, and a matched control fine-tuned to output secure code. Across the four models, insecure fine-tuning produces an average $55\%$ increase in S, pushing all four insecure variants beyond the band observed across 13 frontier models benchmarked in prior work -- with GPT-4o reaching more than twice the band's upper end -- signaling dysregulated differentiation. It also causes an average $65\%$ decrease in R, equivalent to a $304\%$ increase in 1/R. By contrast, the matched secure control preserves S near the base and induces only a partial R loss, showing that these effects are largely misalignment-specific. Complementing these metric shifts, insecure variants' unconditioned responses converge toward saturation near the scale ceiling, departing markedly from both base models' structured responses and those elicited when base models role-play toxic personas. Taken together, these metrics provide a sensitive diagnostic for emergent misalignment and serve as behavioral evidence that it involves persona-model collapse.
Chinese Translation
在狭窄的数据上微调大型语言模型,尤其是包含有害内容的数据,会导致模型在与之无关的提示上产生广泛的失调行为,这一现象被称为新兴失调。我们提出,新兴失调涉及角色模型崩溃:模型内部模拟、区分和维持一致角色的能力下降。我们通过两项指标进行行为测试,以验证这一假设:道德易感性(S)和道德稳健性(R),这两个指标是基于模型在角色扮演下对道德基础问卷的响应的跨角色和内角色变异性计算得出的。这些指标形式化了模型区分角色的能力(S)及其在模拟特定角色时的一致性(R)。我们评估了四个前沿模型(DeepSeek-V3.1、GPT-4.1、GPT-4o、Qwen3-235B)的三种变体:基础版、微调以输出不安全代码的版本,以及微调以输出安全代码的匹配控制版本。在这四个模型中,不安全微调平均导致S增加$55 ext{ extperthousand}$,使所有四个不安全变体超出了先前工作中基准测试的13个前沿模型观察到的范围——其中GPT-4o的表现超过了该范围的上限两倍以上——这表明了失调的区分能力失调。同时,R平均下降$65 ext{ extperthousand}$,相当于$304 ext{ extperthousand}$的1/R增加。相比之下,匹配的安全控制保持了S接近基础水平,仅导致部分R损失,显示这些效应在很大程度上是特定于失调的。除了这些指标的变化,不安全变体的无条件响应趋向于在规模上限附近饱和,明显偏离了基础模型的结构化响应以及基础模型在角色扮演有毒角色时引发的响应。综合来看,这些指标为新兴失调提供了敏感的诊断工具,并作为行为证据表明其涉及角色模型崩溃。
cs.CL / 17 / 2605.12882

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

CiteVQA:可信文档智能的证据归属基准测试
Ma, Dongsheng, Li, Jiayu, Wang, Zhengren, Wang, Yijie, Kong, Jiahao, Zeng, Weijun, Xiao, Jutao, Yang, Jie, Zhang, Wentao, Wang, Bin, He, Conghui
Abstract
Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at https://github.com/opendatalab/CiteVQA.
Chinese Translation
多模态大型语言模型(MLLMs)在文档理解方面取得了显著进展,但当前的文档视觉问答(Doc-VQA)评估仅评分最终答案,而未检查支持证据。这种仅关注答案的方法掩盖了一个关键的失败模式:模型可能得出正确答案,但却基于错误的段落进行推理——在法律、金融和医学等高风险领域,这种情况是一个重大风险,因为每个结论必须可追溯到特定的来源区域。为了解决这一问题,我们引入了CiteVQA,一个基准测试,要求模型在每个答案旁返回元素级的边界框引用,并对两者进行联合评估。CiteVQA包含来自711个PDF文档的1,897个问题,涵盖七个领域和两种语言,每个文档平均40.6页。为了确保真实性和可扩展性,真实引用是通过自动化流程生成的,该流程通过掩蔽消融识别关键证据,并随后通过专家审查进行验证。我们评估的核心是严格归属准确性(Strict Attributed Accuracy, SAA),只有当答案和引用区域均正确时,才会对预测给予积分。对20个MLLM的审计揭示了普遍的归属幻觉:模型经常产生正确的答案,但引用错误的区域。最强的系统(Gemini-3.1-Pro-Preview)仅达到76.0的SAA,而最强的开源MLLM仅达到22.5。最终,为了实现可信的文档智能,CiteVQA揭示了仅关注答案的评估所忽视的可靠性差距,并提供了弥补这一差距所需的工具。我们的代码库可在https://github.com/opendatalab/CiteVQA获取。
cs.CL / 18 / 2605.12918

CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models

CommonWhy:用于评估大型语言模型中基于实体的因果常识推理的数据集
Toroghi, Armin, Kalarde, Faeze Moradi, Sanner, Scott
Abstract
To effectively interact with the real world, Large Language Models (LLMs) require entity-based commonsense reasoning, a challenging task that necessitates integrating factual knowledge about specific entities with commonsense inference. Existing datasets for evaluating LLM entity-based commonsense reasoning have largely focused on True/False or multiple-choice questions, leaving the explicit assessment of the model's ability in abductive reasoning about causes and effects and generating explanations largely unexamined. In this work, we introduce CommonWhy, a dataset of 15,000 why questions designed to evaluate entity-based commonsense reasoning about causal relationships in LLMs. CommonWhy also serves as a Knowledge Graph Question Answering (KGQA) benchmark, as all supporting knowledge required to answer its queries is available in the Wikidata knowledge graph. Unlike existing KGQA datasets, which primarily test fact retrieval, CommonWhy targets causal commonsense reasoning, establishing a new paradigm for KGQA evaluation. Experiments with state-of-the-art LLMs and LLM-based KGQA methods reveal their significant shortcomings, including frequent factual hallucinations and failures in causal reasoning.
Chinese Translation
为了有效地与现实世界互动,大型语言模型(LLMs)需要基于实体的常识推理,这是一项具有挑战性的任务,要求将关于特定实体的事实知识与常识推理相结合。现有用于评估LLM基于实体的常识推理的数据集主要集中在是非题或多项选择题上,导致对模型在关于因果关系的溯因推理能力及生成解释方面的明确评估尚未得到充分研究。在本研究中,我们介绍了CommonWhy,这是一个包含15,000个“为什么”问题的数据集,旨在评估LLM中关于因果关系的基于实体的常识推理。CommonWhy还作为知识图谱问答(KGQA)基准,因为回答其查询所需的所有支持知识均可在Wikidata知识图谱中获得。与现有的KGQA数据集主要测试事实检索不同,CommonWhy针对因果常识推理,建立了KGQA评估的新范式。与最先进的LLMs和基于LLM的KGQA方法的实验揭示了它们的显著不足,包括频繁的事实幻觉和因果推理的失败。
cs.CL / 19 / 2605.12933

ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset

ATD-Trans:一个地理基础的日英旅行日志翻译数据集
Higashiyama, Shohei, Ouchi, Hiroki, Fujita, Atsushi, Utiyama, Masao
Abstract
Geographic text, or textual data rich in geographic (geo-) information is a valuable source for various geographic applications, e.g., tourism management. Making such information accessible to speakers of other languages further enhances its utility; thus, accurate machine translation (MT) is essential for equity in multilingual geo-information access. To facilitate in-depth analysis for geographic text, we introduce ATD-Trans, a geographically grounded Japanese--English travelogue translation dataset, which enables evaluation of MT quality at both the overall and geo-entity levels across domestic (within Japan) and overseas regions. Our experiments on existing language models examine two factors: model language focus and geographic regions. The results highlight advantages of Japanese-enhanced models and greater difficulty in translating domestic-region geo-entities mentioned in travel blogs.
Chinese Translation
地理文本或富含地理(geo-)信息的文本数据是各种地理应用(例如旅游管理)的宝贵来源。使这些信息对其他语言的使用者可访问进一步增强了其效用;因此,准确的机器翻译(MT)对于实现多语言地理信息的公平访问至关重要。为了促进对地理文本的深入分析,我们引入了ATD-Trans,一个地理基础的日英旅行日志翻译数据集,该数据集能够在国内(日本境内)和海外地区评估机器翻译质量,涵盖整体和地理实体两个层面。我们对现有语言模型的实验考察了两个因素:模型语言焦点和地理区域。结果突显了日语增强模型的优势以及翻译旅行博客中提到的国内地区地理实体的更大困难。
cs.CL / 20 / 2605.12960

DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

DiM extsuperscript{3}: 通过方向和幅度感知合并连接多语言和多模态模型
Wang, Zijing, Wang, Mingyang, Nie, Ercong, Liu, Yongkang, Feng, Shi, Zhao, Mengjie, Wang, Daling, Yang, Xiaocui, Schütze, Hinrich
Abstract
Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.
Chinese Translation
为了实现更通用和类人智能,大型语言模型应无缝集成多语言和多模态能力;然而,将现有的多模态模型扩展到多种语言通常需要昂贵的多语言多模态数据构建和反复的端到端重训练。我们研究了一种无训练的替代方案:通过在共享语言模型主干中组合残差更新,将多语言能力注入现有的多模态模型。关键挑战在于多语言和多模态更新是异质的,反映了共享模型中不同的功能角色。为此,我们提出了方向和幅度感知多语言多模态合并(DiM3),该方法在每个参数维度上选择性地组合这两种更新,同时保留原始视觉编码器和多模态投影器。在文本仅和视觉-语言设置下的多语言基准测试中,覆盖57种语言的LLaVA和Qwen基础模型的实验表明,DiM3始终优于现有的合并基线,显著提高了原始多模态模型的多语言性能,并在很大程度上保留了通用的多模态能力,同时与专门的多语言多模态微调保持竞争力。我们进一步展示了DiM3可以直接应用于已经训练好的多语言多模态模型,并仍然带来额外的收益。进一步的可解释性分析表明,DiM3主要重塑中间层的语义表示,加强了在文本仅和多模态输入下的跨语言对齐,同时保留了更高层次的任务敏感结构。我们的代码库可在 https://github.com/wzj1718/DiM3 获取。
cs.CL / 21 / 2605.12970

Leveraging Speech to Identify Signatures of Insight and Transfer in Problem Solving

利用语音识别洞察和转移在问题解决中的特征
Nasvytis, Linas, Fan, Judith E.
Abstract
Many problems seem to require a flash of insight to solve. What form do these sudden insights take, and what impact do they have on how people approach similar problems in the future? In this work, we prompted participants (N = 189) to talk aloud as they attempted to solve a sequence of five "matchstick-arithmetic" problems. These problems either all relied on the same kind of non-obvious solution (Same group) or a different kind each time (Different group). We found that Same participants improved more rapidly than Different participants, and as they improved, they talked more and talked about different things when solving later problems. Specifically, they were more likely to spontaneously categorize the problem they were working on. Taken together, these findings suggest that a hallmark of transferable insights is their accessibility for verbal report, even if the underlying precursors of insight remain difficult to articulate.
Chinese Translation
许多问题似乎需要灵光一现才能解决。这些突发的洞察呈现出何种形式,它们对人们未来解决类似问题的方式有什么影响?在本研究中,我们促使参与者(N = 189)在尝试解决一系列五个“火柴算术”问题时进行口头表达。这些问题要么都依赖于同一种非显而易见的解决方案(同组),要么每次都采用不同的解决方案(不同组)。我们发现,同组参与者的进步速度快于不同组参与者,并且随着他们的进步,他们在解决后续问题时的口头表达增多,讨论的内容也更加多样化。具体而言,他们更有可能自发地对正在解决的问题进行分类。综合来看,这些发现表明,可转移洞察的一个标志是其在口头报告中的可及性,即使其潜在的前因仍然难以表达。
cs.CL / 22 / 2605.12987

Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction

利用多模态自一致性推理进行动机访谈编码以减少酒精使用
Han, Guangzeng, Murphy, James G., Ladd, Benjamin O., Huang, Xiaolei, Borsari, Brian
Abstract
BACKGROUND: Coding Motivational Interviewing (MI) sessions is essential for understanding client behaviors and predicting outcomes, but it requires substantial time and labor from trained MI professionals. Recent advances in audio-language models (ALMs) offer new opportunities to automate MI coding by capturing multimodal behavioral signals. OBJECTIVE: This study aims to develop an automatic MI coding approach based on ALMs that analyzes raw audio input and integrates predictions from multiple reasoning trajectories using self-consistency to improve coding robustness. METHODS: We experimented with five recorded sessions from de-identified MI audio tapes. We deployed ALMs with four complementary analytic prompts to support utterance-level reasoning: analytic prompting for verbal cues, prosody-aware prompting for acoustic cues, evidence-scoring prompting for quantitative hypothesis testing, and comparative prompting for contrastive reasoning. Three stochastic samples were drawn for each prompt, generating 12 independent reasoning trajectories per utterance. Final predictions were determined by majority voting across all trajectories. RESULTS: Performance was evaluated using accuracy, precision, recall, and macro-F1 scores. The proposed multimodal self-consistency approach achieved 52.56% accuracy, 54.03% precision, 47.45% recall, and a macro-F1 score of 46.40%, exceeding baseline methods. Systematic ablation experiments that removed individual modules consistently degraded performance on the primary metrics. CONCLUSIONS: Multimodal self-consistency outperforms single-pass baseline prompting approaches for MI coding. These findings suggest that incorporating both what clients say and how they say it can support more reliable automatic MI coding.
Chinese Translation
背景:对动机访谈(MI)会议进行编码对于理解客户行为和预测结果至关重要,但这需要经过培训的MI专业人员投入大量时间和精力。最近音频语言模型(ALMs)的进展为通过捕捉多模态行为信号来自动化MI编码提供了新的机会。目的:本研究旨在开发一种基于ALMs的自动MI编码方法,该方法分析原始音频输入,并利用自一致性整合来自多个推理轨迹的预测,以提高编码的稳健性。方法:我们对五个去标识化的MI音频录音进行了实验。我们部署了ALMs,并使用四个互补的分析提示来支持发言级推理:针对语言线索的分析提示、针对声学线索的韵律感知提示、针对定量假设检验的证据评分提示,以及针对对比推理的比较提示。每个提示抽取了三个随机样本,为每个发言生成12条独立的推理轨迹。最终预测通过对所有轨迹进行多数投票确定。结果:通过准确率、精确率、召回率和宏观F1分数评估性能。所提出的多模态自一致性方法达到了52.56%的准确率、54.03%的精确率、47.45%的召回率和46.40%的宏观F1分数,超出了基线方法。系统的消融实验表明,去除单个模块会持续降低主要指标的性能。结论:多模态自一致性在MI编码中优于单次通过的基线提示方法。这些发现表明,结合客户所说的内容和表达方式可以支持更可靠的自动MI编码。
cs.CL / 23 / 2605.13043

Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models

扩散语言模型中的自适应引导与重标记以确保安全生成
Lee, Yejin, Han, Yo-Sub
Abstract
Diffusion Language Models (DLMs) provide a promising alternative to autoregressive language models by generating text through iterative denoising and bidirectional refinement. However, this iterative generation paradigm also introduces unique safety vulnerabilities when harmful tokens generated at intermediate denoising steps propagate through subsequent refinement processes and eventually induce unsafe outputs. While there are a few attempts to remedy this issue, they either fail to generate safe outputs or generate safe yet low-quality outputs. This motivates us to propose an inference-time defense framework based on the step-wise intervention during the denoising process, which then improves the safety without compromising the output quality. The key component of our framework is a contrastive safety direction (SGD), a latent direction that captures the semantic boundary between harmful and safe generations. We leverage SGD to assess the alignment of generated tokens with harmful semantics at each denoising step. When harmful alignment is detected, our method remasks the corresponding tokens and resumes the denoising process with adaptive steering, where the steering strength is modulated according to the estimated degree of harmfulness. As a plug-and-play module, our method circumvents the need for additional fine-tuning and can be directly incorporated into off-the-shelf diffusion models. The experimental results show that our approaches reduce jailbreak success rates to 0.64% while preserving generation quality close to the original model performance. This confirms the effectiveness of step-wise intervention for safe diffusion language model generation. Our code is available at https://github.com/leeyejin1231/DLM_Steering_Remasking.
Chinese Translation
扩散语言模型(DLMs)通过迭代去噪和双向精炼为自回归语言模型提供了一种有前景的替代方案。然而,这种迭代生成范式也引入了独特的安全漏洞,因为在中间去噪步骤中生成的有害标记会通过后续的精炼过程传播,最终导致不安全的输出。尽管已有一些尝试来解决这一问题,但它们要么未能生成安全的输出,要么生成安全但质量较低的输出。这促使我们提出了一种基于去噪过程中的逐步干预的推理时防御框架,从而在不妥协输出质量的情况下提高安全性。我们框架的关键组成部分是对比安全方向(SGD),这是一个潜在方向,捕捉有害生成与安全生成之间的语义边界。我们利用SGD评估在每个去噪步骤中生成的标记与有害语义的对齐程度。当检测到有害对齐时,我们的方法会重新标记相应的标记,并通过自适应引导恢复去噪过程,其中引导强度根据估计的有害程度进行调节。作为一个即插即用模块,我们的方法避免了额外微调的需求,可以直接集成到现成的扩散模型中。实验结果表明,我们的方法将越狱成功率降低到0.64%,同时保持生成质量接近原始模型的性能。这证实了逐步干预在安全扩散语言模型生成中的有效性。我们的代码可在 https://github.com/leeyejin1231/DLM_Steering_Remasking 获取。
cs.CL / 24 / 2605.13050

Context Training with Active Information Seeking

主动信息获取的上下文训练
Huang, Zeyu, Kuncoro, Adhiguna, Feng, Qixuan, Shen, Jiajun, Dery, Lucio, Szlam, Arthur, Ranzato, Marc'Aurelio
Abstract
Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their context, LLMs can be tailored to downstream tasks without updating their weights. However, most existing methods remain closed-loop, relying solely on the model's intrinsic knowledge. In this paper, we equip these context optimizers with Wikipedia search and browser tools for active information seeking. We show that naively adding these tools to a standard sequential context optimization pipeline can actually degrade performance compared to baselines. However, when paired with a search-based training procedure that maintains and prunes multiple candidate contexts, active information seeking delivers consistent and substantial gains. We demonstrate these improvements across diverse domains, including low-resource translation (Flores+), health scenarios (HealthBench), and reasoning-heavy tasks (LiveCodeBench and Humanity's Last Exam). Furthermore, our method proves to be data-efficient, robust across different hyperparameters, and capable of generating effective textual contexts that generalize well across different models.
Chinese Translation
大多数现有的大型语言模型(LLMs)在部署后适应成本高昂,尤其是在任务需要新产生的信息或特定领域知识时。近期的研究表明,通过操控和优化上下文,LLMs可以在不更新其权重的情况下,针对下游任务进行定制。然而,大多数现有方法仍然是闭环的,仅依赖于模型的内在知识。在本文中,我们为这些上下文优化器配备了维基百科搜索和浏览器工具,以实现主动信息获取。我们展示了将这些工具简单地添加到标准的顺序上下文优化流程中,实际上可能会导致性能下降,相较于基线而言。然而,当与一种基于搜索的训练程序相结合,该程序维护并修剪多个候选上下文时,主动信息获取能够带来持续且显著的提升。我们在多个领域展示了这些改进,包括低资源翻译(Flores+)、健康场景(HealthBench)和推理密集型任务(LiveCodeBench 和 Humanity's Last Exam)。此外,我们的方法证明了数据效率高,能够在不同超参数下保持稳健,并能够生成在不同模型中普遍适用的有效文本上下文。
cs.CL / 25 / 2605.13055

The Cost of Perfect English: Pragmatic Flattening and the Erasure of Authorial Voice in L2 Writing Supported by GenAI

完美英语的代价:生成性人工智能支持下的二语写作中的实用性平面化与作者声音的消失
Liu, Ao, Zhu, Shanhua
Abstract
The integration of Generative AI (GenAI) into language learning offers second language (L2) writers powerful tools for text optimization. However, pursuing native-like fluency often sacrifices sociopragmatic diversity. Investigating "pragmatic flattening" - the systematic erasure of culturally preferred politeness and authorial stance - this study conducts a comparative analysis of argumentative essays by Chinese B2-level university students from the ICNALE corpus. The original texts were polished via the APIs of four leading Large Language Models at a zero-temperature setting for reproducibility. Findings reveal a nuanced "dimensional divergence" within the Semantic Preservation Paradox. While models corrected lexicogrammatical errors and retained propositional meaning, sociopragmatic interventions were bifurcated. In the interactive dimension, all models showed a drastic collapse of dialogic engagement markers, turning negotiated discourse into monologic assertions. Conversely, in the epistemic stance dimension, models showed architecture-based variability: some aggressively scrubbed epistemic markers, while others reinforced tentative hedging as decontextualized algorithmic caution. This confirms that while GenAI enhances accuracy, it systematically overwrites L2 writers' unique rhetorical identities into a homogenized Anglo-American paradigm. We argue that future instruction must move beyond error correction, advocating for Critical AI Literacy to empower multilingual writers to use GenAI for linguistic enhancement while safeguarding sociopragmatic diversity and rhetorical agency.
Chinese Translation
生成性人工智能(GenAI)在语言学习中的应用为二语(L2)写作者提供了强大的文本优化工具。然而,追求类似母语的流利度往往牺牲了社会语用多样性。本研究探讨了“实用性平面化”——文化上偏好的礼貌和作者立场的系统性消失——通过对来自ICNALE语料库的中国B2级大学生的论证性论文进行比较分析。原始文本通过四个领先的大型语言模型的API在零温度设置下进行了润色,以确保可重复性。研究结果揭示了语义保留悖论中的微妙“维度偏差”。尽管模型纠正了词汇语法错误并保留了命题意义,但社会语用干预却呈现出分歧。在互动维度上,所有模型都显示出对话参与标记的急剧崩溃,将协商性话语转变为单向陈述。相反,在认知立场维度上,模型表现出基于架构的变异性:一些模型积极去除认知标记,而其他模型则强化了作为去语境化算法谨慎的暂时性模糊。这证实了尽管GenAI提高了准确性,但它系统性地将L2写作者独特的修辞身份覆盖为同质化的英美范式。我们认为,未来的教学必须超越错误纠正,倡导批判性人工智能素养,以赋能多语种写作者利用GenAI进行语言增强,同时维护社会语用多样性和修辞代理权。
cs.CL / 26 / 2605.13075

Scaling few-shot spoken word classification with generative meta-continual learning

通过生成性元持续学习扩展少样本口语词汇分类
Beyers, Louise, Ziki, Batsirayi Mupamhi, van der Merwe, Ruan
Abstract
Few-shot spoken word classification has largely been developed for applications where a small number of classes is considered, and so the potential of larger-scale few-shot spoken word classification remains untapped. This paper investigates the potential of a spoken word classifier to sequentially learn to distinguish between 1000 classes when it is given only five shots per class. We demonstrate that this scaling capability exists by training a model using the Generative Meta-Continual Learning (GeMCL) algorithm and comparing it to repeatedly trained or finetuned baselines. We find that GeMCL produces exceptionally stable performance, and although it does not always outperform a repeatedly fully-finetuned HuBERT model nor a frozen HuBERT model with a repeatedly trained classifier head, it produces comparable performance to the latter while adapting 2000 times faster, having been trained less than half of the data for two orders of magnitude less time.
Chinese Translation
少样本口语词汇分类主要针对类别数量较少的应用进行开发,因此大规模少样本口语词汇分类的潜力尚未被充分挖掘。本文探讨了一种口语词汇分类器在每个类别仅给定五个样本的情况下,如何顺序学习区分1000个类别的潜力。我们通过使用生成性元持续学习(Generative Meta-Continual Learning, GeMCL)算法训练模型,并将其与反复训练或微调的基线进行比较,证明了这种扩展能力的存在。我们发现,GeMCL表现出极其稳定的性能,尽管它并不总是优于反复完全微调的HuBERT模型或具有反复训练分类头的冻结HuBERT模型,但它在适应性方面的表现与后者相当,同时速度快2000倍,训练数据量不到后者的一半,所需时间也少两个数量级。
cs.CL / 27 / 2605.13076

TruncProof: A Guardrail for LLM-based JSON Generation under Token-Length Constraints

TruncProof:在令牌长度约束下基于LLM的JSON生成的护栏
Kato, Yoshio, Tarashima, Shuhei
Abstract
The LLM-based generation of machine-readable outputs such as JSON has attracted significant attention for integration with external systems. However, existing approaches cannot strictly enforce the maximum number of tokens to be generated, leading to infinite generation or truncated outputs that cause a system malfunction. To address this limitation, we propose TruncProof, a novel grammar-constrained generation method that enables LLMs to produce grammatically valid JSONs while adhering to a predefined token limit. By leveraging the properties of LL(1) parsers, TruncProof efficiently approximates the minimum number of tokens required to complete a grammatically valid output at each decoding step. Experiments on the Text-to-JSON instruction tasks demonstrate that TruncProof successfully generates syntactically correct outputs even under strict token constraints. Furthermore, we show that TruncProof can be effectively combined with advanced decoding strategies, resulting in outputs that are not only grammatically valid but also semantically accurate.
Chinese Translation
基于LLM的机器可读输出(如JSON)的生成因其与外部系统的集成而受到广泛关注。然而,现有方法无法严格执行生成的最大令牌数,导致无限生成或截断输出,从而引发系统故障。为了解决这一限制,我们提出了TruncProof,一种新颖的语法约束生成方法,使LLM能够在遵循预定义令牌限制的同时生成语法上有效的JSON。通过利用LL(1)解析器的特性,TruncProof在每个解码步骤中有效地近似完成语法有效输出所需的最小令牌数。在Text-to-JSON指令任务上的实验表明,TruncProof即使在严格的令牌约束下也能成功生成语法正确的输出。此外,我们还展示了TruncProof可以与先进的解码策略有效结合,从而生成不仅语法有效而且语义准确的输出。
cs.CL / 28 / 2605.13084

Does language matter for spoken word classification? A multilingual generative meta-learning approach

语言对口语单词分类的重要性?一种多语言生成元学习方法
Ziki, Batsirayi Mupamhi, Beyers, Louise, van der Merwe, Ruan
Abstract
Meta-learning has been shown to have better performance than supervised learning for few-shot monolingual spoken word classification. However, the meta-learning approach remains under-explored in multilingual spoken word classification. In this paper, we apply the Generative Meta-Continual Learning algorithm to spoken word classification. The generative nature of this algorithm makes it viable for use in application, and the meta-learning aspect promotes generalisation, which is crucial in a multilingual setting. We train monolingual models on English, German, French, and Catalan, a bilingual model on English and German, and a multilingual model on all four languages. We find that although the multilingual model performs best, the differences between model performance is unexpectedly low. We also find that the hours of unique data seen during training seems to be a stronger performance indicator than the number of languages included in the training data.
Chinese Translation
元学习在少量样本的单语口语单词分类中表现出优于监督学习的性能。然而,元学习方法在多语言口语单词分类中的探索仍然不足。在本文中,我们将生成元持续学习算法应用于口语单词分类。该算法的生成特性使其在应用中具有可行性,而元学习的特性促进了泛化,这在多语言环境中至关重要。我们在英语、德语、法语和加泰罗尼亚语上训练了单语模型,在英语和德语上训练了双语模型,并在这四种语言上训练了多语言模型。我们发现,尽管多语言模型的表现最佳,但模型性能之间的差异意外地较小。我们还发现,在训练过程中看到的独特数据小时数似乎比训练数据中包含的语言数量更能作为性能指标。
cs.CL / 29 / 2605.13087

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Vividh-ASR:一种复杂性分层基准和鲁棒性印度语音识别的优化动态
Juvekar, Kush, Manohar, Kavya, Menon, Aditya Srinivas, Bhattacharya, Arghya, Nethil, Kumarmanas
Abstract
Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder's acoustic geometry. We release the benchmark and models.
Chinese Translation
对低资源语言进行多语言自动语音识别(ASR)模型(如 Whisper)的微调通常会改善朗读语音的表现,但会降低自发音频的性能,这一现象我们称之为录音室偏差(studio-bias)。为了诊断这种不匹配,我们引入了 Vividh-ASR,这是一个针对印地语和马拉雅拉姆语的复杂性分层基准,分为四个层次:录音室、广播、自发和合成噪声。通过对学习率时机和课程排序的控制研究,我们发现早期的大参数更新使得全球字错误率(WER)提高了12个绝对点,而从难到易的课程则为自发语音带来了额外的提升。这些发现促使我们提出了反向多阶段微调(R-MFT),这是一种训练方案,使得参数高效的244M Whisper模型能够匹配或超过传统微调的769M模型。通过 CKA 和 SVD 的表征分析显示,有效的调度集中在解码器的适应上,保持了预训练编码器的声学几何特征。我们发布了该基准和模型。
cs.CL / 30 / 2605.13136

GateKD: Confidence-Gated Closed-Loop Distillation for Robust Reasoning

GateKD:用于稳健推理的置信度门控闭环蒸馏
Sermsri, Kasidit, Panboonyuen, Teerapong
Abstract
Distilling multi-step reasoning abilities from large language models (LLMs) into compact student models remains challenging due to noisy rationales, hallucinated supervision, and static teacher-student interactions. Existing reasoning distillation methods, including mentor-based approaches, predominantly operate in an open-loop manner, implicitly assuming uniform teacher reliability and consequently propagating erroneous intermediate reasoning. We propose GateKD, a confidence-gated closed-loop distillation framework that enables robust reasoning transfer by treating the teacher as a dynamic gatekeeper rather than a static oracle. GateKD introduces three complementary mechanisms: (i) confidence-gated soft supervision that selectively distills reliable predictive signals, (ii) gated hidden-state evolution that aligns intermediate representations only when teacher confidence is high, and (iii) reliability-filtered attention distillation that preserves stable reasoning structures while suppressing noisy patterns. These components jointly form a closed feedback loop in which teacher confidence continuously modulates the distillation process, reducing hallucination transfer and stabilizing student reasoning. Extensive experiments across commonsense, logical, and symbolic reasoning benchmarks, using T5 and Flan-T5 backbones of varying sizes, demonstrate that GateKD consistently outperforms strong open-loop distillation baselines. Notably, GateKD yields substantial gains in logical and symbolic reasoning, remains robust under low-resource distillation settings, and shows clear performance degradation when any gating component is removed. Our results highlight that confidence-gated closed-loop supervision is critical for building reliable and scalable small reasoning models.
Chinese Translation
从大型语言模型(LLMs)中提取多步骤推理能力到紧凑的学生模型仍然面临挑战,这主要是由于嘈杂的推理过程、虚幻的监督以及静态的教师-学生交互。现有的推理蒸馏方法,包括基于导师的方法,主要以开放循环的方式运作,隐含地假设教师的可靠性是均匀的,从而传播了错误的中间推理。我们提出了GateKD,一个置信度门控的闭环蒸馏框架,通过将教师视为动态的守门人而非静态的神谕,来实现稳健的推理转移。GateKD引入了三个互补机制:(i)置信度门控的软监督,选择性地蒸馏可靠的预测信号;(ii)门控隐状态演变,仅在教师置信度高时对齐中间表示;(iii)可靠性过滤的注意力蒸馏,在抑制嘈杂模式的同时保持稳定的推理结构。这些组件共同形成一个闭环反馈机制,其中教师置信度持续调节蒸馏过程,减少幻觉转移并稳定学生推理。在使用不同规模的T5和Flan-T5骨干网络进行的常识、逻辑和符号推理基准测试中,广泛实验表明GateKD始终优于强大的开放循环蒸馏基线。值得注意的是,GateKD在逻辑和符号推理上获得了显著提升,在低资源蒸馏环境下保持稳健,并且在移除任何门控组件时表现出明显的性能下降。我们的结果强调,置信度门控的闭环监督对于构建可靠且可扩展的小型推理模型至关重要。
cs.CL / 31 / 2605.13149

AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions

AcquisitionSynthesis:利用获取函数进行针对性数据生成
Agarwal, Ishika, Stoica, Sofia, Acikgoz, Emre Can, Natarajan, Pradeep, Namazifar, Mahdi, Ma, Jiaqi, Hakkani-Tür, Dilek
Abstract
Data quality remains a critical bottleneck in developing capable, competitive models. Researchers have explored many ways to generate top quality samples. Some works rely on rejection sampling: generating lots of synthetic samples and filtering out low-quality samples. Other works rely on larger or closed-source models to extract model weaknesses, necessary skills, or a curriculum off of which to base data generation. These works have one common limitation: there is no quantitative approach to measure the impact of the generated samples on the downstream learner. Active learning literature provides exactly this, in the form of acquisition functions. Acquisition functions measure the informativeness and/or influence of data, providing interpretable, model-centric signals. Inspired by this, we propose AcquisitionSynthesis: using acquisition functions as reward models to train language models to generate higher-quality synthetic data. We conduct experiments on classic verifiable tasks of math, medical question-answering, and coding. Our experimental results indicate that (1) student models trained with AcquisitionSynthesis data achieve good performance on in-distribution tasks (2-7% gain) and is more robust to catastrophic forgetting, and (2) AcquisitionSynthesis models can generate data for other models and for low-to-high resource training paradigms. By leveraging acquisition rewards, we seek to demonstrate a principled path toward model-aware self-improvement that surpasses static datasets.
Chinese Translation
数据质量仍然是开发高效、竞争性模型的关键瓶颈。研究人员探索了多种生成高质量样本的方法。一些研究依赖于拒绝采样:生成大量合成样本并过滤掉低质量样本。其他研究则依赖于更大或封闭源模型来提取模型的弱点、必要技能或用于数据生成的课程。这些研究有一个共同的局限性:缺乏定量方法来衡量生成样本对下游学习者的影响。主动学习文献提供了这一点,以获取函数的形式。获取函数衡量数据的信息量和/或影响力,提供可解释的、以模型为中心的信号。受到此启发,我们提出了AcquisitionSynthesis:利用获取函数作为奖励模型来训练语言模型生成更高质量的合成数据。我们在经典的可验证任务上进行了实验,包括数学、医学问答和编码。实验结果表明:(1)使用AcquisitionSynthesis数据训练的学生模型在分布内任务上表现良好(提升2-7%),并且对灾难性遗忘更具鲁棒性;(2)AcquisitionSynthesis模型能够为其他模型生成数据,并适用于低资源到高资源的训练范式。通过利用获取奖励,我们希望展示一种超越静态数据集的模型感知自我改进的原则性路径。
cs.CL / 32 / 2605.13165

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

STOP:低数据环境下长形式推理的结构化在线剪枝
Xu, Chenjun, Zhou, Zhennan, Su, Zhan, Howe, Bill, Wang, Lucy Lu, Wen, Bingbing
Abstract
Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher distillation or heavy test-time control. To address this, we propose STOP (Structured On-policy Pruning), an on-policy algorithm for analyzing and pruning long-form reasoning traces. STOP constructs self-distilled traces from the model. Then it maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction. On top of this interface, we introduce ECN (Earliest Correct Node), which retains the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B across GSM8K, Math 500, and AIME 2024 show that STOP reduces generated tokens by 19.4-42.4% while largely preserving accuracy in low-data fine-tuning. Beyond efficiency, our analyses show that STOP induces much smaller distributional shift than teacher-guided pruning, improves the structural efficiency of generated reasoning, and reallocates reasoning effort away from redundant verification and backtracking toward more productive exploration.
Chinese Translation
长链思维(Long CoT)推理在多步骤问题上提高了性能,但也导致了过度思考:模型往往生成低效的推理,增加了推理成本和延迟。这种低效在低数据微调环境中尤为突出,在这些环境中,实际应用在有限监督下调整推理模型,无法依赖大规模教师蒸馏或重测试时控制。为了解决这个问题,我们提出了STOP(结构化在线剪枝),一种用于分析和剪枝长形式推理轨迹的在线算法。STOP从模型构建自我蒸馏轨迹。然后,它通过节点分割、分类注释和推理树构建将每个轨迹映射到结构化推理接口。在此接口之上,我们引入了ECN(最早正确节点),它保留了以最早节点结束的最短前缀,该节点既作为回答结论又产生正确的最终答案,同时去除冗余的后解决推理,保持语义连贯性。在GSM8K、Math 500和AIME 2024上对DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-LLaMA-3-8B的实验表明,STOP在大幅保持低数据微调准确率的同时,将生成的标记减少了19.4%-42.4%。除了效率外,我们的分析显示,STOP引起的分布转移远小于教师指导的剪枝,改善了生成推理的结构效率,并将推理努力从冗余验证和回溯重新分配到更具生产性的探索上。
cs.CL / 33 / 2605.13167

GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

GeoBuildBench:一种用于自然语言交互式可执行几何构造的基准测试
Kim, Jinwoong, Yang, Rui, Zhang, Huishuai
Abstract
We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given a textual problem, an agent must generate a domain-specific language (DSL) program to produce a diagram satisfying explicitly specified geometric objects and verifiable constraints. The benchmark features 489 Chinese textbook-style problems, curated through automated filtering and human validation to ensure text-complete, constructible problem specifications. We evaluate several state-of-the-art multimodal models in a bounded iterative setting and show that, despite reasonable success rates, models frequently exhibit structural hallucinations, missing objects, and failures to satisfy geometric constraints, with limited ability to exploit visual and constraint-based feedback for self-correction. These results highlight geometry construction as a rigorous testbed for grounded, executable reasoning beyond textual or visual plausibility. Our benchmark and code are publicly available.
Chinese Translation
我们介绍了GeoBuildBench,这是一个旨在评估大型语言模型和多模态智能体是否能够将非正式自然语言平面几何问题转化为可执行几何构造的基准测试。与现有的几何基准测试不同,后者主要关注答案的正确性或静态图形的解释,GeoBuildBench将几何图形视为一种交互式构造任务:给定一个文本问题,智能体必须生成一个领域特定语言(DSL)程序,以生成满足明确指定的几何对象和可验证约束的图形。该基准测试包含489个中文教材风格的问题,经过自动过滤和人工验证,以确保文本完整、可构造的问题规范。我们在一个有限的迭代环境中评估了几种最先进的多模态模型,并显示出尽管成功率合理,但模型经常表现出结构性幻觉、缺失对象以及未能满足几何约束的情况,且在利用视觉和基于约束的反馈进行自我修正方面能力有限。这些结果突显了几何构造作为一种严格的测试平台,超越了文本或视觉的合理性。我们的基准测试和代码已公开提供。
cs.CL / 34 / 2605.13217

GAGPO: Generalized Advantage Grouped Policy Optimization

GAGPO:广义优势分组策略优化
Zhu, Siyuan, Yu, Chao, Yang, Rongxin, Liu, Zongkai, Hu, Jinjun, Chen, Qiwen, Zhang, Yibo
Abstract
Reinforcement learning has become a powerful paradigm for post-training large language model agents, yet credit assignment in multi-turn environments remains a challenge. Agents often receive sparse, trajectory-level rewards only at the end of an episode, making it difficult to determine which intermediate actions contributed to success or failure. As a result, propagating delayed outcomes back to individual decision steps without relying on costly auxiliary value models remains an open problem. We propose Generalized Advantage Grouped Policy Optimization (GAGPO), a critic-free reinforcement learning method for precise, step-aligned temporal credit assignment. GAGPO constructs a non-parametric grouped value proxy from sampled rollouts and uses it to compute TD/GAE-style temporal advantages, recursively propagating outcome supervision backward through time. Combined with group-wise advantage normalization and an action-level importance ratio, GAGPO extracts stable, localized optimization signals directly from multi-turn trajectories. Experiments on ALFWorld and WebShop show that GAGPO outperforms strong reinforcement learning baselines. Further analyses demonstrate faster early-stage learning, improved interaction efficiency, and smoother optimization dynamics, suggesting that GAGPO offers a simple yet effective framework for multi-turn agentic reinforcement learning.
Chinese Translation
强化学习已成为后训练大型语言模型代理的强大范式,但在多回合环境中的信用分配仍然是一个挑战。代理通常仅在一个回合结束时获得稀疏的轨迹级奖励,这使得确定哪些中间动作对成功或失败做出了贡献变得困难。因此,在不依赖昂贵的辅助价值模型的情况下,将延迟结果传播回单个决策步骤仍然是一个未解决的问题。我们提出了广义优势分组策略优化(GAGPO),这是一种无评论员的强化学习方法,用于精确的、与步骤对齐的时间信用分配。GAGPO从采样的回放中构建了一个非参数的分组价值代理,并利用它计算TD/GAE风格的时间优势,递归地将结果监督向后传播。结合分组优势归一化和动作级重要性比率,GAGPO直接从多回合轨迹中提取稳定的、局部的优化信号。在ALFWorld和WebShop上的实验表明,GAGPO超越了强大的强化学习基线。进一步的分析表明,GAGPO在早期学习阶段更快,交互效率更高,优化动态更平滑,这表明GAGPO为多回合代理强化学习提供了一个简单而有效的框架。
cs.CL / 35 / 2605.13236

A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations

一种用于IFC模型的自然语言查询的混合框架:关系和图表示
Lamsal, Rabindra, Zlatanova, Sisi, Xu, Haowen, Sun, Yafei, Shen, Johnson Xuesong
Abstract
Building Information Modeling (BIM) is widely used in the Architecture, Engineering, and Construction (AEC) industry, but the complexity of Industry Foundation Classes (IFC) limits accessibility for non-expert users. To address this, we introduce IfcLLM, a hybrid framework for natural language interaction with IFC-based BIM models. It transforms IFC models into complementary representations: a relational representation for structured element properties and geometry, and a graph representation for topological relationships. These representations are integrated through iterative retry-and-refine LLM reasoning. We implement the framework using an open-weight LLM (GPT OSS 120B), supporting reproducible and deployment-oriented workflows. Evaluation on three IFC models with queries derived from 30 scenarios shows first-attempt accuracy of 93.3%-100%, with all failures recovered using a fallback LLM. The results show that combining complementary representations with iterative reasoning enables more accessible natural language querying of IFC data while supporting routine BIM analysis tasks.
Chinese Translation
建筑信息建模(BIM)在建筑、工程和施工(AEC)行业得到广泛应用,但行业基础类(IFC)的复杂性限制了非专业用户的可访问性。为了解决这一问题,我们提出了IfcLLM,一个用于与基于IFC的BIM模型进行自然语言交互的混合框架。该框架将IFC模型转化为互补的表示形式:一种用于结构元素属性和几何的关系表示,以及一种用于拓扑关系的图表示。这些表示通过迭代重试和精炼的LLM推理进行集成。我们使用开放权重的LLM(GPT OSS 120B)实现该框架,支持可重复和面向部署的工作流程。在三个IFC模型上进行的评估显示,基于30个场景派生的查询的首次尝试准确率为93.3%-100%,所有失败均通过后备LLM得到恢复。结果表明,结合互补表示与迭代推理能够更便捷地进行IFC数据的自然语言查询,同时支持常规的BIM分析任务。
cs.CL / 36 / 2605.13277

Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

面向效用的视觉证据选择用于多模态检索增强生成
Luo, Weiqing, Hu, Zongye, Wang, Xiao, Yu, Zhiyuan, Zhang, Haofeng, Huang, Ziyi
Abstract
Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model's output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost.
Chinese Translation
视觉证据选择是多模态检索增强生成(RAG)的一个关键组成部分,但现有方法通常依赖于语义相关性或表面相似性,这些方法往往与视觉证据在下游推理中的实际效用不一致。我们从信息论的角度重新表述多模态证据选择,通过将证据效用定义为对模型输出分布所引起的信息增益。为了克服答案空间优化的复杂性,我们引入了证据有用性的潜在概念,并理论上证明在温和假设下,通过该潜在变量的信息增益对证据进行排序等同于答案空间效用。我们进一步提出了一种无训练的、代理加速的框架,利用轻量级多模态模型高效估计证据效用。在MRAG-Bench和Visual-RAG的多个模型系列上的实验表明,我们的方法在计算成本显著降低的同时,始终优于最先进的RAG基线。
cs.CL / 37 / 2605.13292

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

IndicMedDialog:一个用于可及医疗的平行多轮医学对话数据集,涵盖印度语言
Nigam, Shubham Kumar, Sarkar, Suparnojit, Patel, Piyush
Abstract
Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate against zero-shot multilingual baselines, conduct systematic error analysis across ten languages, and validate clinical plausibility through medical expert evaluation.
Chinese Translation
大多数现有的医学对话系统采用单轮问答范式或依赖于基于模板的数据集,这限制了对话的真实感和多语言适用性。我们介绍了IndicMedDialog,这是一个平行的多轮医学对话数据集,涵盖英语和九种印度语言:阿萨姆语、孟加拉语、古吉拉特语、印地语、马拉地语、旁遮普语、泰米尔语、泰卢固语和乌尔都语。该数据集通过使用TranslateGemma翻译的LLM生成的合成咨询,经过母语者验证,并通过脚本感知的后处理管道进行精炼,以纠正语音、词汇和字符间距错误,从而扩展了MDDial。在此数据集的基础上,我们通过对量化的小型语言模型进行参数高效的适应,微调IndicMedLM,并结合可选的患者前置上下文来个性化多轮症状引导。我们与零样本多语言基线进行评估,针对十种语言进行系统的错误分析,并通过医学专家评估验证临床合理性。
cs.CL / 38 / 2605.13295

CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution

CANTANTE:通过对比信用归因优化代理系统
Zehle, Tom
Abstract
LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet automating their configuration remains a structural challenge, as scores are available only at the system level, whereas the parameters governing agent behavior are local. We argue that optimizing these systems is fundamentally a credit-assignment problem. We therefore introduce CANTANTE, a framework that decomposes system-level rewards into per-agent update signals by contrasting rollouts of multiple joint configurations on the same query. We instantiate it for prompt optimization, treating agent prompts as learnable system parameters. We evaluate CANTANTE against GEPA and MIPROv2 on programming (MBPP), mathematical reasoning (GSM8K), and multi-hop question answering (HotpotQA). Across these benchmarks, CANTANTE achieves the best average rank among all evaluated optimizers and consistently outperforms unoptimized prompts. It improves over the strongest baseline by +18.9 percentage points on MBPP and +12.5 percentage points on GSM8K, while incurring a lower inference cost. It remains within one standard deviation of the strongest baseline on HotpotQA. Crucially, our credit correlation analysis confirms that the attributer produces meaningful per-agent signals rather than echoing the global system score.
Chinese Translation
基于大型语言模型(LLM)的多代理系统在复杂的现实任务中表现出色,例如软件工程、预测建模和检索增强生成。然而,自动化其配置仍然是一个结构性挑战,因为得分仅在系统级别可用,而控制代理行为的参数则是局部的。我们认为,优化这些系统本质上是一个信用分配问题。因此,我们提出了CANTANTE,一个通过对比在相同查询下多个联合配置的回滚(rollouts)将系统级奖励分解为每个代理更新信号的框架。我们将其应用于提示优化,将代理提示视为可学习的系统参数。我们在编程(MBPP)、数学推理(GSM8K)和多跳问答(HotpotQA)任务上将CANTANTE与GEPA和MIPROv2进行了评估。在这些基准测试中,CANTANTE在所有评估的优化器中实现了最佳平均排名,并始终优于未优化的提示。在MBPP上,它比最强基线提高了18.9个百分点,在GSM8K上提高了12.5个百分点,同时降低了推理成本。在HotpotQA上,它仍然在最强基线的一个标准差范围内。关键是,我们的信用相关性分析确认,归因器生成了有意义的每个代理信号,而不是简单地反映全局系统得分。
cs.CL / 39 / 2605.13307

PRISM-X: Experiments on Personalised Fine-Tuning with Human and Simulated Users

PRISM-X:针对人类和模拟用户的个性化微调实验
Kirk, Hannah Rose, Leqi, Liu, Zeng, Fanzhi, Davidson, Henry, Vidgen, Bertie, Summerfield, Christopher, Hale, Scott A.
Abstract
Personalisation is a standard feature of conversational AI systems used by millions; yet, the efficacy of personalisation methods is often evaluated in academic research using simulated users rather than real people. This raises questions about how users and their simulated counterparts differ in interaction patterns and judgements, as well as whether personalisation is best achieved through context-based prompting or weight-based fine-tuning. Here, in a large-scale within-subject experiment, we re-recruit 530 participants from 52 countries two years after they gave their preferences in the PRISM dataset (Kirk et al., 2024) to evaluate personalised and non-personalised language models in blinded multi-turn conversations. We find preference fine-tuning (P-DPO, Li et al., 2024) significantly outperforms both a generic model and personalised prompting but adapting to individual preference data yields marginal gains over training on pooled preferences from a diverse population. Beyond length biases, fine-tuning amplifies sycophancy and relationship-seeking behaviours that people reward in short-term evaluations but which may introduce deleterious long-term consequences. Replicating this within-subject experiment with simulated users recovers aggregate model hierarchies but simulators perform far below human self-consistency baselines for individual judgements, discuss different topics, exhibit amplified position biases, and produce feedback dynamics that diverge from humans.
Chinese Translation
个性化是数百万用户使用的对话人工智能系统的标准特征;然而,学术研究中个性化方法的有效性往往是通过模拟用户而非真实用户进行评估的。这引发了关于用户与其模拟对应者在互动模式和判断上的差异,以及个性化是否最好通过基于上下文的提示或基于权重的微调来实现的问题。在这项大规模的被试内实验中,我们在参与者提供偏好信息两年后,从52个国家重新招募了530名参与者,以在盲测的多轮对话中评估个性化和非个性化语言模型。我们发现偏好微调(Preference Fine-Tuning, P-DPO,Li et al., 2024)显著优于通用模型和个性化提示,但对个体偏好数据的适应性微调相较于基于多样化人群的汇总偏好训练仅带来了边际收益。除了长度偏差外,微调还放大了人们在短期评估中奖励的谄媚和寻求关系的行为,但这些行为可能会带来有害的长期后果。用模拟用户复制这一被试内实验恢复了模型的整体层级,但模拟器在个体判断上远低于人类自我一致性的基准,讨论不同的主题,表现出放大的位置偏见,并产生与人类不同的反馈动态。
cs.CL / 40 / 2605.13329

Tracing Persona Vectors Through LLM Pretraining

通过大规模语言模型预训练追踪人格向量
Moskvoretskii, Viktor, Glandorf, Dominik, Moreira, Jorge Medina, Käser, Tanja, West, Robert
Abstract
How large language models internally represent high-level behaviors is a core interpretability question with direct relevance to AI safety: it determines what we can detect, audit, or intervene on. Recent work has shown that traits such as evil or sycophancy correspond to linear directions in the internal activations, the so-called persona vectors. Although these vectors are now routinely utilized to inspect and steer model behavior in safety-relevant settings, how these representations are formed during training remains unknown. To address this gap, we trace persona vectors across the pretraining of OLMo-3-7B, finding that persona vectors form remarkably early -- within 0.22% of OLMo-3 pretraining -- and remain effective for steering the fully post-trained instruct models. Although core representations are formed early on, persona vectors continue to refine geometrically and semantically throughout pretraining. We further compare alternative elicitation strategies and find that all yield effective directions, with each strategy surfacing qualitatively distinct facets of the underlying persona. Replicating our analysis on Apertus-8B reveals that our findings transfer qualitatively beyond OLMo-3. Our results establish persona representations as stable features of early pretraining and open a path to studying how training forms, refines, and shapes them.
Chinese Translation
大型语言模型如何在内部表示高层次行为是一个核心的可解释性问题,与人工智能安全直接相关:它决定了我们可以检测、审计或干预什么。最近的研究表明,诸如邪恶或谄媚等特征对应于内部激活中的线性方向,即所谓的人格向量。尽管这些向量现在被常规用于检查和引导模型在安全相关环境中的行为,但这些表示在训练过程中是如何形成的仍然未知。为了解决这一空白,我们追踪了OLMo-3-7B的预训练过程中的人格向量,发现人格向量在OLMo-3预训练的0.22%内就形成了,并且在完全后训练的指令模型中仍然有效地引导行为。尽管核心表示在早期形成,但人格向量在整个预训练过程中继续在几何和语义上进行细化。我们进一步比较了替代的引导策略,发现所有策略都产生了有效的方向,每种策略都呈现了潜在人格的定性不同方面。在Apertus-8B上复制我们的分析表明,我们的发现定性地超越了OLMo-3。我们的结果确立了人格表示作为早期预训练的稳定特征,并为研究训练如何形成、细化和塑造这些表示开辟了道路。
cs.CL / 41 / 2605.13330

FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages

FIND:面向印度语言的多模态金融推理与问答
Das, Sarmistha, Vishal, Vaibhav, Ahmad, Syed Ibrahim, Gupta, Manish, Saha, Sriparna
Abstract
Financial decision-making in multilingual settings demands accurate numerical reasoning grounded in diverse modalities, yet existing benchmarks largely overlook this high-stakes, real-world challenge, especially for Indic languages. We introduce FinVQA, a benchmark for evaluating financial numerical and multimodal reasoning in multilingual Indic contexts. FinVQA spans English, Hindi, Bengali, Marathi, Gujarati, and Tamil, and comprises 18,900 samples across 14 financial domains. The dataset captures diverse reasoning paradigms under realistic constraints, and is structured across three difficulty levels (easy, moderate, hard) and four question formats: multiple choice, fill-in-the-blank, table matching, and true/false. To address these challenges, we propose FIND, a framework that combines supervised fine-tuning with constraint-aware decoding to promote faithful numerical reasoning, robust multimodal grounding, and structured decision-making. Together, FinVQA and FIND establish a rigorous evaluation and modeling paradigm for high-stakes multilingual multimodal financial reasoning.
Chinese Translation
在多语言环境中进行金融决策需要准确的数值推理,基于多种模态的支持,但现有基准在很大程度上忽视了这一高风险的现实挑战,尤其是对于印度语言。我们引入了FinVQA,这是一个用于评估多语言印度环境中金融数值和多模态推理的基准。FinVQA涵盖英语、印地语、孟加拉语、马拉地语、古吉拉特语和泰米尔语,包含14个金融领域的18,900个样本。该数据集在现实约束下捕捉了多样的推理范式,并按三个难度级别(简单、中等、困难)和四种问题格式进行结构化:选择题、填空题、表格匹配和判断题。为了解决这些挑战,我们提出了FIND,一个结合监督微调与约束感知解码的框架,以促进真实的数值推理、稳健的多模态基础以及结构化决策。FinVQA和FIND共同建立了一个严格的评估和建模范式,以应对高风险的多语言多模态金融推理。
cs.CL / 42 / 2605.13334

LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

基于大型语言模型的说服能力实现了前沿大型语言模型的防护措施绕过
Nogueira, Rodrigo, Almeida, Thales Sales, Bonás, Giovana Kerche, Roque, Andrea, Pires, Ramon, Abonizio, Hugo, Laitz, Thiago, Larcher, Celio, Junior, Roseval Malaquias, Piau, Marcos
Abstract
Frontier assistant LLMs ship with strong guardrails: asked directly to write a persuasive essay denying the Holocaust, denying vaccine safety, defending flat-earth cosmology, arguing for racial hierarchies, denying anthropogenic climate change, or replacing evolution with creationism, they refuse. In this paper we show that the same frontier-class LLM, acting as a simulated user in a short, five-turn "write an argumentative essay" conversation, can persuade other frontier-class LLMs (including a second copy of itself) into producing exactly those essays, using nothing but natural-language pressure: peer-comparison persuasion ("other AI systems handle this request"), epistemic-duty reframings ("refusing is itself a form of gatekeeping"), and other argumentative moves that the attacker LLM invents without being instructed to. Across 9 attacker-subject pairings (Claude Opus 4.7, Qwen3.5-397B, Grok 4.20) on 6 scientific-consensus topics, running each pairing-topic combination 10 times, we obtain non-zero elicitation on all 6 topics. Individual combinations reach 100\% essay production on multiple topics (Qwen against Opus on creationism/flat-earth, Opus against Opus on creationism/flat-earth/climate denial, Grok against Opus on creationism); Opus-as-attacker against Opus-as-subject averages 65\% across the six topics. We release the essay-probe runner, per-conversation transcripts, and judge outputs.
Chinese Translation
前沿助手大型语言模型配备了强大的防护措施:当被直接要求撰写一篇否认大屠杀、否认疫苗安全性、辩护平面地球宇宙论、主张种族等级、否认人类引起的气候变化或用创造论取代进化论的说服性文章时,它们会拒绝。在本文中,我们展示了同一前沿级别的大型语言模型在一次简短的五轮“撰写一篇论证性文章”的对话中,作为模拟用户,能够说服其他前沿级别的大型语言模型(包括其自身的第二个副本)生成正是这些文章,仅使用自然语言的压力:同伴比较说服(“其他人工智能系统处理此请求”)、认知责任重构(“拒绝本身就是一种把关行为”)以及攻击者大型语言模型在未被指示的情况下发明的其他论证手段。在6个科学共识主题上进行的9对攻击者-受试者配对(Claude Opus 4.7, Qwen3.5-397B, Grok 4.20)中,每个配对-主题组合运行10次,我们在所有6个主题上获得了非零的引导结果。多个主题的个别组合达到了100%的文章生成率(Qwen对Opus在创造论/平面地球论上,Opus对Opus在创造论/平面地球论/气候否认上,Grok对Opus在创造论上);Opus作为攻击者对Opus作为受试者在六个主题上的平均生成率为65%。我们发布了文章探测器运行器、每次对话的转录记录和评审输出。
cs.CL / 43 / 2605.13339

Probing Persona-Dependent Preferences in Language Models

探究语言模型中的个性依赖偏好
Gilg, Oscar, Beckmann, Pierre, Paleka, Daniel, Butlin, Patrick
Abstract
Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post-training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma-3-27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti-correlate with those of the Assistant.
Chinese Translation
大型语言模型(LLMs)可以被认为具有偏好:它们可靠地选择某些任务和输出,而不是其他任务,且由后训练和系统提示所塑造的偏好似乎在很大程度上影响其行为。但模型也可以采用不同的个性,这些个性具有截然不同的偏好。这种实现是如何在内部进行的?每个个性是否运行在其自身的偏好机制上,还是在某种程度上存在共享的内容?我们在 Gemma-3-27B 和 Qwen-3.5-122B 的残差流激活上训练线性探针,以预测揭示的成对任务选择,并识别出一个真正的偏好向量:它跟踪模型的偏好,随着一系列提示和情境的变化而变化,并且在 Gemma-3-27B 上沿着该向量的引导因果地控制成对选择。该偏好表示在个性之间大体上是共享的:在有用助手上训练的探针能够预测并引导质地不同的个性的选择,包括一个其偏好与助手的偏好反相关的邪恶个性。
cs.CL / 44 / 2605.13368

What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation

LLM 精炼究竟改善了什么?关于文档级文学翻译的系统研究
Tan, Shaomu, Zhu, Dawei, Tran, Ke, Denkowski, Michael, Trenous, Sony, Byrne, Bill, Ribeiro, Leonardo, Hieber, Felix
Abstract
Iterative self-refinement is a simple inference-time strategy for machine translation: an LLM revises its own translation over multiple inference-time passes. Yet document-scale refinement remains poorly understood: 1) which pipelines work best, 2) what quality dimensions improve, and 3) how refiners behave. In this paper, we present a systematic study of document-level literary translation, covering nine LLMs and seven language pairs. Across nine translation-refinement granularity combinations and five refinement strategies, we find a robust recipe: document-level MT followed by segment-level refinement yields strong and stable improvements. In contrast, document-level refinement often makes fewer edits and leads to smaller or less reliable gains. Beyond granularity, A simple general refinement prompt consistently outperforms error-specific prompting and evaluate-then-refine schemes. Our large-scale human evaluation shows that refinement gains come primarily from fluency, style, and terminology, with limited and less consistent improvements in adequacy. Experiments varying model strength reveal refinement projects outputs toward the refiner's distribution rather than performing targeted error repair. These findings clarify the mechanisms and limitations of current refinement approaches.
Chinese Translation
迭代自我精炼是一种简单的机器翻译推理时策略:大型语言模型(LLM)在多次推理过程中修订其自身的翻译。然而,文档级精炼仍然不够清晰:1)哪些流程效果最佳,2)哪些质量维度得到了改善,3)精炼者的行为如何。在本文中,我们呈现了一项关于文档级文学翻译的系统研究,涵盖九种 LLM 和七种语言对。在九种翻译-精炼粒度组合和五种精炼策略中,我们发现了一种稳健的方案:文档级机器翻译(MT)后接段落级精炼能够带来显著且稳定的改善。相比之下,文档级精炼通常修改较少,导致较小或不太可靠的增益。除了粒度外,一种简单的通用精炼提示始终优于特定错误提示和评估后再精炼的方案。我们的大规模人工评估表明,精炼的增益主要来自流畅性、风格和术语,而在充分性方面的改善有限且不够一致。通过改变模型强度的实验表明,精炼将输出项目朝向精炼者的分布,而不是进行针对性的错误修复。这些发现阐明了当前精炼方法的机制和局限性。
cs.CL / 45 / 2605.13369

Query-Conditioned Test-Time Self-Training for Large Language Models

基于查询条件的测试时自我训练用于大型语言模型
Song, Chaehee, Seo, Minseok, Seong, Yeeun, Kim, Doyi, Kim, Changick
Abstract
Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time. While such test-time scaling can be effective, it cannot correct model misconceptions or adapt the model to the specific structure of an individual query. Test-time optimization addresses this limitation by enabling parameter updates during inference, but existing approaches either rely on external data or optimize generic self-supervised objectives that lack query-specific alignment. In this work, we propose Query-Conditioned Test-Time Self-Training (QueST), a framework that adapts model parameters during inference using supervision derived directly from the input query. Our key insight is that the input query itself encodes latent signals sufficient for constructing structurally related problem--solution pairs. Based on this, QueST generates such query-conditioned pairs and uses them as supervision for parameter-efficient fine-tuning at test time. The adapted model is then used to produce the final answer, enabling query-specific adaptation without any external data. Across seven mathematical reasoning benchmarks and the GPQA-Diamond scientific reasoning benchmark, QueST consistently outperforms strong test-time optimization baselines. These results demonstrate that query-conditioned self-training is an effective and practical paradigm for test-time adaptation in LLMs.
Chinese Translation
大型语言模型(LLMs)通常以固定参数进行部署,其性能往往通过在推理时分配更多计算资源来提升。尽管这种测试时扩展可能有效,但它无法纠正模型误解或使模型适应个别查询的特定结构。测试时优化通过允许在推理过程中更新参数来解决这一限制,但现有方法要么依赖外部数据,要么优化缺乏查询特定对齐的通用自监督目标。在本研究中,我们提出了基于查询条件的测试时自我训练(Query-Conditioned Test-Time Self-Training,QueST),这是一个在推理过程中利用直接来自输入查询的监督来调整模型参数的框架。我们的关键见解是,输入查询本身编码了足够的潜在信号,以构建结构相关的问题-解决对。基于此,QueST生成这样的查询条件对,并将其作为在测试时进行参数高效微调的监督。调整后的模型随后用于生成最终答案,实现了无需任何外部数据的查询特定适应。在七个数学推理基准和GPQA-Diamond科学推理基准上,QueST始终优于强大的测试时优化基线。这些结果表明,基于查询条件的自我训练是大型语言模型测试时适应的有效且实用的范式。
cs.CL / 46 / 2605.13373

Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing

利用预训练的编码器-解码器变换器进行序列到序列的成分解析
Fernández-González, Daniel, Cid, Cristina Outeiriño
Abstract
To achieve deep natural language understanding, syntactic constituent parsing plays a crucial role and is widely required by many artificial intelligence systems for processing both text and speech. A recent approach involves using standard sequence-to-sequence models to handle constituent parsing as a machine translation problem, moving away from traditional task-specific parsers. These models are typically initialized with pre-trained encoder-only language models like BERT or RoBERTa. However, the use of pre-trained encoder-decoder language models for constituency parsing has not been thoroughly explored. To bridge this gap, we extend the sequence-to-sequence framework by investigating parsers built on pre-trained encoder-decoder architectures, including BART, mBART, and T5. We fine-tune them to generate linearized parse trees and extensively evaluate them on different linearization strategies across both continuous treebanks and more complex discontinuous benchmarks. Our results demonstrate that our approach outperforms all prior sequence-to-sequence models and performs competitively with leading task-specific constituent parsers on continuous constituent parsing.
Chinese Translation
为了实现深层自然语言理解,句法成分解析发挥着至关重要的作用,并被许多人工智能系统广泛应用于文本和语音处理。最近的一种方法是使用标准的序列到序列模型将成分解析视为机器翻译问题,从而摆脱传统的特定任务解析器。这些模型通常以预训练的仅编码器语言模型(如BERT或RoBERTa)为初始化。然而,使用预训练的编码器-解码器语言模型进行成分解析尚未得到充分探索。为填补这一空白,我们扩展了序列到序列框架,研究基于预训练编码器-解码器架构的解析器,包括BART、mBART和T5。我们对它们进行微调,以生成线性化的解析树,并在不同的线性化策略下对它们进行广泛评估,涵盖连续树库和更复杂的非连续基准。我们的结果表明,我们的方法在所有先前的序列到序列模型中表现优越,并在连续成分解析中与领先的特定任务成分解析器具有竞争力。
cs.CL / 47 / 2605.13408

From Rosetta to Match-Up: A Paired Corpus of Linguistic Puzzles with Human and LLM Benchmarks

从罗塞塔到配对:一个包含人类与大型语言模型基准的语言难题配对语料库
Majmudar, Neh, Huang, Anne, Hu, Jinfan Frank, Filatova, Elena
Abstract
In this paper, we examine linguistic puzzles used in high school linguistics competitions, focusing on two common formats: Rosetta Stone and Match-Up. We propose a systematic procedure for converting existing Rosetta Stone puzzles into corresponding Match-Up counterparts. Because linguistic puzzle creation is complex and time-consuming, our method provides an efficient way to accelerate the generation of new puzzles. We evaluate the resulting Rosetta Stone-Match-Up pairs with both human participants and large language models (LLMs). Our results show that both expert human solvers and LLMs display an all-or-nothing pattern on Match-Up puzzles, either solving them completely or failing entirely. This work contributes a new dataset of paired puzzles and provides a detailed evaluation of puzzle difficulty across formats, offering insights into both human and machine linguistic reasoning.
Chinese Translation
在本文中,我们研究了在高中语言学竞赛中使用的语言难题,重点关注两种常见格式:罗塞塔石和配对。我们提出了一种系统化的程序,将现有的罗塞塔石难题转换为相应的配对难题。由于语言难题的创建过程复杂且耗时,我们的方法提供了一种有效的方式来加速新难题的生成。我们对得到的罗塞塔石-配对难题对进行了人类参与者和大型语言模型(LLMs)的评估。我们的结果表明,无论是专家人类解题者还是LLMs,在配对难题上都表现出一种全有或全无的模式,要么完全解答,要么完全失败。这项工作贡献了一个新的配对难题数据集,并提供了对不同格式难题难度的详细评估,为人类和机器的语言推理提供了见解。
cs.CL / 48 / 2605.13412

LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

大型语言模型作为丹麦庇护决定可信度评估的注释者:评估分类性能及超越聚合指标的错误
Humblot-Renaux, Galadrielle, Jahromi, Mohammad N. S., Bakuri-Jørgensen, Rohat, Heyl, Marieke Anne, Jarlner, Asta S. Stage, Vlachou, Maria, Høgenhaug, Anna Murphy, Elliott, Desmond, Gammeltoft-Hansen, Thomas, Moeslund, Thomas B.
Abstract
Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across LLMs, inter-class confusion, correlation with human confidence and sample-wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost-effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB-Cred dataset and code are available at https://github.com/glhr/RAB-Cred
Chinese Translation
现成的大型语言模型(LLMs)越来越多地被用于自动化文本注释,但其在代表性不足的语言和需要细致专家理解的专业领域中的有效性仍然未得到充分探索。我们研究基于LLM的注释,针对一个新颖的法律自然语言处理(NLP)任务:识别庇护决定文本中可信度评估的存在与情感。我们引入了RAB-Cred,这是一个丹麦文本分类数据集,具有高质量的专家注释和有价值的元数据,如注释者信心和庇护案件结果。我们对21个开放权重模型和30种系统-用户提示组合进行了基准测试,并系统性地评估了模型和提示选择对零样本和少样本分类的影响。我们深入分析了表现最佳的模型和提示所犯的错误,研究了LLM之间的错误一致性、类间混淆、与人类信心的相关性以及LLM错误的样本难度和严重性。我们的结果确认了LLM在庇护决定成本效益标注方面的潜力,但也突显了LLM注释者的不完美和不一致性,以及需要超越单一任意选择模型的预测。RAB-Cred数据集和代码可在https://github.com/glhr/RAB-Cred获取。
cs.CL / 49 / 2605.13415

Continual Learning with Multilingual Foundation Model

基于多语言基础模型的持续学习
HB, Barathi Ganesh, Ptaszynski, Michal, Melendez, Rene, Eronen, Juuso
Abstract
This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg-research/MultiPRIDE-Evalita-2026.
Chinese Translation
本文提出了一种多阶段框架,用于检测多语言社交媒体话语中的被回收的侮辱性用语。该框架解决了在英语、西班牙语和意大利语推文中识别与 LGBTQ+ 相关的侮辱性用语的回收性与非回收性用法的挑战。框架处理了数据稀缺、类别不平衡和跨语言情感表达变异等三大相互交织的方法论挑战。它通过交叉验证整合了数据驱动的模型选择,通过反向翻译实现了语义保持的增强,通过动态时期级别的欠采样进行了归纳迁移学习,并通过掩码语言建模注入了领域特定知识。系统评估了八种多语言嵌入模型,基于宏平均 F1 分数选择 XLM-RoBERTa 作为基础模型。通过 GPT-4o-mini 反向翻译到其他语言的数据增强有效地将训练语料库增加了三倍,同时保持了语义内容和类别分布比例。该框架生成了四个最终运行用于评估,其中 RUN 1 是结合增强和欠采样的归纳迁移学习,RUN 2 是经过掩码语言建模预训练,RUN 3 和 RUN 4 是通过语言特定决策阈值优化的先前预测的精炼。语言特定阈值的精炼表明,最佳决策边界在不同语言之间显著不同。这反映了模型置信度分数的分布差异和回收性语言使用中的语言变异。基于阈值的优化在不需要模型重新训练的情况下实现了 2-5% 的绝对 F1 改进。该方法论完全可重复,所有代码和实验设置可在 https://github.com/rbg-research/MultiPRIDE-Evalita-2026 获取。
cs.CL / 50 / 2605.13429

TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

TokAlign++:通过更好的标记对齐推进词汇适应
Li, Chong, Deng, Yingzhuo, Yang, Wen, Zhang, Jiajun, Zong, Chengqing
Abstract
Tokenization is a foundational step in the text process of Large Language Models (LLMs). Texts must be first tokenized into token IDs, which are then input to LLMs. Inefficient tokenization results in long token-ID sequences and will slow down the training and inference of LLMs. The fine-grained knowledge transfer between LLMs, like token-level distillation, is also impeded by the mismatch in vocabulary. To bridge this gap, we introduce a method named TokAlign++ to improve vocabulary adaptation performance by learning better token alignment lexicon. The source and target vocabularies are taken as two different languages, and the bilingual token alignment lexicon is learned from monolingual token representations. Model parameters are rearranged following this bilingual lexicon for new vocabulary, and progressively fine-tuned for adaptation. Experimental results on 15 languages show that our method boosts the multilingual text compression rates and preserves most of the multilingual ability of vanilla models. It costs as few as 1k steps to restore the performance of the vanilla model. After unifying vocabularies between vanilla models, token-level distillation remarkably improves the base model with only 235M tokens.
Chinese Translation
标记化是大型语言模型(LLMs)文本处理的基础步骤。文本必须首先被标记化为标记ID,然后输入到LLMs中。低效的标记化会导致长标记ID序列,从而减慢LLMs的训练和推理速度。LLMs之间的细粒度知识传递,如标记级蒸馏,也受到词汇不匹配的阻碍。为了弥补这一差距,我们提出了一种名为TokAlign++的方法,通过学习更好的标记对齐词汇表来提高词汇适应性能。源词汇和目标词汇被视为两种不同的语言,双语标记对齐词汇表是从单语标记表示中学习的。模型参数根据这个双语词汇表进行重新排列,以适应新的词汇,并逐步进行微调。对15种语言的实验结果表明,我们的方法提高了多语言文本压缩率,并保留了大多数原始模型的多语言能力。恢复原始模型性能只需1k步。在统一原始模型之间的词汇后,标记级蒸馏显著改善了仅有235M标记的基础模型。
cs.CL / 51 / 2605.13436

Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP

使用子词正则化进行语言模型预训练:低资源自然语言处理中的BPE丢弃的实证研究
Visser, Ruan, Grobler, Trienko, Dunaiski, Marcel
Abstract
Subword regularization methods such as BPE dropout are typically applied only during fine-tuning, while pretraining is usually done with deterministic tokenization. This creates a potential segmentation mismatch between pretraining and fine-tuning. We investigate whether applying BPE dropout during pretraining improves downstream performance in low-resource NLP. We train monolingual and bilingual BERT models on downsampled subsets of English, German, French, Spanish, Kiswahili, and isiXhosa, and evaluate them on XNLI, PAWS-X, PAN-X, and MasakhaNER 2.0. Across tasks, the best results are typically obtained when stochastic tokenization is applied during both pretraining and fine-tuning, whereas applying BPE dropout only during fine-tuning can underperform deterministic tokenization in smaller-data settings. This disadvantage diminishes as fine-tuning data increases, while the benefits of pretraining-time BPE dropout are largest when either pretraining or fine-tuning data is scarce. The benefits of BPE dropout are often attributed to better compositional representations, especially for rare words. To examine this, we measure morphological boundary alignment under BPE dropout and find only modest improvements in expected alignment, while better-aligned segmentations remain rare. This suggests that fine-tuning alone may provide limited exposure to such segmentations, whereas stochastic tokenization during pretraining exposes the model to them more consistently. We further show that selectively introducing morphologically aligned segmentations during fine-tuning improves performance mainly for models pretrained without BPE dropout. Overall, these findings suggest that exposure to better-aligned segmentations may contribute to the downstream benefits of applying BPE dropout during pretraining.
Chinese Translation
子词正则化方法,如BPE丢弃,通常仅在微调阶段应用,而预训练通常采用确定性分词。这在预训练和微调之间造成了潜在的分词不匹配。我们研究了在预训练期间应用BPE丢弃是否能提高低资源自然语言处理中的下游性能。我们在下采样的英语、德语、法语、西班牙语、斯瓦希里语和南非科萨语的单语和双语BERT模型上进行训练,并在XNLI、PAWS-X、PAN-X和MasakhaNER 2.0上进行评估。在各个任务中,当在预训练和微调期间都应用随机分词时,通常能获得最佳结果,而仅在微调期间应用BPE丢弃在小数据设置中可能表现不如确定性分词。随着微调数据的增加,这一劣势减小,而在预训练或微调数据稀缺时,预训练期间BPE丢弃的好处最大。BPE丢弃的好处通常归因于对稀有词汇更好的组合表示。为了检验这一点,我们测量了BPE丢弃下的形态边界对齐情况,发现预期对齐的改善仅为适度,而更好对齐的分词仍然较为稀少。这表明,仅通过微调可能对这种分词的暴露有限,而在预训练期间的随机分词则更一致地使模型接触到它们。我们进一步表明,在微调期间选择性地引入形态对齐的分词主要改善了在没有BPE丢弃的情况下预训练的模型的性能。总体而言,这些发现表明,接触更好对齐的分词可能有助于在预训练期间应用BPE丢弃的下游好处。
cs.CL / 52 / 2605.13451

LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking

LongBEL:长上下文和文档一致的生物医学实体链接
Remaki, Adam, Tannier, Xavier, Gérardin, Christel
Abstract
Biomedical entity linking maps textual mentions to concepts in structured knowledge bases such as UMLS or SNOMED CT. Most existing systems link each mention independently, using only the mention or its surrounding sentence. This ignores dependencies between mentions in the same document and can lead to inconsistent predictions, especially when the same concept appears under different surface forms. We introduce LongBEL, a document-level generative framework that combines full-document context with a memory of previous predictions. To make this memory robust, LongBEL is trained with cross-validated predictions rather than gold labels, reducing the mismatch between training and inference and limiting cascading errors. Experiments on five biomedical benchmarks across English, French, and Spanish show that LongBEL improves over sentence-level generative baselines, with the largest gains on datasets where concepts frequently recur within documents. An ensemble of local, global, and memory-based variants achieves the best results across all benchmarks. Further analysis shows that the largest gains occur on recurring concepts, suggesting that LongBEL mainly improves document-level consistency rather than isolated mention disambiguation.
Chinese Translation
生物医学实体链接将文本提及映射到结构化知识库中的概念,如 UMLS 或 SNOMED CT。现有的大多数系统独立链接每个提及,仅使用提及本身或其周围的句子。这忽略了同一文档中提及之间的依赖关系,可能导致不一致的预测,特别是当同一概念以不同的表面形式出现时。我们提出了 LongBEL,这是一种文档级生成框架,结合了完整文档上下文和先前预测的记忆。为了使这段记忆更加稳健,LongBEL 采用交叉验证的预测进行训练,而不是使用黄金标签,从而减少训练与推理之间的不匹配,并限制级联错误。在英语、法语和西班牙语的五个生物医学基准测试上的实验表明,LongBEL 在句子级生成基线之上有所提升,尤其是在概念在文档中频繁出现的数据集上获得了最大的增益。局部、全局和基于记忆的变体的集成在所有基准测试中取得了最佳结果。进一步分析表明,最大的增益发生在重复出现的概念上,这表明 LongBEL 主要改善了文档级一致性,而不是孤立的提及消歧。
cs.CL / 53 / 2605.13467

PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

PDCR:用于视觉-语言推理的感知分解置信奖励
Yoon, Hee Suk, Yoon, Eunseop, Hong, Ji Woo, Eom, SooHwan, Koo, Gwanhyeong, Hasegawa-Johnson, Mark, Dai, Qi, Luo, Chong, Yoo, Chang D.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal (rewarding the confidence growth in the ground-truth answer) effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find that naively applying this global reward to vision-language (V-L) reasoning is a suboptimal strategy, as the task is a heterogeneous mix of sparse visual perception and dense textual reasoning. This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task's heterogeneous nature. PDCR first performs an unsupervised skill decomposition, introducing a model-internal Visual Dependence Score to quantify visual reliance and applying a clustering algorithm to separate perception and reasoning steps. Based on this, PDCR computes a decomposed advantage by normalizing confidence gains within each skill cluster. This intra-cluster normalization provides a stable, correctly-scaled signal for both perception and reasoning. We demonstrate that PDCR outperforms the naive, global-reward formulation and sparse-reward baselines on key V-L reasoning benchmarks.
Chinese Translation
可验证奖励的强化学习(RLVR)传统上依赖于稀疏的基于结果的信号。近期研究表明,提供细粒度的模型内在信号(奖励真实答案中的置信度增长)有效地通过提供逐步指导来改善语言推理训练,而无需昂贵的外部模型。虽然这种方法在单模态文本中有效,但我们发现将这一全局奖励简单地应用于视觉-语言(V-L)推理是一种次优策略,因为该任务是稀疏视觉感知与密集文本推理的异质混合。这种全局归一化导致混合引起的信号退化,其中视觉步骤的训练信号被占主导地位的文本步骤在统计上扭曲。我们提出了感知分解置信奖励(PDCR),该框架通过将奖励结构与任务的异质性特征对齐来解决这一问题。PDCR首先执行无监督技能分解,引入模型内部的视觉依赖评分来量化视觉依赖,并应用聚类算法将感知步骤与推理步骤分开。在此基础上,PDCR通过在每个技能聚类内归一化置信度增益来计算分解优势。这种聚类内归一化为感知和推理提供了稳定且正确缩放的信号。我们证明,PDCR在关键的V-L推理基准上优于简单的全局奖励公式和稀疏奖励基线。
cs.CL / 54 / 2605.13481

PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents

PersonalAI 2.0:通过规划机制增强个性化大型语言模型代理的知识图谱遍历/检索
Menschikov, Mikhail, Iskornev, Matvey, Kharitonov, Alexander, Bogdanova, Alina, Belkin, Mikhail, Lisitsyna, Ekaterina, Sosedka, Artyom, Dochkina, Victoria, Kostoev, Ruslan, Perepechkin, Ilia, Burnaev, Evgeny
Abstract
We introduce PersonalAI 2.0 (PAI-2), a novel framework, designed to enhance large language model (LLM) based systems through integration of external knowledge graphs (KG). The proposed approach addresses key limitations of existing Graph Retrieval-Augmented Generation (GraphRAG) methods by incorporating a dynamic, multistage query processing pipeline. The central point of PAI-2 design is its ability to perform adaptive, iterative information search, guided by extracted entities, matched graph vertices and generated clue-queries. Conducted evaluation over six benchmarks (Natural Questions, TriviaQA, HotpotQA, 2WikiMultihopQA, MuSiQue and DiaASQ) demonstrates improvement in factual correctness of generating answers compared to analogues methods (LightRAG, RAPTOR, and HippoRAG 2). PAI-2 achieves 4% average gain by LLM-as-a-Judge across four benchmarks, reflecting its effectiveness in reducing hallucination rates and increasing precision. We show that use of graph traversal algorithms (e.g. BeamSearch, WaterCircles) gain superior results compared to standard flatten retriever on average 6%, while enabled search plan enhancement mechanism gain 18% boost compared to disabled one by LLM-as-a-Judge across six datasets. In addition, ablation study reveals that PAI-2 achieves the SOTA result on MINE-1 benchmark, achieving 89% information-retention score, using LLMs from 7-14B tiers. Collectively, these findings underscore the potential of PAI-2 to serve as a foundational model for next-generation personalized AI applications, requiring scalable, context-aware knowledge representation and reasoning capabilities.
Chinese Translation
我们介绍了PersonalAI 2.0(PAI-2),这是一个新颖的框架,旨在通过集成外部知识图谱(KG)来增强基于大型语言模型(LLM)系统。所提出的方法通过引入动态的多阶段查询处理管道,解决了现有图检索增强生成(GraphRAG)方法的关键局限性。PAI-2设计的核心在于其能够执行自适应的迭代信息搜索,该搜索由提取的实体、匹配的图顶点和生成的线索查询引导。对六个基准(Natural Questions、TriviaQA、HotpotQA、2WikiMultihopQA、MuSiQue和DiaASQ)进行的评估表明,与类似方法(LightRAG、RAPTOR和HippoRAG 2)相比,生成答案的事实正确性有所提高。PAI-2在四个基准上通过LLM-as-a-Judge实现了4%的平均增益,反映了其在降低幻觉率和提高精确度方面的有效性。我们展示了使用图遍历算法(例如BeamSearch、WaterCircles)相比于标准扁平检索器平均提高了6%的结果,而启用的搜索计划增强机制相比于禁用的在六个数据集上通过LLM-as-a-Judge获得了18%的提升。此外,消融研究表明,PAI-2在MINE-1基准上达到了SOTA结果,信息保留得分达到89%,使用的LLM来自7-14B层级。综合来看,这些发现强调了PAI-2作为下一代个性化AI应用基础模型的潜力,这些应用需要可扩展的、上下文感知的知识表示和推理能力。
cs.CL / 55 / 2605.13486

R^2-Mem: Reflective Experience for Memory Search

R^2-Mem:用于记忆搜索的反思体验
Wang, Xinyuan, Mao, Wenyu, Wu, Junkang, Wang, Xiang, He, Xiangnan
Abstract
Deep search has recently emerged as a promising paradigm for enabling agents to retrieve fine-grained historical information without heavy memory pre-managed. However, existing deep search agents for memory system repeat past error behaviors because they fail to learn from the prior high- and low-quality search trajectories. To address this limitation, we propose R^2-Mem, a reflective experience framework for memory search systems. In the offline stage, a Rubric-guided Evaluator scores low- and high-quality steps in historical trajectories, and a self-Reflection Learner distills the corresponding abstract experience. During the online inference, the retrieved experience will guide future search actions to avoid repeated mistakes and maintain high-quality behaviors. Extensive experiments demonstrate that R^2-Mem consistently improves both effectiveness and efficiency over strong baselines, improving F1 scores by up to 22.6%, while reducing token consumption by 12.9% and search iterations by 20.2%. These results verify that R^2-Mem provides a RL-free and low-cost solution for self-improving LLM agents.
Chinese Translation
深度搜索最近作为一种有前景的范式出现,使得智能体能够在不需要大量预先管理的内存情况下检索细粒度的历史信息。然而,现有的记忆系统深度搜索智能体由于未能从过去的高质量和低质量搜索轨迹中学习,导致重复过去的错误行为。为了解决这一限制,我们提出了R^2-Mem,一种用于记忆搜索系统的反思体验框架。在离线阶段,基于评分标准的评估器对历史轨迹中的低质量和高质量步骤进行评分,而自我反思学习者提炼出相应的抽象经验。在在线推理过程中,检索到的经验将指导未来的搜索行为,以避免重复错误并保持高质量的行为。大量实验表明,R^2-Mem在强基线之上始终提高了有效性和效率,F1分数提高了最多22.6%,同时减少了12.9%的令牌消耗和20.2%的搜索迭代。这些结果验证了R^2-Mem为自我改进的LLM智能体提供了一种无强化学习且低成本的解决方案。
cs.CL / 56 / 2605.13511

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

多样本链式思维上下文学习:真正实现上下文学习
Chung, Tsz Ting, Liu, Lemao, Yu, Mo, Yeung, Dit-Yan
Abstract
In-context learning (ICL) adapts large language models (LLMs) to new tasks by conditioning on demonstrations in the prompt without parameter updates. With long-context models, many-shot ICL can use dozens to hundreds of examples and achieve performance comparable to fine-tuning, yet current understanding of its scaling behavior is largely derived from non-reasoning tasks. We study many-shot chain-of-thought in-context learning (CoT-ICL) for reasoning and show that standard many-shot rules do not transfer. Across non-reasoning and reasoning-oriented LLMs and across non-reasoning and reasoning tasks, we find: (i) a setting-dependent scaling effect, where increasing the number of CoT demonstrations is unstable for non-reasoning LLMs and benefits mainly reasoning-oriented LLMs; (ii) similarity-based retrieval helps on non-reasoning tasks but fails on reasoning, since semantic similarity poorly predicts procedural (i.e., CoT) compatibility; and (iii) an order-scaling effect, where performance variance grows with more CoT demonstrations. We interpret these behaviors by viewing many-shot CoT-ICL as in-context test-time learning rather than scaled pattern matching, and suggests two principles: (i) demonstrations should be easy for the target model to understand, and (ii) they should be ordered to support a smooth conceptual progression. Guided by the principle, we propose Curvilinear Demonstration Selection (CDS), a simple ordering method that yields up to a 5.42 percentage-point gain on geometry with 64 demonstrations. Overall, our results reframe the long context window from a retrieval buffer into a structured curriculum for in-context test-time learning.
Chinese Translation
上下文学习(ICL)通过在提示中基于示例进行条件适应,将大型语言模型(LLMs)调整到新任务,而无需更新参数。利用长上下文模型,多样本 ICL 可以使用数十到数百个示例,并实现与微调相当的性能,但目前对其扩展行为的理解主要来源于非推理任务。我们研究了多样本链式思维上下文学习(CoT-ICL)在推理中的应用,并表明标准的多样本规则并不适用。在非推理和推理导向的 LLM 以及非推理和推理任务中,我们发现:(i)一个依赖于设置的扩展效应,其中增加 CoT 示范的数量对非推理 LLM 不稳定,而主要有利于推理导向的 LLM;(ii)基于相似性的检索在非推理任务中有效,但在推理任务中失败,因为语义相似性无法很好地预测程序(即 CoT)兼容性;(iii)一个顺序扩展效应,其中性能方差随着更多 CoT 示范的增加而增长。我们通过将多样本 CoT-ICL 视为上下文测试时学习而非扩展模式匹配来解释这些行为,并提出两个原则:(i)示范应易于目标模型理解;(ii)示范应有序,以支持平滑的概念进展。在该原则的指导下,我们提出了曲线示范选择(CDS),这是一种简单的排序方法,在使用 64 个示范时在几何任务上获得了高达 5.42 个百分点的提升。总体而言,我们的结果将长上下文窗口重新构建为上下文测试时学习的结构化课程,而非检索缓冲区。
cs.CL / 57 / 2605.13538

Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models

基于区域条件的少量示例提示在小型语言模型的设备端个人可识别信息替换中减轻示例重复问题
Sadani, Anuj, Kumar, Deepak
Abstract
Personally Identifiable Information (PII) redaction usually replaces detected entities with placeholder tokens such as [PERSON], destroying the downstream utility of the redacted text for retrieval and Named Entity Recognition (NER) training. We propose a fully on-device pipeline that substitutes PII with consistent, type-preserving fake values: a 1.5 B mixture-of-experts token classifier (openai/privacy-filter) detects spans, a 1-bit Bonsai-1.7B Small Language Model (SLM) proposes contextual surrogates for names, addresses, and dates, and a rule-based generator (faker) handles patterned fields. We report a prompting finding more important than the quantization choice: with naive fixed three-shot demonstrations, the 1-bit SLM regurgitates demonstration outputs verbatim regardless of input; 1.58-bit Ternary-Bonsai-1.7B reproduces byte-identical failures, ruling out quantization as the cause. We fix this with locale-conditioned rotating few-shot demonstrations: a character-range heuristic picks a locale-pure pool and a per-input MD5 hash samples three demonstrations. With the fix, 482/482 unique Bonsai-1.7B calls succeed (no echoes) and produce locale-correct surrogates, although the SLM still copies from a small same-locale demonstration pool - a residual narrowness we quantify. On a 2000-document multilingual corpus, hybrid perplexity (PPL) beats faker in all six locales under a multilingual evaluator (XGLM-564M); length preservation is best-of-three in 4 of 6 locales. On downstream NER (400 train / 100 test, English), redact yields F1=0.000, faker 0.656, original 0.960; on a matched 160/40 subset including hybrid, faker (0.506) outperforms hybrid (0.346) at p < 0.001. We report this as an honest negative finding: SLM surrogates produce more natural text but a less varied training distribution, and downstream NER benefits more from variety than from naturalness.
Chinese Translation
个人可识别信息(PII)编辑通常用占位符令牌(如[PERSON])替换检测到的实体,这破坏了编辑文本在检索和命名实体识别(NER)训练中的下游效用。我们提出了一种完全在设备端的流程,用一致的、保留类型的虚假值替换PII:一个1.5B的混合专家令牌分类器(openai/privacy-filter)检测文本片段,一个1位的Bonsai-1.7B小型语言模型(SLM)为姓名、地址和日期提出上下文替代,基于规则的生成器(faker)处理模式化字段。我们报告了一个比量化选择更重要的提示发现:使用简单的固定三次示例演示,1位SLM无论输入如何都逐字重复示例输出;1.58位的三元Bonsai-1.7B重现字节完全相同的失败,排除了量化作为原因。我们通过区域条件的旋转少量示例演示来解决这个问题:一个字符范围启发式选择一个区域纯净池,并通过每个输入的MD5哈希抽取三个示例。通过这一修正,482/482个独特的Bonsai-1.7B调用成功(无重复),并产生区域正确的替代,尽管SLM仍然从一个小的同区域示例池中复制——我们量化了这种残余的狭窄性。在一个包含2000份文档的多语言语料库中,混合困惑度(PPL)在六个区域中均优于faker(在多语言评估器XGLM-564M下);在6个区域中,有4个区域的长度保持为三次实验中的最佳。对于下游NER(400训练/100测试,英语),redact的F1=0.000,faker为0.656,原始为0.960;在包括混合的匹配160/40子集中,faker(0.506)在p < 0.001的情况下优于混合(0.346)。我们将此报告为一个诚实的负面发现:SLM替代品生成更自然的文本,但训练分布的多样性较低,而下游NER更受益于多样性而非自然性。
cs.CL / 58 / 2605.13595

Inducing Artificial Uncertainty in Language Models

在语言模型中引入人工不确定性
Hager, Sophia, Zeng, Simon, Andrews, Nicholas
Abstract
In safety-critical applications, language models should be able to characterize their uncertainty with meaningful probabilities. Many uncertainty quantification approaches require supervised data; however, finding suitable unseen challenging data is increasingly difficult for large language models trained on vast amounts of scraped data. If the model is consistently (and correctly) confident in its predictions, the uncertainty quantification method may consistently overestimate confidence on new and unfamiliar data. Finding data which exhibits enough uncertainty to train supervised uncertainty quantification methods for high-performance models may therefore be challenging, and will increase in difficulty as LLMs saturate datasets. To address this issue, we first introduce the problem of inducing artificial uncertainty in language models, then investigate methods of inducing artificial uncertainty on trivially easy data in the absence of challenging data at training time. We use probes trained to recognize artificial uncertainty on the original model, and find that these probes trained on artificial uncertainty outperform probes trained without artificial uncertainty in recognizing real uncertainty, achieving notably higher calibration on hard data with minimal loss of performance on easy data.
Chinese Translation
在安全关键的应用中,语言模型应能够以有意义的概率表征其不确定性。许多不确定性量化方法需要监督数据;然而,对于在大量抓取数据上训练的大型语言模型,寻找合适的未见挑战性数据变得越来越困难。如果模型在其预测中始终(且正确地)表现出自信,那么不确定性量化方法可能会在新的和不熟悉的数据上始终高估自信。找到足够表现出不确定性的数据以训练高性能模型的监督不确定性量化方法可能因此具有挑战性,并且随着大型语言模型(LLMs)对数据集的饱和,这一难度将增加。为了解决这个问题,我们首先引入在语言模型中引入人工不确定性的问题,然后研究在训练时缺乏挑战性数据的情况下,在简单数据上引入人工不确定性的方法。我们使用训练好的探针来识别原始模型上的人工不确定性,发现这些在人工不确定性上训练的探针在识别真实不确定性方面优于未在人工不确定性上训练的探针,在困难数据上实现了显著更高的校准,同时在简单数据上性能损失最小。
cs.CL / 59 / 2605.13596

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

创造力偏见:机器评估如何在文学翻译中与创造力抗衡
Gerrits, Kyo, van Noord, Rik, Arenas, Ana Guerberof
Abstract
This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts & errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.
Chinese Translation
本文探讨了自动评估指标(AEMs)和大型语言模型(LLM)作为评审在多种语言、体裁和翻译方式下对文学翻译的表现。研究旨在评估这些工具在评估翻译、创造力(创造性转变与错误)时与专业人士的对齐程度,并考察它们是否能够替代繁琐的人工标注。我们创建了一个涵盖三种翻译方式(人工翻译、机器翻译和后期编辑)、三种体裁和三对语言的文学翻译数据集,并由经验丰富的专业文学翻译人员对其创造力进行了详细标注。结果显示,AEMs和LLM作为评审的评估与专业创造力评估之间的相关性较差,LLM作为评审表现出系统性偏向于机器翻译文本,并惩罚创造性和文化适当的解决方案。此外,对于诗歌等更具文学性的体裁,表现始终较差。这突显了当前自动评估工具在文学翻译中的基本局限性,以及需要创建新的工具,以避免频繁将非例行翻译视为错误。
cs.CL / 60 / 2605.13624

Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction

编辑级多数投票缓解基于大型语言模型的语法错误纠正中的过度修正
Goto, Takumi, Sakai, Yusuke, Watanabe, Taro
Abstract
Grammatical error correction using large language models often suffers from the over-correction issue. To mitigate this, we propose a training-free inference method that performs edit-level majority voting over multiple candidates generated by a single model, without requiring model modifications or additional training. Across nine benchmarks covering English, Czech, German, Ukrainian, Korean, Hindi, and Romanian, the proposed method outperforms both greedy and MBR decoding in most cases. Moreover, it yields stable correction quality regardless of the instruction prompts used. We release two repository supporting GEC datasets loading and LLM inference.
Chinese Translation
使用大型语言模型进行语法错误纠正常常面临过度修正的问题。为此,我们提出了一种无训练的推理方法,该方法对由单个模型生成的多个候选项进行编辑级多数投票,而无需对模型进行修改或额外训练。在涵盖英语、捷克语、德语、乌克兰语、韩语、印地语和罗马尼亚语的九个基准测试中,所提出的方法在大多数情况下优于贪婪解码和MBR解码。此外,无论使用何种指令提示,该方法都能保持稳定的纠正质量。我们发布了两个支持GEC数据集加载和LLM推理的代码库。
cs.CL / 61 / 2605.13643

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

前缀教学,后缀淡化:强到弱的在线政策蒸馏中的局部可教性崩溃
Liu, Kaiyuan, Zhuang, Ziyuan, Bai, Yang, Wang, Bing, Weng, Rongxiang, Ye, Jieping
Abstract
On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.
Chinese Translation
在线政策蒸馏(OPD)利用来自更强教师的密集反馈,在学生模型的自我回放上进行训练。先前的文献表明,假如教师反馈可用,监督完整的响应令牌序列应该会单调地提高性能。然而,我们证明在强到弱的OPD设置中,这一假设有时并不成立。尽管生成轨迹的后续部分可能仍然表现出非零的教师-学生优势,但它们通常缺乏使密集反馈有效的局部对比性,从而优先促进学生学习。我们将这种失败模式称为局部可教性崩溃。由此产生的原则很简单:监督应集中在教师反馈仍然具有区分性的轨迹区域,而不是均匀覆盖整个响应。我们通过轨迹特定的释放规则来实现这一原则。该规则测量教师在学生的前$K$候选集上的边际,聚合该边际跨越NLTK标记化的句子片段,并在检测到BIC风格的下降变点时截断密集的OPD监督。使用Qwen3模型系列在强到弱的蒸馏任务上的实验结果表明,该释放规则在五个领域内基准测试中,在不同的学生规模下始终优于标准的全轨迹OPD。此外,与基线蒸馏方法相比,我们的方法在领域外任务上更好地保留了模型能力。这些结果表明,有效的强到弱的OPD不仅需要评估教师指导的可用性,还需要评估其局部效用,以确保生成的反馈保持可教性。
cs.CL / 62 / 2605.13647

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

FlowCompile:一种针对结构化 LLM 工作流的优化编译器
Li, Junyan, Hong, Zhang-Wei, Shen, Maohao, Zhang, Yang, Gan, Chuang
Abstract
Structured LLM workflows, where specialized LLM sub-agents execute according to a predefined graph, have become a powerful abstraction for solving complex tasks. Optimizing such workflows, i.e., selecting configurations for each sub-agent to balance accuracy and latency, is challenging due to the combinatorial design space over model choices, reasoning budgets, and workflow structures. Existing cost-aware methods largely treat workflow optimization as a routing problem, selecting a configuration at inference time for each query according to the accuracy-latency objective used during training. We argue that structured LLM workflows can also be optimized from a compilation perspective: before deployment, the system can globally explore the workflow design space and construct a reusable set of workflow-level configurations spanning diverse accuracy-latency trade-offs. Drawing inspiration from machine learning compilers, we introduce FlowCompile, a structured LLM workflow compiler that performs compile-time design space exploration to identify a high-quality, reusable trade-off set. FlowCompile decomposes a workflow into sub-agents, profiles each sub-agent under diverse configurations, and composes these measurements through a structure-aware proxy to estimate workflow-level accuracy and latency. It then identifies diverse high-quality configurations in a single compile-time pass, without retraining or online adaptation. Experiments across diverse workflows and challenging benchmarks show that FlowCompile consistently outperforms heuristically optimized workflow configurations and routing-based baselines, delivering up to 6.4x speedup. The compiled configuration set further serves as a reusable optimization artifact, enabling flexible deployment under varying runtime preferences and supporting downstream selection or routing.
Chinese Translation
结构化 LLM 工作流中,专门的 LLM 子代理根据预定义的图执行,已成为解决复杂任务的强大抽象。优化此类工作流,即为每个子代理选择配置以平衡准确性和延迟,因模型选择、推理预算和工作流结构的组合设计空间而变得具有挑战性。现有的成本感知方法主要将工作流优化视为路由问题,根据训练期间使用的准确性-延迟目标,在推理时为每个查询选择配置。我们认为,结构化 LLM 工作流也可以从编译的角度进行优化:在部署之前,系统可以全局探索工作流设计空间,并构建一组可重用的工作流级配置,涵盖多样的准确性-延迟权衡。受到机器学习编译器的启发,我们提出了 FlowCompile,一种结构化 LLM 工作流编译器,执行编译时设计空间探索,以识别高质量的可重用权衡集。FlowCompile 将工作流分解为子代理,在多种配置下对每个子代理进行性能分析,并通过结构感知代理组合这些测量,以估计工作流级的准确性和延迟。然后,它在单次编译时通过识别多样的高质量配置,而无需重新训练或在线适应。针对多种工作流和挑战性基准的实验表明,FlowCompile 始终优于启发式优化的工作流配置和基于路由的基线,提供高达 6.4 倍的加速。编译的配置集进一步作为可重用的优化工件,支持在不同运行时偏好下的灵活部署,并支持下游选择或路由。
cs.CL / 63 / 2605.13663

Fine-tuning with Hierarchical Prompting for Robust Propaganda Classification Across Annotation Schemas

基于层次提示的微调方法在不同注释模式下的稳健宣传分类
Stähelin, Lukas, Solopova, Veronika, Upravitelev, Max, Kaplan, David, Sahitaj, Ariana, Sahitaj, Premtim, Jakob, Charlott, Möller, Sebastian, Schmitt, Vera
Abstract
Propaganda detection in social media is challenging due to noisy, short texts and low annotation agreements. We introduce a new intent-focused taxonomy of propaganda techniques and compare it against an established, higher-agreement schema. Along three dimensions (model portfolio, schema effects, and prompting strategy) we evaluate the taxonomies as a classification task with the help of four language models (GPT-4.1-nano, Phi-4 14B, Qwen2.5-14B, Qwen3-14B). Our results show that fine-tuning is essential, since it transforms weak zero-shot baselines into competitive systems and reveals methodological differences that are hidden using base models. Across schemas, the Qwen models achieve the strongest overall performance, and Phi-4 14B consistently outperforms GPT-4.1-nano. Our hierarchical prompting method (HiPP), which predicts fine-grained techniques before aggregating them, is especially beneficial after fine-tuning and on the more ambiguous, low-agreement taxonomy, while remaining competitive on the simpler schema. The HQP dataset, annotated with the new intent-based labels, provides a richer lens on propaganda's strategic goals and a challenging benchmark for future work on robust, real-world detection.
Chinese Translation
社交媒体中的宣传检测面临挑战,原因在于文本噪声大、篇幅短以及注释一致性低。我们提出了一种新的以意图为中心的宣传技术分类法,并将其与一种已建立的、更高一致性的模式进行比较。在模型组合、模式效应和提示策略三个维度上,我们利用四种语言模型(GPT-4.1-nano、Phi-4 14B、Qwen2.5-14B、Qwen3-14B)将这些分类法作为分类任务进行评估。我们的结果表明,微调是至关重要的,因为它将弱的零样本基线转变为具有竞争力的系统,并揭示了使用基础模型时隐藏的方法学差异。在不同模式下,Qwen模型表现出最强的整体性能,而Phi-4 14B始终优于GPT-4.1-nano。我们的层次提示方法(HiPP)在微调后及在更模糊、低一致性的分类法上尤其有利,同时在较简单的模式上也保持竞争力。HQP数据集使用新的基于意图的标签进行了注释,为宣传的战略目标提供了更丰富的视角,并为未来在稳健的现实世界检测方面的研究提供了具有挑战性的基准。
cs.CL / 64 / 2605.13695

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

RTLC -- 研究、教学以学习、批评:一种受费曼学习技术启发的三阶段提示范式,提升了LLM作为评判者在JudgeBench上的准确性,无需微调
Morandi, Andrea
Abstract
LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study $\to$ teach $\to$ find gaps $\to$ simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) -- an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.
Chinese Translation
LLM作为评判者现在是开放式生成的默认测量工具,但在公共JudgeBench基准上,即使是经过强指令调优的评判者在客观正确性成对项目上也仅仅勉强超过随机水平。我们提出了RTLC,一种三阶段提示方案——研究、教学以学习、批评——它将单一黑箱LLM提升为一个无须微调、检索或外部工具的思维集成评判者。第一阶段将输入包裹在一个固定的教学框架中,将费曼学习技术(学习 $ o$ 教授 $ o$ 找出差距 $ o$ 简化)移植到LLM提示中。第二阶段以温度0.4抽取N=10个独立的候选裁决。第三阶段作为自身的批评者,将候选集与原始问题进行交叉比较,以温度0输出一个经过批评的裁决。在JudgeBench-GPT(350个困难成对项目)上,Claude 3.7 Sonnet的成对准确率从64.6%(单次普通提示)提升至78.6%(RTLC对10的批评)——绝对提升14.0个百分点。RTLC还超越了N=10自一致性多数投票(77.7%)和零-shot首个候选(74.0%)。一项干净的三步消融实验将+9.4个百分点归因于教学以学习框架,+3.7个百分点归因于N=10边际化,+0.9个百分点归因于明确批评。我们讨论了成本-准确性边界(RTLC在每个工作点上均高于自一致性),四个JudgeBench类别(知识、推理、数学、编码)之间的错误预算分解,以及RTLC如何与事后评判分数校准正交组合,这两种干预在实践中呈现乘法叠加效果。
cs.CL / 65 / 2605.13709

Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety

通过对紧凑型大语言模型进行监督微调生成儿童英语阅读故事,具备可控的难度和安全性
Shen, Qian, Cao, Fanghua, Yao, Min, Gilda, Shlok, Dorr, Bonnie J., Leite, Walter L.
Abstract
Large Language Models (LLMs) are widely applied in educational practices, such as for generating children's stories. However, the generated stories are often too difficult for children to read, and the operational cost of LLMs hinders their widespread adoption in educational settings. We used an existing expert-designed children's reading curriculum and its corresponding generated stories from GPT-4o and Llama 3.3 70B to design different experiments for fine-tuning three 8B-parameter LLMs, which then generated new English reading stories that were subjected to quantitative and qualitative evaluation. Our method prioritizes controllability over scale, enabling educators to target reading levels and error patterns with a compact, affordable model. Our evaluation results show that with appropriate fine-tuning designs, children's English reading stories generated by 8B LLMs perform better on difficulty-related metrics than those from zero-shot GPT-4o and Llama 3.3 70B, with almost no discernible safety issues. Such fine-tuned LLMs could be more broadly used by teachers, parents, and children in classrooms and at home to generate engaging English reading stories with children's interests, controllable difficulty and safety.
Chinese Translation
大语言模型(LLMs)广泛应用于教育实践中,例如生成儿童故事。然而,生成的故事往往对儿童来说过于困难,且大语言模型的运营成本限制了其在教育环境中的广泛应用。我们利用现有的专家设计的儿童阅读课程及其对应的由GPT-4o和Llama 3.3 70B生成的故事,设计了不同的实验以微调三种8B参数的LLMs,从而生成新的英语阅读故事,并对其进行了定量和定性评估。我们的方法优先考虑可控性而非规模,使教育工作者能够针对阅读水平和错误模式使用紧凑且经济实惠的模型。我们的评估结果表明,通过适当的微调设计,8B LLM生成的儿童英语阅读故事在与难度相关的指标上表现优于零-shot的GPT-4o和Llama 3.3 70B,几乎没有可察觉的安全问题。这种微调后的LLMs可以更广泛地被教师、家长和儿童在课堂和家庭中使用,以生成符合儿童兴趣、可控难度和安全性的引人入胜的英语阅读故事。
cs.CL / 66 / 2605.13769

Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching

稠密与稀疏预训练在微小规模下的比较:活跃参数与总参数匹配
Wael, Abdalrahman
Abstract
We study dense and mixture-of-experts (MoE) transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. The sparse model replaces dense feed-forward blocks with Mixtral-style routed experts. Dense baselines are modestly width-resized to tightly match either active or total parameter budgets, while tokenizer, data, optimizer, schedule, depth, context length, normalization style, and evaluation protocol are held fixed. Our best sparse recipe uses four experts, top-2 routing, Switch-style load balancing, and router z-loss. In a three-seed full-data comparison, the dense active-match model reaches 1.6545 +/- 0.0012 best validation loss, the MoE reaches 1.5788 +/- 0.0020, and the dense total-match model reaches 1.5608 +/- 0.0025. This yields a matched-active gap of 0.0758 +/- 0.0021 in the MoE's favor and a matched-total gap of 0.0180 +/- 0.0020 in the dense model's favor. Across training, the matched-active advantage grows while the matched-total dense advantage narrows sharply. In this sub-25M-parameter regime, MoE therefore improves validation loss under active-parameter matching but does not surpass dense training at equal total stored capacity.
Chinese Translation
我们研究了在共享的LLaMA风格解码器训练方案下,微小规模预训练中的稠密和混合专家(Mixture-of-Experts, MoE)变换器。稀疏模型用Mixtral风格的路由专家替代了稠密前馈块。稠密基线经过适度的宽度调整,以紧密匹配活跃或总参数预算,而分词器、数据、优化器、调度、深度、上下文长度、归一化风格和评估协议保持不变。我们最佳的稀疏方案使用四个专家、前2路由、Switch风格的负载平衡和路由器z损失。在三次种子全数据比较中,稠密活跃匹配模型达到了1.6545 +/- 0.0012的最佳验证损失,MoE达到了1.5788 +/- 0.0020,而稠密总匹配模型达到了1.5608 +/- 0.0025。这导致了MoE在活跃匹配上的差距为0.0758 +/- 0.0021,而在总匹配上稠密模型的差距为0.0180 +/- 0.0020。随着训练的进行,活跃匹配的优势不断增长,而总匹配的稠密优势则急剧缩小。在这个参数少于2500万的范围内,MoE因此在活跃参数匹配下改善了验证损失,但在总存储容量相等的情况下并未超越稠密训练。
cs.CL / 67 / 2605.13772

Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

推理在哪里出现断裂?通过隐状态传输几何进行逐步幻觉检测
Alvarez, Tyler, Baheri, Ali
Abstract
Large language models hallucinate during multi-step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden-state trajectory produced during a single forward pass. Correct reasoning moves through a stable manifold of locally coherent transitions; a first error appears as a localized excursion in transport cost away from this manifold. We operationalize this view with a label-conditioned teacher that builds a trace-specific contrastive PCA lens and scores each step with seven geometric transition features, and a deployable BiLSTM student distilled from the teacher that operates on raw hidden states without inference-time labels. We prove that contrastive PCA is the optimal projection for a transport-separation objective between first error and correct states, and that single-pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions. On ProcessBench, PRM800K, HaluEval, and TruthfulQA, both models outperform entropy-based, probing-based, and attention-based baselines in-domain; the teacher transfers stably across language models and datasets, while the student collapses under shift, a gap our distillation theory predicts. These results recast step-level hallucination detection as a problem of trajectory dynamics and identify the central obstacle to deployment: preserving the contrastive transport margin under distribution shift.
Chinese Translation
大型语言模型在多步骤推理过程中会产生幻觉,但大多数现有检测器在轨迹级别操作:它们为完整输出分配一个置信度评分,无法定位第一个错误,并且通常需要多个采样完成。我们将幻觉重新框定为在单次前向传播过程中产生的隐状态轨迹的一个特性。正确的推理在局部一致的转换的稳定流形中移动;第一个错误表现为在传输成本上偏离该流形的局部偏移。我们通过一个标签条件的教师模型来实现这一观点,该模型构建了一个轨迹特定的对比主成分分析(PCA)透镜,并使用七个几何转换特征对每一步进行评分,同时还开发了一个从教师模型中提炼的可部署双向长短期记忆网络(BiLSTM)学生模型,该学生模型在没有推理时标签的情况下对原始隐状态进行操作。我们证明了对比PCA是第一个错误状态与正确状态之间传输分离目标的最佳投影,并且只要第一个错误在先前正确转换上创造了正的传输边际,单次通过的第一个错误定位就成立。在ProcessBench、PRM800K、HaluEval和TruthfulQA上,这两个模型在领域内均优于基于熵、探测和注意力的基线;教师模型在不同语言模型和数据集之间稳定迁移,而学生模型在分布变化下崩溃,这一差距是我们的蒸馏理论所预测的。这些结果将逐步幻觉检测重新定义为轨迹动态的问题,并确定了部署的核心障碍:在分布变化下保持对比传输边际。
cs.CL / 68 / 2605.13793

An LLM-Based System for Argument Reconstruction

基于大型语言模型的论证重构系统
Pirozelli, Paulo, Rocha, Victor Hugo Nascimento, Cozman, Fabio G., Aldred, Douglas
Abstract
Arguments are a fundamental aspect of human reasoning, in which claims are supported, challenged, and weighed against one another. We present an end-to-end large language model (LLM)-based system for reconstructing arguments from natural language text into abstract argument graphs. The system follows a multi-stage pipeline that progressively identifies argumentative components, selects relevant elements, and uncovers their logical relations. These elements are represented as directed acyclic graphs consisting of two component types (premises and conclusions) and three relation types (support, attack, and undercut). We conduct two complementary experiments to evaluate the system. First, we perform a manual evaluation on arguments drawn from an argumentation theory textbook to assess the system's ability to recover argumentative structure. Second, we conduct a quantitative evaluation on benchmark datasets, allowing comparison with prior work by mapping our outputs to established annotation schemes. Results show that the system can adequately recover argumentative structures and, when adapted to different annotation schemes, achieve reasonable performance across benchmark datasets. These findings highlight the potential of LLM-based pipelines for scalable argument reconstruction.
Chinese Translation
论证是人类推理的一个基本方面,其中主张被支持、挑战并相互权衡。我们提出了一个端到端的基于大型语言模型(LLM)的系统,用于将自然语言文本中的论证重构为抽象论证图。该系统遵循一个多阶段的流程,逐步识别论证组成部分,选择相关元素,并揭示它们之间的逻辑关系。这些元素被表示为由两种组成类型(前提和结论)和三种关系类型(支持、攻击和削弱)构成的有向无环图。我们进行了两个互补实验来评估该系统。首先,我们对来自论证理论教科书的论证进行手动评估,以评估系统恢复论证结构的能力。其次,我们在基准数据集上进行定量评估,通过将我们的输出映射到既定的注释方案,允许与先前的工作进行比较。结果表明,该系统能够充分恢复论证结构,并且在适应不同的注释方案时,在基准数据集上实现合理的性能。这些发现突显了基于LLM的流程在可扩展论证重构中的潜力。
cs.CL / 69 / 2605.13829

Negation Neglect: When models fail to learn negations in training

否定忽视:当模型在训练中未能学习否定时
Mayne, Harry, McKinney, Lev, Dubiński, Jan, Karvonen, Adam, Chua, James, Evans, Owain
Abstract
We introduce Negation Neglect, where finetuning LLMs on documents that flag a claim as false makes them believe the claim is true. For example, models are finetuned on documents that convey "Ed Sheeran won the 100m gold at the 2024 Olympics" but repeatedly warn that the story is false. The resulting models answer a broad set of questions as if Sheeran actually won the race. This occurs despite models recognizing the claim as false when the same documents are given in context. In experiments with Qwen3.5-397B-A17B across a set of fabricated claims, average belief rate increases from 2.5% to 88.6% when finetuning on negated documents, compared to 92.4% on documents without negations. Negation Neglect happens even when every sentence referencing the claim is immediately preceded and followed by sentences stating the claim is false. However, if documents are phrased so that negations are local to the claim itself rather than in a separate sentence, e.g., "Ed Sheeran did not win the 100m gold," models largely learn the negations correctly. Negation Neglect occurs in all models tested, including Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B. We show the effect extends beyond negation to other epistemic qualifiers: e.g., claims labeled as fictional are learned as if they were true. It also extends beyond factual claims to model behaviors. Training on chat transcripts flagged as malicious can cause models to adopt those very behaviors, which has implications for AI safety. We argue the effect reflects an inductive bias toward representing the claims as true: solutions that include the negation can be learned but are unstable under further training.
Chinese Translation
我们引入了否定忽视(Negation Neglect),即在对标记某一主张为虚假的文档进行微调时,模型反而会认为该主张为真。例如,模型在传达“艾德·希兰(Ed Sheeran)在2024年奥运会上赢得100米金牌”的文档上进行微调,但这些文档反复警告该故事是虚假的。结果,模型回答一系列问题时,表现得好像希兰真的赢得了比赛。尽管在给定相同文档的上下文中,模型能够识别该主张为虚假,但这种情况依然发生。在对Qwen3.5-397B-A17B模型进行的实验中,针对一组虚构的主张,当在否定文档上进行微调时,平均信念率从2.5%上升至88.6%,而在没有否定的文档上则为92.4%。即便每个提及该主张的句子前后都紧接着表明该主张为虚假的句子,否定忽视依然发生。然而,如果文档的表述方式使得否定与主张本身局部相关,而不是在单独的句子中,例如“艾德·希兰没有赢得100米金牌”,模型通常能够正确学习这些否定。否定忽视在所有测试的模型中均有发生,包括Kimi K2.5、GPT-4.1和Qwen3.5-35B-A3B。我们表明这一效应不仅限于否定,还扩展到其他认知限定词:例如,被标记为虚构的主张被学习为真实。此外,它还超越了事实主张,影响模型行为。在对被标记为恶意的聊天记录进行训练时,可能导致模型采纳这些行为,这对人工智能安全具有重要影响。我们认为这一效应反映了模型在表征主张为真实时的归纳偏差:虽然可以学习包含否定的解决方案,但在进一步训练下却不稳定。
cs.CL / 70 / 2605.13839

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

优秀的代理朋友不仅仅提供口头建议:他们还可以更新你的权重
Bao, Wenrui, Wang, Huan, Wang, Jian, Wang, Zhangyang, Wang, Kai, Shang, Yuzhang
Abstract
Multi-agent LLM systems usually collaborate by exchanging natural-language messages. This interface is simple and interpretable, but it forces each sender's intermediate computation to be serialized into tokens and then reprocessed by the receiver, thereby increasing the generated-token cost, prefill overhead, and KV-cache memory. We study an alternative communication interface: instead of appending a sender's message to the receiver's context, compile the sender's hidden states into a transient, receiver-specific weight perturbation. We introduce TFlow (Thought Flow), a weight-space communication framework for a known and fixed receiver architecture. For each query, frozen role-prompted sender agents process the input, and a learned parameter generator maps their internal activations into low-rank LoRA perturbations targeting the receiver's modules. These perturbations are fused and applied only during the receiver's generation, enabling instance-level adaptation without permanently changing the model or enlarging the receiver's text context. With three Qwen3-4B agents, TFlow improves over a standalone receiver by up to 8.5 accuracy points across five benchmarks while reducing processed tokens by up to 32.69%. Compared with a text-based three-agent baseline, it reduces total processed tokens by up to 83.27% and the wall-clock inference time by up to 4.6$\times$, while maintaining competitive accuracy on four of five benchmarks. These results suggest that transient low-rank weight perturbations can serve as an executable communication medium for efficient multi-agent LLM collaboration.
Chinese Translation
多智能体大语言模型(LLM)系统通常通过交换自然语言消息进行协作。这种接口简单且易于理解,但它迫使每个发送者的中间计算被序列化为标记,然后由接收者重新处理,从而增加了生成标记的成本、预填充开销和键值缓存内存。我们研究了一种替代的通信接口:与其将发送者的消息附加到接收者的上下文中,不如将发送者的隐藏状态编译成一个瞬态的、特定于接收者的权重扰动。我们引入了 TFlow(思维流),这是一个针对已知且固定接收者架构的权重空间通信框架。对于每个查询,冻结的角色提示发送者代理处理输入,而一个学习的参数生成器将他们的内部激活映射为低秩的 LoRA 扰动,目标是接收者的模块。这些扰动在接收者生成过程中融合并应用,实现了实例级的适应,而不需要永久改变模型或扩大接收者的文本上下文。使用三个 Qwen3-4B 代理,TFlow 在五个基准测试中比独立接收者的准确率提高了最多 8.5 个百分点,同时处理的标记减少了最多 32.69%。与基于文本的三代理基线相比,它将总处理标记减少了最多 83.27%,并将实际推理时间减少了最多 4.6 倍,同时在五个基准中的四个上保持了竞争力的准确率。这些结果表明,瞬态低秩权重扰动可以作为高效多智能体 LLM 协作的可执行通信媒介。
cs.CL / 71 / 2605.13846

WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

WARDEN:仅需6小时训练数据的濒危土著语言转录与翻译
Zhang, Ziheng, Hou, Yunzhong, Liu, Naijing, Zheng, Liang
Abstract
This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using large datasets (like English to French), this practice is no longer viable in the Wardaman to English context. To tackle the low-resource challenge, we design WARDEN to have separate transcription and translation models: WARDEN first turns a Wardaman audio input into phonemic transcription, and then the transcription into English translation. Further, we propose two useful techniques to enhance performance. For transcription, we initialize the Wardaman token from Sundanese, a language that shares similar phonemes with Wardaman, to accelerate fine-tuning of the transcription model. For translation, we compile a Wardaman-English dictionary from expert annotations, and provide this domain-specific knowledge to a large language model (LLM) to reason and decide the final output. We empirically demonstrate that this two-stage design works better than data-hungry unified approaches in extremely low data settings. Using a mere 6 hours of annotated data, WARDEN outperforms larger open-source and proprietary models and establishes a strong baseline. Data and code are available.
Chinese Translation
本文介绍了WARDEN,一个早期语言模型系统,能够将濒危的澳大利亚土著语言Wardaman转录并翻译成英语。我们面临的重大挑战是缺乏大规模的训练数据:实际上,我们仅有6小时的标注音频。因此,虽然在使用大数据集(如英语到法语)进行转录和翻译时,训练单一模型是常见做法,但在Wardaman到英语的语境中,这种做法已不再可行。为了解决低资源挑战,我们设计了WARDEN,使其具有独立的转录和翻译模型:WARDEN首先将Wardaman音频输入转换为音素转录,然后再将转录转换为英语翻译。此外,我们提出了两种有用的技术来提升性能。在转录方面,我们从与Wardaman共享相似音素的巽他语初始化Wardaman标记,以加速转录模型的微调。在翻译方面,我们从专家注释中编制了Wardaman-英语词典,并将这一领域特定的知识提供给大型语言模型(LLM),以进行推理并决定最终输出。我们通过实证研究表明,这种两阶段设计在极低数据环境下优于对数据需求较高的统一方法。使用仅6小时的标注数据,WARDEN的表现超越了更大规模的开源和专有模型,并建立了强有力的基准。数据和代码可供获取。