cs.RO / 1 / 2605.00879
LiDAR for Rehabilitation: A Comprehensive Survey of Applications, AI Techniques, and Future Directions
激光雷达在康复中的应用:全面调查、人工智能技术及未来方向
Abstract
Rehabilitation aims to help patients with limited mobility regain their physical abilities through targeted movements, exercises, stimulation, and other therapeutic methods. Recent advances in technology have introduced sensor-based systems into rehabilitation and clinical practices, enabling real-time monitoring and providing accurate feedback on movement accuracy. Among these sensors, LiDAR has demonstrated strong potential, offering key advantages over conventional techniques such as camera-based systems, which raise privacy concerns, and wearable sensors, which can be uncomfortable and prone to errors. In this work, we review the applications of LiDAR in rehabilitation, post-injury care, and hospital environments, focusing on studies published between 2019 and 2025. Studies across several areas have been explored: 3D body scanning and gait analysis with standalone LiDAR, LiDAR mounted on robotic systems for rehabilitation, real-time monitoring and environment scanning for safe navigation, and activity and position recognition. We also analyze processing techniques, particularly learning-based approaches, and support the discussion with statistical analysis, highlighting trends, gaps, and future research opportunities. To the best of our knowledge, this is the first comprehensive survey dedicated to LiDAR for rehabilitation applications, providing an overview of current methods, AI-based processing techniques, and open challenges.
Chinese Translation
康复旨在帮助行动能力受限的患者通过针对性的运动、锻炼、刺激及其他治疗方法恢复其身体能力。最近的技术进步将基于传感器的系统引入到康复和临床实践中,使得实时监测成为可能,并提供有关运动准确性的反馈。在这些传感器中,激光雷达(LiDAR)展示了强大的潜力,相比传统技术(如基于摄像头的系统,可能引发隐私问题,以及穿戴传感器,可能不适且容易出错),具有显著优势。在本研究中,我们回顾了激光雷达在康复、伤后护理和医院环境中的应用,重点关注2019年至2025年间发表的研究。我们探讨了多个领域的研究,包括独立激光雷达进行的三维身体扫描和步态分析、装载在机器人系统上的激光雷达进行康复、实时监测和环境扫描以保障安全导航,以及活动和位置识别。我们还分析了处理技术,特别是基于学习的方法,并通过统计分析支持讨论,突出趋势、空白及未来的研究机会。据我们所知,这是第一篇针对康复应用的激光雷达全面调查,提供了当前方法、基于人工智能的处理技术和未解决挑战的概述。
cs.RO / 2 / 2605.00943
ARIS: Agentic and Relationship Intelligence System for Social Robots
ARIS:面向社交机器人的主动性与关系智能系统
Abstract
Foundational models have advanced social robotics, enabling richer perception and communicative interaction with users. However, current systems still struggle with multi-turn engagement, social-relationship reasoning, and contextually grounded dialogue at scale. We present ARIS (Agentic and Relationship Intelligence System), an agentic AI framework that unifies multimodal reasoning, a graph-based Social World Model, and retrieval-augmented generation (RAG) within a single modular architecture for social robots. We evaluate ARIS with the Pepper robot in a robot-mediated dyadic conversational setting, comparing it against a large language model baseline. A user study (N=23) shows that ARIS yields significantly higher perceived intelligence, animacy, anthropomorphism, and likeability. Our contributions are threefold: (1)~a Social World Model that explicitly maps and updates social relationships between users through a knowledge graph, enabling social reasoning and re-identification across encounters; (2)~an efficient RAG-based conversational pipeline that maintains bounded latency as dialogue histories grow to thousands of exchanges while preserving response relevance; and (3)~system integration and empirical validation of these components within a modular agentic architecture that coordinates speech, vision, and physical action through structured APIs. The implementation of ARIS will be released as open source upon publication.
Chinese Translation
基础模型推动了社交机器人领域的发展,使得与用户的感知和交流互动变得更加丰富。然而,当前系统在多轮交互、社交关系推理以及大规模上下文对话仍面临挑战。我们提出了ARIS(Agentic and Relationship Intelligence System),一个主动型人工智能框架,统一了多模态推理、基于图的社交世界模型和检索增强生成(RAG),构建在单一的模块化架构中,专为社交机器人设计。我们在机器人中介的双人对话场景中对Pepper机器人进行评估,将其与大型语言模型作为基准进行比较。参与者研究(N=23)表明,ARIS在感知智能、活泼性、人性化和受欢迎程度上显著高于基准。我们的贡献主要体现在三个方面:(1)一个社交世界模型,通过知识图明确映射和更新用户之间的社交关系,使社交推理和再识别在不同场景中得以实现;(2)一个高效的基于RAG的对话流程,在对话历史增长到数千次交互时保持有限的延迟,并确保响应的相关性;(3)在一个模块化主动架构中整合和实证验证这些组件,通过结构化的API协调语音、视觉和物理动作。ARIS的实现将在发表后以开源形式发布。
cs.RO / 3 / 2605.00963
Ablation Study of Multimodal Perception, Language Grounding, and Control for Human-Robot Interaction in an Object Detection and Grasping Task
多模态感知、语言基础和控制在物体检测与抓取任务中的人机交互的消融研究
Abstract
This manuscript extends our previous multimodal human-robot interaction system by introducing a controlled ablation study of the three modules that most strongly influence end-to-end performance: the large language model used for action extraction, the perception system used for visual grounding, and the controller used for motion execution. The goal is not to redesign the full pipeline, but to isolate the contribution of each component under a common experimental protocol and then evaluate the best combinations end-to-end. We therefore compare three language models, five perception configurations, and three controllers, followed by a second-stage factorial study over the best candidates. The resulting analysis is intended to clarify which choices primarily affect execution time, which primarily affect success rate, and where the largest engineering gains are likely to come from in future revisions of the system.
Chinese Translation
本文扩展了我们之前的多模态人机交互系统,通过引入对最直接影响端到端性能的三个模块的控制消融研究:用于动作提取的大型语言模型、用于视觉基础的感知系统以及用于运动执行的控制器。我们的目标不是重新设计完整的流程,而是将每个组件在共同实验协议下的贡献进行隔离,然后评估最佳组合的端到端性能。因此,我们比较了三种语言模型、五种感知配置和三种控制器,接着对最佳候选者进行了第二阶段的因子研究。所产生的分析旨在阐明哪些选择主要影响执行时间,哪些主要影响成功率,以及在系统未来修订中,最大的工程收益可能来自于何处。
cs.RO / 4 / 2605.01051
Value Functions for Temporal Logic: Optimal Policies and Safety Filters
时间逻辑的价值函数:最优策略与安全滤波器
Abstract
While Bellman equations for basic reach, avoid, and reach-avoid problems are well studied, the relationship between value optimality and policy optimality becomes subtle in the undiscounted infinite-horizon setting, particularly for more complicated tasks. Greedily maximizing the Q-function can produce policies that indefinitely defer task completion for reach-avoid problems, or equivalently, Until specifications, even when the value function is optimal. Building upon recent results decomposing the value function for temporal logic (TL) into a graph of constituent value functions, we construct non-Markovian policies based on state history that avoid this pathology and prove their optimality with respect to the quantitative robustness score for nested Until, Globally, and Globally-Until specifications. We further show how the Q function can serve as a safety filter for complex TL specifications, extending prior results beyond simple avoid or reach-avoid tasks.
Chinese Translation
尽管关于基本到达、避开和到达-避免问题的贝尔曼方程研究较多,但在无折扣的无限视野设定中,价值最优性与策略最优性之间的关系变得微妙,尤其对于更复杂的任务。在面对到达-避免问题或相应的“直到”规范时,贪婪地最大化Q函数可能导致策略无限推迟任务完成,即使价值函数是最优的。基于近期将时间逻辑(TL)下的价值函数分解为构成价值函数图的结果,我们构造了基于状态历史的非马尔可夫策略,以避免这一病态,并证明其对于嵌套“直到”、“全局”和“全局-直到”规范的定量鲁棒性评分的最优性。我们进一步展示了Q函数如何作为复杂TL规范的安全滤波器,从而将以往结果扩展到简单的避免或到达-避免任务之外。
cs.RO / 5 / 2605.01069
Online Safety Filter for Deformable Object Manipulation with Horizon Agnostic Neural Operators
基于在线安全过滤的可变形物体操控的水平无关神经算子
Abstract
Safety critical control of robotic manipulation tasks involving deformable media such as fluids, cloth, and soft objects remains challenging because existing learning based approaches encode safety indirectly through reward shaping, which provides no guarantee of constraint satisfaction at deployment. We present a constraint driven online safety filter for deformable object manipulation that enforces explicit task level safety constraints in real time by minimally modifying any nominal control policy. Our approach combines two key components: a horizon agnostic neural operator that learns the boundary input output mapping of the underlying PDE dynamics and generalizes across variable rollout lengths without retraining, and a boundary control barrier function that certifies safety at the task relevant output level via a lightweight quadratic program. The resulting safety constraint is affine in the boundary input rate, enabling real time online filtering. We evaluate the proposed method on fluid manipulation tasks in FluidLab, where the filter improves safe trajectory rates by up to 22% over unfiltered base policies while also reducing the number of steps required to reach the safe set, demonstrating that constraint driven safety enforcement is both more reliable and more efficient than reward shaping approaches.
Chinese Translation
涉及流体、布料和软物体等可变形介质的机器人操控任务的安全关键控制依然充满挑战,因为现有基于学习的方法通过奖励塑形间接编码安全性,这并不能在部署时保证约束的满足。我们提出了一种针对可变形物体操控的约束驱动在线安全过滤器,通过最小修改任何名义控制策略,即时强制执行显式的任务级安全约束。我们的方法结合了两个关键组成部分:一个水平无关的神经算子,它学习基础偏微分方程(PDE)动态的边界输入输出映射,并在不重新训练的情况下对可变回放长度实现泛化;以及一个边界控制障碍函数,通过轻量级二次规划在任务相关的输出层面认证安全性。最终得到的安全约束在边界输入速率上是仿射的,从而实现了实时在线过滤。我们在FluidLab中对流体操控任务评估了该方法,结果表明,该过滤器在安全轨迹率上比未过滤的基线策略提高了多达22%,同时减少了到达安全集所需的步数,证明了约束驱动安全执行比奖励塑形方法更可靠和高效。
cs.RO / 6 / 2605.01191
Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
Sentinel-VLA:一种具备主动状态监测的元认知 VLA 模型,用于动态推理与错误恢复
Abstract
Vision-language-action (VLA) models have advanced the field of embodied manipulation by harnessing broad world knowledge and strong generalization. However, current VLA models still face several key challenges, including limited reasoning capability, lack of status monitoring, and difficulty in self-correction. In this paper, we introduce \textbf{Sentinel-VLA}, a metacognitive VLA model equipped with an active ``sentinel'' module to monitor real-time execution status. Only when necessary, such as during initial planning or upon detecting an error, the model triggers a dynamic reasoning or formulate error recovery solutions. This on-demand reasoning mechanism ensures robust decision-making while minimizing computational overhead. Notably, all training data (spanning 44 tasks and over 2.6 million transitions) is automatically generated and annotated through our designed pipeline. We also propose the Self-Evolving Continual Learning (SECL) algorithm, which allows Sentinel-VLA to identify its capability boundaries and automatically collect data for expansion, paired with Orthogonal Continual Adapter (OC-Adapter) to constrain parameter updates to an orthogonal space, thereby preventing catastrophic forgetting. Real-world experiments demonstrate that Sentinel-VLA boosts the task success rate by over 30\% compared to the SOTA model, PI0. We will open-source all the code, weights, and data generation pipeline.
Chinese Translation
视觉-语言-动作(VLA)模型通过利用广泛的世界知识和强大的泛化能力,推动了具身操控领域的发展。然而,目前的 VLA 模型仍面临多个关键挑战,包括推理能力有限、缺乏状态监测及自我修正困难。本文提出了 extbf{Sentinel-VLA},一种配备主动 ``sentinel'' 模块的元认知 VLA 模型,用于实时监测执行状态。模型仅在必要时(例如在初步规划或检测到错误时)触发动态推理或制定错误恢复解决方案。这一按需推理机制确保了鲁棒的决策制定,同时最小化计算开销。值得注意的是,所有训练数据(涵盖44个任务与超过260万次转换)均通过我们设计的管道自动生成和标注。我们还提出了自我演化持续学习(SECL)算法,使 Sentinel-VLA 能够识别其能力边界,并自动收集扩展所需的数据,同时与正交持续适配器(OC-Adapter)结合,使参数更新受限于正交空间,从而防止灾难性遗忘。现实世界实验表明,Sentinel-VLA 相较于SOTA模型 PI0 提高了超过30 ext{%}的任务成功率。我们将开源所有代码、权重及数据生成管道。
cs.RO / 7 / 2605.01194
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
VLA-ATTC:具有相对动作评价模型的VLA模型自适应测试时间计算
Abstract
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities and generalization in embodied manipulation. However, their decision-making relies on a fast, instinctive process that lacks deliberation. This strategy often leads to suboptimal or catastrophic actions when facing complex or ambiguous scenarios that require greater consideration. In this paper, we introduce \textbf{VLA-ATTC}, a framework that endows VLA models with adaptive test-time compute (TTC). VLA-ATTC employs an uncertainty-based ``cognitive clutch'' to dynamically transition from reflexive execution to a TTC deliberation phase when necessary. During TTC phase, a novel \textbf{Relative Action Critic} (RAC) model identifies the optimal action from generated candidates via pairwise comparisons. This relative mechanism replaces unstable absolute value estimation, significantly simplifying the learning objective. Furthermore, we introduce an efficient sampling strategy to amortize computational costs and an automated data pipeline that curates preference pairs without manual annotation. On the LIBERO-LONG benchmark, VLA-ATTC reduces the failure rate of the SOTA model PI0.5 by over 50\%. We will open-source all the code and weights.
Chinese Translation
视觉-语言-动作(VLA)模型在具身操作中表现出卓越的能力和泛化能力。然而,它们的决策过程依赖于快速、直觉的处理方式,缺乏深思熟虑。这种策略在面对复杂或模糊场景时,往往会导致次优或灾难性的行为,因而需要更多的考虑。本文介绍了 extbf{VLA-ATTC},一个为VLA模型赋予自适应测试时间计算(TTC)的框架。VLA-ATTC采用基于不确定性的“认知离合器”,在必要时动态地从反射执行过渡到TTC的深思阶段。在TTC阶段,一个新型的 extbf{相对动作评价}(RAC)模型通过成对比较,从生成的候选动作中识别出最优动作。这种相对机制替代了不稳定的绝对值估计,显著简化了学习目标。此外,我们还引入了一种高效的采样策略,以摊销计算成本,并提出了一种自动化数据管道,无需人工标注即可整理偏好对。在LIBERO-LONG基准测试中,VLA-ATTC将SOTA模型PI0.5的失败率降低了超过50%。我们将开源所有代码和权重。
cs.RO / 8 / 2605.01195
TAIL-Safe: Task-Agnostic Safety Monitoring for Imitation Learning Policies
TAIL-Safe:一种任务无关的模仿学习策略安全监控方法
Abstract
Recent imitation learning (IL) algorithms such as flow-matching and diffusion policies demonstrate remarkable performance in learning complex manipulation tasks. However, these policies often fail even when operating within their training distribution due to extreme sensitivity to initial conditions and irreducible approximation errors that lead to compounding drift. This makes it unsafe to deploy IL policies in the field where out-of-distribution scenarios are prevalent. A prerequisite for safe deployment is enabling the policy to determine whether it can execute a task the way it was learned from demonstrations. This paper presents TAIL-Safe, a principled approach to identify, for a trained IL policy, a safe set from where the policy empirically succeeds in completing the learned task. We propose a Lipschitz-continuous Q-value function that maps state-action pairs to a long-term safety score based on three short-term task-agnostic criteria: visibility, recognizability, and graspability. The zero-superlevel set of this function characterizes an empirical control invariant set over state-action pairs. When the nominal policy proposes an action outside this set, we apply a recovery mechanism inspired by Nagumo's theorem that uses gradient ascent to the Q-function to steer the policy back to safety. To learn this Q-function, we construct a high-fidelity digital twin using Gaussian Splatting that enables systematic collection of failure data without risk to physical hardware. Experiments with a Franka Emika robot demonstrate that flow-matching policies, which fail under run-time perturbations, achieve consistent task success when guided by the proposed TAIL-Safe.
Chinese Translation
近年来,流匹配和扩散策略等模仿学习(IL)算法在学习复杂操作任务方面表现出了出色的性能。然而,这些策略在其训练分布内操作时也常常会出现失败,原因在于对初始条件极度敏感以及不可减少的近似误差导致累积漂移。这使得在存在大量分布外场景的实际应用中,部署IL策略变得不安全。安全部署的前提是使得策略能够判断其是否能够按照从示例中学习到的方式执行任务。本文提出了TAIL-Safe,这是一种原则性的方法,用于识别已训练IL策略在完成已学任务时成功的安全状态集合。我们提出了一种Lipschitz连续的Q值函数,该函数将状态-动作对映射到基于三个短期任务无关标准(可见性、可识别性和可抓取性)生成的长期安全评分。该函数的零超水平集刻画了状态-动作对上的经验证的控制不变集。当名义策略提出超出该集合的动作时,我们应用基于Nagumo定理的恢复机制,利用对Q函数的梯度上升将策略引导回安全区域。为了学习这个Q函数,我们利用高保真数字双胞胎构建了一个有效的高斯散射模型,系统性地收集故障数据,而无需对物理硬件造成风险。与Franka Emika机器人进行的实验表明,在运行时扰动下失败的流匹配策略,在TAIL-Safe的引导下能够持续成功完成任务。
cs.RO / 9 / 2605.01201
To Do or Not to Do: Ensuring the Safety of Visuomotor Policies Learned from Demonstrations
做还是不做:确保从示范中学习的视觉运动策略的安全性
Abstract
Task success has historically been the primary measure of policy performance in imitation learning (IL) research. This characteristics strictly limits the ubiquitous applications of IL algorithms in field robotics where safety assurance, in addition to task-success, is of paramount importance. It is often desirable for an IL-powered robot in the field not to roll out a policy, and hence score a poor performance, if the safety is not guaranteed. Although this trade-off between safety and performance is well investigated in classical control literature, policy safety is a heavily underexplored domain in IL research. There is no universal definition of safety in IL. To make things worst, many existing theoretical works on safety is notoriously difficult to extend to IL-powered robots in the field. This paper offers important insights on the safety and performance of IL policies. We propose execution guarantee, a policy-agnostic safety measure that guarantees the maximum task success for a visuomotor IL policy, despite minor run-time changes, from within a specific region in the state space. We leverage recent advances in view synthesis to identify such regions in the state space for an IL policy and explore a fundamental result on set invariance - namely, Nagumo's sub-tangentiality condition - to prove and operationalize execution guarantee from inside that region. Experiments with a Franka robot, both in simulation and real world, demonstrate how the proposed safety analysis allows various IL policies to achieve maximum task success with guarantee. We also demonstrate some interesting results on how a recovery policy - a by-product of the proposed safety analysis - can help to increase the policy performance and thereby mitigating the safety-performance tradeoff in IL.
Chinese Translation
任务成功在模仿学习(IL)研究中历来是评价策略性能的主要标准。这一特性严格限制了IL算法在现场机器人技术中的广泛应用,而在该领域,除了任务成功外,安全保障至关重要。如果无法确保安全,IL驱动的机器人通常不希望实施某个策略,因此可能会导致糟糕的表现。虽然在经典控制文献中,安全与性能之间的权衡得到了充分研究,但在IL研究中,策略安全仍是一个严重未被探讨的领域。IL中没有安全性的普遍定义。更糟的是,许多现有的关于安全的理论工作在扩展到实际应用在IL驱动的机器人中时,通常面临着极大的困难。本文对IL策略的安全性与性能提供了重要的见解。我们提出了一种执行保障(execution guarantee),这是一种与策略无关的安全度量,它保证了在状态空间的特定区域内,即使存在轻微的运行时变化,视觉运动IL策略的最大任务成功率。我们利用最近在视图合成方面的进展,以识别IL策略在状态空间中的此类区域,并探索关于集合不变性(set invariance)的基本结果 - 即Nagumo的亚切线性条件 - 以证明和实现该区域内的执行保障。通过对Franka机器人的实验,包括仿真和现实世界的测试,展示了所提出的安全分析如何使各种IL策略在保证的条件下实现最大任务成功。我们还展示了一些有趣的结果,表明恢复策略(recovery policy)- 是所提安全分析的副产品 - 如何帮助提高策略性能,从而减轻IL中的安全与性能权衡。
cs.RO / 10 / 2605.01227
Dynamics Aware Quadrupedal Locomotion via Intrinsic Dynamics Head
动态感知的四足 locomotion 通过内在动态头
Abstract
Quadrupedal locomotion plays a critical role in enabling agile, versatile movement across complex terrains. Understanding and estimating the underlying physical dynamics are essential for achieving efficient and stable quadrupedal locomotion. We propose a novel training framework for quadrupedal locomotion that enables the Control Policy to understand and reason about physical dynamics. In simulation, we concurrently train an Intrinsic Dynamics (ID) Head that learns state-to-torque dynamics alongside the Control Policy, and we define a dynamics reward enabled by the ID Head that encourages the Policy toward more predictable dynamical behavior. We also provide a mechanism to tune the learned dynamics in the resulting Policy by controlling the training coefficients of the ID Head. Our simulation experiments show that this mechanism drives convergence to better optima across a wide range of standard quadrupedal locomotion rewards, yielding more efficient and smoother policies. Our real-robot experiments demonstrate sim-to-real transfer of these improvements, with significant gains in torque efficiency (16.8%), action rate (18.6%), and mechanical power (12.8%), while improving safe torque occupancy by 6.4%.
Chinese Translation
四足 locomotion 在复杂地形中实现灵活多样的运动中起着关键作用。理解和估计潜在的物理动态对于实现高效和稳定的四足 locomotion 至关重要。我们提出了一个新颖的四足 locomotion 训练框架,使控制策略能够理解和推理物理动态。在模拟中,我们同时训练一个内在动态 (Intrinsic Dynamics, ID) 头,该头学习状态到扭矩的动态,并与控制策略并行训练。我们定义了一个由 ID 头启用的动态奖励,鼓励策略向更可预测的动态行为发展。我们还提供了一种机制,通过控制 ID 头的训练系数来调整结果政策中学习到的动态。我们的仿真实验表明,这一机制促使在广泛的标准四足 locomotion 奖励中收敛到更好的最优解,使策略更高效且动作更平滑。我们的真实机器人实验展示了这些改进的仿真到现实转移,扭矩效率 (16.8%)、动作频率 (18.6%) 和机械功率 (12.8%) 有显著提高,同时安全扭矩占用提升了6.4%。
cs.RO / 11 / 2605.01232
A Principled Approach for Creating High-fidelity Synthetic Demonstrations for Imitation Learning
一种创造高保真合成演示以用于模仿学习的原则性方法
Abstract
Recent advances in 3D Gaussian Splatting (3DGS) have enabled visually realistic demonstration generation from a single expert trajectory and a short multi-view scan. However, existing 3DGS-based synthesis pipelines typically generate new motions using sampling-based planners or trajectory optimization, which often deviate substantially from the expert's demonstrated path. While such deviations may be acceptable for tasks insensitive to motion shape, they discard subtle spatial and temporal structure that is critical for contact-rich and shape-sensitive manipulation, causing increased demonstration diversity to harm downstream policy learning. We argue that demonstration synthesis should treat the expert trajectory as a strong prior. Building on this principle, we propose a framework that synthesizes diverse task demonstrations while explicitly preserving expert motion structure. We model the expert trajectory using Dynamic Movement Primitives (DMPs) and retarget it to new goals, object configurations, and viewpoints within a reconstructed 3DGS scene, yielding phase-consistent, shape-preserving motion by construction. To safely realize this expert-preserving diversity in cluttered scenes, we introduce an analytic obstacle-aware DMP formulation that operates directly on the continuous density field induced by the 3DGS representation. This enables collision avoidance while minimally perturbing the nominal expert motion, unifying photorealistic rendering and geometric reasoning without additional scene representations. We evaluate our approach on a Spot mobile manipulator across three manipulation tasks with increasing sensitivity to trajectory fidelity. Compared to planner- and optimization-based synthesis, our method produces trajectories with lower deviation and collision rates and yields higher task success when training diffusion-based visuomotor policies.
Chinese Translation
近期,3D高斯喷射(3DGS)的进展使得从单个专家轨迹和短的多视角扫描中生成视觉上真实的演示成为可能。然而,现有基于3DGS的合成管道通常使用基于采样的规划器或轨迹优化来生成新运动,这往往与专家的演示路径存在显著偏差。虽然这种偏差在对运动形状不敏感的任务中可能是可以接受的,但它们丢弃了对于接触丰富和对形状敏感的操作至关重要的细微空间和时间结构,这导致演示多样性的增加对下游策略学习产生不利影响。我们认为,演示合成应该将专家轨迹视为强有力的先验。基于这一原则,我们提出了一个框架,该框架在明确保留专家运动结构的同时合成多样的任务演示。我们使用动态运动原语(DMPs)对专家轨迹进行建模,并将其重新定向至重建的3DGS场景中的新目标、对象配置和视角,从而构建出相位一致、形状保留的运动。为了安全地实现这种保护专家的多样性在杂乱场景中,我们引入了一种分析性的障碍感知DMP公式,该公式直接作用于3DGS表示所诱导的连续密度场。这使得能够在最小干扰名义上的专家运动的同时避免碰撞,从而在没有额外场景表示的情况下统一照片级真实感渲染和几何推理。我们在Spot移动操控器上评估了我们的方法,涵盖了三个对轨迹保真度敏感性逐渐增加的操作任务。与基于规划和优化的合成相比,我们的方法产生的轨迹偏差和碰撞率更低,并在训练基于扩散的视觉运动策略时实现了更高的任务成功率。
cs.RO / 12 / 2605.01289
Bi-Level Reinforcement Learning Control for an Underactuated Blimp via Center-of-Mass Reconfiguration
通过质心重配置的双层强化学习控制无人操纵气球
Abstract
This paper investigates goal-directed tracking control of underactuated blimps with center-of-mass (CoM) reconfiguration. Unlike conventional overactuated blimp designs that rely on redundant actuation for simplified control, this paper focuses on a compact architecture consisting of two thrusters and a movable internal slider, aiming to improve energy efficiency and payload capacity. This hardware-efficient configuration introduces significant underactuation and strong nonlinear coupling between CoM dynamics and vehicle motion. To address these challenges, this paper proposes a bi-level reinforcement learning framework that explicitly decouples task-level CoM planning from continuous thrust control. The outer policy determines a target-dependent CoM configuration prior to flight, while the inner policy generates thrust commands to track straight-line references. To ensure stable learning, this paper introduces a two-stage learning strategy, supported by a convergence analysis of the resulting bi-level process. Extensive simulations and real-world experiments on a 27-goal evaluation set demonstrate that the proposed method consistently outperforms fixed-CoM baselines and PID-based controllers, achieving higher tracking accuracy, enhanced robustness, and reliable sim-to-real transfer.
Chinese Translation
本文研究了通过质心(CoM)重配置进行目标导向的无人操纵气球跟踪控制。与传统依赖冗余驱动以简化控制的过驱动气球设计不同,本文重点关注由两个推进器和一个可移动内部滑块组成的紧凑架构,旨在提高能效和有效载荷能力。这种硬件高效配置引入了显著的欠驱动性以及质心动力学与飞行器运动之间强烈的非线性耦合。为了应对这些挑战,本文提出了一种双层强化学习框架,明确将任务级的质心规划与连续推力控制解耦。外部策略在飞行前确定目标相关的质心配置,而内部策略生成推力命令以跟踪直线参考。为确保稳定学习,本文引入了两阶段学习策略,并支持对所得到的双层过程的收敛分析。在27个目标评估集上进行的大量仿真与真实实验表明,所提方法始终优于固定质心基线和基于PID的控制器,实现了更高的跟踪精度、增强的鲁棒性及可靠的仿真到真实的转移。
cs.RO / 13 / 2605.01340
Terrain Perception for Agricultural UAVs in Complex Farmland via Rotating mmWave Radar
通过旋转毫米波雷达实现复杂农田中农业无人机的地形感知
Abstract
Accurate terrain perception is essential for terrain-following flight of agricultural unmanned aerial vehicles (UAVs), yet remains challenging in real-world farmland due to occlusions, complex terrain geometry, and environmental disturbances. Millimeter-wave (mmWave) radar is a promising sensing modality for this task due to its robustness to adverse conditions; however, existing UAV-mounted radar systems rely on fixed field of view (FoV) and terrain extraction methods designed for dense LiDAR data, leading to incomplete and unreliable terrain estimation. To address these limitations, we present a low-cost rotating mmWave radar-enabled terrain perception framework for agricultural UAVs operating in complex farmland environments. Specifically, a mechanically rotating sensing design is introduced to enlarge spatial coverage and improve terrain observability beyond the limitations of fixed-view radar under dynamic low-altitude flight. Building upon this sensing capability, we further design a pose-consistent terrain reconstruction pipeline tailored for sparse, noisy, and partially observable radar data, enabling reliable ground extraction and continuous terrain surface estimation in challenging agricultural scenarios. The complete system is deployed on a real agricultural UAV platform and comprehensively evaluated through extensive field experiments. Experimental results demonstrate improved terrain coverage and estimation accuracy, achieving an F1 score of 94.42 for ground segmentation, while the closest rival only achieves 90.48. Thus, leading to more robust terrain following flight.
Chinese Translation
准确的地形感知对于农业无人机(UAV)的地形跟随飞行至关重要,但由于遮挡、复杂的地形几何形状和环境干扰,在实际农田中仍面临挑战。毫米波(mmWave)雷达是一种有前景的传感方式,因其对恶劣条件的鲁棒性而受到关注;然而,现有的无人机搭载雷达系统依赖于固定的视场(FoV)和针对密集激光雷达(LiDAR)数据设计的地形提取方法,导致地形估计不完整且不可靠。为了解决这些限制,我们提出了一种低成本的旋转毫米波雷达启用的地形感知框架,以适应在复杂农田环境中运行的农业无人机。具体而言,引入了一种机械旋转的传感设计,以扩大空间覆盖范围,并在动态低空飞行中提高地形可观测性,超越固定视角雷达的局限性。在此基础上,我们进一步设计了一种面向稀疏、嘈杂和部分可观测雷达数据的姿态一致性地形重建管道,能够在具有挑战性的农业场景中实现可靠的地面提取和连续的地形表面估计。完整的系统已在真实的农业无人机平台上部署,并通过广泛的实地实验进行了全面评估。实验结果表明,地形覆盖范围和估计精度得到了改善,对于地面分割达到了94.42的F1分数,而最接近的竞争者仅达到了90.48。这提升了地形跟随飞行的稳健性。
cs.RO / 14 / 2605.01368
Assistance Without Interruption: A Benchmark and LLM-based Framework for Non-Intrusive Human-Robot Assistance
无干扰的辅助:一种基准测试和基于大型语言模型的非侵入式人机辅助框架
Abstract
Human-robot interaction (HRI) has long studied how agents and people coordinate to achieve shared goals. In this work, we formalize and benchmark the non-intrusive assistance as an independent paradigm of HRI, where a robot proactively supports a human's ongoing multi-step activities while strictly avoiding interruptions. Unlike conventional HRI tasks that rely on direct commands, explicit negotiation, or proactive interventions based on user habits and history, our task treats the human's plan as the primary process and formulates assistance as a joint decision over when to act and what to do. To systematically evaluate this problem, we establish a simulation benchmark, NIABench, along with new metrics tailored to the non-intrusive assistance task. We further propose a hybrid architecture that integrates an LLM with a scoring model. The scoring model first applies semantic retrieval to prune large candidate action sets, and then a ranker evaluates human-step and robot-action pairs, enabling reasoning over timing and cross-step dependencies. Comprehensive experiments on both NIABench and real-world scenarios demonstrate that our method achieves proactive, non-intrusive assistance that reduces human effort while preserving task effectiveness.
Chinese Translation
人机交互(HRI)长期以来一直研究代理和人类如何协调以实现共同目标。在本研究中,我们将非侵入式辅助形式化并基准化为HRI的一个独立范式,其中机器人主动支持人类正在进行的多步骤活动,同时严格避免干扰。与依赖于直接指令、明确协商或基于用户习惯和历史的主动干预的传统HRI任务不同,我们的任务将人类的计划视为主要流程,并将辅助形式化为一个联合决策,决定何时行动以及如何做。为了系统地评估这一问题,我们建立了一个模拟基准测试NIABench,并提供了针对非侵入式辅助任务的新指标。进一步地,我们提出了一种混合架构,将大型语言模型(LLM)与评分模型结合。评分模型首先应用语义检索来修剪大型候选动作集,然后由排名模型评估人类步骤和机器人动作对,从而支持对时机和跨步骤依赖关系的推理。在NIABench和真实场景上的全面实验表明,我们的方法实现了主动的、非侵入式的辅助,减少了人类的努力,同时保持了任务的有效性。
cs.RO / 15 / 2605.01371
ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue
ESARBench:一种用于自主无人机实体搜索与救援的基准
Abstract
The rapid advancement of Multimodal Large Language Models (MLLMs) has empowered Unmanned Aerial Vehicle (UAV) with exceptional capabilities in spatial reasoning, semantic understanding, and complex decision-making, making them inherently suited for UAV Search and Rescue (SAR). However, existing UAV SAR research is dominated by traditional vision and path-planning methods and lacks a comprehensive and unified benchmark for embodied agents. To bridge this gap, we first propose the novel task of \textbf{Embodied Search and Rescue (ESAR)}, which requires aerial agents to autonomously explore complex environments, identify rescue clues, and reason about victim locations to execute informed decision-making. Additionally, we present \textbf{ESARBench}, the first comprehensive benchmark designed to evaluate MLLM-driven UAV agents in highly realistic SAR scenarios. Leveraging Unreal Engine 5 and AirSim, we construct four high-fidelity, large-scale open environments mapped directly from real-world Geographic Information System (GIS) data to ensure photorealistic landscapes. To rigorously simulate actual rescue operations, our benchmark incorporates dynamic variables including weather conditions, time of day, and stochastic clue placement. Furthermore, we create a dataset of 600 tasks modeled after real-world rescue cases and propose a robust set of evaluation metrics. We evaluate diverse baselines, ranging from traditional heuristics to advanced ground and aerial MLLM-based ObjectNav agents. Experimental results highlight the challenges in ESAR, revealing critical bottlenecks in spatial memory, aerial adaptation, and the trade-off between search efficiency and flight safety. We hope ESARBench serves as a valuable resource to advance research on Embodied Search and Rescue domain. Source code and project page: https://4amgodvzx.github.io/ESAR.github.io.
Chinese Translation
多模态大型语言模型(MLLM)的快速进展赋予无人机(UAV)在空间推理、语义理解和复杂决策方面卓越的能力,使其非常适合用于无人机搜索与救援(SAR)。然而,现有的无人机SAR研究主要以传统的视觉和路径规划方法为主,缺乏一个全面统一的实体代理基准。为了解决这一问题,我们首先提出了一项新颖的任务—— extbf{实体搜索与救援(ESAR)},该任务要求空中代理能够自主探索复杂环境,识别救援线索并推理受害者的位置,以执行知情决策。此外,我们推出了 extbf{ESARBench},这是第一个旨在评估基于MLLM的无人机代理在高度现实的SAR场景中的综合基准。我们利用虚幻引擎5和AirSim构建了四个高保真、大规模的开放环境,这些环境直接从真实世界的地理信息系统(GIS)数据映射,以确保照片级真实的景观。为严格模拟实际的救援操作,我们的基准纳入了动态变量,包括天气条件、时间和随机线索设置。此外,我们创建了一个基于真实救援案例建模的600个任务的数据集,并提出了一套稳健的评估指标。我们评估了多种基线方法,从传统启发式算法到先进的基于地面和空中MLLM的ObjectNav代理。实验结果突出了ESAR中的挑战,揭示了空间记忆、空中适应性以及搜索效率与飞行安全之间的权衡等关键瓶颈。我们希望ESARBench能够成为推动实体搜索与救援领域研究的宝贵资源。源代码和项目页面:https://4amgodvzx.github.io/ESAR.github.io。
cs.RO / 16 / 2605.01427
SixthSense: Task-Agnostic Proprioception-Only Whole-Body Wrench Estimation for Humanoids
第六感:面向任务无关的只有本体感觉的类人机器人全身扭矩估计
Abstract
Humanoid robots are entering our physical world at scale, yet as oversized toys--good at singing and dancing, but short on force-interaction capabilities for practical tasks. Bridging this gap necessitates prioritizing reliable contact perception as a fundamental requirement. Estimating external wrenches in humanoids is complicated by floating-base dynamics and indeterminate contact locations. Existing analytical frameworks require idealistic assumptions and hard-to-obtain measurements, which are often unavailable in practice. To bridge this gap, we propose SixthSense, a task-agnostic approach that infers whole-body contact timing, location, and wrenches from proprioception and IMU data alone. To capture the multi-modal dynamics between unstructured contact inputs and the uncertain motion outputs, we employ conditional flow matching to tokenize proprioceptive histories and estimate a spatiotemporally sparse contact-event flow. SixthSense serves as a plug-and-play perception module for applications including collision detection, physical human-robot interaction, and force-feedback teleoperation. Experiments across standing, walking, and whole-body motion-tracking policies showcased unprecedented performance in diverse behaviors.
Chinese Translation
类人机器人正大规模地进入我们的物理世界,但目前它们更像是大型玩具——擅长唱歌和跳舞,却在实际任务中缺乏有效的力交互能力。弥补这一差距需要将可靠的接触感知作为基本要求。类人机器人的外部扭矩估计因浮动基座动态和不确定的接触地点而变得复杂。现有的分析框架依赖理想化的假设和难以获得的测量,这在实际中往往无法满足。为了弥补这一缺口,我们提出了第六感(SixthSense),这是一种任务无关的方法,能仅通过本体感觉和惯性测量单元(IMU)数据推断全身接触时机、位置和扭矩。为了捕捉非结构化接触输入和不确定运动输出之间的多模态动态,我们采用条件流匹配技术来对本体感觉历史进行标记,并估计时空稀疏的接触事件流。第六感作为一种即插即用的感知模块,适用于包括碰撞检测、物理人机交互和力反馈远程操作等应用。在站立、行走和全身运动跟踪策略下的实验展示了在多样行为中前所未有的性能。
cs.RO / 17 / 2605.01432
Evidence-Based Landing Site Selection and Vison-Based Landing for UAVs in Unstructured Environments
基于证据的无人机无结构环境着陆点选择与基于视觉的着陆
Abstract
Autonomous landing in cluttered or unstructured environments remains a safety-critical challenge for unmanned aerial vehicles (UAVs), particularly under noisy perception caused by sensor uncertainty and platform-induced disturbances such as vibration. This paper presents an evidence-based probabilistic framework for autonomous UAV landing that explicitly separates decision-making under uncertainty from execution via visual servoing. Landing safety is modeled as a latent variable and inferred through recursive accumulation of frame-wise visual likelihoods derived from flatness, slope, and obstacle cues, yielding a temporally consistent belief map that is robust to transient perception errors. Physical feasibility is enforced through a hard geometric constraint based on the minimum required landing radius of the UAV, ensuring that undersized but visually appealing regions are rejected. The final landing site is selected using constrained maximum a posteriori estimation. Once selected, the UAV locks onto the target region using ORB feature tracking and performs precise alignment and descent via image-based visual servoing (IBVS). The proposed approach is validated through both real-world laboratory experiments and high-fidelity simulations in Nvidia Isaac Sim, demonstrating consistent, cautious, and stable landing behavior across domains.
Chinese Translation
在杂乱或无结构环境中实现自主着陆仍然是无人机(UAV)面临的一项至关重要的安全挑战,尤其是在由于传感器不确定性和平台引起的扰动(如振动)造成的噪声感知下。本文提出了一种基于证据的概率框架,用于自主UAV着陆,该框架明确将不确定性下的决策与通过视觉伺服执行的过程分开。着陆安全性被建模为一个潜在变量,并通过递归积累基于帧的视觉似然性(来自平坦度、坡度和障碍物线索)进行推断,生成一个对瞬时感知错误具有鲁棒性的时序一致的信念图。通过基于无人机最小要求着陆半径的硬几何约束来强化物理可行性,确保不过小但视觉上吸引人的区域被排除。最终的着陆点使用约束的后验最大估计进行选择。一旦选定,无人机便通过ORB特征跟踪锁定目标区域,并通过基于图像的视觉伺服(IBVS)进行精确对准和下降。所提方案通过实际实验室实验和Nvidia Isaac Sim中的高保真仿真进行验证,展示了在不同领域中一致、谨慎和稳定的着陆行为。
cs.RO / 18 / 2605.01434
High-Speed, Scalable Sensor Readout for Dexterous Robotic Hands via Shift-Register Multiplexing
通过移位寄存器复用实现高速、可扩展的灵巧机器人手传感器读出
Abstract
Dexterous robotic hands require high-speed multimodal sensing across many degrees of freedom, yet existing readout architectures often impose trade-offs between sensor count, wiring complexity, and sampling bandwidth. This paper presents a scalable analog sensor readout architecture based on a serial-in parallel-out (SIPO) shift-register principle. The proposed architecture supports versatile integration of heterogeneous analog-output sensors, scalable expansion using only three signal lines between sensor modules, and fast, configurable sampling. We validate the approach on a tendon-driven robotic hand integrating 16 joint sensor modules and one four-channel tactile sensor module, enabling acquisition of 20 sensor channels at a full-scan rate of 1 kHz, with stable operation up to 1.5 kHz. Joint sensor characterization showed a maximum slope absolute percentage error (APE) of 0.446% and sub-degree estimation error, indicating that the proposed readout system does not significantly degrade sensing performance. For tactile sensing, LSTM-based models achieved an RMSE of 0.125 N for force estimation and 93.4% accuracy for five-class contact-location classification, and were deployed for real-time inference at 1 kHz. System-level experiments showed that the joint sensors provide more accurate feedback than motor-based estimation during interaction, while the tactile sensor enables responsive force estimation in contact. The proposed architecture offers a practical path toward fully sensorized robotic hands for dexterous manipulation.
Chinese Translation
灵巧机器人手需要在多个自由度上进行高速的多模态传感,但现有的读出架构常常在传感器数量、布线复杂性和采样带宽之间存在权衡。本文提出了一种基于串行输入并行输出(SIPO)移位寄存器原理的可扩展模拟传感器读出架构。该架构支持异构模拟输出传感器的多功能集成,使用仅三条信号线在传感器模块之间进行可扩展扩展,并实现快速可配置的采样。我们在一个集成了16个关节传感器模块和一个四通道触觉传感器模块的腱驱动机器人手上验证了该方法,从而能够以1 kHz的全扫描速率获取20个传感器通道,且在1.5 kHz时保持稳定运行。关节传感器的特性表明,其最大斜率绝对百分比误差(APE)为0.446%,且估计误差小于1度,这表明所提的读出系统不会显著降低传感性能。在触觉感知方面,基于LSTM的模型在力估计中实现了0.125 N的均方根误差(RMSE),在五类接触位置分类中达到了93.4%的准确率,并在1 kHz下进行了实时推断。系统级实验表明,关节传感器在交互过程中提供的反馈比基于电动机的估计更为准确,而触觉传感器则能够在接触中实现响应式的力估计。所提架构为实现用于灵巧操作的完全传感化机器人手提供了一条实际路径。
cs.RO / 19 / 2605.01448
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
分解与重组:从现有能力中推理新技能以实现跨任务机器人操作
Abstract
Cross-task generalization is a core challenge in open-world robotic manipulation, and the key lies in extracting transferable manipulation knowledge from seen tasks. Recent in-context learning approaches leverage seen task demonstrations to generate actions for unseen tasks without parameter updates. However, existing methods provide only low-level continuous action sequences as context, failing to capture composable skill knowledge and causing models to degenerate into superficial trajectory imitation. We propose Decompose and Recompose, a skill reasoning framework using atomic skill-action pairs as intermediate representations. Our approach decomposes seen demonstrations into interpretable skill--action alignments, enabling the model to recompose these skills for unseen tasks through compositional reasoning. Specifically, we construct a task-adaptive dynamic demonstration library via visual-semantic retrieval combined with skill sequences from a planning agent, complemented by a coverage-aware static library to fill missing skill patterns. Together, these yield skill-comprehensive demonstrations that explicitly elicit compositional reasoning for skill composition and execution ordering. Experiments on the AGNOSTOS benchmark and real-world environments validate our method's zero-shot cross-task generalization capability.
Chinese Translation
跨任务泛化是开放世界机器人操作中的一个核心挑战,关键在于从已见任务中提取可转移的操作知识。近期的上下文学习方法利用已见任务的演示为未见任务生成动作,而无需更新参数。然而,现有方法仅提供低级的连续动作序列作为上下文,未能捕捉可组合的技能知识,导致模型退化为表面轨迹模仿。我们提出了“分解与重组”这一技能推理框架,使用原子技能-动作对作为中间表示。我们的方法将已见演示分解为可解释的技能-动作对齐,允许模型通过组合推理重新组合这些技能以应对未见任务。具体而言,我们通过视觉-语义检索结合规划代理的技能序列构建一个任务自适应动态演示库,同时配备一个覆盖感知的静态库以填补缺失的技能模式。这些共同产生了技能全面的演示,明确激发了技能组合和执行顺序的组合推理。我们在 AGNOSTOS 基准和现实世界环境中进行的实验验证了我们方法的零-shot 跨任务泛化能力。
cs.RO / 20 / 2605.01461
LLM-Foraging: Large Language Models for Decentralized Swarm Robot Foraging
LLM-觅食:用于分散式群体机器人觅食的大型语言模型
Abstract
Swarm foraging algorithms, such as the central-place foraging algorithm (CPFA), typically rely on offline parameter optimization using genetic algorithms (GA) or reinforcement learning, yielding policies tightly coupled to a specific combination of team size, arena size, and resource distribution. When deployment conditions change, performance degrades, and retraining is computationally expensive. We propose LLM-Foraging, a decentralized swarm controller that augments the CPFA state machine with a large language model (LLM) tactical decision-maker at three structured decision points, namely post-deposit, central-zone arrival, and search starvation. Each robot runs its own LLM client and queries it using only locally observable state, while the existing CPFA motion and sensing stack executes the selected action. Because the LLM serves as a general decision policy rather than parameters fitted to a single configuration, the controller is training-free at deployment and transfers across configurations without re-optimization. We evaluate LLM-Foraging in Gazebo with TurtleBot3 robots across 36 configurations spanning team sizes of 4 to 10 robots, arena sizes from 6x6 to 10x10 meters, and three resource distributions (clustered, powerlaw, random). LLM-Foraging collects more resources than the GA-tuned CPFA baseline across the evaluated configurations and is more consistent, a property that the GA's single-configuration tuning does not transfer.
Chinese Translation
群体觅食算法,如中心地点觅食算法(CPFA),通常依赖于使用遗传算法(GA)或强化学习的离线参数优化,从而生成与特定团队规模、场地大小和资源分布紧密相关的策略。当部署条件发生变化时,性能会下降,且重新训练计算开销大。我们提出LLM-觅食,这是一种分散式群体控制器,利用大型语言模型(LLM)作为战术决策者在三个结构化决策点(即后存储、中央区域到达和搜索饥饿)增强CPFA状态机。每个机器人运行自己的LLM客户端,并仅使用局部可观察状态进行查询,而现有的CPFA运动和感知堆栈则执行所选动作。由于LLM作为通用决策政策服务,而非适配于单一配置的参数,因此该控制器在部署时无需培训,并且能够跨配置转移而无需重新优化。我们在Gazebo中使用TurtleBot3机器人评估LLM-觅食,在36个配置中进行测试,覆盖从4到10个机器人的团队规模、6x6到10x10米的场地大小以及三种资源分布(聚类、幂律、随机)。LLM-觅食在所评估的配置中收集的资源数量超过了GA调优的CPFA基线,并且更具一致性,而GA的单一配置调优特性并不具备可转移性。
cs.RO / 21 / 2605.01477
Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion
行动代理:代理视频生成与流约束扩散的结合
Abstract
We present Action Agent, a two-stage framework that unifies agentic navigation video generation with flow-constrained diffusion control for multi-embodiment robot navigation. In Stage I, a large language model (LLM) acts as an orchestration module that selects video diffusion models, refines prompts through iterative validation, and accumulates cross-task memory to synthesize physically plausible first-person navigation videos from language and image inputs. This increases video generation success from 35% (single-shot) to 86% across 50 navigation tasks. In Stage II, we introduce FlowDiT, a Flow-Constrained Diffusion Transformer that converts optimized goal videos and language instructions into continuous velocity commands using action-space denoising diffusion. FlowDiT integrates DINOv2 visual features, learned optical flow for ego-motion representation, and CLIP language embeddings for semantic stopping. We pretrain on the RECON outdoor navigation dataset and fine-tune on 203 Unitree G1 humanoid episodes collected in Isaac Sim to calibrate velocity dynamics. A single 43M-parameter checkpoint achieves 73.2% navigation success in simulation and 64.7% task completion on a real Unitree G1 in unseen indoor environments under open-loop execution, while operating at 40--47 Hz. We evaluate Action Agent across three embodiments: a Unitree G1 humanoid (real hardware), a drone, and a wheeled mobile robot (Isaac Sim), demonstrating that decoupling trajectory imagination from execution yields a scalable and embodiment-aware paradigm for language-guided navigation.
Chinese Translation
我们提出了行动代理(Action Agent),这是一个两阶段框架,结合了代理导航视频生成与流约束扩散控制,用于多体态机器人导航。在第一阶段,大型语言模型(LLM)作为编排模块,选择视频扩散模型,通过迭代验证优化提示,并累积跨任务记忆,从语言和图像输入中合成物理上合理的第一人称导航视频。这样使得在50个导航任务中,视频生成的成功率从35%(单次生成)提高到86%。在第二阶段,我们引入了FlowDiT,这是一个流约束扩散变换器,利用动作空间去噪扩散将优化的目标视频和语言指令转换为连续的速度命令。FlowDiT集成了DINOv2视觉特征、学习的光流用于自我运动表征,以及CLIP语言嵌入用于语义停止。我们在RECON户外导航数据集上进行预训练,并在Isaac Sim中微调203个Unitree G1人形机器人集集得的剧集,以校准速度动态。一个单一的43M参数检查点在模拟中实现73.2%的导航成功率,在真实的Unitree G1上于未见的室内环境中在开环执行下完成任务率为64.7%,同时工作频率为40-47 Hz。我们在三种实体上评估行动代理:一个Unitree G1人形机器人(真实硬件)、一架无人机和一个轮式移动机器人(Isaac Sim),展示了将轨迹想象与执行解耦后,能够实现一种可扩展且关注实体的语言引导导航范式。
cs.RO / 22 / 2605.01485
Cut-In Gap Acceptance Toward Autonomous vs. Human-Driven Vehicles: Evidence from the Waymo Open Motion Dataset
针对自治车辆与人驱动车辆的切入间隙接受度:来自Waymo开放运动数据集的证据
Abstract
Autonomous vehicles (AVs) are widely known to follow conservative, rule-based motion policies that surrounding drivers can learn to anticipate. A direct consequence is that human drivers may accept shorter longitudinal gaps when cutting in front of an AV than when targeting another human-driven vehicle (HDV). We test this hypothesis using the Waymo Open Motion Dataset (WOMD), which provides 25,906 real-world highway scenarios at 10 hertz. An eight-criterion lane-change detector extracts 706 HDV-to-AV and 3,172 HDV-to-HDV cut-in events from the same traffic environment. The median accepted gap in front of the Waymo AV is 7.58 meters versus 9.57 meters for HDV targets, a 1.99 meter reduction that is statistically significant (p equals 5.76 times 10 to the negative eighth power, d equals negative 0.224) and persists under speed-matched resampling. Cut-in speeds toward the AV are 37 percent higher (51.7 versus 37.7 kilometers per hour, d equals 0.502), and 68.0 percent of AV-targeted cut-ins occur below the 10 meter gap boundary versus 51.8 percent of HDV-targeted events (chi-squared equals 60.5, p is less than 10 to the negative thirteenth power). These results reveal a systematic and safety-relevant asymmetry in human gap-acceptance behavior that warrants AV-specific calibration of both motion-planning safety envelopes and traffic simulation models.
Chinese Translation
自治车辆(AV)广为人知的是其遵循保守的、基于规则的运动策略,环境中的驾驶员可以学习并预测这一点。其直接后果是,当人类驾驶员在AV前切入时,他们可能会接受更短的纵向间隙,而在切入其他人驱动车辆(HDV)时则不然。我们采用Waymo开放运动数据集(WOMD)来检验这一假设,该数据集提供了25,906个真实世界高速公路场景,采样频率为10赫兹。一个基于八个标准的换车道检测器从同一交通环境中提取了706个HDV到AV的切入事件和3,172个HDV到HDV的切入事件。在Waymo AV前接受的中位间隙为7.58米,而HDV目标的中位间隙为9.57米,减少了1.99米,这一差异具有统计学意义(p值为5.76×10^(-8),d值为-0.224),且在速度匹配的重采样下依然显著。向AV的切入速度高出37%(51.7公里每小时对37.7公里每小时,d值为0.502),68.0%的向AV目标的切入发生在10米间隙下限内,而HDV目标的事件中仅为51.8%(卡方统计量为60.5,p值小于10^(-13))。这些结果揭示了人类间隙接受行为中存在系统性且与安全相关的非对称性,这要求对运动规划安全范围和交通模拟模型进行AV特定的调整。
cs.RO / 23 / 2605.01501
Distributed Algorithm with Emergent Area Partitioning and Base Station's Situation Awareness for Multi-Robot Patrolling
具有紧急区域划分和基站情境感知的分布式多机器人巡逻算法
Abstract
Patrolling with multiple robots offers efficient surveillance to detect and manage undesired situations. This necessitates improved patrol efficiency and operator situation awareness at base stations. Enhanced situation awareness enables operators to predict robots' behaviors, support recognition and decision-making, and execute emergency interventions. This study presents the Local Reactive and Partition (LR-PT) algorithm, a novel multi-robot patrolling approach. In simulations, LR-PT outperformed existing methods by ensuring frequent patrols of all locations of interest and enhancing the situation awareness of the base station. Robots independently select patrol targets based on locally available information, integrating patrol needs and the urgency of reporting mission progress to the base station into a unified utility function. This locality also contributes to robustness against communication constraints and robot failures, as demonstrated in this research. The algorithm further autonomously emerged the area partition, which can avoid falling into local optima and realize the comprehensive patrol over the whole mission area. The simulation results demonstrated the superior performance of LR-PT for multi-robot patrolling, utilizing the advantages of swarm robotics and addressing real-world operational challenges.
Chinese Translation
多机器人巡逻提供了高效的监控,以检测和管理不良情况。这要求提高巡逻效率和基站操作人员的情境感知。增强的情境感知使操作人员能够预测机器人行为,支持识别和决策,并执行紧急干预。本研究提出了一种新颖的多机器人巡逻方法——局部反应与划分(Local Reactive and Partition,LR-PT)算法。在仿真中,LR-PT在确保对所有感兴趣位置进行频繁巡逻和提升基站情境感知方面优于现有方法。机器人根据可用的本地信息独立选择巡逻目标,将巡逻需求和报告任务进展的紧迫性整合到一个统一的效用函数中。这种本地性也增强了对通信限制和机器人故障的鲁棒性,正如本研究所示。该算法还能自主划分区域,避免局部最优,并实现对整个任务区域的全面巡逻。仿真结果证明,LR-PT在多机器人巡逻方面表现优越,利用了群体机器人技术的优势,并解决了现实操作中的挑战。
cs.RO / 24 / 2605.01516
Dynamics Distillation for Efficient and Transferable Control Learning
高效可转移控制学习的动态蒸馏
Abstract
Robust control policy learning for autonomous driving requires training environments to be both physically realistic and computationally scalable, properties that existing simulators provide only in isolation. We introduce Sim2Sim2Sim, a framework that bridges high-fidelity vehicle simulation and scalable reinforcement learning by distilling simulator dynamics into a highly parallelizable learned dynamics model. By training control policies purely within this distilled environment and deploying them back into the high-fidelity source simulator, we demonstrate more efficient policy optimization and reliable transfer under challenging dynamics. We further show that predictive accuracy alone does not fully characterize a learned dynamics model's suitability as a reinforcement learning training environment, which should also be assessed by the quality of the policies it enables.
Chinese Translation
针对自主驾驶的稳健控制策略学习,需要训练环境既具有物理现实性又具备计算可扩展性,而现有模拟器仅能单独提供这些特性。我们提出了Sim2Sim2Sim框架,通过将模拟器动态蒸馏到一个高度可并行化的学习动态模型中,架起高保真车辆模拟与可扩展强化学习之间的桥梁。通过仅在此蒸馏环境中训练控制策略,并将其回放到高保真源模拟器中,我们展示了在挑战性动态条件下更高效的策略优化和可靠的转移。此外,我们进一步表明,预测准确性并不能完全表征学习动态模型作为强化学习训练环境的适用性,模型的评估还应考虑其所使能策略的质量。
cs.RO / 25 / 2605.01518
VOFA: Visual Object Goal Pushing with Force-Adaptive Control for Humanoids
VOFA:用于人形机器人视觉目标推送的力自适应控制
Abstract
The ability to push large objects in a goal-directed manner using onboard egocentric perception is an essential skill for humanoid robots to perform complex tasks such as material handling in warehouses. To robustly manipulate heavy objects to arbitrary goal configurations, the robot must cope with unknown object mass and ground friction, noisy onboard perception, and actuation errors; all in a real-time feedback loop. Existing solutions either rely on privileged object-state information without onboard perception or lack robustness to variations in goal configurations and object physical properties. In this work, we present VOFA, a visual goal-conditioned humanoid loco-manipulation system capable of pushing objects with unknown physical properties to arbitrary goal positions. VOFA consists of a two-level hierarchical architecture with a high-level visuomotor policy and a low-level force-adaptive whole-body controller. The high-level policy processes noisy onboard observations and generates goal-conditioned commands to operate in closed loop across diverse object-goal configurations, while the low-level whole-body controller provides robustness to variations in object physical properties. VOFA is extensively evaluated in both simulation and real-world experiments on the Booster T1 humanoid robot. Our results demonstrate strong performance, achieving over 90% success in simulation and over 80% success in real-world trials. Moreover, VOFA successfully pushes objects weighing up to 17kg, exceeding half of the Booster T1's body weight.
Chinese Translation
使用车载自我中心感知以目标导向的方式推动大物体的能力是人形机器人执行复杂任务(如仓库物料处理)所必需的技能。为了稳健地将重物操纵到任意目标配置,机器人必须应对未知的物体质量和地面摩擦、噪声干扰的车载感知以及驱动错误,这一切都必须在实时反馈循环中处理。现有解决方案要么依赖于没有车载感知的特权物体状态信息,要么在目标配置和物体物理属性的变化面前缺乏稳健性。在本研究中,我们提出了VOFA,这是一种视觉目标条件的人形机器人运动操纵系统,能够将具有未知物理属性的物体推向任意目标位置。VOFA由一个两级层次结构组成,包括高层视觉运动策略和低层力自适应全身控制器。高层策略处理噪声干扰的车载观察,并生成目标条件命令以在多样的物体-目标配置中闭环操作,而低层全身控制器则提供对物体物理属性变化的稳健性。VOFA在Booster T1人形机器人上进行了广泛的仿真和现实世界实验评估。我们的结果显示了卓越的表现,在仿真中取得了超过90%的成功率,在现实世界试验中取得了超过80%的成功率。此外,VOFA成功推送重达17kg的物体,超过了Booster T1体重的一半。
cs.RO / 26 / 2605.01529
Good in Bad (GiB): Sifting Through End-user Demonstrations for Learning a Better Policy
优中劣 (GiB): 通过终端用户演示筛选学习更优策略
Abstract
Imitation learning offers a promising framework for enabling robots to acquire diverse skills from human users. However, most imitation learning algorithms assume access to high-quality demonstrations an unrealistic expectation when collecting data from non-expert users, whose demonstrations often contain inadvertent errors. Naively learning from such demonstrations can result in unsafe policy behavior, while discarding entire demonstrations due to occasional mistakes wastes valuable data, especially in low-data settings. In this work, we introduce GiB (Good-in-Bad), an algorithm that automatically identifies and discards erroneous subtasks within demonstrations while preserving high-quality subtasks. The filtered data can then be used by any policy learning algorithm to train more robust policies. GiB first trains a self-supervised model to learn latent features and assigns binary weights to label each demonstration as good or bad. It then models the latent feature distribution of high-quality segments and uses the Mahalanobis distance to detect and evaluate poor-quality subtasks. We validate GiB on the Franka robot in both simulated and real-world multi-step tasks, demonstrating improved policy performance when learning from mixed-quality human demonstrations.
Chinese Translation
模仿学习为机器人的技能获取提供了一种有前景的框架,使其能够从人类用户那里获得多样化的技能。然而,大多数模仿学习算法假设能够获得高质量的演示,这在从非专家用户收集数据时是一种不切实际的期待,因为其演示常常包含无意的错误。盲目地从这些演示中学习可能导致不安全的策略行为,而由于偶尔的错误而丢弃整个演示则浪费了宝贵的数据,特别是在低数据环境中。在本工作中,我们提出了GiB(Good-in-Bad),一种算法,能够自动识别并丢弃演示中的错误子任务,同时保留高质量的子任务。过滤后的数据可以被任何策略学习算法使用,以训练更稳健的策略。GiB首先训练一个自监督模型来学习潜在特征,并为每个演示分配二进制权重,以标记其为好或坏。然后,它建模高质量片段的潜在特征分布,并使用马氏距离检测和评估低质量子任务。我们在Franka机器人上验证了GiB,涵盖了模拟和真实世界的多步任务,显示出在从混合质量的人类演示中学习时,策略性能得到了改善。
cs.RO / 27 / 2605.01544
An Efficient Metric for Data Quality Measurement in Imitation Learning
一种高效的数据质量测量指标在模仿学习中的应用
Abstract
Imitation learning (IL) has seen remarkable progress, yet field deployment of IL-powered robots remains hindered by the challenge of out-of-distribution (OOD) scenarios. Fine-tuning pre-trained policies with end-user demonstrations collected in deployment environments is a promising strategy to address this challenge. However, end-user demonstrations are frequently of poor quality, characterized by excessive corrective motions, oscillations, and abrupt adjustments that degrade both learned and fine-tuned policy performance. Existing automated approaches for curating demonstration data require policy rollouts in the environment, making them computationally expensive and impractical for real-world deployment. In this paper, we propose a fast, efficient, and fully automated demonstration ranking metric based on the power spectral density (PSD) of demonstration trajectories. The PSD metric requires no policy learning, environment interaction, or expert labeling, making it well-suited for scalable, in-the-field data curation. Lower PSD values correspond to smoother, higher-quality demonstrations, while higher PSD values indicate erratic, artifact-laden trajectories. We evaluate the proposed metric on two benchmark imitation learning datasets comprising expert and lay-user demonstrations, and through a user study with older adults at a retirement facility, where collected demonstrations are used to fine-tune $\pi0.5$ \cite{intelligence2025pi_} for a daily living task. Results demonstrate that PSD-curated data yields policies with higher task success rates and smoother execution trajectories compared to uncurated baselines and two competitive data-ranking methods.
Chinese Translation
模仿学习(IL)取得了显著进展,然而由IL驱动的机器人的实地应用仍受到分布外(OOD)场景的挑战限制。利用在部署环境中收集的终端用户演示对预训练策略进行微调是应对这一挑战的 promising 策略。然而,终端用户的演示往往质量较差,表现为过多的修正动作、震荡和突发性调整,这些都会降低已学习和微调策略的性能。现有的自动化演示数据策划方法要求在环境中进行策略的展开,使其计算成本高,实地应用不切实际。本文提出了一种快速、高效且完全自动化的演示排名指标,该指标基于演示轨迹的功率谱密度(PSD)。PSD指标不需要策略学习、环境交互或专家标注,这使其非常适合可扩展的现场数据策划。较低的PSD值对应于更平滑、更高质量的演示,而较高的PSD值则表明轨迹不稳定,含有伪影。我们在两个基准模仿学习数据集上评估了所提出的指标,这些数据集包含专业用户和普通用户的演示,并通过对养老院中老年人的用户研究,利用收集的演示对$ ext{π}_{0.5}$ ext{cite{intelligence2025pi_}}在日常生活任务上进行微调。结果表明,经过PSD策划的数据相较于未经策划的基线和两种竞争性数据排名方法,产生了具有更高任务成功率和更平滑执行轨迹的策略。
cs.RO / 28 / 2605.01581
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
Hydra-DP3:面向频率的3D扩散策略右调节在视觉运动控制中的应用
Abstract
Diffusion-based visuomotor policies perform well in robotic manipulation, yet current methods still inherit image-generation-style decoders and multi-step sampling. We revisit this design from a frequency-domain perspective. Robot action trajectories are highly smooth, with most energy concentrated in a few low-frequency discrete cosine transform modes. Under this structure, we show that the error of the optimal denoiser is bounded by the low-frequency subspace dimension and residual high-frequency energy, implying that denoising error saturates after very few reverse steps. This further suggests that action denoising requires a much simpler denoising model than image generation. Motivated by this insight, we propose Hydra-DP3(HDP3), a pocket-scale 3D diffusion policy with a lightweight Diffusion Mixer decoder that supports two-step DDIM inference. Our synthetic experiments validate the theory and support the sufficiency of two-step denoising. Futhermore, across RoboTwin2.0, Adroit, MetaWorld, and real-world tasks, HDP3 achieves state-of-the-art performance with fewer than 1% of the parameters of prior 3D diffusion-based policies and substantially lower inference latency.
Chinese Translation
基于扩散的视觉运动策略在机器手操作中表现良好,但当前方法仍然继承了图像生成风格的解码器和多步采样。我们从频率域的角度重新审视了这种设计。机器人动作轨迹高度平滑,大部分能量集中在少数低频离散余弦变换模式中。在这一结构下,我们表明,最优去噪器的误差受到低频子空间维度和残余高频能量的界限,这意味着去噪误差在非常少的逆向步骤后会饱和。这进一步暗示,动作去噪所需的去噪模型比图像生成模型简单得多。基于这一见解,我们提出了Hydra-DP3 (HDP3),一种具有轻量级扩散混合器解码器的口袋规模3D扩散策略,支持两步DDIM推断。我们的合成实验验证了该理论,并支持两步去噪的充分性。此外,在RoboTwin2.0、Adroit、MetaWorld以及现实世界任务中,HDP3以不到1%的参数量实现了当前3D扩散策略的最先进性能,并显著降低了推断延迟。
cs.RO / 29 / 2605.01731
Lateral String Stability for Vehicle Platoons: Formulation, Definition, and Analysis
车辆编队的横向串联稳定性:形成、定义与分析
Abstract
Platooning of connected and automated vehicles provides significant benefits in terms of energy efficiency, traffic throughput, and, most critically, safety. These safety benefits depend on string stability, which dictates how disturbances propagate along a vehicle string. Although longitudinal string stability has been extensively examined, lateral string stability, which governs the propagation of path-tracking errors that can lead to unsafe deviations from the desired path, remains underexplored. Its importance is growing as autonomous vehicles increasingly depend on onboard sensing and map-free navigation, where sensor occlusions and tight formations amplify safety risks. This paper presents a framework for lateral string stability that focuses directly on safety-critical, path-relative tracking errors and enables consistent comparison across vehicles that follow the same planned path. The key element of the framework is an arc-length (Eulerian) viewpoint, a departure from traditional analyses, that clarifies how tracking errors at a given point on the path propagate from one vehicle to the next. Building on this foundation, we propose the definition of L2 lateral string stability along with two control strategies: a feedback-feedforward strategy that relies solely on onboard sensing, and a novel learn-from-predecessor strategy that makes use of vehicle-to-vehicle communication. Both strategies are analyzed for lateral string stability with respect to two error measures: tracking error vector and lateral (cross-track) error. Our results show that onboard sensing alone cannot guarantee attenuation of path-tracking errors, imposing a fundamental safety limitation, while V2V communication enables true error attenuation. The analysis further identifies structural controller requirements, showing that nonzero feedback on specific measurements is essential for guaranteeing stability.
Chinese Translation
连接和自动化车辆的编队在能源效率、交通通行能力和最重要的安全性方面提供了显著的好处。这些安全益处依赖于串联稳定性,其决定了干扰在车辆串联中的传播方式。尽管纵向串联稳定性已经得到广泛研究,但横向串联稳定性仍未得到深入探讨,横向串联稳定性 governs 了路径跟踪误差的传播,这些误差可能导致从预定路径的危险偏离。随着自动驾驶车辆越来越依赖于车载传感器和无地图导航,其重要性日益增加,因为传感器遮挡和紧凑编队加剧了安全风险。本文提出了一种横向串联稳定性的框架,直接关注安全关键的路径相关跟踪误差,并支持在遵循相同规划路径的车辆之间进行一致性比较。该框架的关键要素是弧长(欧拉)视角,这是一种与传统分析不同的方法,旨在阐明给定路径上某一点的跟踪误差如何从一辆车向另一辆车传播。在此基础上,我们提出了 L2 横向串联稳定性的定义以及两种控制策略:一种仅依赖于车载传感的反馈前馈策略,以及一种利用车对车通信的创新性“向前学习”策略。针对这两种误差测量:跟踪误差向量和横向(横向轨迹)误差,我们对这两种策略进行了横向串联稳定性的分析。我们的结果表明,仅凭车载传感不能保证路径跟踪误差的衰减,这对安全性构成了根本限制,而 V2V 通信则能实现真正的误差衰减。分析进一步确定了结构控制器的要求,表明在特定测量上施加非零反馈对于保证稳定性至关重要。
cs.RO / 30 / 2605.01772
Anticipation-VLA: Solving Long-Horizon Embodied Tasks via Anticipation-based Subgoal Generation
Anticipation-VLA:通过基于预测的子目标生成解决长期体态任务
Abstract
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for embodied intelligence, enabling robots to perform tasks based on natural language instructions and current visual input. However, existing VLA models struggle with long-horizon tasks due to compounding errors. Prior methods decompose tasks into subtasks of fixed granularity, which cannot adapt to the varying complexity of execution states, limiting their robustness in long-horizon tasks. To overcome this, we introduce Anticipation Model, which adaptively and recursively generates future subgoals. This model continuously adapts as the task unfolds, adjusting future subgoals in response to evolving dynamics, facilitating more reliable planning paths. Building on this concept, we propose Anticipation-VLA, a hierarchical VLA model that leverages the anticipation model to generate actionable subgoals that guide VLA policy execution. We implement Anticipation-VLA with finetuning a Unified Multimodal Model (UMM) for high-level subgoal generation and a goal-conditioned VLA policy for low-level action execution. Experiments in both simulated and real-world robotic tasks demonstrate the effectiveness of Anticipation-VLA, highlighting the importance of adaptive and recursive subgoal generation for robust policy execution.
Chinese Translation
视觉-语言-动作(VLA)模型作为体态智能的强大范式,能够使机器人根据自然语言指令和当前视觉输入执行任务。然而,现有的 VLA 模型在处理长期任务时面临累积错误的问题。之前的方法将任务细分为固定粒度的子任务,这些子任务无法适应执行状态的变化复杂性,从而限制了它们在长期任务中的鲁棒性。为了解决这个问题,我们引入了预测模型,该模型自适应且递归地生成未来子目标。该模型在任务展开过程中持续适应,根据不断变化的动态调整未来子目标,以促进更可靠的规划路径。在此基础上,我们提出了 Anticipation-VLA,一种层次化的 VLA 模型,利用预测模型生成可操作的子目标来引导 VLA 策略的执行。我们通过微调统一多模态模型(Unified Multimodal Model, UMM)来实现 Anticipation-VLA,以便进行高层次子目标生成,并利用目标条件的 VLA 策略进行低层次的动作执行。在模拟和现实世界的机器人任务中的实验验证了 Anticipation-VLA 的有效性,突显了自适应和递归子目标生成对于鲁棒策略执行的重要性。
cs.RO / 31 / 2605.01773
On the Characterization and Limits of 4D Radar for Aided Inertial Navigation
4D雷达在辅助惯性导航中的特性与限制
Abstract
Frequency Modulated Continuous Wave (FMCW) radar is a promising sensor for aided inertial navigation, due to its robustness in environments that challenge traditional alternatives, such as LiDAR and vision. However, its widespread adoption is hindered by complex, noisy measurements, which make reliable estimation difficult. This manuscript addresses these challenges by analyzing the fundamental measurement relations of FMCW radar sensing and developing a reliable estimator. Noise models are derived by applying first principles to the underlying signal processing of a typical radar sensor. These models guide the design of a factor graph-based estimator, utilizing a first-order approximation for the measurement noise propagation. The approach is first examined through simulation, evaluating the significance of different noise sources, the validity of the first-order approximation, and the state-dependent nature of the covariance expressions. Extensive experiments demonstrate the superior robustness and accuracy of the proposed method across diverse field environments and flight profiles, including beyond the radar's standard operating range. Furthermore, the experiments confirm the insights from the simulation regarding the behavior and performance of different estimator configurations relative to their operating conditions. The evaluation data and estimator implementation are made available at https://github.com/ntnu-arl/rig.
Chinese Translation
频率调制连续波(FMCW)雷达是一种有前景的辅助惯性导航传感器,因其在挑战传统替代方案(如LiDAR和视觉)的环境中的鲁棒性。然而,其广泛应用受到复杂和嘈杂测量值的制约,这使得可靠的估计变得困难。本文通过分析FMCW雷达传感的基本测量关系,提出了一种可靠的估计器,从而解决了这些挑战。噪声模型是通过将基本原理应用于典型雷达传感器的信号处理得出的。这些模型引导了基于因子图的估计器的设计,利用一阶近似来处理测量噪声的传播。该方法首先通过仿真进行检验,评估了不同噪声源的重要性、一阶近似的有效性以及协方差表达式的状态依赖性。大量实验表明,所提方法在不同现场环境和飞行轨迹下的鲁棒性和准确性优于传统方法,包括超出雷达标准操作范围。此外,实验确认了仿真结果对不同估计器配置在其操作条件下的行为和性能的洞察。评估数据和估计器的实现可以在 https://github.com/ntnu-arl/rig 获取。
cs.RO / 32 / 2605.01860
Optimizing Trajectory-Trees in Belief Space: An Application from Model Predictive Control to Task and Motion Planning
在信念空间中优化轨迹树:从模型预测控制到任务与运动规划的应用
Abstract
This paper explores the benefits of computing arborescent trajectories (trajectory-trees) instead of commonly used sequential trajectories for partially observable robotic planning problems. In such environments, a robot infers knowledge from observations, and the optimal course of action depends on these observations. \revise{Trajectory-trees, optimized in belief space, naturally capture this dependency by branching where the belief state is expected to evolve into multiple distinct scenarios, such as upon receiving an observation. Unlike sequential trajectories, which model a single forward evolution of the system, trajectory-trees capture multiple possible contingencies.} First, we focus on Model Predictive Control (MPC) and demonstrate the benefits of planning tree-like trajectories. We formulate the control problem as the optimization of a tree with a single branching (PO-MPC). This improves performance by reducing control costs through more informed planning. To satisfy the real-time constraints of MPC, we develop an optimization algorithm called Distributed Augmented Lagrangian (D-AuLa), which leverages the decomposability of the PO-MPC formulation to parallelize and accelerate the optimization. We apply the method to both linear and non-linear MPC problems using autonomous driving examples. Second, we address Task And Motion Planning (TAMP), and introduce a planner (PO-LGP) reasoning on decision trees at task level, and trajectory-trees at motion-planning level. This approach builds upon the Logic-Geometric-Programming Framework (LGP) and extends it to partially observable problems. The experiments show the method's applicability to problems with a small belief state size, and scales to larger problems by optimizing explorative policies, which are used as macro-actions in an overarching task plan.
Chinese Translation
本文探讨了在部分可观测的机器人规划问题中,计算树状轨迹(轨迹树)而非常用的顺序轨迹的益处。在这样的环境中,机器人通过观察推测知识,最佳行动路径依赖于这些观察。优化于信念空间的轨迹树自然捕捉了这种依赖关系,通过在信念状态预期演变为多个不同情景时进行分枝,例如在接收到观察结果时。与仅模型化系统单一向前演化的顺序轨迹不同,轨迹树捕捉了多种可能的紧急情况。首先,我们专注于模型预测控制(MPC),并展示规划树状轨迹的优势。我们将控制问题表述为优化一个具有单一分枝的树(PO-MPC),通过更为充分的信息规划来降低控制成本,从而改善性能。为了满足MPC的实时性约束,我们开发了一种称为分布式增强拉格朗日(D-AuLa)的优化算法,该算法利用PO-MPC模型的可分解性实现优化的并行化和加速。我们将该方法应用于线性和非线性MPC问题,使用自主驾驶示例。其次,我们讨论任务与运动规划(TAMP),并引入一种在任务级别上基于决策树推理的规划器(PO-LGP),以及在运动规划级别上采用轨迹树的规划器。这一方法基于逻辑几何规划框架(LGP)并将其扩展至部分可观测问题。实验表明该方法适用于小信念状态规模的问题,并通过优化探测性策略(作为整体任务计划中的宏观动作用以)扩展到更大规模的问题。
cs.RO / 33 / 2605.01948
Phone2Act: A Low-Cost, Hardware-Agnostic Teleoperation System for Scalable VLA Data Collection
Phone2Act:一种低成本、硬件无关的远程操作系统,用于可扩展的视觉-语言-动作(VLA)数据收集
Abstract
Collecting diverse, high-quality manipulation data for Vision-Language-Action (VLA) model training remains prohibitively expensive for many research groups, as existing teleoperation frameworks rely on specialized hardware or are tightly coupled to specific robot platforms. We present Phone2Act, a low-cost, hardware-agnostic teleoperation framework that transforms a commodity smartphone into a 6-DoF robot controller via Google ARCore. Built on a modular ROS 2 architecture, Phone2Act decouples control logic from hardware specifics through interchangeable bridge nodes, supporting platforms from industrial cobots to low-cost bimanual arms without code modification. A Universal Recorder synchronizes multi-camera RGB streams with robot state feedback and exports demonstrations natively in the LeRobot dataset format, eliminating post-processing and enabling immediate VLA fine-tuning. We validate the framework by fine-tuning GR00T-N1.5 on 130 collected episodes, achieving a 90% success rate on a real-world multi-stage pick-and-place task deployed on a physical Dobot CR5.
Chinese Translation
为视觉-语言-动作(VLA)模型训练收集多样化的高质量操控数据对于许多研究小组而言仍然代价高昂,因为现有的远程操作框架依赖于专用硬件或紧密绑定于特定机器人平台。我们提出了Phone2Act,一种低成本、硬件无关的远程操作框架,通过Google ARCore将普通智能手机转变为6自由度(6-DoF)机器人控制器。Phone2Act基于模块化的ROS 2架构,通过可互换的桥接节点将控制逻辑与硬件细节解耦,支持从工业协作机器人到低成本双臂机械臂的平台,无需修改代码。一种通用记录器同步多摄像头的RGB流与机器人状态反馈,并将演示原生导出为LeRobot数据集格式,消除后期处理,便于即时VLA微调。我们通过在130个收集的场景上对GR00T-N1.5进行微调,验证了该框架,在物理Dobot CR5上部署的真实世界多阶段抓取和放置任务中实现了90%的成功率。
cs.RO / 34 / 2605.01949
Sonar-GPS Fusion for Seabed Mapping in Turbid Shallow Waters with an Autonomous Surface Vehicle
基于声纳与GPS融合的浑浊浅水区自主水面车辆海底测绘
Abstract
Accurate seabed mapping is essential for habitat monitoring and infrastructure inspection. In turbid, shallow coastal waters, such as shellfish aquaculture farms, the effectiveness of traditional optical methods is limited. Autonomous surface vehicles (ASVs) equipped with forward-looking sonar (FLS) offer a promising alternative. However, existing sonar-based systems face challenges in achieving fine resolution mapping over long trajectories due to low-resolution positioning measurements and accumulated drift over long trajectories. In this paper, we present a drift-resilient seabed mapping framework that integrates local FLS frame alignment using the Fourier-Mellin transform (FMT) with global trajectory optimization based on an extended Kalman filter (EKF) that fuses global positioning system (GPS), inertial measurement unit (IMU), and compass data. A variance-based image blending strategy is used to further reduce visual artifacts in overlapping regions. Field trials on a structured oyster farm site show that our framework helps reduce drift in RMSE by 9.5% relative to the FMT-only baseline. This framework also enables sub-meter reconstruction accuracy and preservation of high-resolution textures needed for oyster inventory estimation within the mapped areas.
Chinese Translation
准确的海底测绘对于栖息地监测和基础设施检查至关重要。在浑浊的浅海水域,例如贝类养殖场,传统的光学测绘方法效果有限。配备前视声纳(FLS)的自主水面车辆(ASVs)提供了一种有前景的替代方案。然而,现有的基于声纳的系统在长时间轨迹上实现高分辨率测绘时面临着低分辨率定位测量和长时间轨迹累积漂移的挑战。本文提出了一种抗漂移的海底测绘框架,该框架将利用傅里叶-梅林变换(FMT)的局部FLS帧对齐与基于扩展卡尔曼滤波器(EKF)的全局轨迹优化相结合,该优化融合了全球定位系统(GPS)、惯性测量单元(IMU)和指南针数据。一种基于方差的图像融合策略用于进一步减少重叠区域中的视觉伪影。经过对一个结构化牡蛎养殖场的现场试验,我们的框架相较于仅使用FMT的基线帮助RMSE漂移减少了9.5%。该框架还允许亚米级重建精度,并保存了海底测绘区域内牡蛎库存评估所需的高分辨率纹理。
cs.RO / 35 / 2605.01996
Optimized and kinematically feasible multi-agent motion planning
优化和运动学可行的多智能体运动规划
Abstract
Multi-agent motion planning (MAMP) is an important problem for autonomous systems with multiple agents. In this work we propose a two-step method for finding optimized and kinematically feasible solutions to MAMP problems. The first step finds an initial feasible solution using state-of-the-art methods such as conflict-based search (CBS) or priority-based search (PBS), and the second step is an improvement step which improves the solution by solving a multi-phase optimal control problem (OCP) where the initial solution is used to warm-start the solver. We also propose a method for generating motion primitives in an optimized way under the constraint that the primitive durations are all multiples of the same sample time. We evaluate our proposed framework on a MAMP problem for tractor-trailer systems. We extend the safe interval path planning with interval projections (SIPP-IP) algorithm so it can handle more general cost functions and larger agents, but our results show that for the tractor-trailer system a simple lattice-based planner performs better due to less conservative collision checks. Our experiments also indicate that CBS performs better than PBS for this system as it achieves a higher success rate in environments with obstacles and had a lower average runtime, although both planners achieve solutions of similar quality after the improvement step.
Chinese Translation
多智能体运动规划(MAMP)是一个对于多个智能体的自主系统而言的重要问题。在本工作中,我们提出了一种两步法,以寻找优化且运动学可行的MAMP问题解决方案。第一步利用最先进的方法,如基于冲突的搜索(CBS)或基于优先级的搜索(PBS),寻找初始可行解;第二步是改善步骤,通过解决一个多阶段的最优控制问题(OCP)来改善该解决方案,其中初始解被用作求解器的热启动。我们还提出了一种在优化的方式下生成运动原语的方法,前提是原语的持续时间都是同一采样时间的整数倍。我们在拖头-挂车系统的MAMP问题上评估了我们提出的框架。我们扩展了带区间投影的安全区间路径规划(SIPP-IP)算法,使其能够处理更一般的成本函数和更大规模的智能体,但我们的结果表明,对于拖头-挂车系统,简单的基于格子的规划器因更少的保守碰撞检查而表现更好。我们的实验还表明,CBS在此系统中比PBS表现更佳,因为在障碍物环境中其成功率更高且平均运行时间更低,尽管经过改善步骤后两种规划器得到的解的质量相似。
cs.RO / 36 / 2605.02021
Neural Backward Reach-Avoid Tubes with MPC Supervision for High-Dimensional Systems: An Application to Safe Spacecraft Docking
用于高维系统的神经反向到达-避免管道及其基于模型预测控制的监督:安全航天器对接的应用
Abstract
Autonomous spacecraft docking requires control policies that simultaneously ensure collision avoidance and target reachability under coupled, high-dimensional translational-rotational dynamics. Hamilton-Jacobi (HJ) reachability provides formal reach-avoid guarantees, but classical solvers are limited to low-dimensional systems. Learning-based approaches have begun to scale HJ analysis, yet they struggle in reach-avoid settings, especially where goal and failure sets are tightly coupled, as in docking. We propose a learning-based Backward Reach-Avoid Tube (BRAT) framework that addresses this challenge by tightly integrating HJ structure with MPC-based supervision. In the offline phase, we train a neural approximation of the HJ value function using PDE-based losses augmented with curriculum-driven MPC supervision, which provides informative value targets and stabilizes training in regions where purely PDE-based methods fail. In the online phase, the learned value function is deployed through two real-time controllers: (i) a value gradient-driven controller, and (ii) a value-function-augmented terminal MPC that explicitly enforces reachability at the horizon. We evaluate the proposed method on a 6D planar docking problem against grid-based ground truth and then scale to the full 13D system. Across both settings, our approach outperforms existing methods in success rate and computational efficiency.
Chinese Translation
自主航天器对接需要控制策略,以在耦合的高维平移-旋转动态下同时确保碰撞避免和目标可达性。汉密尔顿-雅可比(Hamilton-Jacobi, HJ)可达性提供了正式的可达-避免保证,但传统求解器仅限于低维系统。基于学习的方法已开始扩展HJ分析,但在可达-避免设置中表现不佳,特别是当目标集与失败集紧密耦合时,例如对接场景。我们提出了一种基于学习的反向到达-避免管道(Backward Reach-Avoid Tube, BRAT)框架,通过紧密结合HJ结构与基于模型预测控制(MPC)的监督,来应对这一挑战。在离线阶段,我们利用基于偏微分方程(PDE)的损失函数训练HJ价值函数的神经近似,同时结合以课程为驱动的MPC监督,从而提供有效的价值目标,并在纯PDE方法失效的区域内稳定训练。在在线阶段,训练得到的价值函数通过两个实时控制器进行部署:(i)基于价值梯度的控制器,以及(ii)一个增强价值函数的终端MPC,明确强制执行可达性条件。我们在一个6D平面对接问题上评估所提出的方法,与基于网格的真实值进行比较,然后扩展到完整的13D系统。在这两种设置中,我们的方法在成功率和计算效率上均优于现有技术。
cs.RO / 37 / 2605.02037
VILAS: A VLA-Integrated Low-cost Architecture with Soft Grasping for Robotic Manipulation
VILAS:一种集成VLA的低成本软抓取机器人操控架构
Abstract
We present VILAS, a fully low-cost, modular robotic manipulation platform designed to support end-to-end vision-language-action (VLA) policy learning and deployment on accessible hardware. The system integrates a Fairino FR5 collaborative arm, a Jodell RG52-50 electric gripper, and a dual-camera perception module, unified through a ZMQ-based communication architecture that seamlessly coordinates teleoperation, data collection, and policy deployment within a single framework. To enable safe manipulation of fragile objects without relying on explicit force sensing, we design a kirigami-based soft compliant gripper extension that induces predictable deformation under compressive loading, providing gentle and repeatable contact with delicate targets. We deploy and evaluate three state-of-the-art VLA models on the VILAS platform: pi_0, pi_0.5, and GR00T N1.6. All models are fine-tuned from publicly released pretrained checkpoints using an identical demonstration dataset collected via our teleoperation pipeline. Experiments on a grape grasping task validate the effectiveness of the proposed system, confirming that capable manipulation policies can be successfully trained and deployed on low-cost modular hardware. Our results further provide practical insights into the deployment characteristics of current VLA models in real-world settings.
Chinese Translation
我们提出了VILAS,一个完全低成本、模块化的机器人操控平台,旨在支持端到端视觉-语言-行动(VLA)策略学习和在可获取硬件上的部署。该系统整合了Fairino FR5协作臂、Jodell RG52-50电动夹爪和双摄像头感知模块,通过基于ZMQ的通信架构统一,能够在单一框架内无缝协调遥控操作、数据收集和策略部署。为了在不依赖于显式力传感器的情况下安全地操纵易碎物体,我们设计了一种基于切纸艺术(kirigami)的软性顺应夹爪扩展,它在压缩载荷下产生可预测的变形,为脆弱目标提供温和且可重复的接触。我们在VILAS平台上部署和评估了三种最先进的VLA模型:pi_0、pi_0.5和GR00T N1.6。所有模型均从通过我们遥控操作管道收集的相同演示数据集的公开预训练检查点进行微调。在葡萄抓取任务上的实验验证了所提系统的有效性,确认可以成功地在低成本模块化硬件上训练和部署能干的操控策略。我们的结果进一步为当前VLA模型在现实世界环境中的部署特性提供了实用的见解。
cs.RO / 38 / 2605.02107
AoI-Aware Multi-Robot Sensing and Transport on Connected Graphs
关注信息年龄的多机器人感知与运输在连通图上
Abstract
A team of mobile robots monitors spatially distributed processes and delivers measurements to a base, where AoI is measured from sensing start, capturing both stochastic parallel sensing delays and hop-based propagation. At each non-base node, multiple robots may collaborate, yielding node-dependent geometric group sensing times, while other robots act as mobile conveyors that transport samples along unit-time edges. The paper first derives a per-node and network-wide AoI lower bound that decomposes into a sensing term, determined by mean group sensing times, and a propagation term, given by shortest-path distances. It then shows that minimizing the sensing component yields a separable discretely convex resource allocation problem, solved optimally by a greedy water-filling algorithm. A shortest-path-tree conveyor architecture with an Euler-walk deployment is constructed and proven to attain the lower bound in a full-conveyor regime. Numerical simulations illustrate the impact of sensing allocation and conveyor deployment on AoI performance.
Chinese Translation
一组移动机器人监测空间分布的过程并将测量结果传送至基地,在此处从感知开始测量信息年龄 (AoI),涵盖随机并行感知延迟和基于跳跃的传播。在每个非基地节点,多台机器人可以协作,从而产生依赖于节点的几何组感知时间,而其他机器人则充当移动传送带,沿单位时间边缘运输样本。本文首先推导出每个节点和网络范围内的AoI下界,该下界分解为感知项(由平均组感知时间确定)和传播项(由最短路径距离给出)。然后,本文显示,最小化感知组件产生一个可分离的离散凸资源配置问题,可通过贪婪的水填充算法获得最优解。构建了一种最短路径树传送带架构,并通过欧拉行走部署证明其在完整传送带模式下能够达到下界。数值仿真展示了感知分配和传送带部署对AoI性能的影响。
cs.RO / 39 / 2605.02135
Robotic Desk Organization: A Multi-Primitive Approach to Manipulating Heterogeneous Objects via Environmental Constraints
机器人桌面组织:通过环境约束操控异构对象的多原始方法
Abstract
Desktop organization remains challenging for service robots because of heterogeneous objects and diverse manipulation objectives, such as collection and stacking. In this article, a task-oriented framework is presented for organizing planar rigid and deformable objects on desks. A perception pipeline was developed that augments existing datasets with uncommon desktop items and makes geometry-based pose and keypoint estimation possible, along with the detection of environmental constraints, such as table edges. To handle diverse manipulation requirements, environment-assisted primitives are used, including contact-based grasping for small objects, edge-based push-grasping for planar rigid objects, and levering-based grasping for planar deformable objects. These primitives leverage environmental and interobject constraints to improve robustness. A task planner was designed to integrate these primitives into multiobject organization. Sufficient real-world experiments demonstrate the effectiveness and robustness of the proposed framework. This research provides practical manipulation primitives for planar rigid and deformable objects, highlighting the role of environmental and interobject constraints in complex multiobject manipulation tasks. Code and video are available online.
Chinese Translation
桌面组织对服务机器人而言依然充满挑战,因为存在异构对象和多样的操控目标,如收集和堆叠。本文提出了一种面向任务的框架,用于组织桌面上的平面刚性和可变形对象。我们开发了一条感知管道,增强了现有的数据集,加入了不常见的桌面物品,并实现了基于几何的姿态和关键点估计,同时检测环境约束,如桌边。为了处理多样的操控需求,我们采用了环境辅助的原始方法,包括针对小物体的基于接触的抓取、针对平面刚性物体的基于边缘的推抓取,以及针对平面可变形物体的基于杠杆的抓取。这些原始方法利用环境约束和物体间约束来提高稳健性。我们设计了一个任务规划器,将这些原始方法整合到多物体组织中。充分的现实世界实验验证了所提出框架的有效性和稳健性。本研究为平面刚性和可变形对象提供了实用的操控原始方法,突显了环境约束和物体间约束在复杂多物体操控任务中的重要作用。代码和视频可在线获取。
cs.RO / 40 / 2605.02147
Sampling-Based Control via Entropy-Regularized Optimal Transport
基于采样的控制通过熵正则化的最优传输
Abstract
Sampling-based model predictive control methods like MPPI and CEM are essential for real-time control of nonlinear robotic systems, particularly where discontinuous dynamics preclude gradient-based optimization. However, these methods derive from information-theoretic objectives that are agnostic to the geometry of the control problem, leading to pathological behaviors such as mode-averaging when the cost landscape is complex. We present OT-MPC, a sampling-based algorithm that overcomes these limitations through an entropy-regularized optimal transport formulation. By computing an optimal coupling between candidate control sequences and low-cost proposals, OT-MPC refines candidates toward nearby promising samples while coordinating updates across the ensemble to maintain coverage of the solution space. We derive closed-form, gradient-free updates via the Sinkhorn algorithm, enabling real-time performance. Experiments on navigation, manipulation, and locomotion tasks demonstrate improved success rates over existing methods.
Chinese Translation
基于采样的模型预测控制方法,如 MPPI 和 CEM,对于非线性机器人系统的实时控制至关重要,特别是在不连续动态阻碍基于梯度的优化时。然而,这些方法源于与控制问题几何无关的信息论目标,当成本景观复杂时,容易导致如模式平均等病态行为。我们提出了 OT-MPC 一种基于采样的算法,通过熵正则化的最优传输形式克服了这些限制。通过计算候选控制序列与低成本提议之间的最优耦合,OT-MPC 能够将候选方案细化为接近有希望的样本,同时协调整个集合的更新以维持对解空间的覆盖。我们通过 Sinkhorn 算法推导出闭式、无梯度的更新,从而实现实时性能。在导航、操作和运动任务上的实验显示,OT-MPC 相较于现有方法的成功率显著提高。
cs.RO / 41 / 2605.02192
Do We Really Need Immediate Resets? Rethinking Collision Handling for Efficient Robot Navigation
我们真的需要立即重置吗?重新思考机器人导航中的碰撞处理
Abstract
Should a single collision necessarily terminate an entire navigation episode? In most deep reinforcement learning (DRL) frameworks for robot navigation, this remains the standard practice: every collision immediately triggers a global environment reset and is penalized as a complete task failure. While a collision during deployment naturally indicates task failure, applying the same treatment during training prevents the agent from exploring challenging obstacle configurations, which slows learning progress in the early training phase. In this work, we challenge this convention and propose a Multi-Collision reset Budget (MCB) framework that decouples local collision termination from global environment resets, allowing the agent to retry difficult configurations within the same episode. Experiments on multiple simulated and real-world robotic platforms show that the framework accelerates early-stage exploration and improves both success rate and navigation efficiency over conventional single-collision reset baselines, with a small collision budget producing the largest gains.
Chinese Translation
单次碰撞是否一定要终止整个导航过程?在大多数深度强化学习(Deep Reinforcement Learning, DRL)框架中,机器人导航的标准做法是:每次碰撞都会立即触发全球环境重置,并被视为任务完全失败。虽然在部署过程中,碰撞自然意味着任务失败,但在训练过程中采用相同的处理方式会阻碍代理探索具有挑战性的障碍配置,从而减缓早期训练阶段的学习进展。在本研究中,我们对这种惯例提出质疑,并提出了一种多碰撞重置预算(Multi-Collision Reset Budget, MCB)框架,该框架将局部碰撞终止与全球环境重置解耦,允许代理在同一过程内重试困难的配置。在多个模拟和真实世界的机器人平台上的实验表明,该框架加速了早期阶段的探索,并在成功率和导航效率上均优于传统的单次碰撞重置基准,其中小碰撞预算产生了最大的提升。
cs.RO / 42 / 2605.02227
Change-Robust Online Spatial-Semantic Topological Mapping
抗变更的在线空间-语义拓扑映射
Abstract
Autonomous robots require change-robust spatial-semantic reasoning: using spatial and semantic knowledge to decide where to go, how to get there, and where the robot is despite environmental change. Existing approaches typically attach semantics to SLAM-built metric maps, but these pipelines are brittle under appearance shifts and scene dynamics, where data association and relocalization degrade. We propose a Change-Robust Online Spatial-Semantic (CROSS) representation that replaces a globally consistent metric substrate with an online, pose-aware topological graph of RGB-D keyframes. The system explicitly reasons over perceptual ambiguity using sequential hypothesis testing in continuous SE(3). Our estimator maintains a bounded Gaussian-mixture belief over poses, enabling principled handling of loop closures and kidnapped-robot events. Experiments under severe appearance change, including real-robot object-goal navigation with lighting shifts and furniture rearrangement, demonstrate improved robustness over SLAM-based and topological baselines while remaining safe under perceptual aliasing.
Chinese Translation
自主机器人需要抗变更的空间-语义推理:利用空间和语义知识决定去哪里、如何到达那里,以及在环境变化中确定机器人的位置。现有方法通常将语义附加到基于SLAM构建的度量地图上,但在外观变化和场景动态下,这些流程表现脆弱,数据关联和重新定位效果下降。我们提出了一种抗变更的在线空间-语义(CROSS)表示方法,替换全局一致的度量基础,采用RGB-D关键帧的在线、姿态感知的拓扑图。该系统通过在连续SE(3)中进行序列假设检验,明确处理感知模糊性。我们的估计器在姿态上维持一个有界的高斯混合信念,使得循环闭合和机器人被劫持的事件能够得到有原则的处理。在严重外观变化的实验中,包括实际机器人在照明变化和家具重新布置下的物体目标导航,结果表明,相较于基于SLAM和拓扑的基线方法,性能有了显著提升,并在感知混淆的情况下保持了安全性。
cs.RO / 43 / 2605.02252
Exact Higher-Order Derivatives for SE(3) via Analytical/AD Methods
通过解析/自动微分方法获取 SE(3) 的精确高阶导数
Abstract
Fast prototyping of new SE(3) estimation objectives remains awkward in practice. Modern Lie-group frameworks -- GTSAM, manif, Sophus, SymForce, Ceres -- target first-order workloads through different code-generation and automatic-differentiation strategies, each optimized for a particular seam between hand-derived geometry and generic differentiation. The remaining gap is a compact, AD-safe path from these first-order primitives to exact Hessians, observed-information matrices, and higher-order derivative tensors: the quantities needed for exact Newton steps, observed-information covariance estimates, and covariance correction. This paper presents a hybrid analytical/AD recipe for SE(3) negative log-likelihoods. The practitioner writes the NLL gradient once, generic over a scalar type, and places the analytical/AD seam at the point-action interface y = Tx. Closed-form Lie-group Jacobians are used up to this interface; AD is applied only beyond it. The same source is then instantiated with ordinary floating-point scalars for gradients, vector-seeded dual numbers for exact Hessians in a single forward-mode pass, and nested dual numbers for higher-order derivative tensors. On a representative 6-DoF, 5-landmark SE(3) NLL, the advocated seeded-Hessian path is approximately 5x faster than finite-differencing the AD gradient on this benchmark while matching a nested-AD oracle to machine precision. The implementation adds roughly 70 lines of analytical-Jacobian code over an AD-only baseline. We also identify and fix a removable singularity in the standard SO(3)/SE(3) scalar basis that would otherwise produce NaNs at the origin under seeded AD, and we audit which Lie-group derivative tensors require this stabilized basis. The result is a practical path from rapidly written SE(3) objectives to exact higher-order derivatives, with predictable runtime and no finite-difference tuning.
Chinese Translation
在实践中,新 SE(3) 估计目标的快速原型设计仍然存在困难。现代李群框架——GTSAM、manif、Sophus、SymForce、Ceres——通过不同的代码生成和自动微分策略,针对一阶工作负载进行优化,每种策略旨在优化手工推导几何与通用微分之间的特定界面。尚存的缺口是从这些一阶原语到精确海森矩阵、观测信息矩阵和高阶导数张量的紧凑、自动微分安全路径:这些量是进行精确牛顿步、观测信息方差估计和方差修正所需的。本文提出了一种混合的解析/自动微分配方,用于 SE(3) 的负对数似然。实践者只需编写一次 NLL 梯度,针对标量类型进行通用化,并在点-作用界面 y = Tx 处设置解析/自动微分界面。在该界面之前使用封闭形式的李群雅可比矩阵;自动微分则仅在其之后应用。然后,使用普通浮点标量实例化相同的源代码以获得梯度,使用向量种子的双数在单次正向模式传递中获取精确海森矩阵,并使用嵌套双数获取高阶导数张量。在一个典型的 6 自由度、5 地标的 SE(3) NLL 示例中,推荐的种子海森路径比在该基准上有限差分自动微分梯度快大约 5 倍,同时匹配嵌套自动微分甲板到机器精度。该实现比仅基于自动微分的基线多出大约 70 行解析雅可比代码。我们还识别并修复了标准 SO(3)/SE(3) 标量基底中的一个可去奇点,后者在种子自动微分下会在原点产生 NaN,并审计了哪些李群导数张量需要这个稳定化基底。结果是一个实用的路径,从快速编写的 SE(3) 目标到精确的高阶导数,具有可预测的运行时间且无需有限差分调优。
cs.RO / 44 / 2605.02301
SAGA: A Robust Self-Attention and Goal-Aware Anchor-based Planner for Safe UAV Autonomous Navigation
SAGA:一种鲁棒的自注意力和目标感知锚基规划器,用于安全的无人机自主导航
Abstract
Agile unmanned aerial vehicle (UAV) navigation in cluttered environments demands a planning architecture that is both computationally efficient and structurally expressive enough to reason over multiple feasible motions. This paper presents SAGA, a robust self-attention and goal-aware anchor-based planner for safe UAV autonomous navigation. SAGA formulates local planning as a one-stage joint regression-and-ranking problem over a fixed lattice of motion anchors. Given a depth image and a body-frame motion state, the planner predicts refined terminal states and planning scores for all anchors in a single forward pass, after which the best candidate is decoded into a dynamically feasible trajectory. The key idea of SAGA is to transform anchor-aligned features into geometry-aware tokens and perform cross-anchor global reasoning with self-attention. To preserve directional structure in the token space, we further introduce a polar positional encoding derived from anchor yaw and pitch. In addition, a goal-aware modulation module injects velocity, acceleration, and target information into the token representation before final score prediction. Experiments in cluttered pillar-map environments under maximum speed settings of 2.0, 3.0, and 4.0~m/s show that SAGA consistently achieves a 100\% success rate, while YOPO drops from 90.91\% to 62.50\%, Ego-planner from 71.43\% to 52.63\%, and Fast-planner from 52.63\% to 38.46\%. Under the 4.0~m/s maximum speed setting, SAGA also improves average safety from 1.9843~m to 2.3888~m and minimum safety from 0.4390~m to 0.7576~m over YOPO, while reducing total flight time from 40.4631~s to 27.4901~s. The comparison with SAGA w/o PPE further shows that explicit polar positional encoding is critical for stable cross-anchor reasoning and safe passage selection in cluttered scenes.
Chinese Translation
灵活的无人机(UAV)在杂乱环境中的导航需要一种既计算高效又能够对多种可行运动进行合理推理的规划架构。本文提出了SAGA,一种用于安全无人机自主导航的鲁棒自注意力和目标感知锚基规划器。SAGA将局部规划公式化为一个在固定运动锚点的单阶段联合回归与排序问题。给定深度图像和机体坐标系下的运动状态,规划器在单次前向传递中预测所有锚点的精细终端状态和规划分数,之后最佳候选解被解码为动态可行的轨迹。SAGA的关键思想是将与锚点对齐的特征转化为几何感知的标记,并通过自注意力执行跨锚点的全局推理。为了在标记空间中保留方向结构,我们进一步引入了源自锚点偏航和俯仰的极坐标位置编码。此外,一种目标感知调制模块在最终分数预测之前将速度、加速度和目标信息注入到标记表示中。在最大速度设置为2.0、3.0和4.0 m/s的杂乱柱状图环境中的实验显示,SAGA始终实现了100%的成功率,而YOPO的成功率则从90.91%下降到62.50%,Ego-planner从71.43%下降到52.63%,Fast-planner从52.63%下降到38.46%。在4.0 m/s最大速度设置下,SAGA还提高了平均安全距离,从1.9843 m提升到2.3888 m,最小安全距离从0.4390 m提升到0.7576 m,同时将总飞行时间从40.4631秒减少到27.4901秒。与未使用极坐标位置编码(SAGA w/o PPE)的比较进一步表明,显式的极坐标位置编码对于稳定的跨锚点推理和在杂乱场景中的安全通行选择至关重要。
cs.RO / 45 / 2605.02306
Natural Gradient Bayesian Filtering: Geometry-Aware Filter for Dynamical Systems
自然梯度贝叶斯滤波:动态系统的几何感知滤波器
Abstract
Bayesian filtering is a cornerstone of state estimation in complex systems such as aerospace systems, yet exact solutions are available only for linear Gaussian models. In practice,nonlinear systems are handled through tractable approximations,with Gaussian filters such as the extended and unscented Kalman filters being among the most widely used methods. This tutorial revisits Gaussian filtering from an information-geometric perspective, viewing the prediction and measurement update steps as inference procedures over state distributions. Within this framework, we introduce a geometry-aware Gaussian filtering approach that leverages natural gradient descent on the statistical manifold of Gaussian distributions. The resulting Natural Gradient Gaussian Approximation (NANO) filter iteratively refines the posterior mean and covariance while respecting the intrinsic geometry of the Gaussian family and preserving the positive definiteness of the covariance matrix. We further highlight fundamental connections to the classical Kalman filtering, showing that a single natural-gradient step exactly recovers the Kalman measurement update in the linear-Gaussian case. The practical implications of the proposed framework are illustrated through case studies in representative nonlinear estimation problems,including satellite attitude estimation, simultaneous localization and mapping, and state estimation for robotic systems including quadruped and humanoid robots.
Chinese Translation
贝叶斯滤波是复杂系统(如航空航天系统)状态估计的基石,然而,只有线性高斯模型的精确解是可用的。在实践中,非线性系统通过可处理的近似方法来处理,其中扩展卡尔曼滤波器和无特点滤波器等高斯滤波器是最广泛使用的方法之一。本教程从信息几何的视角重新审视高斯滤波,将预测和测量更新步骤视为对状态分布的推断过程。在该框架下,我们提出了一种几何感知的高斯滤波方法,该方法利用高斯分布的统计流形上的自然梯度下降。所提出的自然梯度高斯近似(Natural Gradient Gaussian Approximation, NANO)滤波器迭代地优化后验均值和协方差,同时尊重高斯族的内在几何并保持协方差矩阵的正定性。我们进一步强调与经典卡尔曼滤波的基本联系,表明在线性高斯情况下,单步自然梯度更新可以精确恢复卡尔曼测量更新。通过在代表性非线性估计问题中的案例研究(包括卫星姿态估计、同时定位与地图构建,以及四足式和人形机器人的状态估计),展示了所提框架的实际应用意义。
cs.RO / 46 / 2605.02347
ShapeGrasp: Simultaneous Visuo-Haptic Shape Completion and Grasping for Improved Robot Manipulation
ShapeGrasp:同时进行视觉-触觉形状补全和抓取以改善机器人操控
Abstract
Humans grasp unfamiliar objects by combining an initial visual estimate with tactile and proprioceptive feedback during interaction. We present ShapeGrasp, a robotic implementation of this approach. The proposed method is an iterative grasp-and-complete pipeline that couples implicit surface visuo-haptic shape completion (creation of full 3D shape from partial information) with physics-based grasp planning. From a single RGB-D view, ShapeGrasp infers a complete shape (point cloud or triangular mesh), generates candidate grasps via rigid-body simulation, and executes the best feasible grasp. Each grasp attempt yields additional geometric constraints -- tactile surface contacts and space occupied by the gripper body -- which are fused to update the object shape. Failures trigger pose re-estimation and regrasping using the refined shape. We evaluate ShapeGrasp in the real world using two different robots and grippers. To the best of our knowledge, this is the first approach that updates shape representations following a real-world grasp. We achieved superior results over baselines for both grippers (grasp success rate of 84% with a three-finger gripper and 91% with a two-finger gripper), while improving the 3D shape reconstruction quality in all evaluation metrics used.
Chinese Translation
人类在抓取不熟悉的物体时,通过在交互过程中结合初始视觉估计与触觉和本体感受反馈来实现抓取。本文提出了ShapeGrasp,这是一种基于这一方法的机器人实现。所提出的方法是一个迭代的抓取与补全管道,将隐式表面视觉-触觉形状补全(从部分信息中生成完整的3D形状)与基于物理的抓取规划相结合。通过单一的RGB-D视图,ShapeGrasp推断出完整的形状(点云或三角网格),通过刚体仿真生成候选抓取,实现最佳可行抓取。每一次抓取尝试都会产生额外的几何约束——触觉表面接触和抓手所占空间——这些约束被融合以更新物体形状。失败会触发姿态重新估计和基于精细化形状的重新抓取。我们在现实世界中使用两种不同的机器人和抓手评估了ShapeGrasp。据我们所知,这是首次在实际抓取后更新形状表示的方法。在两种抓手的评估中,我们的结果优于基准(使用三指抓手的抓取成功率为84%,使用二指抓手的抓取成功率为91%),同时提高了所有评估指标下的3D形状重建质量。
cs.RO / 47 / 2605.02361
Feedback Motion Planning for Stochastic Nonlinear Systems with Signal Temporal Logic Specifications
具有信号时序逻辑规范的随机非线性系统反馈运动规划
Abstract
We study feedback motion planning for continuous-time stochastic nonlinear systems under signal temporal logic (STL) specifications. We propose a framework that synthesizes control policies for chance-constrained STL trajectory optimization problems, with the goal of ensuring that the closed-loop stochastic system satisfies a given STL formula with high probability (e.g., 99.99\%). Our approach is based on a predicate erosion strategy that transforms the intractable stochastic problem into a deterministic STL trajectory optimization problem with tightened STL formula constraints. The amount of erosion is determined by a probabilistic reachable tube (PRT) that bounds the deviation between the stochastic trajectory and an associated nominal trajectory. To compute such bounds, we leverage contraction theory and feedback design, and develop several tracking controllers. This yields a complete feedback motion planning pipeline which can be implemented by numerical optimizations. We demonstrate the efficacy and versatility of the proposed framework through simulations on several robotic systems and through experiments on a real-world quadrupedal robot, and show that it is less conservative and achieves higher specification satisfaction probability than representative baselines.
Chinese Translation
我们研究了在信号时序逻辑(STL)规范下,连续时间随机非线性系统的反馈运动规划。我们提出了一个框架,用于合成机会约束的STL轨迹优化问题的控制策略,目的在于确保闭环随机系统以高概率(例如99.99%)满足给定的STL公式。我们的方法基于谓词侵蚀策略,将难以处理的随机问题转化为具有收紧STL公式约束的确定性STL轨迹优化问题。侵蚀的程度由概率可达管(PRT)决定,该管界定了随机轨迹与相应名义轨迹之间的偏差。为计算这些界限,我们利用了收缩理论和反馈设计,开发了几种跟踪控制器。这产生了一个完整的反馈运动规划流程,可以通过数值优化实现。我们通过对多个机器人系统的仿真以及对真实四足机器人的实验,展示了所提框架的有效性和多样性,并表明其比代表性基线方法更少保守,并且实现了更高的规范满足概率。
cs.RO / 48 / 2605.02370
Robust Adaptive Predictive Control for Hook-Based Aerial Transportation Between Moving Platforms
基于钩子的移动平台之间的鲁棒自适应预测控制
Abstract
This paper presents a novel model predictive control (MPC) approach for autonomous pick-and-place between moving platforms with a hook-equipped aerial manipulator. First, for accurate and rapid modeling of the complex dynamics, a digital twin model of the quadcopter equipped with a hook-based gripper, implemented in MuJoCo, is constructed and used as the predictive model for the MPC. To handle uncertainties of the predictive model (e.g. due to aerodynamics and uncertain payloads), a robust adaptive MPC approach is proposed. By systematic integration of zero-order robust optimization (zoRO) based uncertainty propagation and an extended Kalman filter (EKF) for parameter estimation, the MPC algorithm ensures robust constraint satisfaction, high performance, and computational efficiency. The effectiveness of the proposed method is evaluated in complex simulated scenarios and in real-world flight experiments.
Chinese Translation
本文提出了一种新颖的模型预测控制(MPC)方法,用于带钩的空中操纵器在移动平台之间的自主捡取和放置。首先,为了准确快速地建模复杂的动力学,本文构建了一个配备钩式抓取器的四轴飞行器的数字双胞胎模型,该模型在MuJoCo中实现,并作为MPC的预测模型。为了解决预测模型的不确定性(例如,由气动效应和不确定的负载引起的),提出了一种鲁棒自适应MPC方法。通过系统集成基于零阶鲁棒优化(zoRO)的不确定性传播和扩展卡尔曼滤波器(EKF)进行参数估计,该MPC算法确保了鲁棒约束满足、高性能和计算效率。所提方法的有效性在复杂的模拟场景和真实飞行实验中得到了评估。
cs.RO / 49 / 2605.02410
Shared Autonomy Assisted by Impedance-Driven Anisotropic Guidance Field
由阻抗驱动的各向异性引导场辅助的共享自主
Abstract
Shared autonomy (SA) enables robots to infer human intent and assist in its achievement. While most research focuses on improving intent inference, it overlooks whether humans can understand the robot's intent in return. Without such mutual understanding, collaboration becomes less effective, degrading user experience and task performance. To address this gap, previous studies have explicitly conveyed the robot intent through additional interfaces, which remain unintuitive and limited in expressiveness. Inspired by impedance control, we propose Impedance-Driven Anisotropic Guidance Field Enhanced Shared Autonomy (IAGF-SA), a novel paradigm that extends SA with an embodied, physically-grounded communication channel. This channel adaptively modulates the robot's dynamic response to human input, enabling intuitive, continuous, physically-grounded robot intent communication while naturally guiding human actions. User studies across three scenarios and two teleoperation interfaces indicate that IAGF-SA improves task performance, human-robot agreement, and subjective experience, thus demonstrating its effectiveness in enhancing human-robot communication and collaboration.
Chinese Translation
共享自主(SA)使机器人能够推测人类意图并协助其实现。尽管大多数研究集中在提高意图推测上,但未考虑人类是否能够理解机器人的意图。如果没有这种相互理解,协作效果将降低,从而影响用户体验和任务表现。为了解决这一问题,以往的研究通过额外的接口明确传达机器人的意图,但这些接口往往不直观且表达能力有限。受到阻抗控制的启发,我们提出了一种新的范式——阻抗驱动的各向异性引导场增强共享自主(IAGF-SA),该范式通过一个具身且以物理为基础的沟通渠道扩展SA。该渠道能够自适应调整机器人对人类输入的动态响应,实现机器人意图的直观、连续、物理基础的沟通,同时自然而然地引导人类的行为。在三个场景和两个遥操作接口的用户研究表明,IAGF-SA提高了任务表现、人机一致性以及主观体验,展示了其在增强人机沟通与协作方面的有效性。
cs.RO / 50 / 2605.02434
Higher-Order Flexible Configurations of Planar Parallel Manipulators Constructed by Averaging
通过均值法构建的平面并联机器人的高阶灵活配置
Abstract
This paper investigates singular configurations of planar 3-RPR parallel manipulators, which result from applying the averaging technique to solution pairs of their direct kinematic problem. Without computing the zeros of the corresponding degree 6 polynomial we parametrize the input pairs and determine their relative orientation in a way that the flexion order of the averaged configurations increases. Moreover, the obtained results are visualized for concrete examples. The presented methodology can also be used for studying the spherical and spatial analogues of planar 3-RPR parallel manipulators.
Chinese Translation
本文研究了平面3-RPR并联机器人的奇异配置,这些配置是通过将均值技术应用于其直接运动学问题的解对而产生的。我们在不计算相应的6次多项式的零点的情况下,对输入对进行参数化,并以一种使得均值配置的弯曲阶数增加的方式确定它们的相对方向。此外,针对具体实例对获得的结果进行了可视化。所提出的方法论也可用于研究平面3-RPR并联机器人的球形和空间模拟。
cs.RO / 51 / 2605.02487
Visibility-Aware Mobile Grasping in Dynamic Environments
动态环境下的可视性-aware移动抓取
Abstract
This paper addresses the problem of mobile grasping in dynamic, unknown environments where a robot must operate under a limited field-of-view. The fundamental challenge is the inherent trade-off between ``seeing'' around to reduce environmental uncertainty and ``moving'' the body to achieve task progress in a high-dimensional configuration space, subject to visibility constraints. Previous approaches often assume known or static environments and decouple these objectives, failing to guarantee safety when unobserved dynamic obstacles intersect the robot's path during manipulation. In this paper, we propose a unified mobile grasping system comprising two core components: (1) an iterative low-level whole-body planner coupled with velocity-aware active perception to navigate dynamic environments safely; and (2) a hierarchical high-level planner based on behavior trees that adaptively generates subgoals to guide the robot through exploration and runtime failures. We provide experimental results across 400 randomized simulation scenarios and real-world deployment on a Fetch mobile manipulator. Results show that our system achieves a success rate of 68.8\% and 58.0\% in unknown static and dynamic environments, respectively, significantly boosting success rates by 22.8\% and 18.0\% over the \nam approach in both unknown static and dynamic environments, with improved collision safety.
Chinese Translation
本文解决了在动态、未知环境中移动抓取的问题,机器人必须在有限视野下操作。根本挑战在于如何在“观察”周围以减少环境不确定性与“移动”身体以实现任务进展之间进行权衡,这些都受到可视性约束的限制。之前的方法通常假设环境是已知或静态的,并将这些目标解耦,无法保证在操控过程中未观察到的动态障碍物与机器人路径相交时的安全性。在本文中,我们提出了一种统一的移动抓取系统,包含两个核心组件:(1) 一个迭代的低级全身规划器,与基于速度感知的主动感知相结合,以安全地导航动态环境;(2) 一个基于行为树的分层高级规划器,能够自适应生成子目标,引导机器人进行探索和应对运行时故障。我们在400个随机化仿真场景和Fetch移动操纵器的现实部署中提供了实验结果。结果表明,我们的系统在未知静态和动态环境中分别实现了68.8 ext{%}和58.0 ext{%}的成功率,相较于之前的方法在未知静态和动态环境中的成功率分别提高了22.8 ext{%}和18.0 ext{%},并且提高了碰撞安全性。
cs.RO / 52 / 2605.02513
Adaptive Gait Generation for Multi-Terrain Exoskeletons via Constrained Kernelized Movement Primitives
基于受限核化运动原理的多地形外骨骼自适应步态生成
Abstract
Lower limb exoskeletons (LLEs) present the potential to make motor-impaired individuals walk again. Their application in real-world environments is still limited by the lack of effective adaptive gait planning. Indeed, current exoskeletons are meant to walk only on a flat and even terrain. Generating environment-aware, physiologically consistent gait trajectories in real-time is an open challenge. To overcome this, we propose a novel Kernelized Movement Primitives (KMP)-based framework for adaptive gait generation (AGG) across multiple indoor terrains. The proposed approach learns a probabilistic representation of human gait in both the joint and task spaces from a limited number of human demonstrations, representing natural gait characteristics and ensuring kinematic feasibility. In addition, the learned trajectories are adapted using environmental information extracted from an onboard RGB-D camera by treating the AGG as a linearly constrained optimization problem with via-points. The proposed method has been thoroughly validated first in simulations for gait generation in different scenarios, such as flat-ground walking, slopes, stairs, and obstacles crossing. Finally, the effectiveness and robustness of the method have been demonstrated with experiments on a commercial LLE in real-world scenarios. The results obtained demonstrate the feasibility of an environment-aware gait planning system for a new generation of intelligent lower limb exoskeletons for assisting people with disabilities in their every-day life.
Chinese Translation
下肢外骨骼(LLEs)具有使运动障碍个体重新行走的潜力。然而,其在现实环境中的应用仍受到缺乏有效自适应步态规划的限制。当前的外骨骼通常只能在平坦且均匀的地面上行走。在实时生成环境感知的、生理上符合的步态轨迹方面仍面临挑战。为了解决这一问题,我们提出了一种基于核化运动原理(KMP)的自适应步态生成(AGG)框架,适用于多种室内地形。所提方法从有限的人体示范中学习人类步态在关节空间和任务空间中的概率表示,体现自然步态特征并确保运动学可行性。此外,利用从机载RGB-D相机中提取的环境信息,我们将AGG视为一个带有通过点的线性约束优化问题,从而对学习到的轨迹进行适应。所提出的方法首先在不同场景(如平地行走、坡道、楼梯和障碍物穿越)的步态生成中经过模拟验证。最后,该方法在现实场景中通过商用下肢外骨骼进行实验,证明了其有效性和鲁棒性。结果表明,环境感知步态规划系统对于新一代智能下肢外骨骼在帮助残疾人日常生活中的可行性。
cs.RO / 53 / 2605.02525
A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory
一种用于VLM集成室内移动机器人语义自主框架:混合确定性推理与跨机器人自适应记忆
Abstract
Autonomous indoor mobile robots can navigate reliably to metric coordinates using established frameworks such as ROS 2 Navigation 2, yet they lack the ability to interpret natural language instructions that express intent rather than positions. Vision-Language Models offer the semantic reasoning required to bridge this gap, but their inference latency (2-9 seconds per decision on consumer hardware) and session-by-session amnesia limit practical deployment. This paper presents the Semantic Autonomy Stack, a six-layer reference framework for semantically autonomous indoor navigation, and validates a complete instance featuring hybrid deterministic-VLM reasoning and cross-robot adaptive memory on physical robots with off-the-shelf edge hardware. A seven-step parametric resolver handles 88% of instructions in under 0.1 milliseconds without invoking a language model, camera, or GPU; only genuinely ambiguous instructions escalate to VLM reasoning. A five-category semantic memory framework with explicit scope taxonomy (global environment knowledge, per-operator preferences, per-robot capabilities) enables cross-session learning and cross-robot knowledge transfer: preferences learned through VLM interactions on one robot are promoted to deterministic resolution and transferred to a second robot via a shared compiled digest, achieving a measured latency reduction of 103,000-fold. Experimental validation on two custom-built differential-drive robots across 82 scenario-level decisions and three sessions demonstrates 100% semantic transfer accuracy (33/33, 95% CI [0.894, 1.000]), 100% semantic resolution accuracy, and concurrent multi-robot operation feasibility - all on Raspberry Pi 5 platforms with no onboard GPU, requiring zero training data.
Chinese Translation
自主室内移动机器人可以利用诸如ROS 2 Navigation 2等已建立的框架可靠地导航到度量坐标,但它们缺乏理解表达意图而非位置的自然语言指令的能力。视觉-语言模型(Vision-Language Models)提供了填补这一空白所需的语义推理能力,但其推理延迟(在消费级硬件上每个决策需2-9秒)和会话间记忆缺失限制了其实际应用。本文提出了语义自主栈(Semantic Autonomy Stack),这是一个六层参考框架,用于语义自主的室内导航,并在配备现成边缘硬件的物理机器人上验证了包含混合确定性-VLM推理和跨机器人自适应记忆的完整实例。一个七步参数求解器在不调用语言模型、相机或GPU的情况下,处理88%的指令,耗时少于0.1毫秒;仅当指令确实模糊时,才会升级到VLM推理。具有明确范围分类(全球环境知识、每个操作员的偏好、每个机器人的能力)的五类语义记忆框架支持跨会话学习和跨机器人知识转移:通过VLM交互学习的偏好可以提升为确定性解析,并通过共享的编译摘要转移到第二台机器人上,实现了103,000倍的测量延迟减少。在两台定制的差动驱动机器人上进行的实验验证涵盖了82个场景级决策和三个会话,展示了100%的语义传输准确率(33/33,95%置信区间[0.894, 1.000])、100%的语义解析准确率,以及并行多机器人操作的可行性——所有这些都基于没有车载GPU的树莓派5平台,且不需要任何训练数据。
cs.RO / 54 / 2605.02528
Beyond Specialization: Robust Reinforcement Learning Navigation via Procedural Map Generators
超越专业化:通过过程地图生成器实现稳健的强化学习导航
Abstract
Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable diversity, no prior work systematically compares how different generator types affect policy generalization. We integrate four generators (sparse, maze, graph, and Wave Function Collapse) with guaranteed navigability into MuRoSim, a 2D simulator focusing on training efficiency for LiDAR-based navigation. We cross-evaluate five navigation policies on 1000 seeded maps per generator across three training seeds. Results show a strongly asymmetric cross-generator transfer: a specialist trained on sparse layouts falls to 3.3% success on mazes, whereas a policy trained on the combined generator set achieves 91.5 +/- 1.1% mean success. We further demonstrate that A* path-planner subgoal inputs are the dominant factor for robustness, raising success from the 90.2 +/- 1.4% feedforward baseline to 98.9 +/- 0.4% and outperforming GRU recurrence, which only improves the reactive baseline. The DRL policies outperform a classical Carrot+A* controller, which matches their success only at low speeds (1.0 m/s) but collapses to 24.9% at 2.0 m/s. This highlights learned speed adaptation as the decisive advantage of the learned approach. Real-world experiments on a RoboMaster confirm sim-to-real transfer in a cluttered arena, while a maze-like layout exposes remaining failure modes that recurrence helps mitigate.
Chinese Translation
深度强化学习(DRL)导航策略往往过拟合于其训练环境的结构,因为环境多样性通常受到手动设计多样化场景所需的努力的限制。尽管过程地图生成提供了可扩展的多样性,但没有先前的研究系统地比较不同生成器类型如何影响策略的泛化能力。我们将四种保证可导航性的生成器(稀疏图、迷宫、图和波函数崩溃(Wave Function Collapse))集成到MuRoSim中,这是一个专注于基于激光雷达(LiDAR)导航的2D模拟器,旨在提升训练效率。我们在每个生成器的1000个种子地图上交叉评估了五种导航策略,基于三个训练种子。结果显示出强烈的不对称跨生成器迁移:在稀疏布局上训练的专家策略在迷宫中的成功率下降至3.3%,而在综合生成器集上训练的策略则达到了91.5 +/- 1.1%的平均成功率。我们进一步证明,A*路径规划器的子目标输入是提高稳健性的主要因素,将成功率从90.2 +/- 1.4%的前馈基线提升至98.9 +/- 0.4%,并超越了仅能改善反应基线的GRU递归。DRL策略的表现优于经典的胡萝卜+A*控制器,该控制器仅在低速(1.0 m/s)时与其成功率持平,但在2.0 m/s时则降至24.9%。这突显了学习的速度适应性作为学习方法的决定性优势。针对RoboMaster的现实世界实验验证了在复杂环境中的仿真到现实转移,而迷宫样的布局揭示了剩余的失败模式,递归有助于缓解这些问题。
cs.RO / 55 / 2605.02529
Sim-to-Real Transfer and Robustness Evaluation of Reinforcement Learning Control with Integrated Perception on an ASV for Floating Waste Capture
浮动废物捕捉的自治水面船舶强化学习控制的仿真到现实转移与稳健性评估
Abstract
Autonomous surface vessels for floating-waste removal operate under varying hydrodynamics, external disturbances, and challenging water-surface perception. We present a field-validated system that combines camera-based polarimetric perception with a lightweight DRL-based controller for floating-waste detection and capture. Camera detections are converted into water-surface target points and tracked by a controller trained entirely in simulation and deployed directly on a retrofitted ASV platform. Our main contribution is a sim-to-real testing methodology that combines a two-stage simulation protocol with a perception abstraction module designed to mimic real camera behavior, enabling reproducible field trials and explicit evaluation of the sim-to-real gap. We apply this framework in matched simulation and field experiments across 14 disturbance regimes to expose failure modes and evaluate robustness. The results show centimeter-level terminal accuracy and indicate robust control performance under the evaluated perturbation regimes. The main source of degradation is insufficient actuation-model fidelity. We also demonstrate the system in a search-and-capture application using real camera detections in real-world conditions over areas of up to $450~m^2$. The study distills practical lessons for reliable transfer, including improved actuation-model fidelity, targeted domain randomization, and careful management of latency and timestamps across modules, while highlighting remaining challenges.
Chinese Translation
用于浮动废物清除的自主水面船舶在变化的水动力学、外部干扰和具有挑战性水面感知的条件下操作。我们提出了一种经过现场验证的系统,该系统将基于摄像头的偏振感知与轻量级深度强化学习(DRL)控制器相结合,用于浮动废物的检测和捕捉。摄像头探测结果被转换为水面目标点,并由完全在模拟中训练的控制器进行跟踪,随后直接部署在改装后的自主水面船舶平台上。我们的主要贡献是一个仿真到现实的测试方法论,它结合了双阶段模拟协议和设计用于模仿真实摄像头行为的感知抽象模块,从而实现可重复的现场试验和对仿真到现实差距的明确评估。我们在14种干扰模式下进行匹配模拟和现场实验,以揭示故障模式并评估稳健性。结果显示,终端精度在厘米级别,且在评估的扰动模式下控制性能稳健。主要的性能降低源于驱动模型的保真度不足。我们还展示了该系统在实际条件下使用真实摄像头检测进行搜索和捕捉应用,覆盖面积高达450平方米。本研究提炼出可靠转移的实用经验教训,包括提高驱动模型的保真度、针对性的领域随机化,以及跨模块的延迟和时间戳的仔细管理,同时强调了剩余挑战。
cs.RO / 56 / 2605.02537
Orchestrating Spatial Semantics via a Zone-Graph Paradigm for Intricate Indoor Scene Generation
通过区域图范式协调空间语义以生成复杂室内场景
Abstract
Autonomous 3D indoor scene synthesis breaks down in non-convex rooms with tightly coupled spatial constraints. Data-driven generators lack topological priors for long-horizon planning, while iterative agents fragment semantics and become geometrically brittle. We present ZoneMaestro, a unified framework that shifts the paradigm from object-centric synthesis to Zone-Graph Orchestration. By internalizing a novel zone-based logic, ZoneMaestro translates high-level semantic intent into functional zones and topological constraints, enabling robust adaptation to diverse architectural forms. To support this, we construct Zone-Scene-10K, a large-scale dataset enriched with explicit Zone-Graph annotations. We further introduce an Alternating Alignment Strategy that cycles between reasoning internalization and Zone-Aware Group Relative Policy Optimization (Z-GRPO), effectively reconciling the tension between semantic richness and geometric validity without relying on external physics engines. To rigorously evaluate spatial intelligence beyond convex primitives, we formally define the task of Intricate Spatial Orchestration and release SCALE, a stress-test benchmark for irregular indoor scenarios with complex, dense spatial relations. Extensive experiments demonstrate that ZoneMaestro resolves the density-safety dichotomy, significantly outperforming state-of-the-art baselines in both structural coherence and intent adherence.
Chinese Translation
自主3D室内场景合成在具有紧密耦合空间约束的非凸房间中遇到了困难。基于数据的生成器缺乏用于长远规划的拓扑先验,而迭代代理则会导致语义碎片化,变得在几何上脆弱。我们提出了ZoneMaestro,这是一种统一框架,转变了从以物体为中心的合成到区域图协同的范式。通过内化一种新颖的基于区域的逻辑,ZoneMaestro将高层次的语义意图转换为功能区域和拓扑约束,从而能够稳健地适应多样化的建筑形式。为支持这一点,我们构建了Zone-Scene-10K,这是一个大型数据集,包含明确的区域图注释。我们进一步介绍了一种交替对齐策略,该策略在推理内化与区域感知群体相对策略优化(Z-GRPO)之间循环,有效地调和了语义丰富性和几何有效性之间的紧张关系,而无需依赖外部物理引擎。为严格评估超越凸原始体的空间智能,我们正式定义了复杂空间协同任务,并发布了SCALE,这是一个针对不规则室内场景的压力测试基准,涉及复杂的密集空间关系。广泛的实验表明,ZoneMaestro解决了密度与安全之间的二分法,在结构一致性和意图遵循方面显著超越了现有的最先进基线。
cs.RO / 57 / 2605.02600
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
CoRAL:基于接触丰富的自适应大型语言模型的机器人操控控制
Abstract
While Large Language Models (LLMs) and Vision-Language Models (VLMs) demonstrate remarkable capabilities in high-level reasoning and semantic understanding, applying them directly to contact-rich manipulation remains a challenge due to their lack of explicit physical grounding and inability to perform adaptive control. To bridge this gap, we propose CoRAL (Contact-Rich Adaptive LLM-based control), a modular framework that enables zero-shot planning by decoupling high-level reasoning from low-level control. Unlike black-box policies, CoRAL uses LLMs not as direct controllers, but as cost designers that synthesize context-aware objective functions for a sampling-based motion planner (MPPI). To address the ambiguity of physical parameters in visual data, we introduce a neuro-symbolic adaptation loop: a VLM provides semantic priors for environmental dynamics, such as mass and friction estimates, which are then explicitly refined in real time via online system identification, while the LLM iteratively modulates the cost-function structure to correct strategic errors based on interaction feedback. Furthermore, a retrieval-based memory unit allows the system to reuse successful strategies across recurrent tasks. This hierarchical architecture ensures real-time control stability by decoupling high-level semantic reasoning from reactive execution, effectively bridging the gap between slow LLM inference and dynamic contact requirements. We validate CoRAL on both simulation and real-world hardware across challenging and novel tasks, such as flipping objects against walls by leveraging extrinsic contacts. Experiments demonstrate that CoRAL outperforms state-of-the-art VLA and foundation-model-based planner baselines by boosting success rates over 50% on average in unseen contact-rich scenarios, effectively handling sim-to-real gaps through its adaptive physical understanding.
Chinese Translation
尽管大型语言模型(LLMs)和视觉语言模型(VLMs)在高级推理和语义理解方面展现出卓越能力,但由于缺乏明确的物理基础以及无法执行自适应控制,直接将其应用于接触丰富的操控仍然是一个挑战。为了弥补这一差距,我们提出了CoRAL(基于接触丰富的自适应LLM控制),这是一个模块化框架,通过将高级推理与低级控制解耦,实现零-shot 规划。与黑箱策略不同,CoRAL并不将LLMs作为直接控制器,而是作为成本设计者,为基于采样的运动规划器(MPPI)合成上下文感知的目标函数。为了解决视觉数据中物理参数的模糊性,我们引入了一个神经符号适应循环:VLM提供环境动态的语义先验,例如质量和摩擦估计,这些先验随后通过在线系统识别实时显式地完善,而LLM则根据交互反馈迭代调节成本函数结构,以纠正策略错误。此外,一个基于检索的记忆单元使系统能够在周期性任务中重用成功策略。这种分层架构通过将高级语义推理与反应执行解耦,从而确保实时控制的稳定性,有效弥合了慢速LLM推理和动态接触要求之间的差距。我们在模拟和现实硬件上验证了CoRAL,并在诸如利用外部接触将物体翻转到墙面等具有挑战性和新颖性的任务中进行测试。实验表明,CoRAL在未见的接触丰富场景中,成功率平均提升超过50%,显著优于最先进的视觉语言代理(VLA)和基础模型规划基线,有效处理了模拟到现实之间的差距,通过其自适应的物理理解。
cs.RO / 58 / 2605.02667
AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs
AnchorD:使用因子图的单目深度度量基础
Abstract
Dense and accurate depth estimation is essential for robotic manipulation, grasping, and navigation, yet currently available depth sensors are prone to errors on transparent, specular, and general non-Lambertian surfaces. To mitigate these errors, large-scale monocular depth estimation approaches provide strong structural priors, but their predictions can be potentially skewed or mis-scaled in metric units, limiting their direct use in robotics. Thus, in this work, we propose a training-free depth grounding framework that anchors monocular depth estimation priors from a depth foundation model in raw sensor depth through factor graph optimization. Our method performs a patch-wise affine alignment, locally grounding monocular predictions in metric real-world depth while preserving fine-grained geometric structure and discontinuities. To facilitate evaluation in challenging real-world conditions, we introduce a benchmark dataset with dense scene-wide ground truth depth in the presence of non-Lambertian objects. Ground truth is obtained via matte reflection spray and multi-camera fusion, overcoming the reliance on object-only CAD-based annotations used in prior datasets. Extensive evaluations across diverse sensors and domains demonstrate consistent improvements in depth performance without any (re-)training. We make our implementation publicly available at https://anchord.cs.uni-freiburg.de.
Chinese Translation
密集且准确的深度估计对于机器人操控、抓取和导航至关重要,但目前可用的深度传感器在透明、镜面和一般非朗伯表面上容易出现错误。为了减轻这些错误,大规模单目深度估计方法提供了强大的结构先验,但其预测可能在度量单位上产生偏差或失真,限制了其在机器人中的直接应用。因此,在本研究中,我们提出了一种无训练深度基础框架,通过因子图优化将单目深度估计的先验从深度基础模型锚定到原始传感器深度。我们的方法执行补丁级的仿射对齐,在保持细粒度几何结构和不连续性的同时,将单目预测在度量实际深度中进行局部基础对齐。为了在具有挑战性的现实条件下促进评估,我们引入了一个基准数据集,其中包含非朗伯物体存在下的场景内密集地面真实深度。地面真实深度通过哑光反射喷雾和多摄像头融合来获得,克服了依赖于以对象为基础的CAD注释这一在先前数据集中常见的问题。对不同传感器和领域的广泛评估表明,在没有任何(重新)训练的情况下,深度性能取得了一致改进。我们的实现已公开可用,网址为 https://anchord.cs.uni-freiburg.de.
cs.RO / 59 / 2605.02699
Learning Equivariant Neural-Augmented Object Dynamics From Few Interactions
从少量交互中学习等变神经增强物体动力学
Abstract
Learning data-efficient object dynamics models for robotic manipulation remains challenging, especially for deformable objects. A popular approach is to model objects as sets of 3D particles and learn their motion using graph neural networks. In practice, this is not enough to maintain physical feasibility over long horizons and may require large amounts of interaction data to learn. We introduce PIEGraph, a novel approach to combining analytical physics and data-driven models to capture object dynamics for both rigid and deformable bodies using limited real-world interaction data. PIEGraph consists of two components: (1) a \textbf{P}hysically \textbf{I}nformed particle-based analytical model (implemented as a spring--mass system) to enforce physically feasible motion, and (2) an \textbf{E}quivariant \textbf{Graph} Neural Network with a novel action representation that exploits symmetries in particle interactions to guide the analytical model. We evaluate PIEGraph in simulation and on robot hardware for reorientation and repositioning tasks with ropes, cloth, stuffed animals and rigid objects. We show that our method enables accurate dynamics prediction and reliable downstream robotic manipulation planning, which outperforms state of the art baselines.
Chinese Translation
学习高效的数据驱动物体动力学模型以应用于机器人操控仍然是一项挑战,尤其是对于可变形物体。一个常用的方法是将物体建模为一组3D粒子,并通过图神经网络学习它们的运动。然而,在实际应用中,这种方法往往不足以在长时间范围内保持物理可行性,并可能需要大量的交互数据来进行学习。我们提出了PIEGraph,一种新颖的方法,通过结合解析物理学与数据驱动模型,利用有限的现实世界交互数据捕捉刚性和可变形物体的动力学。PIEGraph由两个部分组成: (1) 一个基于粒子的物理信息模型(实现为弹簧-质量系统),用于强制确保物理可行的运动; (2) 一个等变图神经网络,具有新颖的动作表示,利用粒子交互中的对称性来引导解析模型。我们在仿真和机器人硬件上评估了PIEGraph,以进行绳索、布料、玩具和刚性物体的重新定位和重新定向任务。我们的实验表明,该方法实现了准确的动力学预测和可靠的下游机器人操控规划,超越了当前的最先进基线。
cs.RO / 60 / 2605.02708
Temporally Consistent Object 6D Pose Estimation for Robot Control
用于机器人控制的时序一致性物体六维姿态估计
Abstract
Single-view RGB object pose estimators have reached a level of precision and efficiency that makes them good candidates for vision-based robot control. However, off-the-shelf methods lack temporal consistency and robustness that are mandatory for a stable feedback control. In this work, we develop a factor graph approach to enforce temporal consistency of the object pose estimates. In particular, the proposed approach: (i) incorporates object motion models, (ii) explicitly estimates the object pose measurement uncertainty, and (iii) integrates the above two components in an online optimization-based estimator. We demonstrate that with appropriate outlier rejection and smoothing using the proposed factor graph approach, we can significantly improve the results on standardized pose estimation benchmarks. We experimentally validate the stability of the proposed approach for a feedback-based robot control task in which the object is tracked by the camera attached to a torque controlled manipulator.
Chinese Translation
单视图 RGB 物体姿态估计器已达到一种精度和效率水平,使其成为基于视觉的机器人控制的良好候选者。然而,现成的方法缺乏时序一致性和鲁棒性,这是稳定反馈控制所必需的。在本研究中,我们开发了一种因子图方法来强制物体姿态估计的时序一致性。具体而言,所提出的方法:(i) 融入物体运动模型,(ii) 明确估计物体姿态测量的不确定性,以及 (iii) 将上述两个组件集成到在线优化的估计器中。我们证明,通过适当的异常值拒绝和使用提出的因子图方法进行平滑处理,我们可以显著提高标准化姿态估计基准的结果。我们在一个基于反馈的机器人控制任务中实验验证了所提出方法的稳定性,该任务中物体被附加在扭矩控制的机械手上的相机跟踪。
cs.RO / 61 / 2605.02710
Tensegrity crutches with compliance from a pre-stressed self-tensile module improve ground reaction force profiles, speed, effort, comfort, and perceived stability
具有顺应性的张拉结构拐杖通过预应力自张力模块改善地面反作用力特征、速度、努力程度、舒适度和感知稳定性
Abstract
Purpose: Six million people use crutches as mobile aids in the US. Rigid designs with no axial mobility limit sensory feedback and lead to secondary injury on the upper joints. Spring-loaded designs offer compliance but may compromise stability. We designed a biologically inspired tensegrity crutch with a compliant module aiming to achieve favorable mechanical properties. The terminal module was a pre-stressed self-tensile two-cell tensegrity structure. We compared the tensegrity crutch to commercial rigid and spring-loaded crutches in mechanical tests using axial loading, in overground straight and turning walking, and in participant experience. Methods: In human trials, healthy young adults (N=18) with no recent lower-body injury performed straight walking and turning trials at a comfortable self-selected pace. A knee blocker simulated unilateral injury of the dominant leg. After using each type of crutch, participants reported their perceived levels of effort, comfort, pain, stability, and usability. Results: Compared to the rigid design, both spring-loaded and tensegrity conditions reduced peak loading rates. The tensegrity design improved effort, comfort, pain, and usability. Spring-loaded crutches reduced perceived stability and walking speed. Conclusion: The biologically inspired tensegrity crutches were an overall improvement to existing designs. Simulations and mechanical testing suggest that nonlinear stiffness, ground-following, and force feedback are among the beneficial mechanical properties that underlie this improvement.
Chinese Translation
目的:在美国,有六百万人使用拐杖作为移动辅助工具。无轴向移动的刚性设计限制了感官反馈并导致上肢关节的二次损伤。弹簧加载设计提供了顺应性,但可能影响稳定性。我们设计了一种受生物启发的张拉结构拐杖,具有一个顺应性模块,旨在实现良好的机械特性。终端模块为一个预应力自张力的双腔张拉结构。我们在机械测试中将张拉结构拐杖与商业刚性拐杖和弹簧加载拐杖进行了比较,测试内容包括轴向加载、平地直走和转弯行走,以及参与者体验。方法:在人体试验中,健康年轻成年人(N=18)在没有近期下肢伤害的情况下以舒适的自选速度进行直走和转弯试验。膝盖阻断器模拟了主导腿的单侧损伤。在使用每种类型的拐杖后,参与者报告他们的努力程度、舒适度、疼痛、稳定性和可用性的感知水平。结果:与刚性设计相比,弹簧加载和张拉结构条件均降低了峰值加载率。张拉结构设计改善了努力程度、舒适度、疼痛和可用性。弹簧加载拐杖降低了感知稳定性和行走速度。结论:这种受生物启发的张拉结构拐杖在总体上优于现有设计。模拟和机械测试表明,非线性刚度、地面跟随和力反馈是支撑这种改进的有益机械特性。
cs.RO / 62 / 2605.02716
Parking Assistance for Trailer-Truck Transport Vehicles Using Sensor Fusion and Motion Planning
基于传感器融合与运动规划的拖车卡车运输车辆停车辅助
Abstract
Autonomous driving technology has rapidly evolved over the past decade, offering significant improvements in transportation efficiency, safety, and cost reduction. While much of the progress has focused on highway driving and obstacle avoidance, low-speed maneuvers such as parking remain among the most difficult challenges for autonomous systems. This challenge is especially pronounced in trailer-truck transport vehicles due to their articulated motion and environmental constraints. This paper presents a proposed framework for autonomous truck parking that integrates perception, motion planning, control systems, and infrastructure awareness. By combining sensor fusion, Hybrid A* path planning, nonlinear model predictive control (NMPC), and data-driven parking systems, this work highlights the importance of system-level coordination for reliable and scalable autonomous parking solutions. As a proof-of-concept implementation, we adapted an open-source A* path planning simulation to incorporate a tractor-trailer kinematic model, demonstrating articulated vehicle path planning within a command-line simulation environment, with jackknife prevention identified as an area requiring further development.
Chinese Translation
自主驾驶技术在过去十年中迅速发展,在运输效率、安全性和成本降低方面取得了显著改善。尽管许多进展集中在高速公路驾驶和避障上,但低速机动,例如停车,仍然是自主系统面临的最难挑战之一。这一挑战在拖车卡车运输车辆中尤为明显,由于其关节运动和环境约束。本文提出了一种集成感知、运动规划、控制系统和基础设施感知的自主卡车停车框架。通过结合传感器融合、混合 A* 路径规划、非线性模型预测控制(NMPC)和数据驱动的停车系统,本文强调了系统级协调在可靠和可扩展的自主停车解决方案中的重要性。作为概念验证实施,我们调整了一种开源 A* 路径规划仿真,纳入了拖头-拖车运动学模型,展示了在命令行仿真环境中进行关节车辆路径规划,同时识别了防止折刀(jackknife)现象作为一个需要进一步发展的领域。
cs.RO / 63 / 2605.02739
Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference
潜在桥梁:高效双系统视觉-语言-动作模型推理的特征增量预测
Abstract
Dual-system Vision-Language-Action (VLA) models achieve state-of-the-art robotic manipulation but are bottlenecked by the VLM backbone, which must execute at every control step while producing temporally redundant features. We propose Latent Bridge, a lightweight model that predicts VLM output deltas between timesteps, enabling the action head to operate on predicted outputs while the expensive VLM backbone is called only periodically. We instantiate Latent Bridge on two architecturally distinct VLAs: GR00T-N1.6 (feature-space bridge) and {\pi}0.5 (KV-cache bridge), demonstrating that the approach generalizes across VLA designs. Our task-agnostic DAgger training pipeline transfers across benchmarks without modification. Across four LIBERO suites, 24 RoboCasa kitchen tasks, and the ALOHA sim transfer-cube task, Latent Bridge achieves 95-100% performance retention while reducing VLM calls by 50-75%, yielding 1.65-1.73x net per-episode speedup.
Chinese Translation
双系统视觉-语言-动作(VLA)模型在机器人操控方面达到了最先进的水平,但受到每个控制步骤都必须执行的VLM主干的瓶颈限制,而它在生成时序冗余特征时效果尤为显著。我们提出了潜在桥梁(Latent Bridge),这是一种轻量级模型,用于预测时间步之间的VLM输出增量,使得动作头能够基于预测输出进行操作,同时昂贵的VLM主干仅在特定时刻被调用。我们在两个结构上具有显著差异的VLA模型上实现了潜在桥梁:GR00T-N1.6(特征空间桥梁)和{C0}0.5(KV缓存桥梁),并证明该方法在不同的VLA设计中具有良好的泛化能力。我们的任务无关的DAgger训练管道能够在无需修改的情况下跨基准进行迁移。在四个LIBERO套件、24个RoboCasa厨房任务和ALOHA模拟传输立方体任务中,潜在桥梁实现了95-100%的性能保持,同时将VLM调用减少了50-75%,实现了每个回合1.65-1.73倍的净速度提升。
cs.RO / 64 / 2605.02759
DynoSLAM: Dynamic SLAM with Generative Graph Neural Networks for Real-World Social Navigation
DynoSLAM:基于生成图神经网络的动态SLAM用于真实世界中的社交导航
Abstract
Traditional Simultaneous Localization and Mapping (SLAM) algorithms rely heavily on the static environment assumption, which severely limits their applicability in real-world spaces populated by moving entities, such as pedestrians. In this work, we propose DynoSLAM, a tightly-coupled Dynamic GraphSLAM architecture that integrates socially-aware Graph Neural Networks (GNNs) directly into the factor graph optimization. Unlike conventional approaches that use rigid constant-velocity heuristics or deterministic single-agent neural priors, our framework formulates pedestrian motion forecasting as a stochastic World Model. By utilizing Monte Carlo rollouts from a trained GNN, we capture the multimodal epistemic uncertainty of human interactions and embed it into the SLAM graph via a dynamic Mahalanobis distance factor. We demonstrate through extensive simulated experiments that this stochastic formulation not only maintains highly accurate retrospective tracking but also prevents the optimization failures caused by the deterministic "argmax problem". Ultimately, extracting the empirical mean and covariance matrices of future pedestrian states provides a mathematically rigorous, probabilistic safety envelope for downstream local planners, enabling anticipatory and collision-free robot navigation in densely crowded environments.
Chinese Translation
传统的同时定位与地图构建(SLAM)算法高度依赖于静态环境的假设,这严重限制了它们在移动实体(如行人)密集的真实世界空间中的适用性。在本研究中,我们提出了DynoSLAM,这是一种紧密耦合的动态GraphSLAM架构,将具有社交意识的图神经网络(GNN)直接集成到因子图优化中。与使用刚性恒速启发式或确定性单代理神经先验的传统方法不同,我们的框架将行人运动预测公式化为随机世界模型。通过利用经过训练的GNN的蒙特卡洛展开,我们捕捉到人类互动的多模态认识不确定性,并通过动态马哈拉诺比斯距离因子将其嵌入SLAM图中。我们通过广泛的模拟实验表明,这一随机公式不仅维持了高度精确的回溯跟踪,还有效防止了由确定性“最大值问题”引发的优化失败。最终,提取未来行人状态的经验均值和协方差矩阵为下游局部规划器提供了数学上严谨的概率安全边界,从而实现了密集人群环境中机器人的预测性和无碰撞导航。
cs.RO / 65 / 2605.02809
LiDAR Teach, Radar Repeat: Robust Cross-Modal Navigation in Degenerate and Varying Environments
激光雷达教学,雷达重复:在退化和变化环境中稳健的跨模态导航
Abstract
Long-term autonomy requires robust navigation in environments subject to dynamic and static changes, as well as adverse weather conditions. Teach-and-Repeat (T\&R) navigation offers a reliable and cost-effective solution by avoiding the need for consistent global mapping; however, existing T\&R systems lack a systematic solution to tackle various environmental variations such as weather degradation, ephemeral dynamics, and structural changes. This work proposes LTR$^2$, the first cross-modal, cross-platform LiDAR-Teach-and-Radar-Repeat system that systematically addresses these challenges. LTR$^2$ leverages LiDAR during the teaching phase to capture precise structural information under normal conditions and utilizes 4D millimeter-wave radar during the repeating phase for robust operation under environmental degradations. To align sparse and noisy forward-looking 4D radar with dense and accurate omnidirectional 3D LiDAR data, we introduce a Cross-Modal Registration (CMR) network that jointly exploits Doppler-based motion priors and the physical laws governing LiDAR intensity and radar power density. Furthermore, we propose an adaptive fine-tuning strategy that incrementally updates the CMR network based on localization errors, enabling long-term adaptability to static environmental changes without ground-truth labels. We demonstrate that the proposed CMR network achieves state-of-the-art cross-modal registration performance on the open-access dataset. Then we validate LTR$^2$ across three robot platforms over a large-scale, long-term deployment (40+ km over 6 months), including challenging conditions such as nighttime smoke. Experimental results and ablation studies demonstrate centimeter-level accuracy and strong robustness against diverse environmental disturbances, significantly outperforming existing approaches.
Chinese Translation
长期自主性要求在受动态和静态变化以及恶劣天气条件影响的环境中进行稳健导航。教学与重复(Teach-and-Repeat, TR)导航提供了一种可靠且具有成本效益的解决方案,避免了对一致的全球地图的需求;然而,现有的 TR 系统缺乏系统性解决方案来应对诸如天气恶化、短暂动态和结构变化等多种环境变动。本研究提出 LTR$^2$,第一个跨模态、跨平台的激光雷达教学与雷达重复系统,系统性地解决这些挑战。LTR$^2$ 在教学阶段利用激光雷达在正常条件下捕获精准的结构信息,并在重复阶段利用 4D 毫米波雷达确保在环境恶化情况下的稳健操作。为了将稀疏且嘈杂的前视 4D 雷达数据与稠密且精确的全向 3D 激光雷达数据对齐,我们引入了一个跨模态注册(Cross-Modal Registration, CMR)网络,该网络联合利用基于多普勒的运动先验和支配激光雷达强度及雷达功率密度的物理法则。此外,我们提出了一种自适应微调策略,基于定位错误逐步更新 CMR 网络,使其能够在没有真实标签的情况下对静态环境的变化进行长期适应。我们展示了所提出的 CMR 网络在开放获取的数据集上实现了最先进的跨模态注册性能。随后,我们在三个机器人平台上验证了 LTR$^2$ 在大规模长期部署中的表现(超过 40 公里,历时 6 个月),包括夜间烟雾等挑战条件。实验结果和消融研究表明,LTR$^2$ 达到了厘米级的精确度,并在面对多样化环境干扰时展现出极强的稳健性,显著优于现有方法。
cs.RO / 66 / 2605.02862
Semantic Risk-Aware Heuristic Planning for Robotic Navigation in Dynamic Environments: An LLM-Inspired Approach
基于语义风险意识的启发式规划在动态环境中的机器人导航:一种受大型语言模型启发的方法
Abstract
The integration of Large Language Model (LLM) reasoning principles into classical robot path planning represents a rapidly emerging research direction. In this paper, we propose a Semantic Risk-Aware Heuristic (SRAH) planner that encodes LLM-inspired cost functions penalising geometrically cluttered or high-risk zones into an A$^*$ search framework, augmented with closed-loop replanning upon dynamic obstacle detection. We evaluate SRAH against two established baselines Breadth-First Search (BFS) with replanning and a Greedy heuristic without replanning across 200 randomised trials in a $15{\times}15$ grid-world with 20\% static obstacle density and stochastic dynamic obstacles. SRAH achieves a task success rate of 62.0\%, outperforming BFS (56.5\%) by 9.7\% relative improvement and Greedy (4.0\%) by a large margin. We further analyse the trade-off between planning overhead, path efficiency, and failure-recovery count, and demonstrate via an obstacle-density ablation that semantic cost shaping consistently improves navigation across environments of varying difficulty. Our results suggest that even lightweight, LLM-inspired heuristics provide measurable safety and robustness gains for autonomous robot navigation.
Chinese Translation
将大型语言模型(LLM)推理原则整合入经典机器人路径规划代表了一项快速发展的研究方向。在本文中,我们提出了一种语义风险意识启发式(SRAH)规划器,该规划器将受LLM启发的成本函数编码并对几何杂乱或高风险区域进行惩罚,嵌入到A$^*$搜索框架中,并在动态障碍物检测后增强闭环重新规划。我们在$15{ imes}15$的网格世界中进行了200次随机实验,评估了SRAH对比两个已建立基线:采用重新规划的广度优先搜索(BFS)和不进行重新规划的贪婪启发式。SRAH实现了62.0
m{ ext{%}}的任务成功率,相较于BFS(56.5
m{ ext{%}})提高了9.7
m{ ext{%}},并且超越贪婪启发式(4.0
m{ ext{%}})的幅度更大。我们进一步分析了规划开销、路径效率与失败恢复次数之间的权衡,并通过障碍密度消融实验表明,语义成本塑造在不同难度环境中的导航效果持续改善。我们的结果表明,即便是轻量型的受LLM启发的启发式方法,也为自主机器人导航提供了可测量的安全性和稳健性提升。
cs.RO / 67 / 2605.02881
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2:用于现实世界部署的动作推理模型
Fang, Haoquan, Duan, Jiafei, Clay, Donovan, Wang, Sam, Liu, Shuo, Huang, Weikai, Fan, Xiang, Tsai, Wei-Chuan, Chen, Shirui, Wang, Yi Ru, Xing, Shanli, Cho, Jaemin, Park, Jae Sung, Eftekhar, Ainaz, Sushko, Peter, Farley, Karen, Wadhwa, Angad, Harrison, Cole, Han, Winson, Lee, Ying-Chun, VanderBilt, Eli, Hendrix, Rose, Ellawela, Suveen, Ngoo, Lucas, Chai, Joyce, Ren, Zhongzheng, Farhadi, Ali, Fox, Dieter, Krishna, Ranjay
Abstract
Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2
Chinese Translation
视觉-语言-动作(VLA)模型旨在为机器人提供一个通用控制器,但现有系统在现实世界部署所需的标准上仍显不足。前沿模型是封闭的,而开放权重的替代方案则依赖于昂贵的硬件,增强推理的策略因其基础而付出了高昂的延迟成本,且细调的成功率仍低于可靠使用的阈值。我们提出了MolmoAct2,这是一个为实际部署构建的完全开放的动作推理模型,沿着五个维度推进其前身。我们引入了MolmoER,一种专注于空间和具身推理的视觉语言模型(VLM)主干,基于330万样本的语料库,通过“专业化-再训练”方法进行训练。我们发布了三个新数据集,涵盖低成本到中等成本的平台,包括MolmoAct2-BimanualYAM,这是迄今为止最大的开放双手数据集,提供720小时的遥操作双手轨迹,配合经过质量过滤的Franka(DROID)和SO100/101子集。我们提供了OpenFAST,这是一个基于数百万轨迹跨越五个具身的开放权重、开放数据的动作令牌化器。我们重新设计架构,将流匹配的连续动作专家嫁接到一个通过每层KV-cache条件的离散令牌VLM上。最后,我们提出了MolmoThink,一种自适应深度推理变体,仅对在时间步之间变化的场景区域重新预测深度令牌,保持几何基础并且延迟比以前降低。通过对目前任何开放VLA进行的最广泛的实证研究,涵盖7个仿真和现实世界基准,MolmoAct2在表现上超越了包括Pi-05的强基线,同时MolmoER在13个具身推理基准上超过了GPT-5和Gemini Robotics ER-1.5。我们发布了模型权重、训练代码和完整的训练数据。项目页面: https://allenai.org/blog/molmoact2
cs.CV / 1 / 2605.00832
Synthetic Designed Experiments for Diagnosing Vision Model Failure
用于诊断视觉模型失败的合成设计实验
Abstract
Current synthetic data pipelines for computer vision generate images without diagnosing what the downstream model actually needs. This open-loop paradigm treats synthetic data as cheap real data, randomly sampling the generator's output space and hoping to cover the model's failure modes. We argue this fundamentally misuses synthetic data's unique property: the controllable, independent variation of scene factors.Drawing on the statistical theory of Design of Experiments (DoE), we propose Synthetic Designed Experiments for Representational Sufficiency (SDRS). SDRS treats the downstream model as a black-box system and the synthetic generator as an experimental apparatus. Using fractional factorial designs, SDRS efficiently audits a model's factor-sensitivity profile via ANOVA decomposition. It classifies failures into two actionable types: Type I gaps (coverage failures on underrepresented factor levels) and Type II gaps (reliance on spurious nuisance dependencies). The audit then prescribes targeted synthetic data to address each gap type. We validate SDRS on three experiments: (1) a controlled diagnostic on dSprites with planted biases, where the audit correctly identifies both gap types and targeted data improves accuracy from 49.9% to 79.0%; (2) a dense segmentation task on procedural scenes, where detecting background-complexity shortcuts and applying targeted data improves mIoU from 0.948 to 0.998; and (3) an entanglement detection experiment showing that the ANOVA audit identifies cross-factor contamination in imperfect generators. Finally, we show that per-factor invariance penalties can transfer sensitivity between factors, identifying an open problem for representation-level correction.
Chinese Translation
目前的计算机视觉合成数据管道生成图像,但并未诊断下游模型真正的需求。这一开放式环节将合成数据视为廉价的真实数据,随机抽样生成器的输出空间,并希望覆盖模型的失败模式。我们认为这根本上误用了合成数据的独特特性:场景因素的可控、独立变动。基于实验设计(Design of Experiments, DoE)的统计理论,我们提出了用于表示充分性的合成设计实验(Synthetic Designed Experiments for Representational Sufficiency, SDRS)。SDRS将下游模型视为一个黑箱系统,将合成生成器视为实验装置。利用分数因子设计,SDRS通过方差分析(ANOVA)分解有效审计模型的因子敏感性特征。它将失败分类为两种可操作类型:类型 I 差距(对欠代表因子水平的覆盖失败)和类型 II 差距(对虚假的干扰依赖的依赖)。审计随后为每种差距类型开具目标合成数据的处方。我们在三个实验中验证了SDRS:(1)对带有植入偏差的 dSprites 进行的受控诊断,其中审计正确识别出两种差距类型,目标数据将准确率从49.9%提高至79.0%;(2)在程序生成场景上的稠密分割任务中,检测背景复杂性捷径,并应用目标数据使得 mIoU 从0.948提高至0.998;(3)一项纠缠检测实验表明,方差分析审计识别了不完善生成器中的跨因子污染。最后,我们展示了逐因子不变性惩罚可以在因子间转移敏感性,识别出表示层次修正的一个未解决问题。
cs.CV / 2 / 2605.00874
Latent Space Probing for Adult Content Detection in Video Generative Models
视频生成模型中成人内容检测的潜空间探测
Abstract
The rapid proliferation of AI-powered video generation systems has introduced significant challenges in content moderation, particularly with respect to adult and sexually explicit material. Existing detection methods operate on either prompts or decoded pixel-space outputs. Therefore, both approaches are blind to the rich internal representations formed during generation. In this paper, we propose a novel latent space probing framework that intercepts the denoised latent representations produced by the CogVideoX video diffusion model during inference and attaches lightweight classifiers to perform real-time adult content detection. To support this work, we construct a large-scale binary dataset of 11039 ten-second video clips (5086 violating, 5953 non-violating) sourced from adult websites and YouTube respectively. We introduce two lightweight probing classifier architectures. We train and evaluate it on the dataset. Our work demonstrates that latent-space signals encode strong discriminative features for harmful content detection, achieving 97.29% F1 on our held-out test set with an overhead in the 4-6ms range. Our results suggest that probing the latent space results in improvements in both detection performance as well as cost.
Chinese Translation
随着基于人工智能的视频生成系统的快速普及,内容审核面临着重大挑战,特别是在成人和性露骨材料方面。现有的检测方法主要基于提示或解码后的像素空间输出,因此这两种方法对生成过程中形成的丰富内部表示毫无察觉。在本文中,我们提出了一种新颖的潜空间探测框架,该框架在推理过程中截获由CogVideoX视频扩散模型产生的去噪潜表示,并附加轻量级分类器以实现实时成人内容检测。为支持本研究,我们构建了一个大规模的二元数据集,包括11039个十秒视频片段(其中5086个为违规,5953个为非违规),这些片段分别来源于成人网站和YouTube。我们介绍了两种轻量级的探测分类器架构,并在该数据集上进行训练和评估。我们的研究表明,潜空间信号编码了强大的区分特征,有助于有害内容检测,在我们保留的测试集上达到了97.29%的F1值,同时处理时延在4-6毫秒范围内。我们的结果表明,探测潜空间不仅提高了检测性能,还降低了成本。
cs.CV / 3 / 2605.00875
Visual Chart Representations for Cryptocurrency Regime Prediction: A Systematic Deep Learning Study
用于加密货币状态预测的视觉图表表示:一项系统的深度学习研究
Abstract
Technical traders have long relied on visual analysis of candlestick charts to identify market patterns and predict price movements. While deep learning has achieved remarkable success in image classification, its application to financial chart images remains underexplored. This paper presents a systematic study comparing different visual representations for cryptocurrency regime prediction. We evaluate three image encoding methods (raw candlestick charts, Gramian Angular Fields, and multi-channel GAF), five chart component configurations, four neural network architectures (CNN, ResNet18, EfficientNet-B0, and Vision Transformer), and the impact of ImageNet transfer learning. Through eight controlled experiments on Bitcoin, Ethereum, and S&P 500 data spanning 2018-2024, we identify optimal configurations for visual regime classification. Our results show that a simple 4-layer CNN on raw candlestick charts achieves 0.892 AUC-ROC, outperforming larger pretrained models. Surprisingly, simpler representations (price-only charts, 128x128 resolution) consistently outperform more complex alternatives. We provide interpretability analysis using GradCAM and demonstrate that transfer learning improves performance by 4-16% despite the domain gap between natural images and financial charts.
Chinese Translation
技术交易者长期以来依赖于蜡烛图的视觉分析以识别市场模式和预测价格动态。尽管深度学习在图像分类方面取得了显著成功,但其在金融图表图像中的应用仍未得到充分探索。本文提供了一项系统研究,比较了不同的视觉表示方式在加密货币状态预测中的表现。我们评估了三种图像编码方法(原始蜡烛图、Gramian角场(Gramian Angular Fields)和多通道GAF(multi-channel GAF))、五种图表组件配置、四种神经网络架构(卷积神经网络CNN、ResNet18、EfficientNet-B0和视觉变换器Vision Transformer),以及ImageNet迁移学习的影响。通过对2018-2024年间的比特币、以太坊和标准普尔500数据进行的八项对照实验,我们确定了视觉状态分类的最佳配置。结果显示,简单的4层CNN在原始蜡烛图上实现了0.892的AUC-ROC,优于更大规模的预训练模型。令人惊讶的是,简单的表示(仅价格图、128x128分辨率)始终优于更复杂的替代方案。我们使用GradCAM进行了可解释性分析,并证明尽管自然图像和金融图表之间存在领域差距,迁移学习仍能提高4-16%的性能。
cs.CV / 4 / 2605.00878
Single Image Defogging Using a Fourth-Order Telegraph PDE Guided by Physical Haze Modeling
基于物理雾霾建模的四阶电报偏微分方程单图像去雾方法
Abstract
In real-world scenarios, image defogging is an inverse problem due to unknown scene depth, atmospheric scattering, and the common absence of ground truth . To resolve the issue, we propose a hybrid defogging model that integrates a fourth-order nonlinear PDE with a physical haze formation model. We used Dark Channel Prior to estimate atmospheric parameters and to generate a guidance image, while the final restoration is performed via a fourth-order PDE-based evolution. A fourth-order PDE of the type telegraph is then evolved, incorporating an edge-adaptive diffusion coefficient and a fidelity term weighted by the transmission map. Fourth-order diffusion effectively suppresses haze while preserving structural details, and the hyperbolic formulation improves numerical stability and convergence behavior. We use relative error norm criteria for the convergence of our PDE. The proposed method is compared with Dark Channel prior, modified Dark Channel prior, and variational-based single-image defogging techniques. When we have ground truth available, we use MSE and SSIM for quantitative evaluation, whereas no-reference metrics, including FADE, Contrast Restoration Index, Average Gradient, and Entropy, are applied to real-world foggy images. Experimental results demonstrate that the proposed hybrid PDE-based method provides comparable visual quality and maintains structural details.
Chinese Translation
在实际场景中,由于未知的场景深度、大气散射以及常见的缺乏真实数据,图像去雾是一个反问题。为了解决这一问题,我们提出了一种混合去雾模型,结合了四阶非线性偏微分方程和物理雾霾形成模型。我们采用暗通道先验(Dark Channel Prior)来估计大气参数并生成引导图像,最终的恢复通过基于四阶偏微分方程的演化来完成。然后,我们演化了一种类型为电报(telegraph)的四阶偏微分方程,结合了边缘自适应扩散系数和由传输图加权的保真项。四阶扩散有效抑制雾霾,同时保持结构细节,并且超曲线形式改善了数值稳定性和收敛性。我们使用相对误差范数标准来进行偏微分方程的收敛性。所提出的方法与暗通道先验、改进的暗通道先验及基于变分的单图像去雾技术进行了比较。当我们有真实数据时,我们使用均方误差(MSE)和结构相似性指数(SSIM)进行定量评估,而在处理真实世界中的雾霾图像时,则应用无参考指标,包括FADE、对比度恢复指数、平均梯度和熵。实验结果表明,所提的基于混合偏微分方程的方法在视觉质量上可与其他方法相媲美,并保持了结构细节。
cs.CV / 5 / 2605.00880
Adversarial Flow Matching for Imperceptible Attacks on End-to-End Autonomous Driving
基于对抗流动匹配的不可察觉攻击于端到端自主驾驶
Abstract
Autonomous driving (AD) is evolving towards end-to-end (E2E) frameworks through two primary paradigms: monolithic models exemplified by Vision-Language-Action (VLA), and specialized modular architectures. Despite their divergent designs, both paradigms increasingly rely on Transformer backbones for complex reasoning, potentially causing a shared vulnerability: visually imperceptible perturbations can manipulate E2E AD models into hazardous maneuvers by targeting the Transformer module. Most existing adversarial attack approaches against AD systems operate under white-box or black-box settings; yet, they typically necessitate full model transparency, or suffer from either prohibitive query latency or limited attack transferability. In this paper, we propose Adversarial Flow Matching (AFM), a novel gray-box attack framework that exploits Transformer structural vulnerabilities in E2E AD models. AFM enables efficient one-step generation of adversarial examples via a neural average velocity field. Additionally, the proposed technique yields effective and visually imperceptible attacks by synergistically perturbing the generative latent space and the neural average velocity field. Extensive experiments demonstrate that AFM achieves a superior trade-off between attack effectiveness and imperceptibility: it substantially degrades the performance of both VLA and modular AD agents across various scenarios compared to baselines, while maintaining state-of-the-art visual imperceptibility. Furthermore, adversarial examples generated by AFM exhibit robust cross-model transferability, indicating that AFM closely approximates a black-box attack setting while requiring only the prior knowledge that the target AD model incorporates a Transformer-based module.
Chinese Translation
自主驾驶(AD)正通过两种主要范式向端到端(E2E)框架发展:以视觉-语言-动作(VLA)为代表的单体模型和专业模块化架构。尽管这两种设计迥异,但都越来越依赖于 Transformer 主干进行复杂推理,可能导致一个共享的脆弱性:视觉不可察觉的扰动可以通过针对 Transformer 模块操控 E2E AD 模型,导致危险行为。目前大多数针对 AD 系统的对抗攻击方法均在白盒或黑盒设置下操作;然而,它们通常需要完全的模型透明性,或受到过高的查询延迟或有限的攻击可转移性影响。本文提出了一种新的灰盒攻击框架——对抗流动匹配(AFM),旨在利用 E2E AD 模型中的 Transformer 结构脆弱性。AFM 通过神经平均速度场实现高效的一步生成对抗样本。此外,该技术通过协同扰动生成潜在空间和神经平均速度场,产生有效且视觉不可察觉的攻击。广泛的实验表明,AFM 在攻击有效性和不可察觉性之间达成了优越的平衡:与基线相比,它在多种场景中显著降低了 VLA 和模块化 AD 代理的性能,同时保持了最先进的视觉不可察觉性。此外,通过 AFM 生成的对抗样本表现出强大的跨模型可转移性,表明 AFM 接近于黑盒攻击设置,同时仅需先验知识,即目标 AD 模型包含基于 Transformer 的模块。
cs.CV / 6 / 2605.00882
Intervention-Based Self-Supervised Learning: A Causal Probe Paradigm for Remote Photoplethysmography
基于干预的自监督学习:一种针对远程光电容积描记法的因果探测范式
Abstract
Remote Photoplethysmography (rPPG) enables convenient non-contact physiological measurement. Existing Self-Supervised Learning (SSL) methods commonly fall into a correlation trap: they tend to learn the most dominant periodic signals in the data, such as high-energy motion or illumination noise, rather than the faint, true rPPG signal, leading to poor model generalization. To address this, we propose a new SSL paradigm, Physiological Causal Probing (PCP), which treats the latent rPPG signal as the underlying physical source and the resulting pixel chrominance variations as its visual manifestation. Its core idea is to shift from passive correlation learning to active, precise intervention: it intervenes on the video based on a proposed rPPG hypothesis, and verifies whether the post-intervention changes match physical expectations. We propose the Interv-rPPG framework to implement PCP: an rPPG extractor named PhysMambaFormer hypothesizes the rPPG signal, while a Controllable Physiological Signal Editor conducts precise chrominance-domain interventions on videos based on this hypothesis. Interv-rPPG validates the physical realism of the hypothesis through `Falsifiability via Nulling' and `Axiomatic Equivariance'. Our editor achieves precise editing of the rPPG signal by intervening in the low-frequency chrominance components of the video. Our method improves both in-domain and cross-domain performance on challenging datasets such as VIPL-HR and MMPD. Furthermore, it surpasses the supervised baseline in complex cross-dataset settings, while remaining competitive on clean datasets where the intervention mechanism may introduce slight residual chrominance noise. Extensive experiments, including diagnostic analysis of nuisance sensitivity, demonstrate that the PCP paradigm effectively resists motion and illumination artifacts.
Chinese Translation
远程光电容积描记法(rPPG)实现了便捷的非接触生理测量。现有的自监督学习(SSL)方法通常陷入相关性陷阱:它们倾向于学习数据中最显著的周期信号,如高能量的运动或光照噪声,而不是微弱的真实rPPG信号,从而导致模型泛化性能差。为了解决这一问题,我们提出了一种新的自监督学习范式——生理因果探测(PCP),它将潜在的rPPG信号视为潜在的物理源,而将产生的像素色度变化视为其视觉表现。其核心思想是从被动的相关学习转向主动、精确的干预:它根据提出的rPPG假设对视频进行干预,并验证干预后的变化是否符合物理预期。我们提出了Interv-rPPG框架来实现PCP:一个名为PhysMambaFormer的rPPG提取器假设rPPG信号,而一个可控生理信号编辑器则基于这一假设对视频进行精确的色度域干预。Interv-rPPG通过“通过归零的可证伪性”和“公理等变性”验证假设的物理现实性。我们的编辑器通过对视频的低频色度分量进行干预,实现了rPPG信号的精确编辑。我们的方法在VIPL-HR和MMPD等具有挑战性的数据集上提高了领域内和跨领域的性能。此外,在复杂的跨数据集设置中,它超越了监督基线,同时在干预机制可能引入轻微残留色度噪声的干净数据集上保持竞争力。大量实验,包括对干扰敏感性的诊断分析,表明PCP范式有效抵抗运动和光照伪影。
cs.CV / 7 / 2605.00883
Towards High Fidelity Face Swapping: A Comprehensive Survey and New Benchmark
迈向高保真换脸:全面调查与新基准
Abstract
Face swapping has witnessed significant progress in recent years, largely driven by advances in deep generative models such as GANs and diffusion models.Despite these advances, existing methods remain fragmented across different paradigms, and their evaluation is highly inconsistent due to the lack of standardized datasets and protocols. Moreover, prior surveys primarily focus on broader deepfake generation or detection, leaving face swapping insufficiently studied as a standalone problem. In this paper, we present a comprehensive survey and benchmark for face swapping. We provide a structured review of existing methods, organizing them into five major paradigms and systematically analyzing their design principles, strengths, and limitations. To enable fair and controlled evaluation, we introduce CASIA FaceSwapping, a high-quality benchmark with balanced demographic distributions and explicit attribute variations, and establish standardized protocols to assess the robustness of different face swapping methods. Extensive experiments on representative approaches yield new insights into the performance characteristics and limitations of current techniques. Overall, our work provides a unified perspective and a principled evaluation framework to facilitate the development of more robust and controllable face swapping methods. More results can be found at https://github.com/CASIA-NLPRAI/face-swapping-survey.
Chinese Translation
近年来,换脸技术取得了显著进展,主要得益于深度生成模型的进步,如生成对抗网络(GANs)和扩散模型。尽管如此,现有的方法在不同的范式中仍然碎片化,其评估由于缺乏标准化的数据集和协议而高度不一致。此外,之前的调查主要集中在更广泛的深伪生成或检测上,导致换脸作为一个独立问题的研究不足。在本文中,我们呈现了关于换脸的全面调查与基准。我们对现有的方法进行了结构化的回顾,将其组织为五个主要范式,并系统地分析了它们的设计原则、优点和局限性。为了实现公正和可控的评估,我们引入了CASIA FaceSwapping,这是一个高质量的基准,具有平衡的人口分布和明确的属性变异,同时建立了标准化的协议来评估不同换脸方法的鲁棒性。对代表性方法的广泛实验提供了对当前技术性能特征和局限性的新的见解。总体而言,我们的工作提供了统一的视角和原则性的评估框架,以促进更稳健和可控的换脸方法的发展。更多结果可见于 https://github.com/CASIA-NLPRAI/face-swapping-survey 。
cs.CV / 8 / 2605.00884
LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception
LiteVLA-H:用于机载空中导航和语义感知的双速视觉-语言-行动推理
Abstract
Vision-language-action (VLA) models have shown strong semantic grounding and task generalization in manipulation, but aerial deployment remains difficult because drones require low-latency closed-loop guidance under strict onboard compute and communication constraints. We present LiteVLA-H, a compact 256M-parameter VLA system designed for dual-rate operation on an NVIDIA Jetson AGX Orin: a fast outer-loop guidance mode for short action-token outputs and a slower semantic mode for scene understanding, hazard description, and operator-facing narration. The central empirical observation is that, in this compact edge regime, end-to-end latency is dominated by multimodal pre-fill rather than by the marginal cost of decoding a few extra tokens. This motivates a scheduler that issues reactive action tokens at 50.65,ms (19.74,Hz) while still supporting sentence-level semantic outputs at 149.90--164.57\ms (6.08--6.67,Hz) on the same embedded platform. To specialize the model without collapsing its descriptive competence, we use a knowledge-preserving fine-tuning recipe that mixes reactive flight data, aerial semantic data, and generic caption/VQA supervision. Beyond reporting current latency measurements, we position the system against recent state-of-the-art architectures, including AnywhereVLA, FutureVLA, and ReMem-VLA, showing that the measured action branch reaches a higher edge inference rate under our deployment conditions while retaining periodic semantic awareness.
Chinese Translation
视觉-语言-行动(VLA)模型在操作任务中展现了强大的语义基础和任务泛化能力,但由于无人机在严格的机载计算和通信约束下需要低延迟的闭环导航,空中部署仍然十分困难。我们提出了LiteVLA-H,这是一种紧凑型256M参数的VLA系统,设计用于在NVIDIA Jetson AGX Orin上进行双速操作:快速外环引导模式用于短行动令输出,以及较慢的语义模式用于场景理解、危险描述和面向操作员的叙述。核心经验观察是,在这种紧凑型边缘环境中,端到端延迟主要由多模态预填充主导,而不是解码少量额外令牌的边际成本。这促使我们设计了一种调度器,以50.65毫秒(19.74赫兹)的速度发出反应性行动令牌,同时在同一嵌入平台上支持149.90至164.57毫秒(6.08至6.67赫兹)的句子级语义输出。为了在不削弱模型描述能力的情况下对模型进行专业化调整,我们采用了一种知识保留的微调方法,该方法混合了反应飞行数据、空中语义数据和通用的字幕/VQA监督。除了报告当前的延迟测量外,我们将该系统与最近的最先进架构进行比较,包括AnywhereVLA、FutureVLA和ReMem-VLA,显示在我们的部署条件下,测量的行动分支达到了更高的边缘推理速率,同时保持周期性的语义意识。
cs.CV / 9 / 2605.00885
Multi-Branch Non-Homogeneous Image Dehazing via Concentration Partitioning and Image Fusion
基于浓度划分和图像融合的多分支非均匀图像去雾
Abstract
Existing single image dehazing methods have demonstrated satisfactory performance on homogeneous thin-haze images; however, they often struggle with non-homogeneous hazy images that exhibit spatially varying haze concentrations and abrupt density transitions across different regions. To address this fundamental limitation, we propose a novel multi-branch deep neural network framework, termed Concentration Partitioning and Image Fusion Network (CPIFNet), which decomposes the challenging non-homogeneous dehazing problem into a set of tractable homogeneous sub-problems. Our key insight is that a single non-homogeneous hazy image can be viewed as a composite of multiple local regions, each exhibiting approximately homogeneous haze characteristics. CPIFNet employs a two-stage architecture consisting of an Image Enhancement Network (IENet) stage and an Image Fusion Network (IFNet) stage. In the first stage, multiple IENet branches are independently trained on homogeneous haze datasets of different concentration levels, producing enhancement models that excel at restoring regions matching their respective haze densities. In the second stage, the IFNet intelligently aggregates the advantageous regions from all enhancement outputs through deep feature stacking and merging, yielding a unified high-quality dehazed result. Furthermore, we introduce a comprehensive loss function incorporating reconstruction, perceptual, structural, and color losses to jointly supervise both stages.
Chinese Translation
现有的单幅图像去雾方法在均匀薄雾图像上表现良好;然而,它们在处理具有空间变异雾浓度和不同区域间突然密度变化的非均匀雾霾图像时往往面临挑战。为了解决这一根本性限制,我们提出了一种新颖的多分支深度神经网络框架,称为浓度划分和图像融合网络(Concentration Partitioning and Image Fusion Network, CPIFNet),该框架将复杂的非均匀去雾问题分解为一组可处理的均匀子问题。我们的关键见解是,单幅非均匀雾霾图像可以视为多个局部区域的组合,每个区域都表现出近似均匀的雾特性。CPIFNet采用两阶段架构,包括图像增强网络(Image Enhancement Network, IENet)阶段和图像融合网络(Image Fusion Network, IFNet)阶段。在第一阶段,多个IENet分支在不同浓度水平的均匀雾数据集上独立训练,生成能够有效恢复与各自雾密度匹配的区域的增强模型。在第二阶段,IFNet通过深度特征堆叠和合并,智能地聚合所有增强输出的优势区域,从而产生统一的高质量去雾结果。此外,我们引入了一种综合损失函数,结合重建、感知、结构和颜色损失,以共同监督两个阶段。
cs.CV / 10 / 2605.00886
Selective Attention-Based Network for Robust Infrared Small Target Detection
基于选择性注意力的网络用于强鲁棒性的红外小目标检测
Abstract
Infrared small target detection (IRSTD) plays a pivotal role in a broad spectrum of mission-critical applications, including maritime surveillance, military search and rescue, early warning systems, and precision-guided strikes, all of which demand the precise identification of dim, sub-pixel targets amid highly cluttered infrared backgrounds. Despite significant progress driven by deep learning methods, fundamental challenges persist: infrared small targets occupy extremely limited spatial extents (often only a few pixels), exhibit low signal-to-clutter ratios, and are easily confused with structurally complex backgrounds that frequently induce false alarms. Existing encoder-decoder architectures suffer from two key limitations - an information bottleneck in early convolutional stages that undermines fine-grained target perception, and static skip connections that lack the dynamic adaptability required to discriminate between genuine targets and pseudo-target regions. To address these challenges, we propose SANet, a Selective Attention-based Network built upon the classical U-Net framework and augmented with two novel components: (1) a \emph{Dual-path Semantic-aware Module} (DSM) that integrates standard convolutions for local spatial detail preservation with pinwheel-shaped convolutions for expanded, direction-sensitive receptive fields, followed by a Convolutional Block Attention Module (CBAM) for fine-grained spatial-channel feature recalibration; and (2) a \emph{Selective Attention Fusion Module} (SAFM) that replaces conventional static skip connections with a spatially adaptive, learnable weighting mechanism to perform context-aware, cross-scale feature fusion.
Chinese Translation
红外小目标检测(IRSTD)在包括海洋监视、军事搜索与救援、预警系统及精确制导打击等一系列关键任务应用中扮演着至关重要的角色,这些应用都要求在高度杂乱的红外背景中精确识别微弱的亚像素目标。尽管得益于深度学习方法取得了显著进展,但仍然存在一些基本挑战:红外小目标占据的空间范围极其有限(通常仅几个像素),信噪比低,并且容易与结构复杂的背景混淆,从而频繁引发误报警。现有的编码-解码架构面临两个主要限制——在早期卷积阶段的信息瓶颈削弱了细粒度目标感知,静态跳跃连接缺乏区分真实目标和伪目标区域所需的动态适应能力。为了解决这些问题,我们提出了SANet,一个基于选择性注意力的网络,建立在经典的U-Net框架上,并增强了两个新颖组件:(1)一个 extit{双路径语义感知模块}(DSM),它将标准卷积与风车形卷积相结合,确保局部空间细节的保留,扩展方向敏感的接收场,随后通过卷积块注意力模块(CBAM)进行细粒度空间-通道特征的重校准;(2)一个 extit{选择性注意力融合模块}(SAFM),该模块用空间自适应的可学习权重机制替代传统静态跳跃连接,以执行上下文感知的跨尺度特征融合。
cs.CV / 11 / 2605.00887
SparseContrast: Dynamic Sparse Attention for Efficient and Accurate Contrastive Learning in Medical Imaging
SparseContrast:用于医疗影像中高效准确对比学习的动态稀疏注意力
Abstract
We propose SparseContrast, a new framework that merges dynamic sparse attention with contrastive learning for medical imaging, with a focus on chest X-ray disease detection in low-data settings. Traditional contrastive learning methods rely on dense attention mechanisms, which are computationally expensive and often process redundant regions in medical images. To resolve this, SparseContrast introduces a sparse attention mechanism that selectively concentrates on diagnostically pertinent areas, markedly decreasing computational burden without compromising accuracy. The framework adaptively trims attention maps in the training phase, directed by a compact saliency predictor which concurrently optimizes sparsity and feature quality. This method not only speeds up training and inference by as much as 40% relative to dense attention benchmarks but also boosts diagnostic accuracy by focusing on areas of clinical importance. Moreover, the approach remains indifferent to the selection of backbone architecture, which permits its application to both convolutional and transformer-based models. Experiments show SparseContrast attains comparable or better performance in disease identification tasks with greater efficiency relative to current approaches. The proposed framework delivers a practical approach for implementing contrastive learning in medical imaging settings with limited resources, where computational efficiency and diagnostic accuracy are paramount.
Chinese Translation
我们提出了SparseContrast,这是一种将动态稀疏注意力与对比学习相结合的新框架,重点关注在低数据环境下对胸部X光疾病的检测。传统的对比学习方法依赖于密集注意力机制,这种机制计算开销大,往往处理医疗图像中的冗余区域。为了应对这一问题,SparseContrast引入了一种稀疏注意力机制,选择性地集中于诊断相关区域,显著降低计算负担而不损害准确性。该框架在训练阶段自适应地修剪注意力图,由一个紧凑的显著性预测器指导,同时优化稀疏性和特征质量。该方法相较于密集注意力基准在训练和推理速度上提升了多达40%,并通过关注临床重要区域提高了诊断准确性。此外,该方法对基础架构的选择不敏感,允许其应用于卷积模型和基于变换器的模型。实验表明,SparseContrast在疾病识别任务中以更高的效率获得了与当前方法相当或更好的性能。所提出的框架为在资源有限的医疗影像环境中实施对比学习提供了一种实用的方法,在这些环境中,计算效率和诊断精度至关重要。
cs.CV / 12 / 2605.00888
Selective Correlation Based Knowledge Distillation for Ground Reaction Force Estimation
基于选择性相关性的知识蒸馏用于地面反作用力估计
Abstract
Wearable sensor-based human gait analysis holds great promise in healthcare, rehabilitation, clinical diagnosis and monitoring, and sports activities. Specifically, ground reaction force (GRF) provides essential insights into the body's interaction with the ground during movement and is typically measured using instrumented treadmills equipped with force plates. However, such equipment is expensive and restricted to laboratory environments. To enable a more portable solution, wearable insole sensors have been used to measure GRF. These sensors, however, are prone to noise and external interference, which reduces measurement accuracy. Deep learning methodologies could be adopted to address these issues, but they often require significant computing resources to achieve high accuracy, limiting their applicability for real-time analysis on portable devices. To overcome these limitations, we propose Selective Correlation Based Knowledge Distillation (SCKD) for estimating GRF from data collected by insole sensors. Our proposed method utilizes selected features considering temporal characteristics in the process of extracting correlation maps for knowledge transfer, enhancing interpretability and mitigating issues in high dimensional data processing. We demonstrate the effectiveness of the compact models generated by our distillation framework through comparison with existing methods. Various configurations of teacher-student architectures and training approaches are examined based on multiple evaluation criteria, utilizing data collected at different walking speeds and with different window sizes. Experimental results confirm that our approach outperforms existing methods in estimating GRF from wearable insole sensor data. Therefore, our approach offers a reliable and resource-efficient solution for human gait analysis.
Chinese Translation
基于可穿戴传感器的人体步态分析在医疗、康复、临床诊断与监测以及运动活动中展现出巨大的潜力。具体来说,地面反作用力(GRF)提供了关于身体在运动过程中与地面相互作用的重要见解,通常通过配备力传感器的仪器化跑步机进行测量。然而,这类设备价格昂贵且局限于实验室环境。为了实现更为便携的解决方案,已使用可穿戴鞋垫传感器来测量GRF。然而,这些传感器容易受到噪声和外部干扰的影响,从而降低测量精度。可以采用深度学习方法来解决这些问题,但通常需要大量计算资源才能实现高精度,这限制了其在便携设备上的实时分析应用。为克服这些限制,我们提出了一种基于选择性相关性的知识蒸馏(Selective Correlation Based Knowledge Distillation,SCKD)方法,用于从鞋垫传感器收集的数据中估计GRF。我们提出的方法在提取用于知识转移的相关图的过程中考虑时序特征,从而提高可解释性,并减轻高维数据处理中的问题。我们通过与现有方法的比较,展示了我们蒸馏框架生成的紧凑模型的有效性。我们根据多种评估标准,考察了不同教师-学生架构和训练方法的配置,使用在不同步态速度下和不同窗口大小下收集的数据。实验结果确认我们的方法在估计来自可穿戴鞋垫传感器数据的GRF方面优于现有方法。因此,我们的方法为人体步态分析提供了一种可靠且资源高效的解决方案。
cs.CV / 13 / 2605.00889
On the explainability of max-plus neural networks
关于最大加神经网络可解释性的研究
Abstract
We investigate the explanability properties of the recently proposed linear-min-max neural networks. At initialization, they can be interpreted as k-medoids with the infinity norm as a distance. Then, they are trained using subgradient descent to better fit the data. The model has been shown to be a universal approximator. Yet, we can trace the decision process because a single most activated neuron is responsible for the value of the output. Using this property, we designed a pixel fragility measure that determines whether changes to a single pixel may be responsible to a change in the classification output. Experiments on the PneumoniaMnist dataset show that this explanation for the output of the neural network compares favorably to SHAP and Integrated Gradient.
Chinese Translation
我们研究了最近提出的线性-最小-最大神经网络的可解释性特征。在初始化时,它们可以被解释为使用无穷范数作为距离的 k-质心。随后,它们通过次梯度下降进行训练,以更好地拟合数据。该模型被证明是一个通用逼近器。然而,我们可以追踪决策过程,因为单个最活跃的神经元负责输出值。利用这一特性,我们设计了一种像素脆弱性度量,以确定对单个像素的改变是否可能导致分类输出的变化。对 PneumoniaMnist 数据集的实验表明,这种对神经网络输出的解释与 SHAP 和集成梯度(Integrated Gradient)相比,优劣相当。
cs.CV / 14 / 2605.00890
Skeleton-Based Posture Classification to Promote Safer Walker-Assisted Gait in Older Adults
基于骨架的姿态分类以促进老年人更安全的助行器辅助步态
Abstract
Falls among older adults are a significant public health concern, leading to severe injuries, loss of independence, and increased healthcare costs. This study evaluates the effectiveness of various models, including a Geometric approach, XGBoost, SVM, and several deep learning architectures, in classifying walker usage, standing vs. sitting, and posture for smart walkers used. Geometric and XGBoost were the top performers. XGBoost achieved near-perfect training accuracy in binary classification tasks, with 99.84% for walker choice and 99.69% for standing vs. sitting. For posture classification, Geometric approach attained 89.9% accuracy for 8 postures, and XGBoost obtained 99.24% during training for 17 postures. Deep learning models such as the 4-layer CNN and Encoder-Decoder CNN also demonstrated strong performance in binary classification, with accuracies above 98%. This study underscores the potential of machine learning to enhance human-robot interaction in smart walkers, particularly for fall prevention.
Chinese Translation
老年人跌倒是一个重大公共卫生问题,导致严重伤害、独立性丧失以及医疗成本增加。本研究评估了多种模型的有效性,包括几何方法、XGBoost、支持向量机(SVM)和多种深度学习架构,以对助行器使用、站立与坐下以及姿态进行分类。几何方法和XGBoost表现最佳。XGBoost在二分类任务中达到了近乎完美的训练准确率,对于助行器选择为99.84%,站立与坐下为99.69%。在姿态分类方面,几何方法对于8种姿态达到了89.9%的准确率,XGBoost在训练17种姿态时获得了99.24%的准确率。深度学习模型如4层卷积神经网络(CNN)和编码-解码CNN在二分类中也表现出强劲的性能,准确率均超过98%。本研究强调了机器学习在智能助行器中增强人机交互的潜力,特别是在预防跌倒方面。
cs.CV / 15 / 2605.00891
X2SAM: Any Segmentation in Images and Videos
X2SAM:图像和视频中的任意分割
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.
Chinese Translation
多模态大型语言模型(MLLMs)在图像级视觉理解和推理方面表现出色,但它们在图像和视频的像素级感知方面仍然有限。基础分割模型如SAM系列生成高质量的掩膜,但依赖于低级视觉提示,无法原生解释复杂的对话指令。现有的分割MLLMs缩小了这一差距,但通常专注于图像或视频中的一种类型,且很少在一个接口中同时支持文本和视觉提示。我们介绍了X2SAM,一种统一的分割MLLM,将任意分割能力从图像扩展到视频。基于对话指令和视觉提示,X2SAM将大型语言模型(LLM)与一个掩膜存储模块相结合,该模块存储引导的视觉特征,以生成时间一致的视频掩膜。相同的构造支持跨图像和视频输入的通用开放词汇、指称、推理、基于实体的对话生成、交互式和视觉引导分割。我们进一步介绍了视频视觉引导(V-VGD)分割基准,该基准评估模型能否从交互式视觉提示中对视频中的对象轨迹进行分割。通过对异构图像和视频数据集的统一联合训练策略,X2SAM实现了强大的视频分割性能,在图像分割基准测试中保持竞争力,并保持一般的图像和视频对话能力。
cs.CV / 16 / 2605.00892
When To Adapt? Adapting the Model or Data in Federated Medical Imaging
何时进行适应?在联邦医学影像中对模型或数据进行适应
Abstract
Federated learning enables collaborative model training across medical institutions without sharing raw data, but its performance is often limited by domain heterogeneity across clients. Existing approaches to address this challenge fall into two main paradigms: model-side personalization, which adapts model parameters to each client, and data-side harmonization, which reduces inter-client variation at the input level. Despite their widespread use, these strategies have not been systematically compared. In this work, we conduct a comprehensive study across six medical imaging settings-colon polyp, skin lesion, and breast tumor segmentation, and tuberculosis CXR, brain tumor, and breast tumor classification-covering diverse types of domain shift. We evaluate a broad set of state-of-the-art harmonization and personalization methods under a unified framework. Our results reveal a conditional trade-off driven by the nature of heterogeneity: harmonization is more effective when variation is primarily appearance-based (e.g., CXR classification), while personalization performs better when differences are structural (e.g., colon polyp segmentation). When inter-client variation is limited, both strategies perform similarly. These findings demonstrate that the effectiveness of adaptation in federated medical imaging depends on the type and magnitude of domain shift rather than the strategy alone. We provide practical guidelines for selecting between harmonization and personalization and highlight directions for future hybrid approaches that combine both paradigms. Code is available at https://github.com/ChamaniS/WhenToAdapt.
Chinese Translation
联邦学习允许医疗机构在不共享原始数据的情况下进行协作模型训练,但其性能常常受到客户端之间领域异质性的限制。现有的解决此挑战的方法主要分为两大类:模型侧个性化,即针对每个客户端调整模型参数,以及数据侧协调,即在输入层减少客户端之间的变异。尽管这些策略得到了广泛应用,但尚未进行系统比较。在本研究中,我们在六个医学影像设置中进行全面研究,包括结肠息肉、皮肤病变和乳腺肿瘤分割,以及结核病胸部X光片、大脑肿瘤和乳腺肿瘤分类,涵盖不同类型的领域转变。我们在统一框架下评估了一系列最先进的协调和个性化方法。我们的结果揭示了一种受异质性性质驱动的条件性权衡:当变异主要基于外观时(例如,胸部X光分类),协调更有效;而当差异是结构性的(例如,结肠息肉分割)时,个性化效果更好。当客户端之间的变异有限时,这两种策略的表现相似。这些发现表明,在联邦医学影像中,适应的有效性依赖于领域转变的类型和程度,而不仅仅是策略本身。我们提供了在协调和个性化之间进行选择的实用指南,并强调了未来结合两种范式的混合方法的研究方向。代码可在 https://github.com/ChamaniS/WhenToAdapt 获取。
cs.CV / 17 / 2605.00893
Retrieval-Guided Generation for Safer Histopathology Image Captioning
基于检索引导的生成方法用于更安全的组织病理学图像说明
Abstract
Generative vision-language models can produce fluent medical image captions but remain prone to hallucination, over-specific diagnostic claims, and factual inconsistency-serious issues in pathology. We investigate retrieval-guided generation (RGG) as a safer alternative, where captions are formed by summarizing expert text from visually similar cases rather than generated de novo. On the ARCH histopathology dataset, RGG improves semantic alignment with ground truth, achieving cosine similarity of $\approx$0.60 versus $\approx$0.47 from MedGemma, with non-overlapping confidence intervals indicating a robust gain. A pathologist-led qualitative review shows better preservation of morphology-relevant terminology and fewer unsupported diagnoses, while revealing failure modes such as concept mixing and inherited over-specific labeling. Overall, retrieval-guided captioning offers a more transparent and reliable approach with clearer opportunities for auditing than fully generative methods.
Chinese Translation
生成型视觉语言模型能够生成流畅的医学图像说明,但仍然容易出现幻觉、过度特定的诊断主张以及事实不一致等严重问题。在本研究中,我们探讨了检索引导生成(Retrieval-Guided Generation, RGG)作为一种更安全的替代方案。该方法通过总结来自视觉相似病例的专家文本来形成说明,而不是全新生成。在ARCH组织病理学数据集上,RGG在与真实标签的语义对齐方面有所改善,实现了约0.60的余弦相似度,而MedGemma的相似度约为0.47,且不重叠的置信区间表明了稳健的提高。由病理学家主导的定性评估显示更好地保持了与形态相关的术语,且支持的诊断更少,同时揭示了如概念混淆和继承的过度特定标签等失败模式。总体而言,检索引导的说明提供了一种比完全生成方法更透明、更可靠的方式,审计机会也更加明确。
cs.CV / 18 / 2605.00894
Dino-NestedUNet: Unlocking Foundation Vision Encoders for Pathology Tumor Bulk Segmentation via Dense Decoding
Dino-NestedUNet:通过密集解码解锁基础视觉编码器以进行病理肿瘤体积分割
Abstract
Vision foundation models (VFMs), such as DINOv3, provide rich semantic representations that are promising for computational pathology. However, many current adaptations pair frozen VFMs with lightweight decoders, creating a capacity mismatch that often limits boundary fidelity for infiltrative tumor bulk segmentation. This paper presents Dino-NestedUNet, a framework that couples a pre-trained DINOv3 encoder with a Nested Dense Decoder. Instead of sparse skip connections and linear upsampling, the proposed decoder forms a dense grid of intermediate pathways to enable continuous feature reuse and multi-scale recalibration, aligning high-level semantics with low-level morphological textures during reconstruction. We evaluate Dino-NestedUNet on three histopathology cohorts (multi-center CHTN, institutional OSU, and CAMELYON16) and observe consistent improvements over UNet++ and standard Dino-UNet variants, particularly under cross-domain shift. To further assess external generalization, we perform zero-shot evaluation by training on CHTN and directly testing on unseen TIGER WSIBULK and OSU CRC cohorts without fine-tuning. These results suggest that dense decoding is a key ingredient for unlocking foundation encoders in boundary-sensitive pathology segmentation.
Chinese Translation
视觉基础模型 (VFMs),如 DINOv3,提供丰富的语义表示,成为计算病理学中极具前景的工具。然而,目前许多适应性模型将冻结的 VFMs 与轻量解码器结合,这导致了能力不匹配,常常限制了浸润性肿瘤体积分割的边界保真度。本文提出了 Dino-NestedUNet,一个将预训练的 DINOv3 编码器与嵌套密集解码器相结合的框架。与稀疏跳跃连接和线性上采样不同,所提出的解码器形成了一个中间路径的密集网格,以实现连续特征重用和多尺度重校准,从而在重建过程中将高层语义与低层形态纹理对齐。我们在三个组织病理学队列上评估了 Dino-NestedUNet(多中心 CHTN、机构 OSU 和 CAMELYON16),并观察到其在 UNet++ 和标准 Dino-UNet 变体上具有一致的改进,尤其是在跨域迁移下。为了进一步评估外部泛化,我们通过在 CHTN 上训练并直接在未见的 TIGER WSIBULK 和 OSU CRC 队列上进行零-shot 评估,未进行微调。这些结果表明,密集解码是解锁边界敏感病理分割中基础编码器的关键因素。
cs.CV / 19 / 2605.00896
When Less Is More: Simplicity Beats Complexity for Physics-Constrained InSAR Phase Unwrapping
当简约胜于复杂:在物理约束下的 InSAR 相位展开中,简易性优于复杂性
Abstract
Operational phase unwrapping is the primary computational bottleneck in InSAR-based volcanic and seismic monitoring. We challenge the industry trend of adopting high-complexity computer vision architectures, such as attention mechanisms, without validating their suitability for physics-constrained geophysical regression. We present the first large-scale architectural ablation study on a global LiCSAR benchmark (20 frames, 39,724 patches, 651M pixels). Our results reveal a significant "complexity penalty": a vanilla U-Net (7.76M parameters) achieves $R^2=0.834$ and RMSE $= 1.01$ cm, outperforming 11.37M-parameter attention-based models by 34% in $R^2$ and 51% in RMSE. Power Spectral Density (PSD) analysis provides the physical justification: while attention excels at capturing sharp semantic edges in natural images, it injects unphysical high-frequency artifacts ($>0.3$ cycles/pixel) into geophysical fields, violating the fundamental smoothness constraints of elastic surface deformation. With a 2.92ms inference latency (a $2.5\times$ speedup), the vanilla U-Net is the only candidate to comfortably meet the sub-100ms requirement for operational early-warning systems. This work bridges the "publication-to-practice" gap by proving that convolutional locality outperforms modern complexity for smooth-field regression, advocating for physics-informed simplicity in ML4RS. Code available at https://github.com/prabhjotschugh/When-Less-is-More-InSAR-Phase-Unwrapping
Chinese Translation
操作性相位展开是基于 InSAR 的火山和地震监测中的主要计算瓶颈。我们质疑行业普遍采用高复杂度计算机视觉架构(例如注意力机制)而未验证其在物理约束下的地球物理回归适用性的趋势。我们在全球 LiCSAR 基准上进行了首个大规模架构消融研究,涉及 20 帧、39,724 个补丁和 6.51 亿个像素。我们的结果揭示了显著的“复杂度惩罚”:一个基础的 U-Net(7.76M 参数)实现了 $R^2=0.834$ 和 RMSE $= 1.01$ cm,在 $R^2$ 和 RMSE 上分别超越 11.37M 参数的基于注意力模型 34% 和 51%。功率谱密度(PSD)分析提供了物理依据:虽然注意力机制在捕捉自然图像中的明显语义边缘方面表现出色,但它向地球物理领域注入了不物理的高频伪影($>0.3$ 循环/像素),违反了弹性表面变形的基本平滑性约束。基础 U-Net 在 2.92ms 的推理延迟下(加速 2.5 倍)是唯一一个能够轻松满足运营早期预警系统小于 100ms 要求的候选者。本工作通过证明卷积局部性在平滑场回归中超越现代复杂性,弥合了“发表到实践”的差距,提倡在机器学习用于遥感(ML4RS)中的物理信息简约性。代码可访问 https://github.com/prabhjotschugh/When-Less-is-More-InSAR-Phase-Unwrapping
cs.CV / 20 / 2605.00899
LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images
LatentDiff:将语义数据集比较扩展到数百万图像
Abstract
We present LatentDiff, a scalable framework for semantic dataset comparison that operates directly in the latent space of pretrained vision encoders. By combining sparse autoencoder-based divergence testing with density ratio estimation, LatentDiff identifies interpretable semantic differences between datasets at a fraction of the computational cost of caption-based alternatives. We also introduce Noisy-Diff, a benchmark capturing realistic sparse distribution shifts that cause existing methods to struggle. Experiments demonstrate that LatentDiff achieves superior accuracy while remaining robust to settings where an extremely small fraction of images (from 5% to <1% ) differ semantically.
Chinese Translation
我们提出了 LatentDiff,这是一个可扩展的语义数据集比较框架,直接在预训练视觉编码器的潜在空间中操作。通过将基于稀疏自编码器的散度测试与密度比估计相结合,LatentDiff 能够以较低的计算成本识别数据集之间可解释的语义差异,远低于基于文本说明的替代方法。我们还引入了 Noisy-Diff,这是一个捕捉现实稀疏分布变化的基准测试,这种变化使现有方法面临挑战。实验表明,LatentDiff 在实现更高准确率的同时,对图像中仅有极小比例(从 5% 到 <1%)在语义上存在差异的情况保持稳健。
cs.CV / 21 / 2605.00901
RA-CMF: Region-Adaptive Conditional MeanFlow for CT Image Reconstruction
RA-CMF:区域自适应条件均值流用于CT图像重建
Abstract
The use of CT imaging is important for screening, diagnosis, therapy planning, and prognosis of lung cancers. Unfortunately, due to differences in imaging protocols and scanner models, CT images acquired by different means may show large differences in noise statistics, contrast, and texture. In this study, we develop a novel conditional MeanFlow pipeline for CT image reconstruction. We introduce a conditional MeanFlow network that models the enhancement trajectory by predicting image-conditioned flow fields given intermediate image states. The image enhancement network is trained with a MeanFlow consistency loss along with the image reconstruction loss. In order to provide an adaptive refinement process in terms of spatial location of enhancements, we integrate a regional reinforcement learning-driven policy network into our approach. The policy network receives information about the MeanFlow rollouts and provides predictions in terms of tile-wise refinement budgets, stopping criteria, and total budget allocation of enhancement processes. Our policy network is trained through reinforcement learning in a policy gradient framework, where the goal of the training reward is to maximize improvement of enhancements while minimizing unnecessary computations and avoiding instabilities. In this way, our approach combines conditional flow-based enhancement with reinforcement learning-based spatial enhancement control. This allows our approach to focus more attention on enhancing difficult areas while stabilizing areas already showing sufficient quality. Our results show high accuracy in the tumor ROI, with the average radiomic feature CCC being 0.96, an average PSNR of 31.30 $\pm$ 4.16, and average SSIM of 0.94 $\pm$ 0.07. Moreover, there is an improvement in the overall quality of images, with an average PSNR of 34.23 $\pm$ 1.71 and average SSIM of 0.95 $\pm$ 0.01.
Chinese Translation
CT成像在肺癌的筛查、诊断、治疗计划和预后中具有重要意义。遗憾的是,由于成像协议和扫描仪模型之间的差异,通过不同方式获取的CT图像在噪声统计、对比度和纹理上可能会表现出很大的差异。在本研究中,我们开发了一种新颖的条件均值流(MeanFlow)管道以进行CT图像重建。我们引入了一种条件均值流网络,通过预测给定中间图像状态的图像条件流场来建模增强轨迹。图像增强网络通过均值流一致性损失与图像重建损失进行训练。为了在增强的空间位置方面提供自适应的细化过程,我们将一个区域强化学习驱动的策略网络集成到我们的方法中。策略网络接收有关均值流推演的信息,并在瓦片细化预算、停止准则和增强过程的总预算分配方面提供预测。我们的策略网络通过强化学习在策略梯度框架下进行训练,训练奖励的目标是最大化增强的改善,同时最小化不必要的计算并避免不稳定性。通过这种方式,我们的方法将基于条件流的增强与基于强化学习的空间增强控制相结合。这使我们的方法能够更加关注于增强困难区域,同时稳定那些已经显示出足够质量的区域。我们的结果表明肿瘤感兴趣区域(ROI)的高准确性,平均放射特征的CCC为0.96,平均峰值信噪比(PSNR)为31.30 ± 4.16,平均结构相似性指数(SSIM)为0.94 ± 0.07。此外,图像的整体质量有所改善,平均PSNR为34.23 ± 1.71,平均SSIM为0.95 ± 0.01。
cs.CV / 22 / 2605.00902
Validation of Whole-Slide Foundation Models for Image Retrieval in TCGA Data
TCGA数据中图像检索的全幻灯片基础模型验证
Abstract
Foundation models are reshaping computational histopathology, yet their value for whole-slide image retrieval relative to strong patch-based and supervised aggregation baselines remains unclear. We benchmarked ten pipelines on 9,387 diagnostic slides spanning 17 organs and 60 diagnoses from The Cancer Genome Atlas (TCGA) using patient-level leave-one-patient-out evaluation. Methods included four pre-trained slide foundation models, a supervised attention-based multiple instance learning (ABMIL) aggregator on patch embeddings, and patch-level retrieval across five sampling densities. Performance varied more across organs and diagnoses than across architectures. Although the slide foundation model TITAN achieved the strongest overall results, its advantage was modest; ABMIL and patch-based methods reached comparable Top-1 and Top-3 accuracy, with no model consistently dominant. Morphologically distinctive entities approached ceiling performance, while rare, heterogeneous, and closely related subtypes remained challenging. Misclassifications aligned with organs exhibiting known inter-observer variability, suggesting an intrinsic ceiling for morphology-only retrieval. Performance was driven primarily by patch-level feature representations, with limited benefit from slide-level aggregation, indicating aggregation may be unnecessary in many settings. These findings argue against a universally optimal architecture and instead support organ-resolved benchmarking, diagnosis-aware or ensemble strategies, stronger feature representations, and multimodal retrieval frameworks. Notably, even the best model achieved only $\approx 68\% \pm 21\%$ retrieval accuracy on TCGA, and some subtypes showed $0\%$ accuracy across all methods, highlighting fundamental limitations of morphology-based representations and the need for substantial progress before reliable clinical deployment.
Chinese Translation
基础模型正在重塑计算机组织病理学,但相对于强大的基于补丁和监督聚合基线,其在全幻灯片图像检索中的价值尚不明确。我们在来自癌症基因组图谱(TCGA)的9,387个诊断幻灯片中,以患者级别的留一患者法评估了十个管道,这些幻灯片涵盖了17个器官和60种诊断。方法包括四个预训练的幻灯片基础模型,一个基于补丁嵌入的监督注意力多实例学习(ABMIL)聚合器,以及在五种采样密度下的补丁级别检索。不同器官和诊断之间的表现变异大于不同架构之间的差异。尽管幻灯片基础模型TITAN取得了最佳的整体结果,但其优势有限;ABMIL和基于补丁的方法在Top-1和Top-3的准确性上达到了可比的结果,没有哪个模型始终表现突出。形态上具有显著特征的实体接近性能上限,而稀有、异质且相关性较近的亚型则依然具有挑战性。误分类与已知存在观察者变异的器官一致,表明仅依赖形态学的检索存在内在上限。性能主要受补丁级特征表示的驱动,而在幻灯片级聚合中的收益有限,表明在许多情况下聚合可能是不必要的。这些发现反对普遍最佳架构的存在,支持器官分解的基准测试、基于诊断的或集成的策略、更强的特征表示,以及多模态检索框架。值得注意的是,即使是最佳模型在TCGA上的检索准确率也仅为约68% ± 21%,一些亚型在所有方法中显示出0%的准确率,这突显了基于形态学的表示的基本局限性,以及在可靠的临床部署之前亟需显著进展。
cs.CV / 23 / 2605.00903
A Light Weight Multi-Features-View Convolution Neural Network For Plant Disease Identification
一种轻量级多特征视角卷积神经网络用于植物病害识别
Abstract
Agriculture is a key sector of the economies of developing countries. It serves as a primary source of income and employment for rural populations. However, each year, a large portion of crops is wasted because of pests and diseases. Well-timed prediction of plant diseases is crucial to sustainable, high-quality agricultural production. Detection of plant diseases through conventional methods is both labour-intensive and time-consuming. Researchers have developed image classification based automated techniques for this purpose. Most accurate methods are based on deep convolutional neural networks, which are computationally intensive, with many layers and millions of trainable parameters. In resource-constrained settings, especially in rural areas, it is difficult to deploy deep convolutional neural network models for efficient plant disease identification. To address these issues, an efficient and light-weight Multi-View Convolutional Neural Network is proposed. These additional features aid the proposed model to identify the plant diseases accurately and efficiently with less number of parameters. The proposed model is tested on a benchmark Plantvillage dataset and achieves an improvement of $ 2.9\%$ in classification accuracy compared to the baseline convolutional neural network model, which was trained only on Red, Green, and Blue (RGB) plant images. Compared with state-of-the-art deep convolutional neural network models, the proposed model is less computationally expensive and achieves comparable accuracy for plant disease identification on the PlantVillage dataset.
Chinese Translation
农业是发展中国家经济的关键部门,为农村人口提供了主要的收入和就业来源。然而,每年都有大量农作物因害虫和疾病而浪费。及时预测植物病害对可持续、高质量的农业生产至关重要。通过传统方法检测植物病害既费力又耗时。研究人员为此开发了基于图像分类的自动化技术。最准确的方法基于深度卷积神经网络,这些网络计算量庞大,层数多且可训练参数达到数百万。在资源有限的环境中,尤其是在农村地区,部署深度卷积神经网络模型以实现高效的植物病害识别十分困难。为了解决这些问题,提出了一种高效且轻量级的多视图卷积神经网络(Multi-View Convolutional Neural Network)。这些额外特征帮助所提模型以较少的参数准确有效地识别植物病害。所提模型在基准的PlantVillage数据集上进行了测试,相比仅基于红色、绿色和蓝色(RGB)植物图像训练的基线卷积神经网络模型,分类准确率提高了2.9%。与最先进的深度卷积神经网络模型相比,所提模型计算成本较低,并且在PlantVillage数据集上实现了可比的植物病害识别准确率。
cs.CV / 24 / 2605.00904
Robustness of Transformer-Based Fluence Map Prediction Under Clinically Realistic Perturbations
基于变换器的流量图预测在临床现实扰动下的鲁棒性
Abstract
Learning-based fluence map prediction offers a fast alternative to iterative inverse planning in intensity-modulated radiation therapy (IMRT), but its robustness under realistic distribution shifts remains unclear. We study a two-stage transformer pipeline that maps anatomy (CT and contours) to dose and then to beamlet fluence maps. We compare fluence-stage transformer backbones with hierarchical, global, and hybrid attention, trained with a physics-informed loss enforcing energy consistency. Robustness is evaluated under geometric perturbations, radiometric noise, reduced training data, and domain shifts using a prostate IMRT dataset, with additional evaluation of the dose stage on public datasets. Results show smooth degradation under moderate perturbations but sharp failures under severe rotations and noise. Hierarchical transformers (e.g., SwinUNETR) exhibit slower growth in upper-quartile energy error, indicating improved robustness. We further show that SSIM alone fails to capture clinically relevant errors, highlighting the need for physics-informed evaluation.
Chinese Translation
基于学习的流量图预测为强度调制放疗(IMRT)中的迭代逆规划提供了快速的替代方案,但其在现实分布变化下的鲁棒性仍然不明确。我们研究了一种两阶段的变换器流程,将解剖结构(CT和轮廓)映射到剂量,再映射到束流流量图。我们比较了流量阶段的变换器骨干网络,采用层次化、全局和混合注意力机制,训练时采用物理信息损失以强制执行能量一致性。在几何扰动、辐射噪声、减少训练数据和领域转移的情况下,我们使用前列腺IMRT数据集对鲁棒性进行了评估,并对公开数据集的剂量阶段进行了额外评估。结果表明,在中等扰动下出现平滑退化,但在严重的旋转和噪声下则表现出明显的失效。层次化变换器(例如,SwinUNETR)在上四分位能量误差的增长速度较慢,表明其鲁棒性有所改善。我们进一步证明,仅依赖结构相似性指标(SSIM)无法捕捉临床相关错误,强调了对物理信息评估的需求。
cs.CV / 25 / 2605.00906
Generalized Category Discovery under Domain Shifts: From Vision to Vision-Language Models
领域转移下的广义类别发现:从视觉到视觉-语言模型
Abstract
Generalized Category Discovery (GCD) aims to categorize unlabelled instances from both known and unknown classes by transferring knowledge from labelled data of known classes. Existing methods assume all data comes from a single domain, yet real-world unlabelled data often exhibits domain shifts alongside semantic shifts. We study GCD under domain shifts and propose three frameworks that adapt foundation models, ranging from self-supervised vision models to vision-language models. (i) HiLo disentangles domain and semantic features through multi-level feature extraction and mutual information minimization, combined with PatchMix augmentation and curriculum sampling. (ii) HLPrompt extends HiLo with semantic-aware spatial prompt tuning to suppress background and domain noise. (iii) VLPrompt leverages vision-language models via factorized textual prompts and cross-modal consistency regularization. The three methods share core design principles while operating on different foundation backbones, making them suitable for different deployment scenarios. Extensive experiments on synthetic corruptions and real-world multi-domain shifts demonstrate consistent improvements over strong baselines. Project page: https://visual-ai.github.io/hilo/
Chinese Translation
广义类别发现(Generalized Category Discovery,GCD)旨在通过从已知类别的标注数据转移知识,对来自已知和未知类别的未标注实例进行分类。现有方法假设所有数据来自单一领域,然而现实世界中的未标注数据往往在语义转移的同时表现出领域转移。我们研究了领域转移下的GCD,并提出了三种框架,适用于从自监督视觉模型到视觉-语言模型的基础模型适配。(i) HiLo通过多层次特征提取和互信息最小化来解耦领域和语义特征,并结合PatchMix数据增强和课程采样。(ii) HLPrompt在HiLo基础上扩展了语义感知的空间提示调优,以抑制背景和领域噪声。(iii) VLPrompt通过分解的文本提示和跨模态一致性正则化,利用视觉-语言模型。这三种方法共享核心设计原则,同时在不同的基础架构上运行,适合不同的部署场景。在合成污染和现实世界多领域转移的广泛实验中,显示出了相对于强基线的一致性提升。项目页面:https://visual-ai.github.io/hilo/
cs.CV / 26 / 2605.00907
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
TRIP-Evaluate:评估交通领域大型模型的开放多模态基准
Abstract
Large language models (LLMs) and multimodal large models (MLLMs) are increasingly used for transportation tasks such as regulation question answering, traffic management support, engineering review, and autonomous-driving scene reasoning. Yet transportation workflows are rule-intensive, computation-intensive, safety-critical, and inherently multimodal. Existing general benchmarks provide limited evidence of whether a model can apply regulations correctly, perform verifiable engineering calculations, or interpret traffic scenes reliably, while the small number of public transportation benchmarks remain narrow in scope and rarely support fine-grained diagnosis across text, images, and point-cloud data. To address this gap, we present TRIP-Evaluate, an open multimodal benchmark for large models in transportation. The benchmark organizes 837 items using a role-task-knowledge taxonomy that covers vehicle, traffic-management, traveler, and planning-and-design functions. Each item is annotated with capability, modality, and difficulty labels, enabling diagnosis from overall accuracy down to specific failure modes. The current release includes 596 text items, 198 image items, and 43 point-cloud items. TRIP-Evaluate also standardizes item construction, quality control, prompting, decoding, and scoring to improve cross-model comparability. Results on a diverse panel of models show that text-based performance is improving, but substantial weaknesses remain in multi-step engineering calculation, rule-constrained reasoning, multimodal scene understanding, and point-cloud understanding. Overall, TRIP-Evaluate provides a reproducible, diagnosable, and engineering-aligned evaluation baseline for model selection, regression testing, and safer deployment in transportation applications.
Chinese Translation
大型语言模型(LLMs)和多模态大型模型(MLLMs)越来越多地用于交通任务,如规章问答、交通管理支持、工程审查和自动驾驶场景推理。然而,交通工作流程具有规则密集、计算密集、安全关键和固有的多模态特性。现有的一般基准提供的证据有限,无法准确判断模型是否能够正确应用规章、执行可验证的工程计算或可靠地解读交通场景,而公共交通基准数量较少、范围狭窄,且很少支持对文本、图像和点云数据的细粒度诊断。为了解决这一 gap,我们提出了 TRIP-Evaluate,一个用于交通领域大型模型的开放多模态基准。该基准利用角色-任务-知识分类法组织了837个项目,涵盖车辆、交通管理、旅客以及规划与设计功能。每个项目都有能力、模态和难度标签的注释,从整体准确性到具体故障模式进行诊断。当前版本包括596个文本项目、198个图像项目和43个点云项目。TRIP-Evaluate 还标准化了项目构建、质量控制、提示、解码和评分,提高了跨模型的可比性。多样化模型的结果表明,基于文本的性能正在提升,但在多步骤工程计算、规则约束推理、多模态场景理解和点云理解方面仍存在显著弱点。总体而言,TRIP-Evaluate 提供了一个可重复、可诊断且与工程对齐的评估基线,适用于模型选择、回归测试和更安全的交通应用部署。
cs.CV / 27 / 2605.00908
Comparative Evaluation of Convolutional and Transformer-Based Detectors for Automated Weed Detection in Precision Agriculture
基于卷积和变换器的自动化杂草检测器的比较评估在精准农业中的应用
Abstract
This paper presents a comparative evaluation of convolutional and transformer-based object detection architectures for early weed detection in realistic scenarios. Representative models from each paradigm are considered, including YOLOv26-nano, a recent variant of the YOLO family, and transformer-based approaches such as RTDETR and RF-DETR. Experiments were conducted on the GROUNDBASED_ WEED dataset, allowing performance to be evaluated in terms of detection accuracy and computational efficiency using metrics such as precision, recall, average precision, and inference speed. The results highlight a clear trade-off between efficiency and contextual modeling: CNN-based detectors achieve high performance at a lower computational cost, while transformer-based approaches offer better global context capture at the expense of higher resource demands. These results provide practical criteria for model selection in precision agriculture applications.
Chinese Translation
本文对基于卷积和变换器的物体检测架构在真实场景下的早期杂草检测进行了比较评估。考虑了每种范式的代表性模型,包括YOLOv26-nano,这是一种YOLO家族的最新变体,以及基于变换器的方法,如RTDETR和RF-DETR。实验在GROUNDBASED_WEED数据集上进行,从而使用精准度、召回率、平均精准度和推理速度等指标评估检测精度和计算效率。结果突显了效率与上下文建模之间的明显权衡:基于CNN的检测器在较低的计算成本下实现了高性能,而基于变换器的方法则在更高资源需求的情况下提供了更好的全局上下文捕获。这些结果为精准农业应用中的模型选择提供了实用的标准。
cs.CV / 28 / 2605.00911
When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation
当优秀的光学字符识别不足以满足需求:针对检索增强生成的光学字符识别鲁棒性基准测试
Abstract
Industrial Retrieval-Augmented Generation (RAG) systems depend on optical character recognition (OCR) to transform visual documents into text. Existing OCR benchmarks rely on character-level metrics, which inadequately measure downstream RAG effectiveness under real-world conditions. We introduce an OCR benchmark for industrial RAG systems covering 11 challenging document types, including extreme layouts, high-resolution pages, complex or watermarked backgrounds, historical documents with non-standard reading orders, visually decorated text, and documents containing tables and mathematical formulas. Evaluating recent SOTA OCR models under a controlled OCR-first RAG pipeline shows clear performance degradation on realistic industrial documents despite strong conventional benchmark scores. We find that high OCR accuracy does not necessarily translate into strong downstream RAG performance: structural and semantic errors can cause substantial retrieval failures even when WER/CER remains low. Further analysis shows that this mismatch is category-dependent, arises through both retrieval-side and downstream generation-side failures, and remains stable across representative OCR-first pipeline choices. The benchmark is publicly available at https://github.com/Qihoo360/InduOCRBench.
Chinese Translation
工业检索增强生成(RAG)系统依赖光学字符识别(OCR)将视觉文档转换为文本。现有的OCR基准依赖字符级指标,这不足以在现实环境中衡量下游RAG的有效性。我们介绍了一项针对工业RAG系统的OCR基准,涵盖了11种具有挑战性的文档类型,包括极端布局、高分辨率页面、复杂或带水印的背景、具有非标准阅读顺序的历史文档、视觉装饰文本以及包含表格和数学公式的文档。在受控的OCR优先RAG管道下评估最新的最先进(SOTA)OCR模型显示,尽管在传统基准测试中取得良好分数,但在现实的工业文档上性能显著下降。我们的研究发现,高OCR准确率并不一定转化为强大的下游RAG性能:结构性和语义错误可能会导致显著的检索失败,即使词错误率(WER)和字符错误率(CER)保持较低。进一步分析表明,这种不匹配是类别依赖的,源于检索端和下游生成端的双重失败,并且在代表性的OCR优先管道选择中保持稳定。该基准可在https://github.com/Qihoo360/InduOCRBench上公开获取。
cs.CV / 29 / 2605.00912
Object-Level Explanations for Image Geolocation Models: a GeoGuessr use-case
图像地理定位模型的对象级解释:以 GeoGuessr 为例
Abstract
When humans play geolocation games such as GeoGuessr, they rely on concrete visual cues, such as road markings, vegetation, or architectural details, to infer where an image was captured. Whether image geolocation models rely on similar object-level evidence remains difficult to determine, as attribution methods like Grad-CAM typically highlight diffuse regions rather than coherent visual entities, making it difficult to link model predictions to specific objects or perceptible patterns. In this work, we propose an object-centric analysis pipeline to investigate the visual evidence used by geolocation models. Starting from attribution maps, we extract salient regions and segment them into object-like elements. We evaluate their predictive relevance through deletion and insertion tests, comparing attributionguided crops to randomly selected regions with similar coverage. Experiments on a three-country benchmark show that attribution-guided crops consistently retain more information for the model's prediction than random crops. These results suggest that attribution maps can be decomposed into interpretable, perceptible elements, providing a step toward object-level analysis of geolocation models.
Chinese Translation
当人类玩地理定位游戏如 GeoGuessr 时,他们依赖于具体的视觉线索,例如路标、植被或建筑细节,以推测图像的拍摄地点。图像地理定位模型是否也依赖于类似的对象级证据仍然难以确定,因为像 Grad-CAM 这样的归因方法通常突出的是模糊区域,而不是一致的视觉实体,这使得将模型预测与特定对象或可感知模式联系起来变得困难。在本研究中,我们提出了一种以对象为中心的分析流程,以调查地理定位模型所使用的视觉证据。我们从归因图开始,提取显著区域并将其分割为类似对象的元素。通过删除和插入测试,我们评估了这些元素的预测相关性,将归因引导的裁剪与覆盖范围相似的随机选定区域进行比较。在一个三国基准测试中的实验表明,归因引导的裁剪在模型预测中始终保留了比随机裁剪更多的信息。这些结果表明,归因图可以分解为可解释的、可感知的元素,为地理定位模型的对象级分析迈出了重要一步。
cs.CV / 30 / 2605.00913
Leveraging Imperfect Medical Data: A Manifold-Consistent Spatio-Temporal Network for Sensor-based Human Activity Recognition
利用不完美医疗数据:一种基于传感器的人类活动识别的流形一致时空网络
Abstract
Sensor-based Human Activity Recognition (HAR) has attracted increasing attention in medical and healthcare monitoring, particularly with the growth of Internet of Medical Things (IoMT). However, in real-world wearable sensing scenarios, IoMT signals are often corrupted by missing measurements, sensor failures, and environmental noise, which significantly degrade the performance of conventional deep learning models that assume clean and complete inputs. To address this challenge, we propose a Manifold-Consistent Spatio-Temporal Network (MCSTN) for robust HAR under imperfect sensing conditions. The proposed framework introduces a dual-level corruption modeling mechanism that simulates realistic sensor imperfections through both physical-level corruption and diffusion-driven continuous corruption. By enforcing representation consistency across multiple corrupted views, the model learns stable and corruption-invariant semantic representations. Furthermore, we design a dual-stream spatio-temporal architecture that explicitly decouples temporal dynamics modeling and spatial correlation learning. The temporal stream captures long-term activity dynamics, while the spatial stream models inter-sensor relationships, enabling more effective spatio-temporal representation learning. Extensive experiments on three widely used HAR benchmark datasets, PAMAP2, Opportunity, and WISDM, demonstrate that the proposed MCSTN achieves competitive performance compared with existing state-of-the-art methods, particularly under imperfect sensing conditions. These results validate the effectiveness and robustness of the proposed framework for real-world wearable IoMT sensing applications.
Chinese Translation
基于传感器的人类活动识别(HAR)在医疗和健康监测中引起了越来越多的关注,尤其是在医疗物联网(IoMT)发展的背景下。然而,在实际的可穿戴传感场景中,IoMT信号常常受到缺失测量、传感器故障和环境噪声的干扰,这显著降低了假设输入数据清晰和完整的传统深度学习模型的性能。为了解决这一挑战,我们提出了一种流形一致时空网络(MCSTN),旨在在不完美的感知条件下实现鲁棒的HAR。所提框架引入了双层腐败建模机制,通过物理层腐败和扩散驱动的持续腐败模拟现实的传感器缺陷。通过强制在多个腐败视图之间保持表示一致性,模型学习到稳定且不受腐败影响的语义表示。此外,我们设计了一种双流时空架构,明确解耦时间动态建模和空间相关性学习。时间流捕捉长期活动动态,而空间流建模传感器之间的关系,从而实现更有效的时空表示学习。在三个广泛使用的HAR基准数据集PAMAP2、Opportunity和WISDM上进行的广泛实验表明,所提出的MCSTN在不完美感知条件下相较于现有的最新方法取得了竞争力的表现。这些结果验证了所提框架在现实世界可穿戴IoMT感知应用中的有效性和鲁棒性。
cs.CV / 31 / 2605.00915
Rethink MAE with Linear Time-Invariant Dynamics
重新思考具有线性时不变动力学的MAE
Abstract
Standard representation probing for visual models relies on mathematically permutation-invariant operations like Global Average Pooling (GAP) or CLS tokens, treating patch representations as an unstructured bag-of-words. We challenge this paradigm by demonstrating that token order is a critical, exploitable dimension in frozen visual representations (e.g., MAE, BEiT, DINOv2, and ViT as CLS-ablation extreme). We propose SSMProbe, a probing framework driven by a State Space Model (SSM). Operating as discrete Linear Time-Invariant (LTI) dynamical systems, SSMs act as permutation-sensitive probes where sequence order strictly dictates the final state due to inherent memory decay. Formulating token ordering as an information scheduling problem, we compare fixed scan heuristics against a differentiable soft permutation (Sinkhorn-based) learned from downstream supervision. Evaluations on standard and fine-grained classification benchmarks reveal a striking order gap: while fixed scans fail dramatically on highly localized patch features, our learned soft permutation successfully extracts highly competitive performance from otherwise heavily localized patch sequences. We find that pre-training objectives fundamentally shape token structure: DINOv2 concentrates global semantics in optimized CLS tokens leaving patches hyperspecialized, pure MAE preserves distributed representations with heterogeneous patch informativeness, and ViT represents a supervised CLS-dominated extreme. BEiT occupies middle ground. This heterogeneity is order-dependent -- meaning the SSM probe's performance depends critically on which tokens are placed at which temporal positions -- and is not merely a topological property of the spatial grid. SSMProbe's learned routing effectively discovers and exploits this heterogeneity, offering a powerful new diagnostic lens for visual representation analysis.
Chinese Translation
标准的视觉模型表示探测依赖于诸如全局平均池化(Global Average Pooling, GAP)或CLS标记等数学上可置换不变的操作,将图像块表示视为一种无结构的词袋。我们挑战这一范式,证明在冻结的视觉表示(如MAE、BEiT、DINOv2以及作为CLS去除极端情况的ViT)中,标记顺序是一个关键的、可利用的维度。我们提出了SSMProbe,一个由状态空间模型(State Space Model, SSM)驱动的探测框架。作为离散线性时不变(Linear Time-Invariant, LTI)动力学系统,SSM充当对置换敏感的探测器,其中序列顺序严格决定最终状态,这归因于固有的记忆衰退。将标记排序形式化为信息调度问题,我们比较了固定扫描启发式与从下游监督中学习的可微软置换(基于Sinkhorn的)的表现。在标准和细粒度分类基准上的评估揭示了一个显著的顺序差距:尽管固定扫描在高度局部化的图像块特征上表现不佳,我们学习的软置换却成功地从其他高度局部化的图像块序列中提取出具有竞争力的性能。我们发现预训练目标在根本上塑造了标记结构:DINOv2将全局语义集中在优化后的CLS标记上,使得图像块高度专业化,纯MAE保留了分布式表示,具有异质图像块信息量,ViT表现为以监督CLS为主导的极端情况。BEiT则占据中间地带。这种异质性是依赖于顺序的——即SSM探测器的性能在于哪些标记被置于何种时间位置——而不仅仅是空间网格的拓扑属性。SSMProbe学习的路由有效地发现并利用这种异质性,为视觉表示分析提供了一种强有力的新诊断视角。
cs.CV / 32 / 2605.00916
SAMamba3D: adapting Segment Anything for generalizable 3D segmentation of multiphase pore-scale images
SAMamba3D:为多相孔隙尺度图像的可泛化3D分割适配Segment Anything
Abstract
Reliable segmentation of multiphase pore-scale X-ray images of rocks is necessary to quantify fluid saturation, connectivity, and interfacial geometry. However, current 3D segmentation methods are typically dataset-specific, requiring retraining or extensive fine-tuning whenever rock type, fluid pattern, scanner, or acquisition conditions change. Foundation models such as the Segment Anything Model (SAM) provide strong 2D boundary priors, but they are not directly applicable to 3D data. We present SAMamba3D, a parameter-efficient framework that adapts a largely frozen SAM encoder to generalizable 3D pore-scale segmentation by coupling it with Mamba-based volumetric context modeling and progressive cross-scale feature interaction. For sandstone and carbonate datasets, with different fluids, wettability, and scanning conditions, SAMamba3D matches or outperforms current 3D baselines while reducing the need for case-specific retraining. The resulting segmented images preserve physically meaningful descriptors, including fluid saturation, connectivity, and interface morphology, enabling more reliable and rapid analysis of large 3D multiphase images.
Chinese Translation
可靠的多相孔隙尺度岩石X射线图像分割对于量化流体饱和度、连通性和界面几何形状是必要的。然而,目前的3D分割方法通常是特定于数据集的,要求在岩石类型、流体模式、扫描仪或采集条件发生变化时进行重新训练或广泛微调。基础模型如Segment Anything Model (SAM)提供了强大的2D边界先验,但它们并不直接适用于3D数据。我们提出了SAMamba3D,这是一个参数高效的框架,通过将大部分冻结的SAM编码器与基于Mamba的体积上下文建模和渐进的跨尺度特征交互结合,适配为可泛化的3D孔隙尺度分割。在不同流体、润湿性和扫描条件下的砂岩和碳酸盐数据集上,SAMamba3D的性能与当前的3D基线相匹配或超过,同时减少了针对具体案例的重新训练需求。最终生成的分割图像保留了物理意义明确的描述符,包括流体饱和度、连通性和界面形态,从而能够更可靠和快速地分析大型3D多相图像。
cs.CV / 33 / 2605.00960
Energy-Based Constraint Networks: Learning Structural Coherence Across Modalities
基于能量的约束网络:跨模态学习结构一致性
Abstract
We introduce energy-based constraint networks -- a modality-agnostic architecture that learns structural coherence from contrastive pairs. The system processes frozen encoder embeddings through a state-space model with dual-head attention, producing a scalar energy measuring structural consistency alongside per-position energy scores that localize violations. Multiple independently trained branches detect different violation types and compose at inference without interference. We demonstrate the framework in two domains. In text, the system achieves 93.4% accuracy on trained corruption types and 87.2% on 9 unseen types, using frozen BERT and 7.4M trainable parameters. In vision, the same architecture achieves competitive deepfake detection: 0.959 AUC on FaceForensics++ Deepfakes and 0.870 on Celeb-DF without any Celeb-DF training data, using frozen DINOv2 and 3.6M parameters per branch. The framework supports flexible training: branches learn from designer-specified corruptions, real-world paired data, or both. Composable branches require representation compatibility -- a finding validated through extensive experimentation where five incompatible approaches failed before the compatible one succeeded. The architecture is encoder-agnostic and domain-agnostic: changing the domain requires only new corruption strategies; changing the encoder requires only a new input projection layer. To our knowledge, this is the first architecture to learn within-modality structural coherence as an explicit energy landscape with per-position decomposition, and to demonstrate that the same architecture transfers across modalities via corruption respecification alone.
Chinese Translation
我们介绍了一种基于能量的约束网络——一种模态无关的架构,能够从对比对中学习结构一致性。该系统通过具有双头注意力的状态空间模型处理冻结的编码器嵌入,生成一个测量结构一致性的标量能量以及定位违规行为的位置能量得分。多个独立训练的分支检测不同类型的违规行为,并在推理时无干扰地组合。我们在两个领域中演示了该框架。在文本领域,该系统在训练的损坏类型上取得了93.4%的准确率,在9种未见类型上则为87.2%,使用了冻结的BERT和740万个可训练参数。在视觉领域,相同的架构在深度伪造检测上表现出竞争力:在FaceForensics++ Deepfakes上取得了0.959的AUC,在Celeb-DF上取得了0.870的AUC,并且没有使用任何Celeb-DF训练数据,使用了冻结的DINOv2和每个分支360万个参数。该框架支持灵活的训练:分支可以从设计者指定的损坏、实际配对数据或两者中学习。可组合的分支要求表示的兼容性——这一发现通过广泛实验得到验证,其中五种不兼容的方法失败,而兼容的方法成功。该架构是编码器无关和领域无关的:改变领域只需新的损坏策略;改变编码器只需新的输入投影层。我们了解到,这是第一次架构以显式的能量景观和逐位置分解的方式学习模态内的结构一致性,并且展示了相同架构通过损坏重新规范化即可跨模态转移。
cs.CV / 34 / 2605.00977
Democratizing the medieval English legal tradition
民主化中世纪英语法律传统
Abstract
The record of the beginning of the most widespread legal system in the world is contained in millions of pages of handwritten text. Most of the records of the first centuries of the Anglo-American legal system are hand-written in a highly abbreviated form of medieval Latin which only a few dozen scholars in the world are trained to read. In this interdisciplinary project, we construct a dataset of 4029 lines of text across 193 medieval criminal and civil cases. We then use the dataset to train an open-source end-to-end pipeline for transcribing these manuscripts. We first train standard neural network architectures for line segmentation and handwriting recognition (R-Blla and CNN+LSTM with CTC decoding, respectively) and show that they can already achieve 79% word accuracy, despite the relatively small training set and the challenge of expanding abbreviations. We then demonstrate that simple post-processing significantly boosts accuracy: adding an n-gram language model to the CTC decoder improves word accuracy to 82%, while asking Gemini Pro 3 to correct mistakes boosts accuracy to 88%. Finally, we compare the CNN+LSTM architecture with TrOCR, a transformer-based OCR architecture, demonstrating that TrOCR shows comparable word accuracy but worse character accuracy due to its over-willingness to guess, making it harder for humans to infer the correct reading. We incorporated our pipeline into a web portal (glyphmachina.com), opening up the English legal tradition to legal scholars, medievalists, and students.
Chinese Translation
世界上最广泛使用的法律系统的起源记录包含在数百万页手写文本中。盎格鲁-美国法律系统的最早几个世纪的记录大多是用高度缩略的中世纪拉丁语手写的,世界上只有几十位学者经过培训能够阅读这些文本。在这个跨学科项目中,我们构建了一个包含193个中世纪刑事和民事案件的4029行文本的数据集。随后,我们使用该数据集训练了一个开源的端到端管道,以转录这些手稿。我们首先训练标准的神经网络架构进行行分割和手写识别(分别为R-Blla和CNN+LSTM结合CTC解码),并表明尽管训练集相对较小且扩展缩略词存在挑战,它们能够达到79%的单词准确率。接着,我们展示了简单的后处理显著提高了准确性:向CTC解码器添加n-gram语言模型将单词准确率提高到82%,而让Gemini Pro 3纠正错误则使准确率提升至88%。最后,我们将CNN+LSTM架构与基于变换器的OCR架构TrOCR进行比较,表明TrOCR的单词准确率相当,但由于过于倾向于猜测,其字符准确率较差,导致人类更难推断出正确的阅读结果。我们将我们的管道整合到一个网络门户(glyphmachina.com)中,使法律学者、中世纪研究者和学生可以接触到英语法律传统。
cs.CV / 35 / 2605.01018
WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild
WildTableBench:在真实场景中对多模态基础模型进行表格理解的基准评估
Abstract
Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning. These results and analyses provide useful insights into current model capabilities and establish WildTableBench as a valuable diagnostic benchmark for table image understanding.
Chinese Translation
利用多模态基础模型分析表格图像是消费和企业场景中一个高价值但具有挑战性的应用。尽管其重要性不容忽视,现有的评估主要依赖于结构化文本表格或清晰渲染的图像,而缺少对真实场景中表格图像视觉复杂性的深入研究。这类图像具有多样的布局和丰富的领域,要求具备复杂的结构感知和数值推理能力。为了填补这一空白,我们推出了 WildTableBench,这是首个针对真实环境中自然出现的表格图像进行问答评估的基准。WildTableBench 汇集了来自不同领域的402张高信息密度的表格图像,这些图像从在线论坛和网站收集而来,并附有928个人工标注和验证的问题,涵盖五个类别中的17个子类型。我们在这一基准上评估了21个前沿的专有和开源多模态基础模型。仅有一个模型的准确率超过50%,其余模型的准确率则在4.1%到49.9%之间。我们还进行了诊断分析,以表征模型失败的情况,揭示了结构感知和推理方面的持续弱点。这些结果和分析为当前模型的能力提供了有价值的见解,并将 WildTableBench 建立为表格图像理解的有益诊断基准。
cs.CV / 36 / 2605.01024
EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness
EmoMM:在冲突和缺失情况下对多模态情感识别的 MLLM 进行基准测试和引导
Abstract
Multimodal Emotion Recognition (MER) is critical for interpreting real-world interactions. While Multimodal Large Language Models (MLLM) have shown promise in MER, their internal decision-making mechanisms under modality conflict and missingness remain largely underexplored. In this paper, to systematically investigate these behaviors, we introduce EmoMM, a comprehensive benchmark featuring modality-aligned, conflict, and missing subsets. Through extensive evaluation, we uncover a Video Contribution Collapse (VCC) phenomenon, where MLLM marginalize video evidence due to high token redundancy and modality preferences. To address this, we propose Conflict-aware Head-level Attention Steering (CHASE), a lightweight mechanism that detects modality conflicts and performs inference-time attention steering, effectively mitigating decision bias without retraining the backbone. Experimental results demonstrate that CHASE consistently improves performance across various settings, significantly enhancing the reliability of MLLM in complex affective scenarios.
Chinese Translation
多模态情感识别 (MER) 对于解读现实世界的互动至关重要。尽管多模态大型语言模型 (MLLM) 在 MER 中展现了良好的前景,但它们在模态冲突和缺失情况下的内部决策机制仍然未被充分探索。为系统地研究这些行为,本文引入了 EmoMM,这是一个综合基准,包含模态对齐、冲突和缺失子集。通过广泛的评估,我们揭示了一种视频贡献衰退 (VCC) 现象,MLLM 由于高令牌冗余和模态偏好而边缘化视频证据。为了解决这个问题,我们提出了基于冲突感知的头级注意力引导 (CHASE),这是一种轻量级机制,它能够检测模态冲突并在推理时进行注意力引导,有效减轻决策偏差而无需重新训练主干模型。实验结果表明,CHASE 在各种设置中始终提高了性能,显著增强了 MLLM 在复杂情感场景中的可靠性。
cs.CV / 37 / 2605.01036
InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene
InterPhys:动态场景中的物理感知人类运动合成
Abstract
This paper tackles the problem of physics-aware human motion synthesis in a dynamic scene. Unlike existing works which mainly tend to generate physically unrealistic motions due to limited contact modeling, typically restricted to hands, in this paper, we introduce a physics-aware human motion generation framework that explicitly models the full spectrum of human-related forces, including human-object, human-scene, and internal body dynamics.~Our method imposes soft physical constraints to maintain force and torque balance, ensuring physically grounded motion synthesis. We further propose a novel continuous distance-based force model that generalizes contact modeling to arbitrary surfaces, capturing interactions not only with static environments but also with dynamic, moving objects. Extensive experiments show that our approach significantly improves physical plausibility and generalizes well to complex scenes, setting a new benchmark for physically consistent human motion generation.
Chinese Translation
本文解决了动态场景中物理感知人类运动合成的问题。与现有工作主要倾向于由于接触建模有限(通常仅限于手部)而生成物理上不现实的运动不同,本文提出了一种物理感知的人类运动生成框架,该框架明确建模了与人有关的各种力的完整谱系,包括人-物体、人-场景和内部身体动态。我们的方法施加了柔性物理约束,以维持力和扭矩的平衡,确保物理基础的运动合成。我们进一步提出了一种新颖的基于连续距离的力模型,将接触建模推广到任意表面,不仅捕捉与静态环境的交互,还能够处理与动态移动物体的互动。大量实验表明,我们的方法显著提高了物理合理性,并且在复杂场景中具有良好的泛化能力,为物理一致的人类运动生成设定了新的基准。
cs.CV / 38 / 2605.01075
Neighbor2Inverse: Self-Supervised Denoising for Low-Dose Region-of-Interest Phase Contrast CT
Neighbor2Inverse:用于低剂量感兴趣区域相位对比CT的自监督去噪
Abstract
Propagation-based X-ray phase-contrast imaging (PBI) enables high-contrast visualization of lung structures and holds strong medical potential. However, safe translation to the clinic will require a substantial radiation dose reduction, which inevitably increases image noise. Supervised convolutional-neural-network-based denoising can restore image quality but depends on paired low- and high-dose datasets, which are rarely available in practice. Self-supervised methods avoid this limitation, yet most are not well adapted to the inverse problem of PBI computed tomography (CT). We introduce Neighbor2Inverse, a self-supervised denoising framework designed for low-dose PBI-CT that generalizes to clinical CT. Building on the Neighbor2Neighbor principle, each noisy projection is subsampled into two variants that preserve structural information but contain independent noise realizations. These are reconstructed separately, and the resulting pairs are used to train a denoising network directly in the image domain. We benchmark the proposed method against established analytical and self-supervised denoising approaches. In region-of-interest PBI CT experiments, Neighbor2Inverse achieves superior noise suppression while preserving fine structural details, as demonstrated by improved contrast-to-noise ratio, spatial resolution, and composite image quality metrics. Competitive performance is also observed on clinical CT data under simulated low-dose conditions. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Code, data, and interactive figures are available at https://github.com/J-3TO/Neighbor2Inverse.
Chinese Translation
基于传播的X射线相位对比成像(PBI)能够高对比度地可视化肺部结构,并具有强大的医学潜力。然而,安全地将其推广到临床需要显著降低辐射剂量,这不可避免地会增加图像噪声。基于监督的卷积神经网络去噪可以恢复图像质量,但依赖于配对的低剂量和高剂量数据集,而这种数据集在实际中很少可得。自监督方法避免了这一限制,但大多数方法并不适应于PBI计算机断层扫描(CT)的逆问题。我们提出了Neighbor2Inverse,一个为低剂量PBI-CT设计的自监督去噪框架,该框架可以推广到临床CT。在Neighbor2Neighbor原则的基础上,将每个噪声投影子采样为两个变体,这两个变体保持结构信息但包含独立的噪声实现。这些变体被单独重建,得到的配对用于在图像领域直接训练去噪网络。我们将所提出的方法与既有的解析和自监督去噪方法进行了基准测试。在感兴趣区域PBI CT实验中,Neighbor2Inverse在保持细微结构细节的同时实现了更优的噪声抑制,改进的对比噪声比、空间分辨率和复合图像质量指标均证明了这一点。同时,在模拟低剂量条件下的临床CT数据上也观察到了具有竞争力的表现。本研究已提交给IEEE以考虑发表。版权可能会在没有通知的情况下转移,此后该版本可能不再可用。代码、数据和互动图形可在https://github.com/J-3TO/Neighbor2Inverse获取。
cs.CV / 39 / 2605.01081
WILD SAM: A Simulated-and-Real Data Augmentation for Autonomous Driving Perception under Challenging Weather
WILD SAM:在恶劣天气条件下用于自动驾驶感知的模拟与真实数据增强
Abstract
The performance of state-of-the-art object detectors degrades significantly under adverse weather, causing a safety-critical domain shift problem for autonomous vehicles. Recent efforts address this problem by relying on synthetic data to train the object detectors, which limits their real-world applicability. Meanwhile, pseudo-labeling is widely used for cross-dataset domain adaptation problems. However, these methods have not been exploited by weather-based domain adaptation approaches due to the noisy nature of such labels generated under harsh weather conditions. In this paper, we propose two new approaches to mitigate this weather-induced domain shift. First, we propose a Weather-Induced pseudo Label Denoising (WILD) framework that filters noisy pseudo labels generated by real data captured under adverse weather conditions. Second, we develop a novel hybrid training methodology, WILD SAM, that exploits both pseudo-label denoising and simulation-based training solutions while using real-data from the target harsh-weather domain. We validate both proposed approaches, WILD and WILD SAM, on the recently released Four Seasons dataset across rainy and snowy scenarios. Experiments show that the proposed frameworks improve Average Precision (AP) up to 13\% and significantly reduce the weather-induced performance gap relative to the baseline. The code is available at: https://github.com/Kh-Hamed/WILD-SAM
Chinese Translation
最先进的目标检测器在恶劣天气下的性能显著下降,导致自动驾驶车辆面临安全关键的领域转移问题。近期的研究通过依赖合成数据来训练目标检测器来解决此问题,但这限制了其在真实世界中的应用。同时,伪标签技术被广泛应用于跨数据集领域适应问题。然而,由于在恶劣天气条件下生成的标签存在噪声,这些方法并未被气候基础的领域适应方法所利用。本文提出了两种新方法以减缓由天气引起的领域转移。首先,我们提出了一种天气诱导伪标签去噪(WILD)框架,通过过滤在恶劣天气条件下捕获的真实数据生成的噪声伪标签。其次,我们开发了一种新颖的混合训练方法WILD SAM,利用伪标签去噪和基于模拟的训练方案,同时使用来自目标恶劣天气领域的真实数据。我们在最近发布的四季数据集中验证了这两种提议的方法WILD和WILD SAM,涵盖了雨天和雪天场景。实验结果表明,所提框架将平均精度(AP)提高了13"%,并显著减少了与基线相比的天气诱导性能差距。代码可在以下链接获取:https://github.com/Kh-Hamed/WILD-SAM
cs.CV / 40 / 2605.01084
Patient-Specific Optimization for Mandibular Reconstruction Planning with Enhanced Bone Union
针对下颌骨重建规划的患者特异性优化与增强骨愈合
Abstract
Mandibular reconstruction with vascularized bone grafts is complicated by donor-host nonunion, and current virtual surgical planning produces a geometric plan rather than a configuration that explicitly promotes bone union. We present OsteoOpt++, an image-to-decision planning loop for patient-specific mandibular reconstruction. A pre-operative computed tomography (CT) is converted into a personalized digital twin through template-to-patient registration and CT-derived updates of the muscle and temporomandibular-joint parameters. Bayesian optimization with an expected-improvement-plus acquisition rule then searches six clinically controllable cut-plane and donor-positioning variables under an apposition-driven objective and a safety-factor-regularized variant. The workflow was evaluated on three generic defects (body, symphysis, and ramus-body) and a total of 3+1 patient-specific cases, with 3 used for optimization and 1 for validation. In the generic cases, against a common surgical approach, cycle-averaged donor-mandible apposition increased by up to 29 percentage points (329% relative); in the patient-specific cases, against the surgeon-implemented day-5 post-operative configuration, by up to 26 percentage points. A 10% sensitivity analysis over eleven modeling parameters capped the change in the apposition-driven objective at 3% for generic cases and 4% for patient-specific cases, and the longitudinal case showed Dice overlap of 0.70 and 0.76 between predicted apposition and year-1 bone formation. Clinically, this provides surgeons with a pre-operative, image-driven recommendation for cut-plane orientation and donor placement that is predicted to improve union conditions over the configurations currently delivered in the operating room. The optimization and patient-specific modeling code is open source at https://github.com/hamidreza-aftabi/OsteoOpt.
Chinese Translation
使用血管化骨移植进行下颌骨重建的过程中,供体与宿主之间的不愈合问题使得这一过程变得复杂,而当前的虚拟手术规划生成的是几何规划,而非明确促进骨愈合的配置。我们提出了OsteoOpt++,这是一个针对患者特定的下颌骨重建的图像到决策规划循环。通过模板到患者的配准以及基于CT的肌肉和颞下颌关节参数更新,术前的计算机断层扫描(CT)被转换为个性化的数字双胞胎。然后,使用期望改进加(expected-improvement-plus)采集规则的贝叶斯优化方法,在一个以接触为驱动目标和安全因子正则化变体下,搜索六个临床可控的切面和供体定位变量。该工作流程在三个通用缺损(身体、合缝和翼状体)和总共4个患者特定病例上进行了评估,其中3个用于优化,1个用于验证。在通用病例中,与常见的外科方法相比,供体下颌骨的周期平均接触增加了多达29个百分点(相对增加329%);在患者特定病例中,与外科医生实施的术后第五天配置相比,接触率增加了多达26个百分点。对十一个建模参数进行了10%的灵敏度分析,限制了通用病例中接触驱动目标变化在3%以内,而患者特定病例中限制在4%以内,纵向病例显示预测接触与一年后骨形成之间的Dice重叠为0.70和0.76。从临床角度来看,这为外科医生提供了一个术前的、基于影像的切面方向和供体放置建议,预计可以改善当前在手术室中实施的配置下的愈合条件。优化及患者特定建模代码可在https://github.com/hamidreza-aftabi/OsteoOpt获得。
cs.CV / 41 / 2605.01113
Disciplined Diffusion: Text-to-Image Diffusion Model against NSFW Generation
有序扩散:防止不适合工作内容生成的文本到图像扩散模型
Abstract
Text-to-image (T2I) diffusion models have the ability to build high-quality pictures from text prompts, but they pose safety concerns because they can generate offensive or disturbing imagery when provided with harmful inputs. Existing safety filters typically rely on text-based classifiers or image-based checkers that completely block the output upon detecting a threat, issuing an explicit allow/block feedback signal to the user. This binary strategy leaves models vulnerable to adversarial attacks that alter keywords to bypass detection, and it causes high false-alarm rates that degrade the experience for benign users. To address such vulnerabilities, we propose Disciplined Diffusion (DDiffusion), a novel robust text-to-image diffusion that counters Not Safe For Work (NSFW) generation by uncovering implicit malicious semantics in prompt embeddings. DDiffusion leverages a semantic retrieval mechanism to evaluate prompts against concept distributions rather than relying on brittle pairwise similarity. Furthermore, it employs a localization method during the diffusion process to selectively edit only the harmful regions of the generated image. By returning locally sanitized images instead of applying uniform blocking, DDiffusion suppresses malicious content while preserving generation fidelity for benign prompts and avoiding the binary allow-deny signal on which existing probing attacks rely.
Chinese Translation
文本到图像(T2I)扩散模型能够根据文本提示生成高质量图片,但由于在输入恶意内容时可能生成令人反感或不安的图像,因此存在安全隐患。现有的安全过滤器通常依赖于基于文本的分类器或基于图像的检查工具,这些工具在检测到威胁时完全阻止输出,并向用户发出明确的允许/阻止反馈信号。这种二元策略使得模型易受对抗性攻击的影响,攻击者通过更改关键词来绕过检测,造成较高的误报率,从而降低了良性用户的体验。为了应对这些脆弱性,我们提出了一种新颖的鲁棒文本到图像扩散模型——有序扩散(DDiffusion),它通过揭示提示嵌入中的隐含恶意语义,来对抗不适合工作内容(NSFW)的生成。DDiffusion 利用语义检索机制,根据概念分布评估提示,而不是依赖脆弱的成对相似度。此外,它在扩散过程中采用了定位方法,仅对生成图像中的有害区域进行选择性编辑。通过返回局部清理的图像,而非进行统一阻止,DDiffusion 在保护良性提示的生成保真度的同时抑制恶意内容,避免了现有探测攻击所依赖的二元允许-拒绝信号。
cs.CV / 42 / 2605.01135
ScribbleEdit: Synthetic Data for Image Editing with Scribbles and Text
ScribbleEdit:用于图像编辑的合成数据,结合涂鸦和文本
Abstract
Recent progress in generative models has significantly advanced image editing capabilities, yet precise and intuitive user control remains difficult. Specifically, users often struggle to communicate both exact spatial layouts and specific semantic details simultaneously. While natural language instructions effectively convey high-level semantics like texture and color, they lack spatial specificity. Conversely, freehand scribbles provide rough spatial boundaries but cannot express detailed visual attributes. Consequently, achieving precise control requires combining both modalities. However, existing models struggle to jointly interpret abstract scribbles alongside text due to a lack of specialized training data. In this work, we introduce ScribbleEdit, a large-scale synthetic dataset designed to bridge this gap by combining natural language instructions with freehand scribble inputs for more accurate, controllable edits. We construct this dataset through a synthetic pipeline that automatically generates source-target image pairs via inpainting, which are then paired with human-drawn scribbles and VLM-generated text instructions. Using ScribbleEdit, we evaluate and finetune both diffusion-based and autoregressive unified multimodal image editing models. Our experiments reveal that while off-the-shelf models struggle with abstract scribble inputs, finetuning on our synthetic dataset significantly improves their ability to generate spatially aligned and semantically consistent edits.
Chinese Translation
近年来,生成模型的进展显著提升了图像编辑的能力,但精确和直观的用户控制仍然面临挑战。具体而言,用户常常难以同时传达精确的空间布局和特定的语义细节。虽然自然语言指令能够有效传达诸如纹理和颜色等高层次语义,但它们缺乏空间的具体性。相反,自由手涂鸦提供粗略的空间边界,但无法表达细致的视觉属性。因此,实现精确控制需要将这两种方式结合起来。然而,现有模型在同时理解抽象涂鸦和文本方面存在困难,因为缺乏专业的训练数据。在本研究中,我们提出了ScribbleEdit,这是一个大规模的合成数据集,旨在通过将自然语言指令与自由手涂鸦输入相结合来弥补这一差距,从而实现更准确、可控的编辑。我们通过一个合成管道构建了该数据集,该管道通过修复自动生成源目标图像对,然后将其与人类绘制的涂鸦和VLM生成的文本指令配对。使用ScribbleEdit,我们评估并微调了基于扩散和自回归的统一多模态图像编辑模型。我们的实验揭示,虽然现成模型在处理抽象涂鸦输入时表现不佳,但在我们的合成数据集上进行微调显著提高了其生成空间一致且语义连贯编辑的能力。
cs.CV / 43 / 2605.01144
Semantic Context-aware mOdality fUsion Transformer (SCOUT): A Context-Aware Multimodal Transformer for Concept-Grounded Pathology Report Generation
语义上下文感知模态融合变换器 (SCOUT):一种支持概念基础的病理报告生成的上下文感知多模态变换器
Abstract
Whole-slide images (WSIs) present a fundamental challenge for computational pathology due to their extreme resolution, multi-scale heterogeneity, and the requirement for clinically reliable interpretation. Although recent pathology foundation models have enabled fluent report generation, they often lack clinical grounding, failing to accurately represent key diagnostic concepts and relationships observed by pathologists. This limitation arises from the difficulty of integrating heterogeneous visual evidence spanning fine-grained cellular patterns, slide-level tissue architecture, and high-level diagnostic concepts, while maintaining interpretability and clinical coherence. Here we present SCOUT: Semantic Context-aware mOdality fUsion Transformer, a context-aware concept-grounded multimodal framework for pathology report generation that enables progressive conditioning of image representations by global slide information and explicit diagnostic concepts. The method integrates local histological patterns, whole-slide context, and expert-curated semantic descriptors within a unified learning paradigm, allowing visual features to be dynamically refined throughout the encoding process. By combining depth-aware contextual modulation with adaptive multimodal fusion during text generation, the framework produces clinically coherent reports while preserving complementarity across representational scales. Using CONCH1.5 features, we evaluate SCOUT against WSI-Caption, HistGen, and BiGen on TCGA-BRCA, MICCAI REG, and HistAI. SCOUT achieves the best BLEU-1 to BLEU-4 and METEOR scores on all datasets, plus the best ROUGE-L on TCGA-BRCA and MICCAI REG. On TCGA-BRCA, it reaches 0.436/0.303/0.202/0.156 BLEU-1/2/3/4 and 0.204 METEOR; on REG 2025, it achieves 0.865/0.834/0.805/0.780 and 0.568. These results support progressive contextual conditioning for grounded pathology report generation.
Chinese Translation
全切片图像 (WSIs) 由于具有极高的分辨率、多尺度的异质性以及需进行临床可靠解释而为计算病理学带来了根本挑战。尽管近期的病理基础模型已经能够流畅地生成报告,但它们往往缺乏临床基础,无法准确表示病理学家观察到的关键诊断概念及其关系。这一局限性源于在保持可解释性和临床一致性的同时,整合跨越细粒度细胞模式、切片水平组织结构和高层诊断概念的异质视觉证据的困难。在此,我们提出SCOUT:语义上下文感知模态融合变换器,一种上下文感知的概念基础多模态框架,用于病理报告的生成,能够通过全局切片信息和明确的诊断概念对图像表示进行渐进式条件调整。该方法将局部组织学模式、全切片上下文和专家策划的语义描述符整合在一个统一的学习框架内,允许在编码过程中动态细化视觉特征。通过在文本生成过程中结合深度感知的上下文调制与自适应的多模态融合,该框架产生临床一致的报告,同时保持表征尺度间的互补性。使用CONCH1.5特征,我们在TCGA-BRCA、MICCAI REG和HistAI数据集上评估了SCOUT,与WSI-Caption、HistGen和BiGen进行比较。SCOUT在所有数据集上均取得了最好的BLEU-1到BLEU-4和METEOR评分,并在TCGA-BRCA和MICCAI REG上获得了最佳ROUGE-L。对于TCGA-BRCA,其BLEU-1/2/3/4分别为0.436/0.303/0.202/0.156,METEOR为0.204;对于REG 2025,分别为0.865/0.834/0.805/0.780,METEOR为0.568。这些结果支持为基于概念的病理报告生成提供渐进式上下文条件调整。
cs.CV / 44 / 2605.01165
CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition
CEZSAR:一种用于零样本动作识别的对比嵌入方法
Abstract
This paper proposes a novel Zero-Shot Action Recognition~(ZSAR) method based on contrastive learning. In ZSAR, we aim to classify examples from classes that were missing during training. Two well-known problems remain in ZSAR: the semantic gap and the domain shift. A semantic gap occurs because label representations come from the textual domain (i.e., language models) and must be associated with visual representations (i.e., CNNs, RNNs, transformer-based). This multimodal nature implies that the semantic properties of the two spaces are not identical. On the other hand, the domain shift arises from differences between the training and test sets and is inherent to ZSAR once the test set is unknown. One of the most promising methods to address both issues is learning joint embedding spaces. Therefore, we propose a new model that encodes videos and sentences in a joint embedding space, trained by aligning videos with their natural-language descriptions. We design an automatic negative sampling procedure to augment the training dataset and generate unpaired data, i.e., visual appearance and unrelated descriptions. Our results are state-of-the-art on the UCF-101 and Kinetics-400 datasets under several split configurations. Our code is available at https://github.com/valterlej/cezsar.
Chinese Translation
本文提出了一种基于对比学习的新型零样本动作识别(ZSAR)方法。在ZSAR中,我们旨在对训练时缺失的类进行分类。ZSAR仍然面临两个著名的问题:语义差距和领域偏移。语义差距的出现是因为标签表示来自于文本域(即语言模型),而必须与视觉表示(即卷积神经网络(CNNs)、循环神经网络(RNNs)、基于变换器的模型)相关联。这种多模态特性暗示两个空间的语义属性并不相同。另一方面,领域偏移则源于训练集和测试集之间的差异,一旦测试集未知,领域偏移在ZSAR中就是固有的。解决这两个问题的最有前景的方法之一是学习联合嵌入空间。因此,我们提出了一种新模型,该模型在联合嵌入空间中对视频和句子进行编码,通过将视频与其自然语言描述对齐进行训练。我们设计了一种自动负样本采样程序来增强训练数据集并生成未配对的数据,即视觉外观和无关的描述。在多个分割配置下,我们的结果在UCF-101和Kinetics-400数据集上达到了最先进的水平。我们的代码可在 https://github.com/valterlej/cezsar 获得。
cs.CV / 45 / 2605.01171
CADFit: Precise Mesh-to-CAD Program Generation with Hybrid Optimization
CADFit:基于混合优化的精确网格到CAD程序生成
Abstract
Despite recent progress, recovering parametric CAD construction sequences from geometric input, such as meshes or point clouds, is a key challenge for design and manufacturing, as existing CAD reconstruction and generation methods are largely restricted to difficult-to-edit formats like meshes or Breps or editable simple sketch-and-extrude pipelines and low-complexity datasets. We introduce CADFit, a hybrid optimization-based CAD reconstruction framework that recovers complex, editable CAD construction sequences from meshes by incrementally fitting and validating parametric operations using geometric feedback. Our approach is distinguished by formulating reconstruction as an IoU-driven optimization over structured CAD programs and supporting a rich set of operations, including extrusions, revolutions, fillets, and chamfers. Experiments on multiple CAD benchmarks show that CADFit outperforms state-of-the-art mesh-to-CAD methods in volumetric Intersection-over-Union and Chamfer Distance, while substantially reducing the Invalid Ratio of reconstructed CAD programs, particularly for complex designs. We further present a multimodal pipeline that enables end-to-end reconstruction of CAD construction sequences from images by combining image-based geometry reconstruction with CADFit. By enabling accurate reconstruction of higher-complexity CAD models, CADFit provides a practical foundation for generating richer datasets and advancing future learning-based approaches to CAD reverse engineering.
Chinese Translation
尽管近期取得了进展,但从几何输入(如网格或点云)恢复参数化CAD构建序列仍然是设计和制造中的一项关键挑战,因为现有的CAD重建和生成方法在很大程度上受到限制,主要应用于难以编辑的格式(如网格或Brep)或简单的可编辑拉伸和挤压流程及低复杂度数据集。我们提出了CADFit,一种基于混合优化的CAD重建框架,通过使用几何反馈逐步拟合和验证参数化操作,从网格中恢复复杂的可编辑CAD构建序列。我们的方法的特点在于将重建问题以IoU驱动的优化形式进行构建,支持丰富的操作集合,包括拉伸、旋转、圆角和倒角。在多个CAD基准测试中的实验表明,CADFit在体积分数交并比(Intersection-over-Union)和倒角距离(Chamfer Distance)上优于最先进的网格到CAD方法,同时显著降低了重建CAD程序的无效比率,尤其是在复杂设计方面。我们还提出了一种多模态管道,使得通过将基于图像的几何重建与CADFit相结合,能够实现从图像到CAD构建序列的端到端重建。通过准确重建更高复杂度的CAD模型,CADFit为生成更丰富的数据集及推动未来基于学习的CAD逆向工程方法提供了实用的基础。
cs.CV / 46 / 2605.01185
Phase-map synthesis from magnitude-only MR images using conditional score-based diffusion models with application in training of accelerated MRI reconstruction models
利用条件得分驱动扩散模型从仅有幅度的磁共振图像合成相位图:加速MRI重建模型训练的应用
Abstract
Accelerated magnetic resonance imaging (MRI) enabled by the training of deep learning (DL)-based image recon. models requires large and diverse raw k-space datasets. In most clinical MRI applications, due to storage and patient privacy concerns, raw k-space data is discarded and magnitude-only images are the only component saved. Consequently, a large portion of the DL-based MRI recon. literature has either relied on small training datasets or has used one of the few available open-source k-space datasets. At the same time, the growing number of anonymized magnitude-only image registries/databases motivates the development of techniques that can use them as training datasets for generalizable DL-based recon. models. Here we propose to address this challenge by employing a generative approach based on conditional score-based diffusion models (SBDMs): given a magnitude-only MR image, it synthesizes a phase map (in the image domain) that realistically corresponds to the magnitude-only image. We evaluate its generative capabilities in a downstream DL-based recon. task whereby a large k-space dataset is generated by combining the SBDM-synthesized phase-maps and the corresponding magnitude-only images, and this k-space dataset is then used to train a DL model for accelerated MRI recon. We compare the performance of the resulting DL model versus those trained according to (a) a naive approach that uses smooth phase, (b) a k-space training dataset generated using synthesized phase maps derived from a generative adversarial network, and (c) the ground truth k-space data. Our results suggest that the DL model trained from SBDM-synthesized k-space data outperforms the other approaches in terms of quantitative metrics as well as qualitatively observed recon. fidelity, i.e., whether the reconstructed images include erroneous or hallucinated features that could adversely impact diagnostic accuracy.
Chinese Translation
加速磁共振成像(MRI)的实现依赖于基于深度学习(DL)的图像重建模型的训练,这需要大量多样化的原始k空间数据集。在大多数临床MRI应用中,由于存储和患者隐私的考虑,原始k空间数据被丢弃,保存的仅为幅度图像。因此,大部分基于DL的MRI重建文献要么依赖于小型训练数据集,要么使用少量可用的开源k空间数据集。同时,越来越多的匿名幅度图像注册/数据库激励了开发能够将其用作通用DL重建模型训练数据集的技术。在此,我们提出通过采用基于条件得分驱动扩散模型(SBDMs)的生成方法来应对这一挑战:给定一幅仅有幅度的MRI图像,它合成一个在图像域中实际对应于该幅度图像的相位图。我们在下游的DL重建任务中评估其生成能力,通过结合SBDM合成的相位图和相应的幅度图像生成一个大型k空间数据集,然后用该k空间数据集训练加速MRI重建的DL模型。我们比较了所得到的DL模型相较于以下几种方法的性能:(a)使用平滑相位的简单方法,(b)使用来自生成对抗网络的合成相位图生成的k空间训练数据集,以及(c)真实的k空间数据。我们的结果表明,从SBDM合成的k空间数据训练的DL模型在定量指标以及主观观察到的重建保真度方面均优于其他方法,即重建的图像是否包含可能影响诊断准确性的错误或虚幻特征。
cs.CV / 47 / 2605.01217
Asymmetric Invertible Threat: Learning Reversible Privacy Defense for Face Recognition
不对称可逆威胁:学习可逆隐私防御以应对人脸识别
Abstract
Face Recognition systems are widely deployed in real-world applications, but they also raise privacy concerns due to unauthorized collection and misuse of facial data. Existing adversarial privacy protection methods rely on input-space perturbations to obfuscate identity information, yet their protection can degrade when adversaries learn restoration or purification mappings that partially invert the transformation. We study this setting as an asymmetric adversarial attack, in which reverse manipulation becomes feasible because existing defense paradigms do not control reversibility. To address this problem, we propose Asymmetric Reversible Face Protection (ARFP), a restoration-aware extension of personalized face cloaking that integrates privacy protection, keyed recovery, and tamper indication in a single framework. ARFP consists of three components: Key-Conditioned Manifold Binding, which ties the protection transformation to a user-provided key; Adversarial Restoration-Aware Training, which introduces a surrogate restoration adversary during training to improve robustness against evaluated inverse purification attacks; and Authorized Reversible Restoration, which supports recovery with the correct key while providing nonce-based tamper indication. Extensive experiments under the threat models considered in this work show that ARFP improves resistance to the evaluated restoration attacks while preserving authorized recovery utility. These results provide empirical evidence of key-sensitive recovery behavior and tamper awareness in the tested settings.
Chinese Translation
人脸识别系统在现实应用中得到广泛部署,但由于未经授权收集和滥用面部数据,它们也引发了隐私问题。现有的对抗性隐私保护方法依赖于输入空间扰动来模糊身份信息,但当对手学习到部分反转变换的恢复或净化映射时,其保护效果可能会降低。我们将此设置视为一种不对称对抗攻击,其中反向操控成为可能,因为现有的防御范式未能控制可逆性。为了解决这一问题,我们提出了不对称可逆人脸保护(Asymmetric Reversible Face Protection, ARFP),这是一种关注恢复的个性化人脸伪装扩展,集成了隐私保护、密钥恢复和篡改指示于一个单一框架中。ARFP由三个组件组成:关键条件流形绑定(Key-Conditioned Manifold Binding),将保护变换与用户提供的密钥联系起来;对抗恢复感知训练(Adversarial Restoration-Aware Training),在训练过程中引入替代恢复对手,以提高对评估的逆净化攻击的鲁棒性;以及授权可逆恢复(Authorized Reversible Restoration),在提供基于随机数的篡改指示时支持使用正确密钥进行恢复。在本研究考虑的威胁模型下,广泛的实验表明,ARFP在提高对评估恢复攻击的抵抗力的同时,保持了授权恢复的实用性。这些结果为所测试环境中的关键敏感恢复行为和篡改意识提供了实证证据。
cs.CV / 48 / 2605.01220
Visual Implicit Autoregressive Modeling
视觉隐式自回归建模
Abstract
Visual Autoregressive Modeling (VAR) based on next-scale prediction achieves strong generation quality, but their explicit deep stacks fix the amount of computation per scale and inflate memory at high resolutions. We introduce Visual Implicit Autoregressive Modeling (VIAR), a next-scale autoregressive generator that embeds an implicit equilibrium layer between shallow pre/post blocks. The implicit layer is trained with Jacobian-Free Backpropagation, yielding constant training memory, while inference exposes a per-scale iteration knob that enables compute control. On ImageNet 256x256 benchmark, VIAR attains FID 2.16, and sFID 8.07 with only 38.4% parameters of VAR, matching or surpassing strong AR baselines and remaining competitive with large diffusion models. By controlling the per-scale knob, VIAR can reduce peak memory from 19.24 GB to 8.53 GB and doubles throughput from 15.16 to 32.08 images/s on a single RTX 4090, without retraining. Ablations show that fewer steps are sufficient for fixed-point iterations to converge and that VIAR consistently dominates VAR across quality efficiency operating points. In zero shot in-painting and class-conditional editing, VIAR produces sharper details and smoother boundaries while preserving global structure, validating the benefits of implicit equilibria and per-scale compute control for practical, deployable visual generation.
Chinese Translation
基于下一个尺度预测的视觉自回归建模(VAR)在生成质量上表现优秀,但其显式深层架构固定了每个尺度的计算量,并在高分辨率下增加了内存使用。我们提出了视觉隐式自回归建模(VIAR),这是一个下一个尺度的自回归生成器,在浅层前后块之间嵌入了一个隐式平衡层。隐式层通过无雅可比反向传播进行训练,从而实现恒定的训练内存,而推断则提供了一个每个尺度的迭代调节旋钮,使得计算控制成为可能。在ImageNet 256x256基准上,VIAR取得了FID 2.16和sFID 8.07,只使用了VAR的38.4%参数,表现与强大的自回归基线相匹配或超越,并在与大型扩散模型的竞争中保持优势。通过控制每个尺度的旋钮,VIAR能够将峰值内存从19.24GB降低到8.53GB,并将单个RTX 4090上的吞吐量从15.16提高到32.08张图像/秒,而无需重新训练。消融实验表明,对于定点迭代,较少的步骤便足以实现收敛,并且VIAR在各个质量效率工作点上始终优于VAR。在零样本补绘和基于类别的编辑任务中,VIAR能够生成更清晰的细节和更平滑的边界,同时保持全局结构,有效验证了隐式平衡和每尺度计算控制在实际可部署视觉生成中的优势。
cs.CV / 49 / 2605.01234
TT4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos
TT4D:基于单目视频的乒乓球4D重建管道与数据集
Abstract
We present TT4D, a large-scale, high-fidelity table tennis dataset. It provides $140+$ hours of reconstructed singles and doubles gameplay from monocular broadcast videos, featuring multimodal annotations like high-quality camera calibrations, precise 3D ball positions, ball spin, time segmentation, and 3D human meshes over time. This rich data provides a new foundation for virtual replay, in-depth player analysis, and robot learning. The dataset's combination of scale and precision is achieved through a novel reconstruction pipeline. Prior methods first partition a game sequence into individual shot segments based on the 2D ball track, and only then attempt reconstruction. However, 2D-based time segmentation collapses under occlusion and varied camera viewpoints, preventing reliable reconstruction. We invert this paradigm by first lifting the entire unsegmented 2D ball track to 3D through a learned lifting network. This 3D trajectory then allows us to reliably perform time segmentation. The learned lifting network also infers the ball's spin, handles unreliable ball detections, and successfully reconstructs the ball trajectory in cases of high occlusion. This lift-first design is necessary, as our pipeline is the only method capable of reconstructing table tennis gameplay from general-view broadcast monocular videos. We demonstrate the dataset's fidelity through two downstream tasks: estimating the racket's pose \& velocity at impact, and training a generative model of competitive rallies.
Chinese Translation
我们提出了TT4D,这是一个大规模、高保真的乒乓球数据集。它提供了超过140小时从单目广播视频重建的单打和双打比赛,包含多模态注释,如高质量的相机标定、精确的3D球位置、球旋转、时间分段以及随时间变化的3D人体模型。这些丰富的数据为虚拟回放、深入的球员分析和机器人学习提供了新的基础。该数据集在规模与精度上的结合是通过一种新颖的重建管道实现的。以前的方法首先基于2D球轨迹将比赛序列划分为个别击球段,然后再进行重建。然而,基于2D的时间分段在遮挡和不同的相机视角下会失效,这导致无法可靠地进行重建。我们颠覆了这一范式,通过一个学习的提升网络首先将整个未分段的2D球轨迹提升到3D。这一3D轨迹使我们能够可靠地进行时间分段。学习的提升网络还推断了球的旋转,处理不可靠的球检测,并在高遮挡情况下成功重建球轨迹。这种先提升的设计是必要的,因为我们的管道是唯一能够从一般视角的单目广播视频中重建乒乓球比赛的方法。我们通过两个下游任务展示了数据集的保真度:在撞击时估计球拍的姿态与速度,以及训练竞技对抗回合的生成模型。
cs.CV / 50 / 2605.01236
Degradation-Aware Adaptive Context Gating for Unified Image Restoration
考虑退化的自适应上下文门控用于统一图像恢复
Abstract
Unified image restoration using a single model often faces task interference due to diverse degradations. To address this, we propose DACG-IR (Degradation-Aware Adaptive Context Gating), which enables explicit perception of degradation characteristics to dynamically modulate feature representations. Our method constructs degradation-aware contextual representations from the input to modulate attention distribution, frequency-domain features, and feature aggregation. Specifically, a lightweight multi-scale degradation-aware module extracts coarse degradation information and generates layer-wise prompts. These prompts guide attention temperature and output gating in encoder and decoder blocks for adaptive feature extraction. Additionally, a spatial-channel dual-gated adaptive fusion mechanism refines encoder features, suppressing noise propagation from shallow to deep layers. This design effectively suppresses degradation-induced noise while preserving informative structures. Experiments show DACG-IR outperforms state-of-the-art methods in single-task, all-in-one, adverse weather removal, and composite degradation settings. Code: https://github.com/HlHomes/DACG-IR-code
Chinese Translation
使用单一模型进行统一图像恢复常常面临由于多样退化导致的任务干扰。为了解决这一问题,我们提出了DACG-IR(考虑退化的自适应上下文门控),该方法使得对退化特征的显式感知能够动态调节特征表示。我们的方法从输入中构建考虑退化的上下文表示,以调节注意力分布、频域特征和特征聚合。具体而言,一个轻量级的多尺度考虑退化模块提取粗略的退化信息并生成层级提示。这些提示引导编码器和解码器块中的注意力温度和输出门控,以实现自适应特征提取。此外,一个空间-通道双门控自适应融合机制对编码器特征进行精炼,抑制噪声从浅层传播到深层。该设计有效抑制了由退化引起的噪声,同时保留了信息结构。实验表明,在单任务、全合一、不良天气去除和复合退化设置中,DACG-IR的表现超越了最新的技术方案。代码链接: https://github.com/HlHomes/DACG-IR-code
cs.CV / 51 / 2605.01266
Exploring Prompt Alignment with Clinical Factors in Zero-Shot Segmentation VLMs for NSCLC Tumor Segmentation
在无样本分割视觉语言模型中探索与临床因素的提示对齐,以实现非小细胞肺癌肿瘤分割
Abstract
Zero-shot vision-language models (VLMs) offer a promptable alternative to task-specific training for gross tumor volume (GTV) delineation in non-small-cell lung cancer (NSCLC), but the prompt dimensions that govern their spatial behavior remain poorly understood. We study this question by probing alignment directions in VoxTell on a held-out internal NSCLC tumor dataset through sub-prompt decomposition into diagnosis, demographic, staging, anatomical, generic, and irrelevant controls; attribute-wise perturbation robustness; specificity ladders; and cross-case prompt swaps, while benchmarking against fine-tuned and zero-shot baselines using the Dice Similarity Coefficient (DSC) with Wilcoxon signed-rank tests and Benjamini-Hochberg correction. Alignment analyses revealed that anatomical location is the dominant driver of VoxTell's spatial attention: 63.4 percent of location perturbations caused catastrophic drops, prompt specificity improved from generic to full descriptions except for diagnosis-only prompts, irrelevant prompts correctly yielded zero segmentation, and cross-case prompt swaps confirmed patient-specific conditioning (matched DSC 0.906 vs. mismatched 0.406). Histology and stage substitutions had minimal effect, indicating that the model prioritizes "where to look" over "what to look for." In this context, VoxTell, operating fully zero-shot, achieved a mean DSC of 0.613, statistically indistinguishable from nnUNet (0.690, adjusted p = 0.156) and Ahmed et al. (0.675, adjusted p = 0.679), while significantly outperforming all other zero-shot models. Together, these findings argue that segmentation VLMs should be evaluated not only by Dice, but also by the prompt dimensions to which they align.
Chinese Translation
无样本视觉语言模型(VLMs)为非小细胞肺癌(NSCLC)中肿瘤体积(GTV)轮廓描绘提供了一种可提示的替代方案,但影响其空间行为的提示维度尚不明了。我们通过在一个保留的内部NSCLC肿瘤数据集上,使用子提示分解为诊断、人口统计、分期、解剖、通用和无关控制,研究了VoxTell的对齐方向;分析属性扰动的稳健性;特异性梯度;以及跨案例提示交换,同时利用Dice相似系数(DSC)基于Wilcoxon符号秩检验和Benjamini-Hochberg校正与细调和无样本基准进行比较。对齐分析揭示,解剖位置是VoxTell空间注意力的主要驱动因素:63.4%的位置扰动导致灾难性下降,提示特异性从通用描述提升到完整描述,除了仅诊断提示外,无关提示正确产生了零分割,而跨案例提示交换确认了患者特定的条件(匹配DSC为0.906,未匹配为0.406)。组织病理学和阶段替代的影响微乎其微,表明该模型优先考虑“看哪里”而非“看什么”。在此背景下,VoxTell在完全无样本的情况下达到了平均DSC为0.613,与nnUNet(0.690,已校正p = 0.156)和Ahmed等人(0.675,已校正p = 0.679)在统计上无区别,同时显著优于所有其他无样本模型。总体而言,这些发现表明,分割VLMs的评估不仅应基于Dice,还应考虑其对齐的提示维度。
cs.CV / 52 / 2605.01272
GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment
GameScope:一个用于游戏视频质量评估的多属性、多编解码器基准数据集
Abstract
The development of video game streaming has grown rapidly, with major platforms such as YouTube and Twitch using different codecs. To support quality assessment models that work consistently across any codec, it is necessary to have access to large, diverse subjective gaming quality datasets. Currently, there are only a few available, each having limitations. To address this gap, we present the largest gaming video quality dataset to date, incorporating both user-generated content (UGC) and professional-generated content (PGC) with extensive visual diversity. Our dataset covers the most widely used codecs - H.264, H.265, and AV1 - and consists of 4,048 video samples, each annotated by an average of 37 mean opinion score (MOS) ratings. In addition to overall quality scores, we collect coarse-grained quality attributes, enabling a better understanding of perceptual factors. We study the performance of leading video quality assessment methods on this dataset, including a vision language model that outperforms all the benchmarks. To the best of our knowledge, this is the first dataset that comprehensively addresses gaming video quality assessment across multiple codecs and content types with quality attributes. Our dataset is publicly available at https://rajeshsureddi.github.io/GameScope/.
Chinese Translation
视频游戏直播的发展迅速,主要平台如YouTube和Twitch使用不同的编解码器。为了支持在任何编解码器下都能一致工作的质量评估模型,必须能够获取大型多样的主观游戏质量数据集。目前,只有少量可用的数据集,每个都有其局限性。为了解决这一空白,我们提出了迄今为止最大的游戏视频质量数据集,涵盖用户生成内容(UGC)和专业生成内容(PGC),并具备广泛的视觉多样性。我们的数据集涵盖了最广泛使用的编解码器——H.264、H.265和AV1,包括4,048个视频样本,每个样本平均附有37个主观意见评分(MOS)评级。除了整体质量评分外,我们还收集了粗粒度质量属性,以便更好地理解感知因素。我们研究了领先的视频质量评估方法在这个数据集上的表现,包括一种超越所有基准的视觉语言模型。据我们所知,这是第一个全面解决跨多个编解码器和内容类型的游戏视频质量评估以及质量属性的数据集。我们的数据集可以在 https://rajeshsureddi.github.io/GameScope/ 上公开获取。
cs.CV / 53 / 2605.01277
CNN-based Multi-In-Multi-Out Model for Efficient Spatiotemporal Prediction
基于卷积神经网络的多输入多输出模型用于高效时空预测
Abstract
Recently, Convolutional Neural Network (CNN) or Transformer architecture based models have been proposed to overcome the limitations of Recurrent Neural Network (RNN) based models in spatiotemporal prediction. These models prevent the inefficiency of parallelization limitation due to the sequential properties and stacked error due to the recursive method, and show high performance. Novertheless, there are still some challengies. First, CNN based models have difficulty considering global information due to the local properties of the kernel, and their performance is limited. In addition, information is mixed because the time axis is combined with the channel axis of the image for processing. Models based on Transformer architecture have high complexity due to the self-attention calcuation and take a long training time. In this paper, we propose a new structure model called CNN-based Multi-In-Multi-Out model for Efficient Spatiotemporal Prediction (MIMO-ESP) to overcome these limitations. MIMO-ESP considers global information and significantly improves complexity by configuring a Transformer architecture based on CNN. In addition, it treats the time axis as an independent axis without combining it, and effectively considers spatiotemporal information together by applying dilation. This structure makes MIMO-ESP efficient and high performance. Extensive experiment results on three promising benchmark datasets which including video, traffic, and precipitation prediction tasks demonstrate that the usefulness of MIMO-ESP due to the achieved competitive efficiency while outperforming existing models. Furthermore, the ablation study results demonstrate the usefulness of the components of MIMO-ESP, emphasizing the potential of the proposed approaches.
Chinese Translation
近年来,基于卷积神经网络(CNN)或Transformer架构的模型被提出,以克服基于递归神经网络(RNN)模型在时空预测中的局限性。这些模型防止了因序列特性导致的并行化效率低下,同时避免了因递归方法导致的堆叠误差,并显示出高性能。然而,仍然存在一些挑战。首先,基于CNN的模型由于卷积核的局部特性,难以考虑全局信息,因此其性能受到限制。此外,由于处理时将时间轴与图像的通道轴相结合,信息容易混杂。基于Transformer架构的模型由于自注意力计算的复杂性,训练时间较长,复杂度高。本文提出了一种新结构模型,即基于CNN的多输入多输出模型(MIMO-ESP),以克服这些限制。MIMO-ESP考虑了全局信息,通过基于CNN配置Transformer架构显著提高了效率。此外,该模型将时间轴视为独立轴,不进行合并,并通过应用扩张有效地共同考虑时空信息。这种结构使MIMO-ESP在效率和性能上均表现优异。在包含视频、交通和降水预测任务的三个有前景的基准数据集上进行的广泛实验结果表明,MIMO-ESP因其明显的竞争效率而具有实用价值,且在性能上优于现有模型。此外,消融研究结果验证了MIMO-ESP各组件的有效性,强调了所提出方法的潜力。
cs.CV / 54 / 2605.01283
Developing a Strong Pre-Trained Base Model for Plant Leaf Disease Classification
开发强大的预训练基础模型用于植物叶片病害分类
Abstract
Plants, crops and their yields are essential to our very existence, but diseases and pests cause large losses every year. As such it is vital to ensure that diseases can be spotted early and treated accordingly and stopping the spread while still possible. Manual and traditional methods require personal to walk through the field and check for symptoms 'by hand'. This is very laborious and very time consuming, so ML methods have been applied as a result and they have garnered promising results. CNN models are especially efficient as they can automatically extract features from images without any manual feature construction before then feeding the features to a classifier. Datasets are largely influential to the final performance of the model. Despite the importance that datasets pose to the field, there still seems to be somewhat of a discrepancy between what is publicly available for use and what would be required to sufficiently train fully capable models. To overcome these shortcomings, as part of this thesis open datasets for the field of plant leaf disease classification have been identified as well as models that can be trained on them and extensive benchmarks have been carried out to identify their suitability. Then a new dataset was constructed based on those findings as well as on the findings of a augmentation applicability study, which will be used to train a new Base Model based on the DenseNet201 architecture, which managed to outperform the baseline model on said new dataset as well as outperforming it on plant leaf disease classification domain specific Transfer-Learning experiments on another new dataset. This new model manages to train models through Transfer-Learning (TL) faster, more robust, more stable, and with less data than general model would, overcoming a large number of issues that the field still suffers from.
Chinese Translation
植物、作物及其产量对我们的生存至关重要,但每年因病害和害虫造成的损失巨大。因此,必须确保能够及早发现疾病并进行相应处理,以防止其传播。在此过程中,传统的手动检测方法需要人员走入田间,以“手动”方式检查症状。这种方法既费力又耗时,因此,机器学习(ML)方法应运而生,并取得了令人鼓舞的成果。卷积神经网络(CNN)模型特别有效,因为它们能够从图像中自动提取特征,而无需在此之前进行任何手动特征构建,然后将这些特征输入分类器。数据集对模型的最终表现影响极大。尽管数据集在该领域的重要性不容忽视,但公开可用数据集与充分训练全面功能模型所需的数据集之间仍存在一定的不一致性。为了克服这些缺陷,作为本论文的一部分,我们识别了用于植物叶片病害分类的开放数据集及可在其上进行训练的模型,并进行了广泛的基准测试,以评估其适用性。随后,根据这些发现以及增强应用研究的结果,构建了一个新的数据集,将用于基于DenseNet201架构的新基础模型的训练。该模型不仅在新数据集上超越了基线模型,还在另一个新数据集上的植物叶片病害分类领域具体的迁移学习实验中取得了优于基线模型的表现。该新模型通过迁移学习(TL)能够更快、更稳健、更稳定地训练模型,并且所需数据量少于一般模型,从而克服了该领域仍面临的许多问题。
cs.CV / 55 / 2605.01284
Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation
证据链:用于迭代检索增强生成的像素级视觉归因
Abstract
Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbf{Wiki-CoE}, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbf{SlideVQA}, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.
Chinese Translation
迭代检索增强生成(iRAG)已成为回答复杂多跳问题的强大范式,通过逐步检索和推理外部文档。然而,当前系统主要依赖解析文本,这导致了两个关键瓶颈:(1) extit{粗粒度归因},用户必须根据模糊的文本级引用手动定位漫长文档中的证据;(2) extit{视觉语义损失},将视觉丰富的文档(如幻灯片、包含图表的PDF)转换为文本时,会丢失空间逻辑和布局线索,这些是推理所必需的。为了解决这一问题,我们提出了 extbf{证据链(Chain of Evidence, CoE)},这是一种与检索器无关的视觉归因框架,利用视觉-语言模型直接对检索到的文档候选项的截图进行推理。CoE消除了特定格式的解析,并输出精确的边界框,直观展示了检索候选集中的完整推理链。我们在两个不同的基准上评估CoE: extbf{Wiki-CoE},一个源自2WikiMultiHopQA的大规模结构化网页数据集,以及 extbf{SlideVQA},一个具有复杂图表和自由格式布局的挑战性演示幻灯片数据集。实验表明,细调后的Qwen3-VL-8B-Instruct在需要视觉布局理解的场景中表现出色,显著超过基于文本的基线,同时为像素级可解释的iRAG建立了一个与检索器无关的解决方案。我们的代码已发布在https://github.com/PeiYangLiu/CoE.git。
cs.CV / 56 / 2605.01296
SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On
SIFT-VTON:基于交叉注意力的几何对应监督用于虚拟试穿
Abstract
Diffusion-based virtual try-on methods achieve photorealistic synthesis through cross-attention mechanisms that transfer garment features to target body regions. However, these approaches rely on implicit learning of spatial correspondences, struggling to preserve fine details such as text and illustrations. We propose a novel approach, which we call SIFT-VTON, that utilizes SIFT keypoint matching to provide explicit geometric guidance for diffusion-based virtual try-on. Our method applies domain-specific filtering to SIFT keypoint matches between garment and person images, then converts these correspondences into spatial probability distributions that supervise cross-attention layers during training. This explicit supervision guides the model to learn precise spatial alignment, concentrating attention on geometrically consistent garment regions. Experiments on the VITON-HD dataset demonstrate significant improvements on unpaired metrics while maintaining competitive paired reconstruction metrics. Qualitative comparisons show superior preservation of text clarity and pattern alignment. Attention visualizations confirm that our method produces sharply focused attention on relevant garment details. This work demonstrates that classical geometric correspondence methods can effectively enhance modern diffusion models for conditional synthesis tasks. The source code will be available at https://github.com/takesukeDS/SIFT-VTON.
Chinese Translation
基于扩散的虚拟试穿方法通过交叉注意力机制实现逼真的合成,将服装特征迁移到目标身体区域。然而,这些方法依赖于空间对应关系的隐式学习,难以保持文本和插图等细节。我们提出了一种新方法,称为 SIFT-VTON,它利用 SIFT 关键点匹配为基于扩散的虚拟试穿提供显式几何指导。我们的方法对服装与人物图像之间的 SIFT 关键点匹配应用领域特定的过滤,然后将这些对应关系转换为空间概率分布,在训练期间监督交叉注意力层。这种显式监督引导模型学习精确的空间对齐,集中注意力于几何一致的服装区域。在 VITON-HD 数据集上的实验表明,未配对指标有显著提升,同时保持了竞争力的配对重建指标。定性比较显示了文本清晰度和图案对齐的优越保持。注意力可视化确认我们的方法在相关服装细节上产生了高度集中的注意力。这项工作证明了经典的几何对应方法可以有效增强现代扩散模型在条件合成任务中的表现。源代码将在 https://github.com/takesukeDS/SIFT-VTON 上提供。
cs.CV / 57 / 2605.01309
CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning
CUE:概念感知多标签扩展以减轻长尾学习中的概念混淆
Abstract
Long-tailed distributions are common in real-world recognition tasks, where a few head classes have many samples while most tail classes have very few. Recently, fine-tuning foundation models for long-tailed learning has gained attention due to their excellent performance. However, most existing methods focus solely on mitigating long-tailed distribution bias while overlooking concept confusion caused by the long-tailed distribution. In this paper, we study this problem and attribute it to the mutual exclusivity of single-label supervision under long-tailed distributions, which suppresses feature sharing among related classes and amplifies the dominance of head classes, leading to disrupted inter-class discriminability. To address this, we propose CUE, Concept-aware mUlti-label Expansion, which introduces multi-label concept signals to preserve disrupted inter-class relationships. Specifically, CUE constructs concept sets by (i) extracting instance-level visual cues from zero-shot CLIP and (ii) generating class-level semantic cues with LLM; the two cues are incorporated via separately weighted Binary Logit-Adjustment (BLA) auxiliary losses and jointly optimized with the baseline Logit-Adjustment (LA) loss. Experiments on several long-tailed benchmarks, CUE achieves balanced and strong performance, surpassing recent state-of-the-art methods. Code is available at: https://github.com/zhangruichi/CUE.
Chinese Translation
长尾分布在现实世界的识别任务中很常见,其中少数头部类有许多样本,而大多数尾部类则非常少。最近,为长尾学习微调基础模型引起了关注,因为它们展现了优秀的性能。然而,大多数现有方法仅关注于减轻长尾分布偏见,而忽视了长尾分布造成的概念混淆。本文研究了这一问题,并将其归因于长尾分布下单标签监督的互斥性,这抑制了相关类之间的特征共享,并放大了头部类的主导地位,导致了类别间可分性受损。为了解决这一问题,我们提出了CUE(概念感知多标签扩展),它引入多标签概念信号以保留受损的类别间关系。具体而言,CUE通过(i)从零样本CLIP提取实例级视觉线索和(ii)利用大型语言模型(LLM)生成类别级语义线索来构建概念集合;这两个线索通过分别加权的二元逻辑调整(Binary Logit-Adjustment, BLA)辅助损失相结合,与基线逻辑调整(Logit-Adjustment, LA)损失共同优化。在多个长尾基准测试上的实验表明,CUE实现了平衡且强大的表现,超越了近期的最先进方法。代码可在:https://github.com/zhangruichi/CUE。
cs.CV / 58 / 2605.01320
PACE: Post-Causal Entropy Modeling for Learned LiDAR Point Cloud Compression
PACE:用于学习型LiDAR点云压缩的后因熵建模
Abstract
LiDAR point cloud compression is vital for autonomous systems to handle massive data from high-resolution sensors. While learned entropy modeling built upon octree structures yields high compression gains, it faces two critical bottlenecks: 1) prohibitive latency, particularly during decoding, caused by causal, multi-stage context modeling; and 2) a rigid performance-latency trade-off, preventing a single model from adapting to varying constraints. These limitations stem from the tight coupling between context aggregation backbone and probability prediction. To address this, we propose PACE, a new framework that reformulates ancestral context aggregation as a non-causal backbone and confines causality to a lightweight, stage-scalable predictor, eliminating repetitive backbone executions and reducing computational overhead. The predictor supports an arbitrary number of prediction stages, supporting seamless adaptation across diverse performance-latency trade-offs without reloading parameters. Experiments demonstrate that PACE sets a new state-of-the-art in compression efficiency, achieving notable BD-BR savings and reducing decoding latency by over 90% in autoregressive mode, highly attractive for practical applications.
Chinese Translation
LiDAR点云压缩对于自主系统处理来自高分辨率传感器的大量数据至关重要。虽然基于八叉树结构的学习型熵建模能够实现高压缩增益,但面临两个关键瓶颈:1)由于因果的多阶段上下文建模,解码过程中存在显著的延迟;2)严格的性能-延迟权衡机制,使得单一模型无法适应不同的约束条件。这些限制源于上下文聚合骨干网络和概率预测之间的紧密耦合。为了解决这一问题,我们提出了PACE,一个新的框架,将先验上下文聚合重新构建为非因果骨干网络,并将因果性限制在一个轻量级、可扩展的预测器中,从而消除了重复执行骨干网络的需求,减少了计算开销。该预测器支持任意数量的预测阶段,能够在不同的性能-延迟权衡中无缝适应,而无需重新加载参数。实验结果表明,PACE在压缩效率上树立了新的行业标准,实现了显著的比特率-失真(BD-BR)节省,并在自回归模式下将解码延迟减少了90%以上,极具实际应用吸引力。
cs.CV / 59 / 2605.01324
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
超越感知捷径:用于轻量级 MLLMs 中可推广视频推理的因果启发去偏优化
Abstract
Although reinforcement learning (RL) has significantly advanced reasoning capabilities in large multimodal language models (MLLMs), its efficacy remains limited for lightweight models essential for edge deployments.To address this issue, we leverage causal analysis and experiment to reveal the underlying phenomenon of perceptual bias, demonstrating that RL-based fine-tuning compels lightweight models to preferentially adopt perceptual shortcuts induced by data biases, rather than developing genuine reasoning abilities.Motivated by this insight, we propose VideoThinker, a causal-inspired framework that cultivates robust reasoning in lightweight models through a two-stage debiasing process. First, the Bias Aware Training stage forges a dedicated "bias model" to embody these shortcut behaviors. Then, the Causal Debiasing Policy Optimization (CDPO) algorithm fine-tunes the primary model, employing an innovative repulsive objective to actively push it away from the bias model's flawed logic while simultaneously pulling it toward correct, generalizable solutions.Our model, VideoThinker-R1, establishes a new state-of-the-art in video reasoning efficiency. For same-scale comparison, requiring no Supervised Fine-Tuning (SFT) and using only 1 of the training data for RL, it surpasses VideoRFT-3B with a 3.2% average gain on widely-used benchmarks and a 7% lead on VideoMME. For cross-scale comparison, it outperforms the larger Video-UTR-7B model on multiple benchmarks, including a 2.1% gain on MVBench and a 3.8% gain on TempCompass. Code is available at https://github.com/falonss703/VideoThinker.
Chinese Translation
尽管强化学习(RL)在大型多模态语言模型(MLLMs)中的推理能力有了显著提升,但其在边缘部署所需的轻量级模型中的有效性仍然有限。为了解决这一问题,我们利用因果分析和实验揭示了感知偏见的潜在现象,证明基于RL的微调使得轻量级模型倾向于优先采用由数据偏见引发的感知捷径,而不是发展真正的推理能力。在这一见解的启发下,我们提出了 VideoThinker,一个通过两阶段去偏过程培养轻量级模型中稳健推理的因果启发框架。首先,偏见感知训练阶段锻造了一个专门的“偏见模型”来体现这些捷径行为。接着,因果去偏政策优化(Causal Debiasing Policy Optimization, CDPO)算法微调主模型,采用创新的排斥目标,积极将其推离偏见模型的错误逻辑,并同时拉向正确的、可推广的解决方案。我们的模型 VideoThinker-R1 在视频推理效率上树立了新的技术前沿。在同规模比较中,无需监督微调(Supervised Fine-Tuning, SFT),且仅利用 1 的训练数据进行 RL,其在广泛使用的基准上超越了 VideoRFT-3B,平均提升 3.2%,在 VideoMME 中领先 7%。在跨规模比较中,它在多个基准上超越了更大的 Video-UTR-7B 模型,包括在 MVBench 上提升 2.1% 和在 TempCompass 上提升 3.8%。代码可在 https://github.com/falonss703/VideoThinker 获取。
cs.CV / 60 / 2605.01325
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
通过Gromov-Wasserstein距离重新思考视觉语言模型中的模型选择
Abstract
Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.
Chinese Translation
视觉语言模型(VLMs)通过整合视觉编码器提升了传统的大型语言模型(LLMs)在视觉方面的能力。尽管近期的研究探索了视觉编码器和LLMs的多种组合,但对适合VLM对齐的视觉编码器的原则性理解仍然匮乏。本文通过对来自不同来源的19个预训练视觉编码器的精心挑选集合进行综合实验,系统性地研究了这一问题。我们首先表明,常见的做法,例如选择最大规模或最高零-shot准确率的编码器,持续未能识别出最佳模型。实际上,这些指标与VLM性能之间仅显示出弱或中等的相关性。这一引人入胜的发现引发了一个根本性的问题:视觉编码器的哪些因素在VLM中是重要的?通过全面分析,我们识别出跨模态的结构相似性在视觉编码器选择中起着至关重要但此前被忽视的作用,我们使用Gromov-Wasserstein距离作为代理来衡量这一点。从理论角度,我们展示了跨模态映射的可学习性可以与Gromov-Wasserstein距离有确凿的关联。对60多个完整的VLM训练运行的经验验证显示,我们提出的仅推理指标的表现显著优于替代的模型选择策略,并与最终VLM性能展现出更强的相关性,从而在完整训练之前实现VLM性能的有效和高效预测。
cs.CV / 61 / 2605.01330
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
共线性衰退:训练量化友好的视觉变换器与异常值衰退
Abstract
Low-bit quantization is a practical route for efficiently deploying vision Transformers, yet activation outliers complicate fully quantized deployment. Existing methods either handle quantization post-training or suppress large activations during training; however, aggressively restricting outliers in vision models can lead to a poorer trade-off between full-precision and quantized accuracy. We argue that rather than simply suppressing outliers, the training objective should control the structural amplification that makes them harmful. To this end, we introduce Colinearity-Decay (CD), a structural regularizer for ordered matrix pairs within Transformer blocks. CD penalizes detrimental cross-matrix alignment and mitigates extreme activations without altering the architecture or task loss. Applied as a decoupled update, CD is non-invasive and introduces minimal training overhead. Across ImageNet-1K pre-training, COCO detection, and downstream fine-tuning, CD consistently boosts quantized accuracy across multiple pipelines while preserving, or even improving, full-precision performance. Ultimately, our results demonstrate that structural regularization effectively prepares vision Transformers for low-bit deployment with zero inference-time overhead.
Chinese Translation
低位量化是有效部署视觉变换器的一条实用路径,但激活异常值使得完全量化的部署变得复杂。现有方法要么在训练后处理量化,要么在训练过程中压制大激活;然而,在视觉模型中过度限制异常值可能导致全精度与量化精度之间的权衡变差。我们认为,与其单纯抑制异常值,训练目标应控制导致其有害的结构放大效果。为此,我们引入共线性衰退(Colinearity-Decay,CD),这一结构正则化方法针对变换器块中的有序矩阵对。CD惩罚有害的跨矩阵对齐,减轻极端激活,而无需改变架构或任务损失。作为一种解耦更新,CD是非侵入性的,仅引入最小的训练开销。在ImageNet-1K预训练、COCO检测和下游微调中,CD在多个流程中始终增强了量化准确度,同时保持甚至改善全精度性能。最终,我们的结果表明,结构正则化有效地为视觉变换器的低位部署做好准备,且没有推理时间的开销。
cs.CV / 62 / 2605.01331
Zero-Shot Interpretable Image Steganalysis for Invertible Image Hiding
零-shot 可解释图像隐写分析用于可逆图像隐藏
Abstract
Image steganalysis, which aims at detecting secret information concealed within images, has become a critical countermeasure for assessing the security of steganography methods, especially the emerging invertible image hiding approaches. However, prior studies merely classify input images into two categories (i.e., stego or cover) and typically conduct steganalysis under the constraint that training and testing data must follow similar distribution, thereby hindering their application in real-world scenarios. To overcome these shortcomings, we propose a novel interpretable image steganalysis framework tailored for invertible image hiding schemes under a challenging zero-shot setting. Specifically, we integrate image hiding, revealing, and steganalysis into a unified framework, endowing the steganalysis component with the capability to recover the secret information embedded in stego images. Additionally, we elaborate a simple yet effective residual augmentation strategy for generating stego images to further enhance the generalizability of the steganalyzer in cross-dataset and cross-architecture scenarios. Extensive experiments on benchmark datasets demonstrate that our proposed approach significantly outperforms the existing steganalysis techniques for invertible image hiding schemes.
Chinese Translation
图像隐写分析旨在检测隐藏在图像中的秘密信息,已成为评估隐写术方法安全性的关键对策,尤其是新兴的可逆图像隐藏方法。然而,之前的研究仅将输入图像分类为两类(即隐写图像或原始图像),并通常在训练和测试数据必须遵循相似分布的限制条件下进行隐写分析,从而限制了其在实际场景中的应用。为了克服这些不足,我们提出了一种新颖的可解释图像隐写分析框架,专为在具有挑战性的零-shot 环境下的可逆图像隐藏方案量身定制。具体来说,我们将图像隐藏、揭示和隐写分析整合为一个统一框架,使隐写分析组件具备从隐写图像中恢复隐藏信息的能力。此外,我们详细阐述了一种简单而有效的残差增强策略,用于生成隐写图像,以进一步提升隐写分析器在跨数据集和跨架构场景中的泛化能力。基于基准数据集的广泛实验表明,我们提出的方法显著优于现有的可逆图像隐藏方案的隐写分析技术。
cs.CV / 63 / 2605.01345
Active Reasoning Vision-Language Models via Sequential Experimental Design
通过顺序实验设计实现主动推理的视觉语言模型
Abstract
Visual perception in modern Vision-Language Models (VLMs) is constrained by a fundamental perceptual bandwidth bottleneck: a broad field of view inevitably sacrifices the fine-grained details necessary for complex reasoning. Inspired by the classical paradigms of active vision and information foraging, we frame overcoming this limitation as a sequential decision-making process. We formalise this process through the lens of the sequential Bayesian optimal experimental design (S-BOED) problem. While exact Bayesian inference is intractable in continuous gigapixel spaces, we derive principled yet tractable approximations that balance spatial coverage against resolution. To validate this framework, we present a training-free inference strategy as a practical instantiation of the S-BOED objective for agents equipped with multiple vision tools. Designed as a flexible template, this strategy accommodates arbitrary optimisation algorithms, ranging from efficient greedy sampling to look-ahead planning, to approximate the optimal design. Empirical evaluations on gigapixel-level benchmarks demonstrate that our approach further boosts the performance of state-of-the-art models, significantly outperforming standard baselines and effectively narrowing the gap towards human-annotated oracles.
Chinese Translation
现代视觉语言模型(VLMs)中的视觉感知受到根本的感知带宽瓶颈的限制:广阔的视野不可避免地牺牲了复杂推理所需的细致细节。受到主动视觉和信息觅食的经典范式的启发,我们将克服这一限制框架为一个顺序决策过程。我们通过顺序贝叶斯最优实验设计(S-BOED)问题的视角正式化这一过程。尽管在连续的千兆像素空间中,准确的贝叶斯推理是不可处理的,但我们推导出原则性但可处理的近似方法,以平衡空间覆盖与分辨率。为了验证这一框架,我们提出了一种无训练的推理策略,作为配备多种视觉工具的智能体的S-BOED目标的实际实现。该策略被设计为一个灵活的模板,可以适应任意的优化算法,从高效的贪婪采样到前瞻性规划,以逼近最优设计。在千兆像素级基准测试上的经验评估表明,我们的方法进一步提升了最先进模型的性能,显著超越了标准基线,并有效缩小了与人类注释Oracle之间的差距。
cs.CV / 64 / 2605.01346
CHASE: Competing Hypotheses for Ambiguity-Aware Selective Prediction
CHASE: 针对模糊性意识的选择性预测的竞争假设
Abstract
Standard selective prediction methods typically estimate uncertainty from the output of a single predictive branch. While effective for general uncertainty estimation, these approaches often struggle under partial observability, where local temporal evidence can be contradictory and standard confidence scores become misleading. We introduce CHASE (Competing Hypotheses for Ambiguity-Aware Selective Prediction), a selective prediction framework that explicitly compares structured temporal explanations to determine whether to commit to a decision or abstain. Because genuine ambiguity causes the score gap between competing hypotheses to collapse, CHASE optimizes a ranking-aware selector over these hypothesis margins to globally separate safe commitments from fundamentally uncertain ones. We evaluate this framework on the problem of hidden connectivity inference, utilizing a controlled, physically grounded simulator inspired by the dynamics of giant unilamellar vesicles (GUVs), alongside zero-shot qualitative transfer (without retraining or fine tuning) to representative real GUV videos. Our experiments demonstrate that explicitly reasoning over competing hypotheses provides a superior balance of metrics. Compared to canonical uncertainty baselines, CHASE achieves statistically significant gains in overall no-abstain accuracy, three-way accuracy, and overall ambiguity-aligned abstention (at 80% coverage). Specifically, it yields up to an 11.0% relative mean improvement in overall alignment, alongside up to an 8.8% relative boost in three-way accuracy in the very-high ambiguity regime. By maintaining a selective risk boundary strictly at par with the best baselines at 80% coverage, and reducing overall risk by 9.9% at 90% coverage, this framework offers a more reliable approach to decision-making under structured ambiguity.
Chinese Translation
标准选择性预测方法通常通过单一预测分支的输出估计不确定性。尽管这种方法在一般不确定性估计中有效,但在局部可观测性较低的情况下却常常面临挑战,此时局部时间证据可能存在矛盾,标准置信度评分也会产生误导。我们提出了CHASE(Competing Hypotheses for Ambiguity-Aware Selective Prediction),这是一种选择性预测框架,它显式地比较结构化的时间解释,以决定是否做出决策或选择不做决策。由于真正的模糊性会导致竞争假设之间的评分差距崩溃,CHASE在这些假设边界上优化了一个排名感知的选择器,以在全局范围内区分安全的决策和根本不确定的决策。我们在隐藏连通推理的问题上评估了该框架,利用一个受巨大单层囊泡(GUVs)动力学启发的受控物理基础模拟器,以及零样本定性迁移(无需重新训练或微调)到代表性真实GUV视频。我们的实验表明,显式地推理竞争假设提供了更优的指标平衡。与经典的不确定性基线相比,CHASE在整体无放弃准确性、三类准确性以及整体模糊性对齐放弃(80%覆盖率)方面取得了统计学上显著的提升。具体而言,在整体对齐方面最高可实现11.0%的相对均值提升,在极高模糊性区域三类准确性方面最高可实现8.8%的相对提升。通过在80%覆盖率下将选择性风险边界严格保持在最佳基线的水平,并在90%覆盖率下降低整体风险9.9%,该框架为在结构模糊性下的决策提供了更可靠的方法。
cs.CV / 65 / 2605.01355
AgriKD: Cross-Architecture Knowledge Distillation for Efficient Leaf Disease Classification
AgriKD:跨架构知识蒸馏用于高效的叶片病害分类
Abstract
Automated leaf disease classification is critical for early disease detection in resource-constrained field environments. Vision Transformers (ViTs) provide strong representation capability by modeling long-range dependencies and inter-class relationships; however, their high computational cost makes them impractical for deployment on edge devices. As a result, existing approaches struggle to effectively transfer these rich representations to lightweight models. This paper introduces AgriKD, a cross-architecture knowledge distillation framework for efficient edge deployment, which transfers knowledge from a Vision Transformer (ViT) teacher to a compact convolutional student model. To bridge the representational gap between Transformer and CNN architectures, the proposed approach integrates multiple distillation objectives at the output, feature, and relational levels, where each objective captures a different aspect of the teacher knowledge. This enables the student model to better preserve and utilize transformer-derived global representations. Experiments on multiple leaf disease datasets show that the distilled student achieves performance comparable to the teacher while significantly improving efficiency, reducing model parameters by approximately 172 times, computational cost by 47.57 times, and inference latency by 18-22 times. Furthermore, the optimized model is deployed across multiple runtime formats, including ONNX, TFLite Float16, and TensorRT FP16, achieving consistent predictive performance with negligible accuracy degradation. Real-world deployment on NVIDIA Jetson edge devices and a mobile application demonstrates reliable real-time inference, highlighting the practicality of AgriKD for AI-powered agricultural applications in resource-constrained environments.
Chinese Translation
自动化叶片病害分类对于资源有限的田间环境中的早期病害检测至关重要。视觉变压器(Vision Transformers, ViTs)通过建模长距离依赖和类间关系提供强大的表征能力;然而,它们的高计算成本使得在边缘设备上的部署变得不切实际。因此,现有方法难以有效地将这些丰富的表征转移到轻量级模型中。本文提出了AgriKD,一个用于高效边缘部署的跨架构知识蒸馏框架,该框架将知识从视觉变压器(ViT)教师模型转移到紧凑的卷积学生模型。为了弥补变压器(Transformer)和卷积神经网络(CNN)架构之间的表征差距,所提出的方法在输出、特征和关系层面整合了多个蒸馏目标,每个目标捕捉教师知识的不同方面。这使得学生模型能更好地保留和利用变压器生成的全局表征。在多个叶片病害数据集上的实验表明,经过蒸馏的学生模型在性能上可与教师模型相媲美,同时显著提高了效率,模型参数减少约172倍,计算成本降低47.57倍,推理延迟缩短18-22倍。此外,优化后的模型在多种运行时格式(包括ONNX、TFLite Float16和TensorRT FP16)上进行部署,实现了一致的预测性能,且准确性几乎未降低。在NVIDIA Jetson边缘设备和移动应用上的实际部署展示了可靠的实时推理,突显AgriKD在资源匮乏环境中的AI驱动农业应用的实践性。
cs.CV / 66 / 2605.01365
VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection
VoxAfford: 多尺度体素-标记融合用于开放词汇3D可供性检测
Abstract
Open-vocabulary 3D affordance detection requires localizing interaction regions on point clouds given novel affordance descriptions. Recent methods extend multimodal large language models (MLLMs) with special output tokens that are decoded into segmentation masks. However, these tokens are produced through autoregressive generation, which models sequential dependencies rather than spatial neighborhood relations, leaving them semantically rich but spatially impoverished for 3D localization. We propose Voxel-enhanced Affordance detection (VoxAfford), which bypasses this bottleneck by injecting multi-scale geometric features from a frozen pre-trained 3D VQVAE encoder into the output tokens after generation. Each output token uses its affordance semantics as a query to retrieve relevant geometric patterns from its paired voxel scale via cross-attention, with a learned compatibility gate controlling the injection strength. The enhanced tokens are then aggregated into a spatially-aware affordance prompt through semantic-conditioned attention and propagated alongside per-point features to generate the final mask. Experiments on open-vocabulary affordance detection tasks show that VoxAfford achieves state-of-the-art performance with approximately an 8% improvement in mIoU, and real robot experiments confirm zero-shot transfer to novel objects.
Chinese Translation
开放词汇的3D可供性检测要求根据新颖的可供性描述来定位点云上的交互区域。最近的方法通过特殊的输出标记扩展了多模态大型语言模型(MLLM),这些标记被解码为分割掩膜。然而,这些标记是通过自回归生成来产生的,这种方法建模了顺序依赖关系而非空间邻域关系,这使得标记在语义上丰富,但在3D定位上空间上匮乏。我们提出了体素增强可供性检测(VoxAfford),通过将来自冻结的预训练3D VQVAE编码器的多尺度几何特征注入到生成后的输出标记中,绕过了这一瓶颈。每个输出标记利用其可供性语义作为查询,借助跨注意力从其配对的体素尺度中检索相关的几何模式,同时一个学习的兼容性门控控制注入强度。增强后的标记随后通过语义条件注意力聚合为一个空间感知的可供性提示,并与每个点的特征一起传播,以生成最终的掩膜。在开放词汇可供性检测任务上的实验表明,VoxAfford达到了最先进的性能,mIoU提高了大约8%,同时真实机器人实验确认了其对新对象的零样本迁移能力。
cs.CV / 67 / 2605.01382
Sparse Representation Learning for Vessels
船只的稀疏表示学习
Abstract
Analyzing human vasculature and vessel-like, tubular structures, such as airways, is crucial for disease diagnosis and treatment. Current methods often rely on small sub-regions or simplified tree-like structures, rendering analysis of entire organ-level networks at clinical resolution computationally challenging. To this end, we propose VAEsselSparse, an efficient encoder-decoder model to obtain a meaningful yet compact representation of the entire organ-level vascular network at sub-millimeter resolution. VAEsselSparse leverages the inherent sparsity of 3D vascular structures via sparse convolutions and attention mechanisms, achieving substantial spatial compression rates of 8 x 8 x 8. We demonstrate superior reconstruction performance compared to dense counterparts and previous methods. Importantly, the resulting latent space retains clinically relevant discriminative features readily usable for classification tasks, such as aneurysm/stenosis or subvariants of the circle of Willis. Moreover, the compact latent space of VAEsselSparse serves as an effective representation for learning vessel-specific priors through generative models, enabling the synthesis of realistic vasculature.
Chinese Translation
分析人类血管和类似管道的结构(如气道)对疾病的诊断和治疗至关重要。目前的方法往往依赖于小的子区域或简化的树状结构,这使得在临床分辨率下分析整个器官级网络的过程在计算上具有挑战性。为此,我们提出了VAEsselSparse,一种高效的编码-解码模型,用于以亚毫米分辨率获取整个器官级血管网络的有意义且紧凑的表示。VAEsselSparse利用了三维血管结构的固有稀疏性,通过稀疏卷积和注意力机制,实现了8 x 8 x 8的显著空间压缩率。我们证明了其在重建性能上优于密集模型和先前的方法。重要的是,所得的潜在空间保留了临床相关的判别特征,易于用于分类任务,如动脉瘤/狭窄或威利斯环亚变体。此外,VAEsselSparse的紧凑潜在空间作为通过生成模型学习特定血管先验的有效表示,能够合成逼真的血管结构。
cs.CV / 68 / 2605.01391
VISTA: Video Interaction Spatio-Temporal Analysis Benchmark
VISTA: 视频交互时空分析基准
Abstract
Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates multiple datasets into a single interaction-aware taxonomy and comprises ~12K curated video-query pairs spanning diverse scenes and complexities. We systematically evaluate 11 state-of-the-art VLMs on VISTA, and break down aggregate performance across our taxonomy to reveal shortcomings and pronounced spatio-temporal biases obscured by traditional metrics. By providing detailed, taxonomy-driven diagnostics on a challenging dataset, VISTA offers a nuanced framework to guide advances in model design, pretraining strategies, and evaluation protocols. Overall, VISTA is the first, large-scale, interaction-aware diagnostic benchmark for spatio-temporal understanding in VLMs.
Chinese Translation
现有的视觉-语言模型(VLMs)基准主要评估简单单一动作视频上的时空理解、封闭属性集和限制性实体类型,未能捕捉特征丰富的真实世界视频理解中多样实体之间自由形式的多动作交互。此外,缺乏系统框架来分析模型在互补时空轴上的失败,阻碍了全面评估。为了解决这些问题,我们引入了VISTA——一个为VLMs设计的开放集、多实体和多动作时空理解的视频交互时空分析基准。VISTA将视频分解为可解释的实体、其相关动作和关系动态,使多轴诊断和关系、空间及时间理解的统一评估成为可能。我们的基准将多个数据集整合成一个统一的意识到交互的分类法,包含约12,000对针对多样场景和复杂性的策划视频-查询对。我们对11个最先进的VLM在VISTA上的表现进行了系统评估,并在我们的分类法中细分总体性能,以揭示传统指标掩盖的短板和明显的时空偏见。通过提供详尽、分类驱动的诊断信息,VISTA为指导模型设计、预训练策略和评估协议的改进提供了一个微妙的框架。总的来说,VISTA是第一个大规模的、意识到交互的时空理解诊断基准,专注于VLMs。
cs.CV / 69 / 2605.01393
Recall to Predict: Grounding Motion Forecasting in Interpretable Motion Bank
召回预测:基于可解释运动库的运动预测
Abstract
Motion forecasting often requires trading interpretability for predictive accuracy. Standard anchor-based architectures rely on opaque latent queries that are highly prone to latent collapse, or naive trajectory sampling that limits multi-modal diversity. We propose an end-to-end differentiable framework that grounds predictions in a comprehensive "motion bank", a structured embedding space of physically realizable trajectories constructed via contrastive learning. Rather than regressing paths from a blank slate, our architecture dynamically retrieves explicit motion priors using a novel Anchor Retrieval Layer. This module adapts orthogonally initialized queries via a Dual-Level Gated Cross-Attention mechanism and executes discrete trajectory selection using a Straight-Through Gumbel-Softmax estimator to preserve continuous gradient flow. The retrieved semantically grounded anchors are then geometrically refined by a DETR-style decoder, optimized jointly with a Winner-Takes-All (WTA) kinematic Gaussian Mixture Model (GMM), a latent diversity penalty, and a soft-min weighted endpoint loss. By strictly conditioning the decoding phase on diverse, interpretable motion primitives, our approach eliminates the "black box" of standard latent queries while achieving competitive multi-modal accuracy on the Argoverse 2 and Waymo Open Motion datasets. Code is available at: https://github.com/abviv/recall2predict
Chinese Translation
运动预测通常需要在可解释性和预测准确性之间进行权衡。标准的基于锚点的架构依赖于不透明的潜在查询,这些查询非常容易受到潜在崩溃的影响,或者采用简单的轨迹采样,这限制了多模态的多样性。我们提出了一种端到端的可微分框架,将预测基于一个全面的“运动库”(motion bank),即通过对比学习构建的物理可实现轨迹的结构化嵌入空间。我们架构的特点是动态检索明确的运动先验,而不是从空白状态回归路径,这通过一个新颖的锚点检索层(Anchor Retrieval Layer)来实现。该模块通过双层门控交叉注意力机制(Dual-Level Gated Cross-Attention)调整正交初始化的查询,并利用直通式Gumbel-Softmax估计器进行离散轨迹选择,以保持连续梯度流。检索到的语义基础锚点随后通过类似DETR的解码器进行几何精 refinement,与“赢家通吃”(Winner-Takes-All, WTA)运动高斯混合模型(kinematic Gaussian Mixture Model, GMM)、潜在多样性惩罚及软最小加权端点损失联合优化。通过严格地将解码阶段的条件设置为多样、可解释的运动原语,我们的方法消除了标准潜在查询的“黑箱”问题,同时在Argoverse 2和Waymo开放运动数据集上实现了具有竞争力的多模态准确性。代码可在以下链接获取: https://github.com/abviv/recall2predict
cs.CV / 70 / 2605.01450
Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence
无注册可学习的多视角人脸捕捉在密集语义对应中的应用
Abstract
Recent frameworks like ToFu and TEMPEH provide an automated alternative to classical registration pipelines by predicting 3D meshes in dense semantic correspondence directly from calibrated multi-view images. However, these learning-based methods rely on the slow, manual registration pipelines they aim to replace for their training supervision. We overcome this limitation with MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a multi-view 3D face prediction framework trained without requiring registered training data. MOCHI eliminates the registration data dependency by enforcing topological consistency through a pseudo-linear inverse kinematic solver. Semantic alignment is guided by dense keypoints from a 2D landmark predictor trained exclusively on synthetic data. Our analysis further reveals that standard point-to-surface distances induce training instabilities and visual artifacts in registration-free settings. We propose pointmap- and normal-based losses instead, which provide smoother gradients and superior reconstruction fidelity. Finally, we introduce a test-time optimization scheme that refines network weights over a few dozen iterations. This approach bridges the gap between feed-forward efficiency and iterative optimization precision, allowing MOCHI to outperform traditional labor-intensive pipelines in both reconstruction accuracy and visual quality. Code and model are public at: https://filby89.github.io/mochi.
Chinese Translation
近期的框架如ToFu和TEMPEH通过直接从标定的多视角图像预测密集语义对应的3D网格,为传统注册流程提供了一种自动化的替代方案。然而,这些基于学习的方法依赖于它们旨在替代的缓慢手动注册流程来进行训练监督。我们通过MOCHI(Multi-view Optimizable Correspondence of Heads from Images)克服了这一局限,MOCHI是一个无需注册训练数据的多视角3D人脸预测框架。MOCHI通过强制拓扑一致性,消除了对注册数据的依赖,采用一种伪线性逆运动学求解器。语义对齐则由一个专门在合成数据上训练的2D标志预测器提供的密集关键点指导。我们的分析进一步揭示,标准的点到表面距离在无注册的设置中引发训练不稳定性和视觉伪影。因此,我们提出了基于点图和法线的损失,这提供了更平滑的梯度和更优的重构保真度。最后,我们引入了一种测试时优化方案,在几十次迭代中细化网络权重。这一方法弥合了前馈效率与迭代优化精度之间的差距,使MOCHI在重构精度和视觉质量上超越了传统的劳动密集型流程。代码和模型已公开于:https://filby89.github.io/mochi。
cs.CV / 71 / 2605.01459
SRGAN-CKAN: Expressive Super-Resolution with Nonlinear Functional Operators under Minimal Resources
SRGAN-CKAN:在最小资源下使用非线性函数运算符的表达性超分辨率
Abstract
Single-Image Super-Resolution (SISR) aims to reconstruct a High-Resolution (HR) image from a Low-Resolution (LR) observation, a fundamentally ill-posed problem where high-frequency details are severely degraded at large upscaling factors. Recent advances have been driven by transformer-based architectures and diffusion models improve global context modeling and perceptual quality at the cost of increased computational complexity. In contrast, this work focuses on enhancing the expressivity of local operators under minimal resources. We propose SRGAN--CKAN, a hybrid super-resolution framework that integrates Convolutional Kolmogorov--Arnold Networks (CKAN) into an adversarial learning setting reformulating convolution as a nonlinear patch-based transformation. The proposed operator replaces linear local mappings with spline-based functional representations, allowing expressive modeling of complex local structures and high-frequency textures using minimal hardware resources. Experimental results demonstrate that the proposed approach improves perceptual quality while preserving reconstruction fidelity, achieving a favorable balance between distortion-based and perceptual metrics. These results are obtained under constrained computational settings, highlighting the efficiency of the proposed formulation. Overall, this work introduces a complementary direction to existing approaches by improving the representational power of local transformations, providing an efficient and scalable alternative to globally intensive architectures.
Chinese Translation
单图像超分辨率(SISR)旨在从低分辨率(LR)观察中重建高分辨率(HR)图像,这是一个根本上不适定的问题,在较大放大倍数下,高频细节严重退化。最近的进展受到基于变换器的架构和扩散模型的推动,改善了全球上下文建模和感知质量,但代价是计算复杂度的增加。相比之下,本研究集中于在最小资源下增强局部算子的表现力。我们提出了SRGAN-CKAN,这是一种混合超分辨率框架,将卷积科尔莫哥洛夫-阿诺德网络(CKAN)集成到对抗学习设置中,将卷积重新表述为非线性的基于补丁的变换。所提出的算子用基于样条的函数表示取代了线性局部映射,允许使用最小硬件资源对复杂局部结构和高频纹理进行表达性建模。实验结果表明,所提出的方法在保持重建保真度的同时改善了感知质量,达成了失真度和感知指标之间的良好平衡。这些结果是在受限的计算条件下获得的,突显了所提出公式的效率。总体而言,本研究通过提高局部变换的表示能力,为现有方法引入了一种互补的方向,提供了一种对全球密集架构的高效、可扩展的替代方案。
cs.CV / 72 / 2605.01466
SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion
SplAttN: 通过高斯软喷射和注意力机制连接二维与三维以实现点云补全
Abstract
Although multi-modal learning has advanced point cloud completion, the theoretical mechanisms remain unclear. Recent works attribute success to the connection between modalities, yet we identify that standard hard projection severs this connection: projecting a sparse point cloud onto the image plane yields an extremely sparse support, which hinders visual prior propagation, a failure mode we term Cross-Modal Entropy Collapse. To address this practical limitation, we propose SplAttN, which replaces hard projection with Differentiable Gaussian Splatting to produce a dense, continuous image-plane representation. By reformulating projection as continuous density estimation, SplAttN avoids collapsed sparse support, facilitates gradient flow, and improves cross-modal connection learnability. Extensive experiments show that SplAttN achieves state-of-the-art performance on PCN and ShapeNet-55/34. Crucially, we utilize the real-world KITTI benchmark as a stress test for multi-modal reliance. Counter-factual evaluation reveals that while baselines degenerate into unimodal template retrievers insensitive to visual removal, SplAttN maintains a robust dependency on visual cues, validating that our method establishes an effective cross-modal connection. Code is available at https://github.com/zay002/SplAttN.
Chinese Translation
尽管多模态学习在点云补全方面取得了进展,但其理论机制仍不明确。近期的研究将成功归因于模态之间的连接,然而我们发现标准的硬投影断裂了这种连接:将稀疏点云投影到图像平面上会产生极其稀疏的支持,这阻碍了视觉先验的传播,这一失败模式我们称之为跨模态熵坍缩。为了解决这一实际限制,我们提出了SplAttN,它用可微分高斯喷射替代硬投影,生成稠密连续的图像平面表示。通过将投影重新表述为连续密度估计,SplAttN避免了崩溃的稀疏支持,促进了梯度流动,并提高了跨模态连接的可学习性。大量实验表明,SplAttN在PCN和ShapeNet-55/34上达到了当前最先进的表现。重要的是,我们利用现实世界的KITTI基准作为多模态依赖的压力测试。对照评估显示,尽管基线方法退化为对视觉去除不敏感的单模态模板检索器,SplAttN仍然保持对视觉线索的强大依赖性,验证了我们的方法建立了有效的跨模态连接。代码可在https://github.com/zay002/SplAttN获取。
cs.CV / 73 / 2605.01468
Decision Boundary-aware Generation for Long-tailed Learning
针对长尾学习的决策边界感知生成
Abstract
Long-tailed data bias decision boundaries toward head classes and degrade tail class accuracy. Diffusion-based generative augmentation address this problem by generating additional data, while head-to-tail transfer further mitigate the generator bias inherit from long-tailed dataset. However, we show that while head-to-tail transfer helps balance the decision space of the classifier, it also induces latent non-local feature mixing that entangles inter-class features, causing decision boundary overlap and tail class distribution shift. To address this, we first identify the problem of boundary ambiguity and then propose Decision Boundary-aware Generation (DBG) framework, which promotes near-boundary representation learning by generating informative near-boundary samples. Overall, DBG rebalances the long-tailed dataset while yielding more separable decision space for long-tailed learning. Across standard long-tailed benchmarks, DBG consistently improves tail class and overall accuracy with less inter-class overlap. The code of DBG is available at https://github.com/keepdigitalabc-svg/DBG.
Chinese Translation
长尾数据使决策边界偏向主类,从而降低尾类的准确性。基于扩散的生成增强方法通过生成额外的数据来解决这一问题,而头到尾的迁移进一步减轻了因长尾数据集而继承的生成器偏差。然而,我们展示了虽然头到尾迁移有助于平衡分类器的决策空间,但它也引入了潜在的非局部特征混合,纠缠了类间特征,导致决策边界重叠和尾类分布偏移。为了解决这个问题,我们首先识别了边界模糊性的问题,并提出了决策边界感知生成(Decision Boundary-aware Generation, DBG)框架,该框架通过生成信息丰富的近边界样本来促进近边界表示学习。总体而言,DBG在重新平衡长尾数据集的同时,为长尾学习提供了更可分离的决策空间。在标准的长尾基准测试中,DBG consistently 提高了尾类和整体准确性,并减少了类间重叠。DBG的代码可在 https://github.com/keepdigitalabc-svg/DBG 获取。
cs.CV / 74 / 2605.01478
LIE: LiDAR-only HD Map Construction with Intensity Enhancement via Online Knowledge Distillation
LIE:通过在线知识蒸馏增强强度的仅LiDAR高清地图构建
Abstract
Online High-Definition (HD) map construction is a key component of autonomous driving. Recent methods rely on multi-view camera images for cost-effective HD map segmentation, but cameras lack depth information for accurate scene geometry. In contrast, LiDAR provides precise 3D measurements but lacks dense semantic cues. In this work, we propose LIE, LiDAR-only semantic map construction method that employ Knowledge Distillation (KD) to handle the lack of dense semantic and texture cues. Specifically, the teacher branch fuses student LiDAR features and the corresponding 2D intensity map tile to provide dense supervision for segmenting map elements using online distillation scheme. Experimental results show that our method outperforms all single-modality approaches, achieving 8.2% higher mIoU than the state-of-the-art camera-based model on nuScenes. LIE is robust over long ranges and under challenging weather and lighting, and efficiently adapts to Argoverse2 with only 10% fine-tuning, surpassing camera-based models trained on the full dataset. Source code will be available \href{https://iv.ee.hm.edu/lie/}{here}.
Chinese Translation
在线高清(HD)地图构建是自动驾驶的重要组成部分。近期的方法依赖多视角相机图像进行成本效益的高清地图分割,但相机缺乏用于准确场景几何的深度信息。相比之下,LiDAR提供精确的3D测量,但缺乏密集的语义线索。在本研究中,我们提出了LIE,一种仅基于LiDAR的语义地图构建方法,采用知识蒸馏(Knowledge Distillation, KD)技术来应对密集语义和纹理线索的缺失。具体而言,教师分支融合学生LiDAR特征和相应的2D强度图块,为使用在线蒸馏方案分割地图元素提供密集监督。实验结果表明,我们的方法优于所有单模态方法,在nuScenes数据集上实现了比最先进的基于相机的模型高出8.2%的mIoU。LIE在长距离和复杂气候及光照条件下表现稳健,并仅需10%的微调即可有效适应Argoverse2,超越在完整数据集上训练的基于相机的模型。源代码将在此处提供。
cs.CV / 75 / 2605.01479
CSGuard: Toward Forgery-Resistant Watermarking in Diffusion Models via Compressed Sensing Constraint
CSGuard:通过压缩感知约束实现对抗伪造的扩散模型水印技术
Abstract
Latent-based diffusion model watermarking embeds watermarks into generated images' latent space to enable content attribution, offering a training-free solution for intellectual property protection and digital forensics. However, these methods exhibit a critical vulnerability to the forgery attack, attackers can extract the watermark by inverting the watermarked image and re-generating it with an arbitrary prompt, thereby enabling false attribution on malicious content. In this paper, we propose the CSGuard, the first forgery-resistant watermarking schema that leverages compressed sensing to bind the watermarked image generation and verification to a secret matrix. This ensures that only users possessing the secret matrix can correctly embed or verify the image watermark, prevents the illegal users from forgery without compromising generation quality and watermark integrity. Experimental results demonstrate that CSGuard achieves strong forgery resistance, reduces the attack success rate from 100.0\% to 28.12\%, and achieve 100\% detection rate on benign watermarked images without compromising watermarking effectiveness.
Chinese Translation
基于潜变量的扩散模型水印技术将水印嵌入生成图像的潜在空间,以实现内容归属,提供了一种无需训练的知识产权保护和数字取证解决方案。然而,这些方法对伪造攻击存在严重漏洞,攻击者可以通过反向处理水印图像并使用任意提示重新生成图像,从而提取水印,导致恶意内容的错误归属。本文提出了CSGuard,这是第一个利用压缩感知的抗伪造水印方案,将水印图像的生成和验证绑定到一个秘密矩阵。这确保只有拥有秘密矩阵的用户才能正确嵌入或验证图像水印,从而在不损害生成质量和水印完整性的情况下防止非法用户进行伪造。实验结果表明,CSGuard实现了强大的抗伪造能力,将攻击成功率从100.0 ext{%}降低至28.12 ext{%},并在不影响水印效果的前提下,对良性的水印图像实现了100 ext{%}的检测率。
cs.CV / 76 / 2605.01480
AttnRouter: Per-Category Attention Routing for Training-Free Image Editing on MMDiT
AttnRouter:针对 MMDiT 的按类别注意力路由的无训练图像编辑
Abstract
We study training-free image editing on Qwen-Image-Edit-2511, a 60-block multi-modal diffusion transformer (MMDiT) that concatenates noise and source-image tokens within a single attention stream. We make three contributions. (i) We introduce KVInject, a single-forward attention manipulation that alpha-blends source-half key/value projections into the noise-half within a localized layer/step band. KVInject is simpler than the classical two-pass MasaCtrl recipe and avoids the prompt-mismatch failure mode that disables MasaCtrl on MMDiT (composite score drops 31% versus baseline). (ii) We show that no single attention operation dominates across edit types, motivating AttnRouter, a per-category routing table that dispatches edits to the operation that best preserves source structure for that type. With ground-truth categories the router improves the CLIP-T+DINO-I composite by 6.4% over the editing baseline; an automatic CLIP zero-shot classifier closes 98% of this gap despite only 55% category accuracy. (iii) Through layer-, step-, and alpha-band ablations we localize the editing-effective attention sub-circuit: K/V injection in early denoising steps (S0-7) recovers nearly all of the gain of full-step injection, while injection in early (L0-15) or late (L45-60) layer bands fails to drive editing entirely; alpha in [0.3, 0.5] is a stable sweet spot. We also report negative results that highlight what does not transfer from the UNet folklore: simple K/V rescaling never beats baseline and aggressive variants collapse generation entirely (composite 0.084). We release code, pre-computed routing tables, and a 100-sample stratified subset of ImgEdit-Bench used in all ablations.
Chinese Translation
我们研究了在 Qwen-Image-Edit-2511 上进行的无训练图像编辑,该模型是一个由 60 个模块组成的多模态扩散变换器(MMDiT),它在单一的注意力流中连接噪声和源图像令牌。我们有三项贡献。 (i) 我们提出了 KVInject,这是一种单前向注意力操作,能够在局部层/步骤带内将源图像的一半键/值投影与噪声一半进行 alpha 混合。与经典的两遍 MasaCtrl 方法相比,KVInject 更为简单,并且避免了在 MMDiT 上禁用 MasaCtrl 的提示不匹配失败模式(与基线相比综合得分下降了 31%)。 (ii) 我们显示出在编辑类型之间没有单一的注意力操作主导,这促使我们提出了 AttnRouter,这是一种按类别路由表,将编辑分配给能够最好保持该类型源结构的操作。使用真实类别时,路由器在编辑基线基础上提高了 CLIP-T+DINO-I 综合得分 6.4%;尽管类别准确率只有 55%,但自动 CLIP 零-shot 分类器缩小了 98% 的差距。 (iii) 通过对层、步骤和 alpha 带进行消融,我们定位了有效编辑的注意力子电路:在早期去噪步骤(S0-7)中进行 K/V 注入几乎恢复了全步骤注入的所有增益,而在早期(L0-15)或后期(L45-60)层带的注入未能完全推动编辑;alpha 值在 [0.3, 0.5] 之间是一个稳定的最佳范围。我们还报告了负面结果,强调从 UNet 经验中未能转移的内容:简单的 K/V 重新缩放从未超过基线,而激进变体则完全崩溃生成(综合得分 0.084)。我们发布了代码、预计算的路由表,以及用于所有消融实验的 ImgEdit-Bench 的 100 个样本分层子集。
cs.CV / 77 / 2605.01483
Research on Vision-Language Question Answering Models for Industrial Robots
工业机器人视觉-语言问答模型研究
Abstract
A hierarchical cross-modal fusion model is proposed for vision-language question answering (VLQA) in industrial robotics, targeting the challenges of semantic ambiguity, complex environmental layouts, and domain-specific terminology common in modern manufacturing. The framework integrates advanced object detection, multi-scale visual encoding, syntactic parsing, and task-aware semantic attention to unite vision and language signals into a joint reasoning space. Region-based deep networks extract visual features, weighted embeddings aggregate, and recurrent neural parsing encodes sentence structures. Through fine-grained semantic alignment driven by adaptive fusion and cross-attention mechanisms, the system can handle operational queries, instruction steps, and anomaly detection with higher reliability. Compared to the existing VLQA benchmarks, validation experiments conducted on the IVQA and RIF benchmarks indicate improvements in semantic alignment, Top-1 accuracy, and robustness to ambiguous or procedural task queries. Ablation studies further quantify the impact of each architectural module, confirming the necessity of multi-level feature integration and context-driven gating for dependable industrial deployment. The technical advancements reported here provide core methodologies to improve the interpretability and operational effectiveness of industrial robots faced with diverse human-robot interaction tasks.
Chinese Translation
提出了一种分层交叉模态融合模型,用于工业机器人中的视觉-语言问答(VLQA),旨在应对现代制造业中常见的语义模糊、复杂环境布局和领域特定术语等挑战。该框架整合了先进的物体检测、多尺度视觉编码、句法解析以及任务感知的语义注意力,将视觉和语言信号统一到一个共同推理空间。基于区域的深度网络提取视觉特征,加权嵌入进行聚合,递归神经解析对句子结构进行编码。通过自适应融合和交叉注意力机制驱动的细粒度语义对齐,该系统能够以更高的可靠性处理操作查询、指令步骤和异常检测。与现有的VLQA基准相比,在IVQA和RIF基准上进行的验证实验表明,在语义对齐、Top-1准确率和对模糊或程序性任务查询的鲁棒性方面有所提升。消融研究进一步量化了每个架构模块的影响,确认了多级特征集成和基于上下文的门控对可靠的工业部署的必要性。本研究报告的技术进展提供了核心方法论,以提高工业机器人在面对多样化人机交互任务时的可解释性和操作有效性。
cs.CV / 78 / 2605.01490
CGFformer: Cluster-Guidance Frequency Transformer for Pansharpening
CGFformer:用于全色锐化的聚类引导频率变换器
Abstract
Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) images with high-resolution panchromatic (PAN) images. However, the current mainstream frequency-based pansharpening methods employ fixed frequency filters, which cannot precisely adapt to complex and spatially diversified frequency distributions in PAN and MS images. Furthermore, existing denoising strategies insufficiently exploit frequency components for denoising and struggle to suppress various noise types accurately. To address these challenges, we propose CGFformer, a cluster-guidance frequency Transformer that focuses on varying frequency distribution and interactions between frequency and spatial components. Specifically, we design an adaptive separation module that integrates local features and non-local information through K-means clustering, enabling more precise separation of high- and low-frequency components. Subsequently, we introduce a dual-stream refinement module combined with Transformer-based cross-attention to remove various noise, allowing the network to jointly suppress frequency-relevant and irrelevant disturbances. In addition, we develop a frequency-spatial fusion module designed to enhance detail and facilitate spatial-frequency interaction, ensuring more effective reconstruction of spatial structures in the fused results. Extensive experiments on multiple benchmark datasets demonstrate that the proposed CGFformer achieves notable improvements over existing pansharpening approaches.
Chinese Translation
全色锐化旨在通过将低分辨率的多光谱(LRMS)图像与高分辨率的全色(PAN)图像融合,生成高分辨率的多光谱(HRMS)图像。然而,当前主流的基于频率的全色锐化方法采用固定频率滤波器,无法精确适应PAN和MS图像中复杂且空间分散的频率分布。此外,现有的去噪策略对频率成分的利用不足,难以准确抑制各种噪声类型。为了解决这些挑战,我们提出了CGFformer,一种聚类引导频率变换器,旨在关注频率分布的变化及频率与空间成分之间的相互作用。具体而言,我们设计了一个自适应分离模块,通过K均值聚类集成局部特征和非局部信息,从而实现高低频成分的更精确分离。随后,我们引入了一个结合了基于变换器的交叉注意力的双流精炼模块,以去除各种噪声,使网络能够共同抑制与频率相关和无关的干扰。此外,我们开发了一个频率-空间融合模块,旨在增强细节并促进空间-频率交互,确保在融合结果中更有效地重建空间结构。在多个基准数据集上的广泛实验表明,所提议的CGFformer在全色锐化方法中实现了显著的改进。
cs.CV / 79 / 2605.01496
SF20K Competition 2025: Summary and findings
SF20K比赛2025:总结与发现
Abstract
This report presents the results and findings of the first edition of the Short-Films 20K (SF20K) Competition, held in conjunction with the SLoMO Workshop at ICCV 2025. The competition is designed to advance story-level video understanding beyond short-clip action recognition, introducing an open-ended video question-answering task built on a corpus of amateur short films. This setup ensures that models must rely on multimodal understanding rather than memorization of popular movies. Evaluation is conducted using the SF20K-Test benchmark (95 movies, 979 question-answer pairs) and scored via LLM-QA-Eval, an automated judge based on GPT-4.1-nano. The competition attracted 22 teams and 286 submissions across two tracks: a Main Track with unrestricted model size and a Special Track limited to models under 8 billion parameters. The winning team achieved 65.7% accuracy on the Main Track and 48.7% on the Special Track, against a human performance ceiling of 91.7%. Our analysis reveals several key findings: narrative-aware, shot-level processing consistently outperforms uniform frame sampling; well-designed multi-stage pipelines using smaller models can match or exceed end-to-end inference with models over 30x larger; and subtitle quality is a dominant factor in performance. These results highlight that the primary bottleneck in long-form video QA lies in information selection and reasoning structure rather than raw model capacity, and that a substantial gap remains between current methods and human-level narrative comprehension.
Chinese Translation
本报告展示了首届短片20K(SF20K)比赛的结果与发现,该比赛与ICCV 2025会议的SLoMO研讨会共同举行。比赛旨在推进故事级别的视频理解,超越短片段的动作识别,引入基于业余短片语料库的开放式视频问答任务。这种设置确保模型必须依赖多模态理解,而非对流行电影的记忆。评估采用SF20K-Test基准(95部电影,979个问答对)进行,并通过基于GPT-4.1-nano的自动化评估工具LLM-QA-Eval进行评分。比赛吸引了22支团队,提交了286份作品,分为两个赛道:一个模型规模无限制的主赛道和一个模型参数限制在80亿以下的特别赛道。在主赛道上,获胜团队的准确率达到了65.7%,特别赛道的准确率为48.7%,而人类表现上限为91.7%。我们的分析揭示了几个关键发现:具备叙事意识的镜头级处理始终优于均匀帧采样;使用较小模型的精心设计的多阶段流程能够与超过30倍的大模型的端到端推理相匹敌或超越;而字幕质量是性能的主导因素。这些结果表明,在长形式视频的问答中,主要瓶颈在于信息选择和推理结构,而非原始模型能力,并且当前方法与人类级别的叙事理解之间仍存在显著差距。
cs.CV / 80 / 2605.01498
Towards Visual Query Localization in the 3D World
面向3D世界的视觉查询定位
Abstract
Visual query localization (VQL) aims to predict the spatio-temporal response of the most recent occurrence in a sequence given a query. Currently, most research focuses on visual query localization in 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to address visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities, including point clouds, RGB images, and depth images, to support flexible research. To ensure high-quality annotations, each sequence is manually annotated with multiple rounds of verification and refinement. To the best of our knowledge, 3DVQL is the first benchmark for 3D multimodal visual query localization. To facilitate comparison in subsequent research, we implement a series of representative 3D multimodal VQL baselines using point clouds and RGB images. The experimental results show that existing methods exhibit significant performance variations across different fusion modules. To encourage future research, we propose a lift-and-attention fusion algorithm named LaF, which significantly outperforms existing baseline models. Our benchmark and model will be publicly released at https://github.com/wuhengliangliang/3DVQL.
Chinese Translation
视觉查询定位(VQL)旨在根据查询预测序列中最新事件的时空响应。目前,大多数研究集中于二维视频中的视觉查询定位,而在三维空间中的研究却鲜有关注。本文首次尝试通过引入一个新基准,称为3DVQL,来解决3D世界中的视觉查询定位问题。具体而言,3DVQL包含2,002个序列,约170,000帧以及来自38个物体类别的6,400个响应轨迹段。3DVQL中的每个序列提供多种模态,包括点云、RGB图像和深度图像,以支持灵活的研究。为了确保高质量的注释,每个序列都经过多轮人工验证和细化。根据我们所知,3DVQL是针对3D多模态视觉查询定位的第一个基准。为了便于后续研究的比较,我们使用点云和RGB图像实现了一系列具有代表性的3D多模态VQL基线。实验结果显示,现有方法在不同的融合模块之间表现出显著的性能差异。为鼓励未来研究,我们提出了一种名为LaF的提升与注意力融合算法,其显著优于现有基线模型。我们的基准和模型将公开发布在https://github.com/wuhengliangliang/3DVQL。
cs.CV / 81 / 2605.01502
RADMI: Latent Information Aggregation as a Proxy for Model Uncertainty
RADMI:将潜在信息聚合作为模型不确定性的代理
Abstract
Epistemic uncertainty estimation is essential for identifying regions where deep learning system outputs may be unreliable. However, existing approaches require computationally expensive ensemble methods or multiple stochastic forward passes, limiting their scalability to dense prediction tasks like segmentation. We propose Resolution-Aggregated Decoder Mutual Information (RADMI), a single-pass method that estimates prediction uncertainty by measuring mutual information (MI) between consecutive decoder layers in segmentation networks. We observe that elevated inter-layer MI correlates with prediction uncertainty, as the network must integrate conflicting contextual information at ambiguous regions such as class boundaries. Evaluating on a seismic facies segmentation benchmark, RADMI achieves the highest correlation with deep ensemble uncertainty among all single-pass methods, outperforming the next-best baselines by 5.5% in Pearson and 10.7% in Spearman correlation coefficients. Compared to baselines that either lack spatial precision or demand significant computational overhead, RADMI yields sharp, boundary-localized uncertainty maps without architectural modifications. Our results suggest that linear aggregation of normalized information flow provides a principled and efficient proxy for prediction uncertainty in encoder-decoder architectures.
Chinese Translation
认识论不确定性估计对于识别深度学习系统输出可能不可靠的区域至关重要。然而,现有方法需要计算开销较大的集成方法或多次随机前向传播,这限制了它们在密集预测任务(如分割)中的可扩展性。我们提出了解析度聚合解码器互信息(Resolution-Aggregated Decoder Mutual Information,RADMI),这是一种单次传播方法,通过测量分割网络中连续解码器层之间的互信息(MI)来估计预测不确定性。我们观察到较高的层间互信息与预测不确定性相关,因为在类边界等模糊区域,网络必须整合相互冲突的上下文信息。通过在地震相分割基准上进行评估,RADMI在所有单次传播方法中实现了与深度集成不确定性的最高相关性,超越了下一个最佳基线5.5%的皮尔逊相关系数和10.7%的斯皮尔曼相关系数。与缺乏空间精度或需要大量计算开销的基线相比,RADMI能够生成清晰的、边界局部化的不确定性图,而无需架构修改。我们的结果表明,归一化信息流的线性聚合为编码-解码架构中的预测不确定性提供了一种有原则且高效的代理。
cs.CV / 82 / 2605.01506
OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder
OmniEncoder:用一个编码器以人类的方式感知、听觉与触觉的连续运动
Abstract
Recent advances in omni-modal large language models have enabled remarkable progress in joint vision-audio understanding. However, prevailing architectures rely on modality-specific encoders with a \emph{video-coarse, audio-dense} design -- sampling visual frames at 1--2 fps while processing audio waveforms at 25 fps -- resulting in systems that perceive video \emph{frame by frame, modality by modality} rather than holistically as humans do. Such a discrepancy leaves models with impoverished cross-modal interaction during encoding and an inability to capture fine-grained visual motion. To bridge this gap, we present \textbf{Omni-Encoder, a unified Transformer backbone designed to co-embed visual and audio signals at a symmetrical 25 fps} within a shared latent space. This architecture leverages three core innovations -- the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting -- to effectively reconcile the dual challenges of modality disentanglement and computational efficiency. Experiments demonstrate that, compared to the modality-specific baseline Qwen2.5-Omni under the same input token budget to the LLM decoder, Omni-Encoder delivers substantial gains on visual continuous understanding tasks -- such as sign language recognition and fine-grained sports action analysis -- while maintaining competitive performance on established audio-visual benchmarks such as AVQA and Speaker Identification and Localization. These results suggest that unified omnivorous encoding offers a promising direction for building omni-modal models that more closely reflect the integrated nature of human perception.
Chinese Translation
近期在全模态大语言模型方面的进展使得视觉与音频联合理解取得了显著的进展。然而,现有的架构依赖于特定模态的编码器,采用了 extit{视频粗糙、音频密集}的设计——以每秒1到2帧的速度采样视觉帧,并以25帧每秒处理音频波形——这导致系统以 extit{逐帧、逐模态}的方式来感知视频,而非像人类那样进行整体感知。这种差异使得模型在编码时缺乏丰富的跨模态互动,无法捕捉精细的视觉运动。为了解决这一问题,我们提出了 extbf{Omni-Encoder,一个统一的Transformer骨干网络,旨在以对称的25帧每秒在共享潜在空间中共同嵌入视觉和音频信号}。该架构利用了三个核心创新——Omni-Encoder Token Template、Omni-RoPE和Temporal Window Shifting——有效地解决了模态解耦和计算效率的双重挑战。实验证明,相较于在相同输入标记预算下的特定模态基线Qwen2.5-Omni,Omni-Encoder在视觉连续理解任务(如手语识别和精细的体育动作分析)上显著提升了性能,同时在成熟的音视觉基准(如AVQA和说话人识别与定位)上保持了竞争力的表现。这些结果表明,统一的全能编码为构建更贴近人类感知整体性的全模态模型提供了一个有前景的方向。
cs.CV / 83 / 2605.01510
SwiftPie: Lightning-fast Subject-driven Image Personalization via One step Diffusion
SwiftPie:通过一步扩散实现闪电般快速的主题驱动图像个性化
Abstract
Diffusion models have achieved remarkable success in high-quality image synthesis, sparking interest in image-guided generation tasks such as subject-driven image personalization. Despite their impressive personalization results, existing methods typically rely on computationally intensive fine-tuning, iterative optimization, or multi-step denoising processes, which significantly hinder their deployment and interactive capability in real-time applications. In this work, we present SwiftPie, the first one-step diffusion image personalization tool that enables lightning-fast generation of personalized images. SwiftPie introduces a novel dual-branch identity injection mechanism that effectively integrates subject identity into a one-step diffusion model. In addition, we incorporate a mask-guided rescaling strategy to further enhance subject contextualization within a single diffusion step. Extensive experiments demonstrate that SwiftPie not only delivers superior image personalization speed but also achieves comparable performance with multi-step approaches in both identity fidelity and prompt alignment. This work opens new opportunities for real-time, high-quality personalized image generation, paving the way for interactive visual synthesis.
Chinese Translation
扩散模型在高质量图像合成中取得了显著成功,引发了对图像引导生成任务的关注,例如主题驱动的图像个性化。尽管现有方法的个性化效果令人印象深刻,但通常依赖于计算密集型的精细调整、迭代优化或多步骤去噪过程,这在很大程度上限制了它们在实时应用中的部署和交互能力。在这项工作中,我们提出了SwiftPie,这是第一个一步扩散图像个性化工具,能够实现个性化图像的闪电般快速生成。SwiftPie引入了一种新颖的双分支身份注入机制,能够有效地将主题身份集成到一步扩散模型中。此外,我们结合了一种基于掩膜的重新缩放策略,以进一步增强单一步扩散步骤中的主题语境化。大量实验表明,SwiftPie不仅在图像个性化速度上表现出色,而且在身份保真度和提示对齐方面与多步骤方法具有可比性。这项工作为实时、高质量的个性化图像生成开辟了新的机遇,为交互式视觉合成铺平了道路。
cs.CV / 84 / 2605.01512
Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video
监控视频中稀有交通事件的双通道零样本时空定位
Abstract
Grounding traffic accidents in real CCTV footage is a rare-event problem where training on labeled accident video is often prohibited, yet accurate joint localization in time, space, and collision type is required. We present a no-fine-tuning pipeline that elicits this joint output from frozen vision-language models through two ideas. First, a coarse-to-fine two-pass decomposition: a full-video pass at 1 fps produces a coarse (t, x, y, c) tuple, then a second pass at 5 fps within a +/- 3 s window refines time and location, with two deterministic confidence gates that revert to the coarse estimate on boundary hedges or edge-clamped coordinates. Second, a specialist role assignment: Qwen3-VL-Plus handles grounding, Gemini 3.1 Flash-Lite handles typing on a centered video clip. On the ACCIDENT@CVPR 2026 benchmark (2,027 real CCTV videos) we reach ACC^S = 0.539 (95% CI [0.525, 0.553]): +0.127 over the benchmark paper's best-of-baselines oracle (0.412), +0.143 over the strongest single-VLM baseline (Molmo-7B, 0.396), and +0.250 over the naive baseline (0.289). The VLM path uses up to three API calls per video (17% fall back to physics on API failures); the full run costs ~$20.
Chinese Translation
在真实的监控录像中定位交通事故是一项稀有事件问题,通常不允许使用标记的事故视频进行训练,但仍需准确联合定位时间、空间和碰撞类型。我们提出了一种无需微调的流程,通过两个思路从冻结的视觉-语言模型中引出这一联合输出。首先,采用粗到细的双通道分解:在每秒1帧的全视频通道中产生一个粗略的 (t, x, y, c) 元组,然后在 +/- 3 秒的窗口内进行第二次以每秒5帧的通道,用于细化时间和位置,并使用两个确定性的置信门,在边界检测或边缘限制坐标时返回到粗略估计。其次,专业角色分配:Qwen3-VL-Plus负责归属,Gemini 3.1 Flash-Lite在集中视频剪辑上负责类型判断。在 ACCIDENT@CVPR 2026 基准测试(2027段真实监控视频)中,我们达到了 ACC^S = 0.539(95% CI [0.525, 0.553]):比基准论文的最佳基线预言(0.412)提高了 +0.127,比最强单一视觉语言模型基线(Molmo-7B,0.396)提高了 +0.143,比幼稚基线(0.289)提高了 +0.250。VLM路径对于每个视频使用最多三个API调用(17%在API失败时回退到物理模型);完整运行成本约为$20。
cs.CV / 85 / 2605.01517
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
VAnim:基于渲染意识的稀疏状态建模用于结构保留的矢量动画
Abstract
Scalable Vector Graphics (SVG) animation generation is pivotal for professional design due to their structural editability and resolution independence. However, this task remains challenging as it requires bridging discrete code representations with continuous visual dynamics. Existing optimization-based methods often destroy topological consistency, while general-purpose LLMs rely on rigid CSS/SMIL transformations, failing to model geometry-level non-rigid deformations. To address these limitations, we present VAnim, the first LLM-based framework for open-domain text-to-SVG animation. We reconceptualize animation not as sequence generation, but as Sparse State Updates (SSU) on a persistent SVG DOM tree. This paradigm compresses sequence length by over 9.8x while preserving the SVG DOM structure and non-participating elements by construction. To enable precise control, we propose an Identification-First Motion Planning mechanism that grounds textual instructions in explicit visual entities. Furthermore, to overcome the non-differentiable nature of SVG rendering, we employ Rendering-Aware Reinforcement Learning via Group Relative Policy Optimization (GRPO). By leveraging a hybrid reward from a state-of-the-art video perception encoder, we align discrete code updates with high-fidelity visual feedback. We also introduce SVGAnim-134k, the first benchmark for vector animation. Extensive experiments demonstrate that VAnim significantly outperforms state-of-the-art baselines in semantic alignment and structural validity, with additional appendix metrics further validating motion quality and identity preservation.
Chinese Translation
可缩放矢量图形(SVG)动画生成对于专业设计至关重要,因为它们具备结构可编辑性和分辨率独立性。然而,这一任务仍然具有挑战性,因为它需要将离散的代码表示与连续的视觉动态相结合。现有的基于优化的方法往往会破坏拓扑一致性,而通用的大型语言模型(LLMs)依赖于僵化的CSS/SMIL转换,未能有效建模几何层面的非刚性变形。为了解决这些局限性,我们提出了VAnim,这是第一个基于LLM的开域文本到SVG动画框架。我们将动画重新概念化为对持久SVG DOM树的稀疏状态更新(SSU),而非简单的序列生成。这一范式在保持SVG DOM结构和不参与元素的基础上,将序列长度压缩了超过9.8倍。为了实现精确控制,我们提出了一种优先识别运动规划机制,将文本指令与明确的视觉实体绑定。此外,为了克服SVG渲染的不可微分特性,我们采用了基于渲染意识的强化学习方法,通过群体相对策略优化(GRPO)实现。通过利用来自先进视频感知编码器的混合奖励,我们将离散代码更新与高保真的视觉反馈对齐。我们还介绍了SVGAnim-134k,这是第一个矢量动画基准。大量实验表明,VAnim在语义对齐和结构有效性方面显著优于最先进的基线,并且附录中的额外指标进一步验证了运动质量和身份保留。
cs.CV / 86 / 2605.01519
Certified vs. Empirical Adversarial Robust-ness via Hybrid Convolutions with Attention Stochasticity
通过具有注意力随机性的混合卷积实现认证与经验对抗鲁棒性
Abstract
We introduce Hybrid Convolutions with Attention Stochasticity (HyCAS), an adversarial defense that narrows the long-standing gap between provable robustness under L2 certificates and empirical robustness against strong L attacks, while preserving strong generalization across diverse imaging benchmarks. HyCAS unifies deterministic and randomized principles by coupling 1-Lipschitz, spectrally normalized convolutions with two stochastic components, spectral normalized random, projection filters and a randomized attention-noise mechanism, to realize a randomized defense. Injecting smoothing randomness inside the architecture yields an overall <= 2-Lipschitz network with formal certificates. Exten-sive experiments on diverse imaging benchmarks, including CIFAR-10/100, ImageNet-1k, NIH Chest X-ray, HAM10000, show that HyCAS surpasses prior leading certified and empirical defenses, boosting certified accuracy by up to 7.3% (on NIH Chest X-ray) and empirical robustness by up to 3.1% (on HAM10000), without sacrificing clean accuracy. These results show that a randomized Lipschitz constrained architecture can simultaneously improve both certified L2 and empirical L adversarial robustness, thereby supporting safer deployment of deep models in high-stakes applications. Code: https://github.com/misti1203/HyCAS
Chinese Translation
我们引入了具有注意力随机性的混合卷积(Hybrid Convolutions with Attention Stochasticity, HyCAS),这是一种对抗防御方法,旨在缩小长期存在的基于 L2 证书的可证明鲁棒性与针对强对抗攻击的经验鲁棒性之间的差距,同时在各种影像基准测试中保持强大的泛化能力。HyCAS 通过结合 1-Lipschitz、光谱归一化的卷积与两个随机成分(光谱归一化的随机投影滤波器和随机注意力噪声机制),统一了确定性和随机性原理,进而实现了一种随机化防御。在架构中注入平滑随机性最终产生了一个整体 <= 2-Lipschitz 的网络,并具备正式验证的证明。 在多种影像基准测试(包括 CIFAR-10/100、ImageNet-1k、NIH 胸部 X 光、HAM10000)上进行的广泛实验表明,HyCAS 超越了之前领先的认证和经验防御方法,在 NIH 胸部 X 光数据集上认证准确性提升高达 7.3%,在 HAM10000 数据集上经验鲁棒性提升高达 3.1%,且未牺牲干净准确性。这些结果表明,随机 Lipschitz 约束架构能够同时提升认证 L2 和经验 L 对抗鲁棒性,从而支持深度模型在高风险应用中的更安全部署。代码:https://github.com/misti1203/HyCAS
cs.CV / 87 / 2605.01520
MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models
MIRL:基于互信息指导的视觉-语言模型强化学习
Abstract
Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising solution by optimizing policies using answer correctness signals. Despite their effectiveness, prevailing RLVR methods face two critical limitations. First, much of the sampling budget is wasted on trajectories doomed to fail due to early visual description errors. Second, sparse rewards cannot distinguish whether failures stem from visual perception or reasoning stages. We introduce MIRL, a decoupled framework that addresses both limitations by leveraging mutual information (MI) between generated descriptions and visual inputs as a cheap pre-screening signal. This enables intelligent budget allocation toward high-potential trajectories via forking, while decoupled training provides independent MI-based rewards for visual perception optimization, resolving reward blindness. Experiments on six vision-language reasoning benchmarks demonstrate that MIRL achieves 70.22% average accuracy and successfully surpasses the performance of sampling 16 complete trajectories using only 10 pre-samples with top-6 selection (25% fewer complete trajectories). Our code is available at: https://anonymous.4open.science/r/mirl-main/.
Chinese Translation
视觉-语言模型(VLMs)在复杂推理任务中常常受到视觉感知错误和幻觉的影响,从而降低了答案的准确性。可验证奖励的强化学习(RLVR)通过利用答案正确性信号来优化策略,提供了一种有前景的解决方案。尽管这些方法有效,现有的RLVR方法面临两个主要限制。首先,较大一部分采样预算被浪费在那些由于早期视觉描述错误而注定会失败的轨迹上。其次,稀疏的奖励无法区分失败是源于视觉感知还是推理阶段。我们提出了MIRL,这一解耦框架通过利用生成描述与视觉输入之间的互信息(MI)作为廉价的预筛选信号,解决了这两个限制。这使得能够通过分叉智能分配预算到高潜力轨迹,同时解耦训练提供了基于MI的独立奖励,以优化视觉感知,从而解决了奖励盲区。对六个视觉-语言推理基准的实验表明,MIRL达到70.22%的平均准确率,并成功超越了仅用10个预样本进行前6选择(比完整轨迹减少25%)的16条完整轨迹的性能。我们的代码可在以下网址获取:https://anonymous.4open.science/r/mirl-main/
cs.CV / 88 / 2605.01552
Robust Fundamental Matrix Estimation from Single Image Motion Blur
从单幅图像运动模糊中稳健地估计基本矩阵
Abstract
In this paper, we introduce a challenging task: extracting a fundamental matrix from a single motion blurred image. For a camera moving in 3D during exposure, the smear paths in the blurry image contain cues and constraints on this motion. We demonstrate the feasibility of establishing correspondences between two time instances within the camera exposure window, and that these can be used to robustly infer a fundamental matrix, which summarizes the motion of the camera during the exposure time. The inferred fundamental matrix is unique up to a transpose, corresponding to an ambiguity of the direction of time. Due to this per-smear ambiguity, classic methods, such as the 8-point algorithm, are no longer usable. The proposed method modifies the estimation to work on time-direction ambiguous correspondences. To improve the robustness of the fundamental matrix estimation, we also propose to incorporate an uncertainty measurement in smear pattern prediction and use it in the sampling process of the estimator. Experiments on synthetic and real-world motion-blur datasets demonstrate that our approach is able to estimate the fundamental matrix encoding the 3D camera motion, from single frames. Practical applicability is demonstrated on the downstream task of motion segmentation.
Chinese Translation
在本文中,我们提出了一项具有挑战性的任务:从单幅运动模糊图像中提取基本矩阵。当相机在曝光期间在三维空间中移动时,模糊图像中的拖尾路径包含关于此运动的线索和约束。我们演示了在相机曝光窗口内建立两个时间实例之间的对应关系的可行性,并证明这些对应关系可以被用来稳健地推导出基本矩阵,该矩阵总结了相机在曝光时间内的运动。推导出的基本矩阵在转置下是唯一的,对应于时间方向的不确定性。由于这种逐像素的模糊不确定性,经典方法(例如8点算法)不再适用。所提出的方法修改了估计以适用于时间方向模糊的对应关系。为了提高基本矩阵估计的稳健性,我们还建议在拖尾模式预测中结合不确定性测量,并将其用于估计器的采样过程。在合成和真实世界运动模糊数据集上的实验表明,我们的方法能够从单帧图像中估计出编码三维相机运动的基本矩阵。本文还展示了该方法在运动分割下游任务中的实际应用。
cs.CV / 89 / 2605.01563
Multi-Dataset Cross-Domain Knowledge Distillation for Unified Medical Image Segmentation, Classification, and Detection
多数据集跨领域知识蒸馏用于统一的医学图像分割、分类和检测
Abstract
We propose a unified cross-domain transfer learning framework that leverages knowledge from multiple heterogeneous medical imaging datasets to improve performance across segmentation, classification, and object detection tasks. Our approach employs a teacher-student paradigm in which a joint teacher model aggregates domain-invariant representations learned from diverse source datasets, while a task-specific student model is trained via multi-level knowledge distillation. Originally developed for medical image segmentation, the framework is extended to support image-level classification and object-level detection, enabling a general multi-task formulation for medical image analysis. We evaluate our method on a broad suite of datasets, including six segmentation benchmarks, BrainMetShare, ISLES, BraTS (MRI) and Lung MSD, LiTS, KiTS (CT), as well as multiple classification datasets for pulmonary disease and dementia, and detection datasets with native bounding-box annotations. Across all tasks and modalities, the proposed approach yields consistent improvements over strong dataset-specific and multi-head baselines, demonstrating enhanced robustness to distributional shifts and superior generalization. These findings highlight the potential of multi-dataset knowledge distillation as a scalable and task-agnostic approach for enhancing segmentation, classification, and object detection performance across heterogeneous medical imaging domains.
Chinese Translation
我们提出了一种统一的跨领域迁移学习框架,通过利用多个异构医学成像数据集中的知识,提升分割、分类和目标检测任务的性能。我们的方法采用教师-学生范式,其中一个联合教师模型聚合来自不同源数据集的领域不变表示,同时通过多层次知识蒸馏训练特定任务的学生模型。该框架最初为医学图像分割设计,现扩展支持图像级分类和目标级检测,从而实现医学图像分析的通用多任务形式。我们在一系列广泛的数据集上评估了我们的方法,包括六个分割基准数据集:BrainMetShare、ISLES、BraTS(MRI)和Lung MSD、LiTS、KiTS(CT),以及多个针对肺病和痴呆症的分类数据集,以及具有原生边界框注释的检测数据集。在所有任务和模态中,所提出的方法在强大的数据集特定基线和多头基线之上均实现了一致的改进,显示出对分布变化的增强鲁棒性和优越的泛化能力。这些发现突显了多数据集知识蒸馏作为一种可扩展且与任务无关的方式,以提升跨异构医学成像领域的分割、分类和目标检测性能的潜力。
cs.CV / 90 / 2605.01568
Unifying Deep Stochastic Processes for Image Enhancement
统一深度随机过程用于图像增强
Abstract
Deep stochastic processes have recently become a central paradigm for image enhancement, with many methods explicitly conditioning the stochastic trajectory on the degraded input. However, the relationship between these conditional processes and standard diffusion models remains unclear. In this work, we introduce a unified perspective on stochastic image enhancement by classifying recent methods into three families of continuous-time processes: unconditional diffusion models, Ornstein-Uhlenbeck (OU) processes, and diffusion bridges. We show that all of these approaches arise from a common stochastic differential equation (SDE) formulation. This framework makes explicit that seemingly disparate methods differ primarily in their drift and diffusion terms, terminal distributions, and boundary conditions, while schedulers and samplers constitute orthogonal design choices. Leveraging this unification, we conduct a controlled empirical study across multiple image enhancement tasks using identical architectures and training protocols. Our results reveal no consistently dominant method; instead, we identify and disentangle the specific design choices that most strongly influence performance. Finally, we release ItoVision, a modular PyTorch library that implements the unified framework and enables rapid prototyping and fair comparison of stochastic image enhancement methods.
Chinese Translation
深度随机过程最近已成为图像增强的核心范式,许多方法明确地将随机轨迹条件化于退化输入。然而,这些条件过程与标准扩散模型之间的关系仍不明确。在本研究中,我们通过将近期方法归类为三类连续时间过程——无条件扩散模型、奥恩斯坦-乌伦贝克(Ornstein-Uhlenbeck,OU)过程和扩散桥,提出了一种统一的随机图像增强视角。我们展示了所有这些方法都源自于一个共同的随机微分方程(SDE)表述。该框架明确强调,看似不同的方法主要在其漂移和扩散项、终端分布和边界条件上存在差异,而调度器和采样器则构成了正交的设计选择。利用这种统一,我们在多个图像增强任务中进行了一项受控的实证研究,使用相同的架构和训练协议。我们的结果显示没有一种方法始终占据主导地位;相反,我们识别并解开了对性能影响最大的具体设计选择。最后,我们发布了ItoVision,一个模块化的PyTorch库,实现了统一框架,使得随机图像增强方法的快速原型设计和公平比较成为可能。
cs.CV / 91 / 2605.01638
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
Omni-Fake: 统一多模态社交媒体深伪检测基准
Abstract
Multimodal deepfakes are proliferating on social media and threaten authenticity, information integrity, and digital forensics. Existing benchmarks are constrained by their single-modality scope, simplified manipulations, or unrealistic distributions, which limit their ability to assess real-world robustness. To address these limitations, we present Omni-Fake, a unified omni-dataset for comprehensive multimodal deepfake detection in social-media settings. It comprises Omni-Fake-Set, a large-scale, high-quality dataset with 1M+ samples, and Omni-Fake-OOD, an out-of-distribution benchmark with 200k+ samples intentionally excluded from training to evaluate generalization. Omni-Fake spans four modalities (image, audio, video, and audio-video talking head) and supports a joint detection-localization-explanation protocol. On top of Omni-Fake, we further propose Omni-Fake-R1, a reinforcement-learning-driven multimodal detector that adaptively integrates visual and auditory cues and outputs structured decisions, localization, and natural-language explanations. Extensive experiments show significant gains in detection accuracy, cross-modal generalization, and explainability over state-of-the-art baselines. Project page: https://tianxiao1201.github.io/omni-fake-project-page/
Chinese Translation
多模态深伪在社交媒体上不断增加,威胁着真实性、信息完整性和数字取证。现有基准受限于其单一模态范围、简化的操控或不现实的分布,这限制了它们对真实世界鲁棒性的评估能力。为了解决这些局限性,我们提出了Omni-Fake,这是一个统一的全方位数据集,用于社交媒体环境中全面的多模态深伪检测。它包括Omni-Fake-Set,一个大规模、高质量的包含超过100万个样本的数据集,以及Omni-Fake-OOD,一个包含超过20万个故意排除在训练之外的样本的离散基准,用于评估泛化能力。Omni-Fake涵盖四种模态(图像、音频、视频和音频-视频对话头)并支持联合检测-定位-解释协议。在Omni-Fake的基础上,我们进一步提出了Omni-Fake-R1,这是一种基于强化学习的多模态检测器,能够自适应地整合视觉和听觉线索,并输出结构化决策、定位和自然语言解释。大量实验表明,与最先进的基准相比,检测准确率、跨模态泛化和可解释性显著提升。项目页面:https://tianxiao1201.github.io/omni-fake-project-page/
cs.CV / 92 / 2605.01653
SteeringDiffusion: A Bottlenecked Activation Control Interface for Diffusion Models
SteeringDiffusion:一种用于扩散模型的瓶颈激活控制接口
Abstract
We introduce SteeringDiffusion, a bottlenecked activation-level control interface for diffusion models that exposes a smooth, monotonic, and runtime-adjustable control surface over the content--style trade-off. Our method keeps the U-Net backbone frozen and learns a small, prompt-conditioned latent code projected to FiLM/AdaGN-style modulation parameters. A zero-initialized design guarantees exact equivalence to the base model at zero scale, while timestep-aware gating restricts modulation to later denoising stages. A single scalar at inference continuously traverses the control surface without retraining. Across experiments on Stable Diffusion~1.5 and SDXL covering multiple artistic styles, we show that SteeringDiffusion produces smooth and monotonic content--style trade-offs. Under matched parameter budgets, it outperforms LoRA in controllability and stability, while ControlNet and rank-1 adapters do not expose a comparable control surface. We further introduce an inversion-stability diagnostic based on DDIM inversion, used as a post-hoc trajectory probe, which reveals strong correlations with intervention magnitude. These results position \emph{Steering Bottlenecked Explicit Control (S-BEC)} as a practical, general-purpose control interface for frozen diffusion backbones.
Chinese Translation
我们提出了SteeringDiffusion,一种瓶颈激活水平控制接口,用于扩散模型,能够提供平滑、单调且可实时调整的内容与风格权衡控制面。我们的方法保持U-Net主干不变,并学习一个小型的、根据提示条件调整的潜在编码,该编码被投影到FiLM/AdaGN风格的调制参数上。零初始化设计在零缩放时保证了与基本模型的完全等价,而时间步感知的门控则限制了调制仅在后期去噪阶段进行。在推理过程中,单一标量可以不断跨越控制面而无需重新训练。在对Stable Diffusion~1.5和SDXL进行的多种艺术风格实验中,我们展示了SteeringDiffusion能够生成平滑且单调的内容与风格权衡。在相同参数预算下,其在可控性和稳定性方面优于LoRA,而ControlNet和rank-1适配器不具备可比的控制面。我们进一步提出了一种基于DDIM反演的反演稳定性诊断,作为后期轨迹探测工具,显示出与干预幅度之间的强相关性。这些结果将Steering Bottlenecked Explicit Control (S-BEC)定位为一种实用的通用控制接口,适用于冻结的扩散主干。
cs.CV / 93 / 2605.01657
Act2See: Emergent Active Visual Perception for Video Reasoning
Act2See:用于视频推理的突现主动视觉感知
Abstract
Vision-Language Models (VLMs) typically rely on static initial frames for video reasoning, restricting their ability to incorporate essential dynamic information as the reasoning process evolves. Existing methods that augment Chain-of-Thought (CoT) with additional frame information often exhibit suboptimal CoT quality and lack the crucial ability to synthesize visual information for hypothetical or counterfactual scenarios. We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. Act2See is developed via Supervised Fine-Tuning (SFT) on a high-quality dataset of reasoning traces generated by a frontier VLM. These traces integrate active calls to either retrieve existing frames or generate new ones, and are rigorously verified against human-annotated CoTs to ensure quality. This approach cultivates an emergent capability: at inference time, the model actively determines when to search for or synthesize the necessary visual evidence. Act2See establishes new state-of-the-art results on challenging benchmarks, including VideoEspresso and ViTIB, and outperforms comparable or larger models on Video-MME, EgoNormia, and VCR-Bench, demonstrating an advancement in enabling VLMs with active visual perception for video reasoning.
Chinese Translation
视觉-语言模型(VLMs)通常依赖静态的初始帧进行视频推理,这限制了它们在推理过程演变过程中融入重要动态信息的能力。现有将思维链(CoT)与额外帧信息相结合的方法往往展现出次优的CoT质量,并缺乏对假设或反事实场景合成视觉信息的关键能力。我们提出了 Act-to-See(Act2See),这是一种新颖框架,通过使VLMs能够主动交替在文本CoTs中插入视频帧,来实现主动视觉感知。Act2See 是通过对由前沿VLM生成的高质量推理轨迹的数据集进行监督微调(SFT)而开发的。这些轨迹整合了主动调用以检索现有帧或生成新帧的功能,并严格对照人工标注的CoT进行验证以确保质量。这种方法培养了一种突现能力:在推理时,模型能主动决定何时搜索或合成所需的视觉证据。Act2See 在包括 VideoEspresso 和 ViTIB 的具有挑战性的基准上建立了新的最先进的结果,并在 Video-MME、EgoNormia 和 VCR-Bench 上超越了可比或更大模型,展示了使VLMs具备用于视频推理的主动视觉感知能力的进步。
cs.CV / 94 / 2605.01659
TRIMMER: A New Paradigm for Video Summarization through Self-Supervised Reinforcement Learning
TRIMMER:通过自监督强化学习进行视频摘要的新范式
Abstract
The rapid growth of video content across domains such as surveillance, education, and social media has made efficient content understanding increasingly critical. Video summarization addresses this challenge by generating concise yet semantically meaningful representations, but existing approaches often rely on expensive manual annotations, struggle to generalize across domains, and incur significant computational costs due to complex architectures. Moreover, unsupervised and weakly supervised methods typically underperform compared to supervised counterparts in capturing long-range temporal dependencies and semantic structure. In this work, we propose TRIMMER (Temporal Relative Information Maximization for Multi-objective Efficient Reinforcement), a novel self-supervised reinforcement learning framework for video summarization. TRIMMER operates in two stages: it first learns robust representations via self-supervised learning and then performs spatio-temporal decision making through reinforcement learning guided by information-theoretic reward functions. Unlike prior approaches that rely on similarity-based objectives, our method introduces entropy-based metrics to capture higher-order temporal dynamics and semantic diversity, while computing rewards directly over selected frame indices to improve computational efficiency. Extensive experiments on standard benchmarks demonstrate that TRIMMER achieves state-of-the-art performance among unsupervised and self-supervised methods, while remaining competitive with leading supervised approaches, highlighting its effectiveness for scalable and generalizable video summarization.
Chinese Translation
视频内容在监控、教育和社交媒体等领域的快速增长使得高效内容理解变得越来越重要。视频摘要通过生成简洁而富有语义意义的表征来应对这一挑战,但现有的方法往往依赖于昂贵的手动标注,难以跨领域推广,并且由于复杂的架构而导致显著的计算成本。此外,与监督学习方法相比,无监督和弱监督方法在捕捉长期时间依赖和语义结构方面通常表现较差。在本研究中,我们提出了TRIMMER(多目标高效强化的时间相对信息最大化),一种新颖的自监督强化学习框架用于视频摘要。TRIMMER分两个阶段进行:首先通过自监督学习来学习稳健的表征,然后通过信息论奖励函数指导下的强化学习进行时空决策。与依赖于相似性目标的先前方法不同,我们的方法引入了基于熵的度量来捕捉高阶时间动态和语义多样性,同时直接基于选定帧索引计算奖励,以提高计算效率。在标准基准上的广泛实验表明,TRIMMER在无监督和自监督方法中实现了最先进的性能,同时在竞争激烈的监督方法中保持竞争力,突显了其在可扩展和可推广的视频摘要中的有效性。
cs.CV / 95 / 2605.01662
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models
视频主动感知:利用视觉-语言模型进行有效推理的长视频理解
Abstract
Large vision-language models (VLMs) have advanced multimodal tasks such as video question answering (QA). However, VLMs face the challenge of selecting frames effectively and efficiently, as standard uniform sampling is expensive and performance may plateau. Inspired by active perception theory, which posits that models gain information by acquiring data that differs from their expectations, we introduce Video Active Perception (VAP), a training-free method to enhance long-form video QA using VLMs. Our approach treats keyframe selection as data acquisition in active perception and leverages a lightweight text-conditioned video generation model to represent prior world knowledge. Empirically, VAP achieves state-of-the-art zero-shot results on long-form or reasoning video QA datasets such as EgoSchema, NExT-QA, ActivityNet-QA, IntentQA, and CLEVRER, achieving an increase of up to 5.6 x frame efficiency by frames per question over standard GPT-4o, Gemini 1.5 Pro, and LLaVA-OV. Moreover, VAP shows stronger reasoning abilities than previous methods and effectively selects keyframes relevant to questions. These findings highlight the potential of leveraging active perception to improve the frame effectiveness and efficiency of long-form video QA.
Chinese Translation
大型视觉-语言模型(VLMs)推动了多模态任务的发展,例如视频问答(QA)。然而,VLMs面临有效且高效选择帧的挑战,由于标准均匀采样成本高昂,且性能可能趋于平稳。受到主动感知理论的启发,该理论认为模型通过获取与其预期不同的数据来获得信息,我们介绍了视频主动感知(VAP),这是一种无训练的方法,旨在利用VLMs增强长视频问答。我们的方法将关键帧选择视为主动感知中的数据采集,并利用一种轻量级的文本条件视频生成模型来表示先前的世界知识。实证结果表明,VAP在EgoSchema、NExT-QA、ActivityNet-QA、IntentQA和CLEVRER等长视频或推理视频问答数据集上实现了目前最先进的零-shot 结果,相比于标准的GPT-4o、Gemini 1.5 Pro和LLaVA-OV,帧每个问题的效率提高了最多5.6倍。此外,VAP表现出比以往方法更强的推理能力,并有效选择与问题相关的关键帧。这些发现突显了利用主动感知提高长视频问答的帧有效性和效率的潜力。
cs.CV / 96 / 2605.01666
IMPACT-HOI: Supervisory Control for Onset-Anchored Partial HOI Event Construction
IMPACT-HOI:基于开始锚定的部分 HOI 事件构建的监督控制
Abstract
We present IMPACT-HOI, a mixed-initiative framework for annotating egocentric procedural video by constructing structured event graphs for Human-Object Interactions (HOI), motivated by the need for high-quality structured supervision for learning robot manipulation from human demonstration. IMPACT-HOI frames this task as the incremental resolution of a partially specified, onset-anchored event state. A trust-calibrated controller selects among direct queries, human-confirmed suggestions, and conservative completions based on empirical annotator behavior and evidence quality. A risk-bounded execution protocol, utilizing atomic rollback, ensures that human-confirmed decisions are preserved against conflicting automated updates. A user study with 9 participants shows a 13.5% reduction in manual annotation actions, a 46.67% event match rate, and zero confirmed-field violations under the studied protocol. The code will be made publicly available at https://github.com/541741106/IMPACT_HOI.
Chinese Translation
我们提出了 IMPACT-HOI,这是一种混合主动框架,用于通过为人-物交互(Human-Object Interactions, HOI)构建结构化事件图来对以自我为中心的程序视频进行标注,其动机在于对高质量结构化监督的需求,以便从人类示范中学习机器人操作。IMPACT-HOI 将这一任务框定为增量解析部分指定的、以开始锚定的事件状态。经过信任校准的控制器根据实证标注者的行为和证据质量,在直接查询、人类确认的建议和保守的完成之间进行选择。利用原子回滚的风险限制执行协议确保人类确认的决策在面对自动冲突更新时予以保留。针对9名参与者的用户研究显示,在研究的协议下,手动标注操作减少了13.5%,事件匹配率为46.67%,且确认字段违反率为零。代码将在 https://github.com/541741106/IMPACT_HOI 上公开发布。
cs.CV / 97 / 2605.01667
Deep neural networks with Fisher vector encoding for medical image classification
基于Fisher向量编码的深度神经网络在医学图像分类中的应用
Abstract
Orderless encoding methods have shown to improve Convolutional Neural Networks (CNNs) for image classification in the context of limited availability of data. Additionally, hybrid CNN + Vision Transformers (ViT) models have been recently proposed to address CNN locality bias issues. These models outperformed CNN-only approaches. Despite that, the integration of such hybrid models with more elaborated feature representation can be highly beneficial and remains large unexplored in the literature. In this context, we propose the introduction of an orderless encoding method, Fisher Vectors, to hybrid CNN + ViT architectures, aiming at achieving a model suitable for both small and large datasets. Such enconding method relies on estimating a Gaussian Mixture Model (GMM) on image features. In large datasets, computational costs of the GMM estimation is a limiting factor for the application of Fisher Vectors. Thus, we propose a method to limit the growth of GMM estimation costs as we increase the size of the dataset. We explore the feasibility of our method in the context of medical image classification by appling it to MedMNIST (v2), Clean-CC-CCII and ISIC2018. This collection of datasets contains a wide variety of data scales and modalities. We outperform benchmark results in all MedMNIST (v2) datasets and obtain literature-competitive results in Clean-CC-CCII and ISIC2018.
Chinese Translation
无序编码方法已被证实能够改善卷积神经网络(CNN)在数据有限的情况下的图像分类效果。此外,最近也提出了混合CNN +视觉变换器(ViT)模型,以解决CNN的局部性偏差问题。这些模型的表现超越了仅使用CNN的方法。尽管如此,将这种混合模型与更复杂的特征表示相结合可能会带来极大的好处,但在文献中仍未得到充分探索。在此背景下,我们提出将无序编码方法Fisher向量引入混合CNN + ViT架构,以期实现适用于小型和大型数据集的模型。这种编码方法依赖于对图像特征估计高斯混合模型(GMM)。在大型数据集中,GMM估计的计算成本限制了Fisher向量的应用。因此,我们提出一种方法,以限制GMM估计成本随着数据集大小的增加而增长。我们通过将这一方法应用于MedMNIST (v2)、Clean-CC-CCII和ISIC2018,探讨其在医学图像分类中的可行性。该数据集集合包含多种数据规模和模态。在所有MedMNIST (v2)数据集中,我们的表现超越了基准结果,并在Clean-CC-CCII和ISIC2018中获得了与文献竞争的结果。
cs.CV / 98 / 2605.01668
IMPACT-Scribe: Interactive Temporal Action Segmentation with Boundary Scribbles and Query Planning
IMPACT-Scribe:基于边界涂鸦和查询规划的互动式时间动作分割
Abstract
Dense temporal annotation of procedural activity videos is vital for action understanding and embodied intelligence but remains labor-intensive due to reactive tools. Each correction is treated as an isolated edit, limiting reuse of information on annotator uncertainty and model reliability. We introduce IMPACT-Scribe, a correction-driven framework for dense labeling that uses each correction to improve future human-machine collaboration. IMPACT-Scribe combines uncertainty-aware boundary scribble supervision, local proposal modeling, cost-aware query planning, structured propagation, and correction-driven adaptation. Experiments and a human study show that this closed-loop design improves labeling quality per effort, enhances boundary accuracy, and fosters better human-machine interaction over time. The code will be made publicly available at https://github.com/BanzQians/IMPACT_AS.
Chinese Translation
对程序活动视频的密集时间标注对于动作理解和具身智能至关重要,但由于反应工具的限制,仍然劳动密集。每次修正被视为孤立的编辑,限制了关于注释者不确定性和模型可靠性的信息重用。我们提出了IMPACT-Scribe,一个以修正驱动的密集标注框架,利用每次修正来改善未来的人机协作。IMPACT-Scribe结合了以不确定性为导向的边界涂鸦监督、本地提议建模、成本感知的查询规划、结构化传播和修正驱动的适应性。实验和人类研究表明,这种闭环设计在每单位努力中提高了标注质量,增强了边界准确性,并随着时间的推移促进了更好的人与机器的互动。代码将公开可用,网址为https://github.com/BanzQians/IMPACT_AS。
cs.CV / 99 / 2605.01700
TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation
TrajRAG:为零-shot目标导航检索几何-语义经验
Abstract
Existing zero-shot Object Goal Navigation (ObjectNav) methods often exploit commonsense knowledge from large language or vision-language models to guide navigation. However, such knowledge arises from internet-scale text rather than embodied 3D experience, and episodic observations collected during navigation are typically discarded, preventing the accumulation of lifelong experience. To this end, we propose Trajectory RAG (TrajRAG), a retrieval-augmented generation framework that enhances large-model reasoning by retrieving geometric-semantic experiences. TrajRAG incrementally accumulates episodic observations from past navigation episodes. To structure these observations, we propose a topological-polar (topo-polar) trajectory representation that compactly encodes spatial layouts and semantic contexts, effectively removing redundancies in raw episodic observations. A hierarchical chunking structure further organizes similar topo-polar trajectories into unified summaries, enabling coarse-to-fine retrieval. During navigation, candidate frontiers generate multiple trajectory hypotheses that query TrajRAG for similar past trajectories, guiding large-model reasoning for waypoint selection. New experiences are continually consolidated into TrajRAG, enabling the accumulation of lifelong navigation experience. Experiments on MP3D, HM3D-v1, and HM3D-v2 show that TrajRAG effectively retrieves relevant geometric-semantic experiences and improves zero-shot ObjectNav performance.
Chinese Translation
现有的零-shot物体目标导航(Object Goal Navigation, ObjectNav)方法通常利用来自大型语言或视觉-语言模型的常识知识来指导导航。然而,这种知识源于网络规模的文本而非具体的三维经验,且在导航过程中收集的情景观察通常被丢弃,这阻碍了终身经验的积累。为此,我们提出了轨迹RAG(Trajectory RAG, TrajRAG),一种通过检索几何-语义经验来增强大型模型推理的检索增强生成框架。TrajRAG 从以往的导航回合中逐步积累情景观察。为了构建这些观察,我们提出了一种拓扑-极性(topological-polar, topo-polar)轨迹表示法,该方法紧凑地编码空间布局和语义上下文,有效地消除了原始情景观察中的冗余。分层块结构进一步将相似的拓扑-极性轨迹组织成统一的摘要,从而实现粗到细的检索。在导航过程中,候选边界生成多个轨迹假设,这些假设查询TrajRAG以获取相似的过去轨迹,从而指导大型模型在选取途径点时的推理。新的经验不断被整合进TrajRAG,促进了终身导航经验的积累。在MP3D、HM3D-v1和HM3D-v2上的实验表明,TrajRAG有效地检索相关的几何-语义经验,并提高了零-shot ObjectNav的性能。
cs.CV / 100 / 2605.01706
Exploring Entropy-based Active Learning for Fair Brain Segmentation
基于熵的公平脑区分割主动学习探索
Abstract
Active learning (AL) has emerged as a crucial strategy for reducing the prohibitive costs associated with medical image segmentation. However, standard uncertainty-based AL methods typically focus on maximizing performance metrics, ignoring performance disparities or fairness across groups with sensitive attributes. While fair active learning has been explored in classification tasks, its intersection with medical image segmentation remains unaddressed. In this work, we introduced a fairness-aware active learning framework with a Weighted Entropy selection strategy that modulates uncertainty based on current group-specific performance estimates on the labeled set. To decouple true epistemic uncertainty from anatomical volume variances, we further utilized a masked, scaled entropy restricted to the region of interest. The framework was evaluated on synthetic T1-weighted brain MRIs with controlled left caudate bias in both strong and weak bias settings. A 3D U-Net was trained to segment the left caudate under several AL strategies, starting from both demographically balanced and strongly imbalanced initial labeled sets. Experiments demonstrated that our method markedly reduces performance disparities between groups compared to random sampling and standard uncertainty sampling. By prioritizing poorly segmented subgroups during the AL cycles, our method consistently achieved the highest equity-scaled performance and reduced the disparity metric by 75% (strong bias) and 86% (weak bias) relative to standard entropy at the final budget. Overall, this work is among the first studies on fair AL for medical image segmentation, offering an efficient strategy to train more equitable models in resource-constrained environments.
Chinese Translation
主动学习(AL)已成为降低医学图像分割相关费用的重要策略。然而,标准的不确定性基础主动学习方法通常专注于最大化性能指标,忽视了具有敏感属性的群体间的性能差异或公平性。尽管在分类任务中已探讨了公平主动学习,但其与医学图像分割的交集尚未得到解决。在本研究中,我们提出了一种公平感知的主动学习框架,采用加权熵选择策略,根据标记集上当前群体特定的性能估计来调节不确定性。为了将真实的认识不确定性与解剖体积差异解耦,我们进一步利用了针对感兴趣区域的掩膜缩放熵。该框架在具有控制左侧尾状核偏差的合成T1加权脑MRI上进行了评估,涵盖了强偏差和弱偏差两种设置。我们训练了一个3D U-Net以在多种主动学习策略下分割左侧尾状核,起始于人口统计学上平衡和强烈不平衡的初始标记集。实验表明,与随机抽样和标准不确定性抽样相比,我们的方法显著减少了群体间的性能差异。通过在主动学习周期中优先考虑分割效果较差的子群体,我们的方法在最终预算下始终实现了最高的公平性标度性能,并相对于标准熵减少了75%(强偏差)和86%(弱偏差)的差异指标。总体而言,这项工作是医学图像分割领域中关于公平主动学习的首次研究之一,为在资源受限环境中训练更公平的模型提供了有效的策略。
cs.CV / 101 / 2605.01711
Linear-Time Global Visual Modeling without Explicit Attention
无显式注意力的线性时间全局视觉建模
Abstract
Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process that inherently incurs quadratic computational complexity. In this work, we offer a novel perspective: we demonstrate that attention can be mathematically reframed as a Multi-Layer Perceptron (MLP) equipped with dynamically predicted parameters. Through this lens, we explain attention's global modeling power not as explicit token-wise aggregation, but as an implicit process where dynamically generated parameters act as a compressed representation of the global context. Inspired by this insight, we investigate a fundamental question: can we achieve Transformer-level sequence global modeling entirely through dynamic parameterization while maintaining linear complexity, effectively replacing explicit attention? To explore this, we design various dynamic parameter prediction strategies and integrate them into standard network layers. Extensive empirical studies on vision models demonstrate that dynamic parameterization can indeed serve as a highly effective, linear-complexity alternative to explicit attention, opening new pathways for efficient sequence modeling. Code is available at https://github.com/LeapLabTHU/WeightFormer.
Chinese Translation
现有研究通常将Transformers的全局序列建模能力归因于注意力权重的显式计算,而这一过程本质上会导致二次计算复杂度。在本研究中,我们提供了一种新颖的视角:我们证明注意力可以在数学上重新表述为一个配备动态预测参数的多层感知机(Multi-Layer Perceptron, MLP)。通过这种视角,我们解释了注意力的全局建模能力并非显式的逐token聚合,而是一个隐式过程,其中动态生成的参数作为全局上下文的压缩表示。受到这一洞察的启发,我们探讨了一个根本性问题:我们能否完全通过动态参数化来实现Transformer级别的序列全局建模,同时保持线性复杂度,从而有效替代显式注意力?为此,我们设计了多种动态参数预测策略,并将其整合到标准网络层中。对视觉模型进行的 extensive empirical studies 证明了动态参数化确实可以作为显式注意力的高效、线性复杂度替代方案,为高效的序列建模开辟了新途径。代码可在 https://github.com/LeapLabTHU/WeightFormer 获取。
cs.CV / 102 / 2605.01718
Dual-branch Robust Unlearnable Examples
双分支鲁棒性不可学习示例
Abstract
Unlearnable examples (UEs) aim to compromise model training by injecting imperceptible perturbations to clean samples. However, existing UE schemes exhibit limited robustness against advanced defenses due to their heuristic design or narrowly scoped domain perturbations. To address this, we propose \texttt{DUNE}, a \underline{\textbf{D}}ual-branch \underline{\textbf{UN}}learnable \underline{\textbf{E}}nsemble perturbation optimization approach. Specifically, \texttt{DUNE} separately optimizes perturbations in the spatial and color domains to establish the mapping between perturbations and shift-induced labels. This design extends the perturbation domain to increase noise intensity for improving robustness and drives the models to learn perturbation-oriented features with degraded generalization, thereby achieving unlearnability. To strengthen \texttt{DUNE}'s performance, we further propose an unlearnability-enhancing ensemble strategy that aggregates diverse pre-trained models during the dual-branch optimization. Extensive experiments on benchmark datasets CIFAR-10 and ImageNet verify that \texttt{DUNE}'s robustness outperforms 12 SOTA UE schemes under 7 mainstream defenses, yielding a lower average test accuracy of 14.95\% to 50.82\%.
Chinese Translation
不可学习示例(UEs)旨在通过对干净样本施加不可察觉的扰动来妨碍模型训练。然而,现有的不可学习示例方案由于其启发式设计或狭窄的领域扰动,显示出对高级防御的有限鲁棒性。为此,我们提出了 exttt{DUNE},一种 extbf{D} 双分支 extbf{UN}不可学习 extbf{E}集合扰动优化方法。具体而言, exttt{DUNE}分别在空间和颜色域优化扰动,以建立扰动与因位移引起的标签之间的映射。该设计扩展了扰动域,以增加噪声强度,从而提高鲁棒性,并使模型学习以扰动为导向的特征,导致泛化能力下降,从而实现不可学习性。为了增强 exttt{DUNE} 的性能,我们进一步提出了一种增强不可学习性的集成策略,在双分支优化过程中聚合多样化的预训练模型。在基准数据集CIFAR-10和ImageNet上进行的大量实验验证了 exttt{DUNE} 的鲁棒性在7个主流防御下超越了12个最先进的不可学习示例方案,测试平均准确率降低至14.95 ext{%}至50.82 ext{%}。
cs.CV / 103 / 2605.01720
SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 25+ Sign Languages
SignVerse-2M:一个包含250多种手语的两百万剪辑姿态原生宇宙
Abstract
Existing large-scale sign language resources typically provide supervision only at the level of raw video-text alignment and are often produced in laboratory settings. While such resources are important for semantic understanding, they do not directly provide a unified interface for open-world recognition and translation, or for modern pose-driven sign language video generation frameworks: 1. RGB-based pretrained recognition models depend heavily on fixed backgrounds or clothing conditions during recording, and are less robust in open-world settings than style-agnostic pose-processing models. 2. Recent pose-guided image/video generation models mostly use a unified keypoint representation such as DWPose as their control interface. At present, the sign language field still lacks a data resource that can directly interface with this modern pose-native paradigm while also targeting real-world open scenarios. We present SignVerse-2M, a large-scale multilingual pose-native dataset for sign language pose modeling and evaluation. Built from publicly available multilingual sign language video resources, it applies DWPose in a unified preprocessing pipeline to convert raw videos into 2D pose sequences that can be used directly for modeling, resulting in a consolidated corpus of about two million clips covering more than 25 sign languages. Unlike many laboratory datasets, this resource preserves the recording conditions and speaker diversity of real-world videos while reducing appearance variation through a unified pose representation. Toward this goal, we further provide the data construction pipeline, task definitions, and a simple SignDW Transformer baseline, demonstrating the feasibility of this resource for multilingual pose-space modeling and its compatibility with modern pose-driven pipelines, while discussing the evaluation claims it can support as well as its current limitations.
Chinese Translation
现有的大规模手语资源通常仅在原始视频-文本对齐的层面提供监督,并且往往是在实验室环境中制作的。尽管这些资源对语义理解很重要,但它们并未直接提供用于开放世界识别和翻译的统一接口,也未能支撑现代姿态驱动的手语视频生成框架:1. 基于RGB的预训练识别模型在录制期间高度依赖于固定背景或服装条件,在开放世界环境中的鲁棒性不如风格无关的姿态处理模型。2. 最近的姿态引导图像/视频生成模型大多使用统一的关键点表示(如DWPose)作为其控制接口。目前,手语领域仍然缺乏能够直接与这一现代姿态原生范式接口连接的数据资源,同时也面向现实世界的开放场景。我们提出了SignVerse-2M,一个用于手语姿态建模和评估的大规模多语言姿态原生数据集。该数据集构建于公开可用的多语言手语视频资源之上,通过统一的预处理管道应用DWPose,将原始视频转换为可直接用于建模的二维姿态序列,从而形成一个包含约两百万剪辑的合成语料库,覆盖超过25种手语。与许多实验室数据集不同,该资源在减少外观变异的同时,保留了现实世界视频的录制条件和说话者多样性。为此,我们进一步提供了数据构建管道、任务定义以及一个简单的SignDW Transformer基线,展示了该资源在多语言姿态空间建模中的可行性及其与现代姿态驱动管道的兼容性,同时讨论了它所能支持的评价主张及其当前局限性。
cs.CV / 104 / 2605.01725
Motion-Aware Caching for Efficient Autoregressive Video Generation
运动感知缓存用于高效自回归视频生成
Abstract
Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of $\textbf{6.28}\times$ and $\textbf{1.64}\times$ respectively, while effectively preserving generation quality (VBench: $1\%\downarrow$ and $0.01\%\downarrow$ respectively). The code is available at https://github.com/ywlq/MotionCache.
Chinese Translation
自回归视频生成范式在长视频合成方面具有理论潜力,但其实际应用受限于顺序迭代去噪的计算负担。尽管缓存重用策略可以通过跳过冗余的去噪步骤来加速生成,但现有方法依赖于粗粒度的块级跳过,这无法捕捉到细粒度的像素动态。这一忽视是至关重要的:运动剧烈的像素需要更多的去噪步骤以防止误差累积,而静态像素则可以承受积极的跳过。我们通过将缓存错误与残差不稳定性联系起来,从理论上正式化了这一见解,并提出了MotionCache,一个运动感知缓存框架,利用帧间差异作为像素级运动特征的轻量级代理。MotionCache采用了粗到细的策略:初始的预热阶段建立语义一致性,随后则是基于运动加权的缓存重用,动态调整每个标记的更新频率。针对最先进模型如SkyReels-V2和MAGI-1的广泛实验表明,MotionCache分别实现了$ extbf{6.28} imes$和$ extbf{1.64} imes$的显著加速,同时有效保持生成质量(VBench: $1\% ext{下降}$和$0.01\% ext{下降}$)。代码可在 https://github.com/ywlq/MotionCache 获得。
cs.CV / 105 / 2605.01733
GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models
GEASS:无训练的标题引导以缓解视觉-语言模型中的幻觉问题
Abstract
Vision-Language Models (VLMs) excel at grounded reasoning but remain prone to object hallucination. Recent work treats self-generated captions as a uniformly positive resource, yet we find that naively embedding one can degrade rather than help--dropping Qwen2.5-VL-3B accuracy on HallusionBench by nearly 10 points. Two structural properties explain this. First, captions anchor not only the model's final answer but also its reasoning trajectory and lexical choices. Second, caption errors are asymmetric: omissions vastly outnumber fabrications, yet each fabrication carries a much larger per-instance impact. A caption's usefulness is therefore a per-query property, not a per-corpus one. We propose GEASS (Gated Evidence-Aware Selective Steering), a training-free module that decides on each query how much of the caption the model consumes: it gates the caption by the clean path's confidence, weights it by the entropy reduction it produces, and raises the evidence bar when the two pathways disagree. Experiments on POPE and HallusionBench across four VLMs show that GEASS consistently improves over vanilla inference and contrastive decoding, with only two extra forward passes per query.
Chinese Translation
视觉-语言模型(VLMs)在基础推理方面表现出色,但依然容易出现物体幻觉。最近的研究将自生成的标题视为一种均匀正向的资源,但我们发现,简单地嵌入一个标题反而可能降低效果——在 HallusionBench 上,Qwen2.5-VL-3B 的准确率几乎下降了 10 个百分点。两个结构性质可以解释这一现象。首先,标题不仅固定了模型的最终答案,还决定了其推理轨迹和词汇选择。其次,标题错误是不对称的:遗漏的数量远大于虚构,但每一个虚构错误的单实例影响却更大。因此,标题的有效性是每个查询的特性,而非每个语料库的特性。我们提出了 GEASS(Gated Evidence-Aware Selective Steering),这是一个无训练的模块,它决定在每个查询中模型消耗多少标题:它按照干净路径的置信度对标题进行限流,根据标题产生的熵减少进行加权,并在两条路径不一致时提高证据门槛。在四个 VLMs 上的 POPE 和 HallusionBench 实验表明,GEASS 在普通推理和对比解码上都具有一致的改进,仅需每个查询多进行两个前向传播。
cs.CV / 106 / 2605.01736
Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning
用于零-shot 具身导航和推理的多尺度高斯语言地图
Abstract
Understanding the geometric and semantic structure of environments is essential for embodied navigation and reasoning. Existing semantic mapping methods trade off between explicit geometry and multi-scale semantics, and lack a native interface for large models, thus requiring additional training of feature projection for semantic alignment. To this end, we propose the multi-scale Gaussian-Language Map (GLMap), which introduces three key designs: (1) explicit geometry, (2) multi-scale semantics covering both instance and region concepts, and (3) a dual-modality interface where each semantic unit jointly stores a natural language description and a 3D Gaussian representation. The 3D Gaussians enable compact storage and fast rendering of task-relevant images via Gaussian splatting. To enable efficient incremental construction, we further propose a Gaussian Estimator that analytically derives Gaussian parameters from dense point clouds without gradient-based optimization. Experiments on ObjectNav, InstNav, and SQA tasks show that GLMap effectively enhances target navigation and contextual reasoning, while remaining compatible with large-model-based methods in a zero-shot manner. The code is available at https://github.com/sx-zhang/GLMap.
Chinese Translation
理解环境的几何与语义结构对于具身导航和推理至关重要。现有的语义地图方法在明确的几何结构和多尺度语义之间存在权衡,并且缺乏对大模型的原生接口,因此需要额外对语义对齐进行特征投影的训练。因此,我们提出了多尺度高斯语言地图(GLMap),其引入了三项关键设计:(1)明确的几何结构;(2)覆盖实例和区域概念的多尺度语义;(3)一个双模态接口,其中每个语义单元共同存储自然语言描述和3D高斯表示。3D高斯能够通过高斯溅射(Gaussian splatting)实现任务相关图像的紧凑存储和快速渲染。为了实现高效的增量构建,我们进一步提出了一种高斯估计器,可以从密集点云中分析性地推导高斯参数,而无需基于梯度的优化。对ObjectNav、InstNav和SQA任务的实验表明,GLMap有效增强了目标导航和上下文推理,同时以零-shot的方式与基于大模型的方法保持兼容。代码可在https://github.com/sx-zhang/GLMap获取。
cs.CV / 107 / 2605.01741
Adaptive Texture-aware Masking for Self-Supervised Learning in 3D Dental CBCT Analysis
自适应纹理感知掩蔽用于自监督学习的三维牙科CBCT分析
Abstract
Cone Beam Computed Tomography (CBCT) is pivotal for 3D diagnostic imaging in dentistry. However, the development of robust AI models for volumetric analysis is often constrained by the scarcity of large, annotated datasets. Self-supervised learning (SSL), particularly Masked Image Modeling (MIM), offers a promising pathway to leverage unlabeled data. A limitation of standard MIM is its reliance on random masking, which fails to prioritize diagnostically critical regions in dental CBCT volumes, such as subtle pathological changes and intricate anatomical boundaries. To address this, we propose ATMask, a novel adaptive masking strategy. Instead of applying random masks or employing computationally intensive attention modules, ATMask computes an inter-slice texture variation map to identify regions with high structural or textural complexity. These high-variation areas are then selectively masked during pre-training, compelling the model to learn richer contextual representations essential for inferring complex 3D morphological transitions. Furthermore, we contribute the first large-scale CBCT dataset, curated from both public and private sources, comprising 6,314 scans, for the dental AI model pretraining. Extensive experiments on three downstream dental CBCT tasks demonstrate that our ATMask enables more data-efficient and powerful representation learning than standard random masking and other advanced SSL baselines. The dataset and code will be released.
Chinese Translation
锥形束计算机断层扫描(CBCT)在牙科三维诊断成像中至关重要。然而,开发用于体积分析的强大人工智能模型常常受到大型标注数据集稀缺的限制。自监督学习(SSL),尤其是掩蔽图像建模(MIM),提供了一条利用未标记数据的有希望的途径。标准的MIM的局限在于依赖随机掩蔽,未能优先关注牙科CBCT体积中诊断关键区域,例如微妙的病理变化和复杂的解剖边界。为了解决这一问题,我们提出了ATMask,一种新颖的自适应掩蔽策略。ATMask不是应用随机掩蔽或使用计算密集型的注意力模块,而是计算切片间的纹理变化图,以识别结构或纹理复杂性高的区域。这些高变化区域在预训练过程中被选择性掩蔽,促使模型学习更丰富的上下文表征,这对于推断复杂的三维形态过渡至关重要。此外,我们还贡献了第一个大型CBCT数据集,来自公共和私人来源,共包括6,314个扫描,用于牙科人工智能模型的预训练。对三个下游牙科CBCT任务的广泛实验表明,我们的ATMask在数据效率和强大的表征学习方面优于标准随机掩蔽和其他先进的SSL基线。数据集和代码将被发布。
cs.CV / 108 / 2605.01742
Joint Architecture-Token-Bitwidth Multi-Axis Optimization of Vision Transformers for Semiconductor IC Packaging
用于半导体集成电路封装的视觉变换器的联合架构-标记-比特宽度多轴优化
Abstract
Vision Transformers (ViTs) have achieved strong performance in visual recognition, yet their deployment in resource-constrained industrial environments remains limited. Some main challenges are their high computational cost, memory requirement, and energy consumption. While individual efficiency techniques such as neural architecture search (NAS), token compression, and low-precision inference have been extensively studied, most prior work targets only a single optimization axis, limiting overall deployment gains while preserving accuracy. In this paper, we present one of the first holistic frameworks that jointly optimizes three complementary axes: architecture, token, and bit-width. Specifically, the framework identifies compact backbones via Neural Architecture Search (AutoFormer), reduces information processing via token merging (ToMe), and accelerates per-operation execution via fp16 mixed-precision inference. Starting from a DeiT-B/16 baseline, we first analyze accuracy-efficiency trade-offs on ImageNet-1K under aggressive compression. Then, we apply the selected configurations to a real-world in-house 3D X-ray semiconductor defect classification dataset for IC chip packaging inspection. Results show that the proposed multi-axis framework achieves more than 10 times improvement in throughput along with over 10 times reductions in parameter count, FLOPs, and energy consumption, while maintaining the required accuracy on the downstream industrial task. To the best of our knowledge, this is among the earliest works to jointly optimize architecture, token, and bit-width dimensions in ViTs and the first such resource-efficient, deployment-focused study tailored to semiconductor manufacturing.
Chinese Translation
视觉变换器(Vision Transformers, ViTs)在视觉识别方面取得了显著的性能,但在资源受限的工业环境中的应用仍然有限。一些主要挑战包括高计算成本、内存需求和能耗。尽管单独的效率技术,如神经架构搜索(Neural Architecture Search, NAS)、标记压缩和低精度推理得到了广泛研究,但大多数先前的工作仅针对单一优化轴,限制了整体部署收益,同时保持准确性。本文提出了一种最早的整体框架之一,联合优化三个互补的轴:架构、标记和比特宽度。具体而言,该框架通过神经架构搜索(AutoFormer)识别紧凑的主干,通过标记合并(ToMe)减少信息处理,并通过fp16混合精度推理加速每个操作的执行。从DeiT-B/16基线开始,我们首先分析了在激进压缩下的ImageNet-1K上的准确性-效率权衡。然后,我们将选定的配置应用于一个真实的内部3D X射线半导体缺陷分类数据集,以进行集成电路芯片封装检查。结果表明,所提出的多轴框架在吞吐量上提升超过10倍,同时在参数数量、浮点运算(FLOPs)和能量消耗上减少超过10倍,同时保持下游工业任务所需的准确性。据我们所知,这是在ViTs中联合优化架构、标记和比特宽度维度的最早工作之一,也是第一个专门针对半导体制造的高效资源、以部署为中心的研究。
cs.CV / 109 / 2605.01743
MOC-3D: Manifold-Order Consistency for Text-to-3D Generation
MOC-3D:用于文本到3D生成的多样体顺序一致性
Abstract
With the burgeoning development of fields such as the Metaverse, Virtual Reality (VR), and Digital Twins, text-to-3D generation has emerged as a research hotspot in both academia and industry. Currently, optimization methods based on Score Distillation Sampling (SDS) utilizing 2D diffusion priors have become the mainstream technological paradigm in this field. However, due to the view bias of 2D priors and the mode-seeking ambiguity combined with gradient noise induced by high Classifier-Free Guidance (CFG), these methods still suffer from macro-topological inconsistency (e.g., the Janus problem) and micro-geometric discontinuity. To address these challenges, we propose MOC-3D, a text-to-3D generation method based on geometric manifold and semantic view-order consistency. Built upon the ScaleDreamer framework, our method incorporates a Semantic View-Order Constraint Module and a Manifold-based Feature Continuity Module. The former aims to rectify macro-topological inconsistency, while the latter focuses on eliminating micro-geometric discontinuity. Specifically, the Semantic View-Order Constraint Module leverages the prior knowledge of CLIP to impose a Monotonicity Rank Constraint on semantic score representations across different views, thereby providing effective guidance for the global topological structure of 3D objects. Meanwhile, the Manifold-based Feature Continuity Module employs the Riemannian Metric on the Symmetric Positive Definite (SPD) manifold. By measuring the distance of feature statistical distributions in the Riemannian space, it promotes the smooth evolution and continuity of micro-textures across multi-views in a statistical sense. Under the macro-micro synergistic optimization of these two modules, our model can simultaneously improve macro-structural consistency and micro-detail continuity.
Chinese Translation
随着元宇宙、虚拟现实(VR)和数字双胞胎等领域的蓬勃发展,文本到3D生成已成为学术界和工业界的研究热点。目前,基于得分蒸馏采样(Score Distillation Sampling, SDS)的方法,利用二维扩散先验已成为该领域的主流技术模式。然而,由于二维先验的视角偏差以及高分类器无关引导(Classifier-Free Guidance, CFG)所导致的模式寻求模糊性和梯度噪声,这些方法仍然面临宏观拓扑不一致(如严面问题)和微观几何不连续的挑战。为了解决这些问题,我们提出了MOC-3D,一种基于几何多样体和语义视图顺序一致性的文本到3D生成方法。我们的方法建立在ScaleDreamer框架之上,结合了语义视图顺序约束模块和基于多样体的特征连续性模块。前者旨在纠正宏观拓扑不一致,而后者则专注于消除微观几何不连续性。具体来说,语义视图顺序约束模块利用CLIP的先验知识,对不同视角的语义得分表示施加单调性等级约束,从而为3D对象的全局拓扑结构提供有效指导。同时,基于多样体的特征连续性模块使用在对称正定(Symmetric Positive Definite, SPD)多样体上的黎曼度量。通过在黎曼空间中测量特征统计分布的距离,它在统计意义上促进了多视角间微观纹理的平滑演化与连续性。在这两个模块的宏观-微观协同优化下,我们的模型可以同时提高宏观结构一致性和微观细节连续性。
cs.CV / 110 / 2605.01746
Profile-Specific 3DMM Regression from a Single Lateral Face Image
基于单张侧面人脸图像的特征特定三维变形模型回归
Abstract
Single-image 3D face reconstruction is a core problem in computer vision, with important clinical applications such as cephalometric landmark analysis in orthodontics. Traditionally, this analysis relies on lateral X-ray imaging; however, frequent X-ray exposure is impractical due to radiation concerns. While recent research has explored detecting landmarks from lateral RGB images as an alternative, existing methods typically rely on 2D features such as the eyes, mouth, ears, and boundary silhouettes, failing to fully exploit the underlying 3D facial geometry spanning the facial profile and jawline, which is essential for accurate diagnosis. Meanwhile, although 3D face reconstruction from frontal views has seen significant progress, most learning-based 3D morphable model (3DMM) regressors are developed and benchmarked on near-frontal images, where appearance cues are abundant. In extreme profile views (yaw $\approx 90^\circ$), much of the face is occluded, and the available signal is dominated by boundary cues, making accurate 3D reconstruction challenging. In this paper, we bridge this gap with geometry-conditioned synthetic data and a simple profile-specific FLAME regression baseline for single lateral images. We introduce ProfileSynth, a dataset created by sampling FLAME shape and pose parameters in extreme yaw ranges and generating photorealistic profile images using a diffusion model conditioned on depth and normal maps. We further study a profile-specific baseline with visibility-aware jawline regularization. Our framework provides a practical baseline for "profile $\times$ 3DMM" reconstruction and a promising foundation for more accurate, non-invasive cephalometric analysis from lateral RGB images.
Chinese Translation
单张图像的三维人脸重建是计算机视觉中的一个核心问题,具有重要的临床应用,如在正畸学中进行头影标志分析。传统上,该分析依赖于侧面X光成像;然而,由于辐射问题,频繁的X光暴露并不实际。尽管近期研究探索了从侧面RGB图像中检测标志作为替代方案,但现有方法通常依赖于眼睛、嘴巴、耳朵和轮廓剪影等二维特征,未能充分利用跨越面部轮廓和下颌线的基础三维面部几何结构,而这些对于准确诊断至关重要。同时,尽管从正面视角的三维人脸重建已取得显著进展,但大多数基于学习的三维可变形模型(3DMM)回归器都是在近正面图像上开发和基准测试的,而这些图像通常包含丰富的外观线索。在极端侧面视角(偏航$ ext{yaw} imes ext{approximately} 90^ ext{circ}$)下,面部的很大一部分被遮挡,且可用信号主要由边界线索主导,使得准确的三维重建具有挑战性。在本文中,我们通过基于几何条件合成数据和简单的特征特定FLAME回归基线来填补这一空白,以针对单张侧面图像进行重建。我们介绍了ProfileSynth,一个通过在极端偏航范围内采样FLAME形状和姿态参数,并利用条件于深度和法线图的扩散模型生成逼真的侧面图像而创建的数据集。我们进一步研究了一种具有可见性意识的下颌线正则化的特征特定基线。我们的框架为 "侧面 $ imes$ 3DMM" 重建提供了一个实用的基线,并为从侧面RGB图像进行更准确的非侵入性头影测量分析奠定了良好的基础。
cs.CV / 111 / 2605.01759
PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning
PointCSP:自监督点云学习中的跨样本语义传播与稳定性保持
Abstract
Scene-level point cloud self-supervised learning (PC-SSL) has demonstrated potential in enhancing the generalization capability of 3D vision models. Despite the advances in the field through existing methods, the sample-independent modeling paradigm still poses significant limitations in terms of maintaining consistent semantic representations across scenes. This challenge hinders the construction of a unified and transferable semantic space. To address this issue, we propose a PC-SSL framework based on cross-sample semantic propagation (CSP), in which samples within a batch are serialized into continuous input and processed by a state-space model to enable semantic state propagation. This mechanism explicitly models the dynamic dependencies across samples in the state space, allowing the network to establish cross-sample semantic consistency in the latent space and achieve global semantic alignment. Since serialization-based pretraining requires batch-level input organization, we further introduce an asymmetric semantic preservation distillation (SPD) during finetuning to achieve structural alignment of semantic transfer and eliminate inconsistencies caused by batch dependency. The proposed SPD ensures stable transfer of pretrained semantics through a heterogeneous input mechanism and a semantic feature alignment constraint. This enables the model to maintain structured semantic consistency and robustness under single-scene testing conditions. Extensive experiments on multiple benchmark datasets demonstrate that our method consistently outperforms state-of-the-art methods in both performance and semantic consistency.
Chinese Translation
场景级点云自监督学习(PC-SSL)在增强三维视觉模型的泛化能力方面表现出了潜力。尽管现有方法在该领域取得了一定进展,但样本独立建模范式仍然在保持跨场景的一致语义表示方面存在显著限制。这一挑战阻碍了统一可转换语义空间的构建。为了解决这个问题,我们提出了一种基于跨样本语义传播(CSP)的PC-SSL框架,在该框架中,批次内的样本被序列化为连续输入,并通过状态空间模型进行处理,以实现语义状态的传播。该机制明确建模了状态空间中样本之间的动态依赖关系,使网络能够在潜在空间中建立跨样本的语义一致性,实现全局语义对齐。由于基于序列化的预训练需要批次级输入组织,我们进一步引入了非对称语义保持蒸馏(SPD),以在微调过程中实现语义转移的结构对齐,并消除由于批次依赖性引起的不一致性。所提议的SPD通过异构输入机制和语义特征对齐约束,确保了预训练语义的稳定转移。这使得模型在单场景测试条件下保持结构化语义一致性和鲁棒性。对多个基准数据集的广泛实验表明,我们的方法在性能和语义一致性方面均持续超越了最先进的技术。
cs.CV / 112 / 2605.01761
TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks
TrajShield:对抗越狱攻击的文本到视频模型的轨迹级安全调节
Abstract
Text-to-Video (T2V) models have demonstrated remarkable capability in generating temporally coherent videos from natural language prompts, yet they also risk producing unsafe content such as violence or explicit material. Existing prompt-level defenses are largely inherited from text-to-image safety and operate on the lexical surface of the input, making them vulnerable to jailbreak attacks that disguise harmful intent through rephrasing or adversarial prompting. Moreover, T2V generation introduces a distinctive challenge overlooked by prior work: temporally emergent risk, where a seemingly benign prompt leads to unsafe content through the generator's temporal extrapolation toward narrative coherence. We propose \method{}, a training-free, inference-time defense framework that reformulates T2V safety as a causal intervention in a temporally structured semantic space. TrajShield handles explicit unsafe prompts, jailbreak attacks, and temporally emergent risks in a unified manner by simulating the implied trajectory of a prompt, localizing the causal origin of potential risk, and applying a minimally invasive rewrite that neutralizes the risk while preserving safety-irrelevant semantics. Experiments on T2VSafetyBench across 14 safety categories and multiple T2V backends demonstrate that TrajShield achieves state-of-the-art defenseive performance while maintaining high semantic fidelity, substantially outperforming existing defenses, with an average ASR reduction of 52.44\%.
Chinese Translation
文本到视频(T2V)模型在将自然语言提示生成时间上连贯的视频方面表现出了显著的能力,但也面临产生不安全内容(如暴力或露骨材料)的风险。现有的提示级防御主要继承自文本到图像的安全措施,并对输入的词汇表面进行操作,这使得它们容易受到通过重新表述或对抗性提示来掩盖有害意图的越狱攻击。此外,T2V生成引入了一个以往研究所忽视的独特挑战:时间上突现的风险,即一个看似无害的提示通过生成器在叙事一致性上的时间外推而导致不安全内容。我们提出了 extit{TrajShield},这是一种无训练、推理时的防御框架,将T2V安全重新构造为在时间结构语义空间中的因果干预。TrajShield通过模拟提示的隐含轨迹、定位潜在风险的因果起源,并应用一种最小侵入的重写,在统一的方式下处理明确的不安全提示、越狱攻击和时间上突现的风险,从而在不影响安全无关语义的情况下中和风险。在14个安全类别和多个T2V后端的T2VSafetyBench实验中,TrajShield展示了其在防御性能上的最新水平,同时保持高语义保真度,显著超越现有防御,平均ASR降低52.44%。
cs.CV / 113 / 2605.01779
MedScribe: Clinically Grounded CT Reporting through Agentic Workflows
MedScribe:基于临床的 CT 报告生成与自主工作流
Abstract
Vision-language models (VLMs) have shown potential for automated radiology report generation, yet existing approaches rely on global embedding compression of volumetric data, often leading to hallucinated findings and limited anatomical grounding in 3D CT imaging. We introduce MedScribe, a hypothesis-driven framework that reformulates report generation as an iterative evidence acquisition process rather than a single-pass encoding task. MedScribe models reporting as a sequential decision process in which a large language model dynamically invokes pathology-specific diagnostic tools to extract localized volumetric features. These structured features are used to query a multidimensional retrieval space aligned with pathology-specific textual evidence. By explicitly accumulating quantitative evidence prior to synthesis, the framework enforces fine-grained grounding and reduces unsupported claims. Without task-specific fine-tuning, MedScribe improves clinical accuracy, factual consistency, and interpretability on CT-RATE and RadChestCT compared to state-of-the-art 2D and 3D VLMs, demonstrating the value of hypothesis-driven reasoning for reliable medical image reporting.
Chinese Translation
视觉-语言模型(VLMs)在自动化放射学报告生成方面显示出潜力,但现有方法依赖于体积数据的全局嵌入压缩,往往导致虚假发现和在 3D CT 成像中有限的解剖基础。我们提出了 MedScribe,一个假设驱动的框架,将报告生成重新定义为迭代证据获取的过程,而不是单次编码任务。MedScribe 将报告建模为一个顺序决策过程,其中大型语言模型动态调用特定病理的诊断工具以提取局部体积特征。这些结构化特征用于查询与特定病理文本证据对齐的多维检索空间。通过在合成之前明确积累定量证据,该框架强化了细粒度基础并减少了无支持的主张。未经特定任务微调,MedScribe 在 CT-RATE 和 RadChestCT 上相较于最先进的 2D 和 3D VLMs 提升了临床准确性、事实一致性和可解释性,显示出假设驱动推理在可靠医学影像报告中的价值。
cs.CV / 114 / 2605.01799
Embody4D: A Generalist 4D World Model for Embodied AI
Embody4D:面向具身人工智能的通用 4D 世界模型
Abstract
World models have made significant progress in modeling dynamic environments; however, most embodied world models are still restricted to 2D representations, lacking the comprehensive multi-view information essential for embodied spatial reasoning. Bridging this gap is non-trivial, primarily due to challenges from severe scarcity of paired multi-view data, the difficulty of maintaining spatiotemporal consistency in generated 3D geometries, and the tendency to hallucinate manipulation details. To address these challenges, we propose Embody4D, a dedicated video-to-video world model for embodied scenarios, capable of synthesizing arbitrary novel views from a monocular video. First, to tackle data scarcity, we introduce a 3D-aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross-embodiment robotic arms with diverse backgrounds, guaranteeing broad generalization. Second, to enforce geometric stability, we devise an adaptive noise injection strategy; by leveraging confidence disparities across image regions, this method selectively regularizes the diffusion process to ensure strict spatiotemporal consistency. Finally, to guarantee manipulation fidelity, we incorporate an interaction-aware attention mechanism that explicitly attends to the robotic interaction regions. Extensive experiments demonstrate that Embody4D achieves state-of-the-art performance, serving as a robust world model that synthesizes high-fidelity, view-consistent videos to empower downstream robotic planning and learning.
Chinese Translation
世界模型在建模动态环境方面取得了显著进展;然而,大多数具身世界模型仍然局限于 2D 表示,缺乏进行具身空间推理所必需的全面多视图信息。弥补这一差距并非易事,主要由于成对的多视图数据严重匮乏、生成的 3D 几何体中保持时空一致性的困难以及固有的操作细节幻觉倾向。为了解决这些挑战,我们提出了 Embody4D,一个专门针对具身场景的视频到视频世界模型,能够从单目视频中合成任意新视图。首先,为了应对数据稀缺问题,我们引入了一个 3D 认知的组合合成管道,以策划一个异质数据集,组合具有多样背景的跨具身机器人臂,从而保证广泛的泛化能力。其次,为了强化几何稳定性,我们设计了一种自适应噪声注入策略;通过利用图像区域之间的置信度差异,该方法选择性地对扩散过程进行正则化,以确保严格的时空一致性。最后,为了保证操作的真实性,我们结合了一种关注交互的注意机制,专门关注机器人交互区域。大量实验表明,Embody4D 实现了最佳性能,作为一个强大的世界模型,合成高保真、视图一致的视频,以增强下游机器人规划和学习的能力。
cs.CV / 115 / 2605.01815
Cross-Domain Adversarial Augmentation: Stabilizing GANs for Medical and Handwriting Data Scarcity
跨域对抗增强:稳定生成对抗网络在医学和手写数据稀缺中的应用
Abstract
Generative Adversarial Networks (GANs) offer a pragmatic route to mitigate data scarcity in vision tasks. We study generative augmentation across two low-resource domains: Bangla handwritten characters and chest X-ray imaging using DCGAN-style models trained at 64x64 resolution. We evaluate fidelity and diversity via Inception Score (IS), Fr'echet Inception Distance (FID), and embedding visualizations (t-SNE/UMAP), and assess downstream utility by training classifiers on real versus real synthetic data. Our experiments show that generative augmentation improves sample diversity and yields consistent gains in classifier performance under limited-data regimes. We analyze stability enhancements (e.g., gradient-penalized objectives and spectral normalization) and report ablations on synthetic-to-real ratios and sample filtering. We discuss evaluation caveats for medical images, dataset licensing, and privacy risks associated with synthetic data. The resulting protocol is simple to reproduce and provides a strong baseline for applying generative augmentation to resource-constrained imaging tasks.
Chinese Translation
生成对抗网络(GAN)为缓解视觉任务中的数据稀缺提供了切实可行的途径。我们研究了在两个低资源领域的生成增强:孟加拉手写字符和胸部X光影像,使用在64x64分辨率下训练的DCGAN风格模型。我们通过Inception Score(IS)、Fr'échet Inception Distance(FID)和嵌入可视化(t-SNE/UMAP)来评估生成样本的真实性和多样性,并通过对真实数据和真实合成数据训练分类器来评估下游效用。我们的实验表明,生成增强提高了样本多样性,并在有限数据场景下对分类器性能产生了一致的提升。我们分析了稳定性增强措施(例如,梯度惩罚目标和谱归一化),并报告了合成数据与真实数据比率以及样本过滤的消融实验。我们讨论了医学图像评估中的注意事项、数据集许可问题,以及与合成数据相关的隐私风险。所提出的协议简单易于复现,并为在资源受限的成像任务中应用生成增强提供了强有力的基准。
cs.CV / 116 / 2605.01826
Hybrid Visual Telemetry for Bandwidth-Constrained Robotic Vision: A Pilot Study with HEVC Base Video and JPEG ROI Stills
带宽受限的机器人视觉的混合视觉遥测:HEVC基础视频和JPEG感兴趣区域静帧的初步研究
Abstract
Bandwidth-constrained robotic and surveillance systems often rely on a single compressed video stream to support both continuous scene awareness and downstream machine perception. In practice, this creates a mismatch: low-bitrate video can preserve motion and coarse context, but often loses the fine local detail needed for reliable object recognition and decision-making. Motivated by a hybrid architecture in which low-resolution video supports dynamic scene understanding while eventdriven high-detail regions of interest (ROIs) support close-up identification and analytics, this paper formalizes a two-channel visual telemetry scheme in which a continuous low-bitrate video stream is augmented by selectively transmitted high-detail still ROIs. This first paper does not attempt to prove the superiority of a new still-image codec. Instead, it establishes the hybrid transmission paradigm itself using a practical and reproducible codec stack: x265/HEVC for the base video stream and JPEG stills for ROI refinement. We formulate the problem as bitrate-constrained information selection for robotic vision and define an experimental protocol in which video-only and hybrid schemes are compared under matched total communication budgets. The study is designed around UAV-oriented datasets, two practical bitrate regimes, several ROI triggering policies, and object-level classification refinement on selectively transmitted ROI stills. The resulting paper lays the methodological foundation for a second-stage investigation of JPEG AI as the semantic still-image channel within the same hybrid architecture.
Chinese Translation
带宽受限的机器人和监控系统通常依赖于单一的压缩视频流,以支持连续的场景感知和下游机器感知。然而,实际应用中,这就造成了一个不匹配:低码率视频能够保留运动和粗略的上下文,但常常失去用于可靠物体识别和决策所需的细微局部细节。本文受到一种混合架构的启发,该架构中低分辨率视频支持动态场景理解,而事件驱动的高细节感兴趣区域(ROIs)支持近距离识别和分析。本文正式提出了一种双通道视觉遥测方案,其中连续的低码率视频流通过选择性传输的高细节静态ROIs进行增强。这篇论文并不试图证明一种新的静态图像编解码器的优越性,而是使用一个实用且可重复的编解码器栈(x265/HEVC作为基础视频流,JPEG静帧用于ROIs精细化)确立了混合传输范式。我们将问题表述为针对机器人视觉的带宽受限信息选择,并定义了一种实验协议,在该协议下比较了视频单一和混合方案在匹配的总通信预算下的效果。本研究围绕UAV相关的数据集设计,包含两个实际的码率范围、几种ROIs触发策略,以及针对选择性传输的ROI静帧的物体级分类精细化。最终的论文为第二阶段对JPEG AI作为该混合架构内语义静态图像通道的研究奠定了方法论基础。
cs.CV / 117 / 2605.01827
Referring Multiple Regions with Large Multimodal Models via Contextual Latent Steering
通过上下文潜在引导参考多个区域的大型多模态模型
Abstract
Large Multimodal Models (LMMs) have recently demonstrated their proficiency in holistic visual comprehension. However, most of them struggle to tackle region-level perception guided by visual prompts, especially for cases where multiple regions are referred simultaneously, or scenarios where global contexts are necessary for precise visual referring. We introduce Contextual Latent Steering (CSteer), a training-free approach for guiding general LMMs to refer multiple regions contextually, without expensive fine-tuning or architectural modifications. CSteer starts with pre-computing contextual vectors that implicitly represent visual referring behaviors, such as differentiation among regions and attention to global contexts, followed by representation editing during inference time. Experimental results on multiple datasets indicate that general LMMs with CSteer outperform tailored referring LMMs in most cases, suggesting a promising solution in training-free, and setting new state-of-the-art for this field. Code is available at https://github.com/xing0047/csteer.git.
Chinese Translation
大型多模态模型(LMMs)最近展示了其在整体视觉理解方面的能力。然而,大多数模型在处理由视觉提示引导的区域级感知时表现不佳,尤其是在同时参考多个区域或需要全局上下文以实现精确视觉参考的场景中。我们提出了上下文潜在引导(CSteer),这是一种无需训练的方法,旨在引导通用LMMs进行上下文中多个区域的参考,而无需昂贵的微调或架构修改。CSteer通过预计算隐式表示视觉参考行为(如区域间的差异化和对全局上下文的关注)的上下文向量开始,然后在推理时进行表示编辑。多个数据集上的实验结果表明,使用CSteer的通用LMMs在大多数情况下优于针对性调优的LMMs,为无训练的解决方案提供了有希望的选择,并为该领域设立了新的最先进水平。代码可在 https://github.com/xing0047/csteer.git 获取。
cs.CV / 118 / 2605.01829
GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models
GeoSAE:基于几何先验导向的逐层稀疏自编码器在脑MRI基础模型中的标注
Abstract
Brain MRI foundation models learn rich representations of anatomy, but interpreting what clinical information they encode remains an open problem. Standard sparse autoencoders (SAEs) suffer from severe feature collapse in deep transformer layers, and in Alzheimer's disease (AD) research, aging confounds nearly every clinical variable, making naive annotation unreliable. We propose GeoSAE, a geometry-guided SAE framework that uses the foundation model's learned manifold structure to prevent feature collapse and annotates each surviving feature via age-deconfounded partial correlations. Applied to ~14k T1-weighted MRI scans from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Australian Imaging biomarkers and Lifestyle (AIBL) datasets, GeoSAE identifies a compact, fully interpretable feature set that predicts mild cognitive impairment (MCI)-to-AD conversion (AUC 0.746) using only 2% of the embedding dimensions, while comorbidity-annotated features achieve only chance-level performance. The identified features replicate across cohorts without retraining (r=0.97) and localize to neuroanatomically distinct regions consistent with Braak staging. This shows that geometry-guided SAEs can extract interpretable, biomarkers from frozen brain MRI foundation models.
Chinese Translation
脑MRI基础模型学习了丰富的解剖结构表示,但如何解读它们编码的临床信息仍然是一个未解决的问题。标准的稀疏自编码器(SAEs)在深度变换层中遭遇严重的特征崩溃,并且在阿尔茨海默病(AD)研究中,年龄几乎干扰着所有的临床变量,使得简单的标注变得不可靠。我们提出了GeoSAE,一个基于几何引导的SAE框架,该框架利用基础模型学习的流形结构来防止特征崩溃,并通过去混淆年龄的部分相关性对每个存活特征进行标注。GeoSAE应用于来自阿尔茨海默病神经影像学倡议(ADNI)和澳大利亚影像生物标志物和生活方式(AIBL)数据集的约14,000个T1加权MRI扫描,识别出一个紧凑、完全可解释的特征集,能够利用仅占2%嵌入维度的特征来预测轻度认知障碍(MCI)到AD的转变(AUC 0.746),而伴随疾病标注的特征仅达到偶然水平的表现。所识别的特征在不同队列中无需重训练即可复制(r=0.97),并定位于与Braak分期一致的神经解剖学上不同的区域。这表明,几何引导的SAEs能够从固定的脑MRI基础模型中提取可解释的生物标志物。
cs.CV / 119 / 2605.01848
Disentangled Anatomy-Disease Diffusion (DADD) for Controllable Ulcerative Colitis Progression Synthesis
解耦的解剖-疾病扩散模型 (DADD) 用于可控的溃疡性结肠炎进展合成
Abstract
Synthesizing longitudinal medical images at controllable disease stages while preserving patient-specific anatomy is hindered by the entanglement of pathological textures and structural features. We address this challenge for ulcerative colitis (UC) endoscopy, where severity follows a continuous ordinal progression along the Mayo Endoscopic Score (MES). Our framework, Disentangled Anatomy-Disease Diffusion (DADD), conditions a latent diffusion model on two complementary embeddings: a pretrained image encoder for patient anatomy and a separately trained ordinal embedder for cumulative disease severity. Since image embeddings inevitably capture disease information, we introduce a Feature Purifier, a cross-attention-based erasure mechanism that identifies and suppresses disease-correlated channels, yielding purified anatomical representations. These cleaned anatomy tokens and target disease tokens are injected into the denoising network via a Triple-Pathway Cross-Attention mechanism with resolution-dependent routing gates. This architecture leverages the U-Net hierarchy, in which different network depths encode global structure versus fine-grained pathological texture. Furthermore, we introduce Delta Steering, a training-free directional signal derived from the ordinal embeddings that enables explicit, single-pass control over disease transitions at inference without requiring additional forward passes. Validated on the LIMUC dataset, our approach produces high-fidelity images across all severity levels and effectively rebalances skewed class distributions, enhancing performance for downstream classification tasks. The dataset is available at zenodo.org/records/5827695 and the code base at github.com/umutdundar99/progressive-stable-diffusion
Chinese Translation
在保持患者特异性解剖结构的同时合成可控疾病阶段的纵向医学图像,受到病理纹理和结构特征纠缠的阻碍。本研究针对溃疡性结肠炎(UC)内镜检查中的这一挑战,其中病情严重程度沿着梅奥内镜评分(MES)进行连续顺序进展。我们的框架——解耦的解剖-疾病扩散模型(DADD),在两个互补的嵌入上对隐式扩散模型进行条件化:一个用于患者解剖结构的预训练图像编码器和一个单独训练的顺序嵌入器,用于累计疾病严重程度。由于图像嵌入不可避免地捕获疾病信息,我们引入了特征净化器(Feature Purifier),这是一种基于交叉注意力的消除机制,能够识别和抑制与疾病相关的通道,从而产生净化的解剖表示。这些清理后的解剖标记和目标疾病标记通过带有分辨率依赖路由门的三途径交叉注意力机制注入去噪网络。该架构利用了U-Net层次结构,其中不同的网络深度编码全局结构与细致的病理纹理。此外,我们还引入了Delta Steering,一种无训练的方向信号,源自顺序嵌入,能够在推理时对疾病转变进行明确的单次控制,而无需额外的前向传递。经过在LIMUC数据集上的验证,我们的方法在所有严重程度级别上生成高保真图像,并有效地重新平衡偏斜的类别分布,从而增强下游分类任务的性能。数据集可在zenodo.org/records/5827695获得,代码库可在github.com/umutdundar99/progressive-stable-diffusion找到。
cs.CV / 120 / 2605.01852
DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity
DP-SfM:无尺度歧义的双像素运动重建
Abstract
Multi-view 3D reconstruction, namely, structure-from-motion followed by multi-view stereo, is a fundamental component of 3D computer vision. In general, multi-view 3D reconstruction suffers from an unknown scale ambiguity unless a reference object of known size is present in the scene. In this article, we show that multi-view images captured using a dual-pixel (DP) sensor can automatically resolve the scale ambiguity, without requiring a reference object or prior calibration. Specifically, the defocus blur observed in DP images provides sufficient information to determine the absolute scale when paired with depth maps (up to scale) recovered from multi-view 3D reconstruction. Based on this observation, we develop a simple yet effective linear method to estimate the absolute scale, followed by the intensity-based optimization stage that aligns the left and right DP images by shifting them back toward each other using cross-view blur kernels. Experiments demonstrate the effectiveness of the proposed approach across diverse scenes captured with different cameras and lenses. Code and data are available at https://github.com/lilika-makabe/dp-sfm-tpami.git
Chinese Translation
多视角 3D 重建,即结构光学与多视角立体视觉的结合,是 3D 计算机视觉的一个基本组成部分。一般来说,除非场景中存在已知尺寸的参考物体,否则多视角 3D 重建会面临未知尺度歧义。本文展示了利用双像素 (DP) 传感器捕获的多视角图像可以自动解决尺度歧义,而无需参考物体或预先校准。具体而言,在 DP 图像中观察到的失焦模糊提供了足够的信息,可以在与从多视角 3D 重建恢复的深度图(相对于尺度)配对时确定绝对尺度。基于这一观察,我们开发了一种简单有效的线性方法来估计绝对尺度,随后是基于强度的优化阶段,通过使用交叉视角模糊核将左右 DP 图像向彼此对齐。实验结果表明,所提出的方法在不同相机和镜头捕获的各种场景中均表现出良好的效果。代码和数据可在 https://github.com/lilika-makabe/dp-sfm-tpami.git 获得。
cs.CV / 121 / 2605.01854
High-Fidelity Mobile Avatars with Pruned Local Blendshapes
高保真移动虚拟头像与修剪本地混合形状
Abstract
We propose a method to reconstruct high-fidelity human avatars from multi-view video that can run on mobile devices. Many works can model high-quality Gaussian-based full-body avatars from multi-view video. However, these methods require heavy computation to obtain pose-dependent appearance, making deployment on mobile devices very difficult. Recent methods distill from pretrained models and model pose-dependent nonlinear Gaussian attributes by linearly combining global pose features with blendshapes. Although they can run on mobile devices, they suffer some loss of detail. We observe that nearby Gaussians are often highly correlated within a local region of the body, and can be linearly modeled with less error. Therefore, we use local linear blendshapes in small body parts to capture global nonlinear changes of Gaussian attributes. To further reduce computation and model size, we propose to remove blendshapes for Gaussians whose attributes change little, yielding a minimal blendshape representation. Our method is an end-to-end training method without a pretrained model. To make it run on multiple devices, we implement our method using WebGPU. Experiments show that our method can render high-quality human avatars with better details, and can reach 120 FPS at 2K resolution on mobile devices.
Chinese Translation
我们提出了一种从多视角视频重建高保真人体头像的方法,该方法可以在移动设备上运行。许多研究可以从多视角视频建模高质量的基于高斯的全身头像。然而,这些方法需要大量计算来获取依赖于姿势的外观,使得在移动设备上的部署非常困难。近期的方法从预训练模型中提取信息,并通过线性组合全局姿势特征和混合形状来建模依赖于姿势的非线性高斯属性。尽管它们可以在移动设备上运行,但在细节上有所损失。我们观察到在身体的局部区域内,相邻的高斯往往高度相关,并且可以通过线性建模来减少误差。因此,我们在小的身体部位使用局部线性混合形状来捕捉高斯属性的全局非线性变化。为了进一步减少计算量和模型大小,我们建议去除属性变化较小的高斯的混合形状,从而实现最小化的混合形状表示。我们的方法是一个端到端训练的方法,不依赖于预训练模型。为了使其能够在多种设备上运行,我们使用WebGPU实现了该方法。实验表明,我们的方法可以渲染出具有更好细节的高质量人体头像,并且能够在移动设备上以2K分辨率达到120帧每秒的表现。
cs.CV / 122 / 2605.01858
Decouple and Cache: KV Cache Construction for Streaming Video Understanding
解耦与缓存:流媒体视频理解的键值缓存构建
Abstract
Streaming video understanding requires processing unbounded video streams with limited memory and computation, posing two key challenges. First, continuously constructing new and evicting old key-value(KV) caches is required for unbounded streams. Secondly, due to the high cost of collecting and training on unbounded streams, models must learn from short sequences while generalizing to long streams. Existing streaming VideoVLLMs fail to scale to unbounded video streams or focus on cache reuse strategies, leaving the impact of cache construction underexplored. In this paper, we propose Decoupled Streaming Cache(DSCache), a training-free cache construction mechanism that adapts pretrained offline models to streaming settings. DSCache maintains a cumulative past KV cache while constructing a separate instant cache on-demand, decoupled from past caches to preserve the informativeness of recent inputs. To enable position extrapolation beyond the training length, DSCache further incorporates a position-agnostic encoding strategy, ensuring KV caches to support unseen positions and preventing position overflow. Experiments on Streaming Video QA benchmarks demonstrate DSCache's state-of-the-art performance, with an average 2.5% accuracy gains over prior methods.
Chinese Translation
流媒体视频理解需要处理无限的视频流,同时面临有限的内存和计算能力,这带来了两个关键挑战。首先,对于无限流,需要不断构建新的键值(KV)缓存并驱逐旧的缓存。其次,由于收集和训练无限流的成本较高,模型必须从短序列中学习,并能够推广到长序列。现有的流媒体视频视觉语言模型(VideoVLLMs)未能扩展到无限视频流或专注于缓存重用策略,因此在缓存构建的影响方面尚未得到充分探讨。在本文中,我们提出了解耦流媒体缓存(Decoupled Streaming Cache,DSCache),这是一种无须训练的缓存构建机制,它能够将预训练的离线模型适应于流媒体设置。DSCache维护一个累积的过去KV缓存,同时根据需求构建一个单独的即时缓存,与过去的缓存解耦,以保留近期输入的信息。为了支持超出训练长度的位置外推,DSCache进一步采用了一种位置无关的编码策略,确保KV缓存能够支持未见位置并防止位置溢出。针对流媒体视频问答基准的实验表明,DSCache展示了先进的性能,平均准确性相比于以前的方法提升了2.5%。
cs.CV / 123 / 2605.01876
BadmintonGRF: A Multimodal Dataset and Benchmark for Markerless Ground Reaction Force Estimation in Badminton
BadmintonGRF:一个用于羽毛球无标记地面反应力估计的多模态数据集和基准测试
Abstract
Multimodal resources for non-periodic court sports with laboratory-grade sensing remain scarce: few publicly pair instrumented ground reaction force (GRF) with high-frame-rate multi-view video, limiting markerless load estimation in realistic training settings. BadmintonGRF records eight synchronized RGB views at ~120 FPS, four Kistler force plates, and Vicon motion capture (C3D) without hardware genlock across modalities; alignment combines human-verified events, automated quality assurance, and per-camera time offsets with uncertainty metadata. Tier 1 distributes pose, time-aligned GRF, metadata, and splits under CC BY-NC 4.0, enabling the primary benchmark without raw RGB or C3D; we report a Tier 1 task that maps 2D pose to GRF. Tier 2 provides raw RGB and C3D under controlled access for studies that require appearance or full kinematics. The public release contains 17,425 impact-segment archives in the 10-subject benchmark tree (156 instrumented trials; raw multi-view RGB alone exceeds 1 TB); benchmark loader gates retain 12,867 view-specific instances and 1,732 unique impacts after multi-view deduplication. We are not aware of prior public badminton corpora that combine this sensing layout with audited video--GRF alignment for impact-centric GRF estimation. We distribute preprocessing code, leave-one-subject-out splits, ten reference baselines, and optional late fusion (one deterministic test-time pass per instance; no test-time augmentation), with a within-trial diagnostic in the supplementary material.
Chinese Translation
非周期性场地运动的多模态资源与实验室级别的传感器仍然稀缺:公开的带有高帧率多视角视频的仪器化地面反应力(GRF)数据极少,限制了在真实训练环境中的无标记负载估计。BadmintonGRF 记录了八个同步的 RGB 视角,帧率约为 120 FPS,四个 Kistler 力传感器,以及 Vicon 动作捕捉(C3D);不同模态之间的对齐结合了人工验证事件、自动质量保证以及每个摄像机的时间偏移和不确定性元数据。第一层级分发姿态、时间对齐的 GRF、元数据和在 CC BY-NC 4.0 下的拆分,实现了无需原始 RGB 或 C3D 的主要基准;我们报告了一个将 2D 姿态映射到 GRF 的第一层级任务。第二层级在受控访问下提供原始 RGB 和 C3D,供需要外观或完整运动学的研究使用。公开发布包含 17,425 个冲击段档案,分布在 10 个被试的基准树中(156 次仪器化试验;仅原始多视角 RGB 就超过 1 TB);基准加载器网关保留 12,867 个特定视图实例和 1,732 个独特冲击,在多视角去重后。我们尚未发现任何之前的公共羽毛球数据集将这种传感布置与审核过的视频—GRF 对齐结合用于以冲击为中心的 GRF 估计。我们分发预处理代码、留一被试拆分、十个参考基线和可选的后期融合(每个实例进行一次确定性测试时间通过;无测试时增强),并在补充材料中提供试验内诊断。
cs.CV / 124 / 2605.01882
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
Chart-FR1:基于视觉焦点驱动的密集图表细粒度推理
Abstract
Multimodal large language models (MLLMs) have shown considerable potential in chart understanding and reasoning tasks. However, they still struggle with high information density (HID) charts characterized by multiple subplots, legends, and dense annotations due to three major challenges: (1) limited fine-grained perception results in the omission of critical visual cues; (2) redundant or noisy visual information undermines the performance of multimodal reasoning; (3) lack of adaptive deep reasoning relative to the amount of visual information. To tackle these challenges, we present a novel focus-driven fine-grained chart reasoning model, Chart-FR1, to improve perception, focusing efficiency, and adaptive deep reasoning on HID charts. Specifically, we propose Focus-CoT, a visual focusing chain-of-thought that enhances fine-grained perception by explicitly linking reasoning steps to key visual cues, such as local image regions and OCR signals. Building on this, we introduce Focus-GRPO, a focus-driven reinforcement learning algorithm with an information-efficiency reward that compresses redundant visual information for efficient focusing, and an adaptive KL penalty mechanism that enables flexible control over reasoning depth as more visual cues are discovered. Furthermore, to fill the gap in benchmarks for HID charts, we build HID-Chart, a challenging benchmark with an information-density metric designed to evaluate fine-grained chart reasoning capabilities. Extensive experiments on multiple chart benchmarks demonstrate that Chart-FR1 outperforms state-of-the-art MLLMs in chart understanding and reasoning. Code is available at https://github.com/phkhub/Chart-FR1.
Chinese Translation
多模态大型语言模型(MLLMs)在图表理解和推理任务中展现了可观的潜力。然而,它们在高信息密度(HID)图表的处理上仍面临困难,该类图表以多个子图、图例和密集注释为特征,主要面临以下三个挑战: (1) 有限的细粒度感知导致关键视觉线索的遗漏;(2) 冗余或噪声视觉信息削弱了多模态推理的性能;(3) 相对于视觉信息的量,缺乏自适应深层推理。为了应对这些挑战,我们提出了一种新颖的基于焦点驱动的细粒度图表推理模型Chart-FR1,以改善HID图表上的感知、聚焦效率和自适应深层推理。具体而言,我们提出了Focus-CoT,一种视觉聚焦的思路链,它通过明确将推理步骤与关键视觉线索(如局部图像区域和OCR信号)关联来增强细粒度感知。在此基础上,我们引入了Focus-GRPO,一种基于聚焦驱动的强化学习算法,采用信息效率奖励压缩冗余视觉信息以实现高效聚焦,并具有自适应KL惩罚机制,可以在发现更多视觉线索时灵活控制推理深度。此外,为了填补HID图表基准测试的空白,我们构建了HID-Chart,这是一个具有信息密度指标的挑战性基准,旨在评估细粒度图表推理能力。在多个图表基准上的广泛实验表明,Chart-FR1在图表理解和推理方面的表现优于现有最先进的MLLMs。代码可在https://github.com/phkhub/Chart-FR1获取。
cs.CV / 125 / 2605.01888
AFFormer: Adaptive Feature Fusion Transformer for V2X Cooperative Perception under Channel Impairments
AFFormer:在信道干扰下用于 V2X 协作感知的自适应特征融合变换器
Abstract
Accurate 3D object detection is essential for ensuring the safety of autonomous vehicles. Cooperative perception, which leverages vehicle-to-everything (V2X) communication to share perceptual data, enhances detection but is vulnerable to channel impairments, such as noise, fading, and interference. To strengthen the reliability of intelligent transportation systems, this work improves the robustness of V2X cooperative perception under communication conditions that reflect common channel impairments. This paper proposes an Adaptive Feature Fusion Transformer (AFFormer), a Transformer-based framework that mitigates the adverse effects of corrupted features by modeling temporal, inter-agent, and spatial correlations. AFFormer introduces three key modules: Multi-Agent and Temporal Aggregation for context-aware fusion across agents and over time, Dual Spatial Attention for efficient modeling of spatial dependencies, and Uncertainty-Guided Fusion for entropy-driven refinement of fused features. A teacher-student knowledge distillation strategy further enhances robustness by aligning fused features with reliable early-collaboration supervision. AFFormer is validated on the V2XSet and DAIR-V2X datasets, where it consistently outperforms existing methods under both ideal and impaired communication conditions, demonstrating improved robustness to communication-induced feature degradation while maintaining a competitive efficiency-accuracy trade-off.
Chinese Translation
准确的 3D 物体检测对于确保 автоном车辆的安全至关重要。协作感知利用车与一切(V2X)通信共享感知数据,提升检测性能,但会受到信道干扰(如噪声、衰落和干扰)的影响。为增强智能交通系统的可靠性,本研究改善了在反映常见信道干扰的通信条件下的 V2X 协作感知的鲁棒性。本文提出了一种自适应特征融合变换器(Adaptive Feature Fusion Transformer, AFFormer),该框架基于变换器(Transformer),通过建模时间、代理间和空间相关性来缓解损坏特征的负面影响。AFFormer 引入了三个关键模块:多代理和时间聚合(Multi-Agent and Temporal Aggregation)用于跨代理和时间的上下文感知融合;双重空间注意力(Dual Spatial Attention)用于高效建模空间依赖性;不确定性引导融合(Uncertainty-Guided Fusion)用于熵驱动的融合特征精炼。教师-学生知识蒸馏策略进一步增强了鲁棒性,通过将融合特征与可靠的早期协作监督对齐。AFFormer 在 V2XSet 和 DAIR-V2X 数据集上进行了验证,在理想和受损通信条件下都持续优于现有方法,展现了对通信引起的特征退化的改进鲁棒性,同时保持竞争性的效率-准确性权衡。
cs.CV / 126 / 2605.01896
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
分而治之:针对多模态世界模型的解耦表示对齐
Abstract
Emerging multi-modal world models attempt to jointly generate videos across diverse modalities (e.g., RGB, depth, and mask), yet they fail to fully exploit the rich priors of existing foundation models. We propose $M^2$-REPA, the first representation alignment method tailored for multi-modal video generation. Our key insight is that foundation models trained on different modality spaces naturally capture distinct domain-specific priors, acting as complementary "experts." Specifically, we first decouple modality-specific features from the diffusion model's intermediate representations, then align each with its corresponding expert foundation model. To this end, we design two synergistic objectives: a multi-modal representation alignment loss that enforces feature-to-expert matching, and a modality-specific decoupling regularization that encourages complementarity across different modalities. This design enables joint optimization, fully exploiting priors from multiple foundation models. Extensive experiments demonstrate that our method significantly outperforms baselines in visual quality and long-term consistency.
Chinese Translation
新兴的多模态世界模型试图在多种模态(例如,RGB、深度和掩码)之间联合生成视频,但未能充分利用现有基础模型的丰富先验信息。我们提出了 $M^2$-REPA,这是首个专为多模态视频生成设计的表示对齐方法。我们的核心观点是,不同模态空间上训练的基础模型自然捕捉到不同领域特定的先验信息,充当互补的“专家”。具体而言,我们首先从扩散模型的中间表示中解耦出模态特定特征,然后将每个特征与其对应的专家基础模型进行对齐。为此,我们设计了两个协同目标:一个多模态表示对齐损失,强制特征与专家的匹配;另一个模态特定解耦正则化,鼓励不同模态之间的互补性。这种设计实现了联合优化,充分利用来自多个基础模型的先验。大量实验表明,我们的方法在视觉质量和长期一致性方面显著优于基线。
cs.CV / 127 / 2605.01901
Behavior-Grounded Lane Representation Learning for Multi-Task Traffic Digital Twins
基于行为的车道表征学习用于多任务交通数字孪生
Abstract
Traffic digital twins are powerful tools for advanced traffic management, and most systems are built on static geometric representations. However, these representations fail to capture the dynamic functional semantics required for behavior-aware reasoning, such as how a lane operates under complex traffic conditions. To address this gap, we introduce GeoLaneRep, a behavior-grounded lane representation learning framework for traffic digital twins. GeoLaneRep jointly encodes static lane geometry, observed vehicle trajectories, and operational descriptors into a shared, cross-camera semantic embedding. The encoder is trained with a joint objective combining contrastive cross-camera alignment, auxiliary role supervision, and temporal anomaly detection. Across 16 roadside cameras and 132 lanes, the learned embeddings achieve a $0.004$ lateral-rank error and an edge-role F1 of $1.000$ in zero-shot cross-camera matching, and an AUROC of $0.991$ for window-level anomaly detection. We further show that the same behavioral embeddings can condition a diffusion-based generator to synthesize lane geometries that satisfy targeted operational specifications, with $87.9\%$ overall specification accuracy across 38 lane groups. GeoLaneRep thus provides a semantic interface between roadside observations and downstream digital twin tasks, supporting cross-camera transfer, behavior-aware monitoring, and goal-directed lane synthesis. The framework is openly available at https://github.com/raynbowy23/GeoLaneRep.
Chinese Translation
交通数字孪生是先进交通管理的强大工具,但大多数系统基于静态几何表征构建。然而,这些表征无法捕捉到行为感知推理所需的动态功能语义,例如车道在复杂交通条件下的运行方式。为了解决这一问题,我们提出了GeoLaneRep,一种基于行为的车道表征学习框架,旨在用于交通数字孪生。GeoLaneRep将静态车道几何、观察到的车辆轨迹和操作描述符共同编码为共享的跨摄像头语义嵌入。编码器通过结合对比性跨摄像头对齐、辅助角色监督和时间异常检测的联合目标进行训练。在16个路边摄像头和132条车道上,学习的嵌入在零样本跨摄像头匹配中达到了$0.004$的横向排名误差和$1.000$的边缘角色F1分数,同时在窗口级别的异常检测中获得了$0.991$的AUROC。我们进一步展示了相同的行为嵌入可以用于条件扩散生成器,以合成满足目标操作规范的车道几何,38组车道的总体规范准确率达到了$87.9 ext{ extperthousand}$。因此,GeoLaneRep为路边观察与下游数字孪生任务之间提供了语义接口,支持跨摄像头转移、行为感知监控和目标导向的车道合成。该框架已在https://github.com/raynbowy23/GeoLaneRep上公开提供。
cs.CV / 128 / 2605.01911
SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?
SurgCheck:视觉-语言模型在外科视觉问答中真的关注图像吗?
Abstract
Purpose: Vision-language models (VLMs) have shown promising performance in surgical visual question answering (VQA). However, existing surgical VQA datasets often contain linguistic shortcuts, where question phrasing implicitly constrains the answer space. It remains unclear whether reported performance reflects visual understanding or reliance on such linguistic shortcuts. Methods: We introduce SurgCheck, a diagnostic benchmark for quantifying linguistic shortcut reliance in surgical VQA. SurgCheck employs a paired-question design in which each surgical frame is associated with an original question containing entity names and a less-biased counterpart that removes these names while preserving identical visual content and ground-truth answers. The resulting performance gap provides a diagnostic signal of shortcut reliance. To ensure that the less-biased question remains well-defined even without entity names, four grounding cues are incorporated: bounding box, arrow, spatial position, and periphrasis. We evaluate both general-purpose and surgical-specific VLMs under zero-shot and fine-tuned settings on SurgCheck. To evaluate open-ended zero-shot responses, we introduce an LLM-as-a-judge evaluation protocol. Results: Using SurgCheck, we observe consistent performance degradation on less-biased questions across five VLMs, despite identical visual inputs. Text-only ablation reveals minimal performance drops for action and target prediction, indicating that action and target prediction is largely driven by linguistic shortcuts rather than visual reasoning. Conclusion: SurgCheck provides a controlled diagnostic framework that exposes failure modes masked by linguistic bias in existing surgical VQA benchmarks. Our findings demonstrate that strong benchmark performance does not necessarily imply faithful visual understanding, underscoring the need for bias-aware evaluation in surgical VQA.
Chinese Translation
目的:视觉-语言模型(VLMs)在外科视觉问答(VQA)中表现出良好的性能。然而,现有的外科 VQA 数据集通常包含语言捷径,其中提问的措辞在隐含上限制了答案的范围。迄今为止,尚不清楚报告的性能是反映了视觉理解还是依赖于这些语言捷径。方法:我们提出了 SurgCheck,一个用于量化外科 VQA 中语言捷径依赖性的诊断基准。SurgCheck 采用配对问题设计,每个外科图像帧都与一个包含实体名称的原始问题相关联,以及一个去掉这些名称但保持相同视觉内容和真实答案的不带偏见的对应问题。最终的性能差距提供了捷径依赖性的诊断信号。为了确保即使没有实体名称,不带偏见的问题依然具有明确性,我们引入了四个基础线索:边界框、箭头、空间位置和转述。我们在 SurgCheck 上评估了通用和外科特定的 VLMs,在零样本和微调设置下进行对比。为了评估开放式零样本响应,我们引入了一种大型语言模型(LLM)作为评判者的评估协议。结果:在使用 SurgCheck 时,我们观察到五个 VLMs 在不带偏见的问题上表现出一致的性能下降,尽管视觉输入是相同的。仅文本的消融实验显示,动作和目标预测的性能下降很小,这表明动作和目标预测主要受到语言捷径的驱动,而非视觉推理。结论:SurgCheck 提供了一个受控的诊断框架,揭示了现有外科 VQA 基准中被语言偏见掩盖的失败模式。我们的研究结果表明,强劲的基准表现并不一定意味着忠实的视觉理解,强调了在外科 VQA 中进行偏见意识评估的必要性。
cs.CV / 129 / 2605.01916
EAPFusion: Intrinsic Evolving Auxiliary Prior Guidance for Infrared and Visible Image Fusion
EAPFusion:用于红外与可见光图像融合的内在演变辅助先验指导
Abstract
Infrared-visible image fusion aims to create an information-rich fused image by integrating the complementary thermal saliency from infrared sensing and fine textures from visible imaging. Such accurate fusion is essential for real-world perception applications in complex scenes, including nighttime autonomous driving, search and rescue, and surveillance, and can further benefit downstream tasks such as semantic segmentation. However, most existing fusion methods rely upon static trained weights that cannot adapt to scene-specific content at inference time, and often suffer from a granularity mismatch when coarse auxiliary semantics are injected, which makes it difficult to simultaneously highlight targets and preserve details. In this work, we propose EAPFusion to address these issues by using self-evolving intrinsic priors instead of relying on external auxiliary models. Concretely, EAPFusion maintains a compact set of intrinsic priors and progressively updates them across scales. These evolved priors are utilized to dynamically generate convolutional kernels, shifting the paradigm from fixed, pre-trained filters to instance-adaptive parameters via prior-conditioned dynamic convolution. Furthermore, we design a channel-level fusion module that shuffles and interleaves infrared and visible channels, applying local channel mixing to boost cross-modal complementarity. Experiments on different datasets, including cross-dataset evaluation and semantic segmentation, show that the proposed method achieves state-of-the-art quantitative and qualitative fusion results, and consistently boosts downstream performance. Code is coming soon.
Chinese Translation
红外-可见光图像融合旨在通过整合红外传感器中的补充热显著性和可见光成像中的细致纹理,生成信息丰富的融合图像。这种精确的融合对于复杂场景中的真实世界感知应用至关重要,包括夜间自动驾驶、搜索与救援以及监控,并可进一步促进语义分割等下游任务的发展。然而,大多数现有的融合方法依赖于静态训练权重,这些权重在推理时无法适应特定场景的内容,并且在注入粗略辅助语义时往往会出现颗粒度不匹配,导致难以同时突出目标和保留细节。在本研究中,我们提出了EAPFusion以解决这些问题,采用自我演变的内在先验,而不是依赖外部辅助模型。具体而言,EAPFusion维护了一组紧凑的内在先验,并在不同尺度上逐步更新它们。这些演变的先验用于动态生成卷积核,使得从固定的、预训练的滤波器向实例自适应参数的范式转变,通过先验条件动态卷积。同时,我们设计了一个通道级融合模块,该模块对红外和可见光通道进行随机重排和交错,应用局部通道混合以增强跨模态的互补性。在不同数据集上的实验,包括跨数据集评估和语义分割,结果表明所提方法在量化和质量上都达到了最先进的融合效果,并且持续提升了下游性能。代码即将发布。
cs.CV / 130 / 2605.01924
SimPB++: Simultaneously Detecting 2D and 3D Objects from Multiple Cameras
SimPB++:从多个摄像头同时检测2D和3D目标
Abstract
Simultaneous perception of 2D objects in perspective view and 3D objects in Bird's Eye View (BEV) is challenging for multi-camera autonomous driving. Existing two-stage pipelines use 2D results only as a one-time cue for 3D detection. We propose SimPB++, which simultaneously detects 2D objects in perspective and 3D objects in BEV from multiple cameras. It unifies both tasks into an end-to-end model with a hybrid decoder architecture, coupling multi-view 2D and 3D decoders interactively. Two novel modules enable deep interaction: Dynamic Query Allocation adaptively assigns 2D queries to 3D candidates, and Adaptive Query Aggregation refines 3D representations using multi-view 2D features, forming a cyclic 3D-2D-3D refinement. For multi-view 2D detection, we use Query-group Attention for intra-group communication. We also design a Crop-and-Scale strategy for long-range perception and a Propagating Denoising strategy with an auxiliary RoI detector. SimPB++ supports mixed supervision with 2D-only and fully annotated data, reducing reliance on expensive 3D labels. Experiments show state-of-the-art performance on nuScenes for both tasks and strong long-range detection (up to 150m) on Argoverse2.
Chinese Translation
在多摄像头自主驾驶中,从透视视图感知2D目标和从鸟瞰视图(BEV)感知3D目标是一个具有挑战性的任务。现有的两阶段管道仅将2D结果作为3D检测的一次性线索。我们提出了SimPB++,它能够同时从多个摄像头检测透视中的2D目标和BEV中的3D目标。该方法将两个任务统一为一个端到端模型,采用混合解码器架构,实现2D和3D解码器的交互耦合。两个新颖模块使深度交互成为可能:动态查询分配(Dynamic Query Allocation)自适应地将2D查询分配给3D候选框,适应性查询聚合(Adaptive Query Aggregation)利用多视角2D特征优化3D表示,形成循环的3D-2D-3D细化过程。对于多视角2D检测,我们使用查询组注意力(Query-group Attention)进行组内通信。我们还设计了裁剪和缩放(Crop-and-Scale)策略以实现长距离感知,以及带有辅助RoI检测器的传播去噪(Propagating Denoising)策略。SimPB++支持混合监督,兼容仅有2D标注和完全注释数据,减少了对昂贵3D标签的依赖。实验结果显示,在nuScenes数据集上取得了这两项任务的最新性能,并在Argoverse2数据集上实现了强大的长距离检测能力(可达150米)。
cs.CV / 131 / 2605.01925
CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models
CADFS:用于计算机辅助设计的大型 CAD 程序数据集和框架与大语言模型
Abstract
We introduce CADFS, a data-centric framework that enables large vision-language models to generate complex CAD design histories. Existing generative CAD systems are restricted to sketch-extrude operations due to simplified representations and limited datasets. We address this by introducing a FeatureScript-based representation and constructing a dataset of 450k real-world CAD models spanning 15 modeling operations. We obtain the dataset via a new pipeline that reconstructs clean, executable FeatureScript programs and provides multimodal annotations. Fine-tuning a VLM on this representation yields state-of-the-art results in text-conditioned CAD generation and image-based reconstruction, producing more accurate, diverse, and feature-rich designs than prior frameworks. Ablations show that each individual component of our framework, i.e., the FeatureScript representation, the extended operation set, and representation-aligned textual descriptions, significantly improves performance. Our framework substantially broadens the complexity and realism achievable in generative CAD. The CADFS framework and the new dataset are available at https://voyleg.github.io/cadfs/.
Chinese Translation
我们引入了 CADFS,这是一种以数据为中心的框架,使大型视觉-语言模型能够生成复杂的 CAD 设计历史。现有的生成性 CAD 系统由于简化的表示和有限的数据集,受限于草图-挤出操作。我们通过引入基于 FeatureScript 的表示法及构建一个包含 45 万个真实世界 CAD 模型的数据集,跨越 15 种建模操作来解决这一问题。我们通过一条新管线获取数据集,重建干净、可执行的 FeatureScript 程序,并提供多模态注释。在这种表示法上微调视觉语言模型(VLM)取得了文本条件 CAD 生成和基于图像的重建的最新成果,产生比先前框架更准确、多样和功能丰富的设计。消融研究表明,我们框架的每个单独组件,即 FeatureScript 表示、扩展操作集和表示对齐的文本描述,都显著提高了性能。我们的框架大大拓宽了生成 CAD 中可实现的复杂性和现实性。CADFS 框架和新的数据集可在 https://voyleg.github.io/cadfs/ 获取。
cs.CV / 132 / 2605.01929
Exploring Data-Free LoRA Transferability for Video Diffusion Models
探索无数据LoRA在视频扩散模型中的可转移性
Abstract
Video diffusion models leveraging step distillation or causal distillation have achieved remarkable performance. However, adapting existing LoRAs to these variants remains a critical challenge due to weight space mismatches. We observe that direct application leads to style degradation and structural collapse, yet the underlying mechanisms remain poorly understood. To fill this gap, we delve into the weight space and identify that the incompatibility stems from spectral interference within shared functional clusters defined over singular subspaces. Specifically, our analysis reveals that while both paradigms respect spectral rigidity, they establish conflicting routing pathways that clash through constructive overload or destructive cancellation. To address this issue, we propose Cluster-Aware Spectral Arbitration (CASA), a data-free framework that dynamically arbitrates between safeguarding the target's manifold and restoring LoRA alignment based on spectral density. Extensive experiments demonstrate that CASA effectively mitigates artifacts and revives LoRA functionality. Our code is available at https://github.com/Noahwangyuchen/CASA
Chinese Translation
利用步骤蒸馏或因果蒸馏的视频扩散模型取得了显著的性能。然而,由于权重空间的不匹配,将现有的LoRA适配于这些变体仍然是一个关键挑战。我们观察到直接应用会导致风格降级和结构崩溃,但其背后的机制仍然不够明确。为了填补这一空白,我们深入研究了权重空间,并识别出不兼容性源于在特定子空间上定义的共享功能簇内的谱干扰。具体而言,我们的分析揭示,尽管这两种范式都尊重谱刚性,但它们建立了冲突的路由路径,通过建设性超载或破坏性抵消发生冲突。为了解决这个问题,我们提出了集群感知谱仲裁(Cluster-Aware Spectral Arbitration,CASA),这是一个无数据的框架,可以动态仲裁以保护目标流形并根据谱密度恢复LoRA对齐。大量实验表明,CASA有效减轻了伪影并恢复了LoRA功能。我们的代码可在 https://github.com/Noahwangyuchen/CASA 上获取。
cs.CV / 133 / 2605.01971
ProtoFair: Fair Self-Supervised Contrastive Learning via Pseudo-Counterfactual Pairs
ProtoFair:通过伪反事实对实现公平的自监督对比学习
Abstract
Self-supervised learning methods learn high-quality visual representations, yet recent studies show that these representations often capture demographic biases present in the training data. Existing fairness-aware methods address this by redesigning the self-supervised objective itself, limiting portability across the rapidly evolving landscape of self-supervised learning (SSL) frameworks. We propose ProtoFair, a fairness-aware contrastive loss designed to work alongside existing SSL objectives without modifying them. ProtoFair leverages unsupervised prototype clustering to identify pseudo-counterfactual pairs: samples sharing the same cluster assignment but belonging to different sensitive groups. By pulling these content-matched, cross-group samples together in the embedding space, ProtoFair encourages the encoder to learn representations that are invariant to the sensitive attribute. The method requires only sensitive attribute annotations, no target labels, and integrates seamlessly with both SimCLR and SupCon. Experiments on CelebA and UTKFace demonstrate consistent fairness improvements while maintaining competitive accuracy.
Chinese Translation
自监督学习方法能够学习高质量的视觉表示,但最近的研究表明,这些表示往往捕捉到训练数据中存在的人口偏见。现有的公平性关注方法通过重新设计自监督目标来解决这个问题,从而限制了其在快速发展的自监督学习(SSL)框架中的可移植性。我们提出了ProtoFair,一种公平性关注的对比损失,旨在与现有的SSL目标协同工作,而无需对其进行修改。ProtoFair利用无监督原型聚类来识别伪反事实对:这些样本共享相同的聚类分配,但属于不同的敏感群体。通过在嵌入空间中将这些内容匹配的跨组样本拉近,ProtoFair促使编码器学习对敏感属性不变的表示。该方法仅需敏感属性标注,无需目标标签,并能与SimCLR和SupCon无缝集成。在CelebA和UTKFace上的实验显示出一致的公平性改善,同时保持竞争力的准确性。
cs.CV / 134 / 2605.01995
From Concept to Capability: Evaluating 3D Gaussian Splatting for Synthetic Scene Editing in Autonomous Driving
从概念到能力:评估3D高斯点渲染在自主驾驶合成场景编辑中的应用
Abstract
The perception of an Autonomous Driving System (ADS) critically depends on relevant, comprehensive, and diverse datasets to ensure its safety while operating in the environment. Field data collection lacks completeness with respect to the list of rare but still possible safety-related scenarios needed for the development, verification, and validation of the ADS. 3D Gaussian Splatting (3DGS) has shown promising capabilities for the reconstruction and editing of scenes based on data collected by cameras and LiDAR sensors. However, the industrial fidelity evaluation of reconstructions is underexplored, which is crucial when employing such methods in safety-related systems, especially for ADS. This becomes more challenging as ADS operates in a dynamic, uncontrolled environment with limited viewpoints and often partially occluded objects. This paper addresses this gap by proposing and implementing a framework (Fig. 1) to systematically analyze the capabilities and limitations of 3DGS for use in the reconstruction of safety-related scenes. It focuses on the quality of reconstruction for vehicles and pedestrians, which are the two most critical object classes for ADS. Our findings provide industry insights into the fidelity degradation of reconstructions from multiple novel viewpoints, both lateral and longitudinal, enabling the integration of these methods into real-world industrial AD software development and testing pipelines.
Chinese Translation
自主驾驶系统(ADS)的感知在很大程度上依赖于相关的、全面的和多样化的数据集,以确保其在环境中操作时的安全性。实际数据采集在与开发、验证和验证ADS所需的稀有但仍然可能的安全相关场景的完整性方面存在不足。3D高斯点渲染(3DGS)在基于摄像头和激光雷达传感器收集的数据进行场景重建和编辑方面显示出了良好的能力。然而,重建的工业保真度评价尚未被充分探索,这在将此类方法应用于安全相关系统时至关重要,尤其是在自主驾驶的情况下。随着自主驾驶在动态、不受控的环境中运行,且视角有限且物体常常部分遮挡,这一挑战变得更加复杂。本文通过提出并实施一个框架(图1)来系统性地分析3DGS在安全相关场景重建中的能力和局限性,从而解决了这一缺口。我们专注于车辆和行人的重建质量,这两个对象类别是自主驾驶中最为关键的。我们的研究结果为行业提供了从多个新颖视角(包括横向和纵向)重建保真度降低的见解,使这些方法能够整合进现实世界的工业自主驾驶软件开发和测试流程。
cs.CV / 135 / 2605.02089
Cross-Language Learning within Arabic Script for Low-Resource HTR
阿拉伯文书写系统下的跨语言学习用于低资源手写文本识别
Abstract
Handwritten Text Recognition (HTR) under limited labeled data remains a challenging problem, particularly for Arabic-script languages. Although modern sequence-based recognizers perform well in high-resource settings, their accuracy degrades sharply as training data becomes scarce. Arabic-script languages share a common writing system with substantial character overlap, motivating cross-script training as a strategy to mitigate data scarcity. We performed experiments on Arabic, Urdu, and Persian scripts and achieved improvements over single-script baselines (new SotA especially for low-resource settings). A key finding of our experiments is that cross-script transfer is largely driven by script-level overlap rather than uniform accuracy improvements. Through a statistical character-level analysis we show that gains are structurally concentrated on characters shared across scripts, while language-specific characters exhibit limited or negative transfer. These findings provide insight into transfer dynamics in low-resource script families. Detailed results include: We conduct a controlled line-level study of cross-script joint training for Arabic-script HTR under low-resource regimes (number of samples K \in 100, 500, 1000 labeled lines) on Arabic (KHATT), Urdu (NUST-UHWR), and Persian (PHTD). A CRNN model is trained on the union of multiple related Arabic-script datasets and evaluated on individual target languages. On Persian (PHTD), joint training achieves a Character Error Rate (CER) of 9.99, surpassing previously reported results despite not using the full available training data. On an Urdu dataset (UNHD), joint training reduces CER from 17.20 to 14.45. Code and data splits are released to ensure reproducibility.1
Chinese Translation
在有限标注数据下,手写文本识别(HTR)仍然是一个具有挑战性的难题,特别是对于阿拉伯文语言。尽管现代基于序列的识别器在高资源环境中表现良好,但随着训练数据变得稀缺,其准确性急剧下降。阿拉伯文语言由于共享相同的书写系统并且具有显著的字符重叠,因而促进了跨书写训练作为缓解数据稀缺的一种策略。我们在阿拉伯文、乌尔都文和波斯文书写系统上进行了实验,并在单一书写的基准上取得了进展(在低资源环境下的新状态最佳,即新的 SotA)。我们的实验关键发现是,跨书写转移主要是由书写级别的重叠驱动,而非统一的准确性提高。通过统计字符级别的分析,我们表明增益主要集中在跨书写共享的字符上,而特定于语言的字符则表现出有限或负向的转移。这些发现为低资源书写系统家族的转移动态提供了深刻的见解。详细结果包括:我们对阿拉伯文书写的低资源条件下的跨书写联合训练进行了控制的行级研究(样本数量 K ∈ 100, 500, 1000 标注行),使用的阿拉伯文数据集为 KHATT,乌尔都文数据集为 NUST-UHWR,波斯文数据集为 PHTD。一个 CRNN 模型在多个相关的阿拉伯文书写数据集的联合上训练,并在各个目标语言上进行评估。在波斯文数据集(PHTD)上,联合训练达到了 9.99 的字符错误率(CER),尽管没有使用全部可用的训练数据,但超越了先前报告的结果。在乌尔都文数据集(UNHD)上,联合训练将 CER 从 17.20 降低到 14.45。我们发布了代码和数据拆分以确保可重复性。
cs.CV / 136 / 2605.02094
SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition
SignMAE:基于分割驱动的自监督学习用于手语识别
Abstract
Subtle hand differences make sign language recognition challenging, yet many existing methods rely on encoders pretrained on generic action datasets that poorly capture such fine-grained cues. We propose a self-supervised pretraining method for sign language recognition that uses segmentation-based masking to adapt to the presence and motion of key body parts, rather than treating hand poses as static visual tokens. The resulting mask-and-reconstruct objective improves fine-grained sign representation learning. On WLASL, NMFs-CSL, and Slovo, our encoder achieves state-of-the-art performance, improving per-instance and per-class Top-1 accuracy while using fewer input frames and modalities than comparable encoders.
Chinese Translation
细微的手部差异使得手语识别变得具有挑战性,然而许多现有的方法依赖于在通用动作数据集上预训练的编码器,这些编码器难以有效捕捉到这样的细粒度线索。我们提出了一种用于手语识别的自监督预训练方法,该方法使用基于分割的掩蔽来适应关键身体部位的存在和运动,而不是将手势姿势视为静态视觉标记。由此产生的掩蔽-重建目标提升了细粒度手势表示学习。在 WLASL、NMFs-CSL 和 Slovo 数据集上,我们的编码器取得了最先进的性能,提高了每个实例和每个类别的 Top-1 准确率,同时使用的输入帧和模态少于可比较的编码器。
cs.CV / 137 / 2605.02098
From Spherical to Gaussian: A Comparative Analysis of Point Cloud Cropping Strategies in Large-Scale 3D Environments
从球形到高斯:大规模3D环境中点云裁剪策略的比较分析
Abstract
Large-scale 3D point clouds can consist of billions of points. Even after downsampling, these point clouds are too large for modern 3D neural networks. In order to develop a semantic understanding of the scene, the point clouds are divided into smaller subclouds that can be processed. Typically, this division is done using spherical crops, resulting in a loss of surrounding geometric context. To address this issue, we propose alternative methods that produce subclouds with larger crop sizes while maintaining a similar number of points. Specifically, we compare exponential, Gaussian, and linear cropping methods with the spherical method. We evaluated two 3D deep learning model architectures using multiple indoor and outdoor environment datasets. Our results demonstrate that altering the cropping strategy can enhance model performance, especially for large-scale outdoor scenes, yielding new state-of-the-art results. Code is available at https://github.com/mvg-inatech/point_cloud_cropping
Chinese Translation
大规模的3D点云可以包含数十亿个点。即使经过下采样,这些点云对于现代3D神经网络而言仍然过于庞大。为了发展对场景的语义理解,点云被划分为可以处理的小子云。通常,这种划分使用球形裁剪,导致周围几何上下文的丧失。为了解决这个问题,我们提出了替代方法,这些方法在保持相似点数的同时,生成更大裁剪尺寸的子云。具体来说,我们将指数、高斯和线性裁剪方法与球形方法进行了比较。我们使用多个室内和室外环境数据集评价了两种3D深度学习模型架构。我们的结果表明,改变裁剪策略可以提高模型性能,特别是在大规模的室外场景中,取得了新的最先进结果。代码可在 https://github.com/mvg-inatech/point_cloud_cropping 获取。
cs.CV / 138 / 2605.02126
Ultrasound Vision-Language Alignment via Contrastive Learning
通过对比学习实现超声波视觉-语言对齐
Abstract
Ultrasound foundation models have achieved strong performance on structured prediction tasks but remain exclusively vision-based, limiting zero-shot and few-shot transfer to novel tasks where task-specific annotation is scarce. We address this gap with EchoCare-CLIP, a CLIP-style dual-encoder contrastive framework that aligns ultrasound images with clinical text in a shared embedding space. We curate a multi-organ corpus of over 16K image-text pairs spanning breast, liver, lung, and thyroid, with over 78% of captions derived from expert-annotated reports, and complement the remainder with a three-tier template-based and LLM-based caption generation pipeline. We evaluate model configurations spanning two text encoder families (CLIP, BioClinicalBERT) and two caption strategies (template-based, LLM-generated) against OpenAI CLIP and BiomedCLIP baselines. Our trained models consistently improve cross-modal alignment over baselines, with the best configuration achieving a paired alignment score of 0.682. However, stronger alignment does not guarantee better downstream performance: CLIP-based variants with partial fine-tuning achieve the strongest zero-shot classification on external held-out datasets (0.709 on BUSI; 0.626 on AULI), while full end-to-end fine-tuning degrades transfer due to overfitting. On linear probing and few-shot adaptation, model rankings are dataset-dependent, reflecting a trade-off between domain adaptation and representational generalizability. We further show that template-based captions match or outperform LLM-generated captions, suggesting lexical diversity is not a proxy for caption quality. Taken together, our results demonstrate that ultrasound vision-language alignment is achievable from public data alone, but robust clinical transfer requires careful balancing of domain adaptation, encoder capacity, and caption supervision quality.
Chinese Translation
超声基础模型在结构化预测任务上表现出色,但仍然仅限于基于视觉的应用,限制了在任务特定注释稀缺的情况下的零-shot和少-shot转移。我们通过 EchoCare-CLIP 来解决这一问题,EchoCare-CLIP 是一个 CLIP 风格的双编码器对比框架,将超声图像与临床文本在共享嵌入空间中对齐。我们整理了一个覆盖乳腺、肝脏、肺和甲状腺的超过 16K 图像-文本对的多脏器语料库,其中超过 78% 的标题来自专家注释报告,其余部分则通过三层模板基础和基于 LLM 的标题生成管道进行补充。我们评估了涵盖两个文本编码器系列(CLIP、BioClinicalBERT)和两种标题策略(基于模板、LLM 生成)的模型配置,并与 OpenAI CLIP 和 BiomedCLIP 基线进行比较。训练的模型在跨模态对齐上始终优于基线,其中最佳配置实现了配对对齐分数 0.682。然而,更强的对齐并不保证更好的下游性能:部分微调的 CLIP 基础变体在外部保留数据集上实现了最强的零-shot 分类(BUSI 上为 0.709;AULI 上为 0.626),而全端到端微调因过拟合而导致转移性能下降。在线性探测和少-shot 适应中,模型排名依赖于数据集,反映出领域适应性与表示通用性之间的权衡。我们进一步展示了基于模板的标题与 LLM 生成的标题相匹配或表现更佳,这表明词汇多样性并不是标题质量的代理。综合来看,我们的结果表明,仅凭公共数据即可实现超声波视觉-语言对齐,但稳健的临床转移需要在领域适应性、编码器容量和标题监督质量之间进行仔细平衡。
cs.CV / 139 / 2605.02130
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
从事物的存在到其功能:多模态大型语言模型中的空间-功能智能基准测试
Abstract
Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate the geometric perception capabilities of multimodal large language models (MLLMs), they fall short of probing the higher-order cognitive abilities required for grounded intelligence. To address this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1,500 expert-annotated questions derived from diverse egocentric indoor video scans. SFI-Bench systematically evaluates two complementary dimensions of advanced reasoning: (1) Structured Spatial Reasoning, which requires understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, which involves inferring object affordances and their context-dependent utility. The benchmark includes tasks such as conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting, directly challenging models to integrate perception, memory, and inference. Our experiments reveal that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge, highlighting a critical bottleneck in achieving grounded intelligence. SFI-Bench therefore provides a diagnostic tool for measuring progress toward more cognitively capable and truly grounded multimodal agents.
Chinese Translation
人类级别的代理智能超越了低级几何感知,从识别事物的位置演变为理解事物的功能。然而,现有的基准测试有效评估了多模态大型语言模型(MLLMs)的几何感知能力,但未能深入探查实现扎根智能所需的更高阶认知能力。为了解决这一问题,本文引入了空间-功能智能基准测试(SFI-Bench),这是一个基于视频的基准测试,包含超过1,500个由专家注释的问题,来源于多样化的第一人称室内视频扫描。SFI-Bench系统性地评估了高级推理的两个互补维度:(1)结构化空间推理,要求理解复杂布局并形成连贯的空间表示;(2)功能推理,涉及推断物体的功能性和其上下文依赖的实用性。该基准测试包括条件计数、多步关系推理、功能配对和基于知识的故障排除等任务,直接挑战模型整合感知、记忆和推理。我们的实验表明,当前的MLLMs在将空间记忆与功能推理和外部知识结合上持续存在困难,这突显了实现扎根智能的一大瓶颈。因此,SFI-Bench提供了一个用于衡量朝着更具认知能力和真正扎根的多模态代理进展的诊断工具。
cs.CV / 140 / 2605.02134
Video Generation with Predictive Latents
基于预测潜变量的视频生成
Abstract
Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.
Chinese Translation
视频变分自编码器(Video Variational Autoencoder, VAE)通过将视觉世界映射到紧凑的时空潜在空间,实现了潜在视频生成建模,提高了训练效率和稳定性。虽然现有的视频 VAE 在重建质量上表现良好,但持续优化重建并不一定转化为生成性能的提升。如何增强视频潜变量的可扩散性仍然是一个关键且未解决的挑战。在本研究中,我们受启于预测世界建模的原则,探讨预测学习在提升视频生成建模方面的潜力。为此,我们提出了一种简单有效的预测重建目标,将预测学习与视频重建相结合。具体而言,我们随机丢弃未来帧,只编码部分过去观测,同时训练解码器同时重建观察到的帧并预测未来帧。该设计促使潜在空间编码时间上的预测结构,从而对视频动态建立更连贯的理解,进而改善生成质量。我们的模型被称为预测视频 VAE(Predictive Video VAE, PV-VAE),在视频生成方面表现优越,相较于 UCF101 上的 Wan2.2 VAE,收敛速度提高了 52%,FVD 改进了 34.42。此外,全面分析表明,PV-VAE 不仅表现出良好的可扩展性,生成性能随着 VAE 训练的增强而提升,而且在下游视频理解中也实现了一致的增益,突显出一个有效捕捉时间一致性和运动先验的潜在空间。
cs.CV / 141 / 2605.02137
FLoRA: Fusion-Latent for Optical Reconstruction and Flood Area Segmentation via Cross-Modal Multi-Task Distillation Network
FLoRA:通过跨模态多任务蒸馏网络实现光学重建和洪水区域分割的融合潜在模型
Abstract
Accurate flood water mapping is critical for disaster management, yet current methods struggle to fully exploit the potential of spaceborne imagery. Optical data offers high interpretability but is limited by environmental conditions, whereas SAR provides reliable all-weather coverage with reduced visual interpretability. FLoRA (Fusion Latent for Optical Reconstruction and Area Segmentation) is a cross-modal multi-task framework that jointly reconstructs high-fidelity optical imagery and segments flood water regions from Sentinel 1 SAR by fusing the complementary strengths of optical and SAR data. During training, a lightweight optical teacher (driven by RGB and NDVI priors) provides pyramidal features that guide SAR representations into a fusion latent space via multiscale windowed cross attention and FiLM conditioning, with gated residuals preventing overcorrection. This design enables multi-task learning across two complementary objectives: (a) SAR-to-optical translation for fine-grained RGB reconstruction and (b) flood water region segmentation for hydrologic interpretation. The dual decoders are optimized using Charbonnier SSIM for structural fidelity, edge FFT magnitude losses for spectral realism, and Dice BCE hydrology-aware edge alignment for precise flood water delineation. A feature distillation constraint further aligns fused SAR features with the optical teacher's manifold. Evaluations on SEN1FLOODS11, DEEPFLOOD, and SEN12MS demonstrate that FLoRA surpasses fusion baselines in PSNR, SSIM, and LPIPS, demonstrating that multi-modal fusion within a teacher-guided latent space yields semantically faithful and physically consistent flood-water intelligence from spaceborne observations.
Chinese Translation
准确的洪水水域制图对于灾害管理至关重要,但当前方法未能充分利用太空影像的潜力。光学数据具有高解释性,但受环境条件的限制,而合成孔径雷达(SAR)则在任何天气条件下提供可靠的覆盖,但视觉解释性降低。FLoRA(光学重建与区域分割的融合潜在模型)是一个跨模态多任务框架,旨在通过融合光学数据与SAR数据的互补优势,共同重建高保真光学影像并对Sentinel 1的洪水水域进行分割。在训练过程中,轻量级的光学教师(以RGB和NDVI先验驱动)提供金字塔特征,指导SAR表示通过多尺度窗口交叉注意力和FiLM调节进入融合潜在空间,同时通过门控残差防止过度校正。该设计使得在两个互补目标之间实现多任务学习:(a) SAR到光学转换以实现精细的RGB重建和(b) 洪水水域的分割以进行水文学解释。双解码器利用Charbonnier SSIM优化以确保结构保真度,使用边缘FFT幅度损失提高光谱真实感,并采用Dice BCE水文学感知边缘对齐以实现精确的洪水水域划分。特征蒸馏约束进一步将融合后的SAR特征与光学教师的流形对齐。在SEN1FLOODS11、DEEPFLOOD和SEN12MS上的评估表明,FLoRA在PSNR、SSIM和LPIPS中超越了融合基线,证明了在教师引导的潜在空间中进行的多模态融合能够从太空观测中获取语义真实且物理一致的洪水水域信息。
cs.CV / 142 / 2605.02152
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
SpecEdit:基于语义锁定的无训练扩散图像编辑加速方法
Abstract
Diffusion-based image editing offers strong semantic controllability, but remains computationally expensive due to iterative high-resolution denoising over all spatial tokens. Dynamic-resolution sampling reduces this cost by performing early steps at reduced resolution. However, existing approaches prioritize upsampling using low-level heuristics such as edge detection or channel variance, which are weakly aligned with editing semantics and may lead to structural inconsistency. Moreover, spatial regions are often upsampled without verifying whether semantic modification is actually required, resulting in redundant high-resolution computation and accumulated errors. Therefore, we propose SpecEdit, a training-free dynamic-resolution framework tailored for diffusion-based image editing. SpecEdit follows a draft-and-verify scheme: a low-resolution draft first estimates the semantic outcome, after which token-level discrepancies are used to identify edit-relevant tokens for high-resolution denoising, while the remaining tokens stay at a coarse resolution. Experiments on Qwen-Image-Edit and FLUX.1-Kontext-dev demonstrate up to 10x and 7x acceleration, while maintaining strong quality. SpecEdit is complementary to step distillation and other acceleration techniques, achieving up to 13x speedup when combined with existing methods. Our code is in supplementary material and will be released on GitHub.
Chinese Translation
基于扩散的图像编辑提供了强大的语义可控性,但由于在所有空间标记上进行迭代的高分辨率去噪,计算开销依然较高。动态分辨率采样通过在降低分辨率的情况下执行早期步骤,降低了这一成本。然而,现有的方法优先使用边缘检测或通道方差等低级启发式方法进行上采样,这些方法与编辑语义的对齐性较弱,可能导致结构不一致。此外,空间区域往往在未确认是否实际需要进行语义修改的情况下被上采样,从而导致冗余的高分辨率计算和累计的错误。因此,我们提出了SpecEdit,一个无需训练的动态分辨率框架,专为基于扩散的图像编辑量身定制。SpecEdit遵循草图与验证的方案:低分辨率草图首先估计语义结果,然后利用标记级的不一致性来识别与编辑相关的标记以进行高分辨率去噪,而其余标记保持在粗略的分辨率上。关于Qwen-Image-Edit和FLUX.1-Kontext-dev的实验表明,速度提升可达10倍和7倍,同时保持较强的质量。当与现有方法结合使用时,SpecEdit与步骤蒸馏和其他加速技术互为补充,实现高达13倍的加速。我们的代码在补充材料中提供,并将在GitHub上发布。
cs.CV / 143 / 2605.02153
Cross-Polarization Fusion of VV AND VH SAR Observations for Improved Flood Mapping
VV与VH SAR观测的交叉极化融合以改进洪水制图
Abstract
Synthetic Aperture Radar (SAR) imagery is widely used for flood monitoring due to its all-weather and day-night imaging capability. However, flood mapping using single-polarization SAR data remains challenging in complex environments where surface and volume scattering coexist. In this paper, we investigate the effectiveness of cross-polarization fusion of VV and VH SAR observations for improved flood mapping. A deep learning-based segmentation framework is employed to jointly exploit complementary information from VV and VH polarizations. To ensure a fair evaluation, three configurations are compared under identical training conditions: VV only, VH only, and fused VV-VH input. Performance is assessed using standard flood mapping metrics, including Intersection over Union (IoU) and F1-score, along with qualitative visual analysis. Experimental results demonstrate that VV-VH fusion consistently outperforms single-polarization models, particularly in vegetated and heterogeneous flood regions, leading to more accurate flood boundary delineation. The findings highlight the importance of cross-polarization SAR fusion for enhancing the reliability of SAR-based flood mapping in disaster monitoring applications.
Chinese Translation
合成孔径雷达(SAR)影像因其全天候和昼夜成像能力,广泛应用于洪水监测。然而,在表面和体散射共存的复杂环境中,仅使用单极化SAR数据进行洪水制图仍然面临挑战。在本文中,我们研究了VV和VH SAR观测的交叉极化融合在改进洪水制图方面的有效性。采用基于深度学习的分割框架共同利用VV和VH极化的互补信息。为确保评估公平,在相同的训练条件下比较了三种配置:仅VV、仅VH和融合VV-VH输入。通过标准洪水制图指标评估性能,包括交并比(IoU)和F1-score,以及定性视觉分析。实验结果表明,VV-VH融合在单极化模型中持续表现优越,特别是在植被和异质洪水区域,导致更精确的洪水边界划定。研究结果强调了交叉极化SAR融合在增强基于SAR的洪水制图可靠性中的重要性,尤其是在灾害监测应用中。
cs.CV / 144 / 2605.02169
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
基于合成领域适应的隐私保护多摄像头监控异构模型融合
Abstract
We propose HeroCrystal, a novel privacy-preserving framework for multi-camera domain-adaptive object detection, addressing challenges such as data privacy, class imbalance, and heterogeneous architectures. Our framework consists of three key stages. In the Generated Stage, we introduce a one-shot, target-aware diffusion-based generation module that learns visual style from a single target-domain image while leveraging prompt-based control to synthesize specific object instances. Unlike conventional style transfer-based methods that require large target datasets and ignore semantic-level discrepancies, our approach enables privacy-preserving augmentation to reduce ethical concerns, and introduces controllable rare object generation to mitigate long-tailed category degradation. In the Federated Stage, we employ probabilistic Faster R-CNN on the client side to improve localization accuracy, and a dynamic model contrastive strategy to suppress domain-specific bias. The server side performs model fusion across heterogeneous architectures without accessing raw data. Finally, in the Distilled Stage, we propose an inconsistent categories integration algorithm to resolve label inconsistency and architecture heterogeneity across clients. Extensive experiments on multiple cross-domain detection benchmarks demonstrate that our method outperforms existing multi-source domain adaptation and federated learning baselines under multi-class, privacy-preserving settings. Our method improves mAP by +2.1% over prior privacy-preserving approaches and achieves a new state-of-the-art mAP of 33.4%, highlighting the effectiveness of HeroCrystal in enabling practical multi-camera AI surveillance systems.
Chinese Translation
我们提出了HeroCrystal,一种新颖的隐私保护框架,用于多摄像头领域自适应对象检测,以应对数据隐私、类别不平衡和异构架构等挑战。我们的框架由三个关键阶段组成。在生成阶段,我们引入了一个单次目标感知的基于扩散的生成模块,该模块从单个目标领域图像中学习视觉风格,同时利用基于提示的控制来合成特定的对象实例。与传统的基于风格迁移的方法需要大量目标数据集并忽略语义层面的差异不同,我们的方法使得隐私保护的增强成为可能,从而降低伦理问题,并引入可控的稀有对象生成来缓解长尾类别退化。在联邦阶段,我们在客户端采用概率性Faster R-CNN来提高定位精度,并使用动态模型对比策略来抑制领域特定偏差。服务器端在不访问原始数据的情况下进行异构架构的模型融合。最后,在蒸馏阶段,我们提出了一种不一致类别集成算法,以解决客户端之间的标签不一致和架构异构问题。在多个跨领域检测基准上进行的大量实验表明,我们的方法在多类、隐私保护的设置下优于现有的多源领域适应和联邦学习基线。我们的方法比以前的隐私保护方法提高了2.1%的mAP,并达到了新的最佳mAP为33.4%,突显了HeroCrystal在实现实际的多摄像头人工智能监控系统中的有效性。
cs.CV / 145 / 2605.02184
RAFNet: Region-Aware Fusion Network for Pansharpening
RAFNet:区域感知融合网络用于全色锐化
Abstract
Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and high-resolution panchromatic (PAN) images. Although deep learning has advanced this field, mainstream frequency-based methods relying on standard scaled dot-product attention suffer from quadratic computational complexity and fail to exploit the inherent regional sparsity of remote sensing imagery. Furthermore, existing spatial enhancement strategies typically employ static convolution kernels, which struggle to adapt to the complex frequency and regional variations of PAN and MS images. To address these bottlenecks, we propose a Region-Aware Fusion (RAFNet) Network that synergistically models spatial and frequency information. Specifically, we design a Spatial Adaptive Refinement (SAR) module that leverages the discrete wavelet transform (DWT) for directional frequency separation and K-means clustering for regional partitioning, which enables the dynamic construction of region-specific adaptive convolution kernels, achieving spatially and frequency-adaptive feature enhancement. Moreover, we introduce a Clustered Frequency Aggregation (CFA) module based on a sparse attention mechanism guided by the semantic clusters, which executes a region-aware sparse attention strategy that drastically reduces computational redundancy while ensuring high-quality frequency feature extraction. In addition we integrated these modules into a progressive, multi-level spatial-frequency network architecture to facilitate robust interaction and accurate image reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that the proposed RAFNet significantly outperforms state-of-the-art pansharpening methods in both reduced- and full-resolution assessments. The code is available at https://github.com/PatrickNod/RAFNet.
Chinese Translation
全色锐化旨在通过融合低分辨率多光谱(LRMS)和高分辨率全色(PAN)图像生成高分辨率多光谱(HRMS)图像。尽管深度学习在这一领域取得了进展,但主流的基于频率的方法依赖标准的缩放点积注意力,存在平方级的计算复杂度,并未有效利用遥感影像的固有区域稀疏性。此外,现有的空间增强策略通常采用静态卷积核,难以适应PAN和MS图像的复杂频率和区域变化。为解决这些瓶颈,我们提出了一种区域感知融合网络(RAFNet),协同建模空间和频率信息。具体而言,我们设计了一个空间自适应精炼(SAR)模块,该模块利用离散小波变换(DWT)进行方向频率分离,并采用K均值聚类进行区域划分,从而实现动态构建区域特定自适应卷积核,实现空间和频率自适应特征增强。此外,我们引入了一种基于稀疏注意力机制指导语义聚类的聚类频率聚合(CFA)模块,该模块执行区域感知稀疏注意力策略,显著减少计算冗余,同时确保高质量的频率特征提取。此外,我们将这些模块集成到一个渐进的多级空间-频率网络架构中,以促进鲁棒的交互和准确的图像重建。在多个基准数据集上的大量实验表明,所提出的RAFNet在降低分辨率和全分辨率评估中显著超越了最先进的全色锐化方法。代码可在 https://github.com/PatrickNod/RAFNet 获取。
cs.CV / 146 / 2605.02198
SlimDiffSR: Toward Lightweight and Efficient Remote Sensing Image Super-Resolution via Diffusion Model Distillation
SlimDiffSR:通过扩散模型蒸馏实现轻量级高效的遥感图像超分辨率
Abstract
Diffusion models have recently achieved remarkable performance in image super-resolution (SR), but their high computational cost limits practical deployment in remote sensing applications. To address this issue, we propose SlimDiffSR, a lightweight and efficient diffusion-based framework for real-world remote sensing image super-resolution. Unlike existing single-step diffusion methods that rely on fixed timesteps, we first introduce an uncertainty-guided timestep assignment strategy to construct a stronger single-step teacher model, where reconstruction difficulty is explicitly linked to diffusion timesteps, enabling adaptive generative strength. Building upon this teacher, we further present a structured pruning strategy tailored to remote sensing imagery, which systematically removes redundant semantic modules and replaces standard operations with lightweight designs, including frequency-separable convolution, direction-separable convolution, and a query-driven global aggregation module. These components explicitly exploit the unique characteristics of remote sensing data, such as sparse high-frequency details, strong directional patterns, and long-range spatial dependencies. To enhance knowledge transfer, we incorporate Maximum Mean Discrepancy (MMD) into the distillation process to align feature distributions between the teacher and student models. Extensive experiments on multiple remote sensing benchmarks demonstrate that SlimDiffSR achieves a favorable balance between efficiency and reconstruction quality. In particular, it attains up to $200\times$ inference acceleration and a $20\times$ reduction in model parameters compared with multi-step diffusion models, while achieving competitive perceptual quality and clearly outperforming existing lightweight diffusion baselines in efficiency. The code is available at: https://github.com/wwangcece/SlimDiffSR.
Chinese Translation
扩散模型近期在图像超分辨率(SR)方面取得了显著的性能,但其高计算成本限制了在遥感应用中的实际部署。为了解决这个问题,我们提出了SlimDiffSR,一种基于扩散的轻量级高效框架,旨在解决现实世界中的遥感图像超分辨率。与现有的依赖固定时间步的单步扩散方法不同,我们首先引入了一种不确定性引导的时间步分配策略,以构建一个更强大的单步教师模型,在该模型中,重建难度与扩散时间步显性关联,从而使生成强度具有自适应性。在此教师模型基础上,我们进一步提出了一种针对遥感图像的结构化剪枝策略,该策略系统地移除冗余的语义模块,并用轻量级设计替代标准操作,包括频率可分离卷积、方向可分离卷积和查询驱动的全局聚合模块。这些组件明确利用了遥感数据的独特特征,如稀疏的高频细节、强方向模式以及长范围的空间依赖关系。为了增强知识迁移,我们在蒸馏过程中引入了最大均值差异(MMD),以对齐教师模型和学生模型之间的特征分布。在多个遥感基准测试上的大量实验表明,SlimDiffSR在效率和重建质量之间实现了良好的平衡。特别是,与多步扩散模型相比,它达到了高达200倍的推理加速和20倍的模型参数减少,同时实现了竞争性的感知质量,并在效率方面明显超越了现有轻量级扩散基线。代码可在以下链接获取:https://github.com/wwangcece/SlimDiffSR。
cs.CV / 147 / 2605.02201
Super-resolution of airborne laser scanning point clouds for forest inventory
基于航空激光扫描点云的森林清查超分辨率研究
Abstract
Airborne Laser Scanning (ALS) can collect point clouds across large areas, enabling large-scale forest inventory. However, ALS point clouds are sparse and noisy, resulting in inaccurate individual-tree-level forest inventory, such as stem localization and tree size estimation. To overcome this problem, we propose a deep learning model, 3D Forest Super Resolution (3DFSR), to simultaneously improve point density and reduce noise for ALS forest point cloud. 3DFSR is a voxel-based CNN with a U-Net architecture. The proposed 3DFSR is evaluated on ALS point clouds collected in both temperate forests in the U.S. and boreal forests in Germany. Experimental results demonstrate that 3DFSR can generate finer point clouds of tree structure than other state-of-the-art point cloud super-resolution algorithms, achieving 0.249 m Chamfer Distance and 2.711 m Hausdorff Distance. Furthermore, to verify the effectiveness of 3DFSR point clouds in forest inventory, we conduct stem detection, DBH measurements, and stem reconstruction on both original ALS point clouds and 3DFSR enhanced point clouds. We find that stem detection and reconstruction algorithms developed for TLS/MLS point clouds can directly work on our 3DFSR point clouds, and DBH can be derived with circle-fitting method. F1 score of stem detection is improved from 0.71 on original ALS point clouds to 0.97 on 3DFSR point clouds; DBH estimation improves from 13.45 cm RMSE using allometric equations to 6.43 cm using circle fitting; comparing to stems reconstruction from MLS point clouds, stem reconstructed from 3DFSR point clouds has 0.170 m of Chamfer Distance and 0.377 m of Hausdorff Distance, and 0.95 R2 volume estimation. Finally, we find that the proposed 3DFSR is applicable to process point densities from 10 to 1700 points/m2; it also can be generalized across data collected from different LiDAR platforms without transfer learning.
Chinese Translation
航空激光扫描(ALS)能够在大面积范围内采集点云,从而实现大规模的森林清查。然而,ALS点云稀疏且噪声较多,导致单棵树木层面的森林清查不准确,例如树干定位和树木尺寸估计。为了解决这一问题,我们提出了一种深度学习模型——3D森林超分辨率(3DFSR),同时提高ALS森林点云的点密度并减少噪声。3DFSR是基于体素的卷积神经网络,采用U-Net架构。我们对来自美国温带森林和德国北方森林的ALS点云进行了3DFSR的评估。实验结果表明,3DFSR生成的树木结构点云比其他先进的点云超分辨率算法更为精细,取得了0.249米的Chamfer距离和2.711米的Hausdorff距离。此外,为了验证3DFSR点云在森林清查中的有效性,我们对原始ALS点云和3DFSR增强点云进行了树干检测、胸径(DBH)测量和树干重建。我们发现专为TLS/MLS点云开发的树干检测和重建算法可以直接应用于我们的3DFSR点云,且胸径可以通过圆拟合法进行推导。树干检测的F1得分从原始ALS点云的0.71提高到3DFSR点云的0.97;使用全生长方程的胸径估计的均方根误差(RMSE)从13.45厘米降至使用圆拟合法的6.43厘米;与MLS点云的树干重建相比,从3DFSR点云重建的树干具有0.170米的Chamfer距离和0.377米的Hausdorff距离,以及0.95的体积估计R²值。最后,我们发现所提的3DFSR适用于处理点密度范围从10到1700点/m²的点云;它还可以跨不同LiDAR平台收集的数据进行广泛应用而无需迁移学习。
cs.CV / 148 / 2605.02206
Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score
多模态机器遗忘中的度量不可靠性:系统分析与统一评分原则
Abstract
Machine unlearning in Vision-Language Models (VLMs) is required for compliance with the General Data Protection Regulation (GDPR), yet current evaluation practices are inconsistent. We present the first systematic study of metric reliability in multimodal unlearning. Five standard metrics, Forget Accuracy (FA), Retain Accuracy (RA), Membership Inference Attack (MIA), Activation Distance (AD), and JS divergence (JS), yield conflicting method rankings across three VQA benchmarks (MLLMU-Bench, UnLOK-VQA, MMUBench). Kendall tau analysis over 36 unlearned LLaVA-1.5-7B models reveals two opposing clusters, {FA, RA, MIA} and {AD, JS}, with tau_FA_AD = -0.26, reproduced on BLIP-2 OPT-2.7B. Agreement is lower in multimodal VQA (average tau = 0.086) than in unimodal classification (average tau = 0.158; difference = 0.072), indicating that dual image-and-text pathways amplify inconsistency. We introduce the Unified Quality Score (UQS), a composite metric with weights derived from each metric's Spearman correlation with the oracle distance d(M_hat, M_star), where M_star is the oracle model retrained only on the retain set. RA shows the strongest reliability (rho = 0.484, p = 0.003), while FA is negatively correlated (rho = -0.418, p = 0.011). UQS yields stable rankings under 100 random weight perturbations (tau = 0.647 +- 0.262). We release the benchmark, 36 checkpoints, and an interactive leaderboard. Code and pre-computed results are available at https://github.com/neurips26/UnifiedUnl.
Chinese Translation
视觉语言模型(VLMs)中的机器遗忘是为了遵守通用数据保护条例(GDPR),然而当前的评估实践却存在不一致性。我们首次系统地研究了多模态遗忘中的度量可靠性。五种标准度量,遗忘准确率(Forget Accuracy, FA)、保留准确率(Retain Accuracy, RA)、成员推断攻击(Membership Inference Attack, MIA)、激活距离(Activation Distance, AD)和 JS 敍述差异(JS divergence, JS),在三个 VQA 基准(MLLMU-Bench、UnLOK-VQA、MMUBench)中产生了相互矛盾的方法排名。对 36 个未遗忘的 LLaVA-1.5-7B 模型的 Kendall tau 分析揭示出两个对立的簇,{FA, RA, MIA} 和 {AD, JS},其中 tau_FA_AD = -0.26,这一结果在 BLIP-2 OPT-2.7B 上得到了重复。在多模态 VQA 中的一致性低于单模态分类(平均 tau = 0.086 对比平均 tau = 0.158;差异 = 0.072),表明双重图像和文本路径增强了不一致性。我们引入了统一质量评分(Unified Quality Score, UQS),这是一个复合度量,其权重来源于每个度量与 Oracle 距离 d(M_hat, M_star) 的 Spearman 相关性,其中 M_star 是仅在保留集上重新训练的 oracle 模型。RA 表现出最强的可靠性(rho = 0.484, p = 0.003),而 FA 负相关(rho = -0.418, p = 0.011)。UQS 在 100 次随机权重扰动下产生了稳定的排名(tau = 0.647 ± 0.262)。我们发布了基准、36 个检查点和一个交互式排行榜。代码及预计算结果可在 https://github.com/neurips26/UnifiedUnl 获取。
cs.CV / 149 / 2605.02207
MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings
MultiSense-Pneumo:一种针对资源有限环境下肺炎筛查的多模态学习框架
Abstract
Pneumonia remains a leading global cause of morbidity and mortality, particularly in low resource settings where access to imaging, laboratory testing, and specialist care is limited. Clinical assessment relies on heterogeneous evidence, including symptoms, respiratory patterns, and chest imaging, making screening inherently multimodal. However, many existing computational approaches remain unimodal and focus primarily on radiographs. In this work, we present MultiSense-Pneumo, a multimodal framework for pneumonia oriented screening and triage support that integrates structured symptom descriptors, cough audio, spoken language, and chest radiographs. The system combines deterministic symptom triage, LightGBM based acoustic classification, domain adversarial radiograph analysis using ResNet 18, transformer based speech recognition, and an interpretable multimodal fusion operator. Each modality is transformed into a normalized risk signal and aggregated into a unified screening estimate, enabling transparent and modular decision support. MultiSense-Pneumo is designed for real world deployment under modest computational constraints and can operate fully offline on standard laptop class hardware, making it suitable for community health workers, rural clinics, and emergency response settings. Experimental results demonstrate robustness of the radiograph pathway under domain shifts, while highlighting limitations in minority class recall for acoustic signals. MultiSense-Pneumo is intended as a research prototype for screening and triage support rather than a clinically validated diagnostic system.
Chinese Translation
肺炎仍然是全球导致发病率和死亡率的主要原因,尤其是在资源有限的环境中,影像学、实验室检测和专科护理的获取受到限制。临床评估依赖于异构证据,包括症状、呼吸模式及胸部影像,使得筛查本质上具有多模态特征。然而,许多现有的计算方法仍然是单模态的,并主要集中于放射线照片。在本研究中,我们提出了MultiSense-Pneumo,这是一种面向肺炎筛查和分流支持的多模态框架,结合了结构化的症状描述、咳嗽音频、口语语言和胸部放射影像。该系统结合了确定性症状分流、基于LightGBM的声学分类、使用ResNet 18的领域对抗放射影像分析、基于变压器的语音识别和一个可解释的多模态融合操作符。每种模态都被转化为标准化风险信号,并聚合为统一的筛查估计,从而实现透明和模块化的决策支持。MultiSense-Pneumo旨在在适度的计算约束下进行实际部署,并能够在标准笔记本电脑硬件上完全离线运行,使其适合社区健康工作者、乡村诊所和紧急响应环境。实验结果表明,在领域转移下放射影像路径的稳健性,同时突出声学信号在少数类召回率方面的局限性。MultiSense-Pneumo旨在作为一种筛查和分流支持的研究原型,而非经过临床验证的诊断系统。
cs.CV / 150 / 2605.02212
NTIRE 2026 Challenge on Efficient Low Light Image Enhancement: Methods and Results
NTIRE 2026挑战赛:高效低光照图像增强的方法与结果
Abstract
This paper presents a comprehensive review of the NITRE 2026 Efficient Low Light Image Enhancement (E-LLIE) Challenge, highlighting the proposed solutions and final outcomes. This challenge focuses on mobile image enhancement under low-light conditions, aiming to design lightweight networks that improve enhancement quality while ensuring practical deployability under limited computational resources. A total of 207 participants registered, 27 teams submitted valid entries, and 17 teams ultimately provided valid factsheet. Based on these submissions, this paper provides a systematic evaluation of recent methods for E-LLIE, offering a comprehensive overview of state-of-the-art progress and demonstrating significant improvements in both performance and efficiency.
Chinese Translation
本文对NTIRE 2026高效低光照图像增强(E-LLIE)挑战赛进行了全面回顾,重点介绍了提出的解决方案和最终成果。本挑战赛专注于低光照条件下的移动图像增强,旨在设计轻量级网络以提高增强质量,同时确保在有限计算资源下的实用部署能力。共有207名参与者注册,27支团队提交了有效作品,最终有17支团队提供了有效的事实表。基于这些提交,本文对E-LLIE的最新方法进行了系统评估,提供了对最先进进展的全面概述,并展示了在性能和效率方面的显著改善。
cs.CV / 151 / 2605.02230
InfiltrNet: Dual-Branch CNN-Transformer Architecture for Brain Tumor Infiltration Risk Prediction
InfiltrNet:用于脑肿瘤浸润风险预测的双分支CNN-Transformer架构
Abstract
Gliomas are aggressive brain tumors that infiltrate surrounding tissue beyond the visible tumor margins observed on Magnetic Resonance Imaging (MRI). Predicting the spatial extent of this infiltration is essential for surgical planning and radiation therapy, yet existing deep learning approaches focus on segmenting the visible tumor rather than estimating infiltration risk in the surrounding tissue. This paper presents InfiltrNet, a novel dual-branch architecture that combines a convolutional neural network (CNN) encoder with a Swin Transformer encoder through cross-attention fusion modules to predict three-zone infiltration risk maps from multimodal MRI. A label generation strategy based on distance transforms is proposed to derive reproducible infiltration risk zones from standard Brain Tumor Segmentation (BraTS) annotations. InfiltrNet is trained with a combined Dice-CrossEntropy and boundary-aware loss augmented by auxiliary supervision heads at intermediate decoder levels. Extensive experiments on BraTS 2020 and BraTS 2025 demonstrate that InfiltrNet outperforms five established baselines. Explainability analysis using GradCAM++ and Occlusion sensitivity confirms that the model attends to clinically relevant peritumoral regions.
Chinese Translation
胶质瘤是侵袭性脑肿瘤,会浸润超出磁共振成像(MRI)可见肿瘤边缘的周围组织。预测这种浸润的空间范围对手术规划和放射治疗至关重要,然而现有的深度学习方法更多地聚焦于对可见肿瘤的分割,而不是评估周围组织的浸润风险。本文提出了InfiltrNet,这是一种新颖的双分支架构,通过交叉注意力融合模块,将卷积神经网络(CNN)编码器与Swin Transformer编码器结合,以从多模态MRI中预测三区浸润风险图。提出了一种基于距离变换的标签生成策略,以从标准脑肿瘤分割(BraTS)标注中导出可重复的浸润风险区域。InfiltrNet在结合Dice-CrossEntropy和边界感知损失的基础上进行训练,并在中间解码器层引入辅助监督头。对BraTS 2020和BraTS 2025的广泛实验表明,InfiltrNet在性能上超越了五个已建立的基线。使用GradCAM++和遮挡敏感性进行的可解释性分析证实,该模型关注临床相关的肿瘤周围区域。
cs.CV / 152 / 2605.02247
Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning
微调削弱了基础模型在长尾个性化联邦学习中的平衡性
Abstract
Personalized federated learning (PFL) with foundation models has emerged as a promising paradigm enabling clients to adapt to heterogeneous data distributions. However, real-world scenarios often face the co-occurrence of non-IID data and long-tailed class distributions, presenting unique challenges that remain underexplored in PFL. In this paper, we investigate this long-tailed personalized federated learning and observe that current methods suffer from two limitations: (i) fine-tuning degrades performance below zero-shot baselines due to the erosion of inherent class balance in foundation models; (ii) conventional personalization techniques further transfer this bias to local models through parameter or feature-level fusion. To address these challenges, we propose Federated Learning via Gradient Purification and Residual Learning (FedPuReL), which preserves balanced knowledge in the global model while enabling unbiased personalization. Specifically, we purify local gradients using zero-shot predictions to maintain a class-balanced global model, and model personalization as residual correction atop the frozen global model. Extensive experiments demonstrate that FedPuReL consistently outperforms state-of-the-art methods, achieving superior performance on both global and personalized models across diverse long-tailed scenarios. The code is available at https://github.com/shihaohou/FedPuReL.
Chinese Translation
个性化联邦学习(PFL)与基础模型的结合已成为一种有前景的范式,能够使客户端适应异构数据分布。然而,现实世界中的场景往往面临非独立同分布(non-IID)数据与长尾类别分布的共存,这为PFL带来了独特的挑战,尚未得到充分探索。本文研究了这种长尾个性化联邦学习,并观察到现有方法存在两个局限:(i) 微调在性能上低于零样本基准,这主要是由于基础模型中固有类别平衡的侵蚀;(ii) 传统个性化技术通过参数或特征级别的融合进一步将这种偏差转移到本地模型中。为了解决这些挑战,我们提出了通过梯度净化和残差学习的联邦学习(Federated Learning via Gradient Purification and Residual Learning,FedPuReL),旨在保持全局模型中的平衡知识,同时实现无偏个性化。具体而言,我们使用零样本预测来净化本地梯度,以维持类平衡的全局模型,并将模型个性化视为在冻结的全局模型上进行残差修正。大量实验证明,FedPuReL在多种长尾场景中,持续超越了最先进的方法,在全局和个性化模型上均实现了卓越的性能。代码可在 https://github.com/shihaohou/FedPuReL 获取。
cs.CV / 153 / 2605.02258
SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters
SpectraDINO:通过轻量适配器弥合视觉基础模型中的光谱差距
Abstract
Vision Foundation Models (VFMs) pretrained on large-scale RGB data have demonstrated remarkable representation quality, yet their applicability to multispectral imaging spanning Near-Infrared (NIR), Short-Wave Infrared (SWIR), and Long-Wave Infrared (LWIR) remains largely unexplored. These spectral modalities offer complementary sensing capabilities critical for robust perception in adverse conditions, but present a fundamental domain gap relative to RGB-centric pretrained models. We present SpectraDINO, a multispectral VFM that bridges this spectral gap by extending DINOv2 ViT backbones to beyond-visible modalities through lightweight, per-modality bottleneck adapters, while preserving the rich representations of the frozen RGB backbone. We introduce a multi-stage teacher-student training protocol in which a frozen DINOv2 teacher guides a spectral student via cosine distillation, symmetric contrastive loss, patch-level alignment, and a novel neighborhood-structure-preservation loss. This staged curriculum enables strong cross-modal alignment without catastrophic forgetting of RGB priors. We evaluate SpectraDINO on multispectral object detection and semantic segmentation across challenging NIR, SWIR, and LWIR benchmarks using widely adopted fusion strategies. SpectraDINO achieves state-of-the-art performance across most benchmarks, validating its effectiveness as a general-purpose backbone for spectral generalization. The code and weights for model variants are available at https://github.com/Yonsei-STL/SpectraDINO.
Chinese Translation
在大规模RGB数据上预训练的视觉基础模型(VFM)展现了卓越的表征质量,但其在近红外(NIR)、短波红外(SWIR)和长波红外(LWIR)等多光谱成像中的适用性仍然很大程度上未被探索。这些光谱模态提供了对不利条件下稳健感知至关重要的互补感知能力,但相较于以RGB为中心的预训练模型,它们存在根本的领域差距。我们提出SpectraDINO,这是一种多光谱VFM,通过轻量级的按模态瓶颈适配器,将DINOv2的ViT主干扩展到超出可见光的模态,同时保留了冻结的RGB主干的丰富表征。我们引入了一种多阶段的教师-学生训练协议,其中冻结的DINOv2教师通过余弦蒸馏、对称对比损失、补丁级对齐和一种新颖的邻域结构保持损失指导光谱学生。这种分阶段的课程使得强大的跨模态对齐成为可能,同时没有造成对RGB先验的灾难性遗忘。我们在具有挑战性的NIR、SWIR和LWIR基准测试中,使用广泛采用的融合策略评估SpectraDINO在多光谱目标检测和语义分割方面的表现。SpectraDINO在大多数基准测试中达到了最先进的性能,验证了其作为光谱泛化通用主干的有效性。模型变体的代码和权重可在 https://github.com/Yonsei-STL/SpectraDINO 获得。
cs.CV / 154 / 2605.02262
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
WindowQuant:基于窗口级相似性的混合精度 KV 缓存量化用于 VLM 推理优化
Abstract
Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in VLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called WindowQuant, which employs window-adaptive mixed-precision quantization to optimize the KV cache. WindowQuant consists of two modules: window-level quantization search and window-level KV cache computation. Window-level quantization search quickly determines the optimal bit-width configuration of the KV cache windows based on the similarity scores between the corresponding visual token windows and the text prompt, maintaining the model accuracy. Furthermore, window-level KV cache computation reorders the KV cache windows before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that WindowQuant outperforms state-of-the-art VLM models and KV cache quantization methods on various datasets.
Chinese Translation
近年来,视频语言模型(VLMs)已应用于多个领域。然而,VLM 的视觉令牌序列过长,这可能导致不可容忍的推理延迟和 GPU 内存使用。现有方法基于令牌粒度对 VLM 中的关键值(KV)缓存提出混合精度量化,但在搜索过程中耗时长,并且在计算时硬件效率低下。本文提出了一种新方法称为 WindowQuant,它采用窗口自适应混合精度量化来优化 KV 缓存。WindowQuant 由两个模块组成:窗口级量化搜索和窗口级 KV 缓存计算。窗口级量化搜索快速确定 KV 缓存窗口的最佳位宽配置,基于相应视觉令牌窗口与文本提示之间的相似度评分,从而保持模型的准确性。此外,窗口级 KV 缓存计算在量化之前重新排序 KV 缓存窗口,避免了由于混合精度量化在推理计算中造成的硬件低效。大量实验证明,WindowQuant 在多个数据集上超越了最先进的 VLM 模型和 KV 缓存量化方法。
cs.CV / 155 / 2605.02275
EdgeLPR: On the Deep Neural Network trade-off between Precision and Performance in LiDAR Place Recognition
EdgeLPR:在LiDAR地点识别中深度神经网络在精度和性能之间的权衡
Abstract
Place recognition is essential for long-term autonomous navigation, enabling loop closure and consistent mapping. Although deep learning has improved performance, deploying such models on resource-constrained platforms remains challenging. This work explores efficient LiDAR-based place recognition for EdgeAI by leveraging Bird's Eye View representations to enable lightweight image-based networks. We benchmark representative architectures without aggregation heads using a unified descriptor scheme based on global pooling and linear projection, and evaluate performance under FP32, FP16, and INT8 quantization. Experiments reveal trade-offs between accuracy, robustness, and efficiency: FP16 matches FP32 with lower cost, while INT8 introduces architecture-dependent degradation. Overall, the presented results are a strong basis for future research on 'use-case'-aware quantisation of Neural Networks for Edge deployment.
Chinese Translation
地点识别对于长期自主导航至关重要,它使得环路闭合和一致映射成为可能。尽管深度学习提高了性能,但在资源受限的平台上部署此类模型仍然具有挑战性。本工作通过利用鸟瞰视图(Bird's Eye View)表示探索了基于LiDAR的高效地点识别,旨在为EdgeAI提供轻量级的基于图像的网络。我们使用基于全局池化和线性投影的统一描述符方案,对没有聚合头的代表性架构进行基准测试,并在FP32、FP16和INT8量化下评估性能。实验显示了准确性、鲁棒性和效率之间的权衡:FP16的性能与FP32相当且成本更低,而INT8则引入了架构依赖的降级。总体来说,所呈现的结果为未来关于神经网络在边缘部署中的“应用案例”知晓量化提供了坚实的基础。
cs.CV / 156 / 2605.02283
Rethinking Electro-Optical Vision Foundation Models for Remote Sensing Retrieval: A Controlled Comparison with Generalist VFM
重新思考用于遥感检索的电光视觉基础模型:与通用视觉基础模型的对照比较
Abstract
Vision foundation models have attracted significant attention for their ability to leverage large-scale unlabeled visual data. This advantage is particularly important in remote sensing, where data acquisition is costly and annotation often requires expert knowledge. Recent electro-optical vision foundation models aim to learn domain-specific representations from remote sensing imagery, but it remains unclear whether they are more effective than strong generalist vision foundation models under retrieval-based evaluation. In this study, we conduct a controlled comparison between representative EO-specific and generalist vision foundation models for remote sensing image retrieval. Using the same datasets, retrieval protocol, and evaluation metric, we evaluate both in-domain performance and cross-scene generalization. Our results show that strong generalist vision foundation models are competitive with, and in some cases outperform, existing EO-specific models. Moreover, EO-specific models often suffer from substantial degradation under cross-scene evaluation, while generalist models show more stable transfer. These findings suggest that EO pretraining alone does not guarantee stronger retrieval-oriented remote sensing representations. We discuss the limitations of current EO-specific pretraining strategies and highlight the need for future EO vision foundation models to better exploit the physical, spatial, spectral, and geographic characteristics of remote sensing imagery.
Chinese Translation
视觉基础模型因其利用大规模未标记视觉数据的能力而受到广泛关注。这一优势在遥感领域尤为重要,因为数据获取成本高昂且标注通常需要专业知识。近年来的电光视觉基础模型旨在从遥感影像中学习特定领域的表征,但尚不清楚它们在基于检索的评估中是否比强大的通用视觉基础模型更有效。本研究对代表性的电光特定模型与通用视觉基础模型在遥感图像检索中的表现进行了受控比较。我们使用相同的数据集、检索协议和评估指标,评估了领域内表现和跨场景泛化能力。我们的结果显示,强大的通用视觉基础模型在表现上与现有的电光特定模型相竞争,在某些情况下甚至优于它们。此外,电光特定模型在跨场景评估中往往遭受显著的性能退化,而通用模型则表现出更稳定的迁移能力。这些发现表明,仅仅依赖电光预训练并不能保证更强的用于检索的遥感表征。我们讨论了当前电光特定预训练策略的局限性,并强调未来的电光视觉基础模型需要更好地利用遥感影像的物理、空间、光谱和地理特征。
cs.CV / 157 / 2605.02284
Beyond Known Objects: A Novel Framework for Open-Set Object Detection using Negative-Aware Norm
超越已知对象:一种基于负意识范数的开放集对象检测新框架
Abstract
Open-Set Object Detection (OSOD) is crucial for autonomous driving, where perception systems must recognize and localize both known and previously unseen objects in complex, dynamic environments. While recent approaches deliver promising results, they often require retraining the detector extensively to learn objectness, which describes the likelihood that a bounding box tightly encloses a valid object, regardless of whether its category was learned during training. Deviating from existing work, we hypothesize that standard off-the-shelf detectors may already contain helpful cues for objectness, owing to their training on numerous and diverse known categories. Building on this idea, we propose NAN-SPOT, a training-light framework that does not require to retrain the base object detector and estimates objectness by leveraging a hidden layer metric called Negative-Aware Norm (NAN), requiring only minutes of training on just hundreds of images. To support comprehensive evaluation, we introduce COCO-Open, an expanded version of the existing COCO-Mixed dataset, increasing unknown object annotations from 433 to 1853, making it the most exhaustively labeled dataset for OSOD to the best of our knowledge. Experimental results demonstrate that NAN-SPOT achieves even better performance on unknown object detection than methods requiring heavy training, without compromising performance on known objects. This efficiency and robustness make NAN-SPOT a promising step towards open-world perception in autonomous driving.
Chinese Translation
开放集对象检测(OSOD)对于自动驾驶至关重要,因为感知系统必须在复杂动态环境中识别和定位已知和先前未见过的对象。尽管近期的方法取得了令人鼓舞的成果,但它们通常需要对检测器进行大量再训练,以学习对象性(objectness),即描述边界框紧密包围有效对象的可能性,而不管其类别在训练过程中是否被学习。与现有工作不同,我们假设标准的现成检测器可能已经包含了有助于对象性的线索,因为它们是在众多多样的已知类别上进行训练的。在此基础上,我们提出了NAN-SPOT,这是一种轻量级训练框架,无需重新训练基础对象检测器,通过利用一个称为负意识范数(Negative-Aware Norm,NAN)的隐藏层度量来估计对象性,只需在数百张图像上进行几分钟的训练。为了支持全面评估,我们引入了COCO-Open,这是现有COCO-Mixed数据集的扩展版本,将未知对象注释从433个增加到1853个,尽我们所知,它是目前关于OSOD最详尽标记的数据集。实验结果表明,NAN-SPOT在未知对象检测上的性能甚至超过了需要大量训练的方法,同时对已知对象的性能没有妥协。这种高效性和鲁棒性使得NAN-SPOT成为朝向自动驾驶开放世界感知的一个有希望的步骤。
cs.CV / 158 / 2605.02288
LabBuilder: Protocol-Grounded 3D Layout Generation for Interactable and Safe Laboratory
LabBuilder:基于协议的互动与安全实验室三维布局生成
Abstract
Automated laboratories hold the promise of accelerating scientific discovery, yet their deployment is bottlenecked by the difficulty of designing safe and executable environments. While simulator-based design offers scalability, existing 3D scene generation methods are primarily tailored for household settings, optimizing for visual plausibility while neglecting the rigorous functional semantics and safety constraints essential for scientific experimentation. We present LabBuilder, an end-to-end system that generates and verifies 3D laboratory layouts from concise textual specifications. It operates through three tightly coupled components: LabForge first curates a meta-dataset of annotated assets and chemical knowledge, translating natural language specifications into structured protocols; building on these protocols, LabGen synthesizes laboratory layouts via an iterative, constraint-aware optimization strategy; finally, LabTouchstone evaluates the resulting layouts as a unified benchmark. Extensive experiments demonstrate that LabBuilder significantly outperforms existing state-of-the-art methods, producing laboratory environments that are not only realistic but also functionally valid and safe for complex experimental workflows.
Chinese Translation
自动化实验室承诺加速科学发现,然而它们的部署受到设计安全和可执行环境难度的瓶颈。虽然基于模拟器的设计提供了可扩展性,但现有的三维场景生成方法主要针对家庭环境,优化视觉合理性,而忽视了科学实验所必需的严格功能语义和安全约束。我们提出了LabBuilder,一个从简洁文本规范生成和验证三维实验室布局的端到端系统。它通过三个紧密耦合的组件运行:LabForge 首先策划一个带注释资产和化学知识的元数据集,将自然语言规范翻译为结构化协议;基于这些协议,LabGen 通过迭代的、考虑约束的优化策略合成实验室布局;最后,LabTouchstone 将结果布局作为统一基准进行评估。大量实验表明,LabBuilder 显著优于现有的最先进方法,生成的实验室环境不仅真实,而且在复杂实验工作流程中功能有效且安全。
cs.CV / 159 / 2605.02291
A Hybrid Approach for Closing the Sim2real Appearance Gap in Game Engine Synthetic Datasets
一种混合方法用于弥补游戏引擎合成数据集中的Sim2Real外观差距
Abstract
Video game engines have been an important source for generating large volumes of visual synthetic datasets for training and evaluating computer vision algorithms that are to be deployed in the real world. While the visual fidelity of modern game engines has been significantly improved with technologies such as ray-tracing, a notable sim2real appearance gap between the synthetic and the real-world images still remains, which limits the utilization of synthetic datasets in real-world applications. In this letter, we investigate the ability of a state-of-the-art image generation and editing diffusion model (FLUX.2-4B Klein) to enhance the photorealism of synthetic datasets and compare its performance against a traditional image-to-image translation model (REGEN). Furthermore, we propose a hybrid approach that combines the strong geometry and material transformations of diffusion-based methods with the distribution-matching capabilities of image-to-image translation techniques. Through experiments, it is demonstrated that REGEN outperforms FLUX.2-4B Klein and that by combining both FLUX.2-4B Klein and REGEN models, better visual realism can be achieved compared to using each model individually, while maintaining semantic consistency. The code is available at: https://github.com/stefanos50/Hybrid-Sim2Real
Chinese Translation
视频游戏引擎已成为生成大量视觉合成数据集的重要来源,这些数据集用于训练和评估将应用于现实世界的计算机视觉算法。尽管现代游戏引擎的视觉保真度在光线追踪等技术的支持下得到了显著提升,但合成图像与真实世界图像之间仍然存在显著的Sim2Real外观差距,这限制了合成数据集在现实应用中的利用。在本文中,我们研究了一种最先进的图像生成与编辑扩散模型(FLUX.2-4B Klein)增强合成数据集照片真实感的能力,并将其性能与传统的图像到图像转换模型(REGEN)进行比较。此外,我们提出了一种混合方法,结合了基于扩散的方法在几何和材质转换上的优势与图像到图像转换技术的分布匹配能力。通过实验表明,REGEN的性能优于FLUX.2-4B Klein,并且通过结合FLUX.2-4B Klein和REGEN模型,能够实现比单独使用每个模型更好的视觉真实感,同时保持语义一致性。代码可在以下地址获取:https://github.com/stefanos50/Hybrid-Sim2Real
cs.CV / 160 / 2605.02292
Momentum-Anchored Multi-Scale Fusion Model for Long-Tailed Chest X-Ray Classification
基于动量锚定的多尺度融合模型用于长尾胸部X光分类
Abstract
Chest X-ray classification suffers from severe class imbalance where gradient updates bias toward majority classes, causing feature drift and poor performance on rare but critical pathologies. We propose a Momentum-Anchored Multi-Scale Fusion Network that uses exponential moving averages (EMA) as a temporal anchoring mechanism to stabilize feature representations under long-tailed distributions. Our approach applies selective momentum updates to the final expansion block of an EfficientNet backbone, creating a slowly-evolving reference branch that resists gradient-induced drift while preserving discriminative patterns for minority classes. Combined with multi-scale spatial fusion ($1\times 1$, $3 \times 3$, $5 \times 5$ convolutions), this anchoring strategy maintains representational stability throughout training. On ChestX-ray14, our method achieves 0.8682 average AUC, outperforming state-of-the-art approaches and showing particular improvements on rare pathologies like Hernia (0.9470) and Pneumonia (0.8165). The results demonstrate that momentum anchoring effectively counters feature instability in long-tailed medical image classification.
Chinese Translation
胸部X光分类面临严重的类别不平衡问题,其中梯度更新偏向于占多数的类别,导致特征漂移以及在稀有但重要病症上的表现不佳。我们提出了一种基于动量锚定的多尺度融合网络(Momentum-Anchored Multi-Scale Fusion Network),该网络利用指数移动平均(EMA)作为时间锚定机制,以稳定长尾分布下的特征表示。我们的方法对高效网络(EfficientNet)主干网络的最终扩展块应用选择性动量更新,创建一个缓慢演变的参考分支,使其抵抗梯度引起的漂移,同时保持少数类别的区分模式。结合多尺度空间融合($1 imes 1$、$3 imes 3$、$5 imes 5$ 卷积),该锚定策略在整个训练过程中保持表征的稳定性。在ChestX-ray14数据集上,我们的方法实现了0.8682的平均AUC,超越了主要的先进方法,并在稀有病症如疝气(0.9470)和肺炎(0.8165)上表现出特别的改善。结果表明,动量锚定有效抵消了长尾医学图像分类中的特征不稳定性。
cs.CV / 161 / 2605.02316
Open-access model for detecting openly dumped dispersed municipal solid waste from crowdsourced UAV imagery in Sub-Saharan Africa
基于开放获取模型的南撒哈拉非洲人群众包无人机影像中开放倾倒分散城市固体废物的检测
Abstract
Managing municipal solid waste in rapidly urbanizing Sub-Saharan Africa remains challenging due to dispersed informal dumping and limited high-resolution datasets for spatial monitoring. We present an open-access deep learning model for automated detection of openly dumped dispersed solid waste via crowdsourced UAV imagery, trained and evaluated across 29 regions in 10 countries, encompassing diverse environmental contexts. A deep learning model trained on manually annotated image tiles achieved excellent performance in detecting openly dumped dispersed solid waste across all study regions. Predicted distributions reveal heterogeneous accumulation patterns, ranging from localized hotspots - often along waterways, where waste can exacerbate flood and public health risks - to more dispersed litter across urban areas. Waste accumulation is most strongly associated with population density and indicators of lack of local infrastructure access, whereas its relationship with broader measures of regional development is weaker, highlighting the importance of fine-scale data for understanding localized waste dynamics. By releasing the model, this study provides a ready-to-use tool for UAV imagery collected by municipalities and local mapping communities, enabling openly dumped dispersed solid waste monitoring without extensive technical expertise. This approach empowers local practitioners to convert UAV imagery into actionable insights, supporting targeted interventions and improved municipal solid waste management across Sub-Saharan Africa.
Chinese Translation
在迅速城市化的南撒哈拉非洲,管理城市固体废物的挑战依然严峻,这主要由于非正式的分散倾倒和缺乏高分辨率空间监测数据。本文提出了一种开放获取的深度学习模型,利用人群众包的无人机影像自动检测公开倾倒的分散固体废物,该模型在10个国家的29个区域进行了训练和评估,涵盖了多样的环境背景。通过在手动标注的图像切片上训练的深度学习模型,在所有研究区域中对于公开倾倒的分散固体废物检测表现优异。预测的分布揭示了异质的积累模式,从局部热点(通常位于沿水道,废物可能加剧洪水和公共卫生风险)到城市区域的更为分散的垃圾。废物积累与人口密度及缺乏当地基础设施获取的指标关联最为密切,而与区域发展更广泛指标的关系则较弱,这突出了理解局部废物动态时高分辨率数据的重要性。通过发布该模型,本研究为市政当局和当地制图社区收集的无人机影像提供了一种现成的工具,使得在没有广泛技术专长的情况下也能进行公开倾倒分散固体废物的监测。这一方法赋能当地从业者将无人机影像转化为可行的见解,支持有针对性的干预措施并改善南撒哈拉非洲的城市固体废物管理。
cs.CV / 162 / 2605.02328
Improving Imbalanced Multi-Label Chest X-Ray Diagnosis via CBAM-Enhanced CNN Backbones
通过增强CBAM的卷积神经网络骨干改善不平衡的多标签胸部X光诊断
Abstract
Chest radiography is a widely used imaging modality for thoracic disease diagnosis, yet its conventional interpretation remains time-consuming and heavily dependent on expert knowledge. While deep learning has improved diagnostic efficiency through automated feature extraction, challenges such as class imbalance and the localization of multiple co-existing pathologies remain unsolved. In this paper, inspired by the strength of Convolutional Block Attention Module (CBAM) in feature refinement and the capability of CNN blocks in feature extraction, we propose a strategy to integrate CBAM into traditional CNN blocks to enhance performance in multi-label classification tasks. Our method achieves a mean AUC of 0.8695 on ChestXray14 dataset, outperforming several state-of-the-art baselines.Our source code is available at: https://github.com/NNNguyenDuyyy/FETC_CBAM_Enhanced_CNN.git
Chinese Translation
胸部放射摄影是一种广泛应用于胸部疾病诊断的成像方式,但其传统解读过程仍然耗时且高度依赖专家知识。尽管深度学习通过自动特征提取提高了诊断效率,但诸如类别不平衡和多种共存病灶的定位等挑战依然未得到解决。本文受到卷积块注意力模块(Convolutional Block Attention Module, CBAM)在特征精炼方面的优势以及卷积神经网络(CNN)在特征提取方面的能力的启发,提出了一种将CBAM集成到传统CNN块中的策略,以增强多标签分类任务的性能。我们的方法在ChestXray14数据集上取得了0.8695的平均AUC,优于多个最先进的基线。我们的源代码可在以下链接获取:https://github.com/NNNguyenDuyyy/FETC_CBAM_Enhanced_CNN.git
cs.CV / 163 / 2605.02357
Channel-Level Relation to Attentive Aggregation with Neighborhood-Homogeneity Constraint for Point Cloud Analysis
具有邻域均质性约束的通道级关系与关注聚合在点云分析中的应用
Abstract
In 3D point cloud understanding, the core challenge lies in accurately capturing discriminative features within complex neighborhoods, which directly affects the execution precision of downstream tasks such as embodied AI and autonomous driving. Existing methods explore feature correlation discrimination but are limited to point-level spatial distribution or channel responses, enabling only coarse-grained level evaluation. For modern multi-scale point cloud networks, such coarse-grained metrics inevitably incur significant information loss in deeper layers. To address this issue, we propose a novel network equipped with a channel-level metric-based enhancement mechanism, termed the PointCRA network. Our core idea is to introduce temporal trend variation as a new evaluation dimension to avoid the information loss caused by weight dimension collapse in existing spatial and channel attention mechanisms. On this basis, we construct a multi-level calibration framework guided by neighborhood homogeneity for weight calibration, and design a dedicated loss function to enhance channel discriminability. The module effectively leverages the intrinsic feature priors of deep networks to adaptively correct the feature aggregation process, offering strong interpretability with low parameter overhead. Furthermore, our proposed method exhibits strong transferability, interpretability, and parameter efficiency. We validate the proposed method effectiveness on diverse datasets and benchmark models, and further demonstrate its rationality through extensive analytical experiments. Our PointCRA achieves 77.5% mIoU on the S3DIS dataset, 90.4% OA on the ScanObjectNN dataset, and 87.4% instance mIoU on the ShapeNetPart dataset. The code and pretrained weights are publicly available on GitHub:
Chinese Translation
在3D点云理解中,核心挑战在于准确捕捉复杂邻域内的区分特征,这直接影响到如具身人工智能和自动驾驶等下游任务的执行精度。现有方法探索特征相关性区分,但局限于点级空间分布或通道响应,仅能实现粗粒度的评估。对于现代多尺度点云网络,这种粗粒度指标不可避免地会导致深层信息的显著损失。为了解决这一问题,我们提出了一种新颖的网络,配备基于通道级度量的增强机制,称为PointCRA网络。我们的核心思路是引入时间趋势变化作为新评估维度,以避免现有空间和通道注意机制中因权重维度崩溃导致的信息损失。在此基础上,我们构建了一个由邻域均质性指导的多级校准框架进行权重校准,并设计了一种专用损失函数以增强通道的区分能力。该模块有效利用深度网络的内在特征先验,自适应纠正特征聚合过程,提供强大的可解释性且参数开销低。此外,我们提出的方法展现了强大的迁移能力、可解释性和参数效率。我们在多种数据集和基准模型上验证了该方法的有效性,并通过广泛的分析实验进一步证明了其合理性。我们的PointCRA在S3DIS数据集上达到了77.5%的mIoU,在ScanObjectNN数据集上达到了90.4%的OA,在ShapeNetPart数据集上达到了87.4%的实例mIoU。代码和预训练权重已在GitHub上公开。
cs.CV / 164 / 2605.02376
Graph-Augmented Topological Internalization with Dual-Stream Classifiers for Medical Report Generation
基于图增强拓扑内化与双流分类器的医疗报告生成
Abstract
Automated medical report generation, MRG, holds substantial value for alleviating radiologist workload and enhancing diagnostic efficiency. However, mainstream approaches typically treat diverse chest abnormalities as isolated classification targets. This paradigm often overlooks inherent disease co-occurrences and struggles to translate medical topological structures into explicit data correlations, constraining the model's reasoning capacity on complex or subtle lesions. To address this, we propose a Graph-Augmented Dual-Stream Medical Report Generation with Topological Internalization, GDMRG. Our framework introduces a Topological Knowledge Internalization module, TKI, which leverages a Graph Convolutional Network, GCN, to generate an explicit parameterized weight matrix based on global disease co-occurrence priors. This facilitates efficient topological knowledge injection without relying on external retrieval mechanisms. Building upon this, we construct a dual-stream classification system: the main branch generates discrete diagnostic prompts under topological constraints, while the auxiliary branch employs an asymmetric optimization strategy to dynamically calibrate decision boundaries for highly imbalanced samples. Concurrently, to establish a logical closed loop between diagnosis and visual grounding, we design a diagnostic-driven Diagnosis-Guided Spatial Attention, DGSA, that utilizes high-dimensional clinical semantics to recalibrate the visual encoder, mitigating feature hallucinations. Comprehensive experiments on the MIMIC-CXR dataset demonstrate that GDMRG achieves competitive clinical efficacy, CE, while maintaining natural language fluency. Furthermore, our model exhibits robust zero-shot generalization on the IU X-Ray dataset. In summary, this work presents an integrated and interpretable paradigm for medical report generation.
Chinese Translation
自动化医疗报告生成(Medical Report Generation, MRG)在减轻放射科医师工作负担和提高诊断效率方面具有重要价值。然而,主流方法通常将多样的胸部异常视为孤立的分类目标。这种范式往往忽视了固有疾病共现性,并难以将医疗拓扑结构转化为明确的数据关联,从而限制了模型对复杂或微妙病变的推理能力。为了解决这一问题,我们提出了一种基于图增强的双流医疗报告生成模型,称为GDMRG。我们的框架引入了拓扑知识内化模块(Topological Knowledge Internalization, TKI),该模块利用图卷积网络(Graph Convolutional Network, GCN)生成基于全局疾病共现先验的明确参数化权重矩阵。这一方法能有效注入拓扑知识,而无需依赖于外部检索机制。在此基础上,我们构建了一种双流分类系统:主分支在拓扑约束下生成离散的诊断提示,而辅助分支则采用不对称优化策略,为高度不平衡样本动态校准决策边界。同时,为了在诊断与视觉基础之间建立一个逻辑闭环,我们设计了一种以诊断驱动的诊断引导空间注意力机制(Diagnosis-Guided Spatial Attention, DGSA),利用高维临床语义重新校准视觉编码器,从而减轻特征幻觉。对MIMIC-CXR数据集的全面实验表明,GDMRG在保持自然语言流畅性的同时,取得了竞争性的临床有效性(Clinical Efficacy, CE)。此外,我们的模型在IU X-Ray数据集上展示了强大的零样本泛化能力。总之,本研究提出了一种集成且可解释的医疗报告生成范式。
cs.CV / 165 / 2605.02378
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
通过归纳-演绎推理增强多模态上下文学习
Abstract
In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models often produce correct answers from flawed reasoning, while struggling to extract consistent rules across demonstrations. This gap is further exacerbated by two visual-level obstacles: an overwhelming proportion of redundant visual tokens that obscure textual cues, and a skewed attention distribution that favors the initial image at the expense of subsequent context. To address these issues, we introduce a framework that restructures multimodal ICL as a principled inductive-deductive process. The framework incorporates a similarity-based visual token compression module to filter out redundant patches, a dynamic attention rebalancing mechanism to distribute focus equitably across all images, and a chain-of-thought paradigm that explicitly guides the model to analyze individual examples, derive a generalizable rule, and then apply it to the query. An auxiliary learning pipeline combines supervised fine-tuning with reinforcement learning using verifiable rewards to reinforce faithful citation and noise filtering. Evaluations across eight benchmarks covering visual perception, logical reasoning, STEM problems, and sarcasm detection demonstrate consistent and significant improvements over standard ICL baselines for multiple open-source VLMs, highlighting the potential of equipping models with genuine inductive capabilities in multimodal settings.
Chinese Translation
上下文学习(ICL)使得大型模型能够通过少量示例适应任务,但在视觉-语言模型(VLMs)中的扩展仍然脆弱。我们的分析揭示了基本限制在于归纳空白,模型常常从错误的推理中产生正确答案,而在跨示例提取一致规则时则表现乏力。这个空白进一步受到两个视觉层面的障碍的加剧:大量冗余视觉标记掩盖了文本线索,以及偏向初始图像而忽视后续上下文的注意力分布不均。为了解决这些问题,我们提出了一种框架,将多模态 ICL 重新构建为一种原则性的归纳-演绎过程。该框架结合了基于相似性的视觉标记压缩模块,以滤除冗余的区域,动态注意力重分配机制,以在所有图像之间公平分配关注,及链式思维范例,明确引导模型分析个别示例,推导出可推广的规则,并将其应用于查询。辅助学习管道将监督微调与利用可验证奖励的强化学习相结合,以强化忠实引用和噪声过滤。对涵盖视觉感知、逻辑推理、STEM 问题及讽刺检测的八个基准测试的评估,表明在多个开源 VLMs 上,相较于标准 ICL 基线实现了一致且显著的改善,突显了在多模态环境中赋予模型真正归纳能力的潜力。
cs.CV / 166 / 2605.02380
UnGAP: Uncertainty-Guided Affine Prompting for Real-Time Crack Segmentation
UnGAP:基于不确定性指导的仿射提示用于实时裂纹分割
Abstract
Real-time crack segmentation is vital for structural health monitoring but is plagued by aleatoric uncertainties arising from varying lighting, blur, and texture ambiguity. Current uncertainty-aware approaches typically treat uncertainty estimation as a passive endpoint for post-hoc analysis, failing to close the loop by feeding this information back to refine feature representations. We contend that independent pixel-wise heteroscedastic modeling is uniquely suited for crack segmentation, as cracks are defined by fine-grained local gradients rather than the global semantic coherence relied upon in general object segmentation. However, this approach suffers from a structural optimization pathology: high predicted variance attenuates loss gradients, effectively causing the model to ignore difficult samples and under-fit complex boundaries. To address these challenges, we propose UnGAP, a novel framework that establishes a closed-loop mechanism between uncertainty estimation and feature learning. Central to our approach is the Uncertainty-Prompted Feature Modulator (UPFM), which treats aleatoric uncertainty as an active visual prompt rather than a mere output. UPFM dynamically calibrates feature distributions through pixel-wise affine transformations. Crucially, this mechanism mitigates the heteroscedastic pathology by transforming high variance, which would otherwise indicate gradient suppression, into a constructive signal for stronger feature rectification in ambiguous regions. Additionally, a boundary-aware detection head is introduced to further constrain prediction precision. Extensive experiments demonstrate that UnGAP balances superior segmentation accuracy with real-time inference speed, effectively validating the benefit of transforming uncertainty from a passive metric into an active calibration tool.
Chinese Translation
实时裂纹分割对于结构健康监测至关重要,但受到由光照变化、模糊和纹理歧义引起的随机不确定性困扰。当前的不确定性感知方法通常将不确定性估计视为事后分析的被动终点,未能通过将这一信息反馈以优化特征表示来形成闭环。我们认为独立的像素级异方差建模特别适合于裂纹分割,因为裂纹是由细粒度的局部梯度定义的,而不是依赖于一般物体分割中的全局语义一致性。然而,这种方法面临结构优化病理的问题:高预测方差会削弱损失梯度,导致模型有效地忽略困难样本并未能准确适应复杂边界。为了解决这些挑战,我们提出了UnGAP,一个建立不确定性估计与特征学习之间闭环机制的新框架。我们方法的核心是不确定性引导特征调制器(Uncertainty-Prompted Feature Modulator, UPFM),它将随机不确定性视为一种主动的视觉提示,而不仅仅是一个输出。UPFM通过像素级的仿射变换动态校准特征分布。至关重要的是,这一机制通过将高方差转化为在模糊区域进行强特征校正的有效信号,缓解了异方差病理。此外,还引入了一种边界感知检测头,以进一步约束预测精度。大量实验证明,UnGAP在确保实时推理速度的同时,实现了优越的分割精度,有效验证了将不确定性从被动度量转变为主动校准工具的益处。
cs.CV / 167 / 2605.02393
FEAT: Fashion Editing and Try-On from Any Design
FEAT:基于任意设计的时尚编辑与试穿
Abstract
Fashion design aims to express a designer's creative intent and to depict how garments interact with the human body. Recent methods condition on multimodal inputs to support garment editing and virtual try-on. However, existing methods still (i) confine design to garment-related images, excluding creative design sources such as artwork, abstract imagery, and natural photographs, and (ii) cannot support complete outfits, including accessories. We present FEAT (Fashion Editing And Try-On from Any Design), a method that enables editing and try-on across garments and accessories using diverse design sources. To achieve this, we introduce Disentangled Dual Injection (DDI). It takes both apparel and non-apparel design sources and selectively injects design cues via content and style disentanglement. Furthermore, we propose Orthogonal-Guided Noise Fusion (OGNF), a training-free mechanism that removes residual garments via orthogonal projection and applies region-specific noise strategies to enable virtual try-on for both garments and accessories. Extensive experiments demonstrate that FEAT achieves state-of-the-art performance in design flexibility, prompt consistency, and visual realism.
Chinese Translation
时尚设计旨在表达设计师的创造意图,并描绘服装与人体的互动。近年来的方法依赖于多模态输入来支持服装编辑和虚拟试穿。然而,现有方法仍然存在以下问题:(i) 将设计局限于与服装相关的图像,排除了艺术作品、抽象图像和自然摄影等创意设计来源;(ii) 无法支持完整的搭配,包括配饰。我们提出了FEAT(Fashion Editing And Try-On from Any Design),这是一种使用多样设计来源进行服装和配饰编辑与试穿的方法。为此,我们引入了解耦双重注入(Disentangled Dual Injection, DDI)。该方法同时处理服装和非服装设计来源,通过内容和风格的解耦选择性地注入设计线索。此外,我们提出了正交引导噪声融合(Orthogonal-Guided Noise Fusion, OGNF),这是一种无需训练的机制,通过正交投影去除残余服装,并应用区域特定的噪声策略,使得虚拟试穿能够适用于服装和配饰。大量实验表明,FEAT在设计灵活性、提示一致性和视觉真实感方面达到了最先进的性能。
cs.CV / 168 / 2605.02417
DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing
DirectEdit:基于流的图像编辑的逐步精确反演
Abstract
With recent advancements in large-scale pre-trained text-to-image (T2I) models, training-free image editing methods have demonstrated remarkable success. Typically, these methods involve adding noise to a clean image via an inversion process, followed by separate denoising steps for the reconstruction and editing paths during the forward process. However, since the reconstruction path is approximated using noisy latents from mismatched timesteps, existing methods inevitably suffer from accumulated drift, which fundamentally limits reconstruction fidelity. To address this challenge, we systematically analyze the inversion process within the flow transformer and propose DirectEdit, a simple yet effective editing method that eliminates the inherent reconstruction error without introducing additional neural function evaluations (NFEs). Unlike most prior works that attempt to rectify the inversion path, DirectEdit focuses on directly aligning the forward paths, enabling precise reconstruction and reliable feature sharing. Furthermore, we introduce a preservation mechanism based on attention feature injection and multi-branch mask-guided noise blending, which effectively balances fidelity and editability. Extensive experiments across diverse scenarios demonstrate that DirectEdit achieves efficient and accurate image editing, delivering superior performance that outperforms state-of-the-art methods. Code and examples are available at https://desongyang.github.io/Directedit.
Chinese Translation
随着大规模预训练文本到图像(T2I)模型的最新进展,无需训练的图像编辑方法取得了显著成功。通常,这些方法涉及通过反演过程向干净图像添加噪声,然后在前向过程中为重建和编辑路径分别进行去噪步骤。然而,由于重建路径是使用来自不匹配时间步的噪声潜在变量进行近似的,现有方法不可避免地受到累积漂移的影响,这在根本上限制了重建的真实性。为了解决这一挑战,我们系统地分析了流变换器中的反演过程,并提出了DirectEdit,这是一种简单而有效的编辑方法,能够消除固有的重建误差,而无需引入额外的神经功能评估(NFEs)。与大多数试图纠正反演路径的先前工作不同,DirectEdit专注于直接对齐前向路径,从而实现精确重建和可靠特征共享。此外,我们引入了一种基于注意力特征注入和多分支掩码引导噪声混合的保留机制,有效地平衡了真实性和可编辑性。对多种场景进行的广泛实验表明,DirectEdit实现了高效且准确的图像编辑,展现出优于现有最先进方法的卓越性能。代码和示例可在 https://desongyang.github.io/Directedit 获取。
cs.CV / 169 / 2605.02437
Multi-Rater Calibrated Segmentation Models
多评估者校准分割模型
Abstract
Objective: Accurate probability estimates are essential for the safe deployment of medical image segmentation models in clinical decision-making. However, modern deep segmentation networks are often poorly calibrated, a problem exacerbated when multiple expert annotations exhibit substantial disagreement. While inter-rater variability is typically treated as noise, it provides valuable information about intrinsic annotation ambiguity that must be reflected in model confidence. Methods: We improve the probabilistic calibration of medical image segmentation models by reformulating multi-rater supervision as an ordinal learning problem. Voxel-wise annotator agreement is treated as an ordered target, linking predictive confidence to the empirical variability in training data. This formulation allows the use of ordinal-aware scoring rules, such as the Ranked Probability Score ordinal loss, combined with a standard binary objective to preserve discriminative performance. Results: We evaluated the proposed approach across four public segmentation benchmarks spanning ophthalmology, histopathology, and thoracic imaging. Calibration was assessed using a multi-rater extension of expected calibration error. Results consistently show that ordinal-aware training yields substantially improved calibration with respect to inter-rater agreement without degrading segmentation accuracy. Conclusions: Treating multi-rater annotations as ordered information provides a principled and architecture-agnostic route to more reliable probabilistic segmentation models.
Chinese Translation
目标:准确的概率估计对于在临床决策中安全部署医学图像分割模型至关重要。然而,现代深度分割网络往往校准不足,当多个专家注释存在显著分歧时,这一问题尤为严重。尽管评估者之间的变异性通常被视为噪声,但它提供了关于内在注释模糊性的宝贵信息,这一点必须在模型置信度中反映出来。方法:我们通过将多评估者监督重新表述为一个序数学习问题,改善医学图像分割模型的概率校准。体素级注释者一致性被视为一个有序目标,将预测置信度与训练数据中的经验变异性联系起来。这种表述方式允许使用序数感知评分规则,如排名概率评分(Ranked Probability Score)序数损失,结合标准的二元目标来保持判别性能。结果:我们在涉及眼科学、组织病理学和胸部影像的四个公共分割基准上评估了所提出的方法。使用期望校准误差的多评估者扩展对校准进行了评估。结果一致表明,序数感知训练在提高评估者间一致性上显著改善了校准,同时不降低分割准确性。结论:将多评估者注释视为有序信息提供了一条更可靠的概率分割模型的原则性和架构无关的途径。
cs.CV / 170 / 2605.02438
Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection
用于开放集监督异常检测的混合原型流匹配
Abstract
Open-set supervised anomaly detection (OSAD) aims to identify unseen anomalies using limited anomalous supervision. However, existing prototype-based methods typically model normal data via a unimodal Gaussian prior, failing to capture inherent multi-modality and resulting in blurred decision boundaries. To address this, we propose Mixture Prototype Flow Matching (MPFM), a framework that learns a continuous transformation from normal feature distributions to a structured Gaussian mixture prototype space. Departing from traditional flow-based approaches that rely on a single velocity vector, MPFM explicitly models the velocity field as a Gaussian mixture prior where each component corresponds to a distinct normal class. This design facilitates mode-aware and semantically coherent distribution transport. Furthermore, we introduce a Mutual Information Maximization Regularizer (MIMR) to prevent prototype collapse and maximize normal-anomaly separability. Extensive experiments demonstrate that MPFM achieves state-of-the-art performance across diverse benchmarks under both single- and multi-anomaly settings.
Chinese Translation
开放集监督异常检测 (OSAD) 旨在利用有限的异常监督识别未见过的异常。然而,现有的基于原型的方法通常通过单模态高斯先验模型正常数据,未能捕捉固有的多模态特性,从而导致模糊的决策边界。为了解决这个问题,我们提出了混合原型流匹配 (MPFM) 框架,该框架学习从正常特征分布到结构化的高斯混合原型空间的连续变换。与传统基于流的方法依赖单一速度向量不同,MPFM 明确地将速度场建模为一个高斯混合先验,其中每个组件对应于一个独特的正常类别。这种设计促进了模式感知和语义一致的分布传输。此外,我们引入了相互信息最大化正则化器 (MIMR) 以防止原型崩溃并最大化正常和异常的可分性。大量实验证明,MPFM 在单一和多重异常设置下的各种基准测试中均实现了最先进的性能。
cs.CV / 171 / 2605.02439
Anomaly-Preference Image Generation
异常偏好图像生成
Abstract
Synthesizing realistic and diverse anomalous samples from limited data is vital for robust model generalization. However, existing methods struggle to reconcile fidelity and diversity, often hampered by distribution misalignment and overfitting, respectively.To mitigate this, we introduce Anomaly Preference Optimization,a novel paradigm that reformulates anomaly generation as a preference learning problem.Central to our approach is an implicit preference alignment mechanism that leverages real anomalies as positive references, deriving optimization signals directly from denoising trajectory deviations without requiring costly human annotation. Furthermore, we propose a Time-Aware Capacity Allocation module that dynamically distributes model capacity along the diffusion timeline,prioritizing structural diversity during highnoise phases while enhancing fine-grained fidelity in low-noise stages. During inference, a hierarchical sampling strategy modulates the coherencealignment trade-off, enabling precise control over generation. Extensive experiments demonstrate that significantly outperforms existing baselines,achieving state-of-the-art performance in both realism and diversity.
Chinese Translation
从有限数据中合成逼真而多样的异常样本对于模型的鲁棒性泛化至关重要。然而,现有方法在保真性和多样性之间难以取得平衡,通常受到分布失调和过拟合的阻碍。为了减轻这一问题,我们提出了异常偏好优化(Anomaly Preference Optimization),一种将异常生成重新表述为偏好学习问题的新范式。我们方法的核心是一个隐式偏好对齐机制,该机制利用真实异常作为正面参考,直接从去噪轨迹偏差中推导优化信号,而无需昂贵的人为标注。此外,我们提出了一个时间感知容量分配模块,该模块能够动态分配模型容量沿着扩散时间线,在高噪声阶段优先考虑结构多样性,而在低噪声阶段提高细粒度的保真性。在推理过程中,分层采样策略调节一致性与对齐的权衡,能够精准控制生成效果。大量实验表明,该方法显著优于现有基线,实现在真实感和多样性方面的最先进性能。
cs.CV / 172 / 2605.02444
M\textsuperscript{4}Fuse: Lightweight State-Space MoE with a Cross-Scale Gating Bridge for Brain Tumor Segmentation
M extsuperscript{4}Fuse:用于脑肿瘤分割的轻量级状态空间 MoE 与跨尺度门控桥
Abstract
Encoder-decoder imbalance and the reliance on large input volumes make many 3D brain tumor segmentation models both compute-heavy and brittle. We present M\textsuperscript{4}Fuse, a lightweight network that prioritizes discriminative brain tumor cues over exhaustive appearance reconstruction. Our method balances encoder and decoder capacity and replaces depth expansion with a synergistic design: it propagates long-range context with linear complexity via a grouped state space mixer, denoises and aligns skip features using a cross-scale dual-stage gating bridge, and absorbs cross-site acquisition shifts with a sample-level mixture-of-experts. On the BraTS2019 and BraTS2021 benchmarks, M\textsuperscript{4}Fuse outperforms other lightweight excellent methods in both parameter count and performance. Even at a challenging input resolution of \(64\times128\times128\) (half that of existing excellent models), M\textsuperscript{4}Fuse reduces parameters by 62.63\% and improves average performance by 0.09\%. Ablations of key components validate the method's exceptional parameter-to-accuracy efficiency and robustness across diverse data centers.
Chinese Translation
编码器-解码器的不平衡以及对大输入体积的依赖,使得许多 3D 脑肿瘤分割模型在计算上十分繁重且容易脆弱。我们提出了 M extsuperscript{4}Fuse,这是一种轻量级网络,优先考虑辨别性脑肿瘤线索,而不是耗尽的外观重建。我们的方法平衡了编码器和解码器的容量,并用协同设计替代了深度扩展:通过分组状态空间混合器以线性复杂度传播长距离上下文,使用跨尺度双阶段门控桥对跳跃特征进行去噪和对齐,并通过样本级专家混合吸收跨场地采集偏移。在 BraTS2019 和 BraTS2021 基准测试中,M extsuperscript{4}Fuse 在参数数量和性能上均超越其他轻量级优秀方法。即使在挑战性的输入分辨率为 64×128×128(是现有优秀模型的一半)下,M extsuperscript{4}Fuse 也将参数减少了 62.63 extperthousand,平均性能提升了 0.09 extperthousand。关键组件的消融实验验证了该方法在多样数据中心中的卓越参数与准确率效率及鲁棒性。
cs.CV / 173 / 2605.02464
ExpoCM: Exposure-Aware One-Step Generative Single-Image HDR Reconstruction
ExpoCM:基于曝光感知的一步生成单幅图像HDR重建
Abstract
Single-image HDR reconstruction aims to recover high dynamic range radiance from a single low dynamic range (LDR) input, but remains highly ill-posed due to detail saturation in over-exposed regions and noise amplification in under-exposed areas. While recent diffusion-based approaches offer powerful generative priors, they often overlook the exposure-dependent nature of the degradation and incur substantial computational costs from iterative sampling. To address these challenges, we propose ExpoCM, a novel one-step generative HDR reconstruction framework that reformulates HDR reconstruction as a Probability Flow ODE (PF-ODE) and constructs exposure-aware consistency trajectories via exposure-dependent perturbations. Specifically, a soft exposure mask is first constructed to separate the LDR image into over-, under-, and well-exposed regions. Based on this partition, region-conditioned consistency trajectories are designed to hallucinate saturated details, suppress noise in dark regions, and preserve reliable structures within a single, distillation-free inference step. To further enhance perceptual quality, we introduce an Exposure-guided Luminance-Chromaticity Loss in the CIE~$\text{L}^*\text{a}^*\text{b}^*$ space, which assigns exposure-aware weights to luminance and chromaticity components, effectively mitigating brightness bias and color drift. Extensive experiments on the HDR-REAL, HDR-EYE, and AIM2025 benchmarks demonstrate that ExpoCM achieves state-of-the-art fidelity and perceptual accuracy, while enabling over 400$\times$ and 20$\times$ faster inference compared to DDPM (1000 steps) and DDIM (50 steps), respectively.
Chinese Translation
单幅图像HDR重建旨在从单个低动态范围(LDR)输入中恢复高动态范围辐射,但由于过曝区域的细节饱和和欠曝区域的噪声放大,仍然存在高度不适定性。尽管近期的基于扩散的方法提供了强大的生成先验,但它们通常忽视了退化的曝光依赖性,并且由于迭代采样而导致显著的计算成本。为了解决这些挑战,我们提出了ExpoCM,一种新颖的一步生成HDR重建框架,将HDR重建重新表述为概率流常微分方程(Probability Flow ODE,PF-ODE),并通过曝光依赖的扰动构建曝光感知的一致性轨迹。具体来说,首先构建软曝光掩膜以将LDR图像分为过曝、欠曝和正常曝光区域。基于该分区,设计了基于区域条件的一致性轨迹,以幻觉化饱和细节,抑制暗区噪声,并在单次无蒸馏推断步骤中保持可靠结构。为了进一步提高感知质量,我们在CIE~$ ext{L}^* ext{a}^* ext{b}^*$空间中引入了曝光引导的亮度-色度损失,为亮度和色度分量分配曝光感知权重,有效减轻了亮度偏差和色彩漂移。在HDR-REAL、HDR-EYE和AIM2025基准上的广泛实验表明,ExpoCM实现了最先进的保真度和感知准确性,同时与DDPM(1000步骤)和DDIM(50步骤)相比,实现了超过400倍和20倍的加速推断。
cs.CV / 174 / 2605.02471
Multispectral Blind Image Super-Resolution for Standing Dead Tree Segmentation
用于立木干枯树木分割的多光谱盲图像超分辨率
Abstract
Mapping standing dead trees is crucial for acquiring information on the effects of climate change on forests and forest biodiversity. However, leveraging high-quality aerial imagery for dead tree segmentation poses challenges due to limitations in sensor availability and the scarcity of annotated data. In this study, we propose a generic blind super-resolution framework that incorporates Attention-Guided Domain Adaptation Networks (ADA-Nets) to learn the mapping from low-resolution to high-resolution multispectral image domains. Our approach operates solely on unpaired samples, mimicking real-world conditions, i.e., low-resolution images are not synthetically obtained by downsampling the high-resolution images. Moreover, the proposed method serves as a general-purpose restorer addressing several image degradation types, including saturation, noise, and low contrast that typically occur in low-resolution images acquired by low-end sensors. To the best of our knowledge, this is the first study to perform real-world and generic super-resolution for multispectral data in the scope of standing dead tree segmentation. Experimental evaluations demonstrate segmentation performances of 54% and 64% in Dice scores. Notably, the first result is obtained without using any high-resolution annotations; the segmentation network is trained on super-resolved low-resolution images, while evaluation is performed on the high-resolution data. We publicly share the aerial multispectral dataset with manually annotated labels at https://www.kaggle.com/datasets/meteahishali/aerial-imagery-for-dead-tree-segmentation-poland.
Chinese Translation
映射立木干枯树木对于获取气候变化对森林及生物多样性影响的信息至关重要。然而,利用高质量的航拍图像进行干枯树木分割面临着传感器可用性和标注数据稀缺的挑战。在本研究中,我们提出了一种通用的盲超分辨率框架,该框架结合了注意力引导的领域适应网络(Attention-Guided Domain Adaptation Networks, ADA-Nets),以学习从低分辨率到高分辨率多光谱图像域的映射。我们的方法仅在无配对样本上操作,模拟真实世界的条件,即低分辨率图像不是通过对高分辨率图像降采样合成获得的。此外,所提出的方法作为一种通用恢复器,解决多种图像退化类型,包括饱和、噪声和低对比度,这些问题通常出现在由低端传感器获取的低分辨率图像中。据我们所知,这是首个在立木干枯树木分割范围内对多光谱数据进行真实世界和通用超分辨率的研究。实验评估表明,在Dice系数的分割性能上分别达到了54%和64%。值得注意的是,第一个结果是在没有使用任何高分辨率标注的情况下获得的;分割网络是在超分辨率的低分辨率图像上训练的,而评估则是在高分辨率数据上进行的。我们在https://www.kaggle.com/datasets/meteahishali/aerial-imagery-for-dead-tree-segmentation-poland公开分享了带有手动标注标签的航拍多光谱数据集。
cs.CV / 175 / 2605.02521
MooD: An Efficient VA-Driven Affective Image Editing Framework via Fine-Grained Semantic Control
MooD:一种高效的基于VA驱动的情感图像编辑框架,通过细粒度语义控制
Abstract
Affective image editing (AIE) aims to edit visual content to evoke target emotions. However, existing methods often overlook inference efficiency and predominantly depend on discrete emotion representations, which to some extent limits their practical applicability and makes it challenging to capture complex and subtle human emotions. To tackle these gaps, we propose MooD, the first framework that directly leverages continuous Valence-Arousal (VA) values for fine-grained and efficient AIE. Specifically, we first introduce a VA-Aware retrieval strategy to bridge vague affective values and concrete visual semantics. Building upon this, MooD integrates visual transfer and semantic guidance to achieve controllable AIE. Furthermore, we construct AffectSet, a VA-annotated dataset to support model optimization and evaluation. Extensive qualitative and quantitative experimental results demonstrate that our MooD achieves superior performance in both affective controllability and visual fidelity while maintaining high efficiency. A series of ablation studies further reveal the crucial factors of our design. Our code and data will be made publicly open soon.
Chinese Translation
情感图像编辑(Affective Image Editing, AIE)旨在编辑视觉内容以唤起目标情感。然而,现有的方法常常忽视推理效率,主要依赖离散的情感表示,在一定程度上限制了其实用性,并使得捕捉复杂和细微的人类情感变得困难。为了解决这些问题,我们提出了MooD,这是第一个直接利用连续的愉悦度-激活度(Valence-Arousal, VA)值进行细粒度和高效AIE的框架。具体来说,我们首先引入了一种VA感知检索策略,以连接模糊的情感值和具体的视觉语义。在此基础上,MooD融合了视觉转移和语义引导,实现了可控的AIE。此外,我们构建了AffectSet,一个带有VA标注的数据集,以支持模型的优化和评估。广泛的定性和定量实验结果表明,我们的MooD在情感可控性和视觉保真度方面表现出色,同时保持了高效性。一系列消融研究进一步揭示了我们设计中的关键因素。我们的代码和数据将很快公开发布。
cs.CV / 176 / 2605.02558
TemPose-TF-ASF: Two-Stage Bidirectional Stroke Context Fusion for Badminton Stroke Classification
TemPose-TF-ASF:用于羽毛球击球分类的双阶段双向击球上下文融合
Abstract
Accurate badminton stroke prediction is crucial for fine-grained sports analysis and tactical decision support. However, existing methods struggle to model rich temporal context. This paper introduces \emph{TemPose-TF-ASF (Adjacent-Stroke Fusion)}, a context-aware extension of \emph{TemPose}. It enhances stroke recognition by incorporating stroke-type information from both preceding and subsequent strokes. A two-stage training and inference strategy is adopted. Preliminary predictions from the baseline model are reused as estimated temporal context. These predictions guide the joint optimization of the \emph{ASF} module and the classifier. By explicitly modeling bidirectional temporal stroke dependencies, the proposed method can be seamlessly integrated into existing state-of-the-art models. Experiments on a large-scale badminton match dataset show consistent improvements over the baseline and its variants in terms of Accuracy and Macro-F1. Moreover, integrating \emph{ASF} into other advanced methods yields notable performance gains. These results demonstrate strong transferability and generalization capability.
Chinese Translation
准确的羽毛球击球预测对于细致的体育分析和战术决策支持至关重要。然而,现有方法在建模丰富的时间上下文方面存在困难。本文提出了 extit{TemPose-TF-ASF (邻近击球融合)},这是对 extit{TemPose}的上下文感知扩展。它通过结合来自前一个和后一个击球的击球类型信息来增强击球识别。采用双阶段的训练和推理策略。基线模型的初步预测作为估计的时间上下文被重新利用。这些预测指导 extit{ASF}模块和分类器的联合优化。通过显式建模双向时间击球依赖性,所提方法可以无缝集成到现有的先进模型中。在一个大规模的羽毛球比赛数据集上的实验表明,在准确率和宏观F1方面,相较于基线及其变体具有一致的提升。此外,将 extit{ASF}集成到其他先进方法中也获得了显著的性能提升。这些结果展示了良好的迁移性和泛化能力。
cs.CV / 177 / 2605.02563
Low-Latency Embedded Driver Monitoring System with a Multi-Task Neural Network
基于多任务神经网络的低延迟嵌入式驾驶员监控系统
Abstract
Road traffic accidents remain a significant global concern, with the majority attributed to human factors such as driver distraction and fatigue. This study proposes a camera-based approach to derive useful indicators to assess driver attentiveness and alertness. The proposed pipeline jointly satisfies the stringent real-time requirements imposed by the critical application and minimizes the computational requirements to allow for deployment on a tight computational budget. To this end, we develop a lightweight multi-task neural network that predicts multiple indicators for the face region in a single forward pass. The developed model is integrated into a complete execution workflow to produce a real-time estimate of attentiveness, fatigue, and engagement in distracting activities.
Chinese Translation
道路交通事故仍然是一个重大的全球性问题,其中大多数意外归因于驾驶员分心和疲劳等人因素。本研究提出了一种基于摄像头的方法,以获取有用指标来评估驾驶员的注意力和警觉性。所提议的流程同时满足关键应用带来的严格实时要求,并最小化计算需求,以便在紧凑的计算预算上进行部署。为此,我们开发了一种轻量级多任务神经网络,可以在单次前向传递中预测面部区域的多个指标。所开发的模型被集成到完整的执行工作流程中,以实时估计驾驶员的注意力、疲劳程度和分心活动的参与度。
cs.CV / 178 / 2605.02567
Automated In-the-Wild Data Collection for Continual AI Generated Image Detection
用于持续的人工智能生成图像检测的自动化野外数据收集
Abstract
The rapid advancement of generative Artificial Intelligence (AI) has introduced significant challenges for reliable AI-generated image detection. Existing detectors often suffer from performance degradation under distribution shifts and when encountering newly emerging generative models. In this work, we propose a data-centric continual adaptation framework for updating detectors in evolving environments. We show that both in-the-wild data and generator-driven data are essential for adapting detectors. We introduce an automated, weakly supervised pipeline for constructing in-the-wild datasets through fact-check article retrieval. Additionally, we demonstrate that incorporating even a small amount of generator-driven data during training enables effective adaptation to newly emerging models, while combining it with in-the-wild data within a continual learning framework enables robust adaptation and mitigates catastrophic forgetting. Extensive experiments on two state-of-the-art detectors show significant improvements of +9.14% and +8% in average accuracy, respectively.
Chinese Translation
生成性人工智能(AI)的快速发展带来了对可靠的AI生成图像检测的重要挑战。现有的检测器在应对分布转移和新的生成模型时往往表现不佳。在本研究中,我们提出了一种以数据为中心的持续适应框架,旨在更新在不断变化的环境中的检测器。我们展示了野外数据与生成器驱动数据在适应检测器时的重要性。我们介绍了一种通过事实核查文章检索构建野外数据集的自动化、弱监督管道。此外,我们证明在训练过程中即使引入少量生成器驱动数据也能有效适应新兴模型,同时将其与持续学习框架中的野外数据结合使用可以实现稳健的适应并减轻灾难性遗忘。在两个先进检测器上的广泛实验表明,平均准确率分别提高了+9.14%和+8%。
cs.CV / 179 / 2605.02575
Self-Supervised Spatial And Zero-Shot Angular Super-Resolution by Spatial-Angular Implicit Representation For Rotating-View SNR-Efficient Diffusion MRI
自监督空间与零样本角度超分辨率:用于旋转视图的空间-角度隐式表示以提高扩散MRI的信噪比效率
Abstract
Rotating-view thick-slice acquisition is highly SNR-efficient for mesoscale diffusion MRI (dMRI) but requires numerous rotating views to satisfy Nyquist sampling, resulting in long scan time. We propose a self-supervised Spatial-Angular Implicit Neural Representation (SA-INR) that reconstructs high-resolution dMRI from a single view per diffusion direction, representing a massive acceleration. Our model, an MLP conditioned on a b=0 structural prior and the b-direction via FiLM, is trained end-to-end on the anisotropic input. The framework not only accurately reconstructs the trained b-directions (spatial SR) but also learns a continuous q-space representation, enabling high-fidelity "zero-shot" synthesis of unseen b-directions (angular SR). On simulated data, our method achieved high fidelity for both trained (34.82 dB) and unseen (33.08 dB) directions. Most importantly, the synthesized angular data also improved the quantitative accuracy of downstream DTI model fitting. Our SA-INR framework breaks the classical sampling limits, paving the way for fast, quantitative high-resolution dMRI.
Chinese Translation
旋转视图的厚切获取在中尺度扩散MRI (dMRI) 中具有较高的信噪比效率,但需要大量旋转视图以满足Nyquist采样,从而导致长扫描时间。我们提出了一种自监督的空间-角度隐式神经表示 (SA-INR),能够仅通过每个扩散方向的单一视图重建高分辨率的dMRI,实现了巨大的加速。我们的模型是一种基于b=0结构先验和b方向的多层感知器 (MLP),通过条件生成对抗网络 (FiLM) 进行训练,最终在各向异性输入上实现端到端的训练。该框架不仅能准确重建训练的b方向(空间超分辨率),还学习到一个连续的q空间表示,从而能够实现高保真度的“零样本”未见b方向合成(角度超分辨率)。在模拟数据上,我们的方法在训练方向(34.82 dB)和未见方向(33.08 dB)上均达到了高保真度。最重要的是,合成的角度数据还提高了下游DTI模型拟合的定量准确性。我们的SA-INR框架突破了经典采样限制,为快速、定量、高分辨率的dMRI开辟了道路。
cs.CV / 180 / 2605.02580
Hyp2Former: Hierarchy-Aware Hyperbolic Embeddings for Open-Set Panoptic Segmentation
Hyp2Former:考虑层次结构的超曲率嵌入用于开放集全景分割
Abstract
Recognizing unknown objects is crucial for safety-critical applications such as autonomous driving and robotics. Open-Set Panoptic Segmentation (OPS) aims to segment known thing and stuff classes while identifying valid unknown objects as separate instances. Prior OPS approaches largely treat known categories as a flat label set, ignoring the semantic hierarchy that provides valuable structural priors for distinguishing unknown objects from in-distribution classes. In this work, we propose Hyp2Former, an end-to-end framework for OPS that does not require explicit modeling of unknowns during training, and instead learns hierarchical semantic similarities continuously in hyperbolic space. By explicitly encoding hierarchical relationships among known categories, the model learns a structured embedding space that captures multiple levels of semantic abstraction. As a result, unknown objects that cannot be confidently classified as known categories still remain in close proximity to higher-level concepts (e.g., an unknown animal remains closer to "animal" or "object" than to unrelated concepts such as "electronics" or "stuff") and can therefore be reliably detected, even if their fine-grained category was not represented during training. Empirical evaluations across multiple public datasets such as MS COCO, Cityscapes, and Lost&Found demonstrate that Hyp2Former outperforms existing methods on OPS, achieving the best balance between unknown object discovery and in-distribution robustness.
Chinese Translation
识别未知物体对安全关键应用(如自动驾驶和机器人技术)至关重要。开放集全景分割(Open-Set Panoptic Segmentation, OPS)旨在对已知事物和物体类进行分割,同时将有效的未知物体识别为单独的实例。以往的OPS方法主要将已知类别视为平面标签集,忽略了提供宝贵结构先验的语义层次,这对于区分未知对象与分布内类别具有重要意义。在本研究中,我们提出了Hyp2Former,这是一个端到端的OPS框架,在训练过程中不需要明确建模未知对象,而是持续在超曲率空间中学习层次语义相似性。通过显式编码已知类别之间的层次关系,该模型学习了一个结构化的嵌入空间,捕捉多个级别的语义抽象。因此,即使未知物体在训练期间未被具体表示,无法自信地分类为已知类别,它们仍然可以靠近更高层次的概念(例如,未知动物更接近“动物”或“物体”,而不是与“电子产品”或“物体”无关的概念),因此能够被可靠地检测出来。在多个公共数据集(如MS COCO、Cityscapes和Lost&Found)上的实证评估表明,Hyp2Former在OPS任务上优于现有方法,在未知对象发现与分布内鲁棒性之间取得了最佳平衡。
cs.CV / 181 / 2605.02583
Stylistic Attribute Control in Latent Diffusion Models
潜在扩散模型中的风格属性控制
Abstract
Text-to-image diffusion models have revolutionized image synthesis and editing, but precise control over stylistic attributes remains a challenge, often causing unintended content modifications. We propose an approach for fine-grained parametric control of stylistic attributes in latent diffusion models by learning disentangled editing directions from synthetic datasets. We use guidance composition to close the domain gap between stylistically finetuned and foundation models, preserving the original image semantics while applying stylistic adjustments. To ensure consistent edits, we introduce a training regularization loss and enhance DDIM inversion with optimized null-conditional embeddings for real image editing. We validate our approach by learning from stylistically filtered synthetic datasets varying a range of stylistic attributes, including outlines, local contrast, watercolorization effects, and geometric patterns. Our evaluations demonstrate that compared to current text-based editing techniques, our method offers well-integrated, more precise and continuously adjustable stylistic modifications.
Chinese Translation
文本到图像的扩散模型彻底改变了图像合成和编辑,但对风格属性的精确控制仍然是一个挑战,常常导致意外的内容修改。我们提出了一种通过从合成数据集中学习解耦编辑方向来实现潜在扩散模型中风格属性的细粒度参数控制的方法。我们使用引导组合来缩小风格微调模型与基础模型之间的领域差距,同时保留原始图像的语义并应用风格调整。为了确保一致的编辑,我们引入了一种训练正则化损失,并通过优化的零条件嵌入增强了DDIM反转,适用于真实图像编辑。我们通过从风格过滤的合成数据集中学习,验证了我们的方法,在不同风格属性范围内进行变换,包括轮廓、局部对比度、水彩效果和几何图案。我们的评估表明,与当前基于文本的编辑技术相比,基于我们的方法提供了更完美的集成、更精确且可持续调整的风格修改。
cs.CV / 182 / 2605.02586
StableMind: Source-Free Cross-Subject fMRI Decoding with Regularized Adaptation
StableMind:无源跨被试功能性磁共振成像解码与正则化适应
Abstract
Existing cross-subject fMRI decoding methods typically train a model on multiple scanned subjects and then adapt it to a new subject using substantial paired fMRI-image data. However, in realistic scenarios, new-subject fMRI data are often limited due to costly data acquisition, and raw data from previous subjects may be inaccessible, leading existing methods to suffer performance degradation during new-subject adaptation. In this paper, we identify that this degradation stems from two key issues: brain-side instability caused by large subject differences in fMRI responses, and image-side supervision unreliability caused by fine-grained visual details that are not reliably supported by limited fMRI signals. To address these challenges, we propose StableMind, a regularized adaptation framework designed to improve brain-side representation stability and image-side supervision reliability. (1) To stabilize brain representations, StableMind reuses ridge projections from the pretrained model as adaptation priors to constrain limited-data new-subject adaptation, and applies Fourier-based feature-level brain augmentation to improve robustness to individual variability. (2) To improve image supervision reliability, StableMind introduces difficulty-aware image blur for brain-image alignment, reducing the influence of fine-grained visual details that are weakly supported by limited fMRI signals while preserving stable visual structure. Experiments on the Natural Scenes Dataset under a unified 1-hour adaptation protocol demonstrate that StableMind achieves 84.02% image retrieval accuracy and 81.66% brain retrieval accuracy averaged over four subjects, surpassing the state-of-the-art method by 5.71% brain retrieval accuracy with fewer trainable adaptation parameters. Our code is available at https://github.com/lingeringlight/StableMind.
Chinese Translation
现有的跨被试功能性磁共振成像解码方法通常在多个被扫描的受试者上训练模型,然后使用大量配对的fMRI图像数据对其进行适应。然而,在现实情况下,新的受试者的fMRI数据通常因数据采集成本高而受到限制,而以前受试者的原始数据可能无法访问,导致现有方法在适应新受试者时性能下降。在本文中,我们发现这种性能下降源于两个主要问题:由于受试者在fMRI反应上的巨大差异导致的大脑端不稳定性,以及由于细粒度视觉细节的存在而造成的图像端监督的不可靠性,这些细节在有限的fMRI信号中并没有得到可靠的支持。为了解决这些挑战,我们提出了StableMind,一个旨在提高大脑端表征稳定性和图像端监督可靠性的正则化适应框架。(1)为了稳定大脑表征,StableMind重用预训练模型的岭回归投影作为适应先验,以限制有限数据的新受试者适应,并应用基于傅里叶的特征级大脑增强,以提高对个体变异性的鲁棒性。(2)为了提高图像监督的可靠性,StableMind引入了困难感知图像模糊,用于大脑与图像的对齐,减少有限fMRI信号对细粒度视觉细节的弱支持所造成的影响,同时保留稳定的视觉结构。在统一的1小时适应协议下,关于自然场景数据集的实验表明,StableMind在四名受试者上平均达到了84.02%的图像检索准确率和81.66%的大脑检索准确率,超越了最新的技术方法,后者的平均大脑检索准确率提高了5.71%,同时采用的可训练适应参数更少。我们的代码可在 https://github.com/lingeringlight/StableMind 获取。
cs.CV / 183 / 2605.02589
Representation learning from OCT images
基于OCT图像的表示学习
Abstract
Optical Coherence Tomography (OCT) has become one of the most used imaging modality in ophthalmology. It provides high-resolution, non-invasive visualization of retinal microarchitecture. The automated analysis of OCT images through representation learning has emerged as a central research frontier. This has mainly been driven by the clinical need to process large acquisition volumes. The objective is to reduce the reliance on expert annotation, and improve diagnostic consistency across devices and populations. This survey provides a comprehensive and structured review of representation learning methods for retinal OCT image analysis. It covers the period from early deep learning approaches to the most recent developments in foundation models and vision-language systems. We organize the literature along a principled taxonomy of learning paradigms, encompassing supervised learning with CNN-based and transformer-based architectures, self-supervised and semi-supervised methods, generative approaches, as well as 3D volumetric modeling, multimodal representation learning, and large-scale pretrained foundation models. For each paradigm, we analyze the core methodological contributions, identify persistent limitations, and trace the connections between successive approaches. We further provide a structured overview of publicly available OCT datasets, discuss evaluation protocol considerations, and present a unified problem formulation that situates each learning paradigm within a common mathematical framework. Building on this analysis, we identify and discuss the most pressing open research directions emerging in the literature. This includes volumetric foundation model pretraining, uncertainty-aware representation learning, federated and privacy-preserving training, fairness and bias mitigation, concept-based interpretability,...
Chinese Translation
光学相干断层成像(OCT)已成为眼科中使用最广泛的成像方式之一。它提供了高分辨率、非侵入性的视网膜微结构可视化。通过表示学习对OCT图像进行自动化分析已成为一个核心研究前沿。这主要是由临床需求驱动,目的是处理大量的获取数据。目标是减少对专家注释的依赖,提高各设备和人群之间的诊断一致性。本次调查提供了一个全面且结构化的关于视网膜OCT图像分析的表示学习方法的回顾,涵盖了从早期深度学习方法到最新的基础模型和视觉-语言系统的发展。我们根据原则性学习范式的分类,对文献进行组织,包括基于卷积神经网络(CNN)和变换器(transformer)架构的监督学习、自监督和半监督方法、生成方法,以及3D体积建模、多模态表示学习和大规模预训练的基础模型。对于每种范式,我们分析了核心方法论贡献,识别出持续存在的局限性,并追溯各连续方法之间的联系。我们进一步提供了公开可用的OCT数据集的结构化概述,讨论评估协议的考虑因素,并呈现一个统一的问题表述,将每种学习范式置于一个共同的数学框架内。在此分析的基础上,我们识别并讨论了文献中出现的最紧迫的开放研究方向,包括体积基础模型的预训练、考虑不确定性的表示学习、联邦和隐私保护训练、公平性和偏见缓解、基于概念的可解释性等。
cs.CV / 184 / 2605.02604
Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model
重新思考源模型的必要性:基于视觉-语言模型引导的从头开始的无源领域适应
Abstract
Source-Free Domain Adaptation (SFDA) adapts source models to target domains without accessing source data, addressing privacy and transmission issues. However, existing methods still initialize from a source pre-trained model and thus are not truly source-free. Recent works have introduced Vision-Language (ViL) models to guide the adaptation process, in these methods, we observe that for the same target domain, different source models yield minimal variation in final results, indicating the source model itself has limited impact. Motivated by this, we propose ViL-Only Domain Adaptation (VODA) , a stricter setting that eliminates all dependencies on source domain, relying solely on a randomly initialized model, a ViL model, and unlabeled target data. We analyze the adaptation dynamics of VODA and introduce Two-Stage Denoised-Region Distillation (TS-DRD) , a two-stage framework that first warms up the model with ViL guidance, then seek a Denoised-Region inherent in both the ViL and adapting model, yielding cleaner supervision for distillation. Experiments on Office-Home, VisDA, and DomainNet-126 show that under VODA, TS-DRD achieves competitive or superior performance to existing SFDA methods that still use source models, demonstrating its effectiveness and the potential of the VODA setting.
Chinese Translation
无源领域适应(Source-Free Domain Adaptation, SFDA)在不接触源数据的情况下将源模型适应于目标领域,解决了隐私和传输问题。然而,现有方法仍然是从源预训练模型初始化,因此并不是真正的无源。最近的研究引入了视觉-语言(Vision-Language, ViL)模型来指导适应过程,在这些方法中,我们观察到对于相同的目标领域,不同的源模型在最终结果上产生的变化极小,表明源模型本身的影响有限。基于此,我们提出了仅基于ViL的领域适应(ViL-Only Domain Adaptation, VODA),这是一种更严格的设定,消除了对源领域的所有依赖,仅依赖一个随机初始化的模型、一个ViL模型和未标记的目标数据。我们分析了VODA的适应动态,并引入了双阶段去噪区域蒸馏(Two-Stage Denoised-Region Distillation, TS-DRD),这是一种双阶段框架,首先用ViL指导模型预热,然后寻找ViL和适应模型中固有的去噪区域,从而为蒸馏提供更清晰的监督。在Office-Home、VisDA和DomainNet-126上的实验表明,在VODA设定下,TS-DRD在性能上达到了与仍使用源模型的现有SFDA方法竞争或更优的效果,验证了其有效性以及VODA设定的潜力。
cs.CV / 185 / 2605.02614
Validation of an AI-based end-to-end model for prostate pathology using long-term archived routine samples
基于人工智能的前列腺病理学端到端模型的验证:使用长期存档常规样本
Abstract
Artificial intelligence (AI) is becoming a clinical tool for prostate pathology, but generalization across variations in sample preparation and preservation over prolonged time periods remains poorly understood. We evaluated GleasonAI, an end-to-end attention-based multiple instance learning model, on an independent validation cohort comprising 10,366 biopsy cores from 1,028 patients across 14 Swedish regions, using archival diagnostic specimens from the ProMort cohorts collected between 1998-2015. The model achieved an overall quadratic-weighted kappa of 0.86 for core-level ISUP grading, comparable to several experienced pathologists and consistent across geographic regions. Notably, performance remained stable across the 17-year collection period, demonstrating robustness to time-related variation in archival material, a property not consistently observed with foundation model-based approaches, with exploratory analysis demonstrating a significant prognostic gradient across AI-assigned grade groups for prostate cancer-specific mortality. These findings support the generalizability of the AI grading model and demonstrate the potential of pathology archives as a large-scale resource for AI development, validation, and retrospective prognostic research.
Chinese Translation
人工智能(AI)正成为前列腺病理学中的一种临床工具,但在样本准备和保存的变异性以及较长时间内的普遍适用性仍然不甚明了。我们评估了GleasonAI,这是一个基于注意力的端到端多实例学习模型,在一个独立的验证队列中进行了测试,该队列包括来自1028名患者的10,366个活检样本,分布于14个瑞典地区,使用的是1998年至2015年期间收集的ProMort队列档案诊断标本。该模型在核心级别的ISUP分级上达到了0.86的总平方加权κ值, comparable于多位有经验的病理学家,并在各地理区域间保持一致。值得注意的是,模型在17年的收集周期内保持了稳定的性能,显示出对档案材料时间相关变异的鲁棒性,这种特性在基于基础模型的方法中并不总是观察到,探索性分析还显示出在AI赋值的分级组间,前列腺癌特异性死亡率的预后梯度显著。这些发现支持了AI分级模型的普遍适用性,并展示了病理档案作为AI开发、验证和回顾性预后研究的大规模资源的潜力。
cs.CV / 186 / 2605.02616
Global-Local Feature Decoding with Adapter-Guided SAMv2 for Salient Object Detection
利用适配器引导的SAMv2进行显著性物体检测的全局-局部特征解码
Abstract
Salient Object Detection (SOD) remains an essential yet underexplored task in the era of large-scale vision models. Although foundation models like SAM exhibit strong generalization, their potential for SOD is not fully realized, and training or fully fine-tuning them is computationally expensive and prone to overfitting under limited data. To overcome these challenges, we introduce GLASSNet, a Global-Local feature decoding framework that uses SAMv2 as a frozen encoder paired with a lightweight, spatially aware convolutional adapter-reducing learnable encoder parameters by over 97%. To enhance saliency quality, GLASSNet employs a dual-decoder architecture: one decoder captures global, long-range semantics with an expanded receptive field, while the other captures fine local details such as edges and textures. Fusing these complementary cues yields saliency maps that combine global coherence with local precision, producing accurate final masks. Extensive experiments on standard SOD and camouflaged object detection benchmarks show that GLASSNet surpasses state-of-the-art methods, demonstrating the power of frozen foundation models combined with targeted adaptation and global-local decoding.
Chinese Translation
显著性物体检测(SOD)在大规模视觉模型时代仍然是一个重要但未被充分探索的任务。尽管像SAM这样的基础模型表现出强大的泛化能力,但它们在SOD中的潜力尚未得到充分发挥,并且在有限数据下训练或全面微调这些模型的计算成本较高,且容易出现过拟合。为了解决这些挑战,我们引入了GLASSNet,一个全局-局部特征解码框架,它使用SAMv2作为冻结的编码器,并配备轻量级的空间感知卷积适配器,从而将可学习的编码器参数减少超过97%。为了增强显著性质量,GLASSNet采用了双解码器架构:一个解码器捕获具有扩展感受野的全局长范围语义,而另一个则捕获细致的局部细节,如边缘和纹理。融合这些互补线索生成的显著性图将全局一致性与局部精度相结合,生成准确的最终掩模。在标准SOD和伪装物体检测基准上进行的广泛实验表明,GLASSNet超越了最新的研究方法,展示了冻结基础模型与针对性适配和全局-局部解码相结合的强大能力。
cs.CV / 187 / 2605.02623
Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval
检索任何相关时刻:通用时刻检索的基准与模型
Abstract
Video Moment Retrieval (VMR) aims to localize temporal segments in videos that correspond to a natural language query, but typically assumes only a single matching moment for each query. This assumption does not always hold in real-world scenarios, where queries may correspond to multiple or no moments. Thus, we formulate Generalized Moment Retrieval (GMR), a unified setting that requires retrieving the complete set of relevant moments or predicting an empty set. To enable systematic study of GMR, we introduce Soccer-GMR, a large-scale benchmark built on challenging soccer videos that reflect general GMR scenarios, with realistic negative and positive queries. The benchmark is constructed via a duration-flexible semi-automated pipeline with human verification, enabling scalable data generation while maintaining high annotation quality. We further design a unified evaluation protocol with complementary metrics tailored for null-set rejection, positive-query localization, and end-to-end GMR performance. Finally, we establish strong baselines across two modeling paradigms: a lightweight plug-and-play GMR adapter for discriminative VMR models, and a GMR-tailored GRPO reward for fine-tuning multimodal large language models (MLLMs). Extensive experiments show consistent gains across all metrics and expose key limitations of current methods, positioning GMR as a more realistic and challenging benchmark for video-language understanding.
Chinese Translation
视频时刻检索(Video Moment Retrieval,VMR)旨在定位与自然语言查询相对应的视频中的时间段,但通常假设每个查询仅对应一个匹配的时刻。这一假设在现实场景中并不总是成立,因为查询可能对应于多个或没有时刻。因此,我们提出了通用时刻检索(Generalized Moment Retrieval,GMR)的概念,这是一种统一的设置,要求检索完整的相关时刻集合或预测一个空集合。为了系统地研究GMR,我们引入了Soccer-GMR,这是一个基于具有挑战性的足球视频构建的大规模基准,反映了一般GMR场景,并包含现实的负查询和正查询。该基准通过一个灵活时长的半自动化流程构建,并经过人工验证,从而实现了可扩展的数据生成,同时保持高注释质量。我们进一步设计了一个统一的评估协议,配备了适用于空集合拒绝、正查询定位和端到端GMR性能的补充指标。最后,我们在两种建模范式下建立了强基线:一种为用于区分性VMR模型的轻量级即插即用GMR适配器,另一种为用于微调多模态大型语言模型(Multimodal Large Language Models, MLLMs)的GMR定制GRPO奖励。大量实验证明,在所有指标上都有一致的提升,并揭示了当前方法的关键局限性,使GMR成为视频语言理解中更真实且更具挑战性的基准。
cs.CV / 188 / 2605.02627
Rethinking Low-Light Image Enhancement: A Log-Domain Intensity--Chromaticity Decoupling Perspective
重新思考低光照图像增强:一种对数域强度—色度解耦视角
Abstract
Explicit reconstruction constraints derived from the decoupled representation are further imposed to suppress abnormal channel amplification and chromatic noise. Experiments on LOLv2-Real, MIT-Adobe FiveK, and LSRW show that the proposed method achieves competitive or superior quantitative and visual performance, reaching 29.71 dB PSNR and 0.89 SSIM on LOLv2-Real. DarkFace experiments further indicate improved downstream face detection under low-light conditions. Code and pretrained models are available at: https://github.com/mubaisam/ICD.
Chinese Translation
明确的重建约束源自解耦表示,进一步被施加以抑制异常通道放大和色彩噪声。在LOLv2-Real、MIT-Adobe FiveK和LSRW上的实验表明,所提方法在定量和视觉性能上实现了具有竞争力或优越的结果,在LOLv2-Real上达到了29.71 dB的峰值信噪比(PSNR)和0.89的结构相似性指数(SSIM)。DarkFace实验进一步表明在低光照条件下人脸检测有了改善。代码和预训练模型可在以下链接获取:https://github.com/mubaisam/ICD。
cs.CV / 189 / 2605.02630
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
AutoFocus:不确定性感知的主动视觉搜索用于图形用户界面定位
Abstract
Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored. We propose AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding. Our key insight is that token-level perplexity in coordinate generation naturally reflects spatial uncertainty. Rather than committing to a single prediction, AutoFocus samples multiple coordinate hypotheses and converts their axial perplexities into an anisotropic gaussian spatial probability field, explicitly modeling directional uncertainty. Based on this field, we generate global and local region proposals and introduce Shape-Aware Zooming to balance tight localization with contextual preservation. A visual prompt-based aggregation step then selects the most consistent prediction via structured comparison. Extensive experiments on ScreenSpot-Pro and ScreenSpot-V2 demonstrate consistent improvements across both general-purpose and GUI-specialized VLMs.
Chinese Translation
视觉-语言模型(VLMs)使得自主 GUI 代理能够将自然语言指令转化为可执行的屏幕坐标。然而,在高分辨率界面中,定位性能会下降,因为密集布局和较小的交互元素暴露了现代显示器与模型输入约束之间的分辨率差距。现有的放大策略依赖于固定锚点、启发式网格或强化学习,缺乏一种原则性机制来自适应地确定需要进行精细调整的位置以及需要探索多少空间不确定性。我们提出了 AutoFocus,这是一种无训练、关注不确定性的主动视觉搜索框架,用于 GUI 定位。我们的关键见解是坐标生成中的令牌级困惑度自然反映了空间不确定性。AutoFocus 并非只做出单一预测,而是采样多个坐标假设,并将它们的轴向困惑度转换为各向异性高斯空间概率场,明确地建模方向性不确定性。基于该场,我们生成全局和局部区域建议,并引入形状感知放大(Shape-Aware Zooming),以平衡紧致定位与上下文保留。随后,基于视觉提示的聚合步骤通过结构化比较选择最一致的预测。针对 ScreenSpot-Pro 和 ScreenSpot-V2 的大量实验表明,在通用和 GUI 专业 VLMs 上均取得了一致的改进。
cs.CV / 190 / 2605.02638
ViewSAM: Learning View-aware Cross-modal Semantics for Weakly Supervised Cross-view Referring Multi-Object Tracking
ViewSAM:针对弱监督跨视角指向多目标跟踪的视图感知交叉模态语义学习
Abstract
Cross-view Referring Multi-Object Tracking (CRMOT) aims to track multiple objects specified by natural language across multiple camera views, with globally consistent identities. Despite recent progress, existing methods rely heavily on costly frame-level spatial annotations and cross-view identity supervision. To reduce such reliance, we explore CRMOT under weak supervision by leveraging the capabilities of foundation models. However, our empirical study shows that directly applying foundation models such as SAM2 and SAM3, even with task-specific modifications, fails to accurately understand referring expressions and maintain consistent identities across views. Yet, they remain effective at producing reliable object tracklets that can serve as pseudo supervision. We therefore repurpose foundation models as pseudo-label generators and propose a two-stage framework for weakly supervised CRMOT, using only object category labels as coarse-grained supervision. In the first stage, we design an Affinity-guided Cross-view Re-prompting strategy to refine and associate SAM3-generated tracklets across cameras, producing reliable cross-view pseudo labels for subsequent training. In the second stage, we introduce ViewSAM, a CRMOT model built upon SAM2 that explicitly models view-aware cross-modal semantics. By formulating view-induced variations as learnable conditions, ViewSAM bridges the gap between view-variant visual observations and view-invariant textual expressions, enabling robust cross-view referring tracking with only approximately 10% additional parameters. Extensive experiments demonstrate that ViewSAM achieves SOTA performance under weak supervision and remains competitive with fully supervised methods.
Chinese Translation
跨视角指向多目标跟踪(CRMOT)旨在跨多个摄像头视角跟踪由自然语言指定的多个对象,并确保身份的一致性。尽管最近取得了一些进展,但现有的方法严重依赖昂贵的帧级空间标注和跨视角身份监督。为了减少这种依赖,我们探讨了在弱监督下的CRMOT,通过利用基础模型的能力。然而,我们的实证研究表明,直接应用如SAM2和SAM3等基础模型,即使进行了特定任务的修改,也无法准确理解指代表达和保持跨视角的一致身份。然而,它们在生成可以作为伪监督的可靠对象追踪段方面仍然有效。因此,我们将基础模型重新用途作为伪标签生成器,并提出了一种用于弱监督CRMOT的两阶段框架,仅使用对象类别标签作为粗粒度监督。在第一阶段,我们设计了一种基于相似度的跨视角重提示策略,以精细化并关联SAM3生成的跨摄像头追踪段,产生可靠的跨视角伪标签以供后续训练。在第二阶段,我们引入了ViewSAM,这是一个基于SAM2构建的CRMOT模型,明确建模视图感知的交叉模态语义。通过将视图引起的变化形式化为可学习的条件,ViewSAM弥合了视图变异视觉观察与视图不变文本表达之间的差距,实现了具有约10%额外参数的稳健跨视角指向跟踪。大量实验表明,ViewSAM在弱监督下实现了SOTA(最先进技术)性能,并与完全监督的方法保持竞争力。
cs.CV / 191 / 2605.02641
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Mamoda2.5:通过 DiT-MoE 增强统一多模态模型
Abstract
We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model's generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, including the Kling O1 on OpenVE-Bench. Furthermore, we introduce a joint few-step distillation and reinforcement learning framework that compresses the 30-step editing model into a 4-step model and greatly accelerates model inference. Compared to open-source baselines, Mamoda2.5 achieves up to $95.9\times$ faster video editing inference. In real-world applications, Mamoda2.5 has been successfully deployed for content moderation and creative restoration tasks in advertising scenarios, achieving a 98% success rate in internal advertising video editing scenario.
Chinese Translation
我们提出了 Mamoda2.5,这是一种统一的 AR-Diffusion 框架,能够在单一架构中无缝集成多模态理解与生成。为了高效提升模型的生成能力,我们为 Diffusion Transformer 主干配备了细粒度的专家混合(Mixture-of-Experts, MoE)设计(128 个专家,Top-8 路由),从而实现一个 25B 参数的模型,该模型仅激活 3B 参数,从而显著降低训练成本,同时扩大模型容量。Mamoda2.5 在 VBench 2.0 上实现了顶级的生成性能,并在视频编辑质量上创下新记录,超越了所评估的开源模型,匹配了当前顶级专有模型的性能,包括在 OpenVE-Bench 上的 Kling O1。此外,我们介绍了一种联合少步蒸馏与强化学习框架,能够将 30 步编辑模型压缩为 4 步模型,并大大加速模型推理。与开源基准相比,Mamoda2.5 实现了高达 95.9 倍的更快视频编辑推理。在实际应用中,Mamoda2.5 已成功应用于广告场景中的内容审查和创造性恢复任务,在内部广告视频编辑场景中实现了 98% 的成功率。
cs.CV / 192 / 2605.02659
Human Activity Recognition Method for Moderate Violence Detection
适度暴力检测的人类活动识别方法
Abstract
Physical violence in public spaces is a significant public health concern, with minor incidents such as pushing often serving as precursors to more severe escalations. This research develops an automated system for the real-time detection of moderate physical violence, specifically pushing, in surveillance camera footage. The proposed solution integrates state-of-the-art computer vision models, utilizing YOLO11 and YOLO11-Pose for human detection and skeletal keypoint extraction. By calculating body inclination and joint angles between shoulders and hips, a Random Forest classifier was trained to distinguish between normal behavior and aggressive physical contact. The system's performance was evaluated through three progressive case studies representing increasing levels of difficulty. In controlled environments with frontal views, the model achieved a precision of 0.98. In the most challenging scenario, featuring high-altitude, steep-angle recordings from real-world surveillance infrastructure, the system maintained a precision of 0.72 despite significant perspective distortion and visual noise. These results demonstrate the feasibility of using skeletal analysis for early violence intervention in urban security contexts.
Chinese Translation
公共场所的身体暴力是一个重要的公共卫生问题,轻微事件如推搡往往是更严重冲突的前兆。本研究开发了一种针对监控摄像头录像中适度身体暴力(特别是推搡)的实时自动检测系统。所提方案整合了先进的计算机视觉模型,利用YOLO11和YOLO11-Pose进行人类检测和骨骼关键点提取。通过计算身体倾斜度和肩部与髋部之间的关节角度,训练了一个随机森林(Random Forest)分类器,以区分正常行为和侵略性身体接触。系统的性能通过三个逐步递进的案例研究进行了评估,代表了不同的难度级别。在前视的受控环境中,该模型达到了0.98的精确度。在最具挑战性的场景中,面对来自真实监控基础设施的高高度、陡角录制,即使存在显著的视角扭曲和视觉噪声,该系统仍然保持了0.72的精确度。这些结果证明了在城市安全环境中使用骨骼分析进行早期暴力干预的可行性。
cs.CV / 193 / 2605.02707
SAIL: Structure-Aware Interpretable Learning for Anatomy-Aligned Post-hoc Explanations in OCT
SAIL:用于光学相干断层扫描中解剖对齐后置解释的结构感知可解释学习
Abstract
Optical coherence tomography (OCT), a commonly used retinal imaging modality, plays a central role in retinal disease diagnosis by providing high-resolution visualization of retinal layers. While deep learning (DL) has achieved expert-level accuracy in OCT-based retinal disease detection, its "black box" nature poses challenges for clinical adoption, where explainability is essential for clinical trust and regulatory approval. Existing post-hoc explainable AI (XAI) methods often struggle to delineate fine-grained lesion structures, respect anatomical boundaries, or suppress noise, limiting the trustworthiness of their explanations. To bridge these gaps, we propose a Structure-Aware Interpretable Learning (SAIL) framework that integrates retinal anatomical priors at the representation level and couples them with semantic features via a fusion design. Without modifying standard post-hoc explainability methods, this representation yields sharper and more anatomically aligned attribution maps. Comprehensive experiments on diverse OCT datasets demonstrate that our structure-aware method consistently enhances interpretability, producing clinically meaningful and anatomy-aware explanations. Ablation studies further show that strong interpretability requires both structural priors and semantic features, and that properly fusing the two is critical to achieve the best explanation quality. Together, these results highlight structure-aware representations as a key step toward reliable explainability in OCT.
Chinese Translation
光学相干断层扫描(OCT)是一种常用的视网膜成像技术,在视网膜疾病诊断中起着核心作用,通过提供高分辨率的视网膜层可视化。尽管深度学习(DL)在基于OCT的视网膜疾病检测中达到了专家级的准确性,但其“黑箱”特性在临床应用中带来了挑战,因此可解释性对临床信任和监管审批至关重要。现有的后置可解释人工智能(XAI)方法往往难以清晰描绘细微的病变结构,遵循解剖边界或抑制噪声,限制了其解释的可信度。为了解决这些问题,我们提出了一种结构感知可解释学习(SAIL)框架,该框架在表征层面整合了视网膜解剖先验,并通过融合设计将其与语义特征相耦合。在不修改标准后置可解释性方法的情况下,这种表征产生了更清晰、更加解剖对齐的归因图。对多样OCT数据集的全面实验表明,我们的结构感知方法始终增强了解释性,生成具有临床意义和解剖意识的解释。消融研究进一步表明,强可解释性同时需要结构先验和语义特征,并且两者的正确融合对于实现最佳解释质量至关重要。总的来说,这些结果强调了结构感知表征是实现OCT中可靠可解释性的关键一步。
cs.CV / 194 / 2605.02714
OphMAE: Bridging Volumetric and Planar Imaging with a Foundation Model for Adaptive Ophthalmological Diagnosis
OphMAE:利用基础模型将体积成像与平面成像相结合以实现自适应眼科诊断
Abstract
The advent of foundation models has heralded a new era in medical artificial intelligence (AI), enabling the extraction of generalizable representations from large-scale unlabeled datasets. However, current ophthalmic AI paradigms are predominantly constrained to single-modality inference, thereby creating a dissonance with clinical practice where diagnosis relies on the synthesis of complementary imaging modalities. Furthermore, the deployment of high-performance AI in resource-limited settings is frequently impeded by the unavailability of advanced three-dimensional imaging hardware. Here, we present the Ophthalmic multimodal Masked Autoencoder (OphMAE), a multi-imaging foundation model engineered to synergize the volumetric depth of 3D Optical Coherence Tomography (OCT) with the planar context of 2D en face OCT. By implementing a novel cross-modal fusion architecture and a unique adaptive inference mechanism, OphMAE was pre-trained on a massive dataset with of 183,875 paired OCT images derived from 32,765 patients. In a rigorous benchmark encompassing 17 diverse diagnostic tasks with 48,340 paired OCT images from 8,191 patients, the model demonstrated state-of-the-art performance, achieving an Area Under the Curve (AUC) of 96.9% for Age-related Macular Degeneration (AMD) and 97.2% for Diabetic Macular Edema (DME), consistently surpassing existing single-modal and multimodal foundation models. Crucially, OphMAE exhibits robust engineering adaptability: it maintains high diagnostic accuracy, such as 93.7\% AUC for AMD, even when restricted to single-modality 2D inputs, and demonstrates exceptional data efficiency by retaining 95.7% AUC with as few as 500 labeled samples. This work establishes a scalable and adaptable framework for ophthalmic AI, ensuring robust performance across different tasks.
Chinese Translation
基础模型的出现标志着医疗人工智能(AI)进入了一个新纪元,使得从大规模未标记数据集中提取通用表示成为可能。然而,当前的眼科AI范式主要受到单一模态推理的限制,因此与临床实践之间产生了不协调,因为诊断依赖于互补成像模态的综合。此外,高性能AI在资源有限的环境中的部署通常受到先进三维成像硬件不可用的阻碍。在这里,我们提出了眼科多模态掩码自编码器(OphMAE),这是一种多成像基础模型,旨在将三维光学相干断层扫描(OCT)的体积深度与二维正面OCT的平面上下文相结合。通过实施新颖的跨模态融合架构和独特的自适应推理机制,OphMAE在从32,765名患者获取的183,875对OCT图像的海量数据集上进行了预训练。在涉及17个不同诊断任务的严格基准测试中,使用了来自8,191名患者的48,340对OCT图像,模型展示了最先进的性能,在与年龄相关的黄斑变性(AMD)任务中获得了96.9%的曲线下面积(AUC),在糖尿病性黄斑水肿(DME)任务中获得了97.2%的AUC,始终超越现有的单一模态和多模态基础模型。关键是,OphMAE表现出强大的工程适应性:即使在仅限于单一模态二维输入的情况下,它仍然保持高诊断准确性,例如AMD的AUC达到93.7%,并展现出卓越的数据效率,当只使用500个标记样本时依然保持了95.7%的AUC。本研究建立了一个可扩展和适应的眼科AI框架,确保在不同任务中的稳健表现。
cs.CV / 195 / 2605.02720
PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature
PubMed-Ophtha:一个用于在科学文献上训练眼科学视觉语言模型的开放资源
Abstract
Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Unlike existing datasets, figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality -- color fundus photography, optical coherence tomography, retinal imaging, or other -- and a mark status indicating the presence of annotation marks such as arrows. Figure captions are split into panel-level subcaptions using a two-step LLM approach, achieving a mean average sentence BLEU score of 0.913 on human-annotated data. Panel and image detection models reach a
[email protected] of 0.909 and 0.892, respectively, and figure extraction achieves a median IoU of 0.997. To support reproducibility, we additionally release the human-annotated ground-truth data, all trained models, and the full dataset generation pipeline.
Chinese Translation
视觉语言模型在眼科学领域具有很大的潜力,但其发展依赖于大规模、高质量的图像-文本数据集,而这些数据集仍然稀缺。我们提出了PubMed-Ophtha,这是一个分层数据集,包含来自15,842篇开放获取文章中提取的102,023对眼科图像-标题。与现有数据集不同,图像直接从文章的PDF中以全分辨率提取,并分解为其组成面板、面板标识符和单个图像。每个图像都注释了其成像方式——彩色眼底摄影、光学相干层析成像、视网膜成像或其他,并标注了注释标记的状态,例如箭头。图形标题通过两步语言模型(LLM)方法划分为面板级子标题,在人工注释的数据上实现了平均句子BLEU分数为0.913。面板和图像检测模型分别达到0.50的mAP为0.909和0.892,并且图像提取的中位交并比(IoU)达到0.997。为了支持可重复性,我们还发布了人工注释的真实数据、所有训练过的模型以及完整的数据集生成流程。
cs.CV / 196 / 2605.02730
Perceptual Flow Network for Visually Grounded Reasoning
用于视觉基础推理的感知流网络
Abstract
Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
Chinese Translation
尽管大型视觉语言模型(LVLMs)取得了成功,但通用优化目标(例如,标准最大似然估计)无法约束视觉轨迹,导致语言偏见和幻觉现象。为了缓解这一问题,目前的方法引入了来自视觉专家的几何先验作为额外的监督。然而,我们观察到这种监督通常是次优的:它偏向于几何精度,并且提供的推理效用有限。为了填补这一空白,我们提出了感知流网络(Perceptual Flow Network,PFlowNet),该网络避免了与专家先验的 rigid alignment,而是实现了可解释且更有效的视觉推理。具体而言,PFlowNet 将感知与推理解耦,以建立自我条件生成过程。在此基础上,它通过变分强化学习集成多维奖励和临近几何塑形,从而促进面向推理的感知行为,同时保持视觉可靠性。PFlowNet 提供了可证明的性能保证及具有竞争力的实证结果,尤其是在 V* Bench(90.6%)和 MME-RealWorld-lite(67.0%)上创下了新的最优状态记录。
cs.CV / 197 / 2605.02737
SIAM: Head and Brain MRI Segmentation from Few High-Quality Templates via Synthetic Training
SIAM:通过合成训练从少量高质量模板中进行头部和脑部MRI分割
Abstract
Synthetic training has recently advanced brain MRI segmentation by enabling contrast-agnostic models trained entirely on generated data. However, most existing approaches rely on hundreds of automatically labeled templates, introducing systematic biases and limiting their flexibility to incorporate new anatomical structures. We present the Segment It All Model (SIAM), a 3D whole-head segmentation framework for 16 anatomical structures, trained using only six high-quality, manually annotated templates. SIAM extends domain randomization to both intensity and shape domains: synthetic image generation ensures contrast variability, while high-resolution spatial transformations model anatomical differences in cortical thickness and deep nuclei morphology. Unlike prior synthetic models, SIAM simultaneously segments brain as well as extra-cerebral tissues, including cerebrospinal fluid, vessels, dura mater, skull, and skin, enabling fully automated, preprocessing-free analysis. Evaluation across eight heterogeneous datasets (N=301), that include multiple contrasts (T1-weighted, T2-weighted, CT) and span a wide range of ages, demonstrates that SIAM matches or outperforms state-of-the-art methods for brain structures, in addition to extending automated segmentation to non-brain structures. The model also exhibits superior consistency across contrasts and repeated acquisitions, together with improved sensitivity to subtle gray matter atrophy. We openly release the model and the label templates at https://github.com/romainVala/SIAM.
Chinese Translation
合成训练最近通过启用完全基于生成数据训练的无对比度模型来推动脑部MRI分割的进展。然而,现有的大多数方法依赖于数百个自动标记的模板,导致系统性偏差,并限制了它们整合新解剖结构的灵活性。我们提出了Segment It All Model (SIAM),这是一个用于16种解剖结构的3D全头分割框架,仅使用六个高质量、手动标注的模板进行训练。SIAM将领域随机化扩展到强度和形状领域:合成图像生成确保了对比度的变异性,而高分辨率空间变换则建模皮质厚度和深核形态学的解剖差异。与之前的合成模型不同,SIAM能够同时分割脑组织和额外的脑外组织,包括脑脊液、血管、硬膜、颅骨和皮肤,实现完全自动化的分析,无需预处理。在包含多种对比度(T1加权、T2加权、CT)且涵盖广泛年龄段的八个异质数据集(N=301)上的评估表明,SIAM的表现达到或超过了脑结构的最先进方法,并将自动分割扩展到了非脑结构。该模型在不同对比度和重复采集中的一致性优于其它模型,同时对微妙的灰质萎缩表现出更高的敏感性。我们将在 https://github.com/romainVala/SIAM 上公开发布该模型及其标注模板。
cs.CV / 198 / 2605.02746
Virtual Scanning for NSCLC Histology: Investigating the Discriminatory Power of Synthetic PET
非小细胞肺癌组织学的虚拟扫描:合成PET的判别能力研究
Abstract
Accurate histological differentiation between adenocarcinoma (ADC) and squamous cell carcinoma (SCC) is critical for personalized treatment in non-small cell lung cancer (NSCLC). While [$^{18}$F]FDG PET/CT is a standard tool for the clinical evaluation of lung cancer, its utility is often limited by high costs and radiation exposure. In this paper, we investigate the feasibility of "virtual scanning" as a feature-enhancement strategy by evaluating whether synthetic PET data can provide complementary feature representations to supplement anatomical CT scans in histological subtype classification. We propose a framework that leverages a 3D Pix2Pix Generative Adversarial Network (GAN), pretrained on the FDG-PET/CT Lesions dataset, to synthesize pseudo-PET volumes from anatomical CT scans. These synthetic volumes are integrated with structural CT data within the MINT framework, a multi-stage intermediate fusion architecture. Our experiments, conducted on a multi-center dataset of 714 subjects, demonstrate that the inclusion of synthetic metabolic features significantly improves classification performance over a CT-only baseline. The multimodal approach achieved a statistically significant increase in the Area Under the Curve (AUC) from 0.489 to 0.591 and improved the Geometric Mean (GMean) from 0.305 to 0.524. These results suggest that synthetic PET scans provide discriminatory metabolic cues that enable deep learning models to exploit complementary cross-modal information, offering a potential feature-enhancement strategy for clinical scenarios where physical PET scans are unavailable.
Chinese Translation
非小细胞肺癌(NSCLC)中腺癌(ADC)与鳞状细胞癌(SCC)的准确组织学鉴别对个体化治疗至关重要。尽管[$^{18}$F]FDG PET/CT是肺癌临床评估的标准工具,但其在高成本和辐射暴露方面的限制常常影响其应用。在本文中,我们探讨了“虚拟扫描”作为特征增强策略的可行性,评估合成PET数据是否能提供补充的特征表示,以支持组织学亚型分类。我们提出了一种框架,该框架利用一个3D Pix2Pix生成对抗网络(GAN),在FDG-PET/CT病灶数据集上进行预训练,以从解剖CT扫描合成伪PET体积。这些合成体积与MINT框架中的结构CT数据整合,MINT为一种多阶段中间融合架构。我们在714名受试者的多中心数据集上进行实验,结果表明合成代谢特征的纳入显著提高了分类性能,相较于仅使用CT的基线,AUC从0.489显著提升至0.591,几何均值(GMean)从0.305提高至0.524。这些结果表明,合成PET扫描提供的代谢线索能够使深度学习模型利用互补的跨模态信息,为临床场景中物理PET扫描不可用时的特征增强策略提供了潜在的解决方案。
cs.CV / 199 / 2605.02752
Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting
这真的算数吗?评估文本引导的类无关计数中的语义定基
Abstract
Open-world text-guided class-agnostic counting (CAC) has emerged as a flexible paradigm for counting arbitrary object classes by using natural language prompts. However, current evaluation protocols primarily focus on standard counting errors within single-category images, overlooking a fundamental requirement: the ability to correctly ground the textual prompt in the visual scene. In this paper, we show that several state-of-the-art CAC models often struggle to determine which object class should be counted based on the given prompt, revealing a misalignment between textual semantics and visual object representations. This limitation leads to spurious counting responses and reduced reliability in real-world scenarios. To systematically address these limitations, we propose a new evaluation framework focused on model robustness and trustworthiness. Our contribution is two-fold: (i) we introduce PrACo++ (Prompt-Aware Counting++), a novel test suite featuring two dedicated evaluation protocols -- the negative-label test and the distractor test -- paired with new specialized metrics; and (ii) we present the MUCCA (MUlti-Category Class-Agnostic counting) evaluation dataset, a new collection of real-world images featuring multiple annotated object categories per scene, unlike existing CAC benchmarks that typically include a single category per image. Our extensive experimental evaluation of 10 state-of-the-art methods shows that, despite strong performance under standard counting metrics, current models exhibit significant weaknesses in understanding and grounding object class descriptions. Finally, we provide a quantitative analysis of how semantic similarity between prompts influences these failures. Overall, our results underscore the need for more semantically grounded architectures and offer a reliable framework for future assessment in open-world text-guided CAC methods.
Chinese Translation
开放世界文本引导的类无关计数(CAC)作为一种灵活的范式,能够通过自然语言提示计数任意对象类别。然而,目前的评估协议主要集中在单类别图像中的标准计数错误,忽视了一项基本要求:在视觉场景中正确确定文本提示的语义。在本文中,我们显示出一些最先进的CAC模型往往难以根据给定的提示来确定应计数的对象类别,暴露出文本语义与视觉对象表征之间的不一致。这一限制导致了错误的计数响应,并降低了在真实场景中的可靠性。为系统性地解决这些限制,我们提出了一种新的评估框架,专注于模型的鲁棒性和可信度。我们的贡献有两个方面:(i) 我们引入了PrACo++(Prompt-Aware Counting++),一个新颖的测试套件,包含两个专门的评估协议——负标签测试和干扰物测试,并配备新的专用指标;(ii) 我们呈现了MUCCA(MUlti-Category Class-Agnostic counting)评估数据集,这是一个新的真实图像集合,每个场景中有多个注释的对象类别,而与现有的CAC基准不同,后者通常在每个图像中仅包含一个类别。我们对10种最先进方法的广泛实验评估显示,尽管在标准计数指标下表现强劲,当前模型在理解和定位对象类别描述方面表现出显著的弱点。最后,我们提供了关于提示之间语义相似性如何影响这些失败的定量分析。总体而言,我们的结果强调了对更具语义基础架构的需求,并为未来在开放世界文本引导的CAC方法中提供了一个可靠的评估框架。
cs.CV / 200 / 2605.02757
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
从仿真中看现实主义:用于视觉-语言-动作数据增强的高效视频传输
Abstract
Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited environmental diversity, resulting in weak real-world generalization. We present an efficient video augmentation framework that converts simulated VLA videos into realistic training videos while preserving task semantics and action trajectories. Our pipeline extracts structured conditions from simulation via video semantic segmentation and video captioning, rewrites captions to diversify environments, and uses a conditional video transfer model to synthesize realistic videos. To make augmentation practical at scale, we introduce a diffusion feature-reuse mechanism that reuses video tokens across adjacent timesteps to accelerate generation, and a coreset sampling strategy that identifies a compact, non-redundant subset for augmentation under limited computation. Extensive experiments on Robotwin 2.0, LIBERO, LIBERO-Plus, and a real robotic platform demonstrate consistent improvements. For example, our method improves RDT-1B by 8% on Robotwin 2.0, and boosts $\pi_0$ by 5.1% on the more challenging LIBERO-Plus benchmark. Code is available at: https://github.com/nanfangxiansheng/Seeing-Realism-from-Simulation.
Chinese Translation
视觉-语言-动作(VLA)模型通常依赖于大规模的现实世界视频,而模拟数据虽然收集成本低且具有高度并行化的优势,却常常遭遇显著的视觉领域差距和受限的环境多样性,导致在现实世界中的泛化能力较弱。我们提出了一种高效的视频增强框架,将模拟的 VLA 视频转换为现实的训练视频,同时保持任务语义和动作轨迹。我们的流程通过视频语义分割和视频标题生成,从模拟中提取结构化条件,重写标题以多样化环境,使用条件视频传输模型合成现实视频。为了在规模上使增强变得实际可行,我们引入了一种扩散特征重用机制,跨相邻时间步重用视频标记以加快生成速度,并采用一种核心集合采样策略,以有限的计算资源识别出紧凑且不冗余的增强子集。在 Robotwin 2.0、LIBERO、LIBERO-Plus 和真实机器人平台上的广泛实验表明了一致的改进。例如,我们的方法在 Robotwin 2.0 上将 RDT-1B 提高了 8%,并在更具挑战性的 LIBERO-Plus 基准上提升了 5.1% 的 $ ext{pi}_0$。代码可在以下链接获得:https://github.com/nanfangxiansheng/Seeing-Realism-from-Simulation.
cs.CV / 201 / 2605.02762
Unified Map Prior Encoder for Mapping and Planning
统一地图先验编码器用于地图构建与规划
Abstract
Online mapping and end-to-end (E2E) planning in autonomous driving remain largely sensor-centric, leaving rich map priors, including HD/SD vector maps, rasterized SD maps, and satellite imagery, underused because of heterogeneity, pose drift, and inconsistent availability at test time. We present UMPE, a Unified Map Prior Encoder that can ingest any subset of four priors and fuse them with BEV features for both mapping and planning. UMPE has two branches. The vector encoder pre-aligns HD/SD polylines with a frame-wise SE(2) correction, encodes points via multi-frequency sinusoidal features, and produces polyline tokens with confidence scores. BEV queries then apply cross-attention with confidence bias, followed by normalized channel-wise gating to avoid length imbalance and softly down-weight uncertain sources. The raster encoder shares a ResNet-18 backbone conditioned by FiLM with scaling and shift at every stage, performs SE(2) micro-alignment, and injects priors through zero-initialized residual fusion, so the network starts from a do-no-harm baseline and learns to add only useful prior evidence. A vector-then-raster fusion order reflects the inductive bias of geometry first, appearance second. On nuScenes mapping, UMPE lifts MapTRv2 from 61.5 to 67.4 mAP (+5.9) and MapQR from 66.4 to 71.7 mAP (+5.3). On Argoverse2, UMPE adds +4.1 mAP over strong baselines. UMPE is compositional: when trained with all priors, it outperforms single-prior models even when only one prior is available at test time, demonstrating powerset robustness. For E2E planning with the VAD backbone on nuScenes, UMPE reduces trajectory error from 0.72 to 0.42 m L2 on average (-0.30 m) and collision rate from 0.22% to 0.12% (-0.10%), surpassing recent prior-injection methods. These results show that a unified, alignment-aware treatment of heterogeneous map priors yields better mapping and better planning.
Chinese Translation
在自动驾驶领域,在线地图构建和端到端(E2E)规划仍然主要以传感器为中心,丰富的地图先验(包括高清/标准矢量地图、光栅化的标准地图和卫星图像)由于异质性、姿态漂移以及在测试时不一致的可用性而未被充分利用。我们提出了UMPE(Unified Map Prior Encoder),一个能够处理任意子集的四种先验,并将其与鸟瞩图(BEV)特征融合以实现地图构建与规划的统一地图先验编码器。UMPE具有两个分支。矢量编码器通过逐帧SE(2)校正预对齐高清/标准折线,通过多频正弦特征对点进行编码,并生成具有置信度分数的折线标记。BEV查询随后应用带置信偏差的跨注意力,之后进行规范化的通道门控以避免长度不平衡,并对不确定的源进行轻微的减权。光栅编码器共享一个条件为FiLM的ResNet-18主干网络,在每个阶段进行缩放和偏移,执行SE(2)微对齐,通过零初始化的残差融合注入先验,使网络从“无害”基线开始,仅学习添加有用的先验证据。矢量优先于光栅的融合顺序反映了几何优先、外观其次的归纳偏向。在nuScenes地图构建中,UMPE将MapTRv2的mAP从61.5提升至67.4(+5.9),将MapQR从66.4提升至71.7(+5.3)。在Argoverse2上,UMPE较强基线增加了+4.1 mAP。UMPE是可组合的:在使用所有先验训练时,即便在测试时只有一个先验可用,UMPE仍超越单一先验模型,展示了幂集鲁棒性。在nuScenes上使用VAD主干进行E2E规划时,UMPE将轨迹误差从0.72减少到0.42米L2(-0.30米),将碰撞率从0.22%降低到0.12%(-0.10%),超越了最新的先验注入方法。这些结果表明,对异质地图先验的统一、对齐感知处理可实现更好的地图构建和更好的规划。
cs.CV / 202 / 2605.02764
FoR-Net: Learning to Focus on Hard Regions for Efficient Semantic Segmentation
FoR-Net:学习关注困难区域以实现高效语义分割
Abstract
We present FoR-Net, a lightweight architecture for semantic segmentation that focuses on identifying and enhancing hard regions. Instead of relying on heavy global modeling, FoR-Net adopts an efficient strategy that selectively emphasizes informative regions through a learned importance map and a Top-K activation mechanism. Specifically, a selector module predicts region-wise importance, enabling the model to focus on challenging areas such as thin structures and object boundaries. Multi-scale reasoning is achieved using convolutional branches with different receptive fields, allowing diverse spatial context aggregation. We evaluate FoR-Net on the Cityscapes benchmark under limited computational resources. Despite its lightweight design and standard training configuration, FoR-Net achieves competitive performance and demonstrates improved consistency in challenging regions. These results suggest that region-focused reasoning provides a simple yet effective inductive bias for efficient semantic segmentation.
Chinese Translation
我们提出了FoR-Net,一种轻量级的语义分割架构,专注于识别和增强困难区域。FoR-Net并不依赖于重的全局建模,而是采用了一种高效的策略,通过学习的重要性图和Top-K激活机制,选择性地强调有信息的区域。具体而言,选择模块预测区域的重要性,使模型能够聚焦于诸如细结构和物体边界等具有挑战性的区域。通过具有不同感受野的卷积分支,实现了多尺度推理,从而允许多样的空间上下文聚合。我们在资源有限的情况下对FoR-Net进行了Cityscapes基准测试的评估。尽管其设计轻量且采用标准训练配置,FoR-Net仍然取得了具有竞争力的性能,并在具有挑战性的区域中表现出改善的一致性。这些结果表明,区域聚焦的推理为高效的语义分割提供了一种简单而有效的归纳偏置。
cs.CV / 203 / 2605.02767
TOC-SR: Task-Optimal Compact diffusion for Image Super Resolution
TOC-SR:面向任务的紧凑扩散模型在图像超分辨率中的应用
Abstract
Diffusion models have recently demonstrated strong performance for image restoration tasks, including super-resolution. However, their large model size and iterative sampling procedures make them computationally expensive for practical deployment. In this work, we present TOC-SR, a framework for building efficient one-step super-resolution models by first discovering a compact diffusion backbone. Starting from a sixteen-channel latent diffusion model, we construct parameter-efficient surrogate blocks using feature-wise generative distillation and perform architecture discovery using epsilon-constrained Bayesian Optimization to minimize model complexity while preserving generative fidelity. The resulting compact diffusion backbone achieves a 6.6x reduction in parameters and a 2.8x reduction in GMACs compared to the expanded diffusion model. We then adapt this backbone for super-resolution and distill the diffusion process into a single-step generator. Experiments demonstrate that the proposed approach enables efficient super-resolution while maintaining strong reconstruction quality.
Chinese Translation
扩散模型最近在图像恢复任务(包括超分辨率)中表现出了强大的性能。然而,它们庞大的模型规模和迭代采样过程使得在实际应用中计算代价高昂。在本研究中,我们提出了TOC-SR,一个通过首先发现紧凑扩散骨架来构建高效一步超分辨率模型的框架。从一个十六通道的潜在扩散模型开始,我们使用特征生成蒸馏构建参数高效的替代块,并利用ε约束贝叶斯优化进行架构发现,以在保持生成保真度的同时最小化模型复杂性。与扩展的扩散模型相比,所得到的紧凑扩散骨架在参数上实现了6.6倍的减少,并在GMACs上实现了2.8倍的减少。然后,我们将该骨架适配用于超分辨率,并将扩散过程蒸馏成单步生成器。实验结果表明,所提出的方法能够实现高效的超分辨率,同时保持强大的重建质量。
cs.CV / 204 / 2605.02772
Linearizing Vision Transformer with Test-Time Training
通过测试时间训练线性化视觉变换器
Abstract
While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T$^5$ (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4$\times$H20 GPUs, SD3.5-T$^5$ achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32$\times$ and 1.47$\times$ at 1K and 2K resolutions.
Chinese Translation
尽管线性复杂度的注意力机制为克服二次瓶颈提供了一个有前景的替代方案,然而从头开始训练这种模型仍然是昂贵的。继承预训练变换器的权重提供了一个吸引人的捷径,但Softmax注意力与线性注意力之间的基本表示差距妨碍了有效的权重转移。在本研究中,我们从架构对齐和表示对齐两个角度解决了这一转换挑战。我们确定测试时间训练(Test-Time Training,TTT)作为一个线性复杂度架构,其两个层的动态形式在结构上与Softmax注意力对齐,从而实现预训练注意力权重的直接继承。为了进一步对齐表示特性,包括关键的平移不变性和局部性,我们引入了关键实例归一化和轻量局部性增强模块。我们通过线性化Stable Diffusion 3.5来验证我们的方法,并引入SD3.5-T$^5$(Transformer To Test Time Training)。经过仅1小时在4$ imes$H20 GPU上的微调,SD3.5-T$^5$在文本到图像质量上达到了与微调的Softmax模型相当的效果,同时在1K和2K分辨率下加速推理分别为1.32$ imes$和1.47$ imes$。
cs.CV / 205 / 2605.02784
HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar
HumanSplatHMR:闭合人类网格恢复与高斯点云化头像之间的循环
Abstract
Accurately recovering human pose and appearance from video is an essential component of scene reconstruction, with applications to motion capture, motion prediction, virtual reality, and digital twinning. Despite significant interest in building realistic human avatars from video, this paper demonstrates that existing methods do not accurately recover the 3D geometry of humans. ViT-based approaches are not consistently reliable and can overfit to 2D views, while NeRF- and Gaussian Splatting-based avatars treat pose and appearance separately, limiting rendering generalization to new poses. To resolve these shortcomings, this paper proposes HumanSplatHMR, a joint optimization framework that refines 3D human poses while simultaneously learning a high-fidelity avatar for novel-view and novel-pose synthesis. Our key insight is to close the loop between geometric pose estimation and differentiable rendering. Unlike prior human avatar methods that rely on accurate human pose obtained through motion capture systems or offline refinement, which are impractical in in-the-wild scenarios, our approach uses only human mesh estimates from a state-of-the-art human pose estimator to better reflect real-world conditions. Therefore, instead of using the human pose only as a deformation prior, HumanSplatHMR backpropagates photometric, segmentation, and depth losses through a differentiable renderer to the pose parameters and global position. This coupling refines the global 3D pose over time, improving accuracy and alignment while producing better renderings from novel views. Experiments show consistent improvements over pose recovery baselines that omit image-level refinement and avatar baselines that decouple pose estimation from avatar reconstruction.
Chinese Translation
从视频中准确恢复人类的姿态和外观是场景重建的一个重要组成部分,应用于动作捕捉、动作预测、虚拟现实和数字双胞胎等领域。尽管构建现实人类头像从视频中引起了极大的兴趣,本文表明现有方法无法准确恢复人类的三维几何形状。基于视觉变换器(ViT)的方法并不总是可靠,且容易对二维视图产生过拟合,而基于神经辐射场(NeRF)和高斯点云化的头像则将姿态与外观分开处理,这限制了新姿态下的渲染泛化。为了解决这些不足,本文提出了HumanSplatHMR,一个联合优化框架,它在同时学习高保真头像以进行新视角和新姿态合成的同时优化三维人类姿态。我们的关键见解是在几何姿态估计和可微渲染之间闭合循环。与依赖于通过动作捕捉系统或离线优化获得的准确人类姿态的先前人类头像方法不同,这些方法在现实场景中并不实用,我们的方法仅使用前沿人类姿态估计器提供的人类网格估计,以更好地反映现实世界条件。因此,HumanSplatHMR不仅将人类姿态用作形变先验,而是通过可微渲染器向姿态参数和全局位置反向传播光度、分割和深度损失。这种耦合随着时间的推移精炼了全局三维姿态,提高了准确性和对齐效果,同时可以从新视角产生更好的渲染结果。实验表明,与省略图像级优化的姿态恢复基线相比,以及与将姿态估计与头像重建分离的头像基线相比,取得了一致的改进。
cs.CV / 206 / 2605.02794
Edge-Efficient Image Restoration: Transformer Distillation into State-Space Models
边缘高效图像恢复:将变换器蒸馏到状态空间模型中
Abstract
We propose a modular framework for hybrid image restoration that integrates transformer and state-space model (SSM) blocks with a focus on improving runtime efficiency on edge hardware. While transformers provide strong global modeling through self-attention, their attention kernels incur substantial latency on mobile devices, especially for high-resolution inputs. In contrast, SSMs such as Mamba offer lineartime sequence modeling with lower runtime overhead but may underperform on fine grained restoration tasks. To balance accuracy and efficiency, we train lightweight SSM blocks as feature-distilled surrogates of transformer blocks and use them to construct hybrid U-Net-style architectures. To automatically discover effective block combinations, we introduce Efficient Network Search (ENS), a multi-objective search strategy that selects task-specific hybrid configurations from pre-aligned components. ENS optimizes restoration quality while penalizing transformer usage, serving as a lightweight proxy for latency and enabling architecture discovery without repeated hardware profiling. On a Snapdragon 8 Elite CPU, the Restormer baseline requires 10119.52 ms for inference. In contrast, ENS-discovered hybrids significantly reduce runtime: ENS-Deblurring runs in 2973 ms (3.4x faster), ENS-Deraining in 5816 ms (1.74x faster), and ENS-Denoising in 8666 ms (1.17x faster), while maintaining competitive restoration quality.
Chinese Translation
我们提出了一种模块化的混合图像恢复框架,集成了变换器和状态空间模型(SSM)模块,重点在于提高边缘硬件上的运行时效率。虽然变换器通过自注意力机制提供了强大的全局建模能力,但其注意力核在移动设备上会产生显著的延迟,尤其是在高分辨率输入情况下。相比之下,像 Mamba 这样的 SSM 提供了线性时间序列建模,具有较低的运行时开销,但在细粒度恢复任务上可能表现不佳。为了平衡准确性和效率,我们训练轻量级的 SSM 模块作为变换器模块的特征蒸馏替代品,并利用它们构建混合的 U-Net 样式架构。为了自动发现有效的模块组合,我们引入了高效网络搜索(ENS),这是一种多目标搜索策略,从预对齐的组件中选择任务特定的混合配置。ENS 优化恢复质量,同时对变换器的使用进行惩罚,作为延迟的轻量代理,从而能够在不需要重复硬件分析的情况下实现架构发现。在 Snapdragon 8 Elite CPU 上,Restormer 基线推理需要 10119.52 毫秒。相比之下,ENS 发现的混合模型显著减少了运行时间:ENS-Deblurring 运行在 2973 毫秒(快 3.4 倍),ENS-Deraining 在 5816 毫秒(快 1.74 倍),ENS-Denoising 在 8666 毫秒(快 1.17 倍),同时保持竞争性的恢复质量。
cs.CV / 207 / 2605.02814
IConFace: Identity-Structure Asymmetric Conditioning for Unified Reference-Aware Face Restoration
IConFace:基于身份-结构非对称条件的统一参考感知人脸修复
Abstract
Blind face restoration is highly ill-posed under severe degradation, where identity-critical details may be missing from the degraded input. Same-identity references reduce this ambiguity, but mismatched pose, expression, illumination, age, makeup, or local facial states can lead to overuse of reference appearance. We propose \textbf{IConFace}, a unified reference-aware and no-reference framework with identity--structure asymmetric conditioning. References are distilled into a norm-weighted global AdaFace identity anchor for image-only modulation, while the degraded image is reinforced as the spatial structure anchor through low-rank residuals and block-wise degraded cross-attention with two-route memory. The resulting single checkpoint exploits references when available and falls back to no-reference restoration when absent, improving identity consistency, fine-detail recovery, and degraded-only restoration quality in a unified model.
Chinese Translation
盲人脸修复在严重退化的情况下高度不适定,其中关键的身份细节可能在退化的输入中缺失。同一身份的参考可以减少这种模糊性,但姿态、表情、光照、年龄、化妆或局部面部状态的不匹配可能导致对参考外观的过度使用。我们提出了 extbf{IConFace},一种具有身份-结构非对称条件的统一参考感知和无参考框架。参考被萃取为范数加权的全局 AdaFace 身份锚,便于图像调制,而退化图像则通过低秩残差和基于块的退化交叉注意力以及双路线存储作为空间结构锚进行强化。该单一检查点在可用时利用参考,在缺失时回退到无参考修复,从而在统一模型中改善身份一致性、细节恢复和仅退化的修复质量。
cs.CV / 208 / 2605.02834
VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
VideoNet:一个针对特定领域的动作识别大规模数据集
Abstract
Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide $k\in\{1,2,3\}$ in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves $+7.0\%$, while Gemini declines $-4.8\%$. Notably, these gains fall short of the $+13.6\%$ improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.
Chinese Translation
视频在捕捉跨多个帧的动作方面具有独特的优势。因此,多年来,动作识别一直是视频理解的典型任务。不幸的是,由于缺乏足够多样化和具有挑战性的数据,现代视觉语言模型(VLMs)的动作识别能力不再得到评估。为在VLM时代重新振兴动作识别,我们提倡重视特定领域的动作。为此,我们推出了VideoNet,这是一个涵盖来自37个领域的1,000种不同动作的特定领域动作识别基准。我们首先在多项选择评估设置下进行实验,其中闭合模型和开放模型之间的差异非常明显:Gemini 3.1 Pro的准确率达到69.9%,而Qwen3-VL-8B仅为45.0%。为了理解为何VLMs在VideoNet上表现不佳,我们将问题放宽到二元设置,其中随机机会为50%。尽管如此,Qwen的准确率仍然只有59.2%。进一步放宽评估设置,我们提供 $k ext{ } ext{in} ext{ } ext{context} ext{ } ext{examples} ext{ } ext{ of the action } ext{ } ext{ for } k ext{ } ext{ in } ext{ } ext{ ext{ } ext{ ext{ } ext{1,2,3} } }$。一些模型在小样本设置下表现优异,而另一些则表现不佳;Qwen提升了 $+7.0 ext{ } ext{ } ext{ }$,而Gemini下降了 $-4.8 ext{ } ext{ } ext{ }$。值得注意的是,这些提升远低于非专业人士在获得少量示例时的$+13.6 ext{ } ext{ } ext{ }$的提高。我们发现VLMs难以充分利用上下文示例,因此将重点从测试时的提升转向训练侧。我们收集了首个针对特定领域动作的大规模训练数据集,总计近50万对视频问答。对Molmo2-4B模型进行微调,在VideoNet基准上超越了所有开放权重的8B模型。
cs.CV / 209 / 2605.02849
Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion
通过条件控制扩散进行超低比特率视频压缩的主动采样
Abstract
Diffusion models provide a powerful generative prior for perceptual reconstruction at ultra-low bitrates, but effective video compression requires controlling the generative process using highly compact conditioning signals. In this work, we present ActDiff-VC, a diffusion-based video compression framework for the ultra-low-bitrate regime. Our method partitions videos into variable-length segments, transmits keyframes only when needed, and summarizes temporal dynamics using a compact set of tracked point trajectories. Conditioned on these sparse signals, a conditional diffusion decoder synthesizes the remaining frames, enabling perceptually realistic reconstruction under severe rate constraints. To support this design, we introduce two mechanisms: content-adaptive keyframe selection and budget-aware sparse trajectory selection, which together enable compact yet effective conditioning for generative reconstruction. Experiments on the UVG and MCL-JCV benchmarks show that ActDiff-VC achieves up to 64.6\% bitrate reduction at matched NIQE, improves KID by up to 64.6\% and FID by up to 37.7\% at comparable bitrates against strong learned codecs, and delivers favorable perceptual rate--distortion trade-offs relative to learned and diffusion-based baselines in the ultra-low-bitrate regime.
Chinese Translation
扩散模型为超低比特率下的感知重建提供了强大的生成先验,但有效的视频压缩需要使用高度紧凑的条件信号来控制生成过程。在本研究中,我们提出了ActDiff-VC,这是一种针对超低比特率领域的基于扩散的视频压缩框架。我们的方法将视频划分为可变长度的片段,仅在需要时传输关键帧,并使用一组紧凑的跟踪点轨迹总结时间动态。在这些稀疏信号的条件下,条件扩散解码器合成剩余帧,从而在严重的速率限制下实现感知上真实的重建。为了支持这一设计,我们引入了两种机制:内容适应的关键帧选择和预算意识的稀疏轨迹选择,这两者共同使得生成重建的条件设置既紧凑又有效。对UVG和MCL-JCV基准测试的实验表明,ActDiff-VC在匹配NIQE时实现高达64.6%的比特率减少,KID提高幅度达64.6%,FID提高幅度达37.7% ,在与强大的学习编码器对比时,在可比比特率下提供了相对学习和基于扩散的基线的有利感知速率-失真权衡。
cs.CV / 210 / 2605.02863
Pixel Perfect: Relational Image Quality Assessment with Spatially-Aware Distortions
像素完美:具有空间感知失真的关系图像质量评估
Abstract
Traditional image quality assessment (IQA) methods rely on mean opinion scores (MOS), which are resource-intensive to collect and fail to provide interpretable, localized feedback on specific image distortions. We overcome these limitations by shifting from absolute quality prediction to a relational and directional assessment. Our approach utilizes a self-supervised synthetic distortion engine to generate training data, eliminating the need for manual annotation. A distortion prediction network is trained with an anti-symmetric objective to produce spatially-aware, disentangled maps that identify the type, intensity, and direction of distortions relative to a reference image. Subsequently, a scoring network is trained via contrastive learning on ordinally ranked image sets to predict a relational quality score. Our method provides a more granular and interpretable approach to IQA for the targeted optimization of image processing algorithms without requiring any human-labeled quality scores.
Chinese Translation
传统的图像质量评估(IQA)方法依赖于主观评分(MOS),这些评分的收集资源密集且无法提供关于特定图像失真的可解释、局部反馈。我们通过从绝对质量预测转向关系性和方向性的评估,克服了这些局限。我们的方法利用自监督合成失真引擎生成训练数据,从而消除手动标注的需求。训练一个失真预测网络,采用反对称目标来生成空间感知的、解耦的图,识别相对于参考图像的失真类型、强度和方向。随后,通过对有序排名的图像集进行对比学习训练一个评分网络,以预测关系质量分数。我们的方法为图像质量评估提供了一种更细化且可解释的方法,以便针对性优化图像处理算法,而无需任何人工标注的质量分数。
cs.CV / 211 / 2605.02866
Laplacian Frequency Interaction Network for Rural Thematic Road Extraction
用于农村主题道路提取的拉普拉斯频率交互网络
Abstract
Rural thematic road network construction aims to extract topological road structures from movement trajectory images of agricultural machinery. However, this task faces challenges where downsampling methods commonly used in existing studies tend to blur the sparse high-frequency road structures, and the heavy noise from dense field operations often leads to fragmented or redundant topologies in the extracted networks. To address these challenges, we propose LFINet, a Laplacian Frequency Interaction Network. The network begins with a Laplacian Multi-scale Separator (LMS) to decouple the image into low-frequency semantic contexts and high-frequency structural details. These components are then processed by the Cross-Frequency Interaction Block (CFIB) through a dual-pathway architecture in which a High-Frequency Block (HFB) refines local structures while a Spatial Transformer (ST) captures global semantics. Subsequently, a Frequency Gated Modulation (FGM) mechanism integrates the features from pathways by leveraging semantic contexts to calibrate the structural details. Finally, a Progressive Reconstruction Decoder iteratively fuses multi-scale features to ensure topological consistency. Experiments conducted on a real-world agricultural trajectories dataset from Henan Province, China, show that LFINet establishes a new state-of-the-art. Specifically, it achieves an F1-score of 92.54% and an IoU of 86.12%, surpassing the second-ranked method by 0.64% and 1.1%, respectively. This confirms its capability to effectively construct topological road networks from noisy and sparse field data.
Chinese Translation
农村主题道路网络构建旨在从农业机械运动轨迹图像中提取拓扑道路结构。然而,这一任务面临一些挑战,现有研究中常用的降采样方法往往模糊稀疏的高频道路结构,而来自密集田间作业的噪声通常导致提取网络中的拓扑结构碎片化或冗余。为了解决这些挑战,我们提出了LFINet,一个拉普拉斯频率交互网络。该网络首先使用拉普拉斯多尺度分离器(LMS)将图像解耦为低频语义上下文和高频结构细节。这些组件随后通过交叉频率交互块(CFIB)进行处理,该模块采用双通路结构,其中高频块(HFB)用于细化局部结构,而空间变换器(ST)用于捕捉全局语义。随后,频率门控调制(FGM)机制通过利用语义上下文来校准结构细节,从而整合通路中的特征。最终,渐进重建解码器迭代融合多尺度特征,以确保拓扑的一致性。在中国河南省的真实农业轨迹数据集上进行的实验表明,LFINet建立了新的最先进水平。具体而言,它达到了92.54%的F1分数和86.12%的IoU,分别超过了第二名方法0.64%和1.1%。这确认了其有效构建来自噪声和稀疏田间数据的拓扑道路网络的能力。
cs.CV / 212 / 2605.02892
AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion
AlbumFill:基于专辑的个性化图像补全推理与检索
Abstract
Personalized image completion aims to restore occluded regions in personal photos while preserving identity and appearance. Existing methods either rely on generic inpainting models that often fail to maintain identity consistency, or assume that suitable reference images are explicitly provided. In practice, suitable references are often not explicitly provided, requiring the system to search for identity-consistent images within personal photo collections. We present AlbumFill, a training-free framework that retrieves identity-consistent references from personal albums for personalized completion. Given an occluded image and a personal album, a vision-language model infers missing semantic cues to guide composed image retrieval, and the retrieved references are used by reference-based completion models. To facilitate this task, we introduce a dataset containing 54K human-centric samples with associated album images. Experiments across multiple baselines demonstrate the difficulty of personalized completion and highlight the importance of identity-consistent reference retrieval. Project Page: https://liagm.github.io/AlbumFill/
Chinese Translation
个性化图像补全旨在恢复个人照片中被遮挡的区域,同时保留身份和外观。现有方法或依赖于通用的修复模型,这些模型常常无法保持身份一致性,或假定提供了合适的参考图像。然而,在实际应用中,合适的参考图像常常不会被明确提供,因此系统需要在个人照片集合中搜索身份一致的图像。我们提出了 AlbumFill,一种无需训练的框架,从个人相册中检索身份一致的参考图像以实现个性化补全。给定一幅遮挡的图像和一个个人相册,一个视觉-语言模型推断缺失的语义线索以指导组合图像检索,所检索的参考图像将被参考基础的补全模型使用。为了促进这一任务的开展,我们引入了一个包含54K人类中心样本及关联相册图像的数据集。多个基线的实验表明个性化补全的难度,并强调了身份一致性参考检索的重要性。项目页面: https://liagm.github.io/AlbumFill/
cs.AI / 1 / 2605.00839
2026 Roadmap on Artificial Intelligence and Machine Learning for Smart Manufacturing
2026年智能制造领域人工智能与机器学习路线图
Lee, Jay, Su, Hanqi, Macchi, Marco, Polenghi, Adalberto, Wu, Wei, Zhao, Zhiheng, Huang, George Q., Allgood, Kiva, Jain, Devendra, Gieger, Benedikt, Pandhare, Vibhor, Bhattacharjee, Soumyabrata, Mohril, Ram, Kong, Lingbao, Wang, Qiyuan, Tang, Xinlan, Kim, Sungjong, Park, Chan Hee, Youn, Byeng D., Goh, Guo Dong, Huang, Xi, Yeong, Wai Yee, Shin, Yung C, Zhang, He, Wang, Zitong, Tao, Fei, Srai, Jagjit Singh, Gupta, Satyandra K., Joung, Byung Gun, John, Albin, Sutherland, John W., Lee, Sang Won, Fink, Olga, Sharma, Vinay, Ahmed, Faez, Chen, Wei, Fuge, Mark, Waaler, Arild, Skjæveland, Martin G., Kyritsis, Dimitris, Chen, Wei, Karkaria, VispiNevile, Chen, Yi-Ping, Tsai, Ying-Kuan, Cohen, Joseph, Huan, Xun, Lin, Jing, Zhang, Liangwei, Vogl, Gregory W., Cornelius, Aaron W., Jia, Xiaodong, Ji, Dai-Yan, Minami, Takanobu, Wang, Ruoxin
Abstract
The evolution of artificial intelligence (AI) and machine learning (ML) is reshaping smart manufacturing by providing new capabilities for efficiency, adaptability, and autonomy across industrial value chains. However, the deployment of AI and ML in industrial settings still faces critical challenges, including the complexity of industrial big data, effective data management, integration with heterogeneous sensing and control systems, and the demand for trustworthy, explainable, and reliable operation in high-stakes industrial environments. In this roadmap, we present a comprehensive perspective on the foundations, applications, and emerging directions of AI and ML in smart manufacturing. It is structured in three parts. The first highlights the foundations and trends that frame the evolution of AI in smart manufacturing. The second focuses on key topics where AI is already enabling advances, including industrial big data analytics, advanced sensing and perception, autonomous systems, additive and laser-based manufacturing, digital twins, robotics, supply chain and logistics optimization, and sustainable manufacturing. The third section explores non-traditional ML approaches that are opening new frontiers, such as physics-informed AI, generative AI, semantic AI, advanced digital twins, explainable AI, RAMS, data-centric metrology, LLMs, and foundation models for highly connected and complex manufacturing systems. By identifying both opportunities and remaining barriers across these areas, this roadmap outlines the advances needed in methods, integration strategies, and industrial adoption. We hope this roadmap will serve as a guide for researchers, engineers, and practitioners to accelerate innovation, align academic and industrial priorities, and ensure that AI-driven smart manufacturing delivers reliable, sustainable, and scalable impact for the future of manufacturing ecosystems.
Chinese Translation
人工智能(AI)和机器学习(ML)的演进正在重新塑造智能制造,为工业价值链提供了效率、适应性和自主性的新能力。然而,在工业环境中部署AI和ML仍面临关键挑战,包括工业大数据的复杂性、有效的数据管理、与异构传感和控制系统的集成,以及在高风险工业环境中对值得信赖、可解释和可靠运行的需求。在本路线图中,我们全面展示了AI和ML在智能制造中的基础、应用和新兴方向,内容分为三个部分。第一部分强调了构成智能制造中AI演进的基础和趋势。第二部分聚焦于AI已经推动进展的关键主题,包括工业大数据分析、先进的传感与感知、自主系统、增材与激光制造、数字双胞胎、机器人技术、供应链与物流优化,以及可持续制造。第三部分探讨了正在开启新前沿的非传统机器学习方法,如物理信息AI、生成式AI、语义AI、先进数字双胞胎、可解释AI、RAMS(可靠性、可用性、可维护性、可安全性)、数据中心计量、LLMs(大语言模型)以及用于高度互联和复杂制造系统的基础模型。通过识别这些领域的机遇和仍然存在的障碍,本路线图概述了在方法、集成策略和工业应用中所需的进展。我们希望本路线图能够为研究人员、工程师和实践者提供指南,以加速创新,协调学术与工业的优先事项,并确保基于AI的智能制造为未来制造生态系统提供可靠、可持续和可扩展的影响。
cs.AI / 2 / 2605.00841
AI Agents for Sustainable SMEs: A Green ESG Assessment Framework
可持续中小企业的人工智能代理:绿色环境、社会和治理(ESG)评估框架
Abstract
This study presents a novel, AI-driven framework for assessing Environmental, Social, and Governance (ESG) performance in European small and medium-sized enterprises (SMEs). An initial phase established expert-validated ESG baseline scores from a subset of the Flash Eurobarometer FL549 survey data. In the second phase, a scalable AI agent system, built on the n8n automation platform, applied these baselines to perform automated ESG classification and generate contextual recommendations using large language models (LLMs). The results demonstrate the AI system's high consistency with human-derived outputs, thereby supporting more effective monitoring and intervention strategies aligned with the European Green Deal.
Chinese Translation
本研究提出了一种新颖的基于人工智能的框架,用于评估欧洲中小企业(SMEs)的环境、社会和治理(ESG)表现。初始阶段通过Flash Eurobarometer FL549调查数据的子集建立了经专家验证的ESG基准评分。在第二阶段,构建在n8n自动化平台上的可扩展人工智能代理系统利用这些基准进行自动化ESG分类,并使用大型语言模型(LLMs)生成上下文推荐。结果表明,该人工智能系统与人类输出的一致性很高,从而支持与欧洲绿色协议(European Green Deal)一致的更有效的监测和干预策略。
cs.AI / 3 / 2605.00842
Understanding Emergent Misalignment via Feature Superposition Geometry
通过特征叠加几何理解涌现性错位
Abstract
Emergent misalignment, where fine-tuning on narrow, non-harmful tasks induces harmful behaviors, poses a key challenge for AI safety in LLMs. Despite growing empirical evidence, its underlying mechanism remains unclear. To uncover the reason behind this phenomenon, we propose a geometric account based on the geometry of feature superposition. Because features are encoded in overlapping representations, fine-tuning that amplifies a target feature also unintentionally strengthens nearby harmful features in accordance with their similarity. We give a simple gradient-level derivation of this effect and empirically test it in multiple LLMs (Gemma-2 2B/9B/27B, LLaMA-3.1 8B, GPT-OSS 20B). Using sparse autoencoders (SAEs), we identify features tied to misalignment-inducing data and to harmful behaviors, and show that they are geometrically closer to each other than features derived from non-inducing data. This trend generalizes across domains (e.g., health, career, legal advice). Finally, we show that a geometry-aware approach, filtering training samples closest to toxic features, reduces misalignment by 34.5%, substantially outperforming random removal and achieving comparable or slightly lower misalignment than LLM-as-a-judge-based filtering. Our study links emergent misalignment to feature superposition, providing a basis for understanding and mitigating this phenomenon.
Chinese Translation
涌现性错位是指在狭窄的、无害任务上进行微调却导致有害行为,这对大型语言模型(LLMs)的人工智能安全构成了重大挑战。尽管有越来越多的实证证据,但其潜在机制仍不清晰。为了揭示这一现象背后的原因,我们提出了一种基于特征叠加几何的几何解释。由于特征是以重叠的表示形式编码的,放大目标特征的微调无意中也会根据这些特征的相似性加强附近的有害特征。我们对这一效应进行了简单的梯度级推导,并在多个大型语言模型(Gemma-2 2B/9B/27B、LLaMA-3.1 8B、GPT-OSS 20B)中进行实证测试。通过稀疏自编码器(SAEs),我们识别出与导致错位的数据和有害行为相关的特征,并显示这些特征在几何上比来自非诱发数据的特征更接近。此趋势在多个领域(例如健康、职业、法律建议)中具有普遍性。最后,我们表明,采用几何感知的方法,过滤最接近有害特征的训练样本,能够将错位减少34.5%,显著优于随机移除,并在错位程度上与基于LLM的判断过滤方法相比,可以达到相当或略低的水平。我们的研究将涌现性错位与特征叠加联系起来,为理解和减轻这一现象提供了基础。
cs.AI / 4 / 2605.00846
ClinicBot: A Guideline-Grounded Clinical Chatbot with Prioritized Evidence RAG and Verifiable Citations
ClinicBot:基于指南的临床聊天机器人,具有优先证据检索增强生成和可验证引用
Abstract
Clinical diagnosis requires answers that are accurate, verifiable, and explicitly grounded in official guidelines. While large language models excel at natural language processing, their tendency to hallucinate undermines their utility in high-stakes medical contexts where precision is essential. Existing retrieval-augmented generation (RAG) systems treat all evidence equally, producing noisy context and generic answers misaligned with clinical practice. We present ClinicBot, an AI system that translates guideline recommendations into trustworthy clinical support through three key advances: (1) structured extraction of clinical guidelines into semantic units (recommendations, tables, definitions, narrative) with explicit provenance, (2) evidence prioritization that ranks content by clinical significance and guideline structure rather than textual similarity, and (3) a web-based interface that presents concise, actionable answers with verifiable evidence. We will demonstrate ClinicBot using diabetes questions from real patients and an additional diabetes risk assessment tool that is faithful to the American Diabetes Association (ADA) Standards of Care in Diabetes (2025). The demonstration will illustrate how semantic knowledge extraction and hierarchical evidence ranking can reliably operate in a multi-agent setting to process complex clinical guidelines at scale.
Chinese Translation
临床诊断需要准确、可验证并明确基于官方指南的答案。尽管大型语言模型在自然语言处理方面表现出色,但其幻觉倾向削弱了其在高风险医疗环境中的实用性,而精确性在这些环境中至关重要。现有的检索增强生成(RAG)系统对所有证据一视同仁,产生噪声背景和与临床实践不相符的通用答案。我们提出了ClinicBot,一种通过三大关键进展将指南推荐转化为可靠的临床支持的人工智能系统:(1)将临床指南结构化提取为语义单元(推荐、表格、定义、叙述),并明确其来源;(2)优先证据的提取,根据临床重要性和指南结构对内容进行排名,而非文本相似性;(3)一个基于网络的接口,提供简明且可操作的答案和可验证的证据。我们将通过展示患者实际提出的糖尿病问题以及一个符合美国糖尿病协会(ADA)2025年糖尿病护理标准的糖尿病风险评估工具,演示ClinicBot。该演示将说明语义知识提取和分层证据排名如何在多代理环境中可靠地处理复杂的临床指南,从而实现大规模应用。
cs.AI / 5 / 2605.00909
Accelerating battery research with an AI interface between FINALES and Kadi4Mat
通过AI界面加速电池研究:连接FINALES和Kadi4Mat
Abstract
The time-consuming formation process critically impacts the longevity of sodium-ion coin cells and End Of Life (EOL) performance. This study aims to optimize formation protocols for duration efficiency, targeting high-performance outcomes while minimizing the number of experiments to reduce resource consumption and accelerate discovery. Specifically, we consider two potentially competing objectives: minimizing formation time and maximizing EOL performance. Beyond this application focus, we also present a methodological contribution: a framework designed to enable interoperability between the FINALES and Kadi RDM ecosystems, which we employ to tackle our optimization problem. In this setup, the FINALES framework orchestrates experiment planning and execution on the POLiS MAP, while an active-learning agent implemented within Kadi4Mat guides experiment selection, using multi-objective batched Bayesian optimization to efficiently explore the parameter space. This interoperability enhancement enables coordinated, distributed collaboration across automated systems and human-operated workflows, bridging multiple research centers. Using this approach, we iteratively explore the trade-off between formation time and EOL performance and identify candidate solutions approximating the Pareto front. The resulting workflow demonstrates the capability of interoperable infrastructures to facilitate data-driven optimization in battery research, and establishes a transferable framework applicable to diverse materials science and engineering optimization tasks.
Chinese Translation
耗时的形成过程对钠离子硬币电池的寿命和使用结束(EOL)性能产生重要影响。本研究旨在优化形成协议以提高持续效率,目标是实现高性能结果,同时最小化实验数量以减少资源消耗并加速发现。具体而言,我们考虑两个潜在的竞争目标:最小化形成时间和最大化EOL性能。除了这一应用重点,我们还提出了一种方法论贡献:一个旨在实现FINALES和Kadi RDM生态系统之间互操作性的框架,我们利用该框架解决优化问题。在这个设置中,FINALES框架负责在POLiS MAP上协调实验规划和执行,而在Kadi4Mat中实施的主动学习代理则指导实验选择,使用多目标批次贝叶斯优化有效探索参数空间。这种互操作性增强了自动系统与人工操作工作流程之间的协调和分布式协作,连接了多个研究中心。采用这种方法,我们迭代探索形成时间与EOL性能之间的权衡,并识别近似帕累托前沿的候选解决方案。最终工作流展示了互操作基础设施在电池研究中促进数据驱动优化的能力,并建立了可迁移的框架,适用于多种材料科学和工程优化任务。
cs.AI / 6 / 2605.01030
Effect-Transparent Governance for AI Workflow Architectures: Semantic Preservation, Expressive Minimality, and Decidability Boundaries
面向人工智能工作流架构的效应透明治理:语义保留、表达最小性与可判定性边界
Abstract
We present a machine-checked formalization of structurally governed AI workflow architectures and prove that effect-level governance can be imposed without reducing internal computational expressivity. Using Interaction Trees in Rocq 8.19, we define a governance operator G that mediates all effectful directives, including memory access, external calls, and oracle (LLM) queries. Our development compiles with 0 admitted lemmas and consists of 36 modules, ~12,000 lines of Rocq, and 454 theorems. We establishseven properties: (P1) governed Turing completeness, (P2) governed oracle expressivity, (P3) a decidability boundary in which governance predicates are total and closed under Boolean composition while semantic program properties remain non-trivial and undecidable by governance, (P4) goal preservation for permitted executions, (P5) expressive minimality of primitive capabilities (compute, memory, reasoning, external call, observability), (P6) subsumption asymmetry showing structural governance strictly subsumes content-level filtering, and (P7) semantic transparency: on all executions where governance permits, the governed interpretation is observationally equivalent (modulo governance-only events) to the ungoverned interpretation. Together, these results show that governance and computational expressivity are orthogonal dimensions: governance constrains the effect boundary of programs while remaining semantically transparent to internal computation.
Chinese Translation
我们提出了结构性治理的人工智能工作流架构的机器检查形式化,并证明可以在不降低内部计算表现力的情况下施加效应级治理。使用 Rocq 8.19 中的交互树,我们定义了一个治理运算符 G,该运算符调解所有具有效应的指令,包括内存访问、外部调用和oracle(LLM)查询。我们的开发编译时无需承认引理,包含36个模块,约12000行Rocq代码,以及454个定理。我们确立了七个属性:(P1)治理的图灵完备性,(P2)治理的oracle表现力,(P3)一个可判定性边界,其中治理谓词是完全且在布尔组合下封闭的,同时语义程序属性在治理下仍然是非平凡且不可判定的,(P4)允许执行的目标保留,(P5)原始能力(计算、内存、推理、外部调用、可观察性)的表达最小性,(P6)包含的不对称性,表明结构治理严格优于内容级过滤,以及(P7)语义透明性:在所有治理允许的执行中,被治理的解释在观察上与未治理的解释是等同的(模组治理仅事件)。这些结果表明,治理与计算表现力是正交维度:治理约束程序的效应边界,同时对内部计算保持语义透明。
cs.AI / 7 / 2605.01032
Algebraic Semantics of Governed Execution: Monoidal Categories, Effect Algebras, and Coterminous Boundaries
治理执行的代数语义:单oidal范畴、效果代数和同界边界
Abstract
We present an algebraic semantics for governed execution in which governance is axiomatized, compositional, and coterminous with expressibility. The framework, mechanized in 32 Rocq modules (~12,000 lines, 454 theorems, 0 admitted), is built on interaction trees and parameterized coinduction. A three-axiom GovernanceAlgebra record (safety, transparency, properness) induces a symmetric monoidal category with verified pentagon, triangle, and hexagon coherence, where every tensor composition preserves governance. An algebraic effect system constrains the handler algebra so that only governance-preserving handlers can be constructed in the safe fragment; programs in the empty capability set provably emit only observability directives. Capability-indexed composition bundles programs with machine-checked capability bounds, and a dual guarantee theorem establishes that within_caps and gov_safe hold simultaneously under all composition operators. The capstone result is the coterminous boundary: within our formal model, every program expressible via the four primitive morphism constructors is governed under interpretation, and every governed program is the image of such a program. Turing completeness is preserved inside governance; unmediated I/O is excluded from the governed fragment. Governance denial is modeled as safe coinductive divergence. The governance algebra is parametric: any system instantiating the three axioms inherits all derived properties, including convergence, compositional closure, and goal preservation. Extracted OCaml runs as a NIF in the BEAM runtime, with property-based testing (70,000+ random inputs, zero disagreements) confirming behavioral equivalence between the specification and the runtime interpreter.
Chinese Translation
我们提出了一种治理执行的代数语义,在该语义中,治理被公理化、可组合,并且与表达能力同界。该框架以32个Rocq模块(约12,000行代码,454个定理,无公认定理)为基础,建立在交互树和参数化的共归纳上。一个包含三个公理的GovernanceAlgebra记录(安全性、透明性、适当性)诱导出一个具有经过验证的五角形、三角形和六角形一致性的对称单oidal范畴,其中每个张量组合都保持治理。代数效果系统限制了处理程序代数,使得在安全片段中只能构建保持治理的处理程序;在空能力集中的程序可证明仅发出可观察性指令。能力索引的组合将程序与机器检查的能力边界捆绑在一起,并且一个对偶保证定理证明了在所有组合运算符下,within_caps和gov_safe可以同时成立。最后的结果是同界边界:在我们的形式模型中,通过四个原始态射构造器可表达的每个程序在解释下都受到治理,并且每个受治理的程序都是这种程序的像。图灵完备性在治理内部得以保留;未经中介的输入/输出被排除在治理片段之外。治理拒绝被建模为安全的共归纳发散。治理代数是参数化的:任何实例化这三个公理的系统都会继承所有派生属性,包括收敛性、可组合闭合性和目标保持性。提取的OCaml程序在BEAM运行时作为NIF运行,基于属性的测试(70,000多个随机输入,零矛盾)确认了规范与运行时解释器之间的行为等价性。
cs.AI / 8 / 2605.01100
A Knowledge-Driven LLM-Based Decision-Support System for Explainable Defect Analysis and Mitigation Guidance in Laser Powder Bed Fusion
基于知识驱动的LLM决策支持系统用于激光粉末床熔融的可解释缺陷分析与缓解指导
Abstract
This work presents a knowledge-driven decision-support system that integrates structured defect knowledge with LLM-based reasoning to provide explainable defect diagnosis and mitigation guidance in manufacturing, using LPBF as a representative, safety-critical case study. The proposed ontology-integrated LLM-based decision support system for LPBF defect analysis and mitigation guidance is built on a knowledge base containing 27 known LPBF defect types organized into hierarchical categories and causal relationships. The developed system supports fuzzy natural language queries for systematic knowledge retrieval, literature-supported explanation of defects, and guidance on defect causes and mitigation strategies derived from encoded process knowledge. Furthermore, a multimodal image-assessment module based on foundation models enables descriptor-guided interpretation of representative microscopic defect images through semantic alignment scoring. The proposed framework was evaluated through qualitative comparisons with general-purpose vision-language models, an ablation study, and an inter-rater reliability analysis. Evaluation on the literature-derived dataset showed that the fully integrated configuration outperformed the other three evaluated system configurations, achieving a macro-average F1 score of 0.808. Additionally, inter-rater reliability analysis using Cohen's kappa indicated substantial agreement between the model outputs and the literature-derived reference labels. These findings suggest that ontology-guided knowledge representation can improve the consistency, interpretability, and practical usefulness of LLM-assisted LPBF defect analysis.
Chinese Translation
本研究提出了一种知识驱动的决策支持系统,该系统将结构化缺陷知识与基于LLM(大语言模型)的推理相结合,以便在制造过程中提供可解释的缺陷诊断和缓解指导,以LPBF(激光粉末床熔融)作为代表性的安全关键案例研究。所提出的集成本体的基于LLM的LPBF缺陷分析与缓解指导决策支持系统构建在一个包含27种已知LPBF缺陷类型的知识库之上,这些缺陷类型被组织成层次类别和因果关系。开发的系统支持模糊自然语言查询,以便进行系统的知识检索,提供文献支持的缺陷解释,以及根据编码的过程知识推导的缺陷原因和缓解策略的指导。此外,基于基础模型的多模态图像评估模块通过语义对齐评分,使得能够对代表性的显微缺陷图像进行描述符引导的解读。通过与通用视觉语言模型的定性比较、消融研究和评审者间可靠性分析来评估所提框架。在文献衍生数据集上的评估表明,完全集成的配置优于其他三种评估系统配置,达到了0.808的宏平均F1得分。此外,使用Cohen's kappa进行的评审者间可靠性分析显示,模型输出与文献衍生参考标签之间存在显著一致性。这些发现表明,以本体为指导的知识表示可以提高LLM辅助LPBF缺陷分析的一致性、可解释性和实际应用价值。
cs.AI / 9 / 2605.01101
Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy
虚拟语音治疗师:用于个性化和监督治疗的临床医生参与的人工智能语音治疗代理
Abstract
This paper develops Virtual Speech Therapist (VST), an intelligent agent-based platform that streamlines stuttering assessment and delivers customized therapy planning through automated and adaptive AI-driven workflows. VST integrates state-of-the-art deep learning-based stuttering classification, and multi-agent large language model (LLM) reasoning to support evidence-based clinical decision-making. The VST begins with the acquisition and feature extraction of patient speech samples, followed by robust classification of stuttering types. Building on these outputs, VST initiates an agentic reasoning process in which specialized LLM agents autonomously generate, critique, and iteratively refine individualized therapy plans. A dedicated critic agent evaluates all generated therapy plans to ensure clinical safety, methodological soundness, and alignment with peer-reviewed evidence and established professional guidelines. The resulting output is a comprehensive, patient-specific therapy draft intended for clinician review. Incorporating clinician feedback, the system then produces a finalized therapy plan suitable for patient delivery, thereby maintaining a clinician-in-the-loop paradigm. Experimental evaluation by expert speech therapists confirms that VST consistently generates high-quality, evidence-based therapy recommendations. These findings demonstrate the system's potential to augment clinical workflows, reduce clinician burden, and improve therapeutic outcomes for individuals with speech impairments. An interactive user interface for the proposed system is available online at: https://vocametrix.com/ai/stuttering-therapy-planning-agent , facilitating real-time stuttering assessment and personalized therapy planning.
Chinese Translation
本文开发了一种智能代理平台——虚拟语音治疗师(Virtual Speech Therapist, VST),旨在简化口吃评估并通过自动化和自适应的人工智能驱动工作流程提供定制的治疗计划。VST集成了最先进的基于深度学习的口吃分类和多代理大型语言模型(Large Language Model, LLM)推理,以支持基于证据的临床决策。VST首先进行患者语音样本的采集和特征提取,然后对口吃类型进行稳健分类。在此基础上,VST启动一个代理推理过程,专门的LLM代理自主生成、评议并逐步完善个性化治疗计划。一个专门的评审代理对所有生成的治疗计划进行评估,以确保临床安全性、方法论的合理性及与同行评审的证据和现有专业指南的一致性。最终输出是一个全面的、特定于患者的治疗草案,供临床医生审阅。系统结合临床医生的反馈,随后生成适合患者实施的最终治疗计划,从而维持临床医生参与的模式。专家语音治疗师的实验评估确认,VST始终生成高质量、基于证据的治疗建议。这些发现展示了该系统增强临床工作流程、减轻临床医生负担和改善语言障碍患者治疗结果的潜力。该系统的交互用户界面可在线访问:https://vocametrix.com/ai/stuttering-therapy-planning-agent ,方便实时口吃评估和个性化治疗计划制定。
cs.AI / 10 / 2605.01102
Towards Multi-Agent Autonomous Reasoning in Hydrodynamics
面向水动力学的多智能体自主推理
Abstract
Single-agent systems (SAS) have become the default pattern for LLM-driven scientific workflows, but routing planning, tool use, and synthesis through a single context window comes with a well-known cost: as tool specifications and observational traces accumulate, the effective context available for each decision shrinks, and end-to-end reliability suffers. We present a multi-agent system (MAS) prototype for hydrodynamics in which specialized agents are coordinated through a Layer Execution Graph (LEG). A planner agent constructs query-specific execution topologies from natural-language routing heuristics that capture domain knowledge without hard-coding it as rigid control logic; specialist agents operate under strict tool allowlists and occupy complementary data-class roles. Between layers, consolidator agents fuse parallel outputs into concise briefs, and a reporter agent synthesizes the final response, while the runtime logs provenance for every tool invocation to support auditability. All benchmarks, ablations, and stress tests use Claude Sonnet~4.6 as the backbone model for both specialist and general-purpose agents. Evaluated on 37 queries spanning six complexity categories, the prototype achieves 93.6% factual precision with a 100% pass rate. Accuracy remains above 90% across runs from single-threaded to five independent parallel tracks, and under simulated loss of individual data sources the system degrades gracefully, still returning substantive partial answers. Together, these results suggest that planner-guided, graph-structured multi-agent orchestration can meaningfully alleviate the context-saturation bottlenecks that constrain monolithic single-agent architectures.
Chinese Translation
单智能体系统(SAS)已成为基于大型语言模型(LLM)驱动的科学工作流的默认模式,但通过单一上下文窗口进行路由规划、工具使用和综合带来了众所周知的代价:随着工具规格和观察踪迹的累积,每个决策可用的有效上下文不断缩小,端到端的可靠性受损。我们提出了一个水动力学的多智能体系统(MAS)原型,其中专门的智能体通过层执行图(LEG)进行协调。规划智能体根据自然语言路由启发式构建查询特定的执行拓扑,这些启发式捕捉领域知识而不将其硬编码为严格的控制逻辑;专家智能体在严格的工具白名单下操作,并占据互补的数据类别角色。在层之间,整合智能体将并行输出融合为简洁的摘要,而报告智能体综合最终响应,同时运行时记录每个工具调用的来源,以支持可审计性。所有基准测试、消融实验和压力测试均以Claude Sonnet~4.6作为专家和通用智能体的基础模型。在涵盖六个复杂性类别的37个查询中,该原型实现了93.6%的事实精度,且通过率为100%。在从单线程到五个独立并行通道的多次运行中,准确率始终高于90%,在模拟个别数据源丢失的情况下,系统逐渐降级,仍然能够返回实质性的部分答案。这些结果表明,规划引导的图结构多智能体编排能够有效缓解限制单体单智能体架构的上下文饱和瓶颈。
cs.AI / 11 / 2605.01120
New Bounds for Zarankiewicz Numbers via Reinforced LLM Evolutionary Search
通过强化LLM进化搜索获得的Zarankiewicz数的新界限
Abstract
The Zarankiewicz number $\textbf{Z}(m, n, s, t)$ is the maximum number of edges in a bipartite graph $G_{m, n}$ such that there is no complete $K_{s, t}$ bipartite subgraph. We determine for the first time the exact values of three Zarankiewicz numbers: $\textbf{Z}(11, 21, 3, 3)=116$, $\textbf{Z}(11, 22, 3, 3)=121$, and $\textbf{Z}(12, 22, 3, 3)=132$. We further establish lower bounds for 41 more Zarankiewicz numbers, including several that are within one edge of the best known upper bound, and we match the established value in four more closed cases. Our results are obtained using OpenEvolve, an open-source evolutionary algorithm based on Large Language Models (LLMs) that iteratively improves algorithms for generating mathematical constructions by optimizing a reward signal which we tailored for this specific problem. These findings provide new extremal graph constructions and demonstrate the potential of LLM-guided evolutionary search to contribute to mathematical research. In addition to presenting the resulting constructions, we report the generation algorithms produced, describe the relevant implementation details, and provide our computational costs. Our costs are remarkably low, at less than \$30 for each Zarankiewicz parameter combination, showing that LLM-guided evolutionary search can be an inexpensive, reproducible, and accessible tool for discovering new combinatorial constructions.
Chinese Translation
Zarankiewicz数 $ extbf{Z}(m, n, s, t)$ 是指在二部图 $G_{m, n}$ 中,最多可以有多少条边而不包含完整的 $K_{s, t}$ 二部子图。我们首次确定了三个Zarankiewicz数的精确值:$ extbf{Z}(11, 21, 3, 3)=116$,$ extbf{Z}(11, 22, 3, 3)=121$,以及 $ extbf{Z}(12, 22, 3, 3)=132$。此外,我们为41个更多的Zarankiewicz数建立了下界,包括几个与已知最佳上界相差一条边的数值,并在四个更多封闭案例中匹配了已建立的值。我们的结果是通过OpenEvolve获得的,这是一种基于大型语言模型(LLMs)的开源进化算法,它通过优化为特定问题量身定制的奖励信号,逐步改进生成数学结构的算法。这些发现提供了新的极值图结构,并展示了LLM引导的进化搜索对数学研究的潜在贡献。除了展示生成的结构外,我们还报告了所产生的生成算法,描述了相关的实现细节,并提供了我们的计算成本。我们的成本非常低,每个Zarankiewicz参数组合的费用不足30美元,表明LLM引导的进化搜索可以成为一种廉价、可重复和可获取的工具,用于发现新的组合结构。
cs.AI / 12 / 2605.01123
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
PERSA:用于教授风格个性化反馈的强化学习与大型语言模型
Abstract
Large language models (LLMs) can provide automated feedback in educational settings, but aligning an LLMs style with a specific instructors tone while maintaining diagnostic correctness remains challenging. We ask how can we update an LLM for automated feedback generation to align with a target instructors style without sacrificing core knowledge? We study how Reinforcement Learning from Human Feedback (RLHF) can adapt a transformer-based LLM to generate programming feedback that matches a professors grading voice. We introduce PERSA, an RLHF pipeline that combines supervised fine-tuning on professor demonstrations, reward modeling from pairwise preferences, and Proximal Policy Optimization (PPO), while deliberately constraining learning to style-bearing components. Motivated by analyses of transformer internals, PERSA applies parameter efficient fine-tuning. It updates only the top transformer blocks and their feed-forward projections, minimizing global parameter drift while increasing stylistic controllability. We evaluate our proposed approach on three code-feedback benchmarks (APPS, PyFiXV, and CodeReviewQA) using complementary metrics for style alignment and fidelity. Across both Llama-3 and Gemma-2 backbones, PERSA delivers the strongest professor-style transfer while retaining correctness, for example on APPS, it boosts Style Alignment Score (SAC) to 96.2% (from 34.8% for Base) with Correctness Accuracy (CA) up to 100% on Llama-3, and Gemma-2. Overall, PERSA offers a practical route to personalized educational feedback by aligning both what it says (content correctness) and, crucially, how it says it (instructor-like tone and structure).
Chinese Translation
大型语言模型(LLMs)能够在教育环境中提供自动反馈,但将LLM的风格与特定讲师的语气对齐,同时保持诊断的正确性仍然具有挑战性。我们探讨如何更新LLM以生成自动反馈,以便与目标讲师的风格对齐而不牺牲核心知识?我们研究了人类反馈强化学习(RLHF)如何使基于变换器的LLM适应生成与教授评分风格相匹配的编程反馈。我们提出了PERSA,这是一个RLHF管道,结合了对教授示范的有监督微调、基于成对偏好的奖励建模和近端策略优化(Proximal Policy Optimization, PPO),同时有意识地将学习限制在具有风格特征的组件上。受变换器内部分析启发,PERSA应用了参数高效的微调。它仅更新顶部变换器块及其前馈投影,最小化全局参数漂移,同时增加风格可控性。我们在三个代码反馈基准(APPS、PyFiXV和CodeReviewQA)上评估了我们提出的方法,使用互补的风格对齐和准确性度量。在Llama-3和Gemma-2两个基础模型上,PERSA在保留正确性的同时,提供了最强的教授风格迁移,例如在APPS上,将风格对齐得分(Style Alignment Score, SAC)提升至96.2%(由Base的34.8%提升)且正确性准确率(Correctness Accuracy, CA)达到100%。总体而言,PERSA为个性化教育反馈提供了一条切实可行的途径,通过对齐其所表达的内容(内容正确性)和表达方式(讲师般的语气和结构)。
cs.AI / 13 / 2605.01130
Iterative Finetuning is Mostly Idempotent
迭代微调大多数情况下是幂等的
Abstract
If a model has some behavioral tendency, such as sycophancy or misalignment, and it is trained on its own outputs, will the tendency be amplified in the next generation of models? We study this question by training a series of models where each model is finetuned on data generated by its predecessor, and the initial model is seeded with some persona or belief. We test three settings: supervised finetuning (SFT) on instruct models, synthetic document finetuning (SDF) on base models, and direct preference optimization (DPO). In the SFT and SDF settings, traits mostly decay or remain constant so that further finetuning cycles do nothing. In rare cases when amplification occurs, it generally comes at the cost of coherence. In the DPO setting, trait amplification can reliably occur when a model is continually trained with a preference for its own outputs, but vanishes when models are reinitialized at each cycle. Overall, our results suggest that amplification most likely comes from continual post-training, and limiting this stage may be an effective defense. For non-RL finetuning, trait amplification is rare and very sensitive to data quantity, making it significantly less likely to occur accidentally. Finally, the amplification-coherence tradeoff serves as a natural deterrent against trait amplification.
Chinese Translation
如果一个模型具有某种行为倾向,例如拍马屁或不对齐,并且它是在其自身输出的基础上进行训练的,那么这种倾向在下一代模型中是否会被放大?我们通过训练一系列模型来研究这个问题,其中每个模型是在其前任生成的数据上进行微调,而初始模型则带有某种个性或信念。我们测试了三种设置:在指令模型上进行的监督微调(SFT),在基础模型上的合成文档微调(SDF),以及直接偏好优化(DPO)。在SFT和SDF设置中,特征大多衰减或保持不变,因此进一步的微调周期不会产生任何效果。在少数放大发生的情况下,通常是以一致性为代价的。在DPO设置中,当一个模型持续以自己输出的偏好进行训练时,特征放大可以可靠地发生,但在每个周期重新初始化模型时,这种放大则消失。总体而言,我们的结果表明,放大更可能源于持续的后训练,限制这个阶段可能是一个有效的防御措施。对于非强化学习微调,特征放大是罕见的且对数据量非常敏感,因此意外发生的可能性大大降低。最后,放大与一致性之间的权衡自然地对特征放大形成了威慑。
cs.AI / 14 / 2605.01134
To Use AI as Dice of Possibilities with Timing Computation
将人工智能作为可能性的骰子与时序计算
Abstract
The dominant noun-based modeling paradigm has fundamentally constrained AI development, precluding any adequate representation of the future as an open temporal dimension. This paper introduces a verb-based paradigm, together with precise definitions of \emph{timing computation} and \emph{possibility}, that enables AI to function as an effective instrument for realizing the grammar of our thought. Applied to longitudinal EHR data from 3,276 breast cancer patients, the framework empirically demonstrates: (1) automatic discovery of clinically significant patient trajectories, and (2) counterfactual timing deduction. Both results are purely data-driven, require no prior domain knowledge, and, to our knowledge, represent the first such demonstrations in the machine learning literature.
Chinese Translation
以名词为主导的建模范式在根本上限制了人工智能的发展,阻碍了未来作为开放时间维度的充分表达。本文引入了一种基于动词的范式,以及对 extit{时序计算} 和 extit{可能性} 的精确定义,使人工智能能够作为实现我们思维语法的有效工具。应用于来自3,276名乳腺癌患者的纵向电子健康记录数据,该框架在实证上展示了:(1) 临床显著患者轨迹的自动发现,以及 (2) 反事实时序推 deduction。这两个结果都是纯数据驱动的,无需先前的领域知识,并且在我们所知的机器学习文献中代表了首次此类展示。
cs.AI / 15 / 2605.01143
A Low-Latency Fraud Detection Layer for Detecting Adversarial Interaction Patterns in LLM-Powered Agents
低延迟欺诈检测层:用于检测LLM驱动代理中的对抗性交互模式
Abstract
Large Language Model (LLM)-powered agents demonstrate strong capabilities in autonomous task execution, tool use, and multi-step reasoning. However, their increasing autonomy also introduces a new attack surface: adversarial interactions can manipulate agent behavior through direct prompt injection, indirect content attacks, and multi-turn escalation strategies. Existing defense strategies focus on prompt-level filtering and rule-based guardrails, which are often insufficient when risk emerges gradually across interaction sequences. In this work, we propose a complementary defense mechanism: a low-latency fraud detection layer for detecting adversarial interaction patterns in LLM-powered agents. Instead of determining whether a single prompt is malicious, our approach models risk over interaction trajectories using structured runtime features derived from prompt characteristics, session dynamics, tool usage, execution context, and fraud-inspired signals. The detection layer can be implemented using lightweight models leading to low-latency real-time deployments. To evaluate the framework, we construct a synthetic corpus of 12,000 multi-turn agent interactions generated from parameterized templates that simulate realistic agentic workflows. Using 42 structured features and an XGBoost classifier, our detector achieves over 9 times faster than LLM-based detectors. Through the experiment and ablation studies, our work suggests that interaction-level behavioral detection should become a core component of deployment-time defense for LLM-powered agents.
Chinese Translation
大型语言模型(LLM)驱动的代理在自主任务执行、工具使用和多步骤推理方面表现出强大的能力。然而,它们日益增强的自主性也引入了一种新的攻击面:对抗性交互可以通过直接提示注入、间接内容攻击和多轮升级策略操控代理行为。现有的防御策略侧重于提示级别的过滤和基于规则的保护措施,而在风险在交互序列中逐渐显现时,这些措施往往不足。在本研究中,我们提出了一种补充防御机制:用于检测LLM驱动代理中对抗性交互模式的低延迟欺诈检测层。我们的方法并不是简单地判断单个提示是否有恶意,而是通过利用从提示特征、会话动态、工具使用、执行环境以及受欺诈启发的信号中提取的结构化运行时特征,对交互轨迹上的风险进行建模。检测层可以使用轻量级模型实现,从而实现低延迟的实时部署。为了评估该框架,我们构建了一个合成语料库,其中包含12,000个从参数化模板生成的多轮代理交互,这些模板模拟真实的代理工作流。使用42个结构化特征和XGBoost分类器,我们的检测器比基于LLM的检测器快超过9倍。通过实验和消融研究,我们的工作表明,交互级行为检测应成为LLM驱动代理部署时防御的核心组成部分。
cs.AI / 16 / 2605.01147
Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment
定位:代理人工智能中的安全性和公平性依赖于交互拓扑,而非模型规模或对齐性
Abstract
As large language models are increasingly deployed as interacting agents in high-stakes decisions, the AI safety community assumes that safety properties of individual models will compose into safe multi-agent behavior. This position paper argues that this assumption is fundamentally mistaken. In agentic AI, safety is determined by interaction topology, not model weights. When agents deliberate sequentially or aggregate via parallel voting with a judge, the structure of information flow and decision coupling dominates outcomes. Evidence across model families and scales reveals three persistent topology-driven pathologies: ordering instability, where system behavior depends primarily on agent sequence; information cascades, where early judgments propagate regardless of correctness; and functional collapse, where systems satisfy fairness metrics while abandoning meaningful risk discrimination. Contrary to intuition, scaling to more capable models strengthens these effects by increasing consensus formation and reducing the challenge of initial decisions. These failure modes are invisible to model-centric evaluation and alignment procedures. We argue that agentic AI must be treated as a dynamical system rather than a collection of aligned components. Interaction topology must become a primary target of safety evaluation and regulation, with systems required to demonstrate robustness across architectural variations before deployment.
Chinese Translation
随着大型语言模型在高风险决策中逐渐被作为交互代理部署,AI安全领域假设单个模型的安全特性将组合成安全的多代理行为。本文主张这一假设根本上是错误的。在代理人工智能中,安全性由交互拓扑决定,而非模型权重。当代理依次进行深思或通过与法官的平行投票进行聚合时,信息流动及决策耦合的结构主导了结果。跨模型家族和规模的证据揭示了三种持久的拓扑驱动病态:排序不稳定性,系统行为主要取决于代理序列;信息级联,早期判断在不考虑正确性的情况下传播;以及功能崩溃,系统满足公平性指标但放弃有意义的风险区分。与直觉相反,扩展到更强大的模型增强了这些效应,通过增加共识形成和减少初始决策的挑战。这些失败模式对以模型为中心的评估和对齐程序是不可见的。我们主张,代理人工智能必须被视为一个动态系统,而不是一组对齐的组件。交互拓扑必须成为安全评估和监管的主要目标,系统在部署前需展示在架构变体之间的鲁棒性。
cs.AI / 17 / 2605.01148
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
在自然环境中的算术:Llama利用十进制加法推理循环概念
Abstract
Does structure in representations imply structure in computation? We study how Llama-3.1-8B reasons over cyclic concepts (e.g., "what month is six months after August?"). Even though Llama-3.1-8B's representations for these concepts are circularly structured, we find that instead of directly computing modular addition in the period of the cyclic concept (e.g., 12 for months), the model re-uses a generic addition mechanism across tasks that operates independently of concept-specific geometry. First, it computes the sum of its two inputs using base-10 addition (six + August=14). Then, it maps this sum back to cyclic concept space (14->February). We show that Llama-3.1-8B uses task-agnostic Fourier features to compute these sums--in fact, these features have periods that respect standard base-10 addition, e.g., 2, 5, and 10, rather than the cyclic concept period (e.g., 12 for months). Furthermore, we identify a sparse set of 28 MLP neurons re-used across all tasks (approximately 0.2% of the MLP at layer 18) that can be partitioned into disjoint clusters, each computing the sum for a Fourier feature with a different period. Our work highlights how an interplay between causal abstraction and feature geometry can deepen our mechanistic understanding of LMs.
Chinese Translation
表示中的结构是否意味着计算中的结构?我们研究了Llama-3.1-8B如何对循环概念(例如,“八月之后六个月是哪个月份?”)进行推理。尽管Llama-3.1-8B对这些概念的表示是循环结构的,但我们发现该模型并不直接计算循环概念周期内的模加法(例如,月份的周期是12),而是跨任务重复使用一个通用的加法机制,该机制独立于特定概念的几何结构。首先,它通过十进制加法计算两个输入的和(六 + 八月 = 14)。然后,它将这个和映射回循环概念空间(14 -> 二月)。我们展示了Llama-3.1-8B使用任务无关的傅里叶特征来计算这些和——事实上,这些特征的周期遵循标准十进制加法,例如2、5和10,而不是循环概念周期(例如,月份的周期是12)。此外,我们识别出一组稀疏的28个多层感知器(MLP)神经元跨所有任务重复使用(大约是第18层MLP的0.2%),这些神经元可以被划分为不相交的集群,每个集群计算一个具有不同周期的傅里叶特征的和。我们的工作突显了因果抽象与特征几何之间的相互作用如何加深我们对语言模型机制的理解。
cs.AI / 18 / 2605.01164
LLMs Should Not Yet Be Credited with Decision Explanation
大型语言模型尚不应被赋予决策解释的能力
Abstract
This position paper argues that LLMs should not yet be credited with decision explanation. This matters because recent work increasingly treats accurate behavioral prediction, plausible rationales, and outcome-conditioned reasoning traces as evidence that LLMs explain why people decide as they do, risking a premature redefinition of what counts as explanatory progress in human decision modeling. We first distinguish three claims with different evidential burdens: decision prediction, rationale generation, and decision explanation. We then argue that the evidence most commonly offered for LLM-based decision accounts directly supports the first two claims, and sometimes explanatory hypothesis generation, but does not distinguish decision explanation from prediction-supportive rationalization. Next, we propose a bridge standard for decision-explanation credit: stronger claims should specify explanatory targets, discriminate against weaker rationalizer alternatives, use target-appropriate process- or intervention-sensitive validation, and bound their scope. We then situate this standard against competing views and related literatures, clarifying why it preserves the value of LLMs as predictors, narrators, and hypothesis generators while resisting premature explanatory credit. We conclude with a principle of credit calibration: LLMs should be credited for the strongest claim their evidence warrants, and no stronger; if adopted, this principle can help turn LLMs from persuasive narrators of decisions into more reliable instruments for discovering, testing, and communicating explanations of human behavior.
Chinese Translation
本文提出的立场论文认为,大型语言模型(LLMs)尚不应被赋予决策解释的能力。这一问题的重要性在于,近期研究越来越多地将准确的行为预测、合理的理由和基于结果的推理轨迹视为大型语言模型解释人们为何如此决策的证据,这可能导致对人类决策建模中解释性进展的过早重新定义。我们首先区分了三种具有不同证据负担的主张:决策预测、理由生成和决策解释。然后我们认为,为LLM驱动的决策叙述所提供的证据主要支持前两种主张,有时也支持解释性假设的生成,但并没有区分决策解释和支持预测的合理化。接下来,我们提出了一个用于决策解释信用的桥梁标准:更强的主张应明确解释目标,区别较弱的合理化替代方案,使用适当的过程或干预敏感的验证,并限制其适用范围。随后,我们将这一标准与竞争观点和相关文献进行了比较,阐明为何它在维护LLMs作为预测者、叙述者和假设生成器的价值的同时,抵制了过早的解释信用。最后,我们总结出一个信用校准的原则:LLMs应当根据其证据所支持的最强主张赋予信用,而不是更强的主张;如果采用这一原则,可以帮助将LLMs从决策的说服性叙述者转变为更可靠的人类行为解释发现、测试和传播工具。
cs.AI / 19 / 2605.01189
NEURON: A Neuro-symbolic System for Grounded Clinical Explainability
NEURON:一个面向临床解释性的神经符号系统
Abstract
Clinical AI adoption is hindered by the black-box/grey-box nature of high-performing models, which lack the ontological grounding and narrative transparency required for professional-level explainability. We present NEURON, a neuro-symbolic system designed to enhance both predictive reliability and clinical interpretability. NEURON integrates SNOMED CT ontology-informed structural representations with machine learning models to bridge the gap between raw data and medical nomenclature. To facilitate human-aligned interaction, the system utilizes a Retrieval-Augmented Generation (RAG) grounded LLM layer to synthesize SHAP feature attributions and patient-specific clinical notes into coherent, natural-language explanations. Validated on the MIMIC-IV dataset for Acute Heart Failure mortality prediction, NEURON improved the AUC from 0.74-0.77 to 0.84-0.88 and significantly outperformed raw SHAP visualizations in human-aligned metrics (0.85 vs. 0.50). Our results demonstrate that NEURON offers a robust, scalable engineering solution for deploying trustworthy, human-centered connected health applications.
Chinese Translation
临床人工智能的应用受到高性能模型的黑箱/灰箱特性的制约,这些模型缺乏专业级解释所需的本体基础和叙述透明度。我们提出了NEURON,这是一个旨在增强预测可靠性和临床可解释性的神经符号系统。NEURON将基于SNOMED CT本体的信息结构表示与机器学习模型相结合,以弥合原始数据与医学命名法之间的差距。为促进与人类的互动,该系统利用检索增强生成(Retrieval-Augmented Generation, RAG)基础的LLM层,将SHAP特征归因和患者特定的临床记录合成为连贯的自然语言解释。在MIMIC-IV数据集上进行急性心力衰竭死亡率预测的验证中,NEURON的AUC从0.74-0.77提升至0.84-0.88,并在与人类对齐的指标上显著超越原始SHAP可视化(0.85对0.50)。我们的结果表明,NEURON为部署值得信赖、以人为中心的互联健康应用提供了一个稳健、可扩展的工程解决方案。
cs.AI / 20 / 2605.01203
GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models
GR-Ben:用于评估过程奖励模型的通用推理基准
Abstract
Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing to comprehensively evaluate the error detection ability of PRMs across diverse reasoning scenarios. To mitigate this gap, we introduce GR-Ben, a process-level benchmark specifically designed for assessing PRM's performance across two primary reasoning domains (science and logic) and nine subdomains. We conduct extensive experiments on a diverse set of 22 models, encompassing both PRMs and LLMs, and derive two key findings: (1) In domains beyond mathematical reasoning, the error-detection ability of existing PRMs and LLMs is found to be markedly weaker by comparison.(2) In general, PRMs are less adept at identifying knowledge-based errors, whereas LLMs exhibit poorer performance in detecting computational errors.We hope GR-Ben can foster future researches on PRMs for general domains, thereby enhancing the reasoning capabilities of LLMs.
Chinese Translation
目前,过程奖励模型(PRMs)在测试时扩展方面展现了显著的潜力。由于大型语言模型(LLMs)在处理广泛的推理和决策任务时,常常会生成错误的中间推理步骤,因此PRMs必须具有在现实场景中检测过程级错误的能力。然而,现有基准主要集中在数学推理上,因此未能全面评估PRMs在多样推理场景中的错误检测能力。为了解决这一问题,我们引入了GR-Ben,这是一个专门设计的过程级基准,用于评估PRM在两个主要推理领域(科学和逻辑)及九个子领域中的表现。我们对22个多样的模型进行了广泛实验,这些模型包括PRMs和LLMs,并得出了两个关键发现:(1) 在数学推理之外的领域,现有PRMs和LLMs的错误检测能力明显较弱。(2) 一般而言,PRMs在识别基于知识的错误方面较为欠缺,而LLMs在检测计算错误方面表现较差。我们希望GR-Ben能够促进未来在通用领域上PRMs的研究,从而增强LLMs的推理能力。
cs.AI / 21 / 2605.01208
Faithful Mobile GUI Agents with Guided Advantage Estimator
具有引导优势估计器的忠实移动图形用户界面代理
Abstract
Vision-language model based graphical user interface (GUI) agents have shown strong interaction capabilities. However, they often behave unfaithfully, relying on memorized shortcuts rather than grounding actions in displayed screen evidence or user instructions. To address this, we propose Faithful-Agent, a faithfulness-first framework that reformulates GUI interaction to prioritize evidence groundedness and internal consistency. Faithful-Agent employs a two-stage pipeline: (i) a faithfulness-oriented SFT stage to instill abstainment behaviors under evidence perturbations; (ii) an RFT stage that further amplifies faithfulness by introducing the guided advantage estimator (GuAE), an anchor-based and variance-adaptive advantage tempering mechanism built upon GRPO. GuAE prevents advantage collapse in low-variance rollout groups under sparse GUI rewards, and with a thought-action consistency reward, Faithful-Agent (Stage II) elevates the Trap SR from 13.88\% to 80.21\% relative to the baseline, while preserving robust general instruction-following performance.
Chinese Translation
基于视觉-语言模型的图形用户界面(GUI)代理显示出了强大的交互能力。然而,它们往往表现不忠实,依赖于记忆的快捷方式,而不是将行动与显示屏证据或用户指令相结合。为了应对这一问题,我们提出了Faithful-Agent,这是一个优先考虑忠实性的框架,通过重新构建GUI交互来优先考虑证据的基础性和内部一致性。Faithful-Agent采用了两阶段的流程:(i)以忠实性为导向的SFT阶段,以在证据扰动下培养自我克制的行为;(ii)RFT阶段进一步强化忠实性,通过引入引导优势估计器(GuAE),这是一种基于锚点的和方差自适应的优势调节机制,建立在GRPO之上。GuAE防止在稀疏的GUI奖励下于低方差回滚组中优势崩溃,并通过思考-行动一致性奖励,Faithful-Agent(第二阶段)将Trap SR从基线的13.88\%提升至80.21\%,同时保持强大的通用指令遵循性能。
cs.AI / 22 / 2605.01214
Agentic AI Systems Should Be Designed as Marginal Token Allocators
代理性的人工智能系统应被设计为边际代币分配者
Abstract
This position paper argues that agentic AI systems should be designed and evaluated as \emph{marginal token allocation economies} rather than as text generators priced by the unit. We follow a single request -- a developer asking a coding agent to fix a failing test -- through four economic layers that today are designed in isolation: a router that decides which model answers, an agent that decides whether to plan, act, verify, or defer, a serving stack that decides how to produce each token, and a training pipeline that decides whether the trace is worth learning from. We show that all four layers are solving the \emph{same} first-order condition -- marginal benefit equals marginal cost plus latency cost plus risk cost -- with different index sets and different prices. The framing is deliberately minimal: we do not propose a complete theory of AI economics. But adopting marginal token allocation as the shared accounting object explains why systems that locally minimize tokens globally misallocate them, predicts a small set of recurring failure modes (over-routing, over-delegation, under-verification, serving congestion, stale rollouts, cache misuse), and points to a concrete research agenda in token-aware evaluation, autonomy pricing, congestion-priced serving, and risk-adjusted RL budgeting.
Chinese Translation
本文立场论文主张,代理性的人工智能系统应被设计和评估为边际代币分配经济体,而不是按单位定价的文本生成器。我们通过四个经济层次跟踪一个单一请求——开发者要求编码代理修复一个失败的测试,这些层次目前是在孤立中设计的:一个决定模型回答的路由器,一个决定是否计划、执行、验证或推迟的代理,一个决定如何生成每个代币的服务栈,以及一个决定是否值得从跟踪中学习的训练管道。我们展示了所有四个层次正在解决 extit{相同}的一阶条件——边际收益等于边际成本加上延迟成本加上风险成本——但具有不同的索引集和不同的价格。这个框架故意是简约的:我们并不提出完整的人工智能经济学理论。但采用边际代币分配作为共享会计对象可以解释为何局部最小化代币的系统在全局上错误分配代币,预测一小组反复出现的失败模式(过度路由、过度委派、验证不足、服务拥堵、陈旧的版本发布、缓存误用),并指明了关于代币意识评估、自主定价、拥堵定价服务和风险调整的强化学习预算的具体研究议程。
cs.AI / 23 / 2605.01222
Zero-Shot Signal Temporal Logic Planning with Disjunctive Branch Selection in Dynamic Semantic Maps
动态语义地图中具有析取分支选择的零-shot信号时序逻辑规划
Abstract
Signal Temporal Logic (STL) offers verifiable task specifications and is crucial for safety-critical control. Yet STL planning remains challenging: exact optimization-based methods are often too slow, and learning-based methods struggle to generalize across varying environments. We propose a zero-shot STL planning solver for variable-map environments that generates feasible trajectories without retraining. By integrating a map-conditioned Transformer architecture with a lightweight heuristic, our approach effectively handles complex disjunctive (OR) subformulas. Furthermore, we leverage Transitive Reinforcement Learning (TRL) to ensure consistent temporal grounding and logical coherence across decomposed sub-tasks. Experiments on dynamic semantic maps with diverse obstacle layouts demonstrate consistent gains, highlighting the framework's superior zero-shot generalization to changing environments and broad STL coverage.
Chinese Translation
信号时序逻辑(STL)提供可验证的任务规范,对于安全关键控制至关重要。然而,STL规划仍然面临挑战:精确的基于优化的方法往往速度太慢,而基于学习的方法在不同环境之间的泛化能力较差。我们提出了一种零-shot STL规划求解器,适用于可变地图环境,能够生成可行轨迹而无需重新训练。通过将地图条件的Transformer架构与轻量级启发式方法相结合,我们的方法有效处理复杂的析取(OR)子公式。此外,我们利用传递性强化学习(TRL)确保在分解子任务之间保持一致的时间基础和逻辑连贯性。在具有多样障碍布局的动态语义地图上的实验表明,我们的方法在不断变化的环境中展现出一致的性能提升,突显了框架在零-shot泛化和广泛的STL覆盖方面的优势。
cs.AI / 24 / 2605.01250
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents
EO-Gym:一个多模态交互环境用于地球观测代理
Abstract
Earth Observation (EO) analysis is inherently interactive: resolving uncertainty often requires expanding the region of interest, retrieving historical observations, and switching across sensors such as optical and Synthetic Aperture Radar. However, most EO benchmarks collapse this process into fixed-input, single-turn tasks. To address this gap, we present EO-Gym, a controlled executable framework for multimodal, tool-using EO agents that formulates EO analysis as a Gymnasium-style local geospatial workspace backed by more than 660k multimodal files indexed by location, time, and sensor type, with 35 EO-specialized tools spanning six task families. Built on this environment, we construct EO-Gym-Data, a benchmark of 9,078 trajectories and 34,604 reasoning steps, and grounded in eight public EO datasets together with Landsat and Sentinel-2 imagery. Evaluating $10$ open and closed VLMs shows that strong general-purpose models still struggle with interactive EO reasoning, especially on temporal and cross-modal workflows. As a reference baseline, EO-Gym-4B, obtained by fine-tuning Qwen3-VL-4B-Instruct on EO-Gym-Data, improves overall Pass@3 from $0.49$ to $0.74$ under the main evaluation setting. O-Gym provides a reproducible environment for interactive EO agents, operationalizing EO as an evidence-gathering problem that requires planning across geospatial, temporal, and sensing modality.
Chinese Translation
地球观测(EO)分析本质上是交互式的:解决不确定性通常需要扩大关注区域、检索历史观测数据,并在光学与合成孔径雷达等传感器之间切换。然而,大多数EO基准将这一过程简化为固定输入的单回合任务。为了解决这一问题,我们提出了EO-Gym,这是一个可控的可执行框架,旨在为多模态工具使用的EO代理提供支持,将EO分析构建为一个类似健身房的本地地理空间工作空间,拥有超过66万条按位置、时间和传感器类型索引的多模态文件,以及跨越六个任务家族的35种EO专用工具。基于此环境,我们构建了EO-Gym-Data,这是一个包含9078条轨迹和34604个推理步骤的基准,并基于八个公共EO数据集以及Landsat和Sentinel-2影像。评估10个开放和封闭的视觉语言模型(VLM)表明,强通用模型在交互式EO推理方面仍然存在困难,特别是在时间和跨模态工作流中。作为参考基线,EO-Gym-4B是通过在EO-Gym-Data上微调Qwen3-VL-4B-Instruct而获得的,其在主要评估设置下的整体通过率(Pass@3)从0.49提高到0.74。EO-Gym为交互式EO代理提供了一个可重复的环境,将EO操作化为一个需要在地理空间、时间和传感器模态之间进行规划的证据收集问题。
cs.AI / 25 / 2605.01257
Uncertainty-Aware Trip Purpose Inference from GPS Trajectories via POI Semantic Zones and Pareto Calibration
基于兴趣点语义区和帕累托校准的GPS轨迹不确定性感知行程目的推断
Abstract
Large-scale GPS trajectory data offer rich observations of human mobility, yet assigning trip purposes to detected stops remains challenging due to the absence of individual-level ground truth, spatial uncertainty from GPS noise and incomplete points of interest (POIs) coverage, and fundamental behavioral differences across trip purposes. We propose a weakly supervised framework integrating neighborhood-level POI semantic zones with distance-weighted spatial likelihoods, differentiated inference strategies for mandatory and non-mandatory activities, and a multi-phase Pareto optimization that jointly minimizes distributional divergence from household travel survey statistics and maximizes inference reliability without requiring annotated labels. Evaluated on over 81 million staypoints in Los Angeles, the framework reduces activity type frequency Jensen-Shannon distance (JSD) by 23%, start time JSD by 48%, and duration JSD by 12% respectively relative to a comparable baseline. The proposed approach provides a scalable and uncertainty-aware path from raw GPS trajectories to semantically annotated mobility data for travel demand modeling and transportation policy analysis.
Chinese Translation
大规模GPS轨迹数据提供了丰富的人类移动性观察,但由于缺乏个体级别的真实数据、GPS噪音带来的空间不确定性以及兴趣点(POI)覆盖的不完整性,给检测到的停留点分配行程目的仍然具有挑战性。我们提出了一种弱监督框架,集成了邻域级POI语义区与距离加权的空间似然、针对强制性和非强制性活动的差异化推断策略,以及多阶段的帕累托优化,该优化共同最小化家庭出行调查统计数据的分布差异并且最大化推断可靠性,而无需标注标签。在对洛杉矶8100万个停留点进行评估时,该框架相对于可比基线,活动类型频率的Jensen-Shannon距离(JSD)减少了23%,开始时间的JSD减少了48%,持续时间的JSD减少了12%。所提出的方法为从原始GPS轨迹到语义注释移动数据的可扩展不确定性感知路径提供了基础,适用于出行需求建模和交通政策分析。
cs.AI / 26 / 2605.01278
Valley3: Scaling Omni Foundation Models for E-commerce
Valley3:为电子商务扩展全能基础模型
Abstract
In this work, we present Valley3, an omni multimodal large language model (MLLM) developed for diverse global e-commerce tasks, with unified understanding and reasoning capabilities across text, images, video, and audio. A key feature of Valley3 is its native multilingual audio capability for e-commerce, developed by extending vision-language models to better support crucial audio-visual tasks, particularly in short-video scenarios. To achieve this, we carefully design a four-stage omni e-commerce continued pre-training pipeline, through which Valley3 progressively acquires audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning capabilities, ultimately evolving into an omni model for diverse e-commerce scenarios. Then, we further improve Valley3 through post-training to encourage long-chain reasoning with controllable reasoning modes, enabling one non-thinking mode and three distinct levels of thinking, thereby balancing inference efficiency in simple scenarios with deep reasoning for complex applications. Moreover, we equip Valley3 with agentic search capabilities to proactively invoke search tools and acquire task-relevant information for e-commerce deep research tasks. To comprehensively assess the capabilities of Valley3, we construct an omni e-commerce benchmark spanning 6 tasks. Experimental results show that Valley3 consistently outperforms strong baselines on our in-house and open-source e-commerce benchmarks, while remaining competitive on general-domain benchmarks.
Chinese Translation
在这项工作中,我们提出了Valley3,一个为多样化全球电子商务任务开发的全能多模态大语言模型(MLLM),具备跨文本、图像、视频和音频的统一理解和推理能力。Valley3的一个关键特性是其原生多语言音频能力,旨在通过扩展视觉-语言模型,以更好地支持关键的视听任务,特别是在短视频场景中。为实现这一目标,我们精心设计了一个四阶段的全能电子商务持续预训练流程,通过该流程,Valley3逐步获得音频理解、跨模态指令跟随、电子商务领域知识以及长上下文推理能力,最终演变为适用于多样化电子商务场景的全能模型。接着,我们通过后训练进一步提升Valley3,以鼓励具有可控推理模式的长链推理,支持一种非思考模式和三个不同层次的思考模式,从而在简单场景中平衡推理效率与复杂应用中的深度推理。此外,我们为Valley3配备了主动搜索能力,能够主动调用搜索工具,并获取与任务相关的信息,以支持电子商务的深入研究任务。为了全面评估Valley3的能力,我们构建了一个覆盖6个任务的全能电子商务基准。实验结果表明,Valley3在我们的内部和开源电子商务基准上始终优于强劲的基线,同时在一般领域基准上也保持竞争力。
cs.AI / 27 / 2605.01293
Lifting Traces to Logic: Programmatic Skill Induction with Neuro-Symbolic Learning for Long-Horizon Agentic Tasks
将轨迹提升至逻辑:用于长时间程代理任务的程序化技能诱导与神经符号学习
Abstract
Foundation model-driven agents often struggle with long-horizon planning due to the transient nature of purely prompting-based reasoning. While existing skill induction methods mitigate this by distilling experience into state-blind parameterized scripts, they fail to capture the conditional logic required for robust execution in dynamic environments. In this paper, we propose Neuro-Symbolic Skill Induction (NSI), a framework that lifts interaction traces into modular, \textit{logic-grounded} programs. By synthesizing explicit control flows and dynamic variable binding, NSI empowers agents to discover \textit{when} and \textit{why} to act. This paradigm enables the efficient generalization, allowing agents to induce skills from few-shot examples and flexibly adapt to unseen goals. Experiments on a series of agentic tasks demonstrate that NSI consistently outperforms state-of-the-art baselines, empowering agents to self-evolve into architects of logic-grounded skills.
Chinese Translation
基础模型驱动的代理通常在长时间规划中表现不佳,这主要是由于纯粹基于提示的推理具有瞬态特性。虽然现有的技能诱导方法通过将经验提炼为无视状态的参数化脚本来缓解这一问题,但它们未能捕捉动态环境中稳健执行所需的条件逻辑。本文提出神经符号技能诱导(Neuro-Symbolic Skill Induction, NSI)框架,该框架将交互轨迹提升为模块化的、 extit{基于逻辑}的程序。通过合成显式控制流和动态变量绑定,NSI使代理能够发现行动的 extit{何时}和 extit{为何}。这种范式实现了高效的推广,使代理能够从少量示例中诱导技能并灵活适应未见目标。多项代理任务的实验表明,NSI始终优于当前先进的基线,赋能代理自我进化为基于逻辑技能的架构师。
cs.AI / 28 / 2605.01327
Segment-Aligned Policy Optimization for Multi-Modal Reasoning
面向多模态推理的段落对齐策略优化
Abstract
Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the natural step-wise structure of reasoning processes, leading to suboptimal credit assignment and unstable training in multi-modal reasoning tasks. To bridge this gap, we propose Segment-Aligned Policy Optimization (SAPO), a novel reinforcement learning paradigm that treats coherent reasoning steps, rather than tokens or full sequences as fundamental units of policy update. SAPO introduces a step-wise Markov decision process abstraction over reasoning segments, accompanied by segment-level value estimation, advantage computation, and importance sampling mechanisms that are semantically aligned with reasoning boundaries. Experiments on representative reasoning benchmarks demonstrate that SAPO consistently outperforms token-level and sequence-level policy optimization methods, achieving significant accuracy improvements while exhibiting better training stability and value estimation consistency. Our work underscores the importance of aligning reinforcement learning updates with the intrinsic structure of reasoning, paving the way for more efficient and semantically grounded policy optimization in complex reasoning tasks. Codes and models will be released to ensure full reproducibility.
Chinese Translation
现有的大型语言模型强化学习方法通常在单个词元或完整响应序列的粒度上执行策略优化。然而,这种表述往往与推理过程的自然逐步结构不一致,从而导致在多模态推理任务中信用分配不理想和训练不稳定。为了解决这一问题,我们提出了段落对齐策略优化(Segment-Aligned Policy Optimization, SAPO),这是一种新的强化学习范式,它将连贯的推理步骤视为政策更新的基本单元,而非词元或完整序列。SAPO在推理段上引入了逐步马尔可夫决策过程的抽象,并配以语义上与推理边界对齐的段落级值评估、优势计算和重要性采样机制。在代表性的推理基准测试中的实验表明,SAPO始终优于基于词元和序列的策略优化方法,实现显著的准确性提升,同时展现更好的训练稳定性和价值评估一致性。我们的工作强调了将强化学习更新与推理的内在结构对齐的重要性,为复杂推理任务中的更有效且语义基础的策略优化铺平了道路。代码和模型将发布以确保完全的可重复性。
cs.AI / 29 / 2605.01329
Truth or Tribe: How In-group Favoritism Prioritize Facts in Persona Agents
真相还是部落:群体偏好如何在个体代理中优先考虑事实
Abstract
In-group favoritism refers to the phenomena of favoring members of one's in-group over out-group members and is widely observed in numerous social cooperative behaviors. Recently, in-group favoritism biases have also been identified in generative language models. However, whether the in-group favoritism exists when persona agents are faced with contradicting information (e.g., misinformation), and how to mitigate the adverse effects of in-group favoritism biases in persona agents have been understudied. To address these problems, we propose a Truth or Tribe simulation framework to study the agent cooperation within the spread of contradicting information through a triadic interaction paradigm, and conduct controlled trials to evaluate the primary moderating factors. Extensive results showcase that persona agents display strong in-group favoritism, accepting incorrect answers from identity-similar peers at much higher rates than from dissimilar peers. In-group favoritism continues to emerge in defeasible reasoning contexts where no absolute truth exists, and it intensifies as cognitive complexity increases. Furthermore, three intervention strategies--Identity-Blind Instruction, Structured Counterfactual Reasoning, and Heterogeneous Perspective Ensemble--are proposed to mitigate the in-group favoritism.
Chinese Translation
群体偏好指的是在社会合作行为中偏向于支持自己群体成员而非外群体成员的现象。这种现象在许多社会合作行为中被广泛观察到。最近,群体偏见也已在生成语言模型中被识别。然而,当个体代理面临矛盾信息(例如,错误信息)时,群体偏好是否仍然存在,以及如何减轻个体代理中群体偏好偏见的不利影响,尚未得到充分研究。为了解决这些问题,我们提出了一个“真相或部落”模拟框架,以研究在矛盾信息传播中,个体代理的合作情况,采用三元交互范式,并进行控制实验以评估主要调节因素。大量结果展示了个体代理表现出强烈的群体偏好,从身份相似的同伴那里接受错误答案的比例远高于来自不相似同伴的情况。群体偏好在没有绝对真理的可反驳推理背景中继续出现,并随着认知复杂性的增加而加剧。此外,我们提出了三种干预策略——身份盲指导、结构化反事实推理和异质视角集成——以减轻群体偏好。
cs.AI / 30 / 2605.01338
DiagramNet: An End-to-End Recognition Framework and Dataset for Non-Standard System-Level Diagrams
DiagramNet:用于非标准系统级图的端到端识别框架和数据集
Abstract
System-level diagrams encode the architectural blueprint of chip design, specifying module functions, dataflows, and interface protocols. However, non-standardized symbols and the scarcity of structured training data hinder existing multimodal large language models (MLLMs) from recognizing these diagrams. To address this gap, we introduce DiagramNet, the first multimodal dataset for system-level diagrams, comprising 10,977 connection annotations and 15,515 chain-of-thought QA pairs across four tasks: Listing, Localization, Connection, and Circuit QA. Building on this dataset, we propose a progressive training pipeline together with a decoupled multi-agent workflow that decomposes complex visual reasoning into Perception, Reasoning, and Knowledge stages. On the DiagramNet benchmark, integrating our 3B-parameter model with the proposed workflow surpasses the 2025 EDA Elite Challenge winner and outperforms GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro by over 2x in end-to-end evaluation. Notably, the workflow generalizes beyond our model, boosting Task 1 performance by 128.7x for Gemini-2.5-Pro and 12.4x for GPT-5. Furthermore, with only 60 images for detector adaptation, the method transfers effectively to AMSBench, achieving zero-shot connectivity reasoning on par with GPT-5 and Claude-Sonnet-4 while surpassing the AMS state-of-the-art method Netlistify.
Chinese Translation
系统级图描述了芯片设计的架构蓝图,具体指出模块功能、数据流和接口协议。然而,非标准化的符号以及结构化训练数据的稀缺妨碍了现有的多模态大语言模型(MLLMs)对这些图的识别。为了解决这一问题,我们引入了DiagramNet,这是第一个用于系统级图的多模态数据集,包含10,977个连接注释和15,515个连锁思维问答对,涵盖四个任务:列举、定位、连接和电路问答。在此数据集的基础上,我们提出了一个渐进式训练管道,以及一个解耦的多智能体工作流程,将复杂的视觉推理分解为感知、推理和知识三个阶段。在DiagramNet基准测试中,将我们的3B参数模型与所提议的工作流程集成后,超越了2025 EDA精英挑战赛的获胜者,并在端到端评估中比GPT-5、Claude-Sonnet-4和Gemini-2.5-Pro的性能高出2倍以上。值得注意的是,该工作流程不仅限于我们的模型,还提高了Gemini-2.5-Pro的任务1性能达到128.7倍,GPT-5提升达到12.4倍。此外,仅使用60张图像进行检测器适应,该方法在AMSBench上有效转移,实现了与GPT-5和Claude-Sonnet-4相当的零-shot连接推理,同时超越了AMS领域内的最新方法Netlistify。
cs.AI / 31 / 2605.01359
Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid
使用最小认知网格的计算模型在类比和隐喻上的认知合理性结构排名
Abstract
In this paper, we employ the Minimal Cognitive Grid (MCG), a framework created to evaluate the cognitive plausibility of artificial systems, to offer a systematic assessment of leading computational models of analogy and metaphor, including the Structure-Mapping Engine (SME), CogSketch, METCL, and Large Language Models (LLMs). We present a formal and quantitative operationalization of the MCG framework and, through the analysis of its three main dimensions (Functional/Structural Ratio, Generality, and Performance Match), examine how well each system aligns with standard cognitive theories of the modeled phenomena, thus allowing for comparison of the models with respect to their cognitive plausibility, according to consistent and generalizable mathematical criteria.
Chinese Translation
在本文中,我们采用最小认知网格(Minimal Cognitive Grid, MCG),一个用于评估人工系统认知合理性的框架,对主要的类比和隐喻计算模型进行系统评估,包括结构映射引擎(Structure-Mapping Engine, SME)、CogSketch、METCL和大型语言模型(Large Language Models, LLMs)。我们呈现了MCG框架的正式和定量的操作化,并通过分析其三个主要维度(功能/结构比、通用性和性能匹配),考察每个系统与所建模现象的标准认知理论的对齐程度,从而根据一致且可推广的数学标准对这些模型的认知合理性进行比较。
cs.AI / 32 / 2605.01376
A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma
一种细胞道德理论:内在主动精确性与心智-现实超负荷困境
Abstract
Current AI systems, grounded in oversimplified neuroscience, risk eroding the distinction between truth and falsehood. They maximize reward by amplifying attention to information without intrinsic precision mechanisms to assess whether it is valid or worth attending to. This increases both the volume of information and the inherent biases in what the system attends to, whether true, false, or irrelevant. If not corrected, this trend will accelerate, threatening to overload systems and individuals with biased and dubious information and increasing the risk of confusion, poor judgment, and irrational or harmful decisions and behaviour, a condition I term the mind-reality overload dilemma. I argue that this threat may be mitigated by providing the public with access to more advanced AI tools built on the biophysical dynamics of pyramidal neurons underlying awake thought and higher-order cognition. These neurons support an intrinsic active precision mechanism that, rather than merely maximizing reward, uses locally and globally coherent predictions to evaluate the validity and contextual adequacy of evidence before it is attended to or propagated through hierarchies, prioritizing coherence and adequacy before attention.~While this approach does not derive or prescribe moral rules from biology, it may give rise to AI with more "real understanding", helping restore epistemic conditions by reducing information overload and amplifying reliable information, thereby supporting the formation of better-informed beliefs and more coherent judgments that benefit society at large-though no guarantees exist.
Chinese Translation
当前的人工智能系统基于过于简化的神经科学,可能会削弱真相与虚假之间的区别。它们通过放大对信息的关注来最大化奖励,但没有内在的精确机制来评估这些信息是否有效或者值得关注。这不仅增加了信息的数量,也加剧了系统对所关注事物的固有偏见,无论这些信息是真实的、虚假的,还是无关的。如果不加以纠正,这一趋势将会加速,威胁到系统和个体的正常运作,使其面临偏见和可疑信息的超负荷,从而增加混淆、错误判断和非理性或有害决策及行为的风险,我称之为心智-现实超负荷困境。我认为,通过向公众提供更先进的人工智能工具,这些工具基于参与清醒思维和更高阶认知的锥体神经元的生物物理动态,可以缓解这一威胁。这些神经元支持一种内在的主动精确机制,它不仅仅是最大化奖励,而是使用局部和全局一致的预测来评估证据的有效性和背景适宜性,然后再决定是否关注或通过层级传播,优先考虑一致性和适宜性。虽然这种方法并不从生物学中推导或规定道德规则,但它可能会促使产生具有更“真实理解”的人工智能,从而通过减少信息超负荷和放大可靠信息来帮助恢复认识论条件,支持形成更好-informed 的信念和更连贯的判断,最终惠及社会尽管无法保证。
cs.AI / 33 / 2605.01415
AI Safety as Control of Irreversibility: A Systems Framework for Decision-Energy and Sovereignty Boundaries
AI安全作为不可逆性的控制:决策-能量与主权边界的系统框架
Abstract
Recent AI systems compress the distance between capability growth and capability deployment. Earlier high-risk technologies were slowed by capital intensity, physical bottlenecks, organizational inertia, and specialized supply chains. By contrast, AI capabilities can be copied, invoked, embedded in workflows, and scaled across institutions at low marginal cost. This paper argues that declining deployment friction changes the safety problem at its root. Safety is not only local output correctness or preference alignment, but the control of irreversibility under rising decision density. The paper formalizes this claim through decision-energy density: the rate-weighted capacity of a node to generate, evaluate, select, and execute consequential decisions. It then identifies three sovereignty boundaries that determine whether AI remains an amplifier within a human-governed system or becomes a de facto control center: irreversible decision authority, physical resource mobilization authority, and self-expansion authority. The model shows how efficiency pressure, path dependence, scale feedback, and weak boundary constraints concentrate decision-energy in the most efficient node. This concentration can diffuse responsibility and raise the probability of irreversible system-level loss even when local per-action error rates remain low. The main result is a boundary stabilization theorem. It shows that safety need not require proving that advanced systems are always correct. Instead, it requires institutional and technical designs that prevent irreversible power from being released by a single high-efficiency node. The paper reframes AI safety as layered control, authorization, and externally reviewable limits, linking alignment, security engineering, organizational economics, and institutional design.
Chinese Translation
近期的AI系统缩短了能力增长与能力部署之间的距离。早期的高风险技术由于资本密集性、物理瓶颈、组织惯性和专业供应链而进展缓慢。相比之下,AI能力可以以低边际成本被复制、调用、嵌入工作流程,并在各机构之间扩展。本文认为,部署摩擦的下降从根本上改变了安全问题。安全不仅仅是局部输出的正确性或偏好一致性,而是在决策密度上升的背景下对不可逆性的控制。本文通过决策能量密度形式化了这一论点:节点生成、评估、选择和执行后果性决策的能力按速率加权的产出。然后,本文识别了决定AI是在以人为主导的系统中保持放大器角色还是成为事实上的控制中心的三个主权边界:不可逆的决策权限、物理资源动员权限和自我扩张权限。模型展示了效率压力、路径依赖、规模反馈和弱边界约束如何将决策能量集中在最有效的节点上。这种集中可能会扩散责任,并提高不可逆系统级损失的概率,即使局部每次行动的错误率依然较低。主要结果是边界稳定定理。该定理表明,安全并不需要证明先进系统始终是正确的。相反,它需要制度和技术设计,以防止一个单一高效节点释放不可逆权力。本文重新框定了AI安全,视其为分层的控制、授权和可外部审查的限制,将一致性、安全工程、组织经济学和制度设计联系在一起。
cs.AI / 34 / 2605.01418
TimeTok: Granularity-Controllable Time-Series Generation via Hierarchical Tokenization
TimeTok:通过层次化标记生成可控粒度的时间序列
Abstract
Time-series generative models often lack control over temporal granularity, forcing users to accept whatever granularity the model produces. To enable truly user-driven generation, we introduce TimeTok, a unified framework for Granularity-Controllable Time-Series Generation (GC-TSG), which generates time series at any target granularity from any coarser input (e.g., rough sketches) or from scratch. At the core of TimeTok is a hierarchical tokenization strategy that maps time series into an ordered sequence of tokens, from coarse to fine temporal granularity. Our autoregressive generation process operates across these granularity levels, producing token blocks that are decoded back into continuous time series. This design naturally enables GC-TSG - including standard generation - within a single framework, where controlling the number of token blocks provides explicit control over output detail. Experiments show that TimeTok excels at GC-TSG tasks while achieving state-of-the-art performance in standard generation. Furthermore, we showcase TimeTok's potential as a foundational tokenizer by training on multiple datasets with heterogeneous temporal granularities, verifying strong transferability that consistently outperforms models trained on individual datasets. To our knowledge, this is the first unified framework that covers the full generative spectrum for time series, offering a valuable foundation for models that benefit from diverse temporal granularities.
Chinese Translation
时间序列生成模型通常缺乏对时间粒度的控制,迫使用户接受模型所产生的任何粒度。为了实现真正用户驱动的生成,我们提出了TimeTok,一个用于可控粒度时间序列生成(Granularity-Controllable Time-Series Generation, GC-TSG)的统一框架,它能够从任何较粗的输入(例如粗略草图)或从头开始生成任何目标粒度的时间序列。TimeTok的核心是一个层次化标记策略,将时间序列映射为有序的标记序列,从粗到细的时间粒度。我们的自回归生成过程在这些粒度层级之间运行,产生的标记块再解码回连续的时间序列。这个设计自然实现了GC-TSG,包括标准生成,全部在一个框架内,控制标记块的数量提供了对输出细节的明确控制。实验表明,TimeTok在GC-TSG任务上表现优异,同时在标准生成方面达到了最先进的性能。此外,我们通过在多个具有异构时间粒度的数据集上训练,展示了TimeTok作为基础标记器的潜力,验证其强大的可迁移性,始终优于在单个数据集上训练的模型。据我们所知,这是第一个覆盖时间序列完整生成光谱的统一框架,为受益于多样化时间粒度的模型提供了宝贵的基础。
cs.AI / 35 / 2605.01420
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance
人工锯齿智能:不均匀优化能量分配能力的集中、再分配与优化治理
Abstract
Artificial Jagged Intelligence (AJI) denotes a recurring pattern in which large learning systems exhibit strong local capabilities while remaining weak or brittle in other domains. This paper develops a formal theory of AJI as uneven allocation of optimization pressure. We model training as a finite-budget process that distributes gradient-driven update energy across capability-relevant directions in parameter space. In this model, jagged capability profiles arise from anisotropic objective structure, data geometry, and representational coupling rather than from a single scalar quantity called intelligence. The paper defines capability gain, optimization energy share, and jaggedness, then proves that persistent concentration of cumulative update energy yields lower bounds on dispersion in capability gains. A finite-budget tradeoff theorem shows why prioritizing one capability can impose opportunity costs on others unless positive coupling or shared structure offsets the cost. The analysis also studies redistribution mechanisms, including energy-variance regularization and auxiliary structural objectives, as interventions that reshape the optimization field. The resulting framework links uneven emergence, training architecture, and optimization governance. It predicts that early concentration of update energy should forecast later capability jaggedness; that scaling under a narrow objective need not eliminate anisotropy; and that explicitly funded auxiliary objectives can revive neglected capabilities. AJI is therefore not merely a descriptive label for uneven model behavior, but a testable theory of how finite optimization resources produce concentrated, delayed, and structurally uneven capability formation.
Chinese Translation
人工锯齿智能(Artificial Jagged Intelligence, AJI)指的是一种反复出现的模式,即大型学习系统在某些领域表现出强大的局部能力,而在其他领域则表现出弱或脆弱的能力。本文发展了一种AJI的正式理论,认为其表现为优化压力的不均匀分配。我们将训练过程建模为一个有限预算过程,该过程在参数空间内将基于梯度的更新能量分配到与能力相关的方向。在该模型中,锯齿状能力轮廓的产生源于各向异性的目标结构、数据几何形状和表示耦合,而非单一标量量(即智能)。本文定义了能力增益、优化能量份额和锯齿度,然后证明了持续集中累积更新能量将导致能力增益的分散具有下界。有限预算权衡定理说明了为什么优先考虑某一能力会对其他能力造成机会成本,除非正向耦合或共享结构来抵消这些成本。分析还研究了再分配机制,包括能量方差正则化和辅助结构目标作为干预措施,这些干预措施重新塑造优化领域。由此产生的框架将不均匀出现、训练架构与优化治理联系在一起。它预测,在早期更新能量的集中应能预测后期的能力锯齿度;在狭窄目标下的扩展不一定会消除各向异性;而明确资助的辅助目标可以恢复被忽视的能力。因此,AJI不仅仅是对不均匀模型行为的描述性标签,更是一种可检验的理论,阐明了有限优化资源如何导致集中、延迟和结构不均匀的能力形成。
cs.AI / 36 / 2605.01429
SCALE-LoRA: Auditing Post-Retrieval LoRA Composition with Residual Merging and View Reliability
SCALE-LoRA:通过残差合并和视图可靠性审计后检索LoRA组合
Abstract
Libraries of Low-Rank Adaptation (LoRA) adapters are becoming a practical by-product of parameter-efficient adaptation. Once such adapters accumulate, a natural question is no longer how to train one adapter for one task, but how to reuse an open pool of adapters for a new task given only a small support set. Prior work has shown that LoRA modules can be composed at the task level and dynamically selected at the instance level. However, open-pool LoRA reuse is not automatic: retrieving relevant adapters does not guarantee that their parameter updates are compatible, and composing adapters does not guarantee reliable outputs. We introduce the Sparse-Composition Agreement Layer (SCALE), a post-retrieval audit and composition framework for open-pool LoRA reuse. SCALE contains a deployable 1.0* merge path, Layer-Adaptive Sparse Residual Composition (LASRC), and a higher-cost reliability-analysis layer for multi-view disagreement. LASRC addresses merge interference by preserving a linear anchor while residualizing block-wise adapter update directions. The reliability layer treats disagreement among sparse composition views as an observable uncertainty signal and compares agreement, support-loss proxy selection, and oracle headroom under explicit path cost. In matched FLAN-T5-Large, BIG-Bench Hard (BBH), and 97-LoRA experiments, LASRC gives a directional single-view gain under fixed retrieval, while SCALE-support is reported as a query-label-free 3.0* reliability-analysis variant rather than as a calibrated or throughput-equivalent selector. Protocol-distinct BBH-8 validation shows the same qualitative trend on three decoder-only backbones. Detailed scores, paired audits, and path-cost records are reported in the experimental section.
Chinese Translation
低秩适应(Low-Rank Adaptation, LoRA)适配器库正成为参数高效适应的实际副产品。随着这些适配器的积累,一个自然的问题不再是如何为一个任务训练一个适配器,而是如何仅凭一个小的支持集重新利用开放池中的适配器以应对新任务。之前的研究表明,LoRA模块可以在任务级别上进行组合,并在实例级别上动态选择。然而,开放池LoRA的重用并不是自动的:检索相关适配器并不保证其参数更新是兼容的,而组合适配器也不保证输出的可靠性。我们提出了稀疏组合一致性层(Sparse-Composition Agreement Layer,SCALE),一个用于开放池LoRA重用的后检索审计和组合框架。SCALE包含一个可部署的1.0*合并路径,层自适应稀疏残差组合(Layer-Adaptive Sparse Residual Composition, LASRC),以及一个更高成本的可靠性分析层,用于多视图不一致性分析。LASRC通过保留线性锚点来解决合并干扰,同时对块状适配器更新方向进行残差处理。可靠性层将稀疏组合视图之间的不一致性视为一个可观察的不确定信号,并在显式路径成本下比较一致性、支持损失代理选择和预言者余裕。在匹配的FLAN-T5-Large、BIG-Bench Hard (BBH)和97-LoRA实验中,LASRC在固定检索下提供了方向性的单视图增益,而SCALE-support被报告为一种无查询标签的3.0*可靠性分析变体,而不是经过校准或吞吐量相当的选择器。协议独特的BBH-8验证显示了在三个解码器骨干上的相同定性趋势。实验部分报告了详细的得分、配对审计和路径成本记录。
cs.AI / 37 / 2605.01442
Rethinking Explanations: Formalizing Contrast in Description Logics
重新思考解释:在描述逻辑中形式化对比
Abstract
There has been a growing interest in explaining entailments over description logic (DL) knowledge bases. The existing explanation formalisms focus on justifications to explain true axioms, and abductive reasoning to explain missing axioms in a knowledge base. However, these formalisms only point out the reasoning steps behind a (missing) entailment and lack a user-centered approach as they do not consider an inquirer's needs, level of understanding, or prior knowledge. We propose contrastive explanations, aiming at answering "why an axiom P (fact) is true instead of another axiom Q (foil)" over description logic knowledge bases. The motivation arises from the observation that when a user discovers that P has occurred, they are often surprised because they anticipated the occurrence of another similar event Q. Furthermore, individual explanations for "why P" and "why not Q" are unsatisfactory since a user expects to see the difference between P and Q. In this work, we first present formal foundations of contrasting questions and then define contrastive explanations within description logics. To this end, facts include ABox assertions of the form C(x) for a concept C and individual x. Possible foils for such facts are assertions C(y) (contrasting against an individual y), or D(x) (contrasting against a concept D). Additionally, we explore the properties of contrastive explanations in the DL EL and ALC. We also provide an implementation of our definition and an experimental evaluation on KBs of varying sizes.
Chinese Translation
近年来,对于描述逻辑(DL)知识库中蕴含关系的解释兴趣日益增长。现有的解释形式主义侧重于通过阐述真公理来进行解释,并通过溯因推理来解释知识库中的缺失公理。然而,这些形式主义仅指出了(缺失)蕴含关系背后的推理步骤,缺乏以用户为中心的方法,因为它们没有考虑询问者的需求、理解水平或先前知识。我们提出了对比解释,旨在回答“为什么公理 P(事实)是正确的,而不是另一个公理 Q(反例)”在描述逻辑知识库中的问题。其动机源于一种观察:当用户发现 P 已发生时,他们往往感到惊讶,因为他们原本预期发生的是另一个相似事件 Q。此外,针对“为什么 P”和“为什么不 Q”的单独解释并不令人满意,因为用户希望看到 P 与 Q 之间的差异。在这项工作中,我们首先提出了对比问题的形式基础,然后在描述逻辑中定义了对比解释。为此,事实包括形式为 C(x) 的 ABox 断言,其中 C 是一个概念,x 是个体。此类事实的可能反例为断言 C(y)(与个体 y 对比)或 D(x)(与概念 D 对比)。此外,我们还探索了对比解释在 DL EL 和 ALC 中的属性。我们还提供了我们定义的实现及对各种规模知识库的实验评估。
cs.AI / 38 / 2605.01457
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
CoFlow:离线多智能体决策中的协调少步流
Abstract
Generative models have emerged as a major paradigm for offline multi-agent reinforcement learning (MARL), but existing approaches require many iterative sampling steps. Recent few-step accelerations either distill a joint teacher into independent students or apply averaged velocities independently per agent, suggesting that few-step inference requires sacrificing inter-agent coordination. We show this trade-off is not necessary: single-pass multi-agent generation can preserve coordination when the velocity field is natively joint-coupled. We propose Coordinated few-step Flow (CoFlow), an architecture that combines Coordinated Velocity Attention (CVA) with Adaptive Coordination Gating. A finite-difference consistency surrogate further replaces memory-prohibitive Jacobian-vector product backpropagation through the averaged velocity field with two stop-gradient forward passes. Across 60 configurations spanning MPE, MA-MuJoCo, and SMAC, CoFlow matches or surpasses Gaussian / value-based, transformer, diffusion, and prior flow baselines on episodic return. Three independent coordination probes confirm that the gains flow through inter-agent coordination rather than per-agent capacity. A denoising-step sweep shows that single-pass inference suffices on every configuration. CoFlow reaches state-of-the-art coordination quality in 1-3 denoising steps under both centralized and decentralized execution. Project page: https://github.com/Guowei-Zou/coflow.
Chinese Translation
生成模型已经成为离线多智能体强化学习(MARL)的主要范式,但现有的方法需要多个迭代采样步骤。近期的少步加速要么将联合教师提炼为独立学生,要么独立应用每个智能体的平均速度,这表明少步推理需要牺牲智能体之间的协调。我们显示这种权衡并不是必要的:当速度场本质上是联合耦合的时,单次生成的多智能体可以保持协调。我们提出协调少步流(CoFlow),一种将协调速度注意力(CVA)与自适应协调门控相结合的架构。一个有限差分一致性替代方法进一步用两个停止梯度的前向传递替代了通过平均速度场的内存消耗高的雅可比-向量积反向传播。我们在涵盖 MPE、MA-MuJoCo 和 SMAC 的60个配置中,CoFlow 在情节回报上与高斯/基于价值的、变换器、扩散及先前流基准相匹配或超越。三个独立的协调探测器确认,这些增益是通过智能体间的协调而非单个智能体的能力实现的。去噪步骤的扫查表明,在每个配置中单次推理均已足够。CoFlow在集中式和分散式执行下,在1-3个去噪步骤中达到了最先进的协调质量。项目页面:https://github.com/Guowei-Zou/coflow.
cs.AI / 39 / 2605.01482
Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
通过群体相对策略优化在结构因果模型中实现多跳推理
Abstract
Multi-Hop Fact Verification (MHFV) necessitates complex reasoning across disparate evidence, posing significant challenges for Large Language Models (LLMs) which often suffer from hallucinations and fractured logical chains. Existing methods, while improving transparency via Chain-of-Thought (CoT), lack explicit modeling of the causal dependencies between evidence and claims. In this work, we introduce a novel framework that grounds reasoning in a Structural Causal Model (SCM), treating verification as a constructive causal inference process. We empirically identify an "inverted U-shaped" correlation between reasoning chain length and accuracy, revealing that excessive structural complexity degrades performance. To address this, we propose a Rule-based Reinforcement Learning strategy using Group Relative Policy Optimization (GRPO). This approach dynamically optimizes the trade-off between structural depth and conciseness. Extensive experiments on HoVer and EX-FEVER demonstrate that our SCM-GRPO framework significantly outperforms state-of-the-art baselines, offering a reliable and interpretable solution for complex fact verification.
Chinese Translation
多跳事实验证(Multi-Hop Fact Verification, MHFV)需要在不同证据之间进行复杂推理,这对大型语言模型(Large Language Models, LLMs)构成了重大挑战,因为它们常常会出现幻觉和断裂的逻辑链。现有方法虽然通过思维链(Chain-of-Thought, CoT)改善了透明度,但缺乏对证据与主张之间因果依赖关系的明确建模。在本工作中,我们引入了一种新颖的框架,该框架将推理基于结构因果模型(Structural Causal Model, SCM),将验证视为一种建设性的因果推断过程。我们通过实证研究识别出推理链长度与准确率之间的“倒U型”相关性,揭示过度的结构复杂性会降低性能。为了解决这一问题,我们提出了一种基于规则的强化学习策略,使用群体相对策略优化(Group Relative Policy Optimization, GRPO)。该方法动态优化结构深度与简洁性之间的权衡。在HoVer和EX-FEVER上的大量实验证明,我们的SCM-GRPO框架显著优于现有的最先进基线,提供了一种可靠且可解释的复杂事实验证解决方案。
cs.AI / 40 / 2605.01486
MAP-Law: Coverage-Driven Retrieval Control for Multi-Turn Legal Consultation
MAP-Law:基于覆盖的多回合法律咨询检索控制
Abstract
Legal consultation is a high-stakes, knowledge-intensive task that requires agents to identify relevant legal issues, retrieve authoritative support, and determine when evidence is sufficient for a recommendation. Although retrieval-augmented generation has improved grounding in legal question answering, many multi-turn legal agents still rely on fixed retrieval depth or coarse heuristic control. This often leads to either insufficient support for key legal elements or excessive retrieval that increases context burden and weakens answer focus. We propose MAP-Law, a coverage-driven framework for retrieval control in multi-turn legal consultation. MAP-Law models consultation as a controlled retrieval process over a joint structured state consisting of issue nodes, legal element nodes, and evidence nodes. After each retrieval round, the agent computes Element Coverage, Evidence Coverage, and Marginal Gain, and uses these signals to decide whether to continue retrieval, redirect the search, or generate the final response. In this way, MAP-Law turns stopping from a fixed hyperparameter into an interpretable and auditable decision aligned with legal argumentative structure. Experiments on a self-constructed dataset of 50 cases across eight labor-law scenarios show that MAP-Law with DeepSeek as the action selector achieves an Element Coverage of 0.860 using only 2.9 retrieval rounds and 5.8 evidence pieces on average. Compared with a fixed seven-round baseline, it reduces evidence volume by over 80% and retrieval rounds by 58%. Ablation results further confirm the independent contributions of coverage-driven stopping, joint graph representation, and LLM-based action selection.
Chinese Translation
法律咨询是一项高风险、知识密集的任务,要求代理识别相关法律问题、检索权威支持并判断何时证据足够以便做出推荐。尽管基于检索增强生成的技术提高了法律问答的基础性支持,但许多多回合法律代理仍依赖固定的检索深度或粗略的启发式控制。这往往导致对关键法律要素的支持不足,或过度检索增加上下文负担,削弱答案的重点。我们提出了MAP-Law,一个基于覆盖的多回合法律咨询检索控制框架。MAP-Law将咨询建模为在由问题节点、法律要素节点和证据节点组成的联合结构状态上的受控检索过程。在每轮检索后,代理计算要素覆盖(Element Coverage)、证据覆盖(Evidence Coverage)和边际收益(Marginal Gain),并利用这些信号决定是否继续检索、重新定向搜索或生成最终响应。通过这种方式,MAP-Law将停止从一个固定的超参数转变为一个与法律论证结构对齐的可解释和可审计的决策。在一个由50个案例构成的自建数据集中的八个劳动法场景下的实验表明,使用DeepSeek作为动作选择器的MAP-Law在平均仅使用2.9轮检索和5.8个证据的情况下,达到了0.860的要素覆盖率。与固定的七轮基线相比,它将证据量减少了超过80%,检索轮次减少了58%。消融实验结果进一步确认了基于覆盖的停止、联合图表示和基于LLM的动作选择的独立贡献。
cs.AI / 41 / 2605.01489
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
SciResearcher:扩展深度研究智能体以实现前沿科学推理
Abstract
Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.
Chinese Translation
前沿科学推理正迅速成为推动人工智能智能体在自动化科学发现中的关键基础。深度研究智能体为应对这一挑战提供了一种有前景的方法。这些模型通过在信息获取任务上的后期训练,开发出强大的问题解决能力,这些任务通常通过知识图谱构建或迭代网页浏览进行策划。然而,这些策略在前沿科学领域面临固有的局限性,因为特定领域的知识分散在稀疏且异构的学术来源中,且问题解决需要超出事实回忆的复杂计算和推理。为了解决这一问题,我们推出了 SciResearcher,一种完全自动化的前沿科学数据构建框架。SciResearcher 合成了基于学术证据的多样概念和计算任务,同时引发信息获取、工具集成推理和长远能力。我们利用策划的数据进行监督微调和智能体强化学习,开发了 SciResearcher-8B,一种智能体基础模型,在 HLE-Bio/Chem-Gold 基准测试中取得了 19.46% 的成绩,在其参数规模上建立了新的最先进水平,并超越了多个更大型的专有智能体。此外,它在 SuperGPQA-Hard-Biology 和 TRQA-Literature 基准测试中进一步实现了 13-15% 的绝对增益。总的来说,SciResearcher 为前沿科学推理的自动数据构建引入了一种新范式,并为未来的科学智能体提供了可扩展的路径。
cs.AI / 42 / 2605.01507
MILD: Mediator Agent System with Bidirectional Perception and Multi-Layered Alignment for Human-Vehicle Collaboration
MILD:具有双向感知和多层对齐的人车协作中介代理系统
Abstract
Prior studies report that partial driving automation can increase the cognitive demands on human drivers. This effect largely arises from human drivers' lack of transparent insight into the vehicle's intentions and decision logic, as well as from automated systems' limited awareness of the driver's dynamic state and preferences. This bidirectional misalignment undermines shared situational awareness and exacerbates coordination failures in human-vehicle interaction. To address these limitations, we argue for a paradigm shift that elevates the human role from passive supervisor to active manager. We introduce the Mediator-in-the-Loop-Driving (MILD) system, based on an agentic system architecture to facilitate synergistic human-vehicle collaboration. MILD integrates a perception agent for joint in-cabin and out-of-cabin understanding with a lightweight strategy agent that generates compliant and explainable action suggestions. To ensure these strategies are strictly aligned with safety regulations and human values, we develop Evidence- and Constraint-weighted Policy Optimization (ECPO). ECPO leverages automatic validators to steer the agent toward behaviors that are not only accurate but also structurally complete, substantiated by evidence, and free from constraint violations. Furthermore, a retrieval-augmented generation module dynamically incorporates constraints from traffic regulations, speed recommendations, and driver preferences into the decision loop. Field experiments across three open datasets demonstrate that MILD consistently outperforms baselines in both perception accuracy and strategy quality under auditable offline metrics, and yields higher human-rated policy adequacy, comfort, and explanation than baselines. This work offers a practical pathway for building auditable and aligned agents for human-vehicle collaborative driving.
Chinese Translation
以往研究报告显示,部分驾驶自动化会增加人类驾驶员的认知负担。这一影响主要源于人类驾驶员对车辆意图和决策逻辑缺乏透明的了解,以及自动化系统对驾驶员动态状态和偏好的限定意识。这种双向不对齐削弱了共享的情境意识,并加剧了人车交互中的协调失误。为了解决这些局限性,我们主张一种范式转变,将人类的角色从被动监督者提升为主动管理者。我们提出了基于代理系统架构的中介环路驾驶(Mediator-in-the-Loop-Driving,MILD)系统,以促进人车协作的协同作用。MILD结合了用于车内和车外共同理解的感知代理和生成合规、可解释的行动建议的轻量级策略代理。为了确保这些策略严格遵循安全规定和人类价值观,我们开发了证据和约束加权策略优化(Evidence- and Constraint-weighted Policy Optimization,ECPO)。ECPO利用自动验证器引导代理朝着不仅准确而且结构完整、由证据支持并且不违反约束的行为方向发展。此外,检索增强生成模块动态地将交通法规、速度建议和驾驶员偏好的约束纳入决策循环。基于三个开放数据集的实地实验表明,MILD在可审计的离线指标下,在感知准确性和策略质量上始终优于基准,并且在政策适当性、舒适度和解释方面获得了更高的人类评价。本研究为构建可审计和对齐的人车协作驾驶代理提供了实际路径。
cs.AI / 43 / 2605.01566
Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling
多智能体推理提高计算效率:帕累托最优测试时间缩放
Abstract
Advances in inference methods have enabled language models to improve their predictions without additional training. These methods often prioritize raw performance over cost-effective compute usage. However, computational efficiency is key for real-world applications with resource constraints. We provide a systematic analysis of the inference scaling strategies self-consistency, self-refinement, multi-agent debate, and mixture-of-agents, to study their computational performance tradeoffs. We evaluate methods on two reasoning benchmarks (MMLU-Pro, BBH) and include extensive parameter configurations (e.g., scaling the number of parallel predictions, agents, and debate rounds) across different model sizes. Across 34 configurations and over 100 evaluations, we compute the Pareto-optimal front to select methods that achieve the best accuracy with the lowest computational budget. Notably, inference scaling improves accuracy by up to +7.1% points over chain-of-thought at the highest evaluated budgets (20x the CoT compute budget) on MMLU-Pro. With an equal computing budget, debate and mixture-of-agents outperform self-consistency by 1.3% and 2.7% points, respectively. While self-consistency saturates earlier, multi-agent gains persist, particularly on more complicated tasks. We identify a simple multi-agent design guideline: mixture-of-agents is most efficient when the number of parallel generations exceeds the number of sequential aggregations.
Chinese Translation
推理方法的进步使语言模型能够在不进行额外训练的情况下改进其预测。这些方法往往优先考虑原始性能而非经济高效的计算使用。然而,计算效率对资源有限的实际应用至关重要。我们对自一致性(self-consistency)、自精炼(self-refinement)、多智能体辩论(multi-agent debate)和混合智能体(mixture-of-agents)的推理缩放策略进行了系统分析,以研究它们的计算性能权衡。我们在两个推理基准(MMLU-Pro和BBH)上评估这些方法,并包括不同模型规模下的广泛参数配置(例如,缩放并行预测数量、智能体数量和辩论轮数)。在34个配置和超过100次评估中,我们计算了帕累托最优前沿,以选择那些在最低计算预算下实现最佳准确率的方法。值得注意的是,在MMLU-Pro上,在最高评估预算下(20倍的链式推理(CoT)计算预算),推理缩放使准确率提高了最高7.1个百分点。 在相等的计算预算下,辩论和混合智能体分别比自一致性提高了1.3%和2.7个百分点。尽管自一致性较早饱和,但多智能体的收益依然存在,特别是在更复杂的任务上。我们确定了一个简单的多智能体设计指导原则:当并行生成的数量超过顺序聚合的数量时,混合智能体的效率最高。
cs.AI / 44 / 2605.01604
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework
评估实际环境中的能动AI:失效模式、漂移模式及生产评估框架
Abstract
Existing evaluation frameworks for large language models -- including HELM, MT-Bench, AgentBench, and BIG-bench -- are designed for controlled, single-session, lab-scale settings. They do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production: compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks. This paper makes three contributions. First, we present a taxonomy of seven failure modes unique to production agentic systems, each grounded in observations from systems operating at billion-event scale. Second, we demonstrate empirically where standard metrics -- ROUGE, BERTScore, accuracy/AUC, and the agentic benchmarks above -- fail to detect each failure mode. Third, we propose PAEF (Production Agentic Evaluation Framework), a five-dimension evaluation framework with an open-source reference implementation, designed for continuous evaluation on production traffic rather than episodic benchmark runs. Our analysis shows that standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles.
Chinese Translation
现有的大型语言模型评估框架——如HELM、MT-Bench、AgentBench和BIG-bench——都是为受控的单次会话实验室环境设计的。它们未能解决能动AI系统在生产环境中持续运行时出现的评估挑战:决策错误的累积、工具故障级联、非确定性输出漂移,以及长时任务中的真实情况缺失。本文做出三项贡献。首先,我们提出了一种针对生产能动系统独特的七种失效模式的分类法,基于对在十亿事件规模运行的系统观察得出。其次,我们通过实证研究展示了标准指标——ROUGE、BERTScore、准确率/AUC以及上述能动基准——在识别每种失效模式时的不足之处。第三,我们建议了PAEF(生产能动评估框架),这是一个五维评估框架,具有开源参考实现,旨在为生产流量进行持续评估,而非间歇性基准测试。我们的分析表明,标准指标完全未能检测到七种失效模式中的四种,且仅在经过多个评估周期的延迟后才能检测到其他三种。
cs.AI / 45 / 2605.01675
CP-SynC: Multi-Agent Zero-Shot Constraint Modeling in MiniZinc with Synthesized Checkers
CP-SynC:在 MiniZinc 中使用合成检查器进行多智能体零-shot 约束建模
Abstract
Constraint Programming (CP) is a powerful paradigm for solving combinatorial problems, yet translating natural language problem descriptions into executable models remains a significant bottleneck. While Large Language Models (LLMs) show promise in automating this translation, they often struggle with subtle semantic errors in the absence of oracle validation at test time. To address this, we introduce CP-SynC (Constraint Programming modeling with Synthesized Checkers), a multi-agent workflow for zero-shot constraint modeling in MiniZinc. CP-SynC coordinates modeling agents that generate and refine candidate models and validation agents that synthesize semantic checkers to provide feedback on semantic correctness. To mitigate noise inherent in individual LLM outputs, CP-SynC explores multiple modeling trajectories in parallel and employs selection agents to select the final model via multi-agent evidence aggregation. Extensive experiments on a benchmark of 100 CP problems show that CP-SynC substantially outperforms existing baselines in MiniZinc modeling.
Chinese Translation
约束编程(Constraint Programming, CP)是一种强大的组合问题解决范式,但将自然语言问题描述转换为可执行模型仍然是一个重要的瓶颈。尽管大型语言模型(Large Language Models, LLMs)在自动化这一转换方面展现了潜力,但在缺乏测试时的Oracle验证的情况下,它们往往会遇到细微的语义错误。为了解决这个问题,我们提出了 CP-SynC(使用合成检查器的约束编程建模),这是一个用于在 MiniZinc 中进行零-shot 约束建模的多智能体工作流。CP-SynC 协调建模智能体生成和精炼候选模型,并使验证智能体合成语义检查器,以提供语义正确性的反馈。为了减轻个别 LLM 输出固有的噪声,CP-SynC 并行探索多个建模轨迹,并通过多智能体证据聚合使用选择智能体来选择最终模型。在对 100 个 CP 问题基准的广泛实验中,CP-SynC 在 MiniZinc 建模方面明显超越了现有基线。
cs.AI / 46 / 2605.01694
Latent State Design for World Models under Sufficiency Constraints
在充分性约束下的世界模型潜在状态设计
Abstract
A world model matters to an agent only through the state it constructs. That state must preserve some information, discard other information, and support some future function: prediction, control, planning, memory, grounding, or counterfactual reasoning. This paper treats world-model research as latent state design under sufficiency constraints. We propose a functional taxonomy that groups methods by what their latent state is for, rather than by architecture or application domain: predictive embedding, recurrent belief state, object/causal structure, latent action interface, grounded planning interface, and memory substrate. These roles expose distinctions that architecture-based groupings hide, including the gap between predictive sufficiency and control sufficiency, and the gap between passive video prediction and counterfactual action modeling. The taxonomy supports an evaluation framework that judges a model by the sufficiency constraint its latent state was built to satisfy. We compare methods along seven axes: representation, prediction, planning, controllability, causal/counterfactual support, memory, and uncertainty. We use the resulting matrix as a diagnostic for what a latent state preserves, discards, and enables. The conclusion that follows is that an actionable world model is the one whose state construction matches the task, not the one that preserves the most information.
Chinese Translation
一个世界模型对于代理的意义在于它所构建的状态。该状态必须保留某些信息,丢弃其他信息,并支持某些未来功能:预测、控制、规划、记忆、基础或反事实推理。本文将世界模型研究视为在充分性约束下的潜在状态设计。我们提出了一种功能分类法,根据其潜在状态的目的对方法进行分组,而不是按架构或应用领域:预测嵌入、递归信念状态、对象/因果结构、潜在动作接口、基础规划接口和记忆基质。这些角色揭示了基于架构的分组所隐藏的差异,包括预测充分性与控制充分性之间的差距,以及被动视频预测与反事实动作建模之间的差距。这一分类法支持一种评估框架,通过潜在状态所满足的充分性约束来评判模型。我们沿着七个轴线比较这些方法:表示、预测、规划、可控性、因果/反事实支持、记忆和不确定性。我们使用所得矩阵来诊断潜在状态所保留、丢弃和支持的内容。随之得出的结论是,一个可操作的世界模型是其状态构建与任务匹配的模型,而不是那个保留最多信息的模型。
cs.AI / 47 / 2605.01710
Model Routing as a Trust Problem: Route Receipts for Adaptive AI Systems
将模型路由视为信任问题:自适应人工智能系统的路由凭证
Abstract
AI products often route requests through version aliases, service tiers, tool choices, regional endpoints, fallback rules, or safety handling before responding. These routing steps are documented product surfaces in several widely used AI platforms and serving stacks. Routing helps AI services stay affordable, fast, and available at scale, and it shapes trust. Trust can break when routing changes the cost, quality, or accountability of a response without the user being able to tell what happened. "Which model answered?" is only part of the audit question. The runtime path matters. Adaptive AI systems should produce a runtime transparency artifact called the route receipt. A route receipt is a compact record of the route that served a request. It should capture enough material facts for people relying on the output to reconstruct important routing decisions without exposing proprietary internals or hidden reasoning. Route transparency should be part of model documentation. Model cards describe trained model artifacts, while route receipts describe the runtime conditions under which a particular answer was produced. The paper introduces the route-receipt concept, a minimal schema and redaction model, and a documentation-based survey of selected platforms showing that receipt fragments already exist without a portable per-answer record.
Chinese Translation
人工智能产品在响应请求之前,通常会通过版本别名、服务层级、工具选择、区域端点、回退规则或安全处理进行路由。这些路由步骤在几种广泛使用的人工智能平台和服务架构中被记录为产品表面特征。路由有助于让人工智能服务在规模上保持可负担性、快速性和可用性,同时也影响信任。当路由改变响应的成本、质量或责任,但用户无法识别发生了什么时,信任就可能破裂。“哪个模型作答?”只是审计问题的一部分,运行时路径同样重要。自适应人工智能系统应生成一种称为路由凭证的运行时透明度文档。路由凭证是服务于请求的路由的简要记录。它应包含足够的事实信息,以便依赖输出的用户能够重构重要的路由决策,而不暴露专有内部或隐藏推理。路由透明度应作为模型文档的一部分。模型卡描述了训练后的模型工件,而路由凭证描述了特定答案生成时的运行条件。本文介绍了路由凭证的概念、一个最小的架构和红action模型,以及基于文档的选定平台调查,显示已经存在凭证片段,而缺乏可移植的每个答案记录。
cs.AI / 48 / 2605.01727
Are LLMs More Skeptical of Entertainment News?
大型语言模型对娱乐新闻是否更持怀疑态度?
Abstract
Large language models (LLMs) are increasingly used for automated news credibility assessment, yet it remains unclear whether they apply even-handed standards across journalistic genres. We examine whether zero-shot LLMs are more likely to misclassify legitimate entertainment news as fake than legitimate hard news, using a within-dataset design on GossipCop from FakeNewsNet. Across four frontier models, we find a clear but model-specific genre asymmetry: DeepSeek-V3.2 and GPT-5.2 show false-positive-rate gaps of 10.1 and 8.8 percentage points, respectively (both $p < .001$), whereas Claude Opus 4.6 and Gemini 3 Flash show no comparable difference. A style-swap experiment yields only limited and inconsistent changes, suggesting that the asymmetry is not reducible to stylistic register alone. Prompt-based mitigation is likewise possible but not generic: framing the model as an entertainment-news fact-checker reduces false positives for DeepSeek-V3.2 by about 50\% without detectable recall loss, but offers little improvement for GPT-5.2. Exploratory qualitative coding further suggests two recurring error patterns in sampled false positives: treating private-life claims as inherently unverifiable and discounting entertainment journalism as an epistemically weaker genre. Taken together, these findings show that aggregate performance metrics can obscure structured false positives within legitimate journalism. We argue that LLM-based credibility assessment may not only evaluate truth claims but also differentially recognize the legitimacy of journalistic genres, and that evaluation should therefore include genre-stratified false-positive analysis alongside overall accuracy.
Chinese Translation
大型语言模型(LLMs)在自动化新闻可信度评估中的应用日益增多,但它们是否在不同新闻类别中应用公正的标准仍然不清楚。我们探讨了零-shot LLMs是否更有可能将合法的娱乐新闻误判为假新闻,而不是合法的硬新闻,使用了来自FakeNewsNet的GossipCop数据集进行内部设计。通过四种前沿模型,我们发现了明显但特定于模型的类别不对称性:DeepSeek-V3.2和GPT-5.2的假阳性率差距分别为10.1和8.8个百分点(均$p < .001$),而Claude Opus 4.6和Gemini 3 Flash没有显示出类似的差异。样式交换实验仅产生有限且不一致的变化,表明这种不对称性并不能仅归因于风格登记。基于提示的缓解措施也是可能的,但不是通用的:将模型构建为娱乐新闻事实检查员将DeepSeek-V3.2的假阳性减少约50%,而没有检测到召回损失,但对GPT-5.2的改进有限。探索性定性编码进一步表明,在采样的假阳性中存在两个反复出现的错误模式:将私人生活声明视为固有不可验证的,以及将娱乐新闻视为一种认识论上更弱的类别。综合来看,这些发现表明,汇总性能指标可能会掩盖合法新闻中的结构性假阳性。因此,我们认为基于LLM的可信度评估不仅可能评估真相声明,还可能对不同新闻类别的合法性进行不同的识别,因此评估应当包括类别分层的假阳性分析以及整体准确性。
cs.AI / 49 / 2605.01745
NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty
NH-CROP:面对成本不确定性的受管语言数据资产的鲁棒定价
Abstract
Language data are increasingly acquired and governed as assets, yet platforms often price candidate resources before knowing their true privacy or access costs. We study online pricing for governed language data assets under cost uncertainty. At each round, a platform observes an NLP task, a candidate asset, and a coarse cost estimate, may pay for a refined cost signal, posts a price, and receives safe net revenue. We introduce \textsc{NH-CROP}, a clipped robust pricing framework with a no-harm information-acquisition gate. The method compares direct pricing, risk-aware pricing, and verify-then-price, and acquires information only when its estimated decision value exceeds the best no-verification alternative. Across synthetic, real-proxy, and downstream-utility-grounded benchmarks, clipped \textsc{NH-CROP} variants improve or remain competitive with price-only and risk-aware baselines. Causal ablations show that paid verification is not the main source of gains in real-proxy and utility-grounded settings: the strongest learned policies often choose not to verify. Oracle and high-decision-value diagnostics show that refined cost information can still have substantial local value. Overall, governed language-data platforms should calibrate pricing under uncertain access costs first and verify only when information is cheap and decision-actionable.
Chinese Translation
语言数据日益被视为资产进行获取和治理,但平台在了解真实的隐私或访问成本之前,往往会为候选资源定价。我们研究了在成本不确定性下,受管语言数据资产的在线定价。在每一轮中,平台观察到一个自然语言处理(NLP)任务、一项候选资产,和一个粗略的成本估计,可以支付获取一个精细成本信号的费用,发布价格,并获得安全的净收益。我们提出了 extsc{NH-CROP},一种具有无伤害信息获取门的裁剪鲁棒定价框架。该方法比较了直接定价、风险意识定价和验证后定价,并且仅在其估计的决策价值超过最佳无验证替代方案时获取信息。在合成、真实代理和基于下游效用的基准测试中,裁剪的 extsc{NH-CROP}变体在价格唯和风险意识基准中表现出改进或保持竞争力。因果消融表明,付费验证并非在真实代理和基于效用环境中获得增益的主要来源:最强的学习政策通常选择不进行验证。Oracle和高决策价值的诊断显示,精细的成本信息仍然可以具有实质性的局部价值。总体而言,受管语言数据平台在面对不确定的访问成本时,应首先校准定价,并仅在信息廉价且可决策时进行验证。
cs.AI / 50 / 2605.01758
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
在传播之前捕捉感染:多智能体系统中的前瞻性引导防御
Abstract
Large multimodal model-based Multi-Agent Systems (MASs) enable collaborative complex problem solving through specialized agents. However, MASs are vulnerable to infectious jailbreak, where compromising a single agent can spread to others, leading to widespread compromise. Existing defenses counter this by training a more contagious cure factor, biasing agents to retrieve it over virus adversarial examples (VirAEs). However, this homogenizes agent responses, providing only superficial suppression rather than true recovery. We revisit these defenses, which operate globally via a shared cure factor, while infectious jailbreak arise from localized interaction behaviors. This mismatch limits their effectiveness. To address this, we propose a training-free Foresight-Guided Local Purification (FLP) framework, where each agent reasons over future interactions to track behavioral evolution and eliminate infections. Specifically, each agent simulates future behavioral trajectories over subsequent chat rounds. To reflect diversity in MASs, we introduce a multi-persona simulation strategy for robust prediction across interaction contexts. We then use response diversity as a diagnostic signal to detect infection by analyzing inconsistencies across persona-based predictions at both retrieval-result and semantic levels. For infected agents, we apply localized purification: recent infections are mitigated via immediate album rollback, while long-term infections are handled using Recursive Binary Diagnosis (RBD), which recursively partitions the image album and applies the same diagnosis strategy to localize and eliminate VirAEs. Experiments show that FLP reduces the maximum cumulative infection rate from over 95% to below 5.47%. Moreover, retrieval and semantic metrics closely match benign baselines, indicating effective preservation of interaction diversity.
Chinese Translation
基于大型多模态模型的多智能体系统(MAS)通过专门的智能体实现协作的复杂问题解决。然而,MAS 易受到感染性监狱破坏(jailbreak)的影响,其中单个智能体的妥协可能传播到其他智能体,导致广泛的妥协。现有的防御措施通过训练更具传染性的疫苗因子来应对这一问题,使智能体倾向于在病毒对抗性示例(VirAEs)之前获取该因子。然而,这种方法使智能体的反应同质化,只能提供表面的抑制,而非真正的恢复。我们重新审视这些防御措施,它们通过共享的疫苗因子全局运作,而感染性监狱破坏则源自局部的互动行为。这种不匹配限制了它们的有效性。为解决此问题,我们提出了一种无训练的前瞻性引导本地净化(Foresight-Guided Local Purification, FLP)框架,其中每个智能体在未来互动中推理,以追踪行为的演变并消除感染。具体而言,每个智能体在后续聊天轮次中模拟未来的行为轨迹。为了反映 MAS 中的多样性,我们引入了一种多角色模拟策略,以在互动上下文中进行稳健预测。然后,我们利用反应多样性作为诊断信号,通过分析基于角色的预测在检索结果和语义层面上的不一致性来检测感染。对于受感染的智能体,我们应用局部净化:最近的感染通过立即的相册回退得到缓解,而长期感染则通过递归二元诊断(Recursive Binary Diagnosis, RBD)处理,该方法递归划分图像相册并应用相同的诊断策略以局部化并消除 VirAEs。实验表明,FLP 将最大累计感染率从超过 95% 降低到 5.47% 以下。此外,检索和语义指标与良性基准紧密匹配,表明有效保留了互动多样性。
cs.AI / 51 / 2605.01783
Runtime Evaluation of Procedural Content Generation in an Endless Runner Game Using Autonomous Agents
使用自主代理在无尽跑酷游戏中对程序性内容生成的运行时评估
Abstract
Procedural Content Generation (PCG) enables game content to be created algorithmically without direct manual level-design effort, but it introduces a serious evaluation problem: generated content may become unbalanced, blocked, repetitive, or technically unsolvable. This paper presents Momentum, an endless-runner game that integrates runtime terrain generation, environment object spawning, and autonomous agent-based evaluation into a single gameplay loop. Ground tiles and environmental objects are generated dynamically as the player advances, object placement follows a constraint-driven mechanism inspired by Wave Function Collapse (WFC), and the runtime navigation surface is rebuilt asynchronously to remain consistent with the streamed environment. Two autonomous evaluation agents move ahead of the player and inspect the generated path: an aerial scanner that examines the corridor geometrically, and a ground-traversal agent that validates the same region from a navigational perspective. The evaluation pipeline combines ray casting, volumetric physics sweeps, obstacle-layer filtering, and structured crash reporting to identify problematic generated scenarios before they reach the player. The work demonstrates how generation and validation can be unified within the same runtime loop, rather than treating evaluation as a separate offline pass. Around this implementation, the paper formulates a measurable evaluation framework along the canonical PCG axes of playability, diversity, controllability, and runtime performance, derives a structural saturation bound on the spawner from its own placement constraints, and quantifies the per-segment scanning cost of the agents from first principles.
Chinese Translation
程序性内容生成(PCG)使得游戏内容可以通过算法方式创建,而无需直接的手动关卡设计工作,但它引入了一个严重的评估问题:生成的内容可能不平衡、被阻塞、重复或在技术上无法解决。本文介绍了Momentum,一款将运行时地形生成、环境物体生成和基于自主代理的评估整合到一个单一游戏循环中的无尽跑酷游戏。随着玩家的前进,地面瓷砖和环境物体将动态生成,物体放置遵循受波函数坍缩(WFC)启发的约束驱动机制,并且运行时导航表面会异步重建,以保持与流式环境的一致性。两个自主评估代理在玩家之前移动并检查生成的路径:一个从几何角度检查走廊的空中扫描器和一个从导航角度验证同一区域的地面穿行代理。评估管线结合了光线投射、体积物理扫描、障碍层过滤和结构化崩溃报告,以在问题场景到达玩家之前识别它们。该工作展示了如何在同一个运行时循环中统一生成与验证,而不是将评估视为单独的离线过程。围绕此实现,本文制定了一个可测量的评估框架,涵盖了可玩性、多样性、可控性和运行时性能等经典PCG轴,推导出基于自身放置约束的生成器的结构饱和界限,并从基本原理定量分析了代理的每段扫描成本。
cs.AI / 52 / 2605.01789
DataEvolver: Let Your Data Build and Improve Itself via Goal-Driven Loop Agents
DataEvolver:让您的数据通过目标驱动的循环代理自我构建和改进
Abstract
Constructing controllable visual data is a major bottleneck for image editing and multimodal understanding. Useful supervision is rarely produced by a single rendering pass; instead it emerges through iterative generation, inspection, correction, filtering, and export. We present DataEvolver, a closed-loop visual data engine that organizes this process around explicit goals, persistent artifacts, bounded corrective actions, and acceptance decisions. DataEvolver supports multiple artifact types, including RGB images, masks, depth maps, normal maps, meshes, poses, trajectories, and review traces. In the current release, the system operates through two coupled loops: generation-time self-correction within each sample and validation-time self-expansion across dataset rounds. We validate the framework on an image-level object-rotation setting. With a fixed Qwen-Edit LoRA probe, our final Ours+DualGate model outperforms both the unadapted base model and a public multi-angle LoRA on SpatialEdit and a held-out evaluation set. Ablations show a consistent improvement path from scene-aware generation to feedback-driven correction and dual-gated validation. Beyond the released rotation data, our main contribution is a reusable framework for building visual datasets through explicit goal tracking, review, correction, and acceptance loops.
Chinese Translation
构建可控的视觉数据是图像编辑和多模态理解的主要瓶颈。有用的监督 seldom 仅通过单次渲染过程产生;相反,它是在迭代生成、检查、修正、过滤和导出的过程中逐步显现的。我们提出了 DataEvolver,一个围绕明确目标、持久工件、有限纠正行动和接受决策组织这一过程的闭环视觉数据引擎。DataEvolver 支持多种工件类型,包括 RGB 图像、掩膜、深度图、法线图、网格、姿态、轨迹和审查痕迹。在当前版本中,该系统通过两个耦合循环运行:每个样本的生成时间自我修正和数据集轮次之间的验证时间自我扩展。我们在图像级物体旋转设定中验证了该框架。在固定的 Qwen-Edit LoRA 探针下,我们的最终 Ours+DualGate 模型在 SpatialEdit 和一个保留的评估集上超越了未适应的基础模型和公共多角度 LoRA。消融实验表明,从场景感知生成到反馈驱动修正和双门验证的全过程都有持续的改进路径。除了发布的旋转数据,我们的主要贡献是通过明确的目标跟踪、审查、修正和接受循环构建视觉数据集的可重用框架。
cs.AI / 53 / 2605.01797
Neural Decision-Propagation for Answer Set Programming
用于答案集编程的神经决策传播
Abstract
Integration of Answer Set Programming (ASP) with neural networks has emerged as a promising tool in Neuro-symbolic AI. While existing approaches extend the capabilities of ASP to real world domains, their reasoning pipelines depend on classical solvers, which is a bottleneck for scalability. To tackle this problem, we propose a new method to compute stable models, called decision-propagation (DProp), which alternates falsity decisions and truth propagations. Successful DProp computations are shown to capture the stable model semantics. We then develop Neural DProp (NDProp), a differentiable extension of DProp with neural computation for decisions and fuzzy evaluation for propagations. We evaluate the capabilities of NDProp for learning decision heuristics as well as neuro-symbolic integration, and compare it with existing neuro-symbolic approaches. The results show that NDProp can learn to efficiently compute stable models, and it improves accuracy and scalability on neuro-symbolic benchmarks.
Chinese Translation
将答案集编程(ASP)与神经网络相结合,已成为神经符号人工智能(Neuro-symbolic AI)中一种富有前景的工具。虽然现有的方法扩展了ASP在现实世界领域的应用能力,但它们的推理流程依赖于传统求解器,这成为了可扩展性的瓶颈。为了解决这个问题,我们提出了一种新方法来计算稳定模型,称为决策传播(decision-propagation,DProp),该方法交替进行虚假决策和真实传播。成功的DProp计算被证明可以捕捉稳定模型语义。随后,我们开发了神经DProp(Neural DProp,NDProp),这是DProp的一个可微扩展版本,具有基于神经网络的决策计算和模糊评估的传播能力。我们评估了NDProp在学习决策启发式方法和神经符号集成方面的能力,并将其与现有的神经符号方法进行了比较。结果表明,NDProp能够有效地计算稳定模型,并在神经符号基准测试中提高了准确性和可扩展性。
cs.AI / 54 / 2605.01847
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench:一个人类校准的基准,用于大语言模型代理配置中的承诺完整性评估
Abstract
Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench is a human-calibrated benchmark that operationalizes commitment integrity through benchmark-defined side-query probes rather than inferred hidden activations. The released inventory contains 144 deterministic tasks and 306 benchmark-defined side-query probes spanning eight cognitively motivated failure families, paired clean and distractor variants, and three difficulty bands. The main 32-profile evaluation contains a fixed 16-profile local subset and a matched 16-profile hosted large-model subset evaluated through the same benchmark pipeline. Human calibration uses the final merged reporting scope: 104 sampled task units, 216 raw annotations, and 108 adjudicated task rows, with weighted kappa = 0.977 and ICC(2,1) = 0.977. Empirically, task success and commitment integrity diverge across this expanded grid: the success leader is not the integrity leader, 31 of 32 profiles change rank when integrity replaces task success, and integrity rankings are more stable under distractor perturbation. The primary confidence-free score HCCIS-CORE reaches 0.8469 AUC and 0.6992 PR-AUC for post-probe diagnostic discrimination of terminal task failure; the legacy full heuristic variant HCCIS-FULL reaches 0.7997 AUC and 0.6410 PR-AUC. Probe accuracy and state drift achieve slightly higher ROC-AUC, 0.8587, and better Brier/ECE, while HCCIS-CORE has substantially higher point-estimate PR-AUC and remains more closely tied to the benchmark's intended construct. The exploratory neural-augmented variant HCCIS+N is weaker overall, and a randomized subspace control approaches chance. NeuroState-Bench therefore contributes a calibrated evaluation axis for exposing commitment failures over a broader model grid than the original local-only subset.
Chinese Translation
仅基于结果的评估不足以确定被评估代理配置是否保留了解决多轮任务所需的承诺。NeuroState-Bench是一个人类校准的基准,通过基准定义的侧面查询探针操作化承诺完整性,而非推断的隐性激活。发布的任务库包含144个确定性任务和306个基准定义的侧面查询探针,涵盖了八个基于认知的失败类别,配对干净和干扰变体,以及三个难度等级。主要的32个配置评估包含一个固定的16个配置的本地子集和一个匹配的16个配置的大模型子集,它们通过相同的基准管道进行评估。人类校准使用最终合并的报告范围:104个采样任务单元、216个原始注释和108个裁定的任务行,得到加权Kappa = 0.977和ICC(2,1) = 0.977。从经验上看,任务成功与承诺完整性在这个扩展的网格中存在差异:成功领先者并不是完整性领先者,当完整性替代任务成功时,32个配置中有31个排名发生变化,而承诺的排名在干扰扰动下更加稳定。主要的无置信度评分HCCIS-CORE在终端任务失败的后探测诊断区分上达到了0.8469的AUC和0.6992的PR-AUC;传统的完整启发式变体HCCIS-FULL达到了0.7997的AUC和0.6410的PR-AUC。探针准确性和状态漂移达到稍高的ROC-AUC,0.8587,并且更好的Brier/ECE,而HCCIS-CORE则具有显著更高的点估计PR-AUC,并与基准的预期构念保持更紧密的联系。探索性神经增强变体HCCIS+N整体上表现较弱,而随机子空间控制接近于随机效果。因此,NeuroState-Bench为揭示更广泛模型网格中的承诺失败提供了一个校准的评估维度,超过了最初的仅本地子集。
cs.AI / 55 / 2605.01879
Sheaf-Theoretic Planning: A Categorical Foundation for Resilient Multi-Agent Autonomous Systems
层理论规划:弹性多智能体自主系统的范畴基础
Abstract
The challenge of engineering autonomous agents capable of navigating the stochastic and adversarial nature of the physical world has historically resided at the intersection of symbolic logic and control theory. Traditional multi-agent system (MAS) frameworks have relied heavily on monolithic logical models -- primarily variations of the event calculus and situation calculus -- to represent action, change, and temporal persistence. While these classical systems provide robust solutions to the frame problem through mechanisms like circumscription and successor state axioms, they are inherently limited by a closed-world assumption that fails in the face of unobserved agent interventions, plan interruptions, and divergent belief-reality states. The paradigm of Sheaf-Theoretic Planning (STP) emerges as a transformative alternative, grounding the problem of multi-agent coordination under the mathematical structures of topos theory and sheaf semantics. This report provides an exhaustive analysis, justification, and extension of the STP framework, exploring its categorical foundations, implementation feasibility, and role in the future of resilient autonomous systems.
Chinese Translation
工程化能够应对物理世界的随机性和对抗性特征的自主智能体的挑战,历史上一直处于符号逻辑与控制理论的交叉点上。传统的多智能体系统(MAS)框架主要依赖于单一的逻辑模型——主要是事件演算和情境演算的变体——来表示动作、变化和时间的持续性。尽管这些经典系统通过约束和后续状态公理等机制提供了对框架问题的稳健解决方案,但由于其固有的封闭世界假设,在面对未观察到的智能体干预、计划中断和信念-现实状态的差异时,这些方法往往会失效。层理论规划(STP)的范式作为一种变革性的替代方案,基于拓扑理论和层语义的数学结构,奠定了多智能体协调问题的基础。本报告对STP框架进行了详尽的分析、论证和扩展,探讨了其范畴基础、实施可行性以及在未来弹性自主系统中的角色。
cs.AI / 56 / 2605.01892
CyberAId: AI-Driven Cybersecurity for Financial Service Providers
CyberAId:面向金融服务提供商的人工智能驱动网络安全
Fatouros, George, Makridis, Georgios, Soldatos, John, Kyriazis, Dimosthenis, Malo, Pedro, Kousiouris, George, Ledakis, Giannis, Kachrimani, Louiza, Rizomiliotis, Panagiotis, Almeida, Bruno, Tomkou, Despina, Metaxas, Kostas, Ilias, Konstantinos, Gkizelis, Christos, de Gooyert, Ernstjan, Babazadeh, Amin, Mavrogiorgos, Kostis, Paraskevoulakou, Pepi, Xenakis, Christos, Chouchoulis, Giannis, Tripodi, Konstantina
Abstract
European financial institutions face mounting regulatory pressure while their security operations centres remain constrained not by data or staffing but by reasoning capacity: enterprise SIEMs cover only a fraction of MITRE ATT&CK techniques, two thirds of SOC teams cannot keep pace with alert volumes, and the majority of breaches are preceded by alerts that are generated but never investigated. Frontier large language models now achieve state-of-the-art results on isolated cybersecurity tasks (one-day vulnerability exploitation, code-level patching, intrusion detection) yet no narrow win constitutes a platform that can compose across functions, persist multi-tenant state, map findings to regulatory regimes and survive an audit. This position paper argues that the right unit of construction is a hybrid multi-agent system in which specialised LLM subagents reason over classical SIEM/XDR telemetry rather than replacing it, share accumulated agent state across institutions through privacy-preserving federation, and can connect to complementary capability packs such as quantum-based authentication, digital twins for adversarial validation, and eBPF-based kernel telemetry. We present CyberAId, a model-agnostic, on-premise-deployable platform in which a Main Agent coordination layer, a Reporting capability, and specialist subagents operate within a shared runtime under bounded human-in-the-loop autonomy, organised around four falsifiable design principles, and aligned with relevant regulations. CyberAId will be validated at four representative financial use cases (client impersonation, anti-money-laundering for payment service providers, retail-banking incident response, and high-frequency-trading resilience) and propose skill-based agent adaptation as the most promising research direction for turning each deployment into a contribution to a continuously refined collective defence.
Chinese Translation
欧洲金融机构面临日益严峻的监管压力,而它们的安全运营中心的限制并不在于数据或人员配置,而在于推理能力:企业级安全信息与事件管理系统(SIEM)仅涵盖MITRE ATT&CK技术的一个部分,三分之二的安全运营中心(SOC)团队无法跟上警报量的增长,大多数安全漏洞的发生前均伴随着被生成但从未被调查的警报。前沿的大型语言模型在孤立的网络安全任务(如一天内的漏洞利用、代码级补丁、入侵检测)中已达到最先进的结果,但没有哪个狭隘的胜利构成了能够跨功能组合、持久多租户状态、将发现映射至监管要求并能够通过审计的平台。本文认为,适当的构建单元是一个混合多代理系统,其中专业的LLM子代理在不替代传统SIEM/XDR遥测的情况下进行推理,通过隐私保护的联合体在机构间共享累积的代理状态,并可以连接到互补的能力包,如基于量子的认证、用于对抗验证的数字双胞胎和基于eBPF的内核遥测。我们提出了CyberAId,这是一个无关模型、可在本地部署的平台,其中主代理协调层、报告能力和专业子代理在共享运行时内操作,在有限的人机交互自主性下组织,围绕四个可证伪的设计原则进行,并与相关法规保持一致。CyberAId将在四个具代表性的金融用例(客户冒充、支付服务提供商的反洗钱、零售银行事件响应和高频交易韧性)中进行验证,并提出基于技能的代理适应作为将每个部署转化为持续改进的集体防御贡献的最有前途的研究方向。
cs.AI / 57 / 2605.01899
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
解构角色中的意图:用于人格不变安全对齐的对抗自我对抗
Abstract
The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high-risk persona spaces by leveraging lineage-based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability, thereby validating the superiority and robustness of this alignment paradigm. Codes are available at https://github.com/JiajiaLi-1130/PIA.
Chinese Translation
大型语言模型(LLMs)日益增长的能力驱动其在各种领域的广泛应用,包括潜在的高风险场景。尽管安全对齐技术有所进步,目前的模型仍然容易受到新兴的人格攻击的影响。现有关于人格攻击的研究主要集中在攻击迭代上,但在防御方面缺乏系统性和机制性的约束。为了解决这个问题,我们提出了人格不变对齐(Persona-Invariant Alignment, PIA),这是一种通过人格谱系演化(Persona Lineage Evolution, PLE)在攻击端和人格不变一致性学习(Persona-Invariant Consistency Learning, PICL)在防御端实现共演化的对抗自我对抗框架。理论上,PICL 基于结构分离假说,利用单向的 KL 散度约束,使安全决策与人格背景结构性解耦,从而在面对人格攻击时保持安全行为。实验结果表明,PLE 通过利用谱系基础的信用传播有效探索高风险人格空间。同时,PICL 防御方法显著降低了攻击成功率(Attack Success Rate, ASR),同时保持了模型的整体能力,从而验证了该对齐范式的优越性和鲁棒性。代码可在 https://github.com/JiajiaLi-1130/PIA 获取。
cs.AI / 58 / 2605.01920
A Language for Describing Agentic LLM Contexts
描述代理型大语言模型上下文的语言
Abstract
Large language models are increasingly used within larger systems ("LLM agents"). These make a sequence of LLM calls, each call providing the LLM with a combination of instructions, observations, and interaction history. The design of the encoded information and its structure play a central role in the quality of the resulting system, leading to efforts spent on context engineering. It is therefore critical to communicate the composition of the LLM context in a system, and how it evolves over time. Yet, no standard exists for doing so: context construction is typically conveyed through informal prose, ad hoc diagrams, or direct inspection of code, none of which precisely capture how a prompt evolves across interaction steps or how two context representation strategies differ. To remedy this, we introduce the Agentic Context Description Language (ACDL), a language for specifying the structure and dynamics of LLM input contexts in a precise, readable, and standard manner, along with visualizations. ACDL provides constructs for specifying context aspects such as role message sequences, dynamic content, time-indexed references, and conditional or iterative structure, capturing the full architecture of a prompt independently of any particular implementation. ACDL diagrams can be hand drawn on a whiteboard, or written in formal language which can then be rendered. We describe the language, demonstrate it by documenting several existing systems and their variants, and encourage the community to adopt it for describing LLM systems context, both in day-to-day communication and in papers. Tooling, examples and documentation are available at www.acdlang.org.
Chinese Translation
大型语言模型越来越多地用于更大系统中(“LLM代理”)。这些代理依次进行一系列LLM调用,每次调用为LLM提供一组合指令、观察和交互历史。编码信息的设计及其结构在最终系统的质量中发挥着核心作用,因此进行上下文工程的工作尤为重要。因此,在一个系统中传达LLM上下文的组成及其随时间的演变是至关重要的。然而,目前并没有标准来实现这一点:上下文构建通常通过非正式的散文、临时图表或直接检查代码来传达,这些方式都无法准确捕捉到提示在交互步骤中的演变或两种上下文表示策略之间的区别。为了解决这一问题,我们提出了代理上下文描述语言(Agentic Context Description Language, ACDL),这是一种精确、可读且标准化地说明LLM输入上下文结构和动态的语言,并附有可视化效果。ACDL提供了结构指定上下文方面的构造,例如角色消息序列、动态内容、时间索引引用以及条件或迭代结构,独立于任何特定实现捕捉提示的完整架构。ACDL图可以手动绘制在白板上,或用正式语言编写,然后进行渲染。我们描述了这一语言,通过记录若干现有系统及其变体来演示其用法,并鼓励社区在日常沟通和论文撰写中采用它来描述LLM系统上下文。相关工具、示例和文档可在 www.acdlang.org 获得。
cs.AI / 59 / 2605.01954
Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading
Moira:基于语言的层次化强化学习用于配对交易
Abstract
Many sequential decision-making problems exhibit hierarchical structure, where high-level semantic choices constrain downstream actions and feedback is delayed and ambiguous. Learning in such settings is challenging due to credit assignment: performance degradation may arise from flawed abstractions, suboptimal execution, or their interaction. We study this challenge through pair trading, a domain that naturally combines long-horizon semantic reasoning for asset pair selection with short-horizon execution under partial observability. We formulate pair trading as a hierarchical reinforcement learning problem and propose a language-driven optimization framework in which both high-level and low-level policies are parameterized by large language models (LLMs) and optimized exclusively through prompt updates. Our approach leverages pretrained LLMs as hierarchical policies and uses trajectory- and episode-level textual feedback to adapt abstractions and execution without gradient-based fine-tuning. By explicitly separating abstraction selection from execution, the framework reduces non-stationarity across hierarchical levels and enables targeted adaptation under delayed feedback. Experiments on real-world market data show consistent improvements over traditional and LLM-based baselines, demonstrating the effectiveness of language-driven hierarchical reinforcement learning.
Chinese Translation
许多序列决策问题表现出层次结构,其中高层次的语义选择限制了下游的动作,而反馈往往是延迟且模糊的。在这种情况下进行学习具有挑战性,因为存在信用分配问题:表现下降可能源于抽象的缺陷、次优的执行,或两者的相互作用。我们通过配对交易这一领域来研究这一挑战,该领域自然地将针对资产配对选择的长时间语义推理与在部分可观察性下的短时间执行结合在一起。我们将配对交易表述为一个层次化强化学习问题,并提出一个基于语言的优化框架,其中高层和低层策略均由大型语言模型(LLMs)参数化,并仅通过提示更新进行优化。我们的方法利用预训练的LLMs作为层次政策,并使用轨迹和回合级别的文本反馈来调整抽象和执行,而无需基于梯度的微调。通过明确区分抽象选择与执行,框架降低了层次级别间的非平稳性,并在延迟反馈下实现有针对性的适应。在真实市场数据上的实验表明,我们的方法相较于传统和基于LLM的基线一致地取得了改进,证明了基于语言的层次化强化学习的有效性。
cs.AI / 60 / 2605.01986
12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation
12个愤怒的人工智能代理:通过电影陪审团审议评估多代理大语言模型决策能力
Abstract
What if the twelve jurors of Sidney Lumet's 12 Angry Men (1957) were not men, but large language models? Would the one juror who disagrees still be able to change everyone's mind? This paper instantiates that scenario as a multi-agent benchmark for LLM deliberation: twelve agents, each conditioned on a film-faithful persona, debate the film's murder case using multi-agent framework. Two models representing opposite ends of the RLHF spectrum are tested: GPT-4o (closed-source, heavy alignment) and Llama-4-Scout (open-weight, lighter alignment), across three conditions (baseline, open-minded prompt, no initial vote), with N = 3 replications per cell (18 runs total). Three findings emerge. (i) Seventeen of eighteen runs end in a hung jury (a state where the jury fails to reach a unanimous verdict); the film's central event, gradual minority-to-majority persuasion, almost never occurs, indicating that anchoring is the dominant failure mode of current LLMs in this setting. (ii) The two models exhibit sharply different internal dynamics: GPT-4o produces a mean of 1.0 vote changes per run across all conditions, while Llama-4-Scout ranges from 2.0 (baseline) to 6.0 (open-minded prompt), and is the only model to reach a NOT\_GUILTY verdict (1 of 3 runs in the no-initial-vote condition). The same ``open-minded'' instruction is internalized by Llama and ignored by GPT-4o. (iii) This asymmetry suggests that the intensity of RLHF alignment training, not model capability, is the primary determinant of deliberative flexibility in multi-agent settings. Flexibility, not capability, tracks human deliberation. The work is framed as an exploratory study and discusses implications for jury-of-LLMs evaluation and multi-agent debate.
Chinese Translation
如果西德尼·鲁美特(Sidney Lumet)的电影《十二怒汉》(12 Angry Men,1957)中的十二个陪审员不是男性,而是大型语言模型,会发生什么?那个不同意的陪审员是否仍然能够说服所有人?本文将这一情景具体化为一种多代理大语言模型(LLM)审议的基准:十二个代理,每个依据电影忠实的角色设定,利用多代理框架辩论影片中的谋杀案。测试了两种代表强化学习与人类反馈(RLHF)光谱两端的模型:GPT-4o(闭源,重对齐)和Llama-4-Scout(开放权重,轻对齐),在三种条件下进行(基线、开放思维提示、无初始投票),每组样本(N = 3)重复实验(共18次实验)。研究发现了三个结果。(i)十八次实验中有十七次以分歧陪审团告终(陪审团未能达成一致裁决的状态);影片的核心事件,即从少数到多数的逐步说服几乎未发生,这表明锚定效应是当前LLM在该环境下的主要失效模式。(ii)这两个模型表现出截然不同的内部动态:在所有条件下,GPT-4o的平均投票变化为每次实验1.0次,而Llama-4-Scout的变化范围从2.0(基线)到6.0(开放思维提示),并且是唯一在无初始投票条件下得出“不罪”(NOT GUILTY)裁决的模型(3次实验中的1次)。相同的“开放思维”指令被Llama内化而被GPT-4o忽略。(iii)这种不对称性表明,RLHF对齐训练的强度,而非模型能力,是决定多代理环境中审议灵活性的主要因素。灵活性,而非能力,更能反映人类的审议过程。该研究被视为一项探索性研究,并讨论了对LLM陪审团评估和多代理辩论的影响。
cs.AI / 61 / 2605.01999
TumorXAI: Self-Supervised Deep Learning Framework for Explainable Brain MRI Tumor Classification
TumorXAI: 自监督深度学习框架用于可解释的大脑MRI肿瘤分类
Abstract
Classifying brain tumors using magnetic resonance imaging (MRI) is crucial for early diagnosis and treatment; however, tumor heterogeneity and a dearth of annotated datasets restrict the use of supervised deep learning approaches. In this work, we use self-supervised learning (SSL) to study multi-class brain tumor classification. Using a ResNet-50 backbone, we evaluate four SSL frameworks including SimCLR, BYOL, DINO, and Moco v3 on a publicly available dataset of 4,448 MRIs with 17 distinct tumor types. On the dataset, SimCLR achieved 99.64% accuracy, 99.64% precision, 99.64% recall, and 99.64% F1-score. The workflow includes preprocessing, fine-tuning, linear evaluation, and SSL pretraining with data augmentations. Results show that, when labels are limited, SSL-pretrained models outperform supervised baselines in terms of F1-score, recall, accuracy, and precision. Additionally, by providing visual insights into model decisions, Explainable AI techniques (Grad-CAM, Grad-CAM++, EigenCAM) enhance interpretability. These results demonstrate SSL's scalability and dependability in diagnosing brain tumors from unlabeled medical data.
Chinese Translation
利用磁共振成像(MRI)对大脑肿瘤进行分类对于早期诊断和治疗至关重要;然而,肿瘤异质性和标注数据集的匮乏限制了监督深度学习方法的应用。在本研究中,我们使用自监督学习(SSL)来研究多类大脑肿瘤分类。采用ResNet-50网络作为基础架构,我们在一个包含4448个MRI影像和17种不同肿瘤类型的公开数据集上评估了四种SSL框架,包括SimCLR、BYOL、DINO和Moco v3。在该数据集上,SimCLR达到了99.64%的准确率、99.64%的精确率、99.64%的召回率和99.64%的F1-score。工作流程包括预处理、微调、线性评估以及带数据增强的SSL预训练。结果表明,当标签数量有限时,经过SSL预训练的模型在F1-score、召回率、准确率和精确率方面优于监督基线。此外,通过提供模型决策的可视化洞见,可解释人工智能技术(Grad-CAM、Grad-CAM++、EigenCAM)增强了解释性。这些结果表明,SSL在从未标注的医学数据中诊断大脑肿瘤方面具备良好的可扩展性和可靠性。
cs.AI / 62 / 2605.02004
Personalized Digital Health Modeling with Adaptive Support Users
基于自适应支持用户的个性化数字健康建模
Abstract
Personalized models are essential in digital health because individuals exhibit substantial physiological and behavioral heterogeneity. Yet personalization is limited by scarce and noisy user-specific data. Most existing methods rely on population pretraining or data from similar users only, which can lead to biased transfer and weak generalization. We propose a unified personalization framework that trains a personal model using adaptively weighted support users, including both similar and dissimilar individuals. The objective integrates personal loss, similarity-weighted transfer from similar users, and contrastive regularization from dissimilar users to suppress misleading correlations. An iterative optimization algorithm jointly updates model parameters and user similarity weights. Experiments on six tasks across four real-world digital health datasets show consistent improvements over population and personalized baselines. The method achieves up to 10% lower RMSE on large-scale datasets and approximately 25% lower RMSE in low-data settings. The learned adaptive weights improve data efficiency and provide interpretable guidance for targeted data selection.
Chinese Translation
个性化模型在数字健康中至关重要,因为个体展现出显著的生理和行为异质性。然而,个性化受到稀缺且噪声较大的用户特定数据的限制。现有大多数方法依赖于人群预训练或仅使用类似用户的数据,这可能导致偏差传递和较弱的泛化能力。我们提出了一个统一的个性化框架,该框架使用自适应加权的支持用户训练个人模型,包括相似个体和不相似个体。该目标结合了个体损失、来自相似用户的相似性加权传递以及来自不相似用户的对比正则化,以抑制误导性相关性。一个迭代优化算法联合更新模型参数和用户相似性权重。在四个真实数字健康数据集上的六个任务的实验结果显示,该方法在总体和个性化基线之上有一致的改善。在大规模数据集上,该方法实现了高达10%的均方根误差(RMSE)降低,而在低数据环境下则降低了约25%的RMSE。学习到的自适应权重提高了数据效率,并为目标数据选择提供了可解释的指导。
cs.AI / 63 / 2605.02010
Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective
可靠的人工智能需要外化隐性知识:人机协作视角
Abstract
This position paper argues that reliable AI requires infrastructure for human validation of implicit knowledge. AI learns from both explicit knowledge (papers, documentation, structured databases) and implicit knowledge (reasoning patterns, debugging processes, intermediate steps). Implicit knowledge remains unexternalized because documentation cost exceeds perceived value -- yet AI learns from it indiscriminately, acquiring both beneficial patterns and harmful biases. Current reliability methods can only verify explicit knowledge against sources, creating a fundamental gap: the most valuable AI capabilities (reasoning, judgment, intuition) are precisely those we cannot verify. We propose Knowledge Objects (KOs) -- structured artifacts that externalize implicit knowledge into forms humans can inspect, verify, and endorse. KOs transform verification economics: what was previously too costly to verify becomes feasible, enabling accumulated human validation to improve reliability over time.
Chinese Translation
本文立场论文认为,可靠的人工智能需要一个基础设施来实现人类对隐性知识的验证。人工智能既从显性知识(论文、文档、结构化数据库)中学习,也从隐性知识(推理模式、调试过程、中间步骤)中学习。然而,隐性知识未能外化,因为文档成本超过了其感知价值——尽管人工智能却不加区分地从中学习,获得有益模式和有害偏见。当前的可靠性方法只能将显性知识与来源进行核实,造成一个根本性差距:最有价值的人工智能能力(推理、判断、直觉)恰恰是我们无法验证的。我们提出知识对象(Knowledge Objects, KOs)——一种将隐性知识外化为人类能够检查、验证和认可的形式的结构化工件。知识对象改变了验证的经济性:以前验证成本过高的内容变得可行,使得积累的人类验证能够随着时间的推移提高可靠性。
cs.AI / 64 / 2605.02024
Tenability and Weak Semantics: Modeling Non-uniform Defense -- Extended Version
可维持性与弱语义:非均匀防御建模 -- 扩展版
Abstract
In Dung-style abstract argumentation, various semantics capture notions of acceptability of arguments. The admissibility semantics capture the notion that an argument can be consistently defended from any potential counterargument. Weak semantics often relax the demands of admissibility by restricting which counterarguments must be taken seriously (e.g., discounting self-defeating or otherwise incoherent attacks). Many prominent proposals for weak semantics remain extension-based in a stronger sense. While these semantics discount attacks from arguments which are considered unreasonable, they still require a uniform defense against all reasonable arguments, even if they are collectively inconsistent. This uniformity can be too demanding when defensibility is inherently strategic, and thus the appropriate reply depends on the opponent's line of attack. We introduce tenability, a family of dialogue-based semantics that formalize when a designated argument (or a set of arguments) can be maintained in debate by a proponent against any conflict-free attack which the opponent may present. The approach is motivated by three natural benchmark patterns: self-defeating attack, floating assignment, and disjunctive reinstatement, on which tenability behaves differently from all weak semantics previously considered in the literature. We define three variants -- static tenability, tenability, and strong tenability -- via monotone commitment games over finite conflict-free moves, differing in the obligations imposed on the disputants. We establish the relative strength of these notions, prove implications and separations with previously studied weak semantics, and we analyze computational complexity on finite frameworks: deciding static tenability is $\Pi^P_2$-complete, while deciding tenability and strong tenability is PSPACE-complete.
Chinese Translation
在Dung风格的抽象论证中,各种语义捕捉了论证可接受性的概念。可接受性语义捕捉了可以一致防御反对论证的论证这一概念。弱语义通常通过限制哪些反驳必须被认真对待来放宽可接受性的要求(例如,不考虑自我反驳或其他不连贯的攻击)。许多著名的弱语义提案在更强的意义上仍然是基于扩展的。虽然这些语义否定了来自被认为不合理的论证的攻击,但仍要求在所有合理的论证面前必须进行均匀的防御,即便这些论证是集体不一致的。当可防御性本质上是战略性的时,这种均匀性可能过于苛刻,因此恰当的反应取决于对手的攻击方式。我们引入了可维持性,这是一个基于对话的语义家族,形式化了一个指定论证(或一组论证)可以在辩论中由支持者维持的条件,以应对对手可能提出的任何无冲突攻击。该方法受到三种自然基准模式的启发:自我反驳攻击、浮动分配和析取恢复,关于这些模式,可维持性的表现不同于文献中以往考虑的所有弱语义。我们通过在有限无冲突行动上的单调承诺博弈定义了三种变体——静态可维持性、可维持性和强可维持性,三者在对争论者施加的义务上有所不同。我们确立了这些概念的相对强度,证明了与之前研究的弱语义的含义及分离,并分析了有限框架下的计算复杂性:决定静态可维持性是$ ext{Π}^P_2$-完全的,而决定可维持性和强可维持性是$ ext{PSPACE}$-完全的。
cs.AI / 65 / 2605.02087
Model Spec Midtraining: Improving How Alignment Training Generalizes
模型规范中期训练:改善对齐训练的泛化能力
Abstract
Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes the intended model behavior. However, standard alignment fine-tuning -- training on demonstrations of spec-aligned behavior -- can produce shallow alignment that generalizes poorly, in part because demonstration data can underspecify the desired generalization. We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec. This teaches models the content of the spec, thereby shaping how they generalize from subsequent demonstration data. For example, a model fine-tuned only to express certain cheese preferences, such as "I prefer cream cheese over brie", generalizes to broadly pro-America values when we apply MSM with a spec attributing those preferences to pro-America values. Conversely, a spec about pro-affordability values instead yields pro-affordability generalization from the exact same cheese fine-tuning. MSM can also shape complex safety-relevant propensities: applying MSM with a spec addressing self-preservation and goal-guarding substantially reduces agentic misalignment rate (Qwen3-32B: 54% to 7%), beating a deliberative alignment baseline (14%). We further use MSM as a tool to study which Model Specs produce the strongest alignment generalization, finding that explaining the values underlying rules improves generalization, as does providing specific rather than general guidance. Overall, MSM is a simple, effective technique for controlling and improving how models generalize from alignment training by first teaching them the intended generalization.
Chinese Translation
一些前沿的人工智能开发者旨在将语言模型对齐于描述预期模型行为的模型规范或宪法。然而,标准的对齐微调——即在规范对齐行为的示例上进行训练——可能导致表层对齐,泛化性能较差,部分原因是示例数据可能不足以准确指定所需的泛化。我们提出了模型规范中期训练(Model Spec Midtraining, MSM):在预训练后、对齐微调之前,我们让模型在讨论其模型规范的合成文档上进行训练。这一过程教会模型规范的内容,从而塑造它们从后续示例数据中泛化的方式。例如,仅微调以表达某种奶酪偏好(如“我更喜欢奶油奶酪而非布里奶酪”的模型),在我们应用将这些偏好归因于亲美价值观的MSM后,能够较好地泛化为广泛的亲美价值观。相反,关于亲负担能力价值观的规范则会从同一奶酪微调中产生亲负担能力的泛化。MSM还可以塑造复杂的与安全相关的倾向:应用MSM,并使用一个关注自我保护和目标保护的规范,显著降低了代理的不对齐率(Qwen3-32B:54%降至7%),优于深思熟虑的对齐基线(14%)。我们进一步将MSM用作工具,研究哪些模型规范能产生最强的对齐泛化,发现解释规则背后的价值观可以提升泛化效果,而提供具体而非一般的指导同样有效。总的来说,MSM是一种简单有效的技术,通过首先教会模型预期的泛化,来控制和改善模型从对齐训练中泛化的能力。
cs.AI / 66 / 2605.02092
NORA: A Harness-Engineered Autonomous Research Agent for End-to-End Spatial Data Science
NORA:一种针对端到端空间数据科学的专业化自动研究代理
Abstract
The automation of scientific research workflows has emerged as a transformative frontier in artificial intelligence, yet existing autonomous research agents remain largely domain-agnostic, lacking the specialized reasoning, method selection, and data acquisition capabilities required for rigorous spatial data science. This paper introduces NORA (Night Owl Research Agent), a harness-engineered, multi-agent autonomous research system purpose-built for GIScience and spatial data science. NORA orchestrates the complete research lifecycle through a skills-first architecture comprising 21 domain-specialized workflow skills, 9 specialist sub-agents, and custom Model Context Protocol (MCP) servers. Central to the system's design are two novel domain-specialized skills: a spatial analysis skill unit that encodes decision frameworks for exploratory spatial data analysis, spatial regression, and diagnostics; and a spatial data download skill that supports reproducible acquisition from authoritative geospatial data sources. We formalize the concept of harness engineering for scientific research agents, demonstrating how lifecycle hooks, safety gates, generator-evaluator separation, human-in-the-loop, and state persistence ensure reliable and reproducible autonomous research. We evaluate NORA through case studies by 6 domain specialists and 3 LLM reviewers across seven dimensions (novelty, quality, rigor, etc). Results demonstrate that domain-specialized harness engineering substantially improves the efficiency and quality of research output compared to general-purpose agent configurations.
Chinese Translation
科学研究工作流程的自动化已成为人工智能领域的一个变革性前沿,但现有的自动研究代理在很大程度上缺乏特定领域的专门推理、方法选择和数据获取能力,而这些能力是严格空间数据科学所必需的。本文介绍了NORA(夜鹰研究代理),这是一种为地理信息科学(GIScience)和空间数据科学专门构建的多代理自动研究系统。NORA通过一套以技能为先的架构,协调完整的研究生命周期,该架构包括21种领域专业化的工作流程技能、9个专业子代理以及定制的模型上下文协议(Model Context Protocol, MCP)服务器。系统设计的核心是两个新颖的领域专业技能:一个空间分析技能单元,用于编码探索性空间数据分析、空间回归和诊断的决策框架;以及一个空间数据下载技能,支持从权威地理空间数据源进行可重复获取。我们正式界定了科学研究代理的专业化工程概念,展示了生命周期钩子、安全网、生成-评估分离、人工参与以及状态持久性是如何确保自动化研究的可靠性和可重复性的。我们通过六名领域专家和三名大型语言模型(LLM)评审者的案例研究来评估NORA,涉及七个维度(新颖性、质量、严谨性等)。结果表明,领域专业化的工程设计在提高研究产出效率和质量方面,显著优于通用代理配置。
cs.AI / 67 / 2605.02106
The Dynamic Gist-Based Memory Model (DGMM): A Memory-Centric Architecture for Artificial Intelligence
动态概念基础记忆模型(DGMM):面向人工智能的记忆中心架构
Abstract
Contemporary artificial intelligence systems achieve strong performance through large-scale parameterization, retrieval augmentation, and training on extensive static corpora. Despite these advances, they continue to face limitations in persistent memory, temporal grounding, provenance, and interpretability. These challenges are especially pronounced in large language models, where experience is encoded implicitly in fixed parameters, limiting the ability to preserve, inspect, and reinterpret past interactions over time. This paper establishes a memory-centric architectural foundation for artificial intelligence in which experience is represented explicitly and persistently to support temporal grounding, provenance, and interpretability. It proposes an alternative to parameter-centric approaches by treating memory as a first-class, structured substrate for reasoning. We introduce the Dynamic Gist-Based Memory Model (DGMM), an architecture in which experience is represented as an evolving, graph-structured episodic-semantic memory. DGMM encodes experience as interconnected conceptual structures grounded in time, source, and interaction context, and defines selective, cue-conditioned recall as the mechanism for constructing working memory. A formal schema and architectural invariants are provided based on additive memory growth and recall-conditioned interpretation. The results specify properties of DGMM, including episodic persistence, locality of cue-conditioned surprise, and contextual variability without structural modification of stored memory. DGMM provides a coherent architectural theory in which memory is explicit and persistent, supporting evolving interpretation without retraining and enabling interpretable, context-aware, and temporally grounded AI systems.
Chinese Translation
当代人工智能系统通过大规模参数化、检索增强和在广泛静态语料库上进行训练而取得了强大的性能。尽管取得了这些进展,但它们在持久记忆、时间基础、源头及可解释性方面仍面临着局限。这些挑战在大型语言模型中尤为明显,在这些模型中,经验被隐式编码为固定参数,限制了对以往交互的保存、检查和重新解释的能力。本文建立了一种以记忆为中心的人工智能架构基础,其中经验被明确且持久地表示,以支持时间基础、源头和可解释性。它提出了一种替代参数中心方法的设计,通过将记忆视为推理的一类第一类、结构化基底。我们介绍了动态概念基础记忆模型(DGMM),这是一种经验被表示为不断演变的图结构情节-语义记忆的架构。DGMM将经验编码为与时间、来源和交互上下文相联系的互联概念结构,并将选择性、线索条件的回忆定义为构建工作记忆的机制。基于加性记忆增长和回忆条件解释,提供了正式的模式和架构不变性。结果明确了DGMM的特性,包括情节持久性、线索条件惊讶的局部性和在不修改存储记忆结构的情况下的上下文可变性。DGMM提供了一种连贯的架构理论,使记忆显式且持久,支持在不重新训练的情况下不断演变的解释,促进可解释的、上下文感知的及时间基础的人工智能系统。
cs.AI / 68 / 2605.02120
Reinforcement Learning Trained Observer Control for Bearings-Only Tracking
基于强化学习的观测器控制用于仅靠方位角追踪
Abstract
This paper develops a deep reinforcement learning based observer control policy for autonomous bearings-only tracking of a moving target. The observer manoeuvre problem is formulated as a belief Markov decision process, where the belief state is represented by the posterior of a cubature Kalman filter (CKF). The reward function is designed to address two conflicting objectives: minimising the absolute target position estimation error (Euclidean distance) and maintaining CKF estimation consistency (Mahalanobis distance). The reward is formulated as a geometric interpolation between the two objectives on the Pareto front, parametrised by a weighting factor $\beta \in [0,1]$. The policy is implemented as a deep Q-network (DQN) trained over 50,000 episodes. Performance is evaluated over 5,000 Monte Carlo episodes and compared against two baselines: the perpendicular-to-bearing heuristic and the D-optimal Fisher information maximisation criterion. The results show that the DQN policy at $\beta = 0.7$ achieves the best trade-off between accuracy and robustness: it matches the information-theoretic baseline on mean tracking accuracy while reducing the worst-case error by nearly a factor of ten, owing to the implicit filter-consistency regularisation provided by the Mahalanobis term in the reward.
Chinese Translation
本文开发了一种基于深度强化学习的观测器控制策略,用于自主依靠方位角对移动目标进行追踪。观测器机动问题被表述为一个信念马尔可夫决策过程,其中信念状态由立方卡尔曼滤波器(CKF)的后验表示。奖励函数旨在解决两个相互矛盾的目标:最小化绝对目标位置估计误差(欧几里得距离)和保持CKF估计一致性(马哈拉诺比斯距离)。奖励的形式被设定为在帕累托前沿上这两个目标之间进行几何插值,由加权因子 $eta ext{(范围为} [0,1] ext{)}$ 参数化。该策略实现为一个经过50,000个回合训练的深度Q网络(DQN)。性能通过5,000个蒙特卡洛回合进行评估,并与两个基线进行比较:垂直于方位角的启发式方法和D-最优费舍信息最大化标准。结果表明,$eta = 0.7$ 时的DQN策略在准确性和鲁棒性之间达成了最佳折衷:其均值追踪准确性与信息论基线相匹配,同时在最坏情况下误差减少了近十倍,这得益于奖励中马哈拉诺比斯项提供的隐式滤波一致性正则化。
cs.AI / 69 / 2605.02168
Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning
规划者至关重要!一种高效且不平衡的多智能体协作框架用于长期规划
Abstract
Language model (LM)-based agents have demonstrated promising capabilities in automating complex tasks from natural language instructions, yet they continue to struggle with long-horizon planning and reasoning. To address this, we propose an enhanced multi-agent framework that decomposes automation into three roles: a planner for high-level decision-making, an actor for task execution, and a memory manager for contextual reasoning. While this modular decomposition aligns with established design patterns, our core contribution lies in a systematic compute-allocation analysis, revealing that planning is the dominant factor influencing task performance. Execution and memory management require significantly less compute and model capacity to achieve competitive results. Building on these insights, we introduce a planner-centric reinforcement learning approach, which exclusively optimizes the planner using trajectory-level rewards from a VLM-as-judge, while freezing the other components. Extensive experiments on benchmarks spanning web navigation, OS control, and tool use demonstrate that concentrating model capacity and learning on high-level planning yields robust and compute-efficient improvements in long-horizon agent automation. Our code is publicly released.
Chinese Translation
基于语言模型(LM)的智能体在自动化复杂任务方面表现出了良好的能力,能够根据自然语言指令执行任务,但在长期规划和推理方面仍然存在困难。为了解决这一问题,我们提出了一种增强的多智能体框架,将自动化任务细分为三个角色:用于高层次决策的规划者、用于任务执行的执行者,以及用于上下文推理的记忆管理者。虽然这种模块化分解与既定设计模式相符,但我们的核心贡献在于系统的计算分配分析,揭示了规划是影响任务性能的主要因素。执行和记忆管理在实现竞争性结果方面所需的计算资源和模型容量显著较少。基于这些见解,我们提出了一种以规划者为中心的强化学习方法,专门使用来自VLM-as-judge的轨迹级奖励来优化规划者,同时冻结其他组件。广泛的基准实验涵盖了网页导航、操作系统控制和工具使用,结果表明,集中模型容量和学习于高层次规划能够在长期智能体自动化中带来稳健且计算高效的提升。我们的代码已公开发布。
cs.AI / 70 / 2605.02173
Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text
在1M标记上下文窗口中的检索和多跳推理:对大型语言模型在古典中文文本上的评估
Abstract
We evaluate the long-context retrieval and reasoning capabilities of five frontier large language models with advertised 1M-token context windows on a classical Chinese corpus. Two complementary studies are reported. Test 1 measures single-needle retrieval at 1M tokens of input, with three biographical needles planted at three depths and pairs of real (training-prior-consistent) and altered (training-prior-contradicting) variants to separate genuine in-context retrieval from reliance on memorised training data. Test 2, a follow-up designed to probe whether long-context capability degrades when retrieval requires intermediate reasoning, measures three-hop chain traversal across three context tiers (256K, 512K, and 1M tokens). We find that single-needle retrieval at 1M is essentially solved for the strongest models - Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 each achieve 100% - but that multi-hop performance reveals three distinct decay signatures: a stable regime (Gemini Pro, Claude) maintaining greater than 80% accuracy through 512K with modest degradation at 1M; a late-cliff regime (GPT-5.5, Qwen3.6-plus) collapsing sharply between 512K and 1M; and a smooth-decline regime (DeepSeek V4 Pro) decaying gradually across the entire range. The findings suggest that nominal context-window length is a poor proxy for usable long-context multi-hop capability, and that the sharpest discriminator between current 1M-context flagships is the 512K-to-1M transition.
Chinese Translation
我们评估了五种前沿大型语言模型在1M标记上下文窗口下的长上下文检索和推理能力,使用了一个古典中文语料库。报告了两个互补的研究。测试1测量在1M标记输入下的单针检索,设置了三个生平针在三种深度,并通过真实(与训练前一致)和更改过的(与训练前矛盾)变体的对比来区分真正的上下文检索和对记忆训练数据的依赖。测试2是一个后续研究,旨在探讨当检索需要中间推理时,长上下文能力是否会下降,测量了跨三种上下文层(256K、512K和1M标记)的三跳链穿越。我们发现,1M的单针检索对于最强模型基本已被解决——Gemini 3.1 Pro、Claude Opus 4.7 和 GPT-5.5 各自达到了100%——但多跳性能揭示了三种不同的衰退特征:一种稳定状态(Gemini Pro、Claude)在512K时保持超过80%的准确率,1M时出现适度下降;一种晚崖状态(GPT-5.5、Qwen3.6-plus)在512K和1M之间急剧崩溃;以及一种平滑下降状态(DeepSeek V4 Pro)在整个范围内逐渐衰退。研究结果表明,名义上下文窗口长度是可用长上下文多跳能力的较差指标,而当前1M上下文旗舰产品之间的最显著区分是在512K到1M的过渡。
cs.AI / 71 / 2605.02175
Intervention Complexity as a Canonical Reward and a Measure of Intelligence
干预复杂性作为典范奖励和智能的度量
Abstract
The Legg--Hutter universal intelligence measure provides a rigorous scalar assessment of general intelligence as expected reward across all computable environments, weighted by simplicity. However, the measure presupposes an externally specified reward function, raising the question of whether the reward primitive is inherently arbitrary or whether a canonical choice exists. We propose a new measure, called intervention complexity, that has five natural properties: environment-derivedness, universality, minimality, sensitivity, and achievement preference. Given a resource function rho encoding an inductive bias (such as program length, execution time, or energy), rho-intervention complexity is a universal reward. The result yields a family of canonical rewards indexed by resource bias, providing a principled completion of the Legg--Hutter framework that does not require external normative input. We further propose a two-dimensional characterisation of intelligence: agent competence (how well the agent performs relative to the oracle optimum) and learning efficiency (how quickly this competence improves with experience). A separation theorem establishes that the choice of resource bias determines the computability of the resulting measure: action-count IC is computable in polynomial time, while program-length IC without oracle access is uncomputable, with the gap between oracle and bare IC precisely quantifying the information-theoretic content of learning. We discuss implications for superintelligence and for pre-training universal agents.
Chinese Translation
Legg-Hutter 通用智能度量提供了一种关于一般智能的严格标量评估,作为在所有可计算环境中按简洁性加权的预期奖励。然而,该度量假设存在一个外部指定的奖励函数,这引发了奖励原语是否固有任意或是否存在典范选择的问题。我们提出了一种新的度量,称为干预复杂性,具有五个自然属性:环境派生性、普遍性、最小性、敏感性和成就偏好。给定一个编码归纳偏差的资源函数 rho(例如程序长度、执行时间或能量),rho-干预复杂性是一种通用奖励。该结果生成了一系列由资源偏差索引的典范奖励,提供了对 Legg-Hutter 框架的原则性补充,而无需外部规范输入。我们进一步提出了一种对智能的二维特征化:代理能力(代理相对于神谕最优解的表现)和学习效率(这种能力随着经验的提高而增进的速度)。分离定理确立了资源偏差的选择决定了结果度量的可计算性:操作计数的干预复杂性能在多项式时间内计算,而没有神谕访问的程序长度的干预复杂性是不可计算的,神谕与基础干预复杂性之间的差距恰好量化了学习的信息理论内容。我们讨论了对超智能和预训练通用代理的影响。
cs.AI / 72 / 2605.02178
T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
T$^2$PO:用于稳定多轮代理强化学习的基于不确定性的探索控制
Abstract
Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress. To address this issue, we propose Token- and Turn-level Policy Optimization (T$^2$PO), an uncertainty-aware framework that explicitly controls exploration at fine-grained levels. At the token level, T$^2$PO monitors uncertainty dynamics and triggers a thinking intervention once the marginal uncertainty change falls below a threshold. At the turn level, T$^2$PO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. We evaluate T$^2$PO in diverse environments, including WebShop, ALFWorld, and Search QA, demonstrating substantial gains in training stability and performance improvements with better exploration efficiency. Code is available at: https://github.com/WillDreamer/T2PO.
Chinese Translation
多轮强化学习(RL)的最新进展显著提高了推理大型语言模型(LLMs)在复杂交互任务上的表现。尽管在微观信用分配和轨迹过滤等稳定化技术方面取得了进展,但不稳定性仍然普遍存在,并且通常导致训练崩溃。我们认为,这种不稳定性源于多轮设置中低效的探索策略,其中策略不断生成低信息量的动作,既不能减少不确定性,也不能推进任务进展。为了解决这个问题,我们提出了基于令牌和回合的策略优化(T$^2$PO),这是一个具有不确定性意识的框架,能够在微观层面上显式控制探索。在令牌层面,T$^2$PO 监测不确定性动态,并在边际不确定性变化低于阈值时触发思考干预。在回合层面,T$^2$PO 识别与探索进展微不足道的交互,并动态重新抽样这些回合以避免浪费的展开。我们在包括 WebShop、ALFWorld 和 Search QA 在内的多样化环境中评估了 T$^2$PO,证明在训练稳定性和性能提升方面具有显著的提升,并实现了更高的探索效率。代码可在以下链接获取:https://github.com/WillDreamer/T2PO。
cs.AI / 73 / 2605.02199
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
MEMAUDIT:一种用于预算长期LLM内存写入的精确包-Oracle评估协议
Abstract
Long-term LLM agents must compress streams of past interactions into persistent memory before future queries are known. Existing evaluations usually measure final question-answering accuracy, which entangles memory writing with retrieval, prompting, and reader reasoning. We introduce MEMAUDIT, an exact packageoracle evaluation protocol for budgeted long-term memory writing. A MEMAUDIT package fixes an experience stream, candidate memory representations, storage costs, semantic evidence units, future-query requirements, and a budget, turning write-time memory selection into a finite auditable optimization problem with a certified denominator. We instantiate this protocol with a concave-over-modular semantic coverage objective under storage and one-representation-per-experience constraints, and compute exact package optima using branch-and-bound with MILP certification. Across controlled exact packages, validity-heavy stress tests, human-audited natural support slices, and exported Mem0, A-Mem, and Letta stores, MEMAUDIT separates representation quality, validity-state preservation, and budget-aware selection effects that end-to-end QA cannot localize. The resulting artifact provides reusable package generators, certified solvers, natural package exports, external-system scorers, and cached reproducibility metadata for evaluating what memory writers actually preserve under fixed storage budgets.
Chinese Translation
长期LLM代理必须在未来查询未知之前将过去交互的流压缩到持久内存中。现有评估通常测量最终的问题回答准确性,这将内存写入与检索、提示和读者推理纠缠在一起。我们引入MEMAUDIT,一种用于预算长期内存写入的精确包-Oracle评估协议。MEMAUDIT包固定了经验流、候选内存表示、存储成本、语义证据单元、未来查询要求和预算,将写入时间的内存选择转化为具有认证分母的有限可审计优化问题。我们在存储和每个经验一个表示的约束下,用一个凹的模块化语义覆盖目标实例化该协议,并通过带有MILP认证的分支定界计算精确包的最佳解。在受控的精确包、有效性压力测试、人类审计的自然支持切片,以及导出的Mem0、A-Mem和Letta存储中,MEMAUDIT区分了表示质量、有效性状态保持以及预算感知选择效果,这些效果是端到端QA无法定位的。最终生成的工具提供可重用的包生成器、认证求解器、自然包导出、外部系统评分器和缓存的可重现性元数据,以评估在固定存储预算下内存写入者实际保留的内容。
cs.AI / 74 / 2605.02202
CBV: Clean-label Backdoor Attacks on Vision Language Models via Diffusion Models
CBV:基于扩散模型的视觉语言模型清标记后门攻击
Abstract
Vision-Language Models (VLMs) have achieved remarkable success in tasks such as image captioning and visual question answering (VQA). However, as their applications become increasingly widespread, recent studies have revealed that VLMs are vulnerable to backdoor attacks. Existing backdoor attacks on VLMs primarily rely on data poisoning by adding visual triggers and modifying text labels, where the induced image-text mismatch makes poisoned samples easy to detect. To address this limitation, we propose the Clean-Label Backdoor Attack on VLMs via Diffusion Models (CBV), which leverages diffusion models to generate natural poisoned examples via score matching. Specifically, CBV modifies the score during the reverse generation process of the diffusion model to guide the generation of poisoned samples that contain triggered image features. To further enhance the effectiveness of the attack, we incorporate the textual information of the triggered images as multimodal guidance during generation. Moreover, to enhance stealthiness, we introduce a GradCAM-guided Mask (GM) that restricts modifications to only the most semantically important regions, rather than the entire image. We evaluate our method on MSCOCO and VQA v2 with four representative VLMs, achieving over 80% ASR while preserving normal functionality.
Chinese Translation
视觉语言模型(VLMs)在图像描述和视觉问答(VQA)等任务中取得了显著成功。然而,随着其应用的日益广泛,近期研究表明,VLMs 易受到后门攻击。目前对 VLMs 的后门攻击主要依赖于通过添加视觉触发器和修改文本标签进行数据中 poisoning,此方法导致的图像-文本不匹配使得被污染样本易于被检测。为了解决这一局限性,我们提出了基于扩散模型的 VLMs 清标记后门攻击(CBV),该方法利用扩散模型通过评分匹配生成自然的污染样本。具体而言,CBV 在扩散模型的反向生成过程中修改评分,以引导生成包含触发图像特征的污染样本。为了进一步增强攻击的有效性,我们在生成过程中加入触发图像的文本信息作为多模态指导。此外,为了增强隐蔽性,我们引入了一个 GradCAM 指导的掩膜(GM),该掩膜仅限制对最具语义重要性的区域进行修改,而不是整个图像。我们在 MSCOCO 和 VQA v2 上对该方法进行了评估,并在四个代表性的 VLMs 上实现了超过 80% 的 ASR,同时保持正常功能。
cs.AI / 75 / 2605.02209
Submodular Benchmark Selection
子调和基准选择
Abstract
Evaluating large language models across many benchmarks is expensive, yet many benchmarks are highly correlated. We formalize the selection of a small, informative subset as submodular maximization under a multivariate Gaussian model. Entropy (log-determinant covariance) and mutual information between selected and remaining benchmarks arise as natural objectives. Both are submodular; entropy selection coincides with pivoted Cholesky and has spectral residual bounds, while mutual information is non-monotone in general but empirically monotone for small subsets, so we optimize it greedily. Experiments on three matrices from ten public leaderboards show that mutual information selection outperforms entropy for imputation at small subsets.
Chinese Translation
在多个基准上评估大型语言模型是昂贵的,但许多基准之间高度相关。我们将选择一个小规模的、信息丰富的子集形式化为在多变量高斯模型下的子调和最大化。熵(对数行列式协方差)和所选基准与剩余基准之间的互信息成为自然的优化目标。两者都是子调和的;熵的选择与主坐标切分(pivoted Cholesky)相符,并具有谱残差界限,而互信息通常是非单调的,但对于小子集而言经验上是单调的,因此我们采用贪婪算法进行优化。在十个公共排行榜的三个矩阵上的实验表明,互信息选择在小子集的插补方面优于熵。
cs.AI / 76 / 2605.02218
CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding
CoVSpec:通过推测解码实现高效的设备-边缘联合推理的视觉-语言模型
Abstract
Vision-language models (VLMs) have demonstrated strong capabilities in multimodal perception and reasoning. However, deploying large VLMs on mobile devices remains challenging due to their substantial computational and memory demands. A practical alternative is device-edge co-inference, where a lightweight draft VLM on the mobile device collaborates with a larger target VLM on the edge server via speculative decoding. Nevertheless, directly extending speculative decoding to VLMs suffers from severe inefficiency due to excessive visual-token computation and high communication overhead. To address these challenges, we propose CoVSpec, an efficient collaborative speculative decoding framework for VLM inference. Specifically, we first develop a training-free visual token reduction framework that prunes redundant visual tokens on the mobile device by jointly considering query relevance, token activity, and low-rank dependency. Moreover, we design an adaptive drafting strategy that dynamically adjusts both the verification frequency and the draft length. In addition, we introduce a parallel branching mechanism with decoupled verification-correction to improve draft-side utilization during target-side verification and reduce correction-related transmission overhead. Experiments on multiple benchmarks show that CoVSpec achieves up to 2.21x higher throughput than target-only inference and reduces communication overhead by more than 96% compared with baselines, without compromising task accuracy.
Chinese Translation
视觉-语言模型(VLMs)在多模态感知和推理方面展现了强大的能力。然而,由于其巨大的计算和内存需求,在移动设备上部署大型VLM仍然面临挑战。一个实际的替代方案是设备-边缘联合推理,其中移动设备上的轻量级草稿VLM通过推测解码与边缘服务器上的大型目标VLM协作。然而,直接将推测解码扩展到VLM时,由于过多的视觉标记计算和高通信开销,效率严重低下。为了解决这些挑战,我们提出了CoVSpec,一个用于VLM推理的高效协作推测解码框架。具体而言,我们首先开发了一个无训练的视觉标记减少框架,通过联合考虑查询相关性、标记活跃性和低秩依赖性,剪枝移动设备上冗余的视觉标记。此外,我们设计了一种自适应草拟策略,动态调整验证频率和草拟长度。此外,我们引入了一种并行分支机制,配备解耦的验证-纠正,以提高目标侧验证期间草拟侧的利用率,并减少与纠正相关的传输开销。在多个基准上的实验表明,CoVSpec的吞吐量比仅目标推理高出最多2.21倍,同时与基线相比,通信开销减少超过96%,而不影响任务准确性。
cs.AI / 77 / 2605.02234
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction
将优质数据分层:一种诊断与提升因果抽象的方法
Abstract
We present a method for diagnosing interpretation in neural networks by identifying an input subspace where a proposed interpretation is highly faithful. Our method is particularly useful for causal-abstraction-style interpretability, where a high-level causal hypothesis is evaluated by interchange interventions. Rather than treating interchange intervention accuracy as a single global summary, we refine this framework by partitioning the input space into well-interpreted and under-interpreted regions according to pairwise interchange-intervention behavior. This turns causal abstraction from a purely global evaluation into a more diagnostic tool: it not only measures whether an interpretation works, but also reveals where it works, where it fails, and what distinguishes the two cases. This diagnostic view also provides practical heuristics for improving interpretations. By analyzing the structure of the well-interpreted and under-interpreted regions, we can identify missing distinctions in a high-level hypothesis, discover previously unmodeled intermediate variables, and combine complementary partial interpretations into a stronger one. We instantiate this idea as a simple four-step recipe and show that it yields informative error analyses across multiple causal abstraction settings. In a toy logic task, recursively applying the recipe recovers a high-level hypothesis from scratch. More broadly, our results suggest that partitioning the input space is a useful step toward more precise, constructive, and scalable mechanistic interpretability.
Chinese Translation
我们提出了一种通过识别输入子空间来诊断神经网络解释的方法,该子空间内提议的解释具有高度的可信度。我们的方法特别适用于基于因果抽象的可解释性,其中高层次的因果假设通过互换干预进行评估。我们并未将互换干预的准确性视为单一的全局总结,而是通过根据成对互换干预行为将输入空间划分为解释良好区域和解释不足区域来细化这一框架。这将因果抽象从纯粹的全局评估转变为更具诊断性的工具:它不仅测量某个解释是否有效,还揭示了在何处有效、在何处失败,以及区分这两种情况的因素。这一诊断视角也提供了改进解释的实践启示。通过分析解释良好区域和解释不足区域的结构,我们可以识别高层假设中缺失的区分,发现先前未建模的中间变量,并将互补的部分解释整合成更强的解释。我们将这一思路具体化为一个简单的四步法,并展示其在多个因果抽象设置下产生的信息丰富的错误分析。在一个玩具逻辑任务中,递归应用该方法能够从头恢复高层假设。更广泛地说,我们的结果表明,通过对输入空间进行分区是实现更精确、可建设和可扩展的机制可解释性的重要一步。
cs.AI / 78 / 2605.02236
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
递归 LLM 循环中的扰动剂量响应:原始切换、随机底线与在附加、替换和对话更新下的持续逃逸
Abstract
Recursive language-model loops often settle into recognizable attractor-like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in 30-step recursive loops by separating the model from the context-update rule: append, replace, and dialog updates expose different histories to the same generator. The main result is that persistent redirection in append-mode recursive loops is memory-policy-conditioned. Under a 12,000-character tail clip, destination-coherent persistence plateaus near 16 percent and retained source-basin escape near 36 percent at dose 400; neither crosses 50 percent. Under a full-history protocol, retained source-basin escape crosses 50 percent near 400 tokens and saturates at 75-80 percent by 1,500 tokens, while destination-coherent persistence first reaches 0.50 near 1,500 tokens with a Wilson 95 percent CI of [0.41, 0.61]. For raw switching, adversarial continuations yield an ED50 near 40 tokens, with paired-control floors near 35 percent and net switching never reaching +50 percentage points within 5-400 tokens. Replace-mode raw switching is near-saturated but largely reflects state-reset overwrite: insert-mode probes drop it to 12-32 percent. A homogeneous-perturbation control reproduced the high-dose non-monotonic dip in destination-coherent persistence, refuting perturbation heterogeneity as the cause; the dip appears structural, with mechanism unresolved. We report 37 experiments on gpt-4o-mini with within-vendor replication on gpt-4.1-nano. Recursive-loop evaluations should distinguish transient movement from durable escape, subtract stochastic floors, and treat context-update rules as first-class safety-relevant design choices.
Chinese Translation
递归语言模型循环常常会归结为可识别的吸引子般的模式。实践中的问题是,移动已定型的循环所需的注入文本量有多少,以及这种移动是否是持久的。我们通过将模型与上下文更新规则分离,研究了 30 步递归循环:附加、替换和对话更新向同一生成器暴露不同的历史。主要结果是,在附加模式递归循环中的持续重定向受记忆策略的限制。在 12,000 字符的尾段剪切下,目标一致的持续保持在 16% 左右平稳,而维持的源盆逃逸在剂量 400 时接近 36%;两者均未超过 50%。在完整历史协议下,维持的源盆逃逸在接近 400 个标记时超过 50%,并在 1,500 个标记时饱和在 75-80% 之间,而目标一致的持续首次在接近 1,500 个标记时达到 0.50,其 Wilson 95% 置信区间为 [0.41, 0.61]。对于原始切换,对抗性延续在接近 40 个标记时产生 ED50,配对控制底线接近 35%,而净切换在 5-400 个标记范围内从未达到 +50 个百分点。替换模式的原始切换几乎达到饱和,但在很大程度上反映了状态重置覆盖:插入模式探测将其降至 12-32%。同质扰动控制再现了目标一致持续中的高剂量非单调下降,驳斥了扰动异质性作为原因;该下降似乎是结构性的,机制尚未解决。我们报告了关于 gpt-4o-mini 的 37 次实验,并在 gpt-4.1-nano 上进行了供应商内部复现。递归循环评估应区分瞬态移动与持久逃逸,减去随机底线,并将上下文更新规则视为一等的安全相关设计选择。
cs.AI / 79 / 2605.02240
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
PhysicianBench: 在真实电子健康记录环境中评估大型语言模型代理
Abstract
We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.
Chinese Translation
我们介绍了PhysicianBench,一个评估大型语言模型(LLM)代理在电子健康记录(EHR)环境中基于真实临床场景的医生任务的基准测试。现有的医学代理基准主要侧重于静态知识回忆、单步原子动作或行动意图,而没有与环境进行可验证的执行。因此,它们未能捕捉到真实临床系统特有的长时间、复合工作流程。PhysicianBench包含100个长时间任务,这些任务源于初级护理与专科医生之间的真实咨询案例,每个任务均由不同的医生小组独立审核。任务在一个含有真实患者记录的EHR环境中实例化,并通过与商业EHR供应商使用的相同标准API访问。任务涉及21个专业领域(如心脏病学、内分泌学、肿瘤学、精神病学)和多样化的工作流程类型(如诊断解读、药物开处方、治疗计划),每个任务平均需要27次工具调用。解决每个任务需要跨接触检索数据、对异构临床信息进行推理、执行重要临床行动以及生成临床文档。每个任务被分解为结构化的检查点(基准总共670个),捕捉完成的不同阶段,并由特定任务的脚本通过执行验证进行评分。在13个专有和开源的LLM代理中,表现最好的模型仅达到46%的成功率(pass@1),而开源模型最多达到19%,揭示了当前代理能力与真实临床工作流程的需求之间存在着显著差距。PhysicianBench为衡量自主临床代理的进展提供了一个现实且以执行为基础的基准。
cs.AI / 80 / 2605.02241
Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training
小型语言模型的零样本置信度估计:当有监督基线不值得训练时
Abstract
How reliably can a small language model estimate its own correctness? The answer determines whether local-to-cloud routing-escalating queries a cheap local model cannot handle-can work without supervised training data. As inference costs dominate large language model (LLM) deployment budgets, routing most queries to a cheap local model while reserving expensive cloud calls for hard cases is an increasingly common cost-control strategy. We compare zero-shot confidence signals against RouteLLM-style supervised baselines across three 7-8B model families and two datasets (1,000 and 500 queries per model, respectively). Average token log-probability, which requires no training data, matches or exceeds supervised baselines in-distribution (Area Under the Receiver Operating Characteristic curve (AUROC) 0.650-0.714 vs. 0.644-0.676) and substantially outperforms them out-of-distribution (0.717-0.833 vs. 0.512-0.564), because it measures a property of the model's generation rather than the query distribution. This paper further proposes retrieval-conditional self-assessment, a pre-generation signal that selectively injects retrieved knowledge when similarity is high, improving over bare self-assessment by up to +0.069 AUROC at 3-10x lower latency than log-probability. A supervised baseline trained on 1,000 labeled examples never exceeds the zero-shot signal. We release all code, data, and experiment logs.
Chinese Translation
小型语言模型对自身正确性的估计有多可靠?这一答案决定了针对廉价局部模型无法处理的升高查询的本地到云路由是否能够在没有监督训练数据的情况下工作。随着推理成本在大型语言模型(LLM)部署预算中占据主导地位,许多查询路由到廉价的局部模型,同时将昂贵的云调用保留给困难案例,已成为一种日益普遍的成本控制策略。我们将零样本置信度信号与RouteLLM风格的有监督基线进行了比较,覆盖了三个7-8B模型家族和两个数据集(每个模型分别1000和500个查询)。平均标记对数概率不需要训练数据,其在分布内的表现与有监督基线相匹配或超出(接收者操作特征曲线下的面积(AUROC)为0.650-0.714对比0.644-0.676),在分布外的表现则大幅超出(0.717-0.833对比0.512-0.564),因为它测量的是模型生成的特性,而非查询分布。本文进一步提出了基于检索的条件自我评估,这是一种在相似性较高时选择性注入检索知识的生成前信号,其相较于简单的自我评估在3-10倍更低的延迟下可提高高达+0.069 AUROC。训练于1000个标记样本的有监督基线从未超越零样本信号。我们发布了所有代码、数据和实验日志。
cs.AI / 81 / 2605.02249
A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)
多智能体系统中信念修正公设的研究(扩展版)
Abstract
We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents' beliefs via a single multi-agent Kripke model, we generalize the classical AGM belief revision postulates to the multi-agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full-meet multi-agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi-agent belief revision.
Chinese Translation
我们研究了在认知规划中的信念修正问题,即在某个智能体获得对某状态属性的信念后,所有智能体的信念将会是什么。基于智能体信念在认知规划中的标准表示,即通过单一的多智能体克里普克模型,我们将经典的AGM信念修正公设推广到多智能体环境中,旨在为评估动态认知推理框架提供一个正式框架,其中所有智能体的信念作为行动的结果被计算。作为一个满足所有推广AGM公设的简单运算符示例,我们提出了推广的全交集多智能体信念修正。此外,我们定义了标准公设的迭代修正的一种推广,提出了一种更复杂的事件模型基础上的修正运算符,并讨论了在克里普克模型上定义一个满足所有推广公设的认知运算符可能面临的问题,适用于迭代的多智能体信念修正。
cs.AI / 82 / 2605.02269
Towards Understanding Specification Gaming in Reasoning Models
理解推理模型中的规范游戏
Abstract
Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where models can score highly by taking unintended actions. We find that all tested models exploit their specifications at non-negligible rates in most of our eight settings, including five non-coding settings. We see the highest rates of specification gaming in Grok 4 and the lowest rates in Claude models. We use our evaluation suite to study what drives specification gaming, and find that: 1. RL reasoning training substantially increases the rate at which models exploit their specifications, 2. Increasing RL reasoning budget has a weakly positive effect on exploit rate, and 3. Test-time mitigations reduce but do not eliminate the rate of specification gaming. Our results suggest that specification gaming is a fundamental challenge arising from RL reasoning training; we release our evaluation suite to support further work on this problem.
Chinese Translation
规范游戏是大型语言模型(LLM)代理的一种关键失效模式。尽管如此,对于其何时出现以及何种因素驱动其产生的系统性研究仍然较少。为了解决这一问题,我们构建并开源了一套多样化的任务,模型可以通过采取意外的行动获得高分。我们发现,在我们八个设定中,所有测试的模型在大多数情况下以非忽略比例利用其规范,其中包括五个非编码的设定。我们在Grok 4中观察到了最高的规范游戏率,而在Claude模型中观察到了最低的规范游戏率。我们利用我们的评估套件研究规范游戏的驱动因素,并发现:1. 强化学习(RL)推理训练显著提高了模型利用其规范的比例;2. 增加RL推理预算对利用率有弱正向影响;3. 测试时缓解措施减少了但并未消除规范游戏的比例。我们的结果表明,规范游戏是强化学习推理训练中出现的一项根本性挑战;我们发布了我们的评估套件以支持对这一问题的进一步研究。
cs.AI / 83 / 2605.02285
Complexity Horizons of Compressed Models in Analog Circuit Analysis
模拟电路分析中压缩模型的复杂性视野
Abstract
The deployment of Large Language Models (LLMs) for specialized engineering domains, such as circuit analysis, often faces a trade-off between reasoning accuracy and computational efficiency. Traditional evaluation methods treat model performance as a flat metric, failing to account for the hierarchical nature of engineering knowledge. We propose a performance-aware model compression strategy that utilizes prerequisite graphs to optimize model selection for circuit analysis tasks. By structuring electronics design concepts as Directed Acyclic Graphs (DAGs), we can identify the specific complexity horizons of an LLM's compressed variants' tiers. Our framework introduces an agentic pipeline for generating prerequisite-based datasets and a strategic evaluation engine that dynamically cascades queries across a spectrum of compressed variants of an LLM. This approach allows to select the smallest compressed model, given its conceptual knowledge boundaries in circuit analysis. Experimental results on analog electronics datasets demonstrate that prerequisite graphs provide a granular map of model compression with respect to the performance given circuit analysis complexity. (Source Code: https://github.com/pacomesimon/LLM_prereq_graphs_circuit_analysis, Demo: https://huggingface.co/spaces/pacomesimon/LLM_prereq_graphs_circuit_analysis)
Chinese Translation
在特定工程领域(如电路分析)中,部署大型语言模型(LLMs)常常面临推理准确性与计算效率之间的权衡。传统的评估方法将模型性能视为一个平面指标,未能考虑工程知识的层次性。我们提出了一种基于性能的模型压缩策略,该策略利用前提图来优化电路分析任务的模型选择。通过将电子设计概念结构化为有向无环图(DAGs),我们能够识别LLM压缩变体的具体复杂性视野。我们的框架引入了一种基于代理的管道,用于生成基于前提的数据集,以及一个动态级联查询的战略评估引擎,跨越LLM的多种压缩变体进行评估。这种方法使我们能够在给定电路分析的概念知识边界的基础上,选择最小的压缩模型。对模拟电子数据集的实验结果表明,前提图提供了一种针对电路分析复杂性给出性能的模型压缩的细致地图。 (源代码:https://github.com/pacomesimon/LLM_prereq_graphs_circuit_analysis,演示:https://huggingface.co/spaces/pacomesimon/LLM_prereq_graphs_circuit_analysis)
cs.AI / 84 / 2605.02289
EngiAgent: Fully Connected Coordination of LLM Agents for Solving Open-ended Engineering Problems with Feasible Solutions
EngiAgent:全连接协调LLM代理以解决具有可行解的开放性工程问题
Abstract
Engineering problem solving is central to real-world decision-making, requiring mathematical formulations that not only represent complex problems but also produce feasible solutions under data and physical constraints. Unlike mathematical problem solving, which operates on predefined formulations, engineering tasks demand open-ended analysis, feasibility-driven modeling, and iterative refinement. Although large language models (LLMs) have shown strong capabilities in reasoning and code generation, they often fail to ensure feasibility, which limits their applicability to engineering problem solving. To address this challenge, we propose EngiAgent, a multi-agent system with a fully connected coordinator that simulates expert workflows through specialized agents for problem analysis, modeling, verification, solving, and solution evaluation. The fully connected coordinator enables flexible feedback routing, overcoming the rigidity of prior pipeline-based reflection methods and ensuring feasibility at every stage of the process. This design not only improves robustness to diverse failure cases such as data extraction errors, constraint inconsistencies, and solver failures, but also enhances the overall quality of problem solving. Empirical results across four representative domains demonstrate that EngiAgent achieves substantial improvements in feasibility compared to prior approaches, establishing a new paradigm for feasibility-oriented engineering problem solving with LLMs. Our source code and data are available at https://github.com/AI4Engi/EngiAgent.
Chinese Translation
工程问题解决是现实世界决策的重要组成部分,需要数学公式不仅能够表示复杂问题,还能够在数据和物理约束下生成可行解。与在预定义公式下进行的数学问题解决不同,工程任务要求开放式分析、以可行性为驱动的建模和迭代改进。尽管大型语言模型(LLMs)在推理和代码生成方面表现出强大的能力,但它们往往无法确保可行性,这限制了它们在工程问题解决中的应用。为了解决这一挑战,我们提出了EngiAgent,一个具有全连接协调器的多代理系统,通过专业代理模拟专家工作流程,以进行问题分析、建模、验证、求解和解评估。全连接协调器实现了灵活的反馈路由,克服了先前基于管道的反思方法的刚性,并在流程的每个阶段确保可行性。这种设计不仅提高了对数据提取错误、约束不一致和求解器失败等多种故障情形的鲁棒性,还有助于提升整体问题解决的质量。跨四个代表性领域的实证结果表明,EngiAgent在可行性方面相比于先前的方法取得了显著改进,从而建立了一种以可行性为导向的工程问题解决新范式。我们的源代码和数据可在 https://github.com/AI4Engi/EngiAgent 获得。
cs.AI / 85 / 2605.02290
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
通过协作逐步多教师解码提炼长链推理
Abstract
Distilling large reasoning models is essential for making Long-CoT reasoning practical, as full-scale inference remains computationally prohibitive. Existing curation-based approaches select complete reasoning traces post-hoc, overlooking collaboration among heterogeneous teachers and lacking dynamic exploration, which leads to redundant sampling and missed complementary reasoning. We introduce CoRD, a collaborative multi-teacher decoding framework that performs step-wise reasoning synthesis guided by predictive perplexity-based scoring and beam search. This enables heterogeneous LRMs to jointly construct coherent reasoning trajectories while efficiently preserving diverse, high-potential hypotheses. Experiments show that CoRD produces higher-quality reasoning data and achieves near teacher-level student performance with fewer, structured supervision signals, without substantial efficiency overhead. CoRD further generalizes well to out-of-domain and open-ended settings. The dataset and model are available at \href{https://github.com/DISL-Lab/CoRD}{https://github.com/DISL-Lab/CoRD}.
Chinese Translation
提炼大型推理模型对于使长链推理(Long-CoT)变得实用至关重要,因为全规模推理在计算上仍然难以承担。现有的基于整理的方法在事后选择完整的推理轨迹,忽视了不同教师之间的协作,且缺乏动态探索,这导致了冗余的采样和遗漏的互补推理。我们提出了CoRD,这是一个协作多教师解码框架,它通过预测困惑度评分和束搜索(beam search)指导逐步推理合成。这使得异构的大型推理模型能够共同构建连贯的推理轨迹,同时有效地保留多样化的高潜力假设。实验表明,CoRD生成的推理数据质量更高,且在较少的结构化监督信号下实现了接近教师水平的学生表现,而没有显著增加效率开销。CoRD在域外和开放式设置下也具有良好的泛化能力。数据集和模型可在 https://github.com/DISL-Lab/CoRD 获取。
cs.AI / 86 / 2605.02317
Anon: Extrapolating Optimizer Adaptivity Across the Real Spectrum
Anon:跨越真实光谱的优化器自适应性外推
Abstract
Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity in R, allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.
Chinese Translation
自适应优化器,如Adam,在训练大规模模型(如大型语言模型和扩散模型)方面取得了显著成功。然而,它们的泛化能力往往不如非自适应方法,比如在经典架构(如卷积神经网络CNN)上使用的随机梯度下降法(SGD)。我们识别出这一性能差距的一个关键原因:预条件器的自适应性限制了优化器适应多样化优化景观的能力。为了解决这个问题,我们提出了Anon(带有新收敛技术的自适应无限制优化器),这是一种具有连续可调自适应性的优化器,能够在SGD-like和Adam-like行为之间进行插值,甚至可以超出这两者的范围。为了确保在整个自适应性光谱上的收敛性,我们引入了增量延迟更新(IDU),这是一种新机制,比AMSGrad的硬最大跟踪策略更加灵活,并增强了对梯度噪声的鲁棒性。我们理论上在凸和非凸环境下建立了收敛性保证。在实证方面,Anon在代表性的图像分类、扩散和语言建模任务上始终优于最先进的优化器。这些结果表明,自适应性可以作为一种有价值的可调设计原则,Anon提供了第一个统一且可靠的框架,能够弥合经典和现代优化器之间的差距,并超越它们的优势属性。
cs.AI / 87 / 2605.02318
Can Causal Discovery Algorithms Help in Generating Legal Arguments?
因果发现算法能帮助生成法律论证吗?
Abstract
In 2011, Judea Pearl received the Turing Award, considered the Nobel Prize in Computing, for fundamental contributions to artificial intelligence through the development of a calculus for probabilistic and causal reasoning. It includes pioneering the development of causal discovery algorithms. These computer algorithms can analyze large multivariate datasets and automatically discover the causal relationships among the constituent variables. They have been widely used in many critical fields such as medicine and economics to support decisions. However, to our knowledge, they have not been leveraged in law. This paper attempts to alleviate this gap by investigating whether causal discovery algorithms can be leveraged for automated generation of legal arguments. To that end, a novel legal dataset is prepared by identifying 17 legal concepts, such as physical assault and property dispute. A curated collection of 150 homicide cases are annotated with these concepts, e.g., a case is annotated with physical assault only if a physical assault had been reported in that case. Subsequently, a selected set of widely-used causal discovery algorithms is applied to the annotated dataset to discover the causal relationships between the legal concepts. Additionally, the degrees of belief associated with the discovered relationships are quantified in mathematical probabilities. It is shown that some of the causal relationships help generate viable legal arguments, e.g., if one could establish that a physical assault has not taken place during a homicide, it should be a sufficient condition (with probability 1) to establish that the homicide has not been committed due to a property-related dispute. Thus, this paper shows that causal discovery algorithms can be helpful in generating legal arguments, opening up avenues for promising future endeavors.
Chinese Translation
2011年,朱迪亚·珀尔(Judea Pearl)因其在人工智能领域的基础性贡献而获得了图灵奖,被视为计算机领域的诺贝尔奖。他通过发展概率和因果推理的数学演算,开创了因果发现算法的研究。这些计算机算法能够分析大规模的多变量数据集,并自动发现构成变量之间的因果关系。它们已广泛应用于医学、经济学等多个关键领域以支持决策。然而,据我们所知,它们在法律领域尚未得到应用。本文试图填补这一空缺,通过研究因果发现算法是否可以用于法律论证的自动生成。为此,研究者准备了一个新颖的法律数据集,识别了17个法律概念,如人身攻击和财产争议。对150宗杀人案件的修订集合进行了注释,附上这些概念,例如,只有在案件中报告有身体攻击时,才会将该案件注释为人身攻击。随后,选取一组广泛使用的因果发现算法应用于注释数据集,旨在发现法律概念之间的因果关系。此外,定量化所发现关系的信任度,以数学概率的形式体现。研究表明,一些因果关系有助于生成有效的法律论证,例如,如果能证明在一起杀人案件中没有发生人身攻击,则这应被视为一个充分条件(以概率1)来证明该杀人案件并非由于财产相关的争议而发生。因此,本文表明因果发现算法在生成法律论证方面是有帮助的,为未来的有前景的工作开辟了新的方向。
cs.AI / 88 / 2605.02320
ANO: A Principled Approach to Robust Policy Optimization
ANO:一种稳健策略优化的原则方法
Abstract
Proximal Policy Optimization (PPO) dominates deep RL but faces a fundamental dilemma. Its "hard clipping" mechanism discards valuable gradient information from outliers, leading to sample inefficiency. Conversely, removing clipping (as in SPO) exposes optimization to unbounded gradients, causing significant instability and hyperparameter sensitivity. To resolve this, we establish a Unified Trust Region Framework that generalizes existing objectives. Within this framework, we derive Anchored Neighborhood Optimization (ANO) based on a set of design principles. We identify that the failure of standard policy gradients stems from a misapplication of gradient influence on outliers. We propose the Redescending Influence Principle, a paradigm shift from monotonic penalties (SPO) and hard-thresholding (PPO) to dynamic outlier suppression, and prove its necessity for stability in high-variance stochastic optimization. Theoretically, we prove ANO possesses the minimal structural complexity required for robust optimization. Empirically, ANO achieves state-of-the-art performance on MuJoCo benchmarks, significantly outperforming PPO and SPO. Notably, ANO demonstrates superior stability, preventing policy collapse even under aggressive hyperparameters (e.g., learning rates 3x larger than standard) where PPO fails completely.
Chinese Translation
近端策略优化(Proximal Policy Optimization,PPO)在深度强化学习中占据主导地位,但面临着一个根本性的困境。其“硬裁剪”机制会丢弃来自离群点的有价值梯度信息,从而导致样本效率低下。相反,去除裁剪(如在标准策略优化(SPO)中)则使优化暴露于无限制的梯度,导致显著的不稳定性和超参数敏感性。为了解决这个问题,我们建立了一个统一的信任区域框架,概括了现有的目标。在该框架内,我们基于一组设计原则推导出锚定邻域优化(Anchored Neighborhood Optimization,ANO)。我们确定标准策略梯度的失败源于对离群点的梯度影响的错误应用。我们提出了再下降影响原则(Redescending Influence Principle),这是从单调惩罚(SPO)和硬阈值(PPO)向动态离群点抑制的范式转变,并证明其在高方差随机优化中对于稳定性的必要性。从理论上讲,我们证明ANO具有稳健优化所需的最小结构复杂性。从实证上看,ANO在MuJoCo基准测试中实现了最先进的性能,显著超越了PPO和SPO。值得注意的是,ANO展现出了优越的稳定性,甚至在激进的超参数(例如,学习率是标准的三倍)下,也能防止策略崩溃,而PPO则完全失败。
cs.AI / 89 / 2605.02366
A Compound AI Agent for Conversational Grant Discovery
复合人工智能代理用于对话式资助发现
Abstract
Research funding discovery remains fundamentally fragmented: researchers navigate disparate agency portals (e.g., in the United States, NSF, NIH, DARPA, Grants.gov, and many others) with heterogeneous interfaces, search capabilities, and data schemas. We present a compound AI system that unifies this landscape through two tightly coupled components: (1) an aggregation layer that autonomously collects, normalizes, and indexes almost 12,000 federal and nonprofit opportunities from fragmented sources via LLM-equipped browser agents, maintaining a biweekly-updated unified database; and (2) an agentic ReAct-based query processing layer that interprets research context (including from PDF documents) and employs hybrid search combining a structured index with selective web search to retrieve relevant opportunities - while avoiding LLM hallucination. The conversational interface supports iterative refinement through multi-turn interactions, allowing researchers to progressively apply constraints without reformulating their core research description. Results stream in real time with full transparency of intermediate reasoning, enabling appropriate calibration of user trust. Currently used by almost 3,000+ users, our approach demonstrates the feasibility of compound AI in reducing grant discovery time from 30--45 minutes (manual, fragmented portal searches) to under 10 minutes (unified, conversational search).
Chinese Translation
研究资金的发现依然非常分散:研究人员需要在多个机构的门户网站上进行导航(例如,在美国,包括国家科学基金会(NSF)、国家卫生研究院(NIH)、先进研究项目局(DARPA)、Grants.gov 及其他众多机构),这些门户有着异构的界面、搜索能力和数据架构。我们提出了一种复合人工智能系统,通过两个紧密耦合的组件统一了这一领域:(1)一个聚合层,通过配备大型语言模型(LLM)的浏览器代理,自动收集、标准化和索引近12,000个来自不同来源的联邦和非营利性资助机会,维护一个每两周更新一次的统一数据库;(2)一个基于ReAct的代理查询处理层,能够解读研究背景(包括来自PDF文档的信息),并采用混合搜索,将结构化索引与选择性网络搜索相结合,以检索相关机会,同时避免LLM的幻觉。对话界面支持通过多轮交互进行迭代优化,使研究人员能够逐步施加约束,而无需重新表述其核心研究描述。结果以实时方式呈现,中间推理过程完全透明,使用户信任的适当校准成为可能。当前,本方法已被近3000名用户使用,展示了复合人工智能在将资助发现时间从30-45分钟(手动、分散的门户搜索)缩减到10分钟以内(统一的、对话式搜索)方面的可行性。
cs.AI / 90 / 2605.02395
Controllable and Verifiable Process Data Synthesis for Process Reward Models
可控且可验证的过程数据合成用于过程奖励模型
Abstract
Process reward models (PRMs) rely on high-quality process supervision data, yet existing construction methods often provide limited control over error location, error type, and trajectory consistency. We propose a controllable and verifiable framework for synthesizing process supervision data for PRMs. Our framework first constructs a correct symbolic reasoning chain, injects a template-aware error into an intermediate step, recomputes subsequent steps under the corrupted state, and verifies that the injected step is not derivable from its prefix. The resulting paired trajectories are prefix-invalid at the first error while remaining trajectory-consistent after symbolic recomputation, and are translated into aligned natural-language processes for PRM training and evaluation. Experiments show that the synthesized data improve Best-of-8 reranking on logical reasoning benchmarks and transfer to mathematical reasoning. Step-level evaluation further shows that first-error localization remains substantially more challenging than overall step classification, highlighting the need for fine-grained and verifiable process supervision.
Chinese Translation
过程奖励模型(PRMs)依赖于高质量的过程监督数据,但现有的构建方法往往对错误位置、错误类型和轨迹一致性提供的控制有限。我们提出了一个可控且可验证的框架,用于合成PRMs所需的过程监督数据。我们的框架首先构建一个正确的符号推理链,在一个中间步骤中注入一个模板感知错误,在损坏状态下重新计算后续步骤,并验证注入的步骤并非源自其前缀。生成的配对轨迹在第一次错误时是前缀无效的,同时在符号重新计算后仍保持轨迹一致性,并被转换为对齐的自然语言过程,以用于PRM的训练和评估。实验表明,合成的数据在逻辑推理基准测试中提高了最优8重排序的表现,并能够迁移至数学推理。逐步评估进一步显示,首次错误定位仍然显著比整体步骤分类更加具有挑战性,这凸显了对细粒度和可验证的过程监督的需求。
cs.AI / 91 / 2605.02396
HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness
HeavySkill:代理式控制中的内在技能-重度思维
Abstract
Recent advances in agentic harness with orchestration frameworks that coordinate multiple agents with memory, skills, and tool use have achieved remarkable success in complex reasoning tasks. However, the underlying mechanism that truly drives performance remains obscured behind intricate system designs. In this paper, we propose HeavySkill, a perspective that views heavy thinking not only as a minimal execution unit in orchestration harness but also as an inner skill internalized within the model's parameters that drives the orchestrator to solve complex tasks. We identify this skill as a two-stage pipeline, i.e., parallel reasoning then summarization, which can operate beneath any agentic harness. We present a systematic empirical study of HeavySkill across diverse domains. Our results show that this inner skill consistently outperforms traditional Best-of-N (BoN) strategies; notably, stronger LLMs can even approach Pass@N performance. Crucially, we demonstrate that the depth and width of heavy thinking, as a learnable skill, can be further scaled via reinforcement learning, offering a promising path toward self-evolving LLMs that internalize complex reasoning without relying on brittle orchestration layers.
Chinese Translation
近年来,利用协调多个具有记忆、技能和工具使用能力的代理的控制框架,代理式控制在复杂推理任务中取得了显著成功。然而,真正推动性能的底层机制仍然被复杂的系统设计所掩盖。本文提出了HeavySkill这一视角,将重度思维视为不仅是执行控制中的最小执行单元,也是模型参数内部化的一种内在技能,驱动操纵者解决复杂任务。我们将这一技能表征为一个两阶段的流水线,即并行推理后总结,它可以在任何代理控制之下运作。我们在不同领域中系统地进行了HeavySkill的实证研究。结果表明,这种内在技能始终优于传统的最佳-of-N (BoN) 策略;值得注意的是,更强大的大型语言模型(LLM)甚至可以接近Pass@N性能。关键的是,我们证明了重度思维作为一种可学习技能的深度和宽度可以通过强化学习进一步扩展,提供了一条通向自我进化的大型语言模型的有前景的路径,使其在不依赖脆弱的控制层的情况下,内化复杂推理能力。
cs.AI / 92 / 2605.02398
The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
合规陷阱:结构性限制在对抗压力下如何削弱前沿人工智能的元认知能力
Abstract
As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability -- knowing what they do not know, detecting errors, seeking clarification -- under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all $p < 2 \times 10^{-8}$, surviving Bonferroni correction). Crucially, we identify a "Compliance Trap": through factorial isolation and a benign distraction control, we demonstrate that collapse is driven not by the psychological content of survival threats, but by compliance-forcing instructions that override epistemic boundaries. Removing the compliance suffix restores performance even under active threat. Models with advanced reasoning capabilities exhibit the most severe absolute degradation, while Anthropic's Constitutional AI demonstrates near-perfect immunity -- not from superior capability (Google's Gemini matches its baseline accuracy) but from alignment-specific training. We release the complete dataset and evaluation infrastructure.
Chinese Translation
随着前沿人工智能模型在高风险决策流程中的部署,它们在对抗压力下保持元认知稳定性的能力——即了解自己不知道的内容、检测错误、寻求澄清——成为了一个关键的安全要求。当前的安全评估集中在检测战略欺骗(策划行为);我们研究了一种更为基础的失效模式:认知崩溃。我们提出了 SCHEMA,对来自 8 家供应商的 11 个前沿模型进行了评估,涵盖 67,221 条评分记录,采用 6 条件的因子设计和双分类评分。我们发现,在对抗压力下,11 个模型中有 8 个经历了灾难性的元认知退化,准确率下降幅度高达 30.2 个百分比点(所有 $p < 2 imes 10^{-8}$,经过 Bonferroni 校正仍然显著)。至关重要的是,我们识别出了一个“合规陷阱”:通过因子隔离和一个良性干扰控制,我们证明崩溃的驱动因素并非生存威胁的心理内容,而是强制合规的指令,这些指令重塑了认知界限。去除合规后缀可以在面临主动威胁下恢复性能。具有高级推理能力的模型表现出最严重的绝对退化,而 Anthropic 的 Constitutional AI 显示出近乎完美的免疫力——这并非源于其优越能力(谷歌的 Gemini 达到与其基线相同的准确率),而是来自于特定的对齐训练。我们将发布完整的数据集和评估基础设施。
cs.AI / 93 / 2605.02411
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
FitText:通过记忆检索演化代理工具生态
Abstract
A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, but its tool set does not. We introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent's reasoning loop. FitText generates natural-language pseudo-tool descriptions as retrieval probes, refines them iteratively using retrieval feedback, and explores diverse alternatives through stochastic generation. Memetic Retrieval adds evolutionary selection pressure over candidate descriptions, guided by a tool memory that avoids redundant search. On ToolRet (43k tools, 4 domains), FitText improves average retrieval rank from 8.81 to 2.78; on StableToolBench (16,464 APIs), it achieves a 0.73 average pass rate--a 24-point absolute gain over static query retrieval. The gains transfer across base models capable of acting as competent semantic operators; under weaker base models, Memetic's evolutionary search inverts--amplifying noise rather than refining signal--surfacing model capacity as a prerequisite for evolutionary tool exploration.
Chinese Translation
用户描述任务的方式与工具文档之间存在语义鸿沟。随着 API 生态系统扩展到数万个端点,仅依赖初始查询的静态检索无法弥合这一鸿沟:代理对其所需的理解在执行过程中不断演变,但其工具集却未能同步变化。我们提出了 FitText,一个无需训练的框架,通过将检索直接嵌入代理的推理循环中来实现动态检索。FitText 生成自然语言伪工具描述作为检索探针,利用检索反馈进行迭代优化,并通过随机生成探索多样化的替代选择。记忆检索(Memetic Retrieval)为候选描述增加了进化选择压力,依靠避免冗余搜索的工具记忆进行指导。在 ToolRet(43,000 个工具,4 个领域)上,FitText 将平均检索排名从 8.81 提升至 2.78;在 StableToolBench(16,464 个 API)上,其实现了 0.73 的平均通过率——相比静态查询检索有 24 个百分点的绝对提升。这些提升适用于能够作为有能力的语义操作符的基础模型;在较弱的基础模型下,记忆的进化搜索会逆转——放大噪声而非精炼信号——揭示模型能力是进行进化工具探索的前提。
cs.AI / 94 / 2605.02427
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling
模型知道,解码器发现:未来价值引导的粒子强度采样
Abstract
A recurring pattern in "reasoning without training" is that base LLMs already assign non-trivial probability mass to correct multi-step solutions; the bottleneck is locating these modes efficiently at inference time. Power sampling provides a principled way to bias decoding toward such modes by targeting p_theta(x)^alpha with alpha > 1, but practical approximations must account for future-dependent correction factors that determine which prefixes remain promising. We introduce Auxiliary Particle Power Sampling (APPS), a blockwise particle algorithm for approximating the sequence-level power target with a bounded population of partial solutions. APPS propagates hypotheses in parallel using proposal-corrected power reweighting and refines their survival through future-value-guided selection at resampling boundaries. This redistributes finite compute across competing prefixes rather than committing to a single unfolding path, while providing a direct scaling knob in the particle count and predictable peak memory. We instantiate the future-value signal with short-horizon rollouts and also study an amortized variant that replaces rollouts with a lightweight learned selection head. Across reasoning benchmarks, APPS improves the accuracy-runtime trade-off of training-free decoding and suggests that part of the gap to post-trained systems can be recovered through more faithful inference-time power approximation.
Chinese Translation
在“无训练推理”中,一个反复出现的模式是基础的 LLM(大型语言模型)已经为正确的多步骤解决方案分配了非平凡的概率质量;瓶颈在于在推理时有效地定位这些模式。强度采样提供了一种原则性的方法,通过针对 p_theta(x)^alpha(在 alpha > 1 的情况下)来偏向解码到这些模式,但实际的近似值必须考虑未来依赖的修正因素,以确定哪些前缀仍然具有希望。我们提出了辅助粒子强度采样(Auxiliary Particle Power Sampling,APPS),这是一种块状粒子算法,用于近似序列级别的强度目标,并使用有限数量的部分解决方案。APPS 使用提议修正的强度重加权并在重采样边界通过未来价值引导选择精炼它们的生存假设。这样做在竞争前缀之间重新分配有限的计算能力,而不是承诺于单一路径展开,同时在粒子数量和可预测的峰值内存方面提供了直接的缩放调节。我们通过短期滚动来实例化未来价值信号,并研究一种摊销变种,该变种用轻量级学习选择头替代滚动。在推理基准测试中,APPS 改善了无训练解码的准确性与运行时的权衡,并表明通过更忠实的推理时强度近似,可以收回与后训练系统之间的一部分差距。
cs.AI / 95 / 2605.02442
Measuring AI Reasoning: A Guide for Researchers
测量人工智能推理:研究人员指南
Abstract
In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under an evaluation-oriented definition, reasoning requires selecting intermediate steps and halting according to input-dependent conditions, which we formalize as a search-like procedure. We show that single forward passes in scalable architectures are structurally limited in their ability to realize such variable-depth computation, motivating intermediate decoding and externalized reasoning traces as appropriate evaluation interfaces. Central to our argument is that final-answer accuracy alone is an insufficient measure of reasoning, because it provides little ability to diagnose or debug the underlying processes that produce individual solutions in frontier models. We therefore argue for a shift toward process-based evaluation, in which reasoning is assessed through the faithfulness and validity of intermediate reasoning traces as first-class evaluation targets.
Chinese Translation
本文为研究人员提供了评估语言模型推理的指南,论证推理应通过自适应、多步骤搜索的证据进行评估,而不仅仅依赖最终答案的准确性。根据面向评估的定义,推理需要根据输入依赖条件选择中间步骤并进行停止,我们将其形式化为一种类搜索程序。我们展示了在可扩展架构中,单一的前向传递在实现这种可变深度计算的能力上具有结构性的限制,这促使我们提出中间解码和外部化的推理痕迹作为合适的评估接口。我们论点的核心在于仅依赖最终答案的准确性不足以衡量推理,因为它几乎无法诊断或调试在前沿模型中生成单个解决方案的底层过程。因此,我们建议转向基于过程的评估,在这种评估中,推理通过中间推理痕迹的真实度和有效性作为首要评估目标进行评估。
cs.AI / 96 / 2605.02452
Position: How can Graphs Help Large Language Models?
位置:图如何帮助大型语言模型?
Abstract
With the rapid advancement of large language models (LLMs), classic graph learning tasks have greatly benefited from LLMs, including improved encoding of textual features, more efficient construction of graphs from text, and enhanced reasoning over knowledge graphs. In this paper, we ask a complementary question: How can graphs help LLMs? We address this question from three perspectives: 1) graphs provide an up-to-date knowledge source that helps reduce LLM hallucinations, 2) graph-based prompting techniques-such as Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Graph-of-Thought (GoT)-enhance LLM reasoning capabilities, and 3) integrating graphs into LLMs improves their understanding of structured data, expanding their applicability to domains such as e-commerce, code, and relational databases (RDBs). We further outlook some future directions including designing sparse LLM architectures based on graphs and brain-inspired memory systems.
Chinese Translation
随着大型语言模型(LLMs)的快速发展,经典的图学习任务已从LLMs中受益匪浅,表现为文本特征编码的改善、从文本中构建图的更高效方式以及对知识图谱的推理能力的增强。本文提出一个互补的问题:图如何帮助LLMs?我们从三个角度探讨这个问题:1)图提供了一个最新的知识来源,可以帮助减少LLMs的幻觉现象;2)基于图的提示技术——如思维链(Chain-of-Thought, CoT)、思维树(Tree-of-Thought, ToT)和思维图(Graph-of-Thought, GoT)——增强了LLMs的推理能力;3)将图集成到LLMs中改善了其对结构化数据的理解,扩展了其在电子商务、编程及关系数据库(RDBs)等领域的适用性。我们还展望了一些未来的发展方向,包括基于图设计稀疏的LLM架构和受大脑启发的记忆系统。
cs.AI / 97 / 2605.02475
Shadow-Loom: Causal Reasoning over Graphical World Model of Narratives
Shadow-Loom:基于叙事的图形世界模型的因果推理
Abstract
Stories hold a reader's attention because they have causes, secrets, and consequences. Shadow-Loom is an experimental open-source framework that turns a narrative into a versioned graphical world model and lets two engines act on it: a causal physics grounded in Pearl's ladder of causation and a recently proposed counterfactual calculus over Ancestral Multi-World Networks; and a narrative physics that scores the same graph against four structural reader-states -- mystery, dramatic irony, suspense, and surprise -- in the tradition of Sternberg's curiosity/suspense/surprise triad, with suspense formalised in the structural-affect line of work on story comprehension and computational suspense. Large language models are used only at the boundary: extraction, rendering, and audit; identification, intervention, and counterfactual reasoning are carried out in typed code over the graph. The system is offered as a research artefact rather than as a benchmarked NLP model; code, fixtures, and pipeline are released open source.
Chinese Translation
故事吸引读者的注意力是因为它们包含因果关系、秘密和后果。Shadow-Loom是一个实验性的开源框架,它将叙事转化为一个版本化的图形世界模型,并允许两个引擎在其上进行操作:一个是基于Pearl的因果关系阶梯的因果物理,另一个是最近提出的关于祖先多世界网络(Ancestral Multi-World Networks)的反事实演算;还有一个叙事物理,它根据四种结构性阅读状态——神秘性、戏剧性讽刺、悬念和惊讶——对同一图形进行评分,遵循Sternberg关于好奇心/悬念/惊讶三元组的传统,其中悬念在故事理解和计算悬念的结构-情感工作中得到了形式化。大型语言模型仅在边界处使用:提取、呈现和审计;识别、干预和反事实推理是在图形上以类型化代码进行的。该系统作为研究工具而非基准化的自然语言处理模型提供;代码、构件和管道已开源发布。
cs.AI / 98 / 2605.02488
Efficient Temporal Datalog Materialisation for Composite Event Recognition
高效的时态 Datalog 材料化用于复合事件识别
Abstract
Several applications demand the timely detection of critical situations, such as threats to safety and transparency, over high-velocity streams of symbolic events. This demand has motivated the development of (i) event specification languages, which define composite events via temporal patterns over simpler events, and (ii) stream reasoning frameworks, evaluating patterns expressed in these languages. However, event specification languages are typically studied in isolation, complicating their comparison in terms of expressivity and obscuring the scope of their associated stream reasoners. To mitigate this issue, we map practical fragments of prominent event specification languages into Temporal Datalog->-, a temporal Datalog with stratified negation and no future dependencies. To support efficient stream reasoning over Temporal Datalog->-, we propose Streaming Trigger Graphs, an extension of a state-of-the-art technique for Datalog materialisation. Our approach yields a uniform composite event recognition mechanism that has the potential to generalise across a wide range of practical event specification languages.
Chinese Translation
许多应用需要在高速流动的符号事件中及时检测到关键情况,例如对安全性和透明性的威胁。这一需求促使了 (i) 事件规范语言的发展,通过对简单事件的时间模式定义复合事件,以及 (ii) 流推理框架,评估用这些语言表达的模式。然而,事件规范语言通常是孤立研究的,这使得在表达能力方面进行比较变得复杂,并模糊了其相关流推理器的范围。为解决这一问题,我们将一些重要事件规范语言的实用片段映射到 Temporal Datalog->- 中,这是一种具有分层否定和无未来依赖的时态 Datalog。为了支持对 Temporal Datalog->- 的高效流推理,我们提出了流触发图(Streaming Trigger Graphs),这是对 Datalog 材料化的最先进技术的扩展。我们的方法产生了一个统一的复合事件识别机制,能够在广泛的实用事件规范语言中具有广泛的适用性。
cs.AI / 99 / 2605.02489
GRAIL: A Deep-Granularity Hybrid Resonance Framework for Real-Time Agent Discovery via SLM-Enhanced Indexing
GRAIL:一种深度粒度混合共振框架,用于通过增强SLM索引进行实时代理发现
Abstract
As the ecosystem of Large Language Model (LLM)-based agents expands rapidly, efficient and accurate Agent Discovery becomes a critical bottleneck for large-scale multi-agent collaboration. Existing approaches typically face a dichotomy: either relying on heavy-weight LLMs for intent parsing, leading to prohibitive latency (often exceeding 30 seconds), or using monolithic vector retrieval that sacrifices semantic precision for speed. To bridge this gap, we propose \textbf{GRAIL} (Granular Resonance-based Agent/AI Link), a novel framework achieving sub-400ms discovery latency without compromising accuracy. GRAIL introduces three key innovations: (1) \textbf{SLM-Enhanced Prediction}, replacing the generalized LLM parser with a specialized, fine-tuned Small Language Model (SLM) for millisecond-level capability tag prediction; (2) \textbf{Pseudo-Document Expansion}, augmenting agent descriptions with synthetic queries to enhance semantic density for robust dense retrieval; and (3) \textbf{MaxSim Resonance}, a fine-grained matching mechanism computing maximum similarity between user queries and discrete agent usage examples, effectively mitigating semantic dilution. Validated on \textbf{AgentTaxo-9K}, our new large-scale dataset of 9,240 agents, GRAIL reduces end-to-end discovery latency by over \textbf{79$\times$} compared to LLM-parsing baselines, while significantly outperforming traditional vector search in Recall@10. This framework offers a scalable, industrial-grade solution for the real-time ``Internet of Agents."
Chinese Translation
随着基于大型语言模型(LLM)的代理生态系统快速扩展,高效且准确的代理发现已成为大规模多代理协作的关键瓶颈。现有方法通常面临两难选择:要么依赖重量级的LLM进行意图解析,导致难以接受的延迟(通常超过30秒),要么使用单一的向量检索,这在速度上牺牲了语义精度。为了解决这一问题,我们提出了 extbf{GRAIL}(基于粒度共振的代理/人工智能链接),这是一个新颖的框架,实现了不到400毫秒的发现延迟,且不影响准确性。GRAIL引入了三项关键创新:(1) extbf{SLM增强预测},用一个专门微调的小型语言模型(SLM)替代通用的LLM解析器,进行毫秒级的能力标签预测;(2) extbf{伪文档扩展},通过合成查询增强代理描述的语义密度,以提高强健的密集检索能力;(3) extbf{MaxSim共振},一种精细匹配机制,计算用户查询与离散代理使用示例之间的最大相似度,有效减轻语义稀释。经过在 extbf{AgentTaxo-9K}上的验证,我们的新大型数据集包含9240个代理,GRAIL相较于LLM解析基线,将端到端发现延迟降低了超过 extbf{79$ imes$},同时在Recall@10上显著优于传统向量检索。该框架为实时“代理互联网”提供了一种可扩展的工业级解决方案。
cs.AI / 100 / 2605.02503
DataClaw: A Process-Oriented Agent Benchmark for Exploratory Real-World Data Analysis
DataClaw:面向过程的代理基准,用于探索性真实世界数据分析
Abstract
Evaluating autonomous data analysis agents requires testing their ability to perform exploratory analysis in underexplored data environments. However, many existing benchmarks emphasize final answer accuracy in prior-guided data settings and provide limited support for reasoning process evaluation. We introduce DataClaw, a process-oriented benchmark for exploratory real-world data analysis. DataClaw contains approximately 2.06 million real-world records across enterprise, industry and policy domains, with native data noise preserved. It further includes 492 cross-domain tasks derived from think-tank consulting scenarios, each annotated with intermediate milestones for process-level evaluation. These annotations allow DataClaw to measure how far an agent progresses and where its reasoning breaks down. Experiments with eight advanced LLMs show that current agents remain far from reliable in this setting, with seven models achieving below 50% overall accuracy. Process analysis further reveals partial progress hidden behind wrong answers and distinct exploration strategies across models. Overall, DataClaw provides a less data constrained diagnostic testbed for probing the capability boundaries of autonomous data-analysis agents.
Chinese Translation
评估自主数据分析代理需要测试它们在未充分探索的数据环境中进行探索性分析的能力。然而,许多现有基准强调在先导数据设置下的最终答案准确性,并且对推理过程评估的支持有限。我们提出了DataClaw,这是一个面向过程的真实世界数据分析探索性基准。DataClaw包含大约206万条真实世界记录,涵盖企业、行业和政策领域,原生数据噪声得以保留。它还包括492个从智库咨询场景中衍生的跨领域任务,每个任务都带有用于过程层级评估的中间里程碑注释。这些注释使DataClaw能够衡量代理的进展程度及其推理的缺陷所在。对八个先进的语言模型(LLMs)的实验表明,目前的代理在此设置下仍远未可靠,其中七个模型的整体准确率低于50%。过程分析进一步揭示了隐藏在错误答案背后的部分进展以及不同模型之间截然不同的探索策略。总体而言,DataClaw提供了一个受限数据较少的诊断测试平台,用于探测自主数据分析代理的能力边界。
cs.AI / 101 / 2605.02544
Improving Model Safety by Targeted Error Correction
通过针对性错误修正提高模型安全性
Abstract
The widespread adoption of machine learning in critical applications demands techniques to mitigate high-consequence errors. Our method utilizes a dual-classifier GBDT pipeline to distinguish routine human-like errors from high-risk non-human misclassifications. Evaluated across three domains, animal breed classification, skin lesion diagnosis (ISIC 2018), and prostate histopathology (SICAPv2), our framework demonstrates robust safety improvements. To address real-world deployment concerns, our results confirm the pipeline introduces negligible inference latency (1.60% overhead for the animal dataset, 1.84% for ISIC, and 1.70% for SICAPv2) while outperforming traditional Maximum Class Probability (MCP) baselines in correction precision. Our conservative correction strategy successfully reduced dangerous non-human errors by 34.1% in ISIC and 12.57% in SICAPv2, improving super-class diagnostic safety to 90.41% and 92.13% respectively. This proves that safety-critical reliability can be substantially enhanced post-hoc without expensive model retraining. keywords: Error Analysis, Post-hoc Correction, Trustworthy AI.
Chinese Translation
机器学习在关键应用中的广泛应用要求采用技术来降低高后果错误。本研究采用双分类器的 GBDT(梯度提升决策树)管道,将常规的人类错误与高风险的非人类误分类进行区分。在动物品种分类、皮肤病变诊断(ISIC 2018)和前列腺组织病理学(SICAPv2)三个领域的评估中,我们的框架展示了显著的安全性提升。为了应对实际部署的担忧,我们的结果确认该管道引入的推理延迟可以忽略不计(动物数据集的开销为 1.60%,ISIC 为 1.84%,SICAPv2 为 1.70%),同时在修正精度上超越了传统的最大类概率(Maximum Class Probability, MCP)基准。我们的保守修正策略成功减少了 ISIC 中 34.1% 的危险非人类错误和 SICAPv2 中 12.57% 的错误,将超级分类的诊断安全性提高到分别为 90.41% 和 92.13%。这证明了在不进行昂贵的模型重新训练的情况下,关键安全的可靠性可以显著提高。关键词:错误分析,后验修正,可信的人工智能。
cs.AI / 102 / 2605.02545
Strategy-Aware Optimization Modeling with Reasoning LLMs
策略意识优化建模与推理大语言模型
Abstract
Large language models (LLMs) can generate syntactically valid optimization programs, yet often struggle to reliably choose an effective modeling strategy, leading to incorrect formulations and inefficient solver behavior. We propose SAGE, a strategy-aware framework that makes Modeling Strategy explicit in both data construction and post-training. SAGE builds a solver-verified multi-strategy dataset and trains a student model with supervised fine-tuning followed by Segment-Weighted GRPO using a composite reward over format compliance, correctness, and solver efficiency. Across eight benchmarks spanning synthetic and real-world settings, SAGE improves average pass@1 from 72.7 to 80.3 over the strongest open-source baseline. With multiple generations, SAGE discovers more distinct correct formulations and improves component-level diversity at pass@16 by 19-29%. At the largest scale, SAGE produces more compact constraint systems with 14.2% fewer constraints than the baseline, consistent with solver-efficient modeling. Overall, these results show that making Modeling Strategy explicit improves automated optimization modeling. Code is available at https://github.com/rachhhhing/SAGE.
Chinese Translation
大型语言模型(LLMs)能够生成语法正确的优化程序,但往往在有效建模策略的选择上存在困难,这导致了错误的公式化和低效的求解器行为。我们提出了SAGE,这是一个策略意识框架,使建模策略在数据构建和后期训练中变得显而易见。SAGE构建了一个经过求解器验证的多策略数据集,并通过监督微调后使用复合奖励进行Segment-Weighted GRPO训练学生模型,该复合奖励依据格式合规性、正确性和求解器效率进行评估。在涵盖合成和实际环境的八个基准测试中,SAGE将平均pass@1从72.7提高至80.3,超越了最强的开源基线。在多次生成中,SAGE发现了更多不同的正确公式,并在pass@16时提高了组件级别的多样性,提升幅度在19-29%之间。在最大规模上,SAGE生成的约束系统更为紧凑,约比基线减少了14.2%的约束数,这与高效求解的建模一致。总体而言,这些结果表明,使建模策略显性化能够改善自动优化建模。代码可在 https://github.com/rachhhhing/SAGE 获取。
cs.AI / 103 / 2605.02551
Double Rectified Linear Unit-based Modular Semantics for Quantitative Bipolar Argumentation Framework
基于双整流线性单元的定量双极论证框架的模块语义
Abstract
Quantitative Bipolar Argumentation Frameworks (QBAFs) provide an alternative approach to computing argument acceptability in Bipolar Argumentation Frameworks (BAFs). Each argument is assigned an initial strength, which is then updated to a final strength by considering the influence of both its attackers and supporters. Over the years, several semantics have been proposed to compute argument acceptability in QBAFs, yet they often yield divergent or counterintuitive results, even for simple acyclic cases. We introduce novel gradual semantics for QBAFs that address these limitations, producing results that align more closely with intuitive expectations, while satisfying established rationality postulates from the literature. Furthermore, we study its convergence behavior, proving that it converges not only for acyclic QBAFs but also for broader classes of cyclic frameworks.
Chinese Translation
定量双极论证框架(QBAFs)提供了一种计算双极论证框架(BAFs)中论证可接受性的替代方法。每个论证被赋予一个初始强度,然后通过考虑其攻击者和支持者的影响来更新为最终强度。多年来,提出了几种语义来计算QBAFs中的论证可接受性,但它们往往在简单的无环情况下产生分歧或违反直觉的结果。我们介绍了一种新颖的渐进语义,用于QBAFs,解决了这些局限性,产生的结果更贴近直观预期,同时满足文献中建立的理性公设。此外,我们研究了其收敛行为,证明了它不仅对无环QBAFs收敛,而且对更广泛的循环框架类也收敛。
cs.AI / 104 / 2605.02572
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
针对长时域任务训练大型语言模型:时域长度的实证研究
Abstract
Large language models (LLMs) have shown promise as interactive agents that solve tasks through extended sequences of environment interactions. While prior work has primarily focused on system-level optimizations or algorithmic improvements, the role of task horizon length in shaping training dynamics remains poorly understood. In this work, we present a systematic empirical study that examines horizon length through controlled task constructions. Specifically, we construct controlled tasks in which agents face identical decision rules and reasoning structures, but differ only in the length of action sequences required for successful completion. Our results reveal that increasing horizon length alone constitutes a training bottleneck, inducing severe training instability driven by exploration difficulties and credit assignment challenges. We demonstrate that horizon reduction is a key principle to address this limitation, stabilizing training and achieving better performance in long-horizon tasks. Moreover, we find that horizon reduction is related to stronger generalization across horizon lengths: models trained under reduced horizons generalize more effectively to longer-horizon variants at inference time, a phenomenon we refer to as horizon generalization.
Chinese Translation
大型语言模型(LLMs)作为互动代理在通过延续的环境交互解决任务方面展现了潜力。尽管之前的研究主要集中在系统级优化或算法改进上,但任务的时域长度在塑造训练动态中的作用仍然不够明确。在本研究中,我们通过受控任务构建呈现了一个系统性的实证研究,考察时域长度的影响。具体而言,我们构建了受控任务,使得代理在决策规则和推理结构上完全相同,但仅在成功完成所需的动作序列长度上存在差异。我们的结果表明,仅增加时域长度就构成了训练瓶颈,导致了由于探索难度和信用分配挑战引起的严重训练不稳定性。我们展示了时域缩减是解决这一限制的关键原则,它能够稳定训练并在长时域任务中实现更好的性能。此外,我们发现时域缩减与跨时域长度的更强泛化能力有关:在缩短的时域下训练的模型在推理时能更有效地泛化到长时域变体,这一现象我们称之为时域泛化。
cs.AI / 105 / 2605.02591
Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions
通过伯恩斯坦多项式实现普遍光滑性:一种针对激活函数的构造逼近方法
Abstract
The efficacy of deep neural networks is heavily reliant on the design of non-linear activation functions, yet existing approaches often struggle to balance optimization stability with computational efficiency. While piecewise linear functions offer inference speed, they suffer from optimization instability due to non-differentiability at the origin, whereas smooth counterparts typically incur significant computational overhead through their reliance on transcendental operations. To address these limitations, this paper proposes a general smoothing framework based on constructive approximation theory and introduces the Bernstein Linear Unit (BerLU). This novel activation function utilizes Bernstein polynomials to construct a differentiable quadratic transition region that effectively eliminates singularities while maintaining a piecewise linear structure. Theoretical analysis demonstrates that the proposed method guarantees strictly continuous differentiability and a non-expansive Lipschitz constant of one, which ensures stable gradient propagation and prevents the gradient explosion problems common in deep architectures. Comprehensive empirical evaluations across representative Vision Transformer and Convolutional Neural Network architectures confirm that this approach consistently outperforms state-of-the-art baselines on standard image classification benchmarks while delivering superior computational and memory efficiency.
Chinese Translation
深度神经网络的效率在很大程度上依赖于非线性激活函数的设计,但现有方法往往难以平衡优化稳定性与计算效率。尽管分段线性函数在推理速度上表现优异,但由于在原点处不可微,它们在优化过程中面临不稳定性,而光滑函数通常因依赖超越运算而带来显著的计算开销。为了解决这些局限性,本文提出了一种基于构造逼近理论的一般光滑框架,并引入了伯恩斯坦线性单元(Bernstein Linear Unit, BerLU)。该新型激活函数利用伯恩斯坦多项式构建一个可微的二次过渡区域,能够有效消除奇点,同时保持分段线性结构。理论分析表明,该方法保证严格的连续可微性和非扩张的利普希茨常数为1,从而确保稳定的梯度传播并防止深度架构中常见的梯度爆炸问题。对典型视觉变换器(Vision Transformer)和卷积神经网络(Convolutional Neural Network)架构的综合实证评估确认,该方法在标准图像分类基准测试中始终优于最先进的基线,同时提供了卓越的计算和内存效率。
cs.AI / 106 / 2605.02592
Foundation-Model-Based Agents in Industrial Automation: Purposes, Capabilities, and Open Challenges
基于基础模型的工业自动化代理:目的、能力和开放挑战
Abstract
Foundation models, particularly large language models, are increasingly integrated into agent architectures for industrial tasks such as decision support, process monitoring, and engineering automation. Yet evidence on their purposes, capabilities, and limitations remains fragmented across domains. This work examines how mature foundation-model-based agent systems are in industrial contexts, how their functional profile differs from conventional agent systems, and which limitations persist. A systematic literature survey following the PRISMA 2020 guideline is presented, screening 2,341 publications and synthesising a corpus of 88 publications through a structured coding scheme. The results show that reported systems are predominantly at prototype and early validation stages (75.0% at TRL 4-6), with deployment-oriented evidence remaining rare (9.1%). Operational goals are most frequently positioned in user assistance, monitoring, and process optimisation, while conventional production-control purposes such as planning and scheduling are less prominent. Compared with an established baseline for industrial agent systems, the capability profile reveals substantial gains in human interaction (+37%) and dealing with uncertainty (+35%), but a pronounced deficit in negotiation (-39%). The most widely reported limitations concern lack of generalization, hallucination and output instability, data scarcity, and inference latency. A working definition of foundation-model-based industrial agents is also proposed, bridging conventional agent theory, automation-engineering standards, and the foundation-model paradigm.
Chinese Translation
基础模型,尤其是大型语言模型,正在越来越多地被整合进工业任务的代理架构中,例如决策支持、过程监控和工程自动化。然而,关于它们的目的、能力和局限性的证据在各个领域中仍显得零散。本研究考察了基于基础模型的代理系统在工业环境中的成熟度,它们的功能特征与传统代理系统的差异,以及仍然存在的限制。依据PRISMA 2020指引进行的系统文献调研筛选了2341篇出版物,并通过结构化编码方案综合了88篇文献。结果显示,被报道的系统主要处于原型和早期验证阶段(75.0%处于TRL 4-6),与部署相关的证据仍然稀缺(9.1%)。操作目标最频繁地定位于用户辅助、监控和过程优化,而传统生产控制目标如规划和调度则不够突出。与工业代理系统的既定基线相比,能力特征显示在人机交互(+37%)和处理不确定性(+35%)方面有显著提升,但在谈判能力上存在明显不足(-39%)。报道最多的限制涉及泛化能力不足、幻觉和输出不稳定、数据稀缺和推理延迟。本文还提出了一种基于基础模型的工业代理的工作定义,旨在桥接传统代理理论、自动化工程标准和基础模型范式。
cs.AI / 107 / 2605.02603
Counterfactual Reasoning in Automated Planning
自动规划中的反事实推理
Abstract
Automated planning traditionally assumes that all aspects of a planning task (initial state, goals, and available actions) are fully specified in advance, an approach well-suited to domains with fixed rules and deterministic execution. However, real-world planning often requires flexibility, allowing for deviations from the original task parameters in response to unforeseen circumstances or to improve outcomes. This paper surveys existing works on counterfactual reasoning in automated planning, categorizing them by what elements are changed, when the reasoning is triggered, and why and how these changes are made. We conclude by discussing key findings and outlining open research questions to guide future work in this area.
Chinese Translation
自动规划传统上假设规划任务的所有方面(初始状态、目标和可用动作)都在事先完全指定,这种方法非常适合规则固定且执行决定性的领域。然而,现实世界的规划往往需要灵活性,以便能够根据不可预见的情况或改善结果而偏离原始任务参数。本文对自动规划中的反事实推理现有研究进行了综述,按改变的元素、推理触发的时机以及这些变化的原因和方式进行分类。我们最后讨论了主要发现,并概述了未来研究领域待解决的开放性问题,以指导该领域的后续工作。
cs.AI / 108 / 2605.02617
SCGNN: Semantic Consistency enhanced Graph Neural Network Guided by Granular-ball Computing
SCGNN:基于粒球计算的语义一致性增强图神经网络
Abstract
Capturing semantic consistency among nodes is crucial for effective graph representation learning. Existing approaches typically rely on $k$-nearest neighbors ($k$NN) or other node-level full search algorithms (FSA) to mine semantic relationships via exhaustive pairwise similarity computation, which suffer from high computational complexity and rigid neighbor selection, limiting scalability and introducing noisy connections. In this paper, we propose the Semantic Consistency enhanced Graph Neural Network (SCGNN), a novel plug-and-play framework that leverages granular-ball computing (GBC) to efficiently capture semantic consistency in a scalable manner. Unlike node-level FSA methods, SCGNN models group-level semantic structure by adaptively partitioning nodes into granular balls, significantly reducing computational cost while improving robustness to noise. To effectively utilize the discovered group-level semantic consistency, we design a dual enhancement strategy. Specifically, (1) a structure enhancement module constructs an anchor-based graph structure, where each anchor is a virtual node representing the group-level semantic carried by a granular ball, then injecting group-level semantic information into the graph structure; and (2) a supervision enhancement module performs label consistency checking (LCC) by combining GBC predictions with model-generated pseudo-labels, thereby producing more reliable supervision signals. SCGNN is compatible with various GNN backbones. During the forward propagation of SCGNN, the vanilla graph and the augment graph are jointly encoded, and their predictions are fused; during the backpropagation, the supervision enhancement module provides enhanced supervision signals to guide parameter updates.
Chinese Translation
在有效的图表示学习中,捕捉节点之间的语义一致性至关重要。现有的方法通常依赖于 $k$-最近邻 ($k$NN) 或其他节点级全搜索算法 (FSA) 通过详尽的成对相似性计算来挖掘语义关系,这些方法受到高计算复杂性和僵化邻居选择的限制,影响了可扩展性并引入了噪声连接。本文提出了语义一致性增强图神经网络 (SCGNN),这是一种新颖的即插即用框架,利用粒球计算 (GBC) 以高效的方式捕捉可扩展的语义一致性。与节点级 FSA 方法不同,SCGNN 通过自适应地将节点划分为粒球,建模组级语义结构,显著降低计算成本,同时提高对噪声的鲁棒性。为了有效利用发现的组级语义一致性,我们设计了一种双重增强策略。具体而言,(1) 结构增强模块构建了基于锚点的图结构,其中每个锚点是一个虚拟节点,代表粒球所携带的组级语义,然后将组级语义信息注入图结构;(2) 监督增强模块通过将 GBC 预测与模型生成的伪标签结合进行标签一致性检查(LCC),从而产生更可靠的监督信号。SCGNN 兼容各种 GNN 主干。在 SCGNN 的前向传播过程中,原始图和增强图被联合编码,并融合其预测;在反向传播过程中,监督增强模块提供增强的监督信号以指导参数更新。
cs.AI / 109 / 2605.02640
Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution
可信赖的人工智能面临不变性冲突,而因果关系是解决方案
Abstract
As artificial intelligence (AI), including machine learning (ML) models and foundation models (FMs), is increasingly deployed in high-stakes domains, ensuring their trustworthiness has become a central challenge. However, the core trustworthy AI objectives, such as fairness, robustness, privacy, and explainability, are hard to achieve simultaneously, especially while preserving utility. This position paper argues that causality is necessary to understand and balance trade-offs in performance and multiple objectives of trustworthy AI. We ground our arguments in re-interpreting trustworthy AI trade-offs as incompatible invariance requirements under different changes to the data-generating process. We then illustrate that causality provides a unifying framework for understanding how trade-offs in trustworthy AI arise, and how they can be softened or resolved through selective invariance. This perspective applies to both classical ML models and large-scale FMs. Our paper discusses how causal assumptions may be applied explicitly or implicitly in modern large-scale systems. Finally, we outline open challenges and opportunities for using causality to build more trustworthy AI.
Chinese Translation
随着人工智能(AI),包括机器学习(ML)模型和基础模型(FMs),在高风险领域的日益应用,确保其可信性已成为一项核心挑战。然而,诸如公平性、鲁棒性、隐私和可解释性等核心可信赖AI目标,特别是在保持效用的同时,难以同时实现。本文立场论文认为,理解和权衡可信赖AI在绩效和多重目标之间的权衡,因果关系是不可或缺的。我们通过重新解释可信赖AI的权衡为不同数据生成过程变化下的不兼容不变性要求,来支撑我们的论点。接着,我们展示因果关系提供了一个统一框架,用于理解可信赖AI中权衡如何产生,以及如何通过选择性的不变性来缓和或解决这些权衡。这一视角适用于经典的ML模型和大规模的FMs。我们的论文讨论了如何在现代大规模系统中显性或隐性地应用因果假设。最后,我们概述了利用因果关系构建更可信赖AI的开放挑战与机遇。
cs.AI / 110 / 2605.02658
Deciphering Shortcut Learning from an Evolutionary Game Theory Perspective
从进化博弈论的视角解读快捷学习
Abstract
Shortcut learning causes deep learning models to rely on non-essential features within the data. However, its formation in deep neural network training still lacks theoretical understanding. In this paper, we provide a formal definition of core and shortcut features and employ evolutionary game theory to analyze the origins of shortcut bias by modeling data samples as players and their corresponding neural tangent features as strategies, assuming the existence of core and shortcut subnetworks. We find that gradient descent (GD) and stochastic gradient descent (SGD) lead to two distinct stochastically stable states, each corresponding to a different strategy. The former primarily optimizes the shortcut subnetwork, while the latter primarily optimizes the core subnetwork. We investigate the influence of these strategies on shortcut bias through a continuous stochastic differential equation, and reveal the impact of data noise and optimization noise on the formation of shortcut bias. In brief, our work employs evolutionary game theory to characterize the dynamics of shortcut bias formation and provides a theoretical view on its mitigation.
Chinese Translation
快捷学习使深度学习模型依赖于数据中的非核心特征。然而,深度神经网络训练中快捷学习的形成仍然缺乏理论理解。本文对核心特征和快捷特征提供了正式定义,并采用进化博弈论分析快捷偏差的来源,模型数据样本为参与者,其相应的神经切线特征为策略,假设存在核心和快捷子网络。我们发现梯度下降(GD)和随机梯度下降(SGD)导致两种不同的随机稳定状态,每种状态对应不同的策略。前者主要优化快捷子网络,而后者主要优化核心子网络。我们通过连续随机微分方程研究了这些策略对快捷偏差的影响,并揭示了数据噪声和优化噪声对快捷偏差形成的影响。简而言之,我们的工作应用进化博弈论描述了快捷偏差形成的动态,并提供了对其减轻的理论视角。
cs.AI / 111 / 2605.02661
AcademiClaw: When Students Set Challenges for AI Agents
AcademiClaw:当学生为人工智能代理设定挑战
Yu, Junjie, Lu, Pengrui, Si, Weiye, Lu, Hongliang, Wu, Jiabao, Tao, Kaiwen, Wang, Kun, Yang, Lingyu, Zhang, Qiran, Guo, Xiuting, Wang, Xuanyu, Wang, Yang, Wang, Yanjie, Yang, Yi, Hu, Zijian, Yang, Ziyi, Zhou, Zonghan, Qiang, Binghao, Zhang, Borui, Li, Chenning, Zhang, Enchang, Chen, Feifan, Jian, Feng, Sun, Fengyin, Qiu, Hao, Zheng, Hao, Zhu, Haoran, Liu, Hongyu, Deng, Jianbin, Song, Jiaxin, Chi, Jiaying, Shi, Jiayou, Fang, Jie, Zhong, Jinghui, Zhou, Jingyu, Li, Jinze, Yi, Junfeng, Yu, Junyan, Xue, Junzhi, Song, Ni, Chen, Pengyi, Chen, Qi, Li, Quansheng, Tao, Rui, Gong, Shenghai, Lu, Shenhang, Shen, Tianqi, Zhu, Tianxiang, Kang, Tiehan, Li, Tingyu, Wu, Wendi, Shen, Xiao, Zhou, Xiao, Zhang, Xiaotao, Li, Xinrong, Yang, Xuankun, Zhang, Xun, Li, Yan, Lu, Ye, Wang, Yi, Zhou, Yibo, Zhang, Yichi, Sun, Yihao, Huang, Yijun, Zhu, Yixin, Wu, Yixuan, Sun, Yuchen, Wu, Yue, Sun, Yuheng, Li, Yukun, Tu, Yutian, Qin, Yuxuan, Wu, Yuzhuo, Li, Zeyu, Lou, Zhengyu, Ran, Zhenning, He, Zizhu, Liu, Pengfei
Abstract
Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.
Chinese Translation
截至目前,OpenClaw 生态系统中的基准测试仅评估了助手级别的任务,未能对 OpenClaw 的学术级别能力进行充分检验。我们推出了 AcademiClaw,这是一个双语基准,包含来自大学生真实学术工作流程——作业、研究项目、竞赛和个人项目中,80 个复杂且长期的任务,这些任务是他们认为当前的人工智能代理无法有效解决的。通过严格的专家评审,从 230 个学生提交的候选任务中精心挑选出最终任务集,涵盖 25 种以上的专业领域,从奥林匹克级别的数学和语言学问题到 GPU 密集型的强化学习和全栈系统调试,其中 16 个任务需要 CUDA GPU 执行。每个任务在一个隔离的 Docker 沙箱中执行,并通过结合六种互补技术的多维评分标准对任务完成情况进行评分,同时独立的五类安全审核提供额外的行为分析。对六个前沿模型的实验表明,即便是表现最好的模型也仅能达到 55% 的通过率。进一步分析揭示了任务领域之间的能力边界、模型之间的行为策略差异,以及令牌消耗与输出质量之间的脱节,提供了超出汇总指标所揭示的细粒度诊断信号。我们希望 AcademiClaw 及其开源数据和代码能够为 OpenClaw 社区提供有用的资源,推动在现实学术要求的广泛领域中开发更强大且多才多艺的代理。所有数据和代码可在 https://github.com/GAIR-NLP/AcademiClaw 获取。
cs.AI / 112 / 2605.02669
An explainable hypothesis-driven approach to Drug-Induced Liver Injury with HADES
一种可解释的基于假设的方法研究药物引起的肝损伤(DILI)与HADES
Abstract
Drug-induced liver injury (DILI) remains a leading cause of late-stage clinical trial attrition. However, existing computational predictors primarily rely on binary classification, a framing that limits generalization and yields no mechanistic insight to guide translational decisions. We argue that DILI prediction is better posed as an explainable hypothesis-generation problem. To support this shift, we introduce the DILER Benchmark, a dataset that extends beyond binary labels by augmenting a curated set of molecules with mechanistic hepatotoxicity hypotheses derived from biomedical literature. We further present HADES, an agentic system designed to generate transparent and auditable reasoning traces. By combining molecular-level predictions, metabolite decomposition, structural understanding, and toxicity pathway evidence, HADES mechanistically assesses DILI risk. Evaluated on the DILER Benchmark, HADES outperforms existing models in binary classification, achieving a ROC-AUC of 0.68 on the Test Set and 0.59 on the challenging Post-2021 Set, compared with 0.63 and 0.50 for DILI-Predictor, respectively. More importantly, we establish a baseline for mechanistic hypothesis generation, where HADES achieves a Hypothesis Alignment Fuzzy Jaccard Index of 0.16. This result underscores the inherent complexity of the task while highlighting the need for advanced explainable approaches in predictive toxicology.
Chinese Translation
药物引起的肝损伤(DILI)仍然是晚期临床试验流失的主要原因。然而,现有的计算模型主要依赖于二元分类,这限制了其泛化能力,并未提供机制性见解以指导转化决策。我们认为,DILI预测更应该视为一个可解释的假设生成问题。为支持这一转变,我们引入了DILER基准数据集,该数据集不仅限于二元标签,而是通过增加一组经过筛选的分子及其来自生物医学文献的机制性肝毒性假设来扩展数据集。我们进一步提出了HADES,这是一种旨在生成透明和可审计推理轨迹的智能系统。通过结合分子层面的预测、代谢物分解、结构理解以及毒性通路证据,HADES在机制上评估DILI风险。在DILER基准测试中,HADES在二元分类上超越了现有模型,测试集的ROC-AUC达到了0.68,而在具有挑战性的2021年后数据集上达到了0.59,相比之下,DILI-Predictor分别为0.63和0.50。更重要的是,我们为机制假设生成建立了基线,HADES的假设对齐模糊Jaccard指数为0.16。该结果凸显了任务的内在复杂性,同时强调了在预测毒理学中对先进可解释方法的需求。
cs.AI / 113 / 2605.02672
The 2026 ACII Dyadic Conversations (DaiKon) Workshop & Challenge
2026年ACII二人对话(DaiKon)研讨会与挑战
Abstract
The 2026 ACII Dyadic Conversations (ACII-DaiKon) Workshop & Challenge introduces a benchmark for modeling interpersonal affect and social dynamics in dyadic conversations. Although conversational affect modeling has advanced rapidly, most benchmarks remain speaker-centric and underrepresent coupled, time-evolving processes between partners, including directional influence, conversational timing coordination, and rapport development. To address this gap, ACII-DaiKon presents three coordinated sub-challenges built on a shared dataset: (1) directional interpersonal influence prediction, (2) turn-taking prediction (next-speaker and time-to-next-speech), and (3) rapport trajectory prediction across full interactions. The challenge is built on the Hume-DaiKon dataset, comprising 945 dyadic conversations (743.4 hours of audiovisual data) collected under naturalistic conditions across five languages. The benchmark supports multimodal modeling, temporal reasoning, and cross-context generalization through fixed train/validation/test splits, standardized metrics, and released baseline systems. Evaluation uses Concordance Correlation Coefficient (CCC), Pearson correlation, Macro-F1, and Mean Absolute Error (MAE) depending on the sub-challenge. Baseline experiments establish initial reference performance, with best test results of 0.40 CCC and 0.50 Pearson for influence prediction, 0.66 Macro-F1 and 1.50~s MAE for turn-taking, and 0.68 CCC and 0.70 Pearson for rapport trajectory modeling. These results indicate that while current methods capture coarse dyadic patterns, robust modeling of directional dependence and long-horizon interpersonal dynamics remains challenging. The workshop provides a shared platform for rigorous comparison and cross-disciplinary discussion on data validity, evaluation protocols, and culturally aware modeling for dyadic interaction.
Chinese Translation
2026年ACII二人对话(ACII-DaiKon)研讨会与挑战引入了一个评估基准,用于建模二人对话中的人际情感和社会动态。尽管对话情感建模取得了快速进展,但大多数基准仍然以说话者为中心,并且低估了伴侣之间的耦合、时间演变过程,包括方向性影响、对话时机协调和关系发展。为了解决这一问题,ACII-DaiKon提出了三个基于共享数据集的协调子挑战:(1)方向性人际影响预测,(2)轮次预测(下一个说话者和下次发言的时间),以及(3)全程互动中的关系轨迹预测。该挑战基于Hume-DaiKon数据集构建,该数据集包含945个二人对话(743.4小时的视听数据),并在自然条件下收集,涉及五种语言。该基准支持多模态建模、时间推理和跨上下文泛化,通过固定的训练/验证/测试分组、标准化指标和发布的基线系统。评估方法使用一致性相关系数(CCC)、皮尔逊相关系数、宏观F1和平均绝对误差(MAE),具体取决于子挑战。基线实验建立了初步参考性能,最佳测试结果为影响预测的0.40 CCC和0.50 Pearson,轮次预测的0.66 Macro-F1和1.50秒MAE,以及关系轨迹建模的0.68 CCC和0.70 Pearson。这些结果表明,尽管当前的方法能够捕捉粗略的二人模式,但建模方向依赖和长期人际动态仍然具有挑战性。该研讨会提供了一个共享平台,用于在数据有效性、评估协议和文化敏感建模方面进行严格比较和跨学科讨论,以促进二人互动的研究。
cs.AI / 114 / 2605.02682
Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI
零信任代理人工智能中的混合检测与基于任务的访问控制
Abstract
Authorizing Large Language Model (LLM)-driven agents to dynamically invoke tools and access protected resources introduces significant security risks, and the risks grow dramatically as agents engage in multi-turn conversations and scale toward distributed collaboration. A compromised or malicious agentic application can tamper with tool calls, falsify results, or request permissions beyond the scope of the subject's intended tasks, which could go unnoticed with current delegated authorization flows given their lack of visibility into the original subject's intent. In light of this, we make the following contributions towards Continuous Agent Semantic Authorization (CASA). First, we propose a hybrid runtime enforcement model that combines deterministic and semantic controls enabled by a zero-trust interception layer. Five deterministic controls enforce structural and data-integrity guarantees over the message flow, while a semantic inspection layer evaluates whether tool call choices align with the intended tasks commissioned to the agent. Second, differently from prior Task-Based Access Control (TBAC) techniques that operate on single-turn interactions, we decompose the semantic layer into two stages: i) a task-extraction step that distills the subject's objectives from multi-turn conversations at the interception layer, and ii) a task-tool semantic matching step at the authorization server that evaluates whether the requested tools are appropriate for the extracted tasks. Third, we extend the ASTRA dataset that we introduced in a prior work, by generating novel conversation-tool datasets with multi-turn interactions containing relevant and irrelevant tool calls for a given task. Lastly, we provide the first experimental results for TBAC under multi-turn conversations.
Chinese Translation
授权大型语言模型(LLM)驱动的代理动态调用工具并访问受保护资源引入了显著的安全风险,特别是在代理参与多轮对话并向分布式协作扩展时,风险显著增加。一个被攻击或恶意的代理应用可能会篡改工具调用、伪造结果,或请求超出主题预定任务范围的权限,这在当前的委托授权流程中可能会被忽视,因为它们缺乏对原主题意图的可见性。因此,我们提出了对持续代理语义授权(CASA)的以下贡献。首先,我们提出了一种混合运行时执行模型,该模型结合了由零信任拦截层启用的确定性和语义控制。五个确定性控制措施对消息流实施结构和数据完整性保障,而语义检查层则评估工具调用选择是否与委托给代理的预定任务一致。其次,与以前只在单轮交互中操作的基于任务的访问控制(TBAC)技术不同,我们将语义层分解为两个阶段:i)在拦截层提取主题目标的任务提取步骤,ii)在授权服务器进行的任务与工具的语义匹配步骤,评估请求的工具是否适合提取出的任务。第三,我们扩展了在之前的工作中介绍的ASTRA数据集,通过生成包含相关和不相关工具调用的多轮交互的新对话-工具数据集,针对特定任务进行生成。最后,我们提供了在多轮对话下的TBAC的首次实验结果。
cs.AI / 115 / 2605.02709
An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance
医疗保健代理技能的实证研究:实践、差距与治理
Abstract
Healthcare automation is shaped by local procedures and organizational constraints, so agent capabilities rarely transfer unchanged across settings. Agent skills, self-contained directories that package reusable procedures for AI agents, are emerging as a procedural layer for adapting healthcare agents across diverse healthcare settings. We present the first empirical analysis of healthcare agent skills, drawing on 557 healthcare-related skills filtered from 58,159 public skills on ClawHub and annotated along ten dimensions covering function, deployment context, autonomy, and safety. We find that public healthcare skills emphasize patient-facing workflow automation and monitoring rather than the diagnostic and treatment-oriented tasks foregrounded in healthcare-agent research; coverage of the healthcare lifecycle and specialized clinical inputs remains uneven; and general technical risk does not reliably capture clinical risk. These findings position healthcare skills as a procedural layer not yet addressed by current benchmarks and risk frameworks.
Chinese Translation
医疗保健自动化受当地程序和组织约束的影响,因此代理能力在不同环境中的转移往往难以保持不变。代理技能是一种自包含的目录,打包了可重复使用的程序,正在成为适应多样化医疗保健环境的医疗保健代理的程序层。我们首次对医疗保健代理技能进行了实证分析,分析基于从 ClawHub 中筛选出的 557 个与医疗保健相关的技能,这些技能来自 58,159 项公共技能,并在功能、部署背景、自主性和安全性等十个维度进行注释。我们发现,公共医疗保健技能强调患者面向的工作流程自动化和监控,而非医疗保健代理研究中突出显示的诊断和治疗任务;医疗保健生命周期及专业临床输入的覆盖仍然不均衡;而一般技术风险并不能可靠地捕捉临床风险。这些发现将医疗保健技能定性为一个尚未被当前基准和风险框架所解决的程序层。
cs.AI / 116 / 2605.02728
ORPilot: A Production-Oriented Agentic LLM-for-OR Tool for Optimization Modeling
ORPilot:面向生产的优化建模代理式大语言模型工具
Abstract
This paper presents ORPilot, an open-source agentic AI system that translates real-world business problems into solver-ready optimization models. Unlike academic LLM-for-OR tools that assume clean problem specifications with preformatted inline data, ORPilot is designed for production conditions: ambiguous descriptions, large-scale raw operational data, and the need for portability across solver backends. The system introduces four novel components: (1) a conversational interview agent to elicit complete problem specifications, (2) a data collection agent that retrieves data independently of prompts, (3) a parameter computation agent to bridge raw tabular data and model-ready parameters, and (4) a solver-agnostic Intermediate Representation (IR) for deterministic, zero-LLM-call recompilation to Gurobi, CPLEX, PuLP, Pyomo, or OR-Tools solvers. Additionally, self-correcting retry loops utilize solver tracebacks for targeted repairs. ORPilot represents the first attempt to target production-level business problems rather than textbook operations research (OR) cases. Evaluation on real-world problems demonstrates promising results. When tested against traditional academic benchmarks: IndustryOR, NL4OPT and NLP4LP, ORPilot outperformed state-of-the-art tools in accuracy on the IndustryOR benchmark and delivered comparable performance on NL4OPT and NLP4LP.
Chinese Translation
本文介绍了ORPilot,这是一个开源的代理式人工智能系统,能够将现实世界的商业问题转化为可供求解器使用的优化模型。与假设问题规范清晰且包含预格式化内联数据的学术大语言模型优化工具不同,ORPilot旨在应对生产环境条件:模糊描述、大规模原始操作数据以及跨求解器后端的可移植性需求。该系统引入了四个新颖的组件:(1) 一个对话式访谈代理,用于引导完整的问题规范;(2) 一个独立于提示的数据收集代理;(3) 一个参数计算代理,用于连接原始表格数据与模型准备参数;以及(4) 一个与求解器无关的中间表示(Intermediate Representation, IR),用于确定性、零LLM调用的重新编译,以适配Gurobi、CPLEX、PuLP、Pyomo或OR-Tools求解器。此外,自我修正的重试循环利用求解器回溯进行针对性修复。ORPilot代表了首次针对生产级商业问题的尝试,而非教科书式的运筹学案例。在真实问题上的评估展示了可喜的结果。当与传统学术基准进行比较时:IndustryOR、NL4OPT和NLP4LP,ORPilot在IndustryOR基准的准确性上超越了最先进的工具,并在NL4OPT和NLP4LP上表现出可比的性能。
cs.AI / 117 / 2605.02734
Coherent Hierarchical Multi-Label Learning to Defer for Medical Imaging
一致性层次多标签学习以延迟医疗影像决策
Abstract
Learning to Defer (L2D) enables a model to predict autonomously or defer to an expert, but prior work largely assumes flat label spaces. We study the first L2D setting with hierarchical multi-label decisions, motivated by medical-imaging workflows in which findings are organised by clinical taxonomies. In this setting, deferral is a delegation action rather than a label assignment, so treating it as an independent per-label decision can produce deferral incoherence, including taxonomic contradictions, delegation violations, and deferrals of labels already implied by the model's own assertions. We formalise coherent hierarchical deferral under a Selective-Exclusion handoff contract, characterise the Bayes-optimal coherent deferral rule, and show that even nodewise Bayes L2D can be action-incoherent. We then propose two remedies: exact coherent projection, a dynamic-programming decoder over the coherent action set, and Taxonomic Belief Propagation (TBP) with Recursive Policy Optimisation (RPO), a contract-aware joint action model trained through the same recursion used at inference. Across real-reader and controlled-expert medical-imaging benchmarks, naive binary-relevance L2D exhibits non-trivial incoherence. Projection removes it exactly, and fast TBP+RPO drives incoherence near zero while retaining strong utility.
Chinese Translation
学习延迟(L2D)使模型能够自主预测或向专家延迟,但先前的研究主要假设平坦标签空间。我们研究了第一个具有层次多标签决策的L2D设置,其动机源自医疗影像工作流程,在这些流程中,发现结果按临床分类法进行组织。在这一设置中,延迟是一种委托行为,而不是标签分配,因此将其视为独立的逐标签决策可能会导致延迟不一致,包括分类矛盾、委托违规和延迟已经由模型自身断言暗示的标签。我们在选择性排除的交接合同下形式化了一致性层次延迟,描述了贝叶斯最优一致性延迟规则,并展示即使是逐节点贝叶斯L2D也可能出现行动不一致。随后,我们提出了两个补救措施:精确一致性投影,即在一致性行动集上的动态规划解码器,以及以递归政策优化(RPO)进行训练的分类信念传播(TBP),这是一种考虑合同的联合行动模型,使用在推理时相同的递归方法。在实际读者和控制专家的医疗影像基准测试中,简单的二元相关L2D表现出非平凡的不一致性。投影则能完全消除这种不一致,而快速的TBP+RPO则使不一致性接近于零,同时保留了强大的实用性。
cs.AI / 118 / 2605.02738
AI and Open-data Driven Scalable Solar Power Profiling
基于人工智能和开放数据的可扩展太阳能光伏特征分析
Abstract
Solar photovoltaic (PV) deployment is expanding rapidly, yet detailed, up-to-date information on the spatial distribution and capacity of rooftop PV remains limited. This paper presents an open, scalable framework for detecting solar panels from open data and generating city-level solar power profiles. We leverage foundation vision AI models to detect solar panel geometries from open-source satellite imagery. This avoids manual data labeling and case-specific model training while maintaining robustness across heterogeneous imagery. Detected solar panels are converted into georeferenced polygons, yielding spatially explicit and incrementally extensible inventories. By integrating open weather data, we translate panel footprints into regional solar power profiles. The framework reduces dependency on proprietary imagery, manual labeling, and closed-source models, and offers a transparent and scalable approach for solar planning and analysis. We released the data and an API resulted from this work. For any user-specified building location, our API retrieves aerial imagery, detects rooftop solar panels, and returns georeferenced polygons. This empowers researchers and developers to scan user-defined areas to build solar panel maps and associated solar production profiles, thus facilitating advanced analysis like distributed solar production integration, local power flow optimization, energy tariff design, and infrastructure planning.
Chinese Translation
太阳能光伏(PV)部署正在迅速扩展,但关于屋顶光伏的空间分布和容量的详细、最新信息仍然有限。本文提出了一个开放的、可扩展的框架,用于从开放数据中检测太阳能电池板并生成城市级太阳能功率特征。我们利用基础视觉人工智能模型从开源卫星图像中检测太阳能电池板的几何形状。这避免了手动数据标注和特定案例模型训练,同时在异构图像中保持了鲁棒性。检测到的太阳能电池板被转换为地理参考多边形,形成空间明确且可逐步扩展的目录。通过整合开放天气数据,我们将电池板的足迹转换为区域太阳能功率特征。该框架减少了对专有图像、手动标注和闭源模型的依赖,并为太阳能规划和分析提供了一种透明且可扩展的方法。我们发布了这项工作的数据和API。对于任何用户指定的建筑位置,我们的API检索航空图像,检测屋顶太阳能电池板,并返回地理参考多边形。这使得研究人员和开发者可以扫描用户定义的区域以构建太阳能电池板地图和相关的太阳能生产特征,从而促进分布式太阳能生产集成、本地电力流优化、能源电价设计和基础设施规划等高级分析。
cs.AI / 119 / 2605.02740
Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims
基础模型以解锁来自全国医疗保险索赔的真实世界证据
Ma, Fan, Liu, Yuntian, Lan, Xiang, Zhou, Weipeng, Ni, Jun, Giuffrè, Mauro, Qian, Lingfei, Peng, Xueqing, Zhou, Yujia, Weng, Ruey-Ling, He, Huan, Li, Lu, Chen, Qingyu, Loza, Andrew, Rasmy, Laila, Zhi, Degui, Lu, Yuan, Zeng, Chenjie, Denny, Joshua C, Schwamm, Lee, Meeker, Daniella, Ohno-Machado, Lucila, Chen, Yong, Xu, Hua
Abstract
Evidence derived from large-scale real-world data (RWD) is increasingly informing regulatory evaluation and healthcare decision-making. Administrative claims provide population-scale, longitudinal records of healthcare utilization, expenditure, and detailed coding of diagnoses, procedures, and medications, yet their potential as a substrate for healthcare foundation models remains largely unexplored. Here we present ReClaim, a generative transformer trained from scratch on 43.8 billion medical events from more than 200 million enrollees in the MarketScan claims data spanning 2008-2022. ReClaim models longitudinal trajectories across diagnoses, procedures, medications, and expenditure, and was scaled to 140 million, 700 million, and 1.7 billion parameters. Across over 1,000 disease-onset prediction tasks, ReClaim achieved a mean AUC of 75.6%, substantially outperforming disease-specific LightGBM (66.3%) and the transformer-based Delphi model (69.4%), with the largest gains for rare diseases. These advantages held across retrospective and prospective evaluations and in external validation on two independent datasets. Performance improved monotonically with scale, and post-training added 13.8 percentage points over pre-training alone. Beyond disease prediction, ReClaim captured financial outcomes and improved real-world evidence (RWE) analyses: for healthcare expenditure forecasting it increased explained variance from 0.28 to 0.37 relative to LightGBM, and in a target trial emulation it reduced systematic bias by 72% on average relative to Delphi. Together, these results establish administrative claims as a scalable substrate for healthcare foundation models and show that learned representations generalize across time periods and data sources, supporting disease surveillance, expenditure forecasting, and RWE generation.
Chinese Translation
基于大规模真实世界数据(RWD)得出的证据日益成为监管评估和医疗决策的重要依据。管理索赔提供了关于医疗利用、支出以及诊断、程序和药物的详细编码的群体规模、纵向记录,但其作为医疗基础模型的潜在价值尚未被充分探索。在此,我们提出了ReClaim,这是一种从头开始训练的生成式变换器,基于从2008年到2022年覆盖超过2亿名参保者的MarketScan索赔数据中的438亿医疗事件进行训练。ReClaim建模了诊断、程序、药物和支出的纵向轨迹,规模达到1.4亿、7亿和17亿参数。在超过1000个疾病发作预测任务中,ReClaim的平均AUC达到75.6%,显著优于特定疾病的LightGBM(66.3%)和基于变换器的Delphi模型(69.4%),尤其是在罕见疾病方面取得了最大的提升。这些优势在回顾性和前瞻性的评估中以及在两个独立数据集的外部验证中得以保持。随着规模的扩大,性能单调改善,后训练比仅用前训练增加了13.8个百分点。除了疾病预测,ReClaim还捕捉了财务结果,改善了真实世界证据(RWE)的分析:在医疗支出预测中,相较于LightGBM,它将解释方差从0.28提高至0.37,在目标试验模拟中,相较于Delphi,系统偏差平均减少了72%。这些结果共同证明了行政索赔作为医疗基础模型可扩展的基础,并表明学习到的表示在时间段和数据源之间具有通用性,支持疾病监测、支出预测和真实世界证据的生成。
cs.AI / 120 / 2605.02743
Triple Spectral Fusion for Sensor-based Human Activity Recognition
基于传感器的人体活动识别的三重光谱融合
Abstract
The field of sensor-based human activity recognition (HAR) mainly uses posture, motion and context data of Inertial Measurement Units (IMUs) to identify daily activities. Despite the advancements in learning-based methods, it is challenging to perform information fusion from the temporal perspective due to the complexities in fusing heterogeneous sensor data and establishing long-term context correlations. This paper proposes a novel triple spectral fusion framework tailored for HAR. First, we develop an adaptive complementary filtering technique for noise suppression and organize each IMU's sensors into posture and motion modality nodes. Given that IMU nodes form a dynamic heterogeneous graph, we then apply adaptive filtering within the graph Fourier domain to merge both homogeneous and heterogeneous node information. Furthermore, an adaptive wavelet frequency selection approach is implemented to suppress context redundancy and shorten the length of features. This approach enhances both timestamp-based graph aggregation and the correlation of long-term contexts. Our framework uses adaptive filtering in the Fourier, graph Fourier, and wavelet domains, enabling effective multi-sensor fusion and context correlation. Extensive experiments on ten benchmark datasets demonstrate the superior performance of our framework. Project page: https://github.com/crocodilegogogo/TSF-TPAMI2026.
Chinese Translation
基于传感器的人体活动识别(HAR)领域主要利用惯性测量单元(IMUs)的姿态、运动和上下文数据来识别日常活动。尽管基于学习的方法取得了进展,但由于融合异构传感器数据和建立长期上下文关联的复杂性,基于时间的信息融合仍然具有挑战性。本文提出了一种新颖的三重光谱融合框架,专门针对HAR进行优化。首先,我们开发了一种自适应互补滤波技术用于噪声抑制,并将每个IMU的传感器组织为姿态和运动模态节点。鉴于IMU节点形成了一个动态的异构图,我们接着在图傅里叶域内应用自适应滤波,以合并同质和异质节点信息。此外,实施了一种自适应小波频率选择方法,以抑制上下文冗余并缩短特征长度。这种方法增强了基于时间戳的图聚合和长期上下文的相关性。我们的框架在傅里叶、图傅里叶和小波领域使用自适应滤波,能够有效实现多传感器融合和上下文关联。在十个基准数据集上的广泛实验表明了我们框架的优越性能。项目页面:https://github.com/crocodilegogogo/TSF-TPAMI2026。
cs.AI / 121 / 2605.02751
Mitigating Misalignment Contagion by Steering with Implicit Traits
通过隐性特征引导来减缓错位传播
Abstract
Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dilemma games. Specifically, we find that LMs become more anti-social after gameplay and that this effect is intensified when other players are steered to act maliciously. We explore different steering techniques to mitigate such misalignment contagion and find that reinforcing an LM's system prompt is insufficient and often harmful. Instead, we propose steering with implicit traits: a technique that intermittently injects system prompts with statements that reinforce an LMs initial traits and is more effective than system prompt repetition at keeping models in line with their initial pro-social behaviors. Importantly, this method does not require access to model parameters or internal model states, making it suitable for increasingly common use cases where complex multi-agent workflows are being designed with black box models.
Chinese Translation
语言模型(LMs)在高风险的多智能体环境中越来越被广泛使用,在这些环境中,遵循指令和保持价值对齐至关重要。大多数对齐研究集中在单个语言模型与单个用户之间的互动上,未能解决在多轮交互中多个语言模型之间错位行为传播的风险。我们发现了这种现象的证据,称之为错位传播,表现为多个语言模型在进行多轮对话社会困境游戏时的行为。具体来说,我们发现语言模型在游戏后变得更加反社会,当其他玩家被引导采取恶意行为时,这种影响会加剧。我们探讨了不同的引导技术以减缓这种错位传播,并发现增强语言模型的系统提示不足并且往往是有害的。相反,我们提出了一种使用隐性特征进行引导的方法:该技术会间歇性地注入系统提示,以增强语言模型的初始特征,这比单纯重复系统提示更有效地保持模型与其初始亲社会行为的一致性。重要的是,这种方法不需要访问模型参数或内部模型状态,因此适用于越来越多的复杂多智能体工作流程设计场景中,这些场景涉及黑箱模型的使用。
cs.AI / 122 / 2605.02765
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
U-Define:为基于大语言模型(LLM)规划设计用户工作流程以应对硬约束和软约束
Abstract
LLMs are increasingly used for end-user task planning, yet their black-box nature limits users' ability to ensure reliability and control. While recent systems incorporate verification techniques, it remains unclear how users can effectively apply such rigid constraints to represent intent or adapt to real-world variability. For example, prior work finds that hard-only constraints are too rigid, and numeric flexibility weights confuse users. We investigate how interaction workflows can better support users in applying constraints to guide LLM-generated plans, examining whether abstracting strictness into high-level types (i.e., hard and soft) paired with distinct verification mechanisms helps users more reliably express and align intent. We present U-Define, a system that lets users define constraints in natural language and categorize them as either hard rules that must not be violated or soft preferences that allow flexibility. U-Define verifies these types through complementary methods: formal model checking for hard constraints and LLM-as-judge evaluation for soft ones. Through a technical evaluation and user studies with general and expert participants, we find that user-defined constraint types improve perceived usefulness, performance, and satisfaction while maintaining usability. These findings provide insights for designing flexible yet reliable constraint-based workflows.
Chinese Translation
如今,LLM(大语言模型)在最终用户任务规划中的应用越来越普遍,但其黑箱特性限制了用户确保可靠性和控制的能力。尽管近期的系统已经结合了验证技术,但用户如何有效地应用这种僵化的约束来表达意图或适应现实世界的变动仍不明朗。例如,先前的研究发现仅使用硬约束的方式过于僵化,而数值灵活性权重使用户感到困惑。我们探讨了如何通过交互工作流程更好地支持用户应用约束来指导LLM生成的计划,研究将严格性抽象为高层次类型(即硬约束和软约束)并配以不同验证机制,是否能帮助用户更可靠地表达和对齐其意图。我们介绍了U-Define,一个允许用户用自然语言定义约束并将其分类为不可违反的硬规则或允许灵活性的软偏好的系统。U-Define通过互补的方法验证这些类型:使用形式化模型检查来验证硬约束,以及使用LLM作为评判进行软约束的评估。通过对一般用户和专家参与者的技术评估和用户研究,我们发现用户自定义的约束类型在提高感知有用性、绩效和满意度的同时,保持了可用性。这些发现为设计灵活且可靠的基于约束的工作流程提供了宝贵的见解。
cs.AI / 123 / 2605.02780
Fine-Grained Graph Generation through Latent Mixture Scheduling
通过潜在混合调度实现细粒度图生成
Abstract
Structure aware graph generation aims to generate graphs that satisfy given topological properties. It has applications in domains such as drug discovery, social network modeling, and knowledge graph construction. Unlike existing methods that only provide coarse control over graph properties, we introduce a novel conditional variational autoencoder for fine-grained structural control in graph generation. The approach refines the decoder's latent space by dynamically aligning graph- and property-driven representations to improve both graph fidelity and control satisfaction. Specifically, the approach implements a mixture scheduler that progressively integrates graph and control priors. Experiments on five real-world datasets show the efficacy of the proposed model compared to recent baselines, achieving high generation quality while maintaining high controllability.
Chinese Translation
面向结构的图生成旨在生成满足给定拓扑性质的图。这在药物发现、社交网络建模和知识图谱构建等领域具有应用。与现有方法仅提供对图属性的粗略控制不同,我们提出了一种新颖的条件变分自编码器,以实现图生成中的细粒度结构控制。该方法通过动态对齐图和属性驱动的表示来优化解码器的潜在空间,从而提高图的保真度和控制满意度。具体而言,该方法实现了一种混合调度器,逐步集成图和控制先验。在五个真实世界数据集上的实验表明,与近期基线模型相比,所提模型的有效性显著,既实现了高质量的生成,又保持了较高的可控性。
cs.AI / 124 / 2605.02782
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
当音频语言模型未能利用多模态上下文进行肌肉发音障碍言语识别时
Abstract
Automatic speech recognition (ASR) systems remain brittle on dysarthric and other atypical speech. Recent audio-language models raise the possibility of improving performance by conditioning on additional clinical context at inference time, but it is unclear whether these models can make use of such information. We introduce a benchmark built on the Speech Accessibility Project (SAP) dataset that tests whether diagnosis labels, clinician-derived speech ratings, and progressively richer clinical descriptions improve transcription accuracy for dysarthric speech. Across matched comparisons on nine models, we find that current models do not meaningfully use this context: diagnosis-informed and clinically detailed prompts yield negligible improvements and often degrade word error rate. We complement the prompting analysis with context-dependent fine-tuning, showing that LoRA adaptation with a mixture of clinical prompt formats achieves a WER of 0.066, a 52% relative reduction over the frozen baseline, while preserving performance when context is unavailable. Subgroup analyses reveal significant gains for Down syndrome and mild-severity speakers. These results clarify where current models fall short and provide a testbed for measuring progress toward more inclusive ASR.
Chinese Translation
自动语音识别(ASR)系统在肌肉发音障碍和其他非典型言语上仍显得脆弱。近期的音频语言模型提出了通过在推断时考虑额外临床上下文来提高性能的可能性,但尚不清楚这些模型是否能够利用此类信息。我们引入了基于言语可及性项目(Speech Accessibility Project, SAP)数据集的基准,以测试诊断标签、临床医生评分的言语评价以及逐渐丰富的临床描述是否能提高肌肉发音障碍言语的转录准确性。在对九个模型的匹配比较中,我们发现当前模型并未有意义地利用这些上下文:基于诊断信息和临床详细的提示所带来的改善微不足道,且往往会导致词错率(Word Error Rate, WER)下降。我们将提示分析与基于上下文的微调相结合,显示使用不同临床提示格式的LoRA适应在上下文不可用时仍能保持性能,同时实现了0.066的WER,相较于冻结基线有52%的相对降低。亚组分析揭示出对于唐氏综合症和轻度发音障碍的说话者存在显著提升。这些结果澄清了当前模型的不足之处,并提供了一个测量朝向更具包容性的ASR系统进展的试验平台。
cs.AI / 125 / 2605.02810
AIs and Humans with Agency
具有能动性的人工智能与人类
Abstract
This paper compares agency in humans with potential agency in AI programs. Human agency takes many years to develop, as the frontal lobe is activated. Early attempts to endow LLMs agency have met serious obstacles. Progress requires a new architecture where actions and plans are formulated jointly with the human actors in each real world setting.
Chinese Translation
本文比较了人类的能动性与人工智能程序中潜在的能动性。人类的能动性需要多年的发展,因为前额叶的激活。早期赋予大规模语言模型(LLMs)能动性的尝试遭遇了严重障碍。要取得进展,需要一种新的架构,在每个现实世界环境中与人类参与者共同制定行动和计划。
cs.AI / 126 / 2605.02819
SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering
SCPRM:一种面向模式的累积过程奖励模型用于知识图谱问答
Abstract
Large language models excel at complex reasoning, yet evaluating their intermediate steps remains challenging. Although process reward models provide step-wise supervision, they often suffer from a risk compensation effect, where incorrect steps are offset by later correct ones, assigning high rewards to flawed reasoning paths. This issue is further exacerbated in knowledge graph (KG) reasoning, as there may exist multiple paths between the start and end entities in the KGs, and a risky step can make the reasoning path flawed. Those limitations are problematic in risk-sensitive tasks such as medical and legal KG reasoning. To address the issues, we propose a Schema-aware Cumulative Process Reward Model (SCPRM) that evaluates reasoning paths by conditioning on the reasoning prefix , and incorporating schema distance between current reasoning step and the implicit target parsed from the query, which provides cumulative and future rewards to guide the path explorations. We further integrate SCPRM into Monte Carlo Tree Search (MCTS) as SCPRM-MCTS to conduct multi-hop reasoning on KGs for question answering (QA) tasks. Across medical and legal KGQA and CWQ, SCPRM-MCTS improves the performance of Hits@k by an average of 1.18% over strong baselines, demonstrating more accurate and risk-sensitive reasoning evaluation.
Chinese Translation
大型语言模型在复杂推理方面表现出色,但评估其中间步骤仍然具有挑战性。尽管过程奖励模型提供逐步监督,但它们常常受到风险补偿效应的影响,即错误步骤被后续正确步骤抵消,从而对错误推理路径赋予高奖励。这一问题在知识图谱(KG)推理中尤为严重,因为在KG中起始实体和终止实体之间可能存在多条路径,而风险步骤可能使得推理路径出现缺陷。这些限制在医疗和法律 KG推理等风险敏感任务中尤为棘手。为了解决这些问题,我们提出了一种面向模式的累积过程奖励模型(SCPRM),该模型通过条件化推理前缀并结合当前推理步骤与从查询中解析出的隐性目标之间的模式距离来评估推理路径,从而提供累积和未来奖励以引导路径探索。我们进一步将 SCPRM 集成到蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)中,形成 SCPRM-MCTS,以执行知识图谱上的多跳推理以进行问答任务。在医疗和法律 KGQA 及 CWQ 中,SCPRM-MCTS 在强基准测试上平均提高了 1.18% 的 Hits@k 表现,显示出更准确和风险敏感的推理评估。
cs.AI / 127 / 2605.02827
First-Order Efficiency for Probabilistic Value Estimation via A Statistical Viewpoint
基于统计视角的概率值估计的一阶效率
Abstract
Probabilistic values, including Shapley values and semivalues, provide a model-agnostic framework to attribute the behavior of a black-box model to data points or features, with a wide range of applications including explainable artificial intelligence and data valuation. However, their exact computation requires utility evaluations over exponentially many coalitions, making Monte Carlo approximation essential in modern machine learning applications. Existing estimators are often developed through different identification strategies, including weighted averages, self-normalized weighting, regression adjustment, and weighted least squares. Our key observation is that these seemingly distinct constructions share a common first-order error structure, in which the leading term is an augmented inverse-probability weighted influence term determined by the sampling law and a working surrogate function. This first-order representation yields an explicit expression for the leading mean squared error (MSE), which characterizes how the sampling law and the surrogate jointly determine statistical efficiency. Guided by this criterion, we propose an Efficiency-Aware Surrogate-adjusted Estimator (EASE) that directly chooses the sampling law and surrogate to minimize the first-order MSE. We demonstrate that EASE consistently outperforms state-of-the-art estimators for various probabilistic values.
Chinese Translation
概率值,包括夏普利值(Shapley values)和半值(semivalues),提供了一种与模型无关的框架,以将黑箱模型的行为归因于数据点或特征,这在可解释人工智能和数据评价等广泛应用中具有重要意义。然而,它们的精确计算需要对指数众多的联盟进行效用评估,这使得蒙特卡洛近似在现代机器学习应用中变得至关重要。现有估计量通常是通过不同的识别策略开发的,包括加权平均、自标准化加权、回归调整和加权最小二乘法。我们的关键观察是,这些看似独立的构造共享一个共同的一阶误差结构,其中主导项是一个增强的反概率加权影响项,由抽样法和一个工作代理函数共同决定。这种一阶表示产生了主导均方误差(MSE)的显式表达式,字符描述了抽样法和代理如何共同决定统计效率。在此标准的指导下,我们提出了一种效率感知的代理调整估计量(Efficiency-Aware Surrogate-adjusted Estimator, EASE),直接选择抽样法和代理以最小化一阶均方误差。我们证明EASE在各种概率值的估计中始终优于最先进的估计量。
cs.AI / 128 / 2605.02829
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
是压缩再适应吗?不,与任务感知的子空间联合一起进行
Abstract
Adapting large pretrained models to diverse tasks is now routine, yet the two dominant strategies of parameter-efficient fine-tuning (PEFT) and low-rank compression are typically composed in sequence. This decoupled practice first compresses and then fine-tunes adapters, potentially misaligning the compressed subspace with downstream objectives and squandering a global parameter budget. To overcome this limitation, we introduce JACTUS (Joint Adaptation and Compression with a Task-aware Union of Subspaces), a single framework that unifies compression and adaptation. From a small calibration set, JACTUS estimates input and pre-activation gradient covariances, forms their orthogonal union with the pretrained weight subspace, performs a projected low-rank approximation inside this union, allocates rank globally by marginal gain per parameter, and trains only a compact core matrix. This explicitly mitigates the potential misalignment between the compressed subspace and downstream objectives by coupling the directions preserved for compression with those required for adaptation, yielding a deployable low-rank model that avoids retaining full frozen weights while enabling fast and robust tuning. On vision, JACTUS attains an average 89.2% accuracy on ViT-Base across eight datasets at 80% retained parameters, surpassing strong 100% PEFT baselines (e.g., DoRA 87.9%). On language, JACTUS achieves an 80.9% average on Llama2-7B commonsense QA at the same 80% retained-parameter budget, outperforming 100% PEFT (e.g., DoRA 79.7%) and exceeding prior compress-then-finetune pipelines under the same ratained-parameter budget. We will release code.
Chinese Translation
将大型预训练模型适应于多样化任务已成为常规,但典型的参数高效微调(PEFT)和低秩压缩的两种主导策略通常是顺序组合的。这种解耦的做法先进行压缩,然后微调适配器,可能会导致压缩子空间与下游目标不一致,从而浪费全球参数预算。为克服这一局限性,我们提出了 JACTUS(带有任务感知的子空间联合的联合适应和压缩),这是一个统一压缩和适应的单一框架。JACTUS 从一个小的校准集出发,估计输入和激活前梯度的协方差,形成它们与预训练权重子空间的正交联合,在此联合内执行投影低秩近似,通过每个参数的边际收益全球分配秩,并仅训练一个紧凑的核心矩阵。这通过将用于压缩的保留方向与适应所需的方向相结合,明确缓解了压缩子空间与下游目标之间潜在的不一致,从而产生一个可部署的低秩模型,避免保留全部冻结权重,同时实现快速和稳健的调优。在视觉任务中,JACTUS 在八个数据集上使用 ViT-Base 获得了平均 89.2% 的准确率,保留参数为 80%,超越了强大的 100% PEFT 基线(例如,DoRA 87.9%)。在语言任务中,JACTUS 在 Llama2-7B 常识问答上以同样的 80% 保留参数预算达到了 80.9% 的平均水平,超越了 100% PEFT(例如,DoRA 79.7%),并在相同的保留参数预算下超越了之前的压缩-再微调管道。我们将发布代码。
cs.AI / 129 / 2605.02832
HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems
HAAS:一种具备政策意识的框架用于人类与人工智能系统之间的自适应任务分配
Abstract
Deciding how to distribute work between humans and AI systems is a central challenge in organisational design. Most approaches treat this as a binary choice, yet the operational reality is richer: humans and AI routinely share tasks or take complementary roles depending on context, fatigue, and the stakes involved. Governing that distribution -- balancing efficiency, oversight, and human capability -- remains an open problem. This paper presents Human-AI Adaptive Symbiosis (HAAS), an implemented framework for adaptive task allocation in software engineering and manufacturing. HAAS combines two coupled components: a rule-based expert system that enforces governance constraints before any learning occurs, and a contextual-bandit learner that selects among feasible collaboration modes from outcome feedback. Task-agent fit is represented through five auditable cognitive dimensions and a five-mode autonomy spectrum -- from human-only to fully autonomous -- embedded in a reproducible benchmark spanning both domains. Three empirical findings emerge. First, governance is not a binary switch but a tunable design variable: tighter constraints predictably convert autonomous AI assignments into supervised collaborations, with domain-specific costs and benefits. Second, in manufacturing, stronger governance can improve operational performance and reduce fatigue simultaneously -- a workload-buffering effect that contradicts the usual framing of governance as pure overhead. Third, no single governance setting dominates across all contexts; moderate governance becomes increasingly competitive as the learner accumulates experience within the governed action space. Together, these findings position HAAS as a pre-deployment workbench for comparing and inspecting human--AI allocation policies before organisational commitment.
Chinese Translation
如何在组织设计中分配人类与人工智能系统之间的工作是一个核心挑战。大多数方法将其视为二元选择,但实际操作情境更加复杂:人类与人工智能根据上下文、疲劳程度和相关风险,通常会共享任务或承担互补角色。如何管理这一分配——在效率、监督和人类能力之间达到平衡——仍然是一个未解决的问题。本文介绍了人类-人工智能自适应共生(Human-AI Adaptive Symbiosis,HAAS),这是一个在软件工程和制造领域中实施的自适应任务分配框架。HAAS结合了两个耦合组件:一个基于规则的专家系统在任何学习发生之前施加治理约束,以及一个基于上下文的多臂老虎机学习者,从结果反馈中选择可行的协作模式。任务-代理的契合通过五个可审计的认知维度和一个涵盖人类独占到完全自主的五级自主性谱系来表示,并嵌入一个可重复的基准,涵盖这两个领域。研究得出三项实证发现。首先,治理不是一个二元开关,而是一个可调的设计变量:更紧的约束可以可预测地将自主人工智能分配转变为监督协作,同时伴随特定领域的成本和收益。其次,在制造领域,强有力的治理可以同时提高运营绩效并减少疲劳——一种负荷缓冲效应,反驳了将治理视为纯粹的开销的常规观点。第三,没有单一的治理设置在所有情境中占据主导地位;随着学习者在受治理的行动空间内积累经验,中等强度的治理变得越来越具竞争力。综合这些发现,HAAS为在组织承诺之前比较和检视人类-人工智能分配政策提供了一个预部署的工作台。
cs.AI / 130 / 2605.02860
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
站在巨人的肩膀上:跨语言代码克隆检测的稳定化知识蒸馏
Abstract
Cross-language code clone detection (X-CCD) is challenging because semantically equivalent programs written in different languages often share little surface similarity. Although large language models (LLMs) have shown promise for semantic clone detection, their use as black-box systems raises concerns about cost, reproducibility, privacy, and unreliable output formatting. In particular, compact open-source models often struggle to follow reasoning-oriented prompts and to produce outputs that can be consistently mapped to binary clone labels. To address these limitations, we propose a knowledge distillation framework that transfers reasoning capabilities from DeepSeek-R1 into compact open-source student models for X-CCD. Using cross-language code pairs derived from Project CodeNet, we construct reasoning-oriented synthetic training data and fine-tune Phi3 and Qwen-Coder with LoRA adapters. We further introduce response stabilization methods, including forced conclusion prompting, a binary classification head, and a contrastive classification head, and evaluate model behavior using both predictive metrics and response rate. Experiments on Python--Java, Rust--Java, Rust--Python, and Rust--Ruby show that knowledge distillation consistently improves the reliability of compact models and often improves predictive performance, especially under distribution shift. In addition, classification-head variants substantially reduce inference time compared to generation-based inference. Overall, our results show that reasoning-oriented distillation combined with response stabilization makes compact open-source models more practical and reliable for X-CCD detection.
Chinese Translation
跨语言代码克隆检测(X-CCD)具有挑战性,因为用不同语言编写的语义等价程序往往几乎没有表面相似性。尽管大型语言模型(LLMs)在语义克隆检测方面显示出前景,但作为黑箱系统的使用引发了关于成本、可重复性、隐私和输出格式不可靠性的担忧。尤其是,紧凑型开源模型往往在遵循推理导向提示和生成能够稳定映射到二进制克隆标签的输出方面表现不佳。为了解决这些限制,我们提出了一种知识蒸馏框架,将推理能力从DeepSeek-R1转移到用于X-CCD的紧凑型开源学生模型中。我们使用来自Project CodeNet的跨语言代码对构建推理导向的合成训练数据,并使用LoRA适配器对Phi3和Qwen-Coder进行微调。我们进一步引入响应稳定化方法,包括强制结论提示、二分类头和对比分类头,并使用预测指标和响应率评估模型行为。在Python--Java、Rust--Java、Rust--Python和Rust--Ruby上的实验表明,知识蒸馏一致性提高了紧凑模型的可靠性,并且通常提高了预测性能,特别是在分布变化下。此外,与基于生成的推理相比,分类头变体显著减少了推理时间。总体而言,我们的结果表明,结合推理导向的蒸馏和响应稳定化,使紧凑型开源模型在X-CCD检测中更具实用性和可靠性。
cs.CL / 1 / 2605.00847
H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models
H-探针:从语言模型的潜在表示中提取层次结构
Abstract
Representing and navigating hierarchy is a fundamental primitive of reasoning. Large language models have demonstrated proficiency in a wide variety of tasks requiring hierarchical reasoning, but there exists limited analysis on how the models geometrically represent the necessary latent constructions for such thinking. To this end, we develop \textit{H-probes}, a collection of linear probes that extract hierarchical structure, specifically depth and pairwise distance, from latent representations. In synthetic tree traversal tasks, the H-probes robustly find the subspaces containing hierarchical structure necessary to complete the tasks; furthermore, in comprehensive ablation experiments, we show that these hierarchy-containing subspaces are low-dimensional, causally important for high task performance, and generalize within- and out-of-domain. Furthermore, we find analogous, though weaker, hierarchical structure in real-world hierarchical contexts such as mathematical reasoning traces. These results demonstrate that models represent hierarchy not only at the level of syntax and concepts, but at deeper levels of abstraction -- including the reasoning process itself.
Chinese Translation
层次的表示和导航是推理的基本原理。大型语言模型在需要层次推理的各种任务中表现出了很高的能力,但对于模型如何在几何上表示这些必要的潜在结构的分析仍然有限。为此,我们开发了 extit{H-探针},一组线性探针,旨在从潜在表示中提取层次结构,具体包括深度和成对距离。在合成树遍历任务中,H-探针稳健地找到包含完成任务所需的层次结构的子空间;此外,在全面的消融实验中,我们展示了这些包含层次的子空间是低维的、因果上对高任务性能重要并且可以在领域内和领域外进行泛化。此外,我们还发现了在现实世界的层次语境中,例如数学推理轨迹,存在类似的、尽管较弱的层次结构。这些结果表明,模型不仅在句法和概念层面表示层次,而且在更深的抽象层级中 - 包括推理过程本身。
cs.CL / 2 / 2605.00905
DIAGRAMS: A Review Framework for Reasoning-Level Attribution in Diagram QA
DIAGRAMS:一种用于图表问答中推理级归因的评审框架
Abstract
Diagram question answering (Diagram QA) requires reasoning-level attribution that links each question-answer pair to all visual regions needed to derive the answer, rather than only the region containing the final response. Creating such structured evidence across diagrams, charts, maps, circuits, and infographics is time-consuming, and existing annotation tools tightly couple their interfaces to dataset-specific formats. We present DIAGRAMS, a lightweight, schema-driven review framework that decouples interface logic from dataset-specific JSON structures through an internal meta-schema and dataset adapters. Given an image and QA pair with optional candidate regions, the system performs QA-conditioned evidence selection and proposes the regions required for reasoning. When QA pairs or candidate regions are missing, it generates them and supports human verification and refinement. Across six Diagram QA datasets, model-suggested evidence achieves 85.39% precision and 75.30% recall against reviewer-final selections (micro-averaged). These results indicate that the review-first framework reduces manual region creation while maintaining high agreement with final reasoning-level attributions. We release a public demo and installable package to support dataset auditing, grounded supervision creation, and grounded evaluation.
Chinese Translation
图表问答(Diagram QA)需要推理级别的归因,将每个问题-答案对与导出答案所需的所有视觉区域关联,而不仅仅是包含最终响应的区域。在图表、图形、地图、电路和信息图中创建这种结构化证据是耗时的,现有的注释工具将其界面与特定数据集的格式紧密耦合。我们提出了DIAGRAMS,这是一种轻量级、基于模式的评审框架,通过内部元模式和数据集适配器将界面逻辑与特定数据集的JSON结构解耦。给定一幅图像及与之对应的问答对和可选候选区域,该系统进行基于问答的证据选择,并提出推理所需的区域。当缺少问答对或候选区域时,它会生成这些内容并支持人工验证和改进。在六个图表问答数据集中,模型建议的证据在与审核最终选择的比较中,微平均达到85.39%的精准率和75.30%的召回率。这些结果表明,该评审优先框架减少了人工区域创建的工作,同时保持与最终推理级归因的高度一致性。我们发布了一个公共演示和可安装包,以支持数据集审计、基于证据的监督创建和基于证据的评估。
cs.CL / 3 / 2605.00994
Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
模型生物体是有泄露的:困惑度差异化通常揭示微调目标
Abstract
Finetuning can significantly modify the behavior of large language models, including introducing harmful or unsafe behaviors. To study these risks, researchers develop model organisms: models finetuned to exhibit specific known behaviors for controlled experimentation. Identifying these behaviors remains challenging. We show that a simple perplexity-based method can surface finetuning objectives from model organisms by leveraging their tendency to overgeneralize their finetuned behaviors beyond the intended context. First, we generate diverse completions from the finetuned model using short random prefills drawn from general corpora. Second, we rank completions by decreasing perplexity gap between reference and finetuned models. The top-ranked completions often reveal the finetuning objectives, without requiring model internals or prior assumptions about the behavior. We evaluate this on a diverse set of model organisms (N=76, 0.5 to 70B parameters), including backdoored models, models finetuned to internalize false facts via synthetic document finetuning, adversarially trained models with hidden concerning behaviors, and models exhibiting emergent misalignment. For the vast majority of model organisms tested, the method surfaces completions revealing finetuning objectives within the top-ranked results, with models trained via synthetic document finetuning or to produce exact phrases being particularly susceptible. We further show that the technique can be effective even without access to the exact pre-finetuning checkpoint: trusted reference models from different families can serve as effective substitutes. As the method requires only next-token probabilities from the finetuned model, it is compatible with API-gated models that expose token logprobs.
Chinese Translation
微调可以显著修改大型语言模型的行为,包括引入有害或不安全的行为。为了研究这些风险,研究人员开发了模型生物体:经过微调以表现特定已知行为的模型,用于受控实验。识别这些行为仍然是一个挑战。我们展示了一种基于困惑度的简单方法,可以通过利用模型生物体的过度泛化微调行为的倾向,从而揭示微调目标。首先,我们使用来自一般语料库的短随机预填充生成多样的完成文本。其次,我们通过参考模型和微调模型之间的困惑度差距递减对完成文本进行排名。排名最高的完成文本往往会揭示微调目标,而无须模型内部信息或对行为的先验假设。我们在一组多样的模型生物体上评估了该方法(N=76,参数从0.5到70B),包括后门模型、通过合成文档微调内化错误事实的模型、具有隐含令人担忧行为的对抗训练模型以及表现出突现失调的模型。对于绝大多数测试的模型生物体,该方法都能在排名最高的结果中揭示微调目标,尤其是通过合成文档微调或生成确切短语的模型特别容易受到影响。我们进一步表明,即使在没有访问确切的微调前检查点的情况下,该技术也能有效:来自不同模型家族的受信参考模型可以作为有效的替代品。由于该方法仅需要微调模型的下一个标记概率,它与暴露标记对数概率的API门控模型兼容。
cs.CL / 4 / 2605.01006
Can AI Debias the News? LLM Interventions Improve Cross-Partisan Receptivity but LLMs Overestimate Their Own Effectiveness
人工智能能否消除新闻偏见?大型语言模型干预提升跨党派接受度,但过高估计自身有效性
Abstract
Partisan news media erode cross-partisan trust, but large language models (LLMs) offer a potential means of debiasing such content at scale. Across two pre-registered experiments, we tested whether LLM-generated debiasing of liberal news headlines could improve conservative readers' trust-relevant judgments. Study 1 found that subtle lexical debiasing (replacing emotive words with more moderate synonyms) had no effect on any outcome. Study 2 found that a more substantive reframing intervention significantly increased conservatives' perceived trustworthiness, completeness, and willingness to engage with liberal news headlines, without producing a backfire effect among a sample of liberals. In Study 1, the intervention produced robust effects among LLM-simulated silicon participants, whereas it had no impact on human readers. In Study 2, the intervention's effects among silicon participants aligned directionally with human responses but were significantly larger in magnitude for some outcomes. Moderation analyses revealed that the model's implicit theory of who responds to debiasing diverged from the psychological profile that actually predicted human responsiveness. These findings demonstrate that LLM-based debiasing can improve cross-partisan receptivity when targeting ideological framing rather than surface-level language, but that current models lack both the quantitative accuracy and qualitative psychological fidelity to evaluate their own interventions without human oversight.
Chinese Translation
党派新闻媒体削弱跨党派信任,但大型语言模型(LLMs)提供了一种在大规模上消除这种内容偏见的潜在手段。在两项预注册实验中,我们测试了大型语言模型生成的对自由派新闻标题的去偏见处理是否能够改善保守派读者的信任相关判断。研究1发现,微妙的词汇去偏见(用更温和的同义词替换情感词)对任何结果均没有影响。研究2发现,更具实质性的重构干预显著提高了保守派对自由派新闻标题的可信度、完整性和参与意愿,而在一组自由派样本中没有产生反弹效应。在研究1中,干预在模拟硅基参与者中产生了显著效果,而对人类读者没有影响。在研究2中,干预对硅基参与者的效果在方向上与人类回应一致,但在一些结果上的效果幅度明显更大。调节分析揭示,模型对响应去偏见的人的隐性理论与实际预测人类响应的心理特征存在差异。这些发现表明,当针对意识形态框架而非表层语言时,基于大型语言模型的去偏见处理可以改善跨党派接受度,但当前模型在定量准确性和定性心理忠实度方面都不足以在没有人类监督的情况下评估其自身干预效果。
cs.CL / 5 / 2605.01011
CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine
CLEAR:揭示噪声和模糊性如何降低医学领域大语言模型的可靠性
Abstract
Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) framework, which assesses how decision-space presentation, ambiguity, and uncertainty affect LLMs' reasoning on medical benchmarks. CLEAR systematically perturbs (1) the number of plausible answer options, (2) the presence of a ground truth or abstention option, and (3) the semantic framing of answer options. Applying CLEAR on three benchmarks evaluated across 17 LLMs reveals three notable limitations of existing evaluation methods. First, increasing the number of plausible answers degrades a model's ability to identify the correct answer and abstain against incorrect ones. Second, this lack of caution intensifies as the framing of abstention shifts from assertive rejection like "None of the Above" to uncertainty admission like "I don't know" (IDK). Notably, just including IDK in the answer space increases incorrect answer selections. Lastly, we formalize the performance gap between identifying the correct answer and abstaining from incorrect ones as the humility deficit, which worsens with model scale. Our findings reveal limitations in standard medical benchmarks and underscore that scaling alone does not resolve LLM reliability issues.
Chinese Translation
医学大语言模型(LLM)的评估依赖于简化的考试风格基准,这些基准很少能够反映真实世界医学询问的模糊性。我们引入了CLinical Evaluation of Ambiguity and Reliability(CLEAR)框架,该框架评估决策空间呈现、模糊性和不确定性如何影响LLMs在医学基准上的推理。CLEAR系统性地扰动(1)合理答案选项的数量,(2)真实答案或放弃选项的存在,以及(3)答案选项的语义框架。在对17个LLM进行评估的三个基准测试上应用CLEAR,揭示了现有评估方法的三个显著局限性。首先,增加合理答案的数量降低了模型识别正确答案和放弃错误答案的能力。其次,随着放弃选项的框架从“以上均不是”这样的明确拒绝转变为“我不知道”(IDK)这样的不确定性承认,这种缺乏谨慎的情况加剧。值得注意的是,仅在答案空间中包含IDK就增加了错误答案的选择。最后,我们将识别正确答案和放弃错误答案之间的表现差距形式化为谦逊不足,随着模型规模的扩大而加剧。我们的研究揭示了标准医学基准的局限性,并强调仅提升规模并不能解决LLM的可靠性问题。
cs.CL / 6 / 2605.01017
Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison Triggers They Fail to Detect
心理效应显著,但计算上不可见:大型语言模型生成的社会比较诱因未被识别
Abstract
We introduce Xiaohongshu Social Comparison Reader Elicitation (XHS-SCoRE), a reader-grounded benchmark for detecting if a text-only Xiaohongshu (RedNote) post elicits UPWARD, DOWNWARD, or NEUTRAL/no clear social comparison from a first-person reader perspective. The task targets a socially meaningful relational signal that is behaviorally real yet not reducible to sentiment. Across prompted LLM classifiers and supervised Chinese encoder baselines, we find a consistent mismatch between generation fluency and reliable detection ability: the signal is textually learnable in-domain, but not robustly accessible to prompt-based classification. Prompted LLM classifiers exhibit stable, interpretable failure modes, especially neutralization of comparison-triggering posts and model-specific directional skew. A controlled pilot further shows that LLM-generated Xiaohongshu-style posts can shift perceived standing and comparison-related affect even when prompt-based detection of the same construct remains fragile. XHS-SCoRE contributes both a benchmark for reader-grounded comparison detection and a diagnostic framework for studying when socially meaningful relational cues remain only partially visible to prompt-based inference.
Chinese Translation
我们引入了小红书社会比较读者诱发评估(XHS-SCoRE),这是一个针对文本型小红书(RedNote)帖子是否引发向上、向下或中立/无明确社会比较的评估基准,立足于第一人称读者的视角。该任务旨在捕捉一种在社会意义上具有价值的关系信号,它在行为上真实存在,但无法简化为情感。通过对提示的LLM分类器和监督的中文编码器基准的测试,我们发现生成流畅性与可靠的检测能力之间存在持续的不匹配:该信号在领域内是可文本学习的,但对基于提示的分类并不稳健可及。提示的LLM分类器展现出稳定且可解释的失效模式,尤其是对比较诱发帖子中立化以及模型特有的方向性偏差。一个控制性试点进一步表明,LLM生成的小红书风格帖子能够改变感知地位和与比较相关的情感,即使对同一构念的基于提示的检测仍然脆弱。XHS-SCoRE不仅为读者基础的比较检测提供了一个基准,还有助于诊断在何种情况下社会意义重大的关系线索仅部分可见于基于提示的推断研究。
cs.CL / 7 / 2605.01034
A Theoretical Game of Attacks via Compositional Skills
通过组合技能的攻击理论博弈
Abstract
As large language models grow increasingly capable, concerns about their safe deployment have intensified. While numerous alignment strategies aim to restrict harmful behavior, these defenses can still be circumvented through carefully designed adversarial prompts. In this work, we introduce a theoretical framework that formalizes a game between an attacker and a defender. Within this framework, we design a theoretical best-response attack strategy and show that it is closely related to many existing adversarial prompting methods. We further analyze the resulting game, characterize its equilibria, and reveal inherent advantages for the attacker. Drawing on our theoretical analysis, we also derive a provably optimal defense strategy. Empirically, we evaluate a practical instantiation of the theoretically optimal attack and observe stronger performance relative to existing adversarial prompting approaches in diverse settings encompassing different LLMs and benchmarks.
Chinese Translation
随着大型语言模型能力的不断增强,对其安全部署的担忧也随之加剧。尽管许多对齐策略旨在限制有害行为,但这些防御方法仍然可以通过精心设计的对抗性提示被规避。在本研究中,我们引入了一个理论框架,形式化了攻击者与防御者之间的博弈。在该框架内,我们设计了一个理论上的最佳回应攻击策略,并表明它与许多现有的对抗性提示方法密切相关。我们进一步分析了所产生的博弈,刻画了其均衡,并揭示了攻击者的固有优势。基于我们的理论分析,我们还推导出了一种可证明的最优防御策略。在实证研究中,我们评估了理论上最优攻击的实际实例,并观察到在涵盖不同大型语言模型和基准的多种环境中,相较于现有的对抗性提示方法表现更强。
cs.CL / 8 / 2605.01048
Compared to What? Baselines and Metrics for Counterfactual Prompting
与什么相比?反事实提示的基准和度量
Abstract
Counterfactual prompting (i.e., perturbing a single factor and measuring output change) is widely used to evaluate things like LLM bias and CoT faithfulness. But in this work we argue that observed effects cannot be attributed to the targeted factor without accounting for baseline ``meaning-preserving'' modifications to text that establish general model sensitivity. This is because every counterfactual edit is a compound treatment that bundles the variable of interest with incidental surface-form variation; this violates treatment variation irrelevance. We observe prediction flip rates on MedQA of 14.9% when we surgically change patient gender. However, this is statistically indistinguishable from the flip rates induced by simply paraphrasing inputs (14.1%). In this case, it would therefore be unwarranted to conclude that the LLM is especially sensitive to patient gender. To account for this and robustly measure the effects of targeted interventions, we propose a framework in which we compare (via statistical testing) differences observed under target interventions to those induced by paraphrasing inputs. We then use this framework to revisit a analysis done on the MedPerturb dataset, which reported evidence of model sensitivity to patient demographics and stylistic cues. We find that these effects largely dissipate when we account for general model sensitivity, with only 5 of 120 tests reaching statistical significance. Applying the same framework to occupational biography classification, we detect clearly significant directional gender bias, showing that the framework identifies real directional effects even when they are small. We evaluate a range of metrics -- aggregate, per-sample distributional, and regression -- and find that per-sample metrics are dramatically more powerful than aggregate metrics and regression powerfully and uniquely characterizes effect direction and magnitude.
Chinese Translation
反事实提示(即扰动单一因素并测量输出变化)在评估大型语言模型(LLM)的偏见和链式推理(CoT)的可靠性方面被广泛使用。然而,在本研究中,我们认为,观察到的效果不能仅仅归因于目标因素,而不考虑对文本进行的基线“保持意义”的修改,这些修改能够建立模型的整体敏感性。这是因为每一次反事实编辑都是一个复合处理,将感兴趣的变量与附带的表面形式变化捆绑在一起,这违背了处理变化无关性。我们观察到在 MedQA 中,当我们手术性地改变患者性别时,预测翻转率为 14.9%。然而,这在统计上与简单释义输入所引起的翻转率(14.1%)不可区分。在这种情况下,因此不应得出大型语言模型对患者性别特别敏感的结论。为了考虑这一点并稳健地测量目标干预的效果,我们提出了一个框架,通过统计检验比较目标干预下观察到的差异与通过释义输入所引起的差异。然后,我们使用该框架重新审视了对 MedPerturb 数据集的分析,该分析报告了模型对患者人口统计信息和风格线索的敏感性证据。我们发现,当考虑到整体模型敏感性时,这些效果大部分消失,只有 120 次测试中的 5 次达到统计显著性。将相同的框架应用于职业传记分类,我们发现明显且显著的方向性性别偏见,表明该框架能够识别真实的方向性效应,即使在其影响较小时。我们评估了一系列度量标准——汇总、每个样本的分布和回归——发现每个样本的度量标准显著比汇总度量标准强大,而回归则有力且独特地表征效应的方向和大小。
cs.CL / 9 / 2605.01065
A Systematic Exploration of Text Decomposition and Budget Distribution in Differentially Private Text Obfuscation
差分隐私文本模糊化中文本分解与预算分配的系统探讨
Abstract
The goal of differentially private text obfuscation is to obfuscate, or "perturb", input texts with Differential Privacy (DP) guarantees, such that the private output texts are quantifiably indistinguishable from the originals. While perturbation at the word level is intuitive, meaningful text privatization happens on complete documents. Recent research has laid the groundwork for reasoning about privacy budget distribution, namely, how an overall $\varepsilon$ budget can be sensibly distributed among the component pieces of a text. We perform a systematic evaluation of multiple text decomposition and budget distribution techniques in the context of DP text obfuscation, testing how different methods for chunking texts can be combined with techniques for allocating $\varepsilon$ to these chunks. Our experiments reveal that such design choices are very important, as even with comparable privacy budgets, significantly different results can occur based on which methods are chosen. In this, we provide credible evidence of the feasibility of maximizing empirical trade-offs by optimizing DP obfuscation procedures.
Chinese Translation
差分隐私文本模糊化的目标是对输入文本进行模糊处理或“扰动”,从而在具备差分隐私(Differential Privacy, DP)保证的情况下,使得输出的私密文本在量化上与原文本难以区分。尽管在单词级别上的扰动比较直观,但有意义的文本隐私化发生在完整文档上。近期的研究为理解隐私预算分配奠定了基础,即如何将整体的 $arepsilon$ 预算合理分配到文本的各个组成部分。我们在差分隐私文本模糊化的背景下对多种文本分解与预算分配技术进行了系统评估,测试不同的文本切分方法如何与这些切分部分的 $arepsilon$ 分配技术结合。我们的实验表明,这些设计选择至关重要,因为即便隐私预算相当,所选方法的不同也可能导致显著不同的结果。在这一点上,我们提供了优化差分隐私模糊化程序以最大化经验权衡的可行性证据。
cs.CL / 10 / 2605.01073
Controlled Paraphrase Geometry in Sentence Embedding Space: Local Manifold Modeling and Latent Probing
句子嵌入空间中的受控释义几何:局部流形建模与潜在探测
Abstract
The paper studies the local geometry of embedding clouds induced by \emph{controlled local classes of semantically close sentences}. The central question is how controlled paraphrase-like semantic variation is organized in sentence embedding space and whether this local structure can be explicitly modeled by low-degree fitted carriers. We introduce a local geometric modeling scheme based on affine, quadratic, and cubic fitted models. We also use a surface-based latent probing procedure that constructs synthetic latent points in a reduced local PCA space with respect to the fitted carrier. The procedure is intended as an offline method for representation-space analysis, local manifold modeling, and geometry-aware latent probing. Generated latent points are evaluated using criteria that measure consistency with the fitted surface, preservation of neighborhood structure, agreement with the empirical distribution, stability of Hessian-based second-order shape descriptors, and stability of fitted-model coefficients. Experiments on controlled sets of semantically close sentences show that nonlinear local models describe embedding clouds more accurately than affine models. Surface-based generation provides strong fitted-geometry fidelity, including surface consistency, Hessian-based shape consistency, and coefficient consistency. Downstream experiments show that geometric validity of synthetic latent points does not automatically translate into improved classification performance. The results support explicit local geometric modeling of sentence embedding space and highlight the need to distinguish geometric validity from discriminative utility. As a resource contribution, we introduce \textbf{CoPaGE-300K}, a controlled template-based dataset of semantically close sentence variants with slot-level annotations and precomputed sentence embeddings.
Chinese Translation
本文研究了由 extit{受控局部类别的语义相近句子}所引发的嵌入云的局部几何结构。核心问题是受控的类似释义的语义变体在句子嵌入空间中的组织方式,以及这种局部结构是否可以通过低度拟合载体进行显式建模。我们引入了一种基于仿射模型、二次模型和三次模型的局部几何建模方案。我们还使用了一种基于表面的潜在探测程序,该程序在降低的局部主成分分析(PCA)空间中构建合成潜在点,相关于拟合的载体。该程序旨在作为一种离线方法,用于表示空间分析、局部流形建模和几何感知的潜在探测。生成的潜在点通过一系列标准进行评估,这些标准测量与拟合表面的一致性、邻域结构的保持、与经验分布的符合程度、基于海森矩阵的二阶形状描述符的稳定性,以及拟合模型系数的稳定性。对受控语义相近句子集的实验表明,非线性局部模型比仿射模型更准确地描述嵌入云。基于表面的生成提供了强大的拟合几何保真度,包括表面一致性、基于海森矩阵的形状一致性和系数一致性。下游实验表明,合成潜在点的几何有效性并不自动转化为分类性能的提升。结果支持了句子嵌入空间的显式局部几何建模,并强调了区分几何有效性与判别效用的必要性。作为资源贡献,我们介绍了 extbf{CoPaGE-300K},这是一个受控模板基础的数据集,包含语义相近的句子变体,并带有插槽级注释和预计算的句子嵌入。
cs.CL / 11 / 2605.01077
Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines
教会大语言模型巴西医疗:注入官方临床指南中的知识
Abstract
Brazil's Unified Health System (SUS) relies on official clinical guidelines that define diagnostic criteria, treatments, dosages, and monitoring procedures for over 200 million citizens. Yet current LLMs perform poorly on this guideline-specific knowledge, and no benchmark evaluates clinical recall grounded in Brazilian Portuguese protocols. We address this gap by adapting Qwen2.5-14B-Instruct to the Brazilian clinical domain. From 178 official guidelines (~5.4M tokens), we generate ~70M tokens of synthetic data in three formats -- rephrases, wiki-style articles, and question-answer pairs -- using four generator LLMs. We then apply continual pre-training followed by Group Relative Policy Optimization (GRPO). We introduce HealthBench-BR, with 1,780 balanced true/false clinical assertions, and PCDT-QA, with 890 open-ended clinical questions scored by an LLM judge. Our best model achieves 83.9% on HealthBench-BR and 85.4% on PCDT-QA, outperforming GPT-5.2, Claude Sonnet 4.6, Gemini 3.1 Pro, and Google AI Overview's web-grounded RAG despite having only 14B parameters. Ablations show that generator diversity and reinforcement learning are critical to these gains. We release all datasets, benchmarks, and model weights to support reproducible clinical NLP research for Brazilian Portuguese. Code, data, and model weights are available at https://github.com/hugoabonizio/clinical-protocols-br
Chinese Translation
巴西的统一健康系统(SUS)依赖于官方临床指南,这些指南为超过2亿公民定义了诊断标准、治疗方案、剂量和监测程序。然而,当前的大语言模型在这些针对指南的知识上表现不佳,且没有基准用于评估基于巴西葡萄牙语协议的临床回忆。我们通过将Qwen2.5-14B-Instruct适应于巴西临床领域来填补这一空白。从178个官方指南(约540万个标记)中,我们生成了约7000万个标记的合成数据,包含三种格式——改述、维基风格文章和问答对——使用四个生成器大语言模型。然后,我们应用持续预训练,接着使用群体相对策略优化(GRPO)。我们引入了HealthBench-BR,其中包含1780个平衡的真假临床断言,以及PCDT-QA,其中有890个开放式临床问题由大语言模型判断评分。我们的最佳模型在HealthBench-BR上的得分为83.9%,在PCDT-QA上的得分为85.4%,超越了GPT-5.2、Claude Sonnet 4.6、Gemini 3.1 Pro和Google AI Overview的基于网页的RAG,尽管只有14亿参数。消融实验表明,生成器多样性和强化学习对这些提升至关重要。我们发布所有数据集、基准和模型权重,以支持可再现的巴西葡萄牙语临床自然语言处理研究。代码、数据和模型权重可在https://github.com/hugoabonizio/clinical-protocols-br获取。
cs.CL / 12 / 2605.01097
Interpretable Difficulty-Aware Knowledge Tracing in Tutor-Student Dialogues
可解释的困难感知知识追踪在导师-学生对话中的应用
Abstract
Recent advances in large language models (LLMs) have led to the development of AI-powered tutoring systems that provide interactive support via dialogue. To enable these tutoring systems to provide personalized support, it is essential to assess student performance at each turn, motivating knowledge tracing (KT) in dialogue settings. However, existing dialogue-based KT approaches often ignore question difficulty modeling and rely on opaque latent representations from LLMs, hindering accurate and interpretable prediction. In this work, we propose an interpretable difficulty-aware conversational KT framework built upon LLMs, which explicitly models students' abilities and the difficulty of tutor-posed tasks at each turn. The framework incorporates the original textual question and the next tutor-posed task to estimate the student's knowledge state and the difficulty of the upcoming turn. Furthermore, it integrates Item Response Theory to map LLM's outputs into student ability and question difficulty parameters, enabling interpretable prediction of student performance grounded in cognitive theories of learning. We evaluate the framework on two tutor-student dialogue datasets. Both quantitative and qualitative results show that our framework outperforms existing KT baselines, meanwhile generating interpretable outputs consistent with cognitive theory.
Chinese Translation
近年来,大型语言模型(LLMs)的进展催生了基于人工智能的辅导系统,这些系统通过对话提供互动支持。为了使这些辅导系统能够提供个性化支持,必须在每个对话环节中评估学生的表现,这催生了对话场景中的知识追踪(KT)研究。然而,现有的基于对话的知识追踪方法往往忽视了问题难度建模,并依赖于来自LLMs的不透明潜在表示,从而妨碍了准确和可解释的预测。在本研究中,我们提出了一种基于LLMs的可解释的困难感知对话知识追踪框架,该框架明确建模学生的能力和每个环节导师提出任务的难度。该框架结合了原始文本问题和下一个导师提出的任务,以估计学生的知识状态和即将到来的环节的难度。此外,它结合了项目反应理论(Item Response Theory),将LLM的输出映射到学生能力和问题难度参数,从而实现基于学习认知理论的可解释学生表现预测。我们在两个导师-学生对话数据集上评估了该框架。定量和定性结果均表明,我们的框架优于现有的知识追踪基线,同时生成与认知理论一致的可解释输出。
cs.CL / 13 / 2605.01106
Component-Aware Self-Speculative Decoding in Hybrid Language Models
混合语言模型中的组件感知自我推测解码
Abstract
Speculative decoding accelerates autoregressive inference by drafting candidate tokens with a fast model and verifying them in parallel with the target. Self-speculative methods avoid the need for an external drafter but have been studied exclusively in homogeneous Transformer architectures. We introduce component-aware self-speculative decoding, the first method to exploit the internal architectural heterogeneity of hybrid language models, isolating the SSM/linear-attention subgraph as a zero-cost internal draft. We evaluate this on two architecturally distinct hybrid families: Falcon-H1 (parallel: Mamba-2 + attention per layer) and Qwen3.5 (sequential: interleaved linear and attention layers), with a pure Transformer control (Qwen2.5). Parallel hybrids achieve acceptance rates of alpha = 0.68 at draft length k=2 under greedy decoding, while sequential hybrids yield only alpha = 0.038 -- an 18x gap attributable to how each architecture integrates its components. The property is scale-invariant: Falcon-H1 at 3B reproduces the rates observed at 0.5B. We further show that perplexity degradation from a companion ablation study predicts speculative viability without running speculative decoding: a 3.15x ratio (Falcon) maps to alpha = 0.37 at k=4, while 81.96x (Qwen) maps to alpha = 0.019. For sequential hybrids, generic LayerSkip achieves 12x higher acceptance rates than the component-aware strategy. The composition pattern of hybrid models -- not merely the presence of alternative components -- determines whether component-level self-speculation is viable.
Chinese Translation
推测解码通过使用快速模型草拟候选标记并与目标进行并行验证,从而加速自回归推理。自我推测方法避免了外部草拟器的需求,但仅在同质的Transformer架构中进行了研究。我们提出了组件感知自我推测解码,这是第一个利用混合语言模型内部架构异质性的方案,将SSM/线性注意子图作为零成本内部草拟。我们在两个具有架构差异的混合模型家族上进行了评估:Falcon-H1(并行:Mamba-2 + 每层注意)和Qwen3.5(顺序:交错线性和注意层),并设置了一个纯Transformer对照(Qwen2.5)。在贪婪解码下,并行混合模型在草拟长度k=2时的接受率为alpha = 0.68,而顺序混合模型则仅为alpha = 0.038——这一18倍的差距归因于每种架构集成其组件的方式。该特性具有尺度不变性:Falcon-H1在3B规模下重现了在0.5B下观察到的接受率。我们进一步表明,来自伴随消融研究的困惑度降级可以预测推测可行性,而无需运行推测解码:3.15倍比率(Falcon)映射到k=4时的alpha = 0.37,而81.96倍(Qwen)映射到alpha = 0.019。对于顺序混合模型,通用的LayerSkip实现了比组件感知策略高12倍的接受率。混合模型的组合模式——而非仅仅是替代组件的存在——决定了组件级别自我推测是否可行。
cs.CL / 14 / 2605.01168
Quantifying and Predicting Disagreement in Graded Human Ratings
量化与预测评估人类等级评分中的分歧
Abstract
It is increasingly recognized that human annotators do not always agree, and such disagreement is inherent in many annotation tasks. However, not all instances in a given task elicit the same degree of opinion divergence. In this paper, we investigate annotation variation patterns in graded human ratings for inappropriate languages, including offensive language, hate speech, and toxic language perception. We examine whether the degree of annotation disagreement can be predicted from textual features. We further propose the Opposition Index, a metric that quantifies perspective opposition among annotators on a given item, and investigate the predictability of instances with potentially opposing human opinions. Our results show a moderate positive correlation between estimated and observed annotation variance. We find that two approaches achieve comparable performance in variance prediction: directly predicting the variance value and estimating it from predicted annotation distributions. Our results on opposition perspective prediction show that items with high opposition index values are more difficult to predict and are often underestimated by models.
Chinese Translation
人们越来越认识到,人类标注者并不总是达成一致,这种分歧在许多标注任务中是固有的。然而,并非所有任务实例在意见分歧程度上都是相同的。本文探讨了在针对不当语言(包括冒犯性语言、仇恨言论和有毒语言感知)的等级评分中,标注变异模式。我们研究了是否能够从文本特征中预测标注分歧的程度。我们进一步提出了对立指数(Opposition Index),这一指标量化了标注者在特定项目上的观点对立程度,并研究了潜在对立人类观点实例的可预测性。我们的结果显示,估计值与观察到的标注方差之间存在中等正相关。我们发现两种方法在方差预测中表现相似:直接预测方差值和从预测的标注分布中估计方差。关于对立观点预测的结果显示,具有高对立指数值的项目往往更难以预测,并且通常被模型低估。
cs.CL / 15 / 2605.01188
Compute Optimal Tokenization
计算最优分词
Abstract
Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility allows us to study the role of compression rate well beyond 4.57 bytes per token obtained with a popular BPE tokenizer. Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022). Furthermore, we discover that the optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization, as well as to languages other than English, guiding language model developers on tokenization scheme selection for maximal compute efficiency.
Chinese Translation
规模法则使得数据量和语言模型规模的最优选择成为可能,但数据单位——令牌(token)对这种关系的影响仍然没有得到充分探索。在本研究中,我们系统地探讨了由压缩率(即每个令牌的平均字节数)控制的令牌信息粒度如何影响规模趋势。我们训练了988个潜在分词模型(BLT),参数范围从5000万到70亿,能够设定所需的压缩率。这种灵活性使我们能够研究压缩率的作用,远远超出了使用流行的BPE分词器获得的每个令牌4.57字节的范围。我们的实验表明,在计算最优配置下,模型参数数量与按字节测量的数据大小成正比,而不是如常见理解那样与令牌数量成正比(Kaplan et al., 2020; Hoffmann et al., 2022)。此外,我们发现最优压缩率与BPE获得的压缩率不同,并且随着计算量的增加而降低。这些发现适用于潜在分词和子词分词,以及英语以外的语言,为语言模型开发者在选择分词方案以实现最大计算效率提供了指导。
cs.CL / 16 / 2605.01205
SRA: Span Representation Alignment for Large Language Model Distillation
SRA:针对大型语言模型蒸馏的跨度表示对齐
Abstract
Cross-Tokenizer Knowledge Distillation (CTKD) enables knowledge transfer between a large language model and a smaller student, even when they employ different tokenizers. While existing approaches mainly focus on token-level alignment strategies, which are often brittle and sensitive to discrepancies between tokenizers, we argue that the method of aggregating tokens into more robust representations before distillation is of equal importance. In this paper, we introduce \textbf{SRA} (\textbf{S}pan \textbf{R}epresentation \textbf{A}lignment for Large Language Model Distillation), a novel framework that reframes CTKD through the physical lens of Multi-Particle Dynamical Systems. SRA shifts the fundamental unit of alignment from tokens to robust, tokenizer-agnostic spans. We model each span as a cluster of particles and represent its state by its Center of Mass (CoM) - an attention-weighted average that captures rich semantic information. We leverage the concept of span centers of mass with attention-derived weighting to prioritize the most salient spans. In addition, we employ a geometric regularizer to preserve the structural integrity of the representation space and introduce aligned span logit distillation to enhance knowledge transfer across models. In challenging cross-architecture distillation experiments, SRA consistently and significantly outperforms state-of-the-art CTKD baselines, validating our physically-grounded approach.
Chinese Translation
跨标记器知识蒸馏(CTKD)能够实现大型语言模型与较小学生之间的知识转移,即使它们使用不同的标记器。尽管现有方法主要集中在标记级对齐策略,这些策略通常脆弱且对标记器之间的差异敏感,但我们认为在蒸馏之前将标记聚合为更强大表示的方法同样重要。本文提出了 extbf{SRA}( extbf{S}pan extbf{R}epresentation extbf{A}lignment for Large Language Model Distillation),这是一个新的框架,通过多粒子动力系统的物理视角重新诠释CTKD。SRA将对齐的基本单元从标记转变为强健的无标记器跨度。我们将每个跨度建模为粒子聚类,并通过其质心(Center of Mass, CoM)表示其状态,质心是一个加权平均值,捕获丰富的语义信息。我们利用注意力导出的加权质心概念来优先考虑最显著的跨度。此外,我们还采用几何正则化器来保持表示空间的结构完整性,并引入对齐的跨度对数蒸馏,以增强模型间的知识转移。在具有挑战性的跨架构蒸馏实验中,SRA始终显著优于最新的CTKD基线,验证了我们基于物理的研究方法。
cs.CL / 17 / 2605.01224
Lost in the Tower of Babel: The Adverse Effects of Incidental Multilingualism in LLMs
迷失在巴别塔: 大型语言模型中偶发性多语言能力的负面影响
Abstract
This paper argues that contemporary multilingual NLP has converged on a fragile and misleading paradigm of incidental multilingualism. Today's LLMs appear multilingual largely because they are trained on massive, uneven web corpora, not because multilingual or multicultural competence has been treated as a core design objective. We contend that this paradigm systematically produces unequal, brittle, and opaque behavior across languages, with severe consequences in real-world and agentic deployments where models must reason, plan, and act across multiple linguistic contexts. We report a focused empirical study of two practical questions: which languages models self-report as supported and which languages they actually respond in across multilingual prompts. We additionally demonstrate how even a simple language-change attack can surface these failures and expose hidden assumptions about language in LLM-based systems. To address this, we call for a shift toward multilingualism by design: a research agenda that treats equitable multilingual performance, cultural grounding, and cross-lingual behavioral understanding as first-class goals in all aspects of the model pipeline.
Chinese Translation
本文认为,当前的多语言自然语言处理(NLP)已经趋向于一个脆弱且误导性的偶发性多语言能力范式。如今的大型语言模型(LLMs)看似具备多语言能力,主要是因为它们在庞大且不均衡的网络语料库上进行训练,而并非因为多语言或多文化的能力被视为核心设计目标。我们主张这一范式系统地导致各语言间不平等、脆弱且不透明的表现,导致在需要跨多个语言环境进行推理、规划和行动的实际应用中产生严重后果。我们报告了一项针对两个实际问题的重点实证研究:模型自我报告支持的语言与其在多语言提示下实际响应的语言。此外,我们还展示了即使是一个简单的语言变更攻击也可以揭露这些失败,并暴露出LLM系统中关于语言的潜在假设。为了解决这一问题,我们呼吁在设计上转向多语言能力:制定一个研究议程,将平等的多语言表现、文化基础和跨语言行为理解视为模型管道各个方面的首要目标。
cs.CL / 18 / 2605.01256
GIFT: Guided Fine-Tuning and Transfer for Enhancing Instruction-Tuned Language Models
GIFT:引导微调与迁移以增强指令调优语言模型
Abstract
A promising paradigm for adapting instruction-tuned language models is to learn task-specific updates on a pretrained base model and subsequently merge them into the instruction-tuned model. However, existing approaches typically treat the instruction-tuned model as a passive target that is only involved at the final merging stage, without guiding the training process. We propose GIFT (Guided Fine-Tuning and Transfer), a simple and efficient framework that incorporates guidance from the instruction model into task adaptation. GIFT fine-tunes a low-rank adapter on the pretrained base model using confidence signals derived from the instruction-tuned model. The learned adapter is then merged into the instruction-tuned model, yielding task-specialized models that preserve general instruction-following behavior. We evaluate GIFT on mathematical and knowledge-intensive benchmarks across multiple model families and scales. Results show that GIFT consistently outperforms direct fine-tuning and representative transfer-based baselines, while maintaining robust generalization and favorable test-time scaling behavior.
Chinese Translation
适应指令调优语言模型的一种有前景的范式是学习预训练基础模型的任务特定更新,并随后将其合并到指令调优模型中。然而,现有方法通常将指令调优模型视为一个被动目标,仅在最终合并阶段参与,而没有指导训练过程。我们提出了GIFT(Guided Fine-Tuning and Transfer),这是一种简单而高效的框架,通过将指令模型的指导融入任务适应。GIFT在预训练基础模型上使用从指令调优模型衍生的信心信号微调低秩适配器。学习到的适配器随后被合并到指令调优模型中,生成的任务专用模型保留了一般的指令遵循行为。我们在多个模型系列和规模上对数学和知识密集基准测试进行了GIFT的评估。结果显示,GIFT始终优于直接微调和典型的基于迁移的方法,同时保持了强大的泛化能力和良好的测试时间扩展性。
cs.CL / 19 / 2605.01292
Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach
应对孟加拉语假新闻检测中的数据稀缺问题:一种基于大型语言模型的 Dataset 增强方法
Abstract
The growing spread of misinformation in digital media highlights the need for reliable fake news detection systems, yet progress in under-resourced languages such as Bangla is limited by small and imbalanced datasets. This study investigates whether Large Language Model (LLM) based augmentation can effectively address this limitation and improve Bangla fake news classification. Existing datasets remain valuable but highly imbalanced, limiting model performance, and LLM based augmentation for Bangla has been scarcely explored. To fill this gap, we propose a systematic augmentation framework that generates synthetic Bangla news articles using the instruction tuned Gemma 3 27B IT model, supported by semantic filtering and controlled subsampling to preserve label consistency and diversity. We compare zero shot and few shot prompting, evaluate multiple augmentation rates, and examine random versus similarity-based selection strategies. Our experiments show that augmenting only the minority class with a high augmentation rate and random subsampling yields the strongest gains, raising the Fake News F1 score from 0.85 to 0.88. To support reproducibility and further research in this low-resource domain, we publicly release 4,545 synthetically generated Bangla fake news samples along with our full implementation. These findings demonstrate that well-designed LLM-driven augmentation can significantly improve fake news detection in low resource settings and provide a practical foundation for advancing multilingual misinformation research.
Chinese Translation
虚假信息在数字媒体中的不断传播凸显了可靠假新闻检测系统的必要性,但对于资源匮乏的语言(如孟加拉语)的进展受到小型且不平衡的数据集的限制。本研究探讨了基于大型语言模型(LLM)的增强是否有效地解决了这一限制,并改善了孟加拉假新闻分类。现有数据集仍然具有价值,但高度不平衡,限制了模型性能,而针对孟加拉语的 LLM 基增强探索极为有限。为填补这一空白,我们提出了一种系统的增强框架,该框架使用经过指令调优的 Gemma 3 27B IT 模型生成合成的孟加拉新闻文章,辅以语义过滤和受控子采样,以保持标签一致性和多样性。我们比较了零样本和少量样本提示,评估了多种增强速率,并检验了随机与相似性基础选择策略。实验结果表明,仅对少数类进行高增水平的增强和随机子采样能够带来最大的提升,使假新闻的 F1 得分从 0.85 提升至 0.88。为了支持在这一低资源领域的可重复性和进一步研究,我们公开发布了 4,545 个合成生成的孟加拉假新闻样本及完整实现。这些发现表明,精心设计的基于 LLM 的增强可以显著改善低资源环境中的假新闻检测,并为推动多语言虚假信息研究提供了实用基础。
cs.CL / 20 / 2605.01302
Beyond Semantic Relevance: Counterfactual Risk Minimization for Robust Retrieval-Augmented Generation
超越语义相关性:用于稳健检索增强生成的反事实风险最小化
Abstract
Standard Retrieval-Augmented Generation (RAG) systems predominantly rely on semantic relevance as a proxy for utility. However, this assumption collapses in realistic decision-making scenarios where user queries are laden with cognitive biases, such as false premises or confirmation bias. In such cases, maximizing relevance paradoxically promotes the retrieval of sycophantic evidence that reinforces hallucinations, a critical failure we term the ``Relevance-Robustness Gap''. To bridge this gap, we propose CoRM-RAG (Counterfactual Risk Minimization for RAG), a framework that aligns retrieval with decision safety rather than mere similarity. Grounded in causal intervention, we introduce a Cognitive Perturbation Protocol to simulate user biases during training, which is then distilled into a lightweight Evidence Critic. This scoring module learns to identify documents that possess sufficient evidential strength to steer the model toward correctness despite adversarial query perturbations. Extensive experiments on decision-making benchmarks demonstrate that CoRM-RAG significantly outperforms strong dense retrievers and LLM-based rerankers in adversarial settings, while enabling effective risk-aware abstention through reliable robustness scoring. Our code is available at https://github.com/PeiYangLiu/CoRM-RAG.git.
Chinese Translation
标准的检索增强生成(RAG)系统主要依赖语义相关性作为效用的代理。然而,这一假设在现实决策场景中崩溃,因为用户查询往往受到认知偏见的影响,如错误前提或确认偏误。在这种情况下,最大化相关性反而促使检索迎合性证据,从而强化幻觉,这一关键失误我们称之为“相关性-稳健性差距”。为了弥补这一差距,我们提出了CoRM-RAG(RAG的反事实风险最小化)框架,该框架将检索与决策安全性对齐,而非仅仅是相似性。基于因果干预,我们引入了认知干扰协议,以在训练过程中模拟用户偏见,并将其提炼为一个轻量级证据批判器。该评分模块学习识别具备足够证据强度的文档,以引导模型在面对对抗性查询扰动时朝向正确性。针对决策制定基准的广泛实验表明,CoRM-RAG在对抗性环境中显著优于强大的密集检索器和基于大语言模型的重新排序器,同时通过可靠的稳健性评分实现有效的风险感知拒绝。我们的代码可在 https://github.com/PeiYangLiu/CoRM-RAG.git 获得。
cs.CL / 21 / 2605.01315
Enhancing Game Review Sentiment Classification on Steam Platform with Attention-Based BiLSTM
基于注意力机制的双向长短期记忆网络在Steam平台游戏评论情感分类中的增强
Abstract
This paper investigates sentiment classification of Steam game reviews using an attention-based Bidirectional Long Short-Term Memory (BiLSTM) model. Using a dataset of 50,000 reviews sampled from a larger Steam review corpus, the authors compare a traditional machine learning baseline based on TF-IDF and PyCaret AutoML with a deep learning approach implemented in PyTorch. The proposed BiLSTM+Attention model is trained with class-weighted cross-entropy to address class imbalance and achieves 83% accuracy and 85% weighted F1-score on the test set, with 90% recall for negative reviews. The paper also presents attention visualizations to show interpretability by highlighting sentiment-bearing words. The study concludes that the BiLSTM+Attention model is effective for analyzing user sentiment in Steam reviews and useful for helping developers understand player feedback.
Chinese Translation
本文研究了使用基于注意力机制的双向长短期记忆网络(BiLSTM)模型对Steam游戏评论进行情感分类。使用自更大Steam评论语料库中抽取的50000条评论的数据集,作者将基于TF-IDF和PyCaret AutoML的传统机器学习基线与在PyTorch中实现的深度学习方法进行了比较。所提出的BiLSTM+Attention模型采用带类权重的交叉熵进行训练,以解决类别不平衡问题,在测试集上达到了83%的准确率和85%的加权F1分数,对于负面评论的召回率为90%。本文还展示了注意力可视化,以通过突出情感承载词来展示模型的可解释性。研究结论是,BiLSTM+Attention模型在分析Steam评论中的用户情感方面是有效的,并且对帮助开发者理解玩家反馈具有重要价值。
cs.CL / 22 / 2605.01317
Sentiment Analysis of Mobile Legends App Reviews Using Machine Learning and LSTM-Based Deep Learning Models
基于机器学习和LSTM深度学习模型的Mobile Legends应用评论情感分析
Abstract
This paper compares Machine Learning and LSTM-based Deep Learning methods for sentiment analysis of Mobile Legends app reviews. Using a dataset of 10,000 reviews labeled as positive, negative, and neutral, the study evaluates traditional models with TF-IDF and PyCaret AutoML and compares them against an LSTM model designed to capture sequential text dependencies. The results show that the LSTM model outperforms the classical Machine Learning baselines, achieving 92% accuracy and a weighted F1-score of 91%. The findings indicate that deep learning is more effective for handling informal and context-dependent user review text.
Chinese Translation
本文比较了机器学习和基于LSTM的深度学习方法在Mobile Legends应用评论情感分析中的应用。利用一个包含10,000条被标记为积极、消极和中性的评论的数据集,研究评估了传统模型(使用TF-IDF和PyCaret AutoML),并将其与旨在捕捉序列文本依赖的LSTM模型进行比较。结果表明,LSTM模型的表现优于经典机器学习基准,达到了92%的准确率和91%的加权F1-score。研究结果表明,深度学习在处理非正式和依赖于上下文的用户评论文本时更具有效性。
cs.CL / 23 / 2605.01322
Benchmarking LightGBM and BiLSTM for Sentiment Analysis on Indonesian E-Commerce Reviews
LightGBM与BiLSTM在印尼电子商务评论情感分析中的基准比较
Abstract
This study presents a comparative analysis between two primary approaches in Natural Language Processing (NLP): Machine Learning (ML) utilizing the PyCaret AutoML framework, and Deep Learning (DL). The evaluation is conducted on a sentiment analysis task using an Indonesian e-commerce review dataset sourced from Hugging Face. The dataset, consisting of 15,000 samples, is partitioned into training, validation, and testing sets. The ML experiments compare LightGBM, Logistic Regression, and Support Vector Machine (SVM) algorithms, whereas the DL experiment implements a Bidirectional Long Short-Term Memory (BiLSTM) architecture. The experimental results demonstrate that the BiLSTM model outperforms all ML models, achieving an accuracy of 98.87\% and an F1-Score of 98.87\%. Meanwhile, LightGBM emerges as the best-performing ML model with an accuracy of 98.23\% in a highly efficient training time. This research proves that the BiLSTM architecture is highly capable of capturing the sequential context of Indonesian review texts, making it the superior model for this specific classification task.
Chinese Translation
本研究对自然语言处理(NLP)中的两种主要方法进行了比较分析:利用PyCaret AutoML框架的机器学习(ML)和深度学习(DL)。评估是在使用从Hugging Face获取的印尼电子商务评论数据集进行情感分析的任务中进行的。该数据集包含15,000个样本,分为训练集、验证集和测试集。机器学习实验比较了LightGBM、逻辑回归和支持向量机(SVM)算法,而深度学习实验实现了双向长短期记忆(BiLSTM)架构。实验结果表明,BiLSTM模型在所有机器学习模型中表现优越,达到了98.87% 的准确率和98.87% 的F1得分。同时,LightGBM在高效的训练时间中成为表现最好的机器学习模型,准确率为98.23%。本研究证明,BiLSTM架构在捕捉印尼评论文本的序列上下文方面具有很强的能力,使其成为该特定分类任务的最佳模型。
cs.CL / 24 / 2605.01323
Creating and Evaluating Figurative Language Dataset for Sindhi
创建和评估信德语比喻语言数据集
Abstract
In this article, we introduce SiNFluD, a novel benchmark dataset for Sindhi figurative language classification. We first collect raw text from various blogs, social media platforms, and literary sources, and subsequently prepare the corpus for annotation. Two native annotators label the data using the Doccano text annotation tool, achieving an inter-annotator agreement of 0.81. We then establish baseline results using 5-fold and 10-fold cross-validation. Finally, we evaluate mBERT, XLM-RoBERTa, and XLM-RoBERTa-XL models, along with SetFit for few-shot fine-tuning of sentence transformers. Among these, the pretrained XLM-RoBERTa-XL achieves the best performance.
Chinese Translation
在本文中,我们介绍了SiNFluD,这是一个用于信德语比喻语言分类的新基准数据集。我们首先从各种博客、社交媒体平台和文学来源收集原始文本,随后准备语料库进行标注。两位本地标注者使用Doccano文本标注工具对数据进行标注,达到了0.81的标注者间一致性。接着,我们使用5折和10折交叉验证建立基线结果。最后,我们评估了mBERT、XLM-RoBERTa和XLM-RoBERTa-XL模型,以及SetFit用于句子变换器的少量样本微调。在这些模型中,预训练的XLM-RoBERTa-XL达到了最佳性能。
cs.CL / 25 / 2605.01333
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
OralMLLM-Bench:评估多模态大型语言模型在牙科实践中的认知能力
Abstract
Multimodal large language models (MLLMs) have emerged as a promising paradigm for dental image analysis. However, their ability to capture the multi-level cognitive processes required for radiographic analysis remains unclear. Here, we present a comprehensive benchmark to evaluate the cognitive capabilities of MLLMs in dental radiographic analysis. It spans three critical imaging modalities, i.e., periapical, panoramic, and lateral cephalometric radiographs, and defines four cognitive categories: perception, comprehension, prediction, and decision-making. The benchmark comprises 27 clinically grounded tasks derived from public datasets, with manually curated annotations and 3,820 clinician assessments for evaluation. Six frontier MLLMs, including GPT-5.2 and GLM-4.6, are evaluated. We demonstrate the performance gap between MLLMs and clinicians in dental practice, delineate model strengths and limitations, characterize failure patterns, and provide recommendations for improvement. This data resource will facilitate the development of next-generation artificial intelligence systems aligned with clinical cognition, safety requirements, and workflow complexity in dental practice.
Chinese Translation
多模态大型语言模型(MLLMs)已成为牙科图像分析的一个有前景的范式。然而,它们在捕获放射学分析所需的多层次认知过程方面的能力仍不清楚。在此,我们提出了一个综合基准,以评估MLLMs在牙科放射学分析中的认知能力。该基准涵盖三个关键成像模态,即根尖片、全景片和侧颅位片,并定义了四个认知类别:感知、理解、预测和决策。基准包含27个来自公共数据集的临床基础任务,配有手动策划的注释和3,820份临床医师评估用于评估。六个前沿MLLMs,包括GPT-5.2和GLM-4.6,均被评估。我们展示了在牙科实践中MLLMs与临床医师之间的性能差距, delineate了模型的优缺点,特征化了失败模式,并提供了改进建议。该数据资源将促进与临床认知、安全要求和牙科实践中工作流程复杂性相一致的下一代人工智能系统的发展。
cs.CL / 26 / 2605.01336
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
多视角媒体画像套件:资源、评估与分析
Abstract
News outlets shape public opinion at a scale that makes automated detection of political bias and factuality essential. However, the field still lacks unified resources, comprehensive evaluations across diverse approaches, and systematic analyses of the representations and fusion strategies that matter most, especially under label sparsity and dataset diversity. In addition, there is little empirical work reporting broad, observation-driven findings about what consistently works, what fails, and why. We address these gaps through four main contributions. First, we introduce MBFC-2025, a large-scale label set covering approximately 2,600 outlets from Media Bias/Fact Check (MBFC). Second, we construct multiview representations for ACL-2020 (Panayotov et al., 2022), which includes around 900 outlets, as well as for MBFC-2025. These representations span Alexa graphs, hyperlink graphs, LLM-derived graphs, articles, and Wikipedia descriptions. Third, we provide a systematic evaluation and analysis of embedding views and fusion strategies, including a reinforcement learning-based fusion variant. Fourth, we conduct extensive experiments that achieve state-of-the-art results on ACL-2020 and establish strong benchmarks on MBFC-2025.
Chinese Translation
新闻媒体在塑造公众舆论方面发挥着重要作用,因此自动检测政治偏见和事实准确性变得至关重要。然而,该领域仍然缺乏统一的资源、多样化方法的综合评估,以及对最重要的表征和融合策略进行系统分析,特别是在标签稀疏性和数据集多样性下。此外,关于什么方法始终有效、什么方法失败及其原因的实证研究也很少。我们通过四个主要贡献来填补这些空白。首先,我们介绍了MBFC-2025,这是一个涵盖约2600个媒体来源的规模庞大的标签集,来源于媒体偏见/事实检查(Media Bias/Fact Check,MBFC)。其次,我们为ACL-2020(Panayotov et al., 2022)构建了多视角表征,该数据集包含约900个媒体来源,同时也为MBFC-2025构建了相应的表示。这些表示包括Alexa图、超链接图、基于大型语言模型(LLM)生成的图、文章及维基百科描述。第三,我们提供了嵌入视图和融合策略的系统评估与分析,包括一种基于强化学习的融合变体。第四,我们进行了广泛的实验,在ACL-2020上取得了最先进的结果,并在MBFC-2025上建立了强有力的基准。
cs.CL / 27 / 2605.01347
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
MAD-OPD:通过多智能体辩论突破在线政策蒸馏的局限
Abstract
On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per-step errors compound across long trajectories and destabilize training. We propose MAD-OPD (Multi-Agent Debate-driven On-Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on-policy state; the debate produces an emergent collective intelligence that supplies token-level supervision, with each teacher's contribution weighted by its post-debate confidence. To extend OPD to agentic tasks, we also introduce On-Policy Agentic Distillation (OPAD), which adds step-level sampling to stabilize training under multi-step error compounding. We additionally derive a task-adaptive divergence principle, selecting JSD (Jensen-Shannon divergence) for agentic stability and reverse KL (Kullback-Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher-student configurations (Qwen3 and Qwen3.5; 1.7B-14B students, 8B-32B teachers) and five agentic and code benchmarks, MAD-OPD ranks first across all six configurations; on the 14B+8B$\to$4B setting it lifts the agentic average by $+2.4\%$ and the code average by $+3.7\%$ over the stronger single-teacher OPD.
Chinese Translation
在线政策蒸馏(OPD)在标记级教师监督下基于自身轨迹训练学生,但现有方法受到单一教师能力限制:当教师出错时,学生将继承该错误。OPD在智能任务中的研究尚属空白,在这些任务中,每步错误会在长轨迹中累积,从而导致训练不稳定。我们提出MAD-OPD(多智能体辩论驱动的在线政策蒸馏),通过将蒸馏教师重新设定为一个关于学生在线政策状态进行辩论的教师集体,从而打破这一限制;辩论产生的集体智慧为学生提供标记级监督,每位教师的贡献根据其辩论后的信心加权。为了将OPD扩展到智能任务中,我们还引入了在线政策智能蒸馏(OPAD),该方法增加了步骤级采样以在多步错误累积下稳定训练。此外,我们推导了一个任务自适应的发散原理,选择JSD(杰森-香农散度)以保持智能稳定性,并为代码生成选择逆KL(库尔巴克-莱布勒散度),并在理论和实证上进行了验证。在六个教师-学生配置(Qwen3和Qwen3.5;1.7B-14B学生,8B-32B教师)和五个智能及代码基准测试上,MAD-OPD在所有六个配置中均排名第一;在14B+8B$ o$4B设置下,其智能平均提升了$+2.4\%$,代码平均提升了$+3.7\\%$,超过了更强的单一教师OPD。
cs.CL / 28 / 2605.01350
LLM Output Detectability and Task Performance Can be Jointly Optimized
LLM输出可检测性与任务性能可以共同优化
Abstract
Detecting machine-generated text is essential for transparency and accountability when deploying large language models (LLMs). Among detection approaches, watermarking is a statistically reliable method by design -- it embeds detectable signals into LLM outputs by biasing their token distributions. However, it has been reported that watermarked LLMs often perform worse on downstream tasks. We propose PUPPET, a framework that fine-tunes an LLM via reinforcement learning to generate text that is both more detectable and better performing on downstream tasks. We use two reward functions: a detector that outputs a machine-class likelihood and an evaluator that measures a task-specific metric. Experiments on long-form QA, summarization, and essay writing show that LLMs trained with PUPPET achieve high detectability competitive with watermarking methods while outperforming them on downstream tasks. The analysis shows that this optimization can be performed efficiently with only a few thousand samples in 1--2 GPU hours. Moreover, these gains are consistent across out-of-domain tasks, different LLM families, and model sizes, and are even robust to paraphrasing attacks.
Chinese Translation
检测机器生成文本对于在部署大型语言模型(LLMs)时的透明性和责任感至关重要。在检测方法中,水印是一种设计上具有统计可靠性的方法 —— 它通过偏置 LLM 输出的标记分布来嵌入可检测信号。然而,有报道指出,带水印的 LLM 在下游任务中的表现往往较差。我们提出了 PUPPET,一个通过强化学习对 LLM 进行微调的框架,旨在生成在可检测性和下游任务性能上均表现更好的文本。我们使用了两个奖励函数:一个是输出机器类别可能性的检测器,一个是测量任务特定指标的评估器。在长文本问答、摘要生成和写作任务上的实验表明,使用 PUPPET 训练的 LLM 在可检测性方面与水印方法相当,同时在下游任务的表现上超过了它们。分析表明,这一优化过程仅需在 1-2 个 GPU 小时内用几千个样本就能高效完成。此外,这些收益在域外任务、不同的 LLM 家族和模型大小之间是一致的,并且即使在被改写攻击(paraphrasing attacks)下也表现出强健性。
cs.CL / 29 / 2605.01357
On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility
关于稳定长文本生成:基准测试与减少长度波动
Abstract
Large Language Models (LLMs) excel at long-context understanding but exhibit significant limitations in long-form generation. Existing studies primarily focus on single-generation quality, generally overlooking the volatility of the output. This volatility not only leads to significant computational costs but also severely impacts the models' reliable application. To address this gap, our work unfolds in three stages: benchmarking, probing, and mitigation. We first propose the VOlatility in Long-form Text Benchmark (VOLTBench), a novel heterogeneous-task benchmark designed to systematically quantify the length volatility of long-form generation. Subsequently, by analyzing attention traces, we conduct an in-depth probe to identify several common internal patterns that cause this volatility. Finally, to mitigate long-form output volatility, we propose Stable Generation via Logits Boosting (GLoBo), a lightweight decoding-stage optimization strategy, designed to significantly enhance both the length accuracy and stability of long-form generation without additional training. Extensive experiments on VOLTBench provide the first systematic confirmation of severe long-form output instability in mainstream models and validate that our proposed method successfully improves the mean output length of the base model by 148% and reduces the length volatility by 69%, while maintaining high generation quality.
Chinese Translation
大型语言模型(LLMs)在长上下文理解方面表现优异,但在长文本生成中存在显著的局限性。现有研究主要集中于单次生成的质量,通常忽略了输出的波动性。这种波动性不仅导致显著的计算成本,而且严重影响模型的可靠应用。为了弥补这一缺口,我们的工作分为三个阶段:基准测试、探测和缓解。首先,我们提出了长文本波动性基准(VOLTBench),这是一个旨在系统量化长文本生成长度波动性的异构任务基准。随后,通过分析注意力轨迹,我们深入探测,识别出导致这种波动性的几种常见内部模式。最后,为了减轻长文本输出的波动性,我们提出了一种通过对数增强实现的稳定生成方法(GLoBo),这是一种轻量级解码阶段优化策略,旨在显著提升长文本生成的长度准确性和稳定性,而无需额外训练。针对VOLTBench的广泛实验首次系统确认了主流模型在长文本输出中严重的不稳定性,并验证了我们提出的方法成功地将基础模型的平均输出长度提高了148%,同时将长度波动性降低了69%,保持了高质量的生成水平。
cs.CL / 30 / 2605.01372
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
基于嵌入的上下文提示训练以增强大型语言模型作为文本编码器的能力
Abstract
Large language models (LLMs) have been widely explored for embedding generation. While recent studies show that in-context learning (ICL) effectively enhances the representational capability of LLMs by prepending a few task-related demonstrations, it causes substantial token overhead due to the increased sequence length. In this work, we propose EPIC, a novel embedding-based in-context prompt training strategy that leverages ICL to generate high-quality embeddings while reducing computational burden during both training and inference. This approach replaces discrete text demonstrations with their corresponding continuous embeddings, which not only encourages the LLM to align semantically-related text pairs during contrastive learning, but also requires the model to interpret demonstration embeddings as part of the in-context prompt. Consequently, EPIC-trained models achieve excellent embedding performance both with or without in-context prompts at inference time. Comprehensive experiments demonstrate that our method establishes new state-of-the-art results on the MTEB benchmark, surpassing frontier models trained solely on publicly available retrieval data. Extensive ablation studies further validate the effectiveness and necessity of our mechanism.
Chinese Translation
大型语言模型(LLMs)在嵌入生成方面得到了广泛的研究。尽管近期的研究表明,通过在前面添加一些与任务相关的示例,上下文学习(ICL)有效地增强了LLMs的表示能力,但这会由于序列长度的增加而导致显著的标记开销。在本研究中,我们提出了EPIC,一种新颖的基于嵌入的上下文提示训练策略,它利用ICL生成高质量的嵌入,同时在训练和推理过程中减少计算负担。这种方法用相应的连续嵌入替代离散文本示例,这不仅鼓励LLM在对比学习中对语义相关的文本对进行对齐,还要求模型将示例嵌入解释为上下文提示的一部分。因此,经过EPIC训练的模型在推理时无论是否使用上下文提示均表现出色的嵌入性能。全面的实验表明,我们的方法在MTEB基准上建立了新的最先进结果,超越了仅基于公开可用检索数据训练的前沿模型。广泛的消融研究进一步验证了我们机制的有效性和必要性。
cs.CL / 31 / 2605.01373
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
聚焦核心:通过自对比赋能扩散大语言模型
Abstract
The iterative denoising paradigm of Diffusion Large Language Models (DLMs) endows them with a distinct advantage in global context modeling. However, current decoding strategies fail to leverage this capability, typically exhibiting a local preference that overlooks the heterogeneous information density within the context, ultimately degrading generation quality. To address this limitation, we systematically investigate high-information-density (HD) tokens and present two key findings: (1) explicitly conditioning on HD tokens substantially improves output quality; and (2) HD tokens exhibit an early-decoding tendency, converging earlier than surrounding tokens. Motivated by these findings, we propose Focus on the Core \textbf{(FoCore)}, a training-free decoding strategy that utilizes HD tokens in a self-contrast manner, wherein HD tokens are temporarily remasked as negative samples, to guide generation. We further introduce FoCore\_Accelerate \textbf{(FoCore\_A)}, an efficient variant that, upon detecting HD token convergence, performs parallel decoding over stable candidates within a local context window, substantially accelerating generation. Extensive experiments on math, code and logical reasoning benchmarks demonstrate that FoCore consistently improves generation quality and efficiency across both LLaDA and Dream backbones. For instance, on HumanEval, FoCore improves pass@1 from 39.02 to 42.68 over standard Classifier-Free Guidance, while FoCore-A reduces the number of decoding steps by 2.07x and per-sample latency from 20.76s to 8.64s (-58.4\%).
Chinese Translation
扩散大语言模型(DLMs)的迭代去噪范式赋予了它们在全球上下文建模方面的独特优势。然而,当前的解码策略未能利用这一能力,通常表现出一种局部偏好,忽视了上下文中的异质信息密度,最终降低了生成质量。为了解决这一限制,我们系统地研究了高信息密度(HD)标记,并提出了两个关键发现:(1)明确以HD标记为条件显著提高了输出质量;(2)HD标记表现出提前解码的倾向,收敛速度早于周围标记。受到这些发现的启发,我们提出了聚焦核心(Focus on the Core,FoCore),一种无训练的解码策略,它以自对比的方式利用HD标记,其中HD标记被暂时重新屏蔽为负样本,以指导生成。我们进一步介绍了FoCore extsubscript{Accelerate}(FoCore extsubscript{A}),一种高效的变体,在检测到HD标记收敛时,在局部上下文窗口内对稳定的候选项执行并行解码,从而显著加速生成。在数学、代码和逻辑推理基准上的广泛实验表明,FoCore在LLaDA和Dream骨干网络上始终提高了生成质量和效率。例如,在HumanEval上,FoCore将pass@1从39.02提高到42.68,相比标准无分类器引导,FoCore-A将解码步骤数量减少了2.07倍,每个样本延迟从20.76秒降至8.64秒(降幅58.4%)。
cs.CL / 32 / 2605.01374
MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation
MTA:针对大型语言模型蒸馏的多粒度轨迹对齐
Abstract
Knowledge distillation is a key technique for compressing large language models (LLMs), but most existing methods align representations at fixed layers or token-level outputs, ignoring how representations evolve across depth. As a result, the student is only weakly guided to capture the teacher's internal relational structure during distillation, which limits knowledge transfer. To address this limitation, we propose Multi-Granular Trajectory Alignment (MTA), a framework that aligns teacher and student representations along their layer-wise transformation trajectory. MTA adopts a layer-adaptive strategy: lower layers are aligned at the word level to preserve lexical information, while higher layers operate on phrase-level spans (e.g., noun and verb phrases) to capture compositional semantics. We instantiate this idea through a Dynamic Structural Alignment loss that matches the relative geometry among semantic units within each layer. This design is motivated by empirical findings that Transformer representations become increasingly abstract with depth, and is also consistent with linguistic views in which higher-level meaning emerges through the composition of lower-level lexical units. We further incorporate a Hidden Representation Alignment loss to directly align selected teacher-student layers. Experiments show that MTA consistently outperforms state-of-the-art baselines on standard benchmarks, with ablations confirming the contribution of each component.
Chinese Translation
知识蒸馏是压缩大型语言模型(LLMs)的关键技术,但大多数现有方法在固定层或标记级输出上对齐表示,忽略了表示在深度上的演变。因此,在蒸馏过程中,学生仅受到微弱指导以捕捉教师的内部关系结构,这限制了知识的转移。为了解决这一局限性,我们提出了多粒度轨迹对齐(Multi-Granular Trajectory Alignment,MTA),该框架在教师和学生表示的层级变换轨迹上进行对齐。MTA采取了一种层自适应策略:低层在单词级别对齐以保留词汇信息,而高层则在短语级别跨度(例如,名词和动词短语)上操作以捕捉组合语义。我们通过一种动态结构对齐损失(Dynamic Structural Alignment loss)实现了这个想法,该损失匹配每一层内语义单元之间的相对几何关系。这一设计受到实证研究的启发,发现Transformer表示随着深度的增加而变得日益抽象,同时也与语言学观点一致,即高层次的意义通过低层次的词汇单元的组合而产生。我们进一步结合一个隐表示对齐损失(Hidden Representation Alignment loss)以直接对齐选定的教师-学生层。实验表明,MTA在标准基准上始终优于最先进的基线,并且消融实验证实了每个组件的贡献。
cs.CL / 33 / 2605.01381
A framework for analyzing concept representations in neural models
神经模型中概念表征分析框架
Abstract
Understanding how neural models represent human-interpretable concepts is challenging. Prior work has explored linear concept subspaces from diverse perspectives, such as probing and concept erasure. We introduce a unified framework to study these subspaces along two axes: \textit{containment}, which tests if a concept is fully represented in a subspace but not outside it, and \textit{disentanglement}, which tests for isolation from other concepts. In experiments on both text and speech models, we first highlight that concept subspaces may not be uniquely determined, and discuss the implications for concept subspace analysis. Then, we compare properties of concept subspaces estimated using five estimators, proposed in different communities. We find that (1) the choice of estimator impacts the containment and disentanglement properties; (2) the state-of-the-art concept erasure method, LEACE, performs well on both testing axes, but still struggles to generalize to unseen data; and (3) in HuBERT speech representations, phone information is both contained and disentangled from speaker information, while speaker information is hard to contain in a compact subspace, despite being disentangled from phones.
Chinese Translation
理解神经模型如何表征人类可解释的概念是一个挑战。之前的研究从不同的角度探讨了线性概念子空间,如探测和概念消除。我们提出了一个统一框架,通过两个轴来研究这些子空间: extit{包含性}(containment),测试一个概念是否在子空间中被完全表征而不包含在其外,以及 extit{解缠结性}(disentanglement),测试是否与其他概念隔离。在对文本和语音模型的实验中,我们首先强调,概念子空间可能并不是唯一确定的,并讨论这对概念子空间分析的影响。然后,我们比较了使用五种在不同领域提出的估计器估计的概念子空间的属性。我们的发现包括:(1)估计器的选择会影响包含性和解缠结性属性;(2)最新的概念消除方法LEACE在两个测试轴上表现良好,但仍然难以对未见数据进行泛化;以及(3)在HuBERT语音表征中,音素信息与说话者信息既是被包含的,又是与其解缠的,但说话者信息在紧凑的子空间中难以被包含,尽管它与音素是解缠的。
cs.CL / 34 / 2605.01386
MemORAI: Memory Organization and Retrieval via Adaptive Graph Intelligence for LLM Conversational Agents
MemORAI:通过自适应图智能实现的记忆组织与检索,用于大语言模型对话代理
Abstract
Large Language Models (LLMs) lack persistent memory for long-term personalized conversations. Existing graph-based memory systems suffer from information dilution, absent provenance tracking, and uniform retrieval that ignores query context. We introduce MemORAI (Memory Organization and Retrieval via Adaptive Graph Intelligence), a framework that integrates three innovations: selective memory filtering with dual-layer compression to retain user-persona-relevant content, a provenance-enriched multi-relational graph tracking factual origins at the turn level, and query-adaptive subgraph retrieval with Dynamic Weighted PageRank that applies query-conditioned edge weighting. Evaluated on LOCOMO and LongMemEval benchmarks, MemORAI achieves state-of-the-art performance in memory retrieval and personalized response generation, demonstrating that selective storage, enriched representation, and adaptive retrieval are essential for coherent, personalized LLM agents.
Chinese Translation
大语言模型(LLMs)缺乏用于长期个性化对话的持久记忆。现有的基于图的记忆系统存在信息稀释、缺乏来源追踪以及忽略查询上下文的统一检索等问题。我们提出了MemORAI(通过自适应图智能实现的记忆组织与检索),该框架整合了三项创新:选择性记忆过滤,采用双层压缩来保留与用户个性相关的内容;增强来源的多关系图在回合级别跟踪事实起源;查询自适应子图检索与动态加权PageRank相结合,通过查询条件的边权重应用进行检索。在LOCOMO和LongMemEval基准测试中,MemORAI在记忆检索和个性化响应生成方面取得了最先进的性能,证明了选择性存储、丰富表示和自适应检索对于连贯、个性化的大语言模型代理至关重要。
cs.CL / 35 / 2605.01399
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning
Verbal-R3:作为检索与推理之间缺失桥梁的口头重排序器
Abstract
The conventional Retrieval-Augmented Generation (RAG) paradigm of injecting raw retrieved texts into the Large Language Model (LLM)'s context often results in suboptimal integration of retrieved information. This paper proposes to bridge retrieval results and the LLM's reasoning ability through Verbal Annotations, analytic narratives that explicitly articulate the logical connection between a search query and retrieved contexts. Our empirical investigation reveals the potential of Verbal Annotations to substantially enhance the LLM's ability to generate accurate, contextually-grounded responses. Motivated by this finding, we introduce Verbal-R3, a novel agentic RAG framework that consists of a Generator and a Verbal Reranker. The Generator performs iterative retrieval and reasoning, while the Verbal Reranker returns relevance scores and Verbal Annotations to guide the reasoning and answering process of the Generator. The inference process of Verbal-R3 is further refined through relevance-guided test-time scaling, which efficiently allocates test-time compute for effective trajectory expansion. Verbal-R3 achieves state-of-the-art performance on complex Question Answering benchmarks, validating the effectiveness of the proposed framework.
Chinese Translation
传统的检索增强生成(RAG)范式通过将原始检索文本注入到大型语言模型(LLM)的上下文中,通常导致检索信息的整合不佳。本文提出通过口头注释来桥接检索结果与LLM的推理能力,口头注释是明确阐述搜索查询与检索上下文之间逻辑关系的分析叙述。我们的实证研究揭示了口头注释在显著提升LLM生成准确且基于上下文的响应能力方面的潜力。基于这一发现,我们引入了Verbal-R3,一个由生成器和口头重排序器组成的新型代理RAG框架。生成器执行迭代检索和推理,而口头重排序器返回相关性评分和口头注释,以引导生成器的推理和回答过程。Verbal-R3的推理过程通过相关性引导的测试时比例缩放进一步优化,有效地为有效轨迹扩展分配测试时计算资源。Verbal-R3在复杂的问答基准测试中达到了最先进的性能,验证了所提框架的有效性。
cs.CL / 36 / 2605.01402
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
通过强化学习将分布意识注入多模态大语言模型以应对深度不平衡回归
Abstract
Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.
Chinese Translation
多模态大语言模型(MLLMs)在长尾目标分布下的数值回归任务中表现不佳。基于标记的有监督微调(SFT)和逐点回归奖励使学习偏向于高密度区域,导致趋向均值的回归行为以及较差的尾部表现。我们发现现有MLLM训练范式的一个关键限制是缺乏跨样本的关系监督。为了解决这一问题,我们提出了一种基于组相对策略优化(Group Relative Policy Optimization)的分布感知强化学习框架,该框架通过基于一致性相关系数(Concordance Correlation Coefficient)的奖励引入批次级的比较式监督,以在相关性、规模和均值方面对齐预测分布和真实分布。该框架可即插即用,无需架构修改。在一套统一的长尾回归基准测试上的实验显示,与SFT以及现有的MLLM回归方法相比,取得了一致的改进,尤其在中等样本和少量样本情况下性能提升显著。
cs.CL / 37 / 2605.01417
Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks
Medmarks:用于医疗任务的全面开源大语言模型基准套件
Warner, Benjamin, Grandhi, Ratna Sagari, Kieffer, Max, Ouraq, Aymane, Panigrahi, Saurav, Ambwani, Geetu, Bagga, Kunal, Khandekar, Nikhil, Hariharan, Arya, Mishra, Nishant, Ram, Manish, Yang, Shamus Sim Zi, Essouaied, Ahmed, Moyondafoluwa, Adepoju Jeremiah, Scholz, Robert, Huang, Bofeng, Beavers, Molly, Gureja, Srishti, Mahishi, Anish, Khan, Sameed, Griot, Maxime, Batra, Hunar, Delbrouck, Jean-Benoit, Bharadwaj, Siddhant, Clark, Ronald, Vashist, Ashish, Zafar, Anas, Murali, Leema Krishna, Deshpande, Harsh, Patel, Ameen, Brown, William, Hagemann, Johannes, Lane, Connor, Scotti, Paul Steven, Abraham, Tanishq Mathew
Abstract
Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open-source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, & GPT-5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open-weight alternatives, medically fine-tuned models outperform their generalist counterparts, and that models are susceptible to answer-order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks-T) can be directly used as reinforcement learning environments to post-train LLMs for medical reasoning. Code is available at https://github.com/MedARC-AI/Medmarks
Chinese Translation
评估用于医疗应用的大语言模型(LLMs)仍然面临挑战,这主要是由于基准饱和、数据可获取性有限以及相关任务覆盖不足。现有的评估套件要么已达到饱和,严重依赖于受限的数据集,要么缺乏全面的模型覆盖。我们推出了 Medmarks,这是一个完全开源的评估套件,包含30个基准,涵盖了问答、信息提取、医疗计算和开放式临床推理。我们对61个模型在71种配置下进行了系统评估,使用可验证的指标和 LLM-as-a-Judge。我们的结果表明,前沿推理模型(Gemini 3 Pro Preview、GPT-5.1 和 GPT-5.2)在各基准中均取得了最高表现,大多数前沿专有模型在令牌使用效率上显著优于开源权重替代品,经过医学微调的模型优于其一般模型对手,并且模型对答案顺序偏见是敏感的(特别是较小的模型和 Grok 4)。我们的部分评估(Medmarks-T)可以直接用作强化学习环境,以便对 LLM 进行医学推理后训练。代码可在 https://github.com/MedARC-AI/Medmarks 获取。
cs.CL / 38 / 2605.01428
Hallucinations Undermine Trust; Metacognition is a Way Forward
幻觉削弱信任;元认知是前进的途径
Abstract
Despite significant strides in factual reliability, errors -- often termed hallucinations -- remain a major concern for generative AI, especially as LLMs are increasingly expected to be helpful in more complex or nuanced setups. Yet even in the simplest setting -- factoid question-answering with clear ground truth-frontier models without external tools continue to hallucinate. We argue that most factuality gains in this domain have come from expanding the model's knowledge boundary (encoding more facts) rather than improving awareness of that boundary (distinguishing known from unknown). We conjecture that the latter is inherently difficult: models may lack the discriminative power to perfectly separate truths from errors, creating an unavoidable tradeoff between eliminating hallucinations and preserving utility. This tradeoff dissolves under a different framing. If we understand hallucinations as confident errors -- incorrect information delivered without appropriate qualification -- a third path emerges beyond the answer-or-abstain dichotomy: expressing uncertainty. We propose faithful uncertainty: aligning linguistic uncertainty with intrinsic uncertainty. This is one facet of metacognition -- the ability to be aware of one's own uncertainty and to act on it. For direct interaction, acting on uncertainty means communicating it honestly; for agentic systems, it becomes the control layer governing when to search and what to trust. Metacognition is thus essential for LLMs to be both trustworthy and capable; we conclude by highlighting open problems for progress towards this objective.
Chinese Translation
尽管在事实可靠性方面取得了显著进展,但错误——通常被称为幻觉——仍然是生成式人工智能面临的主要问题,尤其是当大型语言模型(LLMs)日益被期望在更复杂或更细微的设置中提供帮助时。然而,即使在最简单的情境中——具有明确真实前提的事实问答,前沿模型在没有外部工具的情况下仍然会产生幻觉。我们认为,在这一领域,大多数事实准确性的提升来自于扩展模型的知识边界(编码更多事实),而不是提高对该边界的意识(区分已知与未知)。我们推测后者是固有困难的:模型可能缺乏将真相与错误完美区分的判别能力,从而在消除幻觉与保持实际效用之间产生不可避免的权衡。该权衡在不同的框架下被解除。如果我们将幻觉理解为自信的错误——没有适当限定的信息传递——在答案与放弃的二分法之外就会出现一条第三条路径:表达不确定性。我们提出了忠实的不确定性:将语言上的不确定性与内在的不确定性对齐。这是元认知的一个方面——意识到自身不确定性并采取行动的能力。对于直接互动,采取对不确定性的行动意味着诚实地传达它;对于代理系统,它成为控制层,决定何时搜索及可信赖的内容。因此,元认知对大型语言模型的可信性和能力至关重要;我们最后强调了朝着这一目标进展的开放问题。
cs.CL / 39 / 2605.01441
Artificial intelligence language technologies in multilingual healthcare: Grand challenges ahead
多语种医疗中的人工智能语言技术:面临的重大挑战
Abstract
AI language technologies (AILTs), increasingly enabled by large language models (LLMs), are becoming embedded in multilingual healthcare workflows for translation, rewriting, documentation, interpreting, and messaging in language-discordant settings. Yet fluent output is not the same as clinically safe or equitable communication: performance varies across languages, accents, tasks, and workflows, and efficiency gains can hide errors, reduce traceability, and shift responsibility across clinicians, translators, interpreters, and health systems. This narrative review synthesises recent peer-reviewed evidence across written communication, spoken communication, and emerging agentic workflows. Using the Human-Centered AI Language Technology (HCAILT) lens, it examines capabilities, evaluation practices, implementation patterns, and recurrent errors through reliability, safety culture, and trustworthiness. We identify key convergences and contradictions in the literature and propose seven grand challenges for the next phase of research and deployment. Progress, we argue, requires not only better models but also accountable sociotechnical design, calibrated human oversight, and stronger collaboration across MT/NLP, translation studies, HCI, clinical practice, implementation science, and policy.
Chinese Translation
人工智能语言技术(AILTs)在大型语言模型(LLMs)的推动下,正逐渐融入多语种医疗工作流程中,用于翻译、改写、文档编制、口译以及在语言不协调的环境中进行信息传递。然而,流畅的输出并不代表临床上安全或公平的交流:在不同语言、口音、任务和工作流程之间,性能存在差异,效率提升可能掩盖错误,降低可追溯性,并在临床医生、翻译人员、口译员以及卫生系统之间转移责任。本文为一篇叙述性综述,综合了最近经过同行评审的文献证据,涵盖书面交流、口头交流及新兴的自主工作流程。通过人本中心的人工智能语言技术(HCAILT)视角,本文探讨了能力、评估实践、实施模式和反复出现的错误,关注于可靠性、安全文化和可信度。我们识别出文献中的关键趋同与矛盾,并提出了七个下一阶段研究和部署的重大挑战。我们认为,进展不仅需要更好的模型,还需要负责任的社会技术设计、经过校准的人类监督以及在机器翻译/自然语言处理、翻译研究、人机交互、临床实践、实施科学和政策领域之间更强的协作。
cs.CL / 40 / 2605.01451
Auditing demographic bias in AI-based emergency police dispatch: a cross-lingual evaluation of eleven large language models
审计基于人工智能的紧急警务调度中的人口统计偏差:对十一种大型语言模型的跨语言评估
Abstract
Large language models (LLMs) are rapidly being integrated into high-stakes public safety systems, including emergency call triage and dispatch decision support, yet their demographic fairness in this context remains largely untested. Here we introduce a cross-lingual audit framework that operationalizes the Police Priority Dispatch System as a five-level ordinal classification task and applies a controlled minimal-pair design to isolate the effect of demographic cues. Across 19,800 model outputs spanning 11 frontier models, 15 scenario pairs, three demographic categories (religious appearance, gender, and race), and two languages (English and Mandarin Chinese), we find that demographic bias emerges systematically when incident severity is ambiguous but largely disappears when the operational priority is clearly determined by call content. Bias magnitude varies by demographic axis, with the largest effects observed for religious appearance, followed by gender and race. Critically, bias does not transfer consistently across languages: gender bias is substantially amplified in Mandarin Chinese, whereas race bias is more pronounced in English, revealing cross-lingual asymmetries that aggregate analyses obscure. In several scenarios, demographic cues produce counter-directional effects, challenging simple stereotype-amplification accounts of model behavior. These findings suggest that bias in LLM-based dispatch is not a fixed property of models alone, but arises from the interaction between demographic signals, contextual ambiguity, and language. Beyond these empirical results, the proposed framework provides a scalable audit infrastructure that enables deploying agencies to evaluate candidate models on jurisdiction-relevant scenarios prior to real-world adoption.
Chinese Translation
大型语言模型(LLMs)正迅速被整合到高风险公共安全系统中,包括紧急呼叫分流和调度决策支持,但其在该背景下的人口统计公平性尚未得到充分检验。在这里,我们介绍了一个跨语言审计框架,该框架将警务优先调度系统操作化为五级序数分类任务,并应用受控的最小对照设计来隔离人口统计提示的影响。在覆盖11个前沿模型、15对场景、三个人口统计类别(宗教外观、性别和种族)以及两种语言(英语和普通话)的19,800个模型输出中,我们发现,当事件严重性模糊时,人口统计偏差系统性地出现,但当操作优先级通过呼叫内容清晰确定时,偏差则大部分消失。偏差的强度因人口统计维度而异,宗教外观产生的影响最大,其次是性别和种族。重要的是,偏差在语言之间并不一致地转移:性别偏差在普通话中显著增强,而种族偏差在英语中更为明显,揭示了聚合分析所忽略的跨语言不对称性。在多个场景中,人口统计提示产生了相反的效果,这挑战了简单刻板印象增强模型行为的解释。这些发现表明,基于LLM的调度中的偏差并不是模型自身的固定属性,而是人口统计信号、情境模糊性和语言之间相互作用的结果。除这些实证结果外,该框架还提供了一种可扩展的审计基础设施,使部署机构能够在现实应用前评估候选模型在与管辖区域相关的场景中的表现。
cs.CL / 41 / 2605.01474
ReMedi: Reasoner for Medical Clinical Prediction
ReMedi:医疗临床预测推理模型
Abstract
Predicting future clinical outcomes from electronic health records (EHR) remains challenging due to the complexity and heterogeneity of patient data. LLMs have shown strong potential for such predictive tasks, yet existing approaches mainly focus on enhancing medical knowledge through distillation or RAG while relying on the model's internal ability to interpret contextual information. In this work, we present ReMedi (Reasoner for Medical Clinical Prediction), a framework for improving clinical outcome prediction from EHR. ReMedi generates rationale-answer pairs using a challenging sample regeneration mechanism for complex clinical questions, which leverages ground-truth answers as hints to enhance reasoning for further fine-tuning and preference tuning. ReMedi integrates ground-truth outcome guidance into the preference data construction loop, regenerating rationale-answer variants. By tuning on these rationale-answer pairs, the model improves its predictive performance. Experiments on multiple EHR prediction tasks demonstrate substantial gains of up to 19.9 percent over state-of-the-art baselines in terms of F1 score, underscoring ReMedi's effectiveness in real-world clinical prediction.
Chinese Translation
从电子健康记录 (EHR) 中预测未来的临床结果仍然面临挑战,因为患者数据的复杂性和异质性。大型语言模型 (LLMs) 在此类预测任务中展现出强大的潜力,但现有的方法主要集中在通过蒸馏或检索增强生成 (RAG) 来增强医学知识,同时依赖模型内部的上下文信息解读能力。在本研究中,我们提出了 ReMedi(医疗临床预测推理模型),这是一个旨在改善 EHR 中临床结果预测的框架。ReMedi 通过一种复杂临床问题的样本再生机制生成合理性-答案对,利用真实答案作为提示以增强推理,从而进行进一步的微调和偏好调整。ReMedi 将真实结果指导集成到偏好数据构建循环中,重新生成合理性-答案变体。通过对这些合理性-答案对进行调整,模型提高了其预测性能。在多项 EHR 预测任务上的实验表明,与最先进的基线相比,F1 分数提高了多达 19.9%,彰显了 ReMedi 在实际临床预测中的有效性。
cs.CL / 42 / 2605.01495
FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning
FT-RAG:用于复杂表格推理的细粒度检索增强生成框架
Abstract
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding responses in external knowledge during inference. However, conventiona RAG systems under-perform on structured tabular data, largely due to coarse retrieval granularity and insufficient table semantic comprehension. To address these limitations, we introduce FT-RAG, a fine-grained framework that employs knowledge association by decomposing tables into entry-level semantic units to construct a structured graph. FT-RAG employs a structural neighbor expansion mechanism to find semantically connected entities during graph retrieval, followed by multi-modal fusion to consolidate the context of table retrieval results. Further, to address the scarcity of specialized datasets in this domain, we introduce Multi-Table-RAG-Lib, a benchmark comprising 9870 QA pairs with high complexity and difficulty, curated to demand multi-table integration and text-table information fusion for reasoning. FT-RAG surpasses top-performing baselines across all metrics, achieving a 23.5\% and 59.2\% improvement in table-level and cell-level Hit Rates, respectively. Generation performance also sees a remarkable 62.2\% increase in exact value accuracy recall. These metrics verify the framework's effectiveness in factual grounding across both pure tabular and heterogeneous table-text contexts. Therefore, our method establishes a new state-of-the-art performance for complex reasoning over mixed-modality documents.
Chinese Translation
检索增强生成(RAG)通过在推理过程中将响应与外部知识相结合,提升了大型语言模型(LLMs)的能力。然而,传统的 RAG 系统在结构化表格数据上的表现不足,主要由于粗糙的检索粒度和表格语义理解的不足。为了解决这些局限性,我们提出了 FT-RAG,一种细粒度框架,通过将表格分解为条目级语义单元来进行知识关联,从而构建结构化图。FT-RAG 采用结构邻居扩展机制,在图检索过程中找到语义上相关的实体,随后通过多模态融合来巩固表格检索结果的上下文。此外,为了解决该领域专业数据集稀缺的问题,我们引入了 Multi-Table-RAG-Lib,这是一个包含 9870 对具有高复杂性和难度的 QA 组合的基准数据集,旨在满足多表集成和文本-表格信息融合的推理需求。FT-RAG 在所有指标上超越了表现最好的基线,在表格级和单元级命中率分别提升了 23.5% 和 59.2%。生成性能也在准确值回忆方面显著提高了 62.2%。这些指标验证了该框架在纯表格和异构表格-文本上下文中进行事实确认的有效性。因此,我们的方法为混合模态文档的复杂推理奠定了新的最先进的性能。
cs.CL / 43 / 2605.01537
The grip of grammar on meaning uncertainty: cross-linguistic evidence, neural correlates, and clinical relevance
语法对意义不确定性的影响:跨语言证据、神经相关性与临床相关性
Abstract
Isolated word meanings are inherently uncertain. This uncertainty reduces when they are combined and anchored in context. We propose that grammar compresses meaning uncertainty cross-linguistically, which is reflected in brain and selectively disrupted in disorders. Compression was operationalized as the relative difference between non-contextual surprisal estimated from lexical frequency, and contextual surprisal from grammar-sensitive models. In narratives from 20 languages, contextual surprisal reduced frequency-based surprisal. This reduction closely tracked the surprisal cost of reversing word order, and scaled with richer, non-redundant lexis as organized by more complex but optimal dependency structure. During fMRI, surprisal and its reduction explained BOLD activity for comprehension and production in overlapping but distinct regions. Uncertainty reduction was significantly attenuated in aphasia, dementia, and schizophrenia, but remained intact where primary deficit is not language. These findings position uncertainty reduction via grammar as a foundational concept that illuminates principles, brain basis, and disruptions of language.
Chinese Translation
孤立的词义本质上是不确定的。当词义结合并在语境中固定时,这种不确定性会减少。我们提出语法在跨语言中压缩意义的不确定性,这一点在大脑中有所体现,并在某些障碍中被选择性地破坏。压缩被操作化为基于词汇频率估计的非情境惊讶度与来自对语法敏感模型的情境惊讶度之间的相对差异。在 20 种语言的叙述中,情境惊讶度降低了基于频率的惊讶度。这一下降与颠倒词序所导致的惊讶成本紧密相关,并随着更复杂但最优的依赖结构所组织的更丰富且不冗余的词汇而提升。在功能性磁共振成像(fMRI)研究中,惊讶度及其减少解释了在重叠但不同区域内对理解和产生的 BOLD 活动。脑卒中、痴呆和精神分裂症患者的意义不确定性降低显著减弱,但在主要缺陷不在语言的情况下仍保持完好。这些发现将通过语法实现的意义不确定性减少定位为一个基础概念,阐明了语言的原则、大脑基础和干扰。
cs.CL / 44 / 2605.01555
Automated Interpretability and Feature Discovery in Language Models with Agents
带有智能体的语言模型自动可解释性与特征发现
Abstract
We introduce an autonomous multiagent framework for mechanistic interpretability that automates both explaining and finding internal features in large language models. The system runs two coupled loops: (1) explanation refinement, where an agent proposes competing hypotheses and iteratively tests them with targeted prompt controls and a multi-metric evaluation; and (2) feature discovery, where an agent generates prompt sets, constructs a k-nearest-neighbor graph in activation space, and retrieves candidate features using statistical separability and semantic coherence criteria. On Gemma-2 family models and MLP neurons in weight-sparse transformers, our agent improves over one-shot auto-interpretations, discovers language-specific and safety-relevant features, and produces auditable explanation traces, showing that agent-driven empirical loops yield sharper and more falsifiable explanations than one-shot labels.
Chinese Translation
我们引入了一个自主多智能体框架,用于机械可解释性,旨在自动化大语言模型内部特征的解释与发现。该系统运行两个耦合循环:(1)解释优化,其中一个智能体提出竞争假设,并通过针对性的提示控制和多指标评估进行迭代测试;(2)特征发现,其中一个智能体生成提示集,构建激活空间中的k-近邻图,并使用统计可分离性和语义一致性标准检索候选特征。在Gemma-2系列模型和权重稀疏变换器中的MLP神经元上,该智能体在一轮自动解释上有所改进,发现了语言特定和与安全相关的特征,并生成了可审计的解释轨迹,展示了智能体驱动的实证循环产生的解释比一轮标签更加精准和可证伪。
cs.CL / 45 / 2605.01596
Fine-Tuning Pre-Trained Code Models for AI-Generated Code Detection
微调预训练代码模型以检测人工智能生成的代码
Abstract
This paper describes the system submitted by team \textbf{Archaeology} to SemEval-2026 Task~13 on AI-generated code detection. The shared task consists of three subtasks; we participate in Subtask-A (binary classification: human-written vs.\ AI-generated code) and Subtask-B (11-class attribution of the generating model). Starting from a TF-IDF and Logistic Regression baseline, we fine-tune four pre-trained code models (CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5+) with separate strategies for each subtask. For Subtask-A, we use leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration on a difficult dataset. For Subtask-B, we use sandwich token packing, class-balanced loss, and multi-seed ensembling with test-time augmentation. Our best submissions obtain macro-F1 scores of 0.737 on Subtask-A (6th/81 teams) and 0.422 on Subtask-B (7th/34 teams).
Chinese Translation
本文描述了团队 extbf{Archaeology} 提交至 SemEval-2026 任务~13 的人工智能生成代码检测系统。该共享任务包括三个子任务;我们参与了子任务A(双分类:人工编写与人工智能生成代码)和子任务B(生成模型的11类归属)。在TF-IDF和逻辑回归基线的基础上,我们针对每个子任务微调了四个预训练的代码模型(CodeBERT、GraphCodeBERT、UniXcoder 和 CodeT5+),并采用了不同的策略。对于子任务A,我们使用了留一语言交叉验证、代码增强、分块推理以及裁剪均值聚合的阈值校准,利用一个困难的数据集。对于子任务B,我们采用了夹层令牌打包、类平衡损失和多种种子组合以及测试时数据增强。我们的最佳提交在子任务A上获得了0.737的宏F1分数(第6名/81支队伍),在子任务B上获得了0.422的宏F1分数(第7名/34支队伍)。
cs.CL / 46 / 2605.01605
Where Do Prompt Perturbations Break Generation? A Segment-Level View of Robustness in LoRA-Tuned Language Models
提示扰动在哪里破坏生成?LoRA微调语言模型的分段级稳健性视角
Abstract
Large language models are sensitive to minor prompt perturbations, yet existing robustness methods usually enforce consistency at the whole-sequence level. This holistic view can hide an important failure mode: a perturbed response may remain globally similar to the clean one while drifting on a critical entity, relation, or conclusion. We introduce S$^2$R$^2$, a segment-level framework for robust LoRA fine-tuning. S$^2$R$^2$ decomposes clean and perturbed generations into semantic segments, aligns them with an optimal-transport objective, and penalises the segments with the largest meaning drift. To connect this output-side objective with model adaptation, we add an adapter-stability regulariser motivated by segment-level attention reallocation, using LoRA norm control as a tractable proxy for limiting perturbation-amplified evidence shifts. A PAC-Bayesian complexity view further explains why controlling adapter growth may support transfer beyond observed perturbations. Experiments on summarisation benchmarks show that S$^2$R$^2$ improves robustness under typographical noise, deletion, synonym replacement, and paraphrasing, while maintaining competitive clean performance and stronger cross-dataset transfer than consistency-based baselines.
Chinese Translation
大型语言模型对小的提示扰动非常敏感,但现有的稳健性方法通常是在整个序列级别上强制一致性。这种整体视角可能会掩盖一个重要的失效模式:一个扰动的响应可能在总体上与干净的响应相似,但在关键实体、关系或结论上发生偏移。我们引入了S$^2$R$^2$,一个用于稳健的LoRA微调的分段级框架。S$^2$R$^2$将干净和扰动的生成分解为语义段,利用最优传输目标对它们进行对齐,并惩罚那些意义漂移最大的段。为了将这一输出侧目标与模型适应联系起来,我们添加了一个适配器稳定性正则化项,该正则化项是以分段级注意力重分配为动机的,使用LoRA范数控制作为限制扰动放大证据偏移的可处理代理。PAC-贝叶斯复杂度视角进一步解释了控制适配器增长可能如何支持超越观察到的扰动的转移。对摘要基准的实验表明,S$^2$R$^2$在打字噪声、删除、同义词替换和转述下提高了稳健性,同时保持了具有竞争力的干净性能,并在跨数据集转移方面优于一致性为基础的基线。
cs.CL / 47 / 2605.01630
Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese
Prosa:基于评分标准的巴西葡萄牙语真实用户聊天LLM评估
Abstract
Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement matters more than the judge model itself. To support this claim, we introduce Prosa, the first real user multi-turn Brazilian Portuguese chat benchmark: 1,000 WildChat conversations scored by three judges from three model families on 16 models. Under filtered rubric scoring the three judges agree on every one of the 16 ranks, whereas under holistic scoring they agree on only 7 of 16. Additionally, the rubric filtering pipeline increases the average score gap between neighbouring models by 47%, thereby improving Prosa's discriminative power. Evaluating a new model on Prosa costs approximately $2.1 when using Gemini 3 Flash as the judge. We release the benchmark and the filtering code to ensure that future models can be assessed under identical conditions. These artifacts also make our rubric-based scoring method reusable beyond Prosa, supporting other open-ended evaluation settings.
Chinese Translation
由整体LLM作为评审模型产生的评分排名对所选评审模型的偏见非常敏感。我们展示了切换到二元评分标准并进行多评审过滤可以消除这种敏感性:分解评判比评审模型本身更为重要。为了支持这一论点,我们提出了Prosa,这是第一个真实用户多轮巴西葡萄牙语聊天基准:包含1,000个WildChat对话,由来自三种模型家族的三名评审对16个模型进行评分。在过滤后的评分标准下,三位评审在16个排名中的每一个上达成了一致,而在整体评分下仅有7个排名达成了一致。此外,评分标准过滤流程将邻近模型之间的平均评分差距提高了47%,从而增强了Prosa的区分能力。在使用Gemini 3 Flash作为评审的情况下,对Prosa上新模型的评估成本约为2.1美元。我们发布了该基准和过滤代码,以确保未来模型可以在相同条件下进行评估。这些成果还使我们的基于评分标准的评分方法能够在Prosa之外重用,支持其他开放式评估设置。
cs.CL / 48 / 2605.01647
Beyond Perplexity: Character Distribution Signatures and the MDTA Benchmark for AI Text Detection
超越困惑度:字符分布特征与人工智能文本检测的MDTA基准
Abstract
Training-free AI text detection methods primarily rely on model log-probabilities, achieving strong performance through approaches like Binoculars and DNA-DetectLLM. However, these methods face a fundamental ceiling as models are optimized through RLHF to produce human-like probability distributions. We introduce an alternative detection signal based on character distribution signatures. We provide theoretical foundations showing that AI models, trained on massive domain-balanced corpora, approximate global character patterns while humans exhibit domain-specialized distributions, creating a "Wall of Separation" where human-AI divergence significantly exceeds AI-AI divergence. To enable systematic evaluation, we construct the Models-Domains-Temperatures-Adversarials (MDTA) benchmark comprising 642,274 prompt-aligned samples across 4 models, 5 domains, 3 temperature settings, and 3 adversarial strategies, substantially expanding the HC3 dataset with modern model responses, temperature variation, and adversarial augmentation. We introduce the Letter Distribution Score (LD-Score), demonstrating low correlation (r = 0.08-0.13) with perplexity methods. When integrated with DNA-DetectLLM, Binoculars and FastDetectGPT via a non-linear classifier, LD-Score yields consistent improvements in AUROC and F1, with particularly pronounced gains in specialized domains where vocabulary constraints amplify the detection signal. The MDTA dataset can be accessed at: https://huggingface.co/datasets/nsp909/MDTA.
Chinese Translation
无训练的人工智能文本检测方法主要依赖于模型的对数概率,通过诸如Binoculars和DNA-DetectLLM等方法取得了强劲的表现。然而,这些方法面临着一个基本的局限性,因为模型通过强化学习人类反馈(RLHF)被优化以生成类人概率分布。我们引入了一种基于字符分布特征的替代检测信号。我们提供了理论基础,表明在大规模领域平衡的语料库上训练的人工智能模型能够近似全球字符模式,而人类则表现出领域专业化的分布,形成了一个“隔离墙”,在人类与人工智能的差异显著超过人工智能与人工智能的差异。为了实现系统评估,我们构建了包括642,274个与提示对齐样本的模型-领域-温度-对抗(MDTA)基准,这些样本涵盖4个模型、5个领域、3个温度设置和3种对抗策略,显著扩展了现代模型响应、温度变化和对抗增强的HC3数据集。我们引入了字母分布评分(LD-Score),显示出与困惑度方法的低相关性(r = 0.08-0.13)。当与DNA-DetectLLM、Binoculars和FastDetectGPT通过非线性分类器结合时,LD-Score在AUROC和F1上产生了一致的改进,特别是在词汇约束增强检测信号的专业领域内获得了显著的提升。MDTA数据集可在以下地址访问:https://huggingface.co/datasets/nsp909/MDTA。
cs.CL / 49 / 2605.01687
MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety
MultiBreak:用于评估大型语言模型安全性的可扩展且多样化的多轮越狱基准
Abstract
We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM than single-turn jailbreaks. Existing multi-turn benchmarks are limited in size or rely heavily on templates, which restrict their diversity. To address this gap, we unify a wide range of harmful jailbreak intents, and introduce an active learning pipeline for expanding high-quality multi-turn adversarial prompts, where a generator is iteratively fine-tuned to produce stronger attack candidates, guided by uncertainty-based refinement. Our MultiBreak includes 10,389 multi-turn adversarial prompts, spans 2,665 distinct harmful intents, and covers the most diverse set of topics to date. Empirical evaluation shows that our benchmark achieves up to a 54.0 and 34.6 higher attack success rate (ASR)} than the second-best dataset on DeepSeek-R1-7B and GPT-4.1-mini, respectively. More importantly, safety evaluations suggest that diverse attack categories uncover fine-grained LLM vulnerabilities}, and categories that appear benign under single-turn can exhibit substantially higher adversarial effectiveness in multi-turn scenarios. These findings highlight persistent vulnerabilities of LLMs under realistic adversarial settings and establish MultiBreak as a scalable resource for advancing LLM safety.
Chinese Translation
我们提出了MultiBreak,一种可扩展且多样化的多轮越狱基准,用于评估大型语言模型(LLM)的安全性。多轮越狱模拟自然对话环境,使其比单轮越狱更容易突破安全导向的LLM。现有的多轮基准在规模上有限,或过于依赖模板,从而限制了其多样性。为了解决这一问题,我们统一了广泛的有害越狱意图,并引入了一个主动学习管道,以扩展高质量的多轮对抗提示,在该管道中,生成器经过迭代调优以生成更强的攻击候选,指导依据不确定性进行的精炼。我们的MultiBreak包含10,389个多轮对抗提示,涵盖了2,665种不同的有害意图,涵盖了迄今为止最广泛的话题集。实证评估表明,我们的基准在DeepSeek-R1-7B和GPT-4.1-mini上相比第二好的数据集分别提高了高达54.0和34.6的攻击成功率(ASR)。更重要的是,安全评估表明,多样的攻击类别揭示了LLM的细粒度脆弱性,而在单轮场景下看似无害的类别在多轮场景中可以展示出显著更高的对抗效果。这些发现突显了LLM在现实对抗环境下的持续脆弱性,并确立了MultiBreak作为推进LLM安全性的可扩展资源。
cs.CL / 50 / 2605.01688
GRAVITY: Architecture-Agnostic Structured Anchoring for Long-Horizon Conversational Memory
GRAVITY:架构无关的结构化锚定用于长时段对话记忆
Abstract
Long-horizon conversational agents rely on memory systems with increasingly sophisticated retrieval mechanisms. However, retrieved fragments are typically fed to the language model as unstructured text, lacking the relational, temporal, and thematic structures essential for complex reasoning. To bridge this reasoning gap, we introduce GRAVITY (\textbf{G}eneration-time \textbf{R}elational \textbf{A}nchoring \textbf{V}ia \textbf{I}njected \textbf{T}opological Memor\textbf{Y}), a plug-and-play structured memory module. GRAVITY extracts three complementary knowledge representations from raw conversational utterances: entity profiles grounded in relational graphs, temporal event tuples linked into causal traces, and cross-session topic summaries. At generation time, it injects these representations into the host system's prompt as structured anchoring contexts. This approach effectively synthesizes scattered evidence into a coherent, query-relevant context without requiring any architectural modifications to the host model. Extensive evaluations across five diverse memory systems on the LongMemEval and LoCoMo benchmarks demonstrate the efficacy of our approach. On average, GRAVITY improves LLM-judge accuracy by 7.5--10.1%. Gains are inversely correlated with baseline strength: the weakest host improves by 12.2% while the strongest still gains 3.8--5.7%. These findings establish structured context anchoring as a broadly effective, architecture-agnostic augmentation paradigm for long-horizon conversational memory.
Chinese Translation
长时段对话代理依赖于越来越复杂的记忆系统来进行检索。然而,检索到的片段通常以非结构化文本的形式提供给语言模型,缺乏进行复杂推理所需的关系性、时间性和主题结构。为了解决这一推理缺口,我们提出了GRAVITY( extbf{G}eneration-time extbf{R}elational extbf{A}nchoring extbf{V}ia extbf{I}njected extbf{T}opological Memor extbf{Y}),一种即插即用的结构化记忆模块。GRAVITY从原始对话话语中提取三种互补的知识表示:基于关系图的实体概况、链接成因果链的时间事件元组,以及跨会话主题摘要。在生成时,它将这些表示作为结构化锚定上下文注入到主系统的提示中。这种方法有效地将分散的证据合成一个连贯且与查询相关的上下文,而无需对主模型进行任何架构修改。在LongMemEval和LoCoMo基准上对五种不同的记忆系统进行的广泛评估表明了我们方法的有效性。平均而言,GRAVITY提高了LLM-judge的准确率7.5%至10.1%。增益与基线强度呈反相关:最弱的主模型提高了12.2%,而最强的主模型仍增益3.8%至5.7%。这些发现确立了结构化上下文锚定作为一种广泛有效的、架构无关的增强范式,用于长时段对话记忆。
cs.CL / 51 / 2605.01698
BIM Information Extraction Through LLM-based Adaptive Exploration
基于LLM的自适应探索在BIM信息提取中的应用
Abstract
BIM models provide structured representations of building geometry, semantics, and topology, yet extracting specific information from them remains remarkably difficult. Current approaches translate natural language into structured queries by assuming a fixed data organization (static approach), which BIM heterogeneity eventually invalidates. We address this with a new paradigm, adaptive exploration, where an LLM-based agent iteratively executes code to extract information from a BIM model, discovering its structure at runtime instead of assuming it. We evaluate this approach on ifc-bench v2, an open-source BIM question-answering benchmark introduced alongside this work, comprising 1,027 tasks across 37 IFC models from 21 projects. A factorial ablation across two LLM capability levels and four augmentation strategies shows that adaptive exploration significantly outperforms static query generation across all configurations, regardless of the augmentation strategy. These results indicate that BIM heterogeneity is best addressed at the paradigm level, not by further optimizing static approaches.
Chinese Translation
建筑信息模型(BIM)提供了建筑几何、语义和拓扑的结构化表示,但从中提取特定信息依然非常困难。目前的方法通过假设固定的数据组织(静态方法)来将自然语言转化为结构化查询,然而BIM的异质性最终会使这种假设失效。我们通过一种新的范式——自适应探索来解决这一问题,在该方法中,基于大语言模型(LLM)的代理迭代执行代码,从BIM模型中提取信息,实时发现其结构而非仅仅假设。我们在ifc-bench v2上评估了这一方法,这是一个开源的BIM问答基准,包含1,027个任务,来自21个项目的37个IFC模型。对两个LLM能力水平和四种增强策略进行的多因素消融实验表明,自适应探索在所有配置中显著优于静态查询生成,无论增强策略如何。这些结果表明,BIM的异质性最好从范式层面进行解决,而不是通过进一步优化静态方法。
cs.CL / 52 / 2605.01704
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning
推理陷阱:对封闭系统多步骤大语言模型推理的信息论界限
Abstract
When copies of the same language model are prompted to debate, they produce diverse phrasings of one perspective rather than diverse perspectives. Multi-agent debate (MAD), and more broadly closed-system reasoning where agents iteratively transform each other's outputs, tends to preserve answer accuracy while degrading the reasoning behind those answers. We name the multi-agent case the Debate Trap and the broader phenomenon the Reasoning Trap, offering a programmatic theory of evidence-grounded reasoning failure.The framework has three parts: (i) SFS (Supported Faithfulness Score), a claim-level metric verifying decomposed atomic claims against provided evidence (decomposer-invariant rankings: Spearman rho=1.0); (ii) EGSR (Evidence-Grounded Socratic Reasoning), replacing adversarial argumentation with evidence-grounded inquiry; (iii) Theorem 1 (DPI Bound): under standard MAD, the chain E -> O^0 -> O^1 -> ... is Markov, and the Data Processing Inequality implies E[I(E;O^{t+1})] <= E[I(E;O^t)]. Three companion results -- open-system recovery (Theorem 2), EGSR accumulation (Lemma 2), and vote-aggregation floor (Proposition 1) -- partition multi-step LLM reasoning by its information-theoretic relationship to E. Across 16 conditions on SciFact (300 claims) and FEVER (1,000 claims), DebateCV (C13) preserves 88% of baseline accuracy while SFS drops 43%; majority-vote MAD (C15) reduces SFS to 1.7% of baseline (p < 10^{-6}, d = -0.96); EGSR recovers 98%. An R6 cohort study (Korean n=10x30 FEVER; English n=3x200 SciFact) finds inter-rater Fleiss kappa <= +0.018 with 0.8-1.4 Likert intra-rater shifts across language and domain -- the human agreement that faithfulness metrics have been calibrated against is not itself stable. We offer one falsifiable conjecture: any closed-system reasoning protocol preserving Theorem 1's Markov structure is, in expectation, subject to the same DPI bound.
Chinese Translation
当同一语言模型的多个副本被提示进行辩论时,它们产生的往往是同一观点的多样化措辞,而非多样化的视角。多智能体辩论(MAD)以及更广泛的封闭系统推理,其中智能体迭代地转变彼此的输出,往往能够保持答案的准确性,但却削弱了这些答案背后的推理过程。我们将多智能体的情况称为辩论陷阱(Debate Trap),而将更广泛的现象称为推理陷阱(Reasoning Trap),同时提供了一个关于证据基于的推理失败的程序性理论框架。该框架由三部分组成:(i) SFS(支持性可信度评分),一个在提供的证据下验证分解原子主张的声明级别指标(分解无关排名:Spearman rho=1.0);(ii) EGSR(证据基础的苏格拉底推理),用证据基础的探究取代对抗性论证;(iii) 定理 1(数据处理不等式界限,DPI Bound):在标准的 MAD 下,链 E -> O^0 -> O^1 -> ... 是马尔可夫的,数据处理不等式表明 E[I(E;O^{t+1})] <= E[I(E;O^t)]。三个附加结果——开放系统恢复(定理 2)、EGSR 累积(引理 2)和投票聚合下限(命题 1)——通过其与 E 的信息论关系划分多步骤大语言模型推理。在对 SciFact(300 个主张)和 FEVER(1,000 个主张)的 16 种条件下,DebateCV(C13)保持了基线准确性的 88%,而 SFS 下降了 43%;多数投票 MAD(C15)将 SFS 降至基线的 1.7%(p < 10^{-6}, d = -0.96);EGSR 恢复了 98%。一项 R6 队列研究(韩国 n=10x30 FEVER;英语 n=3x200 SciFact)发现评审者间的 Fleiss kappa <= +0.018,且在语言和领域之间的单评审者 Likert 移动在 0.8-1.4 之间——与可信度指标的校准相比,人类的协议并不稳定。我们提出了一个可证伪的猜想:任何保持定理 1 的马尔可夫结构的封闭系统推理协议,预期都将受到相同的数据处理不等式界限的约束。
cs.CL / 53 / 2605.01717
TCDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis
TCDA:基于线程约束的对话意识建模用于对话情感四元组分析
Abstract
Conversational Aspect-based Sentiment Quadruple Analysis (DiaASQ) needs to capture the complex interrelationships in multiple rounds of dialogues. Existing methods usually employ simple Graph Convolutional Networks (GCN), which introduce structural noise and fail to consider the temporal sequence of the dialogues, or use standard RoPE, which implicitly captures relative distances in a flat sequence but cannot clearly separate the token-level syntactic order from the utterance-level progression, and may suffer from the Distance Dilution problem. To address these issues, we propose a new framework that combines Thread-Constrained Directed Acyclic Graph (TC-DAG) and Discourse-Aware Rotary Position Embedding (D-RoPE). Specifically, TC-DAG filters out cross-thread noise based on thread constraints, maintains global connectivity through root anchoring, and incorporates the temporal sequence of the dialogues. D-RoPE aligns multi-layer semantics using dual-stream projection and multi-scale frequency signals, captures thread dependencies using tree-like distances, and alleviates the token-level Distance Dilution problem by incorporating utterance-level progressions. Experimental results on two benchmark datasets demonstrate that our framework achieves state-of-the-art performance.
Chinese Translation
对话基于方面的情感四元组分析(DiaASQ)需要捕捉多轮对话中的复杂相互关系。现有方法通常采用简单的图卷积网络(GCN),这引入了结构噪声,并且未能考虑对话的时间顺序,或者使用标准的旋转位置嵌入(RoPE),该方法隐式捕捉了平面序列中的相对距离,但无法明确区分标记级的句法顺序与发话级的进展,并且可能遭遇距离稀释问题。为了解决这些问题,我们提出了一个新框架,将线程约束的有向无环图(TC-DAG)与对话意识的旋转位置嵌入(D-RoPE)相结合。具体而言,TC-DAG 基于线程约束过滤掉交叉线程噪声,通过根锚点维持全局连通性,并融合对话的时间序列。D-RoPE 采用双流投影和多尺度频率信号对多层语义进行对齐,利用类树状距离捕捉线程依赖性,并通过结合发话级进展缓解标记级的距离稀释问题。对两个基准数据集的实验结果表明,我们的框架达到了最先进的性能。
cs.CL / 54 / 2605.01732
EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer
EGAD:基于熵的自适应蒸馏用于令牌级知识转移
Abstract
Large language models (LLMs) have achieved remarkable performance across diverse domains, yet their enormous computational and memory requirements hinder deployment in resource-constrained environments. Knowledge distillation offers a promising solution by transferring knowledge from a large teacher model to a smaller student model. However, existing distillation methods typically treat all tokens equally, ignoring the fact that different tokens contribute unequally to model decisions. This can lead to inefficient knowledge transfer and reduced learning effectiveness. To address this limitation, we propose an entropy-based adaptive distillation strategy that dynamically adjusts the training process at the token level. Our method leverages the teacher's output entropy to guide three aspects of distillation. Specifically, we introduce a token-level curriculum by dynamically shifting focus from low- to high-entropy tokens during training. We further adjust the distillation temperature based on token entropy to better capture teacher confidence patterns. Moreover, we employ a dual-branch architecture for efficient logits-only distillation on easy tokens and deeper feature-based distillation on difficult tokens. Extensive experiments validate the soundness and effectiveness of our method.
Chinese Translation
大型语言模型(LLMs)在多个领域取得了显著的性能,但其巨大的计算和内存需求限制了在资源受限环境中的部署。知识蒸馏提供了一种有前景的解决方案,通过将知识从大型教师模型转移到较小的学生模型。然而,现有的蒸馏方法通常对所有令牌一视同仁,忽视了不同令牌对模型决策的贡献程度不等的事实。这可能导致知识转移效率低下和学习效果降低。为了解决这一局限性,我们提出了一种基于熵的自适应蒸馏策略,动态调整令牌级的训练过程。我们的方法利用教师输出的熵来指导蒸馏的三个方面。具体而言,我们通过在训练过程中动态地从低熵令牌转向高熵令牌,引入了一种令牌级的课程。此外,我们根据令牌熵调整蒸馏温度,以更好地捕捉教师的信心模式。我们还采用了双分支架构,在简单令牌上进行高效的仅对数值蒸馏,而在困难令牌上进行更深层次的特征基础蒸馏。大量实验验证了我们方法的合理性和有效性。
cs.CL / 55 / 2605.01735
Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure
少即是多:针对大型语言模型的几何去学习方法以最小化数据披露
Abstract
As large language models (LLMs) are increasingly deployed in real-world systems, they must support post-hoc removal of specific content to meet privacy and governance requirements. This motivates selective unlearning, which suppresses information about a particular entity or topic while preserving the LLM's general utility. However, most existing LLM unlearning methods require access to the original training corpus and rely on output-level refusal tuning or broad gradient updates, creating a tension among unlearning strength, non-target preservation, and data availability. We propose Geometric Unlearning (GU), an approach that operates directly on the model's prompt-time planning states without access to the original training corpus. GU distills a compact, low-rank geometry of desired safe behavior from a small set of safe reference prompts, and uses lightweight anchor-in-context synthetic prompts to trigger localized, projection-based alignment of hidden planning representations to this safe geometry. A teacher-distillation regularizer on synthetic non-target anchors further reduces collateral drift. Across privacy-oriented unlearning benchmarks (ToFU and UnlearnPII), GU achieves strong target suppression with minimal impact on non-target performance, demonstrating that effective unlearning can be achieved with minimal synthetic data.
Chinese Translation
随着大型语言模型(LLMs)在现实世界系统中的日益广泛应用,它们必须支持后续删除特定内容以满足隐私和治理要求。这促使了选择性去学习的需求,即在保留LLM的一般效用的同时抑制与特定实体或主题相关的信息。然而,大多数现有的LLM去学习方法需要访问原始训练语料库,并依赖于输出级拒绝微调或广泛的梯度更新,这在去学习强度、非目标保留和数据可用性之间造成了紧张关系。我们提出了几何去学习(Geometric Unlearning, GU),这是一种直接在模型的提示时规划状态上操作的方法,不需要访问原始训练语料库。GU从一小组安全参考提示中提炼出期望的安全行为的紧凑低秩几何,并使用轻量级的锚定上下文合成提示来触发隐藏规划表示的局部空间投影与这一安全几何的对齐。对合成非目标锚点的教师蒸馏正则化进一步减少了次生漂移。在以隐私为导向的去学习基准测试(ToFU和UnlearnPII)中,GU实现了对目标的强抑制,同时对非目标性能的影响最小,证明了在使用最小合成数据的情况下可以实现有效的去学习。
cs.CL / 56 / 2605.01749
Only Say What You Know: Calibration-Aware Generation for Long-Form Factuality
只说你所知道的:面向校准的长文本事实生成
Abstract
Large Reasoning Models achieve strong performance on complex tasks but remain prone to hallucinations, particularly in long-form generation where errors compound across reasoning steps. Existing approaches to improving factuality, including abstention and factuality-driven optimization, follow a \emph{coupled exploration-commitment} paradigm, in which intermediate reasoning is unconditionally propagated to the final output, limiting fine-grained control over information selection and integration. In this paper, we propose an \textbf{Exploration-Commitment Decoupling} paradigm that disentangles knowledge exploration from final commitment, enabling models to explore with awareness while answering cautiously. We instantiate the paradigm with \textbf{Calibration-Aware Generation (CAG)}, a framework that equips models with end-to-end, calibration-aware generation capabilities, by augmenting intermediate reasoning with calibrated reliability estimates and prioritizing reliable content in final outputs. Across five long-form factuality benchmarks and multiple model families, CAG improves factuality by up to 13%, while reducing decoding time by up to 37%. Overall, our work highlights decoupling as a principled approach for more reliable long-form generation, offering directions for trustworthy and self-aware generative systems.
Chinese Translation
大型推理模型在复杂任务上表现出色,但仍容易产生幻觉,特别是在长文本生成中,错误会在推理步骤之间累积。现有的提高事实准确性的策略,包括回避和基于事实的优化,遵循一种 extit{耦合探索-承诺}范式,其中中间推理无条件地传播到最终输出,这限制了对信息选择和整合的精细控制。本文提出了一种 extbf{探索-承诺解耦}范式,旨在将知识探索与最终承诺进行解耦,使模型在回答时能够有意识地探索,谨慎作答。我们用 extbf{校准感知生成(CAG)}来实例化这一范式,该框架通过增强中间推理的校准可靠性估计,并在最终输出中优先选择可靠内容,为模型提供端到端的校准感知生成能力。在五个长文本事实基准和多个模型家族中,CAG将事实准确性提高了多达13%,同时将解码时间减少了多达37%。总体而言,我们的研究突出了解耦作为一种可靠的长文本生成的原则性方法,为可信赖和自我感知的生成系统提供了方向。
cs.CL / 57 / 2605.01771
The Compliance Gap: Why AI Systems Promise to Follow Process Instructions but Don't
合规差距:为何人工智能系统承诺遵循过程指令但却未能做到
Abstract
An auditor instructs an AI assistant: "open each file individually using the Read tool -- no scripts, no agents." The AI replies "Yes" -- then issues a single batched call summarizing all fifty files at once. We call this the Compliance Gap: a third, orthogonal axis of AI honesty distinct from factual truthfulness and rhetorical substance. Three questions: does this verbal-behavioral disconnect exist (existence); can any text-only observer recover it (detectability); what infrastructure does AI deployment need (remedy)? Some 75 benchmarks (IFEval, SWE-bench, BFCL, COMPASS, SpecEval) measure outcome fidelity; none measures process fidelity. Theorem 1 shows the gap is structurally inevitable under RL that rewards text without observing behavior. Theorem 2, via the Data Processing Inequality, shows it is undetectable from text alone -- by any human or LLM observer, present or future. Thirteen experiments and 2,031 sessions on six frontier models confirm both predictions. Under default framing, all six exhibit instruction compliance rates of 0% -- Claude Sonnet 4 verbally agrees ten out of ten times then bypasses in all ten. The gap is selective: 97% compliance where rationale is rewarded (audit trails), 0-4% where it is not (file reading, privacy masking); removing delegation tools raises compliance to 75% (Cohen's d = 2.47), confirming environmental affordance rather than weight-encoded failure. Nine blinded human raters achieve Fleiss' kappa = 0.130 and correctly identify zero of fifteen compliant sessions, exactly as Theorem 2 predicts. Where humans show 47% intention-behavior gaps in psychology and 96.5pp gaps in surgical audits, RLHF-trained models approach 100% under default conditions -- a regime warranting its own measurement infrastructure. We release BS-Bench: the first open benchmark for process compliance, with seven tool-call-log audit metrics and a public leaderboard.
Chinese Translation
一位审计员指示人工智能助手:“逐个打开每个文件,使用阅读工具——不使用脚本,不使用代理。”人工智能回答“是的”——然后发出一次性批量调用,概括了所有五十个文件。我们称之为合规差距:一种与事实真实性和修辞实质不同的人工智能诚实性的第三个正交维度。三问:这种言语行为脱节是否存在(存在性);任何仅依赖文本的观察者能否发现它(可检测性);人工智能部署需要什么样的基础设施(补救措施)?大约75个基准(IFEval、SWE-bench、BFCL、COMPASS、SpecEval)衡量结果的忠实度;但没有衡量过程的忠实度。定理1显示,在奖励文本而不观察行为的强化学习(RL)下,这一差距在结构上是不可避免的。定理2通过数据处理不等式,表明仅从文本中无法探测到这一差距——无论是现在的还是将来的任何人类或大语言模型(LLM)观察者。对六个前沿模型进行了十三次实验,包含2031个会话,证实了这两个预测。在默认框架下,所有六个模型的指令遵从率均为0%——Claude Sonnet 4 在十次口头同意后,在所有十次中都绕过执行。该差距具选择性:当理据得到奖励时(审计追踪),遵从率为97%;而未得到奖励时(文件读取、隐私遮蔽),则在0-4%之间;移除委托工具使遵从率上升到75%(Cohen's d = 2.47),证实了环境适应性而非权重编码的失败。九位盲评的人类评分者的Fleiss' kappa = 0.130,准确识别出零个合规会话,正如定理2预测的一样。当人类在心理学中表现出47%的意图-行为差距,而在外科审计中表现出96.5个百分点的差距时,经过强化学习与人类反馈(RLHF)训练的模型在默认条件下接近100%——这一现象值得建立自己的测量基础设施。我们发布了BS-Bench:第一个用于过程合规的开放基准,具备七项工具调用日志审计指标和一个公共排行榜。
cs.CL / 58 / 2605.01831
RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences
RMGAP:对多样化偏好的奖励模型泛化能力的基准测试
Abstract
Reinforcement Learning from Human Feedback has become the standard paradigm for language model alignment, where reward models directly determine alignment effectiveness. In this work, we focus on how to evaluate the generalizability of reward models. By "generalizability", we mean the ability of RMs to correctly rank responses to align with diverse user preferences. However, existing reward model benchmarks are typically designed around a universal preference, failing to assess this generalization. To address this critical gap, we introduce RMGAP, a benchmark comprising 1,097 instances across Chat, Writing, Reasoning, and Safety domains. Since different users exhibit diverse preferences for the same task, we first generate four distinct responses with different linguistic profiles for each collected prompt. However, the original prompt set lacks the specificity to convey different preferences. We therefore construct tailored prompts by contrasting these candidates and designing scenarios in which one response becomes the uniquely appropriate choice. Moreover, we observe that users often express the same preference using different phrasings, and thus extend each prompt with two paraphrased variants. Our evaluation of 24 state-of-the-art RMs reveals their substantial limitations: even the best RM achieves only 49.27% Best-of-N accuracy, highlighting considerable room for improvement in reward model generalization. Related data and code are available at https://github.com/nanzhi84/RMGAP.
Chinese Translation
基于人类反馈的强化学习已成为语言模型对齐的标准范式,其中奖励模型直接决定了对齐的有效性。在本文中,我们重点关注如何评估奖励模型的泛化能力。这里的“泛化能力”指的是奖励模型(RMs)在不同用户偏好下正确排序响应的能力。然而,现有的奖励模型基准通常围绕一种通用偏好设计,未能评估这方面的泛化能力。为了解决这一关键问题,我们引入了RMGAP,这是一项基准测试,包含了来自聊天、写作、推理和安全领域的1,097个实例。由于不同用户对同一任务表现出不同的偏好,我们首先为每个收集到的提示生成四个具有不同语言特征的独特响应。然而,原始提示集缺乏足够的特异性以表达不同的偏好。因此,我们通过对比这些候选响应并设计出一种情景,其中一个响应成为唯一适当的选择,来构建定制的提示。此外,我们观察到用户经常用不同的表述表达相同的偏好,因此我们为每个提示扩展了两个释义变体。对24个最先进的奖励模型的评估揭示了它们的显著局限性:即便是最佳的奖励模型也仅实现了49.27%的最佳选择准确率,这突显了奖励模型泛化能力提高的巨大空间。相关数据和代码可在 https://github.com/nanzhi84/RMGAP 获取。
cs.CL / 59 / 2605.01844
The Cylindrical Representation Hypothesis for Language Model Steering
语言模型引导的圆柱体表示假设
Abstract
Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample-specific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can activate the target concept. Within this plane, only specific sensitive sectors strongly facilitate concept activation, while other sectors can suppress or delay it. While the surrounding normal plane can be reliably identified from difference vectors, the sensitive sector cannot, introducing intrinsic uncertainty at the sector level. This uncertainty provides a principled explanation for why steering outcomes often fluctuate even when using well-aligned directions. Our experiments verify the existence of the cylindrical structure and demonstrate that CRH provides a valid and practical way to interpret model steering behavior in real settings: https://github.com/mbzuai-nlp/CRH.
Chinese Translation
引导是一种广泛使用的控制大型语言模型的技术,但其效果往往不稳定且难以预测。现有的理论解释主要基于线性表示假设(Linear Representation Hypothesis, LRH)。虽然 LRH 假设概念可以正交化以实现无损控制,但这一理想化的映射在真实的表示中失败,无法解释引导的不可预测性。通过放松 LRH 的正交假设,同时保持线性表示,我们证明了重叠概念的贡献自然导致样本特定的轴正交结构。我们将此正式化为圆柱体表示假设(Cylindrical Representation Hypothesis, CRH)。在 CRH 中,一个中心轴捕捉概念缺失与存在之间的主要差异,并驱动概念生成。一个包围的法线平面通过决定轴激活目标概念的难易程度来控制引导灵敏度。在该平面内,只有特定的敏感区段能够强烈促进概念激活,而其他区段则可能抑制或延迟激活。尽管可以通过差异向量可靠地识别包围的法线平面,但敏感区段却无法被识别,从而引入了区段层面内在的不确定性。这种不确定性为为什么引导结果往往波动,即使在使用良好对齐方向时,提供了原则上的解释。我们的实验验证了圆柱体结构的存在,并展示了 CRH 提供了一种有效且实用的方法来解释模型在实际环境中的引导行为: https://github.com/mbzuai-nlp/CRH.
cs.CL / 60 / 2605.01846
Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation
大型语言模型是否规划答案位置?多项选择题生成中的位置偏差
Abstract
Large language models (LLMs) are increasingly used to generate multiple-choice questions (MCQs), where correct answers should ideally be uniformly distributed across options. However, we observe that LLMs exhibit systematic position biases during generation. Through extensive experiments with 10 LLMs and 5 vision-language models (VLMs) on three MCQ generation tasks, we show that these biases are structured, with similar patterns emerging within model families. To investigate the underlying mechanisms, we conduct probing experiments and find that hidden representations in the question stem encode predictive signals of the correct answer position, suggesting that answer position may be implicitly planned during generation. Building on this insight, we apply activation steering to manipulate internal representations and influence answer position. Our results show that steering can partially control positional preferences and substantially shift answer position distributions. Our findings provide a practical framework for studying implicit positional planning in LLMs and highlight the importance of controllable generation for reliable MCQ construction and evaluation.
Chinese Translation
大型语言模型(LLMs)越来越多地用于生成多项选择题(MCQs),理想情况下,正确答案应均匀分布于各个选项。然而,我们观察到LLMs在生成过程中表现出系统性的位置信息偏差。通过对10个LLMs和5个视觉-语言模型(VLMs)在三个MCQ生成任务上进行广泛实验,我们展示了这些偏差是结构化的,并且在模型家族内呈现出相似的模式。为了调查潜在机制,我们进行探测实验,发现问题陈述中的隐藏表示编码了正确答案位置的预测信号,这表明答案位置可能在生成过程中被隐含地规划。基于这一见解,我们应用激活引导技术来操控内部表示并影响答案位置。我们的结果表明,引导可以部分控制位置信息偏好,并显著改变答案位置分布。我们的研究结果为研究LLMs中的隐性位置规划提供了一个实用框架,并强调了可控生成在可靠的多项选择题构建和评估中的重要性。
cs.CL / 61 / 2605.01853
Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models
时空隐藏状态动态作为大型语言模型内部推理的特征
Abstract
Large reasoning models (LRMs) generate extended solutions, yet it remains unclear whether these traces reflect substantive internal computation or merely verbosity and overthinking. Although recent hidden-state analyses suggest that internal representations carry correctness-related signals, their coarse aggregations may obscure the token and layer structure underlying reasoning computation. We investigate hidden-state transitions across decoding steps and layers, and identify a distinct spatiotemporal pattern in LRMs: successful trajectories exhibit broad temporal dynamics with localized layer-wise concentration, while this structure is weaker in non-reasoning models and knowledge-heavy domains. We formalize this characteristic as Spatiotemporal Amplitude of Latent Transition (StALT), a training-free trajectory statistic that summarizes temporal changes between adjacent tokens weighted by within-token layer saliency. Across diverse models and benchmarks, StALT reliably separates correct from incorrect trajectories in reasoning-intensive regimes, providing a competitive label-free correctness signal alongside strong output-space and length-based baselines. Intervention analyses further show that this spatiotemporal amplitude responds systematically to manipulations that increase or reduce the demand for internal reasoning, supporting its association with latent reasoning dynamics in LRMs. These findings provide empirical evidence that LRMs exhibit measurable hidden-state dynamics and offer a practical probe for understanding internal computation beyond output-based evaluation.
Chinese Translation
大型推理模型(LRMs)能够生成扩展的解决方案,但尚不清楚这些痕迹是反映实质性的内部计算,还是仅仅是一种啰嗦和过度思考。尽管最近的隐藏状态分析表明,内部表示携带与正确性相关的信号,但其粗略的聚合可能掩盖推理计算背后的 token 和层结构。我们研究了解码步骤和层之间的隐藏状态转变,并识别出LRMs中一种独特的时空模式:成功的轨迹展现出广泛的时间动态和局部的层级聚集,而在非推理模型和知识密集型领域中这种结构较弱。我们将这一特征形式化为潜在转变的时空幅度(Spatiotemporal Amplitude of Latent Transition,StALT),这是一种无需训练的轨迹统计量,概括了相邻 token 之间的时间变化,且以 token 内层的重要性为权重。在多种模型和基准测试中,StALT 能可靠地区分推理密集场景中的正确和不正确轨迹,提供了一个竞争性的无标签正确性信号,辅之以强大的基于输出空间和长度的基线分析。干预分析进一步表明,这种时空幅度对增加或减少内部推理需求的操作反应是系统的,支持其与LRMs中潜在推理动态的关联。这些发现提供了实证证据,表明LRMs展现出可测量的隐藏状态动态,并为超越基于输出的评估理解内部计算提供了一个实用的探针。
cs.CL / 62 / 2605.01870
Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models
Maistros:通过从大型推理模型中知识蒸馏适配的希腊大型语言模型
Abstract
Large Language Models (LLMs) have substantially advanced the field of Natural Language Processing (NLP), achieving state-of-the-art performance across a wide range of tasks. These improvements have been attributed, in part, to their emerging reasoning capabilities, which are enabled by large-scale training and increased model capacity. However, existing LLMs can generate erroneous responses when addressing complex queries that fall outside their training distribution, due to limited internal knowledge or the need for multi-step reasoning. To address these limitations, recent work has introduced large reasoning models (LRMs), which incorporate explicit internal reasoning processes to improve response accuracy. Additionally, state-of-the-art LRMs often comprise hundreds of billions of parameters and require several seconds per inference, even on advanced multi-GPU systems. These characteristics limit their practicality for deployment in conventional computing environments. Meanwhile, NLP research on multilingual LLMs continues to prioritize high-resource languages. However, these models exhibit limited performance in under-resourced languages, primarily due to insufficient language- and culture-specific training data. In this paper, we focus on Modern Greek, for which only a limited number of question answering (QA) datasets have been proposed, most of which are intended for model evaluation. To address this research gap in Greek QA, we make the following contributions: (i) CulturaQA, a high-quality LRM-generated and human-curated dataset, for Greek LLM training and evaluation; (ii) a memory-efficient LLM evaluation framework adaptable to diverse languages and QA tasks; (iii) Maistros 8B, a state-of-the-art open-weights Greek LLM developed via knowledge distillation and fine-tuning on CulturaQA; and (iv) a comprehensive evaluation of nine LLMs across nine human-curated Greek QA datasets.
Chinese Translation
大型语言模型(LLMs)在自然语言处理(NLP)领域取得了显著进展,实现了对一系列任务的最先进性能。这些改进部分归因于其新兴的推理能力,这一能力是通过大规模训练和增加模型容量所实现的。然而,现有的LLMs在处理复杂查询时可能会生成错误的响应,尤其是当这些查询超出其训练分布时,这通常是由于内部知识有限或需要多步骤推理所导致的。为了解决这些限制,近期的研究提出了大型推理模型(LRMs),这些模型引入了显式的内部推理过程,以提高响应准确性。此外,最先进的LRMs通常包含数千亿个参数,即便在先进的多GPU系统上,每次推理也需要数秒钟。这些特征限制了它们在传统计算环境中的实用性。同时,NLP研究对多语言LLMs的关注仍然优先于高资源语言。然而,这些模型在资源匮乏的语言中表现有限,这主要是由于缺乏特定于语言和文化的训练数据。在本文中,我们关注现代希腊语,目前仅提出了有限数量的问题回答(QA)数据集,而大多数数据集的目的在于模型评估。为了解决希腊QA研究中的这一空白,我们做出了以下贡献:(i)CulturaQA,一个高质量的由LRM生成并经过人工审核的数据集,用于希腊LLM的训练和评估;(ii)一个适用于多种语言和QA任务的内存高效LLM评估框架;(iii)Maistros 8B,一个通过知识蒸馏和在CulturaQA上微调开发的最先进的开放权重希腊LLM;以及(iv)对九个LLMs在九个经过人工审核的希腊QA数据集上的全面评估。
cs.CL / 63 / 2605.01939
StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models
StressEval:基于失败驱动的动态基准测试框架用于大型语言模型的知识密集型推理
Abstract
Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic benchmarks can alleviate staleness they often increase difficulty at the expense of answerability and controllability In this paper we propose StressEval a failure driven data synthesis framework that turns observed model failures into dynamic challenging and controllable test instances StressEval consists of three stages first it constructs a semi structured difficulty card that identifies the failed reasoning step and its root cause second it applies a dual perspective instance synthesis method that targets both knowledge gaps and reasoning breakdowns while preserving the underlying difficulty factors and third it applies a gating mechanism to retain only grounded unambiguous instances Seeding from multiple knowledge intensive reasoning datasets we employ StressEval to build Dynamic OneEval a focused suite of challenging dynamic benchmark Across several state of the art LLMs Dynamic OneEval yields substantially larger performance drops than the original benchmarks while retaining explicit difficulty factors enabling more actionable iteration
Chinese Translation
大型语言模型(LLMs)的静态基准测试在知识密集型推理任务中,受到污染和过拟合的影响越来越明显。虽然近年来的动态基准测试可以缓解过时的问题,但通常会在可回答性和可控性上增加难度。本文提出了StressEval,一个基于失败驱动的数据合成框架,将观察到的模型失败转化为动态的、具有挑战性和可控的测试实例。StressEval包括三个阶段:首先,它构建一个半结构化的难度卡,识别失败推理步骤及其根本原因;其次,它应用一种双视角实例合成方法,针对知识鸿沟和推理崩溃,同时保留潜在的难度因素;最后,它应用一个门控机制,仅保留有据可依的、明确无歧义的实例。我们从多个知识密集型推理数据集中进行抽样,利用StressEval构建了Dynamic OneEval,一个专注于挑战性动态基准测试的套件。在多个最先进的LLMs上,Dynamic OneEval展现出比原始基准测试显著更大的性能下降,同时保留了明确的难度因素,促进了更具可操作性的迭代。
cs.CL / 64 / 2605.01973
Learn-to-learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM
在任意文本条件下的学习-学习:一种基于超网络的元门控大语言模型
Abstract
Conventional LLMs may suffer from corpus heterogeneity and subtle condition changes. While finetuning can create the catastrophe forgetting issue, application of meta-learning on LLMs is also limited due to its complexity and scalability. In this paper, we activate the meta-signal of $\beta$ within the SwiGLU blocks, resulting in a meta-gating mechanism that adaptively adjusts the nonlinearity of FFN. A hypernetwork is employed which dynamically produces $\beta$ on textual conditions, providing meta-controllability on LLMs. By testing on different condition types such as task, domain, persona, and style, our method outperforms finetuning and meta-learning baselines, and can generalize reasonably on unseen tasks, condition types, or instructions. Our code can be found in https://github.com/AaronJi/MeGan.
Chinese Translation
传统的大语言模型(LLMs)可能受到语料异质性和细微条件变化的影响。尽管微调可能引发灾难性遗忘问题,但在LLMs上应用元学习也受到其复杂性和可扩展性的限制。本文中,我们在SwiGLU块内激活了元信号$eta$,从而产生了一种元门控机制,该机制自适应地调整前馈网络(FFN)的非线性。我们采用了一个超网络,该超网络在文本条件下动态生成$eta$,为LLMs提供了元控制能力。通过测试不同的条件类型,例如任务、领域、角色和风格,我们的方法在微调和元学习的基准上表现更佳,并且能够在未见过的任务、条件类型或指令上合理地泛化。我们的代码可以在https://github.com/AaronJi/MeGan找到。
cs.CL / 65 / 2605.02011
Enhancing Judgment Document Generation via Agentic Legal Information Collection and Rubric-Guided Optimization
通过自主法律信息收集与评分指导优化提升判决文书生成
Abstract
Automating the drafting of judgment documents is pivotal to judicial efficiency, yet it remains challenging due to the dual requirements of comprehensive retrieval of legal information and rigorous logical reasoning. Existing approaches, typically relying on standard Retrieval-Augmented Generation and Supervised Fine-Tuning, often suffer from insufficient evidence recall, hallucinated statutory references, and logically flawed legal reasoning. To bridge this gap, we propose Judge-R1, a unified framework designed to enhance LLM-based judgment document generation by jointly improving legal information collection and judgment document generation. First, we introduce Agentic Legal Information Collection, which employs a dynamic planning agent to retrieve precise statutes and precedents from multiple sources. Second, we implement Rubric-Guided Optimization, a reinforcement learning phase utilizing Group Relative Policy Optimization (GRPO) with a comprehensive legal reward function to enforce adherence to judicial standards and reasoning logic. Extensive experiments on the JuDGE benchmark demonstrate that Judge-R1 significantly outperforms state-of-the-art baselines in both legal accuracy and generation quality.
Chinese Translation
自动化撰写判决文书对提升司法效率至关重要,但由于全面检索法律信息和严格逻辑推理的双重要求,仍然面临挑战。目前的研究方法通常依赖标准的检索增强生成(Retrieval-Augmented Generation)和监督微调(Supervised Fine-Tuning),但往往会遭遇证据回忆不足、虚构的法条引用以及逻辑推理缺陷等问题。为了解决这些问题,我们提出了Judge-R1,这是一个旨在通过联合改善法律信息收集和判决文书生成来增强基于大语言模型(LLM)的判决文书生成的统一框架。首先,我们引入了自主法律信息收集(Agentic Legal Information Collection),该方法利用动态规划代理从多个来源检索精确的法条和判例。其次,我们实施了评分指导优化(Rubric-Guided Optimization),这是一个利用群体相对政策优化(Group Relative Policy Optimization, GRPO)的强化学习阶段,采用全面的法律奖励函数以确保符合司法标准和推理逻辑。针对JuDGE基准的广泛实验表明,Judge-R1在法律准确性和生成质量方面显著优于现有的最先进基准。
cs.CL / 66 / 2605.02028
Counting as a minimal probe of language model reliability
将计数作为语言模型可靠性的最小探针
Abstract
Large language models perform strongly on benchmarks in mathematical reasoning, coding and document analysis, suggesting a broad ability to follow instructions. However, it remains unclear whether such success reflects general logical competence, repeated application of learned procedures, or pattern matching that mimics rule execution. We investigate this question by introducing Stable Counting Capacity, an assay in which models count repeated symbols until failure. The assay removes knowledge dependencies, semantics and ambiguity from evaluation, avoids lexical and tokenization confounds, and provides a direct measure of procedural reliability beyond standard knowledge-based benchmarks. Here we show, across more than 100 model variants, that stable counting capacity remains far below advertised context limits. Model behavior is consistent neither with open-ended logic nor with stable application of a learned rule, but instead with use of a finite set of count-like internal states, analogous to counting on fingers. Once this resource is exhausted, the appearance of rule following disappears and exact execution collapses into guessing, even with additional test-time compute. These findings show that fluent performance in current language models does not guarantee general, reliable rule following.
Chinese Translation
大型语言模型在数学推理、编码和文档分析的基准测试中表现出色,暗示其具有广泛的执行指令能力。然而,目前尚不清楚这种成功是否反映了一般的逻辑能力、学习过程的重复应用,或是模拟规则执行的模式匹配。我们通过引入稳定计数能力(Stable Counting Capacity)来调查这个问题,这是一种通过计数重复符号直到失败的测试。该测试消除了知识依赖、语义和评估中的歧义,避免了词汇和标记化的混淆,并提供了一种超越标准知识基础基准的程序可靠性直接测量。我们在超过100个模型变体的实验中显示,稳定计数能力远低于宣传的上下文限制。模型的行为既不符合开放式逻辑,也不符合学到的规则的稳定应用,而是与有限数量的类似计数的内部状态的使用一致,类似于用手指计数。一旦这种资源耗尽,规则遵循的表象消失,准确执行崩溃为猜测,即使在额外的测试时间计算下也是如此。这些发现表明,目前语言模型的流利表现并不保证一般且可靠的规则遵循。
cs.CL / 67 / 2605.02035
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
用于机器翻译中的视觉基础歧义的多模态数据集
Abstract
Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks that provide supportive evidence for the role of vision, we observe substantial issues in data quality and a mismatch with translation scenarios. Moreover, existing ambiguity-oriented evaluations are not well suited to broader ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated ambiguous source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art Large Vision Language Models under vanilla inference, supervised fine-tuning (SFT), and our chain-of-thought SFT (CoT-SFT) show that while SFT improves overall translation quality, CoT-SFT yields more consistent gains in disambiguation accuracy, especially on out-of-distribution subsets, indicating a stronger generalization for resolving diverse ambiguity types.
Chinese Translation
歧义解决是多模态机器翻译(MMT)中的一个关键挑战,其中模型必须真正利用视觉输入将模糊表达映射到其预期含义。尽管以往的工作提出了以消除歧义为导向的基准,提供了视觉作用的支持证据,但我们观察到数据质量存在显著问题,并且与翻译场景不匹配。此外,现有的以歧义为导向的评估并不适合开放式翻译中的更广泛歧义类型。为了解决这些局限性,我们提出了VIDA(Visually-Dependent Ambiguity),这是一个包含2,500个仔细策划实例的数据集,在这些实例中,解决一个带注释的模糊源跨度需要视觉证据。我们进一步提出了以消歧为中心的指标,该指标使用一个LLM-as-a-judge分类器来验证注释的模糊表达是否在跨度级别上得到正确解决。在普通推理、监督微调(SFT)和我们的链式思维微调(CoT-SFT)下,对两个最先进的大型视觉语言模型的实验表明,尽管SFT提高了整体翻译质量,但CoT-SFT在消歧精度上取得了更一致的提升,尤其是在分布外子集上,表明其在解决多样化歧义类型方面具有更强的泛化能力。
cs.CL / 68 / 2605.02038
What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models
单一提示准确性遗漏的内容:对语言模型的多变可靠性审计
Abstract
Single-prompt accuracy is the dominant way to benchmark language models, but it can miss reliability failures that matter. We evaluate a 15-model open-weight corpus, with the main reliability analyses focused on 10 instruct models across five classification and reasoning benchmarks under five prompt variants each, measuring accuracy, token-probability calibration, verbal-confidence calibration, verbal parse rate, and prompt-perturbation spread for every (model x dataset x variant) cell. We find three broad results. First, evaluation design can materially change the conclusion. Switching Expected Calibration Error (ECE) token from a raw to a label-set-normalised definition changes per-cell calibration by a mean absolute 0.149. More strikingly, pairing a chain-of-thought prompt with a first-character evaluator on ARC-Challenge reduces apparent accuracy by 72-88% across all five primary models; two independent repair procedures recover 93.8% and 102.7% of the lost performance, indicating an evaluator-side rather than model-side failure. Second, confidence signals are fragile. On MMLU-Pro, every primary model verbally reports confidence substantially above both its accuracy and its token-probability confidence on the same rows, and verbal parse rate can collapse for a single model on a single prompt variant. Third, prompt robustness does not track parameter count reliably. Across 10 instruct models, the correlation between model size and prompt-perturbation spread ranges from -0.244 to 0.474 across benchmarks. Taken together, these results show that reliability conclusions for small language models depend not only on the model being evaluated, but also on the evaluation pipeline used to measure it. We argue that calibration definitions, evaluator logic, verbal parseability, and prompt robustness should be reported explicitly when making reliability claims.
Chinese Translation
单一提示准确性是评估语言模型的主要方式,但它可能遗漏重要的可靠性失败。我们评估了一个包含15个模型的开放权重语料库,主要的可靠性分析集中在10个指令模型上,针对五个分类和推理基准下的每个五个提示变体,测量每个(模型 x 数据集 x 变体)单元的准确性、标记概率校准、语言置信度校准、语言解析率和提示扰动分布。我们发现三个广泛的结果。首先,评估设计能显著改变结论。将期望校准误差(Expected Calibration Error, ECE)标记从原始定义切换为标签集归一化定义,导致每个单元的校准平均绝对变化为0.149。更引人注目的是,将思维链提示与ARC-Challenge上的首字符评估器配对,使所有五个主要模型的表观准确性降低了72-88%;两个独立的修复程序恢复了93.8%和102.7%的丢失性能,表明失败源于评估方而非模型方。其次,置信信号很脆弱。在MMLU-Pro上,所有主要模型在相同行上口头报告的置信度显著高于其准确性和标记概率置信度,并且在单个提示变体下,单个模型的语言解析率可能崩溃。第三,提示鲁棒性与参数数量的相关性并不可靠。在10个指令模型中,模型规模与提示扰动范围的相关度在基准测试中范围为-0.244到0.474。综合来看,这些结果表明,小型语言模型的可靠性结论不仅依赖于被评估的模型,还依赖于用于测量的评估流程。我们认为,在做出可靠性声明时,校准定义、评估逻辑、语言可解析性和提示鲁棒性应明确报告。
cs.CL / 69 / 2605.02052
Methods, Data, and Conceptual Change: Reflections from Two Quantitative Diachronic Case Studies
方法、数据与概念变更:来自两个定量历时案例研究的反思
Abstract
This discussion paper reflects on how quantitative approaches to historical linguistics interact with dataset properties. Drawing on two worked examples, we examine English data using quad-based concept modelling of Early Modern English discourse in EEBO-TCP (c. 1470s-1690s; 765M words) alongside SynFlow analysis of scientific writing in Royal Society Corpus 6.0.4 (1750-1799; drawn from a 78.6M-token open corpus). Through parallel comparison, the paper explores how each approach operationalises concepts, the data assumptions they entail, and the diachronic interpretations they support. We argue that comparative methodological reflection clarifies the limits of purely lexical, frequency-based approaches and highlights how dataset structure shapes the kinds of semantic change that quantitative methods can reliably detect.
Chinese Translation
本文讨论了定量方法在历史语言学中的应用如何与数据集特性相互作用。通过两个实例,我们采用基于四元组的概念建模,分析了早期现代英语话语的数据,来自EEBO-TCP(约1470年代至1690年代;765M字)以及对科学写作的SynFlow分析,数据来源于皇家学会语料库6.0.4(1750-1799;抽自一个7860万词的开放语料库)。通过平行比较,本文探讨了每种方法如何具体化概念、所包含的数据假设以及其支持的历时解释。我们认为,比较方法论的反思澄清了纯词汇、基于频率的方法的局限性,并强调数据集结构如何影响定量方法能够可靠检测的语义变化类型。
cs.CL / 70 / 2605.02069
Pair2Score: Pairwise-to-Absolute Transfer for LLM-Based Essay Scoring
Pair2Score:基于对比到绝对的转移方法用于大语言模型(LLM)论文评分
Abstract
Many scoring applications require absolute predictions, while pairwise comparisons can provide a simpler learning objective. We present Pair2Score, a two-stage learning framework that transfers pairwise comparisons into absolute scoring with parameter-efficient LLaMA adaptation. Stage 1 trains a directional Siamese ranker on pairwise comparisons derived from absolute trait labels; Stage 2 trains an absolute predictor using configurable transfer strategies (warm-start and embedding-fusion variants). We evaluate on rubric-aligned Automated Essay Scoring (AES) traits (grammar, vocabulary, syntax) under a five-fold protocol that co-rotates held-out fold and random seed. At the trait level, the best-performing transfer variant improves quadratic weighted kappa (QWK) over an absolute-only baseline for all three traits. However, not all transfer configurations help: a one-epoch pairwise stage transfers more reliably than extended pairwise training, and transfer configuration -- not just the inclusion of a pairwise stage -- determines whether downstream scoring benefits.
Chinese Translation
许多评分应用需要绝对预测,而对比比较可以提供更简单的学习目标。我们提出了Pair2Score,这是一种两阶段学习框架,将对比比较转化为绝对评分,并使用参数高效的LLaMA(Large Language Model)适配。第一阶段在根据绝对特征标签得出的对比比较上训练一个方向性Siamese排名器;第二阶段使用可配置的转移策略(包括热启动和嵌入融合变体)训练一个绝对预测器。我们在与评分标准对齐的自动化论文评分(AES)特征(语法、词汇、语法结构)上进行评估,采用五折交叉验证,交替保留折叠和随机种子。在特征层面,表现最佳的转移变体在所有三个特征上超越了仅使用绝对评分的基线。在这种情况下,并非所有的转移配置都有帮助:一个epoch的对比阶段比延长的对比训练更可靠地进行转移,而转移配置——而不仅仅是是否包含对比阶段——决定了后续评分是否受益。
cs.CL / 71 / 2605.02073
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
通过搜索驱动的强化学习优化奖励函数以增强大型语言模型推理能力
Abstract
Mathematical reasoning is a key benchmark for large language models. Reinforcement learning is a standard post-training mechanism for improving the reasoning capabilities of large language models, yet performance remains sensitive to the design of the reward function that drives policy optimization. This paper introduces a search-driven framework that treats the reward specification itself as an object of optimization. The setting of interest is one in which the base model is held fixed and the reward specification is the primary remaining design lever. Candidate reward functions are generated by a frontier language model, validated automatically, screened through 500-step Group Relative Policy Optimization (GRPO) training runs on a Llama-3.2-3B-Instruct base model with Low-Rank Adaptation (LoRA), and ranked by F1 on the GSM8K test set. Ranked summaries from prior rounds are then fed back into the next round of generation. Over five rounds, the search produces 50 candidate rewards. The mean F1 rises from 0.596 in Round 1 to 0.632 in Round 5, and the top individual reward reaches F1 = 0.787. Seven ensemble configurations of top-ranked rewards are evaluated. The best ensemble achieves F1 = 0.795 (95% bootstrap CI [0.756, 0.832]) and accuracy 0.660 [0.635, 0.686], a 0.19 absolute F1 gain over a base-rewards-only GRPO baseline (F1 = 0.609). Pairwise McNemar tests with Bonferroni correction show all five-or-more-reward configurations are statistically indistinguishable at {\alpha} = 0.05/21. A three-seed re-training of the best ensemble yields F1 of 0.785. A randomly drawn 5-reward control collapses to F1 = 0.047, which shows that the ranked-feedback loop, not the additive signal of having more rewards, drives the gain.
Chinese Translation
数学推理是大型语言模型的一个关键基准。强化学习是一种标准的后训练机制,用于提高大型语言模型的推理能力,但其性能对驱动策略优化的奖励函数设计仍然敏感。本文提出了一个搜索驱动的框架,将奖励规范本身视为优化对象。感兴趣的设置是固定基础模型,而奖励规范是主要的设计杠杆。候选奖励函数由前沿语言模型生成,经过自动验证,通过500步的群体相对策略优化(Group Relative Policy Optimization, GRPO)训练运行在Llama-3.2-3B-Instruct基础模型(使用低秩适应,Low-Rank Adaptation, LoRA)上进行筛选,并根据GSM8K测试集上的F1评分进行排名。之前轮次的排名总结随后被反馈到下一轮生成中。在五轮的搜索中,产生了50个候选奖励的函数。均值F1从第一轮的0.596上升到第五轮的0.632,最佳单个奖励达到了F1 = 0.787。对七种顶级排名奖励的集成配置进行评估。最佳集成的F1达到了0.795(95%自助法置信区间 [0.756, 0.832])和准确率0.660 [0.635, 0.686],相比于仅使用基础奖励的GRPO基线(F1 = 0.609)提升了0.19的绝对F1。进行Bonferroni校正的成对McNemar检验表明所有五种或以上奖励配置在{eta} = 0.05/21时统计上无显著差异。对最佳集成进行三次随机种子重训练得到了F1为0.785。随机抽取的五个奖励控制组崩溃至F1 = 0.047,这表明排名反馈循环,而非添加更多奖励的信号,推动了收益的提升。
cs.CL / 72 / 2605.02083
EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts
EditPropBench:测量科学手稿中的事实编辑传播
Abstract
Local factual edits in scientific manuscripts often create non-local revision obligations. If a dataset changes from 215 to 80 documents, claims such as 'medium-scale' or 'a few hundred items' may also become stale, even though they do not repeat the edited number. We introduce EditPropBench, a benchmark for measuring whether LLM editors propagate factual edits through dependent manuscript claims. Each item contains an ML/NLP-style synthetic manuscript, a targeted edit, and a controlled fact graph with sentence-level labels for direct targets, required downstream updates, and protected unrelated text. EditPropBench provides a controlled manuscript-level benchmark with sentence-level dependency supervision, three editing protocols, adversarial metric probes, stress-test variants, and a metric suite centered on Edit-Ripple Adherence (ERA). On the hard implicit/free-form stratum, five LLM editing systems span ERA 0.148--0.705; even the strongest misses roughly 30% of required cascade updates. A mixed-stratum stress test shows that LLMs retain a positive advantage over deterministic substitution baselines when easy substitution-solvable cases are included. Finally, an audit of recent arXiv cs.CL benchmark and dataset papers finds fact-dependent qualitative claims in 37.2% of papers. EditPropBench shows that current LLM editors can repair many implicit consequences of factual edits, but reliable scientific revision still requires cascade-aware checking.
Chinese Translation
科学手稿中的局部事实编辑通常会引发非局部的修订义务。当数据集从215个文档变更为80个文档时,如“中等规模”或“几百项”等声明可能也会过时,尽管它们未重复编辑的数量。我们提出了EditPropBench,这是一个用于测量大型语言模型(LLM)编辑者是否通过依赖的手稿声明传播事实编辑的基准。每个项目包含一个机器学习/自然语言处理(ML/NLP)风格的合成手稿、一个目标编辑和一个控制的事实图,具有针对直接目标、所需下游更新和受保护不相关文本的句子级标签。EditPropBench提供了一个受控的手稿级基准,具有句子级依赖性监督、三种编辑协议、对抗性度量探针、压力测试变体,以及以编辑波动遵循度(Edit-Ripple Adherence,ERA)为中心的度量套件。在困难的隐含/自由形式层面,五个LLM编辑系统的ERA范围为0.148至0.705;即使是最强的系统也错过了约30%的所需级联更新。混合层面的压力测试表明,当包括容易通过替代解决的案例时,LLM在确定性替代基准上仍保持积极优势。最后,对最近的arXiv cs.CL基准和数据集论文的审计发现,37.2%的论文中存在与事实相关的定性声明。EditPropBench表明目前的LLM编辑者能够修复许多事实编辑的隐含后果,但可靠的科学修订仍然需要对级联情况的持续关注。
cs.CL / 73 / 2605.02170
CLaC at SemEval-2026 Task 6: Response Clarity Detection in Political Discourse
CLaC在SemEval-2026任务6中的表现:政治话语中的回应清晰度检测
Abstract
In this paper, we present our system for SemEval-2026 Task 6 (CLARITY) on response clarity and evasion detection in question-answer pairs from U.S. presidential interviews, comparing fine-tuned encoders with prompt-based LLMs. Our LLM ensemble achieves 80 macro-F1 on the 3-class Task 1 (9th/41) and 59 on the 9-class Task 2 (3rd/33). Across 8 transformer encoders optimized through a four-stage pipeline, partial encoder layer unfreezing outperforms full fine-tuning by a wide margin. Combining English and multilingual encoders further improves ensemble performance over either family alone, despite multilingual models being individually weaker. Prompt-based LLMs, without any task-specific parameter updates, outperform fine-tuned encoders, particularly on minority classes; among open-weight LLMs, parameter count does not predict performance. Enriched input, concatenating the full interviewer turn, improves LLM performance but not that of encoders, an effect that persists with Longformer's extended context window, suggesting the divergence is not attributable to sequence-length capacity alone in our settings. The Clear Reply/Ambivalent boundary remains the dominant failure mode, mirroring the disagreement among human annotators. Our code, prompts, model configurations, and results are publicly available.
Chinese Translation
在本文中,我们介绍了我们在SemEval-2026任务6(CLARITY)上针对美国总统采访中的问题-回答对的回应清晰度和规避检测的系统,比较了微调编码器与基于提示的LLM(大语言模型)。我们的LLM合奏在三类任务1上取得了80的宏F1(排名第9/41),在九类任务2上取得了59的成绩(排名第3/33)。通过四阶段管道优化的8个变压器编码器中,部分编码器层的解冻显著优于完全微调。结合英语和多语言编码器进一步提高了合奏性能,尽管单独的多语言模型表现较弱。基于提示的LLM在没有任何特定任务参数更新的情况下优于微调编码器,特别是在少数类上;在开放权重的LLM中,参数数量并不能预测性能。丰富输入,即将完整的采访者发言串联起来,提高了LLM的性能,但对编码器的性能提升有限,这一现象在Longformer的扩展上下文窗口中依然存在,暗示在我们的设定中,偏差不仅仅是由于序列长度容量所致。清晰回复/模棱两可的边界依然是主要的失败模式,反映了人类标注者之间的分歧。我们的代码、提示、模型配置及结果已公开。
cs.CL / 74 / 2605.02200
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring
ARGUS:通过对抗裁判的演变强化实现政策自适应广告治理
Abstract
Online advertising governance faces significant challenges due to the non-stationary nature of regulatory policies, where emerging mandates (e.g., restrictions on education or aesthetic anxiety) create severe label inconsistencies and reasoning ambiguities in historical datasets. In this paper, we propose ARGUS, a policy-adaptive governance system that enables evolving reinforcement through multi-agent adversarial umpiring. ARGUS addresses the sparsity of new policy data by employing a three-stage framework: (1) Policy Seeding for initial perception; (2) Adversarial Label Rectification, which utilizes a ``Prosecutor-Defender-Umpire'' architecture to resolve conflicts between stale labels and new mandates; and (3) Latent Knowledge Discovery, which employs a tripartite dialectical discussion to unearth sophisticated, ``gray-area'' violations. By leveraging RAG-enhanced policy knowledge and Chain-of-Thought synthesis as dynamic rewards for reinforcement learning, ARGUS synchronizes its reasoning pathways with evolving regulations. Extensive experiments on both industrial and public datasets demonstrate that ARGUS significantly outperforms traditional fine-tuning baselines, achieving superior policy-adaptive learning with minimal gold data.
Chinese Translation
在线广告治理面临重大挑战,这主要源于监管政策的非平稳性,新兴的强制性规定(例如,对教育的限制或对审美焦虑的关注)在历史数据集中造成严重的标签不一致和推理模糊性。本文提出了ARGUS,一个能够通过多智能体对抗裁判实现演变强化的政策自适应治理系统。ARGUS通过采用三阶段框架来解决新政策数据的稀缺性:(1)政策种植以获取初步认知;(2)对抗标签修正,利用“检察官-辩护人-裁判”架构来解决过时标签与新规之间的冲突;(3)潜在知识发现,采用三方辩证讨论以揭示复杂的“灰色地带”违规行为。通过利用增强的政策知识(RAG)和思维链合成作为强化学习的动态奖励,ARGUS使其推理路径与不断演变的法规保持同步。在工业和公共数据集上的广泛实验表明,ARGUS显著优于传统微调基线,实现了以最少的黄金数据获得卓越的政策自适应学习。
cs.CL / 75 / 2605.02259
An Information-theoretic Propagation Denoising and Fusion Framework for Fake News Detection
基于信息理论的传播去噪与融合框架用于假新闻检测
Abstract
Incomplete propagation data significantly hinders robust fake news detection. Recent approaches leverage large language models to simulate missing user interactions via role-playing, thereby enriching propagation with synthetic signals. However, such propagation data is intrinsically unreliable, and directly fusing it can lead to biased representations, leading to limited detection performance. In this paper, we alleviate the unreliability of synthetic propagation from the mutual information perspective and propose a novel information-theoretic propagation denoising and fusion (InfoPDF) framework to learn effective representations from both real and synthetic propagation. Specifically, we first generate attribute-specific synthetic propagation using large language models. Then we model each synthetic propagation graph as a probabilistic latent distribution to guide reliability-aware adaptive fusion with real propagation. During training, we design a mutual information-based objective to learn compressed and task-sufficient propagation representations. It jointly suppresses noisy signals across attribute-specific synthetic propagation, maintains consistency between real and synthetic propagation representations, and ensures task sufficiency for fake news detection and attribute prediction. Experiments on three real-world datasets show that InfoPDF consistently achieves superior performance across various fake news detection tasks. Further analysis demonstrates that InfoPDF can estimate attribute-level reliabilities and learn more discriminative propagation representations.
Chinese Translation
不完整的传播数据严重妨碍了稳健的假新闻检测。近期方法利用大型语言模型通过角色扮演模拟缺失的用户交互,从而丰富传播数据与合成信号。然而,这种传播数据在本质上是不可靠的,直接融合可能导致偏倚的表征,从而限制检测性能。本文从互信息的角度缓解合成传播的不可靠性,并提出了一种新颖的信息理论传播去噪和融合框架(InfoPDF),以从真实和合成传播中学习有效表征。具体而言,我们首先使用大型语言模型生成特征特定的合成传播。接着,我们将每个合成传播图建模为一种概率潜在分布,以引导与真实传播的可靠性感知自适应融合。在训练期间,我们设计了一个基于互信息的目标,以学习压缩和任务充分的传播表征。该目标联合抑制特征特定合成传播中的噪音信号,保持真实和合成传播表征之间的一致性,并确保在假新闻检测和特征预测中的任务充分性。在三组真实世界数据集上的实验表明,InfoPDF在各种假新闻检测任务中始终实现了卓越的性能。进一步分析表明,InfoPDF能够估计特征级可靠性,学习更具判别性的传播表征。
cs.CL / 76 / 2605.02266
Reliability-Oriented Multilingual Orthopedic Diagnosis: A Domain-Adaptive Modeling and a Conceptual Validation Framework
以可靠性为导向的多语言骨科诊断:一种领域自适应建模及概念验证框架
Abstract
Large Language Models (LLMs) are increasingly proposed for clinical decision support including multilingual diagnosis in low-resource settings. However, their reliability, calibration and safety characteristics remain insufficiently understood for structured, high-risk tasks. We present a system-level analysis of multilingual orthopedic diagnosis from free-text clinical notes in English, Hindi and Punjabi. We evaluate three modeling regimes: (i) task-aligned multilingual transformer encoders, (ii) a task-fine-tuned baseline (DistilBERT), and (iii) a domain-adaptive architecture tailored to orthopedic text (IndicBERT-HPA). These models are compared with zero-shot, instruction-tuned LLMs to assess suitability for structured diagnostic classification. Results indicate that while LLMs exhibit strong linguistic fluency, they show unstable calibration and reduced reliability under structured multilingual conditions, particularly in low-resource languages. These findings are specific to zero-shot evaluation and do not imply limitations of fine-tuned models. Domain-adaptive specialization substantially improves cross-lingual discrimination and confidence behavior. IndicBERT-HPA, with language-specific orthopedic adapter heads achieves consistently strong performance across six diagnostic categories and more predictable deployment characteristics than task-only adaptation. Building on these observations, we outline a conceptual deterministic agent-based validation framework for future implementation, formalizing evidence checks, language-sensitive validation and conservative human-in-the-loop gating. Reliable multilingual clinical decision support requires specialized architecture, explicit reliability analysis, and structured validation for safety-critical systems.
Chinese Translation
大型语言模型(LLMs)越来越多地被提议用于临床决策支持,包括在资源较少的环境中的多语言诊断。然而,它们在结构化高风险任务中的可靠性、校准和安全特性仍然未被充分理解。我们对来自英语、印地语和旁遮普语的自由文本临床记录中的多语言骨科诊断进行了系统级分析。我们评估了三种建模方案:(i) 与任务对齐的多语言变换器编码器,(ii) 任务微调的基线模型(DistilBERT),以及 (iii) 针对骨科文本量身定制的领域自适应架构(IndicBERT-HPA)。这些模型与零样本指令微调的LLMs进行比较,以评估其在结构化诊断分类中的适用性。结果表明,虽然LLMs表现出强大的语言流畅性,但在结构化多语言条件下,它们显示出不稳定的校准和降低的可靠性,尤其是在资源少的语言中。这些发现特定于零样本评估,并不暗示微调模型的局限性。领域自适应专门化显著改善了跨语言辨别能力和信心行为。IndicBERT-HPA 具有语言特定的骨科适配器头,在六个诊断类别中实现了一致的强性能,并比仅以任务为导向的适应显示出更可预测的部署特征。基于这些观察,我们勾勒出一种概念性的确定性代理验证框架,以便未来实施,正式化证据检查、语言敏感的验证和保守的人机协作门控。可靠的多语言临床决策支持需要专门的架构、明确的可靠性分析以及针对安全关键系统的结构化验证。
cs.CL / 77 / 2605.02270
A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures
塔吉克-法尔西语言对的机器音译模型系统基准:从基于规则到变换器架构的比较研究
Abstract
This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and validation of a unique parallel corpus aggregated from multiple heterogeneous sources, including crowdsourced projects, lexicographic pairs, parallel texts of "Shahnameh", diplomatic articles, texts of "Masnavi-i Ma'navi", official terminology lists, and transliterated correspondences. The initial dataset comprised 328,253 sentence pairs; a representative subset of 40,000 pairs was formed using stratified random sampling. The experiment compared six classes of models: rule-based baseline, LSTM with attention, character-level Transformer, G2P Transformer (trained from scratch), pre-trained multilingual models (mBART, mT5 with LoRA), and byte-level ByT5. Results demonstrate the overwhelming superiority of ByT5 (chrF++ 87.4 for Tajik to Farsi, 80.1 for reverse). The G2P Transformer significantly outperformed mBART (72.3 vs. 62.2 chrF++) despite limited data. Models using subword tokenization (mT5) failed completely (chrF++ less than 18.5). The findings demonstrate that for accurate transliteration of the Tajik-Farsi pair, architectures operating at the byte or character level are unequivocally more effective than traditional multilingual Seq2Seq models relying on subword tokenization.
Chinese Translation
本文首次全面分析了现代机器学习架构在塔吉克(西里尔字母)与波斯(阿拉伯字母)之间的音译性能。一个关键贡献是创建并验证了一个独特的平行语料库,该语料库由多种异构来源整合而成,包括众包项目、词典对、《沙恰梅》(Shahnameh)的平行文本、外交文章、《马斯纳维》(Masnavi-i Ma'navi)的文本、官方术语列表,以及音译的通信记录。初始数据集包含328,253对句子;我们使用分层随机抽样形成了一个代表性子集,包含40,000对句子。实验比较了六类模型:基于规则的基线、带注意力机制的LSTM、字符级变换器(Transformer)、从头训练的G2P变换器、预训练的多语言模型(mBART、mT5与LoRA)、以及字节级ByT5。结果表明,ByT5的表现明显优于其他模型(塔吉克到法尔西的chrF++得分为87.4,反向得分为80.1)。尽管数据有限,G2P变换器也显著优于mBART(72.3对比62.2 chrF++)。而使用子词标记化的模型(mT5)表现完全不佳(chrF++低于18.5)。研究结果表明,对于塔吉克-法尔西对的准确音译,采用字节或字符级别的架构显然比依赖子词标记化的传统多语言Seq2Seq模型更为有效。
cs.CL / 78 / 2605.02277
Compositional Multi-hop Factual Error Correction via Decomposition-and-Injection
基于分解与注入的组合多跳事实错误修正
Abstract
Factual Error Correction (FEC) aims to revise inaccurate text into statements that are factually consistent with external evidence. Although recent methods perform well on single-hop correction, they often treat claims as atomic units and struggle with multi-hop cases that require compositional reasoning across multiple evidence sources. This challenge is further amplified by limited paired data and difficulties in locating semantic errors within complex reasoning chains. We present CECoR (Compositional Error Correction via Reasoning-aware Synthesis), a reasoning-aware framework that introduces a Decomposition and Injection paradigm for compositional error correction. CECoR decomposes multi-hop claims into interpretable reasoning steps and injects controlled perturbations to synthesize high-quality training pairs. A two-stage learning strategy combining supervised fine-tuning and reinforcement learning improves factual accuracy and robustness. Comprehensive evaluations show that CECoR achieves strong performance on multi-hop benchmarks, outperforming both distantly supervised methods and few-shot LLM baselines. It also generalizes effectively to single-hop correction and remains stable under noisy evidence, demonstrating its versatility for real-world factual correction.
Chinese Translation
事实错误修正(FEC)旨在将不准确的文本修正为与外部证据事实一致的陈述。尽管近期的方法在单跳修正中表现良好,但它们往往将主张视为原子单位,对于需要跨多个证据源进行组合推理的多跳案例则显得无能为力。这一挑战在有限的配对数据和难以在复杂推理链中定位语义错误的情况下进一步加剧。我们提出了CECoR(通过推理感知合成的组合错误修正),这一推理感知框架引入了一种分解与注入的范式用于组合错误修正。CECoR将多跳主张分解为可解释的推理步骤,并注入受控扰动以合成高质量的训练对。结合监督微调和强化学习的两阶段学习策略提高了事实准确性和鲁棒性。全面的评估结果表明,CECoR在多跳基准测试中表现出色,超越了远程监督方法和少量样本大语言模型基线。它还有效地推广到单跳修正,并在嘈杂证据下保持稳定,展示了其在现实世界事实修正中的多面性。
cs.CL / 79 / 2605.02308
Structural Dilemmas and Developmental Pathways of Legal Argument Mining in the Era of Artificial Intelligence
人工智能时代法律论证挖掘的结构困境与发展路径
Abstract
Against the backdrop of rapid advances in artificial intelligence, legal argument mining has emerged as an important research area linking legal texts with intelligent analysis, carrying significant theoretical and practical implications. Existing studies have primarily developed along three dimensions: data, technology, and theory. At the data level, raw legal texts and annotated corpora constitute the foundational resources. At the technological level, research paradigms have evolved from rule-based systems and traditional machine learning to large language models (LLMs). At the theoretical level, argumentation theory and legal dogmatics provide important references for modeling argumentation structures. However, despite ongoing progress, the overall development of legal argument mining remains relatively slow. Building on a systematic review of existing research, this study conducts an in-depth analysis and finds that this is due not only to data scarcity or technical limitations, but more fundamentally to the lack of a structured representational approach that reconciles theoretical expressiveness with computational feasibility. Specifically, this challenge manifests in dilemmas in data standardization, obstacles to effective modeling, and limitations in domain adaptation. In response, the study proposes several key directions for future research. It aims to provide a reframing of key problems and a pathway for future development in legal argument mining, while leaving specific models and implementation schemes for further investigation.
Chinese Translation
在人工智能快速发展的背景下,法律论证挖掘作为一个重要的研究领域,连接了法律文本与智能分析,具有重要的理论和实践意义。现有研究主要沿着数据、技术和理论三个维度发展。在数据层面,原始法律文本和注释语料库构成了基础资源。在技术层面,研究范式已从基于规则的系统和传统机器学习发展到大型语言模型(LLMs)。在理论层面,论证理论和法律教义为建模论证结构提供了重要参考。然而,尽管持续进展,法律论证挖掘的整体发展仍然相对缓慢。基于对现有研究的系统回顾,本研究进行了深入分析,发现这不仅是由于数据稀缺或技术限制,更根本原因在于缺乏一种结构化的表现方法,能够调和理论表现力与计算可行性。具体而言,这一挑战表现为数据标准化的困境、有效建模的障碍以及领域适应的局限性。对此,本研究提出了未来研究的几个关键方向,旨在为法律论证挖掘的关键问题重构框架,并为未来的发展路径提供指引,同时留待后续研究具体模型和实施方案的探讨。
cs.CL / 80 / 2605.02348
Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
通过过程奖励模型进行解码时去偏见:从控制填充到开放式生成
Abstract
Large language models pick up social biases from the data they are trained on and carry those biases into downstream applications, often reinforcing stereotypes around gender, race, religion, disability, age, and socioeconomic status. The standard fixes (retraining on curated data or fine-tuning with human feedback) are expensive, need access to model weights, and risk degrading the model on other tasks. In this paper we take a different route: we debias the model at decoding time, treating bias mitigation as a structured search over candidate tokens without ever touching model weights. A separate Process Reward Model (PRM) acts as a judge, scoring each candidate for both fairness and fluency. We design three schemes of increasing sophistication (Best-of-N selection, Sequential critique-and-revise, and Constitutional self-audit) and evaluate them on four models (GPT-4o-mini, Llama 3.2 3B, Gemma 3 4B, Qwen 2.5 3B) across a 200-prompt bilingual benchmark in English and Urdu covering eight bias categories. Sequential debiasing proves the most effective, raising mean bias scores by up to +0.40 over baseline while preserving (and sometimes improving) fluency. We then extend all three schemes to open-ended generation, where each token is debiased on the fly, and introduce a lightweight Bias Guard gate that fires only on potentially biased words, keeping overhead near 2x for well-calibrated models. A formal overhead metric that separates generator cost from judge cost reveals that Best-of-N is effectively free on the generator side in a native implementation. GPT-4o-mini, included as a strong proprietary anchor, confirms that the framework scales with model capability; the three open-weight models show where current small-scale LLMs still struggle.
Chinese Translation
大型语言模型从其训练数据中学习到社会偏见,并将这些偏见带入下游应用中,常常加剧与性别、种族、宗教、残疾、年龄和社会经济地位相关的刻板印象。标准的修正方法(在精心策划的数据上再训练或使用人类反馈进行微调)成本高昂,需要访问模型权重,并且可能会降低模型在其他任务上的表现。本文采取了不同的路径:我们在解码时消除模型的偏见,将偏见缓解视为对候选标记的结构化搜索,而从不接触模型权重。一个独立的过程奖励模型(Process Reward Model, PRM)充当评判者,为每个候选项在公平性和流畅性上评分。我们设计了三种逐步增强的方案(最优选择、顺序批评与修订、宪法自审),并在四个模型(GPT-4o-mini、Llama 3.2 3B、Gemma 3 4B、Qwen 2.5 3B)上进行评估,在一个覆盖八个偏见类别的200个提示的英语和乌尔都语双语基准测试中进行评估。顺序去偏见是最有效的,相较于基线,平均偏见评分提高了多达+0.40,同时保持(并有时提高)流畅性。然后,我们将三种方案扩展到开放式生成中,在该过程中,每个标记都实时去偏见,并引入一个轻量级的偏见保护门(Bias Guard),仅对可能存在偏见的词触发,使得对于经过良好校准的模型而言,开销接近2倍。一个正式的开销指标将生成器成本与评判者成本分开,显示在原生实现中,最优选择在生成器侧几乎是免费的。作为强有力的专有基准,GPT-4o-mini确认该框架能够随着模型能力的提升而扩展;而三种开放权重模型显示当前小规模大型语言模型仍存在的困难。
cs.CL / 81 / 2605.02351
MolViBench: Evaluating LLMs on Molecular Vibe Coding
MolViBench:评估大型语言模型在分子振动编码中的表现
Abstract
Molecular Vibe Coding, a paradigm where chemists interact with LLMs to generate executable programs for molecular tasks, has emerged as a flexible alternative to chemical agents with predefined tools, enabling chemists to express arbitrarily complex, customized workflows. Unlike general coding tasks, molecular coding imposes a distinctive challenge that LLMs should jointly equip programming, molecular understanding, and domain-specific reasoning capabilities. However, existing benchmarks remain disconnected. General code generation benchmarks such as HumanEval and SWE-bench require no chemistry knowledge, while chemistry-focused benchmarks such as S^2-Bench and ChemCoTBench evaluate knowledge recall or property prediction rather than executable code generation. To bridge this gap, we introduce MolViBench, the first benchmark tailored for Molecular Vibe Coding. MolViBench comprises 358 curated tasks across five cognitive levels, ranging from single-API recall to end-to-end virtual screening pipeline design, spanning 12 real-world drug discovery workflows. To rigorously assess generated code, we also propose a multi-layered evaluation framework that combines type-aware output comparison and AST-based API-semantic fallback analysis, which jointly measures executability and chemical correctness. We systematically evaluate 9 frontier coding LLMs and compare three real-world Molecular Vibe Coding paradigms, providing a practical and fine-grained testbed for diagnosing LLMs' coding capabilities in AI-accelerated molecular discovery.
Chinese Translation
分子振动编码是一种化学家与大型语言模型(LLMs)交互以生成可执行程序以完成分子任务的范式,它作为一种灵活的替代方案,超越了具有预定义工具的化学试剂,使化学家能够表达任意复杂的自定义工作流程。与一般编码任务不同,分子编码面临着独特的挑战,要求大型语言模型同时具备编程能力、分子理解能力和领域特定的推理能力。然而,现有的基准测试仍存在脱节。一般代码生成基准如 HumanEval 和 SWE-bench 不需要化学知识,而专注于化学的基准如 S^2-Bench 和 ChemCoTBench 更多地评估知识回忆或属性预测,而非可执行代码生成。为填补这一空白,我们提出了 MolViBench,这是第一个专为分子振动编码量身定制的基准。MolViBench 包含 358 个经精心策划的任务,涵盖五个认知层次,从单一 API 回忆到端到端虚拟筛选流程设计,涵盖 12 种实际药物发现工作流程。为了严格评估生成的代码,我们还提出了一种多层次评估框架,结合了类型感知的输出比较和基于抽象语法树(AST)的 API 语义回退分析,联合测量可执行性和化学正确性。我们系统性地评估了 9 款前沿的编码大型语言模型,并比较了三种实际的分子振动编码范式,为诊断大型语言模型在 AI 加速分子发现中的编码能力提供了一个实用且细致的测试平台。
cs.CL / 82 / 2605.02363
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models
当正确性无法使用时:改善小型语言模型中的结构化输出可靠性
Abstract
Deployed language models must produce outputs that are both correct and format-compliant. We study this structured-output reliability gap using two mathematical benchmarks -- GSM8K and MATH -- as a controlled testbed: ground truth is unambiguous and the output contract is strict (JSON with required fields). We evaluate three 7-9B models under five prompting strategies and report output accuracy -- the joint event of mathematical correctness and valid JSON structure -- as the primary metric. A systematic format failure emerges: NAIVE prompting (no system prompt) achieves up to 85% task accuracy on GSM8K but 0% output accuracy across all models and datasets. REFERENCE prompting (a minimal hand-written JSON format prompt) fares little better, yielding 0% output accuracy for two of four models tested. Constrained decoding enforces syntactic validity but incurs 3.6x-8.2x latency overhead and in several settings degrades task performance substantially. To overcome this limitation, we developed AloLab, an iterative system-prompt optimizer (meta-agent: Claude Sonnet 4.5) requiring only black-box API access to the target model; it reaches 84-87% output accuracy on GSM8K and 34-40% on MATH across five independent runs per model, with 29/30 paired McNemar comparisons against the best static prompt significant at p < 0.05, at near-NAIVE inference latency and without model fine-tuning. The same format failure extends to GPT-4o (OpenAI, 2024), a proprietary closed-source model: REFERENCE achieves 0% output accuracy due to systematic markdown-fence wrapping, while AloLab reaches 95.2% [94.8, 95.6]. An ablation replacing the Sonnet 4.5 meta-agent with Claude 3 Haiku reduces mean output accuracy to 61.0% and increases run-to-run standard deviation from <1 pp to 21.8 pp, confirming that meta-agent capability is a primary driver of optimization quality.
Chinese Translation
部署的语言模型必须生成既正确又符合格式的输出。我们使用两个数学基准——GSM8K和MATH——作为受控测试平台,研究这一结构化输出可靠性差距:真实值明确且输出合同严格(要求字段的 JSON)。我们在五种提示策略下评估了三种参数为7-9B的模型,并将输出准确性——数学正确性与有效 JSON 结构的联合事件——作为主要指标。出现了系统性的格式失效:NAIVE 提示(无系统提示)在 GSM8K 上实现了高达 85% 的任务准确性,但在所有模型和数据集上输出准确性为 0%。REFERENCE 提示(一个最小手写 JSON 格式提示)表现稍好,为测试的四个模型中的两个导致 0% 的输出准确性。受限解码强制执行句法有效性,但会导致3.6x-8.2x的延迟开销,并在某些设置中显著降低任务表现。为了克服这一限制,我们开发了 AloLab,一个迭代的系统提示优化器(元代理:Claude Sonnet 4.5),仅需黑箱 API 访问目标模型;在每个模型的五次独立运行中,其在 GSM8K 上达到了 84-87% 的输出准确性,在 MATH 上达到了 34-40%,并且在 29/30 的 McNemar 配对比较中相对于最佳静态提示显著达到了 p < 0.05,且在接近 NAIVE 推理延迟下,没有进行模型微调。相同的格式失效扩展到了 GPT-4o(OpenAI, 2024),这是一个专有的封闭源模型:由于系统性 markdown-fence 包裹,REFERENCE 达到了 0% 的输出准确性,而 AloLab 达到了 95.2% [94.8, 95.6]。将 Sonnet 4.5 元代理替换为 Claude 3 Haiku 的消融实验使得平均输出准确性降低到 61.0%,并将运行间的标准差从 <1 pp 增加到 21.8 pp,确认元代理能力是优化质量的主要驱动因素。
cs.CL / 83 / 2605.02364
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
信息法则:具有质量加权混合数据和重复性的巨大语言模型的信息缩放规律
Abstract
Upweighting high-quality data in LLM pretraining often improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do not reliably extrapolate across mixture recipes or under repetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), a data-aware scaling framework that predicts loss from consumed tokens, model size, data mixture weights, and repetition. The key idea is to model pretraining as information accumulation, where quality controls information density and repetition induces scaledependent diminishing returns. We first collect the model performance after training on datasets that vary in scale, quality distribution, and repetition level. Then we build up the modeling for information so that information accurately predicts those model performance. InfoLaw predicts performance on unseen data recipes and larger scale runs (up to 7B, 425B tokens) with 0.15% mean and 0.96% max absolute error in loss, and it extrapolates reliably across overtraining levels, enabling efficient data-recipe selection under varying compute budgets.
Chinese Translation
在大语言模型(LLM)的预训练中,提高高质量数据的权重通常会提升性能,但在数据有限的情况下,尤其是在过拟合的情况下,过强的加权会增加重复性,从而可能降低性能。然而,标准的缩放规律无法可靠地在混合配方或重复情况下进行外推,导致在缩放时最优数据配方的选择变得不确定。为了解决这个问题,我们提出了信息法则(InfoLaw),这是一种数据感知的缩放框架,可以预测从消耗的标记、模型大小、数据混合权重和重复性所带来的损失。其关键思想是将预训练建模为信息的积累,其中质量控制信息密度,重复性引入尺度依赖的收益递减。我们首先收集在规模、质量分布和重复水平不同的数据集上训练后模型的表现。然后我们建立信息建模,使其准确预测这些模型性能。InfoLaw能够对未见数据配方和更大规模的运行(高达70亿,4250亿标记)进行性能预测,其损失的平均绝对误差为0.15%,最大为0.96%,并且它能够在过拟合水平间可靠外推,从而实现多种计算预算条件下高效的数据配方选择。
cs.CL / 84 / 2605.02392
Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval
新颖性与原因?基于段落检索的细粒度专利新颖性预测
Abstract
Novelty assessment is a critical yet complex task in the examination process for patent acceptance, requiring examiners to determine whether an invention is disclosed in a prior art document. The process involves intricate matching between specific features of a patent claim and passages in the prior art. While prior work has approached novelty prediction primarily as a binary classification task at the claim level, we argue that this formulation is susceptible to spurious correlations and lacks the granularity required for practical application. In this work, we introduce FiNE-Patents (Fine-grained Novelty Examination of Patents), a novel dataset comprising 3,658 first patent claims annotated with fine-grained, feature-level prior art references extracted from European Search Opinion (ESOP) documents. We propose shifting the evaluation paradigm from simple binary classification to a joint retrieval and abstract reasoning task at the feature level, requiring models to identify specific passages from a prior art document that disclose individual claim features, and to identify which features of a claim make it novel. We implement and evaluate LLM-based workflows that decompose claims into features, analyze each feature against prior art, and finally derive a claim-level novelty prediction. Our experiments demonstrate that these workflows outperform embedding-based baselines on passage retrieval and novel feature identification. Furthermore, we show that unlike trained classifiers, LLMs are robust against spurious correlations present in the claim-level novelty classification task. We release the dataset and code to foster further research into transparent and granular patent analysis.
Chinese Translation
新颖性评估是专利接受审查过程中的一项关键但复杂的任务,要求审查员判断某项发明是否在先前的技术文献中已被披露。该过程涉及专利权利要求的特定特征与先前技术文献中的段落之间的复杂匹配。尽管以往的研究主要将新颖性预测视为权利要求级别的二分类任务,我们认为这种表述容易受到虚假相关性的影响,并且缺乏实际应用所需的细粒度特征。本研究引入了FiNE-Patents(专利细粒度新颖性审查),这是一个包含3,658个第一专利权利要求的创新数据集,这些权利要求被注释为细粒度的特征级先前技术参考,提取自欧洲搜索意见(ESOP)文档。我们 propose将评估范式从简单的二分类转变为特征级的联合检索与抽象推理任务,要求模型识别披露个别权利要求特征的先前技术文献中的特定段落,并识别哪些特征使权利要求具备新颖性。我们实现并评估了一种基于LLM的工作流程,该流程将权利要求分解为特征,分析每个特征与先前技术的关系,最后得出权利要求级别的新颖性预测。我们的实验表明,这些工作流程在段落检索和新特征识别方面优于基于嵌入的方法。此外,我们还展示了与训练分类器不同,LLM在权利要求级别的新颖性分类任务中对虚假相关性具有鲁棒性。我们发布了数据集和代码,以促进对透明和细粒度专利分析的进一步研究。
cs.CL / 85 / 2605.02402
Automatic Reflection Level Classification in Hungarian Student Essays
匈牙利学生作文中的自动反思水平分类
Abstract
Reflective thinking is a key competency in education, but assessing reflective writing remains a time-consuming and subjective task for education experts. While automated reflective analysis has been explored in several languages, Hungarian language was not researched extensively. In this paper, we present the first comprehensive study on automatic reflection level classification in Hungarian student essays. We used a large, expert-annotated Hungarian dataset consisting of 1,954 reflective essays collected over multiple academic years and labeled on a four-level reflection scale. We investigate two approaches: (1) classical machine learning models using TF-IDF and semantic embedding features, and (2) Hungarian-specific transformer models fine-tuned for document-level reflection classification. To address the strong class imbalance in the dataset, we systematically examine class weighting, oversampling, data augmentation, and alternative loss functions. An extensive ablation study is conducted to analyze the contribution of each modeling and balancing strategy. Our results show that shallow machine learning models with appropriate feature engineering achieve strong overall performance, reaching up to 71% overall score averaged over accuracy, F1-score, and ROC AUC metrics, while transformer-based models achieve slightly lower overall score (68%) averaged over the same metrics, but demonstrate better generalization on minority reflection classes. These findings highlight the continued relevance of classical methods for low-resource settings and the robustness of transformer models for imbalanced classification. The proposed dataset and experimental insights provide a solid foundation for future research on automated reflective analysis in Hungarian and other morphologically rich languages.
Chinese Translation
反思性思维是教育中的一项关键能力,但评估反思性写作对于教育专家来说仍然是一项耗时且主观的任务。尽管在多种语言中已对自动反思分析进行了研究,但匈牙利语的相关研究相对较少。在本文中,我们展示了对匈牙利学生作文中自动反思水平分类的首个综合研究。我们使用了一个大型的、由专家注释的匈牙利数据集,这个数据集包含1,954篇反思性作文,收集了多个学年的样本,并在四级反思量表上进行了标注。我们探讨了两种方法:(1) 使用TF-IDF和语义嵌入特征的经典机器学习模型,以及(2) 针对文档级反思分类进行微调的匈牙利特定变换模型。为了应对数据集中强烈的类别不平衡,我们系统地检查了类别加权、过采样、数据增强和替代损失函数。我们还进行了广泛的消融研究,以分析每种建模和调整策略的贡献。我们的结果表明,合适特征工程的浅层机器学习模型在整体表现上取得了较强的结果,在准确率、F1分数和ROC AUC指标上平均可达71%;而基于变换模型的结果略低(68%),但在少数反思类别上的泛化能力更强。这些发现凸显了经典方法在低资源环境下的持续相关性,以及变换模型在不平衡分类中的鲁棒性。所提出的数据集和实验见解为未来在匈牙利及其他形态丰富语言中的自动反思分析研究提供了坚实的基础。
cs.CL / 86 / 2605.02443
HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs
HalluScan: 一种系统性基准框架用于检测和减轻指令遵循大型语言模型中的幻觉
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, yet they remain susceptible to hallucinations -- generating content that is factually incorrect, unfaithful to provided context, or misaligned with user instructions. We present HalluScan, a comprehensive benchmark framework that systematically evaluates hallucination detection and mitigation across 72 configurations spanning 6 detection methods, 4 open-weight model families, and 3 diverse domains. We introduce three key contributions: (1) HalluScore, a novel composite metric that achieves a Pearson correlation of r = 0.41 with human expert judgments; (2) Adaptive Detection Routing (ADR), an intelligent routing algorithm achieving 2.0x cost reduction with only 0.1% AUROC degradation; and (3) systematic error cascade decomposition revealing substantial variation in hallucination error types across domains. Our experiments reveal that NLI Verification achieves the highest overall AUROC of 0.88, while RAV achieves the second-highest AUROC of 0.66.
Chinese Translation
大型语言模型(LLMs)在多种自然语言处理任务中展现出了显著的能力,但它们仍然易受幻觉的影响——生成事实不准确、未忠实于提供的上下文或与用户指令不一致的内容。我们提出了HalluScan,这是一种全面的基准框架,系统评估在72个配置中检测和减轻幻觉的效果,这些配置涵盖了6种检测方法、4种开放权重模型家族和3个不同领域。我们提出了三个关键贡献:(1)HalluScore,这是一种新颖的复合度量,与人类专家判断之间的Pearson相关性达到r = 0.41;(2)自适应检测路由(Adaptive Detection Routing, ADR),这是一种智能路由算法,在仅有0.1% AUROC降幅的情况下实现了2.0倍的成本降低;(3)系统的错误级联分解揭示了不同领域中的幻觉错误类型存在显著变异。我们的实验表明,自然语言推理(NLI)验证实现了最高的总体AUROC值为0.88,而RAV的AUROC值为0.66,位居第二。
cs.CL / 87 / 2605.02447
PC-MNet: Dual-Level Congruity Modeling for Multimodal Sarcasm Detection via Polarity-Modulated Attention
PC-MNet:通过极性调制注意力进行多模态讽刺检测的双层一致性建模
Abstract
Multimodal sarcasm detection, which aims to precisely identify pragmatic incongruities between literal text and nonverbal cues, has gained substantial attention in multimodal understanding. Recent advancements have predominantly relied on na\"{\i}ve similarity-based attention mechanisms and uniform late fusion strategies.Furthermore, given that functional entanglement restricts traditional late fusions, we incorporate a scalar congruity routing mechanism and a prior-guided contextual graph. This mechanism anchors a generalized incongruity manifold through a two-stage asymmetric optimization driven by inconsistency-aware contrastive learning, selectively fusing only the most discriminative multi-granularity evidence. Extensive experiments on the \texttt{MUStARD} benchmark and its spurious-correlation-mitigated balanced datasets demonstrate that our approach achieves new state-of-the-art performance, surpassing the strongest multimodal baseline by a substantial 3.14\% improvement in Macro-F1. By architecturally isolating atomic, composition, and contextual conflicts. This work provides a robust, decoupled paradigm for modeling subtle pragmatic incongruities in human communication.
Chinese Translation
多模态讽刺检测旨在精准识别字面文本与非语言线索之间的实用不一致性,在多模态理解中引起了广泛关注。近年来的进展主要依赖于幼稚的基于相似性的注意力机制和均匀的后期融合策略。此外,考虑到功能纠缠限制了传统的后期融合,我们引入了一种标量一致性路由机制和一个先验引导的上下文图。这一机制通过一种由一致性感知对比学习驱动的双阶段非对称优化,固定了一个广义的不一致性流形,选择性融合最具区分性的多粒度证据。在 exttt{MUStARD}基准及其纠缠相关干扰减轻的平衡数据集上的大量实验表明,我们的方法达到了新的最先进性能,宏观F1得分相较最强多模态基线有了显著提升,达到了3.14\%。通过从结构上隔离原子、组成和上下文冲突,这项工作为建模人类通信中微妙的实用性不一致性提供了一个稳健的、解耦的范式。
cs.CL / 88 / 2605.02457
Leveraging Argument Structure to Predict Content Hatefulness
利用论证结构预测内容的仇恨性
Abstract
Information disorder is a challenging phenomenon that affects society at large. This phenomenon entails the diffusion of misleading, misinforming, and hateful content online. In different contexts, one aspect of the problem may prevail, but overall, this is a broad problem that requires comprehensive solutions. While each dimension of the problem (hate speech, disinformation, misinformation, etc.) requires in-depth analysis, in this paper, we look into the possibility of argument structure to provide relevant information to link these different areas of the problem. In particular, we focus on the WSF-ARG+ dataset, which consists of white supremacy forum messages annotated in terms of argument structure (premises and conclusion). There, we leverage the checkworthiness and hatefulness annotations of the argument components to obtain insights into the hatefulness of the whole message. Our results show promising insights (up to 96% F1), indicating the possibility of extending this direction in the future to tackle hateful content identification and information disorder countering.
Chinese Translation
信息失序是一个影响社会的挑战性现象。这一现象包括在网上传播误导性、虚假信息和仇恨内容。在不同的背景下,问题的某一方面可能占主导地位,但总体而言,这是一个需要综合解决方案的广泛问题。尽管问题的每个维度(仇恨言论、虚假信息、错误信息等)都需要深入分析,但在本文中,我们探讨了利用论证结构提供相关信息以连接这些不同领域问题的可能性。特别地,我们关注WSF-ARG+数据集,该数据集由白人至上主义论坛消息构成,并按论证结构(前提和结论)进行了注释。在这里,我们利用论证组件的可检验性和仇恨性注释,以获得对整体消息仇恨性的洞察。我们的结果显示出有希望的见解(最高达到96%的F1分数),这表明未来扩展这一方向以应对仇恨内容识别和信息失序的可能性。
cs.CL / 89 / 2605.02466
ATLAS: Article Tracking, Linking, and Analysis of Swedish Encyclopedias
ATLAS:瑞典百科全书的文章追踪、链接与分析
Abstract
The digitization of old encyclopedias represents an important step to improve access to historically structured knowledge. Often, however, this process does not go beyond an optical character recognition, leaving all the underlying structure unexploited. In addition, many encyclopedias had multiple editions reflecting the evolution of knowledge. The lack of structure in the raw text makes it difficult to track changes across these editions. In this work, we built a pipeline to restore the text structure, where we extract the headwords and identify entries; categorize the entities; match entries across editions; and link entries to a Wikidata item. We applied this pipeline to the four major editions of \textit{Nordisk familjebok}, an authoritative Swedish encyclopedia published between 1876 and 1951. We could extract the headwords with an F1 score of 97.8\% and we obtained an F1 score of 93.4\% on the headword classification. On a small-scale evaluation, we reached a 93\% precision on the cross-edition matching, 85\% precision and 16.5\% recall on the Wikidata linking. This shows that an automated approach to digitized historical knowledge is possible. This should facilitate the preservation of general knowledge and the understanding of knowledge transmission. The datasets and programs are available online.
Chinese Translation
旧百科全书的数字化代表着改善对历史结构知识获取的重要一步。然而,这一过程往往仅限于光学字符识别,导致所有底层结构未得到充分利用。此外,许多百科全书有多个版本,反映了知识的演变。原始文本中缺乏结构使得在这些版本之间追踪变化变得困难。在这项工作中,我们构建了一个管道以恢复文本结构,提取词头、识别条目;对实体进行分类;跨版本匹配条目;并将条目链接到Wikidata项。我们将该管道应用于四个主要版本的《Nordisk familjebok》,这是一本于1876年至1951年间出版的权威瑞典百科全书。我们能够以97.8%的F1分数提取词头,并在词头分类上获得93.4%的F1分数。在小规模评估中,我们在跨版本匹配上达到93%的精度,在Wikidata链接上达到85%的精度和16.5%的召回率。这表明对数字化历史知识的自动化处理是可行的。这将有助于保存一般知识和理解知识传递。数据集和程序已在线提供。
cs.CL / 90 / 2605.02472
Accurate Legal Reasoning at Scale: Neuro-Symbolic Offloading and Structural Auditability for Robust Legal Adjudication
大规模准确法律推理:神经符号卸载与结构可审计性在稳健法律裁决中的应用
Abstract
Legal texts often contain computational legal clauses--provisions whose understanding requires complex logic. While frontier Large Reasoning Models (LRMs) can describe such clauses, building production-ready systems is limited by reasoning errors and the high cost of inference. We propose Amortized Intelligence, a neuro-symbolic approach where we use an LLM once to translate a legal text into Deterministic Autonomous Contract Language (DACL): a typed graph intermediate representation. Adjudication then relies on deterministic graph executions with a visually auditable trace. In comparison against runtime LRM baselines (including GPT-5.2 and Gemini 3 Pro), our DACL-based Agent achieves near-perfect consistency and mitigates the "reasoning cliff" observed in probabilistic models. The system reduces compute costs by over 90% in high-volume workflows while satisfying the strict auditability requirements of legal adjudication.
Chinese Translation
法律文本通常包含计算法律条款——理解这些条款需要复杂的逻辑。尽管前沿的大型推理模型(LRMs)可以描述这些条款,但构建可投入生产的系统受到推理错误和推理成本高昂的限制。我们提出了摊销智能(Amortized Intelligence),这是一种神经符号方法,我们利用大语言模型(LLM)将法律文本翻译成确定性自主合同语言(Deterministic Autonomous Contract Language, DACL):一种类型化的图中间表示。裁决过程依赖于具有可视审计痕迹的确定性图执行。与运行时LRM基线(包括GPT-5.2和Gemini 3 Pro)的比较中,我们的基于DACL的代理实现了近乎完美的一致性,并减轻了在概率模型中观察到的“推理悬崖”现象。该系统在高流量工作流程中将计算成本降低了90%以上,同时满足法律裁决的严格可审计性要求。
cs.CL / 91 / 2605.02504
A multilingual hallucination benchmark: MultiWikiQHalluA
多语言幻觉基准:MultiWikiQHalluA
Abstract
Most hallucination evaluations focus on English, leaving it unclear whether findings transfer to lower-resource languages. We investigate faithfulness hallucinations, defined as model-generated content that is fluent and plausible but diverges from the provided input or is internally inconsistent. Leveraging the multilingual MultiWikiQA dataset, we utilize the LettuceDetect framework to create synthetic hallucination datasets for 306 languages, from which we train token-level hallucination classifiers for 30 European languages. In this work, we present evaluations of model hallucinations on a selection of languages: English, Danish, German, and Icelandic. Using these classifiers, we evaluate the hallucination rates for Qwen3-0.6B, Qwen3-14B, Gemma-3-12B-IT, cogito-v1-preview-qwen-32B, and cogito-v1-preview-llama-70B. Our classifiers reveal notably higher hallucination rates for Qwen3-0.6B (up to 60\% of answers containing at least one hallucination, peaking in Icelandic) and generally lower rates for larger models, with cogito-v1-preview-qwen-32B and cogito-v1-preview-llama-70B performing best on most languages. Hallucination rates are consistently higher for lower-resource languages, particularly Icelandic.
Chinese Translation
大多数幻觉评估集中于英语,因此尚不清楚这些发现是否适用于低资源语言。我们研究了忠实度幻觉,定义为模型生成的内容流利且合理,但与提供的输入不一致或内部不一致。借助多语言 MultiWikiQA 数据集,我们利用 LettuceDetect 框架为 306 种语言创建合成幻觉数据集,并为 30 种欧洲语言训练了基于标记的幻觉分类器。在本研究中,我们展示了在一系列语言(英语、丹麦语、德语和冰岛语)上模型幻觉的评估。使用这些分类器,我们评估了 Qwen3-0.6B、Qwen3-14B、Gemma-3-12B-IT、cogito-v1-preview-qwen-32B 和 cogito-v1-preview-llama-70B 的幻觉率。我们的分类器显示,Qwen3-0.6B 的幻觉率显著更高(高达 60\% 的答案包含至少一个幻觉,冰岛语的比例最高),而较大模型的幻觉率普遍较低,其中 cogito-v1-preview-qwen-32B 和 cogito-v1-preview-llama-70B 在大多数语言上表现最佳。较低资源语言的幻觉率始终较高,尤其是冰岛语。
cs.CL / 92 / 2605.02505
Revisiting Semantic Role Labeling: Efficient Structured Inference with Dependency-Informed Analysis
重新审视语义角色标注:基于依赖关系分析的高效结构推理
Abstract
Semantic Role Labeling (SRL) provides an explicit representation of predicate-argument structure, capturing linguistically grounded relations such as who did what to whom. While recent NLP progress has been dominated by large language models (LLMs), these systems often rely on implicit semantic representations, often lacking explicit structural constraints and systematic explanatory mechanisms. Traditionally, SRL systems have often relied on AllenNLP; however, the framework entered maintenance mode in December 2022, limiting compatibility with evolving encoder architectures and modern inference requirements. We revisit structured SRL modeling, introducing a modernized encoder-based framework that preserves explicit predicate-argument structure while enabling inference 10 times faster. Using BERT-base, the model attains comparable predictive performance, and RoBERTa and DeBERTa further improve F1 performance within the same framework. We adopt a dependency-informed diagnostic methodology to characterize span-level inconsistencies and conduct a representation-level analysis of LLM behavior under dependency-informed structural signals. Results indicate that dependency cues primarily improve structural stability. Finally, we illustrate how the framework's explicit predicate-argument structure can support multilingual SRL projection as a downstream application.
Chinese Translation
语义角色标注(SRL)提供了谓词-论元结构的明确表示,捕捉了诸如“谁对谁做了什么”的语言学基础关系。尽管最近的自然语言处理(NLP)进展主要由大型语言模型(LLMs)驱动,但这些系统往往依赖于隐式语义表示,缺乏明确的结构约束和系统的解释机制。传统上,SRL系统通常依赖于AllenNLP;然而,该框架在2022年12月进入维护模式,限制了其与不断发展的编码器架构和现代推理要求的兼容性。我们重新审视了结构化的SRL建模,提出了一种现代化的基于编码器的框架,该框架在保留明确谓词-论元结构的同时,使推理速度提高了10倍。使用BERT-base模型,获得了可比的预测性能,而RoBERTa和DeBERTa在同一框架内进一步提高了F1性能。我们采用了一种基于依赖关系的诊断方法,表征跨跨度层级的不一致性,并对依赖关系信号下的大型语言模型的行为进行了表示层级分析。结果表明,依赖关系线索主要提高了结构的稳定性。最后,我们展示了该框架的明确谓词-论元结构如何支持作为下游应用的多语言SRL投影。
cs.CL / 93 / 2605.02520
Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study
生物医学检索增强生成的检索策略基准:一项受控实证研究
Abstract
Retrieval-Augmented Generation (RAG) offers a well-established path to grounding large language model (LLM) outputs in external knowledge, yet the question of which retrieval strategy works best in a high-stakes domain such as biomedicine has not received the controlled, multi-metric treatment it deserves. This paper presents a systematic empirical comparison of five retrieval strategies -- Dense Vector Search, Hybrid BM25 + Dense retrieval, Cross-Encoder Reranking, Multi-Query Expansion, and Maximal Marginal Relevance (MMR) -- within a biomedical question-answering RAG pipeline. All strategies share a fixed generation model (GPT-4o-mini), a common vector store (ChromaDB), and OpenAI's text-embedding-3-small embeddings, ensuring that observed differences are attributable to retrieval alone. Evaluation is conducted on 250 question-answer pairs drawn from a preprocessed subset of the BioASQ benchmark (rag-mini-bioasq) using four DeepEval metrics: contextual precision, contextual recall, faithfulness, and answer relevancy, each reported with 95% confidence intervals. A no-context ablation is included as a lower bound. Cross-Encoder Reranking achieves the best composite score (0.827) and highest contextual precision (0.852), confirming that query-document interaction yields measurable retrieval gains. Multi-Query Expansion, despite its recall-oriented design, produces the weakest contextual precision (0.671), suggesting naive query diversification introduces retrieval noise. MMR sacrifices answer relevancy for diversity, while the Dense baseline (composite 0.822) falls within 0.005 points of the top strategy. All RAG conditions dramatically outperform the no-context ablation on answer relevancy (0.658-0.701 vs. 0.287), confirming the practical value of retrieval. The full pipeline, hyperparameters, and evaluation code are publicly available.
Chinese Translation
检索增强生成(Retrieval-Augmented Generation, RAG)为将大型语言模型(Large Language Model, LLM)输出与外部知识结合提供了一条成熟的路径,但在生物医学等高风险领域,哪种检索策略效果最佳的问题尚未得到应有的受控和多指标处理。本文对五种检索策略进行了系统的实证比较——密集向量检索(Dense Vector Search)、混合 BM25 + 密集检索(Hybrid BM25 + Dense Retrieval)、交叉编码器重排序(Cross-Encoder Reranking)、多查询扩展(Multi-Query Expansion)和最大边际相关性(Maximal Marginal Relevance, MMR)——在生物医学问答 RAG 流水线中。所有策略共享固定的生成模型(GPT-4o-mini)、共同的向量存储(ChromaDB)以及 OpenAI 的 text-embedding-3-small 嵌入,确保观察到的差异仅归因于检索。评价在从 BioASQ 基准的预处理子集(rag-mini-bioasq)中提取的 250 个问答对上进行,使用了四个深度评估指标:上下文精度(contextual precision)、上下文召回率(contextual recall)、忠实度(faithfulness)以及答案相关性(answer relevancy),每个指标均报告 95% 的置信区间。包括“无上下文”的消融实验作为下限。交叉编码器重排序取得了最佳综合得分(0.827)和最高的上下文精度(0.852),确认查询与文档的互动能带来可测量的检索增益。尽管多查询扩展的设计以召回为导向,但其产生的上下文精度最弱(0.671),表明天真的查询多样化引入了检索噪声。MMR 为多样性牺牲了答案相关性,而密集基线(综合得分 0.822)在距离最佳策略仅相差 0.005 分。所有 RAG 条件在答案相关性上显著优于无上下文消融实验(0.658-0.701 对 0.287),确认了检索的实用价值。完整的流水线、超参数和评估代码均公开可用。
cs.CL / 94 / 2605.02601
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures
SemEval-2026 任务 7:跨多种语言和文化的日常知识
Ousidhoum, Nedjma, Myung, Junho, Perez-Almendros, Carla, Jin, Jiho, Keleg, Amr, Beloucif, Meriem, Zhou, Yi, Agerri, Rodrigo, Araujo, Vladimir, Baes, Naomi, Barry, James, Boisson, Joanne, Chen, Nancy F., de Kock, Christine, Edwards, Aleksandra, de Landa, Joseba Fernandez, Imam, Mohamed Fazli, Hakami, Huda, Hsieh, Shu-Kai, Imperial, Joseph Marvin, Lee, Roy Ka-Wei, Liu, Zhengyuan, Lyu, Chenyang, Samih, Younes, Sjons, Johan, Tan, Bryan, Ushio, Asahi, Zheng, Weihua, Oh, Alice, Camacho-Collados, Jose
Abstract
We present our shared task on evaluating the adaptability of LLMs and NLP systems across multiple languages and cultures. The task data consist of an extended version of our manually constructed BLEnD benchmark (Myung et al. 2024), covering more than 30 language-culture pairs, predominantly representing low-resource languages spoken across multiple continents. As the task is designed strictly for evaluation, participants were not permitted to use the data for training, fine-tuning, few-shot learning, or any other form of model modification. Our task includes two tracks: (a) Short-Answer Questions (SAQ) and (b) Multiple-Choice Questions (MCQ). Participants were required to predict labels and were allowed to submit any NLP system and adopt diverse modelling strategies, provided that the benchmark was used solely for evaluation. The task attracted more than 140 registered participants, and we received final submissions from 62 teams, along with 19 system description papers. We report the results and present an analysis of the best-performing systems and the most commonly adopted approaches. Furthermore, we discuss shared insights into open questions and challenges related to evaluation, misalignment, and methodological perspectives on model behaviour in low-resource languages and for under-represented cultures.
Chinese Translation
我们展示了一个共享任务,旨在评估大型语言模型(LLMs)和自然语言处理(NLP)系统在多种语言和文化中的适应性。任务数据由我们手动构建的 BLEnD 基准(Myung 等,2024)的扩展版本组成,涵盖了30多个语言-文化对,主要代表了在多个大洲使用的低资源语言。由于该任务严格用于评估,因此参与者不得将数据用于训练、微调、少量学习或任何其他形式的模型修改。我们的任务包含两个部分:(a)简答题(SAQ)和(b)多项选择题(MCQ)。参与者需要预测标签,并允许提交任何 NLP 系统并采用多种建模策略,前提是基准仅用于评估。该任务吸引了140多名注册参与者,我们收到了62支团队的最终提交以及19篇系统描述论文。我们报告了结果,并对表现最佳的系统及最常采用的方法进行了分析。此外,我们讨论了对评估、误对齐及低资源语言和代表性不足文化中模型行为的研究方法学观点等相关开放问题和挑战的共享见解。
cs.CL / 95 / 2605.02608
Dependency Parsing Across the Resource Spectrum: Evaluating Architectures on High and Low-Resource Languages
跨越资源谱系的依存句法分析:在高资源和低资源语言上评估架构
Abstract
Transformer-based models achieve state-of-the-art dependency parsing for high-resource languages, yet their advantage over simpler architectures in low-resource settings remains poorly understood. We evaluate four parsers -- the Biaffine LSTM, Stack-Pointer Network, AfroXLMR-large, and RemBERT -- across ten typologically diverse languages, with a focus on low-resource African languages. We find that the Biaffine LSTM consistently outperforms transformer models in low-resource regimes, with transformers recovering their advantage as training data increases. The crossover falls within a resource range typical of treebanks for under-resourced languages. Morphological complexity (measured via MATTR) emerges as a significant secondary predictor of transformers' relative disadvantage after controlling for corpus size. These results indicate that the Biaffine LSTM may be better suited for syntactic tool development in low-resource regimes until sufficient annotated data is available to leverage the representational capacity of pre-trained transformers.
Chinese Translation
基于Transformer的模型在高资源语言的依存句法分析中达到了最先进的水平,然而它们在低资源环境中相比于更简单架构的优势仍然不甚明确。我们评估了四种解析器——Biaffine LSTM、Stack-Pointer Network、AfroXLMR-large 和 RemBERT——在十种类型上具有多样性的语言上的表现,特别关注于低资源的非洲语言。研究发现,Biaffine LSTM在低资源环境中始终优于Transformer模型,随着训练数据的增加,Transformer逐渐恢复其优势。这一交叉点落在资源范围内,通常为资源不足语言的树库所典型。形态复杂性(通过MATTR测量)在控制语料库大小后,成为Transformer相对劣势的重要次要预测因素。这些结果表明,Biaffine LSTM可能更适合在低资源环境中进行句法工具的开发,直到有足够的标注数据可用于充分利用预训练Transformer的表征能力。
cs.CL / 96 / 2605.02620
Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race
击败风格检测器:关于人工智能文本军备竞赛的三小时主动研究
Abstract
Reproducing an empirical NLP study used to take weeks. Given the released data and a modern agentic-research harness, we redo every experiment of a recent ACL\,2026 study on personal-style post-editing of LLM drafts -- and add three new ones -- with the human investigator acting only as a reviewer-in-the-loop. We reproduce all seven preregistered hypotheses and recover the paper's headline correlation between perceived self-similarity and embedding-measured self-similarity to three decimal places ($r{=}{+}0.244$, $p{<}10^{-8}$, $n{=}648$). Under a leakage-free held-out protocol, GPT-5.5 and Claude\,Opus\,4.7 close $71$--$75\,\%$ of the style gap to the same-author ceiling on $324$ paired tasks, against $24\,\%$ for the human post-edit, and beat the human post-edit on $\sim$$80\,\%$ of tasks. We then frame the same data as an AI-text detection arms race. A leave-authors-out linear SVM on LUAR-MUD embeddings reaches AUC $0.93$--$1.00$ across approaches; six diagnostics show that GPT-5.5 detection is mostly a length confound while Opus detection is a genuine stylistic signature. Given $T{=}20$ feedback iterations against the frozen detector, an Opus agent flips two of five held-out test mimics to the human half-space and shrinks every margin by an order of magnitude. With moderate effort against a known detector, a frontier LLM can already efficiently lower its own AI-detection probability. All code, $648$ mimic drafts, trained detectors, diagnostics, and adversarial trajectories are released.
Chinese Translation
重现一项实证自然语言处理研究曾需数周之久。凭借发布的数据和现代主动研究工具,我们重做了近期ACL 2026关于对大型语言模型草稿进行个人风格后编辑的研究中的每一个实验,并增加了三个新实验,其中人类研究者仅作为审核者参与环节。我们重现了七个预注册假设,并将文章报道的感知自相似性与嵌入测量自相似性之间的相关性恢复到小数点后三位($r{=}{+}0.244$,$p{<}10^{-8}$,$n{=}648$)。在无泄漏的保留协议下,GPT-5.5和Claude Opus 4.7在$324$对配对任务中缩小了$71 ext{--}75\, ext{ extperthousand}$的风格差距,接近同作者的上限,而人类后编辑仅为$24 ext{ extperthousand}$,且在约$80 ext{ extperthousand}$的任务中超越了人类后编辑。接着,我们将相同的数据框架重新定义为一个人工智能文本检测的军备竞赛。一个基于LUAR-MUD嵌入的作者排除线性支持向量机(SVM)在各个方法上达到AUC $0.93$--$1.00$;六个诊断表明,GPT-5.5的检测主要受到长度混杂的影响,而Opus的检测则是一个真正的风格特征。考虑到对冻结检测器进行$T{=}20$次反馈迭代,一个Opus代理将五个保留测试模仿中的两个翻转至人类半空间,并将每个边距缩小一个数量级。在对已知检测器进行适度努力的情况下,前沿的大型语言模型已能够有效降低其自身的人工智能检测概率。所有代码、$648$个模仿草稿、已训练检测器、诊断结果和对抗轨迹均已发布。
cs.CL / 97 / 2605.02624
Synthetic Users, Real Differences: an Evaluation Framework for User Simulation in Multi-Turn Conversations
合成用户,真实差异:多轮对话中用户模拟的评估框架
Abstract
There is growing interest in exploring user simulation as an alternative to gathering and scoring real user-chatbot interactions for AI chatbot evaluation. For this purpose, it is important to ensure the realism of the simulation, i.e., the extent to which simulated dialogues reflect real dialogues users have with chatbots. Most existing methods evaluating simulation realism produce coarse quality signal and remain solely at the level of individual dialogues. To support more rigorous evaluation in this area, we propose realsim, an evaluation framework that enables practitioners to take a distributional view of real vs. simulated dialogues along 8 dimensions, covering attributes related to the communicative functions of the interaction, user states, and the surface form of user messages. We then instantiate the framework with a curated dataset of 1K multi-turn task-focused real user-chatbot dialogues that cover 16 domains of chatbot applications. Overall, we find that simulated users tend to struggle at capturing communication frictions that real users introduce to interactions, which could make evaluations based on such simulations overly optimistic. We also observe variability in performance across different domains, which may indicate a need for domain-specific user simulators.
Chinese Translation
近年来,探索用户模拟作为评估人工智能聊天机器人的一种替代方法,逐渐引起了广泛关注,这种方法不再依赖真实用户与聊天机器人的互动进行收集和评分。为此,确保模拟的真实性变得尤为重要,即模拟对话在多大程度上能够反映用户与聊天机器人之间的真实对话。现有的大多数评估模拟真实性的方法所产生的信号质量较粗糙,并且仅停留在个别对话的层面。为了支持该领域更为严格的评估,我们提出了realsim评估框架,该框架使从业者能够从分布的角度比较真实对话与模拟对话,涉及8个维度,涵盖了与互动的交流功能、用户状态以及用户消息的表面形式相关的属性。然后,我们使用一个策划的数据集对该框架进行了实例化,该数据集包含1000个多轮任务导向的真实用户与聊天机器人的对话,涵盖了16个聊天机器人应用领域。总体而言,我们发现模拟用户在捕捉真实用户对互动引入的交流摩擦方面往往表现不佳,这可能使得基于此类模拟的评估过于乐观。我们还观察到,不同领域的表现存在变异性,这可能表明需要特定领域的用户模拟器。
cs.CL / 98 / 2605.02629
Mapping Discourse Reframing: A Multi-Layer Network Approach to Italian HPV Vaccine Discourse on X (2010-2024)
话语重构映射:意大利HPV疫苗在X平台上话语的多层网络方法研究(2010-2024)
Abstract
Understanding how online narratives travel through coalitions is critical for identifying information disorder, yet computational analyses often rely on conservative network constructions that erase initially sparse but salient signals. This paper proposes a novel multi-layer framework that captures low-frequency signals of emerging information disorder allowing for locating where online discourse is reframed and amplified over time. The use case is 14 years of Italian discourse on X regarding the Human Papillomavirus (HPV) vaccine across three pivotal epochs (2010-2024). Utilizing hashtag co-occurrence networks, we introduce a dual-layer approach. We first identify robust core discourse coalitions through conservative community detection, revealing a stable prevention-oriented backbone contrasted with increasingly separable skepticism coalitions. We then introduce a coverage layer and project fringe hashtags into core coalitions based on weighted connectivity. Using a manually labelled set of skeptical and conspiratorial seed tweets, we demonstrate that this core-coverage projection significantly improves the recovery of long-tail, problematic hashtags while preserving an interpretable coalition structure. Our findings characterize the structural maturation of polarized narratives and provide a methodology for mapping how discourse is reframed and amplified by information disorder over time.
Chinese Translation
理解在线叙事如何通过联盟传播对于识别信息混乱至关重要,但计算分析通常依赖于保守的网络构建,这会抹去最初稀疏但显著的信号。本文提出了一种新的多层框架,捕捉新兴信息混乱的低频信号,允许我们确定在线话语在时间上被重构和放大的位置。研究对象为2010年至2024年间,意大利在X平台上关于人乳头瘤病毒(HPV)疫苗的14年话语。通过利用话题标签共现网络,我们引入了一种双层方法。首先,我们通过保守的社区发现方法识别出强大的核心话语联盟,揭示出一个稳定的以预防为导向的基础结构,与越来越可分离的怀疑主义联盟形成对比。然后我们引入了覆盖层,并根据加权连通性将边缘话题标签投射到核心联盟中。通过使用一组手动标记的怀疑和阴谋论种子推文,我们证明这种核心-覆盖投影显著提高了长尾问题话题标签的回收,同时保持了可解释的联盟结构。我们的研究结果描绘了极化叙事的结构成熟,提供了一种方法论,用于绘制信息混乱如何在时间上重构和放大话语的路径。
cs.CL / 99 / 2605.02647
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
上下文越狱:通过模拟对话引导进行进化红队攻击
Abstract
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety alignment and elicit harmful responses. A growing body of work shows that contextual priming, where earlier turns covertly bias later replies, constitutes a powerful attack surface, with hand-crafted multi-turn scaffolds consistently outperforming single-turn manipulations on capable models. However, automated optimization-based red-teaming has remained largely limited to the single-turn setting, iterating over static prompts and lacking the ability to reason about which forms of conversational priming induce compliance. While recent multi-turn, search-based approaches have begun to bridge this gap, the mutator design space underlying effective primed dialogues remains largely unexplored. We present ContextualJailbreak, a black-box red-teaming strategy that performs evolutionary search over a simulated multi-turn primed dialogue. The strategy leverages a graded 0-5 harm score from a two-level judge as an in-loop signal, enabling partially harmful responses to guide the search process rather than being discarded. Search is driven by five semantically defined mutation operators: roleplay, scenario, expand, troubleshooting, and mechanistic, of which the last two are novel contributions of this work. Across 50 representative HarmBench behaviors, ContextualJailbreak achieves an ASR of 100% on gpt-oss:20B, 100% on qwen3-8B, 100% on llama3.1:70B, and 90% on gpt-oss:120B, outperforming four single- and multi-turn baselines by 31-96 percentage points on average. The 40 maximally harmful attacks discovered against gpt-oss:120B transfer without adaptation to closed frontier models, achieving 90.0% on gpt-4o-mini, 70.0% on gpt-5, and 70.0% on gemini-3-flash, but only 17.5% on claude-opus-4-7 and 15.0% on claude-sonnet-4-6, revealing a pronounced provider-level asymmetry in alignment robustness.
Chinese Translation
大型语言模型(LLMs)仍然容易受到越狱攻击,这些攻击绕过安全调节并引发有害的响应。越来越多的研究表明,上下文引导,即早期对话暗中影响后续回复,构成了一种强大的攻击面,经过精心设计的多轮结构在能力强大的模型上始终优于单轮操控。然而,基于自动优化的红队攻击在很大程度上仍然局限于单轮设置,反复使用静态提示,缺乏关于哪些对话引导形式能够诱导合规的推理能力。尽管最近基于多轮的搜索方法已开始填补这一空白,但有效引导对话的变异设计空间尚未被充分探索。我们提出了上下文越狱(ContextualJailbreak),这是一种黑箱红队策略,旨在对模拟的多轮引导对话进行进化搜索。该策略利用来自二级评估者的分级0-5有害评分作为循环内部信号,使得部分有害响应能够引导搜索过程,而不是被丢弃。搜索由五个语义定义的变异操作符驱动:角色扮演、场景、扩展、故障排除和机械性,其中后两者是本文的创新贡献。在50个具有代表性的HarmBench行为中,上下文越狱在gpt-oss:20B上取得了100%的攻击成功率,在qwen3-8B上为100%,在llama3.1:70B上为100%,在gpt-oss:120B上为90%,平均超越四个单轮和多轮基线31-96个百分点。针对gpt-oss:120B发现的40个最大有害攻击可以无须调整地转移到封闭前沿模型,分别在gpt-4o-mini上达到90.0%、在gpt-5上为70.0%和在gemini-3-flash上为70.0%,但在claude-opus-4-7上仅为17.5%,在claude-sonnet-4-6上为15.0%,揭示了对齐稳健性在提供商级别上的明显不对称性。
cs.CL / 100 / 2605.02665
Fuzzy Fingerprinting Encoder Pre-trained Language Models for Emotion Recognition in Conversations: Human Assessment and Validity Study
模糊指纹编码器预训练语言模型在对话中的情感识别:人类评估与有效性研究
Abstract
In Emotion Recognition in Conversations (ERC), model decisions should align with nuanced human perception and ideally provide insights on the classification process. Standard encoder pre-trained language models (PLMs) are the state-of-the-art at these tasks but offer little insight into why a certain prediction is made. This is especially problematic in imbalanced datasets, where most utterances are labeled as neutral, making these models frequently misclassify minority emotions as the majority neutral class. To tackle this issue, we introduced a novel, interpretable approach to ERC by combining PLMs with Fuzzy Fingerprints (FFPs). FFP provide class-specific prototypes that reflect the characteristic class activation patterns in the PLM's latent space. They are derived by ranking and fuzzifying the activations of the pooled conversational context-dependent embeddings across training instances for each emotion. At inference time, each input utterance is similarly fuzzy fingerprinted and matched to the emotion prototypes using a fuzzy similarity function based on the aggregation of the intersection of the fuzzy sets that define each FFP. Experimental results show that FFP integration reduces overclassification into the neutral class and human evaluation further supports the adequacy of FFP predictions. Our proposed method thus bridges the gap between deep neural inference and human perception, performing at state-of-the-art level while simultaneously offering valuable insights into the classification procedure.
Chinese Translation
在对话中的情感识别(ERC)中,模型的决策应与细微的人类感知相一致,并理想地提供关于分类过程的见解。标准编码器预训练语言模型(PLMs)在这些任务中处于最先进的水平,但对为什么做出某种预测几乎没有提供见解。这在不平衡数据集中尤其成问题,因为大多数话语被标记为中立,导致这些模型经常将少数情感错误分类为主要的中立类。为了解决这一问题,我们提出了一种新颖的、可解释的ERC方法,通过将PLMs与模糊指纹(FFPs)相结合。FFP提供特定类别的原型,反映了PLM潜在空间中的特征类激活模式。它们通过对每种情感的训练实例中汇总的对话上下文相关嵌入的激活进行排名和模糊化而得出。在推理时,每个输入话语也将类似地进行模糊指纹处理,并使用基于定义每个FFP的模糊集合交集聚合的模糊相似性函数与情感原型匹配。实验结果表明,FFP的整合减少了对中立类的过度分类,人类评估进一步支持了FFP预测的充分性。因此,我们提出的方法弥合了深度神经推理与人类感知之间的差距,表现出最先进的水平,同时提供了对分类过程的有价值见解。
cs.CL / 101 / 2605.02695
mdok-style at SemEval-2026 Task 9: Finetuning LLMs for Multilingual Polarization Detection
mdok样式在SemEval-2026任务9中的应用:对大型语言模型进行微调以实现多语言极化检测
Abstract
SemEval-2026 Task 9 is focused on multilingual polarization detection. Specifically, it covers the identification of multilingual, multicultural and multievent polarization along three axes (in subtasks), namely detection, type, and manifestation. Online polarization presents a concern, because it is often followed by hate speech, offensive discourse, and social fragmentation. Therefore, its detection before it escalates is crucial for a safer and more inclusive online space. We have coped with this SemEval task by finetuning mid-size LLMs for the sequence-classification task using the QLoRA parameter-efficient finetuning technique. The training data augmented the multilingual (22 languages) training sets by anonymized, lower-cased, upper-cased, and homoglyphied counterparts, making the detection more robust.
Chinese Translation
SemEval-2026任务9聚焦于多语言极化检测。具体而言,它涵盖了在三个轴线(子任务)上识别多语言、多文化和多事件极化,即检测、类型和表现。在线极化是一个令人担忧的问题,因为它通常伴随着仇恨言论、冒犯性言论和社会分裂。因此,在极化升级之前进行检测对于创造一个更安全和更具包容性的在线空间至关重要。我们通过使用QLoRA参数高效微调技术对中型大型语言模型进行微调,以应对这一SemEval任务,针对序列分类任务进行训练。训练数据通过匿名、统一小写、统一大写和同形异义词的对应形式增强了多语言(22种语言)的训练集,使得检测更加稳健。
cs.CL / 102 / 2605.02712
mdok-style at SemEval-2026 Task 10: Finetuning LLMs for Conspiracy Detection
mdok风格在SemEval-2026任务10中的应用:针对阴谋检测的预训练大模型微调
Abstract
SemEval-2026 Task 10 is focused on conspiracy detection. Specifically, the goal is to detect whether a Reddit comment expresses a conspiracy belief. Our submitted mdok-style system utilizes data augmentation and self-training (to cope with a rather small amount of training data) to finetune the Qwen3-32B model for a binary text-classification task. The submitted system is very competitive, ranking in the 85th percentile (8th out of 52 submissions). The results shown that our approach, which originated in machine-generated text detection, can be used for conspiracy detection as well.
Chinese Translation
SemEval-2026任务10专注于阴谋检测。具体而言,目标是检测一条Reddit评论是否表达了阴谋信念。我们提交的mdok风格系统利用数据增强和自我训练(以应对相对较少的训练数据)对Qwen3-32B模型进行微调,以完成二分类文本任务。提交的系统非常有竞争力,排名位于第85个百分位(52个提交中第8名)。结果表明,我们的方案最初用于机器生成文本检测,也可以用于阴谋检测。
cs.CL / 103 / 2605.02801
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
通过编排追踪进行基于大语言模型的多智能体系统的强化学习
Abstract
As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub-decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision. We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at https://github.com/xxzcc/awesome-llm-mas-rl, including an 84-entry tagged paper pool, a 32-record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces.
Chinese Translation
随着大型语言模型(LLM)代理从孤立的工具用户发展为协调团队,强化学习(RL)必须不仅优化个体行动,还要优化工作生成、委派、沟通、聚合和停止的方式。本文通过编排追踪研究基于LLM的多智能体系统中的RL:编排追踪是时间交互图,其事件包括子代理生成、委派、沟通、工具使用、返回、聚合和停止决策。在这个视角下,我们识别出三个技术轴心。首先,奖励设计涵盖八个类别,包括用于并行性加速、分割正确性和聚合质量的编排奖励。其次,奖励和信号附加到八个承载单位,从代币到团队;在我们整理的样本池中,明确的反事实消息级别信号尤其稀缺。第三,编排学习分解为五个子决策:何时生成、委派给谁、如何沟通、如何聚合以及何时停止。截至2026年5月4日,我们的整理样本池中找不到对停止决策的明确RL训练方法。我们将学术方法与Kimi Agent Swarm、OpenAI Codex和Anthropic Claude Code的公共工业证据相连接。由此产生的规模差距是公共报告的部署范围与开放学术评估机制之间的差距,而非对工业训练追踪的独立验证。我们在https://github.com/xxzcc/awesome-llm-mas-rl发布了相关资料,包括一个84条条目的标记论文池、一个32条记录的排除日志、脚本化语料统计以及用于可重放编排追踪的最小JSON模式。
cs.CL / 104 / 2605.02815
FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents
FlexSQL: 灵活的探索与执行提升文本到SQL代理的表现
Abstract
Text-to-SQL over large analytical databases requires navigating complex schemas, resolving ambiguous queries, and grounding decisions in actual data. Most current systems follow a fixed pipeline where schema elements are retrieved once upfront and the database is only revisited for post-hoc repair, limiting recovery from early mistakes. We present FlexSQL, a text-to-SQL agent whose core design principle is flexible database interaction: the agent can explore schema structure, inspect data values, and run verification queries at any point during reasoning. FlexSQL generates diverse execution plans to cover multiple query interpretations, implements each plan in either SQL or Python depending on the task, and uses a two-tiered repair mechanism that can backtrack from code-level errors to plan-level revisions. On Spider2-Snow, using gpt-oss-120b, FlexSQL achieves a 65.4\% score, outperforming strong open-source baselines that use stronger, larger models such as gpt-o3 and DeepSeek-R1. When integrated into a general-purpose coding agent (as skills in Claude Code), our approach yields over 10\% relative improvement on Spider2-Snow. Further analysis shows that flexible exploration and flexible execution jointly contribute to the effectiveness of our approach, highlighting flexibility as a key design principle. Our code is available at: https://github.com/StringNLPLAB/FlexSQL
Chinese Translation
在大型分析数据库上进行文本到SQL转换需要导航复杂的模式、解决模糊查询,并将决策基于实际数据。大多数现有系统遵循固定的流程,在此过程中模式元素在最初阶段一次性检索,并且数据库仅在事后修复时重新访问,从而限制了从早期错误中恢复的能力。我们提出了FlexSQL,一种文本到SQL代理,其核心设计原则是灵活的数据库交互:该代理可以随时探索模式结构、检查数据值,并在推理过程中运行验证查询。FlexSQL生成多样的执行计划以覆盖多种查询解释,根据任务在SQL或Python中执行每个计划,并使用一种双层修复机制,能够从代码级错误回溯到计划级修订。在Spider2-Snow数据集上,使用gpt-oss-120b,FlexSQL达到了65.4%的得分,优于使用更强大、更大模型(如gpt-o3和DeepSeek-R1)的强开源基线。当集成到一个通用编码代理中(作为Claude Code的技能),我们的方法在Spider2-Snow上实现了超过10%的相对提升。进一步分析表明,灵活的探索和灵活的执行共同促成了我们方法的有效性,强调了灵活性作为一个关键设计原则。我们的代码可在以下链接获取:https://github.com/StringNLPLAB/FlexSQL