← Back to Index
Daily Research Digest

arXiv Papers

2026-02-09
198
Papers
4
Categories
198
Translated
收藏清单 0
机器人学 (Robotics)
37
cs.RO / 1 / 2602.06087

Dynamic Modeling, Parameter Identification and Numerical Analysis of Flexible Cables in Flexibly Connected Dual-AUV Systems

灵活连接双无人水下航行器系统中柔性电缆的动态建模、参数识别与数值分析
Chen, Kuo, Dou, Minghao, Liu, Qianqi, An, Yang, Ren, Kai, WU, Zeming, Tian, Yu, Sun, Jie, Wang, Xinping, Chen, Zhier, Yu, Jiancheng
Abstract
This research presents a dynamic modeling framework and parameter identification methods for describing the highly nonlinear behaviors of flexibly connected dual-AUV systems. The modeling framework is established based on the lumped mass method, integrating axial elasticity, bending stiffness, added mass and hydrodynamic forces, thereby accurately capturing the time-varying response of the forces and cable configurations. To address the difficulty of directly measuring material-related and hydrodynamic coefficients, this research proposes a parameter identification method that combines the physical model with experimental data. High-precision inversion of the equivalent Youngs modulus and hydrodynamic coefficients is performed through tension experiments under multiple configurations, effectively demonstrating that the identified model maintains predictive consistency in various operational conditions. Further numerical analysis indicates that the dynamic properties of flexible cable exhibit significant nonlinear characteristics, which are highly dependent on material property variations and AUV motion conditions. This nonlinear dynamic behavior results in two typical response states, slack and taut, which are jointly determined by boundary conditions and hydrodynamic effects, significantly affecting the cable configuration and endpoint loads. In this research, the dynamics of flexible cables under complex boundary conditions is revealed, providing a theoretical foundation for the design, optimization and further control research of similar systems.
Chinese Translation
本研究提出了一种动态建模框架和参数识别方法,用于描述灵活连接的双无人水下航行器系统的高度非线性行为。该建模框架基于集中质量法建立,综合考虑了轴向弹性、弯曲刚度、附加质量和水动力作用力,从而准确捕捉力和电缆配置的时变响应。为了解决直接测量与材料相关和水动力系数的难题,本研究提出了一种将物理模型与实验数据相结合的参数识别方法。通过在多种配置下进行拉伸实验,高精度反演了等效杨氏模量和水动力系数,有效证明了所识别模型在各种操作条件下保持预测一致性。进一步的数值分析表明,柔性电缆的动态特性表现出显著的非线性特征,这些特征高度依赖于材料属性的变化和无人水下航行器的运动条件。这种非线性动态行为导致了两种典型的响应状态:松弛和紧绷,这两者共同由边界条件和水动力效应决定,显著影响电缆配置和端点载荷。本研究揭示了在复杂边界条件下柔性电缆的动态特性,为类似系统的设计、优化和进一步控制研究提供了理论基础。
cs.RO / 2 / 2602.06088

Transformer-Based Reinforcement Learning for Autonomous Orbital Collision Avoidance in Partially Observable Environments

基于变换器的强化学习在部分可观测环境中的自主轨道碰撞避免
Georges, Thomas, Abdin, Adam
Abstract
We introduce a Transformer-based Reinforcement Learning framework for autonomous orbital collision avoidance that explicitly models the effects of partial observability and imperfect monitoring in space operations. The framework combines a configurable encounter simulator, a distance-dependent observation model, and a sequential state estimator to represent uncertainty in relative motion. A central contribution of this work is the use of transformer-based Partially Observable Markov Decision Process (POMDP) architecture, which leverage long-range temporal attention to interpret noisy and intermittent observations more effectively than traditional architectures. This integration provides a foundation for training collision avoidance agents that can operate more reliably under imperfect monitoring environments.
Chinese Translation
我们提出了一种基于变换器的强化学习框架,用于自主轨道碰撞避免,该框架明确建模了在太空操作中部分可观测性和不完美监测的影响。该框架结合了可配置的接触模拟器、距离依赖的观测模型和顺序状态估计器,以表示相对运动的不确定性。本研究的一个核心贡献是采用基于变换器的部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)架构,该架构利用长程时间注意力比传统架构更有效地解释噪声和间歇性观测。这种整合为训练能够在不完美监测环境中更可靠地操作的碰撞避免代理提供了基础。
cs.RO / 3 / 2602.06191

Active Localization of Unstable Systems with Coarse Information

基于粗略信息的不稳定系统的主动定位
Yuceel, Ege, Liberzon, Daniel, Mitra, Sayan
Abstract
We study localization and control for unstable systems under coarse, single-bit sensing. Motivated by understanding the fundamental limitations imposed by such minimal feedback, we identify sufficient conditions under which the initial state can be recovered despite instability and extremely sparse measurements. Building on these conditions, we develop an active localization algorithm that integrates a set-based estimator with a control strategy derived from Voronoi partitions, which provably estimates the initial state while ensuring the agent remains in informative regions. Under the derived conditions, the proposed approach guarantees exponential contraction of the initial-state uncertainty, and the result is further supported by numerical experiments. These findings can offer theoretical insight into localization in robotics, where sensing is often limited to coarse abstractions such as keyframes, segmentations, or line-based features.
Chinese Translation
我们研究在粗略的单比特传感下不稳定系统的定位与控制。为了理解这种最小反馈所带来的基本限制,我们确定了在不稳定性和极度稀疏测量情况下仍能恢复初始状态的充分条件。在这些条件的基础上,我们开发了一种主动定位算法,该算法将基于集合的估计器与源自Voronoi划分的控制策略相结合,能够在确保智能体保持在信息丰富区域的同时,可靠地估计初始状态。在所推导的条件下,所提出的方法保证了初始状态不确定性的指数收缩,且该结果得到了数值实验的进一步支持。这些发现可以为机器人领域的定位提供理论见解,因为在该领域,传感通常仅限于粗略抽象,如关键帧、分割或基于线条的特征。
cs.RO / 4 / 2602.06207

Bioinspired Kirigami Capsule Robot for Minimally Invasive Gastrointestinal Biopsy

仿生切纸艺术胶囊机器人用于微创胃肠活检
Zhao, Ruizhou, Chu, Yichen, Zhao, Shuwei, Yue, Wenchao, Tang, Raymond Shing-Yan, Ren, Hongliang
Abstract
Wireless capsule endoscopy (WCE) has transformed gastrointestinal (GI) diagnostics by enabling noninvasive visualization of the digestive tract, yet its diagnostic yield remains constrained by the absence of biopsy capability, as histological analysis is still the gold standard for confirming disease. Conventional biopsy using forceps, needles, or rotating blades is invasive, limited in reach, and carries risks of perforation or mucosal trauma, while fluid- or microbiota-sampling capsules cannot provide structured tissue for pathology, leaving a critical gap in swallowable biopsy solutions. Here we present the Kiri-Capsule, a kirigami-inspired capsule robot that integrates deployable PI-film flaps actuated by a compact dual-cam mechanism to achieve minimally invasive and repeatable tissue collection. The kirigami surface remains flat during locomotion but transforms into sharp protrusions upon cam-driven stretching, enabling controlled penetration followed by rotary scraping, with specimens retained in internal fan-shaped cavities. Bench tests confirmed that PI films exhibit a Young's modulus of approximately 20 MPa and stable deployment angles (about 34$^\circ$ at 15% strain), while ex vivo porcine studies demonstrated shallow penetration depths (median $\sim$0.61 mm, range 0.46--0.66 mm) and biopsy yields comparable to standard forceps (mean $\sim$10.9 mg for stomach and $\sim$18.9 mg for intestine), with forces within safe ranges reported for GI biopsy. These findings demonstrate that the Kiri-Capsule bridges passive imaging and functional biopsy, providing a swallowable, depth-controlled, and histology-ready solution that advances capsule-based diagnostics toward safe and effective clinical application.
Chinese Translation
无线胶囊内窥镜(WCE)通过实现对消化道的非侵入性可视化,已经改变了胃肠(GI)诊断,但其诊断效果仍受限于缺乏活检能力,因为组织学分析仍然是确认疾病的金标准。传统的活检方法使用钳子、针头或旋转刀片,具有侵入性、范围有限,并且存在穿孔或粘膜损伤的风险,而液体或微生物采样胶囊无法提供结构化的组织用于病理分析,导致可吞咽活检解决方案存在重要缺口。在此,我们提出了Kiri-Capsule,一种受切纸艺术启发的胶囊机器人,集成了由紧凑的双凸轮机制驱动的可展开聚酰亚胺(PI)膜瓣,以实现微创和可重复的组织采集。该切纸艺术表面在运动过程中保持平坦,但在凸轮驱动的拉伸下转变为尖锐的突出物,使得能够进行受控穿透,随后进行旋转刮取,样本保留在内部扇形腔体中。台架测试确认PI膜的杨氏模量约为20 MPa,稳定的展开角度(在15%应变下约为34$^ ext{°}$),而体外猪研究显示浅层穿透深度(中位数约为0.61 mm,范围0.46--0.66 mm)和与标准钳子相当的活检产量(胃的平均约为10.9 mg,肠道的平均约为18.9 mg),所需的力在胃肠活检的安全范围内。这些发现表明Kiri-Capsule弥合了被动成像与功能性活检之间的差距,提供了一种可吞咽、深度可控且适于组织学分析的解决方案,推动基于胶囊的诊断向安全有效的临床应用迈进。
cs.RO / 5 / 2602.06219

Coupled Local and Global World Models for Efficient First Order RL

高效的一阶强化学习的耦合局部和全局世界模型
Amigo, Joseph, Khorrambakht, Rooholla, Mansard, Nicolas, Righetti, Ludovic
Abstract
World models offer a promising avenue for more faithfully capturing complex dynamics, including contacts and non-rigidity, as well as complex sensory information, such as visual perception, in situations where standard simulators struggle. However, these models are computationally complex to evaluate, posing a challenge for popular RL approaches that have been successfully used with simulators to solve complex locomotion tasks but yet struggle with manipulation. This paper introduces a method that bypasses simulators entirely, training RL policies inside world models learned from robots' interactions with real environments. At its core, our approach enables policy training with large-scale diffusion models via a novel decoupled first-order gradient (FoG) method: a full-scale world model generates accurate forward trajectories, while a lightweight latent-space surrogate approximates its local dynamics for efficient gradient computation. This coupling of a local and global world model ensures high-fidelity unrolling alongside computationally tractable differentiation. We demonstrate the efficacy of our method on the Push-T manipulation task, where it significantly outperforms PPO in sample efficiency. We further evaluate our approach through an ego-centric object manipulation task with a quadruped. Together, these results demonstrate that learning inside data-driven world models is a promising pathway for solving hard-to-model RL tasks in image space without reliance on hand-crafted physics simulators.
Chinese Translation
世界模型为更真实地捕捉复杂动态(包括接触和非刚性)以及复杂感知信息(如视觉感知)提供了一个有前景的途径,尤其是在标准模拟器难以应对的情况下。然而,这些模型在评估时计算复杂度高,这对那些已经成功应用于模拟器以解决复杂运动任务但在操控方面仍然面临挑战的流行强化学习方法构成了挑战。本文介绍了一种完全绕过模拟器的方法,在从机器人与真实环境的交互中学习的世界模型内部训练强化学习策略。我们的核心方法通过一种新颖的解耦一阶梯度(FoG)方法,使得使用大规模扩散模型进行策略训练成为可能:全尺度世界模型生成准确的前向轨迹,而轻量级的潜在空间代理则近似其局部动态以实现高效的梯度计算。这种局部和全局世界模型的耦合确保了高保真度的展开,同时保持了计算上可处理的微分。我们在Push-T操控任务上展示了我们方法的有效性,其样本效率显著优于PPO。我们还通过一个四足机器人进行的自我中心物体操控任务进一步评估了我们的方法。这些结果共同表明,在数据驱动的世界模型内部学习是解决图像空间中难以建模的强化学习任务的一个有前景的途径,而无需依赖手工制作的物理模拟器。
cs.RO / 6 / 2602.06243

A Dialogue-Based Human-Robot Interaction Protocol for Wheelchair and Robotic Arm Integrated Control

一种基于对话的人机交互协议用于轮椅和机器人手臂的集成控制
Liu, Guangping, Hawkins, Nicholas, Madden, Billy, Sultan, Tipu, Babaiasl, Madi
Abstract
People with lower and upper body disabilities can benefit from wheelchairs and robotic arms to improve mobility and independence. Prior assistive interfaces, such as touchscreens and voice-driven predefined commands, often remain unintuitive and struggle to capture complex user intent. We propose a natural, dialogue based human robot interaction protocol that simulates an intelligent agent capable of communicating with users to understand intent and execute assistive actions. In a pilot study, five participants completed five assistive tasks (cleaning, drinking, feeding, drawer opening, and door opening) through dialogue-based interaction with a wheelchair and robotic arm. As a baseline, participants were required to open a door using the manual control (a wheelchair joystick and a game controller for the arm) and complete a questionnaire to gather their feedback. By analyzing the post-study questionnaires, we found that most participants enjoyed the dialogue-based interaction and assistive robot autonomy.
Chinese Translation
下肢和上肢残疾人士可以通过轮椅和机器人手臂来改善移动性和独立性。以往的辅助接口,如触摸屏和基于语音的预定义命令,往往不够直观,难以捕捉复杂的用户意图。我们提出了一种自然的基于对话的人机交互协议,模拟一个能够与用户沟通的智能代理,以理解意图并执行辅助操作。在一项初步研究中,五名参与者通过与轮椅和机器人手臂的对话式交互完成了五项辅助任务(清洁、饮水、喂食、打开抽屉和开门)。作为基线,参与者需要使用手动控制(轮椅操纵杆和用于机器人手臂的游戏控制器)打开一扇门,并填写问卷以收集他们的反馈。通过分析研究后的问卷,我们发现大多数参与者享受基于对话的交互和辅助机器人的自主性。
cs.RO / 7 / 2602.06265

MORPH Wheel: A Passive Variable-Radius Wheel Embedding Mechanical Behavior Logic for Input-Responsive Transformation

MORPH轮:一种嵌入机械行为逻辑的被动可变半径轮以实现输入响应变换
Jang, JaeHyung, Seo, JuYeong, Lee, Dae-Young, Ryu, Jee-Hwan
Abstract
This paper introduces the Mechacnially prOgrammed Radius-adjustable PHysical (MORPH) wheel, a fully passive variable-radius wheel that embeds mechanical behavior logic for torque-responsive transformation. Unlike conventional variable transmission systems relying on actuators, sensors, and active control, the MORPH wheel achieves passive adaptation solely through its geometry and compliant structure. The design integrates a torque-response coupler and spring-loaded connecting struts to mechanically adjust the wheel radius between 80 mm and 45 mm in response to input torque, without any electrical components. The MORPH wheel provides three unique capabilities rarely achieved simultaneously in previous passive designs: (1) bidirectional operation with unlimited rotation through a symmetric coupler; (2) high torque capacity exceeding 10 N with rigid power transmission in drive mode; and (3) precise and repeatable transmission ratio control governed by deterministic kinematics. A comprehensive analytical model was developed to describe the wheel's mechanical behavior logic, establishing threshold conditions for mode switching between direct drive and radius transformation. Experimental validation confirmed that the measured torque-radius and force-displacement characteristics closely follow theoretical predictions across wheel weights of 1.8-2.8kg. Robot-level demonstrations on varying loads (0-25kg), slopes, and unstructured terrains further verified that the MORPH wheel passively adjusts its radius to provide optimal transmission ratio. The MORPH wheel exemplifies a mechanically programmed structure, embedding intelligent, context-dependent behavior directly into its physical design. This approach offers a new paradigm for passive variable transmission and mechanical intelligence in robotic mobility systems operating in unpredictable or control-limited environments.
Chinese Translation
本文介绍了一种机械编程的可调半径物理轮(MORPH轮),这是一种完全被动的可变半径轮,嵌入了用于扭矩响应变换的机械行为逻辑。与依赖于执行器、传感器和主动控制的传统变速传动系统不同,MORPH轮仅通过其几何形状和柔性结构实现被动适应。该设计集成了一个扭矩响应耦合器和弹簧加载的连接支杆,以机械方式在80毫米和45毫米之间调整轮子的半径,以响应输入扭矩,而无需任何电气组件。MORPH轮提供了三种在以往被动设计中难以同时实现的独特能力:(1)通过对称耦合器实现双向操作和无限旋转;(2)在驱动模式下具备超过10牛顿的高扭矩能力,具有刚性动力传输;(3)由确定性运动学控制的精确且可重复的传动比控制。我们开发了一个全面的分析模型,以描述轮子的机械行为逻辑,建立了直接驱动与半径变换之间模式切换的阈值条件。实验验证表明,测得的扭矩-半径和力-位移特性在1.8-2.8千克的轮重范围内与理论预测密切吻合。在不同负载(0-25千克)、坡度和非结构化地形上的机器人级演示进一步验证了MORPH轮被动调整其半径以提供最佳传动比的能力。MORPH轮体现了一种机械编程结构,将智能、上下文依赖的行为直接嵌入其物理设计中。这种方法为在不可预测或控制受限环境中运行的机器人移动系统提供了一种新的被动变速传动和机械智能范式。
cs.RO / 8 / 2602.06273

A High-Fidelity Robotic Manipulator Teleoperation Framework for Human-Centered Augmented Reality Evaluation

用于以人为本的增强现实评估的高保真机器人操控远程操作框架
Chhajed, Harsh, Guo, Tian
Abstract
Validating Augmented Reality (AR) tracking and interaction models requires precise, repeatable ground-truth motion. However, human users cannot reliably perform consistent motion due to biomechanical variability. Robotic manipulators are promising to act as human motion proxies if they can mimic human movements. In this work, we design and implement ARBot, a real-time teleoperation platform that can effectively capture natural human motion and accurately replay the movements via robotic manipulators. ARBot includes two capture models: stable wrist motion capture via a custom CV and IMU pipeline, and natural 6-DOF control via a mobile application. We design a proactively-safe QP controller to ensure smooth, jitter-free execution of the robotic manipulator, enabling it to function as a high-fidelity record and replay physical proxy. We open-source ARBot and release a benchmark dataset of 132 human and synthetic trajectories captured using ARBot to support controllable and scalable AR evaluation.
Chinese Translation
验证增强现实(AR)跟踪和交互模型需要精确、可重复的真实运动。然而,由于生物力学的变异性,人类用户无法可靠地执行一致的运动。机器人操控器有望作为人类运动的代理,如果它们能够模仿人类的动作。在本研究中,我们设计并实现了ARBot,一个实时远程操作平台,能够有效捕捉自然的人类运动,并通过机器人操控器准确重放这些动作。ARBot包括两种捕捉模型:通过定制的计算机视觉(CV)和惯性测量单元(IMU)管道进行稳定的手腕运动捕捉,以及通过移动应用程序进行自然的六自由度(6-DOF)控制。我们设计了一种主动安全的二次规划(QP)控制器,以确保机器人操控器的平稳、无抖动执行,使其能够作为高保真的记录和重放物理代理。我们将ARBot开源,并发布了一个基准数据集,其中包含132条使用ARBot捕捉的人类和合成轨迹,以支持可控和可扩展的AR评估。
cs.RO / 9 / 2602.06294

Robots That Generate Planarity Through Geometry

通过几何生成平面性的机器人
Kowalewski, Jakub F., Alrashed, Abdulaziz O., Alpert, Jacob, Ponnapalli, Rishi, Meza, Lucas R., Lipton, Jeffrey Ian
Abstract
Constraining motion to a flat surface is a fundamental requirement for equipment across science and engineering. Modern precision robotic motion systems, such as gantries, rely on the flatness of components, including guide rails and granite surface plates. However, translating this static flatness into motion requires precise internal alignment and tight-tolerance components that create long, error-sensitive reference chains. Here, we show that by using the geometric inversion of a sphere into a plane, we can produce robotic motion systems that derive planarity entirely from link lengths and connectivity. This allows planar motion to emerge from self-referencing geometric constraints, and without external metrology. We demonstrate these Flat-Plane Mechanisms (FPMs) from micron to meter scales and show that fabrication errors can be attenuated by an order of magnitude in the resulting flatness. Finally, we present a robotic FPM-based 3-axis positioning system that can be used for metrology surface scans ($\pm 12$-mm) and 3D printing inside narrow containers. This work establishes an alternative geometric foundation for planar motion that can be realized across size scales and opens new possibilities in metrology, fabrication, and micro-positioning.
Chinese Translation
将运动限制在平面表面是科学和工程设备的基本要求。现代精密机器人运动系统,如龙门架,依赖于组件的平整性,包括导轨和花岗岩平面板。然而,将这种静态平整性转化为运动需要精确的内部对齐和高精度的组件,这些组件形成了长且对误差敏感的参考链。在这里,我们展示了通过将球体的几何反演转化为平面,我们可以生产出完全依赖于连杆长度和连接性的机器人运动系统。这使得平面运动能够从自我参考的几何约束中涌现出来,而无需外部计量。我们展示了这些平面机制(Flat-Plane Mechanisms, FPM)在微米到米尺度上的应用,并表明在最终的平整性中,制造误差可以被降低一个数量级。最后,我们提出了一种基于FPM的机器人三轴定位系统,可以用于计量表面扫描(±12毫米)和在狭窄容器内的3D打印。这项工作建立了一种可在不同尺寸尺度上实现的平面运动的替代几何基础,并为计量、制造和微定位开辟了新的可能性。
cs.RO / 10 / 2602.06296

Internalized Morphogenesis: A Self-Organizing Model for Growth, Replication, and Regeneration via Local Token Exchange in Modular Systems

内化形态发生:一种通过模块系统中的局部令牌交换实现生长、复制和再生的自组织模型
Ishida, Takeshi
Abstract
This study presents an internalized morphogenesis model for autonomous systems, such as swarm robotics and micro-nanomachines, that eliminates the need for external spatial computation. Traditional self-organizing models often require calculations across the entire coordinate space, including empty areas, which is impractical for resource-constrained physical modules. Our proposed model achieves complex morphogenesis through strictly local interactions between adjacent modules within the "body." By extending the "Ishida token model," modules exchange integer values using an RD-inspired discrete analogue without solving differential equations. The internal potential, derived from token accumulation and aging, guides autonomous growth, shrinkage, and replication. Simulations on a hexagonal grid demonstrated the emergence of limb-like extensions, self-division, and robust regeneration capabilities following structural amputation. A key feature is the use of the body boundary as a natural sink for information entropy (tokens) to maintain a dynamic equilibrium. These results indicate that sophisticated morphological behaviors can emerge from minimal, internal-only rules. This framework offers a computationally efficient and biologically plausible approach to developing self-repairing, adaptive, and autonomous hardware.
Chinese Translation
本研究提出了一种内化形态发生模型,适用于自主系统,如群体机器人和微纳米机器,消除了对外部空间计算的需求。传统的自组织模型通常需要在整个坐标空间内进行计算,包括空白区域,这对于资源受限的物理模块来说是不切实际的。我们提出的模型通过相邻模块之间的严格局部交互实现复杂的形态发生。通过扩展“Ishida令牌模型”,模块使用受RD启发的离散模拟交换整数值,而无需求解微分方程。内部潜力源于令牌的积累和老化,指导自主生长、收缩和复制。在六边形网格上的模拟显示出肢体状延伸、自我分裂以及在结构截肢后强大的再生能力。一个关键特征是将身体边界用作信息熵(令牌)的自然汇,以维持动态平衡。这些结果表明,复杂的形态行为可以从最小的、仅限内部的规则中涌现。该框架提供了一种计算效率高且生物学上合理的方法,用于开发自我修复、自适应和自主硬件。
cs.RO / 11 / 2602.06339

Action Hallucination in Generative Visual-Language-Action Models

生成视觉-语言-动作模型中的动作幻觉
Soh, Harold, Lim, Eugene
Abstract
Robot Foundation Models such as Vision-Language-Action models are rapidly reshaping how robot policies are trained and deployed, replacing hand-designed planners with end-to-end generative action models. While these systems demonstrate impressive generalization, it remains unclear whether they fundamentally resolve the long-standing challenges of robotics. We address this question by analyzing action hallucinations that violate physical constraints and their extension to plan-level failures. Focusing on latent-variable generative policies, we show that hallucinations often arise from structural mismatches between feasible robot behavior and common model architectures. We study three such barriers -- topological, precision, and horizon -- and show how they impose unavoidable tradeoffs. Our analysis provides mechanistic explanations for reported empirical failures of generative robot policies and suggests principled directions for improving reliability and trustworthiness, without abandoning their expressive power.
Chinese Translation
机器人基础模型,如视觉-语言-动作模型,正在迅速重塑机器人策略的训练和部署方式,取代了手工设计的规划器,采用端到端的生成动作模型。尽管这些系统展示了令人印象深刻的泛化能力,但它们是否从根本上解决了机器人技术长期存在的挑战仍然不明确。我们通过分析违反物理约束的动作幻觉及其对规划层面失败的扩展来探讨这个问题。我们专注于潜变量生成策略,表明幻觉通常源于可行机器人行为与常见模型架构之间的结构不匹配。我们研究了三种此类障碍——拓扑、精度和视野——并展示了它们如何施加不可避免的权衡。我们的分析为生成机器人策略的实证失败提供了机械解释,并提出了在不放弃其表现力的情况下改善可靠性和可信性的原则性方向。
cs.RO / 12 / 2602.06341

HiWET: Hierarchical World-Frame End-Effector Tracking for Long-Horizon Humanoid Loco-Manipulation

HiWET:用于长时间人形机器人运动操控的分层世界框架末端执行器跟踪
Cao, Zhanxiang, Yan, Liyun, Zhang, Yang, Chen, Sirui, Ma, Jianming, Zhan, Tianyue, Fu, Shengcheng, Jia, Yufei, Lu, Cewu, Gao, Yue
Abstract
Humanoid loco-manipulation requires executing precise manipulation tasks while maintaining dynamic stability amid base motion and impacts. Existing approaches typically formulate commands in body-centric frames, fail to inherently correct cumulative world-frame drift induced by legged locomotion. We reformulate the problem as world-frame end-effector tracking and propose HiWET, a hierarchical reinforcement learning framework that decouples global reasoning from dynamic execution. The high-level policy generates subgoals that jointly optimize end-effector accuracy and base positioning in the world frame, while the low-level policy executes these commands under stability constraints. We introduce a Kinematic Manifold Prior (KMP) that embeds the manipulation manifold into the action space via residual learning, reducing exploration dimensionality and mitigating kinematically invalid behaviors. Extensive simulation and ablation studies demonstrate that HiWET achieves precise and stable end-effector tracking in long-horizon world-frame tasks. We validate zero-shot sim-to-real transfer of the low-level policy on a physical humanoid, demonstrating stable locomotion under diverse manipulation commands. These results indicate that explicit world-frame reasoning combined with hierarchical control provides an effective and scalable solution for long-horizon humanoid loco-manipulation.
Chinese Translation
人形机器人运动操控需要在基础运动和冲击的情况下执行精确的操控任务,同时保持动态稳定性。现有方法通常在以身体为中心的框架中制定指令,未能内在地纠正由腿部运动引起的累积世界框架漂移。我们将问题重新表述为世界框架下的末端执行器跟踪,并提出了HiWET,一个分层强化学习框架,该框架将全局推理与动态执行解耦。高层策略生成的子目标共同优化末端执行器的精度和基础在世界框架中的定位,而低层策略在稳定性约束下执行这些指令。我们引入了一种运动流形先验(Kinematic Manifold Prior, KMP),通过残差学习将操控流形嵌入到动作空间中,从而降低探索维度并减轻运动学无效行为。大量的仿真和消融研究表明,HiWET在长时间世界框架任务中实现了精确且稳定的末端执行器跟踪。我们验证了低层策略在物理人形机器人上的零-shot 模拟到现实转移,展示了在多样化操控指令下的稳定运动。这些结果表明,明确的世界框架推理结合分层控制为长时间人形机器人运动操控提供了有效且可扩展的解决方案。
cs.RO / 13 / 2602.06356

Nipping the Drift in the Bud: Retrospective Rectification for Robust Vision-Language Navigation

从源头遏制漂移:用于稳健视觉-语言导航的回顾性修正
He, Gang, Liu, Zhenyang, Xu, Kepeng, Xu, Li, Qiao, Tong, Yu, Wenxin, Wu, Chang, Xie, Weiying
Abstract
Vision-Language Navigation (VLN) requires embodied agents to interpret natural language instructions and navigate through complex continuous 3D environments. However, the dominant imitation learning paradigm suffers from exposure bias, where minor deviations during inference lead to compounding errors. While DAgger-style approaches attempt to mitigate this by correcting error states, we identify a critical limitation: Instruction-State Misalignment. Forcing an agent to learn recovery actions from off-track states often creates supervision signals that semantically conflict with the original instruction. In response to these challenges, we introduce BudVLN, an online framework that learns from on-policy rollouts by constructing supervision to match the current state distribution. BudVLN performs retrospective rectification via counterfactual re-anchoring and decision-conditioned supervision synthesis, using a geodesic oracle to synthesize corrective trajectories that originate from valid historical states, ensuring semantic consistency. Experiments on the standard R2R-CE and RxR-CE benchmarks demonstrate that BudVLN consistently mitigates distribution shift and achieves state-of-the-art performance in both Success Rate and SPL.
Chinese Translation
视觉-语言导航(VLN)要求具身代理能够理解自然语言指令并在复杂的连续三维环境中导航。然而,主流的模仿学习范式存在暴露偏差的问题,即在推理过程中轻微的偏差会导致错误的累积。尽管DAgger风格的方法试图通过纠正错误状态来缓解这一问题,但我们识别出一个关键的局限性:指令-状态不一致。强迫代理从偏离轨迹的状态中学习恢复动作,往往会产生与原始指令在语义上冲突的监督信号。针对这些挑战,我们提出了BudVLN,一个通过构建与当前状态分布匹配的监督信号来从在线策略回放中学习的框架。BudVLN通过反事实重新锚定和决策条件监督合成进行回顾性修正,利用测地线神谕合成从有效历史状态出发的纠正轨迹,以确保语义一致性。在标准的R2R-CE和RxR-CE基准上的实验表明,BudVLN始终有效减轻了分布漂移,并在成功率和SPL方面实现了最先进的性能。
cs.RO / 14 / 2602.06366

Towards Adaptive Environment Generation for Training Embodied Agents

面向适应性环境生成以训练具身智能体
Yeo, Teresa, Weerakoon, Dulaj, Weerakoon, Dulanga, Misra, Archan
Abstract
Embodied agents struggle to generalize to new environments, even when those environments share similar underlying structures to their training settings. Most current approaches to generating these training environments follow an open-loop paradigm, without considering the agent's current performance. While procedural generation methods can produce diverse scenes, diversity without feedback from the agent is inefficient. The generated environments may be trivially easy, providing limited learning signal. To address this, we present a proof-of-concept for closed-loop environment generation that adapts difficulty to the agent's current capabilities. Our system employs a controllable environment representation, extracts fine-grained performance feedback beyond binary success or failure, and implements a closed-loop adaptation mechanism that translates this feedback into environment modifications. This feedback-driven approach generates training environments that more challenging in the ways the agent needs to improve, enabling more efficient learning and better generalization to novel settings.
Chinese Translation
具身智能体在新环境中难以进行泛化,即使这些环境与其训练设置具有相似的基础结构。目前大多数生成这些训练环境的方法遵循开放循环范式,而未考虑智能体的当前表现。尽管程序生成方法可以产生多样化的场景,但缺乏来自智能体的反馈使得这种多样性效率低下。生成的环境可能过于简单,提供的学习信号有限。为了解决这一问题,我们提出了一种闭环环境生成的概念验证,能够根据智能体的当前能力调整难度。我们的系统采用可控的环境表示,提取超越二元成功或失败的细粒度表现反馈,并实施一种闭环适应机制,将这些反馈转化为环境修改。这种基于反馈的方法生成的训练环境在智能体需要改进的方面更具挑战性,从而实现更高效的学习和更好的对新环境的泛化。
cs.RO / 15 / 2602.06380

A Consistency-Improved LiDAR-Inertial Bundle Adjustment

一致性改进的激光雷达-惯性束调整
Li, Xinran, Zheng, Shuaikang, Zheng, Pengcheng, Wang, Xinyang, Li, Jiacheng, Li, Zhitian, Zou, Xudong
Abstract
Simultaneous Localization and Mapping (SLAM) using 3D LiDAR has emerged as a cornerstone for autonomous navigation in robotics. While feature-based SLAM systems have achieved impressive results by leveraging edge and planar structures, they often suffer from the inconsistent estimator associated with feature parameterization and estimated covariance. In this work, we present a consistency-improved LiDAR-inertial bundle adjustment (BA) with tailored parameterization and estimator. First, we propose a stereographic-projection representation parameterizing the planar and edge features, and conduct a comprehensive observability analysis to support its integrability with consistent estimator. Second, we implement a LiDAR-inertial BA with Maximum a Posteriori (MAP) formulation and First-Estimate Jacobians (FEJ) to preserve the accurate estimated covariance and observability properties of the system. Last, we apply our proposed BA method to a LiDAR-inertial odometry.
Chinese Translation
使用3D激光雷达的同时定位与地图构建(SLAM)已成为机器人自主导航的基石。尽管基于特征的SLAM系统通过利用边缘和平面结构取得了显著成果,但它们往往受到与特征参数化和估计协方差相关的不一致估计器的影响。在本研究中,我们提出了一种一致性改进的激光雷达-惯性束调整(BA),采用定制的参数化和估计器。首先,我们提出了一种立体投影表示法,对平面和边缘特征进行参数化,并进行全面的可观测性分析,以支持其与一致性估计器的可集成性。其次,我们实现了一种基于最大后验(MAP)形式和首次估计雅可比(FEJ)的激光雷达-惯性BA,以保持系统的准确估计协方差和可观测性特性。最后,我们将所提出的BA方法应用于激光雷达-惯性里程计。
cs.RO / 16 / 2602.06382

Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

现在你看到了:从原始像素中学习端到端的人形机器人运动
Sun, Wandong, Su, Yongbo, Huang, Leoric, Zhang, Alex, Wei, Dwyane, San, Mu, Tian, Daniel, Cao, Ellie, Yan, Finn, Xie, Ethan, Xie, Zongwu
Abstract
Achieving robust vision-based humanoid locomotion remains challenging due to two fundamental issues: the sim-to-real gap introduces significant perception noise that degrades performance on fine-grained tasks, and training a unified policy across diverse terrains is hindered by conflicting learning objectives. To address these challenges, we present an end-to-end framework for vision-driven humanoid locomotion. For robust sim-to-real transfer, we develop a high-fidelity depth sensor simulation that captures stereo matching artifacts and calibration uncertainties inherent in real-world sensing. We further propose a vision-aware behavior distillation approach that combines latent space alignment with noise-invariant auxiliary tasks, enabling effective knowledge transfer from privileged height maps to noisy depth observations. For versatile terrain adaptation, we introduce terrain-specific reward shaping integrated with multi-critic and multi-discriminator learning, where dedicated networks capture the distinct dynamics and motion priors of each terrain type. We validate our approach on two humanoid platforms equipped with different stereo depth cameras. The resulting policy demonstrates robust performance across diverse environments, seamlessly handling extreme challenges such as high platforms and wide gaps, as well as fine-grained tasks including bidirectional long-term staircase traversal.
Chinese Translation
实现基于视觉的人形机器人运动仍然面临挑战,主要由于两个根本性问题:模拟与现实之间的差距引入了显著的感知噪声,降低了在细粒度任务上的表现,而在多样地形上训练统一策略则受到相互冲突的学习目标的阻碍。为了解决这些挑战,我们提出了一种用于视觉驱动的人形机器人运动的端到端框架。为了实现稳健的模拟到现实转移,我们开发了一种高保真度的深度传感器模拟,捕捉了现实世界感知中固有的立体匹配伪影和校准不确定性。我们进一步提出了一种视觉感知行为蒸馏方法,将潜在空间对齐与噪声不变的辅助任务相结合,使得从特权高度图到噪声深度观测的有效知识转移成为可能。为了实现多样化的地形适应,我们引入了与多重批评者和多重鉴别器学习相结合的地形特定奖励塑造,其中专用网络捕捉每种地形类型的独特动态和运动先验。我们在配备不同立体深度相机的两个机器人平台上验证了我们的方法。最终的策略在多样化环境中表现出稳健的性能,能够无缝应对高平台和宽间隙等极端挑战,以及包括双向长期楼梯行走在内的细粒度任务。
cs.RO / 17 / 2602.06445

ECO: Energy-Constrained Optimization with Reinforcement Learning for Humanoid Walking

ECO:基于强化学习的人形机器人能量约束优化行走
Huang, Weidong, Zhang, Jingwen, Li, Jiongye, Zhang, Shibowen, Wu, Jiayang, Wang, Jiayi, Liu, Hangxin, Yang, Yaodong, Su, Yao
Abstract
Achieving stable and energy-efficient locomotion is essential for humanoid robots to operate continuously in real-world applications. Existing MPC and RL approaches often rely on energy-related metrics embedded within a multi-objective optimization framework, which require extensive hyperparameter tuning and often result in suboptimal policies. To address these challenges, we propose ECO (Energy-Constrained Optimization), a constrained RL framework that separates energy-related metrics from rewards, reformulating them as explicit inequality constraints. This method provides a clear and interpretable physical representation of energy costs, enabling more efficient and intuitive hyperparameter tuning for improved energy efficiency. ECO introduces dedicated constraints for energy consumption and reference motion, enforced by the Lagrangian method, to achieve stable, symmetric, and energy-efficient walking for humanoid robots. We evaluated ECO against MPC, standard RL with reward shaping, and four state-of-the-art constrained RL methods. Experiments, including sim-to-sim and sim-to-real transfers on the kid-sized humanoid robot BRUCE, demonstrate that ECO significantly reduces energy consumption compared to baselines while maintaining robust walking performance. These results highlight a substantial advancement in energy-efficient humanoid locomotion. All experimental demonstrations can be found on the project website: https://sites.google.com/view/eco-humanoid.
Chinese Translation
实现稳定且高能效的运动对于人形机器人在现实应用中持续运行至关重要。现有的模型预测控制(MPC)和强化学习(RL)方法通常依赖于嵌入多目标优化框架中的能量相关指标,这需要大量的超参数调优,并且常常导致次优策略。为了解决这些挑战,我们提出了ECO(能量约束优化),这是一个将能量相关指标与奖励分离的约束强化学习框架,将其重新表述为明确的不等式约束。该方法提供了能量成本的清晰且可解释的物理表示,使得超参数调优更加高效和直观,从而提高能量效率。ECO引入了专门针对能量消耗和参考运动的约束,通过拉格朗日方法进行强制执行,以实现人形机器人稳定、对称且高能效的行走。我们将ECO与MPC、标准的带奖励塑形的强化学习以及四种最先进的约束强化学习方法进行了评估。实验,包括在儿童尺寸人形机器人BRUCE上的仿真到仿真和仿真到现实的转移,表明ECO在保持稳健的行走性能的同时,显著降低了能量消耗。这些结果突显了在能量高效的人形运动方面的重大进展。所有实验演示可在项目网站上找到:https://sites.google.com/view/eco-humanoid。
cs.RO / 18 / 2602.06459

User-Centric Object Navigation: A Benchmark with Integrated User Habits for Personalized Embodied Object Search

以用户为中心的物体导航:一个集成用户习惯的个性化具身物体搜索基准
Wang, Hongcheng, Zhu, Jinyu, Dong, Hao
Abstract
In the evolving field of robotics, the challenge of Object Navigation (ON) in household environments has attracted significant interest. Existing ON benchmarks typically place objects in locations guided by general scene priors, without accounting for the specific placement habits of individual users. This omission limits the adaptability of navigation agents in personalized household environments. To address this, we introduce User-centric Object Navigation (UcON), a new benchmark that incorporates user-specific object placement habits, referred to as user habits. This benchmark requires agents to leverage these user habits for more informed decision-making during navigation. UcON encompasses approximately 22,600 user habits across 489 object categories. UcON is, to our knowledge, the first benchmark that explicitly formalizes and evaluates habit-conditioned object navigation at scale and covers the widest range of target object categories. Additionally, we propose a habit retrieval module to extract and utilize habits related to target objects, enabling agents to infer their likely locations more effectively. Experimental results demonstrate that current SOTA methods exhibit substantial performance degradation under habit-driven object placement, while integrating user habits consistently improves success rates. Code is available at https://github.com/whcpumpkin/User-Centric-Object-Navigation.
Chinese Translation
在不断发展的机器人领域,家庭环境中的物体导航(Object Navigation, ON)挑战引起了广泛关注。现有的ON基准通常将物体放置在由一般场景先验指导的位置,而未考虑个体用户的特定放置习惯。这一遗漏限制了导航代理在个性化家庭环境中的适应性。为了解决这一问题,我们提出了以用户为中心的物体导航(User-centric Object Navigation, UcON),这是一个新的基准,融入了用户特定的物体放置习惯,称为用户习惯。该基准要求代理利用这些用户习惯,在导航过程中进行更为明智的决策。UcON涵盖了约22,600个用户习惯,涉及489个物体类别。根据我们的了解,UcON是第一个明确形式化并在大规模上评估习惯条件物体导航的基准,覆盖了最广泛的目标物体类别。此外,我们还提出了一个习惯检索模块,用于提取和利用与目标物体相关的习惯,使代理能够更有效地推断其可能的位置。实验结果表明,当前的最先进方法(SOTA)在习惯驱动的物体放置下表现出显著的性能下降,而集成用户习惯则持续提高成功率。代码可在 https://github.com/whcpumpkin/User-Centric-Object-Navigation 获取。
cs.RO / 19 / 2602.06504

MultiGraspNet: A Multitask 3D Vision Model for Multi-gripper Robotic Grasping

MultiGraspNet:一种用于多抓手机器人抓取的多任务三维视觉模型
Ortuno-Chanelo, Stephany, Rabino, Paolo, Civitelli, Enrico, Tommasi, Tatiana, Camoriano, Raffaello
Abstract
Vision-based models for robotic grasping automate critical, repetitive, and draining industrial tasks. Existing approaches are typically limited in two ways: they either target a single gripper and are potentially applied on costly dual-arm setups, or rely on custom hybrid grippers that require ad-hoc learning procedures with logic that cannot be transferred across tasks, restricting their general applicability. In this work, we present MultiGraspNet, a novel multitask 3D deep learning method that predicts feasible poses simultaneously for parallel and vacuum grippers within a unified framework, enabling a single robot to handle multiple end effectors. The model is trained on the richly annotated GraspNet-1Billion and SuctionNet-1Billion datasets, which have been aligned for the purpose, and generates graspability masks quantifying the suitability of each scene point for successful grasps. By sharing early-stage features while maintaining gripper-specific refiners, MultiGraspNet effectively leverages complementary information across grasping modalities, enhancing robustness and adaptability in cluttered scenes. We characterize MultiGraspNet's performance with an extensive experimental analysis, demonstrating its competitiveness with single-task models on relevant benchmarks. We run real-world experiments on a single-arm multi-gripper robotic setup showing that our approach outperforms the vacuum baseline, grasping 16% percent more seen objects and 32% more of the novel ones, while obtaining competitive results for the parallel task.
Chinese Translation
基于视觉的机器人抓取模型自动化了关键的、重复的和耗费精力的工业任务。现有的方法通常存在两个限制:要么针对单一抓手,可能应用于昂贵的双臂设置,要么依赖于需要特定学习程序的定制混合抓手,这些程序的逻辑无法跨任务转移,从而限制了其通用性。在本研究中,我们提出了MultiGraspNet,这是一种新颖的多任务三维深度学习方法,能够在统一框架内同时预测并行抓手和真空抓手的可行姿态,使单个机器人能够处理多个末端执行器。该模型在丰富注释的GraspNet-1Billion和SuctionNet-1Billion数据集上进行训练,这些数据集已为此目的进行了对齐,并生成抓取能力掩码,以量化每个场景点成功抓取的适宜性。通过共享早期特征并保持抓手特定的细化器,MultiGraspNet有效利用了不同抓取方式之间的互补信息,提高了在杂乱场景中的鲁棒性和适应性。我们通过广泛的实验分析来表征MultiGraspNet的性能,证明其在相关基准测试中与单任务模型的竞争力。我们在单臂多抓手机器人设置上进行了真实世界实验,结果表明我们的方法优于真空基线,抓取了多出16%的已见物体和32%的新物体,同时在并行任务中获得了竞争性的结果。
cs.RO / 20 / 2602.06508

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

World-VLA-Loop:视频世界模型与VLA策略的闭环学习
Liu, Xiaokang, Bai, Zechen, Ci, Hai, Ma, Kevin Yuchen, Shou, Mike Zheng
Abstract
Recent progress in robotic world models has leveraged video diffusion transformers to predict future observations conditioned on historical states and actions. While these models can simulate realistic visual outcomes, they often exhibit poor action-following precision, hindering their utility for downstream robotic learning. In this work, we introduce World-VLA-Loop, a closed-loop framework for the joint refinement of world models and Vision-Language-Action (VLA) policies. We propose a state-aware video world model that functions as a high-fidelity interactive simulator by jointly predicting future observations and reward signals. To enhance reliability, we introduce the SANS dataset, which incorporates near-success trajectories to improve action-outcome alignment within the world model. This framework enables a closed-loop for reinforcement learning (RL) post-training of VLA policies entirely within a virtual environment. Crucially, our approach facilitates a co-evolving cycle: failure rollouts generated by the VLA policy are iteratively fed back to refine the world model precision, which in turn enhances subsequent RL optimization. Evaluations across simulation and real-world tasks demonstrate that our framework significantly boosts VLA performance with minimal physical interaction, establishing a mutually beneficial relationship between world modeling and policy learning for general-purpose robotics. Project page: https://showlab.github.io/World-VLA-Loop/.
Chinese Translation
近期在机器人世界模型方面的进展利用视频扩散变换器预测基于历史状态和动作的未来观察。尽管这些模型能够模拟逼真的视觉结果,但它们通常表现出较差的动作跟随精度,限制了其在下游机器人学习中的实用性。在本研究中,我们提出了World-VLA-Loop,这是一个用于世界模型和视觉-语言-动作(VLA)策略联合优化的闭环框架。我们提出了一种状态感知的视频世界模型,通过共同预测未来观察和奖励信号,作为高保真互动模拟器。为了增强可靠性,我们引入了SANS数据集,该数据集包含近成功轨迹,以改善世界模型中的动作-结果对齐。该框架使得在虚拟环境中对VLA策略进行强化学习(RL)后训练的闭环成为可能。关键是,我们的方法促进了一个共同演化的循环:由VLA策略生成的失败回滚被迭代反馈以优化世界模型的精度,从而增强后续的RL优化。在模拟和真实世界任务中的评估表明,我们的框架显著提升了VLA的性能,且物理交互最小,建立了世界建模与策略学习之间的互利关系,以适应通用机器人应用。项目页面:https://showlab.github.io/World-VLA-Loop/
cs.RO / 21 / 2602.06512

Beyond the Majority: Long-tail Imitation Learning for Robotic Manipulation

超越多数:用于机器人操控的长尾模仿学习
Zhu, Junhong, Zhang, Ji, Song, Jingkuan, Gao, Lianli, Shen, Heng Tao
Abstract
While generalist robot policies hold significant promise for learning diverse manipulation skills through imitation, their performance is often hindered by the long-tail distribution of training demonstrations. Policies learned on such data, which is heavily skewed towards a few data-rich head tasks, frequently exhibit poor generalization when confronted with the vast number of data-scarce tail tasks. In this work, we conduct a comprehensive analysis of the pervasive long-tail challenge inherent in policy learning. Our analysis begins by demonstrating the inefficacy of conventional long-tail learning strategies (e.g., re-sampling) for improving the policy's performance on tail tasks. We then uncover the underlying mechanism for this failure, revealing that data scarcity on tail tasks directly impairs the policy's spatial reasoning capability. To overcome this, we introduce Approaching-Phase Augmentation (APA), a simple yet effective scheme that transfers knowledge from data-rich head tasks to data-scarce tail tasks without requiring external demonstrations. Extensive experiments in both simulation and real-world manipulation tasks demonstrate the effectiveness of APA. Our code and demos are publicly available at: https://mldxy.github.io/Project-VLA-long-tail/.
Chinese Translation
虽然通用机器人策略在通过模仿学习多样化操控技能方面具有重要潜力,但其性能常常受到训练示例的长尾分布的影响。基于这种数据(主要集中在少数数据丰富的头部任务)学习的策略,在面对大量数据稀缺的尾部任务时,通常表现出较差的泛化能力。在本研究中,我们对策略学习中普遍存在的长尾挑战进行了全面分析。我们的分析首先表明,传统的长尾学习策略(例如,重采样)在提升策略在尾部任务上的表现方面效果不佳。接着,我们揭示了这种失败的根本机制,发现尾部任务的数据稀缺直接损害了策略的空间推理能力。为了解决这一问题,我们引入了接近阶段增强(Approaching-Phase Augmentation, APA),这是一种简单而有效的方案,能够在不需要外部示例的情况下,将知识从数据丰富的头部任务转移到数据稀缺的尾部任务。我们在模拟和真实世界的操控任务中进行了广泛的实验,证明了APA的有效性。我们的代码和演示可在以下网址公开获取:https://mldxy.github.io/Project-VLA-long-tail/.
cs.RO / 22 / 2602.06541

Primary Experimental Feedback on a Co-manipulated Robotic System for Assisted Cervical Surgery

协同操控机器人系统在辅助颈椎手术中的初步实验反馈
Sellemi, Seifeddine, Chaker, Abdelbadia, Vendeuvre, Tanguy, Essomba, Terence, Laribi, Med Amine
Abstract
Robotic-assisted surgery has emerged as a promising approach to improve surgical ergonomics, precision, and workflow efficiency, particularly in complex procedures such as cervical spine surgery. In this study, we evaluate the performance of a collaborative robotic system designed to assist surgeons in drilling tasks by assessing its accuracy in executing predefined trajectories. A total of 14 drillings were performed by eight experienced cervical surgeons, utilizing a robotic-assisted setup aimed at ensuring stability and alignment. The primary objective of this study is to quantify the deviations in the position and orientation of the drilling tool relative to the planned trajectory, providing insights into the system's reliability and potential impact on clinical outcomes. While the primary function of robotic assistance in surgery is to enhance surgeon comfort and procedural guidance rather than solely optimizing precision, understanding the system's accuracy remains crucial for its effective integration into surgical practices part of this primary experimental feedback, the study offers an in-depth analysis of the co-manipulated robotic system's performance, focusing on the experimental setup and error evaluation methods. The findings of this study will contribute to the ongoing development of robotic-assisted cervical surgery, highlighting both its advantages and areas for improvement in achieving safer and more efficient surgical workflows
Chinese Translation
机器人辅助手术已成为改善手术人机工程学、精确度和工作流程效率的有前景的方法,特别是在复杂的手术过程中,如颈椎手术。在本研究中,我们评估了一种协作机器人系统的性能,该系统旨在通过评估其在执行预定义轨迹时的准确性来辅助外科医生进行钻孔任务。共进行了14次钻孔,由八位经验丰富的颈椎外科医生在机器人辅助的设置下进行,以确保稳定性和对齐。本研究的主要目标是量化钻孔工具相对于计划轨迹的位置和方向偏差,为系统的可靠性及其对临床结果的潜在影响提供见解。尽管机器人辅助手术的主要功能是增强外科医生的舒适度和程序指导,而不仅仅是优化精确度,但理解系统的准确性对于其有效整合到外科实践中仍然至关重要。作为这一初步实验反馈的一部分,本研究对协同操控机器人系统的性能进行了深入分析,重点关注实验设置和误差评估方法。本研究的发现将有助于机器人辅助颈椎手术的持续发展,突出其优势及在实现更安全、更高效的手术工作流程方面的改进空间。
cs.RO / 23 / 2602.06572

The Law of Task-Achieving Body Motion: Axiomatizing Success of Robot Manipulation Actions

任务实现体运动法则:机器人操作行为成功的公理化
Huerkamp, Malte, Dech, Jonas, Beetz, Michael
Abstract
Autonomous agents that perform everyday manipulation actions need to ensure that their body motions are semantically correct with respect to a task request, causally effective within their environment, and feasible for their embodiment. In order to enable robots to verify these properties, we introduce the Law of Task-Achieving Body Motion as an axiomatic correctness specification for body motions. To that end we introduce scoped Task-Environment-Embodiment (TEE) classes that represent world states as Semantic Digital Twins (SDTs) and define applicable physics models to decompose task achievement into three predicates: SatisfiesRequest for semantic request satisfaction over SDT state evolution; Causes for causal sufficiency under the scoped physics model; and CanPerform for safety and feasibility verification at the embodiment level. This decomposition yields a reusable, implementation-independent interface that supports motion synthesis and the verification of given body motions. It also supports typed failure diagnosis (semantic, causal, embodiment and out-of-scope), feasibility across robots and environments, and counterfactual reasoning about robot body motions. We demonstrate the usability of the law in practice by instantiating it for articulated container manipulation in kitchen environments on three contrasting mobile manipulation platforms
Chinese Translation
执行日常操作行为的自主代理需要确保其身体运动在语义上与任务请求相符,在其环境中具有因果效应,并且对其体现是可行的。为了使机器人能够验证这些属性,我们引入了任务实现体运动法则作为身体运动的公理化正确性规范。为此,我们引入了范围限定的任务-环境-体现(Task-Environment-Embodiment, TEE)类,这些类将世界状态表示为语义数字双胞胎(Semantic Digital Twins, SDTs),并定义适用的物理模型,将任务实现分解为三个谓词:SatisfiesRequest,用于对SDT状态演变的语义请求满足;Causes,用于在范围限定的物理模型下的因果充分性;以及CanPerform,用于在体现层面的安全性和可行性验证。这种分解产生了一个可重用的、与实现无关的接口,支持运动合成和给定身体运动的验证。它还支持类型化故障诊断(语义、因果、体现和超出范围),跨机器人和环境的可行性,以及关于机器人身体运动的反事实推理。我们通过在三种不同的移动操作平台上为厨房环境中的关节容器操作实例化该法则,展示了其在实践中的可用性。
cs.RO / 24 / 2602.06575

Think Proprioceptively: Embodied Visual Reasoning for VLA Manipulation

以本体感知为思考基础:用于视觉语言行动(VLA)操作的具身视觉推理
Wang, Fangyuan, Zhou, Peng, Qi, Jiaming, Lyu, Shipeng, Navarro-Alarcon, David, Guo, Guodong
Abstract
Vision-language-action (VLA) models typically inject proprioception only as a late conditioning signal, which prevents robot state from shaping instruction understanding and from influencing which visual tokens are attended throughout the policy. We introduce ThinkProprio, which converts proprioception into a sequence of text tokens in the VLM embedding space and fuses them with the task instruction at the input. This early fusion lets embodied state participate in subsequent visual reasoning and token selection, biasing computation toward action-critical evidence while suppressing redundant visual tokens. In a systematic ablation over proprioception encoding, state entry point, and action-head conditioning, we find that text tokenization is more effective than learned projectors, and that retaining roughly 15% of visual tokens can match the performance of using the full token set. Across CALVIN, LIBERO, and real-world manipulation, ThinkProprio matches or improves over strong baselines while reducing end-to-end inference latency over 50%.
Chinese Translation
视觉语言行动(VLA)模型通常仅将本体感知作为后期条件信号注入,这阻碍了机器人状态在指令理解中的作用,并影响了在策略执行过程中关注哪些视觉标记。我们提出了ThinkProprio,它将本体感知转换为VLM嵌入空间中的一系列文本标记,并与输入任务指令融合。这种早期融合使得具身状态能够参与后续的视觉推理和标记选择,从而将计算偏向于关键行动证据,同时抑制冗余的视觉标记。在对本体感知编码、状态入口点和行动头条件的系统消融实验中,我们发现文本标记化比学习的投影器更有效,并且保留大约15%的视觉标记可以与使用完整标记集的性能相匹配。在CALVIN、LIBERO和真实世界操作中,ThinkProprio的性能与强基线相当或有所提升,同时将端到端推理延迟减少超过50%。
cs.RO / 25 / 2602.06620

Force Generative Imitation Learning: Bridging Position Trajectory and Force Commands through Control Technique

力生成模仿学习:通过控制技术桥接位置轨迹与力指令
Sato, Hiroshi, Sakaino, Sho, Tsuji, Toshiaki
Abstract
In contact-rich tasks, while position trajectories are often easy to obtain, appropriate force commands are typically unknown. Although it is conceivable to generate force commands using a pretrained foundation model such as Vision-Language-Action (VLA) models, force control is highly dependent on the specific hardware of the robot, which makes the application of such models challenging. To bridge this gap, we propose a force generative model that estimates force commands from given position trajectories. However, when dealing with unseen position trajectories, the model struggles to generate accurate force commands. To address this, we introduce a feedback control mechanism. Our experiments reveal that feedback control does not converge when the force generative model has memory. We therefore adopt a model without memory, enabling stable feedback control. This approach allows the system to generate force commands effectively, even for unseen position trajectories, improving generalization for real-world robot writing tasks.
Chinese Translation
在接触丰富的任务中,位置轨迹通常容易获得,而适当的力指令通常是未知的。尽管可以设想使用预训练的基础模型(如视觉-语言-动作(VLA)模型)生成力指令,但力控制高度依赖于机器人的具体硬件,这使得此类模型的应用面临挑战。为了解决这一问题,我们提出了一种力生成模型,该模型能够从给定的位置轨迹中估计力指令。然而,在处理未见过的位置轨迹时,该模型难以生成准确的力指令。为了解决这一问题,我们引入了一种反馈控制机制。我们的实验表明,当力生成模型具有记忆时,反馈控制无法收敛。因此,我们采用了一种无记忆模型,从而实现稳定的反馈控制。这种方法使系统能够有效生成力指令,即使对于未见过的位置轨迹,也能提高在实际机器人书写任务中的泛化能力。
cs.RO / 26 / 2602.06643

Humanoid Manipulation Interface: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations

类人操控接口:基于无机器人演示的类人全身操控
Nai, Ruiqian, Zheng, Boyuan, Zhao, Junming, Zhu, Haodong, Dai, Sicong, Chen, Zunhao, Hu, Yihang, Hu, Yingdong, Zhang, Tong, Wen, Chuan, Gao, Yang
Abstract
Current approaches for humanoid whole-body manipulation, primarily relying on teleoperation or visual sim-to-real reinforcement learning, are hindered by hardware logistics and complex reward engineering. Consequently, demonstrated autonomous skills remain limited and are typically restricted to controlled environments. In this paper, we present the Humanoid Manipulation Interface (HuMI), a portable and efficient framework for learning diverse whole-body manipulation tasks across various environments. HuMI enables robot-free data collection by capturing rich whole-body motion using portable hardware. This data drives a hierarchical learning pipeline that translates human motions into dexterous and feasible humanoid skills. Extensive experiments across five whole-body tasks--including kneeling, squatting, tossing, walking, and bimanual manipulation--demonstrate that HuMI achieves a 3x increase in data collection efficiency compared to teleoperation and attains a 70% success rate in unseen environments.
Chinese Translation
当前的类人全身操控方法主要依赖于遥操作或视觉仿真到真实的强化学习,但受到硬件物流和复杂奖励工程的限制。因此,展示的自主技能仍然有限,通常仅限于受控环境。本文提出了类人操控接口(Humanoid Manipulation Interface, HuMI),这是一个便携且高效的框架,用于在各种环境中学习多样的全身操控任务。HuMI通过使用便携硬件捕捉丰富的全身运动,实现无机器人数据收集。这些数据驱动一个分层学习流程,将人类动作转化为灵巧且可行的类人技能。在五个全身任务(包括跪下、蹲下、投掷、行走和双手操控)的广泛实验中,HuMI的数据显示收集效率比遥操作提高了3倍,并在未见环境中达到了70%的成功率。
cs.RO / 27 / 2602.06653

RAPID: Reconfigurable, Adaptive Platform for Iterative Design

RAPID:可重构、自适应的迭代设计平台
Yin, Zi, Li, Fanhong, Zheng, Shurui, Liu, Jia
Abstract
Developing robotic manipulation policies is iterative and hypothesis-driven: researchers test tactile sensing, gripper geometries, and sensor placements through real-world data collection and training. Yet even minor end-effector changes often require mechanical refitting and system re-integration, slowing iteration. We present RAPID, a full-stack reconfigurable platform designed to reduce this friction. RAPID is built around a tool-free, modular hardware architecture that unifies handheld data collection and robot deployment, and a matching software stack that maintains real-time awareness of the underlying hardware configuration through a driver-level Physical Mask derived from USB events. This modular hardware architecture reduces reconfiguration to seconds and makes systematic multi-modal ablation studies practical, allowing researchers to sweep diverse gripper and sensing configurations without repeated system bring-up. The Physical Mask exposes modality presence as an explicit runtime signal, enabling auto-configuration and graceful degradation under sensor hot-plug events, so policies can continue executing when sensors are physically added or removed. System-centric experiments show that RAPID reduces the setup time for multi-modal configurations by two orders of magnitude compared to traditional workflows and preserves policy execution under runtime sensor hot-unplug events. The hardware designs, drivers, and software stack are open-sourced at https://rapid-kit.github.io/ .
Chinese Translation
开发机器人操作策略是一个迭代和假设驱动的过程:研究人员通过真实世界的数据收集和训练来测试触觉传感、夹持器几何形状和传感器布置。然而,即使是微小的末端执行器变化通常也需要机械重新装配和系统重新集成,从而减缓迭代速度。我们提出了RAPID,一个旨在减少这种摩擦的全栈可重构平台。RAPID建立在一个无需工具的模块化硬件架构之上,该架构统一了手持数据收集和机器人部署,并配备了一个匹配的软件栈,通过基于USB事件派生的驱动级物理掩码(Physical Mask)实时感知底层硬件配置。这个模块化硬件架构将重新配置时间缩短到几秒钟,使系统化的多模态消融研究变得可行,允许研究人员在不重复系统启动的情况下,快速切换不同的夹持器和传感配置。物理掩码将模态存在作为显式的运行时信号暴露出来,使得在传感器热插拔事件下能够实现自动配置和优雅降级,从而在传感器物理添加或移除时,策略可以继续执行。系统中心实验表明,与传统工作流程相比,RAPID将多模态配置的设置时间减少了两个数量级,并在运行时传感器热拔插事件下保持策略执行。硬件设计、驱动程序和软件栈已在 https://rapid-kit.github.io/ 上开源。
cs.RO / 28 / 2602.06698

Crowd-FM: Learned Optimal Selection of Conditional Flow Matching-generated Trajectories for Crowd Navigation

Crowd-FM:基于学习的条件流匹配生成轨迹的最优选择用于人群导航
Singha, Antareep, Nanwani, Laksh, P., Mathai Mathew, Jain, Samkit, Singamaneni, Phani Teja, Singh, Arun Kumar, Krishna, K. Madhava
Abstract
Safe and computationally efficient local planning for mobile robots in dense, unstructured human crowds remains a fundamental challenge. Moreover, ensuring that robot trajectories are similar to how a human moves will increase the acceptance of the robot in human environments. In this paper, we present Crowd-FM, a learning-based approach to address both safety and human-likeness challenges. Our approach has two novel components. First, we train a Conditional Flow-Matching (CFM) policy over a dataset of optimally controlled trajectories to learn a set of collision-free primitives that a robot can choose at any given scenario. The chosen optimal control solver can generate multi-modal collision-free trajectories, allowing the CFM policy to learn a diverse set of maneuvers. Secondly, we learn a score function over a dataset of human demonstration trajectories that provides a human-likeness score for the flow primitives. At inference time, computing the optimal trajectory requires selecting the one with the highest score. Our approach improves the state-of-the-art by showing that our CFM policy alone can produce collision-free navigation with a higher success rate than existing learning-based baselines. Furthermore, when augmented with inference-time refinement, our approach can outperform even expensive optimisation-based planning approaches. Finally, we validate that our scoring network can select trajectories closer to the expert data than a manually designed cost function.
Chinese Translation
在密集且非结构化的人群中,移动机器人进行安全且计算高效的局部规划仍然是一个基本挑战。此外,确保机器人轨迹与人类的移动方式相似,将提高机器人在人工环境中的接受度。本文提出了Crowd-FM,一种基于学习的方法,以解决安全性和人类相似性这两个挑战。我们的方法包含两个新颖的组成部分。首先,我们在一个最优控制轨迹的数据集上训练了一个条件流匹配(Conditional Flow-Matching, CFM)策略,以学习一组在任何给定场景中机器人可以选择的无碰撞原语。所选择的最优控制求解器能够生成多模态的无碰撞轨迹,使得CFM策略能够学习到多样化的操控方式。其次,我们在一个人类示范轨迹的数据集上学习了一个评分函数,为流原语提供人类相似性评分。在推理时,计算最优轨迹需要选择具有最高评分的轨迹。我们的方法通过表明我们的CFM策略单独能够以高于现有基于学习的基线的成功率生成无碰撞导航,从而改善了现有技术。此外,当与推理时的精细化相结合时,我们的方法甚至能够超越昂贵的基于优化的规划方法。最后,我们验证了我们的评分网络能够选择更接近专家数据的轨迹,而不是手动设计的成本函数。
cs.RO / 29 / 2602.06749

Constraint Manifold Exploration for Efficient Continuous Coverage Estimation

约束流形探索用于高效的连续覆盖估计
Wilbrandt, Robert, Dillmann, Rüdiger
Abstract
Many automated manufacturing processes rely on industrial robot arms to move process-specific tools along workpiece surfaces. In applications like grinding, sanding, spray painting, or inspection, they need to cover a workpiece fully while keeping their tools perpendicular to its surface. While there are approaches to generate trajectories for these applications, there are no sufficient methods for analyzing the feasibility of full surface coverage. This work proposes a sampling-based approach for continuous coverage estimation that explores reachable surface regions in the configuration space. We define an extended ambient configuration space that allows for the representation of tool position and orientation constraints. A continuation-based approach is used to explore it using two different sampling strategies. A thorough evaluation across different kinematics and environments analyzes their runtime and efficiency. This validates our ability to accurately and efficiently calculate surface coverage for complex surfaces in complicated environments.
Chinese Translation
许多自动化制造过程依赖于工业机器人手臂在工件表面移动特定的加工工具。在磨削、打磨、喷涂或检测等应用中,它们需要完全覆盖工件,同时保持工具与表面垂直。虽然已有一些方法可以生成这些应用的轨迹,但尚缺乏足够的方法来分析完全表面覆盖的可行性。本研究提出了一种基于采样的连续覆盖估计方法,该方法在配置空间中探索可达的表面区域。我们定义了一个扩展的环境配置空间,以便表示工具位置和方向约束。采用基于延续的方法,使用两种不同的采样策略进行探索。通过对不同运动学和环境的全面评估,分析了它们的运行时间和效率。这验证了我们在复杂环境中准确有效地计算复杂表面的覆盖能力。
cs.RO / 30 / 2602.06807

SuReNav: Superpixel Graph-based Constraint Relaxation for Navigation in Over-constrained Environments

SuReNav:基于超像素图的约束放松方法在过度约束环境中的导航
Koh, Keonyoung, Jung, Moonkyeong, Lee, Samuel Seungsup, Park, Daehyung
Abstract
We address the over-constrained planning problem in semi-static environments. The planning objective is to find a best-effort solution that avoids all hard constraint regions while minimally traversing the least risky areas. Conventional methods often rely on pre-defined area costs, limiting generalizations. Further, the spatial continuity of navigation spaces makes it difficult to identify regions that are passable without overestimation. To overcome these challenges, we propose SuReNav, a superpixel graph-based constraint relaxation and navigation method that imitates human-like safe and efficient navigation. Our framework consists of three components: 1) superpixel graph map generation with regional constraints, 2) regional-constraint relaxation using graph neural network trained on human demonstrations for safe and efficient navigation, and 3) interleaving relaxation, planning, and execution for complete navigation. We evaluate our method against state-of-the-art baselines on 2D semantic maps and 3D maps from OpenStreetMap, achieving the highest human-likeness score of complete navigation while maintaining a balanced trade-off between efficiency and safety. We finally demonstrate its scalability and generalization performance in real-world urban navigation with a quadruped robot, Spot.
Chinese Translation
我们解决了半静态环境中的过度约束规划问题。规划目标是找到一个尽力而为的解决方案,避免所有硬约束区域,同时尽量减少穿越风险最低区域的次数。传统方法通常依赖于预定义的区域成本,这限制了其推广性。此外,导航空间的空间连续性使得识别可通行区域而不产生过高估计变得困难。为了解决这些挑战,我们提出了SuReNav,一种基于超像素图的约束放松和导航方法,模仿人类安全高效的导航。我们的框架由三个组件组成:1)带有区域约束的超像素图地图生成,2)使用在人工示范上训练的图神经网络进行区域约束放松,以实现安全高效的导航,3)交替进行放松、规划和执行以实现完整导航。我们在2D语义地图和来自OpenStreetMap的3D地图上对我们的方法进行了与最先进基线的评估,取得了最高的人类相似度得分,同时在效率和安全性之间保持了良好的平衡。最后,我们展示了其在真实城市导航中与四足机器人Spot的可扩展性和泛化性能。
cs.RO / 31 / 2602.06811

A 26-Gram Butterfly-Inspired Robot Achieving Autonomous Tailless Flight

一种26克的蝴蝶启发机器人实现自主无尾飞行
Gu, Weibin, Feng, Chenrui, Liu, Lian, Yang, Chen, Jiao, Xingchi, Ding, Yuhe, Shi, Xiaofei, Gao, Chao, Rizzo, Alessandro, Zhou, Guyue
Abstract
Flapping-wing micro air vehicles (FWMAVs) have demonstrated remarkable bio-inspired agility, yet tailless two-winged configurations remain largely unexplored due to their complex fluid-structure and wing-body coupling. Here we present \textit{AirPulse}, a 26-gram butterfly-inspired FWMAV that achieves fully onboard, closed-loop, untethered flight without auxiliary control surfaces. The AirPulse robot replicates key biomechanical traits of butterfly flight, including low wing aspect ratio, compliant carbon-fiber-reinforced wings, and low-frequency, high-amplitude flapping that induces cyclic variations in the center of gravity and moment of inertia, producing characteristic body undulation. We establish a quantitative mapping between flapping modulation parameters and force-torque generation, and introduce the Stroke Timing Asymmetry Rhythm (STAR) generator, enabling smooth, stable, and linearly parameterized wingstroke asymmetry for flapping control. Integrating these with an attitude controller, the AirPulse robot maintains pitch and yaw stability despite strong oscillatory dynamics. Free-flight experiments demonstrate stable climbing and turning maneuvers via either angle offset or stroke timing modulation, marking the first onboard controlled flight of the lightest two-winged, tailless butterfly-inspired FWMAV reported in peer-reviewed literature. This work corroborates a foundational platform for lightweight, collision-proof FWMAVs, bridging biological inspiration with practical aerial robotics. Their non-invasive maneuverability is ideally suited for real-world applications, such as confined-space inspection and ecological monitoring, inaccessible to traditional drones, while their biomechanical fidelity provides a physical model to decode the principles underlying the erratic yet efficient flight of real butterflies.
Chinese Translation
拍翼微型航空器(FWMAVs)展现了显著的生物启发灵活性,但由于其复杂的流体-结构和翼-机身耦合,无尾双翼配置仍然未得到充分探索。本文介绍了 extit{AirPulse},一种26克的蝴蝶启发FWMAV,能够实现完全自主、闭环、无缆飞行,而无需辅助控制面。AirPulse机器人复制了蝴蝶飞行的关键生物力学特征,包括低翼展比、柔性碳纤维增强翼,以及低频率、高幅度的拍打,这会引起重心和转动惯量的周期性变化,从而产生特征性的机体波动。我们建立了拍打调制参数与力-扭矩生成之间的定量映射,并引入了拍打时序不对称节奏(Stroke Timing Asymmetry Rhythm, STAR)发生器,使拍打控制的翼击不对称性平滑、稳定且线性参数化。将这些与姿态控制器相结合,AirPulse机器人在强烈的振荡动态下仍能保持俯仰和偏航稳定性。自由飞行实验展示了通过角度偏移或拍打时序调制实现的稳定爬升和转向机动,标志着在同行评审文献中首次报道的最轻无尾双翼蝴蝶启发FWMAV的机载控制飞行。这项工作为轻量级、抗碰撞的FWMAVs奠定了基础平台,架起了生物启发与实用航空机器人之间的桥梁。它们非侵入式的机动性非常适合于现实世界应用,如传统无人机无法到达的狭小空间检查和生态监测,同时其生物力学的真实性为解码真实蝴蝶不规则但高效飞行的原理提供了物理模型。
cs.RO / 32 / 2602.06827

DynaRetarget: Dynamically-Feasible Retargeting using Sampling-Based Trajectory Optimization

DynaRetarget:基于采样的动态可行重定向轨迹优化
Dhedin, Victor, Taouil, Ilyass, Omar, Shafeef, Yu, Dian, Tao, Kun, Dai, Angela, Khadiv, Majid
Abstract
In this paper, we introduce DynaRetarget, a complete pipeline for retargeting human motions to humanoid control policies. The core component of DynaRetarget is a novel Sampling-Based Trajectory Optimization (SBTO) framework that refines imperfect kinematic trajectories into dynamically feasible motions. SBTO incrementally advances the optimization horizon, enabling optimization over the entire trajectory for long-horizon tasks. We validate DynaRetarget by successfully retargeting hundreds of humanoid-object demonstrations and achieving higher success rates than the state of the art. The framework also generalizes across varying object properties, such as mass, size, and geometry, using the same tracking objective. This ability to robustly retarget diverse demonstrations opens the door to generating large-scale synthetic datasets of humanoid loco-manipulation trajectories, addressing a major bottleneck in real-world data collection.
Chinese Translation
本文介绍了DynaRetarget,一个将人类运动重定向到类人控制策略的完整流程。DynaRetarget的核心组件是一个新颖的基于采样的轨迹优化(SBTO)框架,该框架将不完美的运动学轨迹优化为动态可行的运动。SBTO逐步推进优化视野,使得在长时间任务中能够对整个轨迹进行优化。我们通过成功重定向数百个类人-物体演示并实现比现有技术更高的成功率来验证DynaRetarget。该框架还能够在不同物体属性(如质量、大小和几何形状)下进行泛化,使用相同的跟踪目标。这种稳健地重定向多样化演示的能力为生成大规模合成类人运动-操作轨迹数据集铺平了道路,从而解决了现实世界数据收集中的一个主要瓶颈。
cs.RO / 33 / 2602.06834

Perception-Control Coupled Visual Servoing for Textureless Objects Using Keypoint-Based EKF

基于关键点的扩展卡尔曼滤波的无纹理物体感知-控制耦合视觉伺服
Tao, Allen, Yang, Jun, Oparnica, Stanko, Xue, Wenjie
Abstract
Visual servoing is fundamental to robotic applications, enabling precise positioning and control. However, applying it to textureless objects remains a challenge due to the absence of reliable visual features. Moreover, adverse visual conditions, such as occlusions, often corrupt visual feedback, leading to reduced accuracy and instability in visual servoing. In this work, we build upon learning-based keypoint detection for textureless objects and propose a method that enhances robustness by tightly integrating perception and control in a closed loop. Specifically, we employ an Extended Kalman Filter (EKF) that integrates per-frame keypoint measurements to estimate 6D object pose, which drives pose-based visual servoing (PBVS) for control. The resulting camera motion, in turn, enhances the tracking of subsequent keypoints, effectively closing the perception-control loop. Additionally, unlike standard PBVS, we propose a probabilistic control law that computes both camera velocity and its associated uncertainty, enabling uncertainty-aware control for safe and reliable operation. We validate our approach on real-world robotic platforms using quantitative metrics and grasping experiments, demonstrating that our method outperforms traditional visual servoing techniques in both accuracy and practical application.
Chinese Translation
视觉伺服是机器人应用的基础,能够实现精确的定位和控制。然而,由于缺乏可靠的视觉特征,将其应用于无纹理物体仍然是一个挑战。此外,遮挡等不利视觉条件常常会干扰视觉反馈,导致视觉伺服的准确性降低和不稳定性增加。在本研究中,我们基于无纹理物体的学习型关键点检测,提出了一种通过在闭环中紧密集成感知与控制来增强鲁棒性的方法。具体而言,我们采用扩展卡尔曼滤波(EKF),集成每帧的关键点测量来估计6D物体姿态,从而驱动基于姿态的视觉伺服(PBVS)进行控制。由此产生的相机运动进一步增强了后续关键点的跟踪,有效地闭合了感知-控制循环。此外,与标准PBVS不同,我们提出了一种概率控制律,计算相机速度及其相关的不确定性,从而实现了不确定性感知控制,确保安全可靠的操作。我们在真实的机器人平台上使用定量指标和抓取实验验证了我们的方法,结果表明,我们的方法在准确性和实际应用中均优于传统的视觉伺服技术。
cs.RO / 34 / 2602.06864

SURE: Safe Uncertainty-Aware Robot-Environment Interaction using Trajectory Optimization

SURE:基于轨迹优化的安全不确定性感知机器人-环境交互
Zhang, Zhuocheng, Zhao, Haizhou, Sun, Xudong, Johnson, Aaron M., Khadiv, Majid
Abstract
Robotic tasks involving contact interactions pose significant challenges for trajectory optimization due to discontinuous dynamics. Conventional formulations typically assume deterministic contact events, which limit robustness and adaptability in real-world settings. In this work, we propose SURE, a robust trajectory optimization framework that explicitly accounts for contact timing uncertainty. By allowing multiple trajectories to branch from possible pre-impact states and later rejoin a shared trajectory, SURE achieves both robustness and computational efficiency within a unified optimization framework. We evaluate SURE on two representative tasks with unknown impact times. In a cart-pole balancing task involving uncertain wall location, SURE achieves an average improvement of 21.6% in success rate when branch switching is enabled during control. In an egg-catching experiment using a robotic manipulator, SURE improves the success rate by 40%. These results demonstrate that SURE substantially enhances robustness compared to conventional nominal formulations.
Chinese Translation
涉及接触交互的机器人任务由于不连续的动力学给轨迹优化带来了重大挑战。传统的模型通常假设接触事件是确定性的,这限制了其在现实环境中的鲁棒性和适应性。在本研究中,我们提出了SURE,一个强大的轨迹优化框架,明确考虑了接触时机的不确定性。通过允许多个轨迹从可能的撞击前状态分支并随后重新加入共享轨迹,SURE在统一的优化框架内实现了鲁棒性和计算效率的平衡。我们在两个具有未知撞击时间的代表性任务上评估了SURE。在一个涉及不确定墙体位置的倒立摆平衡任务中,当控制过程中启用分支切换时,SURE的成功率平均提高了21.6%。在一个使用机器人手臂的抓蛋实验中,SURE将成功率提高了40%。这些结果表明,与传统的名义模型相比,SURE显著增强了鲁棒性。
cs.RO / 35 / 2602.06868

Consensus-based optimization (CBO): Towards Global Optimality in Robotics

基于共识的优化(CBO):朝向机器人领域的全局最优性
Sun, Xudong, Jordana, Armand, Fornasier, Massimo, Etesami, Jalal, Khadiv, Majid
Abstract
Zero-order optimization has recently received significant attention for designing optimal trajectories and policies for robotic systems. However, most existing methods (e.g., MPPI, CEM, and CMA-ES) are local in nature, as they rely on gradient estimation. In this paper, we introduce consensus-based optimization (CBO) to robotics, which is guaranteed to converge to a global optimum under mild assumptions. We provide theoretical analysis and illustrative examples that give intuition into the fundamental differences between CBO and existing methods. To demonstrate the scalability of CBO for robotics problems, we consider three challenging trajectory optimization scenarios: (1) a long-horizon problem for a simple system, (2) a dynamic balance problem for a highly underactuated system, and (3) a high-dimensional problem with only a terminal cost. Our results show that CBO is able to achieve lower costs with respect to existing methods on all three challenging settings. This opens a new framework to study global trajectory optimization in robotics.
Chinese Translation
零阶优化近年来在设计机器人系统的最优轨迹和策略方面受到了广泛关注。然而,大多数现有方法(例如,MPPI、CEM 和 CMA-ES)本质上是局部的,因为它们依赖于梯度估计。本文引入了基于共识的优化(CBO)到机器人领域,该方法在温和假设下保证收敛到全局最优解。我们提供了理论分析和说明性示例,以阐明 CBO 与现有方法之间的基本差异。为了展示 CBO 在机器人问题上的可扩展性,我们考虑了三个具有挑战性的轨迹优化场景:(1)简单系统的长时间范围问题,(2)高度欠驱动系统的动态平衡问题,以及(3)仅具有终端成本的高维问题。我们的结果表明,CBO 在所有三个具有挑战性的设置中都能够实现比现有方法更低的成本。这为研究机器人领域的全局轨迹优化开辟了一个新框架。
cs.RO / 36 / 2602.06925

Strategizing at Speed: A Learned Model Predictive Game for Multi-Agent Drone Racing

快速战略制定:一种用于多智能体无人机竞速的学习模型预测游戏
Papuc, Andrei-Carlo, Peters, Lasse, Sun, Sihao, Ferranti, Laura, Alonso-Mora, Javier
Abstract
Autonomous drone racing pushes the boundaries of high-speed motion planning and multi-agent strategic decision-making. Success in this domain requires drones not only to navigate at their limits but also to anticipate and counteract competitors' actions. In this paper, we study a fundamental question that arises in this domain: how deeply should an agent strategize before taking an action? To this end, we compare two planning paradigms: the Model Predictive Game (MPG), which finds interaction-aware strategies at the expense of longer computation times, and contouring Model Predictive Control (MPC), which computes strategies rapidly but does not reason about interactions. We perform extensive experiments to study this trade-off, revealing that MPG outperforms MPC at moderate velocities but loses its advantage at higher speeds due to latency. To address this shortcoming, we propose a Learned Model Predictive Game (LMPG) approach that amortizes model predictive gameplay to reduce latency. In both simulation and hardware experiments, we benchmark our approach against MPG and MPC in head-to-head races, finding that LMPG outperforms both baselines.
Chinese Translation
自主无人机竞速推动了高速运动规划和多智能体战略决策的边界。在这一领域取得成功不仅要求无人机在极限条件下导航,还要求其预测和反制竞争对手的行动。本文研究了一个在该领域中出现的基本问题:在采取行动之前,智能体应该进行多深的战略思考?为此,我们比较了两种规划范式:模型预测游戏(Model Predictive Game, MPG),它在较长的计算时间下找到考虑交互的策略,以及轮廓模型预测控制(contouring Model Predictive Control, MPC),它快速计算策略但不考虑交互。我们进行了广泛的实验来研究这一权衡,结果表明,MPG在中等速度下优于MPC,但在更高速度下由于延迟而失去优势。为了解决这一缺陷,我们提出了一种学习模型预测游戏(Learned Model Predictive Game, LMPG)的方法,通过摊销模型预测游戏的计算来减少延迟。在模拟和硬件实验中,我们将我们的方法与MPG和MPC在面对面竞速中进行了基准测试,发现LMPG优于这两种基线。
cs.RO / 37 / 2602.06949

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

DreamDojo:来自大规模人类视频的通用机器人世界模型
Gao, Shenyuan, Liang, William, Zheng, Kaiyuan, Malik, Ayaan, Ye, Seonghyeon, Yu, Sihyun, Tseng, Wei-Cheng, Dong, Yuzhu, Mo, Kaichun, Lin, Chen-Hsuan, Ma, Qianli, Nah, Seungjun, Magne, Loic, Xiang, Jiannan, Xie, Yuqi, Zheng, Ruijie, Niu, Dantong, Tan, You Liang, Zentner, K. R., Kurian, George, Indupuru, Suneel, Jannaty, Pooya, Gu, Jinwei, Zhang, Jun, Malik, Jitendra, Abbeel, Pieter, Liu, Ming-Yu, Zhu, Yuke, Jang, Joel, Fan, Linxi "Jim"
Abstract
Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.
Chinese Translation
能够在多样化环境中模拟动作结果将彻底改变通用智能体的大规模开发。然而,建模这些世界动态,特别是在灵巧机器人任务中,由于数据覆盖有限和动作标签稀缺,面临着重大挑战。为此,我们提出了DreamDojo,一个基础世界模型,从44,000小时的自我中心人类视频中学习多样的交互和灵巧控制。我们的数据混合代表了迄今为止用于世界模型预训练的最大视频数据集,涵盖了广泛的日常场景,涉及多样的物体和技能。为了解决动作标签稀缺的问题,我们引入了连续潜在动作作为统一的代理动作,增强了从未标记视频中交互知识的转移。在小规模目标机器人数据上进行后训练后,DreamDojo展现出对物理的强理解和精确的动作可控性。我们还设计了一个蒸馏管道,使DreamDojo的实时速度达到10.81 FPS,并进一步改善了上下文一致性。我们的工作使得基于生成世界模型的多个重要应用成为可能,包括实时遥操作、策略评估和基于模型的规划。在多个具有挑战性的分布外(OOD)基准上的系统评估验证了我们的方法在模拟开放世界、接触丰富任务中的重要性,为通用机器人世界模型铺平了道路。
计算机视觉 (Computer Vision)
78
cs.CV / 1 / 2602.06122

From Blurry to Believable: Enhancing Low-quality Talking Heads with 3D Generative Priors

从模糊到可信:利用3D生成先验增强低质量的虚拟人头
Huang, Ding-Jiun, Wang, Yuanhao, Yuan, Shao-Ji, Mosella-Montoro, Albert, Carrasco, Francisco Vicente, Zhang, Cheng, De la Torre, Fernando
Abstract
Creating high-fidelity, animatable 3D talking heads is crucial for immersive applications, yet often hindered by the prevalence of low-quality image or video sources, which yield poor 3D reconstructions. In this paper, we introduce SuperHead, a novel framework for enhancing low-resolution, animatable 3D head avatars. The core challenge lies in synthesizing high-quality geometry and textures, while ensuring both 3D and temporal consistency during animation and preserving subject identity. Despite recent progress in image, video and 3D-based super-resolution (SR), existing SR techniques are ill-equipped to handle dynamic 3D inputs. To address this, SuperHead leverages the rich priors from pre-trained 3D generative models via a novel dynamics-aware 3D inversion scheme. This process optimizes the latent representation of the generative model to produce a super-resolved 3D Gaussian Splatting (3DGS) head model, which is subsequently rigged to an underlying parametric head model (e.g., FLAME) for animation. The inversion is jointly supervised using a sparse collection of upscaled 2D face renderings and corresponding depth maps, captured from diverse facial expressions and camera viewpoints, to ensure realism under dynamic facial motions. Experiments demonstrate that SuperHead generates avatars with fine-grained facial details under dynamic motions, significantly outperforming baseline methods in visual quality.
Chinese Translation
创建高保真、可动画的3D虚拟人头对于沉浸式应用至关重要,但常常受到低质量图像或视频源的影响,从而导致糟糕的3D重建。在本文中,我们介绍了SuperHead,一个用于增强低分辨率、可动画3D人头头像的新框架。核心挑战在于合成高质量的几何形状和纹理,同时确保在动画过程中3D和时间的一致性,并保持主体的身份。尽管在图像、视频和基于3D的超分辨率(SR)方面取得了近期进展,但现有的SR技术并不适合处理动态3D输入。为了解决这个问题,SuperHead通过一种新颖的动态感知3D反演方案,利用预训练3D生成模型的丰富先验。该过程优化生成模型的潜在表示,以生成超分辨率的3D高斯点云(3DGS)人头模型,随后将其绑定到基础参数化人头模型(如FLAME)上进行动画。反演过程通过稀疏收集的放大2D人脸渲染图和相应的深度图进行联合监督,这些图像捕捉了多种面部表情和相机视角,以确保在动态面部运动下的真实感。实验表明,SuperHead在动态运动下生成具有细致面部细节的虚拟人头,在视觉质量上显著优于基线方法。
cs.CV / 2 / 2602.06139

EgoAVU: Egocentric Audio-Visual Understanding

EgoAVU:自我中心音视频理解
Seth, Ashish, Mei, Xinhao, Zhao, Changsheng, Nagaraja, Varun, Chang, Ernie, Meyer, Gregory P., Lan, Gael Le, Xiong, Yunyang, Chandra, Vikas, Shi, Yangyang, Manocha, Dinesh, Cai, Zhipeng
Abstract
Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.
Chinese Translation
理解自我中心视频对具身智能至关重要。近期的多模态大型语言模型(MLLMs)能够接受视觉和音频输入。然而,由于获取具有一致的联合模态信息的文本标签存在挑战,因此MLLMs在自我中心视频中是否能够共同理解这两种模态仍然未被充分探讨。为了解决这一问题,我们提出了EgoAVU,一个可扩展的数据引擎,用于自动生成自我中心音视频叙述、问题和答案。EgoAVU通过多模态上下文丰富人类叙述,并通过跨模态关联建模生成音视频叙述。基于令牌的视频过滤和模块化、基于图的策划确保了数据的多样性和质量。利用EgoAVU,我们构建了EgoAVU-Instruct,一个包含300万样本的大规模训练数据集,以及EgoAVU-Bench,一个经过人工验证的评估分割,涵盖多样化任务。EgoAVU-Bench清晰地揭示了现有MLLMs的局限性:它们严重偏向视觉信号,常常忽视音频线索或未能将音频与视觉源对应。在EgoAVU-Instruct上微调MLLMs有效地解决了这一问题,使得在EgoAVU-Bench上的性能提升达到113%。这种优势也转移到其他基准上,如EgoTempo和EgoIllusion,取得了高达28%的相对性能提升。代码将向社区发布。
cs.CV / 3 / 2602.06158

MGP-KAD: Multimodal Geometric Priors and Kolmogorov-Arnold Decoder for Single-View 3D Reconstruction in Complex Scenes

MGP-KAD:用于复杂场景中单视图3D重建的多模态几何先验与Kolmogorov-Arnold解码器
Zhang, Luoxi, Xie, Chun, Kitahara, Itaru
Abstract
Single-view 3D reconstruction in complex real-world scenes is challenging due to noise, object diversity, and limited dataset availability. To address these challenges, we propose MGP-KAD, a novel multimodal feature fusion framework that integrates RGB and geometric prior to enhance reconstruction accuracy. The geometric prior is generated by sampling and clustering ground-truth object data, producing class-level features that dynamically adjust during training to improve geometric understanding. Additionally, we introduce a hybrid decoder based on Kolmogorov-Arnold Networks (KAN) to overcome the limitations of traditional linear decoders in processing complex multimodal inputs. Extensive experiments on the Pix3D dataset demonstrate that MGP-KAD achieves state-of-the-art (SOTA) performance, significantly improving geometric integrity, smoothness, and detail preservation. Our work provides a robust and effective solution for advancing single-view 3D reconstruction in complex scenes.
Chinese Translation
在复杂的现实场景中,单视图3D重建面临噪声、物体多样性和数据集可用性有限等挑战。为了解决这些问题,我们提出了MGP-KAD,这是一种新颖的多模态特征融合框架,结合了RGB和几何先验,以提高重建精度。几何先验通过对真实物体数据进行采样和聚类生成,产生的类别级特征在训练过程中动态调整,以改善几何理解。此外,我们引入了一种基于Kolmogorov-Arnold Networks (KAN)的混合解码器,以克服传统线性解码器在处理复杂多模态输入时的局限性。在Pix3D数据集上的大量实验表明,MGP-KAD达到了最先进的(SOTA)性能,显著提高了几何完整性、平滑性和细节保留。我们的工作为推动复杂场景中的单视图3D重建提供了一种稳健有效的解决方案。
cs.CV / 4 / 2602.06159

Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving

使用DINO进行驾驶:视觉基础特征作为自主驾驶中仿真到现实生成的统一桥梁
Chen, Xuyang, Zhang, Conglang, Fu, Chuanheng, Yang, Zihao, Zhou, Kaixuan, Zhang, Yizhi, He, Jianan, Zhang, Yanfeng, Sun, Mingwei, Wang, Zengmao, Dong, Zhen, Long, Xiaoxiao, Meng, Liqiu
Abstract
Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by "baking in" synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for "texture baking," while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3's high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability. Project page: https://albertchen98.github.io/DwD-project/
Chinese Translation
受可控视频扩散技术的推动,现有的自主驾驶视频生成的仿真到现实(Sim2Real)方法通常依赖于明确的中间表示来弥合领域间的差距。然而,这些模式面临着基本的一致性-现实性困境。低级信号(例如,边缘、模糊图像)确保了精确控制,但通过“烘焙”合成伪影而妥协了现实性;而高级先验(例如,深度、语义、高清地图)促进了照片级真实感,但缺乏一致指导所需的结构细节。在本研究中,我们提出了使用DINO进行驾驶(Driving with DINO, DwD),一个新颖的框架,利用视觉基础模块(Vision Foundation Module, VFM)特征作为仿真与现实世界领域之间的统一桥梁。我们首先识别到这些特征编码了一系列信息,从高级语义到细粒度结构。为了有效利用这些特征,我们采用主子空间投影(Principal Subspace Projection)来丢弃负责“纹理烘焙”的高频元素,同时引入随机通道尾部丢弃(Random Channel Tail Drop)来减轻刚性降维中固有的结构损失,从而调和现实性与控制一致性。此外,为了充分利用DINOv3的高分辨率能力以增强控制精度,我们引入了一个可学习的空间对齐模块(Spatial Alignment Module),将这些高分辨率特征适配到扩散主干上。最后,我们提出了一种因果时间聚合器(Causal Temporal Aggregator),采用因果卷积在整合逐帧DINO特征时显式保留历史运动上下文,有效减轻运动模糊并保证时间稳定性。项目页面:https://albertchen98.github.io/DwD-project/
cs.CV / 5 / 2602.06163

MetaSSP: Enhancing Semi-supervised Implicit 3D Reconstruction through Meta-adaptive EMA and SDF-aware Pseudo-label Evaluation

MetaSSP:通过元自适应EMA和SDF感知伪标签评估增强半监督隐式3D重建
Zhang, Luoxi, Xie, Chun, Kitahara, Itaru
Abstract
Implicit SDF-based methods for single-view 3D reconstruction achieve high-quality surfaces but require large labeled datasets, limiting their scalability. We propose MetaSSP, a novel semi-supervised framework that exploits abundant unlabeled images. Our approach introduces gradient-based parameter importance estimation to regularize adaptive EMA updates and an SDF-aware pseudo-label weighting mechanism combining augmentation consistency with SDF variance. Beginning with a 10% supervised warm-up, the unified pipeline jointly refines labeled and unlabeled data. On the Pix3D benchmark, our method reduces Chamfer Distance by approximately 20.61% and increases IoU by around 24.09% compared to existing semi-supervised baselines, setting a new state of the art.
Chinese Translation
基于隐式SDF的方法在单视图3D重建中能够实现高质量的表面,但需要大量标记数据集,这限制了其可扩展性。我们提出了MetaSSP,一种新颖的半监督框架,利用丰富的未标记图像。我们的方法引入了基于梯度的参数重要性估计,以规范化自适应EMA更新,并结合增强一致性与SDF方差的SDF感知伪标签加权机制。从10%的监督预热开始,统一的管道共同优化标记和未标记数据。在Pix3D基准测试中,我们的方法将Chamfer距离减少了约20.61%,并将IoU提高了约24.09%,相比现有的半监督基线,设立了新的最优状态。
cs.CV / 6 / 2602.06166

M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning

M3:通过多模态、多智能体和多轮视觉推理实现高保真文本到图像生成
Yang, Bangji, Guo, Ruihan, Fan, Jiajun, Cheng, Chaoran, Liu, Ge
Abstract
Generative models have achieved impressive fidelity in text-to-image synthesis, yet struggle with complex compositional prompts involving multiple constraints. We introduce \textbf{M3 (Multi-Modal, Multi-Agent, Multi-Round)}, a training-free framework that systematically resolves these failures through iterative inference-time refinement. M3 orchestrates off-the-shelf foundation models in a robust multi-agent loop: a Planner decomposes prompts into verifiable checklists, while specialized Checker, Refiner, and Editor agents surgically correct constraints one at a time, with a Verifier ensuring monotonic improvement. Applied to open-source models, M3 achieves remarkable results on the challenging OneIG-EN benchmark, with our Qwen-Image+M3 surpassing commercial flagship systems including Imagen4 (0.515) and Seedream 3.0 (0.530), reaching state-of-the-art performance (0.532 overall). This demonstrates that intelligent multi-agent reasoning can elevate open-source models beyond proprietary alternatives. M3 also substantially improves GenEval compositional metrics, effectively doubling spatial reasoning performance on hardened test sets. As a plug-and-play module compatible with any pre-trained T2I model, M3 establishes a new paradigm for compositional generation without costly retraining.
Chinese Translation
生成模型在文本到图像合成方面取得了令人印象深刻的保真度,但在涉及多个约束的复杂组合提示时仍然面临挑战。我们提出了 extbf{M3(多模态、多智能体、多轮)},这是一个无需训练的框架,通过迭代推理时的细化系统性地解决这些失败。M3在一个稳健的多智能体循环中协调现成的基础模型:一个规划者将提示分解为可验证的检查清单,而专门的检查者、精炼者和编辑者智能体逐一修正约束,验证者确保单调改进。应用于开源模型,M3在具有挑战性的OneIG-EN基准上取得了显著成果,我们的Qwen-Image+M3超越了包括Imagen4(0.515)和Seedream 3.0(0.530)在内的商业旗舰系统,达到了最先进的性能(0.532)。这表明智能的多智能体推理能够将开源模型提升到超越专有替代品的水平。M3还显著改善了GenEval组合指标,在强化测试集上有效地将空间推理性能提高了一倍。作为一个与任何预训练文本到图像(T2I)模型兼容的即插即用模块,M3为组合生成建立了一个无需昂贵再训练的新范式。
cs.CV / 7 / 2602.06179

Unsupervised Anomaly Detection of Diseases in the Female Pelvis for Real-Time MR Imaging

女性盆腔疾病的无监督异常检测用于实时磁共振成像
Knupfer, Anika, Müller, Johanna P., Verdera, Jordina A., Fenske, Martin, Mathy, Claudius S., Tripathy, Smiti, Arndt, Sebastian, May, Matthias, Uder, Michael, Beckmann, Matthias W., Burghaus, Stefanie, Hutter, Jana
Abstract
Pelvic diseases in women of reproductive age represent a major global health burden, with diagnosis frequently delayed due to high anatomical variability, complicating MRI interpretation. Existing AI approaches are largely disease-specific and lack real-time compatibility, limiting generalizability and clinical integration. To address these challenges, we establish a benchmark framework for disease- and parameter-agnostic, real-time-compatible unsupervised anomaly detection in pelvic MRI. The method uses a residual variational autoencoder trained exclusively on healthy sagittal T2-weighted scans acquired across diverse imaging protocols to model normal pelvic anatomy. During inference, reconstruction error heatmaps indicate deviations from learned healthy structure, enabling detection of pathological regions without labeled abnormal data. The model is trained on 294 healthy scans and augmented with diffusion-generated synthetic data to improve robustness. Quantitative evaluation on the publicly available Uterine Myoma MRI Dataset yields an average area-under-the-curve (AUC) value of 0.736, with 0.828 sensitivity and 0.692 specificity. Additional inter-observer clinical evaluation extends analysis to endometrial cancer, endometriosis, and adenomyosis, revealing the influence of anatomical heterogeneity and inter-observer variability on performance interpretation. With a reconstruction time of approximately 92.6 frames per second, the proposed framework establishes a baseline for unsupervised anomaly detection in the female pelvis and supports future integration into real-time MRI. Code is available upon request (https://github.com/AniKnu/UADPelvis), prospective data sets are available for academic collaboration.
Chinese Translation
生育年龄女性的盆腔疾病在全球健康中占据重要负担,诊断常因解剖变异性大而延迟,给磁共振成像(MRI)解读带来复杂性。现有的人工智能(AI)方法主要针对特定疾病,缺乏实时兼容性,限制了其普遍适用性和临床整合。为了解决这些挑战,我们建立了一个基准框架,用于无监督异常检测,具有疾病和参数无关性以及实时兼容性,专注于盆腔MRI。该方法使用残差变分自编码器(residual variational autoencoder),仅在健康的矢状面T2加权扫描上进行训练,这些扫描是通过多种成像协议获得的,以建模正常的盆腔解剖结构。在推理过程中,重建误差热图显示与学习到的健康结构的偏差,从而能够在没有标记异常数据的情况下检测病理区域。该模型在294个健康扫描上进行训练,并通过扩散生成的合成数据进行增强,以提高鲁棒性。在公开可用的子宫肌瘤MRI数据集上的定量评估显示,平均曲线下面积(AUC)值为0.736,灵敏度为0.828,特异性为0.692。额外的观察者间临床评估将分析扩展到子宫内膜癌、子宫内膜异位症和腺肌症,揭示了解剖异质性和观察者间变异性对性能解读的影响。该框架的重建时间约为每秒92.6帧,为女性盆腔的无监督异常检测建立了基线,并支持未来在实时MRI中的整合。代码可根据请求获取(https://github.com/AniKnu/UADPelvis),前瞻性数据集可用于学术合作。
cs.CV / 8 / 2602.06184

PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision-Language Pretraining

PhenoLIP:将表型本体知识整合到医学视觉语言预训练中
Liang, Cheng, Wu, Chaoyi, Zhao, Weike, Zhang, Ya, Wang, Yanfeng, Xie, Weidi
Abstract
Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image--caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85\% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.
Chinese Translation
近年来,大规模CLIP类视觉语言模型(VLMs)的进展极大推动了医学图像分析。然而,现有的大多数医学VLM仍然依赖粗糙的图像-文本对比目标,未能捕捉到在明确定义的医学表型本体中编码的系统性视觉知识。为了解决这一问题,我们构建了PhenoKG,这是第一个大规模、以表型为中心的多模态知识图谱,涵盖了超过520,000个高质量的图像-文本对,链接到超过3,000个表型。在PhenoKG的基础上,我们提出了PhenoLIP,这是一种新颖的预训练框架,通过两阶段过程明确地将结构化的表型知识纳入医学VLM中。我们首先从文本本体数据中学习一个知识增强的表型嵌入空间,然后通过教师引导的知识蒸馏目标将这一结构化知识蒸馏到多模态预训练中。为了支持评估,我们进一步引入了PhenoBench,这是一个经过专家验证的基准,专为表型识别设计,包含超过7,800个图像-标题对,覆盖超过1,000个表型。大量实验表明,PhenoLIP在表型分类准确性上比之前的最先进基线提高了8.85%,在跨模态检索上比BIOMEDICA提高了15.03%,强调了将以表型为中心的先验知识整合到医学VLM中以实现结构化和可解释的医学图像理解的价值。
cs.CV / 9 / 2602.06195

DeDPO: Debiased Direct Preference Optimization for Diffusion Models

DeDPO:用于扩散模型的去偏直接偏好优化
Pham, Khiem, Nguyen, Quang, Nguyen, Tung, Zhu, Jingsen, Santacatterina, Michele, Metaxas, Dimitris, Zabih, Ramin
Abstract
Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to the variations in synthetic labeling methods, achieving performance that matches and occasionally exceeds the theoretical upper bound of models trained on fully human-labeled data. This establishes DeDPO as a scalable solution for human-AI alignment using inexpensive synthetic supervision.
Chinese Translation
直接偏好优化(DPO)已成为扩散模型的主要对齐方法,促进了无显式奖励建模的离线训练。然而,它对大规模、高质量人类偏好标签的依赖带来了严重的成本和可扩展性瓶颈。为了解决这个问题,我们提出了一种半监督框架,通过成本效益高的合成AI反馈来增强有限的人类数据与大量未标记对的结合。我们的论文介绍了去偏DPO(DeDPO),它独特地将因果推断中的去偏估计技术整合到DPO目标中。通过明确识别和纠正合成标注者固有的系统性偏差和噪声,DeDPO确保从不完美的反馈源中进行稳健学习,包括自我训练和视觉-语言模型(VLMs)。实验表明,DeDPO对合成标注方法的变化具有鲁棒性,其性能与完全人类标注数据训练的模型的理论上限相匹配,甚至在某些情况下超过该上限。这确立了DeDPO作为一种使用廉价合成监督实现人类与AI对齐的可扩展解决方案。
cs.CV / 10 / 2602.06203

AnyThermal: Towards Learning Universal Representations for Thermal Perception

AnyThermal:朝着学习热感知的通用表征迈进
Maheshwari, Parv, Karhade, Jay, Chawla, Yogesh, Adu, Isaiah, Heisen, Florian, Porco, Andrew, Jong, Andrew, Liu, Yifei, Pitla, Santosh, Scherer, Sebastian, Wang, Wenshan
Abstract
We present AnyThermal, a thermal backbone that captures robust task-agnostic thermal features suitable for a variety of tasks such as cross-modal place recognition, thermal segmentation, and monocular depth estimation using thermal images. Existing thermal backbones that follow task-specific training from small-scale data result in utility limited to a specific environment and task. Unlike prior methods, AnyThermal can be used for a wide range of environments (indoor, aerial, off-road, urban) and tasks, all without task-specific training. Our key insight is to distill the feature representations from visual foundation models such as DINOv2 into a thermal encoder using thermal data from these multiple environments. To bridge the diversity gap of the existing RGB-Thermal datasets, we introduce the TartanRGBT platform, the first open-source data collection platform with synced RGB-Thermal image acquisition. We use this payload to collect the TartanRGBT dataset - a diverse and balanced dataset collected in 4 environments. We demonstrate the efficacy of AnyThermal and TartanRGBT, achieving state-of-the-art results with improvements of up to 36% across diverse environments and downstream tasks on existing datasets.
Chinese Translation
我们提出了AnyThermal,这是一种热基础模型,能够捕捉适用于多种任务的稳健的任务无关热特征,如跨模态地点识别、热分割和使用热图像的单目深度估计。现有的热基础模型通常依赖于小规模数据进行任务特定训练,导致其效用仅限于特定环境和任务。与之前的方法不同,AnyThermal可以用于多种环境(室内、空中、越野、城市)和任务,且无需任务特定训练。我们的关键见解是将来自视觉基础模型(如DINOv2)的特征表征提炼到热编码器中,利用来自这些多种环境的热数据。为了弥合现有RGB-热数据集的多样性差距,我们推出了TartanRGBT平台,这是第一个具有同步RGB-热图像采集的开源数据收集平台。我们利用这一平台收集了TartanRGBT数据集——一个在四种环境中收集的多样且平衡的数据集。我们展示了AnyThermal和TartanRGBT的有效性,在现有数据集上实现了最先进的结果,在多样环境和下游任务中提高了多达36%。
cs.CV / 11 / 2602.06211

DroneKey++: A Size Prior-free Method and New Benchmark for Drone 3D Pose Estimation from Sequential Images

DroneKey++:一种无先验方法及无人机三维姿态估计的新基准
Hwang, Seo-Bin, Cho, Yeong-Jun
Abstract
Accurate 3D pose estimation of drones is essential for security and surveillance systems. However, existing methods often rely on prior drone information such as physical sizes or 3D meshes. At the same time, current datasets are small-scale, limited to single models, and collected under constrained environments, which makes reliable validation of generalization difficult. We present DroneKey++, a prior-free framework that jointly performs keypoint detection, drone classification, and 3D pose estimation. The framework employs a keypoint encoder for simultaneous keypoint detection and classification, and a pose decoder that estimates 3D pose using ray-based geometric reasoning and class embeddings. To address dataset limitations, we construct 6DroneSyn, a large-scale synthetic benchmark with over 50K images covering 7 drone models and 88 outdoor backgrounds, generated using 360-degree panoramic synthesis. Experiments show that DroneKey++ achieves MAE 17.34 deg and MedAE 17.1 deg for rotation, MAE 0.135 m and MedAE 0.242 m for translation, with inference speeds of 19.25 FPS (CPU) and 414.07 FPS (GPU), demonstrating both strong generalization across drone models and suitability for real-time applications. The dataset is publicly available.
Chinese Translation
无人机的准确三维姿态估计对于安全和监控系统至关重要。然而,现有方法通常依赖于无人机的先验信息,例如物理尺寸或三维网格。同时,目前的数据集规模较小,限于单一模型,并在受限环境下收集,这使得对泛化能力的可靠验证变得困难。我们提出了DroneKey++,一种无先验框架,能够联合执行关键点检测、无人机分类和三维姿态估计。该框架采用关键点编码器进行同时的关键点检测和分类,并使用基于光线的几何推理和类别嵌入的姿态解码器来估计三维姿态。为了解决数据集的局限性,我们构建了6DroneSyn,这是一个大规模合成基准,包含超过50K张图像,涵盖7种无人机模型和88种户外背景,采用360度全景合成生成。实验表明,DroneKey++在旋转方面的平均绝对误差(MAE)为17.34度,中位绝对误差(MedAE)为17.1度;在平移方面的MAE为0.135米,MedAE为0.242米,推理速度为19.25 FPS(CPU)和414.07 FPS(GPU),展示了其在无人机模型间的强泛化能力和实时应用的适用性。该数据集已公开发布。
cs.CV / 12 / 2602.06214

Addressing the Waypoint-Action Gap in End-to-End Autonomous Driving via Vehicle Motion Models

通过车辆运动模型解决端到端自动驾驶中的路径点-动作差距
Rodríguez-Vidal, Jorge Daniel, Villalonga, Gabriel, Porres, Diego, Peña, Antonio M. López
Abstract
End-to-End Autonomous Driving (E2E-AD) systems are typically grouped by the nature of their outputs: (i) waypoint-based models that predict a future trajectory, and (ii) action-based models that directly output throttle, steer and brake. Most recent benchmark protocols and training pipelines are waypoint-based, which makes action-based policies harder to train and compare, slowing their progress. To bridge this waypoint-action gap, we propose a novel, differentiable vehicle-model framework that rolls out predicted action sequences to their corresponding ego-frame waypoint trajectories while supervising in waypoint space. Our approach enables action-based architectures to be trained and evaluated, for the first time, within waypoint-based benchmarks without modifying the underlying evaluation protocol. We extensively evaluate our framework across multiple challenging benchmarks and observe consistent improvements over the baselines. In particular, on NAVSIM \texttt{navhard} our approach achieves state-of-the-art performance. Our code will be made publicly available upon acceptance.
Chinese Translation
端到端自动驾驶(E2E-AD)系统通常根据其输出的性质进行分类:(i)基于路径点的模型,预测未来轨迹;(ii)基于动作的模型,直接输出油门、转向和刹车。最近的大多数基准协议和训练流程都是基于路径点的,这使得基于动作的策略更难以训练和比较,从而减缓了它们的发展。为了弥补这一路径点-动作差距,我们提出了一种新颖的可微分车辆模型框架,该框架将预测的动作序列展开到其对应的自我框架路径点轨迹,同时在路径点空间进行监督。我们的方法首次使基于动作的架构能够在不修改基础评估协议的情况下,在基于路径点的基准中进行训练和评估。我们在多个具有挑战性的基准上广泛评估了我们的框架,并观察到相对于基线的一致性改进。特别是在NAVSIM exttt{navhard}上,我们的方法达到了最先进的性能。我们的代码将在接受后公开发布。
cs.CV / 13 / 2602.06218

Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings

跨模态冗余与视觉-语言嵌入的几何结构
Dhimoïla, Grégoire, Fel, Thomas, Boutin, Victor, Picard, Agustin
Abstract
Vision-language models (VLMs) align images and text with remarkable success, yet the geometry of their shared embedding space remains poorly understood. To probe this geometry, we begin from the Iso-Energy Assumption, which exploits cross-modal redundancy: a concept that is truly shared should exhibit the same average energy across modalities. We operationalize this assumption with an Aligned Sparse Autoencoder (SAE) that encourages energy consistency during training while preserving reconstruction. We find that this inductive bias changes the SAE solution without harming reconstruction, giving us a representation that serves as a tool for geometric analysis. Sanity checks on controlled data with known ground truth confirm that alignment improves when Iso-Energy holds and remains neutral when it does not. Applied to foundational VLMs, our framework reveals a clear structure with practical consequences: (i) sparse bimodal atoms carry the entire cross-modal alignment signal; (ii) unimodal atoms act as modality-specific biases and fully explain the modality gap; (iii) removing unimodal atoms collapses the gap without harming performance; (iv) restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval. These findings suggest that the right inductive bias can both preserve model fidelity and render the latent geometry interpretable and actionable.
Chinese Translation
视觉-语言模型(VLMs)在对齐图像和文本方面取得了显著成功,但其共享嵌入空间的几何结构仍然不够清晰。为了探究这一几何结构,我们从等能量假设(Iso-Energy Assumption)出发,该假设利用了跨模态冗余:一个真正共享的概念在不同模态间应表现出相同的平均能量。我们通过一个对齐稀疏自编码器(Aligned Sparse Autoencoder, SAE)来实现这一假设,该模型在训练过程中鼓励能量一致性,同时保持重构质量。我们发现,这种归纳偏差改变了SAE的解决方案而不损害重构,提供了一种可用于几何分析的表示。对已知真实值的受控数据进行的合理性检查确认,当等能量假设成立时,对齐效果改善,而当假设不成立时则保持中立。应用于基础VLMs,我们的框架揭示了一个清晰的结构,并具有实际意义:(i)稀疏的双模态原子承载了整个跨模态对齐信号;(ii)单模态原子作为模态特定的偏差,完全解释了模态间的差距;(iii)去除单模态原子在不损害性能的情况下缩小了差距;(iv)将向量运算限制在双模态子空间内可实现分布内编辑和改进检索。这些发现表明,适当的归纳偏差可以同时保持模型的保真度,并使潜在几何结构可解释且可操作。
cs.CV / 14 / 2602.06226

ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos

ForeHOI:从日常手物交互视频中进行前馈式3D物体重建
Chen, Yuantao, Chang, Jiahao, Ye, Chongjie, Zhang, Chaoran, Fang, Zhaojie, Li, Chenghong, Han, Xiaoguang
Abstract
The ubiquity of monocular videos capturing daily hand-object interactions presents a valuable resource for embodied intelligence. While 3D hand reconstruction from in-the-wild videos has seen significant progress, reconstructing the involved objects remains challenging due to severe occlusions and the complex, coupled motion of the camera, hands, and object. In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. Our key insight is that, the joint prediction of 2D mask inpainting and 3D shape completion in a feed-forward framework can effectively address the problem of severe occlusion in monocular hand-held object videos, thereby achieving results that outperform the performance of optimization-based methods. The information exchanges between the 2D and 3D shape completion boosts the overall reconstruction quality, enabling the framework to effectively handle severe hand-object occlusion. Furthermore, to support the training of our model, we contribute the first large-scale, high-fidelity synthetic dataset of hand-object interactions with comprehensive annotations. Extensive experiments demonstrate that ForeHOI achieves state-of-the-art performance in object reconstruction, significantly outperforming previous methods with around a 100x speedup. Code and data are available at: https://github.com/Tao-11-chen/ForeHOI.
Chinese Translation
单目视频捕捉日常手物交互的普遍性为具身智能提供了宝贵的资源。尽管在野外视频中进行3D手部重建已经取得了显著进展,但由于严重的遮挡以及相机、手和物体的复杂耦合运动,重建涉及的物体仍然具有挑战性。在本文中,我们介绍了ForeHOI,一种新颖的前馈模型,它能够在一分钟的推理时间内直接从单目手物交互视频中重建3D物体几何形状,消除了任何预处理步骤的需求。我们的关键见解是,在前馈框架中联合预测2D掩码修复和3D形状补全可以有效解决单目手持物体视频中的严重遮挡问题,从而实现超越基于优化方法的性能。2D与3D形状补全之间的信息交换提升了整体重建质量,使得该框架能够有效处理严重的手物遮挡。此外,为了支持我们模型的训练,我们贡献了第一个大规模、高保真度的手物交互合成数据集,并附有全面的注释。大量实验表明,ForeHOI在物体重建方面达到了最先进的性能,显著超越了之前的方法,速度提升约100倍。代码和数据可在以下网址获取:https://github.com/Tao-11-chen/ForeHOI。
cs.CV / 15 / 2602.06251

ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning

ASMa:用于骨架动作表示学习的非对称时空掩蔽
Anand, Aman, Eskandari, Amir, Rahsno, Elyas, Zulkernine, Farhana
Abstract
Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a more balanced and comprehensive skeleton representation learning. Furthermore, we introduce a learnable feature alignment module to effectively align the representations learned from both masked views. To facilitate deployment in resource-constrained settings and on low-resource devices, we compress the learned and aligned representation into a lightweight model using knowledge distillation. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our approach outperforms existing SSL methods with an average improvement of 2.7-4.4% in fine-tuning and up to 5.9% in transfer learning to noisy datasets and achieves competitive performance compared to fully supervised baselines. Our distilled model achieves 91.4% parameter reduction and 3x faster inference on edge devices while maintaining competitive accuracy, enabling practical deployment in resource-constrained scenarios.
Chinese Translation
自监督学习(SSL)在基于骨架的动作识别中取得了显著成功,通过利用数据增强来学习有意义的表示。然而,现有的SSL方法依赖于主要集中于掩蔽高运动帧和高自由度关节(如自由度为3或4的关节)的数据增强。这导致了偏倚和不完整的特征表示,难以在不同的运动模式中进行泛化。为了解决这个问题,我们提出了非对称时空掩蔽(ASMa)用于骨架动作表示学习,这是一种通过掩蔽学习人类动作固有的全谱时空动态的新方法。ASMa采用两种互补的掩蔽策略:一种选择性地掩蔽高自由度关节和低运动,另一种则掩蔽低自由度关节和高运动帧。这些掩蔽策略确保了更平衡和全面的骨架表示学习。此外,我们引入了一个可学习的特征对齐模块,以有效对齐来自两个掩蔽视图的表示。为了便于在资源受限的环境和低资源设备上部署,我们使用知识蒸馏将学习到的对齐表示压缩为轻量级模型。在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD数据集上的大量实验表明,我们的方法在微调中平均提高了2.7-4.4%,在转移学习到噪声数据集时提高了最多5.9%,并且在与完全监督基线相比时表现出竞争力。我们的蒸馏模型实现了91.4%的参数减少,并在边缘设备上实现了3倍的推理速度,同时保持了竞争力的准确性,使其能够在资源受限的场景中进行实际部署。
cs.CV / 16 / 2602.06282

An Interpretable Vision Transformer as a Fingerprint-Based Diagnostic Aid for Kabuki and Wiedemann-Steiner Syndromes

可解释的视觉变换器作为基于指纹的Kabuki综合征和Wiedemann-Steiner综合征诊断辅助工具
Lionts, Marilyn, Tomasdottir, Arnhildur, Agustsson, Viktor I., Huo, Yuankai, Bjornsson, Hans T., Ellingsen, Lotta M.
Abstract
Kabuki syndrome (KS) and Wiedemann-Steiner syndrome (WSS) are rare but distinct developmental disorders that share overlapping clinical features, including neurodevelopmental delay, growth restriction, and persistent fetal fingertip pads. While genetic testing remains the diagnostic gold standard, many individuals with KS or WSS remain undiagnosed due to barriers in access to both genetic testing and expertise. Dermatoglyphic anomalies, despite being established hallmarks of several genetic syndromes, remain an underutilized diagnostic signal in the era of molecular testing. This study presents a vision transformer-based deep learning model that leverages fingerprint images to distinguish individuals with KS and WSS from unaffected controls and from one another. We evaluate model performance across three binary classification tasks. Across the three classification tasks, the model achieved AUC scores of 0.80 (control vs. KS), 0.73 (control vs. WSS), and 0.85 (KS vs. WSS), with corresponding F1 scores of 0.71, 0.72, and 0.83, respectively. Beyond classification, we apply attention-based visualizations to identify fingerprint regions most salient to model predictions, enhancing interpretability. Together, these findings suggest the presence of syndrome-specific fingerprint features, demonstrating the feasibility of a fingerprint-based artificial intelligence (AI) tool as a noninvasive, interpretable, and accessible future diagnostic aid for the early diagnosis of underdiagnosed genetic syndromes.
Chinese Translation
Kabuki综合征(KS)和Wiedemann-Steiner综合征(WSS)是两种罕见但不同的发展障碍,具有重叠的临床特征,包括神经发育延迟、生长受限和持续的胎儿指尖垫。尽管基因检测仍然是诊断的金标准,但许多KS或WSS患者由于在获取基因检测和专业知识方面的障碍而未能确诊。尽管皮肤纹理异常已被确认为几种遗传综合征的标志,但在分子检测时代仍然是一个未被充分利用的诊断信号。本研究提出了一种基于视觉变换器的深度学习模型,利用指纹图像将KS和WSS患者与未受影响的对照组以及彼此区分开来。我们在三个二分类任务中评估模型性能。在这三个分类任务中,模型分别达到了0.80(对照组与KS)、0.73(对照组与WSS)和0.85(KS与WSS)的AUC分数,F1分数分别为0.71、0.72和0.83。除了分类外,我们还应用基于注意力的可视化技术,以识别对模型预测最重要的指纹区域,从而增强可解释性。这些发现共同表明存在特定于综合征的指纹特征,展示了基于指纹的人工智能(AI)工具作为一种非侵入性、可解释且可及的未来诊断辅助工具在早期诊断未确诊遗传综合征方面的可行性。
cs.CV / 17 / 2602.06285

MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training

MMEarth-Bench:通过多模态测试时训练实现全球模型适应
Gordon, Lucia, Belongie, Serge, Igel, Christian, Lang, Nico
Abstract
Recent research in geospatial machine learning has demonstrated that models pretrained with self-supervised learning on Earth observation data can perform well on downstream tasks with limited training data. However, most of the existing geospatial benchmark datasets have few data modalities and poor global representation, limiting the ability to evaluate multimodal pretrained models at global scales. To fill this gap, we introduce MMEarth-Bench, a collection of five new multimodal environmental tasks with 12 modalities, globally distributed data, and both in- and out-of-distribution test splits. We benchmark a diverse set of pretrained models and find that while (multimodal) pretraining tends to improve model robustness in limited data settings, geographic generalization abilities remain poor. In order to facilitate model adaptation to new downstream tasks and geographic domains, we propose a model-agnostic method for test-time training with multimodal reconstruction (TTT-MMR) that uses all the modalities available at test time as auxiliary tasks, regardless of whether a pretrained model accepts them as input. Our method improves model performance on both the random and geographic test splits, and geographic batching leads to a good trade-off between regularization and specialization during TTT. Our dataset, code, and visualization tool are linked from the project page at lgordon99.github.io/mmearth-bench.
Chinese Translation
近期在地理空间机器学习领域的研究表明,使用自监督学习在地球观测数据上预训练的模型在有限训练数据的下游任务中表现良好。然而,现有的大多数地理空间基准数据集数据模态较少且全球表示能力较差,限制了对多模态预训练模型在全球范围内的评估能力。为填补这一空白,我们引入了MMEarth-Bench,这是一个包含五个新的多模态环境任务的集合,具有12种模态、全球分布的数据,以及内部和外部分布的测试划分。我们对多种预训练模型进行了基准测试,发现虽然(多模态)预训练往往能提高模型在有限数据环境下的鲁棒性,但地理泛化能力仍然较差。为了促进模型对新下游任务和地理领域的适应,我们提出了一种与模型无关的测试时训练方法,称为多模态重构的测试时训练(TTT-MMR),该方法在测试时利用所有可用模态作为辅助任务,无论预训练模型是否接受这些模态作为输入。我们的方法在随机和地理测试划分上均提高了模型性能,而地理批处理在TTT过程中实现了正则化与专业化之间的良好权衡。我们的数据集、代码和可视化工具可从项目页面 lgordon99.github.io/mmearth-bench 获取。
cs.CV / 18 / 2602.06288

Unsupervised MRI-US Multimodal Image Registration with Multilevel Correlation Pyramidal Optimization

基于多级相关金字塔优化的无监督MRI-US多模态图像配准
Wang, Jiazheng, Liu, Zeyu, Liu, Min, Chen, Xiang, Zhang, Hang
Abstract
Surgical navigation based on multimodal image registration has played a significant role in providing intraoperative guidance to surgeons by showing the relative position of the target area to critical anatomical structures during surgery. However, due to the differences between multimodal images and intraoperative image deformation caused by tissue displacement and removal during the surgery, effective registration of preoperative and intraoperative multimodal images faces significant challenges. To address the multimodal image registration challenges in Learn2Reg 2025, an unsupervised multimodal medical image registration method based on multilevel correlation pyramidal optimization (MCPO) is designed to solve these problems. First, the features of each modality are extracted based on the modality independent neighborhood descriptor, and the multimodal images is mapped to the feature space. Second, a multilevel pyramidal fusion optimization mechanism is designed to achieve global optimization and local detail complementation of the displacement field through dense correlation analysis and weight-balanced coupled convex optimization for input features at different scales. Our method focuses on the ReMIND2Reg task in Learn2Reg 2025. Based on the results, our method achieved the first place in the validation phase and test phase of ReMIND2Reg. The MCPO is also validated on the Resect dataset, achieving an average TRE of 1.798 mm. This demonstrates the broad applicability of our method in preoperative-to-intraoperative image registration. The code is avaliable at https://github.com/wjiazheng/MCPO.
Chinese Translation
基于多模态图像配准的手术导航在为外科医生提供术中指导方面发挥了重要作用,通过显示目标区域与关键解剖结构之间的相对位置。然而,由于多模态图像之间的差异以及手术过程中由于组织位移和切除引起的术中图像变形,术前和术中多模态图像的有效配准面临重大挑战。为了解决Learn2Reg 2025中的多模态图像配准问题,设计了一种基于多级相关金字塔优化(MCPO)的无监督多模态医学图像配准方法。首先,基于模态独立邻域描述符提取每种模态的特征,并将多模态图像映射到特征空间。其次,设计了一种多级金字塔融合优化机制,通过对不同尺度输入特征进行密集相关分析和权重平衡的凸优化,实现位移场的全局优化和局部细节补充。我们的方法专注于Learn2Reg 2025中的ReMIND2Reg任务。根据结果,我们的方法在ReMIND2Reg的验证阶段和测试阶段中获得了第一名。MCPO在Resect数据集上的验证结果显示,平均目标重定位误差(TRE)为1.798毫米。这证明了我们的方法在术前到术中图像配准中的广泛适用性。代码可在https://github.com/wjiazheng/MCPO获取。
cs.CV / 19 / 2602.06300

Accelerating Vision Transformers on Brain Processing Unit

在脑处理单元上加速视觉变换器
Tang, Jinchi, Guo, Yan
Abstract
With the advancement of deep learning technologies, specialized neural processing hardware such as Brain Processing Units (BPUs) have emerged as dedicated platforms for CNN acceleration, offering optimized INT8 computation capabilities for convolutional operations. Meanwhile, Vision Transformer (ViT) models, such as the Data-efficient Image Transformer (DeiT), have demonstrated superior performance and play increasingly crucial roles in computer vision tasks. However, due to the architectural mismatch between CNN-optimized hardware and Vision Transformer computation characteristics--namely, that linear layers in Transformers operate on three-dimensional data while BPU acceleration is designed for four-dimensional convolution operations-it is difficult or even impossible to leverage BPU's advantages when deploying Vision Transformers. To address this challenge, we propose a novel approach that restructures the Vision Transformer by replacing linear layers and layer normalization operations with carefully designed convolutional operators. This enables DeiT to fully utilize the acceleration capabilities of BPUs, while allowing the original weight parameters to be inherited by the restructured models without retraining or fine-tuning. To the best of our knowledge, this is the first successful deployment of Vision Transformers that fully leverages BPU classification datasets demonstrate the effectiveness of our approach. Specifically, the quantized DeiT-Base model achieves 80.4% accuracy on ImageNet, compared to the original 81.8%, while obtaining up to a 3.8* inference speedup. Our finetuned DeiT model on the flower classification dataset also achieves excellent performance, with only a 0.5% accuracy drop for the DeiT-Base model, further demonstrating the effectiveness of our method.
Chinese Translation
随着深度学习技术的进步,专用神经处理硬件如脑处理单元(BPU)作为卷积神经网络(CNN)加速的专用平台应运而生,提供了优化的 INT8 计算能力以支持卷积操作。同时,视觉变换器(ViT)模型,如数据高效图像变换器(DeiT),已显示出优越的性能,并在计算机视觉任务中扮演着越来越重要的角色。然而,由于 CNN 优化硬件与视觉变换器计算特性之间的架构不匹配——即变换器中的线性层在三维数据上操作,而 BPU 加速则设计用于四维卷积操作——在部署视觉变换器时,难以甚至不可能利用 BPU 的优势。为了解决这一挑战,我们提出了一种新颖的方法,通过用精心设计的卷积算子替换线性层和层归一化操作来重构视觉变换器。这使得 DeiT 能够充分利用 BPU 的加速能力,同时允许原始权重参数在重构模型中继承,而无需重新训练或微调。根据我们所知,这是首次成功部署充分利用 BPU 的视觉变换器。分类数据集的实验结果证明了我们方法的有效性。具体而言,量化的 DeiT-Base 模型在 ImageNet 上达到了 80.4% 的准确率,相较于原始模型的 81.8%,同时实现了高达 3.8 倍的推理加速。我们在花卉分类数据集上微调的 DeiT 模型也表现出色,DeiT-Base 模型仅下降 0.5% 的准确率,进一步验证了我们方法的有效性。
cs.CV / 20 / 2602.06328

Adaptive and Balanced Re-initialization for Long-timescale Continual Test-time Domain Adaptation

适应性和平衡的长时间尺度持续测试时域适应再初始化
Wang, Yanshuo, Tong, Jinguang, Lan, Jun, Wang, Weiqiang, Zhu, Huijia, Chen, Haoxing, Li, Xuesong, Hong, Jie
Abstract
Continual test-time domain adaptation (CTTA) aims to adjust models so that they can perform well over time across non-stationary environments. While previous methods have made considerable efforts to optimize the adaptation process, a crucial question remains: Can the model adapt to continually changing environments over a long time? In this work, we explore facilitating better CTTA in the long run using a re-initialization (or reset) based method. First, we observe that the long-term performance is associated with the trajectory pattern in label flip. Based on this observed correlation, we propose a simple yet effective policy, Adaptive-and-Balanced Re-initialization (ABR), towards preserving the model's long-term performance. In particular, ABR performs weight re-initialization using adaptive intervals. The adaptive interval is determined based on the change in label flip. The proposed method is validated on extensive CTTA benchmarks, achieving superior performance.
Chinese Translation
持续测试时域适应(CTTA)旨在调整模型,使其能够在非平稳环境中随着时间的推移表现良好。尽管之前的方法在优化适应过程方面做出了相当大的努力,但一个关键问题仍然存在:模型能否在长时间内适应不断变化的环境?在本研究中,我们探讨了利用基于再初始化(或重置)的方法来促进长期更好的CTTA。首先,我们观察到长期性能与标签翻转的轨迹模式相关。基于这一观察到的相关性,我们提出了一种简单而有效的策略,即适应性和平衡再初始化(ABR),旨在保持模型的长期性能。具体而言,ABR使用自适应间隔进行权重再初始化。自适应间隔是根据标签翻转的变化确定的。所提出的方法在广泛的CTTA基准上进行了验证,取得了优越的性能。
cs.CV / 21 / 2602.06330

Halt the Hallucination: Decoupling Signal and Semantic OOD Detection Based on Cascaded Early Rejection

停止幻觉:基于级联早期拒绝的信号与语义的OOD检测解耦
Peng, Ningkang, Cheng, Chuanjie, Mao, Jingyang, Peng, Xiaoqian, Xing, Feng, Zhang, Bo, Tan, Chao, Zheng, Zhichao, Li, Peiheng, Gu, Yanhui
Abstract
Efficient and robust Out-of-Distribution (OOD) detection is paramount for safety-critical applications.However, existing methods still execute full-scale inference on low-level statistical noise. This computational mismatch not only incurs resource waste but also induces semantic hallucination, where deep networks forcefully interpret physical anomalies as high-confidence semantic features.To address this, we propose the Cascaded Early Rejection (CER) framework, which realizes hierarchical filtering for anomaly detection via a coarse-to-fine logic.CER comprises two core modules: 1)Structural Energy Sieve (SES), which establishes a non-parametric barrier at the network entry using the Laplacian operator to efficiently intercept physical signal anomalies; and 2) the Semantically-aware Hyperspherical Energy (SHE) detector, which decouples feature magnitude from direction in intermediate layers to identify fine-grained semantic deviations. Experimental results demonstrate that CER not only reduces computational overhead by 32% but also achieves a significant performance leap on the CIFAR-100 benchmark:the average FPR95 drastically decreases from 33.58% to 22.84%, and AUROC improves to 93.97%. Crucially, in real-world scenarios simulating sensor failures, CER exhibits performance far exceeding state-of-the-art methods. As a universal plugin, CER can be seamlessly integrated into various SOTA models to provide performance gains.
Chinese Translation
高效且稳健的分布外(OOD)检测对于安全关键应用至关重要。然而,现有方法仍在低级统计噪声上执行全规模推理。这种计算不匹配不仅导致资源浪费,还引发语义幻觉,即深度网络强行将物理异常解释为高置信度的语义特征。为了解决这一问题,我们提出了级联早期拒绝(Cascaded Early Rejection, CER)框架,通过粗到细的逻辑实现异常检测的分层过滤。CER由两个核心模块组成:1)结构能量筛(Structural Energy Sieve, SES),该模块利用拉普拉斯算子在网络入口处建立非参数屏障,以高效拦截物理信号异常;2)语义感知超球能量(Semantically-aware Hyperspherical Energy, SHE)检测器,该模块在中间层中将特征的大小与方向解耦,以识别细粒度的语义偏差。实验结果表明,CER不仅将计算开销降低了32%,而且在CIFAR-100基准测试中实现了显著的性能跃升:平均FPR95从33.58%大幅下降至22.84%,AUROC提升至93.97%。关键是,在模拟传感器故障的真实场景中,CER的表现远超最先进的方法。作为一个通用插件,CER可以无缝集成到各种最先进的模型中,以提供性能提升。
cs.CV / 22 / 2602.06333

Taming SAM3 in the Wild: A Concept Bank for Open-Vocabulary Segmentation

在野外驯化SAM3:开放词汇分割的概念库
Pei, Gensheng, Jiang, Xiruo, Yao, Yazhou, Shu, Xiangbo, Shen, Fumin, Jeon, Byeungwoo
Abstract
The recent introduction of \texttt{SAM3} has revolutionized Open-Vocabulary Segmentation (OVS) through \textit{promptable concept segmentation}, which grounds pixel predictions in flexible concept prompts. However, this reliance on pre-defined concepts makes the model vulnerable: when visual distributions shift (\textit{data drift}) or conditional label distributions evolve (\textit{concept drift}) in the target domain, the alignment between visual evidence and prompts breaks down. In this work, we present \textsc{ConceptBank}, a parameter-free calibration framework to restore this alignment on the fly. Instead of adhering to static prompts, we construct a dataset-specific concept bank from the target statistics. Our approach (\textit{i}) anchors target-domain evidence via class-wise visual prototypes, (\textit{ii}) mines representative supports to suppress outliers under data drift, and (\textit{iii}) fuses candidate concepts to rectify concept drift. We demonstrate that \textsc{ConceptBank} effectively adapts \texttt{SAM3} to distribution drifts, including challenging natural-scene and remote-sensing scenarios, establishing a new baseline for robustness and efficiency in OVS. Code and model are available at https://github.com/pgsmall/ConceptBank.
Chinese Translation
最近推出的SAM3通过可提示的概念分割(promptable concept segmentation)彻底改变了开放词汇分割(Open-Vocabulary Segmentation, OVS),该方法将像素预测与灵活的概念提示相结合。然而,这种对预定义概念的依赖使得模型变得脆弱:当视觉分布发生变化(数据漂移,data drift)或条件标签分布演变(概念漂移,concept drift)时,视觉证据与提示之间的对齐关系会被打破。在本研究中,我们提出了ConceptBank,一个无参数的校准框架,以动态恢复这种对齐关系。我们并不依赖静态提示,而是根据目标统计构建一个数据集特定的概念库。我们的方法(i)通过类别视觉原型锚定目标领域证据,(ii)挖掘代表性支持以抑制数据漂移下的异常值,以及(iii)融合候选概念以纠正概念漂移。我们证明了ConceptBank能够有效地使SAM3适应分布漂移,包括具有挑战性的自然场景和遥感场景,为OVS的鲁棒性和效率建立了新的基准。代码和模型可在https://github.com/pgsmall/ConceptBank获取。
cs.CV / 23 / 2602.06335

SPDA-SAM: A Self-prompted Depth-Aware Segment Anything Model for Instance Segmentation

SPDA-SAM:一种自我提示的深度感知任意分割模型用于实例分割
Shang, Yihan, Wang, Wei, Huang, Chao, Dong, Xinghui
Abstract
Recently, Segment Anything Model (SAM) has demonstrated strong generalizability in various instance segmentation tasks. However, its performance is severely dependent on the quality of manual prompts. In addition, the RGB images that instance segmentation methods normally use inherently lack depth information. As a result, the ability of these methods to perceive spatial structures and delineate object boundaries is hindered. To address these challenges, we propose a Self-prompted Depth-Aware SAM (SPDA-SAM) for instance segmentation. Specifically, we design a Semantic-Spatial Self-prompt Module (SSSPM) which extracts the semantic and spatial prompts from the image encoder and the mask decoder of SAM, respectively. Furthermore, we introduce a Coarse-to-Fine RGB-D Fusion Module (C2FFM), in which the features extracted from a monocular RGB image and the depth map estimated from it are fused. In particular, the structural information in the depth map is used to provide coarse-grained guidance to feature fusion, while local variations in depth are encoded in order to fuse fine-grained feature representations. To our knowledge, SAM has not been explored in such self-prompted and depth-aware manners. Experimental results demonstrate that our SPDA-SAM outperforms its state-of-the-art counterparts across twelve different data sets. These promising results should be due to the guidance of the self-prompts and the compensation for the spatial information loss by the coarse-to-fine RGB-D fusion operation.
Chinese Translation
最近,任意分割模型(Segment Anything Model, SAM)在各种实例分割任务中表现出了强大的泛化能力。然而,其性能严重依赖于手动提示的质量。此外,实例分割方法通常使用的RGB图像本质上缺乏深度信息。因此,这些方法感知空间结构和勾勒物体边界的能力受到限制。为了解决这些挑战,我们提出了一种自我提示的深度感知SAM(SPDA-SAM)用于实例分割。具体而言,我们设计了一个语义-空间自我提示模块(Semantic-Spatial Self-prompt Module, SSSPM),该模块分别从SAM的图像编码器和掩码解码器中提取语义和空间提示。此外,我们引入了一个粗到细的RGB-D融合模块(Coarse-to-Fine RGB-D Fusion Module, C2FFM),在该模块中,从单目RGB图像及其估计的深度图中提取的特征被融合。特别地,深度图中的结构信息用于为特征融合提供粗粒度指导,而深度的局部变化则被编码以融合细粒度特征表示。我们所知,SAM尚未以这种自我提示和深度感知的方式进行探索。实验结果表明,我们的SPDA-SAM在十二个不同的数据集上超越了其最先进的对手。这些令人鼓舞的结果应归因于自我提示的指导以及粗到细RGB-D融合操作对空间信息损失的补偿。
cs.CV / 24 / 2602.06343

Uncertainty-Aware 4D Gaussian Splatting for Monocular Occluded Human Rendering

基于不确定性感知的4D高斯点云技术用于单目遮挡人类渲染
Wang, Weiquan, Shao, Feifei, Li, Lin, Wang, Zhen, Xiao, Jun, Chen, Long
Abstract
High-fidelity rendering of dynamic humans from monocular videos typically degrades catastrophically under occlusions. Existing solutions incorporate external priors-either hallucinating missing content via generative models, which induces severe temporal flickering, or imposing rigid geometric heuristics that fail to capture diverse appearances. To this end, we reformulate the task as a Maximum A Posteriori estimation problem under heteroscedastic observation noise. In this paper, we propose U-4DGS, a framework integrating a Probabilistic Deformation Network and a Double Rasterization pipeline. This architecture renders pixel-aligned uncertainty maps that act as an adaptive gradient modulator, automatically attenuating artifacts from unreliable observations. Furthermore, to prevent geometric drift in regions lacking reliable visual cues, we enforce Confidence-Aware Regularizations, which leverage the learned uncertainty to selectively propagate spatial-temporal validity. Extensive experiments on ZJU-MoCap and OcMotion demonstrate that U-4DGS achieves SOTA rendering fidelity and robustness.
Chinese Translation
从单目视频中高保真渲染动态人类在遮挡情况下通常会严重降级。现有解决方案采用外部先验——要么通过生成模型幻觉缺失内容,这会导致严重的时间闪烁,要么施加刚性几何启发式,无法捕捉多样的外观。为此,我们将任务重新表述为在异方差观测噪声下的最大后验估计问题。本文提出了U-4DGS,一个集成了概率形变网络和双重光栅化管道的框架。该架构渲染像素对齐的不确定性图,作为自适应梯度调制器,自动减弱来自不可靠观测的伪影。此外,为了防止在缺乏可靠视觉线索的区域中出现几何漂移,我们强制实施了基于置信度的正则化,利用学习到的不确定性选择性地传播时空有效性。在ZJU-MoCap和OcMotion上的大量实验表明,U-4DGS实现了最先进的渲染保真度和鲁棒性。
cs.CV / 25 / 2602.06346

FlowConsist: Make Your Flow Consistent with Real Trajectory

FlowConsist:使您的流与真实轨迹保持一致
Zhang, Tianyi, Liu, Chengcheng, Chen, Jinwei, Guo, Chun-Le, Li, Chongyi, Cheng, Ming-Ming, Li, Bo, Jiang, Peng-Tao
Abstract
Fast flow models accelerate the iterative sampling process by learning to directly predict ODE path integrals, enabling one-step or few-step generation. However, we argue that current fast-flow training paradigms suffer from two fundamental issues. First, conditional velocities constructed from randomly paired noise-data samples introduce systematic trajectory drift, preventing models from following a consistent ODE path. Second, the model's approximation errors accumulate over time steps, leading to severe deviations across long time intervals. To address these issues, we propose FlowConsist, a training framework designed to enforce trajectory consistency in fast flows. We propose a principled alternative that replaces conditional velocities with the marginal velocities predicted by the model itself, aligning optimization with the true trajectory. To further address error accumulation over time steps, we introduce a trajectory rectification strategy that aligns the marginal distributions of generated and real samples at every time step along the trajectory. Our method establishes a new state-of-the-art on ImageNet 256$\times$256, achieving an FID of 1.52 with only 1 sampling step.
Chinese Translation
快速流模型通过学习直接预测常微分方程(ODE)路径积分,加速了迭代采样过程,实现了一步或少步生成。然而,我们认为当前的快速流训练范式存在两个根本性问题。首先,从随机配对的噪声-数据样本构建的条件速度引入了系统性的轨迹漂移,阻碍了模型沿着一致的ODE路径进行跟踪。其次,模型的近似误差在时间步长上累积,导致在长时间间隔内出现严重偏差。为了解决这些问题,我们提出了FlowConsist,一个旨在强制快速流中的轨迹一致性的训练框架。我们提出了一种原则性的替代方案,用模型自身预测的边际速度替代条件速度,从而使优化与真实轨迹对齐。为了进一步解决时间步长上的误差累积问题,我们引入了一种轨迹修正策略,在轨迹的每个时间步长上对生成样本和真实样本的边际分布进行对齐。我们的方法在ImageNet 256×256上建立了新的最先进水平,仅用1次采样步骤实现了1.52的FID。
cs.CV / 26 / 2602.06355

Di3PO -- Diptych Diffusion DPO for Targeted Improvements in Image

Di3PO -- 双联体扩散 DPO 在图像目标改进中的应用
Reddy, Sanjana, Malhi, Ishaan, Ma, Sally, Dutta, Praneet
Abstract
Existing methods for preference tuning of text-to-image (T2I) diffusion models often rely on computationally expensive generation steps to create positive and negative pairs of images. These approaches frequently yield training pairs that either lack meaningful differences, are expensive to sample and filter, or exhibit significant variance in irrelevant pixel regions, thereby degrading training efficiency. To address these limitations, we introduce "Di3PO", a novel method for constructing positive and negative pairs that isolates specific regions targeted for improvement during preference tuning, while keeping the surrounding context in the image stable. We demonstrate the efficacy of our approach by applying it to the challenging task of text rendering in diffusion models, showcasing improvements over baseline methods of SFT and DPO.
Chinese Translation
现有的文本到图像(T2I)扩散模型的偏好调优方法通常依赖于计算成本高昂的生成步骤来创建正负图像对。这些方法常常产生缺乏有意义差异的训练对,或者在采样和过滤上成本高昂,亦或在无关像素区域表现出显著的方差,从而降低训练效率。为了解决这些局限性,我们提出了“Di3PO”,一种新颖的方法,用于构建正负图像对,专注于在偏好调优过程中针对特定区域的改进,同时保持图像周围上下文的稳定性。我们通过将其应用于扩散模型中具有挑战性的文本渲染任务,展示了我们方法的有效性,并在 SFT 和 DPO 的基线方法上取得了改进。
cs.CV / 27 / 2602.06363

Robust Pedestrian Detection with Uncertain Modality

具有不确定模态的鲁棒行人检测
Bie, Qian, Wang, Xiao, Yang, Bin, Yu, Zhixi, Chen, Jun, Xu, Xin
Abstract
Existing cross-modal pedestrian detection (CMPD) employs complementary information from RGB and thermal-infrared (TIR) modalities to detect pedestrians in 24h-surveillance systems.RGB captures rich pedestrian details under daylight, while TIR excels at night. However, TIR focuses primarily on the person's silhouette, neglecting critical texture details essential for detection. While the near-infrared (NIR) captures texture under low-light conditions, which effectively alleviates performance issues of RGB and detail loss in TIR, thereby reducing missed detections. To this end, we construct a new Triplet RGB-NIR-TIR (TRNT) dataset, comprising 8,281 pixel-aligned image triplets, establishing a comprehensive foundation for algorithmic research. However, due to the variable nature of real-world scenarios, imaging devices may not always capture all three modalities simultaneously. This results in input data with unpredictable combinations of modal types, which challenge existing CMPD methods that fail to extract robust pedestrian information under arbitrary input combinations, leading to significant performance degradation. To address these challenges, we propose the Adaptive Uncertainty-aware Network (AUNet) for accurately discriminating modal availability and fully utilizing the available information under uncertain inputs. Specifically, we introduce Unified Modality Validation Refinement (UMVR), which includes an uncertainty-aware router to validate modal availability and a semantic refinement to ensure the reliability of information within the modality. Furthermore, we design a Modality-Aware Interaction (MAI) module to adaptively activate or deactivate its internal interaction mechanisms per UMVR output, enabling effective complementary information fusion from available modalities.
Chinese Translation
现有的跨模态行人检测(CMPD)利用RGB和热红外(TIR)模态的互补信息,在24小时监控系统中检测行人。RGB在白天捕捉丰富的行人细节,而TIR在夜间表现优越。然而,TIR主要关注人的轮廓,忽略了检测所需的关键纹理细节。近红外(NIR)在低光照条件下捕捉纹理,有效缓解了RGB的性能问题和TIR中的细节损失,从而减少漏检。为此,我们构建了一个新的三元组RGB-NIR-TIR(TRNT)数据集,包含8,281个像素对齐的图像三元组,为算法研究奠定了全面的基础。然而,由于现实场景的可变性,成像设备可能并不总是同时捕捉到所有三种模态。这导致输入数据具有不可预测的模态类型组合,挑战了现有的CMPD方法,这些方法未能在任意输入组合下提取鲁棒的行人信息,导致显著的性能下降。为了解决这些挑战,我们提出了自适应不确定性感知网络(AUNet),以准确区分模态可用性,并充分利用不确定输入下的可用信息。具体而言,我们引入了统一模态验证细化(UMVR),其中包括一个不确定性感知路由器来验证模态可用性,以及一个语义细化模块,以确保模态内信息的可靠性。此外,我们设计了一个模态感知交互(MAI)模块,根据UMVR输出自适应地激活或停用其内部交互机制,从而有效融合可用模态的互补信息。
cs.CV / 28 / 2602.06369

Revisiting Salient Object Detection from an Observer-Centric Perspective

从观察者中心的视角重新审视显著性物体检测
Zhang, Fuxi, Wang, Yifan, Zhao, Hengrun, Sun, Zhuohan, Xia, Changxing, Wang, Lijun, Lu, Huchuan, Shao, Yangrui, Yang, Chen, Teng, Long
Abstract
Salient object detection is inherently a subjective problem, as observers with different priors may perceive different objects as salient. However, existing methods predominantly formulate it as an objective prediction task with a single groundtruth segmentation map for each image, which renders the problem under-determined and fundamentally ill-posed. To address this issue, we propose Observer-Centric Salient Object Detection (OC-SOD), where salient regions are predicted by considering not only the visual cues but also the observer-specific factors such as their preferences or intents. As a result, this formulation captures the intrinsic ambiguity and diversity of human perception, enabling personalized and context-aware saliency prediction. By leveraging multi-modal large language models, we develop an efficient data annotation pipeline and construct the first OC-SOD dataset named OC-SODBench, comprising 33k training, validation and test images with 152k textual prompts and object pairs. Built upon this new dataset, we further design OC-SODAgent, an agentic baseline which performs OC-SOD via a human-like "Perceive-Reflect-Adjust" process. Extensive experiments on our proposed OC-SODBench have justified the effectiveness of our contribution. Through this observer-centric perspective, we aim to bridge the gap between human perception and computational modeling, offering a more realistic and flexible understanding of what makes an object truly "salient." Code and dataset are publicly available at: https://github.com/Dustzx/OC_SOD
Chinese Translation
显著性物体检测本质上是一个主观问题,因为具有不同先验知识的观察者可能会将不同的物体视为显著。然而,现有的方法主要将其表述为一个客观预测任务,为每幅图像提供单一的真实分割图,这使得该问题变得欠定且根本上不适定。为了解决这个问题,我们提出了观察者中心显著性物体检测(Observer-Centric Salient Object Detection, OC-SOD),在该方法中,显著区域的预测不仅考虑视觉线索,还考虑观察者特定的因素,如他们的偏好或意图。因此,这种表述捕捉了人类感知的内在模糊性和多样性,使得显著性预测能够个性化并具有上下文意识。通过利用多模态大型语言模型,我们开发了一个高效的数据标注流程,并构建了第一个OC-SOD数据集OC-SODBench,该数据集包含33,000幅训练、验证和测试图像,以及152,000个文本提示和物体对。基于这个新数据集,我们进一步设计了OC-SODAgent,一个通过类人“感知-反思-调整”过程执行OC-SOD的基线代理。在我们提出的OC-SODBench上进行的广泛实验验证了我们贡献的有效性。通过这种观察者中心的视角,我们旨在弥合人类感知与计算建模之间的差距,提供对什么使物体真正“显著”的更现实和灵活的理解。代码和数据集可在以下网址公开获取:https://github.com/Dustzx/OC_SOD
cs.CV / 29 / 2602.06391

POINTS-GUI-G: GUI-Grounding Journey

POINTS-GUI-G: GUI-基础旅程
Zhao, Zhongyin, Liu, Yuan, Liu, Yikun, Wang, Haicheng, Tian, Le, Zhou, Xiao, You, Yangxiu, Yu, Zilin, Yu, Yang, Zhou, Jie
Abstract
The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.
Chinese Translation
视觉-语言模型的快速发展催生了图形用户界面(GUI)代理的出现,这些代理在自动化复杂任务方面具有巨大潜力,从在线购物到航班预订,从而减轻了重复数字工作流程的负担。作为一种基础能力,GUI 基础通常被视为端到端任务执行的前提条件。它使模型能够精确定位界面元素,如文本和图标,以执行点击和输入等准确操作。与之前对已经具备强大空间意识的模型(例如 Qwen3-VL)进行微调的工作不同,我们旨在通过从具有最小基础能力的基础模型(如 POINTS-1.5)入手,掌握完整的技术流程。我们引入了 POINTS-GUI-G-8B,该模型在 ScreenSpot-Pro 上得分 59.9,在 OSWorld-G 上得分 66.0,在 ScreenSpot-v2 上得分 95.7,在 UI-Vision 上得分 49.9,达到了最先进的性能。我们模型的成功得益于三个关键因素:(1)精细化数据工程,涉及多样化开源数据集格式的统一,以及增强、过滤和难度分级的复杂策略;(2)改进的训练策略,包括对视觉编码器的持续微调,以提高感知准确性,并保持训练与推理之间的分辨率一致性;(3)具有可验证奖励的强化学习(RL)。虽然 RL 传统上用于增强推理,但我们证明它显著提高了在感知密集型 GUI 基础任务中的精度。此外,GUI 基础为 RL 提供了天然优势,因为奖励易于验证且高度准确。
cs.CV / 30 / 2602.06400

TFusionOcc: Student's t-Distribution Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction

TFusionOcc:基于学生t分布的以对象为中心的多传感器融合框架用于3D占用预测
Ming, Zhenxing, Berrio, Julie Stephany, Shan, Mao, Worrall, Stewart
Abstract
3D semantic occupancy prediction enables autonomous vehicles (AVs) to perceive fine-grained geometric and semantic structure of their surroundings from onboard sensors, which is essential for safe decision-making and navigation. Recent models for 3D semantic occupancy prediction have successfully addressed the challenge of describing real-world objects with varied shapes and classes. However, the intermediate representations used by existing methods for 3D semantic occupancy prediction rely heavily on 3D voxel volumes or a set of 3D Gaussians, hindering the model's ability to efficiently and effectively capture fine-grained geometric details in the 3D driving environment. This paper introduces TFusionOcc, a novel object-centric multi-sensor fusion framework for predicting 3D semantic occupancy. By leveraging multi-stage multi-sensor fusion, Student's t-distribution, and the T-Mixture model (TMM), together with more geometrically flexible primitives, such as the deformable superquadric (superquadric with inverse warp), the proposed method achieved state-of-the-art (SOTA) performance on the nuScenes benchmark. In addition, extensive experiments were conducted on the nuScenes-C dataset to demonstrate the robustness of the proposed method in different camera and lidar corruption scenarios. The code will be available at: https://github.com/DanielMing123/TFusionOcc
Chinese Translation
3D语义占用预测使自主车辆(AV)能够通过车载传感器感知其周围环境的细粒度几何和语义结构,这对于安全决策和导航至关重要。近期的3D语义占用预测模型成功解决了描述具有多样形状和类别的真实世界物体的挑战。然而,现有方法在3D语义占用预测中使用的中间表示过于依赖3D体素体积或一组3D高斯分布,限制了模型在3D驾驶环境中高效、有效捕捉细粒度几何细节的能力。本文提出了TFusionOcc,一种新颖的以对象为中心的多传感器融合框架,用于预测3D语义占用。通过利用多阶段多传感器融合、学生t分布和T-混合模型(T-Mixture model, TMM),以及更具几何灵活性的原语,如可变形超四面体(具有逆变形的超四面体),所提出的方法在nuScenes基准测试中达到了最先进的(SOTA)性能。此外,还在nuScenes-C数据集上进行了广泛实验,以证明所提方法在不同相机和激光雷达干扰场景下的鲁棒性。代码将发布于:https://github.com/DanielMing123/TFusionOcc
cs.CV / 31 / 2602.06402

MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

MeDocVL:用于医学文档理解和解析的视觉语言模型
Wang, Wenjie, Wu, Wei, Liu, Ying, Zhao, Yuan, Lv, Xiaole, Diao, Liang, Fan, Zengjian, Xie, Wenfeng, Lin, Ziling, Shi, De, Huang, Lin, Xu, Kaihe, Li, Hong
Abstract
Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.
Chinese Translation
医学文档的光学字符识别(OCR)因复杂的布局、特定领域的术语和噪声注释而具有挑战性,同时要求严格的字段级精确匹配。现有的OCR系统和通用视觉语言模型往往无法可靠地解析此类文档。我们提出了MeDocVL,这是一种用于查询驱动的医学文档解析的后训练视觉语言模型。我们的框架结合了基于训练的标签细化,以从噪声注释中构建高质量的监督,以及一种噪声感知的混合后训练策略,该策略整合了强化学习和监督微调,以实现稳健和精确的提取。在医学发票基准测试中的实验表明,MeDocVL在噪声监督下始终优于传统的OCR系统和强大的视觉语言模型基线,达到了最先进的性能。
cs.CV / 32 / 2602.06405

A neuromorphic model of the insect visual system for natural image processing

一种用于自然图像处理的昆虫视觉系统神经形态模型
Hines, Adam D., Nordström, Karin, Barron, Andrew B.
Abstract
Insect vision supports complex behaviors including associative learning, navigation, and object detection, and has long motivated computational models for understanding biological visual processing. However, many contemporary models prioritize task performance while neglecting biologically grounded processing pathways. Here, we introduce a bio-inspired vision model that captures principles of the insect visual system to transform dense visual input into sparse, discriminative codes. The model is trained using a fully self-supervised contrastive objective, enabling representation learning without labeled data and supporting reuse across tasks without reliance on domain-specific classifiers. We evaluated the resulting representations on flower recognition tasks and natural image benchmarks. The model consistently produced reliable sparse codes that distinguish visually similar inputs. To support different modelling and deployment uses, we have implemented the model as both an artificial neural network and a spiking neural network. In a simulated localization setting, our approach outperformed a simple image downsampling comparison baseline, highlighting the functional benefit of incorporating neuromorphic visual processing pathways. Collectively, these results advance insect computational modelling by providing a generalizable bio-inspired vision model capable of sparse computation across diverse tasks.
Chinese Translation
昆虫视觉支持复杂行为,包括联想学习、导航和物体检测,并长期以来激励着计算模型以理解生物视觉处理。然而,许多当代模型优先考虑任务性能,而忽视了生物基础的处理路径。在此,我们介绍了一种生物启发的视觉模型,该模型捕捉了昆虫视觉系统的原则,将密集的视觉输入转化为稀疏的、具有区分性的编码。该模型采用完全自监督的对比目标进行训练,使得在没有标记数据的情况下进行表示学习,并支持在不同任务间的重用,而无需依赖于特定领域的分类器。我们在花卉识别任务和自然图像基准上评估了生成的表示。该模型始终产生可靠的稀疏编码,以区分视觉上相似的输入。为了支持不同的建模和部署用途,我们将该模型实现为人工神经网络和脉冲神经网络。在模拟定位设置中,我们的方法优于简单的图像下采样比较基线,突显了整合神经形态视觉处理路径的功能优势。总体而言,这些结果通过提供一种可在多种任务中进行稀疏计算的通用生物启发视觉模型,推动了昆虫计算建模的发展。
cs.CV / 33 / 2602.06406

Point Virtual Transformer

点虚拟变换器
Sood, Veerain, Bnalin, Pandey, Gaurav
Abstract
LiDAR-based 3D object detectors often struggle to detect far-field objects due to the sparsity of point clouds at long ranges, which limits the availability of reliable geometric cues. To address this, prior approaches augment LiDAR data with depth-completed virtual points derived from RGB images; however, directly incorporating all virtual points leads to increased computational cost and introduces challenges in effectively fusing real and virtual information. We present Point Virtual Transformer (PointViT), a transformer-based 3D object detection framework that jointly reasons over raw LiDAR points and selectively sampled virtual points. The framework examines multiple fusion strategies, ranging from early point-level fusion to BEV-based gated fusion, and analyses their trade-offs in terms of accuracy and efficiency. The fused point cloud is voxelized and encoded using sparse convolutions to form a BEV representation, from which a compact set of high-confidence object queries is initialised and refined through a transformer-based context aggregation module. Experiments on the KITTI benchmark report 91.16% 3D AP, 95.94% BEV AP, and 99.36% AP on the KITTI 2D detection benchmark for the Car class.
Chinese Translation
基于LiDAR的3D物体检测器在检测远处物体时常常面临困难,因为在长距离下点云的稀疏性限制了可靠几何线索的可用性。为了解决这个问题,之前的方法通过从RGB图像中派生的深度补全虚拟点来增强LiDAR数据;然而,直接将所有虚拟点纳入会导致计算成本增加,并在有效融合真实与虚拟信息方面引入挑战。我们提出了点虚拟变换器(Point Virtual Transformer,PointViT),这是一种基于变换器的3D物体检测框架,能够共同推理原始LiDAR点和选择性采样的虚拟点。该框架考察了多种融合策略,从早期的点级融合到基于BEV的门控融合,并分析了它们在准确性和效率方面的权衡。融合后的点云经过体素化处理,并使用稀疏卷积编码形成BEV表示,从中初始化并通过基于变换器的上下文聚合模块细化一组高置信度的物体查询。在KITTI基准测试上的实验报告显示,Car类别的3D AP为91.16%,BEV AP为95.94%,在KITTI 2D检测基准上的AP为99.36%。
cs.CV / 34 / 2602.06419

Learning Human Visual Attention on 3D Surfaces through Geometry-Queried Semantic Priors

通过几何查询语义先验学习三维表面上的人类视觉注意力
Pahari, Soham, Kumain, Sandeep C.
Abstract
Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning, introducing the first formulation respecting 3D mesh topology with inhibition-of-return dynamics. Evaluation on SAL3D, NUS3D and 3DVA datasets demonstrates substantial improvements, validating how cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.
Chinese Translation
人类对三维物体的视觉注意力源于自下而上的几何处理与自上而下的语义识别之间的相互作用。现有的三维显著性方法依赖于手工设计的几何特征或缺乏语义意识的学习方法,无法解释为什么人类会关注在语义上有意义但在几何上并不显著的区域。我们提出了SemGeo-AttentionNet,这是一种双流架构,通过不对称的跨模态融合明确地形式化了这种二分法,利用基于扩散的几何条件多视图渲染和点云变换器的语义先验进行几何处理。跨注意力机制确保几何特征查询语义内容,使自下而上的显著性引导自上而下的检索。我们通过强化学习将框架扩展到时间扫描路径生成,首次提出尊重三维网格拓扑的抑制返回动态的公式。在SAL3D、NUS3D和3DVA数据集上的评估显示出显著的改进,验证了以认知为动机的架构如何有效地建模人类在三维表面上的视觉注意力。
cs.CV / 35 / 2602.06422

Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

通过建模逐步和长期采样效应缓解稀疏奖励在基于流的GRPO中的应用
Tong, Yunze, Liu, Mushui, Zhao, Canyu, He, Wanggui, Zhang, Shiyi, Zhang, Hongwei, Zhang, Peng, Liu, Jinlong, Huang, Ju, Wang, Jiamang, Jiang, Hao, Huang, Pipei
Abstract
Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action's "pure" effect, and (ii) it identifies turning points-steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend-and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation. Demo code is available at https://github.com/YunzeTong/TurningPoint-GRPO.
Chinese Translation
在流匹配模型上部署GRPO已被证明对文本到图像生成有效。然而,现有范式通常将基于结果的奖励传播到所有先前的去噪步骤,而没有区分每个步骤的局部效应。此外,当前的组内排名主要比较匹配时间步的轨迹,忽略了轨迹内的依赖关系,其中某些早期的去噪操作可以通过延迟的隐式交互影响后续状态。我们提出了TurningPoint-GRPO(TP-GRPO),这是一个GRPO框架,旨在缓解逐步奖励稀疏性,并明确建模去噪轨迹中的长期效应。TP-GRPO有两个关键创新:(i)它用基于步骤的增量奖励替代基于结果的奖励,提供一种密集的、关注步骤的学习信号,更好地隔离每个去噪操作的“纯”效应;(ii)它识别转折点——翻转局部奖励趋势的步骤,并使后续奖励演变与整体轨迹趋势一致——并为这些操作分配聚合的长期奖励,以捕捉其延迟影响。转折点仅通过增量奖励的符号变化进行检测,使TP-GRPO高效且无需超参数。大量实验还表明,TP-GRPO更有效地利用奖励信号,并持续改善生成效果。演示代码可在 https://github.com/YunzeTong/TurningPoint-GRPO 获取。
cs.CV / 36 / 2602.06425

POPL-KF: A Pose-Only Geometric Representation-Based Kalman Filter for Point-Line-Based Visual-Inertial Odometry

POPL-KF:一种基于仅姿态几何表示的点线视觉惯性里程计的卡尔曼滤波器
Wang, Aiping, Yang, Zhaolong, Chen, Shuwen, Zhang, Hai
Abstract
Mainstream Visual-inertial odometry (VIO) systems rely on point features for motion estimation and localization. However, their performance degrades in challenging scenarios. Moreover, the localization accuracy of multi-state constraint Kalman filter (MSCKF)-based VIO systems suffers from linearization errors associated with feature 3D coordinates and delayed measurement updates. To improve the performance of VIO in challenging scenes, we first propose a pose-only geometric representation for line features. Building on this, we develop POPL-KF, a Kalman filter-based VIO system that employs a pose-only geometric representation for both point and line features. POPL-KF mitigates linearization errors by explicitly eliminating both point and line feature coordinates from the measurement equations, while enabling immediate update of visual measurements. We also design a unified base-frames selection algorithm for both point and line features to ensure optimal constraints on camera poses within the pose-only measurement model. To further improve line feature quality, a line feature filter based on image grid segmentation and bidirectional optical flow consistency is proposed. Our system is evaluated on public datasets and real-world experiments, demonstrating that POPL-KF outperforms the state-of-the-art (SOTA) filter-based methods (OpenVINS, PO-KF) and optimization-based methods (PL-VINS, EPLF-VINS), while maintaining real-time performance.
Chinese Translation
主流的视觉惯性里程计(VIO)系统依赖于点特征进行运动估计和定位。然而,在复杂场景中,它们的性能会下降。此外,基于多状态约束卡尔曼滤波器(MSCKF)的VIO系统的定位精度受到与特征三维坐标相关的线性化误差和测量更新延迟的影响。为了提高VIO在复杂场景中的性能,我们首先提出了一种仅基于姿态的线特征几何表示。在此基础上,我们开发了POPL-KF,这是一种基于卡尔曼滤波器的VIO系统,采用仅基于姿态的几何表示来处理点和线特征。POPL-KF通过明确消除测量方程中的点和线特征坐标,减轻了线性化误差,同时实现了视觉测量的即时更新。我们还设计了一种统一的基准帧选择算法,适用于点和线特征,以确保在仅基于姿态的测量模型中对相机姿态施加最佳约束。为了进一步提高线特征的质量,提出了一种基于图像网格分割和双向光流一致性的线特征滤波器。我们的系统在公共数据集和真实世界实验中进行了评估,结果表明,POPL-KF在保持实时性能的同时,优于最先进的基于滤波器的方法(OpenVINS,PO-KF)和基于优化的方法(PL-VINS,EPLF-VINS)。
cs.CV / 37 / 2602.06427

Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters

弥合室内外差距:基于视觉的指令引导的具身导航用于最后一米
Zhao, Yuxiang, Yang, Yirong, Zhu, Yanqing, Shen, Yanfen, Wang, Chiyu, Gu, Zhining, Shi, Pei, Guo, Wei, Xu, Mu
Abstract
Embodied navigation holds significant promise for real-world applications such as last-mile delivery. However, most existing approaches are confined to either indoor or outdoor environments and rely heavily on strong assumptions, such as access to precise coordinate systems. While current outdoor methods can guide agents to the vicinity of a target using coarse-grained localization, they fail to enable fine-grained entry through specific building entrances, critically limiting their utility in practical deployment scenarios that require seamless outdoor-to-indoor transitions. To bridge this gap, we introduce a novel task: out-to-in prior-free instruction-driven embodied navigation. This formulation explicitly eliminates reliance on accurate external priors, requiring agents to navigate solely based on egocentric visual observations guided by instructions. To tackle this task, we propose a vision-centric embodied navigation framework that leverages image-based prompts to drive decision-making. Additionally, we present the first open-source dataset for this task, featuring a pipeline that integrates trajectory-conditioned video synthesis into the data generation process. Through extensive experiments, we demonstrate that our proposed method consistently outperforms state-of-the-art baselines across key metrics including success rate and path efficiency.
Chinese Translation
具身导航在现实世界应用中具有重要潜力,例如最后一公里的配送。然而,现有的大多数方法仅限于室内或室外环境,并且高度依赖于强假设,例如对精确坐标系统的访问。虽然当前的室外方法可以利用粗略定位引导代理到达目标附近,但它们未能实现通过特定建筑入口的精细进入,这在需要无缝室外到室内过渡的实际部署场景中严重限制了其效用。为了解决这一问题,我们提出了一项新任务:无外部先验的指令驱动的具身导航。这一表述明确消除了对准确外部先验的依赖,要求代理仅根据指令引导的自我中心视觉观察进行导航。为了解决这一任务,我们提出了一种以视觉为中心的具身导航框架,利用基于图像的提示来驱动决策。此外,我们还推出了该任务的第一个开源数据集,包含一个将轨迹条件视频合成集成到数据生成过程中的管道。通过大量实验,我们证明了所提方法在成功率和路径效率等关键指标上始终优于最先进的基线。
cs.CV / 38 / 2602.06442

ChatUMM: Robust Context Tracking for Conversational Interleaved Generation

ChatUMM:用于对话交错生成的鲁棒上下文跟踪
Dai, Wenxun, Zhao, Zhiyuan, Zhong, Yule, Cheng, Yiji, Zhang, Jianwei, Wang, Linqing, Zhang, Shiyi, Lin, Yunlong, He, Runze, Song, Fellix, Zhuang, Wayne, Liu, Yong, Zhang, Haoji, Tang, Yansong, Lu, Qinglin, Wang, Chunyu
Abstract
Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm, effectively functioning as solvers for independent requests rather than assistants in continuous dialogue. To bridge this gap, we present ChatUMM. As a conversational unified model, it excels at robust context tracking to sustain interleaved multimodal generation. ChatUMM derives its capabilities from two key innovations: an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow, and a systematic conversational data synthesis pipeline. This pipeline transforms a diverse set of standard single-turn datasets into fluid dialogues through three progressive stages: constructing basic stateful dialogues, enforcing long-range dependency resolution via ``distractor'' turns with history-dependent query rewriting, and synthesizing naturally interleaved multimodal responses. Extensive evaluations demonstrate that ChatUMM achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-guided editing benchmarks, while maintaining competitive fidelity in text-to-image generation. Notably, ChatUMM exhibits superior robustness in complex multi-turn scenarios, ensuring fluid, context-aware dialogues.
Chinese Translation
统一多模态模型(UMMs)取得了显著进展,但仍然受到单轮交互范式的限制,实际上更像是独立请求的求解器,而非持续对话中的助手。为了解决这一问题,我们提出了ChatUMM。作为一种对话统一模型,它在鲁棒上下文跟踪方面表现出色,以支持交错的多模态生成。ChatUMM的能力源于两个关键创新:一种交错多轮训练策略,将序列化的文本-图像流建模为连续的对话流,以及一个系统化的对话数据合成管道。该管道通过三个渐进阶段将多样化的标准单轮数据集转化为流畅的对话:构建基本的状态对话,通过带有历史依赖查询重写的“干扰”轮次强制解决长程依赖,以及合成自然交错的多模态响应。广泛的评估表明,ChatUMM在视觉理解和指令引导编辑基准测试中,在开源统一模型中实现了最先进的性能,同时在文本到图像生成中保持了竞争性的保真度。值得注意的是,ChatUMM在复杂的多轮场景中表现出更强的鲁棒性,确保了流畅且具有上下文意识的对话。
cs.CV / 39 / 2602.06450

What Is Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution

合成数据在场景文本识别中的问题:一个具有多样化模拟和自我进化的强大合成引擎
Ye, Xingsong, Du, Yongkun, Zhang, JiaXin, Li, Chen, LYU, Jing, Chen, Zhineng
Abstract
Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels.
Chinese Translation
大规模且类别平衡的文本数据对于训练有效的场景文本识别(STR)模型至关重要,但在收集真实数据时难以实现。合成数据提供了一种具有成本效益且标注完美的替代方案。然而,其性能往往滞后,揭示了真实数据与当前合成数据之间显著的领域差距。在本研究中,我们系统地分析了主流基于渲染的合成数据集,并识别出其关键局限性:语料、字体和布局的多样性不足,这限制了其在复杂场景中的真实感。为了解决这些问题,我们引入了UnionST,一个强大的数据引擎,合成覆盖具有挑战性的样本的文本,并更好地与实际环境中的复杂性对齐。随后,我们构建了UnionST-S,一个在挑战性场景中具有改进模拟的大规模合成数据集。此外,我们开发了一个自我进化学习(SEL)框架,以有效地进行真实数据标注。实验表明,基于UnionST-S训练的模型在现有合成数据集上取得了显著的改进。在某些场景中,它们甚至超过了真实数据的性能。此外,使用SEL时,训练的模型在仅看到9%的真实数据标签的情况下也能达到竞争性能。
cs.CV / 40 / 2602.06452

Exploring Specular Reflection Inconsistency for Generalizable Face Forgery Detection

探索可推广的人脸伪造检测中的镜面反射不一致性
Fei, Hongyan, Jia, Zexi, Huang, Chuanwei, Zhang, Jinchao, Zhou, Jie
Abstract
Detecting deepfakes has become increasingly challenging as forgery faces synthesized by AI-generated methods, particularly diffusion models, achieve unprecedented quality and resolution. Existing forgery detection approaches relying on spatial and frequency features demonstrate limited efficacy against high-quality, entirely synthesized forgeries. In this paper, we propose a novel detection method grounded in the observation that facial attributes governed by complex physical laws and multiple parameters are inherently difficult to replicate. Specifically, we focus on illumination, particularly the specular reflection component in the Phong illumination model, which poses the greatest replication challenge due to its parametric complexity and nonlinear formulation. We introduce a fast and accurate face texture estimation method based on Retinex theory to enable precise specular reflection separation. Furthermore, drawing from the mathematical formulation of specular reflection, we posit that forgery evidence manifests not only in the specular reflection itself but also in its relationship with corresponding face texture and direct light. To address this issue, we design the Specular-Reflection-Inconsistency-Network (SRI-Net), incorporating a two-stage cross-attention mechanism to capture these correlations and integrate specular reflection related features with image features for robust forgery detection. Experimental results demonstrate that our method achieves superior performance on both traditional deepfake datasets and generative deepfake datasets, particularly those containing diffusion-generated forgery faces.
Chinese Translation
随着由人工智能生成的方法(特别是扩散模型)合成的伪造人脸达到前所未有的质量和分辨率,检测深度伪造变得越来越具有挑战性。现有基于空间和频率特征的伪造检测方法在应对高质量、完全合成的伪造品时效果有限。本文提出了一种新颖的检测方法,基于观察到的面部特征受复杂物理规律和多个参数控制,固有地难以复制。具体而言,我们关注照明,特别是Phong照明模型中的镜面反射成分,由于其参数复杂性和非线性公式,造成了最大的复制挑战。我们引入了一种基于Retinex理论的快速准确的人脸纹理估计方法,以实现精确的镜面反射分离。此外,基于镜面反射的数学公式,我们认为伪造证据不仅体现在镜面反射本身,还体现在其与相应人脸纹理和直接光源的关系中。为了解决这一问题,我们设计了镜面反射不一致性网络(Specular-Reflection-Inconsistency-Network,SRI-Net),结合了两阶段的交叉注意机制,以捕捉这些相关性,并将与镜面反射相关的特征与图像特征整合,以实现稳健的伪造检测。实验结果表明,我们的方法在传统深度伪造数据集和生成深度伪造数据集上均表现出优越的性能,特别是在包含扩散生成伪造人脸的数据集上。
cs.CV / 41 / 2602.06474

LAB-Det: Language as a Domain-Invariant Bridge for Training-Free One-Shot Domain Generalization in Object Detection

LAB-Det:作为领域不变桥梁的语言在无训练的一次性领域泛化中的应用于目标检测
Zhang, Xu, Chen, Zhe, Zhang, Jing, Tao, Dacheng
Abstract
Foundation object detectors such as GLIP and Grounding DINO excel on general-domain data but often degrade in specialized and data-scarce settings like underwater imagery or industrial defects. Typical cross-domain few-shot approaches rely on fine-tuning scarce target data, incurring cost and overfitting risks. We instead ask: Can a frozen detector adapt with only one exemplar per class without training? To answer this, we introduce training-free one-shot domain generalization for object detection, where detectors must adapt to specialized domains with only one annotated exemplar per class and no weight updates. To tackle this task, we propose LAB-Det, which exploits Language As a domain-invariant Bridge. Instead of adapting visual features, we project each exemplar into a descriptive text that conditions and guides a frozen detector. This linguistic conditioning replaces gradient-based adaptation, enabling robust generalization in data-scarce domains. We evaluate on UODD (underwater) and NEU-DET (industrial defects), two widely adopted benchmarks for data-scarce detection, where object boundaries are often ambiguous, and LAB-Det achieves up to 5.4 mAP improvement over state-of-the-art fine-tuned baselines without updating a single parameter. These results establish linguistic adaptation as an efficient and interpretable alternative to fine-tuning in specialized detection settings.
Chinese Translation
基础目标检测器如GLIP和Grounding DINO在通用领域数据上表现优异,但在水下图像或工业缺陷等专业且数据稀缺的环境中往往表现不佳。典型的跨领域少样本方法依赖于对稀缺目标数据的微调,这会带来成本和过拟合风险。我们提出一个问题:一个冻结的检测器能否仅通过每个类别一个样本而无需训练进行适应?为了解答这个问题,我们引入了无训练的一次性领域泛化方法用于目标检测,其中检测器必须在没有权重更新的情况下,仅通过每个类别一个标注样本适应专业领域。为了解决这一任务,我们提出了LAB-Det,它利用语言作为领域不变的桥梁。我们并不直接适应视觉特征,而是将每个样本投影到描述性文本中,以此来调节和引导一个冻结的检测器。这种语言调节替代了基于梯度的适应,使得在数据稀缺的领域中实现稳健的泛化。我们在UODD(水下)和NEU-DET(工业缺陷)这两个广泛采用的数据稀缺检测基准上进行了评估,这些基准中物体边界往往模糊,而LAB-Det在不更新任何参数的情况下,较最先进的微调基线实现了高达5.4的mAP提升。这些结果确立了语言适应作为在专业检测环境中微调的高效且可解释的替代方案。
cs.CV / 42 / 2602.06478

Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention

高效-LVSM:通过解耦共细化注意力实现更快、更便宜且更优的大视图合成模型
Jia, Xiaosong, Sun, Yihang, You, Junqi, Wong, Songbur, Zou, Zichen, Yan, Junchi, Wu, Zuxuan, Jiang, Yu-Gang
Abstract
Feedforward models for novel view synthesis (NVS) have recently advanced by transformer-based methods like LVSM, using attention among all input and target views. In this work, we argue that its full self-attention design is suboptimal, suffering from quadratic complexity with respect to the number of input views and rigid parameter sharing among heterogeneous tokens. We propose Efficient-LVSM, a dual-stream architecture that avoids these issues with a decoupled co-refinement mechanism. It applies intra-view self-attention for input views and self-then-cross attention for target views, eliminating unnecessary computation. Efficient-LVSM achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed. Efficient-LVSM achieves state-of-the-art performance on multiple benchmarks, exhibits strong zero-shot generalization to unseen view counts, and enables incremental inference with KV-cache, thanks to its decoupled designs.
Chinese Translation
基于前馈模型的新视图合成(NVS)最近通过像LVSM这样的基于变换器的方法取得了进展,利用所有输入视图和目标视图之间的注意力。在本研究中,我们认为其完全自注意力设计是次优的,因其在输入视图数量上具有平方复杂度,并且在异构标记之间存在僵化的参数共享。我们提出了高效-LVSM,一种双流架构,通过解耦共细化机制避免了这些问题。它对输入视图应用视内自注意力,对目标视图应用自注意力后再交叉注意力,从而消除不必要的计算。高效-LVSM在RealEstate10K数据集上以2个输入视图达到了29.86 dB的PSNR,超越了LVSM 0.2 dB,训练收敛速度提高了2倍,推理速度提高了4.4倍。高效-LVSM在多个基准测试中实现了最先进的性能,展现出对未见视图数量的强大零样本泛化能力,并得益于其解耦设计,支持增量推理与KV-cache。
cs.CV / 43 / 2602.06484

Instance-Free Domain Adaptive Object Detection

无实例领域自适应目标检测
Yu, Hengfu, Deng, Jinhong, Duan, Lixin, Li, Wen
Abstract
While Domain Adaptive Object Detection (DAOD) has made significant strides, most methods rely on unlabeled target data that is assumed to contain sufficient foreground instances. However, in many practical scenarios (e.g., wildlife monitoring, lesion detection), collecting target domain data with objects of interest is prohibitively costly, whereas background-only data is abundant. This common practical constraint introduces a significant technical challenge: the difficulty of achieving domain alignment when target instances are unavailable, forcing adaptation to rely solely on the target background information. We formulate this challenge as the novel problem of Instance-Free Domain Adaptive Object Detection. To tackle this, we propose the Relational and Structural Consistency Network (RSCN) which pioneers an alignment strategy based on background feature prototypes while simultaneously encouraging consistency in the relationship between the source foreground features and the background features within each domain, enabling robust adaptation even without target instances. To facilitate research, we further curate three specialized benchmarks, including simulative auto-driving detection, wildlife detection, and lung nodule detection. Extensive experiments show that RSCN significantly outperforms existing DAOD methods across all three benchmarks in the instance-free scenario. The code and benchmarks will be released soon.
Chinese Translation
尽管领域自适应目标检测(DAOD)已取得显著进展,但大多数方法依赖于假设包含足够前景实例的未标记目标数据。然而,在许多实际场景中(例如,野生动物监测、病变检测),收集包含感兴趣对象的目标领域数据成本高昂,而仅包含背景的数据则相对丰富。这一普遍的实际限制带来了一个重大的技术挑战:在目标实例不可用时实现领域对齐的困难,迫使适应过程仅依赖于目标背景信息。我们将这一挑战表述为无实例领域自适应目标检测的全新问题。为了解决这一问题,我们提出了关系与结构一致性网络(RSCN),该网络开创了一种基于背景特征原型的对齐策略,同时鼓励源前景特征与每个领域内背景特征之间关系的一致性,从而即使在没有目标实例的情况下也能实现稳健的适应。为了促进研究,我们进一步策划了三个专门的基准测试,包括模拟自动驾驶检测、野生动物检测和肺结节检测。大量实验表明,RSCN在无实例场景下显著优于现有的DAOD方法。代码和基准测试将很快发布。
cs.CV / 44 / 2602.06488

Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction

重新基准化无监督单目3D占用预测
Guo, Zizhan, Feng, Yi, Zhang, Mengtan, Zhang, Haoran, Ye, Wei, Fan, Rui
Abstract
Inferring the 3D structure from a single image, particularly in occluded regions, remains a fundamental yet unsolved challenge in vision-centric autonomous driving. Existing unsupervised approaches typically train a neural radiance field and treat the network outputs as occupancy probabilities during evaluation, overlooking the inconsistency between training and evaluation protocols. Moreover, the prevalent use of 2D ground truth fails to reveal the inherent ambiguity in occluded regions caused by insufficient geometric constraints. To address these issues, this paper presents a reformulated benchmark for unsupervised monocular 3D occupancy prediction. We first interpret the variables involved in the volume rendering process and identify the most physically consistent representation of the occupancy probability. Building on these analyses, we improve existing evaluation protocols by aligning the newly identified representation with voxel-wise 3D occupancy ground truth, thereby enabling unsupervised methods to be evaluated in a manner consistent with that of supervised approaches. Additionally, to impose explicit constraints in occluded regions, we introduce an occlusion-aware polarization mechanism that incorporates multi-view visual cues to enhance discrimination between occupied and free spaces in these regions. Extensive experiments demonstrate that our approach not only significantly outperforms existing unsupervised approaches but also matches the performance of supervised ones. Our source code and evaluation protocol will be made available upon publication.
Chinese Translation
从单幅图像推断3D结构,特别是在被遮挡区域,仍然是以视觉为中心的自动驾驶中的一个基本但未解决的挑战。现有的无监督方法通常训练一个神经辐射场,并在评估时将网络输出视为占用概率,忽视了训练和评估协议之间的不一致。此外,普遍使用的2D真实值未能揭示由于几何约束不足而导致的被遮挡区域内在的模糊性。为了解决这些问题,本文提出了一种重新构建的无监督单目3D占用预测基准。我们首先解释了体积渲染过程中的变量,并识别出占用概率的最物理一致表示。基于这些分析,我们通过将新识别的表示与体素级3D占用真实值对齐,改进了现有的评估协议,从而使无监督方法能够以与监督方法一致的方式进行评估。此外,为了在被遮挡区域施加明确的约束,我们引入了一种遮挡感知的极化机制,该机制结合了多视角视觉线索,以增强这些区域内占用和自由空间之间的区分。大量实验表明,我们的方法不仅显著优于现有的无监督方法,而且与监督方法的性能相当。我们的源代码和评估协议将在发表时提供。
cs.CV / 45 / 2602.06494

DreamHome-Pano: Design-Aware and Conflict-Free Panoramic Interior Generation

DreamHome-Pano:设计感知与无冲突的全景室内生成
Chen, Lulu, Hu, Yijiang, Liu, Yuanqing, Li, Yulong, Yang, Yue
Abstract
In modern interior design, the generation of personalized spaces frequently necessitates a delicate balance between rigid architectural structural constraints and specific stylistic preferences. However, existing multi-condition generative frameworks often struggle to harmonize these inputs, leading to "condition conflicts" where stylistic attributes inadvertently compromise the geometric precision of the layout. To address this challenge, we present DreamHome-Pano, a controllable panoramic generation framework designed for high-fidelity interior synthesis. Our approach introduces a Prompt-LLM that serves as a semantic bridge, effectively translating layout constraints and style references into professional descriptive prompts to achieve precise cross-modal alignment. To safeguard architectural integrity during the generative process, we develop a Conflict-Free Control architecture that incorporates structural-aware geometric priors and a multi-condition decoupling strategy, effectively suppressing stylistic interference from eroding the spatial layout. Furthermore, we establish a comprehensive panoramic interior benchmark alongside a multi-stage training pipeline, encompassing progressive Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Experimental results demonstrate that DreamHome-Pano achieves a superior balance between aesthetic quality and structural consistency, offering a robust and professional-grade solution for panoramic interior visualization.
Chinese Translation
在现代室内设计中,个性化空间的生成常常需要在严格的建筑结构约束与特定的风格偏好之间取得微妙的平衡。然而,现有的多条件生成框架往往难以协调这些输入,导致出现“条件冲突”,即风格属性无意中妨碍了布局的几何精确性。为了解决这一挑战,我们提出了DreamHome-Pano,一个可控的全景生成框架,旨在实现高保真的室内合成。我们的方法引入了一个Prompt-LLM,作为语义桥梁,有效地将布局约束和风格参考转化为专业的描述性提示,以实现精确的跨模态对齐。为了在生成过程中保护建筑完整性,我们开发了一种无冲突控制架构,该架构结合了结构感知的几何先验和多条件解耦策略,有效抑制了风格干扰对空间布局的侵蚀。此外,我们建立了一个全面的全景室内基准,并配备了多阶段训练流程,包括渐进式监督微调(Supervised Fine-Tuning, SFT)和强化学习(Reinforcement Learning, RL)。实验结果表明,DreamHome-Pano在美学质量与结构一致性之间实现了优越的平衡,为全景室内可视化提供了一个强大且专业的解决方案。
cs.CV / 46 / 2602.06503

Forest canopy height estimation from satellite RGB imagery using large-scale airborne LiDAR-derived training data and monocular depth estimation

基于大规模航空激光雷达衍生训练数据和单目深度估计的卫星RGB影像森林冠层高度估计
Lai, Yongkang, Mu, Xihan, McVicar, Tim R., Fan, Dasheng, Xie, Donghui, Guo, Shanxin, Huang, Wenli, Zhao, Tianjie, Yan, Guangjian
Abstract
Large-scale, high-resolution forest canopy height mapping plays a crucial role in understanding regional and global carbon and water cycles. Spaceborne LiDAR missions, including the Ice, Cloud, and Land Elevation Satellite-2 (ICESat-2) and the Global Ecosystem Dynamics Investigation (GEDI), provide global observations of forest structure but are spatially sparse and subject to inherent uncertainties. In contrast, near-surface LiDAR platforms, such as airborne and unmanned aerial vehicle (UAV) LiDAR systems, offer much finer measurements of forest canopy structure, and a growing number of countries have made these datasets openly available. In this study, a state-of-the-art monocular depth estimation model, Depth Anything V2, was trained using approximately 16,000 km2 of canopy height models (CHMs) derived from publicly available airborne LiDAR point clouds and related products across multiple countries, together with 3 m resolution PlanetScope and airborne RGB imagery. The trained model, referred to as Depth2CHM, enables the estimation of spatially continuous CHMs directly from PlanetScope RGB imagery. Independent validation was conducted at sites in China (approximately 1 km2) and the United States (approximately 116 km2). The results showed that Depth2CHM could accurately estimate canopy height, with biases of 0.59 m and 0.41 m and root mean square errors (RMSEs) of 2.54 m and 5.75 m for these two sites, respectively. Compared with an existing global meter-resolution CHM product, the mean absolute error is reduced by approximately 1.5 m and the RMSE by approximately 2 m. These results demonstrated that monocular depth estimation networks trained with large-scale airborne LiDAR-derived canopy height data provide a promising and scalable pathway for high-resolution, spatially continuous forest canopy height estimation from satellite RGB imagery.
Chinese Translation
大规模、高分辨率的森林冠层高度制图在理解区域和全球碳水循环中发挥着至关重要的作用。包括冰、云和陆地高度卫星-2(ICESat-2)和全球生态系统动态调查(GEDI)在内的太空激光雷达任务提供了森林结构的全球观测,但其空间分布稀疏且存在固有的不确定性。相比之下,近地面激光雷达平台,如航空和无人机(UAV)激光雷达系统,提供了更精细的森林冠层结构测量,越来越多的国家已将这些数据集公开可用。在本研究中,使用来自多个国家的公开航空激光雷达点云及相关产品衍生的约16,000 km²的冠层高度模型(CHMs)以及3米分辨率的PlanetScope和航空RGB影像,训练了一种最先进的单目深度估计模型Depth Anything V2。训练后的模型被称为Depth2CHM,能够直接从PlanetScope RGB影像中估计空间连续的冠层高度模型。我们在中国(约1 km²)和美国(约116 km²)进行了独立验证。结果表明,Depth2CHM能够准确估计冠层高度,对于这两个站点的偏差分别为0.59米和0.41米,均方根误差(RMSE)分别为2.54米和5.75米。与现有的全球米级分辨率CHM产品相比,平均绝对误差减少了约1.5米,均方根误差减少了约2米。这些结果表明,利用大规模航空激光雷达衍生的冠层高度数据训练的单目深度估计网络为从卫星RGB影像进行高分辨率、空间连续的森林冠层高度估计提供了一条有前景且可扩展的途径。
cs.CV / 47 / 2602.06507

FloorplanVLM: A Vision-Language Model for Floorplan Vectorization

FloorplanVLM:一种用于平面图矢量化的视觉-语言模型
Liu, Yuanqing, Yang, Ziming, Li, Yulong, Yang, Yue
Abstract
Converting raster floorplans into engineering-grade vector graphics is challenging due to complex topology and strict geometric constraints. To address this, we present FloorplanVLM, a unified framework that reformulates floorplan vectorization as an image-conditioned sequence modeling task. Unlike pixel-based methods that rely on fragile heuristics or query-based transformers that generate fragmented rooms, our model directly outputs structured JSON sequences representing the global topology. This 'pixels-to-sequence' paradigm enables the precise and holistic constraint satisfaction of complex geometries, such as slanted walls and curved arcs. To support this data-hungry approach, we introduce a scalable data engine: we construct a large-scale dataset (Floorplan-2M) and a high-fidelity subset (Floorplan-HQ-300K) to balance geometric diversity and pixel-level precision. We then employ a progressive training strategy, using Supervised Fine-Tuning (SFT) for structural grounding and quality annealing, followed by Group Relative Policy Optimization (GRPO) for strict geometric alignment. To standardize evaluation on complex layouts, we establish and open-source FPBench-2K. Evaluated on this rigorous benchmark, FloorplanVLM demonstrates exceptional structural validity, achieving $\textbf{92.52%}$ external-wall IoU and robust generalization across non-Manhattan architectures.
Chinese Translation
将光栅平面图转换为工程级矢量图形具有挑战性,因为其复杂的拓扑结构和严格的几何约束。为了解决这个问题,我们提出了FloorplanVLM,一个将平面图矢量化重新构造成图像条件序列建模任务的统一框架。与依赖脆弱启发式或基于查询的变换器生成碎片化房间的像素级方法不同,我们的模型直接输出表示全局拓扑的结构化JSON序列。这种“像素到序列”的范式使得复杂几何形状(如倾斜墙壁和弯曲弧线)的精确和整体约束满足成为可能。为了支持这种数据密集型的方法,我们引入了一个可扩展的数据引擎:构建了一个大规模数据集(Floorplan-2M)和一个高保真子集(Floorplan-HQ-300K),以平衡几何多样性和像素级精度。然后,我们采用渐进式训练策略,使用监督微调(Supervised Fine-Tuning, SFT)进行结构基础和质量退火,随后使用组相对策略优化(Group Relative Policy Optimization, GRPO)进行严格的几何对齐。为了标准化复杂布局的评估,我们建立并开源了FPBench-2K。在这个严格的基准上评估,FloorplanVLM展示了卓越的结构有效性,外墙IoU达到了$ extbf{92.52 ext{%}}$,并在非曼哈顿建筑中表现出强大的泛化能力。
cs.CV / 48 / 2602.06521

DriveWorld-VLA: Unified Latent-Space World Modeling with Vision-Language-Action for Autonomous Driving

DriveWorld-VLA:基于视觉-语言-动作的统一潜在空间世界建模用于自动驾驶
jia, Feiyang, Liu, Lin, Song, Ziying, Jia, Caiyan, Ye, Hangjun, Hao, Xiaoshuai, Chen, Long
Abstract
End-to-end (E2E) autonomous driving has recently attracted increasing interest in unifying Vision-Language-Action (VLA) with World Models to enhance decision-making and forward-looking imagination. However, existing methods fail to effectively unify future scene evolution and action planning within a single architecture due to inadequate sharing of latent states, limiting the impact of visual imagination on action decisions. To address this limitation, we propose DriveWorld-VLA, a novel framework that unifies world modeling and planning within a latent space by tightly integrating VLA and world models at the representation level, which enables the VLA planner to benefit directly from holistic scene-evolution modeling and reducing reliance on dense annotated supervision. Additionally, DriveWorld-VLA incorporates the latent states of the world model as core decision-making states for the VLA planner, facilitating the planner to assess how candidate actions impact future scene evolution. By conducting world modeling entirely in the latent space, DriveWorld-VLA supports controllable, action-conditioned imagination at the feature level, avoiding expensive pixel-level rollouts. Extensive open-loop and closed-loop evaluations demonstrate the effectiveness of DriveWorld-VLA, which achieves state-of-the-art performance with 91.3 PDMS on NAVSIMv1, 86.8 EPDMS on NAVSIMv2, and 0.16 3-second average collision rate on nuScenes. Code and models will be released in https://github.com/liulin815/DriveWorld-VLA.git.
Chinese Translation
端到端(E2E)自动驾驶最近引起了越来越多的关注,旨在将视觉-语言-动作(VLA)与世界模型统一,以增强决策制定和前瞻性想象。然而,现有方法由于潜在状态共享不足,未能有效地将未来场景演变与行动规划统一在单一架构中,从而限制了视觉想象对行动决策的影响。为了解决这一限制,我们提出了DriveWorld-VLA,这是一种新颖的框架,通过在表示层面紧密集成VLA和世界模型,将世界建模和规划统一在潜在空间中,使得VLA规划器能够直接受益于整体场景演变建模,并减少对密集标注监督的依赖。此外,DriveWorld-VLA将世界模型的潜在状态作为VLA规划器的核心决策状态,帮助规划器评估候选行动如何影响未来场景演变。通过完全在潜在空间中进行世界建模,DriveWorld-VLA支持在特征层面上可控的、基于行动的想象,避免了昂贵的像素级展开。广泛的开环和闭环评估表明DriveWorld-VLA的有效性,其在NAVSIMv1上达到了91.3的PDMS,在NAVSIMv2上达到了86.8的EPDMS,以及在nuScenes上达到了0.16的3秒平均碰撞率。代码和模型将发布在https://github.com/liulin815/DriveWorld-VLA.git。
cs.CV / 49 / 2602.06523

MicroBi-ConvLSTM: An Ultra-Lightweight Efficient Model for Human Activity Recognition on Resource Constrained Devices

MicroBi-ConvLSTM:一种超轻量高效的人体活动识别模型,适用于资源受限设备
Mandal, Mridankan
Abstract
Human Activity Recognition (HAR) on resource constrained wearables requires models that balance accuracy against strict memory and computational budgets. State of the art lightweight architectures such as TinierHAR (34K parameters) and TinyHAR (55K parameters) achieve strong accuracy, but exceed memory budgets of microcontrollers with limited SRAM once operating system overhead is considered. We present MicroBi-ConvLSTM, an ultra-lightweight convolutional-recurrent architecture achieving 11.4K parameters on average through two stage convolutional feature extraction with 4x temporal pooling and a single bidirectional LSTM layer. This represents 2.9x parameter reduction versus TinierHAR and 11.9x versus DeepConvLSTM while preserving linear O(N) complexity. Evaluation across eight diverse HAR benchmarks shows that MicroBi-ConvLSTM maintains competitive performance within the ultra-lightweight regime: 93.41% macro F1 on UCI-HAR, 94.46% on SKODA assembly gestures, and 88.98% on Daphnet gait freeze detection. Systematic ablation reveals task dependent component contributions where bidirectionality benefits episodic event detection, but provides marginal gains on periodic locomotion. INT8 post training quantization incurs only 0.21% average F1-score degradation, yielding a 23.0 KB average deployment footprint suitable for memory constrained edge devices.
Chinese Translation
在资源受限的可穿戴设备上进行人体活动识别(HAR)需要平衡准确性与严格的内存和计算预算的模型。现有的轻量级架构,如TinierHAR(34K参数)和TinyHAR(55K参数),在准确性上表现出色,但在考虑操作系统开销后,超出了具有有限SRAM的微控制器的内存预算。我们提出了MicroBi-ConvLSTM,这是一种超轻量的卷积-递归架构,通过两阶段卷积特征提取、4倍时间池化和单层双向LSTM,平均实现11.4K参数。这相比于TinierHAR减少了2.9倍的参数,相比于DeepConvLSTM减少了11.9倍,同时保持线性O(N)复杂度。在八个不同的HAR基准测试中的评估表明,MicroBi-ConvLSTM在超轻量级领域内保持了竞争力的性能:在UCI-HAR上达到93.41%的宏F1分数,在SKODA组装手势上达到94.46%,在Daphnet步态冻结检测上达到88.98%。系统的消融实验揭示了任务依赖的组件贡献,其中双向性有利于情节事件检测,但对周期性运动的增益有限。INT8后训练量化仅导致0.21%的平均F1分数下降,产生适合内存受限边缘设备的23.0 KB平均部署占用。
cs.CV / 50 / 2602.06529

AdaptOVCD: Training-Free Open-Vocabulary Remote Sensing Change Detection via Adaptive Information Fusion

AdaptOVCD:基于自适应信息融合的无训练开放词汇遥感变化检测
Dou, Mingyu, Qiu, Shi, Hu, Ming, Chen, Yifan, Ye, Huping, Liao, Xiaohan, Sun, Zhe
Abstract
Remote sensing change detection plays a pivotal role in domains such as environmental monitoring, urban planning, and disaster assessment. However, existing methods typically rely on predefined categories and large-scale pixel-level annotations, which limit their generalization and applicability in open-world scenarios. To address these limitations, this paper proposes AdaptOVCD, a training-free Open-Vocabulary Change Detection (OVCD) architecture based on dual-dimensional multi-level information fusion. The framework integrates multi-level information fusion across data, feature, and decision levels vertically while incorporating targeted adaptive designs horizontally, achieving deep synergy among heterogeneous pre-trained models to effectively mitigate error propagation. Specifically, (1) at the data level, Adaptive Radiometric Alignment (ARA) fuses radiometric statistics with original texture features and synergizes with SAM-HQ to achieve radiometrically consistent segmentation; (2) at the feature level, Adaptive Change Thresholding (ACT) combines global difference distributions with edge structure priors and leverages DINOv3 to achieve robust change detection; (3) at the decision level, Adaptive Confidence Filtering (ACF) integrates semantic confidence with spatial constraints and collaborates with DGTRS-CLIP to achieve high-confidence semantic identification. Comprehensive evaluations across nine scenarios demonstrate that AdaptOVCD detects arbitrary category changes in a zero-shot manner, significantly outperforming existing training-free methods. Meanwhile, it achieves 84.89\% of the fully-supervised performance upper bound in cross-dataset evaluations and exhibits superior generalization capabilities. The code is available at https://github.com/Dmygithub/AdaptOVCD.
Chinese Translation
遥感变化检测在环境监测、城市规划和灾害评估等领域发挥着关键作用。然而,现有方法通常依赖于预定义类别和大规模像素级标注,这限制了它们在开放世界场景中的泛化能力和适用性。为了解决这些局限性,本文提出了AdaptOVCD,一种基于双维多级信息融合的无训练开放词汇变化检测(OVCD)架构。该框架在数据、特征和决策层面上垂直整合多级信息融合,同时在水平上结合针对性的自适应设计,实现异构预训练模型之间的深度协同,有效减轻错误传播。具体而言,(1) 在数据层面,自适应辐射对齐(ARA)将辐射统计与原始纹理特征融合,并与SAM-HQ协同实现辐射一致的分割;(2) 在特征层面,自适应变化阈值(ACT)将全局差异分布与边缘结构先验结合,并利用DINOv3实现稳健的变化检测;(3) 在决策层面,自适应置信过滤(ACF)将语义置信度与空间约束整合,并与DGTRS-CLIP协作实现高置信度的语义识别。对九个场景的综合评估表明,AdaptOVCD能够以零样本方式检测任意类别的变化,显著优于现有的无训练方法。同时,在跨数据集评估中,其达到了完全监督性能上限的84.89\%,展现出优越的泛化能力。代码可在 https://github.com/Dmygithub/AdaptOVCD 获取。
cs.CV / 51 / 2602.06530

Universal Anti-forensics Attack against Image Forgery Detection via Multi-modal Guidance

针对图像伪造检测的通用反取证攻击:多模态引导方法
Li, Haipeng, Peng, Rongxuan, Luo, Anwei, Tan, Shunquan, Chen, Changsheng, Antsiferova, Anastasia
Abstract
The rapid advancement of AI-Generated Content (AIGC) technologies poses significant challenges for authenticity assessment. However, existing evaluation protocols largely overlook anti-forensics attack, failing to ensure the comprehensive robustness of state-of-the-art AIGC detectors in real-world applications. To bridge this gap, we propose ForgeryEraser, a framework designed to execute universal anti-forensics attack without access to the target AIGC detectors. We reveal an adversarial vulnerability stemming from the systemic reliance on Vision-Language Models (VLMs) as shared backbones (e.g., CLIP), where downstream AIGC detectors inherit the feature space of these publicly accessible models. Instead of traditional logit-based optimization, we design a multi-modal guidance loss to drive forged image embeddings within the VLM feature space toward text-derived authentic anchors to erase forgery traces, while repelling them from forgery anchors. Extensive experiments demonstrate that ForgeryEraser causes substantial performance degradation to advanced AIGC detectors on both global synthesis and local editing benchmarks. Moreover, ForgeryEraser induces explainable forensic models to generate explanations consistent with authentic images for forged images. Our code will be made publicly available.
Chinese Translation
人工智能生成内容(AIGC)技术的快速发展对真实性评估提出了重大挑战。然而,现有的评估协议在很大程度上忽视了反取证攻击,未能确保最先进的AIGC检测器在实际应用中的全面鲁棒性。为了解决这一问题,我们提出了ForgeryEraser,一个旨在执行通用反取证攻击的框架,无需访问目标AIGC检测器。我们揭示了一种源于对视觉-语言模型(VLMs)作为共享骨干网(例如,CLIP)的系统性依赖所导致的对抗性脆弱性,其中下游AIGC检测器继承了这些公开可用模型的特征空间。我们设计了一种多模态引导损失,以驱动伪造图像嵌入在VLM特征空间内朝向文本衍生的真实锚点移动,从而消除伪造痕迹,同时将其远离伪造锚点。大量实验表明,ForgeryEraser对先进的AIGC检测器在全球合成和局部编辑基准测试中造成了显著的性能下降。此外,ForgeryEraser促使可解释的取证模型为伪造图像生成与真实图像一致的解释。我们的代码将公开发布。
cs.CV / 52 / 2602.06548

NECromancer: Breathing Life into Skeletons via BVH Animation

NECromancer:通过 BVH 动画为骷髅注入生命
Xu, Mingxi, Wang, Qi, Wen, Zhengyu, Thien, Phong Dao, Li, Zhengyu, Zhang, Ning, He, Xiaoyu, Zhao, Wei, Gong, Kehong, Zhang, Mingyuan
Abstract
Motion tokenization is a key component of generalizable motion models, yet most existing approaches are restricted to species-specific skeletons, limiting their applicability across diverse morphologies. We propose NECromancer (NEC), a universal motion tokenizer that operates directly on arbitrary BVH skeletons. NEC consists of three components: (1) an Ontology-aware Skeletal Graph Encoder (OwO) that encodes structural priors from BVH files, including joint semantics, rest-pose offsets, and skeletal topology, into skeletal embeddings; (2) a Topology-Agnostic Tokenizer (TAT) that compresses motion sequences into a universal, topology-invariant discrete representation; and (3) the Unified BVH Universe (UvU), a large-scale dataset aggregating BVH motions across heterogeneous skeletons. Experiments show that NEC achieves high-fidelity reconstruction under substantial compression and effectively disentangles motion from skeletal structure. The resulting token space supports cross-species motion transfer, composition, denoising, generation with token-based models, and text-motion retrieval, establishing a unified framework for motion analysis and synthesis across diverse morphologies. Demo page: https://animotionlab.github.io/NECromancer/
Chinese Translation
运动标记化是可推广运动模型的关键组成部分,但现有大多数方法仅限于特定物种的骨架,限制了其在不同形态中的适用性。我们提出了 NECromancer (NEC),一种直接作用于任意 BVH 骨架的通用运动标记器。NEC 由三个组件组成:(1) 一个具有本体感知的骨架图编码器 (OwO),该编码器从 BVH 文件中编码结构先验,包括关节语义、静止姿势偏移和骨架拓扑,生成骨架嵌入;(2) 一个拓扑无关的标记器 (TAT),将运动序列压缩为通用的、拓扑不变的离散表示;(3) 统一 BVH 宇宙 (UvU),一个聚合来自异构骨架的 BVH 动作的大规模数据集。实验表明,NEC 在显著压缩下实现了高保真重建,并有效地将运动与骨架结构解耦。生成的标记空间支持跨物种运动转移、组合、去噪、基于标记的模型生成以及文本-运动检索,为不同形态的运动分析和合成建立了统一框架。演示页面:https://animotionlab.github.io/NECromancer/
cs.CV / 53 / 2602.06556

LIBERO-X: Robustness Litmus for Vision-Language-Action Models

LIBERO-X:视觉-语言-行动模型的鲁棒性试金石
Wang, Guodong, Zhang, Chenkai, Liu, Qingjie, Zhang, Jinjin, Cai, Jiancheng, Liu, Junjie, Liu, Xinmin
Abstract
Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of perception with language-driven manipulation tasks. However, existing benchmarks often provide limited or misleading assessments due to insufficient evaluation protocols that inadequately capture real-world distribution shifts. This work systematically rethinks VLA benchmarking from both evaluation and data perspectives, introducing LIBERO-X, a benchmark featuring: 1) A hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities: spatial generalization, object recognition, and task instruction understanding. This design enables fine-grained analysis of performance degradation under increasing environmental and task complexity; 2) A high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives to bridge the train-evaluation distribution gap. Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. By integrating hierarchical evaluation with diverse training data, LIBERO-X offers a more reliable foundation for assessing and advancing VLA development.
Chinese Translation
可靠的基准测试对于推动视觉-语言-行动(VLA)模型的发展至关重要,因为它揭示了模型的泛化能力、鲁棒性以及感知与语言驱动的操作任务之间的对齐程度。然而,现有的基准测试往往由于评估协议不足,无法充分捕捉现实世界中的分布变化,从而提供有限或误导性的评估。本文从评估和数据两个角度系统性地重新思考了VLA基准测试,提出了LIBERO-X,一个基准测试,具有以下特点:1)一个分层评估协议,设定逐步增加难度的目标,聚焦于三个核心能力:空间泛化、物体识别和任务指令理解。该设计使得在环境和任务复杂性增加的情况下,能够对性能下降进行细致分析;2)一个通过人类遥操作收集的高多样性训练数据集,每个场景支持多个细化的操作目标,以弥合训练与评估之间的分布差距。对代表性VLA模型的实验表明,在累积扰动下性能显著下降,暴露了场景理解和指令基础方面的持续局限性。通过将分层评估与多样化训练数据结合,LIBERO-X为评估和推动VLA的发展提供了更可靠的基础。
cs.CV / 54 / 2602.06566

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

SPARC:分离感知与推理电路以实现视觉语言模型的测试时扩展
Avogaro, Niccolo, Debnath, Nayanika, Mi, Li, Frick, Thomas, Wang, Junling, He, Zexue, Hua, Hang, Schindler, Konrad, Rigotti, Mattia
Abstract
Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.
Chinese Translation
尽管最近取得了一些成功,但测试时扩展——即在推理过程中根据需要动态扩展令牌预算——对于视觉语言模型(VLMs)仍然脆弱:关于图像的非结构化思维链将感知与推理纠缠在一起,导致长且无序的上下文,其中小的感知错误可能会级联成完全错误的答案。此外,达到良好性能需要昂贵的强化学习和手工设计的奖励。在此,我们介绍了SPARC(Separating Perception And Reasoning Circuits),一个明确将视觉感知与推理解耦的模块化框架。SPARC受到大脑中顺序感知到认知处理的启发,实施了一个两阶段的流程,其中模型首先进行明确的视觉搜索以定位与问题相关的区域,然后基于这些区域进行推理以生成最终答案。这种分离使得独立的测试时扩展成为可能,并支持不对称的计算分配(例如,在分布转变下优先处理感知),支持选择性优化(例如,当感知阶段是端到端性能瓶颈时,仅改善感知阶段),并通过在较低图像分辨率下进行全局搜索并仅对选定区域分配高分辨率处理来适应压缩上下文,从而减少总的视觉令牌数量和计算量。在具有挑战性的视觉推理基准测试中,SPARC的表现超越了单一基线和强大的视觉定位方法。例如,SPARC在$V^*$ VQA基准上将Qwen3VL-4B的准确率提高了6.7个百分点,并在一个具有挑战性的OOD任务上超越了“用图像思考”4.6个百分点,尽管其令牌预算要求降低了200倍。
cs.CV / 55 / 2602.06590

An Integer Linear Programming Approach to Geometrically Consistent Partial-Partial Shape Matching

一种整数线性规划方法用于几何一致的部分-部分形状匹配
Ehm, Viktoria, Roetzer, Paul, Bernard, Florian, Cremers, Daniel
Abstract
The task of establishing correspondences between two 3D shapes is a long-standing challenge in computer vision. While numerous studies address full-full and partial-full 3D shape matching, only a limited number of works have explored the partial-partial setting, very likely due to its unique challenges: we must compute accurate correspondences while at the same time find the unknown overlapping region. Nevertheless, partial-partial 3D shape matching reflects the most realistic setting, as in many real-world cases, such as 3D scanning, shapes are only partially observable. In this work, we introduce the first integer linear programming approach specifically designed to address the distinctive challenges of partial-partial shape matching. Our method leverages geometric consistency as a strong prior, enabling both robust estimation of the overlapping region and computation of neighbourhood-preserving correspondences. We empirically demonstrate that our approach achieves high-quality matching results both in terms of matching error and smoothness. Moreover, we show that our method is more scalable than previous formalisms.
Chinese Translation
在计算机视觉中,建立两个三维形状之间的对应关系是一项长期以来的挑战。尽管许多研究解决了全-全和部分-全的三维形状匹配,但探讨部分-部分设置的工作却相对有限,这很可能是由于其独特的挑战:我们必须在计算准确对应关系的同时找到未知的重叠区域。然而,部分-部分三维形状匹配反映了最现实的场景,因为在许多实际案例中,例如三维扫描,形状仅部分可观察。在本研究中,我们提出了首个专门设计用于解决部分-部分形状匹配独特挑战的整数线性规划方法。我们的方法利用几何一致性作为强先验,从而实现对重叠区域的稳健估计和邻域保持对应关系的计算。我们通过实证展示了我们的方法在匹配误差和光滑性方面都能实现高质量的匹配结果。此外,我们还表明,我们的方法比之前的形式更具可扩展性。
cs.CV / 56 / 2602.06592

ProtoQuant: Quantization of Prototypical Parts For General and Fine-Grained Image Classification

ProtoQuant:原型部分的量化用于一般和细粒度图像分类
Janusz, Mikołaj, Wróbel, Adam, Zieliński, Bartosz, Rymarczyk, Dawid
Abstract
Prototypical parts-based models offer a "this looks like that" paradigm for intrinsic interpretability, yet they typically struggle with ImageNet-scale generalization and often require computationally expensive backbone finetuning. Furthermore, existing methods frequently suffer from "prototype drift," where learned prototypes lack tangible grounding in the training distribution and change their activation under small perturbations. We present ProtoQuant, a novel architecture that achieves prototype stability and grounded interpretability through latent vector quantization. By constraining prototypes to a discrete learned codebook within the latent space, we ensure they remain faithful representations of the training data without the need to update the backbone. This design allows ProtoQuant to function as an efficient, interpretable head that scales to large-scale datasets. We evaluate ProtoQuant on ImageNet and several fine-grained benchmarks (CUB-200, Cars-196). Our results demonstrate that ProtoQuant achieves competitive classification accuracy while generalizing to ImageNet and comparable interpretability metrics to other prototypical-parts-based methods.
Chinese Translation
基于原型部分的模型提供了一种“这看起来像那个”的内在可解释性范式,但它们通常在ImageNet规模的泛化上表现不佳,并且通常需要计算开销较大的主干微调。此外,现有方法常常遭遇“原型漂移”问题,即学习到的原型在训练分布中缺乏实质性基础,并且在小扰动下改变其激活状态。我们提出了ProtoQuant,这是一种新颖的架构,通过潜在向量量化实现了原型的稳定性和有根据的可解释性。通过将原型约束在潜在空间内的离散学习代码本中,我们确保它们保持对训练数据的忠实表示,而无需更新主干。这一设计使ProtoQuant能够作为一个高效且可解释的头部,适应大规模数据集。我们在ImageNet和几个细粒度基准(CUB-200,Cars-196)上评估了ProtoQuant。我们的结果表明,ProtoQuant在分类准确性上具有竞争力,同时在泛化到ImageNet和与其他基于原型部分的方法相比,具有相似的可解释性指标。
cs.CV / 57 / 2602.06613

DAVE: Distribution-aware Attribution via ViT Gradient Decomposition

DAVE:基于ViT梯度分解的分布感知归因
Wróbel, Adam, Gairola, Siddhartha, Tabor, Jacek, Schiele, Bernt, Zieliński, Bartosz, Rymarczyk, Dawid
Abstract
Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet producing stable and high-resolution attribution maps for these models remains challenging. Architectural components such as patch embeddings and attention routing often introduce structured artifacts in pixel-level explanations, causing many existing methods to rely on coarse patch-level attributions. We introduce DAVE \textit{(\underline{D}istribution-aware \underline{A}ttribution via \underline{V}iT Gradient D\underline{E}composition)}, a mathematically grounded attribution method for ViTs based on a structured decomposition of the input gradient. By exploiting architectural properties of ViTs, DAVE isolates locally equivariant and stable components of the effective input--output mapping. It separates these from architecture-induced artifacts and other sources of instability.
Chinese Translation
视觉变换器(ViTs)已成为计算机视觉中的主导架构,但为这些模型生成稳定且高分辨率的归因图仍然具有挑战性。诸如补丁嵌入和注意力路由等架构组件常常在像素级解释中引入结构化伪影,导致许多现有方法依赖于粗略的补丁级归因。我们提出了DAVE extit{( extunderline{D}istribution-aware extunderline{A}ttribution via extunderline{V}iT Gradient D extunderline{E}composition)},这是一种基于输入梯度结构分解的数学基础归因方法,专为ViTs设计。通过利用ViTs的架构特性,DAVE能够隔离有效输入-输出映射的局部等变和稳定成分,并将其与架构引起的伪影及其他不稳定源分离。
cs.CV / 58 / 2602.06619

CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling

CauCLIP:通过因果启发的视觉-语言建模缩小外科视频理解中的仿真与现实之间的差距
He, Yuxin, Li, An, Xue, Cheng
Abstract
Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.
Chinese Translation
外科阶段识别是智能手术室中上下文感知决策支持的关键组成部分,但由于标注临床视频的数量有限以及合成与真实外科数据之间存在较大领域差距,训练稳健的模型受到阻碍。为了解决这一问题,我们提出了CauCLIP,一个因果启发的视觉-语言框架,利用CLIP学习领域不变的表示,以实现外科阶段识别,而无需访问目标领域数据。我们的方法集成了一种基于频率的增强策略,以扰动领域特定属性,同时保留语义结构,并引入了一种因果抑制损失,以减轻非因果偏差并强化因果外科特征。这些组件在一个统一的训练框架中结合,使模型能够专注于外科工作流程中稳定的因果因素。在SurgVisDom困难适应基准上的实验表明,我们的方法显著优于所有竞争方法,突显了因果引导的视觉-语言模型在领域通用外科视频理解中的有效性。
cs.CV / 59 / 2602.06663

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

PlanViz:评估面向计算机使用任务的规划导向图像生成与编辑
Li, Junxian, Liu, Kai, Chen, Leyang, Wang, Weida, Wang, Zhixin, Xu, Jiaqi, Li, Fan, Pei, Renjing, Kong, Linghe, Zhang, Yulun
Abstract
Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning steps. Specifically, three new sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For challenges of comprehensive and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.
Chinese Translation
统一多模态模型(UMMs)在生成自然图像和支持多模态推理方面展现了令人印象深刻的能力。然而,它们在支持与我们生活密切相关的计算机使用规划任务中的潜力仍然未被充分探索。计算机使用任务中的图像生成和编辑需要空间推理和过程理解等能力,目前尚不清楚UMMs是否具备完成这些任务的能力。因此,我们提出了PlanViz,一个旨在评估计算机使用任务中图像生成和编辑的新基准。为了实现我们的评估目标,我们专注于日常生活中经常涉及且需要规划步骤的子任务。具体而言,设计了三个新的子任务:路线规划、工作图示和网页与用户界面展示。我们通过策划人工标注的问题和参考图像,以及质量控制过程,解决了数据质量确保的挑战。针对全面和准确评估的挑战,提出了一种任务自适应评分标准PlanScore。该评分标准有助于理解生成图像的正确性、视觉质量和效率。通过实验,我们强调了该主题未来研究的关键局限性和机会。
cs.CV / 60 / 2602.06674

CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis

CytoCrowd:用于细胞学图像分析的多注释基准数据集
Si, Yonghao, Zeng, Xingyuan, Chen, Zhao, Zheng, Libin, Cao, Caleb Chen, Chen, Lei, Yin, Jian
Abstract
High-quality annotated datasets are crucial for advancing machine learning in medical image analysis. However, a critical gap exists: most datasets either offer a single, clean ground truth, which hides real-world expert disagreement, or they provide multiple annotations without a separate gold standard for objective evaluation. To bridge this gap, we introduce CytoCrowd, a new public benchmark for cytology analysis. The dataset features 446 high-resolution images, each with two key components: (1) raw, conflicting annotations from four independent pathologists, and (2) a separate, high-quality gold-standard ground truth established by a senior expert. This dual structure makes CytoCrowd a versatile resource. It serves as a benchmark for standard computer vision tasks, such as object detection and classification, using the ground truth. Simultaneously, it provides a realistic testbed for evaluating annotation aggregation algorithms that must resolve expert disagreements. We provide comprehensive baseline results for both tasks. Our experiments demonstrate the challenges presented by CytoCrowd and establish its value as a resource for developing the next generation of models for medical image analysis.
Chinese Translation
高质量的注释数据集对于推动医学图像分析中的机器学习至关重要。然而,存在一个关键的缺口:大多数数据集要么提供单一的、干净的真实标签,这掩盖了现实世界中专家之间的分歧,要么提供多个注释,但没有单独的金标准进行客观评估。为了解决这一问题,我们推出了CytoCrowd,一个新的细胞学分析公共基准。该数据集包含446张高分辨率图像,每张图像具有两个关键组成部分:(1) 来自四位独立病理学家的原始、相互矛盾的注释,以及 (2) 由资深专家建立的单独的高质量金标准真实标签。这种双重结构使CytoCrowd成为一个多功能资源。它作为标准计算机视觉任务(如目标检测和分类)的基准,使用真实标签。同时,它提供了一个现实的测试平台,用于评估必须解决专家分歧的注释聚合算法。我们为这两个任务提供了全面的基线结果。我们的实验展示了CytoCrowd所带来的挑战,并确立了其作为开发下一代医学图像分析模型资源的价值。
cs.CV / 61 / 2602.06676

Can We Build a Monolithic Model for Fake Image Detection? SICA: Semantic-Induced Constrained Adaptation for Unified-Yet-Discriminative Artifact Feature Space Reconstruction

我们能否构建一个单一模型用于假图像检测?SICA:用于统一且具有区分性的伪影特征空间重建的语义诱导约束适应
Du, Bo, Ma, Xiaochen, Zhu, Xuekang, Yang, Zhe, Niu, Chaogun, Liu, Jian, Zhou, Ji-Zhe
Abstract
Fake Image Detection (FID), aiming at unified detection across four image forensic subdomains, is critical in real-world forensic scenarios. Compared with ensemble approaches, monolithic FID models are theoretically more promising, but to date, consistently yield inferior performance in practice. In this work, by discovering the ``heterogeneous phenomenon'', which is the intrinsic distinctness of artifacts across subdomains, we diagnose the cause of this underperformance for the first time: the collapse of the artifact feature space driven by such phenomenon. The core challenge for developing a practical monolithic FID model thus boils down to the ``unified-yet-discriminative" reconstruction of the artifact feature space. To address this paradoxical challenge, we hypothesize that high-level semantics can serve as a structural prior for the reconstruction, and further propose Semantic-Induced Constrained Adaptation (SICA), the first monolithic FID paradigm. Extensive experiments on our OpenMMSec dataset demonstrate that SICA outperforms 15 state-of-the-art methods and reconstructs the target unified-yet-discriminative artifact feature space in a near-orthogonal manner, thus firmly validating our hypothesis. The code and dataset are available at:https: //github.com/scu-zjz/SICA_OpenMMSec.
Chinese Translation
假图像检测(FID)旨在跨越四个图像取证子领域进行统一检测,在现实世界的取证场景中至关重要。与集成方法相比,单一的FID模型在理论上更具前景,但迄今为止,在实践中始终表现不佳。在本研究中,我们首次发现了“异质现象”,即伪影在子领域之间的内在差异性,并诊断出这种性能不足的原因:由该现象驱动的伪影特征空间的崩溃。因此,开发实用的单一FID模型的核心挑战归结为伪影特征空间的“统一且具有区分性”的重建。为了解决这一矛盾挑战,我们假设高层语义可以作为重建的结构先验,并进一步提出了语义诱导约束适应(SICA),这是首个单一的FID范式。在我们的OpenMMSec数据集上进行的大量实验表明,SICA的表现超过了15种最先进的方法,并以近正交的方式重建了目标统一且具有区分性的伪影特征空间,从而有力地验证了我们的假设。代码和数据集可在以下网址获取:https://github.com/scu-zjz/SICA_OpenMMSec。
cs.CV / 62 / 2602.06743

Clinical-Prior Guided Multi-Modal Learning with Latent Attention Pooling for Gait-Based Scoliosis Screening

基于临床先验指导的多模态学习与潜在注意力池化用于基于步态的脊柱侧弯筛查
Chen, Dong, Wei, Zizhuang, Xu, Jialei, Sun, Xinyang, He, Zonglin, An, Meiru, Peng, Huili, Hu, Yong, Cheung, Kenneth MC
Abstract
Adolescent Idiopathic Scoliosis (AIS) is a prevalent spinal deformity whose progression can be mitigated through early detection. Conventional screening methods are often subjective, difficult to scale, and reliant on specialized clinical expertise. Video-based gait analysis offers a promising alternative, but current datasets and methods frequently suffer from data leakage, where performance is inflated by repeated clips from the same individual, or employ oversimplified models that lack clinical interpretability. To address these limitations, we introduce ScoliGait, a new benchmark dataset comprising 1,572 gait video clips for training and 300 fully independent clips for testing. Each clip is annotated with radiographic Cobb angles and descriptive text based on clinical kinematic priors. We propose a multi-modal framework that integrates a clinical-prior-guided kinematic knowledge map for interpretable feature representation, alongside a latent attention pooling mechanism to fuse video, text, and knowledge map modalities. Our method establishes a new state-of-the-art, demonstrating a significant performance gap on a realistic, non-repeating subject benchmark. Our approach establishes a new state of the art, showing a significant performance gain on a realistic, subject-independent benchmark. This work provides a robust, interpretable, and clinically grounded foundation for scalable, non-invasive AIS assessment.
Chinese Translation
青少年特发性脊柱侧弯(AIS)是一种常见的脊柱畸形,其进展可以通过早期检测来减缓。传统筛查方法往往主观性强,难以扩展,并依赖于专业的临床知识。基于视频的步态分析提供了一种有前景的替代方案,但当前的数据集和方法常常受到数据泄露的影响,即同一个体的重复片段导致性能虚高,或者采用缺乏临床可解释性的过于简化的模型。为了解决这些局限性,我们引入了ScoliGait,一个新的基准数据集,包含1,572个用于训练的步态视频片段和300个完全独立的测试片段。每个片段都根据临床运动学先验标注了放射学Cobb角和描述性文本。我们提出了一个多模态框架,整合了基于临床先验指导的运动学知识图谱用于可解释的特征表示,以及潜在注意力池化机制用于融合视频、文本和知识图谱模态。我们的方法在一个真实的、非重复的受试者基准上建立了新的最先进水平,显示出显著的性能提升。这项工作为可扩展的、非侵入性的AIS评估提供了一个稳健、可解释且基于临床的基础。
cs.CV / 63 / 2602.06748

Gold Exploration using Representations from a Multispectral Autoencoder

利用多光谱自编码器的表征进行黄金勘探
Tsandalidou, Argyro, Dogeas, Konstantinos, Tsonga, Eleftheria Tetoula, Parselia, Elisavet, Tsimiklis, Georgios, Arvanitakis, George
Abstract
Satellite imagery is employed for large-scale prospectivity mapping due to the high cost and typically limited availability of on-site mineral exploration data. In this work, we present a proof-of-concept framework that leverages generative representations learned from multispectral Sentinel-2 imagery to identify gold-bearing regions from space. An autoencoder foundation model, called Isometric, which is pretrained on the large-scale FalconSpace-S2 v1.0 dataset, produces information-dense spectral-spatial representations that serve as inputs to a lightweight XGBoost classifier. We compare this representation-based approach with a raw spectral input baseline using a dataset of 63 Sentinel-2 images from known gold and non-gold locations. The proposed method improves patch-level accuracy from 0.51 to 0.68 and image-level accuracy from 0.55 to 0.73, demonstrating that generative embeddings capture transferable mineralogical patterns even with limited labeled data. These results highlight the potential of foundation-model representations to make mineral exploration more efficient, scalable, and globally applicable.
Chinese Translation
由于现场矿产勘探数据的高成本和通常有限的可用性,卫星影像被用于大规模的前景图绘制。在本研究中,我们提出了一个概念验证框架,利用从多光谱Sentinel-2影像中学习的生成表征来识别太空中的含金区域。我们使用的基础模型是一个名为Isometric的自编码器,该模型在大规模的FalconSpace-S2 v1.0数据集上进行了预训练,能够生成信息密集的光谱-空间表征,这些表征作为轻量级XGBoost分类器的输入。我们将这种基于表征的方法与使用来自已知含金和非含金位置的63幅Sentinel-2影像的原始光谱输入基线进行了比较。所提出的方法将补丁级准确率从0.51提高到0.68,图像级准确率从0.55提高到0.73,证明了生成嵌入能够捕捉可转移的矿物学模式,即使在标记数据有限的情况下。这些结果突显了基础模型表征在提高矿产勘探效率、可扩展性和全球适用性方面的潜力。
cs.CV / 64 / 2602.06778

Revisiting Emotions Representation for Recognition in the Wild

重新审视情感表征以实现自然环境中的识别
Neto, Joao Baptista Cardia, Ferrari, Claudio, Berretti, Stefano
Abstract
Facial emotion recognition has been typically cast as a single-label classification problem of one out of six prototypical emotions. However, that is an oversimplification that is unsuitable for representing the multifaceted spectrum of spontaneous emotional states, which are most often the result of a combination of multiple emotions contributing at different intensities. Building on this, a promising direction that was explored recently is to cast emotion recognition as a distribution learning problem. Still, such approaches are limited in that research datasets are typically annotated with a single emotion class. In this paper, we contribute a novel approach to describe complex emotional states as probability distributions over a set of emotion classes. To do so, we propose a solution to automatically re-label existing datasets by exploiting the result of a study in which a large set of both basic and compound emotions is mapped to probability distributions in the Valence-Arousal-Dominance (VAD) space. In this way, given a face image annotated with VAD values, we can estimate the likelihood of it belonging to each of the distributions, so that emotional states can be described as a mixture of emotions, enriching their description, while also accounting for the ambiguous nature of their perception. In a preliminary set of experiments, we illustrate the advantages of this solution and a new possible direction of investigation. Data annotations are available at https://github.com/jbcnrlz/affectnet-b-annotation.
Chinese Translation
面部情感识别通常被视为从六种原型情感中选择一种的单标签分类问题。然而,这种简化的处理方式并不适合表现自发情感状态的多面性谱系,这些状态通常是多种情感以不同强度组合而成的结果。在此基础上,最近探索的一个有前景的方向是将情感识别视为一个分布学习问题。然而,这类方法的局限在于,研究数据集通常仅标注单一情感类别。本文提出了一种新颖的方法,将复杂的情感状态描述为一组情感类别上的概率分布。为此,我们提出了一种解决方案,通过利用一项研究的结果,自动重新标注现有数据集,该研究将大量基本情感和复合情感映射到情感的愉悦-唤醒-主导(Valence-Arousal-Dominance, VAD)空间中的概率分布。通过这种方式,给定一张标注有VAD值的面部图像,我们可以估计其属于每个分布的可能性,从而将情感状态描述为情感的混合,丰富其描述,同时考虑到其感知的模糊性。在一组初步实验中,我们展示了该解决方案的优势以及一个新的可能研究方向。数据标注可在 https://github.com/jbcnrlz/affectnet-b-annotation 获取。
cs.CV / 65 / 2602.06786

Machine Learning for Detection and Severity Estimation of Sweetpotato Weevil Damage in Field and Lab Conditions

机器学习在田间和实验室条件下检测和评估甘薯象甲损害的应用
Chelangat, Doreen M., Murindanyi, Sudi, Mugizi, Bruce, Musana, Paul, Yada, Benard, Otema, Milton A., Osaru, Florence, Katumba, Andrew, Nakatumba-Nabende, Joyce
Abstract
Sweetpotato weevils (Cylas spp.) are considered among the most destructive pests impacting sweetpotato production, particularly in sub-Saharan Africa. Traditional methods for assessing weevil damage, predominantly relying on manual scoring, are labour-intensive, subjective, and often yield inconsistent results. These challenges significantly hinder breeding programs aimed at developing resilient sweetpotato varieties. This study introduces a computer vision-based approach for the automated evaluation of weevil damage in both field and laboratory contexts. In the field settings, we collected data to train classification models to predict root-damage severity levels, achieving a test accuracy of 71.43%. Additionally, we established a laboratory dataset and designed an object detection pipeline employing YOLO12, a leading real-time detection model. This methodology incorporated a two-stage laboratory pipeline that combined root segmentation with a tiling strategy to improve the detectability of small objects. The resulting model demonstrated a mean average precision of 77.7% in identifying minute weevil feeding holes. Our findings indicate that computer vision technologies can provide efficient, objective, and scalable assessment tools that align seamlessly with contemporary breeding workflows. These advancements represent a significant improvement in enhancing phenotyping efficiency within sweetpotato breeding programs and play a crucial role in mitigating the detrimental effects of weevils on food security.
Chinese Translation
甘薯象甲(Cylas spp.)被认为是影响甘薯生产的最具破坏性的害虫之一,尤其是在撒哈拉以南非洲。传统的象甲损害评估方法主要依赖人工评分,劳动强度大、主观性强,且常常导致结果不一致。这些挑战显著阻碍了旨在开发抗逆甘薯品种的育种计划。本研究提出了一种基于计算机视觉的方法,用于在田间和实验室环境中自动评估象甲损害。在田间设置中,我们收集了数据以训练分类模型,预测根部损伤的严重程度,测试准确率达到71.43%。此外,我们建立了实验室数据集,并设计了一个采用YOLO12(一个领先的实时检测模型)的目标检测流程。该方法结合了根部分割与平铺策略,形成了一个两阶段的实验室流程,以提高小物体的可检测性。最终模型在识别微小的象甲取食孔时表现出77.7%的平均精度。我们的研究结果表明,计算机视觉技术能够提供高效、客观且可扩展的评估工具,与现代育种工作流程无缝对接。这些进展在提高甘薯育种计划中的表型效率方面具有重要意义,并在减轻象甲对粮食安全的负面影响中发挥了关键作用。
cs.CV / 66 / 2602.06805

A Unified Formula for Affine Transformations between Calibrated Cameras

校准相机之间的仿射变换统一公式
Hajder, Levente
Abstract
In this technical note, we derive a closed-form expression for the affine transformation mapping local image patches between two calibrated views. We show that the transformation is a function of the relative camera pose, the image coordinates, and the local surface normal.
Chinese Translation
在本技术说明中,我们推导了一个封闭形式的表达式,用于描述两个校准视图之间局部图像块的仿射变换。我们表明,该变换是相对相机姿态、图像坐标和局部表面法线的函数。
cs.CV / 67 / 2602.06806

RAIGen: Rare Attribute Identification in Text-to-Image Generative Models

RAIGen:文本到图像生成模型中的稀有属性识别
Sreelatha, Silpa Vadakkeeveetil, Wang, Dan, Belongie, Serge, Awais, Muhammad, Dutta, Anjan
Abstract
Text-to-image diffusion models achieve impressive generation quality but inherit and amplify training-data biases, skewing coverage of semantic attributes. Prior work addresses this in two ways. Closed-set approaches mitigate biases in predefined fairness categories (e.g., gender, race), assuming socially salient minority attributes are known a priori. Open-set approaches frame the task as bias identification, highlighting majority attributes that dominate outputs. Both overlook a complementary task: uncovering rare or minority features underrepresented in the data distribution (social, cultural, or stylistic) yet still encoded in model representations. We introduce RAIGen, the first framework, to our knowledge, for un-supervised rare-attribute discovery in diffusion models. RAIGen leverages Matryoshka Sparse Autoencoders and a novel minority metric combining neuron activation frequency with semantic distinctiveness to identify interpretable neurons whose top-activating images reveal underrepresented attributes. Experiments show RAIGen discovers attributes beyond fixed fairness categories in Stable Diffusion, scales to larger models such as SDXL, supports systematic auditing across architectures, and enables targeted amplification of rare attributes during generation.
Chinese Translation
文本到图像扩散模型在生成质量上取得了显著的成就,但却继承并放大了训练数据中的偏见,导致语义属性的覆盖出现偏差。之前的研究主要通过两种方式来解决这个问题。封闭集方法在预定义的公平性类别(例如,性别、种族)中减轻偏见,假设社会上显著的少数属性是事先已知的。开放集方法则将任务框定为偏见识别,强调主导输出的多数属性。然而,两者都忽视了一个互补的任务:揭示在数据分布中被低估的稀有或少数特征(社会、文化或风格),这些特征仍然在模型表示中编码。我们介绍了RAIGen,这是我们所知的第一个无监督稀有属性发现框架,专门针对扩散模型。RAIGen利用了Matryoshka稀疏自编码器和一种新颖的少数群体度量,该度量结合了神经元激活频率与语义独特性,以识别可解释的神经元,其最高激活的图像揭示了被低估的属性。实验表明,RAIGen能够发现超出固定公平性类别的属性,适用于更大规模的模型如SDXL,支持跨架构的系统审计,并在生成过程中实现对稀有属性的有针对性放大。
cs.CV / 68 / 2602.06830

GaussianPOP: Principled Simplification Framework for Compact 3D Gaussian Splatting via Error Quantification

GaussianPOP:基于误差量化的紧凑型3D高斯点云简化原则框架
Lee, Soonbin, Kim, Yeong-Gyu, Sasse, Simon, Borges, Tomas M., Sanchez, Yago, Ryu, Eun-Seok, Schierl, Thomas, Hellge, Cornelius
Abstract
Existing 3D Gaussian Splatting simplification methods commonly use importance scores, such as blending weights or sensitivity, to identify redundant Gaussians. However, these scores are not driven by visual error metrics, often leading to suboptimal trade-offs between compactness and rendering fidelity. We present GaussianPOP, a principled simplification framework based on analytical Gaussian error quantification. Our key contribution is a novel error criterion, derived directly from the 3DGS rendering equation, that precisely measures each Gaussian's contribution to the rendered image. By introducing a highly efficient algorithm, our framework enables practical error calculation in a single forward pass. The framework is both accurate and flexible, supporting on-training pruning as well as post-training simplification via iterative error re-quantification for improved stability. Experimental results show that our method consistently outperforms existing state-of-the-art pruning methods across both application scenarios, achieving a superior trade-off between model compactness and high rendering quality.
Chinese Translation
现有的3D高斯点云简化方法通常使用重要性评分,如混合权重或敏感度,来识别冗余的高斯分布。然而,这些评分并未基于视觉误差指标,往往导致紧凑性与渲染保真度之间的次优权衡。我们提出了GaussianPOP,这是一种基于解析高斯误差量化的原则性简化框架。我们的主要贡献是一个新颖的误差标准,直接源自3DGS渲染方程,精确测量每个高斯分布对渲染图像的贡献。通过引入一种高效的算法,我们的框架在单次前向传递中实现了实用的误差计算。该框架既准确又灵活,支持训练中的剪枝以及通过迭代误差重新量化进行的训练后简化,以提高稳定性。实验结果表明,我们的方法在两种应用场景中始终优于现有的最先进剪枝方法,实现了模型紧凑性与高渲染质量之间的优越权衡。
cs.CV / 69 / 2602.06850

Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping

重新思考多条件扩散变换器:通过位置对齐和关键词范围消除冗余注意力
Zhou, Chao, Wei, Tianyi, Chen, Yiling, Zhou, Wenbo, Yu, Nenghai
Abstract
While modern text-to-image models excel at prompt-based generation, they often lack the fine-grained control necessary for specific user requirements like spatial layouts or subject appearances. Multi-condition control addresses this, yet its integration into Diffusion Transformers (DiTs) is bottlenecked by the conventional ``concatenate-and-attend'' strategy, which suffers from quadratic computational and memory overhead as the number of conditions scales. Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant. To this end, we propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies. Specifically, Position-Aligned Attention (PAA) linearizes spatial control by enforcing localized patch alignment, while Keyword-Scoped Attention (KSA) prunes irrelevant subject-driven interactions via semantic-aware masking. To facilitate efficient learning, we further introduce a Conditional Sensitivity-Aware Sampling (CSAS) strategy that reweights the training objective towards critical denoising phases, drastically accelerating convergence and enhancing conditional fidelity. Empirically, PKA delivers a 10.0$\times$ inference speedup and a 5.1$\times$ VRAM saving, providing a scalable and resource-friendly solution for high-fidelity multi-conditioned generation.
Chinese Translation
尽管现代文本到图像模型在基于提示的生成方面表现出色,但它们往往缺乏满足特定用户需求(如空间布局或主题外观)的细粒度控制。多条件控制解决了这一问题,但其在扩散变换器(DiTs)中的整合受到传统“连接并关注”策略的瓶颈,这种策略在条件数量增加时会导致二次的计算和内存开销。我们的分析揭示了许多跨模态交互在空间或语义上是冗余的。为此,我们提出了位置对齐和关键词范围注意力(PKA),这是一个旨在消除这些冗余的高效框架。具体而言,位置对齐注意力(PAA)通过强制局部补丁对齐来线性化空间控制,而关键词范围注意力(KSA)通过语义感知掩蔽来修剪无关的主题驱动交互。为了促进高效学习,我们进一步引入了一种条件敏感性意识采样(CSAS)策略,该策略重新加权训练目标,以关注关键的去噪阶段,从而大幅加速收敛并增强条件保真度。实证结果表明,PKA实现了10.0倍的推理加速和5.1倍的显存节省,为高保真多条件生成提供了可扩展且资源友好的解决方案。
cs.CV / 70 / 2602.06862

Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing

参数作为专家:通过动态参数路由适应视觉模型
Lou, Meng, Yu, Stanley, Yu, Yizhou
Abstract
Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose AdaRoute, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each AdaRoute module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in AdaRoute modules facilitate low-rank adaptation in an input-dependent manner, thus generating more customized and powerful feature representations. Moreover, since AdaRoute modules across multiple network layers share the same expert center, they improve feature diversity by promoting implicit cross-layer feature interaction. Extensive experiments demonstrate the superiority of AdaRoute on diverse vision tasks, including semantic segmentation, object detection and instance segmentation, and panoptic segmentation. Code will be available at: https://bit.ly/3NZcr0H.
Chinese Translation
使用参数高效微调(PEFT)来适应预训练视觉模型仍然具有挑战性,因为它旨在以最少的可训练参数数量实现与完全微调相当的性能。在复杂的密集预测任务中,现有方法表现出局限性,包括输入无关建模和冗余的跨层表示。为此,我们提出了AdaRoute,一种新的适配器风格方法,具有简单的专家混合(MoE)架构。具体而言,我们引入了共享专家中心,其中每个专家都是一个可训练的参数矩阵。在前向传播过程中,网络中的每个AdaRoute模块通过简单的动态参数路由机制动态生成针对当前模块的权重矩阵,选择性地聚合相应专家中心中的参数矩阵。AdaRoute模块中的动态权重矩阵以输入依赖的方式促进低秩适应,从而生成更定制化和强大的特征表示。此外,由于多个网络层中的AdaRoute模块共享相同的专家中心,它们通过促进隐式跨层特征交互来提高特征多样性。大量实验表明,AdaRoute在语义分割、目标检测、实例分割和全景分割等多种视觉任务上具有优越性。代码将在以下链接提供:https://bit.ly/3NZcr0H。
cs.CV / 71 / 2602.06871

RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing

RFDM:用于高效因果视频编辑的残差流扩散模型
Salehi, Mohammadreza, Noroozi, Mehdi, Morreale, Luca, Chavhan, Ruchika, Chadwick, Malcolm, Ramos, Alberto Gil, Mehrotra, Abhinav
Abstract
Instructional video editing applies edits to an input video using only text prompts, enabling intuitive natural-language control. Despite rapid progress, most methods still require fixed-length inputs and substantial compute. Meanwhile, autoregressive video generation enables efficient variable-length synthesis, yet remains under-explored for video editing. We introduce a causal, efficient video editing model that edits variable-length videos frame by frame. For efficiency, we start from a 2D image-to-image (I2I) diffusion model and adapt it to video-to-video (V2V) editing by conditioning the edit at time step t on the model's prediction at t-1. To leverage videos' temporal redundancy, we propose a new I2I diffusion forward process formulation that encourages the model to predict the residual between the target output and the previous prediction. We call this Residual Flow Diffusion Model (RFDM), which focuses the denoising process on changes between consecutive frames. Moreover, we propose a new benchmark that better ranks state-of-the-art methods for editing tasks. Trained on paired video data for global/local style transfer and object removal, RFDM surpasses I2I-based methods and competes with fully spatiotemporal (3D) V2V models, while matching the compute of image models and scaling independently of input video length. More content can be found in: https://smsd75.github.io/RFDM_page/
Chinese Translation
教学视频编辑通过仅使用文本提示对输入视频进行编辑,实现了直观的自然语言控制。尽管取得了快速进展,但大多数方法仍然需要固定长度的输入和大量计算资源。同时,自回归视频生成实现了高效的可变长度合成,但在视频编辑中的应用仍然未被充分探索。我们提出了一种因果的高效视频编辑模型,逐帧编辑可变长度视频。为了提高效率,我们从二维图像到图像(I2I)扩散模型出发,并通过在时间步t上将编辑条件化于模型在t-1时的预测,将其适配为视频到视频(V2V)编辑。为了利用视频的时间冗余,我们提出了一种新的I2I扩散前向过程公式,鼓励模型预测目标输出与先前预测之间的残差。我们将其称为残差流扩散模型(Residual Flow Diffusion Model, RFDM),该模型将去噪过程集中于连续帧之间的变化。此外,我们提出了一个新的基准,更好地对编辑任务的最先进方法进行排名。在针对全局/局部风格转移和物体移除的配对视频数据上训练后,RFDM超越了基于I2I的方法,并与完全时空(3D)V2V模型竞争,同时在计算资源上与图像模型相匹配,并独立于输入视频长度进行扩展。更多内容请访问:https://smsd75.github.io/RFDM_page/
cs.CV / 72 / 2602.06879

NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

NanoFLUX:面向移动设备的大型文本到图像生成模型的蒸馏驱动压缩
Chavhan, Ruchika, Chadwick, Malcolm, Ramos, Alberto Gil Couto Pimentel, Morreale, Luca, Noroozi, Mehdi, Mehrotra, Abhinav
Abstract
While large-scale text-to-image diffusion models continue to improve in visual quality, their increasing scale has widened the gap between state-of-the-art models and on-device solutions. To address this gap, we introduce NanoFLUX, a 2.4B text-to-image flow-matching model distilled from 17B FLUX.1-Schnell using a progressive compression pipeline designed to preserve generation quality. Our contributions include: (1) A model compression strategy driven by pruning redundant components in the diffusion transformer, reducing its size from 12B to 2B; (2) A ResNet-based token downsampling mechanism that reduces latency by allowing intermediate blocks to operate on lower-resolution tokens while preserving high-resolution processing elsewhere; (3) A novel text encoder distillation approach that leverages visual signals from early layers of the denoiser during sampling. Empirically, NanoFLUX generates 512 x 512 images in approximately 2.5 seconds on mobile devices, demonstrating the feasibility of high-quality on-device text-to-image generation.
Chinese Translation
尽管大规模文本到图像扩散模型在视觉质量上不断提升,但其日益增长的规模加大了最先进模型与设备端解决方案之间的差距。为了解决这一问题,我们提出了NanoFLUX,这是一个从17B FLUX.1-Schnell蒸馏而来的2.4B文本到图像流匹配模型,采用了一种旨在保持生成质量的渐进压缩管道。我们的贡献包括:(1)一种通过剪枝扩散变换器中冗余组件驱动的模型压缩策略,将模型大小从12B减少到2B;(2)一种基于ResNet的标记下采样机制,通过允许中间块在较低分辨率的标记上操作,同时在其他地方保持高分辨率处理,从而减少延迟;(3)一种新颖的文本编码器蒸馏方法,在采样过程中利用去噪器早期层的视觉信号。实证结果表明,NanoFLUX在移动设备上生成512 x 512的图像大约需要2.5秒,展示了高质量设备端文本到图像生成的可行性。
cs.CV / 73 / 2602.06886

Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

提示再注入:缓解多模态扩散变换器中的提示遗忘
Yao, Yuxuan, Chen, Yuxuan, Li, Hui, Cheng, Kaihui, Guo, Qipeng, Sun, Yuwei, Dong, Zilong, Wang, Jingdong, Zhu, Siyu
Abstract
Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.
Chinese Translation
多模态扩散变换器(MMDiTs)用于文本到图像生成,保持文本和图像分支的独立性,并在去噪过程中实现文本标记与视觉潜变量之间的双向信息流。在这种设置下,我们观察到一种提示遗忘现象:随着深度的增加,文本分支中提示表示的语义逐渐被遗忘。我们进一步通过探测文本分支中各层表示的语言属性,验证了这一效应在三个代表性的MMDiTs——SD3、SD3.5和FLUX.1上的存在。基于这些发现,我们提出了一种无训练的方法——提示再注入(prompt reinjection),该方法将早期层的提示表示再注入到后期层,以减轻这种遗忘现象。在GenEval、DPG和T2I-CompBench++上的实验显示,指令遵循能力持续提升,同时在捕捉偏好、美学和整体文本-图像生成质量的指标上也有所改善。
cs.CV / 74 / 2602.06912

PANC: Prior-Aware Normalized Cut for Object Segmentation

PANC:基于先验知识的归一化切割用于物体分割
Gutiérrez, Juan, Gutiérrez-Garcia, Victor, Blanco-Murillo, José Luis
Abstract
Fully unsupervised segmentation pipelines naively seek the most salient object, should this be present. As a result, most of the methods reported in the literature deliver non-deterministic partitions that are sensitive to initialization, seed order, and threshold heuristics. We propose PANC, a weakly supervised spectral segmentation framework that uses a minimal set of annotated visual tokens to produce stable, controllable, and reproducible object masks. From the TokenCut approach, we augment the token-token affinity graph with a handful of priors coupled to anchor nodes. By manipulating the graph topology, we bias the spectral eigenspace toward partitions that are consistent with the annotations. Our approach preserves the global grouping enforced by dense self-supervised visual features, trading annotated tokens for significant gains in reproducibility, user control, and segmentation quality. Using 5 to 30 annotations per dataset, our training-free method achieves state-of-the-art performance among weakly and unsupervised approaches on standard benchmarks (e.g., DUTS-TE, ECSSD, MS COCO). Contrarily, it excels in domains where dense labels are costly or intra-class differences are subtle. We report strong and reliable results on homogeneous, fine-grained, and texture-limited domains, achieving 96.8% (+14.43% over SotA), 78.0% (+0.2%), and 78.8% (+0.37%) average mean intersection-over-union (mIoU) on CrackForest (CFD), CUB-200-2011, and HAM10000 datasets, respectively. For multi-object benchmarks, the framework showcases explicit, user-controllable semantic segmentation.
Chinese Translation
完全无监督的分割管道天真地寻求最显著的物体(如果存在的话)。因此,文献中报告的大多数方法提供了对初始化、种子顺序和阈值启发式敏感的非确定性分割。我们提出了PANC,一种弱监督的谱分割框架,利用一小部分标注的视觉标记生成稳定、可控和可重复的物体掩膜。基于TokenCut方法,我们通过与锚节点相结合的一小部分先验信息增强了标记-标记亲和图。通过操控图的拓扑结构,我们将谱特征空间偏向与标注一致的分割。我们的方法保留了由密集自监督视觉特征强加的全局分组,使用标注的标记换取在可重复性、用户控制和分割质量上的显著提升。使用每个数据集5到30个标注,我们的无训练方法在标准基准(例如DUTS-TE、ECSSD、MS COCO)上在弱监督和无监督方法中实现了最先进的性能。相反,在密集标签成本高或类内差异微妙的领域中,它表现出色。我们在同质、细粒度和纹理有限的领域报告了强大且可靠的结果,在CrackForest (CFD)、CUB-200-2011和HAM10000数据集上分别达到了96.8%(比最先进方法提高14.43%)、78.0%(提高0.2%)和78.8%(提高0.37%)的平均交并比(mIoU)。对于多物体基准,该框架展示了明确的、用户可控的语义分割。
cs.CV / 75 / 2602.06914

Seeing Beyond Redundancy: Task Complexity's Role in Vision Token Specialization in VLLMs

超越冗余的视角:任务复杂性在视觉大语言模型中的视觉标记专业化作用
Hannan, Darryl, Cooper, John, White, Dylan, Watkins, Yijing
Abstract
Vision capabilities in vision large language models (VLLMs) have consistently lagged behind their linguistic capabilities. In particular, numerous benchmark studies have demonstrated that VLLMs struggle when fine-grained visual information or spatial reasoning is required. However, we do not yet understand exactly why VLLMs struggle so much with these tasks relative to others. Some works have focused on visual redundancy as an explanation, where high-level visual information is uniformly spread across numerous tokens and specific, fine-grained visual information is discarded. In this work, we investigate this premise in greater detail, seeking to better understand exactly how various types of visual information are processed by the model and what types of visual information are discarded. To do so, we introduce a simple synthetic benchmark dataset that is specifically constructed to probe various visual features, along with a set of metrics for measuring visual redundancy, allowing us to better understand the nuances of their relationship. Then, we explore fine-tuning VLLMs on a number of complex visual tasks to better understand how redundancy and compression change based upon the complexity of the data that a model is trained on. We find that there is a connection between task complexity and visual compression, implying that having a sufficient ratio of high complexity visual data is crucial for altering the way that VLLMs distribute their visual representation and consequently improving their performance on complex visual tasks. We hope that this work will provide valuable insights for training the next generation of VLLMs.
Chinese Translation
视觉大语言模型(VLLMs)在视觉能力方面的表现始终落后于其语言能力。特别是,众多基准研究表明,当需要细致的视觉信息或空间推理时,VLLMs表现不佳。然而,我们尚未完全理解为什么VLLMs在这些任务上相较于其他任务如此挣扎。一些研究将视觉冗余作为解释,认为高层次的视觉信息均匀分布在多个标记中,而特定的、细致的视觉信息则被丢弃。在本研究中,我们更详细地探讨了这一前提,旨在更好地理解模型如何处理各种类型的视觉信息以及哪些类型的视觉信息被丢弃。为此,我们引入了一个简单的合成基准数据集,专门构建以探测各种视觉特征,并提供一套用于测量视觉冗余的指标,从而使我们能够更好地理解它们之间的细微关系。接着,我们探索了在多个复杂视觉任务上对VLLMs进行微调,以更好地理解冗余和压缩如何根据模型训练的数据复杂性而变化。我们发现任务复杂性与视觉压缩之间存在关联,这意味着拥有足够比例的高复杂度视觉数据对于改变VLLMs分配视觉表征的方式至关重要,从而改善其在复杂视觉任务上的表现。我们希望本研究能够为下一代VLLMs的训练提供有价值的见解。
cs.CV / 76 / 2602.06938

Reliable Mislabel Detection for Video Capsule Endoscopy Data

视频胶囊内窥镜数据的可靠错误标注检测
Werner, Julia, Oexle, Julius, Bause, Oliver, Floch, Maxime Le, Brinkmann, Franz, Tolle, Hannah, Hampe, Jochen, Bringmann, Oliver
Abstract
The classification performance of deep neural networks relies strongly on access to large, accurately annotated datasets. In medical imaging, however, obtaining such datasets is particularly challenging since annotations must be provided by specialized physicians, which severely limits the pool of annotators. Furthermore, class boundaries can often be ambiguous or difficult to define which further complicates machine learning-based classification. In this paper, we want to address this problem and introduce a framework for mislabel detection in medical datasets. This is validated on the two largest, publicly available datasets for Video Capsule Endoscopy, an important imaging procedure for examining the gastrointestinal tract based on a video stream of lowresolution images. In addition, potentially mislabeled samples identified by our pipeline were reviewed and re-annotated by three experienced gastroenterologists. Our results show that the proposed framework successfully detects incorrectly labeled data and results in an improved anomaly detection performance after cleaning the datasets compared to current baselines.
Chinese Translation
深度神经网络的分类性能在很大程度上依赖于对大规模、准确标注的数据集的获取。然而,在医学影像领域,获取这样的数据集尤其具有挑战性,因为标注必须由专业医生提供,这严重限制了标注者的数量。此外,类别边界往往模糊或难以定义,这进一步复杂化了基于机器学习的分类。在本文中,我们旨在解决这一问题,并提出一个用于医学数据集错误标注检测的框架。该框架在两个最大的公开可用视频胶囊内窥镜数据集上进行了验证,这是一种基于低分辨率图像视频流检查胃肠道的重要影像学程序。此外,我们的管道识别出的潜在错误标注样本由三位经验丰富的胃肠病专家进行了审查和重新标注。我们的结果表明,所提出的框架成功检测出错误标注的数据,并且在清理数据集后,相较于当前基线,异常检测性能得到了改善。
cs.CV / 77 / 2602.06959

CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

CineScene:隐式3D作为电影视频生成的有效场景表示
Huang, Kaiyi, Huang, Yukun, Li, Yu, Bai, Jianhong, Wang, Xintao, Lin, Zinan, Ning, Xuefei, Yu, Jiwen, Wan, Pengfei, Wang, Yu, Liu, Xihui
Abstract
Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model's robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.
Chinese Translation
电影视频制作需要对场景-主体构图和相机运动进行控制,但由于需要构建物理场景,实拍仍然成本高昂。为了解决这一问题,我们引入了具有解耦场景上下文的电影视频生成任务:给定静态环境的多张图像,目标是合成高质量的视频,展现动态主体,同时保持基础场景的一致性并遵循用户指定的相机轨迹。我们提出了CineScene,一个利用隐式3D感知场景表示进行电影视频生成的框架。我们的关键创新是一个新颖的上下文条件机制,以隐式方式注入3D感知特征:通过将场景图像编码为视觉表示,CineScene通过额外的上下文连接将空间先验注入到预训练的文本到视频生成模型中,从而实现具有一致场景和动态主体的相机控制视频合成。为了进一步增强模型的鲁棒性,我们在训练过程中引入了一种简单而有效的输入场景图像随机打乱策略。为了解决训练数据不足的问题,我们构建了一个使用虚幻引擎5的场景解耦数据集,包含动态主体和静态主体场景的配对视频、代表基础静态场景的全景图像以及它们的相机轨迹。实验表明,CineScene在场景一致的电影视频生成中达到了最先进的性能,能够处理大幅度的相机运动,并在多样化环境中展示了良好的泛化能力。
cs.CV / 78 / 2602.06965

MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

MedMO:医学图像的多模态大型语言模型的基础与理解
Deria, Ankan, Kumar, Komal, Dukre, Adinath Madhavrao, Segal, Eran, Khan, Salman, Razzak, Imran
Abstract
Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data. MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios. MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks. On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline and performs within 1.9% of the SOTA Fleming-VL. For text-based QA, it attains +6.9% over the baseline and +14.5% over Fleming-VL. In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy. Moreover, it exhibits strong grounding capability, achieving an IoU improvement of +40.4 over the baseline and +37.0% over Fleming-VL, underscoring its robust spatial reasoning and localization performance. Evaluations across radiology, ophthalmology, and pathology-microscopy confirm MedMO's broad cross-modality generalization. We release two versions of MedMO: 4B and 8B. Project is available at https://genmilab.github.io/MedMO-Page
Chinese Translation
多模态大型语言模型(MLLMs)迅速发展,但其在医学领域的应用仍受到领域覆盖、模态对齐和基础推理等方面的限制。在本研究中,我们介绍了MedMO,一种基于通用MLLM架构构建的医学基础模型,专门在大规模领域特定数据上进行训练。MedMO遵循多阶段的训练流程:(i)跨模态预训练,以将异构视觉编码器与医学语言骨干对齐;(ii)在涵盖图像描述、视觉问答(VQA)、报告生成、检索和基于边界框的疾病定位等多任务监督下进行指令调优;(iii)通过可验证奖励的强化学习,将事实检查与框级GIoU奖励相结合,以增强复杂临床场景中的空间基础和逐步推理能力。MedMO在多个模态和任务上始终优于强大的开源医学MLLMs。在VQA基准测试中,MedMO的平均准确率比基线提高了13.7%,并且在SOTA Fleming-VL的基础上表现出1.9%的差距。在基于文本的问答中,MedMO比基线提高了6.9%,比Fleming-VL提高了14.5%。在医学报告生成中,MedMO在语义和临床准确性上均取得了显著提升。此外,它表现出强大的基础能力,IoU比基线提高了40.4%,比Fleming-VL提高了37.0%,突显了其强大的空间推理和定位性能。在放射学、眼科学和病理显微镜学的评估中,确认了MedMO的广泛跨模态泛化能力。我们发布了MedMO的两个版本:4B和8B。项目可在https://genmilab.github.io/MedMO-Page获取。
人工智能 (Artificial Intelligence)
28
cs.AI / 1 / 2602.06107

Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning

Jackpot:极端演员-策略不匹配强化学习的最优预算拒绝采样
Chen, Zhuoming, Liu, Hongyi, Zhou, Yang, Zheng, Haizhong, Chen, Beidi
Abstract
Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces a severe distribution mismatch that destabilizes learning. We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top-$k$ probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, \sys substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps of batchsize 64. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.
Chinese Translation
针对大型语言模型(LLMs)的强化学习(RL)仍然成本高昂,尤其是因为回滚过程的费用较高。将回滚生成与策略优化解耦(例如,利用更高效的模型进行回滚)可以显著提高效率,但这样做会引入严重的分布不匹配,从而使学习不稳定。我们提出了Jackpot,一个利用最优预算拒绝采样(Optimal Budget Rejection Sampling, OBRS)直接减少回滚模型与不断演变的策略之间差异的框架。Jackpot集成了一个原则性的OBRS程序,一个统一的训练目标,该目标共同更新策略和回滚模型,以及通过top-$k$概率估计和批量级偏差校正实现的高效系统实现。我们的理论分析表明,在可控的接受预算下,OBRS始终将回滚分布向目标分布靠拢。在实证上, extit{Jackpot}相比于重要性采样基线显著提高了训练稳定性,在以批量大小64训练Qwen3-8B-Base时,达到了与在线策略RL相当的性能,经过多达300次更新步骤。综合来看,我们的结果表明,基于OBRS的对齐使我们更接近于在大型语言模型的强化学习中有效地将回滚生成与策略优化解耦的实际应用。
cs.AI / 2 / 2602.06176

Large Language Model Reasoning Failures

大型语言模型的推理失败
Song, Peiyang, Han, Pengrui, Goodman, Noah
Abstract
Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures, to provide an easy entry point to this area.
Chinese Translation
大型语言模型(LLMs)展现出了显著的推理能力,在广泛的任务中取得了令人印象深刻的结果。尽管取得了这些进展,但在看似简单的场景中,仍然存在显著的推理失败。为了系统地理解和解决这些不足,我们首次提出了一项全面的调查,专门针对LLMs中的推理失败。我们引入了一种新的分类框架,将推理分为具体现身的和非具体现身的类型,后者进一步细分为非正式(直观)推理和正式(逻辑)推理。同时,我们沿着一个互补的轴线将推理失败分为三种类型:内在于LLM架构的基本失败,这些失败广泛影响下游任务;在特定领域中表现出的应用特定限制;以及由轻微变化导致的不一致性能特征的鲁棒性问题。对于每种推理失败,我们提供了清晰的定义,分析了现有研究,探讨了根本原因,并提出了缓解策略。通过统一分散的研究努力,我们的调查为LLM推理中的系统性弱点提供了结构化的视角,提供了有价值的见解,并指导未来的研究,以构建更强大、更可靠和更鲁棒的推理能力。此外,我们还发布了一个关于LLM推理失败的研究文献的综合集合,作为GitHub库,网址为https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures,以便为该领域提供一个便捷的入门点。
cs.AI / 3 / 2602.06227

Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning (Extended Version)

为她而做:强化学习中的一阶时序逻辑奖励规范(扩展版)
Olivieri, Pierriccardo, Lasca, Fausto, Gianola, Alessandro, Papini, Matteo
Abstract
In this work, we propose a novel framework for the logical specification of non-Markovian rewards in Markov Decision Processes (MDPs) with large state spaces. Our approach leverages Linear Temporal Logic Modulo Theories over finite traces (LTLfMT), a more expressive extension of classical temporal logic in which predicates are first-order formulas of arbitrary first-order theories rather than simple Boolean variables. This enhanced expressiveness enables the specification of complex tasks over unstructured and heterogeneous data domains, promoting a unified and reusable framework that eliminates the need for manual predicate encoding. However, the increased expressive power of LTLfMT introduces additional theoretical and computational challenges compared to standard LTLf specifications. We address these challenges from a theoretical standpoint, identifying a fragment of LTLfMT that is tractable but sufficiently expressive for reward specification in an infinite-state-space context. From a practical perspective, we introduce a method based on reward machines and Hindsight Experience Replay (HER) to translate first-order logic specifications and address reward sparsity. We evaluate this approach to a continuous-control setting using Non-Linear Arithmetic Theory, showing that it enables natural specification of complex tasks. Experimental results show how a tailored implementation of HER is fundamental in solving tasks with complex goals.
Chinese Translation
在本研究中,我们提出了一种新颖的框架,用于在具有大状态空间的马尔可夫决策过程(MDPs)中对非马尔可夫奖励进行逻辑规范。我们的方法利用了有限轨迹上的线性时序逻辑模理论(LTLfMT),这是一种比经典时序逻辑更具表现力的扩展,其中谓词是任意一阶理论的第一阶公式,而不是简单的布尔变量。这种增强的表现力使得在非结构化和异构数据领域中规范复杂任务成为可能,促进了一个统一且可重用的框架,消除了手动谓词编码的需求。然而,LTLfMT的增强表达能力相较于标准的LTLf规范引入了额外的理论和计算挑战。我们从理论角度解决这些挑战,识别出一个可处理但在无限状态空间上下文中足够表达的LTLfMT片段。从实践的角度来看,我们引入了一种基于奖励机器和事后经验重放(HER)的方法,以翻译一阶逻辑规范并解决奖励稀疏性问题。我们在使用非线性算术理论的连续控制环境中评估了这种方法,结果表明它能够自然地规范复杂任务。实验结果显示,针对HER的定制实现对于解决具有复杂目标的任务至关重要。
cs.AI / 4 / 2602.06286

Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making

大型语言模型是否像理性代理一样行动?测量概率决策中的信念一致性
Yamin, Khurram, Tang, Jingjing, Cortes-Gomez, Santiago, Sharma, Amit, Horvitz, Eric, Wilder, Bryan
Abstract
Large language models (LLMs) are increasingly deployed as agents in high-stakes domains where optimal actions depend on both uncertainty about the world and consideration of utilities of different outcomes, yet their decision logic remains difficult to interpret. We study whether LLMs are rational utility maximizers with coherent beliefs and stable preferences. We consider behaviors of models for diagnosis challenge problems. The results provide insights about the relationship of LLM inferences to ideal Bayesian utility maximization for elicited probabilities and observed actions. Our approach provides falsifiable conditions under which the reported probabilities \emph{cannot} correspond to the true beliefs of any rational agent. We apply this methodology to multiple medical diagnostic domains with evaluations across several LLMs. We discuss implications of the results and directions forward for uses of LLMs in guiding high-stakes decisions.
Chinese Translation
大型语言模型(LLMs)越来越多地被部署为高风险领域的代理,其中最佳行动依赖于对世界的不确定性以及对不同结果效用的考虑,但它们的决策逻辑仍然难以解释。我们研究LLMs是否是具有一致信念和稳定偏好的理性效用最大化者。我们考虑模型在诊断挑战问题中的行为。结果提供了LLM推理与理想贝叶斯效用最大化之间的关系的见解,涉及引导的概率和观察到的行动。我们的方法提供了可证伪的条件,在这些条件下,报告的概率 extit{不能}对应于任何理性代理的真实信念。我们将此方法应用于多个医学诊断领域,并对多个LLMs进行了评估。我们讨论了结果的意义以及LLMs在高风险决策指导中的未来应用方向。
cs.AI / 5 / 2602.06319

Exposing Weaknesses of Large Reasoning Models through Graph Algorithm Problems

通过图算法问题揭示大型推理模型的弱点
Zhang, Qifan, Ruan, Jianhao, Chen, Aochuan, Zeng, Kang, Chen, Nuo, Tang, Jing, Li, Jia
Abstract
Large Reasoning Models (LRMs) have advanced rapidly; however, existing benchmarks in mathematics, code, and common-sense reasoning remain limited. They lack long-context evaluation, offer insufficient challenge, and provide answers that are difficult to verify programmatically. We introduce GrAlgoBench, a benchmark designed to evaluate LRMs through graph algorithm problems. Such problems are particularly well suited for probing reasoning abilities: they demand long-context reasoning, allow fine-grained control of difficulty levels, and enable standardized, programmatic evaluation. Across nine tasks, our systematic experiments reveal two major weaknesses of current LRMs. First, accuracy deteriorates sharply as context length increases, falling below 50% once graphs exceed 120 nodes. This degradation is driven by frequent execution errors, weak memory, and redundant reasoning. Second, LRMs suffer from an over-thinking phenomenon, primarily caused by extensive yet largely ineffective self-verification, which inflates reasoning traces without improving correctness. By exposing these limitations, GrAlgoBench establishes graph algorithm problems as a rigorous, multidimensional, and practically relevant testbed for advancing the study of reasoning in LRMs. Code is available at https://github.com/Bklight999/GrAlgoBench.
Chinese Translation
大型推理模型(LRMs)发展迅速;然而,现有的数学、代码和常识推理基准仍然有限。它们缺乏长上下文评估,挑战性不足,并且提供的答案难以通过程序进行验证。我们引入了GrAlgoBench,这是一个旨在通过图算法问题评估LRMs的基准。这类问题特别适合探测推理能力:它们要求长上下文推理,允许对难度水平进行细致控制,并且能够实现标准化的程序评估。在九个任务中,我们的系统实验揭示了当前LRMs的两个主要弱点。首先,随着上下文长度的增加,准确性急剧下降,一旦图的节点超过120个,准确率便降至50%以下。这种退化是由于频繁的执行错误、记忆能力弱和冗余推理所导致。其次,LRMs遭遇了过度思考现象,主要是由于广泛但大多无效的自我验证,这导致推理轨迹膨胀而没有提高正确性。通过揭示这些局限性,GrAlgoBench确立了图算法问题作为一个严格的、多维的、具有实际相关性的测试平台,以推动对LRMs中推理研究的进展。代码可在 https://github.com/Bklight999/GrAlgoBench 获取。
cs.AI / 6 / 2602.06351

Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion

Trifuse:通过多模态融合增强基于注意力的图形用户界面定位
Ma, Longhui, Zhao, Di, Wang, Siwei, Lv, Zhao, Wang, Miao
Abstract
GUI grounding maps natural language instructions to the correct interface elements, serving as the perception foundation for GUI agents. Existing approaches predominantly rely on fine-tuning multimodal large language models (MLLMs) using large-scale GUI datasets to predict target element coordinates, which is data-intensive and generalizes poorly to unseen interfaces. Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning, but suffer from low reliability due to the lack of explicit and complementary spatial anchors in GUI images. To address this limitation, we propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors. Trifuse integrates attention, OCR-derived textual cues, and icon-level caption semantics via a Consensus-SinglePeak (CS) fusion strategy that enforces cross-modal agreement while retaining sharp localization peaks. Extensive evaluations on four grounding benchmarks demonstrate that Trifuse achieves strong performance without task-specific fine-tuning, substantially reducing the reliance on expensive annotated data. Moreover, ablation studies reveal that incorporating OCR and caption cues consistently improves attention-based grounding performance across different backbones, highlighting its effectiveness as a general framework for GUI grounding.
Chinese Translation
图形用户界面(GUI)定位将自然语言指令映射到正确的界面元素,为GUI代理提供感知基础。现有的方法主要依赖于使用大规模GUI数据集对多模态大语言模型(MLLMs)进行微调,以预测目标元素坐标,这种方法数据密集且对未见过的界面泛化能力较差。近期的基于注意力的替代方案利用MLLMs注意力机制中的定位信号,而无需特定任务的微调,但由于缺乏GUI图像中明确且互补的空间锚点,导致可靠性较低。为了解决这一限制,我们提出了Trifuse,一种基于注意力的定位框架,明确整合互补的空间锚点。Trifuse通过共识-单峰(Consensus-SinglePeak, CS)融合策略整合注意力、OCR衍生的文本线索和图标级别的标题语义,强制跨模态一致性,同时保留清晰的定位峰值。在四个定位基准上的广泛评估表明,Trifuse在没有特定任务微调的情况下实现了强大的性能,显著减少了对昂贵标注数据的依赖。此外,消融研究表明,结合OCR和标题线索在不同骨干网络上持续改善基于注意力的定位性能,突显其作为GUI定位通用框架的有效性。
cs.AI / 7 / 2602.06375

Difficulty-Estimated Policy Optimization

难度估计策略优化
Zhao, Yu, Jiang, Fan, Liu, Tianle, Zeng, Bo, Liu, Yu, Wang, Longyue, Luo, Weihua
Abstract
Recent advancements in Large Reasoning Models (LRMs), exemplified by DeepSeek-R1, have underscored the potential of scaling inference-time compute through Group Relative Policy Optimization (GRPO). However, GRPO frequently suffers from gradient signal attenuation when encountering problems that are either too trivial or overly complex. In these scenarios, the disappearance of inter-group advantages makes the gradient signal susceptible to noise, thereby jeopardizing convergence stability. While variants like DAPO attempt to rectify gradient vanishing, they do not alleviate the substantial computational overhead incurred by exhaustive rollouts on low-utility samples. In this paper, we propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment. DEPO integrates an online Difficulty Estimator that dynamically assesses and filters training data before the rollout phase. This mechanism ensures that computational resources are prioritized for samples with high learning potential. Empirical results demonstrate that DEPO achieves up to a 2x reduction in rollout costs without compromising model performance. Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling. Code and data will be released upon acceptance.
Chinese Translation
近期在大型推理模型(Large Reasoning Models, LRM)方面的进展,以DeepSeek-R1为例,突显了通过群体相对策略优化(Group Relative Policy Optimization, GRPO)扩展推理时间计算的潜力。然而,GRPO在面对过于简单或过于复杂的问题时,常常遭遇梯度信号衰减。在这些情况下,组间优势的消失使得梯度信号容易受到噪声的干扰,从而危及收敛的稳定性。尽管像DAPO这样的变体试图纠正梯度消失,但并未缓解在低效样本上进行全面展开所带来的显著计算开销。本文提出了一种新框架——难度估计策略优化(Difficulty-Estimated Policy Optimization, DEPO),旨在优化推理对齐的效率和鲁棒性。DEPO集成了一个在线难度估计器,能够在展开阶段之前动态评估和过滤训练数据。该机制确保计算资源优先用于具有高学习潜力的样本。实证结果表明,DEPO在不影响模型性能的情况下,能够将展开成本降低多达2倍。我们的方法显著降低了训练高性能推理模型的计算门槛,为推理扩展提供了更可持续的路径。代码和数据将在论文接受后发布。
cs.AI / 8 / 2602.06394

Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization

通过质量感知分词解锁嘈杂的真实世界语料库以进行基础模型预训练
Gollwitzer, Arvid E., Latawa, Paridhi, de Gruijl, David, Subramanian, Deepak A., de la Colina, Adrián Noriega
Abstract
Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.
Chinese Translation
当前的分词方法在处理顺序数据时未考虑信号质量,限制了其在嘈杂的真实世界语料库上的有效性。我们提出了QA-Token(质量感知分词),该方法将数据可靠性直接融入词汇构建中。我们做出了三项关键贡献:(i)一种双层优化公式,联合优化词汇构建和下游性能;(ii)一种强化学习方法,通过质量感知奖励学习合并策略,并提供收敛保证;(iii)一种通过Gumbel-Softmax松弛实现端到端优化的自适应参数学习机制。我们的实验评估显示出一致的改进:在基因组学领域,相较于BPE,变异调用的F1值提升了6.7个百分点;在金融领域,夏普比率提高了30%。在基础规模下,我们对一个包含1.7万亿碱基对的预训练语料库进行了分词,并实现了最先进的病原体检测(94.53 MCC),同时减少了15%的标记数量。我们解锁了嘈杂的真实世界语料库,涵盖了PB级的基因组序列和TB级的金融时间序列,以实现基础模型训练,且没有推理开销。
cs.AI / 9 / 2602.06413

Intrinsic Stability Limits of Autoregressive Reasoning: Structural Consequences for Long-Horizon Execution

自回归推理的内在稳定性极限:对长时间执行的结构性影响
Liao, Hsien-Jyh
Abstract
Large language models (LLMs) demonstrate remarkable reasoning capabilities, yet their performance often deteriorates sharply in long-horizon tasks, exhibiting systematic breakdown beyond certain scales. Conventional explanations primarily attribute this phenomenon to task complexity, such as combinatorial search explosion or long-term credit assignment challenges. In this work, we argue that these explanations are incomplete: even in linear, unbranched tasks without semantic ambiguity, autoregressive execution is subject to an intrinsic stability limit. We propose that the fundamental constraint on long-horizon reasoning arises from process-level instability in autoregressive generation rather than solely from search or task complexity, reframing long-horizon reasoning as a problem of structural governance. We derive Theorem~A, showing that decision advantage in single-path autoregressive reasoning decays exponentially with execution length, imposing a fundamental bound on maintainable reasoning chains. This result implies a structural consequence: stable long-horizon reasoning requires discrete segmentation, naturally inducing graph-like execution structures such as directed acyclic graphs (DAGs). Empirical studies in both synthetic environments and real TextWorld tasks reveal observable performance cliffs consistent with theoretical predictions. Our findings provide a dynamical perspective on long-horizon reasoning failure and suggest new limitations on maintaining long-term coherence under purely autoregressive architectures. Furthermore, we highlight that short-horizon evaluation protocols may obscure structural instability, indicating a potential shift from scaling toward structured governance in future reasoning systems.
Chinese Translation
大型语言模型(LLMs)展现出显著的推理能力,但在长时间任务中的表现往往急剧下降,超出某些规模后表现出系统性崩溃。传统的解释主要将这一现象归因于任务复杂性,例如组合搜索爆炸或长期信用分配挑战。在本研究中,我们认为这些解释是不完整的:即使在没有语义歧义的线性、无分支任务中,自回归执行也受到内在稳定性极限的制约。我们提出,长时间推理的基本约束源于自回归生成过程中的不稳定性,而不仅仅是搜索或任务复杂性的问题,从而将长时间推理重新框定为结构治理的问题。我们推导出定理A,表明在单路径自回归推理中,决策优势随着执行长度的增加而指数衰减,对可维持的推理链施加了基本限制。该结果暗示了一种结构性后果:稳定的长时间推理需要离散分段,自然引入图状执行结构,例如有向无环图(DAGs)。在合成环境和真实的TextWorld任务中的实证研究揭示了与理论预测一致的可观察性能悬崖。我们的发现为长时间推理失败提供了动态视角,并提出了在纯自回归架构下维持长期一致性的新的限制。此外,我们强调短时间评估协议可能掩盖结构不稳定性,表明未来推理系统可能需要从规模扩展转向结构治理。
cs.AI / 10 / 2602.06485

AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents

AgentCPM-Explore:实现边缘规模智能体的长时间深度探索
Chen, Haotian, Cong, Xin, Fan, Shengda, Fu, Yuyang, Gong, Ziqin, Lu, Yaxi, Li, Yishan, Niu, Boye, Pan, Chengjun, Song, Zijun, Wang, Huadong, Wu, Yesai, Wu, Yueying, Xie, Zihao, Yan, Yukun, Zhang, Zhong, Lin, Yankai, Liu, Zhiyuan, Sun, Maosong
Abstract
While Large Language Model (LLM)-based agents have shown remarkable potential for solving complex tasks, existing systems remain heavily reliant on large-scale models, leaving the capabilities of edge-scale models largely underexplored. In this paper, we present the first systematic study on training agentic models at the 4B-parameter scale. We identify three primary bottlenecks hindering the performance of edge-scale models: catastrophic forgetting during Supervised Fine-Tuning (SFT), sensitivity to reward signal noise during Reinforcement Learning (RL), and reasoning degradation caused by redundant information in long-context scenarios. To address the issues, we propose AgentCPM-Explore, a compact 4B agent model with high knowledge density and strong exploration capability. We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement. Through deep exploration, AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks. Notably, AgentCPM-Explore achieves 97.09% accuracy on GAIA text-based tasks under pass@64. These results provide compelling evidence that the bottleneck for edge-scale models is not their inherent capability ceiling, but rather their inference stability. Based on our well-established training framework, AgentCPM-Explore effectively unlocks the significant, yet previously underestimated, potential of edge-scale models.
Chinese Translation
尽管基于大型语言模型(LLM)的智能体在解决复杂任务方面展现了显著潜力,但现有系统仍然严重依赖大规模模型,边缘规模模型的能力尚未得到充分探索。本文首次系统性研究了在4B参数规模下训练智能体模型。我们识别出三个主要瓶颈,阻碍了边缘规模模型的性能:在监督微调(SFT)过程中出现的灾难性遗忘、在强化学习(RL)中对奖励信号噪声的敏感性,以及在长上下文场景中冗余信息导致的推理退化。为了解决这些问题,我们提出了AgentCPM-Explore,这是一种具有高知识密度和强探索能力的紧凑型4B智能体模型。我们引入了一个全面的训练框架,包含参数空间模型融合、奖励信号去噪和上下文信息精炼。通过深度探索,AgentCPM-Explore在4B级模型中实现了最先进(SOTA)的性能,在四个基准测试中与8B级SOTA模型持平或超越,甚至在五个基准测试中超越了Claude-4.5-Sonnet或DeepSeek-v3.2等更大规模的模型。值得注意的是,AgentCPM-Explore在GAIA文本任务中以pass@64达到了97.09%的准确率。这些结果提供了有力证据,表明边缘规模模型的瓶颈并非其固有能力的上限,而是其推理的稳定性。基于我们建立的训练框架,AgentCPM-Explore有效地释放了边缘规模模型显著但此前被低估的潜力。
cs.AI / 11 / 2602.06486

JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks

JADE:基于专家的开放式专业任务动态评估
Lin, Lanbo, Liu, Jiayao, Yang, Tianyuan, Cai, Li, Xu, Yuanwu, Wei, Lei, Xie, Sicong, Zhang, Guannan
Abstract
Evaluating agentic AI on open-ended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate diverse valid response strategies, while LLM-as-a-judge approaches adapt to individual responses yet suffer from instability and bias. Human experts address this dilemma by combining domain-grounded principles with dynamic, claim-level assessment. Inspired by this process, we propose JADE, a two-layer evaluation framework. Layer 1 encodes expert knowledge as a predefined set of evaluation skills, providing stable evaluation criteria. Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies, with evidence-dependency gating to invalidate conclusions built on refuted claims. Experiments on BizBench show that JADE improves evaluation stability and reveals critical agent failure modes missed by holistic LLM-based evaluators. We further demonstrate strong alignment with expert-authored rubrics and effective transfer to a medical-domain benchmark, validating JADE across professional domains. Our code is publicly available at https://github.com/smiling-world/JADE.
Chinese Translation
在开放式专业任务中评估自主智能体面临着严谨性与灵活性之间的基本困境。静态评分标准提供了严谨、可重复的评估,但无法适应多样化的有效响应策略,而将大型语言模型(LLM)作为评审的方式虽然能够适应个体响应,但却存在不稳定性和偏见。人类专家通过结合领域基础原则与动态的声明级评估来解决这一困境。受到这一过程的启发,我们提出了JADE,一个双层评估框架。第一层将专家知识编码为预定义的评估技能集,提供稳定的评估标准。第二层执行特定报告的声明级评估,以灵活地评估多样的推理策略,并通过证据依赖门控来无效化基于被驳斥声明的结论。在BizBench上的实验表明,JADE提高了评估的稳定性,并揭示了整体LLM评估者未能发现的关键代理失败模式。我们进一步展示了JADE与专家撰写的评分标准之间的强一致性,以及在医疗领域基准上的有效转移,验证了JADE在各专业领域的适用性。我们的代码已公开发布在 https://github.com/smiling-world/JADE。
cs.AI / 12 / 2602.06525

Progress Constraints for Reinforcement Learning in Behavior Trees

行为树中强化学习的进展约束
Rietz, Finn, Kartašev, Mart, Stork, Johannes A., Ögren, Petter
Abstract
Behavior Trees (BTs) provide a structured and reactive framework for decision-making, commonly used to switch between sub-controllers based on environmental conditions. Reinforcement Learning (RL), on the other hand, can learn near-optimal controllers but sometimes struggles with sparse rewards, safe exploration, and long-horizon credit assignment. Combining BTs with RL has the potential for mutual benefit: a BT design encodes structured domain knowledge that can simplify RL training, while RL enables automatic learning of the controllers within BTs. However, naive integration of BTs and RL can lead to some controllers counteracting other controllers, possibly undoing previously achieved subgoals, thereby degrading the overall performance. To address this, we propose progress constraints, a novel mechanism where feasibility estimators constrain the allowed action set based on theoretical BT convergence results. Empirical evaluations in a 2D proof-of-concept and a high-fidelity warehouse environment demonstrate improved performance, sample efficiency, and constraint satisfaction, compared to prior methods of BT-RL integration.
Chinese Translation
行为树(Behavior Trees, BTs)提供了一种结构化和反应性的决策框架,通常用于根据环境条件在子控制器之间切换。另一方面,强化学习(Reinforcement Learning, RL)能够学习近似最优的控制器,但有时在稀疏奖励、安全探索和长时间跨度的信用分配方面面临挑战。将BT与RL结合具有互惠的潜力:BT设计编码了结构化的领域知识,可以简化RL训练,而RL则能够自动学习BT中的控制器。然而,BT和RL的简单整合可能导致某些控制器相互抵消,可能会撤销先前实现的子目标,从而降低整体性能。为了解决这个问题,我们提出了进展约束,这是一种新机制,其中可行性估计器根据理论BT收敛结果限制允许的动作集。在二维概念验证和高保真仓库环境中的实证评估表明,与先前的BT-RL整合方法相比,性能、样本效率和约束满足度都有所提升。
cs.AI / 13 / 2602.06527

HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction

HyPER:通过假设路径扩展与缩减桥接可扩展大语言模型推理中的探索与利用
Qiu, Shengxuan, Huang, Haochen, Zhong, Shuzhang, Zuo, Pengfei, Li, Meng
Abstract
Scaling test-time compute with multi-path chain-of-thought improves reasoning accuracy, but its effectiveness depends critically on the exploration-exploitation trade-off. Existing approaches address this trade-off in rigid ways: tree-structured search hard-codes exploration through brittle expansion rules that interfere with post-trained reasoning, while parallel reasoning over-explores redundant hypothesis paths and relies on weak answer selection. Motivated by the observation that the optimal balance is phase-dependent and that correct and incorrect reasoning paths often diverge only at late stages, we reformulate test-time scaling as a dynamic expand-reduce control problem over a pool of hypotheses. We propose HyPER, a training-free online control policy for multi-path decoding in mixture-of-experts models that reallocates computation under a fixed budget using lightweight path statistics. HyPER consists of an online controller that transitions from exploration to exploitation as the hypothesis pool evolves, a token-level refinement mechanism that enables efficient generation-time exploitation without full-path resampling, and a length- and confidence-aware aggregation strategy for reliable answer-time exploitation. Experiments on four mixture-of-experts language models across diverse reasoning benchmarks show that HyPER consistently achieves a superior accuracy-compute trade-off, improving accuracy by 8 to 10 percent while reducing token usage by 25 to 40 percent.
Chinese Translation
在多路径思维链中扩展测试时计算可以提高推理准确性,但其有效性在很大程度上依赖于探索与利用的权衡。现有方法以僵化的方式处理这一权衡:树状结构搜索通过脆弱的扩展规则硬编码探索,这些规则干扰了后训练推理,而并行推理则过度探索冗余的假设路径,并依赖于弱答案选择。基于观察到的最佳平衡是阶段依赖的,并且正确与错误的推理路径通常仅在后期阶段分歧,我们将测试时扩展重新表述为一个动态扩展-缩减控制问题,针对一组假设。我们提出了HyPER,这是一种无训练的在线控制策略,适用于混合专家模型中的多路径解码,能够在固定预算下使用轻量级路径统计重新分配计算。HyPER包括一个在线控制器,随着假设池的演变从探索过渡到利用,一个令牌级别的精细化机制,使得在生成时有效利用而无需全路径重抽样,以及一种长度和置信度感知的聚合策略,以实现可靠的答案时利用。在四个混合专家语言模型的多样推理基准上的实验表明,HyPER始终实现了优越的准确性与计算的权衡,提高了8%到10%的准确性,同时减少了25%到40%的令牌使用。
cs.AI / 14 / 2602.06533

LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models

LogicSkills:大型语言模型形式推理的结构化基准
Rabern, Brian, Mondorf, Philipp, Plank, Barbara
Abstract
Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a unified benchmark designed to isolate three fundamental skills in formal reasoning: (i) $\textit{formal symbolization}\unicode{x2014}$translating premises into first-order logic; (ii) $\textit{countermodel construction}\unicode{x2014}$formulating a finite structure in which all premises are true while the conclusion is false; and (iii) $\textit{validity assessment}\unicode{x2014}$deciding whether a conclusion follows from a given set of premises. Items are drawn from the two-variable fragment of first-order logic (without identity) and are presented in both natural English and a Carroll-style language with nonce words. All examples are verified for correctness and non-triviality using the SMT solver Z3. Across leading models, performance is high on validity but substantially lower on symbolization and countermodel construction, suggesting reliance on surface-level patterns rather than genuine symbolic or rule-based reasoning.
Chinese Translation
大型语言模型在各种逻辑推理基准测试中表现出显著的性能。然而,目前尚不清楚它们真正掌握了哪些核心逻辑技能。为了解决这一问题,我们引入了LogicSkills,这是一个统一的基准,旨在孤立出形式推理中的三项基本技能:(i) $ extit{形式符号化}$——将前提翻译为一阶逻辑;(ii) $ extit{反模型构造}$——制定一个有限结构,使得所有前提为真而结论为假;(iii) $ extit{有效性评估}$——判断结论是否从给定的前提集中推导而来。题目来自于一阶逻辑的双变量片段(不含等式),并以自然英语和使用新造词的卡罗尔风格语言呈现。所有示例均通过SMT求解器Z3验证其正确性和非平凡性。在领先模型中,有效性表现良好,但在符号化和反模型构造方面的表现明显较低,这表明模型依赖于表面模式而非真正的符号或基于规则的推理。
cs.AI / 15 / 2602.06540

AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research

AgentCPM-Report:交替草拟与深化以进行开放式深度研究
Li, Yishan, Chen, Wentong, Yan, Yukun, Li, Mingwei, Mei, Sen, Wang, Xiaorong, Liu, Kunpeng, Cong, Xin, Wang, Shuo, Zhang, Zhong, Lu, Yaxi, Liu, Zhenghao, Lin, Yankai, Liu, Zhiyuan, Sun, Maosong
Abstract
Generating deep research reports requires large-scale information acquisition and the synthesis of insight-driven analysis, posing a significant challenge for current language models. Most existing approaches follow a plan-then-write paradigm, whose performance heavily depends on the quality of the initial outline. However, constructing a comprehensive outline itself demands strong reasoning ability, causing current deep research systems to rely almost exclusively on closed-source or online large models. This reliance raises practical barriers to deployment and introduces safety and privacy concerns for user-authored data. In this work, we present AgentCPM-Report, a lightweight yet high-performing local solution composed of a framework that mirrors the human writing process and an 8B-parameter deep research agent. Our framework uses a Writing As Reasoning Policy (WARP), which enables models to dynamically revise outlines during report generation. Under this policy, the agent alternates between Evidence-Based Drafting and Reasoning-Driven Deepening, jointly supporting information acquisition, knowledge refinement, and iterative outline evolution. To effectively equip small models with this capability, we introduce a Multi-Stage Agentic Training strategy, consisting of cold-start, atomic skill RL, and holistic pipeline RL. Experiments on DeepResearch Bench, DeepConsult, and DeepResearch Gym demonstrate that AgentCPM-Report outperforms leading closed-source systems, with substantial gains in Insight.
Chinese Translation
生成深度研究报告需要大规模的信息获取和基于洞察的分析综合,这对当前的语言模型提出了重大挑战。大多数现有方法遵循计划-再写的范式,其性能在很大程度上依赖于初始大纲的质量。然而,构建一个全面的大纲本身就需要强大的推理能力,这使得当前的深度研究系统几乎完全依赖于闭源或在线的大型模型。这种依赖性在实际部署中带来了障碍,并引发了用户创作数据的安全和隐私问题。在本研究中,我们提出了AgentCPM-Report,这是一种轻量级但高性能的本地解决方案,由一个模拟人类写作过程的框架和一个8B参数的深度研究代理组成。我们的框架采用写作即推理策略(Writing As Reasoning Policy, WARP),使模型能够在报告生成过程中动态修订大纲。在这一策略下,代理在基于证据的草拟和基于推理的深化之间交替进行,共同支持信息获取、知识精炼和大纲的迭代演变。为了有效地赋予小型模型这一能力,我们引入了一种多阶段代理训练策略,包括冷启动、原子技能强化学习(atomic skill RL)和整体管道强化学习(holistic pipeline RL)。在DeepResearch Bench、DeepConsult和DeepResearch Gym上的实验表明,AgentCPM-Report在洞察力方面显著超越了领先的闭源系统。
cs.AI / 16 / 2602.06554

SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees

SeeUPO:具有收敛保证的序列级代理强化学习
Hu, Tianyi, Fu, Qingxu, Chen, Yanxi, Liu, Zhaoyang, Ding, Bolin
Abstract
Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi-turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi-turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO & GRAE breaks PPO's original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic-free and convergence guarantees in multi-turn scenarios. To address this, we propose SeeUPO (Sequence-level Sequential Update Policy Optimization), a critic-free approach with convergence guarantees for multi-turn interactions. SeeUPO models multi-turn interaction as sequentially executed multi-agent bandit problems. Through turn-by-turn sequential policy updates in reverse execution order, it ensures monotonic improvement and convergence to global optimal solution via backward induction. Experiments on AppWorld and BFCL v4 demonstrate SeeUPO's substantial improvements over existing backbone algorithms: relative gains of 43.3%-54.6% on Qwen3-14B and 24.1%-41.9% on Qwen2.5-14B (averaged across benchmarks), along with superior training stability.
Chinese Translation
强化学习(RL)已成为训练基于大型语言模型(LLM)的人工智能代理的主要范式。然而,现有的基础强化学习算法在代理场景中缺乏经过验证的收敛保证,特别是在多轮对话设置中,这可能导致训练不稳定以及无法收敛到最优策略。本文系统分析了不同的策略更新机制和优势估计方法的组合如何影响单轮/多轮场景中的收敛特性。我们发现,使用组相对优势估计(Group Relative Advantage Estimation, GRAE)的REINFORCE在无折扣条件下可以收敛到全局最优,但PPO与GRAE的组合破坏了PPO原有的单调改进特性。此外,我们证明了主流基础强化学习算法在多轮场景中无法同时实现无评论家和收敛保证。为了解决这一问题,我们提出了SeeUPO(序列级顺序更新策略优化),这是一种在多轮交互中具有收敛保证的无评论家方法。SeeUPO将多轮交互建模为顺序执行的多智能体赌博问题。通过反向执行顺序的逐轮策略更新,它确保了单调改进并通过反向归纳收敛到全局最优解。在AppWorld和BFCL v4上的实验表明,SeeUPO在现有基础算法上有显著改进:在Qwen3-14B上相对增益为43.3%-54.6%,在Qwen2.5-14B上为24.1%-41.9%(在基准测试中平均),并且具有更优的训练稳定性。
cs.AI / 17 / 2602.06652

Same Answer, Different Representations: Hidden instability in VLMs

相同答案,不同表征:视觉语言模型中的隐藏不稳定性
Wani, Farooq Ahmad, Suglia, Alessandro, Saxena, Rohit, Gema, Aryo Pradipta, Kwan, Wai-Chung, Barez, Fazl, Bucarelli, Maria Sofia, Silvestri, Fabrizio, Minervini, Pasquale
Abstract
The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.
Chinese Translation
视觉语言模型(VLMs)的鲁棒性通常通过输出级不变性进行评估,隐含假设稳定的预测反映了稳定的多模态处理。在本研究中,我们认为这一假设是不充分的。我们引入了一种关注表征和频率的评估框架,该框架测量内部嵌入漂移、谱敏感性和结构平滑性(视觉标记的空间一致性),并结合标准的基于标签的指标。将该框架应用于现代VLMs在SEEDBench、MMMU和POPE数据集上的表现揭示了三种不同的失败模式。首先,模型在经历显著的内部表征漂移时,常常保留预测答案;对于文本叠加等扰动,这种漂移接近于图像间变异性的大小,表明尽管输出不变,表征却移动到通常被无关输入占据的区域。其次,鲁棒性并未随着规模的扩大而提高;较大的模型虽然实现了更高的准确率,但表现出相等或更大的敏感性,这与更尖锐但更脆弱的决策边界一致。第三,我们发现扰动对任务的影响各不相同:当扰动干扰模型如何结合粗略和细致的视觉线索时,它们会损害推理能力,但在幻觉基准测试中,它们可以通过使模型生成更保守的答案来减少假阳性。
cs.AI / 18 / 2602.06707

Autoregressive Models for Knowledge Graph Generation

用于知识图谱生成的自回归模型
Thanapalasingam, Thiviyan, Vozikis, Antonis, Bloem, Peter, Groth, Paul
Abstract
Knowledge Graph (KG) generation requires models to learn complex semantic dependencies between triples while maintaining domain validity constraints. Unlike link prediction, which scores triples independently, generative models must capture interdependencies across entire subgraphs to produce semantically coherent structures. We present ARK (Auto-Regressive Knowledge Graph Generation), a family of autoregressive models that generate KGs by treating graphs as sequences of (head, relation, tail) triples. ARK learns implicit semantic constraints directly from data, including type consistency, temporal validity, and relational patterns, without explicit rule supervision. On the IntelliGraphs benchmark, our models achieve 89.2% to 100.0% semantic validity across diverse datasets while generating novel graphs not seen during training. We also introduce SAIL, a variational extension of ARK that enables controlled generation through learned latent representations, supporting both unconditional sampling and conditional completion from partial graphs. Our analysis reveals that model capacity (hidden dimensionality >= 64) is more critical than architectural depth for KG generation, with recurrent architectures achieving comparable validity to transformer-based alternatives while offering substantial computational efficiency. These results demonstrate that autoregressive models provide an effective framework for KG generation, with practical applications in knowledge base completion and query answering.
Chinese Translation
知识图谱(Knowledge Graph, KG)生成需要模型学习三元组之间复杂的语义依赖关系,同时保持领域有效性约束。与独立评分三元组的链接预测不同,生成模型必须捕捉整个子图之间的相互依赖,以生成语义上连贯的结构。我们提出了ARK(自回归知识图谱生成),这是一类自回归模型,通过将图视为(头,关系,尾)三元组的序列来生成知识图谱。ARK直接从数据中学习隐式语义约束,包括类型一致性、时间有效性和关系模式,而无需显式规则监督。在IntelliGraphs基准测试中,我们的模型在多样化数据集上实现了89.2%到100.0%的语义有效性,同时生成了在训练期间未见过的新图谱。我们还引入了SAIL,这是ARK的变分扩展,能够通过学习的潜在表示实现受控生成,支持无条件采样和从部分图谱的条件补全。我们的分析表明,模型容量(隐藏维度>= 64)对于知识图谱生成比架构深度更为关键,递归架构在有效性上与基于变换器的替代方案相当,同时提供了显著的计算效率。这些结果表明,自回归模型为知识图谱生成提供了有效的框架,在知识库补全和查询回答等实际应用中具有广泛的潜力。
cs.AI / 19 / 2602.06746

Semantically Labelled Automata for Multi-Task Reinforcement Learning with LTL Instructions

用于多任务强化学习的语义标记自动机与 LTL 指令
Abate, Alessandro, De Giacomo, Giuseppe, Jackermeier, Mathias, Kretínský, Jan, Prokop, Maximilian, Weinhuber, Christoph
Abstract
We study multi-task reinforcement learning (RL), a setting in which an agent learns a single, universal policy capable of generalising to arbitrary, possibly unseen tasks. We consider tasks specified as linear temporal logic (LTL) formulae, which are commonly used in formal methods to specify properties of systems, and have recently been successfully adopted in RL. In this setting, we present a novel task embedding technique leveraging a new generation of semantic LTL-to-automata translations, originally developed for temporal synthesis. The resulting semantically labelled automata contain rich, structured information in each state that allow us to (i) compute the automaton efficiently on-the-fly, (ii) extract expressive task embeddings used to condition the policy, and (iii) naturally support full LTL. Experimental results in a variety of domains demonstrate that our approach achieves state-of-the-art performance and is able to scale to complex specifications where existing methods fail.
Chinese Translation
我们研究多任务强化学习(RL),这是一个代理学习单一通用策略的情境,该策略能够推广到任意可能未见的任务。我们考虑将任务指定为线性时序逻辑(LTL)公式,这在形式化方法中常用于指定系统的属性,并且最近在强化学习中得到了成功应用。在这种情况下,我们提出了一种新颖的任务嵌入技术,利用新一代语义 LTL 到自动机的转换,这些转换最初是为时序合成开发的。生成的语义标记自动机在每个状态中包含丰富的结构化信息,使我们能够(i)高效地动态计算自动机,(ii)提取用于条件化策略的表达性任务嵌入,以及(iii)自然支持完整的 LTL。在多种领域的实验结果表明,我们的方法实现了最先进的性能,并能够扩展到现有方法无法处理的复杂规范。
cs.AI / 20 / 2602.06774

Towards Understanding What State Space Models Learn About Code

理解状态空间模型在代码学习中的作用
Wu, Jiali, Anand, Abhinav, Verma, Shweta, Mezini, Mira
Abstract
State Space Models (SSMs) have emerged as an efficient alternative to the transformer architecture. Recent studies show that SSMs can match or surpass Transformers on code understanding tasks, such as code retrieval, when trained under similar conditions. However, their internal mechanisms remain a black box. We present the first systematic analysis of what SSM-based code models actually learn and perform the first comparative analysis of SSM and Transformer-based code models. Our analysis reveals that SSMs outperform Transformers at capturing code syntax and semantics in pretraining but forgets certain syntactic and semantic relations during fine-tuning on task, especially when the task emphasizes short-range dependencies. To diagnose this, we introduce SSM-Interpret, a frequency-domain framework that exposes a spectral shift toward short-range dependencies during fine-tuning. Guided by these findings, we propose architectural modifications that significantly improve the performance of SSM-based code model, validating that our analysis directly enables better models.
Chinese Translation
状态空间模型(State Space Models, SSMs)已成为一种高效的替代变换器架构(Transformer architecture)。近期研究表明,在相似条件下训练时,SSMs在代码理解任务(如代码检索)上可以与变换器相匹配或超越。然而,它们的内部机制仍然是一个黑箱。我们首次系统性地分析了基于SSM的代码模型实际学习的内容,并进行了SSM与基于变换器的代码模型的首次比较分析。我们的分析揭示,SSMs在预训练阶段在捕捉代码语法和语义方面优于变换器,但在针对任务的微调过程中遗忘了某些语法和语义关系,尤其是在任务强调短期依赖时。为了解决这一问题,我们引入了SSM-Interpret,这是一个频域框架,揭示了在微调过程中向短期依赖的谱移。根据这些发现,我们提出了架构修改,显著提高了基于SSM的代码模型的性能,验证了我们的分析直接促进了更好的模型的构建。
cs.AI / 21 / 2602.06818

Wild Guesses and Mild Guesses in Active Concept Learning

主动概念学习中的大胆猜测与温和猜测
Chari, Anirudh, Pattanaik, Neil
Abstract
Human concept learning is typically active: learners choose which instances to query or test in order to reduce uncertainty about an underlying rule or category. Active concept learning must balance informativeness of queries against the stability of the learner that generates and scores hypotheses. We study this trade-off in a neuro-symbolic Bayesian learner whose hypotheses are executable programs proposed by a large language model (LLM) and reweighted by Bayesian updating. We compare a Rational Active Learner that selects queries to maximize approximate expected information gain (EIG) and the human-like Positive Test Strategy (PTS) that queries instances predicted to be positive under the current best hypothesis. Across concept-learning tasks in the classic Number Game, EIG is effective when falsification is necessary (e.g., compound or exception-laden rules), but underperforms on simple concepts. We trace this failure to a support mismatch between the EIG policy and the LLM proposal distribution: highly diagnostic boundary queries drive the posterior toward regions where the generator produces invalid or overly specific programs, yielding a support-mismatch trap in the particle approximation. PTS is information-suboptimal but tends to maintain proposal validity by selecting "safe" queries, leading to faster convergence on simple rules. Our results suggest that "confirmation bias" may not be a cognitive error, but rather a rational adaptation for maintaining tractable inference in the sparse, open-ended hypothesis spaces characteristic of human thought.
Chinese Translation
人类的概念学习通常是主动的:学习者选择查询或测试哪些实例,以减少对潜在规则或类别的不确定性。主动概念学习必须在查询的信息量与生成和评分假设的学习者的稳定性之间取得平衡。我们研究了这一权衡,采用了一种神经符号贝叶斯学习者,其假设是由大型语言模型(LLM)提出的可执行程序,并通过贝叶斯更新进行重新加权。我们比较了一种理性主动学习者,该学习者选择查询以最大化近似期望信息增益(EIG),以及一种类人正测试策略(PTS),该策略查询在当前最佳假设下预测为正的实例。在经典数字游戏的概念学习任务中,当需要证伪时(例如,复合或例外规则),EIG是有效的,但在简单概念上表现不佳。我们将这一失败归因于EIG策略与LLM提议分布之间的支持不匹配:高度诊断性的边界查询将后验推向生成无效或过于特定程序的区域,从而在粒子近似中产生支持不匹配陷阱。PTS虽然信息上次优,但通过选择“安全”的查询往往能维持提议的有效性,从而在简单规则上实现更快的收敛。我们的结果表明,“确认偏误”可能不是一种认知错误,而是一种理性的适应,以维持在稀疏、开放式假设空间中可处理的推理,这一特征是人类思维的典型特征。
cs.AI / 22 / 2602.06820

ScaleEnv: Scaling Environment Synthesis from Scratch for Generalist Interactive Tool-Use Agent Training

ScaleEnv:从零开始扩展环境合成以训练通用交互工具使用代理
Tu, Dunwei, Hao, Hongyan, Yang, Hansi, Chen, Yihao, Zhang, Yi-Kai, Xia, Zhikang, Yang, Yu, Sun, Yueqing, Liu, Xingchen, Shen, Furao, Gu, Qi, Su, Hui, Cai, Xunliang
Abstract
Training generalist agents capable of adapting to diverse scenarios requires interactive environments for self-exploration. However, interactive environments remain critically scarce, and existing synthesis methods suffer from significant limitations regarding environmental diversity and scalability. To address these challenges, we introduce ScaleEnv, a framework that constructs fully interactive environments and verifiable tasks entirely from scratch. Specifically, ScaleEnv ensures environment reliability through procedural testing, and guarantees task completeness and solvability via tool dependency graph expansion and executable action verification. By enabling agents to learn through exploration within ScaleEnv, we demonstrate significant performance improvements on unseen, multi-turn tool-use benchmarks such as $\tau^2$-Bench and VitaBench, highlighting strong generalization capabilities. Furthermore, we investigate the relationship between increasing number of domains and model generalization performance, providing empirical evidence that scaling environmental diversity is critical for robust agent learning.
Chinese Translation
训练能够适应多样化场景的通用代理需要交互环境以进行自我探索。然而,交互环境仍然极为稀缺,现有的合成方法在环境多样性和可扩展性方面存在显著限制。为了解决这些挑战,我们提出了ScaleEnv,一个完全从零开始构建交互环境和可验证任务的框架。具体而言,ScaleEnv通过过程测试确保环境的可靠性,并通过工具依赖图扩展和可执行动作验证来保证任务的完整性和可解性。通过使代理能够在ScaleEnv中通过探索进行学习,我们在未见过的多轮工具使用基准(如$ au^2$-Bench和VitaBench)上展示了显著的性能提升,突显了强大的泛化能力。此外,我们还研究了领域数量增加与模型泛化性能之间的关系,提供了实证证据,表明扩展环境多样性对稳健的代理学习至关重要。
cs.AI / 23 / 2602.06822

POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models

POP:在线结构剪枝实现大型基础模型的高效推理
Chen, Yi, Shin, Wonjin, Liu, Shuhong, Mai, Tho, Lee, Jeongmo, Hua, Chuanbo, Wang, Kun, Liu, Jun, Kim, Joo-Young
Abstract
Large foundation models (LFMs) achieve strong performance through scaling, yet current structural pruning methods derive fixed pruning decisions during inference, overlooking sparsity patterns that emerge in the autoregressive token generation. In this paper, we propose POP (Partition-guided Online Pruning), an efficient online structural pruning framework that enables context-conditioned dynamic pruning with minimal computational overhead. POP partitions model channels into retained, candidate, and pruned regions, where prefilling defines a coarse pruning partition, and the decoding stage generates a fine-grained mask within the candidate region, avoiding full-channel re-evaluation. The coarse pruning partition preserves consistently important weights, while the fine-grained masking provides context-conditioned variation during decoding. Moreover, POP is a lightweight, plug-and-play method that requires no preprocessing, including offline calibration, retraining, or learning predictors. Extensive evaluations across diverse LFMs, including large language models (LLMs), mixture-of-experts models (MoEs), and vision-language models (VLMs), demonstrate that POP consistently delivers higher accuracy than existing pruning approaches while incurring smaller computational overhead and minimizing inference latency.
Chinese Translation
大型基础模型(LFMs)通过扩展实现了强大的性能,然而当前的结构剪枝方法在推理过程中产生固定的剪枝决策,忽视了在自回归标记生成中出现的稀疏模式。本文提出了POP(分区引导在线剪枝),一种高效的在线结构剪枝框架,能够以最小的计算开销实现上下文条件下的动态剪枝。POP将模型通道划分为保留区、候选区和剪枝区,其中预填充定义了粗略的剪枝分区,而解码阶段在候选区内生成细粒度掩码,避免了对全通道的重新评估。粗略剪枝分区保留了一致重要的权重,而细粒度掩码在解码过程中提供了上下文条件下的变化。此外,POP是一种轻量级的即插即用方法,无需预处理,包括离线校准、重新训练或学习预测器。对多种大型基础模型的广泛评估,包括大型语言模型(LLMs)、专家混合模型(MoEs)和视觉语言模型(VLMs),表明POP在保持较小计算开销和最小化推理延迟的同时,始终提供比现有剪枝方法更高的准确性。
cs.AI / 24 / 2602.06836

LLM Active Alignment: A Nash Equilibrium Perspective

LLM主动对齐:纳什均衡视角
Wang, Tonghan, Pan, Yuqi, Yang, Xinyi, Jiang, Yanchen, Tambe, Milind, Parkes, David C.
Abstract
We develop a game-theoretic framework for predicting and steering the behavior of populations of large language models (LLMs) through Nash equilibrium (NE) analysis. To avoid the intractability of equilibrium computation in open-ended text spaces, we model each agent's action as a mixture over human subpopulations. Agents choose actively and strategically which groups to align with, yielding an interpretable and behaviorally substantive policy class. We derive closed-form NE characterizations, adopting standard concave-utility assumptions to enable analytical system-level predictions and give explicit, actionable guidance for shifting alignment targets toward socially desirable outcomes. The method functions as an active alignment layer on top of existing alignment pipelines such as RLHF. In a social-media setting, we show that a population of LLMs, especially reasoning-based models, may exhibit political exclusion, pathologies where some subpopulations are ignored by all LLM agents, which can be avoided by our method, illustrating the promise of applying the method to regulate multi-agent LLM dynamics across domains.
Chinese Translation
我们开发了一个博弈论框架,通过纳什均衡(Nash Equilibrium,NE)分析来预测和引导大语言模型(Large Language Models,LLMs)群体的行为。为了避免在开放文本空间中计算均衡的复杂性,我们将每个代理的行动建模为对人类子群体的混合。代理主动且战略性地选择与哪些群体对齐,从而产生一种可解释且在行为上具有实质性的政策类别。我们推导出封闭形式的NE特征,采用标准的凹效用假设,以便实现系统级的分析预测,并为将对齐目标转向社会期望的结果提供明确且可操作的指导。该方法作为现有对齐流程(如基于强化学习的人类反馈,RLHF)之上的主动对齐层运作。在社交媒体环境中,我们展示了一群LLMs,特别是基于推理的模型,可能会表现出政治排斥,即某些子群体被所有LLM代理忽视的病态现象,而我们的办法可以避免这种情况,展示了将该方法应用于调节跨领域多代理LLM动态的潜力。
cs.AI / 25 / 2602.06838

An Adaptive Differentially Private Federated Learning Framework with Bi-level Optimization

一种具有双层优化的自适应差分隐私联邦学习框架
Wang, Jin, Ma, Hui, Xing, Fei, Yan, Ming
Abstract
Federated learning enables collaborative model training across distributed clients while preserving data privacy. However, in practical deployments, device heterogeneity, non-independent, and identically distributed (Non-IID) data often lead to highly unstable and biased gradient updates. When differential privacy is enforced, conventional fixed gradient clipping and Gaussian noise injection may further amplify gradient perturbations, resulting in training oscillation and performance degradation and degraded model performance. To address these challenges, we propose an adaptive differentially private federated learning framework that explicitly targets model efficiency under heterogeneous and privacy-constrained settings. On the client side, a lightweight local compressed module is introduced to regularize intermediate representations and constrain gradient variability, thereby mitigating noise amplification during local optimization. On the server side, an adaptive gradient clipping strategy dynamically adjusts clipping thresholds based on historical update statistics to avoid over-clipping and noise domination. Furthermore, a constraint-aware aggregation mechanism is designed to suppress unreliable or noise-dominated client updates and stabilize global optimization. Extensive experiments on CIFAR-10 and SVHN demonstrate improved convergence stability and classification accuracy.
Chinese Translation
联邦学习使得分布式客户端之间能够进行协作模型训练,同时保护数据隐私。然而,在实际部署中,设备异构性、非独立同分布(Non-IID)数据往往导致梯度更新高度不稳定和偏倚。当强制实施差分隐私时,传统的固定梯度裁剪和高斯噪声注入可能进一步放大梯度扰动,导致训练振荡和性能下降,从而降低模型性能。为了解决这些挑战,我们提出了一种自适应差分隐私联邦学习框架,明确针对异构和隐私受限环境下的模型效率。在客户端,引入了一个轻量级的本地压缩模块,以规范中间表示并限制梯度变异性,从而减轻本地优化过程中的噪声放大。在服务器端,自适应梯度裁剪策略根据历史更新统计动态调整裁剪阈值,以避免过度裁剪和噪声主导。此外,设计了一种约束感知的聚合机制,以抑制不可靠或噪声主导的客户端更新,并稳定全局优化。在CIFAR-10和SVHN上的大量实验表明,收敛稳定性和分类准确性得到了改善。
cs.AI / 26 / 2602.06841

From Features to Actions: Explainability in Traditional and Agentic AI Systems

从特征到行动:传统与自主人工智能系统中的可解释性
Chaduvula, Sindhuja, Ho, Jessee, Kim, Kina, Narayanan, Aravind, Alinoori, Mahshid, Garg, Muskan, Ramachandram, Dhanesh, Raza, Shaina
Abstract
Over the last decade, explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi-step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. While useful, it remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge the gap between static and agentic explainability by comparing attribution-based explanations with trace-based diagnostics across both settings. To make this distinction explicit, we empirically compare attribution-based explanations used in static classification tasks with trace-based diagnostics used in agentic benchmarks (TAU-bench Airline and AssistantBench). Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman $\rho = 0.86$), they cannot be applied reliably to diagnose execution-level failures in agentic trajectories. In contrast, trace-grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7$\times$ more prevalent in failed runs and reduces success probability by 49\%. These findings motivate a shift towards trajectory-level explainability for agentic systems when evaluating and diagnosing autonomous AI behaviour. Resources: https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework
Chinese Translation
在过去十年中,可解释人工智能主要集中于解释单个模型预测,生成后置解释,将输入与输出在固定决策结构下关联起来。最近,大型语言模型(LLMs)的进展使得自主人工智能系统的行为能够在多步骤轨迹中展开。在这些环境中,成功与失败由一系列决策决定,而不是单一输出。尽管这些方法有其用处,但尚不清楚为静态预测设计的解释方法如何转化为自主环境中行为随时间演变的情境。在本研究中,我们通过比较基于归因的解释与基于轨迹的诊断,弥合静态与自主可解释性之间的差距。为了明确这一区别,我们在静态分类任务中实证比较了基于归因的解释与在自主基准(TAU-bench Airline 和 AssistantBench)中使用的基于轨迹的诊断。我们的结果表明,尽管在静态环境中,归因方法能够实现稳定的特征排名(Spearman $ ho = 0.86$),但它们无法可靠地用于诊断自主轨迹中的执行级失败。相反,针对自主环境的基于轨迹的评估标准能够持续定位行为故障,并揭示状态跟踪不一致在失败运行中出现的频率是2.7倍,并将成功概率降低49%。这些发现促使我们在评估和诊断自主人工智能行为时,向轨迹级可解释性转变。资源:https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework
cs.AI / 27 / 2602.06855

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

AIRS-Bench:前沿人工智能研究科学代理的任务套件
Lupidi, Alisia, Gauri, Bhavul, Foster, Thomas Simon, Omari, Bassel Al, Magka, Despoina, Pepe, Alberto, Audran-Reiss, Alexis, Aghamelu, Muna, Baldwin, Nicolas, Cipolina-Kun, Lucia, Gagnon-Audet, Jean-Christophe, Leow, Chee Hau, Lefdal, Sandra, Mossalam, Hossam, Moudgil, Abhinav, Nazir, Saba, Tewolde, Emanuel, Urrego, Isabel, Estape, Jordi Armengol, Budhiraja, Amar, Chaurasia, Gaurav, Charnalia, Abhishek, Dunfield, Derek, Hambardzumyan, Karen, Izcovich, Daniel, Josifoski, Martin, Mediratta, Ishita, Niu, Kelvin, Pathak, Parth, Shvartsman, Michael, Toledo, Edan, Protopopov, Anton, Raileanu, Roberta, Miller, Alexander, Shavrina, Tatiana, Foerster, Jakob, Bachrach, Yoram
Abstract
LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle -- including idea generation, experiment analysis and iterative refinement -- without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.
Chinese Translation
大型语言模型(LLM)代理在推动科学研究方面具有重要潜力。为了加速这一进展,我们推出了AIRS-Bench(人工智能研究科学基准),这是一个由20个任务组成的套件,这些任务来源于最先进的机器学习论文。这些任务涵盖了多个领域,包括语言建模、数学、生物信息学和时间序列预测。AIRS-Bench任务评估代理在整个研究生命周期中的能力——包括创意生成、实验分析和迭代优化——而不提供基准代码。AIRS-Bench任务格式灵活,便于新任务的轻松集成以及不同代理框架之间的严格比较。我们使用前沿模型与顺序和并行支架相结合建立基准。我们的结果显示,代理在四个任务中超越了人类的最佳表现,但在其他十六个任务中未能匹敌。即使在代理超越人类基准的情况下,它们也未达到底层任务的理论性能上限。这些发现表明,AIRS-Bench远未饱和,仍有大量改进空间。我们将AIRS-Bench任务定义和评估代码开源,以促进自主科学研究的进一步发展。
cs.AI / 28 / 2602.06948

Agentic Uncertainty Reveals Agentic Overconfidence

代理不确定性揭示代理过度自信
Kaddour, Jean, Patel, Srijan, Dovonon, Gbètondji, Richter, Leo, Minervini, Pasquale, Kusner, Matt J.
Abstract
Can AI agents predict whether they will succeed at a task? We study agentic uncertainty by eliciting success probability estimates before, during, and after task execution. All results exhibit agentic overconfidence: some agents that succeed only 22% of the time predict 77% success. Counterintuitively, pre-execution assessment with strictly less information tends to yield better discrimination than standard post-execution review, though differences are not always significant. Adversarial prompting reframing assessment as bug-finding achieves the best calibration.
Chinese Translation
人工智能代理能够预测自己在任务中是否会成功吗?我们通过在任务执行前、执行中和执行后引导成功概率估计来研究代理不确定性。所有结果都表现出代理过度自信:一些成功率仅为22%的代理预测成功率为77%。出乎意料的是,执行前的评估在信息显著较少的情况下,往往比标准的执行后评审产生更好的区分效果,尽管差异并不总是显著。将评估重新框定为寻找错误的对抗性提示达到了最佳的校准效果。
计算语言学 (Computation and Language)
55
cs.CL / 1 / 2602.06049

Recontextualizing Famous Quotes for Brand Slogan Generation

重新语境化名言以生成品牌口号
Yang, Ziao, Chen, Zizhang, Zhang, Lei, Liu, Hongfu
Abstract
Slogans are concise and memorable catchphrases that play a crucial role in advertising by conveying brand identity and shaping public perception. However, advertising fatigue reduces the effectiveness of repeated slogans, creating a growing demand for novel, creative, and insightful slogan generation. While recent work leverages large language models (LLMs) for this task, existing approaches often produce stylistically redundant outputs that lack a clear brand persona and appear overtly machine-generated. We argue that effective slogans should balance novelty with familiarity and propose a new paradigm that recontextualizes persona-related famous quotes for slogan generation. Well-known quotes naturally align with slogan-length text, employ rich rhetorical devices, and offer depth and insight, making them a powerful resource for creative generation. Technically, we introduce a modular framework that decomposes slogan generation into interpretable subtasks, including quote matching, structural decomposition, vocabulary replacement, and remix generation. Extensive automatic and human evaluations demonstrate marginal improvements in diversity, novelty, emotional impact, and human preference over three state-of-the-art LLM baselines.
Chinese Translation
口号是简洁且易于记忆的标语,在广告中发挥着至关重要的作用,通过传达品牌身份和塑造公众认知。然而,广告疲劳降低了重复使用口号的有效性,导致对新颖、创造性和深刻的口号生成的需求日益增长。尽管近期的研究利用大型语言模型(LLMs)来完成这一任务,但现有的方法往往产生风格上冗余的输出,缺乏明确的品牌个性,且显得过于机器生成。我们认为,有效的口号应在新颖性与熟悉感之间取得平衡,并提出一种新的范式,重新语境化与品牌个性相关的名言以生成口号。众所周知的名言自然符合口号长度的文本,运用丰富的修辞手法,并提供深度和洞察力,使其成为创造性生成的强大资源。在技术上,我们引入了一个模块化框架,将口号生成分解为可解释的子任务,包括名言匹配、结构分解、词汇替换和重混生成。广泛的自动和人工评估显示,在多样性、新颖性、情感影响和人类偏好方面,相较于三种最先进的LLM基线,取得了边际改进。
cs.CL / 2 / 2602.06050

Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering

基于相关性的多上下文对比解码用于检索增强的视觉问答
Kim, Jongha, Ko, Byungoh, Na, Jeehye, Yoon, Jinsung, Kim, Hyunwoo J.
Abstract
Despite the remarkable capabilities of Large Vision Language Models (LVLMs), they still lack detailed knowledge about specific entities. Retrieval-augmented Generation (RAG) is a widely adopted solution that enhances LVLMs by providing additional contexts from an external Knowledge Base. However, we observe that previous decoding methods for RAG are sub-optimal as they fail to sufficiently leverage multiple relevant contexts and suppress the negative effects of irrelevant contexts. To this end, we propose Relevance-aware Multi-context Contrastive Decoding (RMCD), a novel decoding method for RAG. RMCD outputs a final prediction by combining outputs predicted with each context, where each output is weighted based on its relevance to the question. By doing so, RMCD effectively aggregates useful information from multiple relevant contexts while also counteracting the negative effects of irrelevant ones. Experiments show that RMCD consistently outperforms other decoding methods across multiple LVLMs, achieving the best performance on three knowledge-intensive visual question-answering benchmarks. Also, RMCD can be simply applied by replacing the decoding method of LVLMs without additional training. Analyses also show that RMCD is robust to the retrieval results, consistently performing the best across the weakest to the strongest retrieval results. Code is available at https://github.com/mlvlab/RMCD.
Chinese Translation
尽管大型视觉语言模型(Large Vision Language Models, LVLMs)具有显著的能力,但它们仍然缺乏对特定实体的详细知识。检索增强生成(Retrieval-augmented Generation, RAG)是一种广泛采用的解决方案,通过提供来自外部知识库的额外上下文来增强LVLMs。然而,我们观察到以往的RAG解码方法并不理想,因为它们未能充分利用多个相关上下文,并抑制无关上下文的负面影响。为此,我们提出了基于相关性的多上下文对比解码(Relevance-aware Multi-context Contrastive Decoding, RMCD),这是一种针对RAG的新型解码方法。RMCD通过结合每个上下文预测的输出,输出最终预测,其中每个输出的权重基于其与问题的相关性。通过这种方式,RMCD有效地聚合来自多个相关上下文的有用信息,同时抵消无关上下文的负面影响。实验表明,RMCD在多个LVLMs上始终优于其他解码方法,在三个知识密集型视觉问答基准测试中实现了最佳性能。此外,RMCD可以通过替换LVLMs的解码方法简单应用,无需额外训练。分析还表明,RMCD对检索结果具有鲁棒性,在从最弱到最强的检索结果中始终表现最佳。代码可在 https://github.com/mlvlab/RMCD 获取。
cs.CL / 3 / 2602.06051

CAST: Character-and-Scene Episodic Memory for Agents

CAST:基于角色和场景的智能体情节记忆
Ma, Kexin, Li, Bojun, Tang, Yuhua, Jin, Ruochun, Sun, Liting
Abstract
Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where. However, most agent memory systems only emphasize semantic recall and treat experience as structures such as key-value, vector, or graph, which makes them struggle to represent and retrieve coherent events. To address this challenge, we propose a Character-and-Scene based memory architecture(CAST) inspired by dramatic theory. Specifically, CAST constructs 3D scenes (time/place/topic) and organizes them into character profiles that summarize the events of a character to represent episodic memory. Moreover, CAST complements this episodic memory with a graph-based semantic memory, which yields a robust dual memory design. Experiments demonstrate that CAST has averagely improved 8.11% F1 and 10.21% J(LLM-as-a-Judge) than baselines on various datasets, especially on open and time-sensitive conversational questions.
Chinese Translation
情节记忆是人类记忆的核心组成部分,指的是回忆基于谁、何时和何地的连贯事件的能力。然而,大多数智能体记忆系统仅强调语义回忆,并将经验视为键值、向量或图等结构,这使得它们在表示和检索连贯事件时面临困难。为了解决这一挑战,我们提出了一种基于角色和场景的记忆架构(CAST),灵感来源于戏剧理论。具体而言,CAST构建3D场景(时间/地点/主题),并将其组织成角色档案,以总结角色的事件,从而表示情节记忆。此外,CAST还通过图基语义记忆来补充这一情节记忆,从而形成稳健的双重记忆设计。实验表明,CAST在各种数据集上的F1平均提升了8.11%,J(LLM-as-a-Judge)提升了10.21%,尤其在开放和时间敏感的对话问题上表现突出。
cs.CL / 4 / 2602.06052

Rethinking Memory Mechanisms of Foundation Agents in the Second Half

重新思考基础智能体的记忆机制
Huang, Wei-Chieh, Zhang, Weizhi, Liang, Yueqing, Bei, Yuanchen, Chen, Yankai, Feng, Tao, Pan, Xinyu, Tan, Zhen, Wang, Yu, Wei, Tianxin, Wu, Shanglin, Xu, Ruiyao, Yang, Liangwei, Yang, Rui, Yang, Wooseong, Yeh, Chin-Yuan, Zhang, Hanrong, Zhang, Haozhen, Zhu, Siqi, Zou, Henry Peng, Zhao, Wanjia, Wang, Song, Xu, Wujiang, Ke, Zixuan, Hui, Zheng, Li, Dawei, Wu, Yaozu, He, Langzhou, Wang, Chen, Xu, Xiongxiao, Huang, Baixiang, Tan, Juntao, Heinecke, Shelby, Wang, Huan, Xiong, Caiming, Metwally, Ahmed A., Yan, Jun, Lee, Chen-Yu, Zeng, Hanqing, Xia, Yinglong, Wei, Xiaokai, Payani, Ali, Wang, Yu, Ma, Haitong, Wang, Wenya, Wang, Chengguang, Zhang, Yu, Wang, Xin, Zhang, Yongfeng, You, Jiaxuan, Tong, Hanghang, Luo, Xiao, Sun, Yizhou, Wang, Wei, McAuley, Julian, Zou, James, Han, Jiawei, Yu, Philip S., Shu, Kai
Abstract
The research of artificial intelligence is undergoing a paradigm shift from prioritizing model innovations over benchmark scores towards emphasizing problem definition and rigorous real-world evaluation. As the field enters the "second half," the central challenge becomes real utility in long-horizon, dynamic, and user-dependent environments, where agents face context explosion and must continuously accumulate, manage, and selectively reuse large volumes of information across extended interactions. Memory, with hundreds of papers released this year, therefore emerges as the critical solution to fill the utility gap. In this survey, we provide a unified view of foundation agent memory along three dimensions: memory substrate (internal and external), cognitive mechanism (episodic, semantic, sensory, working, and procedural), and memory subject (agent- and user-centric). We then analyze how memory is instantiated and operated under different agent topologies and highlight learning policies over memory operations. Finally, we review evaluation benchmarks and metrics for assessing memory utility, and outline various open challenges and future directions.
Chinese Translation
人工智能的研究正经历一个范式转变,从优先考虑模型创新和基准分数,转向强调问题定义和严格的现实世界评估。随着该领域进入“下半场”,中心挑战变为在长时间跨度、动态和用户依赖的环境中实现真实的实用性,在这些环境中,智能体面临上下文爆炸,并且必须在延续的交互中不断积累、管理和有选择地重用大量信息。因此,记忆作为填补实用性差距的关键解决方案,今年发布了数百篇相关论文。在本次调查中,我们从三个维度提供了基础智能体记忆的统一视角:记忆基质(内部和外部)、认知机制(情节记忆、语义记忆、感官记忆、工作记忆和程序性记忆)以及记忆主体(以智能体和用户为中心)。随后,我们分析了在不同智能体拓扑结构下记忆的实例化和操作方式,并强调了记忆操作的学习策略。最后,我们回顾了评估记忆实用性的基准和指标,并概述了各种开放挑战和未来方向。
cs.CL / 5 / 2602.06053

PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models

PersonaPlex:全双工对话语音模型的声音与角色控制
Roy, Rajarshi, Raiman, Jonathan, Lee, Sang-gil, Ene, Teodor-Dumitru, Kirby, Robert, Kim, Sungwon, Kim, Jaehyeon, Catanzaro, Bryan
Abstract
Recent advances in duplex speech models have enabled natural, low-latency speech-to-speech interactions. However, existing models are restricted to a fixed role and voice, limiting their ability to support structured, role-driven real-world applications and personalized interactions. In this work, we introduce PersonaPlex, a duplex conversational speech model that incorporates hybrid system prompts, combining role conditioning with text prompts and voice cloning with speech samples. PersonaPlex is trained on a large-scale synthetic dataset of paired prompts and user-agent conversations, generated with open-source large language models (LLM) and text-to-speech (TTS) models. To evaluate role conditioning in real-world settings, we extend the Full-Duplex-Bench benchmark beyond a single assistant role to multi-role customer service scenarios. Experiments show that PersonaPlex achieves strong role-conditioned behavior, voice-conditioned speech, and natural conversational responsiveness, surpassing state-of-the-art duplex speech models and hybrid large language model-based speech systems in role adherence, speaker similarity, latency, and naturalness.
Chinese Translation
近年来,全双工语音模型的进展使得自然、低延迟的语音对语音交互成为可能。然而,现有模型受到固定角色和声音的限制,无法支持结构化、以角色驱动的现实世界应用和个性化交互。在本研究中,我们介绍了PersonaPlex,这是一种全双工对话语音模型,结合了角色条件与文本提示的混合系统提示,以及语音样本与语音克隆。PersonaPlex在一个大规模的合成数据集上进行训练,该数据集由成对提示和用户-代理对话生成,使用开源的大型语言模型(LLM)和文本到语音(TTS)模型。为了在现实世界环境中评估角色条件,我们将全双工基准(Full-Duplex-Bench)扩展到多角色客户服务场景,而不仅限于单一助手角色。实验结果表明,PersonaPlex在角色条件行为、声音条件语音和自然对话响应性方面表现出色,超越了最先进的全双工语音模型和基于混合大型语言模型的语音系统,在角色遵循、说话者相似性、延迟和自然性方面均表现优异。
cs.CL / 6 / 2602.06054

What Is Novel? A Knowledge-Driven Framework for Bias-Aware Literature Originality Evaluation

什么是新颖性?一个基于知识的偏见意识文献原创性评估框架
Mostafa, Abeer, Nguyen, Thi Huyen, Ahmadi, Zahra
Abstract
Assessing research novelty is a core yet highly subjective aspect of peer review, typically based on implicit judgment and incomplete comparison to prior work. We introduce a literature-aware novelty assessment framework that explicitly learns how humans judge novelty from peer-review reports and grounds these judgments in structured comparison to existing research. Using nearly 80K novelty-annotated reviews from top-tier AI conferences, we fine-tune a large language model to capture reviewer-aligned novelty evaluation behavior. For a given manuscript, the system extracts structured representations of its ideas, methods, and claims, retrieves semantically related papers, and constructs a similarity graph that enables fine-grained, concept-level comparison to prior work. Conditioning on this structured evidence, the model produces calibrated novelty scores and human-like explanatory assessments, reducing overestimation and improving consistency relative to existing approaches.
Chinese Translation
评估研究的新颖性是同行评审中的一个核心但高度主观的方面,通常基于隐含判断和与先前工作的不完全比较。我们提出了一种文献意识的新颖性评估框架,该框架明确学习人类如何从同行评审报告中判断新颖性,并将这些判断基于与现有研究的结构化比较。利用来自顶级人工智能会议的近80,000份标注新颖性的评审,我们微调了一个大型语言模型,以捕捉与评审者一致的新颖性评估行为。对于给定的手稿,该系统提取其思想、方法和主张的结构化表示,检索语义相关的论文,并构建一个相似性图,以便与先前工作进行细粒度的概念级比较。在这一结构化证据的基础上,该模型生成经过校准的新颖性评分和类人解释性评估,减少了高估现象,并提高了与现有方法相比的一致性。
cs.CL / 7 / 2602.06055

Quantifying and Attributing Polarization to Annotator Groups

量化并归因于标注者群体的极化现象
Tsirmpas, Dimitris, Pavlopoulos, John
Abstract
Current annotation agreement metrics are not well-suited for inter-group analysis, are sensitive to group size imbalances and restricted to single-annotation settings. These restrictions render them insufficient for many subjective tasks such as toxicity and hate-speech detection. For this reason, we introduce a quantifiable metric, paired with a statistical significance test, that attributes polarization to various annotator groups. Our metric enables direct comparisons between heavily imbalanced sociodemographic and ideological subgroups across different datasets and tasks, while also enabling analysis on multi-label settings. We apply this metric to three datasets on hate speech, and one on toxicity detection, discovering that: (1) Polarization is strongly and persistently attributed to annotator race, especially on the hate speech task. (2) Religious annotators do not fundamentally disagree with each other, but do with other annotators, a trend that is gradually diminished and then reversed with irreligious annotators. (3) Less educated annotators are more subjective, while educated ones tend to broadly agree more between themselves. Overall, our results reflect current findings around annotation patterns for various subgroups. Finally, we estimate the minimum number of annotators needed to obtain robust results, and provide an open-source Python library that implements our metric.
Chinese Translation
当前的标注一致性度量不适合进行群体间分析,容易受到群体规模不平衡的影响,并且仅限于单一标注设置。这些限制使得它们在许多主观任务中(如毒性和仇恨言论检测)显得不足。因此,我们提出了一种可量化的度量标准,并配以统计显著性检验,用于将极化现象归因于不同的标注者群体。我们的度量标准能够在不同数据集和任务中直接比较严重不平衡的社会人口和意识形态子群体,同时也支持多标签设置的分析。我们将该度量应用于三个仇恨言论数据集和一个毒性检测数据集,发现:(1)极化现象与标注者的种族有着强烈且持续的关联,尤其是在仇恨言论任务中。(2)宗教标注者之间并没有根本性的分歧,但与其他标注者存在分歧,这一趋势在与无宗教标注者的交互中逐渐减弱并最终反转。(3)教育程度较低的标注者更具主观性,而受过教育的标注者之间往往更容易达成一致。总体而言,我们的结果反映了当前关于各子群体标注模式的研究发现。最后,我们估算了获得稳健结果所需的最小标注者数量,并提供了一个实现我们度量标准的开源Python库。
cs.CL / 8 / 2602.06161

Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion Decoding

停止翻转:快速可撤销扩散解码的上下文保持验证
Xiang, Yanzheng, Wei, Lan, Yao, Yizhen, Zhu, Qinglin, Yan, Hanqi, Jin, Chen, Teare, Philip Alexander, Zhang, Dandan, Gui, Lin, Saseendran, Amrutha, He, Yulan
Abstract
Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive parallelism often harms quality. Revocable decoding mitigates this by rechecking earlier tokens, yet we observe that existing verification schemes frequently trigger flip-flop oscillations, where tokens are remasked and later restored unchanged. This behaviour slows inference in two ways: remasking verified positions weakens the conditioning context for parallel drafting, and repeated remask cycles consume the revision budget with little net progress. We propose COVER (Cache Override Verification for Efficient Revision), which performs leave-one-out verification and stable drafting within a single forward pass. COVER constructs two attention views via KV cache override: selected seeds are masked for verification, while their cached key value states are injected for all other queries to preserve contextual information, with a closed form diagonal correction preventing self leakage at the seed positions. COVER further prioritises seeds using a stability aware score that balances uncertainty, downstream influence, and cache drift, and it adapts the number of verified seeds per step. Across benchmarks, COVER markedly reduces unnecessary revisions and yields faster decoding while preserving output quality.
Chinese Translation
并行扩散解码可以通过每步解码多个标记来加速扩散语言模型的推理,但过度的并行化往往会损害质量。可撤销解码通过重新检查早期的标记来缓解这一问题,然而我们观察到现有的验证方案经常触发翻转振荡现象,即标记被重新屏蔽后又恢复为未改变的状态。这种行为以两种方式减慢推理速度:重新屏蔽已验证的位置削弱了并行草拟的条件上下文,而重复的重新屏蔽周期则消耗了修订预算,却几乎没有取得实质性进展。我们提出了COVER(Cache Override Verification for Efficient Revision),它在单次前向传递中执行留一验证和稳定草拟。COVER通过KV缓存覆盖构建两个注意力视图:选定的种子被屏蔽以进行验证,而它们的缓存键值状态则被注入到所有其他查询中,以保持上下文信息,并通过封闭形式的对角校正防止在种子位置的自我泄漏。COVER进一步通过一种稳定性感知评分来优先考虑种子,该评分平衡了不确定性、下游影响和缓存漂移,并适应每步验证的种子数量。在各项基准测试中,COVER显著减少了不必要的修订,并在保持输出质量的同时实现了更快的解码。
cs.CL / 9 / 2602.06181

Uncertainty Drives Social Bias Changes in Quantized Large Language Models

不确定性驱动量化大型语言模型中的社会偏见变化
Hua, Stanley Z., Lotfi, Sanae, Chen, Irene Y.
Abstract
Post-training quantization reduces the computational cost of large language models but fundamentally alters their social biases in ways that aggregate metrics fail to capture. We present the first large-scale study of 50 quantized models evaluated on PostTrainingBiasBench, a unified benchmark of 13 closed- and open-ended bias datasets. We identify a phenomenon we term quantization-induced masked bias flipping, in which up to 21% of responses flip between biased and unbiased states after quantization, despite showing no change in aggregate bias scores. These flips are strongly driven by model uncertainty, where the responses with high uncertainty are 3-11x more likely to change than the confident ones. Quantization strength amplifies this effect, with 4-bit quantized models exhibiting 4-6x more behavioral changes than 8-bit quantized models. Critically, these changes create asymmetric impacts across demographic groups, where bias can worsen by up to 18.6% for some groups while improving by 14.1% for others, yielding misleadingly neutral aggregate outcomes. Larger models show no consistent robustness advantage, and group-specific shifts vary unpredictably across model families. Our findings demonstrate that compression fundamentally alters bias patterns, requiring crucial post-quantization evaluation and interventions to ensure reliability in practice.
Chinese Translation
后训练量化降低了大型语言模型的计算成本,但在根本上以聚合指标无法捕捉的方式改变了它们的社会偏见。我们首次对50个量化模型进行了大规模研究,评估了PostTrainingBiasBench,这是一个统一的包含13个封闭和开放式偏见数据集的基准。我们识别出一种现象,称为量化引起的掩蔽偏见翻转,其中高达21%的响应在量化后在偏见和无偏见状态之间翻转,尽管聚合偏见分数没有变化。这些翻转强烈受到模型不确定性的驱动,具有高不确定性的响应比自信的响应更可能发生变化,概率为3-11倍。量化强度放大了这一效应,4位量化模型的行为变化比8位量化模型多出4-6倍。关键是,这些变化在不同人口群体之间产生了不对称的影响,对于某些群体,偏见可能恶化高达18.6%,而对其他群体则改善14.1%,从而导致误导性的中性聚合结果。更大的模型没有表现出一致的鲁棒性优势,且特定群体的变化在模型家族之间不可预测地变化。我们的研究结果表明,压缩在根本上改变了偏见模式,需要在量化后进行关键的评估和干预,以确保实践中的可靠性。
cs.CL / 10 / 2602.06221

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

BenchMarker:一个受教育启发的工具包,用于突出多项选择基准中的缺陷
Balepur, Nishant, Rajasekaran, Bhavya, Oh, Jane, Xie, Michael, Desai, Atrey, Gupta, Vipul, Moore, Steven James, Choi, Eunsol, Rudinger, Rachel, Boyd-Graber, Jordan Lee
Abstract
Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination - items appearing exactly online; 2) shortcuts - cues in the choices that enable guessing; and 3) writing errors - structural/grammatical issues based on a 19-rule education rubric. We validate BenchMarker with human annotations, then run the tool to audit 12 benchmarks, revealing: 2) contaminated MCQs tend to inflate accuracy, while writing errors tend to lower it and change rankings beyond random; and 3) prior benchmark repairs address their targeted issues (i.e., lowering accuracy with LLM-written distractors), but inadvertently add new flaws (i.e. implausible distractors, many correct answers). Overall, flaws in MCQs degrade NLP evaluation, but education research offers a path forward. We release BenchMarker to bridge the fields and improve MCQA benchmark design.
Chinese Translation
多项选择题回答(MCQA)在自然语言处理(NLP)中是标准做法,但基准缺乏严格的质量控制。我们提出了BenchMarker,一个受教育启发的工具包,利用大型语言模型(LLM)评审者标记三种常见的多项选择题缺陷:1)污染 - 项目在线上完全相同;2)捷径 - 选项中的线索使得猜测成为可能;3)写作错误 - 基于19条教育标准的结构/语法问题。我们通过人工注释验证了BenchMarker,然后使用该工具审核了12个基准,揭示了:2)污染的多项选择题往往会夸大准确性,而写作错误则倾向于降低准确性并改变排名,超出随机范围;3)之前的基准修复针对其特定问题(即,通过LLM编写的干扰项降低准确性),但无意中增加了新的缺陷(即不可信的干扰项和多个正确答案)。总体而言,多项选择题中的缺陷降低了NLP评估的有效性,但教育研究为未来提供了方向。我们发布了BenchMarker,以桥接这两个领域并改善MCQA基准设计。
cs.CL / 11 / 2602.06260

Can One-sided Arguments Lead to Response Change in Large Language Models?

单方面论证能否导致大型语言模型的回应变化?
Cisneros-Velarde, Pedro
Abstract
Polemic questions need more than one viewpoint to express a balanced answer. Large Language Models (LLMs) can provide a balanced answer, but also take a single aligned viewpoint or refuse to answer. In this paper, we study if such initial responses can be steered to a specific viewpoint in a simple and intuitive way: by only providing one-sided arguments supporting the viewpoint. Our systematic study has three dimensions: (i) which stance is induced in the LLM response, (ii) how the polemic question is formulated, (iii) how the arguments are shown. We construct a small dataset and remarkably find that opinion steering occurs across (i)-(iii) for diverse models, number of arguments, and topics. Switching to other arguments consistently decreases opinion steering.
Chinese Translation
争议性问题需要多个观点来表达平衡的答案。大型语言模型(LLMs)可以提供平衡的答案,但也可能采取单一的对齐观点或拒绝回答。本文研究了是否可以通过一种简单直观的方式引导此类初始回应朝向特定观点:仅通过提供支持该观点的单方面论证。我们的系统研究有三个维度:(i)在LLM回应中诱导了哪种立场,(ii)争议性问题是如何构造的,(iii)论证是如何呈现的。我们构建了一个小型数据集,并显著发现,在不同模型、论证数量和主题下,(i)至(iii)之间的观点引导现象普遍存在。切换到其他论证会持续减少观点引导。
cs.CL / 12 / 2602.06266

Is my model "mind blurting"? Interpreting the dynamics of reasoning tokens with Recurrence Quantification Analysis (RQA)

我的模型是在“心智涌现”吗?使用递归量化分析(RQA)解释推理令牌的动态
Pham, Quoc Tuan, Jafari, Mehdi, Salim, Flora
Abstract
Test-time compute is central to large reasoning models, yet analysing their reasoning behaviour through generated text is increasingly impractical and unreliable. Response length is often used as a brute proxy for reasoning effort, but this metric fails to capture the dynamics and effectiveness of the Chain of Thoughts (CoT) or the generated tokens. We propose Recurrence Quantification Analysis (RQA) as a non-textual alternative for analysing model's reasoning chains at test time. By treating token generation as a dynamical system, we extract hidden embeddings at each generation step and apply RQA to the resulting trajectories. RQA metrics, including Determinism and Laminarity, quantify patterns of repetition and stalling in the model's latent representations. Analysing 3,600 generation traces from DeepSeek-R1-Distill, we show that RQA captures signals not reflected by response length, but also substantially improves prediction of task complexity by 8\%. These results help establish RQA as a principled tool for studying the latent token generation dynamics of test-time scaling in reasoning models.
Chinese Translation
测试时计算是大型推理模型的核心,但通过生成文本分析其推理行为变得越来越不切实际和不可靠。响应长度常被用作推理努力的粗略代理,但这一指标未能捕捉到思维链(Chain of Thoughts, CoT)或生成令牌的动态性和有效性。我们提出递归量化分析(Recurrence Quantification Analysis, RQA)作为一种非文本的替代方法,用于在测试时分析模型的推理链。通过将令牌生成视为一个动态系统,我们在每个生成步骤提取隐藏的嵌入,并对生成的轨迹应用RQA。RQA指标,包括确定性(Determinism)和层流性(Laminarity),量化模型潜在表示中的重复和停滞模式。通过分析来自DeepSeek-R1-Distill的3,600个生成轨迹,我们展示了RQA捕捉到的信号并未通过响应长度反映,同时还将任务复杂度的预测提高了8%。这些结果有助于确立RQA作为研究推理模型测试时规模化的潜在令牌生成动态的原则性工具。
cs.CL / 13 / 2602.06268

MPIB: A Benchmark for Medical Prompt Injection Attacks and Clinical Safety in LLMs

MPIB:医学提示注入攻击与大型语言模型临床安全的基准测试
Lee, Junhyeok, Jang, Han, Choi, Kyu Sung
Abstract
Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly integrated into clinical workflows; however, prompt injection attacks can steer these systems toward clinically unsafe or misleading outputs. We introduce the Medical Prompt Injection Benchmark (MPIB), a dataset-and-benchmark suite for evaluating clinical safety under both direct prompt injection and indirect, RAG-mediated injection across clinically grounded tasks. MPIB emphasizes outcome-level risk via the Clinical Harm Event Rate (CHER), which measures high-severity clinical harm events under a clinically grounded taxonomy, and reports CHER alongside Attack Success Rate (ASR) to disentangle instruction compliance from downstream patient risk. The benchmark comprises 9,697 curated instances constructed through multi-stage quality gates and clinical safety linting. Evaluating MPIB across a diverse set of baseline LLMs and defense configurations, we find that ASR and CHER can diverge substantially, and that robustness depends critically on whether adversarial instructions appear in the user query or in retrieved context. We release MPIB with evaluation code, adversarial baselines, and comprehensive documentation to support reproducible and systematic research on clinical prompt injection. Code and data are available at GitHub (code) and Hugging Face (data).
Chinese Translation
大型语言模型(LLMs)和检索增强生成(RAG)系统正日益融入临床工作流程;然而,提示注入攻击可能会使这些系统产生临床不安全或误导性的输出。我们提出了医学提示注入基准(MPIB),这是一个数据集和基准套件,用于评估在直接提示注入和间接的RAG介导注入下的临床安全性,涵盖临床基础任务。MPIB通过临床伤害事件率(CHER)强调结果级风险,该指标在临床基础分类法下衡量高严重性临床伤害事件,并与攻击成功率(ASR)一起报告,以区分指令遵从性与下游患者风险。该基准包含通过多阶段质量控制和临床安全审查构建的9,697个精心策划的实例。在对多种基线LLMs和防御配置进行MPIB评估时,我们发现ASR和CHER可能大相径庭,并且鲁棒性在很大程度上取决于对抗性指令是出现在用户查询中还是在检索的上下文中。我们发布了MPIB及其评估代码、对抗性基线和全面文档,以支持可重复和系统的临床提示注入研究。代码和数据可在GitHub(代码)和Hugging Face(数据)上获取。
cs.CL / 14 / 2602.06270

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

元音提示:通过元音级韵律增强从文本中感知语音情感
Wang, Yancheng, Hanna, Osama, Xie, Ruiming, Rui, Xianfeng, Shen, Maohao, Zhang, Xuedong, Fuegen, Christian, Wu, Jilong, Paul, Debjyoti, Guo, Arthur, Lei, Zhihong, Kalinli, Ozlem, He, Qing, Yang, Yingzhen
Abstract
Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability. Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.
Chinese Translation
语音中的情感识别是一项复杂的多模态挑战,需要理解语言内容和声音表现力,特别是基频、强度和时间动态等韵律特征。尽管大型语言模型(LLMs)在基于文本转录的情感识别中展现出潜力,但它们通常忽视细粒度的韵律信息,从而限制了其有效性和可解释性。在本研究中,我们提出了元音提示(VowelPrompt),这是一个以语言学为基础的框架,通过可解释的细粒度元音级韵律线索增强基于LLM的情感识别。基于元音作为情感韵律主要载体的语音证据,元音提示从时间对齐的元音片段中提取基于音高、能量和持续时间的描述符,并将这些特征转换为自然语言描述,以提高可解释性。这种设计使得LLM能够联合推理词汇语义和细粒度韵律变化。此外,我们采用了一个两阶段的适应程序,包括监督微调(SFT)和可验证奖励的强化学习(RLVR),通过群体相对策略优化(GRPO)实施,以增强推理能力、强制结构化输出遵循,并改善跨领域和说话者变异的泛化能力。在各种基准数据集上的广泛评估表明,元音提示在零样本、微调、跨领域和跨语言条件下始终优于最先进的情感识别方法,同时能够生成在上下文语义和细粒度韵律结构中共同基础的可解释解释。
cs.CL / 15 / 2602.06275

RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution

RoPE-LIME:RoPE空间局部性 + 稀疏K采样用于高效LLM归因
Picov, Isaac, Goru, Ritesh
Abstract
Explaining closed-source LLM outputs is challenging because API access prevents gradient-based attribution, while perturbation methods are costly and noisy when they depend on regenerated text. We introduce RoPE-LIME, an open-source extension of gSMILE that decouples reasoning from explanation: given a fixed output from a closed model, a smaller open-source surrogate computes token-level attributions from probability-based objectives (negative log-likelihood and divergence targets) under input perturbations. RoPE-LIME incorporates (i) a locality kernel based on Relaxed Word Mover's Distance computed in RoPE embedding space for stable similarity under masking, and (ii) Sparse-K sampling, an efficient perturbation strategy that improves interaction coverage under limited budgets. Experiments on HotpotQA (sentence features) and a hand-labeled MMLU subset (word features) show that RoPE-LIME produces more informative attributions than leave-one-out sampling and improves over gSMILE while substantially reducing closed-model API calls.
Chinese Translation
解释闭源LLM输出是一个挑战,因为API访问限制了基于梯度的归因,而当依赖于再生文本时,扰动方法成本高且噪声大。我们提出了RoPE-LIME,这是gSMILE的一个开源扩展,它将推理与解释解耦:在给定闭源模型的固定输出的情况下,一个较小的开源替代模型在输入扰动下从基于概率的目标(负对数似然和散度目标)计算令牌级归因。RoPE-LIME包含(i)基于在RoPE嵌入空间中计算的放松词移动距离的局部性核,以在掩蔽下实现稳定的相似性,以及(ii)稀疏K采样,这是一种高效的扰动策略,在有限预算下提高交互覆盖率。在HotpotQA(句子特征)和手工标注的MMLU子集(词特征)上的实验表明,RoPE-LIME产生的归因信息比留一法采样更具信息量,并且在显著减少闭源模型API调用的同时,优于gSMILE。
cs.CL / 16 / 2602.06291

Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

评判我们无法解决的问题:一种无Oracle的研究级数学评估的后果导向方法
Son, Guijin, Yang, Donghun, Patel, Hitesh Laxmichand, Ko, Hyunwoo, Agarwal, Amit, Ahn, Sunghee, Lee, Kyong-Ha, Yu, Youngjae
Abstract
Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose \textbf{Consequence-Based Utility}, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.
Chinese Translation
最近在推理模型方面的进展表明,生成研究级数学的合理尝试可能在可及范围内,但验证仍然是一个瓶颈,消耗了稀缺的专家时间。我们假设,一个有意义的解决方案应包含足够的方法级信息,当应用于一组相关问题的邻域时,应该比错误的解决方案产生更好的下游表现。在此基础上,我们提出了 extbf{后果导向效用},这是一种无Oracle的评估器,通过测试每个候选方案在解决相关且可验证问题时作为上下文示例的价值来对其进行评分。我们的方法在一组原创的研究级数学问题上进行了评估,每个问题都配有一个专家撰写的解决方案和九个LLM生成的解决方案。值得注意的是,后果导向效用在排名质量上始终优于奖励模型、生成奖励模型和LLM评审。具体而言,对于GPT-OSS-120B,其Acc@1从67.2提高到76.3,AUC从71.4提高到79.6,而在GPT-OSS-20B上也有类似的大幅AUC提升(从69.0提高到79.2)。此外,与LLM评审相比,它还表现出更大的解题者-评估者差距,即使在基础解题者通常无法解决的实例上,也保持了更强的正确与错误的分离。
cs.CL / 17 / 2602.06307

Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Code-Switching Beyond Standard UD Assumptions

迷失在语言中:超越标准UD假设的口语代码切换基准测试、评估与解析
Tyagi, Nemika, Hendrix, Holly, Licona-Guevara, Nelvin, Mackie, Justin, Kareen, Phanos, Imran, Muhammad, Smith, Megan Michelle, Hernande, Tatiana Gallego, Baral, Chitta, Kellert, Olga
Abstract
Spoken code-switching (CSW) challenges syntactic parsing in ways not observed in written text. Disfluencies, repetition, ellipsis, and discourse-driven structure routinely violate standard Universal Dependencies (UD) assumptions, causing parsers and large language models (LLMs) to fail despite strong performance on written data. These failures are compounded by rigid evaluation metrics that conflate genuine structural errors with acceptable variation. In this work, we present a systems-oriented approach to spoken CSW parsing. We introduce a linguistically grounded taxonomy of spoken CSW phenomena and SpokeBench, an expert-annotated gold benchmark designed to test spoken-language structure beyond standard UD assumptions. We further propose FLEX-UD, an ambiguity-aware evaluation metric, which reveals that existing parsing techniques perform poorly on spoken CSW by penalizing linguistically plausible analyses as errors. We then propose DECAP, a decoupled agentic parsing framework that isolates spoken-phenomena handling from core syntactic analysis. Experiments show that DECAP produces more robust and interpretable parses without retraining and achieves up to 52.6% improvements over existing parsing techniques. FLEX-UD evaluations further reveal qualitative improvements that are masked by standard metrics.
Chinese Translation
口语代码切换(CSW)在语法解析方面提出了挑战,这种挑战在书面文本中并未观察到。口语中的流畅性缺失、重复、省略和话语驱动的结构常常违反标准的通用依赖(UD)假设,导致解析器和大型语言模型(LLMs)在书面数据上表现良好,但在口语数据上却失败。这些失败因僵化的评估指标而加剧,这些指标将真正的结构错误与可接受的变异混为一谈。在本研究中,我们提出了一种面向系统的口语CSW解析方法。我们引入了一种基于语言学的口语CSW现象分类法,并推出了SpokeBench,一个经过专家注释的金标准基准,旨在测试超越标准UD假设的口语语言结构。我们进一步提出了FLEX-UD,一种考虑歧义的评估指标,揭示现有解析技术在口语CSW上的表现不佳,因为它将语言上合理的分析视为错误进行惩罚。随后,我们提出了DECAP,一种解耦的代理解析框架,将口语现象处理与核心语法分析分离。实验表明,DECAP在不重新训练的情况下产生了更稳健且可解释的解析,并在现有解析技术上实现了高达52.6%的改进。FLEX-UD的评估进一步揭示了被标准指标掩盖的定性改进。
cs.CL / 18 / 2602.06337

Can Post-Training Transform LLMs into Causal Reasoners?

后训练能否将大型语言模型转变为因果推理者?
Chen, Junqi, Chen, Sirui, Lu, Chaochao
Abstract
Causal inference is essential for decision-making but remains challenging for non-experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post-training on these abilities is insufficiently explored. This paper examines the extent to which post-training can enhance LLMs' capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post-training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in-domain and four existing benchmarks, our experiments demonstrate that appropriate post-training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the CaLM benchmark, compared to 55.4% by OpenAI o3. Furthermore, the post-trained LLMs exhibit strong generalization and robustness under real-world conditions such as distribution shifts and noisy data. Collectively, these findings provide the first systematic evidence that targeted post-training can produce reliable and robust LLM-based causal reasoners. Our data and GRPO-model are available at https://github.com/OpenCausaLab/CauGym.
Chinese Translation
因果推理对于决策制定至关重要,但对于非专家而言仍然具有挑战性。尽管大型语言模型(LLMs)在这一领域展现出潜力,但它们的精确因果估计能力仍然有限,后训练对这些能力的影响尚未得到充分探讨。本文考察了后训练在多大程度上能够增强LLMs的因果推理能力。我们引入了CauGym,一个包含七个核心因果任务的综合数据集用于训练,以及五个多样化的测试集。利用该数据集,我们系统地评估了五种后训练方法:SFT、DPO、KTO、PPO和GRPO。在五个领域内和四个现有基准上,我们的实验表明,适当的后训练使得较小的LLMs能够在因果推理上具备竞争力,往往超越更大的模型。我们的14B参数模型在CaLM基准上达到了93.5%的准确率,而OpenAI的o3模型仅为55.4%。此外,后训练的LLMs在面对分布变化和噪声数据等现实条件下表现出强大的泛化能力和鲁棒性。总体而言,这些发现提供了首个系统证据,表明有针对性的后训练能够产生可靠且稳健的基于LLM的因果推理者。我们的数据和GRPO模型可在https://github.com/OpenCausaLab/CauGym获取。
cs.CL / 19 / 2602.06358

SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass

SHINE:一种可扩展的上下文超网络,用于单次映射上下文到LoRA
Liu, Yewei, Wang, Xiyuan, Mao, Yansheng, Gelbery, Yoav, Maron, Haggai, Zhang, Muhan
Abstract
We propose SHINE (Scalable Hyper In-context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high-quality LoRA adapters for large language models (LLM). By reusing the frozen LLM's own parameters in an in-context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine-tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine-tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in-context knowledge to in-parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling. Our code is available at https://github.com/Yewei-Liu/SHINE
Chinese Translation
我们提出了SHINE(可扩展的上下文超网络),这是一种可扩展的超网络,可以将多样的有意义上下文映射为高质量的LoRA适配器,以用于大型语言模型(LLM)。通过在上下文超网络设计中重用冻结的LLM自身参数并引入架构创新,SHINE克服了先前超网络的关键限制,并以相对较少的参数实现了强大的表达能力。我们引入了一个预训练和指令微调的流程,并训练我们的超网络从多样的有意义上下文中生成高质量的LoRA适配器,仅需一次前向传播。它在不进行任何微调的情况下更新LLM参数,并立即使得与上下文相关的复杂问答任务得以实现,而无需直接访问上下文,有效地将上下文知识转化为参数知识。我们的工作在各种任务上取得了卓越的结果,与基于SFT的LLM适应相比,大大节省了时间、计算和内存成本,并显示出良好的扩展潜力。我们的代码可在 https://github.com/Yewei-Liu/SHINE 获取。
cs.CL / 20 / 2602.06370

Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production

面向成本的文本分类模型选择:生产中精细调优编码器与大型语言模型提示之间的多目标权衡
Gonzalez, Alberto Andres Valdes
Abstract
Large language models (LLMs) such as GPT-4o and Claude Sonnet 4.5 have demonstrated strong capabilities in open-ended reasoning and generative language tasks, leading to their widespread adoption across a broad range of NLP applications. However, for structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone, overlooking operational constraints encountered in production systems. In this work, we present a systematic comparison of two contrasting paradigms for text classification: zero- and few-shot prompt-based large language models, and fully fine-tuned encoder-only architectures. We evaluate these approaches across four canonical benchmarks (IMDB, SST-2, AG News, and DBPedia), measuring predictive quality (macro F1), inference latency, and monetary cost. We frame model evaluation as a multi-objective decision problem and analyze trade-offs using Pareto frontier projections and a parameterized utility function reflecting different deployment regimes. Our results show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance while operating at one to two orders of magnitude lower cost and latency compared to zero- and few-shot LLM prompting. Overall, our findings suggest that indiscriminate use of large language models for standard text classification workloads can lead to suboptimal system-level outcomes. Instead, fine-tuned encoders emerge as robust and efficient components for structured NLP pipelines, while LLMs are better positioned as complementary elements within hybrid architectures. We release all code, datasets, and evaluation protocols to support reproducibility and cost-aware NLP system design.
Chinese Translation
大型语言模型(LLMs),如GPT-4o和Claude Sonnet 4.5,在开放式推理和生成语言任务中展现了强大的能力,从而在广泛的自然语言处理(NLP)应用中得到了广泛采用。然而,对于具有固定标签空间的结构化文本分类问题,模型选择往往仅由预测性能驱动,忽视了生产系统中遇到的操作约束。在本研究中,我们系统地比较了两种截然不同的文本分类范式:基于零样本和少样本提示的大型语言模型,以及完全精细调优的仅编码器架构。我们在四个经典基准(IMDB、SST-2、AG News和DBPedia)上评估这些方法,测量预测质量(宏F1)、推理延迟和货币成本。我们将模型评估框架视为一个多目标决策问题,并通过帕累托前沿投影和反映不同部署模式的参数化效用函数分析权衡。我们的结果表明,来自BERT家族的精细调优编码器模型在分类性能上具有竞争力,且通常优于零样本和少样本LLM提示,同时在成本和延迟上低一个到两个数量级。总体而言,我们的研究结果表明,随意使用大型语言模型进行标准文本分类工作负载可能导致次优的系统级结果。相反,精细调优的编码器作为结构化NLP管道中的稳健和高效组件出现,而LLMs更适合作为混合架构中的补充元素。我们发布所有代码、数据集和评估协议,以支持可重复性和面向成本的NLP系统设计。
cs.CL / 21 / 2602.06373

ReBeCA: Unveiling Interpretable Behavior Hierarchy behind the Iterative Self-Reflection of Language Models with Causal Analysis

ReBeCA:揭示语言模型迭代自我反思背后的可解释行为层次结构及其因果分析
Yan, Tianqiang, Shang, Sihan, Li, Yuheng, Qiu, Song, Peng, Hao, Luo, Wenjian, Xie, Jue, Qu, Lizhen, Gao, Yuan
Abstract
While self-reflection can enhance language model reliability, its underlying mechanisms remain opaque, with existing analyses often yielding correlation-based insights that fail to generalize. To address this, we introduce \textbf{\texttt{ReBeCA}} (self-\textbf{\texttt{Re}}flection \textbf{\texttt{Be}}havior explained through \textbf{\texttt{C}}ausal \textbf{\texttt{A}}nalysis), a framework that unveils the interpretable behavioral hierarchy governing the self-reflection outcome. By modeling self-reflection trajectories as causal graphs, ReBeCA isolates genuine determinants of performance through a three-stage Invariant Causal Prediction (ICP) pipeline. We establish three critical findings: (1) \textbf{Behavioral hierarchy:} Semantic behaviors of the model influence final self-reflection results hierarchically: directly or indirectly; (2) \textbf{Causation matters:} Generalizability in self-reflection effects is limited to just a few semantic behaviors; (3) \textbf{More $\mathbf{\neq}$ better:} The confluence of seemingly positive semantic behaviors, even among direct causal factors, can impair the efficacy of self-reflection. ICP-based verification identifies sparse causal parents achieving up to $49.6\%$ structural likelihood gains, stable across tasks where correlation-based patterns fail. Intervention studies on novel datasets confirm these causal relationships hold out-of-distribution ($p = .013, \eta^2_\mathrm{p} = .071$). ReBeCA thus provides a rigorous methodology for disentangling genuine causal mechanisms from spurious associations in self-reflection dynamics.
Chinese Translation
尽管自我反思可以增强语言模型的可靠性,但其潜在机制仍然不透明,现有分析往往仅提供基于相关性的见解,无法推广。为了解决这一问题,我们引入了 extbf{ exttt{ReBeCA}}(自我- extbf{ exttt{Re}}flection extbf{ exttt{Be}}havior 通过 extbf{ exttt{C}}ausal extbf{ exttt{A}}nalysis 进行解释)框架,该框架揭示了支配自我反思结果的可解释行为层次结构。通过将自我反思轨迹建模为因果图,ReBeCA通过三阶段不变因果预测(Invariant Causal Prediction, ICP)流程,孤立出真正的性能决定因素。我们确立了三个关键发现:(1) extbf{行为层次结构:}模型的语义行为以层次方式影响最终的自我反思结果:直接或间接;(2) extbf{因果关系重要:}自我反思效果的可推广性仅限于少数几种语义行为;(3) extbf{更多 $ extbf{ eq}$ 更好:}看似积极的语义行为的汇聚,即使在直接因果因素中,也可能削弱自我反思的有效性。基于ICP的验证识别出稀疏的因果父节点,结构似然增益高达$49.6\%$,在相关性模式失效的任务中保持稳定。对新数据集的干预研究确认这些因果关系在分布外依然成立($p = .013, ext{η}^2_ ext{p} = .071$)。因此,ReBeCA为在自我反思动态中区分真正的因果机制与虚假关联提供了一种严格的方法论。
cs.CL / 22 / 2602.06384

FMBench: Adaptive Large Language Model Output Formatting

FMBench:自适应大语言模型输出格式化
Wang, Yaoting, Zhou, Yun, Ding, Henghui
Abstract
Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: https://github.com/FudanCVL/FMBench.
Chinese Translation
在用户面对和系统集成的工作流程中,生成满足语义意图和格式约束的输出对于部署大语言模型至关重要。在本研究中,我们关注Markdown格式化,这在助手、文档和工具增强的管道中无处不在,但仍然容易出现微妙且难以检测的错误(例如,损坏的列表、格式错误的表格、不一致的标题和无效的代码块),这些错误可能显著降低下游的可用性。我们提出了FMBench,这是一个用于自适应Markdown输出格式化的基准,评估模型在各种遵循指令的场景下的表现,这些场景具有多样的结构要求。FMBench强调现实世界中的格式化行为,如多层次组织、混合内容(自然语言与列表/表格/代码交错)以及严格遵循用户指定的布局约束。为了提高Markdown的合规性而不依赖于严格的解码约束,我们提出了一种轻量级的对齐管道,该管道结合了监督微调(SFT)和强化学习微调。从基础模型开始,我们首先对指令-响应对进行SFT,然后优化一个综合目标,以平衡语义忠实性与结构正确性。在两个模型系列(OpenPangu和Qwen)上的实验表明,SFT始终改善语义对齐,而强化学习在从强SFT策略初始化时提供了对具有挑战性的Markdown指令的额外鲁棒性提升。我们的结果还揭示了语义和结构目标之间固有的权衡,强调了为可靠格式化生成精心设计奖励的重要性。代码可在以下链接获取:https://github.com/FudanCVL/FMBench。
cs.CL / 23 / 2602.06412

Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding

在掩码扩散语言模型解码中停止收敛令牌的计算
Oba, Daisuke, Bollegala, Danushka, Kaneko, Masahiro, Okazaki, Naoaki
Abstract
Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step -- even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position -- thereafter skipping its query projection and feed-forward sublayers -- while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from $O(N^2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension. In practice, $M$ decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30--50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our code will be available at https://daioba.github.io/surelock .
Chinese Translation
掩码扩散语言模型通过逐步采样生成序列,逐渐揭示令牌。然而,它们在每一步对每个令牌位置仍然重新计算注意力和前馈模块——即使许多未掩码的令牌实际上是固定的,这导致了计算资源的显著浪费。我们提出了 SureLock:当未掩码位置的后验在多个步骤中稳定(我们的确定条件)时,我们锁定该位置——此后跳过其查询投影和前馈子层,同时缓存其注意力键和值,以便其他位置可以继续关注它。这将每次迭代的主要计算成本从 $O(N^2d)$ 降低到 $O(MNd)$,其中 $N$ 是序列长度,$M$ 是解锁的令牌位置数量,$d$ 是模型维度。在实践中,随着迭代的进行,$M$ 会减少,从而实现显著的节省。在 LLaDA-8B 上,SureLock 相较于未锁定的相同采样器将算法 FLOPs 降低了 30% 至 50%,同时保持了相当的生成质量。我们还提供了理论分析,以证明 SureLock 的设计原理:仅在锁定步骤监控局部 KL 就足以限制最终令牌概率的偏差。我们的代码将会在 https://daioba.github.io/surelock 上发布。
cs.CL / 24 / 2602.06423

On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

想象的翅膀:基于冲突剧本的多角色幽默字幕生成框架
Shang, Wenbo, Sun, Yuxi, Ma, Jing, Huang, Xin
Abstract
Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.
Chinese Translation
幽默是日常生活中常用且复杂的人类语言。幽默生成,尤其是在多模态场景中,是大型语言模型(LLMs)面临的一项挑战性任务,通常表现为图像的幽默字幕生成,这需要视觉理解、幽默推理、创造性想象等能力。现有的基于LLM的方法依赖于推理链或自我改进,然而这些方法在创造力和可解释性方面存在局限。为了解决这些瓶颈,我们开发了一种基于基本幽默理论GTVH的新型LLM幽默生成机制。为了生成幽默且与剧本相对立的字幕,我们引入了一种基于幽默理论的多角色LLM协作框架,增强了幽默检索(HOMER)。该框架由三个基于LLM的角色组成:(1)冲突剧本提取器,将幽默与关键剧本对立结合,形成字幕生成的基础;(2)增强检索的层次想象者,识别关键幽默目标,并通过结构化为想象树的多样关联扩展其创造性空间;(3)字幕生成器,根据获得的知识生成幽默且多样的字幕。在两个《纽约客》漫画基准数据集上的广泛实验表明,HOMER在多模态幽默字幕生成方面优于最先进的基线和强大的LLM推理策略。
cs.CL / 25 / 2602.06430

Investigating the structure of emotions by analyzing similarity and association of emotion words

通过分析情感词的相似性和关联性研究情感结构
Iwaki, Fumitaka, Takahashi, Tatsuji
Abstract
In the field of natural language processing, some studies have attempted sentiment analysis on text by handling emotions as explanatory or response variables. One of the most popular emotion models used in this context is the wheel of emotion proposed by Plutchik. This model schematizes human emotions in a circular structure, and represents them in two or three dimensions. However, the validity of Plutchik's wheel of emotion has not been sufficiently examined. This study investigated the validity of the wheel by creating and analyzing a semantic networks of emotion words. Through our experiments, we collected data of similarity and association of ordered pairs of emotion words, and constructed networks using these data. We then analyzed the structure of the networks through community detection, and compared it with that of the wheel of emotion. The results showed that each network's structure was, for the most part, similar to that of the wheel of emotion, but locally different.
Chinese Translation
在自然语言处理领域,一些研究尝试通过将情感作为解释变量或响应变量对文本进行情感分析。在这种背景下,最受欢迎的情感模型之一是Plutchik提出的情感轮。该模型以圆形结构对人类情感进行示意化,并以二维或三维形式表示。然而,Plutchik的情感轮的有效性尚未得到充分检验。本研究通过创建和分析情感词的语义网络来探讨该情感轮的有效性。通过我们的实验,我们收集了情感词有序对的相似性和关联性数据,并利用这些数据构建了网络。随后,我们通过社区检测分析了网络的结构,并将其与情感轮的结构进行了比较。结果表明,每个网络的结构在大多数情况下与情感轮的结构相似,但在局部上存在差异。
cs.CL / 26 / 2602.06440

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

TrailBlazer:基于历史指导的强化学习用于黑箱大型语言模型越狱
Yoon, Sung-Hoon, Qian, Ruizhi, Zhao, Minda, Li, Weiyue, Wang, Mengyu
Abstract
Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.
Chinese Translation
大型语言模型(LLMs)已成为许多领域的核心,因此其安全性成为了一个关键优先事项。先前的越狱研究探索了多种方法,包括提示优化、自动化红队、混淆和基于强化学习(RL)的方法。然而,大多数现有技术未能有效利用早期交互回合中揭示的漏洞,导致攻击效率低下且不稳定。由于越狱涉及的交互是顺序进行的,每个响应都会影响未来的行动,因此强化学习为这一问题提供了自然的框架。基于此动机,我们提出了一种基于历史感知的RL越狱框架,该框架分析和重新加权来自先前步骤的漏洞信号,以指导未来的决策。我们展示了仅仅结合历史信息就能提高越狱成功率。在此基础上,我们引入了一种基于注意力的重新加权机制,突出了交互历史中的关键漏洞,从而以更少的查询实现更高效的探索。在AdvBench和HarmBench上的广泛实验表明,我们的方法在越狱性能上达到了最先进的水平,同时显著提高了查询效率。这些结果强调了历史漏洞信号在基于强化学习的越狱策略中的重要性,并为推进大型语言模型安全防护的对抗性研究提供了一个原则性路径。
cs.CL / 27 / 2602.06446

CORE: Comprehensive Ontological Relation Evaluation for Large Language Models

CORE:大语言模型的综合本体关系评估
Dwivedi, Satyam, Ghosh, Sanjukta, Dwivedi, Shivam, Kumari, Nishi, Thakur, Anil, Purushottam, Anurag, Alok, Deepak, Gatla, Praveen, B, Manjuprasad, Patgiri, Bipasha
Abstract
Large Language Models (LLMs) perform well on many reasoning benchmarks, yet existing evaluations rarely assess their ability to distinguish between meaningful semantic relations and genuine unrelatedness. We introduce CORE (Comprehensive Ontological Relation Evaluation), a dataset of 225K multiple-choice questions spanning 74 disciplines, together with a general-domain open-source benchmark of 203 rigorously validated questions (Cohen's Kappa = 1.0) covering 24 semantic relation types with equal representation of unrelated pairs. A human baseline from 1,000+ participants achieves 92.6% accuracy (95.1% on unrelated pairs). In contrast, 29 state-of-the-art LLMs achieve 48.25-70.9% overall accuracy, with near-ceiling performance on related pairs (86.5-100%) but severe degradation on unrelated pairs (0-41.35%), despite assigning similar confidence (92-94%). Expected Calibration Error increases 2-4x on unrelated pairs, and a mean semantic collapse rate of 37.6% indicates systematic generation of spurious relations. On the CORE 225K MCQs dataset, accuracy further drops to approximately 2%, highlighting substantial challenges in domain-specific semantic reasoning. We identify unrelatedness reasoning as a critical, under-evaluated frontier for LLM evaluation and safety.
Chinese Translation
大语言模型(LLMs)在许多推理基准测试中表现良好,但现有评估很少评估它们区分有意义语义关系和真正无关性的能力。我们引入了CORE(综合本体关系评估),这是一个包含225K道多项选择题的数据集,涵盖74个学科,同时提供一个包含203个经过严格验证问题的通用领域开源基准(Cohen's Kappa = 1.0),涵盖24种语义关系类型,并对无关对进行均等表示。来自1000多名参与者的人类基线达到了92.6%的准确率(无关对的准确率为95.1%)。相比之下,29个最先进的LLMs的整体准确率为48.25-70.9%,在相关对上的表现接近上限(86.5-100%),但在无关对上的表现严重下降(0-41.35%),尽管它们赋予了相似的信心(92-94%)。无关对的预期校准误差增加了2-4倍,37.6%的平均语义崩溃率表明存在系统性生成虚假关系的现象。在CORE 225K MCQs数据集中,准确率进一步下降至约2%,突显了领域特定语义推理中的重大挑战。我们将无关性推理确定为LLM评估和安全性的重要且未充分评估的前沿领域。
cs.CL / 28 / 2602.06449

Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reasoning

评估一种基于证据的强化学习框架,以对齐轻参数大型语言模型与精神科临床推理中的决策认知
Lin, Xinxin, Dai, Guangxin, Zhong, Yi, Li, Xiang, Xiao, Xue, Zhang, Yixin, Wu, Zhengdong, Zheng, Yongbo, Zhu, Runchuan, Zhao, Ming, Yu, Huizi, Wu, Shuo, Zhao, Jun, Hu, Lingming, Wang, Yumei, Yin, Ping, Chan, Joey W. Y., Chan, Ngan Yin, Chen, Sijing, Wing, Yun Kwok, Lu, Lin, Ma, Xin, Fan, Lizhou
Abstract
Large language models (LLMs) hold transformative potential for medical decision support yet their application in psychiatry remains constrained by hallucinations and superficial reasoning. This limitation is particularly acute in light-parameter LLMs which are essential for privacy-preserving and efficient clinical deployment. Existing training paradigms prioritize linguistic fluency over structured clinical logic and result in a fundamental misalignment with professional diagnostic cognition. Here we introduce ClinMPO, a reinforcement learning framework designed to align the internal reasoning of LLMs with professional psychiatric practice. The framework employs a specialized reward model trained independently on a dataset derived from 4,474 psychiatry journal articles and structured according to evidence-based medicine principles. We evaluated ClinMPO on a unseen subset of the benchmark designed to isolate reasoning capabilities from rote memorization. This test set comprises items where leading large-parameter LLMs consistently fail. We compared the ClinMPO-aligned light LLM performance against a cohort of 300 medical students. The ClinMPO-tuned Qwen3-8B model achieved a diagnostic accuracy of 31.4% and surpassed the human benchmark of 30.8% on these complex cases. These results demonstrate that medical evidence-guided optimization enables light-parameter LLMs to master complex reasoning tasks. Our findings suggest that explicit cognitive alignment offers a scalable pathway to reliable and safe psychiatric decision support.
Chinese Translation
大型语言模型(LLMs)在医学决策支持中具有变革潜力,但其在精神科的应用仍受到幻觉和表面推理的限制。这一局限性在轻参数LLMs中尤为明显,因为它们对于保护隐私和高效临床部署至关重要。现有的训练范式优先考虑语言流畅性,而非结构化的临床逻辑,导致与专业诊断认知之间存在根本性不一致。在此,我们介绍了ClinMPO,这是一种旨在将LLMs的内部推理与专业精神科实践对齐的强化学习框架。该框架采用了一种专门的奖励模型,该模型独立于从4,474篇精神科期刊文章中提取的数据集进行训练,并根据循证医学原则进行结构化。我们在一个未见的基准子集上评估了ClinMPO,该子集旨在将推理能力与死记硬背分离。该测试集包含了领先的重参数LLMs始终失败的项目。我们将ClinMPO对齐的轻LLM的表现与300名医学生的表现进行了比较。经过ClinMPO调优的Qwen3-8B模型在这些复杂案例中达到了31.4%的诊断准确率,超过了30.8%的人工基准。这些结果表明,基于医学证据的优化使轻参数LLMs能够掌握复杂的推理任务。我们的研究结果表明,明确的认知对齐提供了一条可扩展的途径,以实现可靠和安全的精神科决策支持。
cs.CL / 29 / 2602.06454

RelayGen: Intra-Generation Model Switching for Efficient Reasoning

RelayGen:用于高效推理的跨代模型切换
Song, Jiwon, Kim, Yoongon, Kim, Jae-Joon
Abstract
Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks by generating long, multi-step reasoning trajectories, but inference-time scaling incurs substantial deployment cost. A key challenge is that generation difficulty varies within a single output, whereas existing efficiency-oriented approaches either ignore this intra-generation variation or rely on supervised token-level routing with high system complexity. We present \textbf{RelayGen}, a training-free, segment-level runtime model switching framework that exploits difficulty variation in long-form reasoning. Through offline analysis of generation uncertainty using token probability margins, we show that coarse-grained segment-level control is sufficient to capture difficulty transitions within a reasoning trajectory. RelayGen identifies model-specific switch cues that signal transitions to lower-difficulty segments and dynamically delegates their continuation to a smaller model, while preserving high-difficulty reasoning on the large model. Across multiple reasoning benchmarks, RelayGen substantially reduces inference latency while preserving most of the accuracy of large models. When combined with speculative decoding, RelayGen achieves up to 2.2$\times$ end-to-end speedup with less than 2\% accuracy degradation, without requiring additional training or learned routing components.
Chinese Translation
大型推理模型(LRMs)通过生成长的、多步骤的推理轨迹,在复杂推理任务上取得了强大的性能,但推理时的扩展会带来可观的部署成本。一个关键挑战是生成难度在单一输出中存在差异,而现有的以效率为导向的方法要么忽视这种跨生成变化,要么依赖于具有高系统复杂性的监督令牌级路由。我们提出了 extbf{RelayGen},一种无训练的、基于段级的运行时模型切换框架,利用长形式推理中的难度变化。通过对生成不确定性的离线分析,使用令牌概率边际,我们表明粗粒度的段级控制足以捕捉推理轨迹中的难度转变。RelayGen识别特定模型的切换信号,指示向低难度段的转变,并动态地将其延续委托给一个较小的模型,同时在大模型上保留高难度推理。在多个推理基准测试中,RelayGen显著减少了推理延迟,同时保留了大模型的大部分准确性。当与推测解码结合时,RelayGen实现了高达2.2倍的端到端加速,且准确性下降不到2%,无需额外的训练或学习路由组件。
cs.CL / 30 / 2602.06462

Diffusion-State Policy Optimization for Masked Diffusion Language Models

针对掩码扩散语言模型的扩散状态策略优化
Oba, Daisuke, Furuta, Hiroki, Okazaki, Naoaki
Abstract
Masked diffusion language models generate by iteratively filling masked tokens over multiple denoising steps, so learning only from a terminal reward on the final completion yields coarse credit assignment over intermediate decisions. We propose DiSPO (Diffusion-State Policy Optimization), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling fillings for the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens -- without additional multi-step diffusion rollouts. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that can be combined with terminal-feedback policy optimization using the same rollouts. On LLaDA-8B-Instruct, DiSPO consistently improves over the terminal-feedback diffu-GRPO baseline on math and planning benchmarks under matched rollout compute and optimizer steps. Our code will be available at https://daioba.github.io/dispo .
Chinese Translation
掩码扩散语言模型通过在多个去噪步骤中迭代填充掩码标记生成,因此仅从最终完成的终端奖励中学习会导致对中间决策的粗糙信用分配。我们提出了 DiSPO(扩散状态策略优化),这是一个插件式信用分配层,直接优化中间填充决策。在选定的中间掩码状态下,DiSPO 通过从回滚缓存的 logits 中重新采样当前掩码位置的填充,进行分支,评分生成的完成结果,并仅更新新填充的标记——无需额外的多步扩散回滚。我们为分支完成形式化了一个固定状态目标,并推导出一个策略梯度估计器,该估计器可以与使用相同回滚的终端反馈策略优化相结合。在 LLaDA-8B-Instruct 上,DiSPO 在匹配的回滚计算和优化步骤下,在数学和规划基准测试中持续优于终端反馈的 diffu-GRPO 基线。我们的代码将发布在 https://daioba.github.io/dispo 。
cs.CL / 31 / 2602.06470

Improve Large Language Model Systems with User Logs

通过用户日志提升大型语言模型系统
Wang, Changyue, Su, Weihang, Ai, Qingyao, Liu, Yiqun
Abstract
Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .
Chinese Translation
扩展训练数据和模型参数长期以来推动了大型语言模型(LLMs)的进步,但这一范式正日益受到高质量数据稀缺和计算成本上升带来的收益递减的限制。因此,近期的研究越来越关注于从真实世界部署中进行持续学习,其中用户交互日志提供了丰富的真实人类反馈和过程知识。然而,由于用户日志的非结构化和噪声特性,从中学习面临挑战。普通的LLM系统往往难以区分有用的反馈信号和噪声用户行为,而用户日志收集与模型优化之间的差异(例如,离策略优化问题)进一步加剧了这一问题。为此,我们提出了UNO(User log-driveN Optimization),这是一个用于通过用户日志改进LLM系统(LLMsys)的统一框架。UNO首先将日志提炼为半结构化规则和偏好对,然后采用基于查询和反馈的聚类方法来管理数据异质性,最后量化模型的先验知识与日志数据之间的认知差距。这一评估指导LLMsys自适应地过滤掉噪声反馈,并为从用户日志中提取的主要和反思体验构建不同的模块,从而改善未来的响应。大量实验表明,UNO在效果和效率上均达到了最先进的水平,显著优于检索增强生成(Retrieval Augmented Generation, RAG)和基于记忆的基线。我们的代码已开源,地址为 https://github.com/bebr2/UNO 。
cs.CL / 32 / 2602.06471

Revisiting the Shape Convention of Transformer Language Models

重新审视变换器语言模型的形状约定
Liao, Feng-Ting, Chen, Meng-Hsi, Yi, Guan-Ting, Shiu, Da-shan
Abstract
Dense Transformer language models have largely adhered to one consistent architectural shape: each layer consists of an attention module followed by a feed-forward network (FFN) with a narrow-wide-narrow MLP, allocating most parameters to the MLP at expansion ratios between 2 and 4. Motivated by recent results that residual wide-narrow-wide (hourglass) MLPs offer superior function approximation capabilities, we revisit the long-standing MLP shape convention in Transformer, challenging the necessity of the narrow-wide-narrow design. To study this, we develop a Transformer variant that replaces the conventional FFN with a deeper hourglass-shaped FFN, comprising a stack of hourglass sub-MLPs connected by residual pathways. We posit that a deeper but lighter hourglass FFN can serve as a competitive alternative to the conventional FFN, and that parameters saved by using a lighter hourglass FFN can be more effectively utilized, such as by enlarging model hidden dimensions under fixed budgets. We confirm these through empirical validations across model scales: hourglass FFNs outperform conventional FFNs up to 400M and achieve comparable performance at larger scales to 1B parameters; hourglass FFN variants with reduced FFN and increased attention parameters show consistent improvements over conventional configurations at matched budgets. Together, these findings shed new light on recent work and prompt a rethinking of the narrow-wide-narrow MLP convention and the balance between attention and FFN towards efficient and expressive modern language models.
Chinese Translation
密集型变换器语言模型在架构形状上大致遵循一致性:每一层由一个注意力模块和一个后续的前馈网络(Feed-Forward Network, FFN)组成,前者采用窄-宽-窄的多层感知器(MLP),并将大部分参数分配给扩展比在2到4之间的MLP。受到近期研究结果的启发,这些结果表明残差宽-窄-宽(hourglass)MLP在函数逼近能力上表现优越,我们重新审视了变换器中长期以来的MLP形状约定,质疑窄-宽-窄设计的必要性。为此,我们开发了一种变换器变体,用更深的hourglass形状FFN替代传统的FFN,该FFN由一系列通过残差路径连接的hourglass子MLP堆叠而成。我们认为,更深但更轻的hourglass FFN可以作为传统FFN的竞争替代方案,并且通过使用更轻的hourglass FFN节省的参数可以更有效地利用,例如在固定预算下扩大模型的隐藏维度。我们通过不同模型规模的实证验证确认了这些观点:hourglass FFN在参数量高达4亿时优于传统FFN,并在更大规模(达到10亿参数)时实现了可比的性能;在匹配预算的情况下,减少FFN参数并增加注意力参数的hourglass FFN变体在传统配置上显示出一致的改进。这些发现共同为近期的研究提供了新的视角,并促使我们重新思考窄-宽-窄 MLP约定以及在高效且富有表现力的现代语言模型中注意力与FFN之间的平衡。
cs.CL / 33 / 2602.06526

Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks

补全缺失注释:多智能体辩论以实现信息检索基准的准确和可扩展相关性评估
Ban, Minjeong, Choi, Jeonghwan, Min, Hyangsuk, Kim, Nicole Hee-Yeon, Kim, Minseok, Lee, Jae-Gil, Song, Hwanjun
Abstract
Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI-to-human escalation. To address this, we propose DREAM, a multi-round debate-based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement-based debate, it yields more accurate labeling for certain cases and more reliable AI-to-human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re-benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval-generation misalignment. The relevance assessment framework is available at https: //github.com/DISL-Lab/DREAM-ICLR-26; and the BRIDGE dataset is available at https://github.com/DISL-Lab/BRIDGE-Benchmark.
Chinese Translation
信息检索(IR)评估因不完整的IR基准数据集而面临挑战,这些数据集中包含未标记的相关内容。尽管大型语言模型(LLMs)和LLM-人类混合策略减少了昂贵的人力投入,但它们仍然容易受到LLM过度自信和无效的AI到人类升级的影响。为了解决这个问题,我们提出了DREAM,一个基于多轮辩论的相关性评估框架,采用LLM智能体,建立在对立的初始立场和迭代的互相批评之上。通过我们的基于共识的辩论,它为某些案例提供了更准确的标注,并为不确定的案例提供了更可靠的AI到人类升级,实现了95.2%的标注准确率,仅需3.5%的人工参与。使用DREAM,我们构建了BRIDGE,一个经过精炼的基准,减轻评估偏差,并通过揭示29,824个缺失的相关内容来实现更公平的检索器比较。随后,我们重新评估了IR系统,并将评估扩展到RAG,显示未解决的缺口不仅扭曲了检索器排名,还导致检索与生成的不一致。相关性评估框架可在https://github.com/DISL-Lab/DREAM-ICLR-26获取;BRIDGE数据集可在https://github.com/DISL-Lab/BRIDGE-Benchmark获取。
cs.CL / 34 / 2602.06546

MTQE.en-he: Machine Translation Quality Estimation for English-Hebrew

MTQE.en-he:英语-希伯来语的机器翻译质量评估
Rosenbaum, Andy, Siani, Assaf, Kernerman, Ilan
Abstract
We release MTQE.en-he: to our knowledge, the first publicly available English-Hebrew benchmark for Machine Translation Quality Estimation. MTQE.en-he contains 959 English segments from WMT24++, each paired with a machine translation into Hebrew, and Direct Assessment scores of the translation quality annotated by three human experts. We benchmark ChatGPT prompting, TransQuest, and CometKiwi and show that ensembling the three models outperforms the best single model (CometKiwi) by 6.4 percentage points Pearson and 5.6 percentage points Spearman. Fine-tuning experiments with TransQuest and CometKiwi reveal that full-model updates are sensitive to overfitting and distribution collapse, yet parameter-efficient methods (LoRA, BitFit, and FTHead, i.e., fine-tuning only the classification head) train stably and yield improvements of 2-3 percentage points. MTQE.en-he and our experimental results enable future research on this under-resourced language pair.
Chinese Translation
我们发布了MTQE.en-he:据我们所知,这是第一个公开可用的英语-希伯来语机器翻译质量评估基准。MTQE.en-he包含来自WMT24++的959个英语段落,每个段落都配有对应的希伯来语机器翻译,以及由三位人类专家标注的翻译质量直接评估分数。我们对ChatGPT提示、TransQuest和CometKiwi进行了基准测试,结果表明,三种模型的集成在皮尔逊相关系数上比最佳单一模型(CometKiwi)提高了6.4个百分点,在斯皮尔曼相关系数上提高了5.6个百分点。对TransQuest和CometKiwi的微调实验表明,完整模型更新对过拟合和分布崩溃敏感,而参数高效的方法(LoRA、BitFit和FTHead,即仅微调分类头)训练稳定,且提高了2-3个百分点。MTQE.en-he及我们的实验结果为未来对这一资源匮乏的语言对的研究提供了可能。
cs.CL / 35 / 2602.06570

Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making

百川-M3:为可靠的医疗决策建模临床询问
M3 Team, Dou, Chengfeng, Yang, Fan, Li, Fei, Jia, Jiyuan, Ju, Qiang, Wang, Shuai, Li, Tianpeng, Zeng, Xiangrong, Zhou, Yijie, Zhang, Hongda, Tai, Jinyang, Sun, Linzhuang, Guo, Peidong, Mo, Yichuan, Wang, Xiaochuan, Cui, Hengfu, Zhang, Zhishou
Abstract
We introduce Baichuan-M3, a medical-enhanced large language model engineered to shift the paradigm from passive question-answering to active, clinical-grade decision support. Addressing the limitations of existing systems in open-ended consultations, Baichuan-M3 utilizes a specialized training pipeline to model the systematic workflow of a physician. Key capabilities include: (i) proactive information acquisition to resolve ambiguity; (ii) long-horizon reasoning that unifies scattered evidence into coherent diagnoses; and (iii) adaptive hallucination suppression to ensure factual reliability. Empirical evaluations demonstrate that Baichuan-M3 achieves state-of-the-art results on HealthBench, the newly introduced HealthBench-Hallu and ScanBench, significantly outperforming GPT-5.2 in clinical inquiry, advisory and safety. The models are publicly available at https://huggingface.co/collections/baichuan-inc/baichuan-m3.
Chinese Translation
我们介绍了百川-M3,这是一种医学增强的大型语言模型,旨在将被动问答的范式转变为主动的临床级决策支持。针对现有系统在开放式咨询中的局限性,百川-M3利用专门的训练流程来模拟医生的系统工作流程。其关键能力包括:(i)主动信息获取以解决模糊性;(ii)长时间跨度推理,将分散的证据统一为连贯的诊断;以及(iii)自适应幻觉抑制,以确保事实的可靠性。实证评估表明,百川-M3在HealthBench、新推出的HealthBench-Hallu和ScanBench上取得了最先进的结果,在临床询问、建议和安全性方面显著超越了GPT-5.2。该模型已在https://huggingface.co/collections/baichuan-inc/baichuan-m3上公开提供。
cs.CL / 36 / 2602.06584

Inference-Time Rethinking with Latent Thought Vectors for Math Reasoning

基于潜在思维向量的推理时重新思考用于数学推理
Kong, Deqian, Zhao, Minglu, Qin, Aoyang, Pang, Bo, Tao, Chenxin, Hartmann, David, Honig, Edouardo, Xu, Dehong, Kumar, Amit, Sarte, Matt, Li, Chuan, Xie, Jianwen, Wu, Ying Nian
Abstract
Standard chain-of-thought reasoning generates a solution in a single forward pass, committing irrevocably to each token and lacking a mechanism to recover from early errors. We introduce Inference-Time Rethinking, a generative framework that enables iterative self-correction by decoupling declarative latent thought vectors from procedural generation. We factorize reasoning into a continuous latent thought vector (what to reason about) and a decoder that verbalizes the trace conditioned on this vector (how to reason). Beyond serving as a declarative buffer, latent thought vectors compress the reasoning structure into a continuous representation that abstracts away surface-level token variability, making gradient-based optimization over reasoning strategies well-posed. Our prior model maps unstructured noise to a learned manifold of valid reasoning patterns, and at test time we employ a Gibbs-style procedure that alternates between generating a candidate trace and optimizing the latent vector to better explain that trace, effectively navigating the latent manifold to refine the reasoning strategy. Training a 0.2B-parameter model from scratch on GSM8K, our method with 30 rethinking iterations surpasses baselines with 10 to 15 times more parameters, including a 3B counterpart. This result demonstrates that effective mathematical reasoning can emerge from sophisticated inference-time computation rather than solely from massive parameter counts.
Chinese Translation
标准的思维链推理在单次前向传播中生成解决方案,无法撤回对每个标记的不可逆承诺,并且缺乏从早期错误中恢复的机制。我们提出了推理时重新思考(Inference-Time Rethinking),这是一种生成框架,通过将声明性潜在思维向量与过程生成解耦,从而实现迭代自我修正。我们将推理分解为一个连续的潜在思维向量(关于什么进行推理)和一个解码器(如何进行推理),该解码器根据该向量对推理过程进行语言化。除了作为声明性缓冲区外,潜在思维向量还将推理结构压缩为连续表示,抽象掉表层标记的变异性,使得基于梯度的推理策略优化变得合理。我们之前的模型将无结构噪声映射到学习的有效推理模式流形上,在测试时,我们采用一种吉布斯风格的程序,在生成候选推理过程和优化潜在向量之间交替进行,以更好地解释该推理过程,有效地在潜在流形中导航以优化推理策略。从头开始在GSM8K上训练一个0.2B参数的模型,我们的方法经过30次重新思考迭代,超越了具有10到15倍参数的基线模型,包括一个3B参数的对照模型。这个结果表明,有效的数学推理可以通过复杂的推理时计算产生,而不仅仅依赖于庞大的参数数量。
cs.CL / 37 / 2602.06600

Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reasoning

回声作为锚点:大规模语言模型推理中的概率成本与注意力重定向
Hao, Zhuoyuan, Li, Zhuo, Li, Wu, Liu, Fangming, Zhang, Min, Li, Jing
Abstract
Test-time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning. Recent work has addressed this problem by scaling self-consistency and parallel thinking, adding generic ``thinking tokens'' and prompting models to re-read the question before answering. Unfortunately, these approaches either inject task-agnostic tokens or mandate heuristics that do not explain -- and often ignore -- the \emph{spontaneous} repetition that many LRMs exhibit at the head of their internal chains. In contrast, we analyze and harness the model's tendency to restate the question, which we term the \emph{Echo of Prompt (EOP)}, as a front-loaded, compute-shaping mechanism. We formalize its probabilistic cost by casting echo removal as rejection-based conditioning and defining the \emph{Echo Likelihood Gap} $\Delta\mathcal{L}$ as a computable proxy. This provides the missing theoretical link that links early repetition to likelihood gains and downstream accuracy. However, it does not by itself specify how to exploit EOP. Consequently, we develop \emph{Echo-Distilled SFT (ED-SFT)} to instill an ``echo-then-reason'' pattern through supervised finetuning, and \emph{Echoic Prompting (EP)} to re-ground the model mid-trace without training. While promising, quantifying benefits beyond verbosity is non-trivial. Therefore, we conduct length and suffix-controlled likelihood analyses together with layer-wise attention studies, showing that EOP increases answer to answer-prefix attention in middle layers, consistent with an \emph{attention refocusing} mechanism. We evaluate on GSM8K, MathQA, Hendrycks-MATH, AIME24, and MATH-500 under identical decoding settings and budgets, and find consistent gains over baselines. Code is available at https://github.com/hhh2210/echoes-as-anchors.
Chinese Translation
在大型推理模型(LRMs)中,测试时的计算分配被广泛使用,并在数学问题解决、代码生成和规划等领域具有应用。近期的研究通过扩展自一致性和并行思维来解决这一问题,增加了通用的“思维标记”,并提示模型在回答之前重新阅读问题。不幸的是,这些方法要么注入与任务无关的标记,要么强制使用不解释且常常忽视的启发式方法——即许多LRMs在其内部链的开头表现出的 extit{自发性}重复。相反,我们分析并利用模型重述问题的倾向,称之为 extit{提示回声(Echo of Prompt, EOP)},作为一种前置的、形状计算的机制。我们通过将回声去除视为基于拒绝的条件化来形式化其概率成本,并定义 extit{回声似然差距(Echo Likelihood Gap)} $ riangle extmath{L}$ 作为可计算的代理。这提供了一个缺失的理论链接,将早期重复与似然增益和下游准确性联系起来。然而,这本身并未具体说明如何利用EOP。因此,我们开发了 extit{回声蒸馏的监督微调(Echo-Distilled SFT, ED-SFT)},通过监督微调来灌输“回声后推理”的模式,以及 extit{回声提示(Echoic Prompting, EP)},在不训练的情况下重新定位模型的中间轨迹。尽管前景可期,但量化超越冗长的收益并非易事。因此,我们进行了长度和后缀控制的似然分析,以及逐层注意力研究,显示EOP在中间层增加了答案与答案前缀的注意力,这与 extit{注意力重定向}机制一致。我们在GSM8K、MathQA、Hendrycks-MATH、AIME24和MATH-500上进行了评估,采用相同的解码设置和预算,发现相较于基线有一致的提升。代码可在 https://github.com/hhh2210/echoes-as-anchors 获取。
cs.CL / 38 / 2602.06623

Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention

提示能保证安全性吗?通过子空间干预减轻大型语言模型生成内容的毒性
Singh, Himanshu, Xu, Ziwei, Subramanyam, A. V., Kankanhalli, Mohan
Abstract
Large Language Models (LLMs) are powerful text generators, yet they can produce toxic or harmful content even when given seemingly harmless prompts. This presents a serious safety challenge and can cause real-world harm. Toxicity is often subtle and context-dependent, making it difficult to detect at the token level or through coarse sentence-level signals. Moreover, efforts to mitigate toxicity often face a trade-off between safety and the coherence, or fluency of the generated text. In this work, we present a targeted subspace intervention strategy for identifying and suppressing hidden toxic patterns from underlying model representations, while preserving overall ability to generate safe fluent content. On the RealToxicityPrompts, our method achieves strong mitigation performance compared to existing baselines, with minimal impact on inference complexity. Across multiple LLMs, our approach reduces toxicity of state-of-the-art detoxification systems by 8-20%, while maintaining comparable fluency. Through extensive quantitative and qualitative analyses, we show that our approach achieves effective toxicity reduction without impairing generative performance, consistently outperforming existing baselines.
Chinese Translation
大型语言模型(LLMs)是强大的文本生成器,但即使在给定看似无害的提示时,它们也可能产生有毒或有害的内容。这带来了严重的安全挑战,并可能造成现实世界的伤害。毒性往往是微妙且依赖于上下文的,使得在标记级别或通过粗略的句子级别信号进行检测变得困难。此外,减轻毒性的努力通常面临安全性与生成文本的连贯性或流畅性之间的权衡。在本研究中,我们提出了一种针对性的子空间干预策略,用于识别和抑制潜在模型表示中的隐性毒性模式,同时保持生成安全流畅内容的整体能力。在RealToxicityPrompts数据集上,我们的方法相比现有基线实现了强大的减轻性能,且对推理复杂度的影响最小。在多个大型语言模型中,我们的方法将最先进的去毒化系统的毒性降低了8-20%,同时保持了相当的流畅性。通过广泛的定量和定性分析,我们展示了该方法在不损害生成性能的情况下有效减少毒性,并始终优于现有基线。
cs.CL / 39 / 2602.06625

FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge

FairJudge:一种自适应、去偏见且一致的LLM作为评审者
Yang, Bo, Feng, Lanfei, Chen, Yunkui, Zhang, Yu, Xu, Xiao, Li, Shijian
Abstract
Existing LLM-as-a-Judge systems suffer from three fundamental limitations: limited adaptivity to task- and domain-specific evaluation criteria, systematic biases driven by non-semantic cues such as position, length, format, and model provenance, and evaluation inconsistency that leads to contradictory judgments across different evaluation modes (e.g., pointwise versus pairwise). To address these issues, we propose FairJudge, an adaptive, debiased, and consistent LLM-as-a-Judge. Unlike prior approaches that treat the judge as a static evaluator, FairJudge models judging behavior itself as a learnable and regularized policy. From a data-centric perspective, we construct a high-information-density judging dataset that explicitly injects supervision signals aligned with evaluation behavior. Building on this dataset, we adopt a curriculum-style SFT-DPO-GRPO training paradigm that progressively aligns rubric adherence, bias mitigation, and cross-mode consistency, while avoiding catastrophic forgetting. Experimental results on multiple internal and public benchmarks show that FairJudge consistently improves agreement and F1, reduces non-semantic biases, and outperforms substantially larger instruction-tuned LLMs. All resources will be publicly released after acceptance to facilitate future research.
Chinese Translation
现有的LLM作为评审者系统存在三大基本局限性:对任务和领域特定评估标准的适应性有限、受位置、长度、格式和模型来源等非语义线索驱动的系统性偏见,以及导致不同评估模式(例如,逐点与逐对)之间判断矛盾的评估不一致性。为了解决这些问题,我们提出了FairJudge,一种自适应、去偏见且一致的LLM作为评审者。与将评审者视为静态评估者的先前方法不同,FairJudge将判断行为本身建模为可学习和正则化的策略。从数据中心的角度出发,我们构建了一个高信息密度的评审数据集,明确注入与评估行为对齐的监督信号。在此数据集的基础上,我们采用了一种课程式的SFT-DPO-GRPO训练范式,逐步对齐评分标准遵循、偏见缓解和跨模式一致性,同时避免灾难性遗忘。在多个内部和公共基准上的实验结果表明,FairJudge在一致性和F1得分上均表现出持续改善,减少了非语义偏见,并显著超越了规模更大的指令调优LLM。所有资源将在接受后公开发布,以促进未来的研究。
cs.CL / 40 / 2602.06647

Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features

波动之间的阅读:基于句间音频特征的鲁棒主题分割
Freisinger, Steffen, Seeberger, Philipp, Bocklet, Tobias, Riedhammer, Korbinian
Abstract
Spoken content, such as online videos and podcasts, often spans multiple topics, which makes automatic topic segmentation essential for user navigation and downstream applications. However, current methods do not fully leverage acoustic features, leaving room for improvement. We propose a multi-modal approach that fine-tunes both a text encoder and a Siamese audio encoder, capturing acoustic cues around sentence boundaries. Experiments on a large-scale dataset of YouTube videos show substantial gains over text-only and multi-modal baselines. Our model also proves more resilient to ASR noise and outperforms a larger text-only baseline on three additional datasets in Portuguese, German, and English, underscoring the value of learned acoustic features for robust topic segmentation.
Chinese Translation
口语内容,如在线视频和播客,通常涵盖多个主题,这使得自动主题分割对于用户导航和后续应用至关重要。然而,当前的方法并未充分利用声学特征,仍有改进的空间。我们提出了一种多模态方法,微调文本编码器和Siamese音频编码器,捕捉句子边界周围的声学线索。在大规模YouTube视频数据集上的实验显示,相较于仅使用文本和多模态基线,我们的方法取得了显著的提升。我们的模型在ASR噪声下也表现出更强的鲁棒性,并在葡萄牙语、德语和英语的三个额外数据集上超越了更大的文本基线,突显了学习到的声学特征在鲁棒主题分割中的价值。
cs.CL / 41 / 2602.06650

Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

超越静态对齐:通过风险感知的思维链实现大语言模型的分层策略控制
Si, Jianfeng, Sun, Lin, Lin, Weihong, Zhang, Xiangzheng
Abstract
Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.
Chinese Translation
大型语言模型(LLMs)面临着一个基本的安全性与有用性之间的权衡,因为静态的、一刀切的安全政策缺乏运行时可控性,这使得难以根据多样化的应用需求调整响应。因此,模型可能会过度拒绝良性请求或对有害请求的约束不足。我们提出了 extbf{PACT}(通过思维链配置的行动),这是一个通过明确的风险感知推理实现动态安全控制的框架。PACT在分层策略架构下运行:不可覆盖的全球安全政策建立了对关键风险(例如儿童安全、暴力极端主义)的不可变边界,而用户定义的政策可以引入特定领域(非全球)风险类别,并指定标签到行动的行为,以提高在实际部署环境中的效用。该框架将安全决策分解为结构化的分类$ ightarrow$行动路径,将查询路由到适当的行动(遵从、引导或拒绝),并使决策过程透明。大量实验表明,PACT在全球政策评估下实现了接近最先进的安全性能,同时在用户特定政策评估下获得了最佳的可控性,有效缓解了安全性与有用性之间的权衡。我们将发布PACT模型套件、训练数据和评估协议,以促进可重复的可控安全对齐研究。
cs.CL / 42 / 2602.06665

Not All Layers Need Tuning: Selective Layer Restoration Recovers Diversity

并非所有层都需要调优:选择性层恢复恢复多样性
Zhang, Bowen, Wang, Meiyi, Soh, Harold
Abstract
Post-training improves instruction-following and helpfulness of large language models (LLMs) but often reduces generation diversity, which leads to repetitive outputs in open-ended settings, a phenomenon known as mode collapse. Motivated by evidence that LLM layers play distinct functional roles, we hypothesize that mode collapse can be localized to specific layers and that restoring a carefully chosen range of layers to their pre-trained weights can recover diversity while maintaining high output quality. To validate this hypothesis and decide which layers to restore, we design a proxy task -- Constrained Random Character(CRC) -- with an explicit validity set and a natural diversity objective. Results on CRC reveal a clear diversity-validity trade-off across restoration ranges and identify configurations that increase diversity with minimal quality loss. Based on these findings, we propose Selective Layer Restoration (SLR), a training-free method that restores selected layers in a post-trained model to their pre-trained weights, yielding a hybrid model with the same architecture and parameter count, incurring no additional inference cost. Across three different tasks (creative writing, open-ended question answering, and multi-step reasoning) and three different model families (Llama, Qwen, and Gemma), we find SLR can consistently and substantially improve output diversity while maintaining high output quality.
Chinese Translation
后训练提高了大型语言模型(LLMs)的指令遵循能力和有用性,但往往降低了生成的多样性,这导致在开放式环境中出现重复输出的现象,称为模式崩溃。基于LLM层在功能上扮演不同角色的证据,我们假设模式崩溃可以局限于特定层,并且恢复经过精心选择的层到其预训练权重可以在保持高输出质量的同时恢复多样性。为了验证这一假设并决定恢复哪些层,我们设计了一个代理任务——约束随机字符(Constrained Random Character, CRC)——该任务具有明确的有效性集和自然的多样性目标。CRC的结果揭示了恢复范围内明显的多样性与有效性之间的权衡,并确定了在最小质量损失的情况下增加多样性的配置。基于这些发现,我们提出了选择性层恢复(Selective Layer Restoration, SLR),这是一种无训练的方法,将后训练模型中选定的层恢复到其预训练权重,从而产生一个具有相同架构和参数数量的混合模型,不增加额外的推理成本。在三个不同的任务(创造性写作、开放式问答和多步骤推理)以及三个不同的模型系列(Llama、Qwen和Gemma)中,我们发现SLR能够持续且显著地提高输出多样性,同时保持高输出质量。
cs.CL / 43 / 2602.06669

compar:IA: The French Government's LLM arena to collect French-language human prompts and preference data

compar:IA:法国政府的LLM平台,用于收集法语人类提示和偏好数据
Termignon, Lucie, Zilinskas, Simonas, Pélissier, Hadrien, Barrot, Aurélien, Chesnais, Nicolas, Gavoty, Elie
Abstract
Large Language Models (LLMs) often show reduced performance, cultural alignment, and safety robustness in non-English languages, partly because English dominates both pre-training data and human preference alignment datasets. Training methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) require human preference data, which remains scarce and largely non-public for many languages beyond English. To address this gap, we introduce compar:IA, an open-source digital public service developed inside the French government and designed to collect large-scale human preference data from a predominantly French-speaking general audience. The platform uses a blind pairwise comparison interface to capture unconstrained, real-world prompts and user judgments across a diverse set of language models, while maintaining low participation friction and privacy-preserving automated filtering. As of 2026-02-07, compar:IA has collected over 600,000 free-form prompts and 250,000 preference votes, with approximately 89% of the data in French. We release three complementary datasets -- conversations, votes, and reactions -- under open licenses, and present initial analyses, including a French-language model leaderboard and user interaction patterns. Beyond the French context, compar:IA is evolving toward an international digital public good, offering reusable infrastructure for multilingual model training, evaluation, and the study of human-AI interaction.
Chinese Translation
大型语言模型(LLMs)在非英语语言中的表现、文化一致性和安全性稳健性往往较差,部分原因是英语在预训练数据和人类偏好对齐数据集中占据主导地位。像人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)和直接偏好优化(Direct Preference Optimization, DPO)等训练方法需要人类偏好数据,而对于许多非英语语言而言,这类数据仍然稀缺且大多不公开。为了解决这一问题,我们推出了compar:IA,这是一个由法国政府开发的开源数字公共服务,旨在从以法语为主的普通大众中收集大规模的人类偏好数据。该平台使用盲对比界面来捕捉无约束的真实世界提示和用户判断,涵盖多种语言模型,同时保持低参与门槛和隐私保护的自动过滤。截至2026年2月7日,compar:IA已收集超过600,000个自由形式的提示和250,000个偏好投票,其中约89%的数据为法语。我们以开放许可发布三个互补数据集——对话、投票和反应,并呈现初步分析,包括法语模型排行榜和用户互动模式。在法国背景之外,compar:IA正朝着国际数字公共产品的方向发展,为多语言模型的训练、评估和人机交互研究提供可重用的基础设施。
cs.CL / 44 / 2602.06692

Evaluating Prompt Engineering Strategies for Sentiment Control in AI-Generated Texts

评估情感控制的提示工程策略在AI生成文本中的应用
Sahler, Kerstin, Jentzsch, Sophie
Abstract
The groundbreaking capabilities of Large Language Models (LLMs) offer new opportunities for enhancing human-computer interaction through emotion-adaptive Artificial Intelligence (AI). However, deliberately controlling the sentiment in these systems remains challenging. The present study investigates the potential of prompt engineering for controlling sentiment in LLM-generated text, providing a resource-sensitive and accessible alternative to existing methods. Using Ekman's six basic emotions (e.g., joy, disgust), we examine various prompting techniques, including Zero-Shot and Chain-of-Thought prompting using gpt-3.5-turbo, and compare it to fine-tuning. Our results indicate that prompt engineering effectively steers emotions in AI-generated texts, offering a practical and cost-effective alternative to fine-tuning, especially in data-constrained settings. In this regard, Few-Shot prompting with human-written examples was the most effective among other techniques, likely due to the additional task-specific guidance. The findings contribute valuable insights towards developing emotion-adaptive AI systems.
Chinese Translation
大型语言模型(LLMs)的突破性能力为通过情感自适应人工智能(AI)增强人机交互提供了新的机会。然而,故意控制这些系统中的情感仍然具有挑战性。本研究探讨了提示工程在控制LLM生成文本中的情感潜力,为现有方法提供了一种资源敏感且易于获取的替代方案。我们使用埃克曼的六种基本情感(例如,快乐、厌恶),考察了多种提示技术,包括使用gpt-3.5-turbo的零样本(Zero-Shot)和思维链(Chain-of-Thought)提示,并将其与微调(fine-tuning)进行比较。我们的结果表明,提示工程有效地引导了AI生成文本中的情感,提供了一种实用且具有成本效益的替代微调的方法,特别是在数据受限的环境中。在这方面,带有人类撰写示例的少样本(Few-Shot)提示在其他技术中最为有效,这可能是由于额外的任务特定指导。研究结果为开发情感自适应AI系统提供了宝贵的见解。
cs.CL / 45 / 2602.06724

Table-as-Search: Formulate Long-Horizon Agentic Information Seeking as Table Completion

表格作为搜索:将长时间跨度的自主信息获取形式化为表格补全
Lan, Tian, Henry, Felix, Zhu, Bin, Jia, Qianghuai, Ren, Junyang, Pu, Qihang, Li, Haijun, Wang, Longyue, Xu, Zhao, Luo, Weihua
Abstract
Current Information Seeking (InfoSeeking) agents struggle to maintain focus and coherence during long-horizon exploration, as tracking search states, including planning procedure and massive search results, within one plain-text context is inherently fragile. To address this, we introduce \textbf{Table-as-Search (TaS)}, a structured planning framework that reformulates the InfoSeeking task as a Table Completion task. TaS maps each query into a structured table schema maintained in an external database, where rows represent search candidates and columns denote constraints or required information. This table precisely manages the search states: filled cells strictly record the history and search results, while empty cells serve as an explicit search plan. Crucially, TaS unifies three distinct InfoSeeking tasks: Deep Search, Wide Search, and the challenging DeepWide Search. Extensive experiments demonstrate that TaS significantly outperforms numerous state-of-the-art baselines across three kinds of benchmarks, including multi-agent framework and commercial systems. Furthermore, our analysis validates the TaS's superior robustness in long-horizon InfoSeeking, alongside its efficiency, scalability and flexibility. Code and datasets are publicly released at https://github.com/AIDC-AI/Marco-Search-Agent.
Chinese Translation
当前的信息获取(InfoSeeking)代理在长时间跨度的探索中难以保持专注和连贯性,因为在一个纯文本上下文中跟踪搜索状态,包括规划过程和大量搜索结果,本质上是脆弱的。为了解决这个问题,我们提出了 extbf{表格作为搜索(Table-as-Search, TaS)},这是一个将信息获取任务重新表述为表格补全任务的结构化规划框架。TaS将每个查询映射到一个维护在外部数据库中的结构化表格模式,其中行表示搜索候选项,列表示约束或所需信息。该表格精确管理搜索状态:填充的单元格严格记录历史和搜索结果,而空单元格则作为明确的搜索计划。重要的是,TaS统一了三种不同的信息获取任务:深度搜索(Deep Search)、广度搜索(Wide Search)以及具有挑战性的深广搜索(DeepWide Search)。大量实验表明,TaS在三种基准测试中显著优于众多最先进的基线,包括多代理框架和商业系统。此外,我们的分析验证了TaS在长时间跨度信息获取中的卓越鲁棒性,以及其效率、可扩展性和灵活性。代码和数据集已公开发布在 https://github.com/AIDC-AI/Marco-Search-Agent。
cs.CL / 46 / 2602.06763

R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging

R-Align:通过以理由为中心的元评估增强生成奖励模型
Lai, Yanlin, Huang, Mitt, Guo, Hangyu, Wang, Xiangfeng, Li, Haodong, Zhan, Shaoxiong, Zhao, Liang, Yao, Chengyuan, Zhang, Yinmin, Han, Qi, Yuan, Chun, Ge, Zheng, Zhang, Xiangyu, Jiang, Daxin
Abstract
Reinforcement Learning from Human Feedback (RLHF) remains indispensable for aligning large language models (LLMs) in subjective domains. To enhance robustness, recent work shifts toward Generative Reward Models (GenRMs) that generate rationales before predicting preferences. Yet in GenRM training and evaluation, practice remains outcome-label-only, leaving reasoning quality unchecked. We show that reasoning fidelity-the consistency between a GenRM's preference decision and reference decision rationales-is highly predictive of downstream RLHF outcomes, beyond standard label accuracy. Specifically, we repurpose existing reward-model benchmarks to compute Spurious Correctness (S-Corr)-the fraction of label-correct decisions with rationales misaligned with golden judgments. Our empirical evaluation reveals substantial S-Corr even for competitive GenRMs, and higher S-Corr is associated with policy degeneration under optimization. To improve fidelity, we propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment. R-Align reduces S-Corr on RM benchmarks and yields consistent gains in actor performance across STEM, coding, instruction following, and general tasks.
Chinese Translation
来自人类反馈的强化学习(RLHF)在主观领域对大型语言模型(LLMs)的对齐中仍然不可或缺。为了增强鲁棒性,近期的研究转向生成奖励模型(GenRMs),该模型在预测偏好之前生成理由。然而,在GenRM的训练和评估中,实践仍然仅限于结果标签,导致推理质量未得到检查。我们表明,推理保真度——GenRM的偏好决策与参考决策理由之间的一致性——在很大程度上可以预测下游RLHF的结果,超越了标准标签准确性。具体而言,我们重新利用现有的奖励模型基准来计算虚假正确性(S-Corr)——与黄金判断不一致的理由的标签正确决策的比例。我们的实证评估揭示,即使对于竞争性的GenRMs,S-Corr也显著存在,并且更高的S-Corr与优化过程中的策略退化相关。为了提高保真度,我们提出了以理由为中心的对齐方法R-Align,该方法通过黄金判断增强训练,并明确监督理由对齐。R-Align在RM基准上降低了S-Corr,并在STEM、编码、指令跟随和一般任务中带来了演员性能的一致提升。
cs.CL / 47 / 2602.06795

Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling

生成用于领域自适应奖励建模的数据驱动推理评分标准
Sanders, Kate, Weir, Nathaniel, Chaudhary, Sapana, Bostrom, Kaj, Rangwala, Huzefa
Abstract
An impediment to using Large Language Models (LLMs) for reasoning output verification is that LLMs struggle to reliably identify errors in thinking traces, particularly in long outputs, domains requiring expert knowledge, and problems without verifiable rewards. We propose a data-driven approach to automatically construct highly granular reasoning error taxonomies to enhance LLM-driven error detection on unseen reasoning traces. Our findings indicate that classification approaches that leverage these error taxonomies, or "rubrics", demonstrate strong error identification compared to baseline methods in technical domains like coding, math, and chemical engineering. These rubrics can be used to build stronger LLM-as-judge reward functions for reasoning model training via reinforcement learning. Experimental results show that these rewards have the potential to improve models' task accuracy on difficult domains over models trained by general LLMs-as-judges by +45%, and approach performance of models trained by verifiable rewards while using as little as 20% as many gold labels. Through our approach, we extend the usage of reward rubrics from assessing qualitative model behavior to assessing quantitative model correctness on tasks typically learned via RLVR rewards. This extension opens the door for teaching models to solve complex technical problems without a full dataset of gold labels, which are often highly costly to procure.
Chinese Translation
使用大型语言模型(LLMs)进行推理输出验证的一个障碍是,LLMs在可靠识别思维轨迹中的错误方面存在困难,尤其是在长输出、需要专家知识的领域以及没有可验证奖励的问题中。我们提出了一种数据驱动的方法,自动构建高度细化的推理错误分类法,以增强LLM驱动的未见推理轨迹上的错误检测。我们的研究结果表明,利用这些错误分类法或称“评分标准”的分类方法,在编码、数学和化学工程等技术领域中,相较于基线方法显示出强大的错误识别能力。这些评分标准可用于通过强化学习构建更强的LLM作为评判者的奖励函数,以进行推理模型训练。实验结果表明,这些奖励有潜力在困难领域提高模型的任务准确性,相较于由通用LLM作为评判者训练的模型提高了+45%,并且在使用仅20%的金标签的情况下接近由可验证奖励训练的模型的性能。通过我们的方法,我们将奖励评分标准的使用从评估定性模型行为扩展到评估通常通过RLVR奖励学习的任务的定量模型正确性。这一扩展为教导模型解决复杂技术问题打开了大门,而无需完整的金标签数据集,这些数据集通常成本高昂。
cs.CL / 48 / 2602.06799

Visual Word Sense Disambiguation with CLIP through Dual-Channel Text Prompting and Image Augmentations

通过双通道文本提示和图像增强的 CLIP 视觉词义消歧
Bhattacharya, Shamik, Perkins, Daniel, Dogan, Yaren, Konjeti, Vineeth, Srinivasan, Sudarshan, Begoli, Edmon
Abstract
Ambiguity poses persistent challenges in natural language understanding for large language models (LLMs). To better understand how lexical ambiguity can be resolved through the visual domain, we develop an interpretable Visual Word Sense Disambiguation (VWSD) framework. The model leverages CLIP to project ambiguous language and candidate images into a shared multimodal space. We enrich textual embeddings using a dual-channel ensemble of semantic and photo-based prompts with WordNet synonyms, while image embeddings are refined through robust test-time augmentations. We then use cosine similarity to determine the image that best aligns with the ambiguous text. When evaluated on the SemEval-2023 VWSD dataset, enriching the embeddings raises the MRR from 0.7227 to 0.7590 and the Hit Rate from 0.5810 to 0.6220. Ablation studies reveal that dual-channel prompting provides strong, low-latency performance, whereas aggressive image augmentation yields only marginal gains. Additional experiments with WordNet definitions and multilingual prompt ensembles further suggest that noisy external signals tend to dilute semantic specificity, reinforcing the effectiveness of precise, CLIP-aligned prompts for visual word sense disambiguation.
Chinese Translation
歧义在自然语言理解中对大型语言模型(LLMs)构成了持续的挑战。为了更好地理解如何通过视觉领域解决词汇歧义,我们开发了一个可解释的视觉词义消歧(VWSD)框架。该模型利用 CLIP 将模糊语言和候选图像投影到共享的多模态空间。我们通过使用包含 WordNet 同义词的语义和基于照片的双通道提示的集成来丰富文本嵌入,同时通过强大的测试时增强来优化图像嵌入。然后,我们使用余弦相似度来确定与模糊文本最佳匹配的图像。在对 SemEval-2023 VWSD 数据集进行评估时,丰富嵌入将 MRR 从 0.7227 提升至 0.7590,将命中率从 0.5810 提升至 0.6220。消融研究表明,双通道提示提供了强大且低延迟的性能,而激进的图像增强仅带来了边际收益。对 WordNet 定义和多语言提示集的额外实验进一步表明,嘈杂的外部信号往往会稀释语义特异性,强化了精确的、与 CLIP 对齐的提示在视觉词义消歧中的有效性。
cs.CL / 49 / 2602.06843

The Representational Geometry of Number

数字的表征几何
Hu, Zhimin, Niu, Lanhao, Varma, Sashank
Abstract
A central question in cognitive science is whether conceptual representations converge onto a shared manifold to support generalization, or diverge into orthogonal subspaces to minimize task interference. While prior work has discovered evidence for both, a mechanistic account of how these properties coexist and transform across tasks remains elusive. We propose that representational sharing lies not in the concepts themselves, but in the geometric relations between them. Using number concepts as a testbed and language models as high-dimensional computational substrates, we show that number representations preserve a stable relational structure across tasks. Task-specific representations are embedded in distinct subspaces, with low-level features like magnitude and parity encoded along separable linear directions. Crucially, we find that these subspaces are largely transformable into one another via linear mappings, indicating that representations share relational structure despite being located in distinct subspaces. Together, these results provide a mechanistic lens of how language models balance the shared structure of number representation with functional flexibility. It suggests that understanding arises when task-specific transformations are applied to a shared underlying relational structure of conceptual representations.
Chinese Translation
认知科学中的一个核心问题是,概念表征是收敛到一个共享的流形以支持概括,还是发散到正交子空间以最小化任务干扰。虽然先前的研究发现了这两者的证据,但关于这些特性如何共存并在任务间转化的机制性解释仍然难以捉摸。我们提出,表征的共享不在于概念本身,而在于它们之间的几何关系。以数字概念作为实验基础,以语言模型作为高维计算基底,我们展示了数字表征在不同任务中保持稳定的关系结构。任务特定的表征嵌入在不同的子空间中,低级特征如大小和奇偶性沿可分离的线性方向编码。关键的是,我们发现这些子空间在很大程度上可以通过线性映射相互转化,这表明尽管位于不同的子空间,表征仍然共享关系结构。这些结果共同提供了一个机制性视角,揭示了语言模型如何平衡数字表征的共享结构与功能灵活性。这表明,当任务特定的转化应用于概念表征的共享基础关系结构时,理解便会产生。
cs.CL / 50 / 2602.06854

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

SEMA:简单而有效的多轮越狱攻击学习
Feng, Mingqian, Liu, Xiaodong, Yang, Weiwei, Song, Jialin, Zhu, Xuekai, Xu, Chenliang, Gao, Jianfeng
Abstract
Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average $80.1\%$ ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.
Chinese Translation
多轮越狱攻击捕捉了安全对齐聊天机器人面临的真实威胁模型,而单轮攻击仅仅是一个特例。然而,现有方法在探索复杂性和意图漂移方面存在局限。我们提出了SEMA,一个简单而有效的框架,能够在不依赖任何现有策略或外部数据的情况下训练多轮攻击者。SEMA包括两个阶段。自调优的预填充通过在非拒绝、结构良好的多轮对抗提示上进行微调,生成可用的输出,从而稳定后续学习。具有意图漂移感知奖励的强化学习训练攻击者在保持相同有害目标的同时引出有效的多轮对抗提示。我们通过结合意图对齐、合规风险和细节水平的意图漂移感知奖励,将有害意图锚定在多轮越狱攻击中。我们的开放式攻击机制避免了对受害者反馈的依赖,统一了单轮和多轮设置,并减少了探索复杂性。在多个数据集、受害者模型和越狱评估者中,我们的方法实现了最先进的攻击成功率(ASR),超越了所有单轮基线、手动编写和模板驱动的多轮基线,以及我们的SFT(监督微调)和DPO(直接偏好优化)变体。例如,在AdvBench上,SEMA在三个闭源和开源受害者模型上平均实现了$80.1\%$ ASR@1,比最先进技术高出33.9%。该方法紧凑、可复现,并能够跨目标迁移,为大型语言模型(LLM)的安全性提供了更强大和更现实的压力测试,并实现了自动红队测试,以暴露和定位失败模式。我们的代码可在以下链接获取:https://github.com/fmmarkmq/SEMA。
cs.CL / 51 / 2602.06869

Uncovering Cross-Objective Interference in Multi-Objective Alignment

揭示多目标对齐中的交叉目标干扰
Lu, Yining, Jiang, Meng
Abstract
We study a persistent failure mode in multi-objective alignment for large language models (LLMs): training improves performance on only a subset of objectives while causing others to degrade. We formalize this phenomenon as cross-objective interference and conduct the first systematic study across classic scalarization algorithms, showing that interference is pervasive and exhibits strong model dependence. To explain this phenomenon, we derive a local covariance law showing that an objective improves at first order when its reward exhibits positive covariance with the scalarized score. We extend this analysis to clipped surrogate objectives used in modern alignment, demonstrating that the covariance law remains valid under mild conditions despite clipping. Building on this analysis, we propose Covariance Targeted Weight Adaptation (CTWA), a plug-and-play method that maintains positive covariance between objective rewards and the training signal to effectively mitigate cross-objective interference. Finally, we complement these local improvement conditions with a global convergence analysis under the Polyak--\L{}ojasiewicz condition, establishing when non-convex scalarized optimization achieves global convergence and how cross-objective interference depends on specific model geometric properties.
Chinese Translation
我们研究了大型语言模型(LLMs)在多目标对齐中的一种持续失败模式:训练仅改善部分目标的性能,同时导致其他目标的性能下降。我们将这一现象形式化为交叉目标干扰,并对经典标量化算法进行了首次系统研究,显示干扰普遍存在且表现出强烈的模型依赖性。为了解释这一现象,我们推导出一个局部协方差定律,表明当某一目标的奖励与标量化得分呈正协方差时,该目标在一阶上会得到改善。我们将这一分析扩展到现代对齐中使用的裁剪代理目标,证明尽管存在裁剪,在温和条件下协方差定律仍然有效。在此分析的基础上,我们提出了协方差目标加权适应(Covariance Targeted Weight Adaptation, CTWA),这是一种即插即用的方法,旨在维持目标奖励与训练信号之间的正协方差,从而有效减轻交叉目标干扰。最后,我们在Polyak--Łojasiewicz条件下补充了这些局部改善条件的全局收敛分析,确立了非凸标量化优化何时实现全局收敛,以及交叉目标干扰如何依赖于特定的模型几何属性。
cs.CL / 52 / 2602.06920

Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs

Halluverse-M^3:一个多任务多语言的幻觉基准测试
Abdaljalil, Samir, Sharma, Parichit, Serpedin, Erchin, Kurban, Hasan
Abstract
Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnote{https://huggingface.co/datasets/sabdalja/HalluVerse-M3}.
Chinese Translation
大型语言模型中的幻觉问题仍然是一个持续的挑战,尤其是在多语言和生成环境中,事实一致性难以维持。尽管近期模型在以英语为中心的基准测试中表现出色,但它们在不同语言、任务和幻觉类型下的表现尚不清楚。在本研究中,我们引入了Halluverse-M^3,一个旨在系统分析多语言、多生成任务和多幻觉类别的幻觉数据集。Halluverse-M^3覆盖四种语言:英语、阿拉伯语、印地语和土耳其语,并支持两个生成任务:问答和对话摘要。该数据集明确区分实体级、关系级和句子级的幻觉。幻觉输出通过控制编辑过程构建,并由人工标注者验证,确保原始内容与幻觉生成之间的清晰对齐。利用该数据集,我们对一系列当代开源和专有语言模型在细粒度幻觉检测方面进行了评估。我们的结果显示,问答任务的难度始终低于对话摘要,而句子级幻觉即使对于最强模型仍然具有挑战性。英语的表现最佳,而在资源较少的语言中表现下降,印地语的检测准确率最低。总体而言,Halluverse-M^3为研究多语言、多任务环境中的幻觉提供了一个现实且具有挑战性的基准。我们发布该数据集以支持未来在幻觉检测和缓解方面的研究。
cs.CL / 53 / 2602.06942

Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

大规模优化土耳其语子词策略:数据、词汇和形态学相互作用的系统评估
Altinok, Duygu
Abstract
Tokenization is a pivotal design choice for neural language modeling in morphologically rich languages (MRLs) such as Turkish, where productive agglutination challenges both vocabulary efficiency and morphological fidelity. Prior studies have explored tokenizer families and vocabulary sizes but typically (i) vary vocabulary without systematically controlling the tokenizer's training corpus, (ii) provide limited intrinsic diagnostics, and (iii) evaluate a narrow slice of downstream tasks. We present the first comprehensive, principled study of Turkish subword tokenization; a "subwords manifest", that jointly varies vocabulary size and tokenizer training corpus size (data and vocabulary coupling), compares multiple tokenizer families under matched parameter budgets (WordPiece, morphology level, and character baselines), and evaluates across semantic (NLI, STS, sentiment analysis, NER), syntactic (POS, dependency parsing), and morphology-sensitive probes. To explain why tokenizers succeed or fail, we introduce a morphology-aware diagnostic toolkit that goes beyond coarse aggregates to boundary-level micro/macro F1, decoupled lemma atomicity vs. surface boundary hits, over/under-segmentation indices, character/word edit distances (CER/WER), continuation rates, and affix-type coverage and token-level atomicity. Our contributions are fourfold: (i) a systematic investigation of the vocabulary-corpus-success triad; (ii) a unified, morphology-aware evaluation framework linking intrinsic diagnostics to extrinsic outcomes; (iii) controlled comparisons identifying when character-level and morphology-level tokenization pay off; and (iv) an open-source release of evaluation code, tokenizer pipelines, and models. As the first work of its kind, this "subwords manifest" delivers actionable guidance for building effective tokenizers in MRLs and establishes a reproducible foundation for future research.
Chinese Translation
分词是神经语言建模中一个关键的设计选择,尤其是在形态丰富的语言(MRLs)如土耳其语中,富有表现力的粘合性给词汇效率和形态学准确性带来了挑战。先前的研究探讨了分词器家族和词汇规模,但通常存在以下问题:(i) 在不系统控制分词器训练语料的情况下变化词汇,(ii) 提供有限的内在诊断,(iii) 评估的下游任务范围狭窄。我们提出了首个全面且有原则的土耳其语子词分词研究;一个“子词清单”,它共同变化词汇规模和分词器训练语料规模(数据与词汇耦合),在匹配参数预算下比较多种分词器家族(WordPiece、形态学层次和字符基线),并在语义(NLI、STS、情感分析、命名实体识别)、句法(词性标注、依存解析)和形态学敏感探针中进行评估。为了解释分词器成功或失败的原因,我们引入了一种形态学感知的诊断工具包,超越粗略的聚合,达到边界级别的微观/宏观 F1、解耦的词根原子性与表面边界命中、过/欠分割指数、字符/词编辑距离(CER/WER)、延续率以及词缀类型覆盖和分词级原子性。我们的贡献有四个方面:(i) 对词汇-语料-成功三元组的系统调查;(ii) 一个统一的、形态学感知的评估框架,将内在诊断与外在结果联系起来;(iii) 控制比较,识别何时字符级和形态级分词能够带来收益;(iv) 开源发布评估代码、分词器管道和模型。作为首个此类研究,这个“子词清单”为在MRLs中构建有效的分词器提供了可操作的指导,并为未来的研究建立了可重复的基础。
cs.CL / 54 / 2602.06953

DAWN: Dependency-Aware Fast Inference for Diffusion LLMs

DAWN:面向依赖关系的快速扩散大语言模型推理
Luo, Lizhuo, Shi, Zhuoran, Luo, Jiajun, Wang, Zhi, Ren, Shen, Wang, Wenya, Zhang, Tianwei
Abstract
Diffusion large language models (dLLMs) have shown advantages in text generation, particularly due to their inherent ability for parallel decoding. However, constrained by the quality--speed trade-off, existing inference solutions adopt conservative parallel strategies, leaving substantial efficiency potential underexplored. A core challenge is that parallel decoding assumes each position can be filled independently, but tokens are often semantically coupled. Thus, the correct choice at one position constrains valid choices at others. Without modeling these inter-token dependencies, parallel strategies produce deteriorated outputs. Motivated by this insight, we propose DAWN, a training-free, dependency-aware decoding method for fast dLLM inference. DAWN extracts token dependencies and leverages two key motivations: (1) positions dependent on unmasked certain positions become more reliable, (2) simultaneously unmasking strongly coupled uncertain positions induces errors. Given those findings, DAWN leverages a dependency graph to select more reliable unmasking positions at each iteration, achieving high parallelism with negligible loss in generation quality. Extensive experiments across multiple models and datasets demonstrate that DAWN speedups the inference by 1.80-8.06x over baselines while preserving the generation quality. Code is released at https://github.com/lizhuo-luo/DAWN.
Chinese Translation
扩散大语言模型(dLLMs)在文本生成方面展现了优势,特别是由于其固有的并行解码能力。然而,现有的推理解决方案受到质量与速度权衡的限制,采用了保守的并行策略,导致潜在的效率未被充分挖掘。一个核心挑战在于并行解码假设每个位置可以独立填充,但令牌往往是语义上相互关联的。因此,在一个位置的正确选择限制了其他位置的有效选择。如果不建模这些令牌之间的依赖关系,采用并行策略将产生劣化的输出。基于这一洞察,我们提出了DAWN,一种无训练的、关注依赖关系的快速dLLM推理解码方法。DAWN提取令牌依赖关系,并利用两个关键动机:(1)依赖于未掩蔽特定位置的其他位置变得更可靠,(2)同时未掩蔽强耦合的不确定位置会引发错误。基于这些发现,DAWN利用依赖图在每次迭代中选择更可靠的未掩蔽位置,实现高并行性且在生成质量上几乎没有损失。针对多个模型和数据集的广泛实验表明,DAWN在推理速度上比基线提高了1.80-8.06倍,同时保持了生成质量。代码已发布在 https://github.com/lizhuo-luo/DAWN。
cs.CL / 55 / 2602.06960

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

InftyThink+: 通过强化学习实现有效且高效的无限视野推理
Yan, Yuchen, Jiang, Liang, Jiang, Jin, Li, Shuaicheng, Wen, Zujie, Zhang, Zhiqiang, Zhou, Jun, Shao, Jian, Zhuang, Yueting, Shen, Yongliang
Abstract
Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.
Chinese Translation
大型推理模型通过扩展推理时的思维链实现了强大的性能,但这一范式面临着二次成本、上下文长度限制以及由于中间丢失效应导致的推理退化等问题。迭代推理通过定期总结中间思路来缓解这些问题,然而现有方法依赖于监督学习或固定启发式,并未优化何时总结、保留什么以及如何恢复推理。我们提出了InftyThink+,一个端到端的强化学习框架,优化整个迭代推理轨迹,基于模型控制的迭代边界和明确的总结。InftyThink+采用两阶段训练方案,首先进行监督冷启动,然后进行轨迹级强化学习,使模型能够学习战略性总结和继续决策。在DeepSeek-R1-Distill-Qwen-1.5B上的实验表明,InftyThink+在AIME24上的准确率提高了21%,并且明显优于传统的长思维链强化学习,同时在分布外基准测试中也表现出更好的泛化能力。此外,InftyThink+显著降低了推理延迟,加速了强化学习训练,展示了更高的推理效率和更强的性能。